Summary:
Inverse probability weighted estimators are the oldest and potentially most commonly used class of procedures for the estimation of causal effects. By adjusting for selection biases via a weighting mechanism, these procedures estimate an effect of interest by constructing a pseudo-population in which selection biases are eliminated. Despite their ease of use, these estimators require the correct specification of a model for the weighting mechanism, are known to be inefficient, and suffer from the curse of dimensionality. We propose a class of nonparametric inverse probability weighted estimators in which the weighting mechanism is estimated via undersmoothing of the highly adaptive lasso, a nonparametric regression function proven to converge at nearly n−1/3-rate to the true weighting mechanism. We demonstrate that our estimators are asymptotically linear with variance converging to the nonparametric efficiency bound. Unlike doubly robust estimators, our procedures require neither derivation of the efficient influence function nor specification of the conditional outcome model. Our theoretical developments have broad implications for the construction of efficient inverse probability weighted estimators in large statistical models and a variety of problem settings. We assess the practical performance of our estimators in simulation studies and demonstrate use of our proposed methodology with data from a large-scale epidemiologic study.
Keywords: Adaptive estimation, Causal inference, Efficient influence function, Semiparametric efficiency
1. Introduction
Inverse probability weighted (IPW) estimators have been widely used in a diversity of fields, as inverse probability weighting allows for the adjustment of selection biases by the assignment of weights (i.e., based on propensity scores) to observational units such that a pseudo-population mimicking the target population is generated. The construction of IPW estimators is relatively straightforward, as the only nuisance parameter that must be estimated is the propensity score. Owing in part to the ease with which IPW estimators may be constructed, their application has been frequent in causal inference (e.g., Robins et al., 2000), missing data (e.g., Robins et al., 1994), and survival analysis (e.g., Tsiatis, 2007).
While inverse probability weighting may easily be implemented and is appropriate for use in a variety of problem settings, the resultant estimators face several disadvantages. Unfortunately, IPW estimators require a correctly specified estimate of the propensity score to produce consistent estimates of the target parameter and can be inefficient in certain settings (e.g., randomized controlled trials). What is more, IPW estimators suffer from the curse of dimensionality, as their rate of convergence depends entirely on the convergence rate of the postulated model for the propensity score. This latter requirement has proven a significant obstacle to investigators wishing to use data adaptive techniques in the estimation of propensity scores. To overcome these significant shortcomings, van der Laan (2014) proposed the targeted IPW estimator, which facilities the use of data adaptive techniques in estimating the relevant weight functions. While the targeted estimator is asymptotically linear, it has been shown to suffer from issues of irregularity (van der Laan, 2014). Alternatively, doubly robust estimation procedures, which are based on constructing models for both the propensity score and the outcome mechanism (Bang and Robins, 2005), were proposed. Doubly robust estimators are consistent for the target parameter when either one of the two nuisance parameters is consistently estimated; moreover, such estimators are efficient when both nuisance parameter estimators are correctly specified (Rotnitzky et al., 1998; van der Laan and Robins, 2003). While doubly robust estimators allow two opportunities for consistent estimation, their performance depends critically on the choice of estimators of these nuisance parameters. When finite-dimensional models are used to estimate the two nuisance parameters, doubly robust estimators may perform poorly, due to the possibility of model misspecification in either of the nuisance parameter estimators (Kang and Schafer, 2007; Cao et al., 2009; Vermeulen and Vansteelandt, 2015, 2016). Although doubly robust procedures facilitate the use of data adaptive techniques for modeling nuisance parameters, the resultant estimator can be irregular with large bias and a slow rate of convergence when either of the nuisance parameters is inconsistently estimated. To ease such issues, van der Laan (2014) proposed a targeted doubly robust estimator that does not suffer from the irregularity issue; the properties of this estimation procedure were investigated in detail by Benkeser et al. (2017).
Though many data adaptive regression techniques have been shown to provide consistent estimates in flexible models, establishing the rate of convergence for such approaches is often a significant obstacle. Among such approaches, the highly adaptive lasso stands out for its ability to flexibly estimate arbitrary functional forms with a fast rate of convergence under relatively mild conditions. The highly adaptive lasso (HAL) is a nonparametric regression function that minimizes a loss-specific empirical risk over linear combinations of indicator basis functions under the constraint that the sum of the absolute value of the coefficients is bounded by a constant (van der Laan, 2017; van der Laan and Bibaut, 2017). Letting the space of the functional parameter be a subset of the set of càdlàg (right-hand continuous with left-hand limits) functions with sectional variation norm bounded by a finite constant, van der Laan (2017) showed that the HAL estimator converges to the true function at a rate faster than n−1/4, regardless of dimensionality d (i.e., for any fixed d). Bibaut and van der Laan (2019) subsequently improved this convergence rate to n−1/3 log(n)d/2. Unlike most existing data adaptive techniques that require local smoothness assumptions on the true functional form, the finite sectional variation norm assumption imposed by HAL constitutes a (less restrictive) global smoothness assumption, making it a powerful approach for use in a variety of settings; in Section S1 of the Supporting Information we briefly compare the sectional variation norm to another commonly used approach.
Hirano et al. (2003) proposed an efficient IPW estimator in which the propensity score is estimated in a sieve approach by the logit series estimators (Geman and Hwang, 1982). Their approach has two shortcomings: (1) it requires that both propensity score and outcome models be continuously differentiable, with the level of smoothness of the propensity score increasing (by factor of 7) with the covariate dimension (Hirano et al., 2003, Assumption 4(i)); and (2) the rate of convergence of the estimated propensity score depends on the covariate dimension. We overcome these issues by proposing the first dimension-free and smoothness-free efficient nonparametric IPW estimator.
We show that IPW estimators can be asymptotically (nonparametric) efficient when the propensity score is estimated using a HAL estimator tuned in a particular manner. Specifically, we show that undersmoothing of the HAL estimator allows for the resultant IPW estimator of the target parameter to be asymptotically linear and a solution to an appropriate efficient influence function (EIF) equation. In the typical construction of HAL estimators, cross-validation is used to determine the sectional variation norm of the underlying functional parameter. By contrast, undersmoothing of the HAL estimator selects a sectional variation norm greater than the choice made by the (global) cross-validation selector. A significant challenge arises in finding a suitable choice of sectional variation norm — one that results in sufficient undersmoothing while simultaneously avoiding overfitting. We provide theoretical conditions under which the desired degree of undersmoothing may be achieved, and we supplement our theoretical investigations by providing practical guidance for appropriately choosing the required tuning parameters. Our proposed approach obviates many of the challenges associated with current methods of choice, namely
in contrast with standard IPW estimators, our estimators do not suffer from an asymptotic curse of dimensionality, allowing asymptotic efficiency;
in contrast with targeted IPW estimators, our estimators do not suffer from potential issues of irregularity;
in contrast with doubly robust estimators, our IPW estimators rely on only a single nuisance parameter and may be formulated without the EIF; and
in contrast with the method of Hirano et al. (2003), our estimators do not require local smoothness assumptions on the propensity score or the outcome models.
2. Preliminaries
2.1. Problem formulation, notation, and target parameter
Consider data generated by typical cohort sampling: let be the data on a given observational unit, where P0, the distribution of O, lies in the nonparametric model . The random variable constitutes baseline covariates measured prior to treatment A ∈ {0, 1}, and Y is an outcome of interest. Suppose we observe a sample of n independent and identically distributed units O1, … On, whose empirical distribution we denote Pn. We let Pf = ∫ f(o)dP(o) for a given function f(o) and distribution P, denoting by expectations with respect to P. Let be a functional nuisance parameter where . We use G := G(P) ≡ G(P)(A | W) to denote the treatment mechanism under an arbitrary distribution . We refer to the treatment mechanism under the true data-generating distribution as G0, that is, G0 := G(P0). Letting Ya be the potential outcome that would have been observed under the intervention that sets A to level a, we define the full (unobservable) data unit X as . A common parameter of interest is the mean counterfactual outcome under treatment, i.e., , where and is the nonparametric model for the full data X. While we present our results in the context of this target parameter, we stress that our developments extend to any arbitrary without loss of generality. Define the corresponding full data canonical gradient DF (X, ΨF) = {Y1 − ΨF (PX)} and allow Π to be a projection operator in the Hilbert space with inner product . To identify the causal effect of interest, we assume consistency (i.e., ) and no unmeasured confounding (i.e., A ⊥ Ya | W) (Hernán and Robins, 2020). We also make the strong positivity assumption that, for all , min(G0, 1 − G0) > δ where δ is a positive constant. Consistency links the potential outcomes to those observed, no unmeasured confounding is a particular case of the randomization (i.e., coarsening at random) assumption, and positivity ensures that there is sufficient treatment assignment variation for the treatment effect to be assessed.
2.2. Inverse probability weighted mapping
We define an IPW mapping of DF (X, ΨF) so as to estimate the target parameter Ψ(PX) using the observed data:
where . Here, . Under the standard identification assumptions noted in Section 2.1, . Under coarsening at random, the tangent space of G may be defined as . The canonical gradient of Ψ at a distribution is
where DCAR(P) = Π{UG(Ψ) | TCAR} (Robins et al., 1994; van der Laan and Robins, 2003). Following van der Laan and Robins (2003), we have that , which may equivalently be expressed
where is the conditional mean outcome.
2.3. The highly adaptive lasso estimator
The highly adaptive lasso is a nonparametric regression function with the capability to estimate infinite-dimensional functional parameters at a fast rate (roughly n−1/3) (van der Laan, 2017; van der Laan and Bibaut, 2017). Benkeser and van der Laan (2016) first demonstrated the utility of the HAL estimator in extensive simulation experiments. The zeroth-order HAL estimator constructs a linear combination of indicator basis functions to minimize the expected value of a loss function while constraining the L1-norm of its coefficients to be bounded by a finite constant corresponding to the sectional variation norm.
Let be the Banach space of d-variate real-valued càdlàg functions on a cube , where τ is the upper bound of all supports and is assumed to be finite. For each function , define the supremum norm as . The d-dimensional cube [0, τ] can be represented as a union of lower-dimensional cubes (i.e., l-dimensional with l ⩽ d) and the origin. That is, [0, τ] = {∪s(0, τs]} ∪ {0} where ∪s is over all subsets s of {1, 2, …, d}. For a given subset s ⊂ {0, 1, …, d} and for each function , we define the sth section of f as fs(u) = f (u1I(1 ∈ s), …, ud ∈ I(d ∈ s)). This is the function that varies along the variables in us according to f while setting other variables to zero. Then, the sectional variation norm of a given f may be defined as
where the sum is over all subsets of the coordinates {0, 1, …, d}. The term is the s–specific variation norm.
Under the assumption that our nuisance functional parameter has finite sectional variation norm, logit G may be represented (Gill et al., 1995):
(1) |
The representation in equation 1 may be approximated using a discrete measure that places mass on each observed Ws,i, denoted by βs,i. Letting ϕs,i(ws) = I(ws ⩾ us,i), where ws,i are support points of logit Gs, we have
where is an approximation of the sectional variation norm of logit G. HAL first expands the covariate dimension by embedding observations in a space of up to n(2d − 1) indicator basis functions (i.e., the HAL basis features), where (2d − 1) corresponds to all subsets of the set {1, 2, …, d}, sans the empty set. Let Φ denote the constructed n(2d − 1) × n design matrix. Then, we can write logit Gβ = β0 + Φ⊤β, where β is a n(2d − 1) × 1 vector of parameters. The loss-based HAL estimator βn,λ is defined as
where L(·) is an appropriate loss function and . Denote by the HAL estimate of G0. When the functional nuisance parameter is a conditional probability (e.g., the propensity score for a binary treatment), log-likelihood loss may be used. Different choices of the tuning parameter λ result in unique HAL estimators; our goal is to select a HAL estimator that allows the construction of an asymptotically linear IPW estimator of Ψ(P0). We let λn denote this data adaptively selected tuning parameter. Section S1 of the Supporting Information discusses further the sectional variation norm and representation in terms of indicator basis functions.
3. Methodology
We estimate the full data parameter ΨF (PX) using an IPW estimator Ψ(Pn, Gn), which is a solution to the score equation . That is,
(2) |
Alternatively, a stabilized IPW estimator may be defined as the solution to n−1{Ai(Yi − Ψ(P))}/{Gn(Ai | Wi)} = 0. The consistency and convergence rate of these estimators relies on the consistency and convergence rate of the estimator Gn of G0. While finite-dimensional (i.e., parametric) models are often utilized to construct the propensity score estimator Gn, it has been widely conceded that such models are not sufficiently flexible to provide a consistent estimator of the nuisance parameter G0. Consequently, corresponding confidence intervals for ΨF (PX) will have coverage tending to zero asymptotically. While flexible, data adaptive regression techniques may be used to improve the consistency of Gn for G0, establishing asymptotic linearity of the resultant IPW estimator Ψ(Pn, Gn) can prove challenging. Specifically,
(3) |
Assuming UG(Ψ) is càdlàg with a universal bound on the sectional variation norm, it can be shown that for each G, relying only on standard empirical process theory and the assumption of consistency. Consequently, the asymptotic linearity of our IPW estimator relies on the asymptotic linearity of . Since data adaptive regression techniques have a rate of convergence slower than n−1/2, the bias of will dominate the right-hand side of equation 3.
To show that asymptotic linearity of Ψ(Pn, Gn) can be established when G is estimated using a properly tuned HAL estimator, we introduce Lemma 1, which is an adaptation of Theorem 1 of van der Laan et al. (2019).
Lemma 1:
Let be a HAL estimator of G with L1-norm bound λn chosen such that
(4) |
where L(·) is log-likelihood loss and is a set of indices for the basis functions such that βn,s,j ≠ 0. Let D(f, Gn) = f ·(A−Gn). Here, f is càdlàg with finite sectional variation norm, and is a projection of f onto the linear span of the basis functions ϕs,j in L2(P), where ϕs,j satisfies condition (4). Assuming , it follows that and PnD(f, Gn) = op(n−1/2) where .
In condition (4), is , denoting the directional derivative of the loss along the path logit . Under log-likelihood loss,
which is the corresponding score function. Generally, maximum likelihood estimators exactly solve the score equations; however, the L1-norm restriction results in only approximate solutions. Condition (4) implies that the L1-norm must be increased (i.e., weakening the restriction and undersmoothing the fit) until one of the score equations is solved to a precision of op(n−1/2). Lemma 1 provides the theoretical guarantee for our main results presented in Theorem 1. The use of different loss functions leads to different estimators with different properties; however, our proposal applies to any loss function that generates scores of form , including the squared error loss.
In general, undersmoothing is used for the following two reasons: (1) to produce an asymptotically efficient and unbiased estimator (plug-in or non-plug-in) of a pathwise differentiable parameter where the nuisance parameters are estimated using a data-adaptive approach; and (2) to produce an asymptotically unbiased and normally distributed estimator of a nonpathwise differentiable parameter (Wasserman, 2006). The former is our motivation in our paper and, as shown in Theorem 1, under certain assumptions, undersmoothing of the HAL estimator of the nuisance parameter G results in IPW estimators that are asymptotically linear and efficient in the nonparametric model. In the following, in a slight abuse of notation, we use f to denote Q0(1, W)/G0, which is a particular member of (i.e., under assumption 1). This requires further assumptions.
Assumption 1:
Let and G0(W) be càdlàg with finite sectional variation norm.
Assumption 2:
Let be the projection of f = Q0(1, W)/G0 onto a linear span of basis functions ϕs,j in L2(P), for ϕs,j satisfying condition (4). Then, .
Since the set of càdlàg functions with finite sectional variation norm contains a rich variety of functional forms, assumption 1 is mild in that it would be expected to hold in nearly any practical application. We now consider assumption 2. This assumption states that the degree of undersmoothing needs to be such that the generated basis functions in the HAL fit of Gn are sufficient to approximate f within an n−1/4 neighborhood of f (i.e., ). This assumption is not particularly strong. Firstly, assuming that f is càdlàg with finite sectional variation norm (assumption 1), there always exists λ such that is satisfied. Secondly, the required convergence rate is even slower than the established rate obtained by the HAL estimator (i.e., Op(n−1/3)). Let f = Q0(1, W)/G0 and suppose that dfs/d logit Gs < ∞ for all sections s ⊂ {0, 1, …, d}. It follows that
When logit G0 has similar complexity to f, measured by the support set for the knot points of the basis functions, assumption 2 may hold without undersmoothing. On the other hand, when G0 is a simple function (e.g., in randomized controlled trials), undersmoothing is more likely to be needed so that the undersmoothed d logit Gn has rich enough support to approximate f. In general, as f becomes more complex relative to G0, more undersmoothing should be required. We examine this phenomenon in Section S3 of the Supporting Information. In our simulation study, we find that even in the extreme case that G0(W) = 0.5 and Q0 is a function of W, undersmoothing still improves the efficiency of HAL-based IPW estimators.
Theorem 1:
Suppose the support of W is uniformly bounded, i.e., for some finite constant τ. Let be a HAL estimator of G0 with L1-norm bound equal to λn. Under assumption 1, when λn is chosen such that condition (4) and assumption 2 are satisfied, the estimator will be asymptotically efficient with influence function
Intuitively, Theorem 1 states that when the HAL estimator is properly undersmoothed, the resultant estimate will include a rich enough set of basis functions to approximate any arbitrary càdlàg function with finite sectional variation norm (as per Lemma 1). With respect to the asymptotic linearity result, assumptions 1 and 2, along with condition (4), imply that the chosen set of basis functions must be sufficient to solve the EIF equation, that is, . A proof of this result is given in Section S2 of the Supporting Information. Conveniently, when the form of the EIF is unknown, inference is attainable via the standard bootstrap (Cai and van der Laan, 2019). The undersmoothing condition (4) provides a rate condition and cannot be used to determine the level of undersmoothing in finite sample settings. In Section 4.2, we provide practical criteria for undersmoothing that are verifiable using the observed data.
Remark 1:
The undersmoothing does not impact the rate of convergence of highly adaptive estimators as long as remains finite. The latter is plausible in our setting for two reasons: (1) we have assumed that the propensity score has finite sectional variation norm; and (2) the undersmoothing criterion is based on approximating a function with finite sectional variation norm (i.e., Q0(1, W)/G0(W)), and thus, even undersmoothing is unlikely to lead to diverging as n increases. This is confirmed in our simulation studies presented in Figures 1 and 2 where the level of undersmoothing stabilizes as n → ∞.
Remark 2:
van der Laan et al. (2019) showed that when the conditional outcome model (i.e., Q0) is estimated using an undersmoothed HAL the resulting plug-in estimator will be efficient and asymptoticly linear. Qiu et al. (2020) also explores undersmoothing methods to achieve efficiency in plug-in estimators using machine learning tools. Our results establish asymptotic efficiency of non-plug-in (i.e., IPW) estimators when propensity score (i.e., G0) is estimated using an undersmoothed HAL. Even though an important principle in the efficiency proofs of these two types of HAL-based estimators is the same (i.e., the HAL fit must solve a large class of score equations), there are also principle differences between the two. First, using knowledge on G0 in obtaining a HAL-based estimate of G0 may lead to an asymptotically inefficient IPW estimator (van der Laan and Robins, 2003). For example, if it is known that A only depends on W1, a component of W, estimating the propensity score using an undersmoothed HAL that only includes W1 (and the corresponding derived features) may result in a highly inefficient estimator. This is because the HAL fit may not solve enough score equations to satisfy . In contrast, for a plug-in estimator any knowledge on Q0 should be used in the formulation of HAL to enhance its efficiency. Second, in finite sample, the IPW estimators are in general more sensitive to overfitting the HAL-based estimator beyond the optimal undersmoothing level than the plug-in estimators. This is because in IPW estimators G0 appears in the denominator and thus, too much undersmoothing can make the resulting estimator unstable by pushing the Gn toward zero. Hence, in finite samples, a carefully designed undersmoothing criterion is needed for IPW estimators. Finally, because the dependence of plug-in and IPW estimators on the HAL estimator is very different (e.g., IPW estimator is highly non-linear in G0) the proof of asymptotic linearity of these estimators requires different techniques.
4. Estimation
4.1. Cross-fitted inverse probability weighted estimation
Cross-fitting has previously been studied as a step to de-bias parameter estimates when the relevant nuisance functions are estimated using data-adaptive techniques (Klaassen, 1987; Zheng and van der Laan, 2011; Chernozhukov et al., 2017).
To employ V -fold cross-fitting, split the data, uniformly at random, into V mutually exclusive and exhaustive sets of size approximately nV−1. Denote by the empirical distribution of a training sample and by the empirical distribution of a validation sample. For a given λ, exclude a single (validation) fold of data and fit the HAL estimator using data from the remaining (V − 1) folds; use this model to estimate the propensity scores for observational units in the holdout (validation) fold. Repeat this process V times, such that holdout estimates of the propensity score are available for all observational units. The cross-fitted IPW estimator is the solution to , where Gn,λ,ν(A | W) is the estimate of G0(A | W) applied to the training sample for the vth sample split for a given λ.
Theorem S1, found in the Supporting Information, shows that the cross-fitted IPW estimator is asymptotically linear. Although cross-fitting relaxes the Donsker class condition to show , our proposed approach does not obviate the need for the Donsker class condition. The condition is required to show that , which requires assuming that Q0/G0 has finite section variation norm. Thus, cross-fitting provides only finite-sample improvements in our setting.
4.2. Undersmoothing in practice
Undersmoothing is crucial for both asymptotic linearity and efficiency of our proposed estimators. Our theoretical results show that targeted undersmoothing of the HAL estimator of G can result in an IPW estimator that is a solution to the EIF equation. In practice, an L1-norm bound for an estimate of G may be obtained such that
(5) |
where Qn,ν is a cross-validated HAL estimate of Q0(1, W) with the L1-norm bound based on the global cross-validation selector. The criterion (5) is motivated by the goal of achieving efficiency asymptotically, requiring that our estimator be a solution to the EIF equation, that is, .
For a general censored data problem and inverse probability of censoring weighted HAL estimator, in certain complex settings, the derivation of the EIF can become mathematically involved and tedious. This arises, for example, in longitudinal settings with many decision points. For such settings, alternative criteria that do not require knowledge of the form of the EIF may prove useful. To this end, we propose the criterion:
(6) |
in which is the L1-norm of the coefficients βn,λ,s,j in the HAL estimator Gn,λ for a given λ, and . This score-based criterion leverages a general characteristic of canonical gradients: propensity score terms always appear in the denominator. Increasing λ results in decreasing the empirical mean of the score equation Ss,j(ϕ, Gn,λ,ν) (where Ss,j(ϕ, Gn,λ,ν) = ϕs,j(W){A − Gn,λ,ν(1 | W)}) and increasing the variance of the weight function {Gn,λ,ν(1 | W)}−1. The latter occurs because, as we undersmooth the propensity score fit, the values of Gn,λ,ν may start approaching the boundaries of the unit interval, leading to large and unstable weights. Another key component of our score criterion is the L1-norm . Under assumption 1, as λ increases, the L1-norm increases, but its rate of increase diminishes as λ diverges. Hence, at a certain point in the grid of λ, decreases in the empirical mean of are insufficient for satisfying condition 6, which starts increasing on account of {Gn,λ,ν(1 | W)}−1.
For both proposed undersmoothing criteria, the series of propensity score models based on HAL is constructed as follows. First, an initial fit is obtained via global cross-validation, to choose a starting value λcv. Next, undersmoothed HAL fits are constructed by weakening the restriction placed on the L1-norm, that is, λ ⩾ λcv. Then, the value of λ is increased until the target criterion is satisfied, selecting a particular HAL fit in the sequence.
Remark 3:
As noted by a referee, there is no guarantee that the selected λn using our finite sample criteria would satisfy conditions of Theorem 1. However, the criteria correspond to the most efficient IPW estimator for a given data.
4.3. Stability under near-violations of positivity
In practice, despite the strong positivity assumption, the estimated propensity score may fall close to the boundaries of the unit interval in finite samples. Such cases may result in large or unstable estimates of the inverse probability weights required for estimator construction. To mitigate this issue, we propose truncation of propensity score estimates. Importantly, because of the assumed positivity assumption and uniform consistency of the highly adaptive lasso (van der Laan and Bibaut, 2017), min(Gn, 1 − Gn) > δ with probability 1 as n → ∞. This implies that no truncation is needed asymptotically, hence the −inference is not affected. For a given positive constant κ, truncation sets all propensity score estimates lower than κ or greater than 1 − κ to κ and 1 − κ, respectively. The selectors of equations 5 and 6 may be straightforwardly extended to achieve optimal κ-truncation:
(7) |
(8) |
where Gn,λ,ν,κ is the truncated propensity score estimate for a given λ and κ, and .
5. Numerical Studies
The practical performance of our proposed IPW estimators was assessed in simulation studies. We present two of these studies in the sequel, with four additional scenarios discussed in Section S3 of the Supporting Information. In the present two scenarios, we assess the performance of our IPW estimators against alternatives based on correctly specified parametric models for the propensity score and a cross-validated HAL estimator, illustrating that estimators based on undersmoothing of HAL can be made both unbiased and efficient.
In both of the following scenarios, W1 ~ Uniform(−2, 2), W2 ~ Normal(μ = 0, σ = 0.5), ϵ ~ Normal(μ = 0, σ = 0.1), and expit(x) = {1 + exp(−x)}−1. In each setting, we sample n ∈ {1000, 2000, 3000, 5000} independent and identically distributed observations, applying each estimator to the resultant data. This was repeated 200 times. In both scenarios, the true propensity score G0 is bounded away from zero (i.e., 0.15 < G0); thus, the positivity assumption holds. In both scenarios, the true treatment effect is zero. Note that while these scenarios include two covariates and up to several thousand observations, they are far from low-dimensional with regard to the complexity of the highly adaptive lasso — in fact, the basis function expansion is expected to generate between 3000 (for n = 1000) and 15, 000 indicator bases (for n = 5000) to represent the relevant main and interaction terms across the several thousand observational units.
In the first scenario, A | W ~ Bernoulli{expit(0.75W1 + 0.5W2)} and Y | A, W = 0.5W1 − 2/3W2+ϵ. As both models are linear, parametric IPW estimators are expected to be unbiased. In the second scenario, and Y | A, . Due to nonlinearity of the propensity score model, the parametric IPW estimator (with main effect terms) is expected to exhibit bias while our undersmoothed IPW estimators ought to be unbiased and efficient.
We consider undersmoothing criteria including the minimizer of DCAR (equation 5) and the score-based method (equation 6). Throughout, we use the hal9001 R package (Coyle et al., 2020; Hejazi et al., 2020), considering basis functions for up to all two-way interactions in estimating the propensity score and outcome models. For comparison, we construct propensity score estimates using a cross-validated HAL and a (parametric) logistic regression model with main effect terms for W1 and W2. All models were fit using 15-fold cross-validation. All numerical experiments were performed using the R language and environment for statistical computing (R Core Team, 2022).
Figures 1 and 2 display the results for scenarios 1 and 2, respectively. The IPW estimators using undersmoothing of the highly adaptive lasso outperform those based on cross-validated HAL in terms of both bias and efficiency, producing similar results as the IPW estimators based on correctly specified parametric models of the propensity score.
The first row of each figure presents the bias and the cross-validated mean of DCAR (both scaled by n1/2) of the corresponding estimators, where the latter is the objective function in equation 5, and expected to be nearly zero for estimators that solve the efficient influence function equation. While the scaled bias and the cross-validated mean of DCAR of the cross-validation-based selector diverges (triangle), the undersmoothed highly adaptive lasso and the correctly specified parametric models perform similarly. In terms of coverage, DCAR-based criterion achieves the nominal coverage rate of 95%, even for smaller samples sizes, while the cross-validation-based estimator (triangle) yields a poor coverage rate of ≈50%. The score-based undersmoothing selectors perform reasonably well, producing IPW estimators with coverage rates ≈90% for n = 1000 and ≈95% at larger sample sizes (n ⩾ 5000). In scenario 2, where the parametric model of the propensity score is misspecified, the parametric IPW estimator performs poorly, resulting in IPW estimators with coverage rates tending to zero asymptotically. In the same scenario, the score-based selector performs as well as the DCAR-based selector producing estimators with coverage rates ≈95% for all the sample sizes considered. For both scenarios, we additionally report the selected tuning parameter λ based on both the global cross-validation and undersmoothing selectors. Our results illustrate that, as sample size increases, the selected value of the tuning parameter stabilizes. Importantly, this observation implies that the undersmoothing procedure does not lead to violations of the Donsker class assumption. Figures S6 and S7 in Section S4 of the Supporting Information show how the proposed estimator changes as a function of the tuning parameter λ. The U-shaped plots indicate that our proposed criteria perform well in terms of both scaled bias and the coverage of 95% Wald-style confidence intervals. Figures S8 and S9 in Section S4 of the Supporting Information show that when the cross-fitting is not used, our λ selector criteria tend to undersmooth too much resulting in inferior performance compared with the corresponding estimators with cross-fitting (Figures 1 and 2).
We provide additional simulation studies in Section S3 of the Supporting Information, in which we examine the relative performance of our proposed estimators under differing outcome and propensity score models. In Section S3, we also compare our proposed estimators with various augmented IPW estimators (Robins et al., 1994; van der Laan and Robins, 2003; Tsiatis, 2007). Our results suggest that, when at least one of the nuisance parameters is mispecified, bias of the doubly robust estimators tends to zero at a slower rate than n−1/2, while the bias of the proposed IPW estimators tend to zero faster. Moreover, in relatively limited sample sizes, the DCAR-based and the score-based IPW estimators can outperform the doubly robust estimators even when both nuisance parameters are consistently estimated. Notably, the score-based estimator does not even require any knowledge about the form of the efficient influence function.
6. Empirical Illustration
6.1. Overview and problem setup
We now apply our proposed estimation strategy to assessing the effect of smoking cessation on weight gain, using a subset of data from the National Health and Nutrition Examination Survey Data I Epidemiologic Follow-up Study (NHEFS). As per Hernán and Robins (2020), the NHEFS was jointly initiated by the National Center for Health Statistics and the National Institute on Aging, in collaboration with several other agencies of the United States Public Health Service. The study was designed to investigate the impact of a variety of clinical, nutritional, and behavioral factors on health outcomes including morbidity and mortality. The subset of the NHEFS data we consider totaled n = 1566 cigarette smokers, all between the ages of 25 and 74; the data is publicly available. Each individual must have been present for a baseline visit and a follow-up visit roughly 10 years later. Individual weight gain was measured as a difference between baseline body weight and body weight at a follow-up visit; moreover, individuals were classified as having been in the treatment group if they reported having quit smoking prior to the follow-up visit and in the control group otherwise. Hernán and Robins (2020) caution that this subset of the NHEFS data could suffer from selection bias. As correcting for such a bias is tangent to the illustration of our analytic approach, we forego standard corrections, warning of this as a caveat of our demonstration. In practice, we advocate the use of our strategy in tandem with censoring or selection bias corrections, e.g., imputation or inverse probability of censoring re-weighting (e.g., Carpenter et al., 2006; Seaman et al., 2012; Hernán and Robins, 2020).
6.2. Estimation strategy
We consider estimating the average treatment effect of smoking cessation on weight gain in this subset of the NHEFS cohort (n = 1566). A fairly rich set of baseline covariates — including sex, race, age, highest degree of formal education, intensity of smoking, years of smoking, exercise habits, indicators of an active lifestyle, and weight at study onset — were considered as potential baseline confounders of the relationship between smoking cessation and weight gain. Constructing IPW estimators for the average treatment effect (ATE) requires estimation of the propensity score, to model the conditional probability of smoking cessation given potential baseline confounders. An IPW estimator for the ATE of smoking cessation may be constructed based on distinct estimators of the respective treatment-specific counterfactual means. We compare estimates of the ATE based on both parametric and nonparametric strategies for estimating the propensity score, including
logistic regression with main terms for all baseline covariates;
logistic regression with main terms for all baseline covariates and with quadratic terms for age, smoking intensity, years of smoking, and baseline weight; and
the highly adaptive lasso with basis functions for all terms up to and including 4-way interactions between the baseline covariates, fit with 5-fold cross-validation.
The series of highly adaptive lasso propensity score estimators was constructed by weakening the restriction placed on the L1-norm by a fit of the global cross-validation selector.
6.3. Results
We apply each of the IPW estimators to recover the ATE of smoking cessation on weight gain, controlling for possible confounding by the baseline covariates previously enumerated. Table 1 summarizes the results. The unadjusted estimate is merely the difference of the mean observed outcomes between treated and untreated individuals. Generally, estimates of the ATE were similar across the two classes of IPW estimators. When the propensity score was estimated via a main terms logistic regression model, the estimate was 3.32 (CI: [2.15, 4.49]); likewise, when a logistic regression model with several quadratic terms was used, the estimate was 3.42 (CI: [2.24, 4.61]). By contrast, our cross-fitted (5-fold) nonparametric IPW estimators based on the highly adaptive lasso produced estimates of 3.20 (CI: [2.00, 4.41]) and 3.25 (CI: [2.03, 4.47]), for the cross-validation-based and DCAR-based variants, respectively. Since the form of the canonical gradient is readily known for the ATE, in this case, the DCAR-based estimator provides the most reliable estimate. We note that the DCAR-based estimate of the ATE is lower in magnitude than those recovered by parametric methods, suggesting that the impact of smoking cessation on weight gain may perhaps be lower than suggested by parametric propensity score estimation techniques, which are more prone to model misspecification bias than our proposed approach.
Table 1:
Estimator | Lower 95% CL | Estimate | Upper 95% CL |
---|---|---|---|
| |||
HAL (undersmoothed, truncated) | 2.24 | 3.48 | 4.72 |
HAL (undersmooted, minimal truncation) | 2.03 | 3.25 | 4.47 |
HAL (global cross-validation) | 2.00 | 3.20 | 4.41 |
GLM (main terms only) | 2.15 | 3.32 | 4.49 |
GLM (w/ quadratic terms) | 2.24 | 3.42 | 4.61 |
Unadjusted (intercept model) | 1.49 | 2.54 | 3.59 |
Figure S10 shows how the treatment effect estimates change as a function of the L1-norm. Table S2 presents the standardized differences as a measure of covariate balance (Greifer, 2021), and Figure S11 depicts the overlap between the HAL-based and parametric model-based propensity score estimates. It has been suggested that imbalance should be considered potentially important if the absolute standardized difference is greater than 0.2 (Austin, 2009). Although HAL results in absolute standardized differences less than 0.2, the values are slightly higher than the corresponding logistic regression fits. This is because parametric models exactly solve the score equations corresponding to the specified model, so these models attain a better balance in finite samples for the covariates included in the model. In contrast, HAL approximately (up to Op(n−2/3)) solves score equations corresponding to relatively high-dimensional derived features. A major drawback of parametric modeling is the potential failure to balance complex functions of covariates (e.g., second-order terms) that are not specified in the model, resulting in misspecification (see Section S3.4 of the Supporting Information). Moreover, since our proposed estimators achieve efficiency by solving the efficient influence function, the type of balance achieved may not be amenable to measurement by the marginal covariate balance techniques in popular use. In our example, the resultant ATE estimates are similar in scale (and all very different from the unadjusted estimate), further suggesting that important confounders have been properly adjusted for in all cases.
7. Discussion
We have proposed a class of nonparametric IPW estimators in which the weighting mechanism is estimated via undersmoothing of the highly adaptive lasso regression function. A particularly interesting application of the proposed approach is in settings in which closed-form representations of the efficient influence function of the target parameter of interest are intractable, such as the bivariate survival probability in bivariate (or, in general, d-variate) right-censored data (van der Laan, 1996) or the mean or truncated mean survival time in interval-censored data (e.g., Chapter 8 of van der Laan and Rose, 2018). In these two example problems, the proposed method can be leveraged to achieve efficient estimation based on the readily available IPW estimators, by undersmoothing the highly adaptive lasso fits of the probability of censoring. The development of asymptotic linearity of the resultant IPW estimators and the corresponding undersmoothing criteria merits further investigation. Notably, there are several key differences between developing undersmoothed plug-in and non-plug-in estimators, including the techniques used to prove their theoretical properties and the required assumptions for undersmoothing of nuisance function estimators, which we discuss in much greater detail in Section S5 of the Supporting Information. One may obviate the need for truncation in finite samples by enforcing the positivity restriction min(Gn, 1 − Gn) > δ in HAL fit. This is an interesting approach from both methodological and practical perspective that could motivate future research.
Supplementary Material
Acknowledgments
We thank David Benkeser for helpful discussions and practical insights. This work was partially supported by the National Institute on Drug Abuse, the National Institute on Alcohol Abuse and Alcoholism, the National Institute of Neurological Disorders and Stroke, the National Institute of Allergy and Infectious Diseases (award no. R01-AI074345), and the National Science Foundation (award no. DMS-2102840).
Footnotes
Supporting Information
Web Appendices, Tables, and Figures referenced in Sections 1, 2.3, 3, 4.1, 5, 6.3, 7 are available with this paper at the Biometrics website on Wiley Online Library: (S1) Sectional variation norm; (S2) Proofs of Lemma 1 and Theorem 1; (S3) Additional Simulation Studies; (S4) Additional tables and figures; (S5) Undersmoothing plug-in and non-plug-in estimators. An R code implementing our proposed approach is available with this paper and at https://github.com/nhejazi/pub_ipwhal_biometrics including code for both simulation studies and the real data analysis.
Publisher's Disclaimer: This article has been accepted for publication and undergone full peer review but has not been through the copyediting, typesetting, pagination and proofreading process, which may lead to differences between this version and the Version of Record.
Contributor Information
Ashkan Ertefaie, Department of Biostatistics and Computational Biology, University of Rochester.
Nima S. Hejazi, Division of Biostatistics, Department of Population Health Sciences, Weill Cornell Medicine
Mark J. van der Laan, Division of Biostatistics, School of Public Health, University of California, Berkeley Department of Statistics, University of California, Berkeley.
Data Availability Statement
The National Health and Nutrition Examination Survey (NHEFS) data that support the findings in this paper are openly available at https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/.
References
- Austin PC (2009). Using the standardized difference to compare the prevalence of a binary variable between two groups in observational research. Communications in Statistics – Simulation and Computation 38, 1228–1234. [Google Scholar]
- Bang H and Robins JM (2005). Doubly robust estimation in missing data and causal inference models. Biometrics 61, 962–973. [DOI] [PubMed] [Google Scholar]
- Benkeser D, Carone M, van der Laan MJ, and Gilbert PB (2017). Doubly robust nonparametric inference on the average treatment effect. Biometrika 104, 863–880. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Benkeser D and van der Laan MJ (2016). The highly adaptive lasso estimator. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 689–696. IEEE. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bibaut AF and van der Laan MJ (2019). Fast rates for empirical risk minimization over càdlàg functions with bounded sectional variation norm. arXiv preprint arXiv:1907.09244. [Google Scholar]
- Cai W and van der Laan M (2019). Nonparametric bootstrap inference for the targeted highly adaptive lasso estimator. arXiv preprint arXiv:1905.10299. [DOI] [PubMed] [Google Scholar]
- Cao W, Tsiatis AA, and Davidian M (2009). Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika 96, 723–734. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Carpenter JR, Kenward MG, and Vansteelandt S (2006). A comparison of multiple imputation and doubly robust estimation for analyses with missing data. Journal of the Royal Statistical Society: Series A (Statistics in Society) 169, 571–584. [Google Scholar]
- Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, and Newey W (2017). Double/debiased/Neyman machine learning of treatment effects. American Economic Review 107, 261–65. [Google Scholar]
- Coyle JR, Hejazi NS, and van der Laan MJ (2020). hal9001: The scalable highly adaptive lasso. R package version 0.2.7. [Google Scholar]
- Geman S and Hwang C-R (1982). Nonparametric maximum likelihood estimation by the method of sieves. Annals of Statistics pages 401–414. [Google Scholar]
- Gill RD, van der Laan MJ, and Wellner JA (1995). Inefficient estimators of the bivariate survival function for three models. In Annales de l’IHP Probabilités et Statistiques, volume 31, pages 545–597. [Google Scholar]
- Greifer N (2021). cobalt: Covariate balance tables and plots. R package version 4.3.1. [Google Scholar]
- Hejazi NS, Coyle JR, and van der Laan MJ (2020). hal9001: Scalable highly adaptive lasso regression in R. Journal of Open Source Software. [Google Scholar]
- Hernán MA and Robins JM (2020). Causal Inference: What If. CRC; Boca Raton, FL. [Google Scholar]
- Hirano K, Imbens GW, and Ridder G (2003). Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71, 1161–1189. [Google Scholar]
- Kang JD and Schafer JL (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science 22, 523–539. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Klaassen CA (1987). Consistent estimation of the influence function of locally asymptotically linear estimators. Annals of Statistics pages 1548–1562. [Google Scholar]
- Qiu H, Luedtke A, and Carone M (2020). Universal sieve-based strategies for efficient estimation using machine learning tools. arXiv preprint arXiv:2003.01856. [DOI] [PMC free article] [PubMed] [Google Scholar]
- R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. [Google Scholar]
- Robins JM, Hernán MÁ, and Brumback B (2000). Marginal structural models and causal inference in epidemiology. [DOI] [PubMed]
- Robins JM, Rotnitzky A, and Zhao LP (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association 89, 846–866. [Google Scholar]
- Rotnitzky A, Robins JM, and Scharfstein DO (1998). Semiparametric regression for repeated outcomes with nonignorable nonresponse. Journal of the American Statistical Association 93, 1321–1339. [Google Scholar]
- Seaman SR, White IR, Copas AJ, and Li L (2012). Combining multiple imputation and inverse-probability weighting. Biometrics 68, 129–137. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tsiatis A (2007). Semiparametric Theory and Missing Data. Springer. [Google Scholar]
- van der Laan MJ (1996). Efficient estimation in the bivariate censoring model and repairing NPMLE. Annals of Statistics 24, 596–627. [Google Scholar]
- van der Laan MJ (2014). Targeted estimation of nuisance parameters to obtain valid statistical inference. International Journal of Biostatistics 10, 29–57. [DOI] [PubMed] [Google Scholar]
- van der Laan MJ (2017). A generally efficient targeted minimum loss-based estimator based on the highly adaptive lasso. International Journal of Biostatistics 13,. [DOI] [PMC free article] [PubMed] [Google Scholar]
- van der Laan MJ, Benkeser D, and Cai W (2019). Efficient estimation of pathwise differentiable target parameters with the undersmoothed highly adaptive lasso. arXiv preprint arXiv:1908.05607. [DOI] [PMC free article] [PubMed] [Google Scholar]
- van der Laan MJ and Bibaut AF (2017). Uniform consistency of the highly adaptive lasso estimator of infinite-dimensional parameters. arXiv preprint arXiv:1709.06256. [Google Scholar]
- van der Laan MJ and Bibaut AF (2017). Uniform consistency of the highly adaptive lasso estimator of infinite dimensional parameters. arXiv preprint arXiv:1709.06256. [Google Scholar]
- van der Laan MJ and Robins JM (2003). Unified Methods for Censored Longitudinal Data and Causality. Springer. [Google Scholar]
- van der Laan MJ and Rose S (2018). Targeted Learning in Data Science. Springer. [Google Scholar]
- Vermeulen K and Vansteelandt S (2015). Bias-reduced doubly robust estimation. Journal of the American Statistical Association 110, 1024–1036. [Google Scholar]
- Vermeulen K and Vansteelandt S (2016). Data-adaptive bias-reduced doubly robust estimation. International Journal of Biostatistics 12, 253–282. [DOI] [PubMed] [Google Scholar]
- Wasserman L (2006). All of nonparametric statistics. Springer Science & Business Media. [Google Scholar]
- Zheng W and van der Laan MJ (2011). Cross-validated targeted minimum-loss-based estimation. In Targeted Learning, pages 459–474. Springer. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The National Health and Nutrition Examination Survey (NHEFS) data that support the findings in this paper are openly available at https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/.