Abstract
The assumption of positivity or experimental treatment assignment requires that observed treatment levels vary within confounder strata. This article discusses the positivity assumption in the context of assessing model and parameter-specific identifiability of causal effects. Positivity violations occur when certain subgroups in a sample rarely or never receive some treatments of interest. The resulting sparsity in the data may increase bias with or without an increase in variance and can threaten valid inference. The parametric bootstrap is presented as a tool to assess the severity of such threats and its utility as a diagnostic is explored using simulated and real data. Several approaches for improving the identifiability of parameters in the presence of positivity violations are reviewed. Potential responses to data sparsity include restriction of the covariate adjustment set, use of an alternative projection function to define the target parameter within a marginal structural working model, restriction of the sample, and modification of the target intervention. All of these approaches can be understood as trading off proximity to the initial target of inference for identifiability; we advocate approaching this tradeoff systematically.
Keywords: experimental treatment assignment, positivity, marginal structural model, inverse probability weight, double robust, causal inference, counterfactual, parametric bootstrap, realistic treatment rule, trimming, stabilised weights, truncation
1 Introduction
Incomplete control of confounding is a well-recognised source of bias in causal effect estimation-measured covariates must be sufficient to control for confounding in order for causal effects to be identified based on observational data. The identifiability of causal effects further requires sufficient variability in treatment or exposure assignment within strata of confounders. The dangers of causal effect estimation in the absence of adequate data support have long been understood.1 More recent causal inference literature refers to the need for adequate exposure variability within confounder strata as the assumption of positivity or experimental treatment assignment.2–4 While perhaps less well-recognised than confounding bias, violations and near violations of the positivity assumption can increase both the variance and bias of causal effect estimates, and if undiagnosed can threaten the validity of causal inferences.
Positivity violations can arise for two reasons. First, it may be theoretically impossible for individuals with certain covariate values to receive a given exposure of interest. For example, certain patient characteristics may constitute an absolute contraindication to receipt of a particular treatment. The threat to causal inference posed by such structural or theoretical violations of positivity does not improve with increasing sample size. Second, violations or near violations of positivity can arise in finite samples due to chance. This is a particular problem in small samples, but also occurs frequently in moderate to large samples when the treatment is continuous or can take multiple levels, or when the covariate adjustment set is large and/or contains continuous or multi-level covariates. Regardless of the cause, causal effects may be poorly or non-identified when certain subgroups in a finite sample do not receive some of the treatment levels of interest. In this article, we will use the term ‘sparsity’ to refer positivity violations and near-violations arising from either of these causes, recognising that other types of sparsity can also threaten valid inference.
In this article, we discuss the positivity assumption within a general framework for assessing the identifiability of causal effects. The causal model and target causal parameter are defined using a non-parametric structural equation model (NPSEM) and the positivity assumption is introduced as a key assumption needed for parameter identifiability. The counterfactual or potential outcome framework is then used to review estimation of the target parameter, assessment of the extent to which data sparsity threatens valid inference for this parameter, and practical approaches for responding to such threats. For clarity, we focus on a simple data structure in which treatment is assigned at a single time point. Concluding remarks generalise to more complex longitudinal data structures.
Data sparsity can increase both the bias and variance of a causal effect estimator; the extent to which each are impacted will depend on the estimator. An estimator-specific diagnostic tool is thus required to quantify the extent to which positivity violations threaten the validity of inference for a given causal effect parameter (for a given model, data-generating distribution and finite sample). Wang et al.5 proposed such a diagnostic based on the parametric bootstrap. Application of a candidate estimator to bootstrapped data sampled from the estimated data generating distribution provides information about the estimator’s behaviour under a data generating distribution that is based on the observed data. The true parameter value in the bootstrap data is known and can be used to assess estimator bias. A large bias estimate can alert the analyst to the presence of a parameter that is poorly identified, an important warning in settings where data sparsity may not be reflected in the variance of the causal effect estimate.
Once bias due to violations in positivity has been diagnosed, the question remains how best to proceed with estimation. We review several approaches. Identifiability can be improved by extrapolating based on subgroups in which sufficient treatment variability does exist; however, such an approach requires additional parametric model assumptions. Alternative approaches for responding to sparsity include the following: restriction of the sample to those subjects for whom the positivity assumption is not violated (known as trimming); re-definition of the causal effect of interest as the effect of only those treatments that do not result in positivity violations (estimation of the effects of ‘realistic’ or ‘intention to treat’ dynamic regimes); restriction of the covariate adjustment set to exclude those covariates responsible for positivity violations; and, when the target parameter is defined using a marginal structural working model, use of a projection function that focuses estimation on areas of the data with greater support.
As we discuss, all of these approaches change the parameter being estimated by trading proximity to the original target of inference for improved identifiability. We advocate incorporation of this trade-off into the effect estimator itself. This requires defining a family of parameters, the members of which vary in their proximity to the initial target and in their identifiability. An estimator can then be defined that selects among the members of this family according to some pre-specifed criteria.
1.1 Outline
The article is structured as follows. Section 2 introduces an NPSEM for a simple point treatment data structure, defines the target causal parameter using a marginal structural working model and discusses conditions for parameter identifiability with an emphasis on the positivity assumption. Section 3 reviews three classes of causal effect estimators and discusses the behaviour of these estimators in the presence of positivity violations. Section 4 reviews approaches for assessing threats to inference arising from positivity violations, with a focus on the parametric bootstrap. Section 5 investigates the performance of the parametric bootstrap as a diagnostic tool using simulated and real data. Section 6 reviews methods for responding to positivity violations once they have been diagnosed, and integrates these methods into a general approach to sparsity that is based on defining a family of parameters. Section 7 provide some concluding remarks and advocates a systematic approach to possible violations in positivity.
2 Framework for causal effect estimation
We proceed from the basic premise that model assumptions should honestly reflect investigator knowledge. The NPSEM framework provides a systematic approach for translating background knowledge into a causal model and corresponding statistical model, defining a target causal parameter, and assessing the identifiability of that parameter.6 We illustrate this approach using a simple point treatment data structure. We minimise notation by focusing on discrete-valued random variables.
2.1 Model
Let W denotes a set of baseline covariates on a subject, let A denote a treatment or exposure variable and let Y denote an outcome. Specify the following structural equation model (with random input U ~ PU):
(1) |
where U = (UW, UA, UY) denotes the set of background factors that deterministically assign values to (W, A, Y) according to functions (fW, fA, fY). Each of the equations in this model is assumed to represent a mechanism that is autonomous, in the sense that changing or intervening on the equation will not affect the remaining equations, and that is functional, in the sense that the equation reflects assumptions about how the observed data were in fact generated by Nature. In addition, each of the equations is non-parametric: its specification does not require assumptions regarding the true functional form of the underlying causal relationships. However, if aspects of the functional form of any of these equations are known based on background knowledge, such knowledge can be incorporated into the model. The background factors U are assumed to be jointly independent in this particular model; or in other words, the model is assumed to be Markov; however, the NPSEM framework can also be applied to non-Markov models.6
Let the observed data consist of n i.i.d. observations O1,…, On of O = (W, A, Y) ~ P0. Causal model (1) places no restrictions on the allowed distributions for P0, and thus implies a non-parametric statistical model.
2.2 Target causal parameter
A causal effect can be defined in terms of the joint distribution of the observed data under an intervention on one or more of the structural equations. For example, consider the post-intervention distribution of Y under an intervention on the structural model to set A = a. Such an intervention corresponds to replacing A = fA(W, UA) with A = a in the structural model (1). The counterfactual outcome that a given subject with background factors u would have had if he or she were to have received treatment level a is denoted Ya(u).7,8 This counterfactual can be derived as the solution to the structural equation fY in modified equation system within input U = u.
Let FX denotes the distribution of X = (W, (Ya : a ∈ 𝒜)), where 𝒜 denotes the possible values that the treatment variable can take (e.g. {0, 1} for a binary treatment). FX describes the joint distribution of the baseline covariates and counterfactual outcomes under a range of interventions on treatment variable A. A causal effect can be defined as some parameter of FX. For example, a common target parameter for binary A is the average treatment effect
(2) |
or the difference in expected counterfactual outcome if every subject in the population had received versus had not received treatment.
Alternatively, an investigator may be interested in estimating the average treatment effect separately within certain strata of the population and/or for non-binary treatments. Specification of a marginal structural model (a model on the conditional expectation of the counterfactual outcome given effect modifiers of interest) provides one option for defining the target causal parameter in such cases.4,9,10 Marginal structural models take the following form:
(3) |
where V ⊂ W denotes the strata in which one wishes to estimate a conditional causal effect. For example, one might specify the following model:
For a binary treatment 𝒜 ∈ {0, 1}, such a model implies an average treatment effect within stratum V = v equal to β2 + β4v.
The true functional form of EFX(Ya | V) will generally not be known. One option is to assume that the parametric model m(a, V | β) is correctly specified, or in other words that EFX(Ya | V) = m(a, V | β) for some value β. Such an approach, however, can place additional restrictions on the allowable distributions of the observed data and thus change the statistical model. In order to respect the premise that the statistical model should faithfully reflect the limits of investigator knowledge and not be altered in order to facilitate definition of the target parameter, we advocate an alternative approach in which the target causal parameter is defined using a marginal structural working model. Under this approach the target parameter β is defined as the projection of the true causal curve EFX(Ya | V) onto the specified model m(a, V | β) according to some projection function h(a, V):11
(4) |
When h(a, V) = 1, the target parameter β corresponds to an unweighted projection of the entire causal curve onto the model m(a, V | β); alternative choices of h correspond to placing greater emphasis on specific parts of the curve (i.e. on certain (a, V) values).
Use of a marginal structural working model such as (4) is attractive because it allows the target causal parameter to be defined within the original statistical model. However, this approach by no means absolves the investigator from careful consideration of marginal structural model specification. A poorly specified model m(a, V | β) may result in a target parameter that provides a poor summary of the features of the true causal relationship that are of interest.
In the following sections we discuss the parameter β(FX, m, 1) as the target of inference, corresponding to a focus on estimation of the treatment-specific mean for all levels a ∈ 𝒜 within strata of V as projected onto modelm, with projection h(a, V) = 1 chosen to reflect a focus on the entire causal curve. To simplify notation we use β to refer to this target parameter unless otherwise noted.
2.3 Identifiability
We assess whether the target parameter β of the counterfactual data distribution FX is identified as a parameter of the observed data distribution P0 under causal model (1). Because model (1) is Markov, we have that
(5) |
identifying the target parameter β according to projection (4).6 This identifiability result is often referred to as the G-computation formula.2,3,12
The weaker randomisation assumption, or the assumption that A and Ya are conditionally independent given W, is also sufficient for identifiability result (5) to hold.
Randomisation assumption
(6) |
Whether or not a given structural model implies that assumption (6) holds can be assessed directly from the corresponding causal graph through the back door criterion.6
2.3.1 The need for experimentation in treatment assignment
The G-computation formula (5) is only a valid formula if the conditional distributions in the formula are well-defined. Let g0(a | W) ≡ P0(A = a | W), a ∈ 𝒜 denote the conditonal distribution of treatment variable A given covariates under the observed data distribution P0. If one or more treatment levels of interest do not occur within some covariate strata, the conditional probability P0(Y = y | A = a, W = w) will not be well-defined for some value(s) (a, w) and identifiability result (5) will break down.
A simple example provides intuition into the threat to parameter identifiability posed by sparsity of this nature. Consider an example in which W = I(woman), A is a binary treatment, and no women are treated (g0(1 | W = 1) = 0). In this data generating distribution, there is no information regarding outcomes among treated women. Thus, as long as there are women in the target population (i.e. P0(W = 1) > 0), the average treatment effect EFX(Y1−Y0) will not be identified without additional parametric assumptions.
This simple example illustrates that a given causal parameter under a given model may be identified for some joint distributions of the observed data but not for others. An additional assumption is thus needed to ensure identfiability. We begin by presenting the strong version of this assumption, needed for the identification of PFX((Ya = y, W = w) : a, y, w) in a non-parametric model.
Strong positivity assumption
(7) |
The strong positivity assumption, or assumption of experimental treatment assignment (ETA), states that each possible treatment level occurs with some positive probability within each strata of W.
Parametric model assumptions may allow the positivity assumption to be weakened. In the example described above, an assumption that the treatment effect is the same among treated men and women would result in identification of the average treatment effect (2) based on extrapolation from the estimated treatment effect among men (assuming that other identifiability assumptions were met). Parametric model assumptions of this nature are particularly dangerous; however, because they extrapolate to regions of the joint distribution of (A, W) that are not supported by the data. Such assumptions should be approached with caution and adopted only when they have a solid foundation in background knowledge.
In addition to being model specific, the form of the positivity assumption needed for identifiability is parameter specific. Many target causal parameters require much weaker versions of positivity than (7). To take one simple example, if the target parameter is E(Y1), the identifiability result only requires that g0(1 | W) > 0 hold; it does not matter if there are some strata of the population in which no one was treated. Similarly, the identifiability of β(FX, m, h), defined using a marginal structural working model, relies on a weaker positivity assumption.
Positivity assumption for β(FX, h, m)
(8) |
Choice of projection function h(a, V) used to define the target parameter therefore has implications for how strong an assumption of positivity is needed for identifiability. In Section 6, we consider specification of alternative target parameters that allow for weaker positivity assumptions than (7), including parameters indexed by alternative choices of h(a, V). For now we focus on the target parameter β indexed by the choice h(a, V)=1 and note that (7) and (8) are equivalent for this parameter.
3 Estimator-specific behaviour in the face of positivity violations
Let Ψ(P0) denote the target parameter value, a function of the observed data distribution. Under the assumptions of randomisation (6) and positivity (8) Ψ(P0) equals the target causal parameter β(FX, m, h). Estimators of this parameter are denoted Ψ̂(Pn), where Pn is the empirical distribution of a sample of n i.i.d observations from P0. We use Q0W (w) ≡ P0(W = w), Q0Y (y | A, W) ≡ P0(Y = y | A, W) and Ǭ0(A, W) ≡ E0(Y | A, W). Recall that g0(a | W) ≡ P0(A = a | W). We review three classes of estimators Ψ̂(Pn) of β that employ estimators of distinct parts of the observed data likelihood. Maximum likelihood-based substitution estimators (also referred to as ‘G-computation’ estimators) employ estimators of Q0 ≡ (Q0W, Ǭ0). Inverse probability weighted estimators employ estimators of g0. Double robust (DR) estimators employ estimators of both g0 and Q0. A summary of these estimators is provided in Table 1. Their behaviour in the face of positivity violations is illustrated in Section 5 and previous work.11–16
Table 1.
G-computation estimator | |
Needed for implementation | Estimator Qn of Q0 |
Needed for consistency | Qn is a consistent estimator of Q0 |
Response to sparsity | Extrapolates based on Qn |
Sparsity can amplify bias due to model misspecification | |
IPTW estimator | |
Needed for implementation | Estimator gn of g0 |
Needed for consistency | gn is a consistent estimator of g0 |
g0 satisfies positivity | |
Response to sparsity | Does not extrapolate based on Qn |
Sensitive to positivity violations and near violations | |
DR estimators | |
Needed for implementation | Estimator gn of g0 and Qn of Q0 |
Needed for consistency | gn is consistent or Qn is consistent |
gn converges to a distribution that satisfies positivity | |
Response to sparsity | Can extrapolate based on Qn |
Without positivity, relies on consistency of Qn |
We focus our discussion on bias in the point estimate of the target parameter β. While estimates of the variance of β can also be biased when data are sparse, methods exist to improve variance estimation. The non-parametric bootstrap provides one straightforward approach to variance estimation in setting where the central limit theorem may not apply as a result of sparsity; alternative approaches to correct for biased variance estimates are also possible.17 These methods will not, however, protect against misleading inference if the point estimate itself is biased.
3.1 G-computation estimator
The G-computation estimator Ψ̂Gcomp(Pn) takes as input the empirical data distribution Pn and provides as output a parameter estimate β̂Gcomp. Ψ̂Gcomp(Pn) is a substitution estimator based on identifiability result (5). It is implemented based on an estimator of Q0 ≡ (Q0W, Ǭ0) and its consistency relies on the consistency of this estimator.2,3 Q0W can generally be estimated based on the empirical distribution of W. However, even when positivity is not violated, the dimension of A,W is frequently too large for Ǭ0 to be estimated simply by evaluating the mean of Y within strata of (A, W). Due to the curse of dimensionality, estimation of Ǭ0 under a non-parametric or semi-parametric statistical model therefore frequently requires data-adaptive approaches, such as cross-validated loss-based learning.18–20
Given an estimator Ǭn of Ǭ0, the G-computation estimator can be implemented by generating a predicted counterfactual outcome for each subject under each possible treatment: Ŷa,i = Ǭn(a,Wi) for a ∈ 𝒜, i = 1, …, n. The estimate β̂Gcomp is then obtained by regressing Ŷ a on a and V according to the model m(a, V | β), with weights based on the projection function h(a, V).
When all treatment levels of interest are not represented within all covariate strata (i.e. assumption (7) is violated), some of the conditional probabilities in the non-parametric G-computation formula (5) will not be defined. A given estimator Ǭn may allow the G-computation estimator to extrapolate based on covariate strata in which sufficient experimentation in treatment level does exist. Importantly, however, this extrapolation depends heavily on the model for Ǭ0 and the resulting effect estimates will be biased if the model used to estimate Q0 is misspecified.
3.2 Inverse probability of treatment weighted estimator
The inverse probability of treatment weighted (IPTW) estimator Ψ̂IPTW(Pn) takes as input the empirical data distribution Pn and provides as output a parameter estimate β̂IPTW based on an estimator gn of g0(A | W).10,21 The estimator is defined as the solution in β to the following estimating equation:
(9) |
where h(A, V) is the projection function used to define the target causal parameter β(FX, m, h) according to (4). The IPTW estimator of β can be implemented as the solution to a weighted regression of the outcome Y on treatment A and effect modifiers V according to model m(A,V | β), with weights equal to . Consistency of Ψ̂IPTW(Pn) requires that g0 satisfies positivity and that gn is a consistent estimator of g0. As with Ǭ0, g0 can be estimated using loss-based learning and cross validation. Depending on choice of projection function, implementation may further require estimation of h(A,V); however, the consistency of the IPTW estimator does not depend on consistent estimation of h(A, V).
The IPTW estimator is particularly sensitive to bias due to data sparsity. Bias can arise due to structural positivity violations (positivity does not hold for g0) or may occur because by chance certain covariate and treatment combinations are not represented in a given finite sample (gn(a | W = w) may have values of zero or close to zero for some (a,w) even when positivity holds for g0 and gn is consistent).5,13–16 In the latter case, as fewer individuals within a given covariate stratum receive a given treatment, the weights of those rare individuals who do receive the treatment become more extreme. The disproportionate reliance of the causal effect estimate on the experience of a few unusual individuals can result in substantial finite sample bias.
While values of gn(a | W) remain positive for all a ∈ 𝒜, elevated weights inflate the variance of the effect estimate and can serve as a warning that the data may poorly support the target parameter. However, as the number of individuals within a covariate stratum who receive a given treatment level shifts from few (each of whom receive a large weight and thus increase the variance) to none, estimator variance can decrease while bias increases rapidly. In other words, when gn(a | W = w) = 0 for some (a,w), the weight for a subject with A = a and W = w is infinity; however, as no such individuals exist in the dataset, the corresponding threat to valid inference will not be reflected in either the weights or in estimator variance.
3.2.1 Weight truncation
Weights are commonly truncated or bounded in order to improve the performance of the IPTW estimator in face of data sparsity.5,15,16,22,23 Weights are truncated at either a fixed or relative level (for example, at the 1st and 99th percentiles), thereby reducing the variance arising from large weights and limiting the impact of a few possibly non-representative individuals on the effect estimate. This advantage comes at a cost, however, in the form of increased bias due to misspecification of the treatment model gn, a bias that does not decrease with increasing sample size.
3.2.2 Stabilised Weights
Use of projection function h(a, V) = 1 implies the use of unstabilised weights. In contrast, stabilised weights, corresponding to a choice h(a, V) = g0(a | V) (where g0(a | V) ≡ P0(A = a | V)) are generally recommended for the implementation of the IPTW estimator. The choice h(a, V) = g0(a | V) results in weaker positivity assumption (8), by allowing the IPTW estimator to extrapolate to sparse areas of the joint distribution of (A, V) using the model m(a, V | β). For example, if A is an ordinal variable with multiple levels, V = {}, and the target parameter is defined using the model m(a, V | β) = β0 + β1a, the IPTW estimator with stabilised weights will extrapolate to levels of A that are sparsely represented in the data by assuming a linear relationship between Ya and a for a ∈ 𝒜. However, when the target parameter β is defined using a marginal structural working model according to (4) (an approach that acknowledges that the model m(A, V | β) may be misspecified), the use of stabilised versus unstabilised weights corresponds to a shift in the target parameter via choice of an alternative projection function.11
3.3 Double robust estimators
Double robust estimators of β include the augmented inverse probability weighted estimator (A-IPTW) and the targeted maximum likelihood estimator (TMLE). For the target parameter β (FX, h, m), TMLE corresponds to the extended DR parametric regression estimator of Scharfstein et al.4,24–28 Implementation of the DR estimators requires estimators of both Q0 and g0; as with the IPTW and G-computation estimators, a non-parametric loss-based approach can be employed for both. An implementation of the TMLE estimator of the average treatment effect E(Y1 − Y0) is available in the R package tmleLite; an implementation of the A-IPTW estimator for a point treatment marginal structural model is available in the R package cvDSA (both available at http://www.stat.berkeley.edu/~laan/Software/index.html). Prior literature provides further details regarding implementation and theoretical properties.4,11,13,24,26–28
DR estimators remain consistent if either: 1. gn is a consistent estimator of g0 and g0 satisfies positivity; or, 2. Qn is a consistent estimator of Q0 and gn converges to a distribution g* that satisfies positivity. Thus, when positivity holds, these estimators are truly DR, in the sense that consistent estimation of either g0 or Q0 results in a consistent estimator. When positivity fails, however, the consistency of the DR estimators relies entirely on consistent estimation of Q0. In the setting of positivity violations, DR estimators are thus faced with the same vulnerabilities as the G-computation estimator.
In addition to illustrating how positivity violations increase the vulnerability of DR estimators to bias resulting from inconsistent estimation of Q0, these asymptotic results have practical implications for the implementation of the DR estimators. Specifically, they suggest that the use of an estimator gn that yields predicted values in [0 + γ, 1 − γ] (where γ is some small number) can improve finite sample performance. One way to achieve such bounds is by truncating the predicted probabilities generated by gn, similar to the process of weight truncation described for the IPTW estimator.
4 Diagnosing bias due to positivity violations
Positivity violations can result in substantial bias, with or without a corresponding increase in variance, regardless of the causal effect estimator used. Practical methods are thus needed to diagnose and quantify estimator-specific positivity bias for a given model, parameter and sample. Cole and Hernan16 suggest a range of informal diagnostic approaches when the IPTW estimator is applied. Basic descriptive analyses of treatment variability within covariate strata can be helpful; however, this approach quickly becomes unwieldy when the covariate set is moderately large and includes continuous or multi-level variables. Examination of the distribution of the estimated weights can also provide useful information as near violations of the positivity assumption will be reflected in large weights. As noted by these authors and discussed above, however, well-behaved weights are not sufficient in themselves to ensure the absence of positivity violations.
An alternative formulation is to examine the distribution of the estimated propensity score values given by gn(a | W) for a ∈ 𝒜. Values of gn(a | W) close to 0 for any a constitute a warning regarding the presence of positivity violations. We note that examination of the propensity score distribution is a general approach not restricted to the IPTW estimator. However, while useful in diagnosing the presence of positivity violations, examination of the estimated propensity scores does not provide any quantitative estimate of the degree to which such violations are resulting in estimator bias and may pose a threat to inference. The parametric bootstrap can be used to provide an optimistic bias estimate specifically targeted at bias caused by positivity violations and near-violations.5
4.1 The parametric bootstrap as a diagnostic tool
We focus on the bias of estimators that target a parameter of the observed data distribution; this target observed data parameter is equal under the randomisation assumption (6) to the target causal parameter. (Divergence between the target observed data parameter and target causal parameter when (6) fails is a distinct issue not addressed by the proposed diagnostic.) The bias in an estimator is the difference between the true value of the target parameter of the observed data distribution and the expectation of the estimator applied to a finite sample from that distribution:
where we recall that Ψ(P0) is the true value of target observed data parameter, Ψ̂ is an estimator of that parameter (which may be a function of gn, Qn or both) and Pn is the empirical distribution of a sample of n i.i.d observations from the true observed data distribution P0.
Bias in an estimator can arise due to a range of causes. First, the estimators gn and/or Qn may be inconsistent. Second, g0 may not satisfy the positivity assumption. Third, consistent estimators gn and/or Qn may still have substantial finite sample bias. This latter type of finite sample bias arises in particular due to the curse of dimensionality in a non-parametric or semi-parametric model when gn and/or Qn are data-adaptive estimators, although it can also be substantial for parametric estimators. Fourth, estimated values of gn may be equal or close to zero or one, despite use of a consistent estimator gn and a distribution g0 that satisfies positivity. The relative contribution of each of these sources of bias will depend on the model, the true data generating distribution, the estimator, and the finite sample.
The parametric bootstrap provides a tool that allows the analyst to explore the extent to which bias due to any of these causes is affecting a given parameter estimate. The parametric bootstrap-based bias estimate is defined as follows:
(10) |
where P̂0 is an estimate of P0 and the empirical distribution of a bootstrap sample obtained by sampling from P̂0. In other words, the parametric bootstrap is used to sample from an estimate of the true data generating distribution, resulting in multiple simulated data sets. The true data generating distribution and target parameter value in the bootstrapped data are known. The candidate estimator is then applied to each bootstrapped data set and the mean of the resulting estimates across data sets is compared with the known ‘truth’ (i.e. the true parameter value for the bootstrap data generating distribution).
We focus on a particular algorithm for parametric bootstrap-based bias estimation, which specifically targets the component of estimator-specific finite sample bias due to violations and near violations of the positivity assumption. The goal is not to provide an accurate estimate of bias, but rather to provide a diagnostic tool that can serve as a ‘red flag’ warning that positivity bias may pose a threat to inference. The distinguishing characteristic of the diagnostic algorithm is its use of an estimated data generating distribution P̂0 that both approximates the true P0 as closely as possible and is compatible with the estimators Ǭn and/or gn used in Ψ̂(Pn). In other words, P̂0 is chosen such that the estimator Ψ̂ applied to bootstrap samples from P̂0 is guaranteed to be consistent unless g0 fails to satisfy the positivity assumption or gn is truncated. As a result, the parametric bootstrap provides an optimistic estimate of finite sample bias, in which bias due to model misspecification other than truncation is eliminated.
We refer informally to the resulting bias estimate as ETA.Bias because in many settings it will be predominantly composed of bias from the following sources: 1. violation of the positivity assumption by g0; 2. truncation, if any, of gn in response to positivity violations; and, 3. finite sample bias arising from values of gn close to zero or one (sometime referred to as practical violations of the positivity assumption). The term ETA.Bias is imprecise because the bias estimated by the proposed algorithm will also capture some of the bias in Ψ̂(Pn) due to finite sample bias of the estimators gn and Ǭn (a form of sparsity only partially related to positivity). Due to the curse of dimensionality, the contribution of this latter source of bias may be substantial when gn and/or Qn are data-adaptive estimators in a non-parametric or semi-parametric model. However, the proposed diagnostic algorithm will only capture a portion of this bias because, unlike P0, P̂0 is guaranteed to have a functional form that can be well-approximated by the data-adaptive algorithms employed by gn and Qn.
The diagnostic algorithm for ETA.Bias is implemented as follows:
- Step 1. Estimate P0: Estimation of P0 requires estimation of Q0W, g0 and Q0Y, (i.e. estimation of P0(W = w), P0(A = a | W = w) and P0(Y = y | A = a, W = w) for all (w, a, y)). We define Q P̂0W = QPnW (or in other words, use an estimate based on the empirical distribution of the data), g P̂0 = gn and Ǭ P̂0 = Ǭn. Note that the estimators QPnW, gn and Ǭn were all needed for implementation of the IPTW, G-compuation, and DR estimators; the same estimators QPnW, gn and Qn can be used here. Additional steps may be required to estimate the entire conditional distribution of Y given (A,W) (beyond the estimate of its mean given by Ǭn). The true target parameter for the known distribution P̂ 0 is only a function of Qn = (QPnW, Ǭn), and Ψ(P̂0) is the same as the G-computation estimator (using Qn) applied to the observed data:
Step 2. Generate by sampling from P̂0: In the second step, we assume that P̂0 is the true data generating distribution. Bootstrap samples , each with n i.i.d observations, are generated by sampling from P̂0. For example, W can be sampled from the empirical, a binary A can be generated as a Bernoulli with probability gn(1 | W), and a continuous Y can be generated by adding an N(0, 1) error to Ǭn(A,W) (alternative approaches are also possible).
Step 3. Estimate : Finally, the estimator Ψ̂ is applied to each bootstrap sample. Depending on the estimator being evaluated, this step involves first applying the estimators gn, Qn or both to each bootstrap sample. If Qn and/or gn are data-adaptive estimators, the corresponding data-adaptive algorithm should be re-run in each bootstrap sample; otherwise, the coefficients of the corresponding models should be refit. ETA.Bias is calculated by comparing the mean of the estimator Ψ̂ across bootstrap samples () with the true value of the target parameter under the bootstrap data generating distribution (Ψ(P̂0)).
The parametric bootstrap-based diagnostic applied to the IPTW estimator is available as an R function check.ETA in the cvDSA package.5 The routine takes the original data as input and performs bootstrap simulations under user-specified information such as functional forms for m(a, V | β), gn and Qn. Application of the bootstrap to the IPTW estimator offers one particularly sensitive assessment of positivity bias because, unlike the G-computation and DR estimators, the IPTW estimator can not extrapolate based on Ǭn. However, this approach can be applied to any causal effect estimator, including estimators introduced in Section 6 that trade-off identifiability for proximity to the target parameter. In assessing the threat posed by positivity violations the bootstrap should ideally be applied to both the IPTW estimator and the estimator of choice.
4.1.1 Remarks on interpretation of the bias estimate
We caution against using the parametric bootstrap for any form of bias correction. The true bias of the estimator is EP0 Ψ̂ (Pn) − Ψ(P0), while the parametric bootstrap estimates . The performance of the diagnostic therefore depends on the extent to which P̂0 approximates the true data generating distribution. This suggests the importance of using flexible data-adaptive algorithms to estimate P0. Regardless of estimation approach, however, when the target parameter Ψ(P0) is poorly identified due to positivity violations Ψ(P̂0) may be a poor estimate of Ψ(P0). In such cases one would not expect the parametric bootstrap to provide a good estimate of the true bias. Further, the ETA.Bias implementation of the parametric bootstrap provides a deliberately optimistic bias estimate by excluding bias due to model mis-specifcation for the estimators gn and Ǭn.
Rather, the parametric bootstrap is proposed as a diagnostic tool. Even when the data generating distribution is not estimated consistently, the bias estimate provided by the parametric bootstrap remains interpretable in the world where the estimated data generating mechanism represents the truth. If the estimated bias is large, an analyst who disregards the implied caution is relying on an unsubstantiated hope that first, he or she has inconsistently estimated the data generating distribution but still done a reasonable job estimating the causal effect of interest; and second, the true data generating distribution is less affected by positivity (and other finite sample) bias than is the analyst’s best estimate of it.
The threshold level of ETA.Bias that is considered problematic will vary depending on the scientific question and the point and variance estimates of the causal effect. With that caveat, we suggest the following two general situations in which ETA.Bias can be considered a ‘red flag’ warning: 1. when ETA.Bias is of the same magnitude as (or larger than) the estimated standard error of the estimator; and, 2. when the interpretation of a bias-corrected confidence interval would differ meaningfully from initial conclusions.
5 Application of the parametric bootstrap
5.1 Application to simulated data
5.1.1 Methods
Data were simulated using a data generating distribution published by Freedman and Berk.29 Two baseline covariates, W = (W1, W2), were generated bivariate normal, N(µ, Σ), with µ1 = 0.5, µ2 = 1 and . Ǭ0(A, W) ≡ E0 (Y | A,W) | was given by Ǭ0(A,W) = 1 + A + W1 + 2W2 and Y was generated as Ǭ0(A,W) + N(0, 1). The g0(1 | W) ≡ P0(A = 1 | W) was given by: g0(1 | W) = Φ(0.5 + 0.25W1 + 0.75W2), where Φ is the Cumulative distribution function (CDF) of the standard normal distribution. With this treatment mechanism g0 ∈[0.001, 1], resulting in practical violation of the positivity assumption. The target parameter was E(Y1 − Y0) (corresponding to marginal structural model m(a | β) = β0 + β1a)). The true value of the target parameter Ψ(P0) = 1.
The bias, variance and mean squared error of the G-computation, IPTW, A-IPTW and TMLE estimators were estimated by applying each estimator to 250 samples of size 1000 drawn from this data generating distribution. Each of the four estimators was implemented with each of the following three approaches: 1. use of a correctly specified model to estimate both Ǭ0 and g0 (a specification referred to as ‘Qcgc’); 2. use of a correctly specified model to estimate Ǭ0 and a misspecified model to estimate g0 (obtained by omitting W2 from gn, a specification referred to as ‘Qcgm’); and, 3. use of a correctly specified model to estimate g0 and a misspecified model to estimate Ǭ0 (obtained by omitting W2 from Ǭn, a specification referred to as ‘Qmgc’). The DR and IPTW estimators were further implemented using the following sets of bounds for the values of gn: [0, 1] (or no bounding), [0.025, 0.975],[0.05, 0.95] and [0.1, 0.9]. For the IPTW estimator, the latter three bounds correspond to truncation of the unstabilised weights at [1.03, 40], [1.05, 20], and [1.11, 10].
The parametric bootstrap was then applied using the ETA.Bias algorithm to 10 of the 250 samples. For each sample and for each model specification (Qcgc,Qmgc and Qcgm), Qn and gn were used to draw 1000 parametric bootstrap samples. Specifically, W was drawn from the empirical distribution for that sample; A was generated given the bootstrapped values of W as a series of Bernoulli trials with probability gn(1 | W), and Y was generated given the bootstrapped values of A,W by adding a N(0, 1) error to Ǭn(A,W). Each candidate estimator was then applied to each bootstrap sample. In this step, the parametric models gn and Ǭn were held fixed and their coefficients refit. ETA.Bias was calculated for each of the 10 samples as the difference between the mean of the bootstrapped estimator and the initial G-computation estimate Ψ(P̂0) = Ψ̂Gcomp(Pn) in that sample. Additional simulations are discussed in a technical report and code is available at http://www.stat.berkeley.edu/laan/Software/index.html.30
5.1.2 Results
Table 2 demonstrates the effect of positivity violations and near-violations on estimator behaviour across 250 samples. The G-computation estimator remained minimally biased when the estimator Ǭn was consistent; use of inconsistent Ǭn resulted in bias. Given consistent estimators Ǭn and gn, the IPTW estimator was more biased than the other three estimators, as expected given the practical positivity violations present in the simulation. The finite sample performance of the A-IPTW and TMLE estimators was also affected by the presence of practical positivity violations. The DR estimators achieved the lowest mean square error MSE when 1. Ǭn was consistent and 2. gn was inconsistent but satisfied positivity (as a result either of truncation or of omission of W2, a major source of positivity bias). Interestingly, in this simulation TMLE still did quite well when Ǭn was inconsistent and the model used for gn was correctly specified but its values bounded at [0.025, 0.975].
Table 2.
Qcgc | Qcgm | Qmgc | |||||||
---|---|---|---|---|---|---|---|---|---|
Bias | Var | MSE | Bias | Var | MSE | Bias | Var | MSE | |
G-COMP | |||||||||
None | 0.007 | 0.009 | 0.009 | 0.007 | 0.009 | 0.009 | 1.145 | 0.025 | 1.336 |
[0.025,0.975] | 0.007 | 0.009 | 0.009 | 0.007 | 0.009 | 0.009 | 1.145 | 0.025 | 1.336 |
[0.05,0.95] | 0.007 | 0.009 | 0.009 | 0.007 | 0.009 | 0.009 | 1.145 | 0.025 | 1.336 |
[0.1,0.9] | 0.007 | 0.009 | 0.009 | 0.007 | 0.009 | 0.009 | 1.145 | 0.025 | 1.336 |
IPTW | |||||||||
None | 0.544 | 0.693 | 0.989 | 1.547 | 0.267 | 2.660 | 0.544 | 0.693 | 0.989 |
[0.025,0.975] | 1.080 | 0.090 | 1.257 | 1.807 | 0.077 | 3.340 | 1.080 | 0.090 | 1.257 |
[0.05,0.95] | 1.437 | 0.059 | 2.123 | 2.062 | 0.054 | 4.306 | 1.437 | 0.059 | 2.123 |
[0.1,0.9] | 1.935 | 0.043 | 3.787 | 2.456 | 0.043 | 6.076 | 1.935 | 0.043 | 3.787 |
A-IPTW | |||||||||
None | 0.080 | 0.966 | 0.972 | −0.003 | 0.032 | 0.032 | −0.096 | 16.978 | 16.987 |
[0.025,0.975] | 0.012 | 0.017 | 0.017 | 0.006 | 0.017 | 0.017 | 0.430 | 0.035 | 0.219 |
[0.05,0.95] | 0.011 | 0.014 | 0.014 | 0.009 | 0.014 | 0.014 | 0.556 | 0.025 | 0.334 |
[0.1,0.9] | 0.009 | 0.011 | 0.011 | 0.008 | 0.011 | 0.011 | 0.706 | 0.020 | 0.519 |
TMLE | |||||||||
None | 0.251 | 0.478 | 0.540 | 0.026 | 0.059 | 0.060 | −0.675 | 0.367 | 0.824 |
[0.025,0.975] | 0.016 | 0.028 | 0.028 | 0.005 | 0.021 | 0.021 | −0.004 | 0.049 | 0.049 |
[0.05,0.95] | 0.013 | 0.019 | 0.020 | 0.010 | 0.016 | 0.017 | 0.163 | 0.027 | 0.054 |
[0.1,0.9] | 0.010 | 0.014 | 0.014 | 0.009 | 0.013 | 0.013 | 0.384 | 0.018 | 0.166 |
Choice of bound imposed on gn affected both the bias and variance of the IPTW, A-IPTW and TMLE estimators. As expected, truncation of the IPTW weights improved the variance of the estimator but increased bias. Without additional diagnostic information, an analyst who observed the dramatic decline in the variance of the IPTW estimator that occurred with weight truncation might have concluded that truncation improved estimator performance; however, in this simulation weight truncation increased mean square error (MSE). In contrast, and as predicted by theory, use of bounded values of gn decreased MSE of the DR estimators in spite of the inconsistency introduced to gn.
Table 3 shows the mean of ETA.Bias across 10 of the 250 samples; the variance of ETA.Bias across the samples was small (results available in a technical report).30 Based on the results shown in Table 2, a red flag was needed for the IPTW estimator with and without bounded gn and for the TMLE estimator without bounded gn. (The A-IPTW estimator without bounded gn exhibited a small to moderate amount of bias; however, the variance would likely have altered an analyst to the presence of sparsity.) The parametric bootstrap correctly identified the presence of substantial finite sample bias in the IPTW estimator for all truncation levels and in the TMLE estimator with unbounded gn. ETA.Bias was minimal for the remaining estimators.
Table 3.
Bound on gn | None | [0.025,0.975] | [0.05,0.95] | [0.1,0.9] |
---|---|---|---|---|
G-computation estimator | ||||
Finite sample bias: Qcgc | 7.01e−03 | 7.01e−03 | 7.01e−03 | 7.01e−03 |
Mean(ETA.Bias): Qcgc | −8.51e−04 | −8.51e−04 | −8.51e−04 | −8.51e−04 |
Mean(ETA.Bias): Qcgm | 2.39e−04 | 2.39e−04 | 2.39e−04 | 2.39e−04 |
Mean(ETA.Bias): Qmgc | 5.12e−04 | 5.12e−04 | 5.12e−04 | 5.12e−04 |
IPTW estimator | ||||
Finite sample bias: Qcgc | 5.44e−01 | 1.08e+00 | 1.44e+00 | 1.93e+00 |
Mean(ETA.Bias): Qcgc | 4.22e−01 | 1.04e+00 | 1.40e+00 | 1.90e+00 |
Mean(ETA.Bias): Qcgm | 1.34e−01 | 4.83e−01 | 7.84e−01 | 1.23e+00 |
Mean(ETA.Bias): Qmgc | 2.98e−01 | 7.39e−01 | 9.95e−01 | 1.35e+00 |
A–IPTW estimator | ||||
Finite sample bias: Qcgc | 7.99e−02 | 1.25e−02 | 1.07e−02 | 8.78e−03 |
Mean(ETA.Bias): Qcgc | 1.86e−03 | 2.80e−03 | 5.89e−05 | 1.65e−03 |
Mean(ETA.Bias): Qcgm | −3.68e−04 | −6.36e−04 | 2.56e−05 | 5.72e−04 |
Mean(ETA.Bias): Qmgc | −3.59e−04 | 1.21e−04 | −1.18e−04 | −1.09e−03 |
TMLE estimator | ||||
Finite sample bias: Qcgc | 2.51e−01 | 1.60e−02 | 1.31e−02 | 9.98e−03 |
Mean(ETA.Bias): Qcgc | 1.74e−01 | 4.28e−03 | 2.65e−04 | 1.84e−03 |
Mean(ETA.Bias): Qcgm | 2.70e−02 | −3.07e−04 | 2.15e−04 | 7.74e−04 |
Mean(ETA.Bias): Qmgc | 1.11e−01 | 9.82e−04 | −2.17e−04 | −1.47e−03 |
For correctly specified Ǭn and gn (gn unbounded), the mean of ETA.Bias across the 10 samples was 78% and 69% of the true finite sample bias of the IPTW and TMLE estimators, respectively. The fact that the true bias was underestimated in both cases illustrates a limitation of the parametric bootstrap– its performance, even as an intentionally optimistic bias estimate, suffers when the target estimator is not asymptotically normally distributed.31 Bounding gn improved the ability of the bootstrap to accurately diagnose bias by improving estimator behaviour (in addition to adding a new source of bias due to truncation of gn). This finding suggests that practical application of the bootstrap to a given estimator should at minimum generate ETA.Bias estimates for a single low level of truncation of gn in addition to any unbounded estimate. When gn was bounded, the mean of ETA.Bias for the IPTW estimator across the 10 samples was 96–98% of the true finite sample bias; the finite sample bias for the TMLE estimator with bounded gn was accurately estimated to be minimal. Misspecification of gn or Ǭn by excluding a key covariate lead to an estimated data generating distribution with less sparsity than the true P0, and as a result the parametric bootstrap underestimated bias to a greater extent for these model specifications.
While use of an unbounded gn resulted in an underestimate of the true degree of finite sample bias for the IPTW and TMLE estimators, in this simulation the parametric bootstrap would still have functioned well as a diagnostic in each of the 10 samples considered. Table 4 reports the output that would have been available to an analyst applying the parametric bootstrap to the unbounded IPTW and TMLE estimators for each of the 10 samples. In all samples ETA.Bias was of roughly the same magnitude or larger than the estimated standard error of the estimator, and in most was of significant magnitude relative to the point estimate of the causal effect.
Table 4.
IPTW estimator | TMLE estimator | |||||||
---|---|---|---|---|---|---|---|---|
Sample | β̂IPTW | ETA.Bias | β̂TMLE | ETA.Bias | ||||
1 | 0.207 | 0.203 | 0.473 | 0.827 | 0.197 | 0.172 | ||
2 | 1.722 | 0.197 | 0.425 | 0.734 | 0.114 | 0.153 | ||
3 | 1.957 | 0.184 | 0.306 | 1.379 | 0.105 | 0.087 | ||
4 | 1.926 | 0.206 | 0.510 | 0.237 | 0.089 | 0.252 | ||
5 | 2.201 | 0.192 | 0.565 | 2.548 | 0.182 | 0.245 | ||
6 | 0.035 | 0.236 | 0.520 | 0.533 | 0.228 | 0.234 | ||
7 | 1.799 | 0.180 | 0.346 | 1.781 | 0.184 | 0.150 | ||
8 | 0.471 | 0.215 | 0.420 | 1.066 | 0.114 | 0.188 | ||
9 | 2.749 | 0.184 | 0.391 | 1.974 | 0.114 | 0.161 | ||
10 | 0.095 | 0.228 | 0.263 | 0.628 | 0.173 | 0.099 |
The simulation demonstrates how the parametric bootstrap can be used to investigate the trade-offs between bias due to weight truncation/bounding of gn and positivity bias. The parametric bootstrap accurately diagnosed both an increase in the bias of the IPTW estimator with increasing truncation and a reduction in the bias of the TMLE estimator with truncation. When viewed in light of the standard error estimates under different levels of truncation, the diagnostic would have accurately suggested that truncation of gn for the TMLE estimator was beneficial, while truncation of the weights for the IPTW estimator was of questionable benefit. (The parametric bootstrap can also be used to provide a more refined approach to choosing an optimal truncation constant based on estimated MSE.23)
These results further illustrate the benefit of applying the parametric bootstrap to the IPTW estimator in addition to the analyst’s estimator of choice. Diagnosis of substantial bias in the IPTW estimator due to positivity violations would have alerted an analyst that the G-computation estimator was relying heavily on extrapolation, and that the DR estimators were sensitive to bias arising from misspecification of the model used to estimate Ǭ0.
5.2 Data example: HIV resistance mutations
5.2.1 Data and question
We analysed an observational cohort of HIV-infected patients in order to estimate the effect of mutations in the HIV protease enzyme on viral response to the antiretroviral drug lopinavir. The question, data, and analysis have been described previously.32 Here, a simplified version of prior analyses was performed and the parametric bootstrap was applied to investigate the potential impact of positivity violations on results.
Briefly, baseline covariates, mutation profiles prior to treatment change, and viral response to therapy were collected for 401 treatment change episodes (TCEs) in which protease inhibitor-experienced subjects initiated a new antiretroviral regimen containing the drug lopinavir. We focused on two target mutations in the protease enzyme: p82AFST and p82MLC (present in 25% and 1% of TCEs, respectively). The data for each target mutation consisted of O = (W, A, Y), where A was a binary indicator that the target mutation was present prior to treatment change, W was a set of 35 baseline characteristics including summaries of past treatment history, mutations in the reverse transcriptase enzyme, and a genotypic susceptibility score for the background regimen (based on the Stanford scoring system; http://hivdb.stanford.edu/). The outcome Y was the change in log10(viral load) following initiation of the new antiretroviral regimen. The target observed data parameter was EW(E(Y | A = 1, W) − E(Y | A = 0, W)), equal under (6) to the average treatment effect E(Y1 − Y0).
5.2.2 Methods
Effect estimates were obtained for each mutation using the IPTW estimator and TMLE with a logistic fluctuation.33 Ǭ0 and g0 were estimated with stepwise forward selection of main terms based on the AIC criterion, using the step function in the stats v2.11.1 package in R. Estimators were implemented using both unbounded values for gn(A | W) and values truncated at [0.025, 0.975]. Following standard practice in much of the literature, standard errors were estimated using the influence curve, corresponding to the standard output for the glm and tmle functions in R, treating the values of gn as fixed. The parametric bootstrap was used to estimate the ETA.Bias of each estimator using 1000 samples and the ETA.Bias algorithm, with the step function rerun in each parametric bootstrap sample.
5.2.3 Results
Results for both mutations are presented in Table 5. p82AFST is known to be a major mutation for lopinavir resistance.34 The current results support this finding; the IPTW and TMLE point estimates were similar and both suggested a significantly more positive change in viral load (corresponding to a less effective drug response) among subjects with the mutation as compared to those without it. The parametric bootstrap-based bias estimate was minimal, raising no red flag that these findings might be attributable to positivity bias.
Table 5.
TMLE estimator | IPTW estimator | |||||||
---|---|---|---|---|---|---|---|---|
β̂TMLE | ETA.Bias | β̂IPTW | ETA.Bias | |||||
p82AFST | ||||||||
[0, 1] | 0.65 | 0.13 | −0.01 | 0.66 | 0.15 | −0.01 | ||
[0.025, 0.975] | 0.62 | 0.13 | 0.00 | 0.66 | 0.15 | −0.01 | ||
p82MLC | ||||||||
[0, 1] | 2.85 | 0.14 | −0.37 | 1.29 | 0.14 | 0.09 | ||
[0.025, 0.975] | 0.86 | 0.10 | −0.01 | 0.80 | 0.23 | 0.08 |
The role of mutation p82CLM is less clear based on existing knowledge; depending on the scoring system used it is either not considered a lopinavir resistance mutation, or given an intermediate lopinavir resistance score (http://hivdb.stanford.edu/).34 Initial inspection of the point estimates and standard errors in the current analysis would have suggested that p82CLM had a large and highly significant effect on lopinavir resistance. Application of the parametric bootstrap-based diagnostic, however, would have suggested that these results should be interpreted with caution. In particular, the bias estimate for the unbounded TMLE was larger than the estimated standard error, while the bias estimate for the unbounded IPTW estimator was of roughly the same magnitude. While neither bias estimate was of sufficient magnitude relative to the point estimate to change inference, their size relative to the corresponding standard errors would have suggested that further investigation was warranted.
In response, the non-parametric bootstrap (based on 1000 bootstrap samples) was applied to provide an alternative estimate of the standard error. Using this alternative approach, the standard errors for the unbounded TMLE and IPTW estimators of the effect of p82MLC were estimated to be 2.77 and 1.17, respectively. Non-parametric bootstrap-based standard error estimates for the bounded TMLE and IPTW estimators were lower (0.84 and 1.12, respectively), but still substantially higher than the initial naive standard error estimates. These revised standard error estimates dramatically changed interpretation of results, suggesting that the current analysis was unable to provide essentially any information on the presence, magnitude, or direction of the p82CLM effect. (Non-parametric bootstrap-based standard error estimates for p82AFST were also somewhat larger than initial estimates, but did not change inference.)
In this example, ETA.Bias is expected to include some non-positivity bias due to the curse of dimensionality. However, the resulting bias estimate should still be interpreted as highly optimistic (i.e. as an underestimate of the true finite sample bias). The parametric bootstrap sampled from estimates of g0 and Ǭ0 that had been fit using the step() algorithm. This ensured that the estimators gn and Ǭn (which applied the same stepwise algorithm) would do a good job approximating gP̂0 and ǬP0 in each bootstrap sample. Clearly, no such guarantee exists for the true P0. This simple example further illustrates the utility of the non-parametric bootstrap for standard error estimation in the setting of sparse data and positivity violations. In this particular example, the improved variance estimate provided by the non-parametric bootstrap was sufficient to prevent positivity violations from leading to incorrect inference. As demonstrated in the simulations, however, in other settings improved variance estimates may still fail to alert the analyst to threats posed by positivity violations.
6 Practical approaches to causal inference in the presence of positivity violations
6.1 Approach no. 1: Change the projection function h(A, V)
Throughout this article we have focused on the target causal parameter β(FX, m, h) defined according to (4) as the projection of EFX (Ya | V) on the working marginal structural model m(a, V | β). Choice of function h(a, V) both defines the target parameter by specifying which values of (A, V) should be given greater weight when estimating β and, by assumption (8), defines the positivity assumption needed for β to be identifiable.
We have focused on parameters indexed by h(a, V) = 1, a choice that gives equal weight to estimating the counterfactual outcome for all values (a, v).11 Alternative choices of h(a, V) can significantly weaken the needed positivity assumption. For example, if the target of inference only involves counterfactual outcomes among some restricted range [c, d] of possible values 𝒜, defining h(a, V) = I(a ∈ [c, d]) weakens the positivity assumption by requiring sufficient variability only in the assignment of treatment levels within the target range. In some settings, the causal parameter defined by such a projection over a limited range of | might be of substantial a priori interest. For example, one may wish to focus estimation of a drug dose response curve only on the range of doses considered reasonable for routine clinical use, rather than on the full range of doses theoretically possible or observed in a given data set.
An alternative approach, commonly employed in the context of IPTW estimation and introduced in Section 3.2, is to choose h(a, V) = g(a | V), where g(a | V) ≡ P(A = a | V) is the conditional probability of treatment given the covariates included in the marginal structural model. In the setting of IPTW estimation this choice corresponds to the use of stabilizing weights, a common approach to reducing the variance of the IPTW estimator in the face of sparsity.21 When the target causal parameter is defined using a marginal structural working model, use of h(a, V) = g(a | V) corresponds to a definition of the target parameter that gives greater weight to those regions of the joint distribution of (A, V) that are well-supported, and that relies on smoothing or extrapolation to a greater degree in regions that are not.11
Use of a marginal structural working model makes clear that the utility of choosing h(a, V) = g(a | V) as a method to approach data sparsity is not limited to the IPTW estimator. Recall that the G-computation estimator can be implemented by regressing predicted values for Ŷa on (a, V) according to model m(a, V | β) with weights provided by h(a, V). When the projection function is chosen to be g(a | V), this corresponds to a weighted regression in which weights are proportional to the degree of support in the data.
Even when one is ideally interested in the entire causal curve (implying a target parameter defined by choice h(a, V) = 1), specification of alternative choices for h offers a means of improving identifiability, at a cost of redefining the target parameter. For example, one can define a family of target parameters indexed by hδ(a, V) = I(a ∈ [c(δ), d(δ)]), where an increase in δ corresponds to progressive restriction on the range of treatment levels targeted by estimation. Fluctuation of δ thus corresponds to trading a focus on more limited areas of the causal curve for improved parameter identifiability. Selection of the final target from among this family can be based on an estimate of bias provided by the parametric bootstrap. For example, the bootstrap can be used to select the parameter with the smallest δ below some pre-specified threshold for allowable ETA.Bias.
6.2 Approach no. 2: Restrict the adjustment set
Exclusion of problematic W (i.e. those covariates resulting in positivity violations or near violations) from the adjustment set provides a means to trade confounding bias for a reduction in positivity violations.35 In some cases, exclusion of covariates from the adjustment set may come at little or no cost to bias in the estimate of the target parameter. In particular, a subset of W that excludes covariates responsible for positivity violations may still be sufficient to control for confounding. In other words, a subset W′ ⊂ W may exist for which both identifying assumptions (6) and (7) hold (i.e. Ya ∐ A | W′ and g0(a | W′) > 0, a ∈ 𝒜), while positivity fails for the full set of covariates. In practice, this approach can be implemented by first determining candidate subsets of W under which the positivity assumption holds, and then using causal graphs to assess whether any of these candidates is sufficient to control for confounding. Even when no such candidate set can be identified, background knowledge (or sensitivity analysis) may suggest that problematic W represent a minimal source of confounding bias (Moore et al. provide an example).15 Often, however, those covariates that are most problematic from a positivity perspective are also strong confounders.
As suggested with respect to choice of projection function h(a, V) in the previous section, the causal effect estimator can be fine-tuned to select the degree of restriction on the adjustment set W according to some pre-specified rule for eliminating covariates from the adjustment set, and the parametric bootstrap used to select the minimal degree of restriction that maintains ETA.Bias below an acceptable threshold.35 In the case of substantial positivity violations, such an approach can result in small covariate adjustment sets. While such limited covariate adjustment accurately reflects a target parameter that is poorly supported by the available data, the resulting estimate can be difficult to interpret and will no longer carry a causal interpretation.
6.3 Approach no. 3: Restrict the sample
An alternative approach, sometimes referred to as ‘trimming’, is to discard classes of subjects for whom there exists no or limited variability in observed treatment assignment. A causal effect is then estimated in the remaining subsample. This approach is popular in the econometrics and social science literature; Crump et al. provide a recent review.36–39
When the subset of covariates responsible for positivity violations is low or one dimensional, such an approach can be implemented simply by discarding subjects with covariate values not represented in all treatment groups. For example, say that one aims to estimate the average effect of a binary treatment, and in order to control for confounding needs to adjust for W, a covariate with possible levels {1, 2, 3, 4}. However, inspection of the data reveals that no one in the sample with W = 4 received treatment (i.e. gn(1 | W = 4) = 0). The sample can be trimmed by excluding those subjects for whom W = 4 prior to applying a given causal effect estimator for the average treatment effect. As a result, the target parameter is shifted from E(Y1 − Y0) to E(Y1 − Y0 | W < 4), and the positivity assumption (7) now holds (as W = 4 occurs with zero probability).
Often W is too high dimensional to make this straightforward implementation feasible; in such a case matching on the propensity score provides a means to trim the sample. (While there is an extensive literature on propensity score-based effect estimators a review of these estimators is beyond the scope of the current review.) Several potential problems arise with the use of trimming methods to address positivity violations. First, discarding subjects responsible for positivity violations shrinks sample size, and thus runs the risk of increasing the variance of the effect estimate. Further, sample size and the extent to which positivity violations arise by chance are closely related. Depending on how trimming is implemented, new practical positivity violations can be introduced as sample size shrinks. Second, restriction of the sample may result in a causal effect for a population of limited interest. In other words, as can occur with alternative approaches to improve identifiability by shifting the target of inference, the parameter actually estimated may be far from the initial target. Further, when the criterion used to restrict the sample involves a summary of high dimensional covariates, such as is provided the propensity score, it can be difficult to interpret the parameter estimated. Finally, when treatment is longitudinal, the covariates responsible for positivity violations may themselves be affected by past treatment.15 Trimming to remove positivity violations in this setting amounts to condtioning on post-treatment covariates and can thus introduce new bias.
Crump et al.36 propose an approach to trimming that falls within the general strategy of redefining the target parameter in order to explicitly capture the tradeoff between parameter identifiability and proximity to the initial target. In addition to focusing on the treatment effect in an a priori specified target population, he defines an alternative target parameter corresponding to the average treatment effect in that subsample of the population for which the most precise estimate can be achieved. Crump et al. further suggest the potential for extending this approach to achieve an optimal (according to some user-specified criteria) trade-off between the representativeness of the subsample in which the effect is estimated and the variance of the estimate.
6.4 Approach no. 4: Change the intervention of interest
A final alternative for improving the identifiability of a causal parameter in the presence of positivity violations is to redefine the intervention of interest. Realistic rules rely on an estimate of the propensity score g(a | W) to define interventions that explicitly avoid positivity violations. This ensures that the causal parameter estimated is sufficiently supported by existing data.
Realistic interventions avoid positivity violations by first identifying subjects for whom a given treatment assignment is not realistic (i.e. subjects whose propensity score for a given treatment is small or zero) and then assigning an alternative treatment with better data support to those individuals. Such an approach is made possible by focusing on the causal effects of dynamic treatment regimes.40,41 The causal parameters described thus far are summaries of the counterfactual outcome distribution under a fixed treatment applied uniformly across the target population. In contrast, a dynamic regime assigns treatment in response to patient covariate values. This characteristic makes it possible to define interventions under which a subject is only assigned treatments that are possible (or ‘realistic’) given a subject’s covariate values.
To continue the previous example in which no subjects with W = 4 were treated, a realistic treatment rule might take the form ‘treat only those subjects with W less than 4.’ More formally, let d(W) refer to a treatment rule that deterministically assigns a treatment a ∈ 𝒜 based on a subject’s covariates W and consider the rule d(W) = I(W < 4). Let Yd denote the counterfactual outcome under the treatment rule d(W), which corresponds to treating a subject if and only if his or her covariate W is below 4. In this example E(Y0) is identified as ∑w E(Y | W = w, A = 0)P(W = w); however, since E(Y | W = w, A = 1) is undefined for W = 4, E(Y1) is not identified (unless we are willing to extrapolate based on W < 4). In contrast, E(Yd) is identified by the non-parametric G-computation formula: ∑wE(Y = y | W = w, A = d(w))P(W = w). Thus the treatment effect E(Yd − Y0) is identified, but E(Y1 − Y0) is not. The redefined causal parameter can be interpreted as the difference in expected counterfactual outcome if only those subjects with W < 4 were treated as compared to the outcome if no one were treated.
More generally, realistic rules indexed by a given static treatment a assign a only to those individuals for whom the probability of receiving a is greater than some user-specified probability α (such as α > 0.05). Let d(a, W) denote the rule indexed by static treatment a. If A is binary, then d(1, W) = 1 if g(1 | W) > α, otherwise d(1, W) = 0. Similarly, d(0, W) = 0 if g(0 | W) > α; otherwise d(0, W) = 1. Realistic causal parameters are defined as some parameter of the distribution of Yd(a,W) (possibly conditional on some subset of baseline covariates V ⊂ W). Estimation of the causal effects of dynamic rules d(W) allows the positivity assumption to be relaxed to g(d(W) | W) > 0 -a.e (i.e. only those treatments that would be assigned based on rule d to patients with covariates W need to occur with positive probability within strata of W). Realistic rules d(a, W) are designed to satisfy this assumption by definition.
When a given treatment level a is unrealistic (i.e. when g(a | W) < α), realistic rules assign an alternative from among viable (well-supported) choices. Choice of an alternative is straightforward when treatment is binary. When treatment has more than two levels, however, a rule for selecting the alternative treatment level is needed. One option is to assign a treatment level that is as close as possible to the orignal assignment while still remaining realistic. For example, if high doses of drugs occur with low probability in a certain subset of the population, a realistic rule might assign the maximum dose that occurs with probability > α in that subset. An alternative class of dynamic regimes, referred to as ‘intent-to-treat’ rules, instead assign a subject to his or her observed treatment value if an initial assignment is deemed unrealistic. Bembom and Vander Laan14 and Moore et al.15 provide illustrations of both of these types of realistic rules using simulated and real data.’
The causal effects of realistic rules clearly differ from their static counterparts. The extent to which the new target parameter diverges from the initial parameter of interest depends on both the extent to which positivity violations occur in the finite sample (i.e. the extent of support available in the data for the initial target parameter) and on a user-supplied threshold β. The parametric bootstrap approach presented in Section 4 can be employed to data-adaptively select α based on the level of ETA.Bias deemed acceptable.14
6.5 Selection among a family of parameters
Each of the methods described for estimating causal effects in the presence of data sparsity corresponds to a particular strategy for altering the target parameter in exchange for improved identifiability. In each case, we have outlined how this tradeoff could be made systematically, based on some user-specified criterion such as a maximum acceptable level of ETA.Bias as estimated by the parametric bootstrap. We now summarise this general approach in terms of a formal method for estimation in the face of positivity violations.
Define a family of parameters. The family should include the initial target of inference together with a set of related parameters, indexed by γ in index set I, where γ represents the extent to which a given family member trades improved identifiability for decreased proximity to the initial target. In the examples given in the previous sections, γ could be used to index a set of projection functions h(a, V) based on an increasingly restrictive range of the possible values 𝒜, the degree to which the adjustment covariate set or sample is restricted, or the choice of a threshold used to define a realistic rule.
Apply the parametric bootstrap to generate an estimate ETA.Bias for each γ ∈ I. In particular, this involves estimating the data generating distribution, simulating new data from this estimate, and then applying an estimator of each target parameter indexed by α.
Select the target parameter from the set that falls below a pre-specified threshold for acceptable ETA.Bias. In particular, select the parameter from within this set that is indexed by the value γ that corresponds to the greatest proximity to the initial target.
This approach allows an estimator to be defined in terms of an algorithm that estimates the parameter within a candidate family that is as close to the initial target of inference as possible while remaining within some user-supplied limit on the extent of tolerable bias due to positivity violations.
7 Conclusions
The identifiability of causal effects relies on sufficient variation in treatment assignment within covariate strata. The strong version of positivity requires that each possible treatment occur with positive probability in each covariate strata; depending on the model and target parameter, this assumption can be relaxed to some extent. In addition to assessing identifiability based on measurement of and control for sufficient confounders, data analyses should directly assess threats to identifiability posed by positivity violations.
The parametric bootstrap is a practical tool for assessing such threats, and provides a quantitative estimator-specific estimate of bias arising largely from positivity violations. This article has focused on the positivity assumption for the causal effect of a treatment assigned at a single time point. Extension to a longitudinal setting in which the goal is to estimate the effect of multiple treatments assigned sequentially over time introduces considerable additional complexity. First, practical violations of the positivity assumption can arise more readily in this setting. Under the longitudinal version of the positivity assumption the conditional probability of each possible treatment history should remain positive regardless of covariate history. However, this probability is the product of time point-specific treatment probabilities given the past. When the product is taken over multiple time points it is easy for treatment histories with very small conditional probabilities to arise. Second, longitudinal data make it harder to diagnose the bias arising due to positivity violations. Implementation of the parametric bootstrap in longitudinal settings requires Monte Carlo simulation both to implement the G-computation estimator and to generate each bootstrap sample. In particular, this requires estimating and sampling from the time-point specific conditional distributions of all covariates and treatment given the past. Additional research on assessing the impact of of positivity bias on longitudinal causal parameters is needed, including investigation of the parametric bootstrap in this setting.
When positivity violations occur for structural reasons rather than due to chance, a causal parameter that avoids these positivity violations will often be of substantial interest. For example, when certain treatment levels are contraindicated for certain types of individuals, the average treatment effect in the population may be of less interest than the effect of treatment among that subset of the population without contraindications, or alternatively, the effect of an intervention that assigns treatment only to those subjects without contraindications. Similarly, the effect of a multilevel treatment may be of greatest interest for only a subset of treatment levels.
In other cases researchers may be happy to settle for a better estimate of a less interesting parameter. Sample restriction, estimation of realistic parameters, and change in projection function h(a, V) all change the causal effect being estimated; in contrast, restriction of the covariate adjustment set often results in estimation of a non-causal parameter. However, all of these approaches can be understood as means to shift from a poorly identified initial target towards a parameter that is less ambitious but more fully supported by the available data. The new estimand is not determined a priori by the question of interest, but rather is driven by the observed data distribution in the finite sample at hand. There is thus an explicit trade-off between identifiability and proximity to the initial target of inference. Ideally, this trade-off will be made in a systematic way rather than on an ad hoc basis at the discretion of the investigator. Definition of an estimator that selects among a family of parameters according to some pre-specified criteria is a means to formalise this trade-off. An estimate of bias based on the parametric bootstrap can be used to implement the tradeoff in practice.
In summary, we offer the following advice for applied analyses: First, define the causal effect of interest based on careful consideration of structural positivity violations. Second, consider estimator behaviour in the context of positivity violations when selecting an estimator. Third, apply the parametric bootstrap provide a quantitative measure of estimator bias under data simulated to approximate the true data generating distribution. Finally, when positivity violations are a concern, choose an estimator that selects systematically among a family of parameters based on the trade-off between data support and proximity to the initial target of inference.
References
- 1.Cochran WG. Analysis of covariance: its nature and uses. Biometrics. 1957;13:261–281. [Google Scholar]
- 2.Robins JM. A new approach to causal inference in mortality studies with sustained exposure periods – application to control of the healthy worker survivor effect. Math Model. 1986;7:1393–1512. [Google Scholar]
- 3.Robins JM. Addendum to: A new approach to causal inference in mortality studies with a sustained exposure period–application to control of the healthy worker survivor effect. Math. Model. 1986;7(9–12):1393–1512. (MR 87m:92078. Comput. Math. Appl. 1987 14(9–12): 923–945. [Google Scholar]
- 4.Robins JM. Proceedings of the American Statistical Association: Section on Bayesian Statistical Science. Alexandria, VA: 1999. Robust estimation in sequentially ignorable missing data and causal inference models; pp. 6–10. [Google Scholar]
- 5.Wang Y, Petersen M, Bangsberg D, van der Laan MJ. Technical Report 211, Division of Biostatistics. Berkeley: University of California; 2006. Diagnosing bias in the inverse probability of treatment weighted estimator resulting from violation of experimental treatment assignment. [Google Scholar]
- 6.Pearl J. Causality: models, reasoning, and inference. New York: Cambridge University Press; 2000. [Google Scholar]
- 7.Neyman J. On the application of probability theory to agricultural experiments. Essay on principles: section 9. Stat Sci. 1923;5:465–480. [Google Scholar]
- 8.Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol. 1974;66:688–701. [Google Scholar]
- 9.Robins JM. Proceedings of the American Statistical Association. Section on Bayesian Statistical Science 1997. Alexandria, VA: 1998. Marginal structural models; pp. 1–10. [Google Scholar]
- 10.Robins JM. Marginal structural models versus structural nested models as tools for causal inference. In: Halloran E, Berry D, editors. Statistical models in epidemiology, the environment, and clinical trials (Minneapolis, MN, 1997) New York: Springer; 1999. pp. 95–133. [Google Scholar]
- 11.Neugebauer R, van der Laan MJ. Non-parametric causal effects based on marginal structural models. J Stat Plan Infer. 2007;137(2):419–434. [Google Scholar]
- 12.Robins JM. A graphical approach to the identification and estimation of causal parameters in mortality studies with sustained exposure periods. J Chronic Dis. 1987;40(2):139s–161s. doi: 10.1016/s0021-9681(87)80018-8. [DOI] [PubMed] [Google Scholar]
- 13.Neugebauer R, van der Laan MJ. Why prefer DR estimates. J Stat Plan Infer. 2005;129(1–2):405–426. [Google Scholar]
- 14.Bembom O, van der Laan MJ. A practical illustration of the importance of realistic individualized treatment rules in causal inference. EJS. 2007;1:574–596. doi: 10.1214/07-EJS105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Moore KL, Neugebauer RS, van der Laan MJ, Tager IB. Technical Report 255, Division of Biostatistics. Berkeley: University of California; 2009. Causal inference in epidemiological studies with strong confounding. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Cole SR and Hernan MA. Constructing inverse probability weights for marginal structural models. Am J Epidemiol. 2008;168(6):656–664. doi: 10.1093/aje/kwn164. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Rosenblum MM, van der Laan MJ. Confidence intervals for the population mean tailored to small sample sizes, with applications to survey sampling. Int J Biostat. 2001;1:4. doi: 10.2202/1557-4679.1118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.van der Laan MJ, Dudoit S. Technical Report 130, Division of Biostatistics. Berkeley: University of California; 2003. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples. [Google Scholar]
- 19.van der Laan MJ, Polley EC, Hubbard AE. Super learner. Genet Mol Biol. 2007;6 doi: 10.2202/1544-6115.1309. [DOI] [PubMed] [Google Scholar]
- 20.Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. London: Springer; 2009. [Google Scholar]
- 21.Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000;11(5):550–560. doi: 10.1097/00001648-200009000-00011. [DOI] [PubMed] [Google Scholar]
- 22.Kish L. Weighting for unequal pi. J Official Stat. 1992;8:183–200. [Google Scholar]
- 23.Bembom O, van der Laan MJ. Technical Report 230, Division of Biostatstics. Berkeley: University of California; 2008. Data-adaptive selection of the truncation level for inverse-probability-of-treatment-weighted estimators. [Google Scholar]
- 24.Robins JM, Rotnitzky A. Comment on the Bickel and Kwon article, Inference for semiparametric models: some questions and an answer. Stat Sinica. 2001;11(4):920–936. [Google Scholar]
- 25.Robins JM. Commentary on using inverse weighting and predictive inference to estimate the effects of time-varying treatments on the discrete-time hazard by Dawson and Lavori. Stat Med. 2002;21:1663–1680. doi: 10.1002/sim.1111. [DOI] [PubMed] [Google Scholar]
- 26.Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for non-ignorable drop-out using semiparametric nonresponse models, (with discussion and rejoinder) J Am Stat Assoc. 1999;94(1096–1120):1121–1146. [Google Scholar]
- 27.van der Laan MJ, Rubin D. Targeted maximum likelihood learning. Int J Biostat. 2006;2(1):11. [Google Scholar]
- 28.Rosenblum M, van der Laan M. Targeted maximum likelihood estimation of the parameter of a marginal structural model. Int J Biostat. 2010;6(2):19. doi: 10.2202/1557-4679.1238. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Freedman DA, Berk RA. Weighting regressions by propensity scores. Eval Rev. 2008;32(4):392–409. doi: 10.1177/0193841X08317586. [DOI] [PubMed] [Google Scholar]
- 30.Petersen ML, Porter K, Gruber S, Wang Y, van der Laan M. Technical report, division of Biostatstics. Berkeley: Universtiy of California; 2010. Diagnosing and responding to violations in the positivity assumption. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.van der Vaart AW, Wellner JA. Weak Convergence and Emprical Processes. New York: Springer-Verlag; 1996. [Google Scholar]
- 32.Bembom O, Petersen ML, Rhee S-Y, et al. Biomarker discovery using targeted maximum likelihood estimation: Application to the treatment of antiretroviral resistant HIV infection. Stat Med. 2009;28:152–172. doi: 10.1002/sim.3414. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Gruber S, van der Laan MJ. An application of collaborative targeted maximum likelihood estimation in causal inference and genomics. Int J Biostat. 2010;6(1) doi: 10.2202/1557-4679.1182. Article 18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Johnson VA, Brun-Vezinet F, Clotet B, et al. Update of the drug resistance mutations in HIV-1: December 2009. Top HIV Med. 2009;17(5):138–145. [PubMed] [Google Scholar]
- 35.Bembom O, Fessel JW, Shafer RW, van der Laan MJ. Technical Report 231, Division of Biostatstics. Berkeley: University of California; 2008. Data-adaptive selection of the adjustment set in variable importance estimation. [Google Scholar]
- 36.Crump RK, Hotz VJ, Imbens GW and Mitnik OA. Moving the goalposts: Adressing limited overlap in the estimation of average treatment effects by changing the estimand. Technical Report 330, National Bureau of Economic Research. 2006
- 37.LaLonde RJ. Evaluating the econometric evaluations of training programs with experimental data. Am Econ Rev. 1986;76:604–620. [Google Scholar]
- 38.Heckman J, Ichimura H, Todd R. Matching as an econometric evaluation estimator: evidence from evaluating a job training programme. Rev Econ Stud. 1997;64:605–654. [Google Scholar]
- 39.Dehejia R, Wahba S. Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. J Am Stat Assoc. 1999;94:1053–1062. [Google Scholar]
- 40.van der Laan MJ, Petersen ML. Causal effect models for realistic individualized treatment and intention to treat rules. Int J Biostat. 2007;3(1):3. doi: 10.2202/1557-4679.1022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Robins JM, Orellana L, Rotnitzky A. Estimation and extrapolation of optimal treatment and testing strategies. Stat Med. 2008;27:4678–4721. doi: 10.1002/sim.3301. [DOI] [PubMed] [Google Scholar]