Diagnosing and responding to violations in the positivity assumption

Maya L Petersen; Kristin E Porter; Susan Gruber; Yue Wang; Mark J van der Laan

doi:10.1177/0962280210386207

. Author manuscript; available in PMC: 2014 Jul 23.

Published in final edited form as: Stat Methods Med Res. 2010 Oct 28;21(1):31–54. doi: 10.1177/0962280210386207

Diagnosing and responding to violations in the positivity assumption

Maya L Petersen ¹, Kristin E Porter ¹, Susan Gruber ¹, Yue Wang ², Mark J van der Laan ¹

PMCID: PMC4107929 NIHMSID: NIHMS603976 PMID: 21030422

Abstract

The assumption of positivity or experimental treatment assignment requires that observed treatment levels vary within confounder strata. This article discusses the positivity assumption in the context of assessing model and parameter-specific identifiability of causal effects. Positivity violations occur when certain subgroups in a sample rarely or never receive some treatments of interest. The resulting sparsity in the data may increase bias with or without an increase in variance and can threaten valid inference. The parametric bootstrap is presented as a tool to assess the severity of such threats and its utility as a diagnostic is explored using simulated and real data. Several approaches for improving the identifiability of parameters in the presence of positivity violations are reviewed. Potential responses to data sparsity include restriction of the covariate adjustment set, use of an alternative projection function to define the target parameter within a marginal structural working model, restriction of the sample, and modification of the target intervention. All of these approaches can be understood as trading off proximity to the initial target of inference for identifiability; we advocate approaching this tradeoff systematically.

Keywords: experimental treatment assignment, positivity, marginal structural model, inverse probability weight, double robust, causal inference, counterfactual, parametric bootstrap, realistic treatment rule, trimming, stabilised weights, truncation

1 Introduction

Incomplete control of confounding is a well-recognised source of bias in causal effect estimation-measured covariates must be sufficient to control for confounding in order for causal effects to be identified based on observational data. The identifiability of causal effects further requires sufficient variability in treatment or exposure assignment within strata of confounders. The dangers of causal effect estimation in the absence of adequate data support have long been understood.¹ More recent causal inference literature refers to the need for adequate exposure variability within confounder strata as the assumption of positivity or experimental treatment assignment.^2–4 While perhaps less well-recognised than confounding bias, violations and near violations of the positivity assumption can increase both the variance and bias of causal effect estimates, and if undiagnosed can threaten the validity of causal inferences.

Positivity violations can arise for two reasons. First, it may be theoretically impossible for individuals with certain covariate values to receive a given exposure of interest. For example, certain patient characteristics may constitute an absolute contraindication to receipt of a particular treatment. The threat to causal inference posed by such structural or theoretical violations of positivity does not improve with increasing sample size. Second, violations or near violations of positivity can arise in finite samples due to chance. This is a particular problem in small samples, but also occurs frequently in moderate to large samples when the treatment is continuous or can take multiple levels, or when the covariate adjustment set is large and/or contains continuous or multi-level covariates. Regardless of the cause, causal effects may be poorly or non-identified when certain subgroups in a finite sample do not receive some of the treatment levels of interest. In this article, we will use the term ‘sparsity’ to refer positivity violations and near-violations arising from either of these causes, recognising that other types of sparsity can also threaten valid inference.

In this article, we discuss the positivity assumption within a general framework for assessing the identifiability of causal effects. The causal model and target causal parameter are defined using a non-parametric structural equation model (NPSEM) and the positivity assumption is introduced as a key assumption needed for parameter identifiability. The counterfactual or potential outcome framework is then used to review estimation of the target parameter, assessment of the extent to which data sparsity threatens valid inference for this parameter, and practical approaches for responding to such threats. For clarity, we focus on a simple data structure in which treatment is assigned at a single time point. Concluding remarks generalise to more complex longitudinal data structures.

Data sparsity can increase both the bias and variance of a causal effect estimator; the extent to which each are impacted will depend on the estimator. An estimator-specific diagnostic tool is thus required to quantify the extent to which positivity violations threaten the validity of inference for a given causal effect parameter (for a given model, data-generating distribution and finite sample). Wang et al.⁵ proposed such a diagnostic based on the parametric bootstrap. Application of a candidate estimator to bootstrapped data sampled from the estimated data generating distribution provides information about the estimator’s behaviour under a data generating distribution that is based on the observed data. The true parameter value in the bootstrap data is known and can be used to assess estimator bias. A large bias estimate can alert the analyst to the presence of a parameter that is poorly identified, an important warning in settings where data sparsity may not be reflected in the variance of the causal effect estimate.

Once bias due to violations in positivity has been diagnosed, the question remains how best to proceed with estimation. We review several approaches. Identifiability can be improved by extrapolating based on subgroups in which sufficient treatment variability does exist; however, such an approach requires additional parametric model assumptions. Alternative approaches for responding to sparsity include the following: restriction of the sample to those subjects for whom the positivity assumption is not violated (known as trimming); re-definition of the causal effect of interest as the effect of only those treatments that do not result in positivity violations (estimation of the effects of ‘realistic’ or ‘intention to treat’ dynamic regimes); restriction of the covariate adjustment set to exclude those covariates responsible for positivity violations; and, when the target parameter is defined using a marginal structural working model, use of a projection function that focuses estimation on areas of the data with greater support.

As we discuss, all of these approaches change the parameter being estimated by trading proximity to the original target of inference for improved identifiability. We advocate incorporation of this trade-off into the effect estimator itself. This requires defining a family of parameters, the members of which vary in their proximity to the initial target and in their identifiability. An estimator can then be defined that selects among the members of this family according to some pre-specifed criteria.

1.1 Outline

The article is structured as follows. Section 2 introduces an NPSEM for a simple point treatment data structure, defines the target causal parameter using a marginal structural working model and discusses conditions for parameter identifiability with an emphasis on the positivity assumption. Section 3 reviews three classes of causal effect estimators and discusses the behaviour of these estimators in the presence of positivity violations. Section 4 reviews approaches for assessing threats to inference arising from positivity violations, with a focus on the parametric bootstrap. Section 5 investigates the performance of the parametric bootstrap as a diagnostic tool using simulated and real data. Section 6 reviews methods for responding to positivity violations once they have been diagnosed, and integrates these methods into a general approach to sparsity that is based on defining a family of parameters. Section 7 provide some concluding remarks and advocates a systematic approach to possible violations in positivity.

2 Framework for causal effect estimation

We proceed from the basic premise that model assumptions should honestly reflect investigator knowledge. The NPSEM framework provides a systematic approach for translating background knowledge into a causal model and corresponding statistical model, defining a target causal parameter, and assessing the identifiability of that parameter.⁶ We illustrate this approach using a simple point treatment data structure. We minimise notation by focusing on discrete-valued random variables.

2.1 Model

Let W denotes a set of baseline covariates on a subject, let A denote a treatment or exposure variable and let Y denote an outcome. Specify the following structural equation model (with random input U ~ P_U):

W = f_{W} (U_{W}) A = f_{A} (W, U_{A}) Y = f_{Y} (W, A, U_{Y}),

(1)

where U = (U_W, U_A, U_Y) denotes the set of background factors that deterministically assign values to (W, A, Y) according to functions (f_W, f_A, f_Y). Each of the equations in this model is assumed to represent a mechanism that is autonomous, in the sense that changing or intervening on the equation will not affect the remaining equations, and that is functional, in the sense that the equation reflects assumptions about how the observed data were in fact generated by Nature. In addition, each of the equations is non-parametric: its specification does not require assumptions regarding the true functional form of the underlying causal relationships. However, if aspects of the functional form of any of these equations are known based on background knowledge, such knowledge can be incorporated into the model. The background factors U are assumed to be jointly independent in this particular model; or in other words, the model is assumed to be Markov; however, the NPSEM framework can also be applied to non-Markov models.⁶

Let the observed data consist of n i.i.d. observations O₁,…, O_n of O = (W, A, Y) ~ P₀. Causal model (1) places no restrictions on the allowed distributions for P₀, and thus implies a non-parametric statistical model.

2.2 Target causal parameter

A causal effect can be defined in terms of the joint distribution of the observed data under an intervention on one or more of the structural equations. For example, consider the post-intervention distribution of Y under an intervention on the structural model to set A = a. Such an intervention corresponds to replacing A = f_A(W, U_A) with A = a in the structural model (1). The counterfactual outcome that a given subject with background factors u would have had if he or she were to have received treatment level a is denoted Y_a(u).^7,8 This counterfactual can be derived as the solution to the structural equation f_Y in modified equation system within input U = u.

Let F_X denotes the distribution of X = (W, (Y_a : a ∈ 𝒜)), where 𝒜 denotes the possible values that the treatment variable can take (e.g. {0, 1} for a binary treatment). F_X describes the joint distribution of the baseline covariates and counterfactual outcomes under a range of interventions on treatment variable A. A causal effect can be defined as some parameter of F_X. For example, a common target parameter for binary A is the average treatment effect

E_{F_{X}} (Y_{1} - Y_{0}),

(2)

or the difference in expected counterfactual outcome if every subject in the population had received versus had not received treatment.

Alternatively, an investigator may be interested in estimating the average treatment effect separately within certain strata of the population and/or for non-binary treatments. Specification of a marginal structural model (a model on the conditional expectation of the counterfactual outcome given effect modifiers of interest) provides one option for defining the target causal parameter in such cases.^4,9,10 Marginal structural models take the following form:

E_{F_{X}} (Y_{a} | V) = m (a, V | β),

(3)

where V ⊂ W denotes the strata in which one wishes to estimate a conditional causal effect. For example, one might specify the following model:

m (a, V | β) = β_{1} + β_{2} a + β_{3} V + β_{4} a V .

For a binary treatment 𝒜 ∈ {0, 1}, such a model implies an average treatment effect within stratum V = v equal to β₂ + β₄v.

The true functional form of E_{F_X}(Y_a | V) will generally not be known. One option is to assume that the parametric model m(a, V | β) is correctly specified, or in other words that E_{F_X}(Y_a | V) = m(a, V | β) for some value β. Such an approach, however, can place additional restrictions on the allowable distributions of the observed data and thus change the statistical model. In order to respect the premise that the statistical model should faithfully reflect the limits of investigator knowledge and not be altered in order to facilitate definition of the target parameter, we advocate an alternative approach in which the target causal parameter is defined using a marginal structural working model. Under this approach the target parameter β is defined as the projection of the true causal curve E_{F_X}(Y_a | V) onto the specified model m(a, V | β) according to some projection function h(a, V):¹¹

β (F_{X}, m, h) = \underset{β}{arg min} E_{F_{X}} [\sum_{a \in 𝒜} {(Y_{a} - m (a, V | β))}^{2} h (a, V)] .

(4)

When h(a, V) = 1, the target parameter β corresponds to an unweighted projection of the entire causal curve onto the model m(a, V | β); alternative choices of h correspond to placing greater emphasis on specific parts of the curve (i.e. on certain (a, V) values).

Use of a marginal structural working model such as (4) is attractive because it allows the target causal parameter to be defined within the original statistical model. However, this approach by no means absolves the investigator from careful consideration of marginal structural model specification. A poorly specified model m(a, V | β) may result in a target parameter that provides a poor summary of the features of the true causal relationship that are of interest.

In the following sections we discuss the parameter β(F_X, m, 1) as the target of inference, corresponding to a focus on estimation of the treatment-specific mean for all levels a ∈ 𝒜 within strata of V as projected onto modelm, with projection h(a, V) = 1 chosen to reflect a focus on the entire causal curve. To simplify notation we use β to refer to this target parameter unless otherwise noted.

2.3 Identifiability

We assess whether the target parameter β of the counterfactual data distribution F_X is identified as a parameter of the observed data distribution P₀ under causal model (1). Because model (1) is Markov, we have that

P_{F_{X}} (Y_{a} = y) = \sum_{w} P_{0} (Y = y | W = w, A = a) P_{0} (W = w),

(5)

identifying the target parameter β according to projection (4).⁶ This identifiability result is often referred to as the G-computation formula.^2,3,12

The weaker randomisation assumption, or the assumption that A and Y_a are conditionally independent given W, is also sufficient for identifiability result (5) to hold.

Randomisation assumption

A ∐ Y_{a} | W for all a \in 𝒜 .

(6)

Whether or not a given structural model implies that assumption (6) holds can be assessed directly from the corresponding causal graph through the back door criterion.⁶

2.3.1 The need for experimentation in treatment assignment

The G-computation formula (5) is only a valid formula if the conditional distributions in the formula are well-defined. Let g₀(a | W) ≡ P₀(A = a | W), a ∈ 𝒜 denote the conditonal distribution of treatment variable A given covariates under the observed data distribution P₀. If one or more treatment levels of interest do not occur within some covariate strata, the conditional probability P₀(Y = y | A = a, W = w) will not be well-defined for some value(s) (a, w) and identifiability result (5) will break down.

A simple example provides intuition into the threat to parameter identifiability posed by sparsity of this nature. Consider an example in which W = I(woman), A is a binary treatment, and no women are treated (g₀(1 | W = 1) = 0). In this data generating distribution, there is no information regarding outcomes among treated women. Thus, as long as there are women in the target population (i.e. P₀(W = 1) > 0), the average treatment effect E_{F_X}(Y₁−Y₀) will not be identified without additional parametric assumptions.

This simple example illustrates that a given causal parameter under a given model may be identified for some joint distributions of the observed data but not for others. An additional assumption is thus needed to ensure identfiability. We begin by presenting the strong version of this assumption, needed for the identification of P_{F_X}((Y_a = y, W = w) : a, y, w) in a non-parametric model.

Strong positivity assumption

inf_{a \in 𝒜} g_{0} (a | W) > 0, - a . e .

(7)

The strong positivity assumption, or assumption of experimental treatment assignment (ETA), states that each possible treatment level occurs with some positive probability within each strata of W.

Parametric model assumptions may allow the positivity assumption to be weakened. In the example described above, an assumption that the treatment effect is the same among treated men and women would result in identification of the average treatment effect (2) based on extrapolation from the estimated treatment effect among men (assuming that other identifiability assumptions were met). Parametric model assumptions of this nature are particularly dangerous; however, because they extrapolate to regions of the joint distribution of (A, W) that are not supported by the data. Such assumptions should be approached with caution and adopted only when they have a solid foundation in background knowledge.

In addition to being model specific, the form of the positivity assumption needed for identifiability is parameter specific. Many target causal parameters require much weaker versions of positivity than (7). To take one simple example, if the target parameter is E(Y₁), the identifiability result only requires that g₀(1 | W) > 0 hold; it does not matter if there are some strata of the population in which no one was treated. Similarly, the identifiability of β(F_X, m, h), defined using a marginal structural working model, relies on a weaker positivity assumption.

Positivity assumption for β(F_X, h, m)

sup_{a \in 𝒜} \frac{h (a, V)}{g_{0} (a | W)} < \infty, - a . e .

(8)

Choice of projection function h(a, V) used to define the target parameter therefore has implications for how strong an assumption of positivity is needed for identifiability. In Section 6, we consider specification of alternative target parameters that allow for weaker positivity assumptions than (7), including parameters indexed by alternative choices of h(a, V). For now we focus on the target parameter β indexed by the choice h(a, V)=1 and note that (7) and (8) are equivalent for this parameter.

3 Estimator-specific behaviour in the face of positivity violations

Let Ψ(P₀) denote the target parameter value, a function of the observed data distribution. Under the assumptions of randomisation (6) and positivity (8) Ψ(P₀) equals the target causal parameter β(F_X, m, h). Estimators of this parameter are denoted Ψ̂(P_n), where P_n is the empirical distribution of a sample of n i.i.d observations from P₀. We use Q_0W (w) ≡ P₀(W = w), Q_0Y (y | A, W) ≡ P₀(Y = y | A, W) and Ǭ₀(A, W) ≡ E₀(Y | A, W). Recall that g₀(a | W) ≡ P₀(A = a | W). We review three classes of estimators Ψ̂(P_n) of β that employ estimators of distinct parts of the observed data likelihood. Maximum likelihood-based substitution estimators (also referred to as ‘G-computation’ estimators) employ estimators of Q₀ ≡ (Q_0W, Ǭ₀). Inverse probability weighted estimators employ estimators of g₀. Double robust (DR) estimators employ estimators of both g₀ and Q₀. A summary of these estimators is provided in Table 1. Their behaviour in the face of positivity violations is illustrated in Section 5 and previous work.^11–16

Table 1.

Overview of three classes of causal effect estimator

G-computation estimator
Needed for implementation	Estimator Q_n of Q₀
Needed for consistency	Q_n is a consistent estimator of Q₀
Response to sparsity	Extrapolates based on Q_n
	Sparsity can amplify bias due to model misspecification
IPTW estimator
Needed for implementation	Estimator g_n of g₀
Needed for consistency	g_n is a consistent estimator of g₀
	g₀ satisfies positivity
Response to sparsity	Does not extrapolate based on Q_n
	Sensitive to positivity violations and near violations
DR estimators
Needed for implementation	Estimator g_n of g₀ and Q_n of Q₀
Needed for consistency	g_n is consistent or Q_n is consistent
	g_n converges to a distribution that satisfies positivity
Response to sparsity	Can extrapolate based on Q_n
	Without positivity, relies on consistency of Q_n

Open in a new tab

We focus our discussion on bias in the point estimate of the target parameter β. While estimates of the variance of β can also be biased when data are sparse, methods exist to improve variance estimation. The non-parametric bootstrap provides one straightforward approach to variance estimation in setting where the central limit theorem may not apply as a result of sparsity; alternative approaches to correct for biased variance estimates are also possible.¹⁷ These methods will not, however, protect against misleading inference if the point estimate itself is biased.

3.1 G-computation estimator

The G-computation estimator Ψ̂Gcomp(P_n) takes as input the empirical data distribution P_n and provides as output a parameter estimate β̂_Gcomp. Ψ̂_Gcomp(P_n) is a substitution estimator based on identifiability result (5). It is implemented based on an estimator of Q₀ ≡ (Q_0W, Ǭ₀) and its consistency relies on the consistency of this estimator.^2,3 Q_0W can generally be estimated based on the empirical distribution of W. However, even when positivity is not violated, the dimension of A,W is frequently too large for Ǭ₀ to be estimated simply by evaluating the mean of Y within strata of (A, W). Due to the curse of dimensionality, estimation of Ǭ₀ under a non-parametric or semi-parametric statistical model therefore frequently requires data-adaptive approaches, such as cross-validated loss-based learning.^18–20

Given an estimator Ǭ_n of Ǭ₀, the G-computation estimator can be implemented by generating a predicted counterfactual outcome for each subject under each possible treatment: Ŷ_a,i = Ǭ_n(a,W_i) for a ∈ 𝒜, i = 1, …, n. The estimate β̂_Gcomp is then obtained by regressing Ŷ _a on a and V according to the model m(a, V | β), with weights based on the projection function h(a, V).

When all treatment levels of interest are not represented within all covariate strata (i.e. assumption (7) is violated), some of the conditional probabilities in the non-parametric G-computation formula (5) will not be defined. A given estimator Ǭ_n may allow the G-computation estimator to extrapolate based on covariate strata in which sufficient experimentation in treatment level does exist. Importantly, however, this extrapolation depends heavily on the model for Ǭ₀ and the resulting effect estimates will be biased if the model used to estimate Q₀ is misspecified.

3.2 Inverse probability of treatment weighted estimator

The inverse probability of treatment weighted (IPTW) estimator Ψ̂_IPTW(P_n) takes as input the empirical data distribution P_n and provides as output a parameter estimate β̂_IPTW based on an estimator g_n of g₀(A | W).^10,21 The estimator is defined as the solution in β to the following estimating equation:

0 = \sum_{i = 1}^{n} \frac{h (A_{i}, V_{i})}{g_{n} (A_{i} | W_{i})} \frac{d}{d β} (m (A_{i}, V_{i} | β)) (Y_{i} - m (A_{i}, V_{i} | β)),

(9)

where h(A, V) is the projection function used to define the target causal parameter β(F_X, m, h) according to (4). The IPTW estimator of β can be implemented as the solution to a weighted regression of the outcome Y on treatment A and effect modifiers V according to model m(A,V | β), with weights equal to $\frac{h (A, V)}{g_{n} (A | W)}$ . Consistency of Ψ̂_IPTW(P_n) requires that g₀ satisfies positivity and that g_n is a consistent estimator of g₀. As with Ǭ₀, g₀ can be estimated using loss-based learning and cross validation. Depending on choice of projection function, implementation may further require estimation of h(A,V); however, the consistency of the IPTW estimator does not depend on consistent estimation of h(A, V).

The IPTW estimator is particularly sensitive to bias due to data sparsity. Bias can arise due to structural positivity violations (positivity does not hold for g₀) or may occur because by chance certain covariate and treatment combinations are not represented in a given finite sample (g_n(a | W = w) may have values of zero or close to zero for some (a,w) even when positivity holds for g₀ and g_n is consistent).^5,13–16 In the latter case, as fewer individuals within a given covariate stratum receive a given treatment, the weights of those rare individuals who do receive the treatment become more extreme. The disproportionate reliance of the causal effect estimate on the experience of a few unusual individuals can result in substantial finite sample bias.

While values of g_n(a | W) remain positive for all a ∈ 𝒜, elevated weights inflate the variance of the effect estimate and can serve as a warning that the data may poorly support the target parameter. However, as the number of individuals within a covariate stratum who receive a given treatment level shifts from few (each of whom receive a large weight and thus increase the variance) to none, estimator variance can decrease while bias increases rapidly. In other words, when g_n(a | W = w) = 0 for some (a,w), the weight for a subject with A = a and W = w is infinity; however, as no such individuals exist in the dataset, the corresponding threat to valid inference will not be reflected in either the weights or in estimator variance.

3.2.1 Weight truncation

Weights are commonly truncated or bounded in order to improve the performance of the IPTW estimator in face of data sparsity.^{5,15,16,22,23} Weights are truncated at either a fixed or relative level (for example, at the 1st and 99th percentiles), thereby reducing the variance arising from large weights and limiting the impact of a few possibly non-representative individuals on the effect estimate. This advantage comes at a cost, however, in the form of increased bias due to misspecification of the treatment model g_n, a bias that does not decrease with increasing sample size.

3.2.2 Stabilised Weights

Use of projection function h(a, V) = 1 implies the use of unstabilised weights. In contrast, stabilised weights, corresponding to a choice h(a, V) = g₀(a | V) (where g₀(a | V) ≡ P₀(A = a | V)) are generally recommended for the implementation of the IPTW estimator. The choice h(a, V) = g₀(a | V) results in weaker positivity assumption (8), by allowing the IPTW estimator to extrapolate to sparse areas of the joint distribution of (A, V) using the model m(a, V | β). For example, if A is an ordinal variable with multiple levels, V = {}, and the target parameter is defined using the model m(a, V | β) = β₀ + β₁a, the IPTW estimator with stabilised weights will extrapolate to levels of A that are sparsely represented in the data by assuming a linear relationship between Y_a and a for a ∈ 𝒜. However, when the target parameter β is defined using a marginal structural working model according to (4) (an approach that acknowledges that the model m(A, V | β) may be misspecified), the use of stabilised versus unstabilised weights corresponds to a shift in the target parameter via choice of an alternative projection function.¹¹

3.3 Double robust estimators

Double robust estimators of β include the augmented inverse probability weighted estimator (A-IPTW) and the targeted maximum likelihood estimator (TMLE). For the target parameter β (F_X, h, m), TMLE corresponds to the extended DR parametric regression estimator of Scharfstein et al.^4,24–28 Implementation of the DR estimators requires estimators of both Q₀ and g₀; as with the IPTW and G-computation estimators, a non-parametric loss-based approach can be employed for both. An implementation of the TMLE estimator of the average treatment effect E(Y₁ − Y₀) is available in the R package tmleLite; an implementation of the A-IPTW estimator for a point treatment marginal structural model is available in the R package cvDSA (both available at http://www.stat.berkeley.edu/~laan/Software/index.html). Prior literature provides further details regarding implementation and theoretical properties.^{4,11,13,24,26–28}

DR estimators remain consistent if either: 1. g_n is a consistent estimator of g₀ and g₀ satisfies positivity; or, 2. Q_n is a consistent estimator of Q₀ and g_n converges to a distribution g^* that satisfies positivity. Thus, when positivity holds, these estimators are truly DR, in the sense that consistent estimation of either g₀ or Q₀ results in a consistent estimator. When positivity fails, however, the consistency of the DR estimators relies entirely on consistent estimation of Q₀. In the setting of positivity violations, DR estimators are thus faced with the same vulnerabilities as the G-computation estimator.

In addition to illustrating how positivity violations increase the vulnerability of DR estimators to bias resulting from inconsistent estimation of Q₀, these asymptotic results have practical implications for the implementation of the DR estimators. Specifically, they suggest that the use of an estimator g_n that yields predicted values in [0 + γ, 1 − γ] (where γ is some small number) can improve finite sample performance. One way to achieve such bounds is by truncating the predicted probabilities generated by g_n, similar to the process of weight truncation described for the IPTW estimator.

4 Diagnosing bias due to positivity violations

Positivity violations can result in substantial bias, with or without a corresponding increase in variance, regardless of the causal effect estimator used. Practical methods are thus needed to diagnose and quantify estimator-specific positivity bias for a given model, parameter and sample. Cole and Hernan¹⁶ suggest a range of informal diagnostic approaches when the IPTW estimator is applied. Basic descriptive analyses of treatment variability within covariate strata can be helpful; however, this approach quickly becomes unwieldy when the covariate set is moderately large and includes continuous or multi-level variables. Examination of the distribution of the estimated weights can also provide useful information as near violations of the positivity assumption will be reflected in large weights. As noted by these authors and discussed above, however, well-behaved weights are not sufficient in themselves to ensure the absence of positivity violations.

An alternative formulation is to examine the distribution of the estimated propensity score values given by g_n(a | W) for a ∈ 𝒜. Values of g_n(a | W) close to 0 for any a constitute a warning regarding the presence of positivity violations. We note that examination of the propensity score distribution is a general approach not restricted to the IPTW estimator. However, while useful in diagnosing the presence of positivity violations, examination of the estimated propensity scores does not provide any quantitative estimate of the degree to which such violations are resulting in estimator bias and may pose a threat to inference. The parametric bootstrap can be used to provide an optimistic bias estimate specifically targeted at bias caused by positivity violations and near-violations.⁵

4.1 The parametric bootstrap as a diagnostic tool

We focus on the bias of estimators that target a parameter of the observed data distribution; this target observed data parameter is equal under the randomisation assumption (6) to the target causal parameter. (Divergence between the target observed data parameter and target causal parameter when (6) fails is a distinct issue not addressed by the proposed diagnostic.) The bias in an estimator is the difference between the true value of the target parameter of the observed data distribution and the expectation of the estimator applied to a finite sample from that distribution:

Bias (Ψ̂, P_{0}, n) = E_{P_{0}} Ψ̂ (P_{n}) - Ψ (P_{0}),

where we recall that Ψ(P₀) is the true value of target observed data parameter, Ψ̂ is an estimator of that parameter (which may be a function of g_n, Q_n or both) and P_n is the empirical distribution of a sample of n i.i.d observations from the true observed data distribution P₀.

Bias in an estimator can arise due to a range of causes. First, the estimators g_n and/or Q_n may be inconsistent. Second, g₀ may not satisfy the positivity assumption. Third, consistent estimators g_n and/or Q_n may still have substantial finite sample bias. This latter type of finite sample bias arises in particular due to the curse of dimensionality in a non-parametric or semi-parametric model when g_n and/or Q_n are data-adaptive estimators, although it can also be substantial for parametric estimators. Fourth, estimated values of g_n may be equal or close to zero or one, despite use of a consistent estimator g_n and a distribution g₀ that satisfies positivity. The relative contribution of each of these sources of bias will depend on the model, the true data generating distribution, the estimator, and the finite sample.

The parametric bootstrap provides a tool that allows the analyst to explore the extent to which bias due to any of these causes is affecting a given parameter estimate. The parametric bootstrap-based bias estimate is defined as follows:

{\hat{Bias}}_{P B} (Ψ̂, {P̂}_{0}, n) = E_{{P̂}_{0}} Ψ̂ (P_{n}^{#}) - Ψ ({P̂}_{0}),

(10)

where P̂₀ is an estimate of P₀ and $P_{n}^{#}$ the empirical distribution of a bootstrap sample obtained by sampling from P̂₀. In other words, the parametric bootstrap is used to sample from an estimate of the true data generating distribution, resulting in multiple simulated data sets. The true data generating distribution and target parameter value in the bootstrapped data are known. The candidate estimator is then applied to each bootstrapped data set and the mean of the resulting estimates across data sets is compared with the known ‘truth’ (i.e. the true parameter value for the bootstrap data generating distribution).

We focus on a particular algorithm for parametric bootstrap-based bias estimation, which specifically targets the component of estimator-specific finite sample bias due to violations and near violations of the positivity assumption. The goal is not to provide an accurate estimate of bias, but rather to provide a diagnostic tool that can serve as a ‘red flag’ warning that positivity bias may pose a threat to inference. The distinguishing characteristic of the diagnostic algorithm is its use of an estimated data generating distribution P̂₀ that both approximates the true P₀ as closely as possible and is compatible with the estimators Ǭ_n and/or g_n used in Ψ̂(P_n). In other words, P̂₀ is chosen such that the estimator Ψ̂ applied to bootstrap samples from P̂₀ is guaranteed to be consistent unless g₀ fails to satisfy the positivity assumption or g_n is truncated. As a result, the parametric bootstrap provides an optimistic estimate of finite sample bias, in which bias due to model misspecification other than truncation is eliminated.

We refer informally to the resulting bias estimate as ETA.Bias because in many settings it will be predominantly composed of bias from the following sources: 1. violation of the positivity assumption by g₀; 2. truncation, if any, of g_n in response to positivity violations; and, 3. finite sample bias arising from values of g_n close to zero or one (sometime referred to as practical violations of the positivity assumption). The term ETA.Bias is imprecise because the bias estimated by the proposed algorithm will also capture some of the bias in Ψ̂(P_n) due to finite sample bias of the estimators g_n and Ǭ_n (a form of sparsity only partially related to positivity). Due to the curse of dimensionality, the contribution of this latter source of bias may be substantial when g_n and/or Q_n are data-adaptive estimators in a non-parametric or semi-parametric model. However, the proposed diagnostic algorithm will only capture a portion of this bias because, unlike P₀, P̂₀ is guaranteed to have a functional form that can be well-approximated by the data-adaptive algorithms employed by g_n and Q_n.

The diagnostic algorithm for ETA.Bias is implemented as follows:

Step 1. Estimate P₀: Estimation of P₀ requires estimation of Q_0W, g₀ and Q_0Y, (i.e. estimation of P₀(W = w), P₀(A = a | W = w) and P₀(Y = y | A = a, W = w) for all (w, a, y)). We define Q _P̂₀W = Q_{P_nW} (or in other words, use an estimate based on the empirical distribution of the data), g _P̂₀ = g_n and Ǭ _P̂₀ = Ǭ_n. Note that the estimators Q_{P_nW}, g_n and Ǭ_n were all needed for implementation of the IPTW, G-compuation, and DR estimators; the same estimators QP_nW, g_n and Q_n can be used here. Additional steps may be required to estimate the entire conditional distribution of Y given (A,W) (beyond the estimate of its mean given by Ǭ_n). The true target parameter for the known distribution P̂ ₀ is only a function of Q_n = (QP_nW, Ǭ_n), and Ψ(P̂₀) is the same as the G-computation estimator (using Q_n) applied to the observed data:
$Ψ ({P̂}_{0}) = {Ψ̂}_{G comp} (P_{n}) .$
Step 2. Generate $P_{n}^{#}$ by sampling from P̂₀: In the second step, we assume that P̂₀ is the true data generating distribution. Bootstrap samples $P_{n}^{#}$ , each with n i.i.d observations, are generated by sampling from P̂₀. For example, W can be sampled from the empirical, a binary A can be generated as a Bernoulli with probability g_n(1 | W), and a continuous Y can be generated by adding an N(0, 1) error to Ǭ_n(A,W) (alternative approaches are also possible).
Step 3. Estimate $E_{{P̂}_{0}} ψ̂ (P_{n}^{#})$ : Finally, the estimator Ψ̂ is applied to each bootstrap sample. Depending on the estimator being evaluated, this step involves first applying the estimators g_n, Q_n or both to each bootstrap sample. If Q_n and/or g_n are data-adaptive estimators, the corresponding data-adaptive algorithm should be re-run in each bootstrap sample; otherwise, the coefficients of the corresponding models should be refit. ETA.Bias is calculated by comparing the mean of the estimator Ψ̂ across bootstrap samples ( $E_{{P̂}_{0}} {Ψ̂}_{I P T W} (P_{n}^{#})$ ) with the true value of the target parameter under the bootstrap data generating distribution (Ψ(P̂₀)).

The parametric bootstrap-based diagnostic applied to the IPTW estimator is available as an R function check.ETA in the cvDSA package.⁵ The routine takes the original data as input and performs bootstrap simulations under user-specified information such as functional forms for m(a, V | β), g_n and Q_n. Application of the bootstrap to the IPTW estimator offers one particularly sensitive assessment of positivity bias because, unlike the G-computation and DR estimators, the IPTW estimator can not extrapolate based on Ǭ_n. However, this approach can be applied to any causal effect estimator, including estimators introduced in Section 6 that trade-off identifiability for proximity to the target parameter. In assessing the threat posed by positivity violations the bootstrap should ideally be applied to both the IPTW estimator and the estimator of choice.

4.1.1 Remarks on interpretation of the bias estimate

We caution against using the parametric bootstrap for any form of bias correction. The true bias of the estimator is E_P₀ Ψ̂ (P_n) − Ψ(P₀), while the parametric bootstrap estimates $E_{{P̂}_{0}} Ψ̂ (P_{n}^{#}) - Ψ ({P̂}_{0})$ . The performance of the diagnostic therefore depends on the extent to which P̂₀ approximates the true data generating distribution. This suggests the importance of using flexible data-adaptive algorithms to estimate P₀. Regardless of estimation approach, however, when the target parameter Ψ(P₀) is poorly identified due to positivity violations Ψ(P̂₀) may be a poor estimate of Ψ(P₀). In such cases one would not expect the parametric bootstrap to provide a good estimate of the true bias. Further, the ETA.Bias implementation of the parametric bootstrap provides a deliberately optimistic bias estimate by excluding bias due to model mis-specifcation for the estimators g_n and Ǭ_n.

Rather, the parametric bootstrap is proposed as a diagnostic tool. Even when the data generating distribution is not estimated consistently, the bias estimate provided by the parametric bootstrap remains interpretable in the world where the estimated data generating mechanism represents the truth. If the estimated bias is large, an analyst who disregards the implied caution is relying on an unsubstantiated hope that first, he or she has inconsistently estimated the data generating distribution but still done a reasonable job estimating the causal effect of interest; and second, the true data generating distribution is less affected by positivity (and other finite sample) bias than is the analyst’s best estimate of it.

The threshold level of ETA.Bias that is considered problematic will vary depending on the scientific question and the point and variance estimates of the causal effect. With that caveat, we suggest the following two general situations in which ETA.Bias can be considered a ‘red flag’ warning: 1. when ETA.Bias is of the same magnitude as (or larger than) the estimated standard error of the estimator; and, 2. when the interpretation of a bias-corrected confidence interval would differ meaningfully from initial conclusions.

5 Application of the parametric bootstrap

5.1 Application to simulated data

5.1.1 Methods

Data were simulated using a data generating distribution published by Freedman and Berk.²⁹ Two baseline covariates, W = (W₁, W₂), were generated bivariate normal, N(µ, Σ), with µ₁ = 0.5, µ₂ = 1 and $Σ = [\begin{matrix} 2 & 1 \\ 1 & 1 \end{matrix}]$ . Ǭ₀(A, W) ≡ E₀ (Y | A,W) | was given by Ǭ₀(A,W) = 1 + A + W₁ + 2W₂ and Y was generated as Ǭ₀(A,W) + N(0, 1). The g₀(1 | W) ≡ P₀(A = 1 | W) was given by: g₀(1 | W) = Φ(0.5 + 0.25W₁ + 0.75W₂), where Φ is the Cumulative distribution function (CDF) of the standard normal distribution. With this treatment mechanism g₀ ∈[0.001, 1], resulting in practical violation of the positivity assumption. The target parameter was E(Y₁ − Y₀) (corresponding to marginal structural model m(a | β) = β₀ + β₁a)). The true value of the target parameter Ψ(P₀) = 1.

The bias, variance and mean squared error of the G-computation, IPTW, A-IPTW and TMLE estimators were estimated by applying each estimator to 250 samples of size 1000 drawn from this data generating distribution. Each of the four estimators was implemented with each of the following three approaches: 1. use of a correctly specified model to estimate both Ǭ₀ and g₀ (a specification referred to as ‘Qcgc’); 2. use of a correctly specified model to estimate Ǭ₀ and a misspecified model to estimate g₀ (obtained by omitting W₂ from g_n, a specification referred to as ‘Qcgm’); and, 3. use of a correctly specified model to estimate g₀ and a misspecified model to estimate Ǭ₀ (obtained by omitting W₂ from Ǭ_n, a specification referred to as ‘Qmgc’). The DR and IPTW estimators were further implemented using the following sets of bounds for the values of g_n: [0, 1] (or no bounding), [0.025, 0.975],[0.05, 0.95] and [0.1, 0.9]. For the IPTW estimator, the latter three bounds correspond to truncation of the unstabilised weights at [1.03, 40], [1.05, 20], and [1.11, 10].

The parametric bootstrap was then applied using the ETA.Bias algorithm to 10 of the 250 samples. For each sample and for each model specification (Qcgc,Qmgc and Qcgm), Q_n and g_n were used to draw 1000 parametric bootstrap samples. Specifically, W was drawn from the empirical distribution for that sample; A was generated given the bootstrapped values of W as a series of Bernoulli trials with probability g_n(1 | W), and Y was generated given the bootstrapped values of A,W by adding a N(0, 1) error to Ǭ_n(A,W). Each candidate estimator was then applied to each bootstrap sample. In this step, the parametric models g_n and Ǭ_n were held fixed and their coefficients refit. ETA.Bias was calculated for each of the 10 samples as the difference between the mean of the bootstrapped estimator and the initial G-computation estimate Ψ(P̂₀) = Ψ̂_Gcomp(P_n) in that sample. Additional simulations are discussed in a technical report and code is available at http://www.stat.berkeley.edu/laan/Software/index.html.³⁰

5.1.2 Results

Table 2 demonstrates the effect of positivity violations and near-violations on estimator behaviour across 250 samples. The G-computation estimator remained minimally biased when the estimator Ǭ_n was consistent; use of inconsistent Ǭ_n resulted in bias. Given consistent estimators Ǭ_n and g_n, the IPTW estimator was more biased than the other three estimators, as expected given the practical positivity violations present in the simulation. The finite sample performance of the A-IPTW and TMLE estimators was also affected by the presence of practical positivity violations. The DR estimators achieved the lowest mean square error MSE when 1. Ǭ_n was consistent and 2. g_n was inconsistent but satisfied positivity (as a result either of truncation or of omission of W₂, a major source of positivity bias). Interestingly, in this simulation TMLE still did quite well when Ǭ_n was inconsistent and the model used for g_n was correctly specified but its values bounded at [0.025, 0.975].

Table 2.

Performance of estimators in 250 simulated data sets of size 1000, by estimator and bound on g_n

	Qcgc			Qcgm			Qmgc

	Bias	Var	MSE	Bias	Var	MSE	Bias	Var	MSE
G-COMP
None	0.007	0.009	0.009	0.007	0.009	0.009	1.145	0.025	1.336
[0.025,0.975]	0.007	0.009	0.009	0.007	0.009	0.009	1.145	0.025	1.336
[0.05,0.95]	0.007	0.009	0.009	0.007	0.009	0.009	1.145	0.025	1.336
[0.1,0.9]	0.007	0.009	0.009	0.007	0.009	0.009	1.145	0.025	1.336
IPTW
None	0.544	0.693	0.989	1.547	0.267	2.660	0.544	0.693	0.989
[0.025,0.975]	1.080	0.090	1.257	1.807	0.077	3.340	1.080	0.090	1.257
[0.05,0.95]	1.437	0.059	2.123	2.062	0.054	4.306	1.437	0.059	2.123
[0.1,0.9]	1.935	0.043	3.787	2.456	0.043	6.076	1.935	0.043	3.787
A-IPTW
None	0.080	0.966	0.972	−0.003	0.032	0.032	−0.096	16.978	16.987
[0.025,0.975]	0.012	0.017	0.017	0.006	0.017	0.017	0.430	0.035	0.219
[0.05,0.95]	0.011	0.014	0.014	0.009	0.014	0.014	0.556	0.025	0.334
[0.1,0.9]	0.009	0.011	0.011	0.008	0.011	0.011	0.706	0.020	0.519
TMLE
None	0.251	0.478	0.540	0.026	0.059	0.060	−0.675	0.367	0.824
[0.025,0.975]	0.016	0.028	0.028	0.005	0.021	0.021	−0.004	0.049	0.049
[0.05,0.95]	0.013	0.019	0.020	0.010	0.016	0.017	0.163	0.027	0.054
[0.1,0.9]	0.010	0.014	0.014	0.009	0.013	0.013	0.384	0.018	0.166

Open in a new tab

Choice of bound imposed on g_n affected both the bias and variance of the IPTW, A-IPTW and TMLE estimators. As expected, truncation of the IPTW weights improved the variance of the estimator but increased bias. Without additional diagnostic information, an analyst who observed the dramatic decline in the variance of the IPTW estimator that occurred with weight truncation might have concluded that truncation improved estimator performance; however, in this simulation weight truncation increased mean square error (MSE). In contrast, and as predicted by theory, use of bounded values of g_n decreased MSE of the DR estimators in spite of the inconsistency introduced to g_n.

Table 3 shows the mean of ETA.Bias across 10 of the 250 samples; the variance of ETA.Bias across the samples was small (results available in a technical report).³⁰ Based on the results shown in Table 2, a red flag was needed for the IPTW estimator with and without bounded g_n and for the TMLE estimator without bounded g_n. (The A-IPTW estimator without bounded g_n exhibited a small to moderate amount of bias; however, the variance would likely have altered an analyst to the presence of sparsity.) The parametric bootstrap correctly identified the presence of substantial finite sample bias in the IPTW estimator for all truncation levels and in the TMLE estimator with unbounded g_n. ETA.Bias was minimal for the remaining estimators.

Table 3.

Finite sample bias (Q_n and g_n correctly specified) and mean of ETA.Bias across 10 simulated datasets of size 1000

Bound on g_n	None	[0.025,0.975]	[0.05,0.95]	[0.1,0.9]
G-computation estimator
Finite sample bias: Qcgc	7.01e−03	7.01e−03	7.01e−03	7.01e−03
Mean(ETA.Bias): Qcgc	−8.51e−04	−8.51e−04	−8.51e−04	−8.51e−04
Mean(ETA.Bias): Qcgm	2.39e−04	2.39e−04	2.39e−04	2.39e−04
Mean(ETA.Bias): Qmgc	5.12e−04	5.12e−04	5.12e−04	5.12e−04
IPTW estimator
Finite sample bias: Qcgc	5.44e−01	1.08e+00	1.44e+00	1.93e+00
Mean(ETA.Bias): Qcgc	4.22e−01	1.04e+00	1.40e+00	1.90e+00
Mean(ETA.Bias): Qcgm	1.34e−01	4.83e−01	7.84e−01	1.23e+00
Mean(ETA.Bias): Qmgc	2.98e−01	7.39e−01	9.95e−01	1.35e+00
A–IPTW estimator
Finite sample bias: Qcgc	7.99e−02	1.25e−02	1.07e−02	8.78e−03
Mean(ETA.Bias): Qcgc	1.86e−03	2.80e−03	5.89e−05	1.65e−03
Mean(ETA.Bias): Qcgm	−3.68e−04	−6.36e−04	2.56e−05	5.72e−04
Mean(ETA.Bias): Qmgc	−3.59e−04	1.21e−04	−1.18e−04	−1.09e−03
TMLE estimator
Finite sample bias: Qcgc	2.51e−01	1.60e−02	1.31e−02	9.98e−03
Mean(ETA.Bias): Qcgc	1.74e−01	4.28e−03	2.65e−04	1.84e−03
Mean(ETA.Bias): Qcgm	2.70e−02	−3.07e−04	2.15e−04	7.74e−04
Mean(ETA.Bias): Qmgc	1.11e−01	9.82e−04	−2.17e−04	−1.47e−03

Open in a new tab

For correctly specified Ǭ_n and g_n (g_n unbounded), the mean of ETA.Bias across the 10 samples was 78% and 69% of the true finite sample bias of the IPTW and TMLE estimators, respectively. The fact that the true bias was underestimated in both cases illustrates a limitation of the parametric bootstrap– its performance, even as an intentionally optimistic bias estimate, suffers when the target estimator is not asymptotically normally distributed.³¹ Bounding g_n improved the ability of the bootstrap to accurately diagnose bias by improving estimator behaviour (in addition to adding a new source of bias due to truncation of g_n). This finding suggests that practical application of the bootstrap to a given estimator should at minimum generate ETA.Bias estimates for a single low level of truncation of g_n in addition to any unbounded estimate. When g_n was bounded, the mean of ETA.Bias for the IPTW estimator across the 10 samples was 96–98% of the true finite sample bias; the finite sample bias for the TMLE estimator with bounded g_n was accurately estimated to be minimal. Misspecification of g_n or Ǭ_n by excluding a key covariate lead to an estimated data generating distribution with less sparsity than the true P₀, and as a result the parametric bootstrap underestimated bias to a greater extent for these model specifications.

While use of an unbounded g_n resulted in an underestimate of the true degree of finite sample bias for the IPTW and TMLE estimators, in this simulation the parametric bootstrap would still have functioned well as a diagnostic in each of the 10 samples considered. Table 4 reports the output that would have been available to an analyst applying the parametric bootstrap to the unbounded IPTW and TMLE estimators for each of the 10 samples. In all samples ETA.Bias was of roughly the same magnitude or larger than the estimated standard error of the estimator, and in most was of significant magnitude relative to the point estimate of the causal effect.

Table 4.

Estimated causal treatment effect β̂, standard error $(\hat{S E})$ and ETA.Bias in 10 simulated datasets of size 1000; g_n and Q_n correctly specified, g_n unbounded

IPTW estimator

TMLE estimator

Sample

β̂_IPTW

\hat{S E}

ETA.Bias

β̂_TMLE

\hat{S E}

ETA.Bias

0.207

0.203

0.473

0.827

0.197

0.172

1.722

0.197

0.425

0.734

0.114

0.153

1.957

0.184

0.306

1.379

0.105

0.087

1.926

0.206

0.510

0.237

0.089

0.252

2.201

0.192

0.565

2.548

0.182

0.245

0.035

0.236

0.520

0.533

0.228

0.234

1.799

0.180

0.346

1.781

0.184

0.150

0.471

0.215

0.420

1.066

0.114

0.188

2.749

0.184

0.391

1.974

0.114

0.161

0.095

0.228

0.263

0.628

0.173

0.099

Open in a new tab

The simulation demonstrates how the parametric bootstrap can be used to investigate the trade-offs between bias due to weight truncation/bounding of g_n and positivity bias. The parametric bootstrap accurately diagnosed both an increase in the bias of the IPTW estimator with increasing truncation and a reduction in the bias of the TMLE estimator with truncation. When viewed in light of the standard error estimates under different levels of truncation, the diagnostic would have accurately suggested that truncation of g_n for the TMLE estimator was beneficial, while truncation of the weights for the IPTW estimator was of questionable benefit. (The parametric bootstrap can also be used to provide a more refined approach to choosing an optimal truncation constant based on estimated MSE.²³)

These results further illustrate the benefit of applying the parametric bootstrap to the IPTW estimator in addition to the analyst’s estimator of choice. Diagnosis of substantial bias in the IPTW estimator due to positivity violations would have alerted an analyst that the G-computation estimator was relying heavily on extrapolation, and that the DR estimators were sensitive to bias arising from misspecification of the model used to estimate Ǭ₀.

5.2 Data example: HIV resistance mutations

5.2.1 Data and question

We analysed an observational cohort of HIV-infected patients in order to estimate the effect of mutations in the HIV protease enzyme on viral response to the antiretroviral drug lopinavir. The question, data, and analysis have been described previously.³² Here, a simplified version of prior analyses was performed and the parametric bootstrap was applied to investigate the potential impact of positivity violations on results.

Briefly, baseline covariates, mutation profiles prior to treatment change, and viral response to therapy were collected for 401 treatment change episodes (TCEs) in which protease inhibitor-experienced subjects initiated a new antiretroviral regimen containing the drug lopinavir. We focused on two target mutations in the protease enzyme: p82AFST and p82MLC (present in 25% and 1% of TCEs, respectively). The data for each target mutation consisted of O = (W, A, Y), where A was a binary indicator that the target mutation was present prior to treatment change, W was a set of 35 baseline characteristics including summaries of past treatment history, mutations in the reverse transcriptase enzyme, and a genotypic susceptibility score for the background regimen (based on the Stanford scoring system; http://hivdb.stanford.edu/). The outcome Y was the change in log₁₀(viral load) following initiation of the new antiretroviral regimen. The target observed data parameter was E_W(E(Y | A = 1, W) − E(Y | A = 0, W)), equal under (6) to the average treatment effect E(Y₁ − Y₀).

5.2.2 Methods

Effect estimates were obtained for each mutation using the IPTW estimator and TMLE with a logistic fluctuation.³³ Ǭ₀ and g₀ were estimated with stepwise forward selection of main terms based on the AIC criterion, using the step function in the stats v2.11.1 package in R. Estimators were implemented using both unbounded values for g_n(A | W) and values truncated at [0.025, 0.975]. Following standard practice in much of the literature, standard errors were estimated using the influence curve, corresponding to the standard output for the glm and tmle functions in R, treating the values of g_n as fixed. The parametric bootstrap was used to estimate the ETA.Bias of each estimator using 1000 samples and the ETA.Bias algorithm, with the step function rerun in each parametric bootstrap sample.

5.2.3 Results

Results for both mutations are presented in Table 5. p82AFST is known to be a major mutation for lopinavir resistance.³⁴ The current results support this finding; the IPTW and TMLE point estimates were similar and both suggested a significantly more positive change in viral load (corresponding to a less effective drug response) among subjects with the mutation as compared to those without it. The parametric bootstrap-based bias estimate was minimal, raising no red flag that these findings might be attributable to positivity bias.

Table 5.

Point estimate, standard error and parametric bootstrap-based bias estimates for the effect of two HIV resistance mutation on viral response, by estimator and bound on g_n

TMLE estimator

IPTW estimator

β̂_TMLE

\hat{S E}

ETA.Bias

β̂_IPTW

\hat{S E}

ETA.Bias

p82AFST

[0, 1]

0.65

0.13

−0.01

0.66

0.15

−0.01

[0.025, 0.975]

0.62

0.13

0.00

0.66

0.15

−0.01

p82MLC

[0, 1]

2.85

0.14

−0.37

1.29

0.14

0.09

[0.025, 0.975]

0.86

0.10

−0.01

0.80

0.23

0.08

Open in a new tab

The role of mutation p82CLM is less clear based on existing knowledge; depending on the scoring system used it is either not considered a lopinavir resistance mutation, or given an intermediate lopinavir resistance score (http://hivdb.stanford.edu/).³⁴ Initial inspection of the point estimates and standard errors in the current analysis would have suggested that p82CLM had a large and highly significant effect on lopinavir resistance. Application of the parametric bootstrap-based diagnostic, however, would have suggested that these results should be interpreted with caution. In particular, the bias estimate for the unbounded TMLE was larger than the estimated standard error, while the bias estimate for the unbounded IPTW estimator was of roughly the same magnitude. While neither bias estimate was of sufficient magnitude relative to the point estimate to change inference, their size relative to the corresponding standard errors would have suggested that further investigation was warranted.

In response, the non-parametric bootstrap (based on 1000 bootstrap samples) was applied to provide an alternative estimate of the standard error. Using this alternative approach, the standard errors for the unbounded TMLE and IPTW estimators of the effect of p82MLC were estimated to be 2.77 and 1.17, respectively. Non-parametric bootstrap-based standard error estimates for the bounded TMLE and IPTW estimators were lower (0.84 and 1.12, respectively), but still substantially higher than the initial naive standard error estimates. These revised standard error estimates dramatically changed interpretation of results, suggesting that the current analysis was unable to provide essentially any information on the presence, magnitude, or direction of the p82CLM effect. (Non-parametric bootstrap-based standard error estimates for p82AFST were also somewhat larger than initial estimates, but did not change inference.)

In this example, ETA.Bias is expected to include some non-positivity bias due to the curse of dimensionality. However, the resulting bias estimate should still be interpreted as highly optimistic (i.e. as an underestimate of the true finite sample bias). The parametric bootstrap sampled from estimates of g₀ and Ǭ₀ that had been fit using the step() algorithm. This ensured that the estimators gn and Ǭ_n (which applied the same stepwise algorithm) would do a good job approximating g_P̂₀ and Ǭ_P₀ in each bootstrap sample. Clearly, no such guarantee exists for the true P₀. This simple example further illustrates the utility of the non-parametric bootstrap for standard error estimation in the setting of sparse data and positivity violations. In this particular example, the improved variance estimate provided by the non-parametric bootstrap was sufficient to prevent positivity violations from leading to incorrect inference. As demonstrated in the simulations, however, in other settings improved variance estimates may still fail to alert the analyst to threats posed by positivity violations.

6 Practical approaches to causal inference in the presence of positivity violations

6.1 Approach no. 1: Change the projection function h(A, V)

Throughout this article we have focused on the target causal parameter β(F_X, m, h) defined according to (4) as the projection of E_{F_X} (Y_a | V) on the working marginal structural model m(a, V | β). Choice of function h(a, V) both defines the target parameter by specifying which values of (A, V) should be given greater weight when estimating β and, by assumption (8), defines the positivity assumption needed for β to be identifiable.

We have focused on parameters indexed by h(a, V) = 1, a choice that gives equal weight to estimating the counterfactual outcome for all values (a, v).¹¹ Alternative choices of h(a, V) can significantly weaken the needed positivity assumption. For example, if the target of inference only involves counterfactual outcomes among some restricted range [c, d] of possible values 𝒜, defining h(a, V) = I(a ∈ [c, d]) weakens the positivity assumption by requiring sufficient variability only in the assignment of treatment levels within the target range. In some settings, the causal parameter defined by such a projection over a limited range of | might be of substantial a priori interest. For example, one may wish to focus estimation of a drug dose response curve only on the range of doses considered reasonable for routine clinical use, rather than on the full range of doses theoretically possible or observed in a given data set.

An alternative approach, commonly employed in the context of IPTW estimation and introduced in Section 3.2, is to choose h(a, V) = g(a | V), where g(a | V) ≡ P(A = a | V) is the conditional probability of treatment given the covariates included in the marginal structural model. In the setting of IPTW estimation this choice corresponds to the use of stabilizing weights, a common approach to reducing the variance of the IPTW estimator in the face of sparsity.²¹ When the target causal parameter is defined using a marginal structural working model, use of h(a, V) = g(a | V) corresponds to a definition of the target parameter that gives greater weight to those regions of the joint distribution of (A, V) that are well-supported, and that relies on smoothing or extrapolation to a greater degree in regions that are not.¹¹

Use of a marginal structural working model makes clear that the utility of choosing h(a, V) = g(a | V) as a method to approach data sparsity is not limited to the IPTW estimator. Recall that the G-computation estimator can be implemented by regressing predicted values for Ŷ_a on (a, V) according to model m(a, V | β) with weights provided by h(a, V). When the projection function is chosen to be g(a | V), this corresponds to a weighted regression in which weights are proportional to the degree of support in the data.

Even when one is ideally interested in the entire causal curve (implying a target parameter defined by choice h(a, V) = 1), specification of alternative choices for h offers a means of improving identifiability, at a cost of redefining the target parameter. For example, one can define a family of target parameters indexed by h_δ(a, V) = I(a ∈ [c(δ), d(δ)]), where an increase in δ corresponds to progressive restriction on the range of treatment levels targeted by estimation. Fluctuation of δ thus corresponds to trading a focus on more limited areas of the causal curve for improved parameter identifiability. Selection of the final target from among this family can be based on an estimate of bias provided by the parametric bootstrap. For example, the bootstrap can be used to select the parameter with the smallest δ below some pre-specified threshold for allowable ETA.Bias.

6.2 Approach no. 2: Restrict the adjustment set

Exclusion of problematic W (i.e. those covariates resulting in positivity violations or near violations) from the adjustment set provides a means to trade confounding bias for a reduction in positivity violations.³⁵ In some cases, exclusion of covariates from the adjustment set may come at little or no cost to bias in the estimate of the target parameter. In particular, a subset of W that excludes covariates responsible for positivity violations may still be sufficient to control for confounding. In other words, a subset W′ ⊂ W may exist for which both identifying assumptions (6) and (7) hold (i.e. Y_a ∐ A | W′ and g₀(a | W′) > 0, a ∈ 𝒜), while positivity fails for the full set of covariates. In practice, this approach can be implemented by first determining candidate subsets of W under which the positivity assumption holds, and then using causal graphs to assess whether any of these candidates is sufficient to control for confounding. Even when no such candidate set can be identified, background knowledge (or sensitivity analysis) may suggest that problematic W represent a minimal source of confounding bias (Moore et al. provide an example).¹⁵ Often, however, those covariates that are most problematic from a positivity perspective are also strong confounders.

As suggested with respect to choice of projection function h(a, V) in the previous section, the causal effect estimator can be fine-tuned to select the degree of restriction on the adjustment set W according to some pre-specified rule for eliminating covariates from the adjustment set, and the parametric bootstrap used to select the minimal degree of restriction that maintains ETA.Bias below an acceptable threshold.³⁵ In the case of substantial positivity violations, such an approach can result in small covariate adjustment sets. While such limited covariate adjustment accurately reflects a target parameter that is poorly supported by the available data, the resulting estimate can be difficult to interpret and will no longer carry a causal interpretation.

6.3 Approach no. 3: Restrict the sample

An alternative approach, sometimes referred to as ‘trimming’, is to discard classes of subjects for whom there exists no or limited variability in observed treatment assignment. A causal effect is then estimated in the remaining subsample. This approach is popular in the econometrics and social science literature; Crump et al. provide a recent review.^36–39

When the subset of covariates responsible for positivity violations is low or one dimensional, such an approach can be implemented simply by discarding subjects with covariate values not represented in all treatment groups. For example, say that one aims to estimate the average effect of a binary treatment, and in order to control for confounding needs to adjust for W, a covariate with possible levels {1, 2, 3, 4}. However, inspection of the data reveals that no one in the sample with W = 4 received treatment (i.e. g_n(1 | W = 4) = 0). The sample can be trimmed by excluding those subjects for whom W = 4 prior to applying a given causal effect estimator for the average treatment effect. As a result, the target parameter is shifted from E(Y₁ − Y₀) to E(Y₁ − Y₀ | W < 4), and the positivity assumption (7) now holds (as W = 4 occurs with zero probability).

Often W is too high dimensional to make this straightforward implementation feasible; in such a case matching on the propensity score provides a means to trim the sample. (While there is an extensive literature on propensity score-based effect estimators a review of these estimators is beyond the scope of the current review.) Several potential problems arise with the use of trimming methods to address positivity violations. First, discarding subjects responsible for positivity violations shrinks sample size, and thus runs the risk of increasing the variance of the effect estimate. Further, sample size and the extent to which positivity violations arise by chance are closely related. Depending on how trimming is implemented, new practical positivity violations can be introduced as sample size shrinks. Second, restriction of the sample may result in a causal effect for a population of limited interest. In other words, as can occur with alternative approaches to improve identifiability by shifting the target of inference, the parameter actually estimated may be far from the initial target. Further, when the criterion used to restrict the sample involves a summary of high dimensional covariates, such as is provided the propensity score, it can be difficult to interpret the parameter estimated. Finally, when treatment is longitudinal, the covariates responsible for positivity violations may themselves be affected by past treatment.¹⁵ Trimming to remove positivity violations in this setting amounts to condtioning on post-treatment covariates and can thus introduce new bias.

Crump et al.³⁶ propose an approach to trimming that falls within the general strategy of redefining the target parameter in order to explicitly capture the tradeoff between parameter identifiability and proximity to the initial target. In addition to focusing on the treatment effect in an a priori specified target population, he defines an alternative target parameter corresponding to the average treatment effect in that subsample of the population for which the most precise estimate can be achieved. Crump et al. further suggest the potential for extending this approach to achieve an optimal (according to some user-specified criteria) trade-off between the representativeness of the subsample in which the effect is estimated and the variance of the estimate.

6.4 Approach no. 4: Change the intervention of interest

A final alternative for improving the identifiability of a causal parameter in the presence of positivity violations is to redefine the intervention of interest. Realistic rules rely on an estimate of the propensity score g(a | W) to define interventions that explicitly avoid positivity violations. This ensures that the causal parameter estimated is sufficiently supported by existing data.

Realistic interventions avoid positivity violations by first identifying subjects for whom a given treatment assignment is not realistic (i.e. subjects whose propensity score for a given treatment is small or zero) and then assigning an alternative treatment with better data support to those individuals. Such an approach is made possible by focusing on the causal effects of dynamic treatment regimes.^40,41 The causal parameters described thus far are summaries of the counterfactual outcome distribution under a fixed treatment applied uniformly across the target population. In contrast, a dynamic regime assigns treatment in response to patient covariate values. This characteristic makes it possible to define interventions under which a subject is only assigned treatments that are possible (or ‘realistic’) given a subject’s covariate values.

To continue the previous example in which no subjects with W = 4 were treated, a realistic treatment rule might take the form ‘treat only those subjects with W less than 4.’ More formally, let d(W) refer to a treatment rule that deterministically assigns a treatment a ∈ 𝒜 based on a subject’s covariates W and consider the rule d(W) = I(W < 4). Let Y_d denote the counterfactual outcome under the treatment rule d(W), which corresponds to treating a subject if and only if his or her covariate W is below 4. In this example E(Y₀) is identified as ∑_w E(Y | W = w, A = 0)P(W = w); however, since E(Y | W = w, A = 1) is undefined for W = 4, E(Y₁) is not identified (unless we are willing to extrapolate based on W < 4). In contrast, E(Y_d) is identified by the non-parametric G-computation formula: ∑_wE(Y = y | W = w, A = d(w))P(W = w). Thus the treatment effect E(Y_d − Y₀) is identified, but E(Y₁ − Y₀) is not. The redefined causal parameter can be interpreted as the difference in expected counterfactual outcome if only those subjects with W < 4 were treated as compared to the outcome if no one were treated.

More generally, realistic rules indexed by a given static treatment a assign a only to those individuals for whom the probability of receiving a is greater than some user-specified probability α (such as α > 0.05). Let d(a, W) denote the rule indexed by static treatment a. If A is binary, then d(1, W) = 1 if g(1 | W) > α, otherwise d(1, W) = 0. Similarly, d(0, W) = 0 if g(0 | W) > α; otherwise d(0, W) = 1. Realistic causal parameters are defined as some parameter of the distribution of Y_d(a,W) (possibly conditional on some subset of baseline covariates V ⊂ W). Estimation of the causal effects of dynamic rules d(W) allows the positivity assumption to be relaxed to g(d(W) | W) > 0 -a.e (i.e. only those treatments that would be assigned based on rule d to patients with covariates W need to occur with positive probability within strata of W). Realistic rules d(a, W) are designed to satisfy this assumption by definition.

When a given treatment level a is unrealistic (i.e. when g(a | W) < α), realistic rules assign an alternative from among viable (well-supported) choices. Choice of an alternative is straightforward when treatment is binary. When treatment has more than two levels, however, a rule for selecting the alternative treatment level is needed. One option is to assign a treatment level that is as close as possible to the orignal assignment while still remaining realistic. For example, if high doses of drugs occur with low probability in a certain subset of the population, a realistic rule might assign the maximum dose that occurs with probability > α in that subset. An alternative class of dynamic regimes, referred to as ‘intent-to-treat’ rules, instead assign a subject to his or her observed treatment value if an initial assignment is deemed unrealistic. Bembom and Vander Laan¹⁴ and Moore et al.¹⁵ provide illustrations of both of these types of realistic rules using simulated and real data.’

The causal effects of realistic rules clearly differ from their static counterparts. The extent to which the new target parameter diverges from the initial parameter of interest depends on both the extent to which positivity violations occur in the finite sample (i.e. the extent of support available in the data for the initial target parameter) and on a user-supplied threshold β. The parametric bootstrap approach presented in Section 4 can be employed to data-adaptively select α based on the level of ETA.Bias deemed acceptable.¹⁴

6.5 Selection among a family of parameters

Each of the methods described for estimating causal effects in the presence of data sparsity corresponds to a particular strategy for altering the target parameter in exchange for improved identifiability. In each case, we have outlined how this tradeoff could be made systematically, based on some user-specified criterion such as a maximum acceptable level of ETA.Bias as estimated by the parametric bootstrap. We now summarise this general approach in terms of a formal method for estimation in the face of positivity violations.

Define a family of parameters. The family should include the initial target of inference together with a set of related parameters, indexed by γ in index set I, where γ represents the extent to which a given family member trades improved identifiability for decreased proximity to the initial target. In the examples given in the previous sections, γ could be used to index a set of projection functions h(a, V) based on an increasingly restrictive range of the possible values 𝒜, the degree to which the adjustment covariate set or sample is restricted, or the choice of a threshold used to define a realistic rule.
Apply the parametric bootstrap to generate an estimate ETA.Bias for each γ ∈ I. In particular, this involves estimating the data generating distribution, simulating new data from this estimate, and then applying an estimator of each target parameter indexed by α.
Select the target parameter from the set that falls below a pre-specified threshold for acceptable ETA.Bias. In particular, select the parameter from within this set that is indexed by the value γ that corresponds to the greatest proximity to the initial target.

This approach allows an estimator to be defined in terms of an algorithm that estimates the parameter within a candidate family that is as close to the initial target of inference as possible while remaining within some user-supplied limit on the extent of tolerable bias due to positivity violations.

7 Conclusions

The identifiability of causal effects relies on sufficient variation in treatment assignment within covariate strata. The strong version of positivity requires that each possible treatment occur with positive probability in each covariate strata; depending on the model and target parameter, this assumption can be relaxed to some extent. In addition to assessing identifiability based on measurement of and control for sufficient confounders, data analyses should directly assess threats to identifiability posed by positivity violations.

The parametric bootstrap is a practical tool for assessing such threats, and provides a quantitative estimator-specific estimate of bias arising largely from positivity violations. This article has focused on the positivity assumption for the causal effect of a treatment assigned at a single time point. Extension to a longitudinal setting in which the goal is to estimate the effect of multiple treatments assigned sequentially over time introduces considerable additional complexity. First, practical violations of the positivity assumption can arise more readily in this setting. Under the longitudinal version of the positivity assumption the conditional probability of each possible treatment history should remain positive regardless of covariate history. However, this probability is the product of time point-specific treatment probabilities given the past. When the product is taken over multiple time points it is easy for treatment histories with very small conditional probabilities to arise. Second, longitudinal data make it harder to diagnose the bias arising due to positivity violations. Implementation of the parametric bootstrap in longitudinal settings requires Monte Carlo simulation both to implement the G-computation estimator and to generate each bootstrap sample. In particular, this requires estimating and sampling from the time-point specific conditional distributions of all covariates and treatment given the past. Additional research on assessing the impact of of positivity bias on longitudinal causal parameters is needed, including investigation of the parametric bootstrap in this setting.

When positivity violations occur for structural reasons rather than due to chance, a causal parameter that avoids these positivity violations will often be of substantial interest. For example, when certain treatment levels are contraindicated for certain types of individuals, the average treatment effect in the population may be of less interest than the effect of treatment among that subset of the population without contraindications, or alternatively, the effect of an intervention that assigns treatment only to those subjects without contraindications. Similarly, the effect of a multilevel treatment may be of greatest interest for only a subset of treatment levels.

In other cases researchers may be happy to settle for a better estimate of a less interesting parameter. Sample restriction, estimation of realistic parameters, and change in projection function h(a, V) all change the causal effect being estimated; in contrast, restriction of the covariate adjustment set often results in estimation of a non-causal parameter. However, all of these approaches can be understood as means to shift from a poorly identified initial target towards a parameter that is less ambitious but more fully supported by the available data. The new estimand is not determined a priori by the question of interest, but rather is driven by the observed data distribution in the finite sample at hand. There is thus an explicit trade-off between identifiability and proximity to the initial target of inference. Ideally, this trade-off will be made in a systematic way rather than on an ad hoc basis at the discretion of the investigator. Definition of an estimator that selects among a family of parameters according to some pre-specified criteria is a means to formalise this trade-off. An estimate of bias based on the parametric bootstrap can be used to implement the tradeoff in practice.

In summary, we offer the following advice for applied analyses: First, define the causal effect of interest based on careful consideration of structural positivity violations. Second, consider estimator behaviour in the context of positivity violations when selecting an estimator. Third, apply the parametric bootstrap provide a quantitative measure of estimator bias under data simulated to approximate the true data generating distribution. Finally, when positivity violations are a concern, choose an estimator that selects systematically among a family of parameters based on the trade-off between data support and proximity to the initial target of inference.

References

1.Cochran WG. Analysis of covariance: its nature and uses. Biometrics. 1957;13:261–281. [Google Scholar]
2.Robins JM. A new approach to causal inference in mortality studies with sustained exposure periods – application to control of the healthy worker survivor effect. Math Model. 1986;7:1393–1512. [Google Scholar]
3.Robins JM. Addendum to: A new approach to causal inference in mortality studies with a sustained exposure period–application to control of the healthy worker survivor effect. Math. Model. 1986;7(9–12):1393–1512. (MR 87m:92078. Comput. Math. Appl. 1987 14(9–12): 923–945. [Google Scholar]
4.Robins JM. Proceedings of the American Statistical Association: Section on Bayesian Statistical Science. Alexandria, VA: 1999. Robust estimation in sequentially ignorable missing data and causal inference models; pp. 6–10. [Google Scholar]
5.Wang Y, Petersen M, Bangsberg D, van der Laan MJ. Technical Report 211, Division of Biostatistics. Berkeley: University of California; 2006. Diagnosing bias in the inverse probability of treatment weighted estimator resulting from violation of experimental treatment assignment. [Google Scholar]
6.Pearl J. Causality: models, reasoning, and inference. New York: Cambridge University Press; 2000. [Google Scholar]
7.Neyman J. On the application of probability theory to agricultural experiments. Essay on principles: section 9. Stat Sci. 1923;5:465–480. [Google Scholar]
8.Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol. 1974;66:688–701. [Google Scholar]
9.Robins JM. Proceedings of the American Statistical Association. Section on Bayesian Statistical Science 1997. Alexandria, VA: 1998. Marginal structural models; pp. 1–10. [Google Scholar]
10.Robins JM. Marginal structural models versus structural nested models as tools for causal inference. In: Halloran E, Berry D, editors. Statistical models in epidemiology, the environment, and clinical trials (Minneapolis, MN, 1997) New York: Springer; 1999. pp. 95–133. [Google Scholar]
11.Neugebauer R, van der Laan MJ. Non-parametric causal effects based on marginal structural models. J Stat Plan Infer. 2007;137(2):419–434. [Google Scholar]
12.Robins JM. A graphical approach to the identification and estimation of causal parameters in mortality studies with sustained exposure periods. J Chronic Dis. 1987;40(2):139s–161s. doi: 10.1016/s0021-9681(87)80018-8. [DOI] [PubMed] [Google Scholar]
13.Neugebauer R, van der Laan MJ. Why prefer DR estimates. J Stat Plan Infer. 2005;129(1–2):405–426. [Google Scholar]
14.Bembom O, van der Laan MJ. A practical illustration of the importance of realistic individualized treatment rules in causal inference. EJS. 2007;1:574–596. doi: 10.1214/07-EJS105. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Moore KL, Neugebauer RS, van der Laan MJ, Tager IB. Technical Report 255, Division of Biostatistics. Berkeley: University of California; 2009. Causal inference in epidemiological studies with strong confounding. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Cole SR and Hernan MA. Constructing inverse probability weights for marginal structural models. Am J Epidemiol. 2008;168(6):656–664. doi: 10.1093/aje/kwn164. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Rosenblum MM, van der Laan MJ. Confidence intervals for the population mean tailored to small sample sizes, with applications to survey sampling. Int J Biostat. 2001;1:4. doi: 10.2202/1557-4679.1118. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.van der Laan MJ, Dudoit S. Technical Report 130, Division of Biostatistics. Berkeley: University of California; 2003. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples. [Google Scholar]
19.van der Laan MJ, Polley EC, Hubbard AE. Super learner. Genet Mol Biol. 2007;6 doi: 10.2202/1544-6115.1309. [DOI] [PubMed] [Google Scholar]
20.Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. London: Springer; 2009. [Google Scholar]
21.Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000;11(5):550–560. doi: 10.1097/00001648-200009000-00011. [DOI] [PubMed] [Google Scholar]
22.Kish L. Weighting for unequal pi. J Official Stat. 1992;8:183–200. [Google Scholar]
23.Bembom O, van der Laan MJ. Technical Report 230, Division of Biostatstics. Berkeley: University of California; 2008. Data-adaptive selection of the truncation level for inverse-probability-of-treatment-weighted estimators. [Google Scholar]
24.Robins JM, Rotnitzky A. Comment on the Bickel and Kwon article, Inference for semiparametric models: some questions and an answer. Stat Sinica. 2001;11(4):920–936. [Google Scholar]
25.Robins JM. Commentary on using inverse weighting and predictive inference to estimate the effects of time-varying treatments on the discrete-time hazard by Dawson and Lavori. Stat Med. 2002;21:1663–1680. doi: 10.1002/sim.1111. [DOI] [PubMed] [Google Scholar]
26.Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for non-ignorable drop-out using semiparametric nonresponse models, (with discussion and rejoinder) J Am Stat Assoc. 1999;94(1096–1120):1121–1146. [Google Scholar]
27.van der Laan MJ, Rubin D. Targeted maximum likelihood learning. Int J Biostat. 2006;2(1):11. [Google Scholar]
28.Rosenblum M, van der Laan M. Targeted maximum likelihood estimation of the parameter of a marginal structural model. Int J Biostat. 2010;6(2):19. doi: 10.2202/1557-4679.1238. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Freedman DA, Berk RA. Weighting regressions by propensity scores. Eval Rev. 2008;32(4):392–409. doi: 10.1177/0193841X08317586. [DOI] [PubMed] [Google Scholar]
30.Petersen ML, Porter K, Gruber S, Wang Y, van der Laan M. Technical report, division of Biostatstics. Berkeley: Universtiy of California; 2010. Diagnosing and responding to violations in the positivity assumption. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.van der Vaart AW, Wellner JA. Weak Convergence and Emprical Processes. New York: Springer-Verlag; 1996. [Google Scholar]
32.Bembom O, Petersen ML, Rhee S-Y, et al. Biomarker discovery using targeted maximum likelihood estimation: Application to the treatment of antiretroviral resistant HIV infection. Stat Med. 2009;28:152–172. doi: 10.1002/sim.3414. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Gruber S, van der Laan MJ. An application of collaborative targeted maximum likelihood estimation in causal inference and genomics. Int J Biostat. 2010;6(1) doi: 10.2202/1557-4679.1182. Article 18. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Johnson VA, Brun-Vezinet F, Clotet B, et al. Update of the drug resistance mutations in HIV-1: December 2009. Top HIV Med. 2009;17(5):138–145. [PubMed] [Google Scholar]
35.Bembom O, Fessel JW, Shafer RW, van der Laan MJ. Technical Report 231, Division of Biostatstics. Berkeley: University of California; 2008. Data-adaptive selection of the adjustment set in variable importance estimation. [Google Scholar]
36.Crump RK, Hotz VJ, Imbens GW and Mitnik OA. Moving the goalposts: Adressing limited overlap in the estimation of average treatment effects by changing the estimand. Technical Report 330, National Bureau of Economic Research. 2006
37.LaLonde RJ. Evaluating the econometric evaluations of training programs with experimental data. Am Econ Rev. 1986;76:604–620. [Google Scholar]
38.Heckman J, Ichimura H, Todd R. Matching as an econometric evaluation estimator: evidence from evaluating a job training programme. Rev Econ Stud. 1997;64:605–654. [Google Scholar]
39.Dehejia R, Wahba S. Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. J Am Stat Assoc. 1999;94:1053–1062. [Google Scholar]
40.van der Laan MJ, Petersen ML. Causal effect models for realistic individualized treatment and intention to treat rules. Int J Biostat. 2007;3(1):3. doi: 10.2202/1557-4679.1022. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Robins JM, Orellana L, Rotnitzky A. Estimation and extrapolation of optimal treatment and testing strategies. Stat Med. 2008;27:4678–4721. doi: 10.1002/sim.3301. [DOI] [PubMed] [Google Scholar]

[R1] 1.Cochran WG. Analysis of covariance: its nature and uses. Biometrics. 1957;13:261–281. [Google Scholar]

[R2] 2.Robins JM. A new approach to causal inference in mortality studies with sustained exposure periods – application to control of the healthy worker survivor effect. Math Model. 1986;7:1393–1512. [Google Scholar]

[R3] 3.Robins JM. Addendum to: A new approach to causal inference in mortality studies with a sustained exposure period–application to control of the healthy worker survivor effect. Math. Model. 1986;7(9–12):1393–1512. (MR 87m:92078. Comput. Math. Appl. 1987 14(9–12): 923–945. [Google Scholar]

[R4] 4.Robins JM. Proceedings of the American Statistical Association: Section on Bayesian Statistical Science. Alexandria, VA: 1999. Robust estimation in sequentially ignorable missing data and causal inference models; pp. 6–10. [Google Scholar]

[R5] 5.Wang Y, Petersen M, Bangsberg D, van der Laan MJ. Technical Report 211, Division of Biostatistics. Berkeley: University of California; 2006. Diagnosing bias in the inverse probability of treatment weighted estimator resulting from violation of experimental treatment assignment. [Google Scholar]

[R6] 6.Pearl J. Causality: models, reasoning, and inference. New York: Cambridge University Press; 2000. [Google Scholar]

[R7] 7.Neyman J. On the application of probability theory to agricultural experiments. Essay on principles: section 9. Stat Sci. 1923;5:465–480. [Google Scholar]

[R8] 8.Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol. 1974;66:688–701. [Google Scholar]

[R9] 9.Robins JM. Proceedings of the American Statistical Association. Section on Bayesian Statistical Science 1997. Alexandria, VA: 1998. Marginal structural models; pp. 1–10. [Google Scholar]

[R10] 10.Robins JM. Marginal structural models versus structural nested models as tools for causal inference. In: Halloran E, Berry D, editors. Statistical models in epidemiology, the environment, and clinical trials (Minneapolis, MN, 1997) New York: Springer; 1999. pp. 95–133. [Google Scholar]

[R11] 11.Neugebauer R, van der Laan MJ. Non-parametric causal effects based on marginal structural models. J Stat Plan Infer. 2007;137(2):419–434. [Google Scholar]

[R12] 12.Robins JM. A graphical approach to the identification and estimation of causal parameters in mortality studies with sustained exposure periods. J Chronic Dis. 1987;40(2):139s–161s. doi: 10.1016/s0021-9681(87)80018-8. [DOI] [PubMed] [Google Scholar]

[R13] 13.Neugebauer R, van der Laan MJ. Why prefer DR estimates. J Stat Plan Infer. 2005;129(1–2):405–426. [Google Scholar]

[R14] 14.Bembom O, van der Laan MJ. A practical illustration of the importance of realistic individualized treatment rules in causal inference. EJS. 2007;1:574–596. doi: 10.1214/07-EJS105. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Moore KL, Neugebauer RS, van der Laan MJ, Tager IB. Technical Report 255, Division of Biostatistics. Berkeley: University of California; 2009. Causal inference in epidemiological studies with strong confounding. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Cole SR and Hernan MA. Constructing inverse probability weights for marginal structural models. Am J Epidemiol. 2008;168(6):656–664. doi: 10.1093/aje/kwn164. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Rosenblum MM, van der Laan MJ. Confidence intervals for the population mean tailored to small sample sizes, with applications to survey sampling. Int J Biostat. 2001;1:4. doi: 10.2202/1557-4679.1118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.van der Laan MJ, Dudoit S. Technical Report 130, Division of Biostatistics. Berkeley: University of California; 2003. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples. [Google Scholar]

[R19] 19.van der Laan MJ, Polley EC, Hubbard AE. Super learner. Genet Mol Biol. 2007;6 doi: 10.2202/1544-6115.1309. [DOI] [PubMed] [Google Scholar]

[R20] 20.Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. London: Springer; 2009. [Google Scholar]

[R21] 21.Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000;11(5):550–560. doi: 10.1097/00001648-200009000-00011. [DOI] [PubMed] [Google Scholar]

[R22] 22.Kish L. Weighting for unequal pi. J Official Stat. 1992;8:183–200. [Google Scholar]

[R23] 23.Bembom O, van der Laan MJ. Technical Report 230, Division of Biostatstics. Berkeley: University of California; 2008. Data-adaptive selection of the truncation level for inverse-probability-of-treatment-weighted estimators. [Google Scholar]

[R24] 24.Robins JM, Rotnitzky A. Comment on the Bickel and Kwon article, Inference for semiparametric models: some questions and an answer. Stat Sinica. 2001;11(4):920–936. [Google Scholar]

[R25] 25.Robins JM. Commentary on using inverse weighting and predictive inference to estimate the effects of time-varying treatments on the discrete-time hazard by Dawson and Lavori. Stat Med. 2002;21:1663–1680. doi: 10.1002/sim.1111. [DOI] [PubMed] [Google Scholar]

[R26] 26.Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for non-ignorable drop-out using semiparametric nonresponse models, (with discussion and rejoinder) J Am Stat Assoc. 1999;94(1096–1120):1121–1146. [Google Scholar]

[R27] 27.van der Laan MJ, Rubin D. Targeted maximum likelihood learning. Int J Biostat. 2006;2(1):11. [Google Scholar]

[R28] 28.Rosenblum M, van der Laan M. Targeted maximum likelihood estimation of the parameter of a marginal structural model. Int J Biostat. 2010;6(2):19. doi: 10.2202/1557-4679.1238. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Freedman DA, Berk RA. Weighting regressions by propensity scores. Eval Rev. 2008;32(4):392–409. doi: 10.1177/0193841X08317586. [DOI] [PubMed] [Google Scholar]

[R30] 30.Petersen ML, Porter K, Gruber S, Wang Y, van der Laan M. Technical report, division of Biostatstics. Berkeley: Universtiy of California; 2010. Diagnosing and responding to violations in the positivity assumption. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] 31.van der Vaart AW, Wellner JA. Weak Convergence and Emprical Processes. New York: Springer-Verlag; 1996. [Google Scholar]

[R32] 32.Bembom O, Petersen ML, Rhee S-Y, et al. Biomarker discovery using targeted maximum likelihood estimation: Application to the treatment of antiretroviral resistant HIV infection. Stat Med. 2009;28:152–172. doi: 10.1002/sim.3414. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] 33.Gruber S, van der Laan MJ. An application of collaborative targeted maximum likelihood estimation in causal inference and genomics. Int J Biostat. 2010;6(1) doi: 10.2202/1557-4679.1182. Article 18. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] 34.Johnson VA, Brun-Vezinet F, Clotet B, et al. Update of the drug resistance mutations in HIV-1: December 2009. Top HIV Med. 2009;17(5):138–145. [PubMed] [Google Scholar]

[R35] 35.Bembom O, Fessel JW, Shafer RW, van der Laan MJ. Technical Report 231, Division of Biostatstics. Berkeley: University of California; 2008. Data-adaptive selection of the adjustment set in variable importance estimation. [Google Scholar]

[R36] 36.Crump RK, Hotz VJ, Imbens GW and Mitnik OA. Moving the goalposts: Adressing limited overlap in the estimation of average treatment effects by changing the estimand. Technical Report 330, National Bureau of Economic Research. 2006

[R37] 37.LaLonde RJ. Evaluating the econometric evaluations of training programs with experimental data. Am Econ Rev. 1986;76:604–620. [Google Scholar]

[R38] 38.Heckman J, Ichimura H, Todd R. Matching as an econometric evaluation estimator: evidence from evaluating a job training programme. Rev Econ Stud. 1997;64:605–654. [Google Scholar]

[R39] 39.Dehejia R, Wahba S. Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. J Am Stat Assoc. 1999;94:1053–1062. [Google Scholar]

[R40] 40.van der Laan MJ, Petersen ML. Causal effect models for realistic individualized treatment and intention to treat rules. Int J Biostat. 2007;3(1):3. doi: 10.2202/1557-4679.1022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] 41.Robins JM, Orellana L, Rotnitzky A. Estimation and extrapolation of optimal treatment and testing strategies. Stat Med. 2008;27:4678–4721. doi: 10.1002/sim.3301. [DOI] [PubMed] [Google Scholar]

PERMALINK

Diagnosing and responding to violations in the positivity assumption

Maya L Petersen

Kristin E Porter

Susan Gruber

Yue Wang

Mark J van der Laan

Abstract

1 Introduction

1.1 Outline

2 Framework for causal effect estimation

2.1 Model

2.2 Target causal parameter

2.3 Identifiability

2.3.1 The need for experimentation in treatment assignment

3 Estimator-specific behaviour in the face of positivity violations

Table 1.

3.1 G-computation estimator

3.2 Inverse probability of treatment weighted estimator

3.2.1 Weight truncation

3.2.2 Stabilised Weights

3.3 Double robust estimators

4 Diagnosing bias due to positivity violations

4.1 The parametric bootstrap as a diagnostic tool

4.1.1 Remarks on interpretation of the bias estimate

5 Application of the parametric bootstrap

5.1 Application to simulated data

5.1.1 Methods

5.1.2 Results

Table 2.

Table 3.

Table 4.

5.2 Data example: HIV resistance mutations

5.2.1 Data and question

5.2.2 Methods

5.2.3 Results

Table 5.

6 Practical approaches to causal inference in the presence of positivity violations

6.1 Approach no. 1: Change the projection function h(A, V)

6.2 Approach no. 2: Restrict the adjustment set

6.3 Approach no. 3: Restrict the sample

6.4 Approach no. 4: Change the intervention of interest

6.5 Selection among a family of parameters

7 Conclusions

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases