Summary
Standardized means, commonly used in observational studies in epidemiology to adjust for potential confounders, are equal to inverse probability weighted means with inverse weights equal to the empirical propensity scores. More refined standardization corresponds with empirical propensity scores computed under more flexible models. Unnecessary standardization induces efficiency loss. However, according to the theory of inverse probability weighted estimation, propensity scores estimated under more flexible models induce improvement in the precision of inverse probability weighted means. This apparent contradiction is clarified by explicitly stating the assumptions under which the improvement in precision is attained.
Some key words: Causal inference, Propensity score, Standardized mean
1. Introduction
Often, epidemiological studies aim to evaluate the causal effect of a discrete exposure on an outcome. In observational studies systematic bias due to confounding is a serious concern. For this reason, investigators routinely collect and adjust for a large number of confounding factors in data analyses. A common analytic strategy is to categorize the confounders and then to compare the exposure group-specific standardized means. These are exposure group-specific weighted means of the outcome across levels of the categorized confounders with weights equal to the empirical probabilities of the categorized confounders in the entire sample. It is well known that overcategorization, i.e. unnecessary categorization, may induce efficiency losses. This issue is essentially the same as the well-understood increase in variance induced by adding in a linear regression model covariates that have no partial correlation with the outcome (Cochran, 1968). It has been studied in a number of nonlinear regression settings, e.g. Mantel & Haenszel (1959), Breslow (1982), Gail (1988), Robinson & Jewell (1991), Neuhauhaser & Becher (1997) and De Stavola & Cox (2008), and has been empirically analyzed for standardized means in Brookhart et al. (2006).
The issue, however, appears to contradict well-known facts in the theory of inverse probability weighted estimation. Specifically, a standardized mean is equal to a so-called inverse probability of treatment weighted mean. More precisely, it is equal to a group-specific mean of the outcome weighted by the inverse of the empirical propensity score. An empirical propensity score is the maximum likelihood estimate of the true propensity score, i.e. of the probability of being in the exposure group given the confounders, under a saturated model for the probability of exposure given the categorized confounder. The apparent contradiction is that more refined categorization corresponds to more flexible models for the propensity score, and according to the theory of inverse probability estimation, the use of more flexible propensity score models induces an improvement in the precision of inverse probability means, and not a decrease in precision as regression theory indicates.
The purpose of this note is to clarify this apparent contradiction showing that indeed, efficiency losses induced by unnecessarily refined categorizations do not contradict, and indeed are a consequence of, the theory of inverse probability estimation.
2. The apparent contradiction
Consider a cohort study in which a discrete exposure variable A, an outcome Y and a vector of pre-exposure covariates X are measured for each of n subjects drawn at random from a study population. Although the typical goal of such a study is the evaluation of the exposure effect on the outcome, i.e. a comparison across exposure levels, the issues in this note are best understood by considering inference about the outcome mean at one specific exposure level. Thus, we will assume that A is binary and that the goal is to estimate the outcome mean at exposure level A = 1. Consider a categorization of X into J strata and let L denote the polytomous variable that records the stratum, a subject with covariates X belongs to. The standardized mean at exposure level A = 1 and with categorized variable L is
(1) |
where throughout for any U and V,
For standardized means to be informative about the causal effects certain assumptions need to hold. The issue is best articulated within the potential outcomes framework. Let Ya be the subject’s potential outcome if, perhaps contrary to fact, he is exposed to A = a. Contrasts comparing E(Y1) and E (Y0) quantify the causal effect of exposure. The standardized mean μ̂ is consistent for μ ≡ E(Y1) under the following assumptions.
Assumption 1. Consistency: Y = YA.
Assumption 2. Positivity: pr(A = 1 | L) > 0.
- Assumption 3. No unmeasured confounders: Y1 and A are conditionally independent given L, because in such a case
(2)
The apparent contradiction discussed in this note refers to the asymptotic behaviour of μ̂ under two categorizations, one more refined than the other. The essence of the matter is best understood by considering the extreme case contrasting the asymptotic behaviour of the adjusted average μ̂ with that of the crude unadjusted average,
Our discussion focusses on this comparison. The well-known risk of bias induced by underadjustment, i.e. by failure to adjust for an important confounder, is vividly unmasked in this extreme case: μ̃ does not generally converge in probability to E(Y1). Formally, μ̃ converges to E(Y1 | A = 1) which is not generally equal to E(Y1) because Y1 and A may share the common determinant L. Consistency of μ̃ requires that, in addition to Assumptions 1–3, at least one of the following two independencies hold.
Assumption 4. The variables Y and L are conditionally independent given A = 1.
Assumption 5. The variables A and L are independent.
In the Appendix we show that μ̂ solves the inverse probability weighted estimating equation
(3) |
whereas μ̃ solves the inverse probability weighted estimating equation
(4) |
whence the apparent contradiction emerges. Specifically, both En(A | L) and En(a) can be regarded as efficient estimators of the propensity score π (l) ≡ E(A | L), the former under a saturated model on L and the latter under the smaller model that assumes independence of A and L. According to the theory of inverse probability estimation, inclusion of covariate L in an efficiently estimated model for the propensity score should not be detrimental to the efficiency with which E(Y1) is estimated even if the covariate is not needed for bias correction. This appears to contradict the fact that under Assumptions 1–3, μ̃ is more efficient than μ̂ when Assumption 4 holds and Assumption 5 fails.
3. Explaining the apparent contradiction
The apparent contradiction arises because of the vagueness of the statement about the efficiency gains induced by including L in the propensity score estimators, which does not explicitly mention the assumptions required for its validity. To explain the contradiction, let 𝒜 denote the model defined by Assumptions 1–3, let ℬ denote the model defined by Assumptions 1–4 and let 𝒞 denote Assumptions 1–3 and 5.
Both μ̂ and μ̃ are consistent for E(Y1) under model ℬ or 𝒞 but only μ̂ is consistent for E(Y1) under model 𝒜.
The estimator μ̂ is asymptotically efficient under model 𝒜 and under model 𝒞 but μ̃ is asymptotically efficient under model ℬ. These efficiency results are best understood by examining the likelihood
(5) |
where
Model 𝒜 imposes restrictions on the law of (Y1, L, A) but not on the distribution fA,Y,L of the observed data (Y, L, A) (Gill et al., 1997) and hence is a nonparametric model for the observables. Because the estimator μ̂ is the plug-in estimator of μ = E{E(Y | A = 1, L)}, it is the maximum likelihood estimator of μ under the nonparametric model 𝒜.
Model 𝒞 restricts the law fA|L entering the second term on the right-hand side of (5) since Assumption 5 postulates that fA|L = fA. Because by (2), μ depends only on the components of the law entering in the ℒ1,n-part of the likelihood (5), the maximum likelihood estimators of μ under models 𝒜 and 𝒞 must agree. Thus, μ̂ is the maximum likelihood estimator of μ under model 𝒞 and consequently asymptotically efficient, i.e. avar(μ̂) is equal to the semiparametric variance bound for μ under the model. We let avar(·) denote the variance of the limiting distribution, hereafter.
Model ℬ imposes the restriction fY | A=1,L = fY | A=1 and hence it restricts the law fY | A,L in ℒ1,n. The estimator μ̂ is not the maximum likelihood estimator under model ℬ because it does not exploit this restriction. In fact, under model ℬ, μ̃ is asymptotically efficient. Furthermore, μ̃ is asymptotically strictly more efficient than μ̂ unless Assumption 5 also holds. Proof of these results can be found in the online Supplementary Material. We are now ready to explain the contradiction.
Given an arbitrary function d(l) and any π (l), let μ̂d (π) denote the solution to
(6) |
The following Lemma, a corollary of the theory laid out in Robins et al. (1994), states the precise result of the theory of inverse probability weighted estimation that the gain in efficiency of μ̃ over μ̂ appears to contradict.
Lemma 1. Given one of the models 𝒜, ℬ or 𝒞 for the observables, let π̂ (l) and π̃(l) be the maximum likelihood estimators of fA|L (1 | l) under two nested models for fA|L that are correctly specified under the assumptions of the given model. Then √ n{μ̂d (π̂) − μ} and √ n{μ̂d (π̃) − μ} converge to mean zero normal distributions. If π̂ (l) is the estimator of fA|L (1 | l) under the larger model, then
Observe that because μ̂ solves (3) and μ̃ solves (4) we can write μ̂ = μ̂d1(π̂) and μ̃ = μ̂d1(π̃) with d1(l) = 1, π̂ (l) = En(A | L = l) and π̃(l) = En(A). The improved efficiency of μ̃ over μ̂, i.e. the fact that generally avar(μ̃) is strictly smaller than avar(μ̂), under model ℬ does not contradict Lemma 1 because π̃(l) does not meet its premise. Specifically, Lemma 1 makes the premise that π̃(l) is computed under a model for fA|L that is correctly specified under the given model, in the case of our concern, model ℬ. However, π̃(l) = En(A) is the fitted value under a model for fA|L that assumes that A and L are independent, an assumption not made by model ℬ.
The efficiency gains conferred by μ̃ over μ̂ under model ℬ can be deduced from the general theory of efficient inverse probability estimation in semiparametric models for missing data (Robins et al., proposition 8.1, 1994). In the Supplementary Material we apply this theory to show that: (a) μ̃ is asymptotically equivalent to μ̂d2(π̂) with d2(l) = E(A | L = l) and (b) μ̂ d2(π̂), and therefore μ̃, is semiparametric efficient under ℬ.
In conclusion, the fallacy arises because the claim about efficiency gains assumes an explicit model for the law of (A, L, Y) and it requires that both propensity score models be correct under the given model. However, En(A) is the efficient propensity score estimator under a model not implied by model ℬ, so the efficiency claim does not apply.
4. Concluding remarks
Our analysis extends to inference in marginal structural mean models for the effect of a, possibly polytomous, exposure A given, a possibly strict, subset Z of the confounders L. These models assume that E(Ya | Z) = m(a, Z; β), where m(·) is known and β unknown. Estimators of β are obtained by solving (6) with A/π (l) replaced by an estimator of 1/ fA|L (A | L), μ replaced by m(A, Z; β) and with d(l) of the dimension of β. When Assumption 5 holds, using 1/ f̃A(A) where f̃A(A) = En{Ia(A)} and Ia(A) is the indicator that A = a yields consistent and asymptotically normal estimators of β that are generally more efficient than those obtained using 1/ f̂A|L (A | L) where f̂A|L (A | l) = En{Ia(A) |L = l}. Once again, this raises an apparent contradiction with inverse probability weighted estimation which can be explained as in § 3.
Acknowledgments
Andrea Rotnitzky was funded by a grant from the National Institutes of Health, U.S.A. The authors wish to thank two referees and the associate editors for helpful comments.
Appendix
For any given law f (l, a, y), define the new law f *(l, a, y) = f (l)I1(a) f (y | a, l). Then E{E(Y | A = 1, L)} = E*(Y) where E(·) and E*(·) denote expectations under f and f * respectively. But, f *(l, a, y)/ f (l, a, y) = I1(a)/ f (a | l), so E*(Y) = E{I1(a)Y / f (A | L)} thus proving that E{E(Y | A = 1, L)} = E{AY/ f (1 | L)} for any f and A binary. That μ̂ solving (3) also admits the representation (1) follows by applying this result when f is the empirical law.
Supplementary material
References
- Brookhart MA, Schneeweiss AL, Rothman KJ, Glynn RJ, Avorn J, Sturmer T. Variable selection for propensity score models. Am J Epidemiol. 2006;163:1149–56. doi: 10.1093/aje/kwj149. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Breslow N. Design and analysis of case control studies. Annual Rev. Public Health. 1982;3:29–54. doi: 10.1146/annurev.pu.03.050182.000333. [DOI] [PubMed] [Google Scholar]
- Cochran WC. The effectiveness of adjusting by subclassification in removing bias in observational studies. Biometrics. 1968;24:295–313. [PubMed] [Google Scholar]
- De Stavola HL, Cox DR. On the consequences of overstratification. Biometrika. 2008;95:992–6. [Google Scholar]
- Gail M. The effect of pooling across strata in perfectly balanced studies. Biometrics. 1988;44:151–62. [Google Scholar]
- Gill R, van der Laan M, Robins JM. Coarsening at random: characterizations, conjectures and counterexamples. In: Lin D, Fleming T, editors. Proc 1st Seattle Symp Biostatist. New York: Springer; 1997. pp. 255–94. [Google Scholar]
- Mantel N, Haenszel W. Statistical aspects of the analysis of data from retrospective studies of disease. J Nat Cancer Inst. 1959;22:719–48. [PubMed] [Google Scholar]
- Neuhausaer M, Becher H. Improved odds ratio estimation by post-hoc stratification of case-control data. Statist Med. 1997;16:993–1004. doi: 10.1002/(sici)1097-0258(19970515)16:9<993::aid-sim505>3.0.co;2-2. [DOI] [PubMed] [Google Scholar]
- Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J Am Statist Assoc. 1994;89:846–66. [Google Scholar]
- Robinson L, Jewell NP. Some surprising results about covariate adjustment in logistic regression models. Int Statist Rev. 1991;59:227–40. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.