Abstract
We describe an estimator of the parameter indexing a model for the conditional odds ratio between a binary exposure and a binary outcome given a high-dimensional vector of confounders, when the exposure and a subset of the confounders are missing, not necessarily simultaneously, in a subsample. We argue that a recently proposed estimator restricted to complete-cases confers more protection to model misspecification than existing ones in the sense that the set of data laws under which it is consistent strictly contains each set of data laws under which each of the previous estimators are consistent.
Keywords: Inverse probability weighted, Logistic regression, Missing at random, Model misspecification
1. Introduction
A common aim of epidemiologic observational studies is to evaluate the causal effect of a dichotomous point exposure A on the risk of a disease outcome encoded in a binary variable Y. In an effort to account for confounding bias in such nonexperimental studies, investigators routinely collect and adjust for in data analysis, a large number of confounding factors encoded in a vector L. A common target of inference is a finite-dimensional parameter ψ indexing a model OR(L; ψ) for the conditional log-odds ratio function:
| (1) |
where OR(· ; ·) is a known function and the true value of ψ is unknown. Under the assumption of no unmeasured confounders OR(L) quantifies, on the log-odds ratio scale, the causal effect of A on Y. The function OR(·) is attractive as it is invariant to alterations in the marginal distributions of (Y, L) or of (A, L). As a consequence, ψ can be consistently estimated from data (Yi, Ai, Li) (i = 1, . . . , n) collected under a variety of commonly employed epidemiological designs, in particular, simple random sample designs and case-control designs unmatched or matched on some components of L.
In this note, we consider the estimation of ψ when Y and a subset Lobs of the confounders L are always observed but A and the remaining components Lmis of L are not fully observed in all study participants, perhaps only A missing in some, only some random subset of Lmis missing in others and both missing in a third subsample. We examine inference under the assumptions that:
| (2) |
and pr(R = 1 | Lobs, Y) > 0, where R = 1 if A and Lmis are jointly observed, and R = 0 otherwise. When A and Lmis can only be either both observed or both missing simultaneously, (2) is a particular instance of the assumption that the data are missing at random (Little & Rubin, 2002) which is routinely made in the literature on missing covariate data (Robins et al., 1994; Zhao et al., 1996; Lipsitz et al., 1998; Little & Rubin, 2002; Parzen et al., 2002). For concreteness, we assume that (Ri, Yi, Ai, Li) (i = 1, . . . , n) are independent and identically distributed. Our discussion equally applies when the data are collected from a case-control design unmatched or matched on some of the confounders L. We argue that an estimator of ψ recently proposed by Tchetgen Tchetgen et al. (2010), computed just from the completecase units, i.e., those for which A and Lmis are observed, is more robust than currently available estimators in the sense that the set of data laws under which it is consistent strictly contains each set of data laws under which each of the previous estimators is consistent.
2. Dimension reducing estimation strategies
The right-hand side of (1) and the baseline log-odds function
| (3) |
determine the conditional probability function pr(Y = 1 | A, L). In the absence of missing data, to ensure maximal robustness against model misspecification, it is desirable to estimate ψ under a model that assumes just (1) rather than under a model that parameterizes the regression function pr(Y = 1 | A, L) and consequently that parameterizes the right-hand sides of both (1) and (3). Tchetgen Tchetgen et al. (2010) showed that due to the curse of dimensionality this is not feasible when, as is often the case in applications to render the assumption of no unmeasured confounders plausible, L is a vector with several components. The situation is aggravated when (A, Lmis) are missing by happenstance, as then the conditional missingness probability function pr(R = 1 | Lobs, Y) is also unknown. The following dimension reduction strategies appear to be available. One strategy is to assume (a) a parametric model oddsA(L; α) for
| (4) |
Model (1) coupled with model oddsA(L; α) for (4) renders the parametric retrospective logistic regression model for A on Y and L,
| (5) |
The problem then reduces to estimation of a parametric logistic regression function with incomplete observations of the outcomes A and/or a subset Lmis of the covariates, when the probability of observing both A and Lmis depends only on the observed model covariates Y and Lobs. Under such conditions, it is well known that the standard logistic regression estimator of ψ, say ψ̃retro, restricted to the complete-case units, i.e., the units with no missing data, is consistent and asymptotically normal. In fact, when Lobs = L so that Lmis = is empty, ψ̃retro is semiparametric efficient under the model that assumes (5) and condition (2) on the missing data mechanism, so under just these assumptions, no information is lost in large samples by discarding incomplete units (Robins et al., 1994).
An alternative strategy would be to assume (b) a parametric model oddsY (L; η) for oddsY (L), which, in conjunction with model (1), renders the parametric logistic regression model for Y on A and L,
| (6) |
The problem then reduces to estimation under a logistic regression model with a subset of the covariates incompletely observed and probability of complete observations that depends at most on the outcome and on the always observed covariates. However, when L is high-dimensional, this dimension-reducing strategy is insufficient to avoid the curse of dimensionality and further model restrictions are needed. One option is to postulate, in addition to (b), a parametric model f (A, Lmis | Lobs; υ) for f (A, Lmis | Lobs) and to conduct maximum likelihood estimation of the model parameters (ψ, η, υ). This strategy, however, leads to even more restrictions than the sole restriction (a) required by the retrospective logistic regression ψ̃retro because it leads not only to a parametric model for f (A | Y, L) but for the joint law f (Y, A, Lmis | Lobs).
Alternatively, one might consider assuming (b) and (c) a parametric model pr(R = 1 | Lobs, Y ; λ) for pr(R = 1 | Lobs, Y), thus arriving at the problem of estimating the logistic regression model (6) under a parametric model for the probability of complete observations. In such a setting, the weighted logistic regression estimator of ψ, say ψ̃ipw, in model (6) restricted to the complete-case units and weights equal to π̂−1 = pr(R = 1 | Lobs, Y; λ̂)−1, where λ̂ is the maximum likelihood estimator of λ, is consistent and asymptotically normal (Robins et al., 1994). This estimator, known as the inverse probability weighted estimator, was used, for example, in Moore et al. (2009), in a setting in which only A was missing.
It is possible to construct a so-called augmented inverse probability weighted estimator of ψ, say ψ̃aipw, which is more robust than ψ̃ipw. Like ψ̃ipw, the estimator, ψ̃aipw is consistent if (b) and (c) hold, but unlike ψ̃ipw, it is also consistent even if (c) fails provided (a), (b) and (d) a parametric model f (Lmis | Lobs, Y ; τ)for f (Lmis | Lobs, Y) hold. To compute ψ̃aipw one first computes, using only the complete-case units, the maximum likelihood estimators τ̂ and (α̂retro, ψ̃retro) of τ and (α, ψ) under models (d) and (6), respectively. The estimator ψ̃aipw of ψ is the subvector of (ψ̃T, η̃T)T solving
where S(ψ, η) = (1, A, L′){Y − pr(Y = 1 | A, L; ψ, η)}, Ê (· | Y, Lobs) is the expectation with respect to f̂(A, Lmis | Lobs, Y) = f (A | L, Y ; α̂retro, ψ̃retro) f (Lmis | Lobs, Y ; τ̂) and here and throughout En(·) denotes the empirical mean operator. Tchetgen Tchetgen (2009) developed a simple algorithm for implementing this estimator when L is fully observed.
In summary, of the dimension reducing procedures examined so far, the two which are consistent under the weakest modelling assumptions are:
the estimator ψ̃retro, which requires that model (a) be correct;
the estimator ψ̃aipw, which requires that either models (b) and (c) or models (a) and (b) and (d) be correct.
3. The proposed estimator
We now argue that an estimator of ψ recently derived in Tchetgen Tchetgen et al. (2010) computed just from the complete-case units confers more robustness to model misspecification than the estimators ψ̃retro and ψ̃aipw. To compute this estimator, one first computes the standard logistic regression estimators α̂ of α and η̂ of η in models (a) and (b) using only data from complete-case units. One then discards the separate estimators of ψ obtained from the fit of each model and instead, one computes the desired estimator, say ψ̂, by solving the estimating equations En{U(ψ; η̂, α̂)} = 0 where
with
Solving the estimating equation may appear computationally challenging, but this is not the case as Tchetgen Tchetgen & Rotnitzky (2011) provide an iterative algorithm that implements ψ̂ with standard logistic regression software. The algorithm is reproduced in the online Supplementary Material for completeness. These authors showed that in the absence of missing data, ψ̂ is consistent and asymptotically normal for ψ provided either oddsA(L; α) is a correct model for oddsA(L) or, oddsY (L; η) is a correct model for oddsY (L), but not necessarily both assertions hold. This result immediately implies that with missing data, ψ̂ is an estimator of the parameter ψ indexing a model OR(L; ψ) for the complete-case log-OR function:
which is consistent and asymptotically normal for ψ provided either oddsA(L; α) is a correct model for oddscc A(L) or, oddsY (L; η) is a correct model for oddsccY (L), but not necessarily both, where
are the complete-case baseline log-odds functions.
Under (2), ORcc(L) = OR(L) because the full-data law f (A, Y, L) differs from the complete-case data law f (A, Y, L | R = 1) only in that f (Y, Lobs) is distinct from f (Y, Lobs | R = 1) and, as noted earlier, OR(L) remains unchanged under departures from the marginal law of (Y, L). Thus, a model OR(L; ψ) for ORcc(L) is also a model for OR(L). Furthermore, because oddsAcc(L) depends solely on pr(A = 1 | Y = 0, L, R = 1) and this is equal to pr(A = 1 | Y = 0, L) under (2), then oddsAcc(L) = oddsA(L) and thus model oddsA(L; α) is also a model for oddsA(L).
We thus arrive at the conclusion that under (2), ψ̂ is consistent for ψ if one of Conditions 1 and 2 below holds, but not necessarily both.
Condition 1. Model (a) is correctly specified.
-
Condition 2. A parametric model for oddsYcc(L) is correctly specified.
The identity
implies that a parametric model for the full-data baseline log-odds oddsY (L) and another for pr(R = 1 | L, Y) = pr(R = 1 | Lobs, Y) determine a parametric model for oddsYcc(L). Consequently Condition 2 is met, in particular, if Condition 3 is met and a parametric model for oddsYcc(L) is constructed combining (b) and (c) via the identity (7). Condition 3. Models (b) and (c) are correctly specified.
In summary,
(iii) consistency of ψ̂ requires that either model (a) or models (b) and (c) be correct.
Contrasting (iii) with (i) and (ii), we observe that the conditions for consistency of ψ̂ are met if and only if either the conditions for consistency of ψ̃retro or those for consistency of ψ̃aipw are met, but not necessarily both. We thus conclude that ψ̂ confers more protection to model misspecification than each individual estimator ψ̃retro and ψ̃aipw.This in turn implies that ψ̂ confers more protection to model misspecification than the prospective parametric likelihood-based estimator and the estimator ψ̃ipw discussed in § 2, as consistency of the former requires even more stringent conditions than those for consistency of ψ̃retro and consistency of the latter requires even more stringent conditions than those for consistency of ψ̃aipw.
4. Simulation study
We compared ψ̂ with ψ̃retro and ψ̃aipw in a simulation study in which A is sometimes missing but Lobs = L is always observed. We simulated 500 random samples of (Y, R A, R, L) of size 2000 according to L1 ∼ N(0, 0.92), , f (Y, A | L) ∝ exp(−0.5Y + 0.5YL1 + 0.4YL2 + 0.7YL1 L2 + AY + 0.8A − 0.4AL1 + 0.5AL2 + 0.6AL1 L2) and R | (Y, A, L) ∼ Ber[{1 + exp(0.3L1 + 0.3 L2Y + 0.4 Y L1 L2)}−1]. Table 1 reports the results for the estimators ψ̂, ψ̃retro and ψ̃aipw in model OR(L; ψ) = ψ computed under all eight combinations of correct or incorrect models for oddsA(L),oddsY (L) and pr(R = 1 | L, Y), labelled (a), (b) and (c), respectively. Correct models used oddsA(L; α) = α0 + α1 L1 + α2 L2 + α3 L1 L2, oddsY (L; η) = η0 + η1 L1 + η2 L2 + η3 L1 L2 and pr(R = 1 | L, Y ; λ) = {1 + exp(−λ0 − λ1 L1 − λ2 L2 − λ3Y L2 − λ4Y L1 L2)}−1. Incorrect models ignored the terms in L1 L2. According to theory, ψ̂ is consistent in lines 1–5, ψ̃retro is consistent in lines 1, 2, 3 and 5 and ψ̃aipw is consistent in lines 1, 2 and 4. The Monte Carlo bias of each estimator reported in the corresponding lines is small, but the biases of ψ̃retro in line 4 and that of ψ̃aipw in lines 3 and 5 are substantial, thus illustrating the robustness advantage of ψ̂. According to theory, none of the three estimators is consistent in lines 6–8, a fact reflected in the substantial Monte Carlo bias observed for all three estimators in these lines, with the exception of the bias of estimators ψ̃aipw and ψ̂ in line 6. This exception is not supported by theory and is thus likely particular to our data generating process.
Table 1.
Simulation results
| Line | Model Correct | Estimator | |||
|---|---|---|---|---|---|
| ψ̃retro | ψ̃aipw | ψ̃ | |||
| 1 | (a), (b), (c) | Bias | −0.001 | 0.0003 | 0.0007 |
| Variance | 0.0285 | 0.029 | 0.029 | ||
| 2 | (a), (b) | Bias | −0.001 | 0.00005 | 0.003 |
| Variance | 0.028 | 0.031 | 0.029 | ||
| 3 | (a), (c) | Bias | −0.001 | 0.242 | 0.00041 |
| Variance | 0.028 | 0.029 | 0.029 | ||
| 4 | (b), (c) | Bias | 0.184 | −0.043 | −0.010 |
| Variance | 0.024 | 0.028 | 0.025 | ||
| 5 | (a) | Bias | −0.001 | −0.240 | −0.005 |
| Variance | 0.028 | 0.028 | 0.027 | ||
| 6 | (b) | Bias | 0.184 | −0.028 | 0.006 |
| Variance | 0.024 | 0.033 | 0.029 | ||
| 7 | (c) | Bias | 0.184 | 0.210 | 0.190 |
| Variance | 0.024 | 0.031 | 0.028 | ||
| 8 | – | Bias | 0.184 | 0.215 | 0.193 |
| Variance | 0.024 | 0.026 | 0.024 | ||
model for oddsA(L);
model for oddsY (L);
model for pr(R = 1 | Y, Lobts).
As indicated in § 2, according to theory, when oddsA(L) is correctly modelled, as in lines 1, 2, 3 and 5, ψ̃retro is indeed asymptotically efficient, in spite of being based only on complete-case units. A comparison of the Monte Carlo standard errors of ψ̃retro and ψ̃aipw in lines 1 and 2, corresponding to cases in which both are consistent, confirms this result. A comparison of the Monte Carlo standard errors of ψ̃retro and ψ̂ in lines 1, 2, 3 and 5, the cases in which both are consistent, further indicates that the new complete-case estimator ψ̂ does not incur in any sizeable efficiency loss. Furthermore, a comparison of the standard errors of ψ̂ and ψ̃aipw in line 4, a situation in which both are consistent, additionally illustrates the fact that even though ψ̃aipw uses data from incomplete-case units it can be less efficient than ψ̂ when the model for oddsA(L) is incorrectly specified.
5. Extensions to nonbinary outcomes and/or treatments
For the situations in which A and/or Y are not binary, a number of recent articles have focused attention on the estimation of the parameter ψ indexing an odds ratio model OR(Y, A, L; ψ) for OR(Y, A, L) = log[{ f (Y | A, L) f (y0 | a0, L)}/{ f (y0 | A, L) f (Y | a0, L)}] where (y0, a0) is a user-specified point in the sample space (Chen, 2007; Osius, 2009; Tchetgen Tchetgen et al., 2010; Tchetgen Tchetgen, 2010). As in the binary case, OR(Y, A, L) is an attractive target of inference as it is invariant to alterations in the marginal laws of (Y, L) or of (A, L). Tchetgen Tchetgen et al. (2010) describe estimators of ψ that are consistent and asymptotically normal provided one of the following two models is correctly specified, but not necessarily both.
Model 1. A model oddsA(A, L; α) for oddsA(A, L) = log{ f (A|y0, L)/ f (a0| y0, L)}.
Model 2. A model oddsY (Y, L; η) for oddsY (Y, L) = log{ f (Y |a0, L)/ f (y0|a0, L)}.
Just as in the binary case, under the missing data conditions of this note, the estimators of Tchetgen Tchetgen et al. (2010) restricted to the complete-case units are consistent and asymptotically normal for ψ so long as oddsA(A, L; α) is a correctly specified model for oddsA(A, L) or oddsY (A, L; η) is now a correctly specified model for oddsYcc(Y, L) = log{ f (Y | a0, L, R = 1)/ f (y0 | a0, L, R = 1)}, but not necessarily both.
Supplementary material
Supplementary material available at Biometrika online includes an iterative algorithm.
Acknowledgments
The authors were funded by grants from the U.S. National Institutes of Health. The authors wish to thank the editor and the reviewers for helpful comments. Andrea Rotnitzky is also affiliated with the Harvard School of Public Health.
References
- Chen YH. A semi-parametric odds ratio model for measuring association. Biometrics. 2007;63:413–21. doi: 10.1111/j.1541-0420.2006.00701.x. [DOI] [PubMed] [Google Scholar]
- Lipsitz S, Parzen M, Ewell M. Inference using conditional logistic regression with missing covariates. Biometrics. 1998;54:295–303. [PubMed] [Google Scholar]
- Little R, Rubin D. Statistical Analysis with Missing Data. 2nd ed. New York: John Wiley; 2002. [Google Scholar]
- Moore CG, Lipsitz SR, Addy CL, Hussey JR, Fitzmaurice G, Natarajan S. Logistic Regression with incomplete covariate data in complex survey sampling:application of re-weighted estimating equations. Epidemiology. 2009;20:382–90. doi: 10.1097/EDE.0b013e318196cd65. [DOI] [PubMed] [Google Scholar]
- Osius G. Asymptotic inference for semiparametric association models. Ann Statist. 2009;37:459–89. [Google Scholar]
- Parzen M, Lipsitz S, Ibrahim J, Lipshultz S. A weighted estimating equation for linear regression with missing covariate data. Statist Med. 2002;21:2421–36. doi: 10.1002/sim.1195. [DOI] [PubMed] [Google Scholar]
- Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J Am Statist Assoc. 1994;89:846–66. [Google Scholar]
- Tchetgen Tchetgen E. A simple implementation of doubly robust estimation in logistic regression with covariates missing at random. Epidemiology. 2009;20:391–4. doi: 10.1097/EDE.0b013e3181a0acc7. [DOI] [PubMed] [Google Scholar]
- Tchetgen Tchetgen E, Robins J, Rotnitzy A. On doubly robust estimation in a semi-parametric odds ratio model. Biometrika. 2010;97:171–80. doi: 10.1093/biomet/asp062. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tchetgen Tchetgen E, Rotnitzky A. Double-robust adjustment for confounding in cohort and case-control studies. Statist Med. 2011;30:335–47. doi: 10.1002/sim.4103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tchetgen Tchetgen E. On the interpretation, robustness, and power of varieties of case-only tests of gene-environment interaction. Am J Epidemiol. 2010;172:1335–8. doi: 10.1093/aje/kwq359. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao LP, Lipsitz S, Lew D. Regression analysis with missing covariate data using estimating equations. Biometrics. 1996;52:1165–82. [PubMed] [Google Scholar]
