Abstract
We study identification of parametric and semiparametric models with missing covariate data. When covariate data are missing not at random, identification is not guaranteed even under fairly restrictive parametric assumptions, a fact that is illustrated with several examples. We propose a general approach to establish identification of parametric and semiparametric models when a covariate is missing not at random. Without auxiliary information about the missingness process, identification of parametric models is strongly dependent on model specification. However, in the presence of a fully observed shadow variable, which is correlated with the missing covariate but otherwise independent of its missingness, identification is more broadly achievable, including in fairly large semiparametric models. With a shadow variable, special consideration is given to the generalized linear models with the missingness process unrestricted. Under such a setting, the outcome model is identified for familiar generalized linear models, and we provide counterexamples when identification fails. For estimation, we describe an inverse probability weighted estimator that incorporates the shadow variable to estimate the missingness process, and we evaluate its performance via simulations.
Keywords: Identification, Missing covariate data, Missing not at random, Shadow variable
1. INTRODUCTION
Missing data are commonly encountered in many socioeconomic and biomedical studies. Methods to account for missing outcome data in regression analysis figures prominently in the literature. However, missing covariate data is also a long-standing problem in applied research. In the early history of missing data analysis, Glasser (1964); Afifi & Elashoff (1966) and Haitovsky (1968) studied the missing covariate problem in regression analysis; Edgett (1956); Anderson (1957) and Buck (1960) studied the problem in the context of multivariate analysis. The work of Rubin (1976) formalized the concept of missing data mechanism as a separate process from the full data law of primary scientific interest. The missing data mechanism is said missing at random, if it is independent of missing values after conditioning on the observed data, and it is said missing not at random otherwise. For analysis of data missing at random, there currently exist a variety of methods such as likelihood-based approaches (Dempster et al., 1977; Horton & Laird, 2001; Ibrahim, 1990), imputation and multiple imputation (Rubin & Schenker, 1986; Vach & Schumacher, 1993; Rubin, 2004), and semiparametric methods (Zhao et al., 1996; Robins et al., 1994; Zhao et al., 1996).
In many empirical studies, however, covariate data will often be missing not at random, i.e., the missingness is related to missing covariate values even after conditioning on the observed data. Most of the aforementioned methods have previously been adapted to deal with covariate data missing not at random. Reviews of research on this topic include Ibrahim et al. (1999); Little & Zhang (2011) and Ibrahim et al. (2005). Validity of existing estimation methods relies on first establishing identification. Identification means that the parameter of interest, for example, the regression coefficient of the outcome on the missing covariate, is uniquely determined by the observed data. Without identification, statistical inference is generally of limited interest and may often be misleading. Under missingness at random, the joint distribution of all variables of interest is identified without parametric assumptions (Little & Rubin, 2002); however, as pointed out by Baker & Laird (1988), under missingness not at random, identification is not always guaranteed. Fay (1986) and Ma et al. (2003) used graphic models to represent missingness mechanisms, and they studied identification for longitudinal categorical variables that are missing not at random. In the context of missing outcome data, Tang et al. (2003); Wang et al. (2014); Miao et al. (2016) and Zhao & Shao (2015) studied identification of several parametric and semiparametric models, and they presented counterexamples when identification fails. Kott (2014); Zhao & Shao (2015) and D’Haultfoeuille (2010) noted that a fully observed shadow variable can sometimes be used to improve identification under missingness not at random. Such a variable is associated with the potentially unobserved variable conditional on the observed data, but independent of the missingness process conditional both on the observed data and missing values (Kott, 2014). For missing outcome data, we have previously demonstrated identification of a class of location-scale models with a shadow variable (Miao et al., 2015). Identification is also crucial and challenging for covariate data missing not at random; yet the literature on this topic is somewhat sparse. In this paper, we illustrate difficulty for identification of nonginorable missing covariate data in section 2. We establish a general framework for studying identification with missing covariate data in section 3, and we illustrate its use with several parametric models. In section 4, we use a shadow variable for the missing covariate to improve identification in semiparametric models, and we establish identification of a large class of generalized linear models. In section 5, we describe inverse probability weighted estimation, which incorporates the shadow variable to estimate the nonignorable missingness process, and we evaluate its performance via simulations in section 6.
2. POTENTIAL DIFFICULTY FOR IDENTIFICATION
Throughout, we let Y denote the fully observed outcome variable, X a covariate with missing values, and R the missing indicator of X with R = 1 if X is observed and R = 0 otherwise. The observed data include (Y,R) for all samples, and X only for those with R = 1. The goal of missing data analysis is to make inference about the full data law pr(y, x) and the missingness process pr(r = 1 | y, x), based on the observed data distribution, which is captured by pr(y) and pr(x, r = 1 | y). Recovery of the full data law and the missingness process from the observed data distribution is the fundamental identification challenge in missing data problems. It can be formally expressed as below.
Definition 1. For a model pr(y, x, r; θ) indexed by θ, which may have a finite dimensional component, as well as nonparametric components, the parameter θ is said to be identified from the observed data, if there exists a one-to-one mapping between θ and {pr(y; θ), pr(x, r = 1 | y; θ)}.
When data are missing at random, i.e., R⫫X | Y, it is well known that the joint distribution pr(y, x, r) is nonparametrically identified, because pr(x, r = 1 | y) and pr(y) can be identified from the observed data without imposing parametric assumptions, and pr(x, r = 0 | y) = pr(x | y, r = 1)pr(r = 0 | y) is uniquely determined by pr(x, r = 1 | y). When data are missing not at random, however, pr(x, r = 0 | y) ≠ pr(x | y, r = 1)pr(r = 0 | y). In fact, in contrast with data missing at random, one cannot ignore the missing data mechanism to make inference (Little & Rubin, 2002; Ibrahim et al., 1999). As shown in the following example, even when fairly restrictive parametric models are correctly specified both for pr(y, x) and pr(r = 1 | y, x), identification is not guaranteed, and selection bias of the regression model due to missing data cannot necessarily be eliminated.
Example 1. We consider a normal model encoded in pr(x) ~ N(γ, λ) and pr(y|x) ~ N(β0 + β1x, ϕ), and a logistic model for the missingness process: logit pr(r = 1 | x, y) = α0 + α1x + α2y. Letting λ = 1.25, ϕ = 0.8, and β1 = 0.4, the two sets of parameters (μ, β0, α0, α1, α2) = (0, 0, 2, −2, 1) and (2, −0.8, −2, 2, −1), result in the same observed data distribution: pr(y) ~ N(0, 1), and
Therefore, (γ, β0, α0, α1, α2) are not identified from the observed data.
The normal model and the logistic model are commonly used in missing data analysis. However, example 1 shows potential non-identification of such models, when the missingness process depends on the potentially unobserved covariate. Without identification, there may exist different laws of the full data that have the same observed data distribution, thus, even at asymptopia one cannot determine which is the truth based on the observed data. In this case, estimation for the parameters is of limited interest in practice. Despite its importance, identification for missing covariate data is not extensively studied in the literature. In subsequent sections, we propose a general framework to establish identification of missing covariate data.
3. A GENERAL FRAMEWORK FOR IDENTIFICATION
We consider a model pr(y, x, r; θ) indexed by θ. In order to identify θ from the observed data distribution that is captured by pr(y; θ) and pr(x, r = 1 | y; θ), we must rule out values of θ that result in the same observed data distribution. Suppose θ1 and θ2 are two candidate values of θ that have the same pr(y) and pr(x, r = 1 | y), we have
which characterize all values of θ to be ruled out for identification. We clearly have the following working definition of identification.
Definition 2. The parameter θ is identified if and only if for any two values θ1 and θ2 of θ,
pr(y; θ1) ≠ pr(y; θ2); or
- the following ratios are not equal:
Definition 2 establishes that different values of θ must have distinct distributions of the observed data. Because pr(y) can be uniquely determined from the observed data, (a) of definition 2 can be checked based on observed data; but (b) of definition 2 involves the missingness process, which is the key rule for identification. We provide several examples to illustrate the use of (b).
Example 2. We verify identification of the missing at random mechanism: R⫫X | Y by checking (b) of definition 2. Following the approach of Fay (1986), such a missingness mechanism can also be encoded in the directed acyclic graph model of figure 1 (i), where the arrow between X and R is not present. It is plausible in a retrospective study such as a case control study in which X is ascertained only after Y is determined, so that Y may in fact directly influence whether or not X is missing. For any two candidate models pr(y, x, r; θ1) and pr(y, x, r; θ2) such that pr(y; θ1) = pr(y; θ2), the ratio of the missingness models pr(r = 1 | y; θ1)/pr(r = 1 | y; θ2) is only a function of y. However,
must vary with x, which cannot equal the ratio of the missingness models. Therefore, from (b) of definition 2, θ is identified.
Figure 1:

Directed acyclic graph models for different missingness mechanisms.
Example 3. Bartlett et al. (2014) considered estimation under the missingness mechanism encoded in the graph model of figure 1 (ii), which is missing not at random. The graph depicts a prospective study in which Y is ascertained only after X is observed, and therefore, it is reasonable to assume that Y cannot determine whether X is missing, provided that participants are not able to anticipate her outcome at baseline. Under such a missingness mechanism, for pr(y | x; θ1) ≠ pr(y | x; θ2), the ratio of any two candidate models for pr(y, x), {pr(y | x; θ1)pr1(x)}/{pr(y | x; θ2)pr2(x)} must vary with y, which cannot equal the ratio of any two different missingness models, a function only of x. Therefore, θ indexing the outcome model is identified, although, the covariate distribution pr(x) may not be.
When the missingness process depends on either the missing covariate X (example 3) or the fully observed outcome Y (example 2), identification is well established (Little & Rubin, 2002), thus, we have simply confirmed identification by verifying (b) of definition 2. We further provide several examples to illustrate the case where missingness depends both on X and Y. In empirical studies, the covariate and outcome of interest are often binary. We note that identification is not guaranteed for binary variables, as illustrated in the following example.
Example 4. Consider the logistic models for binary X and Y:
encoded in figure 1 (iii). The parameters are not identified. One can verify that pr(y = 1) and pr(r = 1,x | y) are identical under the following two settings: α = (−0.4, −0.4, 0.2) β = (−0.359, 0.6), pr(x = 1) = 0.597, and α′ = (0.468, −1.64, 0.338), β′ = (−0.361, 0.488), pr′(x = 1) = 0.737.
In the binary example, one can also follow the “parameter counting” approach to check identification (Baker & Laird, 1988). In example 4, the model contains six unknown parameters: (α, β) and pr(x = 1), but the observed data distribution only has five degrees of freedom: pr(y, x, r = 1) for x, y = 0, 1 and pr(y = 1, r = 0), which provides five estimating equations of the unknown parameters. Thus, the solution for {α, β, pr(x = 1)} is not unique with more parameters than estimating equations, i.e., the parameters are not identified. For a continuous covariate or a semiparametric model, however, the number of unknown parameters and degrees of freedom of the observed data are difficult to characterize. In this case, the “parameter counting” approach often does not apply. But, it is quite convenient to use (b) of definition 2 for specific parametric models.
Example 5. Continuation of example 1. The missingness mechanism can be encoded in the graph of figure 1 (iii). The model for the joint distribution is indexed by θ = (γ, λ, α0, α1, α2, β1, β2, ϕ). Considering the respective models indexed by θ and θ′, we have
| (1) |
which is a linear combination of y2, y, x2, xy and x; and we have
| (2) |
For , (2) is a linear combination of x and y. Thus, (1) and (2) can be equal for certain values of the parameters such as those given in example 1, and we can further verify that pr(y) is the same for those two settings. Therefore, θ cannot be identified. In particular, (γ, α0, α1, α2, β0) cannot be identified, but (λ, ϕ, β1) can be identified by noting that the coefficients of y2, xy and x2 must be zero when (1) equals (2).
Examples 1 and 5 show potential lack of identification for the normal model when the covariate is missing not at random. In this case, the slope of the outcome model, i.e., β1 is identified but the intercept β0 is not. In contrast to the normal model, the following example establishes identification of a certain exponential regression model.
Example 6. Consider a normal model for the covariate: , an exponential regression model for the outcome variable: Y ~ η(x) exp{−yη(x)} with η(x) = exp(β0 + β1x) and β1 ≠ 0, and a logistic model for the missingness of X: logit pr(r = 1 | y, x) = α0 + α1x + α2y, then all parameters are identified.
In a breast cancer study, Lipsitz et al. (1999) applied the Weibull regression to model the time to treatment failure, without formally establishing identification of the model. The Weibull regression model is more general than the exponential regression model. However, we show in the Appendix that identification does hold as well.
4. IDENTIFICATION WITH A SHADOW VARIABLE
In examples 1–6, identification or lack thereof is completely determined by the specific parametric model being considered, and therefore, it is unclear whether a general identification framework is available without all of the restrictions on the models. However, when a shadow variable for the missing covariate is fully observed, identification is often possible even in fairly large semiparametric models. A shadow variable is associated with the potentially missing variable conditional on the observed data, but independent of the missingness process conditional both on the fully observe data and the potentially missing variable (Kott, 2014). The following assumption formalizes these conditions for a shadow variable.
Assumption 1. A fully observed variable Z is a shadow variable for X if (i) Z
X | Y; and (ii) Z ⫫ R | (Y, X).
Assumption 1 formalizes the idea that the shadow variable only affects the missingness through its association with the missing covariate and the fully observed outcome. Figure 2 is a directed acyclic graph encoding assumption 1.
Figure 2:

A directed acyclic graph model for the shadow variable.
The shadow variable for a missing covariate may be available in many empirical studies, where a fully observed proxy or a mismeasured version of the missing covariate is available. For example, in a study of mental health of children in Connecticut (Zahner et al., 1992; Horton & Laird, 2001), researchers were interested in the correlation between children’s mental health status and utilization of mental health service. The measure of psychopathology used in the study was based on the teacher’s assessment that had 43% missing values, however, a separate parental report was complete. The parental report is a proxy for the teacher’s assessment, but it is unlikely to be related to the teacher’s response rate conditional on other covariates and her assessment of the student; in this case the parental assessment constitutes a valid shadow variable. Such a variable introduces additional restrictions on the missingness process, and thus provides better opportunity for identification under missingness not at random. For example, non-identification of the binary case (example 4) is completely resolved with a binary shadow variable.
Example 7. Suppose X and the shadow variable Z are binary. Because Z⫫R | (X, Y), we have pr(z | x, y) = pr(z | x, y, r = 1) for all (y, x, z). For all y, one can solve the linear equation pr(z | y) = Σx pr(z | x, y, r = 1)pr(x | y) for pr(x | y). Suppose Z
X | Y = y for all y, then the solution is unique, i.e., pr(x | y) is identified. One can further solve pr(r = 1, x | y) = pr(r = 1 | x, y)pr(x | y) to identify the missingness process pr(r = 1 | x, y). Thus, one can identify the joint distribution pr(y, z, x, r) = pr(y)pr(x | y)pr(z | x, y)pr(r = 1 | x, y). See the Appendix for additional details.
It has been previously noted by Ma et al. (2003) that for the binary case of example 7, the joint distribution pr(y, z, x, r) can be identified explicitly as a function of the observed data distribution when a binary shadow variable is available. For more complicated models such as a semiparametric model with a continuous covariate, identification is not as straightforward as that of example 7. For such cases, the following proposition is convenient to check identification of the outcome model, even the missingness process is unrestricted.
Proposition 1. Considering the models pr(y | x, z; θ) and pr(x|z; ξ), if for θ1 ≠ θ2, the ratio pr(y, x|z; θ1, ξ1)/pr(y, x|z; θ2, ξ2) varies with z for all ξ1, ξ2, then the parameter θ indexing the outcome model is identified.
The proposition follows from the fact that under the shadow variable assumption, the ratio of any two different missingness models is not a function of z, and from definition 2 (b), θ must be identified if the ratio pr(y, x|z; θ1, ξ1)/pr(y, x|z; θ2, ξ2) varies with z for distinct values θ1 and θ2. We further consider identification of the generalized linear models, which are commonly used in practice. We suppose X and Z are continuous variables, and we assume the following models:
| (3) |
| (4) |
with dispersion parameters ϕ, λ > 0, and known functions A1, A2, B1, B2, η1(z; γ) = η1(γ0 + γ1z) and η2(z, x; β) = η2(β0 + β1z + β2x). We assume that the functions are infinitely often differentiable, and that for all γ, η1(z; γ) contains an open set, i.e., the exponential family pr(x | z, r) is full rank (Shao, 2003, page 96). Note that we assume Z⫫R | (Y, X), and the missingness process is otherwise unspecified. We have the following identification results for such models.
Theorem 1. Assuming the generalized linear models (3)–(4), and the shadow variable assumption 1, we have
if η2 is a linear function, then β1/ϕ is identified;
if η2 a linear function, and , the second-order derivative of B2 is not a linear function, then (β1, β2, ϕ) are identified;
if η2 is a nonlinear function, then (β1, β2) are identified.
Theorem 1 is proved by verifying the condition of proposition 1, and we relegate the details to the Appendix. It establishes identification of the coefficients of Z and X in the outcome model pr(y | z, x) except when η2 is a linear function and B2 is a cubic or quadratic function. From the theorem, (β1, β2) of the logistic model
must be identified. However, when η2 is a linear function and B2 is a cubic or quadratic function, even Z is indeed correlated with X, we observe that in fact Z may be independent of X after conditioning on Y, i.e., the shadow variable assumption may not hold. We have the following counterexample when η2 is a linear function and B2 is a quadratic function, both of which hold for a normal outcome model.
Example 8. Consider the normal model pr(y|z, x) = N(β1z + β2x, ϕ) and pr(x|z) = N(γ1z, λ) indexed by θ = (β1, β2, ϕ, γ1, λ). For the two sets of values θ1 = (1, 1, 1, 1, 1) and θ2 = (1.5, 0.5, 1.5, 1, 2), one can verify
which does not vary with z. Consider the following two models for the missingness process:
then one can verify that the two data generating mechanisms, encoded in pr(y, x | z, θi) and pri(r = 1 | y, x) for i = 1, 2, have the same observed data distribution. Thus, θ is not identified from the observed data. But β1/ϕ = 1 is identified, a fact that is consistent with theorem 1 (a).
The example shows potential lack of identification of normal models. But it should be noted that non-identification only happens for a certain few values of the parameters.
Theorem 2. For the normal model pr(y|z, x) = N(β0 + β1z + β2x, ϕ) and pr(x|z) = N(γ0 + γ1z, λ), all parameters are identified if β1β2/ϕ − γ1/λ ≠ 0.
The condition β1β2/ϕ−γ1/λ ≠ 0 ensures that X is not independent of Z conditional on Y. From theorem 2, normal models are generally identifiable except for a specific subset of the parameter space. In practice, one can test whether the quantity β1β2/ϕ − γ1/λ is null to check if the model is identified.
The following submodels have better identification results as they have fewer parameters than models (3) and (4).
| (5) |
| (6) |
Theorem 3. For model (5), (β0, β2, ϕ) are identified; and for model (6), (β0, β1, ϕ) are identified.
5. ESTIMATION
Inverse probability weighted estimation (Horvitz & Thompson, 1952; Robins et al., 1994; Scharfstein et al., 1999) is one of the most influential missing data methods. The approach employs a propensity score model π(x, y; α) = pr(r = 1 | x, y; α). For example, one may specify a logistic model logit π(x, y; α) = α0 + α1x + α2y, with unknown parameters (α0, α1, α2). As α1 ≠ 0, the model accommodates a nonignorable missingness process. With fully observed data, α can be consistently estimated by standard maximum likelihood. Alternatively, one may solve estimating functions of the following form: , with G(x, y) a user-specified vector function of dimension equal to that of α, and E[∂{r/π(x, y; α)}/∂α × G(x, y)] nonsingular for all α. For the logistic propensity score model, one may naturally choose G(x, y) = (1, x, y) for estimation. But, when X or Y has missing values, neither approach is feasible. However, when a shadow variable Z is fully observed, one can solve the following modified estimating equation by replacing G(x, y) with G(z, y), without compromising consistency of the estimators, thus, one can solve
| (7) |
for . The inverse probability estimator of (β, ϕ) solves
| (8) |
with the score function S(x, y; β, ϕ) = ∂ log{pr(y | x, z; β; ϕ)}/∂(β, ϕ), and obtained form (7). We have the following theorem about the consistency of the inverse probability weighted estimators.
Theorem 4. If the propensity score model π(x, y; α) is correctly specified, then as the sample size increases, obtained from (7) converges to the true value of α; if further the outcome model pr(y | z, x; β, ϕ) is correctly specified, then as the sample size increases, () obtained from (8) converge to the true value of (β, ϕ) in probability.
6. SIMULATION
We study the finite sample performance of the proposed inverse probability weighted estimator via simulations. We generate the shadow variable Z from N(0, 1), X ~ N(0.5 + 0.5z, 1), and Y ~ N(β0 + β1z + β2x, 1) with β0 = 0.5, β1 = 1.5 and β2 = −0.5. We generate R from logit pr(r = 1 | x, y) = α0 + α1y + α2x with α0 = 0.5, α1 = −1 and α2 = 1, and delete the values of X with R = 0. Under such a setting, the missing data proportion is about 39%. We simulate 1000 independent data sets of sample sizes 500 and 1500. We apply both inverse probability weighting and linear regression based on complete cases to analyze the data sets. Results are summarized in the boxplots of figure 3. With the shadow variable incorporated, the estimator for the propensity score model has small bias. The inverse probability weighted estimator for the outcome model has small bias under moderate sample sizes, and the bias decreases as sample size increases; but the estimator obtained from linear regression based on complete cases has large bias, and it cannot be eliminated as sample size increases. Thus, the proposed inverse probability weighted estimator appears to perform reasonably well at least in simulated settings.
Figure 3:

Boxplots for the estimators.
Note: data are analyzed with inverse probability weighting (IPW) and linear regression based on complete cases (CC). In each boxplot, white boxes are for sample size 500, and gray ones for 1500. The horizontal line marks the true value of the parameter.
APPENDIX A. ADDITIONAL DETAILS FOR EXAMPLES 2, 3, 6, 7
Details for example 2
Suppose pr(x | y; θ1)/pr(x | y; θ2) = h(y) for some function h(y), then for all y we have
which is a contradiction with pr(x | y; θ1) ≠ pr(x | y; θ2). Therefore, pr(x | y; θ1)/pr(x | y; θ2) must vary with x.
Details for example 3
We only need to prove that pr(y | x; θ1)/pr(y | x; θ2) varies with y; otherwise, suppose pr(y | x; θ1)/pr(y | x; θ2) = h(x) for some function h(x), then for all x we have
which is a contradiction with pr(y | x; θ1) ≠ pr(y | x; θ2). Therefore, pr(y | x; θ1)/pr(y | x; θ2) and thus {pr(y | x; θ1)pr1(x)}/{pr(y | x; θ2)pr2(x)} must vary with y.
Details for example 6
We use a proof by contradiction to show identification of the parameters. Suppose that there were two sets of parameters resulting in the same distribution pr(y, x, r = 1):
| (A.1) |
with Φ the probability density function of a standard normal variable. Taking logarithm on both sides and rearranging the terms, we have
| (A.2) |
with . For arbitrary y, The left hand side of (A.2) is a linear combination of x and x2. But for or , note that β, , the right hand side of (A.2) must include an exponential term of x; and it cannot equal the left hand side of (A.2), which contradicts (A.2). Thus, we must have and , and (A.1) reduces to
By the same argument of Miao et al. (2016) for identification of normal densities, the identity holds only for μ = μ′, and . Therefore, all parameters are identified.
The Weibull regression is a generalization of the exponential regression model. We first prove identification of σ2, and then identification of other parameters follows from identification of the exponential regression model. For the Weibull regression, we follow the proof for the exponential regression and then obtain a parallel version of (A.2):
| (A.3) |
For arbitrary x, the left hand side of (A.3) is a linear combination of y and log(y). But for , the right hand side of (A.2) must include a power term of y, and it is not equal to the left hand side of (A.2). Thus, we must have . Letting , then , which is an exponential regression model. Applying the identification result of the exponential regression model, we obtain identification of the remaining parameters.
Details for example 7
When X and Z are binary, for arbitrary y, we solve the equation pr(z = 1 | y) = Σx=0,1 pr(z = 1 | x, y, r = 1)pr(x | y) for pr(x = 1 | y). Note that pr(x = 1 | y) + pr(x = 0 | y) = 1, we have
Under the assumption Z
X | Y = y for any y, pr(z = 1 | x = 1, y) ≠ pr(z = 1 | x = 0, y), thus, pr(z = 1 | x = 1, y, r = 1) ≠ pr(z = 1 | x = 0, y, r = 1) by the shadow variable assumption Z⫫R | (X, Y). Therefore, the solution for pr(x = 1 | y) is unique.
APPENDIX B. LEMMAS AND PROOFS
Lemma 1. Suppose pr(x|z) and pr′(x|z) are two probability density functions that follow model (3), then the ratio pr′(x|z)/pr(x|z) must vary with z.
Proof. The proof proceeds by contradiction. Suppose the ratio pr′(x|z)/pr(x|z) does not vary with z, and
for some h(x) ≠ 1, then we have
for all z, and thus ∫x pr(x|z){h(x) − 1}dx = 0 for all z, i.e.,
| (A.4) |
for all z. For the exponential family, X is complete for pr(x | z) under the full rank condition (Shao, 2003, Proposition 2.1, page 110), i.e., E{f(X) | z} = 0 for all z implies f(X) = 0. Thus, from (A.4), we must have h(x) = 1, which contradicts pr(x | z) ≠ pr′(x | z). As a result, pr′(x|z)/pr(x|z) must vary with z. □
Lemma 2. For a non-zero and non-periodic function g, if βg(α + βz) = β′g(α′ + β′z) for all z, then we must have
β = β′; or
β = −β′ ≠ 0, and g(α + βz) = −g(α′ − βz) for all z.
Proof. For β = 0, β′g(α′ + β′z) = βg(α + βz) = 0 for all z. Because g is a nonzero function, we must have β′ = 0;
For β ≠ 0, we must have β′ ≠ 0. For |β′/β| < 1. Because βg(α + βz) = β′g(α′ + β′z) for all z, letting t = βz, we have g(α + t) = β′/β · g(α′ + β′/β · t). By iteration, we have g(α + t) = 0 for all t, which is impossible for a nonzero function g. So we have |β′β| ≥ 1, and similarly, |β′/β| ≤ 1. As a result, we have |β| = |β′| > 0.
If β = β′, we must have g(α + βz) = g(α′ + βz). If β = −β′, we have g(α + βz) = −g(α′ − βz). □
Lemma 3. For a non-zero function g, if β2g(α+βz) = β′2g(α′ +β′z) for all z, then we must have
β = β′ = 0;
β = −β′ ≠ 0, and g(α + βz) = g(α′ − βz) for all z.
Proof. If β = 0, β′2g(α′ + β′z) = β2g(α + βz) = 0 for all z. Because g is a nonzero function, we must have β′ = 0;
For β ≠ 0, we must have β′ ≠ 0. For |β′/β| < 1, letting t = βz, because β2g(α + βz) = β′2g(α′ +β′z) for any z, we have g(α + t) = (β′/β)2 ·g(α′ +β′/β · t). By iteration, we have g(α + t) = 0 for all t, which is impossible for a nonzero function g. So we have |β′/β| ≥ 1, and similarly, |β′/β| ≤ 1. As a result, we have |β| = |β′| > 0.
If β = β′ ≠ 0, we have g(α + βz) = g(α′ + βz). If β = −β′ ≠ 0, we have g(α + βz) = g(α′ − βz), and g(α + βz) = g(α′ − βz) for any z. □
Lemma 4. For a non-zero function g, and ϕ, ϕ′ > 0, if β/ϕ·g(α + βz) = β′/ϕ′·g(α′ + β′z) for all z, then we must have
β = β′ = 0; or
β = β′ ≠ 0, and 1/ϕ · g(α + βz) = 1/ϕ′ · g(α′ + βz)
β = −β′ ≠ 0, ϕ = ϕ′, and g(α + βz) = −g(α′ − βz) for any z.
β/ϕ = β′/ϕ′, and g is a constant.
Proof. For β = 0, β′/ϕ′g(α′ + β′z) = β/ϕg(α + βz) = 0 for any z. Because g is a nonzero function, we must have β′ = 0;
For β ≠ 0, we must have β′ ≠ 0. For |β′/β| < 1. Because β/ϕ · g(α + βz) = β′/ϕ′· g(α′ + β′z) for any z, let t = βz, we have g(α + t) = β′/β · ϕ/ϕ′ · g(α′ + β′/β · t). By iteration, g(α + t) must be a constant, and β/ϕ = β′/ϕ′. Similarly, if |β′/β| ≤ 1, g(α + t) must be a constant, and β/ϕ = β′/ϕ′.
If β = β′ ≠ 0, we have 1/ϕ · g(α + βz) = 1/ϕ′ · g(α′ + βz). If β = −β′, we have g(α +βz) = −ϕ′/ϕ·g(α′ − βz) for all z, and thus g(α′ −βz) = −ϕ′/ϕ·g(α + βz) for all z. We let t1 and t2 denote to points such that g(t1), g(t2) ≠ 0, and let z1, z2 denote two values such that α′ − βz1 = α + βz2 = t1, and α′ − βz2 = α + βz1 = t2, then we have g(t1)/g(t2) = g(α′−βz1)/g(α+βz1) = −ϕ′/ϕ, and g(t1)/g(t2) = g(α+βz2)/g(α′−βz2) = −ϕ/ϕ′. As a result, we must have ϕ = ϕ′. and thus g(α + βz) = −g(α′ − βz) for all z. □
Lemma 5. For a non-constant function g, if g(α + βz) = g(α′ +β′z), then we must have
β = β′; or
β = −β′, and g(α + βz) = g(α′ − βz).
Proof. For β = 0, g(α′ + β′z) = g(α) = 0 for any z. Because g is not a constant, we must have β′ = 0.
For β ≠ 0, we must have β′ ≠ 0. For |β′/β| < 1. Because g(α + βz) = g(α′ + β′z) for any z, letting t = βz, we have g(α + t) = g(α′ + β′/β · t). By iteration, g(α + t) must be a constant, which is a contradiction. So |β′/β| ≤ 1 is impossible, and similarly, |β′/β| ≤ 1 is impossible.
If β = β′ ≠ 0, we have g(α + βz) = g(α′ + βz). If β = −β′, we have g(α + βz) = −g(α′ − βz) for all z. □
Proof of Theorem 1
Suppose θ and θ′ are two different sets of parameters such that pr(y, z; θ) = pr(y, z; θ′). Letting L(y, x, z) = log{pr(y, x|z; θ)/pr(y, x|z; θ′)}, we have L(y, x, z) = log{pr(x | y, z; θ)/pr(x|y, z; θ′)}. Because X
Z | Y, pr(x|y, z; θ) and pr(x|y, z; θ′) must vary with z; otherwise, L(y, x, z) is only a function of x and y, and θ cannot be identified. We further show that L(y, x, z) varies with z when a particular component of θ and θ′ are unequal. Under models (3) and (4), we have
We take derivative of L(y, x, z) with respect to y and z, and obtain
If ∂2L/(∂y∂z) is not equal to zero, then L(y, x, z) varies with z.
-
(a)
If η2 is a linear function, i.e., is a nonzero constant, then ∂2L/(∂y∂z) cannot equal zero for . Thus, β1/ϕ must be identified.
-
(b)If η2 is a linear function and , the second-order derivative of B2 is a nonlinear function, we take derivative of L(y, x, z) with respect to x and z. Note that , we have
For , from lemma 3, we only need to consider the following cases:-
(b1)if , letting z = −(β0 + β2x)/β1, we have
Because is not a linear function, i.e. is not a constant, ∂3L/(∂2x∂z) cannot equal zero for all x. Therefore, L(y, z, w) must vary with . -
(b2)If , we apply lemma 5 to ∂3L/(∂2x∂z). When η2 is a linear function, we have proved that . Noting that ϕ, ϕ′ > 0, β1 and must have the same sign. For fixed x, from lemma 5, ∂3L/(∂2x∂z) cannot equal zero for or ϕ ≠ ϕ′.
-
(b3)If , we have Y⫫X|Z, and thus pr(y | z, x) = pr(y | z) can be identified from the observed data.Therefore, we have proved that when η2 is a linear function and is a nonlinear function, ∂3L/(∂2x∂z) cannot equal zero and thus L(y, x, z) must vary with z for , i.e., (β1, β2, ϕ) are identified.
-
(b1)
-
(c)When η2 is a nonlinear function, we prove that L(y, x, z) varies with z for . From lemma 4, we only need to consider the following three cases.
-
(c1)If , Y⫫Z | X, then from the shadow variable assumption Z⫫R | (Y, X), we have Z⫫R | X, and thus
From lemma 1, pr(x | z; θ)/pr(x | z; θ′) and thus L(y, x, z) varies with z for . For , we note that
i.e.,
thus, by noting completeness of the exponential families under the full rank condition (Shao, 2003, Proposition 2.1, page 110), we have pr(y | z) = pr′(y | z). -
(c2)For , we have
For , letting z = −(β0 + β2x)/β1, we have
which cannot equal 0 for all w for a non-constant . Therefore, L(y, x, z) varies with z for . -
(c3)For , ϕ = ϕ′ and is antisymmetric. Letting z = −(β0 + β2w)/β1, we have
which cannot equal 0 for all w if .
-
(c1)
For β1 = −β1 ≠ 0, , ϕ = ϕ′, we let . If ∂g(x, z)/∂z ≠ 0, we have L2(z, w) = β1/ϕ × ∂g(x, z)/∂z ≠ 0; otherwise if ∂g(x, z)/∂z = 0, i.e., g(z, x) = g(x) is only a function of x, we let , and then we have g(x) = 0. Therefore, for all z and for all x. Note that ϕ = ϕ′, then the two different sets (β0, β1, β2, ϕ) and (, , , ϕ′) must index the same distribution pr(y | z, x), i.e., pr(y | z, x) is identified. Thus (β0, β1, β2, ϕ) can be identified under a given one-to-one mapping between the parameters and the distribution pr(y | z, x).
Proof of Theorem 2
Assume the normal models: Y | X, Z ~ N(β0 + β1z + β2x, ϕ) and X | Z ~ N(γ0 + γ1z, λ), then we have the following conditional distribution
with
Because X⫫Z | Y if and only if , the shadow variable assumption is satisfied when , i.e., β1β2/ϕ ≠ γ1/λ. Under such a condition, because pr(x | y, z) follows a normal model, Miao et al. (2015) proved that for any two candidate models pr(x | y, z) and pr′(x | y, z), the ratio pr(x | y, z)/pr′(x | y, z) must vary with z. Thus, pr(x, y, z)/pr′(x, y, z) must vary with z, and therefore, all parameters (β0, β1, β2, ϕ, λ, α0, α1, α2) are identified.
Proof of Theorem 3
- If , i.e., Y⫫Z | X, then from the shadow variable assumption Z⫫R | (Y, X), we have Z⫫R | X, and thus
From lemma 1, pr(x | z; θ)/pr(x | z; θ′) and thus L(y, x, z) varies with z for . For , we note that
i.e.,
thus, by noting completeness of the exponential families under the full rank condition (Shao, 2003, Proposition 2.1, page 110), we have pr(y | x) = pr′(y | x), i.e., (β0, β2, ϕ) are identified. If , we have Y⫫X|Z, and thus pr(y | z, x) = pr(y | z) can be identified from the observed data, i.e., (β1, β2, ϕ) are identified.
Proof of Theorem 4
We only need to prove that (7) and (8) are unbiased estimating equations, when both pr(r = 1 | x, y; α) and pr(y | x, z; β) are correctly specified. Under the shadow variable assumption Z⫫R | (X, Y), for the true value α0 of α, we have
When pr(r = 1 | x, y; α) is correctly specified, E{r/π(x, y; α0) − 1 | x, y} = 0, and thus E [{r/π(x, y; α0) − 1} G(x, y)] = 0, i.e., (7) is an unbiased estimating equation α, and thus obtained from (7) converges to α0 in probability. Furthermore, under true values (α0, β0, ϕ0), we have
which equals zero under correct specification of both pr(r = 1 | x, y; α) and pr(y | x, z; β). Thus, (8) is an unbiased estimating equation for (β, ϕ), and thus (, ) obtained from (8) converges to their respective true values in probability.
REFERENCES
- Afifi A & Elashoff R (1966). Missing observations in multivariate statistics i. review of the literature. Journal of the American Statistical Association 61, 595–604. [Google Scholar]
- Anderson TW (1957). Maximum likelihood estimates for a multivariate normal distribution when some observations are missing. Journal of the american Statistical Association 52, 200–203. [Google Scholar]
- Baker SG & Laird NM (1988). Regression analysis for categorical variables with outcome subject to nonignorable nonresponse. Journal of the American Statistical association 83, 62–69. [Google Scholar]
- Bartlett JW, Carpenter JR, Tilling K & Vansteelandt S (2014). Improving upon the efficiency of complete case analysis when covariates are mnar. Biostatistics 15, 719–730. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Buck SF (1960). A method of estimation of missing values in multivariate data suitable for use with an electronic computer. Journal of the Royal Statistical Society. Series B (Methodological), 302–306. [Google Scholar]
- Chen K (2001). Parametric models for response-biased sampling. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63, 775–789. [Google Scholar]
- Dempster AP, Laird NM & Rubin DB (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 1–38. [Google Scholar]
- D’Haultfoeuille X (2010). A new instrumental method for dealing with endogenous selection. Journal of Econometrics 154, 1–15. [Google Scholar]
- Edgett GL (1956). Multiple regression with missing observations among the independent variables. Journal of the American Statistical Association 51, 122–131. [Google Scholar]
- Fay RE (1986). Causal models for patterns of nonresponse. Journal of the American Statistical Association 81, 354–365. [Google Scholar]
- Glasser M (1964). Linear regression analysis with missing observations among the independent variables. Journal of the American Statistical Association 59, 834–844. [Google Scholar]
- Haitovsky Y (1968). Missing data in regression analysis. Journal of the Royal Statistical Society. Series B (Methodological), 67–82. [Google Scholar]
- Horton NJ & Laird NM (2001). Maximum likelihood analysis of logistic regression models with incomplete covariate data and auxiliary information. Biometrics 57, 34–42. [DOI] [PubMed] [Google Scholar]
- Horvitz DG & Thompson DJ (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association 47, 663–685. [Google Scholar]
- Ibrahim JG (1990). Incomplete data in generalized linear models. Journal of the American Statistical Association 85, 765–769. [Google Scholar]
- Ibrahim JG, Chen M-H, Lipsitz SR & Herring AH (2005). Missing-data methods for generalized linear models: A comparative review. Journal of the American Statistical Association 100, 332–346. [Google Scholar]
- Ibrahim JG, Lipsitz SR & Chen M-H (1999). Missing covariates in generalized linear models when the missing data mechanism is non-ignorable. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61, 173–190. [Google Scholar]
- Kott PS (2014). Calibration weighting when model and calibration variables can differ In Contributions to Sampling Statistics, Mecatti F, Conti LP & Ranalli GM, eds. Cham: Springer, pp. 1–18. [Google Scholar]
- Lipsitz SR, Ibrahim JG, Chen M-H & Peterson H (1999). Non-ignorable missing covariates in generalized linear models. Statistics in medicine 18, 2435–2448. [DOI] [PubMed] [Google Scholar]
- Little RJ (1992). Regression with missing x’s: a review. Journal of the American Statistical Association 87, 1227–1237. [Google Scholar]
- Little RJ & Rubin DB (2002). Statistical Analysis with Missing Data. Wiley: New York. [Google Scholar]
- Little RJ & Zhang N (2011). Subsample ignorable likelihood for regression analysis with missing data. Journal of the Royal Statistical Society: Series C (Applied Statistics) 60, 591–605. [Google Scholar]
- Ma WQ, Geng Z & Hu YH (2003). Identification of graphical models for nonignorable nonresponse of binary outcomes in longitudinal studies. Journal of multivariate analysis 87, 24–45. [Google Scholar]
- Miao W, Ding P & Geng Z (2016). Identifiability of normal and normal mixture models with nonignorable missing data. Journal of the American Statistical Association. To appear. [Google Scholar]
- Miao W, Tchetgen Tchetgen E & Geng Z (2015). Identification and doubly robust estimation of data missing not at random with a shadow variable. ArXiv:1509.02556. [Google Scholar]
- Miao W & Tchetgen Tchetgen EJ (2016). On varieties of doubly robust estimators under missingness not at random with a shadow variable. Biometrika. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Robins JM, Rotnitzky A & Zhao LP (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association 89, 846–866. [Google Scholar]
- Rubin DB (1976). Inference and missing data (with discussion). Biometrika 63, 581–592. [Google Scholar]
- Rubin DB (2004). Multiple Imputation for Nonresponse in Surveys, vol. 81 John Wiley & Sons. [Google Scholar]
- Rubin DB & Schenker N (1986). Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. Journal of the American Statistical Association 81, 366–374. [Google Scholar]
- Scharfstein DO, Rotnitzky A & Robins JM (1999). Adjusting for nonignorable drop-out using semiparametric nonresponse models. Journal of the American Statistical Association 94, 1096–1120. [Google Scholar]
- Shao J (2003). Mathematical Statistics. New York: Springer, 2nd ed. [Google Scholar]
- Tang G, Little RJ & Raghunathan TE (2003). Analysis of multivariate missing data with nonignorable nonresponse. Biometrika 90, 747–764. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vach W & Schumacher M (1993). Logistic regression with incompletely observed categorical covariates: a comparison of three approaches. Biometrika 80, 353–362. [Google Scholar]
- Wang S, Shao J & Kim JK (2014). An instrumental variable approach for identification and estimation with nonignorable nonresponse. Statistica Sinica 24, 1097–1116. [Google Scholar]
- Zahner GE, Pawelkiewicz W, DeFrancesco JJ & Adnopoz J (1992). Children’s mental health service needs and utilization patterns in an urban community: an epidemiological assessment. Journal of the American Academy of Child & Adolescent Psychiatry 31, 951–960. [DOI] [PubMed] [Google Scholar]
- Zhao J & Shao J (2015). Semiparametric pseudo-likelihoods in generalized linear models with nonignorable missing data. Journal of the American Statistical Association 110, 1577–1590. [Google Scholar]
- Zhao L & Lipsitz S (1992). Designs and analysis of two-stage studies. Statistics in medicine 11, 769–782. [DOI] [PubMed] [Google Scholar]
- Zhao LP, Lipsitz S & Lew D (1996). Regression analysis with missing covariate data using estimating equations. Biometrics, 1165–1182. [PubMed] [Google Scholar]
