Abstract
This paper shows that, when variables with missing values are linearly related to observed variables, the normal-distribution-based pseudo MLEs are still consistent. The population distribution may be unknown while the missing data process can follow an arbitrary missing at random mechanism. Enough details are provided for the bivariate case so that readers having taken a course in statistics/probability can fully understand the development. Sufficient conditions for the consistency of the MLEs in higher dimensions are also stated, while the details are omitted.
Keywords: Consistency, maximum likelihood, model misspecification, missing data
1. Introduction
Incomplete or missing data exist in almost all areas of empirical research. They are especially common when data are collected longitudinally and/or by surveys. There can be various reasons for missing data to occur. The process by which data become incomplete was called the missing data mechanism by Rubin (1976). Missing completely at random (MCAR) is a process in which missingness of data is independent of both the observed and the missing values; missing at random (MAR) is a process in which missingness is independent of the missing values given the observed data. When missingness depends on the missing values themselves given the observed data, the process is not missing at random (NMAR). Missing data with an NMAR mechanism are also referred to as non-ignorable non-responses because maximum likelihood estimates (MLE), by ignoring the missing data mechanism, are generally inconsistent. This paper studies the consistency of the normal-distribution-based MLE with MAR data.
In contrast to many ad hoc procedures for missing data analysis, MLEs have the desired property of being consistent even when the specific MAR mechanism is ignored. When modeling real data, however, specifying the correct distribution form to obtain the true MLE is always challenging if not impossible. The normal distribution is often chosen for convenience, not because practical data tend to come from normally distributed populations. Geary (1947, p. 241) observed that “Normality is a myth; there never was, and never will be a normal distribution.” Such an observation was further supported by Micceri (1989), who examined 440 data sets obtained from journal articles, research projects as well as tests, and found that all were significantly nonnormally distributed. Thus, the normal-distribution-based MLEs for real data typically are pseudo MLEs, whose properties have been obtained by White (1982) and Gourieroux, Monfort and Trognon (1984) in the context of complete data. With missing data, however, according to Laird (1988) and Rotnitzky and Wypij (1994), pseudo MLEs will be inconsistent unless the missing data mechanism is MCAR. The need for a correct likelihood function with an MAR mechanism was also noted by Liang and Zeger (1986) and Little (1993). If a pseudo MLE is not consistent when data are MAR, then only the MCAR mechanism can be ignored when modeling practical multivariate data with unknown population distributions. Thus, in addition to being an important mathematical property, consistency of the normal-distribution-based pseudo MLE with MAR data also has wide implications for many areas of applied statistics where the normal distribution is routinely used to model missing data.
Let x = (x1, x2, …, xq)′ be a vector representing a q-variate population, xo contain the observed values and xm contain the missing values in a realization of x. Let r = (r1, r2, …, rq)′, where rj = 1 if xj is observed and 0 otherwise. The missing data in xm are MAR if
(1) |
and the rj's are conditionally/locally independent given x, where γj is a vector of parameters. In practice, three popular forms of gj(xo, γj) are the interval selection model (see e.g., Schafer, 1997, p. 25)
when xo falls into certain hyper-rectangles; the probit selection model
where Φ(·) is the cumulative distribution function of N(0, 1) and γj1 contains the regression coefficients; and the logistic selection model
The interval selection model is widely used in economics (Amemiya, 1973; Heckman, 1979) while the probit and logistic selection models are commonly used in many other disciplines (Allison 2001; Little & Rubin, 2002; Molenberghs & Kenward, 2007; Daniels & Hogan, 2008). Under the interval selection model, Yuan (2009) showed that the normal-distribution-based pseudo MLEs are consistent and asymptotically normally distributed even when the underlying distribution is unknown. The purpose of this note is to extend the result of Yuan (2009) by showing that the normal-distribution-based MLEs are still consistent when the underlying population is unknown and when the gj(xo, γj) in (1) is any function of the observed data, including both the probit and logistic selection models.
While all the results in Yuan (2009) can be generalized to an MAR mechanism described by probit and logistic selection models, for brevity we only give the details for the bivariate case in section 2. With missing data and a misspecified likelihood function, little literature exists that facilitates thorough understanding of issues related to consistency. We choose this simple model with enough details so that readers having taken a course in statistics/probability can fully understand the development. We will also state the results for consistency for a general q in section 3, but the details will be omitted. We conclude the paper by pointing out that not all pseudo MLEs are consistent.
2. Consistency in the bivariate case
Let x = (x1, x2)′ with
(2) |
where and σ12 = σ21 = σ1σ2ρ. A sample from x with missing values in x2 can be represented by
(3) |
where x(n+1)2, …, xN2 are missing. The interest is to infer (2) based on (3) using the possibly wrong assumption x ~ N2(μ,Σ). Notice that the number of cases with xi2 missing is not controllable so that the n in (3) is a random number.
With two variables, there are a total of 4 possible observed patterns: (xi1, xi2), (xi1, ), ( , xi2) and ( , ). The sample in (3) only contains two of the four. We chose this sample because the MLEs enjoy analytical solutions and the proof of their consistency is simple enough to be understood by a broad audience. We will discuss the consistency of the MLEs with more missing data patterns and more variables in section 3.
Let
Then it follows from Anderson (1957) that the MLEs of (μ1, σ11, μ2, σ22, σ12) based on the normal distribution assumption by ignoring the missing data mechanism are
(4a) |
(4b) |
(4c) |
(4d) |
Through the work of Rubin (1976) and others, it is widely known that, , are consistent when all the missing data in (3) are MCAR or MAR and x ~ N2(μ,Σ). In the following, we will study the consistency of , and using (4) when x does not follow the bivariate normal distribution and the missing xi2's in (3) are MAR. For such a purpose, we assume that the population for the data in (3) is
(5) |
where z1 and z2 are independent with E(z1) = E(z2) = 0 and Var(z1) = Var(z2) = 1. Clearly, (5) follows a bivariate normal distribution only when z1 ~ N(0, 1) and z2 ~ N(0, 1). However, the population mean vector and variance-covariance matrix of x = (x1, x2)′ remain the same regardless of the distributions of z1 and z2.
Corresponding to (3) there exist independent random variables zi1 and zi2 such that
Let
Then
(6a) |
(6b) |
The equations in (6) allow us to obtain the probability limits of and sjk* through those of and szjk*, which further lead to consistency of , and in (4).
We also need to connect the observations in (3) to the MAR mechanism. Let ri = 1 if the xi2 in (3) is observed and ri = 0 if the xi2 in (3) is missing. Because xi1 and zi1 are uniquely determined by each other, we can rewrite the MAR mechanism in equation (1) as
(7) |
where the parameter vector γ is omitted from h(·). Let the probability density functions (pdf) of z1 and z2 be f1(t) and f2(t), respectively. Then n, the number of complete cases in (3), follows the binomial distribution B(N, po), where, with I being an indicator function,
Let ti1 be the realized value of zi1 and f be a generic notation for the probability distribution/density function of the involved random variables. It follows from (7) and
that
Thus, the zi1 corresponding to the observed xi2 are independent, identically distributed (iid), and each follows the distribution with pdf
Notice that, due to the MAR mechanism, the missingness in (3) has nothing to do with zi2. Each zi2 corresponding to either the observed xi2 or missing xi2 still has the same distribution as z2 ~ f2(t).
Let the mean and variance of u ~ f1* (t) be μz1* and σz11*. Let u1, u2, …, uN be iid with pdf f1*(t); ω1, ω2, …, ωN be iid with P(ωi = 1) = po and P(ωi = 0) = 1 – po; and the ωi's be independent of ui's and zi2's. Then n and have the same distribution; and have the same distribution; and have the same distribution; and have the same distribution; and have the same distribution; and have the same distribution. A brief proof for the above relationships using characteristic functions is provided in Appendix A of Yuan and Lu (2008).
We are now ready to show the result of consistency. Applying the law of large numbers to the average of ωi yields
where the equal sign follows from equivalence in distribution and is the notation for convergence with probability one. Continuously applying equivalence in distribution and the law of large numbers to the averages of ωiui and ωizi2, respectively, we have
(8) |
and
(9) |
Similarly,
Thus,
(10) |
(11) |
(12) |
It is obvious that
(13) |
Regarding μ1, μ2, σ1, σ2 and ρ as constants, , , s11*, s21* and s22* in (6) are just linear combinations of , , sz11*, sz21* and sz22*, whose probability limits have already been obtained. Combining (6a), (8), (10) and (11) yields
(14) |
Combining (6b) and (8) to (12) yields
(15) |
Thus,
(16) |
It follows from (4b) and (13) to (16) that
So is consistent. It follows from (4c) and (13) to (16) that
So is also consistent. It follows from (4d), (13) and (16) that
So is, again, consistent.
Notice that the g(·,·) or h(·) in (7) can be any function of the observed data. Thus, the normal-distribution-based pseudo MLEs are consistent for any MAR process.
3. Consistency in general
Parallel to (5), let
(17) |
where μ = (μ1, μ2, …, μq)′,
and satisfies Σ = AA′, and z = (z1, z2, …, zq)′ with z1, z2, …, zq being independent and standardized random variables. Then E(x) = μ, Cov(x) = Σ and the distribution of x is determined by those of the zj's. When z ~ Nq(0, I), x ~ Nq(μ,Σ). We do not know the distribution form of x in general even when the distributions of zj's are known. Notice that each element xj of x in either (5) or (17) is a linear combination of independent random components in z, which is a necessary condition for the consistency of the pseudo MLE and .
Let x1, x2, …, xN be a random sample drawn from x with xi = μ + Azi, where xi = (xi1, xi2, …, xiq)′ and zi = (zi1, zi2, …, ziq)′. Consider the case xi in which d variables xij1, xij2, … xijd are missing and the event is related in probability to the c values of zil1, zil2, …, zilc. Let J = min(j1, j2, …, jd) and L = max(l1, l2, …, lc). When L < J and (xi1, xi2, …, xiL) are observed, all the information related to missing values is observed. Let ri = (ri1, ri2, …, riq)′ be the vector with rij = 1 if xij is observed and zero otherwise, xim = (xij1, xij2, …, xijd)′, and xio be the vectors of observed variables. There exists
(18) |
Thus, the probability of missing only depends on the observed values and the missing data mechanism is MAR (Rubin, 1976). Notice that xiL = (xi1, xi2,…, xiL)′ is a subset of xio, and thus, the probability function gj(xio, γj) in (18) can also be written as gj(xiL, γj). Also notice that, due to a random process, the number d and the subscripts j1, j2, …, jd may change from case to case while l1, l2, …, lc are held constant across the sample.
Unlike the problems considered in the previous section, the MLEs do not possess analytical forms when the observed data patterns are not monotonic. So we cannot directly show that the MLEs are consistent as was done in the previous section. We cannot use the established theory of maximum likelihood as in Rubin (1976) either, because the MLEs are obtained based on an incorrect likelihood function. By showing that the normal estimating equation is unbiased at the true population values, Yuan (2009) proved that the normal-distribution-based MLEs are consistent when the missingness of xim is due to (zil1, zil2, …, zilc) falling into certain hyper-rectangles. When (17) and (18) hold and (xi1, xi2, …, xiL) are observed, using essentially the same approach as in Yuan (2009) we can show that the normal-distribution-based MLEs are consistent regardless of the distribution shape of x and the form of gj(xio, γj) in (18). The MLEs are also asymptotically normally distributed and the covariance matrix of the MLEs can be consistently estimated by a sandwich-type covariance matrix.
4. Conclusions
It has been argued that, in any statistical modeling, the distribution specification is at best only an approximation to the real world (Box, 1979). Thus, all MLEs in practice are pseudo MLEs. In the context of missing data, it is nice to know that pseudo MLEs can remain consistent when (17) and (18) hold. We need to note that data model (17) does not include nonnormal distributions created by nonlinear functions of independent random variables z1, z2, …, zq; although it includes an infinite number of nonnormal distributions. Yuan (2009) describes an example in which the MLEs are not consistent when x = (x1, x2)′ with x1 = z1 and , where z1 and z2 are independent, and x2 is missing when z1 falls into a certain interval. It can be shown that the MLEs also fail to be consistent for this data model when the MAR mechanism obeys either a logit or a probit selection process. We also would like to note that pseudo MLEs based on an incorrect distributional specification other than Nq(μ,Σ) may not be consistent when missing values are MAR. Actually, even without missing data, Gourieroux et al. (1984) showed that pseudo MLEs are consistent only when the assumed distribution belongs to a quadratic exponential family.
For the purpose of allowing missingness to depend on all the linear combinations of the previously observed variables, we specified A as a lower triangular matrix in (17) so that (z1, z2, …, zL) and (x1, x2, …, xL) are determined by each other. In practice, a participant may join the study after missing a few times and then be missing again. The missingness at the later stage may depend on all the previously observed variables. We can match such a case with (17) by specifying that the rows of A that correspond to the observed variables form the upper-left part of a lower triangular matrix, then the consistency result still holds.
As with a general MAR missing data mechanism, the MAR condition in (18) cannot be tested. Without extra information beyond the observed sample, it is impossible to distinguish between MAR and NMAR mechanisms (Molenberghs et al., 2008). Similarly, the data model (17) cannot be tested either because the distribution of z is arbitrary.
Acknowledgment
We would like to thank the editor, an associate editor, and a referee for comments that lead to a significant improvement of the paper.
This research was supported by Grants DA00017 and DA01070 from the National Institute on Drug Abuse and a grant from the National Natural Science Foundation of China (30870784).
References
- Allison PD. Missing data. Sage; Thousand Oaks, CA: 2001. [Google Scholar]
- Amemiya T. Regression analysis when the dependent variable is truncated normal. Econometrica. 1973;41:997–1016. [Google Scholar]
- Anderson TW. Maximum likelihood estimates for the multivariate normal distribution when some observations are missing. Journal of the American Statistical Association. 1957;52:200–203. [Google Scholar]
- Box GEP. Robustness in the strategy of scientific model building. In: Launer RL, Wilkinson GN, editors. Robustness in statistics. Academic Press; New York: 1979. pp. 201–236. [Google Scholar]
- Daniels MJ, Hogan JW. Missing data in longitudinal studies: Strategies for Bayesian modeling and sensitivity analysis. Chapman & Hall; Boca Raton, Florida: 2008. [Google Scholar]
- Geary RC. Testing for normality. Biometrika. 1947;34:209–242. [PubMed] [Google Scholar]
- Gourieroux C, Monfort A, Trognon A. Pseudo maximum likelihood methods: Theory. Econometrica. 1984;52:681–700. [Google Scholar]
- Heckman JJ. Sample selection bias as a specification error. Econometrica. 1979;47:153–161. [Google Scholar]
- Laird NM. Missing data in longitudinal studies. Statistics in Medicine. 1988;7:305–315. doi: 10.1002/sim.4780070131. [DOI] [PubMed] [Google Scholar]
- Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22. [Google Scholar]
- Little RJA. Pattern-mixture models for multivariate incomplete data. Journal of the American Statistical Association. 1993;88:125–134. [Google Scholar]
- Little RJA, Rubin DB. Statistical analysis with missing data. 2nd ed. Wiley; New York: 2002. [Google Scholar]
- Micceri T. The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin. 1989;105:156–166. [Google Scholar]
- Molenberghs G, Beunckens C, Sotto C, Kenward MG. Every missing not at random model has got a missing at random counterpart with equal fit. Journal of the Royal Statistical Society B. 2008;70:371–388. [Google Scholar]
- Molenberghs G, Kenward MG. Missing data in clinical studies. Wiley; Chichester, England: 2007. [Google Scholar]
- Rotnitzky A, Wypij D. A note on the bias of estimators with missing data. Biometrics. 1994;50:1163–1170. [PubMed] [Google Scholar]
- Rubin DB. Inference and missing data (with discussions) Biometrika. 1976;63:581–592. [Google Scholar]
- Schafer JL. Analysis of incomplete multivariate data. Chapman & Hall; London: 1997. [Google Scholar]
- White H. Maximum likelihood estimation of misspecified models. Econometrica. 1982;50:1–25. [Google Scholar]
- Yuan K-H. Normal distribution based pseudo ML for missing data: With applications to mean and covariance structure analysis. Journal of Multivariate Analysis. 2009;100:1900–1918. [Google Scholar]
- Yuan K-H, Lu L. SEM with missing data and unknown population distributions using two-stage ML: Theory and its application. Multivariate Behavioral Research. 2008;62:621–652. doi: 10.1080/00273170802490699. [DOI] [PubMed] [Google Scholar]