Consistency of Normal Distribution Based Pseudo Maximum Likelihood Estimates When Data Are Missing at Random

Ke-Hai Yuan; Peter M Bentler

doi:10.1198/tast.2010.09203

. Author manuscript; available in PMC: 2010 Dec 28.

Published in final edited form as: Am Stat. 2010 Aug 1;64(3):263–267. doi: 10.1198/tast.2010.09203

Consistency of Normal Distribution Based Pseudo Maximum Likelihood Estimates When Data Are Missing at Random

Ke-Hai Yuan ¹, Peter M Bentler ²

PMCID: PMC3010738 NIHMSID: NIHMS258556 PMID: 21197422

Abstract

This paper shows that, when variables with missing values are linearly related to observed variables, the normal-distribution-based pseudo MLEs are still consistent. The population distribution may be unknown while the missing data process can follow an arbitrary missing at random mechanism. Enough details are provided for the bivariate case so that readers having taken a course in statistics/probability can fully understand the development. Sufficient conditions for the consistency of the MLEs in higher dimensions are also stated, while the details are omitted.

Keywords: Consistency, maximum likelihood, model misspecification, missing data

1. Introduction

Incomplete or missing data exist in almost all areas of empirical research. They are especially common when data are collected longitudinally and/or by surveys. There can be various reasons for missing data to occur. The process by which data become incomplete was called the missing data mechanism by Rubin (1976). Missing completely at random (MCAR) is a process in which missingness of data is independent of both the observed and the missing values; missing at random (MAR) is a process in which missingness is independent of the missing values given the observed data. When missingness depends on the missing values themselves given the observed data, the process is not missing at random (NMAR). Missing data with an NMAR mechanism are also referred to as non-ignorable non-responses because maximum likelihood estimates (MLE), by ignoring the missing data mechanism, are generally inconsistent. This paper studies the consistency of the normal-distribution-based MLE with MAR data.

In contrast to many ad hoc procedures for missing data analysis, MLEs have the desired property of being consistent even when the specific MAR mechanism is ignored. When modeling real data, however, specifying the correct distribution form to obtain the true MLE is always challenging if not impossible. The normal distribution is often chosen for convenience, not because practical data tend to come from normally distributed populations. Geary (1947, p. 241) observed that “Normality is a myth; there never was, and never will be a normal distribution.” Such an observation was further supported by Micceri (1989), who examined 440 data sets obtained from journal articles, research projects as well as tests, and found that all were significantly nonnormally distributed. Thus, the normal-distribution-based MLEs for real data typically are pseudo MLEs, whose properties have been obtained by White (1982) and Gourieroux, Monfort and Trognon (1984) in the context of complete data. With missing data, however, according to Laird (1988) and Rotnitzky and Wypij (1994), pseudo MLEs will be inconsistent unless the missing data mechanism is MCAR. The need for a correct likelihood function with an MAR mechanism was also noted by Liang and Zeger (1986) and Little (1993). If a pseudo MLE is not consistent when data are MAR, then only the MCAR mechanism can be ignored when modeling practical multivariate data with unknown population distributions. Thus, in addition to being an important mathematical property, consistency of the normal-distribution-based pseudo MLE with MAR data also has wide implications for many areas of applied statistics where the normal distribution is routinely used to model missing data.

Let x = (x₁, x₂, …, x_q)′ be a vector representing a q-variate population, x_o contain the observed values and x_m contain the missing values in a realization of x. Let r = (r₁, r₂, …, r_q)′, where r_j = 1 if x_j is observed and 0 otherwise. The missing data in x_m are MAR if

P (r_{j} = 0 ∣ x_{o}, x_{m}) = P (r_{j} = 0 ∣ x_{o}) = g_{j} (x_{o}, γ_{j}), j = 1, 2, \dots, q

(1)

and the r_j's are conditionally/locally independent given x, where γ_j is a vector of parameters. In practice, three popular forms of g_j(x_o, γ_j) are the interval selection model (see e.g., Schafer, 1997, p. 25)

g_{j} (x_{o}, γ_{j}) = 1

when x_o falls into certain hyper-rectangles; the probit selection model

g_{j} (x_{o}, γ_{j}) = Φ (γ_{j 0} + γ_{j 1}^{'} x_{o}),

where Φ(·) is the cumulative distribution function of N(0, 1) and γ_j1 contains the regression coefficients; and the logistic selection model

g_{j} (x_{o}, γ_{j}) = \frac{exp (γ_{j 0} + γ_{j 1}^{'} x_{o})}{1 + exp (γ_{j 0} + γ_{j 1}^{'} x_{o})} .

The interval selection model is widely used in economics (Amemiya, 1973; Heckman, 1979) while the probit and logistic selection models are commonly used in many other disciplines (Allison 2001; Little & Rubin, 2002; Molenberghs & Kenward, 2007; Daniels & Hogan, 2008). Under the interval selection model, Yuan (2009) showed that the normal-distribution-based pseudo MLEs are consistent and asymptotically normally distributed even when the underlying distribution is unknown. The purpose of this note is to extend the result of Yuan (2009) by showing that the normal-distribution-based MLEs are still consistent when the underlying population is unknown and when the g_j(x_o, γ_j) in (1) is any function of the observed data, including both the probit and logistic selection models.

While all the results in Yuan (2009) can be generalized to an MAR mechanism described by probit and logistic selection models, for brevity we only give the details for the bivariate case in section 2. With missing data and a misspecified likelihood function, little literature exists that facilitates thorough understanding of issues related to consistency. We choose this simple model with enough details so that readers having taken a course in statistics/probability can fully understand the development. We will also state the results for consistency for a general q in section 3, but the details will be omitted. We conclude the paper by pointing out that not all pseudo MLEs are consistent.

2. Consistency in the bivariate case

Let x = (x₁, x₂)′ with

μ = E (x) = (\begin{matrix} μ_{1} \\ μ_{2} \end{matrix}) and Σ = Cov (x) = (\begin{matrix} σ_{11} & σ_{12} \\ σ_{21} & σ_{22} \end{matrix}),

(2)

where $σ_{11} = σ_{1}^{2}, σ_{22} = σ_{2}^{2}$ and σ₁₂ = σ₂₁ = σ₁σ₂ρ. A sample from x with missing values in x₂ can be represented by

\begin{matrix} x_{11}, \dots, x_{n 1}, x_{(n + 1) 1}, \dots, x_{N 1} \\ x_{12}, \dots, x_{n 2}, \end{matrix}

(3)

where x_(n+1)2, …, x_N2 are missing. The interest is to infer (2) based on (3) using the possibly wrong assumption x ~ N₂(μ,Σ). Notice that the number of cases with x_i2 missing is not controllable so that the n in (3) is a random number.

With two variables, there are a total of 4 possible observed patterns: (x_i1, x_i2), (x_i1, ), ( , x_i2) and ( , ). The sample in (3) only contains two of the four. We chose this sample because the MLEs enjoy analytical solutions and the proof of their consistency is simple enough to be understood by a broad audience. We will discuss the consistency of the MLEs with more missing data patterns and more variables in section 3.

Let

\begin{matrix} {\overset{‒}{x}}_{1} = \frac{1}{N} \sum_{i = 1}^{N} x_{i 1}, s_{11} = \frac{1}{N} \sum_{i = 1}^{N} {(x_{i 1} - {\overset{‒}{x}}_{1})}^{2}, \\ {\overset{‒}{x}}_{1 *} = \frac{1}{n} \sum_{i = 1}^{n} x_{i 1}, s_{11 *} = \frac{1}{n} \sum_{i = 1}^{n} {(x_{i 1} - {\overset{‒}{x}}_{1 *})}^{2}, \\ {\overset{‒}{x}}_{2 *} = \frac{1}{n} \sum_{i = 1}^{n} x_{i 2}, s_{22 *} = \frac{1}{n} \sum_{i = 1}^{n} {(x_{i 2} - {\overset{‒}{x}}_{2 *})}^{2}, \\ s_{21 *} = \frac{1}{n} \sum_{i = 1}^{n} (x_{i 2} - {\overset{‒}{x}}_{2 *}) (x_{i 1} - {\overset{‒}{x}}_{1 *}), {\hat{β}}_{21} = \frac{s_{21 *}}{s_{11 *}} = \frac{Σ_{i = 1}^{n} (x_{i 2} - {\overset{‒}{x}}_{2 *}) (x_{i 1} - {\overset{‒}{x}}_{1 *})}{Σ_{i = 1}^{n} {(x_{i 1} - {\overset{‒}{x}}_{1 *})}^{2}} . \end{matrix}

Then it follows from Anderson (1957) that the MLEs of (μ₁, σ₁₁, μ₂, σ₂₂, σ₁₂) based on the normal distribution assumption by ignoring the missing data mechanism are

{\hat{μ}}_{1} = {\overset{‒}{x}}_{1}, {\hat{σ}}_{11} = s_{11},

(4a)

{\hat{μ}}_{2} = {\overset{‒}{x}}_{2 *} + {\hat{β}}_{21} ({\overset{‒}{x}}_{1} - {\overset{‒}{x}}_{1 *}),

(4b)

{\hat{σ}}_{22} = s_{22 *} + {\hat{β}}_{21}^{2} ({\hat{σ}}_{11} - s_{11 *}),

(4c)

{\hat{σ}}_{12} = {\hat{β}}_{21} {\hat{σ}}_{11} .

(4d)

Through the work of Rubin (1976) and others, it is widely known that ${\hat{μ}}_{2}$ , ${\hat{σ}}_{22}$ , ${\hat{σ}}_{12}$ are consistent when all the missing data in (3) are MCAR or MAR and x ~ N₂(μ,Σ). In the following, we will study the consistency of ${\hat{μ}}_{2}$ , ${\hat{σ}}_{22}$ and ${\hat{σ}}_{12}$ using (4) when x does not follow the bivariate normal distribution and the missing x_i2's in (3) are MAR. For such a purpose, we assume that the population for the data in (3) is

x_{1} = μ_{1} + σ_{1} z_{1}, x_{2} = μ_{2} + σ_{2} [ρ z_{1} + {(1 + ρ^{2})}^{1 ∕ 2} z_{2}],

(5)

where z₁ and z₂ are independent with E(z₁) = E(z₂) = 0 and Var(z₁) = Var(z₂) = 1. Clearly, (5) follows a bivariate normal distribution only when z₁ ~ N(0, 1) and z₂ ~ N(0, 1). However, the population mean vector and variance-covariance matrix of x = (x₁, x₂)′ remain the same regardless of the distributions of z₁ and z₂.

Corresponding to (3) there exist independent random variables z_i1 and z_i2 such that

x_{i 1} = μ_{1} + σ_{1} z_{i 1}, x_{i 2} = μ_{2} + σ_{2} [ρ z_{i 1} + {(1 - ρ^{2})}^{1 ∕ 2} z_{i 2}] .

Let

\begin{matrix} {\overset{‒}{z}}_{1 *} = \frac{1}{n} \sum_{i = 1}^{n} z_{i 1}, {\overset{‒}{z}}_{2 *} = \frac{1}{n} \sum_{i = 1}^{n} z_{i 2}, s_{z 11 *} = \frac{1}{n} \sum_{i = 1}^{n} {(z_{i 1} - {\overset{‒}{z}}_{*})}^{2}, \\ s_{z 21 *} = \frac{1}{n} \sum_{i = 1}^{n} (z_{i 2} - {\overset{‒}{z}}_{2 *}) (z_{i 1} - {\overset{‒}{z}}_{1 *}), s_{z 22 *} = \frac{1}{n} \sum_{i = 1}^{n} {(z_{i 2} - {\overset{‒}{z}}_{2 *})}^{2} . \end{matrix}

Then

{\overset{‒}{x}}_{1 *} = μ_{1} + σ_{1} {\overset{‒}{z}}_{1 *}, s_{11 *} = σ_{11} s_{z 11 *}, s_{21 *} = σ_{2} σ_{1} [ρ s_{z 11 *} + {(1 - ρ^{2})}^{1 ∕ 2} s_{z 21 *}],

(6a)

{\overset{‒}{x}}_{2 *} = μ_{2} + σ_{2} [ρ {\overset{‒}{z}}_{1 *} + {(1 - ρ^{2})}^{1 ∕ 2} {\overset{‒}{z}}_{2 *}], s_{22 *} = σ_{22} [ρ^{2} s_{z 11 *} + 2 ρ {(1 - ρ^{2})}^{1 ∕ 2} s_{z 21 *} + (1 - ρ^{2}) s_{z 22 *}] .

(6b)

The equations in (6) allow us to obtain the probability limits of ${\overset{‒}{x}}_{j *}$ and s_jk* through those of ${\overset{‒}{z}}_{j *}$ and s_zjk*, which further lead to consistency of ${\hat{μ}}_{2}$ , ${\hat{σ}}_{12}$ and ${\hat{σ}}_{22}$ in (4).

We also need to connect the observations in (3) to the MAR mechanism. Let r_i = 1 if the x_i2 in (3) is observed and r_i = 0 if the x_i2 in (3) is missing. Because x_i1 and z_i1 are uniquely determined by each other, we can rewrite the MAR mechanism in equation (1) as

P (r_{i} = 0 ∣ x_{i 1}, x_{i 2}) = P (r_{i} = 0 ∣ x_{i 1}) = P (r_{i} = 0 ∣ z_{i 1}) = g (x_{i 1}, γ) = h (z_{i 1}),

(7)

where the parameter vector γ is omitted from h(·). Let the probability density functions (pdf) of z₁ and z₂ be f₁(t) and f₂(t), respectively. Then n, the number of complete cases in (3), follows the binomial distribution B(N, p_o), where, with I being an indicator function,

\begin{matrix} p_{o} & = P (r_{i} = 1) = E (I_{{r_{i} = 1}}) = E (1 - I_{{r_{i} = 0}}) \\ = 1 - E [E (T_{{r_{i} = 0}} ∣ x_{i 1})] = 1 - E [P (r_{i} = 0 ∣ x_{i 1})] \\ = 1 - E [h (z_{i 1})] . \end{matrix}

Let t_i1 be the realized value of z_i1 and f be a generic notation for the probability distribution/density function of the involved random variables. It follows from (7) and

f (r_{i} = 1, t_{i 1}) = p_{o} f (t_{i 1} ∣ r_{i} = 1) = P (r_{i} = 1 ∣ t_{i 1}) f_{1} (t_{i 1})

that

f (t_{i 1} ∣ r_{i} = 1) = \frac{1}{p_{o}} [1 - h (t_{i 1})] f_{1} (t_{i 1}) .

Thus, the z_i1 corresponding to the observed x_i2 are independent, identically distributed (iid), and each follows the distribution with pdf

f_{1 *} (t) = \frac{1}{p_{o}} [1 - h (t)] f_{1} (t) .

Notice that, due to the MAR mechanism, the missingness in (3) has nothing to do with z_i2. Each z_i2 corresponding to either the observed x_i2 or missing x_i2 still has the same distribution as z₂ ~ f₂(t).

Let the mean and variance of u ~ f_1* (t) be μ_z1* and σ_z11*. Let u₁, u₂, …, u_N be iid with pdf f_1*(t); ω₁, ω₂, …, ω_N be iid with P(ω_i = 1) = p_o and P(ω_i = 0) = 1 – p_o; and the ω_i's be independent of u_i's and z_i2's. Then n and $Σ_{i = 1}^{N} ω_{i}$ have the same distribution; $Σ_{i = 1}^{n} z_{i 1}$ and $Σ_{i = 1}^{N} ω_{i} u_{i}$ have the same distribution; $Σ_{i = 1}^{n} z_{i 1}^{2}$ and $Σ_{i = 1}^{N} ω_{i} u_{i}^{2}$ have the same distribution; $Σ_{i = 1}^{n} z_{i 1} z_{i 2}$ and $Σ_{i = 1}^{N} ω_{i} u_{i} z_{i 2}$ have the same distribution; $Σ_{i = 1}^{n} z_{i 2}$ and $Σ_{i = 1}^{N} ω_{i} z_{i 2}$ have the same distribution; $Σ_{i = 1}^{n} z_{i 2}^{2}$ and $Σ_{i = 1}^{N} ω_{i} z_{i 2}^{2}$ have the same distribution. A brief proof for the above relationships using characteristic functions is provided in Appendix A of Yuan and Lu (2008).

We are now ready to show the result of consistency. Applying the law of large numbers to the average of ω_i yields

\frac{n}{N} = \frac{1}{N} \sum_{i = 1}^{N} ω_{i} \overset{w p 1}{\to} p_{o},

where the equal sign follows from equivalence in distribution and $\overset{wp 1}{\to}$ is the notation for convergence with probability one. Continuously applying equivalence in distribution and the law of large numbers to the averages of ω_iu_i and ω_iz_i2, respectively, we have

{\overset{‒}{z}}_{1 *} = \frac{Σ_{i = 1}^{N} ω_{i} u_{i} ∕ N}{n ∕ N} \overset{w p 1}{\to} \frac{p_{o} E (u)}{p_{o}} = μ_{z 1 *}

(8)

and

{\overset{‒}{z}}_{2 *} = \frac{Σ_{i = 1}^{N} ω_{i} z_{i 2} ∕ N}{n ∕ N} \overset{w p 1}{\to} \frac{p_{o} E (z_{2})}{p_{o}} = 0 .

(9)

Similarly,

\begin{matrix} \frac{1}{n} \sum_{i = 1}^{n} z_{i 1}^{2} = \frac{Σ_{i = 1}^{N} ω_{i} u_{i}^{2} ∕ N}{n ∕ N} \overset{w p 1}{\to} E (u^{2}); \\ \frac{1}{n} \sum_{i = 1}^{n} z_{i 1} z_{i 2} = \frac{Σ_{i = 1}^{N} ω_{i} u_{i} z_{i 2} ∕ N}{n ∕ N} \overset{w p 1}{\to} E (u) E (z_{2}) = 0; \\ \frac{1}{n} \sum_{i = 1}^{n} z_{i 2}^{2} = \frac{Σ_{i = 1}^{N} ω_{i} z_{i 2}^{2} ∕ N}{n ∕ N} \overset{w p 1}{\to} E (z_{2}^{2}) = 1; . \end{matrix}

Thus,

s_{z 11 *} = \frac{1}{n} \sum_{i = 1}^{n} z_{i 1}^{2} - {\overset{‒}{z}}_{1 *}^{2} \overset{w p 1}{\to} E (u^{2}) - μ_{z 1 *}^{2} = σ_{z 11 *};

(10)

s_{z 21 *} = \frac{1}{n} \sum_{i = 1}^{n} z_{i 1} z_{i 2} - {\overset{‒}{z}}_{1 *} {\overset{‒}{z}}_{2 *} \overset{w p 1}{\to} 0;

(11)

s_{z 22 *} = \frac{1}{n} \sum_{i = 1}^{n} z_{i 2}^{2} - {\overset{‒}{z}}_{2 *}^{2} \overset{w p 1}{\to} 1 .

(12)

It is obvious that

{\hat{μ}}_{1} = {\overset{‒}{x}}_{1} \overset{w p 1}{\to} μ_{1} and {\hat{σ}}_{11} = s_{11} \overset{w p 1}{\to} σ_{11} .

(13)

Regarding μ₁, μ₂, σ₁, σ₂ and ρ as constants, ${\overset{‒}{x}}_{1 *}$ , ${\overset{‒}{x}}_{2 *}$ , s_11*, s_21* and s_22* in (6) are just linear combinations of ${\overset{‒}{z}}_{1 *}$ , ${\overset{‒}{z}}_{2 *}$ , s_z11*, s_z21* and s_z22*, whose probability limits have already been obtained. Combining (6a), (8), (10) and (11) yields

{\overset{‒}{x}}_{1 *} \overset{w p 1}{\to} μ_{1 *} = μ_{1} + σ_{1} μ_{z 1 *}, s_{11 *} \overset{w p 1}{\to} σ_{11 *} = σ_{11} σ_{z 11 *}, and s_{21 *} \overset{w p 1}{\to} σ_{2} σ_{1} σ_{z 11 *} ρ .

(14)

Combining (6b) and (8) to (12) yields

{\overset{‒}{x}}_{*} \overset{w p 1}{\to} μ_{2} + σ_{2} ρ μ_{z 1 *} and s_{22 *} \overset{w p 1}{\to} σ_{22} [ρ^{2} σ_{z 11 *} + (1 - ρ^{2})] .

(15)

Thus,

{\hat{β}}_{21} = \frac{s_{21 *}}{s_{11 *}} \overset{wp 1}{\to} \frac{σ_{2}}{σ_{1}} ρ .

(16)

It follows from (4b) and (13) to (16) that

{\hat{μ}}_{2} \overset{wp 1}{\to} μ_{2} + σ_{2} ρ μ_{z 1 *} + \frac{σ_{2}}{σ_{1}} ρ [μ_{1} + (μ_{1} + σ_{1} μ_{z 1 *})] = μ_{2} .

So ${\hat{μ}}_{2}$ is consistent. It follows from (4c) and (13) to (16) that

{\hat{σ}}_{22} \overset{wp 1}{\to} σ_{22} [ρ^{2} σ_{z 11 *} + (1 - ρ^{2})] + \frac{σ_{22} ρ^{2}}{σ_{11}} (σ_{11} - σ_{11} σ_{z 11 *}) = σ_{22} .

So ${\hat{σ}}_{22}$ is also consistent. It follows from (4d), (13) and (16) that

{\hat{σ}}_{12} \overset{wp 1}{\to} \frac{σ_{2}}{σ_{1}} ρ σ_{11} = σ_{12} .

So ${\hat{σ}}_{12}$ is, again, consistent.

Notice that the g(·,·) or h(·) in (7) can be any function of the observed data. Thus, the normal-distribution-based pseudo MLEs are consistent for any MAR process.

3. Consistency in general

Parallel to (5), let

x = {(x_{1}, x_{2}, \dots, x_{q})}^{'} = μ + Az,

(17)

where μ = (μ₁, μ₂, …, μ_q)′,

A = (\begin{matrix} a_{11} & 0 & 0 & \dots & 0 \\ a_{21} & a_{22} & 0 & \dots & 0 \\ \dots \\ a_{q 1} & a_{q 2} & a_{q 3} & \dots & a_{qq} \end{matrix})

and satisfies Σ = AA′, and z = (z₁, z₂, …, z_q)′ with z₁, z₂, …, z_q being independent and standardized random variables. Then E(x) = μ, Cov(x) = Σ and the distribution of x is determined by those of the z_j's. When z ~ N_q(0, I), x ~ N_q(μ,Σ). We do not know the distribution form of x in general even when the distributions of z_j's are known. Notice that each element x_j of x in either (5) or (17) is a linear combination of independent random components in z, which is a necessary condition for the consistency of the pseudo MLE $\hat{μ}$ and $\hat{Σ}$ .

Let x₁, x₂, …, x_N be a random sample drawn from x with x_i = μ + Az_i, where x_i = (x_i1, x_i2, …, x_iq)′ and z_i = (z_i1, z_i2, …, z_iq)′. Consider the case x_i in which d variables x_ij₁, x_ij₂, … x_{ij_d} are missing and the event is related in probability to the c values of z_il₁, z_il₂, …, z_{il_c}. Let J = min(j₁, j₂, …, j_d) and L = max(l₁, l₂, …, l_c). When L < J and (x_i1, x_i2, …, x_iL) are observed, all the information related to missing values is observed. Let r_i = (r_i1, r_i2, …, r_iq)′ be the vector with r_ij = 1 if x_ij is observed and zero otherwise, x_im = (x_ij₁, x_ij₂, …, x_{ij_d})′, and x_io be the vectors of observed variables. There exists

\begin{matrix} P (r_{ij} = 0 ∣ x_{io}; x_{im}) & = P (r_{ij} = 0 ∣ {x_{i 1}, x_{i 2}, \dots, x_{iL}, \dots}; x_{im}) \\ = P (r_{ij} = 0 ∣ {z_{i 1}, z_{i 2}, \dots, z_{iL}, \dots}; x_{im}) \\ = P (r_{ij} = 0 ∣ {z_{i 1}, z_{i 2}, \dots, z_{iL}, \dots}) \\ = P (r_{ij} = 0 ∣ x_{io}) \\ = g_{j} (x_{io}, γ_{j}) = h_{j} (z_{iL}) . \end{matrix}

(18)

Thus, the probability of missing only depends on the observed values and the missing data mechanism is MAR (Rubin, 1976). Notice that x_iL = (x_i1, x_i2,…, x_iL)′ is a subset of x_io, and thus, the probability function g_j(x_io, γ_j) in (18) can also be written as g_j(x_iL, γ_j). Also notice that, due to a random process, the number d and the subscripts j₁, j₂, …, j_d may change from case to case while l₁, l₂, …, l_c are held constant across the sample.

Unlike the problems considered in the previous section, the MLEs do not possess analytical forms when the observed data patterns are not monotonic. So we cannot directly show that the MLEs are consistent as was done in the previous section. We cannot use the established theory of maximum likelihood as in Rubin (1976) either, because the MLEs are obtained based on an incorrect likelihood function. By showing that the normal estimating equation is unbiased at the true population values, Yuan (2009) proved that the normal-distribution-based MLEs are consistent when the missingness of x_im is due to (z_il₁, z_il₂, …, z_{il_c}) falling into certain hyper-rectangles. When (17) and (18) hold and (x_i1, x_i2, …, x_iL) are observed, using essentially the same approach as in Yuan (2009) we can show that the normal-distribution-based MLEs are consistent regardless of the distribution shape of x and the form of g_j(x_io, γ_j) in (18). The MLEs are also asymptotically normally distributed and the covariance matrix of the MLEs can be consistently estimated by a sandwich-type covariance matrix.

4. Conclusions

It has been argued that, in any statistical modeling, the distribution specification is at best only an approximation to the real world (Box, 1979). Thus, all MLEs in practice are pseudo MLEs. In the context of missing data, it is nice to know that pseudo MLEs can remain consistent when (17) and (18) hold. We need to note that data model (17) does not include nonnormal distributions created by nonlinear functions of independent random variables z₁, z₂, …, z_q; although it includes an infinite number of nonnormal distributions. Yuan (2009) describes an example in which the MLEs are not consistent when x = (x₁, x₂)′ with x₁ = z₁ and $x_{2} = {az}_{1}^{2} + {bz}_{2}$ , where z₁ and z₂ are independent, and x₂ is missing when z₁ falls into a certain interval. It can be shown that the MLEs also fail to be consistent for this data model when the MAR mechanism obeys either a logit or a probit selection process. We also would like to note that pseudo MLEs based on an incorrect distributional specification other than N_q(μ,Σ) may not be consistent when missing values are MAR. Actually, even without missing data, Gourieroux et al. (1984) showed that pseudo MLEs are consistent only when the assumed distribution belongs to a quadratic exponential family.

For the purpose of allowing missingness to depend on all the linear combinations of the previously observed variables, we specified A as a lower triangular matrix in (17) so that (z₁, z₂, …, z_L) and (x₁, x₂, …, x_L) are determined by each other. In practice, a participant may join the study after missing a few times and then be missing again. The missingness at the later stage may depend on all the previously observed variables. We can match such a case with (17) by specifying that the rows of A that correspond to the observed variables form the upper-left part of a lower triangular matrix, then the consistency result still holds.

As with a general MAR missing data mechanism, the MAR condition in (18) cannot be tested. Without extra information beyond the observed sample, it is impossible to distinguish between MAR and NMAR mechanisms (Molenberghs et al., 2008). Similarly, the data model (17) cannot be tested either because the distribution of z is arbitrary.

Acknowledgment

We would like to thank the editor, an associate editor, and a referee for comments that lead to a significant improvement of the paper.

This research was supported by Grants DA00017 and DA01070 from the National Institute on Drug Abuse and a grant from the National Natural Science Foundation of China (30870784).

References

Allison PD. Missing data. Sage; Thousand Oaks, CA: 2001. [Google Scholar]
Amemiya T. Regression analysis when the dependent variable is truncated normal. Econometrica. 1973;41:997–1016. [Google Scholar]
Anderson TW. Maximum likelihood estimates for the multivariate normal distribution when some observations are missing. Journal of the American Statistical Association. 1957;52:200–203. [Google Scholar]
Box GEP. Robustness in the strategy of scientific model building. In: Launer RL, Wilkinson GN, editors. Robustness in statistics. Academic Press; New York: 1979. pp. 201–236. [Google Scholar]
Daniels MJ, Hogan JW. Missing data in longitudinal studies: Strategies for Bayesian modeling and sensitivity analysis. Chapman & Hall; Boca Raton, Florida: 2008. [Google Scholar]
Geary RC. Testing for normality. Biometrika. 1947;34:209–242. [PubMed] [Google Scholar]
Gourieroux C, Monfort A, Trognon A. Pseudo maximum likelihood methods: Theory. Econometrica. 1984;52:681–700. [Google Scholar]
Heckman JJ. Sample selection bias as a specification error. Econometrica. 1979;47:153–161. [Google Scholar]
Laird NM. Missing data in longitudinal studies. Statistics in Medicine. 1988;7:305–315. doi: 10.1002/sim.4780070131. [DOI] [PubMed] [Google Scholar]
Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22. [Google Scholar]
Little RJA. Pattern-mixture models for multivariate incomplete data. Journal of the American Statistical Association. 1993;88:125–134. [Google Scholar]
Little RJA, Rubin DB. Statistical analysis with missing data. 2nd ed. Wiley; New York: 2002. [Google Scholar]
Micceri T. The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin. 1989;105:156–166. [Google Scholar]
Molenberghs G, Beunckens C, Sotto C, Kenward MG. Every missing not at random model has got a missing at random counterpart with equal fit. Journal of the Royal Statistical Society B. 2008;70:371–388. [Google Scholar]
Molenberghs G, Kenward MG. Missing data in clinical studies. Wiley; Chichester, England: 2007. [Google Scholar]
Rotnitzky A, Wypij D. A note on the bias of estimators with missing data. Biometrics. 1994;50:1163–1170. [PubMed] [Google Scholar]
Rubin DB. Inference and missing data (with discussions) Biometrika. 1976;63:581–592. [Google Scholar]
Schafer JL. Analysis of incomplete multivariate data. Chapman & Hall; London: 1997. [Google Scholar]
White H. Maximum likelihood estimation of misspecified models. Econometrica. 1982;50:1–25. [Google Scholar]
Yuan K-H. Normal distribution based pseudo ML for missing data: With applications to mean and covariance structure analysis. Journal of Multivariate Analysis. 2009;100:1900–1918. [Google Scholar]
Yuan K-H, Lu L. SEM with missing data and unknown population distributions using two-stage ML: Theory and its application. Multivariate Behavioral Research. 2008;62:621–652. doi: 10.1080/00273170802490699. [DOI] [PubMed] [Google Scholar]

[R1] Allison PD. Missing data. Sage; Thousand Oaks, CA: 2001. [Google Scholar]

[R2] Amemiya T. Regression analysis when the dependent variable is truncated normal. Econometrica. 1973;41:997–1016. [Google Scholar]

[R3] Anderson TW. Maximum likelihood estimates for the multivariate normal distribution when some observations are missing. Journal of the American Statistical Association. 1957;52:200–203. [Google Scholar]

[R4] Box GEP. Robustness in the strategy of scientific model building. In: Launer RL, Wilkinson GN, editors. Robustness in statistics. Academic Press; New York: 1979. pp. 201–236. [Google Scholar]

[R5] Daniels MJ, Hogan JW. Missing data in longitudinal studies: Strategies for Bayesian modeling and sensitivity analysis. Chapman & Hall; Boca Raton, Florida: 2008. [Google Scholar]

[R6] Geary RC. Testing for normality. Biometrika. 1947;34:209–242. [PubMed] [Google Scholar]

[R7] Gourieroux C, Monfort A, Trognon A. Pseudo maximum likelihood methods: Theory. Econometrica. 1984;52:681–700. [Google Scholar]

[R8] Heckman JJ. Sample selection bias as a specification error. Econometrica. 1979;47:153–161. [Google Scholar]

[R9] Laird NM. Missing data in longitudinal studies. Statistics in Medicine. 1988;7:305–315. doi: 10.1002/sim.4780070131. [DOI] [PubMed] [Google Scholar]

[R10] Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22. [Google Scholar]

[R11] Little RJA. Pattern-mixture models for multivariate incomplete data. Journal of the American Statistical Association. 1993;88:125–134. [Google Scholar]

[R12] Little RJA, Rubin DB. Statistical analysis with missing data. 2nd ed. Wiley; New York: 2002. [Google Scholar]

[R13] Micceri T. The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin. 1989;105:156–166. [Google Scholar]

[R14] Molenberghs G, Beunckens C, Sotto C, Kenward MG. Every missing not at random model has got a missing at random counterpart with equal fit. Journal of the Royal Statistical Society B. 2008;70:371–388. [Google Scholar]

[R15] Molenberghs G, Kenward MG. Missing data in clinical studies. Wiley; Chichester, England: 2007. [Google Scholar]

[R16] Rotnitzky A, Wypij D. A note on the bias of estimators with missing data. Biometrics. 1994;50:1163–1170. [PubMed] [Google Scholar]

[R17] Rubin DB. Inference and missing data (with discussions) Biometrika. 1976;63:581–592. [Google Scholar]

[R18] Schafer JL. Analysis of incomplete multivariate data. Chapman & Hall; London: 1997. [Google Scholar]

[R19] White H. Maximum likelihood estimation of misspecified models. Econometrica. 1982;50:1–25. [Google Scholar]

[R20] Yuan K-H. Normal distribution based pseudo ML for missing data: With applications to mean and covariance structure analysis. Journal of Multivariate Analysis. 2009;100:1900–1918. [Google Scholar]

[R21] Yuan K-H, Lu L. SEM with missing data and unknown population distributions using two-stage ML: Theory and its application. Multivariate Behavioral Research. 2008;62:621–652. doi: 10.1080/00273170802490699. [DOI] [PubMed] [Google Scholar]

PERMALINK

Consistency of Normal Distribution Based Pseudo Maximum Likelihood Estimates When Data Are Missing at Random

Ke-Hai Yuan

Peter M Bentler

Abstract

1. Introduction

2. Consistency in the bivariate case

3. Consistency in general

4. Conclusions

Acknowledgment

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Consistency of Normal Distribution Based Pseudo Maximum Likelihood Estimates When Data Are Missing at Random

Ke-Hai Yuan

Peter M Bentler

Abstract

1. Introduction

2. Consistency in the bivariate case

3. Consistency in general

4. Conclusions

Acknowledgment

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases