Model Selection Criteria for Missing-Data Problems Using the EM Algorithm

Joseph G Ibrahim; Hongtu Zhu; Niansheng Tang

doi:10.1198/016214508000001057

. Author manuscript; available in PMC: 2009 Aug 18.

Published in final edited form as: J Am Stat Assoc. 2008 Dec 1;103(484):1648–1658. doi: 10.1198/016214508000001057

Model Selection Criteria for Missing-Data Problems Using the EM Algorithm

Joseph G Ibrahim ¹, Hongtu Zhu ², Niansheng Tang ³

PMCID: PMC2728244 NIHMSID: NIHMS94917 PMID: 19693282

Abstract

We consider novel methods for the computation of model selection criteria in missing-data problems based on the output of the EM algorithm. The methodology is very general and can be applied to numerous situations involving incomplete data within an EM framework, from covariates missing at random in arbitrary regression models to nonignorably missing longitudinal responses and/or covariates. Toward this goal, we develop a class of information criteria for missing-data problems, called IC_H_,_Q, which yields the Akaike information criterion and the Bayesian information criterion as special cases. The computation of IC_H_,_Q requires an analytic approximation to a complicated function, called the H-function, along with output from the EM algorithm used in obtaining maximum likelihood estimates. The approximation to the H-function leads to a large class of information criteria, called IC_H̃₍_k_),_Q. Theoretical properties of IC_H̃₍_k_),_Q, including consistency, are investigated in detail. To eliminate the analytic approximation to the H-function, a computationally simpler approximation to IC_H_,_Q, called IC_Q, is proposed, the computation of which depends solely on the Q-function of the EM algorithm. Advantages and disadvantages of IC_H̃₍_k_),_Q and IC_Q are discussed and examined in detail in the context of missing-data problems. Extensive simulations are given to demonstrate the methodology and examine the small-sample and large-sample performance of IC_H̃₍_k_),_Q and IC_Q in missing-data problems. An AIDS data set also is presented to illustrate the proposed methodology.

Keywords: EM algorithm, H-function, Kullback, Leibler divergence, Missing data, Q-function

1. INTRODUCTION

Missing data have long been a problem in various settings, including surveys, clinical trials, and longitudinal studies. Responses and/or covariates may be missing, and methods for handling the missing data often depend on the mechanism that generated the missing values. Unless the data are missing completely at random (MCAR), a complete-case analysis can be both inefficient and biased; therefore, distributional and modeling assumptions often are made in missing-data problems, and the resulting estimates and tests may be sensitive to these assumptions. For this reason, sensitivity analyses are commonly done to check the robustness of the parameters of interest and their standard errors under different modeling schemes (see, e.g., Rubin 1977; Little 1993, 1994, 1995; Copas and Li 1997; van Steen, Molenberghs, and Thijs 2001; Verbeke, Molenberghs, Thijs, Lasaffre, and Kenward 2001; Jansen, Molenberghs, Aerts, Thjis, and van Steen 2003; Troxel, Ma, and Heitjan 2004). Although these analyses demonstrate the effect of assumptions on estimates and tests, they do not indicate which modeling strategy is best, nor do they specifically address model selection for a given class of models.

Model selection criteria typically depend on the likelihood function based on the observed data, and any sensible model selection criterion must depend on this quantity in some way. In missing-data problems, it is very challenging to obtain a suitable and accurate approximation of the observed data likelihood, which involves intractable multiple integration, and/or directly maximize the observed data likelihood and compute the Akaike information criterion (AIC) and/or the Bayesian information criterion (BIC), for example, as well other model selection criteria. The EM algorithm maximizes the Q-function (formally defined in Sec. 2.1) at each iteration, avoiding direct maximization of the observed data likelihood, which typically is a more difficult function to maximize. A natural and important question is whether we can use the key components of the EM algorithm, such as the Q-function, to develop an easily computable model selection criterion.

In this article we consider a class of information-based model selection criteria, called IC_H_,_Q, for missing-data problems. The class of model selection criteria includes AIC and BIC as special cases, as well other model selection criteria that have been proposed in the literature, mainly for settings not involving missing-data. The essential novel feature of the proposed model selection criteria is that they essentially depend only on output from the EM algorithm for their computation. Our development is based on the fact that the observed data log-likelihood in missing data problems can be written as a difference between two functions, the Q-function of the EM algorithm and another quantity called the H-function. The Q-function and the H-function are formally defined in Section 2.1. The Q-function can be computed solely from the EM output, but the H-function cannot; however, we show that after the H-function is analytically approximated, it then can be computed as part of the EM output, resulting in model selection criteria, IC_H̃₍_k_),_Q, that depend solely on the EM output. We give a theoretical justification for IC_H̃₍_k_),_Q and derive the asymptotic properties of IC_H̃₍_k_),_Q. We also consider another class of model selection criteria, IC_Q, which use only the Q-function in their construction and thus omit the H-function in their construction. We also show that compared with IC_H̃₍_k_),_Q, IC_Q is an inferior approximation to IC_H_,_Q, but it may be adequate when the fraction of missing information is small.

The rest of the article is organized as follows. In Section 2 we introduce IC_H_,_Q, IC_H̃₍_k_),_Q, and IC_Q. We present three theorems characterizing consistency and asymptotic properties of IC_H̃₍_k_),_Q as general model selection criteria. In Section 3 we present two extensive simulation studies, one involving missing-at-random (MAR) covariates in linear models and one involving MAR covariates in generalized linear models (GLMs). These simulations compare the finite-sample performance of IC_H̃₍_k_),_Q and IC_Q and examine how these criteria can be used to determine the best-fitting model from a candidate set of proposed models. In Section 3.3 we analyze a data set from a study of the relationship between acquired immune deficiency syndrome (AIDS) and the use of condoms that includes not missing-at-random (NMAR) (i.e., nonignorable) covariates as well as responses. We conclude with a discussion in Section 4.

2. EM–BASED MODEL SELECTION CRITERIA

2.1 EM Algorithm

For simplicity, we only consider an independent-type incomplete-data (ITID) model throughout the article, even though most of the development here is valid for a large class of statistical models involving missing data. Assume the observed data D_obs = (z_1,_obs, …, z_n_,_obs), the missing data D_mis = (z_1,_mis, …, z_n_,_mis), and the complete data D_com = (z_1,_com, …, z_n_,_com), in which z_i_,_com = (z_i_,_mis, z_i_,_obs) for i = 1,…, n. The ITID model assumes that z_i_,_com and z_j_,_com are independent for i ≠ j. Moreover, the dimensions of z_i_,_mis and z_i_,_obs may vary across i; for instance, in GLMs with missing covariates, some observations may have missing covariates and others may not. This kind of model structure is very general and subsumes most commonly used models, such as GLMs with missing responses and/or covariates and random-effects models (Zhu, Lee, Wei, and Zhou 2001; Ibrahim, Chen, and Lipstiz 1999, 2001).

Suppose that we want to compare a general model for the complete data, g(D_com; θ), with the true model for the complete data, f(D_com). The model for the complete data is the product of a model for the observed data, g(D_obs; θ), and a model for the missing data given the observed data, g(D_mis|D_obs; θ). Correspondingly, f(D_com) = f(D_obs)f(D_mis|D_obs), where f(D_mis|D_obs) and f(D_obs) are the true model for the missing data given the observed data and that for the observed data. Specifically, for the ITID model, we have

\begin{array}{l} g (D_{obs}; θ) = \prod_{i = 1}^{n} g (z_{i, obs}; θ), \\ f (D_{obs}) = \prod_{i = 1}^{n} f (z_{i, obs}), \end{array}

(1)

\begin{array}{l} g (D_{com}; θ) = \prod_{i = 1}^{n} g (z_{i, com}; θ), and \\ f (D_{com}) = \prod_{i = 1}^{n} f (z_{i, com}), \end{array}

(2)

where f(z_i_,_obs) and g(z_i_,_obs; θ) denote the true and postulated models for z_i_,_obs, and f(z_i_,_com) and g(z_i_,_com; θ) denote the true and postulated models for z_i_,_com.

The EM algorithm (Dempster, Laird, and Rubin 1977) has been a popular technique for obtaining maximum likelihood (ML) estimates in missing-data problems (Little and Rubin 2002; Meng and van Dyk 1997; Ibrahim 1990; Ibrahim and Lipsitz 1996). The EM algorithm consists of two key steps as follows. At the sth step of the EM algorithm, given θ⁽^s⁾, the E-step involves evaluating the Q-function given by

Q (θ ∣ θ^{(s)}) = E [log g (D_{com}; θ) ∣ D_{obs}; θ^{(s)}],

(3)

where E[·|D_obs; θ⁽^s⁾] denotes the conditional expectation with respect to g(D_mis|D_obs; θ⁽^s⁾). Recall that the Q-function can be written as

Q (θ ∣ θ^{(s)}) = log g (D_{obs}; θ) + H (θ ∣ θ^{(s)}),

(4)

where

H (θ ∣ θ^{(s)}) = E [log g (D_{mis} ∣ D_{obs}; θ) ∣ D_{obs}; θ^{(s)}]

(5)

is called the H-function. The M-step is to maximize Q(θ|θ⁽^s⁾) to compute θ⁽^s⁺¹⁾. At EM convergence, we can obtain three byproducts: θ̂, Q(θ̂|θ̂), and samples drawn from g(D_mis|D_obs; θ̂). We use these three quantities in constructing our proposed model selection criteria in the subsequent sections.

2.2 Development of IC_H_,_Q

Our main interest is to develop a class of model selection criteria for missing-data problems based on the observed data likelihood g(D_obs; θ). However, some missing-data problems have very complicated observed data likelihood functions, for which g(D_obs; θ) has no closed form, so that its direct evaluation is not computationally feasible or computationally accurate. Because

log g (D_{obs}; θ) = Q (θ ∣ θ^{(s)}) - H (θ ∣ θ^{(s)}),

(6)

this suggests that we may compute g(D_obs; θ) from the EM output—namely, from the Q-function Q(θ̂|θ̂) and the H-function H(θ̂|θ̂) at EM convergence. Thus we consider the class of model selection criteria given by

\begin{array}{l} {IC}_{H, Q} = - 2 log g (D_{obs}; \hat{θ}) + {\hat{c}}_{n} (\hat{θ}) \\ = - 2 Q (\hat{θ} ∣ \hat{θ}) + 2 H (\hat{θ} ∣ \hat{θ}) + {\hat{c}}_{n} (\hat{θ}), \end{array}

(7)

where ĉ_n(θ̂) is a penalty term that is a function of the data and the fitted model. Different forms of the model penalty ĉ_n(θ̂) lead to different criteria; for instance, when ĉ_n(θ̂) = 2d in (7), where d denotes the dimension of θ, we obtain the AIC of Akaike (1973), given by AIC = −2 log g(D_obs; θ̂) + 2d. When ĉ_n(θ̂) = log(n)d, then (7) reduces to the BIC of Schwarz (1978). We note that the penalty term ĉ_n(θ̂) is neither Q-function–based nor specific to missing-data problems; rather, it is a general penalty term chosen by the user, mimicking the penalty terms for general model selection information criteria as discussed in the literature (Macquarrie and Tsai 1998; Konishi and Kitagawa 2008).

There is a subtle computational problem with (7) in that although the Q-function is a direct byproduct of the EM output, the H-function is not a direct byproduct of the EM output. Specifically, the density g(D_mis|D_obs; θ) in the H-function does not have a closed form for many missing-data problems and typically is quite complicated, and thus the integrand of the H-function itself does not have a closed form. Thus g(D_mis|D_obs; θ) first needs an analytic approximation to allow computation of the H-function through the EM output. Once g(D_mis|D_obs; θ) is analytically approximated (i.e., the integrand of the H-function is analytically approximated), the H-function can be computed by Monte Carlo integration using samples from g(D_mis|D_obs; θ̂) at EM convergence. Samples from this density are obtained by carrying out Markov chain Monte Carlo (MCMC) methods and are direct byproducts of the Monte Carlo EM algorithm (MCEM), as discussed by Ibrahim, Lipsitz, and Chen (1999). Using these samples, we then can obtain an EM-based estimator of the approximation to the H-function, which we discuss in detail in the next section. We note that when ĉ_n(θ̂) = 2d, an EM-based approximation to the AIC is obtained by replacing the H-function by its estimator.

2.3 Approximation of g(D_mis|D_obs, θ̂) in IC_H_,_Q

We propose a simple but useful method for approximating the H-function. In general, given the MCMC samples from g(D_mis|D_obs; θ̂) (Ibrahim, Lipsitz, and Chen 1999), we can get a Monte Carlo approximation of the integral ∫w(D_mis)g(D_mis|D_obs; θ̂) dD_mis only if w(D_mis) has an analytic closed form. Although w(D_mis) = log g(D_mis|D_obs; θ̂) for H(θ̂|θ̂), g(D_mis|D_obs; θ̂) does not have a closed form for most missing-data problems.

We propose using a truncated Hermite expansion as an approximation of each g(z_i_,_mis|z_i_,_obs; θ̂), leading to

\tilde{g} (z_{i, mis}; {\hat{μ}}_{i}, {\sum^{^}}_{i}, ψ_{i}, k) = P_{i}^{2} (t; ψ_{i}, k) φ (z_{i, mis}; {\hat{μ}}_{i}, {\sum^{^}}_{i}),

(8)

where $t = R_{i}^{- 1} (z_{i, mis} - {\hat{μ}}_{i}), {\sum^{^}}_{i} = R_{i} R_{i}^{T}$ , and φ(z_i,mis; μ̂_i, Σ̂_i) is a multivariate normal density with mean μ̂_i and covariance matrix Σ̂_i. In addition, μ̂_i = μ_i(θ̂) and Σ̂_i = Σ_i(θ̂) are the conditional mean and covariance matrix of z_i_,_mis given z_i_,_obs at θ̂. Here P_i(t; ψ_i, k) is a multivariate polynomial of order k and ψ_i are the coefficients of P_i(t; ψ_i, k). If g(z_i_,_mis|z_i_,_obs; θ̂) belongs to a smooth class of functions, then g̃(z_i_,_mis; μ̂_i, Σ̂_i, ψ_i, k) approximates g(z_i_,_mis|z_i_,_obs; θ̂) well for even small k, say k = 1 and 2 (Gallant and Nychka 1987); for instance, if z_i_,_mis is uni-variate and k = 2, then

P_{i} (t; ψ_{i}, 2) = \frac{1 + ψ_{i 1} t + ψ_{i 2} t^{2}}{\sqrt{1 + ψ_{i 1}^{2} + 3 ψ_{i 2}^{2} + 2 ψ_{i 2}}} .

If k = 0, then we obtain $P_{i}^{2} (t; ψ_{i}, 0) = 1$ and g̃(z_i_,_mis; μ̂_i, Σ̂_i, ψ_i, k) = φ(z_i_,_mis; μ̂_i, Σ̂_i). It has been shown both numerically and theoretically that the truncated Hermite expansion can provide an accurate approximation to g(z_i_,_mis|z_i_,_obs; θ̂) as k → ∞ (Fenton and Gallant 1996). Moreover, in the truncated Hermite expansion, the multivariate normal density can be replaced by another density, such as a multivariate t, Poisson, or gamma density (Cameron and Johansson 1997; Kim 2007).

We can use g̃(z_i_,_mis; μ̂_i, Σ̂_i, ψ_i, k) to produce a Monte Carlo estimate of H(θ̂|θ̂). The detailed steps are summarized as follows. In step 1 we draw a set of random samples, { $z_{i, mis}^{(s)} : s = 1, \dots, S_{0}$ }, from g(z_i_,_mis|z_i_,_obs; θ̂) using MCMC sampling, where S₀ is a prefixed number. In step 2 we use the sample mean and covariance matrix of { $z_{i, mis}^{(s)} : s = 1, \dots, S_{0}$ } to approximate μ̂_i and Σ̂_i. In step 3, because { $z_{i, mis}^{(s)} : s = 1, \dots, S_{0}$ } are observations from g(z_i_,_mis|z_i_,_obs; θ̂), we can then obtain estimators (e.g., ML estimators) of ψ_i, denoted by ψ̂_i(k), for given k and i = 1, …, n. Because S₀ can be arbitrarily large, we can assume that μ̂_i and that Σ̂_i are exact and that ψ̂_i(k) is the minimizer of the Kullback–Leibler divergence between g̃(z_i_,_mis; μ̂_i, Σ̂_i, ψ_i, k) and g(z_i_,_mis|z_i_,_obs; θ̂), that is,

\begin{array}{l} {\hat{ψ}}_{i} (k) = \underset{ψ_{i}}{arg min} {\int log \frac{g (z_{i, mis} ∣ z_{i, obs}; \hat{θ})}{\tilde{g} (z_{i, mis}; {\hat{μ}}_{i}, {\sum^{^}}_{i}, ψ_{i}, k)} \\ \times g (z_{i, mis} ∣ z_{i, obs}; \hat{θ}) d z_{i, mis}} . \end{array}

In step 4 we calculate

\begin{array}{l} \tilde{H} (k ∣ \hat{θ}) = S_{0}^{- 1} \sum_{s = 1}^{S_{0}} \sum_{i = 1}^{n} log \tilde{g} (z_{i, mis}^{(s)}; {\hat{μ}}_{i}, {\sum^{^}}_{i}, {\hat{ψ}}_{i} (k), k) \\ = E [log \hat{g} (D_{mis} ∣ D_{obs}; k, \hat{θ}) ∣ z_{i, obs}; \hat{θ}] + o (1), \end{array}

(9)

where $\tilde{g} (D_{mis} ∣ D_{obs}; k, \hat{θ}) = \prod_{i = 1}^{n} \tilde{g} (z_{i, mis}; {\hat{μ}}_{i}, {\sum^{^}}_{i}, {\hat{ψ}}_{i} (k), k)$ and o(1) converges to 0 as S₀ → ∞. In general, the computational burden in steps 1, 2, and 4 is minimal, whereas computing ψ̂_i(k) for each i can be computationally cumbersome when k is relatively large. If we set k at 0, then we can avoid the maximization in step 3.

Based on H̃(k|θ̂), we can obtain an approximation of IC_H_,_Q as

\begin{array}{l} {IC}_{H, Q} = - 2 Q (\hat{θ} ∣ \hat{θ}) + 2 \tilde{H} (k ∣ \hat{θ}) + 2 H (\hat{θ} ∣ \hat{θ}) \\ - 2 \tilde{H} (k ∣ \hat{θ}) + c_{n} (\hat{θ}) \\ \approx {IC}_{\tilde{H} (k), Q} = - 2 Q (\hat{θ} ∣ \hat{θ}) + 2 \tilde{H} (k ∣ \hat{θ}) + c_{n} (\hat{θ}) . \end{array}

(10)

Moreover, because H̃(k|θ̂) ≤ H(θ̂|θ̂) according to Jensen’s inequality, IC_H̃₍_k_),_Q ≤ IC_H_,_Q. Although H̃(k|θ̂) converges to H(θ̂|θ̂) as k → ∞, choosing a large k is computationally inefficient. Moreover, we observe that H̃(k|θ̂) based on a small k, say 0 or 1, also can produce reasonable results, as shown in Section 3. Thus this Hermite approximation for g(z_i_,_mis|z_i_,_obs; θ̂) is quite attractive, because model choice is quite robust with respect to the choice of k.

2.4 General Theoretical Development for IC_H̃₍_k_),_Q

Here we present a formal theoretical development for IC_H̃₍_k_),_Q, which was defined in the previous section. We define

\begin{array}{l} {\tilde{g}}_{(k)} (D_{obs}; θ_{1}, θ_{2}) = exp {E [log g (D_{com}; θ_{1}) \\ - log \tilde{g} (D_{mis} ∣ D_{obs}; k, θ_{1}) ∣ D_{obs}; θ_{2}]} \end{array}

(11)

as an approximation to g(D_obs; θ₁), where E[·|D_obs; θ₂] denotes the conditional expectation taken with respect to g(D_mis| D_obs; θ₂). As k → ≠, it can be shown that under some conditions, g̃(D_mis|D_obs; k, θ₁) converges to g(D_mis|D_obs; θ₁), and thus g̃₍_k₎(D_obs; θ₁, θ₂) converges to g(D_obs; θ₁). To develop a general class of model selection criteria, we consider the Kullback–Leibler divergence between g̃₍_k₎(D_obs; θ₁, θ₂) and f(D_obs), defined by

\begin{array}{l} K (θ_{1}, θ_{2}) = - \int log (τ (D_{obs}; θ_{1}, θ_{2})) f (D_{obs}) d D_{obs} \\ = \int f (D_{obs}) log f (D_{obs}) d D_{obs} \\ - \int f (D_{obs}) log {\tilde{g}}_{(k)} (D_{obs}; θ_{1}, θ_{2}) d D_{obs}, \end{array}

(12)

where τ(D_obs; θ₁, θ₂) = g̃₍_k₎(D_obs; θ₁, θ₂)/f(D_obs). The quantity K(θ, θ) is an overall measure of the goodness of fit of g̃₍_k₎(D_obs; θ, θ) relative to f(D_obs). Because the first term in (12) is independent of any fitted model and can be ignored, our goal of selecting a model can be accomplished using the second term of (12).

If g(D_obs; θ) is specified correctly, then θ̂ is asymptotically efficient, and the likelihood ratio statistic is a most sensitive criterion for detecting deviations of the model parameters from their true values. But even though g(D_obs; θ) is “always” mis-specified, White (1994) established consistency and asymptotic normality of θ̂ under some conditions. Thus it is desirable to evaluate K(θ̂, θ^★). A simple estimator of K(θ̂, θ^★) is given by substituting for the distribution of D_obs, denoted by F_obs, the empirical distribution function F̂_obs. Thus, except for a constant, K(θ̂, θ^★) can be approximated by

{\tilde{K}}_{(k)} (\hat{θ}, θ^{★}) = - log {\tilde{g}}_{(k)} (D_{obs}; \hat{θ}, θ^{★}) .

We obtain the following theorems, whose detailed proofs are given in Appendix A. The following conditions are needed to facilitate development of our methods, although they may not be the weakest possible conditions. Even though g(D_obs; θ) may be misspecified, the ML estimator, θ̂, converges to the θ_n_* that minimizes $E {\sum_{i = 1}^{n} ℓ (z_{i, obs}; θ)} = \sum_{i = 1}^{n} \int ℓ (z_{i, obs}; θ) f (z_{i, obs}) d z_{i, obs}$ , where ℓ(z_i_,_obs; θ) = log g(z_i_,_obs; θ) (see, e.g., White 1994). For simplicity, we further assume that θ_n_* = θ_* for all n and E{∂_θℓ(z_i_,_obs; θ_*)} = 0 for all i. The conditions are as follows:

(C1) θ_* is unique and an interior point of Θ, where Θ is a compact set in R^p.
(C2) θ̂ → θ_* in probability as n → ∞.
(C3) For all i, ℓ(z_i_,_obs; θ) is three times continuously differentiable on θ, and |∂_jℓ(z_i_,_obs; θ)|² and |∂_j∂_j_′∂_lℓ(z_i_,_obs; θ)| are dominated by B_i(z_i_,_obs) for all j, j′, l = 1, …, d, where ∂_j = ∂/∂θ_j. The same smoothness condition also holds for h₍_k₎(z_i_,_obs; θ) = E[log g̃(z_i_,_mis|z_i_,_obs; k, θ)|z_i_,_obs; θ].
(C4) For each ε > 0, there exists a finite C such that

$sup_{n \geq 1} n^{- 1} \sum_{i = 1}^{n} E [B_{i} (z_{i, obs}) 1 {B_{i} (z_{i, obs}) > C}] < ε$

for all n, where 1{B_i(z_i_,_obs) > C} is the indicator function of B_i(z_i_,_obs) > C.
(C5)

$lim_{n \to \infty} n^{- 1} E {- \sum_{i = 1}^{n} \partial_{θ}^{2} ℓ (z_{i, obs}; θ_{*})} = A (θ_{*})$

and

${lim_{n \to \infty} n^{- 1} \sum_{i = 1}^{n} E {\partial_{θ} ℓ (z_{i, obs}; θ_{*}) \partial_{θ} {\tilde{K}}_{(k)} {(θ, θ^{★})}^{T}} ∣}_{θ = θ_{*}} = B (θ_{*} ∣ θ^{★}),$

where A(θ_*) is positive definite.

Condition (C1) defines the uniqueness of the “true” parameter value. Condition (C2) is the consistency of θ̂. Condition (C3) is a smoothness condition on ℓ(z_i_,_obs; θ) and h₍_k₎(z_i_,_obs; θ). Condition (C4) is a standard Lindeberg condition, and (C5) can be easily proved by the law of large numbers.

Theorem 1

For ITID models, if conditions (C1), (C2), and (C3) hold true, then

\begin{array}{l} n^{- 1} ∣ {\tilde{K}}_{(k)} (\hat{θ}, θ^{★}) - E [{\tilde{K}}_{(k)} (\hat{θ}, θ^{★})] ∣ \\ + n^{- 1} ∣ E [{\tilde{K}}_{(k)} (\hat{θ}, θ^{★})] - E [{\tilde{K}}_{(k)} (θ_{*}, θ^{★})] ∣ \to 0 \end{array}

(13)

in probability, where E[K̃₍_k₎(·, ·)] denotes the expectation with respect to the observed data, E[K̃₍_k₎(θ̂, θ^★)] denotes E[K̃₍_k₎(θ, θ^★)] evaluated at θ = θ̂, and θ_* is the pseudo true value of θ based on g(D_obs; θ).

Theorem 1 indicates that n⁻¹K̃₍_k₎(θ̂, θ^★) is a consistent estimator of n⁻¹E[K̃₍_k₎(θ_*, θ^★)]. Now consider the situation in which we want to compare values of K̃₍_k₎(θ̂, θ^★) under different models for g(D_com; θ). Although n⁻¹K̃₍_k₎(θ̂, θ^★) is a consistent estimator of n⁻¹E[K̃₍_k₎(θ̂, θ^★)], it is an overestimate of n⁻¹E[K̃₍_k₎(θ̂, θ^★)], because the same data are used to estimate θ and to approximate F_obs. Following Akaike (1973) and Konishi and Kitagawa (2008), we calculate the bias of n⁻¹K̃₍_k₎(θ̂, θ^★) in estimating n⁻¹E[K̃₍_k₎(θ̂, θ^★)] as

\begin{array}{l} b (θ^{★}) = E_{D_{obs}} {{\tilde{K}}_{(k)} (\hat{θ}, θ^{★}) - E [{\tilde{K}}_{(k)} (\hat{θ}, θ^{★})]} \\ = b_{1} (θ^{★}) + o (1), \end{array}

(14)

where E_{D_obs} denotes the expectation taken with respect to the observed data. Although it may be difficult to calculate the explicit form of b(θ^★), we can derive an asymptotic bias expression, denoted b₁(θ^★).

Theorem 2

For ITID models, if conditions (C1)–(C5) are true, then the asymptotic bias of K̃₍_k₎(θ̂, θ^★) in estimating E[K̃₍_k₎(θ̂, θ^★)] is given by

\begin{array}{l} E_{D_{obs}} {{\tilde{K}}_{(k)} (\hat{θ}, θ^{★}) - E [{\tilde{K}}_{(k)} (\hat{θ}, θ^{★})]} \\ = b (θ^{★}) \\ = tr {A {(θ_{*})}^{- 1} B (θ_{*} ∣ θ^{★})} + o_{p} (1), \end{array}

(15)

where A(θ) and B(θ|θ^★) are defined in condition (C5).

Theorem 2 provides a theoretical basis for using −2K̃₍_k₎(θ̂, θ^★) + b(θ^★) as a model selection criterion, and this quantity is precisely a bias-corrected estimate of −2E_{D_obs}[K̃₍_k₎(θ̂, θ^★)]. In particular, if θ^★ = θ_* and g(D_obs; θ) is specified correctly, then A(θ_*) − B(θ_*|θ_*) converges to a zero matrix and b(θ_*) ≈ 2d as k → ∞. But because θ^★ is unknown, we replace θ_* and θ^★ by θ̂. In particular, under the correct specification of g(D_obs; θ), b(θ̂) should be close to 2d for large k. This leads to an approximation to the AIC as AIC_H̃₍_k_),_Q = −2K̃₍_k₎(θ̂, θ̂) + 2d.

We now establish sufficient conditions to ensure consistency of IC_H̃₍_k_),_Q. Following Nishii (1988), we consider two parametric models for the complete data, with densities given by

M_{t} = {g_{(t)} (D_{com}; θ_{(t)}) = g_{(t)} (D_{obs}; θ_{(t)}) g_{(t)} (D_{mis} ∣ D_{obs}; θ_{(t)}) : θ_{(t)} \in Θ_{(t)} \subset R^{d_{t}}}

(16)

for t = 1, 2. For each ℳ_t, the ML estimator θ̂₍_t₎ converges in probability to the pseudo true value, denoted by θ_*(_t₎. To select a better model, we first calculate

\begin{array}{l} {dIC}_{\tilde{H} (k), Q 21} \\ = {IC}_{\tilde{H} (k), Q 2} - {IC}_{\tilde{H} (k), Q 1} \\ = 2 Q ({\hat{θ}}_{(1)} ∣ {\hat{θ}}_{(1)}) - 2 \tilde{H} (k ∣ {\hat{θ}}_{(1)}) - {\hat{c}}_{n} ({\hat{θ}}_{(1)}) \\ - 2 Q ({\hat{θ}}_{(2)} ∣ {\hat{θ}}_{(2)}) + 2 \tilde{H} (k ∣ {\hat{θ}}_{(2)}) + {\hat{c}}_{n} ({\hat{θ}}_{(2)}) . \end{array}

(17)

We choose ℳ₂ if dIC_H̃₍_k_),_Q₂₁ < 0 and ℳ₁ otherwise. Define

\begin{array}{l} δ_{21, k} = E [Q (θ_{* (1)} ∣ θ_{* (1)}) - \tilde{H} (k ∣ θ_{* (1)})] \\ - E [Q (θ_{* (2)} ∣ θ_{* (2)}) - \tilde{H} (k ∣ θ_{* (2)})] \end{array}

and δ_c₂₁ = ĉ_n(θ̂₍₂₎) − ĉ_n(θ̂₍₁₎). Moreover, without loss of generality, we assume that d₂ > d₁ and ĉ_n(θ̂₍₂₎) > ĉ_n(θ̂₍₁₎); for instance, if ĉ_n(θ̂₍₂₎) = d₂ log(n), then δ_c₂₁ = (d₂ − d₁) log(n).

Theorem 3

Suppose that ℳ₁ and ℳ₂ are ITID models and satisfy conditions (C1)–(C5). We then have the following results:

If lim inf_nn⁻¹δ_21,_k > 0 and δ_c₂₁ = o_p(n), then dIC_H̃₍_k_),_Q₂₁ > 0 in probability.
Assume that

$\begin{array}{l} lim sup_{n} n^{- 1 / 2} {E [Q (θ_{* (2)} ∣ {\hat{θ}}_{(2)}) - \tilde{H} (k ∣ {\hat{θ}}_{(2)})] \\ - E [Q (θ_{* (1)} ∣ {\hat{θ}}_{(1)}) - \tilde{H} (k ∣ {\hat{θ}}_{(1)})]} < \infty, \end{array}$

, n⁻¹^/²{H̃(k|θ̂₍_t₎) − E[H̃(k|θ̂₍_t₎)]} = O_p(1), and n⁻¹^/² × {Q(θ̂₍_t₎|θ̂₍_t₎) − E[Q(θ_*(_t₎| θ̂₍_t₎)]} = O_p(1) for t = 1, 2. Then dIC_H̃₍_k₎_,Q₂₁ ≤ 0 in probability as n⁻¹^/²δ_c₂₁ → ∞.
Assume that Q(θ_*(2)|θ̂₍₂₎) − Q(θ_*(1)|θ̂₍₁₎) = O_p(1) and H̃(k|θ̂₍₂₎) − H̃(k|θ̂₍₁₎) = O_p(1). Then dIC_H̃₍_k_),_Q₂₁ ≤ 0 in probability as δ_c₂₁ → ∞.

Theorem 3 has some important implications. Theorem 3a indicates that IC_H̃₍_k_),_Q choosesℳ₂ as lim inf_nn⁻¹δ_21,_k > 0 and δ_c₂₁ = o_p(n). Generally, the most commonly used ĉ_n(θ̂), such as 2d, d log(n), and d log log(n) (d > 0), all satisfy the condition δ_c₂₁ = o_p(n) (Nishii 1988). The condition lim inf_nn⁻¹ δ_21,_k > 0 ensures that IC_H̃₍_k_),_Q chooses a model with large E[Q(θ_*|θ_*) − H̃(k|θ_*)]. If ℳ₁ and ℳ₂ have the same average n⁻¹E[Q(θ_*|θ_*) − H̃(k|θ_*)] (i.e., lim inf_n n⁻¹ × δ_21,_k = 0), then Theorem 3b and c indicate that IC_H̃₍_k_),_Q picks out the “simpler” ℳ₁ when δ_c₂₁ increases to ∞ at a certain rate [e.g., log(n)]. But ĉ_n = 2d does not satisfy this condition. Thus, because IC_H̃₍_k_),_Q with ĉ_n = 2d is the EM-based estimate of the AIC, it tends to overfit the data in this scenario.

2.5 Using IC_H̃₍_k_),_Q in the Presence of Nonignorable Missing Data

Although our model selection criteria IC_H̃₍_k_),_Q are quite general and can be used with MAR or NMAR covariate and/or response data, here we offer some caution and advice on using these criteria with NMAR data. First, it is often argued that in missing-data problems, there is little information in the data regarding the form of the missing-data mechanism, and the parametric assumption of the missing-data mechanism itself is not “testable” from the data. Thus nonignorable modeling should be viewed as a sensitivity analysis involving a more complicated model. In this sense, it is dangerous to use any model selection criterion to directly compare MAR and NMAR models. Formally, we give the following guidelines on using IC_H̃₍_k_),_Q:

IC_H̃₍_k_),_Q should be used to choose among a family of MAR models and/or choose among a family of NMAR models. They should not be used to choose among an aggregate set of MAR and NMAR models nor should they be used to judge the fit of MAR models versus NMAR models.
Once the best MAR model and best NMAR model are found using step 1, further sensitivity analyses can be done on those two models to examine changes in estimates of the main regression coefficients of interest in the sampling model. These sensitivity analyses can be carried out by examining estimates of the regression coefficients of the sampling model under several different parametric forms of the missing-data mechanism.

2.6 IC_Q

Because the analytic approximation to the integrand of the H-function and its computation may be cumbersome for large k, it also might be desirable to obtain a model selection criterion that does not involve the H-function and whose components depend only on quantities obtained directly from the EM output. Toward this goal, we can obtain such a criterion by dropping H(θ̂|θ̂) from (7), leading to the criterion

{IC}_{Q} = - 2 Q (\hat{θ} ∣ \hat{θ}) + {\hat{c}}_{n} (\hat{θ}) .

(18)

Thus IC_Q can be viewed as a crude approximation to IC_H_,_Q in which H(θ̂|θ̂) is omitted. When ĉ_n(θ̂) = 2d in (18), this leads to the criterion

{AIC}_{Q} = - 2 Q (\hat{θ} ∣ \hat{θ}) + 2 d .

(19)

There are clear advantages and disadvantages to using IC_Q instead of IC_H̃₍_k_),_Q. One advantage of using IC_Q is that it is computationally easier than IC_H̃₍_k_),_Q, not requiring an approximation to the integrand of the H-function. But one clear disadvantage of IC_Q is that as a result of omitting the H-function, a model selection criterion based on the Q-function alone can overstate the amount of information in the missing data compared with the observed data log-likelihood function. Omitting the H-function can lead to a criterion with poor model selection properties in some cases, especially when the missing-data fraction is high. In general, we recommend using IC_H̃₍_k_),_Q over IC_Q.

3. SIMULATION STUDIES

In this section we report on several simulation studies used to investigate the finite-sample performance of IC_H̃₍_k_),_Q and IC_Q in linear models and GLMs with MAR covariates. More specifically, we demonstrate how IC_H̃₍_k_),_Q and IC_Q can be used as model selection criteria for choosing the best-fitting model. In the simulation for the linear model with MAR and normally distributed covariates (Sec. 3.1), IC_H_,_Q has an analytic closed form, and thus g(z_i_,_mis|z_i_,_obs; θ̂) has a closed form, and thus neither approximation or MCMC sampling is needed in this case. Therefore, we can assess the performance of the approximation in this setting by comparing {IC _H̃ ₍_k_),_Q::k = 0, 1} to IC_H,Q, which is analytically equivalent to AIC in this case when ĉ_n(θ̂) = 2d. We also compare IC_Q with IC_H,Q.

But for the GLM with MAR covariates neither IC_H,Q nor g(z_i,mis|z_i,obs; θ̂) has a closed form, and thus both the Hermite approximation and MCMC sampling are needed to compute IC_H,Q. In this setting, we do not attempt to compute AIC or BIC directly using Laplace approximations or numerical integration techniques, because these methods are not easy and quite cumbersome to implement, and, more importantly, the resulting approximations are very difficult to assess in terms of accuracy. Thus for GLMs, we only compute {IC _H̃ ₍_k_),_Q:k = 0, 1} and IC_Q through the MCEM algorithm under several values of ĉ_n(θ̂).

3.1 Missing-at-Random Covariates in Linear Models

We generated simulated data sets from a linear regression model with one MAR covariate. This simulation study had three goals: (i) to demonstrate how IC _H̃ ₍_k_),_Q for different k can be used as a tool for selecting a model from a candidate set of proposed models and evaluate and compare them with IC_H,Q, (ii) to compare IC_Q with IC_H,Q, and (iii) to compare the performance of IC_Q with IC _H̃ ₍_k_),_Q. To save space, we focus on c_n(θ̂) = 2d throughout, although several additional simulation results are available for other values of ĉ_n(θ̂) including ĉ_n(θ̂) = d log(n).

Consider the true model y_i|x_i ~ N(β₀ + β₁x_i, σ²), where x_i ~N(μ, τ²) for i = 1, …, n. We generated the data set {(x_i, y_i): i = 1, …, n} as follows. First, we generated n independent random variables x_i from a N(μ, τ²) distribution; and then generated independent responses y_i from a N(β₀ + β₁x_i, σ²) distribution. We then generated n independent standard normally distributed variables z_i that are independent of y_i and x_i. The true parameter values were taken to be β₀ = .8, β₁ = .8, σ² = .8, μ = .8, τ² = .8, and n = 100, 300, 500.

Furthermore, we assume that the response y_i and the additional covariate z_i are completely observed for i = 1, …, n, but the covariate x_i can be missing for some cases. We note that because z_i is fully observed for all cases, we need not specify a covariate distribution for z_i in the modeling strategy, but a covariate distribution for x_i must be specified, because x_i is missing for some cases. The missing-data mechanism for the x_i ’s is defined as follows. We let r_i = 1 if x_i is missing and r_i = 0 if x_i is observed. Then the following logistic regression model is considered for the missing-data mechanism:

p (r_{i} = 1 ∣ y_{i}, z_{i}) = \frac{exp (φ_{0} + φ_{1} y_{i})}{1 + exp (φ_{0} + φ_{1} y_{i})},

(20)

implying MAR covariates. To investigate the effect of the missingness fraction on the performance of the model selection criteria, we consider the following sets of true parameter values for φ₀ and φ₁: (I) φ₀ = −4.0, φ₁ = 1.0 giving an average missingness fraction for x_i roughly equal to 11%, and (II) φ₀ = −3.5, φ₁ = 1.5, giving an average missingness fraction of 29%.

We considered five candidate models:

Model M1 (true model): y_i|x_i ~ N(β₀ + β₁x_i, σ²), x_i ~ N(μ, τ²)
Model M2: y_i|x_i ~ N(β₀, σ²), x_i ~ N(μ, τ²)
Model M3: y_i|x_i_, z_i ~ N(β₀ + β₁x_i + β₂z_i_, σ²), x_i ~ N(μ, τ²)
Model M4: y_i|x_i_, z_i ~ N(β₀ + β₁x_i + β₂x_i z_i, σ²), x_i ~ N(μ, τ²)
Model M5: y_i|x_i_, z_i ~ N(β₀ + β₁x_i + β₂z_i + β₃x_i z_i_, σ²), x_i ~ N(μ, τ²).

We generated R = 500 simulated data sets from M1 and then calculated {IC _H̃ ₍_k_),_Q: k = 0, 1} and IC_Q for ĉ_n(θ̂) = 2d, and AIC ≡ IC_H,Q when ĉ_n(θ̂) = 2d (Table 1).

Table 1.

Comparison of ranks of the true model M1 from various model selection criteria for MAR covariates in linear models

	(I)												(II)
	n = 100				n = 300				n = 500				n = 100				n = 300				n = 500
Rank	1	2	3	4	1	2	3	4	1	2	3	4	1	2	3	4	1	2	3	4	1	2	3	4
IC_H,Q with 2d
1	331	25	4	0	329	17	1	0	325	14	1	0	301	51	13	0	325	37	4	0	327	32	6	0
2	1	48	8	0	5	57	12	0	4	63	10	0	5	31	21	4	6	35	17	5	3	35	15	3
3	0	1	49	3	1	3	53	2	0	3	50	3	0	9	35	8	0	2	36	9	0	7	36	8
4	0	0	1	29	0	0	0	20	0	0	2	25	0	0	1	21	0	0	2	22	0	1	1	26
IC _H̃ ₍₀₎_,Q with 2d
1	329	23	5	0	328	16	3	0	322	14	0	0	302	49	10	0	323	35	5	0	320	31	10	0
2	3	50	8	1	6	57	12	0	7	59	10	1	4	34	25	5	8	36	15	5	10	34	16	1
3	0	1	48	1	1	4	51	2	0	7	50	3	0	8	34	9	0	3	37	9	0	8	31	10
4	0	0	1	30	0	0	0	20	0	0	3	24	0	0	1	19	0	0	2	22	0	2	1	26
IC _H̃_(1), _Q with 2d
1	325	25	5	0	323	19	2	0	310	21	1	0	298	49	11	0	314	33	6	0	301	27	6	1
2	7	46	9	1	9	54	13	0	18	49	11	1	8	34	22	3	15	37	16	5	23	37	17	3
3	0	3	47	2	3	3	50	3	1	10	49	4	0	7	36	8	2	4	35	10	6	9	33	6
4	0	0	1	29	0	1	1	19	0	0	2	23	0	1	1	22	0	0	2	21	0	2	2	27

Open in a new tab

NOTE: Two cases of missing fractions for x_i were included. Three different sample sizes, n = 100, 300, and 500 simulated data sets, were used for each case. The columns represent the results from AIC_Q.

Table 1 shows the number of times out of R = 500 simulations that each rank was achieved for M1, the true model for all model selection criteria. The columns in Table 1 correspond to the rankings of AIC_Q [AIC_Q ≡ IC_Q when ĉ_n(θ̂) = 2d] under different settings, and the rows of Table 1 corresponds to the proposed criteria for different choices of ĉ_n(θ̂) and k. With n = 100 and case (I), M1 got ranked as number one 332 = 331 + 1 times by AIC_Q, 360 = 331 + 25 + 4 times by AIC [IC_H,Q with ĉ_n(θ̂) = 2d], 357 times by IC _H̃ _(0),_Q with ĉ_n(θ̂) = 2d, and 355 times by IC _H̃ _(1),_Q with ĉ_n(θ̂) = 2d. With n = 100 and case (II), M1 got ranked as number one 306 times by AIC_Q, 364 times by IC_H,Q with ĉ_n(θ̂) = 2d, 361 times by IC _H̃ _(0),_Q with ĉ_n(θ̂) = 2d, and 358 times by IC _H̃ _(1),_Q with ĉ_n(θ̂) = 2d. These results imply that AIC_Q performs reasonably well in all scenarios, but IC_H,Q outperforms AIC_Q, particularly for large missingness fractions. The {IC _H̃ ₍_k_),_Q: k = 0, 1} for ĉ_n(θ̂) = 2d perform as well as IC_H,Q even for large missingness fractions, which is an attractive result demonstrating the suitability of the approximation. Moreover, increasing k does not seem to improve the performance of IC _H̃ ₍_k_),_Q, demonstrating its high degree of robustness. The {IC _H̃ ₍_k_),_Q: k = 0, 1} for c_n(θ̂) = 2d outperform AIC_Q, particularly for large missingness fractions. Finally, we note that AIC yields very similar results to {IC _H̃ ₍_k_),_Q:k = 0, 1} for ĉ_n(θ̂) = 2d.

3.2 Missing-at-Random Covariates in Generalized Linear Models

In this section we consider a logistic regression model with one continuous covariate. Our primary aim is to evaluate {IC _H̃ ₍_k_),_Q:k = 0, 1} and IC_Q and compare them with each other. In this simulation study, covariates x₁, …, x_n are iid and generated from a N(.5, 1.0) distribution, and responses y₁, …, y_n are generated independently from a Bernoulli distribution with success probability $p (y_{i} = 1 ∣ β_{0}, β_{1}, x_{i}) = \frac{exp (β_{0} + β_{1} x_{i})}{1 + exp (β_{0} + β_{1} x_{i})} (i = 1, \dots, n)$ . We also assume that y₁, …, y_n are completely observed, whereas x₁, …, x_n are MAR for some cases.

The missing data for the x_i were generated according to the missing-data mechanism in (20), and the z_i ’s were generated exactly as described in Section 3.1. The true parameter values were taken to be β₀ = β₁ = .8 and n = 100, 300, and 500. To investigate the effect of the missingness fraction on our model selection criteria, we again considered two sets of true values for φ₀ and φ₁: (I) φ₀ = −1.2 and φ₁ = −.8, giving a missingness fraction of about 15%, and (II) φ₀ = −.5 and φ₁ = −.8, giving a missingness fraction of about 26%.

As in Section 3.1, we considered five candidate models:

Model M₁ (true model): logit(p_i) = β₀ + β₁x_i, x_i ~ N(μ, τ²)
Model M₂: logit(p_i) = β₀, x_i ~N(μ, τ²)
Model M₃: logit(p_i) = β₀ + β₁x_i + β₂z_i, x_i ~ N(μ, τ²)
Model M₄: logit(p_i) = β₀ + β₁x_i + β₂z_i x_i, x_i ~ N(μ, τ²)
Model M₅: logit(p_i) = β₀ + β₁x_i + β₂z_i x_i + β₃z_i, x_i ~ N(μ, τ²).

We simulated 500 data sets and then calculated {IC _H̃ ₍_k_),_Q:k = 0, 1} and IC_Q for ĉ_n(θ̂) = 2d for each simulated data set. Table 2 shows the number of times out of R = 500 simulations that each rank was achieved for M1, the true model for all model selection criteria. Again, the columns in Table 2 correspond to the rankings of AIC_Q, and the rows correspond to several settings of the proposed criteria. The results are very similar to those reported in Section 3.1. For instance, with n = 100 and case (I), M1 was ranked as number one 302 times by AIC_Q, 319 times by IC _H̃ _(0),_Q with ĉ_n(θ̂) = 2d, and 317 times by IC _H̃ _(1),_Q with ĉ_n(θ̂) = 2d. These results imply that AIC_Q performs reasonably well in all scenarios, and that increasing the missing-data fraction does not have a strong effect on AIC_Q for accurately selecting the true model M1. The {IC _H̃ ₍_k_),_Q:k = 0, 1} for ĉ_n(θ̂) = 2d perform reasonably well even for large missingness fractions. Moreover, increasing k does not seem to improve the performance of IC _H̃ ₍_k_),_Q. Again, the {IC _H̃ ₍_k_),_Q:k = 0, 1} for ĉ_n(θ̂) = 2d outperform AIC_Q, particularly for large missingness fractions.

Table 2.

Comparison of ranks of the true model M1 from various model selection criteria for MAR covariates in GMls

	(I)															(II)
	n = 100					n = 300					n = 500					n = 100					n = 300					n = 500
Rk	1	2	3	4	5	1	2	3	4	5	1	2	3	4	5	1	2	3	4	5	1	2	3	4	5	1	2	3	4	5
IC _H̃ ₍₀₎_,Q with 2d
1	294	22	2	1	0	317	29	2	1	0	316	32	6	0	0	267	34	12	1	0	298	46	6	2	0	305	47	12	4	0
2	8	70	17	1	0	15	53	13	5	0	8	50	19	0	0	12	62	29	5	0	8	37	23	3	0	6	29	36	3	0
3	0	2	61	4	0	0	1	51	11	0	0	0	44	9	0	0	2	47	8	0	0	5	49	8	0	0	6	33	6	0
4	0	0	2	13	2	0	0	2	0	0	0	0	1	15	0	0	0	2	15	2	0	0	2	13	0	0	2	1	10	0
5	0	0	0	1	0	0	0	0	1	0	0	0	0	0	0	0	0	1	0	1	0	0	0	0	0	0	0	0	0	0
IC _H̃ ₍₁₎_,Q with 2d
1	290	24	2	1	0	307	26	3	0	0	296	32	5	1	0	265	35	13	1	0	280	37	9	3	0	269	45	14	3	0
2	11	68	13	1	0	22	49	12	5	0	26	44	17	1	0	10	60	21	4	1	24	41	14	3	0	36	27	27	6	0
3	1	2	63	2	0	3	8	47	11	0	1	4	44	11	0	4	3	52	11	0	2	10	50	10	0	5	10	38	6	0
4	0	0	4	14	2	0	0	6	0	0	1	2	4	11	0	0	0	4	11	1	0	0	7	10	0	1	2	3	8	0
5	0	0	0	2	0	0	0	0	12	0	0	0	0	0	0	0	0	1	12	1	0	0	0	0	0	0	0	0	0	0

Open in a new tab

NOTE: Two cases of missing fractions for x_i were included. Three different sample sizes n = 100, 300, and 500 simulated data sets were used for each case. The columns represent the results from AIC_Q.

3.3 AIDS Data

We considered a data set from a study of the relationship between AIDS and the use of condoms (Morisky et al. 1998; Lee and Tang 2006). This complex data set requires sophisticated structural equations modeling in the presence of NMAR covariate and response data. An intriguing question is whether there is any model selection criterion for selecting the best-fitting model from a candidate set of structural equation models whose observed data likelihood functions involve high-dimensional integrals. Directly computing AIC and BIC, for example, using Laplace methods or high-dimensional numerical integration techniques is simply too hard and computationally cumbersome in this scenario, and moreover, the accuracy of such approximations is impossible to assess in this high-dimensional setting. Thus this example greatly motivates the need for EM-based criteria, such as IC_H̃ ₍_k_),_Q and IC_Q.

For simplicity, we used only the data obtained from female sex workers in Philippine cities (Lee and Tang 2006). These data are related to knowledge of AIDS and attitude toward AIDS, beliefs, self-efficiency of condom use, and other variables. Nine variables in the original data set (items 33, 32, 31, 43, 72, 74, 27h, 27e, and 27i on the questionnaire) were taken as manifest variables in y_i = (y_i₁, …, y_i₉)^T, a continuous item x_i₁ (item 37) and an ordered categorical item x_i₂ (item 21, treated as continuous) were taken as covariates. The definitions of these nine items are given in Appendix B. In this data set, the variables y_i₁, y_i₂, y_i₃, y_i₇, y_i₈, and y_i₉ were measured on a 5-point scale and thus were treated as continuous; variables y_i₄, y_i₅, and y_i₆ were continuous. There are n = 1,116 observations in this data set, and the manifest variables and covariates are missing at least once for 361 of them (32%). The missingness patterns for the manifest variables are shown in table 4 of Lee and Tang (2006). In this data set, the covariate x_i₂ is completely observed.

Following Lee and Tang (2006), the manifest variables (y_i₁, y_i₂, y_i₃) are related to a latent variable, η_i, that can be interpreted as the “threat of AIDS,” whereas the manifest variables (y_i₄, y_i₅, y_i₆) and (y_i₇, y_i₈, y_i₉) are related to the latent variables ξ_i₁ and ξ_i₂, which can be interpreted as “aggressiveness of the sex worker” and “worry of contracting AIDS.” Specifically, to identify the relationship between the manifest variables y_i and the latent variables ω_i = (η_i, ξ_i₁, ξ_i₂)^T, we consider the following measurement equation:

y_{i} = μ + Λ ω_{i} + ε_{i}, i = 1, \dots, n,

where μ = (μ₁, …, μ₉)^T is a vector of intercepts, (ξ_i₁, ξ_i₂) is independent of the measurement error vector ε_i, (ξ_i₁, ξ_i₂) ~ N(0, Φ), and ε_i ~ N(0, Ψ), in which Ψ = diag(ψ₁, …, ψ₉) and Φ = (φ_ij) is a 2 × 2 covariance matrix. We also assume the following structure for Λ:

Λ^{T} = (\begin{matrix} {1.0}^{*} & λ_{21} & λ_{31} & 0^{*} & 0^{*} & 0^{*} & 0^{*} & 0^{*} & 0^{*} \\ 0^{*} & 0^{*} & 0^{*} & {1.0}^{*} & λ_{52} & λ_{62} & 0^{*} & 0^{*} & 0^{*} \\ 0^{*} & 0^{*} & 0^{*} & 0^{*} & 0^{*} & 0^{*} & {1.0}^{*} & λ_{83} & λ_{93} \end{matrix}),

where 0^* and 1.0^* are regarded as fixed values to identify the scale of the latent factor. We let r_yij = 1 if y_ij is missing and r_yij = 0 if y_ij is observed and r_xi₁ = 1 if x_i₁ is missing and r_xi₁ = 0 if x_i₁ is observed. Based on the missingness patterns, we assume that both the missing-data mechanisms of the manifest variables and the covariates are NMAR. In particular, we consider the following missing-data mechanisms for y_ij and x_i₁:

M_{y} : logit {Pr (r_{yij} = 1 ∣ τ)} = τ^{T} y_{i o}^{*}

and

M_{x} : logit {Pr (r_{x i 1} = 1 ∣ ϕ)} = ϕ_{0} + ϕ_{1} y_{i 1} + \dots + ϕ_{9} y_{i 9},

where τ is a vector of logistic regression coefficients, $y_{i o}^{*} = {(1, y_{i o}^{T})}^{T}$ in which y_io is a vector corresponding to the observed data of y_i, and ϕ = (ϕ₀, ϕ₁, …, ϕ₉)^T. Because x_i₁ may be missing, we need to specify its distribution. For simplicity, we assume that x_i₁ ~ N(0, ψ_x).

To study the relationship between η and (x₁, x₂, ξ₁, ξ₂), we consider four nonlinear structural equations models,

\begin{array}{l} M_{0} : η_{i} = b_{1} x_{i 1} + b_{2} x_{i 2} + γ_{1} ξ_{i 1} + γ_{2} ξ_{i 2} + γ_{3} ξ_{i 1} ξ_{i 2} + δ_{i}; \\ M_{1} : η_{i} = b_{1} x_{i 1} + b_{2} x_{i 2} + γ_{1} ξ_{i 1} + γ_{2} ξ_{i 2} \\ + γ_{3} ξ_{i 1} ξ_{i 2} + γ_{4} ξ_{i 1}^{2} + δ_{i}; \\ M_{2} : η_{i} = b_{1} x_{i 1} + b_{2} x_{i 2} + γ_{1} ξ_{i 1} + γ_{2} ξ_{i 2} \\ + γ_{3} ξ_{i 1} ξ_{i 2} + γ_{4} ξ_{i 2}^{2} + δ_{i}; \end{array}

and

\begin{array}{l} M_{3} : η_{i} = b_{1} x_{i 1} + b_{2} x_{i 2} + γ_{1} ξ_{i 1} + γ_{2} ξ_{i 2} \\ + γ_{3} ξ_{i 1} ξ_{i 2} + γ_{4} ξ_{i 1}^{2} + γ_{5} ξ_{i 2}^{2} + δ_{i}, \end{array}

where δ_i ~ N(0, ψ _δ). Clearly, all four models include the linear effect of “aggressiveness,” ξ_i₁, and “worry,” ξ_i₂ and an interaction of “aggressiveness” and “worry.” The models M₁ and M₂, respectively, have the additional quadratic terms of “aggressiveness” and “worry.” Because M₃ includes all the possible terms of ξ_i₁ and ξ_i₂, it may be considered the “full model.”

We calculated values of {IC _H̃ ₍_k_),_Q:k = 0, 1} and IC_Q with ĉ_n(θ̂) = 2d and d log(n) for all four models (Table 3). The calculation of {IC _H̃ ₍_k_),_Q:k = 0, 1} and IC_Q was straightforward, because it only required quantities from the output of the EM algorithm for obtaining parameter estimates. Model M₀ was selected as best by all model selection criteria. The ML estimates of the parameters were obtained through the MCECM algorithm and specific parameter estimates for model M₀ are presented in Table 4. The factor loading estimates are positive and quite large, which implies a strong positive association between the latent variables and their corresponding indicators, and the estimated nonlinear structural equation is given by η̂_i = −.0579x_i₁ + .0821x_i₂ − .2711ξ_i₁ + .2505ξ_i₂ + .1897ξ_i₁ξ_i₂. We note the fact that comparatively large (positive) values of (η_i, x_i₂) (or x_i₁, ξ_i₁) and ξ_i₂ indicate that an individual feels a high (or low) threat from AIDS and is more worried about contracting AIDS. The foregoing equation has the following interpretation:

Table 3.

AIDS data: Values of IC_Q and IC_H̃ ₍_k_),_Q for k = 0 and 1 with ĉ_n(θ̂) = 2d and d log(n) for all four models M₀, M₁, M₂, and M₃

	IC_Q		IC_H̃₍₀₎_,Q		IC_H̃₍₁₎_,Q
Model	ĉ_n = 2d	ĉ_n = d log(n)	ĉ_n = 2d	ĉ_n = d log(n)	ĉ_n = 2d	ĉ_n = d log(n)
M₀	34,676.19	34,896.96	32,941.28	30,985.59	35,423.52	33,467.84
M₁	34,680.18	34,905.97	32,961.77	31,017.56	35,709.52	33,765.32
M₂	34,689.32	34,915.11	32,964.85	31,014.59	35,626.51	33,676.26
M₃	34,708.79	34,939.60	32,988.38	31,037.17	35,567.39	33,616.17

Open in a new tab

Table 4.

AIDS data: Selected ML estimates and their standard deviations (SDs) for model M₀

Parameter	ML estimates	SD	Parameter	ML estimates	SD	Parameter	ML estimates	SD
μ₁	3.6362	.0292	ψ₁	.9405	.0765	λ₂₁	.4493	.1124
μ₂	2.5977	.0432	ψ₂	2.2057	.0931	λ₃₁	.7736	.1558
μ₃	3.9725	.0321	ψ₃	.9525	.0464	λ₅₂	1.6294	.1679
μ₄	.0015	.0052	ψ₄	.8665	.0383	λ₆₂	1.1107	.0859
μ₅	.0031	.0323	ψ₅	.6246	.1358	λ₈₃	.4220	.1407
μ₆	.0020	.0092	ψ₆	.8251	.0452	λ₉₃	.7358	.1149
μ₇	4.3696	.0038	ψ₇	.7179	.0783	b₁	−.0579	.0310
μ₈	3.1411	.0431	ψ₈	2.0665	.0900	b₂	.0821	.0290
μ₉	3.7998	.0344	ψ₉	1.4165	.0865	γ₁	−.2711	.0679
φ₁₁	.1410	.0210	ψ_δ	.4059	.0912	γ₂	.2505	.1060
φ₁₂	−.0422	.0090	ψ_x	1.4774	.6778	γ₃	.1897	.1363
φ₂₂	.3819	.0418

Open in a new tab

b̂₁ = −.0579 indicates that the longer sex workers are in their jobs, the less threat they feel from AIDS, and b̂₂ = .0821 implies that the more they think that they know about AIDS, the more threat they feel from AIDS.
γ̂₁ = −.2711 shows that the more aggressive the sex workers are, the less threat they feel from AIDS, and γ̂₂ = .2505 shows that sex workers who are more worried about contracting AIDS feel more of a threat from AIDS.
γ̂₃ = .1897 indicates that ξ_i₁ and ξ_i₂ have a positive interaction effect on “threat of AIDS.”

It is easily seen from the foregoing analysis that introducing an interaction term in the nonlinear structural equation to interpret the relationship between η_i and ξ_i₁, ξ_i₂ is very necessary, and we could get various different effects for different cases. The estimated correlation between “aggressiveness,” ξ_i₁, and “worry,” ξ_i₂, is −.1819, which indicates that they are negatively correlated.

4. DISCUSSION

We have proposed a general class of model selection criteria, IC _H̃ ₍_k_),_Q, for missing-data problems. The computation of IC _H̃ ₍_k_),_Q can be obtained directly from the EM output. The theory of IC _H̃ ₍_k_),_Q is quite general and can be applied to various types of missing-data models for which the EM algorithm is applicable. Moreover, IC _H̃ ₍_k_),_Q can be directly applied to many other problems in which the ECM algorithm and the ECME algorithm can be applied (Liu and Rubin 1994; Meng and Rubin 1993). We have given theoretical underpinnings for these criteria and have shown that they are consistent. We note, however, that although consistency is a desirable and interesting property, it does not shed light on how to penalize the observed data likelihood for model parsimony in finite samples. Further research is needed to determine the best choice of penalty in missing-data problems. We have also demonstrated that the Hermite approximation to the integrand of the H -function, log(g(D_mis|D_obs; θ̂)), is quite robust for model choice for several choices of k, leading to an attractive feature of the proposed approximation. Choices of k = 0, 1 worked as well as those of k = 10 and larger. This is a comforting feature, because it shows that model choice is not sensitive to the degree of the Hermite approximation to g(D_mis|D_obs; θ̂).

The penalty terms ĉ_n(θ̂) can have a profound effect on the finite-sample performance of IC _H̃ ₍_k_),_Q and IC_Q. Compared with ĉ_n(θ̂) = 2d, the use of the penalty d log(n) for IC_H̃ ₍_k_),_Q and IC_Q leads to a significant improvement in correctly determining the true model (not presented). According to Theorem 3, this is not surprising, because the 2d penalty tends to pick larger models. For instance, because the true model in Section 3.1 has one covariate, the d log(n) penalty will be expected to outperform the 2d penalty (not presented). Furthermore, combining different degrees of approximation in the truncated Hermite expansion and different penalty terms can lead to nonlinear behavior in IC_H̃ ₍_k_),_Q and IC_Q.

The MCEM algorithm converged in a reasonable number of steps for the GLM simulation and the AIDS data set, and the Gibbs sampling followed the same steps as described by Ibrahim, Lipsitz, and Chen (1999). In the Gibbs steps of the MCEM algorithm, the Metropolis–Hastings algorithm (Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller 1953; Hastings 1970) was used to simulate observations from the complex, nonstandard conditional distributions. For the GLM and AIDS data examples, EM convergence was obtained in fewer than 50 iterations using an increasing Gibbs sample size of 2,000 within EM. Gibbs sample sizes of 5,000 and 10,000 also were used to check sensitivity to the choice of the Gibbs sample size, and the estimates were extremely robust to these choices; for example, the estimates based on Gibbs sample sizes of 2,000 and 10,000 matched to the third decimal place. In addition, values of the Gibbs sample size that changed with each EM iteration were considered. For example, at the beginning of EM, we started with 50 Gibbs samples and gradually increased the number of Gibbs samples as the EM iterations increased. The results obtained were quite similar to those obtained using a constant value of 2,000 Gibbs iterations throughout all of the EM iterations. The convergence criterion used for the EM algorithm was that the distance between the kth iteration and the (k + 1)st iteration for all of the parameters was less than 5 × 10⁻⁴. The reason for choosing such a tolerance level is the Gibbs sample size used in each iteration. We also tried a tolerance level of 10⁻⁴ when the Gibbs sample size was 10,000, and EM convergence was obtained in a similar number of iterations. We further note that if the tolerance level were chosen too small, then it would be impossible to achieve convergence due to the Monte Carlo error induced by the Gibbs sampler. Finally, we note that slightly more computing time was required for the AIDS data set than the GLM simulation.

Acknowledgments

The authors wish to deeply thank the editor, the associate editor, and three referees for extremely helpful comments and suggestions that have substantially improved the article. Dr. Ibrahim’s research was supported in part by National Institutes of Health grants GM 70335 and CA 74015. Dr. Zhu’s research was supported in part by National Science Foundation grant SES-06-43663 and BCS-0826844 and NIH grant 1-UL1-RR025747-01. Dr. Tang’s research was supported in part by NSFC (10561008) and NCET (NCET-07-0737).

APPENDIX A: PROOFS OF THEOREMS 1, 2, AND 3

Proof of Theorem 1

We need only show that sup₍_θ_1, _θ_2)∈_Θ _× _Θn⁻¹|K̃₍_k₎(θ₁, θ₂) − E[K̃ ₍_k₎(θ₁, θ₂)]|→ 0 in probability and E[K̃ ₍_k₎(θ₁, θ₂)] is continuous in θ₁ and θ₂ uniformly over Θ × Θ. Conditions (C3) and (C4) are sufficient for assumption W–LIP of Andrews (1992), which ensures the continuity of E[K̃ ₍_k₎(θ₁, θ₂)] and the stochastic equicontinuity (SE) of K̃₍_k₎(θ₁, θ₂). Furthermore, conditions (C3) and (C4) ensure pointwise convergence; that is, n⁻₁{K̃₍_k₎(θ₁, θ₂) − E[K̃ ₍_k₎(θ₁, θ₂)]} converges to 0 for each θ₁ and θ₂ in probability. Thus combining SE and the pointwise convergence yields Theorem 1.

Proof of Theorem 2

We prove Theorem 2 in three steps. We first show that

\sqrt{n} (\hat{θ} - θ_{*}) = A {(θ_{*})}^{- 1} \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} \partial_{θ} ℓ (z_{i, obs}; θ_{*}) + o_{p} (1) .

(A.1)

Conditions (C1)–(C5) are sufficient for establishing (A.1) (Zhu and Zhang 2006). The second step is to obtain the stochastic expansions for K̃ ₍_k₎(θ̂, θ^★) and E[K̃ ₍_k₎(θ̂, θ^★)] as follows:

\begin{array}{l} {\tilde{K}}_{(k)} (\hat{θ}, θ^{★}) = {\tilde{K}}_{(k)} (θ_{*}, θ^{★}) + {[\partial_{θ} {\tilde{K}}_{(k)} (θ_{*}, θ^{★})]}^{T} Δ \hat{θ} \\ + Δ {\hat{θ}}^{T} \partial_{θ}^{2} {\tilde{K}}_{(k)} (θ_{*}, θ^{★}) Δ \hat{θ} + o_{p} (1), \\ E [{\tilde{K}}_{(k)} (\hat{θ}, θ^{★})] = E [{\tilde{K}}_{(k)} (θ_{*}, θ^{★})] + E {[\partial_{θ} {\tilde{K}}_{(k)} (θ_{*}, θ^{★})]}^{T} Δ \hat{θ} \\ + Δ {\hat{θ}}^{T} E [\partial_{θ}^{2} {\tilde{K}}_{(k)} (θ_{*}, θ^{★})] Δ \hat{θ} + o_{p} (1), \end{array}

(A.2)

where Δθ̂ = θ̂ − θ_*. Taking expectation yields

\begin{array}{l} E_{D_{obs}} [{\tilde{K}}_{(k)} (\hat{θ}, θ^{★})] \\ = E_{D_{obs}} [{\tilde{K}}_{(k)} (θ_{*}, θ^{★})] + E_{D_{obs}} {{[\partial_{θ} {\tilde{K}}_{(k)} (θ_{*}, θ^{★})]}^{T} Δ \hat{θ}} \\ + E_{D_{obs}} [Δ {\hat{θ}}^{T} \partial_{θ}^{2} {\tilde{K}}_{(k)} (θ_{*}, θ^{★}) Δ \hat{θ}] + o (1), \\ E_{D_{obs}} {E [{\tilde{K}}_{(k)} (\hat{θ}, θ^{★})]} \\ = E_{D_{obs}} [{\tilde{K}}_{(k)} (θ_{*}, θ^{★})] + E {[\partial_{θ} {\tilde{K}}_{(k)} (θ_{*}, θ^{★})]}^{T} E_{D_{obs}} [Δ \hat{θ}] \\ + E_{D_{obs}} {Δ {\hat{θ}}^{T} E [\partial_{θ}^{2} {\tilde{K}}_{(k)} (θ_{*}, θ^{★})] Δ \hat{θ}} + o (1) . \end{array}

(A.3)

Following the same arguments as Konishi and Kitagawa (2008), we can get

\begin{array}{l} E_{D_{obs}} {{\tilde{K}}_{(k)} (\hat{θ}, θ^{★}) - E [{\tilde{K}}_{(k)} (\hat{θ}, θ^{★})]} \\ = E_{D_{obs}} {{[\partial_{θ} {\tilde{K}}_{(k)} (θ_{*}, θ^{★})]}^{T} (\hat{θ} - θ_{*})} + o (1) \\ = tr {A {(θ_{*})}^{- 1} B (θ_{*} ∣ θ^{★})} + o (1) . \end{array}

(A.4)

Proof of Theorem 3

Based on Theorem 1 and δ_c₂₁ = o_p(n), we have

n^{- 1} {dIC}_{\tilde{H} (k), Q 21} = 2 n^{- 1} δ_{Q 21} + o_{p} (1),

which yields Theorem 3a.

Theorem 3b can be proved by noting that n^−1/2dIC_H̃ ₍_k₎_,Q₂₁ can be written as the sum of

\begin{array}{l} n^{- 1 / 2} δ_{c 21}, \\ - 2 n^{- 1 / 2} {E [Q (θ_{* (2)} ∣ {\hat{θ}}_{(2)}) - \tilde{H} (k ∣ {\hat{θ}}_{(2)})] \\ - E [Q (θ_{* (1)} ∣ {\hat{θ}}_{(1)}) - \tilde{H} (k ∣ {\hat{θ}}_{(2)})]}, \\ - 2 n^{- 1 / 2} {Q ({\hat{θ}}_{(2)} ∣ {\hat{θ}}_{(2)}) - E [Q (θ_{* (2)} ∣ {\hat{θ}}_{(2)})]} \\ + 2 n^{- 1 / 2} {Q ({\hat{θ}}_{(1)} ∣ {\hat{θ}}_{(1)}) - E [Q (θ_{* (1)} ∣ {\hat{θ}}_{(1)})]}, and \\ 2 n^{- 1 / 2} {\tilde{H} (k ∣ {\hat{θ}}_{(1)}) - E [\tilde{H} (k ∣ {\hat{θ}}_{(1)})]} \\ - 2 n^{- 1 / 2} {\tilde{H} (k ∣ {\hat{θ}}_{(2)}) - E [\tilde{H} (k ∣ {\hat{θ}}_{(2)})]} . \end{array}

Note that for t = 1, 2, Q(θ̂₍_t₎|θ̂₍_t₎) can be written as

\begin{array}{l} Q (θ_{* (t)} ∣ {\hat{θ}}_{(t)}) \\ + .5 {({\hat{θ}}_{(t)} - θ_{* (t)})}^{T} \partial_{θ}^{2} Q (θ_{* (t)} ∣ {\hat{θ}}_{(t)}) ({\hat{θ}}_{(t)} - θ_{* (t)}) [1 + o_{p} (1)] . \end{array}

Because θ̂ ₍_t₎ − θ_*(_t₎ = O_p(n^−1/2), Q(θ̂₍_t₎|θ̂ ₍_t₎) = Q(θ_*(_t₎|θ̂₍_t₎) + O_p(1). Thus dIC_H̃₍_k₎_,Q₂₁ can be written as

2 Q (θ_{* (1)} ∣ {\hat{θ}}_{(1)}) - 2 Q (θ_{* (2)} ∣ {\hat{θ}}_{(2)}) + δ_{c 21} + O_{p} (1) .

Theorem 3c can be proved by noting that Q(θ_*(1)|θ̂₍₁₎) − Q(θ_*(2)|θ̂₍₂₎ = O_p(1) and δ_c21 → ∞

APPENDIX B: SELECTED ITEMS IN THE AIDS DATA

The number of the variables in the questionnaire is given in parentheses.

y₁ (item 33): How worried are you about getting AIDS? not at all worried 1/2/3/4/5 extremely worried.

y₂ (item 32): What are the chances that you yourself might get AIDS?

none 1/2/3/4/5 very great.

y₃ (item 31): How much of a threat do you think AIDS is to the health of people?

no threat at all 1/2/3/4/5 very great.

y₄ (item 43): How many times did you have vaginal sex in the last 7 days?

y₅ (item 72): How many “hand jobs” did you give in the last 7 days?

y₆ (item 74): How many “blow jobs” did you give in the last 7 days? How great is the risk of getting AIDS from the following activities.

y₇ (item 27h): Sexual intercourse with someone you don’t know very well without using a condom.

y₈ (item 27e): Sexual intercourse with someone who has the AIDS virus using a condom?

y₉ (item 27i): Sexual intercourse with someone who injects drugs? The scale for y₇, y₈, and y₉ is: no risk 1/2/3/4/5 great risk.

x₁ (item 37): How long (in months) have you been working at a job where people pay to have sex with you?

x₂ (item 21): How much do you think you know about the disease called AIDS?

nothing 1/2/3/4/5 a great deal.

Contributor Information

Joseph G. Ibrahim, Joseph G. Ibrahim is Alumni Distinguished Professor (E-mail: ibrahim@bios.unc.edu), Department of Biostatistics, University of North Carolina, Chapel Hill.

Hongtu Zhu, Hongtu Zhu is Associate Professor (E-mail: hzhu@bios.unc.edu), Department of Biostatistics, University of North Carolina, Chapel Hill..

Niansheng Tang, Niansheng Tang is Professor, Department of Statistics, Yunnan University, Kunming (E-mail: nstang@ynu.edu.cn).

References

Akaike H. Information Theory and an Extension of the Maximum Likelihood Principle. In: Petrov BN, Czáki F, editors. Second International Symposium on Information Theory. Budapest: Akademiai Kiadó; 1973. pp. 267–281. [Google Scholar]
Andrews DWK. Generic Uniform Convergence. Econometric Theory. 1992;8:241–257. [Google Scholar]
Cameron AC, Johansson P. Count Data Regression Using Series Expansions, With Applications. Journal of Applied Econometrics. 1997;12:203–223. [Google Scholar]
Chen MH, Ibrahim JG, Shao QM. Propriety of the Posterior Distribution and Existence of the Maximum Likelihood Estimator for Regression Models With Covariates Missing at Random. Journal of the American Statistical Association. 2004;99:421–438. [Google Scholar]
Copas JB, Li HG. Inference for Non-Random Samples. (with discussion) Journal of the Royal Statistical Society, Ser B. 1997;59:55–96. [Google Scholar]
Dempster AP, Laird NM, Rubin DB. Maximum Likelihood for Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Ser B. 1977;39:1–38. [Google Scholar]
Diggle PJ, Kenward MG. Informative Drop-Out in Longitudinal Data Analysis. Applied Statistics. 1994;43:49–93. [Google Scholar]
Fenton VM, Gallant AR. Qualitative and Asymptotic Performance of SNP Density Estimators. Journal of Econometrics. 1996;74:77–118. [Google Scholar]
Gallant AR, Douglas WN. Semi-Nonparametric Maximum Likelihood Estimation. Econometrica. 1987;55:363–390. [Google Scholar]
Gallant AR, Nychka DW. Semi-Nonparametric Maximum Likelihood Estimation. Econometrica. 1987;55:363–390. [Google Scholar]
Hastings WK. Monte Carlo Sampling Methods Using Markov Chains and Their Application. Biometrika. 1970;57:97–109. [Google Scholar]
Huang L, Chen MH, Ibrahim JG. Bayesian Analysis for Generalized Linear Models With Nonignorable Missing Covariates. Biometrics. 2005;61:729–737. doi: 10.1111/j.1541-0420.2005.00338.x. [DOI] [PubMed] [Google Scholar]
Ibrahim JG. Incomplete Data in Generalized Linear Models. Journal of the American Statistical Association. 1990;85:765–769. [Google Scholar]
Ibrahim JG, Lipsitz SR. Parameter Estimation From Incomplete Data in Binomial Regression When the Missing-Data Mechanism is Nonignorable. Biometrics. 1996;52:1071–1078. [PubMed] [Google Scholar]
Ibrahim JG, Chen M-H, Lipsitz SR. Monte Carlo EM for Missing Covariates in Parametric Regression Models. Biometrics. 1999;55:591–596. doi: 10.1111/j.0006-341x.1999.00591.x. [DOI] [PubMed] [Google Scholar]
Ibrahim JG, Chen M-H, Lipsitz SR. Missing Responses in Generalised Linear Mixed Models When the Missing Data Mechanism Is Nonignorable. Biometrika. 2001;88:551–564. [Google Scholar]
Ibrahim JG, Lipsitz SR, Chen MH. Missing Covariates in Generalized Linear Models When the Missing-Data Mechanism Is Nonignorable. Journal of the Royal Statistical Society, Ser B. 1999;61:173–190. [Google Scholar]
Jansen I, Molenberghs G, Aerts M, Thjis H, van Steen K. A Local Influence Approach to Binary Data From a Psychiatric Study. Biometrics. 2003;59:410–419. doi: 10.1111/1541-0420.00048. [DOI] [PubMed] [Google Scholar]
Kim JI. Uniform Convergence Rate of the Seminonparametric Density Estimator and Testing for Similarity of Two Unknown Densities. Econometrics Journal. 2007;10:1–34. [Google Scholar]
Konishi S, Kitagawa G. Information Criteria and Statistical Modeling. New York: Springer; 2008. [Google Scholar]
Lee SY, Tang NS. Analysis of Nonlinear Structural Equation Models With Nonignorable Missing Covariates and Ordered Categorical Data. Statistica Sinica. 2006;16:1117–1141. [Google Scholar]
Little RJA. Pattern-Mixture Models for Multivariate Incomplete Data. Journal of the American Statistical Association. 1993;88:125–134. [Google Scholar]
Little RJA. A Class of Pattern-Mixture Models for Normal Incomplete Data. Biometrika. 1994;81:471–483. [Google Scholar]
Little RJA. Modeling the Drop-Out Mechanism in Repeated-Measures Studies. Journal of the American Statistical Association. 1995;90:1112–1121. [Google Scholar]
Little RJA, Rubin DB. Statistical Analysis With Missing Data. 2. Hoboken, NJ: Wiley; 2002. [Google Scholar]
Liu CH, Rubin DB. The ECME Algorithm: A Simple Extension of EM and ECM With Fast Monotone Convergence. Biometrika. 1994;81:633–648. [Google Scholar]
Macquarrie ADR, Tsai CL. Regression and Time Series Model Selection. River Edge, NJ: World Scientific; 1998. [Google Scholar]
Meng XL, Rubin DB. Maximum Likelihood Estimation via the ECM Algorithm: A General Framework. Biometrika. 1993;80:267–278. [Google Scholar]
Meng XL, van Dyk D. The EM Algorithm: An Old Folk Song Sung to a Fast New Tune. Journal of the Royal Statistical Society, Ser B. 1997;59:511–540. [Google Scholar]
Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. Equations of State Calculations by Fast Computing Machine. Journal of Chemical Physics. 1953;21:1087–1091. [Google Scholar]
Morisky DE, Tiglao TV, Sneed CD, Tempongko SB, Baltazar JC, Detels R, Stein JA. The Effects of Establishment Practices, Knowledge, and Attitudes on Condom Use Among Filipina Sex Workers. AIDS Care. 1998;10:213–320. doi: 10.1080/09540129850124460. [DOI] [PubMed] [Google Scholar]
Nishii R. Maximum Likelihood Principle and Model Selection When the True Model Is Unspecified. Journal of Multivariate Analysis. 1988;27:392–403. [Google Scholar]
Rubin DB. Formalizing Subjective Notions About the Effect of Non-respondents in Sample Surveys. Journal of the American Statistical Association. 1977;72:538–543. [Google Scholar]
Schwarz G. Estimating the Dimension of a Model. The Annals of Statistics. 1978;6:461–464. [Google Scholar]
Troxel AB, Ma G, Heitjan DF. An Index of Local Sensitivity to Nonignorability. Statistica Sinica. 2004;14:1221–1237. [Google Scholar]
van Steen K, Molenberghs G, Thijs H. A Local Influence Approach to Sensitivity Analysis of Incomplete Longitudinal Ordinal Data. Statistical Modelling: An International Journal. 2001;1:125–142. [Google Scholar]
Verbeke G, Molenberghs G, Thijs H, Lasaffre E, Kenward MG. Sensitivity Analysis for Non-Random Dropout: A Local Influence Approach. Biometrics. 2001;57:43–50. doi: 10.1111/j.0006-341x.2001.00007.x. [DOI] [PubMed] [Google Scholar]
White H. Estimation, Inference, and Specification Analysis. New York: Cambridge University Press; 1994. [Google Scholar]
Zhu HT, Zhang HP. Asymptotics for Estimation and Testing Procedures Under Loss of Identifiability. Journal of Multivariate Analysis. 2006;97:19–45. [Google Scholar]
Zhu HT, Lee SY, Wei BC, Zhou J. Case-Deletion Measures for Models With Incomplete Data. Biometrika. 2001;88:727–737. [Google Scholar]

[R1] Akaike H. Information Theory and an Extension of the Maximum Likelihood Principle. In: Petrov BN, Czáki F, editors. Second International Symposium on Information Theory. Budapest: Akademiai Kiadó; 1973. pp. 267–281. [Google Scholar]

[R2] Andrews DWK. Generic Uniform Convergence. Econometric Theory. 1992;8:241–257. [Google Scholar]

[R3] Cameron AC, Johansson P. Count Data Regression Using Series Expansions, With Applications. Journal of Applied Econometrics. 1997;12:203–223. [Google Scholar]

[R4] Chen MH, Ibrahim JG, Shao QM. Propriety of the Posterior Distribution and Existence of the Maximum Likelihood Estimator for Regression Models With Covariates Missing at Random. Journal of the American Statistical Association. 2004;99:421–438. [Google Scholar]

[R5] Copas JB, Li HG. Inference for Non-Random Samples. (with discussion) Journal of the Royal Statistical Society, Ser B. 1997;59:55–96. [Google Scholar]

[R6] Dempster AP, Laird NM, Rubin DB. Maximum Likelihood for Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Ser B. 1977;39:1–38. [Google Scholar]

[R7] Diggle PJ, Kenward MG. Informative Drop-Out in Longitudinal Data Analysis. Applied Statistics. 1994;43:49–93. [Google Scholar]

[R8] Fenton VM, Gallant AR. Qualitative and Asymptotic Performance of SNP Density Estimators. Journal of Econometrics. 1996;74:77–118. [Google Scholar]

[R9] Gallant AR, Douglas WN. Semi-Nonparametric Maximum Likelihood Estimation. Econometrica. 1987;55:363–390. [Google Scholar]

[R10] Gallant AR, Nychka DW. Semi-Nonparametric Maximum Likelihood Estimation. Econometrica. 1987;55:363–390. [Google Scholar]

[R11] Hastings WK. Monte Carlo Sampling Methods Using Markov Chains and Their Application. Biometrika. 1970;57:97–109. [Google Scholar]

[R12] Huang L, Chen MH, Ibrahim JG. Bayesian Analysis for Generalized Linear Models With Nonignorable Missing Covariates. Biometrics. 2005;61:729–737. doi: 10.1111/j.1541-0420.2005.00338.x. [DOI] [PubMed] [Google Scholar]

[R13] Ibrahim JG. Incomplete Data in Generalized Linear Models. Journal of the American Statistical Association. 1990;85:765–769. [Google Scholar]

[R14] Ibrahim JG, Lipsitz SR. Parameter Estimation From Incomplete Data in Binomial Regression When the Missing-Data Mechanism is Nonignorable. Biometrics. 1996;52:1071–1078. [PubMed] [Google Scholar]

[R15] Ibrahim JG, Chen M-H, Lipsitz SR. Monte Carlo EM for Missing Covariates in Parametric Regression Models. Biometrics. 1999;55:591–596. doi: 10.1111/j.0006-341x.1999.00591.x. [DOI] [PubMed] [Google Scholar]

[R16] Ibrahim JG, Chen M-H, Lipsitz SR. Missing Responses in Generalised Linear Mixed Models When the Missing Data Mechanism Is Nonignorable. Biometrika. 2001;88:551–564. [Google Scholar]

[R17] Ibrahim JG, Lipsitz SR, Chen MH. Missing Covariates in Generalized Linear Models When the Missing-Data Mechanism Is Nonignorable. Journal of the Royal Statistical Society, Ser B. 1999;61:173–190. [Google Scholar]

[R18] Jansen I, Molenberghs G, Aerts M, Thjis H, van Steen K. A Local Influence Approach to Binary Data From a Psychiatric Study. Biometrics. 2003;59:410–419. doi: 10.1111/1541-0420.00048. [DOI] [PubMed] [Google Scholar]

[R19] Kim JI. Uniform Convergence Rate of the Seminonparametric Density Estimator and Testing for Similarity of Two Unknown Densities. Econometrics Journal. 2007;10:1–34. [Google Scholar]

[R20] Konishi S, Kitagawa G. Information Criteria and Statistical Modeling. New York: Springer; 2008. [Google Scholar]

[R21] Lee SY, Tang NS. Analysis of Nonlinear Structural Equation Models With Nonignorable Missing Covariates and Ordered Categorical Data. Statistica Sinica. 2006;16:1117–1141. [Google Scholar]

[R22] Little RJA. Pattern-Mixture Models for Multivariate Incomplete Data. Journal of the American Statistical Association. 1993;88:125–134. [Google Scholar]

[R23] Little RJA. A Class of Pattern-Mixture Models for Normal Incomplete Data. Biometrika. 1994;81:471–483. [Google Scholar]

[R24] Little RJA. Modeling the Drop-Out Mechanism in Repeated-Measures Studies. Journal of the American Statistical Association. 1995;90:1112–1121. [Google Scholar]

[R25] Little RJA, Rubin DB. Statistical Analysis With Missing Data. 2. Hoboken, NJ: Wiley; 2002. [Google Scholar]

[R26] Liu CH, Rubin DB. The ECME Algorithm: A Simple Extension of EM and ECM With Fast Monotone Convergence. Biometrika. 1994;81:633–648. [Google Scholar]

[R27] Macquarrie ADR, Tsai CL. Regression and Time Series Model Selection. River Edge, NJ: World Scientific; 1998. [Google Scholar]

[R28] Meng XL, Rubin DB. Maximum Likelihood Estimation via the ECM Algorithm: A General Framework. Biometrika. 1993;80:267–278. [Google Scholar]

[R29] Meng XL, van Dyk D. The EM Algorithm: An Old Folk Song Sung to a Fast New Tune. Journal of the Royal Statistical Society, Ser B. 1997;59:511–540. [Google Scholar]

[R30] Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. Equations of State Calculations by Fast Computing Machine. Journal of Chemical Physics. 1953;21:1087–1091. [Google Scholar]

[R31] Morisky DE, Tiglao TV, Sneed CD, Tempongko SB, Baltazar JC, Detels R, Stein JA. The Effects of Establishment Practices, Knowledge, and Attitudes on Condom Use Among Filipina Sex Workers. AIDS Care. 1998;10:213–320. doi: 10.1080/09540129850124460. [DOI] [PubMed] [Google Scholar]

[R32] Nishii R. Maximum Likelihood Principle and Model Selection When the True Model Is Unspecified. Journal of Multivariate Analysis. 1988;27:392–403. [Google Scholar]

[R33] Rubin DB. Formalizing Subjective Notions About the Effect of Non-respondents in Sample Surveys. Journal of the American Statistical Association. 1977;72:538–543. [Google Scholar]

[R34] Schwarz G. Estimating the Dimension of a Model. The Annals of Statistics. 1978;6:461–464. [Google Scholar]

[R35] Troxel AB, Ma G, Heitjan DF. An Index of Local Sensitivity to Nonignorability. Statistica Sinica. 2004;14:1221–1237. [Google Scholar]

[R36] van Steen K, Molenberghs G, Thijs H. A Local Influence Approach to Sensitivity Analysis of Incomplete Longitudinal Ordinal Data. Statistical Modelling: An International Journal. 2001;1:125–142. [Google Scholar]

[R37] Verbeke G, Molenberghs G, Thijs H, Lasaffre E, Kenward MG. Sensitivity Analysis for Non-Random Dropout: A Local Influence Approach. Biometrics. 2001;57:43–50. doi: 10.1111/j.0006-341x.2001.00007.x. [DOI] [PubMed] [Google Scholar]

[R38] White H. Estimation, Inference, and Specification Analysis. New York: Cambridge University Press; 1994. [Google Scholar]

[R39] Zhu HT, Zhang HP. Asymptotics for Estimation and Testing Procedures Under Loss of Identifiability. Journal of Multivariate Analysis. 2006;97:19–45. [Google Scholar]

[R40] Zhu HT, Lee SY, Wei BC, Zhou J. Case-Deletion Measures for Models With Incomplete Data. Biometrika. 2001;88:727–737. [Google Scholar]

PERMALINK

Model Selection Criteria for Missing-Data Problems Using the EM Algorithm

Joseph G Ibrahim

Hongtu Zhu

Niansheng Tang

Abstract

1. INTRODUCTION

2. EM–BASED MODEL SELECTION CRITERIA

2.1 EM Algorithm

2.2 Development of IC_H_,_Q

2.3 Approximation of g(D_mis|D_obs, θ̂) in IC_H_,_Q

2.4 General Theoretical Development for IC_H̃₍_k_),_Q

Theorem 1

Theorem 2

Theorem 3

2.5 Using IC_H̃₍_k_),_Q in the Presence of Nonignorable Missing Data

2.6 IC_Q

3. SIMULATION STUDIES

3.1 Missing-at-Random Covariates in Linear Models

Table 1.

3.2 Missing-at-Random Covariates in Generalized Linear Models

Table 2.

3.3 AIDS Data

Table 3.

Table 4.

4. DISCUSSION

Acknowledgments

APPENDIX A: PROOFS OF THEOREMS 1, 2, AND 3

Proof of Theorem 1

Proof of Theorem 2

Proof of Theorem 3

APPENDIX B: SELECTED ITEMS IN THE AIDS DATA

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Model Selection Criteria for Missing-Data Problems Using the EM Algorithm

Joseph G Ibrahim

Hongtu Zhu

Niansheng Tang

Abstract

1. INTRODUCTION

2. EM–BASED MODEL SELECTION CRITERIA

2.1 EM Algorithm

2.2 Development of ICH,Q

2.3 Approximation of g(Dmis|Dobs, θ̂) in ICH,Q

2.4 General Theoretical Development for ICH̃(k),Q

Theorem 1

Theorem 2

Theorem 3

2.5 Using ICH̃(k),Q in the Presence of Nonignorable Missing Data

2.6 ICQ

3. SIMULATION STUDIES

3.1 Missing-at-Random Covariates in Linear Models

Table 1.

3.2 Missing-at-Random Covariates in Generalized Linear Models

Table 2.

3.3 AIDS Data

Table 3.

Table 4.

4. DISCUSSION

Acknowledgments

APPENDIX A: PROOFS OF THEOREMS 1, 2, AND 3

Proof of Theorem 1

Proof of Theorem 2

Proof of Theorem 3

APPENDIX B: SELECTED ITEMS IN THE AIDS DATA

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2.2 Development of IC_H_,_Q

2.3 Approximation of g(D_mis|D_obs, θ̂) in IC_H_,_Q

2.4 General Theoretical Development for IC_H̃₍_k_),_Q

2.5 Using IC_H̃₍_k_),_Q in the Presence of Nonignorable Missing Data

2.6 IC_Q