Abstract
We consider novel methods for the computation of model selection criteria in missing-data problems based on the output of the EM algorithm. The methodology is very general and can be applied to numerous situations involving incomplete data within an EM framework, from covariates missing at random in arbitrary regression models to nonignorably missing longitudinal responses and/or covariates. Toward this goal, we develop a class of information criteria for missing-data problems, called ICH,Q, which yields the Akaike information criterion and the Bayesian information criterion as special cases. The computation of ICH,Q requires an analytic approximation to a complicated function, called the H-function, along with output from the EM algorithm used in obtaining maximum likelihood estimates. The approximation to the H-function leads to a large class of information criteria, called ICH̃(k),Q. Theoretical properties of ICH̃(k),Q, including consistency, are investigated in detail. To eliminate the analytic approximation to the H-function, a computationally simpler approximation to ICH,Q, called ICQ, is proposed, the computation of which depends solely on the Q-function of the EM algorithm. Advantages and disadvantages of ICH̃(k),Q and ICQ are discussed and examined in detail in the context of missing-data problems. Extensive simulations are given to demonstrate the methodology and examine the small-sample and large-sample performance of ICH̃(k),Q and ICQ in missing-data problems. An AIDS data set also is presented to illustrate the proposed methodology.
Keywords: EM algorithm, H-function, Kullback, Leibler divergence, Missing data, Q-function
1. INTRODUCTION
Missing data have long been a problem in various settings, including surveys, clinical trials, and longitudinal studies. Responses and/or covariates may be missing, and methods for handling the missing data often depend on the mechanism that generated the missing values. Unless the data are missing completely at random (MCAR), a complete-case analysis can be both inefficient and biased; therefore, distributional and modeling assumptions often are made in missing-data problems, and the resulting estimates and tests may be sensitive to these assumptions. For this reason, sensitivity analyses are commonly done to check the robustness of the parameters of interest and their standard errors under different modeling schemes (see, e.g., Rubin 1977; Little 1993, 1994, 1995; Copas and Li 1997; van Steen, Molenberghs, and Thijs 2001; Verbeke, Molenberghs, Thijs, Lasaffre, and Kenward 2001; Jansen, Molenberghs, Aerts, Thjis, and van Steen 2003; Troxel, Ma, and Heitjan 2004). Although these analyses demonstrate the effect of assumptions on estimates and tests, they do not indicate which modeling strategy is best, nor do they specifically address model selection for a given class of models.
Model selection criteria typically depend on the likelihood function based on the observed data, and any sensible model selection criterion must depend on this quantity in some way. In missing-data problems, it is very challenging to obtain a suitable and accurate approximation of the observed data likelihood, which involves intractable multiple integration, and/or directly maximize the observed data likelihood and compute the Akaike information criterion (AIC) and/or the Bayesian information criterion (BIC), for example, as well other model selection criteria. The EM algorithm maximizes the Q-function (formally defined in Sec. 2.1) at each iteration, avoiding direct maximization of the observed data likelihood, which typically is a more difficult function to maximize. A natural and important question is whether we can use the key components of the EM algorithm, such as the Q-function, to develop an easily computable model selection criterion.
In this article we consider a class of information-based model selection criteria, called ICH,Q, for missing-data problems. The class of model selection criteria includes AIC and BIC as special cases, as well other model selection criteria that have been proposed in the literature, mainly for settings not involving missing-data. The essential novel feature of the proposed model selection criteria is that they essentially depend only on output from the EM algorithm for their computation. Our development is based on the fact that the observed data log-likelihood in missing data problems can be written as a difference between two functions, the Q-function of the EM algorithm and another quantity called the H-function. The Q-function and the H-function are formally defined in Section 2.1. The Q-function can be computed solely from the EM output, but the H-function cannot; however, we show that after the H-function is analytically approximated, it then can be computed as part of the EM output, resulting in model selection criteria, ICH̃(k),Q, that depend solely on the EM output. We give a theoretical justification for ICH̃(k),Q and derive the asymptotic properties of ICH̃(k),Q. We also consider another class of model selection criteria, ICQ, which use only the Q-function in their construction and thus omit the H-function in their construction. We also show that compared with ICH̃(k),Q, ICQ is an inferior approximation to ICH,Q, but it may be adequate when the fraction of missing information is small.
The rest of the article is organized as follows. In Section 2 we introduce ICH,Q, ICH̃(k),Q, and ICQ. We present three theorems characterizing consistency and asymptotic properties of ICH̃(k),Q as general model selection criteria. In Section 3 we present two extensive simulation studies, one involving missing-at-random (MAR) covariates in linear models and one involving MAR covariates in generalized linear models (GLMs). These simulations compare the finite-sample performance of ICH̃(k),Q and ICQ and examine how these criteria can be used to determine the best-fitting model from a candidate set of proposed models. In Section 3.3 we analyze a data set from a study of the relationship between acquired immune deficiency syndrome (AIDS) and the use of condoms that includes not missing-at-random (NMAR) (i.e., nonignorable) covariates as well as responses. We conclude with a discussion in Section 4.
2. EM–BASED MODEL SELECTION CRITERIA
2.1 EM Algorithm
For simplicity, we only consider an independent-type incomplete-data (ITID) model throughout the article, even though most of the development here is valid for a large class of statistical models involving missing data. Assume the observed data Dobs = (z1,obs, …, zn,obs), the missing data Dmis = (z1,mis, …, zn,mis), and the complete data Dcom = (z1,com, …, zn,com), in which zi,com = (zi,mis, zi,obs) for i = 1,…, n. The ITID model assumes that zi,com and zj,com are independent for i ≠ j. Moreover, the dimensions of zi,mis and zi,obs may vary across i; for instance, in GLMs with missing covariates, some observations may have missing covariates and others may not. This kind of model structure is very general and subsumes most commonly used models, such as GLMs with missing responses and/or covariates and random-effects models (Zhu, Lee, Wei, and Zhou 2001; Ibrahim, Chen, and Lipstiz 1999, 2001).
Suppose that we want to compare a general model for the complete data, g(Dcom; θ), with the true model for the complete data, f(Dcom). The model for the complete data is the product of a model for the observed data, g(Dobs; θ), and a model for the missing data given the observed data, g(Dmis|Dobs; θ). Correspondingly, f(Dcom) = f(Dobs)f(Dmis|Dobs), where f(Dmis|Dobs) and f(Dobs) are the true model for the missing data given the observed data and that for the observed data. Specifically, for the ITID model, we have
(1) |
(2) |
where f(zi,obs) and g(zi,obs; θ) denote the true and postulated models for zi,obs, and f(zi,com) and g(zi,com; θ) denote the true and postulated models for zi,com.
The EM algorithm (Dempster, Laird, and Rubin 1977) has been a popular technique for obtaining maximum likelihood (ML) estimates in missing-data problems (Little and Rubin 2002; Meng and van Dyk 1997; Ibrahim 1990; Ibrahim and Lipsitz 1996). The EM algorithm consists of two key steps as follows. At the sth step of the EM algorithm, given θ(s), the E-step involves evaluating the Q-function given by
(3) |
where E[·|Dobs; θ(s)] denotes the conditional expectation with respect to g(Dmis|Dobs; θ(s)). Recall that the Q-function can be written as
(4) |
where
(5) |
is called the H-function. The M-step is to maximize Q(θ|θ(s)) to compute θ(s+1). At EM convergence, we can obtain three byproducts: θ̂, Q(θ̂|θ̂), and samples drawn from g(Dmis|Dobs; θ̂). We use these three quantities in constructing our proposed model selection criteria in the subsequent sections.
2.2 Development of ICH,Q
Our main interest is to develop a class of model selection criteria for missing-data problems based on the observed data likelihood g(Dobs; θ). However, some missing-data problems have very complicated observed data likelihood functions, for which g(Dobs; θ) has no closed form, so that its direct evaluation is not computationally feasible or computationally accurate. Because
(6) |
this suggests that we may compute g(Dobs; θ) from the EM output—namely, from the Q-function Q(θ̂|θ̂) and the H-function H(θ̂|θ̂) at EM convergence. Thus we consider the class of model selection criteria given by
(7) |
where ĉn(θ̂) is a penalty term that is a function of the data and the fitted model. Different forms of the model penalty ĉn(θ̂) lead to different criteria; for instance, when ĉn(θ̂) = 2d in (7), where d denotes the dimension of θ, we obtain the AIC of Akaike (1973), given by AIC = −2 log g(Dobs; θ̂) + 2d. When ĉn(θ̂) = log(n)d, then (7) reduces to the BIC of Schwarz (1978). We note that the penalty term ĉn(θ̂) is neither Q-function–based nor specific to missing-data problems; rather, it is a general penalty term chosen by the user, mimicking the penalty terms for general model selection information criteria as discussed in the literature (Macquarrie and Tsai 1998; Konishi and Kitagawa 2008).
There is a subtle computational problem with (7) in that although the Q-function is a direct byproduct of the EM output, the H-function is not a direct byproduct of the EM output. Specifically, the density g(Dmis|Dobs; θ) in the H-function does not have a closed form for many missing-data problems and typically is quite complicated, and thus the integrand of the H-function itself does not have a closed form. Thus g(Dmis|Dobs; θ) first needs an analytic approximation to allow computation of the H-function through the EM output. Once g(Dmis|Dobs; θ) is analytically approximated (i.e., the integrand of the H-function is analytically approximated), the H-function can be computed by Monte Carlo integration using samples from g(Dmis|Dobs; θ̂) at EM convergence. Samples from this density are obtained by carrying out Markov chain Monte Carlo (MCMC) methods and are direct byproducts of the Monte Carlo EM algorithm (MCEM), as discussed by Ibrahim, Lipsitz, and Chen (1999). Using these samples, we then can obtain an EM-based estimator of the approximation to the H-function, which we discuss in detail in the next section. We note that when ĉn(θ̂) = 2d, an EM-based approximation to the AIC is obtained by replacing the H-function by its estimator.
2.3 Approximation of g(Dmis|Dobs, θ̂) in ICH,Q
We propose a simple but useful method for approximating the H-function. In general, given the MCMC samples from g(Dmis|Dobs; θ̂) (Ibrahim, Lipsitz, and Chen 1999), we can get a Monte Carlo approximation of the integral ∫w(Dmis)g(Dmis|Dobs; θ̂) dDmis only if w(Dmis) has an analytic closed form. Although w(Dmis) = log g(Dmis|Dobs; θ̂) for H(θ̂|θ̂), g(Dmis|Dobs; θ̂) does not have a closed form for most missing-data problems.
We propose using a truncated Hermite expansion as an approximation of each g(zi,mis|zi,obs; θ̂), leading to
(8) |
where , and φ(zi,mis; μ̂i, Σ̂i) is a multivariate normal density with mean μ̂i and covariance matrix Σ̂i. In addition, μ̂i = μi(θ̂) and Σ̂i = Σi(θ̂) are the conditional mean and covariance matrix of zi,mis given zi,obs at θ̂. Here Pi(t; ψi, k) is a multivariate polynomial of order k and ψi are the coefficients of Pi(t; ψi, k). If g(zi,mis|zi,obs; θ̂) belongs to a smooth class of functions, then g̃(zi,mis; μ̂i, Σ̂i, ψi, k) approximates g(zi,mis|zi,obs; θ̂) well for even small k, say k = 1 and 2 (Gallant and Nychka 1987); for instance, if zi,mis is uni-variate and k = 2, then
If k = 0, then we obtain and g̃(zi,mis; μ̂i, Σ̂i, ψi, k) = φ(zi,mis; μ̂i, Σ̂i). It has been shown both numerically and theoretically that the truncated Hermite expansion can provide an accurate approximation to g(zi,mis|zi,obs; θ̂) as k → ∞ (Fenton and Gallant 1996). Moreover, in the truncated Hermite expansion, the multivariate normal density can be replaced by another density, such as a multivariate t, Poisson, or gamma density (Cameron and Johansson 1997; Kim 2007).
We can use g̃(zi,mis; μ̂i, Σ̂i, ψi, k) to produce a Monte Carlo estimate of H(θ̂|θ̂). The detailed steps are summarized as follows. In step 1 we draw a set of random samples, { }, from g(zi,mis|zi,obs; θ̂) using MCMC sampling, where S0 is a prefixed number. In step 2 we use the sample mean and covariance matrix of { } to approximate μ̂i and Σ̂i. In step 3, because { } are observations from g(zi,mis|zi,obs; θ̂), we can then obtain estimators (e.g., ML estimators) of ψi, denoted by ψ̂i(k), for given k and i = 1, …, n. Because S0 can be arbitrarily large, we can assume that μ̂i and that Σ̂i are exact and that ψ̂i(k) is the minimizer of the Kullback–Leibler divergence between g̃(zi,mis; μ̂i, Σ̂i, ψi, k) and g(zi,mis|zi,obs; θ̂), that is,
In step 4 we calculate
(9) |
where and o(1) converges to 0 as S0 → ∞. In general, the computational burden in steps 1, 2, and 4 is minimal, whereas computing ψ̂i(k) for each i can be computationally cumbersome when k is relatively large. If we set k at 0, then we can avoid the maximization in step 3.
Based on H̃(k|θ̂), we can obtain an approximation of ICH,Q as
(10) |
Moreover, because H̃(k|θ̂) ≤ H(θ̂|θ̂) according to Jensen’s inequality, ICH̃(k),Q ≤ ICH,Q. Although H̃(k|θ̂) converges to H(θ̂|θ̂) as k → ∞, choosing a large k is computationally inefficient. Moreover, we observe that H̃(k|θ̂) based on a small k, say 0 or 1, also can produce reasonable results, as shown in Section 3. Thus this Hermite approximation for g(zi,mis|zi,obs; θ̂) is quite attractive, because model choice is quite robust with respect to the choice of k.
2.4 General Theoretical Development for ICH̃(k),Q
Here we present a formal theoretical development for ICH̃(k),Q, which was defined in the previous section. We define
(11) |
as an approximation to g(Dobs; θ1), where E[·|Dobs; θ2] denotes the conditional expectation taken with respect to g(Dmis| Dobs; θ2). As k → ≠, it can be shown that under some conditions, g̃(Dmis|Dobs; k, θ1) converges to g(Dmis|Dobs; θ1), and thus g̃(k)(Dobs; θ1, θ2) converges to g(Dobs; θ1). To develop a general class of model selection criteria, we consider the Kullback–Leibler divergence between g̃(k)(Dobs; θ1, θ2) and f(Dobs), defined by
(12) |
where τ(Dobs; θ1, θ2) = g̃(k)(Dobs; θ1, θ2)/f(Dobs). The quantity K(θ, θ) is an overall measure of the goodness of fit of g̃(k)(Dobs; θ, θ) relative to f(Dobs). Because the first term in (12) is independent of any fitted model and can be ignored, our goal of selecting a model can be accomplished using the second term of (12).
If g(Dobs; θ) is specified correctly, then θ̂ is asymptotically efficient, and the likelihood ratio statistic is a most sensitive criterion for detecting deviations of the model parameters from their true values. But even though g(Dobs; θ) is “always” mis-specified, White (1994) established consistency and asymptotic normality of θ̂ under some conditions. Thus it is desirable to evaluate K(θ̂, θ★). A simple estimator of K(θ̂, θ★) is given by substituting for the distribution of Dobs, denoted by Fobs, the empirical distribution function F̂obs. Thus, except for a constant, K(θ̂, θ★) can be approximated by
We obtain the following theorems, whose detailed proofs are given in Appendix A. The following conditions are needed to facilitate development of our methods, although they may not be the weakest possible conditions. Even though g(Dobs; θ) may be misspecified, the ML estimator, θ̂, converges to the θn* that minimizes , where ℓ(zi,obs; θ) = log g(zi,obs; θ) (see, e.g., White 1994). For simplicity, we further assume that θn* = θ* for all n and E{∂θℓ(zi,obs; θ*)} = 0 for all i. The conditions are as follows:
(C1) θ* is unique and an interior point of Θ, where Θ is a compact set in Rp.
(C2) θ̂ → θ* in probability as n → ∞.
(C3) For all i, ℓ(zi,obs; θ) is three times continuously differentiable on θ, and |∂jℓ(zi,obs; θ)|2 and |∂j∂j′∂lℓ(zi,obs; θ)| are dominated by Bi(zi,obs) for all j, j′, l = 1, …, d, where ∂j = ∂/∂θj. The same smoothness condition also holds for h(k)(zi,obs; θ) = E[log g̃(zi,mis|zi,obs; k, θ)|zi,obs; θ].
-
(C4) For each ε > 0, there exists a finite C such that
for all n, where 1{Bi(zi,obs) > C} is the indicator function of Bi(zi,obs) > C.
-
(C5)
and
where A(θ*) is positive definite.
Condition (C1) defines the uniqueness of the “true” parameter value. Condition (C2) is the consistency of θ̂. Condition (C3) is a smoothness condition on ℓ(zi,obs; θ) and h(k)(zi,obs; θ). Condition (C4) is a standard Lindeberg condition, and (C5) can be easily proved by the law of large numbers.
Theorem 1
For ITID models, if conditions (C1), (C2), and (C3) hold true, then
(13) |
in probability, where E[K̃(k)(·, ·)] denotes the expectation with respect to the observed data, E[K̃(k)(θ̂, θ★)] denotes E[K̃(k)(θ, θ★)] evaluated at θ = θ̂, and θ* is the pseudo true value of θ based on g(Dobs; θ).
Theorem 1 indicates that n−1K̃(k)(θ̂, θ★) is a consistent estimator of n−1E[K̃(k)(θ*, θ★)]. Now consider the situation in which we want to compare values of K̃(k)(θ̂, θ★) under different models for g(Dcom; θ). Although n−1K̃(k)(θ̂, θ★) is a consistent estimator of n−1E[K̃(k)(θ̂, θ★)], it is an overestimate of n−1E[K̃(k)(θ̂, θ★)], because the same data are used to estimate θ and to approximate Fobs. Following Akaike (1973) and Konishi and Kitagawa (2008), we calculate the bias of n−1K̃(k)(θ̂, θ★) in estimating n−1E[K̃(k)(θ̂, θ★)] as
(14) |
where EDobs denotes the expectation taken with respect to the observed data. Although it may be difficult to calculate the explicit form of b(θ★), we can derive an asymptotic bias expression, denoted b1(θ★).
Theorem 2
For ITID models, if conditions (C1)–(C5) are true, then the asymptotic bias of K̃(k)(θ̂, θ★) in estimating E[K̃(k)(θ̂, θ★)] is given by
(15) |
where A(θ) and B(θ|θ★) are defined in condition (C5).
Theorem 2 provides a theoretical basis for using −2K̃(k)(θ̂, θ★) + b(θ★) as a model selection criterion, and this quantity is precisely a bias-corrected estimate of −2EDobs[K̃(k)(θ̂, θ★)]. In particular, if θ★ = θ* and g(Dobs; θ) is specified correctly, then A(θ*) − B(θ*|θ*) converges to a zero matrix and b(θ*) ≈ 2d as k → ∞. But because θ★ is unknown, we replace θ* and θ★ by θ̂. In particular, under the correct specification of g(Dobs; θ), b(θ̂) should be close to 2d for large k. This leads to an approximation to the AIC as AICH̃(k),Q = −2K̃(k)(θ̂, θ̂) + 2d.
We now establish sufficient conditions to ensure consistency of ICH̃(k),Q. Following Nishii (1988), we consider two parametric models for the complete data, with densities given by
(16) |
for t = 1, 2. For each ℳt, the ML estimator θ̂(t) converges in probability to the pseudo true value, denoted by θ*(t). To select a better model, we first calculate
(17) |
We choose ℳ2 if dICH̃(k),Q21 < 0 and ℳ1 otherwise. Define
and δc21 = ĉn(θ̂(2)) − ĉn(θ̂(1)). Moreover, without loss of generality, we assume that d2 > d1 and ĉn(θ̂(2)) > ĉn(θ̂(1)); for instance, if ĉn(θ̂(2)) = d2 log(n), then δc21 = (d2 − d1) log(n).
Theorem 3
Suppose that ℳ1 and ℳ2 are ITID models and satisfy conditions (C1)–(C5). We then have the following results:
If lim infnn−1δ21,k > 0 and δc21 = op(n), then dICH̃(k),Q21 > 0 in probability.
-
Assume that
, n−1/2{H̃(k|θ̂(t)) − E[H̃(k|θ̂(t))]} = Op(1), and n−1/2 × {Q(θ̂(t)|θ̂(t)) − E[Q(θ*(t)| θ̂(t))]} = Op(1) for t = 1, 2. Then dICH̃(k),Q21 ≤ 0 in probability as n−1/2δc21 → ∞.
Assume that Q(θ*(2)|θ̂(2)) − Q(θ*(1)|θ̂(1)) = Op(1) and H̃(k|θ̂(2)) − H̃(k|θ̂(1)) = Op(1). Then dICH̃(k),Q21 ≤ 0 in probability as δc21 → ∞.
Theorem 3 has some important implications. Theorem 3a indicates that ICH̃(k),Q choosesℳ2 as lim infnn−1δ21,k > 0 and δc21 = op(n). Generally, the most commonly used ĉn(θ̂), such as 2d, d log(n), and d log log(n) (d > 0), all satisfy the condition δc21 = op(n) (Nishii 1988). The condition lim infnn−1 δ21,k > 0 ensures that ICH̃(k),Q chooses a model with large E[Q(θ*|θ*) − H̃(k|θ*)]. If ℳ1 and ℳ2 have the same average n−1E[Q(θ*|θ*) − H̃(k|θ*)] (i.e., lim infn n−1 × δ21,k = 0), then Theorem 3b and c indicate that ICH̃(k),Q picks out the “simpler” ℳ1 when δc21 increases to ∞ at a certain rate [e.g., log(n)]. But ĉn = 2d does not satisfy this condition. Thus, because ICH̃(k),Q with ĉn = 2d is the EM-based estimate of the AIC, it tends to overfit the data in this scenario.
2.5 Using ICH̃(k),Q in the Presence of Nonignorable Missing Data
Although our model selection criteria ICH̃(k),Q are quite general and can be used with MAR or NMAR covariate and/or response data, here we offer some caution and advice on using these criteria with NMAR data. First, it is often argued that in missing-data problems, there is little information in the data regarding the form of the missing-data mechanism, and the parametric assumption of the missing-data mechanism itself is not “testable” from the data. Thus nonignorable modeling should be viewed as a sensitivity analysis involving a more complicated model. In this sense, it is dangerous to use any model selection criterion to directly compare MAR and NMAR models. Formally, we give the following guidelines on using ICH̃(k),Q:
ICH̃(k),Q should be used to choose among a family of MAR models and/or choose among a family of NMAR models. They should not be used to choose among an aggregate set of MAR and NMAR models nor should they be used to judge the fit of MAR models versus NMAR models.
Once the best MAR model and best NMAR model are found using step 1, further sensitivity analyses can be done on those two models to examine changes in estimates of the main regression coefficients of interest in the sampling model. These sensitivity analyses can be carried out by examining estimates of the regression coefficients of the sampling model under several different parametric forms of the missing-data mechanism.
2.6 ICQ
Because the analytic approximation to the integrand of the H-function and its computation may be cumbersome for large k, it also might be desirable to obtain a model selection criterion that does not involve the H-function and whose components depend only on quantities obtained directly from the EM output. Toward this goal, we can obtain such a criterion by dropping H(θ̂|θ̂) from (7), leading to the criterion
(18) |
Thus ICQ can be viewed as a crude approximation to ICH,Q in which H(θ̂|θ̂) is omitted. When ĉn(θ̂) = 2d in (18), this leads to the criterion
(19) |
There are clear advantages and disadvantages to using ICQ instead of ICH̃(k),Q. One advantage of using ICQ is that it is computationally easier than ICH̃(k),Q, not requiring an approximation to the integrand of the H-function. But one clear disadvantage of ICQ is that as a result of omitting the H-function, a model selection criterion based on the Q-function alone can overstate the amount of information in the missing data compared with the observed data log-likelihood function. Omitting the H-function can lead to a criterion with poor model selection properties in some cases, especially when the missing-data fraction is high. In general, we recommend using ICH̃(k),Q over ICQ.
3. SIMULATION STUDIES
In this section we report on several simulation studies used to investigate the finite-sample performance of ICH̃(k),Q and ICQ in linear models and GLMs with MAR covariates. More specifically, we demonstrate how ICH̃(k),Q and ICQ can be used as model selection criteria for choosing the best-fitting model. In the simulation for the linear model with MAR and normally distributed covariates (Sec. 3.1), ICH,Q has an analytic closed form, and thus g(zi,mis|zi,obs; θ̂) has a closed form, and thus neither approximation or MCMC sampling is needed in this case. Therefore, we can assess the performance of the approximation in this setting by comparing {IC H̃ (k),Q::k = 0, 1} to ICH,Q, which is analytically equivalent to AIC in this case when ĉn(θ̂) = 2d. We also compare ICQ with ICH,Q.
But for the GLM with MAR covariates neither ICH,Q nor g(zi,mis|zi,obs; θ̂) has a closed form, and thus both the Hermite approximation and MCMC sampling are needed to compute ICH,Q. In this setting, we do not attempt to compute AIC or BIC directly using Laplace approximations or numerical integration techniques, because these methods are not easy and quite cumbersome to implement, and, more importantly, the resulting approximations are very difficult to assess in terms of accuracy. Thus for GLMs, we only compute {IC H̃ (k),Q:k = 0, 1} and ICQ through the MCEM algorithm under several values of ĉn(θ̂).
3.1 Missing-at-Random Covariates in Linear Models
We generated simulated data sets from a linear regression model with one MAR covariate. This simulation study had three goals: (i) to demonstrate how IC H̃ (k),Q for different k can be used as a tool for selecting a model from a candidate set of proposed models and evaluate and compare them with ICH,Q, (ii) to compare ICQ with ICH,Q, and (iii) to compare the performance of ICQ with IC H̃ (k),Q. To save space, we focus on cn(θ̂) = 2d throughout, although several additional simulation results are available for other values of ĉn(θ̂) including ĉn(θ̂) = d log(n).
Consider the true model yi|xi ~ N(β0 + β1xi, σ2), where xi ~N(μ, τ2) for i = 1, …, n. We generated the data set {(xi, yi): i = 1, …, n} as follows. First, we generated n independent random variables xi from a N(μ, τ2) distribution; and then generated independent responses yi from a N(β0 + β1xi, σ2) distribution. We then generated n independent standard normally distributed variables zi that are independent of yi and xi. The true parameter values were taken to be β0 = .8, β1 = .8, σ2 = .8, μ = .8, τ2 = .8, and n = 100, 300, 500.
Furthermore, we assume that the response yi and the additional covariate zi are completely observed for i = 1, …, n, but the covariate xi can be missing for some cases. We note that because zi is fully observed for all cases, we need not specify a covariate distribution for zi in the modeling strategy, but a covariate distribution for xi must be specified, because xi is missing for some cases. The missing-data mechanism for the xi ’s is defined as follows. We let ri = 1 if xi is missing and ri = 0 if xi is observed. Then the following logistic regression model is considered for the missing-data mechanism:
(20) |
implying MAR covariates. To investigate the effect of the missingness fraction on the performance of the model selection criteria, we consider the following sets of true parameter values for φ0 and φ1: (I) φ0 = −4.0, φ1 = 1.0 giving an average missingness fraction for xi roughly equal to 11%, and (II) φ0 = −3.5, φ1 = 1.5, giving an average missingness fraction of 29%.
We considered five candidate models:
Model M1 (true model): yi|xi ~ N(β0 + β1xi, σ2), xi ~ N(μ, τ2)
Model M2: yi|xi ~ N(β0, σ2), xi ~ N(μ, τ2)
Model M3: yi|xi, zi ~ N(β0 + β1xi + β2zi, σ2), xi ~ N(μ, τ2)
Model M4: yi|xi, zi ~ N(β0 + β1xi + β2xi zi, σ2), xi ~ N(μ, τ2)
Model M5: yi|xi, zi ~ N(β0 + β1xi + β2zi + β3xi zi, σ2), xi ~ N(μ, τ2).
We generated R = 500 simulated data sets from M1 and then calculated {IC H̃ (k),Q: k = 0, 1} and ICQ for ĉn(θ̂) = 2d, and AIC ≡ ICH,Q when ĉn(θ̂) = 2d (Table 1).
Table 1.
(I) |
(II) |
|||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
n = 100 |
n = 300 |
n = 500 |
n = 100 |
n = 300 |
n = 500 |
|||||||||||||||||||
Rank | 1 | 2 | 3 | 4 | 1 | 2 | 3 | 4 | 1 | 2 | 3 | 4 | 1 | 2 | 3 | 4 | 1 | 2 | 3 | 4 | 1 | 2 | 3 | 4 |
ICH,Q with 2d | ||||||||||||||||||||||||
1 | 331 | 25 | 4 | 0 | 329 | 17 | 1 | 0 | 325 | 14 | 1 | 0 | 301 | 51 | 13 | 0 | 325 | 37 | 4 | 0 | 327 | 32 | 6 | 0 |
2 | 1 | 48 | 8 | 0 | 5 | 57 | 12 | 0 | 4 | 63 | 10 | 0 | 5 | 31 | 21 | 4 | 6 | 35 | 17 | 5 | 3 | 35 | 15 | 3 |
3 | 0 | 1 | 49 | 3 | 1 | 3 | 53 | 2 | 0 | 3 | 50 | 3 | 0 | 9 | 35 | 8 | 0 | 2 | 36 | 9 | 0 | 7 | 36 | 8 |
4 | 0 | 0 | 1 | 29 | 0 | 0 | 0 | 20 | 0 | 0 | 2 | 25 | 0 | 0 | 1 | 21 | 0 | 0 | 2 | 22 | 0 | 1 | 1 | 26 |
IC H̃ (0),Q with 2d | ||||||||||||||||||||||||
1 | 329 | 23 | 5 | 0 | 328 | 16 | 3 | 0 | 322 | 14 | 0 | 0 | 302 | 49 | 10 | 0 | 323 | 35 | 5 | 0 | 320 | 31 | 10 | 0 |
2 | 3 | 50 | 8 | 1 | 6 | 57 | 12 | 0 | 7 | 59 | 10 | 1 | 4 | 34 | 25 | 5 | 8 | 36 | 15 | 5 | 10 | 34 | 16 | 1 |
3 | 0 | 1 | 48 | 1 | 1 | 4 | 51 | 2 | 0 | 7 | 50 | 3 | 0 | 8 | 34 | 9 | 0 | 3 | 37 | 9 | 0 | 8 | 31 | 10 |
4 | 0 | 0 | 1 | 30 | 0 | 0 | 0 | 20 | 0 | 0 | 3 | 24 | 0 | 0 | 1 | 19 | 0 | 0 | 2 | 22 | 0 | 2 | 1 | 26 |
IC H̃(1), Q with 2d | ||||||||||||||||||||||||
1 | 325 | 25 | 5 | 0 | 323 | 19 | 2 | 0 | 310 | 21 | 1 | 0 | 298 | 49 | 11 | 0 | 314 | 33 | 6 | 0 | 301 | 27 | 6 | 1 |
2 | 7 | 46 | 9 | 1 | 9 | 54 | 13 | 0 | 18 | 49 | 11 | 1 | 8 | 34 | 22 | 3 | 15 | 37 | 16 | 5 | 23 | 37 | 17 | 3 |
3 | 0 | 3 | 47 | 2 | 3 | 3 | 50 | 3 | 1 | 10 | 49 | 4 | 0 | 7 | 36 | 8 | 2 | 4 | 35 | 10 | 6 | 9 | 33 | 6 |
4 | 0 | 0 | 1 | 29 | 0 | 1 | 1 | 19 | 0 | 0 | 2 | 23 | 0 | 1 | 1 | 22 | 0 | 0 | 2 | 21 | 0 | 2 | 2 | 27 |
NOTE: Two cases of missing fractions for xi were included. Three different sample sizes, n = 100, 300, and 500 simulated data sets, were used for each case. The columns represent the results from AICQ.
Table 1 shows the number of times out of R = 500 simulations that each rank was achieved for M1, the true model for all model selection criteria. The columns in Table 1 correspond to the rankings of AICQ [AICQ ≡ ICQ when ĉn(θ̂) = 2d] under different settings, and the rows of Table 1 corresponds to the proposed criteria for different choices of ĉn(θ̂) and k. With n = 100 and case (I), M1 got ranked as number one 332 = 331 + 1 times by AICQ, 360 = 331 + 25 + 4 times by AIC [ICH,Q with ĉn(θ̂) = 2d], 357 times by IC H̃ (0),Q with ĉn(θ̂) = 2d, and 355 times by IC H̃ (1),Q with ĉn(θ̂) = 2d. With n = 100 and case (II), M1 got ranked as number one 306 times by AICQ, 364 times by ICH,Q with ĉn(θ̂) = 2d, 361 times by IC H̃ (0),Q with ĉn(θ̂) = 2d, and 358 times by IC H̃ (1),Q with ĉn(θ̂) = 2d. These results imply that AICQ performs reasonably well in all scenarios, but ICH,Q outperforms AICQ, particularly for large missingness fractions. The {IC H̃ (k),Q: k = 0, 1} for ĉn(θ̂) = 2d perform as well as ICH,Q even for large missingness fractions, which is an attractive result demonstrating the suitability of the approximation. Moreover, increasing k does not seem to improve the performance of IC H̃ (k),Q, demonstrating its high degree of robustness. The {IC H̃ (k),Q: k = 0, 1} for cn(θ̂) = 2d outperform AICQ, particularly for large missingness fractions. Finally, we note that AIC yields very similar results to {IC H̃ (k),Q:k = 0, 1} for ĉn(θ̂) = 2d.
3.2 Missing-at-Random Covariates in Generalized Linear Models
In this section we consider a logistic regression model with one continuous covariate. Our primary aim is to evaluate {IC H̃ (k),Q:k = 0, 1} and ICQ and compare them with each other. In this simulation study, covariates x1, …, xn are iid and generated from a N(.5, 1.0) distribution, and responses y1, …, yn are generated independently from a Bernoulli distribution with success probability . We also assume that y1, …, yn are completely observed, whereas x1, …, xn are MAR for some cases.
The missing data for the xi were generated according to the missing-data mechanism in (20), and the zi ’s were generated exactly as described in Section 3.1. The true parameter values were taken to be β0 = β1 = .8 and n = 100, 300, and 500. To investigate the effect of the missingness fraction on our model selection criteria, we again considered two sets of true values for φ0 and φ1: (I) φ0 = −1.2 and φ1 = −.8, giving a missingness fraction of about 15%, and (II) φ0 = −.5 and φ1 = −.8, giving a missingness fraction of about 26%.
As in Section 3.1, we considered five candidate models:
Model M1 (true model): logit(pi) = β0 + β1xi, xi ~ N(μ, τ2)
Model M2: logit(pi) = β0, xi ~N(μ, τ2)
Model M3: logit(pi) = β0 + β1xi + β2zi, xi ~ N(μ, τ2)
Model M4: logit(pi) = β0 + β1xi + β2zi xi, xi ~ N(μ, τ2)
Model M5: logit(pi) = β0 + β1xi + β2zi xi + β3zi, xi ~ N(μ, τ2).
We simulated 500 data sets and then calculated {IC H̃ (k),Q:k = 0, 1} and ICQ for ĉn(θ̂) = 2d for each simulated data set. Table 2 shows the number of times out of R = 500 simulations that each rank was achieved for M1, the true model for all model selection criteria. Again, the columns in Table 2 correspond to the rankings of AICQ, and the rows correspond to several settings of the proposed criteria. The results are very similar to those reported in Section 3.1. For instance, with n = 100 and case (I), M1 was ranked as number one 302 times by AICQ, 319 times by IC H̃ (0),Q with ĉn(θ̂) = 2d, and 317 times by IC H̃ (1),Q with ĉn(θ̂) = 2d. These results imply that AICQ performs reasonably well in all scenarios, and that increasing the missing-data fraction does not have a strong effect on AICQ for accurately selecting the true model M1. The {IC H̃ (k),Q:k = 0, 1} for ĉn(θ̂) = 2d perform reasonably well even for large missingness fractions. Moreover, increasing k does not seem to improve the performance of IC H̃ (k),Q. Again, the {IC H̃ (k),Q:k = 0, 1} for ĉn(θ̂) = 2d outperform AICQ, particularly for large missingness fractions.
Table 2.
(I) |
(II) |
|||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
n = 100 |
n = 300 |
n = 500 |
n = 100 |
n = 300 |
n = 500 |
|||||||||||||||||||||||||
Rk | 1 | 2 | 3 | 4 | 5 | 1 | 2 | 3 | 4 | 5 | 1 | 2 | 3 | 4 | 5 | 1 | 2 | 3 | 4 | 5 | 1 | 2 | 3 | 4 | 5 | 1 | 2 | 3 | 4 | 5 |
IC H̃ (0),Q with 2d | ||||||||||||||||||||||||||||||
1 | 294 | 22 | 2 | 1 | 0 | 317 | 29 | 2 | 1 | 0 | 316 | 32 | 6 | 0 | 0 | 267 | 34 | 12 | 1 | 0 | 298 | 46 | 6 | 2 | 0 | 305 | 47 | 12 | 4 | 0 |
2 | 8 | 70 | 17 | 1 | 0 | 15 | 53 | 13 | 5 | 0 | 8 | 50 | 19 | 0 | 0 | 12 | 62 | 29 | 5 | 0 | 8 | 37 | 23 | 3 | 0 | 6 | 29 | 36 | 3 | 0 |
3 | 0 | 2 | 61 | 4 | 0 | 0 | 1 | 51 | 11 | 0 | 0 | 0 | 44 | 9 | 0 | 0 | 2 | 47 | 8 | 0 | 0 | 5 | 49 | 8 | 0 | 0 | 6 | 33 | 6 | 0 |
4 | 0 | 0 | 2 | 13 | 2 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 1 | 15 | 0 | 0 | 0 | 2 | 15 | 2 | 0 | 0 | 2 | 13 | 0 | 0 | 2 | 1 | 10 | 0 |
5 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
IC H̃ (1),Q with 2d | ||||||||||||||||||||||||||||||
1 | 290 | 24 | 2 | 1 | 0 | 307 | 26 | 3 | 0 | 0 | 296 | 32 | 5 | 1 | 0 | 265 | 35 | 13 | 1 | 0 | 280 | 37 | 9 | 3 | 0 | 269 | 45 | 14 | 3 | 0 |
2 | 11 | 68 | 13 | 1 | 0 | 22 | 49 | 12 | 5 | 0 | 26 | 44 | 17 | 1 | 0 | 10 | 60 | 21 | 4 | 1 | 24 | 41 | 14 | 3 | 0 | 36 | 27 | 27 | 6 | 0 |
3 | 1 | 2 | 63 | 2 | 0 | 3 | 8 | 47 | 11 | 0 | 1 | 4 | 44 | 11 | 0 | 4 | 3 | 52 | 11 | 0 | 2 | 10 | 50 | 10 | 0 | 5 | 10 | 38 | 6 | 0 |
4 | 0 | 0 | 4 | 14 | 2 | 0 | 0 | 6 | 0 | 0 | 1 | 2 | 4 | 11 | 0 | 0 | 0 | 4 | 11 | 1 | 0 | 0 | 7 | 10 | 0 | 1 | 2 | 3 | 8 | 0 |
5 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 12 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 12 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
NOTE: Two cases of missing fractions for xi were included. Three different sample sizes n = 100, 300, and 500 simulated data sets were used for each case. The columns represent the results from AICQ.
3.3 AIDS Data
We considered a data set from a study of the relationship between AIDS and the use of condoms (Morisky et al. 1998; Lee and Tang 2006). This complex data set requires sophisticated structural equations modeling in the presence of NMAR covariate and response data. An intriguing question is whether there is any model selection criterion for selecting the best-fitting model from a candidate set of structural equation models whose observed data likelihood functions involve high-dimensional integrals. Directly computing AIC and BIC, for example, using Laplace methods or high-dimensional numerical integration techniques is simply too hard and computationally cumbersome in this scenario, and moreover, the accuracy of such approximations is impossible to assess in this high-dimensional setting. Thus this example greatly motivates the need for EM-based criteria, such as ICH̃ (k),Q and ICQ.
For simplicity, we used only the data obtained from female sex workers in Philippine cities (Lee and Tang 2006). These data are related to knowledge of AIDS and attitude toward AIDS, beliefs, self-efficiency of condom use, and other variables. Nine variables in the original data set (items 33, 32, 31, 43, 72, 74, 27h, 27e, and 27i on the questionnaire) were taken as manifest variables in yi = (yi1, …, yi9)T, a continuous item xi1 (item 37) and an ordered categorical item xi2 (item 21, treated as continuous) were taken as covariates. The definitions of these nine items are given in Appendix B. In this data set, the variables yi1, yi2, yi3, yi7, yi8, and yi9 were measured on a 5-point scale and thus were treated as continuous; variables yi4, yi5, and yi6 were continuous. There are n = 1,116 observations in this data set, and the manifest variables and covariates are missing at least once for 361 of them (32%). The missingness patterns for the manifest variables are shown in table 4 of Lee and Tang (2006). In this data set, the covariate xi2 is completely observed.
Following Lee and Tang (2006), the manifest variables (yi1, yi2, yi3) are related to a latent variable, ηi, that can be interpreted as the “threat of AIDS,” whereas the manifest variables (yi4, yi5, yi6) and (yi7, yi8, yi9) are related to the latent variables ξi1 and ξi2, which can be interpreted as “aggressiveness of the sex worker” and “worry of contracting AIDS.” Specifically, to identify the relationship between the manifest variables yi and the latent variables ωi = (ηi, ξi1, ξi2)T, we consider the following measurement equation:
where μ = (μ1, …, μ9)T is a vector of intercepts, (ξi1, ξi2) is independent of the measurement error vector εi, (ξi1, ξi2) ~ N(0, Φ), and εi ~ N(0, Ψ), in which Ψ = diag(ψ1, …, ψ9) and Φ = (φij) is a 2 × 2 covariance matrix. We also assume the following structure for Λ:
where 0* and 1.0* are regarded as fixed values to identify the scale of the latent factor. We let ryij = 1 if yij is missing and ryij = 0 if yij is observed and rxi1 = 1 if xi1 is missing and rxi1 = 0 if xi1 is observed. Based on the missingness patterns, we assume that both the missing-data mechanisms of the manifest variables and the covariates are NMAR. In particular, we consider the following missing-data mechanisms for yij and xi1:
and
where τ is a vector of logistic regression coefficients, in which yio is a vector corresponding to the observed data of yi, and ϕ = (ϕ0, ϕ1, …, ϕ9)T. Because xi1 may be missing, we need to specify its distribution. For simplicity, we assume that xi1 ~ N(0, ψx).
To study the relationship between η and (x1, x2, ξ1, ξ2), we consider four nonlinear structural equations models,
and
where δi ~ N(0, ψ δ). Clearly, all four models include the linear effect of “aggressiveness,” ξi1, and “worry,” ξi2 and an interaction of “aggressiveness” and “worry.” The models M1 and M2, respectively, have the additional quadratic terms of “aggressiveness” and “worry.” Because M3 includes all the possible terms of ξi1 and ξi2, it may be considered the “full model.”
We calculated values of {IC H̃ (k),Q:k = 0, 1} and ICQ with ĉn(θ̂) = 2d and d log(n) for all four models (Table 3). The calculation of {IC H̃ (k),Q:k = 0, 1} and ICQ was straightforward, because it only required quantities from the output of the EM algorithm for obtaining parameter estimates. Model M0 was selected as best by all model selection criteria. The ML estimates of the parameters were obtained through the MCECM algorithm and specific parameter estimates for model M0 are presented in Table 4. The factor loading estimates are positive and quite large, which implies a strong positive association between the latent variables and their corresponding indicators, and the estimated nonlinear structural equation is given by η̂i = −.0579xi1 + .0821xi2 − .2711ξi1 + .2505ξi2 + .1897ξi1ξi2. We note the fact that comparatively large (positive) values of (ηi, xi2) (or xi1, ξi1) and ξi2 indicate that an individual feels a high (or low) threat from AIDS and is more worried about contracting AIDS. The foregoing equation has the following interpretation:
Table 3.
ICQ |
ICH̃(0),Q |
ICH̃(1),Q |
||||
---|---|---|---|---|---|---|
Model | ĉn = 2d | ĉn = d log(n) | ĉn = 2d | ĉn = d log(n) | ĉn = 2d | ĉn = d log(n) |
M0 | 34,676.19 | 34,896.96 | 32,941.28 | 30,985.59 | 35,423.52 | 33,467.84 |
M1 | 34,680.18 | 34,905.97 | 32,961.77 | 31,017.56 | 35,709.52 | 33,765.32 |
M2 | 34,689.32 | 34,915.11 | 32,964.85 | 31,014.59 | 35,626.51 | 33,676.26 |
M3 | 34,708.79 | 34,939.60 | 32,988.38 | 31,037.17 | 35,567.39 | 33,616.17 |
Table 4.
Parameter | ML estimates | SD | Parameter | ML estimates | SD | Parameter | ML estimates | SD |
---|---|---|---|---|---|---|---|---|
μ1 | 3.6362 | .0292 | ψ1 | .9405 | .0765 | λ21 | .4493 | .1124 |
μ2 | 2.5977 | .0432 | ψ2 | 2.2057 | .0931 | λ31 | .7736 | .1558 |
μ3 | 3.9725 | .0321 | ψ3 | .9525 | .0464 | λ52 | 1.6294 | .1679 |
μ4 | .0015 | .0052 | ψ4 | .8665 | .0383 | λ62 | 1.1107 | .0859 |
μ5 | .0031 | .0323 | ψ5 | .6246 | .1358 | λ83 | .4220 | .1407 |
μ6 | .0020 | .0092 | ψ6 | .8251 | .0452 | λ93 | .7358 | .1149 |
μ7 | 4.3696 | .0038 | ψ7 | .7179 | .0783 | b1 | −.0579 | .0310 |
μ8 | 3.1411 | .0431 | ψ8 | 2.0665 | .0900 | b2 | .0821 | .0290 |
μ9 | 3.7998 | .0344 | ψ9 | 1.4165 | .0865 | γ1 | −.2711 | .0679 |
φ11 | .1410 | .0210 | ψδ | .4059 | .0912 | γ2 | .2505 | .1060 |
φ12 | −.0422 | .0090 | ψx | 1.4774 | .6778 | γ3 | .1897 | .1363 |
φ22 | .3819 | .0418 |
b̂1 = −.0579 indicates that the longer sex workers are in their jobs, the less threat they feel from AIDS, and b̂2 = .0821 implies that the more they think that they know about AIDS, the more threat they feel from AIDS.
γ̂1 = −.2711 shows that the more aggressive the sex workers are, the less threat they feel from AIDS, and γ̂2 = .2505 shows that sex workers who are more worried about contracting AIDS feel more of a threat from AIDS.
γ̂3 = .1897 indicates that ξi1 and ξi2 have a positive interaction effect on “threat of AIDS.”
It is easily seen from the foregoing analysis that introducing an interaction term in the nonlinear structural equation to interpret the relationship between ηi and ξi1, ξi2 is very necessary, and we could get various different effects for different cases. The estimated correlation between “aggressiveness,” ξi1, and “worry,” ξi2, is −.1819, which indicates that they are negatively correlated.
4. DISCUSSION
We have proposed a general class of model selection criteria, IC H̃ (k),Q, for missing-data problems. The computation of IC H̃ (k),Q can be obtained directly from the EM output. The theory of IC H̃ (k),Q is quite general and can be applied to various types of missing-data models for which the EM algorithm is applicable. Moreover, IC H̃ (k),Q can be directly applied to many other problems in which the ECM algorithm and the ECME algorithm can be applied (Liu and Rubin 1994; Meng and Rubin 1993). We have given theoretical underpinnings for these criteria and have shown that they are consistent. We note, however, that although consistency is a desirable and interesting property, it does not shed light on how to penalize the observed data likelihood for model parsimony in finite samples. Further research is needed to determine the best choice of penalty in missing-data problems. We have also demonstrated that the Hermite approximation to the integrand of the H -function, log(g(Dmis|Dobs; θ̂)), is quite robust for model choice for several choices of k, leading to an attractive feature of the proposed approximation. Choices of k = 0, 1 worked as well as those of k = 10 and larger. This is a comforting feature, because it shows that model choice is not sensitive to the degree of the Hermite approximation to g(Dmis|Dobs; θ̂).
The penalty terms ĉn(θ̂) can have a profound effect on the finite-sample performance of IC H̃ (k),Q and ICQ. Compared with ĉn(θ̂) = 2d, the use of the penalty d log(n) for ICH̃ (k),Q and ICQ leads to a significant improvement in correctly determining the true model (not presented). According to Theorem 3, this is not surprising, because the 2d penalty tends to pick larger models. For instance, because the true model in Section 3.1 has one covariate, the d log(n) penalty will be expected to outperform the 2d penalty (not presented). Furthermore, combining different degrees of approximation in the truncated Hermite expansion and different penalty terms can lead to nonlinear behavior in ICH̃ (k),Q and ICQ.
The MCEM algorithm converged in a reasonable number of steps for the GLM simulation and the AIDS data set, and the Gibbs sampling followed the same steps as described by Ibrahim, Lipsitz, and Chen (1999). In the Gibbs steps of the MCEM algorithm, the Metropolis–Hastings algorithm (Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller 1953; Hastings 1970) was used to simulate observations from the complex, nonstandard conditional distributions. For the GLM and AIDS data examples, EM convergence was obtained in fewer than 50 iterations using an increasing Gibbs sample size of 2,000 within EM. Gibbs sample sizes of 5,000 and 10,000 also were used to check sensitivity to the choice of the Gibbs sample size, and the estimates were extremely robust to these choices; for example, the estimates based on Gibbs sample sizes of 2,000 and 10,000 matched to the third decimal place. In addition, values of the Gibbs sample size that changed with each EM iteration were considered. For example, at the beginning of EM, we started with 50 Gibbs samples and gradually increased the number of Gibbs samples as the EM iterations increased. The results obtained were quite similar to those obtained using a constant value of 2,000 Gibbs iterations throughout all of the EM iterations. The convergence criterion used for the EM algorithm was that the distance between the kth iteration and the (k + 1)st iteration for all of the parameters was less than 5 × 10−4. The reason for choosing such a tolerance level is the Gibbs sample size used in each iteration. We also tried a tolerance level of 10−4 when the Gibbs sample size was 10,000, and EM convergence was obtained in a similar number of iterations. We further note that if the tolerance level were chosen too small, then it would be impossible to achieve convergence due to the Monte Carlo error induced by the Gibbs sampler. Finally, we note that slightly more computing time was required for the AIDS data set than the GLM simulation.
Acknowledgments
The authors wish to deeply thank the editor, the associate editor, and three referees for extremely helpful comments and suggestions that have substantially improved the article. Dr. Ibrahim’s research was supported in part by National Institutes of Health grants GM 70335 and CA 74015. Dr. Zhu’s research was supported in part by National Science Foundation grant SES-06-43663 and BCS-0826844 and NIH grant 1-UL1-RR025747-01. Dr. Tang’s research was supported in part by NSFC (10561008) and NCET (NCET-07-0737).
APPENDIX A: PROOFS OF THEOREMS 1, 2, AND 3
Proof of Theorem 1
We need only show that sup(θ1, θ2)∈Θ × Θn−1|K̃(k)(θ1, θ2) − E[K̃ (k)(θ1, θ2)]|→ 0 in probability and E[K̃ (k)(θ1, θ2)] is continuous in θ1 and θ2 uniformly over Θ × Θ. Conditions (C3) and (C4) are sufficient for assumption W–LIP of Andrews (1992), which ensures the continuity of E[K̃ (k)(θ1, θ2)] and the stochastic equicontinuity (SE) of K̃(k)(θ1, θ2). Furthermore, conditions (C3) and (C4) ensure pointwise convergence; that is, n−1{K̃(k)(θ1, θ2) − E[K̃ (k)(θ1, θ2)]} converges to 0 for each θ1 and θ2 in probability. Thus combining SE and the pointwise convergence yields Theorem 1.
Proof of Theorem 2
We prove Theorem 2 in three steps. We first show that
(A.1) |
Conditions (C1)–(C5) are sufficient for establishing (A.1) (Zhu and Zhang 2006). The second step is to obtain the stochastic expansions for K̃ (k)(θ̂, θ★) and E[K̃ (k)(θ̂, θ★)] as follows:
(A.2) |
where Δθ̂ = θ̂ − θ*. Taking expectation yields
(A.3) |
Following the same arguments as Konishi and Kitagawa (2008), we can get
(A.4) |
Proof of Theorem 3
Based on Theorem 1 and δc21 = op(n), we have
which yields Theorem 3a.
Theorem 3b can be proved by noting that n−1/2dICH̃ (k),Q21 can be written as the sum of
Note that for t = 1, 2, Q(θ̂(t)|θ̂(t)) can be written as
Because θ̂ (t) − θ*(t) = Op(n−1/2), Q(θ̂(t)|θ̂ (t)) = Q(θ*(t)|θ̂(t)) + Op(1). Thus dICH̃(k),Q21 can be written as
Theorem 3c can be proved by noting that Q(θ*(1)|θ̂(1)) − Q(θ*(2)|θ̂(2) = Op(1) and δc21 → ∞
APPENDIX B: SELECTED ITEMS IN THE AIDS DATA
The number of the variables in the questionnaire is given in parentheses.
y1 (item 33): How worried are you about getting AIDS? not at all worried 1/2/3/4/5 extremely worried.
y2 (item 32): What are the chances that you yourself might get AIDS?
none 1/2/3/4/5 very great.
y3 (item 31): How much of a threat do you think AIDS is to the health of people?
no threat at all 1/2/3/4/5 very great.
y4 (item 43): How many times did you have vaginal sex in the last 7 days?
y5 (item 72): How many “hand jobs” did you give in the last 7 days?
y6 (item 74): How many “blow jobs” did you give in the last 7 days? How great is the risk of getting AIDS from the following activities.
y7 (item 27h): Sexual intercourse with someone you don’t know very well without using a condom.
y8 (item 27e): Sexual intercourse with someone who has the AIDS virus using a condom?
y9 (item 27i): Sexual intercourse with someone who injects drugs? The scale for y7, y8, and y9 is: no risk 1/2/3/4/5 great risk.
x1 (item 37): How long (in months) have you been working at a job where people pay to have sex with you?
x2 (item 21): How much do you think you know about the disease called AIDS?
nothing 1/2/3/4/5 a great deal.
Contributor Information
Joseph G. Ibrahim, Joseph G. Ibrahim is Alumni Distinguished Professor (E-mail: ibrahim@bios.unc.edu), Department of Biostatistics, University of North Carolina, Chapel Hill.
Hongtu Zhu, Hongtu Zhu is Associate Professor (E-mail: hzhu@bios.unc.edu), Department of Biostatistics, University of North Carolina, Chapel Hill..
Niansheng Tang, Niansheng Tang is Professor, Department of Statistics, Yunnan University, Kunming (E-mail: nstang@ynu.edu.cn).
References
- Akaike H. Information Theory and an Extension of the Maximum Likelihood Principle. In: Petrov BN, Czáki F, editors. Second International Symposium on Information Theory. Budapest: Akademiai Kiadó; 1973. pp. 267–281. [Google Scholar]
- Andrews DWK. Generic Uniform Convergence. Econometric Theory. 1992;8:241–257. [Google Scholar]
- Cameron AC, Johansson P. Count Data Regression Using Series Expansions, With Applications. Journal of Applied Econometrics. 1997;12:203–223. [Google Scholar]
- Chen MH, Ibrahim JG, Shao QM. Propriety of the Posterior Distribution and Existence of the Maximum Likelihood Estimator for Regression Models With Covariates Missing at Random. Journal of the American Statistical Association. 2004;99:421–438. [Google Scholar]
- Copas JB, Li HG. Inference for Non-Random Samples. (with discussion) Journal of the Royal Statistical Society, Ser B. 1997;59:55–96. [Google Scholar]
- Dempster AP, Laird NM, Rubin DB. Maximum Likelihood for Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Ser B. 1977;39:1–38. [Google Scholar]
- Diggle PJ, Kenward MG. Informative Drop-Out in Longitudinal Data Analysis. Applied Statistics. 1994;43:49–93. [Google Scholar]
- Fenton VM, Gallant AR. Qualitative and Asymptotic Performance of SNP Density Estimators. Journal of Econometrics. 1996;74:77–118. [Google Scholar]
- Gallant AR, Douglas WN. Semi-Nonparametric Maximum Likelihood Estimation. Econometrica. 1987;55:363–390. [Google Scholar]
- Gallant AR, Nychka DW. Semi-Nonparametric Maximum Likelihood Estimation. Econometrica. 1987;55:363–390. [Google Scholar]
- Hastings WK. Monte Carlo Sampling Methods Using Markov Chains and Their Application. Biometrika. 1970;57:97–109. [Google Scholar]
- Huang L, Chen MH, Ibrahim JG. Bayesian Analysis for Generalized Linear Models With Nonignorable Missing Covariates. Biometrics. 2005;61:729–737. doi: 10.1111/j.1541-0420.2005.00338.x. [DOI] [PubMed] [Google Scholar]
- Ibrahim JG. Incomplete Data in Generalized Linear Models. Journal of the American Statistical Association. 1990;85:765–769. [Google Scholar]
- Ibrahim JG, Lipsitz SR. Parameter Estimation From Incomplete Data in Binomial Regression When the Missing-Data Mechanism is Nonignorable. Biometrics. 1996;52:1071–1078. [PubMed] [Google Scholar]
- Ibrahim JG, Chen M-H, Lipsitz SR. Monte Carlo EM for Missing Covariates in Parametric Regression Models. Biometrics. 1999;55:591–596. doi: 10.1111/j.0006-341x.1999.00591.x. [DOI] [PubMed] [Google Scholar]
- Ibrahim JG, Chen M-H, Lipsitz SR. Missing Responses in Generalised Linear Mixed Models When the Missing Data Mechanism Is Nonignorable. Biometrika. 2001;88:551–564. [Google Scholar]
- Ibrahim JG, Lipsitz SR, Chen MH. Missing Covariates in Generalized Linear Models When the Missing-Data Mechanism Is Nonignorable. Journal of the Royal Statistical Society, Ser B. 1999;61:173–190. [Google Scholar]
- Jansen I, Molenberghs G, Aerts M, Thjis H, van Steen K. A Local Influence Approach to Binary Data From a Psychiatric Study. Biometrics. 2003;59:410–419. doi: 10.1111/1541-0420.00048. [DOI] [PubMed] [Google Scholar]
- Kim JI. Uniform Convergence Rate of the Seminonparametric Density Estimator and Testing for Similarity of Two Unknown Densities. Econometrics Journal. 2007;10:1–34. [Google Scholar]
- Konishi S, Kitagawa G. Information Criteria and Statistical Modeling. New York: Springer; 2008. [Google Scholar]
- Lee SY, Tang NS. Analysis of Nonlinear Structural Equation Models With Nonignorable Missing Covariates and Ordered Categorical Data. Statistica Sinica. 2006;16:1117–1141. [Google Scholar]
- Little RJA. Pattern-Mixture Models for Multivariate Incomplete Data. Journal of the American Statistical Association. 1993;88:125–134. [Google Scholar]
- Little RJA. A Class of Pattern-Mixture Models for Normal Incomplete Data. Biometrika. 1994;81:471–483. [Google Scholar]
- Little RJA. Modeling the Drop-Out Mechanism in Repeated-Measures Studies. Journal of the American Statistical Association. 1995;90:1112–1121. [Google Scholar]
- Little RJA, Rubin DB. Statistical Analysis With Missing Data. 2. Hoboken, NJ: Wiley; 2002. [Google Scholar]
- Liu CH, Rubin DB. The ECME Algorithm: A Simple Extension of EM and ECM With Fast Monotone Convergence. Biometrika. 1994;81:633–648. [Google Scholar]
- Macquarrie ADR, Tsai CL. Regression and Time Series Model Selection. River Edge, NJ: World Scientific; 1998. [Google Scholar]
- Meng XL, Rubin DB. Maximum Likelihood Estimation via the ECM Algorithm: A General Framework. Biometrika. 1993;80:267–278. [Google Scholar]
- Meng XL, van Dyk D. The EM Algorithm: An Old Folk Song Sung to a Fast New Tune. Journal of the Royal Statistical Society, Ser B. 1997;59:511–540. [Google Scholar]
- Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. Equations of State Calculations by Fast Computing Machine. Journal of Chemical Physics. 1953;21:1087–1091. [Google Scholar]
- Morisky DE, Tiglao TV, Sneed CD, Tempongko SB, Baltazar JC, Detels R, Stein JA. The Effects of Establishment Practices, Knowledge, and Attitudes on Condom Use Among Filipina Sex Workers. AIDS Care. 1998;10:213–320. doi: 10.1080/09540129850124460. [DOI] [PubMed] [Google Scholar]
- Nishii R. Maximum Likelihood Principle and Model Selection When the True Model Is Unspecified. Journal of Multivariate Analysis. 1988;27:392–403. [Google Scholar]
- Rubin DB. Formalizing Subjective Notions About the Effect of Non-respondents in Sample Surveys. Journal of the American Statistical Association. 1977;72:538–543. [Google Scholar]
- Schwarz G. Estimating the Dimension of a Model. The Annals of Statistics. 1978;6:461–464. [Google Scholar]
- Troxel AB, Ma G, Heitjan DF. An Index of Local Sensitivity to Nonignorability. Statistica Sinica. 2004;14:1221–1237. [Google Scholar]
- van Steen K, Molenberghs G, Thijs H. A Local Influence Approach to Sensitivity Analysis of Incomplete Longitudinal Ordinal Data. Statistical Modelling: An International Journal. 2001;1:125–142. [Google Scholar]
- Verbeke G, Molenberghs G, Thijs H, Lasaffre E, Kenward MG. Sensitivity Analysis for Non-Random Dropout: A Local Influence Approach. Biometrics. 2001;57:43–50. doi: 10.1111/j.0006-341x.2001.00007.x. [DOI] [PubMed] [Google Scholar]
- White H. Estimation, Inference, and Specification Analysis. New York: Cambridge University Press; 1994. [Google Scholar]
- Zhu HT, Zhang HP. Asymptotics for Estimation and Testing Procedures Under Loss of Identifiability. Journal of Multivariate Analysis. 2006;97:19–45. [Google Scholar]
- Zhu HT, Lee SY, Wei BC, Zhou J. Case-Deletion Measures for Models With Incomplete Data. Biometrika. 2001;88:727–737. [Google Scholar]