Abstract
Growth mixture models (GMMs) with nonignorable missing data have drawn increasing attention in research communities but have not been fully studied. The goal of this article is to propose and to evaluate a Bayesian method to estimate the GMMs with latent class dependent missing data. An extended GMM is first presented in which class probabilities depend on some observed explanatory variables and data missingness depends on both the explanatory variables and a latent class variable. A full Bayesian method is then proposed to estimate the model. Through the data augmentation method, conditional posterior distributions for all model parameters and missing data are obtained. A Gibbs sampling procedure is then used to generate Markov chains of model parameters for statistical inference. The application of the model and the method is first demonstrated through the analysis of mathematical ability growth data from the National Longitudinal Survey of Youth 1997 (Bureau of Labor Statistics, U.S. Department of Labor, 1997). A simulation study considering 3 main factors (the sample size, the class probability, and the missing data mechanism) is then conducted and the results show that the proposed Bayesian estimation approach performs very well under the studied conditions. Finally, some implications of this study, including the misspecified missingness mechanism, the sample size, the sensitivity of the model, the number of latent classes, the model comparison, and the future directions of the approach, are discussed.
Longitudinal data analysis (LDA) has become widely used in medical, social, psychological, and educational research to investigate both intraindividual changes over time and interindividual differences in changes (e.g., Demidenko, 2004; Fitzmaurice, Laird, & Ware, 2004; Hedeker & Gibbons, 2006; Singer & Willett, 2003). LDA involves data collection on the same participants through multiple wave surveys or questionnaires (e.g., Baltes & Nesselroade, 1979), so heterogeneous data are very common in practical research in these fields (e.g., McLachlan & Peel, 2000). In other words, the data collected often come from more than one distribution with different population parameters. Furthermore, during longitudinal data collection, missing data are almost inevitable because of dropout, fatigue, and other factors (e.g., Little & Rubin, 2002; Schafer, 1997).
Growth mixture models (GMMs) have been developed to provide a flexible approach to analyzing longitudinal data with mixture distributions (e.g., Bartholomew & Knott, 1999) and received a lot of attention in the literature. GMMs are combinations of finite mixture models (e.g., Bartholomew & Knott, 1999; Luke, 2004; McLachlan & Peel, 2000; Yung, 1997) and latent growth curve models (LGCs; e.g., Preacher, Wichman, MacCallum, & Briggs, 2008; Singer & Willett, 2003; Willett & Sayer, 1994). They can also be viewed as special cases of latent variable mixture models (Lubke & Neale, 2006) that allow patterns in the repeated measures to reflect a finite number of trajectory types, each of which corresponds to an unobserved or latent class in the population (e.g., Elliott, Gallo, Have, Bogner, & Katz, 2005; Muthén & Shedden, 1999). For a comprehensive introduction to finite mixture model theory and recent advances, see McLachlan and Peel (2000).
An important issue in the analysis of GMMs is the presence of missing data (e.g., Little & Rubin, 2002; Schafer, 1997). Little and Rubin (2002) distinguished three different missing data mechanisms: (1) missing completely at random (MCAR), (2) missing at random (MAR), and (3) missing not at random (MNAR). MCAR is a process in which data missingness is independent of both observed and unobserved outcomes. For MAR, data missingness may depend on observed outcomes but not on unobserved outcomes. If missingness depends on unobserved outcomes or some unobserved latent variables in the fitted model, then the missingness mechanism is MNAR.
For example, in a pretest-posttest study, some students may drop out of the study after taking the pretest, and thus there are missing data due to their withdrawals. For these students, the pretest scores are observed outcomes and the posttest scores are unobserved potential outcomes. If the dropout is due to a family's move, then the missing mechanism is independent of both pretest and posttest scores; therefore it can be viewed as MCAR. If the dropout is due to a low pretest score, then the missingness depends on the pretest score but not on the posttest score and therefore it is MAR. If the dropout is due to poor performance on the posttest, then the dropout depends on the unobserved posttest score and therefore it is MNAR. If there are several latent classes in the study and the dropout is due to the latent class membership, then the dropout should also be MNAR.
The MCAR and MAR mechanisms are often referred to as ignorable missingness mechanisms because either the parameters that govern the missing process are distinct from the parameters that govern the model outcomes or the missingness depends on some observed variables, and therefore the likelihood-based estimates are generally consistent if the missing data mechanism is ignored (Little & Rubin, 2002).
The MNAR mechanism, on the contrary, is a nonignorable missingness mechanism (Little & Rubin, 2002). When the assumption of ignorable missingness mechanisms is untenable, it becomes necessary to model missingness mechanisms that contain information about the parameters of the complete data population.
Focusing on the nonignorable missingness mechanism, methods and models are available in dealing with missing data. When data come from a single population, there are two possible types of nonignorable missingness: outcome dependant (OD) missingness and latent variable dependent (LVD) missingness. OD missingness occurs when data missingness depends on the unobserved outcomes. For example, Diggle and Kenward (1994) proposed a selection model for continuous longitudinal data subject to nonignorable dropout where missingness on the current occasion is dependent on the historical observations and the current outcome that would be observed if the participant did not drop out. LVD missingness occurs when data missingness depends on some latent variables within the population, such as latent factors, latent slopes, or other latent random effects. For example, Wu and Carroll (1988) and Wu and Bailey (1989) modeled the informative right censoring process where the missingness depends on the latent rate of change. OD and LVD missingness may occur simultaneously when missingness depends on both unobserved outcomes and some latent variables. For example, Lee and Tang (2006) and Song and Lee (2007) proposed a Bayesian method for structural equation models (SEMs; e.g., Bollen, 1989; Lee, 2007) with nonignorable missingness in which the missingness may depend on the potential outcomes and the related latent variables.
When data come from mixture models, the nonignorable missingness could be OD or/and LVD missingness within mixture components and latent class dependent (LCD) missingness in which data missingness depends on latent random class membership. Studies that have contributed greatly to combining finite mixture models and different types of nonignorable missingness include Cai and Song (2010) and Cai, Song, and Hser (2010). Cai & Song extended Lee and Tang's (2006) single SEM with nonignorable missingness to mixture SEMs with nonignorable missingness. Cai et al. further extended the mixture SEMs to allow for missing responses in both missing outcomes and missing covariates.
The LCD missingness is an important issue in both theoretical and practical research. For example, Roy (2003) proposed a pattern mixture method to study nonignorable dropout where dropout time is related to the latent class membership. Frangakis and Rubin (1999) studied nonignorable nonresponses in a broken randomized pretest-posttest experiment by introducing a partial observed class variable, compliance, and obtained normal approximations of estimators under a series of assumptions. Using the compliance variable, Barnard, Frangakis, Hill, and Rubin (2003) studied a real data case by adopting a partial pattern mixture model to deal with missingness through Bayesian methods. Note that the LCD missingness is nonignorable because the class membership in mixture models is a latent variable, so LCD can be viewed as a special LVD missingness in mixture models.
Attrition in GMMs is very common for real data and therefore it is very important to evaluate missing data methods for GMMs. However, in the framework of GMMs, there is rare work discussing how to deal with nonignorable missingness and even less how to model the LCD missingness in GMMs. In an unpublished webnote, Muthén and Brown (2001) extended the GMMs introduced by Muthén and Shedden (1999) to deal with missing data. As a reaction to Barnard et al.'s (2003) paper, Muthén, Jo, and Brown (2003) switched from pretest-posttest models to GMMs and discussed possible approaches to bring together GMMs with missing data with latent variables.
In addition, most of previous studies rely on maximum likelihood methods for parameter estimation and carry out inferences through conventional likelihood procedures. Bayesian methods provide great advantages in the analysis of complex models with complicated data structure (e.g., Ansari, Jedidi, & Jagpal, 2000; Dunson, 2000; Scheines, Hoijtink, & Boomsma, 1999), and the application of Bayesian methods in psychological research has recently become popular through its usage by Lee and colleagues (e.g., Lee, 2007; Lee & Shi, 2000; Lee & Tang, 2006; Song & Lee, 2007; Zhu & Lee, 2001).
The goal of this article is to propose and evaluate a Bayesian approach to estimating GMMs with nonignorable missingness with a focus on LCD missingness in GMMs. Specifically, the model evaluated in this study allows (a) observed covariates to predict the class probability and (b) the latent class membership and observed covariates to predict missingness on each occasion.
This model implies that on each occasion, conditional on the class membership, the missingness given observed covariates is independent of potential outcomes. The missingness represents a form of latent ignorability (LI; Frangakis & Rubin, 1999), which states that, within each latent class, potential outcomes and associated potential response indicators are independent. LI is widely used in the analysis of broken randomized experiment for intent-to-treatment (ITT) effect and complier average causal effect (CACE; e.g., Barnard et al., 2003; Coronary Drug Project Research Group, 1980; Taylor & Zhou, 2009).
The rest of the article consists of five sections. The first describes an extended GMM where class probabilities and nonignorable missingness are modeled. The second presents the estimation of such a GMM through a full Bayesian estimation method utilizing data augmentation and Gibbs sampling algorithms. The third illustrates the application of the model and method through the analysis of mathematical ability growth data from the National Longitudinal Survey of Youth 1997 (NLSY97; Bureau of Labor Statistics, U.S. Department of Labor, 1997). The fourth presents a simulation study to evaluate the performance of the model and the Bayesian estimation method. The last section discusses the implications and future directions of this study. In addition, the Appendices present some technical details.
EXTENDED GMMS WITH LCD MISSING DATA
In this section, we present the proposed extended GMM with LCD missing data. Although focusing on the LCD missingness in this article, the model is very flexible and can be easily modified to cover a variety of missing mechanisms. The path diagram of the model is illustrated in Figure 1. In the diagram, each small square represents an observed variable, each circle represents a latent variable, a circle inside of a square represents an outcome variable with possible missing values, and the triangle represents a constant. The details of the proposed model are given as follows.
FIGURE 1.
Path diagram of a growth mixture model with latent class dependent missing data. mt indicates the missingness status of the corresponding yt. mt = 1 implies yt is missing and mt = 0 implies yt is observed. xrs are covariates. p(mt) depends on xrs and the class membership c, and c is predicted by covariates xrs. The growth mixture model takes the kth component with a probability of πk.
Latent Growth Curve Models (LGCs)
In Figure 1, the path diagram inside each component, the big square, illustrates an LGC model. Suppose that in a longitudinal study there are N subjects and T measurement occasions or time points. For individual i (i = 1, 2, …, N), let yi be a T × 1 random vector yi = (yi1, yi2, …, yiT)′ where yit stands for the outcome or observation on occasion t (t = 1, 2, …, T), and let ηi be a q × 1 random vector containing q continuous latent variables. An LGC of the outcome yi related to the latent ηi can be expressed as
| (1) |
where Λ is a T × q matrix consisting of factor loadings and ei is a T × 1 vector of residuals or measurement errors that are assumed to follow a multivariate normal distribution ei ~ MNT(0, Θ).1 If we assume residual variances are invariant over time, then the covariance matrix Θ = IT ϕ, where ϕ is a scalar and IT is a T × T identity matrix. The matrix Λ and the vector ηi determine the growth trajectory of the model. For instance, when q = 2, ηi = (li, si)′, and Λ is a T × 2 matrix with the first column full of 1s and the second column being (0, 1, …, (T − 1)), the corresponding model represents a linear growth model in which li is the latent random level (or intercept) and si is the latent random slope for individual i. Furthermore, when q = 3, ηi = (li, si, qi)′, and Λ is a T × 3 matrix with the first column full of 1s, the second column being (0, 1, …, (T − 1)), and the third column being (0, 1, …, (T − 1)2), the corresponding model represents a quadratic growth curve model in which li is the latent random level (or intercept), si is the latent random slope, and qi is a latent random quadratic coefficient for individual i.
We further assume
| (2) |
where ξi are q × 1 vectors following a multivariate normal distribution ξi ~ MNq(0, Ψ). β is called fixed effect and ξi is called random effect (e.g., Fitzmaurice et al., 2004; Hedges, 1994; Luke, 2004; Singer & Willett, 2003).
By combining Equation (1) and Equation (2), under the normality assumptions of both ei and ξi and the independence assumption between ei and ξi, we have
where μ = Λβ and Σ = ΛΨΛ′ + ϴ.
Growth Mixture Models (GMMs)
In Figure 1, a GMM is illustrated by the LGC components and a latent categorical variable c, which stands for the latent class membership. GMMs assume that yi follows a mixture of K distributions with each component distribution being a trajectory class (but see Lubke & Neale, 2008, for a discussion). The mixing proportions are also called class probabilities or weights. The density function of yi is
| (3) |
where pk(yi)(k = 1, …, K) are component LGC densities, and πk are class probabilities satisfying 0 ≤ πk ≤ 1 and (McLachlan & Peel, 2000).
If each mixture component pk(yi) is further assumed a multivariate normal distribution MNT(μk, Σk) where μk = Λkβk and , then Equation (3) can be further expressed as a parametric finite normal GMM (e.g., Jordan & Xu, 1995),
| (4) |
For different trajectory classes, βk, Λk, Ψk, and ϴk may be different. The class-specific parameters reflect different fixed-effects and different random-effects in GMMs. For example, the overall sample can be a mixture of one subsample with low initial level and little growth and another subsample with high initial level and big growth.
Note that the class membership is unknown in mixture models. But this variable is very important to interpret mixture models. For individual i, the class membership can be expressed by a single categorical variable ci with ci = k (k = 1, …, K) when yi comes from the kth mixture component or class. But in later work, it is convenient to work with a K-dimensional component label vector zi = (zi1, zi2, …, ziK)′ in place of ci, where zik, the kth element of zi, is defined to be one or zero, according to whether or not yi comes from the kth class. When ci = k, we have zik = 1 and zij = 0 (∀j ≠ k). The vector zi is distributed according to a multinomial distribution consisting of one draw from K categories with a probability πk in the kth category,
| (5) |
The density function for zi is .
Extended GMMs
Now we consider extended GMMs in which class probabilities depend on observed covariates. Notice that the GMM in Equation (4) assumes that the class probability πk is a constant for each class, although the post hoc posterior probability can vary for each individual.2 It is interesting to see how πk is related to some external covariates in the mixture data analysis. For example, in addition to determining the class membership of each individual, it would be useful to see how the class membership is related to individuals' background variables such as gender and income. Note that if we include the individual variant covariates into class probability, the model is not a finite mixture anymore because the class probability is not a constant.
Let πik (i = 1, 2, …, N; k = 1, 2, …, K) be the probability that individual i falls into the kth class, and let
be the cumulative class probability for individual i falling into the first k classes. Note that δiK ≡ 1, meaning the total class probability summing up over all K class probabilities for individual i is 1. With the definition of πik and δik, it is easy to see that when k = 1, πi1 = δi1; when k = 2, 3, …, or, K−1, πik = δik − δi,k−1; and when k = K, πiK = 1 − δi,K−1. In this way, we order the class probabilities πik from k = 1 to k = K.
Now we build a categorical regression model (e.g., Agresti, 2002; Long, 1997) of δik on covariates by using a probit link function3 (e.g., McCullagh & Nelder, 1989). Let xi = (xi1, xi2, …, xir)′ be a r × 1 vector of observed covariates that may be related to the class membership; then the probit regression4 is built as
| (6) |
where the scalar φk0 is an intercept, φk1 is a r×1 vector representing coefficients for covariates xi, both and are (1+r)×1 vectors, and Φ(·) is the cumulative distribution function (CDF) of the standard normal distribution. Then the class probabilities are
| (7) |
For convenience, we express Equation (7) as a function πik = π(φk, φk−1, xi) in the remainder of this article. As a special case, if the model has two classes, then Equation (7) is simplified as and .
Extended GMMs With LCD Missing Data
In this subsection, we model the missingness in extended GMMs. We focus on the LCD missingness. Specifically, the missing data rate on each occasion depends on both the latent class membership zi and some observed covariates xi. To make the model more general, we also assume that (a) the missing pattern is intermittent, namely, participants may return for later assessments after missing earlier assessments, and (b) the missing data rates are independent across different occasions.
Let mi = (mi1, mi2, … miT)′ indicate the missingness status of yi. If yit is missing, then mit = 1. Otherwise, mit = 0. Let τit = p(mit = 1) be the probability that yit is missing. Then, mit follows a Bernoulli distribution,
| (8) |
With the class membership indicating variable zi, the missing probability τit can be expressed as a probit link function of zi and xi,
| (9) |
where and in which γzt is a K × 1 vector γzt = (γzt1, γzt2, ⋯, γztK)′ and γxt is an r × 1 vector γxt = (γxt1, γxt2, ⋯, γxtr)′. From the distribution Equation (8) and Equation (9), we have the density function of mit as a function of the class membership zi and observed covariates xi,
| (10) |
where for zik = 1 or ci = k.
For convenience, in the remainder of this article the parameters β, ψ and ϕ are referred to as the growth curve parameters, and the parameters φ and γ are referred to as the probit parameters.
BAYESIAN ESTIMATION OF THE PROPOSED MODEL
In this section, we present a full Bayesian estimation approach to the proposed extended GMMs with LCD missing data. To obtain parameter estimates through Bayesian inference, we need to calculate the probability of parameters conditionally on the data. As Bayes's theorem states that the posterior distribution of the parameters equals the product of the likelihood function of the sample data and the prior distribution of the parameters divided by the marginal distribution of the data, which is a constant and does not involve any parameter, the posterior is proportional to the likelihood times the prior.
Data Augmentation and Likelihood Function
For multidimensional models with missing data, we utilize the data augmentation method (Tanner & Wong, 1987) to obtain the likelihood function. Data augmentation refers to methods for constructing iterative optimization or sampling algorithms by introducing unobserved data or latent variables (van Dyk & Meng, 2001), and the idea of adding auxiliary variables is a useful conceptual and computational tool for many problems (Gelman, Carlin, Stern, & Rubin, 2003).
Let where and denote observed and missing data for individual i, respectively. The direct observed-data likelihood function of yi and mi for the ith individual is
which is very difficult to evaluate due to the high dimensional integral over an unspecified mixture structure. So data augmentation method is used by adding the auxiliary variables, the missing data , the class membership vector zi = (zi1, zi2, ⋯, ziK)′, and the latent random effects ηi, to the model. With the help of auxiliary variables, the joint likelihood function of yi, mi, zi, and ηi for the ith individual can be expressed as
By combining Equations (1), (2), and (10), the likelihood function for the whole sample is
| (11) |
where ≜ means “is defined as.”
Prior and Posterior Distributions
To use Bayesian methods, we need to specify priors for the model parameters. Lee and Song (2003) found that Bayesian estimation is not sensitive to the prior, especially for large sample size. In this study, we adopted the conjugate priors because they are commonly used in the literature of Bayesian analysis (e.g., Lee, 1981; Roeder & Wasserman, 1997; Zhu & Lee, 2001). The model parameters in this study include the growth curve parameters βk, ψk, ϕk (k = 1, 2, ⋯, K), and the probit parameters φk (k = 1, 2, ⋯, K – 1), γt (t = 1, 2, ⋯, T), so βk and ψk can use a multivariate normal-inverse Wishart distribution prior, ϕk can use an inverse Gamma distribution prior, φk can use a multivariate normal distribution prior, and γt can use a multivariate normal distribution prior. In a simpler manner, we can also directly specify the prior precision of βk instead of setting it proportional to ψk. Appendix A lists the details of these prior distributions.
With the likelihood function and the priors, the joint posterior distribution of the unknown parameters is readily available. However, the marginal posterior distributions (Gelman et al., 2003) of the parameters are very hard to obtain explicitly because of the requirement of high-dimensional integration. Instead, we first obtain the conditional distributions for the parameters and then utilize the Gibbs sampling method (Casella & George, 1992; Geman & Geman, 1984) to generate Markov chains for the parameters and conduct Bayesian inference.
The full conditional posterior distributions for the mixture model parameters are provided by Equation (12)–Equation (18) in Appendix B. In addition, the conditional posterior distributions for the augmented variable zi, the latent variable ηi, and the missing data (i = 1, 2, …, N) are also provided by Equation (19)–Equation (21), respectively, in Appendix B.
Gibbs Sampling and Statistical Inference
With the conditional posterior distributions obtained earlier, we can generate Markov chains for the unknown model parameters by implementing a Gibbs sampling algorithm (Casella & George, 1992; Geman & Geman, 1984). The Gibbs sampling is a Markov chain Monte Carlo algorithm to obtain a sequence of samples from a joint probability distribution. Starting with a set of initial guesses of all these unknown variables, it generates instances from the conditional distribution of each variable in turn, conditionally on the current values of the other variables (Geman & Geman, 1984). The sequence of samples constructs a Markov chain that can be shown ergodic (Geman & Geman, 1984), and thus after convergence the generated value is actually from the joint distribution of all parameters. It can also be shown that each variable is also a Markov chain and converges to the marginal distribution of that variable (Robert & Casella, 2004). Gibbs sampling is especially useful when the joint distribution is complex or unknown but the conditional distribution of each variable is available.
Specifically in our model, the unknown variables include the model parameters ϕ, Ψ, β, φ, γ, the augmented variables z, η, and missing values ymis. The following algorithm can be used.
Start with a set of initial values for model parameters ϕ(0), Ψ(0), β(0), φ(0), γ(0), z(0), η(0), and ymis(0).
- At the sth iteration, the following parameters are generated: ϕ(s), Ψ(s), β(s), φ(s), γ(s), z(s), η(s), and ymis(s). To generate ϕ(s+1), Ψ(s+1), β(s+1), φ(s+1), γ(s+1), z(s+1), η(s+1), and ymis(s+1), the following procedure is implemented:
- Generate ϕ(s+1) from the inverse Gamma distribution in Equation (12).
- Generate Ψ(s+1) from the inverse Wishart distribution in Equation (13).
- Generate β(s+1) from the multivariate normal distribution in Equation (14).
- Generate φ(s+1) from the distributions in Equations (15)–(17).
- Generate γ(s+1) from the distribution in Equation (18).
- Generate z(s+1) from the multinomial distribution in Equation (19).
- Generate η(s+1) from the multivariate normal distribution in Equation (20).
- Generate ymis(s+1) from the normal distribution in Equation (21).
After convergence, the statistical inference can be conducted based on the generated Markov chains. Let θ = (θ1, θ2, …, θp)′ denote a vector of all the unknown variables in the model. The converged Markov chains can be recorded as θ(s), s = 1, 2, …, S, and each parameter estimate (j = 1, 2, …, p) can be calculated as with standard error (SE) obtained as the standard deviation (SD) of θj, . To get the credible (confidence) intervals, the percentiles of the Markov chains can be used.
REAL DATA ANALYSIS
In this section, we illustrate the application of the Bayesian GMM model with missing data through the analysis of mathematical ability growth data from the NLSY97 survey (Bureau of Labor Statistics, U.S. Department of Labor, 1997). Specifically, data used in the current analysis were collected yearly from 1997 to 2001 on N = 1,510 adolescents. Starting in 1997 when they were 12 years old and in the 7th grade, each adolescent was administered the Peabody Individual Achievement Test (PIAT) Mathematics Assessment to measure their mathematical ability. The same adolescents were then measured annually till 2001 when they were 16 years old and in the 11th grade.
Table 1 shows the summary statistics for the data. Overall, the means of mathematical ability increased over time with a roughly linear trend. The missing data rates range from 4.57% to 9.47%, and the raw data show the missing pattern is intermittent. About half of the sample are male (763/1,510 = 50.5%). In order to investigate the possible number of latent classes, we draw a histogram with its smoothing density estimate for mathematical ability data at each wave. The histograms are shown in Figure 2 and clearly show the bi-modes of mathematical ability for the current sample of adolescents. Therefore, a Bayesian linear GMM with two latent classes is fitted to the data in the current analysis.
TABLE 1.
Summary Statistics for PIAT Math Data Set
| Grade 7 | Grade 8 | Grade 9 | Grade 10 | Grade 11 | |
|---|---|---|---|---|---|
| M | 18.147 | 20.041 | 21.178 | 22.465 | 23.110 |
| SD | 6.219 | 6.526 | 6.601 | 6.435 | 6.643 |
| Missing data (count) | 83 | 69 | 120 | 115 | 143 |
| Missing data (percentage) | 5.497 | 4.570 | 7.947 | 7.616 | 9.470 |
|
| |||||
| Male | N = 763 | Female | N = 747 | ||
FIGURE 2.
Histograms of PIAT math scores for five grades.
For the sake of comparison, we fit two models to the data. The first one is the Bayesian GMM model we proposed and assumes that the missing data are nonignorable, and the second one assumes that the missing data are ignorable. For the first model, we evaluate whether missingness is related to class membership and the covariate sex. Because the purpose of the current analysis is to demonstrate the application of the proposed Bayesian GMM model, we adopt the priors discussed earlier with hyperparameters chosen to carry little prior information for our model parameters (Congdon, 2003; Gill, 2002; Zhang, Hamagami, Wang, Grimm, & Nesselroade, 2007). Specifically, for φ1, we set μφ1 = 02 and Σφ1 = 106I2. For ϕk (k = 1,2), we set v0k = s0k = 0:002. For βk, it is assumed that βk0 = 02 and σk0 = 106I2. For Ψk, we define mk0 = 2 and Vk0 = I2. Finally, for , we let and . The starting values are then set at φ1 = 0, ϕk = 1, βk = 1, Ψk = I2, and . In both prior and starting value specifications, 0d and Id denote a d-dimensional zero vector and a d-dimensional identity matrix, respectively. For the second model, the missingness is assumed to be ignorable and therefore there is no estimate for the missingness parameters. For other model parameters, the same priors and starting values as those in the first model are used.
In generating Markov chains through Gibbs sampling, we use a burn-in period of 10, 000 iterations.5 For testing convergence, we examine the history plot and the Geweke's z statistic (Geweke, 1992)6 for each unknown model parameter. To make sure all the parameters are estimated accurately, the next 70,000 iterations7 are then saved for data analysis.
The results for our real data analysis are given in Tables 2 and 3. First, based on the history plots (two selected history plots are presented in Figure 3), it seems that each Markov chain converges to its stationary distribution. Second, the Geweke test statistics for all model parameters are smaller than 1:96, which also indicates the convergence of Markov chains (Geweke, 1992). Third, the ratio of Monte Carlo error and standard deviation for each parameter is smaller than 0.05, which indicates parameter estimates are accurate (Spiegelhalter, Thomas, Best, & Lunn, 2003). Overall, we can conclude that the results from our real data analysis can be used for further inference. For example, the distance between the two populations with different covariance matrices (Anderson & Bahadur, 1962) can be calculated and it is 2.7.
TABLE 2.
Real Data Analysis Under an Assumption of Latent Class Dependent Missingness
| Parameter | M | SD | MCse |
|
CI.La | CI.U | Geweke t | |
|---|---|---|---|---|---|---|---|---|
| Growth Curve Parameters | ||||||||
| Class 1b | ||||||||
| β1[1] | 25.140 | 0.176 | 0.003 | 0.017 | 24.790 | 25.480 | 0.564 | |
| β1[2] | 1.130 | 0.039 | 0.001 | 0.026 | 1.055 | 1.206 | −0.307 | |
| ψ1[11] | 5.501 | 0.687 | 0.010 | 0.015 | 4.265 | 6.944 | 0.126 | |
| ψ1[22] | 0.187 | 0.031 | 0.000 | 0.000 | 0.131 | 0.253 | −0.120 | |
| ψ1[12] | −0.843 | 0.136 | 0.002 | 0.015 | −1.127 | −0.596 | −0.092 | |
| ϕ 1 | 1.904 | 0.107 | 0.002 | 0.019 | 1.701 | 2.121 | −0.457 | |
| Class 2c | ||||||||
| β2[1] | 15.920 | 0.164 | 0.002 | 0.012 | 15.600 | 16.250 | 1.923 | |
| β2[2] | 1.253 | 0.042 | 0.001 | 0.024 | 1.169 | 1.335 | −0.919 | |
| ψ2[11] | 15.850 | 1.141 | 0.019 | 0.017 | 13.720 | 18.170 | 1.566 | |
| ψ2[22] | 0.402 | 0.084 | 0.003 | 0.036 | 0.250 | 0.574 | −0.497 | |
| ψ2[12] | 0.805 | 0.224 | 0.006 | 0.027 | 0.349 | 1.229 | −0.476 | |
| ϕ 2 | 13.310 | 0.364 | 0.006 | 0.016 | 12.610 | 14.040 | −0.750 | |
| Probit Parameters | ||||||||
| Class 6 | ||||||||
| φ 10 d | −0.249 | 0.115 | 0.005 | 0.043 | −0.471 | −0.024 | −0.591 | |
| φ 11 | −0.238 | 0.074 | 0.003 | 0.041 | −0.387 | −0.094 | 0.463 | |
| Grade 7 | ||||||||
| e | −1.470 | 0.181 | 0.007 | 0.039 | −1.835 | −1.116 | −0.895 | |
| 0.116 | 0.134 | 0.003 | 0.022 | −0.135 | 0.397 | 0.344 | ||
| γ x1 | −0.150 | 0.107 | 0.004 | 0.037 | −0.360 | 0.065 | 1.067 | |
| Grade 8 | ||||||||
| −2.199 | 0.229 | 0.011 | 0.048 | −2.662 | −1.771 | −1.407 | ||
| 0.442 | 0.190 | 0.008 | 0.042 | 0.093 | 0.853 | 1.299 | ||
| γ x2 | 0.101 | 0.107 | 0.004 | 0.037 | −0.113 | 0.309 | 1.146 | |
| Grade 9 | ||||||||
| −1.346 | 0.171 | 0.007 | 0.041 | −1.680 | −1.013 | −0.835 | ||
| 0.199 | 0.131 | 0.004 | 0.031 | −0.050 | 0.466 | 1.034 | ||
| γ x3 | −0.147 | 0.096 | 0.004 | 0.042 | −0.333 | 0.038 | 0.655 | |
| Grade 10 | ||||||||
| −1.662 | 0.174 | 0.007 | 0.040 | −2.016 | −1.333 | 0.452 | ||
| 0.192 | 0.131 | 0.004 | 0.031 | −0.062 | 0.456 | −0.038 | ||
| γ x4 | 0.054 | 0.096 | 0.004 | 0.042 | −0.134 | 0.244 | −0.577 | |
| Grade 11 | ||||||||
| −1.507 | 0.170 | 0.008 | 0.047 | −1.848 | −1.178 | −0.854 | ||
| 0.417 | 0.133 | 0.005 | 0.038 | 0.166 | 0.685 | 1.389 | ||
| γ x5 | −0.089 | 0.092 | 0.004 | 0.043 | −0.273 | 0.088 | 0.134 | |
The significance of parameter estimates can be judged based on the confidence intervals. If zero is included in the interval, then the parameter estimate is not significantly different from zero.
The growth curve parameters for Class 1. Specifically, β[1]: initial level; β[2]: slope; ψ[11]: variance of initial level; ψ[22]: variance of slope; ψ[12]: covariance of initial level and slope; ϕ: variance of error.
The growth curve parameters for Class 2.
The probit parameters of class proportion as in Equation (6).
The probit parameters of missing data rate. Note that although the and here are different with the γzt1 and γzt2 in Equation (9), they are equivalent after reparameterizing and .
TABLE 3.
Real Data Analysis Under an Assumption of Ignorable Missingness
| Parameter | M | SD | MCse |
|
CI.L | CI.U | Geweke t | |
|---|---|---|---|---|---|---|---|---|
| Growth Curve Parameters | ||||||||
| Class 1 | ||||||||
| β1[1] | 25.140 | 0.176 | 0.004 | 0.023 | 24.790 | 25.480 | −0.239 | |
| β1[2] | 1.131 | 0.039 | 0.001 | 0.026 | 1.055 | 1.208 | 0.267 | |
| ψ1[11] | 5.471 | 0.692 | 0.011 | 0.016 | 4.220 | 6.935 | −0.684 | |
| ψ1[22] | 0.187 | 0.031 | 0.000 | 0.000 | 0.130 | 0.252 | −0.756 | |
| ψ1[12] | −0.839 | 0.138 | 0.002 | 0.014 | −1.127 | −0.586 | 0.927 | |
| ϕ 1 | 1.905 | 0.106 | 0.002 | 0.019 | 1.706 | 2.121 | 0.206 | |
| Class 2 | ||||||||
| β2[1] | 15.890 | 0.164 | 0.002 | 0.012 | 15.570 | 16.210 | 0.760 | |
| β2[2] | 1.253 | 0.042 | 0.001 | 0.024 | 1.171 | 1.336 | −0.578 | |
| ψ2[11] | 15.640 | 1.116 | 0.018 | 0.016 | 13.550 | 17.920 | −0.445 | |
| ψ2[22] | 0.398 | 0.081 | 0.003 | 0.037 | 0.248 | 0.564 | −0.841 | |
| ψ2[12] | 0.815 | 0.220 | 0.006 | 0.027 | 0.369 | 1.230 | 0.971 | |
| ϕ 2 | 13.320 | 0.363 | 0.006 | 0.017 | 12.620 | 14.050 | 0.906 | |
| Class 3 | ||||||||
| φ 10 | −0.240 | 0.111 | 0.005 | 0.045 | −0.460 | −0.033 | 0.262 | |
| φ 11 | −0.238 | 0.072 | 0.003 | 0.042 | −0.375 | −0.098 | −0.249 | |
Note. With the same notations as those in Table 2.
FIGURE 3.

Selected history plots. History plots for all parameters can be found on our web page. (a) Parameter β2[1]. (b) Parameter .
A quick comparison of results from both analyses shows that the estimates for growth curve parameters are very close. For both models, the differences between Class 1 and Class 2 include (a) Class 1 has a higher initial level and lower slope, (b) Class 2 has larger variations for initial level and slope, (c) the residual variance is much larger for the second class, and (d) for Class 1, the initial level and slope are negatively correlated but for Class 2 they are positively correlated.
A closer look at the results from the analysis with LCD missingness in Table 2 further reveals that none of γx,ts, the coefficients for the covariate sex, are significant at the α level of 0.05, which implies that the missingness may not be related to sex. However, it can be seen that the coefficients for class membership for Grades 8 and 11 are positive and significant. This indicates that Class 2 has higher missing data rates for Grades 8 and 11 than Class 1, which implies that in these two grades adolescents in Class 2 are more likely to have missing data than those in Class 1.
A SIMULATION STUDY
In this section, a simulation study is presented to evaluate the performance of the proposed Bayesian GMMs with missing data. To simplify the presentation, we focus on a linear GMM with two latent trajectory classes resembling our real data analysis. Five occasions of data are generated, and missing data are created on each occasion according to different predesigned missing data rates. It is also assumed there is only one covariate in the simulation study.
Simulation Design
In the simulation, we consider three main factors: the sample size, the class probability, and the missing data mechanism. First, the sample sizes of 1,500, 1,000, and 500 are considered. Second, both equal and unequal class probabilities are considered. For the equal class probabilities, each class contains around 50% of participants. For the unequal class probabilities, around 30% of participants are in the first class and the other 70% are in the second class. Third, both nonignorable and ignorable missing mechanisms are considered. In the simulation, we apply our model to both MCAR and MNAR data. For MCAR data, a uniform missing data rate, around 16%, is set across all five occasions for both classes. For MNAR data, the missing data rates for the first class are set around (2%, 4%, 6%, 8%, 10%) across Occasions 1 to 5, respectively, and for the second class around (4%, 8%, 12%, 16%, 20%). Different missing data rates are realized by setting different values of the corresponding probit parameters , ,8 and γxt. The covariate x follows a normal distribution with mean 1 and standard deviation 1. In total, we evaluate the performance of the model in 3 × 2 × 2 = 12 different cells.
Simulation Implementation
In the simulation, the following procedure is operated automatically.
Set the counter R = 0.
Generate complete GMM data according to predefined model parameters.
Create missing data according to missing data mechanisms and missing data rates.
Generate Markov chains for model parameters through the Gibbs sampling procedure.
Test the convergence of generated Markov chains using the Geweke statistics (Geweke, 1992).
If the Markov chains pass the convergence test, then set R = R + 1 and calculate and save the parameter estimates. Otherwise, set R = R and discard the current replication of the simulation.
Repeat the aforementioned process till R = 100 to obtain 100 replications of valid simulation.
Researchers have found that there exists a label-switching problem in mixture models (e.g., Fruhwirth-Schnatter, 2001; Tueller, Drotar, & Lubke, 2011). In our analysis, we imposed some constraints on the priors to avoid the problem; for example, the intercept of the first class is constrained to be larger than that of the second class.
Because the simulation design is based on the real data analysis, the same set of uninformative priors and starting values as in the previous section (see Real Data Analysis section) are used for all simulation conditions. In generating Markov chains through the Gibbs sampling method, the burn-in period is set from 1 to 10, 000 iterations and the Markov chains with a length of 40, 000 iterations are saved for data analysis.
In this study, the Gibbs sampling algorithm is implemented in open-source software OpenBUGS (Thomas, O'Hara, Ligges, & Sturtz, 2006). OpenBUGS is flexible in estimating both simple and complex statistical modeling with a language similar to the R programming language. Lunn, Spiegelhalter, Thomas, and Best (2009), Zhang et al. (2007), and Zhang, McArdle, Wang, and Hamagami (2008) offer an overview of the use of OpenBUGS. For an in-depth account of it, see Congdon (2003) and Ntzoufras (2009). Sample OpenBUGS codes for our current models are available on our website.
Results
For the purpose of presentation, let θj represent the jth parameter as well as its true value in the simulation. Let denote the estimate of θj in the ith simulation replication. Let denote the estimated standard error of . And let and denote the lower and upper limits of the 95% highest posterior density credible interval (HPD; Box & Tiao, 1973), respectively. For each of 12 conditions in the simulation design, we calculate five statistics defined here based on 100 sets of converged simulation replications.
First, the average estimate (Est.avgj) across the 100 converged simulation replications of each parameter is obtained as . Second, the relative bias (Bias.relj) of each parameter is calculated using when θj ≠ 0 and when θj = 0. Third, the empirical standard deviation (SD.empj) of each parameter is obtained as , and fourth, the average standard deviation (SD.avgj) of the same parameter is calculated by . Fifth, the coverage probability of the 0.95 HPD credible interval (HPD.cvrj) of each parameter is obtained using .
For the sake of saving space and facilitating comparison, instead of presenting full results for each condition, we further calculate four summary statistics across all model parameters for each condition of simulation. The detailed results for each condition can be found at our web page. First, we define the average absolute relative biases (|Bias.rel|) across all model parameters as . Second, we obtain the average absolute differences between the empirical SDs and the average Bayesian SDs (|SD.diff|) across all model parameters by using . Third, we calculate the average coverage probabilities (HPD.cvr) across all model parameters by using . In the aforementioned equations, p is the total number parameters in a model. These three statistics from all 12 simulation conditions are given in Table 4.
TABLE 4.
Summary and Comparison of Simulation Results. The Results Are Based on the Converged Replications.
| Equal Classes |
Unequal Classes |
||||
|---|---|---|---|---|---|
| MNAR | MCAR | MNAR | MCAR | ||
| Sample Size | |||||
| 1,500 | |Bias.rel|a | 0.022 | 0.009 | 0.026 | 0.011 |
| |SD.diff|b | 0.011 | 0.008 | 0.011 | 0.008 | |
| HPD.cvrc | 0.956 | 0.941 | 0.949 | 0.954 | |
| 1,000 | |Bias.rel| | 0.023 | 0.012 | 0.032 | 0.016 |
| |SD.diff| | 0.012 | 0.010 | 0.012 | 0.015 | |
| HPD.cvr | 0.950 | 0.951 | 0.952 | 0.948 | |
| 500d | |Bias.rel| | 0.030 | 0.012 | 0.068 | 0.016 |
| |SD.diff| | 0.015 | 0.020 | 0.042 | 0.021 | |
| HPD.cvr | 0.952 | 0.945 | 0.954 | 0.952 | |
The average absolute relative bias across all model parameters, defined by .
The average absolute difference between the empirical SDs and the average Bayesian SDs across all model parameters, defined by .
The average coverage probability across all model parameters, defined by .
With a sample size of 500, the convergence rate under unequal classes and MNAR missingness is 100/147 ≈ 67%. MNAR = missing not at random; MCAR = missing completely at random.
Based on the results in Table 4, we can conclude the following. First, the proposed Bayesian method can recover model parameters very well because (a) the relative biases are all small (e.g., the maximum bias is about 6.8%, which occurs when the sample size is 500, the class probability is unequal, and the missingness is MNAR) and (b) the average coverage probabilities are all close to the nominal value 95%. The correct coverage probabilities also indicate that we can use the estimated confidence intervals to conduct statistical inference. Second, with the increase of the sample size, (a) the relative biases get smaller, which shows that estimates get closer to their true values, and (b) the average Bayesian SDs get closer to the empirical SDs, which shows that standard errors become more accurate. Third, the small difference between the empirical SD and the average Bayesian SD in all conditions not only demonstrates that the Bayesian method used in the study can estimate the standard errors very well but also indicates that throwing away the nonconverged cases in our simulation does not influence the simulation results. Fourth, this model works equally well for both the MNAR missingness and the MCAR missingness. In both cases, the parameter estimate biases are small, the differences between empirical SDs and average Bayesian SDs are tiny, and the coverage probabilities are close to the nominal level 95%.
DISCUSSION
This article presents a Bayesian method to estimate an extended GMM with LCD missingness. This model is a further extension of the finite mixture model proposed by Muthén and Shedden (1999). Instead of using the maximum likelihood estimation method, we employ a full Bayesian method. The simulation study shows that the Bayesian approach performs well, especially when the sample size is large. In the following paragraphs, we discuss six specific aspects of our study in more detail.
Misspecified Missingness Mechanism
It might be expected that mispecification of the missingness mechanism may cause a substantial misclassification of participants. For the purpose of illustration we conducted a small additional simulation. In this additional simulation, the sample size is 1,500; the class proportion is (30%, 70%); and β1 = 0, ψ11 = ψ22 = 0.5, ψ12 = ψ21 = 0, ϕ = 1 for both classes, β2[1] = 0 for Class 1 and β2[2] = 1.3 for Class 2. The distance between the two populations (Anderson & Bahadur, 1962) is 1.73. LCD missing data are then generated with different missing data proportions for different classes. The generated data are analyzed using two models. The first one uses the proposed method with missingness mechanism. For the second one, the settings are kept the same except that the missingness mechanism is ignored. To keep this article to a reasonable length, the simulation results are uploaded to our website. Table 5 gives the number of misclassified participants under ignorable and nonignorable missingness assumptions. The results clearly show that modeling nonignorable missingness as ignorable missingness can cause severe misclassification. This topic will be further investigated in our future work.
TABLE 5.
Classification Under Ignorable and Nonignorable Missingness Mechanism Assumptions
| Missingness Mechanism
Assumption |
|||||
|---|---|---|---|---|---|
| Nonignorablea |
Ignorableb |
||||
| Class 1 | Class 2 | Class 1 | Class 2 | ||
| True ModelC | |||||
| Class 1 | 506 | 437 | 69 | 502 | 4 |
| Class 2 | 994 | 81 | 913 | 991 | 3 |
| Total | 1,500 | 518 | 982 | 1,493 | 7 |
Modeling GMM and latent class dependent missingness.
Modeling GMM only, ignore the missingness mechanism.
GMM with latent class dependent missing data.
Sample Size
Generally speaking, it is difficult to provide a rule of thumb for the requirements of the sample size to distinguish between latent classes for GMM because it depends on class separation, model complexity, and other properties of the model (Lubke & Neale, 2006, 2008). It becomes more complex if we consider the missingness mechanism in addition to GMM. It is required that the outcome variables provide enough information to estimate the probit regression model parameters well. With respect to the factors examined in this study, the model and estimation method can perform very well with a sample size of 1,500, 1,000, or 500 with small missing data rates (with the lowest one around 2% in our simulation). And our experience shows that if the lowest missing data rate is relatively large (e.g., around 5%), the proposed model and estimation method can still provide useful information with a small sample size of 200.
Sensitivity of the Model
The model discussed in this study can be viewed as an example of the selection models (e.g., Heckman, 1976; Heckman & Robb, 1986; Little & Rubin, 2002) but with a more complex form. The missing mechanism is modeled explicitly by including the latent class membership as a covariate. However, our model suffers the same sensitivity problem as any other selection model. If the missingness does not depend on the class membership but some other latent or unobserved variables, our model then becomes misspecified and thus may not get valid parameter estimates. Fortunately, as we have shown, the Bayesian method can be very flexible in modeling the missing mechanism because the conditional posteriors can be obtained relatively easily through the data augmentation algorithm. Therefore, once the missing mechanism is understood, it can be modeled following the procedure outlined in this study.
Number of Latent Classes
The model and method proposed in this study is based on GMMs with a fixed number of components. For mixture models with unknown number of components, Richardson and Green (1997) used the jump Markov chain Monte Carlo (Green, 1995) in a full Bayesian analysis. Mclachlan (1987) proposed bootstrap methods (e.g., Efron & Tibshirani, 1993) to deal with problems involved in the likelihood ratios. Lee and Song (2003) employed the Bayesian factor (e.g., Berger, 1985; Kass & Raftery, 1995) and path sampling (Gelman & Meng, 1998) in Bayesian procedures of model selection for mixtures of SEMs. These techniques can be applied to our model for the determination of the number of latent classes.
Model Comparison
There are several criteria for model comparison. The deviance information criterion (DIC; Spiegelhalter, Best, Carlin, & Linde, 2002) is a recently developed model comparison criterion designed for complex hierarchical models. DIC can be viewed as a Bayesian version or generalization of the Akaike's information criterion (AIC; Akaike, 1974) and Bayesian information criterion (BIC, or Schwarz criterion; Schwarz, 1978). It is particularly useful in Bayesian model selection problems where the posterior distributions of the models have been obtained by Markov chain Monte Carlo (MCMC) simulation. However, currently there is no exact definition for DIC in GMM with Missing Data. The problem mainly comes from at least two areas: the mixture structure and the posterior class membership. First, for mixture models or random effects models, the log-likelihood function for p(y|θ) can be an observed-data log-likelihood function, a complete-data log-likelihood function, or a conditional log-likelihood function (see Celeux, Forbes, Robert, & Titterington, 2006). Second, when calculating the deviance for the final estimated parameters, it is not clear which posterior estimate of the class membership should be plugged in each individual's likelihood function. It could be a posterior mode or a posterior mean. For GMM with Missing Data, designing an effective model comparison criterion is an interesting topic for future work.
Future Directions
The models proposed in our article can be further developed in various ways. First, the missingness can be predicted by both latent random effects and the latent class membership. Also, the outcome variables, some other covariates that could explain the missingness, and any combination of these variables can be included in the model. Although such models can be in much more complex forms, the same Bayesian estimation procedure proposed in this study can be implemented. Second, in this study, a hybrid Gibbs sampling procedure is used. When the posterior does not have an explicit form, such as the probit parameters φ and γ, the Metropolis-Hastings algorithm (Hastings, 1970; Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 1953) is used to generate random numbers from the posterior. And for each missing datum a Markov chain is produced, which is not very efficient for large missing data. Thus, future research must develop a more efficient way to deal with missing data. Third, as mentioned earlier, this model may be sensitive to the missing mechanism and model specification. Therefore, a study can be conducted to evaluate how the model responds to model misspecification.
ACKNOWLEDGMENTS
Dr. Gitta H. Lubke was supported by DA 018673 by NIDA. Dr. Zhiyong Zhang was partially supported by the Faculty Research Program at the University of Notre Dame. We thank Dr. Joseph L. Rodgers and the other three anonymous reviewers for their helpful comments and suggestions.
APPENDIX A
Prior Distributions
For ϕk (k = 1, 2, …, K), an inverse Gamma distribution is used,
where v0k and s0k are known hyperparameters. The inverse Gamma distribution has a density function
For βk (k = 1, 2, …, K), the multivariate normal prior is used,
where the hyperparameter βk0 is a q-dimensional vector and Σk0 is a q × q matrix.
For ψk (k = 1, 2, …, K), the inverse Wishart distribution prior is used,
where the hyperparameter mk0 is a scalar and Vk0 is a q × q matrix. The inverse Wishart distribution has a density function
For φk (k = 1, 2, …, K − 1), we use an (r + 1)-dimensional multivariate normal distribution,
where μφk, an (r + 1)-dimensional vector, and Σφk, an (r + 1)×(r + 1) matrix, are predetermined hyperparameters.
The prior for γt (t = 1, 2, …, T) is chosen to be a multivariate normal distribution,
where γt0, a (K + r)-dimensional vector, and Dt0, a (K + r) × (K + r) matrix, are predetermined hyperparameters.
APPENDIX B
Posterior Distributions
Let be the number of individuals who are in the kth class, and notate the set (η1, η2, …, ηN) as η.
Conditional posterior distribution for ϕk, k = 1, 2, …, K
The conditional posterior distribution for ϕk is an inverse gamma distribution,
| (12) |
where
Conditional posterior distribution for Ψk, k = 1, 2, …, K
The conditional posterior distribution for Ψk is an inverse Wishart distribution,
| (13) |
where
Conditional posterior distribution for βk, k = 1, 2, …, K
The conditional posterior distribution for βk is a multivariate normal distribution,
| (14) |
where
Conditional posterior distribution for φk, k = 1, 2, …, (K − 1)
When k = 1, the conditional posterior distribution for φ1 is
| (15) |
When 2 ≤ k ≤ K − 2, the conditional posterior distribution of φk is
| (16) |
Finally, when k = K − 1, the conditional posterior distribution of φK−1 is
| (17) |
The in Equations (15), (), and (17) is defined by Equation (6).
Conditional posterior distribution for γt, t = 1, 2, …, T
The conditional posterior distribution for γt is
| (18) |
where and is defined by Equation (9).
Conditional posterior distribution for zi, i = 1, 2, …, N
The conditional posterior distribution for zi is a multinomial distribution,
| (19) |
where with vik defined in Equation (11).
Conditional posterior distribution for ηi, i = 1, 2, …, N
The conditional posterior distribution for ηi is a multivariate normal distribution,
| (20) |
where
Conditional posterior distribution for missing data , i = 1, 2, …, N
The conditional posterior distribution for the missing data is a normal distribution,
and its dimension and location depend on the corresponding mi value.
Footnotes
Throughout the article, MNn denotes an n-dimensional multivariate normal distribution.
Here we have two probabilities that need to be distinguished. The class probability πk is a class-specific population parameter in the model, whereas the posthoc posterior probability is an individual variable that is computed for each individual once model parameters have been estimated.
Note that this is only one way to specify a regression model for categorical variables.
| (21) |
With 10,000 burn-ins, the Markov chains for all parameters converged.
This method tests the convergence of Markov chain by comparing the means of two subsets of the chain.
With 70,000 iterations, the ratio of MCse/sd is less than 0.05 for all parameters, which indicates that the estimates are accurate. An example of inaccurate estimates obtained with 2,000 burn-ins and 5,000 iterations can be found on our website: (http://nd.psychstat.org/research/luzhanglubke2010) for comparison.
To be consistent with the real data analysis, γzt1 and γzt2 are reparameterized as and with and .
REFERENCES
- Agresti A. Categorical data analysis. 2nd ed. Wiley; Hoboken, NJ: 2002. [Google Scholar]
- Akaike H. A new look at the statistical model identification. IEEE Transactions on Automatic Control. 1974;1919:716–723. [Google Scholar]
- Anderson TW, Bahadur RR. Classification into two multivariate normal distributions with different covariance matrices. The Annals of Mathematical Statistics. 1962;33:420–431. [Google Scholar]
- Ansari A, Jedidi K, Jagpal S. A hierarchical bayesian methodology for treating heterogeneity in structural equation models. Marketing Science. 2000;19:328–347. [Google Scholar]
- Baltes PB, Nesselroade JR. History and rationale of longitudinal research. In: Nesselroade JR, Baltes PB, editors. Longitudinal research in the study of behavior and development. Academic Press; New York, NY: 1979. pp. 1–39. [Google Scholar]
- Barnard J, Frangakis CE, Hill JL, Rubin DB. Principal stratification approach to broken randomized experiments: A case study of school choice vouchers in New York City. Journal of the American Statistical Association. 2003;98 [Google Scholar]
- Bartholomew DJ, Knott M. Latent variable models and factor analysis: Kendall's library of statistics. Vol. 7. Edward Arnold; New York, NY: 1999. [Google Scholar]
- Berger JO. Statistical decision theory and Bayesian analysis. 2nd ed. Springer-Verlag; New York, NY: 1985. [Google Scholar]
- Bollen KA. Structural equations with latent variables. Wiley; New York, NY: 1989. [Google Scholar]
- Box GEP, Tiao GC. Bayesian inference in statistical analysis. John Wiley & Sons; Hoboken, NJ: 1973. [Google Scholar]
- Bureau of Labor Statistics, U.S. Department of Labor . National longitudinal survey of youth 1997 cohort, 1997–2003 (Rounds 1–7) National Opinion Research Center, the University of Chicago; Center for Human Resource Research, The Ohio State University; Columbus, OH: 1997. Computer file. 2005. Retrieved from http://www.bls.gov/nls/nlsy97.htm. [Google Scholar]
- Cai JH, Song XY. A Bayesian analysis of mixtures in structural equation models with nonignorable missing data. British Journal of Mathematical and Statistical Psychology. 2010;63:491–508. doi: 10.1348/000711009X475187. [DOI] [PubMed] [Google Scholar]
- Cai JH, Song XY, Hser YI. A Bayesian analysis of mixture structural equation models with non-ignorable missing responses and covariates. Statistic in Medicine. 2010;29:1861–1874. doi: 10.1002/sim.3915. [DOI] [PubMed] [Google Scholar]
- Casella G, George EI. Explaining the Gibbs sampler. The American Statistican. 1992;46(3):167–174. [Google Scholar]
- Celeux G, Forbes F, Robert C, Titterington D. Deviance information criteria for missing data models. Bayesian Analysis. 2006;4:651–674. [Google Scholar]
- Congdon P. Applied Bayesian modelling. Wiley; New York, NY: 2003. [Google Scholar]
- Coronary Drug Project Research Group Influence of adherence to treatment and response of cholesterol on mortality in the coronary drug project. New England Journal of Medicine. 1980;303:1038–1041. doi: 10.1056/NEJM198010303031804. [DOI] [PubMed] [Google Scholar]
- Demidenko E. Mixed models: Theory and applications. Wiley; New York, NY: 2004. [Google Scholar]
- Diggle P, Kenward MG. Informative drop-out in longitudinal data analysis. Journal of the Royal Statistical Society, Series C (Applied Statistics) 1994;43:49–93. [Google Scholar]
- Dunson DB. Bayesian latent variable models for clustered mixed outcomes. Journal of the Royal Statistical Society, B. 2000;62:355–366. [Google Scholar]
- Efron B, Tibshirani R. An introduction to the bootstrap. CRC Press; New York, NY: 1993. [Google Scholar]
- Elliott MR, Gallo JJ, Have TRT, Bogner HR, Katz IR. Using a Bayesian latent growth curve model to identify trajectories of positive affect and negative events following myocardial infarction. Biostatistics. 2005;6:119–143. doi: 10.1093/biostatistics/kxh022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fitzmaurice GM, Laird NM, Ware JH. Applied longitudinal analysis. Wiley; Hoboken, NJ: 2004. [Google Scholar]
- Frangakis CE, Rubin DB. Addressing complications of intention-to-treat analysis in the combined presence of all-or-none treatment-noncompliance and subsequent missing outcomes. Biometrika. 1999;86:365–379. [Google Scholar]
- Fruhwirth-Schnatter S. MCMC estimation of classical and dynamic switching and mixture models. Journal of the American Statistical Association. 2001;96:194–209. [Google Scholar]
- Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian data analysis. 2nd ed. Chapman & Hall/CRC; Boca Raton, FL: 2003. [Google Scholar]
- Gelman A, Meng X-L. Simulating normalizing constants: From importance sampling to bridge sampling to path sampling. Statistical Science. 1998;13:163–185. [Google Scholar]
- Geman S, Geman D. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1984;6:721–741. doi: 10.1109/tpami.1984.4767596. [DOI] [PubMed] [Google Scholar]
- Geweke J. Evaluating the accuracy of sampling-based approaches to calculating posterior moments. In: Bernado JM, Berger JO, Dawid AP, Smith AFM, editors. Bayesian statistics. Vol. 4. Clarendon Press; Oxford, UK: 1992. pp. 169–193. [Google Scholar]
- Gill J. Bayesian methods: A social and behavioral sciences approach. CRC Press; Boca Raton, FL: 2002. [Google Scholar]
- Green PJ. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika. 1995;82:711–732. [Google Scholar]
- Hastings WK. Monte Carlo sampling methods using Markov chains and their applications. Biometrika. 1970;57:97–109. [Google Scholar]
- Heckman J. The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models. Annals of Economic and Social Measurement. 1976;5:475–492. [Google Scholar]
- Heckman J, Robb R. Alternative methods for solving the problem of selection bias in evaluating the impact of treatments on outcomes. In: Wainer H, editor. Drawing inferences from self-selected samples. Springer; New York, NY: 1986. pp. 63–107. [Google Scholar]
- Hedeker D, Gibbons RD. Longitudinal data analysis. Wiley; Hoboken, NJ: 2006. [Google Scholar]
- Hedges LV. Fixed effects models. In: Cooper H, Hedges LV, editors. The handbook of research synthesis. Russell Sage Foundation; New York, NY: 1994. pp. 285–299. [Google Scholar]
- Jordan MI, Xu L. Convergence results for the em approach to mixtures of experts architectures. Neural Networks. 1995;8:1409–1431. [Google Scholar]
- Kass RE, Raftery AE. Bayes factors. Journal of the American Statistical Association. 1995;90:773–795. [Google Scholar]
- Lee SY. A Bayesian approach to confirmatory factor analysis. Psychometrika. 1981;46:153–160. [Google Scholar]
- Lee SY. Structural equation modeling: A Bayesian approach. John Wiley & Sons; Chinchester, UK: 2007. [Google Scholar]
- Lee SY, Shi JQ. Joint Bayesian analysis of factor scores and structural parameters in the factor analysis model. Annals of the Institute of Statistical Mathematics. 2000;52:722–736. [Google Scholar]
- Lee SY, Song XY. Bayesian model selection for mixtures of structural equation models with an unknown number of components. British Journal of Mathematical and Statistical Psychology. 2003;56:145–165. doi: 10.1348/000711003321645403. [DOI] [PubMed] [Google Scholar]
- Lee SY, Tang NS. Bayesian analysis of nonlinear structural equation models with nonignorable missing data. Psychometrika. 2006;71:541–564. [Google Scholar]
- Little RJA, Rubin DB. Statistical analysis with missing data. 2nd ed. Wiley-Interscience; New York, NY: 2002. [Google Scholar]
- Long JS. Regression models for categorical and limited dependent variables. Sage; Thousand Oaks, CA: 1997. [Google Scholar]
- Lubke GH, Neale MC. Distinguishing between latent classes and continuous factors: Resolution by maximum likelihood? Multivariate Behavioral Research. 2006;41:499–532. doi: 10.1207/s15327906mbr4104_4. [DOI] [PubMed] [Google Scholar]
- Lubke GH, Neale MC. Distinguishing between latent classes and continuous factors with categorical outcomes: Class invariance of parameters of factor mixture models. Multivariate Behavioral Research. 2008;43:592–620. doi: 10.1080/00273170802490673. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luke DA. Multilevel modeling (quantitative applications in the social sciences) Sage Publication, Inc.; Thousand Oaks, CA: 2004. [Google Scholar]
- Lunn D, Spiegelhalter D, Thomas A, Best N. The BUGS project: Evolution, critique and future directions (with discussion) Statistics in Medicine. 2009;28:3049–3082. doi: 10.1002/sim.3680. [DOI] [PubMed] [Google Scholar]
- McCullagh P, Nelder J. Generalized linear models. 2nd ed. Chapman & Hall/CRC; Boca Raton, FL: 1989. [Google Scholar]
- McLachlan G, Peel D. Finite mixture models. John Wiley & Sons; New York, NY: 2000. [Google Scholar]
- McLachlan GJ. On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Applied Statistics. 1987;36:318–324. [Google Scholar]
- Metropolis N, Rosenbluth A, Rosenbluth M, Teller A, Teller E. Equations of state calculations by fast computing machines. Journal of Chemical Physics. 1953;21:1087–1092. [Google Scholar]
- Muthén B, Brown CH. Non-ignorable missing data in a general latent variable modeling framework. 2001. Unpublished draft. [Google Scholar]
- Muthén B, Jo B, Brown CH. Principal stratification approach to broken randomized experiments: A case study of school choice vouchers in New York City (with comment) Journal of the American Statistical Association. 2003;98:311–314. [Google Scholar]
- Muthén B, Shedden K. Finite mixture modeling with mixture outcomes using the EM algorithm. Biometrics. 1999;55:463–469. doi: 10.1111/j.0006-341x.1999.00463.x. [DOI] [PubMed] [Google Scholar]
- Ntzoufras I. Bayesian modeling using WinBUGS. John Wiley & Sons; Hoboken, NJ: 2009. [Google Scholar]
- Preacher KJ, Wichman AL, MacCallum RC, Briggs NE. Latent growth curve modeling. Sage; Thousand Oaks, CA: 2008. [Google Scholar]
- Richardson S, Green PJ. On Bayesian analysis of mixtures with unknown number of components (with discussion) Journal of the Royal Statistical Society, Series B. 1997;59:731–792. [Google Scholar]
- Robert CP, Casella G. Monte Carlo statistical methods. Springer Science & Business Media Inc.; New York, NY: 2004. [Google Scholar]
- Roeder K, Wasserman L. Practical bayesian density estimation using mixtures of mormals. Journal of the American Statistical Association. 1997;92:894–902. [Google Scholar]
- Roy J. Modeling longitudinal data with nonignorable dropouts using a latent dropout class model. Biometrics. 2003;59:829–836. doi: 10.1111/j.0006-341x.2003.00097.x. [DOI] [PubMed] [Google Scholar]
- Schafer JL. Analysis of incomplete multivariate data. Chapman & Hall/CRC; Boca Raton, FL: 1997. [Google Scholar]
- Scheines R, Hoijtink H, Boomsma A. Bayesian estimation and testing of structural equation models. Psychometrika. 1999;64:37–52. [Google Scholar]
- Schwarz GE. Estimating the dimension of a model. Annals of Statistics. 1978;6:461–464. [Google Scholar]
- Singer JD, Willett JB. Applied longitudinal data analysis: Modeling change and event occurrence. Oxford University Press; New York, NY: 2003. [Google Scholar]
- Song XY, Lee SY. Bayesian analysis of latent variable models with nonignorable missing outcomes from exponential family. Statistics in Medicine. 2007;26:681–693. doi: 10.1002/sim.2530. [DOI] [PubMed] [Google Scholar]
- Spiegelhalter DJ, Best NG, Carlin BP, Linde A. v. d. Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society, Series B (Statistical Methodology) 2002;64:583–639. [Google Scholar]
- Spiegelhalter DJ, Thomas A, Best N, Lunn D. WinBUGS manual. Version 1.4. MRC Biostatistics Unit, Institute of Public Health; Cambridge, UK: 2003. Retrieved from http://www.mrc-bsu.cam.ac.uk/bugs. [Google Scholar]
- Tanner MA, Wong WH. The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association. 1987;82:528–540. [Google Scholar]
- Taylor L, Zhou XH. Relaxing latent ignorability in the ITT analysis of randomized studies with missing data and noncompliance. UW Biostatistics Working Paper Series; Seattle, WA: Feb, 2009. [Google Scholar]
- Thomas A, O'Hara B, Ligges U, Sturtz S. Making BUGS open. R News. 2006;6:12–17. [Google Scholar]
- Tueller S, Drotar S, Lubke G. Addressing the problem of switched class labels in latent variable mixture model simulation studies. Structural Equation Modeling: A Multidisciplinary Journal. 2011;18:110–131. [Google Scholar]
- van Dyk DA, Meng X-L. The art of data augmentation. Journal of Computational and Graphical Statistics. 2001;10:1–50. [Google Scholar]
- Willett J, Sayer A. Using covariance structure analysis to detect correlates and predictors of individual change over time. Psychological Bulletin. 1994;116:363–381. [Google Scholar]
- Wu MC, Bailey KR. Estimation and comparison of changes in the presence of informative right censoring: Conditional linear model. Biometrics. 1989;45:939–955. [PubMed] [Google Scholar]
- Wu MC, Carroll RJ. Estimation and comparison of changes in the presence of informative right censoring by modeling the censoring process. Biometrics. 1988;44:175–188. [Google Scholar]
- Yung YF. Finite mixtures in confirmatory factor-analysis models. Psychometrika. 1997;62:297–330. [Google Scholar]
- Zhang Z, Hamagami F, Wang L, Grimm KJ, Nesselroade JR. Bayesian analysis of longitudinal data using growth curve models. International Journal of Behavioral Development. 2007;31:374–383. [Google Scholar]
- Zhang Z, McArdle JJ, Wang L, Hamagami F. An SAS interface for Bayesian analysis with WinBUGS. Structural Equation Modeling. 2008;15:705–728. [Google Scholar]
- Zhu HT, Lee SY. A Bayesian analysis of finite mixtures in the LISREL model. Psychometrika. 2001;66:133–152. [Google Scholar]


