Evaluating the Quality of Classification in Mixture Model Simulations

Yoona Jang; Sehee Hong

doi:10.1177/00131644221093619

. 2022 Apr 29;83(2):351–374. doi: 10.1177/00131644221093619

Evaluating the Quality of Classification in Mixture Model Simulations

Yoona Jang ¹, Sehee Hong ^1,^✉

PMCID: PMC9972124 PMID: 36866069

Abstract

The purpose of this study was to evaluate the degree of classification quality in the basic latent class model when covariates are either included or are not included in the model. To accomplish this task, Monte Carlo simulations were conducted in which the results of models with and without a covariate were compared. Based on these simulations, it was determined that models without a covariate better predicted the number of classes. These findings in general supported the use of the popular three-step approach; with its quality of classification determined to be more than 70% under various conditions of covariate effect, sample size, and quality of indicators. In light of these findings, the practical utility of evaluating classification quality is discussed relative to issues that applied researchers need to carefully consider when applying latent class models.

Keywords: latent class analysis, Monte Carlo simulation, sample size, quality of classification, effects of covariates

Introduction

Mixture modeling (McLachlan & Peel, 2004) is widely used as an analytic tool within the behavioral, educational, and social sciences (e.g., DiStefano & Kamphaus, 2006; Eid et al., 2003; Klonsky & Olino, 2008; Pinquart & Schindler, 2007; Van Gaalen & Dykstra, 2006; Yang et al., 2005). An important difference between mixture modeling and most conventional analysis tools is that mixture modeling tries to explain and describe potential unobserved population heterogeneity. In contrast, conventional analysis tools (such as simple regression) assume that the studied population is instead comprised of a homogeneous distribution. Mixture modeling can also be considered a more person-oriented approach that emphasizes the patterns of individual characteristics and not a variable-oriented approach that explores the relations between variables (Bergman et al., 2003; Bergman & Magnusson, 1997).

A basic form of mixture modeling, latent class analysis (LCA) is used for cross-sectional categorical indicator variables based on a latent variable that represents a set of latent classes (Collins & Lanza, 2013; Goodman, 1974a, 1974b; Lazarsfeld & Henry, 1968). LCA has an advantage in that researchers can classify individuals and identify their latent classes according to their overall response patterns to the items rather than just by the sum or the mean score of their responses.

Although many studies have only used measured indicators when conducting LCA (J. J. Li & Lee, 2010; Monga et al., 2007; Sullivan et al., 2002), there has been an increase in the number of studies that have looked to include covariates to improve classification performance and goodness-of-fit indices (Carlson et al., 2005; Sysko et al., 2011). Although the use of covariates can improve model fit indices, covariates can potentially change the structure and interpretation of latent classes. Therefore, applied researchers need to decide whether to use a model with or without covariates to most effectively address their research questions, which means clear guidelines for the use of covariates in mixture modeling are required.

To establish these guidelines, several simulation studies have been conducted, but they have led to different conclusions. Some researchers have recommended the inclusion of covariates in the modeling process, in what is referred to as the one-step approach (Lubke & Muthén, 2007; Wurpts & Geiser, 2014). However, some have argued that covariates should be only partially included (Clark & Muthén, 2009), while others have asserted that models containing covariates should be compared with an exclusion model before choosing one or the other (L. Li & Hser, 2011). It has also been suggested that covariates should be excluded completely from the class classification process (Nylund-Gibson & Masyn, 2016).

Models without covariates, which employ a three-step approach (Asparouhov & Muthén, 2014; Kim et al., 2016; Vermunt, 2010), have recently become more widely used. The three-step approach classifies the latent classes first and then estimates the relationship between the latent classes and the covariates. Unlike the one-step approach, which analyzes the effect of covariates and latent classes simultaneously, this approach excludes the effects of covariates when identifying the classes.

Although the three-step approach has been shown to perform well in many simulation studies, the classification quality of this method has rarely been investigated or compared with other approaches. Many previous simulation studies have focused on assessing the estimated coefficients with coverage and the mean square error (Clark & Muthén, 2009) or evaluating models using model fit indices. Some studies that have attempted to evaluate the classification quality have only examined the bias in the probability or proportions of the classes. Thus, previous research on this topic has not taken full advantage of simulation-based analysis in which researchers have access to information on the true classes within the generated data. With this information, the classification quality can be evaluated by quantifying the proportion of subjects correctly classified into their true latent classes rather than merely comparing the estimated class proportions with the true proportions from the data.

The recent study by Cassiday et al. (2021) did attempt to quantify the proportion of subjects that were correctly classified, but it employed a growth mixture model (GMM), which deals with longitudinal data. Therefore, examining the classification quality of LCA models can be helpful for cross-sectional data. In addition, previous studies focusing on LCA with covariates have focused primarily on only two latent classes and have avoided determining the actual number of these classes. In real data, it is rare for only two latent classes to be identified and this has been pointed out as a limitation of previous studies (Lubke & Muthén, 2007).

To fill this research gap, this study aimed to evaluate the classification performance of a basic LCA model with and without the inclusion of covariates using Monte Carlo simulations. To overcome the limitations of previous simulation studies, this study focused on three latent classes, while the model fit indices widely used in previous LCA studies and classification quality were measured to determine the performance of the tested models.

Theoretical Background

Basic LCA Model

LCA seeks patterns that respond to multiple variables. These response patterns can be divided into several latent classes and the attributes of these classes are determined by the characteristics of the response patterns (Collins & Lanza, 2013). Figure 1 presents a simplified version of an LCA model, with the squares representing indicator variables and the circle representing a latent class.

Figure 1. — Basic Latent Class Analysis Model.

The probability that a specific response pattern $\vec{y}$ will appear is given by Equation (1). Suppose there are J latent class indicators where $R_{j}$ is the observed response to an item and C is the number of latent classes:

P (\vec{Y} = \vec{y}) = \sum_{c = 1}^{C} γ_{c} Π_{j = 1}^{J} Π_{r_{j}}^{R_{j}} ρ_{j, r_{j} | c}^{I (y_{j} = r_{j})},

(1)

where $γ_{c}$ represents the class proportions and $ρ_{j, r_{j} | c}$ is the probability of an item response for each class that will answer $r_{j}$ to the j th item. $I (y_{j} = r_{j})$ is a function that outputs 1 as the response to item $j = r_{j}$ and 0 otherwise. The probability of a particular response pattern appearing for a specific latent class c is as follows:

P (\vec{Y} = \vec{y} | L = c) = γ_{c} Π_{j = 1}^{J} Π_{r_{j}}^{R_{j}} ρ_{j, r_{j} | c}^{I (y_{j} = r_{j})} .

(2)

In Equation (2), because the classes cannot be observed directly, they are estimated using a likelihood function.

LCA determines the probability that a pattern belongs to the c th latent class. This is called the posterior probability and it is used to assign the latent class. It can be expressed as in Equation (3):

P (L = c | \vec{Y} = \vec{y}) .

(3)

Using the posterior probability formula $P (A | B) = (P (B | A) P (A)) / P (B)$ , the posterior probability is as follows:

P (L = c | \vec{Y} = \vec{y}) = \frac{(Π_{j = 1}^{J} Π_{r_{j}}^{R_{j}} ρ_{j, r_{j} | c}^{I (y_{j} = r_{j})}) γ_{c}}{\sum_{c = 1}^{C} γ_{c} Π_{j = 1}^{J} Π_{r_{j}}^{R_{j}} ρ_{j, r_{j} | c}^{I (y_{j} = r_{j})}} .

(4)

The response pattern is then allocated to the specific latent class with the highest probability. The class-specific indicator probability can be expressed as the estimated values using a logit form. Generally, the class-specific indicator probability is represented by a single threshold value on an inverse logit scale when the latent class indicators have a binary response:

ρ_{j, 1 | c} = \frac{1}{1 + \exp (τ_{jc})} .

(5)

The class proportions are parameterized as intercepts on the inverse multinomial logit scale with $α_{0 C} = 0$ for identification:

γ_{c} = \frac{\exp (α_{oc})}{\sum_{s = 1}^{C} \exp (α_{os})} .

(6)

LCA With Covariates

Covariates can affect indicators either indirectly through the latent class variable or directly (Nylund-Gibson & Masyn, 2016). Figures 2 and 3 illustrate the two cases, respectively, with $x_{i}$ representing the covariate.

Figure 2. — LCA Model With the Indirect Effect of a Covariate.

*Note.* LCA = latent class analysis.

Figure 3. — LCA Model With the Direct Effect of a Covariate.

*Note.* LCA = latent class analysis.

For an indirect effect model, the class proportion with covariate x_i is illustrated in Equation (7), where $α_{0 C} = α_{1 C} = 0$ for identification:

γ_{c} = \frac{\exp (α_{oc} + α_{1 c} x_{i})}{\sum_{s = 1}^{C} \exp (α_{os} + α_{1 s} x_{i})} .

(7)

This represents the class-specific probability when the direct effect does not differentiate between classes:

ρ_{j, 1 | c} = \frac{1}{1 + \exp (τ_{jc} - β_{j} x_{i})} .

(8)

Based on Equations (7) and (8), the coefficients can be calculated and used in the data-generating process.

Previous Simulation Studies

Several studies have been conducted on the effect of covariates in various models (Table 1). Lubke and Muthén (2007) and Wurpts and Geiser (2014) have suggested that analyses should include covariates, whereas Clark and Muthén (2009) asserted that the inclusion of covariates should be determined according to criteria such as the entropy, coverage, proportions of subjects, convergence rates, and coefficient bias.

Table 1.

Summary of Previous Simulation Studies.

Simulation studies
Authors	Generation model				Analysis model		Criteria	Recommendation
Authors	Model	Number of classes (proportion)	Type of covariate (#)	Type of indicator (#)	Model	Number of classes	Criteria	Recommendation
Lubke & Muthén (2007)	FMA	2 (0.5:0.5)	Continuous (1)	Continuous (8)	Same as the generation model	Same as the generation model	Coverage, proportion of subjects, entropy, convergence rates	Inclusion
Clark & Muthén (2009)	LCA	2 (0.5:0.5)	Continuous (1)	Binary (10)	Five different regression approaches	Same as the generation model	Mean square error, coverage	Partial inclusion
L. Li & Hser (2011)	GMM	2 (0.5:0.5)	Binary (1) Continuous (1)	Continuous (7)	With or without a covariate	1, 2, 3, 4, 5	Best number of classes by AIC, BIC, ABIC, LMR, ALMR, and BLRT	Need to compare with and without the covariate
Wurpts & Geiser (2014)	LCA	2 (0.67:0.33) 3 (0.4:0.4:0.2)	Continuous (1)	Binary (4~12)	Same as the generation model	Same as the generation model	Class proportion bias, mean conditional response probability bias, covariate effect bias	Inclusion
Nylund-Gibson & Masyn (2016)	LCA	2 (balanced 0.5:0.5, unbalanced 0.8:0.2)	Continuous (1)	Binary (5)	Three different models	2, 3, 4	Best number of classes by BIC, BLRT	Exclusion

Open in a new tab

Note. # denotes the number of covariates or indicators; FMA = factor mixture model analysis; LCA = latent class analysis; GMM = growth mixture model; AIC = Akaike information criterion; BIC = Bayesian information criterion; BLRT = bootstrap likelihood ratio test. ABIC = adjusted BIC; LMR = Lo-Mendell-Rubin likelihood ratio test; ALMR = adjusted LMR likelihood ratio test.

These studies, however, have ignored the two steps in the process of analyzing actual data in applied research: (a) determining the appropriate model for the data based on theory and accepted practice and (b) identifying the number of classes using model fit indices. This is a concern because it is especially important for simulation-based studies to identify the number of classes using model fit indices in their analysis. Unlike applied research dealing with actual data, simulations have access to information on the generated data and are able to compare different analysis models. The results of these analyses can help applied researchers to make more informed decisions regarding their study model.

Although Li and Hser (2011) and Nylund-Gibson and Masyn (2016) sought to determine the number of latent classes in their analyses, the results did not agree. These two studies also had some important limitations. First, they identified only two latent classes in their analyses, although real-life cases with only two identified latent classes are very rare. In addition, they did not measure the classification quality of their models. Even when the proportion of classes derived from the analysis model is the same as that in the generated data, it is still possible for subjects from a particular class in the generated data to be misclassified in the analysis model. As mentioned earlier, LCA is an exploratory analysis process that uses model fit indices to determine the number of classes. Therefore, the present simulation study replicates this exploratory process and then examines the classification quality.

Simulation Study

Design and Manipulated Conditions

Three studies were designed to test three data-generating models that differed according to the effect of the covariates: Study 1 with no covariate effect, Study 2 with an indirect covariate effect, and Study 3 with a direct covariate effect.

Study 1

The purpose of Study 1 was to compare models with and without a covariate when the generated model did not include a covariate. When the data were generated, the covariate effect was restricted to 0 because a covariate model cannot be analyzed if a covariate is not present.

The analysis models for Study 1 are presented in Figure 4. As the correct model (noneffect covariate generation model), Model 1 does not have a covariate, whereas Model 2 is a misspecified model that includes a covariate and Model 3 is a saturated model that estimates all possible parameters. Although saturated models may fail in the estimation of a complicated model because they estimate all of the possible coefficients, they can produce better estimates for simple models. Thus, Model 3 is also tested because its performance may be better than that of the generation model. It was expected that there would be little difference between Models 1 and 2 in identifying the number of classes and in their classification quality because Model 2 was designed to estimate the effect of the covariate as 0.

Study 2

Study 2 compared the performance of models with and without a covariate when the generated model included a covariate. In the generation model, the covariate had an indirect effect on the indicators and a direct effect on the latent class variable. Study 2 had three conditions for the effect of the covariate on the latent class variable. This effect is related to Equation (7) and is employed as the odds ratio for the class proportion. An odds ratio of 1.5 has a small effect size, 2.5 has a moderate effect size, and 4.0 has a large effect size (Rosenthal, 1996). When converted, the parameters were 0.41, 0.92, and 1.39, respectively.

The analysis models for Study 2 are presented in Figure 5. Model 1 was a misspecification model with no covariate, whereas Model 2 was the correct model with an indirect covariate effect and Model 3 was a saturated model. If the performance of the model without a covariate is unsatisfactory, there will be many differences in restoring the number of classes and in the classification quality. In contrast, if the difference between the two models is small, selecting the model without a covariate will improve the parsimony.

Study 3

The goal of Study 3 was to compare models with and without a covariate when the generated model had a covariate with a direct effect on a single indicator. As shown in Equation (8), the covariate for the indicator affects the threshold value via the odds ratio for the class-specific probability. Study 3 employed the same three conditions as Study 2 for the effect of the covariate using the same standard values.

The four analysis models for Study 3 are shown in Figure 6. Both Models 1 and 2 were misspecified, with the former having no covariate and the latter an indirect covariate. Model 3 was a saturated model that estimates all of the possible parameters. Model 4 was the correctly specified model for a covariate with a direct effect. In Study 3, because both Models 1 and 2 are misspecified, which makes it difficult to predict the results, the better-performing of the two would be selected in applied research.

Common Simulation Conditions for All Three Studies

Each of the three studies shared a number of the same simulation settings. There were 10 indicators, each of which was a binary variable with two response categories. There were three latent classes in the generation model present in equal proportions (i.e., 1:1:1). The covariate was a continuous variable with an average of 0 and a variance of 1. The sample sizes were set at 300, 500, and 1,000 with 500 replications.

The quality of an indicator is important because it affects the separation of the latent classes. The higher the indicator quality, the higher the likelihood that the subjects in one class will have the same response patterns. An indicator can be considered high quality when the probability that the members of a group will respond to one outcome of a binary indicator is 0.9 (and 0.1 for the other outcome), while a probability of 0.8 and 0.7 is considered moderate and low quality, respectively (Collins & Wugalter, 1992). This class-specific probability is illustrated by Equation (8), and each threshold can be obtained by calculation. The calculated threshold parameters for the present study were 2.20, 1.39, and 0.85, respectively. The threshold conditions for each class are the same as those presented in Table 2.

Table 2.

Threshold Values for Each Class.

	Quality of indicators
	High			Moderate			Low
Indicator	C#1	C#2	C#3	C#1	C#2	C#3	C#1	C#2	C#3
1	−2.2	−2.2	2.2	−1.39	−1.39	1.39	−0.85	−0.85	0.85
2	−2.2	−2.2	2.2	−1.39	−1.39	1.39	−0.85	−0.85	0.85
3	−2.2	−2.2	2.2	−1.39	−1.39	1.39	−0.85	−0.85	0.85
4	−2.2	−2.2	2.2	−1.39	−1.39	1.39	−0.85	−0.85	0.85
5	−2.2	−2.2	2.2	−1.39	−1.39	1.39	−0.85	−0.85	0.85
6	−2.2	2.2	2.2	−1.39	1.39	1.39	−0.85	0.85	0.85
7	−2.2	2.2	2.2	−1.39	1.39	1.39	−0.85	0.85	0.85
8	−2.2	2.2	2.2	−1.39	1.39	1.39	−0.85	0.85	0.85
9	−2.2	2.2	2.2	−1.39	1.39	1.39	−0.85	0.85	0.85
10	−2.2	2.2	2.2	−1.39	1.39	1.39	−0.85	0.85	0.85

Open in a new tab

Study 1 investigated a total of nine conditions (3 [sample size] × 3 [quality of the indicators]), whereas both Studies 2 and 3 had 27 conditions each (3 [effect of the covariate] × 3 [sample size] × 3 [quality of the indicators]). Thus, a total of 63 conditions were tested, with 500 replications generated for each condition.

In addition to this, the data were analyzed 3 times for each analysis model with two, three, or four classes. Thus, there were 81 analysis conditions for Study 1 (9 [generated conditions] × 3 [analysis model] × 3 [number of classes for each analysis model]), 243 analysis conditions for Study 2 (27 [generated condition] × 3 [analysis model] × 3 [number of classes for each analysis model]), and 324 analysis conditions for Study 3 (27 [generated condition] × 4 [analysis model] × 3 [number of classes for each analysis model]). Overall, a total of 648 analysis conditions were tested with 500 replications. All models were conducted using the one-step approach.

Model Performance Criteria

Model Fit Indices

Because applied research typically attempts to determine the number of latent classes by considering various model fit indices, this study also followed the same process. The selected model fit indices were the Akaike information criterion (AIC; Akaike, 1974), Bayesian information criterion (BIC; Schwarz, 1978), Lo-Mendell-Rubin adjusted likelihood ratio test (LMRLRT; Lo et al., 2001), and bootstrap likelihood ratio test (BLRT; McLachlan & Peel, 2004). In the present study, the AIC and BIC were compared to determine the number of latent classes, while the statistical significance was measured using the p values for the LMRLRT and BLRT.

For the AIC and BIC, a lower value represents a better model fit. In this study, when analyzing the 500 replications for each condition, the AIC and BIC values for the models with two, three, or four classes were compared, and that with the lowest AIC and BIC value was selected as the number of classes for each replication.

The LMRLRT and BLRT are typically employed to compare a model with k latent classes with a model with k− 1 latent classes. In this case, k is selected as the number of classes if the p value is significant, whereas k− 1 is selected if it is insignificant. In this study, this method was used to compare the models with two, three, or four classes and determine the number of classes for each replication.

Classification Quality

After determining the number of latent classes for each condition, the quality of the classification was assessed. Classification quality is measured as the proportion of subjects within a particular generated latent class that are correctly identified as a member of that class by the model. Although monitoring the classification quality is an important factor in the overall evaluation of a model, many previous studies have not considered this. Only recently has this attracted more attention, including the study by Cassiday et al. (2021), who focused on the classification quality of a GMM. In the present study, classification quality was quantified by recording which latent class each subject was assigned to when generating the data (i.e., their generated class) and which class they were assigned to using the posterior probability from the model analysis (i.e., their predicted class; Table 3).

Table 3.

Example of the Data Recorded to Determine Classification Quality Using the Generated and Predicted Class Assignment of the Subjects.

Condition 1 _ Replication 1
ID	Generated class	Predicted class
1	1	3
2	2	2
3	3	1
4	3	1
⋮	⋮	⋮

Open in a new tab

To measure classification quality, a 3 × 3 matrix was constructed for each replication based on the generated class and predicted class of each subject (Table 4). Individual subjects assigned to the first class when the data were generated could later be placed in a second class by the model because label switching can occur during the analysis process (Collins & Lanza, 2013).

Table 4.

Example of Successful Classification.

		Predicted class
Condition 1 _ Replication 1		1	2	3
Generated class	1	1	98	1
	2	1	2	97
	3	96	2	2

Open in a new tab

The classification quality was determined based on whether the subjects in a specific generated class were also assigned to the same class using the model. This is because label switching in Bayesian mixture modeling can occur (Sperrin et al., 2010) if the proportion of labeled classes for the generated model is simply compared with that of the analysis model.

Table 4 presents an example of successful classification. The subjects that were assigned to Class 1 in the generated data were predicted to belong to Class 2 by the estimation model. This also occurred for Class 2 (members were predicted to be in Class 3) and Class 3 (members were predicted to be in Class 1), indicating that label switching had occurred. However, because the members of each generated class mostly remained together after the analysis, this classification was considered high quality.

In contrast, an example of poor classification is presented in Table 5. Around half of the members of generated Class 1 and half of those of generated Class 2 were predicted to be Class 2 during the analysis, which represents incorrect classification. In this case, the reason for determining the classification quality becomes clear. As in many previous simulation studies that have only measured the class proportions, the ratio of the latent classes was similar before and after analysis; thus, the performance was considered satisfactory. However, this is not the case because, in the example, the subjects sharing the attributes of generated Class 1 are predicted to be members of either Class 2 or 3. In other words, subjects from different classes are classified as being part of the same class in the analysis, which represents poor classification quality despite the calculation of good model fit indices.

Table 5.

Example of Unsuccessful Classification.

		Predicted class
Condition 1 _ Replication 2		1	2	3
Generated class	1	1	45	44
	2	1	45	44
	3	96	2	2

Open in a new tab

This study investigated the accuracy of the classification from 500 replications and the proportion of well-classified subjects according to the total sample size. For example, in Table 4, the following is calculated:

97 % = \frac{(98 + 97 + 96)}{300} \times 100 .

(9)

Although there is no absolute threshold for determining good classification performance, model comparisons are possible based on this classification accuracy.

Programs

In this study, Mplus 7.2 was used to generate and analyze the data (Muthén & Muthén, 2012). The MplusAutomation package in R (Hallquist & Wiley, 2011) was employed to run Mplus repeatedly. MATLAB (MathWorks, 2012) was used to organize the data, compare the model fit indices, and calculate the ratios and proportion of well-classified separation and subjects.

Results

It was found that the overall convergence rate for the 500 replications was more than 96.8% for each condition, that is, more than 96.8% of the replications succeeded in finding a solution. The models fully converged when the analysis included two or three classes; however, there was a low level of nonconvergence when a model included a four-class solution. Specifically, the saturated four-class model solutions did not converge in some cases. When analyzing the two-, three-, and four-class models with one replication, if the four-class model did not converge, the model fit indices of the two- and three-class models were compared to determine the appropriate number of classes.