Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 Jan 13.
Published in final edited form as: Struct Equ Modeling. 2017 Jan 13;24(2):159–179. doi: 10.1080/10705511.2016.1257354

An Empirical Assessment of the Sensitivity of Mixture Models to Changes in Measurement

Veronica T Cole 1, Daniel J Bauer 1, Andrea M Hussong 1, Michael L Giordano 1
PMCID: PMC5653313  NIHMSID: NIHMS864328  PMID: 29075091

Abstract

The current study explored the extent to which variations in self-report measures across studies can produce differences in the results obtained from mixture models. Data (N = 854) come from a laboratory analogue study of methods for creating commensurate scores of alcohol- and substance-use-related constructs when items differ systematically across participants for any given measure. Items were manipulated according to four conditions, corresponding to increasing levels of alteration to item stems, response options, or both. In Study 1, results from latent class analyses (LCA) of alcohol consequences were compared across the four conditions, revealing differences in class enumeration and configuration. In Study 2, results from factor mixture models (FMM) of alcohol expectancies were compared across two of the conditions, revealing differences in patterns and magnitude of the factor loadings and thresholds. The results suggest that even subtle differences in measurement can have substantively meaningful effects on mixture model results.


Increasingly popular within psychology and allied fields, finite mixture models offer the opportunity to identify latent subgroups of individuals within a population (McLachlan & Peel, 2000). For instance, one recent study used mixture models to find subtypes of individuals with schizophrenia based on comorbidity, finding three classes characterized by no comorbidity, comorbid anxiety and depression, or comorbid addiction (Tsai & Rosenheck, 2013). Another study (Crow, Swanson, Peterson, & Crosby, 2012) used mixture models to find six latent classes of individuals based on eating disorder symptoms, and additionally found that three of these classes were related to increased mortality risk.

Though mixture models have been applied to many behavioral phenomena, results can differ widely from one study to the next, presenting an inconsistent picture of the underlying latent structure of a given construct. One notable example is in the study of alcohol use disorder (AUD) as defined by the Diagnostic and Statistical Manual (DSM-5; American Psychiatric Association, 2013). A number of studies have attempted to uncover homogeneous classes of individuals on the basis of different patterns of the 11 DSM1 diagnostic criteria as defined by either DSM-5 or DSM-IV (American Psychiatric Association, 1994). The number of classes found from one application to the next ranges widely, with some studies finding two (Rinker & Neighbors, 2015), three (La Flair et al, 2012; 2013; Beseler et al., 2012; Mancha, Hulbert, & Latimer, 2011; Chung & Martin, 2001), four (Jackson et al., 2014; Wells, Horwood, & Ferguson, 2004) and five (Lynskey et al., 2005). Moreover, while most of these studies find classes on a continuum of severity which increases monotonically between classes (i.e., the classes mainly capture level of AUD liability), a few studies (Jackson et al., 2014; Beseler et al., 2012; Lynskey et al., 2005) find at least one class with a unique configuration of symptoms that falls outside of this continuum.

Of course, there are many potential reasons for these inconsistencies. For instance, results may differ depending on characteristics of the population that is sampled (e.g., college students versus primary care patients; Lubke & Miller, 2014), as well as the well-documented effect of sample size on the power to detect classes (Nylund, Asparouhov, & Muthén, 2007; Lubke & Neale, 2006; 2008), especially when classes are unevenly sized (Lubke & Tueller, 2010). The results of finite mixture models can also be particularly sensitive to misspecification of the model assumptions and structure (Bauer & Curran, 2003; 2004; Van Horn et al., 2012). Here, however, we focus on another possibility: that the results may differ across studies due to the use of different measurement instruments. A recent review by Lubke and Miller (2014) cautioned that both theory and prior simulation results point to a general sensitivity of mixture models (as well as taxometric techinques; Meehl, 1995) to the characteristics of the items included in the analysis. Often, these characteristics vary in important ways across studies. With respect to AUD symptoms, for instance, measurement may be performed via structured interviews such as the SCID (Chung & Martin, 2001), the SSAGA (Lynskey et al., 2005), or the CIDI (Wells, Horwood, & Ferguson, 2004), all of which assess DSM criteria through a set of questions designed for either particular clinical or research settings. Alternatively, some studies use paper-and-pencil or computerized surveys that elicit subjects’ direct report of each of the 11 criteria (LaFlair et al., 2012; 2013; Jackson, 2014). Though the instruments used to measure AUD in these studies assess the same 11 criteria, they do so using different questions and modes of response. Thus, it is challenging to determine to what extent differences in class structure between studies are a result of differences in how AUD criteria were measured. Such differences are not isolated to research on AUD.

The question this paper seeks to address is: to what extent might differences in measurement across studies be responsible for these inconsistent results? Two prior lines of research have explored this possibility empirically, in each case by evaluating the sensitivity of longitudinal mixture models to variations in the measurement of over-time trajectories, including both the way outcomes were assessed as well as the number and timing of the assessments. First, using archival criminal offense data for 500 boys between ages 7 and 70, Eggleston, Laub and Sampson (2004; Sampson, Laub, & Eggleston, 2004) investigated the methodological sensitivity of results obtained from a semiparametric growth model (SPGM; Nagin, 1999; Nagin & Tremblay, 2001). Eggleston and colleagues found that the number and nature of classes found by SPGM differed based on the time span over which individuals were studied, as well as the inclusion of controls for incarceration or mortality. In particular, these researchers found that the inclusion of fewer time points led to the discovery of fewer trajectory classes as well as steeper rates of increase and decrease over time. In a second line of research focused on growth mixture models with random effects (GMM; Muthén & Shedden, 1999; Muthén, 2001), Jackson and Sher (2006) demonstrated a similar sensitivity of trajectory classes to the number and timing of repeated assessments. Jackson and Sher (2005) also found that separate GMM analyses conducted with different but related alcohol involvement constructs – AUD diagnosis, alcohol dependence symptoms, quantity-frequency, and heavy drinking – resulted in different conclusions regarding the number of latent classes. Even when holding the number of classes constant, the shapes and prevalence rates of the implied trajectories were highly dissimilar, and individual classifications were largely discordant across models. Finally, in an analysis of heavy episodic drinking (HED), the authors ran separate analyses on binary measures of HED scored according to different cut points (e.g., 5+ drinks in the past 30 days vs. 5+ drinks at least once or twice a week). Although the choice of HED cut point did not change the obtained number of classes, it did alter the relative proportion of individuals thought to belong to each class (Jackson & Sher, 2008).

In a similar spirit, the current analyses aim to further our knowledge on whether and to what extent mixture model results change meaningfully based on subtle differences in how constructs are operationalized and measured. Empirical investigation of this problem in real data has heretofore been challenging for one simple reason: it is rare for studies to systematically vary the measurement of a construct between or within subjects and compare results based on these measurement differences. Here, we report on the results of a unique laboratory analogue study that was designed to mimic measurement differences across studies in a number of self-report inventories in a college sample. Specifically, participants received different versions of the same measures, intended to measure precisely the same constructs but with superficial differences in instructions, wording, and response options (consistent with measurement differences commonly observed across independently conducted studies). Through two empirical examples of alcohol use consequences and alcohol expectancies, we will examine the sensitivity of results from a latent class analysis and a factor mixture model to changes in measurement.

Study 1

Study 1 used an experimental design to manipulate the measurement of alcohol use consequences in a college sample. We investigated the stability of latent class analysis (LCA) results across four different experimental conditions, each corresponding to a different level of alteration of item stems and response options.

Method

Participants

We obtained student contact information from the university Registrar’s office and selected a sampling frame to over-represent African American students (the largest ethnic minority group on this campus) and men (given that 57% of the undergraduate population on this campus were women). A total of 6,000 students received an initial email inviting their participation (and for many, several follow-up emails), yielding a total of 854 study participants. In order to be included in the study, subjects must have been between 18–23 years of age and consumed alcohol in the past year. The final sample was 45% male, 58.1% European American, 21.9% African American, 10.4% Asian, 6.1% more than one race, and 3.5% some other race; across all races, 5.4% of participants were Hispanic/Latino. In addition, 28.6% of the participants were first year students, 20.5% were sophomores, 20% were juniors, 28.9% were seniors, and 2% were non-students, did not specify or were graduate students.

Measures

One of the goals of the REAL-U study was to empirically test the stability of findings in alcohol and drug use research across studies that use different versions of scales to measure the same constructs. Thus, items in the REAL-U study were manipulated in a number of different ways. Although we explain these differences in the context of the specific measure of alcohol use problems used in Study 1, items were manipulated similarly in both Studies 1 and 2.

Lifetime alcohol use consequences were measured using the Rutgers Alcohol Problems Index (RAPI; White & Labouvie, 1989), which has been shown to have very good internal consistency (α = .92), test-retest reliability (reliability = .89–.92), and criterion validity (Miller et al., 2002). In the current study, an 18-item subset of the full 23-item questionnaire was used, based on the findings of Neal, Corbin, and Fromme (2006) that this was the best-functioning subset of items, relatively free of both differential item functioning and local dependence between item pairs. Participants were instructed to indicate how many times they had experienced a given alcohol-related consequence (e.g., going to work or school drunk or waking up in an unfamiliar place after drinking) in their lifetime.

All items are shown in Table 1. Items were manipulated according to one of four versions, corresponding to increasing levels of perturbation in item stems and response categories. In Version 1, items were administered in their original form, using both the original item stems and response scales from the RAPI. In Version 2, half of the items appeared in their original form; half had perturbed item stems, based on items taken from another self-report measure of alcohol use consequences from the Core study (Presley, Meilman, & Lyerla, 1994). All items had the same response categories as the RAPI. Version 3 used the same item stems as Version 2, but used different response categories. By collapsing some categories, however, the response categories in Version 3 could be harmonized with Versions 1 and 2. Finally, Version 4 perturbed the remaining item stems (such that all stems now differed from Version 1). For Version 4 response categories, half the items maintained response categories from Version 3, while the other half used unique response categories which could not be collapsed to be equivalent to those in the other versions (taken from the Semi-Structured Assessment of Alcohol and Other Drugs; Buchholz, Cadoret, & Cloninger, 1994).

Table 1.

Study 1: Summary of all alcohol problems items used.

Version 1 (Battery A) Version 2 (Battery B) Version 3 (Battery A) Version 4 (Battery B)

Response Scale None (0), 1–2 times (1), 3–5 times (2), More than 5 Times (3) None (0), 1–2 times (1), 3–5 times (2), More than 5 Times (3) Never (0), Once (1), Twice (2), 3–5 times (3), 6–9 times (4), 10 or more times (5) Never (0), Once (1), Twice (2), 3–5 times (3), 6–9 times (4), 10 or more times (5) (for non- italicized items); 0–2 times (0), 3–4 times (1), 5–9 times (2), 10 or more times (3) (for italicized items)
Item

1 Got into fights with other people (friends, relatives, strangers) Got into fights with other people (friends, relatives, strangers) Got into fights with other people (friends, relatives, strangers) Gotten into physical fights when drinking

2 Went to work or school high or drunk Gone to class or a job when drunk Gone to class or a job when drunk Gone to class or a job when drunk

3 Caused shame or embarrassment to someone Made others ashamed by your drinking behavior or something you did when drinking Made others ashamed by your drinking behavior or something you did when drinking Made others ashamed by your drinking behavior or something you did when drinking

4 Neglected your responsibilities Neglected your responsibilities Neglected your responsibilities Neglected your obligations, your family, or your work for two or more days in a row because you were drinking

5 Relatives avoided you Family members rejected you because of your drinking Family members rejected you because of your drinking Family members rejected you because of your drinking

6 Felt that you needed more alcohol than you used to in order to get the same effect Felt that you needed more alcohol than you used to in order to get the same effect Felt that you needed more alcohol than you used to in order to get the same effect Needed to drink more and more to get the effect you want

7 Tried to control your drinking (tried to drink only at certain times of the day or in certain places, that is, tried to change your pattern of drinking) Tried to control your drinking (tried to drink only at certain times of the day or in certain places, that is, tried to change your pattern of drinking) Tried to control your drinking (tried to drink only at certain times of the day or in certain places, that is, tried to change your pattern of drinking) Tried to cut down or quit drinking or using alcohol Have you tried to cut down or quit drinking or using alcohol or other drugs?

8 Had withdrawal symptoms, that is, felt sick because you stopped or cut down on drinking Had withdrawal symptoms, that is, felt sick because you stopped or cut down on drinking Had withdrawal symptoms, that is, felt sick because you stopped or cut down on drinking Felt sick, shaky or depressed when you stopped drinking

9 Noticed a change in your personality Acted in a very different way or did things you normally would not do because of your drinking Acted in a very different way or did things you normally would not do because of your drinking Acted in a very different way or did things you normally would not do because of your drinking

10 Felt that you had a problem with alcohol Felt that you had a problem with alcohol Felt that you had a problem with alcohol Thought you might have a drinking problem

11 Wanted to stop drinking but couldn’t Tried unsuccessfully to stop drinking Tried unsuccessfully to stop drinking Tried unsuccessfully to stop drinking

12 Suddenly found yourself in a place that you could not remember getting to Awakened the morning after some drinking the night before and could not remember a part of the evening. Awakened the morning after some drinking the night before and could not remember a part of the evening. Awakened the morning after some drinking the night before and could not remember a part of the evening.

13 Passed out or fainted suddenly Passed out after drinking Passed out after drinking Passed out after drinking

14 Had a fight, argument, or bad feeling with a friend Had a fight, argument, or bad feeling with a friend Had a fight, argument, or bad feeling with a friend Drinking created problems between you and a near relative or close friend

15 Kept drinking when you promised yourself not to Kept drinking when you promised yourself not to Kept drinking when you promised yourself not to Could not stop drinking without difficulty after one or two drinks

16 Felt you were going crazy Your drinking made you feel out of control even when you were sober Your drinking made you feel out of control even when you were sober Your drinking made you feel out of control even when you were sober

17 Felt physically or psychologically dependent on alcohol Felt physically or psychologically dependent on alcohol Felt physically or psychologically dependent on alcohol Thought you were dependent on alcohol

18 Was told by a friend, neighbor or relative to stop or cut down drinking Near relative or close friend worried or complained about your drinking Near relative or close friend worried or complained about your drinking Near relative or close friend worried or complained about your drinking

For the current analyses, all items were recoded in all versions as binary. In Versions 1 and 2, responses of “none” were coded as 0, and all other responses were coded as 1. In Versions 3 and 4 under the 5-point scale, responses of “never” were coded as 0 and all other responses were coded as 1. Note that, for items measured under the 4-point scale in Version 4, this harmonization was imperfect, as the lowest category was “0–2 times.” However, the collapsing of items was intended not only to ensure comparability of solutions across measurement version and model complexity, but also to avoid problems with sparseness given the presence of a number of low-frequency response patterns in the original response scale. Thus, for all versions, a response of 0 generally indicated not having experienced a given alcohol use problem, and a response of 1 indicated having experienced this problem at least once, except in half the Version 4 items for which it indicated three or more times.

Procedure

On each study visit, participants completed two versions of all measures on a computer. Versions of the RAPI were paired systematically such that participants either completed Versions 1 and 3 (denoted Battery A) or Versions 2 and 4 (denoted Battery B) within a given visit. Each battery was designed to be completed in roughly 75 minutes and participants completed a set of additional measures at Visit 2. Participants were compensated $20 for completion of Visit 1 and $25 for completion of Visit 2.

Participants were randomized to one of four conditions determining the combination and order of batteries they completed. As shown in Table 2, subjects completed either Battery A at Visit 1 and Battery B at Visit 2 (AB; N = 196), Battery B at Visit 1 and Battery A at Visit 2 (BA; N = 212), Battery A at Visit 1 and Visit 2 (AA; N = 213), or Battery B at Visit 1 and 2 (BB; N = 219). Also shown in Table 2, for each version of the measure, data were split to form two analysis samples, denoted sample x and sample y. For instance, sample 1x included data on Version 1 from the first visit for individuals assigned to condition AA as well as the second visit for individuals assigned to BA. Sample 1y, in turn, included data from Version 1 from the second visit for individuals assigned to AA, as well as the first visit for individuals assigned to BA.

Table 2.

Study 1: Summary of the composition of each analysis sample.

Analysis Sample Condition/Visit


AA (N = 213) BB (N = 219) AB (N = 196) BA(N = 212)




1 2 1 2 1 2 1 2 N


1x 1 1 425
1y 1 1 409
2x 2 2 415
2y 2 2 431
3x 3 3 425
3y 3 3 409
4x 4 4 415
4y 4 4 431

Note: Numbers in the body of the table refer to the measurement version sampled within a given battery.

Partitioning the sample in this way was helpful for a number of reasons. First, the splitting of each measurement version into two equally sized analysis samples afforded the opportunity to replicate LCA results for each version, establishing a baseline level of stability for model results when there is no measurement perturbation. Second, each analysis sample included data on a given measurement version from both Visits 1 and 2, in order to balance order effects; for instance, analysis sample 2x included data from measurement version 2 taken from group BB at Visit 1 and AB at Visit 2. Finally, the overlap between analysis samples allowed for the examination of class assignment stability within and between measurement versions. For instance, half of the members of analysis sample 2x came from group BB; thus, they were also in analysis samples 2y, 4x, and 4y. The other half of analysis sample 2x came from group AB; thus, were also in analysis samples 1y and 3y. This permitted within-group comparisons (i.e., by comparing class assignments between 2x and 2y) and between-group comparisons (i.e. by comparing class assignments between 2x and 1y, 3y, 4x, and 4y).

Analyses

Latent class analysis (LCA; Lazarsfeld & Henry, 1968; Clogg & Goodman, 1984) models were fit to binary alcohol use consequence items separately for each of the eight analysis samples using Mplus version 7.2 (Muthén & Muthén, 2015). A latent class model consists of classes defined by categorical observed variables, which are assumed conditionally independent given class membership. Let i index subjects (where i =1,...,N), q index binary items(where q =1,...,Q), k index latent classes (where k =1,...,K), and p index covariates (where p =1,...,P). Define the vector of item responses for subject i as yi, with individual elements yiq which represent subject i’s response to the qth binary item. Then the latent class analysis model is given by:

P(yi=1)=k=1KπkP(yi=1cik=1) (1)

where cik is an indicator variable which takes on a value of 1 if subject i is a member of class k and 0 otherwise, and πk is the prevalence of class k, subject to the constraints that πk ranges from 0 to 1, and k=1Kπk=1.2 The class-specific probability mass function for subject i under class k is

P(yi=1cik=1)=q=1QP(yiq=1cik=1) (2)

Critically, the above formulation implies conditional independence of indicators, given class membership. This assumption may be relaxed to allow continuous factors to account for local dependence between pairs of items (Reboussin, Ip, and Wolfson, 2008),or for substantively meaningful factors to be defined on the basis of multiple indicators, as in the factor mixture model presented in Study 2. While preliminary analyses determined that local dependence might exist between some item pairs for some of the models under consideration, the offending item pairs were not consistent across models and incorporating local dependence proved computationally intractable. Thus, we proceeded with the typical LCA formulation above.

Item endorsement probabilities, as well as potential covariate effects, were compared across the eight analyses. Additionally, in order to gauge the agreement between the modal classifications given by the optimal model for each version of the measure, the Adjusted Rand Index (ARI; Hubert & Arabie, 1985) was computed for each pairwise combination of LCA solutions both within-version/between-subsample (e.g., comparing the solution for analysis samples 1x and 1y), and between-version/within-subsample (e.g., comparing the solution for analysis samples 1x and 2x). The ARI measures the concordance between two partitions of the same data, adjusting for chance and ranges from −1 to 1, with values closer to 1 indicating greater agreement between the two classifications (Steinley, 2004).

Results

Class enumeration

Model fit statistics informing class enumeration are presented in Table 3. Initially, class enumeration was informed by consideration of the Akaike Information Criterion (AIC; Akaike, 1998), Bayesian Information Criterion (BIC; Schwarz, 1978), Vuong Lo-Mendell-Rubin likelihood ratio test (LMR; Vuong, 1989; Lo, Mendell, & Rubin, 2001), and the bootstrap likelihood ratio test (BLRT; McLachlan & Peel, 2000). However, ultimately only the BIC and Lo-Mendell-Rubin (LMR) p-value were considered as criteria because, with very few exceptions, neither the AIC nor the bootstrap LRT (BLRT) favored a value of K within the range of models considered (i.e., they continued to support more classes even at 7 classes). When there was disagreement between the BIC and LMR the BIC was generally favored, given that previous simulation work supported its accuracy in detecting the correct number of classes (Nylund, Asparouhov, & Muthén, 2007; Tofighi & Enders, 2008).

Table 3.

Study 1: Fit indices for LCA models with different numbers of classes under each measurement version.

Version 1 Sample 1x Sample 1y


K Parameters LL BIC LMR LMR p.val LL BIC LMR LMR p.val



1 18 −3243.53 6595.96 NA NA −3212.3 6532.67 NA NA
2 37 −2803.87 5831.57 879.334 0.0002 −2693.38 5608.89 1037.85 <.0001
3 56 −2678.06 5694.91 251.611 0.001 −2592.6 5521.42 201.545 0.2465
4 75 −2621.27 5696.27 113.581 0.0053 −2548.97 5548.23 87.267 0.2487
5 94 −2585.74 5740.16 71.055 0.241 −2507.43 5579.23 83.077 0.4613
6 113 −2561.03 5805.68 49.423 0.4393 −2479.25 5636.95 56.352 0.254
7 132 −2538.57 5875.71 44.921 0.4845 −2456.58 5705.67 45.35 0.7603

Version 2 Sample 2x Sample 2y


K Parameters LL BIC LMR LMR p.val LL BIC LMR LMR p.val



1 18 −3644.5 7397.41 NA NA −3600.69 7310.48 NA NA
2 37 −3022.69 6268.26 1243.6 <.0001 −3014.34 6252.96 1172.69 <.0001
3 56 −2871 6079.31 303.393 0.0578 −2887.87 6115.17 252.954 0.011
4 75 −2808.58 6068.91 124.84 0.0418 −2815.92 6086.46 143.883 <.0001
5 94 −2757.86 6081.92 101.439 0.0359 −2772.45 6114.67 86.949 0.0006
6 113 −2723.27 6127.19 69.179 0.1838 −2741.71 6168.36 61.48 0.8012
7 132 −2693.2 6181.49 60.146 0.3856 −2711.21 6222.53 60.799 0.1744

Version 3 Sample 3x Sample 3y


K Parameters LL BIC LMR LMR p.val LL BIC LMR LMR p.val



1 18 −3021.04 6150.84 NA NA −3079.64 6267.18 NA NA
2 37 −2491.42 5206.41 1059.24 <.0001 −2488.04 5197.86 1183.2 <.0001
3 56 −2369.44 5077.27 243.946 0.0076 −2360.95 5057.57 254.181 0.0021
4 75 −2312.36 5077.92 114.16 0.0042 −2311.36 5072.27 99.177 0.0006
5 94 −2280.97 5129.96 62.779 0.7179 −2280.57 5124.57 61.586 0.0812
6 113 −2246.75 5176.32 68.446 0.1493 −2250.85 5179.01 57.273 0.9061
7 132 −2224.82 5247.27 43.864 0.2427 −2223.8 5238.81 55.908 1

Version 4 Sample 4x Sample 4y


K Parameters LL BIC LMR LMR p.val LL BIC LMR LMR p.val



1 18 −2334.85 4778.08 NA NA −2299.78 4708.5 NA NA
2 37 −1851.77 3926.31 966.164 <.0001 −1882.04 3988 835.487 <.0001
3 56 −1767.44 3871.855 168.653 0.0047 −1779.99 3898.89 204.097 <.0001
4 75 −1724.56 3900.69 85.564 0.0681 −1748.11 3950.12 63.767 0.2147
5 94 −1698.87 3963.71 51.378 0.2086 −1726.81 4022.51 42.592 0.3062
6 113 −1674.3 4028.97 49.139 0.2814 −1705.88 4095.65 43.381 1
7 132 −1654.68 4104.14 39.224 0.3878 −1688.91 4176.7 35.234 0.6435

Note: LL = loglikelihood; BIC = Bayesian Information Criterion; LMR = Lo Mendell Rubin test statistic testing the null hypothesis that a model with K – 1 classes fits as well as a model with K classes; LMR p. val = the p value for the LMR statistic. Entries corresponding to the value of K favored by a given fit index are in bold.

For all models (Table 3), fit indices showed varying levels of agreement between and within measurement versions. The BIC was minimized for K = 3 classes in all measurement versions except for Version 2, in which the BIC favored a 4-class solution in both analysis samples 2x and 2y. In analysis samples 1x and 1y, the 3-class model was only narrowly favored over a 4-class model by the BIC. The LMR test was highly inconsistent across analysis samples for Versions 1 and 2, favoring a 2-class solution in samples 1y and 2x, a 4-class solution in sample 1x, and a 5-class solution in sample 2y. However, in Versions 3 and 4, the LMR was in agreement across samples, favoring a 4-class solution in Version 3 and a 3-class solution in Version 4. Solutions with more than three classes were generally unstable in sample 4x; each solution had at least one extremely small class in which parameters could not be freely estimated.

In summary, the BIC generally favored a 3-class solution in Versions 1, 3, and 4, and a 4-class solution in Version 2. The LMR favored anywhere between 2 and 5 classes, with little consistency between and within measurement versions.3 Given this mixed support, we considered both the 3- and 4-class solutions across all versions.

Endorsement probabilities and class prevalences

Figures 1 and 2 show model-implied endorsement probabilities for all 3-class and 4-class solutions in each measurement version. Note that, due to the instability of the 4-class solution in Version 4x, it is not considered further and only Version 4y is presented. In order to present items in a way which facilitates their interpretation, the optimal order of items on the x-axis was determined using a hierarchical clustering algorithm (Fraley & Raftery, 2002). This algorithm groups together items that were highly correlated with one another in a full-sample analysis.

Figure 1.

Figure 1

Study 1: The 3-class model under all measurement versions.

Figure 2.

Figure 2

Study 1: The 4-class model under all measurement versions.

Three-class solutions

Item endorsement patterns for all 3-class solutions are shown in Figure 1. Across all measurement versions, the 3-class solution identified one class (Class 1) comprising the majority of the sample which was characterized by generally low probabilities of endorsing all items. In Versions 1, 2, and 3, the classes largely captured differences in overall level of endorsement, with Class 3 endorsing most items with high probability and Class 2 endorsing roughly half the items (those on the left-hand side of the x-axis) with low probability and roughly half (those on the right-hand side) with intermediate-to-high probability. The items which were endorsed most frequently by this class generally pertained to either loss of control (e.g., Items 6, 12, and 13) or social consequences (e.g., items 1, 14, and 4). Two items, Item 9 (V1: “Noticed a change in your personality”/V2, V3, V4: “Acted in a very different way or did things you normally would not do because of your drinking”) and Item 12 (V1: “Suddenly found yourself in a place that you could not remember getting to”/V2, V3, V4: “Awakened the morning after some drinking the night before and could not remember a part of the evening”), appeared to be less frequently endorsed by this intermediate group in Version 1 than in all other versions. By contrast, items infrequently endorsed by this group relative to Class 3 generally pertained to symptoms of dependence (Items 10, 11, 15, 16, and 17) or family or close relations disapproving of one’s drinking (Items 5 and 18).

In order to quantify the overall extent of the similarity between versions in endorsement patterns, the Euclidean distances between within-class endorsement probabilities (averaged across samples) were calculated for each pair of versions. These values are shown in the top half of Table 4. Differences between versions in Class 1 were generally small, corresponding to the generally low levels of endorsement in all versions. Differences in Classes 2 and 3 were greatest between Version 4 and the other three versions. In Version 4, Class 2 was characterized by a lower probability of endorsing items 3, 4 and 14, but a higher probability of endorsing item 12, than in Versions 1 and 2. Additionally, in Version 4, Class 3 was characterized by lower endorsement probabilities on items 2, 3, 16, and 18 than in the other versions. Of these items, Versions 2 and 4 used the same stems for all but Item 4 (V2: “Neglected your responsibilities”; V4: “Neglected your obligations, your family, or your work for two or more days in a row because you were drinking”) and Item 14 (V2: “Had a fight, argument, or bad feeling with a friend”; V4: “Drinking created problems between you and a near relative or close friend.”). This commonality suggested that changes to the stems for these items did not solely account for the differences observed for this version.

Table 4.

Study 1: Euclidean distance between profile solutions of each version under each latent class.

3-class solution



Class 1 Class 2 Class 3



V1 V2 V3 V4 V1 V2 V3 V4 V1 V2 V3 V4



V1 0 V1 0 V1 0
V2 0.0888 0 V2 0.4015 0 V2 0.1773 0
V3 0.0392 0.0712 0 V3 0.3471 0.2793 0 V3 0.2826 0.2629 0
V4 0.0621 0.1137 0.037 0 V4 1.066 0.717 0.316 0 V4 1.706 1.762 1.146 0
4-class solution




Class 1 Class 2 Class 3 Class 4




V1 V2 V3 V4 V1 V2 V3 V4 V1 V2 V3 V4 V1 V2 V3 V4

V1 0 V1 0 V1 0 V1 0
V2 0.0472 0 V2 0.3710 0 V2 0.9062 0 V2 0.2862 0
V3 0.0418 0.0222 0 V3 0.3663 0.0976 0 V3 1.0723 0.3693 0 V3 1.4901 0.7783 0
V4 0.062 0.037 0.022 0 V4 0.917 0.305 0.2276 0 V4 1.502 0.975 0.728 0 V4 0.928 0.718 1.4 0

It was of interest to compare Version 2 to Versions 1 and 3, because Version 2 had the same response options as Version 1 but 50% different item stems, and the same item stems as Version 3 but different response options. Though the squared Euclidean distances did not indicate differentially close relationships between Version 2 and either Version 1 or 3, visual inspection of Figure 1 suggested that the general pattern of item endorsements might be somewhat closer between Versions 2 and 3 than between Versions 1 and 2. This impression was further supported by the fact that there was greater concordance between Versions 2 and 3 in the rankings of item endorsement probabilities relative to one another (Spearman’s ρ = . 92 for Class 2, ρ = .88 for Class 3) than between Versions 1 and 2 (ρ = . 85 for Class 2, ρ = .74 for Class 3). Versions 2 and 3 appeared to differ mainly in the severity of the items, with Items 9 and 12 being endorsed more frequently in Version 2 than in Version 3. Additionally, the prevalence of the low class relative to the intermediate class was different between these two versions, with Version 3 placing more subjects in the low class and fewer in the intermediate class than Version 3.

Four-class solutions

Item endorsement patterns for all 4-class solutions, with the exception of sample 4x, are shown in Figure 2. Here solutions were considerably less stable within versions than in the 3-class case, particularly for the high symptomatology classes with low prevalence rates, posing some challenge to interpretation. However, two things were particularly noteworthy with respect to the prevalence rates of each class. First, as in the 3-class solutions, the 4-class solutions identified one class (Class 1) which was characterized by low endorsement probabilities for all items; however, the prevalence of this class varied widely across versions, with Version 2 placing the smallest portion of the sample into this class and Version 4 placing the largest portion of the sample into this class. Second, also as in the 3-class solutions, all solutions here found a class (Class 4) with generally high levels of endorsement; however, the prevalence of this class varied widely within and between versions and was generally quite small (with a maximum prevalence of 9.23% in sample 1y). In Version 3, this extremely high level of endorsement was uniform across most items, whereas in Versions 1 and 2 there was considerably more variation among item endorsements.

A number of interesting differences between versions emerged with respect to the two intermediate classes. As in the 3-class solutions, there was some support for the conclusion that the general shapes of the classes in Version 2 were more similar to those of Version 3 than to Version 1, with greater concordance in rank order between Versions 2 and 3 in item endorsement rates for Class 2 (ρ = .93 between Versions 2 and 3, ρ = .81 between Versions 2 and 1) but not for Class 3 (ρ = . 89 between Versions 2 and 3, ρ = .90 between Versions 2 and 1) . As in the 3-class solution, Versions 2 and 3 were differentiated largely by item severity, with generally higher endorsement rates for a number of items in the intermediate classes in Version 2 than Version 3. Unlike in the 3-class solution, here the Euclidean distance between Versions 2 and 3 was smaller than that between Versions 1 and 2 for all classes.4 Additionally, Version 4’s difference from the other versions in endorsement patterns was not as pronounced as in the 3-class solution. Though there were some differences between Version 4 and Versions 1 and 2, particularly in Class 3, these differences were somewhat challenging to interpret because of within-version differences in endorsement patterns.

Finally, some of the most consistent differences between versions were seen in class prevalence. In particular, the prevalence of the low-endorsement class was lowest in Version 2 and highest in Version 4. Both Versions 2 and 3 placed a majority of the sample in intermediate classes (i.e, Classes 2 and 3). By contrast, both Versions 1 and 4 placed a majority of the sample in Class 1, with intermediate classes being somewhat smaller and less stable across analysis samples.

Class assignments

Table 5 shows the adjusted Rand indices (ARI) for modal class assignments in both 3-class and 4-class solutions. Diagonal elements indicate the stability of class assignments within different samples in the same measurement version, with the exception of Version 4,in which only sample 4y is considered. In the 3-class model, within-version class membership was most stable within Version 4. Version 2 showed similar levels of within-version stability in the 3-class solutions, but was less stable in the 4-class solution; thus, in addition to being supported by the BIC as balancing fit and parsimony, the 3-class solution appeared to be particularly reliable in Version 2. As discussed above, the general shape of Version 2’s endorsement profiles corresponded more to that of Version 3, with which it shared common item stems, than to that of Version 1, with which it shared response options. However, the ARI did not present the impression that class membership was especially stable from Versions 2 to 3 in either the 3- or 4-class versions.

Table 5.

Study 1: Adjusted Rand indices for all models.

Version 3-class models Version 4-class models


1y 2y 3y 4y 1y 2y 3y 4y


1x 0.4264 0.3634 0.3011 0.432 1x 0.3365 0.2105 0.2729 0.3912
2x 0.4604 0.5999 0.3436 0.2897 2x 0.3113 0.3888 0.3714 0.226
3x 0.3787 0.3106 0.3677 0.2765 3x 0.3449 0.1781 0.3536 0.2859
4x 0.2126 0.4305 0.2528 0.6177 4x -- -- -- --

Summary

Study 1 examined LCAs of alcohol consequences under four different measurement versions. There were some differences across versions in class enumeration, with the BIC indicating that a 3-class solution fit best in all versions other than Version 2, in which the 4-class solution was favored. Both the 3- and 4-class solutions showed some degree of difference in item endorsement patterns and class prevalence rates across measurement versions. In both the 3- and 4-class solutions, differences from Version 1 in item endorsement patterns within each class generally increased with greater levels of measurement perturbation, with Version 4, corresponding to the highest level of item alteration, showing the greatest difference in the shapes of the class endorsement profiles. There were differences in class prevalence rates across versions, although these differences did not correspond directly to the degree of item perturbation, particularly in the 4-class case. In particular, whereas Versions 2 and 3 placed a large portion of the sample into classes characterized by intermediate levels of item endorsement, the low-endorsement class was considerably larger in Version 4, in both the 3-class and 4-class solutions. This finding is particularly interesting given that, in all versions aside from Version 4, a response of 0 always corresponds to a subject never having experienced a given consequence; in Version 4, a response of 0 may correspond to “never” (for items originally measured using the 5-point scale) or “0–2 times” (for items originally measured using the 4-point scale). Thus, the high prevalence of the low-endorsement class in Version 4 may reflect the higher threshold required to endorse items originally measured using the 4-point response scale. In sum, the results obtained from LCA models differed in a number of important ways across variations in measurement. We now examine the extent to which this sensitivity is also observed for factor mixture models.

Study 2

Study 2 used the REAL-U data described in Study 1 to investigate the stability of factor mixture model (FMM) results across two highly disparate measurement versions. This study focused exclusively on differences across Versions 1 and 4, the two most dissimilar experimental conditions in the study, in the nature of FMM results. Class enumeration has been shown to be highly sensitive to measurement in growth mixture models, a special case of FMM (Jackson & Sher, 2005), and we investigated the possibility of different numbers of classes being chosen in Versions 1 and 4. Unlike in GMM, however, in FMM measurement parameters (factor loadings and thresholds) are freely estimated; thus, it was of primary interest to determine whether and to what extent the measurement properties of items within- and between-classes differed on the basis of alterations to items.

Method

Participants

Participants (N = 854) were the same as those in Study 1.

Measures

Alcohol expectancies were measured using 14 items in two subscales, relating to tension reduction and sociability; these items are shown in Table 6. These items were drawn from a larger pool of 17 items administered in the REAL-U study, but three items were removed due to problematic characteristics, including cross-loadings or local dependence, in one or both versions in preliminary analyses. Tension reduction items were taken from the corresponding subscale on the 9-item Alcohol Outcome Expectancies scale, which has good internal consistency (α = .89; Kushner, Sher, & Wood, 1994). Sociability items were taken from the corresponding subscale in the Brief Comprehensive Effects of Alcohol (B-CEOA; Fromme, Stroot, & Kaplan 1993); these items show fair internal consistency (α = .81). Items were manipulated according to the same measurement versions as in Study 1, with the exception that here the instructions were also altered between measurement versions.

Table 6.

Study 2: Summary of all alcohol expectancies items used.

Version 1 (Battery A) Version 4 (Battery B)

Instructions The following items describe some effects of alcohol. Because alcohol affects people in different ways, we would like to know which of these effects you experience when you drink alcohol. Based on your own drinking experience, how much do you expect each of these effects when drinking alcohol? (If you have never consumed alcohol, indicate how you might expect alcohol to affect you if you had several drinks.) Choose from DISAGREE TO AGREE depending on whether you expect the effect to happen to you IF YOU WERE UNDER THE INFLUENCE OF ALCOHOL. These effects will vary, depending on the amount of alcohol you typically consume. Check one answer after each statement. There are no right or wrong answers. (If you have never consumed alcohol, indicate how you might expect alcohol to affect you if you had several drinks).

Response Scale Not at all (0), A little bit (1), Somewhat (2), Quite a bit (3), A lot (4) Disagree (1), Slightly Disagree (2), Slightly Agree (3), Agree (4) (for non-italicized items). No chance (0), Very unlikely (1) Unlikely (2) Very likely (3) Certain to happen (4) (for italicized items).

Item Factor
1 Tension Reduction Drinking helps me to relax. I would feel calm

2 Tension Reduction Drinking helps me forget problems at work or school. I would be able to take my mind off my problems.

3 Tension Reduction Drinking helps me feel better about myself. I would be more satisfied with myself.

4 Tension Reduction Drinking helps me forget my worries. I would feel less worried.

5 Tension Reduction Drinking helps me feel better when I’m feeling down. I would feel less depressed.

6 Tension Reduction Drinking helps me relax when I’m tense. I would be less tense.


7 Tension Reduction Drinking helps me to calm down when I’m angry. I would feel less hostile.

8 Tension Reduction Drinking helps me deal with boredom. I would be less likely to have negative moods or feelings.

9 Tension Reduction Drinking helps me express my opinions and ideas better. I would be able to discuss or argue a point more forcefully.

10 Sociable Drinking helps me act sociable. I would be more sociable.

11 Sociable Drinking helps me talk to people. I would talk to people more easily

12 Sociable Drinking helps me to be friendly. I would be friendlier.

13 Sociable Drinking helps me to be talkative. I would be more “chatty”.

14 Sociable Drinking helps me to be outgoing. I would be more likely to be courageous.

15 Sociable Drinking helps me to be humorous. I would be more likely to have my humorous side come out.

16 Sociable Drinking helps me express my feelings. I would more easily open up and express my feelings.

17 Sociable Drinking helps me feel energetic. I would feel better physically.

Data were collapsed to a 3-point ordinal scale in both Versions in order to enhance comparability and to eliminate sparse categories which could cause estimation difficulties. Original response options are shown in Table 6. In Version 1, responses were originally measured using a 5-point scale, ranging from “not at all” to “a lot”. These options were recoded so that a response of “not at all” or “a little bit” was coded as 1, “somewhat” was coded as 2, and “quite a bit” or “a lot” was coded as 3. Half of the items in Version 4 were measured using either a 4-point scale ranging from “disagree” to “agree” and the other half with a 5-point scale ranging from “no chance” to “certain to happen.” Items in the 4-point scale were recoded so that a response of “disagree” or “slightly disagree” was coded as 1, “slightly agree” was coded as 2, and “agree” was coded as 3. Items in the 5-point scale were recoded so that a response of “no chance” or “very unlikely” was coded as 1, “unlikely” was coded as 2, and “very likely” or “certain to happen” were coded as 3. While these response options are clearly not harmonizable to categories with identical meanings across scales, such situations are not unusual when comparing results across studies in the absence of a gold standard measure and this experimental condition was meant to mimic such conditions.

Procedure

The experimental procedure and study design were the same as those in Study 1. However, a different sub-sampling strategy was used to generate analysis samples in Versions 1 and 4. In the current analysis, comparing results between Versions 1 and 4 was of primary interest; for this reason, and to maximize sample size, only one large subsample was investigated for each of Versions 1 and 4. Data came from both groups who received a given measurement version at Visit 1, as well as whichever non-redundant group received that measurement version at Visit 2. Thus, data for Version 1 came from groups AB and AA at Visit 1, and group BA at Visit 2, yielding a total N = 635; data for Version 4 came from groups BA and BB at Visit 1, and group AB at Visit 2, yielding a total N = 641. Thus, because groups AB and BA were common to both Versions 1 and 4, the samples overlap greatly, with 65.1% of individuals in Version 4 also measured under Version 1, reducing the extent to which differences obtained across the two versions might reflect simple sampling variability (since the majority of the two samples consisted of the same individuals).

Analyses

Factor mixture models (FMM) were fit to ordinal alcohol expectancies items, assuming that a two-factor structure held in all classes. A brief description of this model follows, but see Lubke and Muthén (2005; 2007), Lubke & Neale (2008), or Muthén (2006) for a more complete description of the FMM.

As in the LCA presented in Study 1, we define yiq as subject i’s response to qth item and cik as an indicator variable which takes on a value of 1 if subject i is a member of class k and 0 otherwise. Within each class, items are assumed to be affected by a set of R continuous, normally distributed factors ηi according to a common factor model. As the data in this study were three-level ordinal variables, we implemented a cumulative logit model specification for the regression of the indicators upon the latent factors.

Define P(yiqj) as the probability of endorsing any response option up to and including j, where j = 1 or 2 (since the cumulative probability for j = 3 is by definition 1.0). This cumulative probability is calculated by marginalizing across continuous and categorical latent variables as follows:

P(yiqj)=k=1KπkP(yiqjcik=1,ηi)ηi (3)

where P(yiqj|cik = 1, ηi) is the probability of endorsing any response option up to and including j on item q given subject i’s values of the continuous and categorical latent variables, ηi and πk is the probability that subject i is a member of class k, subject to the constraints that π k ranges from 0 to 1, and k=1Kπk=1.

Within a given class, the distribution of ηi is assumed multivariate normal with R × 1 mean vector μk and R × R covariance matrix ψk. The class-specific cumulative probability P(yiqj) is related to the latent factors as follows:

logit(P(yiqjcik=1,ηi))=τkjq-λkqηi (4)

Class-specific measurement parameters are defined as in a common factor model: τkjq is a class-specific threshold parameter for response j on item q, and the R × 1 vector λkq contains class-specific factor loadings which transmit the effect of latent variables ηi onto the cumulative logit for item q.

One of the strengths of FMM is that it allows for the assessment of measurement invariance (Mellenbergh, 1989; Meredith, 1993; Vandenberg and Lance, 2000) across latent classes in the population (Lubke & Muthén, 2005). In particular, one might be interested in whether certain segments of the population display fundamental differences in the organization or measurement of the underlying factors relative to other segments. To evaluate this question, we estimated FMMs assuming three distinct levels of measurement invariance, corresponding to configural invariance, weak metric invariance, and strong metric invariance across classes. The least restrictive of these models, the configural invariance model, assumes only that the pattern of factor loadings is the same across classes. The weak metric invariance model assumes equality of factor loadings across classes, and the strong metric invariance model additionally assumes equality of item thresholds across classes.

Though a review of measurement invariance testing in FMM is outside the scope of the current work (see Muthén, 2006; Lubke & Neale, 2008; and Clark et al., 2013), there are a few issues which distinguish measurement invariance testing in FMM from the evaluation of factor models fit to multiple observed groups. First, it is critical to note that in FMM the composition of classes could change on the basis of the level of measurement invariance assumed. Thus, while it is possible to compare the fit between, for example, models assuming weak versus strong measurement invariance across classes, the individuals within each class may shift between models, complicating their comparison in a more substantive sense. Second, even the number of classes deemed optimal for the data may differ depending on the invariance restrictions imposed on the model. That is, on one hand, the number of classes might be under-estimated by assuming too low a level of invariance, owing to the inclusion of unnecessary model parameters. On the other, the number of classes might be over-estimated by assuming too high a level of invariance, due to the potential for additional latent classes to compensate for model misspecification. Further, given the complexity of FMMs, information criteria such as the BIC may erroneously favor a more constrained model over the correct, non-invariant model (Lubke & Neale, 2008). As such, while we present comparisons of model fit here, we also note that hypothesis tests in FMMs must always be interpreted cautiously.

Results

Values of BIC used in determining K are shown in Table 7. The configural invariance model could not support a solution with more than 2 classes. In both measurement versions, BIC favored the 2-class solution over either a 1- or 3-class solution for the weak and strong invariance models and there were indications of estimation problems with three classes.

Table 7.

Study 2: All values of BIC used in model selection.

BIC

K Version 1 Version 4
Strong metric invariance
1 12650.544 12229.431
2 12648.558 12205.890
3 12667.305 12222.515
Weak metric invariance
1 12650.544 12229.431
2 12604.522 12138.856
3 12677.704 12211.276
Configural invariance
1 12650.544 12229.431
2 12637.701 12147.705

Note: BIC = Bayesian Information Criterion. Note that the 1-class strong metric invariance, weak metric invariance, and configural invariance models are the same.

Given that the weak, strong, and configural invariance models all supported a 2-class solution in both measurement versions, likelihood ratio tests were consulted in determining the optimal level of invariance. Despite having potentially limited substantive interpretability, as discussed above, the likelihood ratio test nevertheless provides a useful comparison between models in terms of their overall balance of fit and parsimony. In both measurement versions, the 2-class strong metric invariance model fit significantly worse than the 2-class weak metric invariance model (Version 1: χ2 (26) = 142.18, p < .001; Version 4: χ2 (23) = 171.52, p <.001). The weak metric invariance model, in turn, fit significantly worse than the configural invariance model in both versions (Version 1: χ2 (12) = 42.36, p < .001; Version 4: χ2 (12) = 34.83, p <.001). We considered allowing for partial weak invariance across classes, but partial weak invariance was also rejected relative to configural invariance.5 The disagreement between the LRT results, which favored the configural invariance model, and the BICs, which favored a weak invariance model, underscores the challenges in making meaningful comparisons between FMMs with different levels of invariance. Thus, despite the fact that the 2-class weak metric invariance model was favored by the BIC relative to the 2-class configural invariance model, we proceeded in interpreting the 2-class configural invariance model in both versions.

The 2-class configural invariance solution

In both versions, the 2-class solution divided the sample into relatively large classes in which the sociability and tension reduction factors were positively correlated. In Version 1, 46.54% of the sample fell into Class 1, in which the factors were correlated at r = .825; 53.46% of the sample fell into Class 2, in which the tension reduction and sociability factors were correlated at r = .412. In Version 4, 59.20% of the sample fell into Class 1, in which the tension reduction and sociability factors were correlated at r = .721; 40.80% of the sample fell into Class 2, in which the factors were correlated at r = .628. For the subset of individuals measured using both Versions 1 and 4, the ARI comparing modal class membership estimates under the two measurement versions was .0014, indicating no concordance between the two versions.

Factor loadings

Standardized loadings are shown in Figure 3; note that the background of the plot is shaded for items originally measured using the 5-point scale. We first considered the loadings for Tension Reduction (top panels), then Sociability (bottom panels). In Version 1, loadings for the Tension Reduction factor were generally weaker in Class 2 than Class 1. Whereas Class 1 was characterized by consistently high loadings for all items on the Tension Reduction factor, in Class 2 Items 6, 7, and 8 (V1: “Drinking helps me relax when I’m tense,” “Drinking helps me to calm down when I’m angry,” “Drinking helps me deal with boredom,” respectively) appeared particularly weak. By contrast, in Version 4, loadings for the Tension Reduction factor were relatively close in both Classes 1 and 2. Also different from Version 1 is the fact that in Version 4 the same general pattern of loadings – with items 5 and 8 showing a slightly stronger relationship to the latent factor than the other items – held across both classes.

Figure 3.

Figure 3

Study 2: Standardized factor loadings for all items in factor mixture model in Versions 1 and 4.

Note: Items with a gray background in Version 4 were measured using the 5-point scale.

Differences across classes in factor loadings for Sociability also varied between Versions 1 and 4. In Version 1, Items 10, 11, 12, and 14 had similar standardized factor loadings across classes (V1: “Drinking helps me to act sociable,” “Drinking helps me talk to people,” “Drinking helps me to be friendly” and “Drinking helps me to be outgoing,” respectively). The other three items, Items 15, 16, and 17 (V1: “Drinking helps me to be humorous,” “Drinking helps me express my feelings,” and “Drinking helps me feel energetic,” respectively), showed somewhat weaker loadings in Class 2 than Class 1. A different pattern of class differences in loadings emerged in Version 4. Similar to Version 1, loadings for Items 11 and 12 were close to invariant across classes, while the loading for Item 17 was considerably weaker in Class 2. However, Items 10, 15, and 16 actually showed higher loadings in Class 2 than Class 1. As shown in Table 6 and via shading in Figure 3, these were the 3 (of 4) items for the Sociability factor which were originally measured using a different response scale (0 = no chance, …, 4 = certain to happen) than the other items (1 = strongly disagree, …, 4 = strongly agree). Thus, it may be the case that, in Version 4, differences between the classes in the measurement of Sociability may have reflected a method factor corresponding to differences in response scales across subsets of items.

Thresholds

Figure 4 shows the thresholds for ykq = 1 and ykq = 2 across classes and versions; again, the background of the plot is shaded for items originally measured using the 5-point scale. In Version 4, a few thresholds were fixed at either positive or negative 15 in Class 2, with the threshold for ykq = 2 for items 5 and 7 fixed at 15, and those for ykq = 1 for items 11, 12, and 14 fixed at −15, for members of Class 2. This reflects a boundary condition in which the probability of endorsing ykq = 3 on items 5 and 7 was functionally zero, and endorsing either ykq = 2 or ykq = 3 for items 11, 12, and 14 was functionally one.

Figure 4.

Figure 4

Study 2: Thresholds for all items for factor mixture models in Versions 1 and 4.

Note: Items with a gray background in Version 4 were measured using the 5-point scale; asterisks indicate parameter fixed at boundary value.

In Version 1, both thresholds were consistently lower for members of Class 2 than Class 1, indicating members of Class 2 endorsed these items at higher levels. By contrast, in Version 4, differences between classes in thresholds occurred almost exclusively (with the exception of items whose thresholds were fixed at boundary values) in items measured using the 5-point response scale. In particular, thresholds for ykq = 1 are lower in Class 2, and thresholds for ykq = 2 are higher in Class 2, on all items which show a difference in Version 4. As such, in Version 4, collapsing the top two response categories in the 5-point response scale (“very likely” and “certain to happen”) may have decreased the portion of the sample endorsing ykq = 3. Furthermore, class differences in thresholds for ykq = 1 appeared somewhat more pronounced for items measuring the Sociability factor, whereas class differences for ykq = 2 were larger for items measuring the Tension Reduction factor. These differences, especially for the Tension Reduction items, were most pronounced on items originally measured using the 5-point response scale.

Expected score curves

In order to jointly consider loadings and thresholds, expected score curves are shown in Figures 5 and 6 for Versions 1 and 4 respectively. For ordinal items, expected score curves weight each response option (here j = 1, 2, and 3) by their endorsement probabilities to obtain the expected value of ykq at a range of values for the underlying latent trait (Hill et al., 2007). In Version 1, most items differed uniformly between classes such that members of Class 2 had lower expected scores than members of Class 1 across all values of the latent variable. However, for a few items (particularly items 7, 8, 15, 16, and 17), the relationship between subjects’ values of the latent factor and their expected score was considerably weaker for members of Class 2 than Class 1, corresponding to the lower loadings for these items in Class 2. In Version 4, items measured on the 4-point scale appeared to show a weaker relationship to the latent variable, as well as much higher expected scores across all levels of the latent variable, in Class 2; examining expected score curves shows that, for these items, endorsing a higher response category was extremely likely even at low levels of the latent variable. By contrast, items measured on the 5-point scale differed in a number of ways between Classes 1 and 2. For items measuring the Tension Reduction factor, members of Class 2 showed either a weaker relationship between the latent factor and the expected score (Items 2, 3, and 8), or truncation of the range of expected scores, with virtually no probability of responding ykq = 3 (Items 5 and 7). For items measuring the Sociability factor (Items 10, 15, and 16), though loadings and thresholds were generally different between Classes 1 and 2 (with generally higher loadings, lower first thresholds and higher second thresholds in Class 2), these parameters nevertheless combined to produce relatively similar expected score curves; this is not an uncommon finding when comparing expected score curves between groups, particularly with ordinal items (Raju, Van der Linden, & Fleer, 1995; Oshima, Kushubar, Scott, & Raju, 2009).

Figure 5.

Figure 5

Study 2: Expected score curves in Version 1.

Figure 6.

Figure 6

Study 2: Expected score curves in Version 4.

Note: Items with a gray background in Version 4 were measured using the 5-point scale.

Summary

Study 2 examined FMMs of alcohol expectancies under the two most disparate measurement versions in the study (Version 1 and Version 4). Models with two factors corresponding to Tension Reduction (Factor 1) and Sociability (Factor 2) were considered, with varying numbers of classes and levels of measurement invariance across classes. In both measurement versions, fit statistics favored a 2-class model imposing only configural invariance between classes. Class prevalence rates were relatively similar across measurement versions. The two measurement versions, however, diverged greatly in the item parameter differences seen between the two classes. In Version 1, the two classes differed most in the Tension Reduction factor, with Class 2 showing considerably weaker loadings on a number of items than Class 1. By contrast, in Version 4 the two classes were most different in the Sociability factor. Differences across classes in item parameters were strongest for those which had originally been measured using the 5-point response scale in Version 4; these items showed weaker loadings and higher thresholds in Class 2. This difference is of particular interest because it is not uncommon for item sets to include items measured using multiple response scales (e.g., when combining items from multiple scales within the same study or pooling data across multiple studies which use different response scales; Hussong, Curran, & Bauer, 2013).

Discussion

The current report examined the effects of differences in item wording and response scales on the nature of results obtained from mixture models. The nature of these effects was explored through two studies, which took advantage of an experimental design in which measurement was empirically manipulated. In Study 1, separate latent class analyses of binary alcohol use problem items were conducted across four measurement versions, which differed in terms of item stems, response categories, or both. In Study 2, factor mixture models of a two-factor alcohol expectancies scale were conducted on subjects from two different measurement versions.

Neither study found particularly strong differences between measurement versions in class enumeration. Though model fit indices offered only equivocal support for either a 3- or 4-class solution in Study 1, the results from these model fit indices did not differ reliably between measurement versions. In Study 2, the same number of classes, as well as the same factor structure of items (i.e., configural invariance across classes), was unequivocally favored by model fit indices in both Versions 1 and 4. This is consistent with the results of Jackson and Sher (2005), who found that only more extreme differences in the operationalization of alcohol involvement resulted in a different number of classes being selected in a growth mixture model (GMM).

However, where measurement differences did become relevant was in the overall configuration of the classes observed. In Study 1, while classes with low levels of alcohol problem endorsement and classes with high levels of alcohol problem endorsement were found in all measurement versions, differences in item characteristics primarily changed the shape of intermediate classes. These results suggest that differences across studies in the measurement of alcohol problems might manifest as substantively distinct findings, particularly with respect to item endorsement for intermediate classes, as well as the prevalence for each class.

Likewise, in Study 2, the two measurement versions showed differences in factor loadings and thresholds, which were somewhat stronger on items whose response categories differed across versions. Whereas loadings were most disparate across classes for the Tension Reduction factor in Version 1, differences in loadings were larger for the Sociability factor in Version 4. The difference we observed across classes in likelihood of endorsing items at higher scale points based on differences in items’ original response scale is consistent with the findings of Jackson and Sher (2008), that choosing different cut points for categorical measures changes the configuration of classes in latent class growth analysis. However, it is worth noting that, unlike these authors, we did not see differences in the prevalence of classes on the basis of different response categories.

Taken together, these findings demonstrate that the results of mixture models may change on the basis of decisions made in the measurement of the construct of interest. In this way, the current results contextualize the frequent disagreement between studies in the obtained number and nature of latent classes, suggesting that this disagreement may reflect differences across studies in measurement, as opposed to true differences across studies in the nature of the latent classes themselves. These discrepancies create a serious barrier to a cumulative understanding of a number of constructs in the behavioral sciences, including but not limited to the case of examining latent classes based upon alcohol use disorder criteria that we discussed at the outset. These barriers may potentially be surmounted by more careful consideration of the ways in which constructs are measured when conducting and interpreting mixture model results. Most concretely, researchers may be well-advised to undertake sensitivity analyses in order to show that a given latent class structure is replicable across minor perturbations of measurement. Such alterations to measurement may be made either during data collection (e.g., administering different response scales or item stems to different subsets of participants) or after (e.g., fitting a mixture model on the same data multiple times, each time collapsing response options differently).

One critical limitation of the current work is that, though measurement was manipulated experimentally, it is unknown whether and to what extent any of the latent class solutions obtained represents the truth, as data were not simulated to have any particular latent class structure. Despite this limitation, the current findings suggest that variations in measurement must be considered in the interpretation of mixture model results, particularly when two sets of results differ. While one, both, or neither of these sets of results may represent the true latent class structure of the construct under study, such a determination is not possible to make without accounting for the potentially biasing effects of differences in measurement.

Acknowledgments

This work was supported by National Institutes of Health grants F31 DA040334 (Fellow: Veronica T. Cole) and R01 DA034636 (PI: Daniel J. Bauer). The content is solely the responsibility of the authors and does not represent the official views of the National Institute on Drug Abuse or the National Institutes of Health.

Footnotes

1

Importantly, though the diagnostic classification of AUD changed from DSM-IV to DSM-5, the criteria themselves changed only by the omission of one item and addition of another.

2

Importantly, class membership probabilities may be impacted by covariates (Huang and Bandeen-Roche, 2004). Preliminary analyses included gender, African American and Asian race, and Hispanic/Latino ethnicity, as covariates affecting class membership. However, neither class enumeration nor the LCA solutions themselves (i.e., item endorsement patterns and class prevalence rates) changed with the inclusion of these covariates. Thus, in the interest of parsimony, we exclusively consider an unconditional model here, so class membership probabilities πik do not vary over individuals and become overall prevalence rates πk.

3

This lack of consistency within version raises questions about the use of LMR LRT for class enumeration in general.

4

As in the 3-class solutions, Euclidean distance was computed on probabilities averaged across samples for a given Version (e.g., samples 1x and 1y), with the exception of Version 4 in which only sample 4y was used.

5

In order to determine the optimal partial weak invariance model, item-by-item tests of loading noninvariance across classes were conducted, following the IRT-LR-DIF strategy (Thissen, 2001). In both versions, the resulting partial weak invariance model was still rejected relative to the configural invariance model by the LRT.

References

  1. Akaike H. In Selected Papers of Hirotugu Akaike. New York: Springer; 1998. Information theory and an extension of the maximum likelihood principle; pp. 199–213. [Google Scholar]
  2. American Psychiatric Association. Diagnostic and statistical manual of mental disorders. 4. Washington, DC: Author; 2000. text rev. [Google Scholar]
  3. American Psychiatric Association. Diagnostic and statistical manual of mental disorders. 5. Washington, DC: Author; 2013. [Google Scholar]
  4. Arminger G, Stein P, Wittenberg J. Mixtures of conditional mean-and covariance-structure models. Psychometrika. 1999;64(4):475–494. [Google Scholar]
  5. Bauer DJ, Curran PJ. Distributional assumptions of growth mixture models: implications for overextraction of latent trajectory classes. Psychological methods. 2003;8(3):338–363. doi: 10.1037/1082-989X.8.3.338. [DOI] [PubMed] [Google Scholar]
  6. Bauer DJ, Curran PJ. The integration of continuous and discrete latent variable models: potential problems and promising opportunities. Psychological methods. 2004;9(1):3–29. doi: 10.1037/1082-989X.9.1.3. [DOI] [PubMed] [Google Scholar]
  7. Beseler CL, Taylor LA, Kraemer DT, Leeman RF. A Latent Class Analysis of DSM-IV Alcohol Use Disorder Criteria and Binge Drinking in Undergraduates. Alcoholism: Clinical and experimental research. 2012;36(1):153–161. doi: 10.1111/j.1530-0277.2011.01595.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Bucholz KK, Cadoret R, Cloninger C, et al. A new, semi-structured psychiatric interview for use in genetic linkage studies: A report on the reliability of the SSAGA. Journal of Studies On Alcohol. 1994;55(2):149–158. doi: 10.15288/jsa.1994.55.149. [DOI] [PubMed] [Google Scholar]
  9. Chung T, Martin CS. Classification and course of alcohol problems among adolescents in addictions treatment programs. Alcoholism: Clinical and Experimental Research. 2001;25(12):1734–1742. [PubMed] [Google Scholar]
  10. Clark SL, Muthén B, Kaprio J, D’Onofrio BM, Viken R, Rose RJ. Models and strategies for factor mixture analysis: An example concerning the structure underlying psychological disorders. Structural equation modeling: a multidisciplinary journal. 2013;20(4):681–703. doi: 10.1080/10705511.2013.824786. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Clogg CC, Goodman LA. Latent structure analysis of a set of multidimensional contingency tables. Journal of the American Statistical Association. 1984;79(388):762–771. [Google Scholar]
  12. Crow SJ, Swanson SA, Peterson CB, Crosby RD, Wonderlich SA, Mitchell JE. Latent class analysis of eating disorders: Relationship to mortality. Journal of abnormal psychology. 2012;121(1):225–231. doi: 10.1037/a0024455. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Eggleston EP, Laub JH, Sampson RJ. Methodological sensitivities to latent class analysis of long-term criminal trajectories. Journal of Quantitative Criminology. 2004;20(1):1–26. [Google Scholar]
  14. Fraley C, Raftery AE. Model-based clustering, discriminant analysis, and density estimation. Journal of the American statistical Association. 2002;97(458):611–631. [Google Scholar]
  15. Fromme K, Stroot EA, Kaplan D. Comprehensive effects of alcohol: Development and psychometric assessment of a new expectancy questionnaire. Psychological Assessment. 1993;5(1):19–26. [Google Scholar]
  16. Hill CD, Edwards MC, Thissen D, Langer MM, Wirth RJ, Burwinkle TM, Varni JW. Practical issues in the application of item response theory: a demonstration using items from the pediatric quality of life inventory (PedsQL) 4.0 generic core scales. Medical care. 2007;45(5):S39–S47. doi: 10.1097/01.mlr.0000259879.05499.eb. [DOI] [PubMed] [Google Scholar]
  17. Huang GH, Bandeen-Roche K. Building an identifiable latent class model with covariate effects on underlying and measured variables. Psychometrika. 2004;69(1):5–32. [Google Scholar]
  18. Hubert L, Arabie P. Comparing partitions. Journal of classification. 1985;2(1):193–218. [Google Scholar]
  19. Hussong AM, Curran PJ, Bauer DJ. Integrative data analysis in clinical psychology research. Annual review of clinical psychology. 2013;9:61–89. doi: 10.1146/annurev-clinpsy-050212-185522. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Jackson N, Denny S, Sheridan J, Fleming T, Clark T, Teevale T, Ameratunga S. Predictors of drinking patterns in adolescence: a latent class analysis. Drug and Alcohol Dependence. 2014;135:133–139. doi: 10.1016/j.drugalcdep.2013.11.021. [DOI] [PubMed] [Google Scholar]
  21. Jackson KM, Sher KJ. Similarities and differences of longitudinal phenotypes across alternate indices of alcohol involvement: a methodologic comparison of trajectory approaches. Psychology of Addictive Behaviors. 2005;19(4):339–351. doi: 10.1037/0893-164X.19.4.339. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Jackson KM, Sher KJ. Comparison of longitudinal phenotypes based on number and timing of assessments: a systematic comparison of trajectory approaches II. Psychology of Addictive Behaviors. 2006;20(4):373–384. doi: 10.1037/0893-164X.20.4.373. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Jackson KM, Sher KJ. Comparison of longitudinal phenotypes based on alternate heavy drinking cut scores: a systematic comparison of trajectory approaches III. Psychology of Addictive Behaviors. 2008;22(2):198–209. doi: 10.1037/0893-164X.22.2.198. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Kushner MG, Sher KJ, Wood MD, Wood PK. Anxiety and Drinking Behavior: Moderating Effects of Tension-Reduction Alcohol Outcome Expectancies. Alcoholism: Clinical and Experimental Research. 1994;18(4):852–860. doi: 10.1111/j.1530-0277.1994.tb00050.x. [DOI] [PubMed] [Google Scholar]
  25. La Flair LN, Bradshaw CP, Storr CL, Green KM, Alvanzo AA, Crum RM. Intimate partner violence and patterns of alcohol abuse and dependence criteria among women: a latent class analysis. Journal of studies on alcohol and drugs. 2012;73(3):351–360. doi: 10.15288/jsad.2012.73.351. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. La Flair LN, Reboussin BA, Storr CL, Letourneau E, Green KM, Mojtabai R, … Crum RM. Childhood abuse and neglect and transitions in stages of alcohol involvement among women: a latent transition analysis approach. Drug and alcohol dependence. 2013;132(3):491–498. doi: 10.1016/j.drugalcdep.2013.03.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Lazarsfeld PF, Henry NW, Anderson TW. Latent structure analysis. Boston: Houghton Mifflin; 1968. [Google Scholar]
  28. Lo Y, Mendell NR, Rubin DB. Testing the number of components in a normal mixture. Biometrika. 2001;88(3):767–778. [Google Scholar]
  29. Lubke GH, Miller PJ. Does nature have joints worth carving? A discussion of taxometrics, model-based clustering and latent variable mixture modeling. Psychological medicine. 2015;45(04):705–715. doi: 10.1017/S003329171400169X. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Lubke GH, Muthén B. Investigating population heterogeneity with factor mixture models. Psychological methods. 2005;10(1):21–39. doi: 10.1037/1082-989X.10.1.21. [DOI] [PubMed] [Google Scholar]
  31. Lubke G, Muthén BO. Performance of factor mixture models as a function of model size, covariate effects, and class-specific parameters. Structural Equation Modeling. 2007;14(1):26–47. [Google Scholar]
  32. Lubke G, Neale M. Distinguishing between latent classes and continuous factors with categorical outcomes: Class invariance of parameters of factor mixture models. Multivariate Behavioral Research. 2008;43(4):592–620. doi: 10.1080/00273170802490673. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Lynskey MT, Nelson EC, Neuman RJ, Bucholz KK, Madden PA, Knopik VS, … Heath AC. Limitations of DSM-IV operationalizations of alcohol abuse and dependence in a sample of Australian twins. Twin Research and Human Genetics. 2005;8(06):574–584. doi: 10.1375/183242705774860178. [DOI] [PubMed] [Google Scholar]
  34. Mancha BE, Hulbert A, Latimer WW. A latent class analysis of alcohol abuse and dependence symptoms among Puerto Rican youth. Substance use & misuse. 2012;47(4):429–441. doi: 10.3109/10826084.2011.643525. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. McLachlan G, Peel D. Finite mixture models. John Wiley & Sons; 2000. [Google Scholar]
  36. Meehl PE. Bootstraps taxometrics: Solving the classification problem in psychopathology. American Psychologist. 1995;50(4):266. doi: 10.1037//0003-066x.50.4.266. [DOI] [PubMed] [Google Scholar]
  37. Mellenbergh GJ. Item bias and item response theory. International journal of educational research. 1989;13(2):127–143. [Google Scholar]
  38. Meredith W. Measurement invariance, factor analysis and factorial invariance. Psychometrika. 1993;58(4):525–543. [Google Scholar]
  39. Miller ET, Neal DJ, Roberts LJ, Baer JS, Cressler SO, Metrik J, Marlatt GA. Test-retest reliability of alcohol measures: is there a difference between internet-based assessment and traditional methods? Psychology of Addictive Behaviors. 2002;16(1):56–63. [PubMed] [Google Scholar]
  40. Muthén B. Latent variable mixture modeling. New developments and techniques in structural equation modeling. 2001:1–33. [Google Scholar]
  41. Muthén B. Should substance use disorders be considered as categorical or dimensional? Addiction. 2006;101(s1):6–16. doi: 10.1111/j.1360-0443.2006.01583.x. [DOI] [PubMed] [Google Scholar]
  42. Muthén LK, Muthén BO. Mplus User’s Guide. 7. Los Angeles, CA: Muthén & Muthén; 2015. [Google Scholar]
  43. Muthén B, Shedden K. Finite mixture modeling with mixture outcomes using the EM algorithm. Biometrics. 1999;55(2):463–469. doi: 10.1111/j.0006-341x.1999.00463.x. [DOI] [PubMed] [Google Scholar]
  44. Nagin DS. Analyzing developmental trajectories: a semiparametric, group-based approach. Psychological methods. 1999;4(2):139. doi: 10.1037/1082-989x.6.1.18. [DOI] [PubMed] [Google Scholar]
  45. Nagin DS, Tremblay RE. Analyzing developmental trajectories of distinct but related behaviors: a group-based method. Psychological methods. 2001;6(1):18. doi: 10.1037/1082-989x.6.1.18. [DOI] [PubMed] [Google Scholar]
  46. Neal DJ, Corbin WR, Fromme K. Measurement of alcohol-related consequences among high school and college students: application of item response models to the Rutgers Alcohol Problem Index. Psychological assessment. 2006;18(4):402. doi: 10.1037/1040-3590.18.4.402. [DOI] [PubMed] [Google Scholar]
  47. Nylund KL, Asparouhov T, Muthén BO. Deciding on the number of classes in latent class analysis and growth mixture modeling: A Monte Carlo simulation study. Structural equation modeling. 2007;14(4):535–569. [Google Scholar]
  48. Oshima T, Kushubar S, Scott J, Raju N. DFIT8 for Windows User’s Manual: Differential Functioning of Items and Tests. St. Paul, MN: Assessment Systems Corporation; 2009. [Google Scholar]
  49. Presley CA, Meilman PW, Lyerla R. Development of the Core Alcohol and Drug Survey: Initial findings and future directions. Journal of American College Health. 1994;42(6):248–255. doi: 10.1080/07448481.1994.9936356. [DOI] [PubMed] [Google Scholar]
  50. Raju NS, van der Linden WJ, Fleer PF. IRT-based internal measures of differential functioning of items and tests. Applied Psychological Measurement. 1995;19(4):353–368. [Google Scholar]
  51. Reboussin BA, Ip EH, Wolfson M. Locally dependent latent class models with covariates: an application to under-age drinking in the USA. Journal of the Royal Statistical Society: Series A (Statistics in Society) 2008;171(4):877–897. doi: 10.1111/j.1467-985X.2008.00544.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Rinker DV, Neighbors C. Latent class analysis of DSM-5 alcohol use disorder criteria among heavy-drinking college students. Journal of substance abuse treatment. 2015;57:81–88. doi: 10.1016/j.jsat.2015.05.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Sampson RJ, Laub JH, Eggleston EP. On the robustness and validity of groups. Journal of Quantitative Criminology. 2004;20(1):37–42. [Google Scholar]
  54. Schwarz G. Estimating the dimension of a model. The annals of statistics. 1978;6(2):461–464. [Google Scholar]
  55. Steinley D. Properties of the Hubert-Arable Adjusted Rand Index. Psychological methods. 2004;9(3):386. doi: 10.1037/1082-989X.9.3.386. [DOI] [PubMed] [Google Scholar]
  56. Thissen D. IRTLRDIF v2.0b: Software for the computation of the statistics involved in item response theory likelihood-ratio tests for differential item functioning. [Documentation for computer program] Chapel Hill: L. L. Thurstone Psychometric Laboratory, University of North Carolina, Chapel Hill; 2001. [Google Scholar]
  57. Tofighi D, Enders CK. Advances in latent variable mixture models. Information Age Publishing; 2008. Identifying the correct number of classes in growth mixture models; pp. 317–341. [Google Scholar]
  58. Tsai J, Rosenheck RA. Conduct disorder behaviors, childhood family instability, and childhood abuse as predictors of severity of adult homelessness among American veterans. Social psychiatry and psychiatric epidemiology. 2013;48(3):477–486. doi: 10.1007/s00127-012-0551-4. [DOI] [PubMed] [Google Scholar]
  59. Van Horn ML, Smith J, Fagan AA, Jaki T, Feaster DJ, Masyn K, … Howe G. Not quite normal: Consequences of violating the assumption of normality in regression mixture models. Structural equation modeling: a multidisciplinary journal. 2012;19(2):227–249. doi: 10.1080/10705511.2012.659622. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Vandenberg RJ, Lance CE. A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational research methods. 2000;3(1):4–70. [Google Scholar]
  61. Vuong QH. Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica: Journal of the Econometric Society. 1989:307–333. [Google Scholar]
  62. Wells JE, Horwood LJ, Fergusson DM. Drinking patterns in mid-adolescence and psychosocial outcomes in late adolescence and early adulthood. Addiction. 2004;99(12):1529–1541. doi: 10.1111/j.1360-0443.2004.00918.x. [DOI] [PubMed] [Google Scholar]
  63. White HR, Labouvie EW. Towards the assessment of adolescent problem drinking. Journal of studies on alcohol. 1989;50(1):30–37. doi: 10.15288/jsa.1989.50.30. [DOI] [PubMed] [Google Scholar]

RESOURCES