Abstract
Latent growth modeling allows social behavioral researchers to investigate within-person change and between-person differences in within-person change. Typically, conventional latent growth curve models are applied to continuous variables, where the residuals are assumed to be normally distributed, whereas categorical variables (i.e., binary and ordinal variables), which do not hold to normal distribution assumptions, have been rarely used. This article describes the latent growth curve model with categorical variables, and illustrates applications using Mplus software that are applicable to social behavioral research. The illustrations use marital instability data from the Iowa Youth and Family Project. We close with recommendations for the specification and parameterization of growth models that use both logit and probit link functions.
Keywords: Categorical variables, Latent response variable, Latent growth curve model
The repeated measures used in longitudinal research may not be continuous, and such measures are not normally distributed. For example, the occurrence of life events, including divorce, job loss, pregnancy, and others, are binary responses (0 = No or 1 = Yes). Also, ordinal scales often consist of multiple response options with uneven spacing. For example, ‘think about divorce’, a constituent item of the marital instability measure, may have response options ranging from 0 = never in the last year, 1 = yes, within the last year, 2= yes, within the last 6 months, and 3 = yes, within the last 3 months. These categorical variables (i.e., binary and ordinal) are discrete and often skewed, which violates the normal distributional assumptions required for ML estimation in LGCM (McTernan & Blozis, 2015).
One solution to this problem is to transform categorical responses into normally distributed continuous variables before estimating a LGCM. This approach is known as latent response variable (LRV) transformation (Masyn, Petras, & Liu, 2013). This approach requires extending the measurement component of LGCM in a SEM framework to incorporate an additional step of transforming observed categorical response variables into latent continuous variables. However, this approach to estimating a LGCM with categorical response variables (hereafter referred to as “a categorical LGCM”) has only recently begun to gain popularity in social behavioral research. Using a longitudinal sample of married couples, there are three main purposes of this paper. We aim to: (a) explain how categorical response variables (i.e., binary and ordinal responses) are transformed into latent continuous response variables, (b) demonstrate how to specify a categorical LGCM with time-invariant covariates (i.e., predictors and the outcomes) and interpret the parameters, and (c) demonstrate the utilization of a longitudinal dyadic model with categorical variables to investigate the associations of time-variant covariates. The equations, figures, and Mplus programs (version 7.4) are provided.
LATENT RESPONSE VARIABLE TRANSFORMATION
Suppose that the observed binary outcome Y (i.e., 0 and 1) is modeled in the confirmatory factor analysis (CFA) model. Contrary to estimating model parameters in a single confirmatory factor analysis (CFA) model with continuous variables (see panel a of Figure 1), additional latent variables are required in a CFA with categorical response variables (hereafter referred to as “categorical CFA”) (see panel b of Figure 1). These additional latent variables estimate a corresponding model with categorical response variables which exist between observed categorical variables (Y1, Y2, Y3, Y4, and Y5) and the latent factor (η). The variables comprising this corresponding model are known as latent response variables (LRV) and are indicated as Y*s. The Y*s are latent continuous variables, which are transformed from categorical response variables using cut-points (known as “thresholds”; Skrondal & Rabe-Hesketh, 2004). This transformation converts observed categorical variables (Ys) into latent continuous Y*s. It is these latent continuous Y*s that are then used as the indicators to produce the latent factor η in a CFA model. This LRV transformation of observed categorical variables (Y1 – Y5) to latent continuous responses (Y1* – Y5*) is part of a family of linear models known as Generalized Linear Models (GLM; Skrondal & Rabe-Hesketh, 2004). The LRV can be formulated based on one of two distributional assumptions: (a) standard logistic distribution or (b) standard normal distribution. In the following sections, we first introduce the LRV transformation using the standard logistic distribution.
LRV transformation of a Binary Response Variable
Transformation of a binary response variable assuming a standard logistic distribution.
Using the standard logistic distribution, the non-normal binary variable Y should be transformed into a normally distributed continuous latent variable Y* as given by:
where Prob (Yi =1) is the probability of Y being 1 for individual i. Odds (Yi =1) is the ratio of the odds for Y being 1 to the odds for Y being 0. This can be expressed as: or . Log-odds or logistic (logits) are the natural logarithm of the odds.
As can be seen in the transformation process above, the transformation first starts with the probability of the observed binary Yi = 1 (recall the possible range was from 0 to 1) and continues with the odds of Yi = 1 (possible range from 0 to ∞). Then, these odds are converted into log-odds values (possible range from −∞ to ∞) with a standardized logistic distribution (mean = 0 and variance = ). In the standardized logistic distribution, each respondent’s log-odds (or logit) value, which represents his/her continuous latent variable response Yi*, yields a threshold, τ1. In the LRV transformation, this threshold serves as a cut-point that separate the underlying unobserved (latent) continuous variable, Y* into observed categories. Panel a of Figure 2 illustrates how the threshold dichotomizes values of the continuous latent response variable Y*, such that
(1) |
If the latent continuous value of Yi* is less than or equal to τ1 (the cut-point or threshold), then the observed binary Yi variable = 0 (that is, individual i’s response is 0). If the latent continuous value of Yi* is greater than τ1, then the observed binary Yi variable = 1 (that is, individual i’s response is 1).
Using the information from the threshold under the standard logistic distribution, the unconditional logistic model of the continuous latent variable Yi* can now be specified as the baseline measurement model of the CFA, such as
(2) |
As can be seen in Equation 1, the threshold indicates the propensity (the level) of a latent continuous response variable Yi* for observed category Y = 0. The threshold (i.e., cut-point) can be converted into both odds and the percentages of the response categories.
Next, the logistic model can be modeled with the latent factor ηi as a predictor of latent Yi* in the CFA model, yielding a conditional logistic model (see panel b of Figure 1; Agresti, 2002; Masyn et al., 2013), such that:
For the log-odds (logit) of the observed cagorical variable Yi = 0:
(3.a) |
For the log-odds (logit) of the observed cagorical variable Yi = 1:
(3.b) |
Note that given in equation 1, the value of threshold indicates Yi = 0, (not Yi = 1). This leads to an issue that often arises related to the direction of logistic coefficients (see Equations 3.a and 3.b). We will discuss this issue in more detail in next section.
With the addition of the latent factor, the logit value of a binary response variable Yi now linearly changes as a function of the latent factor (ηi) with the logistic regression coefficient, λ. Similar to the interpretation of a normal regression model, λ can be interpreted as the change in the log-odds (or logit values) of Yi* for a one-unit difference in ηi. Alternatively, the odds-ratio (abbreviated as OR and calculated as exp [λ]) can be used to interpret factor loadings as the percent (%) change in the odds of Y being 1 (that is, Y = 1) for a one-unit increase in the latent variable η using a simple formula: 100 × (Exp (λ) −1). The main advantage of utilizing this odds-ratio is to investigate the effect sizes of factor loadings, which is analogous to standardized coefficients in a normal regression model (Allen & Le, 2008). That is, an odds-ratio reflects the relative contribution of a latent variable to the categorical outcome.
The association between logistic coefficients and probabilities of Yi being 1.
This section introduces how to convert logistic coefficients as the probabilities for the observed response category of binary outcome Yi. According to equations 1 and 2, a positive threshold value (τ1) indicates the log-odds (i.e., logit) of Yi being 0 (i.e., the lower category), whereas a negative of threshold value (−τ1) indicates the logit value of being Yi = 1 (i.e., the higher category). This threshold τ1 can be converted to the probability for a specific response category (Muthén, 2001). For example, the response probability for individual i having a Y value of 1 is calculated using the negative of threshold value (−τ1) as follows:
(4.1) |
This equation suggests that a large positive threshold value reflects a low probability of Yi = 1 and consequently, a high probability of Yi being 0. A large negative threshold value reflects the high probability of Yi = 1 (or a low probability of Yi being 0). To use a numerical example, −3 reflects that the probability of Yi being 1 is .953, whereas 3 reflects that the probability of Y being 1 is .047 (equivalently, 1 – .953). In the conditional logistic model of a CFA (see Equation 3.b), the conditional response probability that the Yi value is 1, is also calculated as follows:
(4.2) |
In this equation, the threshold τ1 can be converted to represent the conditional response probability of Yi = 1 after adjusting for the effects of the latent factor, ηi (i.e., the conditional probability represents a model where the effect of the latent variable is zero).
Extending logit transformation to latent growth curves with binary indicator variables.
Under the logit link function, a categorical CFA can be extended to a categorical latent growth curve model (categorical LGCM; random intercept and random slope model) by specifying repeated categorical indicators to the measurement model and estimating the growth factors (e.g., initial level and rate of change) as latent variables in the structural part of the model (Masyn et al., 2013). The full model specification is shown in Figure 3.
The model specification is similar to that of a conventional multilevel framework (Raudenbush & Bryk, 2002). The difference is this categorical LGCM specification uses the continuous latent response variable, Yti*, (for individual i at time t) as the growth indicators whereas a conventional LGCM uses observed indicators, Yti. Given the baseline model of the continuous latent response variable Yi* (see Equation 2), the unconditional linear LGCM with binary repeated indicators can be specified as follows.
(5) |
(6.1) |
(6.2) |
(7) |
where τ1 is the threshold of the latent response variable Yti*. η0i and η1i are the latent growth factors for the initial level (i.e., intercept factor) and the rate of change (i.e., slope), respectively, with factor loadings (λ) usually set to equal t (= 0, 1, 2, …, T; time is centered at the first occasion of measurement). ζ indicates the normally distributed error with a mean of 0 and a variance of ψ. Ψ represents the variance-covariance structure of η0i and η1i. In general, in a categorical LGCM, the thresholds are set to be invariant over time in order to consistently define the association between Yti and Y*ti. This is known as the “longitudinal threshold invariance assumption” (Masyn et al., 2013), showing that the thresholds do not depend on time.
Given the multilevel modeling structure, level 1 represents the individual’s latent responses at each time point, and level 2 characterizes the individual trajectories over time. Therefore, individual differences in the growth factors (i.e., random effects) from level 1 (η0i and η1i) are represented by errors (ζ0i and ζ1i) that vary around the expected means of the intercept (α00) and slope (α10) (i.e., inter-individual difference in intra-individual change). According to the Equation 3.b, at the beginning (when time = 0), in the categorical LGCM, −τ1 + α00 represents the mean of log-odds (i.e., logit) values of being Y = 1, and α10 represents the mean change in logit for Y = 1 corresponding to a one-unit increase in the time (Masyn et al., 2013). Note that either the threshold, τ1, or the mean parameter, α00, of the intercept factor (η0i) should be fixed to 0 for model identification purposes. To follow the conventional LGCM approach, the time-invariant threshold, τ1, is fixed to 0 and the mean (α00) of the intercept factor is estimated (Mehta, Neale, & Flay, 2004).
For the purpose of illustrating this model specification, the latent response variable, Yti* was modeled as a linear function of the growth factors that make up the intercept, η0i, and the slope, η1i. However, nonlinear trajectories can be modeled, such as curvilinear change forms (, , etc.), depending on the sample size and the number of repeated categorical indicators. In the section that follows, we illustrate how to model parameters of a linear categorical LGCM using the logit (log-odds) transformation in Mplus.
Mplus Model Specification for the Categorical LGCM
In Mplus, the categorical LGCM with five repeated binary outcomes can be estimated by specifying the syntax below:
DATA: FILE IS example.dat;
VARIABLE: NAMES ARE Y1-Y5;
USEVARIABLES ARE Y1-Y5;
CATEGORICAL ARE Y1-Y5;
ANALYSIS: ESTIMATOR=ML;
LINK=LOGIT;
MISSING = ALL (−999);
MODEL:
I S | Y1@0 Y2@1 Y3@2 Y4@3 Y5@4;
I WITH S; [Y1$1− Y5$1@0]; [I];
where Y1 to Y5 are repeated binary outcomes. Most of this Mplus syntax is similar to that of a LGCM with continuous variables. For example, the DATA command specifies the data file to be utilized for the analysis (i.e., example.dat). The VARIABLE command defines all variables in the data file. The USEVARIABLES option selects the variables to be used in the analysis. The MISSING=ALL (−999) option allows Mplus to handle missing cases (coded as −999 in the current example) with full-information maximum likelihood estimation (FIML). However, several additional lines of syntax must be specified to estimate categorical LGCM (see the italic and bold syntax). First, the CATEGORICAL option specifies variables need to be treated as either binary or ordinal variables in the model. Second, the link option should be included under ANALYSIS to use logit link function (LINK=LOGIT) with the ML estimation (or ML with robust standard errors [MLR] in Mplus).
Also, for the model specification of categorical LGCM, the syntax for two random effects in the growth model (i.e., random intercept and random slope; shown above as I S | …) should be specified in the MODEL command (see italic and bold commands in the above syntax). This syntax instructs Mplus to estimate growth parameters (i.e., the mean and variance) of the intercept and slope factors. The covariance, ψ10 between the intercept and slope is estimated by using a WITH statement (I WITH S). For the illustrative purpose, a linear growth factor was specified by fixing the factor loadings to equal the time intervals (i.e., 0, 1, 2, 3, and 4) for the slope (η1i). The thresholds of the five repeated indicators are referred to as Y1$1-Y5$1. Both thresholds and mean are defined in square brackets. To set all thresholds to 0 and estimate the mean of intercept factor, Mplus syntax is given by [Y1$1 – Y5$1@0] and [I].
Model fit evaluation.
In Mplus, ML estimation with logit transformation provides both relative and absolute model fit indices. More specifically, a log-likelihood (LL) value and IC statistics (e.g., Akaike information criterion [AIC] and Bayesian information criterion [BIC]) are provided as relative fit indices. For absolute fit indices, Mplus gives two chi-square tests: (a) the Pearson chi-square test and (b) the likelihood ratio chi-square test (LRT). The null hypothesis for both chi-square tests reflects how well the hypothesized model fits the observed data. Therefore, non-significant p values indicate the model fits the data well (Rupp, Templin, & Henson, 2010). However, the Pearson chi-square and LRT statistics sometimes provide inconsistent results, particularly when the model includes a large number of categorical variables in combination with a small sample size (Geiser, 2012). For this reason, we recommend using the deviance statistic (= −2 × LL; with a chi-square distribution) with the number of free parameters (FP). Equivalently, when comparing two competing nested models, a deviance difference test (ΔDeviance; Δ −2LL) can be used. Model M (i.e., the constrained model) is nested within the Model M’ (i.e., the unconstrained model) if M is obtained by imposing constraints on the parameters of M,’ which is often referred to as the parent model. Calculating a ΔDeviance statistic is similar to the calculation process of a nested chi-square comparison (Δχ2) in that ΔDeviance = (−2 × LL Constrained model) − (−2 × LL Unconstrained model) and Δdf = FP Unconstrained model – FP Constrained model. In general, the significant p-value of ΔDeviance indicates that the model with smaller deviance value fits better than the model with larger deviance value, whereas the non-significant p-value of ΔDeviance indicates that the model with larger deviance value fits better. Also, smaller AIC and BIC values can be examined as evidence of the preferred model.
Empirical example of a categorical LGCM with binary response variables: Marital instability
This section uses empirical data to illustrate how model parameters of a categorical LGCM with five repeated binary items can be estimated using the logit transformation. Binary response data were used from the Iowa Youth and Family Project (IYFP) (PI: R. D. Conger). The IYFP is a longitudinal panel study of 451 youths (52% female) and their families from two-parent households in the Midwest. Additional information regarding the study procedures is available from Conger and Conger (2002). For our example, categorical analyses are based on husbands’ and wives’ responses to one marital instability item (“thinking about divorce”) in 1989 (Wave 1), 1990 (Wave 2), 1991 (Wave 3), 1992 (Wave 4), and 1994 (Wave 6). Response options included “never in the last year,” “yes, within the last year,” “yes, within the last 6 months,” and “yes, within the last 3 months” (coded as 0 to 3). The original 4-point scale was recoded with 0 = never and 1 = yes ( > 0; i.e., thoughts of divorce at any point in the previous year). Factor loadings for the slope factor were fixed to 0, 1, 2, 3, and 5 (because the measurement occasions were not equally spaced).
Results.
As in the case of the categorical CFA model, we investigated the model fit indices and model parameters. At Wave 1, around half (64.1%) of the wives reported that they had thought about divorce during the past year. This proportion decreased over time (31.6%, 30.9%, 31.2%, and 28.9% for Wave 2 to Wave 6, respectively). In order to evaluate model-fit by using fit indices, we compared the two competing nested models: (a) the random intercept and slope model (i.e., the linear categorical LGCM; the unconstrained model) and (b) the random intercept model1 (the constrained model; Curran, Obeidat, & Losardo, 2010). The results showed that the random intercept and random slope model (−2LL, FP = 2287.50, 5; AIC / BIC = 2297.71 / 2318.27; unconstrained model) had smaller IC statistics compared to the model with random intercept (−2LL, FP = 2421.72, 2; AIC / BIC = 2425.71 / 2433.93; constrained model). Also, the p-value of ΔDeviance test was statistically significant (Δ-2LL, Δdf = 134.22, 3, p < .001), indicating that the linear categorical LGCM was a better fit to the data compared to the random intercept model. As mentioned in the previous section, in a similar manner, several potential non-linear trajectories can also be compared to the linear categorical LGCM (e.g., quadratic trajectories). For the illustrative purpose, we used a linear categorical LGCM as the optimal model. Overall, the results indicated that there was an inter-individual variability in intra-individual patterns of change over time for “thinking about divorce.”
Next, we investigated the growth parameters of the categorical LGCM (see the left column of Table 1). The mean of the intercept was not significantly different from 0 (α00 = .11, p = .32). However, the mean of slope was significant (α10 = −.57, p < .001), indicating an odds-ratio (OR) = .57. That is, there was a 43% (= 100 × (exp[−.57]−1)) decrease in the odds of “thinking about divorce” for a 1-year time change. Moreover, the growth factor variances were statistically significant (ψ00 = 1.49, p < .01; ψ11 = .40, p < .01), suggesting that the trajectories of “thinking about divorce” varied across the sample of wives (i.e., this shows inter-individual variation in individual’s trajectories). In addition, the positive covariance between growth factors was significant (ψ10 =.55, p < .001), suggesting that wives who had a higher propensity of “thinking about divorce” at the first measurement occasion tended to show a slower decrease in the propensity to think about divorce over the 6-year time span captured in the analysis.
TABLE 1.
Latent Growth Curve Model (LGCM) | |||
---|---|---|---|
Estimator | ML | ML | ML |
Link-function | Logit link | Probit link | Probit link |
Indicator | Binary | Binary | Ordinal |
Parameterization | – | Theta | Theta |
Residuals | Fixed to π2 / 3 | Fixed to 1 | Fixed to 1 |
Means | |||
α00 | .11 (.10) | .04 (.06) | .03 (.06) |
α10 | −.57*** (.07) | −.32*** (.04) | −.27*** (.03) |
Variance and Covariance | |||
ψ00 | 1.49** (.52) | .57** (.18) | .68*** (.14) |
ψ11 | .40** (.14) | .13** (.05) | .05** (.02) |
ψ10 | .55*** (.13) | .18*** (.05) | .10** (.10) |
Thresholds | |||
τ1 | = .00 | = .00 | = .00 |
τ2 | – | – | 1.08 (.05) |
τ3 | – | – | 1.77 (.07) |
Fit statistics | |||
Deviance (−2LL) | 2287.50 | 2289.24 | 3742.71 |
FP | 5 | 5 | 7 |
AIC / BIC | 2297.71 / 2318.27 | 2299.24 / 2319.79 | 3756.71 / 3785.49 |
Note. ML = Maximum Likelihood. −2LL= −2 log-likelihood value. FP = Numbers of free parameters. AIC = Akaike Information Criterion. BIC = Bayesian Information Criterion. Unstandardized coefficients are shown with standard errors in parentheses.
p < .01.
p < .001.
A categorical latent growth curve model with time-invariant covariates
The individual differences of growth parameters can be modeled as a function of an individual, time-invariant covariate, or predictor, W1i (multiple covariates are possible). These differences are quantified by regression coefficients γ01 and γ11 representing the association/influence of the predictor on the intercept and slope, respectively. The time-invariant predictor W1i can be added to the Equation 6.1 and 6.2, as follows:
(8.1) |
(8.2) |
The advantage of utilizing a SEM approach is that this approach allows for the prediction of a subsequent outcome or response, D1i, by growth factors (intercept and slope) and other predictors within the same analytical framework. The levels of the outcome can be expressed as a function of the growth parameters and predictor(s) and can be written as:
(9) |
β0 is the intercept for the multiple regression of the outcome, Di. β1 and β2 are the magnitudes of the coefficients linking the intercept parameter (η0i) and the slope parameter (η1i), respectively, to the outcome, Di. γ1 is the coefficient linking a time-invariant predictor, W1i, to a time-invariant outcome, Di. εi is the normal distributed residual including the fixed mean (set to 0) and variance (σ2). The model is now identical to a multiple regression estimating the adjusted effect of each predictor on an outcome after controlling for the effects of other covariates.
Mplus model specification for time-invariant covariates in the categorical LGCM.
In Mplus, covariates (i.e., predictors and outcomes) can be estimated by specifying the italicized and bolded syntax below to the existing categorical LGCM syntax:
VARIABLE:NAMES ARE Y1-Y5 X D;
USEVARIABLES ARE Y1-Y5 X D;
⋮
MODEL:
⋮
I S ON X; D ON I S X;
where X and D represent continuous covariates (X = a predictor and D = an outcome). The ON syntax defines the regression relationships between growth factors and covariates. For example, I S ON X represents that predictor X is regressed on the dependent variables I and S, which are two growth factors (i.e., initial level and slope) in a binary LGCM.
Results.
To demonstrate our example conditional LGCM model with binary outcomes, we used two continuous measures from wives’ reports as a time-invariant predictor and outcome, respectively: (a) family-work conflict at Wave 1 (a summed score of two items with a mean [SD] of 5.13 [1.26] and a skewness of −.10) and (b) self-report of global mental health at Wave 6 (a single item with a mean [SD] of 2.14 [.86] and a skewness of .53). Wives’ family-work conflict did not predict the likelihood of their thinking about divorce at the first measurement occasion (γ01 = .04, p = .71), but it positively predicted the slope parameter (γ11 = .11, p < .05). The results indicated that wives who perceived more work-family conflict at Wave 1 were more likely than those who perceived less work-family conflict to exhibit increases in their propensity to think about divorce over time. In terms of the outcome model, the slope of “thinking about divorce” was positively related to wives’ mental health problems at Wave 6 (β2 = .30, p < .05), after adjusting for the effects of the intercept (β1 = .09, p = .06) and early family-work conflict (γ1 = .05, p = .21). These results indicated that a wife with a greater increase in the likelihood of “thinking about divorce” over time tended to reporter more mental health problems at Wave 6.
Applying Probit Transformation for Categorical LGCM
In the categorical SEMs, probit transformation assumes the latent response variable Y* is normally distributed (i.e., standard normal distribution)2. Therefore, probit transformation produces the standard normal z-scores for the continuous latent response variable Y* in place of the logit values acquired by a logit transformation. Note that the estimated logit value is approximately equal to 1.81 times the probit value, which allows for an easy transformation from probit to logit values and vice versa. The probit transformation of a categorical response Y for a categorical CFA (see panel b of Figure 1) is as follows:
(10) |
where τ1 is a threshold, defining the continuous latent response variable Yi* (see Equation 1). λ is the estimated factor loading of an indicator on the latent factor, ηi. εi is the residual of Yi*, including the mean (set to 0) and variance (θ).
In a SEM framework, two parameterizations are commonly used: (a) theta and (b) delta parameterizations for probit models (Muthén & Asparouhov, 2002). In the theta parameterization, the residual variance θ is defined by fixing its variance to 1. Thus, the residual variance θ is directly specified and is fixed at 1 (e.g., εi ~ N (0, 1)). This is how probit regression models are parameterized when using ML estimation in other statistical packages (e.g., SAS, and SPSS). Instead, in the delta parameterization, the total variance of Yi* is defined by fixing it to 1 (e.g., (λ2 × ψ + εi) ~ N (0, 1)). Therefore, the residual variance θ is not directly specified, but indirectly calculated using a scale factor where σ*= a standard deviation of Y* (equivalent to ). Typically, this scale factor is fixed to 1, which produces the residual variance θ as 1 – explained variance of Y* (equivalent to standardized λ2).
In latent growth curve modeling using probit transformation, two parameterizations (theta vs. delta) produce slightly different parameter estimates because different identification constraints are imposed (see Grimm & Liu, 2016 for a more detailed discussion of estimation challenges). In this tutorial, we focus on the theta parameterization, because it is a standard probit model specification. Now, the probit value of binary response variable Yi changes linearly as a function of the latent factor (ηi) with the probit regression coefficient, λ. Similar to interpreting a standard regression model, λ can be interpreted as the expected change in the probit values of Yi* for a one-unit difference in ηi, which can be converted to probabilities. Moreover, a standardized coefficient, λ, can make interpreting model parameters between the continuous latent response variable, Y*, and the latent variable, η, simpler (i.e., the correlation [r-matrix]; Toland, 2014). These standardized parameters correspond to the effect size estimates (Kline, 2011).
In addition, the probit transformation can be used under either the (a) ML estimator or (b) weighted least squares (WLS) estimator (or weighted least squares means and variances, WLSMV) for the model estimation. The WLS estimator uses limited information methods, which eases computational burden (Finney & DiStefano, 2006). Therefore, the WLS estimator is preferable compared to ML estimation in SEM. However, some researchers opt for ML estimation because it produces more efficient parameter estimates (Edwards, 2010). For this reason, the illustrative examples for the categorical LCGMs were estimated with ML estimation. With ML estimation, probit transformation provides the same types of model fit indices as logit transformation. For model comparisons, the ∆Deviance test (which we describe in the model evaluation of the logit model) can be also used with a probit model with ML estimation.
Using the model specification shown in figure 3, we demonstrated the univariate latent growth curve model with same five repeated dichotomized (binary) items (‘thinking about divorce’) of the marital instability measure with probit transformation. The Mplus commands for the categorical LGCM (using probit transformation) are identical to commands with the categorical LGCM using logit transformation with the exception of the LINK option (i.e., LINK=PROBIT) in the ANALYSIS command. The results (i.e., model fit indices and growth parameters) were similar to those in the categorical LGCM with logit link function (see the middle column of Table 1). Similar to the categorical LGCM with logit link function, the growth parameters for specific response with probit link function can also be converted to corresponding percentages (%) of response categories (see the formula in the appendix A; Grimm & Liu, 2016).
LRV Transformation of Ordinal Response Variables.
As mentioned in the introduction, LRV transformation can be extended to a model with ordinal variables. If the response variable is comprised of more than two response categories, according to LRV formulation, there should be more than one threshold. More specifically, the number of thresholds should equal one less than the number of response variable categories. Panel b of Figure 2 shows how an ordinal variable with four categories (ranged from 0 to 3) can be converted into a latent response variable, Yi*. The observed Yi value is 0 when Yi* is less than or equal to threshold τ1. The observed Yi = 1 when Yi* is greater than τ1 but less than or equal to τ2. Yi = 2 when Yi* is greater than τ2 but less than or equal to τ3,. The observed Yi = 3, when Yi* exceeds the threshold τ3 (Masyn et al., 2013).
In the ordinal response, thresholds reflect the “distance” between categories. Thus, the greater distance between response thresholds reflects higher likelihood of being in one category rather than the other categories, whereas lesser distance between thresholds indicates a lower likelihood of being in one category rather than the others. As can be seen in panel b of Figure 2, most individuals responded 0; consequently, the greatest increase is noted for τ1 (from the left side of the curve), followed by τ2 (from τ1), and τ3 (from τ2).
Extending Probit Transformations to a Latent Growth Curve with Ordinal Indicators.
The probit transformation for ordinal response variables, an extension of the binary probit model, provides the cumulative probit model. That is, for an ordinal variable with J categories, the probit model represents the propensity of being a higher category j (> j) compared to a member of category j or a lower category (≤ j). Based on the standard probit model with a binary item (see Equation 10), the ordinal probit regression equation linking the latent response variable Yi* to the latent factor ηi can be expressed as:
(11) |
where j = 0, 1, 2, …, J−1, τ0 = −∞, and τJ = ∞. Similar to the interpretation of the coefficient in the binary probit model, λ is interpreted as the difference of probit values for responding above category j for every one-unit increase in a latent variable, ηi. Probit values with ordinal variables can also be converted to the probabilities for the specific category. Given that ordinal variables produce cumulative probabilities, the probability of a specific category j can be calculated as:
(12) |
In this instance, the continuous LRV, Yi*, of an ordinal variable is used as the indicator of the categorical LGCM. For the same reason that the threshold values (τ) are fixed to be equal across time points in the categorical LGCM with repeated binary variables (i.e., a time-invariant, τ1), the multiple thresholds within each ordinal variable should be fixed to be equal across time points (e.g., τ1, τ2, ⋯, τJ−1) to meet the longitudinal threshold assumption. In Mplus, the CATEGORICAL option (under the ANALYSIS command) can be used to specify both binary and ordinal variables. Consequently, specifying repeated ordinal variables in the categorical LCGM is identical to model specification of the LGCM with binary variables in Mplus program. In next section, we will demonstrate the univariate categorical LGCM with the original response scale (i.e., 4 responses) of same items (‘thinking about divorce’) on the marital instability.
Results.
Across the five time points, the observed proportion of respondents in Category 0 (= did not think about divorce in the last year) increased from 35.9% at Wave 1 to 68.8% at Wave 5. In contrast, the proportion of respondents in Category 1 (= thought about divorce within the last year) decreased from 32.4% at Wave 1 to 20.5% at Wave 5. Category 2 (= thought about divorce within the last 6 months) decreased from to.26.6% to 5.1%. For Category 3 (= thought about divorce within the last 3 months), the proportions were small, but stable, over time (ranged from 5.1% at Wave 1 to 5.6% at Wave 4). Overall, these response frequencies imply that the likelihood of “thinking about divorce” generally decreased over time. Regarding model evaluation, as previously discussed with the categorical LGCM with logit transformation, comparison of incremental models showed that the random intercept and random slope model (linear LGCM; −2LL, FP = 3742.71, 7; AIC / BIC = 3756.71 / 3785.49; unconstrained model) was a better fit to the data than the random intercept model (−2LL, FP =3865.72, 4; AIC / BIC = 3873.72 / 3890.17; constrained model) with a significant p-value for the ∆Deviance test (Δ-2LL, Δdf = 123.01, 3, p < .001).
Next, we examined this model’s growth parameters (see the last column of Table 1). As discussed previously, ordinal variables produce cumulative response probabilities (with multiple thresholds). In the LGCM with repeated ordinal variables, the initial level growth factor mean, α00, now coincides that the expected proportion of respondents “thinking about divorce” is greater than category 0 (= Never in the last year). Given by the definition − τ1 + α00 in the categorical LGCM, the mean of initial level growth factor, .03, is identical to the first threshold, τ1, −.03. Consequently, the three estimated thresholds were −.03, 1.08, and 1.77 for τ1 to τ3, respectively. These multiple thresholds provide information on the expected item-response proportions (i.e., probabilities) of wives’ responses on marital instability at Wave 1. These thresholds can also be converted into the expected proportions (or probabilities) of the item at the first assessment using the formula in the Equation 12 and appendix A. Our calculations showed that the expected proportions of responses 0, 1, 2, and 3 at the first measurement occasion (i.e., initial levels) were 49.1% (), 30.7% (), 11.7% (), and 8.6% (), respectively. Additionally, the variance of the latent variable intercept, η0i was .68 and was significantly different from 0 (p < .001), indicating that wives varied in their propensity to marital instability at the initial level.
The mean of the slope factor was −.27, indicating that the average response propensity decreased .27 units per year. This decreasing mean trajectory indicates that, on average, wives’ propensity to “think about divorce” decreases over time. The variance of the latent variable slope was 0.05 and differed significantly from 0, indicating that there was inter-individual variation across wives in their propensity change over time. Finally, the covariance between the intercept and slope of the latent variable was .10 (also statistically different from 0), suggesting that wives who had higher response propensities at the first measurement occasion generally experienced a slower rate of change in their response propensity over time compared to wives with a low response propensity at Wave 1.
Categorical parallel process models (Categorical PPM) with categorical LGCMs
In order to investigate the dyadic association of marital attributes between wives and husbands over time, a categorical LGCM can be extended to a categorical parallel process model (hereafter referred to as “categorical PPM”) by using either the probit transformation or the logit transformation. To demonstrate the categorical PPM with ordinal outcomes, our example categorical PPM was estimated using the probit transformation. Similar to a parallel process model with continuous variables (Wickrama, Lee, O’Neal & Lorenz, 2016), the categorical PPM contains two separated categorical LGCMs as follows:
(13) |
(14) |
As can be seen in Equations 13 and 14, the parallel process model is estimated using two primary growth curve models identified by the latent response variables Yti* and Zti*. The unique feature of a categorical PPM is its ability to specify the variance and covariance structures among latent growth factors (i.e., , , , and ), which is shown in panel a of Figure 4. Using the probit transformation with ML estimation, the Mplus commands for the example categorical PPM with ordinal response variables are as follows:
⋮
VARIABLE:NAMES ARE Y1-Y5 Z1-Z5;
USEVARIABLES ARE Y1-Y5 Z1-Z5;
CATEGORICAL VARIABLES ARE Y1-Y5 Z1-Z5;
MISSING = ALL (999);
ANALYSIS: ESTIMATOR=ML;
LINK=PROBIT;
MODEL:
I1 S1 | Y1@0 Y2@1 Y3@2 Y4@3 Y5@4;
I2 S2 | Z1@0 Z2@1 Z3@2 Z4@3 Z5@4;
[I1 I2]; [Y1$1-Y5$1@0]; [Z1$1-Z5$1@0];
I1 WITH S1-S2; S1 WITH I2-S2; I2 WITH S2;
OUTPUT: STANDARDIZED;
In the model commands, two sets of categorical latent growth curve models (I1 S1|… and I2 S2|…; [I1 I2]; [Y1$1-Y5$1@0]; [Z1$1-Z5$1@0]) are specified to estimate growth parameters for within each dyad member group (i.e., wives’ and husbands’ categorical LGCMs). All covariances among growth factors both within and across dyad members are estimated by a series of WITH statements. The OUTPUT command contains an added keyword: STANDARDIZED. This option instructs Mplus to include standardized parameter estimate values and their standard errors in addition to the default unstandardized values in the output. With ML estimation, estimating growth parameters in the categorical PPM sometimes increase the computational burden as a function of the number of categorical variables. Muthén and Muthén (1998-2012) suggests using the INTEGRATION=MONTECARLO option (500 integration points by default) in the ANALYSIS command. This option will reduce the number of integration points, which may save computational time. For our illustrative purpose, we estimated the model with the ordinal items of marital instability measure using both wives’ and husbands’ reports. The results are shown in panel b of Figure 4.
For wives’ growth model, the thresholds at the first assessment were −.04, 1.12, and 1.82 (corresponding item-response proportions = 48.9%, 30.7%, 11.4%, 9.0% for category 0 to 3, respectively). For husbands’ growth model, the thresholds at the first assessment were .02, 1.26, and 2.00 (corresponding item-response proportions = 50.6%, 31.0%, 10.7%, 7.7% for category 0 to 3, respectively). Both wives and husbands decreased in their response propensities for “thinking about divorce” over time (; ). However, the variances of all growth factors (i.e., initial level and rate of change for wives and husbands) were statistically significant, indicating the existence of inter-individual differences in these trajectories (; ; ; ). Positive covariances between intercept and slope growth factors were significant in the growth models for within each dyadic group (; ), which suggests that both wives and husbands who had higher response propensities for “thinking about divorce” at the first occasion tended to show slower decreases in their response propensity over time. Additionally, the two significant associations across dyadic groups (i.e., associations between wives’ and husbands’ growth factors) were found: (a) between wives’ and husband’s intercept factors; and (b) between wives’ and husband’s slope factors. The positive associations between intercept factors () indicated that wives with high initial levels of ‘thinking about divorce’ also tended to have husbands with high initial levels of ‘thinking about divorce’. The positive associations between slope factors () indicators that the linear trajectory for wives was positively associated with the linear trajectory for husbands, suggesting the parallel associations between wives’ and husbands’ reports on the “thinking about divorce” item.
A PPM allows for the specification of residual correlations taken at the same time across dyad members (between-dyads correlations) along with residual correlations within dyads across time (within-dyad correlations [within-time correlations]; Wickrama et al., 2016), which may affect model fit, estimated parameters, and their standard errors. These residual structures can also be specified by the categorical latent growth curve models using theta parameterization under the WLS (or WLSMV) estimator. In Mplus, the residual correlations can be specified by a WITH option (e.g., Y1 WITH Y2 and Y1 WITH Z1 for within- and between-dyad correlations, respectively) using a theta parameterization under the WLS estimator (ANALYSIS: ESTIMATOR=WLS (or WLSMV); PARAMETERIZATION = THETA). We recommend first estimating a PPM without specifying a residual structure (where the residual variances are fixed to 1 across time; Y1-Y5@1; Z1-Z5@1;), then researchers can explore if it is necessary to specify the residual structure. This is our recommendation because specifying the residual structures in PPM increases model complexity, which may result in an improper solution (i.e., convergence problems) or impossible parameter estimates. For researchers who are interested in specifying residual structures in a latent growth curve model with ordinal variables, more detailed information is provided in Grimm and Liu (2016).
CONCLUDING REMARKS
Social behavioral researchers often need to describe and analyze changes in behavioral or psychological attributes with repeated categorical outcomes. In the present article, using the structural equation modeling framework, we have illustrated categorical latent growth curve modeling for repeated categorical responses based on logit and probit transformation strategies. We have presented step-by-step procedures for categorical LGCM with corresponding Mplus syntax. We also extended on univariate modeling by incorporating time-invariant covariates (i.e., predictors and outcomes) and time varying covariates (i.e., parallel process model) to the categorical LGCMs. These modeling illustrations are useful for social behavioral researchers to test important hypotheses involving analysis of change in categorical outcomes.
Acknowledgments
During the past several years, support for this research has come from multiple sources, including the National Institute of Mental Health (MH00567, MH19734, MH43270, MH48165, MH51361), The National Institute on Drug Abuse (DA05347), the Bureau of Maternal and Child Health (MCJ-109572), the Macarthur Foundation Research Network on Successful Adolescent Development Among Youth in High-Risk Settings, and the Iowa Agriculture and Home Economics Experiment Station (Project 3320).
Appendix A.
Formula to convert probit values of being Yt > category j to the expected proportion in the linear categorical LGCM (i.e., random intercept and random slope model).
Note. F = the standard normal distribution function (i.e., z-table). Item / variable response category j = 0, 1, 2, …, J-1. Time t = 0, 1, 2, …, T.
Footnotes
Random intercept model is the baseline model for the growth curve model. In Mplus, the model can be specified by using I | Y1@0 Y2@1 Y3@2 Y4@3 Y5@5; under the model command.
Prob (Yi = 1) → Φ−1 (Prob (Yi = 1)) → Probit Yi (or z-score) = Latent response variable Yi* where Prob (Yi =1) is the probability of being Yi =1 for individual i. The inverse standard normal function Φ−1 of Prob (Yi =1) produces probit values which can be converted to standard normal value z-score. Like the logit model, the probit values yields a threshold, τ1 which represents continuous latent response variable Yi*.
REFERENCES
- Agresti A (2002). Categorical Data Analysis (2nd ed.). New York: John Wiley & Sons. [Google Scholar]
- Allen J, & Le H (2008). An additional measures of overall effect size for logistic regression models. Journal of Educational and Behavioral Statistics, 33, 416–441. [Google Scholar]
- Conger RD, & Conger KJ (2002). Resilience in midwestern families: selected findings from the first decade of a prospective, longitudinal study. Journal of Marriage and the Family, 64, 361–373. [Google Scholar]
- Curran P, Obeidat K, & Losardo D (2010). Twelve frequently asked questions about growth curve modeling. Journal of Cognition and Development, 11, 121–136. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Edwards MC (2010). A Markov chain Monte Carlo approach to confirmatory item factor analysis. Psychometrika, 75, 474–497. [Google Scholar]
- Finney SJ, & DiStefano C (2006). Nonnormal and categorical data in structural equation models In Hancock GR & Mueller RO (Eds.). A second course in structural equation modeling (pp. 269–314). Greenwich, CT: Information Age. [Google Scholar]
- Geiser C (2012). Data Analysis with Mplus. New York, NY: Guildford Press. [Google Scholar]
- Grimm K, & Liu Y (2016). Residual structure in growth models with ordinal outcomes. Structural Equation Modeling, 23, 466–475. [Google Scholar]
- Kline R (2011). Principles and practice of structural equation modeling (3rd ed.). New York, NY: Guilford Press. [Google Scholar]
- Masyn KE, Petras H, & Liu W (2013). Growth curve models with categorical outcomes In Bruinsma G & Weisburd D (Eds.), Encyclopedia of Criminology and Criminal Justices (pp. 2013–2025). New York: Springer Verlag. [Google Scholar]
- McTernan M, & Blozis SA (2015). Longitudinal models for ordinal data with many zeros and varying numbers of response categories. Structural Equation Modeling, 22, 216–226. [Google Scholar]
- Mehta PD, Neale MC, & Flay BR (2004). Squeezing interval change from ordinal panel data: latent growth curves with ordinal outcomes. Psychological Methods, 9, 301–333. [DOI] [PubMed] [Google Scholar]
- Muthén BO (2001). Latent variable mixture modeling In Marcoulides GA & Schumacker RE (Eds.), New developments and techniques in structural equation modeling (pp. 1–34). Mahwah, NJ: Lawrence Erlbaum Associates. [Google Scholar]
- Muthén B, & Asparouhov T (2002). Latent variable analysis with categorical outcomes: Multiple-group and growth modeling in Mplus. Retrieved from http://www.statmodel.com/download/webnotes/CatMGLong.pdf [Google Scholar]
- Muthén LK, & Muthén BO (1998-2012). Mplus user’s guide (7th ed.). Los Angeles: Muthén & Muthén. [Google Scholar]
- Raudenbush SW, & Bryk AS (2002). Hierarchical linear models: Applications and data analysis methods (2nd ed.). Thousand Oaks, CA: Sage. [Google Scholar]
- Rupp AA, Templin J, & Henson RA (2010). Diagnostic measurement theory, methods, and applications. New York, NY, Guildford. [Google Scholar]
- Skrondal A, & Rabe-Hesketh S (2004). Generalized Latent Variable Modeling: Multilevel, Longitudinal, and Structural Equation Models. Boca Raton, FL: Chapman & Hall/CRC. [Google Scholar]
- Toland MD (2014). Practical guide to conducting an item response theory analysis. Journal of Early Adolescence, 34, 120–151. [Google Scholar]
- Wickrama KAS, Lee TK, O’Neal CW, & Lorenz FO (2016). Higher-order growth curves and mixture modeling with Mplus: A practical guide. New York, NY, Routledge. [Google Scholar]