Abstract
This study investigates the performance of robust ML estimators when fitting and evaluating small sample latent growth models (LGM) with non-normal missing data. Results showed that the robust ML methods could be used to account for non-normality even when the sample size is very small (e.g., N < 100). Among the robust ML estimators, “MLR” was the optimal choice, as it was found to be robust to both non-normality and missing data while also yielding more accurate standard error estimates and growth parameter coverage. However, the choice “MLMV” produced the most accurate p values for the Chi-square test statistic under conditions studied. Regarding the goodness of fit indices, as sample size decreased, all three fit indices studied (i.e., CFI, RMSEA, and SRMR) exhibited worse fit. When the sample size was very small (e.g., N < 60), the fit indices would imply that a proposed model fit poorly, when this might not be actually the case in the population.
Keywords: latent growth models, small sample, non-normality, missing data
In the developmental sciences, researchers are often interested in measuring change over time in selected outcomes for the same individuals. For example, studies on change or development have been conducted in a diverse array of areas such as delinquency, cognitive functioning, marital quality, procrastination, and crime. While many analysis strategies exist for studying change, latent growth models (LGM) within the structural equation modeling (SEM) framework has emerged as one of the most popular longitudinal analysis techniques for assessing changes in psychological constructs across time (Bollen & Curran, 2006).
The original of modeling growth trajectory within the SEM framework was proposed by Meredith and Tisak (1984; 1990). The basic idea of a general latent curve model is to fit a two-factor, multiple-indicator CFA model. Specifically, assuming a linear growth trajectory, the LGM can be expressed as:
where Yij is the time- and person-specific outcome of person i at time point j, λ0i and λ1i are the random intercept and slope of person i respectively, and εij is the person-specific residual at time j. The latent intercepts and slopes follow a bivariate normal distribution with covariance matrix:
and mean vector of μ = (μγ0,μγ1)T. The residuals at each time point follow a multivariate normal distribution with means of 0.
The LGM allows researchers to model a growth trajectory within the confirmatory factor analysis (CFA) framework. Specifically, with a linear LGM, two latent factors, intercept and slope, are of primary interest. Modeling intercepts and slopes as random effects allow researchers to examine individual differences in initial status and growth rate. In addition, by estimating the covariance between the random slopes and intercepts, the LGM can address the relationship between initial status and growth rate.
In addition, integrating the study of change into the SEM framework allows LGM to take advantage of the flexibility and broadness of the framework over other methods used to study change (e.g., the Hierarchical Linear Modeling, or HLM; Hedeker & Gibbons, 2006)1. For example, global fit of the tested model can be assessed using a wide variety of indices (Chou, Bentler & Pentz, 1998; Wu, West, & Taylor, 2009). Researchers can integrate other procedures commonly employed within the SEM-framework, such as choosing an estimation technique that matches the metric level of the data or incorporating methods to accurately address missing data.
The most commonly used estimation technique for LGM is maximum likelihood estimation (ML). ML requires the endogenous variables used in analyses (i.e., measures at each time point) to be multivariate normally distributed in the population (Bollen, 1989). While LGM data are often continuous in nature, the data may exhibit substantial skewness and/or kurtosis due to the construct of the population under study (Micceri, 1989). This non-normality does not generally have an effect on parameter estimates (if applicable in the tested LGM), but the non-normality does impact standard errors of the parameter estimates, χ2 fit statistic, and fit indices which use the χ2 in their calculations (Finney & DiStefano, 2013). Robust corrections are often employed to adjusted standard errors and χ2 values to approximate values that would have been obtained if the data were normally distributed (Satorra & Bentler, 1994; Savalei, 2014). With the ML estimator, different types of robust corrections have been developed, and these methods are readily available in most SEM software. In this study, we focused on three widely used robust variants of ML, including MLM (i.e., robust standard errors computed using the expected information matrix coupled with Satorra and Bentler (1994) mean- adjusted chi-square test statistic), MLR (i.e., robust standard errors computed using the sandwich estimator coupled with Asparouhov and Muthén (2005)mean-adjusted chi-square test statistic), and MLMV (i.e., robust standard errors computed using the expected information matrix coupled with Asparouhov and Muthén (2010) mean- and variance adjusted chi-square test statistic)2 (see Maydeu-Olivares, 2017 for a review of the technical details for the robust ML methods).
When fitting an LGM, missing data is another important issue to consider, as attrition is a very common occurrence in longitudinal designs. Within the SEM framework, full information maximum likelihood (FIML) is perhaps the most popular technique to deal with missing data as this method is easy to apply, and works acceptably for data which are missing at random (MAR) and missing completely at random (MCAR) (Enders, 2010). Robust ML methods that address missing data are also available. For example, In Mplus, the choice MLR is robust to the presence of both non-normality and missing data (Savalei, 2010). Other robust methods (e.g., MLM or MLMV) require the use of listwise deletion for estimation, as these robust techniques can only be employed with a complete dataset (Maydeu-Olivares, 2017).
While LGM researchers can turn to the SEM literature for guidance on issues such as performance of robust estimators or optimal methods for accommodating missing data, one area which has not received much attention is the application of LGM with small sample sizes. Methods which advise researchers on an acceptable sample size needed for study design do exist (e.g., Monte Carlo simulation, Muthén & Muthén, 2002); however, it is safe to say that many researchers, especially applied researchers, do not employ such procedures, especially when constructing a study. Researchers’ evaluation of sample size acceptability is often by considering well-cited “rules of thumb” - and sample size recommendations based upon these rules vary. Some recommendations base the number of cases needed on factors such as: number of variables included in a model (e.g., 10 cases per variable, Nunnally & Bernstein, 1967), a ratio of cases needed per estimated parameter (e.g., 5 or 10 observations per parameter (Bollen, 1989), or minimum sample size level (e.g., 100 or 200 cases; (Boomsma, 1982). The overriding message is that smaller sample sizes are not acceptable for SEM.
However, in developmental studies using longitudinal data, sample sizes below 100 may be common due to factors such as high subject attrition rates, difficulty tracking participants over time, or financial restraints (McNeish, 2017). McNeish and Harring (2017a) supported the presence of low sample sizes through meta-analyses and literature review of longitudinal research. Their analysis showed that 33% - 41% of the reviewed studies investigated models with fewer than 100 participants. Further, some phenomena of interest may have low base rates in the general population and, as such, fewer participants are eligible to participate. For example, in clinical studies, researchers interested in studying rare disorders such as Fragile X syndrome (Hatton et al., 2006), Dissociative Identity Disorder (Ross, 1997) or comorbidity of certain characteristics may have difficulty in recruiting a large number of participants who qualify for, or are willing to participate in, a research study. Although fewer participants may be available, the study of change over time for these issues is still of interest to investigate. Further exacerbating such situations may be the additional presence of non-normal and/or missing data present with the studied characteristics. These situations may be faced in empirical research situations; however, there is not yet clear direction to aid researchers interested in fitting LGMs under such situations.
McNeish and Harring (2017a) specifically noted that simulation studies on LGMs have thus far not included sample sizes below 100. We suspect that one potential reason is that applied researchers typically consider larger samples are necessary (e.g., 100 or 200 cases) if using the SEM framework. However, LGM is a restricted SEM model; the latent variables are used simply to model the random effects (which is comparable to the HLM approach). Within the HLM framework, researchers have investigated the viability of the technique with small numbers of clusters (where “clusters” are individuals in longitudinal studies)3, and small numbers of individuals within clusters (individuals within clusters are time points in longitudinal studies)4. Therefore, there is no compelling reason to believe that LGM could not be applied (or at least studied) under similar sample size conditions.
McNeish and Harring (2017a) conducted an initial study to explore the performance of LGM under small samples (i.e., N < 100). Though the findings were promising, they only focused on the behaviors of the Chi-square test statistic and the use of small sample corrections under normal data conditions. The authors suggested that robust methods may not be appropriate for even minimally complex models until sample sizes reach 100 cases (McNeish & Harring, 2017a). However, this conjecture has not yet been verified by simulation studies, possibly due to the fact that previous research involving robust-ML typically focuses on sample sizes of 200 or more (e.g., Maydeu-Olivares, 2017).
Given the noted gaps in the literature, the present study aims to address a simple but essential question for applied developmental researchers: What would be the best practice to fit linear LGMs with a very small sample size (i.e., N < 100)? More specifically, we investigate the possibility of analysis of small sample LGMs with non-normal missing data analyzed via robust ML estimation methods.
Method
We conducted a simulation study to investigate the performance of using robust ML to estimate latent growth models when the sample size is very small (i.e., N < 100). To reflect realistic empirical scenarios, we examined conditions with both non-normal and incomplete data. In this study, we considered linear growth models because of the common use of this growth pattern in practice as compared to nonlinear trajectory modeling. When exploring growth trajectories, due to their simplicity and easy interpretation, linear models are often a starting point with LGM (Grimm, Ram & Estabrook, 2017). In the population model, the fixed effects were set equal to 2.77 for the mean intercept (μ0 ) and 0.06 for the mean slope (μs). The random effects were set to 0.46 for the intercept variance and 0.05 for the slope variance . The covariance between the intercept and slope was set to −0.02 (i.e., correlation ρ=−.13).The values of the growth parameters were chosen to mimic an empirical example of LGM in the applied literature (Ferrer, Balluerka & Widaman, 2008). Following previous studies (Bauer & Curran, 2003; McNeish & Harring, 2017a), the error variances were chosen, such that 50% of the observed variable variance was explained by the growth factors at each occasion. The variables manipulated in the simulation are described below.
Numbers of repeated measures.
The number of repeated measures (i.e., timepoints, t) included 4 and 8. The conditions manipulated reflect a longitudinal study with a relatively small (4) or large (8) number of time points.
Sample size.
In the current study, we focused on small sample sizes (i.e., N ≤ 100). The sample sizes ranged from 20 to 100 (i.e., five levels in intervals of 20).
Data Distribution.
We considered normal (skewness=0.00, kurtosis=0.00), moderate non-normal (skewness=1.00, kurtosis=7.00) and severe non-normal (skewness=3.00, kurtosis = 21.00) data distributions. The level of skewness and kurtosis for non-normal data was similar with levels used with previous simulation studies (Curran, West & Finch, 1996).
Percentage of missing data.
Three levels of missing data percentage were manipulated, including 0% (i.e., complete data), 10%, and 30%. Missing values only occurred at the second half of the occasions (i.e., time points 3 and 4 for small repeated measures; time point 5, 6, 7 and 8 for large repeated measures). The incomplete data were generated at the later time points to echo the scenarios of missing due to attrition, which is commonly observed in longitudinal studies (Zheng, 2017).
Mechanism of missing data.
Based on a typology for missing data mechanisms developed by Rubin (1976), we included data which were a) missing completely at random (MCAR; the probability that a data value is missing does not depend on the observed or missing values) and b) missing at random (MAR; the probability that a data value is missing may depend on other observed variables, but not on the variable which is missing). Specifically, regarding MCAR, s% (where s = 10 or 30) of the individuals were randomly chosen, and for the selected individuals, data from the second half of repeated measures were incomplete5. Under MAR, the missingness at the latter time points was determined by the percentile of the sum of the complete variables. That is, if the sum scores were smaller than its sth percentile, missing data were created. This approach used for generating missing data is consistent with that of previous simulation studies (e.g., Shi, Lee, Fairchild & Maydeu-Olivares, 2019).
For complete data, the number of conditions examined was 30 = 2 (numbers of repeated measures) × 5 (sample size levels) × 3 (data distributions). In the presence of missing data, the number of conditions considered was 120 = 2 (numbers of repeated measures) × 5 (sample size levels) × 3 (data distributions) × 2 (missing data mechanisms) × 2 (missing data percentages). In total, 150 conditions were included in the simulation study. For each simulated condition, 10,000 replications were generated with the simsem package in R (Pornprasertmanit, Miller, & Schoemann, 2012; R Development Core Team, 2015). Procedures described in Vale and Maurelli (1983) were used to generate non-normal data.
For each generated dataset, we fit the linear latent growth models with four choices of ML-based estimators, including classical ML, MLM, MLR, and MLMV. These estimators have been implemented in most popular SEM software packages (e.g., Mplus, lavaan), and have been widely used in empirical studies. Parameter estimates, standard error estimates, parameter coverage, and the goodness of fit indices were obtained and summarized across simulation replications. In addition, we calculated the coverage rates for each growth parameter as the proportion of estimated 95% confidence intervals that contain the population value. Finally, with regards to model fit, we focused on the likelihood ratio chi-square test statistics and three additional commonly used goodness of fit indices: root mean square error of approximation (RMSEA; Steiger & Lind, 1980), comparative fit index (CFI; Bentler, 1990), and standardized root mean square residual (SRMR; Joreskog & Sorbom, 1988). For the chi-square test statistics, we calculated the empirical rejection rates at the 5% significance level. With regards to the goodness of fit indices (i.e., RMSEA, CFI, and SRMR), we computed average estimates over 10,000 replications across the simulated conditions. All data analyses were conducted using Mplus 8.1 (Muthén & Muthén, 1998–2018). It is noted that under robust ML, different formulas for computing robust RMSEA and CFI have been applied (see Brosseau-Liard & Savalei, 2014; Savalei, 2018; Gao, Shi & Maydeu-Olivares, 2020). According to the definition given in the work of Brosseau-Liard, Savalei, and Li (2012), in this study, we focused on the robust RMSEA and CFI using the “population” correction, which is computed and reported by Mplus.
Results
Convergence rates were greater than 99.9% across conditions in the current study. Only cases with converged results were included in calculating the outcome variables. To make it easier to decipher key trends, we present the results visually. Tables with detailed results are available in the supplementary materials. To identify the main conditions that affect the outcome variables, we also conducted analyses of variance (ANOVAs). All two-way interactions between the simulation factors were also included in the ANOVA models. Specifically, eta squared (η2) values for each factor were computed to identify conditions that contributed to sizeable amounts of variability in the outcomes. Using the benchmarks provided by Ferguson (2009), we focused on factors yielded η2 > .04, which is considered the minimum effect size representing a “practically” significant effect.
Parameter Estimates.
Using the absolute values of the relative biases in parameter estimates6 as the outcome variable, ANOVA results showed that the important sources of the (relative) biases were the growth parameters estimated (η2 = 0.115), the mechanism of missing data (η2 = 0.068), and the interaction between the mechanism of missing data and the choice of estimators (0.062). As expected, with complete data, all ML-based estimators yielded the same parameter estimates. In the presence of missing data, using full information from the data, ML and MLR yielded the same parameter estimates. These estimates are different from those obtained using MLM/MLMV, which employ listwise deletion when complete data are not available.
In general, when no missing data are present, or data are missing completely at random, applying any (robust) ML estimator yielded parameter estimates which were fairly close to the population values, regardless of the other simulation conditions (i.e., N, t, percentage of missing data, data distribution). However, this was not true under missing at random conditions. Figure 1 shows the boxplots of average point estimates for all five parameters of interest across all simulation conditions under MAR. Here, it is easily observed that MLM/MLMV may yield noticeably biased parameter estimates, especially when a large (30%) percentage of missingness was present The findings were not surprising; as pairwise deletion was applied under missing at random, the parameter estimates are anticipated to be biased (Enders, 2010).
Figure 1:
Average Parameter Estimates: Missing at Random (MAR)
In addition, as shown in Figure 1 and the supplementary table, the estimates of the fixed effects (i.e., μ0 and μs) were more accurate than the estimates of the random effects (i.e., and ). The finding is consistent with previous studies showing that full maximum likelihood estimation leads to downwardly biased estimates of the random effects, especially when the sample size is small (McNeish & Stapleton, 2016a, 2016b; McNeish & Matta, 2018).
Standard Error Estimates.
For each parameter, the average standard error estimates were compared to the empirical standard deviations of the estimates (sdθ) over the replications. The relative bias for the standard error estimates was computed as (Bandalos & Leite, 2013). Following previous studies, relative bias below 10% (in absolute values) was considered to be an acceptable rate (e.g., Hoogland & Boomsma, 1998). ANOVA results showed that the important factors affect the RB in standard errors were the choice of estimators (η2 = 0.095) and the growth parameters estimated (η2 = 0.065). Figure 2 provides boxplots of relative biases for all parameters across study conditions by different estimators. In the supplementary materials, we provided separate figures for each growth parameter. It is noted that the MLM and MLMV techniques yielded the same standard error estimates as each other.
Figure 2:
Relative Biases in Standard Error Estimates
As seen in Figure 2, MLR generally produced the most accurate standard error estimates. Among all 150 conditions, on average (across the five parameters), 110 conditions (73.5%) produced standard error estimates with an absolute relative bias level less than 10%. Normal theory based-ML yielded the second-best performance, with 65.7% of the conditions in an acceptable range. The standard errors using MLM/MLMV were generally more accurate than those obtained from normal theory based ML only under complete non-normal data. With missing values, the percentage of acceptable condition for MLM/MLMV was only 37.3%. MLM-/MLMV-based standard errors could be severely underestimated, particularly when a large percentage of observations (i.e., 30%) were missing at random. With regards to different growth parameters, the standard error estimates for the fixed effects (i.e., the mean intercept and the mean slope) are generally more accurate than those estimated for the random effects (i.e., the intercept variance, the slope variance and the covariance).
Coverage Rates.
Figure 3 provides the coverage of the LGM growth parameters (i.e., the proportion of the estimated 95% confidence intervals that contain the true parameter value) under different estimators. Coverage rates between .92 and .98 were considered acceptable (i.e., 95% ± 3%; Bradley, 1978; McNeish & Harring, 2017b). Results from ANOVA showed that the choice of estimators (η2 = 0.137) is the most important condition affecting the convergence rates. The effect of the choice of estimators was also found to be moderated by the mechanism of missing data (η2 = 0.077) and the growth parameters being estimated (η2 = 0.087).
Figure 3:
95% Coverage Rates
Similar to what was observed for the standard error estimates, MLR generally produced the most accurate parameter coverage rates, with 63.2% of the conditions yielding acceptable coverage rates. The normal theory based ML yielded lower coverage rates (56.1% of the conditions between .92 and .98) than MLR-based estimates, especially when the data were non-normally distributed. Coverage rates for MLM/MLMV could be noticeably worse than those obtained using ML, even when the data were non-normally distributed. Specifically, in the presence of data missing at random, coverage rates for MLM/MLMV tended to be much lower than the acceptable level, particularly if the percentage of missing was high (30%). Moreover, as shown in the supplementary table and figures, the difference between the choices of ML-based estimators was more pronounced when estimating the fixed effect parameters (compared to the random effect parameters).
Chi-square test Statistics.
In the tables, we reported the means and standard deviations of the Chi-square test statistics across conditions. In this paper, we focused on the empirical rejection rates of the Chi-square test statistics at the 5% significance level. The ANOVA results showed that sizable variability in the empirical rejection rates could be explained by the number of time occasions (η2 = 0.187), the sample size (η2 = 0.132), the data distribution (η2 = 0.112), and the choice of estimators (η2 = 0.101). In addition, three interactions between the simulation conditions were also practically significant (i.e., data distribution × choice of estimators, η2 = 0.081; number of time occasions × sample sizes, η2 = 0.081; data distribution × sample sizes, η2 = 0.066).
The distributions of the empirical rejection rates across study conditions (i.e., sample sizes, number of observed variables, and data distributions) are presented in Figure 4. Specifically, in Figure 4(a), we compared the empirical rejection rates using different ML-based estimators across sample sizes and data distributions. In Figure 4(b), the empirical rejection rates of different ML-based estimators were summarized across sample sizes and number of time occasions7.
Figure 4:
Empirical Rejection Rates (5%) for the Chi-Square Test Statistics (a). empirical rejection rates across sample size and data distributions. (b). empirical rejection rates across sample size and number of time occasions. Note. N = sample size; t = number of measurement occasions (observed variables).
We considered Type I error rates between 2% and 8% acceptable (95% ± 3%; Bradley, 1978; Pavlov, Maydeu-Olivares & Shi, 2020). As shown in the figure, regardless of the choice of estimator, inflated rejection rates were observed as sample size decreased and the number of observed variables (measurement occasions) increased. For example, with complete, normally distributed data, t = 4 and N = 100, the empirical rejection rates were very close to the nominal level (5%) for all four estimators. However, keeping other conditions constant, as t increased to eight and N decreased to 20, empirical rejection rates were inflated to 24% (ML), 36% (MLM), 14% (MLMV) and 36% (MLR), respectively. Consistent with previous studies (Curran, West, & Finch, 1996), we are not surprised to find inflated empirical rejection rates when normal theory based-ML was used under non-normal data.
On the other hand, robust ML could yield rejection rates closer to the nominal level when the data were non-normally distributed and/or sample size was small. Among the three robust ML estimators considered, MLMV produced the most accurate p values (empirical rejection rates). Across the 150 conditions, the numbers of conditions with empirical rejection rates less than 10% were 43 for ML (the average rejection rates = 25%), 66 for MLM (the average rejection rates = 18%), 47 for MLR (the average rejection rates = 23%) and 119 for MLMV (the average rejection rates = 8%).
Goodness of Fit Indices.
The behaviors of the selected fit indices when fitting small sample LGM were summarized in Figures 5–7. Specifically, for each fit index, we plotted the distributions of the average sample estimates across simulation conditions. The following reference values are commonly used for assessing model close fit: 1) RMSEA ≤ .06; 2) CFI ≥ .95; and 3) SRMR ≤ .08 (Hu & Bentler, 1999).
Figure 5:
The Behaviors of the Average Sample RMSEAs (a). average sample RMSEAs across sample size and data distributions. (b). average sample RMSEAs across sample size and number of time occasions. Note. RMSEAs = root mean square error of approximation; N = sample size; t = number of measurement occasions (observed variables).
Figure 7:
The Behaviors of the Average Sample SRMRs
Note. PM = Percentage of missing data. N = sample size; t = number of measurement occasions (observed variables). (a). average sample SRMRs across sample size and data distributions. (b). average sample SRMRs across sample size and number of time occasions. Note. SRMRs = standardized root mean square residuals; N = sample size; t = number of measurement occasions (observed variables).
Using the average sample RMSEA as the outcome, ANOVA results indicated that the most important factors include the sample size (η2 = 0.574) and the data distribution (η2 = 0.063). As the sample size decreased, the average sample RMSEA increased; meaning that this index would illustrate that the tested model fits worse. In general, based on average sample RMSEA, good fit (i.e., RMSEA ≤ .06) may be concluded (for the model considered in the study) as the sample size reached 60, regardless of the data distribution and the presence of missing data. Under non-normal data, the (normal theory) ML-based RMSEA values tended to be slightly larger than those obtained from robust ML estimators. In terms of the choice of robust ML, the MLMV based RMSEA generally yielded the smallest average sample estimates.
The ANOVA results showed that for the sample average CFI, the most important preditors were the sample size (η2 = 0.482), the number of time occasions (η2 = 0.071), and the interaction term between sample sizes and the number of time occasions (η2 = 0.087). The average sample CFI increased as N increased, and the number of time occasions (t) decreases, suggesting that the model fit better with larger sample sizes. To yield a better fit, a larger sample was required with a larger number of observed variables (i.e., time occasions). With normal data, the average CFI values above .95 were obtained for sample sizes of 60 or greater. However, this was not true in the presence of missing data or non-normality. Under non-normal conditions, a larger sample size was required (e.g., N ≥ 80), especially with missing data. In terms of the choice of estimators, when (normal theory) ML was applied under non-normal data, the average sample CFI tended to be slightly smaller than those obtained from robust-ML estimators. A similar performance was observed across the three robust ML estimators when data were complete. In the presence of missing data, the average sample CFI using MLM or MLMV with listwise deletion performed poorly in comparison to the estimates using MLR, particularly if the data were missing at random and the percentage of missing was high (30%).
For sample SRMR, the ANOVA results identified the most important factors as the sample size (η2 = 0.582) and the number of time occasions (η2 = 0.102). The same sample SRMR values were obtained using ML/MLR and MLM/MLMV estimation methods. Regardless of the choice of estimators, the average sample SRMR was generally greater than the conventional cutoff (.08) across most simulation conditions, indicating that the (correctly specified) model did not fit the data well. The average sample SRMR tended to decrease as the sample size increased and the number of time occasions decreased; however, large average sample SRMR values (> .08) were observed under most conditions when N ≤ 100.
It is noted that in the current paper, we focused on the behaviors of the average goodness of fit indices (across replications). In practice, researchers tend to fit and evaluate the LGM using a single sample. For each goodness of fit indices, we also computed the empirical rejection rates across replications by using the conventional sample cutoff criteria (i.e., RMSEA ≤ .06; CFI ≥ .95; SRMR ≤ .08; Hu & Bentler, 1999). The information is provided in the supplementary tables (i.e., Tables 16–18). Results showed that by using the conventional cutoffs, researchers tended to reject the correct specified LGM too often under most of the conditions considered in the current study. For example, when N ≥ 80, the average sample CFI and RMSEA were generally within the cutoffs for good fit. However, the highest empirical rejection rates (under robust ML estimators) were 70% and 67% for CFI and RMSEA, respectively, implying that researchers would reject the correctly specified models more than half of the time. The empirical rejection rates for RMSEA were below 10% only when N = 100, t = 8, and the data were complete. For CFI, the only cases where the empirical rejection rates were below 10% were observed when N = 100, and the data were complete and normally distributed.
Discussion
This study investigated the performance robust-ML estimators to fit a LGM with non-normal data when the sample size is very small (N ≤ 100) and missing data were present. A simulation study was employed, using continuous data and varying conditions such as sample size, number of time occasions, normality of data, percentage of missing values, and mechanism of missing data.
Results showed that when data were complete or missing completely at random (MCAR), estimates for the growth parameters were generally unbiased under all methods. However, biased results may be observed when MLM or MLMV (with listwise deletion) are applied under MAR. In addition, with non-normal observations, robust ML estimators yielded smaller bias in standard errors and more accurate parameter coverage as compared with normal theory ML. Among the robust estimators, MLR illustrated optimal performance as illustrated by smaller amounts of biases in standard errors and acceptable parameter coverage for most of the study conditions. When MLM/MLMV were applied under MAR, noticeably biased standard error estimates and poor parameter coverage were observed.
In terms of goodness of model fit, with non-normal data and/or a larger number of observed variables, robust ML-based Chi-square test statistics yielded Type I error rates closer to the nominal level as compared to values obtained from normal theory ML. Among the robust estimators, MLMV based Chi-square test statistics performed the best in terms of controlling Type I error rates across the tested conditions, even when the data were missing at random. For fit indices, as sample size decreased, CFI decreased, and RMSEA and SRMR tended to increase, suggesting worse model fit.
The findings of the current study expand on conclusions from previous methodological research regarding fitting latent growth models with very small samples (McNeish & Harring, 2017a). We found robust ML estimation could be used with very small sample size (i.e., N ≤ 100), and generally surpassed normal theory ML by providing more accurate standard error estimates, better parameter coverage, and p values of the Chi-square test statistic close to the nominal (.05) level – even in the presence of non-normal missing data. When choosing among robust ML estimators, our research findings for LGM models are consistent with results reported by Maydeu-Olivares (2017), which studied the behavior of various robust ML estimators in the context of factor analysis models. Specifically, we found that the choice of MLR generally yielded more accurate standard errors and parameter coverage under non-normal data. On the other hand, by implementing adjustments on both the mean and the variance, the MLMV tended to yield more accurate p values for the Chi-square test statistics.
The results in the current study also reflected the model size effect for evaluating the SEM model fit. Specifically, previous research has shown that when SEM models with a larger number of observed variables are fitted to small samples, the chi-square test statistics tended to over-reject the correctly specified model (Moshagen, 2012; Shi, Lee, & Terry, 2018). Further, Chi-square based fit indices (e.g., CFI and RMSEA) tend to yield estimates that suggest worse fit (Shi, Lee & Maydeu-Olivares, 2019). Although the sample size was extremely small (N ≤ 100), models with eight observed variables (t; i.e., measurement occasion) may be considered “large” models in the context of growth modeling.
While we do not advocate for collecting small sample sizes, we understand that sometimes the characteristics under study make this situation unavoidable. In light of the study findings, we offer the following recommendations to empirical researchers fitting and evaluating latent growth models (LGM) with small samples. Robust ML estimation may be used to account for the non-normality of the data, even when the sample size is extremely small (N < 100). When choosing among estimators, MLR is the optimal choice as it yielded the most accurate standard error estimates and growth parameter coverage across most simulated conditions, including cases with non-normal and missing data.
In terms of the chi-square test statistics, the MLMV generally produced the most accurate p values (Type I error rates) under the conditions studied. However, as the MLMV cannot be used under missing data, the best practical approach to evaluate LGM with small sample sizes and non-normal missing data requires further investigation. Interestingly, for similar conditions, the Type I error rates of the MLMV-based Chi-square index were comparable to those obtained from ML with small sample corrections (e.g., the Bartlett corrected test statistics); these findings replicate those reported in McNeish and Harring (2017a). Future studies are also expected to further compare the performance between the robust ML and ML with small sample corrections. In addition, to gain a better understanding of the performance of estimators under non-normal data, it would be interesting to explore the possibility of incorporating small sample corrections to robust ML-based chi-square test statistics when various levels of non-normality, or even ordinal variables, are present.
In terms of the goodness of fit indices, all three fit indices studied (i.e., CFI, RMSEA, and SRMR) exhibited worse fit as sample size decreased, especially when non-normally distributed data with missingness is analyzed. Roughly speaking, for correctly specified LGMs, the average sample RMSEA is within the cutoff for good fit (< 0.06) when the sample size is above 60, and a sample size of 80 or above is required to yield the average sample CFI above the 0.95 cutoff. If the sample size is less than 100, researchers should be cautious by rejecting the LGM solely based on SRMR. For all three fit indices, the empirical rejection rates based on the conventional sample cutoffs are generally inflated, implying that researchers tend to reject the correctly specified model too often, even when N reaches 100.
It is noted that by using Mplus, this study focused on the robust RMSEA and CFI using the “population” correction. The robust RMSEA and CFI reported in Mplus (and many other popular SEM software program) does not estimate their population values. Alternatively, methodologies have recommended to apply the formulas based on “sample” robust correction, which consistently estimates the ML based population values (Brosseau-Liard, Savalei & Li, 2012; Brosseau-Liard & Savalei, 2014; Savalei, 2018; Gao, Shi & Maydeu-Olivares, 2020). Future studies are expected to examine the behaviors of the robust RMSEA and CFI under “sample” correction.
Other limitations and future research directions are discussed. First, the findings of the current study are based on a correctly specified LGM with linear growth. Additional types of growth models should be investigated in future studies. Conditions including under different types and levels of model misspecification should be investigated to test the power of the goodness of fit indices to reject misspecified models, particularly with small sample sizes. In addition, we generated continuous non-normal data using the Vale and Maurelli (1983) method. Previous studies have shown that by using the Vale and Maurelli algorithm, the skewness and kurtosis can be downward-biased, especially when the sample size is small (Olvera Astivia & Zumbo, 2015). Future studies are expected to validate the findings using other simulation techniques for non-normal data (e.g., Foldnes & Olsson, 2016). We hope that the results from the current study are informative to applied researchers when fitting and evaluating latent growth models with small sample size, and possibly with non-normal missing data.
Supplementary Material
Figure 6:
The Behaviors of the Average Sample CFIs (a). average sample CFIs across sample size and data distributions. (b). average sample CFIs across sample size and number of time occasions. Note. CFI = comparative fit index; N = sample size; t = number of measurement occasions (observed variables).
Acknowledgment:
This work was supported in part by the Advanced Support for Innovative Research Excellence grant [No.13580-17-44758] funded by the Office of the Vice President for Research at the University of South Carolina. The work by Dexin Shi was also supported in part by the National Institute On Deafness And Other Communication Disorders of the National Institutes of Health under Award Number R21DC017252. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Footnotes
Under the basic linear growth model, parameters of interest in the HLM representation have equivalent parameters in LGM representation (Chou, Bentler & Pentz, 1998). Please see McNeish and Matta (2018) for a detailed comparison between the two approaches.
In Mplus, when choosing MLR, by default, the standard errors are computed using the sandwich estimator with observed information in the outer block and cross-products information in the inner block. When using MLM or MLMV, the standard errors are computed using the sandwich estimator with expected information in both outer and inner blocks.
For example, the number of clusters ranged from 4 to 14 in McNeish & Stapleton (2016a) and ranged from 10 to 50 in McNeish & Harring (2017b). See McNeish & Stapleton (2016b) for a review.
For example, the within-cluster sample size ranged from 5 to 50 in McNeish & Harring (2017b).
Time points 3 and 4 for four repeated measures; time point 5, 6, 7 and 8 for eight repeated measures.
The relative bias (RB) in parameter estimates is computed as where represents the average sample estimate of the parameter estimates across all replications, and θpop indicates the population values.
Contributor Information
Dexin Shi, University of South Carolina, Columbia, SC, USA.
Christine DiStefano, University of South Carolina, Columbia, SC, USA.
Xiaying Zheng, American Institutes for Research, Washington, DC, USA.
Ren Liu, University of California, Merced, CA, USA.
Zhehan Jiang, Institute of Medical Education & National Center for Health Professions Education Development, Peking University, Beijing, China.
References
- Asparouhov T, & Muthén B (2005). Multivariate statistical modeling with survey data. In Proceedings of the Federal Committee on Statistical Methodology (FCSM) Research Conference. [Google Scholar]
- Asparouhov T, & Muthén B (2010). Simple second order chi-square correction scaled chi-square statistics (Technical appendix). Los Angeles, CA. [Google Scholar]
- Bandalos DL, & Leite W. (2013). The role of simulation in structural equation modeling. Structural equation modeling: A second course (2nd ed., pp. 625–666). Greenwich, CT: Information Age. [Google Scholar]
- Bauer DJ, & Curran PJ (2003). Distributional assumptions of growth mixture models: implications for overextraction of latent trajectory classes. Psychological Methods, 8(3), 338–363. [DOI] [PubMed] [Google Scholar]
- Bentler PM (1990). Comparative fit indexes in structural models. Psychological bulletin, 107(2), 238–246. [DOI] [PubMed] [Google Scholar]
- Bollen KA (1989). Structural equations with latent variables. New York: Wiley. [Google Scholar]
- Bollen KA, & Curran PJ (2006). Latent curve models: A structural equation perspective (Vol. 467): Wiley-Interscience. [Google Scholar]
- Boomsma A. (1982). The robustness of LISREL against small sample sizes in factor analysis models. In Jöreskog KG & Wold H. (Eds.), Systems under indirect observation: Causality, structure, prediction (Part 1, pp. 149–173). Amsterdam: North-Holland. [Google Scholar]
- Bradley JV (1978). Robustness? British Journal of Mathematical and Statistical Psychology, 31(2), 144–152. [DOI] [PubMed] [Google Scholar]
- Brosseau-Liard PE, Savalei V, & Li L. (2012). An investigation of the sample performance of two nonnormality corrections for RMSEA. Multivariate Behavioral Research, 47(6), 904–930. [DOI] [PubMed] [Google Scholar]
- Brosseau-Liard PE, & Savalei V. (2014). Adjusting incremental fit indices for nonnormality. Multivariate Behavioral Research, 49(5), 460–470. [DOI] [PubMed] [Google Scholar]
- Chou CP, Bentler PM, & Pentz MA (1998). Comparisons of two statistical approaches to study growth curves: The multilevel model and the latent curve analysis. Structural Equation Modeling: A Multidisciplinary Journal, 5(3), 247–266. [Google Scholar]
- Curran PJ, West SG, & Finch JF (1996). The robustness of test statistics to nonnormality and specification error in confirmatory factor analysis. Psychological methods, 1(1), 16. [Google Scholar]
- Enders CK (2010). Applied missing data analysis. New York: Guilford Press. [Google Scholar]
- Ferrer E, Balluerka N, & Widaman KF (2008). Factorial invariance and the specification of second-order latent growth models. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 4(1), 22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ferguson CJ (2009). An effect size primer: a guide for clinicians and researchers. Professional Psychology: Research and Practice, 40, 532–538. [Google Scholar]
- Finney S, & DiStefano C. (2013). Nonnormal and categorical data in structural equation modeling. In Hancock GR & Mueller RO (Eds.), Structural Equation Modeling: A Second Course (2nd ed., pp. 439–492). Charlotte, NC: Information Age. [Google Scholar]
- Foldnes N, & Olsson UH (2016). A simple simulation technique for nonnormal data with prespecified skewness, kurtosis, and covariance matrix. Multivariate behavioral research, 51(2–3), 207–219. [DOI] [PubMed] [Google Scholar]
- Gao C, Shi D, & Maydeu-Olivares A. (2020). Estimating the maximum likelihood root mean square error of approximation (RMSEA) with non-normal data: A Monte-Carlo study. Structural Equation Modeling: A Multidisciplinary Journal, 27(2), 192–201. [Google Scholar]
- Grimm KJ, Ram N, & Estabrook R. (2017). Growth modeling: Structural equation and multilevel modeling approaches. Guilford Publications. [Google Scholar]
- Hatton DD, Sideris J, Skinner M, Mankowski J, Bailey DB Jr, Roberts J, & Mirrett P. (2006). Autistic behavior in children with fragile X syndrome: prevalence, stability, and the impact of FMRP. American journal of medical genetics Part A, 140(17), 1804–1813. [DOI] [PubMed] [Google Scholar]
- Hedeker and Gibbons (2006). Longitudinal Data Analysis, New York: Wiley. [Google Scholar]
- Hoogland JJ, & Boomsma A. (1998). Robustness studies in covariance structure modeling: An overview and a meta-analysis. Sociological Methods & Research, 26(3), 329–367. [Google Scholar]
- Hu LT, & Bentler PM (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural equation modeling: a multidisciplinary journal, 6(1), 1–55. [Google Scholar]
- Jöreskog KG, & Sörbom D. (1988). LISREL 7. A guide to the program and applications (2nd ed.). Chicago, IL: International Education Services. [Google Scholar]
- Maydeu-Olivares A. (2017). Maximum likelihood estimation of structural equation models for continuous data: Standard errors and goodness of fit. Structural Equation Modeling: A Multidisciplinary Journal, 24(3), 383–394. [Google Scholar]
- McNeish D, & Stapleton LM (2016a). Modeling clustered data with very few clusters. Multivariate behavioral research, 51(4), 495–518. [DOI] [PubMed] [Google Scholar]
- McNeish DM, & Stapleton LM (2016b). The effect of small sample size on two-level model estimates: A review and illustration. Educational Psychology Review, 28(2), 295–314. [Google Scholar]
- McNeish D. (2017). Brief Research Report: Growth Models with Small Samples and Missing Data. The Journal of Experimental Education, 1–12. [Google Scholar]
- McNeish D, & Harring JR (2017a). Correcting model fit criteria for small sample latent growth models with incomplete data. Educational and Psychological Measurement, 77(6), 990–1018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McNeish DM, & Harring JR (2017b). Clustered data with small sample sizes: Comparing the performance of model-based and design-based approaches. Communications in Statistics-Simulation and Computation, 46(2), 855–869. [Google Scholar]
- McNeish D, & Matta T. (2018). Differentiating between mixed-effects and latent-curve approaches to growth modeling. Behavior research methods, 50(4), 1398–1414. [DOI] [PubMed] [Google Scholar]
- Meredith W, & Tisak J. (1984, June). “Tuckerizing” curves. Paper presented at the meeting of the Psychometric Society, Santa Barbara, CA. [Google Scholar]
- Meredith W, & Tisak J. (1990). Latent curve analysis. Psychometrika, 55(1), 107–122. [Google Scholar]
- Micceri T. (1989). The Unicorn, The Normal Curve, and Other Improbable Creatures. Psychological Bulletin, 105 (1), 156–166. [Google Scholar]
- Moshagen M. (2012). The Model Size Effect in SEM: Inflated Goodness-of-Fit Statistics Are Due to the Size of the Covariance Matrix. Structural Equation Modeling: A Multidisciplinary Journal, 19(1), 86–98. [Google Scholar]
- Muthén L, & Muthén B. (2002). How to use a Monte Carlo study to decide on sample size and determine power. Structural Equation Modeling, 9(4), 599–620. [Google Scholar]
- Muthén LK, & Muthén BO (1998–2018). Mplus User’s Guide. Sixth Edition. Los Angeles, CA: Muthén & Muthén. [Google Scholar]
- Nunnally JC, & Bernstein IH (1967). Psychometric theory(Vol. 226). New York: McGraw-Hill. [Google Scholar]
- Olvera Astivia OL, & Zumbo BD (2015). A cautionary note on the use of the Vale and Maurelli method to generate multivariate, nonnormal data for simulation purposes. Educational and Psychological Measurement, 75(4), 541–567. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pavlov G, Shi D, & Maydeu-Olivares A. (2020). Chi-square difference tests for comparing nested models: An evaluation with non-normal data. Structural Equation Modeling: A Multidisciplinary Journal. 27(6), 908–917. [Google Scholar]
- Pornprasertmanit S, Miller P, & Schoemann AM (2012). R packagesimsem: SIMulated structural equation modeling. Available from the Comprehensive R Archive Network: http://cran.r-project.org.
- R Development Core Team. (2015). R: A Language and Environment for Statistical Computing. Vienna, Austria. [Google Scholar]
- Ross CA (1997). Dissociative identity disorder: Diagnosis, clinical features, and treatment of multiple personality. John Wiley & Sons Inc. [Google Scholar]
- Rubin DB (1976). Inference and missing data. Biometrika, 63(3), 581–592. [Google Scholar]
- Satorra A, & Bentler PM (1994). Corrections to test statistics and standard errors in covariance structure analysis. In Von Eye A. & Clogg CC (Eds.), Latent variable analysis. Applications for developmental research (pp. 399–419). Thousand Oaks, CA: Sage. [Google Scholar]
- Savalei V. (2010). Expected versus observed information in SEM with incomplete normal and nonnormal data. Psychological Methods, 15, 352–367. [DOI] [PubMed] [Google Scholar]
- Savalei V. (2014). Understanding robust corrections in structural equation modeling. Structural Equation Modeling: A Multidisciplinary Journal, 21(1), 149–160. [Google Scholar]
- Savalei V. (2018). On the computation of the RMSEA and CFI from the mean-and-variance corrected test statistic with nonnormal data in SEM. Multivariate Behavioral Research, 53(3), 419–429. [DOI] [PubMed] [Google Scholar]
- Steiger J, & Lind JC (1980, May). Statistically based tests for the number of common factors. Paper Presented at the Annual Meeting of the Annual Spring Meeting of the Psychometric Society, Iowa City. [Google Scholar]
- Shi D, Lee T, & Terry RA (2018). Revisiting the model size effect in structural equation modeling. Structural Equation Modeling: A Multidisciplinary Journal, 25(1), 21–40. [Google Scholar]
- Shi D, Lee T, & Maydeu-Olivares A. (2019). Understanding the model size effect on SEM fit indices. Educational and psychological measurement, 79(2), 310–334.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shi D, Lee T, Fairchild AJ, & Maydeu-Olivares A. (2019). Fitting Ordinal Factor Analysis Models With Missing Data: A Comparison Between Pairwise Deletion and Multiple Imputation. Educational and Psychological Measurement. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vale CD, & Maurelli VA (1983). Simulating multivariate nonnormal Distributions. Psychometrika, 48(3), 465–471. [Google Scholar]
- West SG, Taylor AB, & Wu W. (2012). Model fit and model selection in structural equation modeling. In Hoyle RH (Ed.), Handbook of structural equation modeling (pp.209–231). New York, NY: Guilford Press. [Google Scholar]
- Zheng X. (2017). Latent Growth Curve Analysis with Item Response Data: Model Specification, Estimation, and Panel Attrition. (Unpublished doctoral dissertation). University of Maryland, College Park, MD. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.













