Skip to main content
Educational and Psychological Measurement logoLink to Educational and Psychological Measurement
. 2018 Jun 29;79(2):310–334. doi: 10.1177/0013164418783530

Understanding the Model Size Effect on SEM Fit Indices

Dexin Shi 1, Taehun Lee 2,, Alberto Maydeu-Olivares 1
PMCID: PMC6425088  PMID: 30911195

Abstract

This study investigated the effect the number of observed variables (p) has on three structural equation modeling indices: the comparative fit index (CFI), the Tucker–Lewis index (TLI), and the root mean square error of approximation (RMSEA). The behaviors of the population fit indices and their sample estimates were compared under various conditions created by manipulating the number of observed variables, the types of model misspecification, the sample size, and the magnitude of factor loadings. The results showed that the effect of p on the population CFI and TLI depended on the type of specification error, whereas a higher p was associated with lower values of the population RMSEA regardless of the type of model misspecification. In finite samples, all three fit indices tended to yield estimates that suggested a worse fit than their population counterparts, which was more pronounced with a smaller sample size, higher p, and lower factor loading.

Keywords: model size effect, structural equation modeling (SEM), fit indices


In applications of structural equation modeling (SEM), a critical step is to evaluate the goodness of fit of the proposed model with the data. When maximum likelihood is used to estimate a model, the likelihood ratio (LR) test statistic is regarded as the most commonly used test for assessing the overall goodness of fit (Jöreskog, 1969; Maydeu-Olivares, 2017). Assuming that the proposed model is correctly specified, the test statistic asymptotically follows a central chi-square distribution. Therefore, the chi-square test allows researchers to evaluate the fitness of a model by using the null hypothesis significance test approach. For the chi-square test to be valid, one important assumption is that the sample size (N) should be sufficiently large. It has been generally believed that fitting a large SEM model (with many observed variables)1 to moderate or small samples results in an upwardly biased estimate for the chi-square statistic and, thus, an inflated Type I error rate. This upward bias in the LR-based chi-square statistic is known as the model size effect (Herzog, Boomsma, & Reinecke, 2007; Moshagen, 2012; Shi, Lee, & Terry, 2015, 2017; Yuan, Tian, & Yanagihara, 2015), and it has important ramifications for empirical practices.

On one hand, large models are frequently encountered in psychological research. For example, in longitudinal studies with latent variables, a boost in the number of observed variables is expected as the number of measurement occasions increases. Jackson, Gillaspy, and Purc-Stephenson (2009) reviewed roughly 194 published studies and found that the median number of observed items included in the models was 17, and 25% of models included more than 24 items. In addition, models with more observed variables can be more desirable from many perspectives. For example, based on classic psychometric theory, the use of many variables or items (p) is suggested for achieving higher reliability (Lord & Novick, 1968; McDonald, 1999). Research has also shown that using more items in the measurement model is associated with more proper solutions and more accurate parameter estimates (Marsh, Hau, Balla, & Grayson, 1998). On the other hand, fitting large multiple-item models in SEM also leads to potential difficulties with using the LR-based chi-square tests owing to the problem of a small sample. Paradoxically, the model size effect suggests that well-fitting models and many well-established and desirable goals (e.g., reliable scores) are actually competing against one another. The model size effect on LR-based chi-square statistics has been investigated in many studies (Herzog et al., 2007; Moshagen, 2012; Shi, Lee, et al., 2015, 2017; Yuan et al., 2015).

In practice, the chi-square test is “not always the final word in assessing fit” (West, Taylor, & Wu, 2012, p. 211). A major concern is that the LR chi-square test is the test of exact fit, meaning that the null hypothesis is tested such that there is no discrepancy between the hypothesized model and the true data-generating process. In practice, the model under consideration is almost always incorrect to some degree (Box, 1979; MacCallum, 2003). As a result, the chi-square test of exact fit often rejects the null hypothesis, especially in large samples, even when the postulated model is only trivially false. As such, a host of goodness-of-fit measures have been developed in an attempt to provide additional information about the usefulness of the hypothesized model when the solution is quite feasible and explains the observed data quite well. Many fit indices are developed based on the chi-square test or computed using the LR chi-square in their formulation. In this article, we consider three commonly used fit indices: the comparative fit index (CFI), the Tucker–Lewis index (TLI), and the root mean square error of approximation (RMSEA). The formulas for these fit indices are described below.

The CFI (Bentler, 1990) measures the relative improvement in fit going from the baseline model to the postulated model. Due to Bentler (1990, p. 240), the population CFI can be expressed as follows:

CFI=1FkF0,

where FK and F0 represent the minimum of some discrepancy function for the postulated model and the baseline model, respectively. The sample CFI is estimated as follows:

CFI^=max(χ02df0,0)max(χk2dfk,0)max(χ02df0,0),

where χ02 and df0 denote the chi-square statistic and degrees of freedom for the baseline model, and χk2 and dfkrepresent the chi-square statistic and degrees of freedom for the postulated model, respectively. CFI is a normed fit index in the sense that it ranges between 0 and 1, with higher values indicating a better fit. The most commonly used criterion for a good fit is CFI ≥ .95 (Hu & Bentler, 1999; West et al., 2012).

The TLI (Tucker & Lewis, 1973) measures a relative reduction in misfit per degree of freedom. This index was originally proposed by Tucker and Lewis (1973) in the context of exploratory factor analysis and later generalized to the covariance structure analysis context and labeled as the nonnormed fit index by Bentler and Bonett (1980). This index is nonnormed in that its value can occasionally be negative or exceed 1. Following the expression of Bentler (1990, p. 241), the population TLI can be expressed as follows:

TLI=1Fk/dfKF0/df0,

where F0/df0 and Fk/dfkrepresent the misfit per degree of freedom for the baseline model and the postulated model, respectively.

The sample estimator of TLI can be given as follows:

TLI^=χ02/df0χk2/dfkχ02/df01.

In general, TLI ≥ .95 is a commonly used cutoff criterion for the goodness of fit (Hu & Bentler, 1999; West et al., 2012).

The RMSEA (Steiger, 1989, 1990; Steiger & Lind, 1980) measures the discrepancy due to the approximation per degree of freedom as follows:

RMSEA=Fkdfk,

where FK denotes the minimum of some discrepancy function between the population covariance matrix Σ and the model implied covariance matrix Σ0 for the hypothesized model. The sample estimate of RMSEA is defined as follows (Browne & Cudeck, 1993):

RMSEA^=max(χ2df,0)df(N1).

The RMSEA is a badness-of-fit measure, yielding lower values for a better fit. An RMSEA ≤ .06 could be considered acceptable (Hu & Bentler, 1999), whereas a model with an RMSEA ≥ .10 is unworthy of serious consideration (Browne & Cudeck, 1993).

The above three fit indices have been routinely reported by SEM software (e.g., Mplus) and have been used as standard tools for evaluating model fit (Hancock & Mueller, 2010; McDonald & Ho, 2002). As discussed in the model size literature (e.g., Herzog et al., 2007; Moshagen, 2012), these fit indices are also likely to be influenced by the model size because they are functions of the LR chi-square statistic that tends to be upwardly biased in large models. Typically, applied researchers rely heavily on practical fit indices rather than on the formal chi-square test when evaluating specified SEM models. Therefore, it is integral to understand whether or not selected fit indices tend to increase or decrease as the model size increases for the appropriate application of practical fit indices.

Previous studies have shed some light on the effect of model size on practical fit indices. Under correctly specified models, researchers have focused on the behaviors of the fit indices in the sample.2 The results showed that increasing the number of indicators (p) led to a decline in the average sample estimates of CFI and TLI, indicating that the model fit worsens (Anderson & Gerbing, 1984; Ding, Velicer, & Harlow, 1995; Kenny & McCoach, 2003). However, it was found that the average RMSEA tended to decrease (i.e., indicating an improved fit) as more indicators were added to the correctly specified models (Kenny & McCoach, 2003).

Under the conditions of misspecified models, most previous studies examined the effect of the number of indicators on the population values of the selected practical fit indices. Kenny and McCoach (2003) investigated the effect of the number of observed variables (p) on the population values of CFI, TLI, and RMSEA. They found that as p increased the population RMSEA tended to decrease, indicating that the model fit improved regardless of the type of model misspecification; for CFI and TLI, their population values tended to decrease (i.e., indicating a worse fit) when the model misspecifications were introduced by fitting a single-factor model to two-factor data or by omitting cross-loadings. However, when the models were misspecified by ignoring the nonzero residual correlations, it was found that the population values of CFI and TLI tended to increase (i.e., indications of a better fit) as p increased. Breivik and Olsson (2001) also found similar patterns of behaviors for the population values of CFI and RMSEA. More recently, Savalei (2012) also found that in general, the population RMSEA tended to eventually decrease as the number of indicators (p) increased regardless of the type of model misspecification.

Although suggestive, the findings of these studies are somewhat limited for the following reasons. First, the maximum number of observed variables (p) manipulated in the previous studies were 40 at most. For example, the maximum number of observed variables (p) considered by Ding et al. (1995) was 18. For Kenny and McCoach (2003), the number of observed variables (p) ranged from 4 to 25 for correctly specified models and from 4 to 40 for misspecified models. In psychological studies, many questionnaires can include an extremely high number of items (i.e., p). For example, the commonly used Revised NEO Personality Inventory (NEO PI-R) has 240 items (Costa & McCrae, 1992). In education-related studies, many comprehensive tests also include more than 100 items (e.g., ETS Major Field Tests; Ling, 2012). The CFA models have also been used in studying gene expression microarray data, where hundreds of genes are considered as observed indicators (Xie & Bentler, 2003). In the above applications, researchers fitted factor analysis models with extremely large number of observed variables, and the practical fit indices (e.g., CFI, TLI, and RMSEA) are interpreted and used to evaluate the models. Therefore, the findings based on simulation studies with models of small to moderate size may not be generalizable to psychological research using models of a much larger size.

Second, under model misspecification conditions, most previous studies focused on the behavior of the population fit indices. In practice, researchers would only be able to obtain and interpret the sample estimates of the fit indices, not the population values. As shown earlier, the estimates of CFI, TLI, and RMSEA are functions of the chi-square statistic, whose bias is affected by both the sample size and the model size (Moshagen, 2012; Shi, Lee, et al., 2017). Therefore, it would be unwise to simply assume that the findings about population values of fit indices are directly applicable to their sample estimates. Unfortunately, the effect of model size on the behaviors of the sample fit indices under realistic settings involving model misspecifications has not yet been systematically examined.

To fill these gaps in the literature, this article aims to understand the effects the number of observed variables (p) has on SEM fit indices by using a more comprehensive simulation study. In our simulation design, we manipulated the value of p to reach 120 observed variables so that the findings can be generalized to more complex modeling situations. Moreover, in misspecified models, the behaviors of both the sample fit indices and their population values were studied and the results were compared. We close our discussion by offering some practical guidance of using the fit indices in large SEM models.

Monte Carlo Simulation

We performed a simulation study to investigate the effect of model size on CFI, TLI, and RMSEA in both correctly specified and misspecified models. We examined the following three types of model (mis)specifications:

  1. Correctly specified models: The true data-generating model was a single-factor model, and the same model was fitted to the simulated data.

  2. Misspecified dimensionality: The data-generating model was a two-factor CFA model with an interfactor correction of .90. A single-factor model was fitted to the simulated data.

  3. Omitted residual correlations: The data-generating model was a single-factor model with three correlated residuals. The true value for the residual correlations was .15. The fitted model was a single-factor model with correlations among residual terms fixed to zero.

It should be noted that the degree of model misspecifications examined in the current study is substantively ignorable (see Shi, Maydeu-Olivares, & DiStefano, 2018), which is intended to simulate situations where the specified model is trivially false or the misspecified component would be uninteresting to most researchers. That is, the two factors with .90 correlation cannot or need not be meaningfully discriminated in practice and the precise estimation of residual correlations of size equal to .15 could be considered meaningless or uninteresting to most researchers. It would be fair to say that either correctly specified or only slightly misspecified models should be retained and interpreted, because inferences drawn from poorly fitting models can be misleading (Saris, Satorra, & van der Veld, 2009). Therefore, in the present article, the effect of model size on the behavior of the selected fit indices were studied under the conditions involving slight model misspecifications.

For model identification, the variances of the factors were set to 1.0. Other variables manipulated in the simulation are described below.

  • Model size: Model size was indicated by the total number of observed variables (p). The number of observed variables included 10, 30, 60, 90, and 120. When the population model had a two-factor structure, the same number of observed variables was loaded on each factor.

  • Sample size: Sample sizes included 200, 500, and 1,000.

  • Levels of factor loadings: We included items with low (.40) or high (.80) factor loadings (λ), which represent either weak or strong factor(s). The variances of the error terms were set as 1λ2.

In summary, the number of conditions examined was 90 = 3 (types of model specification) × 5 (model size levels) × 3 (sample size levels) × 2 (factor loading levels). For each simulated condition, 1,000 replications were generated with the simsem package in R (Pornprasertmanit, Miller, & Schoemann, 2012; R Development Core Team, 2015). The observed data were generated from a multivariate normal distribution.

For each condition, we first fit the one-factor CFA models to the population covariance matrix and compute the population values for CFI, TLI, and RMSEA. A series of single-factor models with a varying number of indicators (p) were then fitted to the simulated data sets, from which the empirical distributions of CFI, TLI, and RMSEA were obtained across 1,000 replications. All data analyses were conducted with the maximum likelihood estimation using the lavaan package in R (R Development Core Team, 2015; Rosseel, 2012). All replications converged for all conditions.

The behaviors of CFI, TLI, and RMSEA across different simulation conditions are summarized in Tables 1 to 3. Specifically, for each fit index, we reported their population values and the average sample estimates and computed the relative differences (biases) between the two quantities. That is, relative bias (RB) was computed as follows:

Table 1.

Effect of p on CFI.

Factor loadings N p Correctly specified
Misspecified dimensionality
Omitted residual correlations
POP EST RB POP EST RB POP EST RB
.4 200 10 1.000 0.972 −0.03 0.994 0.965 −0.03 0.901 0.894 −0.01
30 1.000 0.956 −0.04 0.989 0.944 −0.05 0.973 0.935 −0.04
60 1.000 0.872 0.13 0.985 0.853 0.13 0.988 0.863 0.13
90 1.000 0.751 0.25 0.981 0.728 0.26 0.993 0.747 0.25
120 1.000 0.611 0.39 0.978 0.588 0.40 0.995 0.609 0.39
500 10 1.000 0.990 −0.01 0.994 0.984 −0.01 0.901 0.899 0.00
30 1.000 0.989 −0.01 0.989 0.980 −0.01 0.973 0.966 −0.01
60 1.000 0.979 −0.02 0.985 0.963 −0.02 0.988 0.968 −0.02
90 1.000 0.958 −0.04 0.981 0.938 −0.04 0.993 0.952 −0.04
120 1.000 0.928 −0.07 0.978 0.905 −0.07 0.995 0.924 −0.07
1,000 10 1.000 0.995 −0.01 0.994 0.990 0.00 0.901 0.900 0.00
30 1.000 0.995 −0.01 0.989 0.987 0.00 0.973 0.972 0.00
60 1.000 0.994 −0.01 0.985 0.980 −0.01 0.988 0.983 −0.01
90 1.000 0.990 −0.01 0.981 0.970 −0.01 0.993 0.983 −0.01
120 1.000 0.983 −0.02 0.978 0.960 −0.02 0.995 0.977 −0.02
.8 200 10 1.000 0.997 0.00 0.968 0.967 0.00 0.938 0.936 0.00
30 1.000 0.994 −0.01 0.951 0.945 −0.01 0.980 0.975 −0.01
60 1.000 0.980 −0.02 0.941 0.921 −0.02 0.990 0.970 −0.02
90 1.000 0.954 −0.05 0.936 0.890 −0.05 0.994 0.948 −0.05
120 1.000 0.912 −0.09 0.932 0.849 −0.09 0.995 0.908 −0.09
500 10 1.000 0.999 0.00 0.968 0.968 0.00 0.938 0.937 0.00
30 1.000 0.999 0.00 0.951 0.949 0.00 0.980 0.979 0.00
60 1.000 0.997 0.00 0.941 0.938 0.00 0.990 0.987 0.00
90 1.000 0.994 −0.01 0.936 0.929 −0.01 0.994 0.987 −0.01
120 1.000 0.988 −0.01 0.932 0.921 −0.01 0.995 0.984 −0.01
1,000 10 1.000 0.999 0.00 0.968 0.968 0.00 0.938 0.937 0.00
30 1.000 0.999 0.00 0.951 0.950 0.00 0.980 0.980 0.00
60 1.000 0.999 0.00 0.941 0.940 0.00 0.990 0.990 0.00
90 1.000 0.998 0.00 0.936 0.934 0.00 0.994 0.992 0.00
120 1.000 0.997 0.00 0.932 0.930 0.00 0.995 0.993 0.00

Note. CFI = comparative fit index; factor loadings = (standardized) factor loadings; N = sample size; p = number of observed variables; RB = relative bias. POP indicates the population CFI; EST indicates the average sample estimates of CFI. For POP and EST, values less than 0.95 are underlined. |RB| larger than 0.10 (10%) are in boldface.

Table 3.

Effect of p on RMSEA.

Factor loadings N p Correctly specified
Misspecified dimensionality
Omitted residual correlations
POP EST AB POP EST AB RB POP EST AB RB
.4 200 10 0.000 0.016 0.016 0.010 0.017 0.007 0.70 0.049 0.047 −0.002 −0.04
30 0.000 0.017 0.017 0.009 0.019 0.010 1.11 0.015 0.022 0.007 0.47
60 0.000 0.026 0.026 0.008 0.027 0.019 2.38 0.007 0.027 0.020 2.86
90 0.000 0.033 0.033 0.008 0.034 0.026 3.25 0.005 0.033 0.028 5.60
120 0.000 0.040 0.040 0.007 0.041 0.034 4.86 0.004 0.040 0.036 9.00
500 10 0.000 0.009 0.009 0.010 0.012 0.002 0.20 0.049 0.048 −0.001 −0.02
30 0.000 0.007 0.007 0.009 0.011 0.002 0.22 0.015 0.016 0.001 0.07
60 0.000 0.009 0.009 0.008 0.012 0.004 0.50 0.007 0.012 0.005 0.71
90 0.000 0.012 0.012 0.008 0.014 0.006 0.75 0.005 0.013 0.008 1.60
120 0.000 0.014 0.014 0.007 0.016 0.009 1.29 0.004 0.014 0.010 2.50
1,000 10 0.000 0.006 0.006 0.010 0.010 0.000 0.00 0.049 0.048 −0.001 −0.02
30 0.000 0.004 0.004 0.009 0.009 0.000 0.00 0.015 0.015 0.000 0.00
60 0.000 0.004 0.004 0.008 0.009 0.001 0.13 0.007 0.009 0.002 0.29
90 0.000 0.005 0.005 0.008 0.010 0.002 0.25 0.005 0.007 0.002 0.40
120 0.000 0.007 0.007 0.007 0.010 0.003 0.43 0.004 0.008 0.004 1.00
.8 200 10 0.000 0.017 0.017 0.078 0.077 −0.001 −0.01 0.120 0.120 0.000 0.00
30 0.000 0.017 0.017 0.056 0.059 0.003 0.05 0.037 0.041 0.004 0.11
60 0.000 0.026 0.026 0.044 0.051 0.007 0.16 0.018 0.032 0.014 0.78
90 0.000 0.033 0.033 0.037 0.050 0.013 0.35 0.012 0.035 0.023 1.92
120 0.000 0.040 0.040 0.033 0.052 0.019 0.58 0.009 0.041 0.032 3.56
500 10 0.000 0.010 0.010 0.078 0.078 0.000 0.00 0.120 0.120 0.000 0.00
30 0.000 0.007 0.007 0.056 0.056 0.000 0.00 0.037 0.038 0.001 0.03
60 0.000 0.009 0.009 0.044 0.045 0.001 0.02 0.018 0.021 0.003 0.17
90 0.000 0.012 0.012 0.037 0.039 0.002 0.05 0.012 0.017 0.005 0.42
120 0.000 0.014 0.014 0.033 0.036 0.003 0.09 0.009 0.017 0.008 0.89
1,000 10 0.000 0.007 0.007 0.078 0.078 0.000 0.00 0.120 0.120 0.000 0.00
30 0.000 0.004 0.004 0.056 0.056 0.000 0.00 0.037 0.037 0.000 0.00
60 0.000 0.004 0.004 0.044 0.044 0.000 0.00 0.018 0.019 0.001 0.06
90 0.000 0.005 0.005 0.037 0.038 0.001 0.03 0.012 0.013 0.001 0.08
120 0.000 0.007 0.007 0.033 0.034 0.001 0.03 0.009 0.011 0.002 0.22

Note. RMSEA = root mean square error of approximation; factor loadings = (standardized) factor loadings; N = sample size; P = number of observed variables; RB = relative bias; AB = absolute bias. POP indicates the population RMSEA; EST indicates the average sample estimates of RMSEA. For POP and EST, values greater than 0.06 are underlined. |RB| larger than 0.10 (10%) are in boldface.

RB=θ¯estθpopθpop,

where θ¯est represents the average of sample estimates of the fit indices of more than 1,000 replications and θpop indicates the population value of fit indices. Following recommendations from previous studies, RBs less than 10% (in absolute value) were considered acceptable (Muthén, Kaplan, & Hollis, 1987; Muthén & Muthén, 2002; Shi, Song, & Lewis, 2017). It should be noted that for correctly specified models the population RMSEA is zero, and therefore, RB is undefined. Under such conditions, the absolute bias (AB) was instead computed as follows:

AB=θ¯estθpop.

We also recognized that the values of RB can be deceptively high when the population RMSEA values are near zero (e.g., .004). We therefore reported both RB and AB for RMSEA under misspecified conditions.

Analyses of variance (ANOVAs) were conducted with the RB as the dependent variable and the simulation conditions (and their interactions) as independent variables. An eta-square (η2) value above 5% was used to identify conditions that contributed to sizable amounts of variability in the outcome. For visual presentations of the patterns (see Figures 1-3), we also plotted the average sample estimates of the fit indices against model size levels for different levels of sample size (N = 200, 500, 1,000, population), factor loading (λ = .40, .80), and misspecification (no misspecification, omitted correlated residuals, and misspecified dimensionality). A horizontal line has been drawn in these figures to mark the cutoff values for CFI (.95), TLI (.95), and RMSEA (.06) suggested by Hu and Bentler (1999).

Figure 1.

Figure 1.

Effect of p on comparative fit index (CFI).

Figure 2.

Figure 2.

Effect of p on Tucker–Lewis index (TLI).

Figure 3.

Figure 3.

Effect of p on root mean square error of approximation (RMSEA).

Comparative Fit Index

Table 1 and Figure 1 demonstrate the behaviors of population values and sample estimates of CFI as a function of model size (p), factor loading (λ), and sample size (N) under the three conditions of model specification. As seen in the figure, for the correctly specified models, the population CFI is a constant of 1.0, independent of p and λ (also see the first column in Table 1). Under the condition of misspecified dimensionality where a single-factor model was fitted to the data sets generated by the two-factor models with  ρ = .90, as p increased from 10 to 120, the population CFI tended to decrease from .994 to .978 when λ = .40, and from .968 to .932 when λ = .80 (the second column in Table 1). Under the condition of omitted residual covariances, the population CFI tended to increase with p from .901 to .995 when λ = .40, and from .938 to .995 when λ = .80 (the third column in Table 1). As a reminder, it should be noted that we intended to simulate modeling situations where the degree of specification error is relatively small. This was evidenced by the population values of CFI exceeding the Hu–Bentler cutoff (CFI ≥ .95) under most simulation conditions, with the lowest value being .901.

The results from the ANOVA showed that the important sources of variance in the (relative) biases in estimating the population CFI were, by order of η2, the sample size (N, η2 = .22), the interaction between number of observed variables and sample size (N×p, η2 = .16), the model size (p, η2 = .14), followed by the magnitude of factor loadings (λ) and the interactions between the magnitude of factor loadings, the sample size (λ×N, η2 = .09) and model size (λ×p, η2 = .06).

In general, as shown in Table 1, the population CFI can be accurately approximated with a small RB (|RB| < 10%) when the sample size is large (i.e., N≥ 500) across all conditions. On the other hand, when sample size was small (e.g., N = 200), a noticeable RB (i.e. |RB| ≥ 10%, see values in bold in Table 1) can occur when the model size is large (p≥ 60) and the quality of measurement is mediocre (λ = .40). In addition, the effect of p became more conspicuous when the magnitude of factor loadings was small (λ = .40). For example, when fitting correctly specified models with N = 200 and λ = .40, the RBs (in absolute values) increased from 3% to 39% as p increased from 10 to 120. However, if magnitude of factor loadings was large (λ = .80), the increments of RB (in absolute values) was smaller, ranging from 0% (p = 10) to 9% (p = 120).

Because of the tendency of population CFI values being underestimated across all conditions, models with no specification error or with minor specification errors can be rejected if evaluated solely based on their sample CFI. The results presented in Table 1 suggest that the rejecting of models with the population CFI values indicating a well-fitting model is indeed likely to occur when the model size is very large (e.g., p≥ 90; see the underlined values in Table 1). For example, under correctly specified models, when the factor loadings were .40 and N = 200, as p increased from 10 to 120, the average estimates of CFI changed from .972 (closely fitting model) to .611 (poorly fitting model). A similar pattern was observed when collapsing two factors with ρ = .90; when p increased from 10 to 120, the average CFI dropped from .965 to .588, leading to different conclusions in terms of the model fit. When the misspecification was caused by omitting residual correlations, the average CFI initially increased as more observed variables were added to the fitted model; but as p continued increasing, the average CFI would eventually decrease. Taking N = 200 and factor loadings = .80 as an example, when p increased from 10 to 30, the mean CFI increased from .936 to .975; the average CFI then dropped to .948 (p = 90) and finally reached the bottom value at .908 (p = 120).

Tucker–Lewis Index

As shown in Table 2 and Figure 2, the behaviors of both population values and their sample estimates of TLI are virtually indistinguishable from the patterns of CFI in large models. For correctly specified models, the population TLI is a constant of 1.00, regardless of the model size (the first column in Table 2). Under the condition of misspecified dimensionality, the population TLI tended to decrease as p increased (the second column in Table 2), whereas the population TLI tended to increase when three residual covariances were omitted (the third column in Table 2). Taking conditions with λ = .40 as an example, when one-factor models were fitted to two-factor data with ρ = .90, as p increased from 10 to 120, the population TLI slightly decreased from .995 to .978. When omitting residual correlations, a higher p was associated with a larger population TLI, ranging from .923 (p = 10) to .995 (p = 120).

Table 2.

Effect of p on TLI.

Factor loadings N p Correctly specified
Misspecified dimensionality
Omitted residual correlations
POP EST RB POP EST RB POP EST RB
.4 200 10 1.000 0.994 −0.01 0.995 0.985 −0.01 0.923 0.865 −0.06
30 1.000 0.958 −0.04 0.990 0.944 −0.05 0.975 0.931 −0.04
60 1.000 0.867 0.13 0.985 0.848 0.14 0.989 0.858 0.13
90 1.000 0.745 0.26 0.981 0.722 0.26 0.993 0.741 0.25
120 1.000 0.605 0.40 0.978 0.581 0.40 0.995 0.603 0.39
500 10 1.000 0.999 0.00 0.995 0.990 −0.01 0.923 0.871 −0.05
30 1.000 0.992 −0.01 0.990 0.980 −0.01 0.975 0.964 −0.01
60 1.000 0.978 −0.02 0.985 0.962 −0.02 0.989 0.967 −0.02
90 1.000 0.957 −0.04 0.981 0.936 −0.05 0.993 0.950 −0.04
120 1.000 0.927 −0.07 0.978 0.903 −0.08 0.995 0.922 −0.07
1,000 10 1.000 0.999 0.00 0.995 0.991 0.00 0.923 0.872 −0.05
30 1.000 0.998 0.00 0.990 0.986 0.00 0.975 0.969 −0.01
60 1.000 0.995 −0.01 0.985 0.979 −0.01 0.989 0.983 −0.01
90 1.000 0.990 −0.01 0.981 0.970 −0.01 0.993 0.982 −0.01
120 1.000 0.982 −0.02 0.978 0.959 −0.02 0.995 0.977 −0.02
.8 200 10 1.000 0.999 0.00 0.975 0.958 −0.02 0.952 0.918 −0.03
30 1.000 0.994 −0.01 0.954 0.941 −0.01 0.981 0.973 −0.01
60 1.000 0.979 −0.02 0.943 0.918 −0.03 0.991 0.969 −0.02
90 1.000 0.952 −0.05 0.937 0.888 −0.05 0.994 0.946 −0.05
120 1.000 0.911 −0.09 0.934 0.846 −0.09 0.995 0.907 −0.09
500 10 1.000 1.000 0.00 0.975 0.959 −0.02 0.952 0.919 −0.03
30 1.000 0.999 0.00 0.954 0.945 −0.01 0.981 0.977 0.00
60 1.000 0.997 0.00 0.943 0.935 −0.01 0.991 0.987 0.00
90 1.000 0.993 −0.01 0.937 0.927 −0.01 0.994 0.987 −0.01
120 1.000 0.988 −0.01 0.934 0.920 −0.01 0.995 0.983 −0.01
1,000 10 1.000 1.000 0.00 0.975 0.959 −0.02 0.952 0.920 −0.03
30 1.000 1.000 0.00 0.954 0.946 −0.01 0.981 0.978 0.00
60 1.000 0.999 0.00 0.943 0.938 −0.01 0.991 0.989 0.00
90 1.000 0.998 0.00 0.937 0.932 −0.01 0.994 0.992 0.00
120 1.000 0.997 0.00 0.934 0.929 −0.01 0.995 0.992 0.00

Note. TLI = the Tucker–Lewis index; factor loadings = (standardized) factor loadings; N = sample size; p = number of observed variables; RB = relative bias. POP indicates the population TLI; EST indicates the average sample estimates of TLI. For POP and EST, values less than 0.95 are underlined. |RB| larger than 0.10 (10%) are in boldface.

In finite samples, the population values of TLI tended to be underestimated, which was more pronounced with a smaller sample size, larger model size, and lower factor loading. As with CFI, ANOVA results showed that the important sources of the RB variance in estimating population TLI included the sample size (N, η2 = .20), model size (p, η2 = .12), magnitude of the factor loadings (λ, η2 = .08), and the two-way interactions among the above factors (i.e., N×p, η2 = .16; N×λ, η2 = .09; p×λ, η2 = .06). It appeared that a sample of size N≥ 500 may be required to obtain a reasonable estimate of the population TLI (with |RB| < 10%), regardless of the level of factor loading and model size considered in the current study (i.e., p≤ 120). It was also noted that even when the sample size was relatively large (e.g., N = 500), by applying the conventional cutoff, very large models (p≥ 90) with no specification error or minor specification errors could be rejected based on the sample TLI. For example, a correctly specified model with p = 120, λ = .40 and N = 500 would be rejected, if the fixed cutoff score were applied in the strictest sense, by yielding an average sample TLI of .927. When fitting a one-factor model to two-factor data with ρ = .90, for p = 90, λ = .40 and N = 500, the average sample TLI was .936, suggesting that the model with the population TLI of .981 may be rejected in the sample according to the .95 cutoff.

When the sample size was small (e.g., N = 200), the population TLI tended to be substantially underestimated (i.e., RBs were negative), especially if the model size was large (p90) and the magnitude of factor loadings was small (λ = .40). For example, when N = 200 and the misspecification was caused by collapsing two highly correlated factors (ρ = .90) into one, for λ = .40, the absolute values of RBs increased from 1% (p = 10) to 40% (p = 120). Under the same conditions except for when λ = .80, as p increased from 10 to 120, the RB (in absolute values) increased from 2% to 9%.

Root Mean Square Error of Approximation

Table 3 and Figure 3 show the effect of p on RMSEA. For correctly specified models, the population RMSEA is a constant of .00 independent of p and λ (the first column in Table 3). Under both types of specification errors (the second and third columns), the population RMSEA decreased as p increased. This effect of p was more pronounced at a higher λ. For example, under the condition of omitted residual correlations, the population RMSEA decreased from .049 to .004 when p increased from 10 to 120 (λ = .40). With a higher factor loading (λ = .80), the population RMSEA decreased from .120 (p = 10) to .009 (p = 120). It was noted that the population values for RMSEA were below the conventional cutoff value (i.e., RMSEA ≤ .06) across all conditions except for the two cases where the misspecified models had a low number of high-quality indicators (underlined values in Table 3). That is, when p = 10 and λ = .80, the population value of RMSEA was .078 under misspecified dimensionality, and the population RMSEA was .120 under the condition of omitted residual correlations.

ANOVA results showed that the important sources of the RB variance included sample size (N, η2 = .18), the model size (p, η2 = .16), the magnitude of factor loadings (λ, η2 = .08), the interaction between sample size and the model size (N×p, η2 = .13), and the interaction between sample size and magnitude of factor loadings (N×λ, η2 = .06).

Figure 3 shows that the sample estimates of RMSEA tended to be upwardly biased across all conditions. Figure 3 also makes it clear that as p increases, the difference between the population RMSEA and the sample average values becomes larger. For example, under correctly specified models (N = 200, λ = .40), the difference between the sample average RMSEA and the population RMSEA increased from .016 (p = 10) to .040 (p = 120). We also observed that the sample RMSEA could be noticeably different from their population value (with |RB| ≥ 10%, values in bold in Table 3) when p was high (e.g., p≥ 60), even when the sample size was reasonably large (e.g., N = 1,000). For example, when the model was misspecified by omitting the residual covariances (with p = 120, N = 1,000, and λ = .40), the average sample RMSEA = .008, almost twice the value of the corresponding population value (i.e., .004). However, as discussed earlier, when the population RMSEA values were near zero, the values of RB and the |RB| ≥ 10% criterion may not be an appropriate measure of “acceptability” of the average sample estimates. Therefore, in Table 3, we reported the AB for estimating the population RMSEA. The largest AB observed across all simulated conditions was .040 (i.e., correctly specified model, N = 200 and p = 120).

Discussion and Conclusion

This study explored the model size effect in three practical model fit indices. We found that the model size (p) had an important impact on the population values of CFI, TLI, and RMSEA. Specifically, under misspecified models, as p increased, the population RMSEA decreased regardless of the type of model misspecification. The findings regarding the effects of p on population RMSEA are consistent with the conclusions drawn by Kenny and McCoach (2003) and Savalei (2012). According to its definition, the RMSEA penalizes model complexity by incorporating a degree of freedom in the formulation, and it measures the discrepancy due to approximation per degree of freedom. Therefore, for models with a close fit, the population RMSEA can decrease as p increases, because a higher p is typically associated with larger degrees of freedom.

For both CFI and TLI, it was interesting to note that as p increased, whether the population CFI or TLI tended to slightly increase or decrease depended on the type of specification error. Specifically, as p increased, the population values of the CFI and TLI decreased in the presence of misspecified dimensionality. However, when the models were misspecified by omitting the correlated residuals, both population CFI and TLI increased in larger models. We believe that essentially the same explanation provided in Kenny and McCoach (2003, p. 347) can apply to our observations: The specified single-factor model predicts that all covariances take on the same value. Thus, the degree of misfit can depend on the degree to which the covariances in the true population covariance matrix differ from the mean covariances in the model-implied covariance matrix. As p increased, the variability of the covariances in the true population covariance matrix increased under the condition of misspecified dimensionality. However, under the condition of omitted residual covariances, the variability declined with p because the number of omitted residual covariances was fixed to three regardless of the model size. As such, the effect of p on the population CFI and TLI depended on the type of specification error.

On the other hand, for RMSEA, its population value tends to decrease as p increases, notwithstanding the types of model misspecifications. Considering the fact that our study worked with models of a relatively small degree of specification error, the population RMSEA appeared to have a desirable property by producing lower values when p≥ 30. However, it seems that researchers may need to be cautious when interpreting a large RMSEA while working with small models including high-quality indicators (i.e., p = 10 and λ = .80), because even the population RMSEA was effectively above the conventional cutoff value under this situation. For example, when three minor residual correlations were omitted and the factor loadings were .80, the population RMSEA could range from .009 (p = 120) to .120 (p = 10). By applying the commonly used cutoff scores, the models can achieve either an excellent fit (i.e., RMSEA ≤ .01) or a poor fit (i.e., RMSEA ≥ .10). In fact, researchers have realized the sensitivity of population RMSEA to degrees of freedom and argued that RMSEA should not even be computed for models with a low degrees of freedom (Kenny, Kaniskan, & McCoach, 2015).

Our study also gained a better understanding of the behaviors of CFI, TLI, and RMSEA in samples. That is, in small samples, compared with their population values, the sample RMSEA tended to be upwardly biased, and the sample CFI and TLI were downwardly biased. Therefore, when N was low, on average, the sample estimates for all three fit indices tended to suggest a worse fit (than their population values), notwithstanding the types of model (mis)specification. For all three fit indices, the differences between the population values and the average sample estimates increased as p increased; the differences also became more pronounced when the standardized factor loadings were low. It was also noted that the pattern of sample RMSEA we observed was partially different from the pattern in Kenny and McCoach (2003), where the authors found that in correctly specified models, the sample RMSEA could improve as the number of variables increased. We argue that one likely reason for the reversed effect of p in Kenny and McCoach’s (2003) study is that the range of the p manipulation in their simulation was relatively low (from 4 to 25).

Moreover, when fitting large SEM models (e.g., p≥ 30) with small samples (e.g., N≤ 200), disagreement between the sample CFI/TLI and RMSEA would likely be observed. Specifically, the sample CFI and TLI could be largely downwardly biased, even when the models were correctly specified. This is especially true when the quality of measurement was poor (i.e., the standardized factor loadings were low). For example, depending on the number of observed variables, correctly specified models (λ = .40, N = 200) could produce an average CFI ranging from .611 to .972. It appears that a sample of size N≥ 500 may be required to gain relatively accurate estimates (with |RB| < 10%) for both CFI and TLI in large models. It was also noted that when fitting very large models (p≥ 90) of good quality of measurement (λ = .80) to a sample of small to medium size (i.e., less than 1,000), the use of sample CFI/TLI may reject the model that is known to have a close fit in the population if the fixed cutoff scores were applied in the strictest sense. Under such conditions (e.g., p≥ 90 and λ = .40), it appears that a sample of size of N≥ 1,000 may be required to safely interpret CFI and TLI.

In small samples, the average sample RMSEA tends to be upwardly biased, and the bias increases as p increases (indicating a larger difference between the population RMSEA and the average sample estimates). Additionally, when the number of observed variables is high (p > 30), the sample RMSEA could be noticeably overestimated (with an RB ≥ 10%), even when the sample size is 1,000. As with the sample CFI and TLI, the sample RMSEA was sensitive to model size. Nevertheless, the average sample estimates for RMSEA were below the conventional cutoff value (i.e., RMSEA ≤ .06) under nearly all conditions examined in our study except for the three conditions where p = 10 and λ = .80 (see the underlined values in Table 3). As noted earlier, the |RB| ≥ 10% criterion may not be informative from a practical viewpoint when the population parameter is zero or near zero.

Methodologists have shown that for a given level of model misspecification, poor measurement quality is associated with better model fit (i.e., the reliability paradox; see Hancock & Mueller, 2011). This phenomenon has been derived mathematically or revealed at the population level (Hancock & Mueller, 2011; Heene, Hilbert, Draxler, Ziegler, & Bühner, 2011) and also has been demonstrated based on sample estimates using simulation study (McNeish, An, & Hancock, 2018). The findings in our study also showed that the reliability paradox may have operated for both population RMSEA values and their sample estimates across all conditions of sample size (N) and model size (p). For CFI and TLI, however, our findings showed that the effect of measurement quality on model fit evaluations can depend on factors such as sample size (N) and model size (p), resulting in the disappearance of the reliability paradox under certain conditions. Specifically, sample estimates of CFI and TLI, on average, tended to indicate worse fit under the condition of poorer measurement quality (i.e., λ = .40) when a model of large size was fit to a sample of small to medium size (N = 200, 500).

The findings in the current study are based on the assumption that the observed data are multivariate normal distributed. In many applications, the assumption of normal data is likely to be violated (e.g., if ordered categorical data are analyzed), where the chi-square test statistics with robust corrections are commonly used. Previous studies have shown that the robust chi-square test statistics could also be influenced by the number of observed variables in the fitted model (Shi, DiStefano, McDaniel, & Jiang, 2018; Yuan, Yang, & Jiang, 2017). For future studies, it would be interesting to explore the model size effect on practical model fit indices in the presence of nonnormal data (DiStefano, Liu, Jiang, & Shi, 2018; DiStefano, McDaniel, Zhang, Shi, & Jiang, 2018; Maydeu-Olivares, Shi, & Rosseel, 2018). In addition, we only included two specific types and minor levels of model misspecification. Additional types of misspecification (e.g., omitted cross-loading values) and levels of misspecification (e.g., severely misspecified models) should be investigated in future studies, as little is known about the effect of model size on fit indices under such situations.

In summary, our findings support the idea that the fit indices rely not only on the model fit or misfit but also on the context of the model, such as the number of observed variables (p). On one hand, given the same level of model misspecification (e.g., fitting a one-factor model to the two-factor data), the population values of the fit indices can be heavily affected by the model size. On the other hand, in small samples (N < 500), as p increases, the estimates of the sample fit indices, mainly CFI and TLI, are likely to be biased and yield a far worse fit than their population values. In this sense, there are no “golden rules.” In empirical studies, researchers should consider the number of observed variables when using the practical fit indices to assess model fit. That said, we can offer a few cautionary remarks to the researchers evaluating models with no specification error or with minor specification errors.

  1. Regardless of the sample size, researchers should be cautious in interpreting RMSEA for small models (p≤ 10), especially when the factor loadings are large (e.g., λ = 0.80). Closer attention should also be paid when interpreting CFI/TLI for either small models (p≤ 10) or very large models (p≥ 90) with good quality of measurement (e.g., λ = 0.80).

  2. A sample of N = 200 observations only provides a reasonable estimate for CFI and TLI when p≤ 30. A sample of size N≥ 500 is generally required to safely use sample CFI and TLI in large models (p≥ 60).

We hope that the results from the current study are informative to applied researchers who work with imperfect models of various sizes.

1.

The size of an SEM model has been indicated by different indices, including the number of observed variables (p), the number of parameters to be estimated (q), the degrees of freedom (df = p (p + 1)/2 −q), and the ratio of the observed variables to latent factors (p/f). Recent studies have suggested that the number of observed variables (p) is the most important determinant of model size effects (Moshagen, 2012; Shi, Lee, et al., 2015, 2017). Therefore, in the current study, we define large models as SEM models with many observed indicators.

2.

When the models are correctly specified, the population CFI, TLI, and RMSEA are constant because FK is zero.

Authors’ Note: Alberto Maydeu-Olivares is also affiliated to University of Barcelona, Barcelona, Catalonia, Spain.

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government(MSIP) (No. 2017R1C1B2012424). This research was also supported by the National Science Foundation under Grant No. SES-1659936.

References

  1. Anderson J. C., Gerbling D. W. (1984). The effects of sampling error on convergence, improper solutions, and goodness of fit indeces for maximum likelihood confirmatory factor analysis. Psychometrika, 49, 155-173. [Google Scholar]
  2. Bentler P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin, 107, 238-246. [DOI] [PubMed] [Google Scholar]
  3. Bentler P. M., Bonett D. G. (1980). Significance tests and goodness of fit in the analysis of covariance structures. Psychological Bulletin, 88, 588-606. [Google Scholar]
  4. Box G. E. P. (1979). Some problems of statistics and everyday life. Journal of the American Statistical Association, 74(365), 1-4. [Google Scholar]
  5. Breivik E., Olsson U. H. (2001). Adding variables to improve fit: The effect of model size on fit assessment in LISREL. In Cudeck R., Du Toit S., Sorbom D. (Eds.), Structural equation modeling: Present and future (pp. 169-194). Lincolnwood, IL: Scientific Software International. [Google Scholar]
  6. Browne M. W., Cudeck R. (1993). Alternative ways of assessing model fit. In Bollen K. A., Long J. S. (Eds.), Testing structural equation models (pp. 136-162). Newbury Park, CA: Sage. [Google Scholar]
  7. Costa P., McCrae R. (1992). Normal personality assessment in clinical practice: The NEO Personality Inventory. Psychological Assessment, 4(1), 5-13. [Google Scholar]
  8. Ding L., Velicer W. F., Harlow L. L. (1995). The effects of estimation methods, number of indicators per factor and improper solutions on structural equation modeling fit indices. Structural Equation Modeling: A Multidisciplinary Journal, 2, 119-144. [Google Scholar]
  9. DiStefano C., Liu J., Jiang N., Shi D. (2018). Examination of the weighted root mean square residual: Evidence for trustworthiness. Structural Equation Modeling: A Multidisciplinary Journal, 25, 453-466. [Google Scholar]
  10. DiStefano C., McDaniel H., Zhang Y., Shi D., Jiang Z. (2017). Fitting large factor analysis models with ordinal data. Manuscript submitted for publicaiton. [DOI] [PMC free article] [PubMed]
  11. Hancock G. R., Mueller R. O. (2010). The reviewer’s guide to quantitative methods in the social sciences. New York, NY: Routledge. [Google Scholar]
  12. Hancock G. R., Mueller R. O. (2011). The reliability paradox in assessing structural relations within covariance structure models. Educational and Psychological Measurement, 71, 306-324. [Google Scholar]
  13. Heene M., Hilbert S., Draxler C., Ziegler M., Bühner M. (2011). Masking misfit in confirmatory factor analysis by increasing unique variances: A cautionary note on the usefulness of cutoff values of fit indices. Psychological Methods, 16, 319-336. [DOI] [PubMed] [Google Scholar]
  14. Herzog W., Boomsma A., Reinecke S. (2007). The model-size effect on traditional and modified tests of covariance structures. Structural Equation Modeling: A Multidisciplinary Journal, 14, 361-390. [Google Scholar]
  15. Hu L., Bentler P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisciplinary Journal, 6, 1-55. [Google Scholar]
  16. Jackson D. L., Gillaspy J. A., Purc-Stephenson R. (2009). Reporting practices in confirmatory factor analysis: An overview and some recommendations. Psychological Methods, 14, 6-23. [DOI] [PubMed] [Google Scholar]
  17. Jöreskog K. G. (1969). A general approach to confirmatory maximum likelihood factor analysis. Psychometrika, 34, 183-202. [Google Scholar]
  18. Kenny D. A., Kaniskan B., McCoach D. B. (2015). The performance of RMSEA in models with small degrees of freedom. Sociological Methods & Research, 44, 486-507. [Google Scholar]
  19. Kenny D. A., McCoach D. B. (2003). Effect of the number of variables on measures of fit in structural equation modeling. Structural Equation Modeling: A Multidisciplinary Journal, 10, 333-351. [Google Scholar]
  20. Ling G. (2012). Why the major field test in business does not report subscores: Reliability and construct validity evidence. ETS Research Report Series. Retrieved from https://www.ets.org/Media/Research/pdf/RR-12-11.pdf
  21. Lord F., Novick M. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. [Google Scholar]
  22. MacCallum R. C. (2003). 2001 Presidential address: Working with imperfect models. Multivariate Behavioral Research, 38, 113-139. [DOI] [PubMed] [Google Scholar]
  23. Marsh H. W., Hau K. T., Balla J. R., Grayson D. (1998). Is more ever too much? The number of indicators per factor in confirmatory factor analysis. Multivariate Behavioral Research, 33, 181-220. [DOI] [PubMed] [Google Scholar]
  24. Maydeu-Olivares A. (2017). Maximum likelihood estimation of structural equation models for continuous data: Standard errors and goodness of fit. Structural Equation Modeling: A Multidisciplinary Journal, 24, 383-394. [Google Scholar]
  25. Maydeu-Olivares A., Shi D., Rosseel Y. (2018). Assessing fit in structural equation models: A Monte-Carlo evaluation of RMSEA versus SRMR confidence intervals and tests of close fit. Structural Equation Modeling: A Multidisciplinary Journal, 25, 389-402. [Google Scholar]
  26. Muthén B., Kaplan D., Hollis M. (1987). On structural equation modeling with data that are not missing completely at random. Psychometrika, 52, 431-462. [Google Scholar]
  27. Muthén L., Muthén B. (2002). How to use a Monte Carlo study to decide on sample size and determine power. Structural Equation Modeling: A Multidisciplinary Journal, 9, 599-620. [Google Scholar]
  28. McDonald R. P. (1999). Test theory: A unified approach. Mahwah, NJ: Lawrence Erlbaum. [Google Scholar]
  29. McDonald R. P., Ho M.-H. R. (2002). Principles and practice in reporting structural equation analyses. Psychological Methods, 7, 64-82. [DOI] [PubMed] [Google Scholar]
  30. McNeish D., An J., Hancock G. R. (2018). The thorny relation between measurement quality and fit index cutoffs in latent variable models. Journal of Personality Assessment, 100, 43-52. [DOI] [PubMed] [Google Scholar]
  31. Moshagen M. (2012). The model size effect in SEM: Inflated goodness-of-fit statistics are due to the size of the covariance matrix. Structural Equation Modeling: A Multidisciplinary Journal, 19, 86-98. [Google Scholar]
  32. Pornprasertmanit S., Miller P., Schoemann A. M. (2012). R packagesimsem: SIMulated structural equation modeling. Retrieved from http://cran.r-project.org
  33. R Development Core Team. (2015). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. [Google Scholar]
  34. Rosseel Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48(2), 1-36. [Google Scholar]
  35. Saris W. E., Satorra A., van der Veld W. M. (2009). Testing structural equation models or detection of misspecifications? Structural Equation Modeling: A Multidisciplinary Journal, 16, 561-582. [Google Scholar]
  36. Savalei V. (2012). The relationship between root mean square error of approximation and model misspecification in confirmatory factor analysis models. Educational and Psychological Measurement, 72, 910-932. [Google Scholar]
  37. Shi D., DiStefano C., McDaniel H. L., Jiang Z. (2018). Examining chi-square test statistics under conditions of large model size and ordinal data. Structural Equation Modeling: A Multidisciplinary Journal, 25, 924-945. [Google Scholar]
  38. Shi D., Lee T., Terry R. A. (2015). Abstract: Revisiting the model size effect in structural equation modeling (SEM). Multivariate Behavioral Research, 50, 142-142. [DOI] [PubMed] [Google Scholar]
  39. Shi D., Lee T., Terry R. A. (2018). Revisiting the model size effect in structural equation modeling. Structural Equation Modeling: A Multidisciplinary Journal, 25, 21-40. [Google Scholar]
  40. Shi D., Maydeu-Olivares A., DiStefano C. (2018). The relationship between the standardized root mean square residual and model misspecification in factor analysis models. Multivariate Behavioral Research. Advance online publication. [DOI] [PubMed] [Google Scholar]
  41. Shi D., Song H., Lewis M. D. (2017). The impact of partial factorial invariance on cross-group comparisons. Assessment. Advance online publication. [DOI] [PubMed] [Google Scholar]
  42. Steiger J. H. (1989). EzPATH: A supplementary module for SYSTAT and SYGRAPH. Evanston, IL: Systat. [Google Scholar]
  43. Steiger J. H. (1990). Structural model evaluation and modification: An interval estimation approach. Multivariate Behavioral Research, 25, 173-180. [DOI] [PubMed] [Google Scholar]
  44. Steiger J., Lind J. C. (1980, May). Statistically based tests for the number of common factors. Paper Presented at the annual meeting of the Annual Spring Meeting of the Psychometric Society, Iowa City. [Google Scholar]
  45. Tucker L. R., Lewis C. (1973). The reliability coefficient for maximum likelihood factor analysis. Psychometrika, 38, 1-10. [Google Scholar]
  46. West S. G., Taylor A. B., Wu W. (2012). Model fit and model selection in structural equation modeling. In Hoyle R. H. (Ed.), Handbook of structural equation modeling (pp. 209-231). New York, NY: Guilford Press. [Google Scholar]
  47. Xie J., Bentler P. M. (2003). Covariance structure models for gene expression microarray data. Structural Equation Modeling: A Multidisciplinary Journal, 10, 566-582. [Google Scholar]
  48. Yuan K. H., Tian Y., Yanagihara H. (2015). Empirical correction to the likelihood ratio statistic for structural equation modeling with many variables. Psychometrika, 80, 379-405. [DOI] [PubMed] [Google Scholar]
  49. Yuan K. H., Yang M., Jiang G. (2017). Empirically corrected rescaled statistics for SEM with small N and large p. Multivariate Behavioral Research, 52, 673-698. [DOI] [PubMed] [Google Scholar]

Articles from Educational and Psychological Measurement are provided here courtesy of SAGE Publications

RESOURCES