Abstract
Conventional approaches for selecting a reference indicator (RI) could lead to misleading results in testing for measurement invariance (MI). Several newer quantitative methods have been available for more rigorous RI selection. However, it is still unknown how well these methods perform in terms of correctly identifying a truly invariant item to be an RI. Thus, Study 1 was designed to address this issue in various conditions using simulated data. As a follow-up, Study 2 further investigated the advantages/disadvantages of using RI-based approaches for MI testing in comparison with non-RI-based approaches. Altogether, the two studies provided a solid examination on how RI matters in MI tests. In addition, a large sample of real-world data was used to empirically compare the uses of the RI selection methods as well as the RI-based and non-RI-based approaches for MI testing. In the end, we offered a discussion on all these methods, followed by suggestions and recommendations for applied researchers.
Keywords: reference indicator, factorial invariance, multiple-group CFA, measurement invariance
Factorial invariance tests serve as an important tool for establishing measurement invariance (MI) across groups, particularly when scores from self-report measures are being compared (Horn & McArdle, 1992; Meredith, 1993; Shi, Song, & Lewis, 2017). The test helps to examine to what degree observed differences reflect the differences in underlying, unobserved latent constructs across groups. Important questions could be addressed with this technique; for instance, does a mean difference in a measure of depression between males and females reflect entirely the gender difference in trait scores of depression? Or, is the observed difference contaminated by the differences in psychometric properties of the measure across gender groups?
In fact, if a measure indeed behaves differently across groups due to the differences in social norms, cultural norms, or response tendencies, any comparison on the observed composites of this measure (such as t-tests or ANOVAs) will likely lead to ambiguous conclusions. Research has shown that departures from measurement equivalence weaken the accuracy of selection based on composite scores (Millsap & Kwok, 2004), and cross-group differences in composite scores could reflect the difference in psychometric properties of the measure in use (Steinmetz, 2011). Without testing for MI, one cannot be certain whether observed differences across groups truly indicate the underlying latent differences among constructs. Establishing MI has been increasingly recognized as a prerequisite for examining mean differences across groups or mean changes over time.
Factorial invariance tests, testing for MI in the framework of structural equation modeling (SEM), are conducted using techniques of multiple-group confirmatory factor analysis (CFA; Byrne et al., 1989; Horn et al., 1983; Jöreskog, 1971; Meredith, 1993; Millsap, 2012; Steenkamp & Baumgartner, 1998; Widaman & Reise, 1997; also, see Vandenberg & Lance, 2000 for a review). The tests typically begin with fitting a baseline model, where the configuration of factorial structure is set to be identical across groups. To identify this model, a commonly used method is to constrain the factor loadings (and intercepts) of one particular item to be equal across groups. Such items are referred to as reference indicator (RI). All other parameters are then estimated in reference to the metric of this item. If this baseline model is tenable, then a series of multiple-group CFA models are fitted through imposing an increasing number of equality constraints that correspond to different levels of invariance. For example, weak factorial invariance assumes all factor loadings are numerically equivalent across groups, whereas strong factorial invariance assumes all factor loadings and intercepts are equal across groups (e.g., Widaman & Reise, 1997).
In practice, an RI is conventionally chosen either as a random item or an item with the largest standardized factor loading. Such uses of RI indeed create a dilemma in testing for factorial invariance. As Rensvold and Cheung (1998, p. 1022) pointed out, “The reason one wishes to estimate the constrained model in the first place is to test for factorial invariance, yet the procedure requires an a priori assumption of invariance with respect to the referents.” Whether the selected RI is truly invariant is considered to be critical in detecting invariance or noninvariance of other items. Research has shown that when an inappropriate item is chosen to be an RI, severe Type I or Type II errors are expected in testing factorial invariance; that is, truly invariant items could be detected erroneously as noninvariant items and vice versa (Johnson et al., 2009; Yoon & Millsap, 2007). Recent research has also shown that sizable differences in certain parameters could be missed if a reliable but noninvariant item was mistakenly used as the RI (Raykov et al., 2019). It has become evident that the conventional approach for RI selection can be problematic in testing for measurement invariance.
A possible solution to this issue seems to be (a) using more rigorous methods to select RIs instead of the conventional approach or (b) bypassing the use of an RI at all in MI testing. Regarding Rigorous RI selection, a few quantitative methods have been proposed, all involving a set of statistical procedures to identify the best possible invariant indicator as the RI. Some of them originated from item response theory, and some are SEM-based approaches. Unlike the conventional approaches, these quantitative methods make a priori assumption of invariance with respect to the referents tenable. However, it remains unknown how well these methods perform in comparison to each other in identifying an invariant RI. Thus, the primary goal of this study was to compare three well-developed methods for RI selection through conducting a comprehensive simulation study (Study 1), aiming to discover the optimal method for this purpose.
Alternatively, several other approaches have been available that do not require using one specific item as an RI for MI testing (e.g., Kim & Yoon, 2011; Raykov et al., 2013; Stark et al., 2006; Yoon & Millsap, 2007). Given the availability of these non-RI-based methods, one may wonder what the benefit of using the RI-based methods would be, in which an RI is first identified using aforementioned quantitative techniques, and then MI is tested based on the chosen RI. Do these two approaches both perform well in testing MI? Or does one outperform the other? The second goal of this article aimed to address these questions. Study 2 evaluated the performance of the RI-based approach in comparison with the non-RI-based approach in terms of the outcome of MI testing; that is, how well do they correctly identify invariant and noninvariant parameters across groups?
Methods of RI Selection
Two major categories of statistical approaches have been proposed to aid RI selection. One is all-others-as-anchors (AOAA), and the other is Bayesian SEM (BSEM). The AOAA approach originated from IRT, and has been used with great popularity in identifying RIs while the invariance status of all items is initially unknown. The AOAA begins with fitting a baseline model in which all parameters are constrained to be equal across groups. Then each single item alternately serves as the target item, and parameters for the target item are freely estimated while the others are still constrained to be equal. Then a likelihood ratio (LR) test is used to compare the model fit between the two nested models, which is approximately χ2 distributed with degrees of freedom equal to the difference in free parameters. The significance of this test indicates the presence of cross-group item differences.
The AOAA approach indeed subsumes two methods with different criteria for RI selection. The first one, labeled as MaxL in this study, chooses an RI as the item that produces nonsignificant LR statistics and meanwhile, has the largest factor loading (Lopez Rivas et al., 2009; Stark et al., 2006). This method has ever been recommended due to its high power of detecting item differences while controlling for nominal type I error (Meade & Wright, 2012). It could also outperform the BSEM approach in detecting item differences when large differences exist in factor loadings (Shi, Song, Liao, et al., 2017). However, there are methodological concerns. Woods (2009) stated that the magnitude of factor loadings does not necessarily ensure item equivalence when using the MaxL approach. For instance, when item A and item B both produce nonsignificant LR statistics, item A could be chosen as the RI due to its factor loading being the largest, even though item B functions the same across groups. In this case, MaxL would make a mistake in choosing a correct RI.
The second method, labeled as Minχ2 in this study, selects an RI as the item that produces the smallest LR statistic among all items (Woods, 2009). The idea behind this approach is that the magnitude of the LR statistic reflects the degree of difference in item functioning. So the smaller the LR statistic is, the smaller the item difference is. This approach distinguishes itself from MaxL in that it does not require the smallest LR statistic to be nonsignificant. Woods (2009) showed that Minχ2 performed well under a variety of data conditions in identifying truly invariant items with power rates of 90% and above.
The Bayesian SEM approach is a newer application of Bayesian methods in testing for factorial invariance (Shi, Song, Liao, et al., 2017; Shi, Song, et al., 2018). It introduces a new parameter to represent a parameter difference across groups, which can index factor loading differences () and intercept differences (). A selection index for the jth item can then be defined as a sum of standardized difference measures of and for this item:
(1) |
where and are respective estimates of difference in factor loadings and intercepts, and and represent standard deviations of those differences.
The BSEM approach imposes informative priors with zero-mean and small-variance for Dloading and Dintercept, which is referred to as “approximate identification constraints” (Muthén & Asparouhov, 2012). It ensures latent factors to be properly scaled and more importantly, makes Dloading and Dintercept estimable. Once Dloading and Dintercept are estimated for item j, one can compute the selection index Δjj and then evaluate its posterior distribution. The item that produces the smallest posterior mean on Δj is considered to have the largest likelihood of being invariant across groups. This method yielded high power when searching for the RI under various simulation conditions (Shi, et al., 2017). Power increased when there were fewer noninvariant items with large magnitude of differences and large sample sizes. Power can be much higher than 0.90 when only 20% of items function differently across groups. The research showed that the choice of small prior variances did not significantly impact the power rates of the RI selection.
Non-RI-Based Approach for MI Testing
We focused on the non-RI-based approach proposed by Raykov et al. (2013), partly following the reviewers’ suggestion. This approach first constrains all parameters to be equal in a baseline model, and then freely estimates the parameters of one item at a time in a relaxed model. Then a chi-square test was conducted to evaluate model fit differences between the baseline model and each relaxed model. The resulting p values for the chi-square tests were then ascendingly ranked. A value l is computed for each corresponding p value by using the Benjamini–Hochberg procedure (Benjamini & Hochberg, 1995; Wasserman, 2004):
(2) |
where j is the ordering number of each tested parameter, α is the prechosen significance level for chi-square tests, and k is the total number of tested parameters. Among the p values that are smaller than their corresponding l values, the largest p is chosen to be the threshold. Finally, the parameters associated with the p values that are smaller than this threshold will be concluded as noninvariant.
Direction Effect and RI Selection
In previous research on RI selection, a two-group CFA model was typically used as the population model in data simulation. One group served as a reference group where factor means and variances were set to be known, and the other group served as a focal group where factor means and variance were freely estimated. A uniform direction of parameter differences was often simulated for simplicity. While factor loadings were simulated to be the same for truly invariant items across groups, they were set to be smaller in focal groups than those in reference groups for items functioning differently (e.g., Meade & Wright, 2012; Shi, Song, Liao, et al., 2017; Stark et al., 2006; Woods, 2009). For instance, if the factor loadings were set to be .8, .8, .8, and .8 for all four items in the reference group, they were set to be .8, .6, .6, and .8 in the focal group. As a result, the truly invariant items (Items 1 and 4 in the example) happened to have larger factor loadings than the noninvariant items (Items 2 and 3 in the example). RI selection methods in favor of high loadings, such as MaxL would have high power of selecting truly invariant items. However, such high power could just be the artifacts of data simulation with a uniform direction.
What if the direction of parameter differences is reversed? For instance, if the factor loadings are set to be .6, .6, .6, and .6 for all four items in the reference group, and .6, .8, .8, and .6 in the focal group, the methods like MaxL are likely to choose either Item 2 or Item 3 as RI. In this case, the power of correctly selecting invariant items as RI would be low. Therefore, it is critical to consider the directions of parameter differences in generating data and evaluating power of the methods for RI selection.
Unlike previous studies, we differentiated three types of directions of parameter differences in our data simulation design. Positive direction refers to the case where parameter values are larger in the focal group than the reference group. Negative direction refers to the case where parameter values are smaller in the focal group than the reference group. The third direction is the mixed direction where certain parameters have in part larger and smaller values in one group than the other. If the power of RI selection is significantly influenced by the directions of parameter differences, the direction effect is said to occur. We anticipated this would happen in our simulation study, particularly, with the MaxL due to the aforementioned reasons.
As follows, we first presented Study 1 where the performances of MaxL, Minχ2, and BSEM on RI selection were comprehensively compared using simulated data. We then presented Study 2 to evaluate the benefit of using RI-based approach for MI testing in comparison with non-RI-based approach. Then, a large set of real-world data was used to empirically demonstrate the uses of the three RI selection methods as well as the RI-based and non-RI-based approaches for MI testing. In the end, we offered a discussion on the advantages and disadvantages of all these methods, followed by suggestions and recommendations for applied researchers.
Simulation Study 1: RI Selection Using MaxL, Minχ2, and BSEM
We used Mplus 7.0 for data generation and RI selection across all simulation conditions. The results of RI selection were summarized and evaluated using SAS 9.4. No cases of nonconvergence were observed for these analyses.
Data Conditions
The population model was a two-group CFA model with 10 items loaded on a single factor. One group served as a reference group and the other served as a focal group. The variables manipulated in the data simulation were listed as following.
Sample Size
Continuous data were generated with n = 100, 200, 500 per group, representing small, medium, and large samples in typical psychological research. Both groups were simulated to have equal sizes in all conditions (e.g., Shi, Song, & Lewis, 2017; Shi, Song, Liao, et al., 2017).
Location of Difference
Item differences were simulated to occur on either factor loadings or intercepts, never on both at the same time (e.g., Shi, Song, Liao, et al., 2017)
Percentage of Noninvariant Items
Consistent with previous simulation research (e.g., French & Finch, 2008; Meade & Wright, 2012), we simulated data with either 20% or 40% of noninvariant items in this investigation. This corresponded to the cases where either two or four items (out of 10 items) function differently across the two groups.
Magnitude of Difference
The magnitude of cross-group differences was set to 0.2 and 0.4 for factor loadings, and 0.3 and 0.6 for intercepts. The former values for the parameter differences were considered to be small, and the latter values were considered to be relatively large (e.g., Kim et al., 2012; Kim & Yoon, 2011; Meade & Lautenschlager, 2004; Shi, Song, & Lewis, 2017).
Direction of Cross-Group Difference
Three directions were manipulated for factor loadings and intercepts, including positive, negative, and mixed directions.
In total, 72 data conditions were generated by fully crossing three sample sizes, two locations of difference, two percentages of noninvariant items, two magnitudes of difference in parameters, and three directions of differences. Each condition had 500 replications.
Data Simulation
The factor mean and variance were set, respectively, to 0 and 1 in the reference group. The raw factor loadings, intercepts, and unique variance were set to .8, 0, and .36, respectively, for all items. In focal groups, factor mean and variance were set to .5 and 1.2, respectively, and uniqueness was set to .36 for all items. All factor loadings and intercepts in focal groups were generated to be equal to those in reference groups, except for the items that were manipulated to be different under certain conditions.
Data Analysis
Three methods were used to analyze the simulated data, including MaxL, Minχ2, and BSEM. In all analyses, the factor mean and variance were fixed to be 0 and 1, respectively, in the reference groups. All the other parameters were freely estimated except for those required to be constrained by the procedures.
In using the MaxL method, the baseline model constrained all items to be equal across the focal and reference groups. Then, the equality constraints were relaxed one item at time, yielding the reduced model. The between-group differences in the target item were then examined using the likelihood ratio test. This procedure was repeated for all items in the model. Eventually, a reference indicator was chosen as the item that produced a nonsignificant LR statistic and had the largest factor loading as well. When using the Minχ2 approach, the significance of the LR statistic was not a concern; instead, the values of the LR statistics were rank ordered for all items. A reference indicator was chosen as the item yielding the smallest LR.1
When BSEM was used, the parameter was computed for each factor loading () and each intercept () across groups. After imposing the normal priors of zero-mean and small-variance of 0.001 on the parameter , Markov Chain Monte Carlo (MCMC) simulations were run for a minimum of 50,000 and maximum of 100,000 iterations. The estimates at every 10th iteration were retained to form posterior distributions for factor loadings and intercepts. The means and standard deviations of these posterior distributions were then computed. Consequently, each item had a selection index computed, indicating the summary of standardized difference in both factor loading and intercept. The item with the smallest value of was selected as the reference indicator.
Results of Study 1
We used power rates to evaluate the performance of each method. The power rate was calculated as the percentage of replications that correctly selected a truly invariant item as RI under each condition. In addition, ANOVAs were performed on the power rates to test the effects of all six variables.
The power rates under all conditions are summarized in Table 1. An ANOVA was performed on these power rates to test the main effects of each of the six variables. The effect of method was significant (Table 2; F(2, 206) = 25.507, p < .001, = .199), with Minχ2 and BSEM performing better than MaxL (ps < .001). Figures 1 and 2 also display that under multiple conditions, MaxL produced low power rates, and some of those were even lower than the power rates of selecting a random item as RI. This occurred in 50% of the conditions (12 of 24 in Table 1) when the direction of parameter differences was positive. However, this was not the case for Minχ2 and BSEM. Neither of these two methods were associated with lower-than-random power rates.
Table 1.
Power Rates of Selecting a Correct Reference Indicator in Study 1.
Positive | Negative | Mixed | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LO | PE | MA | SS | AR | MaxL | Minχ 2 | BSEM | MaxL | Minχ 2 | BSEM | MaxL | Minχ 2 | BSEM |
Factor loading | 20% | .2 | 100 | .80 | .19 | .95 | .95 | 1.00 | .96 | .95 | .65 | .99 | .98 |
200 | .80 | .44 | .99 | .98 | 1.00 | .99 | 1.00 | .88 | 1.00 | 1.00 | |||
500 | .80 | .95 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | |||
.4 | 100 | .80 | .85 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | .99 | 1.00 | 1.00 | ||
200 | .80 | .99 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | |||
500 | .80 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | |||
40% | .2 | 100 | .60 | .01 | .70 | .73 | 1.00 | .79 | .76 | .36 | .95 | .96 | |
200 | .60 | .01 | .78 | .79 | 1.00 | .90 | .90 | .71 | 1.00 | .99 | |||
500 | .60 | .38 | .89 | .84 | 1.00 | .98 | .99 | 1.00 | 1.00 | 1.00 | |||
.4 | 100 | .60 | .06 | .77 | .78 | 1.00 | .99 | .99 | .96 | 1.00 | 1.00 | ||
200 | .60 | .45 | .83 | .79 | .98 | .99 | 1.00 | 1.00 | 1.00 | 1.00 | |||
500 | .60 | .07 | .90 | .80 | .25 | .99 | 1.00 | 1.00 | 1.00 | 1.00 | |||
Intercept | 20% | .3 | 100 | .80 | .82 | 1.00 | .99 | .97 | .99 | .99 | .98 | 1.00 | 1.00 |
200 | .80 | .97 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | |||
500 | .80 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | |||
.6 | 100 | .80 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | ||
200 | .80 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | |||
500 | .80 | .90 | 1.00 | 1.00 | .99 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | |||
40% | .3 | 100 | .60 | .33 | .84 | .84 | .88 | .87 | .86 | .93 | 1.00 | .99 | |
200 | .60 | .49 | .92 | .92 | .95 | .94 | .92 | 1.00 | 1.00 | 1.00 | |||
500 | .60 | .62 | .96 | .96 | .77 | .98 | .99 | 1.00 | 1.00 | 1.00 | |||
.6 | 100 | .60 | .72 | .95 | .94 | .93 | .98 | .98 | 1.00 | 1.00 | 1.00 | ||
200 | .60 | .15 | .97 | .97 | .27 | .99 | .99 | 1.00 | 1.00 | 1.00 | |||
500 | .60 | .00 | .94 | .99 | .00 | .95 | .99 | 1.00 | 1.00 | 1.00 |
Note. LO = Location of noninvariance; PE = percentage of noninvariance; MA = magnitude of noninvariance; SS = sample size; AR = power rates at random; BSEM = Bayesian structural equation model. These abbreviations represent the same definitions for all other tables.
Table 2.
Effects of the Studied Variables on Power Rates in Study 1.
ANOVA 1 | ANOVA 2 | |||||||
---|---|---|---|---|---|---|---|---|
df | F | p | df | F | p | |||
Location (LO) | 1 | 3.297 | .071 | .016 | 1 | 11.736 | .001 | .096 |
Percentage (PE) | 1 | 33.608 | <.001 | .140 | 1 | 119.617 | <.001 | .521 |
Magnitude (MA) | 1 | 0.690 | .407 | .003 | 1 | 2.455 | .120 | .022 |
Direction (DI) | 2 | 19.623 | <.001 | .160 | 2 | 69.842 | <.001 | .559 |
SampleSize (SS) | 2 | 0.583 | .559 | .006 | 2 | 2.074 | .131 | .036 |
Method (ME) | 2 | 25.507 | <.001 | .199 | 2 | 90.782 | <.001 | .623 |
ME × MA | 2 | 0.232 | .794 | .004 | ||||
ME × LO | 2 | 1.198 | .306 | .021 | ||||
ME × PE | 2 | 37.235 | <.001 | .404 | ||||
ME × DI | 4 | 28.154 | <.001 | .506 | ||||
ME × SS | 4 | 0.215 | .930 | .008 | ||||
PE × MA | 1 | 2.794 | .097 | .025 | ||||
PE × LO | 1 | 0.299 | .585 | .003 | ||||
PE × DI | 2 | 36.894 | <.001 | .402 | ||||
PE × SS | 2 | 0.722 | .488 | .013 | ||||
LO × MA | 1 | 10.055 | .002 | .084 | ||||
LO × DI | 2 | 12.984 | <.001 | .191 | ||||
LO × SS | 2 | 5.464 | .005 | .090 | ||||
DI × MA | 2 | 3.946 | .022 | .067 | ||||
DI × SS | 4 | 2.825 | .028 | .093 | ||||
MA × SS | 2 | 36.894 | <.001 | .232 | ||||
ME × MA × PE | 2 | 9.400 | <.001 | .146 | ||||
ME × MA × LO | 2 | 7.056 | .001 | .114 | ||||
ME × MA × DI | 4 | 7.964 | <.001 | .225 | ||||
ME × MA × SS | 4 | 7.642 | <.001 | .218 | ||||
ME × DI × PE | 4 | 9.840 | <.001 | .264 | ||||
ME × DI × LO | 4 | 5.529 | <.001 | .167 | ||||
ME × DI × SS | 8 | 3.779 | .001 | .216 | ||||
ME × SS × PE | 4 | 4.060 | .004 | .098 | ||||
ME × SS × LO | 4 | 3.000 | .022 | .129 | ||||
ME × LO × PE | 2 | 1.638 | .199 | .029 | ||||
LO × PE × DI | 2 | 3.506 | .033 | .060 | ||||
LO × PE × MA | 1 | 0.223 | .638 | .002 | ||||
LO × PE × SS | 2 | 0.721 | .489 | .013 | ||||
LO × MA × DI | 2 | 0.291 | .748 | .005 | ||||
LO × MA × SS | 2 | 1.604 | .206 | .028 | ||||
LO × DI × SS | 4 | 0.640 | .635 | .023 | ||||
PE × MA × DI | 2 | 2.151 | .121 | .038 | ||||
PE × MA × SS | 2 | 4.322 | .016 | .073 | ||||
PE × DI × SS | 4 | 0.973 | .426 | .034 | ||||
MA × DI × SS | 4 | 1.062 | .379 | .037 | ||||
Residuals | 206 | 110 |
Figure 1.
Power rates for Bayesian structural equation model (BSEM), MaxL, and Minχ2 when the percentage of noninvariant factor loadings = 20% versus 40% and the magnitude of differences = 0.2 versus 0.4.Note. The reference line in each individual graph indicates the power rate of randomly selecting an item as the reference indicator (RI).
Figure 2.
Power rates for Bayesian structural equation model (BSEM), MaxL, and Minχ2 when the percentage of noninvariant intercepts = 20% versus 40% and the magnitude of differences = 0.3 versus 0.6.Note. The reference line in each individual graph indicates the power rate of randomly selecting an item as the reference indicator (RI).
The effect of direction was significant (F(2, 206) = 19.623, p < .001, = .160), and the average power rate in positive condition was lower than that in negative and mixed conditions (ps < .001). The direction effect was evident. More specifically, Figures 1 and 2 indicate that (a) the direction effect was greater for MaxL than for Minχ2 and BSEM and (b) factor loadings were more subjective to such direction effects than intercepts, suggesting the possibility of interaction among these data variables.
The effect of percentage was significant (F(1, 206) = 33.608, p < .001, = .140). Table 1 showed that 40% of items being different produced lower power rates than 20% of being different (p < .001). This occurred on the factor loadings (see Figure 1) as well as on the intercepts (see Figure 2).
Having examined the main effects, next we ran a full ANOVA model including up to all three-way interactions among the six variables. Our focus here was the significance of the interaction effects. In this model, four-way interactions cannot be examined due to the limitation of the data—very few scores in each cell without enough variation. So this ANOVA included six main effects, 15 two-way interactions, and 20 three-way interactions. The results are presented under ANOVA 2 in Table 2. However, only certain effects that bear direct importance are further reported.
We first looked at the three-way interactions containing two-way interactions of method×direction. For a significant three-way interaction like this, we examined the interaction of method×direction at each level of the third variable. If this interaction was significant at a certain level of the third variable, we then tested for simple effects. Pairwise comparisons were made thereafter by using Bonferroni correction to adjust for the level of significance.
Significant three-way interactions included (see Table 2): method×direction×percentage (F(4, 110) = 9.84, p < .001, = .264), method×direction×sample size (F(8, 110) = 3.779, p < 0.001, = .216), method×direction×magnitude (F(4, 110) = 7.964, p < .001, = .225), and method×direction×location (F(4, 110) = 5.529, p < .001, = .167). Then the two-way interaction of method×direction (Table 3) was significant at each level of percentage (20% and 40%), sample size (n = 100, 200, 500), magnitude (small and large), and location (loadings and intercepts). The interaction effects are displayed in Figure 3. As reported in Table 4, the subsequent pairwise comparisons showed that (a) under positive condition, Minχ2 and BSEM consistently outperformed MaxL at all levels of percentage, sample size, magnitude, and location; (b) however, this was true only for percentage = 40% and magnitude = large under negative condition; and (c) under mixed condition the three methods performed similarly.
Table 3.
The Interaction Between Methods and Directions on Power Rates at Each Level of Other Studied Variables.
Positive | Negative | Mixed | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
df | F | P | df | F | p | df | F | p | |||||
PE | 20% | 2 | 4.900 | .008 | .047 | 2 | <.001 | .999 | <.001 | 2 | 0.350 | .707 | 0.004 |
40% | 2 | 74.79 | <.001 | .430 | 2 | 8.030 | <.001 | .075 | 2 | 1.440 | .241 | 0.014 | |
SS | N = 100 | 2 | 14.090 | <.001 | .130 | 2 | 0.070 | .932 | .001 | 2 | 1.520 | .221 | 0.016 |
N = 200 | 2 | 11.840 | <.001 | .111 | 2 | 0.500 | .608 | .005 | 2 | 0.220 | .803 | 0.002 | |
N = 500 | 2 | 9.940 | <.001 | .095 | 2 | 4.980 | .008 | .050 | 2 | 0.000 | 1.000 | <.001 | |
MA | Small | 2 | 21.530 | <.001 | .179 | 2 | 0.030 | .966 | <.001 | 2 | 1.880 | .155 | 0.019 |
Large | 2 | 15.870 | <.001 | .138 | 2 | 5.830 | .004 | .056 | 2 | 0.000 | 1.000 | <.001 | |
LO | Loadings | 2 | 27.300 | <.001 | .216 | 2 | 0.120 | .883 | .001 | 2 | 1.840 | .162 | 0.018 |
Intercepts | 2 | 12.390 | <.001 | .111 | 2 | 3.650 | .028 | .036 | 2 | 0.010 | .993 | <.001 |
Figure 3.
The interaction effect of methods and directions at each level of other studied variables.
Table 4.
Simple Effect of Direction and Methods on Power Rates at Each Level of Other Studied Variables.
Methods | Positive | Negative | Mixed | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Comparison | Diff | t | p | Adj p | Diff | t | p | Adj p | Diff | t | p | Adj p | ||||
PE | 20% | MaxL | — | Minχ2 | −.153 | −2.730 | .007 | .063 | .002 | 0.030 | .976 | 1.000 | −.041 | −0.730 | .466 | 1.000 |
MaxL | — | BSEM | −.151 | −2.700 | .008 | .069 | .002 | 0.030 | .976 | 1.000 | −.040 | −0.720 | .475 | 1.000 | ||
Minχ2 | — | BSEM | .002 | 0.030 | .976 | 1.000 | <.001 | 0.000 | 1.000 | 1.000 | .001 | 0.010 | .988 | 1.000 | ||
40% | MaxL | — | Minχ2 | −.597 | −10.670 | <.001 | <.001 | −.193 | −3.460 | .001 | .006 | −.083 | −1.470 | .142 | 1.000 | |
MaxL | — | BSEM | −.588 | −10.520 | <.001 | <.001 | −.195 | −3.490 | .001 | .005 | −.082 | −1.460 | .146 | 1.000 | ||
Minχ2 | — | BSEM | .008 | 0.150 | .882 | 1.000 | −.002 | −0.030 | .976 | 1.000 | .001 | 0.010 | .988 | 1.000 | ||
SS | N = 100 | MaxL | — | Minχ2 | −.404 | −4.580 | <.001 | <.001 | .025 | 0.280 | .777 | 1.000 | −.134 | −1.520 | .131 | 1.000 |
MaxL | — | BSEM | −.406 | −4.610 | <.001 | <.001 | .031 | 0.350 | .723 | 1.000 | −.133 | −1.500 | .134 | 1.000 | ||
Minχ2 | — | BSEM | −.003 | −0.030 | .997 | 1.000 | .006 | 0.070 | .944 | 1.000 | .001 | 0.010 | .989 | 1.000 | ||
N = 200 | MaxL | — | Minχ2 | −.374 | −4.240 | <.001 | <.001 | −.076 | −0.870 | .388 | 1.000 | −.051 | −0.580 | .561 | 1.000 | |
MaxL | — | BSEM | −.369 | −4.190 | <.001 | <.001 | −.076 | −0.870 | .388 | 1.000 | −.050 | −0.570 | .571 | 1.000 | ||
Minχ2 | — | BSEM | .005 | 0.060 | .955 | 1.000 | .000 | 0.000 | 1.000 | 1.000 | .001 | 0.010 | .989 | 1.000 | ||
N = 500 | MaxL | — | Minχ2 | −.346 | −3.930 | <.001 | .001 | −.236 | −2.680 | .008 | .072 | <.001 | 0.000 | 1.000 | 1.000 | |
MaxL | — | BSEM | −.334 | −3.790 | <.001 | .002 | −.245 | −2.780 | .006 | .054 | <−.001 | 0.000 | 1.000 | 1.000 | ||
Minχ2 | — | BSEM | .013 | 0.140 | .887 | 1.000 | −.009 | −0.100 | .921 | 1.000 | <−.001 | 0.000 | 1.000 | 1.000 | ||
MA | Small | MaxL | — | Minχ2 | −.402 | −5.700 | <.001 | <.001 | .014 | 0.200 | .841 | 1.000 | −.119 | −1.690 | .092 | .831 |
MaxL | — | BSEM | −.400 | −5.670 | <.001 | <.001 | .018 | 0.250 | .804 | 1.000 | −.118 | −1.670 | .097 | .873 | ||
Minχ2 | — | BSEM | .003 | 0.040 | .972 | 1.000 | .003 | 0.050 | .962 | 1.000 | .002 | 0.020 | .981 | 1.000 | ||
Large | MaxL | — | Minχ2 | −.348 | −4.930 | <.001 | <.001 | −.206 | −2.920 | .004 | .035 | −.004 | −0.060 | .953 | 1.000 | |
MaxL | — | BSEM | −.340 | −4.830 | <.001 | <.001 | −.211 | −2.990 | .003 | .028 | −.004 | −0.060 | .953 | 1.000 | ||
Minχ2 | — | BSEM | .008 | 0.110 | .915 | 1.000 | −.005 | −0.070 | .944 | 1.000 | <−.001 | 0.000 | 1.000 | 1.000 | ||
LO | Loadings | MaxL | — | Minχ2 | −.451 | −6.490 | <.001 | <.001 | −.030 | −0.430 | .666 | 1.000 | −.116 | −1.670 | .097 | .874 |
MaxL | — | BSEM | −.438 | −6.310 | <.001 | <.001 | −.030 | −0.430 | .666 | 1.000 | −.115 | −1.650 | .100 | .896 | ||
Minχ2 | — | BSEM | .013 | 0.180 | .857 | 1.000 | <.001 | 0.000 | 1.000 | 1.000 | .001 | 0.010 | .990 | 1.000 | ||
Intercepts | MaxL | — | Minχ2 | −.298 | −4.290 | <.001 | <.001 | −.162 | −2.330 | .021 | .189 | −.008 | −0.110 | .914 | 1.000 | |
MaxL | — | BSEM | −.301 | −4.330 | <.001 | <.001 | −.163 | −2.350 | .020 | .178 | −.007 | −0.100 | .924 | 1.000 | ||
Minχ2 | — | BSEM | −.003 | −0.040 | .971 | 1.000 | −.002 | −0.020 | .981 | 1.000 | .001 | 0.010 | .990 | 1.000 |
Note. Adj p = Bonferroni correction for familywise error rate.
We then examined the three-way interactions containing two-way interactions of method×magnitude. All three-way interactions were significant (see Table 2): method×magnitude×percentage ( = 9.400, p < .001, = .146), method×magnitude×sample size ( = 7.642, p < .001, = .218), method×magnitude×direction ( = 7.964, p < .001, = .225), and method×magnitude×location ( = 7.056, p = 0.001, = .114). Figure 4 (and Table 5) display the two-way interactions of method×magnitude at each level of percentage, sample size, direction, and location. Table 6 shows the results from pairwise comparisons after Bonferroni correction. When the between-group differences in parameters were small, Minχ2 and BSEM outperformed MaxL at percentage = 40%, sample size = 100, direction = positive, and location = loadings, and did not perform differently under other conditions. When the parameter differences were large, Minχ2 and BSEM outperformed MaxL at percentage = 40%, sample size = 500, direction = positive & negative, and location = intercepts, and they did not perform differently under other conditions.
Figure 4.
The interaction effect of methods and magnitudes at each level of other studied variables.
Table 5.
The Interaction Between Methods and Magnitudes on Power Rates at Each Level of Other Studied Variables.
Small | Large | ||||||||
---|---|---|---|---|---|---|---|---|---|
df | F | p | df | F | p | ||||
PE | 20% | 2 | 2.330 | .100 | .022 | 2 | 0.050 | .956 | <.001 |
40% | 2 | 9.400 | <.001 | .084 | 2 | 23.67 | <.001 | .188 | |
SS | N = 100 | 2 | 5.980 | .003 | .057 | 2 | 0.990 | .374 | .010 |
N = 200 | 2 | 3.020 | .051 | .030 | 2 | 2.630 | .074 | .026 | |
N = 500 | 2 | 0.820 | .441 | .008 | 2 | 9.060 | <.001 | .084 | |
DR | Positive | 2 | 21.530 | <.001 | .179 | 2 | 15.870 | <.001 | .138 |
Negative | 2 | 0.030 | .966 | <.001 | 2 | 5.830 | .004 | .056 | |
Mix | 2 | 1.880 | .155 | .019 | 2 | 0.000 | .998 | <.001 | |
LO | Loadings | 2 | 8.600 | <.001 | .078 | 2 | 3.750 | .025 | .036 |
Intercepts | 2 | 1.480 | .230 | .014 | 2 | 7.050 | .001 | .065 |
Table 6.
Simple Effect of Magnitudes and Methods on Power Rates at Each Level of Other Studied Variables.
Methods | Simple effect | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Small magnitude | Large magnitude | |||||||||||
Comparison | Diff | t | p | Adj p | Diff | t | p | Adj p | ||||
PE | 20% | MaxL | — | Minχ2 | −.112 | −1.880 | .061 | .367 | −.016 | −0.260 | .794 | 1.000 |
MaxL | — | BSEM | −.111 | −1.850 | .065 | .391 | −.016 | −0.260 | .794 | 1.000 | ||
Minχ2 | — | BSEM | .002 | 0.030 | .978 | 1.000 | <.001 | 0.000 | 1.000 | 1.000 | ||
40% | MaxL | — | Minχ2 | −.223 | −3.780 | <.001 | .001 | −.356 | −5.970 | <.001 | <.001 | |
MaxL | — | BSEM | −.222 | −3.730 | <.001 | .002 | −.354 | −5.940 | <.001 | <.001 | ||
Minχ2 | — | BSEM | .003 | 0.060 | .956 | 1.000 | .002 | 0.030 | .978 | 1.000 | ||
SS | N = 100 | MaxL | — | Minχ2 | −.243 | −3.020 | .003 | .017 | −.098 | −1.220 | .225 | 1.000 |
MaxL | — | BSEM | −.240 | −2.970 | .003 | .020 | −.098 | −1.220 | .225 | 1.000 | ||
Minχ2 | — | BSEM | .003 | 0.040 | .967 | 1.000 | <.001 | 0.000 | 1.000 | 1.000 | ||
N = 200 | MaxL | — | Minχ2 | −.173 | −2.140 | .034 | .203 | −.162 | −2.000 | .047 | .279 | |
MaxL | — | BSEM | −.171 | −2.120 | .036 | .213 | −.159 | −1.970 | .050 | .300 | ||
Minχ2 | — | BSEM | .002 | 0.020 | .984 | 1.000 | .003 | 0.030 | .975 | 1.000 | ||
N = 500 | MaxL | — | Minχ2 | −.091 | −1.130 | .262 | 1.000 | −.298 | −3.690 | <.001 | .002 | |
MaxL | — | BSEM | −.088 | −1.090 | .275 | 1.000 | −.298 | −3.690 | <.001 | .002 | ||
Minχ2 | — | BSEM | .003 | 0.030 | .975 | 1.000 | <.001 | 0.000 | 1.000 | 1.000 | ||
DI | Positive | MaxL | — | Minχ2 | −.402 | −5.700 | <.001 | <.001 | −.348 | −4.930 | <.001 | <.001 |
MaxL | — | BSEM | −.399 | −5.670 | <.001 | <.001 | −.340 | −4.830 | <.001 | <.001 | ||
Minχ2 | — | BSEM | .003 | 0.040 | .972 | 1.000 | .008 | 0.110 | .915 | 1.000 | ||
Negative | MaxL | — | Minχ2 | .014 | 0.200 | .841 | 1.000 | −.206 | −2.920 | .004 | .023 | |
MaxL | — | BSEM | .018 | 0.250 | .804 | 1.000 | −.211 | −2.990 | .003 | .019 | ||
Minχ2 | — | BSEM | .003 | 0.050 | .962 | 1.000 | −.005 | −0.070 | .944 | 1.000 | ||
Mix | MaxL | — | Minχ2 | −.119 | −1.690 | .092 | .554 | −.004 | −0.060 | .953 | 1.000 | |
MaxL | — | BSEM | −.118 | −1.670 | .100 | .582 | −.004 | −0.060 | .953 | 1.000 | ||
Minχ2 | — | BSEM | .002 | 0.020 | .981 | 1.000 | <.001 | 0.000 | 1.000 | 1.000 | ||
LO | Loadings | MaxL | — | Minχ2 | −.238 | −3.610 | <.001 | .002 | −.159 | −2.420 | .017 | .099 |
MaxL | — | BSEM | −.236 | −3.570 | <.001 | .003 | −.153 | −2.320 | .021 | .127 | ||
Minχ2 | — | BSEM | .003 | 0.040 | .967 | 1.000 | .006 | 0.090 | .926 | 1.000 | ||
Intercepts | MaxL | — | Minχ2 | −.099 | −1.510 | .133 | .799 | −.212 | −3.220 | .002 | .009 | |
MaxL | — | BSEM | −.097 | −1.470 | .142 | .852 | −.217 | −3.280 | .001 | .007 | ||
Minχ2 | — | BSEM | .002 | 0.030 | .973 | 1.000 | −.004 | −0.070 | .946 | 1.000 |
Note. Adj p = Bonferroni correction for familywise error rate.
Simulation Study 2: MI Testing With RI-Based and Non-RI–Based Approaches
The aim of Study 2 was to compare the RI-based approach with the non-RI-based approach for testing measurement invariance. We used the data generated from Study 1 for this purpose. In Study 2, we chose Minχ2 as the representative technique to select an RI, because Study 1 showed that in general, Minχ2 behaved well in identifying an appropriate RI. We anticipated that once the best possible invariant RI is selected, the RI-based approach would lead to satisfactory MI outcomes.
In testing for MI using the RI-based approach, an RI was first chosen by the Minχ2 method, and then the baseline model was fitted by setting the RI to be equal across groups while allowing all the other parameters to be freely estimated. Then each of the other parameters was constrained to be equal to one at a time, leading to a reduced model. The fit difference was evaluated between the baseline model and each reduced model using a LR test. If an LR test turned out to be nonsignificant, the parameter that was constrained to be equal in the reduced model was concluded to be invariant across groups. In Study 2, the non-RI-based approach for MI testing utilized the procedure proposed by Raykov et al. (2013). The l values were computed using Equation (2) based on a significance level α = .05. These methods were elucidated in the Introduction section.
Three criteria were used to evaluate the performance of the RI-based and non-RI-based approaches (e.g., Jung & Yoon, 2016). The first was item power rate, computed as the ratio of total number of detected noninvariant parameters to total number of generated noninvariant parameters in each condition across all 500 replications. The second criterion was item Type I error rate, computed as the ratio of total number of invariant parameters that were falsely detected as noninvariant to total number of generated invariant parameters. The third criterion was item Type II error rate, computed as the ratio of total number of noninvariant parameters that were mistakenly detected as invariant to total number of generated noninvariant parameters.
Results of Study 2
We limited our presentation of the results on the mixed condition, in which some model parameters were generated to be greater but others were set to be smaller in one group than the other.2 This condition is likely to be more realistic in empirical research than the other two uniformed conditions (i.e., positive and negative). The power rates, Type I error rates, and Type II error rates are summarized in Table 7. ANOVAs were performed on these criterion values to test the effects of method (ME; i.e., the RI-based and non-RI-based methods), sample size (SS), location of difference (LO), percentage of noninvariant parameters (PE), and magnitude of difference (MA). To serve the goal of this study, we only reported differences between the two MI methods in the three criteria.
Table 7.
Item Power Rates, Type I Error Rates, and Type II Error Rates for MI Testing in Study 2.
Power rate | Type I error | Type II error | |||||||
---|---|---|---|---|---|---|---|---|---|
LO | PE | MA | SS | RI-based | Non-RI-based | RI-based | Non-RI-based | RI-based | Non-RI-based |
Factor loading | 20% | .2 | 100 | .935 | .144 | .000 | .000 | .058 | .856 |
200 | .999 | .511 | .000 | .002 | .000 | .489 | |||
500 | 1.000 | .973 | .000 | .002 | .000 | .027 | |||
.4 | 100 | 1.000 | .890 | .000 | .004 | .000 | .110 | ||
200 | 1.000 | 1.000 | .000 | .005 | .000 | .000 | |||
500 | 1.000 | 1.000 | .000 | .006 | .000 | .000 | |||
40% | .2 | 100 | .248 | .162 | .000 | .002 | .741 | .839 | |
200 | .749 | .534 | .000 | .004 | .250 | .466 | |||
500 | 1.000 | .981 | .000 | .003 | .000 | .019 | |||
.4 | 100 | .828 | .893 | .000 | .011 | .172 | .106 | ||
200 | 1.000 | .998 | .000 | .016 | .000 | .002 | |||
500 | 1.000 | 1.000 | .000 | .053 | .000 | .000 | |||
Intercept | 20% | .2 | 100 | .500 | .620 | .000 | .002 | .500 | .380 |
200 | 1.000 | .968 | .104 | .002 | .000 | .032 | |||
500 | 1.000 | 1.000 | .000 | .002 | .000 | .000 | |||
.4 | 100 | 1.000 | 1.000 | .000 | .005 | .000 | .000 | ||
200 | 1.000 | 1.000 | .104 | .005 | .000 | .000 | |||
500 | 1.000 | 1.000 | .000 | .005 | .000 | .000 | |||
40% | .2 | 100 | .999 | .698 | .000 | .004 | .000 | .302 | |
200 | 1.000 | .981 | .162 | .003 | .000 | .019 | |||
500 | 1.000 | 1.000 | .000 | .005 | .000 | .000 | |||
.4 | 100 | 1.000 | 1.000 | .000 | .008 | .000 | .000 | ||
200 | 1.000 | 1.000 | .164 | .009 | .000 | .000 | |||
500 | 1.000 | 1.000 | .000 | .022 | .000 | .000 |
The main effect of method was not significant on all three criteria, for power rate (F(1, 46) = 1.492, p = .228), Type I error (F(1, 46) = 2.199, p = .145), and Type II error (F(1, 46) = 1.516, p = .224), when only the main effects were included. (The results from this ANOVA were not tabled to save space.) However, computations based on the results in Table 7 suggested that in detecting factor loading differences, the average power rate was .879 for the RI-based approach, whereas it was .756 (less than a desirable level of .80) for the non-RI-based approach. This difference became even more evident when the factor loading differences were generated to be small (i.e., .20). Under this condition, the RI-based approach was associated with a higher mean power rate (M = .821) than the non-RI-based approach (M = .550). It seemed that the former was more sensitive in detecting small differences in factor loadings than the latter.
Method had no significant interaction with percentage and magnitude on any of the three criteria (ps > .05; see Table 8), but it did have a significant interaction with location on Type I error (F(1, 27) = 12.412, p < .01), and sample size on Type I error (F(1, 27) = 11.901, p < .001). Table 7 revealed that the RI-based method was subject to (nonsignificantly) less Type I error (M = .000) on the factor loadings than the non-RI-based method (M = .008), but significantly greater Type I error (M = .043; still below a standard level of .05) on intercepts than the non-RI-based method (M = .004, F(1, 44) = 7.85, p < .01). In addition, the RI-based method produced (nonsigificantly) less Type I error under sample sizes of 100 and 500, but greater Type I error (M = .0.065) under sample size of 200 than the non-RI-based method (M = .004, F(1, 42) = 15.75, p < .01).
Table 8.
Effects of the Studied Variables on Item Power Rates, Type I Error Rates, and Type II Error Rates in Study 2.
Power rate | Type I error rate | Type II error rate | |||||||
---|---|---|---|---|---|---|---|---|---|
df | F | p | df | F | p | df | F | p | |
LO | 1 | 9.974 | <.01 | 1 | 9.119 | <.01 | 1 | 9.990 | <.01 |
PE | 1 | 0.257 | .617 | 1 | 1.916 | .178 | 1 | 0.248 | .623 |
MA | 1 | 24.691 | <.001 | 1 | 0.776 | .386 | 1 | 24.792 | <.001 |
SS | 2 | 14.904 | <.001 | 2 | 9.843 | <.001 | 2 | 14.950 | <.001 |
ME | 1 | 4.328 | .047 | 1 | 5.715 | .024 | 1 | 4.410 | .045 |
LO × PE | 1 | 3.163 | .087 | 1 | 0.253 | .619 | 1 | 3.151 | .087 |
LO × MA | 1 | 5.370 | .028 | 1 | 0.063 | .803 | 1 | 5.366 | .028 |
LO × SS | 2 | 2.518 | .100 | 2 | 10.935 | <.001 | 2 | 2.513 | .100 |
LO × ME | 1 | 2.511 | .125 | 1 | 12.412 | <.01 | 1 | 2.566 | .121 |
PE × MA | 1 | 0.020 | .890 | 1 | 0.396 | .535 | 1 | 0.017 | .898 |
PE × SS | 2 | 0.074 | .929 | 2 | 0.550 | .583 | 2 | 0.071 | .932 |
PE × ME | 1 | 0.654 | .426 | 1 | 0.016 | .901 | 1 | 0.642 | .430 |
MA × SS | 2 | 9.404 | <.001 | 2 | 0.170 | .844 | 2 | 9.422 | <.01 |
MA × ME | 1 | 3.891 | .059 | 1 | 0.776 | .386 | 1 | 3.967 | .057 |
SS × ME | 2 | 1.033 | .370 | 2 | 11.901 | <.001 | 2 | 1.059 | .361 |
Note. ME = Method; that is, the RI-based approach and non-RI-based methods.
A Pedagogical Example
We first applied MaxL, Minχ2, and BSEM for RI selection to the data collected from a large-scale project (n = 12,811)—Psychological Wellbeing of Children of Rural-to-Urban Migrant Workers in China. The measurement chosen for this demonstration was from the Revised Child Anxiety and Depression Scale (RCADS, Chorpita et al., 2000). This self-report scale contains 47 items in total. However, only the items (18 items) for generalized anxiety were used here for demonstration. Responses were scored on a Likert-type scale of 1 to 4, corresponding to “Never,”“Sometimes,”“Quite Often,” and “Always.” Cronbach’s α was .897 and ω was .910 in this sample.
There were 7,356 male (57.4%) and 5,455 female (42.6%) child respondents in this sample. A two-gender-group CFA was fitted to these data, and MaxL, Minχ2, and BSEM were used to find RIs. Eventually MaxL and Minχ2 each produced 18 different values of LR statistics when comparing the baseline model and each reduced model. Then all 18 values were rank ordered from the smallest to largest. As shown in Table 9, Item 7 in this scale was associated with the smallest LR statistic so that Minχ2 chose this item as RI. For those items that yielded with nonsignificant LR statistic, Item 7 was the one that had the largest factor loading in the baseline model. Thus MaxL chose Item 7 as the RI.
Table 9.
Results of Using MaxL and Min to Identify an RI With the Empirical Data.
Factor loadings | loadings | loadings | df | p | |||
---|---|---|---|---|---|---|---|
Baseline | 6590.815 | 304 | |||||
Item 1 | .300 | .306 | .291 | 6576.866 | 302 | 13.949 | <.001 |
Item 2 | .480 | .471 | .492 | 6588.843 | 302 | 1.972 | .373 |
Item 3 | .571 | .556 | .591 | 6582.942 | 302 | 7.873 | .020 |
Item 4 | .620 | .611 | .633 | 6587.198 | 302 | 3.617 | .164 |
Item 5 | .643 | .630 | .661 | 6586.149 | 302 | 4.666 | .097 |
Item 6 | .538 | .557 | .510 | 6578.007 | 302 | 12.808 | .002 |
Item 7 | .700 | .703 | .694 | 6590.235 | 302 | 0.580 | .748 |
Item 8 | .656 | .651 | .663 | 6586.430 | 302 | 4.385 | .112 |
Item 9 | .665 | .653 | .682 | 6586.734 | 302 | 4.081 | .130 |
Item 10 | .690 | .682 | .703 | 6577.774 | 302 | 13.041 | .002 |
Item 11 | .540 | .555 | .518 | 6568.084 | 302 | 22.731 | <.001 |
Item 12 | .491 | .499 | .478 | 6575.632 | 302 | 15.183 | <.001 |
Item 13 | .536 | .543 | .528 | 6508.893 | 302 | 81.922 | <.001 |
Item 14 | .425 | .435 | .411 | 6582.484 | 302 | 8.331 | .016 |
Item 15 | .625 | .630 | .618 | 6589.869 | 302 | 0.946 | .623 |
Item 16 | .608 | .600 | .621 | 6584.566 | 302 | 6.249 | .044 |
Item 17 | .598 | .605 | .586 | 6589.292 | 302 | 1.523 | .467 |
Item 18 | .481 | .484 | .476 | 6590.227 | 302 | 0.588 | .745 |
Note.“Factor loadings” = the loading estimates from a baseline model in which all loadings were constrained to be equal across groups; “ loadings” and “ loadings” = the loading estimates for the reference () and focal () groups, respectively.
Then BSEM was used to select an RI by specifying a two-group CFA model with the commands knownclass = c (g = 1 2) under Variable, and type = mixture; estimator = bayes; under Analysis (Muthén & Asparouhov, 2012). The parameter , representing a summarized difference of each item across groups, was set under model constraint. We imposed the normal prior of zero-mean and small variance (N(0, 0.001)) on each through the DIFF option under Model Priors. MCMC simulations were run for a minimum of 50,000 and a maximum of 100,000 iterations with thin = 10. The Mplus output contained the necessary information for the posterior distribution of (including and ). Table 10 shows the estimates for , , and their standard deviations. The selection index was then calculated using Equation (1) for each item. Eventually Item 7 was chosen to be the RI because it produced the smallest (=0.646) out of 18 items.
Table 10.
Results of Using BSEM to Identify an RI With the Empirical Data.
(SD) | (SD) | ||
---|---|---|---|
Item 1 | 0.011 (0.014) | 0.044 (0.014) | 3.929 |
Item 2 | 0.017 (0.015) | 0.007 (0.015) | 1.600 |
Item 3 | 0.027 (0.016) | 0.023 (0.015) | 3.221 |
Item 4 | 0.019 (0.016) | 0.016 (0.015) | 2.254 |
Item 5 | 0.024 (0.015) | 0.01 (0.015) | 2.267 |
Item 6 | 0.036 (0.016) | 0.027 (0.015) | 4.050 |
Item 7 | 0.005 (0.016) | 0.005 (0.015) | 0.646 |
Item 8 | 0.01 (0.016) | 0.024 (0.015) | 2.225 |
Item 9 | 0.023 (0.016) | 0.012 (0.015) | 2.238 |
Item 10 | 0.017 (0.016) | 0.037 (0.015) | 3.529 |
Item 11 | 0.03 (0.013) | 0.039 (0.013) | 5.308 |
Item 12 | 0.018 (0.015) | 0.041 (0.014) | 4.129 |
Item 13 | 0.013 (0.015) | 0.106 (0.015) | 7.933 |
Item 14 | 0.019 (0.015) | 0.03 (0.015) | 3.267 |
Item 15 | 0.011 (0.014) | 0.002 (0.014) | 0.929 |
Item 16 | 0.016 (0.015) | 0.022 (0.015) | 2.533 |
Item 17 | 0.016 (0.015) | 0.001 (0.015) | 1.133 |
Item 18 | 0.007 (0.016) | 0.007 (0.015) | 0.904 |
Next we applied the two methods of MI testing to this sample of data. The RI-based method used Item 7 that was previously identified as the RI, and the non-RI-based method utilized the procedure proposed by Raykov et al. (2013). As shown in Tables 11 and 12, the non-RI-based method detected differences in intercepts for Items 1, 10, 11, 12, and 13, and the RI-based approach detected loading differences for Item 6, and intercept differences for Items 1, 6, 10, 11, 12, 13, and 14. So in this empirical application, the two methods for MI testing reached a 62.5% agreement in the detected noninvariant parameters.
Table 11.
Results From the Non-RI-Based Approach for MI Testing Using the Empirical Data.
loadings | loadings | intercepts | intercepts | df | p | l | New order | |||
---|---|---|---|---|---|---|---|---|---|---|
Baseline | 6590.815 | 304 | ||||||||
Item 1 | .306 | .292 | 6589.966 | 303 | 0.849 | .357 | .009 | 27 | ||
2.689 | 2.636 | 6577.756 | 303 | 13.059 | <.001 | .001 | 4 | |||
Item 2 | .471 | .492 | 6589.206 | 303 | 1.609 | .205 | .007 | 22 | ||
1.994 | 2.003 | 6590.462 | 303 | 0.353 | .552 | .010 | 31 | |||
Item 3 | .556 | .591 | 6586.412 | 303 | 4.403 | .036 | .003 | 10 | ||
2.389 | 2.359 | 6587.295 | 303 | 3.52 | .061 | .005 | 14 | |||
Item 4 | .611 | .634 | 6588.794 | 303 | 2.021 | .155 | .006 | 19 | ||
2.224 | 2.205 | 6589.195 | 303 | 1.62 | .203 | .007 | 21 | |||
Item 5 | .630 | .660 | 6587.087 | 303 | 3.728 | .054 | .004 | 13 | ||
2.342 | 2.356 | 6589.902 | 303 | 0.913 | .339 | .008 | 25 | |||
Item 6 | .558 | .642 | 6582.583 | 303 | 8.232 | .004 | .002 | 7 | ||
2.305 | 2.272 | 6586.316 | 303 | 4.499 | .034 | .003 | 9 | |||
Item 7 | .703 | .694 | 6590.493 | 303 | 0.322 | .570 | .011 | 32 | ||
2.343 | 2.350 | 6590.553 | 303 | 0.262 | .609 | .011 | 33 | |||
Item 8 | .651 | .663 | 6590.295 | 303 | 0.52 | .471 | .010 | 29 | ||
2.256 | 2.226 | 6586.933 | 303 | 3.882 | .049 | .004 | 12 | |||
Item 9 | .653 | .682 | 6587.579 | 303 | 3.236 | .072 | .005 | 15 | ||
2.231 | 2.217 | 6589.949 | 303 | 0.866 | .352 | .009 | 26 | |||
Item 10 | .682 | .703 | 6588.744 | 303 | 2.071 | .15 | .006 | 18 | ||
2.006 | 1.959 | 6579.787 | 303 | 11.028 | .001 | .002 | 5 | |||
Item 11 | .555 | .518 | 6582.538 | 303 | 8.277 | .004 | .002 | 6 | ||
2.359 | 2.313 | 6576.481 | 303 | 14.334 | <.001 | .001 | 2 | |||
Item 12 | .499 | .477 | 6588.741 | 303 | 2.074 | .15 | .006 | 17 | ||
2.454 | 2.506 | 6577.627 | 303 | 13.188 | <.001 | .001 | 3 | |||
Item 13 | .542 | .527 | 6589.826 | 303 | 0.989 | .32 | .008 | 24 | ||
2.456 | 2.592 | 6509.733 | 303 | 81.082 | <.001 | .000 | 1 | |||
Item 14 | .435 | .411 | 6588.697 | 303 | 2.118 | .146 | .005 | 16 | ||
2.717 | 2.756 | 6584.549 | 303 | 6.266 | .012 | .003 | 8 | |||
Item 15 | .630 | .618 | 6589.976 | 303 | 0.839 | .360 | .009 | 28 | ||
2.077 | 2.082 | 6590.704 | 303 | 0.111 | .739 | .012 | 35 | |||
Item 16 | .600 | .620 | 6588.907 | 303 | 1.908 | .167 | .007 | 20 | ||
2.185 | 2.214 | 6586.514 | 303 | 4.301 | .038 | .004 | 11 | |||
Item 17 | .605 | .586 | 6589.307 | 303 | 1.508 | .219 | .008 | 23 | ||
2.362 | 2.364 | 6590.797 | 303 | 0.018 | .893 | .012 | 36 | |||
Item 18 | .484 | .476 | 6590.617 | 303 | 0.198 | .656 | .011 | 34 | ||
2.561 | 2.571 | 6590.421 | 303 | 0.394 | .530 | .010 | 30 |
Note. = reference group, = focal group.
Table 12.
Results From the RI-Based Approach for MI Testing Using the Empirical Data.
Factor loadings | Intercepts | df | p | |||
---|---|---|---|---|---|---|
Baseline | 6394.829 | 270 | ||||
Item 1 | .301 | 6395.332 | 271 | 0.503 | .478 | |
2.668 | 6407.199 | 271 | 12.37 | <.001 | ||
Item 2 | .478 | 6396.599 | 271 | 1.77 | .183 | |
1.995 | 6394.881 | 271 | 0.052 | .820 | ||
Item 3 | .567 | 6398.532 | 271 | 3.703 | .054 | |
2.379 | 6397.845 | 271 | 3.016 | .082 | ||
Item 4 | .617 | 6396.684 | 271 | 1.855 | .173 | |
2.218 | 6396.367 | 271 | 1.538 | .215 | ||
Item 5 | .638 | 6397.807 | 271 | 2.978 | .084 | |
2.344 | 6394.951 | 271 | 0.122 | .727 | ||
Item 6 | .546 | 6399.201 | 271 | 4.372 | .037 | |
2.294 | 6398.880 | 271 | 4.051 | .044 | ||
Item 8 | .655 | 6395.484 | 271 | 0.655 | .418 | |
2.248 | 6397.815 | 271 | 2.986 | .084 | ||
Item 9 | .661 | 6397.345 | 271 | 2.516 | .113 | |
2.227 | 6395.781 | 271 | 0.952 | .329 | ||
Item 10 | .687 | 6396.416 | 271 | 1.587 | .208 | |
1.996 | 6401.274 | 271 | 6.445 | .011 | ||
Item 11 | .548 | 6398.250 | 271 | 3.421 | .064 | |
2.347 | 6404.235 | 271 | 9.406 | .002 | ||
Item 12 | .495 | 6395.918 | 271 | 1.089 | .297 | |
2.467 | 6401.883 | 271 | 7.054 | .008 | ||
Item 13 | .540 | 6395.217 | 271 | 0.388 | .533 | |
2.490 | 6441.861 | 271 | 47.032 | <.001 | ||
Item 14 | .429 | 6396.194 | 271 | 1.365 | .243 | |
2.728 | 6398.621 | 271 | 3.792 | .051 | ||
Item 15 | .629 | 6395.046 | 271 | 0.217 | .641 | |
2.076 | 6394.844 | 271 | 0.015 | .903 | ||
Item 16 | .606 | 6396.237 | 271 | 1.408 | .235 | |
2.190 | 6396.172 | 271 | 1.343 | .247 | ||
Item 17 | .602 | 6395.434 | 271 | 0.605 | .437 | |
2.361 | 6394.875 | 271 | 0.046 | .830 | ||
Item 18 | .483 | 6394.916 | 271 | 0.087 | .768 | |
2.563 | 6394.902 | 271 | 0.073 | .787 |
Note. Item 7 was used as the reference indicator (RI).
Discussion
Conventional approach for RI selection could jeopardize the outcome of factorial invariance test using multiple-group CFA approach. More rigorous approaches are obviously needed in this research context. Regarding RI selection, three statistical procedures, MaxL, Minχ2, and BSEM have been available. However, their performances on correctly detecting RI remain unknown. Thus, in this article, Study 1 the performances of MaxL, Minχ2, and BSEM using simulated data. As a follow-up, Study 2 investigated the advantages/disadvantages of using RI-based approach for MI testing in comparison with non-RI-based approach. The two simulation studies altogether provided a complete, solid examination on how reference indicators matter in measurement invariance tests.
Study 1 revealed that Minχ2 and BSEM performed better than MaxL in selecting the correct item as a reference indicator. This was particularly true under the positive condition where parameter values for functionally different items were higher in the focal group than the reference group, regardless of the levels of all other conditions under investigation. Under the negative condition, MaxL performed much better than itself in the positive condition, and showed equivalent power as the other two under certain circumstances, such as small percentage of functionally different items and small magnitude of cross-group difference in parameters. Under the mixed condition, no significance differences were detected for the three methods; however, MaxL appeared to be slightly inferior when the sample sizes and the loading differences were small.
The direction effect was evident when using the MaxL approach. This was consistent with the expectation stated earlier in this article; that is, methods in favor of high loadings such as MaxL tend to perform poorly under conditions where truly invariant items happened to be the items with low factor loadings (i.e., positive condition). However, they would perform decently in most cases when truly invariant items were also the items with high factor loadings (i.e., the negative condition). This may in part explain why MaxL showed high power of correctly selecting RI in previous research where only negative conditions were simulated (e.g., Meade & Wright, 2012). It appeared that nonuniformed direction of parameter differences (i.e., mixed condition) would remedy the drawback of favoring high loadings using the MaxL approach. In this case, the power rates of detecting truly invariant items were comparable among the three methods.
Another key feature of the MaxL approach lies in the utility of the LR statistic in testing for the significance of item difference between groups. Research has shown that the power of the LR test is highly influenced by sample size and consequently, even a trivial difference in item parameters would lead to significant LR test when n is large (Ankenmann et al., 1999; Meade, 2010). We found in our simulation analyses that when the percentage of functionally different items was small, increasing sample size increased the power of detecting truly invariant items. However, power decreased substantially or behaved inconsistently as sample size increased to 500 for instance, particularly when both were high at the same time for the percentage of noninvariant items and the magnitude of item difference. This was true regardless whether the direction was positive or negative, and whether the difference occurred on factor loadings or intercepts. Thus, high sensitivity to sample size makes MaxL not a plausible approach to use in applied research.
Minχ2 and BSEM approaches did not show any significant differences in their performance across all conditions. However, when there were 40% of functionally different items, the power rates of these two approaches were noticeably higher in negative condition than those in positive condition, which was only true for differences occurring in factor loadings. This observation may well be explained by the phenomenon of reliability paradox (see Hancock & Mueller, 2011). That is, when fitting SEM models under a given level of model misspecification, better measurement quality is associated with a poorer model fit (Heene et al., 2011; McNeish et al., 2018; Shi, Lee, & Maydeu-Olivares, 2018). Model misspecification could refer to setting numerically different factor loadings to be the same across groups. Such model misspecification may have a heavier negative impact on model fit when standardized factor loadings are greater in one model scenario than the other. In selecting RIs using Minχ2 and BSEM in our simulation analyses, the standardized factor loadings were consistently higher in the positive than the negative conditions and therefore, model misspecification created by constraining factor loadings to be (relatively) the same could impact model fit more in the positive than negative conditions. Consequently, poorer power rates were observed in positive conditions. Further examinations are needed on the relationship between direction effect and reliability paradox in fitting multiple-group models.
Study 2 compared the RI-based approach with the non-RI-based approach in detecting invariant and noninvariant parameters. The results were consistent with our anticipation that once an RI was rigorously identified, the RI-based approach would perform well in MI testing. More specifically, we found that the RI-based approach performed better than the non-RI-based approach for detecting (particularly small) loading differences while maintaining a fairly low likelihood of mistakenly identifying truly invariant items to be noninvariant. However, the RI-based approach was associated with higher Type I error rates than its counterpart in detecting intercept differences. It became evident that the two approaches have their own pros and cons when testing for differences and equivalences in model parameters.
A few suggestions could be offered based on our findings on RI selection. First, it is not wise to use MaxL to identify reference indicators. Although this approach could perform equally well as others under certain conditions, it is impractical to preidentify those conditions in empirical data analysis. In addition, MaxL could behave poorly in large samples due to the high sensitivity of the LR test to sample size. Second, Minχ2 and BSEM are both recommended for empirical studies; however, different theoretical backgrounds are required for their implementation. While Minχ2 involves fitting a series of multiple-group CFA models and computing the LR statistics for each individual item, BSEM is implemented through fitting a single model for identifying invariant and noninvariant items simultaneously (Shi, Song, Liao, et al., 2017). Last, we recommend that methodological researchers consider the direction of parameter differences as a studied variable in future research involving simulation of multiple-group CFA models; otherwise, the results could be confounded or misleading.
A limitation of the present study should be noted. All the three methods of RI selection included in this article, MaxL, Minχ2, and BSEM are all CFA-based approaches. Thus, in our simulation analyses, the indicator variables were generated to be continuously distributed. However, to demonstrate their uses for RI selection, all indicator variables in the empirical data were scored with a Likert-type scale being ordinal in nature. Although it seems to be fairly common in practice to fit CFA models to data measured with Likert-type scales, we were not certain how robust the results would be in terms of selecting reference indicators. In our empirical analyses, however, all the three methods agreed on the same item to be the RI.
A limitation needs to be pointed out for the RI selection methods and RI-based MI testing. As a matter of fact, an implicit assumption is typically made with these procedures as in our study that there is at least one truly invariant item among all the scale items. Although this assumption is very likely to hold for well-developed scales and instruments, the methods for searching for an RI may end up with selecting a noninvariant item as the RI when truly invariant items indeed do not exist in actual fact. In our study, no data were generated for the conditions where ALL items are set to be different across groups. So it is not clear what would exactly happen to the outcomes of RI-based MI testing if a noninvariant RI has to be selected, even when its cross-group difference is minimal. Given that the non-RI-based approaches do not use any RI in MI tests, they are not subject to the aforementioned limitation that the RI-based approaches have. However, it is still worth an investigation in the future on how the non-RI-based approaches perform in testing for MI when all items function differently across groups.
In summary, we compared three well-developed methods for RI selection that have been considered to be critical in factorial invariance tests. Study 1 showed Minχ2 and BSEM approaches performed generally better than the MaxL approach. It is worth noting that we innovatively examined the direction effect on the performance of those methods and showed that direction effect occurred. This suggested that future comparisons on multiple-group CFA techniques would need to consider directional effects in their investigation based on simulated data; otherwise, any discovered differences can be confounding or misleading. Study 2 compared one RI-based approach with one non-RI-based approach in terms of their performances on MI testing. In general, the former performed well with higher power rates and lower Type I error rates in detecting loading differences but higher Type I error rates in detecting intercept differences.
Woods (2009) rank ordered the items based on their LR/Δdf. In our study, we used LR instead of LR/Δdf, because Δdf (=2) was constant across all conditions.
The complete results of Study 2 can be requested by contacting the first author or the correspondence author.
Footnotes
Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The involvement of Dr. Zhengkui Liu in this project was supported by the Consulting and Appraising Grant from Chinese Academy of Sciences (Y7CX134003).
ORCID iDs: Hairong Song https://orcid.org/0000-0001-5164-2159
Dexin Shi https://orcid.org/0000-0002-4120-6756
References
- Ankenmann R. R., Witt E. A., Dunbar S. B. (1999). An investigation of the power of the likelihood ratio goodness-of-fit statistic in detecting differential item function. Journal of Educational Measurement, 36(4), 277-300. 10.1111/j.1745-3984.1999.tb00558.x [DOI] [Google Scholar]
- Benjamini Y., Hochberg Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289-300. 10.1111/j.2517-6161.1995.tb02031.x [DOI] [Google Scholar]
- Byrne B. M., Shavelson R. J., Muthén B. (1989). Testing for the equivalence of factor covariance and mean structures: The issue of partial measurement invariance. Psychological Bulletin, 105(3), 456-466. 10.1037/0033-2909.105.3.456 [DOI] [Google Scholar]
- Chorpita B. F., Yim L., Moffitt C., Umemoto L. A., Francis S. E. (2000). Assessment of symptoms of DSM-IV anxiety and depression in children: A revised child anxiety and depression scale. Behaviour Research and Therapy, 38(8), 835-855. 10.1016/S0005-7967(99)00130-8 [DOI] [PubMed] [Google Scholar]
- French B. F., Finch W. H. (2008). Multigroup confirmatory factor analysis: Locating the invariant referent sets. Structural Equation Modeling, 15(1), 96-113. 10.1080/10705510701758349 [DOI] [Google Scholar]
- Hancock G. R., Mueller R. O. (2011). The reliability paradox in assessing structural relations within covariance structure models. Educational and Psychological Measurement, 71(2), 306-324. 10.1177/0013164410384856 [DOI] [Google Scholar]
- Heene M., Hilbert S., Draxler C., Ziegler M., Bühner M. (2011). Masking misfit in confirmatory factor analysis by increasing unique variances: A cautionary note on the usefulness of cutoff values of fit indices. Psychological Methods, 16(3), 319-336. 10.1037/a0024917 [DOI] [PubMed] [Google Scholar]
- Horn J. L., McArdle J. J. (1992). A practical and theoretical guide to measurement invariance in aging research. Experimental Aging Research, 18(3), 117-144. 10.1080/03610739208253916 [DOI] [PubMed] [Google Scholar]
- Horn J. L., McArdle J. J., Mason R. (1983). When is invariance not invariant: A practical scientist’s look at the ethereal concept of factor invariance. Southern Psychologist, 1(4), 179-188. [Google Scholar]
- Johnson E. C., Meade A. W., DuVernet A. M. (2009). The role of referent indicators in tests of measurement invariance. Structural Equation Modeling, 16(4), 642-657. 10.1080/10705510903206014 [DOI] [Google Scholar]
- Jöreskog K. G. (1971). Simultaneous factor analysis in several populations. Psychometrika, 36(4), 409-426. 10.1007/BF02291366 [DOI] [Google Scholar]
- Jung E., Yoon M. (2016). Comparisons of three empirical methods for partial factorial invariance: Forward, backward, and factor-ratio tests. Structural Equation Modeling, 23(4), 567-584. 10.1080/10705511.2015.1138092 [DOI] [Google Scholar]
- Kim E. S., Yoon M. (2011). Testing measurement invariance: A comparison of multiple group categorical CFA and IRT. Structural Equation Modeling, 18(2), 212-228. 10.1080/10705511.2011.557337 [DOI] [Google Scholar]
- Kim E. S., Yoon M., Lee T. (2012). Testing measurement invariance using MIMIC: Likelihood ratio test with a critical value adjustment. Educational and Psychological Measurement, 72(3), 469-492. 10.1177/0013164411427395 [DOI] [Google Scholar]
- Lopez Rivas G. E., Stark S., Chernyshenko O. S. (2009). The effects of referent item parameters on differential item functioning detection using the free baseline likelihood ratio test. Applied Psychological Measurement, 33(4), 251-265. 10.1177/0146621608321760 [DOI] [Google Scholar]
- McNeish D., An J., Hancock G. R. (2018). The thorny relation between measurement quality and fit index cutoffs in latent variable models. Journal of Personality Assessment, 100(1), 43-52. 10.1080/00223891.2017.1281286 [DOI] [PubMed] [Google Scholar]
- Meade A. W. (2010). A taxonomy of effect size measures for the differential functioning of items and scales. Journal of Applied Psychology, 95(4), 728-743. 10.1037/a0018966 [DOI] [PubMed] [Google Scholar]
- Meade A. W., Lautenschlager G. J. (2004). A Monte-Carlo study of confirmatory factor analytic tests of measurement equivalence/invariance. Structural Equation Modeling, 11(1), 60-72. 10.1207/S15328007SEM1101_5 [DOI] [Google Scholar]
- Meade A. W., Wright N. A. (2012). Solving the measurement invariance anchor item problem in item response theory. Journal of Applied Psychology, 97(5), 1016-1031. 10.1037/a0027934 [DOI] [PubMed] [Google Scholar]
- Meredith W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58(4), 525-543. 10.1007/BF02294825 [DOI] [Google Scholar]
- Millsap R. E. (2012). Statistical approaches to measurement invariance. Routledge. 10.4324/9780203821961 [DOI]
- Millsap R. E., Kwok O. M. (2004). Evaluating the impact of partial factorial invariance on selection in two populations. Psychological Methods, 9(1), 93-115. 10.1037/1082-989X.9.1.93 [DOI] [PubMed] [Google Scholar]
- Muthén B., Asparouhov T. (2012). Bayesian structural equation modeling: A more flexible representation of substantive theory. Psychological Methods, 17(3), 313-335. 10.1037/a0026802 [DOI] [PubMed] [Google Scholar]
- Raykov T., Marcoulides G. A., Harrison M., Zhang M. (2019). On the dependability of a popular procedure for studying measurement invariance: A cause for concern? Structural Equation Modeling, Advance online publication. 10.1080/10705511.2019.1610409 [DOI]
- Raykov T., Marcoulides G. A., Millsap R. E. (2013). Factorial invariance in multiple populations: A multiple testing procedure. Educational and Psychological Measurement, 73(4), 713-727. 10.1177/0013164412451978 [DOI] [Google Scholar]
- Rensvold R. B., Cheung G. W. (1998). Testing measurement models for factorial invariance: A systematic approach. Educational and Psychological Measurement, 58(6), 1017-1034. 10.1177/0013164498058006010 [DOI] [Google Scholar]
- Shi D., Lee T., Maydeu-Olivares A. (2018). Understanding the model size effect on SEM fit indices. Educational and Psychological Measurement, 79(2), 310-334. 10.1177/0013164418783530 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shi D., Song H., DiStefano C., Maydeu-Olivares A., McDaniel H. L., Jiang Z. (2018). Evaluating factorial invariance: An interval estimation approach using Bayesian structural equation modeling. Multivariate Behavioral Research, 54(2), 224-245. 10.1080/00273171.2018.1514484 [DOI] [PubMed] [Google Scholar]
- Shi D., Song H., Lewis M. D. (2017). The impact of partial factorial invariance on cross-group comparisons. Assessment, 26(7), 1217-1233. 10.1177/1073191117711020 [DOI] [PubMed] [Google Scholar]
- Shi D., Song H., Liao X., Terry R., Snyder L. A. (2017). Bayesian SEM for specification search problems in testing factorial invariance. Multivariate Behavioral Research, 52(4), 430-444. 10.1080/00273171.2017.1306432 [DOI] [PubMed] [Google Scholar]
- Stark S., Chernyshenko O. S., Drasgow F. (2006). Detecting differential item functioning with confirmatory factor analysis and item response theory: Toward a unified strategy. Journal of Applied Psychology, 91(6), 1292-1306. 10.1037/0021-9010.91.6.1292 [DOI] [PubMed] [Google Scholar]
- Steenkamp J. B. E., Baumgartner H. (1998). Assessing measurement invariance in cross national consumer research. Journal of Consumer Research, 25(1), 78-90. 10.1086/209528 [DOI] [Google Scholar]
- Steinmetz H. (2011). Estimation and comparison of latent means across cultures. In Davidov E., Schmidt P., Billiet J. (Eds.), Cross-cultural analysis: Methods and applications (pp. 85-116). Psychology Press. [Google Scholar]
- Vandenberg R. J., Lance C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3(1), 4-70. 10.1177/109442810031002 [DOI] [Google Scholar]
- Wasserman L. (2004). All of statistics: A concise course in statistical inference. Springer Science & Business Media. [Google Scholar]
- Widaman K. F., Reise S. P. (1997). Exploring the measurement invariance of psychological instruments: Applications in the substance use domain. In Bryant K. J., Windle M., West S. G. (Eds.), The science of prevention: Methodological advances from alcohol and substance abuse research (pp. 281-324). American Psychological Association; 10.1037/10222-009 [DOI] [Google Scholar]
- Woods C. M. (2009). Empirical selection of anchors for tests of differential item functioning. Applied Psychological Measurement, 33(1), 42-57. 10.1177/0146621607314044 [DOI] [Google Scholar]
- Yoon M., Millsap R. E. (2007). Detecting violations of factorial invariance using data-based specification searches: A Monte Carlo study. Structural Equation Modeling, 14(3), 435-463. 10.1080/10705510701301677 [DOI] [Google Scholar]