It Matters: Reference Indicator Selection in Measurement Invariance Tests

Yutian T Thompson; Hairong Song; Dexin Shi; Zhengkui Liu

doi:10.1177/0013164420926565

. 2020 Jun 5;81(1):5–38. doi: 10.1177/0013164420926565

It Matters: Reference Indicator Selection in Measurement Invariance Tests

Yutian T Thompson ¹, Hairong Song ^2,^✉, Dexin Shi ³, Zhengkui Liu ⁴

PMCID: PMC7797958 PMID: 33456060

Abstract

Conventional approaches for selecting a reference indicator (RI) could lead to misleading results in testing for measurement invariance (MI). Several newer quantitative methods have been available for more rigorous RI selection. However, it is still unknown how well these methods perform in terms of correctly identifying a truly invariant item to be an RI. Thus, Study 1 was designed to address this issue in various conditions using simulated data. As a follow-up, Study 2 further investigated the advantages/disadvantages of using RI-based approaches for MI testing in comparison with non-RI-based approaches. Altogether, the two studies provided a solid examination on how RI matters in MI tests. In addition, a large sample of real-world data was used to empirically compare the uses of the RI selection methods as well as the RI-based and non-RI-based approaches for MI testing. In the end, we offered a discussion on all these methods, followed by suggestions and recommendations for applied researchers.

Keywords: reference indicator, factorial invariance, multiple-group CFA, measurement invariance

Factorial invariance tests serve as an important tool for establishing measurement invariance (MI) across groups, particularly when scores from self-report measures are being compared (Horn & McArdle, 1992; Meredith, 1993; Shi, Song, & Lewis, 2017). The test helps to examine to what degree observed differences reflect the differences in underlying, unobserved latent constructs across groups. Important questions could be addressed with this technique; for instance, does a mean difference in a measure of depression between males and females reflect entirely the gender difference in trait scores of depression? Or, is the observed difference contaminated by the differences in psychometric properties of the measure across gender groups?

In fact, if a measure indeed behaves differently across groups due to the differences in social norms, cultural norms, or response tendencies, any comparison on the observed composites of this measure (such as t-tests or ANOVAs) will likely lead to ambiguous conclusions. Research has shown that departures from measurement equivalence weaken the accuracy of selection based on composite scores (Millsap & Kwok, 2004), and cross-group differences in composite scores could reflect the difference in psychometric properties of the measure in use (Steinmetz, 2011). Without testing for MI, one cannot be certain whether observed differences across groups truly indicate the underlying latent differences among constructs. Establishing MI has been increasingly recognized as a prerequisite for examining mean differences across groups or mean changes over time.

Factorial invariance tests, testing for MI in the framework of structural equation modeling (SEM), are conducted using techniques of multiple-group confirmatory factor analysis (CFA; Byrne et al., 1989; Horn et al., 1983; Jöreskog, 1971; Meredith, 1993; Millsap, 2012; Steenkamp & Baumgartner, 1998; Widaman & Reise, 1997; also, see Vandenberg & Lance, 2000 for a review). The tests typically begin with fitting a baseline model, where the configuration of factorial structure is set to be identical across groups. To identify this model, a commonly used method is to constrain the factor loadings (and intercepts) of one particular item to be equal across groups. Such items are referred to as reference indicator (RI). All other parameters are then estimated in reference to the metric of this item. If this baseline model is tenable, then a series of multiple-group CFA models are fitted through imposing an increasing number of equality constraints that correspond to different levels of invariance. For example, weak factorial invariance assumes all factor loadings are numerically equivalent across groups, whereas strong factorial invariance assumes all factor loadings and intercepts are equal across groups (e.g., Widaman & Reise, 1997).

In practice, an RI is conventionally chosen either as a random item or an item with the largest standardized factor loading. Such uses of RI indeed create a dilemma in testing for factorial invariance. As Rensvold and Cheung (1998, p. 1022) pointed out, “The reason one wishes to estimate the constrained model in the first place is to test for factorial invariance, yet the procedure requires an a priori assumption of invariance with respect to the referents.” Whether the selected RI is truly invariant is considered to be critical in detecting invariance or noninvariance of other items. Research has shown that when an inappropriate item is chosen to be an RI, severe Type I or Type II errors are expected in testing factorial invariance; that is, truly invariant items could be detected erroneously as noninvariant items and vice versa (Johnson et al., 2009; Yoon & Millsap, 2007). Recent research has also shown that sizable differences in certain parameters could be missed if a reliable but noninvariant item was mistakenly used as the RI (Raykov et al., 2019). It has become evident that the conventional approach for RI selection can be problematic in testing for measurement invariance.

A possible solution to this issue seems to be (a) using more rigorous methods to select RIs instead of the conventional approach or (b) bypassing the use of an RI at all in MI testing. Regarding Rigorous RI selection, a few quantitative methods have been proposed, all involving a set of statistical procedures to identify the best possible invariant indicator as the RI. Some of them originated from item response theory, and some are SEM-based approaches. Unlike the conventional approaches, these quantitative methods make a priori assumption of invariance with respect to the referents tenable. However, it remains unknown how well these methods perform in comparison to each other in identifying an invariant RI. Thus, the primary goal of this study was to compare three well-developed methods for RI selection through conducting a comprehensive simulation study (Study 1), aiming to discover the optimal method for this purpose.

Alternatively, several other approaches have been available that do not require using one specific item as an RI for MI testing (e.g., Kim & Yoon, 2011; Raykov et al., 2013; Stark et al., 2006; Yoon & Millsap, 2007). Given the availability of these non-RI-based methods, one may wonder what the benefit of using the RI-based methods would be, in which an RI is first identified using aforementioned quantitative techniques, and then MI is tested based on the chosen RI. Do these two approaches both perform well in testing MI? Or does one outperform the other? The second goal of this article aimed to address these questions. Study 2 evaluated the performance of the RI-based approach in comparison with the non-RI-based approach in terms of the outcome of MI testing; that is, how well do they correctly identify invariant and noninvariant parameters across groups?

Methods of RI Selection

Two major categories of statistical approaches have been proposed to aid RI selection. One is all-others-as-anchors (AOAA), and the other is Bayesian SEM (BSEM). The AOAA approach originated from IRT, and has been used with great popularity in identifying RIs while the invariance status of all items is initially unknown. The AOAA begins with fitting a baseline model in which all parameters are constrained to be equal across groups. Then each single item alternately serves as the target item, and parameters for the target item are freely estimated while the others are still constrained to be equal. Then a likelihood ratio (LR) test is used to compare the model fit between the two nested models, which is approximately χ² distributed with degrees of freedom equal to the difference in free parameters. The significance of this test indicates the presence of cross-group item differences.

The AOAA approach indeed subsumes two methods with different criteria for RI selection. The first one, labeled as MaxL in this study, chooses an RI as the item that produces nonsignificant LR statistics and meanwhile, has the largest factor loading (Lopez Rivas et al., 2009; Stark et al., 2006). This method has ever been recommended due to its high power of detecting item differences while controlling for nominal type I error (Meade & Wright, 2012). It could also outperform the BSEM approach in detecting item differences when large differences exist in factor loadings (Shi, Song, Liao, et al., 2017). However, there are methodological concerns. Woods (2009) stated that the magnitude of factor loadings does not necessarily ensure item equivalence when using the MaxL approach. For instance, when item A and item B both produce nonsignificant LR statistics, item A could be chosen as the RI due to its factor loading being the largest, even though item B functions the same across groups. In this case, MaxL would make a mistake in choosing a correct RI.

The second method, labeled as Minχ² in this study, selects an RI as the item that produces the smallest LR statistic among all items (Woods, 2009). The idea behind this approach is that the magnitude of the LR statistic reflects the degree of difference in item functioning. So the smaller the LR statistic is, the smaller the item difference is. This approach distinguishes itself from MaxL in that it does not require the smallest LR statistic to be nonsignificant. Woods (2009) showed that Minχ² performed well under a variety of data conditions in identifying truly invariant items with power rates of 90% and above.

The Bayesian SEM approach is a newer application of Bayesian methods in testing for factorial invariance (Shi, Song, Liao, et al., 2017; Shi, Song, et al., 2018). It introduces a new parameter $D_{ij}$ to represent a parameter difference across groups, which can index factor loading differences ( $D_{loading}$ ) and intercept differences ( $D_{intercept}$ ). A selection index for the jth item $Δ_{j}$ can then be defined as a sum of standardized difference measures of $D_{loading}$ and $D_{intercept}$ for this item:

Δ_{j} = \frac{| \hat{D_{loading}} |}{S D_{loading}} + \frac{| \hat{D_{intercept}} |}{S D_{intercept}}

(1)

where $\hat{D_{l o a d i n g}}$ and $\hat{D_{i n t e r c e p t}}$ are respective estimates of difference in factor loadings and intercepts, and $S D_{loading}$ and $S D_{intercept}$ represent standard deviations of those differences.

The BSEM approach imposes informative priors with zero-mean and small-variance for D_loading and D_intercept, which is referred to as “approximate identification constraints” (Muthén & Asparouhov, 2012). It ensures latent factors to be properly scaled and more importantly, makes D_loading and D_intercept estimable. Once D_loading and D_intercept are estimated for item j, one can compute the selection index Δ_jj and then evaluate its posterior distribution. The item that produces the smallest posterior mean on Δ_j is considered to have the largest likelihood of being invariant across groups. This method yielded high power when searching for the RI under various simulation conditions (Shi, et al., 2017). Power increased when there were fewer noninvariant items with large magnitude of differences and large sample sizes. Power can be much higher than 0.90 when only 20% of items function differently across groups. The research showed that the choice of small prior variances did not significantly impact the power rates of the RI selection.

Non-RI-Based Approach for MI Testing

We focused on the non-RI-based approach proposed by Raykov et al. (2013), partly following the reviewers’ suggestion. This approach first constrains all parameters to be equal in a baseline model, and then freely estimates the parameters of one item at a time in a relaxed model. Then a chi-square test was conducted to evaluate model fit differences between the baseline model and each relaxed model. The resulting p values for the chi-square tests were then ascendingly ranked. A value l is computed for each corresponding p value by using the Benjamini–Hochberg procedure (Benjamini & Hochberg, 1995; Wasserman, 2004):

l_{j} = \frac{j α}{k (1 + \frac{1}{2} + \frac{1}{3} + \frac{1}{4} + \dots + \frac{1}{k})}, (j = 1, 2, 3 \dots),

(2)

where j is the ordering number of each tested parameter, α is the prechosen significance level for chi-square tests, and k is the total number of tested parameters. Among the p values that are smaller than their corresponding l values, the largest p is chosen to be the threshold. Finally, the parameters associated with the p values that are smaller than this threshold will be concluded as noninvariant.

Direction Effect and RI Selection

In previous research on RI selection, a two-group CFA model was typically used as the population model in data simulation. One group served as a reference group where factor means and variances were set to be known, and the other group served as a focal group where factor means and variance were freely estimated. A uniform direction of parameter differences was often simulated for simplicity. While factor loadings were simulated to be the same for truly invariant items across groups, they were set to be smaller in focal groups than those in reference groups for items functioning differently (e.g., Meade & Wright, 2012; Shi, Song, Liao, et al., 2017; Stark et al., 2006; Woods, 2009). For instance, if the factor loadings were set to be .8, .8, .8, and .8 for all four items in the reference group, they were set to be .8, .6, .6, and .8 in the focal group. As a result, the truly invariant items (Items 1 and 4 in the example) happened to have larger factor loadings than the noninvariant items (Items 2 and 3 in the example). RI selection methods in favor of high loadings, such as MaxL would have high power of selecting truly invariant items. However, such high power could just be the artifacts of data simulation with a uniform direction.

What if the direction of parameter differences is reversed? For instance, if the factor loadings are set to be .6, .6, .6, and .6 for all four items in the reference group, and .6, .8, .8, and .6 in the focal group, the methods like MaxL are likely to choose either Item 2 or Item 3 as RI. In this case, the power of correctly selecting invariant items as RI would be low. Therefore, it is critical to consider the directions of parameter differences in generating data and evaluating power of the methods for RI selection.

Unlike previous studies, we differentiated three types of directions of parameter differences in our data simulation design. Positive direction refers to the case where parameter values are larger in the focal group than the reference group. Negative direction refers to the case where parameter values are smaller in the focal group than the reference group. The third direction is the mixed direction where certain parameters have in part larger and smaller values in one group than the other. If the power of RI selection is significantly influenced by the directions of parameter differences, the direction effect is said to occur. We anticipated this would happen in our simulation study, particularly, with the MaxL due to the aforementioned reasons.

As follows, we first presented Study 1 where the performances of MaxL, Minχ², and BSEM on RI selection were comprehensively compared using simulated data. We then presented Study 2 to evaluate the benefit of using RI-based approach for MI testing in comparison with non-RI-based approach. Then, a large set of real-world data was used to empirically demonstrate the uses of the three RI selection methods as well as the RI-based and non-RI-based approaches for MI testing. In the end, we offered a discussion on the advantages and disadvantages of all these methods, followed by suggestions and recommendations for applied researchers.

Simulation Study 1: RI Selection Using MaxL, Minχ², and BSEM

We used Mplus 7.0 for data generation and RI selection across all simulation conditions. The results of RI selection were summarized and evaluated using SAS 9.4. No cases of nonconvergence were observed for these analyses.

Data Conditions

The population model was a two-group CFA model with 10 items loaded on a single factor. One group served as a reference group and the other served as a focal group. The variables manipulated in the data simulation were listed as following.

Sample Size

Continuous data were generated with n = 100, 200, 500 per group, representing small, medium, and large samples in typical psychological research. Both groups were simulated to have equal sizes in all conditions (e.g., Shi, Song, & Lewis, 2017; Shi, Song, Liao, et al., 2017).

Location of Difference

Item differences were simulated to occur on either factor loadings or intercepts, never on both at the same time (e.g., Shi, Song, Liao, et al., 2017)

Percentage of Noninvariant Items

Consistent with previous simulation research (e.g., French & Finch, 2008; Meade & Wright, 2012), we simulated data with either 20% or 40% of noninvariant items in this investigation. This corresponded to the cases where either two or four items (out of 10 items) function differently across the two groups.

Magnitude of Difference

The magnitude of cross-group differences was set to 0.2 and 0.4 for factor loadings, and 0.3 and 0.6 for intercepts. The former values for the parameter differences were considered to be small, and the latter values were considered to be relatively large (e.g., Kim et al., 2012; Kim & Yoon, 2011; Meade & Lautenschlager, 2004; Shi, Song, & Lewis, 2017).

Direction of Cross-Group Difference

Three directions were manipulated for factor loadings and intercepts, including positive, negative, and mixed directions.

In total, 72 data conditions were generated by fully crossing three sample sizes, two locations of difference, two percentages of noninvariant items, two magnitudes of difference in parameters, and three directions of differences. Each condition had 500 replications.

Data Simulation

The factor mean and variance were set, respectively, to 0 and 1 in the reference group. The raw factor loadings, intercepts, and unique variance were set to .8, 0, and .36, respectively, for all items. In focal groups, factor mean and variance were set to .5 and 1.2, respectively, and uniqueness was set to .36 for all items. All factor loadings and intercepts in focal groups were generated to be equal to those in reference groups, except for the items that were manipulated to be different under certain conditions.

Data Analysis

Three methods were used to analyze the simulated data, including MaxL, Minχ², and BSEM. In all analyses, the factor mean and variance were fixed to be 0 and 1, respectively, in the reference groups. All the other parameters were freely estimated except for those required to be constrained by the procedures.

In using the MaxL method, the baseline model constrained all items to be equal across the focal and reference groups. Then, the equality constraints were relaxed one item at time, yielding the reduced model. The between-group differences in the target item were then examined using the likelihood ratio test. This procedure was repeated for all items in the model. Eventually, a reference indicator was chosen as the item that produced a nonsignificant LR statistic and had the largest factor loading as well. When using the Minχ² approach, the significance of the LR statistic was not a concern; instead, the values of the LR statistics were rank ordered for all items. A reference indicator was chosen as the item yielding the smallest LR.¹

When BSEM was used, the parameter $D_{ij}$ was computed for each factor loading ( $D_{loading}$ ) and each intercept ( $D_{intercept}$ ) across groups. After imposing the normal priors of zero-mean and small-variance of 0.001 on the parameter $D_{ij}$ , Markov Chain Monte Carlo (MCMC) simulations were run for a minimum of 50,000 and maximum of 100,000 iterations. The estimates at every 10th iteration were retained to form posterior distributions for factor loadings and intercepts. The means and standard deviations of these posterior distributions were then computed. Consequently, each item had a selection index $Δ_{j}$ computed, indicating the summary of standardized difference in both factor loading and intercept. The item with the smallest value of $Δ_{j}$ was selected as the reference indicator.

Results of Study 1

We used power rates to evaluate the performance of each method. The power rate was calculated as the percentage of replications that correctly selected a truly invariant item as RI under each condition. In addition, ANOVAs were performed on the power rates to test the effects of all six variables.

The power rates under all conditions are summarized in Table 1. An ANOVA was performed on these power rates to test the main effects of each of the six variables. The effect of method was significant (Table 2; F_{(2, 206)} = 25.507, p < .001, $η_{p}^{2}$ = .199), with Minχ² and BSEM performing better than MaxL (ps < .001). Figures 1 and 2 also display that under multiple conditions, MaxL produced low power rates, and some of those were even lower than the power rates of selecting a random item as RI. This occurred in 50% of the conditions (12 of 24 in Table 1) when the direction of parameter differences was positive. However, this was not the case for Minχ² and BSEM. Neither of these two methods were associated with lower-than-random power rates.

Table 1.

Power Rates of Selecting a Correct Reference Indicator in Study 1.

					Positive			Negative			Mixed
LO	PE	MA	SS	AR	MaxL	Minχ ²	BSEM	MaxL	Minχ ²	BSEM	MaxL	Minχ ²	BSEM
Factor loading	20%	.2	100	.80	.19	.95	.95	1.00	.96	.95	.65	.99	.98
			200	.80	.44	.99	.98	1.00	.99	1.00	.88	1.00	1.00
			500	.80	.95	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
		.4	100	.80	.85	1.00	1.00	1.00	1.00	1.00	.99	1.00	1.00
			200	.80	.99	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
			500	.80	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
	40%	.2	100	.60	.01	.70	.73	1.00	.79	.76	.36	.95	.96
			200	.60	.01	.78	.79	1.00	.90	.90	.71	1.00	.99
			500	.60	.38	.89	.84	1.00	.98	.99	1.00	1.00	1.00
		.4	100	.60	.06	.77	.78	1.00	.99	.99	.96	1.00	1.00
			200	.60	.45	.83	.79	.98	.99	1.00	1.00	1.00	1.00
			500	.60	.07	.90	.80	.25	.99	1.00	1.00	1.00	1.00
Intercept	20%	.3	100	.80	.82	1.00	.99	.97	.99	.99	.98	1.00	1.00
			200	.80	.97	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
			500	.80	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
		.6	100	.80	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
			200	.80	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
			500	.80	.90	1.00	1.00	.99	1.00	1.00	1.00	1.00	1.00
	40%	.3	100	.60	.33	.84	.84	.88	.87	.86	.93	1.00	.99
			200	.60	.49	.92	.92	.95	.94	.92	1.00	1.00	1.00
			500	.60	.62	.96	.96	.77	.98	.99	1.00	1.00	1.00
		.6	100	.60	.72	.95	.94	.93	.98	.98	1.00	1.00	1.00
			200	.60	.15	.97	.97	.27	.99	.99	1.00	1.00	1.00
			500	.60	.00	.94	.99	.00	.95	.99	1.00	1.00	1.00

Open in a new tab

Note. LO = Location of noninvariance; PE = percentage of noninvariance; MA = magnitude of noninvariance; SS = sample size; AR = power rates at random; BSEM = Bayesian structural equation model. These abbreviations represent the same definitions for all other tables.

Table 2.

Effects of the Studied Variables on Power Rates in Study 1.

	ANOVA 1				ANOVA 2
	df	F	p	$η_{p}^{2}$	df	F	p	$η_{p}^{2}$
Location (LO)	1	3.297	.071	.016	1	11.736	.001	.096
Percentage (PE)	1	33.608	<.001	.140	1	119.617	<.001	.521
Magnitude (MA)	1	0.690	.407	.003	1	2.455	.120	.022
Direction (DI)	2	19.623	<.001	.160	2	69.842	<.001	.559
SampleSize (SS)	2	0.583	.559	.006	2	2.074	.131	.036
Method (ME)	2	25.507	<.001	.199	2	90.782	<.001	.623
ME × MA					2	0.232	.794	.004
ME × LO					2	1.198	.306	.021
ME × PE					2	37.235	<.001	.404
ME × DI					4	28.154	<.001	.506
ME × SS					4	0.215	.930	.008
PE × MA					1	2.794	.097	.025
PE × LO					1	0.299	.585	.003
PE × DI					2	36.894	<.001	.402
PE × SS					2	0.722	.488	.013
LO × MA					1	10.055	.002	.084
LO × DI					2	12.984	<.001	.191
LO × SS					2	5.464	.005	.090
DI × MA					2	3.946	.022	.067
DI × SS					4	2.825	.028	.093
MA × SS					2	36.894	<.001	.232
ME × MA × PE					2	9.400	<.001	.146
ME × MA × LO					2	7.056	.001	.114
ME × MA × DI					4	7.964	<.001	.225
ME × MA × SS					4	7.642	<.001	.218
ME × DI × PE					4	9.840	<.001	.264
ME × DI × LO					4	5.529	<.001	.167
ME × DI × SS					8	3.779	.001	.216
ME × SS × PE					4	4.060	.004	.098
ME × SS × LO					4	3.000	.022	.129
ME × LO × PE					2	1.638	.199	.029
LO × PE × DI					2	3.506	.033	.060
LO × PE × MA					1	0.223	.638	.002
LO × PE × SS					2	0.721	.489	.013
LO × MA × DI					2	0.291	.748	.005
LO × MA × SS					2	1.604	.206	.028
LO × DI × SS					4	0.640	.635	.023
PE × MA × DI					2	2.151	.121	.038
PE × MA × SS					2	4.322	.016	.073
PE × DI × SS					4	0.973	.426	.034
MA × DI × SS					4	1.062	.379	.037
Residuals	206				110

Open in a new tab

Figure 1. — Power rates for Bayesian structural equation model (BSEM), *MaxL*, and *Minχ*² when the percentage of noninvariant factor loadings = 20% versus 40% and the magnitude of differences = 0.2 versus 0.4.Note. The reference line in each individual graph indicates the power rate of randomly selecting an item as the reference indicator (RI).

Figure 2. — Power rates for Bayesian structural equation model (BSEM), *MaxL*, and *Minχ*² when the percentage of noninvariant intercepts = 20% versus 40% and the magnitude of differences = 0.3 versus 0.6.Note. The reference line in each individual graph indicates the power rate of randomly selecting an item as the reference indicator (RI).

The effect of direction was significant (F_{(2, 206)} = 19.623, p < .001, $η_{p}^{2}$ = .160), and the average power rate in positive condition was lower than that in negative and mixed conditions (ps < .001). The direction effect was evident. More specifically, Figures 1 and 2 indicate that (a) the direction effect was greater for MaxL than for Minχ² and BSEM and (b) factor loadings were more subjective to such direction effects than intercepts, suggesting the possibility of interaction among these data variables.

The effect of percentage was significant (F_{(1, 206)} = 33.608, p < .001, $η_{p}^{2}$ = .140). Table 1 showed that 40% of items being different produced lower power rates than 20% of being different (p < .001). This occurred on the factor loadings (see Figure 1) as well as on the intercepts (see Figure 2).

Having examined the main effects, next we ran a full ANOVA model including up to all three-way interactions among the six variables. Our focus here was the significance of the interaction effects. In this model, four-way interactions cannot be examined due to the limitation of the data—very few scores in each cell without enough variation. So this ANOVA included six main effects, 15 two-way interactions, and 20 three-way interactions. The results are presented under ANOVA 2 in Table 2. However, only certain effects that bear direct importance are further reported.

We first looked at the three-way interactions containing two-way interactions of method×direction. For a significant three-way interaction like this, we examined the interaction of method×direction at each level of the third variable. If this interaction was significant at a certain level of the third variable, we then tested for simple effects. Pairwise comparisons were made thereafter by using Bonferroni correction to adjust for the level of significance.

Significant three-way interactions included (see Table 2): method×direction×percentage (F_{(4, 110)} = 9.84, p < .001, $η_{p}^{2}$ = .264), method×direction×sample size (F_{(8, 110)} = 3.779, p < 0.001, $η_{p}^{2}$ = .216), method×direction×magnitude (F_{(4, 110)} = 7.964, p < .001, $η_{p}^{2}$ = .225), and method×direction×location (F_{(4, 110)} = 5.529, p < .001, $η_{p}^{2}$ = .167). Then the two-way interaction of method×direction (Table 3) was significant at each level of percentage (20% and 40%), sample size (n = 100, 200, 500), magnitude (small and large), and location (loadings and intercepts). The interaction effects are displayed in Figure 3. As reported in Table 4, the subsequent pairwise comparisons showed that (a) under positive condition, Minχ² and BSEM consistently outperformed MaxL at all levels of percentage, sample size, magnitude, and location; (b) however, this was true only for percentage = 40% and magnitude = large under negative condition; and (c) under mixed condition the three methods performed similarly.

Table 3.

The Interaction Between Methods and Directions on Power Rates at Each Level of Other Studied Variables.

		Positive				Negative				Mixed
		df	F	P	$η_{p}^{2}$	df	F	p	$η_{p}^{2}$	df	F	p	$η_{p}^{2}$
PE	20%	2	4.900	.008	.047	2	<.001	.999	<.001	2	0.350	.707	0.004
	40%	2	74.79	<.001	.430	2	8.030	<.001	.075	2	1.440	.241	0.014
SS	N = 100	2	14.090	<.001	.130	2	0.070	.932	.001	2	1.520	.221	0.016
	N = 200	2	11.840	<.001	.111	2	0.500	.608	.005	2	0.220	.803	0.002
	N = 500	2	9.940	<.001	.095	2	4.980	.008	.050	2	0.000	1.000	<.001
MA	Small	2	21.530	<.001	.179	2	0.030	.966	<.001	2	1.880	.155	0.019
	Large	2	15.870	<.001	.138	2	5.830	.004	.056	2	0.000	1.000	<.001
LO	Loadings	2	27.300	<.001	.216	2	0.120	.883	.001	2	1.840	.162	0.018
	Intercepts	2	12.390	<.001	.111	2	3.650	.028	.036	2	0.010	.993	<.001

Open in a new tab

Figure 3. — The interaction effect of *methods* and *directions* at each level of other studied variables.

Table 4.

Simple Effect of Direction and Methods on Power Rates at Each Level of Other Studied Variables.

		Methods			Positive				Negative				Mixed
		Comparison			Diff	t	p	Adj p	Diff	t	p	Adj p	Diff	t	p	Adj p
PE	20%	MaxL	—	Minχ²	−.153	−2.730	.007	.063	.002	0.030	.976	1.000	−.041	−0.730	.466	1.000
		MaxL	—	BSEM	−.151	−2.700	.008	.069	.002	0.030	.976	1.000	−.040	−0.720	.475	1.000
		Minχ²	—	BSEM	.002	0.030	.976	1.000	<.001	0.000	1.000	1.000	.001	0.010	.988	1.000
	40%	MaxL	—	Minχ²	−.597	−10.670	<.001	<.001	−.193	−3.460	.001	.006	−.083	−1.470	.142	1.000
		MaxL	—	BSEM	−.588	−10.520	<.001	<.001	−.195	−3.490	.001	.005	−.082	−1.460	.146	1.000
		Minχ²	—	BSEM	.008	0.150	.882	1.000	−.002	−0.030	.976	1.000	.001	0.010	.988	1.000
SS	N = 100	MaxL	—	Minχ²	−.404	−4.580	<.001	<.001	.025	0.280	.777	1.000	−.134	−1.520	.131	1.000
		MaxL	—	BSEM	−.406	−4.610	<.001	<.001	.031	0.350	.723	1.000	−.133	−1.500	.134	1.000
		Minχ²	—	BSEM	−.003	−0.030	.997	1.000	.006	0.070	.944	1.000	.001	0.010	.989	1.000
	N = 200	MaxL	—	Minχ²	−.374	−4.240	<.001	<.001	−.076	−0.870	.388	1.000	−.051	−0.580	.561	1.000
		MaxL	—	BSEM	−.369	−4.190	<.001	<.001	−.076	−0.870	.388	1.000	−.050	−0.570	.571	1.000
		Minχ²	—	BSEM	.005	0.060	.955	1.000	.000	0.000	1.000	1.000	.001	0.010	.989	1.000
	N = 500	MaxL	—	Minχ²	−.346	−3.930	<.001	.001	−.236	−2.680	.008	.072	<.001	0.000	1.000	1.000
		MaxL	—	BSEM	−.334	−3.790	<.001	.002	−.245	−2.780	.006	.054	<−.001	0.000	1.000	1.000
		Minχ²	—	BSEM	.013	0.140	.887	1.000	−.009	−0.100	.921	1.000	<−.001	0.000	1.000	1.000
MA	Small	MaxL	—	Minχ²	−.402	−5.700	<.001	<.001	.014	0.200	.841	1.000	−.119	−1.690	.092	.831
		MaxL	—	BSEM	−.400	−5.670	<.001	<.001	.018	0.250	.804	1.000	−.118	−1.670	.097	.873
		Minχ²	—	BSEM	.003	0.040	.972	1.000	.003	0.050	.962	1.000	.002	0.020	.981	1.000
	Large	MaxL	—	Minχ²	−.348	−4.930	<.001	<.001	−.206	−2.920	.004	.035	−.004	−0.060	.953	1.000
		MaxL	—	BSEM	−.340	−4.830	<.001	<.001	−.211	−2.990	.003	.028	−.004	−0.060	.953	1.000
		Minχ²	—	BSEM	.008	0.110	.915	1.000	−.005	−0.070	.944	1.000	<−.001	0.000	1.000	1.000
LO	Loadings	MaxL	—	Minχ²	−.451	−6.490	<.001	<.001	−.030	−0.430	.666	1.000	−.116	−1.670	.097	.874
		MaxL	—	BSEM	−.438	−6.310	<.001	<.001	−.030	−0.430	.666	1.000	−.115	−1.650	.100	.896
		Minχ²	—	BSEM	.013	0.180	.857	1.000	<.001	0.000	1.000	1.000	.001	0.010	.990	1.000
	Intercepts	MaxL	—	Minχ²	−.298	−4.290	<.001	<.001	−.162	−2.330	.021	.189	−.008	−0.110	.914	1.000
		MaxL	—	BSEM	−.301	−4.330	<.001	<.001	−.163	−2.350	.020	.178	−.007	−0.100	.924	1.000
		Minχ²	—	BSEM	−.003	−0.040	.971	1.000	−.002	−0.020	.981	1.000	.001	0.010	.990	1.000

Open in a new tab

Note. Adj p = Bonferroni correction for familywise error rate.

We then examined the three-way interactions containing two-way interactions of method×magnitude. All three-way interactions were significant (see Table 2): method×magnitude×percentage ( $F_{(2, 110)}$ = 9.400, p < .001, $η_{p}^{2}$ = .146), method×magnitude×sample size ( $F_{(4, 110)}$ = 7.642, p < .001, $η_{p}^{2}$ = .218), method×magnitude×direction ( $F_{(4, 110)}$ = 7.964, p < .001, $η_{p}^{2}$ = .225), and method×magnitude×location ( $F_{(2, 110)}$ = 7.056, p = 0.001, $η_{p}^{2}$ = .114). Figure 4 (and Table 5) display the two-way interactions of method×magnitude at each level of percentage, sample size, direction, and location. Table 6 shows the results from pairwise comparisons after Bonferroni correction. When the between-group differences in parameters were small, Minχ² and BSEM outperformed MaxL at percentage = 40%, sample size = 100, direction = positive, and location = loadings, and did not perform differently under other conditions. When the parameter differences were large, Minχ² and BSEM outperformed MaxL at percentage = 40%, sample size = 500, direction = positive & negative, and location = intercepts, and they did not perform differently under other conditions.

Figure 4. — The interaction effect of *methods* and *magnitudes* at each level of other studied variables.

Table 5.

The Interaction Between Methods and Magnitudes on Power Rates at Each Level of Other Studied Variables.

		Small				Large
		df	F	p	$η_{p}^{2}$	df	F	p	$η_{p}^{2}$
PE	20%	2	2.330	.100	.022	2	0.050	.956	<.001
	40%	2	9.400	<.001	.084	2	23.67	<.001	.188
SS	N = 100	2	5.980	.003	.057	2	0.990	.374	.010
	N = 200	2	3.020	.051	.030	2	2.630	.074	.026
	N = 500	2	0.820	.441	.008	2	9.060	<.001	.084
DR	Positive	2	21.530	<.001	.179	2	15.870	<.001	.138
	Negative	2	0.030	.966	<.001	2	5.830	.004	.056
	Mix	2	1.880	.155	.019	2	0.000	.998	<.001
LO	Loadings	2	8.600	<.001	.078	2	3.750	.025	.036
	Intercepts	2	1.480	.230	.014	2	7.050	.001	.065

Open in a new tab

Table 6.

Simple Effect of Magnitudes and Methods on Power Rates at Each Level of Other Studied Variables.

		Methods			Simple effect
					Small magnitude				Large magnitude
		Comparison			Diff	t	p	Adj p	Diff	t	p	Adj p
PE	20%	MaxL	—	Minχ²	−.112	−1.880	.061	.367	−.016	−0.260	.794	1.000
		MaxL	—	BSEM	−.111	−1.850	.065	.391	−.016	−0.260	.794	1.000
		Minχ²	—	BSEM	.002	0.030	.978	1.000	<.001	0.000	1.000	1.000
	40%	MaxL	—	Minχ²	−.223	−3.780	<.001	.001	−.356	−5.970	<.001	<.001
		MaxL	—	BSEM	−.222	−3.730	<.001	.002	−.354	−5.940	<.001	<.001
		Minχ²	—	BSEM	.003	0.060	.956	1.000	.002	0.030	.978	1.000
SS	N = 100	MaxL	—	Minχ²	−.243	−3.020	.003	.017	−.098	−1.220	.225	1.000
		MaxL	—	BSEM	−.240	−2.970	.003	.020	−.098	−1.220	.225	1.000
		Minχ²	—	BSEM	.003	0.040	.967	1.000	<.001	0.000	1.000	1.000
	N = 200	MaxL	—	Minχ²	−.173	−2.140	.034	.203	−.162	−2.000	.047	.279
		MaxL	—	BSEM	−.171	−2.120	.036	.213	−.159	−1.970	.050	.300
		Minχ²	—	BSEM	.002	0.020	.984	1.000	.003	0.030	.975	1.000
	N = 500	MaxL	—	Minχ²	−.091	−1.130	.262	1.000	−.298	−3.690	<.001	.002
		MaxL	—	BSEM	−.088	−1.090	.275	1.000	−.298	−3.690	<.001	.002
		Minχ²	—	BSEM	.003	0.030	.975	1.000	<.001	0.000	1.000	1.000
DI	Positive	MaxL	—	Minχ²	−.402	−5.700	<.001	<.001	−.348	−4.930	<.001	<.001
		MaxL	—	BSEM	−.399	−5.670	<.001	<.001	−.340	−4.830	<.001	<.001
		Minχ²	—	BSEM	.003	0.040	.972	1.000	.008	0.110	.915	1.000
	Negative	MaxL	—	Minχ²	.014	0.200	.841	1.000	−.206	−2.920	.004	.023
		MaxL	—	BSEM	.018	0.250	.804	1.000	−.211	−2.990	.003	.019
		Minχ²	—	BSEM	.003	0.050	.962	1.000	−.005	−0.070	.944	1.000
	Mix	MaxL	—	Minχ²	−.119	−1.690	.092	.554	−.004	−0.060	.953	1.000
		MaxL	—	BSEM	−.118	−1.670	.100	.582	−.004	−0.060	.953	1.000
		Minχ²	—	BSEM	.002	0.020	.981	1.000	<.001	0.000	1.000	1.000
LO	Loadings	MaxL	—	Minχ²	−.238	−3.610	<.001	.002	−.159	−2.420	.017	.099
		MaxL	—	BSEM	−.236	−3.570	<.001	.003	−.153	−2.320	.021	.127
		Minχ²	—	BSEM	.003	0.040	.967	1.000	.006	0.090	.926	1.000
	Intercepts	MaxL	—	Minχ²	−.099	−1.510	.133	.799	−.212	−3.220	.002	.009
		MaxL	—	BSEM	−.097	−1.470	.142	.852	−.217	−3.280	.001	.007
		Minχ²	—	BSEM	.002	0.030	.973	1.000	−.004	−0.070	.946	1.000

Open in a new tab

Note. Adj p = Bonferroni correction for familywise error rate.

Simulation Study 2: MI Testing With RI-Based and Non-RI–Based Approaches

The aim of Study 2 was to compare the RI-based approach with the non-RI-based approach for testing measurement invariance. We used the data generated from Study 1 for this purpose. In Study 2, we chose Minχ² as the representative technique to select an RI, because Study 1 showed that in general, Minχ² behaved well in identifying an appropriate RI. We anticipated that once the best possible invariant RI is selected, the RI-based approach would lead to satisfactory MI outcomes.

In testing for MI using the RI-based approach, an RI was first chosen by the Minχ² method, and then the baseline model was fitted by setting the RI to be equal across groups while allowing all the other parameters to be freely estimated. Then each of the other parameters was constrained to be equal to one at a time, leading to a reduced model. The fit difference was evaluated between the baseline model and each reduced model using a LR test. If an LR test turned out to be nonsignificant, the parameter that was constrained to be equal in the reduced model was concluded to be invariant across groups. In Study 2, the non-RI-based approach for MI testing utilized the procedure proposed by Raykov et al. (2013). The l values were computed using Equation (2) based on a significance level α = .05. These methods were elucidated in the Introduction section.

Three criteria were used to evaluate the performance of the RI-based and non-RI-based approaches (e.g., Jung & Yoon, 2016). The first was item power rate, computed as the ratio of total number of detected noninvariant parameters to total number of generated noninvariant parameters in each condition across all 500 replications. The second criterion was item Type I error rate, computed as the ratio of total number of invariant parameters that were falsely detected as noninvariant to total number of generated invariant parameters. The third criterion was item Type II error rate, computed as the ratio of total number of noninvariant parameters that were mistakenly detected as invariant to total number of generated noninvariant parameters.

Results of Study 2

We limited our presentation of the results on the mixed condition, in which some model parameters were generated to be greater but others were set to be smaller in one group than the other.² This condition is likely to be more realistic in empirical research than the other two uniformed conditions (i.e., positive and negative). The power rates, Type I error rates, and Type II error rates are summarized in Table 7. ANOVAs were performed on these criterion values to test the effects of method (ME; i.e., the RI-based and non-RI-based methods), sample size (SS), location of difference (LO), percentage of noninvariant parameters (PE), and magnitude of difference (MA). To serve the goal of this study, we only reported differences between the two MI methods in the three criteria.

Table 7.

Item Power Rates, Type I Error Rates, and Type II Error Rates for MI Testing in Study 2.

				Power rate		Type I error		Type II error
LO	PE	MA	SS	RI-based	Non-RI-based	RI-based	Non-RI-based	RI-based	Non-RI-based
Factor loading	20%	.2	100	.935	.144	.000	.000	.058	.856
			200	.999	.511	.000	.002	.000	.489
			500	1.000	.973	.000	.002	.000	.027
		.4	100	1.000	.890	.000	.004	.000	.110
			200	1.000	1.000	.000	.005	.000	.000
			500	1.000	1.000	.000	.006	.000	.000
	40%	.2	100	.248	.162	.000	.002	.741	.839
			200	.749	.534	.000	.004	.250	.466
			500	1.000	.981	.000	.003	.000	.019
		.4	100	.828	.893	.000	.011	.172	.106
			200	1.000	.998	.000	.016	.000	.002
			500	1.000	1.000	.000	.053	.000	.000
Intercept	20%	.2	100	.500	.620	.000	.002	.500	.380
			200	1.000	.968	.104	.002	.000	.032
			500	1.000	1.000	.000	.002	.000	.000
		.4	100	1.000	1.000	.000	.005	.000	.000
			200	1.000	1.000	.104	.005	.000	.000
			500	1.000	1.000	.000	.005	.000	.000
	40%	.2	100	.999	.698	.000	.004	.000	.302
			200	1.000	.981	.162	.003	.000	.019
			500	1.000	1.000	.000	.005	.000	.000
		.4	100	1.000	1.000	.000	.008	.000	.000
			200	1.000	1.000	.164	.009	.000	.000
			500	1.000	1.000	.000	.022	.000	.000

Open in a new tab

The main effect of method was not significant on all three criteria, for power rate (F_{(1, 46)} = 1.492, p = .228), Type I error (F_{(1, 46)} = 2.199, p = .145), and Type II error (F_{(1, 46)} = 1.516, p = .224), when only the main effects were included. (The results from this ANOVA were not tabled to save space.) However, computations based on the results in Table 7 suggested that in detecting factor loading differences, the average power rate was .879 for the RI-based approach, whereas it was .756 (less than a desirable level of .80) for the non-RI-based approach. This difference became even more evident when the factor loading differences were generated to be small (i.e., .20). Under this condition, the RI-based approach was associated with a higher mean power rate (M = .821) than the non-RI-based approach (M = .550). It seemed that the former was more sensitive in detecting small differences in factor loadings than the latter.

Method had no significant interaction with percentage and magnitude on any of the three criteria (ps > .05; see Table 8), but it did have a significant interaction with location on Type I error (F_{(1, 27)} = 12.412, p < .01), and sample size on Type I error (F_{(1, 27)} = 11.901, p < .001). Table 7 revealed that the RI-based method was subject to (nonsignificantly) less Type I error (M = .000) on the factor loadings than the non-RI-based method (M = .008), but significantly greater Type I error (M = .043; still below a standard level of .05) on intercepts than the non-RI-based method (M = .004, F_{(1, 44)} = 7.85, p < .01). In addition, the RI-based method produced (nonsigificantly) less Type I error under sample sizes of 100 and 500, but greater Type I error (M = .0.065) under sample size of 200 than the non-RI-based method (M = .004, F_{(1, 42)} = 15.75, p < .01).

Table 8.

Effects of the Studied Variables on Item Power Rates, Type I Error Rates, and Type II Error Rates in Study 2.

	Power rate			Type I error rate			Type II error rate
	df	F	p	df	F	p	df	F	p
LO	1	9.974	<.01	1	9.119	<.01	1	9.990	<.01
PE	1	0.257	.617	1	1.916	.178	1	0.248	.623
MA	1	24.691	<.001	1	0.776	.386	1	24.792	<.001
SS	2	14.904	<.001	2	9.843	<.001	2	14.950	<.001
ME	1	4.328	.047	1	5.715	.024	1	4.410	.045
LO × PE	1	3.163	.087	1	0.253	.619	1	3.151	.087
LO × MA	1	5.370	.028	1	0.063	.803	1	5.366	.028
LO × SS	2	2.518	.100	2	10.935	<.001	2	2.513	.100
LO × ME	1	2.511	.125	1	12.412	<.01	1	2.566	.121
PE × MA	1	0.020	.890	1	0.396	.535	1	0.017	.898
PE × SS	2	0.074	.929	2	0.550	.583	2	0.071	.932
PE × ME	1	0.654	.426	1	0.016	.901	1	0.642	.430
MA × SS	2	9.404	<.001	2	0.170	.844	2	9.422	<.01
MA × ME	1	3.891	.059	1	0.776	.386	1	3.967	.057
SS × ME	2	1.033	.370	2	11.901	<.001	2	1.059	.361

Open in a new tab

Note. ME = Method; that is, the RI-based approach and non-RI-based methods.

A Pedagogical Example

We first applied MaxL, Minχ², and BSEM for RI selection to the data collected from a large-scale project (n = 12,811)—Psychological Wellbeing of Children of Rural-to-Urban Migrant Workers in China. The measurement chosen for this demonstration was from the Revised Child Anxiety and Depression Scale (RCADS, Chorpita et al., 2000). This self-report scale contains 47 items in total. However, only the items (18 items) for generalized anxiety were used here for demonstration. Responses were scored on a Likert-type scale of 1 to 4, corresponding to “Never,”“Sometimes,”“Quite Often,” and “Always.” Cronbach’s α was .897 and ω was .910 in this sample.

There were 7,356 male (57.4%) and 5,455 female (42.6%) child respondents in this sample. A two-gender-group CFA was fitted to these data, and MaxL, Minχ², and BSEM were used to find RIs. Eventually MaxL and Minχ² each produced 18 different values of LR statistics when comparing the baseline model and each reduced model. Then all 18 values were rank ordered from the smallest to largest. As shown in Table 9, Item 7 in this scale was associated with the smallest LR statistic so that Minχ² chose this item as RI. For those items that yielded with nonsignificant LR statistic, Item 7 was the one that had the largest factor loading in the baseline model. Thus MaxL chose Item 7 as the RI.

Table 9.

Results of Using MaxL and Min $χ^{2}$ to Identify an RI With the Empirical Data.

	Factor loadings	$G_{1}$ loadings	$G_{2}$ loadings	$χ^{2}$	df	$Δ χ^{2}$	p
Baseline				6590.815	304
Item 1	.300	.306	.291	6576.866	302	13.949	<.001
Item 2	.480	.471	.492	6588.843	302	1.972	.373
Item 3	.571	.556	.591	6582.942	302	7.873	.020
Item 4	.620	.611	.633	6587.198	302	3.617	.164
Item 5	.643	.630	.661	6586.149	302	4.666	.097
Item 6	.538	.557	.510	6578.007	302	12.808	.002
Item 7	.700	.703	.694	6590.235	302	0.580	.748
Item 8	.656	.651	.663	6586.430	302	4.385	.112
Item 9	.665	.653	.682	6586.734	302	4.081	.130
Item 10	.690	.682	.703	6577.774	302	13.041	.002
Item 11	.540	.555	.518	6568.084	302	22.731	<.001
Item 12	.491	.499	.478	6575.632	302	15.183	<.001
Item 13	.536	.543	.528	6508.893	302	81.922	<.001
Item 14	.425	.435	.411	6582.484	302	8.331	.016
Item 15	.625	.630	.618	6589.869	302	0.946	.623
Item 16	.608	.600	.621	6584.566	302	6.249	.044
Item 17	.598	.605	.586	6589.292	302	1.523	.467
Item 18	.481	.484	.476	6590.227	302	0.588	.745

Open in a new tab

Note.“Factor loadings” = the loading estimates from a baseline model in which all loadings were constrained to be equal across groups; “ $G_{1}$ loadings” and “ $G_{2}$ loadings” = the loading estimates for the reference ( $G_{1}$ ) and focal ( $G_{2}$ ) groups, respectively.

Then BSEM was used to select an RI by specifying a two-group CFA model with the commands knownclass = c (g = 1 2) under Variable, and type = mixture; estimator = bayes; under Analysis (Muthén & Asparouhov, 2012). The parameter $D_{ij}$ , representing a summarized difference of each item across groups, was set under model constraint. We imposed the normal prior of zero-mean and small variance (N(0, 0.001)) on each $D_{ij}$ through the DIFF option under Model Priors. MCMC simulations were run for a minimum of 50,000 and a maximum of 100,000 iterations with thin = 10. The Mplus output contained the necessary information for the posterior distribution of $D_{ij}$ (including $D_{factor_loading}$ and $D_{intercept}$ ). Table 10 shows the estimates for $D_{factor_loading}$ , $D_{intercept}$ , and their standard deviations. The selection index $Δ_{j}$ was then calculated using Equation (1) for each item. Eventually Item 7 was chosen to be the RI because it produced the smallest $Δ_{j}$ (=0.646) out of 18 items.

Table 10.

Results of Using BSEM to Identify an RI With the Empirical Data.

	$\hat{D_{f a c t o r_l o a d i n g}}$ (SD)	$\hat{D_{i n t e r c e p t}}$ (SD)	$Δ_{j}$
Item 1	0.011 (0.014)	0.044 (0.014)	3.929
Item 2	0.017 (0.015)	0.007 (0.015)	1.600
Item 3	0.027 (0.016)	0.023 (0.015)	3.221
Item 4	0.019 (0.016)	0.016 (0.015)	2.254
Item 5	0.024 (0.015)	0.01 (0.015)	2.267
Item 6	0.036 (0.016)	0.027 (0.015)	4.050
Item 7	0.005 (0.016)	0.005 (0.015)	0.646
Item 8	0.01 (0.016)	0.024 (0.015)	2.225
Item 9	0.023 (0.016)	0.012 (0.015)	2.238
Item 10	0.017 (0.016)	0.037 (0.015)	3.529
Item 11	0.03 (0.013)	0.039 (0.013)	5.308
Item 12	0.018 (0.015)	0.041 (0.014)	4.129
Item 13	0.013 (0.015)	0.106 (0.015)	7.933
Item 14	0.019 (0.015)	0.03 (0.015)	3.267
Item 15	0.011 (0.014)	0.002 (0.014)	0.929
Item 16	0.016 (0.015)	0.022 (0.015)	2.533
Item 17	0.016 (0.015)	0.001 (0.015)	1.133
Item 18	0.007 (0.016)	0.007 (0.015)	0.904

Open in a new tab

Next we applied the two methods of MI testing to this sample of data. The RI-based method used Item 7 that was previously identified as the RI, and the non-RI-based method utilized the procedure proposed by Raykov et al. (2013). As shown in Tables 11 and 12, the non-RI-based method detected differences in intercepts for Items 1, 10, 11, 12, and 13, and the RI-based approach detected loading differences for Item 6, and intercept differences for Items 1, 6, 10, 11, 12, 13, and 14. So in this empirical application, the two methods for MI testing reached a 62.5% agreement in the detected noninvariant parameters.

Table 11.

Results From the Non-RI-Based Approach for MI Testing Using the Empirical Data.

	$G_{1}$ loadings	$G_{2}$ loadings	$G_{1}$ intercepts	$G_{2}$ intercepts	$χ^{2}$	df	$Δ χ^{2}$	p	l	New order
Baseline					6590.815	304
Item 1	.306	.292			6589.966	303	0.849	.357	.009	27
			2.689	2.636	6577.756	303	13.059	<.001	.001	4
Item 2	.471	.492			6589.206	303	1.609	.205	.007	22
			1.994	2.003	6590.462	303	0.353	.552	.010	31
Item 3	.556	.591			6586.412	303	4.403	.036	.003	10
			2.389	2.359	6587.295	303	3.52	.061	.005	14
Item 4	.611	.634			6588.794	303	2.021	.155	.006	19
			2.224	2.205	6589.195	303	1.62	.203	.007	21
Item 5	.630	.660			6587.087	303	3.728	.054	.004	13
			2.342	2.356	6589.902	303	0.913	.339	.008	25
Item 6	.558	.642			6582.583	303	8.232	.004	.002	7
			2.305	2.272	6586.316	303	4.499	.034	.003	9
Item 7	.703	.694			6590.493	303	0.322	.570	.011	32
			2.343	2.350	6590.553	303	0.262	.609	.011	33
Item 8	.651	.663			6590.295	303	0.52	.471	.010	29
			2.256	2.226	6586.933	303	3.882	.049	.004	12
Item 9	.653	.682			6587.579	303	3.236	.072	.005	15
			2.231	2.217	6589.949	303	0.866	.352	.009	26
Item 10	.682	.703			6588.744	303	2.071	.15	.006	18
			2.006	1.959	6579.787	303	11.028	.001	.002	5
Item 11	.555	.518			6582.538	303	8.277	.004	.002	6
			2.359	2.313	6576.481	303	14.334	<.001	.001	2
Item 12	.499	.477			6588.741	303	2.074	.15	.006	17
			2.454	2.506	6577.627	303	13.188	<.001	.001	3
Item 13	.542	.527			6589.826	303	0.989	.32	.008	24
			2.456	2.592	6509.733	303	81.082	<.001	.000	1
Item 14	.435	.411			6588.697	303	2.118	.146	.005	16
			2.717	2.756	6584.549	303	6.266	.012	.003	8
Item 15	.630	.618			6589.976	303	0.839	.360	.009	28
			2.077	2.082	6590.704	303	0.111	.739	.012	35
Item 16	.600	.620			6588.907	303	1.908	.167	.007	20
			2.185	2.214	6586.514	303	4.301	.038	.004	11
Item 17	.605	.586			6589.307	303	1.508	.219	.008	23
			2.362	2.364	6590.797	303	0.018	.893	.012	36
Item 18	.484	.476			6590.617	303	0.198	.656	.011	34
			2.561	2.571	6590.421	303	0.394	.530	.010	30

Open in a new tab

Note. $G_{1}$ = reference group, $G_{2}$ = focal group.

Table 12.

Results From the RI-Based Approach for MI Testing Using the Empirical Data.

	Factor loadings	Intercepts	$χ^{2}$	df	$Δ χ^{2}$	p
Baseline			6394.829	270
Item 1	.301		6395.332	271	0.503	.478
		2.668	6407.199	271	12.37	<.001
Item 2	.478		6396.599	271	1.77	.183
		1.995	6394.881	271	0.052	.820
Item 3	.567		6398.532	271	3.703	.054
		2.379	6397.845	271	3.016	.082
Item 4	.617		6396.684	271	1.855	.173
		2.218	6396.367	271	1.538	.215
Item 5	.638		6397.807	271	2.978	.084
		2.344	6394.951	271	0.122	.727
Item 6	.546		6399.201	271	4.372	.037
		2.294	6398.880	271	4.051	.044
Item 8	.655		6395.484	271	0.655	.418
		2.248	6397.815	271	2.986	.084
Item 9	.661		6397.345	271	2.516	.113
		2.227	6395.781	271	0.952	.329
Item 10	.687		6396.416	271	1.587	.208
		1.996	6401.274	271	6.445	.011
Item 11	.548		6398.250	271	3.421	.064
		2.347	6404.235	271	9.406	.002
Item 12	.495		6395.918	271	1.089	.297
		2.467	6401.883	271	7.054	.008
Item 13	.540		6395.217	271	0.388	.533
		2.490	6441.861	271	47.032	<.001
Item 14	.429		6396.194	271	1.365	.243
		2.728	6398.621	271	3.792	.051
Item 15	.629		6395.046	271	0.217	.641
		2.076	6394.844	271	0.015	.903
Item 16	.606		6396.237	271	1.408	.235
		2.190	6396.172	271	1.343	.247
Item 17	.602		6395.434	271	0.605	.437
		2.361	6394.875	271	0.046	.830
Item 18	.483		6394.916	271	0.087	.768
		2.563	6394.902	271	0.073	.787

Open in a new tab

Note. Item 7 was used as the reference indicator (RI).

Discussion

Conventional approach for RI selection could jeopardize the outcome of factorial invariance test using multiple-group CFA approach. More rigorous approaches are obviously needed in this research context. Regarding RI selection, three statistical procedures, MaxL, Minχ², and BSEM have been available. However, their performances on correctly detecting RI remain unknown. Thus, in this article, Study 1 the performances of MaxL, Minχ², and BSEM using simulated data. As a follow-up, Study 2 investigated the advantages/disadvantages of using RI-based approach for MI testing in comparison with non-RI-based approach. The two simulation studies altogether provided a complete, solid examination on how reference indicators matter in measurement invariance tests.

Study 1 revealed that Minχ² and BSEM performed better than MaxL in selecting the correct item as a reference indicator. This was particularly true under the positive condition where parameter values for functionally different items were higher in the focal group than the reference group, regardless of the levels of all other conditions under investigation. Under the negative condition, MaxL performed much better than itself in the positive condition, and showed equivalent power as the other two under certain circumstances, such as small percentage of functionally different items and small magnitude of cross-group difference in parameters. Under the mixed condition, no significance differences were detected for the three methods; however, MaxL appeared to be slightly inferior when the sample sizes and the loading differences were small.

The direction effect was evident when using the MaxL approach. This was consistent with the expectation stated earlier in this article; that is, methods in favor of high loadings such as MaxL tend to perform poorly under conditions where truly invariant items happened to be the items with low factor loadings (i.e., positive condition). However, they would perform decently in most cases when truly invariant items were also the items with high factor loadings (i.e., the negative condition). This may in part explain why MaxL showed high power of correctly selecting RI in previous research where only negative conditions were simulated (e.g., Meade & Wright, 2012). It appeared that nonuniformed direction of parameter differences (i.e., mixed condition) would remedy the drawback of favoring high loadings using the MaxL approach. In this case, the power rates of detecting truly invariant items were comparable among the three methods.

Another key feature of the MaxL approach lies in the utility of the LR statistic in testing for the significance of item difference between groups. Research has shown that the power of the LR test is highly influenced by sample size and consequently, even a trivial difference in item parameters would lead to significant LR test when n is large (Ankenmann et al., 1999; Meade, 2010). We found in our simulation analyses that when the percentage of functionally different items was small, increasing sample size increased the power of detecting truly invariant items. However, power decreased substantially or behaved inconsistently as sample size increased to 500 for instance, particularly when both were high at the same time for the percentage of noninvariant items and the magnitude of item difference. This was true regardless whether the direction was positive or negative, and whether the difference occurred on factor loadings or intercepts. Thus, high sensitivity to sample size makes MaxL not a plausible approach to use in applied research.

Minχ² and BSEM approaches did not show any significant differences in their performance across all conditions. However, when there were 40% of functionally different items, the power rates of these two approaches were noticeably higher in negative condition than those in positive condition, which was only true for differences occurring in factor loadings. This observation may well be explained by the phenomenon of reliability paradox (see Hancock & Mueller, 2011). That is, when fitting SEM models under a given level of model misspecification, better measurement quality is associated with a poorer model fit (Heene et al., 2011; McNeish et al., 2018; Shi, Lee, & Maydeu-Olivares, 2018). Model misspecification could refer to setting numerically different factor loadings to be the same across groups. Such model misspecification may have a heavier negative impact on model fit when standardized factor loadings are greater in one model scenario than the other. In selecting RIs using Minχ² and BSEM in our simulation analyses, the standardized factor loadings were consistently higher in the positive than the negative conditions and therefore, model misspecification created by constraining factor loadings to be (relatively) the same could impact model fit more in the positive than negative conditions. Consequently, poorer power rates were observed in positive conditions. Further examinations are needed on the relationship between direction effect and reliability paradox in fitting multiple-group models.

Study 2 compared the RI-based approach with the non-RI-based approach in detecting invariant and noninvariant parameters. The results were consistent with our anticipation that once an RI was rigorously identified, the RI-based approach would perform well in MI testing. More specifically, we found that the RI-based approach performed better than the non-RI-based approach for detecting (particularly small) loading differences while maintaining a fairly low likelihood of mistakenly identifying truly invariant items to be noninvariant. However, the RI-based approach was associated with higher Type I error rates than its counterpart in detecting intercept differences. It became evident that the two approaches have their own pros and cons when testing for differences and equivalences in model parameters.

A few suggestions could be offered based on our findings on RI selection. First, it is not wise to use MaxL to identify reference indicators. Although this approach could perform equally well as others under certain conditions, it is impractical to preidentify those conditions in empirical data analysis. In addition, MaxL could behave poorly in large samples due to the high sensitivity of the LR test to sample size. Second, Minχ² and BSEM are both recommended for empirical studies; however, different theoretical backgrounds are required for their implementation. While Minχ² involves fitting a series of multiple-group CFA models and computing the LR statistics for each individual item, BSEM is implemented through fitting a single model for identifying invariant and noninvariant items simultaneously (Shi, Song, Liao, et al., 2017). Last, we recommend that methodological researchers consider the direction of parameter differences as a studied variable in future research involving simulation of multiple-group CFA models; otherwise, the results could be confounded or misleading.

A limitation of the present study should be noted. All the three methods of RI selection included in this article, MaxL, Minχ², and BSEM are all CFA-based approaches. Thus, in our simulation analyses, the indicator variables were generated to be continuously distributed. However, to demonstrate their uses for RI selection, all indicator variables in the empirical data were scored with a Likert-type scale being ordinal in nature. Although it seems to be fairly common in practice to fit CFA models to data measured with Likert-type scales, we were not certain how robust the results would be in terms of selecting reference indicators. In our empirical analyses, however, all the three methods agreed on the same item to be the RI.

A limitation needs to be pointed out for the RI selection methods and RI-based MI testing. As a matter of fact, an implicit assumption is typically made with these procedures as in our study that there is at least one truly invariant item among all the scale items. Although this assumption is very likely to hold for well-developed scales and instruments, the methods for searching for an RI may end up with selecting a noninvariant item as the RI when truly invariant items indeed do not exist in actual fact. In our study, no data were generated for the conditions where ALL items are set to be different across groups. So it is not clear what would exactly happen to the outcomes of RI-based MI testing if a noninvariant RI has to be selected, even when its cross-group difference is minimal. Given that the non-RI-based approaches do not use any RI in MI tests, they are not subject to the aforementioned limitation that the RI-based approaches have. However, it is still worth an investigation in the future on how the non-RI-based approaches perform in testing for MI when all items function differently across groups.

In summary, we compared three well-developed methods for RI selection that have been considered to be critical in factorial invariance tests. Study 1 showed Minχ² and BSEM approaches performed generally better than the MaxL approach. It is worth noting that we innovatively examined the direction effect on the performance of those methods and showed that direction effect occurred. This suggested that future comparisons on multiple-group CFA techniques would need to consider directional effects in their investigation based on simulated data; otherwise, any discovered differences can be confounding or misleading. Study 2 compared one RI-based approach with one non-RI-based approach in terms of their performances on MI testing. In general, the former performed well with higher power rates and lower Type I error rates in detecting loading differences but higher Type I error rates in detecting intercept differences.

^1.

Woods (2009) rank ordered the items based on their LR/Δdf. In our study, we used LR instead of LR/Δdf, because Δdf (=2) was constant across all conditions.

^2.

The complete results of Study 2 can be requested by contacting the first author or the correspondence author.

Footnotes

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The involvement of Dr. Zhengkui Liu in this project was supported by the Consulting and Appraising Grant from Chinese Academy of Sciences (Y7CX134003).

ORCID iDs: Hairong Song Inline graphic https://orcid.org/0000-0001-5164-2159

Dexin Shi Inline graphic https://orcid.org/0000-0002-4120-6756

References

Ankenmann R. R., Witt E. A., Dunbar S. B. (1999). An investigation of the power of the likelihood ratio goodness-of-fit statistic in detecting differential item function. Journal of Educational Measurement, 36(4), 277-300. 10.1111/j.1745-3984.1999.tb00558.x [DOI] [Google Scholar]
Benjamini Y., Hochberg Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289-300. 10.1111/j.2517-6161.1995.tb02031.x [DOI] [Google Scholar]
Byrne B. M., Shavelson R. J., Muthén B. (1989). Testing for the equivalence of factor covariance and mean structures: The issue of partial measurement invariance. Psychological Bulletin, 105(3), 456-466. 10.1037/0033-2909.105.3.456 [DOI] [Google Scholar]
Chorpita B. F., Yim L., Moffitt C., Umemoto L. A., Francis S. E. (2000). Assessment of symptoms of DSM-IV anxiety and depression in children: A revised child anxiety and depression scale. Behaviour Research and Therapy, 38(8), 835-855. 10.1016/S0005-7967(99)00130-8 [DOI] [PubMed] [Google Scholar]
French B. F., Finch W. H. (2008). Multigroup confirmatory factor analysis: Locating the invariant referent sets. Structural Equation Modeling, 15(1), 96-113. 10.1080/10705510701758349 [DOI] [Google Scholar]
Hancock G. R., Mueller R. O. (2011). The reliability paradox in assessing structural relations within covariance structure models. Educational and Psychological Measurement, 71(2), 306-324. 10.1177/0013164410384856 [DOI] [Google Scholar]
Heene M., Hilbert S., Draxler C., Ziegler M., Bühner M. (2011). Masking misfit in confirmatory factor analysis by increasing unique variances: A cautionary note on the usefulness of cutoff values of fit indices. Psychological Methods, 16(3), 319-336. 10.1037/a0024917 [DOI] [PubMed] [Google Scholar]
Horn J. L., McArdle J. J. (1992). A practical and theoretical guide to measurement invariance in aging research. Experimental Aging Research, 18(3), 117-144. 10.1080/03610739208253916 [DOI] [PubMed] [Google Scholar]
Horn J. L., McArdle J. J., Mason R. (1983). When is invariance not invariant: A practical scientist’s look at the ethereal concept of factor invariance. Southern Psychologist, 1(4), 179-188. [Google Scholar]
Johnson E. C., Meade A. W., DuVernet A. M. (2009). The role of referent indicators in tests of measurement invariance. Structural Equation Modeling, 16(4), 642-657. 10.1080/10705510903206014 [DOI] [Google Scholar]
Jöreskog K. G. (1971). Simultaneous factor analysis in several populations. Psychometrika, 36(4), 409-426. 10.1007/BF02291366 [DOI] [Google Scholar]
Jung E., Yoon M. (2016). Comparisons of three empirical methods for partial factorial invariance: Forward, backward, and factor-ratio tests. Structural Equation Modeling, 23(4), 567-584. 10.1080/10705511.2015.1138092 [DOI] [Google Scholar]
Kim E. S., Yoon M. (2011). Testing measurement invariance: A comparison of multiple group categorical CFA and IRT. Structural Equation Modeling, 18(2), 212-228. 10.1080/10705511.2011.557337 [DOI] [Google Scholar]
Kim E. S., Yoon M., Lee T. (2012). Testing measurement invariance using MIMIC: Likelihood ratio test with a critical value adjustment. Educational and Psychological Measurement, 72(3), 469-492. 10.1177/0013164411427395 [DOI] [Google Scholar]
Lopez Rivas G. E., Stark S., Chernyshenko O. S. (2009). The effects of referent item parameters on differential item functioning detection using the free baseline likelihood ratio test. Applied Psychological Measurement, 33(4), 251-265. 10.1177/0146621608321760 [DOI] [Google Scholar]
McNeish D., An J., Hancock G. R. (2018). The thorny relation between measurement quality and fit index cutoffs in latent variable models. Journal of Personality Assessment, 100(1), 43-52. 10.1080/00223891.2017.1281286 [DOI] [PubMed] [Google Scholar]
Meade A. W. (2010). A taxonomy of effect size measures for the differential functioning of items and scales. Journal of Applied Psychology, 95(4), 728-743. 10.1037/a0018966 [DOI] [PubMed] [Google Scholar]
Meade A. W., Lautenschlager G. J. (2004). A Monte-Carlo study of confirmatory factor analytic tests of measurement equivalence/invariance. Structural Equation Modeling, 11(1), 60-72. 10.1207/S15328007SEM1101_5 [DOI] [Google Scholar]
Meade A. W., Wright N. A. (2012). Solving the measurement invariance anchor item problem in item response theory. Journal of Applied Psychology, 97(5), 1016-1031. 10.1037/a0027934 [DOI] [PubMed] [Google Scholar]
Meredith W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58(4), 525-543. 10.1007/BF02294825 [DOI] [Google Scholar]
Millsap R. E. (2012). Statistical approaches to measurement invariance. Routledge. 10.4324/9780203821961 [DOI]
Millsap R. E., Kwok O. M. (2004). Evaluating the impact of partial factorial invariance on selection in two populations. Psychological Methods, 9(1), 93-115. 10.1037/1082-989X.9.1.93 [DOI] [PubMed] [Google Scholar]
Muthén B., Asparouhov T. (2012). Bayesian structural equation modeling: A more flexible representation of substantive theory. Psychological Methods, 17(3), 313-335. 10.1037/a0026802 [DOI] [PubMed] [Google Scholar]
Raykov T., Marcoulides G. A., Harrison M., Zhang M. (2019). On the dependability of a popular procedure for studying measurement invariance: A cause for concern? Structural Equation Modeling, Advance online publication. 10.1080/10705511.2019.1610409 [DOI]
Raykov T., Marcoulides G. A., Millsap R. E. (2013). Factorial invariance in multiple populations: A multiple testing procedure. Educational and Psychological Measurement, 73(4), 713-727. 10.1177/0013164412451978 [DOI] [Google Scholar]
Rensvold R. B., Cheung G. W. (1998). Testing measurement models for factorial invariance: A systematic approach. Educational and Psychological Measurement, 58(6), 1017-1034. 10.1177/0013164498058006010 [DOI] [Google Scholar]
Shi D., Lee T., Maydeu-Olivares A. (2018). Understanding the model size effect on SEM fit indices. Educational and Psychological Measurement, 79(2), 310-334. 10.1177/0013164418783530 [DOI] [PMC free article] [PubMed] [Google Scholar]
Shi D., Song H., DiStefano C., Maydeu-Olivares A., McDaniel H. L., Jiang Z. (2018). Evaluating factorial invariance: An interval estimation approach using Bayesian structural equation modeling. Multivariate Behavioral Research, 54(2), 224-245. 10.1080/00273171.2018.1514484 [DOI] [PubMed] [Google Scholar]
Shi D., Song H., Lewis M. D. (2017). The impact of partial factorial invariance on cross-group comparisons. Assessment, 26(7), 1217-1233. 10.1177/1073191117711020 [DOI] [PubMed] [Google Scholar]
Shi D., Song H., Liao X., Terry R., Snyder L. A. (2017). Bayesian SEM for specification search problems in testing factorial invariance. Multivariate Behavioral Research, 52(4), 430-444. 10.1080/00273171.2017.1306432 [DOI] [PubMed] [Google Scholar]
Stark S., Chernyshenko O. S., Drasgow F. (2006). Detecting differential item functioning with confirmatory factor analysis and item response theory: Toward a unified strategy. Journal of Applied Psychology, 91(6), 1292-1306. 10.1037/0021-9010.91.6.1292 [DOI] [PubMed] [Google Scholar]
Steenkamp J. B. E., Baumgartner H. (1998). Assessing measurement invariance in cross national consumer research. Journal of Consumer Research, 25(1), 78-90. 10.1086/209528 [DOI] [Google Scholar]
Steinmetz H. (2011). Estimation and comparison of latent means across cultures. In Davidov E., Schmidt P., Billiet J. (Eds.), Cross-cultural analysis: Methods and applications (pp. 85-116). Psychology Press. [Google Scholar]
Vandenberg R. J., Lance C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3(1), 4-70. 10.1177/109442810031002 [DOI] [Google Scholar]
Wasserman L. (2004). All of statistics: A concise course in statistical inference. Springer Science & Business Media. [Google Scholar]
Widaman K. F., Reise S. P. (1997). Exploring the measurement invariance of psychological instruments: Applications in the substance use domain. In Bryant K. J., Windle M., West S. G. (Eds.), The science of prevention: Methodological advances from alcohol and substance abuse research (pp. 281-324). American Psychological Association; 10.1037/10222-009 [DOI] [Google Scholar]
Woods C. M. (2009). Empirical selection of anchors for tests of differential item functioning. Applied Psychological Measurement, 33(1), 42-57. 10.1177/0146621607314044 [DOI] [Google Scholar]
Yoon M., Millsap R. E. (2007). Detecting violations of factorial invariance using data-based specification searches: A Monte Carlo study. Structural Equation Modeling, 14(3), 435-463. 10.1080/10705510701301677 [DOI] [Google Scholar]

[bibr1-0013164420926565] Ankenmann R. R., Witt E. A., Dunbar S. B. (1999). An investigation of the power of the likelihood ratio goodness-of-fit statistic in detecting differential item function. Journal of Educational Measurement, 36(4), 277-300. 10.1111/j.1745-3984.1999.tb00558.x [DOI] [Google Scholar]

[bibr2-0013164420926565] Benjamini Y., Hochberg Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289-300. 10.1111/j.2517-6161.1995.tb02031.x [DOI] [Google Scholar]

[bibr3-0013164420926565] Byrne B. M., Shavelson R. J., Muthén B. (1989). Testing for the equivalence of factor covariance and mean structures: The issue of partial measurement invariance. Psychological Bulletin, 105(3), 456-466. 10.1037/0033-2909.105.3.456 [DOI] [Google Scholar]

[bibr4-0013164420926565] Chorpita B. F., Yim L., Moffitt C., Umemoto L. A., Francis S. E. (2000). Assessment of symptoms of DSM-IV anxiety and depression in children: A revised child anxiety and depression scale. Behaviour Research and Therapy, 38(8), 835-855. 10.1016/S0005-7967(99)00130-8 [DOI] [PubMed] [Google Scholar]

[bibr5-0013164420926565] French B. F., Finch W. H. (2008). Multigroup confirmatory factor analysis: Locating the invariant referent sets. Structural Equation Modeling, 15(1), 96-113. 10.1080/10705510701758349 [DOI] [Google Scholar]

[bibr6-0013164420926565] Hancock G. R., Mueller R. O. (2011). The reliability paradox in assessing structural relations within covariance structure models. Educational and Psychological Measurement, 71(2), 306-324. 10.1177/0013164410384856 [DOI] [Google Scholar]

[bibr7-0013164420926565] Heene M., Hilbert S., Draxler C., Ziegler M., Bühner M. (2011). Masking misfit in confirmatory factor analysis by increasing unique variances: A cautionary note on the usefulness of cutoff values of fit indices. Psychological Methods, 16(3), 319-336. 10.1037/a0024917 [DOI] [PubMed] [Google Scholar]

[bibr8-0013164420926565] Horn J. L., McArdle J. J. (1992). A practical and theoretical guide to measurement invariance in aging research. Experimental Aging Research, 18(3), 117-144. 10.1080/03610739208253916 [DOI] [PubMed] [Google Scholar]

[bibr9-0013164420926565] Horn J. L., McArdle J. J., Mason R. (1983). When is invariance not invariant: A practical scientist’s look at the ethereal concept of factor invariance. Southern Psychologist, 1(4), 179-188. [Google Scholar]

[bibr10-0013164420926565] Johnson E. C., Meade A. W., DuVernet A. M. (2009). The role of referent indicators in tests of measurement invariance. Structural Equation Modeling, 16(4), 642-657. 10.1080/10705510903206014 [DOI] [Google Scholar]

[bibr11-0013164420926565] Jöreskog K. G. (1971). Simultaneous factor analysis in several populations. Psychometrika, 36(4), 409-426. 10.1007/BF02291366 [DOI] [Google Scholar]

[bibr12-0013164420926565] Jung E., Yoon M. (2016). Comparisons of three empirical methods for partial factorial invariance: Forward, backward, and factor-ratio tests. Structural Equation Modeling, 23(4), 567-584. 10.1080/10705511.2015.1138092 [DOI] [Google Scholar]

[bibr13-0013164420926565] Kim E. S., Yoon M. (2011). Testing measurement invariance: A comparison of multiple group categorical CFA and IRT. Structural Equation Modeling, 18(2), 212-228. 10.1080/10705511.2011.557337 [DOI] [Google Scholar]

[bibr14-0013164420926565] Kim E. S., Yoon M., Lee T. (2012). Testing measurement invariance using MIMIC: Likelihood ratio test with a critical value adjustment. Educational and Psychological Measurement, 72(3), 469-492. 10.1177/0013164411427395 [DOI] [Google Scholar]

[bibr15-0013164420926565] Lopez Rivas G. E., Stark S., Chernyshenko O. S. (2009). The effects of referent item parameters on differential item functioning detection using the free baseline likelihood ratio test. Applied Psychological Measurement, 33(4), 251-265. 10.1177/0146621608321760 [DOI] [Google Scholar]

[bibr16-0013164420926565] McNeish D., An J., Hancock G. R. (2018). The thorny relation between measurement quality and fit index cutoffs in latent variable models. Journal of Personality Assessment, 100(1), 43-52. 10.1080/00223891.2017.1281286 [DOI] [PubMed] [Google Scholar]

[bibr17-0013164420926565] Meade A. W. (2010). A taxonomy of effect size measures for the differential functioning of items and scales. Journal of Applied Psychology, 95(4), 728-743. 10.1037/a0018966 [DOI] [PubMed] [Google Scholar]

[bibr18-0013164420926565] Meade A. W., Lautenschlager G. J. (2004). A Monte-Carlo study of confirmatory factor analytic tests of measurement equivalence/invariance. Structural Equation Modeling, 11(1), 60-72. 10.1207/S15328007SEM1101_5 [DOI] [Google Scholar]

[bibr19-0013164420926565] Meade A. W., Wright N. A. (2012). Solving the measurement invariance anchor item problem in item response theory. Journal of Applied Psychology, 97(5), 1016-1031. 10.1037/a0027934 [DOI] [PubMed] [Google Scholar]

[bibr20-0013164420926565] Meredith W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58(4), 525-543. 10.1007/BF02294825 [DOI] [Google Scholar]

[bibr21-0013164420926565] Millsap R. E. (2012). Statistical approaches to measurement invariance. Routledge. 10.4324/9780203821961 [DOI]

[bibr22-0013164420926565] Millsap R. E., Kwok O. M. (2004). Evaluating the impact of partial factorial invariance on selection in two populations. Psychological Methods, 9(1), 93-115. 10.1037/1082-989X.9.1.93 [DOI] [PubMed] [Google Scholar]

[bibr23-0013164420926565] Muthén B., Asparouhov T. (2012). Bayesian structural equation modeling: A more flexible representation of substantive theory. Psychological Methods, 17(3), 313-335. 10.1037/a0026802 [DOI] [PubMed] [Google Scholar]

[bibr24-0013164420926565] Raykov T., Marcoulides G. A., Harrison M., Zhang M. (2019). On the dependability of a popular procedure for studying measurement invariance: A cause for concern? Structural Equation Modeling, Advance online publication. 10.1080/10705511.2019.1610409 [DOI]

[bibr25-0013164420926565] Raykov T., Marcoulides G. A., Millsap R. E. (2013). Factorial invariance in multiple populations: A multiple testing procedure. Educational and Psychological Measurement, 73(4), 713-727. 10.1177/0013164412451978 [DOI] [Google Scholar]

[bibr26-0013164420926565] Rensvold R. B., Cheung G. W. (1998). Testing measurement models for factorial invariance: A systematic approach. Educational and Psychological Measurement, 58(6), 1017-1034. 10.1177/0013164498058006010 [DOI] [Google Scholar]

[bibr27-0013164420926565] Shi D., Lee T., Maydeu-Olivares A. (2018). Understanding the model size effect on SEM fit indices. Educational and Psychological Measurement, 79(2), 310-334. 10.1177/0013164418783530 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr28-0013164420926565] Shi D., Song H., DiStefano C., Maydeu-Olivares A., McDaniel H. L., Jiang Z. (2018). Evaluating factorial invariance: An interval estimation approach using Bayesian structural equation modeling. Multivariate Behavioral Research, 54(2), 224-245. 10.1080/00273171.2018.1514484 [DOI] [PubMed] [Google Scholar]

[bibr29-0013164420926565] Shi D., Song H., Lewis M. D. (2017). The impact of partial factorial invariance on cross-group comparisons. Assessment, 26(7), 1217-1233. 10.1177/1073191117711020 [DOI] [PubMed] [Google Scholar]

[bibr30-0013164420926565] Shi D., Song H., Liao X., Terry R., Snyder L. A. (2017). Bayesian SEM for specification search problems in testing factorial invariance. Multivariate Behavioral Research, 52(4), 430-444. 10.1080/00273171.2017.1306432 [DOI] [PubMed] [Google Scholar]

[bibr31-0013164420926565] Stark S., Chernyshenko O. S., Drasgow F. (2006). Detecting differential item functioning with confirmatory factor analysis and item response theory: Toward a unified strategy. Journal of Applied Psychology, 91(6), 1292-1306. 10.1037/0021-9010.91.6.1292 [DOI] [PubMed] [Google Scholar]

[bibr32-0013164420926565] Steenkamp J. B. E., Baumgartner H. (1998). Assessing measurement invariance in cross national consumer research. Journal of Consumer Research, 25(1), 78-90. 10.1086/209528 [DOI] [Google Scholar]

[bibr33-0013164420926565] Steinmetz H. (2011). Estimation and comparison of latent means across cultures. In Davidov E., Schmidt P., Billiet J. (Eds.), Cross-cultural analysis: Methods and applications (pp. 85-116). Psychology Press. [Google Scholar]

[bibr34-0013164420926565] Vandenberg R. J., Lance C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3(1), 4-70. 10.1177/109442810031002 [DOI] [Google Scholar]

[bibr35-0013164420926565] Wasserman L. (2004). All of statistics: A concise course in statistical inference. Springer Science & Business Media. [Google Scholar]

[bibr36-0013164420926565] Widaman K. F., Reise S. P. (1997). Exploring the measurement invariance of psychological instruments: Applications in the substance use domain. In Bryant K. J., Windle M., West S. G. (Eds.), The science of prevention: Methodological advances from alcohol and substance abuse research (pp. 281-324). American Psychological Association; 10.1037/10222-009 [DOI] [Google Scholar]

[bibr37-0013164420926565] Woods C. M. (2009). Empirical selection of anchors for tests of differential item functioning. Applied Psychological Measurement, 33(1), 42-57. 10.1177/0146621607314044 [DOI] [Google Scholar]

[bibr38-0013164420926565] Yoon M., Millsap R. E. (2007). Detecting violations of factorial invariance using data-based specification searches: A Monte Carlo study. Structural Equation Modeling, 14(3), 435-463. 10.1080/10705510701301677 [DOI] [Google Scholar]

PERMALINK

It Matters: Reference Indicator Selection in Measurement Invariance Tests

Yutian T Thompson

Hairong Song

Dexin Shi

Zhengkui Liu

Abstract

Methods of RI Selection

Non-RI-Based Approach for MI Testing

Direction Effect and RI Selection

Simulation Study 1: RI Selection Using MaxL, Minχ2, and BSEM

Data Conditions

Sample Size

Location of Difference

Percentage of Noninvariant Items

Magnitude of Difference

Direction of Cross-Group Difference

Data Simulation

Data Analysis

Results of Study 1

Table 1.

Table 2.

Figure 1.

Figure 2.

Table 3.

Figure 3.

Table 4.

Figure 4.

Table 5.

Table 6.

Simulation Study 2: MI Testing With RI-Based and Non-RI–Based Approaches

Results of Study 2

Table 7.

Table 8.

A Pedagogical Example

Table 9.

Table 10.

Table 11.

Table 12.

Discussion

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Simulation Study 1: RI Selection Using MaxL, Minχ², and BSEM