Abstract
This study examined whether cutoffs in fit indices suggested for traditional formats with maximum likelihood estimators can be utilized to assess model fit and to test measurement invariance when a multiple group confirmatory factor analysis was employed for the Thurstonian item response theory (IRT) model. Regarding the performance of the evaluation criteria, detection of measurement non-invariance and Type I error rates were examined. The impact of measurement non-invariance on estimated scores in the Thurstonian IRT model was also examined through accuracy and efficiency in score estimation. The fit indices used for the evaluation of model fit performed well. Among six cutoffs for changes in model fit indices, only ΔCFI > .01 and ΔNCI > .02 detected metric non-invariance when the medium magnitude of non-invariance occurred and none of the cutoffs performed well to detect scalar non-invariance. Based on the generated sampling distributions of fit index differences, this study suggested ΔCFI > .001 and ΔNCI > .004 for scalar non-invariance and ΔCFI > .007 for metric non-invariance. Considering Type I error rate control and detection rates of measurement non-invariance, ΔCFI was recommended for measurement non-invariance tests for forced-choice format data. Challenges in measurement non-invariance tests in the Thurstonian IRT model were discussed along with the direction for future research to enhance the utility of forced-choice formats in test development for cross-cultural and international settings.
Keywords: measurement invariance, Thurstonian IRT model, forced-choice format, fit indices
Due to innovative advancements in psychometric modeling based on item response theory (IRT) that allows for interpersonal comparisons (e.g., Brown & Maydeu-Olivares, 2011b; Stark et al., 2005), forced-choice formats have been increasingly applied in educational and personnel selection settings (e.g., Anguiano-Carrasco et al., 2015; Dueber et al., 2019; Guenole et al., 2018; Organisation for Economic Co-operation and Development [OECD], 2014; Usami et al., 2016). Along with the popular use of forced-choice formats in assessment construction, validation studies for the use of forced-choice formats demonstrated that estimated scores from the IRT-based approaches represented target traits well and reduced response distortion that may exist in single-stimulus formats by comparing scores between single-stimulus (e.g., Likert-type-scale items) and forced-choice formats (e.g., Anguiano-Carrasco et al., 2015; Guenole et al., 2018; Usami et al., 2016). However, only a few studies (e.g., Bartram, 2013a, 2013b) have examined whether the measurement of constructs through forced-choice formats works equivalently across various subgroups of respondents such as those with different genders, racial/ethnic backgrounds, and from different countries. Except for Bartram’s studies using a forced-choice format version of the Occupational Personality Questionnaire (OPQ32i; SHL Group, 2006), no published studies appearing in academic databases (e.g., PsycINFO and ERIC) have investigated the fairness aspect of validity in the use of scores from a forced-choice assessment across heterogeneous groups of respondents. In other words, the equivalence of psychometric properties in measurement (measurement invariance; Millsap, 2011) for forced-choice formats has not been explored in depth.
Although Bartram (2013a, 2013b) examined the equivalence of construct and scalar equivalence in the OPQ32i across different countries, the employed psychometric approaches were slightly different from commonly used approaches such as multiple group confirmatory factor analyses (CFAs) and multiple indicators multiple causes (MIMIC) models. Different from the approach using model fit indices to which most applied researchers are familiar, Bartram employed correlational analyses for scalar invariance of the forced-choice format assessment and explored whether group-level differences in mean values and standard deviations of trait scores were related to scores from other scales measuring group-level effects. In addition, a multilevel modeling approach was used to examine the proportion of between-country variance in forced-choice format score variance.
Regarding a psychometric approach to test measurement invariance in the Thurstonian IRT model, Brown and Maydeu-Olivares (2018) mentioned that a multiple group CFA can be employed to test equivalence of psychometric properties in measurement including loadings and thresholds across subgroups of respondents. However, with respect to evaluation measures to determine measurement invariance in forced-choice formats, there is a clear lack of research into whether the evaluation criteria established for single-stimulus formats would perform well when the Thurstonian IRT model is fit for responses from forced-choice formats. In addition, considering that the first step of a multiple group CFA is the examination of model fit, a question may arise in terms of selecting evaluation criteria to assess model fit because the commonly used evaluation criteria such as comparative fit index (CFI) no smaller than .95 and root mean square error approximation (RMSEA) no larger than .06 from Hu and Bentler (1999) were established with maximum likelihood estimators (MLEs) and the Thurstonian IRT model uses limited information methods. Although there have been studies demonstrating that model fit evaluation criteria established for multivariate normal data with MLE do not work for the limited information estimation methods (e.g., Nye & Drasgow, 2011), most empirical studies employing forced-choice formats (e.g., Anguiano-Carrasco et al., 2015; Brown & Maydeu-Olivares, 2011b; Guenole et al., 2018; Lee et al., 2018) relied on such rules of thumb for the evaluation of model fit.
The purpose of this simulation study is to investigate whether cutoffs in fit indices suggested for traditional formats with estimation methods such as MLE can be utilized to test measurement invariance when a multiple group CFA was employed for the Thurstonian model. Using the multiple group CFA, the evaluation of measurement invariance in this study was based on a holistic approach through fit indices. Regarding the performance of evaluation criteria in fit indices, detection rates of measurement non-invariance and Type I error rates were examined. In addition, the impact of measurement non-invariance on estimated scores in the Thurstonian IRT model was examined through the accuracy and efficiency in score estimation. Based on the findings, this study aimed to provide information about the selection of evaluation criteria when a multiple group CFA was used for the detection of measurement non-invariance.
This study describes the conceptual framework of the Thurstonian IRT model and evaluation of measurement non-invariance in multiple group CFAs. Two simulation studies were conducted to examine six fit indices and suggest new cutoffs. Finally, challenges in measurement non-invariance tests in the Thurstonian IRT model were discussed.
Thurstonian IRT Model
By applying Thurstonian factor models, Brown and Maydeu-Olivares (2011b) developed the Thurstonian IRT model. Equation 11 shows the probability of selecting statement i over statement k within a block of statements:
| (1) |
The probabilistic function of the binary outcome in a pairwise comparison is a two-dimensional normal ogive IRT model where is the trait each statement (hereafter “item”) measures, is the threshold of the pairwise comparison (pair of items i and k), and is the loading of each item onto its corresponding trait. In the Thurstonian IRT model, the choice behavior, selecting item i over k, is based on the framework of Thurstone’s (1927) law of comparative judgment using the concept of item utility.
The Thurstonian IRT model estimates item and trait (person) parameters from forced-choice item responses from a unidimensional or multidimensional scale. As in Equation 1, the relationship between an item and a trait is assumed to be based on a dominance model (linear relationship). According to Thurstone’s law of comparative judgment, a person j prefers item i to k when the utility for item i () is greater than that for item k (). In the linear factor analysis models, the relationship between an item and a trait can be explained by the linear function of the item means and a person’s standing on the traits as in Equations 2 and 3:
| (2) |
| (3) |
The local dependence occurring in an item block composed of more than two items is accounted for by modeling a covariance structure in the Thurstonian IRT model. Let’s assume that an item block is composed of three items i, k, and q and a respondent ranked the three items as 1, 2, and 3, respectively, with 1 representing “the most important” therefore the highest utility and 3 representing “the least important” the lowest utility. In this example, the ranks can be coded as {i, k} = 1, {i, q} = 1, and {k, q} = 1 through three pairwise comparisons.2 The choices in pairwise comparisons involving the same item such as the pairs {i, k} and {i, q} are not independent after controlling for respondent’s standings on the traits measured by the items. To account for local dependence, variance shared between the two pairwise comparisons is incorporated into the Thurstonian IRT model based on the mathematical derivation (see Brown & Maydeu-Olivares, 2011b, 2012 for technical details). Finally, the Thurstonian IRT model can fit response data from different types of forced-choice formats such as picking the most preferred item (PICK), selecting the most/least self-descriptive item (MOLE), and ranking all items in the order of importance (RANK). The number of items in an item block can vary (see Brown & Maydeu-Olivares, 2011b and Hontangas et al., 2015 for examples of forced-choice formats and the binary coding).
The Thurstonian IRT model is estimated using limited information methods such as unweighted least squares or diagonally weighted least squares. If Mplus (Muthén & Muthén, 1998–2017) is used, the equivalent estimator is unweighted least squares (ULSMVs; Muthén & Muthén). When item blocks are composed of three or more items, a correction to the degrees of freedom is required due to redundancies among thresholds and tetrachoric correlations (Maydeu-Olivares, 1999). The redundancy in each block is computed as , where n is the number of items in each block. The redundancy in each block is subtracted from the degrees of freedom (Brown, 2016b; Brown & Maydeu-Olivares, 2011b).
Detection of Measurement Non-Invariance: Multiple Group CFA
Multiple group CFAs have become the most common method to examine measurement invariance, compared with MIMIC models (Meade & Lautenschlager, 2004). The terminologies used for different types of measurement invariance are as follows: configural invariance testing the same factor structure, metric invariance testing equality in factor loadings, and scalar invariance testing equality of intercepts. Metric invariance subsumes configural invariance, and scalar invariance includes both other types. To conduct subsequent tests, for example, scalar invariance, first configural then metric invariance should be established (Vandenberg, 2002), and these tests are typically performed using chi-square difference tests. A more detailed description of measurement invariance tests was provided in the online appendix.
However, as chi-square difference tests have been criticized due to sensitivity to sample size (Bentler & Bonett, 1980; Brannick, 1995; Meade & Lautenschlager, 2004), the use of alternative fit indices has been suggested because such indices are less sensitive to sample size and perform better to detect measurement non-invariance than chi-square difference tests (Chen, 2007; Cheung & Rensvold, 2002; Meade et al., 2008). In addition, DIFTEST does not currently support the adjustment needed for the Thurstonian IRT model, making it challenging for applied researchers and practitioners to adopt chi-square difference tests. However, the use of changes in fit indices is relatively accessible because these fit indices simply need to be recomputed using corrected degrees of freedom.
Evaluation Criteria for Measurement Invariance: Changes in Fit Indices
Cheung and Rensvold (2002) and Meade et al. (2008) suggested the use of absolute changes (Δ) in alternative fit indices to test measurement invariance. Fit indices used for measurement invariance tests in the literature are CFI (Bentler, 1990), gamma hat (; Steiger, 1998), the noncentrality index (NCI; McDonald, 1989), and RMSEA (Steiger & Lind, 1980). The findings from Cheung and Rensvold indicated that ΔCFI ≤ .01, ≤ .001, and ΔNCI ≤ .02 show evidence of measurement invariance, whereas Meade et al. suggested ΔCFI ≤ .002 and condition-specific changes in NCI (e.g., for the condition of five factors with 30 items the condition-specific cutoff value is .007) for the criteria in measurement invariance tests. In addition, Meade et al. stated that the power related to the use of changes in model fit indices to detect measurement non-invariance seems adequate with a sample size of 400 or larger per group.
Chen (2007) also suggested the use of ΔCFI ≤ .010 and ΔRMSEA < .015 for the evidence of measurement invariance when sample sizes are larger than 300, the ratio between groups is equal, and the pattern of measurement non-invariance is nonuniform. When sample sizes are smaller than 300 with an unequal ratio between groups and the pattern of measurement non-invariance is uniform, Chen suggested the cutoff criteria ΔCFI < .005 and ΔRMSEA < .010. In international settings, the Teaching and Learning International Survey (TALIS) operated by OECD (2014) adopted ΔCFI < .02 and ΔRMSEA < .03 as the evaluation criteria for metric invariance and ΔCFI < .01 and ΔRMSEA < .010 for scalar invariance (Rutkowski & Svetina, 2017). In addition, Rutkowski and Svetina stated that the criteria adopted in TALIS were established specifically for cases where the number of groups was large and the sample size in groups widely varies.
Although using alternative fit indices to determine measurement invariance seems more advantageous compared with chi-square difference tests due to less sensitivity to sample size and relative ease in accessibility, there seems to be no evidence that the same cutoffs would work well for forced-choice formats. Also, for the evaluation of model fit, which is necessary to determine configural invariance, it should be noted that adopting commonly used cutoffs which were established with the use MLE (CFI ≥ .95, RMSEA ≤ .06) may not be appropriate for forced-choice formats because limited information estimators are employed. Thus, Study 1 investigated (a) the performance of the established criteria for the evaluation of model fit and (b) the performance of the existing cutoffs for the changes in fit indices to determine measurement non-invariance when a Thurstonian IRT model was employed for forced-choice format response data. As a follow-up, Study 2 was conducted for the recommendation of better cutoffs to improve the detection of measurement non-invariance.
Study 1
Method
Data generation
Data generation was performed in R (R Core Team, 2018) based on 20 blocks of RANK format forced-choice items. Blocks were composed of three items, and each item measured one of the five personality traits. The response data were generated through three pairwise comparisons, such as {item i, item k}, {item i, item q}, and {item k, item q}, based on Equation 1, resulting in 60 pairwise comparisons. Parameter values (factor loadings, thresholds, and error variances) and the correlation coefficients for the five traits used for data generation were from Brown and Maydeu-Olivares (2018; Table 1A in the online supplement). Random error was incorporated into each response by comparing each computed probability from Equation 1 to a unique random value from a uniform distribution [0, 1]. Afterward, comparisons were coded with a binary value of 0 if they were less than the random value or 1 if they were greater (see footnote 2).
Manipulated factors
A total of five factors were manipulated in this study: (a) types of non-invariance, (b) magnitudes of measurement non-invariance, (c) numbers of items manipulated, (d) directions of non-invariance, and (e) characteristics of manipulated items/item pairs. Table 1 denotes the condition names used for the combinations of the manipulated factors.
Table 1.
Measurement Non-Invariance Conditions.
| Type | Magnitude | Manipulated items | Direction | Characteristic | Condition name |
|---|---|---|---|---|---|
| Metric | Small | 5/10 items | Positive | High Loadings Item | PHL |
| Negative | NHL | ||||
| Medium | 5/10 items | Positive | Low Loadings Item | PLL | |
| Negative | NLL | ||||
| Scalar | Small | 5/10 items | Positive | High Thresholds Item Pair | PHT |
| Negative | NHT | ||||
| Medium | 5/10 items | Positive | Low Thresholds Item Pair | PLT | |
| Negative | NLT |
Among the 60-item pairwise comparisons, the five and 10 lowest or highest factor loadings (weakest or strongest loadings in absolute value) were changed by ±0.3 and ±0.6 in the focal group for the manipulation of small and medium magnitude of metric non-invariance. For small and medium magnitudes of scalar non-invariance, the five and 10 lowest or highest thresholds were changed by ±0.25 and ±0.5 in the focal group. The magnitudes for metric and scalar non-invariance coincide with those employed in previous studies (e.g. Lee et al., 2017; Oshima et al., 1997) where differential item functioning (DIF) was investigated based on the multidimensional IRT framework under a similar test length setting; Lee et al. also included 12 items per factor condition. By adapting effect size measures from Meade (2010) to the ipsative data, the small loading manipulation resulted in average expected score standardized differences equivalent to Cohen’s (1988), d = .3, and medium manipulation equivalent to d = .5. For thresholds, small and medium manipulations corresponded to around d = .2 and d = .4, respectively. As explained in Equation 1, because thresholds involve two items (item pair), the manipulation of one item pair’s threshold affected the other item pair. In this simulation study, it was assumed that scalar non-invariance occurred due to one of the two items from an item pair being either more difficult or easier to endorse. For example, scalar non-invariance occurred due to an increase or decrease in the threshold of Item 1 from the Item 1 and Item 2 pair. Thus, the threshold of the pair composed of Items 1 and 3 is also affected by the change occurring in Item 1. As a result, five threshold manipulations resulted in changes of up to 10 thresholds of item pairs. Among the two items in a pair, an item showing a relatively stronger loading (better discrimination) was chosen for the manipulation of scalar non-invariance to intensify the effects of measurement non-invariance.
As each item pair involves two items, the number of pairwise comparisons manipulated for measurement non-invariance corresponds to the case where 10 and 20 single-stimulus items out of 60 behave differentially across subgroups. The manipulated proportions of items for non-invariance (17% and 33%) were enough to cover the proportion (25%) considered in Meade et al. (2008). The sample size of 500 for each subgroup was considered for this study because most of the simulation and empirical studies on the Thurstonian IRT model employed samples sizes close to 500 or larger (Brown & Maydeu-Olivares, 2010, 2012, 2013; Guenole et al., 2018), although Maydeu-Olivares and Brown (2010) stated 200 as the minimum sample size. To examine Type I error rates and detection of measurement non-invariance, changes in fit indices across the three types of measurement invariance models (configural, metric, and scalar invariance) were used as a criterion for the evaluation of measurement non-invariance. Three evaluation cutoffs in absolute changes (ΔCFI > .01, > .001, and ΔNCI3 > .02) from Cheung and Rensvold (2002), one evaluation cutoff (ΔCFI > .002) from Meade et al. (2008), and two RMSEA cutoffs (ΔRMSEA ≥ .015 from Chen, 2007 and ΔRMSEA ≥ .010 from Rutkowski & Svetina, 2017) were employed to determine whether the more restrictive models fit significantly worse than the less restrictive models.
In Study 1, a total of 16 manipulated measurement non-invariance conditions for two measurement non-invariance types were examined based on the six evaluation cutoffs. A stepwise multiple group CFA approach was employed to test measurement non-invariance for 100 data sets from each condition. In Study 2, the null sampling distributions of fit index differences were generated for the recommendation of cutoff values based on 1,000 data sets where measurement invariance was held. The sampling distributions were employed to determine cutoff values which correspond to critical values for rejecting the null hypothesis of measurement invariance with an α = .05. This procedure was based on that from Chen (2007) and Cheung and Rensvold (2002). Mplus (Muthén & Muthén, 1998–2017) and MplusAutomation (Hallquist & Wiley, 2018) were used for the analyses of the data sets in each condition.
Dependent measures
Type I error rates for the measurement invariance conditions (null conditions), configural invariance for the metric and scalar non-invariance conditions, and metric invariance for the scalar non-invariance conditions were examined based on model fit; CFI < .95 and RMSEA > .06 were considered as poor fit. If the configural invariance models and the metric invariance models where only thresholds were manipulated (PHT, NHT, PLT, and NLT conditions, see Table 1 for condition names) showed poor fit, it was counted as Type I error. Type I error was also examined based on the changes in fit indices between the configural and metric invariance models under PHT, NHT, PLT, and NLT conditions because these conditions were only manipulated for scalar non-invariance and as such should not demonstrate metric non-invariance. For CFI, , and NCI changes larger than the cutoff values or larger or equal to the cutoff value for RMSEA were counted as Type I errors. The error rates correspond to the proportion of replications showing that configural invariance did not hold or the more restricted model was significantly worse than the less restricted model although measurement invariance was supposed to be held. For the detection of measurement non-invariance, proportion of correct non-invariance detection (proportion of replications correctly detecting measurement non-invariance) was computed. Proportions in the PHL, NHL, PLL, and NLL conditions refers to the correct detection of metric non-invariance, whereas proportions in the PHT, NHT, PLT, and NLT conditions refers to the correct detection of scalar non-invariance. The proportion of replications where the changes in the alternative fit indices were larger (CFI, , NCI) or larger or equal to (RMSEA) the cutoff values, out of the number of replications where the previous invariance tests were met, was counted as correct decisions. Based on the stepwise multiple group CFA procedure in measurement invariance tests (e.g., Vandenberg, 2002) for the detection of metric non-invariance, the proportion of detection was based only on cases in which configural invariance was met, and for scalar non-invariance both configural and metric invariance must have been met.
Corrected degrees of freedom due to redundancy were used for the computation of the changes in CFI, , NCI, and RMSEA. Also, RMSEA values were adjusted for a multiple group CFA based on Steiger (1998). Bias and root mean squared error (RMSE) of estimated trait scores were utilized as dependent measures for the examination of the impact of measurement non-invariance on score estimation. For the investigation of factors affecting Type I error rates, proportions of correct non-invariance detection, bias, and RMSE, a linear model with manipulated factors as independent variables was used, and the effect size of each manipulated factor was reported for the significant factors.
Results
Convergence and model fit
As a first step, model fit and nonconvergence rates were examined. CFI ≥ .95 and RMSEA ≤ .06 were used to examine whether the models of configural and metric invariance fit data well; configural invariance was tested for all conditions and metric invariance was tested only for scalar non-invariance conditions (PHT, NHT, PLT, NLT). Most models converged well, aside from the NLL conditions (6%–10% nonconvergence). Table 2A in the online supplement and the online appendix provide additional details about nonconvergence rates.
Type I error rate
For configural invariance, Type I error rates are equal to the rates of models with poor fit across conditions. As mentioned above, as the RMSEA cutoff did not show any non-acceptable fit for both sample size conditions, the Type I error rates for configural and metric invariance were reported based on the CFI cutoff value. Type I error rates were below .05 in all conditions (Table 3A in the online supplement).
When changes in relative fit indices were used, the largest Type I error rates were found with ΔCFI > .002, ranging from .62 to .81. The criterion > .001 showed the second largest Type I error rates ranging from .22 to .40. Compared with these two cutoffs, Type I error rates in ΔNCI > .02 were relatively small, ranging from .03 to .10. The criteria ΔRMSEA ≥ .015 and ΔRMSEA ≥ .010 showed no Type I errors, and ΔCFI > .002 also showed no Type I errors except for one condition. In general, lower Type I error rates were found when the medium magnitude of scalar non-invariance (manipulation threshold by 0.5) was manipulated with greater numbers of items (Table 2).
Table 2.
Type I Error Rates.
| Condition | ΔCFI > .01 |
ΔCFI > .002 |
> .001 |
ΔNCI > .02 |
||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S |
M |
S |
M |
S |
M |
S |
M |
|||||||||
| 5i | 10i | 5i | 10i | 5i | 10i | 5i | 10i | 5i | 10i | 5i | 10i | 5i | 10i | 5i | 10i | |
| Null_MI | .00 | .63 | .13 | .06 | ||||||||||||
| Null_SI | .00 | .00 | .00 | .00 | ||||||||||||
| PHT | .00 | .00 | .00 | .00 | .63 | .64 | .67 | .65 | .26 | .28 | .35 | .23 | .05 | .07 | .06 | .07 |
| NHT | .00 | .00 | .00 | .00 | .70 | .62 | .78 | .71 | .24 | .35 | .40 | .33 | .07 | .10 | .07 | .04 |
| PLT | .00 | .00 | .00 | .01 | .69 | .66 | .67 | .65 | .26 | .26 | .22 | .28 | .04 | .07 | .05 | .03 |
| NLT | .00 | .00 | .00 | .00 | .65 | .64 | .68 | .81 | .33 | .39 | .35 | .36 | .08 | .09 | .10 | .08 |
Note. See Table 1 for the condition names. Type I error rates in PHT, NHT, PLT, and NLT conditions were the proportion of replications falsely detected for metric non-invariance. No Type I errors were found in ΔRMSEA conditions. Values below .05 were bold faced. CFI = comparative fit index; NCI = noncentrality index; S = small magnitude of non-invariance; M = medium magnitude of non-invariance; 5i = five items were manipulated for non-invariance; 10i = 10 items were manipulated for non-invariance; Null_MI = metric invariance; Null_SI = scalar invariance.
Detection of measurement non-invariance
Regarding the best performance in the detection of measurement non-invariance, ΔCFI > .01 for metric non-invariance performed best based on the results from Type I error rates and the proportion of correct non-invariance detection (Tables 2 and 3). Also, ΔNCI > .02 performs as a modest criterion for the detection of metric non-invariance with the simultaneous consideration of Type I error rate control. Although ΔCFI > .002 and > .001 showed quite high detection of metric non-invariance, they had poor Type I error rate control. In contrast, ΔRMSEA ≥ .015 and ΔRMSEA ≥ .010 controlled Type I error rates well but showed no propensity to detect measurement non-invariance. In terms of the detection of scalar non-invariance, none of the six cutoffs seemed to perform well. Even though a small proportion of cases were detected with the use of ΔCFI > .002, high Type I error rates preclude its use.
Table 3.
Proportions of Correct Non-Invariance Detection.
| Condition | ΔCFI > .01 |
ΔCFI > .002 |
> .001 |
ΔNCI > .02 |
||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S |
M |
S |
M |
S |
M |
S |
M |
|||||||||
| 5i | 10i | 5i | 10i | 5i | 10i | 5i | 10i | 5i | 10i | 5i | 10i | 5i | 10i | 5i | 10i | |
| PHL | .00 | .00 | .01 | .12 | .59 | .50 | .57 | .81 | .22 | .11 | .30 | .64 | .05 | .01 | .10 | .42 |
| NHL | .00 | .00 | .04 | .63 | .64 | .61 | .74 | .99 | .18 | .22 | .38 | .94 | .03 | .03 | .16 | .81 |
| PLL | .04 | .16 | .88 | .99 | .63 | .83 | 1.00 | .99 | .27 | .59 | 1.00 | .99 | .10 | .35 | .99 | .99 |
| NLL | .00 | .04 | .57 | .98 | .57 | .71 | .94 | 1.00 | .19 | .43 | .90 | 1.00 | .04 | .28 | .79 | 1.00 |
| PHT | .00 | .00 | .00 | .00 | .02 | .06 | .03 | .65 | .00 | .00 | .00 | .07 | .00 | .00 | .00 | .00 |
| NHT | .00 | .00 | .00 | .00 | .00 | .03 | .05 | .59 | .00 | .00 | .00 | .04 | .00 | .00 | .00 | .00 |
| PLT | .00 | .00 | .00 | .00 | .00 | .03 | .00 | .21 | .00 | .00 | .00 | .00 | .00 | .00 | .00 | .00 |
| NLT | .00 | .00 | .00 | .00 | .00 | .00 | .00 | .28 | .00 | .00 | .00 | .00 | .00 | .00 | .00 | .00 |
Note. See Table 1 for the condition names. Proportions in PHL, NHL, PLL, and NLL conditions were related to correct detection of metric non-invariance. Proportions in PHT, NHT, PLT, and NLT conditions were related to correct detection of scalar non-invariance. No detection was found in ΔRMSEA conditions. Values above .8 were bold faced. CFI = comparative fit index; NCI = noncentrality index; S = small magnitude of non-invariance; M = medium magnitude of non-invariance; 5i = five items were manipulated for non-invariance; 10i = 10 items were manipulated for non-invariance.
Overall, ΔCFI > .01 and NCI > .02 were found to perform relatively well only when a larger number of items exhibited relatively larger magnitude of metric non-invariance. Based on the findings with the established cutoffs, any scalar non-invariance seems to be undetectable. Thus, determining new cutoff values for scalar non-invariance was imperative.
Study 2
Method
To suggest cutoff values to detect measurement non-invariance while controlling for Type I error rates at the nominal level, sampling distributions of fit index differences were generated from 1,000 data sets where measurement invariance held. For the fit index differences, only CFI and NCI were considered due to their performance in Study 1. Based on the procedure from Chen (2007) and Cheung and Rensvold (2002), cutoffs were determined using the concept of critical values in the sampling distribution for rejecting the null hypothesis of measurement invariance with an α = .05. That is, the proposed cutoffs correspond to the 95th percentiles of ΔCFI and ΔNCI distributions.
After determining the cutoffs for the detection of metric and scalar non-invariance, it was examined whether the recommended cutoffs control Type I error rates at the nominal level and exhibit greater detection rates of measurement non-invariance than existing cutoffs, especially considering scalar non-invariance.
Results
Recommendation of cutoffs
Based on the sampling distributions, this study proposed the cutoffs of ΔCFI > .007 for metric non-invariance and ΔCFI > .001 for scalar non-invariance. Regarding ΔNCI, the same cutoff from Cheung and Rensvold (2002), ΔNCI > .02 was found for metric non-invariance. For scalar non-invariance, this study recommended ΔNCI > .004. Compared with the existing cutoffs, the results showed that detecting scalar non-invariance in forced-choice formats need smaller cutoffs than metric non-invariance. The findings are aligned with what Rutkowski and Svetina (2017) suggested; the cutoffs for scalar non-invariance were smaller than those for metric non-invariance (e.g., ΔCFI < .02 for metric non-invariance and ΔCFI < .01 for scalar non-invariance).
Type I error rates and detection of measurement non-invariance
As seen in Table 4, the suggested cutoffs control Type I error rates at the nominal level. ΔCFI controls Type I error rates better than ΔNCI. In terms of measurement non-invariance detection, the suggested cutoffs detected both metric and scalar non-invariance better than the existing cutoffs (ΔCFI >.01, ΔNCI > .02) for scalar invariance, especially when medium non-invariance was exhibited for greater numbers of items (Table 5). Also, the new cutoff ΔCFI > .007 demonstrated higher detection rates for metric non-invariance than ΔCFI >.01.
Table 4.
Type I Error Rates for Recommended Cutoffs.
| Condition | ΔCFI > .007 (MI) ΔCFI > .001 (SI) |
ΔNCI > .02 (MI) ΔNCI > .004 (SI) |
||||||
|---|---|---|---|---|---|---|---|---|
| S |
M |
S |
M |
|||||
| 5i | 10i | 5i | 10i | 5i | 10i | 5i | 10i | |
| Null_MI | .00 | .06 | ||||||
| Null_SI | .00 | .00 | ||||||
| PHT | .01 | .04 | .05 | .06 | .05 | .07 | .06 | .07 |
| NHT | .03 | .06 | .04 | .01 | .07 | .10 | .07 | .04 |
| PLT | .03 | .07 | .04 | .03 | .04 | .07 | .05 | .03 |
| NLT | .04 | .04 | .04 | .05 | .08 | .09 | .10 | .08 |
Note. See Table 1 for the condition names. Values below .05 are bold faced. CFI = comparative fit index; MI = metric invariance; SI = scalar invariance; NCI = noncentrality index; S = small magnitude of non-invariance; M = medium magnitude of non-invariance; 5i = five items were manipulated for non-invariance; 10i = 10 items were manipulated for non-invariance; Null_MI = metric invariance; Null_SI = scalar invariance.
Table 5.
Proportions of Correct Non-Invariance Detection for Recommended Cutoffs.
| Condition | ΔCFI > .007 (MI) ΔCFI > .001 (SI) |
ΔNCI > .02 (MI) ΔNCI > .004 (SI) |
||||||
|---|---|---|---|---|---|---|---|---|
| S |
M |
S |
M |
|||||
| 5i | 10i | 5i | 10i | 5i | 10i | 5i | 10i | |
| PHL | .03 | .00 | .06 | .31 | .05 | .01 | .10 | .42 |
| NHL | .02 | .03 | .15 | .80 | .03 | .03 | .16 | .81 |
| PLL | .00 | .33 | .98 | .99 | .10 | .35 | .99 | .99 |
| NLL | .01 | .23 | .72 | 1.00 | .04 | .28 | .79 | 1.00 |
| PHT | .15 | .24 | .20 | .88 | .02 | .14 | .07 | .81 |
| NHT | .16 | .16 | .20 | .92 | .04 | .11 | .05 | .90 |
| PLT | .09 | .16 | .21 | .63 | .05 | .08 | .08 | .49 |
| NLT | .12 | .09 | .19 | .71 | .06 | .03 | .10 | .56 |
Note. See Table 1 for the condition names. Values above .8 are bold faced. DIF = differential item functioning; CFI = comparative fit index; MI = metric invariance; SI = scalar invariance; NCI = noncentrality index; S = small magnitude of non-invariance; M = medium magnitude of non-invariance; 5i = five items were manipulated for non-invariance; 10i = 10 items were manipulated for non-invariance.
To examine factors affecting Type I error rates and the detection of measurement non-invariance, t-tests and analyses of variance (ANOVAs) were conducted. Overall, ΔCFI performed significantly better than ΔNCI, negative direction of measurement non-invariance led to higher Type I error rates, and the higher magnitude of invariance led to greater detection rates (see the online appendix).
Bias and RMSE
The impact of measurement non-invariance on estimated scores was examined with absolute bias, bias, and RMSE values (Table 4A in the online supplement). The amounts of absolute bias across the measurement non-invariance conditions ranged from 0.34 to 0.46, and bias ranged from −0.025 to 0.043. ANOVAs were conducted to investigate factors affecting these values. Overall, failure to detect metric non-invariance was more detrimental than that of scalar non-invariance, and negative direction of non-invariance was more detrimental to accuracy of estimation than the positive direction (see the online appendix).
Discussion
Regarding the performance of the established criteria for the evaluation of model fit, it was found that RMSEA ≤ .06 did not show any poor model fit across all conditions, and small proportions of models were assessed as fitting poorly with the use of CFI ≥ .95, especially under configural invariance. With respect to the performance of the existing cutoffs for the changes in fit indices to determine measurement non-invariance, ΔCFI > .01 and ΔNCI > .02 performed better for the detection of metric non-invariance than the other three cutoffs. However, as ΔCFI > .01 and ΔNCI > .02 performed poorly for the detection of scalar non-invariance, providing more applicable cutoffs was crucial. This study suggested ΔCFI > .001 and ΔNCI > .004 as the cutoffs for scalar non-invariance and ΔCFI > .007 was also provided for the detection of metric non-invariance. Based on the performance related to Type I error rate control and measurement non-invariance detection, it was concluded that ΔCFI > .007 for metric non-invariance and ΔCFI > .001 for scalar non-invariance were recommended for the cutoffs for measurement non-invariance tests when the Thurstonian IRT model was fit for forced-choice format data.
Regarding the impact of failure in the detection of non-invariance, the average amount of bias may not appear consequential; however, when considering decisions at individual levels (e.g., selection or admission), failure in the detection of non-invariance can potentially jeopardize test fairness, especially in multicultural settings where items related to certain personality traits are more or less favored due to cultural backgrounds than for other items. For example, when ranking statements4 such as “I waste my time (Item 1),”“I get irritated easily (Item 2),” and “I talk to a lot of people at parties (Item 3),” an individual from a culture where implicit cultural stigma is embedded in terms of showing negative affect may provide ranks “maybe like me,”“least like me,”“most like me” for the three statements. As a result, Item 3 becomes easier and Item 2 becomes more difficult to endorse, potentially causing both items to become less discriminating due to the similar response patterns from a majority of respondents from the same cultural background. Then, the negative non-invariance may lead to a positive bias in estimated trait scores, resulting in higher trait scores when non-invariance was ignored, compared with that scores of an individual from a background where showing irritation does not have any cultural connotation.
This study employed the sample size of 500 per group with an equal ratio. Initially, the sample size of 200 per group was included; however, nonconvergence and poor fit were often detected under the 200 sample size condition, producing 20% of nonconvergence under configural invariance. Related to sample size, the findings from Lin and Brown (2017) showed that measurement non-invariance did not affect the estimation of scores, but attention should be placed on the sample sizes in Lin and Brown: 62,639, and 22,610 participants for the quad and triad formats of a forced-choice assessment composed of 104 item blocks, respectively. However, the findings may not be applicable in most research settings where the number of item blocks and respondents are more likely to be small. For example, the sample size was 420 and the number of item blocks were 20 in Guenole et al. (2018), and Anguiano-Carrasco et al. (2015) had the sample size of 283 with eight blocks composed of three items each. In the illustrated cases, failure in the detection of measurement non-invariance due to small sample sizes can potentially threaten test fairness. Thus, the authors would recommend that future research investigate cutoffs from various sample size conditions.
In addition, the stepwise procedures in a multiple group CFA present methodological challenges in identifying items attributing to measurement non-invariance. Because measurement invariance tests employ a holistic approach that uses changes in model fit indices, the results from the tests do not provide much information besides the output from model modification indices (MODINDICES) offered by Mplus (Muthén & Muthén, 1998–2017). For assessment developers, the analysis output may not have practical usefulness when non-invariance was detected; all they can do based on the MODINDICES output is free constraints imposed on thresholds or loadings that the output flags, compare changes in fit indices, and repeat this process until the values are smaller than the criterion value set for MODINDICES. Compared with various DIF detection methods including those for forced-choice formats in the generalized graded unfolding model (Roberts et al., 2000) which focuses more on item-level information, the multiple group CFA stepwise procedures do not seem practically useful, especially for piloting forced-choice items for assessment development. Therefore, in-depth studies should be called for to investigate cutoffs for the evaluation of measurement non-invariance in forced-choice formats along with methods to identify items attributing non-invariance.
The cutoff values suggested in this study were based on only the RANK forced-choice format. Considering that the RANK format offers more information than other forced-choice formats, the use of MOLE or PICK formats may lead to greater challenges in the detection of measurement non-invariance as the occurrence of measurement non-invariance may be on the pairwise comparisons related to missing responses. This problem would be exacerbated for the PICK format due to more limited information compared with RANK or MOLE.
The recommended cutoffs that can be used for forced-choice format invariance tests will be useful information to reduce potential threats to test fairness. However, it should be noted that the recommendations from this study were based on the assumption that there are no mean differences in trait scores between the reference and focal groups. That is, the context of the current research was based on the conditions where personality traits between a reference and focal group may not necessarily be expected to differ (e.g., gender), but different endorsement occurs due to different interpretations of items or different response styles toward items or item blocks. However, as in the literature, cultural differences may affect five-factor personality scores. For example, extroversion scores were found to be lower in Asian cultures than European and American cultures, whereas agreeableness scores were higher in most Asian and African cultures than countries from Western cultures (Allik & McCrae, 2004; Hofstede & McCrae, 2004). Therefore, the authors recommend that future research include different trait levels among subgroups of respondents along with various measurement non-invariance conditions, different types of forced-choice formats, and different sample sizes for the generalization of the findings, especially for cross-cultural research.
Supplemental Material
Supplemental material, supplemental_material for Fit Indices for Measurement Invariance Tests in the Thurstonian IRT Model by HyeSun Lee and Weldon Z. Smith in Applied Psychological Measurement
Equation 1 is based on a dominance model where a linear relationship between an item (or a statement) and a trait is assumed. Depending on the assumed relationships between items and traits, the probabilistic function varies.
The binary code 1 was assigned when the first item in a pairwise comparison is preferred over the second item, and the binary code 0 means that the second item is preferred over the first item.
Because the cutoff values of NCI provided in Meade et al. (2008) were not applicable for the simulated conditions in this study, the NCI cutoff from Meade et al. was not included as a criterion value.
The statements were from Brown and Maydeu-Olivares (2011a).
Footnotes
Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.
ORCID iD: HyeSun Lee
https://orcid.org/0000-0002-0826-4655
Supplemental Material: Supplementary material is available for this article online.
References
- Allik J., McCrae R. R. (2004). Toward a geography of personality traits: Patterns of profiles across 36 cultures. Journal of Cross-Cultural Psychology, 35, 13–28. [Google Scholar]
- Anguiano-Carrasco C., MacCann C., Geiger M., Seybert J. M., Roberts R. D. (2015). Development of a forced-choice measure of typical-performance emotional intelligence. Journal of Psychoeducational Assessment, 33, 83–97. [Google Scholar]
- Bartram D. (2013. a). A cross-validation of between country differences in personality using the OPQ32. International Journal of Quantitative Research in Education, 1, 182–209. [Google Scholar]
- Bartram D. (2013. b). Scalar equivalence of OPQ32: Big five profiles of 31 countries. Journal of Cross-Cultural Psychology, 44, 61–83. [Google Scholar]
- Bentler P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin, 107, 238–246. [DOI] [PubMed] [Google Scholar]
- Bentler P. M., Bonett D. G. (1980). Significance tests and goodness of fit in the analysis of covariance structures. Psychological Bulletin, 88, 588–606. [Google Scholar]
- Brannick M. T. (1995). Critical comments on applying covariance structure modeling. Journal of Organizational Behavior, 16, 201–213. [Google Scholar]
- Brown A. (2016). Item response models for forced-choice questionnaires: A common framework. Psychometrika, 81, 135–160. [DOI] [PubMed] [Google Scholar]
- Brown A., Maydeu-Olivares A. (2011. a). Forced-choice five factor markers. PsycTESTS. [Google Scholar]
- Brown A., Maydeu-Olivares A. (2011. b). Item response modeling of forced-choice questionnaires. Educational and Psychological Measurement, 71, 460–502. [Google Scholar]
- Brown A., Maydeu-Olivares A. (2012). Fitting a Thurstonian IRT model to forced-choice data using Mplus. Behavior Research Methods, 44, 1135–1147. [DOI] [PubMed] [Google Scholar]
- Brown A., Maydeu-Olivares A. (2013). How IRT can solve problems of ipsative data in forced-choice questionnaires. Psychological Methods, 18, 36–52. [DOI] [PubMed] [Google Scholar]
- Brown A., Maydeu-Olivares A. (2018). Modeling forced-choice response format. In Irwing P., Booth T., Hughes D. (Eds.), The Wiley handbook of psychometric testing (pp. 523–569). John Wiley. [Google Scholar]
- Chen F. F. (2007). Sensitivity of goodness of fit indexes to lack of measurement invariance. Structural Equation Modeling, 14, 464–504. [Google Scholar]
- Cheung G. W., Rensvold R. B. (2002). Evaluating goodness-of-fit indexes for testing measurement invariance. Structural Equation Modeling, 9, 233–255. [Google Scholar]
- Cohen J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum. [Google Scholar]
- Dueber D. M., Love A. M. A., Toland M. D., Turner T. A. (2019). Comparison of single-response format and forced-choice format instruments using Thurstonian item response theory. Educational and Psychological Measurement, 79, 108–128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guenole N., Brown A. A., Cooper A. J. (2018). Forced-choice assessment of work-related maladaptive personality traits: Preliminary evidence from an application of Thurstonian item response modeling. Assessment, 25, 513–526. [DOI] [PubMed] [Google Scholar]
- Hallquist M., Wiley J. (2018). MplusAutomation: An R package for facilitating large-scale latent variable analyses in Mplus. Structural Equation Modeling, 25, 621–638. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hofstede G., McCrae R. R. (2004). Personality and culture revisited: Linking traits and dimensions of culture. Cross-Cultural Research, 38, 52–88. [Google Scholar]
- Hontangas P. M., de la Torre J., Ponsoda V., Leenen I., Morillo D., Abad F. J. (2015). Comparing traditional and IRT scoring of forced-choice tests. Applied Psychological Measurement, 39, 598–612. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hu L. T., Bentler P. M. (1999). Cuttoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1–55. [Google Scholar]
- Lee P., Lee S., Stark S. (2018). Examining validity evidence for multidimensional forced choice measures with different scoring approaches. Personality and Individual Differences, 123, 229–235. [Google Scholar]
- Lee S., Bulut O., Suh Y. (2017). Multidimensional extension of multiple indicators multiple causes models to detect DIF. Educational and Psychological Measurement, 77, 545–569. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin Y., Brown A. (2017). Influence of context on item parameters in forced-choice personality assessments. Educational and Psychological Measurement, 77, 389–414. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maydeu-Olivares A. (1999). Thurstonian modeling of ranking data via mean and covariance structure analysis. Psychometrika, 64, 325–340. [Google Scholar]
- Maydeu-Olivares A., Brown A. (2010). Item response modeling of paired comparison and ranking data. Multivariate Behavioral Research, 45, 935–974. [DOI] [PubMed] [Google Scholar]
- McDonald R. P. (1989). An index of goodness-of-fit based on noncentrality. Journal of Classification, 6, 97–103. [Google Scholar]
- Meade A. W. (2010). A taxonomy of effect size measures for the differential functioning of items and scales. Journal of Applied Psychology, 95, 728–743. [DOI] [PubMed] [Google Scholar]
- Meade A. W., Johnson E. C., Braddy P. W. (2008). Power and sensitivity of alternative fit indices in tests of measurement invariance. Journal of Applied Psychology, 93, 568–592. [DOI] [PubMed] [Google Scholar]
- Meade A. W., Lautenschlager G. J. (2004). A Monte-Carlo study of confirmatory factor analytic tests of measurement equivalence/invariance. Structural Equation Modeling, 11, 60–72. [Google Scholar]
- Millsap R. E. (2011). Statistical approaches to measurement invariance. Routledge. [Google Scholar]
- Muthén L. K., Muthén B. O. (1998. –2017). Mplus user’s guide (8th ed.). [Google Scholar]
- Nye C. D., Drasgow F. (2011). Assessing goodness of fit: Simple rules of thumb simply do not work. Organizational Research Methods, 14, 548–570. [Google Scholar]
- Organisation for Economic Co-operation and Development. (2014). PISA 2012 technical report. https://www.oecd.org/pisa/pisaproducts/PISA-2012-technical-report-final.pdf
- Oshima T. C., Raju N. S., Flowers C. P. (1997). Development and demonstration of multidimensional IRT-based internal measures of differential functioning of items and tests. Journal of Educational Measurement, 34, 253–272. [Google Scholar]
- R Core Team. (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing; http://www.R-project.org/ [Google Scholar]
- Roberts J. S., Donoghue J. R., Laughlin J. E. (2000). A general item response theory model for unfolding unidimensional polytomous responses. Applied Psychological Measurement, 24, 3–32. [Google Scholar]
- Rutkowski L., Svetina D. (2017). Measurement invariance in international surveys: Categorical indicators and fit measure performance. Applied Measurement in Education, 30, 39–51. [Google Scholar]
- SHL Group. (2006). OPQ32 technical manual. https://www.yumpu.com/en/document/view/6485292/opq32-user-manual-shl-solutions-partners
- Stark S., Chernyshenko O. S., Drasgow F. (2005). An IRT approach to constructing and scoring pairwise preference items involving stimuli on different dimensions: The multi-unidimensional pairwise-preference model. Applied Psychological Measurement, 29, 184–203. [Google Scholar]
- Steiger J. H. (1998). A note on multiple sample extensions of the RMSEA fit index. Structural Equation Modeling, 5, 411–419. [Google Scholar]
- Steiger J. H., Lind J. C. (1980, May). Statistically based tests for the number of common factors [Paper presentation]. Annual Meeting of the Psychometric Society, Iowa City, IA. [Google Scholar]
- Thurstone L. L. (1927). A law of comparative judgment. Psychological Review, 34, 273–286. [Google Scholar]
- Usami S., Sakamoto A., Naito J., Abe Y. (2016). Developing pairwise preference-based personality test and experimental investigation of its resistance to faking effect by item response model. International Journal of Testing, 16, 288–309. [Google Scholar]
- Vandenberg R. J. (2002). Toward a further understanding of and improvement in measurement invariance methods and procedures. Organizational Research Methods, 5, 139–158. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental material, supplemental_material for Fit Indices for Measurement Invariance Tests in the Thurstonian IRT Model by HyeSun Lee and Weldon Z. Smith in Applied Psychological Measurement
