Abstract
Ipsative tests with multidimensional forced-choice (MFC) items have been widely used to assess career interest, values, and personality to prevent response biases. Recently, there has been a surge of interest in developing item response theory models for MFC items. In reality, a statement in an MFC item may have different utilities for different groups, which is referred to as differential statement functioning (DSF). However, few studies have been investigated methods for detecting DSF owing to the challenges related to the features of ipsative tests. In this study, three methods were adapted for DSF assessment in MFC items: equal-mean-utility (EMU), all-other-statement (AOS), and constant-statement (CS). Simulation studies were conducted to evaluate the recovery of parameters and the performance of the proposed methods. Results showed that statement parameters and DSF parameters were well recovered for all the three methods when the test did not contain any DSF statement. When the test contained one or more DSF statements, only the CS method yielded accurate estimates. With respect to DSF assessment, both the EMU method using the bootstrap standard error and the AOS method performed appropriately so long as the test did not contain any DSF statement. The CS method performed well in cases where one or more DSF-free statements were chosen as an anchor. The longer the anchor statements, the higher the power of DSF detection.
Keywords: differential item functioning, ipsative tests, multidimensional forced-choice, item response theory
Tests with multidimensional forced-choice (MFC) items have been widely used in educational and psychological measurement to assess career interest, values, and personality. Popular ipsative tests include the Jackson Vocational Interest Survey (JVIS), the Edwards Personal Preference Schedule (EPPS), the Allport–Vernon–Lindzey Study of Values, and the Occupational Personality Questionnaire. When responding to MFC items, respondents are often required to choose a statement they prefer from a pair of statements, or to rank a set of statements that are designed to measure different dimensions. For MFC items, the statement is a single stimulus measuring one dimension while the item is a combination of two or more statements. Take career interest as an example: A typical MFC item using paired statements reads, “Which activity do you prefer: attending parties or visiting museums?” The first choice, “attending parties,” is designed to measure the dimension of social interest, whereas the second, “visiting museums,” is designed to measure the dimension of artistic interest. MFC items are usually scored as 1 and 0, with 1 being assigned to the statement preferred by the respondent and 0 to the statement that is not preferred. For example, the respondents will receive a score of 1 if they select “attending parties” and a score of 0 otherwise. Thus, tests using MFC items feature identical number-correct scores for all individuals, which is known as the ipsative nature (from the Latin ipse: he, himself), as opposed to so-called normative tests that use Likert-type items (Cattell, 1944).
MFC items can effectively prevent response biases (e.g., socially desirable responding) found in typical Likert-type items (Matthews & Oddy, 1997). Recently, the field has seen a surge of interest in applying modern testing theory—that is, item response theory (IRT)—to model the data collected from ipsative tests. The IRT models that have been developed to account for ipsative nature can be broadly categorized into two frameworks: the dominant and the ideal point. Models developed within the dominant framework assume that the higher a respondent’s location on a trait and the larger the statement’s utility (attractiveness), the larger the probability of the respondent to select the statement. The models include the Thurstonian IRT models (Brown & Maydeu-Olivares, 2011, 2013) and the Rasch ipsative models (RIM; Wang et al., 2016, 2017). Models developed within the ideal point framework include Zinnes and Griggs (ZG) models for unidimensional pairwise-comparison items (Stark & Drasgow, 2002), multi-unidimensional pairwise-preference (MUPP) model for multidimensional pairwise-comparison items (Stark et al., 2005), and generalized graded unfolding model for multidimensional ranking items (GGUM-RANK; Joo et al., 2018). This approach assumes the distance between the locations of a respondent’s trait and the statement’s utility determines the probability of selecting a statement.
The assessment of differential item functioning (DIF) has been a routine practice in the development of tests because it is closely related to test fairness and validity (American Educational Research Association et al., 2014). There are two main kinds of DIF detection approaches: the nonparametric approach based on observable scores and the parametric approach based on latent traits. Although nonparametric methods are cheaper to implement, they depend on the central assumption that the test under study is unidimensional. When this assumption is tenable, the total score is a reasonable matching variable. However, when a test measures multiple dimensions, as is likely with ipsative tests (e.g., JVIS measures 34 dimensions and EPPS measures 15 dimensions), using the total score as a matching variable may be problematic (B. E. Clauser et al., 1996; Mazor et al., 1998). For this reason, the present study uses the parametric method.
The choice of matching variable is critical in DIF assessment because different groups of participants should be placed on a common metric and their performances on a studied item should be compared for evidence of DIF (Wang, 2008). DIF analysis will be biased and the results of DIF assessment will be misleading if a matching variable contains DIF items. Only when a matching variable contains exclusively DIF-free items will the subsequent DIF analysis be correct. In the DIF literature, three methods were used to establish common matrices (Wang, 2008): the equal-mean-difficulty (EMD) method, the all-other-item (AOI) method, and the constant-item (CI) method. In general, both the EMD and AOI methods can only perform appropriately in some unrealistic conditions (e.g., no DIF item). The CI method is superior for establishing a “clean” matching variable over the EMD and AOI methods. For detailed results, the interested reader is referred to these publications and the references therein (Wang, 2008; Wang & Yeh, 2003).
Test fairness is also an important issue for ipsative tests. Take the career interest item as an example; “attending parties” might have a higher utility for females, whereas “visiting museums” might have a higher utility for males. Compared with the number of studies on the DIF assessment for normative tests, few studies have been aimed at investigating the DIF assessment for ipsative tests. To the best of the authors’ knowledge, only one study (Chen & Wang, 2014) was found to adapt logistic regression (Swaminathan & Rogers, 1990) for ipsative tests and conduct a brief simulation to assess the method. This is because of the challenges associated with DIF assessment for ipsative tests. In ipsative testing, the term differential statement functioning (DSF) is used because the unit of analysis switches from item to statement. However, this shift complicates DSF assessment substantially and makes DSF distinct from the conventional DIF assessment. This study aims to develop methods to detect DSF for ipsative tests to ensure test fairness. The authors focus on ipsative tests with MFC items with two paired statements.
The remainder of this article is organized as follows. First, the unique challenges with respect to DSF assessment are discussed. Second, the definition of DSF and the three methods of establishing a common matrix for DSF assessment are described in detail. Third, the authors consider a series of simulation studies to evaluate parameter recovery as well as the performance of the proposed methods in the assessment of DSF. Fourth, some conclusions are drawn, and directions for future study are discussed.
Challenges of DSF Assessment in Ipsative Tests With MFC Items
Typically, conventional DIF assessment methods were proposed in the context of normative Likert-type items where each item is unidimensional and independent from other items. Thus, they may not be applicable in the context of ipsative MFC items where an item itself is multidimensional and contains more than one statement.
Take an MFC item with two paired statements as an example. There are three possible combinations of DSF-free and DSF statements: (a) both statements are DSF-free (denoted as “DSF-free versus DSF-free”); (b) one of the statements is DSF-free, while the other is a DSF statement (“DSF-free versus DSF”); and (c) both statements are DSF (“DSF versus DSF”). Apparently, it is possible that one or both statements in an item are biased toward one particular group, making the item a DIF item. An item is deemed DIF-free exclusively when both statements are DSF-free. Therefore, it is more appropriate to switch the unit of analysis from item to statement.
However, a statement in an ipsative test can be paired with different statements and exposed in different MFC items. Theoretically, a DSF-free statement should be measurement invariant, irrespective of which statement it is paired with. By contrast, a DSF statement can be more or less attractive to a particular group of participants depending on which statement it is paired with. The pairing “DSF versus DSF” particularly deserves investigation because not only the magnitude but also the direction of bias may be different between the paired DSF statements. That is, one statement favors one group, while the other favors another group, and the biased size is not identical. As a result, problems arise in DSF assessment when the conventional parametric DIF assessment methods are applied into MFC items. Technically, in the context of DSF assessment, one fits an IRT model, which was specifically developed for MFC items to the data for the reference and focal groups, and tests whether the statement parameters for the groups differ significantly. However, it remains unknown that whether statement parameters in the MFC IRT model can be accurately estimated when DSF statement(s) present, and whether the hypothesis testing for DSF statement is affected by the kind of combination (i.e., “DSF-free versus DSF-free,”“DSF-free versus DSF,” and “DSF versus DSF”) that the studied statement is in.
Moreover, MFC items are formed by pairing statements that are designed to measure different dimensions. The multidimensionality within an item also poses a challenge for the assessment of DSF. Shealy and Stout (1993) pointed out that “it is theoretically possible that cancellation could occur within an item if the item depends on at least two nuisance dimensions” (as cited in Holland & Wainer, 1993, p. 218). They also raised the possibility of cancellation for DIF effects for those items measuring multiple dimensions. It is interesting to evaluate whether the DSF effects will cancel for MFC items that are multidimensional in nature because statements in an MFC item measure different dimensions.
These issues are unique and distinctive to DSF assessment for MFC items that are rarely (if not never) considered in the conventional DIF assessment for normative Likert-type items. This study aims to address these issues by adapting the existing DIF detection methods to assess DSF in MFC items.
Methods for DSF Assessment in Ipsative Tests With MFC Items
This study uses the RIM (Wang et al., 2017) because it belongs to the Rasch model family and possesses the measurement property of specific objectivity. In the RIM, each statement has a utility for all persons. In reality, a statement may possess different utilities for different groups of persons. When a statement has different utilities for different groups, it is deemed to have DSF. It can be defined as follows:
| (1) |
where and are the utilities of statements s and t, respectively, for persons in group g; and are the deviations of latent trait a measured by statement s and latent trait b measured by statement t from their respective mean, respectively, for person n.
In this study, three methods of establishing a common metric over groups for normative tests are adapted to assess DSF for ipsative tests.
The Equal-Mean-Utility (EMU) Method
When applying the EMU method to DSF assessment, the mean statement utility of each dimension is constrained to be equal across all groups. The RIM will fit the data of the reference and focal groups separately, and the difference between the statement parameter estimates for the two groups will be evaluated. Specifically, in a univariate case, the difference of the statement s parameters for two groups is tested as follows:
| (2) |
where and are the estimates of the utilities of statement s in item i in the focal and reference groups, respectively, and the subscript i is omitted for simplicity; and are the error variance estimates of and , respectively. The detection of DSF can be conducted by referring to the standard normal distribution or equivalently to the chi-square distribution with one degree of freedom (Wald, 1943). If the difference is statistically significant, statement s is deemed DSF. The method can be applied similarly to assess DSF for other statements.
The All-Other-Statement (AOS) Method
When the AOS method is applied to the DSF assessment, all the statements in a test except the studied statement are anchored. The null hypothesis of no difference in statement parameters between groups is tested with the following steps:
The RIM is fitted to the data in which all the statement parameters are constrained to be identical between groups (the compact model), assuming no DSF in the test. The likelihood deviance is calculated.
The model is fitted to the same data in which the parameters of all the statements except the one being studied are constrained to be identical between groups (the augmented model), assuming that the studied statement is the only DSF statement. The likelihood deviance is calculated. This step is repeated for each statement and is obtained, where W is the total number of statements in the test. Consequently, a total of W augmented models are formed.
is calculated and referred to the chi-square distribution with one degree of freedom for each statement. If the statistic is statistically significant, the null hypothesis, that is, the compact model, is rejected and the studied statement is identified as exhibiting DSF.
The Constant-Statement (CS) Method
When the CS method is used, all the statements in a test except the anchor statements are treated as presumed DSF statements. To use this method, at least one DSF-free statement should be selected as the anchor statement. The steps for the analysis are as follows:
The RIM is fitted to the data in which only the statement parameters of the anchor statement(s) are constrained to be identical between groups (the augmented model), assuming the other statements are DSF. The likelihood deviance is calculated.
The model is fitted to the same data; both the anchor and the studied statements are constrained to be identical across groups, whereas the other statements are treated as presumed DSF statements (the compact model). The likelihood deviance is calculated. This step is repeated for each statement except for the anchor statement(s). Therefore, a total of compact models are formed, where V is the number of statements that do not serve as anchors.
is compared with the chi-square distribution with one degree of freedom for each statement that is not anchored.
Method
Two research questions are considered in this study: first, whether the DSF parameters as well as other parameters under the proposed methods can be accurately recovered, and second, assuming the generating parameters can be accurately recovered, how the proposed methods of DSF detection perform.
Data Generation
The RIM was used to generate item responses, where a total of 1,000 examinees were generated for the reference and focal groups, respectively. This design is in line with previous studies in the assessment of DIF using IRT models (e.g., Wang & Yeh, 2003). A larger sample size is needed for the RIM because ipsative tests are often highly dimensional. With more examinees, it is possible to better control sampling fluctuations and to accurately estimate the item parameters; the performance of the methods could also be assessed more precisely. In general, the larger the sample size, the more accurate is the parameter estimation and the higher the power for detecting DSF. However, a larger sample size also necessitates a longer expected computation time for parameter estimation.
Design
The study employed three dimensions, and each dimension had seven statements. As shown in Figure 1, statements from different dimensions are linked and compared spirally. Consequently, there were 21 items in the simulation. To answer the research questions, two independent variables were manipulated. The first was the number of DSF statements. There are three levels of this condition: zero DSF statement, one DSF statement in one dimension (the first statement in Dimension 1, that is, s1), and three DSF statements (the first statement in each of three dimensions, that is, s1, s8, and s15). Under the zero DSF statement condition, all items were designed to have “DSF-free versus DSF-free” pairing. In the one DSF statement condition, 19 items were “DSF-free versus DSF-free” pairing, and two items were “DSF-free versus DSF” pairing [(s1, s8) and (s21, s1)]. Under the three DSF statement conditions, 17 items were “DSF-free versus DSF-free” pairing, two items were “DSF-free versus DSF” pairing [(s15, s2) and (s21, s1)], and two items were “DSF versus DSF” pairing [(s1, s8) and (s8, s15)]. An item which was “DSF-free versus DSF” pairing or “DSF versus DSF” pairing was treated as a DSF item. This design allows the authors to investigate the effect of different pairings on DSF assessment. The second variable was the assessment method: EMU, AOS, or CS. In CS, one DSF-free anchor statement (CS-1) and all DSF-free anchor statements (CS-A) were considered. Therefore, there are four levels in this condition. The combination of the conditions yielded 12 conditions in the simulation study.
Figure 1.

Linking design in simulation study.
The studied statement(s) were obtained by adding the magnitude of DSF (Δ) to the statement utilities of reference group. A negative value of Δ indicates that all of the δ parameters within DSF items favor the reference group because the item parameters in the RIM represent the utilities of statements. The difference in the parameters was set to favor the reference group by 0.6 for the DSF statement(s), indicating that the magnitude of DSF was simulated to be moderate or large. This manipulation and setting are similar to those of previous studies in normative tests (e.g., Wang & Yeh, 2003).
The ability for Dimensions 1 and 2 of persons was generated from multivariate normal distribution with (0, Σ), where . The ability for Dimension 3 was computed as to ensure the ipsative nature of the data.
Parameter Estimation and Analysis
The ConQuest 4.1 (Adams et al., 2015) program was used in this study for parameter estimation. Except for the EMU method, the default approach of identifying the within-item multidimensional model in ConQuest—that is, setting the mean of each latent trait to be zero—was implemented. For the EMU method, the mean statement utility of each dimension is constrained to be zero for model identification. ConQuest is implemented with a marginal maximum likelihood estimation using the expectation–maximization algorithm (Bock & Aitken, 1981). The program provides the quadrature approach with a fixed set of quadrature points or the Monte Carlo method in which the nodes are randomly drawn to estimate models. This study used the quadrature method with 50 nodes in each dimension.
When using the EMU method, a design matrix which implements the model identification was constructed to estimate the parameters. To save computing time, ConQuest ignores the covariance between the parameters by default. Unfortunately, this quicker approach leads to underestimated standard error (Adams et al., 2015), which will contaminate the subsequent DSF assessment. To resolve the problem, the bootstrap method (Efron, 1979) was also used to compute the standard error. Specifically, a generated dataset was randomly selected in each condition. A bootstrap sample of size 1,000 with replacement was drawn from the dataset and estimated using ConQuest. The steps were repeated 500 times. The bootstrap standard error of the estimates was computed as the standard deviation of the 500 estimates. For comparison, both the EMU method using the quick approach to estimate standard error (EMU-Q) and the bootstrap method to estimate standard error (EMU-B) were utilized in this study.
In the AOS method, each statement was tested consecutively. Therefore, one compact model and 21 augmented models were formed, resulting in 22 calibrations in each replication for each condition.
To use the CS-1 method, one DSF-free statement must be selected as the anchor statement. Theoretically, any DSF-free statement can serve as an anchor. In this study, the last statement was arbitrarily chosen as the anchor. Under the CS-A method, all DSF-free statements were used as anchors. For the CS method, there is one augmented model and V compact models, where V is the number of statements that do not serve as anchors, resulting in (1 +V) calibrations in each replication for each condition. For example, when there were three DSF statements, all the 18 DSF-free statements were set as anchor statements for the CS-A method. Therefore, there were four (1 + 3) calibrations in each replication in this condition.
When evaluating the parameter recovery for MFC items, the bias and the root mean square error (RMSE) in the estimates across R replications were computed. When investigating the performance of the proposed methods in DSF assessment, the dependent variables were the Type I error (false positive) rate—which was computed as a percentage of mistakenly declared DSF-free statements as having DSF across replications—and the power (true positive) rate of the DSF assessment, which was computed as a percentage of correctly declared DSF statements as having DSF across replications. The Type I error rate and the power rate were further averaged across DSF-free and DSF statements, respectively, to form the average Type I error rate and the average power rate.
For each condition, 100 replications were conducted, which produced 15,700 separate calibrations. To facilitate the estimation, the generating values were used as the initial values of the parameters. The computation time was in the range of 0.5 to 2.0 hr per replication. The ConQuest codes for the three methods are shown in the Appendix in the online supplement.
Results
Parameter Recovery for DSF Assessment Methods
Due to space constraints, the generating values, bias, and RMSE of parameter recovery for each statement using the three methods when the test contained zero, one, and three DSF statements are shown in Tables A1 to A3, respectively, in the online supplement. Table 1 summarizes the results of parameter recovery across statements.
Table 1.
Summary of Parameter Recovery for Four Methods.
| EMU | AOS | CS-1 | CS-A | |||||
|---|---|---|---|---|---|---|---|---|
| Parameter | Bias | RMSE | Bias | RMSE | Bias | RMSE | Bias | RMSE |
| 0 DSF | ||||||||
| δ | ||||||||
| M | 0.000 | 0.098 | 0.045 | 0.129 | 0.043 | 0.118 | 0.046 | 0.124 |
| Min | −0.010 | 0.086 | −0.017 | 0.097 | 0.001 | 0.102 | 0.006 | 0.111 |
| Max | 0.020 | 0.112 | 0.141 | 0.185 | 0.069 | 0.132 | 0.076 | 0.140 |
| Δ | ||||||||
| M | 0.000 | 0.138 | 0.000 | 0.049 | −0.018 | 0.120 | — | — |
| Min | −0.034 | 0.125 | −0.010 | 0.035 | −0.034 | 0.045 | — | — |
| Max | 0.031 | 0.152 | 0.011 | 0.073 | −0.003 | 0.162 | — | — |
| 1 DSF | ||||||||
| δ | ||||||||
| M | −0.014 | 0.108 | 0.042 | 0.132 | 0.043 | 0.117 | −0.037 | 0.111 |
| Min | −0.070 | 0.095 | −0.025 | 0.114 | 0.011 | 0.101 | −0.086 | 0.103 |
| Max | 0.013 | 0.128 | 0.138 | 0.186 | 0.064 | 0.134 | −0.009 | 0.120 |
| Δ | ||||||||
| M | 0.029 | 0.099 | 0.026 | 0.089 | −0.016 | 0.123 | −0.003 | 0.067 |
| Min | −0.017 | 0.069 | −0.066 | 0.049 | −0.028 | 0.052 | −0.003 | 0.067 |
| Max | 0.096 | 0.132 | 0.297 | 0.303 | 0.002 | 0.166 | −0.003 | 0.067 |
| 3 DSF | ||||||||
| δ | ||||||||
| M | −0.043 | 0.111 | 0.000 | 0.120 | 0.001 | 0.117 | −0.039 | 0.109 |
| Min | −0.069 | 0.101 | −0.061 | 0.092 | −0.066 | 0.105 | −0.083 | 0.101 |
| Max | −0.024 | 0.126 | 0.093 | 0.146 | 0.041 | 0.136 | −0.016 | 0.119 |
| Δ | ||||||||
| M | 0.086 | 0.143 | 0.085 | 0.117 | 0.008 | 0.098 | 0.004 | 0.069 |
| Min | 0.061 | 0.131 | −0.027 | 0.029 | −0.006 | 0.044 | 0.000 | 0.066 |
| Max | 0.103 | 0.165 | 0.592 | 0.593 | 0.026 | 0.128 | 0.009 | 0.072 |
Note. EMU = equal-mean-utility method; AOS = all-other-statement method; CS-1 = constant-statement method with one DSF-free statement as anchor; CS-A = constant-statement method with all DSF-free statements as anchor; RMSE = root mean square error; DSF = differential statement functioning; δ is the statement utility parameter; Δ is the magnitude of DSF; M = mean; Min = minimum value; Max = maximum value; — = not applicable.
As shown in Table 1, when the test did not contain any DSF statement, under the EMU method, the bias values were between −0.010 and 0.020 (M = 0.000) and the RMSE values were between 0.086 and 0.112 (M = 0.098) for the utility parameters. The bias ranged from −0.034 to 0.031 (M = 0.000) and the RMSE values were between 0.125 and 0.152 (M = 0.138) for the DSF amount. In the AOS method, the bias values were from −0.017 to 0.141 (M = 0.045) and the RMSE values ranged between 0.097 and 0.185 (M = 0.129) for the utility parameters. The bias values were between −0.010 and 0.011 (M = 0.000) and the RMSE values varied between 0.035 and 0.073 (M = 0.049) for the DSF amount. In the CS-1 method, the bias values were between 0.001 and 0.069 (M = 0.043) and the RMSE values ranged from 0.102 to 0.132 (M = 0.118) for the utility parameters. The bias values were from −0.034 to −0.003 (M = −0.018) and the RMSE values were between 0.045 and 0.162 (M = 0.120) for the DSF amount. In the CS-A method, the bias values were between 0.006 and 0.076 (M = 0.046) and the RMSE values were between 0.111 and 0.140 (M = 0.124) for the utility parameters. By definition, all DSF-free statements were anchored in the CS-A method. Therefore, the bias and RMSE of the DSF amount were not computed for this method.
These bias values and RMSE values (in logit units) can be regarded as effect size measures. Therefore, the magnitude of biases and RMSE values was very small compared with the range of the generating values (between −1.999 and 1.783 in Table A1), indicating that all the methods yielded accurate estimates using ConQuest when the test did not contain any DSF statement. Even when all statements except one were treated as a presumed DSF statement, the CS-1 method yielded unbiased estimates, as did the CS-A method.
When the test contained one DSF statement, the EMU method yielded bias values between −0.070 and 0.013 (M = −0.014) and RMSE values between 0.095 and 0.128 (M = 0.108) for the utility parameters. The bias values were between −0.017 and 0.096 (M = 0.029), and RMSE values were between 0.069 and 0.132 (M = 0.099) for the DSF amount. In the AOS method, the bias values ranged from −0.025 to 0.138 (M = 0.042) and the RMSE values were between 0.114 and 0.186 (M = 0.132) for the utility parameters. The bias values varied from −0.066 to 0.297 (M = 0.026) and the RMSE values were between 0.049 and 0.303 (M = 0.089) for the DSF amount. For the CS-1 method, the bias values were between 0.011 and 0.064 (M = 0.043) and the RMSE values ranged from 0.101 to 0.134 (M = 0.117) for the utility parameters. The bias values were between −0.028 and 0.002 (M = −0.016) and the RMSE values ranged from 0.052 to 0.166 (M = 0.123) for the DSF amount. In the CS-A method, the bias values were between −0.086 and −0.009 (M = −0.037) and the RMSE values were from 0.103 to 0.120 (M = 0.111) for the utility parameters. The bias and RMSE values for the DSF amount were −0.003 and 0.067, respectively.
Under this condition, the magnitude of the DSF amount was recovered very well using the CS-A method because it is the true model, followed by the CS-1 method because the method correctly “selected” the last statement which is DSF-free as the anchor statement; the EMU method was the least effective of the three because there was only one single DSF statement in the test, and the mean utilities between the two groups were not equal by definition. Therefore, the assumption for the EMU method did not hold but was also not seriously violated. With regard to the AOS method, the recovery of the DSF amount for the Statements s8 and s21, which are paired with the DSF statement (s1 in this condition), was poor. As shown in Table A2, the bias and RMSE values were 0.297 and 0.303 for , respectively; these values were 0.268 and 0.272 for , respectively.
When the test contained three DSF statements, the accuracy of the parameter estimates decreased for the EMU and the AOS methods. For example, in the EMU method, the bias values were between 0.061 and 0.103 (M = 0.086) and the RMSE values ranged from 0.131 to 0.165 (M = 0.143) for the DSF amount. For the AOS method, the bias values were between −0.027 and 0.592 (M = 0.085) and the RMSE values varied from 0.029 to 0.593 (M = 0.117) for the DSF amount. As shown in Table A3, the recovery of the DSF amount of the DSF statements (s1, s8, and s15 in this condition), as well as the statements that were paired with these DSF statements (s2 and s21), was rather poor. For example, the bias and RMSE values for were 0.592 and 0.593, respectively, under the AOS method.
By contrast, the recovery of the DSF amounts was satisfactory under the CS-A method and the CS-1 method, indicating that the CS method with the DSF-free statement as the anchor yielded accurate estimates. For example, the biases for the DSF amount were between −0.006 and 0.026 (M = 0.008) and the RMSE values were in the range of 0.044 to 0.128 (M = 0.098) for the CS-1 method. The CS-A yielded even more accurate estimates with the biases for the magnitude of DSF amount between 0.000 and 0.009 (M = 0.004) and the RMSE values in the range of 0.066 to 0.072 (M = 0.069).
Assessment of DSF for MFC Items
To investigate how the DSF statements influence the detection of DSF in the test, the average Type I error rate and the average power rate of individual statements across 100 replications were computed and are shown in Table 2. Moreover, the Type I error rates and the power rates are summarized in the left and right panels of Figure 2, respectively.
Table 2.
Type I Error and Power Rates of Individual Statements.
| Statement | EMU-Q | EMU-B | AOS | CS-1 | CS-A | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 3 | 0 | 1 | 3 | 0 | 1 | 3 | 0 | 1 | 3 | 0 | 1 | 3 | |
| 1 | 0.01 | 0.97 | 0.93 | 0.02 | 0.99 | 0.97 | 0.02 | 1.00 | 0.98 | 0.01 | 1.00 | 1.00 | — | 1.00 | 1.00 |
| 2 | 0.30 | 0.11 | 0.15 | 0.04 | 0.03 | 0.03 | 0.02 | 0.07 | 0.99 | 0.02 | 0.02 | 0.03 | — | — | — |
| 3 | 0.24 | 0.12 | 0.12 | 0.03 | 0.02 | 0.01 | 0.01 | 0.02 | 0.02 | 0.05 | 0.06 | 0.03 | — | — | — |
| 4 | 0.28 | 0.09 | 0.13 | 0.02 | 0.01 | 0.04 | 0.02 | 0.05 | 0.00 | 0.05 | 0.04 | 0.04 | — | — | — |
| 5 | 0.33 | 0.09 | 0.09 | 0.02 | 0.07 | 0.01 | 0.01 | 0.05 | 0.02 | 0.05 | 0.04 | 0.04 | — | — | — |
| 6 | 0.21 | 0.07 | 0.12 | 0.02 | 0.02 | 0.02 | 0.02 | 0.05 | 0.02 | 0.04 | 0.02 | 0.02 | — | — | — |
| 7 | 0.27 | 0.05 | 0.12 | 0.02 | 0.03 | 0.02 | 0.00 | 0.09 | 0.02 | 0.03 | 0.01 | 0.06 | — | — | — |
| 8 | 0.01 | 0.02 | 0.96 | 0.03 | 0.04 | 0.98 | 0.03 | 0.94 | 0.02 | 0.03 | 0.01 | 0.98 | — | — | 1.00 |
| 9 | 0.24 | 0.06 | 0.13 | 0.01 | 0.02 | 0.01 | 0.03 | 0.03 | 0.01 | 0.02 | 0.05 | 0.05 | — | — | — |
| 10 | 0.29 | 0.10 | 0.11 | 0.03 | 0.02 | 0.03 | 0.01 | 0.02 | 0.00 | 0.04 | 0.05 | 0.03 | — | — | — |
| 11 | 0.33 | 0.07 | 0.10 | 0.03 | 0.01 | 0.02 | 0.03 | 0.01 | 0.02 | 0.08 | 0.03 | 0.04 | — | — | — |
| 12 | 0.27 | 0.11 | 0.06 | 0.01 | 0.04 | 0.01 | 0.01 | 0.01 | 0.01 | 0.05 | 0.07 | 0.03 | — | — | — |
| 13 | 0.22 | 0.04 | 0.08 | 0.02 | 0.02 | 0.03 | 0.00 | 0.02 | 0.00 | 0.03 | 0.02 | 0.06 | — | — | — |
| 14 | 0.25 | 0.04 | 0.07 | 0.02 | 0.01 | 0.02 | 0.01 | 0.04 | 0.01 | 0.02 | 0.03 | 0.05 | — | — | — |
| 15 | 0.01 | 0.05 | 0.95 | 0.02 | 0.05 | 0.98 | 0.01 | 0.02 | 0.98 | 0.03 | 0.02 | 0.98 | — | — | 1.00 |
| 16 | 0.23 | 0.08 | 0.10 | 0.02 | 0.04 | 0.01 | 0.01 | 0.03 | 0.01 | 0.05 | 0.08 | 0.03 | — | — | — |
| 17 | 0.29 | 0.09 | 0.12 | 0.03 | 0.02 | 0.02 | 0.01 | 0.01 | 0.01 | 0.04 | 0.05 | 0.03 | — | — | — |
| 18 | 0.29 | 0.10 | 0.10 | 0.02 | 0.02 | 0.02 | 0.02 | 0.01 | 0.02 | 0.06 | 0.04 | 0.06 | — | — | — |
| 19 | 0.19 | 0.08 | 0.08 | 0.02 | 0.04 | 0.01 | 0.02 | 0.03 | 0.03 | 0.02 | 0.03 | 0.03 | — | — | — |
| 20 | 0.23 | 0.06 | 0.09 | 0.02 | 0.01 | 0.04 | 0.00 | 0.03 | 0.01 | 0.03 | 0.02 | 0.03 | — | — | — |
| 21 | 0.23 | 0.07 | 0.10 | 0.03 | 0.02 | 0.02 | 0.01 | 0.96 | 0.98 | — | — | — | — | — | — |
Note. The results for DSF statements are bold. EMU-Q = equal-mean-utility method using quick standard error; EMU-B = equal-mean-utility method using bootstrap standard error; AOS = all-other-statement method; CS-1 = constant-statement method with one DSF-free statement as anchor; CS-A = constant-statement method with all DSF-free statements as anchor; 0, 1, and 3 indicate the number of DSF statements; — = not applicable; DSF = differential statement functioning.
Figure 2.
Type I error rates (left) and power rates (right) for DSF assessment.
Note. EMU-Q = equal-mean-utility method using quick standard error; EMU-B = equal-mean-utility method using bootstrap standard error; AOS = all-other-statement method; CS-1 = constant-statement method with one DSF-free statement as anchor; CS-A = constant-statement method with all DSF-free statements as anchor.
Type I error
The 0.05 nominal level was used. The standard error of Type I error was 0.0069 . Therefore, the 99% confidence intervals of the average Type I error were roughly between 0.032 and 0.068 . If the empirical average Type I error rate was far beyond this range, the corresponding power of DSF assessment was meaningless.
As shown in the left panel of Figure 2, the EMU-Q method yielded inflated Type I error rates, even when there was zero DSF statement in the test. By contrast, the EMU-B method was found to perform well. The method yielded the Type I error near the nominal level with 0.020 (SD = 0.008), 0.027 (SD = 0.016), and 0.021 (SD = 0.010) for zero, one, and three DSF statements, respectively.
To examine the accuracy of the standard error estimations for the utility parameters, the ratio of the standard deviation of the parameters across replications over the mean of the empirical standard errors for the parameters across replications (denoted as “the Ratio” hereafter) was computed. A value close to 1 indicated a good estimate for the standard errors, while a value larger (smaller) than 1 indicated the empirical standard errors were underestimated (overestimated). It was found that the Ratio was consistently larger than 1 when using the quick approach. For example, under the condition of three DSF statements, the Ratio was in the range of 1.035 and 2.8171 (M = 1.780, SD = 0.525) for the utility parameters of both groups. Thus, the estimates of standard error from the quick approach in ConQuest were underestimated. By contrast, the estimates of standard error using the bootstrap method were appropriate. The Ratio was found to be very close to 1. Under the condition of three DSF statements, the Ratio was in the range of 0.892 and 1.157 (M = 0.989, SD = 0.065) for the utility parameters of both groups.
The AOS method was found to yield relative conservative Type I error with 0.014 (SD = 0.009) when there was no DSF statement. However, it began to lose control over the DSF-free statements in “DSF-free versus DSF” pairing items when there were DSF statements. For example, when there was one DSF statement (s1), the average Type I error rate was 0.125 (SD = 0.283). As shown in the ninth column of Table 2 for individual statements, it was found that the averaged Type I error rates for most DSF-free statements were between 0.01 and 0.09, except for s8 (0.94) and s21 (0.96). Note that in this condition, it was designed to have two “DSF-free versus DSF” pairing items [(s1, s8) and (s21, s1)], and s8 and s21 were the DSF-free statements that were paired with a DSF statement. Similarly, when there were three DSF statements (s1, s8, and s15), the average Type I error was 0.121 (SD = 0.314). It was found that the average Type I error rates for most DSF-free statements were between 0.00 and 0.03, except for s2 (0.99) and s21 (0.98). In this condition, there were two “DSF-free versus DSF” pairing [(s15, s2) and (s21, s1)] in which s2 and s21 were the DSF-free statements that were paired with the DSF statement.
The CS-1 method was found to well control the Type I error rate at the nominal level in all conditions. For example, the average Type I error rates across replications for the individual statements were 0.038 (SD = 0.017), 0.036 (SD = 0.020), and 0.039 (SD = 0.013) for zero, one, and three DSF statements, respectively.
Power rates
Considering the EMU-Q method could not maintain the Type I error rate at the nominal level in all conditions, the power for this method was meaningless. The EMU-B method yielded high power rates. For example, the power rates were 0.970 and 0.947 for one and three DSF statements, respectively. The AOS method yielded average power rates as high as 1.000 when there was one DSF statement. However, as shown in the right panel of Figure 2, it yielded significantly deflated power when there were three DSF statements. For example, the average power rate was 0.660. However, the power rate of s8 was substantially lower (0.020). In this condition, there were two “DSF versus DSF” pairing items—that is, (s1, s8) and (s8, s15)—with s8 as the DSF statement that was paired with two other DSF statements (s1 and s15) but not with any DSF-free statement.
Both the CS-1 and CS-A methods yielded average power rates of 1.000, except that the CS-1 method had a slightly reduced power (0.987) when there were three DSF statements.
Discussion and Conclusion
The re-emerging interest in applying MFC items to noncognitive testing has led to an increasingly urgent need for the development of a sound theoretical base, where test fairness is one of the key issues. Various methods to detect DIF have been well established (e.g., Holland & Wainer, 1993) for normative Likert-type items. However, no such method is thought to be readily available for assessing bias in the MFC items—the item itself is multidimensional and combines more than one statement. The item format requires the shift of the unit analysis from an item to a statement, raising questions with respect to the parameter estimation and hypothesis testing when DSF statement(s) present and an MFC IRT model is used. Furthermore, it is interesting to study the behavior of items when they belong to different types of statement combination. These research questions are complicated and unique to DSF assessment, making DSF distinct from DIF.
This study adapts three methods for the assessment of DSF: the EMU method, the AOS method, and the CS method. The results indicated that both the statement parameters and the DSF parameters could be well recovered for all the three methods using ConQuest, provided the test did not contain any DSF statement. When the test contained DSF statements, the accuracy of the parameter estimates of the EMU and AOS methods decreased, while the CS method with one or more DSF-free statements as an anchor yielded accurate estimates.
Although the overall findings are similar to those of DIF assessments for normative tests (e.g., Wang, 2008; Wang & Yeh, 2003), different performances of proposed methods are found in handling the challenges with respect to detecting DSF for MFC items. The EMU-Q method using the quick standard error produced inflated Type I error in all conditions. The results were consistent with the authors’ expectations because a previous study showed that the quick approach tends to underestimate the standard errors of parameters (Adams et al., 2015). The EMU-B method using the bootstrap standard error performed well in controlling the Type I error and in detecting the DSF statements.
The performance of the EMU-B method is attributable to the different average signed area (ASA) measures that reflect the average degree to which a test favors the reference group (Wang & Su, 2004). When the test is perfect and does not contain any DSF statement, ASA is zero, indicating that the test as a whole does not favor either group. Thus, the Type I error was well controlled at nominal level. When the test contains one DSF statement yielding two DSF items, ASA is about 0.057 . By contrast, when the test contains three DSF statements that lead to four DSF items, ASA is 0.114 . The magnitude of ASA indicates the test favored the reference group. Therefore, the Type I error was overestimated when there were three DSF statements.
The AOS method performed appropriately only when the test did not contain any DSF statement. The performance of this method decreased significantly when the test contained even a single DSF statement. An investigation of the individual statements revealed that the AOS method mistakenly flagged DSF-free statements that were paired with a DSF statement. In addition, the AOS had little power to detect DSF statements that were only paired with other DSF statements.
The CS method performed well as long as one or more DSF-free statements were chosen as an anchor. The results show the importance of proper anchor selection and linking design. For example, under the one DSF statement condition, by selecting s21 which is a DSF-free statement as an anchor and linking it with s1, not only the parameters of s1 can be well recovered and be correctly identified as DSF statement but also the parameters of s8 which links with s1 can be well recovered and be correctly identified as DSF-free statement. The reason is that the anchor statement s21 set a correct matrix for parameter estimation of other statements. Theoretically, one anchor statement suffices to establish the matrix. Nonetheless, the greater the number of anchor statements, the higher the power of DSF detection. Although the results reveal superiority of the CS method over other methods, they raise concerns about the consequences of a DSF statement being selected as an anchor because in practice whether the selected statement has DSF or not is unknown prior to assessment. Therefore, a procedure of locating DSF-free statements to serve as an anchor is needed for future study.
This study investigated the cancellation of DSF in ipsative tests (Shealy & Stout, 1993). The cancellation was observed when using the AOS method. It was found the average power rate of a DSF statement that was only paired with other DSF statements (i.e., s8) was extremely low. One possible interpretation for this result is that, because of the cancellation of DSF in a “DSF versus DSF” pairing item, the item seems to be fair, making the statements involved in this item read as DSF-free. Hence, when examining s8, the three DSF statements (s1, s8, and s15) were treated as DSF-free. Consequently, the test seemed to contain no DSF statement and the assumption of the AOS method held. The detection of s8 was actually assessing Type I error of a fabricated DSF-free statement. The cancellation effect was not found for other methods. For the EMU method, the mean statement utility of each dimension was constrained to be equal. Therefore, the effect of DSF would not be canceled out. Rather, the DSF amount in a dimension is forced to be balanced between groups. For the CS method, because the common metric was correctly established by choosing DSF-free statement(s) serve as the anchor, the magnitude of DSF was accurately estimated.
The work in this study establishes a research framework to assess measurement invariance for ipsative tests with MFC items. Promising as it is, the proposed framework represents a viable parametric approach for DSF assessment for MFC items. The methods adapted in this article offer a practically useful approach to assess DSF to ensure test fairness of ipsative tests with MFC items. This work also serves as an impetus for further studies in this area. First, a more complicated simulation design is desirable for simulations, including number of categories in MFC items, number of dimensions and statements, nonuniform DSF, difference in ability (i.e., impact), the magnitude of DSF, the percentage of DSF statements on the test, and others. For example, this study focused on dichotomous MFC items. Polytomous MFC items are also used in ipsative tests because they can provide more information about the directions and strengths of preferences for the respondents (Qiu & Wang, 2016). It is interesting to examine the performance of the methods when the DSF statements have more than two categories.
Second, this study represents only the initial steps in developing methods to detect DSF. As noted earlier, the findings in this article indicated the superiority of the CS method over other methods. The results were based on a condition that a DSF-free statement is correctly selected as an anchor. Therefore, one research direction could involve locating DSF-free statements to serve as an anchor. The following steps are identified to the scale purification procedure (e.g., B. Clauser et al., 1993), which can be adapted for future use:
Set statement 1 as an anchor, assess all other statements for DSF with the CS method, and obtain an estimate of the DSF amount for each studied statement.
Set the next statement as an anchor and assess all other statements for DSF as in Step 1.
Repeat Step 2 until the last statement is set as the anchor.
Compute the sum of the DSF amount estimates (as an absolute value) over the iterations for each statement, rank the sum, and select the desired number of statements with the smallest DSF amount estimates to serve as anchors.
In addition, the DIF-free-then-DIF strategy (Wang et al., 2012) should be adapted for the assessment of DSF. The detection should consist of two steps: (a) adopt a procedure (e.g., the purification procedure) to locate a set of DSF-free statements to serve as an anchor, then (b) apply the CS method to detect all other statements in the test for the evidence of DSF. This is referred to as the DSF-free-then-DSF strategy. The feasibility of the strategy for the assessment of DSF deserves further investigation.
Supplemental Material
Supplemental material, Supplemental_Material for Assessment of Differential Statement Functioning in Ipsative Tests With Multidimensional Forced-Choice Items by Xue-Lan Qiu and Wen-Chung Wang in Applied Psychological Measurement
Footnotes
Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by a Faculty Research Fund grant from Faculty of Education in the University of Hong Kong and a General Research Fund grant (845013) from the Hong Kong Research Grants Council.
ORCID iD: Xue-Lan Qiu
https://orcid.org/0000-0002-5446-9758
Supplemental Material: Supplementary material is available for this article online.
References
- Adams R. J., Wu M. L., Wilson M. R. (2015). ACER ConQuest version 4.0: Generalized item response modelling software [Computer program]. Australian Council for Educational Research. [Google Scholar]
- American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association. [Google Scholar]
- Bock R. D., Aitken M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443–459. [Google Scholar]
- Brown A., Maydeu-Olivares A. (2011). Item response modeling of forced-choice questionnaires. Educational and Psychological Measurement, 71(3), 460–502. [Google Scholar]
- Brown A., Maydeu-Olivares A. (2013). How IRT can solve problems of ipsative data in forced-choice questionnaires. Psychological Methods, 18(1), 36–52. [DOI] [PubMed] [Google Scholar]
- Cattell R. B. (1944). Psychological measurement: Normative, ipsative, interactive. Psychological Review, 51(5), 292–303. [Google Scholar]
- Chen C.-W., Wang W.-C. (2014, April). Detecting differential statement functioning in ipsative tests using the logistic regression method [Paper presentation]. Annual Meeting of National Council on Measurement in Education, Philadelphia, PA. [Google Scholar]
- Clauser B., Mazor K., Hambleton R. K. (1993). The effects of purification of the matching criterion on the identification of DIF using the Mantel-Haenszel procedure. Applied Measurement in Education, 6(4), 269–279. [Google Scholar]
- Clauser B. E., Nungester R. J., Mazor K., Ripkey D. (1996). A comparison of alternative matching strategies for DIF detection in tests that are multidimensional. Journal of Educational Measurement, 33(2), 202–214. [Google Scholar]
- Efron B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7, 1–26. [Google Scholar]
- Holland P. W., Wainer H. (1993). Differential item functioning. Lawrence Erlbaum. [Google Scholar]
- Joo S.-H., Lee P., Stark S. (2018). Development of information functions and indices for the GGUM-RANK multidimensional forced choice IRT model. Journal of Educational Measurement, 55, 357–372. 10.1111/jedm.12183 [DOI] [Google Scholar]
- Matthews G., Oddy K. (1997). Ipsative and normative scales in adjectival measurement of personality: Problems of bias and discrepancy. International Journal of Selection and Assessment, 5, 169–182. [Google Scholar]
- Mazor K. M., Hambleton R. K., Clauser B. E. (1998). Multidimensional DIF analyses: The effects of matching on unidimensional subtest scores. Applied Psychological Measurement, 22(4), 357–367. [Google Scholar]
- Qiu X.-L., Wang W.-C. (2016, April). Item response theory models for ipsative tests with polytomous multidimensional forced-choice items [Paper presentation]. Annual Meeting of the National Council on Measurement in Education, Washington, DC. [Google Scholar]
- Shealy R. T., Stout W. F. (1993). An item response theory model for test bias and differential test functioning. In Holland P. W., Wainer H. (Eds.), Differential item functioning (pp. 197–240). Lawrence Erlbaum. [Google Scholar]
- Stark S., Chernyshenko O. S., Drasgow F. (2005). An IRT approach to constructing and scoring pairwise preference items involving stimuli on different dimensions: An application to the problem of faking in personality assessment. Applied Psychological Measurement, 29(3), 184–201. [Google Scholar]
- Stark S., Drasgow F. (2002). An EM approach to parameter estimation for the Zinnes and Griggs paired comparison IRT model. Applied Psychological Measurement, 26(2), 208–227. [Google Scholar]
- Swaminathan H., Rogers H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361–370. [Google Scholar]
- Wald A. (1943). Tests of statistical hypotheses concerning several parameters when the number of observation in large. Transactions of the American Mathematical Society, 54(3), 426–482. [Google Scholar]
- Wang W.-C. (2008). Assessment of differential item functioning. Journal of Applied Measurement, 9(4), 387–408. [PubMed] [Google Scholar]
- Wang W.-C., Qiu X.-L., Chen C.-W., Ro S. (2016). Item response theory models for multidimensional ranking items. In van der Ark L. A., Bolt D. M., Douglas J. A., Wang W.-C., Douglas J. A., Wiberg M. (Eds.), Quantitative psychology research: The 80th annual meeting of the psychometric society (pp. 49–65). Springer. [Google Scholar]
- Wang W.-C., Qiu X.-L., Chen C.-W., Ro S., Jin K.-Y. (2017). Item response theory models for ipsative tests with multidimensional pairwise-comparison items. Applied Psychological Measurement, 41(8), 600–613. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang W.-C., Shih C.-L., Sun G.-W. (2012). The DIF-free-then-DIF strategy for the assessment of differential item functioning. Educational and Psychological Measurement, 72(4), 687–708. [Google Scholar]
- Wang W.-C., Su Y.-H. (2004). Effects of average signed area between two item characteristic curves and test purification procedures on the DIF detection via the Mantel-Haenszel method. Applied Measurement in Education, 17(2), 113–144. [Google Scholar]
- Wang W.-C., Yeh Y.-L. (2003). Effects of anchor item methods on differential item functioning detection with the likelihood ratio test. Applied Psychological Measurement, 27(6), 479–498. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental material, Supplemental_Material for Assessment of Differential Statement Functioning in Ipsative Tests With Multidimensional Forced-Choice Items by Xue-Lan Qiu and Wen-Chung Wang in Applied Psychological Measurement

