Abstract
The Multidimensional Forced Choice (MFC) test is frequently utilized in non-cognitive evaluations because of its effectiveness in reducing response bias commonly associated with the conventional Likert scale. Nonetheless, it is critical to recognize that the MFC test generates ipsative data, a type of measurement that has been criticized due to its limited applicability for comparing individuals. Multidimensional item response theory (MIRT) models have recently sparked renewed interest among academics and professionals. This is largely due to the development of several models that make it easier to collect normative data from forced-choice tests. The paper introduces a modeling framework made up of three key components: response format, measurement model, and decision theory. Under this paradigm, four IRT models were chosen as examples. Following that, a comprehensive study is carried out to compare and characterize the parameter estimation techniques used in MFC-IRT models. This work then examines empirical research on the concept by analyzing three distinct domains: parameter invariance testing, computerized adaptive testing (CAT), and validity investigation. Finally, it is recommended that future research initiatives follow four distinct paths: modeling, parameter invariance testing, forced-choice CAT, and validity studies.
Keywords: Multidimensional forced choice test, Ipsative data, TIRT, MUPP, Parameter estimation
1. Introduction
Non-cognitive psychological tests frequently employ Likert rating scales such as organization (e.g., "I am organized"). There are no correct or incorrect answers to the items. Participants must identify the item that best fits their situation on a five-point Likert scale ranging from little likeness (1) to maximum resemblance (5). People may purposefully distort their answers in important assessments such as employment and selection, particularly those related to traits such as responsibility and optimism. This strategic behavior attempts to appear more aligned with organizational expectations, even if it does not reflect the true nature of the organization. The phenomenon under investigation is "faking," which occurs when an assessment fails to separate people based on their abilities, thereby undermining its objectivity.
To reduce faking, pro-preventing or post-detecting are used [1]. To avoid mistaking an honest person as a fraud, post-control approaches must provide high recognition accuracy. The post-detecting approaches prevents cheating before or during questioning in order to obtain pollution-free data. There are warnings, bogus pipeline, and forced-choice test. Individual faking is unaffected by warnings, and bogus pipeline defrauds people, which is unethical [2]. The forced-choice test requires participants to select items of comparable quality. They cannot provide good options for all objects. Because the products' desirability is equal and none is superior, social desirability is less likely to influence selection or faking. Researchers have explored the use of items involving several statements that are similar in social desirability but represent different dimensions [3].
However, traditional forced-choice test scoring will yield ipsative data. Dimension scores are interdependent in the forced-choice test. All dimensions will have different high and low scores. This information is ipsative. Internal score dependence of ipsative data violates one of classical test theory's basic assumptions, the independence of error variance, which affects statistical analysis and interpretation of forced-choice test scores [4,5], such as reliability, variance, and regression analysis, and increases the probability of type I error. It also has an impact on statistical test power [6]. The ipsative data's distortion of the dimension relationship pollutes the test's construct validity and criterion-related validity [7] and prevents it from being used for factor analysis [8]. Finally, comparing individuals and normalizing ipsative data by scores may affect their veracity. Self-comparison, for example, only displays interest test participants' preference rankings. According to Closs [8], direct comparisons will overvalue or undervalue individual interests.
The number of dimensions and their interrelationships have a great impact on ipsative data. More test dimensions, according to previous research [4,9,10], narrow the gap between ipsative and normality scores. When dimensions are positively or negatively correlated, the difference between the ipsative score and the normality score decreases [11]. As a result, increasing the number of test dimensions is one of the more effective traditional methods for resisting ipsative data, but it is only a compromise.
In conclusion, the ipsative data limits the application of the forced-choice test; while it can resist the ipsative problem by adding dimension methods, it does not reflect the individual's psychological decision process. To address the issue of ipsative data, it is necessary to abandon the traditional scoring method and adopt a modern measurement model to reflect individuals' decision process when answering forced-choice tests [7] and obtain the latent trait scores underlying the decision process from the explicit comparison results. Individual score normalcy can thus be restored.
The primary goal of this work is to provide a comprehensive and structured overview of the MIRT model for forced-choice tests. We introduce forced-choice IRT models and provides a concise summary of their three basic components. We also discuss the parameter estimation techniques used in forced-choice MIRT models. Following that, we demonstrate the application's current research progress. Finally, based on the practical implications of the forced-choice paradigm, this study suggests potential future research directions.
1.1. Search methods
In the Web of Science Core Collection database, the keyword “forced choice” yielded 135 papers. Following an examination of the abstracts, studies that did not include the forced choice model and only used forced choice tests as empirical experiments were excluded, leaving 45 papers to complete the paper search for this literature review. Then, in order to supplement the literature, we searched for the keyword “forced choice” in mainstream journals in the field of forced choice. Such as Educational and Psychological Measurement, Applied Psychological Measurement, Multivariate Behavioral Research, Journal of Educational, Behavioral Statistics and so on. Finally, we traced the sources of existing literature and included 97 references in total.
2. IRT model for multidimensional forced choice test
2.1. Three key elements of multidimensional forced choice models
Various MIRT-based scoring models for forced-choice tests have been developed over the last decade. These models relate explicit responses to underlying features in order to obtain latent trait scores with normal characteristics and to compare scores across individuals. These models are comprised of three major components: response format, measurement model, and decision theory. The response format reflects the format of the forced-choice response data, the measurement model reflects the relationship between item response intensity and dimensions, and the decision theory reflects the process by which participants choose between items. The decision theory acts as a link between the explicit response and the favorability of the items, and is then linked to the personal latent trait level by the measurement model, forming an overall forced-choice IRT model.
2.1.1. Response format
The forced-choice test typically consists of a number of item blocks of varying dimensions. The item block is made up of a fixed number of statements with different or identical dimensions and social desirability levels. The statements are explicit indicators of the dimension (i.e., latent trait).
According to Hontangas et al. [12], there are three common forms of forced-choice item blocks: Pick, Rank and MOLE. This classification is mainly reflected in the types of instructions. Pick (Table 1) requires individuals to choose the item that best matches them. Rank (Table 2) requires individuals to fully rank the items from most agreeable to least agreeable. MOLE (Table 3) requires individuals to choose the item that best fits themselves MOst and the item that is LEast suited to themselves. These three are equivalent in a pair, while Pick and MOLE are equivalent in a triple.
Table 1.
Pick question types.
| Instruction: Choose the one that best suits you from the following two descriptions | |
|---|---|
| Item block | Most |
| A Lack of finding things | ✓ |
| B Explore unfamiliar territory | |
Table 2.
Rank question types.
| Instruction: Sort the following descriptions | |
|---|---|
| Item block | Rank |
| A Lack of finding things | 3 |
| B Explore unfamiliar territory | 1 |
| C Make decisions based on data analysis | 2 |
Table 3.
MOLE question types.
| Instruction: Choose from the following descriptions the one that best fits you and the one that doesn't fit you the most | ||
|---|---|---|
| Item block | Most | Least |
| A Lack of finding things | ||
| B Explore unfamiliar territory | ✓ | |
| C Make decisions based on data analysis | ||
| D Do work that focuses on precision | ✓ | |
The number of statements contained in an item block determines its size, with two to four questions being the most common. The size of the item block influences the individual's load on the selection task. The more items there are, the more times the individual must compare them. The cognitive complexity of the selection task is increased by a large item block. It may be harmful to people with limited education or poor reading skills [13].
For anti-fraud efficacy, the alignment of item desirability is the most important factor in forced-choice test construction, followed by explicit factors such as item size and instructions. In general, the matching degree is calculated by calculating the average absolute difference in desirability between items. The greater the disparity, the more mismatched it is. However, judging only by the mean ignores differences in the desirability evaluations of the same item by different evaluators. Pavlov et al. [14] proposed an alternative index, the IIA (Inter-item Agreement) index, which incorporated the BP and AC indexes [15] into the matching of desirability of items to better match those items with no difference in mean value of desirability. Practitioners can calculate the IIA index and automatically compose the paper using the R [16] package autoFC [17].
The consistency of MFC item blocks in the ideal rank suggests that if item desirability is not well matched, the forced-choice test lacks an anti-fraud effect [18]. Well-matched blocks, on the other hand, produce more uniform response data. As a result, response data can be used to forecast item block faking ability in order to create a Faking Mixture Model [19].
2.1.2. Measurement model
The item is an explicit measure of the trait, and the item's relationship to the latent trait must be linked using a measurement model. Dominance Models and Unfolding Models (or Ideal-Point Models) are the two types of measurement models. According to dominance models, an individual's characteristic level increases their likelihood of responding yes to the item. The Rasch model and the Two-Parameter Logistic Model (2PLM) all assume that the individual answers the item in accordance with the dominance measurement model. The unfolding models assume that the item's proximity to the characteristic level being evaluated increases the likelihood of a positive response. Individuals who are too introverted, for example, may disagree with the item "I enjoy chatting quietly with a friend in a café" because they are uncomfortable in public places, whereas individuals who are extremely extroverted may disagree because they prefer more exciting settings [20]. Individuals at the intermediate level are more likely to agree with the item, and their item response function curve is single-peaked and bell-shaped; that is, the higher the probability of a positive answer, the closer the individual's trait level is to the item position. The Generalized Graded Unfolding Model (GGUM) is the unfolding model's representative model [21].
There is disagreement in the literature about whether models are better at reflecting individuals' responses to non-cognitive items [6,22,23]. Simulation and empirical studies, such as those conducted by Chernyshenko et al. [24] and Tay et al. [25], have aided in the unfolding model. According to these studies, unfolding response items are just as effective as dominance response items in assessing attitude qualities. The unfolding model is more flexible since it might be similar to the dominance model when the item's positional parameter is in its final place. However, studies have shown that this superiority is not universal in practice, and the psychometric properties of scales made entirely of unfolding response items are significantly inferior to scales made entirely of dominance response items, including lower reliability and criterion correlations [26]. Furthermore, the unfolding model cannot directly convert the scoring of reverse items [27]. The dominance model is generally more parsimonious and has fewer parameters than the unfolding model in terms of model complexity. Except where there is clear evidence to examine the superiority of the complex model [28], the more parsimonious model should be considered first. Furthermore, writing unfolding response items is more difficult, and defining the exact meaning reflected by the items is also difficult. More information on the dominance model and the unfolding model can be found in the work of Drasgow et al. [20].
The measurement model is an item-level feature and has nothing to do with the format of forced-choice items. Items from any measurement model can be used when combining items into forced-choice item blocks because they can all measure the same latent trait and the distribution of latent trait is constant for the same population. In practice, researchers must combine the characteristics of the item or the data to choose one of the dominance or unfolding models as the measurement model between the item and the latent trait, and there is currently no situation in which the two models are mixed in the same test.
2.1.3. Decision theory
Instead of evaluating each item independently, forced-choice tests require participants to make comparative judgments on a group of items and then make decisions on how to answer them. The absolute evaluation of the items serves as the foundation for determining an individual's trait level. According to Brown [13], the basis for individuals making comparative judgments on a set of items is their absolute evaluation level of each item being compared. To model forced-choice data, decision theory must explain the relationship between explicit response and absolute evaluation, allowing the individual's latent trait level to be assessed.
Thurston’s Law of Comparative Judgment
Utility is a latent variable that can be thought of as the psychological value of an item to an individual. Thurstone [29] believed that the individual’s consideration of the item was essentially a consideration of utility value. represent the explicit results after comparing the item and , and represent that individual selected the item as the most consistent, otherwise . The relationship between utility difference and explicit response can be sorted as formula (1):
| (1) |
Where represent the utility difference between item and item . represent the utility value of item.
The utility on item can be divided into two parts: systematic and random. The systematic part can be a response function related to the individual’s latent trait level, and the random part is random error . Thurstone assumed that they are independent of each other among different items and obey a normal distribution. Therefore, the relationship between utility and latent trait can be expressed as formula (2):
| (2) |
Where is the individual’s level on the latent trait measured by the item .
Luce’s Choice Axiom
Luce [30,31] extended the Bradley-Terry model [32] from binary choice situations, which used to represent an individual item -related response intensity. The set of alternative items is called , then the probability of choosing from is proportional to as formula (3):
| (3) |
Luce describes the ranking process of a group of items as a series of independent steps to make the best choice: firstly, select the most suitable item from the item set , and then select the second most suitable item from the remaining set , until the selection of the last two items is completed, thus realizing the ranking of all alternative items [12]. The probability of the ranking result is the result of the multiplication of the probability of each step.
When this decision theory is applied to the forced-choice model, can be derived from the item response function related to the latent trait. The MUPP framework proposed by Stark [33] extends Luce's Choice Axiom, which greatly promoted the development of forced-choice models [7]. In the MUPP framework, it is assumed that the individual’s assessment of each item is independent and that the items are unidimensional. Items in a block can come from the same or different dimensions, so it is called the Multi-Unidimensional Pairwise Preference Model (MUPP). Assuming an item block contains the item , and the latent trait are measured separately. represents the individual’s probability of accepting the item and represents the probability of rejecting the item , and . Put them into formula (3), .
When it is Pick item format, the probability of an individual choosing from the set can be expressed as formula (4):
| (4) |
Taking the Rank item format as an example, assuming that the ranking result of an individual is , then is like formula (5):
| (5) |
Taking MOLE as an example, the rank of the two unselected items cannot be determined, so the two possible ranks are combined as the probability of the selection result of this item format. Using to represent the probability that the subjects chose and as the most and the least in line with their probability, then we can get formula (6):
| (6) |
In conclusion, determining the response probability of an individual item will allow us to determine the response probability of a block. In addition, Thurston’s Law of Comparative Judgment and Luce's choice axiom are equivalent in a pair.
Other types of decision theories include Coombs’s Unfolding Preference Model and Andrich’s Forced Endorsement Model. The former is a special case of Thurstone’s Law of Comparative Judgment, and the latter is equivalent to the Bradley-Terry model after simplification.
2.2. IRT models for forced-choice test
Table 4 presents a concise overview of the prevailing forced-choice models [23,[33], [34], [35], [36]] based on the three fundamental components of FC models. Due to the equivalence of decision theory under pairs, pair will be listed separately. This section elucidates the underlying distinctions among models by presenting a comprehensive overview of four specific examples, namely TIRT, MUPP-2PL, ZG-MUPP, and MUPP-GGUM.
Table 4.
Model summary.
| Pair | Pick | Rank | MOLE | ||
|---|---|---|---|---|---|
| Dominant model | Thurston’s Law of Comparative Judgment | TIRT/RIM/MUPP-2PL | TIRT/BRB-IRT | TIRT/BRB-IRT | TIRT/BRB-IRT |
| Luce’s Choice Axiom | – | 2PLM-RanK/ELIRT | – | ||
| Unfolding model | Thurston’s Law of Comparative Judgment | ZG-MUPP/MUPP-GGUM | – | – | – |
| Luce’s Choice Axiom | – | GGUM-Rank/FCRM | – |
2.2.1. TIRT model
Brown and Maydeu-Olivares [34] proposed TIRT, a MIRT model for dominance response items based on Thurstone's Law of Comparative Judgment.
TIRT assumes that the psychological process of individual selection or ranking is to make independent pairwise comparison judgments on items in an item block in turn, and produces comparison results. Before modeling the data, binary coding the response to obtain the comparison results of the pairwise items are needed.
Taking a Rank-3 item block as an example, the items in the item block are , , and respectively represent a project and measure the latent feature s of an independent dimension. Assuming that the individual’s selection result is , the encoding result is , , , which represents , and . Taking as an example, is like formula (7):
| (7) |
Where , is the mean of the latent utility , is the factor loading of the item on the latent trait , assumming latent trait and error obey the normal distribution. The variances of error and are , , then the variance of the difference value is , and represents the cumulative normal distribution function.
TIRT's applicability in various settings has been tested using simulations and empirical studies by many scholars [5,7,[37], [38], [39], [40], [41]]. On the one hand, these studies indicated that TIRT has overcome the ipsative issue in conventional scoring to some extent, has improved measurement accuracy compared to conventional scoring, and is closer to the results of the Likert single stimulus scale [42]; on the other hand, they also indicated that in order to show better properties than conventional scoring, TIRT requires more restrictions on the test design [34].
2.2.2. MUPP-GGUM model
The MUPP-GGUM model, proposed by Stark [33], is a multidimensional model for unfolding response items that is based on Luce's Choice Axiom. As the first forced-choice model used in computerized adaptive testing, it has been extensively used in the development of numerous personality assessments for the purpose of military personnel selection in the United States. It also provides consistent guidance for the development process.
Stark [33] used the binary scoring version of GGUM, which follows the dominance response model, to calculate the response probability of a single item, that is, the and in formula (4). Hontangas et al. [22] developed MUPP-GGUM model suitable for the Rank and MOLE item formats and used the MCMC joint estimation algorithm to evaluate the statement and personal parameters. Based on the presented model, the likelihood of an individual selecting a particular item can be determined as formula (8):
| (8) |
Among them, represents the discrimination parameter of the item , is the intercept parameter of the item , is the location parameter of the item , and represents the latent trait measured by item .
For this model, Joo et al. [43] created two informative indices: OII (Overall Item Information) and OTI (Overall Test Information). When selecting items similar to OII, Joo et al. offered a method to draw graphs of conditional OII so that researchers can further compare and select the item block that provides the greatest amount of information within the target ability interval. The development of information indices also lays the groundwork for CAT [44].
2.2.3. MUPP-2PL model
According to Morillo et al. [24], the items used in the dominance measurement model were superior to those used in the unfolding measurement model in terms of item writing difficulty and model parsimony. Therefore, on the basis of the MUPP framework, Morillo et al. replaced the item response function calculated and in formula (4) with the classical dominance response model 2PLM, and called it the MUPP-2PL model. According to this model, the probability of an individual choosing an item is formula (9):
| (9) |
Among them, represents the discrimination parameter of the item , is the intercept parameter of the item , and represents the latent trait measured by item .
Morillo et al. [24] discovered that the length of the test influenced the recovery of relationships between item parameters, ability parameters, and traits; the longer the test, the more accurate the estimated results. Furthermore, sample size has a significant impact on parameter estimation accuracy, and this method can estimate difficult parameters more accurately than discrimination parameters. Finally, Morillo et al. discovered in an empirical study that MUPP-2PL's estimation results of the relationship between some latent traits were quite different from previous studies, but it was unclear whether the difference was due to the respondent population or a change in the test situation.
2.2.4. ZG-MUPP model
The ideal point model is used as the measurement model in the ZG-MUPP model [45]. The model extends the MUPP framework's decision theory from Luce's Choice Axiom to Thurstone's Law of Comparative Judgment.
Assuming the participant needs to choose between two items and , these items assess latent features in different dimensions. The ZG-MUPP model defines as the latent feature variable of item , and as the statement variable of item . The latent feature variable follows a bivariate normal distribution, while the statement variables are independent. Individual choices between items and are made by comparing the distances between the latent features measured by the two items and the statement variables. If the distance in item is smaller than the distance in item , the participant is inclined to choose item . The ZG-MUPP model calculates the probability of an individual choosing item among items and as formula (10) to formula (12):
| (10) |
| (11) |
| (12) |
Among them, represents the discrimination parameter of the item , is the location parameter of the item , and represents the latent trait measured by item .
The ZG-MUPP model was created in response to criticism of the MUPP-GGUM model, which has too many parameters, making parameter estimation difficult. Each item in the MUPP-GGUM model includes three types of parameters: discrimination, location, and threshold, which appear to be cumbersome and complex. This complexity makes parameter estimation difficult and increases the need for sample size [21]. As a result, reducing model parameters is necessary. Previous research has shown that when estimating item and latent feature parameters directly, the MUPP model's threshold parameters are more difficult to estimate than other parameters [35]. Moreover, threshold parameters provide little information for MFC testing [45]. Therefore, the ZG-MUPP model removes threshold parameters and retains two model parameters: discrimination and location . Joo also derived the information function for this model to facilitate its application in CAT and the calculation of SE [45].
All along, the unfolding model has been considered more flexible compared to dominant models [25,26]. However, it has a higher level of complexity and requires a larger sample size. The ZG-MUPP model simplifies the unfolding model and greatly increases its competitiveness.
3. Parameter estimation methods
To obtain the parameters of the forced choice model in complex scenarios with multidimensional data, some parameter estimation algorithms must be used. Based on the estimation process, these approaches can be divided into joint estimation and two-phase estimation strategies. The least squares algorithm, which is based on the maximum likelihood estimation (MLE) approach, and the Markov Chain Monte Carlo (MCMC) algorithm, which is based on the Bayesian method, are the two main algorithms used in joint estimation.
3.1. Two-phase estimation strategy
MUPP-GGUM uses a two-phase strategy: the item parameters needed to calculate and were pre-calibrated by Likert scale data in steps 2–3 and estimated by the GGUM2000 computer program (Roberts, Donoghue, & Laughlin, 2000b). In step 7, forced-choice response data were used to estimate ability using MUPP-GGUM. Stark et al. [33,46] achieved the Maximum A Posteriori (MAP) for high-dimensional latent trait estimation using a BFGS (Broyden-Fletcher-Goldfarb-Shanno) method similar to Newton-Raphson iterations. Expected A Posteriori (EAP) or MLE can also be used to estimate latent traits. Because an increase in the number of dimensions leads to an exponential increase in the number of nodes for numerical integration in EAP, EAP is best suited for 1–2 dimensions, whereas MAP and MLE are best suited for a large number of dimensions.
Stark implemented the BFGS algorithm in DFPMIN [47], but it can also be implemented in R by specifying the method parameter as L-BFGS-B in the function "optim." In the area of item parameter calibration, GGUM has made many breakthroughs in parameter estimation in recent years [48], supported by the related R packages GGUM [49], mirt [50], and Bmggum [51].
This model makes an implicitly strong assumption that item parameters are consistent across test formats, which may not be correct. However, this process is very beneficial to the management of the item bank and then facilitates the development of forced-choice adaptive tests.
3.2. Joint estimation strategy
3.2.1. Least squares algorithm
TIRT is developed based on the structural equation model. The structural equation modeling software Mplus [52] or the Lavaan package [53] can estimate item parameters using unweighted least squares or diagonally weighted least squares. MLE, MAP, and EAP methods can also be used to estimate latent traits.
Brown and Maydeu-Olivares [54] provide an Excel macro (http://annabrown.name/software) that can export Mplus statements after entering the test design for the convenience of practitioners. Bürkner provides functions for data simulation in the thurstonianIRT package and serves as an interface for users to select the Lavaan package [53] or Mplus as the intrinsic processing of model fit methods, and can automatically generate codes based on the method selected by the user [55].
Obviously, the development of the TIRT software kit provides great convenience for practitioners, which is one of the reasons why TIRT is widely used, but there are some reservations. For example, Bürkner et al. [55] discovered serious model failure to converge when using Mplus and Lavaan to fit TIRT, particularly in the case of large tests (for example, a 5-dimension test with 27 items in each dimension, the model convergence rate is only about 0.3). Furthermore, a large amount of RAM is required (for example, for a 30-dimension test, where each dimension has nine item blocks and the model requires 32 GB of RAM); otherwise, it is necessary to specify in the code not to calculate the chi-square, standard error, and other fitting indices to reduce operating time and operating pressure. The most common error is a negative variance, which often necessitates specifying inter-dimensional relationships or factor loadings to facilitate convergence, but the estimation results also heavily rely on these fixed values. Given the sensitivity of TIRT in model identification, if TIRT is considered for use in a test with high dimensions, it is necessary to fully ensure the quality of the items during test development, such as through unidimensionality testing of the items to ensure the characteristics of unidimensionality. The issue of RAM must be considered when selecting an estimation method. Otherwise, the model does not converge or the memory is insufficient to obtain any estimation results, reducing the test developer's confidence in the test quality and the model.
3.2.2. MCMC algorithm
Unlike TIRT, the proposers of the later models all based the parameter estimation algorithm on MCMC. It is a probabilistic, full-information parameter estimation method that does not necessitate complex mathematical derivation but only requires researchers to construct a reasonable posterior probability distribution function and can achieve estimation accuracy comparable to frequentist algorithms (maximum likelihood estimation, etc.). The Metropolis-Hasting MCMC algorithm is used by the MUPP-2PL, GGUM-Rank, RIM, and BRB-IRT models to estimate item and ability parameters based on forced-choice data.
Ox [56], OpenBUGS 3.2.3 [57], WinBUGS [58], JAGS [59], and other software can implement the MCMC algorithm. WinBUGS and OpenBUGS are relatively slow among these software programs, whereas the MCMC method developed by Bürkner et al. [55] for TIRT uses Stan [60] language, and the estimation speed is greatly improved by using the more advanced NUTS (No-U-Turn sampler) or HMC (Hamiltonian Monte Carlo) sampling methods. They all use the statistics , proposed by Gelman & Rubin [61] in the model convergence evaluation criteria (less than 1.2 means the parameters have converged). Although these models do not have significant convergence issues, they do necessitate practitioners having a deeper understanding of MCMC-related knowledge and implementation steps, and the main disadvantage of MCMC methods is the long estimation time [62].
See Table 5 for a summary of various model parameter estimation methods.
Table 5.
Summary of model parameter estimation methods.
| Parameter Estimation Methods | Software Implementations | Advantage | Disadvantage |
|---|---|---|---|
| Two steps: 1. Pre-calibrate item parameters based on Likert scale data 2. BFGS estimation power |
1. R package: GGUM/mirt/bmggum 2. DFPMIN/R package: stats |
Pre-calibration of item parameters is convenient for self-adaptive item bank management | Using Likert item parameter on forced-choice data to estimate ability have the risk of inconsistent item parameter across test formats |
| Weighted Least Squares/Diagonally Weighted Least Squares | Mplus R package: thurstonianIRT (Mplus/Lavaan method) |
Estimated time is short, easy to use | not easy to converge in high-dimensional situations, the memory usage is too high, and sometimes the calculation of the fitting index needs to be discarded |
| MCMC | Ox/WinBUGS/JAGS/OpenBUGS R package: thurstonianIRT (Stan method) |
no convergence problem | Long estimated time, uneasy to use |
4. Applied research
In the field of industrial and organizational psychology, the MFC-IRT model is widely used. For example, TIRT has been used to develop the Assessment of Work-Related Maladaptive Personality Traits [63], as well as the Occupational Personality Questionnaire (OPQ32r) and the Customer Contact Styles Questionnaire (CCSQ) [34,64]. In the 360-degree feedback test, it has also been suggested that using forced-choice tests and TIRT scoring has better construct validity and aggregate validity than using traditional Likert rating scales to score [65]. The Adaptive Employee Personality Test (Adept-15) [66] and the Tailored Adaptive Personality Assessment System [67] both use MUPP-GGUM. These two tests are also a ground-breaking attempt at a Computerized Adaptive Test (CAT) forced-choice test. Simultaneously, the test of item parameter invariance is an important part of the test development process, and the invariance test method for the forced-choice test is being developed and improved gradually. Considerable evidence has also accumulated in the field of validity research, which practitioners are increasingly interested in. As a result, this paper will summarize the current state of research in three areas: parameter invariance tests, CAT, and validity research.
4.1. Parameter invariance test
To ensure that all participants understand the item in the same way, test developers must run the measurement consistency test (item parameter invariance). In the forced-choice test, item parameter invariance can be classified as cross-block consistency or cross-population consistency. Items that lack parameter invariance indicate that their likelihood of answering is influenced by factors other than the measurement target.
The degree to which an item maintains parameter invariance when paired with different items across item blocks is measured by cross-block consistency. Block 1 (which contains items A, B, and C) and Block 2 (which contains items A, D, and E) are two item blocks that share item A. The estimation results for item parameters for item A in the two item blocks should be consistent, indicating cross-block parameter invariance. Lin and Brown [68] used the TIRT to compare the parameter invariance of two sets of Rank-3 and MOLE-4. Because the latter only added one new item to each of the former's item blocks, the proportion of common items between each pair of item blocks was 75%, and only a few items had significant deviations.
Cross-population consistency refers to whether an item has parametric invariance between people from different backgrounds (for example, people of different genders or different test situations). Differential Item Functioning (DIF) is another term for testing for such variability. If the item parameters differ significantly between groups, it indicates that the individual's background influences the likelihood of answering this item. If the test contains an excessive number of such items, the test's validity will be reduced and it will be unfair. Lee & Smith [69] tested the measurement invariance of TIRT using multiple group confirmatory factor analyses (CFA). It is suggested that ΔCFI >0.007 and ΔCFI >0.001 be the critical values of metric non-invariance and scalar non-invariance, respectively, but this method cannot be specific to the item to screen. The parameter inconsistency at the item level is DIF. DIF is the parameter inconsistency at the item level. P. Lee et al. [70] proposed an Omnibus Wald test for the discrimination and intercept indicators of the TIRT and suggested through simulation research that the detection efficiency was higher under the free baseline method: the detection rate was close to 1 and the type I error rate was close to 0.05 as sample size and DIF amount increased. Qiu & Wang [71] proposed three DIF test methods for RIM including EMD (equal-mean-difficulty), AOS (all-other-statement), and CS (constant-statement). Finally, it was found that the CS performed better than the other two methods in the test with DIF items.
4.2. Computerized adaptive testing
The measurement dimensions of personality assessment tools are typically high-dimensional due to the complexity of human personality. OPQ32r, for example, assesses 32 personality dimensions. The more dimensions there are, the more items are needed. Excessively large items will cause individual fatigue and boredom with the test, leading to careless answering. From the standpoint of measurement efficiency, when individuals in some dimensions have reached an acceptable measurement precision by a small number of items, which can be for individuals to have certain judgments on these dimensions, a subsequent focus on the items of the evaluation of uncertainty in the higher dimensions, a review of individuals in all dimensions as soon as possible can reach the level of reliability. Evaluation efficiency can thus be improved. One solution to the aforementioned problem is to create a CAT version of the forced-choice test.
The forced-choice CAT was first used to select US Navy personnel 15 years ago. Houston et al. [72] created the Navy Computer Adaptive Personality Scales, which assess 19 personality traits. Stark et al. [46] proposed a six-step forced-choice adaptive procedure for multi-dimensional and unidimensional pick-2 (single and multidimensional blocks) using MUPP-GGUM. The most significant difference from traditional CAT is that it must predetermine the proportion of unidimensional item blocks and the dimensional combination form of navigating and storing multidimensional item blocks. The two studies mentioned above imply that CAT improves efficiency more than non-CAT. The forced-choice CAT can be correctly only requiring half of the non-adaptive test questions. In addition, TAPAS, which is based on MUPP-GGUM, is an adaptable personality test for US military selection [67].
The ideal-point measurement model is currently used in the majority of computer-adaptive forced-choice tests, but dominant items have several practical advantages over ideal-point items. According to Brown and Maydeu-Olivares [27], creating ideal-point items is more difficult in content development. Ideal-point items have fewer analytic software options and are more difficult to estimate item parameters for [73]. Chen et al. [74] investigated the FC CAT with dominant items using the Rasch model, and Lin et al. [75] conducted the first empirical study on MFC CAT dominance items using TIRT model.
The assembly of blocks and the guidelines for block selection determine the validity of multidimensional FC-CAT. Fixed assembly and dynamic assembly are two methods of assembling blocks. A benefit of constant assembly is consistent block parameters. But fixed assembly produces fewer blocks from the same item pool than dynamic assembly. Dynamic assembly selects items in real time while the subject takes the test. Flexible matching generates more blocks, making item leakage harder. However, the stability of item parameters over these blocks may be an issue. And dynamically creating blocks requires aligning things with similar social desirability [14]. The genetic algorithm-based NHBSA (Node Histogram-Based Sampling Algorithm) [76] can help achieve this goal [77].
Block selection rules balance information for each dimension. Data from previous blocks determines the next block the subject should answer during the test. There are three FI-based (Fisher Information) selection rules for multi-dimensional CAT, which Mulder & van der Linden [78] categorized as A-optimality (trace), D-optimality (determinant), and E-optimality (eigenvalue). Among them, the A-optimality method has slightly better estimation accuracy than the D-optimal method, and the E-optimal method is the most unstable (Mulder & van der Linden, 2009). Veldkamp & van der Linden [79] proposed a posterior expectation KL information (KB method) based on the KL information as an alternative to FI, which includes three selection rules: KL index (KI), posterior expected KL information (KB), and posterior KL distance (KLP) between subsets [[78], [79], [80], [81], [82]].
Chen et al. [74] proposed three subpool selection strategies to improve the efficiency of item selection and control the exposure rate of items. The three strategies are the Sequential Strategy, the Multinomial Strategy, and the High-SE Strategy. The Sequential Strategy will choose items from each combination of items based on the amount of information until the termination standard is reached. The Multinomial Strategy solves the problem of sequential strategies by randomly selecting a sub-database based on the polynomial distribution. The high-SE Strategy first determines which dimensions an individual has the highest SE in, and then selects the item blocks of the corresponding dimension combination. In terms of overall performance, the Multinomial Strategy does well.
Furthermore, Chen et al. [74] proposed the Revised Sympson-Hetter Online (RSHO) to control the exposure rate of items. When selecting item blocks, first determine the most appropriate item blocks based on the amount of information and then select items with less exposure. The RSHO regulates the exposure rate of the items while slightly sacrificing measurement accuracy.
According Seybert & Becker [82], the retest reliability of the forced-choice CAT is lower than that of traditional Likert rating scales [83], but comparable to the retest reliability of duplicate Likert rating scales.
4.3. Validity studies
To answer this question, researchers focused on five areas to see if the IRT's latent trait scores could accurately reflect individuals' true characteristics. The first is to determine whether IRT scoring recovers better latent traits and their relationships than traditional scoring [12,22,84]. Using IRT to estimate trait scores can result in a significant improvement in measurement accuracy when compared to traditional scoring, which is almost the common conclusion of all studies in this direction, and it also gives researchers great confidence to develop more IRT. Some studies, however, have found that the results of forced-choice models are not always as good as those of traditional scoring models [40,85,86]. However, the extent to which the scores obtained from these models can be interpreted as traditional normality scores merits further investigation because it is directly related to whether these scores can be used as normality scores for personnel selection or for correlation analysis with external criteria.
To answer the above questions, the second direction attempted to investigate the relationship between IRT and latent trait scores obtained by the Likert single stimulus scale [42,64,87,88]. The score of a single stimulus scale is thought to be the most consistent with the true value of individual latent traits in these studies. If the score obtained by the forced-choice model maintains a high similarity with the score origin, size, and dimension relationships, the equivalence of the Likert scale and forced-choice scale will be proved.
The third direction is to investigate the ability of forced-choice tests to detect fraud. When the social desirability of the forced-choice item block is matched, the forced-choice test outperforms the Likert rating scales in terms of anti-fraud ability [89]. When compared to using TIRT to analyze the forced-choice test, using the Graded Response Model (GRM) to analyze the Likert rating scales cannot effectively distinguish high-ability individuals because participants tend to perform better, resulting in low discrimination of items that reflect high ability [90].
The fourth direction is to investigate the application of IRT in non-self-rating situations. Because Likert rating scales have a common method deviation, different raters' evaluations are influenced by their internal ideal behavior standards, resulting in low consistency and reliability among raters. Hung et al. [91] proposed Forced-Choice Ranking Models (FCRM), which quantify rater leniency and task difficulty as new indicators and are useful in non-self-rated scenarios.
The fifth direction is reliability research. Reliability affects validity, so reliability is a crucial indicator for evaluating the applicability of FC models. Many scholars have reported on reliability indicators when conducting IRT-FC studies [7,23,27,34,35,37,42,43,54,88,89,92]. According to Lin et al. [92], reliability indicators include theoretical reliability, empirical reliability, simulated true estimation reliability, and retest reliability. The evaluation score is influenced by four aspects of measurement errors [93]: random errors (due to random fluctuations in an individual’s responses), transient errors (due to situational factors affecting a specific measurement occasion), item-specific errors (due to consistent interindividual differences in item interpretation), and scale-specific errors (due to different measurement operationalizations of the same psychological construct). Simulating the reliability of true estimation involves random, item-specific, and scale-specific errors, while simulated retest reliability only reflects random errors [92]. Theoretical and empirical reliability often overestimate the reliability of IRT information because of the local independence assumption between pairwise comparisons within the same group (containing more than two items), which is not actually rigorous [90].
5. Future research
IRT research has great potential as a form of assessment that can effectively resist fraud and response biases and improve individual response efficiency, particularly in the application of non-cognitive, high-stakes situation assessment. In addition to the unresolved problems of previous research, the following future research directions are proposed: new FC models, item parameter invariance research, forced-choice CAT research, and validity research.
5.1. New FC models
Current forced-choice models work for Pick, MOLE, and Rank item kinds. There are variations of the Pick-2, including the Adept 15 [66], which asks participants to choose their preferred item and their willingness to do so (see Table 6). Brown and Maydeu-Olivares [94] created a factor analysis model and information function for graded blocks based on Thurston's Law of Comparative Judgment. The DPM model was built to handle graded block data by Qiu et al. [95]. This type of item refines the individual’s behavior and provides more information, but the cognitive load of the item was also increased. Only the dominance models RIM and TIRT have now been expanded to a polytomous level. MUPP-GGUM is a dichotomy iteration of GGUM with the potential for future development of polytomous forced ideal point models.
Table 6.
Pick-2 graded blocks.
| Slightly agree | Agree | |
|---|---|---|
| A lack of finding things | ✓ | |
| B explore unfamiliar territory |
A new kind of IRT model with time has also been developed. As decision time increases, product preferences become less differentiated, making decision-making harder. Thus, their trait levels are likely to be similar. This method is used in the Thurstonian D Diffusion Model [96], the Linear Ballistic Accumulator Item Response Theory Model [97], and the Guo et al.’s [98] Log-Linear Model. Collecting response time data in this setting is effortless, therefore augmenting the quantity of information and improving the efficiency of the model without imposing additional cognitive load on the participants. Each traditional model has the capability to create a similar version that incorporates the element of time. In the future, more models may be taken into account for temporal extension.
Both polytomous models and models with time can obtain more information, which produces more accurate parameter estimation results. All models in Table 4 can be extended in these two directions. However, it should be noted that models with time do not increase cognitive load and are generally only suitable for computerized testing with simple data collection; the polytomous model increases cognitive load and thus has significant limitations on the size of blocks.
5.2. Research on parameter invariance based on each model
Following Lin and Brown's study [68] on TIRT, when the proportion of common items decreases, whether a higher proportion of items' parameters can still have cross-item block invariance remains to be studied. In addition, the cross-item block consistency of other models needs to be studied.
At present, there is only research on the parameter invariance of TIRT [69,70] and RIM [71]. Future studies should broaden the repertoire of differential item functioning (DIF) test methods for the forced-choice model and improve their sensitivity in detecting DIF from multiple sources.
5.3. Forced-choice CAT
Although forced-choice CAT has accumulated more experience in empirical research, the adaptive process for latent trait estimation has been developed using item parameters calibrated in advance with a single stimulus scale. The database used for items is the single-item database rather than the item block database. During item selection, items will be combined to form the forced-choice item blocks, so the impact of cross-item block consistency on latent trait estimation under this CAT process needs further study. In addition, the combination of item block dimensions and the test length will increase significantly in a high-dimensional situation, which brings challenges to content balance and test efficiency. In the future, we can further explore how to elaborate on the advantages of CAT in a high-dimensional situation. Although the subpool partition strategies for item selection and within-person statement exposure control procedures proposed by Chen et al. [74] do not involve scoring and can be extended to CAT based on other non-RIM models, the specific performance still needs to be explored by research. In addition, control methods such as The Multinomial Strategy cannot be directly applied to the variable-length test. In the future, we can further explore how to construct a more appropriate item selection strategy.
5.4. Validity studies
A large amount of research compares forced-choice tests and Likert rating scales to see if they measure the same measurement content similarly. However, the difference between the two in the test form and the response biases caused by Likert rating scales will inevitably lead to some errors. It is worth exploring how to maximize the control of these biases in the future. In the form of forced choice, the larger the item block, the stronger the resistance to faking, but it also increases the cognitive load [89]. Future research can explore the balance between the anti-faking effect and cognitive load on the size of the item block. In addition, most of the existing validity studies focus on TIRT, and the validity studies of GGUM-Rank and other new models need to be explored.
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Data availability statement
Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.
Ethics requirements statement
This article does not contain any studies with human participants performed by any of the authors.
References
CRediT authorship contribution statement
Lei Nie: Writing – original draft, Resources, Investigation, Funding acquisition. Peiyi Xu: Writing – review & editing, Project administration, Conceptualization. Di Hu: Writing – original draft.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References
- 1.Luo F., Zhang H. Methods of coping with faking of personality tests. Psychological Exploration. 2007;27(4):78–82. [Google Scholar]
- 2.Aguinis H., Handelsman M.M. Ethical issues in the use of the bogus pipeline. J. Appl. Soc. Psychol. 1997;27(7):557–573. doi: 10.1111/j.1559-1816.1997.tb00647.x. [DOI] [PubMed] [Google Scholar]
- 3.White L.A., Young M.C. Paper presented atthe annual meeting of the American Psychological Association; San Francisco: 1998, August. Development and Validation of the Assessment of Individual Motivation (AIM) [Google Scholar]
- 4.Baron H. Strengths and limitations of ipsative measurement. J. Occup. Organ. Psychol. 1996;69(1):49–56. doi: 10.1111/j.2044-8325.1996.tb00599.x. [DOI] [Google Scholar]
- 5.Frick S., Brown A., Wetzel E. Investigating the normativity of trait estimates from multidimensional forced-choice data. Multivariate Behav. Res. 2023;58(1):1–29. doi: 10.1080/00273171.2021.1938960. [DOI] [PubMed] [Google Scholar]
- 6.Wang S., Luo F., Liu H. The conventional and the IRT-based scoring methods of Forced-Choice personality tests. Adv. Psychol. Sci. 2014;22(3):549–557. doi: 10.3724/SP.J.1042.2014.00549. [DOI] [Google Scholar]
- 7.Brown A., Maydeu-Olivares A. How IRT can solve problems of ipsative data in forced-choice questionnaires. Psychol. Methods. 2013;18(1):36–52. doi: 10.1037/a0030641. [DOI] [PubMed] [Google Scholar]
- 8.Closs S.J. On the factoring and interpretation of ipsative data. J. Occup. Organ. Psychol. 1996;69(1):41–47. doi: 10.1111/j.2044-8325.1996.tb00598.x. [DOI] [Google Scholar]
- 9.Bartram D. The relationship between ipsatized and normative measures of personality. J. Occup. Organ. Psychol. 1996;69(1):25–39. doi: 10.1111/j.2044-8325.1996.tb00597.x. [DOI] [Google Scholar]
- 10.Clemans W.V. An analytical and empirical examination of some properties of ipsative measures. Psychometric Monographs. 1966;14 [Google Scholar]
- 11.Saville P., Willson E. The reliability and validity of normative and ipsative approaches in the measurement of personality. J. Occup. Psychol. 1991;64(3):219–238. doi: 10.1111/j.2044-8325.1991.tb00556.x. [DOI] [Google Scholar]
- 12.Hontangas P.M., de la Torre J., Ponsoda V., Leenen I., Morillo D., Abad F.J. Comparing traditional and IRT scoring of forced-choice tests. Appl. Psychol. Meas. 2015;39(8):598–612. doi: 10.1177/0146621615585851. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Brown A. Item response models for forced-choice questionnaires: a common framework. Psychometrika. 2016;81(1):135–160. doi: 10.1007/s11336-014-9434-9. [DOI] [PubMed] [Google Scholar]
- 14.Pavlov G., Shi D., Maydeu-Olivares A., Fairchild A. Item desirability matching in forced-choice test construction. Pers. Indiv. Differ. 2021;183 doi: 10.1016/j.paid.2021.111114. [DOI] [Google Scholar]
- 15.Gwet K.L. The Definitive Guide to Measuring the Extent of Agreement Among Raters. fourth ed. Advanced Analytics, LLC; Gaithersburg, MD: 2014. Handbook of inter-rater reliability. [Google Scholar]
- 16.R Core Team . 2021. R: A Language and Environment for Statistical Computing.https://www.R-project.org/ Vienna, Austria. [Google Scholar]
- 17.Li M., Sun T., Zhang B. Applied Psychological Measurement, Advance online publication; 2021. autoFC: an R Package for Automatic Item Pairing in Forced-Choice Test Construction. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Hughes A.W., Dunlop P.D., Holtrop D., Wee S. Spotting the “Ideal” personality response: effects of item matching in forced choice measures for personnel selection. J. Person. Psychol. 2021;20(1):17–26. doi: 10.1027/1866-5888/a000267. [DOI] [Google Scholar]
- 19.Frick S. Modeling faking in the multidimensional forced-choice format: the faking mixture model. Psychometrika. 2022;87:773–794. doi: 10.1007/s11336-021-09818-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Drasgow F., Chernyshenko O.S., Stark S. 75 years after Likert: Thurstone was right. Industrial and Organizational Psychology. 2010;3(4):465–476. doi: 10.1111/j.1754-9434.2010.01273.x. [DOI] [Google Scholar]
- 21.Roberts J.S., Donoghue J.R., Laughlin J.E. A general item response theory model for unfolding unidimensional polytomous responses. Appl. Psychol. Meas. 2000;24(1):3–32. doi: 10.1177/01466216000241001. [DOI] [Google Scholar]
- 22.Hontangas P.M., Leenen I., de la Torre J., Ponsoda V., Morillo D., Abad F.J. Traditional scores versus IRT estimates on forced-choice tests based on a dominance model. Psicothema. 2016;28(1):76–82. doi: 10.7334/psicothema2015.204. [DOI] [PubMed] [Google Scholar]
- 23.Morillo D., Leenen I., Abad F.J., Hontangas P., de la Torre J., Ponsoda V. A dominance variant under the multi-unidimensional pairwise-preference framework: model formulation and Markov chain Monte Carlo estimation. Appl. Psychol. Meas. 2016;40(7):500–516. doi: 10.1177/0146621616662226. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Chernyshenko O.S., Stark S., Chan K.Y., Drasgow F., Williams B. Fitting item response theory models to two personality inventories: issues and insights. Multivariate Behav. Res. 2001;36(4):523–562. doi: 10.1207/S15327906MBR3604_03. [DOI] [PubMed] [Google Scholar]
- 25.Tay L., Ali U.S., Drasgow F., Williams B. Fitting IRT models to dichotomous and polytomous data: assessing the relative model–data fit of ideal point and dominance models. Appl. Psychol. Meas. 2011;35(4):280–295. doi: 10.1177/0146621610390674. [DOI] [Google Scholar]
- 26.Huang J., Mead A.D. Effect of personality item writing on psychometric properties of ideal-point and Likert scales. Psychol. Assess. 2014;26(4):1162–1172. doi: 10.1037/a0037273. [DOI] [PubMed] [Google Scholar]
- 27.Brown A., Maydeu-Olivares A. Issues that should not be overlooked in the dominance versus ideal point controversy. Industrial and Organizational Psychology. 2010;3(4):489–493. doi: 10.1111/j.1754-9434.2010.01277.x. [DOI] [Google Scholar]
- 28.Oswald F.L., Schell K.L. Developing and scaling personality measures: Thurstone was right—but so far, likert was not wrong. Industrial and Organizational Psychology. 2010;3(4):481–484. doi: 10.1111/j.1754-9434.2010.01275.x. [DOI] [Google Scholar]
- 29.Thurstone L.L. A law of comparative judgment. Psychol. Rev. 1927;34(4):273–286. doi: 10.1037/h0070288. [DOI] [Google Scholar]
- 30.Luce R.D. On the possible psychophysical laws. Psychol. Rev. 1959;66(2):81–95. doi: 10.1037/h0043178. [DOI] [PubMed] [Google Scholar]
- 31.Luce R.D. The choice axiom after twenty years. J. Math. Psychol. 1977;15(3):215–233. doi: 10.1016/0022-2496(77)90032-3. [DOI] [Google Scholar]
- 32.Bradley R.A., Terry M.E. Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika. 1952;39(3/4):324–345. doi: 10.2307/2334029. [DOI] [Google Scholar]
- 33.Stark S., Chernyshenko O.S., Drasgow F. An IRT approach to constructing and scoring pairwise preference items involving stimuli on different dimensions: the multi- unidimensional pairwise-preference model. Appl. Psychol. Meas. 2005;29(3):184–203. doi: 10.1177/0146621604273988. [DOI] [Google Scholar]
- 34.Brown A., Maydeu-Olivares A. Item response modeling of forced-choice questionnaires. Educ. Psychol. Meas. 2011;71(3):460–502. doi: 10.1177/0013164410375112. [DOI] [Google Scholar]
- 35.Lee P., Joo S.-H., Stark S., Chernyshenko O.S. GGUM-Rank statement and person parameter estimation with multidimensional forced choice triplets. Appl. Psychol. Meas. 2019;43(3):226–240. doi: 10.1177/0146621618768294. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Zheng C.J., Liu J., Li Y.L., Xu P.Y., Zhang B., Wei R., Zhang W.Q. A 2PLM-RANK Multidimensional Forced-Choice Model and its Fast Estimation Algorithm Behav. Res. 2024 doi: 10.3758/s13428-023-02315-x. [DOI] [PubMed] [Google Scholar]
- 37.Bürkner P.-C., Schulte N., Holling H. On the statistical and practical limitations of Thurstonian IRT models. Educ. Psychol. Meas. 2019;79(5):827–854. doi: 10.1177/0013164419832063. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Li H., Xiao Y., Liu H. Influencing factors of Thurstonian IRT model in faking-resisting forced-choice questionnaire. J. Beijing Normal Univ. (Nat. Sci.) 2017;53(5):624–630. doi: 10.16360/j.cnki.jbnuns.2017.05.019. [DOI] [Google Scholar]
- 39.Lian X., Bian Q., Zeng S., Che H. National Conference on Psychology; Beijing, China: 2014. The Fitting Analysis of MAP Occupational Personality Forced Choice Test Based on Thurston IRT Model. [Google Scholar]
- 40.Schulte N., Holling H., Bürkner P.-C. Can high-dimensional questionnaires resolve the ipsativity issue of forced-choice response formats? Educ. Psychol. Meas. 2021;81(2):262–289. doi: 10.1177/0013164420934861. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Lee P., Joo S., Zhou S., Son M. Investigating the impact of negatively keyed statements on multidimensional forced-choice personality measures: a comparison of partially ipsative and IRT scoring methods. Pers. Indiv. Differ. 2022;191:1–15. doi: 10.1016/j.paid.2022.111555. [DOI] [Google Scholar]
- 42.Joubert T., Inceoglu I., Bartram D., Dowdeswell K., Lin Y. A comparison of the psychometric properties of the forced choice and likert scale versions of a personality instrument. Int. J. Sel. Assess. 2015;23(1):92–97. doi: 10.1111/ijsa.12098. [DOI] [Google Scholar]
- 43.Joo S.-H., Lee P., Stark S. Development of information functions and indices for the GGUM-Rank multidimensional forced choice IRT model. J. Educ. Meas. 2018;55(3):357–372. doi: 10.1111/jedm.12183. [DOI] [Google Scholar]
- 44.Joo S.-H., Lee P., Stark S. Adaptive testing with the GGUM-Rank multidimensional forced choice model: comparison of pair, triplet, and tetrad scoring. Behav. Res. Methods. 2020;52(2):761–772. doi: 10.3758/s13428-019-01274-6. [DOI] [PubMed] [Google Scholar]
- 45.Joo S.-H., Lee P., Stark S. Advance online publication; 2021. Modeling Multidimensional Forced Choice Measures with the Zinnes and Griggs Pairwise Preference Item Response Theory Model. Multivariate Behavioral Research. [DOI] [PubMed] [Google Scholar]
- 46.Stark S., Chernyshenko O.S., Drasgow F., White L.A. Adaptive testing with multidimensional pairwise preference items. Organ. Res. Methods. 2012;15(3):463–487. doi: 10.1177/1094428112444611. [DOI] [Google Scholar]
- 47.Press W.H., Flannery B.P., Teukolsky S.A., Vetterling W.T. Cambridge University Press; New York: 1986. Numerical Recipes: the Art of Scientific Computing. [Google Scholar]
- 48.Roberts J.S., Thompson V.M. Marginal maximum a posteriori item parameter estimation for the generalized graded unfolding model. Appl. Psychol. Meas. 2011;35(4):259–279. doi: 10.1177/0146621610392565. [DOI] [Google Scholar]
- 49.Tendeiro J.N., Castro-Alvarez S. GGUM: an R package for fitting the generalized graded unfolding model. Appl. Psychol. Meas. 2018;43(2):172–173. doi: 10.1177/0146621618772290. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Chalmers R.P. mirt: a multidimensional item response theory package for the R environment. J. Stat. Software. 2012;48(6):1–29. [Google Scholar]
- 51.Tu N., Zhang B., Angrave L., Sun T. Bmggum: an R package for Bayesian estimation of the multidimensional generalized graded unfolding model with covariates. Appl. Psychol. Meas. 2021;45(7–8):553–555. doi: 10.1177/01466216211040488. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Muthén L., Muthén B. vol. 5. Author; Los Angeles, CA: 2015. (Mplus. The Comprehensive Modelling Program for Applied Researchers: User's Guide). [Google Scholar]
- 53.Rosseel Y. lavaan: an R package for structural equation modeling. J. Stat. Software. 2012;48(2):1–36. [Google Scholar]
- 54.Brown A., Maydeu-Olivares A. Fitting a Thurstonian IRT model to forced-choice data using Mplus. Behav. Res. Methods. 2012;44(4):1135–1147. doi: 10.3758/s13428-012-0217-x. [DOI] [PubMed] [Google Scholar]
- 55.Bürkner P.-C. thurstonianIRT: Thurstonian IRT models in R. J. Open Source Softw. 2018;4(42):1662. doi: 10.21105/joss.01662. [DOI] [Google Scholar]
- 56.Doornik J.A. Timberlake Consultants Ltd; London, England: 2009. An Object-Oriented Matrix Programming Language Ox 6. [Google Scholar]
- 57.Lunn D., Spiegelhalter D., Thomas A., Best N. The BUGS project: evolution, critique and future directions. Stat. Med. 2009;28(25):3049–3067. doi: 10.1002/sim.3680. [DOI] [PubMed] [Google Scholar]
- 58.Spiegelhalter D., Thomas A., Best N. MRC Biostatistics Unit, Institute of Public Health; Cambridge, UK: 2003. WinBUGS Version 1.4 [Computer Program] [Google Scholar]
- 59.Plummer M. International Workshop on Distributed Statistical Computing; Vienna, Austria: 2003. JAGS: A Program for Analysis of Bayesian Graphical Models Using Gibbs Sampling. Paper presented at the 3rd. [Google Scholar]
- 60.Stan Development Team . 2020. RStan: the R Interface to Stan.http://mc-stan.org/ [Google Scholar]
- 61.Gelman A., Rubin D. Inference from iterative simulation using multiple sequences. Stat. Sci. 1992;7(4):457–472. doi: 10.1214/ss/1177011136. [DOI] [Google Scholar]
- 62.Kim J.-S., Bolt D. Estimating item response theory models using Markov chain Monte Carlo methods. Educ. Meas. 2007;26(4):38–51. doi: 10.1111/j.1745-3992.2007.00107.x. [DOI] [Google Scholar]
- 63.Guenole N., Brown A., Cooper A. Forced-choice assessment of work-related maladaptive personality traits: preliminary evidence from an application of Thurstonian item response patterning. Assessment. 2016;25(4):513–526. doi: 10.1177/1073191116641181. [DOI] [PubMed] [Google Scholar]
- 64.SHL . SHL; 2018. OPQ32r Technical Manual. [Google Scholar]
- 65.Brown A., Inceoglu I., Lin Y. Preventing rater biases in 360-degree feedback by forcing choice. Organ. Res. Methods. 2017;20(1):121–148. doi: 10.1177/1094428116668036. [DOI] [Google Scholar]
- 66.Aon Hewitt. Aon Corp; Lincolnshire, IL: 2015. 2015 Trends in Global Employee Engagement Report. [Google Scholar]
- 67.Stark S., Chernyshenko O.S., Drasgow F., Nye C.D., White L.A., Heffner T., Farmer W.L. From ABLE to TAPAS: a new generation of personality tests to support military selection and classification decisions. Mil. Psychol. 2014;26(3):153–164. doi: 10.1037/mil0000044. [DOI] [Google Scholar]
- 68.Lin Y., Brown A. Influence of context on item parameters in forced-choice personality assessments. Educ. Psychol. Meas. 2017;77(3):389–414. doi: 10.1177/0013164416646162. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Lee H., Smith W.Z. Fit indices for measurement invariance tests in the Thurstonian IRT model. Appl. Psychol. Meas. 2020;44(4):282–295. doi: 10.1177/0146621619893785. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Lee P., Joo S.-H., Stark S. Detecting DIF in multidimensional forced choice measures using the Thurstonian item response theory model. Organ. Res. Methods. 2020;24(4):739–771. doi: 10.1177/1094428120959822. [DOI] [Google Scholar]
- 71.Qiu X.-L., Wang W.-C. Assessment of differential statement functioning in ipsative tests with multidimensional forced-choice items. Appl. Psychol. Meas. 2021;45(2):79–94. doi: 10.1177/01466216209657. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Houston J., Borman W., Farmer W., Bearden R. Navy Personnel Research; Millington, TN: 2006. Development of the Navy Computer Adaptive Personality Scales (NCAPS) (NPRST-TR-06-2) Studies, and Technology. [Google Scholar]
- 73.Forero C.G., Maydeu-Olivares A. Estimation of IRT graded response models: limited versus full information methods. Psychol. Methods. 2009;14(3):275–299. doi: 10.1037/a0015825. [DOI] [PubMed] [Google Scholar]
- 74.Chen C.-W., Wang W.-C., Chiu M.M., Ro S. Item selection and exposure control methods for computerized adaptive testing with multidimensional ranking items. J. Educ. Meas. 2020;57(2):343–369. doi: 10.1111/jedm.12252. [DOI] [Google Scholar]
- 75.Lin Y., Brown A., Williams P. Multidimensional forced-choice CAT with dominance items: an empirical comparison with optimal static testing under different desirability matching. Educ. Psychol. Meas. 2023;83(2):322–350. doi: 10.1177/00131644221077637. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Tsutsui S. Node histogram vs. edge histogram: a comparison of probabilistic model-building genetic algorithms in permutation domains. IEEE International Conference on Evolutionary Computation. 2006:1939–1946. doi: 10.1109/CEC.2006.1688544. [DOI] [Google Scholar]
- 77.Kreitchmann R.S., Abad F.J., Sorrel M.A. A genetic algorithm for optimal assembly of pairwise forced-choice questionnaires. Behav Res. 2022;54:1476–1492. doi: 10.3758/s13428-021-01677-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Mulder J., van der Linden W.J. Multidimensional adaptive testing with optimal design criteria for item selection. Psychometrika. 2009;74:273–296. doi: 10.1007/s11336-008-9097-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Veldkamp B.P., van der Linden W.J. Multidimensional adaptive testing with constraints on test content. Psychometrika. 2002;67:575–588. doi: 10.1007/BF02295132. [DOI] [Google Scholar]
- 80.Chang H.H., Ying Z. A global information approach to computerized adaptive testing. Appl. Psychol. Meas. 1996;20:213–229. doi: 10.1177/014662169602000303. [DOI] [Google Scholar]
- 81.Wang Q., Zheng Y., Liu K., Cai Y., Peng S., Tu D. 2023. Item Selection Methods in Multidimensional Computerized Adaptive Testing for Forced-Choice Items Using Thurstonian IRT Model. Behavior Research Methods. [DOI] [PubMed] [Google Scholar]
- 82.Wang C., Chang H.H. Item selection in multidimensional computerized adaptive testing gaining information from diferent angles. Psychometrika. 2011;76:363–384. doi: 10.1007/s11336-011-9215-7. [DOI] [Google Scholar]
- 83.Seybert J., Becker D. Examination of the test- retest reliability of a forced‐choice personality measure. ETS Research Report Series. 2019;2019(1):1–17. doi: 10.1002/ets2.12273. [DOI] [Google Scholar]
- 84.Oswald F.L., Shaw A., Farmer W.L. Comparing simple scoring with IRT scoring of personality measures: the navy computer adaptive personality scales. Appl. Psychol. Meas. 2015;39(2):144–154. doi: 10.1177/0146621614559517. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Wang W.-C., Qiu X.-L., Chen C.-W., Ro S., Jin K.-Y. Item response theory models for ipsative tests with multidimensional pairwise comparison items. Appl. Psychol. Meas. 2017;41(8):600–613. doi: 10.1177/0146621617703183. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Walton K.E., Cherkasova L., Roberts R.D. On the validity of forced choice scores derived from the Thurstonian item response theory model. Assessment. 2020;27(4):706–718. doi: 10.1177/1073191119843585. [DOI] [PubMed] [Google Scholar]
- 87.Watrin L., Geiger M., Spengler M., Wilhelm O. Forced-choice versus likert responses on an occupational Big Five questionnaire. J. Indiv. Differ. 2019;40(3):134–148. doi: 10.1027/1614-0001/a000285. [DOI] [Google Scholar]
- 88.Zhang B., Sun T., Drasgow F., Chernyshenko O.S., Nye C.D., Stark S., White L.A. Though forced, still valid: psychometric equivalence of forced-choice and single-statement measures. Organ. Res. Methods. 2020;23(3):569–590. doi: 10.1177/1094428119836486. [DOI] [Google Scholar]
- 89.Wetzel E., Frick S., Brown A. Does multidimensional forced-choice prevent faking? Comparing the susceptibility of the multidimensional forced-choice format and the rating scale format to faking. Psychol. Assess. 2020;33(2):156–170. doi: 10.1037/pas0000971. [DOI] [PubMed] [Google Scholar]
- 90.Dueber D.M., Love A.M.A., Toland M.D., Turner T.A. Comparison of single-response format and forced-choice format instruments using Thurstonian item response theory. Educ. Psychol. Meas. 2019;79(1):108–128. doi: 10.1177/0013164417752782. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Hung S.-P., Huang H.-Y. Forced-choice ranking models for raters' ranking data. J. Educ. Behav. Stat. 2022;47(5):603–634. doi: 10.3102/10769986221104207. [DOI] [Google Scholar]
- 92.Lin Y. Reliability estimates for IRT-based forced-choice assessment scores. Organ. Res. Methods. 2022;25(3):575–590. doi: 10.1177/1094428121999086. [DOI] [Google Scholar]
- 93.Gnambs T. Facets of measurement error for scores of the big five: three reliability generalizations. Pers. Indiv. Differ. 2015;84:84–89. doi: 10.1016/j.paid.2014.08.019. [DOI] [Google Scholar]
- 94.Brown A., Maydeu-Olivares A. Ordinal factor analysis of graded-preference questionnaire data. Struct. Equ. Model.: A Multidiscip. J. 2018;25(4):516–529. doi: 10.1080/10705511.2017.1392247. [DOI] [Google Scholar]
- 95.Qiu X.L., Torre J. A dual process item response theory model for polytomous multidimensional forced‐choice items. Br. J. Math. Stat. Psychol. 2023 doi: 10.1111/bmsp.12303. [DOI] [PubMed] [Google Scholar]
- 96.Bunji K., Okada K. Joint modeling of the two-alternative multidimensional forced-choice personality measurement and its response time by a Thurstonian D-diffusion item response pattern. Behav. Res. Methods. 2020;52(3):1091–1107. doi: 10.3758/s13428-019-01302-5. [DOI] [PubMed] [Google Scholar]
- 97.Bunji K., Okada K. Linear ballistic accumulator item response theory model for multidimensional multiple-alternative forced-choice measurement of personality. Multivariate Behav. Res. 2021;57(4):658–678. doi: 10.1080/00273171.2021.1896351. [DOI] [PubMed] [Google Scholar]
- 98.Guo Z., Wang D., Cai Y., Tu D. 2023. An Item Response Theory Model for Incorporating Response Times in Forced-Choice Measures. Educational and Psychological Measurement. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.
