A forced-choice questionnaire (FCQ) is a set of blocks of two or more statements. Statements within a block may measure the same (unidimensional) or different traits (multidimensional), and an FCQ can mix unidimensional and multidimensional blocks (Stark et al., 2005) or comprise only one type. Test takers are asked to rank the statements within a block based on how much they agree with the statements or how closely the statements describe or match their attitudes, beliefs, or behaviors (Brown & Maydeu-Olivares, 2011). The forced-choice item type has been proposed as an alternative to the traditional Likert scale item type for noncognitive questionnaires (e.g., to measure personality, preferences, and attitudes) to address response distortions associated with Likert-type items, for example, social desirability responding (e.g., Birkeland, et al., 2006). Traditional scoring of FCQs takes the sum of observed ranking scores on each trait or dimension (Hontangas et al., 2015), which leads to ipsative scores for multidimensional FCQs. Ipsative scores have a constant total score across all subscores of measured traits for every test taker and thus are problematic for interindividual comparisons. Ipsative scores also lead to problems with estimations of reliability and average scale intercorrelations, factor analysis, and interpretation of scores (Hontangas et al., 2015). There are traditional methods to address the issues of ipsative scores that lead to quasi-ipsative scores (Salgado et al., 2015). Alternatively, some researchers have proposed item response theory (IRT) models to capture the data-generating mechanism.
Brown (2016) provided an integrated review of IRT models for FCQs. A popular model reviewed was Stark et al.’s (2005) multi-unidimensional pairwise preference (MUPP) model for FCQs with ideal point statements and two-statement blocks. Stark et al. (2005) used the dichotomous version of the generalized graded unfolding model (GGUM; Roberts et al., 2000) as the response function for each ideal point statement. For an ideal point statement, the relationship between the extent of agreement and the latent trait scores is a bell curve, where optimal agreement is located somewhere between extremes on the latent trait score continuum, making it possible to disagree with the statement from above or below. For example, the statement, “I like work-life balance,” might be considered an ideal point statement because it is possible to disagree with the statement from above (because I like to work a lot) or below (because I do not like to work very much).
To handle blocks with more than two statements, de la Torre et al. (2012) proposed three extensions of the MUPP-GGUM model based on response formats, referred to as PICK, MOLE, and RANK. For the PICK format, test takers are asked to select one statement from among all statements in a block; for the MOLE format, test takers are asked to select the statement they agree with most and the statement they agree with least; for the RANK format test takers are asked to rank all statements.
Hontangas et al. (2015) compared the normative scores based on the PICK, MOLE, and RANK GGUM models with the traditional raw scores of multidimensional FCQs to check the normative property of ipsative scores. However, in all the studies mentioned above for the IRT models with GGUM for FCQs, the item parameters in the GGUM for each statement are assumed to be known, for example, from fitting a GGUM to response data from a Likert scale questionnaire of the same statements administered previously. Latent trait scores were estimated based on the known (i.e., assumed) item parameters. The use of assumed fixed item parameters from previous Likert scale questionnaires raises serious questions about the validity of these models (Morillo et al., 2019; Fu et al., 2024). Later, Lee et al. (2019) developed a Bayesian estimation of both item and person parameters of the Rank-GGUM models directly from FCQ data using the Markov chain Monte Carlo (MCMC) method (Gilks et al., 1996).
Morillo et al. (2016) coupled the MUPP model with the two-parameter logistic IRT (2PL) response function (Birnbaum, 1968) for blocks with two dominance statements. Unlike an ideal point statement, a dominance statement (e.g., “I enjoy being by myself”) has a monotonical relationship between the degree of agreement and the underlying trait as modeled in the 2PL response function—the higher one is on the trait, the more likely one is to register agreement with the statement. Morillo et al. (2016) provided an MCMC estimation of item parameters and trait scores for the MUPP-2PL model. The MUPP-2PL model is just a special case of the commonly known compensatory two-parameter logistic multidimensional IRT Model (von Davier, 2008). It also has the equivalent item response function (after transformation) with the Thurstonian IRT (TIRT) model (Brown & Maydeu-Olivares, 2011) for blocks with two statements. In addition, Wang et al. (2017) applied the MUPP with the Rasch model (i.e., the multidimensional Rasch model) to FCQ blocks with two dominance statements and emphasized the unique properties associated with the Rasch model.
There have been more extensions and developments in the IRT models for FCQs in recent years; see Zhang et al. (2023, Table 1) for a list. The IRT models for FCQs are complicated, and researchers have primarily used the MCMC method to estimate the models. Maximum likelihood estimation is more mathematically involved and has been developed only for the MUPP-2PL (under the traditional multidimensional IRT models). The current study extends the MUPP-2PL model to accommodate the RANK response format for blocks with more than two dominance statements. We refer to this series of models for different block sizes as Rank-2PL models, and the MUPP-2PL model is just one of them. Focusing on blocks with three statements (triplets), we develop a maximum marginal likelihood estimation with an expectation-maximization algorithm (MML-EM; Bock & Aitkin, 1981; Fu, 2019) to estimate item parameters and their standard errors. We conduct a simulation study to check parameter recovery and demonstrate the use of the model on real FCQ data. Finally, we summarize and discuss the findings and suggest future research.
Table 1.
Possible Ranking Patterns for a Block with Three Statements, A, B, and C
| Coded score | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|
| Ranking pattern | ABC | ACB | BAC | BCA | CAB | CBA |
Model Formulation
The Rank model assumes that ranking statements in a block is an iterative process (Luce, 2005/1959), involving initially selecting the most agreeable statement from all the statements to selecting the more agreeable statement from the last remaining two. This process is modeled by a probability function. Below, we derive the item response probability function for a triplet item (i.e., ) as an example. Denote as the probability of a ranking order conditional on the column vector of the latent trait score(s) that a test measures. Each statement in an item could measure a common or different latent trait: the former is referred to as a unidimensional forced-choice item, and the latter as a multidimensional forced-choice item. The subscripts 1,2,3 refer to the statements with the ranking scores 1, 2, and 3, respectively (Ranking 1 is the more agreeable statement). Let be the conditional probability of selecting the Ranking 1 statement from the three statements, and be the conditional probability of selecting the Ranking 2 statement from the remaining two statements, both given only one statement selected from the available three or two statements. Then, based on the Rank model’s assumption,
| (1) |
Denote the selection of a statement as 1, and 0 otherwise. Let be the joint probability of selecting the Ranking 1 statement and not selecting the other two statements. Likely, define and as the joint probabilities of selecting the Ranking 2 and 3 statements, respectively. Then, the conditional probability is the ratio of the joint probability of selecting Ranking 1 statement over the sum of the joint probabilities of selecting Rankings 1, 2, and 3 statements, respectively,
| (2) |
Similarly,
| (3) |
Assuming a test taker makes the selecting (1) /not selecting (0) decision of a statement independent of other statements,
| (4) |
| (5) |
| (6) |
| (7) |
| (8) |
where , , and refer to the latent trait scores measured by the Rankings 1–3 statements, respectively.
For each statement, the probability of selection follows the 2PL model
| (9) |
| (10) |
where refers to the statement with Ranking , is the discrimination parameter of the statement with Ranking , and is the intercept.
Insert Equations 2–10 into Equation 1 and through Algebra operations,
| (11) |
Note that of the three parameters, , , and , only two can be identified in Equation 11. There are different ways to restrict s: for example, (a) fix one of the s; (b) set the sum of the three s to a constant (e.g., 0); and (c) estimate the differences between s rather than the individual s, , , , and set . All different constraints will lead to the same estimates of , , and , and will not change the estimate of . If all three statements measure a common trait, then only two of the three s can be identified, and constraints similar to those for s may be set to identify the model.
Similarly, for blocks with two statements ()
| (12) |
This is the MUPP-2PL model studied in Morillo et al. (2016). In Equation 12, only one of the two s can be identified; if both statements measure one common trait, then only one of the two s can be identified. Similar constraints on s or s in Equation 11 can be imposed to identify Equation 12. Note that Equation 12 is just a special case of the compensatory 2-parameter logistic multidimensional IRT Model and is also equivalent to the normal ogive response function of the TIRT model for two-statement blocks (Brown & Maydeu-Olivares, 2011; Morillo et al., 2016).
For blocks with four statements (),
| (13) |
In Equation 13, only three of the four s can be identified, and if the four statements measure one common trait, only three of the four s can be identified. Thus, similar constraints on s or s in Equation 11 are needed to identify Equation 13.
For blocks with statements, there are ranking patterns. For example, for three statements, A, B, and C, in a block, there are six possible ranking patterns, as shown in Table 1, and they are coded as 1 to 6 in the observed data.
Relationships with Thurstonian IRT (TIRT) Model
The TIRT model is based on Thurstone’s (1927) law of comparative judgment for a pair of statements. The notion of statement utility connects observed response (e.g., agree/disagree with a statement) with the latent trait that the statement measures. The utility for statement with a ranking score is defined as a factor model
| (14) |
where is the intercept, is the factor loading, is the latent trait score that the statement measures, and is the independent and identically distributed error which follows a normal distribution with mean 0 and variance . The probability of selecting the Ranking 1 statement among a pair of statements is determined by the difference between the two statement utilities:
| (15) |
where . The probability of selecting the Ranking 1 statement is the probability of and modeled by
| (16) |
where is the standard normal cumulative function. The first item response function in Equation 16 is a special case of the compensatory two-parameter normal ogive multidimensional IRT model (McDonald, 1997), and the second one is a special case of the compensatory two-parameter logistic multidimensional IRT Model. For pairs, the model can be identified by setting the error variance of , . If an MFC form only measures two traits, then additionally, the factor loadings and in one pair can be fixed to identify the model (Brown & Maydeu-Olivares, 2011). Equations 16 and 12 (the response function for the MUPP-2PL model) are equivalent by setting
| (17) |
| (18) |
| (19) |
However, the TIRT model is estimated by a limited information method, unweighted least squares with mean and variance-corrected Satorra-Bentler goodness-of-fit tests (ULSMV) under structural equation modeling (SEM). In contrast, the MUPP-2PL model is estimated by a full information method, the maximum marginal likelihood estimation. Therefore, their parameter estimates may differ due to the different estimation methods (Forero & Maydeu-Olivars, 2009).
For a block with more than two statements (i.e., ), the TIRT model recodes a ranking pattern into ranking patterns of pairs. For a triplet, the response pattern of the three statements with Ranking scores 1–3 can be separated into three statement pairs: pair 1 (1, 2), pair 2 (1, 3), and pair 3 (2, 3). Let be the 2 by 3 matrix of contrasts,
In the TIRT model, the probability of the ranking pattern (1, 2, 3) is given by (Maydeu-Olivares, 1999)
| (20) |
where , and are the mean vector and covariance matrix of follows a bivariate normal distribution with a mean vector and covariance matrix , and is the bivariate normal density function. This model is set up as the mean and covariance structures under SEM and estimated by a limited information method, ULSMV. The response functions (marginal probabilities of ranking patterns) for the three pairs are
| (21) |
| (22) |
| (23) |
Equations 21–23 can be identified by fixing the error variance of one statement, for example, setting the error variance of the first statement to 1. The TIRT model considers the local dependence by estimating the covariances among the three pairs’s (Equation 15): , and . However, the local dependence is considered only in item parameter estimation; for estimating latent trait scores, the local dependence is ignored. Maydeu-Olivares and Brown (2010) argued that the simplification of scoring had little impact on the accuracy of latent score estimates.
According to Equation 14, the TIRT model assumes the utilities of statements are locally independent conditional on latent scores. This assumption is the same as the one adopted by the Rank models; a test taker makes an independent selecting/not selecting decision on each statement. The item response function of the Rank-2PL model for triplets is in Equation 11. Based on Equation 11, we can derive the item response function of each pair (like Equations 21–23 for the TIRT model) that has the same form as Equation 12. However, as in the TIRT model, these ranking patterns of pairs are not locally independent because the probability of a response pattern of a triplet is not equal to the product of the three probabilities on individual pairs. Comparing Equation 20 to Equation 1, we can see that the TIRT model builds a full joint distribution of a ranking pattern (1, 2, 3), while Rank-2PL models make the assumption of sequential selections to simplify this joint distribution into two independent distributions. Thus, Rank-2PL models are simpler than the TIRT model: the TIRT model has two more parameters (i.e., error variances of two statements) than the Rank-2PL model for a triplet. Unlike the TIRT model, which assumes local independence among the three pairs in estimating latent trait scores, the Rank-2PL model estimates trait scores directly from the responses of triplets based on Equation 11. These differences between the TIRT and Rank-2PL models for triplets also apply to blocks larger than three statements.
MML-EM Estimation of Item Parameters and Intertrait Correlations
This paper focuses on the blocks with three statements (triplets) for model estimation. For the blocks with two statements, the MML-EM estimation of the multidimensional 2PL model has been discussed previously (e.g., Cai, 2010; Fu, 2019) and can be estimated by many computer programs for multidimensional IRT models, for example, the R package mirt (Chalmers, 2012). For blocks with more than three statements, the estimation and associated program code are similar to those provided for triplets.
Suppose in an FCQ with triplets, there are test takers in total responding to items (i.e., blocks). Let denote test taker ’s coded ranking score on item ; see Table 1), is test taker ’s coded response vector across items, and is the response matrix representing the total observed data. represents a random sample of latent trait score vectors from the test taker population with a standard multivariate normal distribution with a correlation matrix where represents the number of traits measured by the test. Fixing the population means and standard deviations of is a way to identify the model. represents the vector of item parameters in the whole FCQ, and is item ’s parameter vector. is a column vector including all estimated item parameters and intertrait correlations. Assume is locally independent conditional on and , and then the observed data likelihood is a marginal likelihood
| (24) |
where is a -dimensional area of integration, is the item response function defined in Equation 11, and is the standard multivariate normal density function with a correlation matrix . Because directly maximizing the observed likelihood is difficult due to the integral in Equation 24, a common solution is the EM algorithm (Bock & Aitkin, 1981; Dempster et al., 1977), which works on the complete data loglikelihood instead
| (25) |
where is the latent trait score matrix for all test takers, is test taker ’s latent trait score vector including traits, and () is the complete data with observed data and missing data . The EM algorithm involves an iterative process that repeatedly executes an E and an M steps. In the E step, the expectation of the complete data loglikelihood with respect to the posterior distribution of missing data is estimated and serves as the EM map. In the M step, the EM map is maximized with respect to item parameters and distributional parameters of traits. For conciseness and readability, we omit many technical details on the two steps and the standard errors of parameter estimates in the subsequent sections. However, we provide these technical details in the supplemental online materials for interested readers.
E Step
At an EM cycle , the expected complete data loglikelihood is estimated with respect to the posterior distribution of traits conditional on the current parameter estimates and , that is,
| (26) |
| (27) |
where is the posterior distribution function of the trait vector conditional on , and . There are several ways to approximate the posterior distribution of to evaluate the D-fold integral in Equation 26.
The first method uses a grid of quadrature points to approximate the prior standard multivariate normal distribution of with the current estimated , where is the number of points at each dimension (trait). The quadrature points could be the Gauss-Hermite quadrature points (Davis & Rabinowitz, 1984), the adaptive variants (Schilling & Bock, 2005), or the equidistant points within a range (e.g., −6 to 6). The second method is to draw samples from the prior distribution of to approximate the posterior distribution at the EM cycle. The samples can be drawn randomly from the standard multivariate normal distribution of with the estimated . This approach is called the Monte Carlo EM estimation method (MCEM) in mirt. Alternatively, the samples can be drawn by the quasi-Monte Carlo integration method (Morokoff & Caflisch, 1995). This approach is called the quasi-Monte Carlo EM estimation method (QMCEM) in mirt. The third method is to draw random samples from the posterior distribution (Equation 27), for , such that each sample is an latent trait score matrix for all test takers. Because the posterior distribution is usually not in closed form, the Markov chain Monte Carlo (MCMC) is used to draw samples (Cai, 2010; Chen & Zhang, 2021). Thus, this method is referred to as the stochastic approach.
M Step
In the M step, the expected complete data loglikelihood function (Equation 26) is maximized with respect to estimated parameters . We can use the Newton-Raphson method (Atkinson, 1989) or a quasi-Newton method (e.g., the BFGS algorithm; Broyden, 1970) for the maximization problem. Both methods find the model parameters that maximize Equation 26 iteratively. The search stops when a certain convergence criterion is met, for example, (a) the ratio of the difference of the function values between two consecutive iterations over the current function value is smaller than a threshold (e.g., 1e-6; i.e., the relative tolerance criterion), or (b) a preset maximum number of iterations is reached.
When a set of solutions are obtained in the M step, these values are used in the next E step. The EM cycles continue until a certain convergence criterion is reached. The common criterion is (a) the maximum absolute change of item parameter estimates between two consecutive EM cycles smaller than a threshold (e.g., 1e-4), (b) the relative tolerance of the observed loglikelihood is smaller than a threshold, or (c) the EM cycles reach the preset maximum cycles.
For the stochastic approach, there are three methods under this category. The first one is also called the Monte Carlo EM algorithm in the literature (e.g., Meng & Schilling, 1996). The other two are the stochastic EM algorithm (Zhang et al., 2020) and the Metropolis-Hastings Robbins-Monro (MHRM) algorithm (Cai, 2010).
Standard Errors of Parameter Estimates
The standard errors (SEs) of item parameter and intertrait correlation estimates by the MML-EM algorithm can be estimated by Louis’s (1982) observed information function. Louis’s observed information is consistent when the model is correctly specified. When the model is misspecified, the sandwich-type covariance matrix provides the best SE estimates (Yuan et al., 2014). Note that estimating parameter SEs based on the Hessian matrix of the expected complete data loglikelihood underestimates the parameter SEs because it includes information from missing data (i.e.,) that should be excluded. Chalmers (2018) introduced another convenient estimation of the information of the observed likelihood (Equation 26) by Oakes’s identity. For the stochastic approach, Cai (2010) provided an estimation of Louis’s observed information for the MHRM algorithm.
Simulation Study
We conducted a simulation study to check the parameter recovery of the Rank-2PL model with triplets.
Simulation Design
Four simulation factors were manipulated:
Number of traits: 5 and 15. The 5 and 15 traits represent the relatively small and large numbers of traits, respectively, measured by an FCQ, and both exist in real FCQs.
Statement direction within a block: same and mixed. Previous studies (e.g., Brown & Maydeu-Olivares, 2011; Bürkner et al., 2019) showed that the keyed direction of statements within a block had a significant impact on trait score estimates on the TIRT model: mixed keyed statements led to much better estimations than equally keyed statements. In the current study, in the same condition, the three statements within a block were either positive or negative, and half of the blocks were negative, and half were positive. In the mixed condition, half of the blocks had one statement in a different direction from the other two.
Estimation of intertrait correlation matrix: estimated and fixed. The intertrait correlation matrix could be freely estimated or fixed based on prior knowledge (e.g., from a previous administration of the Likert scale of the statements). Previous studies (e.g., Bürkner et al., 2019) have shown that estimating the intertrait correlation matrix sometimes causes an unstable or infeasible estimation, especially for a high-dimensional model. Often, by fixing the correlation matrix, estimations converge easily and become more stable.
Estimation methods: QMCEM and MHRM. As described previously, QMCEM uses a relatively small number of samples (5000 in the current study) with the fixed marginal cumulative probabilities at each trait to approximate the trait space at each EM cycle, while MHRM draws samples using MCMC from the posterior distribution of traits for each test taker at each EM cycle. Thus, it was expected that QMCEM would run faster but less accurate than MHRM.
The initial design was to cross all four factors; however, it turned out that estimating intertrait correlations for 15 traits was infeasible: the estimations always terminated abruptly because of an ill estimate of the correlation matrix (e.g., a nonpositive definite correlation matrix). Thus, this factor was partially crossed with the other three factors, resulting in 12 combinations in total.
Simulated Data Generation
In every condition, each statement in a triplet item measured a different trait, and each trait was measured by 12 statements in a test, resulting in 20 items (60 statements) for the 5-trait condition and 60 items (180 statements) for the 15-trait condition. For the 5-trait condition, the trait allocation to the 20 items was generated by repeating the combination C(5,3) twice. The discrimination parameters were sampled from a lognormal distribution with a mean of 1.45 and a standard deviation (SD) of .79 at the normal scale. The intercept parameters were sampled from a normal distribution with a mean of .44 and an SD of 1.54. Then, for each item, the of the first statement was subtracted from the of the three statements so that the of the first statement became 0. The adjusted were the generating parameters, and the of the first statement was fixed to 0 during estimations to identify the models. All item parameters were generated once and fixed during the simulation. The distributional parameters of the item parameters were from real Likert tests. The Likert tests measured 15 interpersonal and intrapersonal skills important in matching applicants with college majors. For the same statement direction condition, the first ten items had positive statements, and the second ten had negative statements by multiplying −1 with the generated discrimination parameters. For the mixed statement direction condition, one statement was changed to a different keyed direction from the other two statements in half of the 20 items in the same condition. The change of keyed direction on statements was balanced in terms of the five traits and positive/negative items so that one statement direction on each trait was changed in five positive and five negative items. The true intertrait correlations were also taken from the real Likert tests; the correlations among Traits 1,6,8,11, and 13 were used for the 5-trait condition (see Table 2). The traits were sampled from the standard multivariate normal distribution with the correlation matrix. The sample size of test takers was 1,000 for all conditions. Then, based on the Rank-2PL response function for triplets (Equation 11), 100 simulated datasets for the 5-trait condition were generated. Each score in every item in a simulated dataset had at least five test takers.
Table 2.
Intertrait Correlation Matrix in Generating Simulated Data
| Trait | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| ||||||||||||||
| 2 | .54 | |||||||||||||
| 3 | .62 | .62 | ||||||||||||
| 4 | .29 | .33 | .31 | |||||||||||
| 5 | .58 | .65 | .72 | .28 | ||||||||||
| 6 | .41 | .34 | .36 | .16 | .51 | |||||||||
| 7 | .37 | .36 | .45 | .15 | .58 | .74 | ||||||||
| 8 | .52 | .21 | .21 | .01 | .23 | .43 | .22 | |||||||
| 9 | .33 | .07 | .08 | .16 | .06 | .35 | .23 | .51 | ||||||
| 10 | .09 | −.19 | −.25 | −.20 | −.25 | .19 | .00 | .39 | .52 | |||||
| 11 | .41 | .27 | .38 | .29 | .33 | .34 | .48 | .24 | .49 | .13 | ||||
| 12 | .18 | .11 | .22 | .18 | .23 | .40 | .48 | .13 | .32 | .07 | .53 | |||
| 13 | .26 | .13 | .18 | −.06 | .20 | .06 | .18 | .17 | .20 | .13 | .28 | .16 | ||
| 14 | .43 | .25 | .38 | −.10 | .45 | .25 | .37 | .31 | .13 | .02 | .29 | .17 | .54 | |
| 15 | .46 | .19 | .27 | −.12 | .33 | .37 | .33 | .49 | .32 | .22 | .28 | .17 | .53 | .63 |
The datasets in the 15-trait condition were generated similarly. For a cell in the 15-trait condition, the 20 items in the corresponding cell in the 5-trait condition were repeated three times, resulting in 60 items, and each set of 20 items measured a different set of 5 traits so that 15 traits were measured in total. Item parameters were generated once based on the same distributions as in the 5-trait condition. The intertrait correlation matrix came from the real Likert tests (Table 2). Because estimating a model with 15 traits took much longer than estimating one with five traits, only 50 simulated datasets with 1,000 test takers were generated in each cell. Each score in a simulated item had at least one test taker.
Estimation
We mainly used the mirt 1.34 package (Chalmers, 2012) in the 64-bit R 4.1.2 (R Core Team, 2021) to estimate and analyze the Rank-2PL model with triplets. The mirt program provides a function, createItem, for users to define their own models, and then all the estimation and analysis modules in mirt can be used to estimate and further analyze the models. These include, for example, different kinds of EM algorithms, standard error estimations of parameters, and trait score estimations; all kinds of fit statistics for model, item, and person; item/test information; reliability of trait score estimates; and all kinds of informative/diagnostic plots (e.g., item/test characteristic/empirical response curve).
For the QMCEM algorithm, mirt uses only the Halton sequence. As Morokoff and Caflisch (1995) suggested, the Halton sequence is appropriate for dimensions up to 6, and the Sobol sequence performs best for higher numbers of dimensions. We modified the relevant functions in mirt to use the Sobol sequence for estimations with dimensions larger than 6. For the MHRM algorithm, the mirt program provides the MHRM SE of parameter estimates by default. However, mirt estimates the MHRM SEs separately after the parameter estimation is done, and it takes much longer to obtain SE estimates than parameter estimates. We modified the mirt functions to save running time so that parameters and their SE estimations were done simultaneously. For the QMCEM algorithm, the mirt program provides Oakes’s SE of parameter estimates by default. Because it took too long to estimate Oakes’s SE in the 15-trait condition, the SEs of parameters were not estimated in the 15-trait QMCEM condition. The starting values of the discrimination parameters were set to 1 for positive statements and −1 for negative statements. All the starting values of the intercepts and intertrait correlations were set to 0 and .25, respectively. All trait scores were estimated by the maximum likelihood method. Otherwise, the default settings in mirt were used for all estimations and analysis functions. The supplemental online materials provide the R code to simulate and estimate data for the Rank-2PL models for pairs and triplets. All analyses were conducted on a computer with an Intel Core i7 CPU and 16GB RAM.
Evaluation Criteria
We calculated the average relative bias and root mean square error (RMSE) for each discrimination, intercept, or intertrait correlation estimate
| (28) |
| (29) |
where is the estimate of a parameter from a replicate dataset is a parameter’s true value, (equal to 50 or 100) is the number of replicate datasets. For the parameters whose SEs were estimated, we also calculated the average relative bias and RMSE of each estimator’s SE estimates by comparing them to their empirical standard deviation (SD)
| (30) |
| (31) |
| (32) |
These average relative bias and RMSE statistics were then averaged within each parameter group (i.e., discrimination, intercept, and intertrait correlation) and compared.
For score estimates of each trait in a simulated dataset, we calculated (a) the squared correlation between estimated and true scores and called it true reliability; (b) the empirical reliability (Kim, 2012) for maximum likelihood score estimates, given as
| (33) |
where is the sample variance of trait ’s score estimates, and is the error variance estimate of test taker ’s estimated trait ’s score ; (c) the RMSE between estimated and true scores, given as
| (34) |
where is the number of test takers in a simulated dataset with estimable trait scores and SEs. These statistics excluded test takers with nonconverged or infinite/inestimable trait estimates/SEs. Then, these three statistics’ means and standard deviations across the 100/50 simulated datasets were compared.
Results
Table 3 lists the mean estimation times and the average relative biases and RMSEs of the discrimination and intercept parameter estimates and SEs in all the simulation conditions, and Table 4 lists those for the intertrait correlations in the 5-trait condition. We have several observations based on these tables.
Table 3.
Average Relative Biases and RMSEs of Item Parameter Estimates and Standard Errors (SE) in Simulated Data
| No. of traits | Est. method | Keyed direction | Estimate intertrait correlations | Est. time (min) | Discrimination | Intercept | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Estimate | SE | Estimate | SE | |||||||||||||
| Rel. bias | RMSE | Percent inestimable (%) | Rel. bias | RMSE | Rel. bias | RMSE | Percent inestimable (%) | Rel. bias | RMSE | |||||||
| 5 | QMCEM | Same | Y | 34 | .00 | .24 | 0 | .03 | .05 | −.01 | .14 | 0 | .13 | .06 | ||
| N | 29 | .00 | .26 | 0 | .08 | .06 | −.01 | .14 | 0 | .12 | .05 | |||||
| Mixed | Y | 37 | .00 | .22 | 0 | .03 | .06 | −.01 | .15 | 0 | .11 | .07 | ||||
| N | 31 | .00 | .28 | 0 | .07 | .06 | −.01 | .16 | 0 | .10 | .07 | |||||
| MHRM | Same | Y | 52 | .00 | .14 | 11 | −.09 | .07 | .00 | .14 | 6 | −.08 | .07 | |||
| N | 51 | .00 | .14 | 9 | .00 | .07 | .00 | .14 | 6 | −.03 | .07 | |||||
| Mixed | Y | 54 | .00 | .14 | 13 | .19 | .10 | .01 | .15 | 8 | .08 | .09 | ||||
| N | 54 | .00 | .14 | 9 | .05 | .08 | .01 | .15 | 6 | .01 | .08 | |||||
| 15a | QMCEM | Same | N | 354 | −.02 | .55 | — | — | — | .02 | .33 | — | — | — | ||
| Mixed | N | 370 | −.01 | .56 | — | — | — | .03 | .36 | — | — | — | ||||
| MHRM | Same | N | 1270 | .00 | .17 | 10 | −.11 | .10 | −.01 | .16 | 5 | −.08 | .11 | |||
| Mixed | N | 890 | .00 | .19 | 4 | −.09 | .11 | .01 | .17 | 2 | −.04 | .12 | ||||
Note. — = Standard errors of item parameters were not estimated in this condition due to extremely long run time.
Intertrait correlations were fixed in this condition due to estimation difficulty.
Table 4.
Average Relative Bias and Absolute Bias of Intertrait Correlation Estimates and Standard Errors in Simulated Data with 5 Traits
| Estimation method | Keyed direction | Estimate | SE | |||
|---|---|---|---|---|---|---|
|
| ||||||
| Relative bias | RMSE | Percent inestimable (%) | Relative bias | RMSE | ||
|
| ||||||
| QMCEM | Same | −.87 | .20 | 0 | .18 | .01 |
| Mixed | −.47 | .12 | 0 | .20 | .01 | |
| MHRM | Same | −.04 | .04 | 10 | −.26 | .01 |
| Mixed | .01 | .03 | 15 | .17 | .02 | |
First, for the item parameter estimates, the average relative biases were very low (between 0% to 3% in magnitude) for all conditions. For the average RMSEs of discrimination, MHRM performed better than QMCEM (average .14 vs. .25 for the 5-trait condition, .18 vs. .56 for the 15-trait condition), while statement directions and estimation of intertrait correlations had no significant effect. The average RMSEs for intercepts were similar across all conditions (between .14 and .17) except for the QMCEM 15-trait condition, where the average RMSEs were more than doubled (.33 and .36 for the same and mixed keyed directions, respectively).
Second, for the item parameter SE estimates, the average relative biases for discrimination appeared to be overestimated (19%) in the 5-trait MHRM mixed-direction free-correlation condition. In contrast, the average RMSEs were similar across all conditions (between .05 and .11). The average relative biases for intercept were small in the MHRM fixed-correlation condition (between 1% and 4% in magnitude). In contrast, the average RMSEs were similar across all conditions (between .05 and .12). In MHRH, 9% of discrimination SEs and 5% of intercept SEs, on average, were inestimable.
Third, for the intertrait correlation estimates in the 5-trait condition, MHRM performed significantly better than QMCEM regarding both average relative biases and RMSEs. The mixed-direction condition in QMCEM had better average relative biases (−47% vs. −87%) and RMSEs (.12 vs. .20) than the same-direction condition; however, both were too inaccurate to be acceptable. For the intertrait correlation SE estimates, all the average relative biases were relatively large in magnitude (between 17% and 26%); however, this was not a concern as their empirical standard deviations were very small (between .03 and .07). Their RMSEs were no larger than .02, indicating the SEs were well estimated. Note that for MHRM, 10% and 15% of correlation SEs were inestimable in the same-direction and mixed-direction conditions, respectively.
Finally, the MHRM runs were much slower than the QMCEM runs. On average, for the 5-trait condition, a QMCEM run took about half an hour, while an MHRM run took nearly an hour; for the 15-trait condition, a QMCEM run took about six hours, while an MHRM run took about 15 hours in the mixed-direction condition and 21 hours in the same-direction condition.
Table 5 shows the means and standard deviations of the trait score estimates’ true reliabilities, empirical reliabilities, and RMSEs using estimated item parameters and intertrait correlations. The mixed-direction condition had better trait score estimates than the same-direction condition: about 8%–9% and 5%–7% improvement on the mean true and empirical reliabilities, respectively, while the estimation method and intertrait correlation estimation conditions had no significant impact. Compared to the true reliabilities, the empirical reliabilities overestimated the reliabilities by an average of 9%–13% in the same-direction condition and 6% in the mixed-direction condition. For mean RMSEs, those for the mixed-direction condition were smaller than those for the same-direction condition by .11 to .20. The estimation method and intertrait correlation estimation conditions had no significant effect on mean RMSEs in the 5-trait condition. In contrast, MHRM had smaller mean RMSEs by .32 and .36 than QMCEM in the 15-trait same- and mixed-direction conditions, respectively. Table 6 lists the same statistics of the trait score estimates as Table 5 but is based on true item parameters and intertrait correlations. The corresponding statistics between the two tables were very close, indicating that the model’s estimated traits under various conditions were close to the optimal estimated values based on true item parameters and intertrait correlations. However, there was one exception: the mean RMSEs in the 15-trait QMCEM condition in Table 5 were larger by .31 and .34 than in Table 6. This is consistent with the large average RMSEs on item parameter estimates under this condition, as shown in Table 3. All standard deviations in Tables 5 and 6 were small, between .03 and .09, indicating that the reliability and trait score estimates were quite stable across replicated datasets.
Table 5.
Mean and Standard Deviation (SD) of True Reliabilities, Empirical Reliabilities, and RMSEs of Trait Score Estimates Based on Estimated Item Parameters and Intertrait Correlations
| Number of traits | Estimation method | Keyed direction | Estimate intertrait correlations | True reliability | Empirical reliability | RMSE of estimates | |||
|---|---|---|---|---|---|---|---|---|---|
| Mean | SD | Mean | SD | Mean | SD | ||||
| 5 | QMCEM | Same | Y | .69 | .06 | .76 | .08 | .85 | .06 |
| N | .70 | .06 | .77 | .06 | .79 | .06 | |||
| Mixed | Y | .78 | .05 | .83 | .03 | .67 | .05 | ||
| N | .78 | .06 | .83 | .05 | .68 | .05 | |||
| MHRM | Same | Y | .70 | .06 | .77 | .07 | .78 | .07 | |
| N | .70 | .06 | .77 | .06 | .78 | .06 | |||
| Mixed | Y | .78 | .05 | .83 | .03 | .65 | .05 | ||
| N | .78 | .05 | .83 | .03 | .65 | .05 | |||
| 15 | QMCEM | Same | N | .67 | .09 | .73 | .08 | 1.15 | .04 |
| Mixed | N | .75 | .07 | .78 | .08 | .99 | .04 | ||
| MHRM | Same | N | .67 | .09 | .76 | .08 | .83 | .05 | |
| Mixed | N | .76 | .06 | .81 | .06 | .63 | .04 | ||
Table 6.
Mean and Standard Deviation (SD) of True Reliabilities, Empirical Reliabilities, and RMSEs of Trait Score Estimates Based on True Item Parameters and Intertrait Correlations
| Number of traits | Estimation method | Keyed direction | True reliability | Empirical reliability | RMSE of estimates | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Mean | SD | Mean | SD | Mean | SD | |||||
| 5 | QMCEM | Same | .71 | .05 | .77 | .05 | .76 | .06 | ||
| Mixed | .78 | .05 | .83 | .03 | .64 | .04 | ||||
| MHRM | Same | .71 | .05 | .77 | .05 | .76 | .06 | |||
| Mixed | .79 | .04 | .83 | .03 | .64 | .05 | ||||
| 15 | QMCEM | Same | .68 | .09 | .76 | .07 | .84 | .04 | ||
| Mixed | .77 | .05 | .82 | .04 | .65 | .03 | ||||
| MHRM | Same | .67 | .09 | .76 | .07 | .86 | .04 | |||
| Mixed | .77 | .05 | .82 | .04 | .64 | .03 | ||||
In sum, the simulation study produced the following findings:
MHRM had better parameter estimates than QMCEM in general. The improvements were especially significant on the intertrait correlation estimates in the 5-trait condition and the item parameter and trait score estimates in the 15-trait condition.
The mixed-direction condition had moderate improvement in the trait score estimates compared to the same-direction condition regarding reliability and RMSE; however, the improvement was not seen in the item parameter and intertrait correlation estimates.
In general, fixing or estimating intertrait correlations did not impact model estimations in the 5-trait condition. For the 15-trait condition, estimating intertrait correlations, if feasible, was very hard.
None of the conditions significantly affected the SE estimates of item parameters and intertrait correlations. For MHRM, a small portion of SEs could not be estimated.
The empirical reliability overestimated the squared correlation between true and estimated trait scores by 6%−13% in our simulation study.
The MHRM runs took longer to converge than the QMCEM runs, especially for high dimensions: in average minutes, 53 vs. 33 in the 5-trait condition and 1080 vs. 362 in the 15-trait condition.
Real Data Application
The Rank-2PL model was applied to a real FCQ form with triplets. The real triplet form contained 60 items (blocks) with 180 statements, among which were five ideal point statements, with the remainder being all dominance statements. Each statement in an item measured a different trait. These statements measured 14 interpersonal and intrapersonal skills essential to higher education and career success, such as perseverance, leadership, creativity, curiosity, responsibility, and self-discipline. The number of statements per trait ranged from 5 to 18. There were 27 items with three negative statements, 29 with three positive statements, one with one positive and two negative statements, and three with two positive and one negative statements. The number of test takers without any missing responses was 552.
As a comparison, two Rank-2PL models were fitted to the real data: one used the fixed item parameters from the calibration of the Likert data of these statements by the 2PL model in a previous administration, and the other directly estimated the item parameters from the triplet data using the method developed in this paper. For the free calibration, the MHRM algorithm was applied because the MHRM algorithm had better parameter estimations than the QMCEM method based on our simulation study. For fixed-parameter estimation, mirt does not allow the MHRM algorithm; thus, the QMCEM algorithm with the Sobol sequence was used. Because the fixed-parameter estimation does not need to calibrate per se, the two different methods in approximating the posterior distribution of latent trait scores have little impact on comparing the fixed and free models. For both estimations, the intertrait correlation matrix was fixed as those from the previous administration of the Likert scales (i.e., Table 2 with Trait 10 removed). For the free model, the intercept of the first statement in an item was fixed to that from the Likert scale to identify the model. The estimation programs employed in the simulation study were also used here.
We reported the comparison results of the two models by model selection and fit statistics. Table 7 lists the two models’ loglikelihood, Akaike Information Criterion (AIC), and Bayesian Information Criteria (BIC) values. The free model was the chosen one from the model comparison perspective.
Table 7.
Comparison of Loglikelihood, AIC, and BIC
| Model | logLik | AIC | AIC_difa | BIC | BIC_difa |
|---|---|---|---|---|---|
|
| |||||
| Fixed | −59134 | 118868 | 120162 | ||
| Free | −53679 | 107958 | 10910 | 109252 | 10910 |
Model above – Model below.
Table 8 shows the M2* Chi-square statistics (Cai & Hansen, 2013) and the related M2*-based RMSEA statistics (Root Mean Squared Error of Approximation; Maydeu-Olivares & Joe, 2014) for the two models. Although the M2* statistic was significant for both models, RMSEA shows that the free model was a close fit to the data while the fixed model was not even an adequate fit based on Maydeu-Olivares and Joe’s (2014) criteria (<.05 close fit; <.089 adequate fit).
Table 8.
Comparison of M2* and RMSEA
| Model | M2* | DF | P | RMSEA | RMSEA_5a | RMSEA_95a |
|---|---|---|---|---|---|---|
|
| ||||||
| Fixed | 8735 | 1530 | .000 | .092 | .090 | .094 |
| Free | 3381 | 1530 | .000 | .047 | .045 | .049 |
90% confidence interval of RMSEA.
Table 9 lists the number and percentage of misfit items and item pairs for each model based on the (Drasgow et al., 1985) and LD-X2( Chen & Thissen’s, 1997) statistics, respectively. For the statistic, the convenient criterion, , was used to flag misfit items at a .05 significant level. For the LD-X2, it was converted to the Cramer’s V index, and a Cramer’s V value larger than .3 was used as the flagging criterion to indicate a strong association between two items. No item/pair was flagged for the free model, while the fixed model had 23% of items and 56% of item pairs identified as misfits.
Table 9.
Comparison of Number (N) and Percentage (%) of Misfit Items
flagging criterion zh< −1.64.
flagging criterion Cramer’s V >.3.
Summary and Discussion
For the first time, we developed an MML-EM estimation of the Rank-2PL model for triplets in FCQ. FCQs that are used in serious applications are characterized by having a high number of dimensions. To address this, we described various options in the E-step to find the integral of the loglikelihood over high dimensions. These were the Monte Carlo EM algorithm (MCEM), stochastic EM algorithm, and Metropolis-Hastings Robbins-Monro (MHRM) algorithm. These algorithms make the MML-EM estimation of the Rank-2PL model with high dimensions feasible. We also described different estimation methods for standard errors of model parameters.
We conducted a simulation study to check the parameter recovery of the Rank-2PL model for triplets. The parameter recovery was satisfied in most cases. We recommend MHRM over QMCEM for the model estimation method, especially for higher traits, because MHRM achieved more accurate item parameter estimates and intertrait correlation estimates. This is consistent with the nature of the two algorithms. The drawbacks of MHRM include longer estimation time (because of MCMC sampling) and some inestimable (negative) SEs of model parameters. Under five traits, we could always estimate the intertrait correlations successfully and reasonably accurately by MHRM, while Bürkner et al. (2019; Table 2) reported success rates between .28 and .38 for their mixed-direction condition with 45 triplet items. This may be due to different model configurations or estimation methods: Bürkner et al. estimated TIRT models in Mplus by the ULSMV method, a limited information factor analysis method under structure equation modeling. However, for FCQs with higher traits (like 15 in our study), estimating intertrait correlations seems infeasible; fixing the correlation matrix based on prior knowledge is necessary for successful model estimation.
We found mixed-direction items improved the trait score estimates moderately compared to same-direction items in terms of trait score reliability and accuracy; however, this improvement was not seen in item parameter estimates and intertrait correlation estimates. Including mixed-direction items in FCQs was strongly recommended in Brown and Maydeu-Olivares (2011) from the psychometrical perspective, while this recommendation is considered unrealistic as it conflicts with the goal of controlling test takers’ impression management that FCQs want to achieve (Bürkner et al., 2019). Brown and Maydeu-Olivares (2011) gave this recommendation mainly based on their simulation result on two traits. Their result on five traits (Tables 6 and 7 in their paper) showed a similar pattern as ours. Bürkner et al. (2019) reported a large improvement in the trait score estimations under five traits for mixed-direction items; however, it might be due to the particular conditions they simulated. Reports have shown that mixed-direction items are unnecessary for FCQs with higher traits to achieve satisfactory psychometric properties (Bürkner et al., 2019; Brown & Maydeu-Olivares, 2012). However, our study suggests that even on five traits, researchers can design a psychometrically sounded FCQ without mixed-direction items by including appropriate numbers of items with quality statistics.
The Rank-2PL model was applied to a real triplet form. We compared two models, one freely estimating item parameters and another fixing item parameters from the previous Likert scales. The latter is a common practice due to a lack of available estimation methods of IRT models for FCQs. We demonstrated how model comparison could be done by model selection criteria (Loglikelihood, AIC, and BIC) and model/item fit statistics. We found that the free model fitted the FCQ data much better than the fixed model. This is understandable, considering the fixed parameters came from a different population and test type. Therefore, caution should be exercised to use the fixed model in practice without fully verifying its psychometrical properties (Fu et al., 2024). Ideally, the estimation model should be updated periodically when more FCQ data are available to incorporate new information. This practice only becomes feasible when the estimation programs for the FCQ IRT models, like the MML-EM estimation for the Rank-2PL model, become available.
Nowadays, data from psychological and educational tests, such as FCQ data and cognitive diagnostic test data, are often characterized by a high number of dimensions. Using the MML-EM to estimate high-dimensional data is challenging due to the estimation demands on computer RAM capacity and CPU power. The situation has improved in recent years with more computing power and more efficient estimation algorithms, such as those we introduced in this paper. These algorithms have been implemented in many computer programs to estimate multidimensional IRT models, including the mirt program used here, flexMIRT (Cai, 2023), and MIRT (Haberman, 2013). Using multidimensional IRT models to estimate multidimensional test data successfully using the MML-EM method has been demonstrated in numerous papers (e.g., Cai, 2010; Garnier-Villarreal et al., 2021; von Davier, 2008), including the current study. However, most of the applications have dealt with fewer than ten dimensions. Calibrating test data with more than ten dimensions using a full information method is still challenging. This paper shows that the Rank-2PL model for triplets could handle 15 dimensions with a fixed correlation matrix of latent scores. For data with considerably more dimensions, such as 30, a limited information method, such as the unweighted least squares (ULS) method, may be more appropriate (Brown & Maydeu-Olivares, 2012). This is true if estimating the covariance matrix of parameter estimates is not required — it takes considerable time to compute the covariance matrix for high-dimensional data, even under the ULS method.
We conclude our paper with suggestions for future research. First, the design of the current simulation study is not comprehensive. Future studies can be conducted to examine additional conditions on intertrait correlations, item parameters, number of traits, number of items per trait, sample size, and keyed direction. Another interesting topic is to compare the Rank-2PL model for triplets to the TIRT model under various simulated conditions. Previously, we discussed the relationships between the Rank-2PL and TIRT models. An empirical study will help us understand more about their relationships. Finally, differential item/test functioning (DIF/DTF) analysis is usually recommended for checking the fairness of an FCQ. Therefore, a study of differential item/test functioning under the Rank-2PL model will shed some light on the IRT-based DIF/DTF analysis method for FCQs. A multigroup Rank-2PL model may be developed for this purpose.
Supplementary Material
References
- Atkinson KE (1989). An introduction to numerical analysis (2nd ed.). John Wiley. [Google Scholar]
- Birkeland SA, Manson TM, Kisamore JL, Brannick MT, & Smith MA (2006). A meta-analytic investigation of job applicant faking on personality measures. International Journal of Selection and Assessment, 14(4), 317–335. 10.1111/j.1468-2389.2006.00354.x [DOI] [Google Scholar]
- Birnbaum A (1968). Some latent trait models and their use in inferring an examinee’s ability. In Lord FM & Novick MR (Eds.), Statistical theories of mental test scores (pp. 397–479). Addison-Wesley Pub. Co. [Google Scholar]
- Bock RD, & Aitkin M (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46(4), 443–459. 10.1007/BF02293801 [DOI] [Google Scholar]
- Brown A (2016). Item response models for forced-choice questionnaires: A common framework. Psychometrika, 81(4), 135–160. 10.1007/s11336-014-9434-9 [DOI] [PubMed] [Google Scholar]
- Brown A, & Maydeu-Olivares A (2011). Item response modeling of forced-choice questionnaires. Educational and Psychological Measurement, 71(3), 460–502. 10.1177/0013164410375112 [DOI] [Google Scholar]
- Brown A, & Maydeu-Olivares A (2012). Fitting a Thurstonian IRT model to forced-choice data using Mplus. Behavior Research Methods, 44, 1135–1147. doi: 10.3758/s13428-012-0217-x [DOI] [PubMed] [Google Scholar]
- Broyden CG (1970). The convergence of a class of double-rank minimization algorithms 1. General considerations. IMA Journal of Applied Mathematics, 6(1), 76–90, 10.1093/imamat/6.1.76 [DOI] [Google Scholar]
- Bürkner P-C, Schulte N, & Holling H (2019). On the statistical and practical limitations of Thurstonian IRT models. Educational and Psychological Measurement, 79(5), 827–854. 10.1177/0013164419832063 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cai L (2010). High-dimensional exploratory item factor analysis by a Metropolis-Hastings Robbins-Monro algorithm. Psychometrika, 75(1), 33–57. 10.1007/s11336-009-9136-x [DOI] [Google Scholar]
- Cai L (2023). flexMIRT (Version 3.6): Flexible multilevel multidimensional item analysis and test scoring [Computer software]. Vector Psychometric Group. https://vpgcentral.com/software/flexmirt/ [Google Scholar]
- Cai L, & Hansen M (2013). Limited-information goodness-of-fit testing of hierarchical item factor models. British Journal of Mathematical and Statistical Psychology, 66(2), 245–276. 10.1111/j.2044-8317.2012.02050.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chalmers R, P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29. 10.18637/jss.v048.i06 [DOI] [Google Scholar]
- Chalmers RP (2018). Numerical approximation of the observed information matrix with Oakes’ identity. British Journal of Mathematical and Statistical Psychology, 71(3), 415–436. 10.1111/bmsp.12127 [DOI] [PubMed] [Google Scholar]
- Chen WH, & Thissen D (1997). Local dependence indices for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265–289. 10.2307/1165285 [DOI] [Google Scholar]
- Chen Y, & Zhang S (2021). Estimation methods for item factor analysis: an overview. In Zhao Y & Chen DG (Eds.), Modern statistical methods for health research: Emerging topics in statistics and biostatistics (pp. 329–350). Springer, Cham. 10.1007/978-3-030-72437-5_15 [DOI] [Google Scholar]
- Davis PJ, & Rabinowitz P (1984). Methods of numerical integration. Cambridge, MA: Academic Press. 10.1016/B978-0-12-206360-2.50012-1 [DOI] [Google Scholar]
- de la Torre J, Ponsoda V, Leenen I, & Hontangas P (2012, April). Examining the viability of recent models for forced-choice data [Paper presentation]. American Educational Research Association, Vancouver, BC, Canada. [Google Scholar]
- Drasgow F, Levine MV, & Williams EA (1985). Appropriateness measurement with polychotomous item response models and standardized indices. British Journal of Mathematical and Statistical Psychology, 38(1), 67–86. 10.1111/j.2044-8317.1985.tb00817.x [DOI] [Google Scholar]
- Forero CG, & Maydeu-Olivares A (2009). Estimation of IRT graded response models: Limited versus full information methods. Psychological Methods, 14(3), 275–299. 10.1037/a0015825 [DOI] [PubMed] [Google Scholar]
- Fu J (2019). Maximum marginal likelihood estimation with an expectation-maximization algorithm for multigroup/mixture multidimensional item response theory models. (ETS Research Report No. 19–35). Educational Testing Service. 10.1002/ets2.12272 [DOI] [Google Scholar]
- Fu J, Kyllonen PC, & Tan X (2024): From Likert to forced choice: Statement parameter invariance and context effects in personality assessment. Measurement: Interdisciplinary Research and Perspectives. Advance online publication. 10.1080/15366367.2023.2258482 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Garnier-Villarreal M, Merkle EC, & Magnus BE (2021). Between-item multidimensional IRT: How far can the estimation methods go? Psych, 3, 404–421. 10.3390/psych3030029 [DOI] [Google Scholar]
- Gilks WR, Richardson S, & Spiegelhalter DJ (1996). Introducing Markov Chain Monte Carlo. In Gilks WR, Richardson S & Spiegelhalter DJ (Eds.), Markov Chain Monte Carlo in practice. (pp. 1–20). Chapman and Hall. 10.1201/b14835 [DOI] [Google Scholar]
- Haberman SJ (2013). A general program for item-response analysis that employs the stabilized Newton-Raphson algorithm (Research Report No. RR-13–32). Educational Testing Service. 10.1002/j.2333-8504.2013.tb02339.x [DOI] [Google Scholar]
- Hontangas PM, de la Torre J, Ponsoda V, Leenen I, Morillo D, & Abad FJ (2015). Comparing traditional and IRT scoring of forced-choice tests. Applied Psychological Measurement, 39(8), 598–612. 10.1177/0146621615585851 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim S (2012). A note on the reliability coefficients for item response model-based ability estimates. Psychometrika, 77(4), 153–162. 10.1007/s11336-011-9238-0 [DOI] [Google Scholar]
- Lee P, Joo SH, Stark S, & Chernyshenko OS (2019). GGUM-RANK statement and person parameter estimation with multidimensional forced choice triplets. Applied Psychological Measurement, 43(3), 226–240. 10.1177/0146621618768294 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Louis TA (1982). Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society. Series B, 44(2), 226–233. http://www.jstor.org/stable/2345828. [Google Scholar]
- Luce RD (2005). Individual choice behavior. Dover. (Original work published in 1959.) [Google Scholar]
- Maydeu-Olivares A (1999). Thurstonian modeling of ranking data via mean and covariance structure analysis. Psychometrika, 64(3), 325–340. 10.1007/BF02294299 [DOI] [Google Scholar]
- Maydeu-Olivares A, & Brown A (2010). Item response modeling of paired comparison and ranking data. Multivariate Behavioral Research, 45(6), 935–974. 10.1080/00273171.2010.531231 [DOI] [PubMed] [Google Scholar]
- Maydeu-Olivares A, & Joe H (2014). Assessing Approximate Fit in Categorical Data Analysis. Multivariate Behavioral Research, 49(4), 305–328. 10.1080/00273171.2014.911075 [DOI] [PubMed] [Google Scholar]
- McDonald RP (1997). Normal-Ogive multidimensional model. In van der Linden WJ & Hambleton RK (Eds.), Handbook of modern item response theory (pp. 258–270). Springer. [Google Scholar]
- Meng X-L, & Schilling S (1996). Fitting full-information item factor models and an empirical investigation of bridge sampling. Journal of the American Statistical Association, 91(435), 1254–1267. 10.1080/01621459.1996.10476995 [DOI] [Google Scholar]
- Morokoff WJ, & Caflisch RE (1995). Quasi-Monte Carlo integration. Journal of Computational Physics, 122(2), 218–230. 10.1006/jcph.1995.1209 [DOI] [Google Scholar]
- Morillo D, Abad FJ, Kreitchmann RS, Leenen I, Hontangas P, and Ponsoda V (2019). The journey from Likert to forced-choice questionnaires: Evidence of the invariance of item parameters. Journal of Work and Organizational Psychology, 35(2), 75–83. 10.5093/jwop2019a11 [DOI] [Google Scholar]
- Morillo D, Leenen I, Abad FJ, Hontangas P, De la Torre J, & Ponsoda V (2016). A dominance variant under the multi-unidimensional pairwise-preference framework: Model formulation and Markov chain Monte Carlo estimation. Applied Psychological Measurement, 40(7), 500–516. 10.1177/0146621616662226 [DOI] [PMC free article] [PubMed] [Google Scholar]
- R Core Team (2021). R: A language and environment for statistical computing (Version 4.1.2). R Foundation for Statistical Computing. https://www.R-project.org/. [Google Scholar]
- Roberts JS, Donoghue JR, & Laughlin JE (2000). A general item response theory model for unfolding unidimensional polytomous responses. Applied Psychological Measurement, 24(1), 3–32. 10.1177/01466216000241001 [DOI] [Google Scholar]
- Salgado JF, Anderson N, & Tauriz G (2015). The validity of ipsative and quasi-ipsative forced-choice personality inventories for different occupational groups: A comprehensive meta-analysis. Journal of Occupational and Organizational Psychology, 88(4), 797–834. 10.1111/joop.12098 [DOI] [Google Scholar]
- Schilling S, & Bock RD (2005). High-dimensional maximum marginal likelihood item factor analysis by adaptive quadrature. Psychometrika, 70(3), 533–555. 10.1007/s11336-003-1141-x [DOI] [Google Scholar]
- Stark S, Chernyshenko OS, & Drasgow F (2005). An IRT approach to constructing and scoring pairwise preference items involving stimuli on different dimensions: The multi-unidimensional pairwise preference model. Applied Psychological Measurement, 29(3), 184–203. 10.1177/0146621604273988 [DOI] [Google Scholar]
- Thurstone LL (1927). A law of comparative judgment. Psychological Review, 34(4), 273–286. 10.1037/h0070288 [DOI] [Google Scholar]
- von Davier M (2008). A general diagnostic model applied to language testing data. British Journal of Mathematical and Statistical Psychology, 61(2), 287–307. 10.1348/000711007X193957 [DOI] [PubMed] [Google Scholar]
- Wang W-C, Qiu X-L, Chen C-W, Ro S, & Jin K-Y (2017). Item response theory models for ipsative tests with multidimensional pairwise-comparison items. Applied Psychological Measurement, 41(8), 600–613. 10.1177/0146621617703183 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yuan K-H, Cheng Y, & Patton J (2014). Information matrices and standard errors for MLEs of item parameters in IRT. Psychometrika, 79(1), 232–254. 10.1007/S11336-013-9334-4 [DOI] [PubMed] [Google Scholar]
- Zhang B, Tu N, Angrave L, Zhang S, Sun T, Tay L, & Li J (2023). The generalized Thurstonian unfolding model (GTUM): Advancing the modeling of forced-choice data. Organizational Research Methods. Advance online publication. 10.1177/10944281231210481 [DOI] [Google Scholar]
- Zhang S, Chen Y, and Liu Y (2020). An improved stochastic EM algorithm for large-scale full-information item factor analysis. British Journal of Mathematical and Statistical Psychology, 73(1), 44–71. 10.1111/bmsp.12153 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
