Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Aug 30.
Published in final edited form as: J Educ Behav Stat. 2024 Jun 18;50(3):497–525. doi: 10.3102/10769986241256030

The Rank-2PL IRT Models for Forced-Choice Questionnaires: Maximum Marginal Likelihood Estimation with an EM Algorithm

Jianbin Fu 1, Xuan Tan 1, Patrick C Kyllonen 1
PMCID: PMC12379955  NIHMSID: NIHMS2089519  PMID: 40874117

A forced-choice questionnaire (FCQ) is a set of blocks of two or more statements. Statements within a block may measure the same (unidimensional) or different traits (multidimensional), and an FCQ can mix unidimensional and multidimensional blocks (Stark et al., 2005) or comprise only one type. Test takers are asked to rank the statements within a block based on how much they agree with the statements or how closely the statements describe or match their attitudes, beliefs, or behaviors (Brown & Maydeu-Olivares, 2011). The forced-choice item type has been proposed as an alternative to the traditional Likert scale item type for noncognitive questionnaires (e.g., to measure personality, preferences, and attitudes) to address response distortions associated with Likert-type items, for example, social desirability responding (e.g., Birkeland, et al., 2006). Traditional scoring of FCQs takes the sum of observed ranking scores on each trait or dimension (Hontangas et al., 2015), which leads to ipsative scores for multidimensional FCQs. Ipsative scores have a constant total score across all subscores of measured traits for every test taker and thus are problematic for interindividual comparisons. Ipsative scores also lead to problems with estimations of reliability and average scale intercorrelations, factor analysis, and interpretation of scores (Hontangas et al., 2015). There are traditional methods to address the issues of ipsative scores that lead to quasi-ipsative scores (Salgado et al., 2015). Alternatively, some researchers have proposed item response theory (IRT) models to capture the data-generating mechanism.

Brown (2016) provided an integrated review of IRT models for FCQs. A popular model reviewed was Stark et al.’s (2005) multi-unidimensional pairwise preference (MUPP) model for FCQs with ideal point statements and two-statement blocks. Stark et al. (2005) used the dichotomous version of the generalized graded unfolding model (GGUM; Roberts et al., 2000) as the response function for each ideal point statement. For an ideal point statement, the relationship between the extent of agreement and the latent trait scores is a bell curve, where optimal agreement is located somewhere between extremes on the latent trait score continuum, making it possible to disagree with the statement from above or below. For example, the statement, “I like work-life balance,” might be considered an ideal point statement because it is possible to disagree with the statement from above (because I like to work a lot) or below (because I do not like to work very much).

To handle blocks with more than two statements, de la Torre et al. (2012) proposed three extensions of the MUPP-GGUM model based on response formats, referred to as PICK, MOLE, and RANK. For the PICK format, test takers are asked to select one statement from among all statements in a block; for the MOLE format, test takers are asked to select the statement they agree with most and the statement they agree with least; for the RANK format test takers are asked to rank all statements.

Hontangas et al. (2015) compared the normative scores based on the PICK, MOLE, and RANK GGUM models with the traditional raw scores of multidimensional FCQs to check the normative property of ipsative scores. However, in all the studies mentioned above for the IRT models with GGUM for FCQs, the item parameters in the GGUM for each statement are assumed to be known, for example, from fitting a GGUM to response data from a Likert scale questionnaire of the same statements administered previously. Latent trait scores were estimated based on the known (i.e., assumed) item parameters. The use of assumed fixed item parameters from previous Likert scale questionnaires raises serious questions about the validity of these models (Morillo et al., 2019; Fu et al., 2024). Later, Lee et al. (2019) developed a Bayesian estimation of both item and person parameters of the Rank-GGUM models directly from FCQ data using the Markov chain Monte Carlo (MCMC) method (Gilks et al., 1996).

Morillo et al. (2016) coupled the MUPP model with the two-parameter logistic IRT (2PL) response function (Birnbaum, 1968) for blocks with two dominance statements. Unlike an ideal point statement, a dominance statement (e.g., “I enjoy being by myself”) has a monotonical relationship between the degree of agreement and the underlying trait as modeled in the 2PL response function—the higher one is on the trait, the more likely one is to register agreement with the statement. Morillo et al. (2016) provided an MCMC estimation of item parameters and trait scores for the MUPP-2PL model. The MUPP-2PL model is just a special case of the commonly known compensatory two-parameter logistic multidimensional IRT Model (von Davier, 2008). It also has the equivalent item response function (after transformation) with the Thurstonian IRT (TIRT) model (Brown & Maydeu-Olivares, 2011) for blocks with two statements. In addition, Wang et al. (2017) applied the MUPP with the Rasch model (i.e., the multidimensional Rasch model) to FCQ blocks with two dominance statements and emphasized the unique properties associated with the Rasch model.

There have been more extensions and developments in the IRT models for FCQs in recent years; see Zhang et al. (2023, Table 1) for a list. The IRT models for FCQs are complicated, and researchers have primarily used the MCMC method to estimate the models. Maximum likelihood estimation is more mathematically involved and has been developed only for the MUPP-2PL (under the traditional multidimensional IRT models). The current study extends the MUPP-2PL model to accommodate the RANK response format for blocks with more than two dominance statements. We refer to this series of models for different block sizes as Rank-2PL models, and the MUPP-2PL model is just one of them. Focusing on blocks with three statements (triplets), we develop a maximum marginal likelihood estimation with an expectation-maximization algorithm (MML-EM; Bock & Aitkin, 1981; Fu, 2019) to estimate item parameters and their standard errors. We conduct a simulation study to check parameter recovery and demonstrate the use of the model on real FCQ data. Finally, we summarize and discuss the findings and suggest future research.

Table 1.

Possible Ranking Patterns for a Block with Three Statements, A, B, and C

Coded score 1 2 3 4 5 6
Ranking pattern ABC ACB BAC BCA CAB CBA

Model Formulation

The Rank model assumes that ranking K statements in a block is an iterative process (Luce, 2005/1959), involving initially selecting the most agreeable statement from all the statements to selecting the more agreeable statement from the last remaining two. This process is modeled by a probability function. Below, we derive the item response probability function for a triplet item (i.e., K=3) as an example. Denote P(1,2,3)(θ) as the probability of a ranking order conditional on the column vector of the latent trait score(s) θ that a test measures. Each statement in an item could measure a common or different latent trait: the former is referred to as a unidimensional forced-choice item, and the latter as a multidimensional forced-choice item. The subscripts 1,2,3 refer to the statements with the ranking scores 1, 2, and 3, respectively (Ranking 1 is the more agreeable statement). Let P(1|1,2,3)(θ) be the conditional probability of selecting the Ranking 1 statement from the three statements, and P(22,3)(θ) be the conditional probability of selecting the Ranking 2 statement from the remaining two statements, both given only one statement selected from the available three or two statements. Then, based on the Rank model’s assumption,

P(1,2,3)(θ)=P(1|1,2,3)(θ)P(2|2,3)(θ), (1)

Denote the selection of a statement as 1, and 0 otherwise. Let P(1,2,3)(1,0,0θ) be the joint probability of selecting the Ranking 1 statement and not selecting the other two statements. Likely, define P(1,2,3)(0,1,0θ) and P(1,2,3)(0,0,1θ) as the joint probabilities of selecting the Ranking 2 and 3 statements, respectively. Then, the conditional probability P(1|1,2,3) is the ratio of the joint probability of selecting Ranking 1 statement over the sum of the joint probabilities of selecting Rankings 1, 2, and 3 statements, respectively,

P(1|1,2,3)(θ)=P(1,2,3)(1,0,0θ)P(1,2,3)(1,0,0θ)+P(1,2,3)(0,1,0θ)+P(1,2,3)(0,0,1θ). (2)

Similarly,

P(22,3)(θ)=P(2,3)(1,0θ)P(2,3)(1,0θ)+P(2,3)(0,1θ). (3)

Assuming a test taker makes the selecting (1) /not selecting (0) decision of a statement independent of other statements,

P(1,2,3)(1,0,0θ)=P(1)1θ1P(2)0θ2P(3)0θ3, (4)
P(1,2,3)(0,1,0θ)=P(1)0θ1P(2)1θ2P(3)0θ3, (5)
P(1,2,3)(0,0,1θ)=P(1)0θ1P(2)0θ2P(3)1θ3, (6)
P(2,3)(1,0θ)=P(2)1θ2P(3)0θ3, (7)
P(2,3)(0,1θ)=P(2)0θ2P(3)1θ3, (8)

where θ1, θ2, and θ3 refer to the latent trait scores measured by the Rankings 1–3 statements, respectively.

For each statement, the probability of selection follows the 2PL model

P(k)1θk=expakθk+bk1+expakθk+bk,and (9)
P(k)0θk=1-P(k)1θk, (10)

where k=1,2,3 refers to the statement with Ranking k, ak is the discrimination parameter of the statement with Ranking k, and bk is the intercept.

Insert Equations 210 into Equation 1 and through Algebra operations,

P(1,2,3)(θ)=1+expa2θ2+b2-a1θ1-b1+expa3θ3+b3-a1θ1-b1-11+expa3θ3+b3-a2θ2-b2-1 (11)

Note that of the three parameters, b1, b2, and b3, only two can be identified in Equation 11. There are different ways to restrict bs: for example, (a) fix one of the bs; (b) set the sum of the three bs to a constant (e.g., 0); and (c) estimate the differences between bs rather than the individual bs, b21=b2-b1, b31=b3-b1, b32=b3-b2, and set b31=b21+b32. All different constraints will lead to the same estimates of b21, b31, and b32, and will not change the estimate of P(1,2,3)(θ). If all three statements measure a common trait, then only two of the three as can be identified, and constraints similar to those for bs may be set to identify the model.

Similarly, for blocks with two statements (K=2)

P(1,2)(θ)=1+expa2θ2+b2-a1θ1-b1-1. (12)

This is the MUPP-2PL model studied in Morillo et al. (2016). In Equation 12, only one of the two bs can be identified; if both statements measure one common trait, then only one of the two as can be identified. Similar constraints on bs or as in Equation 11 can be imposed to identify Equation 12. Note that Equation 12 is just a special case of the compensatory 2-parameter logistic multidimensional IRT Model and is also equivalent to the normal ogive response function of the TIRT model for two-statement blocks (Brown & Maydeu-Olivares, 2011; Morillo et al., 2016).

For blocks with four statements (K=4),

P(1,2,3,4)(θ)=1+expa2θ2+b2-a1θ1-b1+expa3θ3+b3-a1θ1-b1+expa4θ4+b4-a1θ1-b1-11+expa3θ3+b3-a2θ2-b2+expa4θ4+b4-a2θ2-b2-11+expa4θ4+b4-a3θ3-b3-1 (13)

In Equation 13, only three of the four bs can be identified, and if the four statements measure one common trait, only three of the four as can be identified. Thus, similar constraints on bs or as in Equation 11 are needed to identify Equation 13.

For blocks with K statements, there are K! ranking patterns. For example, for three statements, A, B, and C, in a block, there are six possible ranking patterns, as shown in Table 1, and they are coded as 1 to 6 in the observed data.

Relationships with Thurstonian IRT (TIRT) Model

The TIRT model is based on Thurstone’s (1927) law of comparative judgment for a pair of statements. The notion of statement utility connects observed response (e.g., agree/disagree with a statement) with the latent trait that the statement measures. The utility tk for statement with a ranking score k is defined as a factor model

tk=uk+λkθk+εk, (14)

where uk is the intercept, λk is the factor loading, θk is the latent trait score that the statement measures, and εk is the independent and identically distributed error which follows a normal distribution with mean 0 and variance ψk2. The probability of selecting the Ranking 1 statement among a pair of statements is determined by the difference between the two statement utilities:

y(1,2)*=t1-t2=λ1θ1-λ2θ2-γ21+ε1-ε2, (15)

where γ21=-u1-u2. The probability of selecting the Ranking 1 statement is the probability of y(1,2)*>0 and modeled by

P(1,2)*(θ)=Φλ1θ1-λ2θ2-γ21ψ12+ψ22exp1.702λ1θ1-λ2θ2-γ21ψ12+ψ221+exp1.702λ1θ1-λ2θ2-γ21ψ12+ψ22, (16)

where Φ is the standard normal cumulative function. The first item response function in Equation 16 is a special case of the compensatory two-parameter normal ogive multidimensional IRT model (McDonald, 1997), and the second one is a special case of the compensatory two-parameter logistic multidimensional IRT Model. For pairs, the model can be identified by setting the error variance of y*, ψ12+ψ22=1. If an MFC form only measures two traits, then additionally, the factor loadings λ1 and λ2 in one pair can be fixed to identify the model (Brown & Maydeu-Olivares, 2011). Equations 16 and 12 (the response function for the MUPP-2PL model) are equivalent by setting

a1=1.702λ1ψ12+ψ22, (17)
a2=1.702λ2ψ12+ψ22, (18)
b21=b2-b1=1.702γ21ψ12+ψ22. (19)

However, the TIRT model is estimated by a limited information method, unweighted least squares with mean and variance-corrected Satorra-Bentler goodness-of-fit tests (ULSMV) under structural equation modeling (SEM). In contrast, the MUPP-2PL model is estimated by a full information method, the maximum marginal likelihood estimation. Therefore, their parameter estimates may differ due to the different estimation methods (Forero & Maydeu-Olivars, 2009).

For a block with more than two statements (i.e., K>2), the TIRT model recodes a ranking pattern into K2 ranking patterns of pairs. For a triplet, the response pattern of the three statements with Ranking scores 1–3 can be separated into three statement pairs: pair 1 (1, 2), pair 2 (1, 3), and pair 3 (2, 3). Let C be the 2 by 3 matrix of contrasts,

C=1-1001-1.

In the TIRT model, the probability of the ranking pattern (1, 2, 3) is given by (Maydeu-Olivares, 1999)

P1,2,3*θ=P*(y1,2*>0y2,3*>0θ)=00ϕy*:Cμt,CΣtCdy*, (20)

where t=t1,t2,t3, μt and Σt are the mean vector and covariance matrix of t,y*=(y(1,2)*,y(2,3)*) follows a bivariate normal distribution with a mean vector Cμt and covariance matrix CΣtC, and ϕ is the bivariate normal density function. This model is set up as the mean and covariance structures under SEM and estimated by a limited information method, ULSMV. The response functions (marginal probabilities of ranking patterns) for the three pairs are

P1,2*=Φλ1θ1-λ2θ2-γ21ψ12+ψ22, (21)
P1,3*=Φλ1θ1-λ3θ3-γ31ψ12+ψ32, (22)
P2,3*=Φλ2θ2-λ3θ3-γ32ψ22+ψ32, (23)

Equations 2123 can be identified by fixing the error variance of one statement, for example, setting the error variance of the first statement to 1. The TIRT model considers the local dependence by estimating the covariances among the three pairs’y*s (Equation 15): cov(y(1,2)*,y(1,3)*)=ψ12,cov(y(1,2)*,y(2,3)*)=-ψ22, and cov(y(1,3)*,y(2,3)*)=ψ32. However, the local dependence is considered only in item parameter estimation; for estimating latent trait scores, the local dependence is ignored. Maydeu-Olivares and Brown (2010) argued that the simplification of scoring had little impact on the accuracy of latent score estimates.

According to Equation 14, the TIRT model assumes the utilities of statements are locally independent conditional on latent scores. This assumption is the same as the one adopted by the Rank models; a test taker makes an independent selecting/not selecting decision on each statement. The item response function of the Rank-2PL model for triplets is in Equation 11. Based on Equation 11, we can derive the item response function of each pair (like Equations 2123 for the TIRT model) that has the same form as Equation 12. However, as in the TIRT model, these ranking patterns of pairs are not locally independent because the probability of a response pattern of a triplet is not equal to the product of the three probabilities on individual pairs. Comparing Equation 20 to Equation 1, we can see that the TIRT model builds a full joint distribution of a ranking pattern (1, 2, 3), while Rank-2PL models make the assumption of sequential selections to simplify this joint distribution into two independent distributions. Thus, Rank-2PL models are simpler than the TIRT model: the TIRT model has two more parameters (i.e., error variances of two statements) than the Rank-2PL model for a triplet. Unlike the TIRT model, which assumes local independence among the three pairs in estimating latent trait scores, the Rank-2PL model estimates trait scores directly from the responses of triplets based on Equation 11. These differences between the TIRT and Rank-2PL models for triplets also apply to blocks larger than three statements.

MML-EM Estimation of Item Parameters and Intertrait Correlations

This paper focuses on the blocks with three statements (triplets) for model estimation. For the blocks with two statements, the MML-EM estimation of the multidimensional 2PL model has been discussed previously (e.g., Cai, 2010; Fu, 2019) and can be estimated by many computer programs for multidimensional IRT models, for example, the R package mirt (Chalmers, 2012). For blocks with more than three statements, the estimation and associated program code are similar to those provided for triplets.

Suppose in an FCQ with triplets, there are I test takers in total responding to J items (i.e., blocks). Let xij=s denote test taker i’s coded ranking score on item j(s=1,2,,6; see Table 1), xi is test taker i’s coded response vector across J items, and X is the I×J response matrix representing the total observed data. θ represents a random sample of latent trait score vectors from the test taker population with a standard multivariate normal distribution with a D×D correlation matrix Σθ where D represents the number of traits measured by the test. Fixing the population means and standard deviations of θ is a way to identify the model. η represents the vector of item parameters in the whole FCQ, and ηj is item j’s parameter vector. ζ is a column vector including all estimated item parameters and intertrait correlations. Assume xij is locally independent conditional on ηj and θ, and then the observed data likelihood is a marginal likelihood

LζX=i=1IRj=1JPxijθ,ηjϕθΣθdθ, (24)

where R is a D-dimensional area of integration, Pxijθ,ηj is the item response function defined in Equation 11, and ϕ is the standard multivariate normal density function with a D×D correlation matrix Σθ. Because directly maximizing the observed likelihood is difficult due to the integral in Equation 24, a common solution is the EM algorithm (Bock & Aitkin, 1981; Dempster et al., 1977), which works on the complete data loglikelihood instead

l(ζX,Θ)=i=1Ij=1JlogPxijθi,ηj+logϕθiΣθ, (25)

where Θ is the I×D latent trait score matrix for all test takers, θi is test taker i’s latent trait score vector including D traits, and (X,Θ) is the complete data with observed data X and missing data Θ. The EM algorithm involves an iterative process that repeatedly executes an E and an M steps. In the E step, the expectation of the complete data loglikelihood with respect to the posterior distribution of missing data is estimated and serves as the EM map. In the M step, the EM map is maximized with respect to item parameters and distributional parameters of traits. For conciseness and readability, we omit many technical details on the two steps and the standard errors of parameter estimates in the subsequent sections. However, we provide these technical details in the supplemental online materials for interested readers.

E Step

At an EM cycle t, the expected complete data loglikelihood is estimated with respect to the posterior distribution of traits conditional on the current parameter estimates ηt and Σθt, that is,

Qζt+1ζt=i=1IRj=1JlogP(xijθ,ηjt)+logϕθΣθtfθxi,ηt,Σθtdθ (26)
fθxi,ηt,Σθt=j=1JP(xijηjt,θ)ϕθΣθtRj=1JP(xijηjt,θ)ϕθΣθtdθ, (27)

where fθxi,ηt,Σθt is the posterior distribution function of the trait vector θ conditional on xi,ηt, and Σθt. There are several ways to approximate the posterior distribution of θ to evaluate the D-fold integral in Equation 26.

The first method uses a grid of L=qD quadrature points to approximate the prior standard multivariate normal distribution of θ with the current estimated Σθt, where q is the number of points at each dimension (trait). The quadrature points could be the Gauss-Hermite quadrature points (Davis & Rabinowitz, 1984), the adaptive variants (Schilling & Bock, 2005), or the equidistant points within a range (e.g., −6 to 6). The second method is to draw Lt samples from the prior distribution of θ to approximate the posterior distribution at the tth EM cycle. The Lt samples can be drawn randomly from the standard multivariate normal distribution of θ with the estimated Σθt. This approach is called the Monte Carlo EM estimation method (MCEM) in mirt. Alternatively, the samples can be drawn by the quasi-Monte Carlo integration method (Morokoff & Caflisch, 1995). This approach is called the quasi-Monte Carlo EM estimation method (QMCEM) in mirt. The third method is to draw random samples from the posterior distribution fθxi,ηt,Σθt (Equation 27), for i=1,,I, such that each sample is an I×D latent trait score matrix for all test takers. Because the posterior distribution is usually not in closed form, the Markov chain Monte Carlo (MCMC) is used to draw samples (Cai, 2010; Chen & Zhang, 2021). Thus, this method is referred to as the stochastic approach.

M Step

In the M step, the expected complete data loglikelihood function (Equation 26) is maximized with respect to estimated parameters ζ. We can use the Newton-Raphson method (Atkinson, 1989) or a quasi-Newton method (e.g., the BFGS algorithm; Broyden, 1970) for the maximization problem. Both methods find the model parameters that maximize Equation 26 iteratively. The search stops when a certain convergence criterion is met, for example, (a) the ratio of the difference of the function values between two consecutive iterations over the current function value is smaller than a threshold (e.g., 1e-6; i.e., the relative tolerance criterion), or (b) a preset maximum number of iterations is reached.

When a set of solutions are obtained in the M step, these values are used in the next E step. The EM cycles continue until a certain convergence criterion is reached. The common criterion is (a) the maximum absolute change of item parameter estimates between two consecutive EM cycles smaller than a threshold (e.g., 1e-4), (b) the relative tolerance of the observed loglikelihood is smaller than a threshold, or (c) the EM cycles reach the preset maximum cycles.

For the stochastic approach, there are three methods under this category. The first one is also called the Monte Carlo EM algorithm in the literature (e.g., Meng & Schilling, 1996). The other two are the stochastic EM algorithm (Zhang et al., 2020) and the Metropolis-Hastings Robbins-Monro (MHRM) algorithm (Cai, 2010).

Standard Errors of Parameter Estimates

The standard errors (SEs) of item parameter and intertrait correlation estimates by the MML-EM algorithm can be estimated by Louis’s (1982) observed information function. Louis’s observed information is consistent when the model is correctly specified. When the model is misspecified, the sandwich-type covariance matrix provides the best SE estimates (Yuan et al., 2014). Note that estimating parameter SEs based on the Hessian matrix of the expected complete data loglikelihood underestimates the parameter SEs because it includes information from missing data (i.e.,Θ) that should be excluded. Chalmers (2018) introduced another convenient estimation of the information of the observed likelihood (Equation 26) by Oakes’s identity. For the stochastic approach, Cai (2010) provided an estimation of Louis’s observed information for the MHRM algorithm.

Simulation Study

We conducted a simulation study to check the parameter recovery of the Rank-2PL model with triplets.

Simulation Design

Four simulation factors were manipulated:

  1. Number of traits: 5 and 15. The 5 and 15 traits represent the relatively small and large numbers of traits, respectively, measured by an FCQ, and both exist in real FCQs.

  2. Statement direction within a block: same and mixed. Previous studies (e.g., Brown & Maydeu-Olivares, 2011; Bürkner et al., 2019) showed that the keyed direction of statements within a block had a significant impact on trait score estimates on the TIRT model: mixed keyed statements led to much better estimations than equally keyed statements. In the current study, in the same condition, the three statements within a block were either positive or negative, and half of the blocks were negative, and half were positive. In the mixed condition, half of the blocks had one statement in a different direction from the other two.

  3. Estimation of intertrait correlation matrix: estimated and fixed. The intertrait correlation matrix could be freely estimated or fixed based on prior knowledge (e.g., from a previous administration of the Likert scale of the statements). Previous studies (e.g., Bürkner et al., 2019) have shown that estimating the intertrait correlation matrix sometimes causes an unstable or infeasible estimation, especially for a high-dimensional model. Often, by fixing the correlation matrix, estimations converge easily and become more stable.

  4. Estimation methods: QMCEM and MHRM. As described previously, QMCEM uses a relatively small number of samples (5000 in the current study) with the fixed marginal cumulative probabilities at each trait to approximate the trait space at each EM cycle, while MHRM draws samples using MCMC from the posterior distribution of traits for each test taker at each EM cycle. Thus, it was expected that QMCEM would run faster but less accurate than MHRM.

The initial design was to cross all four factors; however, it turned out that estimating intertrait correlations for 15 traits was infeasible: the estimations always terminated abruptly because of an ill estimate of the correlation matrix (e.g., a nonpositive definite correlation matrix). Thus, this factor was partially crossed with the other three factors, resulting in 12 combinations in total.

Simulated Data Generation

In every condition, each statement in a triplet item measured a different trait, and each trait was measured by 12 statements in a test, resulting in 20 items (60 statements) for the 5-trait condition and 60 items (180 statements) for the 15-trait condition. For the 5-trait condition, the trait allocation to the 20 items was generated by repeating the combination C(5,3) twice. The discrimination parameters ajd were sampled from a lognormal distribution with a mean of 1.45 and a standard deviation (SD) of .79 at the normal scale. The intercept parameters bjd were sampled from a normal distribution with a mean of .44 and an SD of 1.54. Then, for each item, the bjd of the first statement was subtracted from the bjds of the three statements so that the bjd of the first statement became 0. The adjusted bjds were the generating parameters, and the bjd of the first statement was fixed to 0 during estimations to identify the models. All item parameters were generated once and fixed during the simulation. The distributional parameters of the item parameters were from real Likert tests. The Likert tests measured 15 interpersonal and intrapersonal skills important in matching applicants with college majors. For the same statement direction condition, the first ten items had positive statements, and the second ten had negative statements by multiplying −1 with the generated discrimination parameters. For the mixed statement direction condition, one statement was changed to a different keyed direction from the other two statements in half of the 20 items in the same condition. The change of keyed direction on statements was balanced in terms of the five traits and positive/negative items so that one statement direction on each trait was changed in five positive and five negative items. The true intertrait correlations were also taken from the real Likert tests; the correlations among Traits 1,6,8,11, and 13 were used for the 5-trait condition (see Table 2). The traits were sampled from the standard multivariate normal distribution with the correlation matrix. The sample size of test takers was 1,000 for all conditions. Then, based on the Rank-2PL response function for triplets (Equation 11), 100 simulated datasets for the 5-trait condition were generated. Each score in every item in a simulated dataset had at least five test takers.

Table 2.

Intertrait Correlation Matrix in Generating Simulated Data

Trait 1 2 3 4 5 6 7 8 9 10 11 12 13 14

2 .54
3 .62 .62
4 .29 .33 .31
5 .58 .65 .72 .28
6 .41 .34 .36 .16 .51
7 .37 .36 .45 .15 .58 .74
8 .52 .21 .21 .01 .23 .43 .22
9 .33 .07 .08 .16 .06 .35 .23 .51
10 .09 −.19 −.25 −.20 −.25 .19 .00 .39 .52
11 .41 .27 .38 .29 .33 .34 .48 .24 .49 .13
12 .18 .11 .22 .18 .23 .40 .48 .13 .32 .07 .53
13 .26 .13 .18 −.06 .20 .06 .18 .17 .20 .13 .28 .16
14 .43 .25 .38 −.10 .45 .25 .37 .31 .13 .02 .29 .17 .54
15 .46 .19 .27 −.12 .33 .37 .33 .49 .32 .22 .28 .17 .53 .63

The datasets in the 15-trait condition were generated similarly. For a cell in the 15-trait condition, the 20 items in the corresponding cell in the 5-trait condition were repeated three times, resulting in 60 items, and each set of 20 items measured a different set of 5 traits so that 15 traits were measured in total. Item parameters were generated once based on the same distributions as in the 5-trait condition. The intertrait correlation matrix came from the real Likert tests (Table 2). Because estimating a model with 15 traits took much longer than estimating one with five traits, only 50 simulated datasets with 1,000 test takers were generated in each cell. Each score in a simulated item had at least one test taker.

Estimation

We mainly used the mirt 1.34 package (Chalmers, 2012) in the 64-bit R 4.1.2 (R Core Team, 2021) to estimate and analyze the Rank-2PL model with triplets. The mirt program provides a function, createItem, for users to define their own models, and then all the estimation and analysis modules in mirt can be used to estimate and further analyze the models. These include, for example, different kinds of EM algorithms, standard error estimations of parameters, and trait score estimations; all kinds of fit statistics for model, item, and person; item/test information; reliability of trait score estimates; and all kinds of informative/diagnostic plots (e.g., item/test characteristic/empirical response curve).

For the QMCEM algorithm, mirt uses only the Halton sequence. As Morokoff and Caflisch (1995) suggested, the Halton sequence is appropriate for dimensions up to 6, and the Sobol sequence performs best for higher numbers of dimensions. We modified the relevant functions in mirt to use the Sobol sequence for estimations with dimensions larger than 6. For the MHRM algorithm, the mirt program provides the MHRM SE of parameter estimates by default. However, mirt estimates the MHRM SEs separately after the parameter estimation is done, and it takes much longer to obtain SE estimates than parameter estimates. We modified the mirt functions to save running time so that parameters and their SE estimations were done simultaneously. For the QMCEM algorithm, the mirt program provides Oakes’s SE of parameter estimates by default. Because it took too long to estimate Oakes’s SE in the 15-trait condition, the SEs of parameters were not estimated in the 15-trait QMCEM condition. The starting values of the discrimination parameters were set to 1 for positive statements and −1 for negative statements. All the starting values of the intercepts and intertrait correlations were set to 0 and .25, respectively. All trait scores were estimated by the maximum likelihood method. Otherwise, the default settings in mirt were used for all estimations and analysis functions. The supplemental online materials provide the R code to simulate and estimate data for the Rank-2PL models for pairs and triplets. All analyses were conducted on a computer with an Intel Core i7 CPU and 16GB RAM.

Evaluation Criteria

We calculated the average relative bias and root mean square error (RMSE) for each discrimination, intercept, or intertrait correlation estimate

AverageRelativeBias=1Rr=1R(ξˆr-ξ)/|ξ|, (28)
RMSE=r=1R(ξˆr-ξ)2/R, (29)

where ξˆr is the estimate of a parameter from a replicate dataset r,ξ is a parameter’s true value, R (equal to 50 or 100) is the number of replicate datasets. For the parameters whose SEs were estimated, we also calculated the average relative bias and RMSE of each estimator’s SE estimates by comparing them to their empirical standard deviation (SD)

SD(ξˆ)=1Rr=1R(ξˆr-ξ)2-1Rr=1R(ξˆr-ξ)2, (30)
AverageRelativeSEBias=1Rr=1RSE(ξˆr)-SD(ξˆ)/SD(ξˆ), (31)
RMSE_SE=r=1R(SE(ξˆr)-SD(ξˆ))2/R. (32)

These average relative bias and RMSE statistics were then averaged within each parameter group (i.e., discrimination, intercept, and intertrait correlation) and compared.

For score estimates of each trait in a simulated dataset, we calculated (a) the squared correlation between estimated and true scores and called it true reliability; (b) the empirical reliability (Kim, 2012) for maximum likelihood score estimates, given as

ρˆMAR2=var(θˆd)-i=1Ieˆθˆid/Ivar(θˆd), (33)

where var(θˆd) is the sample variance of trait d’s score estimates, and eˆθˆid is the error variance estimate of test taker i’s estimated trait d’s score θˆid; (c) the RMSE between estimated and true scores, given as

RMSE(θˆd,θd)=1Ii=1Iθˆid-θid2, (34)

where I is the number of test takers in a simulated dataset with estimable trait scores and SEs. These statistics excluded test takers with nonconverged or infinite/inestimable trait estimates/SEs. Then, these three statistics’ means and standard deviations across the 100/50 simulated datasets were compared.

Results

Table 3 lists the mean estimation times and the average relative biases and RMSEs of the discrimination and intercept parameter estimates and SEs in all the simulation conditions, and Table 4 lists those for the intertrait correlations in the 5-trait condition. We have several observations based on these tables.

Table 3.

Average Relative Biases and RMSEs of Item Parameter Estimates and Standard Errors (SE) in Simulated Data

No. of traits Est. method Keyed direction Estimate intertrait correlations Est. time (min) Discrimination Intercept
Estimate SE Estimate SE
Rel. bias RMSE Percent inestimable (%) Rel. bias RMSE Rel. bias RMSE Percent inestimable (%) Rel. bias RMSE
5 QMCEM Same Y 34 .00 .24 0 .03 .05 −.01 .14 0 .13 .06
N 29 .00 .26 0 .08 .06 −.01 .14 0 .12 .05
Mixed Y 37 .00 .22 0 .03 .06 −.01 .15 0 .11 .07
N 31 .00 .28 0 .07 .06 −.01 .16 0 .10 .07
MHRM Same Y 52 .00 .14 11 −.09 .07 .00 .14 6 −.08 .07
N 51 .00 .14 9 .00 .07 .00 .14 6 −.03 .07
Mixed Y 54 .00 .14 13 .19 .10 .01 .15 8 .08 .09
N 54 .00 .14 9 .05 .08 .01 .15 6 .01 .08
15a QMCEM Same N 354 −.02 .55 .02 .33
Mixed N 370 −.01 .56 .03 .36
MHRM Same N 1270 .00 .17 10 −.11 .10 −.01 .16 5 −.08 .11
Mixed N 890 .00 .19 4 −.09 .11 .01 .17 2 −.04 .12

Note. — = Standard errors of item parameters were not estimated in this condition due to extremely long run time.

a

Intertrait correlations were fixed in this condition due to estimation difficulty.

Table 4.

Average Relative Bias and Absolute Bias of Intertrait Correlation Estimates and Standard Errors in Simulated Data with 5 Traits

Estimation method Keyed direction Estimate SE

Relative bias RMSE Percent inestimable (%) Relative bias RMSE

QMCEM Same −.87 .20 0 .18 .01
Mixed −.47 .12 0 .20 .01
MHRM Same −.04 .04 10 −.26 .01
Mixed .01 .03 15 .17 .02

First, for the item parameter estimates, the average relative biases were very low (between 0% to 3% in magnitude) for all conditions. For the average RMSEs of discrimination, MHRM performed better than QMCEM (average .14 vs. .25 for the 5-trait condition, .18 vs. .56 for the 15-trait condition), while statement directions and estimation of intertrait correlations had no significant effect. The average RMSEs for intercepts were similar across all conditions (between .14 and .17) except for the QMCEM 15-trait condition, where the average RMSEs were more than doubled (.33 and .36 for the same and mixed keyed directions, respectively).

Second, for the item parameter SE estimates, the average relative biases for discrimination appeared to be overestimated (19%) in the 5-trait MHRM mixed-direction free-correlation condition. In contrast, the average RMSEs were similar across all conditions (between .05 and .11). The average relative biases for intercept were small in the MHRM fixed-correlation condition (between 1% and 4% in magnitude). In contrast, the average RMSEs were similar across all conditions (between .05 and .12). In MHRH, 9% of discrimination SEs and 5% of intercept SEs, on average, were inestimable.

Third, for the intertrait correlation estimates in the 5-trait condition, MHRM performed significantly better than QMCEM regarding both average relative biases and RMSEs. The mixed-direction condition in QMCEM had better average relative biases (−47% vs. −87%) and RMSEs (.12 vs. .20) than the same-direction condition; however, both were too inaccurate to be acceptable. For the intertrait correlation SE estimates, all the average relative biases were relatively large in magnitude (between 17% and 26%); however, this was not a concern as their empirical standard deviations were very small (between .03 and .07). Their RMSEs were no larger than .02, indicating the SEs were well estimated. Note that for MHRM, 10% and 15% of correlation SEs were inestimable in the same-direction and mixed-direction conditions, respectively.

Finally, the MHRM runs were much slower than the QMCEM runs. On average, for the 5-trait condition, a QMCEM run took about half an hour, while an MHRM run took nearly an hour; for the 15-trait condition, a QMCEM run took about six hours, while an MHRM run took about 15 hours in the mixed-direction condition and 21 hours in the same-direction condition.

Table 5 shows the means and standard deviations of the trait score estimates’ true reliabilities, empirical reliabilities, and RMSEs using estimated item parameters and intertrait correlations. The mixed-direction condition had better trait score estimates than the same-direction condition: about 8%–9% and 5%–7% improvement on the mean true and empirical reliabilities, respectively, while the estimation method and intertrait correlation estimation conditions had no significant impact. Compared to the true reliabilities, the empirical reliabilities overestimated the reliabilities by an average of 9%–13% in the same-direction condition and 6% in the mixed-direction condition. For mean RMSEs, those for the mixed-direction condition were smaller than those for the same-direction condition by .11 to .20. The estimation method and intertrait correlation estimation conditions had no significant effect on mean RMSEs in the 5-trait condition. In contrast, MHRM had smaller mean RMSEs by .32 and .36 than QMCEM in the 15-trait same- and mixed-direction conditions, respectively. Table 6 lists the same statistics of the trait score estimates as Table 5 but is based on true item parameters and intertrait correlations. The corresponding statistics between the two tables were very close, indicating that the model’s estimated traits under various conditions were close to the optimal estimated values based on true item parameters and intertrait correlations. However, there was one exception: the mean RMSEs in the 15-trait QMCEM condition in Table 5 were larger by .31 and .34 than in Table 6. This is consistent with the large average RMSEs on item parameter estimates under this condition, as shown in Table 3. All standard deviations in Tables 5 and 6 were small, between .03 and .09, indicating that the reliability and trait score estimates were quite stable across replicated datasets.

Table 5.

Mean and Standard Deviation (SD) of True Reliabilities, Empirical Reliabilities, and RMSEs of Trait Score Estimates Based on Estimated Item Parameters and Intertrait Correlations

Number of traits Estimation method Keyed direction Estimate intertrait correlations True reliability Empirical reliability RMSE of estimates
Mean SD Mean SD Mean SD
5 QMCEM Same Y .69 .06 .76 .08 .85 .06
N .70 .06 .77 .06 .79 .06
Mixed Y .78 .05 .83 .03 .67 .05
N .78 .06 .83 .05 .68 .05
MHRM Same Y .70 .06 .77 .07 .78 .07
N .70 .06 .77 .06 .78 .06
Mixed Y .78 .05 .83 .03 .65 .05
N .78 .05 .83 .03 .65 .05
15 QMCEM Same N .67 .09 .73 .08 1.15 .04
Mixed N .75 .07 .78 .08 .99 .04
MHRM Same N .67 .09 .76 .08 .83 .05
Mixed N .76 .06 .81 .06 .63 .04

Table 6.

Mean and Standard Deviation (SD) of True Reliabilities, Empirical Reliabilities, and RMSEs of Trait Score Estimates Based on True Item Parameters and Intertrait Correlations

Number of traits Estimation method Keyed direction True reliability Empirical reliability RMSE of estimates
Mean SD Mean SD Mean SD
5 QMCEM Same .71 .05 .77 .05 .76 .06
Mixed .78 .05 .83 .03 .64 .04
MHRM Same .71 .05 .77 .05 .76 .06
Mixed .79 .04 .83 .03 .64 .05
15 QMCEM Same .68 .09 .76 .07 .84 .04
Mixed .77 .05 .82 .04 .65 .03
MHRM Same .67 .09 .76 .07 .86 .04
Mixed .77 .05 .82 .04 .64 .03

In sum, the simulation study produced the following findings:

  1. MHRM had better parameter estimates than QMCEM in general. The improvements were especially significant on the intertrait correlation estimates in the 5-trait condition and the item parameter and trait score estimates in the 15-trait condition.

  2. The mixed-direction condition had moderate improvement in the trait score estimates compared to the same-direction condition regarding reliability and RMSE; however, the improvement was not seen in the item parameter and intertrait correlation estimates.

  3. In general, fixing or estimating intertrait correlations did not impact model estimations in the 5-trait condition. For the 15-trait condition, estimating intertrait correlations, if feasible, was very hard.

  4. None of the conditions significantly affected the SE estimates of item parameters and intertrait correlations. For MHRM, a small portion of SEs could not be estimated.

  5. The empirical reliability overestimated the squared correlation between true and estimated trait scores by 6%−13% in our simulation study.

  6. The MHRM runs took longer to converge than the QMCEM runs, especially for high dimensions: in average minutes, 53 vs. 33 in the 5-trait condition and 1080 vs. 362 in the 15-trait condition.

Real Data Application

The Rank-2PL model was applied to a real FCQ form with triplets. The real triplet form contained 60 items (blocks) with 180 statements, among which were five ideal point statements, with the remainder being all dominance statements. Each statement in an item measured a different trait. These statements measured 14 interpersonal and intrapersonal skills essential to higher education and career success, such as perseverance, leadership, creativity, curiosity, responsibility, and self-discipline. The number of statements per trait ranged from 5 to 18. There were 27 items with three negative statements, 29 with three positive statements, one with one positive and two negative statements, and three with two positive and one negative statements. The number of test takers without any missing responses was 552.

As a comparison, two Rank-2PL models were fitted to the real data: one used the fixed item parameters from the calibration of the Likert data of these statements by the 2PL model in a previous administration, and the other directly estimated the item parameters from the triplet data using the method developed in this paper. For the free calibration, the MHRM algorithm was applied because the MHRM algorithm had better parameter estimations than the QMCEM method based on our simulation study. For fixed-parameter estimation, mirt does not allow the MHRM algorithm; thus, the QMCEM algorithm with the Sobol sequence was used. Because the fixed-parameter estimation does not need to calibrate per se, the two different methods in approximating the posterior distribution of latent trait scores have little impact on comparing the fixed and free models. For both estimations, the intertrait correlation matrix was fixed as those from the previous administration of the Likert scales (i.e., Table 2 with Trait 10 removed). For the free model, the intercept of the first statement in an item was fixed to that from the Likert scale to identify the model. The estimation programs employed in the simulation study were also used here.

We reported the comparison results of the two models by model selection and fit statistics. Table 7 lists the two models’ loglikelihood, Akaike Information Criterion (AIC), and Bayesian Information Criteria (BIC) values. The free model was the chosen one from the model comparison perspective.

Table 7.

Comparison of Loglikelihood, AIC, and BIC

Model logLik AIC AIC_difa BIC BIC_difa

Fixed −59134 118868 120162
Free −53679 107958 10910 109252 10910
a

Model above – Model below.

Table 8 shows the M2* Chi-square statistics (Cai & Hansen, 2013) and the related M2*-based RMSEA statistics (Root Mean Squared Error of Approximation; Maydeu-Olivares & Joe, 2014) for the two models. Although the M2* statistic was significant for both models, RMSEA shows that the free model was a close fit to the data while the fixed model was not even an adequate fit based on Maydeu-Olivares and Joe’s (2014) criteria (<.05 close fit; <.089 adequate fit).

Table 8.

Comparison of M2* and RMSEA

Model M2* DF P RMSEA RMSEA_5a RMSEA_95a

Fixed 8735 1530 .000 .092 .090 .094
Free 3381 1530 .000 .047 .045 .049
a

90% confidence interval of RMSEA.

Table 9 lists the number and percentage of misfit items and item pairs for each model based on the zh (Drasgow et al., 1985) and LD-X2( Chen & Thissen’s, 1997) statistics, respectively. For the zh statistic, the convenient criterion, zh<-1.64, was used to flag misfit items at a .05 significant level. For the LD-X2, it was converted to the Cramer’s V index, and a Cramer’s V value larger than .3 was used as the flagging criterion to indicate a strong association between two items. No item/pair was flagged for the free model, while the fixed model had 23% of items and 56% of item pairs identified as misfits.

Table 9.

Comparison of Number (N) and Percentage (%) of Misfit Items

Model zh a LD-X2b
N % N %

Fixed 14 23 995 56
Free 0 0 0 0
a

flagging criterion zh< −1.64.

b

flagging criterion Cramer’s V >.3.

Summary and Discussion

For the first time, we developed an MML-EM estimation of the Rank-2PL model for triplets in FCQ. FCQs that are used in serious applications are characterized by having a high number of dimensions. To address this, we described various options in the E-step to find the integral of the loglikelihood over high dimensions. These were the Monte Carlo EM algorithm (MCEM), stochastic EM algorithm, and Metropolis-Hastings Robbins-Monro (MHRM) algorithm. These algorithms make the MML-EM estimation of the Rank-2PL model with high dimensions feasible. We also described different estimation methods for standard errors of model parameters.

We conducted a simulation study to check the parameter recovery of the Rank-2PL model for triplets. The parameter recovery was satisfied in most cases. We recommend MHRM over QMCEM for the model estimation method, especially for higher traits, because MHRM achieved more accurate item parameter estimates and intertrait correlation estimates. This is consistent with the nature of the two algorithms. The drawbacks of MHRM include longer estimation time (because of MCMC sampling) and some inestimable (negative) SEs of model parameters. Under five traits, we could always estimate the intertrait correlations successfully and reasonably accurately by MHRM, while Bürkner et al. (2019; Table 2) reported success rates between .28 and .38 for their mixed-direction condition with 45 triplet items. This may be due to different model configurations or estimation methods: Bürkner et al. estimated TIRT models in Mplus by the ULSMV method, a limited information factor analysis method under structure equation modeling. However, for FCQs with higher traits (like 15 in our study), estimating intertrait correlations seems infeasible; fixing the correlation matrix based on prior knowledge is necessary for successful model estimation.

We found mixed-direction items improved the trait score estimates moderately compared to same-direction items in terms of trait score reliability and accuracy; however, this improvement was not seen in item parameter estimates and intertrait correlation estimates. Including mixed-direction items in FCQs was strongly recommended in Brown and Maydeu-Olivares (2011) from the psychometrical perspective, while this recommendation is considered unrealistic as it conflicts with the goal of controlling test takers’ impression management that FCQs want to achieve (Bürkner et al., 2019). Brown and Maydeu-Olivares (2011) gave this recommendation mainly based on their simulation result on two traits. Their result on five traits (Tables 6 and 7 in their paper) showed a similar pattern as ours. Bürkner et al. (2019) reported a large improvement in the trait score estimations under five traits for mixed-direction items; however, it might be due to the particular conditions they simulated. Reports have shown that mixed-direction items are unnecessary for FCQs with higher traits to achieve satisfactory psychometric properties (Bürkner et al., 2019; Brown & Maydeu-Olivares, 2012). However, our study suggests that even on five traits, researchers can design a psychometrically sounded FCQ without mixed-direction items by including appropriate numbers of items with quality statistics.

The Rank-2PL model was applied to a real triplet form. We compared two models, one freely estimating item parameters and another fixing item parameters from the previous Likert scales. The latter is a common practice due to a lack of available estimation methods of IRT models for FCQs. We demonstrated how model comparison could be done by model selection criteria (Loglikelihood, AIC, and BIC) and model/item fit statistics. We found that the free model fitted the FCQ data much better than the fixed model. This is understandable, considering the fixed parameters came from a different population and test type. Therefore, caution should be exercised to use the fixed model in practice without fully verifying its psychometrical properties (Fu et al., 2024). Ideally, the estimation model should be updated periodically when more FCQ data are available to incorporate new information. This practice only becomes feasible when the estimation programs for the FCQ IRT models, like the MML-EM estimation for the Rank-2PL model, become available.

Nowadays, data from psychological and educational tests, such as FCQ data and cognitive diagnostic test data, are often characterized by a high number of dimensions. Using the MML-EM to estimate high-dimensional data is challenging due to the estimation demands on computer RAM capacity and CPU power. The situation has improved in recent years with more computing power and more efficient estimation algorithms, such as those we introduced in this paper. These algorithms have been implemented in many computer programs to estimate multidimensional IRT models, including the mirt program used here, flexMIRT (Cai, 2023), and MIRT (Haberman, 2013). Using multidimensional IRT models to estimate multidimensional test data successfully using the MML-EM method has been demonstrated in numerous papers (e.g., Cai, 2010; Garnier-Villarreal et al., 2021; von Davier, 2008), including the current study. However, most of the applications have dealt with fewer than ten dimensions. Calibrating test data with more than ten dimensions using a full information method is still challenging. This paper shows that the Rank-2PL model for triplets could handle 15 dimensions with a fixed correlation matrix of latent scores. For data with considerably more dimensions, such as 30, a limited information method, such as the unweighted least squares (ULS) method, may be more appropriate (Brown & Maydeu-Olivares, 2012). This is true if estimating the covariance matrix of parameter estimates is not required — it takes considerable time to compute the covariance matrix for high-dimensional data, even under the ULS method.

We conclude our paper with suggestions for future research. First, the design of the current simulation study is not comprehensive. Future studies can be conducted to examine additional conditions on intertrait correlations, item parameters, number of traits, number of items per trait, sample size, and keyed direction. Another interesting topic is to compare the Rank-2PL model for triplets to the TIRT model under various simulated conditions. Previously, we discussed the relationships between the Rank-2PL and TIRT models. An empirical study will help us understand more about their relationships. Finally, differential item/test functioning (DIF/DTF) analysis is usually recommended for checking the fairness of an FCQ. Therefore, a study of differential item/test functioning under the Rank-2PL model will shed some light on the IRT-based DIF/DTF analysis method for FCQs. A multigroup Rank-2PL model may be developed for this purpose.

Supplementary Material

Technical Details of the MML-EM Estimation
Code Package

References

  1. Atkinson KE (1989). An introduction to numerical analysis (2nd ed.). John Wiley. [Google Scholar]
  2. Birkeland SA, Manson TM, Kisamore JL, Brannick MT, & Smith MA (2006). A meta-analytic investigation of job applicant faking on personality measures. International Journal of Selection and Assessment, 14(4), 317–335. 10.1111/j.1468-2389.2006.00354.x [DOI] [Google Scholar]
  3. Birnbaum A (1968). Some latent trait models and their use in inferring an examinee’s ability. In Lord FM & Novick MR (Eds.), Statistical theories of mental test scores (pp. 397–479). Addison-Wesley Pub. Co. [Google Scholar]
  4. Bock RD, & Aitkin M (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46(4), 443–459. 10.1007/BF02293801 [DOI] [Google Scholar]
  5. Brown A (2016). Item response models for forced-choice questionnaires: A common framework. Psychometrika, 81(4), 135–160. 10.1007/s11336-014-9434-9 [DOI] [PubMed] [Google Scholar]
  6. Brown A, & Maydeu-Olivares A (2011). Item response modeling of forced-choice questionnaires. Educational and Psychological Measurement, 71(3), 460–502. 10.1177/0013164410375112 [DOI] [Google Scholar]
  7. Brown A, & Maydeu-Olivares A (2012). Fitting a Thurstonian IRT model to forced-choice data using Mplus. Behavior Research Methods, 44, 1135–1147. doi: 10.3758/s13428-012-0217-x [DOI] [PubMed] [Google Scholar]
  8. Broyden CG (1970). The convergence of a class of double-rank minimization algorithms 1. General considerations. IMA Journal of Applied Mathematics, 6(1), 76–90, 10.1093/imamat/6.1.76 [DOI] [Google Scholar]
  9. Bürkner P-C, Schulte N, & Holling H (2019). On the statistical and practical limitations of Thurstonian IRT models. Educational and Psychological Measurement, 79(5), 827–854. 10.1177/0013164419832063 [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Cai L (2010). High-dimensional exploratory item factor analysis by a Metropolis-Hastings Robbins-Monro algorithm. Psychometrika, 75(1), 33–57. 10.1007/s11336-009-9136-x [DOI] [Google Scholar]
  11. Cai L (2023). flexMIRT (Version 3.6): Flexible multilevel multidimensional item analysis and test scoring [Computer software]. Vector Psychometric Group. https://vpgcentral.com/software/flexmirt/ [Google Scholar]
  12. Cai L, & Hansen M (2013). Limited-information goodness-of-fit testing of hierarchical item factor models. British Journal of Mathematical and Statistical Psychology, 66(2), 245–276. 10.1111/j.2044-8317.2012.02050.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Chalmers R, P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29. 10.18637/jss.v048.i06 [DOI] [Google Scholar]
  14. Chalmers RP (2018). Numerical approximation of the observed information matrix with Oakes’ identity. British Journal of Mathematical and Statistical Psychology, 71(3), 415–436. 10.1111/bmsp.12127 [DOI] [PubMed] [Google Scholar]
  15. Chen WH, & Thissen D (1997). Local dependence indices for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265–289. 10.2307/1165285 [DOI] [Google Scholar]
  16. Chen Y, & Zhang S (2021). Estimation methods for item factor analysis: an overview. In Zhao Y & Chen DG (Eds.), Modern statistical methods for health research: Emerging topics in statistics and biostatistics (pp. 329–350). Springer, Cham. 10.1007/978-3-030-72437-5_15 [DOI] [Google Scholar]
  17. Davis PJ, & Rabinowitz P (1984). Methods of numerical integration. Cambridge, MA: Academic Press. 10.1016/B978-0-12-206360-2.50012-1 [DOI] [Google Scholar]
  18. de la Torre J, Ponsoda V, Leenen I, & Hontangas P (2012, April). Examining the viability of recent models for forced-choice data [Paper presentation]. American Educational Research Association, Vancouver, BC, Canada. [Google Scholar]
  19. Drasgow F, Levine MV, & Williams EA (1985). Appropriateness measurement with polychotomous item response models and standardized indices. British Journal of Mathematical and Statistical Psychology, 38(1), 67–86. 10.1111/j.2044-8317.1985.tb00817.x [DOI] [Google Scholar]
  20. Forero CG, & Maydeu-Olivares A (2009). Estimation of IRT graded response models: Limited versus full information methods. Psychological Methods, 14(3), 275–299. 10.1037/a0015825 [DOI] [PubMed] [Google Scholar]
  21. Fu J (2019). Maximum marginal likelihood estimation with an expectation-maximization algorithm for multigroup/mixture multidimensional item response theory models. (ETS Research Report No. 19–35). Educational Testing Service. 10.1002/ets2.12272 [DOI] [Google Scholar]
  22. Fu J, Kyllonen PC, & Tan X (2024): From Likert to forced choice: Statement parameter invariance and context effects in personality assessment. Measurement: Interdisciplinary Research and Perspectives. Advance online publication. 10.1080/15366367.2023.2258482 [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Garnier-Villarreal M, Merkle EC, & Magnus BE (2021). Between-item multidimensional IRT: How far can the estimation methods go? Psych, 3, 404–421. 10.3390/psych3030029 [DOI] [Google Scholar]
  24. Gilks WR, Richardson S, & Spiegelhalter DJ (1996). Introducing Markov Chain Monte Carlo. In Gilks WR, Richardson S & Spiegelhalter DJ (Eds.), Markov Chain Monte Carlo in practice. (pp. 1–20). Chapman and Hall. 10.1201/b14835 [DOI] [Google Scholar]
  25. Haberman SJ (2013). A general program for item-response analysis that employs the stabilized Newton-Raphson algorithm (Research Report No. RR-13–32). Educational Testing Service. 10.1002/j.2333-8504.2013.tb02339.x [DOI] [Google Scholar]
  26. Hontangas PM, de la Torre J, Ponsoda V, Leenen I, Morillo D, & Abad FJ (2015). Comparing traditional and IRT scoring of forced-choice tests. Applied Psychological Measurement, 39(8), 598–612. 10.1177/0146621615585851 [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Kim S (2012). A note on the reliability coefficients for item response model-based ability estimates. Psychometrika, 77(4), 153–162. 10.1007/s11336-011-9238-0 [DOI] [Google Scholar]
  28. Lee P, Joo SH, Stark S, & Chernyshenko OS (2019). GGUM-RANK statement and person parameter estimation with multidimensional forced choice triplets. Applied Psychological Measurement, 43(3), 226–240. 10.1177/0146621618768294 [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Louis TA (1982). Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society. Series B, 44(2), 226–233. http://www.jstor.org/stable/2345828. [Google Scholar]
  30. Luce RD (2005). Individual choice behavior. Dover. (Original work published in 1959.) [Google Scholar]
  31. Maydeu-Olivares A (1999). Thurstonian modeling of ranking data via mean and covariance structure analysis. Psychometrika, 64(3), 325–340. 10.1007/BF02294299 [DOI] [Google Scholar]
  32. Maydeu-Olivares A, & Brown A (2010). Item response modeling of paired comparison and ranking data. Multivariate Behavioral Research, 45(6), 935–974. 10.1080/00273171.2010.531231 [DOI] [PubMed] [Google Scholar]
  33. Maydeu-Olivares A, & Joe H (2014). Assessing Approximate Fit in Categorical Data Analysis. Multivariate Behavioral Research, 49(4), 305–328. 10.1080/00273171.2014.911075 [DOI] [PubMed] [Google Scholar]
  34. McDonald RP (1997). Normal-Ogive multidimensional model. In van der Linden WJ & Hambleton RK (Eds.), Handbook of modern item response theory (pp. 258–270). Springer. [Google Scholar]
  35. Meng X-L, & Schilling S (1996). Fitting full-information item factor models and an empirical investigation of bridge sampling. Journal of the American Statistical Association, 91(435), 1254–1267. 10.1080/01621459.1996.10476995 [DOI] [Google Scholar]
  36. Morokoff WJ, & Caflisch RE (1995). Quasi-Monte Carlo integration. Journal of Computational Physics, 122(2), 218–230. 10.1006/jcph.1995.1209 [DOI] [Google Scholar]
  37. Morillo D, Abad FJ, Kreitchmann RS, Leenen I, Hontangas P, and Ponsoda V (2019). The journey from Likert to forced-choice questionnaires: Evidence of the invariance of item parameters. Journal of Work and Organizational Psychology, 35(2), 75–83. 10.5093/jwop2019a11 [DOI] [Google Scholar]
  38. Morillo D, Leenen I, Abad FJ, Hontangas P, De la Torre J, & Ponsoda V (2016). A dominance variant under the multi-unidimensional pairwise-preference framework: Model formulation and Markov chain Monte Carlo estimation. Applied Psychological Measurement, 40(7), 500–516. 10.1177/0146621616662226 [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. R Core Team (2021). R: A language and environment for statistical computing (Version 4.1.2). R Foundation for Statistical Computing. https://www.R-project.org/. [Google Scholar]
  40. Roberts JS, Donoghue JR, & Laughlin JE (2000). A general item response theory model for unfolding unidimensional polytomous responses. Applied Psychological Measurement, 24(1), 3–32. 10.1177/01466216000241001 [DOI] [Google Scholar]
  41. Salgado JF, Anderson N, & Tauriz G (2015). The validity of ipsative and quasi-ipsative forced-choice personality inventories for different occupational groups: A comprehensive meta-analysis. Journal of Occupational and Organizational Psychology, 88(4), 797–834. 10.1111/joop.12098 [DOI] [Google Scholar]
  42. Schilling S, & Bock RD (2005). High-dimensional maximum marginal likelihood item factor analysis by adaptive quadrature. Psychometrika, 70(3), 533–555. 10.1007/s11336-003-1141-x [DOI] [Google Scholar]
  43. Stark S, Chernyshenko OS, & Drasgow F (2005). An IRT approach to constructing and scoring pairwise preference items involving stimuli on different dimensions: The multi-unidimensional pairwise preference model. Applied Psychological Measurement, 29(3), 184–203. 10.1177/0146621604273988 [DOI] [Google Scholar]
  44. Thurstone LL (1927). A law of comparative judgment. Psychological Review, 34(4), 273–286. 10.1037/h0070288 [DOI] [Google Scholar]
  45. von Davier M (2008). A general diagnostic model applied to language testing data. British Journal of Mathematical and Statistical Psychology, 61(2), 287–307. 10.1348/000711007X193957 [DOI] [PubMed] [Google Scholar]
  46. Wang W-C, Qiu X-L, Chen C-W, Ro S, & Jin K-Y (2017). Item response theory models for ipsative tests with multidimensional pairwise-comparison items. Applied Psychological Measurement, 41(8), 600–613. 10.1177/0146621617703183 [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Yuan K-H, Cheng Y, & Patton J (2014). Information matrices and standard errors for MLEs of item parameters in IRT. Psychometrika, 79(1), 232–254. 10.1007/S11336-013-9334-4 [DOI] [PubMed] [Google Scholar]
  48. Zhang B, Tu N, Angrave L, Zhang S, Sun T, Tay L, & Li J (2023). The generalized Thurstonian unfolding model (GTUM): Advancing the modeling of forced-choice data. Organizational Research Methods. Advance online publication. 10.1177/10944281231210481 [DOI] [Google Scholar]
  49. Zhang S, Chen Y, and Liu Y (2020). An improved stochastic EM algorithm for large-scale full-information item factor analysis. British Journal of Mathematical and Statistical Psychology, 73(1), 44–71. 10.1111/bmsp.12153 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Technical Details of the MML-EM Estimation
Code Package

RESOURCES