Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Sep 9.
Published in final edited form as: J Educ Meas. 2023 Oct 29;61(1):125–149. doi: 10.1111/jedm.12379

Information Functions of Rank-2PL Models for Forced-Choice Questionnaires

Jianbin Fu 1, Xuan Tan 1, Patrick C Kyllonen 1
PMCID: PMC12416924  NIHMSID: NIHMS2089526  PMID: 40927806

Abstract

This paper presents the item and test information functions of the Rank two-parameter logistic models (Rank-2PLM) for items with two (pair) and three (triplet) statements in forced-choice questionnaires. The Rank-2PLM model for pairs is the MUPP-2PLM (Multi-Unidimensional Pairwise Preference; Morillo et al., 2016), and for triplets, is the Triplet-2PLM (Fu et al., 2023a). Fisher’s information and directional information are described, and the test information for Maximum Likelihood (ML), Maximum A Posterior (MAP), and Expected A Posterior (EAP) trait score estimates is distinguished. Expected item/test information indexes at various levels are proposed and plotted to provide diagnostic information on items and tests. The expected test information indexes for EAP scores may be difficult to compute due to a typical test’s vast number of item response patterns. The relationships of item/test information with discrimination parameters of statements, standard error, and reliability estimates of trait score estimates are discussed and demonstrated using real data. Practical suggestions for checking the various expected item/test information indexes and plots are provided.

Keywords: information function, forced-choice questionnaire, rank two-parameter logistic models, item response theory

Introduction

In a forced-choice questionnaire (FCQ; Christiansen et al., 2005), an item is a block of statements that measures noncognitive traits, for example, attitudes, values, beliefs, or behaviors. An item includes two or more statements, which may measure the same or different traits. Respondents are asked to rank the statements based on the degree to which they endorse them, from the most to the least. The following is an example of a forced-choice item with three statements (triplet):

Order the following statements from the most to least like you:

  1. I hate to lose

  2. I find creative ideas come easily

  3. I conceal my feeling

As an alternative to the traditional Likert scale, the forced-choice format is designed to control test takers’ impression management on noncognitive tests and thus is a popular noncognitive test format, especially in high-stakes contexts where test takers may be more inclined to respond in a socially desirable way (Morillo et al., 2019).

Some researchers have proposed item response theory (IRT) models to calibrate FCQs and generate normative scores (Brown, 2016). The notable models include the Thurstonian IRT model (TIRT; Brown & Maydeu-Olivares, 2011) and the Rank models under the ranking framework (de la Torre et al., 2012), for example, MUPP-2PLM (Multi-Unidimensional Pairwise Preference with two-Parameter Logistic IRT Model; Morillo et al., 2016) and MUPP-GGUM (MUPP with Generalized Graded Unfolding Model; Stark et al., 2005) for items with two statements (pairs). Except for MUPP-2PLM, a special case of the commonly known compensatory two-parameter logistic multidimensional IRT Model (von Davier, 2008), no item parameter estimation methods are available for either the MUPP-GGUM or the Rank models for items with more than two statements. Recently, Fu et al. (2023a) developed a maximum marginal likelihood estimation with an expectation-maximization algorithm (MML-EM; Bock & Aitkin, 1981) to estimate item parameters and their standard errors on the Rank-2PLM for items with three statements (denoted as Triplet-2PLM). In a simulation study, they showed satisfactory parameter recovery and successfully applied it to real data.

The purpose of this paper is to develop the item and test information functions (IIF and TIF) for the Rank-2PLMs for pairs (i.e., MUPP-2PLM) and triplets (i.e., Triplet-2PLM). Item and test information are critical psychometric properties of items and tests and play an important role in test assembly, especially for computer adaptive tests. They also provide diagnostic information that test developers can use to identify and adjust problematic items and tests. The current study makes several unique contributions to educational and psychological measurement literature.

First, we studied the item and test information functions for MUPP-2PLM and Triplet-2PLM. The IIF and TIF for Triplet-2PLM were developed for the first time. The current study adds to the test information literature on IRT models for FCQ, for example, TIRT (Brown & Maydeu-Olivares, 2011), MUPP-GGUM (Stark et al., 2005), and Rank-GGUM (Joo et al., 2018).

Second, we proposed various summative IIF and TIF indices and plots to provide helpful information to diagnose and adjust items and tests. These indices can also be used in an automatic test assembly program. It was a long-time issue how to use IIF and TIF in practice for IRT models for FCQs and, broadly, for multidimensional IRT models. It is the first time these indices and plots were presented systemically and demonstrated for FCQs in real data. Joo et al. (2018) also proposed two (i.e., expected overall item information and expected overall test information) of these indices presented in this paper. However, Joo et al. calculated these indices using simulated trait scores from their prior distribution rather than using their densities in a more theoretical way, as we did in this paper. In addition, the use of these indices was little discussed and not the focus of their paper.

Third, we described the inherent connections among test information, standard errors (SE) of trait score estimates, and reliability. We showed that the directional IIF was a weighted sum of all elements in an item information matrix. We discouraged its use in practice due to its disconnection with SE and reliability estimates. Using directional IIF to estimate reliability (e.g., Brown & Maydeu-Olivares, 2011) will produce an inconsistent estimate with that based on a test sample. We proposed to compare expected SE and reliability estimates based on expected overall test information on a trait to those based on a test sample as indications of the sample representativeness.

Fourth, we distinguished different TIFs for three kinds of latent score estimates: Maximum Likelihood (ML), Maximum A Posteriori (MAP), and Expected A Posteriori (EAP). These distinctions were not often emphasized in the literature. For example, in Joo et al. (2018), the TIF for ML scores was used for EAP scores. We also pointed out the difficulty in generating the proposed summative TIF indices for EAP scores.

Finally, using the real data on four FCQ forms, we obtained some interesting findings that are consistent with other studies in the literature: (a) an IIF increased with the absolute item discrimination; (b) a triplets form needed fewer statements than a pairs form to achieve a precision level of trait score estimates; and (c) the reliability estimates for ML and MAP estimates were similar although their TIF, SE, and score variance estimates were quite different.

In the following sections, we first describe the response functions of the Rank-2PLMs; second, we present three estimation methods of latent trait scores (i.e., Maximum Likelihood, Maximum A Posteriori, and Expected A Posteriori) because they are closely related to test information; third, we develop the item and test information functions: distinguish test information for the three types of trait score estimates and show the connection between the Fisher’s information and directional information; fourth, we discuss the inherent relationships between test information and standard error and reliability of trait score estimates; fifth, we propose several expected item and test information indices and plots, demonstrate their use in evaluating items and tests in real data, and provide practical recommendations; sixth, we show the relationships of item and test information with item discrimination parameters, standard errors, and reliabilities in the real data on four FCQ forms; finally we discuss the limitations of the current study and suggest future research.

Item Response Functions of Rank-2PLM

Following Luce (2005/1959), a Rank model assumes that ranking K statements in a block is an iterative process of selecting the most agreeable statement among the remaining statements until all K statements are ranked. For MUPP-2PLM, K=2 and the probability of a ranking of the two statements is

P(1,2)(θ)=P(1)1θ1P(2)0θ2P(1)1θ1P(2)0θ2+P(1)0θ1P(2)1θ2, (1)

where θ is the column vector of the latent trait scores that the statements measure; the subscripts refer to the statements with Rank 1 and 2, respectively (Rank 1 is the more agreeable statement); P(1,2) is the probability of a ranking order; P(1)1θ1 and P(1)0θ1 are the probabilities of endorsing and rejecting Rank 1 statement, respectively, conditional on the latent trait score θ1 that the statement measures; and P(2)1θ2 and P(2)0θ2 are the same probabilities for Rank 2 statement. The probability of endorsing a statement follows the item response function of the 2PLM (Birnbaum, 1968). Insert the 2PLM response function into Equation 1, and then through Algebra operations, we get the item response function of the MUPP-2PLM (Morillo et al., 2016),

P(1,2)(θ)=1+expa2θ2+b2a1θ1b11=1+expa2θ2a1θ1b121, (2)

where a1 and a2 are the discrimination/slope parameters of Rank 1 and 2 statements, respectively, in the 2PLM, b1 and b2 are the intercepts, and b12=b1b2 is the estimable intercept as only one of the two intercepts is identified in Equation 2. Alternatively, one of the values can be fixed to identify the model. Note that MUPP-2PLM is just a special case of the multidimensional 2PLM. If the two statements measure the same trait, Equation 2 becomes 2PLM, and only one of the two discrimination parameters can be identified.

Based on the same rationale, we can get the item response function of Triplet-2PLM for triplets (Fu et al., 2023a),

P(1,2,3)(θ)=1+expa2θ2+b2a1θ1b1+expa3θ3+b3a1θ1b111+expa3θ3+b3a2θ2b21 (3)

Note that for b1, b2 and b3, only two of them can be identified in Equation 3. We can fix one of the b values to identify the model. If all three statements measure a common trait, only two a values can be identified, and one may be fixed to identify the model.

For a triplet with three statements, A, B, and C, there are six possible ranking patterns, ABC, ACB, BAC, BCA, CAB, and CBA, which are coded as 1 to 6, respectively, in the observed data. For pairs, there are only two possible ranking patterns: an item score is coded as 1 if the first statement is ranked the first, 0 otherwise.

Trait Score and Covariance Matrix Estimation

Item and test information in an IRT model is directly related to the error variance of a latent trait estimate. The information function of a latent trait estimate is the reciprocal of its error variance. Thus, we must first describe the different kinds of trait score estimates and their error variance estimations.

Suppose in an FCQ, there are I test takers responding to J items. Let xij=s denote test taker i‘s coded score on item j (s=0,1 for pairs; s=1,2,,6 for triplets), and xi is test taker i‘s coded response vector across J items. The FCQ measures D traits and θ represents a random latent trait score vector sample from the test taker population with a standard multivariate normal distribution with a D×D correlation matrix Σθ. η represents the vector of item parameters in the whole FCQ, and ηj is item j‘s parameter vector. When item parameter and intertrait correlation estimates are available, we can estimate IRT trait scores for each test taker based on Equations 2 and 3. Because the local independence assumption of item responses holds, each test taker’s trait vector can be estimated separately. Typically, there are three types of trait score estimates: Maximum Likelihood (ML), Maximum A Posterior (MAP), and Expected A Posterior (EAP) (Baker & Kim, 2004).

Maximum Likelihood (ML) Score Estimates

For ML, the log-likelihood function for a test taker i is

lMLθixi,η,Σθ=j=1JlogPxijθi,ηj, (4)

where Pxijθi,ηj is the item response function of MUPP-2PLM/Triplet-2PLM (Equation 2/3). The symbolic first and second derivatives, Pxijθi,ηj, with respect to θi can be obtained by the R package, Deriv (Clausen & Sokol, 2020), and the code is provided in the supplemental online materials. Then, Equation 4 can be optimized by the Newton-Raphson method (NR; Atkinson, 1989) or a quasi-Newton method, for example, the BFGS algorithm (Broyden, 1970).

The variance and covariance matrix (COV) of ML estimates (MLE) θ^i can be estimated by the negative inverse of the Hessian matrix evaluated at MLE θ^i and η^j

Ωθ^iMLH=j=1J2logPxijθ^i,η^jθ^iθ^i1=j=1JPxijθ^i,η^jθ^iPxijθ^i,η^jθ^iPxijθ^i,η^j21j=1J2Pxijθ^i,η^jθ^iθ^i/Pxijθ^i,η^j1 (5)

Maximum A Posteriori (MAP) Score Estimates

The MAP θ^i maximizes the posterior distribution of θi,

logfθixi,η,Σθj=1JlogPxijθi,ηj+logϕθiΣθ, (6)

which is just the ML log-likelihood function plus the logarithm of the multivariate normal density of θi, ϕθiΣθ. The first and second derivatives of ϕθΣθ with respect to θ are

ϕθΣθθ=ϕθΣθΣθ1θ, (7)
2ϕθΣθθθ=ϕθΣθΣθ1θθΣθ1Σθ1 (8)

The COV of MAP θ^i estimates can also be estimated by the negative inverse of the Hessian matrix of logfθixi,η,Σθ evaluated at MAP θ^i and η^j.

Expected A Posteriori (EAP) Score Estimates

The EAP θ^i is the expected θi over its posterior distribution

θ^iEAP=Rθfθxi,η,Σθdθl=1Lωlfωlxi,η,Σθ, (9)
fωlxi,η,Σθ=j=1JPxijηj,ωlϕωlΣθl=1Lj=1JPxijηj,ωlϕωlΣθ, (10)

where ωls(l=1,2,,L) are the quadrature points to approximate the prior distribution of θ, usually, the standard multivariate normal distribution with a correlation matrix Σθ.

The covariance matrix of θ^rEAP conditional on response pattern r (Haberman & Sinharay, 2010) is

Ωθ^rEAPxrl=1Lfωlxr,η,Σθωlθ^rEAPωlθ^rEAP, (11)

where xr is response pattern r, and θ^rEAP is the EAP score estimate for xr. The expected covariance matrix of errors of EAP score estimates is the expectation of Ωθ^rEAPxr across all possible response patterns in a test

Ωθ^EAPθr=1RΩθ^rEAPxrPxrη,Σθ, (12)
Pxrη,Σθ=l=1Lj=1JPxrjηj,ωlϕωlΣθ, (13)

where R is the number of all possible response patterns in a test. Because the number of all possible response patterns in a test could be enormous, it may be infeasible to calculate the covariance matrix of θ^EAP errors based on Equations 12 and 13. Alternatively, we can estimate Ωθ^EAPθ based on simulated data or a sample of test data,

Ω^θ^EAPθ=1Ii=1IΩθ^iEAPxi. (14)

Item and Test Information

The test information of a latent trait score estimate refers to the precision of the score estimate based on a test. Item information is the precision of a latent trait score estimate that an item contributes to the test information. In other words, test information is the sum of the item information of all items in a test by assuming item local independence. For MAP and EAP latent trait score estimates, the information from the prior population distribution of latent trait scores must be added to the test information.

Item Information

The item information for an item with multiple score categories in an IRT model is the expected Fisher’s information across score categories (Muraki, 1993; Joo et al., 2018). For an FCQ pair/triplet item j, its item information for a trait vector θ is

βj(θ)=sjs=12logPxj=sθ,ηjθθPxj=sθ,ηj=s=1SjPxj=sθ,ηjθPxj=sθ,ηjθPxj=sθ,ηj1,s=1Sj2Pxj=sθ,ηjθθ (15)

where Sj (=2 for pair; =6 for triplet) is the number of score categories of item j. Because s=1Sj2Pxj=sθ,ηjθθ=2s=1SjPxj=sθ,ηjθθ=0, Equation 15 is reduced to

βj(θ)=s=1sjPxj=sθ,ηjθPxj=sθ,ηjθPxj=sθ,ηj1. (16)

Fisher’s information formula of Equation 15 is based on the Hessian matrix, while Equation 16 is the variance of Fisher’s score. The equality between Equations 15 and 16 is well known in an ML estimation given logPxj=sθ,ηj is twice differentiable with respect to θ, and under certain regularity conditions (Lehmann & Casella, 1998, p.116), which are met here. In the literature, authors used either Equation 15 (e.g., Muraki, 1993; Joo et al., 2018; Yao & Schwarz, 2006) or Equation 16 (e.g., Brown & Maydeu-Olivares, 2011; Stark et al., 2005), while the equivalence was not mentioned in these papers. Equation 16 is more computationally efficient because only the first derivatives are involved; thus, it is used here. The first derivatives in Equation 16 have been used in estimating ML θ (Equation 4) and MAP θ (Equation 6). The derivatives are only relevant to those θd s measured by item j. Thus, in the D×D information matrix, βj(θ), only those elements with row and column traits measured by item have information, and the others are all 0s.

Some researchers used directional information that considers the direction of information in a multidimensional space (e.g., Brown & Maydeu-Olivares, 2011; Yao & Schwarz, 2006). Let α=α1,,αD be a vector of angles to all D axes that defines the direction from a point θ. The item information function for an FCQ item j at the direction of α is defined by

vjα(θ)=αβj(θ)α. (17)

For the information along the axis of one trait, θd, set the angle to θd to 0 and, thus cosαd=1, and set the angles to the other θds. dd based on the correlation between θd and θd so that cosαd=corθd,θd. This is referred to as the item information in the direction of trait θd. One can see that the directional item information along a trait d is just a sum of each element in the item information matrix weighted by the correlations between each trait and trait d. Because we emphasize the consistency of the estimates between standard errors/reliabilities of latent trait scores and item/test information, the directional information is not recommended in practice. The concept of directional information is confusing to test practitioners.

Because item information in an IRT model is defined based on Fisher’s information, and Fisher’s information is a concept defined in the general theory of ML estimation, item information is only relevant to ML θ.

Test Information

Based on the local independence assumption, we get the test information by summing all item information in an FCQ

βML(θ)=j=1Jβj(θ). (18)

The test information in Equation 18 is defined for ML θ. For ML θ, Fisher’s information defines item/test information. Because Fisher’s test information matrix equals the inverse of the covariance matrix of ML θ estimates, we define the test information here for MAP/EAP θ as the inverse of the covariance matrix of MAP/EAP θ estimates.

For MAP θ test information, we need to add the prior information from θ into Equation 18 and get the posterior test information (Du Toit, 2003)

βMAP(θ)=j=1Jβj(θ)2logϕθΣθθθ, (19)

where the second derivative of ϕθΣθ is in Equation 8. The diagonal elements in the second derivative matrix of logϕθΣθ are equal to the diagonal elements in Σθ1 (Brown & Maydeu-Olivares, 2011).

For EAP θ test information, we can only calculate the test information of an EAP score conditional on a response pattern. Using Equation 11 to approximate the covariance matrix of the EAP score θrEAP associated with response pattern and inversing it, we get the test information matrix for θrEAP

βθrEAPl=1Lfωlxr,η,ΣθωlθrEAPωlθrEAP1. (20)

Generally, there is a one-to-one correspondence between θrEAP and xr(Kim, 2012). Because the number of possible response patterns could be vast, it may be practically impossible to calculate the test information for all θrEAP.

For a trait score, its test information is the corresponding diagonal element in the test information matrix and is equal to the squared reciprocal of the standard error estimate of the latent trait score.

Reliability of Trait Score Estimates

The test information of a trait score estimate indicates the precision of the score estimate, and the precisions are different for different score estimates in IRT. An IRT reliability coefficient is an overall index of the precision of trait scores estimated from a test analogous to the reliability in the classical test theory. For MLE θ^, the reliability of trait d‘s score estimates θ^d(d=1,,D). is defined as (Kim, 2012, 2013)

ρML2=varθdvarθd+Eeθ^dθd=varθ^dEeθ^dθdvarθ^d, (21)

where varθd is the population variance of trait d‘s true scores, Eeθ^dθd is the expected value of the error variances of predicting trait d‘s score estimates θ^d by θd, and varθ^d is the population variance of trait ‘s score estimates and equal to varθd+Eeθ^dθd. In Equation 21 varθd can be set to the population variance (usually 1) if known (e.g., Andersson & Xin, 2018), or can be estimated by the sample θ^d estimates of the test takers who took a test. Similarly, Eeθ^dθd can be calculated as the expected value of the reciprocal of the test information of trait d scores (Equation 18) over the distribution of θd given all model parameters known or estimated by the sample mean of test takers’ error variance estimates on trait d. The front one is referred to as marginal reliability, and the late one is referred to as empirical reliability in the literature. The empirical reliability is estimated as

ρ^ML2=vârθ^di=1Ie^θ^id/Ivârθ^d, (22)

where vârθ^d is the sample variance of θ^d, and e^θ^id is the error variance estimate of test taker i‘s estimated trait d‘s score θ^id.

For MAP and EAP θ^, the reliability of trait d‘s score estimates is defined as (Kim, 2012, 2013)

ρM/EAP2=varθ^dvarθ^d+Eeθdθ^d, (23)

where vârθ^d is the population variance of trait d‘s MAP/EAP score estimates θ^d, and Eeθdθ^d is the expected value of the error variances of predicting trait ‘s true scores θd by estimates θ^d. varθ^d is equal to the variance of the prior distribution of θd, given the model holds. Usually, is estimated by the variance of the sample θ^d estimates. For MAP, Eeθdθ^d.can be calculated as the expected value of the reciprocal of the test information of trait d scores (Equations 19) over the prior distribution of θd, and for EAP, Eeθdθ^d.can be calculated by Equation 12. Eeθdθ^d can also be estimated by the sample mean of test takers’ error variance estimates on trait d‘s MAP/EAP score estimates. Using a test-taker sample to estimate varθ^d and Eeθdθ^d, the empirical reliability for MAP or EAP scores is estimated as

ρ^M/EAP2=varθ^dvarθ^d+i=1Ie^θid/I (24)

The Eeθ^dθd and Eeθdθ^d calculated based on the expected test information/error variances represent the theoretical value given the model. The expected test information is one of the expected item/test information indices we propose and demonstrate in the next section. Comparing error variance or test reliability based on the expected test information to the estimate based on a test-taker sample provides an indication of sample adequacy.

Note that Kim (2012) distinguished two types of reliabilities: parallel-forms reliability and squared-correlation reliability. The two types of reliabilities are not equal for IRT latent score estimates because all kinds of IRT score estimates are not unbiased estimators of the true scores. Equation 21 is based on the parallel-forms reliability, and Equation 23 is derived from the squared-correlations reliability under an assumption. These two reliabilities have been widely used in research and practice, partly because the statistics to calculate the two reliabilities are easily estimated from observed samples (Kim, 2013). Kim (2012) provided more formulas for IRT reliability estimations; these reliabilities can be estimated by Monte Carlo simulations with known item parameters if the statistics in the formulas are not available in the score estimations of observed samples.

Applications on Real Data

We demonstrate the estimation of item/test information based on the multidimensional MUPP-2PLM and Triplet-2PLM on four real datasets. We show how expected item/test information at different levels can be calculated and plotted. These indices and graphs provide a convenient way to show the precision distribution of latent trait score estimates at the item and test levels and provide diagnostic information for adjusting items and test assembly.

Real Data and IRT Calibrations

The real data included four FCQ forms: two pairs forms (A2 and B2) and two triplets forms (A3 and B3). Each pairs form included 130 items with 260 statements; each triplets form contained 60 items with 180 statements. These statements measured 14 interpersonal and intrapersonal skills essential to higher education and career success, such as perseverance, leadership, creativity, curiosity, responsibility, and self-discipline. All FCQ items were multidimensional items; that is, all statements in an item measured different traits. Each trait is measured by 11 to 20 statements in a pairs form and by 5 to 18 statements in a triplets form.

The four forms were assembled from 560 statements (measuring 14 traits and each trait measured by 40 statements) by an automatic assembly program using a linear programming solver. The two forms in each type were assembled to be parallel forms. The four FCQ forms were then administered to the test takers recruited from the crowdsourcing website Amazon Mechanical Turk (AMT) and the undergraduate/graduate applicants in some schools. The test orders on items within a form were random separately for each test taker. The valid sample sizes for A2, B2, A3, and B3 forms were 443, 441, 552, and 538, respectively. The samples included about 71% and 47% test takers from AMT for the pairs and triplets forms, respectively, 55% male, 75% white, and 78% with Bachelor’s or above degrees.

We conducted a model comparison study on these four forms (Fu et al., 2023b). The models compared included, among other models, MUPP/Triplet-GGUMs with fixed item parameters (from the previous administrations of the Likert scales of the statements), MUPP/Triplet-2PLMs with fixed and directly estimated item parameters, and TIRT with directly estimated item parameters. The evaluation criteria included model selection, model fit, item fit, reliabilities, and correlations of trait score estimates. The MUPP/Triplet-2PLMs with directly estimated item parameters appeared to be the best-fitted models and were recommended for the pairs/triplets forms (see Fu et al., 2023b, for details).

This paper used the directly estimated item parameters from MUPP/Triplet-2PLMs to estimate item and test information on the four forms. Note that in the MUPP/Triplet-2PLM estimations, the intercept parameter of the first statement in an item was fixed to its intercept estimate from the 2PLM calibration of the Likert forms1 for model identification, as discussed previously in Equations 2 and 3. For latent trait scores, the MAP estimates were reported for these forms. Because all FCQ forms are multidimensional, how to use item and test information to provide useful diagnostic information on items and a test is not so straightforward. We thus proposed a series of expected information indices at different levels and demonstrated their uses in real data.

Computer Program

All four forms were estimated by the MML-EM method in the mirt 1.34 package (Chalmers, 2012) in the 64-bit R 4.1.2 (R Core Team, 2021). For pair forms, the MUPP-2PLM was estimated by the internal implementation of the multidimensional 2PLM in mirt. The Triplet-2PLM estimation was implemented for triplet forms by utilizing the function CreatItem in mirt with some modifications of mirt’s functions (see Fu et al., 2023a for details).

iteminfo function in mirt calculates item information for IRT models; however, it has limitation for our purposes: iteminfo function computes the second derivatives although only the first derivatives are needed for calculating item information, and for Triplet-2PLM iteminfo uses the numeric approximation methods to estimate the derivatives. These features of iteminfo unnecessarily lengthen computational times. Therefore, we modified the iteminfo function to use the symbolic derivatives for Triplet-2PLM and calculate only the first derivatives to speed up computations. The saving of running times was significant in our case as we calculated item information for many items and many trait score vectors. As demonstrated below, we also wrote various R functions to calculate expected item/test information at different levels and create information plots. All the R functions related to item/test information are available in the supplemental online materials.

Expected Item/Test Information Indices and Plots

We now describe and demonstrate various expected item/test information indices and plots we propose. The following description focuses on triplets; however, it applies to pairs too. For a triplet item j, we use a grid of latent trait score points of three traits, each with 15 equally spaced points from −3 to 3 to represent the population latent trait score distribution; θjdkkhk refers to a latent score point hk (= 1,…,15) of trait dk measured by the kth statement in a triplet item j. The weight πjh1h2h3 of a latent trait score vector ωjh1h2h3=θjd1h1,θjd2h2,θjd3h3 is the prior probability of ωjh1h2h3 (usually the normalized density from a standard multivariate normal distribution with the intertrait correlation matrix, as used here). Similarly, the weight πjhk of θjdkhk is the prior probability of θjdkhk(usually the normalized density of a standard normal distribution, as used here). At each latent trait score vector, ωjh1h2h3, we calculate the item information matrix for the three traits’ scores based on Equation 15. The item information for each trait score is on the trace of the information matrix, βjθjd1h1,θjd2h2,θjd3h3. Observing item information on multiple dimensions is difficult, especially in graphs. Thus, we work on each trait’s expected item/test information at different levels to facilitate comparisons among items and tests.

Expected Item Information of a Trait Score.

We first calculate the expected item information on a trait conditional on each latent score point of the trait, for example, for θjd1h1,

βjθjdd1h1=h2=115h3=115diagd1βjθjd1h1,θjd2h2,θjd3h3πjh1h2h3/πjh1 (25)

where diagd1βjθjd1h1,θjd2h2,θjd3h3 is the first diagonal element in the item information matrix representing the item information of θjd1h1 in the latent trait score vector ωjh1h2h3.

Figure 1 includes the expected item information curve (IIC) for Item 1 in Form A3 as an example. This figure contains three expected individual trait IICs, which show the average amount of information on a trait at each score point of the trait that an item contributes to the test. This plot is useful to show the distribution of the expected item information on each trait that an item measures over the range of the trait’s scores. Trait 9, measured by the third statement, appears to have the largest item information across the latent trait score range, and the item information peaks in the trait’s score range between −1 and 0. This pattern is consistent with the discrimination parameter estimates associated with the three statements in the item: −0.55, −0.61, and −0.91 for the three statements, respectively. (The discrimination and intercept parameter estimates for each statement are shown at the top of the plot.) It is well known that a higher (absolute) discrimination parameter leads to more item information (Joo et al., 2018).

Figure 1.

Figure 1

Expected Item Information by Traits’ Scores for Item 1 in Form A3

Note. Trait 2-Statement A = The first statement in the item measures Trait 2; Trait 4-Statement B = The second statement in the item measures Trait 4; Trait 9-Statement C = The third statement in the item measures Trait 9; a = Discrimination estimate; b = Intercept estimate.

Expected Item Information on a Trait.

We next calculate the expected item information of each trait with respect to the prior distribution of the trait. For example, for d1 trait in item j,

βjd1=h1=115βjθjd1h1πjh1. (26)

This index shows expected overall information (a scalar) on a trait that an item contains. It is very convenient to compare overall item information on a trait across items. For a trait, we can order and plot the expected item information of the trait across items measuring the trait to compare the trait’s expected item information across items. Figure 2 is an example of Trait 1 in Form A3. Item 36 stands out as the most informative item on Trait 1. In contrast, the first three items in the plot have expected item information on Trait 1 close to 0.

Figure 2.

Figure 2

Expected Item Information on Trait 1 by Items in Form A3

Expected Overall Item Information.

We can also calculate the expected overall item information of an item by averaging the expected overall item information on the three traits measured by the item,

Oj=h1=115h2=115h3=115mean_trβjθjd1h1,θjd2h2,θjd3h3πjh1h2h3, (27)

where mean_tr refers to the average of a matrix’s trace (diagonal elements). This scalar index provides overall item information to facilitate comparison across items. This index is the same as the overall item information proposed by Joo et al. (2018). We can also order and plot this index across items in an FCQ form. Figure 3 is an example of Form A3 where a few items back to the x-axis (e.g., Items 36, 53, and 42) stand out as the most informative items. In contrast, the first two items on the x-axis (i.e., Items 52 and 9) have little information.

Figure 3.

Figure 3

Expected Overall Item Information by Items in Form A3

Expected Test Information of a Trait Score.

Having the expected item information of each latent score point of a trait (Equation 25), we can calculate the expected test information of each latent score point of the trait; for latent score point h of trait d, θdh, its expected test information is for ML θ

Tdh=jJdβjθdh, (28)

and for MAP θ

Tdh=jJdβjθdh+diagdΣθ1, (29)

where Jd is the set of items measuring trait. For EAP θ, this index should be calculated by Equation 20 through all response patterns. Then, find the response patterns with EAP score estimates at or close to θdh, and Tdh is the weighted average of the test information of EAP θdh from these response patterns with normalized weights Pxrη,Σθ. Because of the vast number of response patterns for a typical real test, it is generally impractical to calculate this index for EAP scores. We can draw the expected test information curves of all traits in a test form against their latent trait scores to compare the expected test information of individual traits. Figure 4 is an example of Form A3. Due to the 14 traits measured by the form, Figure 4 looks cumbersome; however, the distributional patterns of test information on each trait within and across traits are still salient. One can easily see the shape of the distribution for each trait (e.g., the peak and low points) and compare the distributions across traits; for example, some traits (e.g., Traits 17 and 13) had much more test information than other traits (e.g., Traits 4 and 11). We may also place the expected test information curves of a trait in different test forms and compare them. Figure 5 is an example of Trait 1: B2 and A3 forms appear to have more test information on Trait 1 than A2 and B3 forms across most of the score range.

Figure 4.

Figure 4

Expected Test Information on Individual Traits by Trait Scores in Form A3

Note. “T” is the abbreviation of Trait so that, for example, T1 refers to Trait 1, and so on.

Figure 5.

Figure 5

Comparison of Expected Test Information on Trait 1 by Trait Scores Across Four Test Forms (A2, B2, A3, and B3)

Expected Test Information on a Trait.

We can calculate the expected trait test information for each trait

Td=h=115Tdhπjh. (30)

This scalar index represents overall test information on a trait. It can be used to compare trait test information across traits in a test form or across test forms. During assembling the four test forms, this index based on MUPP/Triplet-GGUM was used to control the variance of trait test information within a form and across forms.

Expected Overall Test Information.

Within a test form, the expected test information on each trait can be averaged to obtain the expected overall test information of a form

T=d=1DTd/D. (31)

For EAP scores, the expected overall test information can also be obtained by

T=mean_trr=1RβθrEAPPxrη,Σθ, (32)

where βθrEAP is the test information matrix of the EAP score for response pattern r (Equation 20). This scalar index on the overall test information of a test form was also proposed by Joo et al. (2018). It can be used to compare overall test information across test forms. This index was the statistic that the test assembly program tried to maximize in assembling the four forms.

We can order and plot the expected test information of all the traits measured in a form. Figure 6 shows the 14 traits in Form A3 as an example: Trait 11 had the lowest, and Trait 7 had the highest expected test information. The expected overall test information (5.61) appears as a reference line across the plot. We can also draw the expected test information of all the traits across test forms in one plot to facilitate across-form comparisons (see Figure 7). Figure 7 shows that although there are some variants in the expected test information on some traits within each type of form, the expected overall test information is basically the same between the two forms in each type. Note that the two forms in each type were designed to be parallel. The two triplets forms have slightly more test information than the two pairs forms (5.6 vs. 5.1).

Figure 6.

Figure 6

Comparison of Expected Test Information on Individual Traits in Form A3

Note. Overall TIF = Expected overall Test Information.

Figure 7.

Figure 7

Comparison of Expected Test Information on Individual Traits Across Four Test Forms (A2, B2, A3, and B3)

Note. Overall TIF = Expected overall Test Information.

Practical Recommendations

The item/test information indices and plots presented in the paper could be very useful in practice. We provide suggestions below for reviewing the plots from the top down to identify/remedy information issues in a test form for practitioners.

First, users should check Figure 7 if comparing multiple forms or Figure 6 if checking only one form. Identify the traits with inadequate expected test information that need remedy (trait level). For one form, users may start with expected overall item information by item in Figure 3 to identify the items with very low overall item information (item level) and then jump to the third step (statement level).

Second, users should check the plots of expected item information of individual traits by items like Figure 2. Identify the items with low information on a trait (item level).

Third, users should look at the plots of these items’ expected item information on individual traits by latent trait scores (Figure 1) to identify the problematic statements (statement level). Usually, the low item information issue is caused by statements’ low absolute discrimination parameters. Replacing the statements with ones having larger absolute discrimination parameters will resolve the issue.

There are no one-fit-all criteria regarding the threshold values on the various indices to judge whether an item/test contains satisfactory overall information or information on a particular trait. Rather, they depend on individual test programs’ objectives, for example, the ideal precision distribution of score estimates. Each test program will develop its standards on item and test information, among other requirements on classic item statistics, IRT item parameters, and test statistics (e.g., reliability). These criteria may be determined based on empirical data, simulation studies, or both. Then, these criteria can be implemented as constraints in an automatic test assembly program to meet these criteria. As demonstrated here, they can also be used in the post-administration review of items and test forms to adjust items accordingly.

Standard Error of Trait Score Estimates, Reliability, and Item Discrimination

As discussed previously, test information is directly related to the standard errors (SEs) of trait score estimates, and comparing expected SEs of a trait’s score estimates and reliabilities based on expected test information to those based on sample estimates of SEs can check the representativeness of a test sample. Figure 8 compares the expected SE estimates of MAP trait score estimates from the two approaches in Form A3, and Figure 9 compares the reliabilities. The gaps in the two statistics from the two approaches are consistent with small variations across traits: the (more theoretical) TIF approach is consistently better than the sample approach with smaller SEs and higher reliabilities. Figures 8 and 9 also compare the two statistics for ML trait score estimates, and the same pattern appears2. Therefore, the test sample for this triplets form does not seem to be adequate. Note that MAP test information is ML test information plus the prior information (Equation 19). Thus, the SE estimates of ML score estimates based on TIF and the test sample are much higher than those of MAP score estimates. In fact, in the expected MAP test information on individual traits in this triplets form, the prior information contributes 29% to 60%, varied by traits. At the same time, based on Equation 6, it is well known that MAP score estimates shrink to their population mean compared to ML estimates. Thus, the variance of MAP estimates is smaller than that of ML estimates (average 0.55 vs. 1.90 for the test sample on this triplets form). Then, based on Equations 21 and 23, the reliability of MAP estimates is not necessarily higher than that of ML estimates, which is shown in Figure 9 — the reliabilities for both types of trait score estimates are quite similar, and for some traits, the reliabilities for ML estimates are higher than those for MAP estimates. However, as it is well known, MLE produces much more inestimable scores on extreme/aberrant responses than MAP and EAP estimations. This issue of MLE seems more severe on high dimensional FCQs: for example, in this triplets form, 52 test takers’ ML score estimations are not convergent versus none for MAP estimations. Thus, MAP scores are used for reporting in these FCQ forms.

Figure 8.

Figure 8

Comparison of Expected Standard Errors (SEs) of Trait Score Estimates in Form A3

Note. MAP_TIF = Square root of the reciprocal of the expected test information on a trait for MAP scores; MAP_Sample = Mean of the SE estimates of the sample MAP trait score estimates; ML_TIF = Squared root of the reciprocal of the expected test information on a trait for ML scores; ML_Sample = Mean of the SE estimates of the sample ML trait score estimates.

Figure 9.

Figure 9

Comparison of Reliabilities of Trait Score Estimates in Form A3

Note. MAP_TIF = Reliability based on TIF of MAP trait scores; MAP_Sample = Reliability based on the sample mean of the standard error estimates of the MAP trait score estimates; ML_TIF = Reliability based on TIFs of ML trait scores; ML_Sample = Reliability based on the sample mean of the standard error estimates of the ML trait score estimates.

A trait’s item information is closely related to the absolute discrimination parameter of the statement measuring the trait. Table 1 shows the correlations between absolute discrimination parameters and expected item information on trait for the four forms, respectively. All correlations are 0.97 or 0.98, indicating larger absolute discrimination will almost certainly lead to high item information. This finding is consistent with the simulation result in Joo et al. (2018). Table 1 also lists the mean and SD of absolute discrimination parameters and expected overall ML test information for each form (as ML TIF directly relates to discrimination parameters). One can see that within pairs/triplets form, the two forms are almost identical in these three statistics, which supports the parallelism of the two forms. The triplets forms have slightly higher means on absolute discrimination parameters and expected overall ML test information than the pairs forms. Note that a pairs form includes 80 more statements than a triplets form. Thus, a triplets form can measure traits at a level of precision with fewer statements than a pairs form. This finding is also consistent with the simulation result in Joo et al. (2018).

Table 1.

Discrimination Parameters and Test Information

Form Correlation Absolute discrimination parameter
ML test information
M SD

A2 0.97 0.78 0.49 2.77
B2 0.97 0.78 0.49 2.78
A3 0.97 0.85 0.39 3.29
B3 0.98 0.83 0.46 3.29

Note. Correlation = Correlation between absolute discrimination parameters and expected item information on a trait; ML test information = Expected overall test information for ML trait scores.

Discussion

In this paper, we provided deep and comprehensive discussions on the information functions of Rank-2PLMs in particular and IRT models in general from theoretical and practical perspectives. This paper has valuable contributions to the IRT literature, especially the IRT models for FCQs, as listed in the Introduction section. Below, we discussed the current study’s limitations and suggested future studies.

Limitations and Future Research

This study did not look deeply into how information functions behave with respect to item parameters in MUPP-2PLM and Triplet-2PLM. A future study may investigate it analytically or use simulations/examples, as done in Muraki (1993) and Joo et al. (2018) for their models.

The TIRT model is closely related to Rank-2PLMs. For pairs, it is easy to show that TIRT is equivalent to MUPP-2PLM (as the equivalence between the normal ogive and logistic forms of an IRT model)3. For a block with three or more statements, the two models model the ranking patterns of a block differently. The TIRT model recodes ranking patterns into binary response patterns of pairs. For example, a triplet with statements A, B, and C can form into three pairs: AB, AC, and BC. Currently, in the item estimation, the dependence among pairs is considered; however, in the score estimation, the local independence among pairs is assumed to simplify the score estimation (to avoid multidimensional integrals), which may have a negative impact on the score estimation. Triplet-2PLM avoids this problem by modeling a triplet directly. Thus, it is an interesting future study to compare Triplet-2PLM and TIRT estimations, including their information functions.

Supplementary Material

Data
Code

Footnotes

1

The Likert forms had four response levels (strongly disagree, disagree, agree, and strongly agree). The responses of the Likert statements were recoded to binary values (i.e., strongly disagree, disagree = 0; agree, strongly agree = 1) and fitted by the 2PLM.

2

A TIF reliability for ML score estimates is calculated by the first formula in Equation 21 with the population variance of true scores set to 1, while a TIF reliability for MAP score estimates is estimated by Equation 23 with the population variance of MAP estimates being set to the sample variance.

3

Assume TIRT is estimated by a full information method as for MUPP-2PLM, rather than a limited information method under structural equation modeling.

References

  1. Andersson B, & Xin T. (2018). Large sample confidence intervals for item response theory reliability coefficients. Educational and Psychological Measurement, 78(1), 32–45. 10.1177/0013164417713570 [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Atkinson KE (1989). An introduction to numerical analysis (2nd ed.). John Wiley. [Google Scholar]
  3. Baker FB, & Kim SH (2004). Item response theory: Parameter estimation techniques (2nd ed.). CRC Press. 10.1201/9781482276725 [DOI] [Google Scholar]
  4. Birnbaum A (1968). Some latent trait models and their use in inferring an examinee’s ability. In Lord FM & Novick MR (Eds.), Statistical theories of mental test scores (pp. 397–479). Addison-Wesley. [Google Scholar]
  5. Bock RD, & Aitkin M (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46(4), 443–459. 10.1007/BF02293801 [DOI] [Google Scholar]
  6. Brown A (2016). Item response models for forced-choice questionnaires: A common framework. Psychometrika, 81(4), 135–160. 10.1007/s11336-014-9434-9 [DOI] [PubMed] [Google Scholar]
  7. Brown A, & Maydeu-Olivares A (2011). Item response modeling of forced choice questionnaires. Educational and Psychological Measurement, 71(3), 460–502. 10.1177/0013164410375112 [DOI] [Google Scholar]
  8. Broyden CG (1970). The convergence of a class of double-rank minimization algorithms 1. general considerations. IMA Journal of Applied Mathematics, 6, 76–90, 10.1093/imamat/6.1.76 [DOI] [Google Scholar]
  9. Chalmers R, P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29. 10.18637/jss.v048.i06 [DOI] [Google Scholar]
  10. Christiansen ND, Burns GN, & Montgomery GE (2005). Reconsidering forced-choice item formats for applicant personality assessment. Human Performance, 18(3), 267–307. 10.1207/s15327043hup1803_4 [DOI] [Google Scholar]
  11. Clausen A, & Sokol S (2020). Deriv: R-based symbolic differentiation. R package version 4.1. https://CRAN.R-project.org/package=Deriv [Google Scholar]
  12. de la Torre J, Ponsoda V, Leenen I, & Hontangas P (2012, April). Examining the viability of recent models for forced-choice data [Paper presentation]. American Educational Research Association, Vancouver, BC, Canada. [Google Scholar]
  13. Du Toit M (Ed.). (2003). IRT from SSI. SSI Scientific Software International. [Google Scholar]
  14. Fu J, Tan X, & Kyllonen PC (2023a). The Rank-2PL IRT models for forced-choice questionnaires: Maximum marginal likelihood estimation with an EM algorithm [Manuscript submitted for publication]. Educational Testing Service. [Google Scholar]
  15. Fu J, Tan X, & Kyllonen PC (2023b). A comparison of item response theory models for multidimensional forced-choice questionnaires: Real data examples [Unpublished Manuscript]. Educational Testing Service. [Google Scholar]
  16. Haberman SJ, & Sinharay S (2010). Reporting of subscores using multidimensional item response theory. Psychometrika, 75(2), 209–227. 10.1007/s11336-010-9158-4 [DOI] [Google Scholar]
  17. Joo S-H, Lee P and Stark S (2018). Development of information functions and indices for the GGUM-RANK multidimensional forced choice IRT model. Journal of Educational Measurement, 55(3), 357–372. 10.1111/jedm.12183 [DOI] [Google Scholar]
  18. Kim S (2012). A note on the reliability coefficients for item response model-based ability estimates. Psychometrika, 77(1), 153–162. 10.1007/s11336-011-9238-0 [DOI] [Google Scholar]
  19. Kim S (2013). A diagnosis on the performance of BILOG-MG’s empirical reliability estimators for IRT ability scores. Journal of Educational Evaluation, 26(2), 507–531. [Google Scholar]
  20. Lehmann EL, & Casella G (1998). Theory of point estimation (2nd ed.). Springer. 10.1007/b98854. [DOI] [Google Scholar]
  21. Luce RD (2005). Individual choice behavior. Dover. (Original work published in 1959.) [Google Scholar]
  22. Morillo D, Abad FJ, Kreitchmann RS, Leenen I, Hontangas P, and Ponsoda V (2019). The journey from Likert to forced-choice questionnaires: Evidence of the invariance of item parameters. Journal of Work and Organizational Psychology, 35(2), 75–83. 10.5093/jwop2019a11 [DOI] [Google Scholar]
  23. Morillo D, Leenen I, Abad FJ, Hontangas P, De la Torre J, & Ponsoda V (2016). A dominance variant under the multi-unidimensional pairwise-preference framework: Model formulation and Markov chain Monte Carlo estimation. Applied Psychological Measurement, 40(7), 500–516. 10.1177/0146621616662226 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Muraki E (1993). Information functions of the generalized partial credit model. Applied Psychological Measurement, 17(4), 351–363. 10.1177/014662169301700403 [DOI] [Google Scholar]
  25. R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/. [Google Scholar]
  26. Stark S, Chernyshenko OS, & Drasgow F (2005). An IRT approach to constructing and scoring pairwise preference items involving stimuli on different dimensions: The multi-unidimensional pairwise preference model. Applied Psychological Measurement, 29(3), 184–203. 10.1177/0146621604273988 [DOI] [Google Scholar]
  27. von Davier M (2008). A general diagnostic model applied to language testing data. British Journal of Mathematical and Statistical Psychology, 61(2), 287–307. 10.1348/000711007X193957 [DOI] [PubMed] [Google Scholar]
  28. Yao L, & Schwarz RD (2006). A multidimensional partial credit model with associated item and test statistics: an application to mixed-format tests. Applied Psychological Measurement, 30(6), 469–492. 10.1177/0146621605284537 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data
Code

RESOURCES