Abstract
Item response theory (IRT) true-score equating for the bifactor model is often conducted by first numerically integrating out specific factors from the item response function and then applying the unidimensional IRT true-score equating method to the marginalized bifactor model. However, an alternative procedure for obtaining the marginalized bifactor model is through projecting the nuisance dimensions of the bifactor model onto the dominant dimension. Projection, which can be viewed as an approximation to numerical integration, has an advantage over numerical integration in providing item parameters for the marginalized bifactor model; therefore, projection could be used with existing equating software packages that require item parameters. In this paper, IRT true-score equating results obtained with projection are compared to those obtained with numerical integration. Simulation results show that the two procedures provide very similar equating results.
Keywords: equating, item response theory, MIRT, projective item response theory, bifactor model
In item response theory (IRT) true-score equating, the true score (i.e., expected score) associated with a given on the new form is equated to the true score associated with the same on the old form. For unidimensional IRT (UIRT) models, the true-score equating procedure is quite straightforward to implement because there is a one-to-one correspondence between the values of and true scores. By contrast, multiple are related to the same true score for multidimensional IRT (MIRT) models, which makes it difficult to apply the true-score equating procedure to MIRT models. For the bifactor model, however, assuming that specific factors are nuisance variables, one common approach that has been used to overcome the aforementioned difficulty is to integrate out the specific factors from the item response function and equate two test forms based solely on the general factor (e.g., Lee et al., 2015). For each item, the specific factor is integrated out from the item response function as
| (1) |
where is the general factor, is the item response function for the bifactor model, and is the probability density function of {{\theta }_s}}. This procedure for obtaining will hereinafter be referred to as the “exact” method. A numerical method can be used to solve Equation 1. Although a unidimensional test characteristic curve (TCC) can be obtained by adding across all items, it is difficult to use the resulting TCC for equating with existing IRT equating software packages because most packages require item parameters, not the TCC itself.
An alternative approach to obtain is through projecting the nuisance dimensions of the bifactor model onto the dominant dimension (see Ip & Chen, 2012, for more details about projection). Such a projection results in a locally dependent UIRT model that is empirically indistinguishable from the original MIRT model (Ip, 2010). Projecting the specific factors onto the dimension of the general factor yields
| (2) |
where
| (3) |
with and . In Equation 3, and are the general factor and specific factor slopes for item , respectively; is the pseudo-guessing parameter for item ; is the intercept parameter for item ; and are the standard deviations of the general and specific factors, respectively; and is the correlation between the general and specific factors. More details of the aforementioned formulae are available in Ip and Chen (2012). Under the standard bifactor assumption of complete independence, the correlation between the general and specific factors is zero (i.e., ), and the standard deviations are set equal to 1 for the purpose of fixing the unit of measurement along each dimension (i.e., ). As this procedure can be viewed as an approximation to Equation 1, it will hereinafter be referred to as the “approximation” method. Although the local independence assumption does not hold for the projected model, true-score equating could still be conducted due to the fact that the expected value of the total score —in other words, the TCC—is equal to regardless of whether the item responses are independent or not. The approximation method provides item parameters for a marginalized bifactor item response function and therefore enables the use of existing standalone equating software packages, such as PIE (Hanson & Zeng, 2004), POLYEQUATE (Kolen, 2004), and IRTEQ (Han, 2009), and an R package equateIRT (Battauz, 2015).
Simulation Example
Under the random groups design, equating results obtained with the exact and approximation methods were compared through simulation. Both methods were implemented with a program written in R (R Core Team, 2019) by the authors of this study. Readers interested in obtaining a copy of the IRT equating R package may do so by contacting the authors by e-mail at k_kim9@uncg.edu. As the bifactor model seems appropriate for fitting testlet-based items, two 40-item test forms with five testlets of eight items were generated. The group of eight items that belong to a common testlet was set to load on the same specific factor to take account of the residual dependence that exists above and beyond the general dimension. Each of the five testlets was crossed with two levels of local item dependence (LID): low and moderate. The general factor slopes were sampled from a N(1, 0.3) distribution; specific factor slopes were sampled from a U(0.3, 0.6) distribution for the low-LID condition and a U(0.7, 1.0) distribution for the moderate-LID condition; multidimensional item difficulty parameters (MDIFF) for the old and new forms were sampled from N(0, 1) and N(0.1, 1) distributions, respectively; and guessing parameters were sampled from a U(0.05, 0.35) distribution. The MDIFF parameters were converted to intercept parameters adopting Reckase’s (2009, p. 90) formula:
| (4) |
where is the general factor slope, is the specific factor slope, and is the intercept parameter.
For both groups, for 3,000 simulees were sampled from a standard normal distribution. One hundred data sets were generated for the two LID conditions by fixing the item parameters but sampling different sets of . For the two test forms, average Cronbach’s alpha across the 100 data sets was approximately .89 for the low-LID condition and approximately .87 for the moderate-LID condition. The mean absolute difference (MAD) between the exact and approximation methods was calculated for each number-correct score over the 100 replications (hereinafter referred to as conditional MAD), which was defined by
| (5) |
where and denote the exact and approximation methods, respectively, and is the old form equivalent of number-correct score on the new form.
The conditional MAD plots for the two LID conditions are depicted in Figure 1. The reason that the conditional MADs were zero for scores outside the range of true scores (scores between 0 and 7 and score 40) was because the ad hoc procedure used to equate these scores produced the same equating results for the exact and approximation methods (see, for example, Kolen & Brennan, 2014, for more details about the ad hoc procedure). In general, although the average and maximum MAD increased with increasing LID, the conditional MADs were smaller than .1 across the entire score scale for both LID conditions; specifically, the maximum MAD for the low- and moderate-LID conditions was .0270 and .0856, respectively (see Table 1). Note that number-correct score differences smaller than .5 are typically regarded as practically nonsignificant according to the difference that matters criterion (Dorans & Feigenbaum, 1994).
Figure 1.

Conditional MAD between the exact and approximation methods.
Note. MAD = mean absolute difference.
Table 1.
Average and Maximum MAD Between the Exact and Approximation Methods.
| Average | Maximum | |||
|---|---|---|---|---|
| Evaluation measure | Low LID | Moderate LID | Low LID | Moderate LID |
| MAD | .0135 | .0349 | .0270 | .0856 |
Note. Average MAD was calculated over scores within the range of true scores. MAD = mean absolute difference; LID = local item dependence.
Footnotes
Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.
ORCID iD: Kyung Yong Kim
https://orcid.org/0000-0001-7549-5800
References
- Battauz M. (2015). equateIRT: An R package for IRT test equating. Journal of Statistical Software, 68, 1-22. [Google Scholar]
- Dorans N. J., Feigenbaum M. D. (1994). Equating issues engendered by changes to the SAT and PSAT/NMSQT. In Lawrence I. M., Dorans N. J., Feigenbaum M. D., Feryok N. J., Schmitt A. P., Wright N. K. (Eds.), Technical issues related to the introduction of the new SAT and PSAT/NMSQT (Research Memorandum No. RM-94-10). Princeton, NJ: Educational Testing Service. [Google Scholar]
- Han K. T. (2009). IRTEQ: Windows application that implements IRT scaling and equating [Computer program]. Amherst: The University of Massachusetts Amherst; Retrieved from https://www.umass.edu/remp/software/simcata/irteq/ [Google Scholar]
- Hanson B., Zeng L. (2004). PIE: A computer program for IRT equating [Computer program]. Iowa City: The University of Iowa; Retrieved from https://education.uiowa.edu/centers/casma [Google Scholar]
- Ip E. H. (2010). Empirically indistinguishable IRT and locally dependent unidimensional item response models. British Journal of Mathematical and Statistical Psychology, 63, 395-415. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ip E. H., Chen S. (2012). Projective item response model for test-independent measurement. Applied Psychological Measurement, 36, 581-601. [Google Scholar]
- Kolen M. J. (2004). POLYEQUATE [Computer program]. Iowa City: The University of Iowa; Retrieved from https://education.uiowa.edu/centers/casma [Google Scholar]
- Kolen M. J., Brennan R. L. (2014). Test equating, scaling, and linking. New York, NY: Springer. [Google Scholar]
- Lee G., Lee W., Kolen M. J., Park I., Kim D., Yang J. S. (2015). Bi-factor MIRT true-score equating for testlet-based tests. Paper presented at the annual meeting of National Council on Measurement in Education Philadelphia, PA, April 2015. [Google Scholar]
- R Core Team. (2019). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; Available from http://www.R-project.org/ [Google Scholar]
- Reckase M. D. (2009). Multidimensional item response theory. New York, NY: Springer. [Google Scholar]
