Approximating Bifactor IRT True-Score Equating With a Projective Item Response Model

Kyung Yong Kim; Uk Hyun Cho

doi:10.1177/0146621619885903

. 2019 Nov 13;44(3):215–218. doi: 10.1177/0146621619885903

Approximating Bifactor IRT True-Score Equating With a Projective Item Response Model

Kyung Yong Kim ^1,^✉, Uk Hyun Cho ¹

PMCID: PMC7174803 PMID: 32341608

Abstract

Item response theory (IRT) true-score equating for the bifactor model is often conducted by first numerically integrating out specific factors from the item response function and then applying the unidimensional IRT true-score equating method to the marginalized bifactor model. However, an alternative procedure for obtaining the marginalized bifactor model is through projecting the nuisance dimensions of the bifactor model onto the dominant dimension. Projection, which can be viewed as an approximation to numerical integration, has an advantage over numerical integration in providing item parameters for the marginalized bifactor model; therefore, projection could be used with existing equating software packages that require item parameters. In this paper, IRT true-score equating results obtained with projection are compared to those obtained with numerical integration. Simulation results show that the two procedures provide very similar equating results.

Keywords: equating, item response theory, MIRT, projective item response theory, bifactor model

In item response theory (IRT) true-score equating, the true score (i.e., expected score) associated with a given $θ - value$ on the new form is equated to the true score associated with the same $θ - value$ on the old form. For unidimensional IRT (UIRT) models, the true-score equating procedure is quite straightforward to implement because there is a one-to-one correspondence between the values of $θ$ and true scores. By contrast, multiple $θ - vectors$ are related to the same true score for multidimensional IRT (MIRT) models, which makes it difficult to apply the true-score equating procedure to MIRT models. For the bifactor model, however, assuming that specific factors are nuisance variables, one common approach that has been used to overcome the aforementioned difficulty is to integrate out the specific factors from the item response function and equate two test forms based solely on the general factor (e.g., Lee et al., 2015). For each item, the specific factor $θ_{s}$ is integrated out from the item response function as

\begin{matrix} P (U_{j} = 1 | θ_{0}) = \int_{- \infty}^{\infty} P (U_{j} = 1 | θ_{0}, θ_{s}) h (θ_{s}) d θ_{s}, \end{matrix}

(1)

where $θ_{0}$ is the general factor, $P (U_{j} = 1 | θ_{0}, θ_{s})$ is the item response function for the bifactor model, and $h (θ_{s})$ is the probability density function of {{\theta }_s}}. This procedure for obtaining $P (U_{j} = 1 | θ_{0})$ will hereinafter be referred to as the “exact” method. A numerical method can be used to solve Equation 1. Although a unidimensional test characteristic curve (TCC) can be obtained by adding $P (U_{j} = 1 | θ_{0})$ across all items, it is difficult to use the resulting TCC for equating with existing IRT equating software packages because most packages require item parameters, not the TCC itself.

An alternative approach to obtain $P (U_{j} = 1 | θ_{0})$ is through projecting the nuisance dimensions of the bifactor model onto the dominant dimension (see Ip & Chen, 2012, for more details about projection). Such a projection results in a locally dependent UIRT model that is empirically indistinguishable from the original MIRT model (Ip, 2010). Projecting the specific factors onto the dimension of the general factor yields

\begin{matrix} P (U_{j} = 1 | θ_{0}) = c_{j}^{*} + \frac{1 - c_{j}^{*}}{1 + \exp (a_{j}^{*} θ_{0} + d_{j}^{*})}, \end{matrix}

(2)

where

\begin{matrix} a_{j}^{*} = λ (a_{j 0} + \frac{a_{js} ρ σ_{s}}{σ_{0}}), d_{j}^{*} = λ d_{j}, c_{j}^{*} = c_{j}, \end{matrix}

(3)

with $λ = [k^{2} a_{js}^{2} (1 - ρ^{2}) σ_{s}^{2} + 1]$ and $k = 16 \sqrt{3} / (15 π)$ . In Equation 3, $a_{j 0}$ and $a_{js}$ are the general factor and specific factor slopes for item $j$ , respectively; $c_{j}$ is the pseudo-guessing parameter for item $j$ ; $d_{j}$ is the intercept parameter for item $j$ ; $σ_{0}$ and $σ_{s}$ are the standard deviations of the general and specific factors, respectively; and $ρ$ is the correlation between the general and specific factors. More details of the aforementioned formulae are available in Ip and Chen (2012). Under the standard bifactor assumption of complete independence, the correlation between the general and specific factors is zero (i.e., $ρ = 0$ ), and the standard deviations are set equal to 1 for the purpose of fixing the unit of measurement along each dimension (i.e., $σ_{0} = σ_{s} = 1$ ). As this procedure can be viewed as an approximation to Equation 1, it will hereinafter be referred to as the “approximation” method. Although the local independence assumption does not hold for the projected model, true-score equating could still be conducted due to the fact that the expected value of the total score $\sum U_{j}$ —in other words, the TCC—is equal to $\sum P (U_{j} = 1 | θ_{0})$ regardless of whether the item responses are independent or not. The approximation method provides item parameters for a marginalized bifactor item response function and therefore enables the use of existing standalone equating software packages, such as PIE (Hanson & Zeng, 2004), POLYEQUATE (Kolen, 2004), and IRTEQ (Han, 2009), and an R package equateIRT (Battauz, 2015).

Simulation Example

Under the random groups design, equating results obtained with the exact and approximation methods were compared through simulation. Both methods were implemented with a program written in R (R Core Team, 2019) by the authors of this study. Readers interested in obtaining a copy of the IRT equating R package may do so by contacting the authors by e-mail at k_kim9@uncg.edu. As the bifactor model seems appropriate for fitting testlet-based items, two 40-item test forms with five testlets $(S = 5)$ of eight items were generated. The group of eight items that belong to a common testlet was set to load on the same specific factor to take account of the residual dependence that exists above and beyond the general dimension. Each of the five testlets was crossed with two levels of local item dependence (LID): low and moderate. The general factor slopes were sampled from a N(1, 0.3) distribution; specific factor slopes were sampled from a U(0.3, 0.6) distribution for the low-LID condition and a U(0.7, 1.0) distribution for the moderate-LID condition; multidimensional item difficulty parameters (MDIFF) for the old and new forms were sampled from N(0, 1) and N(0.1, 1) distributions, respectively; and guessing parameters were sampled from a U(0.05, 0.35) distribution. The MDIFF parameters were converted to intercept parameters adopting Reckase’s (2009, p. 90) formula:

\begin{matrix} MDIFF = - \frac{d}{\sqrt{a_{0}^{2} + a_{s}^{2}}}, \end{matrix}

(4)

where $a_{0}$ is the general factor slope, $a_{s}$ is the specific factor slope, and $d$ is the intercept parameter.

For both groups, $θ - vectors$ for 3,000 simulees were sampled from a $(S + 1) - variate$ standard normal distribution. One hundred data sets were generated for the two LID conditions by fixing the item parameters but sampling different sets of $θ - vectors$ . For the two test forms, average Cronbach’s alpha across the 100 data sets was approximately .89 for the low-LID condition and approximately .87 for the moderate-LID condition. The mean absolute difference (MAD) between the exact and approximation methods was calculated for each number-correct score over the 100 replications (hereinafter referred to as conditional MAD), which was defined by

\begin{matrix} MAD (x) = \frac{1}{100} \sum_{r = 1}^{100} | {\hat{e}}_{Y_{E}}^{(r)} (x) - {\hat{e}}_{Y_{A}}^{(r)} (x) |, \end{matrix}

(5)

where $E$ and $A$ denote the exact and approximation methods, respectively, and ${\hat{e}}_{Y} (x)$ is the old form equivalent of number-correct score $x$ on the new form.

The conditional MAD plots for the two LID conditions are depicted in Figure 1. The reason that the conditional MADs were zero for scores outside the range of true scores (scores between 0 and 7 and score 40) was because the ad hoc procedure used to equate these scores produced the same equating results for the exact and approximation methods (see, for example, Kolen & Brennan, 2014, for more details about the ad hoc procedure). In general, although the average and maximum MAD increased with increasing LID, the conditional MADs were smaller than .1 across the entire score scale for both LID conditions; specifically, the maximum MAD for the low- and moderate-LID conditions was .0270 and .0856, respectively (see Table 1). Note that number-correct score differences smaller than .5 are typically regarded as practically nonsignificant according to the difference that matters criterion (Dorans & Feigenbaum, 1994).

Table 1.

Average and Maximum MAD Between the Exact and Approximation Methods.

	Average		Maximum
Evaluation measure	Low LID	Moderate LID	Low LID	Moderate LID
MAD	.0135	.0349	.0270	.0856

Open in a new tab

Note. Average MAD was calculated over scores within the range of true scores. MAD = mean absolute difference; LID = local item dependence.

Footnotes

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD: Kyung Yong Kim Inline graphic https://orcid.org/0000-0001-7549-5800

References

Battauz M. (2015). equateIRT: An R package for IRT test equating. Journal of Statistical Software, 68, 1-22. [Google Scholar]
Dorans N. J., Feigenbaum M. D. (1994). Equating issues engendered by changes to the SAT and PSAT/NMSQT. In Lawrence I. M., Dorans N. J., Feigenbaum M. D., Feryok N. J., Schmitt A. P., Wright N. K. (Eds.), Technical issues related to the introduction of the new SAT and PSAT/NMSQT (Research Memorandum No. RM-94-10). Princeton, NJ: Educational Testing Service. [Google Scholar]
Han K. T. (2009). IRTEQ: Windows application that implements IRT scaling and equating [Computer program]. Amherst: The University of Massachusetts Amherst; Retrieved from https://www.umass.edu/remp/software/simcata/irteq/ [Google Scholar]
Hanson B., Zeng L. (2004). PIE: A computer program for IRT equating [Computer program]. Iowa City: The University of Iowa; Retrieved from https://education.uiowa.edu/centers/casma [Google Scholar]
Ip E. H. (2010). Empirically indistinguishable IRT and locally dependent unidimensional item response models. British Journal of Mathematical and Statistical Psychology, 63, 395-415. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ip E. H., Chen S. (2012). Projective item response model for test-independent measurement. Applied Psychological Measurement, 36, 581-601. [Google Scholar]
Kolen M. J. (2004). POLYEQUATE [Computer program]. Iowa City: The University of Iowa; Retrieved from https://education.uiowa.edu/centers/casma [Google Scholar]
Kolen M. J., Brennan R. L. (2014). Test equating, scaling, and linking. New York, NY: Springer. [Google Scholar]
Lee G., Lee W., Kolen M. J., Park I., Kim D., Yang J. S. (2015). Bi-factor MIRT true-score equating for testlet-based tests. Paper presented at the annual meeting of National Council on Measurement in Education Philadelphia, PA, April 2015. [Google Scholar]
R Core Team. (2019). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; Available from http://www.R-project.org/ [Google Scholar]
Reckase M. D. (2009). Multidimensional item response theory. New York, NY: Springer. [Google Scholar]

[bibr1-0146621619885903] Battauz M. (2015). equateIRT: An R package for IRT test equating. Journal of Statistical Software, 68, 1-22. [Google Scholar]

[bibr2-0146621619885903] Dorans N. J., Feigenbaum M. D. (1994). Equating issues engendered by changes to the SAT and PSAT/NMSQT. In Lawrence I. M., Dorans N. J., Feigenbaum M. D., Feryok N. J., Schmitt A. P., Wright N. K. (Eds.), Technical issues related to the introduction of the new SAT and PSAT/NMSQT (Research Memorandum No. RM-94-10). Princeton, NJ: Educational Testing Service. [Google Scholar]

[bibr3-0146621619885903] Han K. T. (2009). IRTEQ: Windows application that implements IRT scaling and equating [Computer program]. Amherst: The University of Massachusetts Amherst; Retrieved from https://www.umass.edu/remp/software/simcata/irteq/ [Google Scholar]

[bibr4-0146621619885903] Hanson B., Zeng L. (2004). PIE: A computer program for IRT equating [Computer program]. Iowa City: The University of Iowa; Retrieved from https://education.uiowa.edu/centers/casma [Google Scholar]

[bibr5-0146621619885903] Ip E. H. (2010). Empirically indistinguishable IRT and locally dependent unidimensional item response models. British Journal of Mathematical and Statistical Psychology, 63, 395-415. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr6-0146621619885903] Ip E. H., Chen S. (2012). Projective item response model for test-independent measurement. Applied Psychological Measurement, 36, 581-601. [Google Scholar]

[bibr7-0146621619885903] Kolen M. J. (2004). POLYEQUATE [Computer program]. Iowa City: The University of Iowa; Retrieved from https://education.uiowa.edu/centers/casma [Google Scholar]

[bibr8-0146621619885903] Kolen M. J., Brennan R. L. (2014). Test equating, scaling, and linking. New York, NY: Springer. [Google Scholar]

[bibr9-0146621619885903] Lee G., Lee W., Kolen M. J., Park I., Kim D., Yang J. S. (2015). Bi-factor MIRT true-score equating for testlet-based tests. Paper presented at the annual meeting of National Council on Measurement in Education Philadelphia, PA, April 2015. [Google Scholar]

[bibr10-0146621619885903] R Core Team. (2019). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; Available from http://www.R-project.org/ [Google Scholar]

[bibr11-0146621619885903] Reckase M. D. (2009). Multidimensional item response theory. New York, NY: Springer. [Google Scholar]

PERMALINK

Approximating Bifactor IRT True-Score Equating With a Projective Item Response Model

Kyung Yong Kim

Uk Hyun Cho

Abstract

Simulation Example

Figure 1.

Table 1.

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Approximating Bifactor IRT True-Score Equating With a Projective Item Response Model

Kyung Yong Kim

Uk Hyun Cho

Abstract

Simulation Example

Figure 1.

Table 1.

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases