A Comparison of the Separate and Concurrent Calibration Methods for the Full-Information Bifactor model

Kyung Yong Kim

doi:10.1177/0146621618813095

. 2018 Nov 30;43(7):512–526. doi: 10.1177/0146621618813095

A Comparison of the Separate and Concurrent Calibration Methods for the Full-Information Bifactor model

Kyung Yong Kim ^1,^✉

PMCID: PMC6739745 PMID: 31534287

Abstract

When calibrating items using multidimensional item response theory (MIRT) models, item response theory (IRT) calibration programs typically set the probability density of latent variables to a multivariate standard normal distribution to handle three types of indeterminacies: (a) the location of the origin, (b) the unit of measurement along each coordinate axis, and (c) the orientation of the coordinate axes. However, by doing so, item parameter estimates obtained from two independent calibration runs on nonequivalent groups are on two different coordinate systems. To handle this issue and place all the item parameter estimates on a common coordinate system, a process called linking is necessary. Although various linking methods have been introduced and studied for the full MIRT model, little research has been conducted on linking methods for the bifactor model. Thus, the purpose of this study was to provide detailed descriptions of two separate calibration methods and the concurrent calibration method for the bifactor model and to compare the three linking methods through simulation. In general, the concurrent calibration method provided more accurate linking results than the two separate calibration methods, demonstrating better recovery of the item parameters, item characteristic surfaces, and expected score distribution.

Keywords: linking, bifactor model, multidimensional item response theory

In item response theory (IRT), two major difficulties arise in item parameter estimation. First, the use of the marginal maximum likelihood estimation procedure implemented via expectation-maximization (MMLE-EM) algorithm (Bock & Aitkin, 1981; Woodruff & Hanson, 1997) or the marginalized Bayesian approach (Mislevy, 1986) entails the estimation of item parameters by marginalizing (integrating) out the person parameters. This type of estimation requires the person parameters to be described by a probability distribution; however, correctly specifying the distribution of person parameters is often difficult, if not impossible, because they are latent variables. The other difficulty related to item parameter estimation in IRT is the lack of a standard coordinate system. More specifically, in both unidimensional IRT (UIRT) and multidimensional IRT (MIRT), the location of the origin and the unit of measurement along each coordinate axis are arbitrary, and the orientation of the coordinate axes is arbitrary as well in MIRT. To resolve these indeterminacies, IRT calibration programs typically set the distribution of latent variables to a univariate or multivariate standard normal distribution, depending on the number of user-specified dimensions in the IRT model.

Under the common-item nonequivalent groups (CINEG) design, two different test forms sharing common items are administered to two groups of examinees from different populations. This data collection design is often used for test equating when testing programs cannot administer more than one test form on a given test date because of test security concerns. The CINEG design is also widely used in vertical scaling to place tests that differ in difficulty but are intended to measure similar constructs (e.g., math tests across different grade levels) on the same scale. In the context of IRT, two populations are different in the sense that their distributions of latent variables are different. Nonetheless, when estimating item parameters for two nonequivalent groups using two independent calibration runs, the distribution of latent variables for each group is commonly set to a univariate or multivariate standard normal distribution. Consequently, the two sets of item parameter estimates obtained from two independent calibration runs on nonequivalent groups are on two different coordinate systems. To handle this issue and place all the item parameter estimates on a common coordinate system, a process called linking is required. Note that the term linking in this article is used to describe explicitly the process of placing IRT parameters on a common coordinate system.

For UIRT models, various linking procedures have been introduced and used in applications such as vertical scaling, differential item functioning (DIF), computerized adaptive testing (CAT), and test equating. Among these linking procedures, the relative performance of the separate and concurrent calibration methods has been compared in numerous studies (e.g., Hanson & Béguin, 2002; Kim & Cohen, 1998; Kim & Kolen, 2006; W. Lee & Ban, 2010). In the context of MIRT, several studies have proposed separate calibration procedures for the full MIRT model and investigated the recovery of item parameters (e.g., Li & Lissitz, 2000; Oshima, Davey, & Lee, 2000; Yao, 2011). Through simulation, studies consistently found that the proposed methods recovered item parameters reasonably well. In addition, Simon (2008) compared some of these separate calibration methods with the concurrent calibration method and found that the concurrent calibration method generally performed better than the separate calibration methods.

The full MIRT model, which was used in all of the aforementioned MIRT linking studies, is a flexible model as it allows each item to freely load on any dimension in the model. Despite its flexibility, the computational burden of the full MIRT model increases exponentially with increasing dimensions, which substantially lowers the applicability of this model. An alternative MIRT model that is not only computationally tractable but also flexible enough to represent structures that are commonly found in educational and psychological measurement is the full-information bifactor model (Cai, Yang, & Hansen, 2011; Gibbons et al., 2007; Gibbons & Hedeker, 1992). In educational measurement, the bifactor model has been applied to a variety of areas, including calibration of testlet-based tests (DeMars, 2006), vertical scaling (Li & Lissitz, 2012), DIF (Jeon, Rijmen, & Rabe-Hesketh, 2013), multiple-group analysis (Cai et al., 2011), and test equating (G. Lee & Lee, 2016; G. Lee et al., 2015). However, although many of these procedures involve linking, little work has been done to compare the separate and concurrent calibration methods for the bifactor model. Thus, building upon earlier work on linking, the objective of this article is to (a) provide detailed descriptions of the separate and concurrent calibration procedures for the bifactor model, and (b) compare the relative performance of the two linking methods using a simulation study.

Bifactor Model

The full-information bifactor model gained its popularity due to Gibbons and Hedeker’s (1992) derivation of a dimension reduction MMLE-EM technique for binary response data in the form of a normal ogive model. Later, Gibbons et al. (2007) derived a dimension reduction estimation procedure for the normal ogive graded response model when the data had a bifactor structure. Recently, Cai et al. (2011) generalized the bifactor model to relate the latent variables and item responses using multidimensional extensions of the unidimensional logistic models. The bifactor model used in this study is the bifactor extension of the unidimensional three-parameter logistic (3PL) model presented in Cai et al. (2011).

In the bifactor model, there are two different sets of latent variables: (a) the general factor and (b) the specific factors. The model requires that all items load on the general factor and on at most one specific factor. Furthermore, it is typically assumed that all factors are independent to achieve dimension reduction during computations of the marginal maximum likelihood. The bifactor extension of the unidimensional 3PL model is defined by

P (θ_{i}, a_{j}, c_{j}, d_{j}) = P_{j} (θ_{i}) = c_{j} + \frac{1 - c_{j}}{1 + \exp [- D (a_{j 0} θ_{i 0} + a_{js} θ_{is} + d_{j})]},

(1)

where the subscripts $i$ and $j$ denote person and item, respectively; $D$ is a scaling constant that is generally either 1.0 (logistic metric) or 1.7 (normal metric); $θ_{i 0}$ is the general factor; $θ_{is}$ is the specific factor on which item $j$ loads; $θ_{i} = (θ_{i 0}, θ_{is})$ ; $a_{j 0}$ and $a_{js}$ are the item discrimination parameters for the general and specific factors, respectively; $a_{j} = (a_{j 0}, a_{js})$ ; $c_{j}$ is the pseudo-guessing parameter; and $d_{j}$ is the intercept parameter that is related to the multidimensional difficulty parameter defined in Reckase (2009, p. 117).

Separate and Concurrent Calibration for the Bifactor Model

This section provides detailed descriptions of two separate calibration procedures and the concurrent calibration procedure for the bifactor model. It is assumed that the purpose of linking is to place the item parameters of the new form on the coordinate system of the base form. Note that “Form X” will be used interchangeably with “new form,” and “Form Y” will be used interchangeably with “base form.” In addition, the groups that take Forms X and Y will be referred to as the new and base groups, respectively.

Under the CINEG design, three decisions must be made for MIRT models to place item parameter estimates obtained from separate calibration runs on the same coordinate system: (a) the location of the origin, (b) the unit of measurement along each coordinate axis, and (c) the orientation of the coordinate axes (i.e., rotational indeterminacy). Note that the standard bifactor analysis assumption of complete independence of all dimensions, which ensures dimension reduction in likelihood computations, solves rotational indeterminacy for the bifactor model. However, as noted by Rijmen (2009), dimension reduction in likelihood computations can also be achieved by relaxing the complete independence assumption to conditional independence of the specific factors given the general factor. In this case, the rotational indeterminacy can be fixed by constraining the correlation between the general factor and each specific factor to zero. The separate and concurrent calibration procedures presented in this article are based on the standard bifactor analysis assumption of complete independence.

Separate Calibration

Extending the transformation equations for the 3PL model (Kolen & Brennan, 2014, p. 178) to the bifactor model, the item response probabilities remain unchanged with the following linear transformations:

θ_{Y_{i}} = U^{- 1} (θ_{X_{i}} - β),

(2)

a_{Y_{j}}^{T} = a_{X_{j}}^{T} U,

(3)

d_{Y_{j}} = d_{X_{j}} + a_{X_{j}}^{T} β,

(4)

and

c_{Y_{j}} = c_{X_{j}},

(5)

where $U$ is a diagonal matrix of size $(S + 1) \times (S + 1)$ with scaling constants at the diagonal ( $S$ denotes the total number of specific factors in the model); $β$ is the translation vector of length $S + 1$ ; $θ_{Y_{i}}$ is the parameter vector for person $i$ that is on the scale of the base form; $a_{Y_{j}}, c_{Y_{j}}$ , and $d_{Y_{j}}$ are the slope parameter vector, pseudo-guessing parameter, and intercept parameter for item $j$ that are on the scale of the base form, respectively; and $θ_{X_{i}}$ , $a_{X_{j}}$ , $c_{X_{j}}$ , and $d_{X_{j}}$ are the corresponding parameters on the scale of the new form. Probability invariance can be shown mathematically by comparing the exponent of the bifactor model before and after applying the linear transformations:

a_{Y_{j}}^{T} θ_{Y_{i}} + d_{Y_{j}} = (a_{X_{j}}^{T} U) (U^{- 1} (θ_{X_{i}} - β)) + (d_{X_{j}} + a_{X_{j}}^{T} β) = a_{X_{j}}^{T} θ_{X_{i}} + d_{X_{j}} .

(6)

Because the scaling matrix $U$ is diagonal and only appears in Equation 3 for the item parameters, one way to estimate the scaling parameter for dimension $m$ , $u_{m}$ , is to divide the mean of the slope parameter estimates for dimension $m$ of the base form by that of the new form:

{\hat{u}}_{m} = \frac{E ({\hat{a}}_{Y_{m}})}{E ({\hat{a}}_{X_{m}})}, m = 0, 1, \dots, S,

(7)

where $E$ is the expectation operator. Note that only the slope parameter estimates for the common items are used to compute each ${\hat{u}}_{m}$ . To estimate the translation parameters, one can solve Equation 4 for $β$ ; that is,

\hat{β} = {({\hat{A}}_{X}^{T} {\hat{A}}_{X})}^{- 1} {\hat{A}}_{X}^{T} ({\hat{d}}_{Y} - {\hat{d}}_{X}),

(8)

where ${\hat{A}}_{X}$ is a matrix of the slope parameter estimates for the new form, and ${\hat{d}}_{Y}$ and ${\hat{d}}_{X}$ are vectors of the intercept parameter estimates for the base and new forms, respectively. As with the scaling parameters, only the common items are used for the computation of the translation parameters. Conducting linking with Equations 7 and 8 will hereinafter be referred to as the direct method.

The scaling matrix and the translation vector could also be determined by generalizing the Haebara linking method (Haebara, 1980) for UIRT models to the bifactor model. More specifically, the arguments $U$ and β that minimize the sum of the squared differences between the item characteristic surfaces (ICSs) obtained with the item parameter estimates for the base form and those for the new form transformed to the base scale are determined by solving

\arg min_{U, β} \int_{- \infty}^{\infty} w (θ) \sum_{j \in V} {[P (θ, {\hat{a}}_{Y_{j}}, {\hat{c}}_{Y_{j}}, {\hat{d}}_{Y_{j}}) - P (θ, {\hat{a}}_{X_{j}}^{*}, {\hat{c}}_{X_{j}}^{*}, {\hat{d}}_{X_{j}}^{*})]}^{2} d θ,

(9)

where $V$ denotes the set of common items; $w (θ)$ is a weight function; and ${\hat{a}}_{X_{j}}^{*},$ ${\hat{c}}_{X_{j}}^{*}$ , and ${\hat{d}}_{X_{j}}^{*}$ are estimates for the slope parameter vector, pseudo-guessing parameter, and intercept parameter of the new form that are transformed to the base scale. In general, $({\hat{a}}_{Y_{j}}, {\hat{c}}_{Y_{j}}, {\hat{d}}_{Y_{j}})$ and $({\hat{a}}_{X_{j}}^{*}, {\hat{c}}_{X_{j}}^{*}, {\hat{d}}_{X_{j}}^{*}$ ) are different because they are estimates, not population parameters. Note that there is no analytical solution for Equation 9, and therefore must be approximated using a numerical integration method. Finding the scaling matrix and translation vector using Equation 9 will hereinafter be referred to as the ICS method.

Concurrent Calibration

When estimating item parameters for multiple test forms that are administered to nonequivalent groups, as mentioned earlier, the parameter estimates that result from separate runs of an IRT calibration program are not on the same coordinate system because the distribution of latent variables is arbitrarily set to a multivariate standard normal distribution for each calibration. In concurrent calibration, this scale problem is handled by estimating the item parameters and the distributions of latent variables for the base and new groups simultaneously. Because the latent variables in the bifactor model are completely independent under the standard bifactor analysis assumption and each item loads on at most one specific factor, the contribution of examinee $i$ in group $g$ to the marginal likelihood is

\begin{matrix} L (Δ | u_{i (g)}) = \int_{R} \int_{R^{S}} f (u_{i (g)} | θ_{i (g)}, Δ) h (θ_{i (g)} | Δ) d θ_{i (g)} \\ \begin{matrix} = \int_{R} Π_{s = 1}^{S} [\int_{R} \underset{j \in I_{g} \cap I_{s}}{Π} f (u_{i (g) j} | θ_{i (g)}, Δ) h (θ_{i (g) s} | Δ) d θ_{i (g) s}] \end{matrix} h (θ_{i (g) 0} | Δ) d θ_{i (g) 0}, \end{matrix}

(10)

where the subscript $i (g)$ indicates that examinee $i$ is nested within group $g$ ; $Δ$ is the set of all item parameters and structural parameters (i.e., parameters for the distribution of latent variables); $u_{i (g)} = (u_{i (g) 1}, \dots, u_{i (g) J_{g}})$ is the item responses for examinee $i$ in group $g$ to all $J_{g}$ items; $θ_{i (g) 0}$ is the general factor for examinee $i$ in group $g$ ; $θ_{i (g) s}$ is the specific factor for the same examinee on which item $j$ loads; $θ_{i (g)} = (θ_{i (g) 0}, θ_{i (g) s})$ ; $I_{g}$ denotes the set of items that is given to the examinees in group $g$ ; $I_{s}$ denotes the set of items that load on specific factor $θ_{s}$ ; $\cap$ is the intersection operator; $h (θ_{i (g) 0} | Δ)$ and $h (θ_{i (g) s} | Δ)$ are the distributions of $θ_{0}$ and $θ_{s}$ , respectively; and $f (u_{i (g) j} | θ_{i (g)}, Δ) = P_{j} (θ_{i (g)})^{u_{i (g) j}} [1 - P_{j} (θ_{i (g)})]^{1 - u_{i (g) j}}$ . As implied by the term $j \in I_{g} \cap I_{s}$ , Equation 10 is computed using only the items that are taken by group $g$ . Invoking the local independence assumption, the (overall) marginal log-likelihood function becomes $Σ_{g = 1}^{G} Σ_{i = 1}^{N_{g}} \log L (Δ | u_{i (g)})$ , where $G$ is the number of groups and $N_{g}$ is the number of examinees in group $g$ .

Applying the EM algorithm used for UIRT models to the bifactor model, the E-step of the EM algorithm involves creating three types of “pseudo data” for each group using item responses and provisional item parameter estimates obtained from the previous EM cycle. The three types of “pseudo data” are (a) the expected number of examinees in group $g$ at the $Q$ quadrature points for the general dimension, ${X_{q_{0}}}_{q_{0} = 1}^{Q}$ ; (b) the expected number of examinees in group $g$ at the $Q$ quadrature points for each specific dimension, ${X_{q_{s}}}_{q_{s} = 1}^{Q}$ ; and (c) the expected number of examinees in group $g$ that responds to score category $k$ ( $k = 1$ for correct response and $k = 0$ for incorrect response) of item $j$ at each combination of $(X_{q_{0}}, X_{q_{s}})$ . These three quantities for the bifactor model, which will be denoted hereinafter by $r_{g} (X_{q_{0}})$ , $r_{gs} (X_{q_{s}})$ , and $r_{gjk} (X_{q_{0}}, X_{q_{s}})$ , can be computed in a similar way to the two-tier model presented in Cai (2010).

In the M-step, the parameter estimates for item $j$ are updated by finding the values that maximize

\begin{matrix} ϕ_{j} (Δ) = \sum_{g = 1}^{G} I_{j \in I_{g}} \sum_{q_{0} = 1}^{Q} \sum_{q_{s} = 1}^{Q} {r_{gj 1} (X_{q_{0}}, X_{q_{s}}) \log P_{j} (X_{q_{0}}, X_{q_{s}}) + r_{gj 0} (X_{q_{0}}, X_{q_{s}}) \log [1 - P_{j} (X_{q_{0}}, X_{q_{s}})]} \end{matrix},

(11)

where $I_{j \in I_{g}}$ is an indicator variable such that $I_{j \in I_{g}} = 1$ if $j \in I_{g}$ and otherwise $I_{j \in I_{g}} = 0 .$ Because the first summation is over groups, item responses for all groups that take item $j$ (i.e., $I_{j \in I_{g}} = 1)$ are used to maximize Equation 11. This is the reason that the concurrent calibration method provides a single set of parameter estimates for common items.

In addition to the item parameters, the probability distributions of latent variables for all $G$ groups are updated in the M-step as well. Under the assumption that the general and specific factors for each of the $G$ groups jointly follow a multivariate normal distribution, estimating the probability distribution for each group is equivalent to estimating the mean vector and covariance matrix of the distribution. Under the standard bifactor analysis assumption of complete independence of the general and specific factors, each element of the mean vector and covariance matrix can be estimated separately. Specifically, the first two moments of the normal distributions for the general and specific factors can be estimated by finding the structural parameters that maximize

\begin{matrix} ψ_{g} (Δ) = \sum_{q_{0} = 1}^{Q} r_{g} (X_{q_{0}}) \log f_{g} (X_{q_{0}} | Δ) \end{matrix},

(12)

and

\begin{matrix} ψ_{gs} (Δ) = \sum_{q_{s} = 1}^{Q} r_{gs} (X_{q_{s}}) \log f_{gs} (X_{q_{s}} | Δ), \end{matrix}

(13)

respectively, where $f_{g} (X_{q_{0}} | Δ)$ and $f_{gs} (X_{q_{s}} | Δ)$ are the probability density functions of a univariate normal distribution. Once the latent variable distributions for all $G$ groups are updated, the distribution for the base group is linearly transformed to have a specific mean vector and covariance matrix (e.g., a mean vector of 0 and a covariance matrix of I). The purpose of doing so is to fix the location of the origin and unit of measurement along each coordinate axis of the coordinate system. Then, the same linear transformation is applied to the distributions for the other groups to ensure that the relative location of all distributions remain unchanged. Furthermore, the item parameter estimates should also be linearly transformed in such a way that the probability of correct response remains unchanged. These updated distributions of latent variables and item parameter estimates are then used in the following EM cycle.

Instead of assuming that the latent variables are normally distributed, the distributions for all $G$ groups can be estimated empirically by generalizing the empirical histogram method for UIRT models (Bock and Aitkin, 1981; Mislevy, 1984; Woodruff & Hanson, 1997) to the bifactor model. This approach is relatively easy to implement because empirical histograms are by-products of the “pseudo data” computed in the E-step of the EM algorithm; that is, the empirical histograms for the general and specific factors for each group can be obtained by simply dividing $r_{g} (X_{q_{0}})$ and $r_{gs} (X_{q_{s}}), s = 1, \dots, S$ , by $N_{g}$ at each quadrature point.

Method

The relative performance of the two separate calibration methods (i.e., the direct and ICS methods) and the concurrent calibration method was explored based on a simulation study. The study factors are described first, followed by the simulation procedures and evaluation criteria.

Study Factors

Three study factors were considered in the simulation study: (a) two levels of sample size (N = 2,000 and 5,000); (b) two levels of the proportion of common items (CI = 20% and 40%); and (c) four population distributions of latent variables for the new group. More detailed descriptions of the distributions of latent variables are provided below. The three factors were completely crossed, resulting in 16 study conditions.

Simulation Procedures

Two dichotomously scored 45-item test forms with three categories, each category having 15 items, were generated for the simulation study. To create the two test forms, item parameters for the bifactor model with a scaling constant of $D = 1.7$ were sampled from probability distributions that are commonly used in IRT studies. The general slope parameters $(a_{0})$ were sampled from a log-normal distribution with a mean of 0 and a standard deviation of 0.5 in a restricted range of (0.5, 2); three sets of specific slope parameters ( $a_{s})$ , one for each category, were sampled from a uniform distribution between 0.5 and 0.7; the intercept parameters $(d)$ were sampled from a standard normal distribution in a restricted range of [−3, 3]; and the pseudo-guessing parameters $(c)$ were sampled from a uniform distribution between 0.05 and 0.35. These item parameters will be hereinafter referred to as the generating item parameters.

To introduce two different proportions of common items, parameters for 15 items in each category were sampled in such a way that six items had statistical characteristics that were comparable with all 15 items. These six items were considered as common items for the CI = 40% condition. For the CI = 20% condition, the original six common items were split into two sets such that the statistical characteristics for each set of three items were as similar as possible. One of the two sets of three items was considered as common items, and the other set was treated as unique items.

Using the generating item parameters, item responses for the base group were simulated by sampling the general and each of the three specific factors from a $N (0, 1)$ distribution. Item responses for the new group were simulated by sampling the general and specific factors from four different distributions: (a) standard normal distribution; (b) nonstandard normal distribution; (c) negatively skewed distribution; and (d) platykurtic distribution. The standard normal distribution condition was used to examine the case of the common-item equivalent groups design, and the other three distributions were used to examine the case of the CINEG design. Latent factors for the negatively skewed and platykurtic distributions were generated using the power method proposed by Fleishman (1978). Skewness and kurtosis were set to −0.75 and 0 for the negatively skewed distribution and 0 and −0.5 for the platykurtic distribution. These values were selected to represent typical nonnormality situations described by Pearson and Please (1975)—skewness less than 0.8 and kurtosis between −0.6 and 0.6. For the normal, negatively skewed, and platykurtic distributions, the mean vector was set to ${(1, 0, 0.5, 1)}^{T}$ and the covariance matrix was set to an identity matrix. The reason for using different mean values for the specific dimensions was to examine the recovery of item parameters under a variety of plausible conditions.

For each of the 16 conditions, 100 response data sets were simulated using the generating item parameters and 100 samples of latent traits drawn from the standard normal distribution for the base group and one of the four distributions for the new group. Then, item parameters were estimated and linked using the two separate calibration methods and the concurrent calibration method. All analyses, including calibration and linking, were conducted using an R (R Core Team, 2017) program that was written by the author of this study. For concurrent calibration, the empirical histogram method was used to estimate the distribution of latent variables.

Evaluation Criteria

After conducting linking, the item parameter estimates should be close to the generating item parameters. The extent to which this holds was assessed by three criteria. The first criterion compared the linked item parameter estimates to the generating item parameters. Only the new form items were used for the comparison because the focus of this article was to place the item parameter estimates of the new form on the coordinate system of the base form. The recovery of the item parameters was assessed in terms of bias, standard error (SE), and root mean square error (RMSE), which were defined, respectively, by

Bia s_{j} = \frac{1}{R} \sum_{r = 1}^{R} {\hat{ν}}_{jr} - ν_{j},

(14)

S E_{j} = \sqrt{\frac{1}{R} \sum_{r = 1}^{R} {({\hat{ν}}_{jr} - {\bar{\hat{ν}}}_{j})}^{2}},

(15)

and

RMS E_{j} = \sqrt{\frac{1}{R} \sum_{r = 1}^{R} {({\hat{ν}}_{jr} - ν_{j})}^{2}} = \sqrt{{Bias}_{j}^{2} + {SE}_{j}^{2}},

(16)

where $R (= 100)$ is the number of replications; ${\hat{ν}}_{jr}$ denotes an estimate for item $j$ at replication $r$ ( $ν$ represents either the item discrimination, intercept, or pseudo-guessing parameter); $ν_{j}$ denotes a parameter for the same item; and ${\bar{\hat{ν}}}_{j} = 1 / R \sum_{r = 1}^{R} {\hat{v}}_{jr}$ . After computing the three statistics for each item, the average was taken over all the items to summarize the results at the test level. For bias, the absolute values were averaged to prevent cancelation of positive and negative values across different items.

The second and third criteria compared the ICS and expected score distribution (ESD) obtained with the linked item parameter estimates to that obtained with the generating item parameters. As with the first criterion, only the new form ICSs and ESDs were used for comparison. The bias, SE, and RMSE at a given $θ$ value for the ICS and ESD criteria were defined similarly to Equations 14, 15, and 16, respectively, but substituting the ICS and ESD computed with the item parameter estimates for ${\hat{ν}}_{jr}$ and those computed with the generating item parameters for $ν_{j}$ . The Lord–Wingersky recursion formula (Lord & Wingersky, 1984) was used to compute the ESDs. Each of the three statistics was averaged over all possible $θ$ values using two weight functions. The first weight function was $w (θ) = 1$ , and the second weight function was the probability density function of a multivariate standard normal distribution.

Results

For the concurrent calibration method, calibration runs for eight data sets failed to converge within 500 EM cycles when the sample size was N = 2,000 and the proportion of common items was CI = 20%. Among the eight data sets, three data sets were from the nonstandard normal distribution condition, one data set was from the negatively skewed distribution condition, and four data sets were from the platykurtic distribution condition. These problematic replications were discarded and replaced with new data sets until all the data sets for the simulation study converged successfully. Because of space limitation, results for the N = 5,000 condition are not provided here. However, except where noted, the remaining discussion of the results applies to both sample sizes.

Recovery of Item Parameters

Figure 1 provides average bias, SE, and RMSE for the item parameter recovery criterion with $N = 2, 000$ (hereinafter, the term average will be dropped unless confusion may arise). In each figure, there are 15 plots arranged in five rows and three columns. The plots in the three columns give results for bias, SE, and RMSE. The plots in the five rows give results for the general slope parameters, three specific slope parameters, and intercept parameters. The results for the pseudo-guessing parameters are not provided here because no linking is involved in these parameters (see Equation 5).

Two common results were observed for the three linking methods. First, RMSE decreased with increasing sample size. This was because RMSE was dominated by SE, and SE generally decreases with the increase of sample size when estimating population parameters. Second, increasing the proportion of common items tended to result in smaller RMSE, a tendency that was more noticeable for the two separate calibration methods. The proportion of common items had more impact on the direct and ICS methods probably because these two separate calibration methods estimated the transformation parameters using only the common items, whereas the concurrent calibration method estimated the latent-variable distribution for the new group using both the common and unique items.

In general, the concurrent calibration method showed smaller RMSE than the two separate calibration methods across all study conditions. Most of the difference in RMSE between the separate and concurrent calibration methods was due to the difference in SE. It is worth noting that linking two scales using the concurrent calibration method with CI = 20% showed smaller RMSE than linking scales using the two separate calibration methods with CI = 40%. As shown in Table 1, another noteworthy finding for the concurrent calibration method was that it tended to recover the item parameters more accurately than separate calibration with no linking when linking was unnecessary (i.e., the common-item equivalent groups condition). By contrast, the direct and ICS methods provided less accurate item parameter estimates than separate calibration with no linking for the common-item equivalent groups condition, with the direct method showing larger RMSE than the ICS method. The direct method also tended to work worse than the ICS method for the three nonequivalent groups conditions. Poor performance of the direct method was far more noticeable for the CI = 20% condition.

Table 1.

Average RMSE for the Item Parameter Recovery Criterion Under the Common-Item Equivalent Groups Condition With N = 2,000.

No linking	Method	$a_{0}$	$a_{1}$	$a_{2}$	$a_{3}$	$d$
No linking	Method	.148	.131	.128	.124	.148
CI = 20%	Direct	.173	.179	.187	.162	.215
	ICS	.154	.160	.149	.144	.154
	Concurrent	.150	.129	.124	.125	.145
CI = 40%	Direct	.156	.148	.145	.133	.175
	ICS	.149	.138	.138	.128	.153
	Concurrent	.138	.120	.118	.114	.135

Open in a new tab

Note. CI = common item; Direct = separate calibration with the direct linking method; ICS = separate calibration with the item characteristic surface linking method.

The results for the recovery of item parameters across the common and unique items are presented in Figure 2. Larger difference in RMSE between the separate and concurrent calibration methods was observed for the common items than the unique items. Consistent with the findings of Hanson and Béguin (2002) for the unidimensional 3PL model, the concurrent calibration method provided more accurate parameter estimates for the common items than the two separate calibration methods. The reason was that the concurrent calibration method used responses obtained from both the base and new groups to calibrate the common items, whereas the two separate calibration methods only used responses obtained from the new group for estimation.

Accuracy of ICS and ESD

Values of bias, SE, and RMSE for the weighted average ICS and ESD criteria with $N = 2, 000$ are presented in Figure 3. The results for the unweighted average criterion were similar to those for the weighted average criterion, and therefore not provided here. In contrast to the ICS criterion, the ESD criterion provides bias, SE, and RMSE for each number-correct score. The results for the ESD criterion presented in Figure 3 are the average taken over all possible scores. In addition, because the ESD criterion is based on relative frequencies, the values of the three statistics were very small. Therefore, to make meaningful comparison between the three linking methods, the initial values were multiplied by 100.

Similar results observed for the item parameter recovery criterion were also observed for the ICS and ESD criteria. The concurrent calibration method showed the smallest RMSE among the three linking methods, followed by the ICS method. The poor performance of the direct method was mainly due to the large SE, but relatively large bias was also observed when the population distribution for the new group was negatively skewed. Note that the increase of bias was not found for the ICS and concurrent calibration methods under the negatively skewed distribution. Although the results are not shown here due to space limitation, the concurrent calibration method provided smaller conditional RMSE (i.e., RMSE at each number-correct score) than the two separate calibration methods across a wide range of score points.

Discussion

Overall, the concurrent calibration method provided more accurate linking results than the two separate calibration methods in terms of the recovery of item parameters, item characteristic surfaces, and expected score distribution. The superior performance of concurrent calibration method was mainly due to the better recovery of the parameters for common items compared with the direct and ICS methods. Common items were more accurately recovered with the concurrent calibration method because it used item responses obtained from both the base and new groups to calibrate the common items, whereas the two separate calibration methods only used item responses obtained from the new group for calibration. In addition to its better performance, another appealing aspect of the concurrent calibration method is that it only requires a single computer run compared to multiple runs required for separate calibration. Despite these advantages, however, the concurrent calibration method has its own shortcomings. First, theeq concurrent calibration method appears to have convergence issues when the sample size and proportion of common items are small. Second, when concurrent calibration is used, the new form that is calibrated concurrently with the base form influences the item parameter estimates for the base form. As a result, in the case that item parameter estimates for the base form exist from previous calibration, concurrent calibration will likely produce different item parameter estimates for the same base form. Thus, if the base scale and the item parameter estimates for the base form must be preserved, linking two scales using concurrent calibration might complicate the problem.

Between the two separate calibration methods, the ICS method worked better than the direct method but not so good as the concurrent calibration method. Although the two separate calibration methods provided larger estimation error than the concurrent calibration method across all simulation conditions, one distinct advantage of separate calibration is that it can be used to examine parameter drift for the common items. Drift can be examined because separate calibration provides two separate sets of parameter estimates for the common items, whereas concurrent calibration only provides one. For this reason, as suggested by Hanson and Béguin (2002), it would be beneficial in practice to compute the separate calibration estimates for diagnostic purposes, even if the concurrent calibration method is used for operational purposes.

In contrast to separate calibration, concurrent calibration achieves a common scale for the IRT parameters by estimating the distributions of latent variables for the base and new groups along with the item parameters. In the context of UIRT, the two most widely used approaches for estimating the probability densities are the normal solution method (i.e., assuming normality for the latent variables) and the empirical histogram method. Although the use of the normal solution method appears to be more common in practice, the advantage of the empirical histogram method is that it can be used to handle nonnormal latent variables. Another possible approach for estimating the latent variable distribution is to apply the semi-parametric approach proposed by Woods and Thissen (2006) to the bifactor model. It is suspected that different approaches would result in different item parameter estimates. Thus, further research on this topic is warranted.

There are several limitations inherent in this study. First, item responses were generated under the assumption that the bifactor model fits the data perfectly. However, this never happens in the real word, which limits the generalizability of the findings of this study. To examine the performance of the separate and concurrent calibration methods for the bifactor model under more realistic conditions, future studies might consider comparing these two linking methods under some degrees of model misfit. Second, in this study, the relative performance of separate and concurrent calibration was compared for tests that consisted of only dichotomous items. Future studies that compare the two linking methods for tests with polytomously scored items or mixed-format tests would be meaningful. Finally, the separate and concurrent calibration methods presented in this study were based on the standard bifactor analysis assumption of complete independence, which ensures dimension reduction for the computation of the marginal maximum likelihood and model identification. However, as mentioned earlier, this assumption can be relaxed to the case of conditional independence of the specific factors given the general factor.

Footnotes

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD: Kyung Yong Kim Inline graphic https://orcid.org/0000-0001-7549-5800

References

Bock R. D., Aitkin M. (1981). Marginal maximum likelihood estimation of item parameters: An application of an EM algorithm. Psychometrika, 46, 443-459. [Google Scholar]
Cai L. (2010). A two-tier full-information item factor analysis model with applications. Psychometrika, 75, 581-612. [Google Scholar]
Cai L., Yang J. S., Hansen M. (2011). Generalized full-information item bifactor analysis. Psychological Methods, 16, 221-248. [DOI] [PMC free article] [PubMed] [Google Scholar]
DeMars C. E. (2006). Application of the bi-factor multidimensional item response theory model to testlet-based tests. Journal of Educational Measurement, 43, 145-168. [Google Scholar]
Fleishman A. I. (1978). A method for simulating non-normal distributions. Psychometrika, 43, 521-532. [Google Scholar]
Gibbons R. D., Bock R. D., Hedeker D., Weiss D. J., Segawa E., Bhaumik D. K., . . . Stover A. (2007). Full-information item bifactor analysis of graded response data. Applied Psychological Measurement, 31, 4-19. [Google Scholar]
Gibbons R. D., Hedeker D. R. (1992). Full-information item bi-factor analysis. Psychometrika, 57, 423-436. [Google Scholar]
Haebara T. (1980). Estimating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22, 144-149. [Google Scholar]
Hanson B. A., Béguin A. A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common-item equating design. Applied Psychological Measurement, 26, 3-24. [Google Scholar]
Jeon M., Rijmen F., Rabe-Hesketh S. (2013). Modeling differential item functioning using a generalization of the multiple-group bifactor model. Journal of Educational and Behavioral Statistics, 38, 32-60. [Google Scholar]
Kim S., Cohen A. S. (1998). A comparison of linking and concurrent calibration under item response theory. Applied Psychological Measurement, 22, 131-143. [Google Scholar]
Kim S., Kolen M. J. (2006). Robustness to format effects of IRT linking methods for mixed-format tests. Applied Measurement in Education, 19, 357-381. [Google Scholar]
Kolen M. J., Brennan R. L. (2014). Test equating, scaling, and linking. New York, NY: Springer. [Google Scholar]
Lee G., Lee W. (2016). Bi-factor MIRT observed-score equating for mixed format tests. Applied Measurement in Education, 29, 224-241. [Google Scholar]
Lee G., Lee W., Kolen M. J., Park I., Kim D., Yang J. S. (2015). Bi-factor MIRT true-score equating for testlet-based tests. Journal of Educational Evaluation, 28, 681-700. [Google Scholar]
Lee W., Ban J. (2010). A comparison of IRT linking procedures. Applied Measurement in Education, 23, 23-48. [Google Scholar]
Li Y. H., Lissitz R. W. (2000). An evaluation of the accuracy of multidimensional IRT linking. Applied Psychological Measurement, 24, 115-138. [Google Scholar]
Li Y. H., Lissitz R. W. (2012). Exploring the full-information bifactor model in vertical scaling with construct shift. Applied Psychological Measurement, 36, 3-20. [Google Scholar]
Lord F. M., Wingersky M. S. (1984). Comparison of IRT true-score and equipercentile observed-score “equatings.” Applied Psychological Measurement, 8, 453-461. [Google Scholar]
Mislevy R. J. (1984). Estimating latent distributions. Psychometrika, 49, 359-381. [Google Scholar]
Mislevy R. J. (1986). Bayes modal estimation in item response models. Psychometrika, 51, 177-195. [Google Scholar]
Oshima T. C., Davey T. C., Lee K. (2000). Multidimensional linking: Four practical approaches. Journal of Educational Measurement, 37, 357-373. [Google Scholar]
Pearson E. S., Please N. W. (1975). Relation between the shape of population distribution of four simple test statistics. Biometrika, 62, 223-241. [Google Scholar]
R Core Team. (2017). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; Available from http://R-project.org/ [Google Scholar]
Reckase M. D. (2009). Multidimensional item response theory. New York, NY: Springer. [Google Scholar]
Rijmen F. (2009). Efficient full information maximum likelihood estimation for multidimensional IRT models (Tech. Rep. No. RR-09-03). Princeton, NJ: Educational Testing Service. [Google Scholar]
Simon M. K. (2008). Comparison of concurrent and separate multidimensional IRT linking of item parameters (Unpublished doctoral dissertation). The University of Minnesota, Minneapolis. [Google Scholar]
Woodruff D. J., Hanson B. A. (1997). Estimation of item response models using the EM algorithm for finite mixtures (ACT Research Report Series 96-6). Iowa City, IA: ACT. [Google Scholar]
Woods C. M., Thissen D. (2006). Item response theory with estimation of the latent population distribution using spline-based densities. Psychometrika, 71, 281-301. [DOI] [PubMed] [Google Scholar]
Yao L. (2011). Multidimensional linking for domain scores and overall scores for nonequivalent groups. Applied Psychological Measurement, 35, 48-66. [Google Scholar]

[bibr1-0146621618813095] Bock R. D., Aitkin M. (1981). Marginal maximum likelihood estimation of item parameters: An application of an EM algorithm. Psychometrika, 46, 443-459. [Google Scholar]

[bibr2-0146621618813095] Cai L. (2010). A two-tier full-information item factor analysis model with applications. Psychometrika, 75, 581-612. [Google Scholar]

[bibr3-0146621618813095] Cai L., Yang J. S., Hansen M. (2011). Generalized full-information item bifactor analysis. Psychological Methods, 16, 221-248. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr4-0146621618813095] DeMars C. E. (2006). Application of the bi-factor multidimensional item response theory model to testlet-based tests. Journal of Educational Measurement, 43, 145-168. [Google Scholar]

[bibr5-0146621618813095] Fleishman A. I. (1978). A method for simulating non-normal distributions. Psychometrika, 43, 521-532. [Google Scholar]

[bibr6-0146621618813095] Gibbons R. D., Bock R. D., Hedeker D., Weiss D. J., Segawa E., Bhaumik D. K., . . . Stover A. (2007). Full-information item bifactor analysis of graded response data. Applied Psychological Measurement, 31, 4-19. [Google Scholar]

[bibr7-0146621618813095] Gibbons R. D., Hedeker D. R. (1992). Full-information item bi-factor analysis. Psychometrika, 57, 423-436. [Google Scholar]

[bibr8-0146621618813095] Haebara T. (1980). Estimating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22, 144-149. [Google Scholar]

[bibr9-0146621618813095] Hanson B. A., Béguin A. A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common-item equating design. Applied Psychological Measurement, 26, 3-24. [Google Scholar]

[bibr10-0146621618813095] Jeon M., Rijmen F., Rabe-Hesketh S. (2013). Modeling differential item functioning using a generalization of the multiple-group bifactor model. Journal of Educational and Behavioral Statistics, 38, 32-60. [Google Scholar]

[bibr11-0146621618813095] Kim S., Cohen A. S. (1998). A comparison of linking and concurrent calibration under item response theory. Applied Psychological Measurement, 22, 131-143. [Google Scholar]

[bibr12-0146621618813095] Kim S., Kolen M. J. (2006). Robustness to format effects of IRT linking methods for mixed-format tests. Applied Measurement in Education, 19, 357-381. [Google Scholar]

[bibr13-0146621618813095] Kolen M. J., Brennan R. L. (2014). Test equating, scaling, and linking. New York, NY: Springer. [Google Scholar]

[bibr14-0146621618813095] Lee G., Lee W. (2016). Bi-factor MIRT observed-score equating for mixed format tests. Applied Measurement in Education, 29, 224-241. [Google Scholar]

[bibr15-0146621618813095] Lee G., Lee W., Kolen M. J., Park I., Kim D., Yang J. S. (2015). Bi-factor MIRT true-score equating for testlet-based tests. Journal of Educational Evaluation, 28, 681-700. [Google Scholar]

[bibr16-0146621618813095] Lee W., Ban J. (2010). A comparison of IRT linking procedures. Applied Measurement in Education, 23, 23-48. [Google Scholar]

[bibr17-0146621618813095] Li Y. H., Lissitz R. W. (2000). An evaluation of the accuracy of multidimensional IRT linking. Applied Psychological Measurement, 24, 115-138. [Google Scholar]

[bibr18-0146621618813095] Li Y. H., Lissitz R. W. (2012). Exploring the full-information bifactor model in vertical scaling with construct shift. Applied Psychological Measurement, 36, 3-20. [Google Scholar]

[bibr19-0146621618813095] Lord F. M., Wingersky M. S. (1984). Comparison of IRT true-score and equipercentile observed-score “equatings.” Applied Psychological Measurement, 8, 453-461. [Google Scholar]

[bibr20-0146621618813095] Mislevy R. J. (1984). Estimating latent distributions. Psychometrika, 49, 359-381. [Google Scholar]

[bibr21-0146621618813095] Mislevy R. J. (1986). Bayes modal estimation in item response models. Psychometrika, 51, 177-195. [Google Scholar]

[bibr22-0146621618813095] Oshima T. C., Davey T. C., Lee K. (2000). Multidimensional linking: Four practical approaches. Journal of Educational Measurement, 37, 357-373. [Google Scholar]

[bibr23-0146621618813095] Pearson E. S., Please N. W. (1975). Relation between the shape of population distribution of four simple test statistics. Biometrika, 62, 223-241. [Google Scholar]

[bibr24-0146621618813095] R Core Team. (2017). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; Available from http://R-project.org/ [Google Scholar]

[bibr25-0146621618813095] Reckase M. D. (2009). Multidimensional item response theory. New York, NY: Springer. [Google Scholar]

[bibr26-0146621618813095] Rijmen F. (2009). Efficient full information maximum likelihood estimation for multidimensional IRT models (Tech. Rep. No. RR-09-03). Princeton, NJ: Educational Testing Service. [Google Scholar]

[bibr27-0146621618813095] Simon M. K. (2008). Comparison of concurrent and separate multidimensional IRT linking of item parameters (Unpublished doctoral dissertation). The University of Minnesota, Minneapolis. [Google Scholar]

[bibr28-0146621618813095] Woodruff D. J., Hanson B. A. (1997). Estimation of item response models using the EM algorithm for finite mixtures (ACT Research Report Series 96-6). Iowa City, IA: ACT. [Google Scholar]

[bibr29-0146621618813095] Woods C. M., Thissen D. (2006). Item response theory with estimation of the latent population distribution using spline-based densities. Psychometrika, 71, 281-301. [DOI] [PubMed] [Google Scholar]

[bibr30-0146621618813095] Yao L. (2011). Multidimensional linking for domain scores and overall scores for nonequivalent groups. Applied Psychological Measurement, 35, 48-66. [Google Scholar]

PERMALINK

A Comparison of the Separate and Concurrent Calibration Methods for the Full-Information Bifactor model

Kyung Yong Kim

Abstract

Bifactor Model