Scale Linking for the Testlet Item Response Theory Model

Seonghoon Kim; Michael J Kolen

doi:10.1177/01466216211063234

. 2022 Feb 13;46(2):79–97. doi: 10.1177/01466216211063234

Scale Linking for the Testlet Item Response Theory Model

Seonghoon Kim ^1,^✉, Michael J Kolen ²

PMCID: PMC8908412 PMID: 35281343

Abstract

In their 2005 paper, Li and her colleagues proposed a test response function (TRF) linking method for a two-parameter testlet model and used a genetic algorithm to find minimization solutions for the linking coefficients. In the present paper the linking task for a three-parameter testlet model is formulated from the perspective of bi-factor modeling, and three linking methods for the model are presented: the TRF, mean/least squares (MLS), and item response function (IRF) methods. Simulations are conducted to compare the TRF method using a genetic algorithm with the TRF and IRF methods using a quasi-Newton algorithm and the MLS method. The results indicate that the IRF, MLS, and TRF methods perform very well, well, and poorly, respectively, in estimating the linking coefficients associated with testlet effects, that the use of genetic algorithms offers little improvement to the TRF method, and that the minimization function for the TRF method is not as well-structured as that for the IRF method.

Keywords: scale linking methods, testlet model, item response theory

Educational test forms are often constructed using clusters of items based on a common stimulus or content area. For example, test items may be grouped around a reading passage, scenario, chart, or section associated with particular content. Wainer and Kiely (1987) called such a group of items a testlet and adopted it as a construction unit in computerized adaptive testing. From the perspective of item response theory (IRT; Lord, 1980; Yen & Fitzpatrick, 2006), the assumption of local independence among items nested within a testlet, given the primary latent trait, would be violated to some extent because the responses of examinees to the items might be affected by the testlet effect as well as the primary factor. An efficient way to deal with local dependence is to use the testlet model (Wainer et al., 2007), in which a secondary, random-effect factor is added to the primary factor. Researchers (e.g., DeMars, 2006; Li et al., 2006; Rijmen, 2010) have shown that the testlet model is a constrained version of the bi-factor model (Gibbons & Hedeker, 1992).

Like other IRT models, the testlet model has a model identification problem, specifically a scale indeterminacy problem, because the item parameters and person parameters are invariant within a linear transformation of the latent trait scale. In practice, scale indeterminacy typically is solved by choosing a scale such that the mean and standard deviation (SD) of the person parameters are arbitrarily set to certain values (e.g., 0 and 1) for the examinee group being analyzed (Rijmen, 2010). According to that convention, the latent scales obtained from separate calibrations of sample data from different populations are not likely to be equivalent, but they are assumed to be linearly related. This non-equivalency creates the need for a common scale, which can be developed through scale linking (or scale transformation), in which one scale is linked to another (base) scale with a linear function.

This paper is primarily concerned with the methods used to estimate the linking parameters for the testlet model under the common-item nonequivalent groups (CING) design (Kolen & Brennan, 2014). Many linking methods have been presented for use with traditional dichotomous IRT models such as the two-parameter logistic and three-parameter logistic (3PL) models (e.g., Divgi, 1985; Haebara, 1980; Loyd & Hoover, 1980; Marco, 1977; Stocking & Lord, 1983), and they have been extended to polytomous models (Kim & Lee, 2006). Most relevant to the present paper, Kim (2019) presented three linking methods for the 3PL bi-factor model, the direct least squares (DLS), item response function (IRF), and test response function (TRF) methods, which are bi-factor extensions of Divgi’s (1985), Haebara’s (1980), and Stocking and Lord’s (1983) approaches, respectively. Kim (2019) showed through simulations that the IRF, DLS, and TRF methods differed little in estimating the slope (dilation) linking coefficients, but they exhibited substantial differences in estimating the intercept (translation) linking coefficients, with the IRF method being the most accurate and the TRF method being the least accurate. However, in the IRT literature, only the TRF method has been formally extended for use with the testlet model. That extension is found in Li et al. (2005), who presented the TRF method for a two-parameter normal ogive (2PNO) testlet model. In this paper, Li et al.’s TRF method is presented under the 3PL testlet model since this general model is more widely used than the 2PNO testlet model in practice.

Questions and Purposes

As described in detail later, Li et al. (2005) formulated the linking task under the 2PNO testlet model such that, given $k$ common testlets, the linking parameters should include the means (denoted by $μ_{γ_{s}}$ ) of the testlet effect factors $γ_{s}$ , $s =$ 1,…, $k$ , with the constraint $\sum_{s = 1}^{k} μ_{γ_{s}} = 0$ , in addition to the linking coefficients $A$ and $B$ for the primary factor $θ$ . The criterion function (also known as the loss function) for the TRF method is nonlinear with respect to the linking parameters, and thus a multivariate search technique such as the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm, one of the quasi-Newton methods (Dennis & Schnabel, 1996), should be implemented to estimate the linking parameters. Li et al. (2005) combined the GENOUD genetic algorithm (Sekhon & Mebane, 1998; see also Mebane & Sekhon, 2011) with the BFGS algorithm. Li and her colleagues used the genetic algorithm because they were concerned that if there were three or more linking parameters to be estimated, the criterion function might have multiple minimum or saddle points and the BFGS method might be unable to find or fail to converge to the global minimum. However, the GENOUD genetic algorithm is very computationally intensive and time-consuming (taking 25 or more minutes for a linking task, as reported by Li et al.).

The present study was motivated by some related questions regarding Li et al.’s (2005) approach to the linking solutions for the TRF method. The first question is “Is it necessary to use genetic algorithms to find the linking solutions for the TRF method?” This question is important, because previous studies into multidimensional IRT linking (e.g., Davey et al., 1996; Oshima et al., 2000) that considered six or more linking parameters in a rotation matrix and translation vector have not reported any problem in finding the linking solutions using a modified version of the Newton method. If the genetic method is not substantially superior to the BFGS method, there would be no compelling reason to use it in practice. The second and third questions, which are closely related to each other, are “Are linking methods for the testlet model other than the TRF method available?” and “How do the different linking methods for the testlet model compare in their performance?” These questions are also important, because the availability of different methods for scale linking allows practitioners to choose the appropriate method depending on the situation. The choice of a linking method can be made more wisely if more information about the relative performance of different methods is given to the practitioners. Even if one method is operationally used, other methods should still be implemented for diagnostic purposes (Kolen & Brennan, 2014).

The primary purposes of this paper are two-fold. One is to answer the first question posed above regarding Li et al.’s (2005) TRF method. The other is to present the mean/least squares (MLS) and IRF methods for the testlet model and investigate their performance in linking accuracy relative to the TRF method. To achieve these purposes, we first present the 3PL testlet model (instead of the 2PNO model) in the next section for generality and reformulate the linking task formulated by Li et al. (2005) into a special case under the bi-factor modeling. Next we use the reformulated linking framework to present the MLS, TRF, and IRF methods and conduct a simulation study to compare the accuracy of these methods.

Linking Methods for the 3PL Testlet Model

The IRT literature contains several versions of the testlet model for dichotomous items that differ slightly in parameterization (e.g., Bradlow et al., 1999; Glas et al., 2000; Wainer & Wang, 2000). For the purposes of this paper, we use the parameterization of Glas et al. (2000) to write a 3PL testlet model that defines the probability that an examinee $j$ will answer item $i$ correctly as

P_{i} (θ_{j}, γ_{j s (i)}) = P (θ_{j}, γ_{j s (i)}; a_{i}, b_{i}, c_{i}) = c_{i} + \frac{1 - c_{i}}{1 + \exp [- D a_{i} (θ_{j} + γ_{j s (i)} - b_{i})]},

(1)

where $a_{i}$ , $b_{i}$ , and $c_{i}$ are the discrimination, difficulty, and lower asymptote parameters for item $i$ , respectively, $D$ is a scaling constant (usually set to 1 or 1.7); $θ_{j}$ is the primary trait (ability) parameter of examinee $j$ ; and $γ_{j s (i)}$ is a random-effect parameter (assumed to be independent of $θ_{j}$ ) for examinee $j$ of testlet $s$ , the testlet to which item $i$ belongs. Equation (1), the IRF for the 3PL testlet model, can be viewed as a special case of the 3PL bi-factor model, written as

P_{i} (θ_{j 0}, θ_{j s}) = P (θ_{j 0}, θ_{j s}; a_{i}, C_{s (i)}, d_{i}, c_{i}) = c_{i} + \frac{1 - c_{i}}{1 + \exp [- D (a_{i} θ_{j 0} + a_{i} C_{s (i)} θ_{j s} + d_{i})]},

(2)

where $a_{i}$ , $c_{i}$ , and $D$ are the same as in Equation (1); $d_{i} = - a_{i} b_{i}$ is the intercept parameter; $θ_{j 0}$ is the parameter for examinee $j$ of the primary factor (i.e., $θ_{0}$ = $θ$ ); $θ_{j s}$ is the parameter of the specific factor of testlet $s$ with the relationship $γ_{s (i)}$ = $C_{s (i)} θ_{s}$ ; and $C_{s (i)}$ is a proportionality constant across all items nested within testlet $s$ .

Whether expressed as Equations (1) or (2), the 3PL testlet model cannot be identified unless some restrictions are imposed on the parameters. For Equation (1), the mean and SD of $θ$ are typically fixed to 0 and 1, respectively, and the means of the $γ_{s (i)}$ s are fixed to 0, with each of their SDs, $σ_{γ_{s}}$ , being free parameters. For Equation (2), the mean and SD of $θ_{0}$ and $θ_{s}$ are fixed to 0 and 1, respectively, and $C_{s (i)}$ = $σ_{γ_{s}}$ are considered the free parameters to be estimated. In other words, a standardized scale (0–1 scale) is independently used for each $θ$ dimension to remove model indeterminacy. Throughout this paper, we assume that the 3PL testlet model is identified by 0–1 scaling, and its parameters are estimated using sample data.

The Linking Parameters Estimated

Consider two examinee groups, a base group and a new group, that can differ in each $θ$ dimension. Assume that an identical test, consisting of $k$ testlets, has been administered to both groups and that for each group, separate calibration has been conducted using 0–1 scaling to estimate all item parameters, including $C_{s}$ (dropping the nested subscript $i$ for simplicity). Furthermore, define the 0–1 scales from the base and new groups as $θ_{B}$ and $θ_{N}$ , respectively, where $θ_{B} =$ ( $θ_{0 B}$ , $θ_{1 B}$ ,…, $θ_{k B}$ ) and $θ_{N} =$ ( $θ_{0 N}$ , $θ_{1 N}$ ,…, $θ_{k N}$ ). Use $a_{i B}$ , $b_{i B}$ , $c_{i B}$ , $d_{i B}$ , and $C_{s B}$ to denote the item/testlet parameters estimated on the $θ_{B}$ scale, and use $a_{i N}$ , $b_{i N}$ , $c_{i N}$ , $d_{i N}$ , and $C_{s N}$ to denote the counterparts on the $θ_{N}$ scale.

By the “within a linear transformation” invariance property of IRT, the $θ_{B}$ and $θ_{N}$ scales are linearly related as follows (Kim, 2019):

θ_{0 B} = A θ_{0 N} + B,

(3)

θ_{s B} = λ_{s} θ_{s N} + β_{s}, s = 1, \dots, k,

(4)

where $A$ and $B$ are the linking coefficients for the $θ_{0}$ dimension and $λ_{s}$ and $β_{s}$ are the linking coefficients for the $θ_{s}$ dimension. The slopes, $A$ and $λ_{s}$ , adjust for unit differences between the new and base scales, and the translation intercepts, $B$ and $β_{s}$ , adjust for location differences. If scale linking is perfect, the two sets of item/testlet parameter estimates from separate calibrations should be related as follows:

a_{i B} = a_{i N} / A,

(5)

\frac{C_{s B}}{A} = \frac{C_{s N}}{λ_{s}},

(6)

d_{i B} = d_{i N} - a_{i N} (\frac{1}{A}) B - a_{i N} (\frac{C_{s B}}{A}) β_{s}

(7a)

= d_{i N} - \frac{1}{A} [\begin{matrix} a_{i N} & a_{i N} \end{matrix}] [\begin{matrix} 1 & 0 \\ 0 & C_{s B} \end{matrix}] [\begin{matrix} B \\ β_{s} \end{matrix}],

(7b)

b_{i B} = A [b_{i N} + (\frac{C_{s B}}{A}) β_{s}] + B,

(8)

c_{i B} = c_{i N} .

(9)

However, Equations (5) through (9) do not perfectly hold among estimated item/testlet parameters because of sampling errors and possible model-data misfit. In general, linking errors are unavoidable with sample data, and the linking coefficients should be properly estimated so as to minimize the errors (Kim & Lee, 2006; Kolen & Brennan, 2014).

The descriptions above might be read as if the linking methods for the 3PL testlet model should estimate $2 (k + 1)$ linking coefficients ( $A$ , $λ_{1}$ to $λ_{k}$ and $B$ , $β_{1}$ to $β_{k}$ ), but that is not the case. As can be seen from Equation (6), each $λ_{s}$ ( $s =$ 1,…, $k$ ) is a function of $A$ , $C_{s B}$ , and $C_{s N}$ , so if $A$ is estimated and the two constants are given, its value is determined. Thus, $k$ lambda coefficients are not considered as linking parameters to be estimated. For the beta coefficients, $β_{s}$ , $k - 1$ can be uniquely estimated due to the linear dependence among them, such as $\sum_{s = 1}^{k} β_{s} = 0$ . Note that the linear dependence $\sum β_{s} = 0$ agrees with the constraint $\sum μ_{γ_{s}} = 0$ used in Li et al. (2005), where $μ_{γ_{s}} = C_{s B} β_{s} / A$ given $A$ . Therefore, the three linking methods for the testlet model described below estimate $k + 1$ “free” parameters, the $A$ , $B$ , and $k - 1$ $β_{s}$ coefficients. Because the meaning of the linear dependence among the $β_{s}$ values can be clearly revealed in presenting the MLS method, and the TRF and IRF methods are akin to each other, we present the MLS method first and then present the two response function methods. For the following presentation, it is assumed that a common testlet $s$ contains $n_{s}$ items and the total number of items in the $k$ common testlets for linking is $n$ = $\sum n_{s}$ .

MLS Method

The MLS method presented here is a hybrid in that it uses part of the mean/mean method (Loyd & Hoover, 1980) to estimate the slope $A$ and then uses the linear least squares approach to estimate the intercepts, $B$ and $β_{s}$ . Unlike the TRF and IRF methods, the MLS method can estimate the linking coefficients without an iterative search for the solutions.

Taking the mean over a-parameters based on Equation (5) and solving the resulting equation for $A$ leads to a legitimate statistical solution for $A$ (Loyd & Hoover, 1980):

A = \frac{Mean (a_{N})}{Mean (a_{B})} = \frac{\sum_{i} a_{i N} / n}{\sum_{i} a_{i B} / n},

(10)

where $a_{N}$ and $a_{B}$ represent all the discrimination parameters estimated on the new and base scales, respectively. Once the $A$ coefficient is estimated by Equation (10), the $B$ and beta coefficients need to be simultaneously estimated because as seen from by Equation (7) or (8) they are related to each other in an equation.

Based on Equation (7a), let $d_{i N}^{*} = d_{i N} - a_{i N} B / A - a_{i N} C_{s B} β_{s} / A$ and $e_{i} = d_{i N}^{*} - d_{i B}$ . According to the statistical approaches used in Divgi (1985) and Oshima et al. (2000), the $B$ and beta coefficients can be estimated as the values that minimize the sum of squared differences ( $e_{i}^{2}$ ) between $d_{i N}^{*}$ and $d_{i B}$ for all $i$ . To obtain the solutions for the intercept coefficients using the least squares method, we first write an error model, based on Equation (7b), as

d_{N} - d_{B} = \frac{1}{A} P D β + e,

(11)

where $d_{N}$ = $(d_{1 N}, ..., d_{n N})^{'}$ and $d_{B}$ = $(d_{1 B}, ..., d_{n B})^{'}$ are $n \times 1$ vectors of d-parameter estimates; $e$ is an $n \times 1$ error vector; $P$ is an $n \times (k + 1)$ matrix whose row elements are “factor loadings,” $a_{i N}$ s, associated with the $(k + 1)$ -dimensional space $θ =$ ( $θ_{0}$ , $θ_{1}$ ,…, $θ_{k}$ ); $D$ is a diagonal matrix whose $k + 1$ diagonal elements are 1, $C_{1 B}$ ,…, $C_{k B}$ ; and $β$ =( $B, β_{1}, ..., β_{k})^{'}$ is a vector of size $k + 1$ . For instance, if there are two testlets and two items within each testlet, $P$ , $D$ , and $β$ are expressed as

P = [\begin{matrix} a_{1 N} \\ a_{2 N} \\ a_{3 N} \\ a_{4 N} \end{matrix} \begin{matrix} a_{1 N} \\ a_{2 N} \\ 0 \\ 0 \end{matrix} \begin{matrix} 0 \\ 0 \\ a_{3 N} \\ a_{4 N} \end{matrix}], D = [\begin{matrix} 1 & 0 & 0 \\ 0 & C_{1 B} & 0 \\ 0 & 0 & C_{2 B} \end{matrix}], and β = [\begin{matrix} B \\ β_{1} \\ β_{2} \end{matrix}] .

(12)

Although the error model in Equation (11) resembles a regression model, where the dependent variable is $d_{N} - d_{B}$ and the coefficient vector is $β$ , the solutions of $β$ cannot be computed using the ordinary least squares approach because the factor pattern matrix $P$ is not of full column rank (as shown by the example matrix in Equation (12)). With the condition $n \geq k + 1$ , usually met in practice, the rank of $P$ is $k$ , not $k + 1$ . Such rank deficiency implies that except for the $B$ coefficient, the $k$ $β_{s}$ coefficients in $β$ are linearly dependent, and only $k - 1$ ones need to be estimated. Although the linear dependence among $β_{s}$ can be formulated in many ways, we here choose the constraint $\sum β_{s} = 0$ , corresponding to the constraint $\sum μ_{γ_{s}} = 0$ used in Li et al. (2005). In addition, we introduce a transformation matrix $T$ , which relates $β$ to an estimation vector $α$ = $(B, α_{1}, ..., α_{k - 1})^{'}$ such that

β = T α,

(13)

where $T = [\begin{matrix} 1 & 0 \\ 0 & T_{D} \end{matrix}]$ is a $(k + 1) \times k$ matrix, and $T_{D} = [\begin{matrix} 1 / k & 1 / k & ... & 1 / k \\ (1 - k) / k & 1 / k & ... & 1 / k \\ 1 / k & (1 - k) / k & ... & 1 / k \\ . & . & . \\ . & . & . \\ . & . & . \\ 1 / k & 1 / k & ... & (1 - k) / k \end{matrix}]$ . Now the free coefficients in $α$ can be estimated using the least squares method, and the solution formula can be derived as

α = A {[(P D T)^{'} (P D T)]}^{- 1} (P D T)^{'} (d_{N} - d_{B}) .

(14)

Finally, the MLS solutions for the $B$ and $β_{s}$ coefficients in $β$ are obtained by plugging the resulting $α$ into Equation (13). Note that the $β_{s}$ estimates surely satisfy the zero-sum constraint due to the use of the $T$ matrix.

TRF Method

For the traditional 3PL model, the TRF at a given $θ$ is defined as the sum of IRFs over all the items on the test, written as $T (θ) = \sum P_{i} (θ)$ . $T (θ)$ is the true score for an examinee with ability $θ$ . Conceptually, the analog of $T (θ)$ for the 3PL testlet model can be defined as the sum of the marginalized IRFs, each of which is computed by integrating the nuisance dimension $γ_{s}$ (or $θ_{s}$ ) out from the IRF in Equation (1) (or 2). In accordance with this conception, let $P_{i} (θ_{B})$ and $T (θ_{B})$ denote the marginalized IRF and TRF computed with the item/testlet parameters estimated on the $θ_{B}$ scale, respectively, and let $P_{i}^{*} (θ_{B})$ and $T^{*} (θ_{B})$ denote the marginalized IRF and TRF computed with the parameter estimates transformed to the $θ_{B}$ scale. The TRF method finds the solutions of $A$ and $β = {(B, β_{1}, ..., β_{k})}^{'}$ that minimize the criterion function, $f_{T} (A, β)$ ,

f_{T} (A, β) = \frac{1}{N} \sum_{q = 1}^{N} {[T (θ_{q B}) - T^{*} (θ_{q B})]}^{2} = \frac{1}{N} \sum_{q = 1}^{N} {[\sum_{i = 1}^{n} P_{i} (θ_{q B}) - \sum_{i = 1}^{n} P_{i}^{*} (θ_{q B})]}^{2},

(15)

where $q =$ 1, 2,…, $N$ indexes $N$ arbitrary points over the $θ_{B}$ scale.

Although the marginal IRF and TRF can be straightforwardly computed for each pair of $θ$ and $γ_{s}$ , Li et al. (2005) used a new composite variable $ξ_{s} = θ + γ_{s}$ for linking purposes. They noted that, assuming $θ$ and $γ_{s}$ have independent normal distributions with zero means and variances equal to 1 and $σ_{γ_{s}}^{2}$ , respectively, $ξ_{s}$ given $θ$ is distributed as $N (θ, σ_{γ_{s}}^{2})$ . With $ξ_{s}$ , more explicitly $ξ_{s (i)}$ , the 3PL testlet model in Equation (1) can be expressed as

P_{i} (ξ_{s}) = P (ξ_{s}; a_{i}, b_{i}, c_{i}) = c_{i} + \frac{1 - c_{i}}{1 + \exp [- D a_{i} (ξ_{s} - b_{i})]} .

(16)

Then, the probability of answering item $i$ within testlet $s$ correctly, conditional on $θ$ and $σ_{ξ_{s}} (= σ_{γ_{s}})$ , that is, the marginalized $P_{i} (θ)$ , is expressed as

P_{i} (θ) \equiv P_{i} (θ; σ_{ξ_{s}}) = \int P_{i} (ξ_{s}) h (ξ_{s} | θ; σ_{ξ_{s}}) d ξ_{s},

(17)

where $h$ is the probability density function of $ξ_{s}$ given $θ$ . The integral in Equation (17) can be approximated to any desired degree of accuracy by using Gauss–Hermite quadrature.

Let $ξ_{s B}$ and $ξ_{s N}$ denote the $ξ_{s}$ scales defined with the base and new groups, respectively. If the two scales are related as $ξ_{s B}$ = $A ξ_{s N} + B$ , the item parameters on the $ξ_{s N}$ scale can be transformed into those on the $ξ_{s B}$ scale as follows (Li et al., 2005):

a_{i N}^{*} = a_{i N} / A,

(18)

b_{i N}^{*} = A b_{i N} + B .

(19)

Although both transformations are legitimate in a technical sense, the transformation of $b_{i N}$ by Equation (19) is insufficient for linking purposes because $ξ_{s B}$ = $A ξ_{s N} + B$ takes into account possible mean and SD differences in $θ$ between the base and new groups but not possible mean differences in $γ_{s}$ between the two groups. Li et al. (2005) pointed out that if separate calibration results were obtained using the model in Equation (16), possible differences in the mean of $γ_{s}$ between the base and new groups would be absorbed into $b_{i N}$ , which would lead to a shift in $b_{i N}$ . Therefore, they used the following transformation to account for that possible shift

b_{i N}^{*} = A (b_{i N} + μ_{γ_{s}}) + B .

(20)

They further indicated that the zero-sum constraint $\sum μ_{γ_{s}} = 0$ ( $s = 1, ..., k$ ) should be imposed for model identification, although they did not detail why the model needed that constraint. Note that based on Equation (8), the $b_{i N}^{*}$ in Equation (20) can also be written as

b_{i N}^{*} = A [b_{i N} + (\frac{C_{s B}}{A}) β_{s}] + B,

(21)

where $C_{s B} β_{s} / A = μ_{γ_{s}}$ . Of course, the constraint $\sum β_{s} = 0$ is necessary for the reason revealed when the MLS method was addressed above.

The criterion function $f_{T} (A, β)$ in Equation (15), defined with the two sets, { $a_{i B}, b_{i B}, c_{i B}, C_{s B}$ } and { $a_{i N}^{*}, b_{i N}^{*}, c_{i N}^{*}$ }, for the $n$ common items associated with $k$ testlets, is nonlinear with respect to the linking coefficients, $A$ , $B$ , and $β_{s}$ , where the zero-sum constraint can be dealt with in practice by setting $β_{k} = - \sum_{s = 1}^{k - 1} β_{s}$ . Thus a multivariate search technique is required to find the linking solutions for the TRF method. Previous linking studies (e.g., Kim & Lee, 2006; Oshima et al., 2000) suggest that the minimization solutions can be obtained by using a modified Newton or quasi-Newton approach such as the BFGS method. However, Li et al. (2005) combined the GENOUD algorithm (Sekhon & Mebane, 1998) with the BFGS method to ensure that the global, not the local, minimum solutions are obtained. All of the search techniques are based on the vector of partial derivatives (i.e., gradient) of the criterion function with respect to the parameters. The analytic formulas for the gradient of $f_{T} (A, β)$ with respect to the linking coefficients are presented in the Appendix.

IRF Method

Given the marginalized IRFs $P_{i} (θ_{B})$ and $P_{i}^{*} (θ_{B})$ for all common items, the IRF linking method (Haebara, 1980) for the traditional 3PL model can be straightforwardly extended to the 3PL testlet model. Similarly to the TRF method, the IRF method finds the solutions of $A$ and $β = {(B, β_{1}, ..., β_{k})}^{'}$ that minimize the criterion function, $f_{I} (A, β)$ ,

f_{I} (A, β) = \frac{1}{N n} \sum_{q = 1}^{N} \sum_{i = 1}^{n} {[P_{i} (θ_{q B}) - P_{i}^{*} (θ_{q B})]}^{2},

(22)

where, as denoted earlier, $n$ is the number of common items and $q =$ 1, 2,…, $N$ indexes $N$ arbitrary points over the $θ_{B}$ scale. Although the marginalized IRFs $P_{i} (θ)$ or $P_{i} (θ_{0})$ , as generally denoted, can be evaluated by Equation (17), they can also be computed using the bi-factor model in Equation (2) as follows:

P_{i} (θ) \equiv P_{i} (θ_{0}) = \int P_{i} (θ_{0}, θ_{s}) g (θ_{s}) d θ_{s},

(23)

where $g$ is the probability density function of $θ_{s}$ . Of course, in that case, the $P_{i} (θ_{q B})$ and $P_{i}^{*} (θ_{q B})$ in Equation (22) are the probabilities evaluated at $θ_{0} = θ_{q B}$ with the parameter sets { $a_{i B}$ , $c_{i B}$ , $d_{i B}$ , $C_{s B}$ } and { $a_{i N}^{*}$ , $c_{i N}^{*}$ , $d_{i N}^{*}$ }, respectively.

The criterion function $f_{I} (A, β)$ is nonlinear, as is $f_{T} (A, β)$ , with respect to the linking coefficients, and thus a search technique is required to find the linking solutions. In this paper, we use the BFGS algorithm to find the linking solutions for the IRF method. The analytic formulas for the gradient of $f_{I} (A, β)$ are presented in the Appendix.

Simulation Study

A simulation study was conducted to compare the performance of the TRF, MLS, and IRF methods. Two versions of the TRF method were conducted: one using the GENOUD algorithm and the other using the BFGS algorithm. The IRF method was implemented using only the BFGS algorithm. The design and methodology of this simulation study were closely matched to those used by Li et al. (2005) so that the comparison might be made under nearly the same conditions as those used in the previous study.

Design and Data

The CING design was used to evaluate the linking parameter recovery of the four methods for the 3PL testlet model: (a) the GENOUD-TRF method, (b) the BFGS-TRF method, (c) the MLS method, and (d) the IRF method based on the BFGS algorithm. As in Li et al. (2005), simulated tests and data sets were generated using different sets of item/testlet parameters and linking parameters. Each simulated test form consisted of six testlets, each of which contained 5 items, giving 30 items in total. The number $k$ of common testlets between the two (“base” and “new”) test forms to be linked was considered as the simulation factor. Two levels of $k$ were used: $k$ =2 and $k$ =4, resulting in the two common testlets condition (Condition 1) and the four common testlets condition (Condition 2), respectively.

For each simulation condition, 10 pairs of simulated tests with 5000 examinees per form were generated, as in Li et al. (2005). For each test form, with $D = 1.7$ , the $a_{i}$ parameters were generated from $L N (0, {0.5}^{2})$ , the log-normal distribution with log-mean=0 and log-SD = .5; the $b_{i}$ parameters were generated from $N (0, 1)$ under the restriction that $- 3 \leq b_{i} \leq 3$ ; and the $c_{i}$ parameters were generated from a uniform distribution ranging from .05 to .35. For each test form, the variances of $γ_{s}$ (that is, $C_{s (i)}^{2}$ ) were set to three levels, .1 (small testlet effect), .5 (medium testlet effect), and 1 (large testlet effect), and they were assigned to three testlet pairs that were randomly matched. Note that 10 or 20 common items had the same parameters between the base and new forms to be linked.

Because the linking coefficients $A$ and $B$ reflect, respectively, the differences in the SD and mean of the primary factor $θ_{0}$ between the base and new populations, and the $β_{s}$ coefficients reflect differences in the mean of the testlet effect factors $θ_{s}$ , the generation of linking parameters began by fixing the distributions of all factors for the base population to $N (0, 1)$ . Then the slope coefficients $A$ (the SDs of $θ_{0}$ for the new population) were generated from LN (0, 0.2²), and the intercept coefficients $B$ (the means of $θ_{0}$ for the new population) were generated from N (0, 0.3²). Ten combinations of $A$ and $B$ values were generated, and they were applied to both Conditions 1 and 2. Note that for the first combination, the values of $A$ and $B$ were set at 1 and 0, respectively, so that it could serve as the baseline combination. For each simulation condition, the $β_{s}$ coefficients (the means of $θ_{s}$ for the new population) were generated from N (0, 0.3²), subject to the constraint $\sum_{s = 1}^{k} β_{s} = 0$ , where the first $k - 1$ beta coefficients were randomly sampled from the distribution and the last beta coefficient was set as $β_{k} = - \sum_{s = 1}^{k = 1} β_{s}$ . Associated with the first combination of $A$ =1 and $B$ =0, all beta coefficients were set at 0. The true linking parameters, $A$ , $B$ , and $β_{s}$ , used to generate 20 data sets (10 data sets per condition) are presented in Table 1.

Table 1.

True Linking Coefficients for Simulation Conditions 1 and 2.

		Condition 1			Condition 2
$A$	$B$	Data Set	$β_{1}$	$β_{2}$	Data Set	$β_{1}$	$β_{2}$	$β_{3}$	$β_{4}$
1.000	.000	1	.000	.000	11	.000	.000	.000	.000
.850	−.244	2	−.052	.052	12	−.202	.006	−.541	.737
1.250	.370	3	−.362	.362	13	−.164	.507	.063	−.406
1.013	.051	4	.224	−.224	14	−.141	.512	−.112	−.259
.900	−.360	5	.145	−.145	15	−.024	.238	−.143	−.071
1.093	.202	6	−.017	.017	16	−.138	−.008	.658	−.512
1.234	−.180	7	.132	−.132	17	.142	−.172	−.021	.051
.986	.026	8	−.321	.321	18	.087	−.188	.053	.048
.961	.150	9	.239	−.239	19	−.475	−.247	.155	.567
.857	−.108	10	.020	−.020	20	−.082	.013	.364	−.295

Open in a new tab

Estimation and Evaluation

For each data set, the item and testlet parameters for the 3PL testlet model were estimated using the computer program flexMIRT (Cai, 2017). By default, flexMIRT uses 0–1 scaling for each factor to estimate item parameters, and we applied that scaling approach to the separate calibrations of base and new sample data. With the separate calibration results, the linking parameters were estimated using the statistical programming language R (R Development Core Team, 2018). Specifically, the linking solutions for the MLS method were computed using the built-in linear algebra functions. The solutions for the GENOUD-TRF method were found using the “genoud” function included in the R package genoud (Mebane & Sekhon, 2011). The solutions for the BFGS-TRF and IRF methods were found using the “optim” function included in the R package stats. For the TRF and IRF methods, 41 $θ_{B}$ points, equally spaced from −4 to 4, were used to define their criterion functions (see Equations (15) and (22)).

For each data set in each condition, differences between the estimated and true linking parameters (i.e., estimation errors) were computed to evaluate the performance of each linking method. In addition, the means of the absolute differences across the 10 data sets in each condition were computed to summarize the estimation errors for each of the linking parameters.

Results

Results of Condition 1

The linking parameter recovery results of the GENOUD- and BFGS-TRF methods for Condition 1 (data sets 1–10), in which two common testlets were used, are presented in Table 2. The two TRF methods performed nearly equally in estimating the true linking parameters. For the $A$ , $B$ , and $β_{s}$ coefficients, in most cases, the estimates produced by the GENOUD-TRF method were equal to those by the BFGS-TRF method up to three decimal places. These results suggest that the use of a genetic algorithm offers little improvement to the TRF method. The recovery of the linking parameters differed by the type of linking coefficients. For most data sets, the estimation errors for $\hat{A}$ and $\hat{B}$ were close to zero, and the mean absolute errors of $\hat{A}$ and $\hat{B}$ were .021 and .029, respectively, indicating that the two TRF methods perform well in estimating the linking coefficients for the primary factor $θ_{0}$ . By contrast, the estimation errors for ${\hat{β}}_{1}$ and ${\hat{β}}_{2}$ were greater (by more than .4 for data sets 3 and 9), and their mean absolute errors were .182 and .182, respectively (the zero-sum constraint causes the two values to be the same). It is noteworthy that for the first baseline data set, the estimation errors for the beta coefficients (−.174 and .174) are much larger than the error for the $B$ coefficient (.013). This finding suggests that the TRF method can be poor at estimating the mean differences in testlet factors between the examinee groups being analyzed for linking.

Table 2.

Estimation Errors of the Linking Parameters from the Two TRF Methods for Condition 1.

Data Set	$\hat{A} - A$	$\hat{B} - B$	${\hat{β}}_{1} - β_{1}$	${\hat{β}}_{2} - β_{2}$
GENOUD-TRF method
1	−.019	.013	−.174	.174
2	.007	−.016	.053	−.053
3	−.018	.056	.445	−.445
4	−.045	.042	−.162	.162
5	−.007	−.023	−.069	.069
6	−.038	−.001	.063	−.063
7	.029	−.054	.154	−.154
8	−.020	−.056	.121	−.121
9	−.030	.025	−.530	.530
10	.002	.006	.045	−.045
Mean absolute error	.021	.029	.182	.182
BFGS-TRF method
1	−.019	.013	−.174	.174
2	.007	−.016	.053	−.053
3	−.017	.056	.444	−.444
4	−.045	.042	−.163	.163
5	−.007	−.023	−.069	.069
6	−.038	−.001	.063	−.063
7	.029	−.054	.154	−.154
8	−.020	−.056	.121	−.121
9	−.030	.025	−.530	.530
10	.002	.006	.045	−.045
Mean absolute error	.021	.029	.182	.182

Open in a new tab

The recovery results from the MLS and IRF methods for Condition 1 (data sets 1–10) are presented in Table 3. For both methods, the estimation errors of $\hat{A}$ and $\hat{B}$ were close to zero in most cases, as was found with the two TRF methods. The mean absolute errors for $\hat{A}$ and $\hat{B}$ with the MLS method were .035 and .037, respectively, and those with the IRF method were .021 and .033. For the recovery of the beta coefficients, the estimation errors for ${\hat{β}}_{1}$ and ${\hat{β}}_{2}$ with the MLS and IRF methods were closer to zero than those with the two TRF methods. The mean absolute error for either beta coefficient with the MLS method was .068, and that with the IRF method was .036. This finding suggests that the IRF, MLS, and TRF methods perform best, second best, and worst, respectively, in estimating the intercept linking coefficients ( $β_{s}$ ).

Table 3.

Estimation Errors of the Linking Parameters from the MLS and IRF Methods for Condition 1.

Data Set	$\hat{A} - A$	$\hat{B} - B$	${\hat{β}}_{1} - β_{1}$	${\hat{β}}_{2} - β_{2}$
MLS method
1	.015	−.019	.013	−.013
2	.108	−.025	−.155	.155
3	.013	−.150	−.182	.182
4	.025	.021	.002	−.002
5	−.032	−.006	.084	−.084
6	.056	−.067	−.190	.190
7	.017	−.020	.025	−.025
8	−.030	−.039	−.018	.018
9	−.044	.016	.006	−.006
10	.009	.006	.007	−.007
Mean absolute error	.035	.037	.068	.068
IRF method
1	.007	.003	−.005	.005
2	−.003	−.013	.026	−.026
3	.027	−.075	−.091	.091
4	−.033	.043	.012	−.012
5	−.036	−.050	.102	−.102
6	−.032	.013	.037	−.037
7	.025	−.036	.007	−.007
8	−.010	−.069	−.024	.024
9	.019	−.031	.046	−.046
10	.014	−.001	.013	−.013
Mean absolute error	.021	.033	.036	.036

Open in a new tab

Results of Condition 2

The recovery results of the GENOUD- and BFGS-TRF methods in Condition 2 (data sets 11–20) are presented in Table 4, and the results of the MLS and IRF methods are presented in Table 5. As was found in the results in Condition 1, all methods produced estimation errors for $\hat{A}$ and $\hat{B}$ that were close to zero in most cases. The mean absolute errors for $\hat{A}$ and $\hat{B}$ were .022 and .046, respectively, with the GENOUD-TRF method, .023 and .045 with the BFGS-TRF method, .026 and .053 with the MLS method, and .014 and .036 with the IRF method.

Table 4.

Estimation Errors of the Linking Parameters from the Two TRF Methods for Condition 2.

Data Set	$\hat{A} - A$	$\hat{B} - B$	${\hat{β}}_{1} - β_{1}$	${\hat{β}}_{2} - β_{2}$	${\hat{β}}_{3} - β_{3}$	${\hat{β}}_{4} - β_{4}$
GENOUD-TRF method
11	−.067	.006	−.504	.063	−.189	.630
12	−.013	−.088	−.317	−.621	.563	.376
13	−.055	−.021	−.062	−.072	−.189	.324
14	.013	−.055	.180	−.029	−.245	.094
15	−.009	.061	−.304	−.110	.091	.323
16	−.027	.040	−.335	−.196	.361	.170
17	−.001	−.013	.067	.054	−.194	.074
18	.001	.044	−.022	.119	−.298	.201
19	−.017	.115	−.193	−.097	−.063	.352
20	−.018	.020	.102	−.202	−.240	.340
Mean absolute error	.022	.046	.209	.156	.243	.288
BFGS-TRF method
11	−.067	.006	−.500	.057	−.189	.632
12	−.010	−.085	−.310	−.578	.555	.332
13	−.055	−.024	−.088	−.059	−.182	.330
14	.014	−.048	.066	−.038	−.160	.132
15	−.026	.060	.053	−.114	−.101	.162
16	−.028	.040	−.317	−.158	.296	.179
17	.001	−.012	.054	.047	−.169	.068
18	.001	.044	−.027	.119	−.295	.202
19	−.017	.115	−.193	−.097	−.063	.353
20	−.014	.019	.114	−.176	−.244	.306
Mean absolute error	.023	.045	.172	.144	.225	.270

Open in a new tab

Table 5.

Estimation Errors of the Linking Parameters from the MLS and IRF Methods for Condition 2.

Data Set	$\hat{A} - A$	$\hat{B} - B$	${\hat{β}}_{1} - β_{1}$	${\hat{β}}_{2} - β_{2}$	${\hat{β}}_{3} - β_{3}$	${\hat{β}}_{4} - β_{4}$
MLS method
11	.082	−.093	−.014	.078	−.010	−.054
12	.032	−.036	−.012	.170	−.167	.010
13	−.029	−.063	−.300	.126	.120	.054
14	.013	−.052	.005	.094	−.034	−.064
15	−.003	.039	.036	.267	−.085	−.218
16	−.038	.071	.008	−.226	.256	−.038
17	.003	−.004	.012	−.068	.012	.043
18	.005	−.021	.096	.063	−.017	−.141
19	−.020	.139	−.100	−.220	−.078	.398
20	−.035	−.012	−.142	.029	.209	−.097
Mean absolute error	.026	.053	.073	.134	.099	.112
IRF method
11	.047	−.006	.003	−.009	.044	−.037
12	.007	−.051	.086	−.036	−.081	.031
13	−.016	.005	−.038	−.001	.010	.029
14	−.004	−.042	.004	.066	−.018	−.053
15	.005	.063	−.066	.167	−.041	−.060
16	−.013	.053	−.124	−.167	.345	−.054
17	.006	.001	−.025	−.041	−.009	.074
18	.014	.019	.011	−.034	−.027	.050
19	−.011	.113	−.076	−.249	−.028	.353
20	.017	−.006	−.103	.041	.137	−.075
Mean absolute error	.014	.036	.053	.081	.074	.082

Open in a new tab

For the recovery of the beta coefficients, the two TRF methods produced estimation errors for ${\hat{β}}_{1}$ to ${\hat{β}}_{4}$ that deviated from zero by more than .3 in many cases. The mean absolute errors for ${\hat{β}}_{1}$ to ${\hat{β}}_{4}$ were .209, .156, .243, and .288, respectively, with the GENOUD-TRF method and .172, .144, .225, and .270 with the BFGS-TRF method. Thus, using a genetic algorithm for scale linking can lead to worse solutions for the beta coefficients than using a quasi-Newton algorithm. In contrast, the estimation errors for ${\hat{β}}_{1}$ to ${\hat{β}}_{4}$ with the MLS and IRF methods were closer to zero than those with the two TRF methods. The mean absolute errors of ${\hat{β}}_{1}$ to ${\hat{β}}_{4}$ were .073, .134, .099, and .112, respectively, with the MLS method and .053, .081, .074, and .082 with the IRF method. It is noteworthy that with the baseline data set 11, the two TRF methods resulted in much larger estimation errors for the beta coefficients than for the $B$ coefficient. In sum, the IRF, MLS, and TRF methods performed best, second best, and worst, respectively, in estimating the $β_{s}$ coefficients.

Discussion and Conclusions

Li et al. (2005) proposed a TRF linking method for the 2PNO testlet model and used the GENOUD genetic algorithm to find minimization solutions for the linking parameters by updating a population of solutions from generation to generation. In the present paper, we used the 3PL testlet model for generality to formulate the linking task from the perspective of bi-factor modeling and presented two alternatives (MLS and IRF) to the TRF linking method for the model. One of the purposes of the simulation study was to examine whether there is a compelling reason to use the genetic algorithm instead of the BFGS algorithm, which is one of the quasi-Newton methods widely applied, when using the TRF method to find linking solutions. The other purpose was to investigate the performance of the TRF method (based on either the GENOUD or BFGS algorithm) against the other linking methods, MLS and IRF.

The following main results were found from the stimulation study. For the simulated linking data sets using two common testlets (Condition 1), the performance of the GENOUD-TRF method was nearly the same as that of the BFGS-TRF method in recovering the true linking parameters, $A$ (slope) and $B$ (intercept) for the primary dimension factor and $β_{s}$ (intercepts) for the testlet factors, subject to the zero-sum constraint $\sum β_{s} = 0$ . In Condition 2 involving four common testlets, the GENOUD-TRF method performed nearly as well as the BFGS-TRF method in estimating the $A$ and $B$ coefficients, but it tended to estimate the $β_{s}$ coefficients less accurately than the BFGS-TRF method. This finding suggests that using a genetic algorithm does not lead to better solutions for the linking coefficients, particularly the beta coefficients, than using a quasi-Newton algorithm. In both simulation conditions, there was a small difference in linking accuracy among the linking methods for the $A$ and $B$ coefficients, whereas for the $β_{s}$ coefficients, the methods differed substantially. In recovering the true $β_{s}$ coefficients, on average, the IRF method showed the least estimation error, and the TRF methods produced the largest errors, more than double the average error of the IRF method. Taken together, these results suggest that the IRF, MLS, and TRF methods perform best, second best, and worst, respectively, in estimating the linking parameters associated with testlet effects.

The poor performance of the TRF method against the IRF method in estimating the $β_{s}$ coefficients may be regarded as a bit unusual, but is not a new finding, as shown by Kim (2019). To understand why the TRF method estimated the beta coefficients more poorly than the IRF method, we examined the contour plots (i.e., level curves) of the negative criterion functions, $- f_{T} (A, β)$ and $- f_{I} (A, β)$ , for the two methods. With $β_{k}$ = $- \sum_{s = 1}^{k - 1} β_{s}$ and $A$ and/or $B$ fixed to their true values, we drew the contour plots with the axes of $B$ and $β_{1}$ for the data sets in Condition 1 and the axes of each of the three pairs ( $β_{1}$ and $β_{2}$ , $β_{1}$ and $β_{3}$ , and $β_{2}$ and $β_{3}$ ) for the data sets in Condition 2. As illustrated in Figure 1 (where the contour plots for data sets 1 and 11 are presented as examples), for all data sets in Condition 1, the contour plots of $- f_{T} (A, β)$ had top level curves shaped like elongated rings, narrow along the $B$ -axis but wide along the $β_{1}$ -axis, whereas those of $- f_{I} (A, β)$ had top level curves shaped like small ellipses, narrow along both axes. Compared with the top-level curves for the IRF method, the shape of the top-level curves for the TRF method indicates that the $B$ coefficient can be more accurately estimated than the $β_{1}$ coefficient. For all data sets in Condition 2, the contour plots of $- f_{T} (A, β)$ had top level curves shaped like distorted ellipses, big and wide, whereas those of $- f_{I} (A, β)$ had top level curves shaped like small ellipses, tilted diagonally.

Figure 1. — Contour Plots of $- f_{T} (A, β)$ and $- f_{I} (A, β)$ with Data Set 1 in Condition 1 and Data Set 11 in Condition 2.

The difference in shape between the top-level curves suggests that the TRF method produces less stable estimates of the beta coefficients than the IRF method and that the GENOUD and BFGS algorithms can converge to different neighborhoods and reach a global minimum. In other words, the criterion function of the IRF method is well-structured for the linking solutions but that of the TRF method is not. The criterion function of the IRF method is based on the sum of the squared differences in category response functions at the item level, and so the possible location differences between separate calibrations for each testlet are preserved and not mixed with those for other testlets. But the criterion function of the TRF method is based on the squared differences in true test scores, so that the possible location differences for each testlet are likely confounded at the test level. Such confounding likely leads to unstable estimation of the beta coefficients.

From the simulation results, we find no compelling reason to use genetic algorithms instead of quasi-Newton algorithms to find minimization solutions for the TRF method. Furthermore, in a numerical sense, we find that the criterion function of the IRF method produces better-structured solutions than that of the TRF method. From a practical point of view, the BFGS algorithm should be preferred to the GENOUD algorithm because the latter takes much more time than the former to find the minimization solutions. In this simulation study, the GENOUD-TRF method often took more than 5 and 10 minutes for the data sets in conditions 1 and 2, respectively, whereas the BFGS-TRF and IRF methods took less than 5 and 15 seconds for the corresponding data sets.

As pointed out by Li et al. (2005), the zero-sum constraint for the beta coefficients indicates that the average (across testlets) of testlet factor means should be the same between the examinee groups being analyzed for scale linking. Although the zero-sum constraint seems to be reasonable and flexible, the linear dependence among the beta coefficients can be solved in other ways. One feasible approach is choosing a common testlet whose factor mean is expected to change little across examinee groups and fixing the beta coefficient for the common testlet to zero. Then the rest of the beta coefficients are considered as the free parameters. An analog of this approach is found in differential item functioning (DIF) analyses. If we use the constraint that the overall mean difference across examinee groups in item difficulties is zero, any performance differences are absorbed into differences in ability and treated as impact. If we suspect that the DIF may not cancel out across items, we can designate a set of anchor items to have zero DIF. Of course, to apply that “fixed-to-zero” constraint to linking tasks, the transformation matrix in Equation (13) should be modified appropriately for the MLS method, and the partial derivatives of the criterion functions with respect to the linking parameters should be properly computed for the IRF and TRF methods.

Some studies need to be conducted to improve understanding of the three linking methods presented in this paper and enable a wise choice among them in practice. First, the simulation study in this paper did not address the effects of linking errors in transformed item parameter estimates on the estimation of the ability ( $θ_{0}$ ) parameters for the new group examinees. The accuracy of ability estimation would be affected most by the estimates of the $b_{i N}^{*}$ or $d_{i N}^{*}$ parameters, which are functions of the beta coefficients, given $A$ , $B$ , and $C_{s B}$ (see Equation (21)). It should be examined whether the ability parameters can be estimated better when the linking coefficients from the IRF method are used than when those from the MLS or TRF method are used. Second, although the separate calibration and linking approach is a basic and dependable method for developing a common IRT scale, a common scale can also be developed by using multiple-group concurrent calibration (Bock & Zimowski, 1997) or fixed parameter calibration (Kim, 2006). A comparative study on the performance of the three calibration types will offer practitioners in the areas of test equating and vertical scaling useful information about the advantages and disadvantages of the various linking methods. Third, the three linking methods presented for the 3PL testlet model need to be extended to polytomous testlet models such as the graded response testlet model and the generalized partial credit testlet model. It would be meaningful to investigate the performance of the linking methods using a variety of real and simulated data. Finally, it would be very useful to derive analytic formulas for the standard errors of linking coefficient estimates obtained from the different linking methods because those standard errors of estimates can be used as indices of linking precision in practice.

Acknowledgments

The authors are grateful to two anonymous reviewers, Dr. John R. Donoghue (the Editor-in-Chief), and Dr. Christine E. DeMars (the Associate Editor) for their beneficial comments and insightful suggestions to improve the quality of this paper.

Appendix: Partial Derivatives.

Based on $c_{i N}^{*} = c_{i N}$ and Equations (18) and (21), let us write $P_{i}^{*} (ξ_{s B})$ and $P_{i}^{*} (θ_{B})$ as

P_{i}^{*} (ξ_{s B}) = c_{i N}^{*} + (1 - c_{i N}^{*}) / {1 + \exp [- D a_{i N}^{*} (ξ_{s B} - b_{i N}^{*})]},

(A1)

P_{i}^{*} (θ_{B}) = \int P_{i}^{*} (ξ_{s B}) h (ξ_{s B} | θ_{B}; C_{s B}) d ξ_{s B} .

(A2)

The partial derivatives of $P_{i}^{*} (θ_{B})$ with respect to $A$ , $B$ , and $β_{s}$ (where $i \in s$ ) are computed as

\frac{\partial P_{i}^{*} (θ_{B})}{\partial A} = \int - D a_{i N}^{*} (\frac{ξ_{s B} - C_{s B} β_{s} - B}{A}) [\frac{P_{i}^{*} (ξ_{s B}) - c_{i N}^{*}}{1 - c_{i N}^{*}}] [1 - P_{i}^{*} (ξ_{s B})] h (ξ_{s B} | θ_{B}; C_{s B}) d ξ_{s B},

(A3)

\frac{\partial P_{i}^{*} (θ_{B})}{\partial B} = \int - D a_{i N}^{*} [\frac{P_{i}^{*} (ξ_{s B}) - c_{i N}^{*}}{1 - c_{i N}^{*}}] [1 - P_{i}^{*} (ξ_{s B})] h (ξ_{s B} | θ_{B}; C_{s B}) d ξ_{s B},

(A4)

\frac{\partial P_{i}^{*} (θ_{B})}{\partial β_{s}} = \int - D a_{i N}^{*} C_{s B} [\frac{P_{i}^{*} (ξ_{s B}) - c_{i N}^{*}}{1 - c_{i N}^{*}}] [1 - P_{i}^{*} (ξ_{s B})] h (ξ_{s B} | θ_{B}; C_{s B}) d ξ_{s B} .

(A5)

Then, the partial derivatives of $f_{I} (A, β)$ with respect to $A$ and $B$ are given by

\frac{\partial f_{I}}{\partial A} = - \frac{2}{N n} \sum_{q = 1}^{N} \sum_{i = 1}^{n} [P_{i} (θ_{q B}) - P_{i}^{*} (θ_{q B})] \frac{\partial P_{i}^{*} (θ_{q B})}{\partial A},

(A6)

\frac{\partial f_{I}}{\partial B} = - \frac{2}{N n} \sum_{q = 1}^{N} \sum_{i = 1}^{n} [P_{i} (θ_{q B}) - P_{i}^{*} (θ_{q B})] \frac{\partial P_{i}^{*} (θ_{q B})}{\partial B} .

(A7)

When $β_{k} = - \sum_{s = 1}^{k - 1} β_{s}$ , the partial derivatives of $f_{I} (A, β)$ with respect to $β_{s}$ ( $s = 1, ..., k - 1$ ) are given by

\frac{\partial f_{I}}{\partial β_{s}} = - \frac{2}{N n} \sum_{q = 1}^{N} {\sum_{i \in s} [P_{i} (θ_{q B}) - P_{i}^{*} (θ_{q B})] \frac{\partial P_{i}^{*} (θ_{q B})}{\partial β_{s}} - \sum_{i \in k} [P_{i} (θ_{q B}) - P_{i}^{*} (θ_{q B})] \frac{\partial P_{i}^{*} (θ_{q B})}{\partial β_{k}}} .

(A8)

With Equations (A1) to A5, the partial derivatives of $f_{T} (A, β)$ with respect to $A$ and $B$ are given by

\frac{\partial f_{T}}{\partial A} = - \frac{2}{N} \sum_{q = 1}^{N} [\sum_{i = 1}^{n} P_{i} (θ_{q B}) - \sum_{i = 1}^{n} P_{i}^{*} (θ_{q B})] \sum_{i = 1}^{n} \frac{\partial P_{i}^{*} (θ_{q B})}{\partial A},

(A9)

\frac{\partial f_{T}}{\partial B} = - \frac{2}{N} \sum_{q = 1}^{N} [\sum_{i = 1}^{n} P_{i} (θ_{q B}) - \sum_{i = 1}^{n} P_{i}^{*} (θ_{q B})] \sum_{i = 1}^{n} \frac{\partial P_{i}^{*} (θ_{q B})}{\partial B} .

(A10)

And if $β_{k} = - \sum_{s = 1}^{k - 1} β_{s}$ , the partial derivatives of $f_{T} (A, β)$ with respect to $β_{s}$ ( $s = 1, ..., k - 1$ ) are given by

\frac{\partial f_{T}}{\partial β_{s}} = - \frac{2}{N} \sum_{q = 1}^{N} [\sum_{i = 1}^{n} P_{i} (θ_{q B}) - \sum_{i = 1}^{n} P_{i}^{*} (θ_{q B})] {\sum_{i \in s} \frac{\partial P_{i}^{*} (θ_{q B})}{\partial β_{s}} - \sum_{i \in k} \frac{\partial P_{i}^{*} (θ_{q B})}{\partial β_{k}}} .

(A11)

Footnotes

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Seonghoon Kim https://orcid.org/0000-0002-0357-8639

References

Bock R. D., Zimowski M. F. (1997). Multiple group IRT. In van der Linden W. J., Hambleton R. K. (Eds.), Handbook of modern item response theory (pp. 433–448). Springer. https://doi/org/10.1007/978-1-4757-2691-6_25 [Google Scholar]
Bradlow E. T., Wainer H., Wang X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64(2), 153–168. 10.1007/bf02294533 [DOI] [Google Scholar]
Cai L. (2017). flexMIRT: Flexible multilevel multidimensional item analysis and test scoring [Computer software] . Vector Psychometric Group. [Google Scholar]
Davey T., Oshima T. C., Lee K. (1996). Linking multidimensional item calibrations. Applied Psychological Measurement, 20(4), 405–416. 10.1177/014662169602000407 [DOI] [Google Scholar]
DeMars C. E. (2006). Application of the bi-factor multidimensional item response theory model to testlet-based tests. Journal of Educational Measurement, 43(2), 145-168. 10.1111/j.1745-3984.2006.00010.x [DOI] [Google Scholar]
Dennis J. E., Schnabel R. B. (1996). Numerical methods for unconstrained optimization and nonlinear equations. Society for Industrial and Applied Mathematics. [Google Scholar]
Divgi D. R. (1985). A minimum chi-square method for developing a common metric in item response theory. Applied Psychological Measurement, 9(4), 413–415. 10.1177/014662168500900410 [DOI] [Google Scholar]
Gibbons R. D., Hedeker D. R. (1992). Full-information item bi-factor analysis. Psychometrika, 57(3), 423–436. 10.1007/bf02295430 [DOI] [Google Scholar]
Glas C. A. W., Wainer H., Bradlow E. T. (2000). Maximum marginal likelihood and expected a posteriori estimation in testlet-based adaptive testing. In van der Linden W. J., Glas C. A. W. (Eds.), Computerized adaptive testing: Theory and practice (pp. 271–287). Kluwer Academic Publishers. 10.1007/0-306-47531-6_14 [DOI] [Google Scholar]
Haebara T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22(3), 144–149. 10.4992/psycholres1954.22.144 [DOI] [Google Scholar]
Kim S. (2006). A comparative study of IRT fixed parameter calibration methods. Journal of Educational Measurement, 43(4), 353–381. 10.1111/j.1745-3984.2006.00021.x [DOI] [Google Scholar]
Kim S. (2019). Common-item linking methods for the bi-factor three parameter model in MIRT. Journal of Educational Evaluation, 32(1), 27–52. 10.31158/jeev.2019.32.1.27 [DOI] [Google Scholar]
Kim S., Lee W.-C. (2006). An extension of four IRT linking methods for mixed-format tests. Journal of Educational Measurement, 43(1), 53–76. 10.1111/j.1745-3984.2006.00004.x [DOI] [Google Scholar]
Kolen M. J., Brennan R. L. (2014). Test equating, scaling, and linking: Methods and practices (3rd ed.). Springer. [Google Scholar]
Li Y., Bolt D. M., Fu J. (2005). A testlet characteristic curve linking method for the testlet model. Applied Psychological Measurement, 29(5), 340–356. 10.1177/0146621605276678 [DOI] [Google Scholar]
Li Y., Bolt D. M., Fu J. (2006). A comparison of alternative models for testlets. Applied Psychological Measurement, 30(1), 3–21. 10.1177/0146621605275414 [DOI] [Google Scholar]
Lord F. M. (1980). Applications of item response theory to practical testing problems. Lawrence Erlbaum Associates. [Google Scholar]
Loyd B. H., Hoover H. D. (1980). Vertical equating using the Rasch model. Journal of Educational Measurement, 17(3), 179–193. 10.1111/j.1745-3984.1980.tb00825.x [DOI] [Google Scholar]
Marco G. L. (1977). Item characteristic curve solutions to three intractable testing problems. Journal of Educational Measurement, 14(2), 139–160. 10.1111/j.1745-3984.1977.tb00033.x [DOI] [Google Scholar]
Mebane W. R., Sekhon J. S. (2011). Genetic optimization using derivatives: The rgenoud package for R. Journal of Statistical Software, 42(11), 1–26. 10.18637/jss.v042.i11 [DOI] [Google Scholar]
Oshima T. C., Davey T. C., Lee K. (2000). Multidimensional linking: Four practical approaches. Journal of Educational Measurement, 37(4), 357–373. 10.1111/j.1745-3984.2000.tb01092.x [DOI] [Google Scholar]
R Development Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing. [Google Scholar]
Rijmen F. (2010). Formal relations and an empirical comparison among the bi-factor, the testlet, and a second-order multidimensional IRT model. Journal of Educational Measurement, 47(3), 361–372. 10.1111/j.1745-3984.2010.00118.x [DOI] [Google Scholar]
Sekhon J. S., Mebane W. R. (1998). Genetic optimization using derivatives. Political Analysis, 7, 187–210. 10.1093/pan/7.1.187 [DOI] [Google Scholar]
Stocking M. L., Lord F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201–210. 10.1177/014662168300700208 [DOI] [Google Scholar]
Wainer H., Bradlow E. T., Wang X. (2007). Testlet response theory and its applications. Cambridge University Press. [Google Scholar]
Wainer H., Kiely G. L. (1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24(3), 185–201. 10.1111/j.1745-3984.1987.tb00274.x [DOI] [Google Scholar]
Wainer H., Wang X. (2000). Using a new statistical model for testlets to score TOEFL. Journal of Educational Measurement, 37(3), 203–220. 10.1111/j.1745-3984.2000.tb01083.x [DOI] [Google Scholar]
Yen W. M., Fitzpatrick A. R. (2006). Item response theory. In Brennan R. L. (Ed.), Educational measurement (4th ed., pp. 111–153). American Council on Education and Praeger. [Google Scholar]

[bibr1-01466216211063234] Bock R. D., Zimowski M. F. (1997). Multiple group IRT. In van der Linden W. J., Hambleton R. K. (Eds.), Handbook of modern item response theory (pp. 433–448). Springer. https://doi/org/10.1007/978-1-4757-2691-6_25 [Google Scholar]

[bibr2-01466216211063234] Bradlow E. T., Wainer H., Wang X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64(2), 153–168. 10.1007/bf02294533 [DOI] [Google Scholar]

[bibr3-01466216211063234] Cai L. (2017). flexMIRT: Flexible multilevel multidimensional item analysis and test scoring [Computer software] . Vector Psychometric Group. [Google Scholar]

[bibr4-01466216211063234] Davey T., Oshima T. C., Lee K. (1996). Linking multidimensional item calibrations. Applied Psychological Measurement, 20(4), 405–416. 10.1177/014662169602000407 [DOI] [Google Scholar]

[bibr5-01466216211063234] DeMars C. E. (2006). Application of the bi-factor multidimensional item response theory model to testlet-based tests. Journal of Educational Measurement, 43(2), 145-168. 10.1111/j.1745-3984.2006.00010.x [DOI] [Google Scholar]

[bibr6-01466216211063234] Dennis J. E., Schnabel R. B. (1996). Numerical methods for unconstrained optimization and nonlinear equations. Society for Industrial and Applied Mathematics. [Google Scholar]

[bibr7-01466216211063234] Divgi D. R. (1985). A minimum chi-square method for developing a common metric in item response theory. Applied Psychological Measurement, 9(4), 413–415. 10.1177/014662168500900410 [DOI] [Google Scholar]

[bibr8-01466216211063234] Gibbons R. D., Hedeker D. R. (1992). Full-information item bi-factor analysis. Psychometrika, 57(3), 423–436. 10.1007/bf02295430 [DOI] [Google Scholar]

[bibr9-01466216211063234] Glas C. A. W., Wainer H., Bradlow E. T. (2000). Maximum marginal likelihood and expected a posteriori estimation in testlet-based adaptive testing. In van der Linden W. J., Glas C. A. W. (Eds.), Computerized adaptive testing: Theory and practice (pp. 271–287). Kluwer Academic Publishers. 10.1007/0-306-47531-6_14 [DOI] [Google Scholar]

[bibr10-01466216211063234] Haebara T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22(3), 144–149. 10.4992/psycholres1954.22.144 [DOI] [Google Scholar]

[bibr11-01466216211063234] Kim S. (2006). A comparative study of IRT fixed parameter calibration methods. Journal of Educational Measurement, 43(4), 353–381. 10.1111/j.1745-3984.2006.00021.x [DOI] [Google Scholar]

[bibr12-01466216211063234] Kim S. (2019). Common-item linking methods for the bi-factor three parameter model in MIRT. Journal of Educational Evaluation, 32(1), 27–52. 10.31158/jeev.2019.32.1.27 [DOI] [Google Scholar]

[bibr13-01466216211063234] Kim S., Lee W.-C. (2006). An extension of four IRT linking methods for mixed-format tests. Journal of Educational Measurement, 43(1), 53–76. 10.1111/j.1745-3984.2006.00004.x [DOI] [Google Scholar]

[bibr14-01466216211063234] Kolen M. J., Brennan R. L. (2014). Test equating, scaling, and linking: Methods and practices (3rd ed.). Springer. [Google Scholar]

[bibr15-01466216211063234] Li Y., Bolt D. M., Fu J. (2005). A testlet characteristic curve linking method for the testlet model. Applied Psychological Measurement, 29(5), 340–356. 10.1177/0146621605276678 [DOI] [Google Scholar]

[bibr16-01466216211063234] Li Y., Bolt D. M., Fu J. (2006). A comparison of alternative models for testlets. Applied Psychological Measurement, 30(1), 3–21. 10.1177/0146621605275414 [DOI] [Google Scholar]

[bibr17-01466216211063234] Lord F. M. (1980). Applications of item response theory to practical testing problems. Lawrence Erlbaum Associates. [Google Scholar]

[bibr18-01466216211063234] Loyd B. H., Hoover H. D. (1980). Vertical equating using the Rasch model. Journal of Educational Measurement, 17(3), 179–193. 10.1111/j.1745-3984.1980.tb00825.x [DOI] [Google Scholar]

[bibr19-01466216211063234] Marco G. L. (1977). Item characteristic curve solutions to three intractable testing problems. Journal of Educational Measurement, 14(2), 139–160. 10.1111/j.1745-3984.1977.tb00033.x [DOI] [Google Scholar]

[bibr20-01466216211063234] Mebane W. R., Sekhon J. S. (2011). Genetic optimization using derivatives: The rgenoud package for R. Journal of Statistical Software, 42(11), 1–26. 10.18637/jss.v042.i11 [DOI] [Google Scholar]

[bibr21-01466216211063234] Oshima T. C., Davey T. C., Lee K. (2000). Multidimensional linking: Four practical approaches. Journal of Educational Measurement, 37(4), 357–373. 10.1111/j.1745-3984.2000.tb01092.x [DOI] [Google Scholar]

[bibr22-01466216211063234] R Development Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing. [Google Scholar]

[bibr23-01466216211063234] Rijmen F. (2010). Formal relations and an empirical comparison among the bi-factor, the testlet, and a second-order multidimensional IRT model. Journal of Educational Measurement, 47(3), 361–372. 10.1111/j.1745-3984.2010.00118.x [DOI] [Google Scholar]

[bibr24-01466216211063234] Sekhon J. S., Mebane W. R. (1998). Genetic optimization using derivatives. Political Analysis, 7, 187–210. 10.1093/pan/7.1.187 [DOI] [Google Scholar]

[bibr25-01466216211063234] Stocking M. L., Lord F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201–210. 10.1177/014662168300700208 [DOI] [Google Scholar]

[bibr26-01466216211063234] Wainer H., Bradlow E. T., Wang X. (2007). Testlet response theory and its applications. Cambridge University Press. [Google Scholar]

[bibr27-01466216211063234] Wainer H., Kiely G. L. (1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24(3), 185–201. 10.1111/j.1745-3984.1987.tb00274.x [DOI] [Google Scholar]

[bibr28-01466216211063234] Wainer H., Wang X. (2000). Using a new statistical model for testlets to score TOEFL. Journal of Educational Measurement, 37(3), 203–220. 10.1111/j.1745-3984.2000.tb01083.x [DOI] [Google Scholar]

[bibr29-01466216211063234] Yen W. M., Fitzpatrick A. R. (2006). Item response theory. In Brennan R. L. (Ed.), Educational measurement (4th ed., pp. 111–153). American Council on Education and Praeger. [Google Scholar]

PERMALINK

Scale Linking for the Testlet Item Response Theory Model

Seonghoon Kim

Michael J Kolen

Abstract

Questions and Purposes

Linking Methods for the 3PL Testlet Model

The Linking Parameters Estimated

MLS Method

TRF Method

IRF Method

Simulation Study

Design and Data

Table 1.

Estimation and Evaluation

Results

Results of Condition 1

Table 2.

Table 3.

Results of Condition 2

Table 4.

Table 5.

Discussion and Conclusions

Figure 1.

Acknowledgments

Appendix: Partial Derivatives.

Footnotes

ORCID iD

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Scale Linking for the Testlet Item Response Theory Model

Seonghoon Kim

Michael J Kolen

Abstract

Questions and Purposes

Linking Methods for the 3PL Testlet Model

The Linking Parameters Estimated

MLS Method

TRF Method

IRF Method

Simulation Study

Design and Data

Table 1.

Estimation and Evaluation

Results

Results of Condition 1

Table 2.

Table 3.

Results of Condition 2

Table 4.

Table 5.

Discussion and Conclusions

Figure 1.

Acknowledgments

Appendix: Partial Derivatives.

Footnotes

ORCID iD

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases