Abstract
In their 2005 paper, Li and her colleagues proposed a test response function (TRF) linking method for a two-parameter testlet model and used a genetic algorithm to find minimization solutions for the linking coefficients. In the present paper the linking task for a three-parameter testlet model is formulated from the perspective of bi-factor modeling, and three linking methods for the model are presented: the TRF, mean/least squares (MLS), and item response function (IRF) methods. Simulations are conducted to compare the TRF method using a genetic algorithm with the TRF and IRF methods using a quasi-Newton algorithm and the MLS method. The results indicate that the IRF, MLS, and TRF methods perform very well, well, and poorly, respectively, in estimating the linking coefficients associated with testlet effects, that the use of genetic algorithms offers little improvement to the TRF method, and that the minimization function for the TRF method is not as well-structured as that for the IRF method.
Keywords: scale linking methods, testlet model, item response theory
Educational test forms are often constructed using clusters of items based on a common stimulus or content area. For example, test items may be grouped around a reading passage, scenario, chart, or section associated with particular content. Wainer and Kiely (1987) called such a group of items a testlet and adopted it as a construction unit in computerized adaptive testing. From the perspective of item response theory (IRT; Lord, 1980; Yen & Fitzpatrick, 2006), the assumption of local independence among items nested within a testlet, given the primary latent trait, would be violated to some extent because the responses of examinees to the items might be affected by the testlet effect as well as the primary factor. An efficient way to deal with local dependence is to use the testlet model (Wainer et al., 2007), in which a secondary, random-effect factor is added to the primary factor. Researchers (e.g., DeMars, 2006; Li et al., 2006; Rijmen, 2010) have shown that the testlet model is a constrained version of the bi-factor model (Gibbons & Hedeker, 1992).
Like other IRT models, the testlet model has a model identification problem, specifically a scale indeterminacy problem, because the item parameters and person parameters are invariant within a linear transformation of the latent trait scale. In practice, scale indeterminacy typically is solved by choosing a scale such that the mean and standard deviation (SD) of the person parameters are arbitrarily set to certain values (e.g., 0 and 1) for the examinee group being analyzed (Rijmen, 2010). According to that convention, the latent scales obtained from separate calibrations of sample data from different populations are not likely to be equivalent, but they are assumed to be linearly related. This non-equivalency creates the need for a common scale, which can be developed through scale linking (or scale transformation), in which one scale is linked to another (base) scale with a linear function.
This paper is primarily concerned with the methods used to estimate the linking parameters for the testlet model under the common-item nonequivalent groups (CING) design (Kolen & Brennan, 2014). Many linking methods have been presented for use with traditional dichotomous IRT models such as the two-parameter logistic and three-parameter logistic (3PL) models (e.g., Divgi, 1985; Haebara, 1980; Loyd & Hoover, 1980; Marco, 1977; Stocking & Lord, 1983), and they have been extended to polytomous models (Kim & Lee, 2006). Most relevant to the present paper, Kim (2019) presented three linking methods for the 3PL bi-factor model, the direct least squares (DLS), item response function (IRF), and test response function (TRF) methods, which are bi-factor extensions of Divgi’s (1985), Haebara’s (1980), and Stocking and Lord’s (1983) approaches, respectively. Kim (2019) showed through simulations that the IRF, DLS, and TRF methods differed little in estimating the slope (dilation) linking coefficients, but they exhibited substantial differences in estimating the intercept (translation) linking coefficients, with the IRF method being the most accurate and the TRF method being the least accurate. However, in the IRT literature, only the TRF method has been formally extended for use with the testlet model. That extension is found in Li et al. (2005), who presented the TRF method for a two-parameter normal ogive (2PNO) testlet model. In this paper, Li et al.’s TRF method is presented under the 3PL testlet model since this general model is more widely used than the 2PNO testlet model in practice.
Questions and Purposes
As described in detail later, Li et al. (2005) formulated the linking task under the 2PNO testlet model such that, given common testlets, the linking parameters should include the means (denoted by ) of the testlet effect factors , 1,…, , with the constraint , in addition to the linking coefficients and for the primary factor . The criterion function (also known as the loss function) for the TRF method is nonlinear with respect to the linking parameters, and thus a multivariate search technique such as the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm, one of the quasi-Newton methods (Dennis & Schnabel, 1996), should be implemented to estimate the linking parameters. Li et al. (2005) combined the GENOUD genetic algorithm (Sekhon & Mebane, 1998; see also Mebane & Sekhon, 2011) with the BFGS algorithm. Li and her colleagues used the genetic algorithm because they were concerned that if there were three or more linking parameters to be estimated, the criterion function might have multiple minimum or saddle points and the BFGS method might be unable to find or fail to converge to the global minimum. However, the GENOUD genetic algorithm is very computationally intensive and time-consuming (taking 25 or more minutes for a linking task, as reported by Li et al.).
The present study was motivated by some related questions regarding Li et al.’s (2005) approach to the linking solutions for the TRF method. The first question is “Is it necessary to use genetic algorithms to find the linking solutions for the TRF method?” This question is important, because previous studies into multidimensional IRT linking (e.g., Davey et al., 1996; Oshima et al., 2000) that considered six or more linking parameters in a rotation matrix and translation vector have not reported any problem in finding the linking solutions using a modified version of the Newton method. If the genetic method is not substantially superior to the BFGS method, there would be no compelling reason to use it in practice. The second and third questions, which are closely related to each other, are “Are linking methods for the testlet model other than the TRF method available?” and “How do the different linking methods for the testlet model compare in their performance?” These questions are also important, because the availability of different methods for scale linking allows practitioners to choose the appropriate method depending on the situation. The choice of a linking method can be made more wisely if more information about the relative performance of different methods is given to the practitioners. Even if one method is operationally used, other methods should still be implemented for diagnostic purposes (Kolen & Brennan, 2014).
The primary purposes of this paper are two-fold. One is to answer the first question posed above regarding Li et al.’s (2005) TRF method. The other is to present the mean/least squares (MLS) and IRF methods for the testlet model and investigate their performance in linking accuracy relative to the TRF method. To achieve these purposes, we first present the 3PL testlet model (instead of the 2PNO model) in the next section for generality and reformulate the linking task formulated by Li et al. (2005) into a special case under the bi-factor modeling. Next we use the reformulated linking framework to present the MLS, TRF, and IRF methods and conduct a simulation study to compare the accuracy of these methods.
Linking Methods for the 3PL Testlet Model
The IRT literature contains several versions of the testlet model for dichotomous items that differ slightly in parameterization (e.g., Bradlow et al., 1999; Glas et al., 2000; Wainer & Wang, 2000). For the purposes of this paper, we use the parameterization of Glas et al. (2000) to write a 3PL testlet model that defines the probability that an examinee will answer item correctly as
| (1) |
where , , and are the discrimination, difficulty, and lower asymptote parameters for item , respectively, is a scaling constant (usually set to 1 or 1.7); is the primary trait (ability) parameter of examinee ; and is a random-effect parameter (assumed to be independent of ) for examinee of testlet , the testlet to which item belongs. Equation (1), the IRF for the 3PL testlet model, can be viewed as a special case of the 3PL bi-factor model, written as
| (2) |
where , , and are the same as in Equation (1); is the intercept parameter; is the parameter for examinee of the primary factor (i.e., = ); is the parameter of the specific factor of testlet with the relationship = ; and is a proportionality constant across all items nested within testlet .
Whether expressed as Equations (1) or (2), the 3PL testlet model cannot be identified unless some restrictions are imposed on the parameters. For Equation (1), the mean and SD of are typically fixed to 0 and 1, respectively, and the means of the s are fixed to 0, with each of their SDs, , being free parameters. For Equation (2), the mean and SD of and are fixed to 0 and 1, respectively, and = are considered the free parameters to be estimated. In other words, a standardized scale (0–1 scale) is independently used for each dimension to remove model indeterminacy. Throughout this paper, we assume that the 3PL testlet model is identified by 0–1 scaling, and its parameters are estimated using sample data.
The Linking Parameters Estimated
Consider two examinee groups, a base group and a new group, that can differ in each dimension. Assume that an identical test, consisting of testlets, has been administered to both groups and that for each group, separate calibration has been conducted using 0–1 scaling to estimate all item parameters, including (dropping the nested subscript for simplicity). Furthermore, define the 0–1 scales from the base and new groups as and , respectively, where ( , ,…, ) and ( , ,…, ). Use , , , , and to denote the item/testlet parameters estimated on the scale, and use , , , , and to denote the counterparts on the scale.
By the “within a linear transformation” invariance property of IRT, the and scales are linearly related as follows (Kim, 2019):
| (3) |
| (4) |
where and are the linking coefficients for the dimension and and are the linking coefficients for the dimension. The slopes, and , adjust for unit differences between the new and base scales, and the translation intercepts, and , adjust for location differences. If scale linking is perfect, the two sets of item/testlet parameter estimates from separate calibrations should be related as follows:
| (5) |
| (6) |
| (7a) |
| (7b) |
| (8) |
| (9) |
However, Equations (5) through (9) do not perfectly hold among estimated item/testlet parameters because of sampling errors and possible model-data misfit. In general, linking errors are unavoidable with sample data, and the linking coefficients should be properly estimated so as to minimize the errors (Kim & Lee, 2006; Kolen & Brennan, 2014).
The descriptions above might be read as if the linking methods for the 3PL testlet model should estimate linking coefficients ( , to and , to ), but that is not the case. As can be seen from Equation (6), each ( 1,…, ) is a function of , , and , so if is estimated and the two constants are given, its value is determined. Thus, lambda coefficients are not considered as linking parameters to be estimated. For the beta coefficients, , can be uniquely estimated due to the linear dependence among them, such as . Note that the linear dependence agrees with the constraint used in Li et al. (2005), where given . Therefore, the three linking methods for the testlet model described below estimate “free” parameters, the , , and coefficients. Because the meaning of the linear dependence among the values can be clearly revealed in presenting the MLS method, and the TRF and IRF methods are akin to each other, we present the MLS method first and then present the two response function methods. For the following presentation, it is assumed that a common testlet contains items and the total number of items in the common testlets for linking is = .
MLS Method
The MLS method presented here is a hybrid in that it uses part of the mean/mean method (Loyd & Hoover, 1980) to estimate the slope and then uses the linear least squares approach to estimate the intercepts, and . Unlike the TRF and IRF methods, the MLS method can estimate the linking coefficients without an iterative search for the solutions.
Taking the mean over a-parameters based on Equation (5) and solving the resulting equation for leads to a legitimate statistical solution for (Loyd & Hoover, 1980):
| (10) |
where and represent all the discrimination parameters estimated on the new and base scales, respectively. Once the coefficient is estimated by Equation (10), the and beta coefficients need to be simultaneously estimated because as seen from by Equation (7) or (8) they are related to each other in an equation.
Based on Equation (7a), let and . According to the statistical approaches used in Divgi (1985) and Oshima et al. (2000), the and beta coefficients can be estimated as the values that minimize the sum of squared differences ( ) between and for all . To obtain the solutions for the intercept coefficients using the least squares method, we first write an error model, based on Equation (7b), as
| (11) |
where = and = are vectors of d-parameter estimates; is an error vector; is an matrix whose row elements are “factor loadings,” s, associated with the -dimensional space ( , ,…, ); is a diagonal matrix whose diagonal elements are 1, ,…, ; and =( is a vector of size . For instance, if there are two testlets and two items within each testlet, , , and are expressed as
| (12) |
Although the error model in Equation (11) resembles a regression model, where the dependent variable is and the coefficient vector is , the solutions of cannot be computed using the ordinary least squares approach because the factor pattern matrix is not of full column rank (as shown by the example matrix in Equation (12)). With the condition , usually met in practice, the rank of is , not . Such rank deficiency implies that except for the coefficient, the coefficients in are linearly dependent, and only ones need to be estimated. Although the linear dependence among can be formulated in many ways, we here choose the constraint , corresponding to the constraint used in Li et al. (2005). In addition, we introduce a transformation matrix , which relates to an estimation vector = such that
| (13) |
where is a matrix, and . Now the free coefficients in can be estimated using the least squares method, and the solution formula can be derived as
| (14) |
Finally, the MLS solutions for the and coefficients in are obtained by plugging the resulting into Equation (13). Note that the estimates surely satisfy the zero-sum constraint due to the use of the matrix.
TRF Method
For the traditional 3PL model, the TRF at a given is defined as the sum of IRFs over all the items on the test, written as . is the true score for an examinee with ability . Conceptually, the analog of for the 3PL testlet model can be defined as the sum of the marginalized IRFs, each of which is computed by integrating the nuisance dimension (or ) out from the IRF in Equation (1) (or 2). In accordance with this conception, let and denote the marginalized IRF and TRF computed with the item/testlet parameters estimated on the scale, respectively, and let and denote the marginalized IRF and TRF computed with the parameter estimates transformed to the scale. The TRF method finds the solutions of and that minimize the criterion function, ,
| (15) |
where 1, 2,…, indexes arbitrary points over the scale.
Although the marginal IRF and TRF can be straightforwardly computed for each pair of and , Li et al. (2005) used a new composite variable for linking purposes. They noted that, assuming and have independent normal distributions with zero means and variances equal to 1 and , respectively, given is distributed as . With , more explicitly , the 3PL testlet model in Equation (1) can be expressed as
| (16) |
Then, the probability of answering item within testlet correctly, conditional on and , that is, the marginalized , is expressed as
| (17) |
where is the probability density function of given . The integral in Equation (17) can be approximated to any desired degree of accuracy by using Gauss–Hermite quadrature.
Let and denote the scales defined with the base and new groups, respectively. If the two scales are related as = , the item parameters on the scale can be transformed into those on the scale as follows (Li et al., 2005):
| (18) |
| (19) |
Although both transformations are legitimate in a technical sense, the transformation of by Equation (19) is insufficient for linking purposes because = takes into account possible mean and SD differences in between the base and new groups but not possible mean differences in between the two groups. Li et al. (2005) pointed out that if separate calibration results were obtained using the model in Equation (16), possible differences in the mean of between the base and new groups would be absorbed into , which would lead to a shift in . Therefore, they used the following transformation to account for that possible shift
| (20) |
They further indicated that the zero-sum constraint ( ) should be imposed for model identification, although they did not detail why the model needed that constraint. Note that based on Equation (8), the in Equation (20) can also be written as
| (21) |
where . Of course, the constraint is necessary for the reason revealed when the MLS method was addressed above.
The criterion function in Equation (15), defined with the two sets, { } and { }, for the common items associated with testlets, is nonlinear with respect to the linking coefficients, , , and , where the zero-sum constraint can be dealt with in practice by setting . Thus a multivariate search technique is required to find the linking solutions for the TRF method. Previous linking studies (e.g., Kim & Lee, 2006; Oshima et al., 2000) suggest that the minimization solutions can be obtained by using a modified Newton or quasi-Newton approach such as the BFGS method. However, Li et al. (2005) combined the GENOUD algorithm (Sekhon & Mebane, 1998) with the BFGS method to ensure that the global, not the local, minimum solutions are obtained. All of the search techniques are based on the vector of partial derivatives (i.e., gradient) of the criterion function with respect to the parameters. The analytic formulas for the gradient of with respect to the linking coefficients are presented in the Appendix.
IRF Method
Given the marginalized IRFs and for all common items, the IRF linking method (Haebara, 1980) for the traditional 3PL model can be straightforwardly extended to the 3PL testlet model. Similarly to the TRF method, the IRF method finds the solutions of and that minimize the criterion function, ,
| (22) |
where, as denoted earlier, is the number of common items and 1, 2,…, indexes arbitrary points over the scale. Although the marginalized IRFs or , as generally denoted, can be evaluated by Equation (17), they can also be computed using the bi-factor model in Equation (2) as follows:
| (23) |
where is the probability density function of . Of course, in that case, the and in Equation (22) are the probabilities evaluated at with the parameter sets { , , , } and { , , }, respectively.
The criterion function is nonlinear, as is , with respect to the linking coefficients, and thus a search technique is required to find the linking solutions. In this paper, we use the BFGS algorithm to find the linking solutions for the IRF method. The analytic formulas for the gradient of are presented in the Appendix.
Simulation Study
A simulation study was conducted to compare the performance of the TRF, MLS, and IRF methods. Two versions of the TRF method were conducted: one using the GENOUD algorithm and the other using the BFGS algorithm. The IRF method was implemented using only the BFGS algorithm. The design and methodology of this simulation study were closely matched to those used by Li et al. (2005) so that the comparison might be made under nearly the same conditions as those used in the previous study.
Design and Data
The CING design was used to evaluate the linking parameter recovery of the four methods for the 3PL testlet model: (a) the GENOUD-TRF method, (b) the BFGS-TRF method, (c) the MLS method, and (d) the IRF method based on the BFGS algorithm. As in Li et al. (2005), simulated tests and data sets were generated using different sets of item/testlet parameters and linking parameters. Each simulated test form consisted of six testlets, each of which contained 5 items, giving 30 items in total. The number of common testlets between the two (“base” and “new”) test forms to be linked was considered as the simulation factor. Two levels of were used: =2 and =4, resulting in the two common testlets condition (Condition 1) and the four common testlets condition (Condition 2), respectively.
For each simulation condition, 10 pairs of simulated tests with 5000 examinees per form were generated, as in Li et al. (2005). For each test form, with , the parameters were generated from , the log-normal distribution with log-mean=0 and log-SD = .5; the parameters were generated from under the restriction that ; and the parameters were generated from a uniform distribution ranging from .05 to .35. For each test form, the variances of (that is, ) were set to three levels, .1 (small testlet effect), .5 (medium testlet effect), and 1 (large testlet effect), and they were assigned to three testlet pairs that were randomly matched. Note that 10 or 20 common items had the same parameters between the base and new forms to be linked.
Because the linking coefficients and reflect, respectively, the differences in the SD and mean of the primary factor between the base and new populations, and the coefficients reflect differences in the mean of the testlet effect factors , the generation of linking parameters began by fixing the distributions of all factors for the base population to . Then the slope coefficients (the SDs of for the new population) were generated from LN (0, 0.22), and the intercept coefficients (the means of for the new population) were generated from N (0, 0.32). Ten combinations of and values were generated, and they were applied to both Conditions 1 and 2. Note that for the first combination, the values of and were set at 1 and 0, respectively, so that it could serve as the baseline combination. For each simulation condition, the coefficients (the means of for the new population) were generated from N (0, 0.32), subject to the constraint , where the first beta coefficients were randomly sampled from the distribution and the last beta coefficient was set as . Associated with the first combination of =1 and =0, all beta coefficients were set at 0. The true linking parameters, , , and , used to generate 20 data sets (10 data sets per condition) are presented in Table 1.
Table 1.
True Linking Coefficients for Simulation Conditions 1 and 2.
| Condition 1 | Condition 2 | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Data Set | Data Set | ||||||||
| 1.000 | .000 | 1 | .000 | .000 | 11 | .000 | .000 | .000 | .000 |
| .850 | −.244 | 2 | −.052 | .052 | 12 | −.202 | .006 | −.541 | .737 |
| 1.250 | .370 | 3 | −.362 | .362 | 13 | −.164 | .507 | .063 | −.406 |
| 1.013 | .051 | 4 | .224 | −.224 | 14 | −.141 | .512 | −.112 | −.259 |
| .900 | −.360 | 5 | .145 | −.145 | 15 | −.024 | .238 | −.143 | −.071 |
| 1.093 | .202 | 6 | −.017 | .017 | 16 | −.138 | −.008 | .658 | −.512 |
| 1.234 | −.180 | 7 | .132 | −.132 | 17 | .142 | −.172 | −.021 | .051 |
| .986 | .026 | 8 | −.321 | .321 | 18 | .087 | −.188 | .053 | .048 |
| .961 | .150 | 9 | .239 | −.239 | 19 | −.475 | −.247 | .155 | .567 |
| .857 | −.108 | 10 | .020 | −.020 | 20 | −.082 | .013 | .364 | −.295 |
Estimation and Evaluation
For each data set, the item and testlet parameters for the 3PL testlet model were estimated using the computer program flexMIRT (Cai, 2017). By default, flexMIRT uses 0–1 scaling for each factor to estimate item parameters, and we applied that scaling approach to the separate calibrations of base and new sample data. With the separate calibration results, the linking parameters were estimated using the statistical programming language R (R Development Core Team, 2018). Specifically, the linking solutions for the MLS method were computed using the built-in linear algebra functions. The solutions for the GENOUD-TRF method were found using the “genoud” function included in the R package genoud (Mebane & Sekhon, 2011). The solutions for the BFGS-TRF and IRF methods were found using the “optim” function included in the R package stats. For the TRF and IRF methods, 41 points, equally spaced from −4 to 4, were used to define their criterion functions (see Equations (15) and (22)).
For each data set in each condition, differences between the estimated and true linking parameters (i.e., estimation errors) were computed to evaluate the performance of each linking method. In addition, the means of the absolute differences across the 10 data sets in each condition were computed to summarize the estimation errors for each of the linking parameters.
Results
Results of Condition 1
The linking parameter recovery results of the GENOUD- and BFGS-TRF methods for Condition 1 (data sets 1–10), in which two common testlets were used, are presented in Table 2. The two TRF methods performed nearly equally in estimating the true linking parameters. For the , , and coefficients, in most cases, the estimates produced by the GENOUD-TRF method were equal to those by the BFGS-TRF method up to three decimal places. These results suggest that the use of a genetic algorithm offers little improvement to the TRF method. The recovery of the linking parameters differed by the type of linking coefficients. For most data sets, the estimation errors for and were close to zero, and the mean absolute errors of and were .021 and .029, respectively, indicating that the two TRF methods perform well in estimating the linking coefficients for the primary factor . By contrast, the estimation errors for and were greater (by more than .4 for data sets 3 and 9), and their mean absolute errors were .182 and .182, respectively (the zero-sum constraint causes the two values to be the same). It is noteworthy that for the first baseline data set, the estimation errors for the beta coefficients (−.174 and .174) are much larger than the error for the coefficient (.013). This finding suggests that the TRF method can be poor at estimating the mean differences in testlet factors between the examinee groups being analyzed for linking.
Table 2.
Estimation Errors of the Linking Parameters from the Two TRF Methods for Condition 1.
| Data Set | ||||
|---|---|---|---|---|
| GENOUD-TRF method | ||||
| 1 | −.019 | .013 | −.174 | .174 |
| 2 | .007 | −.016 | .053 | −.053 |
| 3 | −.018 | .056 | .445 | −.445 |
| 4 | −.045 | .042 | −.162 | .162 |
| 5 | −.007 | −.023 | −.069 | .069 |
| 6 | −.038 | −.001 | .063 | −.063 |
| 7 | .029 | −.054 | .154 | −.154 |
| 8 | −.020 | −.056 | .121 | −.121 |
| 9 | −.030 | .025 | −.530 | .530 |
| 10 | .002 | .006 | .045 | −.045 |
| Mean absolute error | .021 | .029 | .182 | .182 |
| BFGS-TRF method | ||||
| 1 | −.019 | .013 | −.174 | .174 |
| 2 | .007 | −.016 | .053 | −.053 |
| 3 | −.017 | .056 | .444 | −.444 |
| 4 | −.045 | .042 | −.163 | .163 |
| 5 | −.007 | −.023 | −.069 | .069 |
| 6 | −.038 | −.001 | .063 | −.063 |
| 7 | .029 | −.054 | .154 | −.154 |
| 8 | −.020 | −.056 | .121 | −.121 |
| 9 | −.030 | .025 | −.530 | .530 |
| 10 | .002 | .006 | .045 | −.045 |
| Mean absolute error | .021 | .029 | .182 | .182 |
The recovery results from the MLS and IRF methods for Condition 1 (data sets 1–10) are presented in Table 3. For both methods, the estimation errors of and were close to zero in most cases, as was found with the two TRF methods. The mean absolute errors for and with the MLS method were .035 and .037, respectively, and those with the IRF method were .021 and .033. For the recovery of the beta coefficients, the estimation errors for and with the MLS and IRF methods were closer to zero than those with the two TRF methods. The mean absolute error for either beta coefficient with the MLS method was .068, and that with the IRF method was .036. This finding suggests that the IRF, MLS, and TRF methods perform best, second best, and worst, respectively, in estimating the intercept linking coefficients ( ).
Table 3.
Estimation Errors of the Linking Parameters from the MLS and IRF Methods for Condition 1.
| Data Set | ||||
|---|---|---|---|---|
| MLS method | ||||
| 1 | .015 | −.019 | .013 | −.013 |
| 2 | .108 | −.025 | −.155 | .155 |
| 3 | .013 | −.150 | −.182 | .182 |
| 4 | .025 | .021 | .002 | −.002 |
| 5 | −.032 | −.006 | .084 | −.084 |
| 6 | .056 | −.067 | −.190 | .190 |
| 7 | .017 | −.020 | .025 | −.025 |
| 8 | −.030 | −.039 | −.018 | .018 |
| 9 | −.044 | .016 | .006 | −.006 |
| 10 | .009 | .006 | .007 | −.007 |
| Mean absolute error | .035 | .037 | .068 | .068 |
| IRF method | ||||
| 1 | .007 | .003 | −.005 | .005 |
| 2 | −.003 | −.013 | .026 | −.026 |
| 3 | .027 | −.075 | −.091 | .091 |
| 4 | −.033 | .043 | .012 | −.012 |
| 5 | −.036 | −.050 | .102 | −.102 |
| 6 | −.032 | .013 | .037 | −.037 |
| 7 | .025 | −.036 | .007 | −.007 |
| 8 | −.010 | −.069 | −.024 | .024 |
| 9 | .019 | −.031 | .046 | −.046 |
| 10 | .014 | −.001 | .013 | −.013 |
| Mean absolute error | .021 | .033 | .036 | .036 |
Results of Condition 2
The recovery results of the GENOUD- and BFGS-TRF methods in Condition 2 (data sets 11–20) are presented in Table 4, and the results of the MLS and IRF methods are presented in Table 5. As was found in the results in Condition 1, all methods produced estimation errors for and that were close to zero in most cases. The mean absolute errors for and were .022 and .046, respectively, with the GENOUD-TRF method, .023 and .045 with the BFGS-TRF method, .026 and .053 with the MLS method, and .014 and .036 with the IRF method.
Table 4.
Estimation Errors of the Linking Parameters from the Two TRF Methods for Condition 2.
| Data Set | ||||||
|---|---|---|---|---|---|---|
| GENOUD-TRF method | ||||||
| 11 | −.067 | .006 | −.504 | .063 | −.189 | .630 |
| 12 | −.013 | −.088 | −.317 | −.621 | .563 | .376 |
| 13 | −.055 | −.021 | −.062 | −.072 | −.189 | .324 |
| 14 | .013 | −.055 | .180 | −.029 | −.245 | .094 |
| 15 | −.009 | .061 | −.304 | −.110 | .091 | .323 |
| 16 | −.027 | .040 | −.335 | −.196 | .361 | .170 |
| 17 | −.001 | −.013 | .067 | .054 | −.194 | .074 |
| 18 | .001 | .044 | −.022 | .119 | −.298 | .201 |
| 19 | −.017 | .115 | −.193 | −.097 | −.063 | .352 |
| 20 | −.018 | .020 | .102 | −.202 | −.240 | .340 |
| Mean absolute error | .022 | .046 | .209 | .156 | .243 | .288 |
| BFGS-TRF method | ||||||
| 11 | −.067 | .006 | −.500 | .057 | −.189 | .632 |
| 12 | −.010 | −.085 | −.310 | −.578 | .555 | .332 |
| 13 | −.055 | −.024 | −.088 | −.059 | −.182 | .330 |
| 14 | .014 | −.048 | .066 | −.038 | −.160 | .132 |
| 15 | −.026 | .060 | .053 | −.114 | −.101 | .162 |
| 16 | −.028 | .040 | −.317 | −.158 | .296 | .179 |
| 17 | .001 | −.012 | .054 | .047 | −.169 | .068 |
| 18 | .001 | .044 | −.027 | .119 | −.295 | .202 |
| 19 | −.017 | .115 | −.193 | −.097 | −.063 | .353 |
| 20 | −.014 | .019 | .114 | −.176 | −.244 | .306 |
| Mean absolute error | .023 | .045 | .172 | .144 | .225 | .270 |
Table 5.
Estimation Errors of the Linking Parameters from the MLS and IRF Methods for Condition 2.
| Data Set | ||||||
|---|---|---|---|---|---|---|
| MLS method | ||||||
| 11 | .082 | −.093 | −.014 | .078 | −.010 | −.054 |
| 12 | .032 | −.036 | −.012 | .170 | −.167 | .010 |
| 13 | −.029 | −.063 | −.300 | .126 | .120 | .054 |
| 14 | .013 | −.052 | .005 | .094 | −.034 | −.064 |
| 15 | −.003 | .039 | .036 | .267 | −.085 | −.218 |
| 16 | −.038 | .071 | .008 | −.226 | .256 | −.038 |
| 17 | .003 | −.004 | .012 | −.068 | .012 | .043 |
| 18 | .005 | −.021 | .096 | .063 | −.017 | −.141 |
| 19 | −.020 | .139 | −.100 | −.220 | −.078 | .398 |
| 20 | −.035 | −.012 | −.142 | .029 | .209 | −.097 |
| Mean absolute error | .026 | .053 | .073 | .134 | .099 | .112 |
| IRF method | ||||||
| 11 | .047 | −.006 | .003 | −.009 | .044 | −.037 |
| 12 | .007 | −.051 | .086 | −.036 | −.081 | .031 |
| 13 | −.016 | .005 | −.038 | −.001 | .010 | .029 |
| 14 | −.004 | −.042 | .004 | .066 | −.018 | −.053 |
| 15 | .005 | .063 | −.066 | .167 | −.041 | −.060 |
| 16 | −.013 | .053 | −.124 | −.167 | .345 | −.054 |
| 17 | .006 | .001 | −.025 | −.041 | −.009 | .074 |
| 18 | .014 | .019 | .011 | −.034 | −.027 | .050 |
| 19 | −.011 | .113 | −.076 | −.249 | −.028 | .353 |
| 20 | .017 | −.006 | −.103 | .041 | .137 | −.075 |
| Mean absolute error | .014 | .036 | .053 | .081 | .074 | .082 |
For the recovery of the beta coefficients, the two TRF methods produced estimation errors for to that deviated from zero by more than .3 in many cases. The mean absolute errors for to were .209, .156, .243, and .288, respectively, with the GENOUD-TRF method and .172, .144, .225, and .270 with the BFGS-TRF method. Thus, using a genetic algorithm for scale linking can lead to worse solutions for the beta coefficients than using a quasi-Newton algorithm. In contrast, the estimation errors for to with the MLS and IRF methods were closer to zero than those with the two TRF methods. The mean absolute errors of to were .073, .134, .099, and .112, respectively, with the MLS method and .053, .081, .074, and .082 with the IRF method. It is noteworthy that with the baseline data set 11, the two TRF methods resulted in much larger estimation errors for the beta coefficients than for the coefficient. In sum, the IRF, MLS, and TRF methods performed best, second best, and worst, respectively, in estimating the coefficients.
Discussion and Conclusions
Li et al. (2005) proposed a TRF linking method for the 2PNO testlet model and used the GENOUD genetic algorithm to find minimization solutions for the linking parameters by updating a population of solutions from generation to generation. In the present paper, we used the 3PL testlet model for generality to formulate the linking task from the perspective of bi-factor modeling and presented two alternatives (MLS and IRF) to the TRF linking method for the model. One of the purposes of the simulation study was to examine whether there is a compelling reason to use the genetic algorithm instead of the BFGS algorithm, which is one of the quasi-Newton methods widely applied, when using the TRF method to find linking solutions. The other purpose was to investigate the performance of the TRF method (based on either the GENOUD or BFGS algorithm) against the other linking methods, MLS and IRF.
The following main results were found from the stimulation study. For the simulated linking data sets using two common testlets (Condition 1), the performance of the GENOUD-TRF method was nearly the same as that of the BFGS-TRF method in recovering the true linking parameters, (slope) and (intercept) for the primary dimension factor and (intercepts) for the testlet factors, subject to the zero-sum constraint . In Condition 2 involving four common testlets, the GENOUD-TRF method performed nearly as well as the BFGS-TRF method in estimating the and coefficients, but it tended to estimate the coefficients less accurately than the BFGS-TRF method. This finding suggests that using a genetic algorithm does not lead to better solutions for the linking coefficients, particularly the beta coefficients, than using a quasi-Newton algorithm. In both simulation conditions, there was a small difference in linking accuracy among the linking methods for the and coefficients, whereas for the coefficients, the methods differed substantially. In recovering the true coefficients, on average, the IRF method showed the least estimation error, and the TRF methods produced the largest errors, more than double the average error of the IRF method. Taken together, these results suggest that the IRF, MLS, and TRF methods perform best, second best, and worst, respectively, in estimating the linking parameters associated with testlet effects.
The poor performance of the TRF method against the IRF method in estimating the coefficients may be regarded as a bit unusual, but is not a new finding, as shown by Kim (2019). To understand why the TRF method estimated the beta coefficients more poorly than the IRF method, we examined the contour plots (i.e., level curves) of the negative criterion functions, and , for the two methods. With = and and/or fixed to their true values, we drew the contour plots with the axes of and for the data sets in Condition 1 and the axes of each of the three pairs ( and , and , and and ) for the data sets in Condition 2. As illustrated in Figure 1 (where the contour plots for data sets 1 and 11 are presented as examples), for all data sets in Condition 1, the contour plots of had top level curves shaped like elongated rings, narrow along the -axis but wide along the -axis, whereas those of had top level curves shaped like small ellipses, narrow along both axes. Compared with the top-level curves for the IRF method, the shape of the top-level curves for the TRF method indicates that the coefficient can be more accurately estimated than the coefficient. For all data sets in Condition 2, the contour plots of had top level curves shaped like distorted ellipses, big and wide, whereas those of had top level curves shaped like small ellipses, tilted diagonally.
Figure 1.
Contour Plots of and with Data Set 1 in Condition 1 and Data Set 11 in Condition 2.
The difference in shape between the top-level curves suggests that the TRF method produces less stable estimates of the beta coefficients than the IRF method and that the GENOUD and BFGS algorithms can converge to different neighborhoods and reach a global minimum. In other words, the criterion function of the IRF method is well-structured for the linking solutions but that of the TRF method is not. The criterion function of the IRF method is based on the sum of the squared differences in category response functions at the item level, and so the possible location differences between separate calibrations for each testlet are preserved and not mixed with those for other testlets. But the criterion function of the TRF method is based on the squared differences in true test scores, so that the possible location differences for each testlet are likely confounded at the test level. Such confounding likely leads to unstable estimation of the beta coefficients.
From the simulation results, we find no compelling reason to use genetic algorithms instead of quasi-Newton algorithms to find minimization solutions for the TRF method. Furthermore, in a numerical sense, we find that the criterion function of the IRF method produces better-structured solutions than that of the TRF method. From a practical point of view, the BFGS algorithm should be preferred to the GENOUD algorithm because the latter takes much more time than the former to find the minimization solutions. In this simulation study, the GENOUD-TRF method often took more than 5 and 10 minutes for the data sets in conditions 1 and 2, respectively, whereas the BFGS-TRF and IRF methods took less than 5 and 15 seconds for the corresponding data sets.
As pointed out by Li et al. (2005), the zero-sum constraint for the beta coefficients indicates that the average (across testlets) of testlet factor means should be the same between the examinee groups being analyzed for scale linking. Although the zero-sum constraint seems to be reasonable and flexible, the linear dependence among the beta coefficients can be solved in other ways. One feasible approach is choosing a common testlet whose factor mean is expected to change little across examinee groups and fixing the beta coefficient for the common testlet to zero. Then the rest of the beta coefficients are considered as the free parameters. An analog of this approach is found in differential item functioning (DIF) analyses. If we use the constraint that the overall mean difference across examinee groups in item difficulties is zero, any performance differences are absorbed into differences in ability and treated as impact. If we suspect that the DIF may not cancel out across items, we can designate a set of anchor items to have zero DIF. Of course, to apply that “fixed-to-zero” constraint to linking tasks, the transformation matrix in Equation (13) should be modified appropriately for the MLS method, and the partial derivatives of the criterion functions with respect to the linking parameters should be properly computed for the IRF and TRF methods.
Some studies need to be conducted to improve understanding of the three linking methods presented in this paper and enable a wise choice among them in practice. First, the simulation study in this paper did not address the effects of linking errors in transformed item parameter estimates on the estimation of the ability ( ) parameters for the new group examinees. The accuracy of ability estimation would be affected most by the estimates of the or parameters, which are functions of the beta coefficients, given , , and (see Equation (21)). It should be examined whether the ability parameters can be estimated better when the linking coefficients from the IRF method are used than when those from the MLS or TRF method are used. Second, although the separate calibration and linking approach is a basic and dependable method for developing a common IRT scale, a common scale can also be developed by using multiple-group concurrent calibration (Bock & Zimowski, 1997) or fixed parameter calibration (Kim, 2006). A comparative study on the performance of the three calibration types will offer practitioners in the areas of test equating and vertical scaling useful information about the advantages and disadvantages of the various linking methods. Third, the three linking methods presented for the 3PL testlet model need to be extended to polytomous testlet models such as the graded response testlet model and the generalized partial credit testlet model. It would be meaningful to investigate the performance of the linking methods using a variety of real and simulated data. Finally, it would be very useful to derive analytic formulas for the standard errors of linking coefficient estimates obtained from the different linking methods because those standard errors of estimates can be used as indices of linking precision in practice.
Acknowledgments
The authors are grateful to two anonymous reviewers, Dr. John R. Donoghue (the Editor-in-Chief), and Dr. Christine E. DeMars (the Associate Editor) for their beneficial comments and insightful suggestions to improve the quality of this paper.
Appendix: Partial Derivatives.
Based on and Equations (18) and (21), let us write and as
| (A1) |
| (A2) |
The partial derivatives of with respect to , , and (where ) are computed as
| (A3) |
| (A4) |
| (A5) |
Then, the partial derivatives of with respect to and are given by
| (A6) |
| (A7) |
When , the partial derivatives of with respect to ( ) are given by
| (A8) |
With Equations (A1) to A5, the partial derivatives of with respect to and are given by
| (A9) |
| (A10) |
And if , the partial derivatives of with respect to ( ) are given by
| (A11) |
Footnotes
Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.
ORCID iD
Seonghoon Kim https://orcid.org/0000-0002-0357-8639
References
- Bock R. D., Zimowski M. F. (1997). Multiple group IRT. In van der Linden W. J., Hambleton R. K. (Eds.), Handbook of modern item response theory (pp. 433–448). Springer. https://doi/org/10.1007/978-1-4757-2691-6_25 [Google Scholar]
- Bradlow E. T., Wainer H., Wang X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64(2), 153–168. 10.1007/bf02294533 [DOI] [Google Scholar]
- Cai L. (2017). flexMIRT: Flexible multilevel multidimensional item analysis and test scoring [Computer software] . Vector Psychometric Group. [Google Scholar]
- Davey T., Oshima T. C., Lee K. (1996). Linking multidimensional item calibrations. Applied Psychological Measurement, 20(4), 405–416. 10.1177/014662169602000407 [DOI] [Google Scholar]
- DeMars C. E. (2006). Application of the bi-factor multidimensional item response theory model to testlet-based tests. Journal of Educational Measurement, 43(2), 145-168. 10.1111/j.1745-3984.2006.00010.x [DOI] [Google Scholar]
- Dennis J. E., Schnabel R. B. (1996). Numerical methods for unconstrained optimization and nonlinear equations. Society for Industrial and Applied Mathematics. [Google Scholar]
- Divgi D. R. (1985). A minimum chi-square method for developing a common metric in item response theory. Applied Psychological Measurement, 9(4), 413–415. 10.1177/014662168500900410 [DOI] [Google Scholar]
- Gibbons R. D., Hedeker D. R. (1992). Full-information item bi-factor analysis. Psychometrika, 57(3), 423–436. 10.1007/bf02295430 [DOI] [Google Scholar]
- Glas C. A. W., Wainer H., Bradlow E. T. (2000). Maximum marginal likelihood and expected a posteriori estimation in testlet-based adaptive testing. In van der Linden W. J., Glas C. A. W. (Eds.), Computerized adaptive testing: Theory and practice (pp. 271–287). Kluwer Academic Publishers. 10.1007/0-306-47531-6_14 [DOI] [Google Scholar]
- Haebara T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22(3), 144–149. 10.4992/psycholres1954.22.144 [DOI] [Google Scholar]
- Kim S. (2006). A comparative study of IRT fixed parameter calibration methods. Journal of Educational Measurement, 43(4), 353–381. 10.1111/j.1745-3984.2006.00021.x [DOI] [Google Scholar]
- Kim S. (2019). Common-item linking methods for the bi-factor three parameter model in MIRT. Journal of Educational Evaluation, 32(1), 27–52. 10.31158/jeev.2019.32.1.27 [DOI] [Google Scholar]
- Kim S., Lee W.-C. (2006). An extension of four IRT linking methods for mixed-format tests. Journal of Educational Measurement, 43(1), 53–76. 10.1111/j.1745-3984.2006.00004.x [DOI] [Google Scholar]
- Kolen M. J., Brennan R. L. (2014). Test equating, scaling, and linking: Methods and practices (3rd ed.). Springer. [Google Scholar]
- Li Y., Bolt D. M., Fu J. (2005). A testlet characteristic curve linking method for the testlet model. Applied Psychological Measurement, 29(5), 340–356. 10.1177/0146621605276678 [DOI] [Google Scholar]
- Li Y., Bolt D. M., Fu J. (2006). A comparison of alternative models for testlets. Applied Psychological Measurement, 30(1), 3–21. 10.1177/0146621605275414 [DOI] [Google Scholar]
- Lord F. M. (1980). Applications of item response theory to practical testing problems. Lawrence Erlbaum Associates. [Google Scholar]
- Loyd B. H., Hoover H. D. (1980). Vertical equating using the Rasch model. Journal of Educational Measurement, 17(3), 179–193. 10.1111/j.1745-3984.1980.tb00825.x [DOI] [Google Scholar]
- Marco G. L. (1977). Item characteristic curve solutions to three intractable testing problems. Journal of Educational Measurement, 14(2), 139–160. 10.1111/j.1745-3984.1977.tb00033.x [DOI] [Google Scholar]
- Mebane W. R., Sekhon J. S. (2011). Genetic optimization using derivatives: The rgenoud package for R. Journal of Statistical Software, 42(11), 1–26. 10.18637/jss.v042.i11 [DOI] [Google Scholar]
- Oshima T. C., Davey T. C., Lee K. (2000). Multidimensional linking: Four practical approaches. Journal of Educational Measurement, 37(4), 357–373. 10.1111/j.1745-3984.2000.tb01092.x [DOI] [Google Scholar]
- R Development Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing. [Google Scholar]
- Rijmen F. (2010). Formal relations and an empirical comparison among the bi-factor, the testlet, and a second-order multidimensional IRT model. Journal of Educational Measurement, 47(3), 361–372. 10.1111/j.1745-3984.2010.00118.x [DOI] [Google Scholar]
- Sekhon J. S., Mebane W. R. (1998). Genetic optimization using derivatives. Political Analysis, 7, 187–210. 10.1093/pan/7.1.187 [DOI] [Google Scholar]
- Stocking M. L., Lord F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201–210. 10.1177/014662168300700208 [DOI] [Google Scholar]
- Wainer H., Bradlow E. T., Wang X. (2007). Testlet response theory and its applications. Cambridge University Press. [Google Scholar]
- Wainer H., Kiely G. L. (1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24(3), 185–201. 10.1111/j.1745-3984.1987.tb00274.x [DOI] [Google Scholar]
- Wainer H., Wang X. (2000). Using a new statistical model for testlets to score TOEFL. Journal of Educational Measurement, 37(3), 203–220. 10.1111/j.1745-3984.2000.tb01083.x [DOI] [Google Scholar]
- Yen W. M., Fitzpatrick A. R. (2006). Item response theory. In Brennan R. L. (Ed.), Educational measurement (4th ed., pp. 111–153). American Council on Education and Praeger. [Google Scholar]

