Skip to main content
Applied Psychological Measurement logoLink to Applied Psychological Measurement
. 2019 Jan 30;44(1):3–16. doi: 10.1177/0146621618824854

New Efficient and Practicable Adaptive Designs for Calibrating Items Online

Yinhong He 1,2, Ping Chen 1,, Yong Li 1
PMCID: PMC6906388  PMID: 31853155

Abstract

When calibrating new items online, it is practicable to first compare all new items according to some criterion and then assign the most suitable one to the current examinee who reaches a seeding location. The modified D-optimal design proposed by van der Linden and Ren (denoted as D-VR design) works within this practicable framework with the aim of directly optimizing the estimation of item parameters. However, the optimal design point for a given new item should be obtained by comparing all examinees in a static examinee pool. Thus, D-VR design still has room for improvement in calibration efficiency from the view of traditional optimal design. To this end, this article incorporates the idea of traditional optimal design into D-VR design and proposes a new online calibration design criterion, namely, excellence degree (ED) criterion. Four different schemes are developed to measure the information provided by the current examinee when implementing this new criterion, and four new ED designs equipped with them are put forward accordingly. Simulation studies were conducted under a variety of conditions to compare the D-VR design and the four proposed ED designs in terms of calibration efficiency. Results showed that the four ED designs outperformed D-VR design in almost all simulation conditions.

Keywords: item response theory, computerized adaptive testing, online calibration, adaptive design, optimal design, sequential design

Introduction

Computerized adaptive testing (CAT) is an efficient testing mode that allows each examinee to answer appropriate test questions on the basis of his or her latent trait level (Lord, 1968). Compared with the traditional paper-pencil (P&P) test, CAT possesses several virtues such as immediate score reporting, flexible testing time, and more accurate ability estimates (e.g., Wang & Chang, 2011; Weiss, 1982). Therefore, many large-scale educational and psychological assessment programs such as the Graduate Management Admission Test (GMAT) and the Armed Services Vocational Aptitude Battery (ASVAB) have already been implemented via CAT (e.g., H.-H. Chang & Ying, 2009).

Because in CAT each examinee is evaluated by an optimal and individualized set of test items that tailored to his or her provisional ability, the attrition speed of operational items in the item bank is faster than that in P&P test. For test reliability, fairness, and security reasons, the low-quality, overexposed, or obsolete items should be retired from item bank periodically. On the other hand, new items should be developed to substitute the unsuitable ones to keep the usability and vitality of the item bank. Moreover, it is of great importance to precisely calibrate the new items, because the inaccuracy of item parameter estimates may result in biased ability estimates and underestimate the standard errors of ability estimates in the ensuing scoring process (e.g., Cheng & Yuan, 2010; Mislevy, Wingersky, & Sheehan, 1994; Tsutakawa & Johnson, 1990; Yang, Hansen, & Cai, 2012; Zhang, Xie, Song, & Lu, 2011).

Typically, new items are calibrated by administering them to a batch of volunteers that are recruited expressly for item calibration using the form of P&P test. However, this calibration approach is costly and the obtained calibration results may be unreliable to a certain extent, because the test takers know that their responses on the new items are unrelated to their final scores and may not take it seriously as a result. The other promising and widely used technique to calibrate new items is online calibration, in which the item responses are collected by randomly or adaptively assigning new items to examinees during the course of their testing with operational items (e.g., Stocking, 1988; Wainer & Mislevy, 1990). Compared with the traditional offline calibration approach, online calibration is typically applied in CAT scenario, and it is time-saving and cost-effective and collects more reliable data. Moreover, online calibration is capable of automatically placing new items on the same scale as the operational items without post hoc linking/scaling, thereby providing an effective solution to the complex linking/equating issues in the construction of a large-scale item bank (e.g., Chen, Xin, Wang, & Chang, 2012).

How to select a batch of examinees that provides the best estimate for a given new item is a matter of great concern. Many researchers have made efforts to directly apply some of the traditional experimental designs to calibrate new items online. For example, Jones and Jin (1994) used the D-optimal criterion to select individual examinees for a given new item and sequentially updated the item parameters once a subdesign is obtained; Y. C. I. Chang and Lu (2010) suggested using a two-stage D-optimal design to calibrate the two-parameter logistic (2PL; Birnbaum, 1968) model online under the variable-length CAT scenario; Lu (2014) calibrated the 2PL model online using the conclusions derived by the D-, A-, E-optimality. It is important to note that a key implementation element of the abovementioned designs is that the examinee pool should be static, from which examinees with suitable ability values will be selected to calibrate new items. However, in the context of online calibration, the examinee pool is typically not static because examinees are allowed to leave immediately after their testing sessions. In this regard, it is hardly feasible to directly apply traditional optimal designs to online calibration scenario.

To obtain a batch of examinees for calibrating new items online, a practical way is to administer new items via the following two steps: (a) compare a finite set of new items that need to be calibrated and pick out the most suitable one according to some criterion; and (b) assign it to the current active examinee when he or she reaches a seeding location (i.e., the location where new items are assigned to examinees during the adaptive testing session). Several online calibration designs are achieved via this practicable framework. For instance, the random design randomly selects a new item and then stochastically seeds it in the current examinee’s adaptive test (Wainer & Mislevy, 1990). Although it is simple and convenient to implement, the randomly selected new items may be either too difficult or too easy compared with examinees’ abilities and accordingly may make the new items be easily discerned from the operational items (Jones & Jin, 1994; Kingsbury, 2009). To address this issue, Chen et al. (2012) and Kingsbury (2009) proposed administering new items in an adaptive fashion by using the same item selection strategy that used for selecting operational items. Nevertheless, this design, classified into the category of “examinee-centered adaptive design” by Zheng (2014), may not be efficient for calibrating new items, because the primary target of item selection algorithm is not maximizing the calibration precision of new items, but maximizing the estimation accuracy of abilities. In contrast, the adapted versions of D-, A-, E-, c-optimality proposed by van der Linden and Ren (2015) can be categorized into “item-centered adaptive designs” because their primary objectives are to optimize the estimation of item parameters. Among them, the modified D-optimality concurrently considers all item parameters, thus it is selected as the object of this study. For ease of description, the modified D-optimal design proposed by van der Linden and Ren (2015) is referred to as D-VR design henceforth.

From the implementation process of D-VR design, one can note that if the new item k is picked out from the set of new items and administered to an active examinee p at a seeding location, it implies that examinee p is more helpful in accurately calibrating new item k than in calibrating any other new items. Note that in traditional experimental designs, all examinees need to compete to give the best estimate for a given new item. Hence, from the perspective of traditional optimal experimental design, the optimal design point of new item k should be derived by comparing all examinees in the entire examinee pool. In this case, the current active examinee p may not be the optimal examinee for calibrating new item k, implying that there is still room for improvement for the D-VR design.

To improve the calibration efficiency of D-VR design, this article incorporates the idea of traditional optimal design into D-VR design and proposes a new design criterion (denoted as excellence degree [ED] criterion) for calibrating new item online. The ED criterion measures the excellence degree of the current active examinee relative to the optimal examinee when calibrating each new item. Besides, four different schemes are put forward to measure the information provided by the current examinee. On that basis, four new ED designs (i.e., ED-o, ED-min, ED-mean, and ED-lw) are proposed by equipping ED criterion with the four schemes. Moreover, the whole procedures for assigning new items online are divided into random and adaptive calibration stages, where the item parameters of new item are updated sequentially in the adaptive calibration stage.

The reminder of this article is structured as follows. First, the original D-optimal design and D-VR design are briefly summarized after some pre-knowledge. Then, the newly proposed online calibration designs are introduced in detail, followed by the elaboration on the four schemes to deal with the uncertainty in ability estimates. Next, the simulation studies regarding the full comparison of the four ED designs with the D-VR design under a variety of simulation scenarios are presented. Finally, this article ends with discussion and suggestions for further research.

Method

Pre-Knowledge

For the 2PL model, let θi represent the latent trait of examinee i, yij denote his or her response to a dichotomously scored item j, and Pj(θi) represent the probability of a correct response, that is

Pj(θi)P(yij=1|θi,aj,bj)=11+exp(aj(θibj)), (1)

where aj and bj are the discrimination parameter and difficulty parameter for item j, respectively.

Item calibration aims to estimate the item parameters aj and bj based on a sample of collected response data. Let m be the number of new items that need to be calibrated, and ηk=(ak,bk)T denotes the item parameter vector of new item k (k=1,2,,m). Suppose the kth new item has already been answered by nk examinees with ability vector θnk=(θ1,θ2,,θnk)T. Using the terminology in experimental design, the desired ability values are termed as design points, and all design points form the design space (denoted as Θ).

In addition, let I(ηk;θi) represent the Fisher information of ηk that provided by the ith examinee who answers the kth new item. Then, the Fisher information of ηk provided by the total nk examinees [referred to as I(ηk;θnk)] is exactly the sum of those provided by each individual examinee under local independence assumption, that is, I(ηk;θnk)=i=1nkI(ηk;θi). For the 2PL model, I(ηk;θi) can be concretely expressed as (IikaaIikabIikabIikbb), where Iikaa=(θibk)2Pk(θi)(1Pk(θi)), Iikab=ak(θibk)Pk(θi)(1Pk(θi)), and Iikbb=ak2Pk(θi)(1Pk(θi)).

D-Optimal Design and D-VR Design

One of the commonly used criteria in the literature of optimal design is the D-optimality. The D-optimal design is executed by minimizing the generalized variance of item parameter estimates, or equivalently by maximizing the determinant of the Fisher information matrix (Anderson, 1984). In addition, as indicated by Silvey (1980), the D-optimal design is invariant against the linear transformation of item parameters; thus, it is particularly favored by experiment designers. Formally, the D-optimal design for calibrating the kth new item can be expressed as

θko=argmaxθ{det[i=1nkI(ηk;θi)+I(ηk;θ)]},θΘ. (2)

As seen from Formula 2, the optimal examinee ko with ability θko for calibrating new item k is derived by comparing all the examinees in a static examinee pool (i.e., Θ). However, as alluded to above, the examinee pool is not static in the online calibration scenario, because adaptive testing is continuously administered at frequent time intervals and allows examinees to leave if they have finished their tests. Thus, van der Linden and Ren (2015) proposed assigning a new item to the current active examinee by comparing the expected contribution of the current examinee across all new items when he or she reaches a seeding location. Accordingly, the D-optimal design is adjusted as (i.e., the D-VR design)

argmaxk{det[i=1nkI(ηk;θi)+I(ηk;θp)]det[i=1nkI(ηk;θi)]},kH, (3)

where θp is the ability of the current active examinee p, and H={k|1km} is the ID set of new items that need to be calibrated.

The D-VR design has been proved to be feasible and practical in the context of online calibration (Chen, 2017), primarily because it does not require the examinee pool to be static when assigning new items to examinees. Regarding the full Bayesian version of D-VR design, interested readers are referred to van der Linden and Ren (2015).

New Designs for Online Calibration

Derive virtual optimal design point

Suppose new item k is selected to be assigned to the current active examinee p based on Formula 3, it implies that the pth examinee and the kth new item from the set of new items are the best matched according to D-VR design. However, from the view of entire design space, the optimal examinee ko for calibrating new item k should be derived according to Formula 2.

Note that although the examinee pool is dynamic in adaptive testing scenario, all examinees’ abilities are typically truncated by an interval [CL,CU] in practice (e.g., Baker, 2001; Y. C. I. Chang & Lu, 2010). Thus, the ability of the optimal examinee ko can be obtained via the following formula:

θko=argmaxθdet[i=1nkI(ηk;θi)+I(ηk;θ)],θ[CL,CU]. (4)

To derive θko, one feasible way is to randomly discretize the interval [CL,CU]. The idea of discretizing parameter space is widely used in experimental designs, and similar treatments can be found in the studies of Berger and Tan (2004) and Zhu and Stein (2005). It is noteworthy that examinee ko does not necessarily correspond to a specific examinee in the examinee pool, and it is more like a virtual examinee. In the infinite case, when the number of examinees in the examinee pool goes to infinity, each point in the interval [CL,CU] will correspond to a specific examinee in the examinee pool, and ko will become a real examinee as a result.

New design criterion for online calibration

Based on the derived optimal ability θko and the ability vector θnk=(θ1,θ2,,θnk)T that has already been obtained, the contribution of the virtual optimal examinee ko to calibrating new item k can be measured according to the D-VR design criterion, that is,

INF(ηk,θko)=det[i=1nkI(ηk;θi)+I(ηk;θko)]det[i=1nkI(ηk;θi)]. (5)

Similarly, the contribution of current examinee p to the calibration of new item k is obtained as

INF(ηk,θp)=det[i=1nkI(ηk;θi)+I(ηk;θp)]det[i=1nkI(ηk;θi)]. (6)

Accordingly, when the current active examinee p reaches a seeding location, his or her contribution relative to the maximum contribution can be evaluated as

Rηk(θp;θk0)=INF(ηk,θp)INF(ηk,θko)=det[i=1nkI(ηk;θi)+I(ηk;θp)]det[i=1nkI(ηk;θi)]det[i=1nkI(ηk;θi)+I(ηk;θko)]det[i=1nkI(ηk;θi)]. (7)

The value of Rηk(θp;θk0) varies from 0 to 1, and larger value indicates that the contribution provided by the current examinee p is closer to that generated by the virtual optimal examinee ko. Thus, by taking the virtual optimal examinee ko as a benchmark, Rηk(θp;θk0) can be used to measure the excellence degree of examinee p for calibrating new item k. For the convenience of description, Rηk(θp;θk0) is denoted as “excellence degree (ED) criterion” in this article. Furthermore, the corresponding online calibration design is referred to as ED design and can be formally expressed as

argmaxk{Rηk(θp;θk0)=det[i=1nkI(ηk;θi)+I(ηk;θp)]det[i=1nkI(ηk;θi)]det[i=1nkI(ηk;θi)+I(ηk;θko)]det[i=1nkI(ηk;θi)]},kH. (8)

More specifically, when the current examinee p reaches a seeding location, the virtual ability θk0 is first calculated for each new item k (kH) according to Formula 4, then Rηk(θp;θk0) is computed for each new item based on Formula 7, finally the new item with the largest ED criterion value will be assigned to the examinee p.

Dealing with the uncertainty in ability estimates

In reality, the true abilities in Formula 8 are never known, thus ability estimates instead of true abilities will be used to calibrate new items. Considering that the ability of the current active examinee p may not be accurately estimated due to relatively short test length of operational items at some seeding locations, this article uses four different schemes to measure the Fisher information provided by the current examinee p, that is, I(ηk;θp).

Scheme 1

This scheme is used as a baseline for comparison, because it simply uses the estimated ability θ^p to calculate I(ηk;θp).

Scheme 2

In this scheme, examinee p’s ability is represented by the worst case θ^pk in [θ^pδp,θ^p+δp], where δp>0 is a small neighborhood of the ability estimate θ^p. It should be explained that the worst case θ^pk refers to the ability point from interval [θ^pδp,θ^p+δp] that provides the least Fisher information for calibrating new item k, which is given by

θ^pk=argminxdet[i=1nkI(ηk;θ^i)+I(ηk;x)],x[θ^pδp,θ^p+δp]. (9)

Then, the Fisher information I(ηk;θp) is calculated as I(ηk;θ^pk).

The small neighborhood δp can be set to be the standard error of θ^p, whose squared value approximately equals the inverse of test information evaluated at θ^p, that is, 1/I(θ^p), where I(θ^p)=j=1Lp[(Pj(θ^p))2Pj(θ^p){1Pj(θ^p)}], Lp is the test length of operational items, and Pj(θ^p)=dPj(θp)dθp|(θp=θ^p) is the derivative with respect to ability parameter.

Scheme 3

Different from Scheme 2, which uses the worst case in the neighborhood of θ^p, Scheme 3 replaces I(ηk;θp) by the expected Fisher information E[I(ηk;θp)]. That is,

E[I(ηk;θp)]1Tt=1TI(ηk;θp(t)), (10)

where θp(t) (t=1,2,,T) are randomly drawn from the interval [θ^pδp,θ^p+δp], assuming that true ability θp is equally distributed in a small neighborhood of θ^p.

In addition, if a reasonable and proper prior distribution f(θp) for θp is available, the posterior distribution f(θp|vLp) can be obtained as

f(θp|vLp)=f(θp)Πj=1Lp[(Pj(θp))vk(1Pj(θp))1vk]+f(θp){Πj=1Lp[(Pj(θp))vk(1Pj(θp))1vk]}dθp, (11)

where vLp=(v1,v2,,vLp)T is the response vector of examinee p to Lp operational items. Then, E[I(ηk;θp)] can also be approximated by using θp(t) (t=1,2,,T) that are randomly sampled from the posterior distribution f(θp|vLp).

Scheme 4

In this scheme, I(ηk;θp) is weighted by the likelihood of response vector vLp (denoted as L(vLp;θp)), and then the unknown ability parameter is integrated out of the weighted information matrix, that is

E[I(ηk;θp)L(vLp;θp)]=I(ηk;θp)Πj=1Lp[(Pj(θp))vk(1Pj(θp))1vk]dθp. (12)

In summary, Scheme 1 is relatively simple, and Scheme 2 is conservative due to its use of the worst case in the interval, which is to some extent in the same fashion as the minimax optimal design (e.g., Berger, King, & Wong, 2000; King & Wong, 2000). Compared with Scheme 2, both Scheme 3 and Scheme 4 are less sensitive to the extreme points and are more robust, because they evaluate I(ηk;θp) by integrating θp out. Besides, Scheme 4 also fully utilizes auxiliary information from the operational items. Because Schemes 1 through 4, respectively, use the original estimated information (o), the minimum information (min), the mean information (mean), and the likelihood-weighted information (lw) to measure the information provided by the current examinee, the ED designs equipped with the four schemes are accordingly denoted as ED-o, ED-min, ED-mean, and ED-lw designs for short.

Dealing with the dependence on unknown item parameters

Another noteworthy point is that the implementations of adaptive designs (i.e., D-VR, ED-o, ED-min, ED-mean, and ED-lw) also depend on the unknown item parameters of new items. To handle this kind of dependency, the idea of sequential design is used. In sequential designs, experiments are designed stage by stage, and the item parameter estimates obtained from the current stage will be used as the initial values in the next stage until the predefined sample size is reached or the item parameter estimates are precise enough.

Sequential designs start with initial item parameter values, which can be estimated by item writers based on subjective judgment (Wainer & Mislevy, 1990) or using data-based methods (e.g., Chen, Wang, Xin, & Chang, 2017; Chen et al., 2012). To be on the safe side, the initial item parameter values are determined based on a batch of examines’ responses in this article. Thus, the entire process of online calibration can be divided into random calibration stage and adaptive calibration stage as in the study of Chen et al. (2012). More specifically, in the random calibration stage, new items are randomly administered to a subgroup of test takers, and the initial values are estimated based on the collected data; then in the adaptive calibration stage, new items are assigned to examinees by using different adaptive design criteria and the item parameters of each new item are sequentially updated once it receives a fixed number of new responses.

For simplicity, each item is retired from the whole calibration process when it receives a fixed total number of NT responses. The detailed procedures for administering new items online are sketched in the online Appendix A.

Simulation Studies

To fully evaluate the calibration efficiency of different adaptive designs relative to the random design, simulation studies were conducted under various conditions. In addition, the relative efficiency is calculated via the following formula:

1rep*mr=1rep{j=1m[det[i=1NTI(ηj(r);θi(jAr))]det[i=1NTI(ηj(r);θi(jRr))]]}, (13)

where rep is the total number of replications, ηj(r) is the parameter vector of the jth new item simulated in the rth replication. θi(jAr) and θi(jRr), respectively, denote the abilities of the ith examinee selected by an adaptive design (e.g., D-VR) and the random design for new item j in the rth replication. To reduce random errors, the entire simulation process is replicated 100 times (i.e., rep = 100).

General Setup

For each examinee, his or her true ability was randomly drawn from the standard normal distribution. Because parameter a is typically supposed to follow a lognormal distribution (Baker & Kim, 2004) and parameters a and b are often truncated by intervals in practice (Lord & Wingersky, 1984), the parameter vector γ=(log(a),b)T was randomly drawn from a truncated bivariate normal distribution with a constrained by 0 and 2 and b truncated between −3 and 3. Moreover, the correlation coefficient between the simulated a and b was set to be 0.25 to mimic realistic scenario, in which parameters a and b are positively related to a certain extent (e.g., H.-H. Chang, Qian, & Ying, 2001). In addition, 1,000 operational items were generated to construct the CAT item bank and a total number of 20 new items (i.e., m= 20) were simulated in the same way as the operational items.

Maximum Fisher information (MFI; Lord, 1980) was used as the item selection strategy in CAT, and each test taker’s adaptive test was terminated if he or she finished a fixed number of 30 operational items. Besides, ability was estimated via the expected a posteriori (EAP) method if the current test length of operational items is less than 5 or the response patterns are nonmixed (i.e., all incorrect or all correct); otherwise, ability was updated by using the weighted likelihood estimation method (Firth, 1993; Warm, 1989). What’s more, the total number of new items that each examinee p received was set to be less than 5.

The interval [CL,CU] in Formula 4 was chosen as [−3, 3] and the optimal ability θko was derived by comparing a finite of 100 samples that randomly drawn from this interval. To obtain the worst case θ^pk in Scheme 2, a batch of 100 samples were randomly drawn from [θ^pδp,θ^p+δp], where δp was set to be 1/I(θ^p) and 3/I(θ^p) for the purpose of examining the impact of δp on the performance of ED-min design. In like fashion, to explore the robustness of ED-mean design against δp, the expected information E[I(ηk;θ)] in Scheme 3 was also calculated by using samples randomly drawn from [θ^pδp,θ^p+δp] under the abovementioned two levels of δp. Other highly accurate and efficient ways for approaching E[I(ηk;θ)] may be explored by using posterior distribution in future studies. In addition, the bounds of integral were chosen as −4 and +4 in Scheme 4.

The “one expectation maximization (EM) cycle” (OEM; Wainer & Mislevy, 1990) method was selected to update item parameters in the adaptive calibration stage (see Step 3b in online Appendix A), because OEM works well in terms of item parameter recovery and is unaffected by the measurement errors contained in ability estimates by integrating the latent abilities out in the M step (e.g., Chen & Wang, 2016). The reason why the “multiple EM cycles” (MEM) method that performs better than OEM (e.g., Ban, Hanson, Wang, Yi, & Harris, 2001) was not used is that MEM is much more computationally intensive, which may not be suitable to be frequently used to update item parameter estimates. For other calibration methods, please refer to Chen and Wang (2016), Kim (2006), and Stocking (1988).

Results

Comparison Among Different Adaptive Designs

Summarized in Table 1 are the relative efficiencies of different adaptive designs under three levels of NT (NT = 200, 400, and 600) and three levels of seeding location (Front, Middle, and Rear). Note that the Front, Middle, and Rear, respectively, denote that the seeding locations were randomly drawn from wp out of the first (wp+5), the middle (wp+6), and the last (wp+5) positions in a test of (30+wp) items, where wp (wp 5) is the number of new items that examinee p will receive. Moreover, both the ED-min and the ED-mean designs were conducted under two levels of δp (i.e., 1 SE and 3 SE of ability estimates). In addition, the online calibration procedures depicted in online Appendix A was implemented under nr = 100 and nb = 30, where nr is the total number of examinees participating in the random calibration stage, and nb is the number of newly received responses for each new item to update its item parameters during the adaptive calibration stage.

Table 1.

Relative Efficiencies of D-VR and Four ED Designs Under Different Seeding Locations.

NT Sed. D-VR ED-o ED-min (1 SE) ED-min (3 SE) ED-mean (1 SE) ED-mean (3 SE) ED-lw
200 Front 1.0206 1.0379 1.0686 1.0386 1.0320 1.0572 1.0901
Middle 1.0374 1.1066 1.1042 1.0196 1.1134 1.0873 1.1093
Rear 1.0392 1.1108 1.1089 1.0183 1.1169 1.0838 1.1100
400 Front 1.0102 1.0331 1.0731 1.0389 1.0287 1.0590 1.0944
Middle 1.0255 1.1166 1.1134 1.0222 1.1254 1.0915 1.1179
Rear 1.0253 1.1188 1.1178 1.0196 1.1314 1.0900 1.1171
600 Front 1.0066 1.0308 1.0729 1.0352 1.0266 1.0606 1.0966
Middle 1.0168 1.1230 1.1176 1.0209 1.1301 1.0930 1.1220
Rear 1.0170 1.1254 1.1195 1.0170 1.1340 1.0903 1.1217

Note. One SE and 3 SE represent that δp is obtained as one and three standard error(s) of ability estimate, respectively. The results for conditions where the ED-min, ED-mean, and ED-lw designs are more efficient than the ED-o design are in boldface. VR = van der Linden and Ren; ED = excellence degree; Sed. = Seeding location.

As shown in Table 1, D-VR design was less efficient than ED designs under almost all conditions. In addition, except for ED-min (3 SE), seeding new items at the middle and last locations of adaptive test could result in larger relative efficiency values than seeding them at the beginning. Moreover, the calibration results for the “Middle” seeding location were as good as those generated by the “Rear” seeding location. Furthermore, it is found from Table 1 that ED-o design performed well if new items were seeded in the middle or toward the end of adaptive test. To be specific, ED-o design consistently outperformed ED-min (1 SE), ED-min (3 SE), and ED-mean (3 SE) designs at “Middle” and “Rear” seeding locations, and behaved better than ED-lw design at “Rear” seeding location. This is mainly because CAT is characterized as highly efficient and can make ability estimates rapidly approach their true values as the test length of operational items increases.

When new items were assigned to examinees at “Front” seeding location, ED-lw design generated the highest relative efficiency values, indicating that the information from operational items was helpful to improving calibration efficiency when the provisional test length is short.

By comparing the results of the ED-min and ED-mean designs side by side, one can note that the ED-mean design produced consistently larger calibration efficiency values than the ED-min design at “Middle” and “Rear” seeding locations, whereas at “Front” seeding location, ED-min design outperformed ED-mean design when the neighborhood size was small, and the result was exactly the opposite when the neighborhood size was large. In addition, ED-mean design was more robust to the neighborhood size than ED-min design for both “Middle” and “Rear” seeding locations. For example, when the neighborhood size was changed from “1 SE” to “3 SE,” the relative efficiencies of ED-mean design decreased slightly, while the relative efficiencies of ED-min design decreased dramatically and even lower than those produced by D-VR design when NT = 200 and 400 (see the italics in Table 1).

In addition, the average computation times for the entire online calibration process were also recorded in Table B1 (see online Appendix B). From Table B1, we can find that D-VR design consumed the least time for calibrating new items as a whole, while ED-min design took the longest time. Compared with the calibration time spent by the D-VR design, the additional computational costs associated with the ED-o, ED-mean, and ED-lw designs were acceptable.

In practice, van der Linden and Ren (2015) recommended seeding new items toward the end of CAT, because it profits maximally from the information available in the ability estimates. From the above-presented results, one can notice that the ED-o design was time-saving and performed well at “Rear” seeding locations. Thus, among the four ED designs, the ED-o design with “Rear” seeding location is used for the next studies.

Exploring the Effect of nr and nb

The implementations of adaptive designs also depend on the values of nr and nb. To explore their effects on the performance of adaptive designs, the D-VR design and ED-o design were thoroughly compared under five levels of nr (i.e., nr = 100 : 100 : 500) and six levels of nb (i.e., nb = 5 : 5 : 30). The resulting relative efficiency values were plotted in Figure 1. Note that, the numbers from 1 to 5 on the nr-axis and from 1 to 6 on the nb-axis indicate that nr increased from 100 to 500 and nb increased from 5 to 30, respectively.

Figure 1.

Figure 1.

The calibration efficiencies of D-VR design and ED-o design relative to random design under different conditions.

Note. VR = van der Linden and Ren; ED = excellence degree.

Consistent with the results in Table 1, all relative efficiency values in Figure 1 were larger than 1 under all simulation scenarios, indicating that both D-VR design and ED-o design were more efficient than that of the random design. Moreover, ED-o design was superior to D-VR design in terms of calibration efficiency, because the relative efficiency surfaces generated by ED-o design were all above over those resulting from D-VR design.

For each level of NT, ED-o design behaved differently from D-VR design as nr increased. More specifically, D-VR design worked well for relatively larger nr, while ED-o design was more efficient when nr was relatively smaller. This is primarily because larger nr produced more accurate initial item parameters and ED-o design was less sensitive to the calibration precision of initial item parameters than D-VR design. Hence, the performance of ED-o design may be further improved by using more examinees in the adaptive calibration stage. In addition, the relative efficiencies of D-VR design reached the peak at nr = 400 when NT = 200, implying that there might exist an optimal proportion between nr, NT, the total number of new items (i.e., m), the number of new items that each examinee receives (i.e., wp), and so forth, for D-VR design to achieve maximum relative efficiency.

In addition, a little change was observed in the relative efficiency as nb varied, thus nb will be set to be 30 henceforth, in order to save calibration time. A possible explanation is that the impact of nb was not fully reflected by the six levels (i.e., nb = 5 : 5 : 30) selected in this study. More researches are welcomed to further explore the impact of nb on the performance of adaptive designs.

Ability Estimation Accuracy by Using Calibrated Items

In practical terms, calibrated items will be put into use to estimate ability values. To determine whether the differences in item calibration have an impact on the subsequent ability estimation, all the calibrated 2000 new items (20 items in each replication multiply by 100 replications) were used to recover 61 different ability levels (i.e., -3 : 0.1 : 3). The average bias values against the true abilities with different calibration sample sizes are plotted in Figures C1 through C3 (see online Appendix C).

As alluded to Figures C1 through C3, abilities of medium level were recovered better than those of extreme level. As expected, items with true parameters worked better than calibrated items, and larger calibration sample size produced more accurate ability estimates. In addition, items calibrated by the ED-o design performed slightly worse than those calibrated by the D-VR design in recovering positive abilities when NT = 200. When NT is relatively larger, ED-o design showed better performance than D-VR design. Specifically, ED-o design generated less biased ability estimates at most ability levels when NT = 400; and when NT = 600, new items calibrated by the ED-o design performed as well as the true items, while new items calibrated by the D-VR design still behaved clearly different from the true ones.

Conclusion

The following conclusions can be arrived with the specific settings in the simulations:

  1. Adaptive designs are consistently superior to random design under all simulation conditions, and the calibration precision has an impact on the scoring process.

  2. Relative to the D-VR design, the ED-o design works better in terms of relative efficiency and is less sensitive to the calibration precision of initial item parameter values. Besides, the abilities can be better recovered by using items calibrated by ED-o for relatively large NT values (e.g., 400 and 600).

  3. ED-min design is more efficient when new items are seeded at the beginning of adaptive test, and it is time consuming and vulnerable to the neighborhood size δp. By comparison, ED-mean design is more robust to δp and performs best when new items are seeded at middle/rear locations for small δp.

  4. Among all adaptive designs, ED-lw design shows best performance when seeding new items at front locations.

  5. For all adaptive designs, seeding new items at middle/rear locations generates better results than seeding them at front locations as a whole.

Discussion

The quality of item is essential to obtain accurate ability estimates in the ensuing scoring process. The question of how to efficiently calibrate new items has received great attentions in the context of online calibration. This article attempts to improve the calibration efficiency of the adapted D-optimal design proposed by van der Linden and Ren (2015) (i.e., D-VR design), and further proposes the ED criterion for calibrating new items online. Moreover, the ED criterion is equipped with four different schemes for measuring information provided by the current active examinee, and four ED designs (i.e., ED-o, ED-min, ED-mean, and ED-lw) are put forward accordingly. Besides, the problem regarding the dependence of adaptive designs on unknown item parameters is also well solved by sequentially updating the item parameters.

Results showed that the newly proposed ED designs perform better than the D-VR design when calibrating the 2PL model. Note that the ED designs can be readily generalized to other item response theory (IRT) models and other CAT formats. As an example, this article also applied one representative ED design (i.e., ED-o design) to calibrate the 3PL model, which is widely used in many large-scale assessment programs. Results indicated that ED-o is also a promising design when calibrating the 3PL model online. For more detailed information, interested readers are referred to the online Appendix D.

The current study can be expanded in several directions. First, it is worth exploring in the future whether the ED designs still work well when they are extended to multidimensional CAT and cognitive diagnostic CAT. Second, this article derived the ED-mean design by using samples randomly drawn from a small neighborhood size of ability estimates; future researches may explore its performance when samples are directly drawn from the posterior distribution of ability (e.g., Formula 11). Third, the virtual optimal ability is similarly obtained by randomly sampling from an interval. Because true abilities are typically normally distributed in practice, it would be interesting to further evaluate the ED designs by incorporating such useful distribution information. Fourth, this article sequentially updates the item parameters in the adaptive calibration stage with the intent of dealing with the dependence on unknown item parameters. However, in addition to the sequential design, researchers may use other feasible approaches, such as the minimax optimal design and the Bayesian optimal design, to handle this kind of dependence. Last, but not least, this article retires each new item from the calibration process once it receives a fixed number of responses; thus, another line of research worth considering is to compare the number of examinees required by different designs when the calibration precision reaches a predetermined threshold.

Supplemental Material

Online_Appendix – Supplemental material for New Efficient and Practicable Adaptive Designs for Calibrating Items Online

Supplemental material, Online_Appendix for New Efficient and Practicable Adaptive Designs for Calibrating Items Online by Yinhong He, Ping Chen and Yong Li in Applied Psychological Measurement

Acknowledgments

The authors are indebted to the editor, associate editor, and three anonymous reviewers for their suggestions and comments on the earlier manuscript.

Footnotes

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was supported by the National Natural Science Foundation of China (Grant No. 31300862), KLAS (Grant No. 130028732), and the Startup Foundation for Introducing Talent of NUIST (Grant No. 2018r041).

Supplemental Material: Supplemental material for this article is available online.

References

  1. Anderson T. W. (1984). An introduction to multivariate statistical analysis (2nd ed.). New York, NY: John Wiley. [Google Scholar]
  2. Baker F. B. (2001). The basics of item response theory (2nd ed.). Washington, DC: ERIC Clearinghouse on Assessment and Evaluation. [Google Scholar]
  3. Baker F. B., Kim S.-H. (2004). Item response theory: Parameter estimation techniques (2nd ed.). New York, NY: Marcel Dekker. [Google Scholar]
  4. Ban J. C., Hanson B. A., Wang T. Y., Yi Q., Harris D. J. (2001). A comparative study of on-line pretest item—Calibration/scaling methods in computerized adaptive testing. Journal of Educational Measurement, 38, 191-212. [Google Scholar]
  5. Berger M. P. F., King C. Y. J., Wong W. K. (2000). Minimax D-optimal designs for item response theory models. Psychometrika, 65, 377-390. [Google Scholar]
  6. Berger M. P. F., Tan F. E. S. (2004). Robust designs for linear mixed effects models. Applied Statistics, 53, 569-581. [DOI] [PubMed] [Google Scholar]
  7. Birnbaum A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In Lord F. M., Novick M. R. (Eds.), Statistical theories of mental test scores (pp. 379-479). Reading, MA: Addison-Wesley. [Google Scholar]
  8. Chang H.-H., Qian J. H., Ying Z. L. (2001). a-stratified multistage computerized adaptive testing with b blocking. Applied Psychological Measurement, 25, 333-341. [Google Scholar]
  9. Chang H.-H., Ying Z. L. (2009). Nonlinear sequential designs for logistic item response theory models with applications to computerized adaptive tests. The Annals of Statistics, 37, 1466-1488. [Google Scholar]
  10. Chang Y. C. I., Lu H. Y. (2010). Online calibration via variable length computerized adaptive testing. Psychometrika, 75, 140-157. [Google Scholar]
  11. Chen P. (2017). A comparative study of online item calibration methods in multidimensional computerized adaptive testing. Journal of Educational and Behavioral Statistics, 42, 559-590. [Google Scholar]
  12. Chen P., Wang C. (2016). A new online calibration method for multidimensional computerized adaptive testing. Psychometrika, 81, 674-701. [DOI] [PubMed] [Google Scholar]
  13. Chen P., Wang C., Xin T., Chang H.-H. (2017). Developing new online calibration methods for multidimensional computerized adaptive testing. British Journal of Mathematical and Statistical Psychology, 70, 81-117. [DOI] [PubMed] [Google Scholar]
  14. Chen P., Xin T., Wang C., Chang H.-H. (2012). Online calibration methods for the DINA model with independent attributes in CD-CAT. Psychometrika, 77, 201-222. [Google Scholar]
  15. Cheng Y., Yuan K. H. (2010). The impact of fallible item parameter estimates on latent trait recovery. Psychometrika, 75, 280-291. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Firth D. (1993). Bias reduction of maximum likelihood estimates. Biometrika, 80, 27-38. [Google Scholar]
  17. Jones D. H., Jin Z. Y. (1994). Optimal sequential designs for on-line item estimation. Psychometrika, 59, 59-75. [Google Scholar]
  18. Kim S. (2006). A comparative study of IRT fixed parameter calibration methods. Journal of Educational Measurement, 43, 355-381. [Google Scholar]
  19. King J., Wong W. K. (2000). Minimax D-optimal designs for the logistic model. Biometrics, 56, 1263-1267. [DOI] [PubMed] [Google Scholar]
  20. Kingsbury G. G. (2009). Adaptive item calibration: A process for estimating item parameters within a computerized adaptive test. In Weiss D. J. (Ed.), Proceedings of the 2009 GMAC conference on computerized adaptive testing Retrieved from http://iacat.org/sites/default/files/biblio/cat09kingsbury.pdf [Google Scholar]
  21. Lord F. M. (1968). Some test theory for tailored testing. ETS Research Bulletin RB-68-38. Princeton, NJ: Educational Testing Service. [Google Scholar]
  22. Lord F. M. (1980). Applications of item response theory to practical testing problems. Mahwah, NJ: Lawrence Erlbaum. [Google Scholar]
  23. Lord F. M., Wingersky M. S. (1984). An investigation of methods for reducing sampling error in certain IRT procedures. Applied Psychological Measurement, 8, 347-364. [Google Scholar]
  24. Lu H. Y. (2014). Application of optimal designs to item calibration. PLoS ONE, 9(9), e106747. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Mislevy R. J., Wingersky M. S., Sheehan K. M. (1994). Dealing with uncertainty about item parameters: Expected response functions, ETS Research Report RR-94-28-ONR. Princeton, NJ: Educational Testing Service. [Google Scholar]
  26. Silvey S. D. (1980). Optimal design. London, England: Chapman & Hall. [Google Scholar]
  27. Stocking M. L. (1988). Scale drift in on-line calibration (Research Report No. 88-28). Princeton, NJ: Educational Testing Service. [Google Scholar]
  28. Tsutakawa R. K., Johnson J. C. (1990). The effect of uncertainty of item parameter estimation on ability estimates. Psychometrika, 55, 371-390. [Google Scholar]
  29. van der Linden W. J., Ren H. (2015). Optimal Bayesian adaptive design for test-item calibration. Psychometrika, 80, 263-288. [DOI] [PubMed] [Google Scholar]
  30. Wainer H., Mislevy R. J. (1990). Item response theory, item calibration and proficiency estimation. In Wainer H. (Ed.), Computerized adaptive testing: A primer (Vol. 4, pp. 65-102). Hillsdale, NJ: Lawrence Erlbaum. [Google Scholar]
  31. Wang C., Chang H.-H. (2011). Item selection in multidimensional computerized adaptive testing—Gaining information from different angles. Psychometrika, 76, 363-384. [Google Scholar]
  32. Warm T. A. (1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54, 427-450. [Google Scholar]
  33. Weiss D. J. (1982). Improving measurement quality and efficiency with adaptive testing. Applied Psychological Measurement, 6, 473-492. [Google Scholar]
  34. Yang J. S., Hansen M., Cai L. (2012). Characterizing sources of uncertainty in IRT scale scores. Educational and Psychological Measurement, 72, 264-290. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Zhang J. M., Xie M. G., Song X. L., Lu T. (2011). Investigating the impact of uncertainty about item parameters on ability estimation. Psychometrika, 76, 97-118. [Google Scholar]
  36. Zheng Y. (2014). New methods of online calibration for item bank replenishment (Unpublished doctoral dissertation). University of Illinois at Urbana–Champaign. [Google Scholar]
  37. Zhu Z., Stein M. L. (2005). Spatial sampling design for parameter estimation of the covariance function. Journal of Statistical Planning and Inference, 134, 583-603. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Online_Appendix – Supplemental material for New Efficient and Practicable Adaptive Designs for Calibrating Items Online

Supplemental material, Online_Appendix for New Efficient and Practicable Adaptive Designs for Calibrating Items Online by Yinhong He, Ping Chen and Yong Li in Applied Psychological Measurement


Articles from Applied Psychological Measurement are provided here courtesy of SAGE Publications

RESOURCES