Skip to main content
Applied Psychological Measurement logoLink to Applied Psychological Measurement
. 2020 Aug 14;45(1):22–36. doi: 10.1177/0146621620947177

The Block Item Pocket Method for Reviewable Multidimensional Computerized Adaptive Testing

Zhe Lin 1, Ping Chen 1, Tao Xin 1,
PMCID: PMC7711249  PMID: 33304019

Abstract

Most computerized adaptive testing (CAT) programs do not allow item review due to a decrease in estimation precision and aberrant manipulation strategies. In this article, a block item pocket (BIP) method that combines the item pocket method with the successive block method to realize reviewable CAT was proposed. A worst-case but still reasonable answering strategy and the Wainer-like manipulation strategy were simulated to evaluate the estimation precision of reviewable unidimensional computerized adaptive testing (UCAT) and multidimensional computerized adaptive testing (MCAT) under a series of BIP settings. For both UCAT and MCAT, it was found that the estimation precision of the BIP method improved as the number of blocks increased or the item pocket size decreased under the reasonable strategy. The BIP method was more effective in handling the Wainer-like strategy. With the help of block design, the BIP method can still maintain acceptable estimation precision under slightly large total IP size conditions. These results suggested that the BIP method was a reliable solution for both reviewable UCAT and MCAT.

Keywords: computerized adaptive testing, reviewable CAT, item review, block item pocket


In recent decades, computerized adaptive testing (CAT) has become an important test form for many high-stakes and large-scale tests, and it has been favored by both test developers and test takers (Chang, 2015). To date, allowing item review is one of the major issues in transforming paper-and-pencil (P&P) testing into CAT and improving practical applications of CAT (Han, 2013; Papanastasiou & Reckase, 2007; Vispoel, 1998; Wise, 1997). Item review, consisting of reviewing previous items and changing answers, is an essential test strategy often seen in conventional P&P testing. Examinees can maximize test performance through item review, which may further improve the validity of test scores (Stylianou-Georgiou & Papanastasiou, 2017).

Previous research has shown that item review is also necessary for CAT. First, most examinees prefer to have item review options (and they have the right to do so) even if they may not necessarily use it (Vispoel et al., 2000, 2005; Waddell & Blankenship, 1994). Some previous studies have found that disallowing item review is likely to intensify test anxiety during the test, and ultimately affect examinees’ overall performance (Lunz et al., 1992; Olea et al., 2000; Vispoel, 1998). This increase in test anxiety is partly because of the fact that test takers perceive less sense of control of the nonreviewable test (Wise, 1994, 1997). Second, most examinees would indeed review and change a certain proportion of items in both P&P testing and CAT, and improve their performance by correcting more wrong answers than changing correct answers to incorrect ones (e.g., Benjamin et al., 1984; Vispoel, 1998). Part of such item review behaviors reflects examinee’s continual cognitive processing. Being able to capture the full cognitive process instead of interrupting it will lead to more accurate measurement of examinees’ ability (Harvill & Davis, 1997).

Therefore, allowing item review in CAT is necessary and will benefit both examinees and test developers. Despite all the benefits associated with item review, however, most current operational CATs usually do not allow item review because item review will interfere with the CAT program and result in a series of negative consequences (Wise, 1996).

Consequences of Allowing Item Review in CAT Program

Decrease in estimation precision is one of the main reasons against item review in CAT (Olea et al., 2000; Stone & Lunz, 1994; Vispoel, 1998). More specifically, item selection algorithms in most CAT programs will select the optimal item for each interim ability estimate. However, after one has answered a batch of items, changing the answer to a previous item would yield a new interim ability estimate that differs from that obtained before the answer was changed. Then, the already selected items after the revised item would become nonoptimal for the new interim ability estimate, which may eventually impair the final estimation. For instance, if the maximum Fisher information method was employed, the item information for the nonoptimal item would decrease when evaluated at the new interim ability estimate. Loss of information would lead to a decrease in estimation precision, as test information is inversely proportional to the standard error of ability estimate (Papanastasiou, 2005). Furthermore, estimation precision is related to the frequency of item review. A large number of changes would force the CAT to select more mismatched items and drastically reduce estimation precision, especially when these changes are all from wrong to right or vice versa (Reckase, 1975; Stone & Lunz, 1994). Therefore, several methods have been proposed to limit the overuse of item review (Han, 2013; Stocking, 1997) or to improve the estimation procedure (Papanastasiou & Reckase, 2007; S. Wang et al., 2017). These methods have been shown to be effective in avoiding large drop in estimation precision.

Another important reason why most current operational CATs do not allow item review is that during reviewable CAT, examinees may adopt manipulation strategies that are likely to severely affect the validity and fairness of CAT (Stocking, 1997). For example, the Wainer strategy, which refers to examinees first trying to answering all items incorrectly, and then trying their best to correct these items in the review phase, could result in an extremely easy but mismatched test (Wainer, 1993). Under these conditions, some examinees may get positively biased scores by correcting the incorrect responses at the review stage. The use of Wainer strategy renders CAT invalid (Stocking, 1997; Vispoel et al., 1999). Therefore, manipulation strategies must be given serious attention when item review is allowed in CAT.

Representative Methods for Reviewable CAT

To address the above concerns, researchers have proposed several methods to allow item review in CAT from different perspectives (Bowles & Pommerich, 2001; Han, 2013; Papanastasiou & Reckase, 2007; Stocking, 1997; S. Wang et al., 2019; Yen et al., 2012). In the next section, the authors discussed two methods that are closely related to their method in detail.

Successive Block Method

The successive block method (Stocking, 1997) divides a test into several successive timed blocks,1 with each having a fixed number of items (see Figure 1). With the block design, examinees are only allowed to review items within each block, and cannot review items in previous blocks once they move to the next block. Thus, this method can deal with the Wainer strategy to some extent because only items in the present block can be manipulated. Previous studies have shown that this method can maintain good estimation precision and effectively handle the Wainer strategy if a relatively large number of blocks are employed (Stocking, 1997; Vispoel et al., 2000).

Figure 1.

Figure 1.

Illustration of three methods for reviewable CAT.

Note. CAT = computerized adaptive testing.

Item Pocket (IP) Method

Han (2013) pointed out several limitations of the successive block method. For example, examinees cannot temporarily skip an item and answer it later, as what people often do in P&P tests. Moreover, examinees have to frequently make decisions about whether to proceed to the next block because a relatively large number of blocks are needed to ensure estimation precision. These features may cause adverse effects, such as increasing test anxiety and extra cognitive load during the test, and impair examinees’ performance.

To overcome these limitations, Han (2013) proposed the IP method that provides an item review option that better resembles the P&P test. Specifically, the IP method provides a fixed-size IP for examinees to place a certain number of items that they want to review and revise later (see Figure 1). Items in the IP can be reviewed and revised freely during the test, and examinees can confirm their final answers any time. When examinees want to put a new selected item into an already full IP, they have to replace an old item with this item or stop saving the new item. Once an item is removed from the IP, it cannot be put back and reviewed again. Examinees have to empty the IP before the end of test. Otherwise, answers to these items will be considered incorrect. In terms of item review, the IP method gives examinees sufficient sense of control, as they can skip an item and decide which items should stay in the IP for later review. In addition, the IP method can eliminate the negative impact of answer change on item selection because only items outside the IP are used for interim ability estimation and item selection. Answers to items stored in the IP will not be recorded until the examinee confirms them.

Block Item Pocket (BIP) Method

According to the results of the previous study, the effectiveness of the IP method relies on appropriate IP size. A large IP may cause some problems. Although answer change in the IP method does not directly affect item selection, too many items stored in a large IP may still affect the accuracy of subsequent item selection because items in the IP provide no information for item selection. Furthermore, a large IP may provide too many review opportunities at the beginning of CAT and may substantially affect early item selection if a certain manipulation strategy is adopted. In sum, the IP method with a large IP size may still adversely affect item selection and decrease the final estimation precision to some extent.

To address the problem caused by large IP, a BIP method that combines the IP method with the successive block method to realize reviewable CAT was proposed. More specifically, this method divides a test into several large blocks and assigns a sub-IP with a reasonable size for each block. As shown in Figure 1, sub-IPs with identical size are assigned to each timed block. In each block, examinees can review items in the same way as the IP method, but they need to confirm their answers to the items in the sub-IP before proceeding to the next block. Although the BIP method introduces a small sub-IP in each block, it still provides ample review opportunities because the total IP size can be quite large. It should be noted that the BIP method reduces to the IP method when there is only one block in the test.

In terms of item review, the BIP method inherits the advantage of IP method as their ways to item review are almost same except that blocks will introduce some restrictions. Regarding CAT administration, the BIP method can maintain a small IP available at any stage of the test. This feature may reduce the problems caused by large IP to some extent and improve estimation precision of CAT. It was noted that the BIP method is quite different from the successive block method in the following aspects. First, the BIP method supports item review in a completely different mechanism from the successive block method except for the restriction imposed by the blocks. Second, with the help of the IP design, blocks in the BIP method are fewer and contain more items than the successive block method.

The Current Study

The BIP method is a restricted review design imbedded within CAT administration. From this perspective, it can be easily applied to various formats of CAT programs without changing the core algorithms. However, it is still unclear whether the BIP method can maintain good estimation precision in different CAT scenarios. Therefore, the present study investigated the effectiveness of the BIP method under both unidimensional computerized adaptive testing (UCAT) and multidimensional computerized adaptive testing (MCAT). UCAT was included to compare the effectiveness of the BIP method and existing methods as most existing methods had only been evaluated in UCAT. Moreover, the authors went beyond reviewable UCAT and further examined the performance of the BIP method in reviewable MCAT. Reviewable MCAT has not been formally discussed in the literature.

MCAT has attracted increasing attention in recent years. Previous studies have shown that MCAT is capable of fitting complex multidimensional constructs much better (e.g., Frey & Seitz, 2009; Haley et al., 2006), and possesses better measurement efficiency for multi-trait estimation, especially when dimensions are moderately to highly correlated (Segall, 1996; W. C. Wang & Chen, 2004). Therefore, MCAT has become a promising test form that can satisfy the demand for informative diagnostic evaluation (Mulder & van der Linden, 2009; C. Wang et al., 2011).

In this article, the authors conducted a Monte Carlo simulation study with conditions covering varying BIP settings to examine the estimation precision of the BIP method under the two reviewable CAT scenarios (i.e., UCAT and MCAT). In addition, a worst-case but reasonable strategy and a Wainer-like manipulation strategy were simulated separately to examine the effectiveness of the BIP method in dealing with the two test-taking strategies.

Method

CAT and Examinee Simulation

Typical UCAT and MCAT scenarios were simulated to exemplify the effectiveness of the BIP method applied to data generated according to a two-parameter logistic or three-dimensional two-parameter logistic model, respectively. Details for the CAT simulation procedures were provided in the Online Appendix. Regarding examinee simulation, the grid method (e.g., Finkelman & Roussos, 2009) was used to generate the true abilities of examinees. Specifically, for UCAT, 11 ability points from −2.5 to 2.5 with an interval of 0.5 were selected, and 500 examinees were generated for each ability point. For MCAT, the same 11 ability points were taken on each dimension, producing a total of 1,331 different ability vectors by completely crossing all points between the three dimensions; and 30 examinees were generated for each ability vector. Note that 30 examinees generated per vector is a relatively small number compared with the common practice. This number was chosen because the total number of examinees for MCAT study was already large, as well as due to limitation of computing time.

Simulation of the Reasonable Strategy (Denoted as S1)

The strategy that examinees are most likely to take during the reviewable test is to mark the uncertain items for later review (Vispoel et al., 2000). This strategy is consistent with the review rule of the IP system in which examinees can make full use of the IP to store uncertain items for later review (Han, 2013). In general, examinees tend to answer the most difficult items among all uncertain items last to better review them. Hence, the study first simulated this reasonable strategy, assuming that examinees would keep postponing answering the hardest items among all reviewable items within the limits of the BIP condition. Details about the simulation procedures for the S1 were provided in the Online Appendix.

Simulation of the Wainer-Like Strategy (Denoted as S2)

The Wainer-like strategy (Han, 2013) is a manipulation strategy that follows the rules of the Wainer strategy within the limits of the IP system, assuming that examinees first make full use of the IP to defer the easiest items intentionally and then attempt to gain inflated ability estimates by answering these easiest items correctly before the end of test. According to this assumption, the process of Wainer-like strategy is as follows: First, put as many items as the IP size allows into the IP at the beginning of test; then, postpone answering the relatively easy items; and finally, answer all easy items in the IP before the end of test. Furthermore, the Wainer-like strategy within the BIP system assumes that examinees would follow the process of the Wainer-like strategy within each block. Details regarding the simulation procedures for the S2 were also provided in the Online Appendix. Notably, both S1 and S2 reflect worst-case situations where all examinees consistently adopt the given strategy during the test. Therefore, researchers should be aware of this point when interpreting the simulation results. For simulation of responses under the two strategies, it is only necessary to obtain the final responses to each item via examinees’ true abilities and the given item response theory (IRT) model (see CAT simulation in the Online Appendix) no matter they are revised or not.

Research Design

In this study, a set of conditions with varying IP sizes and numbers of blocks were employed to examine the effectiveness of the BIP method. Specifically, they were conditions of IP-6, BIP-3/2, BIP-2/3, BIP-1/3, BIP-3/3, and the nonreview baseline. X/Y here means that this BIP condition contains Y blocks, and each block contains a sub-IP of size X. And the IP-6 condition refers to the IP method containing an IP of Size 6. In addition, the first three reviewable conditions had an equal total IP size but different numbers of blocks. The last three reviewable conditions possessed an equal number of blocks but different total IP sizes. Moreover, examinees under the nonreview condition answered items following the sequence given by the CAT program, while examinees adopted the given simulated strategy to answer items under the reviewable conditions. The set of conditions was a within-subject design. In each condition, there were 30 replications for each test-taking strategy. Item bank and examinees were regenerated in each replication.

Evaluation Criteria

For UCAT, the mean absolute error (MAE) and bias were used to evaluate overall estimation precision under different reviewable conditions. Empirical standard error of mean absolute error deviation (SEMD) was also used to evaluate the stability of differences between reviewable conditions and nonreview condition in 30 replications. For MCAT, the average Euclidean distance (AED) was used to evaluate the overall estimation precision of θ. MAE, SEMD, and bias were used to evaluate estimation precision of each ability dimension. In addition, conditional mean absolute error (CMAE) and conditional bias (Cbias) were used to evaluate the estimation precision at different ability levels on one dimension, as well as the interaction between dimensions. There was no absolute standard for these criteria, but smaller values indicated higher estimation precision. The formulas are as follows:

MAEθp=1MNk=1Mj=1N|θ^pkjθpkj|biasθp=1MNk=1Mj=1N(θ^pkjθpkj),
CMAEθpg=1k=1MGkk=1Mθpkjg|θ^pkjθpkj|Cbiasθpg=1k=1MGkk=1Mθpkjg(θ^pkjθpkj),
AED=1MNk=1Mj=1Np=1P(θ^pkjθpkj)2SEMDc=k=1M(MDkcMDc¯)2M,

where θ^pkj and θpkj are estimated and true abilities on the pth dimension of the jth examinee in the kth replication (p = 1 in UCAT); M is the number of replications, and N is the number of examinees in each replication; g is the given ability level on the θp scale; Gk is the number of examinees belonging to ability level g in the kth replication; and MDkc is the MAE deviation between the cth reviewable condition and the nonreview condition in the kth replication.

Results

Results From the UCAT Scenario

The results from the UCAT scenario were shown in Table 1. Under both the reasonable strategy (S1) and the Wainer-like strategy (S2), there were small but stable MAE deviations between reviewable and nonreview conditions, and all biases were close to zero, indicating that both IP and BIP methods could maintain an acceptable level of estimation precision in reviewable UCAT despite the fact that reviewable UCAT had slightly lower estimation precision than nonreviewable UCAT. It was also found that MAEs decreased gradually with an increasing number of blocks or decreasing total IP size, indicating that BIP method had slightly better estimation precision than the IP method when given a fixed number of total IP size.

Table 1.

Estimation Precision Under Two Strategies in UCAT.

Indexes IP-6 BIP-3/2 BIP-2/3 BIP-1/3 BIP-3/3 Nonreview
S1
 MAE (SEMD) 0.1181 (0.0007) 0.1162 (0.0005) 0.1159 (0.0005) 0.1151 (0.0005) 0.1161 (0.0004) 0.1146
 Bias −0.0008 −0.0006 −0.0004 −0.0003 −0.0003 −0.0003
S2
 MAE (SEMD) 0.1218 (0.0014) 0.1178 (0.0007) 0.1165 (0.0006) 0.1151 (0.0005) 0.1178 (0.0007) 0.1141
 Bias −0.0007 −0.0011 −0.0010 −0.0008 −0.0010 −0.0008

Note. UCAT = unidimensional computerized adaptive testing; IP = item pocket; BIP = block item pocket; MAE = mean absolute error; SEMD = empirical standard error of mean absolute error deviation.

Results From the Reasonable Strategy (S1) in MCAT Scenario

The results for the S1 were presented in Table 2. Overall, AED and MAE under the reviewable conditions were slightly larger than those under the nonreview baseline condition. Although the deviations were far smaller than the AED and MAE themselves, the small SEMDs indicated that such small deviations were consistently found between reviewable and nonreviewable conditions. This suggested that both IP and BIP methods were capable of avoiding a significant decrease in overall estimation precision, and there were slight differences in estimation precision between different conditions. By further comparing different reviewable conditions, it was found that both AED and MAE became smaller gradually with an increasing number of blocks or decreasing total IP size. For example, the MAEs of θ1 under the conditions with equal IP size of 6 were 0.1659, 0.1612, and 0.1597 for 1 block, 2 blocks, and 3 blocks, respectively. Under the conditions with equal 3 blocks, they were 0.1623, 0.1597, and 0.1576 for sub-IP sizes of 3, 2, and 1. These results suggested that the BIP method with smaller IP size or more blocks would have better overall estimation precision, and the BIP method worked slightly better than the IP method in terms of estimation precision when given an equal total IP size. It is noteworthy that the AED and MAE under BIP-3/3 condition were still smaller than those under the IP-6 condition, which means that the BIP method could accommodate larger total IP size than the IP method while maintaining almost identical estimation precision. In addition, biases of all reviewable conditions were small in an absolute sense but larger (in the positive direction) than those under the nonreviewable condition, indicating that both IP and BIP methods would slightly overestimate examinees’ overall ability levels when they adopted the reasonable strategy. Regarding the cause of positive deviations in bias, further analysis of Cbias at different ability levels was required. However, because deviations in Cbias at different ability levels displayed inconsistent patterns (see Figure 2), it does not make too much sense to compare deviations in overall bias between different conditions. Therefore, the empirical SE of bias deviation was only provided in the Online Appendix.

Table 2.

Estimation Precision Under the Reasonable Strategy in MCAT.

Indexes IP-6 BIP-3/2 BIP-2/3 BIP-1/3 BIP-3/3 Nonreview
AED 0.3349 0.3255 0.3222 0.3176 0.3276 0.3146
MAE (SEMD)
θ1 0.1659(0.0015) 0.1612(0.0012) 0.1597(0.0008) 0.1576(0.0004) 0.1623(0.0011) 0.1562
θ2 0.1660(0.70015) 0.1610(0.0009) 0.1594(0.0007) 0.1570(0.0004) 0.1621(0.0009) 0.1554
θ3 0.1651(0.0015) 0.1604(0.0012) 0.1587(0.0008) 0.1564(0.0004) 0.1614(0.0012) 0.1548
Bias
θ1 0.0009 0.0005 0.0000 −0.0008 0.0013 −0.0012
θ2 0.0019 0.0015 0.0008 0.0000 0.0023 −0.0004
θ3 0.0020 0.0015 0.0009 0.0000 0.0022 −0.0004

Note. MCAT = multidimensional computerized adaptive testing; IP = item pocket; BIP = block item pocket; MAE = mean absolute error; SEMD = empirical standard error of mean absolute error deviation; AED = average Euclidean distance.

Figure 2.

Figure 2.

Estimation precision at different θ1 levels under the reasonable strategy in MCAT.

Note. This figure shows the deviations in CMAE and Cbias between the reviewable and nonreviewable conditions. The nonreview condition is the baseline that equals zero at all ability levels. Deviation closer to zero means smaller difference between them. The original CMAE and Cbias were provided in the Online Appendix. CMAE = conditional mean absolute error; MCAT = multidimensional computerized adaptive testing; Cbias = conditional bias; IP = item pocket; BIP = block item pocket.

In addition to the overall criteria, estimation precision was further explored at different ability levels. Considering that results did not differ substantially among the three dimensions, only results of θ1 were displayed. Figure 2 showed the deviations in CMAE and Cbias between reviewable conditions and nonreviewable condition at 11 θ1 levels. As can be seen, under all reviewable conditions, the estimation precision at low ability levels was more affected than that at middle or high ability levels. More specifically, ability estimates were overestimated at low ability levels, whereas slightly underestimated at high ability levels. The Cbias deviations were 0.0242, 0.0155, 0.0103, 0.0037, and 0.0203 for IP-6, BIP-3/2, BIP-2/3, BIP-1/3, and BIP-3/3, respectively, at θ = −2.5 level, whereas they were −0.0136, −0.0075, −0.0050, −0.0020, and −0.0071, respectively, at θ = 2.5 level. This is the main reason why overall bias under the IP and BIP conditions was more positive compared with the nonreviewable condition. Besides, it was found that the overestimation of low ability levels and the underestimation of high ability levels were alleviated with more blocks or smaller total IP size. It was noted that estimation precision under the BIP-1/3 condition was very close to the nonreview condition. In conclusion, the BIP method containing more blocks or smaller IP size could more effectively avoid overestimation for low-ability examinees and underestimation for high-ability examinees when the reasonable strategy was adopted.

Moreover, the interaction effect on estimation precision between different dimensions was further analyzed. Three groups of examinees, whose true abilities on both θ2 and θ3 were respectively below −2 (including −2), between −0.5 and 0.5, and above 2 (including 2), were selected for comparing their estimation precision on θ1(low-ability group, medium-ability group, and high-ability group). Figure 3 presented the deviations in CMAE and Cbias of the three groups under the IP-6 condition and the BIP-2/3 condition, separately. As the figure showed, under both IP and BIP conditions, the low-ability group had lower estimation precision than the medium- and high-ability groups, meaning that examinees with higher ability on other dimensions could have better estimation precision on the focal dimension. Under the IP-6 condition, the examinees in the low-ability group were overestimated at low θ1 levels and underestimated at high θ1 levels, whereas the examinees in other two groups were overestimated at low θ1 levels but estimated precisely at high θ1 ability levels. Under the BIP-2/3 condition, although a similar trend with the IP condition was found, the difference between the three groups became smaller.

Figure 3.

Figure 3.

Interaction effects between dimensions under the reasonable strategy in MCAT.

Note. Low, medium, and high represent three groups of examinees whose true abilities on both θ2 and θ3 were, respectively, below −2 (including −2), between −0.5 and 0.5, and above 2 (including 2). MCAT = multidimensional computerized adaptive testing; CMAE = conditional mean absolute error; Cbias = conditional bias.

Results From the Wainer-Like Strategy (S2) in the MCAT Scenario

As shown in Table 3, the patterns of AED and MAE were similar to those in the S1. Their values became smaller gradually with the increasing number of blocks or decreasing total IP size. However, the differences between the IP condition and the BIP conditions became slightly larger compared with those under the S1, and biases under all reviewable conditions were not anymore positive. To clarify the reason for these changes, further analysis of CMAE and Cbias is presented next.

Table 3.

Estimation Precision Under the Wainer-Like Strategy in MCAT.

Indexes IP-6 BIP-3/2 BIP-2/3 BIP-1/3 BIP-3/3 Nonreview
AED 0.3485 0.3292 0.3241 0.3193 0.3314 0.3162
MAE (SEMD)
θ1 0.1718(0.0021) 0.1621(0.0013) 0.1597(0.0009) 0.1574(0.0005) 0.1631(0.0015) 0.1559
θ2 0.1721(0.0020) 0.1619(0.0010) 0.1594(0.0007) 0.1569(0.0004) 0.1631(0.0010) 0.1554
θ3 0.1729(0.0018) 0.1634(0.0010) 0.1609(0.0007) 0.1584(0.0004) 0.1645(0.0011) 0.1570
Bias
θ1 −0.0007 −0.0018 −0.0017 −0.0010 −0.0030 −0.0006
θ2 0.0009 −0.0008 −0.0007 0.0002 −0.0018 0.0006
θ3 −0.0008 −0.0021 −0.0017 −0.0011 −0.0029 −0.0007

Note. MCAT = multidimensional computerized adaptive testing; IP = item pocket; BIP = block item pocket; MAE = mean absolute error; SEMD = empirical standard error of mean absolute error deviation; AED = average Euclidean distance.

The CMAE and Cbias of θ1 were used to further examine the estimation precision at different ability levels and the interaction effects between dimensions. As shown in Figure 4, there was a similar overestimation for low ability levels but a more obvious underestimation for high ability levels when compared with the S1. As for this reason, the deviation in overall bias was counteracted under the S2. In addition, it was found that estimation precision under the BIP conditions was slightly better than that under the IP condition at the lowest and highest ability levels. The positive deviation in Cbias was 0.0334 at θ1 = −2.5 level under the IP condition, while those under the BIP conditions were all less than 0.011. And the negative deviations in Cbias at θ1 = 2.5 level also became smaller under the BIP conditions. These results indicated that the BIP method was more effective in keeping low-ability examinees from getting positively biased estimates and was able to reduce the underestimation for high-ability examinees when the Wainer-like strategy was adopted.

Figure 4.

Figure 4.

Estimation precision at different θ1 levels under the Wainer-like strategy in MCAT.

Note. CMAE = conditional mean absolute error; MCAT = multidimensional computerized adaptive testing; Cbias = conditional bias; IP = item pocket; BIP = block item pocket.

The results of the interaction effect (see Figure 5) showed that the estimation precision in low- and high-ability groups was lower than that in medium-ability group at all focal ability levels under the IP-6 condition, and the overestimation at low ability levels was more evident in these two groups. However, there was no obvious interaction effect in BIP-2/3 condition because the differences in estimation precision between the three groups were small. These results indicated that examinees who were weak on only one dimension or weak on all dimensions were likely to get positively biased estimates on the weak dimension by the Wainer-like strategy when the IP method was used, whereas it was harder to accomplish for those examinees when the BIP method was used.

Figure 5.

Figure 5.

Interaction effects between dimensions under the Wainer-like strategy in MCAT.

Note. CMAE = conditional mean absolute error; MCAT = multidimensional computerized adaptive testing; Cbias = conditional bias.

Discussion

Allowing examinees to review items during CAT administration has always been a concern to test developers. The BIP method that was proposed in this article could provide item review opportunities for examinees without significantly reducing the estimation precision of examinees’ abilities in both UCAT and MCAT. In the BIP method, the IP design ensured that the way to item review was almost the same as the IP method, inheriting the advantage of the IP method in item review, and the block design ensured that CAT would maintain an appropriate available IP size during the test, improving estimation precision by alleviating problems caused by large IP size.

In terms of estimation precision, the block design is helpful for the BIP method to maintain good estimation precision in reviewable CAT. Simulation results indicated that the BIP method with more blocks had better estimation precision when examinees adopt the reasonable strategy. The reason may be that blocks can reduce the impact of large-size IP because too many items stored in the IP at the same time may affect subsequent item selection. However, if the complete large IP is divided and equally assigned into several blocks, items in the sub-IPs of previous blocks can provide information for item selection in subsequent blocks, thus improving estimation precision of CAT.

In addition, the results also indicated that the BIP method was efficient in reducing the impact of the Wainer-like strategy on estimation precision. The original Wainer strategy cannot be fully implemented under the IP design due to limited IP size and the way of item review. Because there are no recorded responses to the reviewable items in the IP, it is not feasible for examinees to fully adopt the Wainer strategy. Specifically, examinees can only put several moderately difficult items rather than increasingly easy items into the IP at the beginning, and they can only defer several easy items rather than all easy items during the test. Because the IP design has limited the Wainer strategy to a large extent, the influence of the Wainer-like strategy on the estimation precision is only slightly greater than that of the reasonable strategy. The block design also helps to alleviate the impact of the Wainer-like strategy on estimation precision. The reason may be that blocks can help to balance the review opportunity at each stage of the test. Specifically, a single large IP will give examinees many opportunities to manipulate items at the beginning of the CAT. Manipulation strategies such as the Wainer-like strategy may influence the CAT procedures due to the significant importance of the first several responses in locating examinees’ abilities (Chang & Ying, 2008). However, this problem is alleviated in the BIP method because only a relatively small sub-IP is provided at early stage of the CAT.

In terms of item review, the IP design gives examinees autonomy to review items, and examinees can adopt an answering strategy similar to the review strategy under the P&P test (Han, 2013). Although the BIP method inherits the advantages of the IP method regarding item review, additional block design will inevitably increase the restriction of item review because it needs to empty the IP before proceeding to the next block. However, demand for the number of blocks in the BIP method is much lower than that in the successive block method. According to the present study, three blocks are enough for the BIP method to maintain acceptable estimation precision in the simulated fixed-length MCAT scenarios with 42 items, whereas at least four blocks are needed for the successive block method to effectively reduce the influence of the Wainer strategy on the simulated fixed-length UCAT with 28 items (Stocking, 1997). Therefore, the authors speculate that the block design will not significantly increase restrictions on item review in the BIP method, but this conclusion needs to be further verified through more research.

In sum, the IP method and the BIP method have their own advantages. The IP method is recommended under conditions with small IP sizes to provide a less restricted way to item review, whereas the BIP method is recommended under conditions with large IP sizes to ensure estimation precision of reviewable CAT.

Limitations and Future Directions

The BIP method is still a restricted approach for item review, that is, once an item is finalized, the answer cannot be changed even though examinees realize their mistakes. Hence, there are still some differences between traditional P&P testing and reviewable CAT equipped with the BIP method. Moreover, it is noteworthy that the simulation study cannot examine various psychological effects of the BIP method in real situations. The BIP method may perform better than the nonreview condition in real situations if adverse psychological effects are relieved by item review.

In the future, the effectiveness of IP and BIP methods can be investigated in real reviewable CAT. Considering that true ability is not available in real situations, some new indicators such as standard error of θ can be adopted as evaluation criteria. Because nonreview CAT always performs ideally in simulation situations, standard error of θ was not adopted in this article because it may not provide any more useful information than MAE and bias in a simulation study. However, it may be very useful in real situation due to the fact that examinees’ response to items is not ideal anymore.

Although this study provides some evidence that reviewable MCAT can be realized with the BIP method, there are still some limitations in generalizing the results to more complicated CAT situations because this study does not take additional important factors into account in the implementation of an operational CAT. The effects of the BIP method in more complicated CAT scenarios are still unknown. For example, the exposure rate of moderately difficult items might increase if most examinees always put the first several items into the IP. Therefore, factors related to the design of CAT testing programs, including item exposure, selection algorithm, and ability estimation method, can be systematically varied in future studies to examine whether they will affect the performance of the BIP method.

Another future direction is to investigate the performance of the BIP method in variable-length CAT. Researchers can vary the total IP size in proportion to the test length. For example, test developers can set every five items into a block and provide IP with one size for each block. In this way, the opportunity for item review is approximately proportional to the test length.

Supplemental Material

Supplemental_material – Supplemental material for The Block Item Pocket Method for Reviewable Multidimensional Computerized Adaptive Testing

Supplemental material, Supplemental_material for The Block Item Pocket Method for Reviewable Multidimensional Computerized Adaptive Testing by Zhe Lin, Ping Chen and Tao Xin in Applied Psychological Measurement

1.

Note that the block here is only designed to limit item review, which is different from its function in multistage testing (Zheng & Chang, 2015).

Footnotes

Authors’ Note: Zhe Lin is affiliated with the School of Psychology, Beijing Normal University, China. Ping Chen and Tao Xin are affiliated with the Collaborative Innovation Center of Assessment toward Basic Education Quality, Beijing Normal University, China.

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was supported by the National Natural Science Foundation of China (Grant No. U1911201) and the Research Program Funds of the Collaborative Innovation Center of Assessment toward Basic Education Quality (Grant Nos. 2019-01-082-BZK01 and 2019-01-082-BZK02).

Supplemental Material: Supplemental material for this article is available online.

References

  1. Benjamin L. T., Cavell T. A., Shallenberger W. R., III (1984). Staying with initial answers on objective tests: Is it a myth? Teaching of Psychology, 11(3), 133–141 [Google Scholar]
  2. Bowles R., Pommerich M. (2001, April). An examination of item review on a CAT using the specific information item selection algorithm [Paper presentation]. National Council of Measurement in Education Annual Meeting, Seattle, WA, United States. [Google Scholar]
  3. Chang H. H. (2015). Psychometrics behind computerized adaptive testing. Psychometrika, 80(1), 1–20. [DOI] [PubMed] [Google Scholar]
  4. Chang H. H., Ying Z. L. (2008). To weight or not to weight? Balancing influence of initial items in adaptive testing. Psychometrika, 73(3), 441–450. [Google Scholar]
  5. Finkelman M., Roussos N. L. A. (2009). A conditional exposure control method for multidimensional adaptive testing. Journal of Educational Measurement, 46(1), 84–103. [Google Scholar]
  6. Frey A., Seitz N. N. (2009). Multidimensional adaptive testing in educational and psychological measurement: Current state and future challenges. Studies in Educational Evaluation, 35(2), 89–94. [Google Scholar]
  7. Haley S. M., Ni P., Ludlow L. H., Fragala-Pinkham M. A. (2006). Measurement precision and efficiency of multidimensional computer adaptive testing of physical functioning using the Pediatric Evaluation of Disability Inventory. Archives of Physical Medicine and Rehabilitation, 87(9), 1223–1229. [DOI] [PubMed] [Google Scholar]
  8. Han K. T. (2013). Item pocket method to allow response review and change in computerized adaptive testing. Applied Psychological Measurement, 37(4), 259–275. [Google Scholar]
  9. Harvill L. M., Davis G., III (1997). Medical students’ reasons for changing answers on multiple-choice tests. Academic Medicine, 72(10, Suppl. 1), 97–99. [DOI] [PubMed] [Google Scholar]
  10. Lunz M. E., Bergstrom B. A., Wright B. D. (1992). The effect of review on student ability and test efficiency for computerized adaptive tests. Applied Psychological Measurement, 16(1), 33–40. [Google Scholar]
  11. Mulder J., van der Linden W. J. (2009). Multidimensional adaptive testing with optimal design criteria for item selection. Psychometrika, 74(2), 273–296. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Olea J., Revuelta J., Ximénez M. C., Abad F. J. (2000). Psychometric and psychological effects of review on computerized fixed and adaptive tests. Psicológica, 21(1–2), 157–173. [Google Scholar]
  13. Papanastasiou E. C. (2005). Item review and the rearrangement procedure: Its process and its results. Educational Research and Evaluation, 11(4), 303–321. [Google Scholar]
  14. Papanastasiou E. C., Reckase M. D. (2007). A “rearrangement procedure” for scoring adaptive tests with review options. International Journal of Testing, 7(4), 387–407. [Google Scholar]
  15. Reckase M. D. (1975, April). The effect of item choice on ability estimation when using a simple logistic tailored testing model [Paper presentation]. American Educational Research Association Annual Meeting, Washington, DC, United States. [Google Scholar]
  16. Segall D. O. (1996). Multidimensional adaptive testing. Psychometrika, 61(2), 331–354. [Google Scholar]
  17. Stocking M. L. (1997). Revising item responses in computerized adaptive tests: A comparison of three models. Applied Psychological Measurement, 21(2), 129–142. [Google Scholar]
  18. Stone G. E., Lunz M. E. (1994). The effect of review on the psychometric characteristics of computerized adaptive tests. Applied Measurement in Education, 7(3), 211–222. [Google Scholar]
  19. Stylianou-Georgiou A., Papanastasiou E. C. (2017). Answer changing in testing situations: The role of metacognition in deciding which answers to review. Educational Research and Evaluation, 23(3–4), 102–118 [Google Scholar]
  20. Vispoel W. P. (1998). Reviewing and changing answers on computer-adaptive and self-adaptive vocabulary tests. Journal of Educational Measurement, 35(4), 328–345. [Google Scholar]
  21. Vispoel W. P., Clough S. J., Bleiler T. (2005). A closer look at using judgments of item difficulty to change answers on computerized adaptive tests. Journal of Educational Measurement, 42(4), 331–350. [Google Scholar]
  22. Vispoel W. P., Hendrickson A. B., Bleiler T. (2000). Limiting answer review and change on computerized adaptive vocabulary tests: Psychometric and attitudinal results. Journal of Educational Measurement, 37(1), 21–38. [Google Scholar]
  23. Vispoel W. P., Rocklin T. R., Wang T. Y., Bleiler T. (1999). Can examinees use a review option to obtain positively biased ability estimates on a computerized adaptive test? Journal of Educational Measurement, 36(2), 141–157. [Google Scholar]
  24. Waddell D. L., Blankenship J. C. (1994). Answer changing: A meta-analysis of the prevalence and patterns. The Journal of Continuing Education in Nursing, 25(4), 155–158. [DOI] [PubMed] [Google Scholar]
  25. Wainer H. (1993). Some practical considerations when converting a linearly administered test to an adaptive format. Educational Measurement: Issues and Practice, 12(1), 15–20. [Google Scholar]
  26. Wang C., Chang H. H., Boughton K. A. (2011). Kullback-Leibler information and its applications in multi-dimensional adaptive testing. Psychometrika, 76(1), 13–39. [Google Scholar]
  27. Wang S., Fellouris G., Chang H. H. (2017). Computerized adaptive testing that allows for response revision: Design and asymptotic theory. Statistica Sinica, 27(4), 1987–2010. [Google Scholar]
  28. Wang S., Fellouris G., Chang H. H. (2019). Statistical Foundations for computerized adaptive testing with response revision. Psychometrika, 84(2), 375–394. [DOI] [PubMed] [Google Scholar]
  29. Wang W. C., Chen P. H. (2004). Implementation and measurement efficiency of multidimensional computerized adaptive testing. Applied Psychological Measurement, 28(5), 295–316. [Google Scholar]
  30. Wise S. L. (1994). Understanding self-adapted testing: The perceived control hypothesis. Applied Measurement in Education, 7(1), 15–24. [Google Scholar]
  31. Wise S. L. (1996, April). A critical analysis of the arguments for and against item review in computerized adaptive testing [Paper presentation]. National Council on Measurement in Education Annual Meeting, New York, NY, United States. [Google Scholar]
  32. Wise S. L. (1997, April). Overview of practical issues in a CAT program [Paper presentation]. National Council on Measurement in Education Annual Meeting, Chicago, IL, United States. [Google Scholar]
  33. Yen Y. C., Ho R. G., Liao W. W., Chen L. J. (2012). Reducing the impact of inappropriate items on reviewable computerized adaptive testing. Journal of Educational Technology & Society, 15(2), 231–243. [Google Scholar]
  34. Zheng Y., Chang H. H. (2015). On-the-fly assembled multistage adaptive testing. Applied Psychological Measurement, 39(2), 104–118. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental_material – Supplemental material for The Block Item Pocket Method for Reviewable Multidimensional Computerized Adaptive Testing

Supplemental material, Supplemental_material for The Block Item Pocket Method for Reviewable Multidimensional Computerized Adaptive Testing by Zhe Lin, Ping Chen and Tao Xin in Applied Psychological Measurement


Articles from Applied Psychological Measurement are provided here courtesy of SAGE Publications

RESOURCES