The Optimal Item Pool Design in Multistage Computerized Adaptive Tests With the p-Optimality Method

Lihong Yang; Mark D Reckase

doi:10.1177/0013164419901292

. 2020 Feb 6;80(5):955–974. doi: 10.1177/0013164419901292

The Optimal Item Pool Design in Multistage Computerized Adaptive Tests With the p-Optimality Method

Lihong Yang ^1,^✉, Mark D Reckase ²

PMCID: PMC7425329 PMID: 32855566

Abstract

The present study extended the p-optimality method to the multistage computerized adaptive test (MST) context in developing optimal item pools to support different MST panel designs under different test configurations. Using the Rasch model, simulated optimal item pools were generated with and without practical constraints of exposure control. A total number of 72 simulated optimal item pools were generated and evaluated by an overall sample and conditional sample using various statistical measures. Results showed that the optimal item pools built with the p-optimality method provide sufficient measurement accuracy under all simulated MST panel designs. Exposure control affected the item pool size, but not the item distributions and item pool characteristics. This study demonstrated that the p-optimality method can adapt to MST item pool design, facilitate the MST assembly process, and improve its scoring accuracy.

Keywords: item pool design, multistage computerized adaptive testing, item pool development

In recent years, multistage computerized adaptive test (MST) has gained increasing popularity in the field of educational measurement and operational testing (Chen et al., 2014; Luecht et al., 2006; Yan et al., 2014; Zheng et al., 2012). Many testing programs have successfully shifted from either paper-and-pencil test or computerized adaptive test (CAT) to MST over the past decade (e.g., the Certified Public Accountants examination, the Graduate Record Examination, the Program for the International Assessment of Adult Competencies, and the Educational Records Bureau Comprehensive Testing Program).

With MST, adaptation occurs at the module level instead of the item level as in CAT, which results in fewer adaptation points, more efficient test assembly, and well-controlled content balancing (Berger, 1994; Luecht, 2000). In contrast to the standard CAT programs, however, MST requires a more complex test assembly process. For instance, it involves a preconfigured panel structure embedded with various routing pathways and multiple preassembled modules before administration. To accommodate the complex test assembly process, a well-designed item pool is a prerequisite. However, despite the fact that the multistage tests emerged in 1950s and 1960s (Angoff & Huddleston, 1958; Cronbach & Gleser, 1965; Linn et al., 1968), little research has attended to the problem of their item pool design. In designing an operational item pool for MSTs, an optimal blueprint with desirable psychometric characteristics is anticipated. The blueprint is termed as optimal in that (1) each panel can provide sufficient level of measurement precision provisional on ability along a wide coverage of the scale and (2) the item utilization is well balanced during parallel panel construction and the cost of item creation is controlled at a minimum.

In related literature, two algorithms optimizing item pools were developed: the integer programming represented by the shadow-test approach (Veldkamp & van der Linden, 2000), and the heuristic approach represented by the p-optimality method (Reckase, 2003, 2010). Veldkamp (2014) explored the potential possibility of utilizing the integer programming approach in designing optimal item pools in MSTs. Assuming the number of modules, as well as all the categorical, quantitative, and logical constraints are known, several objective functions in the integer programming models are optimized during MST optimal item pool design. The blueprint is optimal in the sense that the effort or “cost” of item pool creation, as well as the number of unused items in the pool is minimized (van der Linden et al., 2000). Although the integer programming approach was discussed theoretically in item pool design for MSTs, it has never been applied in either simulation or empirical studies. Another line of research in optimal item pool design was demonstrated by the p-optimality method initially proposed by Reckase (2003, 2010) for CATs. The basic idea of applying this method in item pool design is to randomly select examinees from a target population and administer them a series of items with desirable attributes. The selected items are sorted into a set of “bins,” which are defined on examinees’ proficiency scale. Within each bin, all the test items are treated as equivalent. The width of the bin is determined by the desired information for the target test and the item response theory model used for parameter calibration. As shown in Figure 1, when applying the Rasch model, suppose the item that is available for selection has information that is within 95% of the maximum possible, the selected item is within about 0.35 logit of exactly matching the θ value. If an item pool meets the criterion of always having items available for selection that are 95% or more of the maximum possible information, the item pool is called 0.95 p-optimal. If the item difficulty is 0.35 logit away from the current ability estimate, the maximum loss of information for the examinee with the selected item is 5% compared with a perfect match between the selected item and the current ability estimate.

Figure 1. — Item information function specified by a Rasch model.

After one examinee takes the simulated CAT, the optimal item sets for him or her are allocated into bins. Another examinee is then selected and the same procedure is repeated. The minimum common item sets for these individual examinees are determined by taking the union of the two individual item sets. The process continues until the number of items in the union reaches an asymptote (Reckase, 2010). By this method, the bins are designed to tally the number of administered items needed for the correspondent range in the proficiency scale. The p-optimality method is easy to implement and its application does not require any knowledge of a specialized subfield of psychometrics and software. Currently many CAT programs, such as the National Council Licensure Examination, and Armed Services Vocational Aptitude Battery use the p-optimality method in designing their item pools (He & Diao, 2014).

The p-optimality method has been studied extensively in designing optimal item pools for CAT and has been proved to be efficient in reducing the cost of item pool design, simplifying the test assembly, and enhancing measurement accuracy (Gu, 2007; He & Reckase, 2014; Mao, 2014; Zhou, 2014). In a simulation study, Reckase (2006) applied this method to the MST context where the p-optimality method was utilized to explore the best configuration of a two-stage MST with short test length (e.g., a 20-item test). However, the hands-on procedures of how to adapt this method to the MST context were not explicitly illustrated. Additionally, the effect of critical factors in MST (e.g., panel design, test length, routing test length, and exposure control) on the construction of p-optimal item pools was unclear. Thus, this article aims to address these concerns. First, the p-optimality method, initially proposed for CAT, was extended to the MST context, in which the way of applying this method to construct MST item pool blueprints was demonstrated. Second, the effect of the critical factors in MST construction, such as the choice of MST panel design, test length, routing test length, and practical constraint of exposure control on the performance of optimal item pools was investigated by a comprehensive simulation study.

Simulation Study

MST Panel Design and Test Specifications

During the simulation, four types of MST panel designs were chosen among the most popularly investigated ones in the literature: MST 1-2, MST 1-3, MST 1-2-2, and MST 1-2-3. For each MST panel design, three factors of test configuration were simulated: (1) overall test length, (2) proportion of the routing test length, and (3) exposure control.

The overall test length was fixed with 20, 40, and 60 items to represent short, medium, and long MSTs, respectively. The routing test at the first stage was simulated to have 20%, 30% and 40% of the overall test length. The item numbers administered in the modules of later stages were adjusted accordingly (see Table 1 for more detailed test configurations). Specifically, for the two-stage tests (MST 1-2 and MST 1-3), the proportions of the second stage modules are adjusted to have 80%, 70%, and 60% of the total test lengths, respectively. For three-stage tests (MST 1-2-2 and MST 1-2-3), the correspondent proportions at the second stage modules are 40%, 30%, and 20%, and the proportions for the modules at the final stage are all 40%. The rationale for splitting 40% at the third stage is to ensure that enough items are allocated at the final stage for measurement accuracy purpose. The MSTs are also simulated with or without the practical constrains of exposure control. The inverse proportion method discussed in Zheng et al. (2012) was adopted for the exposure control purpose. By this method, the number of parallel test forms to be assembled for each module is inversely proportional to the number of modules of the stage it belongs to. For example, in MST 1-2 design, 18 parallel test forms are available to be selected and assembled at the initial stage, and nine forms are available to be assembled for the easy module and nine for the difficult module at the second stage. The numbers of parallel test forms in different MST designs with exposure control are summarized in Table 2.

Table 1.

MST Panel Design and the Number of Items Across Different Stages.

			Stages
Panel design	Test length	Routing proportions (%)	1	2	3
MST 1-2	20	20	4	16 (16)
		30	6	14 (14)
		40	8	12 (12)
	40	20	8	32 (32)
		30	12	28 (28)
		40	16	24 (24)
	60	20	12	48 (48)
		30	18	42 (42)
		40	24	36 (36)
MST 1-3	20	20	4	16 (16, 16)
		30	6	14 (14, 14)
		40	8	12 (12, 12)
	40	20	8	32 (32, 32)
		30	12	28 (28, 28)
		40	16	24 (24, 24)
	60	20	12	48 (48, 48)
		30	18	42 (42, 42)
		40	24	36 (36, 36)
MST 1-2-2	20	20	4	8 (8)	8 (8)
		30	6	6 (6)	8 (8)
		40	8	4 (4)	8 (8)
	40	20	8	16 (16)	16 (16)
		30	12	12 (12)	16 (16)
		40	16	8 (8)	16 (16)
	60	20	12	24 (24)	24 (24)
		30	18	18 (18)	24 (24)
		40	24	12 (12)	24 (24)
MST 1-2-3	20	20	4	8 (8)	8 (8, 8)
		30	6	6 (6)	8 (8, 8)
		40	8	4 (4)	8 (8, 8)
	40	20	8	16 (16)	16 (16, 16)
		30	12	12 (12)	16 (16, 16)
		40	16	8 (8)	16 (16, 16)
	60	20	12	24 (24)	24 (24, 24)
		30	18	18 (18)	24 (24, 24)
		40	24	12 (12)	24 (24, 24)

Open in a new tab

Note. MST = multistage computerized adaptive test. The number in parentheses means the number of items at the same stage for another/other module(s).

Table 2.

The Number of Test Forms in Different MST Designs With Exposure Control.

	Stages
Panel design	1	2	3
1-2	18	9 (9)
1-3	18	6 (6) (6)
1-2-2	18	9 (9)	9 (9)
1-2-3	18	9 (9)	6 (6) (6)

Open in a new tab

Note. MST = multistage computerized adaptive test. The number in parentheses means the number of test forms at the same stage for another/other module(s).

Procedures of Constructing the Optimal Item Pool Blueprint

In the present study, the Rasch model was chosen as the measurement model for item calibration. The bin width was identified as 0.35 to allow any item within bins to provide at least 95% of its maximum information. We hypothesized that an achievement test is the target test in this study and examinees were drawn from a standard normal distribution.

The item pool blueprints using the p-optimality method were constructed through the following three steps. Step 1 is targeted at creating the item pool blueprint for the first-stage routing test in MST. Initially, a cutoff score (M_t) used for routing examinees from the first stage to the second-stage modules was established. This cutoff score is a particular quantile of the ability distribution to equally separate examinees to the second-stage modules. More specifically, for panel designs having two modules at stage two (e.g., MST 1-2, MST 1-2-2, MST 1-2-3), the cutoff score is located at the 50th percentile of the ability distribution. Following the same logic, MST 1-3 design requires two cutoff scores corresponding to the 33th percentile and 67th percentile of the ability distribution, respectively. Once these cutoff scores were determined, the number of items for the first-stage routing test was selected uniformly at random from the bin(s) containing M_t. The bin location(s) and item counts within these selected bins composed the blueprint of the optimal item pool for the first-stage MST.

Step 2 generates the blueprint for the second-stage modules. After taking the routing test, the examinees were routed to one module at the second stage. This routing was realized through comparing the examinee’s ability estimate with the cutoff score. No matter which module the examinee was taking, the items corresponding to his or her true ability were continuously administered until one of the two criteria was met: the module length was exhausted or the preset information conditional on the specific examinee was satisfied. If the conditional information was satisfied but not the module length, a second examinee would be randomly selected to take this module. After this simulation process, the bin locations and item counts within the selected bins were recorded to form the blueprint of optimal item pool at the second stage.

Step 3 is only required for the three-stage tests (e.g., MST 1-2-2 and MST 1-2-3). Similar to the routing procedure at Step 1, which third-stage module an examinee took was decided on the comparison between his or her ability estimate and cutoff scores. However, this routing is different from the first-stage routing procedure in two aspects. First, the examinees’ ability estimates used for comparison are the calibrated results from the previously administered items instead of the true abilities. Second, the quantiles are selected based on the observed score distribution of the examinees. More specifically, a 25th percentile and 75th percentile of the observed score distribution are used as the cutoff scores for MST 1-2-2 design, and a 33th percentile and 67th percentile of the observed score distribution are used for MST 1-2-3 design, so that an equal proportion of examinees are routed to the modules at the final stage. When one examinee was routed to take one third-stage module, he or she would be administered the items corresponding to his or her true ability levels. The stopping criteria and proceeding procedures utilized at Step 2 were applied to decide upon the bin locations and item numbers within these bins for the MST third stage.

To reduce the sampling error, 100 replications were conducted and the results were averaged to integers to form the final bin count table for each MST panel design. In the present study, with the unidimensionality assumption, item distributions for all hypothetical content areas were assumed the same for the simulated optimal item pool and the impact of content areas on item pool design was not considered.

MST Assembly Process

Based on the optimal item pool blueprint obtained from the above procedures, the administered tests were assembled for all the MST panel types and test configurations (see Table 1). When no exposure control was considered, the items required within each bin were uniformly selected at random to form the MSTs.

When exposure control was implemented, multiple parallel modules at each stage were composed by randomly choosing the number of items required within each bin. The items in each module shared the same frequency distributions with those in other parallel modules. Depending on the type of MST panel design, test length and routing test length, the item numbers were expanded to about 7 to 11 times than that without the implementation of exposure control. A total number of 72 types of MSTs were generated ultimately.

Evaluation Criteria

Each of the simulated 72 optimal item pools was evaluated for its overall and conditional performance in the following three aspects: the precision of ability estimation, classification accuracy, and item usage. The evaluation was conducted by administering the assembled tests to two types of target examinee populations. For the overall evaluation, 5,000 simulées who were randomly selected from the standard normal distribution were tested. For the conditional evaluation, 100 simulées were selected for testing conditional on a fixed θ value over the range between −3.5 and 3.5 at an interval of 0.05.

The evaluation criteria used in this study include the following:

Precision of ability estimation. Bias and root mean square error (RMSE) were computed using Equations (1) and (2).

Bias = \frac{1}{N} \sum_{j = 1}^{N} ({\hat{θ}}_{j} - θ_{j}),

(1)

RMSE = \sqrt{\frac{1}{N} \sum_{j = 1}^{N} {({\hat{θ}}_{j} - θ_{j})}^{2}}

(2)

where ${\hat{θ}}_{j}$ and $θ_{j}$ are the estimated and true abilities of the jth examinee. These two statistics were computed from both the overall and conditional samples. N represents the number of examinees.

Classification accuracy of the ability estimates. Three threshold scores were used in the study: classification accuracy based on the median value, 80th percentile and 95th percentile of the examinees’ score distribution. The purpose of choosing these cutoff values was to check how the procedures of optimal item pool design with the p-optimality method work with a range of values. In practice, they may vary dependent on the need and purpose of each individual test.
Item exposure rate. The item exposure rate is considered high if it is greater than 0.20 (Segall et al., 1997). In CAT, an item with an exposure rate lower than 0.02 is considered as underexposed (He & Reckase, 2014). This value was adopted in the current MST context for evaluating the underexposure rate of modules.
Item overlap rate. The following equation was used to calculate the average item overlap rate in the present study:

R = \frac{T / {C_{N}}^{2}}{\sum_{j = 1}^{N} L_{j} / N},

(3)

where T is the total number of items shared by the given number of pairs in the tests among N examinees. $L_{j}$ is the total number of items administered for N examinees. Both the average item overlap rate and the conditional average item overlap rate were calculated.

Results

Characteristics of the Optimal Item Pools

The item frequency distributions for all simulated optimal item pools without exposure control are presented in Figures 2 through 5. From these figures, we can see that different MST panel designs generated different item pool formats. The number of stages and modules at each stage were the main factors that shaped the item frequency distributions. For MST 1-2 design, the distribution peaked in the middle, and the distribution for MST 1-3 panel design looked approximately bimodal. The item frequency distributions for MST 1-2-2 and MST 1-2-3 designs were roughly trimodal. When the routing test length was longer, the modes in the distributions became more protruding. With the increase of test length, the distribution was extended to include items with extreme values and a larger quantity of items was needed within each bin.

Figure 2. — Item distributions for multistage computerized adaptive test (MST) 1-2 panel design without exposure control.

Figure 5. — Item distributions for multistage computerized adaptive test (MST) 1-2-3 panel design without exposure control.

Figure 3. — Item distributions for multistage computerized adaptive test (MST) 1-3 panel design without exposure control.

Figure 4. — Item distributions for multistage computerized adaptive test (MST) 1-2-2 panel design without exposure control.

The outcome figures under exposure control conditions are not presented due to page limits. For each MST panel design, the primary feature of item frequency distribution after exposure control was similar to that without exposure control. The notable difference after implementing exposure control was that the number of items was 7 to 11 times larger than those without exposure control, and more items were accumulated at the initial stages rather than the later ones.

Figure 6 presents the test information for all four types of MST panel designs. It shows that administering more items could largely increase information, particularly within the range between −2 and 2 for all designs. However, the routing test length did not make a significant difference for any design given the fixed test length.

Performance of the Optimal Item Pools

The overall evaluation statistics show that all the generated item pools are comparable in retrieving good ability estimation at the population level. As summarized in Table 3, the overall bias for all optimal item pools was not affected by the simulation factors to any large extent. On average, the magnitudes of the overall bias were close to zero or negligible. Likewise, RMSE was not significantly affected by these factors as well. One exceptional feature is the test length, for which the longer test length (e.g., 60 items) could lower RMSE by 40% to 50%.

Table 3.

Performance of the Optimal Item Pools for Ability Estimates.

		Without exposure control						With exposure control
		Bias			RMSE			Bias			RMSE
Panel design	Test length	20%^a	30%	40%	20%	30%	40%	20%	30%	40%	20%	30%	40%
MST 1-2	20	0.01	0.00	0.01	0.35	0.35	0.35	0.00	0.00	−0.01	0.36	0.37	0.38
	40	0.00	0.00	0.00	0.24	0.24	0.24	0.00	0.00	0.00	0.26	0.27	0.27
	60	0.00	0.00	0.00	0.19	0.19	0.19	0.00	0.00	0.00	0.22	0.23	0.22
MST 1-3	20	−0.01	0.00	0.00	0.34	0.34	0.34	0.00	0.00	0.00	0.34	0.34	0.34
	40	0.00	0.00	0.00	0.23	0.23	0.23	0.00	0.00	0.00	0.23	0.23	0.23
	60	0.00	0.01	0.00	0.18	0.18	0.19	0.00	0.00	0.00	0.19	0.19	0.19
MST 1-2-2	20	0.01	0.00	0.00	0.36	0.37	0.39	0.00	0.00	0.01	0.36	0.37	0.39
	40	0.00	0.00	0.00	0.24	0.24	0.25	0.00	0.00	0.00	0.24	0.25	0.25
	60	0.00	0.00	0.00	0.19	0.20	0.20	0.00	0.00	0.00	0.20	0.20	0.20
MST 1-2-3	20	−0.01	0.00	0.01	0.33	0.34	0.34	0.00	0.00	0.00	0.33	0.34	0.34
	40	0.00	0.00	0.00	0.23	0.23	0.23	0.00	0.00	0.00	0.23	0.23	0.23
	60	0.00	0.00	0.00	0.19	0.19	0.19	0.00	0.00	0.00	0.19	0.19	0.19

Open in a new tab

Note. RMSE = root mean square error; MST = multistage computerized adaptive test.

Routing test length.

Table 4 reports the overall classification accuracy output. Generally speaking, all the optimal item pools performed well in terms of the three types of classification accuracy criteria. Comparatively speaking, under the same condition, the classification accuracy for 95th percentile was better justified than the other two types of classification criteria. The panel design, routing test length, and exposure control had no detectable effect on these three types of classifications. Although the test length did make a difference in improving the classification accuracy, this increase did not contribute much to the already high classification accuracies of various MST panel designs. An analysis of variance test was conducted to evaluate whether the classification type and the simulation factors (e.g., panel design, test length, routing test length, and exposure control) had any effect on the accuracy of classification outcomes. The results indicated that all the main factors and interaction terms were statistically significant (p < 0.05). However, their effect size (η²) was different. Factors with higher η² were the classification type (η² = 0.749) and test length (η² = 0.205), which suggested that these two factors had a large impact on the accuracy of classifications. Comparatively speaking, the effect sizes of other factors, including the interaction terms, were all below 0.024. This implies that the impact of these factors on classification accuracy was not important because they interpreted only a very small proportion of its overall variance.

Table 4.

Performance of the Optimal Item Pools for Classification Accuracy.

		Median			80th percentile			95th percentile
Panel design	Test length	20%^a	30%	40%	20%	30%	40%	20%	30%	40%
Without exposure control
MST 1-2	20	0.90	0.90	0.91	0.93	0.93	0.93	0.97	0.97	0.96
	40	0.92	0.93	0.93	0.95	0.95	0.95	0.98	0.98	0.98
	60	0.94	0.94	0.94	0.96	0.96	0.96	0.98	0.98	0.98
MST 1-3	20	0.90	0.90	0.91	0.93	0.93	0.94	0.97	0.98	0.97
	40	0.94	0.93	0.93	0.95	0.95	0.95	0.98	0.98	0.98
	60	0.94	0.95	0.94	0.96	0.96	0.96	0.98	0.99	0.98
MST 1-2-2	20	0.90	0.91	0.91	0.93	0.93	0.93	0.97	0.97	0.97
	40	0.93	0.92	0.93	0.95	0.95	0.95	0.98	0.98	0.98
	60	0.93	0.93	0.94	0.96	0.96	0.96	0.98	0.99	0.98
MST 1-2-3	20	0.90	0.91	0.90	0.93	0.93	0.93	0.97	0.97	0.97
	40	0.94	0.94	0.94	0.95	0.95	0.95	0.98	0.98	0.98
	60	0.95	0.94	0.94	0.96	0.96	0.96	0.98	0.98	0.98
With exposure control
MST 1-2	20	0.90	0.91	0.90	0.93	0.92	0.93	0.97	0.97	0.97
	40	0.92	0.93	0.93	0.94	0.95	0.94	0.98	0.98	0.98
	60	0.94	0.94	0.94	0.96	0.95	0.96	0.98	0.98	0.98
MST 1-3	20	0.90	0.90	0.91	0.93	0.94	0.93	0.97	0.98	0.97
	40	0.93	0.93	0.93	0.95	0.95	0.95	0.98	0.98	0.98
	60	0.95	0.95	0.94	0.96	0.96	0.96	0.99	0.98	0.99
MST 1-2-2	20	0.89	0.91	0.90	0.93	0.93	0.92	0.97	0.96	0.96
	40	0.93	0.93	0.93	0.95	0.95	0.94	0.98	0.98	0.98
	60	0.94	0.94	0.94	0.96	0.96	0.95	0.99	0.98	0.99
MST 1-2-3	20	0.90	0.91	0.91	0.93	0.94	0.92	0.97	0.97	0.97
	40	0.93	0.93	0.93	0.95	0.95	0.95	0.98	0.98	0.98
	60	0.94	0.95	0.94	0.96	0.96	0.96	0.99	0.98	0.99

Open in a new tab

Note. MST = multistage computerized adaptive test.

Routing test length.

Table 5 reports the item overlap rate and item exposure rate for all optimal item pools. Before exposure control was implemented, both item overlap and item exposure rate were high (e.g., overlap rate was greater than 0.4 and exposure rate was greater than 0.3). After the implementation of exposure control, both criteria were effectively controlled within an ideal range: the item overlap rate decreased to around 0.06, and the item exposure rate became smaller than 0.06.

Table 5.

Performance of the Optimal Item Pools for Item Utilization.

Panel design		Without exposure control						With exposure control
		Overlap rate			Exposure rate			Overlap rate			Exposure rate
Panel design	Test length	20%^a	30%	40%	20%	30%	40%	20%	30%	40%	20%	30%	40%
MST 1-2	20	0.61	0.65	0.70	0.52	0.52	0.52	0.06	0.06	0.06	0.04	0.04	0.04
	40	0.60	0.65	0.70	0.53	0.52	0.52	0.06	0.06	0.06	0.04	0.04	0.03
	60	0.60	0.65	0.70	0.53	0.53	0.53	0.06	0.06	0.06	0.04	0.03	0.04
MST 1-3	20	0.48	0.56	0.60	0.38	0.35	0.36	0.06	0.06	0.06	0.06	0.06	0.06
	40	0.47	0.53	0.60	0.41	0.37	0.36	0.06	0.06	0.06	0.06	0.06	0.05
	60	0.47	0.53	0.60	0.39	0.38	0.39	0.06	0.06	0.06	0.05	0.06	0.06
MST 1-2-2	20	0.61	0.65	0.70	0.54	0.52	0.53	0.06	0.06	0.06	0.05	0.05	0.05
	40	0.60	0.65	0.70	0.53	0.53	0.53	0.06	0.06	0.06	0.05	0.05	0.05
	60	0.60	0.65	0.70	0.54	0.53	0.54	0.06	0.06	0.06	0.06	0.05	0.05
MST 1-2-3	20	0.54	0.59	0.64	0.37	0.38	0.39	0.06	0.06	0.06	0.05	0.05	0.06
	40	0.53	0.58	0.63	0.39	0.37	0.38	0.06	0.06	0.06	0.05	0.05	0.05
	60	0.53	0.59	0.63	0.38	0.38	0.39	0.06	0.06	0.06	0.05	0.05	0.06

Open in a new tab

Note. MST = multistage computerized adaptive test.

Routing test length.

Each obtained optimal item pool was also evaluated for its conditional performance on ability estimation and item overlap rate, as shown in Figures 7 through 10. Due to page limits, only the conditional bias and RMSE after implementing exposure control was reported in this article.

From Figure 7, we can see that larger bias only occurred at those extreme ability levels (θ > 1.5 or θ < −1.5) tested for short MSTs (e.g., 20 items). For those examinees located at the lower end, the estimates were negatively biased. On the contrary, the examinees located at the higher end were positively biased. However, the amount of bias became smaller with the increase of the test length.

As demonstrated in Figure 8, the conditional RMSE had similar trend with that of the conditional bias. Figure 9 and Figure 10 report the conditional item overlap rate under both exposure control and nonexposure control conditions. When no exposure control was implemented, we can see that the item overlap rate was severe for all optimal item pools along the ability scale. The item overlap rate was comparatively smaller in the middle range of the ability scale, but still as high as 0.3. Among the four types of panel designs, MST 1-2-3 design displayed some advantages over others in controlling the item overlap rate. After exposure control was implemented, the conditional item overlap rate under all simulation conditions dropped to the ideal range of 0.02 to 0.20, as demonstrated in Figure 10.

Conclusion and Discussion

As outlined in this article, the p-optimality method can be well adapted into the MST context to support various types of panel designs. The simulation results showed that all the optimal item pools achieved good measurement accuracy along a wide coverage of the ability scale. Generally speaking, the p-optimality method could facilitate the optimal item pool design for MSTs from the following aspects. First, the heuristic algorithm adopted by the p-optimality method eases the burden of item pool optimization in MST. Due to the panel design variation and module-level adaptive feature, the specifications of item characteristics in MST pools are more complicated than those in CAT programs. The p-optimality method simplifies the optimization process in that every item pool blueprint exactly tailors to each particular panel design. Second, the end product of the p-optimality method is an optimal item pool blueprint for MST, which provides the guideline for either item creation or test assembly. More specifically, during item development, item writers could adhere to the item psychometric properties and item number distributions as outlined in the item pool blueprint. This will largely reduce item writing redundancy efforts as well as the item creation cost. If real items are already in stock, MSTs that best fit the testing purpose can be assembled on the basis of these blueprints. Third, the extent of the p-optimality on measurement precision can be easily adjusted by manipulating the bin width in item pool blueprint setting stage. For instance, the bin width of 0.35 corresponds to the boundary of 95% maximum item information (see Figure 1). Shrinking the bin width indicates a higher level of information every item has to provide, and therefore a higher level of measurement precision can be achieved for the correspondent estimated abilities. If classification precision is preferred at certain cutoff scores, the bin width can vary with narrower bins around these cutoff scores and wider bins elsewhere.

To investigate whether the construction of optimal item pools is affected by the critical factors of MST panel design, a comprehensive simulation study was conducted. By simulating four panel design factors (panel structure, test length, routing test length, and exposure control), a total number of 72 item pool blueprints were built up by the p-optimality method. The study results display that these simulation factors affected the performance of the optimal item pools to a varying degree. First, test length does improve the measurement precision. This was evidenced by the lower overall average RMSE of ability estimates with the increase of test length. Moreover, the conditional bias and conditional RMSE were also largely reduced for those extremely low or high θs when the test length is longer (see Figures 7 and 8). However, administering more items may hurt the optimization target for lowering the item creation cost. Hence, the test practitioners may have to balance the benefit of higher measurement precision and the increased cost of test development while deciding on the appropriate test length.

The panel structure as well as the routing test length did not make noticeable difference on the precision of individual-level ability recovery and group-level classification accuracy. These findings justified one salient benefit of the p-optimality method in designing the optimal item pools for MSTs. No matter how the MST panel is actually structured, the p-optimality method can resist the complexity caused by the choice of stage numbers, module numbers at each stage, the routing test length, and the routing pathways. The irrelevance of the panel design to the performance of optimal item pools also implies that any MST panel format (beyond the simulated ones in this study) has the potential to be well accommodated by the p-optimality method.

The above simulation procedure assumes that an unlimited number of items are available for framing the structure of optimal item pool blueprints. Suppose a master pool already exists, the p-optimality method can be used directly for the test assembly based on these blueprints. The number of items for each module is selected based on the actual test specifications of the target MSTs. The number of test forms is decided on the capacity that the master pool could be able to support. During this process, the real item pool parameters are redistributed and all the bins are filled out with the best choice of candidate items. In the cases where some bins need more items than being provided in the master pool, borrowing from the nearest adjacent bin is the next step.

Despite the several advantages brought up by the p-optimality method in designing the optimal item pools for MSTs, this study is limited in the following aspects. First, the Rasch model was applied for data calibration and item pool construction. In practice, other item response theory models, such as two-parameter logistic (2PL) and 3PL models may fit the data better, and the optimal item pool construction based on these item response theory models need to be further investigated. Second, with the assumption of unidimensionality, content balancing was not considered in this study. Although researchers discovered that content balancing had little impact on the measurement precision during the construction of optimal item pools in CAT (He & Reckase, 2014; Zhou, 2014), whether the same irrelevance exists in the MST context has to be justified.

Third, instead of choosing the fixed-length MST, future studies might consider extending the p-optimality method to support a variable-length MST panel design. For example, additional modules might be administered to those examinees, whose levels of estimation precision are still below requirement after completing the last stage module. This will also compensate the possibility of accidental misrouting in the testing process. Fourth, the item types in the present study are restricted to dichotomous items. In reality, MSTs are possibly composed of different item types, such as polytomous items, testlet-based items, performance-based items, and mixed-format items. It is of interest for the future studies to consider incorporating different item types and examine how the p-optimality method applies in those contexts to support the operational MST assembly.

Footnotes

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD: Lihong Yang Inline graphic https://orcid.org/0000-0002-3862-122X

References

Angoff W. H., Huddleston E. M. (1958). The multi-level experiment: A study of a two-level test system for the college board scholastic aptitude test (ETS Statistical Report No. SR-58-21). Educational Testing Service. [Google Scholar]
Berger M. P. F. (1994). A general approach to algorithmic design of fixed-form tests, adaptive tests, and testlets. Applied Psychological Measurement, 18(2), 141-153. 10.1177/014662169401800204 [DOI] [Google Scholar]
Chen H., Yamamoto K., von Davier M. (2014). Controlling multistage testing exposure rates in international large-scale assessments. In Yan D., von Davier A. A., Lewis C. (Eds.), Computerized multistage testing: Theory and applications (pp. 391-408). CRC Press/Taylor & Francis. [Google Scholar]
Cronbach L. J., Gleser G. C. (1965). Psychological tests and personnel decisions. University of Illinois Press. [Google Scholar]
Gu L. (2007). Designing optimal item pools for computerized adaptive tests with exposure controls [Unpublished doctoral dissertation]. Michigan State University. [Google Scholar]
He W., Diao Q. (2014, April 4-6). Item pool design for cat-review, demonstration, and future prospects [Paper presentation]. Annual Meeting of the National Council on Measurement in Education (NCME), Philadelphia, PA, United States. [Google Scholar]
He W., Reckase M. (2014). Item pool design for an operational variable-length computerized adaptive test. Educational and Psychological Measurement, 74(3), 473-494. 10.1177/0013164413509629 [DOI] [Google Scholar]
Linn R. L., Rock D. A., Cleary T. A. (1968). The development and evaluation of several programmed testing methods (ETS Research Bulletin Series No. i-29). Educational Testing Service. [Google Scholar]
Luecht R. (2000, April 25-27). Implementing the CAST framework to mass produce high quality computer-adaptive and mastery tests [Paper presentation]. Annual Meeting of the National Council on Measurement in Education (NCME), New Orleans, LA, United States. [Google Scholar]
Luecht R. M., Brumfield T., Breithaupt K. (2006). A testlet assembly design for adaptive multistage tests. Applied Measurement in Education, 19(3), 189-202. 10.1207/s15324818ame1903_2 [DOI] [Google Scholar]
Mao L. (2014). Designing p-optimal item pools for multidimensional computerized adaptive testing [Unpublished doctoral dissertation]. Michigan State University. [Google Scholar]
Reckase M. D. (2003, April 22-24). Item pool design for computerized adaptive tests [Paper presentation]. Annual Meeting of the National Council on Measurement in Education (NCME), Chicago, IL, United States. [Google Scholar]
Reckase M. D. (2006, April 8-10). Design of an ideal two-stage test [Paper presentation]. Annual Meeting of the National Council on Measurement in Education (NCME), San Francisco, CA, United States. [Google Scholar]
Reckase M. D. (2010). Designing item pools to optimize the functioning of a computerized adaptive test. Psychological Test and Assessment Modeling, 52(2), 127-141. 10.1177/0013164413509629 [DOI] [Google Scholar]
Segall D. O., Moreno K. E., Hetter D. H. (1997). Item pool development and evaluation. In Sands W. A., Waters B. K., McBride J. R. (Eds.), Computerized adaptive testing: From inquiry to operation (pp. 117-130). American Psychological Association. [Google Scholar]
van der Linden W. J., Veldkamp B. P., Reese L. M. (2000). An integer programming approach to item pool design. Applied Psychological Measurement, 24(2), 139-150. 10.1177/01466210022031570 [DOI] [Google Scholar]
Veldkamp B. P. (2014). Item pool design and maintaince for multistage testing. In Yan D., von Davier A. A., Lewis C. (Eds.), Computerized multistage testing: Theory and applications (pp. 39-53). CRC Press/Taylor & Francis. [Google Scholar]
Veldkamp B. P., van der Linden W. J. (2000). Designing item pools for computerized adaptive testing. In van der Linden W. J., Glas C. A. W. (Eds.), Computerized adaptive testing: Theory and practice (pp. 149-162). Kluwer Academic. [Google Scholar]
Yan D., Lewis C., von Davier A. (2014). Overview of computerized multistage tests. In D. Yan, A. A. von Davier, & C. Lewis (Eds.), Computerized multistage testing: Theory and applications (pp. 3-20). CRC Press/Taylor & Francis. [Google Scholar]
Zheng Y., Nozawa Y., Gao X., Chang H. (2012, April 14-16). Multistage adaptive testing for a large-scale classification test: The designs, heuristic assembly, and comparison with other testing modes [Paper presentation]. Annual Meeting of the National Council on Measurement in Education (NCME) Vancouver, British Columbia, Canada. [Google Scholar]
Zhou X. (2014). Optimal item pool design for computerized adaptive tests with polytomous items using GPCM. Psychological Test and Assessment Modeling, 56(3), 255-274. 10.1177/0013164410393956 [DOI] [Google Scholar]

[bibr1-0013164419901292] Angoff W. H., Huddleston E. M. (1958). The multi-level experiment: A study of a two-level test system for the college board scholastic aptitude test (ETS Statistical Report No. SR-58-21). Educational Testing Service. [Google Scholar]

[bibr2-0013164419901292] Berger M. P. F. (1994). A general approach to algorithmic design of fixed-form tests, adaptive tests, and testlets. Applied Psychological Measurement, 18(2), 141-153. 10.1177/014662169401800204 [DOI] [Google Scholar]

[bibr3-0013164419901292] Chen H., Yamamoto K., von Davier M. (2014). Controlling multistage testing exposure rates in international large-scale assessments. In Yan D., von Davier A. A., Lewis C. (Eds.), Computerized multistage testing: Theory and applications (pp. 391-408). CRC Press/Taylor & Francis. [Google Scholar]

[bibr4-0013164419901292] Cronbach L. J., Gleser G. C. (1965). Psychological tests and personnel decisions. University of Illinois Press. [Google Scholar]

[bibr5-0013164419901292] Gu L. (2007). Designing optimal item pools for computerized adaptive tests with exposure controls [Unpublished doctoral dissertation]. Michigan State University. [Google Scholar]

[bibr6-0013164419901292] He W., Diao Q. (2014, April 4-6). Item pool design for cat-review, demonstration, and future prospects [Paper presentation]. Annual Meeting of the National Council on Measurement in Education (NCME), Philadelphia, PA, United States. [Google Scholar]

[bibr7-0013164419901292] He W., Reckase M. (2014). Item pool design for an operational variable-length computerized adaptive test. Educational and Psychological Measurement, 74(3), 473-494. 10.1177/0013164413509629 [DOI] [Google Scholar]

[bibr8-0013164419901292] Linn R. L., Rock D. A., Cleary T. A. (1968). The development and evaluation of several programmed testing methods (ETS Research Bulletin Series No. i-29). Educational Testing Service. [Google Scholar]

[bibr9-0013164419901292] Luecht R. (2000, April 25-27). Implementing the CAST framework to mass produce high quality computer-adaptive and mastery tests [Paper presentation]. Annual Meeting of the National Council on Measurement in Education (NCME), New Orleans, LA, United States. [Google Scholar]

[bibr10-0013164419901292] Luecht R. M., Brumfield T., Breithaupt K. (2006). A testlet assembly design for adaptive multistage tests. Applied Measurement in Education, 19(3), 189-202. 10.1207/s15324818ame1903_2 [DOI] [Google Scholar]

[bibr11-0013164419901292] Mao L. (2014). Designing p-optimal item pools for multidimensional computerized adaptive testing [Unpublished doctoral dissertation]. Michigan State University. [Google Scholar]

[bibr12-0013164419901292] Reckase M. D. (2003, April 22-24). Item pool design for computerized adaptive tests [Paper presentation]. Annual Meeting of the National Council on Measurement in Education (NCME), Chicago, IL, United States. [Google Scholar]

[bibr13-0013164419901292] Reckase M. D. (2006, April 8-10). Design of an ideal two-stage test [Paper presentation]. Annual Meeting of the National Council on Measurement in Education (NCME), San Francisco, CA, United States. [Google Scholar]

[bibr14-0013164419901292] Reckase M. D. (2010). Designing item pools to optimize the functioning of a computerized adaptive test. Psychological Test and Assessment Modeling, 52(2), 127-141. 10.1177/0013164413509629 [DOI] [Google Scholar]

[bibr15-0013164419901292] Segall D. O., Moreno K. E., Hetter D. H. (1997). Item pool development and evaluation. In Sands W. A., Waters B. K., McBride J. R. (Eds.), Computerized adaptive testing: From inquiry to operation (pp. 117-130). American Psychological Association. [Google Scholar]

[bibr16-0013164419901292] van der Linden W. J., Veldkamp B. P., Reese L. M. (2000). An integer programming approach to item pool design. Applied Psychological Measurement, 24(2), 139-150. 10.1177/01466210022031570 [DOI] [Google Scholar]

[bibr17-0013164419901292] Veldkamp B. P. (2014). Item pool design and maintaince for multistage testing. In Yan D., von Davier A. A., Lewis C. (Eds.), Computerized multistage testing: Theory and applications (pp. 39-53). CRC Press/Taylor & Francis. [Google Scholar]

[bibr18-0013164419901292] Veldkamp B. P., van der Linden W. J. (2000). Designing item pools for computerized adaptive testing. In van der Linden W. J., Glas C. A. W. (Eds.), Computerized adaptive testing: Theory and practice (pp. 149-162). Kluwer Academic. [Google Scholar]

[bibr19-0013164419901292] Yan D., Lewis C., von Davier A. (2014). Overview of computerized multistage tests. In D. Yan, A. A. von Davier, & C. Lewis (Eds.), Computerized multistage testing: Theory and applications (pp. 3-20). CRC Press/Taylor & Francis. [Google Scholar]

[bibr20-0013164419901292] Zheng Y., Nozawa Y., Gao X., Chang H. (2012, April 14-16). Multistage adaptive testing for a large-scale classification test: The designs, heuristic assembly, and comparison with other testing modes [Paper presentation]. Annual Meeting of the National Council on Measurement in Education (NCME) Vancouver, British Columbia, Canada. [Google Scholar]

[bibr21-0013164419901292] Zhou X. (2014). Optimal item pool design for computerized adaptive tests with polytomous items using GPCM. Psychological Test and Assessment Modeling, 56(3), 255-274. 10.1177/0013164410393956 [DOI] [Google Scholar]

PERMALINK

The Optimal Item Pool Design in Multistage Computerized Adaptive Tests With the p-Optimality Method

Lihong Yang

Mark D Reckase

Abstract

Figure 1.