The Impact of Item Model Parameter Variations on Person Parameter Estimation in Computerized Adaptive Testing With Automatically Generated Items

Chen Tian; Jaehwa Choi

doi:10.1177/01466216231165313

. 2023 Mar 17;47(4):275–290. doi: 10.1177/01466216231165313

The Impact of Item Model Parameter Variations on Person Parameter Estimation in Computerized Adaptive Testing With Automatically Generated Items

Chen Tian ^1,^✉, Jaehwa Choi ²

PMCID: PMC10240571 PMID: 37283592

Abstract

Sibling items developed through automatic item generation share similar but not identical psychometric properties. However, considering sibling item variations may bring huge computation difficulties and little improvement on scoring. Assuming identical characteristics among siblings, this study explores the impact of item model parameter variations (i.e., within-family variation between siblings) on person parameter estimation in linear tests and Computerized Adaptive Testing (CAT). Specifically, we explore (1) what if small/medium/large within-family variance is ignored, (2) if the effect of larger within-model variance can be compensated by greater test length, (3) if the item model pool properties affect the impact of within-family variance on scoring, and (4) if the issues in (1) and (2) are different in linear vs. adaptive testing. Related sibling model is used for data generation and identical sibling model is assumed for scoring. Manipulated factors include test length, the size of within-model variation, and item model pool characteristics. Results show that as within-family variance increases, the standard error of scores remains at similar levels. For correlations between true and estimated score and RMSE, the effect of the larger within-model variance was compensated by test length. For bias, scores are biased towards the center, and bias was not compensated by test length. Despite the within-family variation is random in current simulations, to yield less biased ability estimates, the item model pool should provide balanced opportunities such that “fake-easy” and “fake-difficult” item instances cancel their effects. The results of CAT are similar to that of linear tests, except for higher efficiency.

Keywords: automatic item generation, computerized adaptive testing, item model parameter variation, identical sibling model

Computerized Adaptive Testing (CAT) was designed to sequentially tailor the item difficulty to an examinee’s ability so that the examinee is always challenged. Specifically, during the test, every item is selected based on the examinee’s responses to previous items and a set of constraints, such as content coverage, answer-key distribution, and item exposure (e.g., Zheng & Chang, 2015; Wang, Chang & Douglas, 2016). The major advantage of CAT is that it can estimate the latent trait with fewer items than traditional linear tests (e.g., Weiss, 1976; Chang, 2015). However, CAT raised challenges in the area of item development. To continuously administer items to examinees, large numbers of diverse and high-quality test items are required to develop the item pools, and the item pools need to be constantly replenished to maintain the test security and reduce the exposure rates of popular items. Traditionally, new test items are written by subject-matter experts and calibrated individually via pilot or field studies (Gierl et al., 2013; Haladyna & Rodriguez, 2013). The items will be edited, reviewed, and revised iteratively until they meet the required standards of quality (Lane, Raymond, & Haladyna, 2016; Perie & Huff, 2016). Unfortunately, developing new items in this traditional way is time-consuming and expensive, and we are looking for an approach that can create large item banks in a manner that is efficient and economical.

Automatic Item Generation (AIG) is a promising way to address this challenge (Irvine, 2002). It is a process of using item models or item templates developed by experts to generate item instances with the aid of computer technology. Given an explicitly represented and well-specified set of rules and principles, a computer can systematically generate sufficiently generalizable (i.e., large and diverse) item instances in a negligible amount of time. In contrast to the traditional way of item development, AIG avoids the time-consuming process of manual item writing. However, developing and calibrating item models may be computationally complex and time-consuming.

Though theoretically sound, the combination of the two advanced measurement techniques needs more exploration before being used in practical settings. The purpose of the current simulation study is to explore the impact of item model parameter variation (i.e., within-family variation) on examinee’s ability estimation in CAT. The item model parameter variation refers to the variation of parameters for item instances that are generated from the same item family. In AIG, item instances from the same family should have very similar or even identical parameters, despite that variation may exist. If the variation is large but ignored, will the ability estimates of examinees be precise and unbiased? This study tries to answer the question. In the following, this paper first reviews the framework of CAT and AIG. Next, a simulation design and results are presented. The conclusion part discusses limitations and future work.

Literature Review

Computerized Adaptive Testing

CAT requires a smaller number of items than the traditional linear tests for the same testing purpose. The test length is reduced because examinees do not need to answer questions that are too easy/difficult to provide useful information for estimating their ability levels. Previous studies show that CAT can reduce the test length while maintaining a comparable level of estimation accuracy or precision (e.g., Lewis & Sheehan, 1990). This advantage has also been validated in several testing programs, such as the national nursing licensure exam administered by the National Council of State Boards of Nursing (NCSBN, 2022), the Graduate Record Examination administered by Educational Testing Service (e.g., Mills, 1999), and the Armed Services Vocational Aptitude Battery administered by the U.S. Department of Defense (e.g., Moreno & Segall, 1997).

The challenging part of CAT is that test developers need to develop a huge number of high-quality items across diverse content areas, which is time-consuming and expensive. It is demanding for item writers to express cognitive problem-solving skills and knowledge in a specific item format and repeat the same process for every single item (Gierl et al., 2012). In addition, validating the psychometric properties of all new items is required prior to operational use. Usually, a huge percent of items may not perform as intended and should be revised or discarded (Haladyna, 1994). This high cost of time and money is intensified in the context of computerized adaptive testing, for ensuring continuous testing while minimizing the item exposure rate (Gierl et al., 2012). For example, Breithaupt et al. (2009) estimated that for a 40-item high stake licensure examination in the CAT format with two administrations per year, at least 2000 items are needed.

The traditional way of handcrafting items individually may fail to meet the growing demands for items in digital assessments. Therefore, practitioners are seeking solutions to efficiently develop high-quality items in the digital environments. One possible solution is using automatic item generation and transforming the art of item writing to an industrial work by using construct analysis.

Automatic Item Generation

Framework of AIG

AIG is a process of developing item models to generate item instances, with the aid of computer technology. AIG promotes a generative process using item models rather than uses content specialists to write each item individually (Gierl et al., 2012). An item model is a prototypical representation of item instances: it contains all information necessary for item generation. Item models are developed by using content analysis and content modeling, in which the experts specify what the examinees need to know or perform to complete a task, and experts organize their understanding into cognitive model structures.

Gierl et al. (2012) described a three-stage process to generate multiple-choice items using AIG. The first stage is to develop a cognitive model structure that helps organize the relationships of cognitive- and content-specific knowledge, skills, and contents. This is a stage of creative work, where content specialists play an important role. The second stage is to develop item models by specifying the elements that are directly extracted from the cognitive model structure. The third stage is to generate item instances from item models using computerized system (e.g., Computer Adaptive Formative Assessment Platform; Choi, Kim, & Yoon, 2021).

An item model has several elements or features that we want to manipulate to generate item instances. The surface features or incidentals are contextual variables that alter the basic appearance of the item. The deep features or radicals are variables that may influence the psychometric properties (Irvine, 2002). Items generated from the same item model or item family are called siblings, and they have similar conceptual and psychometric properties. In this paper, the terms item model and item family are used interchangeably.

Item Modeling and Item Model Calibration

Besides generating isomorphic item instances, another purpose of item models is to predict the item parameters (e.g., difficulty) and/or account for the dependence among siblings. For example, the linear logistic test model (LLTM; Fischer, 1973) generalizes the Rasch model and models the difficulty parameter by a linear combination of item features. Embretson (1999) generalized LLTM by assuming that the item discrimination parameters vary across items and are linear combinations of item features. Besides predicting the item parameters for each item instance, another way to calibrate the item model is to estimate the family-level model parameters. This approach accounts for the dependence among siblings. Once the family-level parameters are estimated from a few siblings, there is no need to estimate the parameters of new items generated from the same family.

Previous literature introduces three commonly used models accounting for the dependence among siblings (e.g., Johnson & Sinharay, 2005): identical siblings model (ISM), unrelated siblings model (USM), and related siblings model (RSM). ISM assumes the same item response function (IRF) for all items in the same family (Hombo & Dresher, 2001). It ignores variation between siblings and may provide incorrect estimates of the item parameters if such variations present. On the other hand, USM assumes separated and unrelated IRFs for all items. It ignores the similarities between siblings and will provide larger standard errors of item parameter estimates. RSM overcomes the limitations of ISM and USM. It is a hierarchical model that assumes separate IRF for siblings at the first level and relates siblings at the second level. Using RSM, the family expected response function can be used to summarize an item family. Figure 1 illustrates the similarities and differences between ISM, RSM, and USM.

Figure 1. — An Illustration about the Similarities and Differences between ISM, RSM, and USM.

Glas and van der Linden (2003) described RSM in detail. The first level is for item instances and can be any item response model such as the 2-parameter logistic (2PL) model. Take the 2PL model as an example, for an item model p, the first level describes the probability of success on an item instance $i_{p}$ using a function of the latent trait parameter $θ$ as: $P_{i_{p}} (θ) = 1 / (1 + \exp [- a_{i_{p}} (θ - b_{i_{p}})])$ . Each item instance $i_{p}$ has a parameter vector $(a_{i_{p}}, b_{i_{p}})$ . Those values are realizations of a random parameter vector that describes the item family p. The second level model describes the distribution of the transformation of the parameter vector, $ξ_{i_{p}} = (\log a_{i_{p}}, b_{i_{p}})$ . After transformation, the parameter vector is not bounded, and we may assume that the transformed parameter vector has a multivariate normal distribution. That is, $ξ_{i_{p}} \sim N (μ_{p}, Σ_{p})$ . $μ_{p}$ and $Σ_{p}$ are hyperparameters, which are the mean values and covariance matrix of the transformed item parameters in family p. The first level model will not be calibrated. Instead, the hyperparameters for each family will be estimated. Note that the ISM and USM are special cases of the RSM, where ISM assumes a within-family variance of 0, and USM assumes a within-family variance of infinity. This hierarchical model also has variants designed for testlets (e.g., Wainer, Bradlow, and Du, 2000), and variants that combine LLTM in its second level (e.g., linear item cloning model, Geerlings et al., 2011).

Research Motivations and Questions

This study is motivated by the tradeoff between computational simplicity and estimation accuracy. Intuitively, ISM assuming identical item parameters of siblings is less ideal than RSM because ISM ignores the variation introduced by incidental features. On the other hand, using RSM to calibrate item models is challenging on computational power because a hierarchical model usually requires time-consuming Bayesian methods (e.g., Sinharay et al., 2003).

A majority of research shows that as long as the within-family variation is random, scores based on the item family parameters are robust to the within-family variation (e.g., Glas & van der Linden, 2003; Sinharay & Johnson, 2008). Despite the robustness, previous studies do not imply that the within-family variance is ignorable, and the within-family variance does contribute to score instability and potential estimation bias. Fortunately, the random errors of ability estimates associated with the within-family variance may be partially compensated for by increasing the test length.

The purpose of this study is to explore the effect of test length and the extent of within-family variance in the context of both linear and adaptive testing, assuming the true item-family parameters are known. Specifically, we explore (1) what if small/medium/large within-family variance is ignored, (2) if the effect of larger within-model variance can be compensated by greater test length, (3) if the item model pool properties affect the impact of within-family variance on scoring, and (4) if the issues in (1) and (2) are different in linear vs. adaptive testing. The results will profile what is a “short/long” test, and what is a “small/medium/large” within-family variation. As for using the estimated rather than true item-family parameters and the joint effect of item calibration methods and estimation methods, it is a topic for future research. For example, with large within-family variation, more siblings may be needed for estimating item-family parameters such that the scores are robust under ISM.

Methods

Simulation Designs

The testing scenario is that the RSM is the true data-generating model, but ISM is used for calibration and scoring precision. To explore how test length, within-family variance, and the item model pool properties affect the scoring precision, we consider conditions as follows. Six conditions of test length will be explored: 10, 20, 30, 40, 50, and 80. Three conditions of within-family variance are considered: small, medium, and large. The three levels are defined using the ratio of within-family variance over across-family variance. 10% means small, 20% means medium, and 50% means large. Another baseline condition with 0% within-family variance is included. The distributions of 1000 b parameters of 1000 item families are different for pool 1 and for pool 2. These two item model pools differ in the number of item models with difficulty levels near the two ends of the ability scale.

Item Model Pools

Two item model pools were generated. Both item pools contain five content categories¹ and have 1000 item models, with each item model belonging to only one content category and an equal number of items in each content category. The pool size, 1000, was chosen based on the rule of thumb that the pool size should be 12 times the CAT test length. Since the maximum test length is 80, 1000 is large enough. In the first level (i.e., within-family level), the 2PL model is used, and the transformed item parameters, $ξ_{i_{p}} = (\log a_{i_{p}}, b_{i_{p}})$ for newly generated instance $i_{p}$ come from a normal distribution with its family-specific means and small, medium, or large variance defined in the above part. In the second level (i.e., across-family level), the 1000 means of a parameters for 1000 families were drawn from $U n i f o r m (0.5, 1.5)$ , and the 1000 means of b parameters were drawn from $U n i f o r m (- 1.73, 1.73)$ or $N (0,1)$ , for item model pool 1 and 2, respectively.² The distribution $U n i f o r m (- 1.73, 1.73)$ is chosen because the variance is 1, which is the same as the variance of $N (0,1)$ . The variance of the log of $U n i f o r m (0.5, 1.5)$ is 0.09478. Multiplying the covariance matrix (0.09478, 0, 0, 1) by 10%, 20%, and 50% yields small, medium, and large within-family variance (e.g., Bejar et al.,2002).³

Test Specifications

For the linear test with no adaptive algorithm, 10, 20, 30, 40, 50, or 80 item models were randomly chosen from the item model pool 2, constraining that the number of item models from each of the five content categories are equal. Item pool 1 was not considered because it is not practical for linear tests. One instance will be generated from each item model to construct the test. The mean difficulty of the item models in tests of different lengths are: 0.325, 0.071, −0.097, −0.134, −0.185, and −0.076.⁴ For CAT, to stabilize ability estimation, the first five items of medium difficulty level are not adaptively chosen but randomly chosen from five difficulty categories and each of the five content domains. The following item models would be chosen adaptively based on the current estimates of examinees, the content domain constraints, and the exposure rate of different models.

For the item model selection method in CAT, the maximum priority index (MPI) method that can accommodate various non-statistical constraints simultaneously was used (Cheng and Chang, 2009). MPI can be considered as a variant of the maximum information method, and the next item model is selected if it maximizes the priority index (Supplemental Appendix A).

Data Generation Process

Fifteen ability points ranging from −3.5 to 3.5 were considered in the simulation. For each ability point, there are 100 replications under each of the simulation conditions. In CAT, the next item model appropriate for the current theta estimate is chosen. Given an item model, the true parameters of an item instance will be drawn from a multivariate normal distribution according to the family-specific mean and the current condition of within-family variance (Bejar et al., 2002, p. 19). We compute the true probability of correct response according to a 2PL model, using true theta and true item instance parameters drawn in the previous step. If the probability is greater than a random number from a rectangular [0,1] distribution, the response is 1 (correct), otherwise, it is 0 (incorrect). As for scoring, examinees’ abilities are estimated by Maximum Likelihood Estimation (MLE). MLE estimates are restricted within −4 to 4. Note that since ISM is assumed, the ability is estimated using family means rather than the true parameters of item instances. In the null condition where the within-family variance is 0, the family means are identical to the true parameters of all generated item instances. In this case, the null condition can also be interpreted as estimating the abilities using true parameters for item instances regardless of the extent of within-family variance.

Evaluation Criteria

The Standard Error (SE) of measurement at each ability point is simply calculated by taking the standard deviation of 100 estimated abilities. Bias, Pearson’s correlation coefficient, and Rooted Mean Squared Error (RMSE) comparing estimated and true ability parameter will also be calculated and reported. Bias is the estimated ability minus the true ability.

Results

Linear Test

For the linear test of different lengths, Figure 2 shows the standard error of estimates at 15 ability points using item model pool 2. In each plot, as the test length increases, the standard error decreases overall. For example, when the within-family variance is small, a test of length 10 yielded larger SE than a test of length 50. When the test length is long, the standard error is smaller for moderate ability points; when the test length is as short as 10, the standard error is smaller for extreme ability points. This is because we constrained the estimated ability to be between −4 and 4, which restricts the variability of MLEs at two ends. Overall, these results show that a longer test length means more precise estimates. The effect of the size of within-family variances on estimates can be seen by comparing the 4 plots in Figure 2. There is no dramatic increase in SE when the within-family variance increases from small (10%) to large (50%), especially when the test length is larger than 30.

Figure 3 shows the bias of estimates at 15 ability points using item model pool 2. When there was no within-family variance, the MLE estimates are biased away from zero except at the two end points (−3.5 and 3.5) where MLE estimates are constrained to be within −4 and 4. As the within-family variance increases, the MLE estimates are biased towards zero. The absolute values of biases in all conditions are smaller than 0.3. If the test length is long enough and the within-family variance is small, the bias to different directions may compensate each other. Even though the increased within-family variance did not increase the standard error dramatically (Figure 2), the estimates were biased as the within-family variance increased (Figure 3), but the extent of biasedness is small.

Figure 4 compares the estimated and true abilities. For the correlation, a test as short as having 10 items would give a correlation greater than 0.92. As the test length increases, the correlation also increases. A test having 20 or more items would yield a correlation coefficient greater than 0.95, regardless of the size of within-family variance. For the RMSE, the positive effect of test length was also observed: longer tests yielded smaller RMSE. Even though we consistently observed less precise estimates (i.e., larger RMSE) under the condition of larger within-family variance, the difference is not dramatic and can easily be compensated for by extending the test length.

CAT

In terms of SE and bias, results for CAT are similar to that of linear tests, but CAT is more efficient with the same test length. Figure 5 shows the standard error of estimates at 15 ability points using both item model pools. Compare the second row of Figure 5 with the results of linear tests (Figure 2), we can see the SEs are overall smaller. As the test length increases, the standard error decreases. Comparing the four columns in Figure 5. There is no dramatic increase in SE when the within-family variance increases from small (10%) to large (50%), especially when the test length is larger than 20. The distribution of the difficulty parameters in item model pools does not seem to affect the impact of within-family variance on SE.

Figure 6 shows the bias under different conditions. Comparing the second row of Figure 6 with Figure 3, we can see the pattern is the same as the linear tests but to a smaller magnitude. Compare the four columns of Figure 6, we can see that for a given test length, as the within-family variance increases, the estimates are biased towards the center (i.e., zero). The distribution of the difficulty parameters in item model pools does not seem to affect the impact of within-family variance on bias. The absolute values of bias are smaller than 0.3, but the bias could not be compensated by longer test lengths. Fortunately, the large within-family variance depicted in Figure 8 is unrealistic.

Figure 8. — Depiction of the Across-Family Variance in Model Pool 1 and Three Levels of Within-Family Variance for Three Example Item Models.

Figure 7 compares the estimated and true abilities. Like the results of the linear test, as the test length increases, the correlation also increases. A test having 20 or more items would yield a correlation coefficient greater than 0.97, regardless of the size of within-family variance. The test length for CAT needed to achieve the same level of correlation is smaller than that for a linear test, which shows the efficiency of the CAT. For the RMSE, the positive effect of test length was also observed as longer tests yielded smaller RMSE. Even though we consistently observed less precise estimates under the condition of larger within-family variance, the difference is not dramatic and can be compensated for by extending the test length. The distribution of the difficulty parameter did not dramatically affect correlation and RMSE.

Follow-up Exploration about Sources of Bias

In both CAT and linear tests, we observed that larger within-family variance is related to estimates biased towards zero. To explore the nature of bias, a follow-up examination was conducted. This small simulation explores the source of bias in linear tests when the test length is 50 and the within-family variance is large (Figure 3, fourth plot, the line with diamonds). Recall that in the original simulation, at the item model level (i.e., across-family level), the discrimination parameters come from Uniform (0.5, 1.5), and the difficulty parameters come from N(0, 1). When generating item instances for item model p (i.e., at the within family level), the log of discrimination parameters come from $N (\log (a_{p}), 0.047),$ and the difficulty parameters come from $N (b_{p}, 0.5)$ . We replicated this condition 10,000 times for one examinee with ability 3. Results show that the estimated ability was 2.853 (SD = 0.552). In a modified condition, we only changed the distribution of b parameters from N(0, 1) to N(3, 1) at the item model level such that the 50-item test is especially suitable for this high-ability examinee. In this case, the estimated ability was 2.994 (SD = 0.323). The size of bias shrunk from 0.147 to 0.006.

In summary, for examinees with ability 3, if the linear 50-item test comes from a moderately difficult item model pool, then the large within-family variance yielded ability estimates biased towards 0. If the linear 50-item test comes from a highly difficult item model pool, then the large within-family variance yielded an ability estimate much less biased towards 0. Results show that the source of bias is related to the characteristic of item model pool. Though the within-family variance is random, only the highly difficult item model pool allows random variations to cancel each other for a high-ability examinee.

Conclusions and Discussion

The first research question is what if small/medium/large within-family variance is ignored. Results show that larger within-family variation did not decrease the precision of estimation dramatically in either linear test or CAT. The second research question is to explore if the effect of larger within-model variance can be compensated by greater test length. Results show that the increased standard error, increased RMSE, and decreased correlation brought by larger within-family variation can be easily compensated by extending the test length. This finding is promising because it indicates the scoring precision is robust to violations of the item isomorphism assumption. For long CAT having more than 20 items or long linear tests having more than 30 items, no substantial difference between the different extent of within-family variation was found regarding the correlation between estimates and true values, the RMSE, and the SE. For shorter tests, researchers need to be careful about the item isomorphism assumption. These findings support the application of AIG with on-the-fly item generation without calibrating all instances separately. This study provides evidence of a slight or moderate violation of the assumption is not destroying the usefulness of estimated ability scores as long as the test size is long (i.e., around 35 or more). For research question 4, we observed similar results for both linear and adaptive testing.

Results answer research question 3 that the item model pool difficulty affects the impact of within-family variance on bias of ability estimation. Despite robust scoring precision, the increased within-family variance did bring bias, especially for examinees with extreme abilities. This observation is reasonable because with larger within-family variance, we have more chances to give a harder (or easier) item instance and mistakenly treat those instances as easier (or harder) for high-ability (or low-ability) examinees. Since the majority of item models in the item model pool are of moderate difficulty levels, for high-ability examinees, it is more likely to give them “fake-easy” item instances than “fake-hard” item instances. For example, it is more likely to generate an item instance of difficulty 3 from an item model of difficulty 1.5, and it is less likely to generate an item instance of difficulty 3 from an item model of difficulty 4.5. Due to the imbalanced number of “fake-easy” and “fake-hard” item instances, the effect of large within-family variance could not cancel each other and finally yielded biased estimates. This argument is supported by the follow-up examination presented in the results part: If the majority of item models in the item model pool have a difficulty level near examinee’s ability, there are balanced number of “fake-easy” and “fake-hard” item instances. Therefore, the effect of large within-family variance canceled each other, and we observed in the follow-up simulation that the size of bias shrunk from 0.147 to 0.006.

By considering a large within-family variance condition, this study adds on Luecht’s (2012) argument that “a majority of research studies have found that, for the most part, scores based on estimates of the item family statistics are robust with that variation at the level of the individual items, provided that the variation is random in nature”. Randomness plays a key role here, because it allows the negative effects brought by within-family variations to cancel out. On the other hand, the characteristics of the item model pool also affect whether the negative effects can cancel out.

Though the large within-family variation conditions brought issues on scoring, the large within-family variance condition in this study may be unrealistic in practice, as long as the item models are carefully designed. The first plot in Figure 8 depicts the item model characteristic curves for all 1000 item models. Each dashed line represents one item model. The solid line describes the overall characteristic of the item pool. The second to fourth plots in Figure 8 each depicts the item instance characteristic curves for 200 item instances generated from one example item model. Each dashed line represents one item instance. The solid lines describe the characteristic of the item model. See Figure 8, for one item model, in the case of large within-family variance (fourth plot), 95% of the difficulty parameters range from about −1.7 to 1.2, which is considerably large. It is suggested for practitioners to pursue less variation of items within each family regardless of which AIG approach is adopted. Otherwise, we should be careful on the bias for examinees with the extreme ability parameters (e.g., |theta| >3) regardless of the test length.

In summary, this study shows that the size of within-model variation did not impact the precision of estimation dramatically in either linear test or CAT. If the test length is large enough (e.g., >20 in CAT and >30 in linear tests), the tests will yield high correlation between true and estimated ability parameters, low RMSE, and low SE. For shorter tests, researchers need to be careful about the item isomorphism assumption or extend the test length slightly. In case of large within-model variations, the size of bias is greater than .2 for the extreme range of person parameters (e.g., |theta|>3). This is due to the characteristics of item model pool, and we suggest pursuing smaller within-family variances especially for item models of extreme difficulty.

This study has several limitations that require future work. First, we assumed all item models are pre-calibrated. In practice, the true item model parameters are never known. In the future, it is desirable to investigate item model calibration methods and the joint effect of item model calibration methods and ignoring existing within-family variance. Second, in practice, not all template-based item generation process will have the same extent of within-family variation. Depending on the content areas or the skills to be measured, some item modeling processes are simply replacing incidentals, and some item modeling processes are more complicated and prone to higher within-family variation. Practitioners should note that conditions in the simulation assume all item models have the same extent of within-family variation, which may be violated in practice. Future study may relax our assumption that all item models have the same level of within-family variance. Third, this study does not consider how the within-family variance on key core items (e.g., anchor items) affect equating, termination, or scoring. An item model with large within-family variance introduces more noises, and future work needs to explore how the increased amount of noises in core items affect test assembly and scoring. Another issue brought by the increased noises is that, if difficult item models have more noises (i.e., lager within-family variance), it is technically suggested to do scoring only based on easy item models especially for low-ability examinees. However, this is politically indefensible by giving less “chances” to rejected examinees (Irvine, 2014, Chapter 5). In addition to the three extended future directions mentioned above, future work may also refine the current topic discussed in this project. For example, future work may investigate other psychometric evaluations such as CTT indices (e.g., p or d indices) and IRT model fit, or future work may integrate difficulty modeling with scoring.

Supplemental Material

Supplemental Material - The Impact of Item Model Parameter Variations on Person Parameter Estimation in Computerized Adaptive Testing With Automatically Generated Items

Click here for additional data file.^{(190.4KB, pdf)}

Supplemental Material for The Impact of Item Model Parameter Variations on Person Parameter Estimation in Computerized Adaptive Testing With Automatically Generated Items by Chen Tian and Jaehwa Choi in Applied Psychological Measurement

Notes

^1.

Five content categories were chosen to reflect common assessment practices. For example, SAT math has 4 content areas (College Board, 2022), and ACT math has 7 (sub) categories (ACT Inc., 2022).

^2.

The item model pool 1 and 2 differs in the number of items with extreme difficulties. This is related to the chances for extremeability examinees to “cancel” the effect of fake-easy and fake-hard items. For example, for a normal distribution (i.e., pool 2), an examine with ability −1.5 has more chances to get fake-hard item than fake-easy items. See the follow-up study and discussion for more details. The mean difficulty of item models in both pools are chosen to be 0 to match the mean ability level of simulated examinees.

^3.

The small and medium conditions are close to the realistic conditions adopted in the simulation study of Bejar et al. (2002, p. 15). The large condition is an extreme condition that may not happen in practice (see Figure 8 for an illustration), but it is included for reader’s information.

^4.

Those numbers are mean values of 10, 20, 30, 40, 50, and 80 sampled values from a distribution with mean 0.

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.

Supplemental Material: Supplemental material for this article is available online.

ORCID iDs

Chen Tian https://orcid.org/0000-0002-5823-7831

Jaehwa Choi https://orcid.org/0000-0003-4225-9582

References

ACT, Inc (2022). Mathematics Test Description for the ACT. https://www.act.org/content/act/en/products-and-services/the-act/test-preparation/description-of-math-test.html [Google Scholar]
Bejar I. I., Lawless R. R., Morley M. E., Wagner M. E., Bennett R. E., Revuelta J. (2002). A feasibility study of on the fly item generation in adaptive testing. ETS Research Report Series, 2002(2), i-44. 10.1002/j.2333-8504.2002.tb01890.x [DOI] [Google Scholar]
Breithaupt K., Ariel A. A., Hare D. R. (2009). Assembling an inventory of multistage adaptive testing systems. In Elements of adaptive testing (pp. 247–266). Springer. [Google Scholar]
Chang H. H. (2015). Psychometrics behind computerized adaptive testing. Psychometrika, 80(1), 1–20 [DOI] [PubMed] [Google Scholar]
Cheng Y., Chang H. H. (2009). The maximum priority index method for severely constrained item selection in computerized adaptive testing. British Journal of Mathematical and Statistical Psychology, 62(2), 369–383. [DOI] [PubMed] [Google Scholar]
Cho S. J., De Boeck P., Embretson S., Rabe-Hesketh S. (2014). Additive multilevel item structure models with random residuals: Item modeling for explanation and item generation. Psychometrika, 79(1), 84–104. 10.1007/s11336-013-9360-2 [DOI] [PubMed] [Google Scholar]
Choi J., Kim S., Yoon K. (2012-2021). CAFA AIG manual: Computer adaptive formative assessment automatic item generation user’s guide [system manual] (2nd edition). CAFA Lab, Inc. [Google Scholar]
College Board (2022). SAT suite of assessments: The math test. https://satsuite.collegeboard.org/sat/whats-on-the-test/math [Google Scholar]
Embretson S. E. (1999). Generating items during testing: Psychometric issues and models. Psychometrika, 64(4), 407–433. 10.1007/bf02294564 [DOI] [Google Scholar]
Fischer G. H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37(6), 359–374. 10.1016/0001-6918(73)90003-6 [DOI] [Google Scholar]
Geerlings H., Glas C. A. W., van der Linden W. J. (2011). Modeling rule-based item generation. Psychometrika, 76(2), 337–359. 10.1007/s11336-011-9204-x [DOI] [Google Scholar]
Gierl M. J., Lai H., Turner S. R. (2012). Using automatic item generation to create multiple‐choice test items. Medical Education, 46(8), 757–765. 10.1111/j.1365-2923.2012.04289.x [DOI] [PubMed] [Google Scholar]
Gierl M. J., Lai H. (2013). Using automated processes to generate test items. Educational Measurement: Issues and Practice, 39, 36–50 [Google Scholar]
Glas C. A. W., van der Linden W. J. (2003). Computerized adaptive testing with item cloning. Applied Psychological Measurement, 27(4), 247–261. 10.1177/0146621603027004001 [DOI] [Google Scholar]
Haladyna T. (1994). Developing and validation multiple-choice test items. Lawrence Erlbaum Associates. [Google Scholar]
Haladyna T. M., Rodriguez M. C. (2013). Developing and validating test items. New York, NY: Routledge. [Google Scholar]
Hombo C., Dresher A. (2001, April). A simulation study of the impact of automatic item generation under NAEP-like data conditions. In annual meeting of the national Council on measurement in education. NCME. [Google Scholar]
Irvine S. H. (2002). Item generation for test development: An introduction. In Irvine S. H., Kyllonen P. (Eds.), Xv–xxv in item generation for test development (p. 350). Lawrence Erlbaum. [Google Scholar]
Irvine S. H. (2014). Computerised test generation for cross-national military recruitment: A handbook. Ios Press. [Google Scholar]
Johnson M. S., Sinharay S. (2005). Calibration of polytomous item families using Bayesian hierarchical modeling. Applied Psychological Measurement, 29(5), 369–400. 10.1177/0146621605276675 [DOI] [Google Scholar]
Kane M. T. (2006). Validation. In Brennan R. (Ed), Educational measurement (4th ed, pp. 17–64). [Google Scholar]
Lane S., Raymond M., Haladyna R. (2016). Test development process. In Lane S., Raymond M., Haladyna T. (Eds.), Handbook of test development (2nd Ed., pp. 3–18) New York, NY: Routledge. [Google Scholar]
Lewis C., Sheehan K. (1990). Using Bayesian decision theory to design a computerized mastery test. ETS Research Report Series, 1990(2), i-48. 10.1002/j.2333-8504.1990.tb01364.x [DOI] [Google Scholar]
Luecht R. M. (2012). Automatic item generation for computerized adaptive testing. In Automatic item generation (pp. 206–226): Routledge. [Google Scholar]
Mills C. N. (1999). Development and introduction of a computer adaptive Graduate Record Examination General Test. In Innovations in computerized assessment, (pp. 117–135). Psychology Press. [Google Scholar]
Moreno K. E., Segall D. O. (1997). Reliability and construct validity of CAT-ASVAB. [Google Scholar]
National Council of State Boards of Nursing, Inc. (2022). Computerized adaptive testing (CAT). https://www.ncsbn.org/exams/before-the-exam/computerized-adaptive-testing.page [PubMed] [Google Scholar]
Perie M., Huff K. (2016). Determining content and cognitive demands for achievement tests. In Lane S., Raymond M., Haladyna T. (Eds.), Handbook of test development (2nd Ed., pp. 119–143) New York, NY: Routledge. [Google Scholar]
Sinharay S., Johnson M. S. (2008). Use of item models in a large-scale admissions test: A case study. International Journal of Testing, 8(3), 209–236. 10.1080/15305050802262019 [DOI] [Google Scholar]
Sinharay S., Johnson M. S., Williamson D. M. (2003). An application of a Bayesian hierarchical model for item family calibration.ETS Research Report Series, 2003, 2003(1), i-41, 10.1002/j.2333-8504.2003.tb01896.x [DOI] [Google Scholar]
Swaminathan H., Gifford J. A. (1983). Estimation of parameters in the three-parameter latent trait model. In New horizons in testing (pp. 13–30). Academic Press. [Google Scholar]
Wainer H., Bradlow E. T., Du Z. (2000). Testlet response theory: An analog for the 3PL model useful in testlet-based adaptive testing. Computerized adaptive testing: Theory and practice, 245–269 [Google Scholar]
Wang S., Lin H., Chang H. H., Douglas J. (2016). Hybrid computerized adaptive testing: from group sequential design to fully sequential design. Journal of Educational Measurement, 53(1), 45–62 [Google Scholar]
Weiss D. J. (1976). Adaptive testing research in Minnesota: Overview, recent results, and future directions. In Proceedings of the first conference on computerized adaptive testing (pp. 24–35): United States Civil Service Commission. [Google Scholar]
Zheng Y., Chang H. H. (2015). On-the-fly assembled multistage adaptive testing. Applied Psychological Measurement, 39(2), 104–118 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material - The Impact of Item Model Parameter Variations on Person Parameter Estimation in Computerized Adaptive Testing With Automatically Generated Items

Click here for additional data file.^{(190.4KB, pdf)}

[bibr1-01466216231165313] ACT, Inc (2022). Mathematics Test Description for the ACT. https://www.act.org/content/act/en/products-and-services/the-act/test-preparation/description-of-math-test.html [Google Scholar]

[bibr2-01466216231165313] Bejar I. I., Lawless R. R., Morley M. E., Wagner M. E., Bennett R. E., Revuelta J. (2002). A feasibility study of on the fly item generation in adaptive testing. ETS Research Report Series, 2002(2), i-44. 10.1002/j.2333-8504.2002.tb01890.x [DOI] [Google Scholar]

[bibr3-01466216231165313] Breithaupt K., Ariel A. A., Hare D. R. (2009). Assembling an inventory of multistage adaptive testing systems. In Elements of adaptive testing (pp. 247–266). Springer. [Google Scholar]

[bibr29-01466216231165313] Chang H. H. (2015). Psychometrics behind computerized adaptive testing. Psychometrika, 80(1), 1–20 [DOI] [PubMed] [Google Scholar]

[bibr200-01466216231165313] Cheng Y., Chang H. H. (2009). The maximum priority index method for severely constrained item selection in computerized adaptive testing. British Journal of Mathematical and Statistical Psychology, 62(2), 369–383. [DOI] [PubMed] [Google Scholar]

[bibr4-01466216231165313] Cho S. J., De Boeck P., Embretson S., Rabe-Hesketh S. (2014). Additive multilevel item structure models with random residuals: Item modeling for explanation and item generation. Psychometrika, 79(1), 84–104. 10.1007/s11336-013-9360-2 [DOI] [PubMed] [Google Scholar]

[bibr5-01466216231165313] Choi J., Kim S., Yoon K. (2012-2021). CAFA AIG manual: Computer adaptive formative assessment automatic item generation user’s guide [system manual] (2nd edition). CAFA Lab, Inc. [Google Scholar]

[bibr6-01466216231165313] College Board (2022). SAT suite of assessments: The math test. https://satsuite.collegeboard.org/sat/whats-on-the-test/math [Google Scholar]

[bibr7-01466216231165313] Embretson S. E. (1999). Generating items during testing: Psychometric issues and models. Psychometrika, 64(4), 407–433. 10.1007/bf02294564 [DOI] [Google Scholar]

[bibr8-01466216231165313] Fischer G. H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37(6), 359–374. 10.1016/0001-6918(73)90003-6 [DOI] [Google Scholar]

[bibr9-01466216231165313] Geerlings H., Glas C. A. W., van der Linden W. J. (2011). Modeling rule-based item generation. Psychometrika, 76(2), 337–359. 10.1007/s11336-011-9204-x [DOI] [Google Scholar]

[bibr10-01466216231165313] Gierl M. J., Lai H., Turner S. R. (2012). Using automatic item generation to create multiple‐choice test items. Medical Education, 46(8), 757–765. 10.1111/j.1365-2923.2012.04289.x [DOI] [PubMed] [Google Scholar]

[bibr31-01466216231165313] Gierl M. J., Lai H. (2013). Using automated processes to generate test items. Educational Measurement: Issues and Practice, 39, 36–50 [Google Scholar]

[bibr11-01466216231165313] Glas C. A. W., van der Linden W. J. (2003). Computerized adaptive testing with item cloning. Applied Psychological Measurement, 27(4), 247–261. 10.1177/0146621603027004001 [DOI] [Google Scholar]

[bibr12-01466216231165313] Haladyna T. (1994). Developing and validation multiple-choice test items. Lawrence Erlbaum Associates. [Google Scholar]

[bibr30-01466216231165313] Haladyna T. M., Rodriguez M. C. (2013). Developing and validating test items. New York, NY: Routledge. [Google Scholar]

[bibr13-01466216231165313] Hombo C., Dresher A. (2001, April). A simulation study of the impact of automatic item generation under NAEP-like data conditions. In annual meeting of the national Council on measurement in education. NCME. [Google Scholar]

[bibr14-01466216231165313] Irvine S. H. (2002). Item generation for test development: An introduction. In Irvine S. H., Kyllonen P. (Eds.), Xv–xxv in item generation for test development (p. 350). Lawrence Erlbaum. [Google Scholar]

[bibr15-01466216231165313] Irvine S. H. (2014). Computerised test generation for cross-national military recruitment: A handbook. Ios Press. [Google Scholar]

[bibr16-01466216231165313] Johnson M. S., Sinharay S. (2005). Calibration of polytomous item families using Bayesian hierarchical modeling. Applied Psychological Measurement, 29(5), 369–400. 10.1177/0146621605276675 [DOI] [Google Scholar]

[bibr17-01466216231165313] Kane M. T. (2006). Validation. In Brennan R. (Ed), Educational measurement (4th ed, pp. 17–64). [Google Scholar]

[bibr32-01466216231165313] Lane S., Raymond M., Haladyna R. (2016). Test development process. In Lane S., Raymond M., Haladyna T. (Eds.), Handbook of test development (2nd Ed., pp. 3–18) New York, NY: Routledge. [Google Scholar]

[bibr18-01466216231165313] Lewis C., Sheehan K. (1990). Using Bayesian decision theory to design a computerized mastery test. ETS Research Report Series, 1990(2), i-48. 10.1002/j.2333-8504.1990.tb01364.x [DOI] [Google Scholar]

[bibr19-01466216231165313] Luecht R. M. (2012). Automatic item generation for computerized adaptive testing. In Automatic item generation (pp. 206–226): Routledge. [Google Scholar]

[bibr20-01466216231165313] Mills C. N. (1999). Development and introduction of a computer adaptive Graduate Record Examination General Test. In Innovations in computerized assessment, (pp. 117–135). Psychology Press. [Google Scholar]

[bibr21-01466216231165313] Moreno K. E., Segall D. O. (1997). Reliability and construct validity of CAT-ASVAB. [Google Scholar]

[bibr22-01466216231165313] National Council of State Boards of Nursing, Inc. (2022). Computerized adaptive testing (CAT). https://www.ncsbn.org/exams/before-the-exam/computerized-adaptive-testing.page [PubMed] [Google Scholar]

[bibr33-01466216231165313] Perie M., Huff K. (2016). Determining content and cognitive demands for achievement tests. In Lane S., Raymond M., Haladyna T. (Eds.), Handbook of test development (2nd Ed., pp. 119–143) New York, NY: Routledge. [Google Scholar]

[bibr23-01466216231165313] Sinharay S., Johnson M. S. (2008). Use of item models in a large-scale admissions test: A case study. International Journal of Testing, 8(3), 209–236. 10.1080/15305050802262019 [DOI] [Google Scholar]

[bibr24-01466216231165313] Sinharay S., Johnson M. S., Williamson D. M. (2003). An application of a Bayesian hierarchical model for item family calibration.ETS Research Report Series, 2003, 2003(1), i-41, 10.1002/j.2333-8504.2003.tb01896.x [DOI] [Google Scholar]

[bibr25-01466216231165313] Swaminathan H., Gifford J. A. (1983). Estimation of parameters in the three-parameter latent trait model. In New horizons in testing (pp. 13–30). Academic Press. [Google Scholar]

[bibr34-01466216231165313] Wainer H., Bradlow E. T., Du Z. (2000). Testlet response theory: An analog for the 3PL model useful in testlet-based adaptive testing. Computerized adaptive testing: Theory and practice, 245–269 [Google Scholar]

[bibr27-01466216231165313] Wang S., Lin H., Chang H. H., Douglas J. (2016). Hybrid computerized adaptive testing: from group sequential design to fully sequential design. Journal of Educational Measurement, 53(1), 45–62 [Google Scholar]

[bibr28-01466216231165313] Weiss D. J. (1976). Adaptive testing research in Minnesota: Overview, recent results, and future directions. In Proceedings of the first conference on computerized adaptive testing (pp. 24–35): United States Civil Service Commission. [Google Scholar]

[bibr26-01466216231165313] Zheng Y., Chang H. H. (2015). On-the-fly assembled multistage adaptive testing. Applied Psychological Measurement, 39(2), 104–118 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

The Impact of Item Model Parameter Variations on Person Parameter Estimation in Computerized Adaptive Testing With Automatically Generated Items

Chen Tian

Jaehwa Choi

Abstract

Literature Review

Computerized Adaptive Testing

Automatic Item Generation

Framework of AIG

Item Modeling and Item Model Calibration

Figure 1.

Research Motivations and Questions

Methods

Simulation Designs

Item Model Pools

Test Specifications

Data Generation Process

Evaluation Criteria

Results

Linear Test

Figure 2.

Figure 3.

Figure 4.

CAT

Figure 5.

Figure 6.

Figure 8.

Figure 7.

Follow-up Exploration about Sources of Bias

Conclusions and Discussion

Supplemental Material

Notes

ORCID iDs

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases