Abstract
Continuously administered examination programs, particularly credentialing programs that require graduation from educational programs, often experience seasonality where distributions of examine ability may differ over time. Such seasonality may affect the quality of important statistical processes, such as item response theory (IRT) item calibration and equating. The lead time required for producing pre-equated test forms in the continuous testing framework further complicates issues. This study examines the effect of seasonality in test data on Rasch IRT item parameter estimates. Data came from four credentialing examination programs that represented both programs with and without seasonality, as well as medium and low examinee volume. Results showed that calibrating items during certain times can lead to quite poor item parameter estimates. While certain programs could conduct IRT calibrations without waiting for the full examination cycle to be completed, other types of programs should wait as long as possible before calibrating items.
Keywords: IRT, Rasch model, seasonality, calibration, continuous testing, calibration timing, certification, licensure, credentialing
One of the most important factors that affects one’s ability to develop assessments with appropriate psychometric quality are the properties of the available test items that can be used to construct the assessments. The properties of these items are often maintained in an item bank. The item bank typically evolves over time as new items are written, items without data are piloted, and additional data on old items are collected through new administrations. Before a new scored assessment can be launched, most assessment programs require that items that will contribute to an examinee’s score have been piloted and sufficient data have been obtained to ensure that the items are functioning appropriately (Schmeiser & Welch, 2006). This study investigates what happens when people taking the exam during different times of the year may have different ability levels, and how these differences may impact the quality of pilot item parameters. Numerous exam programs experience examinee ability seasonality, where the characteristics of the ability distribution may change over time, yet little research has been conducted into the effects of this phenomenon.
Depending on the assessment program, items may be piloted and tried out in a variety of different ways. A common design is to embed pilot items with scored items on existing tests and try them out in an event test or test that is continuously administered via computer. In event testing, examinees usually take the assessment on a specific test date or during a specific testing window and the responses to all test items are collected and analyzed following the test administration. A testing program may administer multiple forms during a single testing window and may have multiple testing dates throughout the calendar year in this model of assessment. When the test is continuously administered via computer, examinees can test on any day that they are able to schedule an appointment to take the assessment. This model of assessment can create some challenges in terms of analyzing data and launching new test forms, because the schedule of when a new form needs to be launched is typically before the responses for all examinees for the prior testing period have been collected. This means that one needs to make a choice on how to handle these assessment data. One option is to not use or include new pilot items in subsequent test forms until all data for the prior testing period have been collected and then analyze data at this point in time. A second option is to trust that data used in the analyses prior to when the forms are constructed are of sufficient quality and are representative of the full set of data that may be collected. Another option is to include the new items but allow the option to delete the items if additional data shows that items are not functioning appropriately. In this model of assessment, a testing program may have multiple forms available for some period of time and the forms may be rotated and updated periodically (e.g., a month, a quarter, or a year). The idea of a form in these assessment models can be extended to an item pool, which can be used to deliver different tests.
For each model of assessment there are multiple options for test delivery. These include fixed form testing, computer adaptive testing (CAT), and multistage adaptive testing (Luecht & Sireci, 2011; Parshall, Spray, Kalohn, & Davey, 2002). In fixed form testing, the test has a set length of scored and pilot items, which are shared across examinees that take the same form. In CAT, the test questions given to the examinees typically change based on their ability level and the test can be either fixed length or variable length. Usually, a fixed length of pilot items are given to examinees, but the scored items and the sample of pilot items can change across candidates. Multistage adaptive testing shares elements with CAT in that the test adapts to the examinee, but the level of adaption is after taking a block of items in a stage. The blocks of items across stages are usually common to all examinees taking a particular route through the stages. Pilot items are typically embedded in different blocks within the stages of the test.
A key result of the analysis of assessment data from any model of assessment and test delivery strategy are often item response theory (IRT) based item parameters for newly piloted items. The item parameters from these analyses are usually input into item banks, which are then operated on to perform other test development activities. For example, a common practice is to use the IRT item parameters to create target test information or test characteristic curve functions. A testing program may then try matching test information or test characteristic curve functions to attempt to make the forms as close to parallel as possible (van der Linden, 2005). Another use of IRT item parameters is often to pre-equate test forms by fixing the item parameters at the values stored in the item banks (Kolen & Brennan, 2004; Tong, Wu, & Xu, 2008). This allows for raw-to-scale score conversion tables to be created before the tests are administered and it makes it possible to more rapidly report test results. In CAT, fixing the item parameters plays an important role when deciding what items to administer, when to terminate the exam, and in determining examinee abilities after examinees have responded to each item (Babcock & Weiss, 2012; Thompson, 2007).
Despite the critical role that IRT item parameters play in the development of future test forms, limited research has looked at how the timing of calibrations may affect the parameters that are estimated for pilot items. This is especially the case in the second assessment model where the test forms may be continually available for some period of time and some lead time is needed to develop new forms and make sure that appropriate quality control checks are completed prior to the launch of the new forms. An additional confounding factor that can occur for testing programs that continuously administer test forms on computer is the potential for seasonality in test data. In particular, there may be different times during the assessment cycle in which the population of underlying examinees may be more or less able or may have more or less variation in their ability compared with another point in time.
A common example of seasonality occurs in licensure and certification (credentialing) programs when candidates graduate from different institutions. In many cases, candidates graduate from their educational programs in the late spring or the early summer months of the year. Based on both anecdotal evidence, as well as statistical evidence collected by the authors for several programs that they have worked with, the most prepared candidates often tend to take the exam within a few weeks of graduation. Candidates waiting longer to take the exam usually tend to perform more poorly. This decrease in performance could be due to a variety of factors, including forgetting the material tested on the exam, personality traits (such as a tendency to procrastinate) that could correlate with both poorer performance and waiting to take the test, test anxiety, or some combination of all these things. Because candidates tend to graduate from various institutions near the same time of the year, examinees taking the exam in the late spring and early summer tend to be candidates taking the exam close to graduation, and they tend to score higher. Examinees taking the exam in the winter months tend to be taking the exam with a longer lag time after graduation, and this often corresponds to lower examinee scores. These calendar differences can influence the calibration sample and the item parameters that may be obtained from the calibration.
The purpose of this article is to investigate how the timing of when items are calibrated for several different credentialing programs that continuously administer fixed form tests via computer may or may not affect item parameter estimates. Since the credentialing programs from which these data are drawn employ the Rasch model, the focus will be on the recovery of Rasch item parameters at different points in time. In the next section of this article, the literature that has looked at how item parameter estimates are affected by a variety of factors is reviewed. The data and methods used in this article are described next. This is followed by the results for each of four different credentialing programs. Results suggest that some different patterns can be observed for different programs, but in general the best recovery of item parameters was observed the longer one waits to calibrate data. The article concludes with additional discussion of these results and what they mean for the scheduling of item calibrations for testing programs that continuously administer tests online.
Literature Review and Background
A variety of research articles have looked at the recovery of IRT item parameters in different contexts. The most common design for this research is to simulate item response data with a known ability distribution and set of item parameters. Various factors may be manipulated in the study, such as examinee sample size, the examinee distribution, the number of test items, the item parameters of the items, item parameter and examinee ability estimation methods, and the IRT software packages used. Examples of this type of research for the Rasch model include Meyer and Hailey (2012); Svetina et al. (2013); Swaminathan and Gifford (1982); van den Wollenberg, Wierda, and Jansen (1988); Wang and Chen (2005); Wright and Douglas (1977); and Zwinderman and van den Wollenberg (1990). Studies using the two-parameter logistic (2PL) and three-parameter logistic (3PL) models include Drasgow (1989); Harwell and Janosky (1991); Hulin, Lissak, and Drasgow (1982); Kirisci, Hsu, and Yu (2001); Seong (1990); Skaggs and Stevenson (1989); Stone (1992); Swaminathan and Gifford (1983, 1985); and Yen (1987).
The primary focus of these articles is generally to evaluate the performance of the IRT models, estimation methods, and software packages across the conditions included in the study. This is typically accomplished by looking at the root mean square error and/or bias of the parameter estimates. For example, Wang and Chen (2005) looked at the bias of item parameter estimates for tests of length 10, 20, 40, and 60 items for examinee sample sizes of 100, 200, 400, 600, 1,000, 1,500, and 2,000 using the WINSTEPS (Linacre, 2001) computer program. They found that the bias in item parameter estimates tended to decrease as test length and examinee sample size increased. Similarly, Meyer and Hailey (2012) investigated the bias and root mean square error of WINSTEPS and jMetrik (Meyer, 2011) for tests of length 25, 50, and 75 items for examinee sample sizes of 250, 500, 1,000, and 5,000 examinees. They observed that increasing the number of items tended to improve the root mean square error, but often did not have a large effect on the bias. Results from item parameter recovery studies have led to suggestions that acceptable Rasch item parameters can be found with a couple of hundred examinees, while the 2PL and 3PL models typically require more data to yield acceptable item parameters (de Ayala, 2009; Kolen & Brennan, 2004).
None of the aforementioned studies have looked at the recovery of item parameters for pilot items. These studies assumed that all the test items were administered to all examinees and that the sample sizes of examinees taking all the items were the same. Most of the work on methods for calibrating pilot items to add to an existing item bank has been done in the context of CAT. Ban, Hanson, Wang, Yi, and Harris (2001) compared five different methods for calibrating pilot items to add to a CAT item bank. They found that a method based on maximum likelihood with multiple expected maximization cycles tended to perform the best and that an approach based on BILOG-MG did not work well in several conditions. Ban, Hanson, Yi, and Harris (2002) extended the Ban et al. (2001) study and compared three methods for calibrating pilot items when both pilot items and scored items had large amounts of missing data. They also found that the maximum likelihood method with multiple expected maximization cycles performed the best. Kingsbury (2009) introduced a different method called adaptive calibration. He demonstrated using simulations that this method could be effective for adding to an existing CAT item bank. Studying item parameter recovery of pilot items in the context of CAT creates additional complications compared with other forms of testing where candidates see all scored test items because there can be many scored items that have extreme levels of missing data and the examinees that respond to pilot items may have different ranges of abilities. These factors may lead to different results than those that may be obtained outside of the CAT context. Research by Babcock and Albano (2012) evaluated how new item parameter estimates performed on fixed forms using a fixed item parameter estimation strategy, but the authors only examined the case of equal sample sizes in calibrating all the items in the context of item difficulty drift.
The focus of the current study is on computer-based fixed form testing where tests are continuously administered via computer. In these cases, examinees take a set of scored items and a set of pilot items. These designs can reduce the range restriction and extreme levels of missing data for scored items compared with CAT contexts. Similar to the CAT context, an equating method is needed to place the item parameters for the pilot items onto the same scale as the scored items. Typical equating methods that may be used for these purposes include the Stocking and Lord (1983) procedure or fixing the item parameters for the scored items to their values from an earlier analysis (Kim, 2006).
An important assumption employed when calibrating pilot items and using them in subsequent analyses, such as the equating of test forms, is the IRT assumption of parameter invariance. This assumption states that the IRT model and parameters for the items and people should not change across populations within a linear transformation (Embretson & Reise, 2000; Engelhard, 2013; Fischer & Molenaar, 1995). In practical applications, however, the assumption of parameter invariance may not always hold. A common example is differential item functioning, where being a member of different subgroups is associated with different expected item performance (Holland & Wainer, 1993). In this case, the pattern of responses to the items can change as a function of subgroup membership, which can result in different item parameter estimates if one calibrated data with different subgroups (Smith, 2004; Wyse, 2013). A similar, although not as well-documented, situation where different groups of examinees may produce different patterns of responses and potential changes in item parameter estimates is when test data have seasonality. In these cases, examinees that take the exam at different points in time may have different abilities and may interact and respond to test items in different ways. The important realization is that seasonality has the potential to produce shifts in examinee ability that may not directly translate into corresponding shifts in the observed item responses across items, which may lead to different item parameters for various items over time. An important confounding factor in these contexts is also the sample size of examinees, as item parameter estimates tend to be more stable and have less bias when more item responses are used.
This study investigates two research questions in these contexts.
Research Question 1: How well can pilot item parameters from a full sample of examinees be recovered if the calibration of the pilots was done at an earlier point in time where there were fewer examinees?
Research Question 2: How will differences in the underlying characteristics of the sample due to seasonality effects impact pilot item parameters?
The authors are unaware of any study that has investigated how item parameter recovery of pilot items is affected by calibration timing and potential seasonality in test data in the context of fixed form tests that are continuously administered via computer. This study makes an important contribution to literature by examining how calibration Rasch item parameter estimates for four credentialing programs, some of which experience seasonality and some of which do not, were recovered at different time points. To examine how sample size may interact with the parameter estimates, two of the programs that are considered have medium volume programs, while two of the programs have low volume.
Data and Method
Data and Exam Program Description
Data for this study come from four different medical imaging credentialing programs. The tests for these programs are continuously administered via computer throughout the year. Two of the certification programs have been reviewed and accredited by the National Commission for Certifying Agencies (Institute for Credentialing Excellence, 2014), though all four of the programs develop exams that conform to the National Commission for Certifying Agencies’ standards. In each program, candidates are randomly assigned a test form containing a set of pilot items and a set of scored items. The pilot items and scored items are completely randomized within a form and candidates can receive items in any order. This is done to minimize item context effects across the group of examinees, ensure that similar effort is given to pilot and scored items, and as a test security measure.
Test development activities for the different testing programs occur on an ongoing basis with various committees coming in to review test items, approve new test forms, and provide feedback at several points during the year. Assessment data for each program are reviewed weekly using ongoing item analyses and quality control checks. New test forms are launched following the reviews by volunteer subject matter experts at different points throughout the year for the different credentialing programs.
We selected these particular four programs because, although their exam development methods are quite similar, the characteristics of the populations of examinees from the different programs allow for good comparisons to be made. There are two non-content-related factors that influence how one would analyze these four programs’ data. The first factor is examinee volume. In this study, two of the examination programs have 900 people or more taking each exam form every year (medium-volume programs), while two of the programs have less than 350 people per year (low-volume programs). Although 300 to 350 people seems to be sufficient for good Rasch item parameter estimates based on previous research, using only half or a fourth of this volume due to calibrating less than a full year’s worth of data could lead to sample sizes small enough to yield suboptimal results. As previously noted, the psychometric literature is clear about the influence of sample size on IRT models and parameters; estimation error in item parameters tends to be less as more data are collected. The psychometric literature is also clear that including additional items generally improves IRT item parameter estimates, though there may be diminishing returns for adding additional items once a test is already long.
Another factor that influences these programs is a cyclical trend in both examinee performance and examinee volume. Numerous credentialing organizations, particularly in the health sciences and the skilled vocations, require examinees to graduate from an educational program before becoming eligible to take a given exam. This can create seasonality in test data, the impact of which has not been thoroughly documented in the research literate as it relates to IRT models and parameters. Two of the examination programs require graduation from educational programs at hospitals, community colleges, and universities. These programs graduate most of their students from mid-May through August. Most graduates of these programs take the exams within a few weeks of graduation, so both exam volume and exam performance for the educational program-associated exams increases substantially during the middle months of the year. Two of the examinations, however, do not require graduation from educational programs. These exams have a relatively steady number of examinees throughout the year. Performance throughout the year is also not extremely variable. The four programs represent a medium-volume exam with seasonality, a low-volume exam with seasonality, a medium-volume exam without seasonality, and a low-volume exam without seasonality. This variety will allow us to examine the impact of the interaction between examinee volume and seasonality on important psychometric properties of to-be-published exams. Table 1 contains a simple tabular reference of the important information about each of the four exam programs used in this research.
Table 1.
Basic Information About the Four Exam Programs Used for this Study’s Conditions.
| Exam program | Number of operational items per form | Total number of pilot items | Has seasonality effects? | Approximate examinee volume |
|---|---|---|---|---|
| Condition 1 | 200 | 160 | Yes | Medium |
| Condition 2 | 200 | 80 | Yes | Low |
| Condition 3 | 165 | 80 | No | Medium |
| Condition 4 | 75 | 20 | No | Low |
IRT Data Analysis
We attempted to analyze the data from the examination programs in a way as similar as possible to the methods used in the production of the actual exam programs. For each testing program, the item parameters for the scored items on each form were fixed to the values that were previously stored in each item bank, and we applied the computer program WINSTEPS to test form data that had been collected over 12-month periods. The fixed values of the item parameters in the item banks were also obtained from calibrating items from a full year calibration. Consistent with how data are analyzed operationally, various statistics for the scored items were examined to determine whether the item parameters should remain fixed at the values stored in the item bank or whether the items should be recalibrated to obtain new item parameters. Scored items were recalibrated when the item displacement statistics were greater than 0.5 in absolute value as has been suggested by Linacre (2001) when using WINSTEPS. Conceptually, item displacement is the amount that the Rasch difficulty would change if the item’s difficulty value was freely estimated instead of anchored to a fixed value. These final sets of item parameters for the scored items were then fixed and used to estimate item parameters for the pilot items that appeared on the test forms for the full 12-month periods. The pilot item results from the full 12-month periods were treated as the known item parameters and used to form baselines in several statistical indices. Table 1 shows the number of operational items on a form and the total number of pilot items for the four testing programs.
To determine the impact of calibrating data at different times throughout the 12-month periods, we analyzed the data at the end of respective months in a similar fashion to the way we analyzed the data for the full 12-month periods. We compared these results to the results from the full 12-month periods. Taking this approach makes sense, because there is often a desire to obtain parameters for the pilot items and use them as part of various test development activities at different points during the year. At the very least, the psychometric properties of the exam must be determined 2 to 3 months before the exam launches in order to give staff and the exam vendor sufficient time to quality control the exam for scoring accuracy prior to launch. The desire would be that the parameter estimates from the full 12-month periods would be well approximated by data obtained from a shorter time period. This would imply that these items can be reviewed and included as potential scored items on new forms more rapidly than waiting to obtain data from the full sample.
Evaluation Criteria
Five separate criteria were used to evaluate the comparability of the results from the cumulative monthly calibrations with the results obtained from the full sample calibration. The first criterion was the percentage of scored items that had Rasch displacements greater than 0.5 in absolute value and, thus, required recalibration. The goal for this criterion was that the percentage of items that had large displacements should be similar for the cumulative monthly calibrations to what was observed for the full 12-month period. This signals that the percentage of recalibrated scored items would be similar. The operational exam programs monitor the Rasch displacement for item recalibration purposes, as is a common practice in other exam programs (O’Neill, Peabody, Tan, & Du, 2013).
The second criterion is similar in concept to the first criterion, but looks at the item parameters of the pilot items and the percentage of items in which the difference in the estimated item parameters for the monthly calibrations were greater than 0.5 in absolute value when compared with the item parameter estimates from the full sample. The goal for this criterion was that the percentage of pilot items with large changes was close to zero, because this would signal that there were not a substantial percentage of items with large variations in estimated difficulty in the monthly calibrations. This criterion was similar to the first criterion used only it was applied to the pilot items instead of the scored operational items. Nonanchored items, by definition, have a displacement of zero, so the second criterion provides a parameter stability measure for new items.
The third criterion is the root mean square difference (RMSD). The RMSD measures the average squared differences between the monthly calibration and the full calibration item parameter estimates. The RMSD can be represented algebraically as
where is the item parameter estimate obtained after a given month M for pilot item i, is the item parameter estimate from the full sample F for pilot item i, and n is the total number of pilot items. The goal is that the RMSD would be close to zero, because this signals that the item parameter estimates for the monthly calibration were close to those obtained for the full sample. A wide variety of other research has used RMSD as a measure of item parameter performance, as this is a measure of the mean square error of the item parameters. This makes it a good overall criterion for evaluating item parameter performance.
The fourth criterion was the mean absolute difference (MAD). The MAD measures the average of the absolute differences between the monthly calibration and the full calibration item parameter estimates. The MAD can be represented algebraically as
with the terms in the equation have the same meaning as before. The goal is that the MAD would be close to zero, because this signals that the item parameter estimates for the monthly calibration were close in absolute value to those obtained for the full sample. The MAD is similar to RMSD, although it is on a scale that represents a direct difference in the item parameters.
The fifth criterion was the mean difference (MD).1 The MD measures the average differences between the monthly calibration and the full calibration item parameter estimates. The MD can be represented algebraically as
where the terms in the equation have the same meaning as before. When the MD is positive it indicates that the item parameter estimates for the monthly calibration were more difficult on average than those for the full sample, and when the MD was negative it indicates that the monthly item parameter estimates were less difficult on average than those from the full sample. The goal is that the MD would be close to zero, because this signals that the item parameter estimates for the monthly calibration were close to those obtained for the full sample. The mean difference is a measure of whether the parameter estimates corresponding to a given month’s calibration are consistently too high or too low. Either of these cases is undesirable for an estimated item parameter. Interpreting the MD alongside the RMSD or MAD can give one an indication of whether parameter inaccuracy is due to consistently missing the mark in just one direction versus inaccuracy due to a large amount of overall error of estimation for the parameters.
Using all five criteria in combination provides a good overall picture of how well the item parameters were recovered at different points in time, as the measures target unique and complementary aspects of item parameter recovery for these data. In addition to these five criteria, we also report the average cumulative sample size of examinees for the pilot items, the average examine ability estimates, and the standard deviation (SD) of the ability estimates at the end of each of the months. These values allow one to see the different volumes of candidates that took the pilot items after different months, and they also allow one to identify whether or not there were changes in the examinee distributions. An important part of this research is investigating how potential changes in the examinee ability distributions, interacting with examinee volume, may affect pilot item parameters.
Results
To make the results easy to display and interpret, the five criteria and the average cumulative sample size, the average examinee ability, and the SD of examinee ability were displayed in graphical form with month on the x-axis and the criterion of interest on the y-axis.
Condition 1: Medium Volume With Seasonality
Figure 1 displays the results for Condition 1. The average cumulative sample size, average examinee ability, and SD of examinee ability show the seasonality that exists for this medium-volume program. In particular, one can see that there was a dramatic increase in volume for Month 5; this corresponded to an increase in average examinee ability and a decrease in the SD of ability. Following Month 5, the average examinee ability and the SD of the examinee ability remained stable. Looking at the five criteria shows the challenges that can occur if data were calibrated prior to Month 5. In particular, one can see that there were a large percentage of scored items that were flagged with large displacements, and the RMSD, MAD, and percentage of items with changes of greater than 0.5 compared with the full sample appeared to be relatively high. This may signal that the item parameter invariance assumption of the Rasch model was less likely to hold in earlier months. The MD statistic was not largely affected, but was still worse than the value seen after Month 8 or 9. Some of the poor performance in the first 4 months may be due to low candidate volume interacting with the nonrepresentativeness of the data. One can also see that the results for the five criteria appeared to reach a lower asymptote after 8 to 9 months of data had been collected in that there were only relatively minor additional improvements in the values of the criteria. This suggests that, although the best values would be obtained the longer one would wait to calibrate data, results obtained after 8 to 9 months well approximated those from a full year’s calibration.
Figure 1.
Calibration timing results for Condition 1 (medium volume with seasonality).
Condition 2: Low Volume With Seasonality
Figure 2 shows the results for the monthly calibrations for the second condition. Compared with Figure 1, one can see that the seasonality experienced for this program has two points where there were jumps in examinee volume and ability. The first occurred in Month 5, and the second occurred in Month 8. The results for the first several months of calibrations were extremely poor, which was probably due in part to the fact that the candidates differed in ability from that of the full sample and that were very few of them. Similar to the Condition 1, the number of items flagged with large displacements was higher in the first few months of the test cycle, which again may signal that the item parameter invariance assumption of the Rasch model was less likely to hold. After about 9 to 10 months, one can see that the values for each of the five criteria appeared to reach a lower asymptote, where there were minimal improvements in the values of each of the indices. This was the point after which the spikes in candidate volume had been observed and the mean and SD of candidate ability had stabilized. After 9 to 10 months, the percentage of items with large displacements was close to that of the full sample and the RMSD, MAD, MD, and percentage of items with large changes was close to zero.
Figure 2.
Calibration timing results for Condition 2 (low volume with seasonality).
Condition 3: Medium Volume Without Seasonality
Figure 3 shows an example of the calibration results for the third condition. In comparison to Figures 1 and 2, one can see that the candidate volume was steadier thorough the 12-month period, and there were not big shifts in the average or SD of examinee ability. One can also see some differences in the five criteria. Specifically, one can see that the highest values for several criteria were notable lower than those in Figures 1 and 2. For example, the percentage of scored items flagged with large displacements was slightly more than 10% in Month 1 compared with almost 30% in Figure 1 and more than 50% in Figure 2. For the percentage of items with large displacements and the percentage of items with large changes criteria, minimal changes were observed in these criteria after Month 3. This indicates that the Rasch parameter invariance assumption appeared to hold better for these data during the earlier months of the year. For the other three criteria, which are more sensitive to small changes in item parameters, one can see that there tended to be small incremental improvements with each successive month. Acceptable values for these criteria would probably be obtained after 8 or 9 months.
Figure 3.
Calibration timing results for Condition 3 (medium volume without seasonality).
Condition 4: Low Volume Without Seasonality
Figure 4 shows the fourth condition results. Here, one can see a steady increase in exam volume throughout the year and a relatively narrow range of average examinee ability. There were some minor shifts in the SD of examinee ability between months, but the SD remained between 0.6 and 0.7. Examining the five criteria shows that, except for the percentage of items with large changes criterion, the best results tended to be observed after Month 11. In these cases, the RMSD, MAD, MD, and percentage of items with large displacements were closest to the values from the full sample. Several of the criteria did not appear to reach a lower asymptote after 8 to 10 months of data had been collected. This was a bit different than the results shown in some of the other figures, where even though the best results were observed the longer than one waited; it was possible to observe a leveling off of the criteria where only relatively minor improvements were observed for some of the later monthly calibrations. This may be due in part to the fact that the minor changes in ability experienced from month to month in later months had enough of an impact for the leveling off to not occur.
Figure 4.
Calibration timing results for Condition 4 (low volume without seasonality).
Discussion and Conclusion
An important consideration for many testing programs that administer tests with pilot items is when to calibrate pilot items to obtain sufficiently accurate item parameters. This article investigated the impact of when item calibrations were conducted on item parameter estimates for four different credentialing programs. These programs differed in candidate volume and whether or not there was seasonality present during the examination cycle. Results showed that in some cases acceptable item parameters could be obtained as early as 8 months into the yearly exam cycle, while in other cases, waiting until later in the examination cycle was required for item parameter results to stabilize and approach the values from the full sample. Results suggested that, for the programs experiencing examinee seasonality, item parameters using the entire year’s data were better approximated if one waited until a few months after examinee shifts have occurred before calibrating. Results did not necessarily suggest that data could be calibrated more quickly for exam programs that did not experience seasonality compared with exam programs that experienced seasonality. The improvements in the item parameters, however, were more consistently incremental for programs without seasonality than those with seasonality. It appears that the volume and distributional representativeness of data were notable factors in obtaining appropriate item parameter estimates. In particular, we often found values of the five criteria that were three or more times larger than the values from the full sample when there were fewer candidates and the candidates did not well represent the full sample of test takers. For example, in Condition 2, the number of items flagged with large displacement was more than 50% in Month 1 compared with approximately 10% in Month 11. Generally, one needs to make sure that a sufficient number of candidates have responded to the items and that the candidates are similar in ability to the full sample in order to obtain stable parameter estimates. This makes sense given that the bias in item parameter estimates when using WINSTEPS is affected to a great extent by sample size (Wang & Chen, 2005). Previous research also suggests that for the invariance property of the Rasch model to hold with different samples of examinees, data need to fit the model (Bond & Fox, 2007; Engelhard, 2013). Our results suggested that there was greater misfit and more recalibration of previously scored items required when data were not representative of the full sample of examinees or there were fewer examinees. This lack of fit also may have affected the stability of item parameter estimates for some of the earlier monthly calibrations.
Although the finding that the longer one waits to calibrate one’s item response data the better the results will approximate a full year’s calibration seems like an intuitive and rather straightforward observation, this practical observation has not been clearly documented and investigated in the research literature. In operational testing programs, the option of waiting until all data have been collected to analyze item response data is often not logistically feasible due to the many quality control steps that need to be performed prior to the launch of the new forms. This can also have an impact on the use of these estimates in data review meetings or other test development activities. The implication of this research is that one should proceed with caution in using pilot item parameter estimates in test development activities and the creation of new test forms until sufficient and representative data have been collected. In conservatively interpreting the examples presented in this article, pilot item data for some of the programs that we investigated should not be used until 8 or 9 months of data have been collected, and in some cases one should wait longer than that, so long as it is operationally possible. Based on the result of the percentage of items with large changes, some of the larger programs could possibly calibrate items sooner, so long as one has a large enough sample size per item and no major shifts in examinee ability occur after that point in time.
Obviously, the need to wait to obtain sufficient and representative data can directly interfere with the amount of time needed to perform the activities necessary to launch a new form. An option in this case may be the staggered launch of new test forms throughout the year with different sets of pilot items, so that various sets of pilots can have reasonable data at different points in time. The testing programs used as examples in this article do stagger the launch of test forms and test development activities for different testing programs at various points throughout the year. The most important thing that a program can do, however, is to become familiar with the potential month-to-month changes in examinee volume and distribution. Without knowing the type of seasonal effects one may or may not have, it is impossible to know whether one’s data in earlier months are representative of a full exam cycle.
In certain respects, these results suggest that Rasch item parameters may not be completely unbiased and invariant across all possible samples of examinees. While Rasch models have the advantage of the parameter invariance assumption, this assumption is critically dependent on the fit of the model. It is possible that different samples of examinees may fit the model to a greater or lesser degree, and the samples may lead to more or less bias in item parameter estimates. The assumption of invariance and model fit should be examined and tested with different samples. Model users should strive to obtain the best calibration sample possible so that item parameter estimates will be as unbiased and accurate as possible.
Although the study was designed to investigate how item parameter calibration timing may affect item parameter estimates in a variety of different contexts, it is possible that different results may be obtained in other situations. For example, it is possible that different patterns of seasonality may be observed in other testing programs compared with the patterns observed in the four examples used in this article. These different patterns may lead to somewhat different results than we found in our analyses. An example may be a testing program that has high volume and higher ability candidates in the first couple of months of testing and only marginal increases in volume in later months of the year. Programs may also observe spikes in examinee volume just before the end of the calendar year due to employer or state credentialing requirements that correspond to a 12-month calendar schedule. Other programs may also have more dramatic shifts in examinee ability and may have higher candidate volume compared with those observed in the examples in this article. Investigating how these differences may affect item parameter estimates is an interesting area for future research.
Another interesting area for future research would be to investigate the impact of item parameter calibration timing and seasonality for other IRT models. Our results only focused on the Rasch model with the WINSTEPS computer software, as this model and software package is used operationally with the four given credentialing programs. It is possible that the use of a different IRT software package, estimation strategy, or model, such as estimating the 2PL or 3PL model with BILOG-MG, would lead to different results from those obtained in this study. This may be particularly acute when comparing different parameters, such as the b and c parameters. Examinee ability cycles could potentially greatly affect the estimation of c, particularly if more low-ability examinees tended to take the examination earlier or later in the exam cycle. Additional research should examine these and numerous other possibilities.
Additional research on how item calibration timing and seasonality affects other test delivery strategies and assessment models would also be beneficial. This study focused on the continuous testing paradigm using fixed form testing for four credentialing programs in medical imaging. The use of continuous fixed form testing is common for many credentialing programs. However, it is possible that results may be different in a CAT or multistage testing setting. These other test delivery designs can create additional complexities and interactions compared with those observed with fixed form testing designs.
Given the critical role that IRT item parameters play in various test development activities, it is imperative that item parameters are well estimated. This article illustrates that there are important interactions that can exist between the timing of when item parameters are estimated and the quality of the estimates that one might obtain. It is hoped that additional research on item calibration timing, seasonality, and how these may affect item parameter estimates will continue to be conducted to enhance our understanding of these interactions and ensure that the item parameters that are used in different contexts are sufficiently accurate.
Acknowledgments
The authors would like to thank Dan Anderson, Jerry Reid, and Lauren Wood for comments and suggestions they provided on an earlier version of this article.
A similar criterion to this is sometimes referred to as bias. We do not refer to this criterion as bias in this article, as we do not have the true item parameters against which to compare.
Footnotes
Authors’ Note: Any opinions, findings, conclusions, or recommendations expressed in this article are those of the authors and are not necessarily the official position of The American Registry of Radiologic Technologists.
Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.
References
- Babcock B., Albano A. D. (2012). Rasch scale stability in the presence of item parameter and trait drift. Applied Psychological Measurement, 36, 555-580. doi: 10.1177/0146621612455090 [DOI] [Google Scholar]
- Babcock B., Weiss D. J. (2012). Termination criteria in computerized adaptive tests: Do variable-length CATs provide efficient and effective measurement? Journal of Computerized Adaptive Testing, 1, 1-18. doi: 10.7333/1212-0101001 [DOI] [Google Scholar]
- Ban J., Hanson B. A., Wang T., Yi Q., Harris D. J. (2001). A comparative study of on-line pretest item – Calibration/scaling methods in computerized adaptive testing. Journal of Educational Measurement, 38, 191-212. doi: 10.1111/j.1745-3984.201.tb01123.x [DOI] [Google Scholar]
- Ban J., Hanson B. A., Yi Q., Harris D. J. (2002). Data spareness and on-line pretest item calibration-scaling methods in CAT. Journal of Educational Measurement, 39, 207-218. doi: 10.1111/j.1745-3984.2002.tb01174.x [DOI] [Google Scholar]
- Bond T. G., Fox C. M. (2007). Applying the Rasch model. Mahwah, NJ: Erlbaum. [Google Scholar]
- de Ayala R. J. (2009). The theory and practice of item response theory. New York, NY: Guilford Press. [Google Scholar]
- Drasgow F. (1989). An evaluation of marginal maximum likelihood estimation for the two-parameter logistic model. Applied Psychological Measurement, 13, 77-90. doi: 10.1177/014662168901300108 [DOI] [Google Scholar]
- Embretson S. E., Reise S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Erlbaum. [Google Scholar]
- Engelhard G. (2013). Invariant measurement. New York, NY: Routledge. [Google Scholar]
- Fischer G. H., Molenaar I. W. (1995). Rasch models: Foundations, recent developments, and applications. New York, NY: Springer-Verlag. [Google Scholar]
- Harwell M. R., Janosky J. E. (1991). An empirical study of the effects of small datasets and varying prior variances on item parameter estimation in BILOG. Applied Psychological Measurement, 15, 279-291. doi: 10.1177/014662169101500308 [DOI] [Google Scholar]
- Holland P. W., Wainer H. (1993). Differential item functioning. Hillsdale, NJ: Erlbaum. [Google Scholar]
- Hulin C. L., Lissak R. I., Drasgow F. (1982). Recovery of two- and three-parameter logistic item characteristic curves: A Monte Carlo study. Applied Psychological Measurement, 6, 249-260. doi: 10.1177/014662168200600301 [DOI] [Google Scholar]
- Institute for Credentialing Excellence. (2014). NCCA accreditation. Retrieved from http://www.credentialingexcellence.org/ncca
- Kim S. (2006). A comparative study of IRT fixed parameter calibration methods. Journal of Educational Measurement, 43, 355-381. doi: 10.1111/j.1745-3984.2006.00021.x [DOI] [Google Scholar]
- Kingsbury G. G. (2009). Adaptive item calibration: A process for estimating item parameters within a computerized adaptive test. In Weiss D. J. (Ed.), Proceedings of the 2009 GMAC Conference on Computerized Adaptive Testing. [Google Scholar]
- Kirisci L., Hsu T.-C., Yu L. (2001). Robustness of item parameter estimation programs to assumptions of unidimensionality and normality. Applied Psychological Measurement, 25, 146-162. doi: 10.1177/01466210122031975 [DOI] [Google Scholar]
- Kolen M. J., Brennan R. L. (2004). Test equating, scaling, and linking (2nd ed.). New York, NY: Springer-Verlag. [Google Scholar]
- Linacre J. M. (2001). WINSTEPS Rasch measurement computer program (Version 3.31) [Computer software]. Chicago, IL: Winsteps.com. [Google Scholar]
- Luecht R. M., Sireci S. G. (2011). A review of models for computer-based testing (College Board Research Report 2011-12). New York, NY: College Board. [Google Scholar]
- Meyer J. P. (2011). jMetrik 2.1 [computer software]. Retrieved from http://www.itemanalysis.com/jmetrik-download-v2.php
- Meyer J. P, Hailey E. (2012). A study of Rasch, partial credit, and rating scale model parameter recovery in WINSTEPS and jMetrik. Journal of Applied Measurement, 13, 248-258. [PubMed] [Google Scholar]
- O’Neill T, Peabody M., Tan R., Du Y. (2013). How much item drift is too much? Rasch Measurement Transaction, 27, 1423-1424. [Google Scholar]
- Parshall C. G., Spray J. A., Kalohn J. C., Davey T. (2002). Practical considerations in computer-based testing. New York, NY: Springer. [Google Scholar]
- Schmeiser C. B., Welch C. J. (2006). Test development. In Brennan R. L. (Ed.), Educational measurement (4th ed., pp. 307-353). Washington, DC: American Council on Education. [Google Scholar]
- Seong T. (1990). Sensitivity of marginal maximum likelihood estimation of item and ability parameters to the characteristics of the prior ability distributions. Applied Psychological Measurement, 14, 299-311. doi: 10.1177/014662169001400307 [DOI] [Google Scholar]
- Skaggs G., Stevenson J. (1989). A comparison of pseudo-Bayesian and joint maximum likelihood procedures for estimating item parameters in the three-parameter IRT model. Applied Psychological Measurement, 13, 391-402. doi: 10.1177/014662168901300405 [DOI] [Google Scholar]
- Smith R. M. (2004). Detecting item bias with the Rasch model. Journal of Applied Measurement, 5, 430-449. [PubMed] [Google Scholar]
- Stocking M. L., Lord F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, 201-210. doi: 10.1177/014662168300700208 [DOI] [Google Scholar]
- Stone C. A. (1992). Recovery of marginal maximum likelihood estimates in the two-parameter logistic response model: An evaluation of MULTILOG. Applied Psychological Measurement, 16, 1-16. doi: 10.1177/014662169201600101 [DOI] [Google Scholar]
- Svetina D., Crawford A. V., Levy R., Green S. B., Scott L., Thompson M., . . . Kunze K. L. (2013). Designing small-scale tests: A simulation study of parameter recovery with the 1-PL. Psychological Test and Assessment Modeling, 55, 335-360. Retrieved from http://www.psychologie-aktuell.com/fileadmin/download/ptam/4-2013_20131217/01_Svetina.pdf [Google Scholar]
- Swaminathan H., Gifford J. A. (1982). Bayesian estimation in the Rasch model. Journal of Educational Statistics, 7, 175-191. doi: 10.3102/10769986007003175 [DOI] [Google Scholar]
- Swaminathan H., Gifford J. A. (1983). Estimation of parameters in the three-parameter latent trait model. In Weiss D. J. (Ed.), New horizons in testing (pp. 9-30). New York, NY: Academic Press. [Google Scholar]
- Swaminathan H., Gifford J. A. (1985). Bayesian estimation in the two-parameter logistic model. Psychometrika, 50, 349-364. doi: 10.1007/BF02294110 [DOI] [Google Scholar]
- Thompson N. A. (2007). A practitioner’s guide for variable-length computerized classification testing. Practical Assessment, Research and Evaluation, 12(1). Retrieved from http://pareonline.net/pdf/v12n1.pdf [Google Scholar]
- Tong Y., Wu S., Xu M. (2008, March). A comparison of pre-equating and post-equating using large-scale assessment data. Paper presented at the annual meeting of the American Educational Research Association New York, NY. [Google Scholar]
- van der Linden W. J. (2005). Linear models for optimal test design. New York, NY: Springer. [Google Scholar]
- van den Wollenberg A. L., Wierda F. W., Jansen P. G. W. (1988). Consistency of Rasch model parameter estimation: A simulation study. Applied Psychological Measurement, 12, 307-313. doi: 10.1177/014662168801200308 [DOI] [Google Scholar]
- Wang W.-C., Chen C.-T. (2005). Item parameter recovery, standard error estimates, and fit statistics of the WINSTEPS program for the family of Rasch models. Educational and Psychological Measurement, 65, 376-404. doi: 10.1177/0013164404268673 [DOI] [Google Scholar]
- Wright B. D., Douglas G. A. (1977). Conditional versus unconditional procedures for sample-free item analysis. Educational and Psychological Measurement, 37, 573-586. doi: 10.1177/001316447703700301 [DOI] [Google Scholar]
- Wyse A. E. (2013). DIF cancellation in the Rasch model. Journal of Applied Measurement, 14, 118-128. [PubMed] [Google Scholar]
- Yen W. M. (1987). A comparison of the efficiency and accuracy of BILOG and LOGIST. Psychometrika, 52, 275-291. doi: 10.1007/BF02294241 [DOI] [Google Scholar]
- Zwinderman A. H., van den Wollenberg A. L. (1990). Robustness of marginal maximum likelihood estimation in the Rasch model. Applied Psychological Measurement, 14, 1, 73-81. doi: 10.1177/014662169001400107 [DOI] [Google Scholar]




