An Experimental Design to Investigate Item Parameter Drift

Peter Baldwin; Irina Grabovsky; Kimberly A Swygert; Thomas Fogle; Pilar Reid; Brian E Clauser

doi:10.1177/01466216251316282

. 2025 Jan 24:01466216251316282. Online ahead of print. doi: 10.1177/01466216251316282

An Experimental Design to Investigate Item Parameter Drift

Peter Baldwin ^1,^✉, Irina Grabovsky ¹, Kimberly A Swygert ¹, Thomas Fogle ¹, Pilar Reid ¹, Brian E Clauser ¹

PMCID: PMC11760077 PMID: 39867873

Abstract

Methods for detecting item parameter drift may be inadequate when every exposed item is at risk for drift. To address this scenario, a strategy for detecting item parameter drift is proposed that uses only unexposed items deployed in a stratified random method within an experimental design. The proposed method is illustrated by investigating unexpected score increases on a high-stakes licensure exam. Results for this example were suggestive of item parameter drift but not significant at the .05 level.

Keywords: item parameter drift, differential item functioning, invariance, test security

Introduction

Many activities in educational measurement require making score comparisons. These include comparisons across groups of test takers, between individual examinees, or between groups or individual examinees and various defined standards. Furthermore, there are occasions when such comparisons need to be made across different points in time. Before comparisons of any kind can be made, however, steps must be taken to ensure that all scores share a common meaning. In the context of item response theory (IRT), which is the context of interest in this paper, this means taking steps to ensure that all model parameters are expressed on a common scale.

IRT describes the relationship between presumed—but unobservable—examinee and item characteristics and observable examinee performances on test items (i.e., item responses; Lord, 1980; Hambleton & Swaminathan, 1985; Hambleton et al., 1991). As might be expected when model parameters are unobserved, IRT models have an identification problem that requires the origin (or, for more general models, the origin and unit) to be defined arbitrarily. Therefore, when IRT models are independently fit (calibrated) to different datasets, it cannot be assumed that all model parameters are expressed on a common scale. In practice, an adjustment must be estimated and applied to one set of estimates before it can be assumed that they share a common metric (Kolen & Brennan, 2004). To estimate this adjustment, something is needed that is expected to have the same value across calibrations were it not for the difference in scale (Baldwin & Clauser, 2022). Most often, this something is a set of common (or anchor) items. These sets of common items are included on two (or more) independently-calibrated administrations, and we expect their mean difficulty estimate, for example, to be the same for every calibration—excepting estimation error and whatever arbitrary differences in scale arises from the arbitrary way the models were identified (Kolen & Brennan, 2004). For instance, on this basis, the estimated adjustment needed to place the estimates from two Rasch calibrations (which only differ in the arbitrary designation of each scale’s origin) on a common metric is simply the difference in mean difficulty.

This routine practice of developing a common metric requires the true common item parameters to be the same across calibrations, a property called item parameter invariance. Despite its importance, however, in practice, items may not always have this property. When the calibrations of interest involve response data collected at different times, it is conventional to describe any lack of parameter invariance over time as item parameter drift (IPD; Goldstein, 1983). IPD may occur for a variety of reasons. Some of these, like item theft, are sinister (or at least unethical); while others are not—for example, items measuring newer content may become gradually easier as this content is assimilated more broadly into curricula and textbooks.¹ The virtue of its origin aside, IPD is a model violation that can have important consequences for the validity of examinee scores (Rupp & Zumbo, 2006).

In this paper, we describe a method for identifying the presence of IPD using an experimental design. The proposed approach uses a large pool of new and unseen items that are divided into two randomly equivalent item sets and administered across two successive time points (e.g., years). In the context of a common item nonequivalent groups design, the random equivalency of these sets allows the assumption of item parameter invariance for the common items (i.e., the anchor test) to be tested formally. While the startup costs for this method are nontrivial, the effort needed to use the method on a long-term basis are low and data can be collected unobtrusively as part of routine test administrations. For this reason, the method will be most attractive for testing programs wanting to monitor IPD on an ongoing basis. The proposed method is illustrated with a multiyear study examining score increases on a large high-stakes licensure exam. For this exam program, the observed increases were unexpected and we hypothesized that they might be explained better by IPD than by true changes in the examinee population over time.

IPD and DIF

Early on, it was recognized that “the problem of identifying drift is statistically identical to the problem of identifying DIF” (differential item functioning; Donoghue & Isham, 1998), which also is concerned with detecting the absence of item parameter invariance across examinee groups. IPD is a special case of DIF wherein the groups of interest comprise samples of examinees who were tested at different times (e.g., annual testing cohorts). As with DIF studies, IPD investigations typically focus on differences in item performance after controlling for differences in examinee groups with respect to proficiency.

Many IRT-based methods have been developed to test for item parameter invariance across groups. These include, for example, Lord’s test (Lord, 1980), which tests whether item parameters are invariant across groups using a chi-squared test; the comparison of signed and unsigned areas between group-specific item response functions as suggested by Raju (1988, 1990); and many others. All these methods share a common requirement: before item parameter estimates (or item response functions) can be compared, they first must be expressed on a common scale. Typically, this is accomplished using a common item non-equivalent groups design, which can create the awkward situation in which a set of common items that are known (or can be identified) in advance to have the property of item parameter invariance are needed in order to evaluate whether or not a set of common items have the property of parameter invariance. The circular nature of this requirement can make it impossible to satisfy (or, at least, to have confidence that it has been met). And, when it is not met, we cannot be confident in the results produced by these methods.

A Partial Solution

Bechger and Maris (2015) proposed a method for identifying the presence of IPD that does not require a known set of common items with invariant item parameters. This method works by comparing the differences between the difficulties for pairs of items. Suppose we have two items, $i$ and $j$ , that fit the Rasch model and are administered at two times, 1 and 2. Let $d_{i j, 1}$ and $d_{i j, 2}$ be the difference in difficulty parameters, $b$ , between these items at times 1 and 2, respectively, such that $d_{i j, 1} = b_{i, 1} - b_{j, 1}$ and $d_{i j, 2} = b_{i, 2} - b_{j, 2}$ . There are two reasons we might expect $d_{i j, 1}$ and $d_{i j, 2}$ to differ. First, we might believe (wrongly, as we will see in a moment) that differences may arise because the Rasch scales at each time differ by some constant, $β$ , due to the way their respective models were (arbitrarily) identified. Second, $b_{i, 2}$ and $b_{j, 2}$ may have drifted by $ξ_{i}$ and $ξ_{j}$ , respectively. It follows that the relationships between the difficulty parameters at times 1 and 2 are given by $b_{i, 1} = b_{i, 2} + β - ξ_{i}$ and $b_{j, 1} = b_{j, 2} + β - ξ_{j}$ ; and, further, that the relationship between $d_{i j, 1}$ and $d_{i j, 2}$ is given by

\begin{array}{l} d_{i j, 1} - d_{i j, 2} = (b_{i, 1} - b_{j, 1}) - (b_{i, 2} - b_{j, 2}) \\ = (b_{i, 1} - b_{j, 1}) - ((b_{i, 1} - β + ξ_{i}) - (b_{j, 1} - β + ξ_{j})), \end{array}

(1)

which can be written more compactly as

d_{i j, 1} - d_{i j, 2} = ξ_{j} - ξ_{i} .

(2)

As alluded to above, this difference has a very useful property: it does not depend on the scaling constant, $β$ , only on $ξ_{i}$ and $ξ_{j}$ . Bechger and Maris (2015) show how to capitalize on this property to detect IPD (i.e., to detect when $ξ_{j} - ξ_{i} \neq 0$ ) without first having to develop a common metric.

Bechger and Maris’ solution to the problem of detecting IPD is a powerful one because, unlike other methods, it does not require a set of known invariant item parameters that can be used to place the parameter estimates on a common scale. Still, even this approach has limitations. Suppose that all items have drifted by some amount $ξ_{c o m m o n}$ such that the IPD for any item $i$ can be expressed by the sum: $ξ_{i} = ξ_{c o m m o n} + ξ_{i, u n i q u e}$ . Equation (2) can now be rewritten as

\begin{array}{l} d_{i j, 1} - d_{i j, 2} = (ξ_{c o m m o n} + ξ_{j, u n i q u e}) - (ξ_{c o m m o n} + ξ_{i, u n i q u e}) \\ = ξ_{j, u n i q u e} - ξ_{i, u n i q u e}, \end{array}

(3)

where not unlike $β$ , $ξ_{c o m m o n}$ , which is unknown, cancels out. The implication of this result is twofold. First, it demonstrates that the method proposed by Bechger and Maris cannot detect IPD that is common to all items. Second, this method reveals nothing about the direction (sign) of overall IPD.

The situation wherein all items have drifted in the same direction (i.e., $| ξ_{c o m m o n} | > 0$ ) is of particular interest for this paper. Yet, a reasonable person might wonder: if all items have IPD such that, for example, all items become easier over time, does this mean that later subgroups are simply more able? Not necessarily. For example, suppose a security failure occurs such that subsequent examinees have advance access to a complete set of test questions. All items would be easier for examinees testing after the security breach, but it would be a mistake to interpret this as evidence that these later examinees are more able (except perhaps more able at cheating). Instead, this decreased difficulty should be attributed to IPD. For situations like this, where IPD has the potential to affect all items, a different approach to detecting IPD is needed. That is the aim of the proposed method.

Proposed Method

When examinees from a common pool are randomly assigned to more than one form for the purpose of estimating differences in form difficulty, it is sometimes called a random groups design (e.g., Kolen & Brennan, 2004). Because groups are randomly equivalent, any apparent group performance difference across forms can be attributed to random sampling error or differences in form difficulty. The proposed method uses an analogous design: a random items design (Baldwin & Clauser, 2022) wherein study items from a common pool are randomly assigned to examinee subgroup (e.g., annual cohorts). Here, however, the goal is not to equate as it is with the random groups design. Rather, it is to determine whether the observed difference in study item difficulties can be reasonably explained by sampling error alone, or if a failure to place the difficulties on a common scale because of IPD, is a better explanation.

Let’s suppose two successive cohorts, year 1 and year 2, are administered different forms of a common exam. Each form comprises three sets of items: an anchor set that appears on both forms, a set that is unique to each given form, and a study set, which we will denote study set 1 and study set 2 for years 1 and 2, respectively. In fact, the two study sets are also unique to each form; however, these sets have two additional properties. First, they contain only new items that have never appeared on a previous exam and, second, they are randomly equivalent. Random equivalence is accomplished by creating a superset of new study items at the start of the study and then randomly dividing this set into two subsets prior to year 1.² Study set 1 then is included on year 1’s form and study set 2 is included on year 2’s form. Forms for both years are administered in whatever way is typical for the given testing program and an item response theory model is fit to each form’s response data (Figure 1).

Figure 1. — Form assembly design. Anchor sets are common across years and the study sets are unique but randomly equivalent sets of new (previously unseen) items. To ensure that the study sets are randomly equivalent, *both* sets must be assembled prior to Year 1.

To compare examinee performance across years, steps must be taken at the conclusion of year 2 to express all model parameters—at least putatively—on a common scale. When both years share a common set of items (as they do here), this conforms to a common item nonequivalent groups design (as described above), and several well-known methods exist for developing a common scale.³ Nevertheless, as noted, this process only leads to model parameters that are putatively on a common scale. The qualifier putatively is needed here because placing model parameters on a common scale across years relies on the property of item parameter invariance and yet the motivation for investigating IPD is a lack of confidence that item parameters have this property over time. In other words, while we can follow the procedures for developing a common metric across cohorts, until we can demonstrate that the anchor item parameters are invariant, we can only assume that the parameters share a common metric. The proposed method works by making this assumption and then trying to falsify it.

If IPD has not prevented the anchor items from having the property of item parameter invariance, any summary statistic describing a study set’s difficulty parameters should be the same across years (excepting sampling and estimation error) because study sets are randomly equivalent across years.⁴ Therefore, because we know the expected relationship between the study sets when IPD does not exist, the proposition of IPD can be tested by investigating the extent to which the study sets’ item difficulties conform to these expectations. This is a form of proof by contradiction: first, anchor item parameters are assumed to be invariant (i.e., IPD is assumed to be absent) and then, if a contradiction arises (in this case, an unlikely observed difference in a given statistic that describes each study set’s difficulty parameter estimates), this is interpreted as evidence that the proposition of IPD is true. In other words, if the null hypothesis of anchor item parameter invariance is rejected, this is evidence that IPD may be present. Next, this method is illustrated with an empirical dataset.

An Example

Data

All allopathic physicians wishing to be licensed in the United States are required to pass the United States Medical Licensing Examination (USMLE®) sequence. The data in this example were taken from one exam within this sequence. This exam comprises many equal-length forms that are active throughout the year, each with over 270 multiple-choice items, some scored and some unscored. Items on this exam have at least 5 response options, are relatively easy for the testing population (mean proportion correct of .74 in the dataset we used), and historically have been shown to fit the 1PL model well (Kelley & Schumacher, 1984).⁵ All scored items are organized into forms that satisfy various content and statistical constraints but to accomplish this must first appear on prior operational exams as unscored pretest items and be found to have acceptable model-data fit.

The present study spanned two years and included 1330 new and unscored pretest items nested within 18 strata. Table 1 shows this design. Sixteen of these strata were based on content classification (lettered A-P in Table 1); one comprised items that were presented as part of a two-item set (Q); and one comprised items that included a picture or graphic (R). Items within each stratum were randomly divided into two equal size groups—one group being assigned to year 1 and the other to year 2. In this way, each year had a set of 1330 ÷ 2 = 665 study items grouped into 18 strata. Once assigned to year, each examinee was randomly assigned a relatively small number of study items that were presented alongside scored (and, in year 1, other unscored but non-study) material. Thus, while any given examinee was administered only a subset of study items, when aggregated across all examinees in a cohort, this design allows response data to be collected for every study item. Here, each study item was administered to approximately 250 examinees.

Table 1.

Stratified Random Sampling Design. An Initial Pool of 1330 Items Were Randomly Assigned to Years 1 or 2 Such That the Number of Items Within Each Stratum was Equal Across Years.

Stratum (Content Category Unless Otherwise Noted)	Initial Pool	Items Randomly Assigned to Year 1	Items Randomly Assigned to Year 2
A	48	24	24
B	18	9	9
C	36	18	18
D	108	54	54
E	132	66	66
F	18	9	9
G	80	40	40
H	70	35	35
I	70	35	35
J	102	51	51
K	56	28	28
L	64	32	32
M	58	29	29
N	16	8	8
O	72	36	36
P	112	56	56
Q (paired items)	72	36	36
R (items with a graphic)	198	99	99
Total	1330	665	665

Open in a new tab

For these data, the year 1 cohort had 13,000+ examinees and the year 2 cohort had 12,000+ examinees. All examinees were first-time test takers from US Liaison Committee on Medical Education (LCME)-accredited medical schools. Each cohort tested between July 1 and March 1 (albeit one year apart).⁶ This exam is administered continuously throughout the year and examinees can choose their testing date. As noted above, this exam has many parallel forms in simultaneous use each year and each examinee is administered one of these forms at random.

For the purpose of this study, the one-parameter logistic item response theory model (1PL) was independently fit to each cohort’s complete set of responses to the non-study items.⁷ However, because of the identification problem, it cannot be assumed that these independently scaled model parameter estimates are comparable across cohorts (i.e., years). This was addressed using a common item non-equivalent groups design. Specifically, these datasets had 1800+ scored items in common and a linear transformation was calculated and then applied to all item difficulty, item discrimination, and examinee proficiency estimates from year 2 such that the common items had the same mean difficulty and mean discrimination estimates across years (mean/mean transformation method; Loyd & Hoover, 1980).

After these initial calibrations and rescaling, item difficulty parameters were estimated for each study item by fixing the proficiency parameters to their estimated (and, in the case of year 2, adjusted) values and then fitting the 1PL model (with a common discrimination parameter equal to the scored items’ discrimination) to the response data. Finally, to make the model parameters more interpretable, a linear transformation was applied to all model parameters such that the combined group of examinee proficiencies for both years had a mean of 0.0 and a standard deviation of 1.0.

Evaluation Criteria

Ideally, after following the steps just described: (a) all model parameter estimates should be expressed on a common scale (excepting random equating error), and (b) the two sets of item difficulty estimates for the two 665-item study sets should be randomly equivalent. However, this ideal outcome requires that the assumptions for the common-item nonequivalent groups design are satisfied—most importantly for our purpose here, the assumption that the common items have the property of item parameter invariance over time. Or, in other words, the assumption that the anchor item parameters have not drifted between year 1 and year 2.

If parameters have invariance, the difference between the mean true difficulty parameter for each year’s common items is zero when expressed on a common scale. In practice, the true parameters are unknown, but we can use this property of invariance to develop an approximate common metric by adjusting year 2 such that the common items have the same mean estimated difficulty across years. This is all well and good; however, recall that it was a lack of confidence that the common item parameter difficulties were invariant that motived this study. And, if the common item difficulty parameters have drifted by some amount, this adjustment will not place the item difficulty and examinee proficiency estimates from year 2 on a common scale with the year 1 estimates. Instead, these estimates will be misaligned by (the negative of) the mean IPD.

As noted above, the proposition of IPD can be tested by comparing the study sets’ item difficulty estimates and that this is a form of proof by contradiction: we first assume that the anchor item parameters are invariant (the null hypothesis) and then we attempt to falsify it by comparing the randomly equivalent sets of study items. More specifically, we assume the absence of IPD and then, if a difference in a given statistic describing each study set’s difficulty estimates is observed that would be highly unlikely under the null hypothesis, this is interpreted as evidence of IPD. To determine whether an observed difference is highly unlikely, this difference must be compared with its sampling distribution. This can be accomplished using a permutation test.

A permutation test constructs a sampling distribution using permutations of study item assignment under the null hypothesis, which assumes these items’ parameters are exchangeable. This approach supposes that the given assignment of items to study set is but one equally probable permutation from the superset of all possible permutations; and, further, that under the null hypothesis of no IPD, the difference between some observed statistic for each year’s study items is but one random draw from the sampling distribution of differences based on all such permutations.

When study items were randomly assigned to years 1 and 2 at the start of the study, the particular randomization was only one of many possible permutations. Therefore, for this particular randomization, let $η_{1}^{*}$ and $η_{2}^{*}$ denote the median study item difficulty estimate for years 1 and 2, respectively; but, for some other random permutation $i$ , let these medians be denoted $η_{1 . i}$ and $η_{2 . i}$ . Likewise, it follows that the observed difference in median difficulty across years is given by

d * = η_{2}^{*} - η_{1}^{*}

(4)

and the difference for some randomization $i$ is given by

d_{i} = η_{2 . i} - η_{1 . i} .

(5)

The difference in median difficulty across years was of particular interest here because median is unaffected by extreme difficulty estimates. This was an important consideration because the study items were all pretest items and as often occurs in pretesting, some items may produce extreme and unreasonable difficulty estimates. More generally, there is nothing special about the median for the proposed method (e.g., in other contexts, the mean may work just as well or better).

There are a vast number of possible permutations of the 1330 study items into randomly equivalent item sets and their associated distribution of $d$ is sometimes called a randomization distribution. If the null hypothesis is true, the randomization distribution is the exact distribution of differences in median item difficulty due to the random assignment of items to year; however, if the observed difference $d *$ falls in the tail of this distribution, it casts doubt about the assumption that it was sampled from this distribution. It then would follow that the randomization distribution is not the sampling distribution of median differences under the null hypothesis and this null hypothesis would then be rejected.

Because the number of permutations is impractically large, it was approximated by sampling 1,000,000 permutations⁸. The probability of observing a given $d *$ or greater when the null hypothesis (of item parameter invariance) is true is given by

p = P (d \geq d * | H_{0}) = \frac{\sum_{i = 1}^{10^{6}} I (d_{i} \geq d *)}{10^{6}},

(6)

where $I (\cdot)$ is the indicator function which evaluates to 0 when the inequality is false and 1 when it is true (Ernst, 2004). Thus, equation (6) provides the p-value for the hypothesis test. Note that this test is one-tailed because if items have been getting easier over time due to IPD (which is the hypothesized explanation for the historic trend in scores that motivated this study), examinees would appear more able making the study items in year two appear more difficult.

Results

Figure 2 shows a frequency histogram of difficulty parameter estimates for the two years included in this study. It can be seen that the median difficulty increased by .46 (which, recall, is expressed in proficiency standard deviation units); however, while this increase seems substantial, its magnitude does not in itself rule out the possibility that the null hypothesis is true and sampling error is responsible for the observed differences. To shed light on the question of whether or not the observed difference is plausible given the variation we expect due to sampling error, we turn to the permutation test described above.

Figure 3 shows the frequency histogram of differences in median difficulties across years that we would expect if the null hypothesis were true. As described above, this randomization distribution is based on 1,000,000 permutations of item assignment to year. The specific difference, $d *$ , which was observed in this study, is also shown; it can be seen that, although this observed difference or higher is somewhat implausible, we would still expect to see such a difference 9.8% of the time when the null hypothesis is true. That is, while we may find this result suggestive, it is unpersuasive at the .05 level. Therefore, we fail to reject the null hypothesis that sampling error alone can explain the observed difference in median item difficulty estimates for the randomly equivalent sets of study items. These results are also summarized in Table 2.

Table 2.

Summary of Permutation Test.

Year	Median 1PL Difficulty	Observed Difference, d*, Between the Study Sets’ Median Difficulties	Proportion of Differences in the Randomization Distribution ≥ d*
Year 1 (Study Set 1)	−3.25	.46	.098
Year 2 (Study Set 2)	−2.80	.46	.098

Open in a new tab

Discussion

This study was designed to investigate the presence of IPD under conditions where the assumptions of other DIF methods are not expected to be satisfied. In particular, most DIF methods require a set of item parameters that are known to be invariant. An exception to this, which does not require a set of item parameters known to be invariant, was proposed by Bechger and Maris (2015); however, its power to detect DIF is reduced by whatever DIF is common to all items. None of these limitations apply to the proposed method.

For our two-year dataset, this method provided provocative but ultimately unconvincing evidence of IPD. The null hypothesis that examinees are performing ever better is an appealing one; however, failure to reject the null hypothesis is not proof of its truth and several other possible explanations for the reported results remain. These explanations tend to be related to power—or lack thereof. Although the number of study items may seem, in an absolute sense, large, the population⁹ for this exam is highly able and extremely homogenous. Or, put another way, relative to examinee proficiencies, the median study item difficulty was approximately three proficiency standard deviations below the mean and the study item difficulty standard deviation was more than four times that of the proficiency standard deviation. Therefore, while the observed difference across cohorts in median difficulty for the study items was nearly half of a proficiency standard deviation, it was only about a 10th of a difficulty standard deviation. In this way, despite the large number of study items, sampling error still limited our power to identify significant differences. Furthermore, when an examinee population is extremely homogeneous, items will not discriminate well among different proficiency levels. This limitation combined with a highly able population leads to unstable difficulty estimates (it can be difficult to estimate the inflection point for an item that is nearly flat and poorly matched to the calibration sample’s proficiency distribution). This instability also reduces power when comparing summary statistics across study sets.

Although the proposed method offers some advantages over existing methods, several limitations also deserve comment. To begin, the proposed method relies on an experimental design. This is a powerful investigative tool; however, as with many experimental designs, data collection can be more arduous. In the case presented above, data collection itself was straightforward and was undertaken unobtrusively as part of standard exam administrations; however, two years were required to collect the necessary data.

Another practical constraint is the requirement of an additional set of items equal to half the total number of study items at the start of the study. As noted, this is necessary because two randomly equivalent sets of items are needed at the start of the study—one to be administered in year 1 and one to be administered in year 2—and this is accomplished by creating a superset of new study items at the start of the study and randomly dividing it into two subsets. Item development is expensive and time consuming and obtaining this initial superset of study items is a potentially burdensome—albeit, one-time—expense. However, once this initial requirement is met, the method can be applied to subsequent cohorts indefinitely with very little administrative effort and no additional item writing beyond whatever is normally required for pretesting. For this reason, as noted above, the proposed method will be most attractive to testing programs seeking to monitor IPD on a routine and ongoing basis.

Another limitation is that the proposed method only tests for an aggregate IPD effect—it cannot identify particular items that lack invariance. Moreover, testing for aggregate IPD in this way is especially susceptible to Type II error because the proposed method only tests for IPD effects that affect the development of a common metric. In the context of the 1PL model, this means IPD effects that produce scales misaligned by a constant. Note, however, that it is certainly possible for some common items to have positive IPD and others to have negative IPD. When this is so, some IPD will cancel out, at least with respect to estimating the adjustment needed to develop a common metric. Therefore, item difficulties still may be misaligned, but the misalignment will not reflect the total magnitude of IPD. And, this will also be true for the misalignment in the study item difficulties across years that is used to test for IPD. Nevertheless, concerns about IPD often presume a certain direction (e.g., items getting easier over time) or a particular aggregate effect of IPD (e.g., systematic errors in proficiency estimates). Therefore, this limitation may not be of great concern in all cases.

One final limitation is that the proposed method assumes that the recency with which an item was written does not affect its difficulty. In many cases, this assumption will be easily satisfied; however, there are scenarios where its violation is more likely. For example, suppose study items are written measuring some new practice or procedure, which is still being integrated into curricula. It is possible that this content will be taught more widely and/or more effectively one year after such items were written compared with how it would have been taught at the time it was written. This could create systematic differences in the expected distribution of study item difficulties that undermine the property of random equivalence. Exam programs that focus on content areas that are in a state of constant and often rapid development, such as those related to technology, will be especially at risk of violating the assumption that an item’s recency does not affect its difficulty. The proposed method may be less appealing to such programs or may require vetting of study items to ensure that their particular content is not at risk of violating this assumption.

This assumption notwithstanding, this paper describes a method that could be an attractive option for testing programs that have difficulty satisfying the requirements of other DIF methods—namely the requirement that a set of invariant common items is known and the requirement that IPD does not affect all items.

Notes

Rapidly developing content areas such as medicine or technology may be especially susceptible to IPD of this kind.

^2.

If item covariate data are available, as was the case in the example given below, stratified randomization may reduce the sampling effects, leading to more power to detect IPD.

^3.

For example, in the context of a one-parameter logistic item response theory model (1PL), which is the model used in this study, year 2’s item difficulty and examinee proficiency parameter estimates can be adjusted by a constant such that the means of the common item difficulties are equal across year.

^4.

Note that this expected equivalence would not necessarily be true for study item proportion correct values—these will only be randomly equivalent if the two cohorts are equally able and the proposed method makes no assumptions about the relative ability of the two cohorts.

^5.

Note that while the 1PL is used for this exam, there is nothing about the proposed method that limits its use with more general item response theory models.

^6.

It was decided to include only those data collected prior to March 1, 2020 because of the disruption to exam administration and exam preparation caused by the COVID-19 pandemic. This constraint also was mirrored in year 1, which only included administrations that occurred before March 1, 2019.

^7.

Model estimation was accomplished using proprietary, non-commercial software. This software is also used operationally for this exam. Further, note that while the 1PL with unit discrimination and the Rasch model can be shown to be mathematically equivalent, the term 1PL is preferable here because Rasch scaling conventions were not adhered to (as will be described below).

^8.

1,000,000 is a needless large number; however, the process is fast, and a large number of permutations does not pose a burden.

^9.

The population referred to here is the population of first-time examinees from US medical schools, which is a subset of all examinees who take this exam. The 1PL model is fit to examinees from this group alone.

Footnotes

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Peter Baldwin https://orcid.org/0000-0003-3472-0172

References

Baldwin P., Clauser B. E. (2022). Historical perspectives on score comparability issues raised by innovations in testing. Journal of Educational Measurement, 59(2), 140–160. 10.1111/jedm.12318 [DOI] [Google Scholar]
Bechger T. M., Maris G. (2015). A statistical test for differential item pair functioning. Psychometrika, 80(2), 317–340. 10.1007/s11336-014-9408-y [DOI] [PubMed] [Google Scholar]
Donoghue J. R., Isham S. P. (1998). A comparison of procedures to detect item parameter drift. Applied Psychological Measurement, 22(1), 33–51. 10.1177/01466216980221002 [DOI] [Google Scholar]
Ernst M. D. (2004). Permutation methods: A basis for exact inference. Statistical Science, 19, 676–685. [Google Scholar]
Goldstein H. (1983). Measuring changes in educational attainment over time: Problems and possibilities. Journal of Educational Measurement, 20(4), 369–377. 10.1111/j.1745-3984.1983.tb00214.x [DOI] [Google Scholar]
Hambleton R. K., Swaminathan H. (1985). Item response theory: Principles and applications. Kluwer-Nijhoff. [Google Scholar]
Hambleton R. K., Swaminathan H., Rogers H. J. (1991). Fundamentals of item response theory. Sage Publications. [Google Scholar]
Kelley P. R., Schumacher C. F. (1984). The Rasch model: Its use by the national board of medical examiners. Evaluation & the Health Professions, 7(4), 443–454. 10.1177/016327878400700405 [DOI] [Google Scholar]
Kolen M. J., Brennan R. L. (2004). Test equating, scaling, and linking: Methods and practices (2nd ed.): Springer. [Google Scholar]
Lord F. M. (1980). Applications of item response theory to practical testing problems: Erlbaum. [Google Scholar]
Loyd B. H., Hoover H. D. (1980). Vertical equating using the Rasch model. Journal of Educational Measurement, 17(3), 179–193. 10.1111/j.1745-3984.1980.tb00825.x [DOI] [Google Scholar]
Raju N. (1988). The area between two item characteristic curves. Psychometrika, 53(4), 495–502. 10.1007/bf02294403 [DOI] [Google Scholar]
Raju N. (1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Psychological Measurement, 14(2), 197–207. 10.1177/014662169001400208 [DOI] [Google Scholar]
Rupp A. A., Zumbo B. D. (2006). Understanding parameter invariance in unidimensional IRT models. Educational and Psychological Measurement, 66(1), 63–84. 10.1177/0013164404273942 [DOI] [Google Scholar]

[bibr1-01466216251316282] Baldwin P., Clauser B. E. (2022). Historical perspectives on score comparability issues raised by innovations in testing. Journal of Educational Measurement, 59(2), 140–160. 10.1111/jedm.12318 [DOI] [Google Scholar]

[bibr2-01466216251316282] Bechger T. M., Maris G. (2015). A statistical test for differential item pair functioning. Psychometrika, 80(2), 317–340. 10.1007/s11336-014-9408-y [DOI] [PubMed] [Google Scholar]

[bibr3-01466216251316282] Donoghue J. R., Isham S. P. (1998). A comparison of procedures to detect item parameter drift. Applied Psychological Measurement, 22(1), 33–51. 10.1177/01466216980221002 [DOI] [Google Scholar]

[bibr4-01466216251316282] Ernst M. D. (2004). Permutation methods: A basis for exact inference. Statistical Science, 19, 676–685. [Google Scholar]

[bibr5-01466216251316282] Goldstein H. (1983). Measuring changes in educational attainment over time: Problems and possibilities. Journal of Educational Measurement, 20(4), 369–377. 10.1111/j.1745-3984.1983.tb00214.x [DOI] [Google Scholar]

[bibr6-01466216251316282] Hambleton R. K., Swaminathan H. (1985). Item response theory: Principles and applications. Kluwer-Nijhoff. [Google Scholar]

[bibr7-01466216251316282] Hambleton R. K., Swaminathan H., Rogers H. J. (1991). Fundamentals of item response theory. Sage Publications. [Google Scholar]

[bibr8-01466216251316282] Kelley P. R., Schumacher C. F. (1984). The Rasch model: Its use by the national board of medical examiners. Evaluation & the Health Professions, 7(4), 443–454. 10.1177/016327878400700405 [DOI] [Google Scholar]

[bibr9-01466216251316282] Kolen M. J., Brennan R. L. (2004). Test equating, scaling, and linking: Methods and practices (2nd ed.): Springer. [Google Scholar]

[bibr10-01466216251316282] Lord F. M. (1980). Applications of item response theory to practical testing problems: Erlbaum. [Google Scholar]

[bibr11-01466216251316282] Loyd B. H., Hoover H. D. (1980). Vertical equating using the Rasch model. Journal of Educational Measurement, 17(3), 179–193. 10.1111/j.1745-3984.1980.tb00825.x [DOI] [Google Scholar]

[bibr12-01466216251316282] Raju N. (1988). The area between two item characteristic curves. Psychometrika, 53(4), 495–502. 10.1007/bf02294403 [DOI] [Google Scholar]

[bibr13-01466216251316282] Raju N. (1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Psychological Measurement, 14(2), 197–207. 10.1177/014662169001400208 [DOI] [Google Scholar]

[bibr14-01466216251316282] Rupp A. A., Zumbo B. D. (2006). Understanding parameter invariance in unidimensional IRT models. Educational and Psychological Measurement, 66(1), 63–84. 10.1177/0013164404273942 [DOI] [Google Scholar]

PERMALINK

An Experimental Design to Investigate Item Parameter Drift

Peter Baldwin

Irina Grabovsky

Kimberly A Swygert

Thomas Fogle

Pilar Reid

Brian E Clauser

Abstract

Introduction

IPD and DIF

A Partial Solution

Proposed Method

Figure 1.

An Example

Data

Table 1.

Evaluation Criteria

Results

Figure 2.

Figure 3.

Table 2.

Discussion

Notes

Footnotes

ORCID iD

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

An Experimental Design to Investigate Item Parameter Drift

Peter Baldwin

Irina Grabovsky

Kimberly A Swygert

Thomas Fogle

Pilar Reid

Brian E Clauser

Abstract

Introduction

IPD and DIF

A Partial Solution

Proposed Method

Figure 1.

An Example

Data

Table 1.

Evaluation Criteria

Results

Figure 2.

Figure 3.

Table 2.

Discussion

Notes

Footnotes

ORCID iD

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases