Rasch Versus Classical Equating in the Context of Small Sample Sizes

Ben Babcock; Kari J Hodge

doi:10.1177/0013164419878483

. 2019 Sep 30;80(3):499–521. doi: 10.1177/0013164419878483

Rasch Versus Classical Equating in the Context of Small Sample Sizes

Ben Babcock ^1,^✉, Kari J Hodge ²

PMCID: PMC7221499 PMID: 32425217

Abstract

Equating and scaling in the context of small sample exams, such as credentialing exams for highly specialized professions, has received increased attention in recent research. Investigators have proposed a variety of both classical and Rasch-based approaches to the problem. This study attempts to extend past research by (1) directly comparing classical and Rasch techniques of equating exam scores when sample sizes are small (N≤ 100 per exam form) and (2) attempting to pool multiple forms’ worth of data to improve estimation in the Rasch framework. We simulated multiple years of a small-sample exam program by resampling from a larger certification exam program’s real data. Results showed that combining multiple administrations’ worth of data via the Rasch model can lead to more accurate equating compared to classical methods designed to work well in small samples. WINSTEPS-based Rasch methods that used multiple exam forms’ data worked better than Bayesian Markov Chain Monte Carlo methods, as the prior distribution used to estimate the item difficulty parameters biased predicted scores when there were difficulty differences between exam forms.

Keywords: small samples, equating, linking, circle-arc equating, nominal weights mean equating, Rasch model, Markov chain Monte Carlo (MCMC)

Testing programs using multiple exam forms or introducing new test forms must provide evidence that scores and decisions based on those scores maintain equivalence. Even when programs adhere to the test content specifications, different forms may vary in difficulty level. To avoid score and classification bias, it is necessary to statistically adjust scores from different forms (i.e., equating). Most equating procedures, however, require large samples for accurate estimates (Kolen & Brennan, 2004). High-stakes certification and licensure programs in specialized professions use examination scores for pass/fail decisions (Brookhart & Nitko, 2007), but specialized professions often do not have the examinee volume to use traditional equating techniques (S. Kim, von Davier, & Haberman, 2008; Skaggs, 2005).

Recent literature has seen increased interest in procedures developed to address issues of measurement precision in small sample equating. Classical methods including circle-arc equating (Livingston & Kim, 2009) and nominal weights mean equating (Babcock, Albano, & Raymond, 2012) have been found to be robust in small samples compared to conducting standard setting for each administration (Dwyer, 2016). There is little research, however, on the use of Rasch-based techniques in small samples, particularly on comparing Rasch techniques to these small-sample equating methods. This study compares the equating outcomes of classical methods to Rasch methods in a small sample context.

Circle-Arc Equating

Circle-arc equating is a classical, nonlinear method that has shown promise for equating in small samples. The basic idea of circle-arc equating is to create three points and connect them by drawing an arc; one uses the points of the arc as a transformation function to go from one exam form’s scores to a different exam form’s scores. First, the user picks two exam score points, one high and one low, that correspond to equivalent scores on both exams. Most applications of this method at the time of this article have used guessing level and 100% correct on both exams as these two endpoints. Second, the user selects an equating function to find the arc’s required third point, which is somewhere between the selected endpoints. One can then draw a unique circle-arc between these points using the appropriate geometric equations. For further details about how to implement circle-arc equating, see Livingston and Kim (2009).

Livingston and Kim (2009) found that the circle-arc method had lower root mean square difference (RMSD) and bias compared to mean equating, especially in the tails of the score distribution. This is not surprising, as circle-arc is a compromise between using an equating method and using identity equating (i.e., not equating). Scores near the equated middle arc point rely heavily on the equating method used, while scores far away from this point are regressed toward the identity equating solution. It, thus, may be preferable to many linear equating methods when test forms differ in difficulty and sample size is too small for equipercentile-type equatings. In a subsequent study, S. Kim and Livingston (2010) compared chained equipercentille equating with smoothing, chained linear equating, chained mean equating, symmetric circle-arc and simplified circle-arc equating in a resampling common item study. They found that the circle-arc generally outperformed the other methods for sample sizes of 25 and 100.

Nominal Weights Mean Equating

The classical method of nominal weights mean equating is a simplified version of mean equating, which, in turn, is a simplified version of Tucker equating. Babcock et al. (2012) demonstrated that, if one makes the assumption that the mean performance on the anchor items and nonanchor items are roughly equal in the proportion correct metric for the examinee population, one can simplify the slope term in the calculation of the synthetic means to be the number of items on the full exam divided by the number of anchor items. Removing the variances and covariances from the synthetic mean calculations signifies removing the least stable of the statistics that one is calculating. The authors theorized that this should do some good in stabilizing the equating method against obtaining a strange sample, which can more easily happen in small-sample contexts. Babcock et al. (2012) compared nominal weights mean equating with six other techniques (smoothed equipercentile, Tucker, mean, synthetic, identity, and circle-arc equating) in a nonequivalent group anchor test design with sample sizes of 20, 50, and 80. They found that nominal weights mean was the method that was generally the most accurate factoring across varying numerous conditions. They also found that circle-arc equating performed well in a variety of contexts.

Rasch Calibration With Maximum Likelihood Estimation

Rasch models provide a framework to simultaneously score multiple exam forms, linking them onto the same scale (Lord, 1977). One of the most common applications of this method is anchor-item calibration, which uses data from multiple exam forms with shared items to link scores onto the same scale in just one step (Kelderman, 1988; Kolen & Brennan, 2004). This technique does not require an additional equating equation. Instead, the Rasch estimation algorithm automatically places all examinees’ scores onto the same scale (Kelderman, 1988). Many applications of this method use joint maximum likelihood (Linacre, 2009) or marginal maximum likelihood (Linacre, 2004) to estimate the item parameters. One advantage of the Rasch anchor-item calibration technique in a small sample context is that data from numerous previous exam administrations can be pooled to increase parameter estimation accuracy. One can take items and examinees from older exam forms and enter these data into the supermatrix that includes the current and immediate past exam forms. These additional data could increase the quality of the Rasch estimation. Recent research has begun to reveal the Rasch model’s properties in small sample contexts (Dwyer & Furter, 2018; Gregg, Peabody, & O’Neil, 2018).

Rasch Calibration With Bayesian Methodologies

The typical sample size recommendation for Rasch modeling in a dichotomous context for the most accurate item parameter estimates in high-stakes situations is a few hundred examinees (Linacre, 1994; Lord, 2014). Matteucci and Veldkamp (2013), however, introduced the use of power prior distributions in Bayesian techniques to reduce error when estimating item response theory (IRT) modes in small samples. Historical data have been shown to be a useful source for constructing more informative priors for situations where noninformative priors lead to suboptimal solutions (Ibrahim & Chen, 2000). Previous test administrations provide historical data on item responses, and item parameter point estimates and standard errors can inform the prior distributions (Matteucci & Veldkamp, 2013).

There have been numerous applications of Bayesian estimation via Markov Chain Monte Carlo (MCMC) methods to the IRT family of models (e.g., Patz & Junker, 1999). To apply Bayesian estimation in a small sample Rasch setting, one could use historical data in a similar fashion as previously discussed with the non-Bayesian strategies and simply use MCMC as an estimation framework. The user typically applies a marginal prior distribution for a given set of parameters (say, normal [0, 1] for the item difficulties and a uniform distribution for θ). Another option, however, is to use previous calibrations’ results to place unique estimation priors on every item parameter. The Bayesian framework is sufficiently flexible to place a different estimation prior on each item that is based on past calibrations’ results. Using priors in this way could be quite helpful in small sample applications, as the estimation of item parameters based on the new data would be “reigned in” by the priors based on previous uses of the items. This technique is an extension of the procedures Livingston and Lewis (2009) proposed.

Current Study

Recent small-sample research has demonstrated that both classical approaches and IRT approaches may have a great deal of utility. Few studies, however, have compared the families of approaches directly in a small-sample context. We aim to make a direct comparison between two small-sample classical methods and several Rasch estimation methods to see which performs the best for small samples. The classical methods we used were circle-arc equating and nominal weights mean equating based on the past findings of these methods’ effectiveness in small samples. The Rasch techniques that we used were estimated using WINSTEPS (Linacre, 2009) or with Metropolis within Gibbs MCMC estimation (Babcock, 2011; Patz & Junker, 1999), varying how past data were incorporated into the new exam data’s estimation.

Method

Data

Data came from a major certification exam in medical imaging. This exam generally conforms to the Standards for Educational and Psychological Testing concerning credentialing exams, including standards for job analysis, exam construction, and measurement properties (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 2014). The exam is administered at professionally proctored computer-testing centers. Candidates took the fixed form of 220 items (200 scored, 20 pilots) in random order. This particular form had 1,116 first-time examinees. The scored portion of the exam consisted of mostly multiple-choice items with a few multiselect and sorted list items. The items used had relatively good fit to the Rasch model. No items had an INFIT statistic outside the range of 0.7 to 1.3, and only 12 questions had OUTFIT statistics outside the range of 0.7 and 1.3 (Wright & Linacre, 1994).

Simulating an Exam Program

To simulate a real examination context, this simulation sampled six smaller exams and six smaller samples of examinees of real responses from the larger pool of 200 scored items and 1,116 examinees. We used only the 200 scored items from the exam form, as these items had all been previously used on exam forms and had desirable statistical properties.

The goal for the real data resampling was to simulate the first six administrations of a small-sample exam. When first creating an exam program, the exam creators generally place together an initial exam form of all new items. The exam creators then craft future exams from varying combinations of previously used items. Every new form of the exam adds to the item pool by piloting new items. We wanted to simulate this process of creating an initial exam and constructing subsequent exams that add to the item pool by taking data from a relatively long real exam and retroactively creating smaller exams.

The first exam drawn in each replication consisted of 100 scored items plus 20 pilot items, which were drawn from the larger pool of 200 scored items. The first exam draw simulated the initial creation of a new exam program. The second exam consisted of 100 scored items drawn from the items used on the first simulated exam plus 20 pilot items drawn from the remaining unused pool of 80 items and so on through the fifth simulated exam. These second through fifth simulated exams all added pilot items to the item pool for use on subsequent exams. The first four exams were, thus, drawn randomly and (on average) had no difficulty differences. The fifth exam could have had a difficulty difference depending on the condition.

The sixth and final exam was of the most interest to this study. All items in the pool of 200 were available for the sixth exam as scored items. We drew 30 anchor (equator) items from the fifth exam and drew the remaining 70 scored items from the pool that were not used on the fifth exam. The 30 anchor items for a 100-item test conforms to industry recommendations for internal anchor test length (Angoff, 1971; Wendler & Walker, 2006). The sixth exam had no pilot items, as it was the end of the simulation. The sixth exam simulates where, after multiple administrations and gathering small amounts of data, the exam producers may feel confident that there are enough examinees and items to begin equating.

Table 1 illustrates the sampling design, along with some general performance statistics about the examinees and the exams for various conditions. There are several items worth noting about this performance. First, the performance of the first four randomly equivalent forms were virtually identical to the “Randomly Equivalent Exams” portion of Table 1. Second, we based the harder exam, easier exam, greater ability, and lesser ability differences to mimic differences in exam and examinee performance that we have observed in real data for small sample exam programs. Finally, concerning the test items, we drew a new sampling of items by the above algorithm for each replication in every cell of the study design.

Table 1.

Exam Sampling Design and Basic Statistics for Exams.

Exam no.	Scored item pool number of items	Pilot item pool number of items	Scored item drawing methodology	Pilot item drawing methodology	Total number of exam items
1	Shared pool of 200 for scored and pilots		Random	Random	120
2	120	80	Random	Random	120
3	140	60	Random	Random	120
4	160	40	Random	Random	120
5	180	20	Condition-dependent	Last 20 remaining	120
6	200	0	Condition-dependent	N/A	100

Open in a new tab

Exam Design Independent Variables

We manipulated three aspects of the exam. First, we manipulated the sample size for each administration (N = 5, 10, 20, 40, and 100). This gave a pooled total sample size for all administrations of N = 30, 60, 120, 240, and 600. The number of studies documenting the accuracy of equating models with increased sample size is too numerous to cite, and this study’s focus is on performance when the number of examinees is indeed quite small. Second, we manipulated the difficulty of the final two exams to be either randomly equivalent in difficulty, to have the final exam be easier than the penultimate exam, or to have the final exam be harder than the penultimate exam. Third, we manipulated the mean ability of the final two groups to be either randomly equal in ability, to have the final group be higher ability, or to have the final group to have lower difficulty. We selected relative item difficulty and person ability based on the item and person statistics from the full data set of 200 items and 1,116 examinees. We conducted this selection based on classical item and person statistics, though selection based on Rasch IRT statistics would have yielded similar results. Table 1 also contains a summary of these difficulty differences from the 100 replications in each cell of the study design.

Dependent Variables

One challenge in comparing classical and IRT results is finding a suitable outcome metric. One way to compare which methods functioned the best is to predict the real (i.e., observed from real data) item correct Exam 5 scores for the examinees taking Exam 6. Our simulation design started out with an operational data set of over 1,000 examinees; all the examinees had answered all 200 items. We selectively ignored the responses for certain items and examinees to simulate a small-sample testing program over multiple administrations. For example, Exam 6 had 100 scored items, so we de facto ignored 100 of the 200 real item responses from the full operational data set for the Exam 6 group in the equating simulation. We can, however, find each Exam 5 observed number correct score for the Exam 6 group using the full operational data set. Everyone in the Exam 6 group took all 200 items in the operational form. We can, therefore, calculate every person’s actual observed number correct for any test that is a combination of these 200 items. This includes the all-important Exam 5 observed scores.

This observed number correct criterion measure was straightforward for two main reasons. First, one already has real number correct test results in hand, as all examinees have taken all 200 test questions. One simply sums up the number of correct responses for the Exam 6 group for those items selected for Exam 5. One does not have to rely on model-predicted scores, equating equations, or any sort of mathematical transformation (with its sets of assumptions). Second, one can use both classical and IRT equating methods to arrive at a projected number correct for the Exam 5 items. This is conducted automatically with classical equating. With IRT methods, one can take a person’s estimated θ and find the expected number correct by using the test response function transformation.

Using multiple equating techniques to predict performance on the form five items for people taking Exam Form 6 compared the equating results against a solidly observed reality. After all, the purpose of equating or linking is to generalize examinee performance to something outside of the current exam (e.g., a different exam; Kolen & Brennan, 2004, Chapter 1, a trait scale independent of an exam form, Embretson & Reise, 2000, Chapter 3). This technique of testing accuracy across all people is somewhat different from other equating simulation approaches, which attempt to evaluate equating accuracy at various points across a score scale. In order to help relate our technique to past techniques, we calculated bias (Equation 2) conditionally for people scoring below the mean and above the mean. These additional calculations give an indication as to how the equating methods function on the low end and the high end of the score scale.

The statistics we calculated on these outcome measures were the RMSD and bias, where values closer to zero indicated better performance. We calculated the RMSD as

\sqrt{\frac{\sum_{p = 1}^{N} {(X_{p equated} - X_{p observed})}^{2}}{N}},

(1)

where N was the total number of examinees taking Exam 6, X_p_equated was a person’s equated score from Exam 6 onto Exam 5, and X_p_observed was a person’s observed score from Exam 5 in the real data. We calculated bias as

\frac{\sum_{p = 1}^{N} (X_{p equated} - X_{p observed})}{N},

(2)

where the terms were the same meaning as in Equation 1.

Finally, we examined the change in pass rate and the classification consistency. We determined the cut scores by finding the 10th, 20th, 30th, 40th, 50th, 60th, 70th, 80th, and 90th score percentiles on the fifth exam for the entire group of test takers. We then noted which examinees from Group 6 met or exceeded each of these cut scores for Exam 5 and whether their equated score met or exceeded each of these cut scores. We combined the results of various cut scores for ease of presenting the results. The 10th, 20th, and 30th percentile cuts were a low cut score group, the 40th, 50th, and 60th percentile cuts were a middle cut score group, and the 70th, 80th, and 90th percentile cuts were a high cut score group. Evaluating these multiple points of classification consistency also help evaluate the methods’ performance across the entire score scale.

Equating Methods Used

To equate the sixth exam, the authors used two classical methods and five IRT methods. The two classical equating methods were nominal weights mean equating (Babcock et al., 2012) and circle-arc equating (Livingston & Kim, 2009). These equating methods were specifically designed to equate small samples, and research has shown these two methods to have promise with small-sample exam programs (Albano, 2015; Dwyer, 2016). We implemented nominal weights mean equating as described in Babcock et al. (2012) and circle-arc equating as described in Livingston and Kim (2009).

The five Rasch-based IRT methods were joint maximum likelihood as implemented by WINSTEPS (Linacre, 2009) or Metropolis within Gibbs sampling as implemented by a user-written algorithm (Babcock, 2011). We combined data from multiple exam administrations to maximize the accuracy of the parameter estimates. This is somewhat different from previous research. Combining data across multiple administrations gives the model user more accurate results if the previous administrations are representative. Leveraging of past data is an available option to operational testing programs that use IRT; it is more difficult to incorporate numerous past administrations in classical equating.

We will refer to the five Rasch methods as follows: fixed-parameter Rasch (FPR), supermatrix Rasch (SMR), Bayesian fixed-parameter Rasch (BFPR), Bayesian supermatrix Rasch (BSMR), and Bayesian item prior Rasch (BIPR). The two fixed-parameter conditions functioned as preequating conditions, where we estimated the Rasch item parameters based only on the first five administrations’ worth of data. We estimated these parameters by combining all five administrations’ of data. We then fixed the item parameters to estimate the θs for the sixth administration. The two “supermatrix” conditions estimated the parameters by combining the data for all six administrations into a single calibration. We estimated the Rasch item parameters in the FPR and SMR condition using WINSTEPS and the Bayesian conditions with a user-written algorithm as previously mentioned. The only constraint on the WINSTEPS estimation to identify the model was that the item difficulty parameters have a mean of 0. The only prior in the MCMC estimation of the Bayesian fixed-parameter and Bayesian supermatrix conditions was that the item difficulty parameter was distributed N(0,1). The latent ability prior was a noninformative flat prior. In test runs, the model converged in less than 100 iterations by typical convergence criteria, such as R hat and trace plots. In addition to these tests, we ran 50 replication simulations in selected cells and compared the Rasch b and θ values for a 1,000 burn-in with 5,000 iteration estimation run with a 2,000 burn-in and 10,000 iteration estimation run. The b and θ values correlated .99 or higher with RMSDs of .04 or less. We, thus, used a burn-in period of 1,000 with an additional 5,000 iterations retained for parameter estimation. This burn-in period is consistent with past research using the Rasch family of models (e.g., Cohen, Wollack, Bolt, & Mroch, 2002; S.-H. Kim, 2001; Maris, Bechger, & Martin, 2015).

The final condition (Bayesian item prior Rasch) differed a bit from previous applications of Bayesian MCMC Rasch modeling. We estimated this condition’s Rasch difficulty parameters by combining the first five exam administrations as in the Bayesian fixed-parameter Rasch condition. We then found each item’s point estimate and standard error. Finally, we estimated the Exam 6 item parameters using only the Exam 6 data but placing a prior on each individual item that was normal with a mean and standard deviation equal to that of the item’s estimate and standard error from the first five administrations’ calibration. This method leveraged the previous exam data with the use of individualized item priors, yet only used the current exam’s data. In fact, the ability for this method to use individual item-level prior distributions was the main reason for including Bayesian MCMC methods in this study. Table 2 illustrates some of the differences between the equating conditions for clarification.

Table 2.

Diagram of Equating Conditions Used in this Study.

Exam administration	Administration 1: 100 “scored” and 20 “pilot” items, none previously used
	Administration 2: 100 previously-used “scored” and new 20 “pilot” items
	Administration 3: 100 previously-used “scored” and new 20 “pilot” items
	Administration 4: 100 previously-used “scored” and new 20 “pilot” items
	Administration 5: 100 previously-used “scored” and new 20 “pilot” items
	Administration 6: 100 previously-used “scored” items, with 30 overlapping from Exam 5

Condition type	Classical equating conditions		Rasch (item response theory) conditions
Subcondition	Circle-arc	Nominal weights mean	WINSTEPS (FPR and SMR)	Bayesian (BFPR and BSMR)	Bayesian item prior Rasch

Data	Data were from Exams 5 and 6 only		Data were available for all six exams
Calibration	NA		“Pre-equating” fixed-parameter conditions: calibrated first five exams together, item parameters fixed for 6th examination. Supermatrix conditions: combined all data into a single calibration for supermatrix conditions		Calibrated first five exams together; used item means and standard deviations as individual-item priors in Exam 6 item estimation
Theta estimation Exam 6	NA		Estimated 6th exam thetas from relevant item parameters using maximum likelihood (equivalent to MCMC with a flat prior)		Estimated maximum likelihood thetas using item parameters from individual-item prior estimation run
Obtaining Exam 5 Scores	Conducted equating procedure, placing Exam 6 scores on Exam 5s scale		Used thetas to obtain model-predicted number correct scores for items on 5th exam
Criterion	Compared predicted and observed 5th exam scores for 6th exam examinees

Open in a new tab

Note. FPR = fixed-parameter Rasch; SMR = supermatrix Rasch; BFPR = Bayesian fixed-parameter Rasch; BSMR = Bayesian supermatrix Rasch; BIPR = Bayesian item prior Rasch; MCMC = Markov chain Monte Carlo; NA = not applicable.

Results

Results Displayed and Not Displayed

It is important to note the y-axis scales on the figures. The bottom value of the RMSD graphs is four. We selected this floor psychometrically. Our criterion consisted of real observed scores, which contained measurement error. This caused the equatings’ RMSD to have a lower bound greater than zero. To find this lower bound, we conducted a separate simulation to calculate the classical standard error for 10,000 randomly generated exams of 100 items using all 1,116 examinees. The mean standard error was between 3.9 and 4.0, so it was reasonable to assume that the main study’s RMSD floor was around 4.0. The maximum of the RMSD graphs was 6.0, as we felt that RMSDs 50% greater than the randomly equivalent exams’ standard error were bad results no matter how much higher than 6.0 the RMSDs were. The bias graph y-axes go from −4.0 to 4.0, as 4.0 was roughly the size of the standard error for randomly selected forms. The proportion pass change graphs range ±.10, as we felt that a pass rate change of greater than 10% signified a substantial equating method failure.

We only display the performance for sample sizes of N = 5 and N = 20. After examining all results, RMSD, bias, and classification levels were relatively stable for Ns of 20 or more, with improvements being extremely small and incremental. The N = 10 condition was, roughly speaking, a performance midpoint between sample between N = 5 and N = 20 conditions.

N = 5 Conditions

Figure 1 contains RMSDs for N = 5. Two key results reproduced previous studies’ findings. First, while identity equating performed well when the exams were (randomly) equal, identity equating performed extremely poorly when there were difficulty differences in the exams. The RMSDs for identity equating for the harder and easier exam conditions were all greater than 6.0, indicating an undesirable result. Second, when comparing the classical conditions to one another, circle-arc and nominal weights performed similarly in the aggregate of all conditions. Circle-arc tended to perform slightly better than nominal weights when the new exam was harder, while nominal weights performed better when the new exam was easier.

Some of the most interesting results came in comparing the Rasch methods to the classical methods. It appears that the Rasch methods, which combined multiple administrations’ data, outperformed the classical methods. The RMSD for every type of Rasch implementation was lower than the classical methods in every combination of examinee ability and exam difficulty for N = 5. The Rasch implementations even outperformed identity equating when the exam forms were randomly equal. When comparing the two WINSTEPS joint maximum likelihood implementations with the three Bayesian implementations, it appears that WINSTEPS performed slightly better when the exams were harder and Bayesian methods performed slightly better when exams were easier, though the performance differences were quite small.

Figure 2 contains the bias statistics. The mean bias levels for all conditions were similar and near zero for the equal exam conditions. The bias for identity equating went off the charts for the conditions with exam difficulty differences, which was consistent with previous research. Nominal weights appeared to have a bit wider range of conditional bias than circle-arc when the new exam was harder. When the new exam was easier, however, nominal weights had very little bias. The Rasch WINSTEPS conditions appeared to be relatively unbiased in aggregate, though the methods had a wider range of conditional bias when the newest exam was easier. The Bayesian conditions exhibited systematic bias with the exam difficulty differences. They predicted scores as too low when the new exam was harder and predicted scores too high when the new exam was easier.

Figure 3 contains the results for classification at N = 5. The results are largely consistent with the RMSE and bias results from Figures 1 and 2. All methods had similar pass rates to the observed Exam 5 results for the randomly equivalent exams, though there was a slight decrease in the passing proportion on average. Nominal weights and circle-arc did not classify examinees as consistently as the Rasch methods. When the exams were either easier or harder, the fixed parameter Rasch and supermatrix Rasch methods tended to perform the best with both a high level of classification consistency and a pass rate that was close to that observed in the real data. The Bayesian methods had lower overall pass rates when the final exam was harder and higher overall pass rates when the final exam was easier, which is consistent with Figure 2. Nominal weights and circle arc often had overall pass rates that were similar (though sometimes lower) than observed overall, but those methods did not do as well when it came to classification agreement. As was the case with RMSE and bias, the identity conditions were often off the charts due to how poorly they performed.

N = 20 Conditions

Figure 4 contains the RMSD results for N = 20. Not surprisingly, all methods improved with higher N except for identity equating. Circle-arc equating appeared to slightly outperform nominal weights for RMSD except for conditions where the new exam was easier, and there was a group ability difference, though the differences between the two methods were small. The Rasch methods slightly outperformed the classical methods, and there did not appear to be major RMSD differences among the different Rasch implementations at this sample size.

Figure 5 contains the bias performance for the different methods. Not surprisingly, increasing the sample size contributed substantially in lowering the mean bias toward zero for all methods except for identity equating. The bands representing the bias above and below the means also tended to be somewhat smaller. Nominal weights had extremely low conditional biases in the exam easier conditions compared to the other conditions. Bias performance was generally a bit better for the WINSTEPS Rasch methods when looking across all conditions, though these differences were small. The Bayesian methods were substantially less biased at N = 20 than at N = 5 for the conditions where there was an exam difficulty difference. The systematic direction of the bias, however, remained present: Score prediction was slightly too low in aggregate when the new exam was harder and slightly too high when the new exam was easier.

Finally, Figure 6 contains the classification metrics of pass rate and consistency, which improved for all methods. Consistent with Figures 4 and 5, all forms of the Rasch method performed quite similarly. Nominal weights and circle arc had slightly worse performance than the Rasch methods when there was a difficulty difference. Identity equating was again off the charts with poor performance in many conditions.

Discussion

We investigated two classical equating techniques and five Rasch techniques using small N, differences in form difficulty, and differences in exam ability. In aggregate, the Rasch estimations using WINSTEPS (joint maximum likelihood) outperformed other techniques for both RMSD and bias outcome measures. The Rasch WINSTEPS methods did, however, have a wider range of conditional bias when the newest exam was easier. When previous data are available, Rasch techniques may be preferred regardless of difficulty differences. If numerous previous administrations are not available, however, the classical equating methods still do a good job of ensuring against the potentially highly biased results of not equating at all.

A key result that may influence method selection was that the Bayesian methods were slightly biased when there was an exam difficulty difference. A systematic difficulty difference in two forms’ items can interact with the priors used in item estimation. For example, if Exam B (new exam) is systematically more difficult than Exam A (old exam), then more of the Rasch difficulties from Exam B are above the mean of the prior than below, which is opposite of Exam A. The overall item prior can influence many of the Exam B difficulty parameters down toward the prior mean, as did the normal prior used in this study. Even with a flat prior on θ, which this study used, this systematic item parameter regression will cause the θs to be estimated slightly lower given a fixed response vector. The Exam A Rasch difficulties will, in contrast, be slightly regressed upward, as more would be below the mean of the prior. When using the thetas from Exam B to estimate scores on Exam A, one may be using lower than warranted Exam B thetas in combination with higher than warranted Exam A item estimates. This will cause the predicted number correct to be systematically too low. This interaction between differing exam difficulties and item parameter priors has not received much attention in the literature. One advantage and limitation of the use of Bayesian methods is how the priors due influence the posterior distribution. This is particularly important for contexts with small N, when priors have the most influence over the results. Future research should investigate the use of various priors to see where systematic item parameter regression could be harmful to equating accuracy.

As with any study, it is important to highlight key limitations. Seven key limitations of this study were the data set that generated the results, the classical Rasch methodology used to estimate the parameters, the relatively good fit of the data to the model, the Bayesian item priors used for the initial calibrations, that this study did not examine item drift, that we used only Rasch fixed-parameter calibration, and the criterion measure used in the study. First, the data set we used was from a certification exam with rigorous eligibility criteria. Exams with different criteria or with different score distributions could produce different equating results. Second, we used only one frequentist-style IRT estimation method (WINSTEPS). It is possible that a different estimation method, such as one that implements marginal or conditional maximum likelihood, could perform better. In fact, future studies may also attempt using a different IRT model, such as the 2-parameter logistic model, for conditions where individual administrations were lower N but the combined data matrix had a larger N. Third, the data set we used fit the Rasch model relatively well on an item-level basis. It is possible that the Rasch model might not function quite as well in data sets that do not fit as well. Fourth, the overall item prior for the initial calibration of the Bayesian Rasch items was normal (0, 1). This sort of prior is common to use in practice. The prior did, however, produce bias in the equated scores when there was an exam difficulty difference. Future research should examine other priors that can both set the model’s scale properly and do not bias the item parameter estimates. Fifth, it is unclear how the methods we examined would function in an item drift context. If selected items drift easier or more difficult over the course administrations used in the calibration, calibration results could yield additional person measurement bias (Babcock & Albano, 2012). Future research should examine which methods are most robust to item drift. Sixth, we only used concurrent-type calibrations. Future research on Rasch techniques may include a comparison of the concurrent-type calibration that was used in this study and an alternative approach for Rasch linking that uses test characteristic curves to transform a new form on to the same scale as the reference form (Kolen & Brennan, 2004). Finally, we used real scores as the criterion measure to evaluate whether the equating methods functioned well. It may be possible that other criterion measures could produce different results. Being, however, that a goal of equating is to provide an estimate of scores from one test to that of another test, one would be hard-pressed to provide a criterion that has stronger links to using equating in an applied setting.

This study showed promise for using Rasch methods in small-sample contexts when previous administrations’ data are available. Future research will continue to investigate the usefulness of this technique in other contexts.

Acknowledgments

The authors would like to thank Jerry Reid, Lauren Wood, and Jeff McLeod for reviewing this article and giving suggestions.

Footnotes

Authors’ Note: Any opinions, findings, conclusions, or recommendations expressed in this article are those of the authors and are not necessarily the official position of the American Registry of Radiologic Technologists or NACE International. The authors received no special funding or grants to conduct this research.

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD: Ben Babcock Inline graphic https://orcid.org/0000-0003-4107-8894

References

Albano A. D. (2015). A general linear method for equating with small samples. Journal of Educational Measurement, 52, 55-69. doi: 10.1111/jedm.12062 [DOI] [Google Scholar]
American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: Author. [Google Scholar]
Angoff W. (1971). Norms, scales, and equivalent scores. In Thorndike R. (Ed.), Educational measurement (2nd ed., pp. 508-600). Washington, DC: American Council on Education. [Google Scholar]
Babcock B., Albano A. (2012). Rasch scale stability in the presence of item drift. Applied Psychological Measurement, 36, 565-580. doi: 10.1177/0146621612455090 [DOI] [Google Scholar]
Babcock B. (2011). Estimating a noncompensatory IRT model using Metropolis within Gibbs sampling. Applied Psychological Measurement, 35, 317-329. doi: 10.1177/0146621610392366 [Google Scholar]
Babcock B., Albano A., Raymond M. (2012). Nominal weights mean equating: A method for very small samples. Educational and Psychological Measurement, 72, 608-628. doi: 10.1177/0013164411428609 [DOI] [Google Scholar]
Brookhart S. M., Nitko A. J. (2007). Assessment and grading in classrooms. Upper Saddle River, NJ: Pearson. [Google Scholar]
Cohen A. S., Wollack J. A., Bolt D. M., Mroch A. A. (2002, April). A mixture Rasch model analysis of test speededness. Paper presented at the annual meeting of the American Educational Research Association, New Orleans, LA. [Google Scholar]
Dwyer A. C. (2016). Maintaining equivalent cut scores for small sample test forms. Journal of Educational Measurement, 53, 3-22. doi: 10.1111/jedm.12098 [DOI] [Google Scholar]
Dwyer A. C., Furter R. (2018, April). Investigating the classification accuracy of Rasch equating with very small samples. Paper presented at annual meeting of the National Council on Measurement in Education, New York, NY. [Google Scholar]
Embretson S. E., Reise S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum. [Google Scholar]
Gregg J., Peabody M., O’Neil T. (2018, April). Effect of sample size on common-item equating using the dichotomous Rasch model. Paper presented at the annual meeting of the National Council on Measurement in Education, New York, NY. [Google Scholar]
Ibrahim J. G., Chen M.-H. (2000). Power prior distributions for regression models. Statistical Science, 15, 46-60. doi: 10.1214/ss/1009212673 [DOI] [Google Scholar]
Kelderman H. (1988). Common item equating using the loglinear Rasch model. Journal of Educational and Behavioral Statistics, 13, 319-336. doi: 10.3102/10769986013004319 [DOI] [Google Scholar]
Kim S., Livingston S. A. (2010). Comparisons among small sample equating methods in a common-item design. Journal of Educational Measurement, 47, 286-298. doi: 10.1111/j.1745-3984.2010.00114.x [DOI] [Google Scholar]
Kim S.-H. (2001). An evaluation of a Markov chain Monte Carlo method for the Rasch model. Applied Psychological Measurement, 25, 163-176. doi: 10.1177/01466210122031984 [DOI] [Google Scholar]
Kim S., von Davier A. A., Haberman S. (2008). Small-sample equating using a synthetic linking function. Journal of Educational Measurement, 45, 325-342. doi: 10.1111/j.1745-3984.2008.00068.x [DOI] [Google Scholar]
Kolen M. J., Brennan R. L. (2004). Test equating, scaling, and linking: Methods and practices. New York, NY: Springer-Verlag. [Google Scholar]
Linacre J. M. (1994). Sample size and item calibration stability. Rasch Measurement Transactions, 7, 328. [Google Scholar]
Linacre J. M. (2004). Estimation methods for Rasch measures. In E. V. Smith & R. M. Smith (Eds.), Introduction to Rasch measurement (Chapter 2). Maple Grove MN: JAM Press. [Google Scholar]
Linacre J. M. (2009). A user’s guide to WINSTEPS and MINISTEP: Rasch-model computer programs. Chicago, IL: Winsteps. [Google Scholar]
Livingston S. A., Kim S. (2009). The circle-arc method for equating in small samples. Journal of Educational Measurement, 46, 330-343. doi: 10.1111/j.1745-3984.2009.00084.x [DOI] [Google Scholar]
Lord F. M. (1977). Practical applications of item characteristic curve theory. Journal of Educational Measurement, 14, 117-138. doi: 10.1111/j.1745-3984.1977.tb00032.x [DOI] [Google Scholar]
Lord F. M. (2014). Small sample justifies Rasch model. In Weis D. J. (Ed.), New horizon testing: Latent trait test theory and computerized adaptive testing (Chapter 3). Cambridge, MA: Academic Press. [Google Scholar]
Maris G., Bechger T., Martin E. S. (2015). A Gibbs sampler for the (extended) marginal Rasch model. Psychometrika, 80, 859-879. doi: 10.1007/s11336-015-9479-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
Matteucci M., Veldkamp B. P. (2013). Bayesian estimation of item response theory models with power priors. In Brentari E., Carpita M. (Eds.), Advances in latent variables. Retrieved from http://meetings.sis-statistica.org/index.php/sis2013/ALV/paper/view/2662
Patz R. J., Junker B. W. (1999). A straightforward approach to Markov chain Monte Carlo methods for item response models. Journal of Educational and Behavioral Statistics, 24, 146-178. doi: 10.3102/10769986024002146 [DOI] [Google Scholar]
Skaggs G. (2005). Accuracy of random groups equating with very small samples. Journal of Educational Measurement, 42, 309-330. doi: 10.1111/j.1745-3984.2005.00018.x [DOI] [Google Scholar]
Wendler C. L. W., Walker M. E. (2006). Practical issues in designing and maintaining multiple test forms for large-scale programs. In Downing S. M., Haladyna T. M. (Eds.), Handbook of test development (pp. 445-467). Mahwah, NJ: Lawrence Erlbaum. [Google Scholar]
Wright B. D., Linacre J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8, 370. [Google Scholar]

[bibr1-0013164419878483] Albano A. D. (2015). A general linear method for equating with small samples. Journal of Educational Measurement, 52, 55-69. doi: 10.1111/jedm.12062 [DOI] [Google Scholar]

[bibr2-0013164419878483] American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: Author. [Google Scholar]

[bibr3-0013164419878483] Angoff W. (1971). Norms, scales, and equivalent scores. In Thorndike R. (Ed.), Educational measurement (2nd ed., pp. 508-600). Washington, DC: American Council on Education. [Google Scholar]

[bibr4-0013164419878483] Babcock B., Albano A. (2012). Rasch scale stability in the presence of item drift. Applied Psychological Measurement, 36, 565-580. doi: 10.1177/0146621612455090 [DOI] [Google Scholar]

[bibr5-0013164419878483] Babcock B. (2011). Estimating a noncompensatory IRT model using Metropolis within Gibbs sampling. Applied Psychological Measurement, 35, 317-329. doi: 10.1177/0146621610392366 [Google Scholar]

[bibr6-0013164419878483] Babcock B., Albano A., Raymond M. (2012). Nominal weights mean equating: A method for very small samples. Educational and Psychological Measurement, 72, 608-628. doi: 10.1177/0013164411428609 [DOI] [Google Scholar]

[bibr7-0013164419878483] Brookhart S. M., Nitko A. J. (2007). Assessment and grading in classrooms. Upper Saddle River, NJ: Pearson. [Google Scholar]

[bibr8-0013164419878483] Cohen A. S., Wollack J. A., Bolt D. M., Mroch A. A. (2002, April). A mixture Rasch model analysis of test speededness. Paper presented at the annual meeting of the American Educational Research Association, New Orleans, LA. [Google Scholar]

[bibr9-0013164419878483] Dwyer A. C. (2016). Maintaining equivalent cut scores for small sample test forms. Journal of Educational Measurement, 53, 3-22. doi: 10.1111/jedm.12098 [DOI] [Google Scholar]

[bibr10-0013164419878483] Dwyer A. C., Furter R. (2018, April). Investigating the classification accuracy of Rasch equating with very small samples. Paper presented at annual meeting of the National Council on Measurement in Education, New York, NY. [Google Scholar]

[bibr11-0013164419878483] Embretson S. E., Reise S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum. [Google Scholar]

[bibr12-0013164419878483] Gregg J., Peabody M., O’Neil T. (2018, April). Effect of sample size on common-item equating using the dichotomous Rasch model. Paper presented at the annual meeting of the National Council on Measurement in Education, New York, NY. [Google Scholar]

[bibr13-0013164419878483] Ibrahim J. G., Chen M.-H. (2000). Power prior distributions for regression models. Statistical Science, 15, 46-60. doi: 10.1214/ss/1009212673 [DOI] [Google Scholar]

[bibr14-0013164419878483] Kelderman H. (1988). Common item equating using the loglinear Rasch model. Journal of Educational and Behavioral Statistics, 13, 319-336. doi: 10.3102/10769986013004319 [DOI] [Google Scholar]

[bibr15-0013164419878483] Kim S., Livingston S. A. (2010). Comparisons among small sample equating methods in a common-item design. Journal of Educational Measurement, 47, 286-298. doi: 10.1111/j.1745-3984.2010.00114.x [DOI] [Google Scholar]

[bibr16-0013164419878483] Kim S.-H. (2001). An evaluation of a Markov chain Monte Carlo method for the Rasch model. Applied Psychological Measurement, 25, 163-176. doi: 10.1177/01466210122031984 [DOI] [Google Scholar]

[bibr17-0013164419878483] Kim S., von Davier A. A., Haberman S. (2008). Small-sample equating using a synthetic linking function. Journal of Educational Measurement, 45, 325-342. doi: 10.1111/j.1745-3984.2008.00068.x [DOI] [Google Scholar]

[bibr18-0013164419878483] Kolen M. J., Brennan R. L. (2004). Test equating, scaling, and linking: Methods and practices. New York, NY: Springer-Verlag. [Google Scholar]

[bibr19-0013164419878483] Linacre J. M. (1994). Sample size and item calibration stability. Rasch Measurement Transactions, 7, 328. [Google Scholar]

[bibr20-0013164419878483] Linacre J. M. (2004). Estimation methods for Rasch measures. In E. V. Smith & R. M. Smith (Eds.), Introduction to Rasch measurement (Chapter 2). Maple Grove MN: JAM Press. [Google Scholar]

[bibr21-0013164419878483] Linacre J. M. (2009). A user’s guide to WINSTEPS and MINISTEP: Rasch-model computer programs. Chicago, IL: Winsteps. [Google Scholar]

[bibr22-0013164419878483] Livingston S. A., Kim S. (2009). The circle-arc method for equating in small samples. Journal of Educational Measurement, 46, 330-343. doi: 10.1111/j.1745-3984.2009.00084.x [DOI] [Google Scholar]

[bibr23-0013164419878483] Lord F. M. (1977). Practical applications of item characteristic curve theory. Journal of Educational Measurement, 14, 117-138. doi: 10.1111/j.1745-3984.1977.tb00032.x [DOI] [Google Scholar]

[bibr24-0013164419878483] Lord F. M. (2014). Small sample justifies Rasch model. In Weis D. J. (Ed.), New horizon testing: Latent trait test theory and computerized adaptive testing (Chapter 3). Cambridge, MA: Academic Press. [Google Scholar]

[bibr25-0013164419878483] Maris G., Bechger T., Martin E. S. (2015). A Gibbs sampler for the (extended) marginal Rasch model. Psychometrika, 80, 859-879. doi: 10.1007/s11336-015-9479-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr26-0013164419878483] Matteucci M., Veldkamp B. P. (2013). Bayesian estimation of item response theory models with power priors. In Brentari E., Carpita M. (Eds.), Advances in latent variables. Retrieved from http://meetings.sis-statistica.org/index.php/sis2013/ALV/paper/view/2662

[bibr27-0013164419878483] Patz R. J., Junker B. W. (1999). A straightforward approach to Markov chain Monte Carlo methods for item response models. Journal of Educational and Behavioral Statistics, 24, 146-178. doi: 10.3102/10769986024002146 [DOI] [Google Scholar]

[bibr28-0013164419878483] Skaggs G. (2005). Accuracy of random groups equating with very small samples. Journal of Educational Measurement, 42, 309-330. doi: 10.1111/j.1745-3984.2005.00018.x [DOI] [Google Scholar]

[bibr29-0013164419878483] Wendler C. L. W., Walker M. E. (2006). Practical issues in designing and maintaining multiple test forms for large-scale programs. In Downing S. M., Haladyna T. M. (Eds.), Handbook of test development (pp. 445-467). Mahwah, NJ: Lawrence Erlbaum. [Google Scholar]

[bibr30-0013164419878483] Wright B. D., Linacre J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8, 370. [Google Scholar]

PERMALINK

Rasch Versus Classical Equating in the Context of Small Sample Sizes

Ben Babcock

Kari J Hodge

Abstract

Circle-Arc Equating

Nominal Weights Mean Equating

Rasch Calibration With Maximum Likelihood Estimation

Rasch Calibration With Bayesian Methodologies

Current Study

Method