Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2024 Jul 29;31(1):e14114. doi: 10.1111/jep.14114

A many‐facet Rasch measurement model approach to investigating objective structured clinical examination item parameter drift

Karen Coetzee 1,, Sandra Monteiro 2, Luxshi Amirthalingam 3
PMCID: PMC11664641  PMID: 39073068

Abstract

Rationale

Objective Structured Clinical Examinations (OSCEs) are widely used for assessing clinical competence, especially in high‐stakes environments such as medical licensure. However, the reuse of OSCE cases across multiple administrations raises concerns about parameter stability, known as item parameter drift (IPD).

Aims & Objectives

This study aims to investigate IPD in reused OSCE cases while accounting for examiner scoring effects using a Many‐facet Rasch Measurement (MFRM) model.

Method

Data from 12 OSCE cases, reused over seven administrations of the Internationally Educated Nurse Competency Assessment Program (IENCAP), were analyzed using the MFRM model. Each case was treated as an item, and examiner scoring effects were accounted for in the analysis.

Results

The results indicated that despite accounting for examiner effects, all cases exhibited some level of IPD, with an average absolute IPD of 0.21 logits. Three cases showed positive directional trends. IPD significantly affected score decisions in 1.19% of estimates, at an invariance violation of 0.58 logits.

Conclusion

These findings suggest that while OSCE cases demonstrate sufficient stability for reuse, continuous monitoring is essential to ensure the accuracy of score interpretations and decisions. The study provides an objective threshold for detecting concerning levels of IPD and underscores the importance of addressing examiner scoring effects in OSCE assessments. The MFRM model offers a robust framework for tracking and mitigating IPD, contributing to the validity and reliability of OSCEs in evaluating clinical competence.

Keywords: examiner scoring effects, item parameter drift, many‐faceted Rasch measurement model, OSCE

1. INTRODUCTION

Objective Structured Clinical Examinations or OSCEs are considered the most reliable tool for assessing clinical competence. 1 These popular performance‐based assessments are employed particularly within the medical licensure field to serve as gatekeepers for public protection and generally entail examinees rotating through a series of timed and simulated clinical tasks or cases, while interacting with standardized patients. These interactions are scored by examiners (or at times standardized patients) using a list of pre‐defined criteria on a standardized rating scale or checklist. 2

Many studies have revealed an unavoidable source of error associated with the OSCE measurement environment, namely that introduced by examiners and their scoring behaviour. 3 Tavakol and Pinner 4 most recently found that the ability of students did not map well with the performance ratings awarded by the examiners tasked to score 16 OSCE cases, highlighting examiner scoring errors as the reason for incorrect inferences and advocating for continuous monitoring. Despite consistent evidence surrounding the examiner scoring error issue, literature directed at investigating other essential issues related to OSCEs and their psychometric properties predominantly lacks consideration of this property, heightening the risk of reporting results less reflective of the unique test setting. 5 One such issue is that of item parameter stability over time, a topic that has raised concerns over and above that of security when adopting the OSCE format. 1

OSCEs are relatively more resource intensive in their content development and delivery processes. Consequently, content banks are limited, heightening the need for case reuse over several test administrations. 6 The reuse of items across several exam administrations often results in a shift in their parametric values, a phenomenon commonly known as item parameter drift (IPD 7 ). IPD is similar to the issue of differential item functioning or DIF, in that we ask, ‘Does the item function the same in two sets of data?’, 8 an important consideration when stability is paramount. For instance, the appropriate application of a Rasch‐based or Item Response Theory (IRT) model to evaluate the performance of items depends heavily upon the stability or invariance of parameter estimates across time. 9 A violation of invariance exists when the item drifts from its original parametric values within a population, compromising the accuracy of decisions and interpretations attached to generated test scores. 10 , 11 Item parameter estimates may change over time due to many factors other than sampling error, 12 for instance, changes in curriculum, aspects of instruction, social events or simply due to the disclosure of test content by previous test takers. 7 , 13 Given the high‐stakes environment in which OSCEs operate, the need to consistently address and monitor IPD is vital.

The detection of IPD and its impact on various test settings has been relatively well studied in the past, with some studies dating as far back as Lord's 14 proposed X2 test for multiple‐group differential item functioning. Few studies have been directed at the investigation of this issue within performance‐based contexts such as OSCEs, with those that have, taking place some time ago and producing contradictory and limited results given the exclusion of the inherent examiner scoring error issue. For instance, Mckinley and Boulet 15 investigated the issue of drift in terms of OSCE‐like case difficulty estimates in a large‐scale high‐stakes medical certification exam. Their study investigated the use of an Analysis of Covariance or ANCOVA procedure to detect significant changes in mean difficulty estimates over several test administrations and included standardized patients portraying the case and year quarter as independent variables. Their study detected significant changes in mean difficulty of most repeated cases, with a few of these cases displaying large differences that warranted concern. They highlighted the need to monitor item parameters to ensure the accuracy of score comparisons and pass‐fail decisions for examinees assessed on multiple test forms throughout the year. In response to this study, Baig and Violoto applied a two‐parameter IRT model to six high‐stakes OSCE cases to investigate psychometric stability beyond shifts in difficulty parameters and therefore also considered discrimination and reliability estimates. Their study concluded sufficient stability to allow for repeat administrations of OSCE cases without compromising psychometric properties and validity. Their study was criticized given the inclusion of only two time points, which inhibits the investigation of any directional trends across administrations, an issue raised in a previous IPD study conducted by DeMars. 7

To represent the issue of IPD specific to the OSCE environment, we must conduct an investigation that accounts for the unique aspects of the specific test environment and adopt a model capable of doing so. 5 With the Many‐facet Rasch Measurement or MFRM model, aspects considered contributing a significant source of unwanted variance such as examiner scoring behaviour can be conveniently accounted for directly within the model. Calculated measurement estimates are more reflective of the measurement setting. 5 By taking advantage of additional model features we can investigate the issue at the group, individual and interaction level thus gaining further insight. 16 , 17 This study aimed to contribute further and more reflective insight into the issue of IPD in reused OSCE cases using an MFRM model and accounting for inherent examiner scoring effects. In this study, a single OSCE case was an item. The MFRM model was adopted to address the following questions:

  • 1.

    Accounting for examiner scoring effects, to what extent is IPD present in high‐stake OSCE cases reused across several administrations?

If IPD is present, we can then ask:

  • a.

    To what extent does IPD impact examinee performance scores?

  • b.

    To what extent do score criteria biases contribute to IPD?

2. METHODS

This study was a retrospective analysis of data from a competency‐based assessment of nursing competencies. Data, collected over the 2017–2019 test period, from 12 OSCE cases reused over 7 separate and consecutive test administrations of the high‐stakes Internationally Educated Nurse Competency Assessment Program (IENCAP) were used to conduct this study.

To conduct the analysis for this study, we relied on the results of a MFRM model applied to the data of each case separately using FACETS software (Version 3.80.3 18 ) as well as traditional statistical analyses in IBM's SPSS Statistical software (Version 24.0).

2.1. Assessment context

The IENCAP is a high‐stakes assessment administered to internationally educated nurse graduates or IENs as part of their license to practice within the province of Ontario. This assessment is developed and administered by Touchstone Institute in collaboration with the College of Nurses of Ontario (CNO) and consists of two test components, namely a written knowledge‐based multiple‐choice test and a performance‐based 12‐case OSCE. For each attempted OSCE case, examinees complete a 7‐min standardized patient interaction and 4‐min post‐encounter probe. The patient interaction and the post‐encounter probe are scored by a single examiner. For this study, only data related to the 7‐min OSCE interactions were included for investigation in this study.

3. DATA

All assessment data were saved on the Touchstone Institute server. Data from seven administrations of the IENCAP were included in this study. No assessment data were excluded. Each examinee was scored on 12 OSCE cases. There were seven to eight competency scores per case for each examinee. There were no exclusion criteria.

3.1. OSCE cases

The 12 OSCE cases targeted different nursing competencies across various contexts and conditions as outlined by the test blueprint. 19 Each case required examinee performances to be scored on the same seven to eight competency criteria during the 7‐min standardized patient interaction. These included Health History and Data Collection, Prioritization, Implementation of Care Strategies, Responsibility and Integrity, Ethical Safety, Communication, and Collaboration with the Client. An eighth criteria, namely, Physical Assessment was assessed in six of the cases. Examinee performances were scored on each of these criteria by a single examiner for each attempted case using the same 5‐point rating scale; a score of one represented the poorest performance.

3.2. Examiners and examinees

A total of 588 examinees attempted the 12 OSCE cases administered across the seven consecutive test administrations (75 to 93 examinees per administration). As examinees are only allowed one attempt at the IENCAP assessment, the data contained no repeat examinees. A total of 27 examiners were assigned to score examinee performances in each of the cases (3 to 4 examiners assigned to scoring 3 to 4 separate examinee groups per case and test administration). As experienced examiners may be invited back to score multiple test administrations, examiners may be repeated within or across the case specific data. Repeat examiners were not tracked in this study and therefore treated as individual examiners within each case.

4. ANALYSIS

4.1. MFRM model and assumptions

Each OSCE case was individually analysed using the same 4‐facet MFRM rating scale model. Therefore, the 4‐facet MFRM model was applied 12 times to analyze each of the 12 OSCE cases. In the rating scale model, it is assumed that all thresholds on the 5‐point rating scale are equivalent across all elements of a given facet. 20 The mathematical model specified for each case is represented by the equation below:

Log(Pnjikx/Pnjikx1)=BnCjDiEkFx

Where:

  • Pnjikx = probability of examinee n being assigned a rating score of x on competency criteria k in reused‐case administration i by examiner j

  • Pnjikx –1 = probability of examinee n being assigned a rating score of x – 1 on competency criteria k in reused‐case administration i by examiner j

  • Bn = ability of examinee n

  • Cj = severity of examiner j

  • Di = difficulty of reused‐case administration i

  • Ek = difficulty of competency criteria k

  • Fx = difficulty of being assigned a rating score of x relative to a rating of x – 1.

The four facets included in each model for each case included Facet 1: Examinees (N = 588), Facet 2: Reused‐Case Administrations (N = 7), Facet 3: Examiners (N = 27) and Facet 4: Criteria (N = 7–8). Given that examinees were scored by single examiners in each attempted case, the data lacked sufficient connectivity needed to make the necessary comparisons between elements within each facet; a typical issue in OSCE designs. To overcome this, we leveraged the recommendation by Linacre 18 to anchor the elements of an appropriate facet. We chose to anchor the elements in the 4th facet, namely Criteria, so that all element measurements for the remaining facets would be calculated relative to this facet. Facet 4: Criteria's elements were anchored in each case, as recommended by Linacre, 18 using difficulty estimates generated for the seven to eight competencies calculated for each case in previous analyses. Banked statistics for the IENCAP, generated during 2015/2016, its initial years, were used to identify criteria estimates for anchoring. Anchoring of the criteria allowed for the generation of estimates for the other facets relative to Facet 4: Criteria elements. Although not perfect and less optimal than a fully connected data set, this satisfies the issue sufficiently to illustrate the use of the model to detect station drift in OSCEs and the need to recognise the removal of the examiner error to do so. In each case model, Facet 2: Reused‐Case Administrations, Facet 3: Examiners and Facet 4: Criteria were specified as centred (Facet 1: Examinees was non‐centred).

4.2. Overall model and fit statistics

To take advantage of Rasch‐based properties, such as that of measurement invariance, which describes ability and difficulty estimates considered stable across different test items or examinee groups, data needs to meet the strict Rasch criteria of unidimensionality; the measurement of only one underlying construct. Where there is sufficient data–model fit, the assumption of unidimensionality is supported. 17 To determine the appropriateness of analysing the OSCE cases and their data using MFRM model we investigated fit both globally and at the individual facet level. In terms of global fit, we examined residuals between observed and unexpected responses as recommended and generated as part of the software output. According to Linacre, 18 satisfactory model fit is indicated when about 5% or less of absolute standardized residuals are ≥2. In terms of fit at the facet level, we examined various fit statistics also generated as part of the output. The fit statistics provide further indication of the degree to which the data fits the model predictions and the extent to which it can be considered productive for measurement. For this study, we used the weighted Infit Mean‐Square (MnSq) statistic and adopted the recommended range of 0.50–1.50, as an indication of sufficient fit. 18

4.3. Investigating item parameter drift

The MFRM model is essentially an additive linear model based on a logistic transformation of observed ratings to a common interval or logit scale. 17 Accounting for all facets included in the model, the analysis calculates a measurement estimate for each individual element that makes up each facet along this logit scale. At the same time, FACETS calculates various group‐level separation statistics that provide an indication of the overall magnitude of variance between the individual elements. 17

To address question 1 regarding the presence of IPD in each reused OSCE case, we relied specifically on the separation statistics generated for Facet 2: Reused‐Case Administrations, and applied the Rasch‐based property of measurement invariance assumption that states that, given sufficient data fit and after accounting for examiner scoring behaviour, measurement estimates should remain relatively stable or invariant across each administration in which the OSCE case was reused. In other words, the seven administration measurement estimates generated for Facet 2: Reused‐Case Administrations in each of the 12 OSCE case models, should display low variability or high levels of homogeneity to be considered invariant. Where this was not the case, significant variability or heterogeneity across the administration estimates served as evidence of IPD at the group level.

The specific separation statistics investigated for Facet 2: Reused‐Case Administrations included firstly, the homogeneity statistic, reported as fixed (all same) Chi‐squared, which provides a test of the null hypothesis that measures in the population are all the same. It provides an indication of the statistical significance between measurement estimates of at least two elements included in a facet. The chi‐square statistic was used once to investigate homogeneity across the 7 elements included in Facet 2: Reused‐Case Administrations, for each of the 12 cases included in the study. Secondly, the Separation Index, reported as Strata in FACETS, indicates the number of statistically distinct levels that separate elements. If homogeneity is the goal, (i.e. less variance between elements), lower numbers are preferred. Lastly, the Reliability of Separation Index, reported as the Reliability statistic in FACETS provides an indication of how well the elements within a facet are separated to reliably define the facet. 17 It provides an indication of how dissimilar the elements within each facet performed. This statistic ranges from zero to one, with values close to zero indicating highly similar or interchangeable elements and those closer to one indicating higher levels of heterogeneity. Administration estimates were investigated further at the individual level to help better understand and describe the results.

Lastly, to gain insight into any directional trends over time, we conducted a correlation analysis for each case using SPSS and individual Facet 2 measurement estimates and the administration order (1 to 7).

4.4. Impact of IPD on examinee scores

A fair average or expected score based on the MFRM model parameter estimates is reported for all elements and facets in FACETS software. Fair scores result from the transformation of proficiency estimates reported in logits to the corresponding scores on the raw‐score scale. 17 A fair score is for instance in the case of Facet 1: Examinees, the score that a particular examinee would have obtained after accounting for all other facets included in the model, most importantly, after accounting for examiner scoring effects. The observed average would be the score without accounting for the examiner effects.

To address question 1a regarding the impact of IPD on examinee performance scores, we relied specifically on the fair average score generated for each examinee in each OSCE case and submitted these to a one‐way ANOVA in SPSS together with the administration in which they completed the case. In other words, we conducted separate one‐way ANOVAs for each case with the examinee fair averages serving as the dependent variable and the administration (1 to 7) serving as the independent variable to determine any significant score differences as a result of IPD.

4.5. Score criterion biases

Lastly, to answer question 1b, regarding the extent to which bias surrounding scored criteria contribute to the detected IPD we conducted an exploratory interaction analysis applied to the Facet 2: Reused‐Case Administrations, and Facet 4: Criteria. For this analysis, the FACETS software reports a Bias statistic which can be used to judge the statistical significance of the size of the bias parameter estimate for the competency criteria scored within a specific test administration. In other words, the Bias statistic provides a test of the hypothesis that there is no bias apart from measurement error. The analysis tests for patterns of unexpected ratings related to the scoring criterion. 17 Bias estimates greater than zero indicate observed scores that are higher than that expected of the model, while estimates less than zero indicate observed scores that are lower than expected. Typically, t values associated with the Bias statistic that achieve absolute values of at least 2 can be considered indicative of bias. 16 , 17

5. RESULTS

5.1. Overall model and facet results

An investigation of global model fit based on residuals revealed that of the valid responses associated with the 12 cases, less than 5% of the standardized residuals generated for the unexpected responses achieved values ≥ 2 (Mean = 1.49, Range: 0.68–2.43). This is in line with the recommendation and suggests sufficient data fit to the model, deeming the data appropriate to be investigated using the MFRM model.

Separate output with statistics and measurement estimates specific to each facet was generated for each of the 12 OSCE cases. Facet 1: Examinees represented examinee ability estimates on the seven to eight criteria scored in each case and accounting for all facets included in the model. Overall, across all 12 case models, examinees achieved an average of 0.89 logits (SE = 0.89) or 3.37 (SD = 0.59) on the original five‐point scale. Examinees achieved the lowest ability estimate average in Case 12 (−0.99 logits, SE = 0.97 or M = 3.03, SD = 0.60) and the highest in Case 7 (2.63 logits, SE = 0.89 or M = 3.76, SD = 0.50). For this facet, all case models achieved an average Infit MnSq value of 1.06 (Range: 0.89–1.26) indicating adequate data fit for productive measurement.

5.1.1. Facet 2

Reused‐Case Administrations represented the measurement estimates calculated for each of the seven administration elements. As the results of this facet were used to answer question 1, specific estimates for each case and administration are reflected in Table 1 and discussed further below. In terms of data fit, for each case, the administration elements achieved values within the recommended range. Specifically, the average Infit MnSq statistics across all 12 cases was 1.09 (Range: 0.96–1.20) indicating sufficient data fit.

Table 1.

 Administration model estimates generated for 12 reused OSCE cases.

n 1 P 2 3 P 4 5 P 6 7 P 8 9 P 10 11 P 12
Admin 1 75 0.04 −0.14 −0.40 0.07 −0.26 0.33 0.58 0.35 −0.18 0.17 0.08 −0.22
Admin 2 78 0.21 −0.40 −0.16 −0.38 −0.29 0.34 −0.38 −0.32 −0.01 0.04 0.12 0.17
Admin 3 83 0.00 −0.27 −0.03 −0.22 −0.23 0.12 0.04 0.08 0.06 −0.31 0.11 −0.23
Admin 4 83 −0.06 −0.21 0.07 0.20 0.03 −0.25 −0.28 0.26 −0.03 −0.03 −0.11 0.03
Admin 5 87 −0.39 0.25 −0.07 −0.15 −0.12 −0.27 0.20 −0.11 −0.15 0.53 −0.25 −0.40
Admin 6 93 0.13 0.34 0.19 0.11 0.45 −0.37 0.03 −0.01 0.26 −0.09 0.02 0.14
Admin 7 89 0.08 0.25 0.40 0.37 0.42 0.10 −0.19 −0.25 0.06 −0.30 0.03 0.50
Abs. average 0.13 0.27 0.19 0.21 0.26 0.25 0.24 0.20 0.11 0.21 0.10 0.24
Separation index (Strata) 2.64 4.61 4.03 3.09 5.06 4.20 4.64 5.68 2.11 4.08 2.08 4.06
Reliability of separation 0.75 0.91 0.88 0.81 0.93 0.89 0.91 0.82 0.64 0.89 0.63 0.89

Notes: P denotes physical assessment cases; N = 588; all cases achieved significant fixed (all same) Chi‐squared values.

Abbreviation: OSCE; Objective Structured Clinical Examinations.

5.1.2. Facet 3

Examiners displayed the calculated estimates used to represent scoring stringency across the 27 examiners assigned to score examinee performances in each case. Overall, the 12 case models, achieved an average Infit MnSq value of 1.11 (Range: 0.96–1.30) and within the recommended range for data fit. The most lenient examiner achieved a measurement estimate of 1.68 logits (SE = 0.19) above the mean (0.00) and the most stringent scoring examiner, an estimate of −1.75 (SE = 0.18) below the mean. Across all 12 models, an average absolute difference of 2.61 logits was achieved between the most lenient and stringent scoring examiner represented in each case. Separation statistics indicated significant variability across examiners for each case model. Specifically, all cases achieved a significant fixed (all same) Chi‐squared statistic, an average Strata of 4.86 (Range: 3.12–6.09), demonstrating at least four distinct examiner scoring levels within each case and an average Reliability of Separation value of 0.92 (Range: 0.81–0.95), indicating a high level of dissimilarity between examiner score allocation.

5.1.3. Facet 4

Criteria represented the difficulty estimates for each of the competency criteria scored within each specific case. Overall, across all 12 cases, Communication was the easiest criterion with an average logit score of 0.51 (SE = 0.08) and Physical Assessment the most difficult with an average logit of −1.40 (SE = 0.09), however, for those cases that did not assess this, Implementation of Care was determined to be the most difficult (average −0.58 logits; SE = 0.08). For all 12 cases, the criteria elements within this facet once again achieved good fit with an average Infit MnSq value of 1.11 (Range: 0.96–1.27) and therefore also within the recommended range for productive measurement.

5.2. Item parameter drift in 12 OSCE cases

The precise model generated administration measurement estimates, and separation statistics calculated for each separate OSCE case, and their Facet 2: Reused‐Case Administrations are summarized in Table 1. Results indicated that, after accounting for all facets included in the model, including examiner scoring, all 12 cases reflected significant IPD across the seven administrations. All cases achieved a significant fixed (all same) Chi‐squared statistic and at least two (Strata = 2.11, Case 9) to up to six (Strata = 5.68, Case 8) and an average of four (Mean Strata = 3.86) statistically distinct levels in terms of individual administration elements. Reliability of Separation values ranged from moderate with a value of 0.63 (Case 11) to high with a value of 0.93 (Case 5) and achieved an average of 0.83 across all 12 cases, further highlighting dissimilarity between administration estimates.

The average absolute deviation of administration estimates from the mean of 0.00 logits ranged from 0.10 (Case 11) to 0.27 (Case 2) with an average of 0.20 for all cases. Case 7 achieved the administration with the single greatest deviation above the mean, namely 0.58 logits for Admin 1 and Case 1 the single greatest deviation below the mean, namely −0.39 logits achieved for Admin 5.

In terms of estimate directional trends over time, the results of the correlation analysis determined three cases correlated significantly in the positive direction, in other words, as the cases were reused over the seven administrations their estimates increased or experienced as less difficult by examinees after each reuse iteration. Specific results achieved include Case 2, r(6) = 0.799 where p = 0.031, Case 3, r = 0.923(6) where p = 0.003 and Case 5, r = 0.887(6) where p = 0.008.

5.3. Impact of IPD on examinee scores

Results of the ANOVAs conducted for each case and using model generated examinee fair averages and administration (1–7) revealed no significant difference for all cases, except for Case 7. Specifically, for this case, the main effects test achieved F(6,581) = 2.55, p = 0.019 significant at the p < 0.05 level. The LSD post‐hoc test results revealed that those who completed Case 7 during Admin 1 (n = 75) achieved significantly higher fair average scores (Mean = 3.95, S.D. = 0.77) compared to the other admins, ranging from mean of 3.68 (S.D. = 0.38) for Admin 4 (n = 83) to mean of 3.78 (S.D. = 0.51) in Admin 3 (n = 83).

5.4. Criterion bias interactions in IPD cases

Exploratory bias interaction analyses were conducted between Facet 2: Reused‐Case Administrations and Facet 4: Criteria for Case 7 given the significant difference revealed in question 1a. This was done to investigate the extent to which a specific competency criterion scored within the case contributed significantly to the detected variability across the administrations, particularly Admin 1, given its significant impact on examinee scores. In other words, to identify specific competency criteria that achieved estimates significantly higher than model expectations. Results indicated that for Case 7's Admin 1, none of the competency criteria scored within this case achieved significant bias statistics, in other words, results indicated absolute t‐value of <2, indicating consistent difficulty in line with model expectations across all eight performance criteria scored within this case and administration.

6. DISCUSSION AND CONCLUSION

The aim of this study was to address this limitation by contributing insight to the issue of IPD specific to the OSCE environment while accounting for its examiner scoring error property.

To conveniently account for the examiner scoring error source directly within the model, the study adopted a MFRM model approach. Specifically, 12 OSCE cases reused across seven high‐stake exam administrations were submitted to a four‐facet MFRM model, generated results were then further investigated using traditional statistical methods to answer three primary questions related to the presence of IPD after accounting for examiner scoring effects, the impact of detected IPD on examinee performance scores and lastly, the contribution of any score criterion bias to detected IPD.

Results revealed that for each case, data achieved sufficient fit to the model both at the overall global and individual facet level providing evidence for the suitability of analysing the OSCE data using the MFRM model. The separation statistics calculated for Facet 3: Examiners, revealed significant variability across examiners for all 12 OSCE case models, demonstrating at least four distinct and reliably dissimilar examiner scoring levels within each case, contributing further evidence of this source of measurement error within OSCE settings and supporting advocates for the inclusion of this facet when analysing OSCE data.

Despite accounting for examiner scoring effects, results revealed that all 12 cases displayed evidence of IPD given significant levels of variability or heterogeneity achieved across measurement estimates related to the seven administrations in which cases were reused. Levels of estimate variability or IPD ranged from low (two moderately reliable distinct groups) to highly dissimilar (up to six highly reliable distinct groups) across the seven administration estimates. Cases achieved an average absolute IPD across the seven administrations of 0.21 logits from the mean (0.00 logits), with Case 7 displaying the single greatest deviation of 0.58 logits above the mean during Admin 1, and Case 1, the single greatest deviation of −0.39 logits below the mean during Admin 5. Three of the cases, namely Case 2, 3, and 5 achieved significant positive correlations across the seven administrations, displaying a trend of consistent decreased difficulty or increased easiness over time. Results highlight the need for further investigation into these specific cases and administrations. Overall, these findings are consistent with those of McKinley and Boulet 15 who highlighted inevitable parameter drift in reused OSCE cases and indicated further the need for ongoing investigation of IPD when employing reused OSCE cases. By using the MFRM model, and the inclusion of the examiner scoring error facet, results generated for this study may more accurately reflect the extent of IPD in reused OSCE cases. A result that warrants further investigation. Secondly, and of essential value, the analysis and results offer a unique and convenient system for ongoing tracking and monitoring of the IPD issue at the individual estimate level and along a common logit scale, allowing for comparisons both within and across cases, as well as administration. Such a system is vital to the exact pinpointing of problematic cases and administrations and directs necessary investigations, adjustments, and remediation. 11

The dominant concern behind the need to implement an ongoing IPD tracking and monitoring system, is to ensure validity of decisions and interpretations attached to test score decisions. 10 , 11 The results of this study revealed that although IPD was detected across all cases, the issue only appeared to impact examinee performance scores once the case reached a variability violation of 0.58 logits from the mean, which was achieved in one reused case and one administration (Case 7 Admin 1). This is somewhat more generous than Linacre's 21 recommended ‘big enough to be noticeable’ invariance violation level of 0.50 logits above or below the mean but offers a potential objective threshold for IPD at a level deserving concern and action. The extent to which this threshold holds stable over time warrants further investigation. Overall, the finding supports McKinley and Boulet 15 in their conclusion that for the most part, reused OSCE cases achieve sufficient stability to allow for reuse without compromising psychometric stability, given that for this study only 1.19% of investigated IPD (one case administration estimate out of a total 84) garnered variability at a level that compromised or impacted examinee scores.

Lastly, the results of this study revealed that the IPD in Case7 Admin 1, namely the case with the concerning level of IPD given the significant impact on examinee scores, was not achieved as a result of criterion scoring bias. A popular strategy to correct for the effects of parameter drift on examinee scores has entailed the removal of the drifting item from the test. However, in tests designed with a low number of items, such as OSCEs, this can have a negative impact on the overall test reliability. 11

Limitations to this study include the small sample size used in this study in relation to those often recommended when performing Rasch‐based analyses. However, prior work suggests that infit statistics may insensitive to sample size. 22 Additionally, the results of the current study were based on a model anchored to prior generated competency estimates to provide sufficient connectivity across data. Ideally, this model would be applied in a performance‐based setting with sufficient linkages provided by examiners assessing a proportion of the same examinee performances on the same case that would avoid the need for the anchoring and potential impact on results. The results are limited to a single internationally educated nurse examination, as well as specific OSCE format; applying the model to a different population, exam purpose and performance‐based format would be vital for generalizability of results.

CONFLICT OF INTEREST STATEMENT

The authors declare no conflict of interest.

ACKNOWLEDGMENTS

The authors alone are responsible for the content and writing found within this article.

Coetzee K, Monteiro S, Amirthalingam L. A many‐facet Rasch measurement model approach to investigating objective structured clinical examination item parameter drift. J Eval Clin Pract. 2025;31:e14114. 10.1111/jep.14114

DATA AVAILABILITY STATEMENT

The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.

REFERENCES

  • 1. Baig LA, Violato C. Temporal stability of objective structured clinical exams: a longitudinal study employing item response theory. BMC Med Educ. 2012;12(121):121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Iramaneerat C, Yudkowsky R, Myford CM, Downing SM. Quality control of an OSCE using generalizability theory and many‐faceted Rasch measurement. Adv Health Sci Educ Theory Pract. 2008;13(4):479‐493. [DOI] [PubMed] [Google Scholar]
  • 3. Iramaneerat C, Yudkowsky R. Rater errors in a clinical skills assessment of medical students. Eval Health Prof. 2007;30(3):266‐283. [DOI] [PubMed] [Google Scholar]
  • 4. Tavakol M, Pinner G. Using the many‐facet rasch model to analyse and evaluate the quality of objective structured clinical examination: a non‐experimental cross‐sectional design. BMJ Open. 2019;9(9):e029208. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Bond TG, Yan Z, Heene M. Applying the Rasch model: fundamental measurement in the human sciences. 4th ed. Routledge; 2019. [Google Scholar]
  • 6. Gotzmann A, De Champlain A, Homayra F, et al. Cheating in OSCEs: the impact of simulated security breaches on OSCE performance. Teach Learn Med. 2017;29(1):52‐58. [DOI] [PubMed] [Google Scholar]
  • 7. De Mars CE. Detection of item parameter drift over multiple test administrations. Appl Meas Educ. 2004;17(3):265‐300. [Google Scholar]
  • 8. Donoghue JR, Isham SP. A comparison of procedures to detect item parameter drift. Appl Psychol Meas. 1998;22(1):33‐51. [Google Scholar]
  • 9. De Champlain F. A primer on classical test theory and item response theory for assessments in medical education. Med Educ. 2010;44(1):109‐117. [DOI] [PubMed] [Google Scholar]
  • 10. Han KT, Wells CS, Sireci SG. The impact of multidirectional item parameter drift on IRT scaling coefficients and proficiency estimates. Appl Meas Educ. 2012;25(2):97‐117. [Google Scholar]
  • 11. Lee H, Geisinger KF. Item parameter drift in context questionnaires from international large‐scale assessments. Int J Test. 2019;19(1):23‐51. [Google Scholar]
  • 12. Park YS, Lee YS, Xing K. Investigating the impact of item parameter drift for item response theory models with mixture distributions. Front Psychol. 2016;7(255). 10.3389/fpsyg.2016.00255 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Guo H, Robin F, Dorans N. Detecting item drift in large scale testing. J Educ Meas. 2017;54(3):265‐284. [Google Scholar]
  • 14. Lord FM. Applications of item response theory to practical testing problems. Routledge; 1980. [Google Scholar]
  • 15. McKinley DW, Boulet JR. Detecting score drift in a high‐stakes performance‐based assessment. Adv Health Sci Educ. 2004;9(1):29‐38. [DOI] [PubMed] [Google Scholar]
  • 16. Myford CM, Wolfe EW. Detecting and measuring rater effects using many‐facet Rasch measurement: part I. J Appl Meas. 2003;4(4):386‐422. [PubMed] [Google Scholar]
  • 17. Eckes T (2009). Many‐facet Rasch measurement. Retrieved May 13, 2020. https://www.researchgate.net/publication/228465956_Manyfacet_Rasch_measurement/citation/download [Google Scholar]
  • 18. Linacre JM. Facets computer program for many‐facet Rasch measurement. Winsteps.com; 2020. [Google Scholar]
  • 19. College of Nurses of Ontario Entry‐to‐practice competencies for registered nurses. Vol 8. College of Nurses of Ontario; 2020:2024. https://www.cno.org/globalassets/docs/reg/41037-entry-to-practice-competencies-2020.pdf [Google Scholar]
  • 20. Andrich D. A rating formulation for ordered response categories. Psychometrika. 1978;43:561‐573. [Google Scholar]
  • 21. Linacre JM (2020). DIF‐DPF‐bias‐interactions concepts Retrieved from, www.winsteps.com/facetman/webpage.htm
  • 22. Smith AB, Rush R, Fallowfield LJ, Velikova G, Sharpe M. Rasch fit statistics and sample size considerations for polytomous data. BMC Med Res Methodol. 2008;8(33):33. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.


Articles from Journal of Evaluation in Clinical Practice are provided here courtesy of Wiley

RESOURCES