Abstract
Despite numerous studies on the magnitude of differential item functioning (DIF), different DIF detection methods often define effect sizes inconsistently and fail to adequately account for testing conditions. To address these limitations, this study introduces the unified M-DIF model, which defines the magnitude of DIF as the difference in item difficulty parameters between reference and focal groups. The M-DIF model can incorporate various DIF detection methods and test conditions to form a quantitative model. The pretrained approach was employed to leverage a sufficiently representative large sample as the training set and ensure the model’s generalizability. Once the pretrained model is constructed, it can be directly applied to new data. Specifically, a training dataset comprising 144 combinations of test conditions and 144,000 potential DIF items, each equipped with 29 statistical metrics, was used. We adopt the XGBoost method for modeling. Results show that, based on root mean square error (RMSE) and BIAS metrics, the M-DIF model outperforms the baseline model in both validation sets: under consistent and inconsistent test conditions. Across all 360 combinations of test conditions (144 consistent and 216 inconsistent with the training set), the M-DIF model demonstrates lower RMSE in 357 cases (99.2%), illustrating its robustness. Finally, we provided an empirical example to showcase the practical feasibility of implementing the M-DIF model.
Keywords: differential item functioning (DIF), magnitude, pretrained model, root mean square error (RMSE), bias
Introduction
Differential item functioning (DIF) refers to differences in the performance of test items across diverse groups, regardless of the test’s intended purpose (Martinková et al., 2017). In addition to ensuring fairness and equity in testing, the detection of DIF can serve various purposes, such as addressing threats to internal validity, assessing the comparability of translated or adapted measures, understanding item response processes, and identifying instances where test results lack invariance (Zumbo, 2007). Therefore, detecting DIF is of utmost importance.
The exploration of DIF has persisted for over five decades, spanning from early pioneers (Angoff & Ford, 1971; Dorans & Kulick, 1986) to recent advancements (Bauer, 2023; Henninger et al., 2023; Hladká et al., 2024). The prevalent research paradigm for detecting DIF typically involves three main steps. DIF items are initially defined manually by researchers, for instance, by identifying items exhibiting a difficulty difference between focal and reference groups. Next, a simulated response dataset is generated under various conditions. Finally, one or multiple statistical methods for DIF detection are applied to compare the predicted outcomes with predefined true values (with or without DIF), assessing the accuracy of the detection method (Berrío et al., 2020). However, this research paradigm exhibits certain limitations.
One major concern is the inconsistency in defining DIF. Several studies defined DIF by comparing parameter difficulty values between focal and reference groups, while others analyzed disparities in answer probability (Jin et al., 2018), or variances in the area under the item characteristic curve (Belzak & Bauer, 2020). In addition, despite using the same indices, specific criteria may vary. For instance, regarding the difficulty parameter threshold, some studies adhere to a 0.4 and 0.8 threshold for medium and large DIF (Hladká et al., 2024; Jiang, 2019), while others opt for 0.3 and 0.5 (Lim et al., 2022; Zimbra, 2018). The variations in DIF definitions raise questions about the generalization of findings.
Furthermore, relying solely on statistically significant tests for detecting DIF can lead to instability. Previous research usually begins by setting up the null hypothesis that an item is without DIF. However, altering the null hypothesis to “an item with DIF” may change the results (Wells et al., 2009). In addition, statistical outcomes can be sensitive to sample size variations, leading to an inflation of Type I errors (Berrío et al., 2019; Herrera & Gómez, 2008). It has been argued that the setting of significance tests at the 95% level is inappropriate (Benjamin & Berger, 2019). Although significance tests are conventionally set at a 95% level, simulation experiments have demonstrated that this threshold cannot guarantee a consistent type I error rate (Emmert-Streib & Dehmer, 2019). As a result, the statistical significance of DIF results may not provide valuable insights for practical test development or decision-making processes (Amrhein et al., 2019; Meade, 2010).
In contrast to the binary nature of statistical significance, it is proposed that effect sizes serve as indicators for delineating the extent of divergence in item and scale (Meade, 2010). Despite the various methods available for calculating effect sizes, it is essential to focus on their practical significance (Pek & Flora, 2018), and the Educational Testing Service (ETS) delta scale serves as a well-defined metric (Zwick, 2012). The ETS ABC classification system categorizes items into three groups: A (negligible or non-significant DIF), B (slight to moderate DIF), or C (moderate to large DIF), based on the ETS delta scale. Items in category C are recommended for removal or modification, while those in category A require no action. For category B items, handling depends on the degree of high-stakes relevance associated with the assessment. This classification system offers practical guidelines for implementing item-handling procedures.
Unfortunately, the ETS delta still has shortcomings, primarily due to the lack of unified support for different DIF detection methods. While the ETS delta value is designed for the M-H method (Zwick, 2012), various DIF detection methods employ diverse approaches to calculate effect sizes (French & Maller, 2007; Gómez-Benito et al., 2013; Suh, 2016). Although multiple IRT-based effect sizes are compared, they cannot be standardized onto the same scale (Chalmers, 2023; Kleinman & Teresi, 2016). In addition, given that different approaches to detecting DIF can produce conflicting results (Karami & Salmani Nodoushan, 2011), it has been recommended to use multiple DIF detection methods together (Feinberg & Rubright, 2016).
Furthermore, the calculation of effect size is influenced by test conditions, such as the distribution of focal and reference groups (DeMars, 2011; Meade, 2010). Regrettably, the current methods for calculating DIF effect size are solely based on students’ response data and do not incorporate adjustments according to testing conditions.
To address the mentioned limitations, this study introduces a novel M-DIF model for estimating the magnitude of DIF. The M-DIF model can be represented as follows
(1) |
where represents the difference in item i between the reference and focal groups in terms of item difficulty parameter . This serves as a metric to describe the magnitude of DIF present between groups.
represents the predictor variables or model features, denoted as { , … }. encompasses factors related to test conditions, such as the total number of test items, participant numbers in focal and reference groups, ability distributions in these groups, and estimated difficulty of item i. The variable M denotes the number of DIF detection methods used. denotes the statistics related to item i from applying particular DIF detection methods, including such as p values and effect sizes.
The central focus lies in constructing the expression . The fundamental approach entails generating representative training datasets through simulation, encompassing various test conditions. In these training data, both Y and X are known, enabling model construction using supervised machine learning methods. To ensure the representativeness of the training set, a large dataset is required, which implies significant time consumption, rendering it unrealistic for users to start from scratch. Hence, we adopt the pretrained model approach, where models trained on sufficiently large-scale and representative datasets are provided directly to users (Han et al., 2021). Pretraining can effectively enhance the model’s robustness under uncertainty (Hendrycks et al., 2019), thus expectedly supporting the applicability of the M-DIF pertain model to different test conditions. While the pretrained model approach has been utilized to determine the number of factors in exploratory factor analysis (Goretzko & Bühner, 2020), its application in predicting DIF has not been observed to our knowledge.
The unified M-DIF pretrained model is expected to offer significant advantages. First, it can predict the magnitude of DIF as a continuous variable. This magnitude of DIF is defined as the difference in item difficulty parameter between the reference and focal groups for item i, rendering it easily interpretable. Second, the M-DIF pretrained model allows for the utilization of multiple DIF detection methods, which are integrated into a unified scale. Third, the M-DIF pretrained model is sufficiently robust to support its application under various test conditions, without requiring users to make specific selections. Once the M-DIF pretrained model is trained, users can directly apply it for estimating the magnitude of DIF in practice, without needing to retrain the model from the beginning.
The primary objective of this study is to establish and validate the accuracy and robustness of the M-DIF pretrained model.
Establishing the M-DIF Pretrained Model
Step 1: Constructing the Training Datasets
Test Condition Factors in Simulations
Sample Size: We employed three tiers of sample sizes (Small: n=500, Medium: n=1,000, and Large: n=2,000), aligning with prior research (Ma et al., 2021). Sample Size Ratio: Two levels based on earlier studies (Jin et al., 2018): consistent (Ratio = 0.5), where the reference group constitutes 50% of the total population, and inconsistent (R = 0.8). Test Length: Test Length was stipulated at three stages (Short: 20 items, Medium: 40 items, and Long: 60 items). Impact: The Impact factor was two levels: the presence of actual disparities, Impact=1, with the focus group’s ability distribution as N(–1, 1) and the reference group’s ability distribution as N(0, 1) or the absence of disparities, Impact=0, with both groups’ distributions as N(0, 1). Proportion of Potential DIF: Two levels were established: mild (0.1 of items with DIF) and severe (0.4). The difficulty parameters for non-DIF items were consistent across both the focal and reference groups , serving as anchor items. Pattern of DIF: Two patterns, balanced DIF (i.e., DIF in both directions), where were uniformly distributed within the range of U[–0.1, 1], and unbalanced DIF (all DIF in one direction—against the focal group), where followed U[0, 1]. In summary, the six manipulation factors were employed to create a comprehensive range of 144 test conditions. Simulated data generation was performed using the “irtoys” package (Partchev et al., 2022).
Simulating Generating Response Data Units
Because this study emphasizes item difficulty exclusively, response data was generated using the one-parameter IRT model. The difficulty parameters for the reference group followed a standard normal distribution. Next, is equal to plus . All discrimination parameters were set to 1. A total of 100 repetitions were generated for each test condition, resulting in a total of 14,400 individual data units.
Step 2: Estimation of Model Features
This section was conducted for each individual data unit. Model features were divided into two parts: factors related to test conditions { } and statistics related to DIF detection methods { . . . }. For further details, refer to Supplemental Table S1.
Features Related to Test Conditions { }
At the test level, a total of 10 variables were considered, including Test Length, Proportion of Potential DIF (the ratio of the number of items that need to be detected to the total number of items), Sample Size, the number of participants in both the reference and focal groups (n.ref, n.foc), and the ratio of the reference group size to the total sample size. Ability mean and variance for both the reference and focal groups (muR, sigmaR, muF, and sigmaF) were estimated under the one-parameter logistic model.
At the item level, each item’s difficulty values and standard deviations for both the reference and focal groups were analyzed using the Empirical Bayes (EB) method, resulting in four variables (mRb, mRseb, mFb, and mFseb). Furthermore, we assessed the proportion of instances where mF_mR > 0 in relation to the test length, serving as an indicator of the balance in the pattern of DIF. The estimation of abilities was conducted using the “ltm” package (Rizopoulos, 2007).
Features Related to DIF Detection Methods: { … }
This study employed a total of four common DIF detection methods, covering fundamental frameworks: classical test theory (CTT) and item response theory (IRT).
Mantel–Haenszel
The Mantel–Haenszel (MH) methods have firmly established themselves within educational testing institutions (Holland & Thayer, 1986). Statistics were extracted: , MHP, varLambda, alphaMH, and deltaMH. The MH method is analogous to a chi-square test, and its statistics are given by the following formula (Martinková et al., 2017)
(2) |
In formula (2), and represent the counts of examinees in the reference group who achieved a score of “k” and responded accurately or inaccurately to the item, while the corresponding values for the focal group are denoted as and , and .
In addition, we extracted the corresponding p value (MHP) and the values of the variances of the log odds-ratio statistics (varLambda).
The effect size of the MH represents the probability ratio of item responses between the reference and focal groups, denoted as . ETS employs a transformation within the Delta scale, defined as deltaMH = −2.35 ln( ) (Zwick, 2012).
Standardization Approach
The standardization approach is another frequently employed method, characterized by the following definition (Dorans et al., 1992): Three indices were extracted: Std_P_DIF, alphaStd, and deltaStd.
, where denotes the weighting factor implemented at the score level designated as “S,” denotes the difference in the proportions of accurate responses between the focal group and the reference group . The Std_P_DIF ranges from −1 to +1, where positive values signify a preference for the focal group. Through transformation, standardized alpha (alphaStd) and deltaStd values can be computed as rough measures of effect sizes.
Logistic Regression
Three statistics were extracted: Logistik, LogisticP, and deltaR2. The formula for logistic regression is as follows (Magis et al., 2011; Rogers & Swaminathan, 1993)
(3) |
where is the probability of correctly answering the item in group i and X denotes the matching criterion (e.g., the total score of anchor items). The likelihood ratio test statistics (Logistik) and their corresponding p values (LogisticP) were obtained.
Nagelkerke’s R2 (deltaR2) coefficients have been introduced to quantify the magnitude of parameter effects (Nagelkerke, 1991), which is expressed as: , where and denote the likelihood of the fitted and the null models, respectively.
Lord’s Chi-Squared Method
Within the framework of IRT, the primary objective of Lord’s chi-square is to measure the distinction in item difficulty parameters between two specific groups (Stocking & Lord, 1983). This Lord’s chi-square method first estimates the difficulty parameters for both the focus and reference groups, then conducts testing using Lord’s chi-square. Therefore, the chi-square statistic (LordChi) and its corresponding p value (Lordp) were extracted. Furthermore, the effect size for the Lord method is defined as the difference between the item difficulties of the reference group and the focal group.
In total, there are 29 model features {X} for each item. The computation of DIF was conducted using the “difR” package (Magis et al., 2010).
Step 3: The Pretrained Modeling
In the training dataset, and were identified. This allowed us to precisely establish a pretrained model using supervised learning: . Given the XGBoost algorithm’s superior accuracy in prediction, which has been widely applied across diverse domains (Chen & Guestrin, 2016; Sagi & Rokach, 2018), this study utilized XGBoost for modeling. XGBoost operates by iteratively refining weak classifiers, which are ensembles of decision trees, employing gradient boosting to minimize the loss function and improve model optimization. The essence of XGBoost lies in its iterative approach, combining predictions from multiple weak classifiers. It places emphasis on samples previously misclassified in each iteration, thereby refining the model continuously and progressively improving its predictive performance. In our study, the modeling underwent 100 iterations using the “xgboost” package (Chen et al., 2019), with default settings applied to other model hyperparameters.
Considering the possibility that the prediction outcomes may not strictly conform to a particular distribution, such as the normal distribution, we adopted an empirical ranking method for computing the confidence interval (CI). This approach was inspired by the resampling method (DiCiccio & Efron, 1996). During the model validation process, each item was treated as an individual sample. For every sample, bias was computed (denoted as bias i ), resulting in a vector of biases represented as BIAS {bias1, bias2, . . .bias n }, which were then sorted. For a significance level set at 95%, the lower bound of the prediction interval was determined as ( – 97.5th percentile of BIAS), while the upper bound was ( – 2.5th percentile of BIAS).
Ultimately, this process resulted in an M-DIF pretrained model.
Application of Pretrained Model and Model Performance Validation
We applied the M-DIF pretrained model to analyze new data from the validation dataset and validate the model’s performance.
Validation Data Setting
The model validation process involved two rounds. In the first round, the validation dataset shared test conditions consistent with the training set. In the second round, additional test conditions were introduced to assess the generalization of pretrained models. Specifically, the validation dataset included new test conditions with different sample sizes, such as values beyond the training set’s range (n = 100) (Rockoff, 2018) and values within the range but not entirely consistent (n = 600, 1,200) (Jin et al., 2018). Similarly, other settings such as Sample Size Ratio (Ratio = 0.5, 0.75) (Zimbra, 2018) and Test Length (10, 30, 50 items) (Jin et al., 2018; Liu et al., 2019) were manipulated. Proportion of Potential DIF was set at 0.3 and 0.5 (Liu & Jane, 2022). The Impact factor was consistent with that of the training set. Notably, to examine the influence of Pattern of DIF, we expanded the categories to include “partially balanced”U [–0.8, 1]. In summary, the second round of validation encompassed a total of 216 test conditions with inconsistent test conditions.
Baseline and Comparative Model Settings
Given that this study defines the magnitude of DIF as the difference in item difficulty parameter values between the focal and reference groups, the most direct way to estimate this difference is by directly estimating the relevant parameters for both groups separately and then comparing them after performing common scale linking. This method is one of the most straightforward ways to estimate item difficulty parameter differences and is therefore used as our baseline.
In addition to the baseline model, this study also introduced comparative models. These comparative models followed the same procedure as the M-DIF model, with the key difference being the model features. The comparative models were divided into two categories. The first category used only one type of statistical measure from a specific DIF detection method, such as model_MH, model_Std, and model_LR. For instance, the features for model_MH included only the statistics derived from the MH method, such as , MHP, varLambda, alphaMH, and deltaMH. This approach allowed the effect sizes from different DIF detection methods to be converted into the difference in item difficulty parameters between the focal and reference groups. The second category incorporated different model features. For example, the model (MH+TC) includes features derived from the MH method and test conditions (TC) (details can be found in Supplemental Table S1). We created a series of comparative models by combining one, two, or three non-IRT DIF detection methods with TC.
Metrics of the Model’s Performance
The model’s performance was assessed using root mean square error (RMSE) and Bias. RMSE was computed as , while Bias was determined as , where n refers to the number of items with potential DIF, stands for the actual difference in difficulty parameter of item i, while is estimated by the M-DIF pretrained model.
Result
Analysis Under Consistent Test Conditions Validation Datasets
Overall Comparison of RMSE and BIAS
The paired t-test revealed a significant difference between the RMSE values of the pretrained model (M = 0.162, SD = 0.044) and the baseline (M = 0.216, SD = 0.064) with a mean difference of −0.054, t(143) = −13.946, p < .001, 95% CI [–0.062, −0.046]. Similarly, the paired t-test conducted on the BIAS results of the pretrained model (M = 0.000, SD = 0.009) and baseline (M = −0.062, SD = 0.082) showed a significant difference. The mean difference was 0.061, t (143) = 8.784, p < .001, 95% CI [0.048, 0.076]. The BIAS of the pretrained model was nearly zero, while the baseline tended to underestimate the difficulty gap between focal group and reference group. As shown in Figure 1, the pretrained model demonstrated superior and more consistent performance in both RMSE and BIAS metrics. For brevity, subsequent analyses will focus solely on RMSE.
Figure 1.
Comparison of RMSE and BIAS Between Pretrained Model and Baseline
RMSE Comparison Across 144 Test Conditions Combinations
In 144 combinations of test conditions, we calculated the changes in RMSE compared with the baseline for the pretrained model. The analysis revealed that across all scenarios, the pretrained model consistently exhibited lower RMSE values, with the distribution summarized as follows: Min. = −0.154, First Qu. = −0.108, Median = −0.031, Mean = −0.054, Third Qu. = −0.019, Max. = −0.000.
RMSE Comparison in Each Test Condition
As depicted in Figure 2, the RMSE of the pretrained model tends to be lower across various test conditions. The trends in change between the pretrained model and the baseline remain largely consistent across test conditions, except for the “pattern of DIF.” Specifically, when the “pattern of DIF” is balanced, although the RMSE of the pretrained model remains lower than that of the baseline, the decrease is less pronounced compared with when the “pattern of DIF” is unbalanced.
Figure 2.
RMSE Comparison in Each Test Condition
Analysis Under in Inconsistent Test Conditions Validation Datasets
Overall Comparison of RMSE and BIAS
The paired t-test revealed a significant difference between the RMSE values of the pretrained model (M = 0.248, SD = 0.106) and the baseline (M = 0.342, SD = 0.155) with a mean difference of −0.095, t(215) = −18.421, p < .001, 95% CI [–0.105, −0.084]. Similarly, the paired t-test conducted on the BIAS results of the pretrained model (M = −0.027, SD = 0.050) and baseline (M = −0.104, SD = 0.087) showed a significant difference. The mean difference was 0.078, t(215) = 14.688, p < .001, 95% CI [0.067, 0.088]. As illustrated in Supplemental Figure S1, both the baseline and pretrained model exhibited underestimation, yet the pretrained model’s mean was closer to 0.
RMSE Comparison Across 216 Test Conditions Combinations
In 216 combinations of test conditions, a total of 213 cases (98.6%) demonstrated lower RMSE values for the pretrained model. Among the three instances where the RMSE of the pretrained model was higher, all occurred with a test length of 10. Specifically, two cases had a “pattern of DIF” classified as balanced, while one had a partially balanced pattern. The distribution can be summarized as follows: Min. = −0.291, First Qu. = −0.154, Median = −0.079, Mean = −0.095, Third Qu. = −0.028, Max. = 0.009.
RMSE Comparison in Each Test Condition
As shown in Supplemental Figure S2, the pretrained model consistently exhibited lower RMSE across all test conditions. Specifically, when the “pattern of DIF” was unbalanced, the decrease in RMSE compared with the baseline was more pronounced. For partially balanced patterns, the decrease was less pronounced, and for balanced patterns, it was the smallest.
Comparative Models Performance
As shown in Table 1, the model_full consistently performs best overall, regardless of whether the test conditions are consistent or inconsistent. Except for model_LR, all pretrained models outperform the baseline. Interestingly, even for model_MH and model_Std, the non-IRT methods achieved more accurate parameter recovery than direct IRT estimates. Generally, as the number of model features increases, there is a reduction in RMSE. The most significant features are MH and TC. Compared with the model (MH+TC), the reduction in RMSE for other models, model (MH+LR+TC), model (MH+Std+TC), and model (MH+Std+LR+TC) is minimal.
Table 1.
Model Performance (RMSE) Across Different Models
Model | Consistent test conditions | Inconsistent test conditions |
---|---|---|
Baseline | 0.238 | 0.379 |
model_TC | 0.176 | 0.285 |
model_MH | 0.180 | 0.285 |
model_Std | 0.189 | 0.322 |
model_LR | 0.460 | 0.444 |
model (MH+TC) | 0.171 | 0.262 |
model (Std+TC) | 0.174 | 0.283 |
model (LR+TC) | 0.175 | 0.275 |
model (MH+LR+TC) | 0.171 | 0.262 |
model (MH+Std+TC) | 0.171 | 0.264 |
model (Std+LR+TC) | 0.172 | 0.264 |
model (MH+Std+LR+TC) | 0.171 | 0.262 |
model_full | 0.167 | 0.261 |
Empirical Data Application of M-DIF Pretrained Model
The analysis of real data utilized a portion of the Chinese Vocabulary Test (CVT) (Huang & Ishii, 2022). The real data consisted of 26 items, among which items 11 to 26 were from an item bank and had been previously validated for acceptable levels of DIF. These items served as anchor items. In contrast, items 1 to 10 were newly developed and targeted for DIF detection. Male students were the reference group, while female students were the focal group. Initially, we derived 29 statistical metrics {X} from the empirical data. Then, we inputted these {X} into the M-DIF model to predict Y.
Since the true magnitude of DIF in empirical data is unknown and cannot be directly compared with predicted values, we used a reliability-like metric to validate the effectiveness of M-DIF. Specifically, in step 1, we randomly divided the empirical data into two groups and calculated the effect sizes from M-DIF and traditional methods for each subgroup, resulting in MDIF1, MDIF2, alphaMH1, alphaMH2, and so on. In step 2, we computed the correlations between the effect sizes, obtaining MDIF_corr and alphaMH_corr, and so on. We repeated steps 1 and 2 a hundred times, averaging the results and calculating the standard deviation. As shown in Supplemental Table S1, the results derived from M-DIF are more reliable, indicated by the highest mean correlation (.536), and more stable, demonstrated by the smallest standard deviation (0.028).
Users could set two parameters as supplementary references, although their settings did not affect the M-DIF prediction outcome. The first parameter setting is CI, with this study using a 95% confidence level as an example. As mentioned earlier, since the pretraining model was slightly underestimated, we needed to adjust the actual values accordingly. This adjustment is reflected in the CIs, which tend to be biased toward the right side of the point estimate, rather than being centered. The second parameter involved setting the cut points for DIF Magnitude, with this study setting them at 0.4 and 0.8, representing small, moderate, and large levels of DIF.
As illustrated in Figure 3, only the CI of Item 08 covered zero. Other items exhibited varying degrees of DIF, with only Item 06 showing moderate DIF. However, if it is a high-stakes exam, lowering the cut point criterion to 0.3 would enable experts to further evaluate Items 07 and 10. Nevertheless, M-DIF provides relatively objective and robust results, allowing users to make decisions based on their specific circumstances and choose appropriate standards accordingly.
Figure 3.
DIF Prediction Results Based on M-DIF Pretrained Model Using Real Data
Discussion and Conclusion
The M-DIF method offers two key advantages over traditional DIF detection methods. First, irrespective of the effect size definition, the M-DIF method allows for a uniform conversion to the difference in item difficulty between the reference and focal groups. Second, the M-DIF method provides more precise estimates. This precision is evident regardless of whether RMSE or Bias is used as the performance metric. Regardless of whether RMSE or Bias was used as the performance metric, and whether the test conditions were consistent or inconsistent with the training set, the M-DIF pretrained model consistently outperformed the baseline. Across all 360 combinations of test conditions (144 consistent with the training set, 216 inconsistent), M-DIF exhibited lower RMSE in 357 cases (99.2%), showcasing its robustness. In the remaining three cases, all with a test length of 10, the limited number of items likely hindered the accuracy of the estimated basic statistics {X}, thereby decreasing the performance of the pretrained model. Overall, the M-DIF pretrained model demonstrates accuracy and wide applicability. While alphaMH (89.2%) primarily influences M-DIF predictions, other statistical metrics and factors related to test conditions collectively adjust the remaining 10.8%, contributing to enhanced prediction accuracy.
The well-trained M-DIF model streamlines the process for users, sparing them the need to start from scratch and ensuring accessibility for all researchers. Simply establishing CIs and defining cut points for DIF magnitude is all that’s required to replicate results akin to those shown in Figure 3. Researchers primarily focus on interpreting the magnitude of DIF effects according to their research objectives and the level of risk associated with the test, enabling them to make informed decisions.
Furthermore, the empirical data used in this study are particularly suitable for testing scenarios involving a mature testing program. Such a program typically has an established item bank while continuously developing and checking new items for potential DIF. If the entire test comprises newly developed items, a “purification” process can be conducted. Initially, all items are treated as anchor items, and M-DIF predictions are made. Items that do not meet the pre-established criteria are removed from the anchor set, and the process is repeated.
Beyond the specific M-DIF pretrained model, this study introduces a novel approach: establishing pretrained models for DIF prediction. Within the framework of the pretrained approach, researchers are encouraged to explore alternative versions of pretrained models. First, regarding modeling methods, our study exclusively employed XGBoost and lacked thorough fine-tuning; exploring alternative methods and conducting thorough fine-tuning may optimize performance. Constructing datasets for modeling is time-consuming; hence, we openly share three datasets—Traindata.csv (144,000 items), ConsistentValidation.csv (144,000 items under consistent test conditions), and UnConsistentValidation.csv (258,600 items under inconsistent test conditions). Second, the current pretrained model only uses statistical information from four DIF detection methods. As shown in Table 1, incorporating more model features could potentially enhance model performance. Future studies could incorporate findings from other DIF detection methods. Third, the current pretrained model only handles the 1PL IRT model for analyzing two groups. Future research could adopt a similar methodology to extend the pretrained model to include more complex IRT models and accommodate multiple groupings.
Supplemental Material
Supplemental material, sj-docx-1-epm-10.1177_00131644241279882 for Enhancing Precision in Predicting Magnitude of Differential Item Functioning: An M-DIF Pretrained Model Approach by Shan Huang and Hidetoki Ishii in Educational and Psychological Measurement
Footnotes
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was financially supported by JST SPRING, Grant Number 9 JPMJSP2125.The authors would like to take this opportunity to thank the “THERS Make New Standards Program for the Next Generation Researchers.”
ORCID iD: Shan Huang
https://orcid.org/0009-0008-8779-155X
Supplemental Material: Supplemental material for this article is available online.
References
- Amrhein V., Greenland S., McShane B. (2019). Scientists rise up against statistical significance. Nature, 567(7748), 305–307. [DOI] [PubMed] [Google Scholar]
- Angoff W. H., Ford S. F. (1971). Item-race interaction on a test of scholastic aptitude. ETS Research Bulletin Series, 1971(2), i–24. [Google Scholar]
- Bauer D. J. (2023). Enhancing measurement validity in diverse populations: Modern approaches to evaluating differential item functioning. The British Journal of Mathematical and Statistical Psychology, 76, 435–461. [DOI] [PubMed] [Google Scholar]
- Belzak W., Bauer D. J. (2020). Improving the assessment of measurement invariance: Using regularization to select anchor items and identify differential item functioning. Psychological Methods, 25(6), 673–690. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Benjamin D. J., Berger J. O. (2019). Three recommendations for improving the use of p-values. The American Statistician, 73(Suppl. 1), 186–191. [Google Scholar]
- Berrío Á. I., Herrera A. N., Gómez-Benito J. (2019). Effect of sample size ratio and model misfit when using the difficulty parameter differences procedure to detect DIF. The Journal of Experimental Education, 87(3), 367–383. [Google Scholar]
- Berrío Á. I., Gómez-Benito J., Arias E. M. (2020). Developments and trends in research on methods of detecting differential item functioning. Educational Research Review, 31, Article 100340. [Google Scholar]
- Chalmers R. P. (2023). A unified comparison of IRT-based effect sizes for DIF investigations. Journal of Educational Measurement, 60(2), 318–350. [Google Scholar]
- Chen T., Guestrin C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. https://dl.acm.org/doi/10.1145/2939672.2939785
- Chen T., He T., Benesty M., Khotilovich V., Tang Y., Cho H., Chen K., Mitchell R., Cano I., Zhou T., Li M., Xie. J., Geng Y., Tuan J. (2019). Package “xgboost” (R version 90(1-66), 40). R Core Team. [Google Scholar]
- DeMars C. E. (2011). An analytic comparison of effect sizes for differential item functioning. Applied Measurement in Education, 24(3), 189–209. [Google Scholar]
- DiCiccio T. J., Efron B. (1996). Bootstrap confidence intervals. Statistical Science, 11(3), 189–228. [Google Scholar]
- Dorans N. J., Schmitt A. P., Bleistein C. A. (1992). The standardization approach to assessing comprehensive differential item functioning. Journal of Educational Measurement, 29(4), 309–319. [Google Scholar]
- Dorans N. J., Kulick E. (1986). Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the Scholastic Aptitude Test. Journal of Educational Measurement, 23(4), 355–368. [Google Scholar]
- Emmert-Streib F., Dehmer M. (2019). Understanding statistical hypothesis testing: The logic of statistical inference. Machine Learning and Knowledge Extraction, 1(3), 945–962. [Google Scholar]
- Feinberg R. A., Rubright J. D. (2016). Conducting simulation studies in psychometrics. Educational Measurement: Issues and Practice, 35(2), 36–49. [Google Scholar]
- French B. F., Maller S. J. (2007). Iterative purification and effect size use with logistic regression for differential item functioning detection. Educational and Psychological Measurement, 67(3), 373–393. [Google Scholar]
- Gómez-Benito J., Hidalgo M. D., Zumbo B. D. (2013). Effectiveness of combining statistical tests and effect sizes when using logistic discriminant function regression to detect differential item functioning for polytomous items. Educational and Psychological Measurement, 73(5), 875–897. [Google Scholar]
- Goretzko D., Bühner M. (2020). One model to rule them all? Using machine learning algorithms to determine the number of factors in exploratory factor analysis. Psychological Methods, 25(6), 776–786. [DOI] [PubMed] [Google Scholar]
- Han X., Zhang Z., Ding N., Gu Y., Liu X., Huo Y., Qiu J., Yao Y., Zhang A., Zhang L., Han W., Huang M., Jin Q., Lan Y., Liu Y., Liu Z., Lu Z., Qiu X., Song R., . . . Zhu J. (2021). Pre-trained models: Past, present and future. AI Open, 2, 225–250. [Google Scholar]
- Hendrycks D., Lee K., Mazeika M. (2019). Using pre-training can improve model robustness and uncertainty. In International conference on machine learning (PMLR). https://arxiv.org/abs/1901.09960
- Henninger M., Debelak R., Strobl C. (2023). A new stopping criterion for Rasch trees based on the Mantel–Haenszel effect size measure for differential item functioning. Educational and Psychological Measurement, 83(1), 181–212. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Herrera A., Gómez J. (2008). Influence of equal or unequal comparison group sample sizes on the detection of differential item functioning using the Mantel–Haenszel and logistic regression techniques. Quality & Quantity, 42(6), 739–755. [Google Scholar]
- Hladká A., Martinková P., Magis D. (2024). Combining item purification and multiple comparison adjustment methods in detection of differential item functioning. Multivariate Behavioral Research, 59, 46–61. [DOI] [PubMed] [Google Scholar]
- Holland P. W., Thayer D. T. (1986). Differential item functioning and the Mantel-Haenszel procedure. ETS Research Report Series, 1986(2), i–24. [Google Scholar]
- Huang S., Ishii H. (2022). Construction of Chinese vocabulary test for learners across multiple backgrounds. In The 13th conference of Asia-Pacific consortium of teaching Chinese as an international language (pp. 356–367). Hanoi National University Press. (In Chinese) [Google Scholar]
- Jiang J. (2019). Regularization methods for detecting differential item functioning. Boston College, Lynch School of Education. [Google Scholar]
- Jin K.-Y., Chen H.-F., Wang W.-C. (2018). Using odds ratios to detect differential item functioning. Applied Psychological Measurement, 42(8), 613–629. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Karami H., Salmani Nodoushan M. A. (2011). Differential item functioning (DIF): Current problems and future directions. Online Submission, 5(3), 133–142. [Google Scholar]
- Kleinman M., Teresi J. A. (2016). Differential item functioning magnitude and impact measures from item response theory models. Psychological Test and Assessment Modeling, 58(1), 79–98. [PMC free article] [PubMed] [Google Scholar]
- Lim H., Choe E. M., Han K. T. (2022). A residual-based differential item functioning detection framework in item response theory. Journal of Educational Measurement, 59, 80–104. [Google Scholar]
- Liu X., Jane R. H. (2022). Treatments of differential item functioning: A comparison of four methods. Educational and Psychological Measurement, 82(2), 225–253. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu Y., Yin H., Xin T., Shao L., Yuan L. (2019). A comparison of differential item functioning detection methods in cognitive diagnostic models. Frontiers in Psychology, 10, Article 1137. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ma W., Terzi R., de la Torre J. (2021). Detecting differential item functioning using multiple-group cognitive diagnosis models. Applied Psychological Measurement, 45(1), 37–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Magis D., Raîche G., Béland S., Gérard P. (2011). A generalized logistic regression procedure to detect differential item functioning among multiple groups. International Journal of Testing, 11(4), 365–386. [Google Scholar]
- Magis D., Béland S., Tuerlinckx F., De Boeck P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42(3), 847–862. [DOI] [PubMed] [Google Scholar]
- Martinková P., Drabinová A., Liaw Y.-L., Sanders E. A., McFarland J. L., Price R. M. (2017). Checking equity: Why differential item functioning analysis should be a routine part of developing conceptual assessments. CBE: Life Sciences Education, 16(2), Article rm2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meade A. W. (2010). A taxonomy of effect size measures for the differential functioning of items and scales. Journal of Applied Psychology, 95(4), 728–743. [DOI] [PubMed] [Google Scholar]
- Nagelkerke N. J. (1991). A note on a general definition of the coefficient of determination. Biometrika, 78(3), 691–692. [Google Scholar]
- Partchev I., Maris G., Hattori T. (2022). Package irtoys: A collection of functions related to item response theory (IRT). https://cran.r-project.org/web/packages/irtoys/irtoys.pdf
- Pek J., Flora D. B. (2018). Reporting effect sizes in original psychological research: A discussion and tutorial. Psychological Methods, 23(2), 208–225. [DOI] [PubMed] [Google Scholar]
- Rizopoulos D. (2007). ltm: An R package for latent variable modeling and item response analysis. Journal of Statistical Software, 17, 1–25. [Google Scholar]
- Rockoff D. (2018). A randomization test for the detection of differential item functioning. The University of Arizona. [Google Scholar]
- Rogers H. J., Swaminathan H. (1993). A comparison of logistic regression and Mantel-Haenszel procedures for detecting differential item functioning. Applied Psychological Measurement, 17(2), 105–116. [Google Scholar]
- Sagi O., Rokach L. (2018). Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4), Article e1249. [Google Scholar]
- Stocking M. L., Lord F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201–210. [Google Scholar]
- Suh Y. (2016). Effect size measures for differential item functioning in a multidimensional IRT model. Journal of Educational Measurement, 53(4), 403–430. [Google Scholar]
- Wells C. S., Cohen A. S., Patton J. (2009). A range-null hypothesis approach for testing DIF under the Rasch model. International Journal of Testing, 9(4), 310–332. [Google Scholar]
- Zimbra D. J. (2018). An examination of the MIMIC method for detecting DIF and comparison to the IRT likelihood ratio and Wald tests. University of Hawai’i at Mānoa. [Google Scholar]
- Zumbo B. D. (2007). Three generations of DIF analyses: Considering where it has been, where it is now, and where it is going. Language Assessment Quarterly, 4(2), 223–233. [Google Scholar]
- Zwick R. (2012). A review of ETS differential item functioning assessment procedures: Flagging rules, minimum sample size requirements, and criterion refinement. ETS Research Report Series, 2012(1), i–30. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental material, sj-docx-1-epm-10.1177_00131644241279882 for Enhancing Precision in Predicting Magnitude of Differential Item Functioning: An M-DIF Pretrained Model Approach by Shan Huang and Hidetoki Ishii in Educational and Psychological Measurement