Abstract
A study was conducted to implement the use of a standardized effect size and corresponding classification guidelines for polytomous data with the POLYSIBTEST procedure and compare those guidelines with prior recommendations. Two simulation studies were included. The first identifies new unstandardized test heuristics for classifying moderate and large differential item functioning (DIF) for polytomous response data with three to seven response options. These are provided for researchers studying polytomous data using POLYSIBTEST software that has been published previously. The second simulation study provides one pair of standardized effect size heuristics that can be employed with items having any number of response options and compares true-positive and false-positive rates for the standardized effect size proposed by Weese with one proposed by Zwick et al. and two unstandardized classification procedures (Gierl; Golia). All four procedures retained false-positive rates generally below the level of significance at both moderate and large DIF levels. However, Weese’s standardized effect size was not affected by sample size and provided slightly higher true-positive rates than the Zwick et al. and Golia’s recommendations, while flagging substantially fewer items that might be characterized as having negligible DIF when compared with Gierl’s suggested criterion. The proposed effect size allows for easier use and interpretation by practitioners as it can be applied to items with any number of response options and is interpreted as a difference in standard deviation units.
Keywords: POLYSIBTEST, standardized effect size, DIF, polytomous data, differential item functioning
This article presents the results from two simulation studies investigating the effectiveness of a standardized effect size for items with varying numbers of response options using the POLYSIBTEST procedure (Chang et al., 1996; Shealy & Stout, 1993). POLYSIBTEST is a procedure that has been used for over two decades to investigate differential item functioning (DIF) in items or groups of items having polytomous response formats (e.g., Likert-type items). DIF is a statistical procedure used to determine whether individuals from different subgroups, who have the same ability level, have different probabilities of getting an item correct (Hambleton & Swaminathan, 1985). There have been few attempts to classify the magnitude of DIF in polytomous items since the inception of the POLYSIBTEST procedure (e.g., Gierl, 2005; Golia, 2012; Zwick et al., 1997). These include using an unstandardized effect size (Gierl, 2005) that has been applied to different numbers of response options, rescaling the unstandardized effect size using the possible response range of the item (Golia, 2012), or rescaling the unstandardized effect size using a standard deviation that includes all examinees (Zwick et al., 1997). Recently, a standardized effect size for the SIBTEST procedure, the dichotomous predecessor of POLYSIBTEST, was developed (Weese, 2020). This standardized effect size accounts for both the variability in participant responses and the exclusion of examinees with insufficient subgroup sample sizes in its calculation; it also provides a natural extension for application to polytomous items. Due to the standardization of the effect size and that the computations for SIBTEST and POLYSIBTEST are essentially identical, the heuristics naturally extend to the POLYSIBTEST procedure and polytomous data.
When developing achievement and psychological scales, it is important to have the ability to determine whether items display a type of bias or DIF against a specific subgroup to create scales that are equally valid for all participants within a population. The use of psychological scales is important in many different aspects of life (e.g., counseling, psychology, rehabilitation). Evaluating the validity of a construct being measured by a scale across subgroups is vital. For example, having a scale that overidentifies some subgroups for psychological or rehabilitation services and underidentifies others could be unethical (Gitchel et al., 2010). There are a plethora of studies evaluating the use of DIF detection procedures with dichotomous items; however, there are fewer studies evaluating the heuristics used for identifying DIF in polytomous items—similar to what would be used for psychological and attitudinal data—and specifically for the POLYSIBTEST procedure.
This study includes two investigations. The first study is the identification of new unstandardized effect size heuristics for researchers using POLYSIBTEST with polytomous items having three to seven response options. The unstandardized effect size heuristics are derived from the standardized effect size for data fitting a graded response model (GRM; Samejima, 1969). The results will provide guidelines for researchers using current POLYSIBTEST programs with polytomous items of varying response ranges. Furthermore, both the unstandardized and standardized effect size heuristics have been shown to display similar false-positive and true-positive rates for dichotomous items (Weese, 2020).
The second study is a comparison of the standardized effect size developed for SIBTEST with three other recommended applications for classifying moderate and large DIF for polytomous response data using POLYSIBTEST (Gierl, 2005; Golia, 2012; Weese, 2020; Zwick et al., 1997). This study provides the first evaluation of a standardized effect size created for use with POLYSIBTEST in comparison with previously suggested classification schemes for polytomous items using the POLYSIBTEST procedure. Furthermore, many DIF studies using polytomous data tend to simulate polytomous responses for the items being evaluated, however use dichotomous data for items included on matching subtests (e.g., Chang et al., 1996; Penfield, 2007; Zwick et al., 1997) which would not be likely in empirical data. This study adds to the body of DIF research, similar to Bolt (2002) and French and Miller (1996), by investigating DIF outcomes when the complete set of items comprised polytomous data.
Background
DIF
Items on a test should minimize the variance that is attributed to factors other than the valid skills being measured (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 2014). These other factors are what contributes to DIF being present. When extraneous variance is minimized, there is greater confidence that the test is valid for all subgroups of individuals.
Due to the increased use of DIF analysis in a variety of fields (e.g., education, psychology, rehabilitation), the ability to detect whether polytomous items display DIF against a specific subgroup is important. Unlike dichotomous data, where DIF can be classified into uniform and nonuniform DIF, polytomous data can exhibit different types of DIF within individual items. The different types of DIF that can occur within polytomous items are constant, convergent, and divergent (Penfield et al., 2008). These three types of DIF can occur within each step of the item (pervasive) or within a subset of the steps of the item (nonpervasive). One reason behind the different types of DIF that can occur in polytomous items is that models for polytomous data are developed using step functions that allow for DIF that can occur within or across categories (for more information, see Penfield, 2014). Furthermore, the evaluation of DIF can be conducted using a net or a global DIF approach. A global approach analyzes “between-group differences specific to each score level” of an item, whereas the net approach is an analysis of the aggregated score-level differences (Penfield, 2010, p. 130). In this study, constant, pervasive DIF is investigated using a net DIF approach with the POLYSIBTEST procedure.
POLYSIBTEST
POLYSIBTEST is a DIF procedure that is naturally extended from the dichotomous SIBTEST procedure introduced by Shealy and Stout in 1993. POLYSIBTEST allows for researchers and practitioners to assess whether items that have more than two response options (polytomous items) exhibit DIF (Chang et al., 1996). The POLYSIBTEST procedure calculates a weighted average difference between two groups , referred to as an unstandardized effect size, and tests whether the difference is significantly different from zero (Chang et al., 1996; Shealy & Stout, 1993). The null hypothesis is calculated by an a priori contrast (Chang et al., 1996; Shealy & Stout, 1993, p. 175):
| (1) |
where the proportion of all examinees at score category k is , the maximum total possible score on all matching subtest items is , and the average bias-corrected score for group g examinees at score category k is (g = R, F). Reference and focal groups are denoted by R and F, respectively. Positive scores favor the reference group and negative scores favor the focal group. Additional details for computing the bias-corrected average scores can be found in Shealy and Stout’s (1993, pp. 191–193) dichotomous work; however, there was an adaptation to the slope of the regression equation on observed scores from a dichotomous reliability coefficient (KR-20) to the more generalized coefficient alpha (Chang et al., 1996). A test statistic (B ~ N(0,1)) is used to evaluate statistical significance, using the following standard error (SE) term (Bolt & Stout, 1996) as the denominator with for the numerator:
| (2) |
where I(k) represents whether score category k should be included, I(k) = 1, or removed, I(k) = 0, in the calculation and is the proportion of all individuals at score category k. The variances at score category k for the reference and focal groups are represented by and , respectively. Sample sizes for the reference and focal groups at score category k are denoted by and , respectively.
Studies have compared the power and control of type I error rates of both SIBTEST and POLYSIBTEST procedures to other DIF methods (e.g., Bolt & Stout, 1996; Chang et al., 1996; Jiang & Stout, 1998; Narayanan & Swaminathan, 1994). Bolt and Stout (1996) mentioned that POLYSIBTEST tends to display higher levels of power and controls type I error better than other similar DIF methods (e.g., Mantel–Haenszel and standardized mean difference) under conditions such as when ability distributions are unequal, when the studied item discrimination parameters depart from the discrimination parameters included in the matching subtest, and varying sample sizes. These studies included comparisons that used only statistical significance and not a DIF effect size heuristic as is recommended with empirical DIF studies. Roussos and Stout (1996) provided unstandardized DIF effect size heuristics for dichotomous items with SIBTEST that have been the most commonly used in research; however, Chang et al. (1996) did not provide suggested heuristics for polytomous items when using POLYSIBTEST.
Standardized Effect Size
The use of effect sizes for discussing magnitudes of difference between groups that are considered meaningful has been recommended by many researchers such as Thompson (1998), Kelley and Preacher (2012), and Steinberg and Thissen (2006). Standardized effect sizes are particularly useful in comparison with unstandardized effect sizes as the former take into account both differences in central tendency and variability in response, and they allow for comparisons across studies. A standardized effect size was developed that accounts for variability in responses as well as the inclusion (or exclusion) of individuals in the calculation of SIBTEST’s test statistic, based on the number of participants per score category (Weese, 2020). The implementation of the effect size to the POLYSIBTEST procedure includes the following steps:
-
Calculate the pooled standard deviation using the formula for an a priori contrast (Kirk, 1995)
(3) where ngk is the total number of individuals in score category k for reference or focal group g, and I(k) is a vector where a value of 1 indicates inclusion of score category k in the calculation and a value of 0 indicates exclusion of score category k.
-
Divide the weighted average difference by the standard deviation to calculate the effect size
(4)
Classifying DIF With POLYSIBTEST
The classification criteria for DIF magnitudes are not widely agreed upon with the POLYSIBTEST procedure. There have been few attempts to classify the magnitude of DIF with the POLYSIBTEST procedure. Published recommendations include (a) scaling by the range of the item suspected of having DIF and utilizing the dichotomous heuristics of the standardized mean difference (SMD) procedure (Golia, 2012), (b) using a value of 0.100 to signify whether DIF is present (Gierl, 2005), and (c) calculating a standardized effect size using the pooled standard deviation of the suspect item (Zwick et al., 1997).
These three recommendations have their limitations. For example, Gierl (2005) refers to the suggestion to flag a polytomous item for large DIF when the value is equal to or greater than 0.100 for items with four response options. This suggestion has been used with polytomous items by other researchers (e.g., Gitchel et al., 2010, 2011; Walker & Göçer Şahin, 2020); however, the use of a single heuristic value that disregards the number of response options may not be appropriate. Golia (2012) takes this one step further and suggests a set of heuristics that account for the range of possible response options by dividing by the exclusive range of the item.
| (5) |
where C represents the number of categories for the suspect item. This reduction in item range, however, does not account for the actual variability of participants’ responses to the item. In other words, dividing by a range of four when there are five response options seems reasonable if all five response options are used in the sample. However, if most respondents only use three of the five response options, using the complete range as a control would negatively impact power.
Finally, the recommendation by Zwick et al. (1997) includes dividing by an item’s pooled standard deviation, but their suggestion includes all respondents for an item. Including all respondents for an item is not analogous to the respondents used for the calculation of DIF using POLYSIBTEST. The variance suggested by Zwick et al. (1997) is
| (6) |
where represents the variance for item i for the reference and focal group g. In addition, Zwick et al. (1997) did not implement their effect size calculation with POLYSIBTEST directly in their report. Their 1997 analysis, along with further research by Hao (2014) and ACT (2020), was conducted using SMD and dividing SMD by an item’s pooled standard deviation which used all respondents on the item. Zwick et al. (1997) alluded that using the POLYSIBTEST numerator would be a better measure than the SMD.
| (7) |
To our knowledge, that analysis has not been completed and/or published. In addition, in 2008, the National Assessment of Educational Progress (NAEP) reported that their DIF detection method was dividing SMD by the standard deviation of the suspect item and that they used a heuristic recommended by Zwick et al. (1997) for determining whether an item displays DIF. These attempts calculate a standardized effect size measure that is independent of the item scale, but they seem to include all examinees in the calculations (Hao, 2014; NAEP, 2008; Zwick et al., 1997). The inclusion of these examinees will have an impact on the standard deviation when using POLYSIBTEST.
Study Goals
The goal of this study was to investigate the application of a standardized effect size to polytomous items using POLYSIBTEST. First, this study used a simulation to determine the unstandardized values for items with three through seven response categories associated with the standardized effect size heuristics developed for dichotomous data (Weese, 2020). These results can provide a guideline for researchers and practitioners using prior POLYSIBTEST packages with polytomous items of varying response formats. In the second simulation, false-positive and true-positive rates were compared among four polytomous DIF classification guidelines. These include an unstandardized effect size heuristic of 0.100 (Gierl, 2005), scaled heuristics of 0.05 and 0.100 (Golia, 2012), and standardized effect size heuristics of 0.164 and 0.241 for both Weese’s (2020) and Zwick et al.’s (1997) proposed effect size calculations. The results are provided as a resource for researchers who are evaluating polytomous items for DIF using POLYSIBTEST.
Method
Two simulation studies with 5,000 replications were conducted using R 3.6.3 (R Core Team, 2020) and customized R functions of POLYSIBTEST from the DIFSIB R package (Chang et al., 1996; Shealy & Stout, 1993; Weese, 2021). The first simulation used regression to identify the unstandardized effect size measures for polytomous items predicted by the aforementioned standardized effect size heuristics of 0.164 and 0.241 developed for identifying moderate and large dichotomous DIF items using POLYSIBTEST. The second study compared two ways of calculating a standardized effect size (Weese, 2020; Zwick et al., 1997) with two other suggested procedures for categorizing the magnitude of DIF in polytomous items (Gierl, 2005; Golia, 2012). False-positive and true-positive rates were compared across the four procedures for moderate and large DIF.
Data Simulation
Both studies utilized the same simulated data. Data were simulated from the GRM. The GRM calculates the probability that an examinee with ability will select category k as a response for an item j with mj categories. The probability function is
| (8) |
where the probability of selecting category k or higher on item j is defined by
| (9) |
such that the discrimination of item j is aj, and the threshold values for selecting a category at or below category k for item j is . Response strings were simulated in R using the simdata function within the MIRT package (Chalmers, 2012).
Manipulated Variables
Five variables were manipulated to create diverse data for studying the functionality of the standardized effect size and its corresponding DIF heuristics under varying conditions. The simulated conditions also provided for comparison of the standardized effect size to prior attempts to categorize DIF in polytomous items (Gierl, 2005; Golia, 2012) using and Zwick et al.’s (1997) suggested effect size calculation. The manipulated variables are described first, followed by the constants used in the simulations. A fully crossing simulation design that involves three-item discrimination parameter values, five magnitude differences between thresholds, four sample sizes, and five different response option formats resulted in 300 simulation conditions.
Response Options
The number of response categories included in the study ranged from three to seven. The entire test (suspect items and matching items) comprised the same number of response categories. This was done to evaluate the generalizability and effectiveness of for a variety of commonly seen response options (e.g., John et al., 1991; Rosenberg, 1965; Smith & Son, 2013; Yarkoni, 2010).
Item Parameters
Item parameters were manipulated for the suspect items being tested for DIF and included between-group differences in item threshold values and item discrimination values. Between-group differences in item thresholds ( = 0, 0.2, 0.4, 0.6, 0.8) were applied to all thresholds in the GRM for the suspect item. Calculating the net DIF (Penfield, 2010) in terms of Raju’s (1988) signed area (SA) can be accomplished using Cohen et al.’s (1993) formula for polytomous items:
| (10) |
where k is the threshold and is the total number of categories of item j. Due to constant DIF being investigated in this study, Equation 10 reduced to the following
| (11) |
where is the number of response options for item j and is the difference in threshold values.
The between-group difference of was selected to test for false-positive DIF rates. In a prior study, a between-group difference of .30 has been estimated to represent moderate DIF and a difference of .50 has represented the minimum for large DIF in dichotomous items (Walker et al., 2011). Therefore, differences of .2, .4, .6, and .8 were selected for the purpose of comparing differences considered to be negligible ( ; SA = 0.4–1.2 for items with three to seven categories, respectively), moderate ( ; SA = 0.8–2.4), and large ( ; SA = 1.2–3.6 and 1.6–4.8, respectively). To create the group differences on the suspect item, the difference in thresholds was equally split and applied to all thresholds. For example, if item j had three categories, when the difference in item thresholds between groups is 0.2 and the middle threshold (k = 2) for item j is 0 , the reference group threshold would be −0.1 and the focal group threshold would be 0.1 . Discrimination parameters for the suspect items were set at values of 1, 1.5, and 2.0 (Lautenschlager, Meade, & Kim, 2006).
Sample Sizes
The inclusion of equal (50–50 split) and unequal (75–25 split) sample sizes allowed for a comparison of both balanced and unbalanced designs (Clauser & Mazor, 1998; Keiffer, 2011; Narayanan & Swaminathan, 1994; Shealy & Stout, 1993). In addition to equal and unequal sample size conditions, one large and one small total sample size conditions were included. Due to a limited number of simulation studies where the matching subtest is comprised solely of polytomous items (e.g., Bolt, 2002; French & Miller, 1996; Woods, 2010), a large sample size for this study was determined by investigating an item with seven response options and the threshold for the lowest category set to an extreme value (e.g., ). Shealy and Stout (1993) recommended a minimum of two respondents in any given score category to be included in the SIBTEST analysis. Due to ability distributions being drawn from the standard normal distribution, it is expected that approximately 2.3% of the sample will score in the lowest category. This equates to approximately 88 respondents for an item. Multiplying the number of respondents by the number of items in the matching subtest (20 items) gives a focal group sample size of 1,760, to increase the likelihood of having at least two respondents in every score category. To construct a 75%–25% split for the unequal sample size condition, the reference group had a sample size of 5,280. The total sample size for the large sample condition, combining both focal and reference group together, was 7,040. Using the total sample size of 7,040, the equal sample size condition had 3,520 in both the reference and focal groups. A smaller sample size that has been previously investigated (Woods, 2010) was also included to assess how sample size impacted when compared with the other DIF heuristics. The smaller sample size included a total of 2,000 participants with equal (NR = NF = 1,000) and unequal (NR = 1,500, NF = 500) conditions.
Constants
Test length was constrained to 21 items which included a matching subtest. The matching subtest was used as an estimate of examinees’ ability level and consisted of the first 20 items. Only one item was investigated for DIF in each simulated data set. Psychological scales consisting of approximately 20 items are common (e.g., Andresen et al., 1994; Beck et al., 1961; Radloff, 1977; Zung, 1971).
The latent ability distributions for both reference and focal groups were set to be equal and drawn from a standard normal distribution. The difference between adjacent thresholds for the suspect item was set to be 0.6. A difference of this magnitude between adjacent thresholds is generally seen in practice (Kim & Cohen, 1998).
Discrimination parameters for the matching subtest were drawn from a log normal distribution with a mean of 1.50 and standard deviation of 0.45 (Lautenschlager et al., 2006). To avoid out of order thresholds, the threshold parameters for the matching subtest were generated from a noninclusive uniform distribution around cut-points on the distribution scale from −2 to 2 (Jiang et al., 2016). For example, for an item with four categories, threshold parameters were drawn from the intervals (−2, −1), (−1, 1), (1, 2). The discrimination parameters and threshold parameters were unique for each replication to account for potential sampling error (Walker et al., 2011). Table 1 provides a summary of the 300 combinations that investigate 60 conditions comparing five levels of between-group differences in threshold parameters .
Table 1.
Summary of Simulation Conditions.
| Factor | Conditions |
|---|---|
| Sample size | |
| Equal (NR/NF) | 3,520/3,520 1,000/1,000 |
| Unequal (NR/NF) | 5,280/1,760 1,500/500 |
| Number of categories | 3 4 5 6 7 |
| Suspect item parameters | |
| Item discrimination (ai) | 1.0 1.5 2.0 |
| Threshold set | |
| Three-category | [−0.30, 0.30] |
| Four-category | [−0.60, 0.00, 0.60] |
| Five-category | [−0.90, −0.30, 0.30, 0.90] |
| Six-category | [−1.20, −0.60, 0.00, 0.60, 1.20] |
| Seven-category | [−1.50, −0.90, −0.30, 0.30, 0.90, 1.50] |
| Differences in item threshold | 0 0.2 0.4 0.6 0.8 |
| Ability distributions | |
| Reference/Focal | N(0, 1)/N(0, 1) |
Analysis: Study 1
To determine what heuristics are associated with the effect sizes for varying numbers of response options, regression analyses were conducted for five response option formats. The following steps were conducted for each of three-, four-, five-, six-, and seven-category items:
Calculate and for each replication;
Regress on ;
Using the results from Step 2, insert moderate and large DIF heuristics to determine the unstandardized heuristics for three to seven item response categories when analyzing data that fit a GRM.
Results: Study 1
When comparing the relationship between and for three to seven item response categories, there is a strong linear relationship (Figure 1). In addition to the linear relationship, the values associated with the heuristics increase as the number of response categories increase. There are substantial differences between the values for items with three response categories compared with those with seven response categories. Three lines are provided as references for classifying DIF in Figure 1.
Figure 1.
Relationship Between and for Items With Varying Response Categories.
is the value Gierl (2005) recommended to classify a polytomous item with four response categories as having large DIF; values of .164 and .241 are the standardized effect size heuristics that Weese (2020) recommended for classifying moderate and large DIF.
The results of regressing on for three-, four-, five-, six-, and seven-category items fitting a GRM indicated that significantly predicted for all regression models. In addition, the amount of variability in explained by the models ranged from 99.2% for three-category data to 99.4% for seven-category data. All models had intercepts near 0 and positive slopes. The slope of the regression models increased as the number of response categories increased (see Figure 1). The slopes for the models ranged from 0.7988 for three-category data to 1.897 for seven-category data.
The heuristics associated with the moderate and large DIF heuristics were calculated (Table 2) by inputting the heuristics of 0.164 and 0.241 into each of the regression equations. As seen in Figure 1, as the number of categories increase, so does the value that is associated with the moderate and large DIF heuristics. When items have three categories, a value of 0.130 is associated with a moderate DIF effect size, and a value 0.192 aligns with a large DIF effect size. When items have seven categories, values of 0.310 and 0.456 are needed to classify DIF as moderate or large DIF, respectively. Table 2 demonstrates that the use of only one set of criteria is likely inappropriate for classifying items with different numbers of response options as exhibiting DIF.
Table 2.
Values Associated With Heuristics.
| Number of categories | DIF magnitude values | |
|---|---|---|
| Moderate | Large | |
| Three | 0.130 | 0.192 |
| Four | 0.184 | 0.271 |
| Five | 0.232 | 0.341 |
| Six | 0.274 | 0.402 |
| Seven | 0.310 | 0.456 |
Note. DIF = differential item functioning.
Analysis: Study 2
Study 2 compares true-positive and false-positive rates for selected DIF classification recommendations using unstandardized and standardized effect size measures in conjunction with hypothesis testing. Each replication was analyzed using the following DIF classification recommendations: (a) the unstandardized effect size of 0.100 by Gierl (2005), (b) the scaled unstandardized effect size (scaled by dividing by the range) recommendations of 0.05 and 0.100 by Golia (2012), (c) the standardized effect size heuristics of 0.164 and 0.241 using the proposed standardized effect size, and (d) the standardized effect size heuristics of 0.164 and 0.241 using the standardized effect size proposed by Zwick et al. (1997). Although the heuristics provided in Zwick et al.’s (1997) are smaller (.125 and .250), a technical manual by ACT (2020) provided a recommendation of .17 and .25 for moderate and large DIF using the SMD divided by pooled standard deviation effect size. As Zwick et al.’s (1997) recommendation was to investigate the use of the beta-uni numerator rather than SMD for an effect size, we maintained the same criteria of .164 and .241 so that more equal comparisons could be made between the two standardized effect size calculations. Use of .17 and .25 would reduce the true-positive rate for the Zwick et al. comparison otherwise. In Gierl’s (2005) study, a single heuristic of .100 was recommended for the values of polytomous items being used in the study (four response options). He did not make recommendations for items with other response option ranges; however, this value has been adopted by researchers using items with other polytomous response options. Note that the recommendation by Gierl (2005) did not include a heuristic for moderate DIF classification; therefore, Gierl’s recommendation was only implemented for large DIF levels.
Study 2 had two outcomes of interest. The first outcome was to evaluate the false-positive rates of the four selected DIF classification recommendations. A false-positive is determined at the moderate DIF level when an item that was simulated to not display DIF meets the following set of criteria: the magnitude of the heuristic is at least at the moderate DIF level and the hypothesis test is significant. Similarly, an item was classified as a false-positive at the large DIF level when the item was simulated to not display DIF, yet the magnitude of the heuristic is at least at the large DIF level and the hypothesis test is significant. The second outcome was to compare true-positive rates among the four DIF detection criteria. True-positive rates are also calculated for both moderate- and large-level DIF magnitudes. To be considered a true-positive, the item is simulated to display DIF and the same sets of criteria for a false-positive need to be met. A level of significance of .05 was used for all analyses.
Results: Study 2
The results for study 2 will be presented in two parts. The first part will present the comparison of false-positive rates. The second part will present the comparisons of true-positive rates. Due to the comparison of the four different criteria, henceforth each classification scheme will be referred to by the respective author’s name (e.g., Zwick et al.’s (1997) effect size calculation will be referred to as Zwick, Gierl (2005) as Gierl and so forth).
False-Positive Comparison
False-positive rates (Table 3) for data simulated using a GRM varied by the number of response categories, the item type, and sample size. When items had three response options, all DIF criteria had false-positive rates below the level of significance at both moderate and large DIF levels regardless of condition. When the focal group had a sample size of 500, Golia’s criterion had the highest false-positive rate of 1.9%. At the large DIF level, all sets of DIF criteria, except for the recommendation made by Gierl, had false-positive rates near 0%.
Table 3.
False-Positive Rates for the POLYSIBTEST DIF Heuristics.
| Number of categories | Factor | Moderate DIF | Large DIF | |||||
|---|---|---|---|---|---|---|---|---|
| Zwick | Weese | Golia | Zwick | Weese | Golia | Gierl | ||
| Three | Sample Size (NR/NF) | |||||||
| 3,520/3,520 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
| 5,280/1,760 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
| 1,000/1,000 | 0.000 | 0.013 | 0.520 | 0.000 | 0.000 | 0.000 | 0.520 | |
| 1,500/5,00 | 0.027 | 0.253 | 1.927 | 0.000 | 0.007 | 0.000 | 1.927 | |
| Item Type | ||||||||
| Low a | 0.010 | 0.040 | 0.845 | 0.000 | 0.000 | 0.000 | 0.845 | |
| Medium a | 0.005 | 0.060 | 0.600 | 0.000 | 0.005 | 0.000 | 0.600 | |
| High a | 0.005 | 0.100 | 0.390 | 0.000 | 0.000 | 0.000 | 0.390 | |
| Four | Sample Size (NR/NF) | |||||||
| 3,520/3,520 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.047 | |
| 5,280/1,760 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.193 | |
| 1,000/1,000 | 0.007 | 0.033 | 0.340 | 0.000 | 0.000 | 0.000 | 4.533 | |
| 1,500/5,00 | 0.027 | 0.267 | 1.387 | 0.000 | 0.000 | 0.000 | 5.060 | |
| Item Type | ||||||||
| Low a | 0.010 | 0.075 | 0.560 | 0.000 | 0.000 | 0.000 | 2.570 | |
| Medium a | 0.010 | 0.080 | 0.520 | 0.000 | 0.000 | 0.000 | 2.490 | |
| High a | 0.005 | 0.070 | 0.215 | 0.000 | 0.000 | 0.000 | 2.315 | |
| Five | Sample Size (NR/NF) | |||||||
| 3,520/3,520 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.340 | |
| 5,280/1,760 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.013 | |
| 1,000/1,000 | 0.007 | 0.047 | 0.187 | 0.000 | 0.000 | 0.000 | 4.700 | |
| 1,500/500 | 0.020 | 0.313 | 1.013 | 0.000 | 0.000 | 0.000 | 5.187 | |
| Item Type | ||||||||
| Low a | 0.010 | 0.110 | 0.490 | 0.000 | 0.000 | 0.000 | 2.880 | |
| Medium a | 0.010 | 0.080 | 0.240 | 0.000 | 0.000 | 0.000 | 2.925 | |
| High a | 0.000 | 0.080 | 0.170 | 0.000 | 0.000 | 0.000 | 2.625 | |
| Six | Sample Size (NR/NF) | |||||||
| 3,520/3,520 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.347 | |
| 5,280/1,760 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 3.513 | |
| 1,000/1,000 | 0.000 | 0.040 | 0.087 | 0.000 | 0.000 | 0.000 | 4.907 | |
| 1,500/500 | 0.053 | 0.327 | 0.773 | 0.000 | 0.013 | 0.000 | 5.040 | |
| Item Type | ||||||||
| Low a | 0.030 | 0.085 | 0.305 | 0.000 | 0.010 | 0.000 | 4.070 | |
| Medium a | 0.005 | 0.110 | 0.235 | 0.000 | 0.000 | 0.000 | 3.580 | |
| High a | 0.005 | 0.080 | 0.105 | 0.000 | 0.000 | 0.000 | 3.455 | |
| Seven | Sample Size (NR/NF) | |||||||
| 3,520/3,520 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 2.907 | |
| 5,280/1,760 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 4.980 | |
| 1,000/1,000 | 0.027 | 0.073 | 0.067 | 0.000 | 0.000 | 0.000 | 4.887 | |
| 1,500/500 | 0.060 | 0.360 | 0.553 | 0.000 | 0.013 | 0.000 | 5.133 | |
| Item Type | ||||||||
| Low a | 0.045 | 0.120 | 0.240 | 0.000 | 0.005 | 0.000 | 4.585 | |
| Medium a | 0.015 | 0.130 | 0.160 | 0.000 | 0.005 | 0.000 | 4.795 | |
| High a | 0.005 | 0.075 | 0.065 | 0.000 | 0.000 | 0.000 | 4.050 | |
Note. False-positive rates are represented as a percent. DIF = differential item functioning.
This overall trend for the four procedures held when the number of response options increased to four, five, six, and seven. The three criteria for Golia, Weese, and Zwick had false-positive rates far below the level of significance for moderate DIF classifications (between 0.0% and 1.39%) and near 0% for large DIF. The Gierl criteria had false-positive rates close to the level of significance (5.04%–5.19%) when the sample size of the focal group was 500 for all number of response categories greater than three. False-positive rates by item type were below the level of significance for all DIF criteria, regardless of the number of response categories. The false-positive rates for the Gierl criteria increased substantially as the number of response options increased. False-positive rates appeared to decrease for the Golia criteria as the number of response options increased. Rates were fairly stable for the Zwick and Weese criteria as the number of response options increased, with the highest false-positive rate of 0.36%.
True-Positive Comparison
The overall true-positive rates for the Golia, Weese, and Zwick DIF criteria were fairly stable at the moderate DIF level across all response category conditions (Figure 2). The trends for the Weese and Zwick criteria at the moderate DIF level slightly increased as response options increased, whereas true-positive rates tended to decrease as the number of response options increased for the Golia criteria. The true-positive rates at the large DIF level were much lower than for the moderate DIF outcomes, as expected. The same trends of slightly increasing rates for Weese and Zwick as response options increased, compared with decreasing rates for Golia, occurred at the large DIF level. True-positive rates associated with the criteria recommended by Gierl (2005) at the large DIF level increased from approximately 76% at the three-response option condition to 84% to 99% for seven response categories.
Figure 2.
Overall True-Positive Rates by Number of Response Categories.
Weese’s standardized effect size was consistently more powerful than the effect size proposed by Zwick et al. (1997) at both moderate and large DIF levels. At the moderate DIF level, the Golia (2012) criteria had true-positive rates that were consistently higher than both standardized effect sizes. However, the Golia true-positive rates decreased slightly as the number of response categories increased. At the large DIF level, Golia criteria had true-positive rates consistently lower than the Weese effect size and had higher true-positive rates than the Zwick effect size when items had fewer than six categories, and lower true-positive rates when items had six or seven categories.
Sample Size
When comparing across sample sizes (Table 4) at each of the number of response category levels, the true-positive rates for each of the four sets of DIF criteria were stable. This occurred at both moderate and large DIF levels. The standardized effect sizes had lower true-positive rates than the unstandardized effect sizes across all sample size conditions and all number of response category conditions. As the number of response categories increased, the true-positive rates for each sample size condition increased.
Table 4.
True-Positive Rates for the POLYSIBTEST DIF Heuristics.
| Number of categories | Factor | Moderate DIF | Large DIF | |||||
|---|---|---|---|---|---|---|---|---|
| Zwick | Weese | Golia | Zwick | Weese | Golia | Gierl | ||
| Three | Sample Size (NR/NF) | |||||||
| 3,520/3,520 | 56.61 | 66.67 | 75.76 | 30.38 | 44.25 | 39.76 | 75.76 | |
| 5,280/1,760 | 57.14 | 66.85 | 76.42 | 31.48 | 44.58 | 40.67 | 76.42 | |
| 1,000/1000 | 56.26 | 65.68 | 76.28 | 30.35 | 43.73 | 40.04 | 76.28 | |
| 1,500/500 | 56.48 | 66.05 | 76.55 | 30.98 | 44.18 | 40.59 | 76.55 | |
| Item Type | ||||||||
| Low a | 66.08 | 70.86 | 81.02 | 44.39 | 49.91 | 50.84 | 81.02 | |
| Medium a | 56.33 | 66.65 | 76.38 | 30.40 | 44.86 | 41.35 | 76.38 | |
| High a | 47.45 | 61.43 | 71.35 | 17.60 | 37.79 | 28.60 | 71.35 | |
| Four | Sample Size (NR/NF) | |||||||
| 3,520/3,520 | 58.79 | 69.48 | 75.24 | 33.97 | 47.44 | 39.25 | 88.71 | |
| 5,280/1,760 | 58.95 | 69.23 | 75.60 | 34.10 | 47.54 | 39.63 | 88.88 | |
| 1,000/1,000 | 58.84 | 68.30 | 75.70 | 33.58 | 47.49 | 39.61 | 87.09 | |
| 1,500/500 | 59.03 | 68.30 | 76.03 | 34.42 | 47.67 | 40.07 | 83.23 | |
| Item Type | ||||||||
| Low a | 68.30 | 72.57 | 79.83 | 47.11 | 52.12 | 50.03 | 90.46 | |
| Medium a | 58.57 | 69.26 | 76.01 | 33.63 | 48.15 | 40.64 | 87.22 | |
| High a | 49.83 | 64.66 | 71.08 | 21.31 | 42.33 | 28.25 | 83.25 | |
| Five | Sample Size (NR/NF) | |||||||
| 3,520/3,520 | 60.43 | 71.34 | 75.13 | 36.18 | 49.52 | 38.39 | 95.83 | |
| 5,280/1,760 | 59.74 | 70.59 | 74.60 | 34.91 | 49.27 | 37.47 | 94.87 | |
| 1,000/1,000 | 60.05 | 69.97 | 75.40 | 35.48 | 49.54 | 37.90 | 88.51 | |
| 1,500/500 | 61.02 | 70.13 | 76.07 | 36.74 | 49.81 | 39.19 | 83.98 | |
| Item Type | ||||||||
| Low a | 69.54 | 73.60 | 79.23 | 48.35 | 53.42 | 48.56 | 92.99 | |
| Medium a | 60.16 | 70.91 | 75.65 | 35.77 | 49.91 | 38.87 | 91.19 | |
| High a | 51.24 | 67.01 | 71.02 | 23.36 | 45.28 | 27.28 | 88.21 | |
| Six | Sample Size (NR/NF) | |||||||
| 3,520/3,520 | 61.46 | 72.46 | 74.25 | 37.41 | 50.65 | 36.47 | 98.33 | |
| 5,280/1,760 | 61.30 | 71.96 | 74.22 | 37.53 | 50.87 | 36.53 | 97.79 | |
| 1,000/1,000 | 60.90 | 70.89 | 73.94 | 36.55 | 51.63 | 35.82 | 88.84 | |
| 1,500/500 | 62.07 | 71.26 | 75.01 | 38.50 | 51.90 | 37.61 | 84.46 | |
| Item Type | ||||||||
| Low a | 70.49 | 74.29 | 78.09 | 49.42 | 54.67 | 46.95 | 93.88 | |
| Medium a | 61.49 | 72.05 | 74.81 | 38.14 | 51.50 | 36.75 | 92.61 | |
| High a | 52.31 | 68.59 | 70.17 | 24.93 | 47.62 | 26.11 | 90.57 | |
| Seven | Sample Size (NR/NF) | |||||||
| 3,520/3,520 | 62.12 | 73.35 | 73.51 | 38.30 | 51.55 | 33.68 | 99.21 | |
| 5,280/1,760 | 61.95 | 72.78 | 73.02 | 37.98 | 51.98 | 33.39 | 98.61 | |
| 1,000/1,000 | 62.15 | 71.82 | 73.04 | 38.99 | 52.60 | 34.13 | 89.18 | |
| 1,500/500 | 62.20 | 71.89 | 73.29 | 38.54 | 52.91 | 34.18 | 84.45 | |
| Item Type | ||||||||
| Low a | 70.86 | 74.49 | 76.60 | 49.80 | 55.29 | 43.66 | 93.82 | |
| Medium a | 62.31 | 72.74 | 73.72 | 39.36 | 52.30 | 33.31 | 92.99 | |
| High a | 53.15 | 70.15 | 69.32 | 26.21 | 49.19 | 24.57 | 91.77 | |
Note. True-positive rates are represented as a percent. DIF = differential item functioning.
Item Type
Item type impacted the true-positive rates of all four sets of DIF criteria. Disregarding standardized or unstandardized effect size measures, items with low discrimination had higher true-positive rates than items with moderate or high levels of discrimination. Conversely, items with high discrimination had the lowest true-positive rate across the four sets of DIF criteria. This held across the number of response categories and at both moderate and large DIF levels.
Between-Group Differences in Thresholds
Between-group differences in the threshold parameter effected the true-positive rates (see Figures 3 and 4). Larger differences in equated to higher true-positive rates, as would be expected. At the moderate DIF level, the Zwick effect size true-positive rates were always the lowest. Golia’s (2012) criteria had the highest true-positive rates overall at the moderate DIF level; however, the rates decreased as the number of response options increased, until Golia’s rates approached those of Weese’s effect size measure at seven response categories.
Figure 3.
True-Positive Rate by Between-Group Differences in Threshold Parameters for Three to Five Categories.
Figure 4.
True-Positive Rate by Between-Group Differences in Threshold Parameters for Six and Seven Categories.
At the large DIF level, the Gierl criterion maintained the highest true-positive rates. Weese’s effect size had the second highest true-positive rates across all levels of with Golia’s and Zwick’s criteria resulting in the lowest values. As the number of response categories increased, the true-positive rates associated with Golia’s criteria decreased, whereas the true-positive rates of Weese’s and Zwick’s effect size criteria increased.
A more critical analysis of true-positive rates by between-group differences in thresholds are important for evaluating DIF heuristic criteria. A prior study (Walker et al., 2011) with dichotomous data indicated that are representative of moderate DIF levels and indicate large DIF. The was included in this study as a level of negligible DIF with representing moderate DIF and values of 0.60 and 0.80 representing large DIF. Further using the simulated data from this study, a supplemental analysis was conducted to determine what values of Δτ corresponded to moderate and large DIF for varying response categories (three to seven). The results of this supplemental analysis indicated that on average Δτ values of 0.344 (SD = 0.019) and 0.505 (SD = 0.028) were associated with the moderate and large DIF heuristics of , respectively, and similar to the results from Walker et al.’s (2011) study. In addition, the analysis showed that as the number of response categories increased, the value of Δτ decreased for classification at both the moderate and large DIF levels. When the number of response categories were three, Δτ values of 0.371 and 0.546 were associated with moderate and large DIF. When the number of response categories were seven, Δτ values of 0.324 and 0.505 were associated with moderate and large DIF. Therefore, to determine which set of heuristics identified items as correctly having DIF, we should expect the true-positive rates to be greater than 50% at the moderate DIF level when Δτ = 0.4. Similarly, the true-positive rates are expected to be greater than 50% at the large DIF level when Δτ = 0.6.
Although Gierl’s (2005) criterion provides the highest true-positive rate for all conditions, it also provides large DIF true-positive rates for (negligible DIF) that are greater than 50% to 73% when there are four to seven response options. The large DIF true-positive rates for all three of the other criteria are less than 1% when . When (previously considered a moderate DIF magnitude), Gierl’s large DIF true-positive rates are greater than 97% for response categories four to seven, whereas the true-positive rates are between 1% and 16% for the other three criteria. In comparison, for the Weese, Zwick, and Golia criteria, substantial increases in true-positive rates occur at for the large DIF criteria and at for the moderate DIF criteria, as might be expected (Figures 3 and 4).
Regardless of the number of response categories, the Zwick effect size criteria had true-positive rates below 50% at the moderate level when was 0.4. When was 0.6, the Zwick effect size criteria had moderate DIF true-positive rates below 50% when the number of response categories were less than six and true-positive rates of 51% and 54% for six and seven response category items, respectively. The Weese effect size criteria had true-positive rates ranging from 63.2% to 83.9% for the moderate DIF level when was 0.4. Golia’s true-positive rates for moderate DIF at were even higher than Zwick and Weese ranging from 83.0% to 88.9%.
When comparing the Zwick, Weese, and Golia criteria for large DIF at , Weese had the highest true-positive rates (71.8%–92.7%), followed by Golia for three to five response options (58.9%–53.0%) and Zwick for six and seven response options (51.1%–53.8%). True-positive rates for large DIF were greater than 98.8% at for all response options for the Weese effect size. Large DIF true-positive rates were greater than 95.1% for Golia and 86.7% for Zwick criteria when .
Discussion
Educational Testing Service (ETS) defines three levels of DIF classification: negligible, moderate, and large (Zwick, 2012). Negligible items show little to no differences between the tested subgroups, moderate DIF items show a difference and can still be used in the test if needed, and large DIF items should not be used unless they are needed (Zwick, 2012). The results from this study suggest that a single heuristic for identifying DIF should probably be avoided when data are polytomous. Specifically, Gierl (2005) made a suggestion in an applied study that a value of 0.100 might be useful to classify significant DIF as being meaningful (large DIF) when items had four response categories. Other studies (e.g., Gitchel et al., 2010, 2011; Taylor & Lee, 2012) have used this heuristic to classify DIF as meaningful when items had other numbers of response categories. The results from the first simulation study demonstrate that the recommendation from Gierl (2005) may overidentify items as having DIF, even when items fitting a GRM have three response categories. This was observed in the relationships seen in Figure 1 with minor differences between groups detected as DIF, especially when the number of response categories is greater than three.
When detecting moderate level DIF, Golia’s (2012) procedure has a high true-positive rate, even though it was slightly inflated for negligible levels of between-group differences in certain conditions. Weese’s (2020) effect size has the next highest true-positive rate for moderate level DIF while also maintaining low DIF detection rates for negligible DIF levels. Weese’s standardized effect size has the highest true-positive rates for large DIF levels when between-group differences are simulated as having large DIF magnitudes. Golia (2012) and Zwick et al. (1997) criteria have substantially lower true-positive rates for large DIF conditions. The effect size suggested by Zwick and colleagues had the overall lowest true-positive rates of those compared in this study.
False-positive rates for all four procedures remained below level of significance values for most conditions. All but Gierl’s (2005) DIF criteria had extremely low false-positive rates (less than 2%) under all conditions, which has been observed in other simulations that include both statistical testing and effect size heuristics for determining significant DIF (Turner et al., 2011). The results of the second simulation suggest that a standardized effect size that only uses participants included in the calculation of both the numerator and denominator (Weese, 2020) identifies items as having moderate and large DIF more consistently than when a standard deviation that uses all participants is employed. Calculating the standardized effect size as recommended by Zwick et al. (1997) appeared to underidentify items as having DIF at both moderate and large DIF levels regardless of the number of response categories.
Two important pieces of information are provided in this study for researchers using POLYSIBTEST for investigating DIF in polytomous items. For those using prior POLYSIBTEST programs that do not incorporate effect size calculations, Table 4 provides estimates of unstandardized values that can be used for items with three to seven response options when data fit a GRM. Alternatively, a new standardized effect size is provided that can be used with items that include any range of response options. In comparison with the other three procedures used in this study (Gierl, 2005; Golia, 2012; Zwick et al., 1997), the Weese effect size minimizes the detection of negligible DIF while maintaining relatively strong true-positive rates for identifying large DIF. The use of a standardized effect size allows for the employment of only one set of criteria for detecting moderate and large DIF rather than a need for multiple criteria based on the number of response options. Furthermore, Preacher and Kelley’s (2011) first point in describing a desirable effect size is that it should be scaled appropriately, which the proposed standardized effect size for POLYSIBTEST does by matching participants included in the denominator with those included in the numerator.
Overall, caution is recommended regarding further use of Gierl’s (2005) DIF criterion of for identifying DIF in polytomous items as researchers are likely to overidentify DIF items in many conditions and classify negligible DIF as being large. A value of 0.100 is associated with an value of 0.053 (or approximately one-twentieth of a standard deviation) when items have seven categories, and this may not be meaningful considering the associated large effect size heuristic for dichotomous items is 0.241. The overidentification of DIF items could cause researchers and practitioners to spend more resources (e.g., time, money) revising old items and creating new items. Furthermore, if researchers are creating items in multiple languages, the World Bank indicates that an additional 2 to 3 weeks of time and resources are needed (“Timeline of Survey Pilot,” 2020). Because the creation of items is difficult and requires multiple stages of development (Crocker & Algina, 1986), potentially false identification of DIF in items can be problematic. Therefore, the Weese (2020) standardized effect size is recommended for use with both dichotomous and polytomous items when employing SIBTEST or POLYSIBTEST as the standardized heuristic values of 0.164 and 0.241 for moderate and large DIF do not have an inflated DIF identification rate for negligible differences and can be used with items having any range of response options.
Limitations
These findings are not without limitation. To begin, the standardized effect size heuristics used to classify DIF as moderate or large using both the Weese and Zwick standardized effect sizes for POLYSIBTEST were based on Weese’s (2020) dichotomous work. Zwick et al. (1997) originally recommended heuristics of 0.125 and 0.250 based on an SMD effect size that included a different effect size calculation. In a later report (ACT, 2020), the SMD effect size recommended criteria were .17 and .25 for polytomous items. However, Zwick et al. (1997) recommended applying their standard deviation with the value calculated in SIBTEST for further research. This study incorporated that recommended standard deviation with values to calculate an effect size for POLYSIBTEST to be compared with the proposed recommended standard deviation. This study was not conducted to make a comparison between POLYSIBTEST heuristics and heuristics for other DIF detection procedures, thus no information is provided to compare with the SMD effect size for polytomous items.
Future research needs to be conducted to compare the effect size heuristics used in this study with other DIF procedures such as the Mantel–Haenszel, logistic regression, and SMD using ACT’s (2020) suggested effect size heuristics under more expansive conditions and data models. More expansive conditions include, but are not limited to, more sample size conditions including additional moderate and small samples commonly found in practice, a wider range of discrimination parameter conditions, varying types of pervasive and non-pervasive DIF conditions, as well as different polytomous data models beyond the GRM such as partial credit model (PCM) (Masters, 1982), the graded partial credit model (Muraki, 1992), and nominal response models (Bock, 1972). In addition, the incorporation of the nonlinear correction introduced by Jiang and Stout (1998) is not included in any current R packages for the POLYSIBTEST procedure; therefore, studying the addition of this correction procedure and the impact it has on the heuristics selected for the POLYSIBTEST procedure is important and should be considered in the future. Last, it is recommended that additional work be conducted on studying the functionality of POLYSIBTEST and the suggested heuristics under data conditions where item responses and ability distributions are not normally distributed. Open-source R code for calculating the standardized effect size for polytomous items using POLYSIBTEST is available in the DIFSIB package on Github (Weese, 2021).
Footnotes
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The authors received no financial support for the research, authorship, and/or publication of this article.
ORCID iDs: James D.Weese
https://orcid.org/0000-0001-5530-1896
Ronna C. Turner
https://orcid.org/0000-0002-2984-7649
Xinya Liang
https://orcid.org/0000-0002-2453-2162
Allison Ames
https://orcid.org/0000-0002-1512-9830
References
- ACT. (2020). ACT Aspire summative technical manual. https://success.act.org/s/article/ACT-Aspire-Summative-Technical-Manual
- American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association. [Google Scholar]
- Andresen E., Malmgren J., Carter W., Patrick D. (1994). Screening for depression in well older adults—Evaluation of a short-form of the CES-D. American Journal of Preventive Medicine, 10(2), 77–84. [PubMed] [Google Scholar]
- Beck A. T., Ward C. H., Mendelson M., Mock J., Earbaugh J. (1961). An inventory for measuring depression. Archives of General Psychiatry, 4, 561–571. [DOI] [PubMed] [Google Scholar]
- Bock R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, 29–51. [Google Scholar]
- Bolt D. M. (2002). A Monte Carlo comparison of parametric and nonparametric polytomous DIF detection methods. Applied Measurement in Education, 15(2), 113–141. [Google Scholar]
- Bolt D. M., Stout W. (1996). Differential item functioning: Its multidimensional model and resulting SIBTEST detection procedure. Behaviormetrika, 23, 67–96. [Google Scholar]
- Chalmers R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29. 10.18637/jss.v048.i06 [DOI] [Google Scholar]
- Chang H. H., Mazzeo J., Roussos L. (1996). Detecting DIF for polytomously scored items: An adaptation of the SIBTEST procedure. Journal of Educational Measurement, 33(3), 333–353. [Google Scholar]
- Clauser B. E., Mazor K. M. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice, 17(1), 31–44. 10.1111/j.1745-3992.1998.tb00619.x [DOI] [Google Scholar]
- Cohen A., Kim S., Baker F. (1993). Detection of differential item functioning in the graded response model. Applied Psychological Measurement, 17(4), 335–350. 10.1177/014662169301700402 [DOI] [Google Scholar]
- Crocker L. M., Algina J. (1986). Introduction to classical and modern test theory. Harcourt Brace Jovanovich College Publishers. [Google Scholar]
- French A., Miller T. (1996). Logistic regression and its use in detecting differential item functioning in polytomous items. Journal of Educational Measurement, 33(3), 315–332. [Google Scholar]
- Gierl M. J. (2005). Using dimensionality-based DIF analyses to identify and interpret constructs that elicit group differences. Educational Measurement: Issues and Practice, 24(1), 3–14. 10.1111/j.1745-3992.2005.00002.x [DOI] [Google Scholar]
- Gitchel W. D., Roessler R. T., Turner R. C. (2011). Gender effect according to item directionality on the perceived stress scale for adults with multiple sclerosis. Rehabilitation Counseling Bulletin, 55(1), 20–28. 10.1177/0034355211404567 [DOI] [Google Scholar]
- Gitchel W. D., Turner R., Rumrill P. (2010). Differential item functioning in rehabilitation research. Work (Reading, Mass.), 36(3), 361–369. 10.3233/WOR-2010-1072 [DOI] [PubMed] [Google Scholar]
- Golia S. (2012). Differential item functioning classification for polytomously scored items. Electronic Journal of Applied Statistical Analysis, 5(3), 367–373. [Google Scholar]
- Hambleton R. K., Swaminathan H. (1985). Item response theory: Principles and applications. Kluwer-Nijhoff. [Google Scholar]
- Hao S. (2014). Two SAS macros for differential item functioning analysis. Applied Psychological Measurement, 38(1), 81–82. 10.1177/0146621613493164 [DOI] [Google Scholar]
- Jiang H., Stout W. (1998). Improved type I error control and reduced estimation bias for DIF detection using SIBTEST. Journal of Educational and Behavioral Statistics, 23(4), 291–322. 10.2307/1165279 [DOI] [Google Scholar]
- Jiang S., Wang C., Weiss D. J. (2016). Sample size requirements for estimation of item parameters in the multidimensional graded response model. Frontiers in Psychology, 7, 109. 10.3389/fpsyg.2016.00109 [DOI] [PMC free article] [PubMed] [Google Scholar]
- John O. P., Donahue E. M., Kentle R. L. (1991). The Big Five Inventory (Versions 4a and 54). Institute of Personality and Social Research, University of California. [Google Scholar]
- Keiffer E. A. (2011). Group-specific effects of matching subtest contamination on the identification of differential item functioning (Order No. 3476071). Available from Dissertations & Theses @ University of Arkansas Fayetteville; ProQuest Dissertations & Theses Global. (894399625). https://search.proquest.com/docview/894399625?accountid=8361 [Google Scholar]
- Kelley K., Preacher K. J. (2012). On effect size. Psychological Methods, 17(2), 137–152. 10.1037/a0028086 [DOI] [PubMed] [Google Scholar]
- Kim S., Cohen A. (1998). Detection of differential item functioning under the graded response model with the likelihood ratio test. Applied Psychological Measurement, 22(4), 345–355. [Google Scholar]
- Kirk R. E. (1995). Experimental design: Procedures for the behavioral sciences. Brooks/Cole. [Google Scholar]
- Lautenschlager G. J., Meade A. W., Kim S.-H. (2006, April). Cautions regarding sample characteristics when using the graded response model [Paper presentation]. 21st Annual Conference of the Society for Industrial and Organizational Psychology, Dallas, TX. [Google Scholar]
- Masters G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–174. [Google Scholar]
- Muraki E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159–176. [Google Scholar]
- NAEP. (2008). NAEP analysis and scaling—The SIBTEST procedure. https://nces.ed.gov/nationsreportcard/tdw/analysis/scaling_checks_dif_proced_sibtest.aspx
- Narayanan P., Swaminathan H. (1994). Performance of the Mantel-Haenszel and simultaneous item bias procedures for detecting differential item functioning. Applied Psychological Measurement, 18(4), 315–328. [Google Scholar]
- Penfield R. D. (2007). An approach for categorizing DIF in polytomous items. Applied Measurement in Education, 20(3), 335–355. [Google Scholar]
- Penfield R. D. (2010). Distinguishing between Net and Global DIF in polytomous items. Journal of Educational Measurement, 47(2), 129–149. 10.1111/j.1745-3984.2010.00105.x [DOI] [Google Scholar]
- Penfield R. D. (2014). An NCME instructional module on polytomous item response theory models. Educational Measurement, Issues and Practice, 33(1), 36–48. 10.1111/emip.12023 [DOI] [Google Scholar]
- Penfield R. D., Alvarez K., Lee O. (2008). Using a taxonomy of differential step functioning to improve the interpretation of DIF in polytomous items: An illustration. Applied Measurement in Education, 22(1), 61–78. 10.1080/08957340802558367 [DOI] [Google Scholar]
- Preacher K. J., Kelley K. (2011). Effect size measures for mediation models: Quantitative strategies for communicating indirect effects. Psychological Methods, 16(2), 93–115. 10.1037/a0022658 [DOI] [PubMed] [Google Scholar]
- Radloff L. S. (1977). The CES-D scale: A self-report depression scale for research in the general population. Applied Psychological Measurement, 1(3), 385–401. 10.1177/014662167700100306 [DOI] [Google Scholar]
- Raju. (1988). The area between two item characteristic curves. Psychometrika, 53(4), 495–502. 10.1007/BF02294403 [DOI] [Google Scholar]
- R Core Team. (2020). R: A language and environment for statistical computing [Computer software manual]. http://www.R-project.org/
- Rosenberg M. (1965). Society and the adolescent self-image. Princeton University Press. [Google Scholar]
- Roussos L. A., Stout W. F. (1996). Simulation studies of the effects of small sample size and studied item parameters on SIBTEST and Mantel-Haenszel type I error performance. Journal of Educational Measurement, 33(2), 215–230. [Google Scholar]
- Samejima F. (1969). Estimation of latent ability using a response pattern of graded scores (Psychometric Monograph No. 17). Psychometric Society. https://www.psychometricsociety.org/sites/default/files/pdf/MN17.pdf [Google Scholar]
- Shealy R., Stout W. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58(2), 159–194. 10.1007/BF02294572 [DOI] [Google Scholar]
- Smith T. W., Son J. (2013). Trends in public attitudes towards abortion [Final] (p. 50). University of Chicago. [Google Scholar]
- Steinberg L., Thissen D. (2006). Using effect sizes for research reporting: Examples using item response theory to analyze differential item functioning. Psychological Methods, 11(4), 402–415. 10.1037/1082-989X.11.4.402 [DOI] [PubMed] [Google Scholar]
- Taylor C. S., Lee Y. (2012). Gender DIF in reading and mathematics tests with mixed item formats. Applied Measurement in Education, 25(3), 246–280. 10.1080/08957347.2012.687650 [DOI] [Google Scholar]
- Thompson B. (1998). Statistical significance and effect size reporting: Portrait of a possible future. Research in the Schools, 5(2), 33–38. [Google Scholar]
- Timeline of Survey Pilot. (2020). The World Bank. https://dimewiki.worldbank.org/wiki/Timeline_of_Survey_Pilot
- Turner R. C., Gitchel W. D., Keiffer E. A. (2011). Comparing Type I error and power rates in DIF analyses when combining significance tests with effect size criteria [Paper presentation]. National Council on Measurement in Education Annual Conference, New Orleans, LA. [Google Scholar]
- Walker C. M., Göçer Şahin S. (2020). Using differential item functioning to test for interrater reliability in constructed response items. Educational and Psychological Measurement, 80(4), 808–820. 10.1177/0013164419899731 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Walker C. M., Zhang B., Banks K., Cappaert K. (2011). Establishing effect size guidelines for interpreting the results of differential bundle functioning analyses using SIBTEST. Educational and Psychological Measurement, 72(3), 415–434. 10.1177/0013164411422250 [DOI] [Google Scholar]
- Weese J. D. (2020). Development of an effect size to classify the magnitude of DIF in dichotomous and polytomous items (Theses and Dissertations). https://scholarworks.uark.edu/etd/3896
- Weese J. D. (2021). DIFSIB: A SIBTEST package. Applied Psychological Measurement, 46(1), 68–69. 10.1177/01466216211040498 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Woods C. M. (2010). DIF testing for ordinal items with Poly-SIBTEST, the Mantel and GMH tests, and IRT-LR-DIF when the latent distribution is nonnormal for both groups. Applied Psychological Measurement, 35(2), 145–164. 10.1177/0146621610377450 [DOI] [Google Scholar]
- Yarkoni T. (2010). The abbreviation of personality, or how to measure 200 personality scales with 200 items. Journal of Research in Personality, 44(2), 180–198. 10.1016/j.jrp.2010.01.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zung W. W. (1971). A rating instrument for anxiety disorders. Psychosomatics: Journal of Consultation and Liaison Psychiatry, 12(6), 371–379. 10.1016/S0033-3182(71)71479-0 [DOI] [PubMed] [Google Scholar]
- Zwick R. (2012). A review of ETS differential item functioning assessment procedures: Flagging rules, minimum sample size requirements, and criterion refinement. ETS Research Report Series, 2012(1), i–30. 10.1002/j.2333-8504.2012.tb02290.x [DOI] [Google Scholar]
- Zwick R., Thayer D. T., Mazzeo J. (1997). Describing and categorizing DIF in polytomous items. Educational Testing Service. [Google Scholar]




