Skip to main content
Applied Psychological Measurement logoLink to Applied Psychological Measurement
. 2023 May 13;47(4):328–346. doi: 10.1177/01466216231174559

Using a Generalized Logistic Regression Method to Detect Differential Item Functioning With Multiple Groups in Cognitive Diagnostic Tests

Xiaojian Sun 1,2, Shimeng Wang 3, Lei Guo 4,, Tao Xin 5,, Naiqing Song 1,2
PMCID: PMC10240570  PMID: 37283590

Abstract

Items with the presence of differential item functioning (DIF) will compromise the validity and fairness of a test. Studies have investigated the DIF effect in the context of cognitive diagnostic assessment (CDA), and some DIF detection methods have been proposed. Most of these methods are mainly designed to perform the presence of DIF between two groups; however, empirical situations may contain more than two groups. To date, only a handful of studies have detected the DIF effect with multiple groups in the CDA context. This study uses the generalized logistic regression (GLR) method to detect DIF items by using the estimated attribute profile as matching criteria. A simulation study is conducted to examine the performance of the two GLR methods, GLR-based Wald test (GLR-Wald) and GLR-based likelihood ratio test (GLR-LRT), in detecting the DIF items, the results based on the ordinary Wald test are also reported. Results show that (1) both GLR-Wald and GLR-LRT have more reasonable performance in controlling Type I error rates than the ordinary Wald test in most conditions; (2) the GLR method also produces higher empirical rejection rates than the ordinary Wald test in most conditions; and (3) using the estimated attribute profile as the matching criteria can produce similar Type I error rates and empirical rejection rates for GLR-Wald and GLR-LRT. A real data example is also analyzed to illustrate the application of these DIF detection methods in multiple groups.

Keywords: cognitive diagnostic assessment, differential item functioning, generalized logistic regression, multiple groups


Cognitive diagnostic assessment (CDA) has attracted much attention in psychological and educational measurement because it can provide diagnostic information about whether individuals master the attributes in specific domains (Rupp et al., 2010). To ensure the diagnostic information is reliable and valid, on one hand, a number of cognitive diagnostic models (CDMs) have been proposed to address different situations. CDMs are subset of psychometric models that classify individuals into different latent classes based on their performance on some specific tasks (Rupp et al., 2010). On the other hand, highly qualified cognitive diagnostic test, which clarifies the attributes that required for each item, was developed as much as possible. Lots of factors can affect the validity of the test, among which differential item functioning (DIF) is one of the most important factors. In the context of CDA, DIF is defined as the probability that answering an item correctly will be different for individuals who have the same attribute mastery pattern but come from different groups (Hou et al., 2014; Li, 2008). DIF can not only compromise the fairness and validity of tests (Hou et al., 2014; Li & Wang, 2015; Liu et al., 2016; 2019), but also decrease the accuracy of classification and parameter estimation in CDA (Hou et al., 2014; Paulsen et al., 2020).

To date, several methods have been proposed to investigate the presence of DIF in the context of CDA (e.g., George & Robitzsch, 2014; Hou et al., 2014; 2020; Li, 2008; Li & Wang, 2015; Liu et al., 2016; 2019; Ma et al., 2021b; Wang et al., 2014; Zhang, 2006). Based on the logic of classification for the DIF detection method under the item response theory (IRT) framework, these DIF detection methods can be classified into CDM-based and non-CDM methods. The Wald test (e.g., Hou et al., 2014; 2020; Liu et al., 2019; Ma et al., 2021b), likelihood ratio test (LRT; Ma et al., 2021b), and the log-liner cognitive diagnostic models for DIF assessment (LCDM-DIF, Li & Wang, 2015) are CDM-based DIF detection methods, while the Mantel-Haenszel (MH; Mantel & Haenszel, 1959), the simultaneous item bias test (Shealy & Stout, 1993), and the logistic regression (LR; Swaminathan & Rogers, 1990) are non-CDM DIF detection methods. These methods can produce acceptable Type I error rate and statistical power for certain conditions. For instance, Ma et al. (2021b) developed the Wald test and LRT using a scale purification procedure termed the forward anchor item search (FS). Simulation results showed that the Wald test and LRT with the FS algorithm (named Wald-FS and LRT-FS, respectively) produced better-controlled Type I error rates than the ordinary Wald test and LRT, especially for low item quality. Wang et al. (2014) found that the LR method can produce lower Type I error rates and higher power rates than the modified and ordinary Wald test under the deterministic input, noisy, and gate (DINA; Junker & Sijtsma, 2001) model context.

It is worth noting that most of the methods mentioned above are used to detect DIF items between two groups (for simplicity, these methods are named two-group methods). However, empirical studies may require the detection of DIF items between multiple groups (Magis et al., 2011; Penfield, 2001). For instance, researchers may be interested in investigating the DIF effect between different countries in international surveys, such as the trends in the Trends International Mathematics and Science Study (TIMSS). There are two methods to detect DIF in multiple-group situations. One is to perform pairwise comparisons between the base group (reference group) and each focal group using the two-group methods (e.g., Kim et al., 1995; Woods et al., 2013). For instance, researchers can use the LRT, which is commonly used to detect DIF between two groups in the IRT framework, to detect the presence of DIF between the base group and each focal group. The LRT method is feasible to detect DIF when the number of groups is small (e.g., fewer than four); otherwise, using it is time consuming and complex. For instance, in a five-group situation, the LRT method requires 10 pairwise comparisons. The other way to detect DIF in multiple groups is using one of the methods that has the capacity to assess DIF across all groups simultaneously (named multiple-group methods; e.g., Hou & de la Torre, 2015; Magis et al., 2011; Penfield, 2001); among these, the Wald test is the most popular for DIF detection. Compared with the two-group methods (e.g., LRT), the multiple-group methods (e.g., Wald test) have some potential advantages (Magis et al., 2011; Penfield, 2001): (1) the greater power of detecting DIF may be obtained; (2) the Type I error rate may be closer to the nominal level; and (3) the multiple-group methods are more efficient than two-group methods.

As far as the authors know, several studies have detected the DIF effect with multiple groups in the CDA context. Li and Wang (2015) used the LCDM-DIF and ordinary Wald test to compare item performance between three groups. The LCDM-DIF can estimate not only model parameters, such as the intercept and main effects of the LCDM, but also DIF effect parameters, such as the main effects for focal groups and all possible interaction effects among the attributes of specific items. A simulation study showed that the LCDM-DIF produces better parameter recovery and well-controlled Type I error rates than the ordinary Wald test. The LCDM-DIF, however, uses the Markov chain Monte Carlo algorithm to estimate the parameters; obtaining stable parameter estimates in this way requires a large sample size and is time-consuming (Ma et al., 2021). Hou and de la Torre (2015) used the ordinary Wald test to detect the presence of DIF for more than two groups; their simulation study showed that the ordinary Wald test has Type I error rates close to the nominal level in the high item quality condition. Noting that the performance of the Wald test mainly relies on the quality of items, as well as their asymptotic variance–covariance matrix ( Σ^ ), Hou and de la Torre (2015) used the item-wise information matrix to calculate the Σ^ (Liu et al., 2016; 2019; Ma et al., 2021b). Researchers have shown that the item-wise information matrix would underestimate the Σ^ (Liu et al., 2019; Philipp et al., 2018). Moreover, a CDM must be specified before conducting the Wald test, but this will not be available in models that lack a closed form (Wang et al., 2014). Recently, Svetina et al. (2017) investigated the potential sources of DIF among four groups by examining the skills and cognitive processes that are hypothesized to underlie student performance on the National Assessment for Educational Progress. They adopted a generalized logistic regression (GLR) method to detect DIF by using the total scores as the matching criteria, and then applied the reduced reparameterized unified model (Hartz, 2002) to examine the potential sources of DIF. In their analysis, 25 out of 53 items were flagged as DIF among the four groups, and the base group (i.e., unaccommodated group) yielded higher mastery probabilities than the focal groups on the items flagged as DIF. Since the GLR method adopted by Svetina et al. (2017) uses the total score as the matching criteria, the GLR method would presumably produce relatively larger Type I errors than GLR method with the estimated attribute profile as matching criteria (Wang et al., 2014).

In the current study, we used the GLR method with the estimated attribute profile as matching criteria to DIF detection with multiple groups in CDA. The GLR method is model independent (Wang et al., 2014), which indicates that diagnostic classification methods, either parametric CDMs such as the generalized DINA (GDINA, de la Torre, 2011) model or non-parametric methods such as the Hamming distance discrimination (HDD; Chiu et al., 2018) method, can be adopted to estimate the matching criteria (i.e., attribute profile). Another advantage of the GLR method is that it can be directly applied to detect DIF when no clear reference group can be set up (Magis et al., 2011). Using the GLR method to detect DIF has received much attention in the IRT framework, while the GLR method has not been systematically applied and investigated for multiple-group situations in the CDA framework. We expect that the GLR method can produce acceptable Type I error rates and power/detection rates with multiple groups. The remainder of the article is organized as follows. First, the GDINA model is introduced. In the current study, the GDINA model will be used to generate item responses and to calculate the Wald test. Next, the GLR method is introduced for the detection of DIF items with multiple groups. Then a simulation is conducted to examine the performance of the GLR method and the ordinary Wald test under different conditions, followed by an empirical study. Finally, a discussion on the findings of the study is presented.

The GDINA Model

The GDINA model is one of the general CDMs that has become most popular in recent years. For the identity link function, the item response function of the GDINA model is expressed as follows (de la Torre, 2011)

Pj(αi*)=vij=δj0+k=1Kj*δjkαik+k=k+1Kj*k=1Kj*1δjkkαikαik+...+δj12...Kj*k=1Kj*αik, (1)

where Kj* is the number of attributes that item j requires; αi*=(αi1,αi2,...,αiKj*) is the collapsed attribute mastery pattern for individual i, and αik={0,1} ; δj0 is the probability of item j that correctly answered when none of the required attributes is mastered; δjk is the main effect of attribute αk , which indicates the change of probability of a correct response when single attribute αk is mastered; and δjkk and δj12...Kj* are interaction effects, among which the former is the interaction effect between attribute αk and αk and the latter represents the interaction effect between all attributes that item j required. The link function of the GDINA model can also be logit-link and log-link functions, for which details can be found in de la Torre (2011).

The GLR Method for DIF Detection with Multiple Groups

The GLR method, which is an extension of the method that proposed by Wang et al. (2014), is used to detect DIF in the situations where there are more than two groups. The GLR DIF model across G groups by using total score as the matching criteria takes the following form

logit(πij)=log[πij/(1πij)]=τ0j+γjSi+λj,tgit+ωj,t(Sigit), (2)

where πij is the probability that item j is correctly answered for individual i; Si is the total score for individual i; τ0j is the intercept; γj is the main effect of total score; git is a set of dummy variables, which is 1 if individual i belongs to group t (t=1,...,G1) , and 0 otherwise (e.g., giT= (0, 0), (1, 0), or (0, 1) if individual i belongs to the base, first, or second focal groups, respectively); λj,t is the main effect of group; and ωj,t is the interaction effect between total score and group t.

When the attribute profile is used as the matching criteria for the GLR method in the CDA framework, the GLR DIF model can be written as

logit(πij)=log[πij/(1πij)]=τ0j+l=1Lj*τ1jlI(αi*=αl*)+gitτ2jt+gitl=1Lj*τ3jl,tI(αi*=αl*), (3)

where αl*=(αl1,...,αlk,...αlKj*) is the collapsed attribute profile, the attribute profile with all zero elements is the base line; I(x) is the indicator function, which will be 1 if x is true, and vice versa; Lj*=2Kj*1 is the number of the unique collapsed attribute profiles minus 1 for item j; τ1jl and τ2jt are the main effects of collapsed attribute profile αl* and group t, respectively; and τ3jl,t is the interaction effect between collapsed attribute profile αl* and group t.

A comparison of models (2) and (3) reveals that τ0j , αi*=(αi1,...,αik,...αiKj*) , τ1j=(τ1j1,...,τ1jl,...,τ1jLj*) , τ2jt , and τ3j,t=(τ3j1,t,...,τ3jl,t,...,τ3jLj*,t) are equivalent to τ0j , Si , γj , λj,t , and ωj,t , respectively. According to model (3), an item can be detected as a DIF item if the main effect of the group or the interaction effect between group and attribute(s) is different from zero.

Two approaches can be used to test whether an item exhibits DIF for the GLR method: the LRT and the Wald test, and named GLR-LRT and GLR-Wald, respectively. The GLR-LRT is used to compare two nested models. Based on model (3), two nested models are formulated as follows

logit(πij)=log[πij/(1πij)]=τ0j+l=1Lj*τ1jlI(αi*=αl*)+gitτ2jt, (4)

and

logit(πij)=log[πij/(1πij)]=τ0j+l=1Lj*τ1jlI(αi*=αl*), (5)

The lambda statistic Λ (Wilks, 1938) is used to test whether the difference between the augmented model (i.e., model (3)) and the nested model (i.e., model (4) or (5)) is significant: Λ=2log(L0/L1) , where L0 and L1 are the maximum likelihood parameters estimates for the nested and augmented model, respectively. The Λ statistic follows an asymptotic chi-square distribution, and the corresponding degree of freedom (df) equals to the difference in the number of parameters between the augmented model and the nested model. For instance, the presence of nonuniform DIF is tested by comparing models (3) and (4), if the corresponding Λ statistic, with df=F×(2Kj*1) , where F refers to the number of focal groups, larger than the critical value that given previously, then the nonuniform DIF is exhibited. Conversely, the uniform DIF is presented if the Λ statistic, with df = F, between models (4) and (5) is significant. If researchers are merely interested in detecting whether an item exhibits DIF (including uniform- and nonuniform-DIF), the Λ statistic, with df=F×2Kj* , between models (3) and (5) can be tested.

The GLR-Wald uses a matrix form— Cβj=0 —to test the null hypotheses of model (3), where βj represents the parameters in model (3), C is a contrast matrix, and 0 is vector of zeros. In the DIF framework, C is a F×2Kj* -by- (F+1)×2Kj* matrix: C=[0(F×2Kj*)×2Kj*,I(F×2Kj*)] , where 0(F×2Kj*)×2Kj* is F×2Kj* -by- 2Kj* matrix of zeros, and I(F×2Kj*) is the identity matrix of dimension F×2Kj* . For instance, for item j that requires two attributes, the C matrix in the DIF framework for three groups can be written as

CDIF=[000010000000000001000000000000100000000000010000000000001000000000000100000000000010000000000001],

Meanwhile, C is a F×(2Kj*1) -by- (F+1)×2Kj* matrix in nonuniform DIF framework and can be written as

CNUDIF=[000000100000000000010000000000001000000000000100000000000010000000000001].

Both the DIF and nonuniform DIF frameworks have the same vector βj=(τ0j,τ1j1,τ1j2,τ1j3,τ2j1,τ2j2,τ3j1,1,τ3j2,1,τ3j3,1,τ3j1,2,τ3j2,2,τ3j3,2)T , where T refers to transpose.

For the uniform DIF framework, C is a F-by- F×(2Kj*1) matrix; therefore, the C matrix and the vector βj can be written as CUDIF=[000010000001] and βj=(τ0j,τ1j1,τ1j2,τ1j3,τ2j1,τ2j2)T , respectively.

The formulation of the GLR-Wald can be expressed as follows

Wj=(Cβ^j)T(CΣ^jC)1(Cβ^j), (6)

where β^j and Σ^j are the estimated parameter vector and variance–covariance matrix of item j, respectively. Consistent with the Λ statistic, the W statistic also follows an asymptotic chi-squared distribution with df equals to the rank of C . Noting that the GLR-Wald and the ordinary Wald test share the same general formulation, the differences between these two statistics are the specific values of the estimated parameters vector (i.e., β^j ) and the estimated variance–covariance matrix of item j (i.e., Σ^j ). For instance, assume that three groups exist, and item j measures two attributes. In this situation, the estimated parameter vector β^j obtained from model (3) is (τ^0j,τ^1j1,τ^1j2,τ^1j3,τ^2jF1,τ^2jF2,τ^3j1,F1,τ^3j2,F1,τ^3j3,F1,τ^3j1,F2,τ^3j2,F2,τ^3j3,F2)T for the GLR-Wald test, while that obtained from model (1) is (δ^0,B,δ^1,B,δ^2,B,δ^12,B,δ^0,F1,δ^1,F1,δ^2,F1,δ^12,F1,δ^0,F2,δ^1,F2,δ^2,F2,δ^12,F2)T for the ordinary Wald statistics, among which, the subscripts B, F1, and F2 represent the base, first, and second focal groups, respectively.

The GLR-LRT and GLR-Wald test will generally produce similar results, especially for large sample size (Magis et al., 2011). The GLR-LRT approach performs better than the GLR-Wald test when sample size is relatively small (Agresti, 2002). In the CDA context, Ma et al. (2021b) found that when items have high or moderate quality, the Wald test had averaged Type I error rates close to the nominal level, whereas the LRT has slightly inflated Type I error rates; however, the Wald method is the worst option for low item quality. However, Woods et al. (2013) found that the performance between LRT and Wald test was similar both for two and three groups regardless of sample size in the context of IRT framework.

In addition to detecting whether DIF items exist across multiple groups, the GLR-LRT and GLR-Wald test can be used to detect DIF among a subset of groups of individuals. This is particularly useful for identifying exactly which groups really show significant differences. Take the GLR-Wald test as an example; assume that item j, which requires two attributes, is flagged as DIF and that the subtest can be specified between the base group and the first focal group (assume that more than two groups exist). In such a situation, the estimated parameter vector, β^j , is (τ0j,τ1j1,τ1j2,τ1j3,τ2jF1,τ3j1,F1,τ3j2,F1,τ3j3,F1)T , which come from the GLR DIF model with attribute profile as the matching criteria (i.e., equation (3)), and the contrast matrix C in the DIF framework is |00001000000001000000001000000001| . Therefore, the W statistic (equation (6)) between the base group and the first focal group is Wj=(τ^2jF1,τ^3j1,F1,τ^3j2,F1,τ^3j3,F1)(Σ^j,F1)1(τ^2jF1,τ^3j1,F1,τ^3j2,F1,τ^3j3,F1)T , where Σ^j,F1 is the estimated variance–covariance matrix of the main effect and interaction effects associated with the focal group F1. Subtests between any two groups (e.g., base group vs. second focal group and first vs. second focal groups) can be conducted in the same manner.

Overall, the steps that using the GLR method to detect DIF in the CDA framework including: (1) Obtaining the matching criterion for each individual. (2) Constructing the logistic regression equations conditional on the matching criterion obtained in previous step. (3) Using the LRT or Wald test to calculate the DIF statistics.

Simulation Design

The purpose of the simulation study is to examine the performance of the GLR method (including both the GLR-LRT and GLR-Wald tests) during DIF detection with multiple groups in the context of CDA. To justify the advantages of the GLR method (i.e., model independent), a pilot study is conducted to investigate the performance of the GLR method by using the estimated attribute mastery patterns, which based on parametric CDM and non-parametric HDD, respectively, as the matching criterion, and results show that both CDM- and HDD-based GLR methods can produce similar Type I error rates and empirical power rates. Considering that the HDD method can be easily calculated and suitable for all sample size, therefore, the HDD method is used to estimate the attribute profiles of individuals. Then, the estimated attribute profiles are used as the matching criteria for the GLR method. In addition, the ordinary Wald test, which uses the item parameters estimated by the CDM and used by Liu et al. (2019) and Ma et al. (2021), is also adopted in the study, and named CDM-Wald. All of the CDM-Wald, GLR-LRT, and GLR-Wald test are adopted in this study. The number of attributes is five, and individuals’ attribute profiles are generated from discrete uniform distributions with an equal probability for each latent class. The data are generated by the GDINA model.

Design

Six factors are manipulated in this study, including number of groups, sample size, test length, item quality, proportion of DIF items, and DIF size. Specifically, (a) three and five groups are used in this study, and these two levels are common in practical DIF detection analyses (e.g., Kim et al., 1995; Penfield, 2001). (b) Previous studies used 1000 and 500 individuals as large and small sample size (e.g., Hou et al., 2014; Liu et al., 2019). Meanwhile, empirical studies showed that the sample size of focal groups is usually smaller than the base group (e.g., Svetina et al., 2017). Therefore, in the current study, two levels are used for the sample size, N 1 /N 2 /N 3 = 1000/900/800 and N 1 /N 2 /N 3 = 500/400/300 for large and small sample size conditional on three groups, respectively. In addition, N 1 /N 2 /N 3 /N 4 /N 5 = 1000/900/800/700/600 and N 1 /N 2 /N 3 /N 4 /N 5 = 500/400/300/200/100 for large and small sample size conditional on five groups, respectively, where N 1 is the base group. (c) Considering that test length has important impact on classification accuracy (e.g., Chen & de la Torre, 2013), 15 and 30 items are used in this study, representing short and long length, respectively. The Q-matrix is the same as in Liu et al. (2019) and is presented in the Online Appendix. (d) High, low, and mix quality of items are used in this study. For high quality items, both 1P(αl*=1) and P(αl*=0) follow an uniform distribution U(.05, .20), where P(αl*=1) refers to the probability of success on a specific item when all elements of the lth reduced attribute mastery pattern equal to 1, while P(αl*=0) refers to the probability of success on a specific item when all elements of the lth reduced attribute mastery pattern equal to 0 (Ma et al., 2021b). Similarly, U(.20, .35) and U(.05, .35) are low and mixed quality, respectively. (e) Two levels for the proportion of DIF items—20% and 40%—are adopted in this study, and these are commonly adopted in previous studies (Li & Wang, 2015; Ma et al., 2021b; Qiu et al., 2019). (f) The DIF size has two levels, .05 and .10, both of which are commonly used in previous studies (e.g., Hou et al., 2014; Li & Wang, 2015; Liu et al., 2016; 2019; Ma et al., 2021b; Wang et al., 2014). Adopted by Ma et al. (2021b), the DIF size for item j is defined as the absolute difference in probabilities of success between base group and one or more focal groups across all reduced attribute mastery patterns, which can be formulated as δjl=|PB(Xj=1|αl*)||PF(Xj=1|αl*)| . In this study, items exhibited either uniform- or nonuniform-DIF in different focal groups, for instance, a specific item exhibited uniform DIF in focal group 1, while exhibited nonuniform DIF in focal group 2 and is DIF-free in focal group 3.

In total, 120 conditions are generated in this study, among which, 24 conditions [2 (number of groups) × 2 (sample size) × 2 (test length) × 3 (item quality)] are DIF-free for all items (hereinafter referred to as non-DIF conditions), the rest 96 conditions [2 (number of groups) × 2 (sample size) × 2 (test length) × 3 (item quality) × 2 (proportion of DIF items) × 2 (DIF size)] are DIF-free for some items (referred to as DIF conditions). A hundred replications are used for each condition to reduce the sampling error. The study is executed in R version 4.1.0 (R Development Core Team, 2021), and the GDINA package (Ma et al., 2021a) is used both to generate item response data and to estimate item parameters. The attribute profiles, which are treated as the matching criterion for GLR method, are estimated by using the HDD method (Chiu et al., 2018). In specific, we use HDD to estimate individual’s attribute profiles for all groups together. The reason for this is that the pilot study shows that an all groups together-based method can produce lower Type I error rates and comparable empirical rejection rates to those shown by each group separately-based method. The R codes for the CDM-Wald and GLR method are modified from the GDINA package (Ma et al., 2021a) and the difR package (Magis et al., 2010), respectively. The covariance matrix needed to be calculated for the Wald test, and the outer-product of gradient method, which all model parameters are taken into consideration and adopted in previous studies (Liu et al., 2016; 2019; Ma et al., 2021b), is used to calculate the covariance matrix. All codes are available upon request.

Analysis

The Type I error rates and empirical power/rejection rates are used to assess the performance of the CDM-Wald, GLR-LRT, and GLR-Wald test. Type I error rates are defined as proportion of DIF-free items that are falsely identified as DIF items, and they are acceptable if they are fell into [.025, .075] (Bradley, 1978). To ensure statistical power/rejection rates which can be comparable across different methods, empirical power/rejection rates are calculated for each method. Specifically, the 95th percentile of the statistics for each method is calculated under the 24 non-DIF conditions and used as the empirical critical values. To obtain accurate empirical critical values, 300 replications are adopted for these non-DIF conditions. The empirical power/rejection rates are estimated as the proportion of statistics for each method that are greater than the corresponding critical values. These critical values are given in the Online Appendix. In addition, some mixed analyses of variance (ANOVAs) are conducted for Type I error rate and empirical power/rejection rate by using the R package rstatix (Kassambara, 2021). The normal assumption about the response variables is justified before the ANOVAs. Considering that a large sample size (N = 100) is observed for each cell (condition) in the simulation study, therefore, it is naturally inferred that the sampling distribution tends to be normal, according to the central limit theorem (Field et al., 2012; Rosenblatt, 1956; Wilcox, 2005). Consequently, the parametric ANOVAs can be directly applied to both Type I error rates and empirical rejection rates. The generalized eta square, ηG2 , is used as the effect size, according to Cohen (1988), an effect is at least meaningful when ηG2.01 , specifically, it has small, moderate, and large effect size when ηG2 equals to .01, .06, and .14, respectively (Cohen, 1988).

Results

Type I error rates . The Type I error rates are presented in Table 1. Generally speaking, the GLR methods produce lower Type I error rates than the CDM-Wald for most conditions. Specifically, the GLR methods (both GLR-LRT and GLR-Wald) produce Type I error rates within the [.025, .075] for most conditions. While the CDM-Wald produces inflated Type I error rates for most conditions (bold values in the Table), especially under short test length, low item quality, and mixed item quality conditions. The Type I error rates are fell into the [.025, .075] only when test length is long and items have high or mixed quality for CDM-Wald. Meanwhile, the GLR-LRT and GLR-Wald produce similar Type I error rates for most conditions, and the Type I error rates produced by the GLR-Wald are relatively smaller than that produced by the GLR-LRT method. The averaged Type I error rates for GLR- LRT and GLR-Wald are .054 (SD = .006) and .042 (SD = .007) for non-DIF conditions, respectively. For the DIF conditions, the corresponding averaged Type I error rates are .067 (SD = .014) and .053 (SD = .016), respectively. The averaged Type I error rates for CDM-Wald are .272 (SD = .246) and .267 (SD = .241) for non-DIF and DIF conditions, respectively. The results for five groups are similar with those observed in three groups, except that the CDM-Wald produces inflated Type I error rates for all conditions when sample size is small.

Table 1.

Type I Error Rates for All Three DIF Detection Methods.

N J IQ DIF% DIF size G = 3 G = 5
CDM-Wald GLR-Wald GLR-LRT CDM-Wald GLR-Wald GLR-LRT
Small 15 High 0 0 .154 .039 .063 .119 .031 .078
20% .05 .151 .041 .063 .108 .028 .069
.10 .135 .038 .068 .111 .032 .087
40% .05 .163 .043 .058 .101 .038 .088
.10 .130 .050 .067 .121 .034 .077
Low 0 0 .620 .037 .055 .550 .036 .066
20% .05 .631 .027 .044 .580 .035 .068
.10 .589 .053 .067 .575 .028 .068
40% .05 .644 .054 .072 .572 .034 .073
.10 .583 .054 .063 .561 .043 .072
Mix 0 0 .421 .043 .057 .354 .029 .065
20% .05 .457 .038 .056 .383 .025 .056
.10 .416 .048 .063 .351 .035 .073
40% .05 .433 .051 .068 .366 .037 .080
.10 .412 .049 .068 .352 .032 .074
30 High 0 0 .006 .034 .054 .192 .023 .071
20% .05 .008 .033 .055 .185 .020 .061
.10 .004 .036 .059 .188 .024 .071
40% .05 .006 .040 .058 .196 .024 .068
.10 .003 .040 .058 .183 .039 .081
Low 0 0 .319 .043 .053 .554 .030 .055
20% .05 .315 .045 .057 .547 .033 .059
.10 .299 .043 .052 .549 .026 .056
40% .05 .309 .054 .062 .553 .032 .059
.10 .312 .065 .077 .547 .027 .059
Mix 0 0 .020 .034 .054 .277 .028 .063
20% .05 .011 .043 .058 .278 .033 .071
.10 .016 .040 .055 .275 .040 .077
40% .05 .022 .042 .062 .273 .028 .071
.10 .013 .051 .065 .264 .044 .078
Large 15 High 0 0 .115 .046 .052 .140 .040 .050
20% .05 .103 .054 .063 .163 .044 .056
.10 .097 .076 .086 .163 .073 .088
40% .05 .097 .061 .074 .138 .047 .058
.10 .093 .088 .108 .149 .088 .107
Low 0 0 .701 .035 .040 .907 .044 .052
20% .05 .721 .066 .073 .908 .050 .058
.10 .734 .071 .079 .908 .059 .067
40% .05 .740 .064 .072 .898 .048 .051
.10 .718 .086 .091 .906 .087 .098
Mix 0 0 .417 .055 .063 .629 .049 .061
20% .05 .393 .048 .052 .608 .036 .043
.10 .399 .076 .088 .631 .059 .069
40% .05 .394 .057 .070 .628 .052 .063
.10 .378 .099 .107 .599 .104 .126
30 High 0 0 .032 .047 .055 .020 .037 .053
20% .05 .026 .048 .055 .025 .046 .058
.10 .028 .049 .058 .016 .053 .065
40% .05 .027 .047 .057 .018 .042 .054
.10 .021 .064 .074 .020 .058 .073
Low 0 0 .429 .044 .049 .616 .042 .048
20% .05 .431 .058 .064 .587 .040 .045
.10 .395 .061 .064 .572 .059 .068
40% .05 .420 .056 .062 .578 .048 .053
.10 .384 .081 .085 .592 .074 .082
Mix 0 0 .033 .048 .055 .033 .044 .051
20% .05 .039 .056 .062 .031 .044 .058
.10 .034 .050 .056 .033 .062 .072
40% .05 .028 .045 .051 .028 .049 .057
.10 .040 .069 .078 .032 .063 .077

Note. N refers to sample size, J refers to test length, IQ refers to item quality, DIF% refers to proportion of DIF items, and G refers to number of groups. Bold values refer to values that fell out of the interval [.025, .075].

The mixed ANOVAs are conducted to analyze which factors have significant effect on the Type I error rates. The detailed results are given in the Online Appendix, noting that only at least meaningful effects (i.e., ηG2.01 ; Cohen, 1988) are reported. The results can be summarized as follows: (1) Four main effects (i.e., number of groups, test length, item quality, and DIF method) have non-ignorable effects on the Type I error rates for the non-DIF conditions, the ηG2 ranged from .045 (number of groups) to .763 (DIF method). While there are five main effects (i.e., number of groups, sample size, test length, item quality, and DIF method) for the DIF conditions, and the corresponding ηG2 ranged from .014 (sample size) to .668 (DIF method). (2) During the three-way interaction effect, the effect of test length × item quality × DIF method has the largest effect size both for the non-DIF conditions ( ηG2 = .135) and for the DIF conditions ( ηG2 = .104). The CDM-Wald produces larger Type I error rates for low item quality or combination of mixed item quality and short test length, and the GLR methods (GLR-LRT and GLR-Wald) produce smaller Type I error rates across all conditions. (3) The highest-order non-ignorable meaningful interaction is four-way interaction effect both for DIF and non-DIF conditions. Specifically, the four-way interaction effect is the combination of number of groups, sample size, test length, and DIF method for DIF conditions, and the corresponding ηG2 is .064; while three four-way interaction effects (i.e., number of groups × sample size × test length × DIF method, number of groups × sample size × item quality × DIF method, and sample size × test length × item quality × DIF method) are obtained for non-DIF conditions, the corresponding ηG2 are .085, .013, and .014, respectively. The GLR methods (GLR-LRT and GLR-Wald) keep stable across all conditions. The CDM-Wald, on the contrary, has the largest Type I error rates under conditions of five groups, large sample size, and short test length (.558 for both non-DIF and DIF conditions) and has the smallest Type I error rates under conditions of three groups, small sample size, and long test length (.115 for non-DIF conditions and .110 for DIF conditions).

Empirical rejection rates . When larger differences in the Type I error rates are observed for different DIF detection methods, comparison of the null hypothesis rejection rates provides little insight regarding which test procedure has better power, and it cannot be used for a “power” comparison. Therefore, in this study, we use the empirical rejection rate instead of the empirical power rate. The empirical rejection rates for three and five groups are presented at Figures 1 and 2, respectively. Figure 1 shows that the GLR methods, generally speaking, produce higher empirical rejection rates than the CDM-Wald for most conditions, with exception of combinations of Long-high and Long-mix conditions for large sample size. In addition, the GLR-LRT produces relatively higher results than the GLR-Wald method does for small sample sizes, while it produces similar results to those that the GLR-Wald method produces for large sample sizes. Specifically, the averaged empirical rejection rates are .494 (SD = .265), .436 (SD = .270), and .328 (SD = .282) for the GLR-LRT, GLR-Wald, and CDM-Wald, respectively. In addition, the corresponding empirical rejection rates are ranged from .127 to .997, .086 to .988, and .070 to .988, respectively. All three DIF detection methods produce the highest empirical power rates under conditions with long test length and high item quality, and the CDM-Wald produces higher empirical rejection rates than the GLR-LRT and GLR-Wald methods under the conditions with large sample size, long test length, and high item quality. Moreover, compared with conditions with large DIF size (DIF = .10), conditions with small DIF size (DIF = .05) are produced relatively lower empirical rejection rates, and the differences between the GLR methods and the CDM-Wald are relatively smaller in these conditions. Furthermore, with the increased of sample size, the empirical power rates increased for all three DIF detection methods.

Figure 1.

Figure 1.

Empirical rejection rate for three groups. Note. Short-high refers to short test length and high item quality; Short-low refers to short test length and low item quality; Short-mix refers to short test length and mix item quality; Long-high refers to long test length and high item quality; Long-low refers to long test length and low item quality; Long-mix refers to long test length and mix item quality.

Figure 2.

Figure 2.

Empirical rejection rate for five groups.

The empirical rejection rates for five groups are presented at Figure 2. The GLR methods (i.e., GLR-LRT and GLR-Wald) produce a similar pattern to the one observed with the three groups for large sample sizes: the GLR-LRT and GLR-Wald methods produce higher empirical rejection rates than the CDM-Wald method, with exception of combinations of the Long-high and Long-mix conditions; meanwhile, the differences between the GLR-LRT and GLR-Wald methods are small. For the conditions of small sample sizes, the results can be presented as follows: (1) the GLR-LRT method produces higher empirical rejection rates than either the CDM-Wald or GLR-Wald methods regardless of DIF size and the proportion of DIF items and (2) the GLR-Wald method produces higher empirical rejection rates than the CDM-Wald method does for large DIF sizes, while it produces similar results to those that the CDM-Wald method produces for small DIF sizes.

The mixed ANOVA results for the empirical rejection rates are given in Online Appendix, and effects that are at least meaningful are reported. The results can be summarized as follows: (1) Five main effects (i.e., sample size, test length, item quality, DIF size, and DIF method) have large effect size on the empirical rejection rates, and the ηG2 ranged from .197 (test length) to .505 (DIF size). (2) There are eleven two-way interaction effects that are at least meaningful on the empirical rejection rates, among them, the two-way interaction effect of test length and DIF method has the smallest effect size ( ηG2 = .012), and the interaction effect of DIF size and DIF method has the largest effect size ( ηG2 = .043). (3) The highest-order meaningful interaction effect is four-way interaction effect: combination of sample size, test length, and DIF detection method ( ηG2 = .014), and combination of sample size, item quality, and DIF detection method ( ηG2 = .021). For the second combination, the difference of empirical rejection rates between the GLR methods and CDM-Wald is very small under combination of large sample size and high item quality (smaller than .030), while it becomes large under other conditions, especially for combination of large sample size and low item quality (larger than .320).

Real data example

A real data example is presented to illustrate the application of the three DIF detection methods in multiple groups. The real data originate in the TIMSS program, and we used a portion of the TIMSS 2007 Grade 4 mathematics achievement data. These data feature 25 items from the fourth booklet of three cultural contexts: Western culture dominated, Eastern culture dominated, and a combination of the two. In particular, the country and region dominated by Western culture include the United States (US) and England, country and region dominated by Eastern culture include Japan and Chinese Taipei, and country and region affected by both of Eastern and Western culture include Singapore and Hong Kong SAR. A total of 1496 students with no missing data are collected in current study; of these, 563, 513, and 420 students come from Western culture dominated places, Eastern culture dominated places, and places with a combination of the two cultures, respectively. The Q-matrix, which contains seven attributes, was specified by Park and Lee (2014), and the details are presented as an online appendix. We employ all three DIF detection methods to analyze the data using cultural contexts as the grouping variable. The students who come from US and England are treated as the base group.

Results

Table 2 presents the statistics and the Holm-based adjusted p-values (Holm, 1979) for items that are flagged by at least one of three DIF detection methods. It can be observed that there are 10, 16, and 15 DIF items for CDM-Wald, GLR-LRT, and GLR-Wald, respectively. The CDM-Wald method flags the least DIF items, whereas the GLR-LRT method flags the most. Meanwhile, there are eight items that are flagged as DIF and DIF-free items by all three DIF detection methods, respectively. Of these, the eight DIF items are items 3, 8, 9, 13, 16, 17, 22, and 24, while the eight DIF-free items are items 1, 10, 11, 12, 14, 18, 20, and 25. In addition, eight additional items (items 2, 4, 5, 7, 15, 19, 21, and 23) are flagged as DIF items by two out of three DIF detection methods. The single remaining item (item 6) is flagged as a DIF items only by CDM-Wald method. Although there is no strong evidence that the GLR-based DIF detection methods perform better than the CDM-Wald method, based on the results of the simulation study, the GLR-based DIF detection methods (GLR-LRT and GLR-Wald) produce lower Type I error rates and higher empirical rejection rates when sample size is small and number of groups is three. Therefore, it can be inferred that both GLR-LRT and GLR-Wald may produce more reliable results than the CDM-Wald does, which means that more than half of the items may have a DIF effect. As a result, domain experts need to spend more effort on these items to ensure fairer test results.

Table 2.

DIF Items Flagged by Three DIF Detection Methods.

Item No. CDM-Based GLR-Based
Wald df p Value df a LRT p Value Wald p Value
2 12.416 4 .218 4 19.556 .007 19.772 .008
3 140.663 8 .000 8 230.990 .000 182.897 .000
4 14.553 8 .754 8 30.108 .003 25.917 .014
5 9.481 8 1.00 6 22.698 .010 20.696 .025
6 42.375 8 .000 6 6.402 .759 0.001 1.00
7 9.910 16 1.00 11 30.238 .015 27.941 .036
8 53.341 8 .000 8 184.563 .000 163.083 .000
9 16.093 4 .046 4 51.438 .000 43.364 .000
13 31.749 8 .002 6 57.756 .000 51.127 .000
15 9.509 4 .645 4 50.021 .000 41.152 .000
16 26.837 4 .000 4 138.458 .000 118.008 .000
17 131.371 8 .000 6 191.980 .000 165.171 .000
19 15.456 8 .645 6 27.383 .001 26.450 .003
21 25.117 8 .026 6 31.438 .000 2.601 1.00
22 79.593 8 .000 6 60.957 .000 54.673 .000
23 4.642 4 1.00 4 47.589 .000 39.654 .000
24 29.517 4 .000 4 49.341 .000 48.758 .000

Note. p value is the Holm-based adjusted p-value; a the dfs are the same both for GLR-LRT and GLR-Wald methods; theoretically, both the CDM-based and GLR-based DIF detection methods should have the same df, while some collapsed attribute profiles are not found during the estimation of individuals’ attribute profiles, which lead to small df for some item (e.g., items 5, 6, and 7) for the GLR-based DIF detection methods.

To further investigate exactly which groups show significant differences, the GLR-Wald method is used to detect DIF among a subset of groups of individuals. For instance, for items 3, 6, 7, and 16, the results of pairwise comparisons between the base group and each focal group (i.e., base group vs. the first focal group and the base group vs. the second focal group) are presented in Table 3. It can be observed that significant differences are observed between the base group and each focal group for items 3 and 16. Meanwhile, there is no significant difference between the base group and the first focal group for item 7, while a significant difference is observed between the base group and the second focal group. For item 6, no significant difference is observed between the base group and each focal group, which is consistent with expectations, because this item is flagged as DIF-free item in the GLR-Wald method.

Table 3.

Subtests of DIF for Items 3, 6, 7, and 16 Using the GLR-Wald Method.

Item Base Group versus 1st Focal Group Base Group versus 2nd Focal Group
χ 2 df p Value χ2 df p Value
3 52.343 4 .000 58.285 4 .000
6 <.001 3 1.00 <.001 3 1.00
7 3.790 5 1.00 22.728 6 .002
16 84.798 2 .000 56.038 2 .000

Note. p value is the Holm-based adjusted p-value.

Discussion

The current study uses the GLR method with attribute profile as the matching criteria to detect the presence of DIF among multiple groups in the context of CDA. A simulation and a real data example are analyzed to illustrate the application of these DIF detection methods in multiple groups. Simulations show that the GLR method produces smaller Type I error rates than the ordinary Wald test in most conditions, and most of the Type I error rates fall within [.025, .075], which indicates that the GLR method produces well-controlled Type I error rates. The ordinary Wald test, on the contrary, produces a large number of Type I error rates that falling out of [.025, .075], which indicate that it produces inflated Type I error rates for most conditions. In terms of empirical rejection rates, results show that the GLR methods perform better than the ordinary Wald test for most conditions. In sum, the GLR methods perform better than the ordinary Wald test in detecting DIF items with multiple groups in the CDA context.

The results also indicate that all manipulated factors have small impact on Type I error rates for the GLR methods, while three factors, number of groups and test length and item quality, have large impact on that for the ordinary Wald test; however, except for the proportion of DIF items, all factors have at least meaningful effect on the empirical rejection rates for all three DIF detection methods, which is consistent with the previous studies by Li and Wang (2015) and Ma et al. (2021b) and Wang et al. (2014). In addition, the Type I error rates of the two GLR methods, GLR-Wald and GLR-LRT, are similar in most conditions, which confirms that these two tests can produce similar results (Agresti, 2002; Magis et al., 2011; Woods et al., 2013), a further investigation of the results shows that the GLR-Wald method produces relatively smaller Type I error rates than the GLR-LRT method. Noting that, the GLR-Wald can be easily used to perform comparisons between subgroups irrespective of number of groups, the GLR-LRT can also perform pairwise comparisons between subgroups, but it would be much more complex and time consuming when the number of groups is large since number of models needed to fit (Woods et al., 2013). A recommend way to perform testing between subgroups, as suggested by Magis et al. (2011), is conducting the DIF detection by using both GLR-LRT and GLR-Wald test, and after the DIF items are identified, the GLR-Wald is used to perform the subsequent comparisons.

This study demonstrated that the ordinary Wald test may not be suitable for detecting DIF items when items have low or mixed quality and more than two groups are existed in the CDA context. One way to improve the performance of the ordinary Wald test is combining the Wald test with scale purification procedures, for instance, Ma et al. (2021b) found that the Wald-FS produces lower Type I error rates and higher empirical power rates than the ordinary Wald test. The GLR method, both including GLR-Wald and GLR-LRT, may be an alternative method to detect DIF items when number of groups are larger than two. Similarly, scale purification procedures can also be combined with the GLR method, for instance, French and Maller (2007) found that the LR method with scale purification procedure is beneficial for some conditions. It is worth noting that a matching criterion is required before using the GLR method to detect DIF items, since many factors can affect the attribute mastery pattern, consequently, the performance of the GLR method, which using the attribute mastery pattern as the matching criterion, would be mainly dependent on other factors, for instance, test length (e.g., Chen & de la Torre, 2013), Q-matrix (e.g., Madison & Bradshaw, 2015), number of attributes (e.g., Chiu et al., 2016), etc.

The results obtained in this study have important implications for empirical applications. For instance, during the construction of a cognitive diagnostic test, the ordinary Wald test (CDM-Wald) may identify numerous non-DIF items as DIF when more than two groups exist. To mitigate the negative impact of DIF items on test validity and fairness, test developers need to modify or delete these so-called DIF items and develop new items to cover the content of the test. This increases the human, material, and financial costs of educational assessment. The GLR methods, on the contrary, can both detect DIF items and identify non-DIF items. If they used the results produced by the GLR methods, test developers would have to modify or delete relatively few items; so, this approach would save resources. In sum, compared to the ordinary Wald test, the GLR methods, including both the GLR-Wald and GLR-LRT, may be more suitable for analyzing test validity and fairness for multiple group situations in the CDA framework. As for the two GLR-based DIF detection methods, the GLR-Wald method produces relatively smaller Type I error rates than the GLR-LRT method and similar empirical rejection rates. Therefore, we recommend using the GLR-Wald method to detect DIF items when multiple groups exist in an assessment.

Although this study confirms that the GLR method for DIF detection is promising, it also has some limitations. First, scale purification procedures have not been adopted in this study; researcher may systematically investigate the performance of the GLR method when different scale purification procedures are adopted in future studies. Second, this study assumes that the Q-matrix is correctly specified, and while it may be misspecified in practice, researchers can manipulate the Q-matrix as independent variable in future studies. Third, the number of attributes is fixed in this study. It is still unknown whether the GLR method is suited for the detection of DIF items in large-scale assessment programs (such as the TIMSS program), which has more than 10 attributes (Dogan & Tatsuoka, 2008). Last but not least, effect size measures for DIF have not been considered in this study. Researchers have proposed some effect size measures (e.g., log of odds-ratio, difference in probabilities, and proportion of variance) to determine DIF sizes for two-group situations (e.g., DeMars, 2011; Feng, 2021; French & Maller, 2007; Nye et al., 2019) and thereby classify DIF items precisely and quantitatively. However, integrating these effect size measures into the GLR methods can facilitate precise classification of DIF size for multiple groups remains unknown.

Supplemental Material

Supplemental Material - Using a Generalized Logistic Regression Method to Detect Differential Item Functioning With Multiple Groups in Cognitive Diagnostic Tests

Supplemental Material for Using a Generalized Logistic Regression Method to Detect Differential Item Functioning With Multiple Groups in Cognitive Diagnostic Tests by Xiaojian Sun, Shimeng Wang, Lei Guo, Tao Xin, and Naiqing Song in Applied Psychological Measurement

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China (Grant No. 31900793, 32071093) and the Humanities and Social Science Foundation of the Ministry of Education of China (Grant No. 22YJC880065).

Supplemental Material: Supplemental material for this article is available online.

ORCID iD

Xiaojian Sun https://orcid.org/0000-0002-9392-4020

References

  1. Agresti A. (2002). Categorical data analysis (2nd ed.). Wiley. [Google Scholar]
  2. Bradley J. V. (1978). Robustness? British Journal of Mathematical and Statistical Psychology, 31(2), 144–152. 10.1111/j.2044-8317.1978.tb00581.x [DOI] [PubMed] [Google Scholar]
  3. Chen J., Woollacott M., Pologe S., Moore G. P. (2013). Stochastic aspects of motor behavior and their dependence on auditory feedback in experienced cellists. Frontiers in Human Neuroscience, 7(6), 419–437. 10.3389/fnhum.2013.00419 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Chiu C. Y., Köhn H. F., Zheng Y., Henson R. (2016). Joint maximum likelihood estimation for diagnostic classification models. Psychometrika, 81(4), 1069–1092. 10.1007/s11336-016-9534-9 [DOI] [PubMed] [Google Scholar]
  5. Chiu C. Y., Sun Y., Bian Y. (2018). Cognitive diagnosis for small educational programs: The general nonparametric classification method. Psychometrika, 83(2), 355–375. 10.1007/s11336-017-9595-4 [DOI] [PubMed] [Google Scholar]
  6. Cohen J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum Associates. [Google Scholar]
  7. de la Torre J. (2011). The generalized DINA model framework. Psychometrika, 76(2), 179–199. 10.1007/s11336-011-9207-7 [DOI] [Google Scholar]
  8. DeMars C. E. (2011). An analytic comparison of effect sizes for differential item functioning. Applied Measurement in Education, 24(3), 189–209. 10.1080/08957347.2011.580255 [DOI] [Google Scholar]
  9. Dogan E., Tatsuoka K. (2008). An international comparison using a diagnostic testing model: Turkish students’ profile of mathematical skills on TIMSS-R. Educational Studies in Mathematics, 68(3), 263–272. 10.1007/s10649-007-9099-8 [DOI] [Google Scholar]
  10. Feng Y. (2021). Effect size measures for differential item functioning in cognitive diagnostic models (Unpublished doctoral dissertation). Indiana University. [Google Scholar]
  11. Field A., Miles J., Field Z. (2012). Discovering statistics using R. Sage Publications. [Google Scholar]
  12. French B. F., Maller S. J. (2007). Iterative purification and effect size use with logistic regression for differential item functioning detection. Educational and Psychological Measurement, 67(3), 373–393. 10.1177/0013164406294781 [DOI] [Google Scholar]
  13. George A. C., Robitzsch A. (2014). Multiple group cognitive diagnosis models, with an emphasis on differential item functioning. Psychological Test and Assessment Modeling, 56(4), 405–432. [Google Scholar]
  14. Hartz S. M. (2002). A Bayesian framework for the unified model for assessing cognitive abilities: Blending theory with practicality (unpublished doctoral dissertation). University of Illinois at Urbana. [Google Scholar]
  15. Holm S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65–70. [Google Scholar]
  16. Hou L., de la Torre J. (2015). Applying Wald test to detect multi-group DIF in CDM. Paper presented at the Annual Meeting of the National Council on Measurement in Education Conference. [Google Scholar]
  17. Hou L., la Torre J. d., Nandakumar R. (2014). Differential item functioning assessment in cognitive diagnostic modeling: Application of the Wald test to investigate DIF in the DINA model. Journal of Educational Measurement, 51(1), 98–125. 10.1111/jedm.12036 [DOI] [Google Scholar]
  18. Hou L., Terzi R., de la Torre J. (2020). Wald test formulations in DIF detection of CDM data with the proportional reasoning test. International Journal of Assessment Tools in Education, 7(2), 145–158. [Google Scholar]
  19. Junker B. W., Sijtsma K. (2001). Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Applied Psychological Measurement, 25(3), 258–272. 10.1177/01466210122032064 [DOI] [Google Scholar]
  20. Kassambara A. (2021). rstatix: Pipe-friendly framework for basic statistical tests [Computer software manual] (R Package Version 0.7.0). https://CRAN.R-project.org/package=rstatix [Google Scholar]
  21. Kim S.-H., Cohen A. S., Park T.-H. (1995). Detection of differential item functioning in multiple groups. Journal of Educational Measurement, 32(3), 261–276. 10.1111/j.1745-3984.1995.tb00466.x [DOI] [Google Scholar]
  22. Li F. M. (2008). A modified higher-order DINA model for detecting differential item functioning and differential attribute functioning (Unpublished doctoral dissertation). University of Georgia. [Google Scholar]
  23. Li X., Wang W. C. (2015). Assessment of differential item functioning under cognitive diagnosis models: The DINA model example. Journal of Educational Measurement, 52(1), 28–54. 10.1111/jedm.12061 [DOI] [Google Scholar]
  24. Liu Y., Xin T., Li L., Tian W., Liu X. (2016). An improved method for differential item functioning detection in cognitive diagnosis models: An application of Wald statistic based on observed information matrix. Acta Psychologica Sinica, 48(5), 588–598. 10.3724/sp.j.1041.2016.00588 [DOI] [Google Scholar]
  25. Liu Y., Yin H., Xin T., Shao L., Yuan L. (2019). A comparison of differential item functioning detection methods in cognitive diagnostic models. Frontiers in Psychology, 10, Article 1137. 10.3389/fpsyg.2019.01137 [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Ma W., de la Torre J., Sorrel M., Jiang Z. (2021. a). GDINA: The generalized DINA model framework. R package version 2.8.0. https://CRAN.R-project.org/package=GDINA
  27. Ma W., Terzi R., de la Torre J. (2021. b). Detecting differential item functioning using multiple-group cognitive diagnosis models. Applied Psychological Measurement, 45(1), 37–53. 10.1177/0146621620965745 [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Madison M. J., Bradshaw L. P. (2015). The effects of Q-matrix design on classification accuracy in the log-linear cognitive diagnosis model. Educational and Psychological Measurement, 75(3), 491–511. 10.1177/0013164414539162 [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Magis D., Béland S., Tuerlinckx F., De Boeck P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42(3), 847–862. 10.3758/BRM.42.3.847 [DOI] [PubMed] [Google Scholar]
  30. Magis D., Raîche G., Béland S., Gérard P. (2011). A generalized logistic regression procedure to detect differential item functioning among multiple groups. International Journal of Testing, 11(4), 365–386. 10.1080/15305058.2011.602810 [DOI] [Google Scholar]
  31. Mantel N., Haenszel W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22(4), 719–748. [PubMed] [Google Scholar]
  32. Nye C. D., Bradburn J., Olenick J., Bialko C., Drasgow F. (2019). How big are my effects? Examining the magnitude of effect sizes in studies of measurement equivalence. Organizational Research Methods, 22(3), 678–709. 10.1177/1094428118761122 [DOI] [Google Scholar]
  33. Park Y. S., Lee Y. S. (2014). An extension of the DINA model using covariates: Examining factors affecting response probability and latent classification. Applied Psychological Measurement, 38(5), 376–390. 10.1177/0146621614523830 [DOI] [Google Scholar]
  34. Paulsen J., Svetina D., Feng Y., Valdivia M. (2020). Examining the impact of differential item functioning on classification accuracy in cognitive diagnostic models. Applied Psychological Measurement, 44(4), 267–281. 10.1177/0146621619858675 [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Penfield R. D. (2001). Assessing differential item functioning among multiple groups: A comparison of three mantel-haenszel procedures. Applied Measurement in Education, 14(3), 235–259. 10.1207/s15324818ame1403_3 [DOI] [Google Scholar]
  36. Philipp M., Strobl C., de la Torre J., Zeileis A. (2018). On the estimation of standard errors in cognitive diagnosis models. Journal of Educational and Behavioral Statistics, 43(1), 88–115. 10.3102/1076998617719728 [DOI] [Google Scholar]
  37. Qiu X.-L., Li X., Wang W.-C. (2019). Differential item functioning in diagnostic classification models. In von Davier M., Lee Y.-S. (Eds) Handbook of diagnostic classification models (pp. 379–393). Springer International Publishing. [Google Scholar]
  38. R Development Core Team . (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing. [Google Scholar]
  39. Rosenblatt M. (1956). A central limit theorem and a strong mixing condition. Proceedings of the National Academy of Sciences of the United States of America, 42(1), 43–47. 10.1073/pnas.42.1.43 [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Rupp A. A., Templin J., Henson R. A. (2010). Diagnostic measurement: Theory, methods, and applications. Guilford Press. [Google Scholar]
  41. Shealy R., Stout W. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58(2), 159–194. 10.1007/bf02294572 [DOI] [Google Scholar]
  42. Svetina D., Dai S., Wang X. (2017). Use of cognitive diagnostic model to study differential item functioning in accommodations. Behaviormetrika, 44(2), 313–349. 10.1007/s41237-017-0021-0 [DOI] [Google Scholar]
  43. Swaminathan H., Rogers H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361–370. 10.1111/j.1745-3984.1990.tb00754.x [DOI] [Google Scholar]
  44. Wang Z., Guo L., Bian Y. (2014). Comparison of DIF detecting methods in cognitive diagnostic test. Acta Psychologica Sinica, 46(12), 1923–1932. 10.3724/sp.j.1041.2014.01923 [DOI] [Google Scholar]
  45. Wilcox R. (2005). Introduction to robust estimation and hypothesis testing (2nd ed.). Academic Press. [Google Scholar]
  46. Wilks S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses. The Annals of Mathematical Statistics, 9(1), 60–62. 10.1214/aoms/1177732360 [DOI] [Google Scholar]
  47. Woods C. M., Cai L., Wang M. (2013). The Langer-improved Wald test for DIF testing with multiple groups: Evaluation and comparison to two-group IRT. Educational and Psychological Measurement, 73(3), 532–547. 10.1177/0013164412464875 [DOI] [Google Scholar]
  48. Zhang W. (2006). Detecting differential item functioning using the DINA model (Unpublished doctoral dissertation). The University of North Carolina at Greensboro. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material - Using a Generalized Logistic Regression Method to Detect Differential Item Functioning With Multiple Groups in Cognitive Diagnostic Tests

Supplemental Material for Using a Generalized Logistic Regression Method to Detect Differential Item Functioning With Multiple Groups in Cognitive Diagnostic Tests by Xiaojian Sun, Shimeng Wang, Lei Guo, Tao Xin, and Naiqing Song in Applied Psychological Measurement


Articles from Applied Psychological Measurement are provided here courtesy of SAGE Publications

RESOURCES