Abstract
This study proposes a multiple-group cognitive diagnosis model to account for the fact that students in different groups may use distinct attributes or use the same attributes but in different manners (e.g., conjunctive, disjunctive, and compensatory) to solve problems. Based on the proposed model, this study systematically investigates the performance of the likelihood ratio (LR) test and Wald test in detecting differential item functioning (DIF). A forward anchor item search procedure was also proposed to identify a set of anchor items with invariant item parameters across groups. Results showed that the LR and Wald tests with the forward anchor item search algorithm produced better calibrated Type I error rates than the ordinary LR and Wald tests, especially when items were of low quality. A set of real data were also analyzed to illustrate the use of these DIF detection procedures.
Keywords: cognitive diagnosis, differential item functioning, DIF, forward anchor item search, likelihood ratio, Wald test
Introduction
Cognitively diagnostic assessments (CDAs; Nichols et al., 1995) aim to provide students with diagnostic feedback by analyzing their responses to test items. To ensure that the feedback is psychometrically valid and sound, many cognitive diagnosis models (CDMs) have been proposed, including the deterministic inputs, noisy “and” gate (DINA; Haertel, 1989) model, the deterministic inputs, noisy “or” gate (DINO; Templin & Henson, 2006) model, and the generalized deterministic inputs, noisy “and” gate (G-DINA) model (de la Torre, 2011), to name a few. CDMs are restricted latent class models, where the latent variables are typically binary, representing the presence and absence of attributes of interest. The estimated attribute profile characterizes the strengths and weaknesses of the student and thus may be used for personalized learning.
Despite a large number of CDMs available, most of them assume that all students come from the same population, which may not be the case in practice. A few researchers have suggested using multiple-group models to handle sample heterogeneity (e.g., George & Robitzsch, 2014; Xu & Davier, 2008). The multiple-group models allow the comparison between students from different groups, such as countries or genders (Johnson et al., 2013), and may be used for detecting differential item functioning (DIF; for example, George & Robitzsch, 2014) and accommodating missing responses (Rose et al., 2017).
To unlock the potentials of CDMs, many important statistical routines are needed. One of these routines is the procedure for detecting DIF items. Because DIF is closely related to test fairness, detecting DIF has become a routine task in psychometric analyses. An item is defined as a DIF item when students from different groups with the same ability show different probabilities of success (Magis et al., 2010). Similarly, in the CDM context, an item is said to function differently when the probability of success to an item differs across manifest groups of students with the same attribute profile (Hou et al., 2014). The presence of DIF items has been viewed as a potential threat to test validity and could worsen the attribute estimation (Paulsen et al., 2020).
Till now, only a few DIF detection procedures for CDMs have been investigated. Zhang (2006) investigated the performance of Mantel–Haenszel (MH; Holland & Thayer, 1988) and SIBTEST (Shealy & Stout, 1993) in DIF detection by matching students on their test scores, true scores, and attribute profiles from the DINA model. However, the attribute profiles for different groups were not estimated separately, which could yield biased estimates when DIF items exist. Also, the MH and SIBTEST performed poorly in detecting nonuniform DIF (Zhang, 2006). F. Li (2008) modified the higher-order DINA model (de la Torre & Douglas, 2004) to separate the construct-relevant DIF from the construct-irrelevant DIF. The higher-order DINA model was estimated without equal constraints of item parameters across the reference and focal groups, and then the DIF was investigated through the marginalized differences in probabilities of success of an item. However, under some conditions, Type I error rates were out of control. Hou et al. (2014) proposed to use the Wald test to detect both uniform and nonuniform DIF in the DINA model and found that the Wald test, which performed as well as, if not better than, the MH and SIBTEST methods, suffered inflated Type I error when items were of low quality. Hou et al. (2020) have recently used the Wald test for detecting DIF under the G-DINA model. Liu et al. (2019) examined the performance of the Wald test with different types of covariance matrix in DIF detection and found that the covariance matrix estimated using the complete information approach (Philipp et al., 2018) produced better calibrated Type I error than the item-wise information matrix. It should also be noted that all these studies (Hou et al., 2014; F. Li, 2008; X. Li & Wang, 2015; Liu et al., 2019; Paulsen et al., 2020; Zhang, 2006) investigated the DIF detection based on the DINA model, which is one of the simplest CDMs and may not hold in practice. An exception is X. Li and Wang (2015), who developed a model for DIF detection based on the loglinear CDM (Henson et al., 2009) by introducing additional item parameters. A major limitation of this method is that the model can only be estimated using the Markov chain Monte Carlo (MCMC) algorithm, which can be very time-consuming. Another exception is Svetina et al. (2017), where the reparameterized unified model was fit to the data and the generalized logistic regression method with item purification was used to identify DIF items among accommodated and nonaccommodated groups in the National Assessment for Educational Progress (NAEP). However, Svetina et al. (2017) did not examine the performance of the item purification.
The goal of this study is trifold: (a) to develop a multiple-group generalized deterministic inputs, noisy “and” gate (MG-GDINA) model to relax the conjunctive assumption of the MG-DINA model by Johnson et al. (2013) and George and Robitzsch (2014), (b) to compare the performance of the likelihood ratio (LR) test and Wald test in detecting DIF based on the MG-GDINA model, and (c) to propose a forward anchor item search (FS) procedure to be used along with the LR and Wald tests for DIF detection. The remaining parts of this article are laid out as follows. Section “Multiple-Group G-DINA Model” introduces the MG-GDINA model, based on which Section “Simulation Study” presents the LR test and Wald test for DIF detection. Section “Simulation Study” gives a simulation study for evaluating the performance of proposed procedures, followed by a real data example in Section “Real Data Analysis.” This article is concluded with a brief summary of findings and a discussion of future directions.
Multiple-Group G-DINA Model
Let be a vector of binary responses of student to items measuring binary attributes and be responses from students. The Q-matrix (Tatsuoka, 1983) is a binary matrix specifying which attributes are involved to answer each item. Specifically, element if item requires an attribute , and 0 otherwise. In addition, attributes produce latent classes each having a unique attribute profile. The attribute profile related to latent class is denoted as , where .
The G-DINA model is a generalized DINA model developed by de la Torre (2011). For item , the G-DINA model collapses latent classes to latent groups each having different probabilities of success, where is the number of required attributes for item . For notational convenience, the first attributes can be assumed to be the required attributes for item and is denoted as the reduced attribute profile consisting of the columns of the required attributes, where . The conditional probability of success on item by student is denoted by , which is given by
| (1) |
where is the identity, log or logit link function and is the intercept, is the main effect due to , is the two-way interaction effect due to and , and is the interaction effect due to to . The G-DINA model is a saturated model and subsumes several widely used reduced CDMs, including the DINA model, the DINO model, and the additive CDM, among others. For more details, please refer to de la Torre (2011).
The MG-GDINA model is a straightforward extension of the G-DINA model to account for multiple groups. It assumes that different groups may have different q-vectors and parameters for item . Suppose the first attributes are measured by item for group and is the reduced attribute profile for group . The item response function can be written as
| (2) |
where are the item parameters of group . For common items between groups and with the same q-vector and item parameters, it is straightforward to specify . The item parameters of the MG-GDINA model can be estimated using the marginalized maximum likelihood estimation with the expectation–maximization (EM) algorithm (Bock & Aitkin, 1981). With the estimated item parameters, person parameters can be obtained using the expected a posteriori (EAP) method.
Detecting DIF Items Using the MG-GDINA Model
In this section, it is presented how the LR test and the Wald test can be used to identify DIF items based on the MG-GDINA model. Although the MG-GDINA model can be used for more than two groups, in this article the DIF detection is considered for only two groups, namely, the reference group and the focal group, as in previous studies (e.g., Hou et al., 2014, 2020; Zhang, 2006). It is, however, straightforward to extend the procedures for three or more groups.
LR Test for DIF Detection
The LR test used in the item response theory framework for DIF detection (IRT LR-DIF; for example, A. S. Cohen et al., 1996) can be theoretically used in conjunction with the aforementioned MG-GDINA model without any major modifications. Specifically, when it is unclear which items are DIF-free, the common practice is to fit data using two models: (a) a simpler model that treats all items as anchor items and (b) an augmented model that treats all items except the studied one as anchor items. The LR statistic can be calculated from the observed likelihoods of these two models. A limitation of this procedure is that the simpler model may not fit data well when some DIF items are assumed to be DIF-free (Wang & Yeh, 2003), which could yield an LR statistic deviating from its theoretical distribution (Maydeu-Olivares & Cai, 2006). To address this issue, some strategies have been proposed using a single DIF item or only a few DIF-free items to link two groups (González-Betanzos & Abad, 2012).
However, in CDMs, different groups are on the same scale naturally,1 and assuming some items are DIF-free is not necessary. Because of this, this study modifies the IRT LR-DIF procedure for DIF detection in CDMs. Specifically, two MG-GDINA models were fitted to the data, where the simpler one assumes that the item parameters of all items except the studied one are free to estimate across groups. The resulting marginalized log-likelihood is denoted by and the number of parameters is . In contrast, the augmented model allows item parameters of all items to vary across groups. The associated marginalized log-likelihood is denoted as and the number of parameters is . The LR statistic is
| (3) |
which is distributed with the degrees of freedom of .
Wald Test for DIF Detection
The Wald test (Wald, 1943) is a widely used hypothesis test in statistics. In the context of CDMs, it has been used for comparing nested models (de la Torre, 2011; de la Torre & Lee, 2013; Ma & de la Torre, 2019a; Ma et al., 2016), detecting DIF (George & Robitzsch, 2014; Hou et al., 2014, 2020; Liu et al., 2019), and validating the Q-matrix empirically (Ma & de la Torre, 2019b; Terzi, 2017; Terzi & Sen, 2019). To detect DIF items using the Wald test, Hou et al. (2014) calibrated data for each group separately, whereas in this study the MG-GDINA model is adopted, which calibrates multiple groups concurrently. Unlike George and Robitzsch (2014), because two groups are on the same scale automatically, the parameters of all items were allowed to vary across groups. The Wald test is then conducted for all the studied items one by one. A restriction matrix is needed for the Wald test. The restriction matrix can be created by horizontally concatenating two diagonal matrices with 1 and −1 as diagonal elements, respectively. The Wald statistic is defined as
| (4) |
where is the covariance matrix of , which is of dimensions . The Wald statistic W is asymptotically distributed with degrees of freedom.
Hou et al. (2014) calculated the covariance matrix by inverting the information matrix for each item separately and ignored the population proportion parameters. The resulting Wald test has been found too liberal, especially when the sample size is small and items are of low quality (Hou et al., 2014). Philipp et al. (2018) showed that the covariance matrix can be better estimated using an outer-product of gradient (OPG) method when all parameters are taken into consideration. Liu et al. (2019) also showed that the Wald test based on the OPG method with all parameters produced better calibrated Type I error rates for the DIF detection based on the DINA model. Therefore, this study considers all parameters when calculating the covariance matrix of and is the submatrix related with item .
DIF-Free Item Identification
Note that the aforementioned DIF detection methods based on the LR and Wald statistics do not assume any items to be DIF-free, but it is likely that not all items in a test exhibit DIF. Specifying items that are DIF-free as anchor items may stabilize parameter estimation and in turn improve the performance of the LR and Wald statistics in detecting DIF items. A related procedure that has been widely used in the IRT context is the item or scale purification (e.g., Clauser et al., 1993). Despite a number of variants, the purification usually treats all items as anchor items at the beginning of the process to obtain comparable ability scales for two groups and then removes items that exhibited DIF in each iteration from the anchor set to obtain a “purified” scale. The purification is not used in this study because, unlike IRT models, parameters of CDMs from two groups are naturally on the same scale and viewing all items as DIF-free is unnecessary. An FS algorithm is introduced below, which shares the same goal as the purification, that is, to identify a set of DIF-free anchor items, but starts by assuming none of the items is DIF-free. Compared with the purification, which can be viewed as a “backward” search algorithm, the FS algorithm has the potential to remove the impact of including DIF items in the anchor set.
LR test with FS
To detect DIF items using the LR test with the FS procedure (denoted by LR-FS for short), the aforementioned LR-DIF method first was conducted, based on which let be the initial set of items that are estimated to be DIF-free. The LR-FS algorithm is an iterative procedure and, at the iteration, the DIF statuses of all items are assessed one by one and two MG-GDINA models are defined for each “studied” item. In particular, when the studied item , the simpler model assumes that the studied item and the items in set have invariant item parameters across groups, whereas the augmented model assumes that only items in set have invariant item parameters. When , the simpler model assumes that all items in set have invariant item parameters across groups, whereas the augmented model assumes that all items but the studied one in set have invariant item parameters. The LR statistic is calculated using Equation 3 for each studied item and all items that are estimated to be DIF-free after iteration are indexed in . The LR-FS algorithm terminates when does not differ from , is an empty set, or the maximum number of iterations allowed reaches.
Wald test with FS
To detect DIF items using the Wald test with the FS procedure (denoted by Wald-FS for short), the aforementioned Wald-DIF method was conducted first, based on which let be the initial set of items that are estimated to be DIF-free. The Wald-FS algorithm is also an iterative procedure and, at the iteration, the DIF statuses of all items are assessed one by one. In particular, when the studied item , an MG-GDINA model is estimated by treating all items in set as anchor items (i.e., parameters do not vary across groups), whereas when the studied item , an MG-GDINA model is estimated by treating all items but the studied one in set as anchor items. The Wald test was conducted using Equation 4 for the studied item and all DIF-free items are indexed in after the iteration. Similar to the LR-FS procedure, the Wald-FS algorithm terminates when does not differ from , is an empty set, or the maximum number of iterations allowed reaches.
Simulation Study
Design
A simulation study was conducted to assess the performance of the LR-DIF, the LR-FS, the Wald-DIF, and the Wald-FS. Five factors were manipulated.
Type of DIF
Both uniform and nonuniform DIF were considered. An item is said to exhibit uniform DIF when it favors one group relative to the other consistently for all attribute profiles, or nonuniform DIF when it does not consistently favor a certain group. In particular, for simulation purposes, if item exhibits nonuniform DIF, students in the reference group with randomly selected attribute profiles are assumed to have higher success probabilities and students in the focal group with other attribute profiles had higher success probabilities.
DIF magnitude
The DIF magnitude for item is defined as the absolute difference in success probabilities between the reference and focal groups across all latent groups, or . Like Hou et al. (2014), two DIF sizes were considered where and , representing small and large DIF sizes, respectively.
Percentage of DIF items
Similar to Paulsen et al. (2020) and Qiu et al. (2019), this study considered that 0%, 20%, and 40% of items exhibited DIF. The DIF items were randomly selected from all possible items with the constraint that one-third of the DIF items required a single attribute, one-third required two attributes, and the remaining one-third required three attributes.
Sample size per group
This study considered three levels of sample sizes for each group: N = 500, 1,000, and 2,000. The former two levels were in line with Hou et al. (2014) and the last level was included because the G-DINA model is more complicated than the DINA model used in Hou et al. (2014). These levels are also in line with the review of 36 CDM applications by Sessoms and Henson (2018), where the mean and median of sample sizes in these studies were 1,788 and 1,255, respectively, and that 30% of these studies involved samples of 2,000 or more participants.
Item quality
Similar to Ma et al. (2016), item quality had three levels: and , , and for all items, representing high, moderate, and low quality, respectively.
In addition to the factors manipulated, other factors were fixed to make the simulation more manageable. In particular, the numbers of items and attributes were fixed to and , respectively. The number of items requiring single, two, and three attributes are equal—there are 10 single-attribute items , 10 two-attribute items , and 10 three-attribute items . The Q-matrix is given in the Online Appendix, which is balanced and has been used in several previous studies (Hou et al., 2014; Ma et al., 2016). Like Hou et al. (2014), students’ attribute profiles were generated with equal probabilities from a discrete uniform distribution. Based on the G-DINA model, the success probabilities given an attribute profile that is neither nor were simulated randomly with the monotonic constraints if .
In sum, this study consists of 3 (Sample Size) × 3 (Item Quality) = 9 conditions without any DIF items and 2 (Type of DIF) × 3 (Sample Size) × 3 (Item Quality) × 2 (Proportion of DIF Items) × 2 (DIF Sizes) = 72 conditions with some DIF items. Under each condition, 300 data sets were generated and four DIF detection procedures were carried out.
Analysis
To assess the performance of these four procedures in detecting DIF items, the following two criteria were considered.
Type I error
Type I error rates were calculated as the proportion of DIF-free items that were incorrectly flagged as DIF items. Note that nine conditions were considered where all items in each replication were DIF-free and 72 conditions where only a portion of items was DIF-free. For either case, the Type I error rates were calculated for each of the DIF-free items and then averaged across all DIF-free items measuring the same number of attributes. The Type I error rates are not expected to be equal to the nominal level because of the sampling errors, but have a 95% chance of falling within under a nominal level of with replications. With 300 replications, the observed Type I error rates are expected to fall within with a 95% chance at the .05 alpha level.
Empirical power
Statistical power indicates the performance of a hypothesis test in rejecting a false null hypothesis. To compare statistical power rates of different procedures, all procedures should have comparable observed Type I error rates. However, this is not the case in this study as can be observed in Section “Results.” Consequently, the empirical power rates calculated from the empirical distributions under the null hypothesis were examined. In particular, the 95th percentile of the test statistic of each procedure was calculated under the null condition where all items were DIF-free and used as the empirical cutoff. The empirical power rate, which was calculated for each test under each condition, is defined as the percentage of obtained test statistics that were greater than the empirical cutoff under the same condition. The empirical power rates were also averaged across all items requiring the same number of attributes under each condition. As in de la Torre and Lee (2013), a test power of .8 or above is considered adequate.
The Wald test and the LR test were performed at the .05 alpha level. The maximum number of iterations for the FS algorithm was set at 10. Data simulation and DIF detection were implemented using the GDINA R package (Ma & de la Torre, 2020) and the sample code can be downloaded from https://doi.org/10.17605/OSF.IO/3579Y. To better understand the results, mixed analyses of variance (ANOVAs) were performed for each of the criteria using the R package rstatix (Kassambara, 2020). To examine the sizes of different effects, the generalized , denoted by , was calculated, which was recommended to be used in mixed ANOVA (Bakeman, 2005; Olejnik & Algina, 2003). Following the guideline in J. Cohen (2013), an effect is said to be nontrivial or essentially meaningful when and, more specifically, , , and indicate small, medium, and large effects, respectively.
Results
Type I error rates
Type I error rates calculated under the conditions where all items were DIF-free and the conditions where some items exhibited DIF were presented separately. In particular, Figure 1 shows the observed Type I error rates when all items were DIF-free. It can be observed that, when items were of high or moderate quality, all procedures can generally maintain the observed Type I error rates within a reasonable range around the nominal level, especially under large sample conditions. In particular, as shown in Figure 1, the Wald-DIF and Wald-FS produced averaged observed Type I error rates within with only one exception. The LR-DIF and LR-FS produced observed Type I error rates that were slightly inflated when or 1,000, but well calibrated when N = 2,000. The observed Type I error rates when some items exhibited DIF had similar patterns, as shown in Figure 2, except that the Wald-DIF tended to produce observed Type I error rates below the nominal level when items were of high quality and the sample size was small.
Figure 1.
Observed Type I error rates when all items were DIF-free.
Note. DIF = differential item functioning; LR = likelihood ratio; FS = forward anchor item search.
Figure 2.
Observed Type I error rates when some items exhibited DIF.
Note. DIF = differential item functioning; LR = likelihood ratio; FS = forward anchor item search.
To analyze the impact of different design factors on the observed Type I error rates, mixed ANOVA was employed. The ANOVA tables and nontrivial interaction plots are given in the Online Appendix. The highest-order nontrivial interaction is the three-way interaction of Sample Size × Item Quality × Number of Attributes ( when all items were DIF-free and when some items exhibited DIF). It can be observed from the interaction plots that the Type I error rates were in general well controlled when items were of high or moderate quality, but inflated dramatically when items were of low quality, especially if, at the same time, the sample size was small or the number of required attributes was large. Besides, the two-way interaction of Item Quality × Method had a large effect () when all items were DIF-free and a medium effect when some items exhibited DIF. In particular, when items were of high or moderate quality, the Wald and Wald-FS methods had averaged Type I error rates close to the nominal level, whereas the LR and LR-FS methods had slightly inflated Type I error rates. However, when items were of low quality, the LR-FS method outperformed other investigated methods in controlling the Type I error rates and the Wald method is the worst option.
Empirical power
The empirical power rates were analyzed using mixed ANOVA and Figure 3 displays the empirical power rates of four DIF detection methods at different combinations of factors with nontrivial effects. Results showed that the DIF detection method had a small main effect and a small interaction with item quality . In particular, the LR-FS performed the best across all levels of item quality (averaged empirical power rates = .81, .61, and .36 for high, moderate, and low item quality, respectively) whereas the Wald-DIF performed the worst (averaged empirical power rates = .77, .58, and .24 for high, moderate, and low item quality, respectively). The LR-DIF procedure performed similar to the LR-FS when items were of high quality, but deteriorated dramatically as item quality degraded. The Wald-FS performed similar to the Wald-DIF when items were of high quality, but outperformed the latter as item quality worsened. In general, the LR-based methods tended to outperform the Wald test–based methods when items were of high quality and the FS algorithm only improved the empirical power rates when items were of moderate or low quality.
Figure 3.
Empirical power rates.
Note. DIF = differential item functioning; LR = likelihood ratio; FS = forward anchor item search.
The mixed ANOVA also revealed that two three-way interactions had nontrivial effects, that is, Item Quality × DIF Magnitude × Number of Attributes measured and Item Quality × DIF Magnitude × Sample Size . As can be observed from the interaction plots in the Online Appendix, lines in the interaction plots do not cross, so the main effect of these factors can be roughly interpreted for simplicity. In particular, the empirical power rates of these four procedures dropped as item quality degraded (averaged empirical power rates = .79, .59, and .29 for high, moderate, and low item quality, respectively; ), the number of attributes measured increased (averaged empirical power rates = .63, .55, and .47 for , and , respectively; ), the sample size decreased (averaged empirical power rates = .36, .56, and .75 for N = 500, 1,000, and 2,000; ), or the DIF magnitude decreased (averaged empirical power = .73 under large DIF magnitude conditions and .38 under small DIF magnitude conditions; ). The type of DIF had a small effect (mean empirical power = .58 for uniform DIF and .53 for nonuniform DIF; ) and had only trivial interactions with other factors. The proportion of DIF items had a trivial impact on the empirical power.
Finally, the conditions where adequate power rates can be observed cannot be easily summarized. As shown in Figure 3, these four procedures may be only able to detect DIF items of small DIF magnitude with adequate power rates when items were of high quality and the sample size was large. In contrast, some procedures were more likely to correctly detect DIF items of a large DIF magnitude even under some less favorable conditions. For example, under uniform DIF and large sample conditions, the Wald-FS and LR-FS can yield adequate power rates (i.e., ) even when items were of low quality and the number of attributes measured was 3.
Real Data Analysis
To illustrate the use of the Wald and LR tests in detecting DIF items in practice, a set of real data were analyzed, which are part of a larger data set obtained from a Dutch-language version of the Millon Clinical Multiaxial Inventory-III, a self-report clinical instrument (Millon et al., 2009; Rossi et al., 2007). For the current illustration, 30 items that were examined in Ma et al. (2016) were analyzed, with three clinical scales or attributes, namely, somatoform (Scale H), thought disorder (Scale SS), and major depression (Scale CC). Ma et al. (2016) only analyzed the item responses of male respondents, but in this study the responses of 471 female respondents and 739 male respondents were analyzed using the aforementioned Wald and LR tests with and without the FS procedure. The Q-matrix can be found in Ma et al. (2016).
Table 1 gives the test statistics and p values for items that were flagged by at least one of four DIF detection methods. Note that the p values were adjusted using the Holm (1979) method to control the familywise error rate at the .05 nominal level for multiple comparisons. It can be observed that 11 items were flagged by all four methods and that the LR-DIF flagged most of the DIF items (i.e., 15), whereas the Wald-FS method flagged the least (i.e., 11). In addition, four DIF detection methods produced inconsistent results for Items 2, 20, 21, and 24. Based on the simulation study, the LR-FS performed relatively well when the sample size was small for each group. Therefore, the items that were identified as DIF-free by the LR-FS method were used as anchor items for the recalibration of the data. Figure 4 displays the estimated endorsement probabilities of female and male respondents to Items 2, 20, and 21. It can be observed that all of these three items exhibited uniform DIF, where male respondents seem to have lower endorsement probabilities than female ones after controlling their latent attribute profiles. Figures for other items can be found in the Online Appendix, and it can be observed that all DIF items identified by the LR-FS method were shown to be uniform DIF (i.e., the female group had higher endorsement probabilities) with the only exception of Item 5.
Table 1.
DIF Detection Results Based on Different Methods.
| Wald-DIF | Wald-FS | LR-DIF | LR-FS | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Item no. | Statistic | value | Statistic | value | Statistic | value | Statistic | value | |
| 2 | 4 | 22.434 | .004 | 13.271 | .180 | 22.596 | .003 | 21.648 | .004 |
| 5 | 4 | 23.449 | .002 | 22.591 | .004 | 31.928 | .000 | 31.380 | .000 |
| 7 | 2 | 18.595 | .002 | 16.431 | .007 | 20.597 | .001 | 21.671 | .000 |
| 9 | 2 | 21.675 | .001 | 20.656 | .001 | 28.621 | .000 | 27.654 | .000 |
| 11 | 2 | 15.407 | .010 | 13.941 | .021 | 22.925 | .000 | 22.972 | .000 |
| 16 | 2 | 12.781 | .030 | 12.179 | .045 | 17.503 | .003 | 17.833 | .003 |
| 18 | 4 | 28.575 | .000 | 22.037 | .005 | 29.956 | .000 | 29.254 | .000 |
| 20 | 2 | 11.379 | .057 | 10.612 | .094 | 13.520 | .020 | 14.295 | .013 |
| 21 | 4 | 17.895 | .025 | 11.282 | .377 | 21.754 | .004 | 20.293 | .008 |
| 23 | 4 | 23.920 | .002 | 21.134 | .007 | 27.758 | .000 | 26.667 | .001 |
| 24 | 8 | 20.769 | .125 | 18.299 | .325 | 23.452 | .045 | 22.517 | .065 |
| 25 | 2 | 28.411 | .000 | 27.160 | .000 | 34.346 | .000 | 35.721 | .000 |
| 26 | 4 | 26.077 | .001 | 24.533 | .002 | 34.513 | .000 | 34.680 | .000 |
| 27 | 2 | 14.474 | .014 | 14.902 | .013 | 20.684 | .001 | 20.730 | .001 |
| 30 | 2 | 15.096 | .011 | 13.956 | .021 | 21.456 | .001 | 21.958 | .000 |
Note. Bold values represent nonsignificant values. is the degrees of freedom for all hypothesis tests. DIF = differential item functioning; FS = forward anchor item search; LR = likelihood ratio.
Figure 4.
Estimated endorsement probabilities (with standard errors) of Items 2, 20, and 21 for female and male respondents.
The simulation study showed that item quality had a major impact on the performance of DIF detection procedures. The estimated guessing and slip parameters are given in the Online Appendix and the averaged guessing and slip parameter estimates were (SD = .10) and (SD = .21), respectively. Because item quality was less optimal, it can be expected that the LR-FS method may have slightly inflated Type I error rates, especially for those items of low quality, such as Item 16 (estimated slip = .43 and .56 for female and male groups, respectively), Item 20 (estimated slip = .45 and .55 for female and male groups, respectively), and Item 27 (estimated slip = .42 and .50 for female and male groups, respectively). As a result, these items may need domain experts’ close scrutiny.
Summary and Discussion
In this study, a multiple-group G-DINA model has been developed, which allows us to model item responses from different groups at the same time by accounting for the fact that students in different groups may solve the problems in distinct manners. Based on the MG-GDINA model, this study focuses on procedures for detecting DIF items using the Wald and LR tests. This study modifies the traditional IRT LR-DIF procedure for detecting DIF using the LR test in CDMs. This study also proposed an FS algorithm that can be used in conjunction with the LR and Wald tests for DIF detection.
The simulation study showed that the Type I error rates of all four procedures were relatively well behaved when items were of high or moderate quality, though the Wald-DIF and Wald-FS tended to be conservative when items were of high quality, the sample size was small, and some items exhibited DIF. The LR-DIF and LR-FS could be slightly liberal when the sample size was small. When items were of low quality, all four procedures, in general, yielded inflated Type I error rates, and the FS procedure becomes particularly important for controlling the inflation of the Type I error for both Wald and LR tests. The LR-FS exhibited better controlled Type I error rates than the Wald-FS when the number of attributes required was 2 or 3, but the Wald-FS method performed slightly better when the number of attributes required was 1. Although none of the procedures outperforms others consistently, the LR-FS method appears a reasonable choice under most conditions.
The Wald test has been well documented to produce inflated Type I error for model comparison (de la Torre & Lee, 2013; Ma et al., 2016) and DIF detection (Hou et al., 2014) when items were of poor quality. Although a different approach to estimating the variance–covariance matrix has been employed in this study, the Wald test still tends to be liberal when items were of poor quality, though the incorporation of the FS procedure could help control the inflation to some degree. In contrast, although the LR test does not involve the estimation of the variance–covariance matrix, it also results in inflated Type I error and false positive rates when item quality was not desirable.
All procedures exhibited relatively low empirical power in detecting DIF. Acceptable levels of empirical power were only noted with favorable conditions (i.e., large DIF magnitude and sample size, fewer attributes required, high item quality and uniform DIF), though the LR-FS method tended to perform similarly as, if not better than, other methods investigated in terms of the empirical power rates. It is obvious from the simulation study that developing test items of good quality and performing DIF analysis using a relatively large sample are the most important factors.
This study contributes to the literature by developing the multiple-group model and by systematically investigating several DIF detection procedures using the proposed multiple-group model, but it is not without limitations. First, although this study manipulated several important factors, there were some factors that were fixed. In particular, this study only considered DIF detection methods for two groups and assumed an equal sample size for both groups; this study also simulated students’ attribute profiles from discrete uniform distribution; the structure of the Q-matrix, along with test length and the number of attributes, was also fixed. Researchers may vary some of these factors in future research. In addition, although the findings from this study recommend the use of the FS procedure, the FS procedure could be time-consuming. This is because the FS procedure usually involves multiple iterations and, at each iteration, the data need to be calibrated multiple times. Future studies may examine whether the proposed FS procedure can be further simplified.
Supplemental Material
Supplemental material, Online_Appendix for Detecting Differential Item Functioning Using Multiple-Group Cognitive Diagnosis Models by Wenchao Ma, Ragip Terzi and Jimmy de la Torre in Applied Psychological Measurement
Acknowledgments
The authors thank Gina Rossi for access to the data used in the Real Data Analysis section.
Attribute estimates for different groups are on the same scale because the interpretation of attribute estimate is deterministic and the same for all groups (i.e., 0 for nonmastery and 1 for mastery).
Footnotes
Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.
ORCID iD: Wenchao Ma
https://orcid.org/0000-0002-6763-0707
Supplemental Material: Supplementary material is available for this article online.
References
- Bakeman R. (2005). Recommended effect size statistics for repeated measures designs. Behavior Research Methods, 37(3), 379–384. [DOI] [PubMed] [Google Scholar]
- Bock R. D., Aitkin M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443–459. [Google Scholar]
- Clauser B., Mazor K., Hambleton R. K. (1993). The effects of purification of matching criterion on the identification of DIF using the Mantel-Haenszel procedure. Applied Measurement in Education, 6(4), 269–279. [Google Scholar]
- Cohen A. S., Kim S.-H., Wollack J. A. (1996). An investigation of the likelihood ratio test for detection of differential item functioning. Applied Psychological Measurement, 20, 15–26. [Google Scholar]
- Cohen J. (2013). Statistical power analysis for the behavioral sciences. Academic Press. [Google Scholar]
- de la Torre J. (2011). The generalized DINA model framework. Psychometrika, 76, 179–199. [Google Scholar]
- de la Torre J., Douglas J. A. (2004). Higher-order latent trait models for cognitive diagnosis. Psychometrika, 69, 333–353. [Google Scholar]
- de la Torre J., Lee Y. S. (2013). Evaluating the Wald test for item-level comparison of saturated and reduced models in cognitive diagnosis. Journal of Educational Measurement, 50, 355–373. [Google Scholar]
- George A. C., Robitzsch A. (2014). Multiple group cognitive diagnosis models, with an emphasis on differential item functioning. Psychological Test and Assessment Modeling, 56(4), 405–432. [Google Scholar]
- González-Betanzos F., Abad F. J. (2012). The effects of purification and the evaluation of differential item functioning with the likelihood ratio test. Methodology, 8, 134–145. [Google Scholar]
- Haertel E. H. (1989). Using restricted latent class models to map the skill structure of achievement items. Journal of Educational Measurement, 26, 301–321. [Google Scholar]
- Henson R. A., Templin J. L., Willse J. T. (2009). Defining a family of cognitive diagnosis models using log-linear models with latent variables. Psychometrika, 74, 191–210. [Google Scholar]
- Holland P. W., Thayer D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In Wainer H., Braun H. I. (Eds.), Test validity (pp. 129–145). Lawrence Erlbaum. [Google Scholar]
- Holm S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65–70. [Google Scholar]
- Hou L., de la Torre J., Nandakumar R. (2014). Differential item functioning assessment in cognitive diagnostic modeling: Application of the Wald test to investigate DIF in the DINA model. Journal of Educational Measurement, 51, 98–125. [Google Scholar]
- Hou L., Terzi R., de la Torre J. (2020). Wald test formulations in DIF detection of CDM data with the proportional reasoning test. International Journal of Assessment Tools in Education, 7(2), 145–158. [Google Scholar]
- Johnson M., Lee Y.-S., Sachdeva R. J., Zhang J., Waldman M., Park J. Y. (2013, April). Examination of gender differences using the multiple groups DINA model [Paper presentation]. Annual meeting of the National Council on Measurement in Education, San Francisco, CA, United States. [Google Scholar]
- Kassambara A. (2020). rstatix: Pipe-friendly framework for basic statistical tests [Computer software manual] (R Package Version 0.4.0). https://CRAN.R-project.org/package=rstatix
- Li F. (2008). A modified higher-order DINA model for detecting differential item functioning and differential attribute functioning (Unpublished doctoral dissertation). University of Georgia, Georgia. [Google Scholar]
- Li X., Wang W.-C. (2015). Assessment of differential item functioning under cognitive diagnosis models: The DINA model example. Journal of Educational Measurement, 52, 28–54. [Google Scholar]
- Liu Y., Yin H., Xin T., Shao L., Yuan L. (2019). A comparison of differential item functioning detection methods in cognitive diagnostic models. Frontiers in Psychology, 10, Article 1137. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ma W., de la Torre J. (2019. a). Category-level model selection for the sequential G-DINA model. Journal of Educational and Behavioral Statistics, 44, 45–77. [Google Scholar]
- Ma W., de la Torre J. (2019. b). An empirical Q-matrix validation method for the sequential generalized DINA model. British Journal of Mathematical and Statistical Psychology, 73, 142–163. [DOI] [PubMed] [Google Scholar]
- Ma W., de la Torre J. (2020). GDINA: An R package for cognitive diagnosis modeling. Journal of Statistical Software, 93, 1–26. [Google Scholar]
- Ma W., Iaconangelo C., de la Torre J. (2016). Model similarity, model selection, and attribute classification. Applied Psychological Measurement, 40, 200–217. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Magis D., Béland S., Tuerlinckx F., De Boeck P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42(3), 847–862. [DOI] [PubMed] [Google Scholar]
- Maydeu-Olivares A., Cai L. (2006). A cautionary note on using G2(dif) to assess relative model fit in categorical data analysis. Multivariate Behavioral Research, 41, 55–64. [DOI] [PubMed] [Google Scholar]
- Millon T., Millon C., Davis R., Grossman S. (2009). MCMI-III manual (4th ed.). Pearson Assessments. [Google Scholar]
- Nichols P. D., Chipman S. F., Brennan R. L. (Eds.). (1995). Cognitively diagnostic assessments. Lawrence Erlbaum. [Google Scholar]
- Olejnik S., Algina J. (2003). Generalized eta and omega squared statistics: Measures of effect size for some common research designs. Psychological Methods, 8(4), 434–447. [DOI] [PubMed] [Google Scholar]
- Paulsen J., Svetina D., Feng Y., Valdivia M. (2020). Examining the impact of differential item functioning on classification accuracy in cognitive diagnostic models. Applied Psychological Measurement, 44, 267–281. 10.1177/0146621619858675 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Philipp M., Strobl C., de la Torre J., Zeileis A. (2018). On the estimation of standard errors in cognitive diagnosis models. Journal of Educational and Behavioral Statistics, 43, 88–115. [Google Scholar]
- Qiu X.-L., Li X., Wang W.-C. (2019). Differential item functioning in diagnostic classification models. In von Davier M., Lee Y.-S. (Eds.), Handbook of diagnostic classification models (pp. 379–393). Springer. [Google Scholar]
- Rose N., von Davier M., Nagengast B. (2017). Modeling omitted and not-reached items in IR models. Psychometrika, 82(3), 795–819. [DOI] [PubMed] [Google Scholar]
- Rossi G., van der Ark L., Sloore H. (2007). Factor analysis of the Dutch-language version of the MCMI-III. Journal of Personality Assessment, 88, 144–157. [DOI] [PubMed] [Google Scholar]
- Sessoms J., Henson R. A. (2018). Applications of diagnostic classification models: A literature review and critical commentary. Measurement: Interdisciplinary Research and Perspectives, 16(1), 1–17. 10.1080/15366367.2018.1435104 [DOI] [Google Scholar]
- Shealy R., Stout W. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58(2), 159–194. [Google Scholar]
- Svetina D., Dai S., Wang X. (2017). Use of cognitive diagnostic model to study differential item functioning in accommodations. Behaviormetrika, 44(2), 313–349. [Google Scholar]
- Tatsuoka K. K. (1983). Rule space: An approach for dealing with misconceptions based on item response theory. Journal of Educational Measurement, 20, 345–354. [Google Scholar]
- Templin J. L., Henson R. A. (2006). Measurement of psychological disorders using cognitive diagnosis models. Psychological Methods, 11, 287–305. [DOI] [PubMed] [Google Scholar]
- Terzi R. (2017). New Q-matrix validation procedures (Unpublished doctoral dissertation). Rutgers, The State University of New Jersey [Google Scholar]
- Terzi R., Sen S. (2019). A nondiagnostic assessment for diagnostic purposes: Q-matrix validation and item-based model fit evaluation for the TIMSS 2011 assessment. SAGE Open, 9, 1–11. [Google Scholar]
- Wald A. (1943). Tests of statistical hypotheses concerning several parameters when the number of observations is large. Transactions of the American Mathematical Society, 54, 426–482. [Google Scholar]
- Wang W.-C., Yeh Y.-L. (2003). Effects of anchor item methods on differential item functioning detection with the likelihood ratio test. Applied Psychological Measurement, 27, 479–498. [Google Scholar]
- Xu X., Davier M. (2008). Comparing multiple-group multinomial log-linear models for multidimensional skill distributions in the general diagnostic model (ETS Research Report Series, 2008(1)). https://files.eric.ed.gov/fulltext/EJ1111190.pdf
- Zhang W. (2006). Detecting differential item functioning using the DINA model (Unpublished doctoral dissertation). The University of North Carolina at Greensboro. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental material, Online_Appendix for Detecting Differential Item Functioning Using Multiple-Group Cognitive Diagnosis Models by Wenchao Ma, Ragip Terzi and Jimmy de la Torre in Applied Psychological Measurement




