Skip to main content
Applied Psychological Measurement logoLink to Applied Psychological Measurement
. 2018 Dec 4;43(7):527–542. doi: 10.1177/0146621618813104

Q-Matrix Refinement Based on Item Fit Statistic RMSEA

Chunhua Kang 1, Yakun Yang 1, Pingfei Zeng 1,
PMCID: PMC6739743  PMID: 31534288

Abstract

A Q-matrix, which reflects how attributes are measured for each item, is necessary when applying a cognitive diagnosis model to an assessment. In most cases, the Q-matrix is constructed by experts in the field and may be subjective and incorrect. One efficient method to refine the Q-matrix is to employ a suitable statistic that is calculated using response data. However, this approach is limited by its need to estimate all items in the Q-matrix even if only some are incorrect. To address this challenge, this study proposes an item fit statistic root mean square error approximation (RMSEA) for validating a Q-matrix with the deterministic inputs, noisy, “and” (DINA) model. Using a search algorithm, two simulation studies were performed to evaluate the effectiveness and efficiency of the proposed method at recovering Q-matrices. Results showed that using RMSEA can help define attributes in a Q-matrix. A comparison with the existing Delta method and residual sum of squares (RSS) method revealed that the proposed method had higher mean recovery rates and can be used to identify and correct Q-matrix misspecifications. When no error exists in the Q-matrix, the proposed method does not modify the correct Q-matrix.

Keywords: cognitive diagnosis, Q-matrix refinement, item fit statistic, DINA model


Within cognitive diagnostic assessment (CDA), the Q-matrix represents the relationship between items and attributes, and its elements are equal to 1 or 0, depending on whether an attribute is required by an item or not. The Q-matrix is a blueprint for test development, the accuracy and completeness of which are directly related to the quality of the test and the accuracy of any subsequent classification. Studies have demonstrated that Q-matrix errors result in item parameter bias and misclassification of examinees (de la Torre, 2008; de la Torre & Chiu, 2010; Kunina-Habenicht, Rupp, & Wilhelm, 2012; Rupp & Templin, 2008). Usually, underfitting (where a Q-matrix 1 is incorrectly labeled 0) leads to an overestimation of item parameters, and overfitting (a 0 incorrectly labeled with 1) leads to an underestimation of item parameters.

Numerous approaches are available for the evaluation of the Q-matrix, such as literature analysis, analysis of students’ oral reports, and judgment from domain experts. However, these approaches can be subjective, and the Q-matrix may be uncertain. For example, the Q-matrix of the fraction subtraction data set (Tatsuoka, 1990) is still controversial (DeCarlo, 2011, 2012; de la Torre, 2008; Tatsuoka, 1990). Therefore, identifying and modifying Q-matrix misspecifications to ensure the matrix is as accurate as possible is essential within CDA.

One approach is to consider a set of possible Q-matrices and determine which has the superior fitness using a fit index, such as the Akaike criterion or the Bayesian information criterion (Barnes, Bitzer, & Vouk, 2005; Cen, Koedinger, & Junker, 2005; DeCarlo, 2011; de la Torre & Douglas, 2008; Rupp & Templin, 2008). This is a simple and easily understood approach, but as the number of attributes increases, the size of the set required increases exponentially (DeCarlo, 2012; de la Torre, 2008).

DeCarlo (2012) proposed a Bayesian method to estimate the uncertainty in the Q-matrix using the reparameterized deterministic inputs, noisy, “and” (DINA) model (de la Torre, 2011). This method estimated q-entries appropriately, but knowledge of which entries or items were uncertain was necessary in advance. Cut points for the random elements were an additional problem (DeCarlo, 2012). Liu, Xu, and Ying (2012) tried to estimate a Q-matrix using an algorithm that minimized a loss function, and they proved that the estimated Q-matrix converged to the true Q-matrix as the number of examinees approached infinity. This was an objective and direct method of calculating the Q-matrix, but if some items were to measure more than a few attributes, this time-consuming method would not be computationally feasible (Chiu, 2013). Regarding Q-matrix estimation as latent variable selection, Chen, Liu, Xu, and Ying (2015) developed some theories and proposed a new method to directly estimate a Q-matrix from data. The method used the conditional maximum likelihood method, combined with an expectation maximization (EM) algorithm and a coordinate descent algorithm, to make the method more efficient and reasonable. Both simulations and real data studies were presented to show the performance of the method.

Another reasonable method of Q-matrix evaluation is to construct an index and then specify the Q-matrix through maximization or minimization of that index. For example, de la Torre (2008) proposed an item discrimination index δj based on the DINA model (Junker & Sijtsma, 2001), and by maximizing the difference of a correct response probability between mastered and non-mastered examinees on item j (i.e.,δj), which is called Delta method in this article. This simple, empirical method is easy to understand. Using a search algorithm, the Delta method specifies the Q-matrix by entirely replacing each inappropriate q-vector, one after another. Both simulations and real data analyses were used to show this method’s viability. Results indicated that the proposed method is potentially viable and able to identify and correct inappropriate q-vectors and retain correctly specified items. Another item discrimination index,ςj2, which was used for the Q-matrix validation of the Generalized DINA (G-DINA) model (de la Torre, 2011), was proposed by de la Torre and Chiu (2010). Simulations demonstrated that it accurately identified and revised the misspecified q-entries; however, its robustness was not thoroughly assessed (Chiu, 2013). Chiu (2013) proposed a nonparametric-based method for Q-matrix refinement that minimizes the residual sum of squares (RSS) index. The RSS computes the squared sum of the differences between the observed response patterns (ORPs) and ideal response patterns (IRPs) for each item. Simulations and an empirical study revealed that this method for improving the accuracy of the Q-matrix was effective and efficient. Chiu’s (2013) method is simple and easy to operate. However, it is nonparametric and all items in the Q-matrix must be estimated until the algorithm converges. Therefore, much time can be wasted if numerous items in the Q-matrix are already correct. It may even have a negative impact on the precision of the RRS if a small number of items are correct, because it may lead to incorrect refinement.

To validate the Q-matrix using an index, this study employed an existing item fit statistic—the root mean square error approximation (RMSEA)—to refine the Q-matrix. The item fit statistic RMSEA was proposed by von Davier (2005) and is used for detecting whether items fit well or poorly. Kunina-Habenicht, Rupp, and Wilhelm (2009, 2012) used the RMSEA to evaluate item fitness within simulation studies and an empirical study, and gave its range of suitability. An RMSEA of less than 0.05 was discovered to be a good fit, 0.05 to 0.1 a moderate fit, and larger than 0.1 a poor fit. Computation of the RMSEA is straightforward. However, further study on the use of this statistic to evaluate the Q-matrix has not yet been performed. In the present study, the RMSEA was first used to determine which items were not fitted well, and then to specify them using the RMSEA minimization method. This method took into account the range of suitability to evaluate the Q-matrix appropriately, is theoretically justified, and was implemented using readily available statistical packages.

Presentation of the Q-matrix refinement method begins with an introduction of the DINA model and a description of the item fit statistic RMSEA used by the Q-matrix refinement algorithm to detect and define items’ attributes which were not fitted well. Explanation of the algorithm developed from RMSEA is followed. Chiu’s (2013) method is then reviewed, and how the RMSEA operates is described in detail. The performance of the RMSEA is evaluated under a wide range of conditions with two simulation studies. Finally, the discussion addresses the issues raised by the results of the simulation studies, suggests directions for future research, and comments on the advantages of the method.

Theoretical Framework

DINA Model

The DINA model is a noncompensatory and conjunctive model (Maris, 1999; Rupp, Templin, & Henson, 2010). An examinee must master all the required skills of the item to get a correct response, with no guessing or mistakes. Let a k-dimensional vector αi denote the ith examinee’s attribute mastery pattern, with 0 indicating nonpossession and 1 indicating possession. The entry qjk in the Q-matrix indicates whether item j requires attribute k (0 = not required, 1 = required). The ideal item response ηij of examinee i to item j is

ηij=Πk=1Kαikqjk, (1)

where αik is the kth entry of the vector αi.

Within the DINA model, each item has only two item parameters: the slip and guessing parameters. Let Yij be a binary variable that indicates the observed response of examinee i to item j. The probability that examinee i provides a correct response to item j is

P(Yij=1|ηij)=(1sj)ηijgj(1ηij). (2)

RMSEA

The RMSEA is an item fit statistic, the computation of which was implemented using the R package developed by Robitzsch, Kiefer, George, and Uenlue (2015). The RMSEA of item j is

RMSEAj=kcπ(αc)(Pj(αc))njkcNjc, (3)

where αc is the latent proficiency for class c, k is the item response category, π(αc) is the estimated probability of αc, Pj is the estimated item response function, njkc is the expected number of latent class αc examinees on item j in category k, and Njc is the expected number of latent class αc examinees on item j. RMSEA can be used within DINA, deterministic inputs, noisy output “or” gate, G-DINA, and general diagnostic models (Robitzsch et al., 2015). The DINA model was employed for this study to demonstrate how the RMSEA can be used to refine the Q-matrix. It is simple, easy to interpret, and provides a potential extension to more complex cognitive diagnostic models (Park & Lee, 2014). Most Q-matrix evaluations are also based on the DINA model (e.g., Chiu, 2013; DeCarlo, 2012; de la Torre, 2008).

Algorithm

The Lowest Index Search Algorithm

Chiu (2013) proposed an algorithm and a RSS index to refine the q-vector that is most likely to be misspecified, based on a nonparametric method. For an item j, the algorithm searches over all the 2K1 possible q-vectors and replaces the q-vector under consideration with the q-vector that has the lowest RSS. Because the algorithm finds the minimum RSS to estimate an item’s q-vector, the algorithm is referred to as the lowest index search algorithm. The lowest index search algorithm is used to specify or refine the Q-matrix through estimation of the items one at a time. Generally, it includes two stages: estimation and self-tuning. During the estimation stage, parametric estimation methods such as the EM algorithm or nonparametric classification techniques can be used to determine examinees’ proficiency classes or item parameters. The corresponding index (e.g., RSS or RMSEA) of different possible q-vectors for the item is calculated from the estimation, and the item’s q-vector is then treated as the q-vector whose index is the lowest. During the self-tuning stage, all items are redefined until a certain stopping criterion has been met.

The Lowest RMSEA Search Algorithm

It can be justified that the correct q-vector for each item is expected to be the q-vector with the lowest RMSEA among all possible q-vectors (see the justification of online appendix). So, the RMSEA can be easily used by the lowest index search algorithm. For convenience, this is called the lowest RMSEA estimation (LRE) algorithm. Suppose the examinees’ responses for these items are known, and the original Q-matrix which may have some uncertainity in it can be denoted as Q0. Then, the steps of the LRE algorithm are as follows:

  • Step 1: Use the EM algorithm to estimate examinees’ class memberships and item parameters based on Q0 and the response data.

  • Step 2: Compute all items’ RMSEA values based on Q0 and the class memberships and item parameters estimated at Step 1.

  • Step 3: Determine the items in Q0 with large RMSEA values and denote these items by S(0).

  • Step 4: Select item j in S(0), and denote the q-vector of the item as qj(1), where the superscript (1) is the order of the corresponding RMSEA among all items in S(0).

  • Step 5: Compute each remaining (2K2) RMSEA values by replacing qj(1) in Q0 with the other (2K2) q-vectors, one at a time.

  • Step 6: Update Q0 by replacing qj(1) with qj*(1), the q-vector with the lowest RMSEA value among all the (2K1) possible q-vectors. Denote the updated Q-matrix as Q1.

  • Step 7: Omit item j from the item search pool (S(1)=S(0)\j).

  • Step 8: Replace Q0 and S(0) with Q1 and S(1), respectively, and repeat Steps 4 to 7. Iterate until all items in S(0) have been visited.

The original Q-matrix specified by experts or other methods may include some misspecifications, the LSE algorithm refined the Q-matrix by replacing poorly fit items with the q-vector that has the lowest RMSEA value. There is no need to estimate all items until a certain stopping criterion is reached, because the RMSEA value of an incorrect q-vector can be high. The item fit scope of the RMSEA is used to find the misspecified items and modify them. Because random errors cause some items to have inherently high RMSEA values, the algorithm visits each misspecified item only once.

A remarkable feature of the LRE algorithm is that it can be used for defining a Q-matrix and then detecting and refining the Q-matrix input appropriately; the LRE algorithm performs all these tasks with a single RMSEA. This algorithm may be more efficient than other parametric algorithms that have been proposed, because it can be used only to re-estimate items with uncertainty in the Q-matrix and no iteration is necessary until the algorithm converges. More importantly, the fine-tuning stage can be directly used to refine and validate a Q-matrix with some correctly specified items.

Simulation Studies

The performance of the LRE algorithm was evaluated using two simulation studies: the first determined the effect of the number of items and the number of misspecified q-entries on the Q-matrix refinement method, and the second evaluated the performance of the proposed method when the Q-matrix is correct.

Study 1: Effect of the Number of Items and the Number of Misspecified Q-Entries on Q-Matrix Recovery During the Self-Tuning Stage

Design and method

As found in a preliminary analysis, an increase in the number of attributes negatively affected the algorithm. In this study, K = 5 attributes were assumed. Five variables were included in the study design: (a) number of examinees (N = 500, 1,000), (b) number of test items (J = 20, 40), (c) percentage of misspecified q-entries (10%, 20%), (d) upper bound of the slipping and guessing parameters (0.2, 0.3, 0.4, 0.5), and (e) underlying distribution of examinees’ attribute patterns (multivariate normal, uniform). Moreover, de la Torre (2008) proposed a DINA-model-based method for Q-matrix validation based on an item discrimination index, δj; de la Torre’s method is denoted as the Delta method. Chiu (2013) proposed a Q-matrix refinement method by minimizing the RSS between the observed responses and ideal responses; this method is called the RSS method. In this study, the Delta method and RSS method were adopted to study the effectiveness of the proposed method when the proposed method is applied to various conditions.

The correct Q-matrices for tests containing 20 items were constructed by including the 5 one-attribute q-vectors; 5, 5, and 4 randomly chosen two-, three-, and four-attribute q-vectors, respectively; and the single five-attribute q-vectors (Chiu, 2013). The Q-matrices for tests containing 40 items were constructed by doubling the 20-item Q-matrices. The correct Q-matrix was used to generate the examinees’ item responses, and misspecified Q-matrices were created by randomly changing either 10% or 20% of the q-entries in the correct Q-matrix from 0 to 1 or from 1 to 0. The misspecified Q-matrix, denoted by Q_original, was used in Step 1 of the self-tuning stage.

To generate a uniform distribution of examinees’ attribute patterns, 32 different attribute patterns from the number of examinees were averaged. To generate a multivariate normal distribution, examinees’ attribute patterns were drawn from a multivariate normal distribution with all variances and covariances in the variance–covariance matrix equal to 1.0 and 0.5, respectively (Chiu, 2013). Each attribute pattern αi=(αi1,,αiK) was determined by comparing the examinee’s underlying ability θik with Φ1(k/K+1); that is,

αik={1,|ifθikΦ1(kK+1)0,|otherwise, (3)

where k = 1,. . ., K.

The DINA model was used to generate examinees’ ORPs. Forty data sets were simulated for each of the 2 (examinee) × 2 (item) × 2 (misspecification) × 4 (slip upper bound) × 2 (distribution) = 64 combinations. Each dataset was analyzed using the self-tuning stage of the LRE algorithm with two thresholds (0.05, 0.1), denoted as LSE0.05 and LSE0.1. The Delta and the RSS method were applied with a misspecified Q-matrix. The q-entry recovery denoted the ratio of the number of correct q entries in Q_modify to the total number of q entries. The final refined Q-matrix, denoted by Q_modify, was used to calculate the q-entry recovery. Q-entry recovery was computed for each data set and then averaged across the 40 data sets within each design condition. The base accuracy rates were 0.90 and 0.80 for design conditions with 10% and 20% misspecification, respectively. A mean q-entry recovery rate (MRR) greater than the base accuracy rate was more informative and thus indicative of the method’s effectiveness (Chiu, 2013).

Results

Tables 1 to 4 present the means and standard deviations of the correct q-entries’ recovery rates in Q_modify from the misspecified q-entries within simulated item responses for J = 20 and 40, respectively. The standard deviations of the q-entry recovery rates for the proposed method and two other methods were all approximately 0.02, which indicated that the MRRs were stable and represented the results of various conditions. As the bounds of item parameters increased, the deviation tended to grow slightly; however, three methods’ standard deviations were reasonably small and close to one another, especially when the test length was 40. None exceeded 0.1 and the test length and number of examinees had positive effects on the stability of the results. For almost all tables, the MRRs of the proposed LSE method and RSS method were higher than those of the Delta method; however, LSE0.05 was lower under some conditions for 20 items. Calculations revealed that the standard deviations of LSE0.05 under those conditions were approximately 0.1, which was substantial. This proved that the LSE0.05 had the potential to define the incorrect items’ attributes. A further inspection of the mean-redefined correct items indicated that on average, LSE0.5 corrected more items than did the Delta method. For the RSS method and the proposed method, most MRRs were comparable in a multivariate normal distribution. Under some conditions, LSE0.1 and LSE0.05 performed more accurately than did the RSS method (as indicated by the bold figures in Table 1 and Table 3). When the attribute patterns’ distribution is uniform, the RSS method is slightly superior to the proposed method. But the MRRs of proposed method are still higher than the base rate and in some conditions are higher than RSS method, especially for the LSE0.1 method. Therefore, the proposed method was reasonable and effective for detecting incorrect items and refining the Q-matrix, thereby rendering Q-matrix more objective.

Table 1.

MRRs of 40 Replications in Each Condition (20 Items, Multivariate Normal Distribution).

N 500
1,000
Method RSS Delta LRE0.1 LRE0.05 RSS Delta LRE0.1 LRE0.05
Maximum s 10% misspecification 10% misspecification
0.2 0.984 (0.02) 0.954 (0.02) 0.988 (0.02) 0.977 (0.04) 0.987 (0.01) 0.965 (0.02) 0.990 (0.01) 0.989 (0.01)
0.3 0.957 (0.02) 0.936 (0.03) 0.964 (0.03) 0.961 (0.04) 0.962 (0.02) 0.946 (0.02) 0.976 (0.03) 0.971 (0.05)
0.4 0.937 (0.02) 0.919 (0.03) 0.937 (0.03) 0.901 (0.06) 0.939 (0.02) 0.927 (0.02) 0.944 (0.03) 0.942 (0.05)
0.5 0.896 (0.03) 0.886 (0.03) 0.912 (0.03) 0.835 (0.07) 0.910 (0.02) 0.901 (0.02) 0.910 (0.02) 0.909 (0.04)
20% Misspecification 20% Misspecification
0.2 0.957 (0.03) 0.864 (0.05) 0.891 (0.07) 0.871 (0.10) 0.952 (0.04) 0.877 (0.03) 0.891 (0.06) 0.889 (0.08)
0.3 0.896 (0.05) 0.845 (0.04) 0.869 (0.06) 0.841 (0.09) 0.906 (0.04) 0.866 (0.03) 0.899 (0.06) 0.898 (0.09)
0.4 0.843 (0.04) 0.816 (0.03) 0.839 (0.05) 0.814 (0.07) 0.851 (0.03) 0.821 (0.03) 0.832 (0.04) 0.844 (0.08)
0.5 0.815 (0.04) 0.796 (0.02) 0.809 (0.03) 0.770 (0.07) 0.815 (0.03) 0.810 (0.03) 0.813 (0.02) 0.816 (0.06)

Note. Numbers outside brackets are different conditions’ MRRs, whereas numbers inside brackets are the MRRs’ standard deviations for each condition. MRR = mean q-entry recovery rate; RSS = residual sum of squares; LRE = lowest RMSEA estimation.

Note: Bold values signifies situations where the MMRs of the proposed LSE method are higher than the two others methods

Table 4.

MRRs of 40 Replications in Each Condition (40 Items, Uniform Distribution).

N 500
1,000
Method RSS Delta LSE0.1 LSE0.05 RSS Delta LSE0.1 LSE0.05
Maximum s 10% Misspecification 10% Misspecification
0.2 0.999 (0.001) 0.954 (0.02) 0.998 (0.01) 0.996 (0.04) 0.999 (0.002) 0.965 (0.02) 0.999 (0.003) 0.998 (0.01)
0.3 0.999 (0.003) 0.927 (0.02) 0.994 (0.03) 0.992 (0.02) 0.999 (0.003) 0.933 (0.03) 0.997 (0.01) 0.995 (0.01)
0.4 0.981 (0.01) 0.896 (0.03) 0.980 (0.02) 0.970 (0.04) 0.985 (0.01) 0.903 (0.02) 0.989 (0.01) 0.986 (0.02)
0.5 0.930 (0.02) 0.834 (0.03) 0.940 (0.03) 0.910 (0.05) 0.945 (0.02) 0.875 (0.03) 0.960 (0.02) 0.960 (0.04)
20% Misspecification 20% Misspecification
0.2 0.999 (0.002) 0.869 (0.04) 0.960 (0.04) 0.959 (0.05) 0.996 (0.03) 0.889 (0.04) 0.971 (0.04) 0.968 (0.04)
0.3 0.996 (0.01) 0.832 (0.04) 0.932 (0.06) 0.928 (0.07) 0.996 (0.02) 0.852 (0.03) 0.943 (0.06) 0.943 (0.06)
0.4 0.964 (0.04) 0.807 (0.04) 0.903 (0.06) 0.8921 (0.07) 0.972 (0.03) 0.827 (0.04) 0.903 (0.06) 0.912 (0.08)
0.5 0.871 (0.04) 0.780 (0.04) 0.857 (0.05) 0.814 (0.09) 0.889 (0.03) 0.788 (0.05) 0.862 (0.04) 0.878 (0.07)

Note. MRR = mean q-entry recovery rate; RSS = residual sum of squares.

Note: Bold values signifies situations where the MMRs of the proposed LSE method are higher than the two others methods

Table 3.

MRRs of 40 Replications in Each Condition (40 Items, Multivariate Normal Distribution).

N 500
1,000
Method RSS Delta LSE0.1 LSE0.05 RSS Delta LSE0.1 LSE0.05
Maximum s 10% Misspecification 10% Misspecification
0.2 0.998 (0.01) 0.989 (0.01) 0.995 (0.01) 0.992 (0.02) 0.999 (0.002) 0.995 (0.007) 0.999 (0.005) 0.999 (0.004)
0.3 0.995 (0.01) 0.974 (0.01) 0.996 (0.01) 0.996 (0.01) 0.998 (0.004) 0.982 (0.01) 0.996 (0.01) 0.995 (0.02)
0.4 0.965 (0.02) 0.951 (0.02) 0.981 (0.02) 0.977 (0.03) 0.978 (0.02) 0.964 (0.01) 0.991 (0.01) 0.991 (0.01)
0.5 0.913 (0.02) 0.908 (0.02) 0.944 (0.02) 0.917 (0.03) 0.931 (0.03) 0.932 (0.02) 0.966 (0.02) 0.966 (0.03)
20% Misspecification 20% Misspecification
0.2 0.991 (0.03) 0.954 (0.03) 0.980 (0.03) 0.980 (0.04) 0.996 (0.01) 0.967 (0.02) 0.986 (0.02) 0.987 (0.02)
0.3 0.981 (0.02) 0.921 (0.03) 0.965 (0.04) 0.966 (0.04) 0.991 (0.02) 0.940 (0.03) 0.980 (0.02) 0.983 (0.02)
0.4 0.944 (0.03) 0.886 (0.04) 0.937 (0.04) 0.929 (0.05) 0.951 (0.03) 0.908 (0.03) 0.954 (0.03) 0.961 (0.05)
0.5 0.860 (0.04) 0.842 (0.04) 0.883 (0.04) 0.865 (0.06) 0.873 (0.03) 0.858 (0.04) 0.883 (0.04) 0.912 (0.05)

Note. MRR = mean q-entry recovery rate; RSS = residual sum of squares.

Note: Bold values signifies situations where the MMRs of the proposed LSE method are higher than the two others methods

For various percentages of misspecified q-entries, the MRRs of LSE0.1 and RSS method were always greater than the base rates. The distributions of examinees’ attribute patterns seemed to have different effects on the RSS method, proposed method, and Delta method. They had a positive effect on the RSS method and the MRRs of the method when the uniform distribution was higher than the MRRs and the multivariate distribution was normal. When the distribution of examinees’ attribute patterns was uniform, an opposed effect operated on the proposed method and Delta method; the MRRs of these two methods dropped.

When the distribution was multivariate normal, only one out of all 32 conditions (see Tables 1 and 3) was less than 0.8; this happened in 20 items with 500 examinees for the LSE0.05 and Delta methods. For the Delta method, only one condition had an MRR that reached 0.99; for the LSE0.05 method, multiple MRRs were higher than 0.99. When the distribution was uniform, for LSE0.05, two of all 32 conditions were less than 0.8 and five of all 32 conditions were lower than the base rate (see Tables 2 and 4); this occurred in 20 items; the upper bounds of the item parameters were 0.4 or 0.5. For the Delta method, five of all 32 conditions were less than 0.8 and 14 of all 32 conditions were lower than the base rate; this occurred in 40 items. Most (29 of all 32 conditions) MRRs of the LSE0.05 method were higher than those of the Delta method. Therefore, the LRE algorithm has a high recovery rate and can correct more mistakes in a Q-matrix than can the Delta method.

Table 2.

MRRs of 40 Replications in Each Condition (20 Items, Uniform Distribution).

N 500
1,000
Method RSS Delta LSE0.1 LSE0.05 RSS Delta LSE0.1 LSE0.05
Maximum s 10% Misspecification 10% Misspecification
0.2 0.997 (0.01) 0.897 (0.03) 0.968 (0.06) 0.947 (0.04) 0.999 (0.001) 0.917 (0.03) 0.977 (0.03) 0.968 (0.04)
0.3 0.980 (0.02) 0.887 (0.03) 0.959 (0.03) 0.941 (0.05) 0.979 (0.02) 0.904 (0.02) 0.967 (0.02) 0.964 (0.04)
0.4 0.950 (0.02) 0.868 (0.03) 0.926 (0.04) 0.880 (0.07) 0.946 (0.03) 0.890 (0.03) 0.933 (0.03) 0.917 (0.06)
0.5 0.903 (0.03) 0.838 (0.04) 0.903 (0.03) 0.820 (0.07) 0.918 (0.02) 0.850 (0.04) 0.909 (0.02) 0.893 (0.07)
20% Misspecification 20% Misspecification
0.2 0.952 (0.05) 0.804 (0.04) 0.899 (0.07) 0.881 (0.08) 0.960 (0.06) 0.817 (0.04) 0.904 (0.05) 0.896 (0.06)
0.3 0.926 (0.05) 0.797 (0.04) 0.860 (0.07) 0.835 (0.09) 0.900 (0.05) 0.795 (0.03) 0.851 (0.06) 0.853 (0.08)
0.4 0.860 (0.04) 0.789 (0.04) 0.814 (0.04) 0.771 (0.06) 0.865 (0.04) 0.795 (0.04) 0.824 (0.05) 0.840 (0.07)
0.5 0.824 (0.04) 0.779 (0.05) 0.809 (0.02) 0.743 (0.06) 0.828 (0.03) 0.761 (0.03) 0.811 (0.02) 0.807 (0.06)

Note. MRR = mean q-entry recovery rate; RSS = residual sum of squares.

It is important to note that the percentage of misspecified q-entries, the upper bound of item parameters, and the test length have similar effects on these methods. If either of the percentage of misspecified q-entries or the upper bound of item parameters increases, all methods’ MRRs tend to decrease. The lowest MRRs of all conditions were found when the percentage of misspecified q-entries was 20% and the number of items was 20. When both the upper bound of item parameters and percentage of misspecified q-entries are high, an MRR lower than the base rate (0.8) may be the result. The high upper bound of item parameters and percentage of misspecified q-entries may combine to enhance the negative effect. When the percentage of misspecified q-entries was 10%, the differences in MRR between different upper bounds of item parameters were relatively small; the largest MRR drop was less than 0.1 even when the upper bound increased from 0.2 to 0.5. However, for 20% of misspecified q-entries, the MRR drops were larger; the largest MRR drop was 0.14, which occurred when the RSS method was used in 20 items of 20% misspecification. However, under conditions of 10% and 20% misspecified q-entries, the change of MRRs for LSE0.1 for various bounds of item parameters were all less than 0.1. This may indicate that the parameter upper bound exerts only a minor effect on the proposed LSE method.

The longer the test length is, the higher the MRRs are, when the other conditions are the same. Surprisingly, the MRR increased when the number of items was increased. Before the study, it was expected that recovery would be more difficult for long tests, because an increase in the number of items corresponds to an increase in the number of misspecified q-entries if the misspecification percentage is kept constant. However, the number of correct items in the Q-matrix also increases as J increases, which can be essential for the RMSEA to detect unfit items; this increase of correct items can also be crucial for the two other methods to modify the Q-matrix, and this appeared to boost the MRR.

In addition, the number of examinees seemed to have a little effect on the MRRs; when the number of examinees was increased to 1,000, the maximum improvement of MRR did not exceed 0.05. Compared with the other variables, the percentage of misspecified q-entries has a large effect on the performance of all three methods. The MRRs for tests with 10% misspecified q-entries were higher than those for tests with 20% misspecified q-entries. But the upper bound of item parameters (which, to some extent, is a measurement of the item quality) tends to have a greater effect on the Delta method and the RSS method than on the proposed method. This phenomenon was especially notable when the test length was 40 and the upper bound was 0.4 or 0.5; the MRRs of the RSS method dropped lower than those of the LSE method (as indicated by the bold entries in Tables 3 and 4). This shows that the Q-matrix modification of the Delta method and the RSS method was more sensitive to the item quality than the proposed method is. This may be because both the Delta method and the RSS method maximize δj or minimize RSS, which require items to have high discrimination and quality. However, no matter how the bound of item parameters or percentage of misspecified q-entries changed, the worst performance condition is when the tests contained only 20 items. And this further illustrates that a certain number of correct items are thus needed if a high MRR is to be obtained for all methods, which is also consistent with what was discovered in Chiu’s (2013) study.

A survey of all conditions proved that the threshold of modification exerted a minor effect on the q-entry recovery rate for the proposed method especially when the distribution is multivariate normal. Specifically, when 10% and 20% of q-entries were misspecified and the upper bound of item parameters was 0.5, the MRRs of LSE0.05 were lower than those of LSE0.1, which were lower than the base accuracy rate when the test length was 20. Applying a low threshold to specify the Q-matrix can promote the detection of unfit items; however, if an insufficient quantity of correct items exists in the Q-matrix, the process may fail to define the items’ attributes correctly and may even incorrectly define some unfit but correct items’ attributes. In such a situation, a loose threshold may be a reasonable choice. Under other conditions, the differences in MMR between these two distinct thresholds (i.e., 0.05 and 0.1) were small and did not exceed 0.04 when the distribution was multivariate normal and 0.08 when the distribution was uniform; most were approximately 0.01.

Regarding the performance of the three methods, the LSE0.1 and the RSS methods performed better than the Delta method under all conditions. The LSE0.05 method exceeded the Delta method except for in eight situations where the test length was short, the sample size was small, and the proportion of misspecification was high. The RSS method performed better than did the LSE method when the distribution was uniform; by contrast, when the test length was short and proportion of misspecification was 10% or when the test length increased and the upper bound of item parameters was large, the LSE method had an advantage over the RSS method. The RSS method seemed preferable when the test length was short and the proportion of misspecification was high; such factors indicate an exceptionally unfavorable situation that occurs only rarely in reality. The RSS method performed slightly better than the LSE method when the upper bound of item parameters was small and the test length was 40.

Study 2: Effect of the Proposed Method on Specifying the Correct Q-Matrix

In Study 1, the performance of the LRE algorithm to refine a Q-matrix was investigated. Study 2 was designed to determine to what extent refinement is performed when the Q-matrix is completely correct.

Design and method

As in Study 1, a uniform distribution impaired the proposed method. In Study 2, K = 5 attributes and a uniform distribution were assumed and the performance was observed. Four variables were included in the study design: (a) number of examinees (N = 500, 1,000), (b) number of test items (J = 20, 40), (c) upper bound of the slipping and guessing parameters (0.2, 0.3, 0.4, and 0.5), and (d) three methods (RSS, Delta, and LSE). Generation of correct Q-matrices was the same as that in Study 1.

Forty data sets were simulated for each of the 2 (examinee) × 2 (item) × 4 (slip upper bound) = 16 combinations. Each dataset was analyzed using the LSE0.05 method, LSE0.1 method, Delta method, and RSS method. MRRs and Q-matrix recovery times were computed for each dataset and then averaged across the 40 datasets within each design condition.

Results

Tables 5 and 6 (see tables of online appendix) present the MRRs and times of recovery of the correct Q-matrix for J = 20 and 40 within simulated item responses, respectively. The standard deviations of the q-entry recovery rates for the methods indicated that the results were reliable.

In contrast to Study 1, the test length of Study 2 seemed to have little effect on the three methods. Even with 40 items, the RRS method and LSE method were able to recover 100% of the original correct Q-matrix (as indicated by the bold figures in Tables 5 and 6). However, the upper bounds of item parameters still had a negative effect on the three methods. As the sample size increased, the effects of upper bounds of item parameters on the three methods became smaller. Specifically, the largest changes of MRRs and mean corrected items in the Q-matrix were 0.1 and 7 when the sample size was 500. The largest changes of MRRs and mean corrected items in the Q-matrix did not exceed 0.05 and 5 when the sample size was 1,000.

As in Study 1, the other two methods performed better than Delta method with all MRRs. The recovery times and mean corrected items of the other methods were higher than those of the Delta method. A loose threshold can tolerate numerous moderately fit items and take advantage of these items’ information to define the attributes of unfit items; these factors enabled LSE0.1 to outperform LSE0.05. This also showed that a loose threshold can be more efficient and effective than a strict threshold. The performance levels of the LSE0.1 method and RSS method were similar. When the sample size was 1,000 and the upper bound of parameters was 0.4 or 0.5, the LSE0.1 method outperformed the RSS method (as indicated by the bold and underlined figures in Tables 5 and 6).

Discussion

Accurate Q-matrix specification is a critical component of the assessment development process for researchers and practitioners, because if the Q-matrix is not specified correctly, inferences resulting from application of the CDA will cause the assessment to be invalid. In practice, the Q-matrix for most tests is unknown and usually defined by experts in the field, which may result in a misspecified Q-matrix and thus the possible incorrect classification of examinees. Roberts, Alves, Chu, Thompson, and Gotzmann (2014) discovered that the Q-matrices specified by students’ verbal reports and relevant experts were inconsistent. The item fit statistic RMSEA and the corresponding LRE algorithm described in this article offer an alternative solution to Q-matrix definition. Two simulation studies were performed using this algorithm, and they demonstrated that this algorithm is both effective and efficient for recovering the correct Q-matrix from a set of base items or a misspecified Q-matrix under a variety of conditions.

An advantage of the proposed method is that the RMSEA used by the LRE algorithm has its own fitting scope and can be used to detect and define incorrect or ill-fitting item attributes. The refinement process is thus targeted and quick, especially if numerous items in the Q-matrix are already correct. For an investigation of the effectiveness of the LRE algorithm, Study 1 evaluated the Q-matrix refinement method, which used the RMSEA to detect and modify misspecified q-entries. The Delta method and the RSS method are used here to find the relative advantage of the proposed method. These three methods use statistics to validate the Q-matrix; however, the modification of the proposed LSE method is somewhat more effective than that of the nonparametric RSS method especially when the examinees’ distribution is multivariate normal, and also more effective than that of the DINA-model-based Delta method. This may reflect that the RMSEA is more effective for detecting unfitness and modifying errors. The factors have the same direction effects for all methods. Test length and the percentage of misspecified q-entries were discovered to have opposite effects on the performance of the algorithm. A minimum number of correct items in the Q-matrix must be guaranteed, which implies either a long test or a small percentage of misspecifications. Moreover, the upper bound of item parameters seemed to have more influence on the performance of the RSS method and Delta method, which validated each Q-matrix by minimizing the differences between the observed responses and the ideal responses or obtaining small slip and guessing parameters. As for the threshold of modification of the LSE algorithm, there was no absolute cutoff point. Kunina-Habenicht et al. (2012) reported that, with the same probability of a Type I error, the sample size and number of attributes did not have a single cutoff to distinguish well-fitted from ill-fitted items. An alternative solution is to conduct a similar simulation study under the specific conditions used in their assessment to obtain the appropriate threshold. A loose threshold can, however, clearly save time and lead to high algorithm performance when there are only a few mistakes in the Q-matrix. An appropriate threshold not only is useful for the detection and modification of mistakes in the Q-matrix but also helps to avoid unnecessary modifications. To better understand the performance of the proposed method when no error exists in the Q-matrix, Study 2 explored the effectiveness of how the RMSEA detects and modifies a correct Q-matrix. As found in Study 1, the LSE and RSS methods outperformed the Delta method and were able to completely recover the correct Q-matrix in some cases. The LSE performed even better than the RSS method when the sample size and item parameters were large.

Another advantage of the developed method is that RMSEA is itself a model–data fitness statistic, so it can be used to evaluate whether the chosen model for data analysis is suitable, which is a common problem in real-world applications. Future research could explore the usefulness of the RMSEA in choosing the right model, and then refining the Q-matrix with the selected model. The effect of Q-matrix design and misspecification type on the performance of the algorithm should be further investigated. For example, relative to a matrix with one single attribute pattern, a Q-matrix including more than one single attribute pattern has higher classification accuracy (Madison & Bradshaw, 2015), and any single attribute item is less likely to be incorrect.

As a parametric model-based approach to Q-matrix refinement, the LRE algorithm requires neither a large number of examinees nor a great deal of computer time. Therefore, it should be particularly beneficial to medium-sized educational testing programs. The computation of RMSEA within different cognitive diagnosis models is possible using the R package and can satisfy various research-oriented and practical needs. Of course, no matter how well the LRE algorithm performs, it is still a statistical method. Within educational practice, it should be used as a helpful tool and be combined with suggestions from experts when used to specify or validate Q-matrices.

Supplemental Material

Online_Appendix_(3) – Supplemental material for Q-Matrix Refinement Based on Item Fit Statistic RMSEA

Supplemental material, Online_Appendix_(3) for Q-Matrix Refinement Based on Item Fit Statistic RMSEA by Chunhua Kang, Yakun Yang and Pingfei Zeng in Applied Psychological Measurement

Acknowledgments

The authors would like to thank the editor and the anonymous reviewers for insightful comments and valuable suggestions.

Footnotes

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by Humanities and Social Science Foundation of the Ministry of Education of China (16YJA190002).

Supplemental Material: Supplemental material for this article is available online.

References

  1. Barnes T., Bitzer D., Vouk M. (2005). Experimental analysis of the q-matrix method in knowledge discovery. Proceedings of the 15th International Symposium on Methodologies for Intelligent Systems 2005, Saratoga Springs, NY. [Google Scholar]
  2. Cen H., Koedinger K., Junker B. (2005). Learning factors analysis—A general method for cognitive model evaluation and improvement. In Ikeda M., Ashley K., Chan T. (Eds.), Intelligent tutoring systems: 8th International Conference (pp. 164-175). Berlin, Germany: Springer. [Google Scholar]
  3. Chen Y., Liu J., Xu G., Ying Z. (2015). Statistical analysis of Q-matrix based diagnostic classification models. Journal of the American Statistical Association, 110, 850-866. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Chiu C. (2013). Statistical refinement of the Q-matrix in cognitive diagnosis. Applied Psychological Measurement, 37, 598-618. [Google Scholar]
  5. DeCarlo L. T. (2011). On the analysis of fraction subtraction data: The DINA model, classification, latent class sizes, and the Q-matrix. Applied Psychological Measurement, 35, 8-26. [Google Scholar]
  6. DeCarlo L. T. (2012). Recognizing uncertainty in the Q-Matrix via a Bayesian extension of the DINA model. Applied Psychological Measurement, 36, 447-468. [Google Scholar]
  7. de la Torre J. (2008). An empirically based method of Q-matrix validation for the DINA model: Development and applications. Journal of Educational Measurement, 45, 343-362. [Google Scholar]
  8. de la Torre J. (2009). DINA model and parameter estimation: A didactic. Journal of Educational and Behavioral Statistics, 34, 115-130. [Google Scholar]
  9. de la Torre J. (2011). The generalized DINA model framework. Psychometrika, 76, 179-199. [Google Scholar]
  10. de la Torre J., Chiu C.-Y. (2010, April). A general empirical method of Q-matrix validation. Paper presented at the annual meeting of the National Council on Measurement in Education, Denver, CO. [Google Scholar]
  11. de la Torre J., Douglas J. A. (2008). Model evaluation and multiple strategies in cognitive diagnosis: An analysis of fraction subtraction data. Psychometrika, 73, 595-624. [Google Scholar]
  12. Junker B. W., Sijtsma K. (2001). Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Applied Psychological Measurement, 25, 258-272. [Google Scholar]
  13. Kunina-Habenicht O., Rupp A. A., Wilhelm O. (2009). A practical illustration of multidimensional diagnostic skills profiling: Comparing results from confirmatory factor analysis and diagnostic classification models. Studies in Educational Evaluation, 35, 64-70. [Google Scholar]
  14. Kunina-Habenicht O., Rupp A. A., Wilhelm O. (2012). The impact of model misspecification on parameter estimation and item-fit assessment in log-linear diagnostic classification models. Journal of Educational Measurement, 49, 59-81. [Google Scholar]
  15. Liu J. C., Xu G. J., Ying Z. L. (2012). Theory of the self-learning Q-matrix. Bernoulli, 19, 1790-1817. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Liu J. C., Xu G. J., Ying Z. L. (2012). Data driven learning of Q matrix. Applied Psychological Measurement, 36, 548-564. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Madison M. J., Bradshaw L. P. (2015). The effects of Q-matrix design on classification accuracy in the log-linear cognitive diagnosis model. Educational & Psychological Measurement, 75, 491-511. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Maris E. (1999). Estimating multiple classification latent class models. Psychometrika, 64, 187-212. [Google Scholar]
  19. Park Y. S., Lee Y. S. (2014). An extension of the DINA model using covariates: Examining factors affecting response probability and latent classification. Applied Psychological Measurement, 38, 376-390. [Google Scholar]
  20. Roberts M. R., Alves C. B., Chu M. W., Thompson M., Gotzmann L. M. B. A. (2014). Testing expert-based versus student-based cognitive models for a grade 3 diagnostic mathematics assessment. Applied Measurement in Education, 27, 173-195. [Google Scholar]
  21. Robitzsch A., Kiefer T., George A. C., Uenlue A. (2015). CDM: Cognitive diagnosis modeling. Retrieved from https://cran.r-project.org/web/packages/CDM/index.html
  22. Rupp A. A., Templin J. L. (2008). The effects of Q-matrix misspecification on parameter estimates and classification accuracy in the DINA model. Educational and Psychological Measurement, 68, 78-96. [Google Scholar]
  23. Rupp A. A., Templin J. L., Henson R. A. (2010). Diagnostic measurement: Theory, methods, and applications. New York, NY: Guilford Press. [Google Scholar]
  24. Tatsuoka K. K. (1990). Toward an integration of item-response theory and cognitive error diagnosis. In Frederiksen N., Glaser R., Lesgold A., Safto M. (Eds.), Monitoring skills and knowledge acquisition (pp. 453-488). Hillsdale, NJ: Lawrence Erlbaum. [Google Scholar]
  25. von Davier M. (2005). A general diagnostic model applied to language testing data (Research Report No. RR-05-16). Princeton, NJ: Educational Testing Service. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Online_Appendix_(3) – Supplemental material for Q-Matrix Refinement Based on Item Fit Statistic RMSEA

Supplemental material, Online_Appendix_(3) for Q-Matrix Refinement Based on Item Fit Statistic RMSEA by Chunhua Kang, Yakun Yang and Pingfei Zeng in Applied Psychological Measurement


Articles from Applied Psychological Measurement are provided here courtesy of SAGE Publications

RESOURCES