Abstract
Diagnostic classification models have recently gained prominence in educational assessment, psychiatric evaluation, and many other disciplines. Central to the model specification is the so-called Q-matrix that provides a qualitative specification of the item-attribute relationship. In this paper, we develop theories on the identifiability for the Q-matrix under the DINA and the DINO models. We further propose an estimation procedure for the Q-matrix through the regularized maximum likelihood. The applicability of this procedure is not limited to the DINA or the DINO model and it can be applied to essentially all Q-matrix based diagnostic classification models. Simulation studies are conducted to illustrate its performance. Furthermore, two case studies are presented. The first case is a data set on fraction subtraction (educational application) and the second case is a subsample of the National Epidemiological Survey on Alcohol and Related Conditions concerning the social anxiety disorder (psychiatric application).
Keywords: Classification and Clustering, Mathematical Statistics, Model Selection/Variable Selection, Categorical Data Analysis
1 Introduction
Cognitive diagnosis has recently gained prominence in educational assessment, psychiatric evaluation, and many other disciplines (Rupp and Templin, 2008b; Rupp et al., 2010). A cognitive diagnostic test, consisting of a set of items, provides each subject with a profile detailing the concepts and skills (often called “attributes”) that he/she masters. For instance, teachers identify students’ mastery of different skills (attributes) based on their solutions (responses) to exam questions (items); psychiatrists/psychologists learn patients’ presence/absence of disorders (attributes) based on their responses to diagnostic questions (items). Various diagnostic classification models (DCM) have been developed in the literature. A short list includes the conjunctive DINA and NIDA models (Junker and Sijtsma, 2001; Tatsuoka, 2002; de la Torre and Douglas, 2004; de la Torre, 2011), the reparameterized unified/fusion model (RUM) (DiBello et al., 1995), the compensatory DINO and NIDO models (Templin and Henson, 2006), the rule space method (Tatsuoka, 1985, 2009), the attribute hierarchy method (Leighton et al., 2004), and Generalized DINA models (de la Torre, 2011); see also Henson et al. (2009); Rupp et al. (2010) for more developments and approaches to cognitive diagnosis. The general diagnostic model (von Davier, 2005, 2008; von Davier and Yamamoto, 2004) provides a framework for the development of diagnostic models.
A common feature of these models is that the probabilistic distribution of subjects’ responses to items is governed by their latent attribute profiles. Upon observing the responses, one can make inferences on the latent attribute profiles. The key component in the model specification is the relationship between the observed item responses and the latent attribute profiles. A central quantity in this specification is the so-called Q-matrix. Suppose that there are J items measuring K attributes. Then, the Q-matrix is a J by K matrix with zero-one entries each of which indicates whether an item is associated to an attribute. In the statistical analysis of diagnostic classification models, it is customary to work with a prespecified Q-matrix; for instance, an exam maker specifies the set of skills tested by each exam problem (Tatsuoka, 1990). However, such a specification is usually subjective and may not be accurate. The misspeficiation of the Q-matrix could possibly lead to serious lack of fit and further inaccurate inferences on the latent attribute profiles.
In this paper, we consider an objective construction of the Q-matrix, that is, estimating it based on the data. This estimation problem becomes easy or even trivial if the item responses and the attribute profiles are both observed. However, subjects’ attribute profiles are not directly observed and their information can only be extracted from item responses. The estimation of the Q-matrix should be solely based on the dependence structure among item responses. Due to the latent nature of the attribute profiles, when and whether the Q-matrix and other models parameters can be estimated consistently by the observed data under various models specifications is a challenging problem. Furthermore, theoretical results on the identifiability usually do not imply practically feasible estimation procedures. The construction of an implementable estimation procedure is the second objective of this paper.
Following the above discussion, the main contribution of this paper is two-fold. First, we provide identifiability results for the Q-matrix. As we will specify in the subsequent sections, the Q-matrix estimation is equivalent to a latent variable selection problem. Nontrivial conditions are necessary to guarantee the consistent identification of Q-matrix. We present the results for both the DINA and the DINO models that are two important diagnostic classification models. The theoretical results provide the possibility of estimating the Q-matrix, in particular, the consistency of the maximum likelihood estimator (MLE). However, due to the discrete nature of the Q-matrix, MLE requires a substantial computational overhead and it is practically infeasible. The second contribution of this paper is the proposal of a computationally affordable estimator. Formulating Q-matrix estimation into a latent variable selection problem, we propose an estimation procedure via the regularized maximum likelihood. This regularized estimator can be computed by means of a combination of the expectation-maximization algorithm and the coordinate descent algorithm. We emphasize that the applicability of this estimator is not limited to the DINA or the DINO model for which the theoretical results are developed. It can be applied to a large class of diagnostic classification models.
Statistical inference of Q-matrix has been largely an unexplored area in the cognitive assessment literature. Nevertheless, there are a few works related to the current one. Identifiability of the Q-matrix for the DINA model under a specific situation is discussed by Liu et al. (2013). The results require a complete knowledge of the guessing parameter. The theoretical results in the current paper are a natural extension of Liu et al. (2013) to generally all DINA models and further to the DINO model. Furthermore, various diagnosis tools and testing procedures have been developed in the literature (de la Torre and Douglas, 2004; Liu et al., 2007; Rupp and Templin, 2008a; de la Torre, 2008), none of which, however, addresses the estimation problem. In addition to the estimation of the Q-matrix, we discuss the estimation of other model parameters. Although there have been results on estimation (Junker, 1999; Rupp and Templin, 2008b; de la Torre, 2009; Rupp et al., 2010), formal statistical analysis, including rigorous results on identifiability and asymptotic properties, has not been developed.
The rest of the paper is organized as follows. We present the theoretical results for the Q-matrix and other model parameters under DINA and DINO models in Section 2. Section 3 presents a computationally affordable estimation procedure based on regularized maximum likelihood. Simulation studies and real data illustrations are presented in Sections 4 and 5. Detailed proofs are provided in the supplemental material.
2 The identifiability results
2.1 Diagnostic classification models
We consider that there are N subjects, each of whom responds to J items. To simplify the discussion, we assume that the responses are all binary. The analysis of other types of responses can be easily adapted. Diagnostic classification models assume that subject’s responses to items are governed by his/her latent (unobserved) attribute profile that is a K-dimensional vector, each entry of which takes values in {0, 1}, that is, α = (α1, …, αK) and αk ∈ {0, 1}. In the context of educational testing, αk indicates the mastery of skill k. Let R = (R1, …,RJ) denote the vector of responses to the J items. Both α and R are subject-specific and we will later use subscript to indicate different subjects, that is, αi and Ri are the latent attribute profile and response vector of subject i for i = 1, …, N.
The Q-matrix provides a link between the responses to items and the attributes. In particular, Q = (qjk)J×K is a J × K matrix with binary entries. For each j and k, qjk = 1 means that the response to item j is associated to the presence of attribute k and qjk = 0 otherwise. The precise relationship depends on the model parameterization.
We use θ as a generic notation for the unknown item parameters additional to the Q-matrix. Given a specific subject’s profile α, the response Rj to item j follows a Bernoulli distribution
(1) |
where cj,α is the probability for subjects with attribute profile α to provide a positive response to item j, i.e.,
The specific form of cj,α additionally depends on the Q-matrix, the item parameter vector θ, and the model parameterization. Conditional on α, (R1, …, RJ) are jointly independent. We further assume that the attribute profiles are i.i.d. following distribution
Let p = (pα : α ∈ {0, 1}K). In what follows, we present a few examples.
Example 1 (DINA model, Junker and Sijtsma (2001)) For each item j and attribute vector α, we define the ideal response
(2) |
that is, whether α has all the attributes required by item j. For each item, there are two additional parameters sj and gj that are known as the slipping and guessing parameters. The response probability cj,α takes the form
(3) |
If (the subject is capable of solving a problem), then the positive response probability is 1 − sj; otherwise, the probability is gj. The item parameter vector is θ = {sj, gj : j = 1, ⋯, J}.
The DINA model assumes a conjunctive (non-compensatory) relationship among attributes. It is necessary to possess all the attributes indicated by the Q-matrix to be capable of providing a positive response. In addition, having additional unnecessary attributes does not compensate for the lack of necessary attributes. The DINA model is popular in the educational testing applications and is often employed for modeling exam problem solving processes.
Example 2 (NIDA model) The NIDA model admits the following form
The problem solving involves multiple skills indicated by the Q-matrix. For each skill, the student has a certain probability of implementing it: 1 − sj for mastery and gj for non-mastery. The problem is solved correctly if all required skills have been implemented correctly by the student, which leads to the above positive response probability.
The following reduced RUM model is also a conjunctive model, and it generalizes the DINA and the NIDA models by allowing the item parameters to vary among attributes.
Example 3 (Reduced NC-RUM model) Under the reduced noncompensatory reparameterized unified model (NC-RUM), we have
(4) |
where πj is the correct response probability for subjects who possess all required attributes and rj,k, 0 < rj,k < 1, is the penalty parameter for not possessing the kth attribute. The corresponding item parameters are θ = {πj, rj,k : j = 1, ⋯, J, k = 1, ⋯, K}.
In contrast to the DINA, NIDA, and Reduced NC-RUM models, the following DINO and C-RUM models assume compensatory (non-conjunctive) relationship among attributes, that is, one only needs to possess one of the required attributes to be capable of providing a positive response.
Example 4 (DINO model) The ideal response of the DINO model is given by
(5) |
Similar to the DINA model, the positive response probability is
The DINO model is the dual model of the DINA model. The DINO model is often employed in the application of psychiatric assessment, for which the positive response to a diagnostic question (item) could be due to the presence of one disorder (attributes) among several.
Example 5 (C-RUM model) The GLM-type parametrization with a logistic link function is used for the compensatory reparameterized unified model (C-RUM), that is
(6) |
The corresponding item parameter vector is . The C-RUM model is a compensatory model and one can recognize (6) as a structure in multidimensional item response theory model or in factor analysis.
2.2 Some concepts of identifiability
We consider two matrices Q and Q′ that are identical if we appropriately rearrange the orders of their columns. Each column in the Q-matrix corresponds to an attribute. Reordering the columns corresponds to relabeling the attributes and it does not change the model. Upon estimating the Q-matrix, the data does not contain information about the specific meaning of each attribute. Therefore, one cannot differentiate Q and Q′ solely based on data if there are identical up to a column permutation. For this sake, we present the following equivalent relation.
Definition 1 We write Q ~ Q′ if and only if Q and Q′ have identical column vectors that could be arranged in different orders; otherwise, we write Q ≁ Q′.
Definition 2 We say that Q is identifiable if there exits an estimator Q̂ such that
Given a response vector R = (R1, ⋯,Rj)𝖳, the likelihood function of a diagnostic classification model can be written as
Definition 3 (Definition 11.2.2 inCasella and Berger (2001)) For a given Q, we say that the model parameters θ and p are identifiable if distinct values of (θ, p) yield different distributions of R, i.e., there is no (θ̃, p̃) ≠ (θ, p) such that L(θ, p,Q) ≡ L(θ̃, p̃,Q) for all R ∈ {0, 1}J.
Let Q̂ be a consistent estimator. Notice that the Q-matrix is a discrete parameter. The uncertainty of Q̂ in estimating Q is not captured by its standard deviation or confidence interval type of statistics. It is more natural to consider the probability P(Q̂ ≁ Q) that is usually very difficult to compute. Nonetheless, it is believed that P(Q̂ ≁ Q) decays exponentially fast as the sample size (total number of subjects) approaches infinity. We do not pursue along this direction in this paper. The parameters θ and p are both continuous parameters. As long as they are identifiable, the analysis falls into routine inference framework. That is, the maximum likelihood is asymptotically normal centered around the true value and its covariance matrix is the inverse of the Fisher information matrix. In what follows, we present some technical conditions that will be referred to in the subsequent sections.
-
A1
α1,…,αN are independently and identically distributed random vectors following distribution P(αi = α) = pα. The population is fully diversified meaning that pα > 0 for all α.
-
A2
All items have discriminating power meaning that 1 − sj > gj for all j.
-
A3
The true matrix Q0 is complete meaning that {ei : i = 1, …, k} ⊂ RQ, where RQ is the set of row vectors of Q and ei is a row vector such that the i-th element is one and the rest are zero.
-
A4
Each attribute is required by at least two items, that is, for all k.
The completeness of the Q-matrix requires that for each attribute there exists at least one item requiring only that attribute. If Q is complete, then we can rearrange row and column orders (corresponding to reordering the items and attributes) such that it takes the following form
(7) |
where matrix ℐK is the K × K identity matrix. Completeness is an important assumption throughout the subsequent discussion. Without loss of generality, we assume that the rows and columns of the Q-matrix have been rearranged such that it takes the above form.
Remark 1 One of the main objectives of cognitive diagnosis is to identify subjects’ attribute profiles. It has been established that completeness is a sufficient and necessary condition for a set of items to consistently identify all types of attribute profiles for the DINA model when the slipping and the guessing parameters are both zero. It is usually recommended to use a complete Q-matrix. More discussions regarding this issue can be found in Chiu et al. (2009) and Liu et al. (2013).
2.3 Identifiability of Q-matrix for the DINA and the DINO model
We consider the models in Examples 1 and 4 and start the discussion by citing the main result of Liu et al. (2013).
Theorem 1 (Theorem 4.2, Liu et al. (2013)) For the DINA model, if the guessing parameters gj’s are known, under Conditions A1, A2, and A3, the Q-matrix is identifiable.
The first result in this paper generalizes Theorem 1 to the DINO model with a known slipping parameter. In addition, we provide sufficient and necessary conditions for the identifiability of the slipping and guessing parameters.
Theorem 2 For the DINO model with known slipping parameters, under Conditions A1, A2, and A3, the Q-matrix is identifiable; the guessing parameters gj and the attribute population p are identifiable if and only if Condition A4 holds.
Furthermore, under the setting of Theorem 1, the slipping parameters sj and the attribute population parameter p are identifiable if and only if Condition A4 holds.
Theorems 1 and 2 require the knowledge of the guessing parameter (the DINA model) or the slipping parameter (the DINO model). They are applicable under certain situations. In the educational testing context, some testing problems are difficult to guess, for instance, the guessing probability of “879 × 234 =?” is basically zero; for multiple choice problems, if all the choices look “equally correct, ” then the guessing probability may be set to one over the number of choices.
We further extend the results to the situation when neither the slipping nor the guessing parameters is known, for which additional conditions are required.
-
A5
Each attribute of the Q-matrix is associated to at least three items, that is, for all k.
-
A6Q has two complete submatrices, that is, for each attribute, there exists at least two items requiring only that attribute. If so, we can appropriately arrange the columns and rows such that
(8)
Theorem 3 Under the DINA and DINO models with (s, g, p) unknown, if Conditions A1, 2, 5, and 6 hold, then Q is identifiable, i.e., one can construct an estimator Q̂ such that for all (s, g, p)
Theorem 4 Suppose that Conditions A1, 2, 5, and 6 hold. Then s, g, and p are all identifiable.
Theorems 3 and 4 state the identifiability results of Q and other model parameters. They are nontrivial generalizations of Theorems 1 and 2. As we mentioned in the previous section, given that s, g, and p are identifiable, their estimation falls into routine analysis. The asymptotic distribution of the maximum likelihood estimator and generalized estimating equation estimators are all asymptotically multivariate normal centered around the true values and their variances can be estimated either by the Fisher information inverse or by the sandwich variance estimators.
The identifiability results for Q only state the existence of a consistent estimator. We present the following corollary that the maximum likelihood estimator is consistent under the conditions required by the above theorems. The maximum likelihood estimator (MLE) takes the following form
(9) |
where
Corollary 1 Under the conditions of Theorem 3, Q̂MLE is consistent. Moreover, the maximum likelihood estimator of s, g, p
(10) |
are asymptotically normal with mean centered at the true parameters and variance being the inverse Fisher information matrix.
Proof of Corollary 1. Based on the results and proofs of Theorems 3 and 4, this corollary is straightforward to develop by means of Taylor expansion of the likelihood. We therefore omit the details.
To compute the maximum likelihood estimator Q̂MLE, one needs to evaluate the profile likelihood, sups,g,p LN(s, g, p,Q), for all possible J by K matrices with binary entries. The computation of Q̂MLE induces a substantial overhead and is practically impossible to carry out. In the following section, we present a computationally feasible estimator via the regularized maximum likelihood estimator.
Remark 2 The identifiability results are developed under the situation when there is no information about Q at all. In practice, partial information about the Q-matrix is usually available. For instance, a submatrix for some items (rows) is known and the rest needs to be estimated. This happens when new items are to be calibrated based on existing ones. Sometimes, a submatrix is known for some attributes (columns) and that corresponding to other attributes needs to be learned. This happens when some attributes are concrete and easily recognizable in a given item and the others are subtle and not obvious. Under such circumstances, the Q-matrix is easier to estimate and the identifiability conditions are weaker than those in Theorem 3. We do not pursue the partial information situation in this paper.
Remark 3 The equivalent relation “~” defines the finest equivalent classes, up to which Q can be estimated based on the data without assist of prior knowledge. In this sense, Theorem 4 provides the strongest type of identifiability and in turn it also requires some restrictive conditions. For instance, Condition A6 sometimes is difficult to satisfy in practice and it usually leads to some over simplified items especially when the number of attributes K is large. In that case, the Q-matrix can only be identified up to some weaker equivalence classes. We leave this investigation for future study.
3 Q-matrix estimation via a regularized likelihood
3.1 Alternative representation of diagnostic classification models via generalized linear models
We first formulate the Q-matrix estimation as a latent variable selection problem and then construct a computationally feasible estimator via the regularized maximum likelihood, for which there is a large body of literature (Tibshirani, 1996, 1997; Fan and Li, 2001). The applicability of this estimator is not limited to the DINA or the DINO models and it can be applied to basically all Q-matrix based diagnostic classification models in use. A short list of such models includes DINA-type models (such as the DINA and HO-DINA models), RUM-type models (like the NC-RUM, reduced NC-RUM, and C-RUM), and the saturated models, the log linear cognitive diagnosis models (LCDM) and generalized DINA (Henson et al., 2009; Rupp et al., 2010; de la Torre, 2011).
In the model specification, the key element is mapping a latent attribute α to a positive response probability, cj,α, that additionally depends on the Q-matrix and other model parameters. To motivate the general alternative representation with the DINA model, we consider the following equivalent representation of the DINA model (c.f. (3))
(11) |
where for p ∈ (0,1). On the right-hand side, inside the logit-inverse function is a function of α = (α1, …, αK) with all the interactions. Notice that the response to item j is determined by the underlying attribute α. Thus, the above generalized linear representation of cj, α is a saturated model, that is, all diagnostic classification models admitting a K-dimensional attribute profile is a special case of (11).
In what follows, we explain the adaptation of (11) to the DINA model and further to a Q-matrix. The response distribution to each item under the DINA model could be either Bernoulli (1 − sj) or Bernoulli (gj) depending on the ideal responses ξj. Suppose that item j requires attributes 1, 2,…, and Kj, that is, qjk = 1 for all 1 ≤ k ≤ Kj and 0 otherwise. Then, the positive response probability (3) can be written as
Thus, if αk = 1 for all 1 ≤ k ≤ Kj, then otherwise, . Generally speaking, if an item requires attributes k1, …, kj, the coefficients and are non-zero and all other coefficients are zero. Therefore, each row vector of the Q-matrix, corresponding to the attribute requirement of one item, maps to two non-zero β-coefficients. One of these two coefficients is the intercept and the other one is the coefficient for the product of all the required attributes suggested by the Q-matrix.
Therefore, each Q-matrix corresponds to a non-zero pattern of the regression coefficients in (11). Estimating the Q-matrix is equivalent to identifying the non-zero regression coefficients. There is a vast literature on variable and model selection, most of which are developed for linear and generalized linear models. Technically speaking, (11) is a generalized linear mixed model with α1, …, αK and their interactions being the random covariates and β being the regression coefficients. We would employ variable selection methods for the Q-matrix estimation.
Notice that the current setup is different from the regular regression setting in that the covariates αi’s are not directly observed. Therefore the variables to be selected are all latent. The results in the previous section establish sufficient conditions under which the latent variables can be consistently selected. The validity of the methods proposed in this section stands on those theoretical results. We propose the usage of the regularized maximum likelihood estimator. In doing so, we first present the general form of diagnostic classification models. For each item j, the positive response probability given the latent attribute profile admits the following generalized linear form
(12) |
where βj is a 2K-dimensional parameter (column) vector and h(α) is a 2K-dimensional covariate (column) vector including all the necessary interaction terms. For instance, in representation (11), h(α) is the vector containing 1, α1, α2,…, αK, and their interactions of all orders α1α2, α1α3, … For different diagnostic classification models, we may choose different h(α) so that their coefficients correspond directly to a Q-matrix. Examples will be given in the sequel. The likelihood function upon observing αi for each subject is
(13) |
where cj,α is given by (12). Notice that αi are i.i.d. following distribution pα. Then, the observed data likelihood is
(14) |
To simplify the notation, we use L to denote both the observed and the complete data likelihood (with different arguments) when there is no ambiguity. A regularized maximum likelihood estimator of the β-coefficients is given by
(15) |
where pλj is some penalty function and λj is the regularization parameter. In this paper, we choose pλ to be either the L1 penalty or the SCAD penalty (Fan and Li, 2001). In particular, to apply the L1 penalty, we let
where β = (β1, …, βk); to apply the SCAD penalty, we let
The function is defined as and
for x > 0; for x < 0, the function is . There is an additional “a” parameter that is chosen to be a = 3.7 as suggested by Fan and Li (2001).
On the consistency of the regularized estimator
A natural issue is whether the consistency results developed in the previous section can be applied to the regularized estimator. The consistency results for the regularized estimator can be established by means of the techniques developed in the literature (Yu and Zhao, 2006; Fan and Lv, 2011; Fan and Li, 2001). Therefore, we only provide an outline and omit the details. First of all, the parameter dimension is fixed and the sample size becomes large. The regularization parameter is chosen such that λj → 0 and as N → ∞. For the DINA (or DINO) model, let Q1 and Q2 be two matrices. If Q1 ≁ Q2, the consistency results in the previous section ensure that the two families of distributions under different Q’s are separated. Thus, with probability tending to one, the true matrix Q is the global maximizer of the profiled likelihood. Since λj = o(1) and the penalty term is of order o(N), the results in the previous section suggests that the maximized regularized likelihood has to be obtained within ε distance from the true value, that is, the consistency results localize the regularized estimator to a small neighborhood of their true values. The oracle properties of the L1 regularized estimator and SCAD regularized estimator are developed for maximizing the penalized likelihood function locally around the true model parameters (Yu and Zhao, 2006; Fan and Lv, 2011; Fan and Li, 2001). Thus, combining the global results (Q-matrix identifiability) and the local results (oracle condition for the local penalized likelihood maximizer), we obtain that the regularized estimators admit the oracle property in estimating the Q-matrix under the identifiability conditions in the previous section. We mention that for the L1 regularized estimator irrepresentable condition is needed concerning the Fisher information matrix to ensure the oracle condition (Yu and Zhao, 2006).
For other DCM’s, such as NIDA, reduced NC-RUM, and C-RUM, whose representation will be presented immediately, the families of response distributions may be nested among different Q’s. Then, the consistency results of the regularized estimator could be developed similarly as those of generalized linear models or generic likelihood functions given that Q is identifiable and the regularization parameter λj is chosen carefully such that λj → 0 and as N → ∞. Further discussion on the choice of λj will be provided later in the discussion section.
3.2 Reparameterization for other diagnostic classification models
We present a few more examples mentioned previously. For each of them, we present the link function g, h(α), and the non-zero pattern of the β-coefficients corresponding to each Q-matrix.
DINO model
For the DINO model, we write the positive response probability as
Similar to the DINA model, each row of the Q-matrix, corresponding to one item, maps to two non-zero coefficients. One is the and the other one corresponds the interactions of all the required attributes by the Q-matrix.
NIDA model
The positive response probability can be written as
Then, the corresponding Q-matrix entries are given by . Unlike the DINA and the DINO model, the number of non-zero coefficients for each item is unknown.
Reduced NC-RUM
This model is very similar to the NIDA model. The positive probability can written as
and .
C-RUM
The positive probability can written as
and .
As a summary, all the diagnostic classification models in the literature admit the generalized linear form as in (12). Furthermore, each Q-matrix corresponds a nonzero pattern of the regression coefficients and the regularized estimator has a wide applicability.
3.3 Computation via EM algorithm
The advantage of the regularized maximum likelihood estimation for the Q-matrix lies in computation. As mentioned previously, the computation of Q̂MLE in (9) requires evaluation of the profiled likelihood for all possible Q-matrices and there are 2J×K such matrices. This is computationally impossible even for some practically small J and K. The computation of (15) can be done by combining the expectation-maximization (EM) algorithm and the coordinate descent algorithm. In particular, we view α as the missing data following the prior distribution pα. The EM algorithm consists of two steps. The E-step computes function
where the above expectation is taken with respect to αi, i = 1, …,N, under the posterior distribution P(· |Ri, i = 1, …, N, β1, …, βJ, pα). The E-step is a closed form computation. First, the complete data log-likelihood function is additive
Furthermore, under the posterior distribution α1,…, αN are jointly independent. Therefore, one only needs to evaluate
for each i = 1, …,N and j = 1, …, J. Notice that α is a discrete random variable taking values in {0, 1}K. Therefore, the posterior distribution of each αi can be computed exactly and the complexity of the above conditional expectation is 2K that is manageable for K as large as 10 that is a very high dimension for diagnostic classification models in practice. Therefore the overall computational complexity of the E-step is O(NJ2K).
The M-step consists of maximizing the H-function with the penalty term
Before applying the coordinate descent algorithm, we further reduce the dimension. The objective function can be written as
For each j, the term
consists only of . Thus, the M-step can be done by maximizing each independently. Each has 2K coordinate and we apply the coordinate descent algorithm (developed for generalized linear models) to maximize the above function for each j. For details about this algorithm, see Friedman et al. (2010). Furthermore, pα is updated by .
The EM algorithm guarantees a monotone increasing objective function. However, there is no guarantee that the algorithm converges to the global maximum. We empirically found that the algorithm sometimes does stop at a local maximum, especially when λ is large. Therefore, we suggest applying the algorithm with different starting points and select the best.
3.4 Further discussions
It is suggested by the theories that the regularization parameter λ be chosen such that λ → 0 and that is a wide range. For specific diagnostic classification models, we may have more specific choices of λ. For the DINA and the DINO model, each row of the Q-matrix, corresponding to the attribute requirement of one item, maps to two non-zero coefficients. Therefore, we may choose λj for each item differently such that the resulted coefficients βj has exactly two non-zero elements.
The NIDA, NC-RUM, and C-RUM models do not admit a fixed number of co-efficients for each item. To simplify the problem, instead of using item-specific regularization parameters, we choose a single regularization parameter for all items. Furthermore, we build a solution path for different λ. Thus, instead of providing one estimate of the Q-matrix, a set of estimated Q-matrices corresponding to different λ is obtained. We may further investigate these matrices for further validation based on our knowledge of the item-attribute relationship. In case one does not have enough knowledge, one may choose λ via standard information criteria. For instance, we may choose λ such that the resulted selection of latent variables admits the smallest BIC.
4 Simulation study
In this section, simulation studies are conducted to illustrate the performance of the proposed method. The DINO model is mathematically equivalent to the DINA model (Proposition 1) and thus we only provide results for the DINA model. The data from the DINA model are generated under different settings and then the estimated Q-matrix and the true Q-matrix are compared. Two simulation studies are conducted when the attributes α1, …, αK are independent and dependent. The results are presented assuming all the model parameters are unknown including the Q-matrix, attribute distribution, slipping and guessing parameters.
4.1 Study 1: independent attributes
Attribute profiles are generated from the uniform distribution
We consider the cases that K = 3 and 4 and J = 18 items. The following Q-matrices are adopted
These two matrices are chosen such that the identifiability conditions are satisfied. The slipping and guessing parameters are set to be 0.2, but treated as unknown when estimating Q. All other conditions are also satisfied. For each Q, we consider sample sizes N = 500, 1000, 2000, and 4000. For each particular Q and N, 100 independent data sets are generated to evaluate the performance.
L1 regularized estimator
The simulation results of the L1 regularized estimator are summarized in Tables 1, 2, and 3. According to Table 1, for both K = 3 and 4, our method estimates the Q-matrix almost without error when the sample size is as large as 4000. In addition, the higher the dimension is the more difficult the problem is. Furthermore, for the cases when the estimator misses the Q-matrix, Q̂ differs from the true by only one or two rows. We look closer into the estimators in Table 2 that reports the proportion of entries correctly specified by Q̂
Table 1.
N = 500 | N = 1000 | N = 2000 | N = 4000 | |||||
---|---|---|---|---|---|---|---|---|
Q̂ = Q | Q̂ ≠ Q | Q̂ = Q | Q̂ ≠ Q | Q̂ = Q | Q̂ ≠ Q | Q̂ = Q | Q̂ ≠ Q | |
K = 3 | 38 | 62 | 81 | 19 | 98 | 2 | 100 | 0 |
K = 4 | 20 | 80 | 48 | 52 | 77 | 23 | 99 | 1 |
Table 2.
N = 500 | N = 1000 | N = 2000 | N = 4000 | |
---|---|---|---|---|
K = 3 | 98.1% | 99.6% | 100.0% | 100.0% |
K = 4 | 97.7% | 98.9% | 99.6% | 100.0% |
Table 3.
Q1 | Q2 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Sample size | 500 | 1000 | 2000 | 4000 | 500 | 1000 | 2000 | 4000 | |||
Q̂1:15 = Q1:15 | 100 | 100 | 100 | 100 | Q̂1:14 = Q1:14 | 98 | 100 | 100 | 100 | ||
Q̂1:15 ≠ Q1:15 | 0 | 0 | 0 | 0 | Q̂1:14 ≠ Q1:14 | 2 | 0 | 0 | 0 | ||
Q̂16:18 = Q16:18 | 38 | 81 | 98 | 100 | Q̂15:18 = Q15:18 | 20 | 48 | 77 | 99 | ||
Q̂16:18 ≠ Q16:18 | 62 | 19 | 2 | 0 | Q̂15:18 ≠ Q15:18 | 80 | 52 | 23 | 1 |
We empirically found that the row vectors of Q1 and Q2 that require three attributes or four attributes (rows 15 to 18 in Q1 and rows 16 to 18 in Q2) are much more difficult to estimate than others. This phenomenon is reflected by Table 3, in which the notation QI1:I2 represents the submatrix of Q containing row I1 to row I2. In fact, for all simulations in this study, most misspecifications are due to the misspecification of the submatrices of Q1 and Q2 that the corresponding items require three attributes or more.
SCAD estimator
Under the same setting, we investigate the SCAD estimator. The results are summarized in Tables 4 and 5. The SCAD estimator performs better than the L1 regularized estimator upon comparing Table 1 and Table 4.
Table 4.
N = 500 | N = 1000 | N = 2000 | N = 4000 | |||||
---|---|---|---|---|---|---|---|---|
Q̂ = Q | Q̂ ≠ Q | Q̂ = Q | Q̂ ≠ Q | Q̂ = Q | Q̂ ≠ Q | Q̂ = Q | Q̂ ≠ Q | |
K = 3 | 98 | 2 | 100 | 0 | 100 | 0 | 100 | 0 |
K = 4 | 30 | 70 | 96 | 4 | 100 | 0 | 100 | 0 |
Table 5.
N = 500 | N = 1000 | N = 2000 | N = 4000 | |
---|---|---|---|---|
K = 3 | 99.9% | 100% | 100.0% | 100.0% |
K = 4 | 97.6% | 99.9% | 100.0% | 100.0% |
4.2 Study 2: dependent attributes
For each subject, we generate θ = (θ1, ⋯ θK) that is a multivariate normal distribution N(0, Σ), where the covariance matrix Σ has unit variance and has a common correlation ρ, that is,
where 1 is the vector of ones and IK is the K by K identity matrix. We consider the situations that ρ = 0.05, 0.15 and 0.25. Then the attribute profile α is given by
We consider K = 3 and Q1 be the Q-matrix. Table 6 shows the probability distribution pα. The slipping and the guessing parameters remain 0.2. The rest of the setting is the same as that of Study 1.
Table 6.
Class | (0,0,0) | (1,0,0) | (0,1,0) | (1,1,0) | (0,0,1) | (1,0,1) | (0,1,1) | (1,1,1) |
---|---|---|---|---|---|---|---|---|
ρ = 0.05 | 0.137 | 0.121 | 0.121 | 0.121 | 0.121 | 0.121 | 0.121 | 0.137 |
ρ = 0.15 | 0.161 | 0.113 | 0.113 | 0.113 | 0.113 | 0.113 | 0.113 | 0.161 |
ρ = 0.25 | 0.185 | 0.105 | 0.105 | 0.105 | 0.105 | 0.105 | 0.105 | 0.185 |
L1 regularized estimator
The simulation results of the L1 regularized estimator are summarized in Tables 7 and 8. Based on Table 7, the estimation accuracy is improved when the sample size increases. We also observe that the proposed algorithm performs better when ρ increases. A heuristic interpretation is as follows. The row vector of Q tends to be more difficult to estimate when the numbers of subjects who are capable and who are not capable to answer are not balanced. The row vector (1, 1, 1) is the most difficult to estimate because only subjects with attribute profile (1, 1, 1) are able to solve them and all other subjects are not. According to Table 6, as ρ increases, the proportion of subjects with attribute profile (1, 1, 1) increases, which explains the improvement of the performance. In fact, similar to the situation that αi’s are independent, for most simulations in which the Q̂ misses the true, Q̂ differs from the true at the row vectors whose true value is (1, 1, 1).
Table 7.
N = 500 | N = 1000 | N = 2000 | N = 4000 | |||||
---|---|---|---|---|---|---|---|---|
Q̂ = Q | Q̂ ≠ Q | Q̂ = Q | Q̂ ≠ Q | Q̂ = Q | Q̂ ≠ Q | Q̂ = Q | Q̂ ≠ Q | |
ρ = 0.05 | 54 | 46 | 87 | 13 | 99 | 1 | 100 | 0 |
ρ = 0.15 | 67 | 33 | 93 | 7 | 100 | 0 | 100 | 0 |
ρ = 0.25 | 76 | 24 | 95 | 5 | 100 | 0 | 100 | 0 |
Table 8.
N = 500 | N = 1000 | N = 2000 | N = 4000 | |
---|---|---|---|---|
ρ = 0.05 | 98.5% | 99.7% | 100.0% | 100.0% |
ρ = 0.15 | 99.2% | 99.8% | 100.0% | 100.0% |
ρ = 0.25 | 99.4% | 99.9% | 100.0% | 100.0% |
SCAD estimator
Under the same simulation setting, the results of the SCAD estimator are summarized in Tables 9 and 10. Its performance is empirically better than that of the L1 regularized estimator. When the sample size is as small as 500, it has a very high probability estimating all the entries of the Q-matrix correctly.
Table 9.
N = 500 | N = 1000 | N = 2000 | N = 4000 | |||||
---|---|---|---|---|---|---|---|---|
Q̂ = Q | Q̂ ≠ Q | Q̂ = Q | Q̂ ≠ Q | Q̂ = Q | Q̂ ≠ Q | Q̂ = Q | Q̂ ≠ Q | |
ρ = 0.05 | 97 | 3 | 100 | 0 | 100 | 0 | 100 | 0 |
ρ = 0.15 | 98 | 2 | 100 | 0 | 100 | 0 | 100 | 0 |
ρ = 0.25 | 99 | 1 | 100 | 0 | 100 | 0 | 100 | 0 |
Table 10.
N = 500 | N = 1000 | N = 2000 | N = 4000 | |
---|---|---|---|---|
ρ = 0.05 | 99.7% | 100.0% | 100.0% | 100.0% |
ρ = 0.15 | 100.0% | 100.0% | 100.0% | 100.0% |
ρ = 0.25 | 100.0% | 100.0% | 100.0% | 100.0% |
Remark 4 Once an estimate of the Q-matrix has been obtained, other model parameters such as the slipping and the guessing parameters and the attribute population can be estimated via the maximum likelihood estimator (10). Simulation studies show that these parameters can be estimated accurately given that the Q-matrix is recovered with a high chance. As the main focus of this paper is on the Q-matrix, we do not report detailed simulation results for these parameters.
5 Real data analysis
5.1 Example 1: fraction subtraction data
The data set contains 536 middle school students’ responses to 17 fraction subtraction problems. The responses are binary: correct or incorrect solution to the problem. The data were originally described by Tatsuoka (1990) and later by Tatsuoka (2002); de la Torre and Douglas (2004) and many other studies of diagnostic classification models. In these works, the DINA model is fitted with a Q-matrix pre-specified. We fit the DINA model to the data and estimate the Q-matrix for K = 3 and 4. Then we validate the estimated Q-matrix by our knowledge of the cognitive processes of problem solving.
Table 11 presents the estimated Q-matrix along with the slipping and the guessing parameters for K = 3 based on L1 regularization. The slipping and the guessing parameter are estimated by (10). According to our knowledge of the cognitive processes, the items are clustered according to Q̂ reasonably. Roughly speaking, the three attributes can be interpreted as “finding common denominator”, “writing integer as fraction”, and “subtraction of two fraction numbers when there are integers involved” respectively.
Table 11.
K= 3 | K = 4 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ID | Content | Q̂ | ŝ | ĝ | Q̂ | ŝ | ĝ | ||||||
1 |
|
1 | 0 | 0 | 0.12 | 0.03 | 1 | 0 | 0 | 0 | 0.12 | 0.03 | |
2 |
|
1 | 0 | 0 | 0.05 | 0.04 | 1 | 0 | 0 | 0 | 0.04 | 0.03 | |
3 |
|
1 | 0 | 0 | 0.13 | 0.00 | 1 | 0 | 0 | 0 | 0.12 | 0.00 | |
4 |
|
0 | 1 | 0 | 0.13 | 0.17 | 0 | 1 | 1 | 0 | 0.13 | 0.21 | |
5 |
|
0 | 0 | 1 | 0.07 | 0.28 | 0 | 0 | 1 | 0 | 0.07 | 0.24 | |
6 |
|
0 | 0 | 1 | 0.04 | 0.20 | 0 | 0 | 1 | 0 | 0.04 | 0.13 | |
7 |
|
0 | 0 | 1 | 0.08 | 0.20 | 0 | 0 | 1 | 1 | 0.05 | 0.27 | |
8 |
|
1 | 0 | 1 | 0.18 | 0.31 | 1 | 0 | 1 | 0 | 0.19 | 0.31 | |
9 |
|
1 | 0 | 1 | 0.32 | 0.06 | 0 | 1 | 1 | 1 | 0.23 | 0.11 | |
10 |
|
1 | 0 | 1 | 0.23 | 0.07 | 0 | 1 | 1 | 1 | 0.15 | 0.14 | |
11 |
|
0 | 1 | 1 | 0.23 | 0.03 | 0 | 1 | 1 | 0 | 0.24 | 0.03 | |
12 |
|
0 | 1 | 1 | 0.07 | 0.07 | 0 | 1 | 0 | 0 | 0.09 | 0.06 | |
13 |
|
0 | 1 | 1 | 0.13 | 0.05 | 0 | 1 | 1 | 0 | 0.15 | 0.04 | |
14 |
|
0 | 1 | 1 | 0.15 | 0.13 | 0 | 1 | 1 | 0 | 0.16 | 0.12 | |
15 |
|
0 | 1 | 1 | 0.37 | 0.02 | 0 | 1 | 0 | 1 | 0.32 | 0.02 | |
16 |
|
0 | 1 | 1 | 0.18 | 0.01 | 0 | 1 | 1 | 0 | 0.20 | 0.01 | |
17 |
|
1 | 1 | 1 | 0.33 | 0.01 | 1 | 1 | 1 | 1 | 0.31 | 0.01 |
We further fit a four dimensional DINA model and the results are also summarized in Table 11. The first attribute can be interpreted as “finding common denominator”, the second as “borrowing from the whole number part“, and the third and fourth attributes can be interpreted as “subtraction of two fraction numbers when there are integers involved”. However, it seems difficult to interpret the third and the fourth attributes separately.
This data set has been studied intensively. A Q-matrix (with a little bit variation from study to study) is also prespecified based on understandings of each test problem. Table 12 presents the Q-matrix used in de la Torre and Douglas (2004) that contains eight attributes (K = 8). Each attribute corresponds one type of manipulation of fractions:
-
A1:
Convert a whole number to a fraction
-
A2:
Separate a whole number from a fraction
-
A3:
Simplify before subtracting
-
A4:
Find a common denominator
-
A5:
Borrow from whole number part
-
A6:
Column borrow to subtract the second numerator from the first
-
A7:
Subtract numerators
-
A8:
Reduce answers to simplest form
We believe that K is overspecified given that there are only 17 items. Nevertheless, we are able to find some approximate matching between this prespecified matrix and ours. Attributes one in Table 11 roughly corresponds to attribute four in Table 12, attribute two in Table 11 to attribute five in Table 12, and attribute three in Table 11 to attribute two in Table 12.
Table 12.
ID | Content | A1 | A2 | A3 | A4 | A5 | A6 | A7 | A8 | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | ||
2 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | ||
3 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | ||
4 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | ||
5 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | ||
6 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | ||
7 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | ||
8 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | ||
9 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | ||
10 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | ||
11 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | ||
12 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | ||
13 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | ||
14 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | ||
15 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | ||
16 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | ||
17 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 |
We also estimate the Q-matrix using SCAD. The estimated Q-matrices are given in Tables 13 for K = 3 and 4. The estimates are not as interpretable as those by the L1 penalty, although SCAD has better performance in the simulation study. We believe that this is mostly due to the lack of fit of the DINA model. This is an illustration of the difficulties in the analysis of cognitive diagnosis. Most models impose restrictive parametric assumptions such that the lack of fit may affect the quality of the inferences. Thus, the performance in simulations does not extrapolate to real data analysis. We also emphasize that the estimated Q-matrix only serves as a guide of the item-attribute association and strongly recommend that researchers verify or even modify the estimates based on their understanding of the items.
Table 13.
K = 3 | K = 4 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ID | Content | Q̂ | ŝ | ĝ | Q̂ | ŝ | ĝ | ||||||
1 |
|
1 | 0 | 0 | 0.13 | 0.03 | 1 | 1 | 1 | 1 | 0.12 | 0.24 | |
2 |
|
1 | 0 | 0 | 0.06 | 0.04 | 1 | 0 | 0 | 0 | 0.05 | 0.04 | |
3 |
|
1 | 0 | 0 | 0.14 | 0.00 | 1 | 0 | 0 | 0 | 0.14 | 0.01 | |
4 |
|
1 | 1 | 1 | 0.13 | 0.26 | 1 | 1 | 1 | 1 | 0.14 | 0.25 | |
5 |
|
0 | 0 | 1 | 0.05 | 0.52 | 1 | 1 | 1 | 1 | 0.04 | 0.54 | |
6 |
|
0 | 0 | 1 | 0.04 | 0.49 | 0 | 1 | 1 | 0 | 0.03 | 0.50 | |
7 |
|
1 | 1 | 1 | 0.05 | 0.50 | 1 | 1 | 0 | 1 | 0.06 | 0.49 | |
8 |
|
1 | 1 | 1 | 0.17 | 0.39 | 1 | 1 | 1 | 1 | 0.17 | 0.38 | |
9 |
|
1 | 1 | 1 | 0.24 | 0.12 | 1 | 1 | 1 | 1 | 0.25 | 0.11 | |
10 |
|
1 | 1 | 1 | 0.16 | 0.14 | 1 | 1 | 1 | 1 | 0.17 | 0.14 | |
11 |
|
0 | 0 | 1 | 0.25 | 0.03 | 0 | 1 | 1 | 1 | 0.23 | 0.03 | |
12 |
|
0 | 0 | 1 | 0.10 | 0.07 | 0 | 1 | 0 | 0 | 0.12 | 0.07 | |
13 |
|
0 | 0 | 1 | 0.16 | 0.04 | 0 | 1 | 0 | 0 | 0.17 | 0.04 | |
14 |
|
0 | 0 | 1 | 0.16 | 0.11 | 0 | 1 | 0 | 1 | 0.15 | 0.10 | |
15 |
|
1 | 1 | 1 | 0.32 | 0.02 | 1 | 1 | 1 | 1 | 0.34 | 0.02 | |
16 |
|
0 | 0 | 1 | 0.21 | 0.01 | 0 | 1 | 0 | 0 | 0.22 | 0.01 | |
17 |
|
1 | 1 | 1 | 0.33 | 0.00 | 1 | 1 | 1 | 1 | 0.35 | 0.00 |
5.2 Example 2: Social anxiety disorder data
The social anxiety disorder data is a subset of the National Epidemiological Survey on Alcohol and Related Conditions (NESARC) (Grant et al., 2003). We consider participants’ binary responses (Yes/No) to thirteen diagnostic questions for social anxiety disorder. The questions are designed by the Diagnostic and Statistical Manual of Mental Disorders, 4th ed and are displayed in Table 14. Incomplete cases are removed from the data set. The sample size is 5226. To understand the latent structure of social phobia, we fit the compensatory DINO model for K = 2, 3, and 4.
Table 14.
ID | Have you ever had a strong fear or avoidance of |
---|---|
1 | speaking in front of other people? |
2 | taking part/ speaking in class? |
3 | taking part/ speaking at a meeting? |
4 | performing in front of other people? |
5 | being interviewed? |
6 | writing when someone watches? |
7 | taking an important exam? |
8 | speaking to an authority figure? |
9 | eating/drinking in front of other people? |
10 | having conversations with people you don’t know well? |
11 | going to parties/social gatherings? |
12 | dating? |
13 | being in a small group situation? |
We first consider the L1 penalty and fit the two-dimensional DINO model. The estimates Q̂, ŝ, and ĝ are summarized as Case K = 2 of Table 15. In addition, the correlation between the two attributes is 0.47. We further explore the latent structure by considering the three-dimensional DINO model. For the result, Q̂, ŝ, and ĝ are summarized as Case K = 3 of Table 15. A similar latent structure as Q̂ in Case K = 3 of Table 15 is considered in an item response theory model based confirmatory factor analysis (Iza et al., 2014), where the item-attribute structure is prespecified. In their study, the three (continuous) factors are interpreted as “public performance”, “close scrutiny”, and “interaction”, which correspond roughly to those in Case K = 3 of Table 15. Finally, the four-dimensional DINO model is considered. The results are summarized as Case K = 4 in Table 15. According to the corresponding Q̂, the third group (items 9 – 13) based on three-dimensional model splits into two attributes. Furthermore, item 6 “writing when someone watches” becomes associated with attribute three. Furthermore, we estimate the Q-matrix via SCAD. The estimates are summarized in Tables 16 that are similar to those of the L1 penalty.
Table 15.
K = 2 | K = 3 | K = 4 | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ID | Q̂ | ŝ | ĝ | Q̂ | ŝ | ĝ | Q̂ | ŝ | ĝ | ||||||
1 | 1 | 0 | 0.05 | 0.54 | 1 | 0 | 0 | 0.05 | 0.49 | 1 | 0 | 0 | 0 | 0.05 | 0.49 |
2 | 1 | 0 | 0.09 | 0.27 | 1 | 0 | 0 | 0.11 | 0.21 | 1 | 0 | 0 | 0 | 0.11 | 0.21 |
3 | 1 | 0 | 0.13 | 0.15 | 1 | 0 | 0 | 0.16 | 0.09 | 1 | 0 | 0 | 0 | 0.16 | 0.09 |
4 | 1 | 0 | 0.12 | 0.30 | 1 | 0 | 0 | 0.15 | 0.25 | 1 | 0 | 0 | 0 | 0.14 | 0.25 |
5 | 1 | 0 | 0.46 | 0.07 | 0 | 1 | 0 | 0.29 | 0.09 | 0 | 1 | 0 | 0 | 0.29 | 0.08 |
6 | 0 | 1 | 0.66 | 0.07 | 0 | 1 | 0 | 0.68 | 0.06 | 0 | 0 | 1 | 0 | 0.56 | 0.08 |
7 | 1 | 0 | 0.42 | 0.22 | 0 | 1 | 0 | 0.26 | 0.21 | 0 | 1 | 0 | 0 | 0.27 | 0.20 |
8 | 0 | 1 | 0.34 | 0.16 | 0 | 1 | 0 | 0.30 | 0.09 | 0 | 1 | 0 | 0 | 0.30 | 0.08 |
9 | 0 | 1 | 0.68 | 0.02 | 0 | 0 | 1 | 0.68 | 0.02 | 0 | 0 | 1 | 0 | 0.58 | 0.02 |
10 | 0 | 1 | 0.13 | 0.21 | 0 | 0 | 1 | 0.13 | 0.20 | 0 | 0 | 1 | 1 | 0.14 | 0.16 |
11 | 0 | 1 | 0.20 | 0.12 | 0 | 0 | 1 | 0.17 | 0.10 | 0 | 0 | 0 | 1 | 0.21 | 0.08 |
12 | 0 | 1 | 0.59 | 0.05 | 0 | 0 | 1 | 0.59 | 0.05 | 0 | 0 | 1 | 0 | 0.47 | 0.06 |
13 | 0 | 1 | 0.67 | 0.01 | 0 | 0 | 1 | 0.68 | 0.01 | 0 | 0 | 1 | 0 | 0.57 | 0.01 |
Table 16.
K = 2 | K = 3 | K = 4 | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ID | Q̂ | ŝ | ĝ | Q̂ | ŝ | ĝ | Q̂ | ŝ | ĝ | ||||||
1 | 1 | 0 | 0.05 | 0.54 | 1 | 0 | 0 | 0.05 | 0.49 | 1 | 0 | 0 | 0 | 0.05 | 0.49 |
2 | 1 | 0 | 0.09 | 0.27 | 1 | 0 | 0 | 0.11 | 0.21 | 1 | 0 | 0 | 0 | 0.11 | 0.21 |
3 | 1 | 0 | 0.13 | 0.14 | 1 | 0 | 0 | 0.16 | 0.09 | 1 | 0 | 0 | 0 | 0.16 | 0.09 |
4 | 1 | 0 | 0.12 | 0.30 | 1 | 0 | 0 | 0.15 | 0.25 | 1 | 0 | 0 | 0 | 0.15 | 0.25 |
5 | 1 | 0 | 0.46 | 0.07 | 0 | 1 | 0 | 0.29 | 0.09 | 0 | 1 | 0 | 0 | 0.30 | 0.08 |
6 | 0 | 1 | 0.65 | 0.07 | 0 | 1 | 0 | 0.68 | 0.06 | 0 | 0 | 1 | 0 | 0.55 | 0.07 |
7 | 1 | 1 | 0.43 | 0.20 | 0 | 1 | 0 | 0.26 | 0.21 | 0 | 1 | 0 | 0 | 0.27 | 0.20 |
8 | 0 | 1 | 0.33 | 0.16 | 0 | 1 | 0 | 0.30 | 0.09 | 0 | 1 | 0 | 0 | 0.31 | 0.08 |
9 | 0 | 1 | 0.68 | 0.02 | 0 | 0 | 1 | 0.68 | 0.02 | 0 | 0 | 0 | 1 | 0.68 | 0.02 |
10 | 0 | 1 | 0.13 | 0.21 | 0 | 0 | 1 | 0.13 | 0.20 | 0 | 0 | 0 | 1 | 0.11 | 0.19 |
11 | 0 | 1 | 0.20 | 0.13 | 0 | 0 | 1 | 0.17 | 0.10 | 0 | 0 | 0 | 1 | 0.16 | 0.09 |
12 | 0 | 1 | 0.59 | 0.05 | 0 | 0 | 1 | 0.59 | 0.05 | 0 | 0 | 1 | 0 | 0.44 | 0.05 |
13 | 0 | 1 | 0.67 | 0.01 | 0 | 0 | 1 | 0.68 | 0.01 | 0 | 0 | 1 | 0 | 0.55 | 0.01 |
We observe that the estimated slipping parameters are relatively large for some items (such as items 6, 9, 12 and 13) and their guessing parameters are small. These are the low prevalence items that are unlikely to be present even among the abnormal populations. On the other hand, if someone responds positively to that item, he/she is very likely to possess the corresponding disorder (attribute). This is reflected by the small guessing parameters.
6 Concluding remarks
This paper considers the estimation of Q-matrix that is a key quantity in the specification of diagnostic classification models. The results are two-fold. First, we present theoretical identifiability results of the Q-matrix for two stylized diagnostic classification models, the DINA model and the DINO model. A set of sufficient conditions is provided, under which it is theoretically possible to reconstruct the matrix based on only the dependence of the response patterns. The development of the theory is by means of the maximum likelihood estimation (MLE). Unfortunately, MLE, though consistent (under conditions), is not practically implementable due to the unaffordable computational overhead. Thus, the second objective is to present a computationally feasible estimator for Q. We formulate the Q-matrix estimation to a latent variable selection problem and employ the regularized maximum likelihood as the main tool. The L1 penalty and the SCAD penalty are considered. For the optimization, we combine the expectation-maximization algorithm and the coordinate decent algorithm. Both are well studied numerical methods for optimization. The estimation procedure is applicable to most diagnostic classification models and is not limited to the DINA or the DINO model.
The performances of the two penalty functions are compared via simulation studies, in which SCAD penalty yields better results. However, in the analysis of the fraction subtraction data, SCAD yields results that are difficult to interpret, while the L1 penalty produces more interpretable Q-matrices. We believe that this is mostly due to the lack of fit of the DINA model. This data set in part illustrates the complications in the real data analysis for diagnostic classification models. Although the theory and estimation procedure do not require a prior knowledge of Q, we strongly recommend researchers should try to combine their knowledge in the subject matter and our inference tools. That is, our estimated Q-matrix serves as a guideline for the item-attribute association. Further refinement (such as choosing the regularization parameter or the penalty function) should rely on understanding of the items.
Throughout the discussion, the number of attributes (K) is assumed to be known. A natural extension is to estimate K simultaneously with other parameters. This can be done by introducing an additional penalty function added to the likelihood function. We leave this topic for future study.
Supplementary Material
Contributor Information
Yunxiao Chen, Columbia University, Statitics, New York, 10027 United States, yunxiao@stat.columbia.edu.
Jingchen Liu, Columbia University, Statistics, 1255 Amsterdam Avenue, New York, 10027 United States, jcliu@stat.columbia.edu.
Gongjun Xu, University of Minnesota, Statitics, Minneapolis, United States, xuxxx360@umn.edu.
Zhiliang Ying, Columbia University, Statitics, 1255 Amsterdam Avenue, 10th Floor, New York, 10027 United States, zying@stat.columbia.edu.
References
- Casella G, Berger RL. Statistical Inference. Belmont, CA: Duxbury Press; 2001. [Google Scholar]
- Chiu C, Douglas J, Li X. Cluster analysis for cognitive diagnosis: Theory and applications. Psychometrika. 2009;74:633–665. [Google Scholar]
- de la Torre J. An empirically-based method of Q-matrix validation for the DINA model: Development and applications. Journal of Educational Measurement. 2008;45:343–362. [Google Scholar]
- de la Torre J. DINA model and parameter estimation: a didactic. Journal of Educational and Behavioral Statistics. 2009;34:115–130. [Google Scholar]
- de la Torre J. The generalized DINA model framework. Psychometrika. 2011;76:179–199. [Google Scholar]
- de la Torre J, Douglas J. Higher order latent trait models for cognitive diagnosis. Psychometrika. 2004;69:333–353. [Google Scholar]
- DiBello L, Stout W, Roussos L. Unified cognitive psychometric assessment likelihood-based classification techniques. In: Nichols PD, Chipman SF, Brennan RL, editors. Cognitively diagnostic assessment. Hillsdale, NJ: Erlbaum; 1995. [Google Scholar]
- Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of American Statistical Association. 2001;96:1348–1360. [Google Scholar]
- Fan J, Lv J. Nonconcave penalized likelihood with np-dimensionality. IEEE Transactions on Information Theory. 2011;57:5467–5484. doi: 10.1109/TIT.2011.2158486. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software. 2010;33:1–22. [PMC free article] [PubMed] [Google Scholar]
- Grant BF, Kaplan K, Shepard J, Moore T. Source and accuracy statement for wave 1 of the 2001–2002 national epidemiologic survey on alcohol and related conditions. Bethesda, MD: National Institute on Alcohol Abuse and Alcoholism; 2003. [Google Scholar]
- Henson RA, Templin JL, Willse JT. Defining a family of cognitive diagnosis models using log-linear models with latent variables. Psychometrika. 2009;74:191–210. [Google Scholar]
- Iza M, Wall MM, Heimberg RG, Rodebaugh TL, Schneier FR, Liu S-M, Blanco C. Latent structure of social fears and social anxiety disorders. Psychological Medicine. 2014;44:361–370. doi: 10.1017/S0033291713000408. [DOI] [PubMed] [Google Scholar]
- Junker B. Some statistical models and computational methods that may be useful for cognitively-relevant assessment. 1999 Technical Report, available from http://www.stat.cmu.edu/~brian/nrc/cfa/documents/final.pdf. [Google Scholar]
- Junker B, Sijtsma K. Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Applied Psychological Measurement. 2001;25:258–272. [Google Scholar]
- Leighton JP, Gierl MJ, Hunka SM. The attribute hierarchy model for cognitive assessment: A variation on tatsuoka's rule-space approach. Journal of Educational Measurement. 2004;41:205–237. [Google Scholar]
- Liu J, Xu G, Ying Z. Theory of self-learning Q-matrix. Bernoulli. 2013;19:1790–1817. doi: 10.3150/12-BEJ430. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu Y, Douglas J, Henson R. Testing person fit in cognitive diagnosis. In the annual meeting of the National Council on Measurement in Education (NCME); April; Chicago, IL. 2007. [Google Scholar]
- Rupp A, Templin J. Effects of Q-matrix misspecification on parameter estimates and misclassification rates in the DINA model. Educational and Psychological Measurement. 2008a;68:78–98. [Google Scholar]
- Rupp A, Templin J. Unique characteristics of diagnostic classification models: A comprehensive review of the current state-of-the-art. Measurement: Interdisciplinary Research and Perspective. 2008b;6:219–262. [Google Scholar]
- Rupp A, Templin J, Henson RA. Diagnostic Measurement: Theory, Methods, and Applications. New York, NY: Guilford Press; 2010. [Google Scholar]
- Tatsuoka C. Data-analytic methods for latent partially ordered classification models. Applied Statistics (JRSS-C) 2002;51:337–350. [Google Scholar]
- Tatsuoka K. A probabilistic model for diagnosing misconceptions in the pattern classification approach. Journal of Educational Statistics. 1985;12:55–73. [Google Scholar]
- Tatsuoka K. Cognitive assessment: an introduction to the rule space method. New York, NY: Routledge; 2009. [Google Scholar]
- Tatsuoka KK. Toward an integration of item-response theory and cognitive error diagnosis. In: Frederiksen N, Glaser R, Lesgold A, Shafto M, editors. Diagnostic monitoring of skill and knowledge acquisition. Hillsdale, NJ: Erlbaum; 1990. [Google Scholar]
- Templin J, Henson R. Measurement of psychological disorders using cognitive diagnosis models. Psychological Methods. 2006;11:287–305. doi: 10.1037/1082-989X.11.3.287. [DOI] [PubMed] [Google Scholar]
- Tibshirani R. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]
- Tibshirani R. The Lasso method for variable selection in the Cox model. Statistics in medicine. 1997;16:385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
- von Davier M. A General Diagnostic Model Applied to Language Testing Data. Princeton, NJ: Educational Testing Service Research Report; 2005. No. RR-05-16. [DOI] [PubMed] [Google Scholar]
- von Davier M. A general diagnostic model applied to language testing data. British Journal of Mathematical and Statistical Psychology. 2008;61:287–307. doi: 10.1348/000711007X193957. [DOI] [PubMed] [Google Scholar]
- von Davier M, Yamamoto K. A class of models for cognitive diagnosis. In 4th Spearman Conference; Philadelphia, PA. 2004. [Google Scholar]
- Yu B, Zhao P. On model selection consistency of lasso. Journal of Machine Learning Research. 2006;7:2541–2563. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.