Relative Robustness of CDMs and (M)IRT in Measuring Growth in Latent Skills

Qi (Helen) Huang; Daniel M Bolt

doi:10.1177/00131644221117194

. 2022 Aug 18;83(4):808–830. doi: 10.1177/00131644221117194

Relative Robustness of CDMs and (M)IRT in Measuring Growth in Latent Skills

Qi (Helen) Huang ^1,^✉, Daniel M Bolt ¹

PMCID: PMC10311955 PMID: 37398840

Abstract

Previous studies have demonstrated evidence of latent skill continuity even in tests intentionally designed for measurement of binary skills. In addition, the assumption of binary skills when continuity is present has been shown to potentially create a lack of invariance in item and latent ability parameters that may undermine applications. In this article, we examine measurement of growth as one such application, and consider multidimensional item response theory (MIRT) as a competing alternative. Motivated by prior findings concerning the effects of skill continuity, we study the relative robustness of cognitive diagnostic models (CDMs) and (M)IRT models in the measurement of growth under both binary and continuous latent skill distributions. We find CDMs to be a less robust way of quantifying growth under misspecification, and subsequently provide a real-data example suggesting underestimation of growth as a likely consequence. It is suggested that researchers should regularly attend to the assumptions associated with the use of latent binary skills and consider (M)IRT as a potentially more robust alternative if unsure of their discrete nature.

Keywords: relative robustness, cognitive diagnosis models, (multidimensional) item response theory, growth insensitivity

Introduction

The recent literature has witnessed a number of cognitive diagnostic modeling developments designed for longitudinal applications (e.g., Chen et al., 2018; H. Y. Huang, 2017; Kaya & Leite, 2016; Li et al., 2015; Madison & Bradshaw, 2018; Wang, Yang, et al., 2018; Wang, Zhang, et al., 2018; Zhan et al., 2019; Zhang & Chang, 2019). These models and applications can be appealing due to their potential not only to examine whether students have mastered specific types of skills at a fixed timepoint, but also to monitor skill acquisition over time. However, some recent papers (Bolt & Kim, 2018; Huang & Bolt, 2020) have expressed concern about the assumption of binary skills in cognitive diagnostic models (CDMs) and explored the potential to study the consequences of binary skill assumption violations. These studies demonstrate how one form of misspecification, namely, latent skill continuity, is commonly observed even in tests intentionally designed for measurement of binary skills. In addition, the assumption of binary skills when continuity is present has been shown to potentially create a lack of invariance in binary skill metrics, and possibly a reduced sensitivity to actual growth in longitudinal contexts (Huang & Bolt, 2022). Figure 1 provides a hypothetical example to illustrate how a reduced sensitivity to actual growth may occur in CDMs in the presence of latent skill continuity. The illustration assumes a longitudinal growth measurement context where the latent skill is actually continuous. We consider three hypothetical examinees with corresponding pre-/post-levels on the continuous trait, as might correspond to actual change in a longitudinal context. The blue, red, and green solid vertical lines represent the examinees’ true skill proficiency levels on the continuous trait at pre-test, the corresponding dashed lines their true skill proficiency levels at post-test. The solid black line reflects a hypothetical mastery/non-mastery cut off the continuous proficiency, as might correspond to the assumption of a binary classification of respondents as masters versus non-masters—examinees above the cut are “masters,” those below the cut “non-masters.” In this scenario, note that despite all three examinees showing actual growth in the proficiency, the binary mastery/non-mastery approximation only detects growth if the examinee crosses the mastery/non-mastery threshold, and neglects all the other increases, no matter the size of the increase. In this case, only the examinee in blue has shown growth on the binary metric, despite the fact that the examinees in red and green have shown much larger gains on the continuous metric.

A Hypothetical Illustration of Continuous Growth and Classification Outcomes

*Note.* Three hypothetical examinees are considered with corresponding pre-/post-levels on the continuous trait. The solid lines and dashed lines represent their proficiency levels at pre-test and post-test, respectively.

This hypothetical example is designed to highlight what may be a significant concern in the application of binary skills models in longitudinal contexts, namely, that actual growth occurring along a proficiency continuum may be treated as no growth if it occurs within a proficiency category. One can envision educators becoming frustrated with an assessment tool that implies their teaching only yielded gains for a smaller percentage of students that transitioned from non-mastery to mastery status if the growth that frequently occurs keeps a student within their baseline mastery category (either non-mastery or mastery). Such an outcome may even seem somewhat ironic given the intended goal of diagnostic models to extract as much information as possible from the test performances.

Given frequent signs of latent skill continuity and the seemingly greater consequences of a binary misspecification in longitudinal contexts, a natural alternative is to consider a multidimensional item response theory (MIRT) model. However, it is likely that practitioners will often be unsure as to the true nature of the latent skill distribution in real data analysis, thus creating uncertainty over model selection. In light of this, we propose an approach that practitioners can easily adopt to examine the relative robustness of CDMs and (M)IRT models in the measurement of growth under both binary and continuous latent skill distributions, which can also hopefully support model selection. More generally, we consider the robustness of a latent variable model in terms of the model’s ability to capture true growth when the discrete/continuous nature of the skills is misspecified. Specifically, we examine whether the correlation between true changes in proficiencies and estimated changes by the model are acceptable when: (a) applying a binary model when proficiencies are actually continuous, (b) applying a continuous model when proficiencies are actually binary. As will be detailed in a later “Method” section, our general approach begins by generating longitudinal datasets that assume either binary skills or continuous skills. We then apply both CDMs and (M)IRT models to the two types of datasets and correlate the true changes in proficiencies with the estimated changes as defined by the two models. The correct model (MIRT for continuous proficiencies; CDMs for binary proficiencies) is expected to yield a relatively high correlation with the true changes. A smaller difference between the correlation of the misspecified model and the correlation of the correct model is taken to imply a greater robustness of the misspecified model. A large difference between the above two correlations indicates that the misspecified model has less robustness when the assumed nature of the proficiencies is not satisfied. As we show in a real data study, the reduced model robustness can be consequential in actual growth measurement applications.

The remainder of the article is organized as follows. First, we provide a review of the CDM and (M)IRT models used. Then we present simulation studies designed to replicate a real longitudinal diagnostic study that motivated the development of several recent longitudinal CDMs (Chen et al., 2018; Wang, Yang, et al., 2018; Wang, Zhang, et al., 2018), and generalize the findings to additional hypothetical conditions. Finally, the dataset collected by Wang, Yang, et al. (2018) involving an actual longitudinal diagnostic study is analyzed to illustrate the potential consequences of using CDMs when some realistic model misspecification is present due to the presence of latent skill continuity.

Cognitive Diagnostic Models and (Multidimensional) Item Response Models

As noted, we aim to compare the relative robustness of CDMs and (M)IRT models in measuring growth in latent skills. Although there now exist many different CDM and (M)IRT model types, we focus on comparisons between (a) a first-order hidden Markov (FOHM—a longitudinal CDM; Chen et al., 2018) model with the Rasch model and (b) the FOHM with the multidimensional two-parameter logistic (M2PL; Reckase, 1985, 1997) models, reflecting unidimensional and multidimensional simulations, respectively. The reason for selecting the FOHM as an example of CDMs is that it has been demonstrated to provide a respectable fit (Chen et al., 2018) for the real longitudinal diagnostic dataset noted earlier, which is the basis for our simulation and real data study. Regarding the representative models for (M)IRT, we choose the Rasch and M2PL models, as two commonly used models. To estimate these models, we use the R package hmcdm (Wang, Zhang, et al., 2018) for the FOHM, ltm (Rizopoulos, 2006) for the Rasch, and mirt (Chalmers, 2012) for the M2PL analyses. Although the simulation studies only consider the correlations of the true changes with the estimated changes derived from these three models, the method we use here can naturally be applied to any comparison between CDMs and (M)IRT models. Indeed, one of the goals of this paper is to highlight the potential value of fitting both binary and continuous skills models to better understand the implications of model choice on estimated growth in a longitudinal framework.

Cognitive Diagnostic Models

FOHM Model

Chen et al. (2018) formulated the FOHM model as a longitudinal extension of the DINA model that incorporates a hidden Markov model to characterize first-order transition probabilities corresponding to changes in attribute mastery status over time. The DINA model is a commonly applied cognitive diagnosis model with several elements—an item × skill attribute incidence matrix (Q-matrix), parameters that characterize skill attribute proficiency at both the respondent and population levels, and item parameters (i.e., slip and guessing parameters) reflecting stochastic aspects of item scores in relation to latent proficiency state. The FOHM model extends the deterministic inputs, noisy “and” gate (DINA) model in a longitudinal manner by adding also—(a) multiple test forms f (=1 . . . F) and (b) multiple time points t (=1 . . . T)—to account for a design where each examinee is administered a different form at each time point, and the time point at which a given form f is administered varies across examinees. The entry $q_{jk}$ in the Q-matrix expands to $q_{j (f) k}$ to denote whether skill attribute k (=1 . . . K) is required to answer item j (=1 . . . J) on form f. Meanwhile, student i’s (=1 . . . I) discrete skill attribute pattern $α_{i}$ extends to $α_{it} = {α_{ikt}}$ , where ${α_{ikt}}$ is an indicator of whether the ith student has mastered the kth attribute at time point t. Given this parameterization, the ideal response pattern of an examinee with attribute pattern $α_{it}$ is denoted as $η_{it} = {η_{ij (f) t}}$ , and can be determined as:

η_{ij (f) t} = Π_{k = 1}^{K} {α_{ikt}}^{q_{j (f) k}} .

(1)

Item slip parameters ( $s_{j (f)}$ ) define the probability of an incorrect response by an examinee that has a correct ideal response for a given item, while item guess parameters ( $g_{j (f)}$ ) define the probability of a correct response by an examinee that has an ideal incorrect response:

s_{j (f)} = P (Y_{ij (f) t} = 0 | η_{ij (f) t} = 1) and g_{j (f)} = P (Y_{ij (f) t} = 1 | η_{ij (f) t} = 0)

(2)

When the parameters of Equation 2 are combined with Equation 1, the item response function of the longitudinal DINA model becomes:

P (Y_{ij (f) t} | α_{it}, s_{j (f)}, g_{j (f)}) = (1 - s_{j (f)})^{η_{ij (f) t}} {g_{j (f)}}^{(1 - η_{ij (f) t})}

(3)

With respect to the change process, under the FOHM, a monotonic structural constraint is imposed such that the only observed change in attribute status can be one of moving from non-mastery to mastery. In this case, the prior for $α_{i}$ becomes,

\begin{matrix} p (α_{i} | π_{1}, Ω_{+}) = p (α_{i 1} | π_{1}) Π_{t = 2}^{T} p (α_{it} | α_{i, t - 1}, Ω_{+}) \\ \propto \underset{α_{c} \in A}{Π} π_{1 c}^{I (α_{i 1} = α_{c})} Π_{t = 2}^{T} {(\underset{α_{c'} α_{i, t - 1}}{Π} ω_{c^{'} | c}^{I (α_{it} = α_{c'})})}^{I (α_{i, t - 1} = α_{c})} \end{matrix}

(4)

where $π_{1}$ is a vector containing the probabilities of all classes $c$ (=1. . . $2^{K}$ ) at time t =1, which denotes the initial class membership probabilities, $α_{c}$ is the discrete skill attribute pattern for the class $c$ , and $Ω_{+}$ is a $2^{K} \times 2^{K}$ matrix of first-order probabilities for transitioning from class $c$ to $c^{'}$ between any two time periods, with the restriction that the probability of moving from a class involving mastery to one of non-mastery on the same attribute is zero. For instance, if $K = 2$ , we will have $α_{c} = {00, 01, 10, 11}$ for class $c = 1, 2, 3, 4$ , respectively. Then the corresponding $π_{1}$ and $Ω_{+}$ will be in the same formats as shown in Table 1 so that no examinee can lose mastery status on an attribute over time. A detailed Bayesian formulation for estimation of the FOHM is provided by Chen et al. (2018).

Table 1.

Initial Class Membership Probabilities and First-Order Transition Probabilities for a FOHM Model with Two Attributes

Initial class membership		First-order transition probability ( $Ω_{+}$ )
Probability ( $π_{1}$ )				$α_{c'}$
$α_{c}$	$π_{1 c}$	$α_{c}$	00	01	10	11
00	$π_{11}$	00	$ω_{1 \| 1}$	$ω_{2 \| 1}$	$ω_{3 \| 1}$	$ω_{4 \| 1}$
01	$π_{12}$	01	0	$ω_{2 \| 2}$	0	$ω_{4 \| 2}$
10	$π_{13}$	10	0	0	$ω_{3 \| 3}$	$ω_{4 \| 3}$
11	$π_{14}$	11	0	0	0	$ω_{4 \| 4}$

Open in a new tab

Note. FOHM = first-order hidden Markov.

(Multidimensional) Item Response Models

In the (M)IRT analyses, we consider two modeling approaches—(a) a separate calibration in which the responses for each test form are associated with independent latent proficiency estimates for each form/time point and (b) a joint calibration in which the latent ability estimates for each examinee over time and across forms are estimated simultaneously. The joint approach can be viewed as a longitudinal extension where the correlations between skill attributes across time points play a role in each examinee’s ability estimates. This latter approach arguably offers a closer approximation to what occurs in the FOHM, but is more computationally demanding. Note that all of these approaches make use of the fact that forms were randomly assigned to examinees across time points; the assumption of random equivalence of test-taking populations across forms is the basis for assuming the parameters are on a common metric across forms. Using the two approaches for (M)IRT, we can illustrate the consequences of allowing normative information about skill attributes and skill attribute growth to impact the ensuing trait estimates. As the number of time points or skill attributes increases, the stability and credibility of the joint approach would naturally decrease. Moreover, the separate approach may be desirable under conditions where we wish to allow only the examinee’s own data to inform ability and ability growth estimates over time.

To extend the Rasch and M2PL models in the joint approach, we consider a multidimensional Rasch model and a M2PL of higher dimensionality. In the following introduction of (M)IRT models, we maintain the meaning of indices used for the FOHM for consistency, but naturally replace $α$ by $θ$ to distinguish the continuous latent proficiencies of IRT from discrete latent skill attributes of CDM.

Rasch Model (Unidimensional Case)

In our unidimensional simulation, we generate data assuming a continuous proficiency and conduct the separate IRT calibration using the Rasch model using the item response function:

P (Y_{ij (f) t} = 1 | θ_{it}) = \frac{e^{(θ_{it} - b_{j (f)})}}{1 + e^{(θ_{it} - b_{j (f)})}}

(5)

where $θ_{it}$ represents ith student’s proficiency at time point t, and $b_{j (f)}$ denotes the item difficulty parameter for item j in form f. For the joint calibration, we extend the Rasch model above to a multidimensional form with an item response function:

P (Y_{ij (f) t} = 1 | θ_{i}) = \frac{e^{(\sum_{t = 1}^{T} x_{ft} θ_{it} - b_{j (f)})}}{1 + e^{(\sum_{t = 1}^{T} x_{ft} θ_{it} - b_{j (f)})}}

(6)

where $θ_{i}$ is a vector containing ith student’s proficiencies from times $t = 1$ to T. $X_{ft}$ is a test form × time-point matrix, where $x_{ft}$ is an indicator denoting whether the test form f is administered at time point t for examinee i. A core element of the joint approach is the characterization of a multivariate distribution for the trait parameters. In the case of two time points, we have a two-element trait vector that we can assume is bivariate normal with estimated mean and covariance matrix (subject to the imposition of identification constraints). In this way the joint approach allows proficiency levels at other time points to inform the estimated individual proficiencies at other time points, according to the estimated mean and covariance matrix. Where the number of time points is larger than two, we can also impose distributional assumptions based on a growth model (e.g., linear growth over time) to simplify computation, as we consider in further detail in our later illustrations.

M2PL Model (Multidimensional Case)

In the multidimensional simulations (Illustrations 2 and 3), the model we use to generate item response data for continuous proficiencies is the M2PL model—a widely used compensatory MIRT model. In the M2PL, the probability of the ith student’s correct response to an item j in form f at time point t is modeled as

P (Y_{ij (f) t} = 1 | θ_{it}) = \frac{e^{(\sum_{k = 1}^{K} a_{j (f) k} θ_{ikt} + d_{j (f)})}}{1 + e^{(\sum_{k = 1}^{K} a_{j (f) k} θ_{ikt} + d_{j (f)})}}

(7)

where $θ_{it} = [θ_{i 1 t} θ_{i 2 t} \dots θ_{ikt}]$ represent k latent proficiencies for student i at time point t; $a_{j (f) 1}$ , $a_{j (f) 2}, \dots,$ $a_{j (f) k}$ are the discrimination parameters for the k latent skills required to answer item j in form f correctly; and $d_{j (f)}$ is a multidimensional difficulty parameter for item j in form f.

Similar to how Equation (5) is extended to Equation (6), the joint calibration for M2PL has an item response function:

P (Y_{ij (f) t} = 1 | θ_{ikt}) = \frac{e^{(\sum_{t = 1}^{T} x_{ft} \sum_{k = 1}^{K} a_{j (f) k} θ_{ikt} + d_{j (f)})}}{1 + e^{(\sum_{t = 1}^{T} x_{ft} \sum_{k = 1}^{K} a_{j (f) k} θ_{ikt} + d_{j (f)})}}

(8)

where $θ_{ikt}$ is a $K \times T$ matrix, such that $θ_{ikt}$ denotes the ith student’s kth proficiency at time point t. All the other indices have the same meanings as noted before. Note that for the joint calibration, our multivariate distributions now extend to include not only the multiple time points but also the multiple attributes. For example, when there are four attributes and two time points, we have an eight-element trait vector that we assume is multivariate normal with an estimated mean and covariance matrix (subject to the imposition of identification constraints). In this way, the joint approach allows proficiency levels on both other skills/attributes (as well as time points) to inform the estimated individual proficiencies at each time point, as determined by the estimated mean and covariance matrix. Where the number of time points is greater than two, we can again impose distributional assumptions related to a multivariate growth model (e.g., linear growth over time), as we consider in further detail in our later illustrations.

In the ensuing sections, we seek to evaluate the implications of model misspecification on estimations of growth/change in a diagnostic context. It is important to note that there are various dimensions along which models can be compared, some of which have been the attention of past research. For example, the use of empirical information criteria (e.g., Akaike information criterion [AIC], Bayesian information criterion [BIC]) to compare CDM and MIRT models has received previous attention. Although such criteria tend to support continuous traits (Bolt, 2019; Ma et al., 2020; von Davier & Haberman, 2014), others have argued that the theoretical complexity of skills and empirical value of diagnostic classifications outweigh such considerations (Templin & Bradshaw, 2014). In this article, we focus our attention on empirical consequences of misspecification in the context of longitudinal assessment, hopefully offering another criterion by which to compare these general measurement approaches in longitudinal contexts.

Simulation Study

Illustration 1: Relative Robustness of the FOHM and Rasch Models in a Unidimensional Framework

The real data study (Chen et al., 2018) that is a motivation for our simulation incorporated a counterbalanced test design to measure the growth of four skills across five time points. To more easily compare methods, we initially consider a unidimensional simulation study that has a counterbalanced pre- and post-test design with each test form including 25 items (Table 2). Thus, it has only two time points. We also assume 2,000 examinees are assigned to take the tests in two different test orders. The reason for selecting larger numbers of examinees and items for the simulation is that such conditions allow us to achieve more stable conclusions about misspecification with less concern about interference from random sources of sampling/measurement error. A two time-point design is also common in practice.

Table 2.

Simulated Balanced Test Block Design

Test order	Time Point
Test order	1	2
Order 1	Form 1	Form 2
Order 2	Form 2	Form 1

Open in a new tab

As described earlier, in this design, we generate data for two types of datasets: One assumes a binary latent proficiency, and one assumes a continuous latent proficiency. For the binary condition, we assume 25% of the examinees are masters and 75% non-masters at the pre-test. In the post-test, two thirds of non-masters (50% of all the examinees) transition to mastery status, while the remaining 50% of the examinees have an unchanged status from the pre-test. For the second type, the initial proficiencies are assumed to follow $N (0, 1)$ . For growth generation, we seek equivalent conditions between binary and continuous metrics, and thus assume 50% of the examinees have increases in proficiency defined by a truncated $N (1.002, 0.655)$ in which all simulated negative growth is fixed to 0. Such a growth parameter distribution was deliberately chosen to create comparable growth rates to that used for the binary condition. For each dataset, we generate item responses from the corresponding model: For the binary analysis, we use a DINA model, and for the continuous analysis a Rasch model. The specific item and person parameters used in these two models as well as the detailed procedures for data generation are shown in Supplementary Appendix A.

Next, we apply the two different Rasch approaches (separate versus joint calibrations) and the FOHM model to analyze both datasets, and correlate the estimated growth with true growth as simulated above. Note that as the FOHM has placed a monotonicity constraint on the growth estimation, we also force the estimated changes of the two Rasch approaches to be non-negative (replacing all the negative estimates by 0), to reflect the FOHM assumption of no negative growth. As seen in Table 4, for the data generated from continuous traits, the correlations between true and estimated growth under the two Rasch approaches are much higher than under the FOHM. In the data generated from binary skills, results for the Rasch analyses appear good despite being slightly lower than the correlation observed for the FOHM. The results suggest greater robustness under IRT than under CDMs when the models are misspecified. In addition, the Rasch and multidimensional Rasch analyses yield rather similar mean change estimates (see the first column in Table 3) and correlation results (Table 4), implying a separate versus joint calibration seems largely inconsequential in the current analysis. However, we hasten to note that the similarity of results in the two Rasch calibrations does not imply that the correlations between skill attributes across time points play no role in each examinee’s ability estimates. The similar findings in the current analysis are likely a product of highly reliable measurement conditions due to the sizable number of items assumed at each time point (25). In Supplementary Appendix 1, we demonstrate greater differences, and a preference for the joint approach, when the measurement conditions are less favorable (e.g., shorter test lengths).

Table 4.

Pearson Correlations Between Estimated and True Growth, Simulation Illustration 1

True proficiencies	Fitted model
True proficiencies	FOHM	Rasch, separate	Rasch, joint
Binary	0.68	0.60	0.64
Continuous	0.25	0.64	0.63

Open in a new tab

Note. FOHM = first-order hidden Markov.

Table 3.

Means of the Growth Estimated by the Two (M)IRT Modeling Approaches (Separate Calibration Versus Joint Calibration) in Each Simulation Condition

Calibration approach		Illustration 1	Illustration 2				Illustration 3
Generating condition		Rasch	M2PL_Q1				M2PL_Q2				M2PL_Q3
Generating condition		Rasch	S1	S2	S3	S4	S1	S2	S3	S4	S1	S2	S3	S4
Binary	Separate	1.00	.32	.33	.35	.33	.31	.33.	.34	.33	.32	.30	.33	.28
Binary	Joint	1.04	.27	.28	.29	.29	.29	.29	.29	.30	.27	.28	.28	.26
Continuous	Separate	.68	.15	.21	.15	.19	.18	.19	.15	.16	.22	.14	.15	.10
Continuous	Joint	.70	.16	.17	.16	.17	.20	.22	.19	.21	.23	.20	.19	.15

Open in a new tab

Note. M2PL_Q1 represents that the response data are generated from a M2PL model with the factor matrices Q1. The same rule applies for M2PL_Q2 and M2PL_Q3 as well. Q1 denotes the Q-matrices used in Illustration 2, which are based on real data Q-matrices; Q2 and Q3 are the Q-matrices described in conditions 1 (no single-skill item) and 2 (no single-skill item & skill 4 nested within skill 1) of Illustration 3. S1, S2, S3, S4 denote Skill Attributes 1, 2, 3, 4. (M)IRT = (Multidimensional) item response theory; M2PL = multidimensional two-parameter logistic.

To better visualize the correlation results, we draw a scatterplot (Figure 2) of the estimated and true growth under each condition (a random noise jitter is added to binary growth to avoid overplotting). Because of the largely similar results in the two IRT approaches, we only show the estimates of the separate calibration in the figure. In this case, the distinctions between binary and continuous approaches are made apparent (note that plots of the binary measures are jittered for clarity). Although each estimation approach recovers optimally with data simulated from its own model, the continuous approach shows reasonably good recovery as well when the true skill is binary, which indicates a greater robustness of IRT than CDM under misspecification.

Scatterplots of the True Growth from the Two Data Generating Conditions against the Estimated Growth by the Rasch and FOHM Model, Unidimensional Simulation

*Note.* FOHM = first-order hidden Markov.

The pre- and post-test design is common in practice and makes it easier to evaluate the misspecification/robustness issues relation to the skill distributional assumptions. Having additional time points adds complexity to the comparison, as the monotonicity constraint on change in the FOHM likely plays a greater role in creating differences across methods. Nevertheless, it might be questioned whether the greater robustness of IRT remains under designs with more time points. Thus in Supplementary Appendix 2 we extend Illustration I by considering 5 time points. In short, the results seem to support the previous conclusion—IRT seems relatively more robust than CDM under misspecification.

Beyond the correlations seen across methods are questions regarding the aggregate levels of growth observed under correct and incorrect model specification. Note that since the IRT approach provides a continuous metric against which to study growth, even in the presence of no growth, we expect half the respondents to show increases in their proficiency estimates. With this in mind, adopting a common assumption to the FOHM (implying no negative growth over time), we can easily adjust the estimated proportion of respondents that demonstrate growth by taking the difference between the proportion of positive and negative growth estimates as an estimate of the proportion of students that have actually grown (assuming a symmetric distribution around 0 of growth estimates when no growth has occurred). When adopting this approach in the current application (where exactly 50% have grown under both the binary and continuous generating conditions), under the binary data generating condition we see growth estimates of 52% under FOHM and 42% under MIRT, suggesting some underestimation of growth related to MIRT misspecification. Under the continuous generating condition, however, we again see a growth estimate of 42% under MIRT (an 8% underestimate, likely due to some lack of reliability in the measures), but observe a growth rate of only 15% under FOHM, an underestimate of 35%. Similar to our illustration in Figure 1, there is a real potential of a substantial underestimation of growth when the proficiencies are truly continuous.

Illustration 2: Relative Robustness of the FOHM and M2PL Models in a Multidimensional Framework

To further investigate the robustness of the two model types in a more realistic setting, we consider a multidimensional framework involving growth on multiple different skills. Similar to the previous simulation, two types of data are generated: one assumes all skills are continuous skills and the other assumes all skills are binary. To further follow the real data study, we consider two 25-item tests measuring a total of four skills constructed from the parameters observed across the 50 total items studied in Chen et al. (2018). The detailed Q-matrices/factor matrices of the two 25-item tests can be found in Supplementary Appendix B. Then we assume the two tests are administered with the same design as in Table 2. In the simulation condition based on continuous skills, the true initial proficiencies follow $MN (0, \sum)$ where the variance-covariance matrix is defined by the estimated correlations among the four MIRT proficiencies from the real data. At post-test, we assume all the examinees show increases (growth) sampled from $MN (0.5, \sum)$ . As for the FOHM, negative values of true growth are replaced by 0 so as to create more equivalent measurement conditions to those simulated under FOHM and to avoid complications that are trivial to the purpose of the comparison. The generating item parameters for these data are the estimated parameters from a MIRT analysis of the real data. We also apply the same general strategy to generate data from binary attributes (i.e., using the real data Q-matrices, as well as slip and guessing parameter estimates), and assume as an initial distribution of attribute patterns the estimated initial distribution from Chen et al., with the true growth simulated according to the estimated transition probabilities between attribute patterns seen in their real data. Because of the monotonicity constraint noted earlier, the transition probabilities estimated by the FOHM analysis on the real data are also non-negative. Supplementary Appendix C displays the item parameters used for generating the two types of data.

As in the previous simulation, we apply two MIRT approaches and the FOHM model to both datasets. For MIRT, the separate calibration condition estimates the two four-element skill vectors for each examinee separately at each time point in evaluating growth; in the joint calibration condition, these skill vectors are estimated jointly across time points, thus attending not only to intercorrelations among the skills, but also intercorrelations among the growth across skills, in determining growth estimates. As seen in Table 5, the recovery correlations for the two MIRT calibrations tend to be larger than for the FOHM correlations under the presence of skill continuity. Under the presence of binary skills, however, while the recovery correlations for the FOHM are generally better, the FOHM and MIRT correlations are much closer. Like the findings of the previous section, these findings suggest a greater robustness of MIRT. Specifically, there is a greater likelihood of misspecified growth if using the FOHM to approximate growth in the presence of continuous proficiencies that actually grow in a continuous fashion than occurs when using a MIRT model to approximate growth in the presence of binary skills with truly discrete levels of growth. Moreover, the similarity between the results of the separate and joint MIRT calibrations (see Table 5 and the columns under “M2PL_Q1” in Table 3) further suggests that at least for these analyses, the two methods return comparable results.

Table 5.

Pearson Correlations Between Model Estimated Growth and True Growth for Each Skill, Binary and Continuous Data Generating Conditions, Simulation Illustration 2

True latent distribution→	Continuous			Binary
Fitted models→	M2PL, separate	M2PL, joint	FOHM	M2PL, separate	M2PL, joint	FOHM
Skill 1	0.52	0.57	0.21	0.30	0.27	0.39
Skill 2	0.61	0.64	0.27	0.45	0.47	0.57
Skill 3	0.48	0.58	0.22	0.40	0.40	0.45
Skill 4	0.51	0.60	0.27	0.44	0.43	0.52

Open in a new tab

Note. M2PL = multidimensional two-parameter logistic; FOHM = first-order hidden Markov.

Illustration 3: Relative Robustness of the FOHM and M2PL Models in a Multidimensional Framework With Different Q-matrix Specifications

As seen in the Q-matrix of Illustration 2, each skill is individually measured by a number of single-skill items and no nesting structure is present in the Q matrix, which is a relatively ideal specification condition, as it will largely increase classification accuracy (Madison & Bradshaw, 2015). However, this feature is difficult to achieve in many realistic diagnosis settings, such as in the fraction subtraction data set of Tatsuoka (1984), arguably the most common data set used for CDMs illustrations. In that data set, Skills 1, 3, 4, 5, 6, 8 are all nested within skill 7. Therefore, to generalize the findings derived from the previous illustrations to many realistic settings, we further explore the relative robustness of the two model types under more complicated Q-matrix specifications, including ones where: (a) no skill is measured in complete isolation from the others (i.e., no single-skill items), and (b) a nested structure of skill measurement exists (e.g., whenever Skill 4 is required, so is skill 1). Example Q-matrices corresponding to such conditions can be found in Supplementary Appendix D. We consider these two specifications here because they likely resemble realistic settings involving a high specificity of skills and a hierarchy among skill requirements that preclude the ideal simple-structure condition.

Following our previous illustrations, two types of multidimensional response data are generated for each Q-matrix/factor matrix: One assumes continuous skills and the other assumes binary skills. Data generation follows the same procedures as introduced in Illustration 2. By fitting both MIRT approaches and the FOHM to the two generated data types within each Q-matrix specification condition, we again calculate the correlations between the true and the estimated growth for the four skills (Tables 6 and 7).

Table 6.

Pearson Correlations Between Model Estimated Growth and True Growth for Each Skill, Binary and Continuous Data Generating Conditions, Q-matrix/Factor Matrix With No Single Skill Items, Simulation Illustration 3

True latent distribution→	Continuous			Binary
Fitted models→	M2PL, separate	M2PL, joint	FOHM	M2PL, separate	M2PL, joint	FOHM
Skill 1	0.48	0.49	0.21	0.25	0.23	0.34
Skill 2	0.53	0.52	0.24	0.26	0.25	0.33
Skill 3	0.58	0.52	0.22	0.36	0.36	0.43
Skill 4	0.57	0.54	0.28	0.38	0.37	0.44

Open in a new tab

Note. M2PL = multidimensional two-parameter logistic; FOHM = first-order hidden Markov.

Table 7.

Correlations Between Model Estimated Growth and True Growth for Each Skill, Binary and Continuous Data Generating Conditions, Simulation Study Based on a Nested Q-Matrix/Factor Matrix (Skill 4 Nested Within Skill 1) With No Single Skill Items, Simulation Illustration 3

True latent distribution→	Continuous			Binary
Fitted models→	M2PL, separate	M2PL, joint	FOHM	M2PL, separate	M2PL, joint	FOHM
Skill 1	0.64	0.69	0.29	0.35	0.34	0.43
Skill 2	0.49	0.48	0.19	0.27	0.27	0.31
Skill 3	0.56	0.51	0.16	0.38	0.37	0.46
Skill 4	0.42	0.43	0.09	0.33	0.31	0.37

Open in a new tab

Note. M2PL = multidimensional two-parameter logistic; FOHM = first-order hidden Markov.

To compare the correlation results shown in Tables 5 to 7 more directly, we show both sets of results in Figure 3. Since the separate and joint calibrations again yielded highly correlated results under the MIRT approach (see the columns under “M2PL_Q2” and “M2PL_Q3” in Table 3), we only include the separate calibration results in Figure 3 to avoid redundancy. The patterns observed under the two more complex Q-matrix specification conditions (Q2 and Q3) are seen to be essentially the same as in Illustration 2 (which uses Q1). Comparing the differences between the bar graphs in the first row (FOHM is correct and M2PL is misspecified) and the differences between the second row (M2PL is correct and FOHM is misspecified), the M2PL again emerges as more robust under model misspecification. Also, when the FOHM is misspecified (the bottom-right bar graph), the nested Q matrix structure seemingly reduces even further the FOHM’s robustness in recovering the true growth of the nested skill (Skill 4 in this case). Since our intention in this illustration is more one of generalizing the findings seen in Illustration 2, rather than to study the actual impact of different Q-matrix designs, we do not dig deeper into this topic, but acknowledge it as an exploration worthy of future study, as results may vary for different Q matrices.

Bar Graphs of the Recovery Correlations Shown in Tables 3 to 5

*Note.* Q1 denotes the Q-matrices used in Illustration 2, which are based on real data Q-matrices; Q2 and Q3 are the Q-matrices described in Conditions 1 (no single-skill item) and 2 (no single-skill item and Skill 4 nested within Skill 1) of Illustration 3.

Real Data Study: Potential Consequences for Less Robust Model in the Measuring Growths in Latent Skills

As shown in the simulation studies, while there is little disputing the relatively high accuracy of FOHM estimates when the latent skill distributions are binary, the binary skills approach does not appear to function as well when the latent proficiencies are in actuality continuously distributed. Therefore, in this section, we aim to compare methods in the context of the real-data example to examine potential growth insensitivity issues under CDMs when latent skill continuity is present. The real data (which as noted earlier was also the basis for our simulation studies) come from a spatial reasoning assessment described in detail by Wang, Yang, et al. (2018). As reported in Wang et al., the forms for this study were developed in accord with the Revised Purdue Spatial Visualization Test (Yoon, 2011); the study itself incorporated a training tool to facilitate the learning of rotation skills over time. Five test forms were developed each containing 10 items. The five forms were administered in varying orders across five time points for five randomly defined student groups. Each of the test forms measures the same four spatial rotation skill attributes: (a) a 90° x-axis rotation, (b) a 90° y-axis rotation, (c) a 180° x-axis rotation, and (d) a 180° y-axis rotation skills. Students were also randomly assigned for exposure to two types of learning interventions that provided feedback related to their responses on the previous test form, although the effect of this manipulation is not a focus in the current analysis.

We consider the spatial reasoning test data here because it likely represents a setting where the specificity of skills is high. Moreover, in several longitudinal CDMs, it has been found to provide a respectable level of fit (Chen et al., 2018; Wang, Yang, et al., 2018; Wang, Zhang, et al., 2018). One difference in the real data application compared with our previous simulation illustrations concerns the design of the study. Although our previous simulations considered two time points with 25-item test forms administered at each time point, the real data analysis is based on data collected at five time points, with a different 10-item test being administered at each time point. The larger number of time points (and fewer items administered at each time point) is expected to have at least a couple of implications. First, due to the application of a monotonicity constraint, it creates larger differences between the FOHM approach to study change and that assumed with a MIRT approach. Due to the high dimensionality involved under the MIRT approach, we assumed (possibly correlated) linear growth trajectories for each skill across the five time points when applying the joint calibration (as detailed below), implying that each examinee is characterized by a baseline status (a four-element ability vector) and a growth vector (a four-element vector of linear slopes). Despite the use of shared information across attributes under both approaches, the combination of linear growth under the MIRT approach, and the application of a monotonicity constraint under FOHM, conceivably make the modeling approaches different in more ways than the binary/continuous nature of the skills. Second, with fewer items (10) administered at each time point, there is greater measurement error in the ability estimates at each time point, and a greater potential for the structural form of the growth model assumed to more heavily influence the nature of the change seen over time. This likely makes it more challenging to attribute differences between methods solely to the nature of the skills assumed (discrete vs. continuous) as was the case in the simulation illustrations. For this reason, in the MIRT analyses, we focus only on results from the joint approach, where like the FOHM we can similarly attend to associations across attributes and attribute growth. Specifically, we consider joint approaches that (a) attend to data from all five time points by assuming distinct (but possibly correlated) linear trajectories over time; and (b) attend to data from only the first and last time points, and quantifying growth as the difference between the Time 5 and Time 1 skill estimates. To the extent that this general approach can be replicated with other datasets, the current illustration hopefully helps motivate investigators to consider a similar type of comparison with their own data.

Table 8 shows the total number of examinees who are estimated to have gains on the four skill attributes under the FOHM and the two joint approaches using the M2PL. Unlike the FOHM model, no monotonicity structural constraint is imposed within the M2PL model; as a result fluctuations may exist and lead to extreme start-point or end-point values. Thus for the M2PL applications, we report the numbers (and corresponding proportions) of examinees demonstrating a positive change, as either implied by the estimation of skills at the first and last time points, or the estimated linear slope across all five time points. For reasons we speculate on below, the differences between FOHM and both MIRT approaches are substantial. The proportions of students showing skill gains according to FOHM range from 12% to 20% depending on the skill attribute, while under the MIRT approaches we see the percentages that range from 32% to 99%.

Table 8.

Estimated Number (Percentage) of Students Showing Increases in Each Measured Skill Under the FOHM and the M2PL, Real Data Analysis

Growth		Skill 1	Skill 2	Skill 3	Skill 4
FOHM	Increase (TP5–TP1)	42 (.120)	57 (.163)	69 (.197)	71 (.203)
MIRT	Increase (TP5–TP1)	113 (.323)	346 (.989)	328 (.937)	335 (.957)
Joint	Positive slope	336 (.960)	323 (.923)	330 (.943)	294 (.840)

Open in a new tab

Note. T5–T1 denotes subtracting the proficiency estimate at time point 1 from the proficiency estimate at time point 5; in the MIRT approach skill estimates are based only on measures collected at Time 1 and Time 5; for FOHM all five time points are considered. FOHM = first-order hidden -Markov; M2PL = multidimensional two-parameter logistic; (M)IRT = (Multidimensional) item response theory.

As noted, due both to the complexity of the data collection design as well as the substantial error in estimation at each time point (due to the use of only 10 items to measure 4 skills), we think these large differences are due to multiple factors. First, the demonstrated presence of latent skill continuity in the dataset (Huang & Bolt, 2022) is likely a factor, a result we have also seen in the previous simulations. Second, the presence of correlated skill attributes and skill growth gains makes it easier to demonstrate growth—a student that has convincingly demonstrated growth on one skill is more likely to be identified as showing growth on other skills due to the strong positive correlations that exist among skill growth across dimensions, correlations that become especially influential when the data contributions toward estimations of individual skills at each time point are weak. Third, under the MIRT approaches, even small estimated gains in growth count as growth; consequently many students whose growth was not substantial enough to cross a mastery threshold will still be identified as growing under the MIRT approach.

Although Table 8 displays the marginal numbers showing gains for each skill, Table 9 provides some examples of individual-level growth estimates that demonstrate the issue at hand. As in Table 8, the MIRT estimates shown are based on the joint calibration approaches, and display the change estimates as defined both from the slope across all five timepoints, as well as the difference between the Time 5 and Time 1 skill estimates. Also shown are the FOHM change estimates for these example patterns defined from the FOHM analysis. As our interest here is in understanding the lower levels of robustness in capturing growth under the binary approach, we highlight examples from the first transition state (non-master → non-master) and the third transition state (master → master). Each row in Table 9 corresponds to a different student, identified by case number.

Table 9.

Individual-Level Missing Growth Under FOHM, Real Data Analysis

Skill	FOHM	ID	M2PL
Skill	State	ID	Increase (TP5–TP1)	Positive slope
1	Non-master to non-master	182	.230	.088
1	Master to master	274	.108	.136
2	Non-master to non-master	315	.282	.102
2	Master to master	84	.685	.104
3	Non-master to non-master	248	.527	.226
3	Master to master	139	.564	.278
4	Non-master to non-master	248	1.176	.159
4	Master to master	233	1.243	.218

Open in a new tab

Note. T1–T5 denotes skill estimates at Time point 1 to Time point 5, respectively. FOHM = first-order hidden Markov; M2PL = multidimensional two-parameter logistic.

Taking Skill 1 as an example, we see examinees 248 and 233 as examples that have positive overall changes and slopes under M2PL, but are declared as no change examinees under FOHM—ID 248 is classified as starting and ending as a non-master, while ID 233 is categorized as starting and ending as a master. Similar to the red and green examinees shown in Figure 1, despite seemingly large increases in continuous skill estimates over time, the gains are seemingly overlooked. Similar results are also observed for IDs 182 and 274 under Skill 1, IDs 315 and 84 under Skill 2, and IDs 248 and 139 under Skill 3. Overall, there are substantially more cases where examinees with positive slopes and overall increases are found to show no gain under FOHM. Admittedly, growth in the FOHM analyses is always being evaluated in a multivariate fashion, implying that the FOHM changes for any single attribute can also be affected by what is observed to occur for other attributes (this likely also explains why examinees of higher MIRT proficiencies sometime have lower status under FOHM). Nevertheless, it seems clear that there is a real potential for meaningful change to occur on individual attributes that will not be seen. Although our focus for model comparison is on M2PL versus FOHM, the method we used to examine potential consequences can be applied to any comparison between CDMs and (M)IRT models.

Beyond being a curiosity, such results, if observed in practice, arguably have the potential to create confusion in measuring growth. As demonstrated in Huang and Bolt (2022), in the presence of model misspecification due to latent skill continuity, the metrics of skill mastery in CDMs unintentionally change both when jointly calibrating alternate forms (i.e., when new items are added, or taken away, from a previous calibration) and superimposing the restrictions of monotonicity on the measurement model. Consequently, in the absence of consistent metrics, a student displaying overall growth in proficiency from one time point to the next may nevertheless observe no increase on certain skills. A diagnosis of transition from non-mastery to mastery might simply reflect a lack of skill attribute metric invariance across administrations rather than a true gain.

Conclusion and Discussion

This study explores the relative robustness of CDM and (M)IRT models to simulated growth from each model type. Unfortunately, the true nature of the latent distribution with real data is generally not known with certainty. However, as shown previously (Bolt & Kim, 2018), even for tests intentionally designed for diagnostic purposes, there is a strong likelihood of latent skill continuity, as evidenced by systematic violations of parameter invariance in CDMs observed across populations of different ability levels. A chief concern in longitudinal (growth) settings with binary skill misspecification is its potentially reduced sensitivity to actual growth. Specifically, if growth occurs with respect to a continuum, many students that actually grow may not show evidence of growth if the change in their continuous proficiency does not cross what might be viewed as an arbitrarily defined mastery/non-mastery threshold. In the current study, we see this in both real and simulation examples.

Although each model, as expected, functions well when applied to data generated from its own model, a goal of this article is to consider a feasible approach practitioners can adopt to compare approaches and to investigate the implications of model choice on estimated growth outcomes. From our simulation, the (M)IRT model appears to show considerably greater robustness to misspecification than the CDM. Furthermore, the real data comparison of the FOHM versus MIRT analyses suggests the danger of applying a less robust model to a setting where the assumption of the nature of its latent distribution is violated. Specifically, growth can be easily underestimated when skills are continuous, but a binary representation is assumed. Taken together, we find (M)IRT to be a more robust way of quantifying growth, and could thus arguably be viewed as a preferred approach if one is unsure of the nature of the latent skill distribution, as is often the case in real data analysis.

It is sometimes noted that binary skills models often provide estimates along a continuum in the form of posterior probabilities (which range between 0 and 1), thus seemingly providing a type of continuous score report like that of continuous skills. It is important to note, however, that posterior probabilities represent confidence in classification rather an actual skill continuum; the continuum reflected by posterior probabilities is better viewed as the amount of evidence accumulated in support of a mastery/non-mastery classification, not an intermediate skill level. The distinction can be especially important to appreciate in longitudinal (growth) contexts where one would not want to confound changes in skill level with changes in the amount of evidence gathered in support of skill mastery, the latter of which could be due to changes in the reliability of measurement for example.

Our intention is not to diminish the apparent value of diagnostic classifications, especially in cross-sectional contexts, nor to claim that one should always use a continuous metric for latent skills. When tests are suitably designed, the capacity to provide specific information about skills should naturally be an important application of educational measurement. Our studies are naturally also limited to the particular measurement conditions considered. Nevertheless, our results suggest some concern over applying binary skill models to measure growth when the discrete nature of the skills is less than clear. As seen in the real data illustration, quantifications of growth under binary skills models can be easily underestimated, and one can envision teachers becoming quickly frustrated by a scoring approach that fails to acknowledge actual student growth. Our comparisons of MIRT and CDM approaches with a real dataset show that considerably larger numbers of respondents demonstrate growth on each skill attribute under MIRT as compared with CDM.

It is also important to note that, by studying the relative robustness of CDM and (M)IRT, we intend to provide researchers another criterion worthy of consideration in deciding between model types, rather than claim its superiority or neglect consideration of other criteria. Naturally, model fit criteria could also be taken into consideration if one aims to decide which model is more appropriate to a specific dataset. The $M_{2}$ statistic and $RMSE A_{2}$ (Hansen et al., 2016; Liu et al., 2016; Maydeu-Olivares, 2013) can be used to calculate the absolute fit of the models; AIC (Akaike, 1974) and BIC (Schwarz, 1978) can assess the models’ relative fit. Our purpose in this article is to focus the model on the consequences of misspecification, although future work could consider how well robustness in estimating growth aligns with the outcomes of model fit criteria.

Of course, there are other aspects to comparing CDM and MIRT models that should receive research attention. Although this study focused on misspecification of the nature of the latent skills, the MIRT and CDM approaches could also be contrasted with respect to misspecification of the factor loading/Q matrices. Bolt (2019), for example, suggested a higher tendency for Hooker’s paradox under CDMs than when using an IRT approach, essentially implying that getting additional items correct can under some circumstances lead to a transition from mastery to non-mastery status. However, it is also possible that there is a greater potential to detect unusual skill patterns with conjunctive CDMs like the DINA model than a MIRT approach. In longitudinal settings, the methods could also be compared where a similarly monotonicity constraint for growth as used in Wang et al. (2018) were also applied under the MIRT approach.

Supplemental Material

sj-docx-1-epm-10.1177_00131644221117194 – Supplemental material for Relative Robustness of CDMs and (M)IRT in Measuring Growth in Latent Skills

Click here for additional data file.^{(815.4KB, docx)}

Supplemental material, sj-docx-1-epm-10.1177_00131644221117194 for Relative Robustness of CDMs and (M)IRT in Measuring Growth in Latent Skills by Qi (Helen) Huang and Daniel M. Bolt in Educational and Psychological Measurement

sj-docx-2-epm-10.1177_00131644221117194 – Supplemental material for Relative Robustness of CDMs and (M)IRT in Measuring Growth in Latent Skills

Click here for additional data file.^{(18.5KB, docx)}

Supplemental material, sj-docx-2-epm-10.1177_00131644221117194 for Relative Robustness of CDMs and (M)IRT in Measuring Growth in Latent Skills by Qi (Helen) Huang and Daniel M. Bolt in Educational and Psychological Measurement

Footnotes

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs: Qi (Helen) Huang Inline graphic https://orcid.org/0000-0002-8091-5993

Daniel M. Bolt Inline graphic https://orcid.org/0000-0001-7593-4439

Supplemental Material: Supplemental material for this article is available online.

References

Akaike H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–723. [Google Scholar]
Bolt D. M. (2019). Bifactor MIRT as an Appealing and related alternative to CDMs in the presence of skill attribute continuity. In von Davier M., Lee Y.-S. (Eds.), Handbook of diagnostic classification models (pp. 395–420). Springer. [Google Scholar]
Bolt D. M., Kim J. S. (2018). Parameter invariance and skill attribute continuity in the DINA model. Journal of Educational Measurement, 55(2), 264–280. [Google Scholar]
Chalmers R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29. [Google Scholar]
Chen Y., Culpepper S. A., Wang S., Douglas J. (2018). A hidden Markov model for learning trajectories in cognitive diagnosis with application to spatial rotation skills. Applied Psychological Measurement, 42(1), 5–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hansen M., Cai L., Monroe S., Li Z. (2016). Limited-information goodness-of-fit testing of diagnostic classification item response models. British Journal of Mathematical and Statistical Psychology, 69(3), 225–252. [DOI] [PubMed] [Google Scholar]
Huang H. Y. (2017). Multilevel cognitive diagnosis models for assessing changes in latent attributes. Journal of Educational Measurement, 54(4), 440–480. [Google Scholar]
Huang Q., Bolt D. M. (2020). The potential for interpretational confounding in discrete skills models [Poster session]. Virtual IMPS 2020. [Google Scholar]
Huang Q., Bolt D. M. (2022). The potential for interpretational confounding in cognitive diagnosis models. Applied Psychological Measurement, 46(4), 303–320. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kaya Y., Leite W. L. (2016). Assessing change in latent skills across time with longitudinal cognitive diagnosis modeling: An evaluation of model performance. Educational and Psychological Measurement, 77, 369–388. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li F., Cohen A., Bottge B., Templin J. (2015). A latent transition analysis model for assessing change in cognitive skills. Educational and Psychological Measurement, 76, 181–204. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu Y., Tian W., Xin T. (2016). An application of M2 statistic to evaluate the fit of cognitive diagnostic models. Journal of Educational and Behavioral Statistics, 41, 3–26. [Google Scholar]
Ma W., Minchen N., de la Torre J. (2020). Choosing between CDM and unidimensional IRT: The proportional reasoning test case. Measurement: Interdisciplinary Research and Perspectives, 18(2), 87–96. [Google Scholar]
Madison M. J., Bradshaw L. P. (2018). Assessing growth in a diagnostic classification model framework. Psychometrika, 83(4), 963–990. [DOI] [PubMed] [Google Scholar]
Madison M. J., Bradshaw L. P. (2015). The effect of Q-matrix design on classification accuracy in the log-linear cognitive diagnosis model. Educational and Psychological Measurement, 75(3), 491–511. [DOI] [PMC free article] [PubMed] [Google Scholar]
Maydeu-Olivares A. (2013). Goodness-of-fit assessment of item response theory models. Measurement: Interdisciplinary Research and Perspectives, 11, 71–101. [Google Scholar]
Reckase M. D. (1985). The difficulty of test items that measure more than one ability. Applied Psychological Measurement, 21, 25–36. [Google Scholar]
Reckase M. D. (1997). A linear logistic multidimensional model. In VanderLinden W. J., Hambleton R. K. (Eds.), Handbook of modern item response theory (pp. 271–296). Springer-Verlag. [Google Scholar]
Rizopoulos D. (2006). ltm: An R package for latent variables modeling and item response theory analyses. Journal of Statistical Software, 17(5), 1–25. [Google Scholar]
Schwarz G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461–464. [Google Scholar]
Tatsuoka K. (1984). Analysis of errors in fraction addition and subtraction problems (Final Report for NIE-G-81-0002). University of Illinois, Urbana–Champaign. [Google Scholar]
Templin J., Bradshaw L. (2014). The use and misuse of psychometric models. Psychometrika, 79(2), 347–354. [DOI] [PubMed] [Google Scholar]
von Davier M., Haberman S. J. (2014). Hierarchical diagnostic classification models morphing into unidimensional “diagnostic” classification models—A commentary. Psychometrika, 79(2), 340–346. [DOI] [PubMed] [Google Scholar]
Wang S., Yang Y., Culpepper S. A., Douglas J. A. (2018). Tracking skill acquisition with cognitive diagnosis models: A higher-order, hidden Markov model with covariates. Journal of Educational and Behavioral Statistics, 43(1), 57–87. [Google Scholar]
Wang S., Zhang S., Douglas K., Culpepper S. (2018). Using Response Times to Assess learning progress: A joint mode for responses and response times. Measurement: Interdisciplinary Research and Perspectives, 16(1), 45–58. [Google Scholar]
Yoon S. Y. (2011). Psychometric properties of the revised Purdue Spatial Visualization Test: Visualization of rotations (the revised PSVT-R) [Unpublished doctoral dissertation]. Purdue University. [Google Scholar]
Zhan P., Jiao H., Liao D., Li F. (2019). A longitudinal higher-order diagnostic classification model. Journal of Educational and Behavioral Statistics, 44(3), 251–281. [Google Scholar]
Zhang S., Chang H.-H. (2019). A multilevel logistic hidden Markov model for learning under cognitive diagnosis. Behavior Research Methods, 52(1), 408–421. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

sj-docx-1-epm-10.1177_00131644221117194 – Supplemental material for Relative Robustness of CDMs and (M)IRT in Measuring Growth in Latent Skills

Click here for additional data file.^{(815.4KB, docx)}

sj-docx-2-epm-10.1177_00131644221117194 – Supplemental material for Relative Robustness of CDMs and (M)IRT in Measuring Growth in Latent Skills

Click here for additional data file.^{(18.5KB, docx)}

[bibr1-00131644221117194] Akaike H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–723. [Google Scholar]

[bibr2-00131644221117194] Bolt D. M. (2019). Bifactor MIRT as an Appealing and related alternative to CDMs in the presence of skill attribute continuity. In von Davier M., Lee Y.-S. (Eds.), Handbook of diagnostic classification models (pp. 395–420). Springer. [Google Scholar]

[bibr3-00131644221117194] Bolt D. M., Kim J. S. (2018). Parameter invariance and skill attribute continuity in the DINA model. Journal of Educational Measurement, 55(2), 264–280. [Google Scholar]

[bibr4-00131644221117194] Chalmers R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29. [Google Scholar]

[bibr5-00131644221117194] Chen Y., Culpepper S. A., Wang S., Douglas J. (2018). A hidden Markov model for learning trajectories in cognitive diagnosis with application to spatial rotation skills. Applied Psychological Measurement, 42(1), 5–23. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr6-00131644221117194] Hansen M., Cai L., Monroe S., Li Z. (2016). Limited-information goodness-of-fit testing of diagnostic classification item response models. British Journal of Mathematical and Statistical Psychology, 69(3), 225–252. [DOI] [PubMed] [Google Scholar]

[bibr7-00131644221117194] Huang H. Y. (2017). Multilevel cognitive diagnosis models for assessing changes in latent attributes. Journal of Educational Measurement, 54(4), 440–480. [Google Scholar]

[bibr8-00131644221117194] Huang Q., Bolt D. M. (2020). The potential for interpretational confounding in discrete skills models [Poster session]. Virtual IMPS 2020. [Google Scholar]

[bibr9-00131644221117194] Huang Q., Bolt D. M. (2022). The potential for interpretational confounding in cognitive diagnosis models. Applied Psychological Measurement, 46(4), 303–320. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr10-00131644221117194] Kaya Y., Leite W. L. (2016). Assessing change in latent skills across time with longitudinal cognitive diagnosis modeling: An evaluation of model performance. Educational and Psychological Measurement, 77, 369–388. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr11-00131644221117194] Li F., Cohen A., Bottge B., Templin J. (2015). A latent transition analysis model for assessing change in cognitive skills. Educational and Psychological Measurement, 76, 181–204. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr12-00131644221117194] Liu Y., Tian W., Xin T. (2016). An application of M2 statistic to evaluate the fit of cognitive diagnostic models. Journal of Educational and Behavioral Statistics, 41, 3–26. [Google Scholar]

[bibr13-00131644221117194] Ma W., Minchen N., de la Torre J. (2020). Choosing between CDM and unidimensional IRT: The proportional reasoning test case. Measurement: Interdisciplinary Research and Perspectives, 18(2), 87–96. [Google Scholar]

[bibr14-00131644221117194] Madison M. J., Bradshaw L. P. (2018). Assessing growth in a diagnostic classification model framework. Psychometrika, 83(4), 963–990. [DOI] [PubMed] [Google Scholar]

[bibr15-00131644221117194] Madison M. J., Bradshaw L. P. (2015). The effect of Q-matrix design on classification accuracy in the log-linear cognitive diagnosis model. Educational and Psychological Measurement, 75(3), 491–511. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr16-00131644221117194] Maydeu-Olivares A. (2013). Goodness-of-fit assessment of item response theory models. Measurement: Interdisciplinary Research and Perspectives, 11, 71–101. [Google Scholar]

[bibr17-00131644221117194] Reckase M. D. (1985). The difficulty of test items that measure more than one ability. Applied Psychological Measurement, 21, 25–36. [Google Scholar]

[bibr18-00131644221117194] Reckase M. D. (1997). A linear logistic multidimensional model. In VanderLinden W. J., Hambleton R. K. (Eds.), Handbook of modern item response theory (pp. 271–296). Springer-Verlag. [Google Scholar]

[bibr19-00131644221117194] Rizopoulos D. (2006). ltm: An R package for latent variables modeling and item response theory analyses. Journal of Statistical Software, 17(5), 1–25. [Google Scholar]

[bibr20-00131644221117194] Schwarz G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461–464. [Google Scholar]

[bibr21-00131644221117194] Tatsuoka K. (1984). Analysis of errors in fraction addition and subtraction problems (Final Report for NIE-G-81-0002). University of Illinois, Urbana–Champaign. [Google Scholar]

[bibr22-00131644221117194] Templin J., Bradshaw L. (2014). The use and misuse of psychometric models. Psychometrika, 79(2), 347–354. [DOI] [PubMed] [Google Scholar]

[bibr23-00131644221117194] von Davier M., Haberman S. J. (2014). Hierarchical diagnostic classification models morphing into unidimensional “diagnostic” classification models—A commentary. Psychometrika, 79(2), 340–346. [DOI] [PubMed] [Google Scholar]

[bibr24-00131644221117194] Wang S., Yang Y., Culpepper S. A., Douglas J. A. (2018). Tracking skill acquisition with cognitive diagnosis models: A higher-order, hidden Markov model with covariates. Journal of Educational and Behavioral Statistics, 43(1), 57–87. [Google Scholar]

[bibr25-00131644221117194] Wang S., Zhang S., Douglas K., Culpepper S. (2018). Using Response Times to Assess learning progress: A joint mode for responses and response times. Measurement: Interdisciplinary Research and Perspectives, 16(1), 45–58. [Google Scholar]

[bibr26-00131644221117194] Yoon S. Y. (2011). Psychometric properties of the revised Purdue Spatial Visualization Test: Visualization of rotations (the revised PSVT-R) [Unpublished doctoral dissertation]. Purdue University. [Google Scholar]

[bibr27-00131644221117194] Zhan P., Jiao H., Liao D., Li F. (2019). A longitudinal higher-order diagnostic classification model. Journal of Educational and Behavioral Statistics, 44(3), 251–281. [Google Scholar]

[bibr28-00131644221117194] Zhang S., Chang H.-H. (2019). A multilevel logistic hidden Markov model for learning under cognitive diagnosis. Behavior Research Methods, 52(1), 408–421. [DOI] [PubMed] [Google Scholar]

PERMALINK

Relative Robustness of CDMs and (M)IRT in Measuring Growth in Latent Skills

Qi (Helen) Huang

Daniel M Bolt

Abstract

Introduction

Figure 1.