Abstract
Binary examinee mastery/nonmastery classifications in cognitive diagnosis models may often be an approximation to proficiencies that are better regarded as continuous. Such misspecification can lead to inconsistencies in the operational definition of “mastery” when binary skills models are assumed. In this paper we demonstrate the potential for an interpretational confounding of the latent skills when truly continuous skills are treated as binary. Using the DINA model as an example, we show how such forms of confounding can be observed through item and/or examinee parameter change when (1) different collections of items (such as representing different test forms) previously calibrated separately are subsequently calibrated together; and (2) when structural restrictions are placed on the relationships among skill attributes (such as the assumption of strictly nonnegative growth over time), among other possibilities. We examine these occurrences in both simulation and real data studies. It is suggested that researchers should regularly attend to the potential for interpretational confounding by studying differences in attribute mastery proportions and/or changes in item parameter (e.g., slip and guess) estimates attributable to skill continuity when the same samples of examinees are administered different test forms, or the same test forms are involved in different calibrations.
Keywords: interpretational confounding, cognitive diagnosis models, misspecification, latent skill continuity
Introduction
The past decades have seen a proliferation of models and techniques related to cognitive diagnosis models (CDMs). The models and applications are appealing due to their potential to provide information related to the specific types of skills students possess or lack, and thus permit the use of educational assessments to track learning in educational settings. There is little disputing the potential value of such applications and the diagnostic motivations that inspire them. The usefulness of diagnostic measurement models and associated assessments may especially be realized in longitudinal assessment contexts where educators can also administer student-tailored interventions and subsequently monitor the effectiveness of those interventions in terms of skill acquisition over time. The recent literature has seen a number of cognitive diagnosis modeling developments designed for such applications (e.g., Chen et al., 2018; Huang, 2017; Kaya & Leite, 2017; Li et al., 2016; Madison & Bradshaw, 2018; Wang et al., 2018; Zhan et al., 2019).
While diagnostic modeling is still relatively new in application, there is a much longer history to the psychometrics of longitudinal measurement, and in particular the need for conditions of measurement invariance to make longitudinal measurement meaningful (McArdle, 2007; Meredith & Horn, 2001; Millsap & Meredith, 2007). To effectively monitor growth in proficiencies over time, it is important not only that the same skill constructs be measured over time, but also that scoring occur against a consistent metric. The necessity of measurement invariance also holds true for applications of diagnostic assessments. Observing a lack of invariance over time can be problematic. For example, in the absence of consistent metrics, it is conceivable that a student designated a master of a skill at the beginning of the year might be declared a nonmaster at the end of the year despite no change, or even an increase, in their actual skill. One can envision educators becoming frustrated and confused with an assessment framework if the metrics of skill mastery are unintentionally changing across test administrations. As current CDM practice often relies heavily on empirically defined mastery thresholds for skill attributes, it becomes important to verify that different calibrations are in fact defining mastery in consistent ways.
A common strategy for examining invariance in latent variable models focuses on the consistency of measurement model item parameters (Millsap & Cham, 2012). The observation of changes in item parameter estimates across measurement occasions may reflect change in the construct being measured, the latent metric being applied for measurement, or some combination. Consequently, where systematic item parameter change for items is detected, there can be ambiguity in the meaning of scores and in the interpretation of person parameter change. Particular concerns over invariance arise when the groups/time points involved are associated with different proficiency distributions, as even small amounts of model misspecification can lead to a lack of model parameter invariance (Bolt, 2002; Shepard et al., 1984). In longitudinal studies, differences in proficiency distributions are naturally expected when growth occurs across time points. Consequently, it can become especially important when evaluating measurement invariance to be sensitive to likely forms of model misspecification.
Bolt and Kim (2018) showed how one form of misspecification in CDMs, namely continuity in the latent skill attributes, is expected to produce violations of parameter invariance when binary skills are assumed and the CDM is applied across groups having different latent skill distributions. The item parameter change seen under such forms of misspecification is often predictable in nature, and can be seen to correspond to a change in the threshold applied to the continuous proficiency in defining attribute mastery. For example, Bolt and Kim (2018) demonstrated using the Tatsuoka (1990) fraction subtraction dataset how the threshold for mastery along the continuous proficiency metrics tended to increase as the mean skill levels of the population increased.
In this article, we build on the Bolt and Kim (2018) study to consider specific contexts in which skill attribute continuity can contribute to changes in the definition of skill attribute mastery used under binary CDMs. Unlike Bolt and Kim (2018), who focused on changes in the meaning of attribute mastery caused by applying CDMs to subpopulations of different ability distributions, in the current analyses we only have one population of interest but changing measurement conditions---either new (different) items are being added to an assessment, or a structural constraint (i.e., forcing change to be increasing) is imposed. We study this issue on longitudinal applications of CDMs in which different test forms may be used over time as a basis for studying changes in attribute mastery status.
Cognitive Diagnosis Models and Interpretational Confounding
Although there now exist many different CDM models, we focus on the Deterministic-Input-Noisy-And (DINA) model (Junker & Sijtsma, 2001), a commonly applied model in diagnostic modeling contexts and one that is frequently at the core of more general CDMs. We then examine a recent generalization of the DINA approach for longitudinal applications, a first-order hidden Markov (FOHM; Chen et al., 2018) model. The FOHM model (1) simultaneously calibrates several different forms administered to the same examinees across discrete time points, and (2) superimposes a monotone growth constraint on the nature of skill attribute change over time. We consider how these two added components to the DINA model have the potential to alter the meaning of skill attribute mastery in relation to the DINA model. While the DINA model and the FOHM model are selected in this study, this is just for demonstration purposes; the same approach we take can also be applied to other CDMs. Following terminology commonly applied in structural equation modeling (SEM), we refer to changes in meaning of the latent attribute across conditions as an “interpretational confounding” of the latent variable. Simply described, an interpretational confounding implies a change in the meaning of the latent variable from that intended by the practitioner (Burt, 1976; Bainter & Bollen, 2014; Kline, 2015, p. 339). In the context of a binary skill attribute, such a confounding corresponds to a change in the meaning of attribute mastery which, in the presence of continuity, could imply a change in either the meaning of the proficiency, or alternatively the level of the proficiency associated with a mastery classification.
For traditional Item Response Theory (IRT) models, such as the Rasch model, the item (i.e., difficulty) and person (i.e., proficiency) parameters can be interpreted in relation to the same latent metric. Thus, the implications of item parameter change on the person parameter metric can be easily evaluated. One of the challenges with CDMs such as the DINA model is that this shared metric is not present. Specifically, the extent to which a change in the item slip or guess probabilities are associated with changes in the underlying attribute metric is not as immediately interpretable. Thus, a secondary goal of this work is to understand the degree to which latent attribute metrics change in relation to item parameter change under the DINA model.
Deterministic-Input-Noisy-And Model
The DINA model is a commonly applied psychometric model for cognitive diagnosis. It includes several elements—an item skill attribute incidence matrix (Q-matrix), parameters that characterize skill attribute proficiency at both the respondent and population levels, and item parameters (i.e., slip and guess parameters). Entry in the Q-matrix indicates whether skill attribute k (=1…K) is required to answer item j (=1…J) correctly (0 = not required, 1 = required). Under the DINA model, denotes student i’s (=1…I) discrete skill attribute pattern, where is an indicator of whether the ith student has mastered the kth attribute (0 = nonmastery, 1 = mastery). Given this parameterization, the ideal response pattern of an examinee with attribute pattern is denoted as , and can be expressed as
| (1) |
Item slip parameters ( ) define the probability of an incorrect response by an examinee that has a correct ideal response , while item guess parameters ( ) define the probability of a correct response by an examinee that has an incorrect ideal response , i.e.
| (2) |
When these parameters are combined with equation (1), the item response function of the DINA model becomes
| (3) |
The DINA model differs fundamentally from traditional IRT models (e.g., Rasch, two-parameter logistic models) in the lack of metric indeterminacy. As only two examinee states (mastery/nonmastery) are assumed to exist for each attribute, the same slip and guess parameters apply across populations even when the proportions of masters in those populations change. Where skill continuity is present, however, changes in the slip and guess parameters for common items may reflect changes in how attribute mastery is being defined. For example, Bolt and Kim (2018) found that the Tatsuoka fraction subtraction items systematically showed both higher guess and lower slip probability estimates when calibrated in higher ability subpopulations, a result consistent with a change in the meaning of the latent attribute metrics implying higher mastery thresholds in the higher ability subpopulations. Below we consider a different context in which even for the same (sub)population, a related occurrence of an interpretational confounding has the potential to emerge.
First-Order Hidden Markov Model
As noted above, the DINA model has been generalized in useful ways to extend diagnostic measurement applications. Such generalizations can provide a meaningful context in which to explore the possible occurrences of interpretational confoundings due to the added modeling constraints. For example, Chen et al. (2018) formulated the FOHM model as a longitudinal extension of the DINA model that incorporates a hidden Markov model to characterize first-order transition probabilities that reflect changes in attribute mastery status over time. Similar to equation (3), the item response function for the FOHM model in Chen et al. is
| (4) |
where t (=1…T) denotes the time point of measurement. An additional index for form (f) is added to indicate that the items belong to forms; each examinee is administered a different form at each time point, and the time point at which a given form f is administered will vary across examinees.
With respect to the change process, a monotonic structural constraint is imposed such that the only change in attribute mastery status is one from attribute nonmastery to mastery. In this case, the prior for becomes
| (5) |
where denotes the initial class membership probabilities at time t =1, and is a matrix of first-order transition probabilities between classes having the restriction that the probability of moving from mastery to nonmastery on an attribute is zero. A detailed Bayesian formulation for estimation of the FOHM is provided by Chen et al. (2018).
In the real data application presented by Chen et al. (2018) to be further explored below, the application of the FOHM involved time points, and included data from five different test forms, with the forms administered in different orders for different examinees. To resolve invariance issues, the analysis assumed the item parameters for each form remain constant over time. Naturally fixing parameters may help reduce concerns about invariance; however, as we consider shortly, it does not fully resolve concerns about metric changes, as we show below.
A more general concern relates to how the skill attributes themselves may take on a different meaning when considered under the DINA and FOHM models. Through these two models we can draw parallels between the traditional observations of interpretational confounding that emerge when comparing measurement model parameters under a confirmatory factor analytic measurement model with measurement model parameters under a full structural equation model.
Illustration of Interpretational Confounding Under the Deterministic-Input-Noisy-And Model due to Skill Attribute Continuity
As noted earlier, Bolt and Kim (2018) illustrated the lack of parameter invariance that emerges due to skill attribute continuity in the popular Tatsuoka (1990) fraction subtraction dataset. The patterns of invariance observed in their analysis were consistent with misspecification related to skill continuity; specifically, slip estimates tend to be lower, and guess estimates higher when the calibration was applied to a higher ability subsample as compared to a lower ability subsample, effectively implying a higher threshold for attribute mastery was applied in higher ability populations. Figure 1 provides a hypothetical illustration of the nature of the effect when considered in relation to a single attribute. In this example, we assume a continuous normally-distributed proficiency , as shown in Figure 1(a), and for a hypothetical item measuring this proficiency, assume that the true item characteristic curve (ICC) characterizing performance on the item is as shown in Figure 1(b), basically a Rasch item. To the extent that the DINA model assumes the skill is binary, we effectively discretize the continuous proficiency at some location along the continuum, and on that basis can define slip and guess probabilities as a normal-density weighted average of the response probabilities defined by the ICC below the threshold (for the guess) and above the threshold (for the complement of the slip) to understand the DINA approximation (Bolt & Kim, 2018). As seen in Figure 1, when the threshold is set lower, we see lower guess and higher slip probabilities, while as the threshold increases, the guess gets progressively higher and the slip gets progressively lower.
Figure 1.
Illustration of hypothetical item/attribute relationship when skill continuity is present. (a) Item slip/guess estimates for three different locations of a mastery threshold along a latent continuous proficiency. (b) Item characteristic curve of the hypothetical item producing the slip and guess estimates shown in (a).
In the next sections we provide some simple simulation examples to illustrate how an interpretational confounding can manifest in binary skill CDMs in the presence of model misspecification due to skill continuity. For illustration purposes, we begin with a simple unidimensional (single attribute) example, and later consider the more common multi-attribute condition.
Simulation Illustration 1: Item and Person Parameter Change in the Joint Calibration of Alternate Forms
One way in which an interpretational confounding may be observed is in the joint calibration of alternate forms (i.e., when new items are added, or taken away, from a previous calibration). We consider a hypothetical setting in which two distinct forms actually measure a common continuous proficiency, but where a binary skill is assumed for that single proficiency. We first consider parameter estimates observed under a separate calibration of the forms, and then examine parameter change under a joint calibration in which both test forms are calibrated together. If the same skills and metrics are involved, we expect no change in either the item or person parameters (apart from what can be explained by random error) if the binary skill assumption is correct as there is no indeterminacy in these metrics.
In our illustration, item response data were generated for a single test taking population (500 participants) using a continuous latent proficiency distributed as . Each form is assumed to consist of four items and data are generated from a Rasch model with item response function
| (6) |
where denotes the item difficulty parameter for item j. The item characteristics curves of the eight items are shown in Figure 2. Test 1 consists of four Rasch items (items 1–4) having b = −1.99, −1.33, −0.80, −0.35; Test 2 has four Rasch items (item 5–8) with b = 0.51, 1.03, 1.48, 1.89. Thus, from a traditional IRT perspective, Test 2 is more difficult than Test 1. We assume the same simulated examinees take both tests. We first apply the DINA model separately to Tests 1 and 2, and then also consider a DINA calibration in which both tests are combined. In each case, the cdm package (Robitzsch et al., 2020) was used to fit the DINA model and one binary skill attribute was assumed.
Figure 2.
Item characteristics curves for eight hypothetical items.
Table 1 displays the attribute mastery proportions seen for the separate test form DINA calibrations, as well as the joint DINA calibration, when applied to the same item response data. Note that when the forms are analyzed separately using the DINA model, we observe a higher proportion of masters on the first test than the second test, a result consistent with the first test possessing easier items. Further, when both forms are calibrated collectively, we observe a type of compromise between the two proportions, with approximately half of the sample now being identified as masters. As a result, despite the presence of a common trait and the same simulated examinees across all calibrations, we see substantial proportions of examinees whose designation as a master or nonmaster will be dependent on the particular test form administered. Analogous to Figure 1, Figure 3 clarifies how attribute mastery in these calibrations can be viewed as different cutpoints (thresholds) along the continuous proficiency dimensions across calibrations, here defined from the empirical mastery proportions from this example. In each case the threshold is defined based on the lower bound of the integral needed to return the estimated proportion of masters observed for each calibration.
Table 1.
Estimated Percentages of Masters and Nonmasters Across Three Calibrations.
| Master (%) | Non-Master (%) | |
|---|---|---|
| Test 1 | 75.33 | 24.67 |
| Test 2 | 22.54 | 77.46 |
| Joint | 49.41 | 50.59 |
Figure 3.
Mastery threshold for each of three calibrations.
A comparison of the separate and joint item calibrations also yields varying guess and slip estimates for the individual items, as reported in Table 2 and Figure 4. As expected, we observe systematic changes in item parameter estimates for the same items across calibrations. As seen in Figure 4(a), Test 1 item slip estimates decrease when more difficult items (Test 2 items) are added in the joint calibration, while Test 2 item slip estimates increase when the easier items (Test 1 items) are added. The inverse effect is seen for the guess estimates as seen in Figure 4(b). Importantly, the direction of these effects is entirely consistent with the change in the attribute mastery proportions across the separate and joint calibrations. In effect, correct item responses that are viewed as reflections of attribute mastery under a separate calibration for Test 1 are more frequently seen as guesses from nonmasters under the joint calibration, while incorrect responses generally seen as incorrect responses from nonmasters under the separate calibration of Test 2 become increasingly viewed as slips from masters under the joint calibration.
Table 2.
Slip and Guess Estimates and Standard Errors for Separate Calibrations and Joint Calibration.
| Slip ( ) | Guess ( ) | ||||
|---|---|---|---|---|---|
| Item | Separate | Joint | Separate | Joint | |
| Test 1 | 1 | 0.103 (0.019) | 0.056 (0.020) | 0.635 (0.053) | 0.722 (0.031) |
| 2 | 0.111 (0.021) | 0.069 (0.024) | 0.455 (0.061) | 0.636 (0.035) | |
| 3 | 0.184 (0.024) | 0.070 (0.024) | 0.361 (0.061) | 0.483 (0.037) | |
| 4 | 0.256 (0.030) | 0.196 (0.033) | 0.097 (0.062) | 0.369 (0.038) | |
| Test 2 | 5 | 0.224 (0.076) | 0.400 (0.038) | 0.259 (0.029) | 0.157 (0.031) |
| 6 | 0.325 (0.076) | 0.555 (0.037) | 0.157 (0.026) | 0.107 (0.025) | |
| 7 | 0.494 (0.067) | 0.639 (0.035) | 0.142 (0.023) | 0.091 (0.024) | |
| 8 | 0.607 (0.062) | 0.699 (0.033) | 0.121 (0.021) | 0.065 (0.021) | |
Figure 4.
Slip and guess estimates for separate and joint calibrations. (a) Slip estimates. (b) Guess estimates.
As all of these analyses are based on the exact same simulated item response dataset, the observations are not affected by sampling error, but can be viewed as real changes in the meaning of the proficiency, an interpretational confounding of the binary skill attribute.
Important to the current application is that the size in slip/guess estimate change need not be particularly large to reflect a substantial change in the skill attribute. As noted in the introduction, one of the challenges in understanding violations of invariance in the DINA model is that the item and person parameters cannot be interpreted against a shared metric (as in the Rasch model). While we have simulated rather substantial differences in test difficulty in these examples, the corresponding changes in slip and guess estimates might appear rather small. Frequently the changes are on the order of .10–.15, or even less, yet from the results above we know these changes reflect changes in attribute mastery percentages of approximately 25%, a substantial change in the meaning of attribute mastery.
Simulation Illustration 2: Item and Person Parameter Change in the Presence of Structural Restrictions on the Deterministic-Input-Noisy-And Model
A second way in which an interpretational confounding can emerge is when structural restrictions are imposed on a measurement model. This is the common context in which interpretational confoundings are observed in traditional SEMs, i.e., when a constrained structural model (often in the form of a constrained set of causal paths) are superimposed upon a measurement model. In such instances, it is not uncommon to see that the SEM hybrid model (including both the measurement and structural components) fits the data, but at the cost of changing the measurement model parameters from what they were without the structural constraints imposed, implying an interpretational confounding, or a change in the meaning of the latent variables.
In the context of longitudinal CDMs, one type of structural constraint that has been considered concerns how mastery status may change over time. In many educational contexts, it may be reasonable to believe that once skill mastery is attained, it should be maintained for subsequent administrations. Under the FOHM model of Wang et al. (2018), for example, it is assumed that once one has attained a latent state of mastery on any skill attribute, that mastery status will be maintained at subsequent timepoints. In other words, the only permitted transition states on a latent skill attribute are nonmaster → nonmaster, nonmaster → master, and master → master.
As in the previous simulation example, we again consider a unidimensional (single attribute) application. We assume a two time-point study in which two six-item tests are administered as a pre-post design. To assure good precision in estimates from the simulation, we further assume 20,000 students randomly assigned to different test administration orders, one group (Group 1) taking Test 1 followed by Test 2, the other (Group 2) taking Test 2 followed by Test 1. We also assume each group comes from a continuous proficiency distribution that is at pretest and where the nature of growth is such that half of the members in each group show 1 unit of growth (change) from pre-to post, the other half of the members zero growth. Test 1 has six Rasch items (items 1–6) with b = −1, −0.8, −0.6, −0.4, −0.2, 0; and Test 2 has six Rasch items (items 7–12) with b = 0, 0.2, 0.4, 0.6, 0.8, 1. Like our previous example, Test 1 is an easier test than Test 2. Finally, as a structural constraint we assume mastery status can only stay the same or increase from pre to post. Thus, the only attribute mastery transition patterns permitted across the two timepoints are nonmastery → nonmastery, nonmastery → mastery, and mastery → mastery.
We fit both the DINA and FOHM models to the simulated data to illustrate the potential for an interpretational confounding due to the presence of the skill attribute continuity and the added structural constrain of the FOHM model. Under the DINA calibration, we used the data for Test 1 (including Group 1 at pre, Group 2 at post) in one calibration and the data for Test 2 (including Group 2 at pre, Group 1 at post) in a second calibration. In the FOHM calibration, we estimated the data for both groups and tests in one calibration, also applying the structural constraint related to strictly increasing change in mastery status implied above. A comparison of results across the DINA and FOHM analyses thus allows us to examine the effects of the structural constraint added by the FOHM model.
Table 3 shows the proportions of estimated masters across each of these calibrations broken down by model type (DINA vs. FOHM) and time point (1, 2). From the DINA analysis, we observe expected effects in regard to the difficulty of the forms. Group 1, administered the easier form at Time 1, shows approximately 9% more masters at Time 1 than Group 2 (despite random equivalence of the groups), but approximately 9% fewer masters at Time 2 than Group 2. Both differences are naturally explained by the difficulty differences between forms, and how those difficulty differences lead to distinct mastery thresholds being applied to the continuous proficiency. If the intent were to study change in skill proficiency from Time 1 to Time 2, Group 1 would clearly show far lesser gains, despite the equal amounts of growth actually simulated across groups. Of course, these differences conform to the same patterns seen in the first simulation illustration.
Table 3.
Proportion of Mastery for Each Test Form at Each Time Point.
| Deterministic-Input-Noisy-And | First-Order Hidden Markov | |||
|---|---|---|---|---|
| Group | Time Point 1 | Time Point 2 | Time Point 1 | Time Point 2 |
| 1 | 0.3978 | 0.4593 | 0.3595 | 0.5429 |
| 2 | 0.3081 | 0.5488 | 0.3710 | 0.5929 |
Of greater interest in the current illustration is the consequence of the added structural constraint. From Table 3, we clearly see that the FOHM helps equalize form as well as administration order differences, clearly a positive outcome. However, at Time 2, we also see substantially higher percentages of masters in both groups relative to what was observed for each in the DINA analyses, an indication of interpretational confounding. As for the previous simulation, evidence of the interpretational confounding of the attributes may also be seen in changes to the item parameter estimates. Table 4 shows the estimated slip and guess parameters for the two tests under the DINA and FOHM analyses. Similar to the previous simulation, we see Test 1’s slip estimates increase when calibrated with more difficult items (Test 2) under the FOHM, while Test 2’s slip estimates decrease when calibrated with easier items (Test 1) under the constraint. The opposite effect is again seen in the guess estimates. We note again that while changes in the slip and guess estimates across calibrations appear quite small, the changes in attribute metric appear rather substantial. Again, as the calibrations all use the exact same data, these differences should be viewed as unrelated to random sampling error.
Table 4.
Slip and Guess Estimates With (First-Order Hidden Markov) or W/O (Deterministic-Input-Noisy-And) Monotonicity Constraint.
| First-Order Hidden Markov | Deterministic-Input-Noisy-And | ||||
|---|---|---|---|---|---|
| Item | Slip ( ) | Guess ( | Slip ( ) | Guess ( | |
| Test 1 | 1 | 0.164 | 0.428 | 0.141 | 0.465 |
| 2 | 0.202 | 0.389 | 0.177 | 0.425 | |
| 3 | 0.237 | 0.334 | 0.208 | 0.368 | |
| 4 | 0.278 | 0.301 | 0.250 | 0.335 | |
| 5 | 0.314 | 0.258 | 0.281 | 0.289 | |
| 6 | 0.359 | 0.220 | 0.326 | 0.250 | |
| Test 2 | 7 | 0.279 | 0.294 | 0.322 | 0.261 |
| 8 | 0.313 | 0.249 | 0.358 | 0.216 | |
| 9 | 0.354 | 0.210 | 0.402 | 0.180 | |
| 10 | 0.408 | 0.184 | 0.456 | 0.158 | |
| 11 | 0.445 | 0.161 | 0.491 | 0.135 | |
| 12 | 0.505 | 0.135 | 0.551 | 0.116 | |
Taken together, it seems apparent that systematic changes in slip and guess estimates are often interpretable manifestations of changes in the latent attribute metrics, a result consistent with the cross-sectional analyses of Bolt and Kim (2018). While this is of theoretical interest, most CDM applications entail multiple attributes, wherein such evaluations may become more complicated.
Real Data Study
Unfortunately, the capacity to understand the nature of interpretational confounding through violations of item slip and guess parameter invariance may be less systematic when multiple attributes are involved. We consider a real data illustration that incorporates the design elements considered in both simulations to demonstrate the potential to explore skill attribute interpretational confoundings in real data contexts. Like the previous simulation illustration, the design is one in which different test forms are used across time. However, the design is more complex in several respects: (1) the tests involve more than one (specifically, four) attributes; (2) there are more than two timepoints (specifically, five) considered; (3) the Q-matrix structure involves items measuring more than one attribute; and (4) five different test forms are involved. As a result of this added complexity, it becomes more complex to isolate the sources of violations of invariance in the same systematic way they could be studied in the simulation analyses.
The dataset comes from a spatial reasoning test program described in detail by Wang et al. (2018). As reported in Wang et al., the forms for this study were developed in accord with the Revised Purdue Spatial Visualization Test (PSVT-R; Yoon, 2011); the study itself incorporated a training tool to facilitate the learning of rotation skills over time. Five test forms were developed each containing 10 items; the five forms were administered in varying orders across randomly defined student groups, as shown in Table 5. Each of the test forms is assumed to measure the same four spatial rotation skill attributes: (1) a 90o x-axis rotation, (2) a 90o y-axis rotation, (3) a 180o x-axis rotation, and (4) a 180o y-axis rotation. Students were randomly assigned for exposure to two types of learning interventions that provided feedback to their responses on the previous test form, although this manipulation is not a focus in the current analysis.
Table 5.
Test Administration Design, Chen et al. (2018) Study.
| Time Point | |||||
|---|---|---|---|---|---|
| Test Order | 1 | 2 | 3 | 4 | 5 |
| 1 | Test form 1 | Test form 2 | Test form 3 | Test form 4 | Test form 5 |
| 2 | Test form 2 | Test form 3 | Test form 4 | Test form 5 | Test form 1 |
| 3 | Test form 3 | Test form 4 | Test form 5 | Test form 1 | Test form 2 |
| 4 | Test form 4 | Test form 5 | Test form 1 | Test form 2 | Test form 3 |
| 5 | Test form 5 | Test form 1 | Test form 2 | Test form 3 | Test form 4 |
We compare results from (1) a DINA analysis applied to the data collected from each from (across all five time points), and (2) a joint FOHM analysis (as presented by Chen et al., 2018) applied simultaneously to data from all forms across all timepoints, with the structural transitional constraint applied. We evaluate both item and person parameter invariance as a basis for studying the potential for an interpretational confounding in the four skill attributes measured by the tests over time.
A first consideration in our analysis addresses the practical reality of skill continuity in the four attributes measured by the forms. We evaluated this likelihood following the same general approach as presented in Bolt and Kim (2018). Specifically, we sought to evaluate whether the slip and guess estimates demonstrate systematic violations of invariance when calibrated separately for low and high ability respondents. Unlike Bolt and Kim (2018), however, in the current design we have the ability to define these ability groups using data external to the test form administration itself. Specifically, in defining the low and high ability groups, we used test performances on the administrations immediately before the studied test administration (except for the first time point, in which case the test performance at the second time point was used). For each form, we considered separate calibrations for students that scored below the 65th percentile and those that scored above the 35th percentile on that adjacent test performance, thus creating samples of approximately 200 lower proficiency and higher proficiency student samples, respectively. We then calibrated the DINA separately for each of these two groups in relation to the studied form, as well as for the combined groups.
Supplementary Appendix B shows results observed from DINA analyses for each of the five test forms. Following Bolt and Kim (2018) and the illustration in Figure 1, in the presence of skill continuity we expect to see consistently higher guesses and/or lower slips in the higher ability compared to lower ability groups, with the total sample calibration yielding estimates in-between. Indeed, across all five forms, this is the pattern of results consistently seen (the relatively small number of violations of this pattern are highlighted for each form). As occurred for the fraction subtraction dataset, we thus have good reason to believe that skill continuity is present in the measured skills for these data.
Tables 6 and 7 allow us to compare item and examinee parameter estimates between the DINA and FOHM analyses when applied to the exact same response data. As our intent is to explore the potential for an interpretational confounding of skill attributes across the DINA and FOHM analyses, we focus on the results by test form (pooling across the five time points) in order to understand how the meaning of attribute mastery will have changed. Admittedly the conditions in this analysis are more complex than our simulations, both because (1) the Q-matrix is more complex, with many items measuring multiple attributes; and (2) the conditions studied separately in the two simulation illustrations---namely the joint calibrations of test forms AND structural constraint of monotonic attribute mastery change---are simultaneously occurring here. To the extent that this application reflects a real-world application of longitudinal CDMs, it should give a more realistic sense as to the likely magnitude of interpretational confounding that may occur in practice.
Table 6.
Deterministic-Input-Noisy-And and First-Order Hidden Markov Estimated Mastery Proportions for Each Skill and Each Test Form.
| Deterministic-Input-Noisy-And | First-Order Hidden Markov | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Skills | Form 1 | Form 2 | Form 3 | Form 4 | Form 5 | Form 1 | Form 2 | Form 3 | Form 4 | Form 5 |
| 1 | 0.77 | 0.71 | 0.89 | 0.74 | 0.54 | 0.63 | 0.62 | 0.63 | 0.62 | 0.63 |
| 2 | 0.70 | 0.67 | 0.64 | 0.62 | 0.79 | 0.69 | 0.67 | 0.67 | 0.67 | 0.69 |
| 3 | 0.81 | 0.77 | 0.69 | 0.74 | 0.46 | 0.61 | 0.61 | 0.59 | 0.57 | 0.60 |
| 4 | 0.71 | 0.76 | 0.61 | 0.65 | 0.53 | 0.59 | 0.57 | 0.57 | 0.58 | 0.58 |
Table 7.
Item Parameter Estimates for the Five Test Forms under Deterministic-Input-Noisy-And (separate calibrations) and First-Order Hidden Markov (Joint Calibration With Structural Transition Constraints).
| Deterministic-Input-Noisy-And | First-Order Hidden Markov | ||||
|---|---|---|---|---|---|
| Item | Slip ( ) | Guess ( ) | Slip ( | Guess ( ) | |
| Form 1 | 1 | 0.020 | 0.728 | 0.023 | 0.811 |
| 2 | 0.028 | 0.730 | 0.037 | 0.749 | |
| 3 | 0.124 | 0.462 | 0.109 | 0.562 | |
| 4 | 0.061 | 0.437 | 0.027 | 0.618 | |
| 5 | 0.171 | 0.441 | 0.136 | 0.487 | |
| 6 | 0.547 | 0.177 | 0.519 | 0.229 | |
| 7 | 0.090 | 0.566 | 0.058 | 0.601 | |
| 8 | 0.220 | 0.318 | 0.159 | 0.425 | |
| 9 | 0.010 | 0.706 | 0.017 | 0.848 | |
| 10 | 0.177 | 0.543 | 0.140 | 0.578 | |
| Form 2 | 11 | 0.063 | 0.466 | 0.054 | 0.620 |
| 12 | 0.041 | 0.744 | 0.056 | 0.754 | |
| 13 | 0.057 | 0.301 | 0.027 | 0.527 | |
| 14 | 0.031 | 0.781 | 0.040 | 0.819 | |
| 15 | 0.212 | 0.243 | 0.181 | 0.315 | |
| 16 | 0.091 | 0.489 | 0.122 | 0.547 | |
| 17 | 0.094 | 0.308 | 0.112 | 0.324 | |
| 18 | 0.050 | 0.449 | 0.056 | 0.639 | |
| 19 | 0.196 | 0.371 | 0.155 | 0.325 | |
| 20 | 0.170 | 0.308 | 0.132 | 0.370 | |
| Form 3 | 21 | 0.041 | 0.650 | 0.059 | 0.645 |
| 22 | 0.060 | 0.501 | 0.063 | 0.533 | |
| 23 | 0.089 | 0.282 | 0.115 | 0.446 | |
| 24 | 0.161 | 0.451 | 0.117 | 0.433 | |
| 25 | 0.026 | 0.523 | 0.018 | 0.805 | |
| 26 | 0.169 | 0.303 | 0.174 | 0.323 | |
| 27 | 0.143 | 0.477 | 0.176 | 0.553 | |
| 28 | 0.006 | 0.644 | 0.018 | 0.883 | |
| 29 | 0.026 | 0.799 | 0.041 | 0.794 | |
| 30 | 0.315 | 0.358 | 0.259 | 0.357 | |
| Form 4 | 31 | 0.070 | 0.485 | 0.098 | 0.461 |
| 32 | 0.006 | 0.296 | 0.031 | 0.522 | |
| 33 | 0.018 | 0.575 | 0.027 | 0.723 | |
| 34 | 0.173 | 0.612 | 0.151 | 0.613 | |
| 35 | 0.089 | 0.373 | 0.123 | 0.384 | |
| 36 | 0.167 | 0.320 | 0.141 | 0.320 | |
| 37 | 0.377 | 0.287 | 0.379 | 0.327 | |
| 38 | 0.055 | 0.787 | 0.056 | 0.751 | |
| 39 | 0.127 | 0.177 | 0.165 | 0.345 | |
| 40 | 0.124 | 0.508 | 0.135 | 0.500 | |
| Form 5 | 41 | 0.000 | 0.557 | 0.055 | 0.569 |
| 42 | 0.056 | 0.538 | 0.110 | 0.468 | |
| 43 | 0.056 | 0.830 | 0.056 | 0.785 | |
| 44 | 0.000 | 0.425 | 0.027 | 0.659 | |
| 45 | 0.103 | 0.490 | 0.123 | 0.461 | |
| 46 | 0.020 | 0.496 | 0.059 | 0.468 | |
| 47 | 0.180 | 0.316 | 0.230 | 0.287 | |
| 48 | 0.428 | 0.317 | 0.447 | 0.301 | |
| 49 | 0.020 | 0.623 | 0.018 | 0.727 | |
| 50 | 0.029 | 0.504 | 0.069 | 0.479 | |
Table 6 displays the attribute mastery proportions observed under the DINA and FOHM analyses, while Table 7 shows the corresponding slip and guess estimates by form. From Table 6 we observe from the DINA analysis rather substantial differences in the proportion of masters for each of the four attributes across forms. Recalling the samples administered each form are randomly equivalent, this observation alone is noteworthy, as it would appear reflective of the issue demonstrated in Simulation 1, namely that the forms may differ in the degree of item difficulty for items measuring certain attributes, and that these differences will manifest in different empirically defined cutpoints for attribute mastery. We see the differences most notably for skill attributes 1 (with mastery proportions ranging from 0.54 on Form 5 to 0.89 on Form 3) and 3 (with mastery proportions ranging from 0.46 on Form 5 to 0.81 on Form 1). Under the FOHM analysis, these forms differences are substantially diminished. While on first blush this may seem encouraging, it is also apparent that the FOHM result is not a true compromise. Specifically, the mastery proportions under FOHM are on average substantially lower for FOHM than under DINA. Such a change can be attributed at least in part to the structural constraint applied within the FOHM. Specifically, in order to make the growth in each attribute monotone, it appears to be generally necessary to raise the threshold for attribute mastery under FOHM. Note that for certain individual forms the change is quite substantial—for Form 3’s measurement of attribute 1, the proportion of masters changes from 0.89 to 0.63 from DINA to FOHM, while for Form 1’s measurement of attribute 3, the proportion of masters changes from 0.81 to 0.61, despite being based on the same item response data. An example in the reverse direction occurs for attribute 3 on Form 5, which increases from .46 under DINA to .60 under FOHM.
Supplementary Appendix C displays plots showing the information of Table 6 broken down with respect to the individual time points of measurement. Several things are apparent from these plots. First, the same general pattern of Table 6 (typically lower mastery proportions under FOHM than under DINA) is seen from the tendency to observe most of the plotted points (representing attribute × time point combinations) beneath the identity line, implying low proportions of masters under FOHM than DINA, a result that is consistent for all forms except from 5. However, the variability related to attribute (as seen also in Table 6), but also time point, is quite substantial, suggesting that for a given form, DINA and FOHM apply considerably varying levels of mastery, sometimes showing as much as a 35% difference in attribute mastery proportions.
To confirm corresponding changes in item parameters, we consider the slip and guess estimates in Table 7. Recalling the high mastery threshold under FOHM, it is also apparent from Table 7 that while the slip estimates tend to be roughly comparable, the guess estimates under FOHM tend to be higher than under DINA. On the whole, a larger proportion of the correct responses observed under the FOHM analysis are attributed to guesses by nonmasters, further confirmation that the threshold for attribute mastery is generally higher under FOHM than under DINA.
We can also connect some of the more substantial differences in item parameters seen in Table 7 to differences seen for individual attributes in Table 6. We again consider in particular the results seen for attributes 1 and 3. Based on the estimated attribute mastery proportions from the DINA analysis, for attribute 1, test form 3 is the easiest form, while for attribute 3, test form 1 is the easiest and test form 5 is the most difficult. Each of these conditions yields substantially different proportions of masters under the DINA and FOHM analyses. As expected, when calibrating the forms together under FOHM, the items from these two forms show noticeable violations of parameter invariance, particularly those items associated with the attribute showing greatest differences in master proportion across calibrations, as seen in Table 7. For test form 1, items 4, 6, 9 and 10, each of which measures attribute 3, we generally see lower slips and higher guesses under FOHM (highlighted in Table 7). For test form 3, we observe the largest changes in these parameters on the guess estimates for items 5 and 8, both of which increase substantially under FOHM. Meanwhile, the most difficult form for attribute 3 (form 5) shows a reverse result in terms of item slip and guess estimates for those items measuring attribute 3, showing higher slips and lower guesses under FOHM. For the attributes they are associated with (attributes 1 and 3), as the guess estimates decrease and slip estimates increase or unchanged, the FOHM shows a larger mastery proportion compared to DINA. Thus, although the ability to systematically interpret changes in slips and guess estimates is more complex due the features noted earlier, we generally see patterns to the item parameter change under DINA and FOHM that are consistent with changes in the mastery proportions under each calibration type.
Discussion
This paper explores issues related to the interpretation of attribute mastery in CDM models. Most CDM applications assume binary skill attributes, despite the likely presence of some degree of skill continuity. In such contexts, it should naturally be asked, what level of the underlying skill attribute corresponds to “mastery” of the skill attribute? We argue that such considerations are important not only in helping users of CDMs understand what it means to be a skill attribute master, but also in appreciating the potential for interpretational confoundings of the skill attributes, especially as related to the unintentional application of attribute mastery thresholds other than what are intended by the practitioner.
In the presence of skill continuity (and accompanying items associated with different levels of difficulty) it seems inevitable that definitions of skill mastery will often be data-driven, and may in many instances be viewed as defining a threshold of attribute mastery with respect to an underlying continuum. Under such circumstances, it becomes important to consider factors or conditions that may influence the setting of such thresholds, or that at least attend to the possibility that different thresholds are being set in calibrations that are intended to yield comparable metrics. Bolt and Kim (2018) demonstrated one set of conditions where a confounding is induced in the presence of skill continuity, namely when subpopulations of different proficiency distributions are separately calibrated. This paper extends their findings to conditions involving a single population, but where we (1) change the collection of items (add/withdraw items) being jointly calibrated, or (2) constrain the nature of change over time (e.g., only allowing non-decreasing change). Thus while the underlying cause of the confounding is the same as in Bolt and Kim (2018), namely skill continuity, the conditions that produce the confounding are different.
As noted, our primary intention in this paper is to highlight the real potential for interpretational confounding in the latent attributes across different measurement conditions. Unlike traditional IRT and factor analytic models, the item parameters of CDMs are not on the same metric as the skill attributes, making it difficult to appreciate the extent to which changes in item parameters are associated with changes in the underlying skill attributes. Our two simulations show that even seemingly small changes in slip/guess estimates can be associated with substantial changes in the meaning of attribute mastery. Further, the real data comparison of DINA versus FOHM analyses applied to the exact same response datasets are suggestive of considerable differences in the meaning attached to the meaning attribute mastery, presumably due both to changes in the groups of items simultaneously calibrated and through the imposition of structural constraints on the nature of change over time. Briefly stated, in the current real data application, in general fewer students were found to be masters of skill attributes under FOHM compared to DINA, suggesting that the FOHM substantially lowered the threshold for attribute mastery. Depending on the attribute and form, these attribute mastery proportions were as much as 26% lower under FOHM than DINA, despite slip and guess estimates under both models that look quite similar. The fact that such changes emerge when applying the FOHM and DINA separately to the exact same data is indicative of an interpretational confounding issue as opposed to sampling differences.
A natural question to be asked is what to do about this problem when encountered. Of course one possibility is to transition to a (M)IRT model. Under such a framework it becomes possible to address difficulty differences across forms, as linking/equating methods can be applied with (M)IRT. As demonstrated in Bolt (2019), creating binary classifications based off MIRT estimates not only addresses concerns of threshold inconsistency in CDMs, but has other benefits as well, such as reducing the occurrences of Hooker’s paradox (Hooker et al., 2009). Of course one concern in using MIRT could arise if the latent proficiency distribution actually is discrete. This issue was investigated in Huang and Bolt (2021) within the context of measuring growth, where the relative robustness of CDMs and (M)IRT was evaluated and where misspecification was simulated in both directions. They found that the effect of misspecification seems much more consequential when fitting a CDM to continuous trait(s) than the reverse.
Our intention in this study is neither to diminish the potential value of CDMs nor necessarily to advocate (M)IRT as a panacea, but instead suggest that one needs to take careful consideration of the actual discreteness of the latent construct distribution. One approach is to attend to item parameter invariance. This approach can also be easily generalized to other CDMs, and provides a straightforward way of evaluating the true latent discreteness if one is unsure of the latent trait(s) distribution. In the current context, we find that even seemingly small changes in item parameters can reflect sizeable changes in the meaning of the latent traits. The extent to which this is observed in other models seems worthy of further exploration.
Supplemental Material
Supplemental Material, sj-pdf-1-apm-10.1177_01466216221084207 for The Potential for Interpretational Confounding in Cognitive Diagnosis Models by Qi (Helen) Huang and Daniel M. Bolt in Applied Psychological Measurement
Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material: Supplemental material for this article is available online.
ORCID iDs
Qi (Helen) Huang https://orcid.org/0000-0002-8091-5993
Daniel M. Bolt https://orcid.org/0000-0001-7593-4439
References
- Bainter S. A., Bollen K. A. (2014). Interpretational confoundings or confounded interpretations of causal indicators? Measurement: Interdisciplinary Research and Perspectives, 12(4), 125–140. 10.1080/15366367.2014.968503 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bolt D. M. (2002). A Monte Carlo comparison of parametric and nonparametric polytomous DIF detection methods. Applied Measurement in Education, 15(2), 113–141. 10.1207/s15324818ame1502_01 [DOI] [Google Scholar]
- Bolt D. M. (2019). Bifactor MIRT as an appealing and related alternative to CDMs in the presence of skill attribute continuity. In von Davier M., Lee Y.-S. (Eds.), Handbook of diagnostic classification models (pp. 395–417). Springer. 10.1007/978-3-030-05584-4_19 [DOI] [Google Scholar]
- Bolt D. M., Kim J.-S. (2018). Parameter invariance and skill attribute continuity in the DINA model. Journal of Educational Measurement, 55(2), 264–280. 10.1111/jedm.12175 [DOI] [Google Scholar]
- Burt R. S. (1976). Interpretational confounding of unobserved variables in structural equation models. Sociological Methods & Research, 5(1), 3–52. 10.1177/004912417600500101 [DOI] [Google Scholar]
- Chen Y., Culpepper S. A., Wang S., Douglas J. (2018). A hidden Markov model for learning trajectories in cognitive diagnosis with application to spatial rotation skills. Applied Psychological Measurement, 42(1), 5–23. 10.1177/0146621617721250 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hooker G., Finkelman M., Schwartzman A. (2009). Paradoxical results in multidimensional item response theory. Psychometrika, 74(3), 419–442. 10.1007/s11336-009-9111-6 [DOI] [Google Scholar]
- Huang H.-Y. (2017). Multilevel cognitive diagnosis models for assessing changes in latent attributes. Journal of Educational Measurement, 54(4), 440–480. 10.1111/jedm.12156 [DOI] [Google Scholar]
- Huang Q., Bolt D. M. (2021). Relative robustness of CDMs and (M)IRT in measuring growth in latent skills. [Paper presentation]. The 2021 Annual Conference of National Council on Measurement in Education, 8–11 June, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Junker B. W., Sijtsma K. (2001). Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Applied Psychological Measurement, 25(3), 258–272. 10.1177/01466210122032064 [DOI] [Google Scholar]
- Kaya Y., Leite W. L. (2017). Assessing change in latent skills across time with longitudinal cognitive diagnosis modelingAn evaluation of model performance. Educational and Psychological Measurement, 77(3), 369–388. 10.1177/0013164416659314 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kline R. B. (2015). Principles and practice of structural equation modeling. Guilford Press. [Google Scholar]
- Li F., Cohen A., Bottge B., Templin J. (2016). A latent transition analysis model for assessing change in cognitive skills. Educational and Psychological Measurement, 76(2), 181–204. 10.1177/0013164415588946 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Madison M. J., Bradshaw L. P. (2018). Assessing growth in a diagnostic classification model framework. Psychometrika, 83(4), 963–990. 10.1007/s11336-018-9638-5 [DOI] [PubMed] [Google Scholar]
- McArdle J. J. (2007). Five steps in the structural factor analysis of longitudinal data. In Cudeck R., MacCallum R. C. (Eds.), Factor analysis at 100: Historical developments and future directions (pp. 99–130). Lawrence Erlbaum Associates. [Google Scholar]
- Meredith W., Horn J. (2001). The role of factorial invariance in modeling growth and change. In Sayer A. G., Collins L. M. (Eds.), New methods for the analysis of change (pp. 201–240). American Psychological Association. 10.1037/10409-007 [DOI] [Google Scholar]
- Millsap R. E., Cham H. (2012). Investigating factorial invariance in longitudinal data. In Laursen B., Little T. D., Card N. A. (Eds.), Handbook of developmental research methods (pp. 109–127). Guilford Press. [Google Scholar]
- Millsap R. E., Meredith W. (2007). Factorial invariance: Historical perspectives and new problems. In Cudeck R., MacCallum R. C. (Eds.), Factor analysis at 100: Historical developments and future directions (pp. 131–152). Lawrence Erlbaum Associates. [Google Scholar]
- Robitzsch A., Kiefer T., George A. C., Ünlü A. (2020). CDM: Cognitive diagnosis modeling. R package version 7.5–15. https://CRAN.R-project.org/package=CDM
- Shepard L., Camilli G., Williams D. M. (1984). Accounting for statistical artifacts in item bias research. Journal of Educational Statistics, 9(2), 93–128. 10.3102/10769986009002093 [DOI] [Google Scholar]
- Tatsuoka K. K. (1990). Toward an integration of item-response theory and cognitive error diagnosis. In Frederiksen N., Glaser R., Lesgold A., Safto M. G. (Eds.), Monitoring skills and knowledge acquisition (pp. 453–488). Lawrence Erlbaum. [Google Scholar]
- Wang S., Yang Y., Culpepper S. A., Douglas J. A. (2018). Tracking skill acquisition with cognitive diagnosis models: a higher-order, hidden Markov model with covariates. Journal of Educational and Behavioral Statistics, 43(1), 57–87. 10.3102/1076998617719727 [DOI] [Google Scholar]
- Yoon S. Y., & (2011). Psychometric properties of the revised Purdue spatial visualization test: Visualization of rotations (the revised PSVT-R). [unpublished doctoral dissertation]. Purdue University. [Google Scholar]
- Zhan P., Jiao H., Liao D., Li F. (2019). A longitudinal higher-order diagnostic classification model. Journal of Educational and Behavioral Statistics, 44(3), 251–281. 10.3102/1076998619827593 [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental Material, sj-pdf-1-apm-10.1177_01466216221084207 for The Potential for Interpretational Confounding in Cognitive Diagnosis Models by Qi (Helen) Huang and Daniel M. Bolt in Applied Psychological Measurement




