Reducing the Misclassification Costs of Cognitive Diagnosis Computerized Adaptive Testing: Item Selection With Minimum Expected Risk

Chia-Ling Hsu; Wen-Chung Wang

doi:10.1177/01466216211066610

. 2022 Mar 1;46(3):185–199. doi: 10.1177/01466216211066610

Reducing the Misclassification Costs of Cognitive Diagnosis Computerized Adaptive Testing: Item Selection With Minimum Expected Risk

Chia-Ling Hsu ^1,^2,^✉, Wen-Chung Wang ¹

PMCID: PMC9073635 PMID: 35528270

Abstract

Cognitive diagnosis computerized adaptive testing (CD-CAT) aims to identify each examinee’s strengths and weaknesses on latent attributes for appropriate classification into an attribute profile. As the cost of a CD-CAT misclassification differs across user needs (e.g., remedial program vs. scholarship eligibilities), item selection can incorporate such costs to improve measurement efficiency. This study proposes such a method, minimum expected risk (MER), based on Bayesian decision theory. According to simulations, using MER to identify examinees with no mastery (MER-U0) or full mastery (MER-U1) showed greater classification accuracy and efficiency than other methods for these attribute profiles, especially for shorter tests or low quality item banks. For other attribute profiles, regardless of item quality or termination criterion, MER methods, modified posterior-weighted Kullback–Leibler information (MPWKL), posterior-weighted CDM discrimination index (PWCDI), and Shannon entropy (SHE) performed similarly and outperformed posterior-weighted attribute-level CDM discrimination index (PWACDI) in classification accuracy and test efficiency, especially on short tests. MER with a zero-one loss function, MER-U0, MER-U1, and PWACDI utilized item banks more effectively than the other methods. Overall, these results show the feasibility of using MER in CD-CAT to increase the accuracy for specific attribute profiles to address different user needs.

Keywords: cognitive diagnosis, computerized adaptive testing, item selection, bayesian decision theory, minimum expected cost

Computerized adaptive testing (CAT) administers items tailored to individual examinees’ ability levels—neither too difficult nor too easy—thereby using fewer items than linear tests to obtain high precision (van der Linden & Glas, 2000; Wainer, 2000; Weiss, 1982). Cognitive diagnosis CAT (CD-CAT) identifies an examinee’s mastery (vs. non-mastery) of a set of latent attributes from item responses (e.g., cognitive diagnostic model, CDM, Rupp et al., 2010), enabling teachers, clinicians, and other users of test scores to capitalize on such specific information to better understand their students or clients, and then to adapt their instructions/interventions to help them.

To precisely identify the mastered attributes of an examinee, existing algorithms select appropriate items from a large and well-developed bank. Greater item selection efficiency yields greater measurement precision for the examinee, which is especially important with limited testing time or burdensome testing. However, published item selection methods designed to select items in CD-CAT do not explicitly consider misclassification consequences for the examinees of a target group. To reflect different degrees of misclassification losses during item selection, some methods are weighted by the posterior probabilities of the attribute profiles (e.g., Cheng, 2009; Kaplan et al., 2015; Zheng & Chang, 2016), which ensure selection of specific item(s) to adapt to the estimated, interim attribute profile and/or to all attribute profiles. When applying CD-CAT in real-world tests, however, misclassification loss may differ across testing situations or user needs. For example, consider tests for remedial programs versus scholarships. CD-CAT misclassifies some students with false mastery of an attribute and others with false ignorance of it. False mastery can exclude eligible students from remedial programs and severely hinder their learning, while false ignorance can incorrectly include students who do not need the program, giving them unneeded learning tasks (a less harmful consequence). By contrast, false ignorance can remove scholarships from deserving students, and false mastery can unfairly award scholarships to undeserving students. In both situations, misclassifications outside the extremes have no harmful consequences.

To date, no published study has proposed and tested an item selection method in CD-CAT that maximizes information about an examinee’s attribute profile to minimize the harm of specific misclassifications. Hence, in this study, Bayesian decision theory (BDT) is used to theoretically propose and empirically test a set of minimum expected risk (MER) methods for selecting items.

BDT uses probability and the cost/loss of each classification to quantify the trade-offs among classifications and then selects the one with the lowest expected cost. Applying Bayes’ theorem, BDT (a) converts the prior information of the random variables and their conditional probabilities to form posterior probabilities, (b) adopts a cost structure to specify the total risks and benefits of all possible decisions according to these posterior probabilities; and (c) makes an optimal decision. For computerized classification testing (CCT), past studies showed that BDT-based item selection algorithms were more efficient than the Fisher information method (e.g., Rudner, 2009). Like CCT, CD-CAT also classifies each examinee into a category. Unlike CCT, CD-CAT allows many categories (e.g., 32 attribute patterns for five attributes), so whether BDT performs well in CD-CAT is an open question.

After the CDM model and the item selection algorithms are described, MER is introduced. Then, the simulation studies illustrate the new method’s performance against other methods. Lastly, the implications of this study are discussed for applying BDT to CD-CAT.

The Deterministic Input, Noisy “and” Gate Model

Based on an examinee’s test responses, CDMs seek to identify whether the examinee has mastery (or not) over a set of skills (or tasks) underlying the items in a test. Each skill is commonly called a latent attribute and the status (e.g., mastery or non-mastery) of a set of the latent attributes is an attribute profile (or latent class). For most CDMs, content experts and test developers specify a Q-matrix of attributes measured by each item (Tatsuoka, 1983). Given an item bank with J (j = 1, …, J) items and K (k = 1, …, K) attributes, the element q_jk has a value of one if item j requires attribute k or 0 otherwise.

We used the Deterministic Input, Noisy “and” Gate (DINA) model (Haertel, 1989; Junker & Sijtsma, 2001) to demonstrate the application of the proposed item selection method (leaving other CDMs for future studies). DINA assumes that an examinee must master all the attributes required by an item to produce a correct response. The probability of a correct response of examinee i on item j is represented by

(X_{i j} = 1 | α_{i}) = {(1 - s_{j})}^{η_{i j}} g_{j}^{(1 - η_{i j})}

(1)

where $α_{i} = (α_{i 1}, α_{i 2}, \dots, α_{i K})$ is the mastery status for examinee i (α_ik is 1 if examinee i has mastered attribute k, or 0 otherwise); $η_{i j} = \prod_{k = 1}^{K} α_{i k}^{q_{j k}}$ is 1 when examinee i has mastered all the required attributes of item j, or 0 otherwise; s_j is the slip parameter indicating that an examinee has mastered all the required attributes but answers item j incorrectly; g_j is the guessing parameter indicating that an examinee has not mastered all the required attributes but answers item j correctly.

Item Selection Methods in Cognitive Diagnosis Computerized Adaptive Testing

Researchers derived several item selection methods from discontinuous latent variables for CD-CAT, such as Kullback–Leibler information (KL) and Shannon entropy (SHE); SHE outperformed KL in CD-CAT (Tatsuoka, 2002; Tatsuoka & Ferguson, 2003; Xu et al., 2003). Cheng (2009) weighted KL by the posterior probabilities of the latent classes (posterior-weighted KL, PWKL) and by the distance between the latent classes (hybrid KL, HKL), showing that both methods were superior to KL and comparable with SHE. Also, in CD-CAT of short tests, mutual information (MI) was more efficient than KL, PWKL, and SHE (Wang, 2013). Kaplan et al. (2015) modified PWKL (MPWKL) by replacing the interim estimates with all posterior probabilities to calculate KL. Zheng and Chang (2016) weighted the CDM discrimination index (CDI) and attribute-level CDM discrimination index (ACDI) (Henson & Douglas, 2005; Henson, et al., 2008) with the posterior probabilities of the latent classes to create posterior-weighted CDI (PWCDI) and posterior-weighted CDI (PWACDI). Their results showed that PWCDI, PWACDI and MPWKL were all comparable with or outperformed both MI and PWKL. To evaluate our new item selection method in CD-CAT, we compared its performance with similar competitors that use all latent classes’ posterior probabilities for item selection: MPWKL, PWCDI and PWACDI. Like BDT, SHE uses posterior probabilities for item selection, so SHE was also included.

MPWKL is a modification of PWKL that considers the information of all possible latent classes’ posterior probabilities during CD-CAT. MPWKL information for item j is defined as follows:

M P W K L_{j} = \sum_{d = 1}^{2^{k}} \sum_{c = 1}^{2^{k}} \sum_{x = 0}^{1} P (X_{j} = x | α_{d}) \log (\frac{P (X_{j} = x | α_{d})}{P (X_{j} = x | α_{c})}) π^{(t)} (α_{c}) π^{(t)} (α_{d})

(2)

where $P (X_{j} = x | α_{c})$ and $P (X_{j} = x | α_{d})$ are the probabilities of response x to item j given the latent classes c and d, which together are defined as the DINA model (Equation (1)); $π^{(t)} (α_{c})$ and $π^{(t)} (α_{d})$ are the respective posterior probabilities of latent classes c and d (c, d = 1, 2, …, 2^K), given the responses to t items. An item with the maximum MPWKL information is selected to be administered next.

PWCDI integrates the posterior probability of each latent class into a $2^{K} \times 2^{K}$ posterior-weighted D (PWD) matrix (Henson & Douglas, 2005). The D matrix describes the expected KL information (Cover & Thomas, 1991) between the response distributions for any two latent classes $α_{c}$ and $α_{d}$ (c = 1, 2, …, 2^K, and d = 1, 2, …, 2^K). PWD for item j is defined as follows:

P W D_{j c d} = E_{α_{c}} [π (α_{c}) \times π (α_{d}) \times \log (\frac{P (X_{j} | α_{c})}{P (X_{j} | α_{d})})]

(3)

where $π (α_{c})$ and $π (α_{d})$ are the posterior probabilities of latent classes c and d (c, d = 1, 2, …, 2^K). Thus, PWCDI is defined as follows:

P W C D I_{j} = \frac{1}{\sum_{c \neq d} h {(α_{c}, α_{d})}^{- 1}} \sum_{c \neq d} h {(α_{c}, α_{d})}^{- 1} P W D_{j c d}

(4)

where $h (α_{c}, α_{d})$ is the squared Euclidean distance and is defined as ${\sum_{k = 1}^{K} (α_{c k} - α_{c k}^{'})}^{2}$ . Likewise, PWACDI is defined as follows:

P W A C D I_{j} = \frac{1}{2^{K}} \sum_{a l l r e l e v a n t c e l l s} P W D_{j c d}

(5)

where all relevant cells are the entries for any pair of $h (α_{c}, α_{d})$ ( $c \neq d$ ) with the squared Euclidean distance of 1. An item with the maximum PWCDI or PWACDI is selected to be administered next.

SHE quantifies the uncertainty of the posterior distribution of $\hat{α}$ given previous t responses as follows:

S H E_{j} = \sum_{x = 0}^{1} P (X_{j} = x | x^{(t)}) \sum_{c = 1}^{2^{K}} π (α_{c} | x^{(t)}, X_{j} = x) \log (\frac{1}{π (α_{c} | x^{(t)}, X_{j} = x)})

(6)

where $P (X_{j} = x | x^{(t)})$ is the predictive probability of given responses to t items for item j, and $π (α_{c} | x^{(t)}, X_{j} = x)$ is the posterior probability of latent class c given responses to t items and item j. An item that minimizes Equation (6) is selected to be administered next.

Bayesian Decision Theory in Cognitive Diagnosis Computerized Adaptive Testing

BDT involves a prior, a psychometric model, and a cost structure. A prior $π (α_{0})$ states the information about the distributions of different latent classes of interest; a psychometric model $P (X = x | α)$ characterizes how the data are generated; a cost structure λ indicates the cost of decisions. By using the posterior probabilities with an apt cost structure, BDT aims to minimize the cost (or risk/loss) of decisions by minimizing the probabilities of incorrect decisions and maximizing the probabilities of correct decisions. In CD-CAT, the minimum expected risk (MER) method is defined as follows:

M E R_{j} = \sum_{x = 0}^{1} P (X_{j} = x | x^{(t)}) \sum_{c = 1}^{2^{K}} π (α_{c} | x^{(t)}, X_{j} = x) \sum_{d = 1}^{2^{K}} λ_{d c} π (α_{d} | x^{(t)}, X_{j} = x)

(7)

where λ_dc is the cost of misclassifying an examinee into the wrong latent class d rather than the correct latent class c. An item with the lowest MER is selected to be administered next.

To minimize the expected risk of decisions, MER uses the appropriate weights (the cost structure) associated with correct and incorrect decisions in order of importance. A target latent class typically has the largest weight. As test users are primarily concerned about the risks of incorrect decisions, this study focuses on how different cost structures of incorrect decisions affect item selection. Without loss of generality, correct decision weights are set to 0: $λ_{d c} = 0$ for $\forall d = c$ , and $λ_{d c} \neq 0$ for $\forall d \neq c$ . MER is also applicable to scenarios in which the weights of correct decisions are not zero.

Two cost structures were used to illustrate scenarios with identical versus different risks of incorrect decisions. When incorrect decisions’ losses are identical, a cost structure $λ$ with a zero-one loss function is commonly used:

λ_{d c} = {\begin{cases} 0 d = c \\ 1 d \neq c \end{cases}

(8)

Compared to other misclassifications, misclassifications of the target latent classes are often more costly, so decisions regarding them can receive larger weights. Specifically, the greater the differences between the classifications and the true latent classes, the higher the risk of an incorrect decision. Euclidean distance is a common measure for identifying relations between any two latent classes (e.g., Henson & Douglas, 2005); thus, the cost structure $λ$ can be defined as:

λ_{d c} = \sqrt{\sum_{k = 1}^{K} {(α_{d k} - α_{c k})}^{2}}

(9)

Latent classes’ posterior probabilities in MER are further weighted by their distances to the target latent class; those with longer distances have larger weights to reflect greater misclassification costs.

The Relationship Between Minimum Expected Risk and Existing Methods

As shown in Equations (6) and (7), MER resembles SHE, but their weights differ: $\sum_{d = 1}^{2^{K}} λ_{d c} π (α_{d} | x^{(t)}, X_{j} = x)$ for MER, and $\log (\frac{1}{π (α_{c} | x^{(t)}, X_{j} = x)})$ for SHE. MER’s weight has two components: a pre-specified, constant, cost structure and a typically larger posterior, continually updated during item selection. Using equation (8) to define the λ-parameters, the weight can be rewritten as $\sum_{d = 1, d \neq c}^{2^{K}} π (α_{d} | x^{(t)}, X_{j} = x)$ and reformed as $1 - π (α_{c} | x^{(t)}, X_{j} = x)$ . As with SHE, the higher the posterior distribution of latent class c, the more concentrated the posterior distribution of $\hat{α}$ , and the lower the value $1 - π (α_{c} | x^{(t)}, X_{j} = x)$ in MER. As the precisions of the latent classes’ posterior probabilities substantially influence both MER and SHE, they might perform similarly. Specifically, estimates of latent classes with less uncertainty provide more information for classifying the latent classes.

For dissimilar latent classes (i.e., $\forall d \neq c$ ), the magnitude of ${(α_{d k} - α_{c k})}^{2}$ in equation (9) equals or exceeds that of equation (8). MER increases costs for latent classes that differ more from the target latent class and decreases costs for latent classes that resemble the target class more closely. This differentiation might accelerate the process for selecting optimal items for the target latent class.

Moreover, MER aims to minimize the risk of making decisions across the posterior probabilities of all possible latent classes, without relying on a possible latent class’s interim estimate. MPWKL, PWCDI, and PWACDI consider all possible latent classes, so they do not require an interim estimate, and might perform similarly.

Simulation Studies

Two simulation studies compared MER, MPWKL, PWCDI, PWACDI, and SHE. Using the simple, popular DINA model, five attributes were manipulated in the first study, and eight were manipulated in the second study. All of these methods can be directly applied to other CDMs.

Simulation Design

The first study used a simple Q-matrix structure; each item measured at least one attribute, and each attribute was measured by one-fifth of the items. We generated 50,000 examinees, such that each examinee had a 50% chance of mastering each attribute independently. (MER can also be applied to examinees generated from a multivariate normal distribution with correlated attributes (e.g., Henson & Douglas, 2005)). The item bank had 300 items that measured five attributes, yielding 32 latent classes.

Supplementary Tables A1 and A2 in the online supplement present the summary of the generated Q-matrix and examinee population. As the generation procedures for each item measuring an attribute and for each examinee’s mastery of each attribute were independent, the numbers of generated items for measuring an attribute and the numbers of examinees mastering each attribute were uniform. As each attribute was measured by one-fifth of the items, most q-vectors, around 65%, only measured one latent attribute. Both the g- and s-parameters in the DINA model were generated either from U(0.05, 0.15) or U(0.25, 0.35), indicating high or low item quality, respectively.

Eight item selection methods were tested: MPWKL, PWCDI, PWACDI, SHE, and four MER methods. The first MER uses the zero-one loss function (equation (8)) as the cost structure, with equal costs for all misclassifications (MER-Eq). The other three MER methods have unequal costs across misclassifications. The second MER applies the Euclidean distance (Equation (9)) as the weight (MER-Ud). Therefore, the corresponding λ_dc in equation (7) is high when the latent classes d and c differ substantially. The third MER integrates Equations (8) and (9) as the cost structure and the target profile (0,0,0,0,0). This scenario resembles selection of students who did not master any of the five attributes (0,0,0,0,0) for remedial programs. As misclassifying students with this profile can severely harm them, its misclassification cost was much higher than that of other misclassifications. Specifically, the cost increased linearly as the distance between latent classes increased: $\sqrt{\sum_{k = 1}^{K} {(α_{d k} - α_{c k})}^{2}} + 1$ for misclassifying students with profile (0,0,0,0,0) as mastering attribute 1, 2, 3, 4, or 5. For other misclassifications, the cost was 1. This method had unequal costs, and the target was profile (0,0,0,0,0), so it was denoted MER-U0. To help select students for scholarships, the fourth MER resembles MER-U0 but has a target profile of (1,1,1,1,1). The cost for misclassifying these target profile examinees as not mastering attribute 1, 2, 3, 4, or 5 was $\sqrt{\sum_{k = 1}^{K} {(α_{d k} - α_{c k})}^{2}} + 1$ . For other misclassifications, the cost was 1.

Both fixed-length and fixed-precision termination rules were applied. For the fixed-length rule, test length was set at 5, 10, 15, 20, and 30 items; for the fixed-precision rule, the largest posterior probability of latent classes (p_1st, Hsu et al., 2013; Tatsuoka, 2002) was set at .70, .80, .90, and .95. The fixed-precision rule had no maximum test length limit; if the fixed precision was not reached, examinees could receive all items in the item bank. The maximum a posteriori estimate with a uniform prior represented examinees’ latent classes. The first simulation study had 144 conditions: 2 (item quality) × 8 (item selection algorithm) × 9 (termination criterion: five test lengths for the fixed-length rule and four precision levels for the fixed-precision rule) = 144. In all conditions, the same simulated item bank and examinees’ latent classes were used.

The second simulation study resembles the first study except that it uses eight latent attributes (256 latent classes) rather than five, and only 30,000 examinees were generated to avoid excessively onerous computation time. Overall, there were 300 items measuring eight attributes. See the summary of the generated Q-matrix and examinee population for the eight-attribute bank in Supplementary Tables A3 and A4 of the online supplement. As these methods are independent of any CAT component, they can be readily applied to other settings.

Evaluation Criterion

Both simulation studies assess the effectiveness of the CD-CAT via the proportion of examinees whose latent classes were classified correctly (classification accuracy rate, CAR):

C A R = \frac{\sum_{i = 1}^{N} (I_{({\hat{α}}_{i}, α_{i})})}{N}

(10)

where an indicator function $I_{({\hat{α}}_{i}, α_{i})}$ equals 1 if ${\hat{α}}_{i} = α_{i}$ or 0 otherwise, and N is the number of examinees. Also, the test length required to terminate the CD-CAT and the percentage of examinees who took all 300 items in the item bank help evaluate the efficiency of the fixed-precision rule. In addition, the chi-square of the exposure rate helps evaluate efficiency (Chang & Ying, 1996):

χ^{2} (r) = \sum_{j = 1}^{J} \frac{{(r_{j} - \bar{r})}^{2}}{\bar{r}}

(11)

where r_j is the item exposure rate for item j and $\bar{r}$ is the average item exposure rate across items, which is $L / J$ ( $L$ is the test length) and $\bar{L} / J$ ( $\bar{L}$ is the mean test length across examinees) for the fixed-length and fixed-precision rules, respectively. This measure captures the distribution of item exposure rates across items to assess item exposure balance. A smaller $χ^{2}$ value indicates more uniform distribution of exposure rates and more items used from the item bank. Moreover, the execution time of each CD-CAT item selection method helps assess the efficiency.

Results

The Five-Attribute Bank

Figures 1 and 2 display the marginal classification accuracy rates of the eight algorithms with the fixed-length rule for the examinees who mastered 0, 1, 2, 3, 4, and 5 of the attributes in the low and high quality banks, respectively. With the low quality bank, MER-U0 performed the best, followed by MER-Ud, especially for the 10- and 15-item tests. Specifically, the CARs for the 10- and 15-item tests were: MER-U0 (.44, .50), MER-Ud (.39, .47), MER-Eq (.37, .46), MER-U1 (.36, .44), SHE (.32, .45), MPWKL (.31, .45), PWCDI (.31, .44), and PWACDI (.28, .37). Measured by CAR, MER-U0 outperformed other methods on the 10-item test by .05–.16 and on the 15-item test by .03–.13. With the high quality bank, all item selection methods yielded similar CARs for no mastery (0,0,0,0,0), except that PWACDI had a higher CAR for the 5-item test and MER-U1 and PWACDI produced lower CARs for the 10-item test.

Figure 1. — Classification accuracy of examinees mastering 0–5 attributes of the low quality five-attribute bank with the fixed-length rule.

Note. CAR = classification accuracy rate; MPWKL = modification of the PWKL; PWCDI = posterior-weighted CDI; PWACDI = posterior-weighted attribute-level CDI; SHE = Shannon entropy; MER-Eq = MER method with equal costs; MER-Ud = MER method with Euclidean distance’s costs; MER-U0 = MER method with Euclidean distance’s costs and target profile (0,0,0,0,0); and MER-U1 = MER method with Euclidean distance’s costs and target profile (1,1,1,1,1).

Figure 2. — Classification accuracy of examinees mastering 0 – 5 attributes of the high quality five-attribute bank with the fixed-length rule.

Note. CAR = classification accuracy rate; MPWKL = modification of thePWKL; PWCDI = posterior-weighted CDI; PWACDI =posterior-weighted attribute-level CDI; SHE = Shannon entropy; MER-Eq = MER method with equal costs; MER-Ud = MER method with Euclidean distance’s costs; MER-U0 = MER method with Euclidean distance’s costs and target profile (0,0,0,0,0); and MER-U1 = MER method with Euclidean distance’s costs and target profile (1,1,1,1,1).

When recovering the full mastery profile (1,1,1,1,1), MER-U1 performed best for test lengths shorter than 15 items with the low quality bank and for the five-item test with the high quality bank. With the low quality bank, MER-U1’s CARs (.28, .46, .61) were higher than other methods by .06–.08 for the 5-item test, .06–.17 for the 10-item test, and .02–.15 for the 15-item test. PWACDI showed the second best performance on the 10-item (CAR = .40) and 15-item (CAR = .59) tests. With the high quality bank, MER-U1’s CAR (.91) substantially exceeded that of other methods by .12–.16 for the 5-item test.

In the recovery of the other latent classes (examinees mastering 1–4 attributes), MPWKL, PWCDI, SHE, MER-Eq, MER-Ud, and MER-U0 yielded similar CARs irrespective of the number of attributes mastered by examinees or item bank quality. They all generally had higher CARs for the examinees who have mastered 1–4 attributes than the PWACDI and MER-U1, especially for the five-item test.

For the same statistics under the fixed-precision rule, see Supplementary Figure A1 for the low quality bank and Figure A2 for the high quality bank (both in the online supplement). As anticipated, MER-U0 and MER-U1 were the most cost-effective in identifying the no mastery and full mastery profiles, respectively. In general, MER-U1 and MER-U0 were more cost-effective than the others for the low quality bank, and MER-U1 was often more cost-effective than MER-U0, in part because the cost specification had more room to improve through MER-U1 in the low quality bank—identifying the examinees with full mastery was more difficult than identifying those with no mastery.

Also, the performances of different item selection methods on all latent classes were evaluated. See the results of the eight item selection methods with the fixed lengths in Table 1 and Figure 3, and with the fixed precisions in Figure 4. For every method with the fixed-length rule, longer tests or higher item quality yielded higher CARs. All methods were comparable, except that PWACDI and MER-U1 slightly underperformed the others for the five-item test. Figure 3 shows the $χ^{2}$ of the exposure rate for all algorithms. In general, irrespective of item quality, MER-Eq, MER-U0, and MER-U1 produced the lowest $χ^{2}$ values. PWACDI also yielded a low $χ^{2}$ with the low-quality bank, whereas MER-Ud yielded the highest $χ^{2}$ , especially when the test had fewer than 10 items. Hence, MER-Eq, MER-U0, MER-U1, and PWACDI mostly used the item bank more effectively than MPWKL, SHE, PWCDI, and MER-Ud. For every method with the fixed-precision rule (Figure 4), greater precision yielded higher CAR, longer tests, and higher percentages of examinees receiving the maximum number of 300 items. Higher quality item bank yielded higher CAR, shorter tests, or smaller percentages of examinees administering the 300 items. For both fixed-length and fixed precision rules, MER-Eq MER-U0, MER-U1, and PWACDI were more cost-effective than MPWKL, SHE, PWCDI, and MER-Ud.

Table 1.

Classification accuracy rate of latent classes of the five-attribute bank with the fixed-length rule.

Item Quality	Test Length	MPWKL	PWCDI	PWACDI	SHE	MER-Eq	MER-Ud	MER-U0	MER-U1
High	5	.75	.75	.57	.75	.75	.74	.75	.57
	10	.92	.92	.90	.92	.92	.92	.92	.91
	15	.99	.99	.98	.99	.99	.99	.99	.98
	20	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
	30	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
Low	5	.22	.22	.21	.22	.22	.22	.22	.18
	10	.33	.34	.33	.33	.34	.34	.34	.33
	15	.47	.47	.45	.47	.47	.47	.47	.45
	20	.58	.57	.55	.58	.57	.58	.57	.56
	30	.74	.73	.71	.74	.73	.74	.73	.72

Open in a new tab

Note. MPWKL = modification of the PWKL; PWCDI = posterior-weighted CDI; PWACDI = posterior-weighted attribute-level CDI; SHE = Shannon entropy; MER-Eq = MER method with equal costs; MER-Ud = MER method with Euclidean distance’s costs; MER-U0 = MER method with Euclidean distance’s costs and target profile (0,0,0,0,0); and MER-U1 = MER method with Euclidean distance’s costs and target profile (1,1,1,1,1).

Figure 3. — Exposure balance of the five-attribute bank with the fixed-length rule.

Note. χ2(r) = chi-square of exposure rate; MPWKL = modification of the PWKL; PWCDI = posterior-weighted CDI; PWACDI = posterior-weighted attribute-level CDI; SHE = Shannon entropy; MER-Eq = MER method with equal costs; MER-Ud = MER method with Euclidean distance’s costs; MER-U0 = MER method with Euclidean distance’s costs and target profile (0,0,0,0,0); and MER-U1 = MER method with Euclidean distance’s costs and target profile (1,1,1,1,1).

Figure 4. — Summary statistics of the five-attribute bank with the fixed-precision rule.

Note. L = mean test length across examinees; p1st = largest posterior probability of latent classes; χ2(r) = chi-square of exposure rate; MPWKL = modification of the PWKL; PWCDI = posteriorweighted CDI; PWACDI = posterior-weighted attribute-level CDI; SHE = Shannon entropy; MER-Eq = MER method with equal costs; MER-Ud = MER method with Euclidean distance’s costs; MER-U0 = MER method with Euclidean distance’s costs and target profile (0,0,0,0,0); and MER-U1 = MER method with Euclidean distance’s costs and target profile (1,1,1,1,1).

To sum up, MER-U0 and MER-U1 outperformed the other item selection methods by successfully classifying the no mastery and full mastery profiles respectively, without suffering substantial misclassifications of other attribute profiles. Otherwise, the proposed MER-based methods were comparable to the current well-developed item selection methods with respect to classification accuracy and efficiency.

Furthermore, MER was efficient. For the 30-item test with the high quality bank, the average times for each examinee’s CD-CAT administration were within 31, 56, 59, 23, and 83 milliseconds for MPWKL, PWCDI, PWACDI, SHE and MER, respectively. Unlike the MER methods, MPWKL, PWCDI, and PWACDI pre-calculate the D matrix. Although MER required more computations for its $2^{K} \times 2^{K}$ cost structure posterior matrix during testing, it was within 52 milliseconds of the other methods.

The Eight-Attribute Bank

Supplementary Figures A3–A6 of the online supplement present the marginal classification accuracy rates of the eight algorithms for the examinees who mastered 0, 1, 2, 3, 4, 5, 6, 7, and 8 of the attributes with the eight-attribute bank. MER-U0 and MER-U1 were respectively more cost-effective in identifying no mastery and full mastery profiles, especially with the low quality bank. Table 2 shows the CARs for the eight item selection methods with the fixed-length rule. For every method, longer test lengths and higher item quality yielded larger CARs, similar to the 5-attribute results. All item selection methods yielded similar CARs irrespective of item quality and test length. Generally speaking (as shown in Figure 5), MER-Eq, MER-U0, MER-U1, and PWACDI had more even exposure rates, indicating that they had more effective bank utilization than MPWKL, PWCDI, SHE, and MER-Ud. Similarly, as shown in Figure 6 for the fixed-precision rule, the more conservative the termination criterion, the longer the mean test length, and the larger the percentage of examinees who took all 300 items.

Table 2.

Classification accuracy rate of latent classes of the eight-attribute bank with the fixed-length rule.

Item Quality	Test Length	MPWKL	PWCDI	PWACDI	SHE	MER-Eq	MER-Ud	MER-U0	MER-U1
High	5	.09	.09	.09	.09	.09	.09	.09	.09
	10	.66	.66	.65	.66	.65	.66	.65	.64
	15	.88	.89	.86	.88	.88	.88	.88	.87
	20	.97	.97	.96	.97	.96	.96	.96	.96
	30	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
Low	5	.03	.03	.03	.03	.03	.03	.03	.03
	10	.10	.10	.09	.10	.10	.10	.10	.10
	15	.16	.17	.16	.16	.17	.16	.17	.17
	20	.24	.25	.24	.24	.24	.24	.24	.25
	30	.40	.40	.38	.40	.40	.40	.40	.40

Open in a new tab

Figure 5. — Exposure balance of the eight-attribute bank with the fixed-length rule.

Note. χ2(r) = chi-square of exposure rate; MPWKL = modification of the PWKL; PWCDI = posterior-weighted CDI; PWACDI = posterior-weighted attribute-level CDI; SHE = Shannon entropy; MER-Eq = MER method with equal costs; MER-Ud = MER method with Euclidean distance’s costs; MER-U0 = MER method with Euclidean distance’s costs and target profile (0,0,0,0,0); and MER-U1 = MER method with Euclidean distance’s costs and target profile (1,1,1,1,1).

Figure 6. — Summary statistics of the eight-attribute bank with the fixed-precision rule.

Note. L = mean test length across examinees; p1st largest posterior probability of latent classes; 2(r) = chi-square of exposure rate; MPWKL = modification of the PWKL; PWCDI = posteriorweighted CDI; PWACDI = posterior-weighted attribute-level CDI; SHE = Shannon entropy; MER-Eq = MER method with equal costs; MER-Ud = MER method with Euclidean distance’s costs; MER-U0 = MER method with Euclidean distance’s costs and target profile (0,0,0,0,0); and MER-U1 = MER method with Euclidean distance’s costs and target profile (1,1,1,1,1).

Discussion and Conclusion

Past item selection algorithms for CD-CAT did not account for different harmful consequences across specific misclassifications. As BDT considers the costs of different types of misclassifications, MER for CD-CAT was developed. MER considers both the posterior probabilities of different mastery statuses and the costs of their misclassifications to make an optimal decision with minimum expected risk. Unlike other item selection methods, MER’s flexibility enables users to specify cost structures to improve the classification of the most important target latent class(es) to the users. MER-Eq, MER-Ud, MER-U0, and MER-U1 are four MER examples that illustrate how to set appropriate cost structures to address different test user needs.

We conducted simulations to compare the performances of MER-Eq, MER-Ud, MER-U0, and MER-U1 with those of four competing methods: MPWKL, PWCDI, PWACDI, and SHE. MER-U0 was the most cost-effective for identifying no mastery profiles (e.g., students for remedial programs), and MER-U1 was the most cost-effective for identifying full mastery profiles (e.g., scholarship students). Otherwise, MER-Eq, MER-Ud and MER-U0 performed similarly to MPWKL, PWCDI, SHE, and slightly outperformed PWACDI, especially in the short tests (e.g., five-item) with the five-attribute bank. Moreover, MER-Eq, MER-U0, MER-U1, and PWACDI had better bank utilization. As a result, MER not only increased classification accuracy and efficiency for target profiles (e.g., MER-U0 and MER-U1), but also produced comparable classification accuracy and efficiency for other profiles, compared to other methods. These results suggest the following recommendations for specific goals: (a) for overall test accuracy, use MER-Eq, MER-Ud and MER-U0; (b) for test accuracy of target profiles (e.g., no mastery or full mastery), use MER-U0 or MER-U1; or (c) for both overall test accuracy and efficiency, use MER-Eq or MER-U0.

While we used educational scenarios to illustrate MER, MER can be easily applied to other scenarios with different cost structures. For instance, in oncology or other medical areas and clinical psychology where early intervention is critical, a false test result of no cancer despite actual cancer (false negative) can be fatal, whereas a false positive might simply waste time and resources via additional testing.

Our study has five limitations that future studies can address: few factors/levels, practical constraints, response times, initial item selection, and loss functions. We manipulated only a few factors in the simulations. Future studies can manipulate more factors or levels, such as different CDMs for dichotomous or polytomous items (e.g., Hartz, 2002; Junker & Sijtsma, 2001; Made la Torre, 2016), different Q-matrix structures, more attributes, or different maximum test lengths (when using the minimax rule to stop the CAT). Moreover, MER performance might differ under practical constraints (e.g., item exposure control, test overlap control, or content balancing), so future studies can examine these phenomena. As information technology has enabled recording of response times, they can be incorporated into item selection methods (e.g., Finkelman et al., 2014), so future studies might incorporate response times into MER. As initial item selection can affect CAT efficiency (Xu et al., 2016), future studies can combine MER with optimal initial item selection.

The current item selection methods do not incorporate loss functions directly into the algorithms (e.g., MPWKL or SHE), so future studies might consider doing so and testing their effectiveness, compared to MER. Also, we adopted three loss functions (zero-one, Euclidean distance, simple linear) to aid reader understanding of MER. Future studies can consider more complex loss functions. Also, future studies can identify the conditions under which MER-Ud yields the best item selection. As latent classes that differ more from the target latent class have higher costs, such differentiation accelerates the process for selecting optimal items for the target latent class. Lastly, PWACDI’s performance was equal to MER-U1 in most conditions or slightly better than MER-U1 in some situations. Further investigations can compare their performances to better understand both methods.

Supplemental Material

sj-pdf-1-apm-10.1177_01466216211066610 – Supplemental Material for Reducing the Misclassification Costs of Cognitive Diagnosis Computerized Adaptive Testing: Item Selection With Minimum Expected Risk

Click here for additional data file.^{(1.3MB, pdf)}

Supplemental Material, sj-pdf-1-apm-10.1177_01466216211066610 for Reducing the Misclassification Costs of Cognitive Diagnosis Computerized Adaptive Testing: Item Selection With Minimum Expected Risk by Chia-Ling Hsu, and Wen-Chung Wang in Applied Psychological Measurement

Acknowledgment

The authors thank the Editor and three anonymous reviewers for their constructive comments on this article. In addition, the authors would like to thank Professor Ming Ming CHIU for his valuable and insightful suggestions on this study.

Footnotes

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD: Chia-Ling Hsu https://orcid.org/0000-0002-4267-0980

References

Chang H. H., Ying Z. L. (1996). A global information approach to computerized adaptive testing. Applied Psychological Measurement, 20(3), 213–229. 10.1177/014662169602000303 [DOI] [Google Scholar]
Cheng Y. (2009). When cognitive diagnosis meets computerized adaptive testing: CD-CAT. Psychometrika, 74(4), 619–632. 10.1007/s11336-009-9123-2 [DOI] [Google Scholar]
Cover T., Thomas J. (1991). Elements of information theory. Wiley. [Google Scholar]
Finkelman M. D., Kim W., Weissman A., Cook R. J. (2014). Cognitive diagnostic models and computerized adaptive testing: Two new item-selection methods that incorporate response times. Journal of Computerized Adaptive Testing, 2(4), 59–76. 10.7333/1412-0204059 [DOI] [Google Scholar]
Haertel E. H. (1989). Using restricted latent class models to map skill structure of achievement items. Journal of Educational Measurement, 26(4), 301–323. http://www.jstor.org/stable/1434756 [Google Scholar]
Hartz S. (2002). A Bayesian framework for the unified model for assessing cognitive abilities: Blending theory with practice. [Unpublished doctoral thesis, University of Illinois at Urbana-Champaign; ]. [Google Scholar]
Henson R., Douglas J. (2005). Test construction for cognitive diagnosis. Applied Psychological Measurement, 29(4), 262–277. 10.1177/0146621604272623 [DOI] [Google Scholar]
Henson R., Roussos L., Douglas J., He X. (2008). Cognitive diagnostic attribute-level discrimination indices. Applied Psychological Measurement, 32(4), 275–288. 10.1177/0146621607302478 [DOI] [Google Scholar]
Hsu C.-L., Wang W.-C., Chen S.-Y. (2013). Variable-length computerized adaptive testing based on cognitive diagnosis models. Applied Psychological Measurement, 37(7), 563–582. 10.1177/0146621613488642 [DOI] [Google Scholar]
Junker B. W., Sijtsma K. (2001). Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Applied Psychological Measurement, 25(3), 258–272. 10.1177/01466210122032064 [DOI] [Google Scholar]
Kaplan M., de la Torre J., Barrada J. R. (2015). New item selection methods for computerized adaptive testing. Applied Psychological Measurement, 39(3), 167–188. 10.1177/0146621614554650 [DOI] [PMC free article] [PubMed] [Google Scholar]
Ma W., de la Torre J. (2016). A sequential cognitive diagnosis model for polytomous responses. British Journal of Mathematical and Statistical Psychology, 69, 253–275. 10.1111/bmsp.12070 [DOI] [PubMed] [Google Scholar]
Rudner L. M. (2009). Scoring and classification examinees using measurement decision theory. Practical Assessment Research and Evolution, 14(8), 1–11. http://pareonline.net/getvn.asp?v=14&n=8 [Google Scholar]
Rupp A. A., Templin J., Henson R. A. (2010). Diagnostic measurement: Theory, methods, and applications (The statistical structure of core DCMs). Guilford. [Google Scholar]
Tatsouka C. (2002). Data analytic methods for latent partially ordered classification models. Journal of Royal Statistical Society: Series C (Apply Statistics), 51, 337–350. 10.1111/1467-9876.00272 [DOI] [Google Scholar]
Tatsouka C., Ferguson T. (2003). Sequential classification on partially ordered sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(1), 143–157. http://www.jstor.org/stable/3088831 [Google Scholar]
Tatsuoka K. K. (1983). Rule space: An approach for dealing with misconceptions based on item response theory. Journal of Educational Measurement, 20(4), 345–354. http://www.jstor.org/stable/1434951 [Google Scholar]
van der Linden W. J., Glas C.A.W. (Eds.), (2000).Computerized adaptive testing: Theory and practice. Kluwer. [Google Scholar]
Wainer H. (Ed.), (2000) Computerized adaptive testing: A primer (2nd Ed.). Erlbaum. [Google Scholar]
Wang C. (2013). Mutual information item selection method in cognitive diagnostic computerized adaptive testing with short test length. Educational and Psychological Measurement, 73(6), 1017–1035. 10.1177/0013164413498256 [DOI] [Google Scholar]
Weiss D. J. (1982). Improving measurement quality and efficiency with adaptive testing. Applied Psychological Measurement, 6(4), 473–492. 10.1177/014662168200600408 [DOI] [Google Scholar]
Xu X., Chang H., Douglas J. (2003). A simulation study to compare CAT strategies for cognitive diagnosis. Annual Meeting of the American Education Research Association. [Google Scholar]
Xu G., Wang C., Shang Z. (2016). On initial item selection in cognitive diagnostic computerized adaptive testing. British Journal of Mathematical and Statistical Psychology, 69(3), 291–315. 10.1111/bmsp.12072 [DOI] [PubMed] [Google Scholar]
Zheng C., Chang H. H. (2016). High-efficiency response distribution-based item selection algorithms for short-length cognitive diagnostic computerized adaptive testing. Applied Psychological Measurement, 40(8), 608–624. 10.1177/0146621616665196 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Click here for additional data file.^{(1.3MB, pdf)}

[bibr1-01466216211066610] Chang H. H., Ying Z. L. (1996). A global information approach to computerized adaptive testing. Applied Psychological Measurement, 20(3), 213–229. 10.1177/014662169602000303 [DOI] [Google Scholar]

[bibr2-01466216211066610] Cheng Y. (2009). When cognitive diagnosis meets computerized adaptive testing: CD-CAT. Psychometrika, 74(4), 619–632. 10.1007/s11336-009-9123-2 [DOI] [Google Scholar]

[bibr3-01466216211066610] Cover T., Thomas J. (1991). Elements of information theory. Wiley. [Google Scholar]

[bibr4-01466216211066610] Finkelman M. D., Kim W., Weissman A., Cook R. J. (2014). Cognitive diagnostic models and computerized adaptive testing: Two new item-selection methods that incorporate response times. Journal of Computerized Adaptive Testing, 2(4), 59–76. 10.7333/1412-0204059 [DOI] [Google Scholar]

[bibr5-01466216211066610] Haertel E. H. (1989). Using restricted latent class models to map skill structure of achievement items. Journal of Educational Measurement, 26(4), 301–323. http://www.jstor.org/stable/1434756 [Google Scholar]

[bibr6-01466216211066610] Hartz S. (2002). A Bayesian framework for the unified model for assessing cognitive abilities: Blending theory with practice. [Unpublished doctoral thesis, University of Illinois at Urbana-Champaign; ]. [Google Scholar]

[bibr7-01466216211066610] Henson R., Douglas J. (2005). Test construction for cognitive diagnosis. Applied Psychological Measurement, 29(4), 262–277. 10.1177/0146621604272623 [DOI] [Google Scholar]

[bibr8-01466216211066610] Henson R., Roussos L., Douglas J., He X. (2008). Cognitive diagnostic attribute-level discrimination indices. Applied Psychological Measurement, 32(4), 275–288. 10.1177/0146621607302478 [DOI] [Google Scholar]

[bibr9-01466216211066610] Hsu C.-L., Wang W.-C., Chen S.-Y. (2013). Variable-length computerized adaptive testing based on cognitive diagnosis models. Applied Psychological Measurement, 37(7), 563–582. 10.1177/0146621613488642 [DOI] [Google Scholar]

[bibr10-01466216211066610] Junker B. W., Sijtsma K. (2001). Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Applied Psychological Measurement, 25(3), 258–272. 10.1177/01466210122032064 [DOI] [Google Scholar]

[bibr11-01466216211066610] Kaplan M., de la Torre J., Barrada J. R. (2015). New item selection methods for computerized adaptive testing. Applied Psychological Measurement, 39(3), 167–188. 10.1177/0146621614554650 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr12-01466216211066610] Ma W., de la Torre J. (2016). A sequential cognitive diagnosis model for polytomous responses. British Journal of Mathematical and Statistical Psychology, 69, 253–275. 10.1111/bmsp.12070 [DOI] [PubMed] [Google Scholar]

[bibr13-01466216211066610] Rudner L. M. (2009). Scoring and classification examinees using measurement decision theory. Practical Assessment Research and Evolution, 14(8), 1–11. http://pareonline.net/getvn.asp?v=14&n=8 [Google Scholar]

[bibr14-01466216211066610] Rupp A. A., Templin J., Henson R. A. (2010). Diagnostic measurement: Theory, methods, and applications (The statistical structure of core DCMs). Guilford. [Google Scholar]

[bibr15-01466216211066610] Tatsouka C. (2002). Data analytic methods for latent partially ordered classification models. Journal of Royal Statistical Society: Series C (Apply Statistics), 51, 337–350. 10.1111/1467-9876.00272 [DOI] [Google Scholar]

[bibr16-01466216211066610] Tatsouka C., Ferguson T. (2003). Sequential classification on partially ordered sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(1), 143–157. http://www.jstor.org/stable/3088831 [Google Scholar]

[bibr17-01466216211066610] Tatsuoka K. K. (1983). Rule space: An approach for dealing with misconceptions based on item response theory. Journal of Educational Measurement, 20(4), 345–354. http://www.jstor.org/stable/1434951 [Google Scholar]

[bibr18-01466216211066610] van der Linden W. J., Glas C.A.W. (Eds.), (2000).Computerized adaptive testing: Theory and practice. Kluwer. [Google Scholar]

[bibr19-01466216211066610] Wainer H. (Ed.), (2000) Computerized adaptive testing: A primer (2nd Ed.). Erlbaum. [Google Scholar]

[bibr20-01466216211066610] Wang C. (2013). Mutual information item selection method in cognitive diagnostic computerized adaptive testing with short test length. Educational and Psychological Measurement, 73(6), 1017–1035. 10.1177/0013164413498256 [DOI] [Google Scholar]

[bibr21-01466216211066610] Weiss D. J. (1982). Improving measurement quality and efficiency with adaptive testing. Applied Psychological Measurement, 6(4), 473–492. 10.1177/014662168200600408 [DOI] [Google Scholar]

[bibr22-01466216211066610] Xu X., Chang H., Douglas J. (2003). A simulation study to compare CAT strategies for cognitive diagnosis. Annual Meeting of the American Education Research Association. [Google Scholar]

[bibr23-01466216211066610] Xu G., Wang C., Shang Z. (2016). On initial item selection in cognitive diagnostic computerized adaptive testing. British Journal of Mathematical and Statistical Psychology, 69(3), 291–315. 10.1111/bmsp.12072 [DOI] [PubMed] [Google Scholar]

[bibr24-01466216211066610] Zheng C., Chang H. H. (2016). High-efficiency response distribution-based item selection algorithms for short-length cognitive diagnostic computerized adaptive testing. Applied Psychological Measurement, 40(8), 608–624. 10.1177/0146621616665196 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Reducing the Misclassification Costs of Cognitive Diagnosis Computerized Adaptive Testing: Item Selection With Minimum Expected Risk

Chia-Ling Hsu

Wen-Chung Wang

Abstract

The Deterministic Input, Noisy “and” Gate Model

Item Selection Methods in Cognitive Diagnosis Computerized Adaptive Testing

Bayesian Decision Theory in Cognitive Diagnosis Computerized Adaptive Testing