Skip to main content
Applied Psychological Measurement logoLink to Applied Psychological Measurement
. 2018 Dec 10;43(7):543–561. doi: 10.1177/0146621618813113

Nonparametric CAT for CD in Educational Settings With Small Samples

Yuan-Pei Chang 1, Chia-Yi Chiu 1,, Rung-Ching Tsai 2
PMCID: PMC6739744  PMID: 31534289

Abstract

Cognitive diagnostic computerized adaptive testing (CD-CAT) has been suggested by researchers as a diagnostic tool for assessment and evaluation. Although model-based CD-CAT is relatively well researched in the context of large-scale assessment systems, this type of system has not received the same degree of research and development in small-scale settings, such as at the course-based level, where this system would be the most useful. The main obstacle is that the statistical estimation techniques that are successfully applied within the context of a large-scale assessment require large samples to guarantee reliable calibration of the item parameters and an accurate estimation of the examinees’ proficiency class membership. Such samples are simply not obtainable in course-based settings. Therefore, the nonparametric item selection (NPS) method that does not require any parameter calibration, and thus, can be used in small educational programs is proposed in the study. The proposed nonparametric CD-CAT uses the nonparametric classification (NPC) method to estimate an examinee’s attribute profile and based on the examinee’s item responses, the item that can best discriminate the estimated attribute profile and the other attribute profiles is then selected. The simulation results show that the NPS method outperformed the compared parametric CD-CAT algorithms and the differences were substantial when the calibration samples were small.

Keywords: cognitive diagnosis, nonparametric classification, computerized adaptive testing, nonparametric item selection


Assessments for cognitive diagnosis (CD) are designed to assess students’ mastery or nonmastery of a set of skills in a knowledge domain with the goal of providing immediate feedback for further remediation. In the past several years, CD has become a new paradigm of educational testing, and CD-related research has drawn substantial attention. Recently, researchers went one step further to develop large-scale cognitive diagnostic computerized adaptive testing (CD-CAT), which features in improving estimation accuracy and testing efficiency. A pioneer is the CD-CAT developed for administering an English language proficiency test to more than 120,000 students in China (Liu, You, Wang, Ding, & Chang, 2013). Another ongoing research program, the Dynamic Learning Maps Alternate Assessment System Consortium, supported by the U.S. Department of Education, plans to develop computerized cognitive-psychometric models to analyze large-scale data collected from students with the most significant cognitive disabilities. These programs confirm and support the finding that the development of CD-CAT as a diagnostic tool for assessment and evaluation should be the primary interest in psychological and educational measurement (e.g., Cheng, 2009; Cheng & Chang, 2007; Liu et al., 2013; K. K. Tatsuoka & Tatsuoka, 1997).

In the CD-CAT framework, an effective and efficient item selection algorithm is the key to the success of a CD-CAT program. To date, various parametric item selection methods have been proposed (see “Technical Background” section for a detailed review). These parametric methods follow a common rationale that they first calibrate the item parameters by fitting a large data set with a cognitive diagnostic model (CDM) and then select the best item for each examinee by maximizing an information function, minimizing the expected Shannon entropy, or maximizing the variability in the probabilities of success of the estimated latent classes.

Although model-based CD-CAT is relatively well-researched in the context of large-scale assessment systems, where data for thousands of examinees are available, this type of system has not received the same degree of research and development in small-scale settings, such as at the course-based level, where this system would be the most useful. There are at least two difficulties in adapting the current model-based CD-CAT to a small-scale setting. First, the statistical estimation techniques that are successfully applied within the context of a large-scale assessment require large samples to guarantee reliable calibration of the item parameters and an accurate estimation of the examinees’ proficiency class membership. Such samples are simply not obtainable in course-based settings. In most, if not all, published CD-CAT papers on the topic of parametric item selection methods, the true item parameters are used in simulations without calibration to assess the performance of the methods, and thus, no error from item parameter calibration is taken into account (Cheng, 2009; Kaplan, de la Torre, & Barrada, 2015). This setup is unrealistic, particularly, in a small-scale setting, where the calibration of the item parameters could be unstable and thus, the resulting estimates unreliable. Second, item quality is a built-in criterion for these parametric methods, and only items of the best quality in terms of guessing and slipping parameters are selected. Such a routine guarantees optimal estimation accuracy, but results in high exposure rates for certain items and consequently, sacrifices the test security. The trade-off between estimation accuracy and test security is well-known (Ponsoda & Olea, 2003; Zheng & Wang, 2017); however, in the studies where these parametric methods were developed, the estimation accuracy was evaluated without controlling the item exposure (Cheng, 2009; Kaplan et al., 2015) and therefore, the findings did not reflect the true performance of the parametric methods in practice. Zheng and Wang (2017) had the same concern and developed an item selection method with the binary searching algorithm for item exposure control. However, their method was compared only with the posterior-weighted Kullback–Leibler (PWKL) method, and a thorough evaluation of the degree of item overexposure the other parametric methods could encounter without controlling item exposure remains unavailable.

In response to these issues, the nonparametric item selection (NPS) method that first uses the nonparametric classification (NPC) method (Chiu & Douglas, 2013) to classify examinees and then identifies the best item for each examinee individually by maximizing the discrimination power of the items is proposed in this study to develop a nonparametric CD-CAT. Differing from the existing parametric methods for CD-CAT that require a large sample to secure accurate item parameter calibration, the proposed method is free of any item calibration and thus, is favored in a setting where the calibration sample is small or even unavailable.

The remainder of the article is organized as follows. “Technical Background” section gives a brief review of the CDMs and the methods used in the study, followed by “NPS Method for CD-CAT” section where the key concepts and the algorithm of the NPS method are explained in detail. “Simulation Studies” section includes two simulation studies for assessing the performance of the proposed method in comparison with several existing parametric methods under various conditions. The article is concluded with several suggestions and future research directions.

Technical Background

CDMs

The section begins with the key definitions and notations used in the article. The construct of interest in CD is referred to as “attribute,” which is binary with 1 and 0 indicating mastery and nonmastery, respectively. Suppose at most K attributes are required by the items in the test. The latent attribute profile of examinee i is a K×1 vector denoted as αi=(αi1,αi2,,αiK)T. The latent space expanded by the K attributes thus contains 2K latent proficiency classes, and the ultimate goal of CD is to assign examinees to the proficiency class to which they belong. The assignments are typically done by fitting the data with a CDM, which is also referred to as diagnostic classification models (DCMs).

A common component of CD modeling is the prespecified Q-matrix (K. K. Tatsuoka, 1985), which indicates the associations between the test items and the required attributes. Specifically, suppose a test consists of J items. A J×K Q-matrix is specified along with the test such that the entry (j,k), denoted as qjk, is 1 if Item j requires attribute k and 0 otherwise. Given the Q-matrix, the examinees’ class memberships can be estimated by fitting the data with a prespecified CDM. Various CDMs have been developed, and those used in the study are briefly reviewed in the following.

The DINA model

The DINA model (Junker & Sijtsma, 2001; MacReady & Dayton, 1977) is perhaps the most frequently researched CDM due to its simplicity. Let Yi=(Yi1,,YiJ) be the observed item response pattern for examinee i. The item response function of the DINA model is defined as

P(Yij=1|αi)=(1sj)ηijgj(1ηij), (1)

where sj and gj are the slipping and guessing parameters, respectively, for Item j, assuming 0<gj<1sj<1. In addition, ηij, often referred to as the ideal item response, is defined as ΠkAjαik where Aj={k|qjk=1} for Item j. ηij indicates whether examinee i masters all the required attributes of Item j and thus, makes the DINA model a typical conjunctive model. Equation 1 also shows that ηij is the most likely item response from examinee i to Item j when sj and gj are small.

The reduced reparameterized unified model (RUM)

The reduced RUM (Hartz, 2002; Hartz & Roussos, 2008) is another CDM that has received considerable attention among psychometric researchers. The item response function of the reduced RUM is defined as

P(Yij=1|αi)=πj*ΠkAjrjk*(1αik), (2)

where 0<πj*<1 denotes the probability of a correct answer for an examinee who has mastered all the attributes required by Item j, and 0<rjk*<1 is a penalty parameter that reduces the probability of a correct response by a factor of rjk* for those who do not possess attribute k. The reduced RUM relaxes the restricted conjunction rule used in the DINA model and allows different probabilities of success for different proficiency classes based on the number of attributes possessed.

The general CD models

The DINA model and the reduced RUM are two examples of the restricted CDMs subsumed in general CDMs, such as the general diagnostic model (GDM; von Davier, 2005, 2008), the loglinear cognitive diagnosis model (LCDM; Henson, Templin, & Willse, 2009), and the generalized deterministic inputs, noisy “and” gate (G-DINA) model (de la Torre, 2011). Taking the G-DINA model as an example, its item response function can be expressed by relating the probability of a correct response and the model specifications through one of several possible link functions, such as the identity, logit, or log link. For succinctness, suppose Item j requires Kj*K attributes that, without loss of generality, have been moved to the first Kj* positions of the q-vector qj. Furthermore, let αlj* be the reduced attribute profile that consists of the first Kj* attributes, for l=1,,2Kj*. The saturated form of the G-DINA model is formulated using the identity link, and its item response function can be written as

P(Yij=1|αlj*)=βj0+k=1Kj*βjkαlk+k>kKj*k=1Kj*1βjkkαlkαlk++βj(12Kj*)Πk=1Kj*αlk, (3)

where βj0 denotes the intercept of Item j, βjk the main effect due to αlk, βjkk the interaction effect of αlk and αlk, and βj(12Kj*) the interaction effects of αl1,,αlKj*. With careful parameterization, the restricted CDMs can be respecified in terms of the general CDMs.

The NPC Method

The NPC method (Chiu & Douglas, 2013) was used in the study to obtain the estimates of examinees’ attribute profiles in the CD-CAT process. The NPC method classifies examinees by evaluating the distance between the observed and ideal item responses. Specifically, for the 2K possible attribute profiles, the corresponding ideal response patterns are denoted as η(1),η(2),,η(2K), where η(m)=(η1(m),,ηJ(m)) for m=1,,M=2K. The examinee’s attribute profile is then estimated by minimizing the distance between the observed and ideal item responses, d(Yi,η(m)), where m=1,,M. For binary data, a natural and frequently used distance measure is the Hamming distance, which simply counts the number of times that the entries in two vectors disagree, defined as

dh(Yi,η(m))=j=1J|Yijηj(m)|. (4)

Then the estimator of the proficiency class, α^i, is obtained by,

α^i=argminαm{α1,,α2K}dh(Yi,η(m)). (5)

As an important theoretical result, S. Wang and Douglas (2015) proved that assuming P(Yj=1|ηj=0)<0.5, P(Yj=1|ηj=1)>0.5 for Item j, and the Q-matrix is complete (Chiu, Douglas, & Li, 2009; Köhn & Chiu, 2017), the estimator of the attribute profile obtained by using the NPC method is statistically consistent, provided a long test with some regulations on the number of examinees. Chiu and Douglas (2013) also pointed out some remarkable advantages of the NPC method. First, the method is computationally straightforward and inexpensive. Second, the resulting classification rates are almost as high as those obtained using parametric estimation methods. Third, the method can be used with data conforming to a variety of CDMs under the two assumptions provided above. Most importantly, there is no restriction on the sample size. The method is efficient and effective when samples are large; it can also be used even when the sample size is 1.

Parametric Item Selection Methods in CD-CAT

Item selection methods play a central role in CAT procedures. In traditional item response theory (IRT), the Fisher information is perhaps the most commonly used item selection method; however, it is no longer applicable to CDMs because the person parameter (i.e., the attribute profile α) is discrete. Thus, other information functions and indexes have been developed, and those used in the study are briefly reviewed in the following.

Shannon entropy-based approach

The SHE (Shannon, 1948; C. Tatsuoka, 2002) algorithm for CD-CAT aims to minimize the variability of the posterior distribution of attribute profile α. Suppose t items have been given to examinee i. For Item j in the item bank, its expected Shannon entropy for being item (t+1) is expressed as

SHEj=E[H(πi(t+1)(α))], (6)

where the entropy H is in general defined as H(f)=xf(x)logf(x) and πi(t+1)(α)=P(α|yi(1),,yi(t),Yi(t+1)) is the posterior probability of αi given the first (t+1) item responses. Because the (t+1)th item has not been administered, it is denoted as Yi(t+1) and treated as a random variable. The expectation of the entropy H, thus, is taken with respect to Yi(t+1). The item associated with the least SHE is then chosen as the best item for examinee i.

The PWKL index

In addition to the SHE method, another frequently referenced method is the PWKL index (Cheng, 2009), which belongs to the family of the Kullback–Leibler (KL) information-based approaches. The PWKL index is a variation that improves the KL information by taking the KL distance and the posterior probability of αm into account. In other words, the PWKL index of Item j at cycle t for examinee i is expressed as

PWKLj(t)=m=12KDj(f(y|α^i)f(y|αm))πi(t)(αm), (7)

where πi(t)(αm)=P(αm|yi(1),,yi(t)) is the probability of examinee i in proficiency class m, given the first t observed item responses. When the PWKL index is used, the goal is to select an item as discriminating as possible so that the distributions of the response variable given examinees in different proficiency classes are far apart from each other or in technical terms, to select an item that maximizes the PWKL index.

As a side note, the mutual information (MI) item selection method (C. Wang, 2013) can be seen as a particular case of KL divergence (C. Wang, Chang, & Huebner, 2011). For readers who are interested in the method, please refer to C. Wang (2013) for more details.

The posterior-weighted cognitive diagnostic model discrimination index (PWCDI)

CDM discrimination index (CDI) was first proposed by Henson and Douglas (2005) to facilitate test construction for CD purposes. Later on, Zheng and Chang (2016) extended it to the context of CD-CAT and developed the PWCDI for select items. The index for the jth item is defined as

PWCDIj=1mmdh(αm,αm)1mmdh(αm,αm)1π(t)(αm)π(t)(αm)KL(αm||αm), (8)

where dh(αm,αm)=k=1K|αmkαmk| is the Hamming distance between attribute profiles αm and αm. The best item is selected by maximizing the PWCDIj.

Dynamic and stratified dynamic binary searching methods

The DBS method was first proposed by C. Tatsuoka and Ferguson (2003) as a heuristic named in their paper the halving algorithm, and later renamed by Zheng and Wang (2017) as the DBS method. The index for examinee i to Item j is defined as

Bij=|ηj(m)=1πi(t)(αm)0.5|,

and the best item is selected by minimizing this index. It should be noted that the DBS method, as the NPS method, also benefits from the randomization mechanism (Georgiadou, Triantafillou, & Economides, 2007) for item exposure control.

Inspired by an IRT-based CAT technique (Chang & Ying, 1999), Zheng and Wang (2017) modified the method by stratifying the item bank according to item discrimination (Rupp, Templin, & Henson, 2010), and proposed the stratified dynamic binary searching (SDBS) method for fixed-length CD-CAT.

G-DINA model discrimination index (GDI)

The GDI was first proposed by de la Torre and Chiu (2016) as an index for validating a Q-matrix, and the GDI for Item j is defined as

ςj2=l=12Kj*(P(Yij=1|αlj*)P¯j)2P(αlj*|y), (9)

where P(αlj*|y) is the posterior probability of αlj* and P¯j=l=12Kj*P(αlj*|y)P(Yij=1|αlj*) is the mean success probability. Because the GDI measures the ability of an item to discriminate the reduced proficiency classes, in the context of validating Q-matrix, the correct q-vector is chosen by maximizing the GDI.

Kaplan et al. (2015) later extended it to CD-CAT as an item selection index with a modification that P(αlj*|y) is replaced by the posterior probability of αlj* at cycle t for examinee i, denoted as πi(t)(αlj*)=P(αlj*|yi(t)). Denote the modified GDI of Item j for examinee i at cycle t as ςij2(t); the best item is obtained by maximizing ςij2(t).

NPS Method for CD-CAT

The proposed NPS method first uses the NPC method (Chiu & Douglas, 2013) to estimate examinees’ attribute profiles and then selects an item that can best differentiate an examinee’s estimated attribute profile from the other attribute profiles by evaluating the distance between the examinee’s observed and ideal response patterns. After the examinee responds to the selected item, the estimate of his or her attribute profile is then updated using the NPC method again, and another item that can better pinpoint the examinee’s attribute profile is further selected following the same item selection procedure just described. The algorithm continues until the stopping rule is met. In the rest of the section, two key concepts of the method will be discussed first, followed by a presentation of the algorithm of the NPS method.

Key Concepts

Two key concepts need to be elaborated first to support the legitimacy of the NPS method as an item selection method in a CD-CAT system. The first key concept regards the Q-optimal criterion used from Step 1 to Step 5 in the algorithm. To optimize the classification of examinees, the Q-optimal criterion proposed by Xu, Wang, and Shang (2016), rather than a random selection, was adopted to choose the first K items. Suppose α is the true attribute profile of examinee i, and α is a different attribute profile. A Q-matrix is “Q-optimal” in terms of classification accuracy for examinee i if

P(α^i=α|α,Qi)>maxααP(α^i=α|α,Qi). (10)

It has been proven (Xu et al., 2016) that when data conform to the DINA model, a Q-matrix of the first K items is Q-optimal for examinee i if and only if the K q-vectors have the following lower triangular matrix structure, with or without column swapping,

Qi=(1000q21100q(K1)1q(K1)(K2)10qK1qK(K2)qK(K1)1)K×K, (11)

where the diagonals are 1s, and the entry qkk, for k=1,,K1 and k=k+1,,K, takes the value 0 if the Kth item is answered incorrectly by the examinee i and can take either 0 or 1, otherwise.

However, the Q-optimal condition in Equation 11 holds only for the DINA model. For other complex CDMs, more assumptions need to be satisfied. Unfortunately, these assumptions could turn to a number of stringent and thus, unrealistic constraints if the Q-optimal condition in Equation 11 is to be applied to a more complex CDM. Another alternative is to use only the single-attribute items. This initial Q-matrix guarantees the Q-optimality for data conforming to any model but loses the flexibility Equation 11 provides. Considering the pros and cons, the Q-optimal criterion in Equation 11 without extra constraints is used in the simulation to select the initial K items although multiple CDMs could underlie the initial K items. As a consequence, the results should be interpreted with the caution that the Q-optimality may be achieved only partially if any of the first initial K items conforms to a CDM different from the DINA model.

The second key concept concerns the search algorithm used in Steps 6 and 11 of the NPS algorithm for finding the best item for each examinee. Instead of using the complete search algorithm where an item is selected if it can best distinguish α^ from all the other attribute profiles, the NPS method adopts a more economic algorithm by assessing only the separability of α^ from the “second likely” attribute profile, denoted as α~ in the algorithm. The underlying rationale is that α~ is the attribute profile presumably most difficult to be separated from α^, compared with the other 2K2 attribute profiles; once α~ can be separated from α^, the other less competitive attribute profiles will either be further “pushed” away or remain distant from α^, and thus, there is no need to explore the separability of α^ from the other attribute profiles.

The graphs in Figure 1 are used as examples to illustrate the idea under different conditions. The solid trajectory represents the distance corresponding to α and the dashed trajectories represent the others. Given that α is the correct estimate of the attribute profile, the left graph shows that the distance corresponding to α remains the minimum along the way and the other incorrect estimates become farther and farther away from α. However, it could happen that the attribute profile is incorrectly estimated in the early stage of the estimation. The right graph of Figure 1 further illustrates that the NPS algorithm is able to correct the error and eventually obtain the correct estimate.

Figure 1.

Figure 1.

Examples illustrating the trajectories of distances using the NPS Algorithm.

Note. NPS = nonparametric item selection.

Algorithm

In the NPS algorithm, the first K items are selected based on the Q-optimal criterion (Xu et al., 2016) to yield the optimal classification results. After the first K items are determined, the NPS method is applied and iterated to select the personalized items for each examinee until the stopping rule is met. Note that column swapping for the Q-optimal criterion is not applied in the following algorithm description, but is implemented in the actual algorithm.

  • Step 0: Initialize the item pool R(0)={1,,J}.

  • Step 1: For each examinee, randomly select an Item j from R(0) such that the q-vector of Item jqj=e1, where e1 is a unit vector with e1=1 and all the other entries 0s. Update the set R(0) by deleting Item j; that is, R(1)=R(0)\{j}.

  • Step 2: Administer Item j to the examinee and initialize the response vector as y(1)=(y(1)).

  • Step 3: At cycle k where 1<kK, select an item, denoted as Item j, from R(k1) such that qj=ek+i=1k1BieiI(y(i)=1), where Bis are independent Bernoulli variables with success probability equal to 0.5. Update the set R(k1) by deleting Item j; that is, R(k)=R(k1)\{j}.

  • Step 4: Administer Item j to the examinee and update the response vector by y(k)=(y(k1),y(k)).

  • Step 5: Let k=k+1 and repeat Steps 3 and 4 until k=K.

  • Step 6: Estimate the examinee’s attribute profile using the NPC method with the observed responses y(K) and label the estimate as α^(K). Arrange the distances between y(K) and the other 2K1K-dimensional ideal response vectors in ascending order, denoted as d2(K),,d2K(K). Denote the attribute profile corresponding to d2(K) as α~(K).

  • Step 7: At cycle t, where t>K, choose an item from R(t1) that can discriminate α^(t1) and α~(t1) by establishing ηα^ηα~. Suppose the item is Item j. If Item j can be found, go to Step 10; otherwise, continue with Step 8.

  • Step 8: Replace α~(t1) with the attribute profile corresponding to dm(t1), where m3. Choose an item from R(t1) that can discriminate α^(t1) and α~(t1).

  • Step 9: Repeat Step 8 until Item j is found and α^(t1) and α~(t1) can be discriminated.

  • Step 10: Administer Item j to the examinee and update R(t1) and y(t1) by R(t)=R(t1)\{j} and y(t)=(y(t1),y(t)), respectively.

  • Step 11: Estimate the attribute profile of the examinee using the NPC method with the responses in y(t) and denote the estimate as α^(t). Arrange the distances between y(t) and the other 2K1t-dimensional ideal response vectors as d2(t),,d2K(t). Denote the estimate corresponding to d2(t) as α~(t).

  • Step 12: Let t=t+1 and repeat Steps 7 through 11 until the stopping rule is met.

The fixed-length criterion is used in the study to stop the algorithm and other stopping criteria, if suitable, can be easily implemented in the algorithm. It should be emphasized that in Step 7, the discrimination power of an item is determined by whether the item can yield distinct ηs for α^i and α~i. Recall that the goal of the NPS method is to find an item that can result in distinct observed responses in the (t+1)th cycle for the two attribute profiles α^i(t) and α~i(t) estimated in the tth cycle. Whereas the observed responses y(t+1)s from α^i(t) and α~i(t) are not yet obtainable at cycle t, η(t+1)s can already be computed using α^i and α~i and the q-vectors of items in the item bank. Furthermore, it is known that examinees with α^i(t) and α~i(t) are more likely to have distinct observed responses to an item at cycle t+1 if their ideal responses to the item are also distinct. Hence, it is justified to use η(t+1)s, rather than the unobtainable y(t+1)s, to assess the discrimination power of the items.

Several remarkable advantages of the NPS method are worth noting. First, because no model fitting is needed for the NPC method, the NPS method does not require any calibration and thus, is ready to be used regardless of the availability of calibration samples. Second, it has been shown that the NPC method can be used when data conform to various CDMs under the assumptions described earlier in the “Technical Background” section (Chiu & Douglas, 2013; S. Wang & Douglas, 2015). Hence, the NPS method appears to be suitable for an item bank involving multiple CDMs. Third, because the NPS method only requires the q-vector, rather than the characteristics of a specific item, for item selection, items of the same q-vector have identical discrimination power and the best item is selected randomly from the candidate items. Hence, this built-in mechanism for controlling item exposure falls into the category of randomization method (Georgiadou et al., 2007) and nicely balances between estimation accuracy and test security. Fourth, the NPS method is computationally straightforward and requires only a fraction of CPU time, a feature that makes the NPS an accommodating method for an environment without powerful high-end computers.

A Variation: The Weighted NPS (WNPS) Algorithm

The algorithm can easily be modified if a variation of the NPC method is used to estimate the examinees’ attribute profiles. For instance, Chiu and Douglas (2013) proposed the weighted Hamming distance to accommodate item variability. Let p¯j be the proportion of examinees responding correctly to Item j. Then the weighted Hamming distance is defined as

dwh(Yi,η(m))=j=1J1p¯j(1p¯j)|Yijηj(m)|, (12)

and can be plugged into Equation 5 to estimate the examinees’ profiles. The weighted Hamming distance may be favored when the test is short, and thus, ties could happen with the ordinary Hamming distance. If the weighted Hamming distance is used, the only modification needed to be made is to replace d with dwh in Step 11. This version of the algorithm is referred to as the WNPS method and will also be examined in the simulation. However, it is important to emphasize that differing from the NPS algorithm, which is free of any calibration, the WNPS algorithm requires a sample to compute weight 1/(p¯j(1p¯j)) for each item before the test is administered.

Simulation Studies

The goal of the simulations is to evaluate the performance of the two proposed NPS algorithms, NPS and WNPS, in comparison with six frequently researched parametric methods, SHE, PWKL, PWCDI, GDI, DBS, and SDBS with the random selection method serving as the baseline for all methods. Among these parametric methods, the first four do not have any mechanism to control item security and will be used in Simulation Study I. The last two, similar to the NPS and WNPS methods, have the built-in randomization mechanism for item exposure control and will be used in Simulation Study II. Specific focuses are on the effectiveness and efficiency of these compared methods in classifying examinees and their ability to control item exposure with small calibration samples. To provide information useful for and relevant to practice, multiple CDMs, rather than a single CDM, were used to generate data.

Design

To investigate the performance of the studied methods under the conditions of small calibration samples, three sample sizes, N0=30, 50, and 100, were considered. The calibration samples were used for the WNPS algorithm to compute the weight of each item and for the parametric methods to calibrate the item parameters; however, the samples were not required by the NPS algorithm. The item bank consisted of J = 300 items; 150 of them conformed to the DINA model and the other 150 conformed to the reduced RUM. For both CDMs, the guessing, P(Yj=1|α=0), and slipping, P(Yj=0|α=1), parameters were randomly drawn from Unif(0.1, 0.2) and Unif(0.2, 0.3) for items with high and low discrimination (HD and LD), respectively. The other parameters in the reduced RUM were generated in such a way that each required attribute contributed equally to the probability of success. In addition, for the NPC method to perform adequately, the structure (i.e., conjunctive or disjunctive) of the items has to be known in advance (Chiu & Douglas, 2013). In the simulation studies, the conjunction structure was assumed across all conditions and the conjunctive ideal response η=Πk=1Kαkqjk was calculated for each item. The number of attributes (K) was set to be 5 and 8, and each entry of the Q-matrix was sampled from Bern(0.5). The fixed-length method with the length L=4K was used to terminate the algorithms.

Examinees’ attribute profiles were generated either from a discrete uniform distribution with probability 1/2K for each proficiency class or from a multivariate normal (MVN) threshold model (Chiu et al., 2009) with covariances being 0.5. Finally, because the items conformed to multiple CDMs, the item parameters were calibrated by fitting the calibration samples with the saturated G-DINA model to avoid model misfit. The R package G-DINA was used to calibrate the item parameters.

Measures

The performances of the methods were evaluated with classification accuracy and item exposure control. The classification accuracy was measured by the attribute profile recovery rates (RRs), computed by

RR=i=1NI[α^i=αi]N.

As implied, the methods were applied to a sample of N examinees, in this study N=500, to produce smooth trajectories of the RRs along with the test length. To provide more detailed information, the RRs for all methods under each condition were computed at cycles K,2K,3K, and 4K, and then presented in the same graph.

To measure the effectiveness of the item exposure control, four indices were used. The first two indices were the unused rate (UR) and the overexposed rate (OR), where an item is considered overexposed when its exposure rate is larger than 0.2 (Mao & Xin, 2013; Zheng & Wang, 2017). The third index is the average between-test overlap rate (TOR; Chen, Ankenmann, & Spray, 2003). In the study, an approximate TOR developed by Chen et al. (2003) was adopted to ease the computational burden of an exact TOR. Specifically, it is computed by

TOR=JLSr2+LJ,

where Sr2 is the sample variance of exposure rates. Note that a high TOR implies low test security. The fourth index is the chi-square statistic defined as

χ2=j=1J(rjr¯)2r¯,

where rj denotes the exposure rate of Item j and r¯ is the expected exposure rate for a random selection, defined as r¯=LJ=4KJ in the study. The formula indicated that the larger the χ2, the less capable the algorithm of selecting items randomly, which, in turn, implies low test security.

Result I: Comparisons With SHE, PWKL, PWCDI, and GDI Methods

Recall that a mechanism for controlling item exposure is automatically implemented in the NPS and WNPS algorithms, but not in any of these four parametric methods. Although it is possible for the parametric methods to use an external item exposure control method, none of the developed methods is equivalent to the mechanism used in the nonparametric algorithms. Therefore, to avoid unfaithful comparisons, no external method was used to control item exposure for the parametric methods in the simulation, and therefore, the results from the parametric methods represented the upper bounds of the RRs they could possibly produce in practice.

Figure 2 illustrates the RRs of the two nonparametric and four parametric item selection methods when the calibration sample size N0 is 100. In this condition, the NPS and WNPS methods performed similarly, and both outperformed the parametric methods. Among the parametric methods, the GDI method outperformed the other three methods, and the SHE method performed slightly better than the PWKL and the PWCDI methods. For the nonparametric methods, there was no significant difference in the RRs between the attribute structures; however, parametric methods resulted in lower RRs when the attribute structure was uniform. In both instances, the NPS and WNPS methods resulted in RRs close to 0.85; however, the RRs for the GDI method were below 0.78, and those for the SHE, PWKL, and PWCDI methods went down to the 0.6s or worse.

Figure 2.

Figure 2.

Pattern-wise RRs for the SHE, PWKL, PWCDI, and GDI methods with HD items when K=5 and N0=100.

Note. PWKL = posterior-weighted Kullback–Leibler; PWCDI = posterior-weighted cognitive diagnostic model discrimination index; HD = high discrimination; NPS = nonparametric item selection; RR = recovery rate; GDI = G-DINA model discrimination index; G-DINA = generalized deterministic inputs, noisy “and” gate; WNPS = weighted NPS method; MVN = multivariate normal threshold model.

The results from N0=50 were omitted because the effects of all the compared methods with this calibration sample size were, as expected, between those of N0=100 and N0=30.

When N0=30, as shown in Figure 3, the NPS and WNPS methods considerably outperformed the parametric methods and their corresponding RRs stayed high because their performance was independent of the calibration samples. The discrepancy between the nonparametric and parametric methods increased compared with the results in Figure 2. The parametric methods suffered noticeably from the small calibration samples particularly when the attribute profiles were uniformly distributed. Across the two attribute structures, the RRs from the NPS and WNPS methods remained close to 0.85. The RRs from the GDI dropped to 0.68 and 0.45 with the MVN and uniform attribute structures, respectively, and those from the SHE, PWKL, and PWCDI methods further dropped to 0.5s and below 0.3s with the MVN and uniform attribute structures, respectively.

Figure 3.

Figure 3.

Pattern-wise RRs for the SHE, PWKL, PWCDI, and GDI methods with HD items when K=5 and N0=30.

Note. PWKL = posterior-weighted Kullback–Leibler; PWCDI = posterior-weighted cognitive diagnostic model discrimination index; GDI = G-DINA model discrimination index; G-DINA = generalized deterministic inputs, noisy “and” gate; HD = high discrimination; NPS = nonparametric item selection; RR = recovery rate.

To investigate the efficiency of the item selection methods when the number of attributes is large, a simulation for K=8 was also conducted. However, due to the excessive CPU time some of the compared parametric methods required to compute the criteria for all items in the item banks, not all the conditions were used to investigate the performance of the studied methods when K=8. Considering that the parametric methods performed the best under the condition of N0=100 with the attributes following a multivariate normal threshold model when K=5, only this condition was used when K=8. The results presented here could serve as a benchmark for the parametric methods, and it is predictable that the results from the other conditions should only be worse.

As shown in the left graph of Figure 4, the RRs for the NPS and WNPS methods were about between 0.63 and 0.67, indicating that the classification was more difficult when K increased. However, all parametric methods performed just slightly better than the random selection method with the RRs below 0.2, showing that when K=8, these parametric methods did not behave much differently from a random selection method when the size of the calibration sample was 100. The results further confirmed the significant importance of the nonparametric methods when CD-CAT is used in a small-scale setting.

Figure 4.

Figure 4.

Pattern-wise RRs when K=8.

Note. PWKL = posterior-weighted Kullback–Leibler; PWCDI = posterior-weighted cognitive diagnostic model discrimination index; GDI = G-DINA model discrimination index; G-DINA = generalized deterministic inputs, noisy “and” gate; RR = recovery rate; NPS = nonparametric item selection; DBS = dynamic binary searching; SDBS = stratified dynamic binary searching.

Result II: Comparisons With DBS and SDBS Methods

Because the randomization mechanism for controlling item exposure is also implemented in the DBS and the SDBS methods, the four methods compared in this section were evaluated on the same bases. Figure 5 illustrates the RRs resulting from the studied methods when N0=100 and the items had high discrimination.

Figure 5.

Figure 5.

Pattern-wise RRs for the NPS, WNPS, DBS, and SDBS methods with HD items when K=5 and N0=100.

Note. NPS = nonparametric item selection; DBS = dynamic binary searching; SDBS = stratified dynamic binary searching; HD = high discrimination; RR = recovery rate.

The graphs indicated that when the attribute structure was multivariate normal, the NPS and the WNPS methods outperformed the DBS and the SDBS methods. However, when the attribute structure was uniform, the four methods performed similarly well with the NPS and the WNPS methods only slightly better than the DBS and the SDBS methods. When items had low discrimination, the performance of the compared methods across different test lengths followed the similar patterns to those when the items were of high discrimination. The only noticeable difference was that the terminal RRs for the former condition were about 0.8 whereas those for the latter were about 0.6. To avoid redundancy, the graphs for the latter case are omitted.

Figure 6 illustrates the RRs resulting from the compared methods when N0=30 and the items had high discrimination.

Figure 6.

Figure 6.

Pattern-wise RRs for the NPS, WNPS, DBS, and SDBS methods with HD items when K=5 and N0=30.

Note. NPS = nonparametric item selection; DBS = dynamic binary searching; SDBS = stratified dynamic binary searching; HD = high discrimination; RR = recovery rate.

The graphs showed that when N0=30, the NPS and the WNPS methods substantially outperformed the DBS and the SDBS methods across the two attribute structures with the RRs from the NPS and WNPS methods at the 0.8s and those from the DBS and SDBS methods at the 0.5s. When items of low discrimination were used, the performances of the four compared methods, although not shown here, followed the similar patterns, but the RRs dropped to the 0.5s for the NPS and WNPS methods and the 0.3s for the DBS and the SDBS methods.

Result III: Item Exposure Control

The results of item exposure control are reported in Table 1. The table showed that when K=5, the NPS and WNPS methods performed similarly and had the lowest UR, OR, TOR, and χ2 among all compared methods across all conditions. The values resulted from the nonparametric methods were very close to the random selection method, which were omitted here. Specifically, the NPS and WNPS methods used all items in the item bank and did not have overexposed items except for one condition (i.e., MVN and N0=100) with only several overused items. Regarding the parametric methods, the PWCDI method had the most severe item exposure control problem in terms of the TOR and the χ2 indices but did not seem to perform as poorly in terms of the OR index in some conditions. As to the DBS and SDBS methods, Table 1 also indicated that although the DBS and SDBS methods had a default exposure control mechanism, similar to the NPS and WNPS methods, they could not control item exposure as well as the nonparametric methods in terms of all the indices used in the study. In particular, the SDBS performed the worst aomg the four compared methods but both DBS and SDBS methods performed, not surprisingly, better than the other parametric methods.

Table 1.

The Exposure Rates and χ2 Index for the Nonparametric, SHE, PWKL, PWCDI, and GDI Methods.

Study I
Study II
Condition Index NPS WNPS SHE PWKL PWCDI GDI NPS WNPS DBS SDBS
K=5; N0=100
MVN UR 0 0 9.7 12.3 13.7 8.3 0 0 0.1 1.6
OR 3.0 2.3 10.0 9.3 8.0 12.3 5.2 5.1 7.0 7.6
TOR 11.0 11.5 31.0 72.4 83.8 35.2 11.8 11.9 15.6 20.5
χ2 12.8 14.5 72.6 196.7 230.7 85.3 15.3 15.8 26.8 41.4
UNIF UR 0 0 6.3 13.7 15.3 9.7 0 0 0 1.8
OR 0 0 12.3 13.3 11.3 11.3 3.2 3.3 8.4 6.2
TOR 9.4 9.4 27.0 41.7 59.8 32.8 11.2 11.1 15.2 19.0
χ2 8.1 8.1 60.8 104.8 158.8 78.1 13.4 13.1 25.7 36.7
K=5; N0=30
MVN UR 0 0 8.0 15.7 16.0 8.3 0 0 0.3 2.3
OR 0 0 10.0 9.7 7.3 11.0 0 0 5.3 5.7
TOR 9.8 9.5 25.1 57.9 78.3 30.0 9.7 9.5 15.7 19.4
χ2 9.4 8.4 55.2 153.1 214.1 69.7 9.0 8.4 27.1 38.0
UNIF UR 0 0 6.0 13.7 13.3 8.3 0 0 0 1.3
OR 0 0 11.0 12.0 12.0 12.3 0.3 0 8.3 8.3
TOR 8.4 8.5 25.0 38.0 45.8 29.0 9.2 9.0 15.5 16.9
χ2 5.1 5.6 54.7 93.6 117.1 66.9 7.5 7.0 26.5 30.6
K=8; N0=100
MVN UR 0.3 0.3 1.3 3.0 3.7 2.0 0 0 0.3 1.7
OR 12.3 13.0 18.3 17.7 16.3 15.7 12.8 14.0 13.8 16.7
TOR 22.4 22.6 23.0 31.3 37.1 24.9 28.7 29.0 43.2 37.4
χ2 35.0 35.7 37.0 61.7 78.9 42.6 54.0 55.0 97.1 79.9

Note. All the rates (i.e., OR, UR, and TOR) are expressed as percentages. PWKL = posterior-weighted Kullback–Leibler; PWCDI = posterior-weighted cognitive diagnostic model discrimination index; NPS = nonparametric item selection; DBS = dynamic binary searching; SDBS = stratified dynamic binary searching; UR = unused rate; OR = overexposed rate; TOR = between-test overlap rate; GDI = G-DINA model discrimination index; G-DINA = generalized deterministic inputs, noisy “and” gate; UNIF = uniform distribution.

The last panel of Table 1 shows that when K=8, the NPS and WNPS methods again outperformed the other methods in terms of item exposure control. However, although there were no unused items for the nonparametric methods, the OR, TOR, and χ2 values were elevated, indicating that it was more difficult for the nonparametric methods to control item exposure when K was large.

Discussion

The NPS method was developed in this article to select items in the CD-CAT system by first using the NPC method (Chiu & Douglas, 2013) to classify examinees and then for each examinee, selecting an item that can best differentiate the examinee’s estimated attribute profile from the other attribute profiles based on the distance between the examinee’s observed and ideal responses. The examinee’s attribute profile is updated using the NPC method again after the new response is added to the response vector. The algorithm iterates until the stopping criterion is met.

The key concepts provided in this article confirmed and supported the legitimacy of using the NPS method in a CD-CAT system. The simulation study further demonstrated the effectiveness and efficiency of the NPS and WNPS algorithms and systematically verified their outperformance over the parametric methods when the calibration samples were small. It was also reported that the NPS and WNPS algorithms yielded UR, OR, TOR, and χ2 values substantially lower than those from the parametric methods and almost as low as those from the random selection method across all conditions, indicating effective control of item exposure.

Recall that the estimator obtained by using the NPC method is statistically consistent if P(Yj=1|η=0)<0.5 and P(Yj=1|η=1)>0.5. These assumptions hold by default for items generated from the DINA model; however, they may or may not hold for a complex CDM, such as the reduced RUM. The authors would like to point out that in the simulation studies, the items, on average, conforming to the reduced RUM were generated in such a way that these assumptions held. Specifically, when K*=2 and 3, on average, the probabilities of success for the collapsed latent classes used to generate the data were (P(00),P(10),P(01),P(11))=(0.15,0.36,0.36,0.85) and (P(000),P(100),P(010),P(001),P(110),P(101),P(011),P(111))=(0.15,0.27,0.27,0.27,0.48,0.48,0.48,0.85), respectively (note that the single-attribute items are not considered here because the estimation results from different CDMs are identical). These probabilities indicated that P(Yj=1|η=0)<0.5 and P(Yj=1|η=1)>0.5. Although the reduced RUM items satisfied the conjunctive assumptions, they had low discrimination due to the large probabilities such as 0.36 and 0.48. Nevertheless, the NPC method performed sufficiently well that the NPS method outperformed the parametric methods used for comparison.

It is worth mentioning that in the simulation, it is required that all examinees in the calibration sample respond to all the items in the item bank for all the parametric methods and the WNPS algorithm, which is somewhat unrealistic. In this regard, the advantage of the NPS method that no calibration sample is needed cannot be overstated.

However, an issue remains to be addressed here. The Q-optimal criterion used in this article was developed for the parametric item selection methods that use the maximum likelihood estimation for α. It is unknown whether the same criterion can also be applied to the NPS method. Further investigation is called for to either establish the legitimacy of using Equation 11 as a Q-optimal criterion for the NPS method or develop a Q-optimal criterion suitable for the NPS method.

Footnotes

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The project is supported by the National Science Foundation under Grant No. 1552563.

References

  1. Chang H.-H., Ying Z. (1999). alpha-stratified multistage computerized adaptive testing. Applied Psychological Measurement, 23, 211-222. [Google Scholar]
  2. Chen S.-Y., Ankenmann R. D., Spray J. A. (2003). The relationship between item exposure and test overlap in computerized adaptive testing. Journal of Educational Measurement, 40, 129-145. [Google Scholar]
  3. Cheng Y. (2009). When cognitive diagnosis meets computerized adaptive testing: CD-CAT. Psychometrika, 74, 619-632. [Google Scholar]
  4. Cheng Y., Chang H.-H. (2007, April). Dual information method in cognitive diagnostic computerized adaptive testing. Paper presented at the Meeting of the National Council on Measurement in Education, Chicago, IL. [Google Scholar]
  5. Chiu C.-Y., Douglas J. (2013). A nonparametric approach to cognitive diagnosis by proximity to ideal response patterns. Journal of Classification, 30, 225-250. [Google Scholar]
  6. Chiu C.-Y., Douglas J. A., Li X. (2009). Cluster analysis for cognitive diagnosis: Theory and applications. Psychometrika, 74, 633-665. [Google Scholar]
  7. de la Torre J. (2011). The generalized DINA model framework. Psychometrika, 76, 179-199. [Google Scholar]
  8. de la Torre J., Chiu C.-Y. (2016). A general method of empirical Q-matrix validation. Psychometrika, 81, 253-273. [DOI] [PubMed] [Google Scholar]
  9. Georgiadou E. G., Triantafillou E., Economides A. A. (2007). A review of item exposure control strategies for computerized adaptive testing developed from 1983 to 2005. The Journal of Technology, Learning and Assessment, 5, 4-28. [Google Scholar]
  10. Hartz S. M. (2002). A Bayesian framework for the unified model for assessing cognitive abilities: Blending theory with practicality (Doctoral dissertation). University of Illinois at Urbana–Champaign, Urbana-Champaign, IL. [Google Scholar]
  11. Hartz S. M., Roussos L. A. (2008, October). The fusion model for skill diagnosis: Blending theory with practicality (Research Report No. RR-08-71). Princeton, NJ: Educational Testing Service. [Google Scholar]
  12. Henson R., Douglas J. (2005). Test construction for cognitive diagnosis. Applied Psychological Measurement, 29, 262-277. [Google Scholar]
  13. Henson R. A., Templin J. L., Willse J. T. (2009). Defining a family of cognitive diagnosis models using log-linear models with latent variables. Psychometrika, 74, 191-210. [Google Scholar]
  14. Junker B. W., Sijtsma K. (2001). Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Applied Psychological Measurement, 25, 258-272. [Google Scholar]
  15. Kaplan M., de la, Torre J., Barrada J. R. (2015). New item selection methods for cognitive diagnosis computerized adaptive testing. Applied Psychological Measurement, 39, 167-188. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Köhn H.-F., Chiu C.-Y. (2017). A procedure for assessing the completeness of the Q-matrices of cognitively diagnostic tests. Psychometrika, 82, 112-132. [DOI] [PubMed] [Google Scholar]
  17. Liu H., You X., Wang W., Ding S., Chang H.-H. (2013). The development of computerized adaptive testing with cognitive diagnosis for an English achievement test in China. Journal of Classification, 30, 152-172. [Google Scholar]
  18. MacReady G. B., Dayton C. M. (1977). The use of probabilistic models in the assessment of mastery. Journal of Educational Statistics, 2, 99-120. [Google Scholar]
  19. Mao X., Xin T. (2013). The application of the Monte Carlo approach to cognitive diagnostic computerized adaptive testing with content constraints. Applied Psychological Measurement, 37, 482-496. [Google Scholar]
  20. Ponsoda V., Olea J. (2003). Adaptive and tailored testing (including IRT and non-IRT application). In Fernándes-Ballesteros R. (Ed.), Encyclopedia of psychological assessment (pp. 10-13). London, England: Sage. [Google Scholar]
  21. Rupp A. A., Templin J., Henson R. A. (2010). Diagnostic measurement: Theory, methods, and applications. New York, NY: Guilford Press. [Google Scholar]
  22. Shannon C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379-423. [Google Scholar]
  23. Tatsuoka C. (2002). Data analytic methods for latent partially ordered classification models. Applied Statistics, 51, 337-350. [Google Scholar]
  24. Tatsuoka C., Ferguson T. (2003). Sequential classification on partially ordered sets. Journal of the Royal Statistical Society: Series B, 65, 143-157. [Google Scholar]
  25. Tatsuoka K. K. (1985). A probabilistic model for diagnosing misconceptions by the pattern classification approach. Journal of Educational Statistics, 10, 55-73. [Google Scholar]
  26. Tatsuoka K. K., Tatsuoka M. M. (1997). Computerized cognitive diagnostic adaptive testing: Effect on remedial instruction as empirical validation. Journal of Educational Measurement, 34, 3-20. [Google Scholar]
  27. von Davier M. (2005). A general diagnostic model applied to language testing data (Research Report No. RR-05-16). Princeton, NJ: Educational Testing Service. [DOI] [PubMed] [Google Scholar]
  28. von Davier M. (2008). A general diagnostic model applied to language testing data. British Journal of Mathematical and Statistical Psychology, 61, 287-307. [DOI] [PubMed] [Google Scholar]
  29. Wang C. (2013). Mutual information item selection method in cognitive diagnostic computerized adaptive testing with short test length. Educational and Psychological Measurement, 73, 1017-1035. [Google Scholar]
  30. Wang C., Chang H.-H., Huebner A. (2011). Restrictive stochastic item selection methods in cognitive diagnostic computerized adaptive testing. Journal of Educational Measurement, 48, 255-273. [Google Scholar]
  31. Wang S., Douglas J. (2015). Consistency of nonparametric classification in cognitive diagnosis. Psychometrika, 80, 85-100. [DOI] [PubMed] [Google Scholar]
  32. Xu G., Wang C., Shang Z. (2016). On initial item selection in cognitive diagnostic computerized adaptive testing. British Journal of Mathematical and Statistical Psychology, 69, 291-315. [DOI] [PubMed] [Google Scholar]
  33. Zheng C., Chang H.-H. (2016). High-efficiency response distribution-based item selection algorithms for short-length cognitive diagnostic computerized adaptive testing. Applied Psychological Measurement, 40, 608-624. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Zheng C., Wang C. (2017). Application of binary searching for item exposure control in cognitive diagnostic computerized adaptive testing. Applied Psychological Measurement, 41, 561-576. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Applied Psychological Measurement are provided here courtesy of SAGE Publications

RESOURCES