Abstract
Sequential procedures have been shown to be effective methods for real-time detection of compromised items in computerized adaptive testing. In this study, we propose three item response theory-based sequential procedures that involve the use of item scores and response times (RTs). The first procedure requires that either the score-based statistic or the RT-based statistic be extreme, the second procedure requires that both the score-based statistic and the RT-based statistic be extreme, and the third procedure requires that a combined score and RT-based statistic be extreme. Results suggest that the third procedure is the most promising, providing a reasonable balance between the false-positive rate and the true-positive rate while also producing relatively short lag times across a wide range of simulation conditions.
Keywords: computerized adaptive testing, item compromise, item response theory, response time, test security
Introduction
In computerized adaptive testing (CAT), highly informative items tend to be selected more frequently, increasing their exposure and risk of compromise. When items become compromised, examinees with preknowledge (EWP) may gain an unfair advantage, potentially leading to score inflation and threatening the validity of test score interpretations. However, one benefit of CAT is that it allows for real-time detection and intervention. If an item is identified as compromised, the algorithm can exclude it from further administration, limiting its impact on future examinees.
To enable the detection of compromised items (CIs), several statistical procedures have been developed that monitor changes in item performance over time, including cumulative sum (CUSUM) procedures, change-point analysis (CPA) procedures, and sequential procedures. CUSUM procedures have been proposed to detect item parameter drift (Veerkamp & Glas, 2000) and monitor changes in item residuals (Lee & Lewis, 2021; Lee et al., 2014). However, the former procedures, while useful, focus on changes in the item parameters and therefore assume that the model still holds. Meanwhile, the latter procedures do not make this assumption, but they have not yet been applied to data beyond the item scores. In CAT, additional sources of data, such as item response times (RTs), are usually also available and may contribute to improving detection rates. Similar to CUSUM procedures, CPA procedures also monitor changes in item performance over time (Du et al., 2023; Du & Zhang, 2025). These procedures have been extended for use with item scores and RTs; however, it is unknown whether they are appropriate for CAT data due to the fact that missingness is not completely at random. Finally, sequential procedures have been proposed that utilize a series of statistical hypothesis tests to monitor whether item performance has significantly changed over time. These procedures have been applied using both item scores and RTs and have been shown to work well in a CAT environment. Thus, the use of sequential procedures is explored in this study.
Sequential procedures can generally be grouped into two categories: item response theory (IRT)-based procedures and non-IRT-based procedures. Non-IRT-based sequential procedures have been proposed that use item scores only, item RTs only, and both item scores and RTs (Choe et al., 2018; Zhang, 2014). Meanwhile, IRT-based sequential procedures have been proposed that use item scores only and item RTs only (Choe et al., 2018; Zhang & Li, 2016), but not item scores and RTs. Given that (a) Zhang and Li (2016) found that IRT-based sequential procedures tend to outperform non-IRT-based sequential procedures and (b) Choe et al. (2018) found that non-IRT-based sequential procedures using item scores and RTs tend to outperform non-IRT-based sequential procedures using item scores only or item RTs only, it is reasonable to suspect that an IRT-based sequential procedure incorporating both item scores and RTs would outperform existing methods. The purpose of this paper is to investigate this hypothesis and fill this gap in the literature by answering the following research questions:
Do IRT-based sequential procedures that incorporate both item scores and RTs outperform existing sequential procedures?
What factors affect the performance of the new procedures?
In the following section, we introduce three IRT-based sequential procedures for item scores and RTs. The subsequent section presents the simulation study, in which we compare the performance of the newly proposed sequential procedures with that of existing procedures. Finally, we conclude with a discussion of research contributions, suggestions for implementation, and limitations.
Method
In recent years, several IRT- and non-IRT-based sequential procedures have been proposed that involve the use of item scores and/or RTs (Choe et al., 2018; Zhang, 2014; Zhang & Li, 2016). In what follows, we first review common models for item scores and RTs. We then describe two existing IRT-based sequential procedures for item scores only and item RTs only. Finally, we introduce three new IRT-based sequential procedures for item scores and RTs. Note that descriptions of non-IRT-based sequential procedures for item scores only, item RTs only, and item scores and RTs are found in the Appendix.
Models for Item Scores and RTs
Let denote the items in an item pool, and let and denote the score and log RT, respectively, of person on item . If, for example, the Rasch model is used to model the item scores, the probability of a correct response is assumed to be
| (1) |
where is the ability parameter of person , and is the difficulty parameter of item . Similarly, if the lognormal model (van der Linden, 2006) is used to model the RTs, the density of the log RT is assumed to be
| (2) |
where is the speed parameter of person , and and are the time discrimination and time intensity parameters, respectively, of item . Both the Rasch model and lognormal model can be plugged into the hierarchical framework of van der Linden (2007), where joint distributions are assumed for the person and/or item parameters. In the remainder of this paper, the subscript will be dropped for notational convenience.
IRT-Based Sequential Procedure for Item Scores Only
Zhang and Li (2016) proposed an IRT-based sequential procedure that tests whether a dichotomous item is getting easier over time—a phenomenon that may occur if the item has been compromised. The procedure monitors the difficulty of the item across multiple time points by comparing the observed proportion of correct responses to the expected proportion of correct responses for a moving sample of examinees. The moving sample is taken as the most recent examinees who were administered the item. Thus, the observed proportion of correct responses for the moving sample is given by
| (3) |
where is the total number of examinees to whom the item has been administered.
Under the null hypothesis that the item has not been compromised, the test statistic
| (4) |
has an asymptotic standard normal distribution, where and denote the expected value/mean and standard deviation, respectively, of , which are given by
In practice, the ability parameters of the examinees are unknown, and therefore, cannot be computed. However, as noted by Zhang and Li (2016), the test statistic can be approximated by replacing the true ability with an estimate .
To monitor the difficulty of the item across multiple time points, one can compute each time the item is administered. Extreme positive values of indicate that the item was much easier than expected in recent administrations and may therefore have been compromised.
IRT-Based Sequential Procedure for Item RTs Only
Choe et al. (2018) proposed an IRT-based sequential procedure that involves the use of item RTs. The procedure is similar to that which is used for the item scores, but rather than testing whether an item is getting easier over time, it tests whether the item is being answered more quickly over time. The procedure monitors the time intensity of the item by comparing the observed average log RT to the expected average log RT for a moving sample of examinees. The observed average log RT for the moving sample is given by
| (5) |
Under the null hypothesis that the item has not been compromised, the test statistic
| (6) |
has an asymptotic standard normal distribution, where and denote the expected value/mean and standard deviation, respectively, of . If it is assumed that the lognormal model fits the RTs, then
In practice, the speed parameters of the examinees are unknown, and therefore, cannot be computed. However, as noted by Choe et al. (2018), the test statistic can be approximated by replacing the true speed with an estimate .
To monitor the time intensity of the item across multiple time points, one can compute each time the item is administered. Extreme negative values of indicate that the item was answered much more quickly than expected in recent administrations and may therefore have been compromised.
IRT-Based Sequential Procedures for Item Scores and RTs
In this paper, we propose three IRT-based sequential procedures that involve the use of item scores and RTs. The first procedure requires that either or be extreme in recent administrations in order for an item to be flagged as potentially compromised, while the second procedure requires that both and be extreme and is therefore more conservative.
The third procedure uses a different approach that involves combining the and statistics to create a new test statistic. By creating a combined statistic, we hope to control the false positive rate (FPR) by producing values that are neither too liberal nor too conservative. One possible way to combine the two statistics is by computing the sum of their squares ( ). However, the use of the sum of squared statistics is inappropriate for identifying CI, since it implies that extreme negative values of and extreme positive values of indicate potential compromise. Note that these conditions respectively correspond to situations in which an item was much more difficult than expected or answered much more slowly than expected in recent administrations. Therefore, rather than using the sum of squared statistics, we use the test statistic that is given by
| (7) |
where and . A similar strategy has been used, for example, by Sinharay and Johnson (2020) in the context of detecting EWP. Extreme positive values of indicate that the item was much easier and/or answered much more quickly than expected in recent administrations and may therefore have been compromised.
Simulation Study
Design
We conducted a simulation study to compare the performance of the IRT- and non-IRT-based sequential procedures using each of the five flagging methods: item scores only (S), item RTs only (T), and item scores and RTs (ST-1, ST-2, and ST-3). Among the methods that involved the use of item scores and RTs, the first method (ST-1) required that either the score-based statistic or the RT-based statistic be extreme, therefore using a union-based approach; the second method (ST-2) required that both the score-based statistic and the RT-based statistic be extreme, therefore using an intersection-based approach; and the third method (ST-3) required that the (combined) score and RT-based statistic be extreme.
Several factors were manipulated in the simulation study, including test length (20, 40), significance level (0.01, 0.05), percentage of CI (10, 20, 40), percentage of EWP (10, 30), and moving sample size (20, 50), which were chosen to be similar to previous research (e.g., Gorney et al., 2025). These five factors were fully crossed. In addition, eight null conditions—one for each combination of test length, significance level, and moving sample size—were simulated, in which none of the items were compromised (and therefore, no examinees had preknowledge). Thus, 56 conditions ( ) were studied in total, where each condition was replicated 100 times.
The uncontaminated scores and RTs were generated using the Rasch and lognormal models, respectively, with settings similar to previous research (e.g., Gorney et al., 2025; Gorney & Wollack, 2025). Specifically, for each test length, an item pool that was 10 times the length of the test was simulated in which the item difficulty and time intensity parameters were sampled such that and the item time discrimination parameters were sampled such that . Then, for each replication, the person ability and speed parameters of 5,000 examinees were sampled such that .
To generate contaminated scores and RTs, CIs were randomly selected from all items in the item pool. Meanwhile, EWP were selected with probabilities that were proportional to the order in which they were administered the test. Specifically, the first examinee to take the test had the smallest probability of having preknowledge, while the last examinee to take the test had the largest probability of having preknowledge. When EWP were administered CI, the probability of a correct response was set equal to 0.90, and the mean of the log RT distribution was set equal to 75% of its original value, that is, . Again, all settings were chosen to be similar to those used in previous research (e.g., Gorney et al., 2025; Gorney & Wollack, 2025).
When administering each CAT, the initial item was randomly selected as one of 10 maximally informative items at . Subsequent items were selected using the maximum Fisher information criterion at the interim ability estimate, subject to a maximum item exposure rate of 0.20. All interim and final ability and speed parameter estimates were obtained using maximum likelihood estimation, where the estimates were bounded between −4 and 4.
To determine whether the value of a test statistic should be considered “extreme,” critical values were determined using Monte Carlo simulations. For a given item pool, this process involved simulating a null data set (one in which none of the items were compromised) and computing the test statistics starting at an initial monitoring point of 60. Then, for each item, the maximum value (for score-based statistics and score and RT-based statistics) or minimum value (for RT-based statistics) of each test statistic across time points was recorded. This process of simulating and analyzing null data was repeated 1,000 times, and the resulting 95th and 99th percentiles (for score-based statistics and score and RT-based statistics) or 5th and 1st percentiles (for RT-based statistics) for each item were taken as the critical values at the 0.05 and 0.01 significance levels, respectively.
Performance was evaluated using the false positive rate (FPR), the true positive rate (TPR), and lag. The FPR was computed as the proportion of secure items that were incorrectly flagged as compromised, while the TPR was computed as the proportion of CI that were correctly flagged as such. Lag was computed as the average number of examinees who were administered an item after it was compromised and before it was flagged.
Results
Results indicate that the FPRs and TPRs are similar for the two test lengths (20, 40), so we present the results for a test length of 20 in this section and include the results for a test length of 40 in the Supplemental Materials.
False Positive Rates
The FPRs are shown in Table 1 for a significance level of 0.05 and Table 2 for a significance level of 0.01. In both tables, it can be seen that the FPRs tend to be smaller for the IRT-based sequential procedures than for the non-IRT-based sequential procedures when item compromise is present. One potential explanation for this result is as follows. Consider that when the IRT-based sequential procedures are used, item flagging depends on the ability and/or speed estimates of the examinees. If some of the items have been compromised, the ability and/or speed estimates of the EWP would likely be positively biased. As a result, the examinees may be expected to perform better and faster on the secure items than they actually did, producing a signal that is the opposite of what we are looking for when detecting CI. Therefore, reduced FPRs are expected, especially as the percentage of CI increases.
Table 1.
False Positive Rates (Significance Level = 0.05, Test Length = 20).
| %CI | % EWP | IRT-based | Non-IRT-based | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S | T | ST-1 | ST-2 | ST-3 | S | T | ST-1 | ST-2 | ST-3 | |||
| 0 | 0 | 20 | 0.050 | 0.051 | 0.099 | 0.000 | 0.048 | 0.047 | 0.054 | 0.098 | 0.000 | 0.052 |
| 50 | 0.050 | 0.047 | 0.094 | 0.000 | 0.049 | 0.051 | 0.051 | 0.098 | 0.000 | 0.052 | ||
| 10 | 10 | 20 | 0.038 | 0.036 | 0.073 | 0.000 | 0.034 | 0.042 | 0.045 | 0.085 | 0.000 | 0.045 |
| 50 | 0.037 | 0.033 | 0.068 | 0.000 | 0.034 | 0.042 | 0.045 | 0.086 | 0.000 | 0.045 | ||
| 30 | 20 | 0.030 | 0.024 | 0.053 | 0.000 | 0.024 | 0.042 | 0.046 | 0.087 | 0.000 | 0.044 | |
| 50 | 0.028 | 0.018 | 0.046 | 0.000 | 0.019 | 0.039 | 0.044 | 0.081 | 0.000 | 0.048 | ||
| 20 | 10 | 20 | 0.030 | 0.026 | 0.055 | 0.000 | 0.028 | 0.035 | 0.038 | 0.072 | 0.000 | 0.040 |
| 50 | 0.031 | 0.024 | 0.053 | 0.000 | 0.024 | 0.037 | 0.040 | 0.075 | 0.000 | 0.041 | ||
| 30 | 20 | 0.021 | 0.012 | 0.033 | 0.000 | 0.014 | 0.035 | 0.036 | 0.070 | 0.000 | 0.041 | |
| 50 | 0.017 | 0.009 | 0.026 | 0.000 | 0.011 | 0.031 | 0.033 | 0.063 | 0.000 | 0.042 | ||
| 40 | 10 | 20 | 0.023 | 0.021 | 0.043 | 0.000 | 0.021 | 0.031 | 0.030 | 0.060 | 0.000 | 0.036 |
| 50 | 0.022 | 0.018 | 0.039 | 0.000 | 0.018 | 0.028 | 0.031 | 0.058 | 0.000 | 0.035 | ||
| 30 | 20 | 0.010 | 0.006 | 0.016 | 0.000 | 0.006 | 0.024 | 0.025 | 0.047 | 0.000 | 0.032 | |
| 50 | 0.007 | 0.005 | 0.012 | 0.000 | 0.005 | 0.021 | 0.023 | 0.043 | 0.000 | 0.037 | ||
Note. IRT = item response theory; EWP = examinees with preknowledge; CI = compromised items.
Table 2.
False Positive Rates (Significance Level = 0.01, Test Length = 20).
| %CI | % EWP | IRT-based | Non-IRT-based | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S | T | ST-1 | ST-2 | ST-3 | S | T | ST-1 | ST-2 | ST-3 | |||
| 0 | 0 | 20 | 0.010 | 0.012 | 0.022 | 0.000 | 0.009 | 0.011 | 0.013 | 0.023 | 0.000 | 0.012 |
| 50 | 0.010 | 0.011 | 0.021 | 0.000 | 0.011 | 0.010 | 0.012 | 0.022 | 0.000 | 0.010 | ||
| 10 | 10 | 20 | 0.008 | 0.006 | 0.015 | 0.000 | 0.007 | 0.009 | 0.009 | 0.018 | 0.000 | 0.010 |
| 50 | 0.008 | 0.007 | 0.014 | 0.000 | 0.007 | 0.009 | 0.010 | 0.018 | 0.000 | 0.010 | ||
| 30 | 20 | 0.006 | 0.004 | 0.010 | 0.000 | 0.004 | 0.008 | 0.010 | 0.018 | 0.000 | 0.009 | |
| 50 | 0.005 | 0.003 | 0.009 | 0.000 | 0.004 | 0.008 | 0.010 | 0.017 | 0.000 | 0.010 | ||
| 20 | 10 | 20 | 0.006 | 0.006 | 0.012 | 0.000 | 0.006 | 0.007 | 0.007 | 0.014 | 0.000 | 0.009 |
| 50 | 0.006 | 0.004 | 0.010 | 0.000 | 0.005 | 0.007 | 0.008 | 0.016 | 0.000 | 0.009 | ||
| 30 | 20 | 0.004 | 0.002 | 0.006 | 0.000 | 0.002 | 0.008 | 0.008 | 0.016 | 0.000 | 0.009 | |
| 50 | 0.003 | 0.002 | 0.004 | 0.000 | 0.002 | 0.007 | 0.007 | 0.014 | 0.000 | 0.009 | ||
| 40 | 10 | 20 | 0.005 | 0.005 | 0.010 | 0.000 | 0.004 | 0.006 | 0.007 | 0.013 | 0.000 | 0.008 |
| 50 | 0.005 | 0.004 | 0.009 | 0.000 | 0.004 | 0.006 | 0.008 | 0.013 | 0.000 | 0.008 | ||
| 30 | 20 | 0.002 | 0.001 | 0.003 | 0.000 | 0.001 | 0.006 | 0.005 | 0.011 | 0.000 | 0.007 | |
| 50 | 0.001 | 0.001 | 0.002 | 0.000 | 0.001 | 0.005 | 0.005 | 0.009 | 0.000 | 0.009 | ||
Note. IRT = item response theory; EWP = examinees with preknowledge; CI = compromised items.
The second major finding is that the FPRs tend to be close to or slightly smaller than the nominal level when either the S, T, or ST-3 flagging method is used. However, the FPRs for the ST-1 flagging method tend to be larger than the nominal level, a result that is not surprising given that this method uses a union-based approach. Meanwhile, the FPRs for the ST-2 flagging method are much smaller than the nominal level and, in fact, are very close to or equal to zero. This result is also not surprising given that this method uses an intersection-based approach. Based on all of these results, it would seem that the S, T, and ST-3 flagging methods should be preferred. In addition, practitioners should avoid using the ST-1 flagging method if controlling the FPR is a priority.
True Positive Rates
The TPRs are shown in Table 3 for a significance level of 0.05 and Table 4 for a significance level of 0.01. The general patterns are similar for both significance levels, though the TPRs are larger when the significance level is 0.05 than when it is 0.01, as expected.
Table 3.
True Positive Rates (Significance Level = 0.05, Test Length = 20).
| %CI | % EWP | IRT-based | Non-IRT-based | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S | T | ST-1 | ST-2 | ST-3 | S | T | ST-1 | ST-2 | ST-3 | |||
| 10 | 10 | 20 | 0.194 | 0.640 | 0.692 | 0.034 | 0.669 | 0.074 | 0.160 | 0.220 | 0.001 | 0.086 |
| 50 | 0.269 | 0.722 | 0.771 | 0.094 | 0.728 | 0.112 | 0.184 | 0.268 | 0.005 | 0.105 | ||
| 30 | 20 | 0.692 | 0.989 | 0.991 | 0.564 | 0.990 | 0.161 | 0.444 | 0.526 | 0.013 | 0.221 | |
| 50 | 0.856 | 0.993 | 0.994 | 0.829 | 0.993 | 0.351 | 0.601 | 0.712 | 0.094 | 0.451 | ||
| 20 | 10 | 20 | 0.182 | 0.528 | 0.586 | 0.032 | 0.546 | 0.081 | 0.138 | 0.203 | 0.001 | 0.079 |
| 50 | 0.262 | 0.600 | 0.658 | 0.095 | 0.598 | 0.119 | 0.162 | 0.252 | 0.006 | 0.103 | ||
| 30 | 20 | 0.594 | 0.960 | 0.970 | 0.432 | 0.970 | 0.144 | 0.417 | 0.495 | 0.013 | 0.218 | |
| 50 | 0.756 | 0.978 | 0.984 | 0.694 | 0.980 | 0.344 | 0.563 | 0.674 | 0.098 | 0.416 | ||
| 40 | 10 | 20 | 0.161 | 0.301 | 0.363 | 0.031 | 0.319 | 0.080 | 0.120 | 0.186 | 0.001 | 0.077 |
| 50 | 0.216 | 0.346 | 0.415 | 0.092 | 0.360 | 0.119 | 0.148 | 0.228 | 0.009 | 0.105 | ||
| 30 | 20 | 0.398 | 0.748 | 0.780 | 0.247 | 0.772 | 0.144 | 0.341 | 0.414 | 0.014 | 0.197 | |
| 50 | 0.516 | 0.810 | 0.841 | 0.405 | 0.818 | 0.326 | 0.458 | 0.556 | 0.104 | 0.353 | ||
Note. IRT = item response theory; EWP = examinees with preknowledge; CI = compromised items.
Table 4.
True Positive Rates (Significance Level = 0.01, Test Length = 20).
| %CI | % EWP | IRT-based | Non-IRT-based | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S | T | ST-1 | ST-2 | ST-3 | S | T | ST-1 | ST-2 | ST-3 | |||
| 10 | 10 | 20 | 0.060 | 0.404 | 0.434 | 0.009 | 0.443 | 0.019 | 0.036 | 0.053 | 0.000 | 0.025 |
| 50 | 0.106 | 0.512 | 0.538 | 0.033 | 0.540 | 0.035 | 0.044 | 0.077 | 0.000 | 0.027 | ||
| 30 | 20 | 0.434 | 0.970 | 0.975 | 0.307 | 0.983 | 0.031 | 0.168 | 0.190 | 0.001 | 0.071 | |
| 50 | 0.684 | 0.988 | 0.990 | 0.647 | 0.991 | 0.139 | 0.333 | 0.413 | 0.016 | 0.205 | ||
| 20 | 10 | 20 | 0.068 | 0.305 | 0.334 | 0.011 | 0.339 | 0.017 | 0.036 | 0.053 | 0.000 | 0.017 |
| 50 | 0.116 | 0.390 | 0.426 | 0.037 | 0.421 | 0.031 | 0.049 | 0.078 | 0.001 | 0.030 | ||
| 30 | 20 | 0.350 | 0.905 | 0.916 | 0.243 | 0.929 | 0.029 | 0.152 | 0.175 | 0.001 | 0.064 | |
| 50 | 0.581 | 0.951 | 0.960 | 0.520 | 0.960 | 0.134 | 0.307 | 0.376 | 0.021 | 0.196 | ||
| 40 | 10 | 20 | 0.061 | 0.162 | 0.186 | 0.008 | 0.183 | 0.016 | 0.034 | 0.049 | 0.000 | 0.018 |
| 50 | 0.113 | 0.210 | 0.240 | 0.050 | 0.234 | 0.037 | 0.048 | 0.079 | 0.001 | 0.031 | ||
| 30 | 20 | 0.242 | 0.603 | 0.626 | 0.143 | 0.644 | 0.030 | 0.135 | 0.159 | 0.001 | 0.066 | |
| 50 | 0.373 | 0.699 | 0.724 | 0.298 | 0.732 | 0.146 | 0.250 | 0.316 | 0.029 | 0.176 | ||
Note. IRT = item response theory; EWP = examinees with preknowledge; CI = compromised items.
The results show that the TPRs vary significantly across the different sequential procedures, flagging methods, percentages of CI, percentages of EWP, and moving sample sizes. Across all conditions, the IRT-based sequential procedures display larger TPRs than the non-IRT-based sequential procedures. For example, at a significance level of 0.05, the TPRs for the IRT-based sequential procedure with the ST-3 flagging method range from 0.319 to 0.993, while the TPRs for the non-IRT-based sequential procedure with the ST-3 flagging method range from 0.077 to 0.451. These results align with the general findings of previous researchers who also compared the use of IRT- and non-IRT-based sequential procedures (e.g., Table 6 of Zhang & Li, 2016).
Table 6.
Lag (Significance Level = 0.01, Test Length = 20).
| % EWP | IRT-based | Non-IRT-based | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| %CI | S | T | ST-1 | ST-2 | ST-3 | S | T | ST-1 | ST-2 | ST-3 | ||
| 10 | 10 | 20 | 384 | 340 | 333 | 495 | 335 | 321 | 319 | 318 | – | 316 |
| 50 | 393 | 335 | 325 | 543 | 334 | 334 | 338 | 338 | – | 337 | ||
| 30 | 20 | 348 | 201 | 197 | 394 | 188 | 265 | 324 | 312 | 168 | 284 | |
| 50 | 305 | 176 | 171 | 332 | 167 | 354 | 368 | 350 | 505 | 390 | ||
| 20 | 10 | 20 | 383 | 349 | 338 | 625 | 349 | 226 | 258 | 245 | – | 301 |
| 50 | 398 | 355 | 339 | 533 | 341 | 348 | 314 | 321 | 403 | 371 | ||
| 30 | 20 | 337 | 228 | 223 | 403 | 213 | 237 | 350 | 331 | 143 | 327 | |
| 50 | 296 | 199 | 193 | 328 | 190 | 382 | 386 | 361 | 454 | 397 | ||
| 40 | 10 | 20 | 472 | 395 | 374 | 734 | 377 | 244 | 339 | 312 | – | 298 |
| 50 | 459 | 377 | 349 | 602 | 356 | 404 | 396 | 389 | 804 | 423 | ||
| 30 | 20 | 376 | 279 | 268 | 509 | 256 | 206 | 332 | 309 | 292 | 289 | |
| 50 | 285 | 252 | 239 | 347 | 233 | 397 | 385 | 361 | 448 | 400 | ||
Note. IRT = item response theory; EWP = examinees with preknowledge; CI = compromised items;-- = Lag cannot be computed when TPR is 0.
For the IRT-based sequential procedures, the ST-1 flagging method usually displays the largest TPRs and is followed by ST-3, T, S, and then ST-2. For example, when the significance level is 0.05, , and there are 10% CI and 10% EWP, the ST-1 flagging method achieves a TPR of 0.692, which is slightly larger than that of ST-3 (0.669) and T (0.640) and noticeably larger than that of S (0.194) and ST-2 (0.034). These results are not surprising, given that they agree with most of the patterns that were observed for the FPRs. These results also suggest that the ST-3 flagging method is the most promising method overall, as it tends to display relatively high TPRs while keeping the FPRs close to or below the nominal level.
It is worth noting that while the ST-3 flagging method performed well relative to the other flagging methods, its performance, like that of all methods, varies across the different conditions. From the tables, additional results can be gleaned that are common to both the IRT- and non-IRT-based sequential procedures. First, as the percentage of CI increases, the TPRs tend to decrease, particularly when the percentage of EWP is high. For example, for the IRT-based sequential procedure, the TPRs with the ST-3 flagging method at a significance level of 0.05 range from 0.669 to 0.993 when there are 10% CI, 0.546 to 0.980 when there are 20% CI, and 0.319 to 0.818 when there are 40% CI. For the IRT-based sequential procedures, this result is again related to the fact that item flagging depends on the ability and/or speed estimates of the examinees. If more of the items have been compromised, the ability and/or speed estimates of the EWP would likely be even more biased. As a result, the examinees would be expected to perform better and faster on all items (including those that have been compromised), making it difficult to identify a compromise signal.
An additional finding is that as the percentage of EWP increases, the TPRs increase. For example, the TPRs for the IRT-based sequential procedure with the ST-3 flagging method at a significance level of 0.05 range from 0.319 to 0.728 when there are 10% EWP, while they range from 0.772 to 0.993 when there are 30% EWP. Presumably, the presence of more EWP (30%) produces a stronger compromise signal that is easier to detect.
Finally, as the moving sample size increases, the TPRs also increase. For example, the TPRs for the IRT-based sequential procedure with the ST-3 flagging method at a significance level of 0.05 range from 0.319 to 0.990 when the moving sample size is 20, while they range from 0.360 to 0.993 when the moving sample size is 50. These results align with the findings of previous researchers who also found that the use of larger moving sample sizes generally improves the detection of CI (e.g., Table 6 of Zhang & Li, 2016).
Lag
The lag is shown in Table 5 for a significance level of 0.05 and Table 6 for a significance level of 0.01. It is important to note that because the lag was computed across only the CIs that were flagged, the results are directly related to the TPRs. Thus, both the TPRs and the lag are shown in Figures 1 and 2 for significance levels of 0.05 and 0.01, respectively, to illustrate the relationships between the two outcomes.
Table 5.
Lag (Significance Level = 0.05, Test Length = 20).
| % EWP | IRT-based | Non-IRT-based | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| %CI | S | T | ST-1 | ST-2 | ST-3 | S | T | ST-1 | ST-2 | ST-3 | ||
| 10 | 10 | 20 | 293 | 295 | 275 | 404 | 285 | 275 | 309 | 282 | 161 | 267 |
| 50 | 314 | 286 | 265 | 422 | 280 | 299 | 308 | 293 | 227 | 323 | ||
| 30 | 20 | 282 | 163 | 156 | 336 | 154 | 270 | 314 | 289 | 234 | 293 | |
| 50 | 253 | 145 | 137 | 284 | 142 | 316 | 311 | 286 | 402 | 342 | ||
| 20 | 10 | 20 | 340 | 301 | 286 | 493 | 294 | 241 | 281 | 260 | 374 | 284 |
| 50 | 323 | 295 | 272 | 447 | 285 | 301 | 300 | 290 | 436 | 313 | ||
| 30 | 20 | 277 | 183 | 174 | 333 | 175 | 263 | 309 | 287 | 295 | 311 | |
| 50 | 251 | 166 | 155 | 287 | 161 | 321 | 317 | 287 | 409 | 344 | ||
| 40 | 10 | 20 | 358 | 312 | 281 | 596 | 300 | 273 | 302 | 281 | 444 | 294 |
| 50 | 337 | 305 | 271 | 525 | 286 | 351 | 347 | 320 | 521 | 345 | ||
| 30 | 20 | 275 | 228 | 214 | 390 | 216 | 261 | 310 | 285 | 314 | 296 | |
| 50 | 240 | 209 | 194 | 289 | 201 | 324 | 322 | 282 | 421 | 343 | ||
Note. IRT = item response theory; EWP = examinees with preknowledge; CI = compromised items.
Figure 1.
True positive rates and Lag (significance level = 0.05, test length = 20, ).
Figure 2.
True positive rates and Lag (significance level = 0.01, test length = 20, ).
The results show that the lag tends to be shorter for the IRT-based sequential procedures than for the non-IRT-based sequential procedures. When taken together with the results for the TPRs, this result is encouraging, as it suggests that not only are the IRT-based sequential procedures able to detect more CI, but also that they tend to do so in shorter amounts of time.
For the IRT-based sequential procedures, the lag tends to be shortest when the ST-1 flagging method is used, followed by ST-3, T, S, and then ST-2. In addition, as the percentage of CI increases, the lag tends to increase, whereas as the percentage of EWP or the moving sample size increases, the lag tends to decrease. It is interesting to note that all of these results align with those of the TPRs. In particular, larger TPRs tend to correspond to shorter lag times.
Discussion
In summary, this study proposes and evaluates three IRT-based sequential procedures that use item scores and RTs to detect CI in CAT. The first procedure (ST-1) requires that either the score-based statistic or the RT-based statistic be extreme, the second procedure (ST-2) requires that both the score-based statistic and the RT-based statistic be extreme, and the third procedure (ST-3) requires that a (combined) score and RT-based statistic be extreme. Results suggest that the ST-3 flagging method is the most promising method overall, as it tends to produce relatively high TPRs and relatively short lag times while keeping the FPRs close to the nominal level. By contrast, the ST-1 flagging method tends to produce FPRs that are unreasonably large, while the ST-2 flagging method tends to produce TPRs that are unreasonably small. The ST-3 flagging method also tends to perform better than both the item scores only method (S) and the item RTs only method (T), as well as all five of the non-IRT-based sequential procedures—including those that are based on item scores and RTs.
Some guidelines for implementing the proposed methods can be suggested. When item scores and RTs are available, use of the IRT-based sequential procedure with the ST-3 flagging method is recommended, as it appears to provide the best balance between the FPR and the TPR while also producing relatively short lag times. The selection of the moving sample size is also important, as the use of a larger moving sample size tends to improve CI detection. In addition, for the IRT-based sequential procedures, the use of a larger moving sample size also corresponds to shorter lag times. Thus, optimal selection of the moving sample size should be guided by the sequential procedure being used as well as the specific CAT configuration (e.g., test length, item selection method, ability estimation method), the total number of examinees, and the expected performance on various outcome measures. Finally, the significance level plays an important role. For both the IRT- and non-IRT-based sequential procedures, the use of a larger significance level, such as 0.05, tends to increase the likelihood of detecting CI and reduce the lag. However, it also increases the risk of incorrectly flagging a secure item as compromised.
When implementing any of the sequential procedures used in this study, item scores and/or RTs are continuously monitored to detect unexpected changes in performance. Thus, these procedures can be integrated into existing item selection algorithms, provided the CAT system allows interaction between the detection mechanism and the item selection process. Specifically, once an item is flagged as potentially compromised, it can be suspended from future administrations. The flagged item may remain inactive until further evidence is gathered to determine whether it should be permanently removed from the item pool, minimizing the impact of preknowledge and ultimately improving test fairness and security.
This study has several limitations that can be investigated in future research. First, the methods proposed in this paper were evaluated using only simulated data. In the future, it would be useful to apply these methods to real data that contain known compromise. Second, in our simulation study, we only examined a limited set of simulation conditions. Future researchers could investigate additional simulation conditions, including those involving different item pools, item selection methods, item exposure control methods, and ability estimation methods. Third, this study focused solely on comparing sequential procedures for detecting CI in CAT. Future researchers could compare the performance of these methods to other methods for detecting CI (e.g., Du et al., 2023; Du & Zhang, 2025; Kang, 2023; Lee & Lewis, 2021; Lee et al., 2014; Toton & Maynes, 2019; van der Linden, 2022; van der Linden & Belov, 2023; Wang & Liu, 2020) or methods that are designed to detect CI and EWP simultaneously (e.g., Belov, 2014; Pan et al., 2022). Finally, because the proposed methods assume an IRT model fits the data, they are inappropriate for exams that do not use IRT (e.g., small-volume exams). In such cases, non-IRT-based sequential procedures may be considered instead.
Supplemental Material
Supplemental material, sj-pdf-1-epm-10.1177_00131644251368335 for Using Item Scores and Response Times to Detect Item Compromise in Computerized Adaptive Testing by Chansoon Lee, Kylie Gorney and Jianshen Chen in Educational and Psychological Measurement
Appendix
Non-IRT-Based Sequential Procedures for Item Scores Only
Choe et al. (2018) and Zhang (2014) proposed non-IRT-based sequential procedures that test whether a dichotomous item is getting easier over time. Unlike the IRT-based sequential procedure, the non-IRT-based sequential procedures monitor the difficulty of an item by comparing the proportion of correct responses for a moving sample of examinees to the proportion of correct responses for a reference sample of examinees. The moving sample is taken as the most recent examinees who were administered the item, while the reference sample is taken as all other examinees who were administered the item. Thus, the reference sample contains the first examinees who were administered the item, and the proportions of correct responses for the moving and reference samples are given by
| (A1) |
respectively.
Under the null hypothesis that the item has not been compromised, the test statistic
| (A2) |
has an asymptotic standard normal distribution, where denotes the pooled population proportion of correct responses. In practice, the pooled population proportion is unknown, and therefore, cannot be computed. Instead, Zhang (2014) used the approximation that is given by replacing with the reference sample proportion , while Choe et al. (2018) used the approximation that is given by replacing with the pooled sample proportion
which is the approximation that we also use in this paper.
To monitor the difficulty of the item across multiple time points, one can compute each time the item is administered. Extreme positive values of indicate that the item was much easier than expected in recent administrations and may therefore have been compromised.
Non-IRT-Based Sequential Procedure for Item RTs Only
Choe et al. (2018) proposed a non-IRT-based sequential procedure that tests whether an item is being answered more quickly over time. The procedure monitors the time intensity of the item by comparing the average log RT for a moving sample of examinees to the average log RT for a reference sample of examinees. The average log RTs for the moving and reference samples are given by
| (A3) |
respectively.
If it can be assumed that the two population variances are equal ( ), then under the null hypothesis that the item has not been compromised, the test statistic
| (A4) |
has a distribution with degrees of freedom, where denotes the pooled sample variance that is given by
for the sample variances
To monitor the time intensity of the item across multiple time points, one can compute each time the item is administered. Extreme negative values of indicate that the item was answered much more quickly than expected in recent administrations and may therefore have been compromised.
Non-IRT-Based Sequential Procedures for Item Scores and RTs
Choe et al. (2018) proposed three non-IRT-based sequential procedures that involve the use of item scores and RTs. The first procedure requires that either or be extreme for an item to be flagged as potentially compromised, while the second procedure requires that both and be extreme and is therefore more conservative.
The third procedure uses a different approach that involves the creation of a new test statistic. The idea is to monitor the difficulty and time intensity of the item simultaneously by comparing both the proportion of correct responses and the average log RT across the moving and reference samples of examinees. For each sample, the proportion of correct responses and the average log RT can be combined into a single vector as
| (A5) |
Under the null hypothesis that the item has not been compromised, the test statistic
| (A6) |
has an distribution with 2 and degrees of freedom, where denotes the two-sample Hotelling’s statistic that is given by
for the pooled sample covariance matrix
the sample covariance matrices
the sample variances
and the sample covariances and . Extreme positive values of indicate that the item was much easier and/or answered much more quickly than expected in recent administrations and may therefore have been compromised.
For all three procedures, Choe et al. (2018) further required that an item be easier than expected ( ) and answered more quickly than expected ( ) in recent administrations to be flagged as potentially compromised. In this paper, we remove this requirement to allow for a fairer comparison to the IRT-based sequential procedures, which also do not impose this requirement.
Footnotes
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.
ORCID iDs: Chansoon Lee
https://orcid.org/0000-0002-3669-3019
Kylie Gorney
https://orcid.org/0000-0002-8924-0726
Supplemental Material: Supplemental material for this article is available online.
References
- Belov D. I. (2014). Detecting item preknowledge in computerized adaptive testing using information theory and combinatorial optimization. Journal of Computerized Adaptive Testing, 2(3), 37–58. [Google Scholar]
- Choe E. M., Zhang J., Chang H.-H. (2018). Sequential detection of compromised items using response times in computerized adaptive testing. Psychometrika, 83(3), 650–673. 10.1007/s11336-017-9596-3 [DOI] [PubMed] [Google Scholar]
- Du Y., Zhang S. (2025). Detecting compromised items with response times using a Bayesian change-point approach. Journal of Educational and Behavioral Statistics, 50(2), 296–330. 10.3102/10769986241290713 [DOI] [Google Scholar]
- Du Y., Zhang S., Chang H.-H. (2023). Compromised item detection: A Bayesian change-point perspective. British Journal of Mathematical and Statistical Psychology, 76(1), 131–153. 10.1111/bmsp.12286 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gorney K., Lee C., Chen J. (2025). A score-based method for detecting item compromise and preknowledge in computerized adaptive testing. Journal of Computerized Adaptive Testing, 12(2), 123–136. 10.7333/2506-1202123 [DOI] [Google Scholar]
- Gorney K., Wollack J. A. (2025). Using response times in answer similarity analysis. Journal of Educational and Behavioral Statistics, 50(3), 449–470. 10.3102/10769986241248770 [DOI] [Google Scholar]
- Kang H.-A. (2023). Sequential generalized likelihood ratio tests for online item monitoring. Psychometrika, 88(2), 672–696. 10.1007/s11336-022-09871-9 [DOI] [PubMed] [Google Scholar]
- Lee Y.-H., Lewis C. (2021). Monitoring item performance with CUSUM statistics in continuous testing. Journal of Educational and Behavioral Statistics, 46(5), 611–648. 10.3102/1076998621994563 [DOI] [Google Scholar]
- Lee Y.-H., Lewis C., von Davier A. A. (2014). Monitoring the quality and security of multistage tests. In Yan D., von Davier A. A., Lewis C. (Eds.), Computerized multistage testing: Theory and applications (pp. 285–300). CRC Press. [Google Scholar]
- Pan Y., Sinharay S., Livne O., Wollack J. A. (2022). A machine learning approach for detecting item compromise and preknowledge in computerized adaptive testing. Psychological Test and Assessment Modeling, 64(4), 385–424. [Google Scholar]
- Sinharay S., Johnson M. S. (2020). The use of item scores and response times to detect examinees who may have benefited from item preknowledge. British Journal of Mathematical and Statistical Psychology, 73(3), 397–419. 10.1111/bmsp.12187 [DOI] [PubMed] [Google Scholar]
- Toton S. L., Maynes D. D. (2019). Detecting examinees with pre-knowledge in experimental data using conditional scaling of response times. Frontiers in Education, 4, Article 49. 10.3389/feduc.2019.00049 [DOI] [Google Scholar]
- van der Linden W. J. (2006). A lognormal model for response times on test items. Journal of Educational and Behavioral Statistics, 31(2), 181–204. 10.3102/10769986031002181 [DOI] [Google Scholar]
- van der Linden W. J. (2007). A hierarchical framework for modeling speed and accuracy on test items. Psychometrika, 72(3), 287–308. 10.1007/s11336-006-1478-z [DOI] [Google Scholar]
- van der Linden W. J. (2022). Two statistical tests for the detection of item compromise. Journal of Educational and Behavioral Statistics, 47(4), 485–504. 10.3102/10769986221094789 [DOI] [Google Scholar]
- van der Linden W. J., Belov D. I. (2023). A statistical test for the detection of item compromise combining responses and response times. Journal of Educational Measurement, 60(2), 235–254. 10.1111/jedm.12346 [DOI] [Google Scholar]
- Veerkamp W. J. J., Glas C. A. W. (2000). Detection of known items in adaptive testing with a statistical quality control method. Journal of Educational and Behavioral Statistics, 25(4), 373–389. 10.3102/10769986025004373 [DOI] [Google Scholar]
- Wang X., Liu Y. (2020). Detecting compromised items using information from secure items. Journal of Educational and Behavioral Statistics, 45(6), 667–689. 10.3102/1076998620912549 [DOI] [Google Scholar]
- Zhang J. (2014). A sequential procedure for detecting compromised items in the item pool of a CAT system. Applied Psychological Measurement, 38(2), 87–104. 10.1177/0146621613510062 [DOI] [Google Scholar]
- Zhang J., Li J. (2016). Monitoring items in real time to enhance CAT security. Journal of Educational Measurement, 53(2), 131–151. 10.1111/jedm.12104 [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental material, sj-pdf-1-epm-10.1177_00131644251368335 for Using Item Scores and Response Times to Detect Item Compromise in Computerized Adaptive Testing by Chansoon Lee, Kylie Gorney and Jianshen Chen in Educational and Psychological Measurement


