Skip to main content
Educational and Psychological Measurement logoLink to Educational and Psychological Measurement
. 2025 Sep 14:00131644251368335. Online ahead of print. doi: 10.1177/00131644251368335

Using Item Scores and Response Times to Detect Item Compromise in Computerized Adaptive Testing

Chansoon Lee 1,, Kylie Gorney 2, Jianshen Chen 3
PMCID: PMC12433998  PMID: 40959735

Abstract

Sequential procedures have been shown to be effective methods for real-time detection of compromised items in computerized adaptive testing. In this study, we propose three item response theory-based sequential procedures that involve the use of item scores and response times (RTs). The first procedure requires that either the score-based statistic or the RT-based statistic be extreme, the second procedure requires that both the score-based statistic and the RT-based statistic be extreme, and the third procedure requires that a combined score and RT-based statistic be extreme. Results suggest that the third procedure is the most promising, providing a reasonable balance between the false-positive rate and the true-positive rate while also producing relatively short lag times across a wide range of simulation conditions.

Keywords: computerized adaptive testing, item compromise, item response theory, response time, test security

Introduction

In computerized adaptive testing (CAT), highly informative items tend to be selected more frequently, increasing their exposure and risk of compromise. When items become compromised, examinees with preknowledge (EWP) may gain an unfair advantage, potentially leading to score inflation and threatening the validity of test score interpretations. However, one benefit of CAT is that it allows for real-time detection and intervention. If an item is identified as compromised, the algorithm can exclude it from further administration, limiting its impact on future examinees.

To enable the detection of compromised items (CIs), several statistical procedures have been developed that monitor changes in item performance over time, including cumulative sum (CUSUM) procedures, change-point analysis (CPA) procedures, and sequential procedures. CUSUM procedures have been proposed to detect item parameter drift (Veerkamp & Glas, 2000) and monitor changes in item residuals (Lee & Lewis, 2021; Lee et al., 2014). However, the former procedures, while useful, focus on changes in the item parameters and therefore assume that the model still holds. Meanwhile, the latter procedures do not make this assumption, but they have not yet been applied to data beyond the item scores. In CAT, additional sources of data, such as item response times (RTs), are usually also available and may contribute to improving detection rates. Similar to CUSUM procedures, CPA procedures also monitor changes in item performance over time (Du et al., 2023; Du & Zhang, 2025). These procedures have been extended for use with item scores and RTs; however, it is unknown whether they are appropriate for CAT data due to the fact that missingness is not completely at random. Finally, sequential procedures have been proposed that utilize a series of statistical hypothesis tests to monitor whether item performance has significantly changed over time. These procedures have been applied using both item scores and RTs and have been shown to work well in a CAT environment. Thus, the use of sequential procedures is explored in this study.

Sequential procedures can generally be grouped into two categories: item response theory (IRT)-based procedures and non-IRT-based procedures. Non-IRT-based sequential procedures have been proposed that use item scores only, item RTs only, and both item scores and RTs (Choe et al., 2018; Zhang, 2014). Meanwhile, IRT-based sequential procedures have been proposed that use item scores only and item RTs only (Choe et al., 2018; Zhang & Li, 2016), but not item scores and RTs. Given that (a) Zhang and Li (2016) found that IRT-based sequential procedures tend to outperform non-IRT-based sequential procedures and (b) Choe et al. (2018) found that non-IRT-based sequential procedures using item scores and RTs tend to outperform non-IRT-based sequential procedures using item scores only or item RTs only, it is reasonable to suspect that an IRT-based sequential procedure incorporating both item scores and RTs would outperform existing methods. The purpose of this paper is to investigate this hypothesis and fill this gap in the literature by answering the following research questions:

  1. Do IRT-based sequential procedures that incorporate both item scores and RTs outperform existing sequential procedures?

  2. What factors affect the performance of the new procedures?

In the following section, we introduce three IRT-based sequential procedures for item scores and RTs. The subsequent section presents the simulation study, in which we compare the performance of the newly proposed sequential procedures with that of existing procedures. Finally, we conclude with a discussion of research contributions, suggestions for implementation, and limitations.

Method

In recent years, several IRT- and non-IRT-based sequential procedures have been proposed that involve the use of item scores and/or RTs (Choe et al., 2018; Zhang, 2014; Zhang & Li, 2016). In what follows, we first review common models for item scores and RTs. We then describe two existing IRT-based sequential procedures for item scores only and item RTs only. Finally, we introduce three new IRT-based sequential procedures for item scores and RTs. Note that descriptions of non-IRT-based sequential procedures for item scores only, item RTs only, and item scores and RTs are found in the Appendix.

Models for Item Scores and RTs

Let i=1,,I denote the items in an item pool, and let Xji and Yji denote the score and log RT, respectively, of person j on item i . If, for example, the Rasch model is used to model the item scores, the probability of a correct response is assumed to be

pi(θj)=P(Xji=1|θj,bi)=exp(θjbi)1+exp(θjbi), (1)

where θj is the ability parameter of person j , and bi is the difficulty parameter of item i . Similarly, if the lognormal model (van der Linden, 2006) is used to model the RTs, the density of the log RT is assumed to be

f(Yji|τj,αi,βi)=αi2πexp{12[αi(Yji(βiτj))]2}, (2)

where τj is the speed parameter of person j , and αi and βi are the time discrimination and time intensity parameters, respectively, of item i . Both the Rasch model and lognormal model can be plugged into the hierarchical framework of van der Linden (2007), where joint distributions are assumed for the person and/or item parameters. In the remainder of this paper, the subscript i will be dropped for notational convenience.

IRT-Based Sequential Procedure for Item Scores Only

Zhang and Li (2016) proposed an IRT-based sequential procedure that tests whether a dichotomous item is getting easier over time—a phenomenon that may occur if the item has been compromised. The procedure monitors the difficulty of the item across multiple time points by comparing the observed proportion of correct responses to the expected proportion of correct responses for a moving sample of examinees. The moving sample is taken as the most recent m examinees who were administered the item. Thus, the observed proportion of correct responses for the moving sample is given by

S=1mj=nm+1nXj, (3)

where n is the total number of examinees to whom the item has been administered.

Under the null hypothesis that the item has not been compromised, the test statistic

ZS=SμSσS (4)

has an asymptotic standard normal distribution, where μS and σS denote the expected value/mean and standard deviation, respectively, of S , which are given by

μS=1mj=nm+1np(θj)andσS=1m2j=nm+1np(θj)[1p(θj)].

In practice, the ability parameters of the examinees are unknown, and therefore, ZS cannot be computed. However, as noted by Zhang and Li (2016), the test statistic can be approximated by replacing the true ability θj with an estimate θ^j .

To monitor the difficulty of the item across multiple time points, one can compute ZS each time the item is administered. Extreme positive values of ZS indicate that the item was much easier than expected in recent administrations and may therefore have been compromised.

IRT-Based Sequential Procedure for Item RTs Only

Choe et al. (2018) proposed an IRT-based sequential procedure that involves the use of item RTs. The procedure is similar to that which is used for the item scores, but rather than testing whether an item is getting easier over time, it tests whether the item is being answered more quickly over time. The procedure monitors the time intensity of the item by comparing the observed average log RT to the expected average log RT for a moving sample of examinees. The observed average log RT for the moving sample is given by

T=1mj=nm+1nYj. (5)

Under the null hypothesis that the item has not been compromised, the test statistic

ZT=TμTσT (6)

has an asymptotic standard normal distribution, where μT and σT denote the expected value/mean and standard deviation, respectively, of T . If it is assumed that the lognormal model fits the RTs, then

μT=1mj=nm+1n(βτj)andσT=1mα2.

In practice, the speed parameters of the examinees are unknown, and therefore, ZT cannot be computed. However, as noted by Choe et al. (2018), the test statistic can be approximated by replacing the true speed τj with an estimate τ^j .

To monitor the time intensity of the item across multiple time points, one can compute ZT each time the item is administered. Extreme negative values of ZT indicate that the item was answered much more quickly than expected in recent administrations and may therefore have been compromised.

IRT-Based Sequential Procedures for Item Scores and RTs

In this paper, we propose three IRT-based sequential procedures that involve the use of item scores and RTs. The first procedure requires that either ZS or ZT be extreme in recent administrations in order for an item to be flagged as potentially compromised, while the second procedure requires that both ZS and ZT be extreme and is therefore more conservative.

The third procedure uses a different approach that involves combining the ZS and ZT statistics to create a new test statistic. By creating a combined statistic, we hope to control the false positive rate (FPR) by producing values that are neither too liberal nor too conservative. One possible way to combine the two statistics is by computing the sum of their squares ( ZS2+ZT2 ). However, the use of the sum of squared statistics is inappropriate for identifying CI, since it implies that extreme negative values of ZS and extreme positive values of ZT indicate potential compromise. Note that these conditions respectively correspond to situations in which an item was much more difficult than expected or answered much more slowly than expected in recent administrations. Therefore, rather than using the sum of squared statistics, we use the test statistic that is given by

ZST=ZS+2+ZT2, (7)

where ZS+=max{ZS,0} and ZT=min{ZT,0} . A similar strategy has been used, for example, by Sinharay and Johnson (2020) in the context of detecting EWP. Extreme positive values of ZST indicate that the item was much easier and/or answered much more quickly than expected in recent administrations and may therefore have been compromised.

Simulation Study

Design

We conducted a simulation study to compare the performance of the IRT- and non-IRT-based sequential procedures using each of the five flagging methods: item scores only (S), item RTs only (T), and item scores and RTs (ST-1, ST-2, and ST-3). Among the methods that involved the use of item scores and RTs, the first method (ST-1) required that either the score-based statistic or the RT-based statistic be extreme, therefore using a union-based approach; the second method (ST-2) required that both the score-based statistic and the RT-based statistic be extreme, therefore using an intersection-based approach; and the third method (ST-3) required that the (combined) score and RT-based statistic be extreme.

Several factors were manipulated in the simulation study, including test length (20, 40), significance level (0.01, 0.05), percentage of CI (10, 20, 40), percentage of EWP (10, 30), and moving sample size (20, 50), which were chosen to be similar to previous research (e.g., Gorney et al., 2025). These five factors were fully crossed. In addition, eight null conditions—one for each combination of test length, significance level, and moving sample size—were simulated, in which none of the items were compromised (and therefore, no examinees had preknowledge). Thus, 56 conditions ( (2×2×3×2×2)+(2×2×2) ) were studied in total, where each condition was replicated 100 times.

The uncontaminated scores and RTs were generated using the Rasch and lognormal models, respectively, with settings similar to previous research (e.g., Gorney et al., 2025; Gorney & Wollack, 2025). Specifically, for each test length, an item pool that was 10 times the length of the test was simulated in which the item difficulty and time intensity parameters were sampled such that [biβi]~N([0.003.50],[1.000.200.200.15]) and the item time discrimination parameters were sampled such that αi~U(1.50,2.50) . Then, for each replication, the person ability and speed parameters of 5,000 examinees were sampled such that [θjτj]~N([0.000.00],[1.000.250.250.25]) .

To generate contaminated scores and RTs, CIs were randomly selected from all items in the item pool. Meanwhile, EWP were selected with probabilities that were proportional to the order in which they were administered the test. Specifically, the first examinee to take the test had the smallest probability of having preknowledge, while the last examinee to take the test had the largest probability of having preknowledge. When EWP were administered CI, the probability of a correct response was set equal to 0.90, and the mean of the log RT distribution was set equal to 75% of its original value, that is, βiτj×0.75 . Again, all settings were chosen to be similar to those used in previous research (e.g., Gorney et al., 2025; Gorney & Wollack, 2025).

When administering each CAT, the initial item was randomly selected as one of 10 maximally informative items at θ=0 . Subsequent items were selected using the maximum Fisher information criterion at the interim ability estimate, subject to a maximum item exposure rate of 0.20. All interim and final ability and speed parameter estimates were obtained using maximum likelihood estimation, where the estimates were bounded between −4 and 4.

To determine whether the value of a test statistic should be considered “extreme,” critical values were determined using Monte Carlo simulations. For a given item pool, this process involved simulating a null data set (one in which none of the items were compromised) and computing the test statistics starting at an initial monitoring point of 60. Then, for each item, the maximum value (for score-based statistics and score and RT-based statistics) or minimum value (for RT-based statistics) of each test statistic across time points was recorded. This process of simulating and analyzing null data was repeated 1,000 times, and the resulting 95th and 99th percentiles (for score-based statistics and score and RT-based statistics) or 5th and 1st percentiles (for RT-based statistics) for each item were taken as the critical values at the 0.05 and 0.01 significance levels, respectively.

Performance was evaluated using the false positive rate (FPR), the true positive rate (TPR), and lag. The FPR was computed as the proportion of secure items that were incorrectly flagged as compromised, while the TPR was computed as the proportion of CI that were correctly flagged as such. Lag was computed as the average number of examinees who were administered an item after it was compromised and before it was flagged.

Results

Results indicate that the FPRs and TPRs are similar for the two test lengths (20, 40), so we present the results for a test length of 20 in this section and include the results for a test length of 40 in the Supplemental Materials.

False Positive Rates

The FPRs are shown in Table 1 for a significance level of 0.05 and Table 2 for a significance level of 0.01. In both tables, it can be seen that the FPRs tend to be smaller for the IRT-based sequential procedures than for the non-IRT-based sequential procedures when item compromise is present. One potential explanation for this result is as follows. Consider that when the IRT-based sequential procedures are used, item flagging depends on the ability and/or speed estimates of the examinees. If some of the items have been compromised, the ability and/or speed estimates of the EWP would likely be positively biased. As a result, the examinees may be expected to perform better and faster on the secure items than they actually did, producing a signal that is the opposite of what we are looking for when detecting CI. Therefore, reduced FPRs are expected, especially as the percentage of CI increases.

Table 1.

False Positive Rates (Significance Level = 0.05, Test Length = 20).

%CI % EWP m IRT-based Non-IRT-based
S T ST-1 ST-2 ST-3 S T ST-1 ST-2 ST-3
0 0 20 0.050 0.051 0.099 0.000 0.048 0.047 0.054 0.098 0.000 0.052
50 0.050 0.047 0.094 0.000 0.049 0.051 0.051 0.098 0.000 0.052
10 10 20 0.038 0.036 0.073 0.000 0.034 0.042 0.045 0.085 0.000 0.045
50 0.037 0.033 0.068 0.000 0.034 0.042 0.045 0.086 0.000 0.045
30 20 0.030 0.024 0.053 0.000 0.024 0.042 0.046 0.087 0.000 0.044
50 0.028 0.018 0.046 0.000 0.019 0.039 0.044 0.081 0.000 0.048
20 10 20 0.030 0.026 0.055 0.000 0.028 0.035 0.038 0.072 0.000 0.040
50 0.031 0.024 0.053 0.000 0.024 0.037 0.040 0.075 0.000 0.041
30 20 0.021 0.012 0.033 0.000 0.014 0.035 0.036 0.070 0.000 0.041
50 0.017 0.009 0.026 0.000 0.011 0.031 0.033 0.063 0.000 0.042
40 10 20 0.023 0.021 0.043 0.000 0.021 0.031 0.030 0.060 0.000 0.036
50 0.022 0.018 0.039 0.000 0.018 0.028 0.031 0.058 0.000 0.035
30 20 0.010 0.006 0.016 0.000 0.006 0.024 0.025 0.047 0.000 0.032
50 0.007 0.005 0.012 0.000 0.005 0.021 0.023 0.043 0.000 0.037

Note. IRT = item response theory; EWP = examinees with preknowledge; CI = compromised items.

Table 2.

False Positive Rates (Significance Level = 0.01, Test Length = 20).

%CI % EWP m IRT-based Non-IRT-based
S T ST-1 ST-2 ST-3 S T ST-1 ST-2 ST-3
0 0 20 0.010 0.012 0.022 0.000 0.009 0.011 0.013 0.023 0.000 0.012
50 0.010 0.011 0.021 0.000 0.011 0.010 0.012 0.022 0.000 0.010
10 10 20 0.008 0.006 0.015 0.000 0.007 0.009 0.009 0.018 0.000 0.010
50 0.008 0.007 0.014 0.000 0.007 0.009 0.010 0.018 0.000 0.010
30 20 0.006 0.004 0.010 0.000 0.004 0.008 0.010 0.018 0.000 0.009
50 0.005 0.003 0.009 0.000 0.004 0.008 0.010 0.017 0.000 0.010
20 10 20 0.006 0.006 0.012 0.000 0.006 0.007 0.007 0.014 0.000 0.009
50 0.006 0.004 0.010 0.000 0.005 0.007 0.008 0.016 0.000 0.009
30 20 0.004 0.002 0.006 0.000 0.002 0.008 0.008 0.016 0.000 0.009
50 0.003 0.002 0.004 0.000 0.002 0.007 0.007 0.014 0.000 0.009
40 10 20 0.005 0.005 0.010 0.000 0.004 0.006 0.007 0.013 0.000 0.008
50 0.005 0.004 0.009 0.000 0.004 0.006 0.008 0.013 0.000 0.008
30 20 0.002 0.001 0.003 0.000 0.001 0.006 0.005 0.011 0.000 0.007
50 0.001 0.001 0.002 0.000 0.001 0.005 0.005 0.009 0.000 0.009

Note. IRT = item response theory; EWP = examinees with preknowledge; CI = compromised items.

The second major finding is that the FPRs tend to be close to or slightly smaller than the nominal level when either the S, T, or ST-3 flagging method is used. However, the FPRs for the ST-1 flagging method tend to be larger than the nominal level, a result that is not surprising given that this method uses a union-based approach. Meanwhile, the FPRs for the ST-2 flagging method are much smaller than the nominal level and, in fact, are very close to or equal to zero. This result is also not surprising given that this method uses an intersection-based approach. Based on all of these results, it would seem that the S, T, and ST-3 flagging methods should be preferred. In addition, practitioners should avoid using the ST-1 flagging method if controlling the FPR is a priority.

True Positive Rates

The TPRs are shown in Table 3 for a significance level of 0.05 and Table 4 for a significance level of 0.01. The general patterns are similar for both significance levels, though the TPRs are larger when the significance level is 0.05 than when it is 0.01, as expected.

Table 3.

True Positive Rates (Significance Level = 0.05, Test Length = 20).

%CI % EWP m IRT-based Non-IRT-based
S T ST-1 ST-2 ST-3 S T ST-1 ST-2 ST-3
10 10 20 0.194 0.640 0.692 0.034 0.669 0.074 0.160 0.220 0.001 0.086
50 0.269 0.722 0.771 0.094 0.728 0.112 0.184 0.268 0.005 0.105
30 20 0.692 0.989 0.991 0.564 0.990 0.161 0.444 0.526 0.013 0.221
50 0.856 0.993 0.994 0.829 0.993 0.351 0.601 0.712 0.094 0.451
20 10 20 0.182 0.528 0.586 0.032 0.546 0.081 0.138 0.203 0.001 0.079
50 0.262 0.600 0.658 0.095 0.598 0.119 0.162 0.252 0.006 0.103
30 20 0.594 0.960 0.970 0.432 0.970 0.144 0.417 0.495 0.013 0.218
50 0.756 0.978 0.984 0.694 0.980 0.344 0.563 0.674 0.098 0.416
40 10 20 0.161 0.301 0.363 0.031 0.319 0.080 0.120 0.186 0.001 0.077
50 0.216 0.346 0.415 0.092 0.360 0.119 0.148 0.228 0.009 0.105
30 20 0.398 0.748 0.780 0.247 0.772 0.144 0.341 0.414 0.014 0.197
50 0.516 0.810 0.841 0.405 0.818 0.326 0.458 0.556 0.104 0.353

Note. IRT = item response theory; EWP = examinees with preknowledge; CI = compromised items.

Table 4.

True Positive Rates (Significance Level = 0.01, Test Length = 20).

%CI % EWP m IRT-based Non-IRT-based
S T ST-1 ST-2 ST-3 S T ST-1 ST-2 ST-3
10 10 20 0.060 0.404 0.434 0.009 0.443 0.019 0.036 0.053 0.000 0.025
50 0.106 0.512 0.538 0.033 0.540 0.035 0.044 0.077 0.000 0.027
30 20 0.434 0.970 0.975 0.307 0.983 0.031 0.168 0.190 0.001 0.071
50 0.684 0.988 0.990 0.647 0.991 0.139 0.333 0.413 0.016 0.205
20 10 20 0.068 0.305 0.334 0.011 0.339 0.017 0.036 0.053 0.000 0.017
50 0.116 0.390 0.426 0.037 0.421 0.031 0.049 0.078 0.001 0.030
30 20 0.350 0.905 0.916 0.243 0.929 0.029 0.152 0.175 0.001 0.064
50 0.581 0.951 0.960 0.520 0.960 0.134 0.307 0.376 0.021 0.196
40 10 20 0.061 0.162 0.186 0.008 0.183 0.016 0.034 0.049 0.000 0.018
50 0.113 0.210 0.240 0.050 0.234 0.037 0.048 0.079 0.001 0.031
30 20 0.242 0.603 0.626 0.143 0.644 0.030 0.135 0.159 0.001 0.066
50 0.373 0.699 0.724 0.298 0.732 0.146 0.250 0.316 0.029 0.176

Note. IRT = item response theory; EWP = examinees with preknowledge; CI = compromised items.

The results show that the TPRs vary significantly across the different sequential procedures, flagging methods, percentages of CI, percentages of EWP, and moving sample sizes. Across all conditions, the IRT-based sequential procedures display larger TPRs than the non-IRT-based sequential procedures. For example, at a significance level of 0.05, the TPRs for the IRT-based sequential procedure with the ST-3 flagging method range from 0.319 to 0.993, while the TPRs for the non-IRT-based sequential procedure with the ST-3 flagging method range from 0.077 to 0.451. These results align with the general findings of previous researchers who also compared the use of IRT- and non-IRT-based sequential procedures (e.g., Table 6 of Zhang & Li, 2016).

Table 6.

Lag (Significance Level = 0.01, Test Length = 20).

% EWP m IRT-based Non-IRT-based
%CI S T ST-1 ST-2 ST-3 S T ST-1 ST-2 ST-3
10 10 20 384 340 333 495 335 321 319 318 316
50 393 335 325 543 334 334 338 338 337
30 20 348 201 197 394 188 265 324 312 168 284
50 305 176 171 332 167 354 368 350 505 390
20 10 20 383 349 338 625 349 226 258 245 301
50 398 355 339 533 341 348 314 321 403 371
30 20 337 228 223 403 213 237 350 331 143 327
50 296 199 193 328 190 382 386 361 454 397
40 10 20 472 395 374 734 377 244 339 312 298
50 459 377 349 602 356 404 396 389 804 423
30 20 376 279 268 509 256 206 332 309 292 289
50 285 252 239 347 233 397 385 361 448 400

Note. IRT = item response theory; EWP = examinees with preknowledge; CI = compromised items;-- = Lag cannot be computed when TPR is 0.

For the IRT-based sequential procedures, the ST-1 flagging method usually displays the largest TPRs and is followed by ST-3, T, S, and then ST-2. For example, when the significance level is 0.05, m=20 , and there are 10% CI and 10% EWP, the ST-1 flagging method achieves a TPR of 0.692, which is slightly larger than that of ST-3 (0.669) and T (0.640) and noticeably larger than that of S (0.194) and ST-2 (0.034). These results are not surprising, given that they agree with most of the patterns that were observed for the FPRs. These results also suggest that the ST-3 flagging method is the most promising method overall, as it tends to display relatively high TPRs while keeping the FPRs close to or below the nominal level.

It is worth noting that while the ST-3 flagging method performed well relative to the other flagging methods, its performance, like that of all methods, varies across the different conditions. From the tables, additional results can be gleaned that are common to both the IRT- and non-IRT-based sequential procedures. First, as the percentage of CI increases, the TPRs tend to decrease, particularly when the percentage of EWP is high. For example, for the IRT-based sequential procedure, the TPRs with the ST-3 flagging method at a significance level of 0.05 range from 0.669 to 0.993 when there are 10% CI, 0.546 to 0.980 when there are 20% CI, and 0.319 to 0.818 when there are 40% CI. For the IRT-based sequential procedures, this result is again related to the fact that item flagging depends on the ability and/or speed estimates of the examinees. If more of the items have been compromised, the ability and/or speed estimates of the EWP would likely be even more biased. As a result, the examinees would be expected to perform better and faster on all items (including those that have been compromised), making it difficult to identify a compromise signal.

An additional finding is that as the percentage of EWP increases, the TPRs increase. For example, the TPRs for the IRT-based sequential procedure with the ST-3 flagging method at a significance level of 0.05 range from 0.319 to 0.728 when there are 10% EWP, while they range from 0.772 to 0.993 when there are 30% EWP. Presumably, the presence of more EWP (30%) produces a stronger compromise signal that is easier to detect.

Finally, as the moving sample size increases, the TPRs also increase. For example, the TPRs for the IRT-based sequential procedure with the ST-3 flagging method at a significance level of 0.05 range from 0.319 to 0.990 when the moving sample size is 20, while they range from 0.360 to 0.993 when the moving sample size is 50. These results align with the findings of previous researchers who also found that the use of larger moving sample sizes generally improves the detection of CI (e.g., Table 6 of Zhang & Li, 2016).

Lag

The lag is shown in Table 5 for a significance level of 0.05 and Table 6 for a significance level of 0.01. It is important to note that because the lag was computed across only the CIs that were flagged, the results are directly related to the TPRs. Thus, both the TPRs and the lag are shown in Figures 1 and 2 for significance levels of 0.05 and 0.01, respectively, to illustrate the relationships between the two outcomes.

Table 5.

Lag (Significance Level = 0.05, Test Length = 20).

% EWP m IRT-based Non-IRT-based
%CI S T ST-1 ST-2 ST-3 S T ST-1 ST-2 ST-3
10 10 20 293 295 275 404 285 275 309 282 161 267
50 314 286 265 422 280 299 308 293 227 323
30 20 282 163 156 336 154 270 314 289 234 293
50 253 145 137 284 142 316 311 286 402 342
20 10 20 340 301 286 493 294 241 281 260 374 284
50 323 295 272 447 285 301 300 290 436 313
30 20 277 183 174 333 175 263 309 287 295 311
50 251 166 155 287 161 321 317 287 409 344
40 10 20 358 312 281 596 300 273 302 281 444 294
50 337 305 271 525 286 351 347 320 521 345
30 20 275 228 214 390 216 261 310 285 314 296
50 240 209 194 289 201 324 322 282 421 343

Note. IRT = item response theory; EWP = examinees with preknowledge; CI = compromised items.

Figure 1.

Figure 1.

True positive rates and Lag (significance level = 0.05, test length = 20, m=50 ).

Figure 2.

Figure 2.

True positive rates and Lag (significance level = 0.01, test length = 20, m=50 ).

The results show that the lag tends to be shorter for the IRT-based sequential procedures than for the non-IRT-based sequential procedures. When taken together with the results for the TPRs, this result is encouraging, as it suggests that not only are the IRT-based sequential procedures able to detect more CI, but also that they tend to do so in shorter amounts of time.

For the IRT-based sequential procedures, the lag tends to be shortest when the ST-1 flagging method is used, followed by ST-3, T, S, and then ST-2. In addition, as the percentage of CI increases, the lag tends to increase, whereas as the percentage of EWP or the moving sample size increases, the lag tends to decrease. It is interesting to note that all of these results align with those of the TPRs. In particular, larger TPRs tend to correspond to shorter lag times.

Discussion

In summary, this study proposes and evaluates three IRT-based sequential procedures that use item scores and RTs to detect CI in CAT. The first procedure (ST-1) requires that either the score-based statistic or the RT-based statistic be extreme, the second procedure (ST-2) requires that both the score-based statistic and the RT-based statistic be extreme, and the third procedure (ST-3) requires that a (combined) score and RT-based statistic be extreme. Results suggest that the ST-3 flagging method is the most promising method overall, as it tends to produce relatively high TPRs and relatively short lag times while keeping the FPRs close to the nominal level. By contrast, the ST-1 flagging method tends to produce FPRs that are unreasonably large, while the ST-2 flagging method tends to produce TPRs that are unreasonably small. The ST-3 flagging method also tends to perform better than both the item scores only method (S) and the item RTs only method (T), as well as all five of the non-IRT-based sequential procedures—including those that are based on item scores and RTs.

Some guidelines for implementing the proposed methods can be suggested. When item scores and RTs are available, use of the IRT-based sequential procedure with the ST-3 flagging method is recommended, as it appears to provide the best balance between the FPR and the TPR while also producing relatively short lag times. The selection of the moving sample size is also important, as the use of a larger moving sample size tends to improve CI detection. In addition, for the IRT-based sequential procedures, the use of a larger moving sample size also corresponds to shorter lag times. Thus, optimal selection of the moving sample size should be guided by the sequential procedure being used as well as the specific CAT configuration (e.g., test length, item selection method, ability estimation method), the total number of examinees, and the expected performance on various outcome measures. Finally, the significance level plays an important role. For both the IRT- and non-IRT-based sequential procedures, the use of a larger significance level, such as 0.05, tends to increase the likelihood of detecting CI and reduce the lag. However, it also increases the risk of incorrectly flagging a secure item as compromised.

When implementing any of the sequential procedures used in this study, item scores and/or RTs are continuously monitored to detect unexpected changes in performance. Thus, these procedures can be integrated into existing item selection algorithms, provided the CAT system allows interaction between the detection mechanism and the item selection process. Specifically, once an item is flagged as potentially compromised, it can be suspended from future administrations. The flagged item may remain inactive until further evidence is gathered to determine whether it should be permanently removed from the item pool, minimizing the impact of preknowledge and ultimately improving test fairness and security.

This study has several limitations that can be investigated in future research. First, the methods proposed in this paper were evaluated using only simulated data. In the future, it would be useful to apply these methods to real data that contain known compromise. Second, in our simulation study, we only examined a limited set of simulation conditions. Future researchers could investigate additional simulation conditions, including those involving different item pools, item selection methods, item exposure control methods, and ability estimation methods. Third, this study focused solely on comparing sequential procedures for detecting CI in CAT. Future researchers could compare the performance of these methods to other methods for detecting CI (e.g., Du et al., 2023; Du & Zhang, 2025; Kang, 2023; Lee & Lewis, 2021; Lee et al., 2014; Toton & Maynes, 2019; van der Linden, 2022; van der Linden & Belov, 2023; Wang & Liu, 2020) or methods that are designed to detect CI and EWP simultaneously (e.g., Belov, 2014; Pan et al., 2022). Finally, because the proposed methods assume an IRT model fits the data, they are inappropriate for exams that do not use IRT (e.g., small-volume exams). In such cases, non-IRT-based sequential procedures may be considered instead.

Supplemental Material

sj-pdf-1-epm-10.1177_00131644251368335 – Supplemental material for Using Item Scores and Response Times to Detect Item Compromise in Computerized Adaptive Testing

Supplemental material, sj-pdf-1-epm-10.1177_00131644251368335 for Using Item Scores and Response Times to Detect Item Compromise in Computerized Adaptive Testing by Chansoon Lee, Kylie Gorney and Jianshen Chen in Educational and Psychological Measurement

Appendix

Non-IRT-Based Sequential Procedures for Item Scores Only

Choe et al. (2018) and Zhang (2014) proposed non-IRT-based sequential procedures that test whether a dichotomous item is getting easier over time. Unlike the IRT-based sequential procedure, the non-IRT-based sequential procedures monitor the difficulty of an item by comparing the proportion of correct responses for a moving sample of examinees to the proportion of correct responses for a reference sample of examinees. The moving sample is taken as the most recent m examinees who were administered the item, while the reference sample is taken as all other examinees who were administered the item. Thus, the reference sample contains the first nm examinees who were administered the item, and the proportions of correct responses for the moving and reference samples are given by

μ^X(mov)=1mj=nm+1nXjandμ^X(ref)=1nmj=1nmXj, (A1)

respectively.

Under the null hypothesis that the item has not been compromised, the test statistic

WS=μ^X(mov)μ^X(ref)μX(1μX)(1m+1nm) (A2)

has an asymptotic standard normal distribution, where μX denotes the pooled population proportion of correct responses. In practice, the pooled population proportion is unknown, and therefore, WS cannot be computed. Instead, Zhang (2014) used the approximation that is given by replacing μX with the reference sample proportion μ^X(ref) , while Choe et al. (2018) used the approximation that is given by replacing μX with the pooled sample proportion

μ^X=mμ^X(mov)+(nm)μ^X(ref)m+(nm)=1nj=1nXj,

which is the approximation that we also use in this paper.

To monitor the difficulty of the item across multiple time points, one can compute WS each time the item is administered. Extreme positive values of WS indicate that the item was much easier than expected in recent administrations and may therefore have been compromised.

Non-IRT-Based Sequential Procedure for Item RTs Only

Choe et al. (2018) proposed a non-IRT-based sequential procedure that tests whether an item is being answered more quickly over time. The procedure monitors the time intensity of the item by comparing the average log RT for a moving sample of examinees to the average log RT for a reference sample of examinees. The average log RTs for the moving and reference samples are given by

μ^Y(mov)=1mj=nm+1nYjandμ^Y(ref)=1nmj=1nmYj, (A3)

respectively.

If it can be assumed that the two population variances are equal ( σY(mov)2=σY(ref)2 ), then under the null hypothesis that the item has not been compromised, the test statistic

WT=μ^Y(mov)μ^Y(ref)σ^Y2(1m+1nm) (A4)

has a t distribution with n2 degrees of freedom, where σ^Y2 denotes the pooled sample variance that is given by

σ^Y2=(m1)σ^Y(mov)2+(nm1)σ^Y(ref)2n2

for the sample variances

σ^Y(mov)2=j=nm+1n(Yjμ^Y(mov))2m1andσ^Y(ref)2=j=1nm(Yjμ^Y(ref))2nm1.

To monitor the time intensity of the item across multiple time points, one can compute WT each time the item is administered. Extreme negative values of WT indicate that the item was answered much more quickly than expected in recent administrations and may therefore have been compromised.

Non-IRT-Based Sequential Procedures for Item Scores and RTs

Choe et al. (2018) proposed three non-IRT-based sequential procedures that involve the use of item scores and RTs. The first procedure requires that either WS or WT be extreme for an item to be flagged as potentially compromised, while the second procedure requires that both WS and WT be extreme and is therefore more conservative.

The third procedure uses a different approach that involves the creation of a new test statistic. The idea is to monitor the difficulty and time intensity of the item simultaneously by comparing both the proportion of correct responses and the average log RT across the moving and reference samples of examinees. For each sample, the proportion of correct responses and the average log RT can be combined into a single vector as

μ^(mov)=[μ^X(mov)μ^Y(mov)]andμ^(ref)=[μ^X(ref)μ^Y(ref)]. (A5)

Under the null hypothesis that the item has not been compromised, the test statistic

WST=n32(n2)H (A6)

has an F distribution with 2 and n3 degrees of freedom, where H denotes the two-sample Hotelling’s T2 statistic that is given by

H=[μ^(mov)μ^(ref)][Σ^(1m+1nm)]1[μ^(mov)μ^(ref)]

for the pooled sample covariance matrix

Σ^=m1n2Σ^(mov)+nm1n2Σ^(ref),

the sample covariance matrices

Σ^(mov)=[σ^X(mov)2σ^XY(mov)σ^XY(mov)σ^Y(mov)2]andΣ^(ref)=[σ^X(ref)2σ^XY(ref)σ^XY(ref)σ^Y(ref)2],

the sample variances

σ^X(mov)2=mμ^X(mov)(1μ^X(mov))m1andσ^X(ref)2=(nm)μ^X(ref)(1μ^X(ref))nm1,

and the sample covariances σ^XY(mov) and σ^XY(ref) . Extreme positive values of WST indicate that the item was much easier and/or answered much more quickly than expected in recent administrations and may therefore have been compromised.

For all three procedures, Choe et al. (2018) further required that an item be easier than expected ( WS>0 ) and answered more quickly than expected ( WT<0 ) in recent administrations to be flagged as potentially compromised. In this paper, we remove this requirement to allow for a fairer comparison to the IRT-based sequential procedures, which also do not impose this requirement.

Footnotes

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.

Supplemental Material: Supplemental material for this article is available online.

References

  1. Belov D. I. (2014). Detecting item preknowledge in computerized adaptive testing using information theory and combinatorial optimization. Journal of Computerized Adaptive Testing, 2(3), 37–58. [Google Scholar]
  2. Choe E. M., Zhang J., Chang H.-H. (2018). Sequential detection of compromised items using response times in computerized adaptive testing. Psychometrika, 83(3), 650–673. 10.1007/s11336-017-9596-3 [DOI] [PubMed] [Google Scholar]
  3. Du Y., Zhang S. (2025). Detecting compromised items with response times using a Bayesian change-point approach. Journal of Educational and Behavioral Statistics, 50(2), 296–330. 10.3102/10769986241290713 [DOI] [Google Scholar]
  4. Du Y., Zhang S., Chang H.-H. (2023). Compromised item detection: A Bayesian change-point perspective. British Journal of Mathematical and Statistical Psychology, 76(1), 131–153. 10.1111/bmsp.12286 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Gorney K., Lee C., Chen J. (2025). A score-based method for detecting item compromise and preknowledge in computerized adaptive testing. Journal of Computerized Adaptive Testing, 12(2), 123–136. 10.7333/2506-1202123 [DOI] [Google Scholar]
  6. Gorney K., Wollack J. A. (2025). Using response times in answer similarity analysis. Journal of Educational and Behavioral Statistics, 50(3), 449–470. 10.3102/10769986241248770 [DOI] [Google Scholar]
  7. Kang H.-A. (2023). Sequential generalized likelihood ratio tests for online item monitoring. Psychometrika, 88(2), 672–696. 10.1007/s11336-022-09871-9 [DOI] [PubMed] [Google Scholar]
  8. Lee Y.-H., Lewis C. (2021). Monitoring item performance with CUSUM statistics in continuous testing. Journal of Educational and Behavioral Statistics, 46(5), 611–648. 10.3102/1076998621994563 [DOI] [Google Scholar]
  9. Lee Y.-H., Lewis C., von Davier A. A. (2014). Monitoring the quality and security of multistage tests. In Yan D., von Davier A. A., Lewis C. (Eds.), Computerized multistage testing: Theory and applications (pp. 285–300). CRC Press. [Google Scholar]
  10. Pan Y., Sinharay S., Livne O., Wollack J. A. (2022). A machine learning approach for detecting item compromise and preknowledge in computerized adaptive testing. Psychological Test and Assessment Modeling, 64(4), 385–424. [Google Scholar]
  11. Sinharay S., Johnson M. S. (2020). The use of item scores and response times to detect examinees who may have benefited from item preknowledge. British Journal of Mathematical and Statistical Psychology, 73(3), 397–419. 10.1111/bmsp.12187 [DOI] [PubMed] [Google Scholar]
  12. Toton S. L., Maynes D. D. (2019). Detecting examinees with pre-knowledge in experimental data using conditional scaling of response times. Frontiers in Education, 4, Article 49. 10.3389/feduc.2019.00049 [DOI] [Google Scholar]
  13. van der Linden W. J. (2006). A lognormal model for response times on test items. Journal of Educational and Behavioral Statistics, 31(2), 181–204. 10.3102/10769986031002181 [DOI] [Google Scholar]
  14. van der Linden W. J. (2007). A hierarchical framework for modeling speed and accuracy on test items. Psychometrika, 72(3), 287–308. 10.1007/s11336-006-1478-z [DOI] [Google Scholar]
  15. van der Linden W. J. (2022). Two statistical tests for the detection of item compromise. Journal of Educational and Behavioral Statistics, 47(4), 485–504. 10.3102/10769986221094789 [DOI] [Google Scholar]
  16. van der Linden W. J., Belov D. I. (2023). A statistical test for the detection of item compromise combining responses and response times. Journal of Educational Measurement, 60(2), 235–254. 10.1111/jedm.12346 [DOI] [Google Scholar]
  17. Veerkamp W. J. J., Glas C. A. W. (2000). Detection of known items in adaptive testing with a statistical quality control method. Journal of Educational and Behavioral Statistics, 25(4), 373–389. 10.3102/10769986025004373 [DOI] [Google Scholar]
  18. Wang X., Liu Y. (2020). Detecting compromised items using information from secure items. Journal of Educational and Behavioral Statistics, 45(6), 667–689. 10.3102/1076998620912549 [DOI] [Google Scholar]
  19. Zhang J. (2014). A sequential procedure for detecting compromised items in the item pool of a CAT system. Applied Psychological Measurement, 38(2), 87–104. 10.1177/0146621613510062 [DOI] [Google Scholar]
  20. Zhang J., Li J. (2016). Monitoring items in real time to enhance CAT security. Journal of Educational Measurement, 53(2), 131–151. 10.1111/jedm.12104 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

sj-pdf-1-epm-10.1177_00131644251368335 – Supplemental material for Using Item Scores and Response Times to Detect Item Compromise in Computerized Adaptive Testing

Supplemental material, sj-pdf-1-epm-10.1177_00131644251368335 for Using Item Scores and Response Times to Detect Item Compromise in Computerized Adaptive Testing by Chansoon Lee, Kylie Gorney and Jianshen Chen in Educational and Psychological Measurement


Articles from Educational and Psychological Measurement are provided here courtesy of SAGE Publications

RESOURCES