Skip to main content
Applied Psychological Measurement logoLink to Applied Psychological Measurement
. 2015 Oct 27;40(2):142–156. doi: 10.1177/0146621615611633

On Computing the Key Probability in the Stochastically Curtailed Sequential Probability Ratio Test

Alan R Huebner 1,, Matthew D Finkelman 2
PMCID: PMC5982174  PMID: 29881044

Abstract

The Stochastically Curtailed Sequential Probability Ratio Test (SCSPRT) is a termination criterion for computerized classification tests (CCTs) that has been shown to be more efficient than the well-known Sequential Probability Ratio Test (SPRT). The performance of the SCSPRT depends on computing the probability that at a given stage in the test, an examinee’s current interim classification status will not change before the end of the test. Previous work discusses two methods of computing this probability, an exact method in which all potential responses to remaining items are considered and an approximation based on the central limit theorem (CLT) requiring less computation. Generally, the CLT method should be used early in the test when the number of remaining items is large, and the exact method is more appropriate at later stages of the test when few items remain. However, there is currently a dearth of information as to the performance of the SCSPRT when using the two methods. For the first time, the exact and CLT methods of computing the crucial probability are compared in a simulation study to explore whether there is any effect on the accuracy or efficiency of the CCT. The article is focused toward practitioners and researchers interested in using the SCSPRT as a termination criterion in an operational CCT.

Keywords: computerized testing, classification, item response theory

Introduction

Computerized classification tests (CCTs) are assessments that aim to categorize an examinee based on his or her responses to a set of items. CCTs are often used for professional certification exams that classify examinees as masters or non-masters (i.e., passes or fails). CCT methodology has been a consistently fertile area of psychometric research over the past several decades, and there has been much inquiry into and many innovations proposed for CCT item selection (e.g., Eggen, 1999; Lin, 2011; Spray & Reckase, 1994; Thompson, 2009; Van Groen, Eggen, & Veldkamp, 2014), termination rules (e.g., Finkelman, 2008, 2010; Thompson, 2011; Weiss & Kingsbury, 1984), and test security via item exposure or overlap control (e.g., Chen, Lei, Chen, & Liu, 2014; Huebner, 2012). Although it is possible to construct a CCT using a variety of measurement models to classify examinees into one of two or more categories, much research has been directed toward CCTs based on item response theory (IRT) models for dichotomously scored items that classify examinees into one of exactly two categories. This type of CCT will be the focus of the present study.

An essential consideration for the design of a CCT is its efficiency or ability to classify examinees correctly using a minimal number of items. Appropriate item selection rules may bolster the efficiency of CCTs, and termination rules can increase efficiency by stopping CCTs when an accurate classification can be made; the Sequential Probability Ratio Test (SPRT; Wald, 1947) and ability confidence intervals (ACI; Weiss & Kingsbury, 1984) are two well-known examples. This article will focus on a variation of the SPRT proposed by Finkelman (2008) for cases in which a maximum number of items is imposed on the CCT, a common practice in operational settings. (Some authors use the term truncated SPRT to describe this case, but the present study will use SPRT for simplicity.) Finkelman (2008) showed that the SPRT is inefficient in this situation; specifically, the SPRT may allow items to be administered even if the responses will not affect the ultimate classification decision. Thus, the exposure of these items is unnecessary, and he proposed the more efficient Stochastically Curtailed SPRT (SCSPRT). In a simulation study using item parameters from an operational CCT using item exposure and content balancing, the SCSPRT reduced average test length (ATL) by approximately 7 to 10 items compared with the SPRT for examinees with ability levels near the pass/fail threshold with virtually no loss of accuracy.

To facilitate further discussion of the SCSPRT and describe the purpose of this article, some notation must be introduced. Suppose a given CCT allows an examinee to respond to a maximum of J items, and let DJ denote the classification decision after the administration of all J items. This ultimate decision may be DJ=m or n, indicating mastery or non-mastery, respectively. Also, let Dk* represent the tentative, or interim, classification decision at Stage k of the test, after the administration of the kth item. Similar to DJ, Dk* may be m or n. As will be described in detail in the next section, at each stage, the SCSPRT computes P(Dk*=DJ), the probability that the current classification decision will not change before the test is terminated after item J. If P(Dk*=DJ) is greater than a pre-specified probability threshold γ, the test is stopped at Stage k even if the more conservative SPRT criterion does not indicate termination.

Clearly, the computation of P(Dk*=DJ) is a crucial element of the SCSPRT; Finkelman (2008) proposed two methods of computing it. The authors of the present study give a conceptual overview of the methods here and save the detailed explanation for the next section. At Stage k of the test, P(Dk*=DJ) may be computed exactly by considering all possible response patterns to the Jk remaining items. However, if (Jk) is large, the exact method may become computationally burdensome, and a central limit theorem (CLT) approximation may be used for simplicity. As will be described in the next section, the CLT method does not explicitly take into account each possible remaining response pattern. Rather, it uses a normal approximation to compute P(Dk*=DJ) conditional on the examinee’s interim classification status. Hereafter, P(Dk*=DJ) calculated under the exact and CLT methods are referred to as PExact and PCLT, respectively. This article will examine the relation between PExact and PCLT from a practical perspective, geared toward researchers and psychometricians interested in using the SCSPRT method to increase the efficiency of an operational CCT. Although Finkelman (2008) demonstrated the general efficacy of the SCSPRT in reducing ATLs, many issues need to be investigated if the SCSPRT were to be used in an operational setting.

For example, consider a CCT having a maximum of J=50 dichotomously scored items. Computing PExact at Stage 30 of the test, when (Jk)=20, would require evaluating the probabilities of 220 potential response patterns. This may be too computationally burdensome for an operational CCT, as it is undesirable for examinees to experience a noticeable delay between the administrations of items. Thus, in this situation, the use of PCLT would be preferable. However, when (Jk) is small, say 5, the PCLT may not be accurate, as the CLT generally performs better with larger sample sizes. Moreover, PExact could be easily computed in this case, as there are only 25 response patterns for which to account. Thus, it is reasonable that PCLT should be used early in the test whereas PExact should be used at later stages, but there are many yet unanswered concerns that would be raised by practitioners. This research addresses the following questions:

  • Research Question 1: Is there an optimal test stage for switching from using PCLT to PExact?

  • Research Question 2: Does one method consistently under- or overestimate the true probability that the classification decision will be changed?

  • Research Question 3: How are the overall ATLs and classification accuracy affected by PCLT versus PExact?

  • Research Question 4: Do the answers to these questions depend on the quality of the item bank or other elements of the CCT design?

Moreover, this is the first study that compares PCLT and PExact with the quantity PTrue. As will be explained in the following sections, PTrue is the true probability of keeping the same classification decision, based on the true value of the examinee’s latent ability. Although PTrue would not be known in practice, it is treated in this simulation study as a gold standard with which PCLT and PExact are compared.

Method

IRT

The three-parameter logistic (3PL) IRT model quantifies the interaction between an item and an examinee with a given ability level. Specifically, let Xj denote the response of an examinee to item j, where Xj=1 or 0 indicates a correct or incorrect response, respectively, and θ is the latent ability level of the examinee. Then, the item response function (IRF) of the 3PL is given by

P(Xj=1|θ)=cj+1cj1+exp[1.7aj(θbj)],

where aj, bj, and cj are the discrimination, difficulty, and pseudo-chance parameters of item j, respectively (Birnbaum, 1968). It is usually assumed that the responses to a series of k items for a particular examinee are independent conditional on the examinee’s θ level (Lord, 1980). Then, the likelihood function for the responses to items 1,,k from a particular examinee is given by

L(Xk;θ)=Πj=1kL(Xj;θ)=Πj=1kP(Xj=1|θ)Xj[1P(Xj=1|θ)]1Xj.

Note that the boldface notation is used to denote a vector of responses, Xk=(X1,,Xk), while Xj represents the response to a single item.

SPRT

To classify examinees into two categories, a cut point on the θ scale is required. The authors denote this value θ0, and examinees with θθ0 are deemed masters, and non-masters otherwise. Reflecting the fact that classifications cannot be perfect because θ is a latent variable, a small quantity, δ, is added to and subtracted from θ0 to create an “indifference” region. The authors define θ+=θ0+δ and θ=θ0δ as the upper and lower bounds of the indifference region, and misclassifications of examinees with true θ in this range are not considered severe errors. The SPRT formulates the classification decision as accepting a null (H0) or alternative (Ha) hypothesis, H0:θ=θversusHa:θ=θ+. Under this formulation, the null and alternative serve as surrogates for hypotheses θθ and θθ+, respectively (Spray & Reckase, 1996). If Ha is accepted, the examinee is classified as a master (pass); if H0 is accepted, the examinee is judged a non-master (fail). The decision is based on the log of the likelihood ratio at Stage k of the test defined as

logλk=logL(Xk;θ+)L(Xk;θ)=i=1klogL(Xj;θ+)L(Xj;θ).

Also, the desired Type I (judging a non-master as a master) and Type II (judging a master as a non-master) error rates are specified and denoted as α and β, respectively. In addition, define constants A=α/(1β) and B=(1α)/β such that A<1<B. Then, the SPRT termination criterion proceeds as follows:

  1. At Stage k, calculate logλk as shown in Equation 3.

  2. If logλklogB or logλklogA, the test terminates with an m or n decision, respectively.

  3. If the test is not terminated by the criterion in Step 2, proceed to Stage k+1 by administering the next item and repeat Steps 1 to 2.

  4. If the maximum number of items J is administered without a classification being made, a “forced” decision results. If logλJC or logλJ<C, the test terminates with an m or n decision, respectively, where C=[logA+logB]/2. (Note that if α and β are equal, C simplifies to zero.)

The individual elements of the SPRT discussed above, α, β, θ0, and δ, are quantities that would be determined in practice by collaboration between psychometricians, subject-matter experts, and the testing agency. In recent CCT literature, δ values between .1 and .3 have been common. For example, Finkelman (2008, 2010) used δ=.2, Huebner and Fina (2015) used δ=.1 and .2, Lin (2011) examined δ=.1, .2, and .3, and Eggen (1999) conducted simulations in which δ was varied from .10 to .23 by increments of .01. Also, all simulations in those articles set α=β=.05. Values of θ0 used in the above references range from 1.32 to 1.50. Huebner and Fina (2015) presented a simulation study in which θ0 was set to 1, 0, and 1 in various testing conditions. On average, conditions with θ0=0 tended to have larger ATLs and lower classification accuracy than conditions with θ0=1 or 1.

SCSPRT

As mentioned in the previous section, the SCSPRT termination criterion relies on the calculation of P(Dk*=DJ) to increase test efficiency compared with the SPRT. The authors first give a general description of the method and then provide details on computing the exact and CLT approximation versions. The SCSPRT is implemented according to the following steps:

  1. At Stage k, calculate logλk as shown in Equation 3.

  2. If logλklogB or logλklogA, the test terminates with an m or n decision, respectively.

  3. If the test is not terminated at Stage k by the criterion in Step 2, then calculate P(Dk*=DJ). If P(Dk*=DJ)γ, the test terminates with an m decision if logλkC or an n decision if logλk<C.

  4. If the test is not terminated by the criteria in Steps 2 or 3, proceed to Stage k+1 by administering the next item, and repeat Steps 1 to 3.

  5. If the maximum number of items J is administered without a classification being made, a forced decision results as described in the previous subsection for the SPRT.

In summary, the key difference between the SCSPRT and the traditional SPRT is that the SCSPRT aims to shorten the exam by taking into account potential future items, whereas the SPRT does not. The SCSPRT uses the SPRT within it, but it is allowed to terminate the exam earlier by calculating P(Dk*=DJ). The authors note that the SCSPRT terminates the exam whenever the SPRT does, and thus, the SCSPRT never administers more items to a given examinee than the SPRT. Also, note that as γ increases, the SCSPRT is less likely to terminate the exam before the SPRT. In many practical settings, it would be desirable to maintain classification accuracy, and hence, a conservative value (close to 1) should be chosen. Past studies on the SCSPRT (Finkelman, 2008, 2010) have used γ=.95.

Computation of PExactandPCLT

The authors now discuss the computation of P(Dk*=DJ), starting with the exact version PExact. First, at Stage k, the current interim classification is obtained by setting Dk* to m or n if logλkC or logλk<C, respectively. Then, for (Jk) remaining items, there are 2(Jk) possible response patterns. The probability of each response pattern may be computed using Equations 1 and 2, given the parameters of the remaining items and a θ value. It is assumed that an operational CCT has a well-calibrated item bank, so the only issue is choosing a θ value to input. Finkelman (2008) used a conservative approach originating in the field of clinical trials (Lan, Simon, & Halperin, 1982). This approach, termed “conditional power” by Jennison and Turnbull (2000), sets θ=θ+ if Dk*=n, and θ=θ if Dk*=m. In addition to the probability of each potential response pattern, the 2(Jk) corresponding potential λJs are computed. Those λJs are used to indicate potential classification decisions, and the probabilities of those decisions that match Dk* are summed to obtain PExact.

Note that if the true value of θ was known, it could be input into Equation 1 rather than θ or θ+ and used in the subsequent calculations to obtain the “true” value of P(Dk*=DJ). This is, of course, not possible in practice, but it can easily be done in a simulated exam when the true θs are known and used to generate item responses. This true calculation of P(Dk*=DJ) is denoted as PTrue; it is interpreted as the probability of keeping the same classification decision calculated using the examinee’s true θ value. As will be further explained in the next section, PTrue will be regarded as a gold standard by which to judge the performance of PExact and PCLT. Specifically, in “Simulation Study” section, the authors will describe several measures to examine the agreement of PExact and PCLT to PTrue.

Finkelman (2008) provides the formulas for a CLT approximation of P(Dk*=DJ). In contrast to PExact, the calculation of PCLT does not explicitly take into account each potential remaining response pattern. Instead, PCLT is a function of logλk, the θ value obtained by the conditional power approach described above, and the parameters of the remaining (Jk) items. Thus, computing PCLT is much more efficient than PExact, especially when (Jk) is large. A review of the CLT formulas, as well as a worked example illustrating PExact, PCLT, and PTrue, are included in the supplementary online appendix.

Simulation Study

A simulation study was designed to compare the performance of the SCSPRT using PCLT and PExact at different stages of a CCT and to investigate whether there are any sizable differences in the two methods in terms of ATL or classification accuracy. For clarity of discussion, the study is divided into two parts, Part 1 and Part 2. All analyses described in both parts of the study were performed using the 3PL IRT model described in “Item Response Theory” section for four different simulated testing conditions. For all conditions, examinees were administered CCTs with a maximum of J=50 items, cut point θ0=1.0, desired error rates α=β=.05, and, for the SCSPRT, probability threshold γ=.95. One CCT factor that was varied was the half length of the indifference region δ. In two conditions, δ=.1, and in two conditions, δ=.2. These CCT parameter settings were chosen to be representative of the settings chosen in other recent studies as summarized at the end of “SPRT” section. Two different item banks were constructed. For both banks, the discrimination and pseudo-chance parameters were generated from the distributions aj~Normal(0.7,0.2) and cj~Normal(0.25,0.03), respectively. The difficulty parameters were generated in such a way that the item information for one bank was “peaked” about θ0 and “broad” about θ0 for the other. For the peaked bank, bj~Normal(θ00.25,0.2), and bj~Normal(θ00.25,1) for the broad bank. The reasoning behind this setup is that the information for both banks should be centered near the cut point, but the spread around the cut point may differ. The authors note that this method of generating peaked and broad item banks as well as the specific item parameter settings are very similar to those of Thompson (2009). The items were assumed to be calibrated without error, a simplifying assumption commonly used in the CCT and computer adaptive testing literature, for example, Thompson (2009), Wang and Chang (2011), Huebner and Fina (2015), and Fan, Wang, Chang, and Douglas (2012).

Both levels of δ were applied to both item banks, resulting in the four testing conditions. Furthermore, for each condition, the CCTs were administered to N=5,000 simulated examinees with θs generated from the Normal(0,1) distribution. Items were selected by the well-known method of maximizing Fisher information (FI) at θ0. The authors note that although Kullback–Leibler information has also been studied as an alternative to FI, the two item selection criteria perform very similarly in terms of efficiency and accuracy in practical testing situations (Lin, 2011). Moreover, recent studies on SCSPRT methodology (Finkelman, 2008, 2010; Huebner & Fina, 2015) use the FI index. Also, to mimic a practical testing situation, the Sympson–Hetter method of item exposure control (Sympson & Hetter, 1985) was used with a desired maximum exposure rate of 0.20.

Simulation Study—Part 1

Part 1 of the simulation study investigated the performance of the SCSPRT termination rule when PCLT and PExact were used at different stages of test for the computation of PCLT. As stated in the introduction, computing PCLT is not burdensome and can be done at any stage of the test. However, calculating PExact very early in the test would not be feasible in practice or in a simulation study because of the large number of potential response patterns for which to account. Thus, this study begins comparing PCLT and PExact at Test Stage 35, when there are 15 items remaining and 215 possible remaining response patterns. Specifically, let TExact_35 be the SCSPRT termination rule that uses PCLT until Stage 34 of the test and then switches to using PExact at Stage 35 and for the remainder of the test. Then, let TExact_36 be the SCSPRT termination rule that switches from using PCLT to PExact at Stage 36 of the test, and so on. Also, let TCLT be the SCSPRT termination rule that uses PCLT throughout the entire exam; note that for a 50-item test, TCLTTExact_50.

All the above termination rules, along with the original SPRT to serve as a baseline, were implemented simultaneously on the test of each simulated examinee. If one rule stopped the test, the others were allowed to keep administering items until they also terminated. Thus, a “matched” comparison of the methods was produced, as was done in the simulation study reported by Finkelman (2008). The termination rules were evaluated by their ATLs and proportion of examinees classified correctly (PCC).

Simulation Study—Part 2

Part 2 of the simulation study examined the relationships between PCLT, PExact, and PTrue for Test Stages 35 through 49. For each of those 15 test stages, the mean bias (MB) and correlation (COR) between (PCLT,PTrue), (PCLT,PExact), and (PExact,PTrue) were computed for those examinees whose test had not yet been terminated by the TCLT (the reasoning behind this will be discussed momentarily). Specifically, MB(PCLT,PTrue)=1NjNj(PCLTPTrue) was calculated at each stage j=35,,49, where Nj denotes the number of examinees with exams still in progress at Stage j. Also, COR(PCLT,PTrue) is the Pearson product–moment correlation between PCLT and PTrue. The authors note that these quantities were computed also for (PCLT,PExact) and (PExact,PTrue), but for the sake of brevity, the formulas are displayed only for (PCLT,PTrue). The MB and COR describe the relationships between the three probabilities as the test progresses. As mentioned in the previous section, PTrue is regarded as a gold standard because it is based on the true, or generating, examinee θ value. Thus, small MB and high COR with PTrue would indicate good performance for PCLT and PExact.

The study also examined quantities at Test Stages 35 through 49 that describe the relationships between PCLT, PExact, PTrue, and γ. The false positive and false negative rates (FPR and FNR) for PCLT and PExact at Stage j of the test are defined as

FPR(PCLT)=P(PCLTγ|PTrue<γ)=(#examinees withPCLTγandPTrue<γ)(#examinees withPTrue<γ)

and

FNR(PCLT)=P(PCLT<γ|PTrueγ)=(#examinees withPCLT<γandPTrueγ)(#examinees withPTrueγ),

and similar for PExact. As above, the FPR and FNR were based on only those examinees whose exams had not been terminated yet by TCLT at that stage. For a given Test Stage j, the FPR and FNR have the following interpretations. FPR(PCLT) is the proportion of times the SCSPRT terminates the exam based on PCLT, among cases where testing would not have stopped based on PTrue. FNR(PCLT) is the proportion of times the SCSPRT does not terminate the exam based on PCLT, among cases where testing would have stopped based on PTrue. That is, FNR(PCLT) is the observed conditional probability that the SCSPRT fails to stop the test using PCLT, given that the SCSPRT would stop the test using PTrue. FPR(PCLT) indicates whether PCLT is causing the SCSPRT to terminate the exam too aggressively compared with PTrue, possibly resulting in decreased classification accuracy. However, the FNR(PCLT) indicates whether the PCLT is making the SCSPRT too conservative, possibly resulting in reduced efficiency. Of course, these interpretations also hold for FPR(PExact) and FNR(PExact).

The choice to compute the quantities described above is now discussed based on examinees whose exams had not yet been terminated by the TCLT criterion. First, consider the alternate plan of using all examinees at a given test stage regardless of whether their tests had been terminated or not. This plan would not be representative of the actual usage of the SCSPRT, because in practice, PCLT, PExact, and PTrue are only computed for examinees whose tests have not yet been terminated. Moreover, because the examinees whose tests have already terminated are those who exhibit stable classifications, they typically exhibit values of PCLT, PExact, and PTrue that are all close to 1 (and therefore are close to one another). Including such examinees in the analysis would falsely inflate the agreement between these probability estimates compared with their agreement values when the SCSPRT is used in practice. Thus, to avoid an overly optimistic view of the agreement between PCLT, PExact, and PTrue, the computations should be based only on examinees whose tests have not yet been terminated at the given test stage. Of the termination criteria described in the previous subsection, the TCLT criterion is used because it had been the primary SCSPRT stopping rule in previous work (Finkelman, 2008, 2010).

Results

Part 1 Results

The PCC and ATL computed over all values of examinee θ for termination rules in each testing condition are displayed in Table 1 (for brevity, results were omitted for termination rules with an even number of items remaining). It is seen that the PCCs for all versions of the SCSPRT are never more than 0.001, or 0.1%, away from the PCC for the SPRT. Also, all versions of the SCSPRT always yield lower ATLs than the SPRT, with the largest reductions yielded with δ=.10 in Conditions 1 and 2. The difference in ATL between the SCSPRT and SPRT was smallest in Condition 3. For the various versions of the SCSPRT, a gradual increase in ATL going from TExact_35 to TCLT can be seen. This implies that the earlier in the test the SCSPRT is computed using PExact, the shorter the test, on average. However, this effect is small; the largest difference in ATL among TExact_35 and TCLT, which occurred in Condition 2, was only approximately 0.50.

Table 1.

Simulation Results: PCC and ATL Computed Over All Values of Examinee θ for All Studied Termination Rules.

Rule Condition 1:
Condition 2:
Condition 3:
Condition 4:
δ=.1, Peaked
δ=.1, Broad
δ=.2, Peaked
δ=.2, Broad
PCC ATL PCC ATL PCC ATL PCC ATL
TExact_35 0.955 21.87 0.942 22.86 0.953 22.12 0.935 24.37
TExact_37 0.955 21.96 0.942 22.99 0.953 22.19 0.935 24.45
TExact_39 0.955 22.02 0.942 23.08 0.953 22.25 0.935 24.54
TExact_41 0.955 22.10 0.942 23.20 0.953 22.33 0.935 24.62
TExact_43 0.955 22.13 0.942 23.24 0.953 22.36 0.935 24.65
TExact_45 0.955 22.17 0.942 23.27 0.953 22.39 0.935 24.68
TExact_47 0.955 22.20 0.942 23.32 0.953 22.42 0.935 24.72
TExact_49 0.954 22.23 0.942 23.37 0.953 22.44 0.936 24.75
TCLT 0.954 22.25 0.942 23.39 0.953 22.45 0.936 24.76
SPRT 0.954 38.64 0.942 42.62 0.953 24.46 0.936 29.08

Note. PCC = proportion classified correctly; ATL = average test length; TExact_35 = SCSPRT termination rule that switches to using PExact at Stage 35, and so on; SPRT = Sequential Probability Ratio Test; SCSPRT = Stochastically Curtailed Sequential Probability Ratio Test.

A somewhat surprising result is revealed when the effect of changing δ for the SCSPRT is examined while holding the bank constant, that is, comparing Conditions 1 with 3 and 2 with 4. It is seen that for all SCSPRT termination rules, regardless of whether they are based on PCLT or PExact, the ATLs are slightly larger for the higher value of δ. In contrast, Table 1 shows that increasing δ greatly reduces ATL for the SPRT, an effect that is well known. This difference in the interplay between δ and the SPRT and SCSPRT can be explained by once again considering the conditional power approach used in the calculation of PCLT and PExact. For example, when δ=.2 and the current decision is m, the conditional power approach assumes a smaller θ value of 0.80 while estimating the probability of changing decisions. In this case, making an early termination is less likely than if δ=.1, when a θ value of 0.90 would be used. However, this is attenuated by the fact that SCSPRT also stops when the SPRT does, and the SPRT tends to stop earlier when δ is higher. It seems that the two effects nearly cancel each other out, especially for the conditions with peaked banks.

Table 2 displays information on forced decisions that were described in “SPRT” section. Specifically, for each condition, it has shown the percent of examinees that received forced decisions and the proportion of those that were classified correctly for the TExact_35, TCLT, SPRT stopping rules. The SPRT had, by far, the largest percentage of forced decisions in every condition; this is not surprising, given the relatively large ATLs shown in Table 1. Looking at the SCSPRT rules, in every condition, TCLT had larger percent of forced decisions than TExact_35. Again, this is expected given the trends seen in Table 1. The PCC of the forced decisions for TExact_35 and TCLT was markedly lower than that of the SPRT. However, because forced decisions occurred at a much lower rate for these rules than for the SPRT, their overall PCC was not adversely affected.

Table 2.

Percent of Examinees With Forced Decisions and the Proportion of Those Classified Correctly for All Conditions.

Condition TExact_35
TCLT
SPRT
% FD PCC of FD % FD PCC of FD % FD PCC of FD
1 1.1 0.717 2.5 0.622 42.6 0.892
2 1.9 0.547 4.2 0.588 58.4 0.901
3 1.8 0.584 2.9 0.615 19.2 0.778
4 2.3 0.504 4 0.572 28.6 0.781

Note. TExact_35 = SCSPRT termination rule that switches to using PExact at Stage 35; TCLT = SCSPRT termination rule that uses CLT approximation throughout; SPRT = Sequential Probability Ratio Test; FD = forced decisions; PCC = proportion classified correctly; SCSPRT = Stochastically Curtailed Sequential Probability Ratio Test; CLT = central limit theorem.

Part 2 Results

Part 1 presented results indicating that the use of stochastic curtailment termination rules based on PCLT yielded slightly longer tests than PExact. Part 2 delves more deeply into this phenomenon by examining the relationship between PCLT, PExact, and PTrue at Test Stages 35 to 49; that is, when the number of remaining items is 15, 14, and so on. Figure 1 displays the correlations between the three probabilities as the number of remaining items decreases. For all conditions, it can be seen that COR(PExact,PTrue) increases as the tests progress, while COR(PCLT,PTrue) either levels off or drops when several items remain. This is due to the fact that the CLT does not generally perform well for very small sample sizes. Also, COR(PCLT,PExact) decreases or levels off as the tests end; the pattern is more erratic for the broad item banks than for the peaked.

Figure 1.

Figure 1.

Correlations among PCLT, PExact, and PTrue for Test Stages 35 to 49.

Figure 2 displays the MB among PCLT, PExact, and PTrue. The values of MB(PCLT,PTrue) and MB(PExact,PTrue) are negative for all test stages studied in all four conditions, though they both approach zero as the test progresses. This indicates that both PCLT and PExact consistently underestimate P(Dk*=DJ) compared with PTrue; thus, the use of PCLT and PExact terminate the tests less aggressively than would PTrue. Furthermore, PCLT generally underestimates P(Dk*=DJ) more than PExact, especially in Conditions 1 and 3 with peaked item banks. This explains the results in Table 1 showing increasing ATLs the longer PCLT is used for the computation of P(Dk*=DJ). In addition to the MB and Pearson correlation as described above, the mean absolute bias and Spearman’s correlation were examined in a similar manner. Both measures yielded plots showing trends very similar to the Pearson correlations depicted in Figure 1; thus, for the sake of brevity, these plots were omitted.

Figure 2.

Figure 2.

Bias among PCLT, PExact, and PTrue for Test Stages 35 to 49.

Figures 3 and 4 display FPR and FNR, respectively, for PCLT and PExact. In Figure 3, it is shown that the FPR are generally smaller in Conditions 1 and 3 when the item bank is peaked. However, in Condition 3, FPR(PCLT) may be as high as 10% when there are only a couple items remaining. The pattern revealed by Figure 4 for FNR is quite clear; FNR(PCLT) is greater than FNR(PExact) in all conditions. In fact, for most of the studied test stages, FNR(PCLT) is far above 50%. This indicates that PCLT does not terminate the exam in the majority of cases that PTrue would do so. This is consistent with the MB results depicted in Figure 2 and with Table 1. Also, it is noticed that in Figure 4, there are obvious increases, or jumps, in the FNR occurring when there are 9 or 10 items remaining. To investigate these, further simulations were run with different cut points (0.25 and −1; the full results are not presented due to space considerations). Jumps occurred for these conditions with 7 or 8 items remaining in these additional testing conditions. The authors conclude that these jumps are randomly occurring phenomena that depend on the quality of the item bank and the CCT settings. Overall, the results suggest that these are quirks that can occur with the SCSPRT, but they do not negatively affect the general performance of the method.

Figure 3.

Figure 3.

False positive rates for PCLT and PExact.

Note. FPR = False positive rates.

Figure 4.

Figure 4.

False negative rates for PCLT and PExact.

Note. FNR = False negative rates.

Discussion

The SCSPRT termination criterion was proposed by Finkelman (2008) to increase the efficiency of CCTs. His results suggested that the SCSPRT was capable of significantly reducing ATLs compared with the SPRT while maintaining virtually identical classification accuracy. The crucial element of the SCSPRT is the computation of P(Dk*=DJ), the probability that the interim classification decision at Stage k of the test will not change before the maximum number of items J is reached. There are two methods of calculating P(Dk*=DJ), an exact method and a CLT approximation, resulting in PExact and PCLT, respectively. This article presented the results of a simulation study that aimed to provide detailed analyses concerning the relationship of PExact to PCLT as the test progresses and the performance of SCSPRT termination rules that switch from using PCLT to PExact at various stages of the test. Although past studies have shown the SCSPRT and its variations to be generally superior in efficiency to the SPRT, no previous study (to the best of the authors’ knowledge) has addressed such specific comparisons between PCLT and PExact.

At the outset, the authors stated that this article is intended to be of use to practitioners interested in implementing the SCSPRT in an operational CCT. The results of the simulation study suggest the following concrete recommendations and guidelines as the authors refer back to the original research questions stated in the introduction: (a) Based on their analyses, the authors cautiously suggest the switch from using PCLT to PExact be made when there are between five and eight items remaining in the exam. This range should assure adequate performance from PCLT and should not be too computationally burdensome for an operational setting. (b) PCLT tends to overestimate the gold standard PTrue, and PExact is less biased overall. (c) The use of PCLT generally causes the SCSPRT to terminate the test less aggressively than PExact. If test efficiency is of the utmost importance, then the switch from using PCLT to PExact in the SCSPRT should be made as early as possible. Although it is not theoretically appealing, using PCLT throughout the exam in the conditions studied did not significantly decrease PCC and only very slightly increased ATL. However, unsurprisingly, PCLT tended to behave somewhat erratically at later stages of the tests. And (d) the above conclusions are consistent for the four different simulation conditions examined. The reduction in ATL yielded by the SCSPRT is generally more significant for smaller values of δ than for larger values. Even for conditions in which the SCSPRT produces a small increase in efficiency, its use is worthwhile because there is virtually no reduction in classification accuracy.

Of course, the conditions studied in this article represent only a small subset of a vast array of potential CCT designs. One study cannot coherently address all possible variations of CCT parameters such as δ, θ0, test length, and so on, and item bank characteristics. Researchers and practitioners should perform similar simulations on their particular CCTs and item banks when considering the SCSPRT method. Also, this article assumed a calibrated item bank with no estimation error. It is currently unknown how the performance of the SCSPRT is affected by a small amount of mismatch between the true values of the item parameters and their estimated values.

Finally, there are several ways in which this research may be extended. Finkelman (2010) proposed variations on the SCSPRT that terminate CCTs even more aggressively than the original SCSPRT. Future research could explore whether PCLT and PExact perform differently for these variations than they did in this study. Also, it has been proposed that Bayesian methodology be used in the design of a CCT (Lewis & Sheehan, 1990; Vos, 1999, 2000). In addition, a reviewer pointed out that the pass/fail decision in a CCT may be posed as two separate models. Then, test stoppage could be based on the Bayes factor of the two models exceeding a pre-determined threshold, where the Bayes factor may be calculated using a method similar to that proposed by Klugkist and Hoijtink (2007). An examination of the relative merits of SCSPRT-based methods versus Bayesian frameworks would be appealing from both a theoretical and practical viewpoint.

Supplementary Material

Supplementary material
OnlineAppendix.docx (19KB, docx)

Footnotes

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

  1. Birnbaum A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In Lord F. M., Novick M. R. (Eds.), Statistical theories of mental test scores (pp. 397-472). Reading, MA: Addison-Wesley. [Google Scholar]
  2. Chen S. Y., Lei P. W., Chen J.-H., Liu T. C. (2014). General test overlap control: Improved algorithm for CAT and CCT. Applied Psychological Measurement, 38, 229, 244 [Google Scholar]
  3. Eggen T. J. H. M. (1999). Item selection in adaptive testing with the sequential probability ratio test. Applied Psychological Measurement, 23, 249-261. [Google Scholar]
  4. Fan Z., Wang C., Chang H.-H., Douglas J. (2012). Utilizing response time distributions for item selection in CAT. Journal of Educational and Behavioral Statistics, 37, 655-670. [Google Scholar]
  5. Finkelman M. (2008). On using stochastic curtailment to shorten the SPRT in sequential mastery testing. Journal of Educational and Behavioral Statistics, 33, 442-463. [Google Scholar]
  6. Finkelman M. (2010). Variations on stochastic curtailment in sequential mastery testing. Applied Psychological Measurement, 34, 27-45. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Huebner A. (2012). Item overexposure in computerized classification tests using sequential item selection. Practical Assessment, Research, & Evaluation, 17(12). Retrieved from http://pareonline.net/getvn.asp?v=17&n=12 [Google Scholar]
  8. Huebner A., Fina A. (2015). The stochastically curtailed generalized likelihood ratio: A new termination criterion for variable-length computerized classification tests. Behavior Research Methods, 47, 549-561. doi: 10.3758/s13428-014-0490-y [DOI] [PubMed] [Google Scholar]
  9. Jennison C., Turnbull B. W. (2000). Group sequential methods with applications to clinical trials. Boca Raton, FL: Chapman & Hall/CRC. [Google Scholar]
  10. Klugkist I., Hoijtink H. (2007). The Bayes factor for inequality and about equality constrained models. Computational Statistics & Data Analysis, 51, 6367-6379. [Google Scholar]
  11. Lan K. K. G., Simon R., Halperin M. (1982). Stochastically curtailed tests in long term clinical trials. Communications in Statistics: Sequential Analysis, 1, 207-219. [Google Scholar]
  12. Lewis C., Sheehan K. (1990). Using Bayesian decision theory to design a computerized mastery test. Applied Psychological Measurement, 14, 367-386. [Google Scholar]
  13. Lin C.-J. (2011). Item selection criteria with practical constraints for computerized classification testing. Educational and Psychological Measurement, 71, 20-36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Lord F. M. (1980). Applications of item response to theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum. [Google Scholar]
  15. Spray J. A., Reckase M. D. (1994, April 5-7). The selection of test items for decision making with a computerized adaptive test. Paper presented at the Annual Meeting of the National Council for Measurement in Education, New Orleans, LA. [Google Scholar]
  16. Spray J. A., Reckase M. D. (1996). Comparison of SPRT and sequential Bayes procedures for classifying examinees into two categories using a computerized adaptive test. Journal of Educational & Behavioral Statistics, 21, 405-414. [Google Scholar]
  17. Sympson J. B., Hetter R. D. (1985). Controlling item exposure rates in computerized adaptive testing. In Proceedings of the 27th annual meeting of the Military Testing Association (pp. 937-977). San Diego, CA: Navy Personnel Research and Development Center. [Google Scholar]
  18. Thompson N. A. (2009). Item selection in computerized classification testing. Educational and Psychological Measurement, 69, 778-793. [Google Scholar]
  19. Thompson N. A. (2011). Termination criteria for computerized classification testing. Practical Assessment, Research, & Evaluation, 16(4). Retrieved from http://pareonline.net/pdf/v16n4.pdf [Google Scholar]
  20. Van Groen M. M., Eggen T. J. H. M., Veldkamp B. P. (2014). Item selection methods based on multiple objective approaches for classifying respondents into multiple levels. Applied Psychological Measurement, 38, 187-200. [Google Scholar]
  21. Vos H. J. (1999). Applications of Bayesian decision theory to sequential mastery testing. Journal of Educational and Behavioral Statistics, 24, 271-292. [Google Scholar]
  22. Vos H. J. (2000). A Bayesian procedure in the context of sequential mastery testing. Psicológica, 21, 191-211. [Google Scholar]
  23. Wald A. (1947). Sequential analysis. New York, NY: John Wiley. [Google Scholar]
  24. Wang C., Chang H. (2011). Item selection in multidimensional computerized adaptive testing—Gaining information from different angles. Psychometrika, 76, 363-384. [Google Scholar]
  25. Weiss D. J., Kingsbury G. G. (1984). Application of computerized adaptive testing to educational problems. Journal of Educational Measurement, 21, 361-375. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material
OnlineAppendix.docx (19KB, docx)

Articles from Applied Psychological Measurement are provided here courtesy of SAGE Publications

RESOURCES