Abstract
The choice of response probability in the bookmark method has been shown to affect outcomes in important ways. These findings have implications for the validity of the bookmark method because panelists’ inability to internally adjust when given different response probabilities suggests that they are not performing the intended judgment task. In response to the concerns these findings raise, proponents of the bookmark method argue that such concerns can be addressed by using a response probability of .67. A crucial part of their argument includes the often-repeated claim that the .67 value corresponds with the maximum information for a correct response, which is believed to be beneficial in some way. In this article, it is shown that this claim is mistaken; that the formula upon which the .67 result is based is incorrect; that (for the relevant measurement model) there is no difference between the information for a correct response, for an incorrect response, or for the item overall; and, more generally, that the “maximize information” approach is based on the wrong likelihood function altogether.
Keywords: standard setting, item response theory, Fisher information, response probability
The Bookmark standard setting procedure was introduced in 1996 by Lewis, Mitzel, and Green in response to the perceived challenges and shortcomings of Angoff-type approaches to setting standards. Since that time, the method has been widely adopted—particularly in K-12 testing. For both the Angoff and the Bookmark method, content expert panelists discuss and conceptualize the minimally proficient examinee: The examinee whose proficiency level is just high enough to justify a given classification (e.g., Pass). With this conceptualization in mind, Angoff methods require content experts to then review individual test items and estimate the probability that the minimally proficient examinee will answer each item correctly. The Bookmark method calls upon panelists to make a related judgment, but the task is structured differently. With this procedure, test items are ordered by difficulty and arranged into what is called an ordered item booklet. Panelists are asked to review each item starting at the beginning of the booklet, and then place a bookmark at the point between the items in the booklet at which the probability of success for the minimally proficient examinee drops below a prespecified value referred to as the response probability (RP). Item response theory procedures are then used to identify the cut score as the place on the proficiency scale that is associated with that RP given the item response functions for the items in the vicinity of the bookmark.
The choice of RP was, as Lewis, Mitzel, Mercado, and Schulz (2012) characterized it, “an early source of controversy”; however, eventually this controversy subsided, and practice coalesced around an RP of .67 (hereafter, RP67; Lewis et al., 2012, p. 233). Yet, the evidentiary basis for this consensus is surprisingly modest. One (often-repeated) argument in support of RP67—what is sometimes called “the conceptual and psychometric basis of the RP value of .67” (Huynh, 2006, p. 19)—is examined in this article and is shown to be vulnerable to criticism on theoretical and mathematical grounds.
Background
Suppose we have some panelist and we know, without error, that his or her conceptualization of the minimally proficiency candidate corresponds to some proficiency level . We might call this the true cut score for our panelist. Now suppose our panelist is working their way through an ordered item booklet and encounters some item . Upon reaching this item, our devoted panelist must make a judgment about the probability of success for an examinee who possesses their conceptualization of minimal proficiency—that is, an examinee with a proficiency equal to . Specifically, our judge must decide whether or not the probability of success for such an examinee on item drops below the RP. In a perfect world, our panelist would compare the probability with the specified RP; however, in our imperfect world, this comparison will be contaminated in some way by error. So, if the RP is greater than , our judge should insert their bookmark, yet may fail to do so due to an error of some kind; likewise, if the RP is less than , our judge should advance to the next item, yet, again, may fail to do so due to error.
The above description of panelist error has been left deliberately vague because its form is unimportant for the present discussion. For our immediate purpose, the focus is on the error in which we are ultimately most interested—the error in the cut score estimate. Consider an example. Figure 1 shows the item characteristic curves for three items: some item together with an easier item and a harder item . Also shown are the probabilities of success on each item for someone with a proficiency equal to the true cut score , and given an RP value, the associated proficiencies for each item: , , and . The RP value is indicated by the horizontal light dashed line, and were these items to appear in an ordered item booklet, a bookmark inserted between items and would result in a cut score of , the midpoint between the proficiencies and . Likewise, a bookmark inserted between items and would lead to a cut score of , the midpoint between the proficiencies and .1
Figure 1.
Three item characteristic functions.
Note. The thin gray line shows the relationship between the true cut score and the probability of success on these items, , , and . An RP value is indicated by the horizontal light dashed line. A bookmark inserted between items and results in a cut score of , whereas a bookmark inserted between items and leads to a cut score of . RP = response probability.
Since the probability of success, , is less than the RP, it follows that upon encountering item , our judge should insert their bookmark indicating a cut score estimate of ; however, suppose our panelist made some error such that they advance to the next item, item . Our panelist’s judgment on item is, of course, also subject to error, but suppose they (correctly) conclude that . Now, the resulting cut score estimate is rather than and our cut score error is rather than the more modest .
On the Choice of RP
The preceding discussion demonstrates that were panelists able to complete the judgment task without error, any RP value between an item characteristic curves’ lower asymptote (0 in this example) and 1 could be specified without affecting the mechanics of the judgment exercise (Hambleton & Pitoniak, 2006). Nevertheless, even the best panelist will make errors and research has shown (and common sense tells us) that, in fact, the choice of RP systematically affects outcomes—sometimes quite profoundly and in unexpected ways (Hauser, Edley, Koenig, & Elliott, 2005; Williams & Schulz, 2005). The implications of these findings are alarming—after all, if panelists are unable to adjust their judgments to reflect the choice of RP, what do their judgments actually mean? We may never know the answer to this existential question, but to the extent that the resultant cut scores are systematically affected by the choice of RP, they will not reflect the panelists’ beliefs (except by chance) and their validity will be in doubt. Yet, this matter does not seem to have preoccupied proponents of the Bookmark method especially. The focus in the literature has been on the advantages and defensibility of RP67, rather than the validity of judgments that depend so much on the choice of RP.
Although the question of whether or not panelists are capable of making the intended bookmark judgments remains unresolved, let us suppose the advocates of the Bookmark method are correct and that panelists are capable of making the intended judgments but that some RP values are less prone (or more robust) to panelist error than others. So, for example, it might be argued that very high or very low RP values are best avoided because item characteristic curves are asymptotic at these extreme values, and therefore multiple items, all with very near such RPs, could nevertheless correspond to very different . In addition to sensible restrictions like this, other rules governing the choice of RP value have been widely adopted and promoted with less obvious justifications. The rise of RP67 as the de facto RP is one such rule.
Aside from 2/3 being “a familiar value” (Lewis, Green, Mitzel, Baum, & Patz, 1998, p. 4), the argument in favor of RP67 falls along two distinct lines. The first comprises two related notions: (a) that the concept of mastery (or can do) allows panelists to concretize a specified RP value that they would otherwise find too abstract and (b) that panelists’ concept of mastery corresponds to an RP of .67. Together, we might refer to this line of reasoning as the mastery conceit. The mastery conceit seems benign enough as far as heuristics go and although its utility is not investigated here, it seems plausible that the concept of mastery provides a more tangible and comprehensible link between the knowledge, skills, and abilities associated with a given performance level and the judgment task required of the panelists. It is perhaps less obvious that panelists’ concept of mastery corresponds to (or can be made to correspond to) RP67—it seems quite possible that notions of mastery are neither static nor universal (what Lorié, 2002, called the problem of mastery probability indeterminacy). Nevertheless, the mastery conceit reportedly has been helpful (e.g., Lewis et al., 2012) and, in any case, the question of its actual utility lies outside the scope of this article.
The nature of these arguments deserves comment. The familiar value and mastery conceit assertions—however valid they may be—are by and large conjectural, rhetorical, and anecdotal. The second rationale for using RP67, which is the focus of this article, is the contention that the optimum response probability is the one that maximizes information for a correct response.2 Although this too is mere conjecture—it has yet to be demonstrated that there is anything special about maximizing information for a correct response that makes panelists less prone to committing errors or that makes the resultant cut score estimates more robust to panelists’ errors—there is an analytical character to the notion of maximizing information. This character is also evident in the literature, which concentrates on showing (albeit incorrectly) how maximizing such information corresponds to an RP of .67. At the risk of adding more conjecture still, one wonders if perhaps the research in this area has given the entire claim of optimality a veneer of analytical proof it was otherwise lacking.
The lack of evidence demonstrating what can be gained by maximizing information might be considered enough to question whether the use of RP67 is actually defensible at all, but, in any case, the focus here mirrors the focus in the literature and is therefore limited to the analytical reasoning that connects the maximization of information for a correct response to an RP of .67. This reasoning is fully explicated in the work of Huynh (1994, 1998, 2000a, 2000b, 2006) and his conclusions (and their implication that .67 is in some important way optimal) are frequently repeated (e.g., Beretvas, 2004; Cizek, Bunch, & Koons, 2004; Karantonis & Sireci, 2006; Lewis et al., 2012; Lin, 2006; Mitzel, Lewis, Patz, & Green, 2001; Skaggs, Hein, & Awuor, 2007; Skaggs & Tessema, 2001), most often uncritically (cf., Kolstad et al., 1998). Huynh’s analytical derivation for RP67 will be discussed and (to the extent possible) unpacked below. Yet, because of its central importance, some readers may find it helpful to first review some key concepts related to Fisher information. Such a review will be given next.
Observed Information and Expected Information
Supposing some test question fits the three-parameter logistic model (3PL; Birnbaum, 1968), let the probability of a respondent with a proficiency of responding positively, , be given by
| (1) |
where is the item discrimination, is the item difficulty, and is the (pseudo-) guessing parameter (all of which are assumed to be known). Generalizing, it follows that the probability of a response may be expressed as
| (2) |
Now, suppose we have some observation, . Let denote the likelihood of given this response and let , the log-likelihood. Intuitively—from a maximum likelihood perspective, for example—the steeper the curvature of the log-likelihood, the narrower the range of likely to have produced , and for this reason, this curvature is often of great interest. To find the local curvature, we start with the first-order partial derivative (with respect to ) of ,
| (3) |
which is typically called the score, score function, or Fisher’s score function (not to be confused with a test score or a scored item response). The score function is the slope of the log-likelihood, and it is the slope of this slope—that is, the second-order partial derivative (again with respect to ) of , —that equals the local curvature3 of the log-likelihood:
| (4) |
The negative of this local curvature is defined as the observed Fisher information function, which can be denoted as
| (5) |
In contrast, Fisher information proper is defined as the expectation of this observed information. In other words, Fisher information is, essentially, the curvature averaged over all possible responses (e.g., correct and incorrect for a dichotomously scored item). Of course, to take an expectation, we need to reintroduce the random variable rather than considering a particular instance of , , as we did in the case of the observed information. Likewise, we are now interested in the probability of the random variable given rather than the likelihood of given a particular instance of , , and so is replaced with the equivalent probability :
| (6) |
When a maximum likelihood estimator exists (e.g., which is usually the case when we have responses from a set of items), the inverse of the Fisher information is asymptotically equal to the sampling variance of the maximum likelihood estimator, which is why information is of interest generally and, more specifically, why we often seek to maximize it (van der Linden, 2010).
Item Information, Item Response Information, and Information Share
Fisher information corresponds to what in the psychometric literature is called item information—or, in the case of a set of items, test information (Birnbaum, 19684), but what of Equation 5, the observed information? In psychometrics, the notion that each score, , “carries” its own unique amount of information (probably) goes back to the work of Samejima, who suggested that the information for each score category of an item was measured by its item response information function (which corresponds to Equation 5), and that this function was more—or at least more uniquely—informative compared with Birnbaum’s item information function—that is, the standard Fisher information function shown in Equation 6 (Birnbaum, 1968; Samejima, 1973). Despite its allure, however, Samejima’s notion of a response-specific item information function has not been widely adopted within item response theory discourse—although it does arise occasionally (e.g., Bradlow, 1996, who extended the work of Samejima’s 1973 study; Baker & Kim, 2004; Huynh, 1994, 1998, 2000a, 2000b, 2006; Magis, 2015; van der Linden, 1998).
One final term remains: information share. In Samejima’s (1969) monograph introducing the graded response model, she defines information share as follows:
. . . information share is the information function of an individual graded response, which is the negative of the first derivative of the basic function [i.e., the observed information shown in Equation 5], multiplied by the operating characteristic [i.e., the item response function] . . . (p. 38)
Following the notation used here, information share for a correct response can be expressed as
| (7) |
where denotes information share. The notion of item information share is even less widely used than item response information (although like item response information, it does arise now and then—for example, Baker & Kim, 2004; Bock, 1972; DeMars, 2010; Suh & Bolt, 2010).
For Baker and Kim (2004), the difference between item response information and item information share can be described as follows:
The amount of information due to an item response category is a measure of how well responses in that category estimate the examinee’s ability. The information share of a response category is the amount of information contributed by the category to the item information. (p. 223; the original notation was changed to be consistent with the notation used in this article.)
These authors argue that since the sum of the item response information functions across all possible response categories for a given proficiency do not sum to the total information, , they are difficult to interpret. Consequently, they conclude that “from an interpretive point of view, the amount of information share of each response category is the most informative” (p. 223).5 The reader can decide for themselves whether or not they agree with this claim; for our purposes moving forward, the point is that three distinct concepts related to information have been introduced: observed item information (Equation 5), expected item information (Equation 6), and information share (Equation 7).
Huynh’s Unorthodox Derivation of RP67
The above background on Fisher information is well known and (it is hoped) uncontroversial. We begin our move away from this familiarity with a passage from Huynh’s (2006) note intended to clarify the rationale for the RP67:
For binary items, the focus of Huynh’s work on the RP value has been on information of the correct response and not on the (total) item information, which is the typical focus for item information. The crucial difference between these two types of information apparently has been overlooked by several writers on bookmark standard setting, including Cizek et al. (2004) and Karantonis and Sireci (2006). . . Certainly there are differences among (total) item information. . . and information of the correct response. (p. 20; emphasis in original)
Huynh’s broader point here is correct: Equations 5 and 6 are different, as are the concepts of observed and expected information. Nevertheless, despite these differences, observed information and expected information are, in certain important special cases, mathematically equivalent. Notably, equivalence is observed in the case of , the specific case that preoccupies Huynh and the architects of the Bookmark method so much (Lewis et al., 2012).6 Let us take a moment to see how this equivalence arises.
Recalling Equations 2 and 5, the item response information function—that is, the observed information—for the two-parameter logistic (2PL) is given by
| (8) |
(Note that this is not the information for a correct response, , per se but rather for any response or .) Solving this derivative, we have
| (9) |
Equation 9 has an important property: the absence of the term. It follows that the observed information (Equation 5) for the 2PL is unaffected by the specific value of or, in other words, the observed information is the same without regard to the correctness of a given response: . As noted, this characteristic is not universally true, but it is true for exponential family models. This means that in addition to the 2PL, it is a well-known (e.g., Bradlow, 1996; DeMars, 2010; Kolstad et al., 1998; Magis, 2015; van der Linden, 1998; Veerkamp, 1996) property of the one-parameter logistic (1PL) as well as the so-called divide-by-total polytomous models: the partial credit model (Masters, 1982), the generalized partial credit model (Muraki, 1992), the rating scale model (Andrich, 1978), and the nominal response model (Bock, 1972). So, while there are occasions when observed information may yield better standard error estimates than (expected) information (Efron & Hinkley, 1978; Lindsay & Li, 1997) or may provide unique diagnostic information (Bradlow, 1996), this is not one of those occasions.
Given their equivalence, Huynh’s suggestion that the observed information for a correct response and the expected information function for a given item are distinct even when has perhaps contributed to some of the confusion over precisely what information Huynh was referring to in his published work (e.g., Cizek et al., 2004; Karantonis & Sireci, 2006). To get to the bottom of this mystery, we turn now to Bock’s well-known 1972 study, which introduced the nominal response model and which appears to be foundational in Huynh’s thinking (e.g., Huynh, 1994, 1998, 2000a, 2000b). Bock (1972) suggests that expected item information can be partitioned into the “information due to” each response (p. 44, Equation 257),
| (10) |
where denotes the partitioned information8 corresponding to response category . Bock states that Equation 10 follows from Samejima’s (1969) monograph, which introduced (at least in the context of item response theory) both the concepts of item response information and information share. As noted, Bock’s nominal model is an exponential family model and therefore . So, while he does not use the term information share, had Bock been aware that his model shared this property of other exponential family models, he may have intended Equation 10 to describe Samejima’s concept of information share. However, without such awareness, Equation 10 would not appear to correspond to item information, item response information, or information share. Still, there is some evidence that Bock (1972) may have intended Equation 10 to describe item response information (i.e., observed information)—he appears to use Equation 10 in his example to calculate “the information due to response in category h []” (p. 44)—and Huynh (1998) appears to interpret Equation 10 in this way (p. 38). Either way, when Huynh advocates using the RP associated with the maximum information for a correct response, he maximized Bock’s partitioned information, Equation 10, rather than Samejima’s formulation, the standard observed information for a correct response shown in Equation 5.9
Given that Equation 10 does not describe item response information, Huynh’s interpretation ends up being problematic. Specifically, the probability of a correct response when is maximized is potentially quite different from the result when is maximized. To see this, let denote the location on the proficiency metric where is maximized:
| (11) |
Here, can be found in the normal way by taking the root of Equation 11’s derivative,
| (12) |
Equation 1 can then be evaluated using the result from Equation 12. So doing yields
| (13) |
which is the formula Huynh (1998, 2000a, 2000b, 2006) adopts. When , Equation 13 evaluates to 2/3, the familiar RP67.
If we wanted to instead maximize the actual observed information for a correct response, , we would take the same approach, first finding
| (14) |
by taking the root of the derivative of :
| (15) |
and then plugging this result into Equation 1, this time yielding
| (16) |
It might be of some interest (albeit hopefully very little) to compare the “optimal” RP values using Huynh’s Equation 13 with those using Equation 16, which—to reiterate—corresponds to the accepted definition of observed information for a correct response. To this end, Figure 2 shows the probability of a correct response at the point of maximum information for a correct response using these two different formulations. The resultant response probabilities are optimal in so much as they maximize information associated with a correct response (in the sense of Equations 5 and 10). Note that these response probabilities are plotted as a function of the 3PL’s c-parameter—the only parameter left in Equations 13 and 16. Somewhat ironically, the largest difference occurs at , the value proponents of the bookmark method suggest using.
Figure 2.

The probability of a correct response at the point of maximum information for a correct response as a function of the 3PL’s c-parameter.
Note. The solid line shows the result using Equation 16 and the dashed line shows the result using Huynh’s formulation shown in Equation 13. It can be seen that while the lines intersect at one point, they are quite distinct—most markedly when . 3PL = three-parameter logistic.
Discussion
For more than 20 years, advocates and practitioners of the bookmark method have defended an RP of .67 based, in part, on Huynh’s claim that the optimal RP maximizes the information for a correct response, which he calculated using Equation 13, . In this article, it was shown that Huynh’s result was inconsistent with the accepted definition of observed information and that the correct formula for maximizing the information for a correct response is actually the somewhat different Equation 16. Furthermore, the observation was made that in many cases—notably when , which proponents of the bookmark method strongly encourage practitioners to adopt and which moreover is the condition for which Equation 13 evaluates to .67—the information function for a correct response is mathematically equivalent to the information function for an incorrect response as well as the overall item information. That Huynh’s result has been repeated so often and so uncritically should perhaps give us pause. More generally, it raises the question of whether or not RP67 should truly allay concerns over panelists’ inability to correctly adjust their judgments when given different response probabilities.
For the avoidance of doubt, the fact that Equation 16 yields the actual probability of a correct response at the point of maximum information for a correct response does not mean that Equation 16 somehow provides an optimal RP. Despite the attraction of maximizing information, recall that our specific interest in Fisher information is because of its relationship to the (local) curvature of the log-likelihood. And, the likelihood that in this case preoccupies Huynh and others is, regrettably, not the likelihood of a cut score estimate given a bookmark—it is the likelihood of proficiency given an examinee response. That is to say, the whole notion of maximizing item information (whether observed or expected) is premised on maximizing the curvature of the wrong likelihood altogether and therefore is not relevant to the question of what RP minimizes panelists’ errors.
Ultimately, finding the optimal RP will require a probabilistic description of what happens when a panelist interacts with an ordered item booklet. This is not a trivial undertaking and it will be left to future researchers. Until such a time, we should perhaps ask ourselves whether or not practice could be improved by making the “once controversial question” of optimal RP controversial once again.
Acknowledgments
I would like to express my gratitude to the National Board of Medical Examiners for supporting this work.
The approach of taking the midpoint corresponds to the instructions given by Lewis, Mitzel, and Green (1996); however, other approaches have been suggested (e.g., Cizek, Bunch, & Koons, 2004; Lewis, Green, Mitzel, Baum, & Patz, 1998). Although some are more defensible than others, the issues discussed in this article, which relate to the choice of RP, are independent of whichever approach is taken.
Strictly speaking, the claim is that the optimum RP is the one that equals the probability of a correct response on a given item for the proficiency at the location of maximum information for a correct response, but aside from being excessively pedantic, this description may be a case of sacrificing clarity at the altar of precision.
The actual (signed) curvature of is given by , and it can be seen that here the curvature will only equal when . Nevertheless, things sort themselves out nicely once an actual exists because the slope of the log-likelihood is, of course, zero at its maximum—so, because , it follows that . Thus, is the local curvature because it equals the actual curvature of the log-likelihood only where .
The relationship between the form shown in Equation 6 and Birnbaum’s (1968) formulation can be found in Hambleton and Swaminathan (1985).
the problem with Birnbaum’s derivation . . . was that it obscures the contribution, that is, the amount of information share, of each of the two item response categories to the amount of item information at a given ability level. Conceptually, it is important to recognize that all response categories of an item contribute to the amount of item information, not just the correct response. (p. 77)
Given the context, the final sentence in this passage seems to be contrasting item information share with Birnbaum’s item information; however, oddly the final clause, “not just the correct response,” does not refer to either paradigm.
The question of why Huynh and bookmark’s advocates have a special interest in the two-parameter logistic (2PL) is fascinating in its own right. Briefly, it is asserted that panelists are unable to conceptualize the guessing behavior of examinees and therefore the (so-called) guessing parameter, , should be changed to zero while keeping the other item parameters fixed to their estimated values. Although this is not the same as fitting a 2PL to the calibration sample’s response data (because no new estimation takes place), the result takes on the mathematical form of the 2PL.
Bock’s notation (which, incidentally, appears to contain an error) was changed to be consistent with the notation used in this article.
The term partitioned information is used here because it is consistent with Bock’s (1972) description (“item information may be partitioned among the response categories,” p. 44) and to distinguish it from observed information, expected information, and information share; however, the extent to which this term usefully characterizes Bock’s formulation is debatable.
Huynh’s thinking is enigmatic here: It is clear from Huynh (1998) that he was aware of cases wherein (e.g., the three-parameter logistic (3PL)), yet he nevertheless considered preferable to the traditional definition of observed information for a correct response. It appears that Huynh found the prospect of negative observed information in the case of the 3PL disqualifying because of the difficulty it posed for interpretation. He also justified using a supposed Bayesian (Huynh, 1998, 2000a) argument; however, it was convincingly argued in Kolstad et al. (1998) that this justification was without merit.
Footnotes
Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.
References
- Andrich D. (1978). Application of a psychometric rating model to ordered categories which are scored with successive integers. Applied Psychological Measurement, 2, 581-594. [Google Scholar]
- Baker F. B., Kim S. H. (2004). Item response theory: Parameter estimation techniques (2nd ed). New York, NY: Dekker. [Google Scholar]
- Beretvas S. N. (2004). Comparison of bookmark difficulty locations under different item response models. Applied Psychological Measurement, 28, 25-47. [Google Scholar]
- Birnbaum A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In Lord F. M., Novick M. R. (Eds.), Statistical theories of mental test scores (pp. 397-479). Reading, MA: Addison-Wesley. [Google Scholar]
- Bock R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, 29-51. [Google Scholar]
- Bradlow E. T. (1996). Teacher’s corner: Negative information and the three-parameter logistic model. Journal of Educational and Behavioral Statistics, 21, 179-185. [Google Scholar]
- Cizek G. J., Bunch M. B., Koons H. (2004). Setting performance standards: Contemporary methods. Educational Measurement: Issues and Practice, 23(4), 31-31. [Google Scholar]
- DeMars C. (2010). Item response theory. Oxford, UK: Oxford University Press. [Google Scholar]
- Efron B., Hinkley D. V. (1978). Assessing the accuracy of the maximum likelihood estimator: Observed versus expected Fisher information. Biometrika, 65, 457-483. [Google Scholar]
- Hambleton R. K., Pitoniak M. J. (2006). Setting performance standards. In Brennan R. L. (Ed.), Educational measurement (4th ed., pp. 433-470). Washington, DC: American Council on Education. [Google Scholar]
- Hambleton R. K., Swaminathan H. (1985). Item response theory: Principles and applications. Boston, MA: Kluwer-Nijhoff. [Google Scholar]
- Hauser R. M., Edley C. F., Jr., Koenig J. A., Elliott S. W. (2005). Measuring literacy: Performance levels for adults. Washington, DC: National Academies Press. [Google Scholar]
- Huynh H. (1994, October). Some technical aspects of standard setting. In Crocker L., Zieky M. (Eds), Proceedings of the joint conference on standard setting for large scale assessment programs, volume II (pp. 75-91). Washington, DC: National Assessment Governing Board and National Center for Education Statistics. [Google Scholar]
- Huynh H. (1998). On score locations of binary and partial credit items and their applications to item mapping and criterion-referenced interpretation. Journal of Educational and Behavioral Statistics, 23, 35-56. [Google Scholar]
- Huynh H. (2000. a, April). On Bayesian rules for selecting 3PL binary items for criterion-referenced interpretation and creating ordered item booklets for bookmark standard setting. Paper presented at the annual meeting of the American Educational Research Association, New Orleans, LA. [Google Scholar]
- Huynh H. (2000. b, April). On item mapping and statistical rules for selecting binary items for criterion-referenced interpretation and bookmark standard setting. Paper presented at the annual meeting of National Council on Measurement in Education, New Orleans, LA. [Google Scholar]
- Huynh H. (2006). A clarification on the response probability criterion RP67 for standard settings based on bookmark and item mapping. Educational Measurement: Issues and Practice, 25(2), 19-20. [Google Scholar]
- Karantonis A., Sireci S. G. (2006). The bookmark standard-setting method: A literature review. Educational Measurement: Issues and Practice, 25(1), 4-12. [Google Scholar]
- Kolstad A., Cohen J., Baldi S., Chan T., DeFur E., Angeles J. (1998). The response probability convention used in reporting data from IRT assessment scales: Should NCES adopt a standard. Washington, DC: American Institutes for Research. [Google Scholar]
- Lewis D. M., Green D. R., Mitzel H. C., Baum K., Patz R. J. (1998, April). The bookmark standard setting procedure: Methodology and recent implementations. Paper presented at the annual meeting of the National Council for Measurement in Education, San Diego, CA. [Google Scholar]
- Lewis D. M., Mitzel H. C., Green D. R. (1996, June). Standard setting: A bookmark approach. In Green D. R., IRT-based standard-setting procedures utilizing behavioral anchoring. Symposium conducted at the Council of Chief State School Officers National Conference on Large-Scale Assessment, Phoenix, AZ. [Google Scholar]
- Lewis D. M., Mitzel H. C., Mercado R. L., Schulz E. M. (2012). The bookmark standard setting procedure. In Cizek G. J. (Ed.), Setting performance standards: Foundations, methods, and innovations (2nd ed., pp. 225-253). New York, NY: Routledge. [Google Scholar]
- Lin J. (2006). The bookmark procedure for setting cut-scores and finalizing performance standards: Strengths and weaknesses. Alberta Journal of Educational Research, 52(1), 36-52. [Google Scholar]
- Lindsay B. G., Li B. (1997). On second-order optimality of the observed Fisher information. The Annals of Statistics, 25, 2172-2199. [Google Scholar]
- Lorié W. A. (2002). Setting defensible cut scores: Canonical pseudo-responses, item types, and performance standards (Unpublished doctoral dissertation). Stanford University, Stanford, CA. [Google Scholar]
- Magis D. (2015). A note on the equivalence between observed and expected information functions with polytomous IRT models. Journal of Educational and Behavioral Statistics, 40, 96-105. [Google Scholar]
- Masters G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174. [Google Scholar]
- Mitzel H. C., Lewis D. M., Patz R. J., Green D. R. (2001). The bookmark procedure: Psychological perspectives. In Cizek G. (Ed.), Setting performance standards: Concepts, methods and perspectives (pp. 249-281). Mahwah, NJ: Lawrence Erlbaum. [Google Scholar]
- Muraki E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159-176. [Google Scholar]
- Samejima F. (1969). Estimation of latent ability using a response pattern of graded scores (Psychometrika Monograph No. 17). Richmond, VA: Psychometric Society. [Google Scholar]
- Samejima F. (1973). A comment on Birnbaum’s three-parameter logistic model in the latent trait theory. Psychometrika, 38, 221-233. [Google Scholar]
- Skaggs G., Hein S. F., Awuor R. (2007). Setting passing scores on passage-based tests: A comparison of traditional and single-passage bookmark methods. Applied Measurement in Education, 20, 405-426. [Google Scholar]
- Skaggs G., Tessema A. (2001, April). Item disordinality with the bookmark standard setting procedure. Paper presented at the annual meeting of the National Council on Measurement in Education, Seattle, WA. [Google Scholar]
- Suh Y., Bolt D. M. (2010). Nested logit models for multiple-choice item response data. Psychometrika, 75, 454-473. [Google Scholar]
- van der Linden W. J. (1998). Bayesian item selection criteria for adaptive testing. Psychometrika, 63, 201-216. [DOI] [PMC free article] [PubMed] [Google Scholar]
- van der Linden W. J. (2010). Item response theory. In McGaw B., Baker E., Peterson P. (Eds.), International encyclopedia of education (3rd ed, pp. 81-88). Oxford, UK: Elsevier. [Google Scholar]
- Veerkamp W. J. J. (1996). Statistical methods for computerized adaptive testing (Unpublished doctoral thesis). University of Twente, Enschede, The Netherlands. [Google Scholar]
- Williams N. J., Schulz E. M. (2005, April). An investigation of response probability (RP) values used in standard setting. Annual meeting of the National Council on Measurement in Education, Montreal, Québec, Canada. [Google Scholar]

