Skip to main content
Educational and Psychological Measurement logoLink to Educational and Psychological Measurement
. 2026 Feb 25:00131644261420391. Online ahead of print. doi: 10.1177/00131644261420391

Estimation of Conditional Standard Errors of Measurement for MLE Scores in MST

Yuanyuan J Stirn 1,, Won-Chan Lee 1
PMCID: PMC12945742  PMID: 41767786

Abstract

This paper proposes an information-based analytic method for calculating the conditional standard error of measurement (CSEM) in multistage testing (MST) using maximum likelihood estimation. The accuracy of the proposed method was evaluated by comparing CSEMs computed using the analytic method with those obtained from simulation across the same four MST designs. The results show that analytic and simulation-based CSEMs converge as test length increases, indicating that the proposed method provides a reliable approximation for longer tests. However, shorter tests and more complex MST designs require additional items to achieve comparable accuracy. The study also compared the proposed method with Park et al.’s analytic approach. Practical implications of the proposed method are discussed.

Keywords: multistage testing, conditional standard error of measurement, maximum likelihood estimation, test information function

Introduction

In traditional linear tests, items are fixed. All items of various difficulty levels are given to all test-takers at the same time. Therefore, to get a precise measure of test-takers with different abilities, a linear test usually requires a large number of items to cover a broader range of difficulty, which makes it less efficient.

Computer adaptive testing (CAT), on the other hand, tailors the tests to test-takers with different abilities by selecting and administering items based on real-time estimates of test-takers’ abilities. This adaptive nature greatly improves test efficiency, as fewer items are needed per test administration. Consequently, CAT has surged in popularity in recent years. However, the real-time selection nature of the CAT means that each test taker may have different test items and even different test lengths. This made it impossible to review the test content conventionally.

This challenge underscores the need for alternative methods like multistage testing (MST). In MST, test-takers are assigned to different modules with varying levels of difficulty based on their ability estimates, where items are predetermined for each module. This not only improves test efficiency but also allows content experts to review all the materials and content specifications before the test administration. Therefore, MST combines the advantages of both linear tests and CAT and is viewed as the best alternative for addressing content-related challenges (Kimura, 2017).

However, the unique structure of MST brings a challenge in assessing the precision of ability estimation, which is commonly measured by the conditional standard error of measurement (CSEM). As noted by Lee and Harris (2025), CSEMs are generally more informative than reliability coefficients or a single overall standard error of measurement when assessing measurement precision, because standard errors typically vary as a function of true scores. Moreover, CSEMs can be used to construct confidence intervals at different score levels (American Educational Research Association et al., 2014). In item response theory (IRT), CSEMs provide information about measurement precision for ability estimates across the theta scale. Furthermore, CSEMs under IRT are intrinsic properties of a test and are independent of any particular group of examinees, conditional on theta. As a result, CSEMs are recommended as the primary index of dependability (Samejima, 1977) and are commonly used to evaluate MST designs (e.g., MacGregor et al., 2022; Yamamoto et al., 2018).

For linear tests using maximum likelihood estimation, CSEMs can be approximated as the inverse of the square root of the test information function (TIF), which is defined as the sum of the item information functions (Fisher, 1922; Lord, 1980). Accurate estimation of ability parameters requires high test information and, consequently, smaller CSEMs. However, in MST, the possibility of different test-takers taking different test routes complicates the analytic computation of CSEMs. As a result, simulation methods have become the gold standard for calculating CSEMs and reliability, despite their substantial computational demands.

Several analytic methods have been developed as more cost-efficient alternatives to simulation for estimating CSEMs in the context of MST, including those by Park et al. (2017) and Lim et al. (2021). Lim et al. (2021) introduced a recursive analytic method that utilizes number correct (NC) scores in conjunction with the test characteristic curve (TCC) to calculate CSEMs. This method first calculates the joint conditional distribution of NC scores of different MST routes using the Lord and Wingersky (1984) recursion method. It then uses equated number correct (ENC) scoring (Kolen & Brennan, 2014; Lord, 1980; Stocking, 1996) to obtain the joint conditional distributions of estimated theta values, from which CSEMs are derived. The accuracy of this analytic method was examined using simulation, with results showing that it produces CSEMs similar to those obtained from simulation. However, this method may not be directly applicable for approximating CSEMs in MST settings that employ MLE. In particular, the analytic method relies on ENC scoring, which differs fundamentally from MLE from a psychometric perspective (Stocking, 1996). Whereas MLE is an IRT pattern scoring method that fully utilizes information in the response pattern, ENC scoring is an IRT summed scoring method that does not fully exploit this information.

Second, Stocking (1996) argued that ENC scoring can yield results reasonably comparable to those obtained using MLE. Consistent with this view, Lim et al. (2021) reported simulation results showing comparable performance between ENC and MLE for ability levels in the range −1.5 ≤θ≤ 1.5. However, their full-scale simulations indicated that the analytic results consistently aligned with simulation results under ENC scoring, whereas larger discrepancies emerged when compared with simulation results based on MLE scoring outside the −1.5 to 1.5 range. Consequently, this analytic method appears most suitable for approximating CSEMs in MST with ENC scoring and may be less appropriate when MSTs are scored using MLE, particularly for examinees at more extreme ability levels.

Park et al.’s (2017) method estimates CSEMs by computing a moving-average TIF for five points around a target theta value, with the CSEM of the ability estimate obtained as the reciprocal of the square root of this average TIF. However, averaging TIF values across a small set of randomly selected points may compromise estimation accuracy. Lim et al. (2021) observed in their preliminary simulation study that the TIF varied substantially depending on both the choice of locations and the number of points used in the averaging process. In addition, their simulation results indicated that the CSEMs obtained using Park et al.’s (2017) analytic method exhibited larger discrepancies relative to simulation-based estimates. This finding contrasts with the results reported by Park et al. (2017) and may be attributable to the fact that Park et al.’s analytic method does not explicitly account for the exact possibility of secondary routing paths. Therefore, the accuracy of the method may vary across different MST designs.

Jing (2021) proposed an analytic method to calculate CSEMs for MLEs in MST using the exact ITF of various possible MST routes. However, when calculating the routing probabilities for MST with three or more stages, Jing assumed independence across stages when computing path probabilities without accounting for the restricted range of theta values at the third stage, which may have led to inaccurate estimation of CSEMs.

Given the limitations of existing methods, this paper proposes a modified analytic method, building on Jing’s (2021) work, for estimating CSEMs and reliability for MLE in MST. We aim to estimate CSEMs accurately for a wide range of MST designs under MLE scoring without increasing computational complexity. The accuracy of the proposed method is evaluated by comparing its results with those obtained through simulation. Also, the CSEMs calculated by the proposed method are compared with those obtained using Park et al.’s (2017) analytic method.

In summary, Park et al. (2017), Jing (2021), and the proposed method in the current study all rely on the inverse of TIF to approximate CSEMs. Consequently, these approaches are theoretically aligned with MSTs that use MLE scoring, because the inverse of the TIF corresponds to the asymptotic variance of MLE and does not generally apply to other IRT pattern scoring methods, such as expected a posteriori (EAP). By contrast, the core of Lim et al.’s (2021) method is the use of ENC scoring. ENC is a summed scoring method that combines NC scores with the TCC to obtain ability estimates.

In general, MLE scoring utilizes full response pattern information and is considered more informative than the summed scoring method. As noted by Lord (1980, 1983) the asymptotic properties of the MLE, such as consistency and unbiasedness, make it the most informative and consistent estimator of ability under appropriate conditions. However, MLE cannot provide estimates when all responses are identical and requires sufficient test information to yield accurate estimates. Among analytic methods based on the TIF, Park et al. (2017) employ an average TIF, which may compromise accuracy, whereas Jing (2021) assumes independence across stages and therefore does not account for the restricted ability ranges at later stages.

Methods

MST Design and Simulation Settings

Two two-stage (1–3) designs and two three-stage (1–2–3) designs were manually assembled from two item pools calibrated using the IRT 3PL model. The first item pool consisted of 60 items, and the second contained 800 items. Both item pools were calibrated using real data. The first two-stage 1–3 design (MST1) was assembled from the first item pool and comprised 25 items in total, with 10 items in the routing module and 5 items in each of the three second-stage modules. The second two-stage 1–3 MST (MST2) was assembled from the second item pool and contained 15 items in each module. Both three-stage MST (1–2–3) designs were assembled from the second pool. The first (MST3) included five items per module, whereas the second (MST4) included 10 items per module. Tables 1 and 2 provide the item-parameter distributions for the item pools and the four MST designs.

Table 1.

Item-Parameter Distribution for Item Pools

Item Pool 1
Parameters M SD Max Min
a 1.11 0.31 1.91 0.53
b 0.09 0.98 2.10 −1.90
c 0.07 0.16 0.44 0.07
Item Pool 2
a 0.79 0.30 2.35 0.24
b −0.05 0.88 2.80 −3.64
c 0.26 0.10 0.5 0.03

Table 2.

Item-Parameter Distribution for MST Designs.

MST1 (1–3 design, 15 items per path) MST2 (1–3 design, 30 items per path)
Statistic a b c a b c
Mean 1.14 0.45 0.17 1.27 0.08 0.28
SD 0.25 1.11 0.06 0.29 0.68 0.12
Max 1.84 2.10 0.29 2.35 2.07 0.5
Min 0.78 −1.85 0.07 1.00 −1.37 0.05
MST3 (1–2–3 design, 15 items per path) MST4 (1–2–3 design, 30 items per path)
a b c a b c
Mean 1.30 −0.01 0.27 0.89 −0.03 0.26
SD 0.32 0.91 0.12 0.30 1.00 0.11
Max 2.10 2.07 0.5 2.00 2.12 0.48
Min 1.00 −1.61 0.05 0.60 −2.51 0.04

Simulations were conducted using R (version 4.x), with Fisher’s Maximum Information (FMI) as the routing method, and MLE employed for ability estimation. One hundred theta values were randomly drawn from the standard normal distribution. Of these, 99 fell within the range −2.5 to 2.5, with one value at approximately −3.73. For each MST design and each theta value, 1,000 replications were performed.

The Analytic Method

In this study, a “route” or “path” refers to a particular sequence of modules an examinee may follow across stages—the two terms are used interchangeably. Let θcut,s,k denote the kth cut score used to route examinees into stage s. Cut scores are determined using approximate maximum information (AMI), which identifies cut points at the intersection of TIFs. Ability estimates are assumed to follow a normal distribution with mean equal to the true theta (θ) and variance equal to the inverse of the TIF associated with the path taken.

Following Jing (2021), the analytic method for a two-stage 1–3 MST is illustrated in Figure 1. Assuming the estimated thetas ( θ^ ) after the routing module follow a normal distribution with mean equal to the true theta (θ) and variance equal to the inverse of the routing module TIF ( IR{θ} ), the probability of an examinee being routed to the Easy module at Stage 2 can be calculated by the area ( p1) under the normal curve: p1=P(θminθ^θcut,2,1) . Similarly, the probability of being routed to the Medium and Hard modules are ( p2=P(θcut,2,1θ^θcut,2,2) and ( p3=P(θcut,2,2θ^θmax) , respectively. For each possible path r, the path-specific CSEM is computed as the square root of the inverse of the TIF associated with that path, which aggregates information across all items administered along the path ( Ir{θ} ):

Figure 1.

Figure 1.

Calculating the Routing Probability of a Two-Stage (1–3) MST.

σθ^|θr=1/Ir{θ}.

Given the routing probabilities pr obtained using the normal distribution, the overall error variance ( σθ^|θ2 ) of the entire panel is calculated by taking a weighted average of the path-level error variances:

σθ^|θ2=r=1Rpr.*(σθ^|θr)2,

where R represents the total number of possible paths in an MST design. The square root of σθ^|θ2 gives the CSEM for the entire panel.

For three-stage MST designs, Jing (2021) used a similar approach. Figure 2 illustrates the computation of routing probabilities for a three-stage MST. Using the normality assumption and TIF, the probability that an examinee follows a given path is calculated as the product of the probability of being routed from the routing module to a Stage 2 module and the probability of being routed from a Stage 2 module to a Stage 3 module. However, Jing (2021) assumed independence across stages when computing path probabilities, without accounting for the restricted range of theta values at the third stage. The probability of entering a given Stage 3 module should be conditional on the specific Stage 2 module previously administered.

Figure 2.

Figure 2.

Calculating the Routing Probability of a Three-Stage (1–2–3) MST.

Therefore, in this paper, we propose a modified approach that explicitly accounts for the restricted theta ranges at later stages. For example, for Path 2 (1R-2E-3M), the maximum possible theta value should be θcut,2,1 instead of θmax , because examinees routed to Module 2E must have ability estimates below this cut. Thus, the probability of following Path 2 must be conditional on first being routed to Module 2E. Specifically, the probability of following Path 2 is given by:

P2=P(Stage2=E|Stage1=R)*P(Stage3=M|Stage2=E)=P(Stage2=E)*P(Stage3=M|Stage2=E)=P(θminθ^θcut,2,1)*P(θcut,3,1θ^θcut,3,2|θminθ^θcut,2,1)=p(θminθ^θcut,2,1)p(θcut,3,1θ^θcut,2,1)$.

Using this conditional framework, routing probabilities for all possible paths can be calculated in an analogous manner (full derivations are provided in the Supplementary Material). Results presented in the next section demonstrate that the proposed method yields more accurate CSEM estimates than those reported in Jing (2021), particularly by properly accounting for the restricted ability ranges induced by earlier routing decisions.

Results

MST Simulation

Figure 3, Tables 3 and 4 provide the comparison results of the CSEMs obtained using the analytic method and simulation. For visual clarity and ease of interpretation, the figures provided in the main text focus on the theta range from −2.5 to 2.5, which contains 99% of the true theta values. Results across the full theta range, including the single extreme value at approximately −3.7, are provided in the Supplementary Material.

Figure 3.

Figure 3.

Comparing Analytic and Simulation CSEMs Across MST Designs (Theta Range −2.5 to 2.5).

Table 3.

CSEMs Difference Between Analytic and Simulation Results Across MSTs.

MST Theta range Mean|ΔCSEM| SD |ΔCSEM| RMSE Average % difference N
MST1 (−2.5, 2.5) 0.09 0.08 0.12 −10.29 99
MST2 (−2.5, 2.5) 0.04 0.06 0.07 −5.75 99
MST 3 (−2.5, 2.5) 0.21 0.10 0.23 −28.19 99
MST4 (−2.5, 2.5) 0.12 0.05 0.14 −21.32 99

Note. The theta value of −3.73 was excluded. Results across the full theta range are provided in the Supplementary Material.

Table 4.

Paired t-test Results Comparing CSEMs Between Simulation and Analytic Method Across MSTs.

Comparison df t p
MST1 vs MST2 98 −11.57 <.001
MST3 vs MST4 98 −10.76 <.001
MST1 vs MST3 98 20.69 <.001
MST2 vs MST4 98 11.16 <.001

Note. The theta value of −3.73 was excluded from the paired t-test.

As shown in Figure 3, the CSEMs calculated using the analytic method are consistently smaller than those obtained from the simulation approach. This pattern is expected because analytic CSEMs are derived from the inverse of Fisher information and therefore rely on large-sample asymptotic assumptions, which can lead to underestimation when test lengths are moderate or short. Unless the test is sufficiently long, some degree of underestimation is therefore anticipated. This issue is further examined with linear testing in the next section.

Comparing MST1 and MST2 in Figure 3, the analytic CSEMs for MST2 are significantly closer (p < .001) to the simulation-based CSEMs than those for MST1 (See Table 4). MST1 includes 15 items per path, while MST2 includes 30 items per path. Similar patterns were observed for the three-stage MST designs. Compared with MST3, which has 15 items per path, the analytic CSEMs for MST4 with 30 items per path are significantly closer (p < .001) to the simulation-based CSEMs (See Table 4). As expected, the relative performance of the analytic method compared with the simulation approach depends on test length: as the number of items per path increases, the two methods yield increasingly similar results.

Furthermore, comparing the two-stage and three-stage MST designs suggests that the performance of the analytic method is also influenced by design complexity. As shown in Figure 3, Tables 3 and 4, MST3 shows significantly larger discrepancies between the simulation and analytic CSEMs than MST1(p < .001), even though both MST designs include the same number of items per path.

A similar pattern is observed when comparing MST2 and MST4, both of which have 30 items per path. Despite this equivalence in test length, MST4 exhibits significantly larger discrepancies (p < .001) between the simulation and analytic CSEMs than MST2. This finding suggests that more complex MST designs may require more items to improve the accuracy of the analytic method.

In summary, the simulation results suggest that the accuracy of the analytic method depends on both test length and design complexity. Accuracy improves as the number of items increases, and more complex MST designs require a greater number of items to achieve comparable accuracy.

Linear Test Simulation

The patterns observed in the present study differ from those reported in Jing (2021). To further evaluate the validity of our findings, we conducted a follow-up simulation using linear tests, thereby eliminating the additional complexity introduced by MST designs. Three linear tests containing 15, 30, and 60 items, respectively, were simulated under conditions comparable to those used in the MST simulations. As shown in Figure 4, the results reveal a pattern consistent with the MST findings. The analytic CSEMs were consistently smaller than simulation CSEMs, and the magnitude of these differences decreased as test length increased.

Figure 4.

Figure 4.

Results of Three Linear Test Simulation Studies (Theta Range −2.5 to 2.5).

Comparison to Park et al.’s Method

Overall, the proposed method provides a robust approximation of CSEMs in MST, particularly for longer tests. It can be applied to any MST design with MLE scoring without introducing additional computational complexity. Park et al.’s (2017) analytic method also utilizes the TIF and is applicable to MST with MLE scoring. Therefore, we conducted a direct comparison between the two approaches, using simulation-based CSEMs as the benchmark.

Using the same four MST designs (MST1-MST4) and the 99 theta values within the range −2.5 and 2.5, we calculated the absolute differences between CSEMs obtained from each analytic method and those derived from simulation. The results are presented in Figure 5. Overall, differences decrease as test length increases. For the two-stage MST designs, the proposed method generally produced CSEMs that were closer to the simulation-based estimates than those obtained using Park et al.’s method, especially for MST2, in which each path contains 30 items. For the three-stage MST designs, however, Park et al.’s method yielded CSEMs that were somewhat closer to the simulation results for the two designs examined.

Figure 5.

Figure 5.

Absolute Difference Between Analytical and Simulation CSEMs (Proposed Method vs Park et al.’s (2017) Method).

As shown in Figure 5, the largest divergence between the two analytic methods occurs around the first cut score (−0.25), whereas the proposed method performs better around a theta value of 1. One possible explanation, based on the MST designs used in this study, relates to the distribution of test information across the ability scale. Specifically, MST3 and MST4 exhibit relatively high TIF values near θ=1 , which corresponds to the third cut score (0.96), but lower TIF values near the first cut score. In general, when an examinee’s true theta is close to a cut score associated with relatively low test information, there is a greater likelihood of being routed to an “incorrect” path. Because the proposed analytic method relies on theoretically derived routing probabilities, this increased likelihood of misrouting may reduce its accuracy under such conditions. In contrast, Park et al.’s (2017) method uses a moving-average TIF, computed over five points surrounding the target theta value. This averaging may help attenuate the influence of low TIF near a cut score, thereby partially reducing the impact of misrouting and leading to improved performance in these regions.

Discussion and Conclusions

The proposed method offers a novel TIF-based analytic method for estimating CSEMs in MST using MLE scoring. It provides a faster and more cost-effective alternative to simulation for predicting CSEMs across a target range of ability levels and for evaluating measurement precision. Moreover, once the CSEMs are calculated, classification accuracy, which is another important statistic for evaluating MST-based decisions, can also be computed using Rudner’s (2000) method. Both Park et al. (2017) and Lim et al. (2021) estimated classification accuracy using CSEMs derived from their analytic methods. Although classification accuracy was not explicitly examined in the current study, Rudner’s (2000) method can be readily applied to CSEMs derived from the proposed method.

Compared with Park et al.’s (2017) TIF-based analytic method, the proposed method demonstrated improved accuracy in approximating CSEMs for two-stage MST designs. In contrast, Park et al.’s (2017) method outperformed the proposed method for the three-stage MST designs examined. This difference may be attributable to the fact that Park et al.’s method does not explicitly account for routing probabilities determined by subsequent cut scores; consequently, for a 1–2–3 design, only the first cut score influences the analytic approximation. We hypothesize that when test information around subsequent cut scores is relatively high, the proposed method—which incorporates all possible routes beyond those cut points—may be more beneficial. However, this hypothesis is speculative and was not directly tested in the current study. Future simulation studies are needed to further examine the relative accuracy of the two TIF-based methods.

Although many factors can influence MST design and simulation outcomes, such as cut-score spacing and routing criteria, it is not feasible to examine all such factors within a single study. In the current study, factors affecting the accuracy of the proposed method were identified based on observed patterns rather than systematic manipulations. Additional simulation studies are therefore needed to more systematically evaluate the performance of the proposed method across a broader range of MST designs and to assess the impact of specific design features and simulation settings.

For example, the present study focused on dichotomous 3PL IRT models with 1–2 and 1–2–3 MST designs. However, the proposed method is applicable to general MST designs and IRT models with calibrated item parameters, including polytomous IRT models. Future research may further examine the accuracy of the proposed method when applied to such models.

Finally, the proposed method assumes normality of the estimated ability distribution when calculating routing probabilities. This assumption may be violated in small modules, where departures from normality can occur. Future research may therefore examine the robustness of the proposed method to violations of the normality assumption. In addition, the development of software tools or an R package that integrates existing analytic methods for evaluating MST performance would facilitate broader adoption and practical implementation.

Supplemental Material

sj-pdf-1-epm-10.1177_00131644261420391 – Supplemental material for Estimation of Conditional Standard Errors of Measurement for MLE Scores in MST

Supplemental material, sj-pdf-1-epm-10.1177_00131644261420391 for Estimation of Conditional Standard Errors of Measurement for MLE Scores in MST by Yuanyuan J. Stirn and Won-Chan Lee in Educational and Psychological Measurement

Acknowledgments

I thank my friend and colleague Hong Chen for her encouragement and for sharing her MST simulation code from her past project, which informed and supported the development of my own.

Footnotes

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The authors received no financial support for the research, authorship, and/or publication of this article.

Ethical Considerations: This is a simulation study that used previously calibrated item parameters. There are no human participants or identifiable human data in this article. Informed consent is not required.

Consent to Participate: Not applicable

Data Availability: This is a simulation study. No real-world human participant data were collected. The simulation design and settings are fully described in the article, enabling replication of the study.

Supplemental Material: Supplemental material for this article is available online.

References

  1. American Educational Research Association, American Psychological Association, & National Council for Measurement in Education. (2014). Standards for educational and psychological testing. [Google Scholar]
  2. Fisher R. A. (1922). On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 222(594–604), 309–368. 10.1098/rsta.1922.0009 [DOI] [Google Scholar]
  3. Jing S. (2021). Estimating psychometric properties of computerized multistage testing (Unpublished doctoral dissertation). University of Iowa. https://iro.uiowa.edu/esploro/outputs/doctoral/Estimating-psychometric-properties-of-computerized-multistage/9984124571302771 [Google Scholar]
  4. Kimura T. (2017). The impacts of computer adaptive testing from a variety of perspectives. Journal of Educational Evaluation for Health Professions, 14, 12. 10.3352/jeehp.2017.14.12 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Kolen M. J., Brennan R. L. (2014). Test equating, scaling, and linking: Methods and practices (3rd ed.). Springer. [Google Scholar]
  6. Lee W., Harris D. J. (2025). Reliability in educational measurement. In Cook L. L., Pitoniak M. J. (Eds.), Educational measurement (5th ed.; pp. 277–381). Oxford University Press. [Google Scholar]
  7. Lim H., Davey T., Wells C. S. (2021). A recursion-based analytical approach to evaluate the performance of MST. Journal of Educational Measurement, 58(2), 154–178. 10.1111/jedm.12276 [DOI] [Google Scholar]
  8. Lord F. M. (1980). Applications of item response theory to practical testing problems. Routledge. [Google Scholar]
  9. Lord F. M. (1983). Unbiased estimators of ability parameters, of their variance, and of their parallel-forms reliability. Psychometrika, 48(2), 233–245. 10.1007/BF02294018 [DOI] [Google Scholar]
  10. Lord F. M., Wingersky M. S. (1984). Comparison of IRT true-score and equipercentile observed-score “equatings.” Applied Psychological Measurement, 8(4), 453–461. 10.1177/014662168400800409 [DOI] [Google Scholar]
  11. MacGregor D., Yen S. J., Yu X. (2022). Using multistage testing to enhance measurement of an English language proficiency test. Language Assessment Quarterly, 19(1), 54–75. https://doi-org.proxy.lib.uiowa.edu/10.1080/15434303.2021.1988953 [Google Scholar]
  12. Park R., Kim J., Chung H., Dodd B. G. (2017). The development of MST test information for the prediction of test performances. Educational and Psychological Measurement, 77(4), 570–586. 10.1177/0013164416662960 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Rudner L. M. (2000). Computing the expected proportions of misclassified examinees. Practical Assessment, Research, and Evaluation, 7(1), 14. 10.7275/an9m-2035 [DOI] [Google Scholar]
  14. Samejima F. (1977). A use of the information function in tailored testing. Applied Psychological Measurement, 1(2), 233–247. 10.1177/014662167700100209 [DOI] [Google Scholar]
  15. Stocking M. L. (1996). An alternative method for scoring adaptive tests. Journal of Educational and Behavioral Statistics, 21(4), 365–389. 10.3102/10769986021004365 [DOI] [Google Scholar]
  16. Yamamoto K., Shin H. J., Khorramdel L. (2018). Multistage adaptive testing design in international large-scale assessments. Educational Measurement: Issues and Practice, 37(4), 16–27. 10.1111/emip.12226 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

sj-pdf-1-epm-10.1177_00131644261420391 – Supplemental material for Estimation of Conditional Standard Errors of Measurement for MLE Scores in MST

Supplemental material, sj-pdf-1-epm-10.1177_00131644261420391 for Estimation of Conditional Standard Errors of Measurement for MLE Scores in MST by Yuanyuan J. Stirn and Won-Chan Lee in Educational and Psychological Measurement


Articles from Educational and Psychological Measurement are provided here courtesy of SAGE Publications

RESOURCES