Skip to main content
Applied Psychological Measurement logoLink to Applied Psychological Measurement
. 2022 Apr 4;46(3):236–249. doi: 10.1177/01466216221084371

A Comparison of Robust Likelihood Estimators to Mitigate Bias From Rapid Guessing

Joseph A Rios 1,
PMCID: PMC9073634  PMID: 35528268

Abstract

Rapid guessing (RG) behavior can undermine measurement property and score-based inferences. To mitigate this potential bias, practitioners have relied on response time information to identify and filter RG responses. However, response times may be unavailable in many testing contexts, such as paper-and-pencil administrations. When this is the case, self-report measures of effort and person-fit statistics have been used. These methods are limited in that inferences concerning motivation and aberrant responding are made at the examinee level. As test takers can engage in a mixture of solution and RG behavior throughout a test administration, there is a need to limit the influence of potential aberrant responses at the item level. This can be done by employing robust estimation procedures. Since these estimators have received limited attention in the RG literature, the objective of this simulation study was to evaluate ability parameter estimation accuracy in the presence of RG by comparing maximum likelihood estimation (MLE) to two robust variants, the bisquare and Huber estimators. Two RG conditions were manipulated, RG percentage (10%, 20%, and 40%) and pattern (difficulty-based and changing state). Contrasted to the MLE procedure, results demonstrated that both the bisquare and Huber estimators reduced bias in ability parameter estimates by as much as 94%. Given that the Huber estimator showed smaller standard deviations of error and performed equally as well as the bisquare approach under most conditions, it is recommended as a promising approach to mitigating bias from RG when response time information is unavailable.

Keywords: rapid guessing, noneffortful responding, low-stakes testing, robust likelihood estimation, item response theory, validity


Noneffortful responding is a test-taking behavior that has the potential to undermine the validity of score-based inferences (Cronbach, 1960). Although there are multiple types of noneffortful responding that are dependent on assessment contexts and item formats (see Meade & Craig, 2012; Wise, 2017), rapid guessing (RG) has recently received increased attention in the literature. RG occurs when an examinee randomly responds to an item without employing full effort. This behavior is typically represented by short response latencies that are reflective of examinees not investing adequate time in reading all components of the item and solving the problem (see Rios & Deng, 2021).

Assuming that an examinee is capable of effortfully responding (i.e., has had an opportunity to learn the content assessed, can understand the language of the assessment, and test speededness is not an issue), the underlying reasoning for RG is often related to low task value and/or low perceived probability of success (Penk & Schipolowski, 2015). 1 The former manifests itself when test-takers believe that their test performance has no personal consequences, are unaware of the consequences, or do not care about the consequences. For such examinees, given the lack of perceived personal benefits, the cost of expending full effort is too great, and thus disengagement with the assessment task occurs. The other reason for the manifestation of RG is that when an examinee perceives that they lack the requisite knowledge, skills, and/or abilities needed to answer a given item, a random response may provide a strategic attempt to increase their probability of a correct answer (Wise, 2017). Across these factors, RG has been documented to occur for a number of testing contexts (e.g., formative, accountability, and international assessments) and populations that range in age and nationality (e.g., Rios & Guo, 2020; Wise, 2017).

Failing to identify and account for RG responses is associated with biased measurement properties and score-based inferences. Concerning the former, prior research has shown that RG can undermine the accuracy of measurement property evaluations, such as item parameters (van Barnevald, 2007), test information (van Barneveld, 2007), measurement invariance (Rios, 2021), and linking coefficients (e.g., Mittelhaëuser et al., 2015). Furthermore, as RG is generally associated with underestimation of examinee ability (e.g., Rios & Soland, 2020; Silm et al., 2020), it has been documented to bias treatment effects (e.g., Osborne & Blanchard, 2011), achievement gains (e.g., Wise & DeMars, 2010), and subgroup comparisons (e.g., Rios, 2021).

These results have prompted the measurement community to call for test users to document test engagement in contexts where it may be a concern (Standard 13.9; American Education Research Association et al., 2014). However, to do so requires that RG can be accurately identified and appropriately accounted for prior to estimating measurement properties and examinee scores. Given that the identification of this construct-irrelevant behavior has typically relied on response times (Wise, 2017), it may be difficult to document RG when log file information is not collected -- a reality for paper-and-pencil assessment contexts. To address this situation, the objective of the present article is to examine the effectiveness of using robust estimation procedures to downweight potential RG responses based solely on the availability of item response data. The sections that follow describe methods for identifying RG behavior in the absence of response times, robust estimation procedures, and the rationale for the current study.

Methods for Identifying RG Behavior

Although response time approaches to identifying and filtering RG have been shown to provide improved parameter estimation accuracy (e.g., Rios & Soland, 2020; Wang & Xu, 2015), they are limited to computer-based assessment contexts. In paper-and-pencil administered tests, two proxies of RG behavior have been generally employed, self-report measures of effort and person-fit statistics/latent class models.

Self-Reported Effort

As low task value can result in reduced examinee engagement, researchers have employed posttest surveys of test-taking effort (e.g., Rios et al., 2014). Although a number of these surveys are readily accessible (e.g., Thelk et al., 2009), they offer a number of disadvantages. To begin with, effortful self-reporting is assumed, even though an examinee may have been disengaged on an assessment prior to the survey. Thus, there is the possibility for reporting inaccuracies, such as those related to social desirable responding (i.e., exaggerating their estimate of effort) and/or under reporting of effort (Jagacinski & Nicholls, 1990).

If accurate reporting can be assumed, this approach typically assesses effort at the global level, which provides a gauge of whether an examinee was generally effortful or noneffortful on the entire test of interest. However, prior research has shown that test-taking effort can change throughout different phases of a test administration (e.g., Wise & Kingsbury, 2016), suggesting that an examinee can display both behaviors on the same test. Thus, global evaluations may not provide a complete picture of an examinee’s testing experience. Taken together, these limitations suggest that employing self-report measures of test-taking effort may be an inadequate proxy of RG, particularly as inferences can only be made on the global as opposed to item response level.

Person-Fit Statistics/Latent Class Models

A second approach that has gained inspiration from the larger context of outlier or residual analysis is to identify individual examinees with aberrant response patterns via person-fit statistics. One approach to accomplish this is by evaluating the likelihood of an examinee’s response pattern occurring based on a theoretical measurement model. Those response patterns deemed to be unlikely are assumed to be guided by a mechanism other than that of the construct assessed (Meijer & Sijtsma, 2001). An alternative is to employ latent class models to distinguish between aberrant and non-aberrant responders (e.g., Dayton & Macreay, 1980). Similar to self-report measures, an advantage of these approaches is that they do not require the collection of log file information, and thus, can be applied to data collected from paper-and-pencil tests. However, a number of limitations should be noted.

First, this approach is sensitive to a host of aberrant behaviors (e.g., cheating; for a discussion on this topic, see Wise & Kong, 2005). Therefore, it is possible that a responding pattern identified to be aberrant may be unrepresentative of RG. Second, if an aberrant responding pattern is indicative of RG, this approach is limited in that it provides a measure of person rather than response behavior. Thus, at best, one could conclude that an examinee engaged in aberrant responding during the test, but could not answer questions related to how many or for which items the examinee engaged in RG (similar to self-report measures). Third, accuracy rates of these methods have been shown to diminish as the rate of aberrant responding in the sample increases (e.g., Karabatsos, 2003). As a consequence, this approach may be inappropriate for contexts in which RG rates are high. Although there have been some attempts to improve detection rates of these procedures (e.g., Yu & Cheng, 2019), the combined limitations point to why these methods have gained little traction in identifying RG behavior in practice (see Wise & Kong, 2005).

Robust Estimation Procedures

An alternative to these procedures that has received less attention in the RG literature is to utilize robust ability parameter estimation when item parameters are known. Similar to self-report measures and person-fit statistics, these estimators require only item response information, and provide an added benefit in that they can downweight potential aberrant responding at the item level. This allows for the capacity to estimate ability for examinees that have engaged in some degree of RG, which stands in stark contrast to other procedures, like self-report measures and person-fit statistics, that typically listwise delete data for examinees deemed to be unmotivated or aberrant responders. 2 Such an approach can lead to deleting as much as 25% of sample data (Rios et al., 2014). To provide a basis for the robust procedures examined, maximum likelihood estimation (MLE) is next described.

MLE

Suppose that item parameters based on the two-parameter logistic (2PL) model are known for a set of n items. The probability of a correct response from subject j to item i can be expressed as

Pji(θ)=eai(θjbi)1+eai(θjbi), (1)

where θj is the ability of examinee j, bi is the item difficulty for item i, and ai is the item discrimination for item i. Assuming local independence, the probability of the vector of responses for the set of n items, PjI , given θj , is

PjI=i=1nPjiXji (1Pji)1Xji , (2)

where Xji is equal to 1 if examinee j’s response to item i is correct, and 0 otherwise. The maximum likelihood estimate of θj is the value that maximizes Pj with respect to the item responses observed for examinee j.

A key assumption of MLE is that the underlying model accurately reflects examinees’ behavior. If this assumption is violated, estimated abilities and their sampling variances may be biased (Schuster & Yuan, 2011). Early work conducted by Waller (1974) illustrated how ML estimates of ability are sensitive to response disturbances, such as noneffortful responding. Although there are models that attempt to account for response disturbances (e.g., the three-parameter logistic [3PL] model), they generally assume that the nature and extent of disturbances are equivalent across examinees. Given that these responses are likely to be inconsistent among examinees (see Wise, 2017), one approach to dealing with this issue is to modify the maximum likelihood equation by downweighting potential aberrant responses.

Bisquare Estimation

To mitigate bias introduced by response disturbances, Waller (1974) proposed trimming inconsistent or uninformative responses from an examinee’s vector of responses. This is done via the following modified likelihood equation

Wji(XjiPji)ai =0, (3)

where

Wji={1,  |rji|20,   otherwise, (4)
rji=ai(bi  θ˜j), (5)

and  θ˜j is a provisional maximum likelihood ability estimate. In these equations, Waller defined inclusion based on item i’s weighted difficulty being no greater than two logits above or below examinee j’s ML ability estimate. If greater, item i is excluded from the final ML ability estimate for examinee j; otherwise, it is included.

The argument for this definition of item inclusion is that responses that are either very difficult or easy (defined as greater than two logits above or below an examinee’s ML ability estimate) for examinee j will provide minimal information about their underlying ability. Thus, excluding such a response, even if based on effortful responding, will have minimal impact on ability estimation, as this would amount to “taking away ‘information’ that really was not there” (Mosteller & Tukey, 1977; p. 348). However, if a low ability examinee engages in RG on this item, and answers correctly by chance, their ability estimate will likely be artificially inflated. Therefore, limiting the influence of such a response may improve ability estimation accuracy. Though, it should be noted that this may come at the cost of increased ability estimate variance, due to the reduction of item information in estimation.

Mislevy and Bock (1982) proposed using bisquare estimates of latent ability, which modifies Waller’s (1974) solution by including a continuous weighting function with values ranging from zero to one. This choice was based on the belief that each observation could be utilized in proportion to its apparent value as opposed to dichotomously deciding on the inclusion or exclusion of an observation (Mislevy & Bock, 1982). This is accomplished by defining Wji for item i in equation (3) as

Wji={[1(rjiC)2]2,  |rjiC|10,   otherwise, (6)

where C is equal to a constant (also referred to as a tuning constant), which determines the degree of trimming that will occur. Specifically, equation (6) stipulates that as the difference between an item’s weighted difficulty and examinee j’s ability departs from zero, the contribution of this item response in ability estimation will decrease.

As is shown in equation (6), the degree of weighting is driven by the tuning constant, with larger values leading to less downweighting. To illustrate how the tuning constant value impacts weighting, let us examine a scenario in which an examinee with θ = 0 responds to an item with a = 1 and b = 3. If C ≤ 3, Wji = 0; however, if, for example, C = 4, Wji = 0.19. Using one set of tuning constants can lead to eliminating the observation from examinee j’s ability estimation, while the use of a different constant will allow for inclusion (though with reduced weighting) of the observation in estimating examinee j’s ability.

Huber Estimation

One issue found with the bisquare estimation procedure is that it fails to converge when all item responses are significantly downweighted (see Schuster & Yuan, 2011). To address this issue, Schuster and Yuan (2011), proposed a modification to the bisquare scheme by utilizing Huber weights. In this approach, weights are defined for the 2PL model as

Wji={1,  |rji|CC|rji|,   otherwise. (7)

An attractive characteristic of the Huber approach is that no downweighting occurs unless the residual, rji , exceeds the tuning constant weight (C). In contrast, bisquare weights automatically downweight responses once the weighted item difficulty differs from an examinee’s ability estimate by more than zero. To highlight this difference, let us return to our example ( θ = 0 responding to an item with a = 1 and b = 3) and assume that C is equal to 3. As noted previously, in this scenario, Wji = 0 for the bisquare approach, while for the Huber estimator, Wji = 1. Thus, under the bisquare weighting scheme, the item response for this examinee will not contribute to their ability estimation, whereas under the Huber weight, it will contribute fully.

Schuster and Yuan (2011) conducted a simulation analysis to compare bias and root mean square error (RMSE) values in ability estimates between ML, bisquare, and Huber estimation procedures. Data were generated for a hypothetical test of 45 items administered to simulees at three ability levels (−2, −1, and 0). Two forms of response disturbances were manipulated based on random guessing and transcription errors. Results from this simulation supported previous findings by Mislevy and Bock (1982), which showed improvements in mean bias and RMSE for the bisquare approach compared to MLE across both response disturbance types (see also Meijer & Nering, 1997). Similarly, Huber estimation led to more accurate ability estimates when contrasted with MLE; however, of the three procedures, bias was smallest for the bisquare approach, while reduced sampling variance and non-convergence rates were observed for the Huber estimator. Taken together, these findings suggest that both the bisquare and Huber procedures provide more robust ability estimates in the presence of score pattern distortions.

Study Rationale

Although these procedures show some promise for improving ability estimation when aberrant responding occurs, there is minimal research investigating their utility under RG contexts. The need to investigate robust ability estimation procedures in this context is considerable as these procedures offer multiple advantages over competing approaches in three ways. First, they do not require response times, which make them attractive in paper-and-pencil testing situations. Second, they provide the ability to downweight responses at the item level. Third, as prior research has demonstrated that bias due to RG is greatest when weighted item difficulties significantly depart from examinees’ ability (e.g., Rios et al., 2017; Rios & Soland, 2020), focusing on downweighting responses that could have the largest score distortions may provide the greatest benefit.

Given that robust estimation procedures provide a sensible solution to filtering RG responses when response times are unavailable, the objective of this simulation study is to examine ability parameter estimation accuracy across ML, bisquare, and Huber estimators under differing RG patterns and percentages. Specifically, this analysis addresses the following research questions:

  • 1. When does the bisquare estimator fail to converge in the presence of RG and what is the extent of non-convergence rates?

  • 2. Under what RG patterns and percentages do the bisquare and Huber estimators outperform the MLE procedure in terms of parameter estimate bias and RMSE?

  • 3. In comparing the bisquare and Huber estimators, which procedure provides the most accurate ability parameter estimates and lowest standard deviations of error?

Findings from these analyses will inform practitioners about the utility of employing robust estimation procedures in mitigating ability parameter estimate bias caused by RG.

Method

Data Generation

Data were generated for a context in which simulees with discrete ability levels engaged in different percentages and patterns of RG. This first required generating effortful response probabilities and then substituting a certain percentage of these probabilities with the chance rate (.25; assuming that all items were multiple-choice with four response options). The aforementioned process is described in greater detail below.

Generation of Effortful Response Probabilities

Effortful response probabilities for 30 items were generated based on the unidimensional 2PL model (see equation (1)). 3 Generating ability parameters were constrained to values of −2, −1, 0, 1, and 2, while item parameters were sampled from an operational administration of the NAEP math assessment (for a full list of item parameters, see Supplemental Appendix A of the supplemental file). The mean discrimination and difficulty parameters for these items were 1.08 (SD = 0.39) and 0.17 (SD = 0.89), respectively.

Substitution of RG Response Probabilities

To reflect RG in the simulation, effortful response probabilities were replaced with the chance rate. This was done by manipulating two factors: (a) RG percentage (10%, 20%, and 40%); and (b) RG pattern (difficulty-based and changing state). These two factors were fully crossed to produce six conditions, with each condition replicated 10,000 times for every ability level.

RG Percentage

The percentage of RG responses in the data matrix were manipulated to vary across three levels: 10%, 20%, and 40%. These levels reflect RG rates observed for individual examinees in applied analyses (e.g., Rios, 2021).

RG Pattern

Given that research has shown that RG is associated with a perceived low probability of success and task value, two RG patterns were manipulated. In the first, which is referred to as difficulty-based RG, simulees engaged in RG for items in which they had the lowest probability of success. For each simulee, this was accomplished by ordering items based on their true probability of success and replacing the chance rate for the most difficult items, with the number of items dictated by the RG percentage.

The second RG pattern reflects the tendency for examinees’ task value to wane as the test progresses until they hit a point of continual low effort (i.e., once examinees become disengaged, they will not provide effortful responses for the remainder of the test; Wise & Kingsbury, 2016). This RG pattern , which is referred to as changing state RG, was generated by taking the number of known RG responses (based on RG percentage) and replacing the true probability with the chance rate for items based on descending order. For instance, if a simulee engaged in RG for 10% of items, their true probabilities would be replaced with the chance rate for the last three of 30 items on the assessment (akin to Yamamoto’s [1989] Hybrid model).

Model Estimation

Upon substituting chance probabilities, both effortful and RG item response probabilities were combined and compared to a random number sampled from a uniform distribution ranging from 0 to 1. If the random number was less than a given item response probability, a correct response was assigned; otherwise, the response was treated as incorrect. This led to the creation of a 10,000 x 30 item response matrix for every discrete ability level, which was used to estimate simulee ability based on the procedures noted below.

Given known item difficulty and discrimination parameters, simulee ability was estimated separately with the ML, bisquare, and Huber procedures. Similar to Schuster and Yuan (2011), tuning constants of 4 and 1 were respectively used for the latter two estimators. 4 Across procedures, the Newton-Raphson estimation algorithm was utilized to determine the maximum of the likelihood function, with the starting value and convergence criterion established at 0 and 0.0001, respectively. The maximum number of iterations set across estimation procedures was 100. Extreme response patterns consisting of all correct or incorrect item scores were dropped from the analysis since MLE and variants do not work with these response strings. R syntax for all three estimators were adopted from Schuster and Yuan (2011). 5

Outcome Variables

Non-convergence rates were calculated as the total percentage of replications failing to meet the convergence criterion (0.0001) upon reaching the maximum number of iterations (100) established. Ability estimation accuracy was evaluated based on average bias and RMSE across replications, with the latter measure equal to the standard deviation of the ability estimate errors. Bias was calculated as follows

Bias=i=1n(θ^jθ)n, (8)

where θ^j is the estimated parameter for simulee j, θ is the known parameter, and n is the number of replications. RMSE was calculated as

RMSE=i=1n(θ^jθ)2n1. (9)

Both bias and RMSE were calculated separately for each discrete theta value.

Results

When Does the Bisquare Estimator Fail to Converge?

Across conditions, the bisquare estimator failed to converge between 0% and 17% of replications. As is shown in Figure 1, the rate of non-convergence was dependent on RG pattern and simulee theta level. Concerning the former, the bisquare estimator failed to converge for more replications under changing state RG than difficulty-based RG. This likely occurred as simulees tended to engage in RG for items in which they possessed a high true probability of success, leading to large item residuals. As the rate of RG responses increased, a greater percentage of responses were downweighted, and thus, less information was available to provide a stable ability estimate. This was particularly the case at the lowest ability level, θ = −2, in which non-convergence rates were equal to approximately 17% when the percentage of RG responses exceeded 10%. Across conditions, neither the ML nor Huber procedure was susceptible to convergence issues.

Figure 1.

Figure 1.

Percentage of Non-converged Replications for the Bisquare Estimator.

Comparison of Ability Estimation Accuracy Across ML, Bisquare, and Huber Procedures

Ability estimation error distributions are separately displayed by RG pattern in Figures 2 and 3, while Table 1 presents average RMSE values for each estimator and condition. As can be seen, across RG patterns and estimation procedures, nearly identical degrees of bias and RMSE values were obtained for theta values of 0 and 1. This suggests that RG at these ability levels had minimal bias, regardless of the RG percentages examined. However, larger differences in parameter estimate bias were observed for more extreme values of simulee ability. For instance, when simulee ability was equal to −1 and −2, large improvements in estimation accuracy were noted for the bisquare and Huber procedures under difficulty-based RG, particularly as the percentage of RG grew. As an example, in contrast to MLE, the bisquare and Huber estimators improved accuracy in ability estimates respectively by 81% and 49% (based on relative percentage change) for the condition in which theta was equal to −2 and RG was present for 40% of responses.

Figure 2.

Figure 2.

Comparison of Ability Bias by Estimation Procedure for Difficulty-based Rapid Guessing. Note. Vertical lines represent median bias values for a given distribution.

Figure 3.

Figure 3.

Comparison of Ability Bias by Estimation Procedure for Changing State Rapid Guessing. Note. Vertical lines represent median bias values for a given distribution.

Table 1.

RMSE Values by Estimation Procedure, RG Pattern, and RG Percentage.

Rapid Guessing Pattern
Difficulty-Based Changing State
Theta RG % ML BS HU ML BS HU
2 10 0.50 0.49 0.48 0.62 0.68 0.58
20 0.54 0.58 0.58 0.80 0.76 0.74
40 0.92 0.98 0.99 1.35 1.33 1.32
1 10 0.38 0.38 0.38 0.42 0.44 0.43
20 0.39 0.39 0.39 0.48 0.48 0.48
40 0.53 0.53 0.54 0.73 0.73 0.73
0 10 0.37 0.38 0.38 0.38 0.41 0.40
20 0.37 0.38 0.39 0.40 0.44 0.43
40 0.37 0.39 0.39 0.43 0.49 0.48
−1 10 0.48 0.55 0.49 0.48 0.55 0.51
20 0.52 0.57 0.50 0.49 0.62 0.57
40 0.58 0.63 0.55 0.49 0.64 0.55
−2 10 0.72 0.75 0.75 0.72 0.79 0.78
20 0.80 0.78 0.70 0.74 0.96 0.92
40 1.02 0.89 0.76 0.84 0.95 0.78

Note. RG = rapid guessing; ML = maximum likelihood estimation; BS = bisquare estimation; HU = Huber estimation.

For high ability simulees, negative bias and RMSE in theta estimates were found to slightly increase for the robust procedures. As an example, compared to MLE, bias increased by approximately 9% for both bisquare and Huber estimators under extreme conditions (θ = 2 and 40% RG), while the RMSE values for these procedures also increased. This result is reflective of high ability simulees engaging in RG on easy items, which led to underestimation of their ability and greater standard deviations of error, given reduced item response information to contribute to ability estimation. Across conditions, under difficulty-based RG, the bisquare estimator performed equivalent or better than the Huber procedure, with improvements noted under extreme conditions (see Figure 2).

Turning to changing state RG, the robust procedures consistently possessed lower median bias values for low ability simulees when compared to MLE (Figure 3). For instance, under 40% RG responses, the bisquare estimator improved ability estimates for simulees with a theta value of −2 by as much as 94%, while the percentage improvement for the Huber estimator was 76%. Although the bisquare outperformed the Huber under this context, the latter estimator was found to have a smaller RMSE value, indicating greater precision in estimation. This is likely due to the bisquare’s more aggressive downweighting scheme, which generally led to less item response information in ability estimation.

Unlike difficulty-based RG, the robust procedures provided improved estimation for high ability simulees with a theta value equal to 2. This was particularly apparent for low rates of RG (10%) in which the percentage improvement for the bisquare and Huber estimators was 51% and 31% compared to MLE. Across most contexts for this RG pattern, the bisquare estimator performed equally to the Huber estimator, with the latter providing slightly improved precision (Figure 3). An applied example highlighting the differences between these estimation procedures is provided in Appendix B of the supplemental file.

Discussion

The objective of this simulation study was to evaluate ability parameter estimation accuracy in the presence of RG by comparing MLE to two weighted variants, the bisquare and Huber estimators. Contrasted to the MLE procedure, results demonstrated that both the bisquare and Huber estimators reduced bias in ability parameter estimates in the presence of RG by as much as 94%. As expected, the greatest reductions were observed as RG percentages increased. However, the bisquare estimator was found to be susceptible to non-convergence issues when a high percentage of item responses possessed large residuals, which is likely to occur for simulees at the extreme ends of the ability continuum. This finding supports prior literature (Schuster & Yuan, 2011), and suggests that the bisquare approach may be problematic in estimating ability for some examinees. In contrast, the Huber estimator showed no convergence issues, smaller RMSE, and similar bias compared to the bisquare approach under most conditions.

Recommendations

In deciding on whether to employ the bisquare or Huber estimator, a number of factors must be considered in light of a given score purpose. First, if absolute ability estimation accuracy is of greatest need for examinees with very low (θ = −2) theta values that engage in a high percentage of RG (e.g., 40%), the bisquare estimator may be the best option, given that it performs better under extreme conditions. However, in such contexts, this estimation procedure is susceptible to elevated rates of non-convergence. One solution to this issue would be to increase the tuning constant, though, this would allow for increased item responses with large residuals to contribute to ability estimation. Such a consequence may largely undermine efforts to improve ability estimation accuracy.

However, if score users are concerned with providing ability estimates for examinees toward the middle of a normal distribution (i.e., one standard deviation above and below the mean), they may benefit from utilizing the Huber estimator for two reasons. First, under this context, the Huber procedure provides nearly identical ability estimation accuracy as the bisquare. Second, ability estimates from the Huber tend to have smaller standard deviations of error compared to the bisquare, particularly as the rate of RG increases. This effect may be of utility when making classification decisions in some practical contexts (see Rios & Soland, 2020).

Conclusion

Regardless of the score use, applications of these procedures are ideal when RT information is unavailable as a proxy of RG. Compared to competing methods, such as self-report measures of effort and person-fit statistics, both bisquare and Huber estimators avoid significant loss of examinee data by allowing for ability estimation of examinees that engage in RG. This is of particular importance given that many examinees may employ both solution and RG behavior within the same test (see Wise & Kingsbury, 2016). In saying that, it is important that practitioners establish guidelines on the extent of RG that is allowable to provide an examinee with an individual score (Standard 13.9; American Educational Research Association et al., 2014). Although both bisquare and Huber estimators can improve ability estimates, they cannot fully remove all error introduced by RG.

Supplemental Material

sj-pdf-1-apm-10.1177_01466216221084371 – Supplemental Material for A Comparison of Robust Likelihood Estimators to Mitigate Bias From Rapid Guessing

Supplemental Material, sj-pdf-1-apm-10.1177_01466216221084371 for A Comparison of Robust Likelihood Estimators to Mitigate Bias From Rapid Guessing by Joseph A. Rios in Applied Psychological Measurement

Acknowledgments

The author would like to thank Samuel Ihlenfeldt and Jiayi Deng from the University of Minnesota for their feedback on earlier drafts of the manuscript.

Notes

1.

The assumption underlying RG is that it is uninformative. However, as noted by Wise (2017), when examinees engage in RG because they do not have the full capability to engage in an item or task, RG can be informative to better understanding examinee knowledge or lack thereof.

2.

Individual testing programs must decide on the level that they believe is acceptable to provide an individual ability estimate.

3.

Although data were generated for a multiple-choice test, the 3PL model was not employed for data generation or calibration purposes given that the guessing rate for an item is likely not equivalent across all examinees in practice. Thus, estimating item parameters based on the 2PL model with robust likelihood estimators, which accounts for potential aberrant responses, may be a more appropriate solution (Mislevy & Bock, 1982; Schuster & Yuan, 2011).

4.

Pilot analyses demonstrated increased non-convergence rates for the bisquare estimator with tuning constants of 2 and 3. Furthermore, greater bias was noted for the Huber estimator with tuning constants of 2 and 3. The results for these constants are available upon request from the author.

5.

The author would like to thank Schuster and Yuan (2011) for making their R syntax available.

Footnotes

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD: Joseph A. Rios https://orcid.org/0000-0002-1004-9946

Supplemental Material: Supplemental material for this article is available online.

References

  1. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (2014). Standards for educational and psychological testing. American Educational Research Association. [Google Scholar]
  2. Cronbach L. J. (1960). Essentials of psychological testing (2nd ed.). Harper & Row. [Google Scholar]
  3. Dayton C. M., Macready G. B. (1980). A scaling model with response errors and intrinsically unscalable respondents. Psychometrika, 45(3), 343–356. 10.1007/bf02293908 [DOI] [Google Scholar]
  4. Jagacinski C. M., Nicholls J. G. (1990). Reducing effort to protect perceived ability:‘ They'd do it but I wouldn’t.’’ Journal of Educational Psychology, 82(1), 15–21. 10.1037/0022-0663.82.1.15 [DOI] [Google Scholar]
  5. Karabatsos G. (2003). Comparing the aberrant response detection performance of thirty-six person-fit statistics. Applied Measurement in Education, 16(4), 277–298. 10.1207/s15324818ame1604_2 [DOI] [Google Scholar]
  6. Meade A. W., Craig S. B. (2012). Identifying careless responses in survey data. Psychological Methods, 17(3), 437–455. 10.1037/a0028085 [DOI] [PubMed] [Google Scholar]
  7. Meijer R. R., Nering M. L. (1997). Trait level estimation for nonfitting response vectors. Applied Psychological Measurement, 21(4), 321–336. 10.1177/01466216970214003 [DOI] [Google Scholar]
  8. Meijer R. R., Sijtsma K. (2001). Methodology review: Evaluating person fit. Applied Psychological Measurement, 25(2), 107–135. 10.1177/01466210122031957 [DOI] [Google Scholar]
  9. Mislevy R. J., Bock R. D. (1982). Biweight estimates of latent ability. Educational and Psychological Measurement, 42(3), 725–737. 10.1177/001316448204200302 [DOI] [Google Scholar]
  10. Mittelhaëuser M. A., Béguin A. A., Sijtsma K. (2015). The effect of differential motivation on IRT linking. Journal of Educational Measurement, 52(3), 339–358. 10.1111/jedm.12080 [DOI] [Google Scholar]
  11. Mosteller F., Tukey J. (1977). Exploratory data analysis and regression. Addison-Wesley. [Google Scholar]
  12. Osborne J. W., Blanchard M. R. (2011). Random responding from participants is a threat to the validity of social science research results. Frontiers in Psychology, 1, 220. 10.3389/fpsyg.2010.00220 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Penk C., Schipolowski S. (2015). Is it all about value? Bringing back the expectancy component to the assessment of test-taking motivation. Learning and Individual Differences, 42(1), 27–35. 10.1016/j.lindif.2015.08.002 [DOI] [Google Scholar]
  14. Rios J. A. (2021). Is differential noneffortful responding associated with type I error in measurement invariance testing? Educational and Psychological Measurement, 81(5), 957–979. 10.1177/0013164421990429 [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Rios J. A., Deng J. (2021). Does the choice of response time threshold procedure substantially affect inferences concerning the identification and exclusion of rapid guessing responses? A meta-analysis. Large-scale Assessments in Education, 9(18), 1–25. 10.1186/s40536-021-00110-8 [DOI] [Google Scholar]
  16. Rios J. A., Guo H. (2020). Can culture be a salient predictor of test-taking engagement? An analysis of differential noneffortful responding on an international college-level assessment of critical thinking. Applied Measurement in Education, 33(4), 263–279. 10.1080/08957347.2020.1789141 [DOI] [Google Scholar]
  17. Rios J. A., Guo H., Mao L., Liu O. L. (2017). Evaluating the impact of noneffortful responses on aggregated scores: To filter unmotivated examinees or not? International Journal of Testing, 17(1), 74–104. 10.1080/15305058.2016.1231193 [DOI] [Google Scholar]
  18. Rios J. A., Liu O. L., Bridgeman B. (2014). Identifying low-effort examinees on student learning outcomes assessment: A comparison of two approaches. New Directions for Institutional Research, 2014(161), 69–82. 10.1002/ir.20068 [DOI] [Google Scholar]
  19. Rios J. A., Soland J. (2020). Parameter estimation accuracy of the Effort-Moderated IRT model under multiple assumption violations. Educational and Psychological Measurement, 81(3), 569–594. 10.1177/0013164420949896 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Schuster C., Yuan K. H. (2011). Robust estimation of latent ability in item response models. Journal of Educational and Behavioral Statistics, 36(6), 720–735. 10.3102/1076998610396890 [DOI] [Google Scholar]
  21. Silm G., Pedaste M., Täht K. (2020). The relationship between performance and test-taking effort when measured with self-report or time-based instruments: A meta-analytic review. Educational Research Review, 31(2), 100335. 10.1016/j.edurev.2020.100335 [DOI] [Google Scholar]
  22. Thelk A. D., Sundre D. L., Horst S. J., Finney S. J. (2009). Motivation matters: Using the student opinion scale to make valid inferences about student performance. The Journal of General Education, 58(3), 129–151. 10.1353/jge.0.0047 [DOI] [Google Scholar]
  23. van Barnevald C. (2007). The effect of examinee motivation on test construction within an IRT framework. Applied Psychological Measurement, 31(1), 31–46. 10.1177/0146621606286206 [DOI] [Google Scholar]
  24. Waller M. I. (1974). Removing the effects of random guessing from latent trait ability estimates (Research Bulletin #74-32). Educational Testing Service. [Google Scholar]
  25. Wang C., Xu G. (2015). A mixture hierarchical model for response times and response accuracy. British Journal of Mathematical and Statistical Psychology, 68(3), 456–477. 10.1111/bmsp.12054 [DOI] [PubMed] [Google Scholar]
  26. Wise S. L. (2017). Rapid-guessing behavior: Its identification, interpretation, and implications. Educational Measurement: Issues and Practice, 36(4), 52–61. 10.1111/emip.12165 [DOI] [Google Scholar]
  27. Wise S. L., DeMars C. E. (2010). Examinee noneffort and the validity of program assessment results. Educational Assessment, 15(1), 27–41. 10.1080/10627191003673216 [DOI] [Google Scholar]
  28. Wise S. L., Kingsbury G. G. (2016). Modeling student test-taking motivation in the context of an adaptive achievement test. Journal of Educational Measurement, 53(1), 86–105. 10.1111/jedm.12102 [DOI] [Google Scholar]
  29. Wise S. L., Kong X. (2005). A new measure of examinee motivation in computer-based tests. Applied Measurement in Education, 18(2), 163–183. 10.1207/s15324818ame1802_2 Response time effort [DOI] [Google Scholar]
  30. Yamamoto K. Y. (1989). HYBRID model of IRT and latent class models (ETS research report #RR89-41). Educational Testing Service. [Google Scholar]
  31. Yu X., Cheng Y. (2019). A change-point analysis procedure based on weighted residuals to detect back random responding. Psychological Methods, 24(5), 658–674. 10.1037/met0000212 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

sj-pdf-1-apm-10.1177_01466216221084371 – Supplemental Material for A Comparison of Robust Likelihood Estimators to Mitigate Bias From Rapid Guessing

Supplemental Material, sj-pdf-1-apm-10.1177_01466216221084371 for A Comparison of Robust Likelihood Estimators to Mitigate Bias From Rapid Guessing by Joseph A. Rios in Applied Psychological Measurement


Articles from Applied Psychological Measurement are provided here courtesy of SAGE Publications

RESOURCES