Skip to main content
Applied Psychological Measurement logoLink to Applied Psychological Measurement
. 2023 Aug 3;47(5-6):365–385. doi: 10.1177/01466216231194358

Comparing Person-Fit and Traditional Indices Across Careless Response Patterns in Surveys

Eli A Jones 1,, Stefanie A Wind 2, Chia-Lin Tsai 3, Yuan Ge 4
PMCID: PMC10552731  PMID: 37810542

Abstract

Methods to identify carelessness in survey research can be valuable tools in reducing bias during survey development, validation, and use. Because carelessness may take multiple forms, researchers typically use multiple indices when identifying carelessness. In the current study, we extend the literature on careless response identification by examining the usefulness of three item-response theory-based person-fit indices for both random and overconsistent careless response identification: infit MSE outfit MSE, and the polytomous l z statistic. We compared these statistics with traditional careless response indices using both empirical data and simulated data. The empirical data included 2,049 high school student surveys of teaching effectiveness from the Network for Educator Effectiveness. In the simulated data, we manipulated type of carelessness (random response or overconsistency) and percent of carelessness present (0%, 5%, 10%, 20%). Results suggest that infit and outfit MSE and the l z statistic may provide complementary information to traditional indices such as LongString, Mahalanobis Distance, Validity Items, and Completion Time. Receiver operating characteristic curves suggested that the person-fit indices showed good sensitivity and specificity for classifying both over-consistent and under-consistent careless patterns, thus functioning in a bidirectional manner. Carelessness classifications based on low fit values correlated with carelessness classifications from LongString and completion time, and classifications based on high fit values correlated with classifications from Mahalanobis Distance. We consider implications for research and practice.

Keywords: careless responding, classification, person fit, Rasch model, surveys


Careless responding is a notable source of error in instruments that rely on respondent self-report, such as surveys. The central feature of careless responding is content nonresponsivitiy, where respondents do not show sufficient attention to the item content when responding (see Meade & Craig, 2012). While a number of mechanisms may result in carelessness (e.g., a lack of sufficient motivation, a lack of comprehension, a lack of a sense of responsibility, or a lack of effort; Bowling et al., 2021a; Curran, 2016; Godinho et al., 2016; Huang et al., 2012; Ward & Meade, 2018), carelessness ultimately reduces the quality of participant responses. Their responses are therefore substandard indicators of the “true” construct and may result in a reduction of quality and usability (Clark et al., 2003). When used for assessment purposes, data that include careless responses may result in undesirable outcomes such as poor item calibration, biased parameter estimates (Huang et al., 2015), and reduced scale reliability (Patton et al., 2019). Ultimately, careless responses reduce accuracy of survey instruments to produce estimates of the intended abilities or traits.

Most surveys are likely to include some level of careless responses due to the self-report nature of the method, although the amount of carelessness will vary from survey to survey. The percentage of careless response rates can range from the low single digits to upwards of 30% or 40%, depending on survey instrument characteristics (Curran, 2016; Meade & Craig, 2012), with a median of approximately 10% (Curran et al., 2010; Schroeders et al., 2021). Respondents may be more likely to give careless responses if they lack interest, if the survey length is excessive, if they are distracted, or if the survey lacks personal interaction (e.g., as in internet-based surveys; Meade & Craig, 2012). In low-stakes settings such as surveys, carelessness may be particularly troublesome (Arthur et al., 2021).

Because of the prevalence of such responses and the potentially negative consequences they may have on scale validation and use, careless response (CR) indices are frequently applied to identify problematic responses in measurement data. As careless responses may be classified as a source of either random or nonrandom systematic error (Huang et al., 2012; Meade & Craig, 2012), different CR indices may be needed to detect differing carelessness patterns. For example, participant responses may follow a repeating pattern (e.g., 1,2,1,2,1,2; Schroeders et al., 2021), may have no variation whatsoever (e.g., 1,1,1,1,1), or may not follow an identifiable pattern at all (Ulitzsch et al., 2021). Not every CR index is useful for identifying different patterns of carelessness. Consistency indices, for example, are useful at detecting responses with no variation, whereas outlier indices may be useful at detecting highly random patterns (Meade & Craig, 2012). As the usefulness of indices varies depending on the careless response mechanism, understanding their strengths and limitations may help to improve their application.

Past comparisons of CR indices support the notion that individual indices may not show equivalent accuracy at detecting careless responses for all response patterns (see Goldammer et al., 2020). Additionally, past study results have not always agreed on the relative quality of the studied indices for carelessness identification. Some discrepancies stem from the different methodologies used. For example, while some studies employ empirical data, others have used simulated data to estimate the accuracy of various indices (Beck et al., 2019; Dupuis et al., 2019; Goldammer et al., 2020; Karabatsos, 2003; Schneider et al., 2018). Studies have been applied to differing data forms, such as dichotomous (Karabatsos, 2003) or polytomous (Beck et al., 2019) data. Past studies have also varied in how aberrant responses are conceptualized. For example, some studies have used only random response patterns to represent carelessness (Beck et al., 2019), have categorized carelessness using validity items (Schneider et al., 2018), or have modeled highly specific aberrant response mechanisms such as social desirability (Nazari et al., 2022).

Comparing Types of Careless Response Indices

In the current study, we focus on six specific CR indices to explore the consistency of responses across various carelessness conditions. We have selected indices to represent different types of careless response. In the following sections, we introduce these indices, as well as summarize previous evidence of their usefulness.

Consistency Indices

Consistency indices evaluate the degree to which responses follow a specific pattern relating to the underlying construct and relationship between items. Indices may flag responses as careless when they fall below a certain level of consistency across items with similar content, such as with even-odd consistency or psychometric synonyms/antonyms. Consistency indices may also categorize unexpected responses as mismatches when responses to easy items are paired with responses from more difficult items, such as with polytomous Gutmann errors (Meijer, 1996). For other consistency indices, carelessness results in responses that are too consistent, such as the even-odd index or the LongString indicator (Curran, 2016).

Some consistency indices have shown good accuracy in detecting careless responses. However, the LongString indicator, which measures over-consistency by identifying the longest string of identical responses for each participant across the survey, may not be as accurate as other consistency indices (Goldammer et al., 2020), although others have found it to be quite reliable (Curran, 2016). Researchers may calculate this indicator for the entire survey (Curran, 2016), others do so for subsets of questions (Meade & Craig, 2012), or for survey halves (see Huang et al., 2012). As with all indices, researchers have tended to err on the side of caution when determining what string length might indicate carelessness. Curran (2016) suggests that a starting point for identifying the length of a problematic string is one that is greater than half the length of the total scale being considered. That is, for a scale of 32 items, a LongString value of 17 or more would indicate careless responding. Because they are a function of the entire scale, LongString values may be less sensitive in shorter scales or when the number of scale points is not consistent throughout the survey (Bowling et al., 2021b).

Outlier Indices

On the opposite end of the spectrum from consistency indices, outlier indices are used to identify carelessness that manifests in extreme responses relative to the rest of the distribution of scores. Because they follow no pattern, extreme responses may not be captured by using consistency indices alone (Goldammer et al., 2020), necessitating the need to examine inconsistencies instead. Inconsistent scores may not follow the predicted pattern of scores, and thus may provide additional evidence of carelessness (Curran, 2016). Due to the nature of surveys, outlier indices for careless responses are typically multivariate in nature because carelessness occurs across many items and most surveys are multivariate by design.

The Mahalanobis distance (D) is one example of a CR outlier indicator that has shown promise as a method of identifying careless responses (Niessen et al., 2016). Mahalanobis D is an outlier indicator, and it is a multivariate estimate of the distance between an individual’s responses and the centroid of the data. Researchers have found that this indicator correlates well with other methods of careless responding, although its sensitivity is limited in situations where responses deviate substantially from normality (Curran, 2016). In practice, Mahalanobis D is represented as

D=(xix¯)Cx1(xix¯)T,

where xi is an individual’s vector of observations, x¯ is the mean vector of observations, and Cx is the covariance matrix (Curran, 2016). Some research has suggested that Mahalanobis D may not function consistently as an indicator of carelessness (Goldammer et al., 2020) although it has been shown to function most consistently in multivariate normal situations (Curran, 2016). Also, Mahalanobis D requires a larger number of items than other indices (Bowling et al., 2021b).

Observational Indices

A third group of indices provide a measure of carelessness by providing ancillary evidence of the level of effort exerted by participants. Self-report items are one such indicator. Also called “bogus” items or “validity” items, self-report items offer participants the chance to self-report how much effort they put forth in responses. For example, participants may be asked to “Answer ‘mostly true’ to this question,” or “I am paying attention to my answers on this survey.” Unexpected responses from participants on these items may indicate carelessness (Meade & Craig, 2012).

Another readily available observational indicator for carelessness is response time. Similar to rapid guessing in assessment settings (Wise, 2017), participants who are careless may complete surveys in substantially less time than would be required based on the number and complexity of items being asked. Extremely short response times, then, can act as an indicator of insufficient effort (Curran, 2016). The amount of time a respondent takes to complete a survey is a basic indicator that can be used to identify careless responders. As an observational indicator, completion time is easily obtainable with computer-administered surveys and may be obtainable with some effort for paper surveys. Response time as a carelessness indicator is a function of the length of the survey in relation to the completion time. Some research has explored report response time as a measure of the average time per page (Bowling et al., 2021b); others have classified it as the average time per item across the entire survey, or across subscales. Regardless of method, extremely short response times may indicate carelessness, depending on a prespecified threshold. Some researchers have suggested that a threshold of <2 s/item is a conservative level for this indicator (Huang et al., 2012). However, others suggest that the completion time threshold is dependent on characteristics of the survey. For example, Meade and Craig (2012) found that for their survey, a time-per-page limit of between 4 and 5 seconds was most appropriate.

Person-Fit Indices

Person fit statistics, such as person-specific infit and outfit mean square error (MSE) in Rasch models, function similarly to traditional CR indices because they can be used to identify examinees who exhibit patterns of unexpected responses (Wolfe & Smith, 2007a, 2007b). Person-fit statistics are a group of indicators from item response theory (IRT) models that may be useful for classifying carelessness. Person-fit indices allow analysts to identify individual persons whose response patterns do not match what would be expected given their overall location estimate on the latent variable (Glas & Khalid, 2016). When examinees misfit an IRT model, their location estimates cannot be meaningfully interpreted. One potential cause for this misfit may be careless responding.

Researchers have observed that person fit statistics, such as those based on Rasch models, can identify examinees who exhibit patterns of unexpected responses (Karabatsos, 2003), such as those that occur as a result of carelessness (Wolfe & Smith, 2007a). Indeed, many person-fit indices show high levels of accuracy in detecting carelessness in both dichotomous and polytomous data (Beck et al., 2019; Karabatsos, 2003).

In the current study, we extend previous literature on person-fit statistics for polytomous surveys by exploring the use of the person-specific infit and outfit mean square error (MSE) fit statistics from Rasch models. Rasch-based person fit statistics are particularly useful because Rasch models have strict requirements for item responses related to principles of invariance. As a result of these strict requirements, person fit statistics flag participants whose response patterns warrant additional examination. In a person fit context, researchers can use Rasch-based person fit indicators to consider how careless responding may affect fundamental measurement requirements. Infit MSE for persons is calculated as

InfitMSE=iLZni2/iLQni,

where Z ni is the standardized residual, or difference between observed and expected ratings for person n on item i and Q is the variance of the expected response probabilities for person n. Outfit MSE for persons is calculated as

OutfitMSE=iLZni2/L,

where Z ni is defined as before, and L is the number of assessment opportunities (e.g., items). Because infit MSE is weighted, it tends to be more sensitive to extreme unexpected ratings compared to outfit MSE. Values of infit MSE and outfit MSE can range between zero and infinity. Many researchers consider values of these statistics around 1.00 as evidence of acceptable model-data fit (e.g., Linacre, 2002; Smith, 1986; Wu & Adams, 2013). Values lower than 1.00 may indicate overfit (i.e., responses exhibit less variability than predicted), whereas values greater than 1.00 suggest underfit (i.e., responses exhibit more variability than predicted).

Another IRT-based person fit statistic is the polytomous l z statistic (Drasgow et al., 1985; Magis et al., 2012); this statistic is quite popular in person fit research (Rupp, 2013). For example, Niessen et al. (2016) found that the number of Guttman errors and the polytomous l z statistic could be used to complement traditional fit statistics such as LongString and Mahalanobis distance. The l z statistic has a relatively straightforward interpretation because it is based on a standard normal distribution. This statistic can take on values between positive and negative infinity, and some researchers interpret values that exceed ±2.00 as evidence of person misfit. Polytomous l z is calculated as

lz=l0E(l0)/V(l0),

where l 0 is the log-likelihood of the observed response pattern for a person, E(l 0 ) is the expected log-likelihood of the response pattern, and V(l 0 ) is the variance of the log-likelihood of the response pattern.

The Current Study

The current paper seeks to expand the literature on careless responding by examining the usefulness of item response theory person fit statistics (Glas & Khalid, 2016; Niessen et al., 2016; Schneider et al., 2018) in the context of traditional CR indices. Specifically, this study extends prior studies (e.g., Beck et al., 2019; Dupuis et al., 2019; Karabastos, 2003) in several ways. First, we use both traditional and person-fit indices in our analysis. Most studies that have compared CR indices have done so on only traditional indices (e.g., Dupuis et al., 2019; Goldammer et al., 2020) or only person-fit indices (Karabatsos, 2003; Nazari et al., 2022). Second, we explore the accuracy of CR and person-fit indices at detecting carelessness resulting from both random responding and consistency-related responding. As we stated earlier, studies typically have only explored one form of careless response pattern. Third, we incorporate both real and simulated data in our analysis.

Finally, we classify carelessness of responses by using both ends of the person-fit indices distributions. One important benefit of person-fit statistics in the context of CR identification is that they function in a bi-directional manner, where person “underfit” may signal more variation than expected and person “overfit” may signal less variation than expected (discussed further below). This bidirectionality may parallel consistency and outlier analyses of traditional CR indices. In addition, Rasch person fit statistics reflect a theory-driven perspective on measurement. Researchers can use these statistics to explore the potential contribution of carelessness to deviations from measurement requirements that reflect this theoretical perspective.

The current paper contributes to research on CR identification by examining the performance of person-level infit and outfit MSE statistics compared with traditional CR indices. We combine empirical data from a student survey on teacher effectiveness with simulated data to evaluate the utility of person-fit indices in identifying careless responders. Specifically, we seek to answer the following questions:

  • 1. What is the association between person-fit indices and previously identified CR indices when used to identify careless responders? How does the association change across different carelessness conditions?

  • 2. How consistently do person-fit indices and traditional CR indices perform when classifying careless responders? How is the classification correspondence affected by different carelessness conditions?

Method

Measures of Careless Responding

The current paper compares IRT model person-fit statistics, with an emphasis on infit and outfit MSE, with previously studied and commonly used CR indices. We use empirical and simulated data to compare three categories of traditional CR indices (consistency, outlier, and observational) with three types of person-fit indicators. We include at least one commonly used indicator of each category. We discuss each indicator below.

Because employing CR indices requires a balance between the risk of removing accurate responses with the benefit of increased accuracy and precision (Curran, 2016), the choice of what values indicate carelessness must be thoughtful and conservative. Differences in survey complexity, item length, and mode of delivery (e.g., online vs. paper/pencil) make thresholds difficult to establish. Therefore, we use conservative thresholds to classify carelessness and, where appropriate, we test multiple cut points per indicator (Table 1), as we describe below.

Table 1.

Careless Response Indices and Classification Criteria.

Index Type Careless Response Indices Careless Response Classification Criteria a
Infit MSE Low: .68
High: 1.35
Person-fit Outfit MSE Low: .68
High: 1.39
Polytomous l z Low: −.94
High: 1.38
Consistency LongString >50% of total scale
Outlier Mahalanobis distance Low: 2.47
High: 60.81
Observational (empirical only) Completion time >2 sec/item
>3 sec/item
>4 sec/item
Validity items Incorrect response

Note. aPerson-fit and outlier thresholds were calculated using empirical bootstrapping for 5% and 95% cutoffs.

Person-Fit Statistics

For CR identification using infit and outfit MSE statistics, we used a bootstrapping method based on our real data to classify values that exceed the 95th percentile or fall below the 5th percentile of the bootstrap distribution (discussed further below). Because low values indicate a different type of misfit (overfit vs. underfit), we divide the classification into two categories: low values are classified as underfit, and high values are classified as overfit. Although there is a modified version of l z (l z *; Snijders, 2001) that overcomes some documented limitations with the l z statistic, l z * is not yet available for polytomous responses. We used a bootstrap method described below to identify critical values for classifying participants based on their polytomous l z statistics. As with infit/outfit, we subclassify carelessness into low values and high values using the 5th and 95th percentiles from our bootstrap results.

LongString Analysis

In the current study, we calculate the longest string of consecutive responses for the entire survey, inclusive of the 3 validity items in the empirical data (discussed below). Considering Curran’s (2016) recommendation toward cautiousness, we classify careless responses as those with a long string of greater than 50% of the scale length. For the empirical data, this represents a string of 16 identical sequential responses. For the simulated data, this represents a string of 15 consecutive identical responses.

Mahalanobis Distance

In the current study, we compute Mahalanobis D across the entire dataset, rather than subscales. We used the empirical bootstrapping procedure discussed below to identify threshold values for classifying participants as careless using the Mahalanobis D statistic.

Completion Time

In the current study, we used the cutoffs of 3 and 4 s/item as a conservative cutoff for carelessness. We originally tested a 2 s/per item threshold, but it resulted in no respondents classified as careless. Completion time was only calculated for the empirical data. The association between the accuracy and speed of test takers (e.g., Myszkowski, 2017; van der Linden, 2006, 2007; Wang & Xu, 2015) may depend on specific response processes (e.g., He & von Davier, 2015; Qiao & Jiao, 2018). As we did not know the underlying processes of the response time in the empirical data, we chose not to include response time in the simulated data.

Validity Items

The survey used to collect our empirical data contained three self-report items that were located at roughly equal points throughout the survey. Participants self-reported carelessness in three different ways. Validity Item 1 asked students to “Select ‘not true’ for this item”; Validity Item 2 asked “I am paying attention to how I answer this survey”; and Validity Item 3 asked “I am being totally honest on this survey.” Aberrant responses to any item were classified as a careless response for that item. Although some scholars report the sum of the validity items, we report each validity item separately due to the differing nature of each question. Only the empirical data contain these items.

Data Sources

Simulated Data

We simulated participant responses to a four-category rating scale (x = 0, 1, 2, 3) using the Partial Credit Model (PCM; Masters, 1982). We simulated person locations and item difficulties using a normal distribution, which resembles the item distribution in the real data and reflects recent IRT simulation research (e.g., Buchholz & Hartig, 2019; Dodeen, 2004; Finch, 2011; Fox & Verhagen, 2018). Table 2 provides an overview of the simulation design, including characteristics that we held constant over conditions and manipulated factors. We manipulated only one key factor in the simulation: the proportion of participants who exhibited each type of careless responding pattern (0%, 5%, 10%, or 20%). We included two types of carelessness in each condition: overly consistent (LongString) and random responses, such that either 0%, 5%, 10%, or 20% of the sample exhibited each type of carelessness, and the total percentage of participants who exhibited carelessness could range from 0% to 40%. In conditions with careless participants, we created both types of carelessness for the selected random set of participants after we generated participant responses using the PCM. Recognizing that participants may not exhibit carelessness on all items and that some participants may exhibit more-frequent carelessness than others (Goldammer et al., 2020), we randomly selected a value between 50% and 100% for each participant to indicate the proportion of items on which they would exhibit carelessness. This value was not a simulation condition because we did not pre-specify these person-specific proportions, and we did not empirically examine the effects of different participant-specific proportions of carelessness. Using these proportions for each of the selected participants, we replaced between 50% and 100% of randomly selected responses with a new value to generate carelessness patterns. For each of the overly consistent responders, we randomly selected one value from the available rating scale categories (x = 0, 1, 2, 3) and replaced responses to a consecutive set of items of the selected length with this single selected value. For each of the random responders, we randomly selected a new value from the rating scale categories for each selected response and replaced responses to a randomly selected set of items of the selected length with those random responses. For participants who we did not simulate to exhibit carelessness (hereafter, “normal responders”), we used the original simulated responses. We simulated 100 replications of each condition.

Table 2.

Simulation Design Conditions.

Factor Type Factor Level(s)
Held constant Rating scale length 4 categories (0, 1, 2, 3)
Generating person locations N ∼ (0,1)
Generating item difficulties N ∼ (0, .5)
Distance between thresholds Selected from U ∼ (.8, 1.5)
Person sample size 2,000
Item sample size 30
Types of carelessness Overly consistent (long-string); random responses
Manipulated % overly consistent and random 0%, 5%, 10%, 20%

Empirical Data

Data used in this study are from one single school district in the Network for Educator Effectiveness (NEE), a teacher evaluation system used by over 275 districts in Missouri. The current study includes responses to a student survey of teacher effectiveness from a suburban district with an enrollment of approximately 18,000 students. The Teacher Effectiveness Student Survey (TESS; Tsai et al., 2022) items were written and reviewed by a team of assessment experts, content experts, and practitioners to align with the InTASC core teaching standards as condensed by the Missouri State Department of Elementary and Secondary Education (Council of Chief State School Officers, 2013).

The TESS was administered electronically to students during one period in the latter half of the 2017–2018 school year. The students responded to the survey for the teacher of that period, but the survey was administered by proxy, meaning that the classroom teacher left the room and another staff assigned by the school administered the survey. Classes varied by subject matter according to the period selected by the administrators. Data for this study include 2,049 complete surveys from 9th to 12th-grade students who completed the survey for 108 teachers. On average, 18 students evaluated each teacher. We only included surveys with no missing responses in our analysis. Most surveys were complete, and the average number of items skipped was 2.59 (SD = 2.54). Further, in some schools not all questions were administered during the survey due to schools having the option to select specific indicators, resulting in approximately 20% of the surveys with missing data. As missingness may adversely impact indices such as Mahalanobis D and LongString, missing data are often either excluded or imputed (Niessen et al., 2016). Because our data were missing systematically as well as potentially at random, we elected to exclude surveys with missing data.

In the TESS, students were asked to provide ratings on a 4-point Likert scale (Not True, Sort of True, True, Very True) on survey items spanning several interrelated teaching practices. Specifically, the TESS at the school contained 29 survey items measuring seven subdomains of effective teaching practices: (1) teachers’ content knowledge (4 items), (2) cognitive engagement of students (4 items), (3) support of students’ cognitive development (3 items), (4) facilitation of problem-solving and critical thinking (5 items), (5) teacher-student relationship (5 items), (6), communication (4 items), and (7) monitoring learning (5 items). The survey also included three validity items (discussed above), for a total of 32 items.

Data Analysis

We analyzed the simulated and real data using the same procedure, except for the observed indicators. For the empirical data, we calculated response time, and responses to validity items in addition to the other indices. For both simulated and empirical data, we first analyzed the data using the TAM package for R (Robitzsch, 2021) so that we could calculate infit and outfit MSE statistics for each person. We calculated polytomous l z based on the PCM using the PerFit package for R (Tendeiro et al., 2016). We calculated multivariate Mahalanobis D with the base R function (R Core Team, 2020), and LongString indices using the careless package for R (Yentes & Wilhelm, 2021).

We then used an empirical bootstrap method adapted from Wolfe (2013) to obtain thresholds with which to evaluate person fit and Mahalanobis D. Specifically, we used person, item, and rating scale threshold estimates from our PCM analysis of the real data as generating values to simulate 1,000 bootstrap datasets with the same item and person sample sizes using the PCM. We analyzed each of the new datasets using the PCM and calculated infit MSE, outfit MSE, and polytomous l z person fit statistics, along with Mahalanobis D statistics, for each simulated person. From each dataset, we identified the value of each person statistic at the 5th and 95th percentiles; these percentiles reflect a relatively conservative approach to identifying person misfit. We used the mean of these values across the 1,000 bootstrap samples as critical values for evaluating person fit in our analyses. For the MSE person fit statistics, we calculated the 5th and 95th percentiles after we removed any persons with constant extreme responses because the MSE fit statistics for those persons were near zero and resulted in a notably skewed distribution. In other Rasch model software programs such as Facets, Winsteps, and the eRm package for R, person fit statistics are not calculated for these persons with extreme constant scores because their estimates are interpolated, and model-data fit analysis is not meaningful. Because the goal of our bootstrap procedure was to generate data that reflected our real data while also reflecting good model-data fit and meaningful measures, this approach was in line with the procedure proposed by Wolfe (2013) and the interpretation of Rasch results for persons with extreme scores in general (Bond et al., 2020).

We classified careless responders based on the empirical thresholds for infit MSE, outfit MSE, polytomous l z , and Mahalanobis D and the theoretical thresholds for LongString and response time (for the real data). We calculated Spearman rank correlations using the raw scores to explore the correspondence among these person-specific indices. To compare alignment in classification of “careless” or “normal” responses between indices, we calculated phi correlations between carless responders based on the dichotomous classification variables (1 = careless, 0 = normal).

To examine how accurately the person-fit indices and the traditional CR indices performed in detecting careless responders (i.e., constant and random responding) and whether some of the indices performed better than others, we plotted each index in a receiver operating characteristics (ROC) curve for different patterns and amount of careless responding (see Figure 1). We also examined the area under the curve (AUC) to determine the accuracy of the detection. In this study, ROC graphs show the probability of a statistical index (classifier) correctly detecting the careless responders, the true positive rate (sensitivity), against the probability of false alarm (1-specificity) at various threshold settings (Fawcett, 2006). Accordingly, the AUC shows the ability of a classifier to distinguish between careless and non-careless responders. When AUC = 1, the classifier can perfectly detect careless responders and non-careless responders. When AUC = .5, the classifier is not able to distinguish careless responders better than a random guess. When AUC = 0 the classifier is incorrectly identifying every careless and non-careless responder (Fawcett, 2006). Generally, AUC values ≥.7 are considered acceptable, with values ≥.8 considered excellent, and values ≥.9 considered outstanding (Mandrekar, 2010). Because the interpretation of the infit MSE, outfit MSE, and l z indices were bi-directional, meaning that values greater than the upper bound and smaller than the lower bound were considered evidence of person misfit, we used the squared deviation scores for these three indices when plotting the ROCs. These squared deviation scores reflect how much the infit and outfit MSE values deviated from 1.00 and the l z values deviated from 0. We tested the differences between AUCs using the method proposed by Delong et al. (1988).

Figure 1.

Figure 1.

Receiver operating curves for constant (left) and random (right) responses across 5%, 10%, and 20% carelessness conditions.

Because the PCM assumes a unidimensional scale, we tested the psychometric unidimensionality of the empirical data using principal components analysis of model residuals (Chou & Wang, 2010; Linacre, 1998). The variance explained by the items (55.26%) was much higher than the unexplained variance in the first contrast (4.74%). Similarly, the eigenvalue of the unexplained variance was 1.38, less than the threshold of 1.40 that would be expected by random noise (Smith & Miao, 1994). Therefore, for the purposes of this paper, we assume a unidimensional structure to the empirical data.

Results

Simulation Results

In all carelessness simulation conditions, the average Z-values for Mahalanobis D were notably lower in the overly consistent group (−1.05 ≤ MConstant ≤ −.91, with intervals indicating the range of values over the conditions) compared to the normal group (−.14 ≤ MNormal ≤ .00) and the random group (1.46 ≤ MNormal ≤ 2.41). The average long-string statistics were notably higher in the constant group (22.83 ≤ MConstant ≤ 22.87) compared to the other two groups, which were similar (3.45 ≤ MRandom ≤ 4.31; 3.45 ≤ MNormal ≤ 3.46).

The person fit indices showed some differences between person subgroups, with lower values of MSE in the constant group (.84 ≤ MConstant ≤ 1.07) compared to the random (1.46 ≤ MRandom ≤ 1.70) and normal groups (.86 ≤ MNormal ≤ .99). The l z statistics also differed across the constant, random, and normal groups, the highest values for the constant group (.38 ≤ MConstant ≤ .80), the lowest values for the random group, (−3.13 ≤ MRandom ≤ −2.15), and values for the normal group in-between (.16 ≤ MConstant ≤ .63).

Empirical Results

Average completion time for the empirical data was 3.75 minutes (SD = 1.73), with an average of 7.03 seconds/item. The response time distribution was positively skewed (2.21, SE = .05). The longest string of continuous responses was 29, with an average LongString value of 12 continuous responses (SD = 9.09). LongString responses were positively skewed, but with a sharp spike of scores at the highest value, with 16% of respondents producing long strings equal to the entire scale. This spike was also evident in other indices, including person fit indices, suggestive of their utility in identifying extreme overconsistent carelessness. Average infit and outfit MSE values were comparable (M = .87, SD = .56; M = .84, SD = .54, respectively). The average l z value was .51 (SD = 1.38).

The number of responses classified as careless for the empirical data ranged between 9% and 32%, depending on CR index. The 5% infit MSE index classified the most participants as careless (32%), with the low threshold infit and outfit MSE indices both classifying 24% as careless. In contrast, the high infit and outfit MSE thresholds classified fewer responses as careless values (12% and 10%, respectively), and the l z statistic classified only 6% as careless. Validity items classification proportions were more variable (Item 1: 3%; Item 2: 11%; Item 3: 9%). For completion time, the 4 second/item threshold classified the most carelessness (22%), while the 3 s/item threshold only classified 2%. No responses were classified as careless at the 2 second/item threshold.

RQ 1

What is the association between person-fit indices and previously identified CR indices when used to identify careless responders? How does the association change across different carelessness conditions?

Simulated Data

The correspondence among person-fit and CR indices ranged from – .94 ≤ M r ≤ .98 across conditions. We observed strong positive average correlations between Mahalanobis D and the infit and outfit MSE statistics (.69 ≤ M r ≤ .77), and strong negative correlations with polytomous l z (−.84 ≤ M r ≤ −.84). LongString had weak negative correlations with the MSE fit statistics (−.28 ≤ M r ≤ −.13) and weak positive correlations with l z (.21 ≤ M r ≤ .34). These relationships were stable across carelessness conditions (Table 3).

Table 3.

Average Spearman Correlations of Careless Response Indices for Simulated Data (Complete Sample).

% Carelessness Careless Response Indices Mahalanobis D LongString Infit MSE Outfit MSE
0% LongString −.40
Infit MSE .76 −.13
Outfit MSE .77 −.15 .97
Polytomous l z −.84 .21 −.91 −.93
5% LongString −.44
Infit MSE .75 −.14
Outfit MSE .77 −.17 .97
Polytomous l z −.82 .22 −.90 −.93
10% LongString −.50
Infit MSE .74 −.17
Outfit MSE .77 −.21 .97
Polytomous l z −.82 .25 −.91 −.94
20% LongString −.64
Infit MSE .69 −.22
Outfit MSE .75 −.28 .98
Polytomous l z −.80 .34 −.91 −.94

Empirical Data

Correlations between all CR indictors in the empirical data were moderate to large between traditional and person-fit indices, and weak between observational indices (response time and validity items, Table 4). Response time was positively correlated with Mahalanobis D (r s = .34), infit MSE (r s = .21) and outfit MSE (r s = .23) but was negatively correlated with LongString (r s = −.36), and l z (r s = −.19). LongString also showed a strong negative correlation with Mahalanobis D (r s = −.82). Infit MSE, outfit MSE, and l z statistics were significantly correlated with all other indices, although they were most strongly correlated with Mahalanobis D (r s = .66, .71, −.67, respectively) and LongString (r s = −.49, −.54, .54, respectively). Unsurprisingly, infit and outfit MSE were strongly correlated with one another (r s = .97), and with l z (r s = −.91, −.92, respectively). Validity items were not strongly correlated with any other indices. (−.13 < r s < .20).

Table 4.

Empirical Spearman Correlations Between Traditional and Person-Fit Careless Response Indices.

Traditional Indices Person-Fit Indices
Time Validity Item 1 Validity Item 2 Validity Item 3 Mahal.D Long-String Infit MSE Outfit MSE
Traditional
 Validity 1a −.08
 Validity 2 .02 .13
 Validity 3 .01 .14 .62
 Mahal. D .34 −.10 .02 .00
      LongString −.36 .14 .06 .07 −.82
Person-fit
 Infit MSE .21 −.09 −.13 −.15 .66 −.49
 Outfit MSE .23 −.09 −.12 −.13 .71 −.54 .97
lz −.19 .09 .18 .20 −.67 .54 −.91 −.92

Note. aValidity item 1 – “select ‘not true’ for this item.”; validity item 2 – “I am paying attention to how I answer this survey.”; validity item 3 – “I am being totally honest on this survey.”

RQ 2

How consistently do person-fit indices and traditional CR indices perform when classifying careless responders? How is the classification correspondence affected by different carelessness conditions?

Simulated Data

For the complete sample, Mahalanobis D classifications (>95%) were positively correlated with high infit and outfit MSE (>95%) classifications (Table 5), such that higher indicators of outliers were associated with noisier fit statistics (.42 ≤ r ≤ .57). Mahalanobis D classifications were positively correlated with low l z classifications (<5%) in all conditions (.38 ≤ r ≤ .47). Among examinees who we simulated to exhibit overly consistent responses, Mahalanobis D classifications were positively correlated with low MSE classifications (.35 ≤ r ≤ .53) and high l z classifications (.39 ≤ r ≤ .58), and negatively correlated with high MSE classifications (−.37 ≤ r ≤ −.21) and low l z classifications (−.22 ≤ r ≤ −.13). For examinees simulated to exhibit random responses, Mahalanobis D classifications were weakly correlated with low MSE person fit classifications and high l z classifications (−.14 ≤ r ≤ −.04), and they were moderately-to-strongly correlated with high MSE person fit classifications (.52 ≤ r ≤ .72) and low l z classifications (.30 ≤ r ≤ .41).

Table 5.

Simulated Average Phi Coefficients Between CR Classifications of Traditional and Person-Fit Indices.

% Careless Careless Response Indices Examinee Subgroup Infit MSE Outfit MSE Polytomous lz
<.05 >.95 <.05 >.95 <.05 >.95
0% Mahal D. > 95%a All .09 .44 .12 .43 .38 .04
LongString > 50% All .00 .03 .04 −.01 −.02 .01
5% Mahal. D > 95% All .14 .56 .17 .57 .44 .13
Constant .50 −.33 .53 −.22 .13 .58
Random −.14 .72 −.14 .70 .41 −.14
LongString > 50% All .29 .19 .32 .13 .03 .25
Constant .14 −.05 .15 −.10 .16 .14
Randomb
10% Mahal. D > 95% All .14 .53 .15 .56 .47 .15
Constant .51 −.37 .52 −.29 .22 .58
Random −.09 .61 −.09 .61 .36 −.09
LongString > 50% All .36 .20 .38 .13 .01 .32
Constant .12 −.04 .14 −.09 .15 .13
Randomb
20% Mahal. D > 95% All .07 .42 .07 .45 .40 .08
Constant .36 −.25 .35 −.21 .15 .39
Random −.04 .52 .04 .52 .30 −.04
LongString > 50% All .42 .15 .43 .07 −.09 .38
Constant .11 −.03 .12 −.07 −.14 .12
Random .00 .03 .00 .03 .02 .00

Note. aVery few participants were flagged for the <5% Mahalanobis D threshold. Those columns are omitted from this table.

bNo random response examinees were classified by LongString.

LongString classifications were weakly correlated with all person-fit statistics in the condition with no simulated carelessness (−.02 ≤ r ≤ .04); this result reflects the limited frequency of LongString responders in this condition. In the other conditions, for the complete sample, LongString statistics were positively correlated with person fit classifications, indicating consistency between these indices when considered for the complete sample. For examinees simulated to exhibit overly consistent responses, LongString classifications were weakly positively associated with low MSE person fit classifications (.11 ≤ r ≤ .15) and high polytomous l z classifications (.12 ≤ r ≤ .14), not meaningfully correlated with high MSE person fit classifications (−.10 ≤ r ≤ −.03), and weakly negatively correlated with low polytomous l z classifications (−.16 ≤ r ≤ −.14). For random examinees, LongString classifications were not meaningfully associated with any of the person fit classifications (.00 ≤ r ≤ .02), reflecting the limited frequency of LongString examinees in the Random subgroup.

ROC Plots

Constant Responses

The ROC plots and the corresponding AUCs (see Figure 1) suggested that the five indices (i.e., infit MSE, outfit MSE, l z , Mahalanobis, LongString) performed well overall across constant design conditions. The omnibus test for equality indicated a significant difference between the AUCs. The chi-square test statistic values (df = 4) were 12,992.24, 22,855.38, and 39,254.86, for the 5%, 10%, and 20% conditions, respectively. The subsequent pairwise Bonferroni-corrected comparisons of the AUCs indicated that all indices showed significant differences, except that outfit MSE and the Mahalanobis D were equally effective in the 20% condition, χ2(1) = .53, p = .597. The LongString index showed outstanding performance across all conditions (AUC = .999), with infit MSE and outfit MSE also showing excellent performance (AUC ≥ .865). While still acceptable, the l z statistic and Mahalanobis index did not perform as well as the other indices (AUC ≥ .746, .750, respectively). The l z statistic decreased in performance as the amount of carelessness increased. Conversely, the Mahalanobis index increased in accuracy as the amount of carelessness increased.

Random Responses

For the random responses, the chi-square test statistic values (df = 4) were 19,284.87, 36,974.07, and 70,633.09, for the 5%, 10%, and 20% conditions, respectively. The omnibus test for equality indicated a significant difference between the AUCs. The subsequent pairwise Bonferroni-corrected comparisons of the AUCs indicated that all indices showed significant differences across the 5%, 10%, and 20% conditions, p < .005. The Mahalanobis index performed with the highest consistently across CR conditions (AUC ≥ .970). The Infit MSE, outfit MSE, and the l z index also functioned consistently well for the 5% and 10% conditions (AUC ≥ .902, .903, .915, respectively). While their performance was reduced for the 20% condition, they still performed excellently (AUC ≥ .809). The LongString index did not show acceptable accuracy and was consistently inferior across the range of careless conditions (.649 ≥ AUC ≥ .650).

Empirical Data

Correspondence between the classifications of person-fit classifications and traditional indices was good (Table 6), but the strength of the relationship tended to differ between low- and high-person-fit classification. Classification based on lower levels of infit MSE, outfit MSE, and l z correlated positively with LongString classifications (r = .48, .49, .53, respectively). Low infit MSE and outfit MSE classifications correlated positively with classifications based on the low Mahalanobis D threshold (r = .59, .57, respectively), whereas classifications based on high infit MSE and outfit MSE thresholds correlated positively with classifications based on the high Mahalanobis D threshold (r = .45, .50, respectively). This relationship was reversed for the l z statistic, with classifications based on low l z values positively correlated with classifications from the high Mahalanobis D threshold (r = .55), and classifications based on high l z values positively correlated with classifications from the low Mahalanobis D threshold (r = .63). Validity item classifications and response time were not strongly correlated with any other fit statistics.

Table 6.

Empirical Phi Coefficients Between CR Classification of Traditional and Person-Fit Indices.

Careless Response Indices Classification Condition Infit MSE Outfit MSE Polytomous l z
<.05 >.95 <.05 >.95 <.05 >.95
Time <3 s/item .17 −.06 .16 −.04 −.05 .18
<4 s/item .24 −.10 .23 −.07 −.09 .24
Validitya Item 1 .12 −.02 .10 −.06 −.07 .11
Item 2 .16 −.04 .15 −.06 −.07 .15
Item 3 .18 −.09 .16 −.09 −.07 .18
Long string >50% .48 −.14 .49 −.17 −.22 .53
Mahal. D
Low threshold <5% .59 −.18 .57 −.16 −.15 .63
High threshold >95% −.22 .45 −.22 .50 .55 −.20

Note. aValidity item 1 – “select ‘not true’ for this item.”; validity item 2 – “I am paying attention to how I answer this survey.”; validity item 3 – “I am being totally honest on this survey.”

Discussion

This study explored the correspondence between several commonly used CR indices and three person-fit statistics in detecting carelessness in surveys. Using simulated data and empirical survey data, we evaluated the detection accuracy and performance of infit MSE, outfit MSE, and polytomous l z compared with LongString and Mahalanobis D indices. We tested the utility of these indices for detecting two types of careless response: overconsistency (e.g., 1,1,1,1,1…) and random responding. Our findings suggest that person-fit statistics may function excellently, and at similar levels as traditional CR indices, when classifying carelessness from both random responses and from overconsistency. Person-fit indices were consistent as the amount of carelessness increased, and detected carelessness adequately in conditions with as much as 20% careless responses. This is a positive sign, since the proportion of careless respondents is likely to vary from survey to survey. All five indices we tested performed well, which aligns with previous research on other person-fit indices (Karabastos, 2003; Schneider et al., 2018). However, LongString was most suitable for over-consistent carelessness, while Mahalanobis D appeared most accurate when used for random responding. This is not unexpected, given that LongString is designed to examine over-consistency, while Mahalanobis D is used to detect aberrant responses. Notably, person-fit statistics performed excellently for both constant and random careless response patterns.

These findings are slightly at odds with those of Goldammer et al. (2020), who found that LongString was ineffective at detecting carelessness. One reason for this discrepancy might be that Goldammer et al. examined carelessness stemming from random responding and did not explore overconsistency as a form of carelessness. Our findings emphasize the concern expressed by multiple researchers that carelessness is not homogenous, and that using multiple types of CR indices may be useful for improving carelessness detection (Curran, 2016; Meade & Craig, 2012). Our findings also differ from Niessen et al. (2016), who found that l z and other person-fit statistics did not perform well. One reason for this discrepancy might be due to methodological differences, since they did not specify the pattern in which carelessness occurred. Thus, we suggest that when exploring consistency of CR indices, researchers should evaluate them across a variety of careless conditions rather than a single condition.

Our study also highlights the unique bidirectional feature of some indices, such as person-fit measures. The bidirectional nature of person-fit statistics may be one reason for their apparent accuracy in detecting both randomness and overconsistency. That is, they are able to flag over-consistency and under-consistency of responses. Some traditional indices for careless responding, such as LongString, are unidirectional and may capture only certain patterns of carelessness. In the current study, low person fit values classified carelessness in a similar manner to the consistency index of LongString. Similarly, high person-fit values classified carelessness in the same way as the outlier index (Mahalanobis D). Because carelessness is likely to present in both an over-consistent and under-consistent manner (Meade & Craig, 2012), researchers and practitioners should ensure that both the upper and lower bounds of bidirectional indices are examined for potential carelessness. As person-level fit values are already used regularly in psychometrics to remove degrading data during assessment development (e.g., Bond & Fox, 2011; Rupp, 2013), their use in low-stakes survey settings may be considered as a useful addition to existing CR indices when preparing data for analysis or practical use. In addition, person fit statistics based on Rasch measurement theory have a theoretical benefit because they are aligned with a guiding framework for evaluating measurement procedures based on fundamental measurement requirements (Rasch, 1960). Still, since the indices performed neither worst nor best in either condition, our results support the continued use of a combination of indices.

Our study has some limitations that warrant additional research. First, we focused on a limited number of CR indices. Numerous other indices are available that may offer valuable insight into carelessness (see Curran, 2016), including other person-fit indices or IRT methods not explored in this paper (Beck et al., 2019; Schneider et al., 2018). While the purpose of this study was to explore bidirectional person-fit indices, our paper was not an exhaustive exploration of the use of person-fit statistics as CR indices. Future studies could extend our findings by examining the detection accuracy of a wider array of traditional and person-fit indices using simulated and empirical data. However, we recommend that future studies simulate multiple types of carelessness rather than focusing on a single type of carelessness (e.g., both over- and under-consistent responses).

Another area that we did not explore, partially due to the difficulty in simulating such data, were the observational indices such as validity items and response time. While these methods can be valuable in detecting carelessness, especially in longer surveys (Bowling et al., 2021b; Meade & Craig, 2012), observational indices may not maintain high levels of detection accuracy across conditions. The empirical data in our study showed weak relationships between the classification of carelessness from bogus items and response time and other indices. Given that others have identified concerns with validity items (Niessen et al., 2016), more research is needed to explore their sensitivity across conditions. This is also true for response time. While we did not simulate rapid responding as a carelessness pattern, future studies may wish to focus on simulating observational indices such as response time in a similar manner that has been undertaken by those evaluating rapid responses in other assessment settings (see Rios, 2022).

We also did not explore the effect of varying rules-of-thumb for classification cutoffs, as our study focused primarily on carelessness type rather than classification cutoffs. We used a bootstrapping method to obtain 5% and 95% thresholds for our statistics, as well as a strongly conservative 50% cutoff for LongString and 3/seconds to minimize the chance of removing accurate data (Curran, 2016). However, there is not current empirical evidence to support a universally accepted or applied set of thresholds for indices such as person-fit measures, Mahalanobis D, or others, although studies have more extensively explored response time thresholds and some common thresholds exist for this index (see, e.g., Wise, 2019). While this study improves on past research exploring CR indices with bootstrapping (Beck et al., 2019), the performance of thresholds for CR identification may depend on contextual characteristics of surveys, and future studies should explicitly evaluate how varying cutoff thresholds function across different contexts and careless conditions.

A third limitation is that we only explored two patterns of careless responding: over-consistency and random response. In practice, overconsistency is a common occurrence in survey data, but carelessness may take other forms (Ulitzsch, 2021). Also, we only included complete data in our study. However, surveys are likely to include some missingness. It is unclear the degree to which missing data will affect CR indices. While some, such as person-fit indices may function well in the presence of missing data (Smith, 1986), others such as LongString may suffer. Further research is needed to explore the influence of missingness, as well as the type of missingness (e.g., missing at random, missing not at random) and additional forms of carelessness on the performance of CR indices. Along the same lines, our simulation study included a relatively limited set of conditions that allowed us to conduct a focused investigation of our research questions. In future studies, researchers should include additional factors in simulation research on carelessness, including, for example, different procedures for generating overly consistent responses that manipulate the probability for long-string responses in extreme or central rating scale categories.

Finally, we emphasize that the relationship between person fit statistics and traditional CR indices may differ depending on the survey context. As other researchers have noted (Schroeders et al., 2021), the survey setting may impact the amount of carelessness and potentially the performance of CR indices. The current study included an evaluative survey that was well-structured with a survey proctor. The quality of person-fit indices when used for fully online, on-the-spot, or other types of surveys may vary, meriting a continued focus on the use of quality CR indices in survey data (Goldammer et al., 2020).

Footnotes

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Eli Jones https://orcid.org/0000-0002-0320-6341

Stefanie Wind https://orcid.org/0000-0002-1599-375X

References

  1. Arthur W., Jr., Hagen E., George F., Jr. (2021). The lazy or dishonest respondent: Detection and prevention. Annual Review of Organizational Psychology and Organizational Behavior, 8(1), 105–137. 10.1146/annurev-orgpsych-012420-055324 [DOI] [Google Scholar]
  2. Beck M. F., Albano A. D., Smith W. M. (2019). Person-fit as an index of inattentive responding: A comparison of methods using polytomous survey data. Applied Psychological Measurement, 43(5), 374–387. 10.1177/0146621618798666 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bond T. G., Fox C. M. (2011). Applying the Rasch model: Fundamental measurement in the human sciences (2nd ed.). Routledge. [Google Scholar]
  4. Bond T. G., Yan Z., Heene M. (2020). Applying the Rasch model: Fundamental measurement in the human sciences (4th ed.). Routledge. [Google Scholar]
  5. Bowling N. A., Gibson A. M., Houpt J. W., Brower C. K. (2021). Will the questions ever end? Person-level increases in careless responding during questionnaire completion. Organizational Research Methods, 24(4), 718–738. 10.1177/1094428120947794 [DOI] [Google Scholar]
  6. Bowling N. A., Huang J. L., Brower C. K., Bragg C. B. (2021). The quick and the careless: The construct validity of page time as a measure of insufficient effort responding to surveys. Organizational Research Methods, 26(2), 323–352. 10.1177/10944281211056520 [DOI] [Google Scholar]
  7. Buchholz J., Hartig J. (2019). Comparing attitudes across groups: An IRT-based item-fit statistic for the analysis of measurement invariance. Applied Psychological Measurement, 43(3), 241–250. 10.1177/0146621617748323 [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Chou Y.-T., Wang W.-C. (2010). Checking dimensionality in item response models with principal component analysis on standardized residuals. Educational and Psychological Measurement, 70(5), 717–731. 10.1177/0013164410379322 [DOI] [Google Scholar]
  9. Clark M. E., Gironda R. J., Young R. W. (2003). Detection of back random responding: Effectiveness of MMPI-2 and personality assessment inventory validity indices. Psychological Assessment, 15(2), 223–234. 10.1037/1040-3590.15.2.223 [DOI] [PubMed] [Google Scholar]
  10. Council of Chief State School Officers. (2013). Interstate teacher assessment and support consortium InTASC model core teaching standards: A resource for state dialogue. Council of Chief State School Officers. https://ccsso.org/sites/default/files/2017-11/InTASC_Model_Core_Teaching_Standards_2011.pdf [Google Scholar]
  11. Curran P., Kotrba L., Denison D. (2010). Careless responding in surveys: Applying traditional techniques to organizational settings. Poster presented at the 25th annual conference of the society for industrial and organizational psychology, Atlanta, GA, April 8–10, 2010. [Google Scholar]
  12. Curran P. G. (2016). Methods for the detection of carelessly invalid responses in survey data. Journal of Experimental Social Psychology, 66, 4–19. 10.1016/j.jesp.2015.07.006 [DOI] [Google Scholar]
  13. DeLong E. R., DeLong D. M., Clarke-Pearson D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach (pp. 837–845). Biometrics. [PubMed] [Google Scholar]
  14. Dodeen H. (2004). The relationship between item parameters and item fit. Journal of Educational Measurement, 41(3), 261–270. 10.1111/j.1745-3984.2004.tb01165.x [DOI] [Google Scholar]
  15. Drasgow F., Levine M. V., Williams E. A. (1985). Appropriateness measurement with polychotomous item response models and standardized indices. British Journal of Mathematical and Statistical Psychology, 38(1), 67–86. 10.1111/j.2044-8317.1985.tb00817.x [DOI] [Google Scholar]
  16. Dupuis M., Meier E., Cuneo F. (2019). Detecting computer-generated random responding in questionnaire-based data: A comparison of seven indices. Behavior Research Methods, 51(5), 2228–2237. 10.3758/s13428-018-1103-y [DOI] [PubMed] [Google Scholar]
  17. Fawcett T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861–874. 10.1016/j.patrec.2005.10.010 [DOI] [Google Scholar]
  18. Finch H. (2011). Multidimensional item response theory parameter estimation with nonsimple structure items. Applied Psychological Measurement, 35(1), 67–82. 10.1177/0146621610367787 [DOI] [Google Scholar]
  19. Fox J. P., Verhagen J. (2018). Random item effects modeling for cross-national survey data. In: Cross-cultural analysis (pp. 529–550). Routledge. [Google Scholar]
  20. Glas C. A. W., Khalid N. (2017). Person fit. In van der Linden W. J. (Ed.). Handbook of item response theory: Volume three: Applications (1st ed., 107–126). Chapman and Hall/CRC. 10.1201/9781315117430 [DOI] [Google Scholar]
  21. Godinho A., Kushnir V., Cunningham J. (2016). Unfaithful findings: Identifying careless responding in addictions research. Addiction, 111(6), 955–956. 10.1111/add.13221 [DOI] [PubMed] [Google Scholar]
  22. Goldammer P., Annen H., Stöckli P. L., Jonas K. (2020). Careless responding in questionnaire measures: Detection, impact, and remedies. The Leadership Quarterly, 31(4), 101384. 10.1016/j.leaqua.2020.101384 [DOI] [Google Scholar]
  23. He Q., von Davier M. (2015). Identifying feature sequences from process data in problem-solving items with n-grams. In: Quantitative Psychology Research (pp. 173–190). Springer. [Google Scholar]
  24. Huang J. L., Bowling N. A., Liu M., Li Y. (2015). Detecting insufficient effort responding with an infrequency scale: Evaluating validity and participant reactions. Journal of Business and Psychology, 30(2), 299–311. 10.1007/s10869-014-9357-6 [DOI] [Google Scholar]
  25. Huang J. L., Curran P. G., Keeney J., Poposki E. M., DeShon R. P. (2012). Detecting and deterring insufficient effort responding to surveys. Journal of Business and Psychology, 27(1), 99–114. 10.1007/s10869-011-9231-8 [DOI] [Google Scholar]
  26. Karabatsos G. (2003). Comparing the aberrant response detection performance of thirty-six person-fit statistics. Applied Measurement in Education, 16(4), 277–298. 10.1207/s15324818ame1604_2 [DOI] [Google Scholar]
  27. Linacre J. M. (1998). Structure in Rasch residuals: Why principal components analysis (PCA)? Rasch Measurement Transactions, 12(2), 636. [Google Scholar]
  28. Linacre J. M. (2002). What do infit and outfit, mean-square and standardized mean. Rasch Measurement Transactions, 16(2), 878. https://www.rasch.org/rmt/rmt162f.htm [Google Scholar]
  29. Magis D., Raîche G., Béland S. (2012). A didactic presentation of Snijders’s l z * index of person fit with emphasis on response model selection and ability estimation. Journal of Educational and Behavioral Statistics, 37(1), 57–81. 10.3102/1076998610396894 [DOI] [Google Scholar]
  30. Mandrekar J. N. (2010). Receiver operating characteristic curve in diagnostic test assessment. Journal of Thoracic Oncology, 5(9), 1315–1316. 10.1097/JTO.0b013e3181ec173d [DOI] [PubMed] [Google Scholar]
  31. Masters G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149–174. 10.1007/bf02296272 [DOI] [Google Scholar]
  32. Meade A. W., Craig S. B. (2012). Identifying careless responses in survey data. Psychological Methods, 17(3), 437–455. 10.1037/a0028085 [DOI] [PubMed] [Google Scholar]
  33. Meijer R. R. (1996). “Person-fit research: An introduction”. Applied Measurement in Education, 9(1), 3–8, 10.1207/s15324818ame0901_2 [DOI] [Google Scholar]
  34. Myszkowski N. (2019). The first glance is the weakest: “Tasteful” individuals are slower to judge visual art. Personality and Individual Differences, 141, 188–195. 10.1016/j.paid.2019.01.010 [DOI] [Google Scholar]
  35. Nazari S., Leite W. L., Huggins-Manley A. C. (2022). A comparison of person-fit indices to detect social desirability bias. Educational and Psychological Measurement, 001316442211295. Advance online publication. 10.1177/00131644221129577 [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Niessen A. S. M., Meijer R. R., Tendeiro J. N. (2016). Detecting careless respondents in web-based questionnaires: Which method to use? Journal of Research in Personality, 63, 1–11. 10.1016/j.jrp.2016.04.010 [DOI] [Google Scholar]
  37. Patton J. M., Cheng Y., Hong M., Diao Q. (2019). Detection and treatment of careless responses to improve item parameter estimation. Journal of Educational and Behavioral Statistics, 44(3), 309–341. 10.3102/1076998618825116 [DOI] [Google Scholar]
  38. Qiao X., Jiao H. (2018). Data mining techniques in analyzing process data: A didactic. Frontiers in Psychology, 9, 2231. 10.3389/fpsyg.2018.02231 [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Rasch G. (1960). Studies in mathematical psychology: I. Probabilistic models for some intelligence and attainment tests. Nielsen & Lydiche. [Google Scholar]
  40. R Core Team . (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/ [Google Scholar]
  41. Rios J. A. (2022). A comparison of robust likelihood estimators to mitigate bias from rapid guessing. Applied Psychological Measurement, 46(3), 236–249. 10.1177/01466216221084371 [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Robitzsch A., Kiefer T., Wu M. (2021). TAM: Test analysis modules. R package version 3.7–16. https://CRAN.R-project.org/package=TAM [Google Scholar]
  43. Rupp A. A. (2013). A systematic review of the methodology for person fit research in item response theory: Lessons about generalizability of inferences from the design of simulation studies. Psychological Test and Assessment Modeling, 55(1), 3–38. [Google Scholar]
  44. Schneider S., May M., Stone A. A. (2018). Careless responding in internet-based quality of life assessments. Quality of Life Research, 27(4), 1077–1088. 10.1007/s11136-017-1767-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Schroeders U., Schmidt C., Gnambs T. (2022). Detecting careless responding in survey data using stochastic gradient boosting. Educational and Psychological Measurement, 82(1), 29–56. 10.1177/00131644211004708 [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Smith R. M. (1986). Person fit in the Rasch model. Educational and Psychological Measurement, 46(2), 359–372. 10.1177/001316448604600210 [DOI] [Google Scholar]
  47. Smith R. M., Miao C. Y. (1994). Assessing unidimensionality for Rasch measurement. In Wilson M. (Ed.), Objective measurement: Theory into practice, 2, (pp. 316–327). Ablex. [Google Scholar]
  48. Snijders T. A. B. (2001). Asymptotic null distribution of person fit statistics with estimated person parameter. Psychometrika, 66(3), 331–342. 10.1007/bf02294437 [DOI] [Google Scholar]
  49. Tendeiro J. N., Meijer R. R., Niessen A. S. M. (2016). PerFit: An R package for person-fit analysis in IRT. Journal of Statistical Software, 74(5), 1–27. 10.18637/jss.v074.i05 [DOI] [Google Scholar]
  50. Tsai C. L., Bergin C., Jones E. (2022). Students in 4th to 12th grade can distinguish dimensions of teaching when evaluating their teachers: A multilevel analysis of the TESS survey. Educational Studies. 1–16. Advance online publication. 10.1080/03055698.2022.2058319 [DOI] [Google Scholar]
  51. Ulitzsch E., Pohl S., Khorramdel L., Kroehne U., von Davier M. (2021). A response-time-based latent response mixture model for identifying and modeling careless and insufficient effort responding in survey data (pp. 1–27). Psychometrika. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. van der Linden W. J. (2006). A lognormal model for response times on test items. Journal of Educational and Behavioral Statistics, 31(2), 181–204. 10.3102/10769986031002181 [DOI] [Google Scholar]
  53. van der Linden W. J. (2007). A hierarchical framework for modeling speed and accuracy on test items. Psychometrika, 72(3), 287–308. 10.1007/s11336-006-1478-z [DOI] [Google Scholar]
  54. Wang C., Xu G. (2015). A mixture hierarchical model for response times and response accuracy. British Journal of Mathematical and Statistical Psychology, 68(3), 456–477. 10.1111/bmsp.12054 [DOI] [PubMed] [Google Scholar]
  55. Ward M. K., Meade A. W. (2018). Applying social psychology to prevent careless responding during online surveys. Applied Psychology, 67(2), 231–263. 10.1111/apps.12118 [DOI] [Google Scholar]
  56. Wise S. L. (2017). Rapid‐guessing behavior: Its identification, interpretation, and implications. Educational Measurement: Issues and Practice, 36(4), 52–61. 10.1111/emip.12165 [DOI] [Google Scholar]
  57. Wise S. L. (2019). An information-based approach to identifying rapid-guessing thresholds. Applied Measurement in Education, 32(4), 325–336. 10.1080/08957347.2019.1660350 [DOI] [Google Scholar]
  58. Wolfe E. (2013). A bootstrap approach to evaluating person and item fit to the Rasch model. Journal of Applied Measurement, 14(1), 1–9. [PubMed] [Google Scholar]
  59. Wolfe E. W., Smith E. V., Jr. (2007. a). Instrument development tools and activities for measure validation using Rasch models: Part I - instrument development tools. Journal of Applied Measurement, 8(1), 97–123. [PubMed] [Google Scholar]
  60. Wolfe E. W., Smith E. V., Jr. (2007. b). Instrument development tools and activities for measure validation using Rasch models: Part II–validation activities. Journal of Applied Measurement, 8(2), 204–234. [PubMed] [Google Scholar]
  61. Wu M., Adams R. J. (2013). Properties of Rasch residual fit statistics. Journal of Applied Measurement, 14(4), 339–355. [PubMed] [Google Scholar]
  62. Yentes R. D., Wilhelm F. (2021). Careless: Procedures for computing indices of careless responding. R package version 1.2.1. https://cran.r-project.org/web/packages/careless/index.html [Google Scholar]

Articles from Applied Psychological Measurement are provided here courtesy of SAGE Publications

RESOURCES