Methods of Detecting Insufficient Effort Responding: Comparisons and Practical Recommendations

Maxwell Hong; Jeffrey T Steedle; Ying Cheng

doi:10.1177/0013164419865316

. 2019 Jul 31;80(2):312–345. doi: 10.1177/0013164419865316

Methods of Detecting Insufficient Effort Responding: Comparisons and Practical Recommendations

Maxwell Hong ¹, Jeffrey T Steedle ², Ying Cheng ^1,^✉

PMCID: PMC7047258 PMID: 32158024

Abstract

Insufficient effort responding (IER) affects many forms of assessment in both educational and psychological contexts. Much research has examined different types of IER, IER’s impact on the psychometric properties of test scores, and preprocessing procedures used to detect IER. However, there is a gap in the literature in terms of practical advice for applied researchers and psychometricians when evaluating multiple sources of IER evidence, including the best strategy or combination of strategies when preprocessing data. In this study, we demonstrate how the use of different IER detection methods may affect psychometric properties such as predictive validity and reliability. Moreover, we evaluate how different data cleansing procedures can detect different types of IER. We provide evidence via simulation studies and applied analysis using the ACT’s Engage assessment as a motivating example. Based on the findings of the study, we provide recommendations and future research directions for those who suspect their data may contain responses reflecting careless, random, or biased responding.

Keywords: insufficient effort responding, data quality, outlier detection, validity evidence

Insufficient effort responding (IER) refers to the reduced effort by the responder when answering questionnaires (Curran, 2016, Huang, Liu, & Bowling, 2015), which can be due to inattentiveness, fatigue, or speededness, and so on. IER plagues many different types of assessments in education, psychology, and other fields that use self-report measures (Curran, 2016; Dunn, Heggestad, Shanock, & Theilgard, 2018; Hauser & Schwarz, 2016; Huang et al., 2015). IER prevalence can be as low as 2% and may reach up to 50% in a given sample, which can have deleterious effects when evaluating validity evidence for scores derived from an assessment (Johnson, 2005; Meade & Craig, 2012). More broadly, IER can be thought of as an issue of data quality control, which is documented in the professional standards when evaluating scales (American Educational Research Association, American Psychological Association, & National Council on Measurement Education, 2014). The risk of poor data quality is prevalent in contemporary data collection methods such as Amazon’s Mechanical Turk, which are prone to IER (TurkPrime, 2018). Researchers and assessment practitioners may choose to ignore IER and hope respondents respond attentively throughout an assessment. However, doing so can lead to inaccurate estimates of respondent ability (e.g., for score reporting) and distorted evidence of validity (e.g., reliability coefficients) (Woods, 2006). Therefore, evaluating different methods and approaches of handling IER is of utmost importance to support sound inferences based on self-report measures.

Leveraging IER and outlier detection methods is one way to address the issue. IER detection methods, in general, classify “normal” respondents and respondents apparently exhibiting IER. For instance, identifying outlying observations based on Mahalanobis distance is a common practice for data pre-preprocessing across disciplines (Mahalanobis, 1936), and it can be used for IER detection. Several methods will be discussed in detail in subsequent sections. This article builds on the existing IER detection literature (e.g., Steedle, Hong, & Cheng, 2019) and addresses limitations of current methods for treating data with IER.

At least two challenges face researchers or practitioners in this context. First, there are multiple possible manifestations of IER described in the literature. For instance, IER could manifest as random responses (Huang et al., 2015) such that each response category has an equal chance in being endorsed. However, respondents may also produce a series of systematic responses. For example, they may endorse the same response category for a string of items. Either way, insufficient effort responders produce response vectors that are clearly content irresponsive, but the accuracy of IER detection may depend on the way it manifests in the response data.

Second, many IER detection methods exist, and they differ in their effectiveness in detecting different types of IER. The literature has shown that different IER methods may be more or less effective for detecting random or systematic responses (Karabatsos, 2003; Meade & Craig, 2012; Niessen, Meijer, & Tendeiro, 2016). There does not seem to exist a universally effective IER detection method. Moreover, previous work did not consider different ways to combine IER detection methods nor evaluate whether a combination of methods can be effective in face of multiple types of IER that may exist in real data.

With these challenges in mind, this article builds on the previous literature and aims to achieve the following: (1) offer an up-to-date review of different types of IER identified in the literature, (2) review different types of methods to detect IER, (3) evaluate the efficacy of each IER detection methodology under a myriad conditions through simulation studies, (4) evaluate combinations of IER detection methods in terms of sensitivity and specificity, (5) demonstrate how IER detection methods can be implemented with an applied example, and (6) provide recommendations to aid researchers and practitioners.

Types of Insufficient Effort Responding

IER can result in aberrant response patterns, wherein individuals’ response patterns deviate from an assumed underlying measurement model (Karabatsos, 2003; Kim, Reise, & Bentler, 2018). An individual’s response vector may be a mixture of response behavior types: “normal” and “insufficient effort behavior” (Hong & Cheng, 2019a). Insufficient effort may lead to random responses (Meade & Craig, 2012). When respondents are presented Likert-type scale items, random responses mean that the respondents are randomly picking a response option. Though random responding rarely occurs in practice, it is often studied when evaluating outlier detection methods (Johnson, 2005; Meade & Craig, 2012; Niessen et al., 2016). IER could also lead to nonrandom responses. For instance, assessment users may observe what has been called “heaping” wherein respondents tend to endorse middle response categories (Meade & Craig, 2012; Tendeiro, 2017). Other respondents may endorse the same category over a long string of items, such as with all 0s (Johnson, 2005). Certain scale features may also be related to IER. For instance, individuals are more likely to respond inattentively on longer tests (Berry et al., 1992; Clark, Gironda, & Young, 2003). Online, unproctored data collection is also more prone to IER (Pauszek, Sztybel, & Gibson, 2017).

Other scale features such as the use of reverse-coded can illuminate IER since respondents may not read the items thoroughly and endorse response categories inconsistent with their latent trait levels (Weijters, Baumgartner, & Schillewaert, 2013). In the literature, reverse-worded items may be referred to as “negatively worded” (as opposed to “positively worded”) items. Opposite responses occurring when individuals miss this type of wording may be called “misresponses” (Swain, Weathers, & Niedrich, 2008).

In short, IER may manifest in a variety of ways due to a plethora of possible mechanisms. Given the different manifestations of IER, strategies for detecting them may differ. Next, we provide a summary of various approaches to detect IER from the research literature.

Types of Detection Methods

There are several different approaches to counteract the effect of IER. One can try to directly model the hypothesized IER using a mixture modeling approach or treat IER as a separate dimension underlying responses (Falk & Cai, 2016; Suh, Cho, & Wollack, 2012; Weijters et al., 2013; Yamamoto & Everson, 2003). For instance, the hybrid model assumes that IER is manifested by random responses and may be directly modeled with a two-parameter logistic IRT model and a random response component (Yamamoto & Everson, 2003). Others have proposed to reparametrize item parameters and include multiple traits to directly model multiple response styles (Falk & Cai, 2016). Although these methods are viable, they have their shortcomings. Directly modeling IER requires information about the data set that may be unknowable. For instance, one has to assume the underlying mechanism (e.g., time constraints) and manifestation (e.g., random responses) of the IER. Moreover, directly modeling the hypothesized IER method has a distinct goal related to understanding the response process. When IER is considered a nuisance variable, IER detection methods offer a straightforward approach and interpretation when classifying respondents suspected of IER.

Several different IER detection methods have been well documented. For instance, the evidence of IER can be obtained from so-called “bogus” items that were deliberately included in an assessment (Huang, Bowling, Liu, & Li, 2015; Meade & Craig, 2012). These items instruct the respondents to answer the question in a certain way (i.e., instructed responses), or they involve content unrelated to the construct of interest (e.g., “Have you ever brushed your teeth?”). Responses not following the instruction or deemed unreasonable by common sense (e.g., never brushed one’s teeth) suggest IER. Unfortunately, such methods require changing the survey or questionnaire (e.g., by adding bogus or instructed-response items), which may have unintended effects on response behavior (Breitsohl & Steidelmüller, 2018). Other ancillary information such as response times may also be used to identify speeded responders, wherein individuals who respond too quickly are flagged for IER (Marianti, Fox, Avetisyan, Veldkamp, & Tijmstra, 2014). These methods can serve as indicators for IER and are easy to implement and interpret. If a researcher did not plan to collect this information beforehand (e.g., no bogus item was administered or no response time data were collected) or a researcher is analyzing secondhand data, then he or she is left with only the raw response data to identify IER.

In this study, we focus on methods that rely on raw item response data only. One advantage of this class of tools is that they do not require ancillary information or modifying the assessment and are therefore widely applicable. Below we describe several IER detection methods that make use of raw response data. A summary of these methods and their respective IER detection goals is presented in Table 1. Code to implement these methods are provided as appendix.

Table 1.

Methods to Detect IER.

IER detection methods	IER type
Mean absolute difference (ACT, 2016)	Reverse word inattentiveness
Intra-individual response variability (Dunn et al., 2018)	Overly consistent participants
Long string (Johnson, 2005)	Overly consistent responders
Psychometric antonyms (Goldberg & Kilkowski, 1985)	Reverse word inattentiveness
Mahalanobis distance (Mahalanobis, 1936)	Any deviating pattern from multivariate normal distribution
$l_{z}$ (Drasgow, Levine, & Williams, 1985)	Any deviating pattern from IRT model

Open in a new tab

Note. IER = insufficient effort responding; IRT = item response theory.

Mean Absolute Difference

Mean absolute difference (MAD) is calculated as the average score difference between positively worded and negatively worded items. If the difference between the two is greater than some threshold, then a respondent is flagged (ACT, 2016). It is important to note that the null distribution of MAD should not be expected to center around 0. The purpose of MAD is to see if respondents pay attention to negatively worded items. Respondents who ignore the stems of negatively worded items should exhibit larger average score differences between the positively worded and negatively worded items after coding all items in the same direction.

Intraindividual Response Variability

The intraindividual response variability (IRV) is calculated as the within-person standard deviation of the raw scores (Dunn et al., 2018). It attempts to quantify the level of consistency of a person’s responses. If respondents respond in a manner consistent with their trait levels, they should endorse responses with scores that fall in a narrow range. Low IRV possibly indicates overly consistent responding (e.g., long strings of the same response). Researchers have also used large within-person standard deviation as an indicator for random responding. In some empirical studies, however, large IRV may suggest that the scale taps into unintended psychological constructs (Austin, Deary, Gibson, McGregor, & Dent, 1998). For the purpose of this study, we will use small IRV as an indicator of overly consistent responding.

Long String

Long string (LS) quantifies a long sequence of the same response. An individual who consistently endorses the same response may be flagged based on some normative cut-off that uses a “scree-like” approach similar to Johnson (2005). To conduct the analysis done by Johnson (2005), one first calculates the longest strings of the same response category and the number of participants with those longest strings. If one plots the frequency and length of the longest consecutive string for each response category, one can establish a cutoff for LS. In some previous studies, LS was the best method for identifying participants who respond too consistently, while others have found LS to be unhelpful in detecting IER (Huang, Curran, Keeney, Poposki, & DeShon, 2012; Meade & Craig, 2012).

Psychometric Antonyms

Psychometric antonyms (PA) leverages scale features to identify individuals who respond unusually on items that are considered “opposite.” For instance, if one endorses the statement “I get nervous before exams,” they should not endorse the statement “I feel confident before taking exams.” To conduct a psychometric antonym analysis, one calculates the correlation between every item pair and finds item pairs with the most negative correlations (Goldberg & Kilkowski, 1985). Then a correlation is calculated for each individual between scores on the item pairs. This should be negative to reflect the fact these items are considered antonyms based on the empirical evidence. In prior research, psychometric antonyms were helpful for detecting IER related to ignoring negatively worded items (Huang et al., 2012).

Mahalanobis Distance

Mahalanobis distance (MD) estimates the distance between a response vector and the average response vector in some J-dimensional space (Mahalanobis, 1936). Formally, MD is calculated as $\sqrt{(x_{i} - \bar{x}) S_{x}^{- 1} {(x_{i} - \bar{x})}^{T}}$ , where $x_{i}$ is respondent $i$ ’s vector of item scores, $\bar{x}$ is the vector of mean item scores, and $S_{x}$ is the interitem covariance matrix. The squared MD asymptotically follows a central chi-square distribution with the degrees of freedom equal to the number of items in the scale (J), assuming multivariate normality of the item scores. In previous simulation research, MD effectively detected random responding, but not IER reflecting systematic responding (Meade & Craig, 2012).

Standardized Log-Likelihood ( $l_{z}$ )

Standardized log-likelihood ( $l_{z}$ ) is a person-fit statistic based on item response theory (IRT) that quantifies the discrepancy between the expected and empirical likelihood for each individual (Drasgow et al., 1985). The $l_{z}$ statistic asymptotically follows a standard normal distribution. Among IRT person-fit statistics, $l_{z}$ was found in previous studies to be very good at detecting random responses and some other forms of IER (Karabatsos, 2003; Meijer & Sijtsma, 2001).

In summary, we described six methods designed to detect various manifestations of IER. The rationale for evaluating these six IER detection methods is that they are representative of methods designed to identify particular kinds of IER identified in the literature review. It is clear that some IER detection methods were rigorously developed based on psychometric theory, such as $l_{z}$ which assumes an IRT model. Other methods were designed based on heuristics that were found to work in certain situations, such as LS. Given the design and intended purposes of certain IER detection methods (Table 1), some IER detection methods were expected to perform relatively well for certain types of IER (e.g., LS for long strings of the same response, psychometric antonyms for missing negatively worded items). However, we had no expectations about the performance of more general methods like $l_{z}$ and MD, which can potentially detect many types of IER. Thus, we made no explicit hypothesis about the relative efficacy of each IER detection method for different types of IER.

Treatment of IER

IER detection methods identify response patterns that potentially reflect IER. However, there is little guidance as to what the next step should be. We therefore looked at how to integrate information from different detection methods. The first way was to remove respondents flagged by any IER detection method. This would result in the largest reduction of the sample compared to using a single method. This method was most intuitive and compelling because no one knows what kind of IER may be in the data. Also, it makes sense to minimize any potential IER that may lurk in the data. However, this approach may lead to low specificity and leave us with the smallest sample size in subsequent analyses. We label this strategy as the “kitchen sink” (or KS) approach.

Another option is to remove respondents based on a subset of the IER detection methods. We therefore tried various combinations of IER detection methods and identified the combination that achieved greater sensitivity and specificity across different types of IER. This is referred to as the “best subset” (or BS) approach (i.e., the approach combining IER detection methods with the highest sensitivity for each type of IER). The BS approach was expected to result in lower specificity than individual IER methods, though higher than the KS approach.

In the following section, we evaluate how well different manifestations of IER can be detected by each IER detection method in terms of sensitivity and specificity through simulations. Second, we evaluated the KS and BS approach to combine information from different IER detection methods. Furthermore, we evaluated how validity evidence for scores of a scale are influenced by IER and IER cleansing procedures.

Simulation Studies

Simulation 1: Sensitivity and Specificity Evaluation

In the first simulation study, the goal was to evaluate the sensitivity and specificity of the six IER indices to different types of IER. Sensitivity refers to the ability to detect insufficient responders who truly responded with insufficient effort. Specificity is the ability to classify non-IER responders who truly did not respond with IER. To this end, we first examined the null distribution for each IER index and established a cutoff for flagging response patterns suspected of IER. These procedures are described in the following section.

Preliminary Analysis: Establishing the Cutoffs

Data were simulated based on the ACT Engage scale, which measures Student Motivation, Social Engagement, and Self-Regulation as indicators of college preparedness (ACT, 2016). Consistent with previous research, a three-factor model was fit to the data assuming a polytomous graded response model with six response categories ranging from strongly disagree (0) to strongly agree (5) (ACT, 2016; Le, Casillas, Robbins, & Langley, 2005). The total scale has 108 items with 31 negatively worded items. Sixty-three items are attributed to Student Motivation, 21 to Social Engagement, and 24 to Self-Regulation. In different conditions of the study, we reduced the test length to 27 and 54 items, each with the same proportion of negatively worded items and items per factor. In order to generate data, latent trait parameters were drawn from a multivariate normal distribution of size 10,000; the correlations between dimensions were fixed based on the empirical correlations in the original calibration (Table 2). For responders with no IER, their responses were assumed to follow the underlying IRT model (i.e., the three-dimensional graded response model). All cutoffs for different test lengths are reported in Table 3. Note that we chose a nominal Type I error rate of .01 to identify the cutoff. This was done to counteract multiple testing when various IER detection methods are combined.

Table 2.

Correlation Matrix Between Latent Engage Variables.

	Motivation	Social Engagement	Self-Regulation
Motivation	1.000	0.634	0.641
Social Engagement	0.634	1.000	0.507
Self-Regulation	0.641	0.507	1.000

Open in a new tab

Table 3.

Cutoffs for Different IER Detection Methods.

IER detection method	Used null distribution	27-item cutoff	54-item cutoff	108-item cutoff	Average specificity
MAD	Parametric bootstrap	2.170	1.860	1.680	.990
IRV	Parametric bootstrap	1.260	1.260	1.260	.990
LS	“Scree-like” rule	3, 3, 3, 4, 5, 5	3, 3, 3, 5, 5, 8	3, 3, 4, 6, 6, 12	.974
PA	Parametric bootstrap	−0.210	−0.210	−0.230	.990
MD	Parametric bootstrap	71.590	118.710	206.350	.991
$l_{z}$	Standard normal distribution	−2.326	−2.326	−2.326	.998

Open in a new tab

Note. IER = insufficient effort responding; MAD = mean absolute difference; IRV = intra-individual response variability; LS = long string; PA = psychometric antonyms, MD = Mahalanobis distance; $l_{z}$ = standardized log-likelihood. The LS analysis cutoffs correspond to the first category, second category and so on for all six categories. The nominal specificity rate is averaged across varying scale lengths.

MAD, IRV, and PA have no known asymptotic null distributions, so we generated a sample size of 10,000 based on the assumed IRT model. We then applied each IER detection method to the sample and established a null distribution for each statistic for each scale length. This approach mirrors other studies that focused on outlier detection in common statistical software such as the R package PerFit and operational testing organizations (Tendeiro, Meijer, & Niessen, 2016; Van Krimpen-Stoop & Meijer, 2002). For LS, we generated a null distribution as well and used Johnson’s (2005) “scree-like” approach to identify cutoffs.

Both MD and $l_{z}$ have known asymptotic null distributions; the squared MD follows a central chi-square distribution and $l_{z}$ follows a standard normal distribution. However, both statistics will most likely not follow their asymptotic null distributions in practice. MD assumes the data follow a multivariate normal distribution, while the survey data are often ordinal. In this study, MD was found to deviate from a central chi-square distribution. Specifically, the variance of the null distribution was much larger than expected, especially for longer tests. In some cases, using the asymptotic cutoff would result in flagging almost half of the sample in some conditions when there was no IER present. Therefore, we used an empirically derived cutoff for MD.

The $l_{z}$ statistic assumes a known latent ability. Estimated latent abilities would cause $l_{z}$ ’s null distribution to have a smaller variance than a standard normal distribution, which leads to larger specificity rates (Sinharay, 2015; Snijders, 2001). It would be more appropriate to use an asymptotically correct statistic here such as ${l_{z}}^{*}$ (Snijders, 2001). However, it is not clear whether the derivations that extend ${l_{z}}^{*}$ to the polytomous case, nor even the dichotomous case, work for the multidimensional IRT models. The bias due to ability estimation using maximum likelihood was found to be negligible for practical purposes across different test lengths. Therefore, in this study we still used the $l_{z}$ statistic and used the theoretical cutoff of −2.326 (one-tailed significance test with α = .01). That is, a response pattern would be flagged if $l_{z}$ is less than −2.326.

In summary, we used parametric bootstrap to generate null distributions and identify the empirical cutoff for four indices: MAD, IRV, PA, and MD. To be consistent with popular practices, for LS and $l_{z}$ , we did not generate the cutoffs the same way. For LS, we followed the “scree-like” approach proposed in Johnson (2005) to find the cutoff. For $l_{z}$ , given its asymptotic normal distribution, a cutoff of −2.326 was used, which corresponded to a specificity of .990 under the standard normal distribution. Consequently, the empirical specificity rates for LS and $l_{z}$ differed slightly from .990. For LS, the empirical Type I error rate was .974, meaning that using the cutoffs would result in over-flagging. For $l_{z}$ , the theoretical cutoff led to conservative testing, with an empirical Type I error rate of .998, which was expected. As explained earlier, a correction for $l_{z}$ that accounts for the uncertainty in ability estimation would probably lead to better performance. In spite of the slightly liberal and conservative Type I error rates, we did not modify the cutoffs for LS or $l_{z}$ as we followed popular practices. Future research should address these limitations.

Main Simulation

A percentage of respondents was simulated to respond with IER for a portion of their response vector. The proportion of respondents exhibiting IER will be referred to as the “prevalence” of IER, whereas the proportion of items in their response vector affected by IER will be referred to as “severity.” For respondents with no IER, their responses followed the underlying IRT model (i.e., the three-dimensional graded response model). We simulated 500 and 5,000 respondents with three severity percentages (25%, 50%, 100%) and three prevalence percentages (10%, 20%, 30%). These conditions mimic similar IER studies (Hong & Cheng, 2019b; Wang, Xu, & Shang, 2018). We simulated four different types of IER: middle responses were generated from a binomial distribution with 6 categories with p = .5 (which means the most frequent answer is 2 or 3, followed by 1 or 4, and then 0 or 5), uniform or random responding where each item category had an equal chance of being endorsed, reverse carelessness where we reversed an item endorsement for negatively worded items (e.g., 5 to 0, 4 to 1, etc.), and overly consistent item responses where respondents respond with a string of a single category. We also varied the total test length from 27 to 54 to 108. In sum, we simulated 4 types of carelessness, 9 combinations of severity and prevalence, and 3 different test lengths for a total of 108 conditions. Each condition was replicated 100 times, and results were averaged across replications. We evaluated each IER detection method by calculating the average sensitivity and specificity to detect individuals who respond with IER across the 100 replications.

Results

In general, we found consistent trends across different test lengths and sample sizes. Therefore, we will present only the results for a 54-item scale with 5,000 respondents. For the remainder of this article, we will mention relevant differences due to test length or sample size when relevant. For results for all scale lengths and sample sizes, please contact the authors. Table 4 shows the average specificity rates for each individual IER detection method when there was IER in the data. In general, most methods maintained a nominal specificity rate of .990. Similar to when there was no IER, LS had a smaller specificity rate, and $l_{z}$ had a larger specificity rate across conditions. IER detection methods that relied on distributional assumptions (e.g., MD) had larger specificity rates in some conditions, such as when the IER manifestation was random or middle responding. The $l_{z}$ statistic maintained specificity rates that were larger than the asymptotic critical value. The consistency of the $l_{z}$ statistic performance could be due to the fact we assumed knowledge of the underlying IRT model, which MD does not require.

Table 4.

Specificity Rates—The True Negative Rate, Proportion of Non-IER Correctly Identified—for a 54-Item Test.

Severity (%)	Pre.	MAD	IRV	LS	PA	MD	$l_{z}$	BS	KS
Middle IER
25	10	.989	.990	.976	.990	.997	.998	0.955	0.943
50	10	.989	.990	.976	.991	.999	.998	0.956	0.946
100	10	.989	.990	.976	.993	.997	.998	0.955	0.947
25	20	.989	.990	.976	.989	.999	.998	0.955	0.945
50	20	.989	.990	.977	.992	1.000	.998	0.955	0.948
100	20	.989	.990	.977	.991	.999	.998	0.955	0.947
25	30	.989	.990	.976	.989	1.000	.998	0.955	0.946
50	30	.989	.990	.976	.992	1.000	.998	0.956	0.949
100	30	.989	.990	.976	.994	1.000	.998	0.955	0.950
Random IER
25	10	.989	.990	.976	.990	.998	.998	0.955	0.945
50	10	.989	.990	.977	.990	1.000	.998	0.955	0.947
100	10	.989	.990	.976	.994	1.000	.998	0.955	0.950
25	20	.989	.990	.976	.989	1.000	.998	0.955	0.946
50	20	.989	.990	.976	.991	1.000	.998	0.955	0.948
100	20	.989	.990	.976	.990	1.000	.998	0.955	0.948
25	30	.989	.990	.977	.989	1.000	.998	0.956	0.945
50	30	.989	.990	.976	.992	1.000	.998	0.955	0.949
100	30	.989	.990	.977	.994	1.000	.998	0.955	0.950
Reverse IER
25	10	.989	.990	.976	.990	.992	.998	0.955	0.940
50	10	.989	.990	.977	.989	.993	.998	0.956	0.940
100	10	.989	.990	.977	.988	.992	.998	0.955	0.938
25	20	.989	.990	.976	.989	.993	.998	0.955	0.940
50	20	.989	.990	.977	.988	.994	.998	0.955	0.941
100	20	.989	.990	.977	.987	.992	.998	0.955	0.937
25	30	.989	.990	.976	.988	.993	.998	0.956	0.940
50	30	.989	.990	.976	.987	.994	.998	0.955	0.939
100	30	.989	.990	.977	.980	.992	.998	0.954	0.929
Same response IER
25	10	.989	.990	.976	.988	.996	.998	0.955	0.941
50	10	.989	.990	.977	.985	.994	.998	0.956	0.937
100	10	.989	.990	.976	.991	.984	.998	0.955	0.934
25	20	.989	.990	.976	.985	.997	.998	0.955	0.939
50	20	.989	.990	.976	.985	.993	.998	0.955	0.936
100	20	.989	.990	.976	.991	.960	.998	0.956	0.913
25	30	.989	.990	.976	.984	.997	.998	0.955	0.939
50	30	.989	.990	.976	.986	.991	.998	0.955	0.936
100	30	.989	.990	.976	.992	.908	.998	0.955	0.865

Open in a new tab

Note. IER = insufficient effort responding; Pre. = Prevalence rates expressed as percentages; MAD = mean absolute difference; IRV = intraindividual response variability; LS = long string; PA = psychometric antonyms, MD = Mahalanobis distance; $l_{z}$ = standardized log-likelihood; BS = best subset; KS = “kitchen sink” approach.

Table 5 presents the average sensitivity rate across conditions using each individual IER detection method. IER was better detected on longer scales because there was more information about the individual. Different IER detection methods also flagged different types of IER with varying levels of sensitivity. $l_{z}$ was the best method to detect middle response types with low to mid severity, where it flagged about .270 to .500 of the IER, respectively. IRV was able to detect middle responding better when IER severity was 100% with sensitivity greater at .880. $l_{z}$ was the best method to detect random responding, with sensitivity ranging from .460 to .990 across conditions. MAD detected inappropriate item scores on negatively worded items only when the severity was greater than 50%, where it detected .300 to .620 of the IER. None of the methods were adequate to detect negatively worded IER with low severity. With severity of 100%, LS was best at detecting when respondents selected many of the same responses on sequential items. IRV was also able to detect same response IER perfectly across conditions, but only when severity was 100%. It is worth noting that $l_{z}$ was a close contender as the “best” IER detector when flagging any type of IER.

Table 5.

Sensitivity Rates—the True Probability Rate, Proportion of IER Correctly Identified—for a 54-Item Test.

Severity (%)	Pre.	MAD	IRV	LS	PA	MD	$l_{z}$	BS	KS
Middle IER
25	10	.002	.022	.039	.017	.032	.271	0.327	0.339
50	10	.000	.060	.085	.030	.079	.492	0.591	0.609
100	10	.000	.884	.332	.087	.001	.148	0.951	0.953
25	20	.002	.022	.038	.018	.009	.271	0.327	0.338
50	20	.000	.061	.085	.029	.016	.496	0.596	0.611
100	20	.000	.882	.334	.058	.000	.148	0.949	0.952
25	30	.002	.023	.039	.019	.002	.272	0.325	0.339
50	30	.000	.060	.085	.031	.003	.496	0.593	0.610
100	30	.000	.882	.336	.036	.000	.148	0.950	0.952
Random IER
25	10	.004	.001	.037	.023	.194	.469	0.491	0.518
50	10	.001	.000	.047	.039	.541	.873	0.879	0.889
100	10	.000	.000	.094	.087	.804	.990	0.990	0.992
25	20	.004	.001	.036	.022	.071	.473	0.494	0.506
50	20	.001	.000	.047	.038	.195	.872	0.876	0.884
100	20	.000	.000	.095	.071	.262	.990	0.991	0.992
25	30	.004	.002	.037	.022	.020	.470	0.491	0.506
50	30	.001	.000	.047	.038	.041	.874	0.880	0.885
100	30	.000	.000	.097	.049	.024	.990	0.991	0.991
Reverse IER
25	10	.066	.012	.044	.025	.017	.031	0.142	0.171
50	10	.308	.020	.061	.056	.021	.142	0.405	0.449
100	10	.621	.375	.107	.257	.023	.501	0.766	0.822
25	20	.067	.012	.043	.026	.014	.032	0.142	0.171
50	20	.308	.020	.062	.058	.017	.143	0.404	0.446
100	20	.619	.375	.108	.234	.016	.496	0.766	0.818
25	30	.067	.012	.044	.025	.012	.033	0.142	0.169
50	30	.308	.019	.061	.059	.014	.142	0.402	0.444
100	30	.617	.374	.108	.197	.012	.496	0.766	0.807
Same response IER
25	10	.092	.046	1.000	.037	.137	.414	1.000	1.000
50	10	.345	.209	1.000	.055	.056	.551	1.000	1.000
100	10	.666	1.000	1.000	.000	.000	.499	1.000	1.000
25	20	.095	.047	1.000	.039	.029	.417	1.000	1.000
50	20	.344	.210	1.000	.039	.005	.551	1.000	1.000
100	20	.664	1.000	1.000	.000	.000	.498	1.000	1.000
25	30	.091	.046	1.000	.037	.009	.418	1.000	1.000
50	30	.344	.210	1.000	.030	.002	.551	1.000	1.000
100	30	.667	1.000	1.000	.000	.000	.502	1.000	1.000

Open in a new tab

Note. IER = Insufficient effort responding; Pre. = Prevalence rates expressed as percentages; MAD = mean absolute difference; IRV = intra-individual response variability; LS = long string; PA = psychometric antonyms, MD = Mahalanobis distance; $l_{z}$ = standardized log-likelihood; BS = best subset; KS = “kitchen sink” approach.

We also evaluated the KS and BS approaches, which combined information from different IER detection methods. Sensitivity and specificity results are shown in Tables 4 and 5, respectively. The KS approach flagged a respondent if he or she was flagged by any IER detection method. The BS approach used IER detection method that had the highest sensitivity for each IER type, given that the methods generally maintained specificity well. This led to the identification of MAD, IRV, LS, and $l_{z}$ as the BS because they achieved the greatest sensitivity for each type of IER. A participant flagged by any of these four indices would be flagged by the BS approach.

As expected, the KS approach and BS resulted in lower specificity rates compared to using a single method due to multiple testing. For all conditions, however, the BS approach led to specificity rates of .950 or above, which indicated that it maintained the specificity rate well. In contrast, the KS approach led to specificity rates as low as .865 in certain conditions.

KS and BS both led to improved sensitivity for any type of IER that may be in the data. Moreover, there were only small differences between the two approaches. In other words, removing PA and MD did not reduce the ability to detect any IER drastically. This suggested that PA and MD flagged individuals also flagged by other methods. For instance, PA was designed to flag reverse IER. However, clearly MAD outperformed PA across simulation conditions. Therefore, there was little gained by also using PA as an indicator of IER. MD also performed similarly to the $l_{z}$ statistic across conditions, which suggests redundancy of participants flagged by MD. These finding indicate that it may be unnecessary to use every type of IER detection method available for the purpose of flagging individuals.

Ultimately, the first simulation showed that different IER detection methods were able to detect different types of IER with varying specificity and sensitivity. Moreover, IER detection accuracy varied depending on the amount and underlying IER mechanism. Through comparing KS and BS, we recommend that researchers use a subset of available IER detection methods to maximize ability to detect specific IER in order to maintain a large sample. Our simulation showed that leveraging MAD, IRV, LS, and $l_{z}$ allowed us to identify four common IER types with a specificity rate of .950 or above averaged across conditions and sensitivity rate of .140, .400, and .760 for severity of 25%, 50% and 100% averaged across conditions, respectively. When the underlying IER mechanism is unknown, the BS approach formulated in this study (i.e., MAD, IRV, LS, and $l_{z}$ , with cutoffs established in the preliminary analysis) is recommended since it kept the specificity rate controlled while reaching higher sensitivity than applying any IER detection method alone. That said, we need to point out that the sensitivity was not very high unless the severity rate was high for certain types of IER. In other words, IER on a small percentage items is very difficult to detect. Further research is warranted to improve the sensitivity of detecting certain types, such as low severity rates, of IER.

When there was low severity for IER in a response vector (25%), the simulated respondent was less likely be flagged by any type of IER detection method, with sensitivity as low as .140 in some conditions across IER types. This was true for all IER detection methods except $l_{z}$ . $l_{z}$ performed better when there were fewer IER responses in the response vector. For instance, $l_{z}$ could detect individuals with middle IER with a severity of 50% with power around .500 compared to when it was at 100% with power of .150. If there was no information about latent ability in the response vector, then $l_{z}$ would not be able to detect the individual with IER. Prevalence had little impact on the ability to detect individuals who responded with IER.

Middle IER type was difficult to detect when there was 25% severity. However, it was feasible to detect greater severity of middle type IER when using IRV and $l_{z}$ combined. Random IER was generally easy to detect across conditions. When the severity was 25% or 50%, reverse type IER was difficult to detect. None of the IER detection methods worked well for this type of IER in general, except when the severity was 100%. Same response IER was the easiest to detect at any severity level because indicators such as LS were always able to detect same response IER.

It remains unclear if using a combination of indices to remove apparent IER leads to improved accuracy of validity evidence. Next, we examine how different IER cleansing procedures may be used when evaluating the psychometric properties of a scale and how they affected validity evidence.

Simulation 2: Impact of Data Cleansing on Validity Evidence

Using information from the first simulation, the goal of the second simulation was to assess how IER and data cleansing procedures influenced psychometric properties of scores derived from a scale. One would expect that successful removal of responses affected by IER would produce more accurate validity evidence to support the interpretation of test scores for a certain use. However, some real data analyses suggest that IER detection has little impact on validity evidence derived from response data (Steedle et al., 2019). The current study builds on this finding to answer the following questions: Will using a combination of IER detection methods make a difference? Does IER detection lead to improved accuracy of validity evidence?

With this in mind, the goal of the second simulation study was to demonstrate how to best integrate information from different detection methods to mitigate IER’s potential impact on validity evidence. We investigated the impact of IER on reliability estimates for each of the three subscales and the total scale by calculating coefficient alpha before and after removing response data with suspected IER. We also evaluated the impact of IER on external validity by generating a variable correlating at 0.3 with each subscale, then correlating the sum score of each subscale and total score with the criterion variable. Impact was evaluated based on empirical bias:

Empirical bias : \frac{\sum_{i = 1}^{100} ({\hat{γ}}_{i} - {\hat{γ}}_{i}^{IER})}{100},

which we defined as the difference between the estimated alpha or predictive validity correlation coefficient when there was no IER in the data matrix, $\hat{γ}$ , and the estimate of the same parameter after adding simulated IER to the same data matrix with IER, ${\hat{γ}}^{IER}$ . We averaged the empirical bias across 100 replications for each condition. We also evaluated the performance of each statistic with an empirical root mean square error (RMSE):

Empirical RMSE : \sqrt{\frac{\sum_{i = 1}^{100} {({\hat{γ}}_{i} - {\hat{γ}}_{i}^{IER})}^{2}}{100}} .

For brevity, we use the phrase empirical bias and empirical RMSE with bias and RMSE interchangeably for the rest of the article.

To investigate the impact of data cleansing, we evaluated the bias and RMSE of the validity statistics in three scenarios. In the first scenario, we used the full data set containing both normal and IER responders. The second scenario used the KS approach to remove IER based on all identified IER detection statistics. The third scenario used the identified BS of IER detection methods to remove IER. Data were generated the same way as in the first simulation.

Results

In the data set with no IER, the average correlation of the total score with the criterion variable was .332 and coefficient alpha of was .930 for a 54-item test. Because all subscale analyses showed similar patterns when we varied the test length, here we present information based on the total test only. Figure 1 presents the empirical bias and RMSE for the predictive validity coefficient using the total score of a 54-item test using the entire data set with IER. In general, results show that IER attenuated the relation of the total score with the criterion variable. Moreover, the RMSE increased when there was more IER of any type. Increasing both the prevalence and severity increased the negative impact of the IER. Random, middle response, and same response IER all had negative impacts on predictive validity evidence. For instance, the bias and RMSE reached up to .150 and .160, respectively, in certain conditions. Reverse IER had a negative impact, but less than other types of IER. For instance, the bias and RMSE reached up to .071 and .053, respectively. It is important to note that the attenuating effect may not hold in different scenarios because the data were generated assuming there was no IER in the criterion variable. If there was IER in the criterion variable, the correlation coefficient could inflate or deflate depending on the underlying mechanism (Hong & Cheng, 2019a).

Figure 1. — The empirical bias and root mean square error for the predictive validity using the total score for a 54-item test. The light gray, gray, and black bars represent the average bias when the severity is 25%, 50%, and 100%, respectively.

Figures 2 and 3 present the empirical bias and RMSE of the correlation between the total score for a 54-item test with the criterion variable after employing a KS approach or a BS of IER detection methods, respectively. Across simulation conditions, the bias and RMSE never exceeded .025 and .026, respectively, which is substantially better than when there was IER in the data. In general, longer scales and subscales benefited more from either data cleansing procedure in the sense that there were greater reductions in empirical RMSE and bias. Moreover, there were negligible differences between the subset and KS approaches, again suggesting that using all of the IER detection methods was not necessary. The cost of discarding a significant portion of the data using the KS approach was not likely worth the negligible reduction in RMSE.

Figure 2. — The empirical bias and root mean square error for the predictive validity using the total score for a 54-item test after data cleansing using a kitchen sink approach. The light gray, gray, and black bars represent the average bias when the severity is 25%, 50%, and 100%, respectively.

Figure 3. — The empirical bias and root mean square error for the predictive validity using the total score for a 54-item test after data cleansing using a best subset approach. The light gray, gray, and black bars represent the average bias when the severity is 25%, 50%, and 100%, respectively.

Figure 4 presents the empirical bias and RMSE of coefficient alpha using the 54-item test when the data set contained both normal responses and IER. In general, there was no systematic pattern based on the type of IER and severity and prevalence in the data set. There were both positive and negative biases in the estimated coefficient alphas, which was different than the predictive validity statistic. Bias ranged from −0.025 to 0.030, and RMSE ranged from 0.002 to 0.030.

Figures 5 and 6 present the empirical bias and RMSE for coefficient alpha using the 54-item scale after employing the KS or BS of IER detection methods, respectively. In general, we found smaller RMSE and less biased estimates of the reliability coefficient using either the KS or BS approach compared to not cleansing the data. The KS approach had a range of bias from −0.002 to 0.018. Moreover, the RMSE never exceeded 0.018 across simulation conditions.

Figure 5. — The empirical bias and root mean square error (RMSE) for coefficient alpha using the total score for a 54-item test after data cleansing using a kitchen sink approach. The light gray, gray, and black bars represent the average bias when the severity is 25%, 50%, and 100%, respectively.

Figure 6. — The empirical bias and root mean square error (RMSE) for coefficient alpha using the total score for a 54-item test after data cleansing using an informed approach. The light gray, gray, and black bars represent the average bias when the severity is 25%, 50%, and 100%, respectively.

The BS approach outperformed the KS method in a few scenarios. The bias using the BS approach ranged from −0.002 to 0.014, and the RMSE ranged from 0.001 to 0.014. The BS approach also achieved smaller RMSE and less biased estimates of the reliability coefficient when the IER manifestation was of the same response type of IER. There was virtually no bias or variability when using the BS approach.

Unfortunately, there was some bias in the reliability coefficient even after data cleaning with either approach. The bias ranged from 0.003 to 0.016 and the RMSE ranged from 0.004 to 0.017. This was most likely due to the fact that some reverse IER remains in the data, even after any type of data cleansing.

When we examined longer scales, there was generally less bias and smaller RMSE across scales and subscales for coefficient alpha. Moreover, all reliability estimates had some positive empirical bias across simulation conditions. This means that our reliability estimates were smaller than expected sample reliability coefficients after using any data cleaning approach. The positive bias using either data cleansing procedure became more extreme for shorter scales and subscales.

Bias in reliability coefficients using either data cleansing procedure could occur for a few reasons. First, some of the methods used for IER detection are not independent of latent ability. For instance, IRV is a within-person standard deviation. Respondents with nonextreme latent ability measures could produce response vectors with more variability. However, respondents on the extreme ends of latent ability may consistently endorse the highest or lowest item category. This can lead to flagging high- and low-ability respondents as overly consistent, which would be indistinguishable from respondents with overly consistent response of IER such as using the same response or middle responses. The dependence between latent ability and IER detection method could help explain the difference between the BS and KS approach. Moreover, there is also the natural unpredictably of how the IER mechanism would impact reliability coefficients, as shown in the first simulation. These findings combined suggest that it can be difficult to evaluate the impact of data cleansing. The only IER detection method that conditions on latent ability is the $l_{z}$ statistic, which may be an appropriate choice for data cleansing from this perspective.

In summary, data cleansing appears to improve the ability to recover correlation coefficients for predictive validity well. IER clearly attenuates the relation between the total score and some criterion variable. Using either the BS or KS approach reduced the bias and RMSE of the correlation coefficient between the total score and another variable, and the two approaches did not differ much. Moreover, IER affected reliability coefficients in an unpredictable fashion. IER sometimes inflated and sometimes deflated coefficient alpha. There also appeared to be no relation between prevalence and severity and the impact of IER. In general, this simulation study demonstrated that IER and data removal can affect validity evidence if there is high IER prevalence and severity. Using either the KS approach or BS approach appeared to perform well by making coefficient alpha less variable and biased. However, the BS approach worked the best when IER was manifested by overly consistent IER. The average correlation coefficients and coefficient alphas by simulation condition—before and after data cleansing—are provided in the appendix. The following section describes our analysis of real data.

Real Data Analyses

For the applied analysis, we used a new data set including individuals who responded to the Engage survey in order to evaluate the efficacy of IER detection methods and the impact of IER on validity evidence. The total sample size used for this analysis was 48,448 respondents. We removed any participant with missing data, which was less than 5% of the total sample. Due to the large sample size and small proportion of missing data, the impact of missingness on validity evidence was considered negligible. Table 6 presents the reliability and predictive validity for each subscale and the total scale. The validity evidence suggested that the scores derived from the scale are reliable for each subscale and the entire scale. The Motivation subscale had a greater reliability coefficient compared to Social Engagement and Self-Regulation subscales, which was expected because it had more items. Predictive validity was assessed by correlating each subscale score and total score with a criterion variable: respondents’ self-reported high school GPA. The Motivation subscale and total score correlated the most with high school GPA at .508. Other subscales had less predictive validity: Social engagement correlated .257 with GPA, and Self-Regulation correlated .411 with GPA.

Table 6.

Reliability and Predictive Validity Using a 54-Item Version of the Engage Scale.

	Motivation	Social Engagement	Self-Regulation	Total
Predictive (all data)	.508	.257	.411	.500
Predictive (cleansed data)	.499	.215	.388	.470
Predictive (IER data)	.361	.110	.248	.332
Reliability (all data)	.930	.835	.841	.938
Reliability (cleansed data)	.911	.844	.831	.928
Reliability (IER data)	.931	.847	.728	.925

Open in a new tab

Note. IER = insufficient effort responding.

Next, we applied the following IER detection methods: MAD, IRV, LS and $l_{z}$ . We used the same cutoffs as the simulation study. Item parameters used to calculate the $l_{z}$ statistic were based on the calibration sample of the first wave of the Engage data. Table 7 presents the descriptive statistics of the IER detection methods. IRV and LS flagged the most respondents in the data (.136 and .146 of the total sample size, respectively). MAD and $l_{z}$ flagged fewer respondents (.072 and .004, respectively). Taken together, this suggests that there may be less purely random IER in the data compared to systematic responses, such as overly consistent and same responses. Meanwhile, our real data analysis showed that different types of IER lurked in the data, which was consistent with other applied IER studies (e.g., Meade & Craig, 2012). This finding further motivates the simultaneous use of different types of IER detection methods.

Table 7.

Number and Percentages of Participants Flagged and Correlation Between Indicators of IER Methods.

	No. Flagged	MAD	IRV	LS	$l_{z}$
MAD	.072 (3,328)	1	.131	.125	.043
IRV	.136 (6,324)	.131	1	.329	.131
LS	.146 (6,792)	.125	.329	1	.078
$l_{z}$	.004 (227)	.043	.131	.078	1

Open in a new tab

Note. IER = insufficient effort responding; MAD = mean absolute difference; IRV = intraindividual response variability; LS = long string; $l_{z}$ = standardized log-likelihood. Numbers in parenthesis correspond to the total number of participants flagged by each individual IER detection method.

Table 7 also reports the consistency between each IER detection method based on the phi coefficient. LS and IRV had the strongest relationship, which made sense because some individuals may have produced responses that were overly consistent. This consistency is confirmed by the simulation study results in which middle and same type responders were flagged by both IRV and LS in several simulation conditions. IRV correlated moderately with $l_{z}$ , which also was expected because those types of IER detection methods were able to detect middle IER methods the best. Moreover, MAD flagged similar respondents to IRV and LS. MAD and IRV likely flagged respondents who failed to notice reverse-coded questions, which again confirmed the simulation studies. It is important to note that we could not account for severity when quantifying or comparing the amount of different types of IER in the data because none of these methods were designed for that purpose.

After flagging respondents, we removed any participant flagged by any of the four IER detection methods in the BS. This led to a reduced total sample size of 30,978 (a 36.1% reduction). 36.1% may seem high but is within the range of previous studies on IER of as low as 2% to 50% (Johnson, 2005; Meade & Craig, 2012). Moreover, considering that Engage is a long assessment (108 items) and that the respondents (high school students) had no incentive to response attentively, 36.1% may be a reasonable number of respondents exhibiting IER. Validity evidence for the cleansed data are presented in Table 6. The new validity evidence showed little differences in terms of reliability compared to the total sample. However, the predictive validity evidence changed. The total predictive validity decreased from .500 to .470.

We also evaluated the psychometric properties of responses that were suspected to contain IER. First, we checked the face validity of our IER detection methods by looking at the raw responses for a set of individuals flagged by each IER detection method. Table 8 presents example response vectors flagged for each individual method. For those flagged by the MAD index, we would expect the negatively worded item scores to be similar to the other items. There appeared to be some differences in the negatively worded items where there was a larger amount of lower endorsed categories compared to the other items. IRV appeared to be flagging respondents that were overly consistent. However, IRV may also have flagged respondents with high engagement scores. LS appeared to be flagging respondents with strings of the same responses. $l_{z}$ appeared to be flagging individuals who responded frequently with the middle categories. It is important to note that $l_{z}$ was capable of flagging many types of IER based on our simulation study. Visually inspecting the flagged response patterns suggests that there was IER in the subsample of data.

Table 8.

Raw Responses for Suspected IER Individuals for Each Type of IER Detection Method.

MAD	566236424461555553642464466544451535634346524354344145
IRV	516666616461616626666663566666662626415632611566516134
LS	66615626161666666666663166666666166626666663164566314
$l_{z}$	433333443533434333443334443453452633442433434342245343

Open in a new tab

Note. IER = insufficient effort responding; MAD = mean absolute difference; IRV = intra-individual response variability; LS = long string; $l_{z}$ = standardized log-likelihood. All responses are all forward-coded items. Bolded values indicate reverse-worded items.

Table 6 presents the validity evidence for respondents that were suspected of IER. Similar to the “normal” individuals, we saw little differences in some reliability coefficients. The Self-Regulation questionnaire changed from .841 to .728. Moreover, the predictive validity coefficients were all attenuated compared to both the cleansed sample and the full sample. For instance, the predictive validity of the entire scale dropped from .500 to .332.

This analysis supports and expands previous analyses of the same data (Steedle et al., 2019). The current analysis suggests that different methods are able to detect different types of IER. However, the types of IER in real data may be different than originally anticipated. The current analysis suggests that random behavior is not prevalent in the current sample given that $l_{z}$ flagged only .004 of the total sample. This differs from previous work, where .133 of participants were suspected by $l_{z}$ and .203 of the sample was flagged by MD. This most likely occurred for two reasons. First, our simulations demonstrate that MD does not perform well with a large number of Likert-type items. The distributional violations can lead one to overestimate the prevalence of IER. Second, we separated the calibration samples from the flagging procedure in the current analysis. The rationale for data splitting was to remove the confounding effect of using the data twice for both calibration and flagging. Concurrently calibrating and flagging participants with IER is a challenge and may potentially lead to over flagging.

Our applied analysis confirms some implications for data quality based on previous research. First, if there is a large amount of IER in the sample, small changes in the validity evidence should be expected after removing respondents suspected of IER. This is evident when examining the psychometric properties of scores for respondents suspected of IER. Second, if there is a small proportion of suspected IER in the sample, the psychometric properties should not be affected much. Moreover, we found that about 33.0% of the total sample may exhibit IER. However, we do not know how much IER is present for each individual, which would be a future direction for research. Moreover, our analysis reaffirms that investigating different types of IER matters when making informed conclusions about how IER affects psychometric properties of scores derived from a scale.

Discussion

This study investigated the efficacy of IER detection methods. We evaluated six different IER detection methods and demonstrated which IER detection methods were best for flagging different types of IER. Furthermore, we showed how different IER detection methods can be combined to identify different types of IER. We illustrated how different types of IER influence validity evidence such as predictive validity coefficients and internal consistency measures. We investigated the effectiveness of different data quality control procedures in mitigating the impact of IER on validity evidence based on a kitchen sink approach and a best subset analysis. Both behaved in similar ways. Based on the applied analysis and simulation results, our findings converge to a single message that assessment researchers and practitioners need to think critically and carefully about IER when evaluating a single response vector or validity evidence based on responses to Likert-type scale, self-report items.

Our findings build on previous research with the same data set (ACT, 2016; Le al., 2005; Steedle et al., 2019). For instance, our simulations build on the work done by Steedle et al. (2019) by demonstrating how one may overly identify random responders by using MD. The current study provides evidence that MD may not follow a central chi-square distribution when the data are ordinal instead of normally distributed. That may explain why using MD possibly overflags IER. This study identified four indices that were most effective in detecting different types of IER: MAD, IRV, LS, and $l_{z}$ . Each statistic was selected based on its effectiveness for flagging different types of IER. We also provide example code for researchers as appendix.

Furthermore, the current study demonstrated how employing many IER detection methods may be unnecessary to flag individuals who respond with IER. The combination of the four indices worked better than combining all six indices for detecting the types of IER that were considered in this study because it both achieved the same outcome without sacrificing as much of the data. Despite the promising results so far, there is certainly room for improvement in the study. For certain types of IER, none of the detection methods is very effective nor the combination of them. For instance, it is difficult to detect when the severity of the IER is less than 25% for middle and reverse IER. Moreover, reverse IER was the most difficult type of IER to detect. More research should be conducted to see how one may be able to detect this type of IER.

A limitation of the current study is the data generation scheme for the simulation study. We generated data based on the Engage scale. We would caution researchers against over-generalizing our findings. For example, the cutoffs generated in the current study should not be applied blindly, because other scales may have a different number of factors underlying responses, different numbers of categories per item, different numbers of items, examinees exhibiting different behaviors, and so on. MD, for instance, was shown to be affected by the number of response categories in the current study. That said, the procedure for obtaining the cutoffs presented in this study can be adopted when researchers analyze other scales.

In addition, we only evaluated a few ways to detect IER. There is a wealth of other statistics that can be drawn from, such as the more general person-fit literature (Karabatsos, 2003). Moreover, there is other information that can be used to detect IER in a sample. For instance, one can use response times or geolocation techniques (Huang et al., 2012; TurkPrime, 2018). This information can be thought of as a broader class of information, which is called process data (i.e., auxiliary data collected during the measurement process that are not raw item responses). Future studies should evaluate the efficacy when using process data for identifying respondents with IER.

Furthermore, we only looked at certain types of IER. Other forms may certainly exist, for instance there is what is called “back-random-responses” (Clark et al., 2003). These types of IER can be thought of as a mixture of systematic and random behavior where there is more IER found in the end of a survey. This means that scale features and IER may not be independent, which has been demonstrated previously (Shao, Li, & Cheng, 2016; Yu & Cheng, 2019). Moreover, the tendency to respond with IER may be directly related to a respondent’s proficiency level on a scale (Falk & Cai, 2016).

Other sources of validity evidence may be studied in the context of IER (American Educational Research Association et al., 2014). Previous studies also indicated that IER can influence structural evidence when fitting a latent trait model, such as fit statistics in the structural equation modeling framework (Kim et al., 2018). Similar to validity evidence investigated by internal consistency measures, sometimes IER can affect common fit statistics in unpredictable ways. This certainly warrants future research.

It is worth reiterating shortcomings of certain IER detection methods. Besides the $l_{z}$ statistic, each IER detection method is not truly independent of latent ability. Given that we know the truth that data were generated from a multidimensional IRT model, it is well known that the variance is not constant across the latent ability. Respondents that are high or low in the ability will tend to have smaller standard errors for their estimate of ability. Because some IER detection methods do not take this into consideration, it is possible that using IER detection methods based on heuristics can cause some respondents of a certain range of ability to be flagged more. For instance, IRV, which is a direct indicator within-participant variance, will certainly be biased to flag respondents of a certain ability level. Therefore, for such methods, it would be prudent to not blindly remove data flagged by these types of IER detection methods. It would be up to the researcher to gather further validity evidence to verify respondents’ response vectors, which could possibly be gathered from other sources of information such as process data.

Moreover, IER detection methods are susceptible to what is known as the so-called “masking-effect” (Hong & Cheng, 2019b; Yuan & Zhong, 2008, 2013). Some IER detection methods require parametric assumptions, such as multivariate normal distribution or an IRT model. These types of methods require unbiased estimates of the mean and covariance matrix or item parameters. Previous studies found that using robust forms of the estimators for the structural parameters improves outlier detection such as IER, which weight observations differently depending on how “outlying” they are. This would allow researchers to avoid making binary decisions with a fixed cutoff whether an observation is an outlier or not. Hong and Cheng (2019b) demonstrated how the $l_{z}$ index can be used in a robust form in IRT analysis.

Our article also provides evidence for applied researchers on when and when not to treat data as continuous. For instance, some researchers suggest that having 7 or more item categories lends the data to be treated as continuous as a multivariate normal distribution (Cheng, Yuan, & Liu, 2012; Rhemtulla, Brosseau-Liard, & Savalei, 2012). Our studies contradict these findings, at least in the context of outlier or IER detection. There is a clear violation assumption of normality when evaluating the Mahalanobis distance when using it as an IER detection method. To the author’s knowledge, this has not been reported in the literature yet and deserves to be studied more.

In conclusion, our study contributes to the application of data quality management for researchers and practitioners with data that may have measurement error resulting from IER (Loken & Gelman, 2017; Maxwell, Lau, & Howard, 2015). In order to increase the confidence of a single study, it would be wise for a researcher to provide evidence of sufficient data quality to the general research community. In practice, some researchers use evidence based on structural validity or internal consistency measures as indicators of data quality (TurkPrime, 2018). Our study corroborates previous literature indicating that this is poor research practice. Detecting IER is an important step for the general research community in any situation. This study provides new findings that extend the current literature and provides concrete recommendations that can be useful for both applied researchers and methodologists.

Supplemental Material

functions – Supplemental material for Methods of Detecting Insufficient Effort Responding: Comparisons and Practical Recommendations

Click here for additional data file.^{(40KB, zip)}

Supplemental material, functions for Methods of Detecting Insufficient Effort Responding: Comparisons and Practical Recommendations by Maxwell Hong, Jeffrey T. Steedle and Ying Cheng in Educational and Psychological Measurement

Supplementary Material

Supplementary material

supplementary_material_pdf.pdf^{(290.1KB, pdf)}

Appendix

Appendix.

Reliability and Predictive Validity Before and After Applying Data Cleansing for a 54-Item Test.

Middle IER		Correlation				Reliability
Severity	Pre.	Original	Speeded	KS	BS	Original	Speeded	KS	BS
25	10	0.332	0.326	0.323	0.324	0.927	0.93	0.926	0.928
50	10	0.334	0.313	0.324	0.326	0.929	0.931	0.93	0.93
100	10	0.332	0.265	0.322	0.324	0.944	0.93	0.929	0.93
25	20	0.332	0.322	0.321	0.323	0.923	0.931	0.925	0.926
50	20	0.332	0.294	0.317	0.318	0.926	0.93	0.931	0.931
100	20	0.333	0.221	0.316	0.319	0.951	0.931	0.931	0.931
25	30	0.331	0.316	0.317	0.317	0.918	0.931	0.924	0.923
50	30	0.333	0.283	0.313	0.312	0.921	0.93	0.931	0.932
100	30	0.333	0.186	0.31	0.311	0.954	0.93	0.932	0.933
Random IER		Correlation				Reliability
Severity	Pre.	Original	Speeded	KS	BS	Original	Speeded	KS	BS
25	10	0.331	0.325	0.324	0.325	0.925	0.931	0.927	0.927
50	10	0.333	0.313	0.328	0.329	0.925	0.93	0.929	0.929
100	10	0.333	0.265	0.327	0.329	0.938	0.931	0.929	0.929
25	20	0.332	0.32	0.323	0.323	0.919	0.931	0.926	0.926
50	20	0.331	0.293	0.327	0.325	0.918	0.93	0.93	0.93
100	20	0.331	0.216	0.326	0.327	0.939	0.931	0.928	0.929
25	30	0.333	0.317	0.323	0.324	0.912	0.931	0.924	0.924
50	30	0.333	0.28	0.326	0.329	0.91	0.93	0.93	0.931
100	30	0.334	0.183	0.325	0.322	0.938	0.93	0.928	0.93
Reverse IER		Correlation				Reliability
Severity	Pre.	Original	Speeded	KS	BS	Original	Speeded	KS	BS
25	10	0.332	0.329	0.324	0.325	0.926	0.93	0.922	0.925
50	10	0.333	0.327	0.324	0.327	0.924	0.93	0.923	0.925
100	10	0.333	0.313	0.327	0.328	0.922	0.93	0.925	0.928
25	20	0.332	0.328	0.322	0.323	0.922	0.931	0.919	0.92
50	20	0.333	0.321	0.32	0.321	0.916	0.93	0.92	0.922
100	20	0.331	0.292	0.324	0.329	0.912	0.93	0.924	0.926
25	30	0.333	0.327	0.32	0.322	0.917	0.931	0.914	0.916
50	30	0.333	0.316	0.315	0.316	0.908	0.931	0.915	0.917
100	30	0.33	0.272	0.323	0.328	0.902	0.931	0.924	0.925
Same IER		Correlation				Reliability
Severity	Pre.	Original	Speeded	KS	BS	Original	Speeded	KS	BS
25	10	0.332	0.324	0.327	0.329	0.926	0.931	0.928	0.929
50	10	0.332	0.303	0.327	0.328	0.93	0.93	0.927	0.929
100	10	0.332	0.256	0.324	0.329	0.942	0.93	0.925	0.929
25	20	0.333	0.318	0.328	0.327	0.921	0.93	0.928	0.928
50	20	0.334	0.281	0.328	0.329	0.928	0.931	0.927	0.929
100	20	0.334	0.21	0.32	0.329	0.947	0.93	0.921	0.929
25	30	0.334	0.312	0.329	0.329	0.915	0.931	0.928	0.929
50	30	0.334	0.262	0.327	0.329	0.925	0.93	0.927	0.929
100	30	0.331	0.172	0.308	0.329	0.948	0.93	0.913	0.928

Open in a new tab

Note. IER = insufficient effort responding; Pre. = prevalence rates expressed as percentages; Original = original statistic without speededness; Speeded = speeded statistics; BS = best subset; KS = “kitchen sink” approach.

Footnotes

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD: Maxwell Hong Inline graphic https://orcid.org/0000-0001-5984-5508

Supplemental Material: Supplemental material for this article is available online.

References

ACT. (2016). Development and validation of ACT Engage: Technical manual. Retrieved from https://www.act.org/content/dam/act/unsecured/documents/act-engage-technical-manual.pdf
American Educational Research Association, American Psychological Association, & National Council on Measurement Education. (2014). Standards for educational and psychological testing. Retrieved from https://www.apa.org/science/programs/testing/standards
Austin E. J., Deary I. J., Gibson G. J., McGregor M. J., Dent J. B. (1998). Individual response spread in self-report scales: Personality correlations and consequences. Personality and Individual Differences, 24, 421-438. [Google Scholar]
Berry D. T. R., Wetter M. W., Baer R. A., Larsen L., Clark C., Monroe K. (1992). MMPI-2 Random responding indices: Validation using a self-report methodology. Psychological Assessment, 4, 340-345. [Google Scholar]
Breitsohl H., Steidelmüller C. (2018). The impact of insufficient effort responding detection methods on substantive responses: Results from an experiment testing parameter invariance. Applied Psychology, 67, 284-308. [Google Scholar]
Cheng Y., Yuan K-H., Liu C. (2012). Comparison of reliability measures under factor analysis and item response theory. Educational and Psychological Measurement, 72, 52-67. doi: 10.1177/0013164411407315 [DOI] [Google Scholar]
Clark M. E., Gironda R. J., Young R. W. (2003). Detection of back random responding: Effectiveness of MMPI-2 and personality assessment inventory validity indices. Psychological Assessment, 15, 223-234. [DOI] [PubMed] [Google Scholar]
Curran P. G. (2016). Methods for the detection of carelessly invalid responses in survey data. Journal of Experimental Social Psychology, 66, 4-19. [Google Scholar]
Drasgow F., Levine M. V., Williams E. A. (1985). Appropriateness measurement with polychotomous item response models and standardized indices. British Journal of Mathematical and Statistical Psychology, 38, 67-86. [Google Scholar]
Dunn A. M., Heggestad E. D., Shanock L. R., Theilgard N. (2018). Intra-individual response variability as an indicator of insufficient effort responding: Comparison to other indicators and relationships with individual differences. Journal of Business and Psychology, 33, 105-121. [Google Scholar]
Falk C. F., Cai L. (2016). A flexible full-information approach to the modeling of response styles. Psychological Methods, 21, 328-347. [DOI] [PubMed] [Google Scholar]
Goldberg L. R., Kilkowski J. M. (1985). The prediction of semantic consistency in self-descriptions. Characteristics of persons and of terms that affect the consistency of responses to synonym and antonym pairs. Journal of Personality and Social Psychology, 48, 82-98. [DOI] [PubMed] [Google Scholar]
Hauser D. J., Schwarz N. (2016). Attentive turkers: MTurk participants perform better on online attention checks than do subject pool participants. Behavior Research Methods, 48, 400-407. [DOI] [PubMed] [Google Scholar]
Hong M., Cheng Y. (2019. a). Clarifying the effect of test speededness. Applied Psychological Measurement. Advance online publication. doi: 10.1177/0146621618817783 [DOI] [PMC free article] [PubMed] [Google Scholar]
Hong M., Cheng Y. (2019. b). Robust maximum marginal likelihood (RMML) estimation for item response theory models. Behavior Research Methods, 1, 573-588. doi: 10.3758/s13428-018-1150-4 [DOI] [PubMed] [Google Scholar]
Huang J. L., Bowling N. A., Liu M., Li Y. (2015). Detecting insufficient effort responding with an infrequency scale: Evaluating validity and participant reactions. Journal of Business and Psychology, 30, 299-311. [Google Scholar]
Huang J. L., Curran P. G., Keeney J., Poposki E. M., DeShon R. P. (2012). Detecting and deterring insufficient effort responding to surveys. Journal of Business and Psychology, 27, 99-114. [Google Scholar]
Huang J. L., Liu M., Bowling N. A. (2015). Insufficient effort responding: Examining an insidious confound in survey data. Journal of Applied Psychology, 100, 828-845. doi: 10.1037/a0038510 [DOI] [PubMed] [Google Scholar]
Johnson J. A. (2005). Ascertaining the validity of individual protocols from web-based personality inventories. Journal of Research in Personality, 39, 103-129. [Google Scholar]
Karabatsos G. (2003). Comparing the aberrant response detection performance of thirty-six person-fit statistics. Applied Measurement in Education, 16, 277-298. doi: 10.1207/S15324818AME1604 [DOI] [Google Scholar]
Kim D. S., Reise S. P., Bentler P. M. (2018). Identifying aberrant data in structural equation models with IRLS-ADF. Structural Equation Modeling, 25, 343-358. [Google Scholar]
Le H., Casillas A., Robbins S. B., Langley R. (2005). Motivational and skills, social, and self-management predictors of college outcomes: Constructing the student readiness inventory. Educational and Psychological Measurement, 65, 482-508. [Google Scholar]
Loken E., Gelman A. (2017). Measurement error and the replication crisis. Science, 355, 584-585. [DOI] [PubMed] [Google Scholar]
Mahalanobis P. (1936). On the generalized distance in statistics. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2(1), 49-55. [Google Scholar]
Marianti S., Fox J.-P., Avetisyan M., Veldkamp B. P., Tijmstra J. (2014). Testing for aberrant behavior in response time modeling. Journal of Educational and Behavioral Statistics, 39, 426-451. doi: 10.3102/1076998614559412 [DOI] [Google Scholar]
Maxwell S. E., Lau M. Y., Howard G. S. (2015). Is psychology suffering from a replication crisis? What does “failure to replicate” really mean? American Psychologist, 70, 487-498. doi: 10.1037/a0039400 [DOI] [PubMed] [Google Scholar]
Meade A. W., Craig S. B. (2012). Identifying careless responses in survey data. Psychological Methods, 17, 437-455. doi: 10.1037/a0028085 [DOI] [PubMed] [Google Scholar]
Meijer R. R., Sijtsma K. (2001). Methodology review: Evaluating person fit. Applied Psychological Measurement, 25, 107-135. doi: 10.1177/01466210122031957 [DOI] [Google Scholar]
Niessen A. S. M., Meijer R. R., Tendeiro J. N. (2016). Detecting careless respondents in web-based questionnaires: Which method to use? Journal of Research in Personality, 63, 1-11. doi: 10.1016/j.jrp.2016.04.010 [DOI] [Google Scholar]
Pauszek J. R., Sztybel P., Gibson B. S. (2017). Evaluating Amazon’s Mechanical Turk for psychological research on the symbolic control of attention. Behavior Research Methods, 49, 1969-1983. doi: 10.3758/s13428-016-0847-5 [DOI] [PubMed] [Google Scholar]
Rhemtulla M., Brosseau-Liard P. É., Savalei V. (2012). When can categorical variables be treated as continuous? A comparison of robust continuous and categorical SEM estimation methods under suboptimal conditions. Psychological Methods, 17, 354-373. doi: 10.1037/a0029315 [DOI] [PubMed] [Google Scholar]
Shao C., Li J., Cheng Y. (2016). Detection of test speededness using change-point analysis. Psychometrika, 81, 1118-1141. doi: 10.1007/s11336-015-9476-7 [DOI] [PubMed] [Google Scholar]
Sinharay S. (2015). The asymptotic distribution of ability estimates: Beyond dichotomous items and unidimensional IRT models. Journal of Educational and Behavioral Statistics, 40, 511-528. doi: 10.3102/1076998615606115 [DOI] [Google Scholar]
Snijders T. A. B. (2001). Asymptotic null distribution of person fit statistics with estimated person parameter. Psychometrika, 66, 331-342. doi: 10.1007/BF02294437 [DOI] [Google Scholar]
Steedle J., Hong M., Cheng Y. (2019). The effects of inattentive responding on construct validity evidence when measuring social-emotional learning competencies. Educational Measurement Issues and Practice, 38, 101-111. [Google Scholar]
Suh Y., Cho S. J., Wollack J. A. (2012). A comparison of item calibration procedures in the presence of test speededness. Journal of Educational Measurement, 49, 285-311. doi: 10.1111/j.1745-3984.2012.00176.x [DOI] [Google Scholar]
Swain S. D., Weathers D., Niedrich R. W. (2008). Assessing three sources of misresponse to reversed Likert items. Journal of Marketing Research, 45, 116-131. doi: 10.1509/jmkr.45.1.116 [DOI] [Google Scholar]
Tendeiro J. N. (2017). The lz(p)* Person-fit statistic in an unfolding model context. Applied Psychological Measurement, 41, 44-59. doi: 10.1177/0146621616669336 [DOI] [PMC free article] [PubMed] [Google Scholar]
Tendeiro J. N., Meijer R. R., Niessen A. S. M. (2016). PerFit : an R package for person-fit analysis in IRT. Journal of Statistical Software, 74(5). doi: 10.18637/jss.v074.i05 [DOI] [Google Scholar]
TurkPrime. (2018, September 18). After the bot scare: Understanding what’s been happening with data collection on MTurk and how to stop it [Web log post]. Retrieved from https://blog.turkprime.com/after-the-bot-scare-understanding-whats-been-happening-with-data-collection-on-mturk-and-how-to-stop-it
Van Krimpen-Stoop E. M. L. A., Meijer R. R. (2002). Detection of person misfit in computerized adaptive tests with polytomous items. Applied Psychological Measurement, 26, 164-180. doi: 10.1177/01421602026002004 [DOI] [Google Scholar]
Wang C., Xu G., Shang Z. (2018). A two-stage approach to differentiating normal and aberrant behavior in computer based testing. Psychometrika, 83, 223-254. doi: 10.1007/s11336-016-9525-x [DOI] [PubMed] [Google Scholar]
Weijters B., Baumgartner H., Schillewaert N. (2013). Reversed item bias: An integrative model. Psychological Methods, 18, 320-334. doi: 10.1037/a0032121 [DOI] [PubMed] [Google Scholar]
Woods C. M. (2006). Careless responding to reverse-worded items: Implications for confirmatory factor analysis. Journal of Psychopathology and Behavioral Assessment, 28, 189-194. doi: 10.1007/s10862-005-9004-7 [DOI] [Google Scholar]
Yamamoto K., Everson H. (2003). Estimating the effects of test length and test time on parameter estimation using the hybrid model. ETS Research Report Series, 1995, 277-298. doi: 10.1002/j.2333-8504.1995.tb01637.x [DOI] [Google Scholar]
Yu X., Cheng Y. (2019). A change-point analysis procedure based on weighted residuals to detect back random responding. Psychological Methods. Advance online publication. doi: 10.1037/met0000212 [DOI] [PubMed] [Google Scholar]
Yuan K. H., Zhong X. (2008). Outliers, leverage observations, and influential cases in factor analysis: Using robust procedures to minimize their effect. Sociological Methodology, 38, 329-368. doi: 10.1111/j.1467-9531.2008.00198.x [DOI] [Google Scholar]
Yuan K. H., Zhong X. (2013). Robustness of fit indices to outliers and leverage observations in structural equation modeling. Psychological Methods, 18, 121-136. doi: 10.1037/a0031604 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

functions – Supplemental material for Methods of Detecting Insufficient Effort Responding: Comparisons and Practical Recommendations

Click here for additional data file.^{(40KB, zip)}

Supplementary material

supplementary_material_pdf.pdf^{(290.1KB, pdf)}

[bibr1-0013164419865316] ACT. (2016). Development and validation of ACT Engage: Technical manual. Retrieved from https://www.act.org/content/dam/act/unsecured/documents/act-engage-technical-manual.pdf

[bibr2-0013164419865316] American Educational Research Association, American Psychological Association, & National Council on Measurement Education. (2014). Standards for educational and psychological testing. Retrieved from https://www.apa.org/science/programs/testing/standards

[bibr3-0013164419865316] Austin E. J., Deary I. J., Gibson G. J., McGregor M. J., Dent J. B. (1998). Individual response spread in self-report scales: Personality correlations and consequences. Personality and Individual Differences, 24, 421-438. [Google Scholar]

[bibr4-0013164419865316] Berry D. T. R., Wetter M. W., Baer R. A., Larsen L., Clark C., Monroe K. (1992). MMPI-2 Random responding indices: Validation using a self-report methodology. Psychological Assessment, 4, 340-345. [Google Scholar]

[bibr5-0013164419865316] Breitsohl H., Steidelmüller C. (2018). The impact of insufficient effort responding detection methods on substantive responses: Results from an experiment testing parameter invariance. Applied Psychology, 67, 284-308. [Google Scholar]

[bibr6-0013164419865316] Cheng Y., Yuan K-H., Liu C. (2012). Comparison of reliability measures under factor analysis and item response theory. Educational and Psychological Measurement, 72, 52-67. doi: 10.1177/0013164411407315 [DOI] [Google Scholar]

[bibr7-0013164419865316] Clark M. E., Gironda R. J., Young R. W. (2003). Detection of back random responding: Effectiveness of MMPI-2 and personality assessment inventory validity indices. Psychological Assessment, 15, 223-234. [DOI] [PubMed] [Google Scholar]

[bibr8-0013164419865316] Curran P. G. (2016). Methods for the detection of carelessly invalid responses in survey data. Journal of Experimental Social Psychology, 66, 4-19. [Google Scholar]

[bibr9-0013164419865316] Drasgow F., Levine M. V., Williams E. A. (1985). Appropriateness measurement with polychotomous item response models and standardized indices. British Journal of Mathematical and Statistical Psychology, 38, 67-86. [Google Scholar]

[bibr10-0013164419865316] Dunn A. M., Heggestad E. D., Shanock L. R., Theilgard N. (2018). Intra-individual response variability as an indicator of insufficient effort responding: Comparison to other indicators and relationships with individual differences. Journal of Business and Psychology, 33, 105-121. [Google Scholar]

[bibr11-0013164419865316] Falk C. F., Cai L. (2016). A flexible full-information approach to the modeling of response styles. Psychological Methods, 21, 328-347. [DOI] [PubMed] [Google Scholar]

[bibr12-0013164419865316] Goldberg L. R., Kilkowski J. M. (1985). The prediction of semantic consistency in self-descriptions. Characteristics of persons and of terms that affect the consistency of responses to synonym and antonym pairs. Journal of Personality and Social Psychology, 48, 82-98. [DOI] [PubMed] [Google Scholar]

[bibr13-0013164419865316] Hauser D. J., Schwarz N. (2016). Attentive turkers: MTurk participants perform better on online attention checks than do subject pool participants. Behavior Research Methods, 48, 400-407. [DOI] [PubMed] [Google Scholar]

[bibr14-0013164419865316] Hong M., Cheng Y. (2019. a). Clarifying the effect of test speededness. Applied Psychological Measurement. Advance online publication. doi: 10.1177/0146621618817783 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr15-0013164419865316] Hong M., Cheng Y. (2019. b). Robust maximum marginal likelihood (RMML) estimation for item response theory models. Behavior Research Methods, 1, 573-588. doi: 10.3758/s13428-018-1150-4 [DOI] [PubMed] [Google Scholar]

[bibr16-0013164419865316] Huang J. L., Bowling N. A., Liu M., Li Y. (2015). Detecting insufficient effort responding with an infrequency scale: Evaluating validity and participant reactions. Journal of Business and Psychology, 30, 299-311. [Google Scholar]

[bibr17-0013164419865316] Huang J. L., Curran P. G., Keeney J., Poposki E. M., DeShon R. P. (2012). Detecting and deterring insufficient effort responding to surveys. Journal of Business and Psychology, 27, 99-114. [Google Scholar]

[bibr18-0013164419865316] Huang J. L., Liu M., Bowling N. A. (2015). Insufficient effort responding: Examining an insidious confound in survey data. Journal of Applied Psychology, 100, 828-845. doi: 10.1037/a0038510 [DOI] [PubMed] [Google Scholar]

[bibr19-0013164419865316] Johnson J. A. (2005). Ascertaining the validity of individual protocols from web-based personality inventories. Journal of Research in Personality, 39, 103-129. [Google Scholar]

[bibr20-0013164419865316] Karabatsos G. (2003). Comparing the aberrant response detection performance of thirty-six person-fit statistics. Applied Measurement in Education, 16, 277-298. doi: 10.1207/S15324818AME1604 [DOI] [Google Scholar]

[bibr21-0013164419865316] Kim D. S., Reise S. P., Bentler P. M. (2018). Identifying aberrant data in structural equation models with IRLS-ADF. Structural Equation Modeling, 25, 343-358. [Google Scholar]

[bibr22-0013164419865316] Le H., Casillas A., Robbins S. B., Langley R. (2005). Motivational and skills, social, and self-management predictors of college outcomes: Constructing the student readiness inventory. Educational and Psychological Measurement, 65, 482-508. [Google Scholar]

[bibr23-0013164419865316] Loken E., Gelman A. (2017). Measurement error and the replication crisis. Science, 355, 584-585. [DOI] [PubMed] [Google Scholar]

[bibr24-0013164419865316] Mahalanobis P. (1936). On the generalized distance in statistics. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2(1), 49-55. [Google Scholar]

[bibr25-0013164419865316] Marianti S., Fox J.-P., Avetisyan M., Veldkamp B. P., Tijmstra J. (2014). Testing for aberrant behavior in response time modeling. Journal of Educational and Behavioral Statistics, 39, 426-451. doi: 10.3102/1076998614559412 [DOI] [Google Scholar]

[bibr26-0013164419865316] Maxwell S. E., Lau M. Y., Howard G. S. (2015). Is psychology suffering from a replication crisis? What does “failure to replicate” really mean? American Psychologist, 70, 487-498. doi: 10.1037/a0039400 [DOI] [PubMed] [Google Scholar]

[bibr27-0013164419865316] Meade A. W., Craig S. B. (2012). Identifying careless responses in survey data. Psychological Methods, 17, 437-455. doi: 10.1037/a0028085 [DOI] [PubMed] [Google Scholar]

[bibr28-0013164419865316] Meijer R. R., Sijtsma K. (2001). Methodology review: Evaluating person fit. Applied Psychological Measurement, 25, 107-135. doi: 10.1177/01466210122031957 [DOI] [Google Scholar]

[bibr29-0013164419865316] Niessen A. S. M., Meijer R. R., Tendeiro J. N. (2016). Detecting careless respondents in web-based questionnaires: Which method to use? Journal of Research in Personality, 63, 1-11. doi: 10.1016/j.jrp.2016.04.010 [DOI] [Google Scholar]

[bibr30-0013164419865316] Pauszek J. R., Sztybel P., Gibson B. S. (2017). Evaluating Amazon’s Mechanical Turk for psychological research on the symbolic control of attention. Behavior Research Methods, 49, 1969-1983. doi: 10.3758/s13428-016-0847-5 [DOI] [PubMed] [Google Scholar]

[bibr31-0013164419865316] Rhemtulla M., Brosseau-Liard P. É., Savalei V. (2012). When can categorical variables be treated as continuous? A comparison of robust continuous and categorical SEM estimation methods under suboptimal conditions. Psychological Methods, 17, 354-373. doi: 10.1037/a0029315 [DOI] [PubMed] [Google Scholar]

[bibr32-0013164419865316] Shao C., Li J., Cheng Y. (2016). Detection of test speededness using change-point analysis. Psychometrika, 81, 1118-1141. doi: 10.1007/s11336-015-9476-7 [DOI] [PubMed] [Google Scholar]

[bibr33-0013164419865316] Sinharay S. (2015). The asymptotic distribution of ability estimates: Beyond dichotomous items and unidimensional IRT models. Journal of Educational and Behavioral Statistics, 40, 511-528. doi: 10.3102/1076998615606115 [DOI] [Google Scholar]

[bibr34-0013164419865316] Snijders T. A. B. (2001). Asymptotic null distribution of person fit statistics with estimated person parameter. Psychometrika, 66, 331-342. doi: 10.1007/BF02294437 [DOI] [Google Scholar]

[bibr35-0013164419865316] Steedle J., Hong M., Cheng Y. (2019). The effects of inattentive responding on construct validity evidence when measuring social-emotional learning competencies. Educational Measurement Issues and Practice, 38, 101-111. [Google Scholar]

[bibr36-0013164419865316] Suh Y., Cho S. J., Wollack J. A. (2012). A comparison of item calibration procedures in the presence of test speededness. Journal of Educational Measurement, 49, 285-311. doi: 10.1111/j.1745-3984.2012.00176.x [DOI] [Google Scholar]

[bibr37-0013164419865316] Swain S. D., Weathers D., Niedrich R. W. (2008). Assessing three sources of misresponse to reversed Likert items. Journal of Marketing Research, 45, 116-131. doi: 10.1509/jmkr.45.1.116 [DOI] [Google Scholar]

[bibr38-0013164419865316] Tendeiro J. N. (2017). The lz(p)* Person-fit statistic in an unfolding model context. Applied Psychological Measurement, 41, 44-59. doi: 10.1177/0146621616669336 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr39-0013164419865316] Tendeiro J. N., Meijer R. R., Niessen A. S. M. (2016). PerFit : an R package for person-fit analysis in IRT. Journal of Statistical Software, 74(5). doi: 10.18637/jss.v074.i05 [DOI] [Google Scholar]

[bibr40-0013164419865316] TurkPrime. (2018, September 18). After the bot scare: Understanding what’s been happening with data collection on MTurk and how to stop it [Web log post]. Retrieved from https://blog.turkprime.com/after-the-bot-scare-understanding-whats-been-happening-with-data-collection-on-mturk-and-how-to-stop-it

[bibr41-0013164419865316] Van Krimpen-Stoop E. M. L. A., Meijer R. R. (2002). Detection of person misfit in computerized adaptive tests with polytomous items. Applied Psychological Measurement, 26, 164-180. doi: 10.1177/01421602026002004 [DOI] [Google Scholar]

[bibr42-0013164419865316] Wang C., Xu G., Shang Z. (2018). A two-stage approach to differentiating normal and aberrant behavior in computer based testing. Psychometrika, 83, 223-254. doi: 10.1007/s11336-016-9525-x [DOI] [PubMed] [Google Scholar]

[bibr43-0013164419865316] Weijters B., Baumgartner H., Schillewaert N. (2013). Reversed item bias: An integrative model. Psychological Methods, 18, 320-334. doi: 10.1037/a0032121 [DOI] [PubMed] [Google Scholar]

[bibr44-0013164419865316] Woods C. M. (2006). Careless responding to reverse-worded items: Implications for confirmatory factor analysis. Journal of Psychopathology and Behavioral Assessment, 28, 189-194. doi: 10.1007/s10862-005-9004-7 [DOI] [Google Scholar]

[bibr45-0013164419865316] Yamamoto K., Everson H. (2003). Estimating the effects of test length and test time on parameter estimation using the hybrid model. ETS Research Report Series, 1995, 277-298. doi: 10.1002/j.2333-8504.1995.tb01637.x [DOI] [Google Scholar]

[bibr46-0013164419865316] Yu X., Cheng Y. (2019). A change-point analysis procedure based on weighted residuals to detect back random responding. Psychological Methods. Advance online publication. doi: 10.1037/met0000212 [DOI] [PubMed] [Google Scholar]

[bibr47-0013164419865316] Yuan K. H., Zhong X. (2008). Outliers, leverage observations, and influential cases in factor analysis: Using robust procedures to minimize their effect. Sociological Methodology, 38, 329-368. doi: 10.1111/j.1467-9531.2008.00198.x [DOI] [Google Scholar]

[bibr48-0013164419865316] Yuan K. H., Zhong X. (2013). Robustness of fit indices to outliers and leverage observations in structural equation modeling. Psychological Methods, 18, 121-136. doi: 10.1037/a0031604 [DOI] [PubMed] [Google Scholar]

PERMALINK

Methods of Detecting Insufficient Effort Responding: Comparisons and Practical Recommendations

Maxwell Hong

Jeffrey T Steedle

Ying Cheng

Abstract

Types of Insufficient Effort Responding

Types of Detection Methods

Table 1.

Mean Absolute Difference

Intraindividual Response Variability

Long String

Psychometric Antonyms

Mahalanobis Distance

Standardized Log-Likelihood (lz)

Treatment of IER

Simulation Studies

Simulation 1: Sensitivity and Specificity Evaluation

Preliminary Analysis: Establishing the Cutoffs

Table 2.

Table 3.

Main Simulation

Results

Table 4.

Table 5.

Simulation 2: Impact of Data Cleansing on Validity Evidence

Results

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Real Data Analyses

Table 6.

Table 7.

Table 8.

Discussion

Supplemental Material

Supplementary Material

Appendix

Appendix.

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Standardized Log-Likelihood ( $l_{z}$ )