A Quasi-Experimental Study Examining the Efficacy of Multimodal Bot Screening Tools and Recommendations to Preserve Data Integrity in Online Psychological Research

Melissa Simone; Cory J Cascalheira; Benjamin G Pierce

doi:10.1037/amp0001183

. Author manuscript; available in PMC: 2025 Oct 1.

Published in final edited form as: Am Psychol. 2023 Jul 20;79(7):956–969. doi: 10.1037/amp0001183

A Quasi-Experimental Study Examining the Efficacy of Multimodal Bot Screening Tools and Recommendations to Preserve Data Integrity in Online Psychological Research

Melissa Simone ¹, Cory J Cascalheira ², Benjamin G Pierce ³

PMCID: PMC10799166 NIHMSID: NIHMS1916468 PMID: 37471008

Abstract

Bots are automated software programs that pose an ongoing threat to psychological research by invading online research studies and their increasing sophistication over time. Despite this growing concern, research in this area has been limited to bot detection in existing datasets following an unexpected encounter with bots. The present three-condition, quasi-experimental study aimed to address this gap in the literature by examining the efficacy of three types of bot screening tools across three incentive conditions ($0, $1, and $5). Data were collected from 444 respondents via Twitter advertisements between July and September 2021. The efficacy of five Task-based (i.e., anagrams, visual search), Question-based (i.e., attention checks, ReCAPTCHA), and Data-based (i.e., consistency, meta-data) tools were examined with Bonferroni-adjusted univariate and multivariate logistic regression analyses. In general, study results suggest that bot screening tools function similarly for participants recruited across incentive conditions. Moreover, present analyses revealed heterogeneity in the efficacy of bot screening tool subtypes. Notably, the present results suggest that the least effective bot screening tools were among the most commonly used tools in existing literature (e.g., ReCAPTCHA). In sum, study findings revealed highly effective and highly ineffective bot screening tools. Study design and data integrity recommendations for researchers are provided.

Keywords: bot screening, online research, experiment

Introduction

Bots are agent-like software programs that automate computer-based tasks (Franklin & Graesser, 2015) and they are increasingly prevalent across the Internet (Tsvetkova et al., 2017). Bots and other forms of fraudulent respondents (i.e., human participants who enroll in a study multiple times) are participating in online research studies at an alarming rate (Griffin et al., 2022; Simone, 2019; Xu et al., 2022). This brings cause for concern, as bots have the capacity to submit hundreds to thousands of responses to online research within a matter of hours (Burnette et al., 2022; Prince et al., 2012; Xu et al., 2022; Yarrish et al., 2019). While some bots are programmed to perform benevolent actions (e.g., moderating hate speech on group forums; Tsvetkova et al., 2017), others contribute to costly losses across public domains ranging from network security to knowledge production (Geer, 2005; Orabi et al., 2020) because they are programmed to manipulate information maliciously (e.g., injecting misinformation in political elections; Orabi et al., 2020). Indeed, bots have far-reaching deleterious effects on psychological science that span across sectors. Given the speed with which bots can enroll in online studies, researchers are often forced to decide between: (a) delegating resources to sort through hundreds or thousands of responses to identify the handful of genuine human participants or (b) discarding the data and relaunching their study with new methodology. Unfortunately, both options are costly and may result in undue participant burden (e.g., if human data are discarded), loss of federal research dollars (e.g., if bots are compensated), or biased research findings (e.g., if bots are retained in the sample). Considering the consequences that bots may have, research is needed to identify effective bot prevention and detection tools.

While the exact methods that programmers use to code and deploy bots that engage in online survey research remains largely unknown, it is likely that they rely on techniques used by other common types of bots (e.g., chatbots featured on websites, social bots; Ahmad et al., 2018). Natural Language Processing techniques are the most common and widely accepted tools used to program bots across various online settings and are freely accessible (Van Rousselt, 2021; Yang et al., 2019). Natural Language Processing sits at the intersection of linguistics, machine learning, and computer science and enables computers to: (a) read, understand, derive meaning from, and measure sentiment in text and human language, (b) communicate with humans by mimicking natural human language with semantic meaning, and (c) improve their abilities to communicate with humans through iterative algorithmic learning (Sager et al., 2021; Yang et al., 2019). A famous example is ChatGPT. The longer bots developed with National Language Processing techniques are deployed and engaging in online survey research, the more sophisticated they become at replicating human language (Sager et al., 2021), making it harder to detect and remove bots from online research studies. While financial incentives offered to research participants may draw bots to online research studies, bots may be programmed to engage online research studies that do not offer compensation to improve their natural language processing. Thus, bots pose a significant and evolving threat to the integrity of online research that warrants the examination of efficacious tools to prevent and detect bots and investigation into the presences of bots in online research with varying incentive amounts for respondents.

In recent years, social scientists have introduced and evaluated dozens of tactics to screen for bots in online survey research (Burnette et al., 2022; Griffin et al., 2022; Simone, 2019; Xu et al., 2022). Much of the work in this area has focused on Question-Based (i.e., screening strategies that add non-substantive items to surveys solely for the purpose of detecting bots) and Data-Based (i.e., screening strategies that leverage participant data to identify inconsistencies or patterns in participant experience) bot screening tools. To this end, common Question-Based bot screening tools examined in the literature include items that are not visible to humans but detectable by bots (i.e., “honeypots”), attention checks (e.g., “please select strongly agree”), and ReCAPTCHA tools (i.e., small puzzles presented before completing a specific online activity, such as clicking pictures that correspond to text; Simone, 2019; Griffin et al., 2022; Storozuk et al., 2020). In other instances, researchers who described experiencing an overwhelming number of fraudulent respondents proposed several Data-Based screening tools which may be used to flag respondents in existing data (Griffin et al., 2022; Pozzar et al., 2020; Xu et al., 2022). Common Data-Based screening tools described in the literature include: (a) evaluating qualitative responses for duplicate phrasing or illogical answers, (b) identifying discrepancies to verifiable items, and (c) identifying identical response patterns to quantitative surveys (Simone, 2019; Griffin et al., 2022; Levi et al., 2021; Pozzar et al., 2020). Given that bots have the propensity to improve their capacity to pass many Question- and Data-Based screening tools due to algorithmic learning, we considered using common behavioral and cognitive psychological tasks (e.g., Anagrams) as an avenue to examine the efficacy of Task-Based bot screening tools.

One pertinent effectiveness-based gap in the bot detection literature relates to our understanding of (a) how many bot screening tools to use and (b) how combinations of bot screening tools increase the likelihood of eliminating bots. Two studies informed this line of inquiry. Pozzar et al. (2020) divided their bot screening tools into two categories and with specific thresholds for eliminating respondents: a suspicious category, where all bot screening tools provided some evidence that a respondent might be a bot, but not sufficient evidence without combining the bot screening tool with evidence from at least two other bot screening tools, thus creating a threshold where respondents were eliminated if they failed at least three bot screening tools; and a fraudulent category, which was a group of bot screening tools where failing at least one of the bot screening tools provided enough evidence to remove respondents. For example, the team designated a survey completion time of <5 minutes as fraudulent, which meant that respondents failing this screening tactic were immediately removed without determining if they failed additional screening tactics. Shaw et al., (2023) built upon this study by developing a confidence system, where confidence levels and application rules were assigned to bot screening tools to specify the confidence researchers can have in the ability to detect bots with each screening tool, including: (a) high confidence bot screening tools, where eliminating humans with a single tools was highly unlikely and, thus, only one bot screening tool should be used within this group; (b) moderate confidence bot screening tools, where information from at least two bot screening tools reduced the changes of eliminating bots; and (c) low confidence bot screening tools, a group of bot screening tools where evidence from at least three tools are necessary before classifying a respondent as a bot. Both studies established what we define as a Bot Threshold, or the number of failed bot screening tools necessary to classify a respondent as a bot. We also extend these studies by determining how Task- and Question-based tools overlap to detect potential bots. Knowing the degree to which screening tools overlap would provide evidence to guide researchers on tool selection, allowing researchers to maximize bot detection while minimizing undesirable consequences for real participants (e.g., test fatigue from completing ineffective screening tools).

Notably, most bot detection research has been conducted in response to an unexpected encounter with bots and thus no known study has used experimental methods to investigate the prospective effectiveness of distinct bot screening tools. All known existing contributions to this line of work are case studies that detail how the authors mitigated threats to data integrity when bots attacked their online surveys (Griffin et al., 2022; Levi et al., 2021; Pozzar et al., 2020; Simone, 2019; Storozuk et al., 2020). By using an experimental approach that replicates the features of data collection in extant case studies that report bot intrusions (i.e., health- and social-focused research, posting survey links on social media), and only varying the incentive amount offered, the present study may uncover how incentives affect the proportion of bots detected in online survey research and how incentives affect the effectiveness of bot screening tools.

The Present Study

Using a three-condition, quasi-experimental design, the present study sought to (a) explore the pass-fail rates of individual and combinations of Task-, Question-, and Data-Based bot screening tools, (b) identify the extent to which the proportion of participants identified as potential bots in online survey research differs across incentive conditions (with higher-incentive surveys expected to attract more bots), (c) examine whether bot screening tool effectiveness differs across $0, $1, and $5 incentive conditions, and (d) identify associations between individual bot screening tools and the selected Bot Threshold (≥4 failures).

Method

Participants and Procedure

Participants were 444 respondents who completed one of three surveys. Demographics are not provided given that the present study sought to solicit bot participation. Surveys were developed and administered through the Gorilla Experiment Builder platform and included five Question-Based, five Task-Based, and three Data-Based BDTs (see below). In addition to the BDTs, surveys inquired about experiences of disordered eating behavior, mood, and basic demographics to simulate a real social science survey. The decoy focus on disordered eating was selected to align with the first-author’s primary area of research. The content, aesthetics, and mechanics of each survey were identical except for the advertised incentive condition: the first survey offered $0 for survey completion, the second survey offered $1, and the final survey offered $5. The $0 incentive condition was selected to evaluate the extent to which bots engage in research studies without compensation. The $1 and $5 incentive conditions were informed by past research documenting experiences with bots (e.g., MTurk, online research) during recruitment (Gabrielli et al., 2020; Ophir et al., 2020).

Recruitment for each incentive condition was conducted independently and consecutively, which allowed researchers to include compensation details in study advertisements. Recruitment methods were specifically selected to attract bots and malicious actors and thus each incentive condition was advertised exclusively on Twitter, as this platform has frequently been linked with the recruitment of bots (e.g., Simone, 2019). Enrollment for each incentive condition was open for a 24-hour period between July and September 2021. Given the nature of this line of inquiry, an element of deception was necessary to effectively recruit bots and fraudulent respondents. As such, advertisements and informed consent documents referred to the project as research study on the relationship between mood and eating behaviors. Respondents clicked on a link to start the survey and provided informed consent. The informed consent documents stated that the study contained a series of self-report questionnaires and several online tasks (i.e., you will be asked to unscramble anagrams) to prepare participants for the presence of potentially unusual items in the study materials.

Consistent with institutional requirements, participants were asked three questions about the informed consent documents to ascertain whether respondents read and understood consent forms and had the capacity to consent. The three multiple choice questions included: (1) My participation is voluntary and I can withdraw at any time (yes/no); (2) This is a study on mood and eating behaviors (yes/no); and (3) In order to participate in this study I must be______years old (any, at least 25, at least 18, less than 25). In line with institutional guidelines, respondents who incorrectly answered the informed consent questions five times were not able to complete study materials. Participants who progressed onto the full study had three hours to complete all materials. Following recruitment for each incentive condition, participants were debriefed about the true nature of the study via email. All respondents who completed the study materials were compensated. The Institutional Review Board of University of Minnesota approved this study.

Bot Screening Tools

Fourteen bot screening tools across three categories were examined. The order that the screening tools were presented to participants was randomized using a Latin Square, wherein each tool appeared an equal number of times at each position.

Task-Based Bot Screening Tools

Five task-based bot screening tools were programmed. First, during the Visual Search task, participants were shown a screen consisting of pictures of animals, one of which is a cat. The participant was instructed to click on the picture of the cat, located in a different place on their screen three times. Respondents who completed the task with partial success were not flagged. Second, respondents completed an Anagrams task (Charness & Claire Villeval, 2010), wherein participants were shown two anagrams that they were asked to decode. For each anagram, letters were presented on the computer screen (e.g., “r, w, d, o, l”) and participants were asked to put the letters together to form a word (e.g., “world”) by clicking buttons on the screen for each letter in the order that they should appear in the resulting word. Only participants who were inaccurate on both anagrams were flagged. Third, participants completed a Naming Task which included an animal recognition activity, wherein participants were shown pictures of animals and asked to type the name of the animal into a response box across three trials including different animals. Respondents who inaccurately labelled animals across each of the three trials were flagged. Fourth, during the Real Effort Counting task (Abeler et al., 2011) participants were shown a grid of numbers consisting of 1s and 0s. The participants were asked to accurately count the 0s, typing how many 0s they have counted in a text box after each puzzle. Participants were shown three distinct puzzles. As standard with the Real Effort Counting Task, participants were given three attempts to correctly solve each puzzle. Respondents who failed all puzzles were flagged. Finally, in the Unscramble Sentence Task, respondents were presented with a list of words that were scrambled in a nonsensical order for two sentences. Participants were asked to rearrange the words to unscramble the sentence (e.g., the words ‘cat, on, sat, The, mat, the’ can be unscrambled to form ‘The cat sat on the mat’) by clicking buttons on the screen for each word in the order that they should appear in the resulting sentence. Respondents who inaccurately unscrambled each sentence were flagged.

Question-Based Bot Screening Tools

Six question-based bot screening tools were examined. First, respondents completed the Google ReCAPTCHA (v2), wherein respondents were asked to click a checkbox indicating that the user is not a robot. The software determined whether to pass the user immediately without a ReCAPTCHA challenge or present the user with an image challenge to validate whether they are human. Consistent with all other bot screening tools, respondents who failed the ReCAPTCHA were permitted to progress in the study. Second, a Honeypot question was included in the study wherein a checkbox was embedded as a hidden field. Respondents who marked the checkbox were flagged and those with no mark in the hidden field were not.

Third, four distinct attention screeners and checks were examined as bot screening tools. First, the Feeling Screener (Berinsky et al., 2014) presented participants with a short passage, including two sentences on decision-making research and three sentences informing participants they should ignore all other text and instead select the response “None of the Above.” Below the passage, participants were asked to “check all words that describe how you are feeling,” and were presented with 20 response options representing a range of feelings and mood states and the intended response option of “None of the Above.” The Agreement Check asked participants to select the response option “strongly agree” out of seven potential response options ranging from strongly disagree to strongly agree. The Color Check asked participants to select a response that refers to a color out of eleven response options including emotions and colors. The Number Check asked participants to write in a number that is exactly ten digits long.

Data-Based Bot Screening Tools

Three Data-Based bot screening tools were examined. First, for Age Consistency, respondents reported their age (in years) in two different questions, one at the beginning of the survey and one in the middle of the survey, and respondents were flagged if their age was inconsistent. Second, Numeric Flag examined the scores on two substantive measures, including the Eating Disorder Examination Questionnaire (EDEQ; Fairburn & Beglin, 1994) and the State-Trait Anxiety Inventory (STAI; Spielberg et al., 1970). Numeric selections on these measures were flagged if the responses were out of bounds or did not reflect available response options (e.g., responding with a “4.5” or “a” to a discrete Likert response scale with options ranging from 1 to 5), or if identical response patterns for all survey items were observed across multiple (i.e. >2) responses that were submitted consecutively in a brief period of time. Response patterns that followed an identical repeating sequence of numbers but started at different items, and were submitted consecutively in time, were also flagged. Finally, the Open-Ended Flag was formed by flagging nonsensical and identical qualitative responses to three open-ended questions, wherein participants were asked to describe the feelings, influences, and desires associated with weight control behaviors.

Bot Threshold

Responses from participants who reached the end of the study materials (e.g., those who did not time out or drop out) were further examined to identify potential bots. Human participants may fail bot screening tools for several reasons, and thus it is critical that researchers balance ethical considerations (e.g., economic vulnerability of participants, researcher-participant power dynamics) with data integrity concerns (Simone, 2021). Indeed, overly stringent bot screening tactics may inadvertently harm human participants and may negatively impact the representativeness of the sample by excluding those we intend to recruit (e.g., participants for whom attention lapses are a clinical feature; Donegan & Gillian, 2021; Moeck, Bridgeland, & Takarangi, 2021). Additionally, some bot screening tools, such as Anagrams or Unscramble Sentences, are highly linked to English-language proficiency (i.e., humans for whom English is a second language or who struggle with language literacy could fail them). Relatedly, some of the screening tools are linked to visual abilities (visual search, real effort counting) and thus individuals with visual impairment may be more likely to fail these screening tools. To balance ethical considerations with data integrity concerns and adequately account for human error on bot detection screening tools, the current study included a ≥ 4 decision rule wherein participants who failed four or more bot screening tools were identified as potential bots. Please note, this decision rule was rationally rather than empirically-based and thus future research should seek to validate an empirical decision rule. The proportion of respondents identified as potential bots using alternative bot thresholds, ranging from 1 through 8 failed bot screening tools are presented in Supplementary Table 2.

Analyses

Descriptive statistics were examined to identify pass-fail distributions for individual bot screening tools and joint screening rates. Chi-square tests were conducted to examine the extent to which the proportion of participants flagged on each bot screening tool differed between the $1 and $5 incentive conditions. Univariate logistic regression analyses were conducted to examine the associations between each individual screening tool (pass/fail) and surpassing the Bot Threshold. To identify the most efficient tools within each of the three screening categories (Task-, Question-, Data-Based), multivariate logistic regressions were run to examine the associations between surpassing the Bot Threshold and pertinent bot screening tools identified in the univariate analyses. The Bonferroni correction method was used for all logistic regression analyses to reduce potential bias from running multiple analyses.

Transparency and Openness

This study’s design and analyses were not preregistered. The study’s materials and analytic code are available in the Supplementary Files. The data are available upon request.

Results

Individual Screening Methods – Timed-Out Respondents

Table 1 presents the counts and percentages of respondents passing vs failing each bot screening tool in the $1 and $5 incentive conditions among those who had not timed-out before completing a given task or question of the survey.

Table 1.

Counts and Proportions of Respondents Passing Versus Failing Bot Detection Tactics

Type Method	$1 Incentive (n = 95)		$5 Incentive (n = 48^a)		χ² value	p- value
	Pass	Fail	Pass	Fail
Task-Based
Visual Search	94 (98.9%)	1 (1.1%)	47 (97.9%)	1 (2.1%)	0.25	.620
Anagrams	15 (15.8%)	80 (84.2%)	11 (22.9%)	37 (77.1%)	1.09	.297
Naming Task	72 (75.8%)	23 (24.2%)	37 (77.1%)	11 (22.9%)	0.03	.864
Real Effort Counting	93 (97.9%)	2 (2.1%)	47 (97.9%)	1 (2.1%)	0.00	.993
Unscramble Sentence	22 (23.2%)	73 (76.8%)	15 (31.3%)	33 (68.8%)	1.09	.297
Question-Based
ReCAPTCHA	38 (40.0%)	57 (60.0%)	23 (47.9%)	25 (52.1%	0.82	.366
Honeypot	95 (100.0%)	0 (0.0%)	48 (100.0%)	0 (0.0%)
Number Check	44 (46.3%)	51 (53.7%)	30 (62.5%)	18 (37.5%)	3.35	.067
Color Check	61 (64.2%)	34 (35.8%)	27 (56.2%)	21 (43.8%)	0.85	.355
Feeling Screener	37 (38.9%)	58 (61.1%)	15 (31.2%)	33 (68.8%)	0.82	.366
Agreement Check	78 (82.1%)	17 (17.9%)	37 (77.1%)	11 (22.9%)	0.51	.475
Data-Based
Age Consistency	81 (85.3%)	14 (14.7%)	44 (91.7%)	4 (8.3%)	1.19	.276
Numeric Flag	94 (98.9%)	1 (1.1%)	39 (81.2%)	9 (18.8%)	15.36	< .001
Open-Ended Flag	46 (49.5%)	48 (50.5%)	9 (18.7%)	39 (81.3%)	12.20	< .001

Open in a new tab

Note. Task-Based bot screening tools with multiple trails were only considered a ‘fail’ if all trails were unsuccessful. Visual Search = click an image of a cat three times. Anagrams = rearrange letters to form an English word. Naming Task = type the name of an animal based on its picture. Real Effort Counting = count the number of zeros in a grid. Unscramble Sentence = form a sentence with disjointed words. ReCAPTCHA = complete the Google ReCAPTCHA (v2). Honeypot = present invisible information that humans cannot see, but bots can. Number Check = write in a number that is exactly 10 digits long. Color Check = select a response that refers to a color out of a list of emotions and colors. Feeling Screener = asked participants to select “None of the Above” and ignore all other instructions even though other instructions require the selection of a feeling. Agreement Check = select “strongly agree” for a question. Age Consistency = age requested at beginning and middle of the survey matches. Numeric Flag = responding with out-of-bound numbers to a discrete scale. Open-Ended Flag = providing nonsensical and identical qualitative responses to three open-ended questions.

One respondent provided no response to any bot screening tool, and therefore was excluded from the $5 study.

$0 Incentive Condition

Of the 37 respondents who initiated the survey, 28 timed out at the consent form. Of the remaining nine respondents, two failed initial consent form comprehension checks and timed out at subsequent comprehension checks. Out of the seven participants who enrolled, three timed out during the sociodemographic questionnaire and four reached the end of the survey.

$1 Incentive Condition

Two hundred and twenty-four prospective respondents initiated the survey, 29.0% of which timed out at the consent form. Of the remaining 159 respondents who proceeded, 56 failed initial consent form comprehension checks; of these, 21 respondents timed out at subsequent comprehension checks, 5 failed all additional comprehension checks, and 30 passed a subsequent comprehension check and were permitted to consent and enroll in the study. Among the 133 participants permitted to enroll in the study, 38 timed out during the remaining survey materials (see Supplementary Table 1). Of respondents who enrolled, 95 reached the end of the survey. Most participants who completed the study materials in the $1 incentive condition were identified as potential bots by the Bot Threshold (n=72, 76.6%).

$5 Incentive Condition

One hundred and eighty-three prospective respondents initiated the survey, fifty-eight of whom timed out at the consent form. Of the remaining 125 respondents who proceeded, 49 failed initial consent form comprehension checks; of these, 12 respondents timed out at subsequent comprehension checks, 2 failed all additional comprehension checks, and 35 passed a subsequent comprehension check and were permitted to consent and enroll in the study. Among the 111 participants permitted to enroll in the study, 63 timed out during the remaining study materials (see Supplementary Table 1). As such, of those who enrolled, 48 (44.1%) reached the end of the study materials. Most participants who reached the end of the study materials were flagged by the Bot Threshold (n=36, 75.0%).

Bot Screening Tools Among Respondents Timed-Out and Complete Respondents

The proportion of flagged responses on each of the examined bot screening tools were similar among participants who completed all study materials and respondents who timed-out of the research study. The counts and proportions of timed-out respondents passing versus failing bot screening tools are provided in the online supplementary materials (Supplementary Table 1).

Effectiveness of Question-, Task-, and Data-Based Bot Screening Tools

Given the high frequency of time-outs and small proportion of completers in the $0 incentive condition, Task-Based bot screening tools from this survey were not analyzed. All subsequent results and analyses include participants who completed the study materials. Table 1 thus presents the counts and percentages of respondents passing vs failing each bot screening tool in the $1 and $5 incentive conditions. Of the Task-Based bot screening tools, respondents most frequently failed the Anagrams task, followed by the Unscramble Sentences task, while only one failed the Visual Search task in either condition. Considering Question-Based bot screening tools, respondents most frequently failed the Feeling Screener, followed by the reCAPTCHA. No respondents failed the honeypot question.

Effectiveness of Distinct Combinations of Bot Screening Tools

Joint Bot Screening Tool Combinations

Tables 2, 3, and 4 present the counts of participants failing pairs of Task-Based, Question-Based, and Data-Based bot screening tools, respectively, in the $1 and $5 incentive conditions. The diagonals of each table present the counts of respondents failing each individual task, while the off-diagonals present the counts failing pairs of tasks. Considering Task-Based bot screening tools, a large proportion of respondents who failed the Anagrams task failed other bot screening tools in both incentive conditions, although more exceptions to this trend occurred in the $5 incentive condition. Considering Question-Based bot screening tools, no similar overlap was observed for a single bot screening tool; rather, bot screening tool failure tended to occur across different pairs of tools for different respondents. Of the Data-Based bot screening tools, the respondents who were flagged based on their responses to open-ended items also tended to include (but were not limited to) those who were flagged based on the tools.

Table 2.

Joint Failure Rates of Task-Based Bot Screening Tools

	Visual Search	Anagrams	Naming Task	Counting	Unscramble
$1 Incentive
Visual Search	1
Anagrams	1	80
Naming	1	22	23
Counting	0	2	1	2
Unscramble	1	71	22	1	73
$5 Incentive
Visual Search	1
Anagrams	1	38
Naming	0	8	11
Counting	0	1	0	1
Unscramble	1	30	8	1	34

Open in a new tab

Note. Bolded diagonals represent the total number of respondents failing each task individually. Visual Search = click an image of a cat three times. Anagrams = rearrange letters to form an English word. Naming = type the name of an animal based on its picture. Counting = count the number of zeros in a grid. Unscramble = form a sentence with disjointed words.

Table 3.

Joint Failure Rates of Question-Based Bot Screening Tools

	ReCAPTCHA	Number Check	Color Check	Feeling Screener	Agreement Check
$1 Incentive
ReCAPTCHA	57
Number Check	26	51
Color Check	13	23	34
Feeling Screener	34	38	29	58
Agreement Check	9	11	8	13	17
$5 Incentive
ReCAPTCHA	23
Honeypot	NA	NA	NA
Number Check	6	18
Color Check	7	12	21
Feeling Screener	16	14	16	33
Agreement Check	6	2	4	8	11

Open in a new tab

Note. Bolded diagonals represent the total number of respondents failing each item individually. ReCAPTCHA = complete the Google ReCAPTCHA (v2). Honeypot = present invisible information that humans cannot see, but bots can. Number Check = write in a number that is exactly 10 digits long. Color Check = select a response that refers to a color out of a list of emotions and colors. Feeling Screener = asked participants to select “None of the Above” and ignore all other instructions even though other instructions require the selection of a feeling. Agreement Check = select “strongly agree” for a question.

Table 4.

Joint Failure Rates of Data-Based Bot Screening Tools

	Inconsistent Age	Numeric Flag	Open-Ended Flag
$1 Incentive
Age Consistency	14
Numeric Flag	1	1
Open-Ended Flag	5	1	48
$5 Incentive
Age Consistency	4
Numeric Flag	0	9
Open-Ended Flag	2	9	39

Open in a new tab

Note. Inconsistent Age = respondent reported different ages (in years) across two questions. Numeric Flag = respondent flagged based on researchers’ assessment of numeric responses. Open-Ended Flag = respondent flagged based on researchers’ assessment of open-ended responses. Bolded diagonals represent the total number of respondents with each flag individually.

Multi-Method Bot Screening Tool Combination

Tables 5, 6, and 7 presents the counts and percentages of respondents failing pairs of task-by-question bot screening tools in the $1 and $5 incentive conditions. Of respondents failing Task-Based bot screening tools, the largest proportions tended to also fail the Feeling Screener Question-Based bot screening tools. Conversely, of those failing Question-Based bot screening tools, the largest proportions tended also to fail the Anagrams Task-Based bot screening tool. However, of bot screening tools with more-than-modest failure rates, there was no case where all participants failing one Task-Based bot screening tool also all failed a given Question-Based bot screening tool, and vice-versa.

Table 5.

Task-by-Question Bot Screening Tool Contingent Failure, Counts of Respondents

	ReCAPTCHA	Number Check	Color Check	Feeling Screener	Agreement Check	Row Totals
$1 Incentive
Visual Search	1	1	1	1	0	1
Anagrams	44	46	33	50	15	80
Naming	13	14	11	16	3	23
Counting	2	0	0	1	1	2
Unscramble	39	40	28	43	14	73
Column	57	51	34	58	17
Totals
$5 Incentive
Visual Search	1	0	0	1	1	1
Anagrams	19	16	18	26	10	38
Naming	5	4	3	8	5	11
Counting	0	0	0	0	0	1
Unscramble	13	14	19	25	9	34
Column	23	18	21	33	11
Totals

Open in a new tab

Note. Row and column totals represent the total number of people failing each question or task. Visual Search = click an image of a cat three times. Anagrams = rearrange letters to form an English word. Naming = type the name of an animal based on its picture. Counting = count the number of zeros in a grid. Unscramble = form a sentence with disjointed words.

Table 6.

Task-by-Question Bot Screening Tool Contingent Failure, Percentages Failing Each Question-Based Bot Screening Tool Out of Those Failing Each Task-Based Bot Screening Tool

	ReCAPTCHA	Number Check	Color Check	Feeling Screener	Agreement Check
$1 Incentive
Visual Search (1)	100.0%	100.0%	100.0%	100.0%	0.0%
Anagrams (80)	55.0%	57.5%	41.3%	62.5%	18.8%
Naming (23)	56.5%	60.9%	47.8%	69.6%	13.0%
Counting (2)	100.0%	0.0%	0.0%	50.0%	50.0%
Unscramble (73)	53.4%	54.8%	38.4%	58.9%	19.2%
$5 Incentive
Visual Search (1)	100.0%	0.0%	0.0%	100.0%	100.0%
Anagrams (38)	50.0%	42.1%	47.4%	68.4%	26.3%
Naming (11)	45.5%	36.4%	27.3%	72.7%	45.5%
Counting (1)	0.0%	0.0%	0.0%	0.0%	0.0%
Unscramble (34)	38.2%	41.2%	55.9%	73.5%	26.5%

Open in a new tab

Note. Values in parentheses represent the total count of respondents failing each task in each incentive condition. Visual Search = click an image of a cat three times. Anagrams = rearrange letters to form an English word. Naming = type the name of an animal based on its picture. Counting = count the number of zeros in a grid. Unscramble = form a sentence with disjointed words.

Table 7.

Task-by-Question Bot Screening Tool Contingent Failure, Percentages Failing Each Task-Based Bot Screening Tool Out of Those Failing Each Question-Based Bot Screening Tool

	ReCAPTCHA	Number Check	Color Check	Feeling Screener	Agreement Check
$1 Incentive	(57)	(51)	(34)	(58)	(17)
Visual Search	1.8%	2.0%	2.9%	1.7%	0.0%
Anagrams	77.2%	90.2%	97.1%	86.2%	88.2%
Naming	22.8%	27.5%	32.4%	27.6%	17.6%
Counting	3.5%	0.0%	0.0%	1.7%	5.9%
Unscramble	68.4%	78.4%	82.4%	74.1%	82.4%
$5 Incentive	(23)	(18)	(21)	(33)	(11)
Visual Search	4.3%	0.0%	0.0%	3.0%	9.1%
Anagrams	82.6%	88.9%	85.7%	78.8%	90.9%
Naming	21.7%	22.2%	14.3%	24.2%	45.5%
Counting	0.0%	0.0%	0.0%	0.0%	0.0%
Unscramble	56.5%	77.8%	90.5%	75.8%	81.8%

Open in a new tab

Note. Values in parentheses represent the total count of respondents failing each question-based screening item in each incentive condition. Visual Search = click an image of a cat three times. Anagrams = rearrange letters to form an English word. Naming = type the name of an animal based on its picture. Counting = count the number of zeros in a grid. Unscramble = form a sentence with disjointed words.

Cross-Condition Comparisons

Cross-condition comparisons were only possible in the $1 and $5 incentive conditions. Group differences in the average number of failed bot screening tools in the $1 incentive condition (M = 4.83, SD = 1.81) and the $5 incentive condition (M = 5.06, SD = 1.80) were not statistically significant (t(140)= −0.73, p = .470). The proportion of participants flagged by the Bot Threshold in the $1 and $5 incentive conditions (76.6% and 75.0%, respectively) did not significantly differ by condition (χ² = 0.04, p = .833). Chi-square comparisons (see Table 1) revealed a higher proportion of participants flagged by the Bot Threshold in the $5 incentive condition on the Numeric (18.8%) and Open-Ended (81.3%) Flags relative to the $1 incentive condition (1.1% and 50.5%). No condition-level differences in the proportion of respondents flagged by the Bot Threshold for individual Task-Based (p range: .297 - .993) or Question-Based (p range: .067 - .475) bot screening tools were observed.

Efficacy of Individual Bot Screening Tools in Predicting Participant Exclusion

The results from univariate and multivariate logistic regressions testing the associations between bot screening tools and the Bot Threshold (i.e., failing 4+ bot screening tools) are presented in Table 8. Regarding Task-Based bot screening tools, univariate logistic regression analyses, using the Bonferroni correction method, revealed statistically significant associations between surpassing the Bot Threshold and failure on the Anagrams Task (B = 3.51, SE = 0.58, adjusted p = .012) and the Unscramble Sentences task (B = 2.43, SE = 0.45, p = .012). Univariate analyses also revealed statistically significant positive associations between the Bot Threshold with Number Check (B = 1.88, SE = 0.49, adjusted p =.012), Color Check (B = 1.90, SE = 0.57, adjusted p =.012), Feeling Screener (B = 1.79, SE = 0.43, adjusted p =.012), and Open-Ended Flag (B = 1.79, SE = 0.43, adjusted p =.012). No statistically significant associations between the Bot Threshold with the Naming Task, Real Effort Counting task, ReCAPTCHA, Agreement Check, or Age Consistency Flag were observed.

Table 8.

Univariate and Multivariate Associations Between Failing Individual Bot Screening Tools and Surpassing the 4+ Bot Threshold in Combined Sample

Type Method	Univariate Tests			Multivariate Tests
Type Method	B	SE	Adjusted p - value	B	SE	Adjusted p - value
Task-Based
Visual Search	NA	NA	NA
Anagrams	3.51	0.58	.012	2.84	0.62	.002
Naming Task	2.63	1.04	.132
Real Effort Counting	− 0.47	1.24	.703
Unscramble Sentence	2.43	0.45	.012	1.42	0.56	.022
Question-Based
ReCAPTCHA	− 0.59	0.41	.155
Number Check	1.88	0.49	.012	1.51	0.52	.008
Color Check	1.90	0.57	.012	1.32	0.61	.058
Feeling Screener	1.79	0.43	.012	1.34	0.47	.008
Agreement Check	1.57	0.76	.468
Consistency-Based
Age Consistency	1.82	1.05	.996
Numeric Flag	NA	NA	NA
Open-Ended Flag	1.79	0.43	.012

Open in a new tab

Note. p-values were calculated using a Bonferroni correction to adjust for potential bias resulting from multiple comparisons; NA = not available due to insufficient statistical power. Visual Search = click an image of a cat three times. Anagrams = rearrange letters to form an English word. Naming Task = type the name of an animal based on its picture. Real Effort Counting = count the number of zeros in a grid. Unscramble Sentence = form a sentence with disjointed words. ReCAPTCHA = complete the Google ReCAPTCHA (v2). Number Check = write in a number that is exactly 10 digits long. Color Check = select a response that refers to a color out of a list of emotions and colors. Feeling Screener = asked participants to select “None of the Above” and ignore all other instructions even though other instructions require the selection of a feeling. Agreement Check = select “strongly agree” for a question. Age Consistency = age requested at beginning and middle of the survey matches. Numeric Flag = responding with out-of-bound numbers to a discrete scale. Open-Ended Flag = providing nonsensical and identical qualitative responses to three open-ended questions.

Multivariate analyses within the Task- and Question-Based bot screening tool subtypes were also examined. The multivariate analyses for Task-Based bot screening tools revealed statistically significant positive associations between the Bot Threshold with the Anagrams (B = 2.84, SE = 0.62, adjusted p =.002) and Unscramble Sentences (B = 1.42, SE = 0.56, adjusted p =.022) tasks. The multivariate analyses for Question-Based bot screening tools revealed statistically significant positive associations between the Bot Threshold and Number Check (B = 1.51, SE = 0.52, adjusted p =.008) and Feeling Screener (B = 1.34, SE = 0.47, adjusted p =.008).

Discussion

The present investigation is the first known quasi-experimental study aimed to examine a range of Task-, Question-, and Data-Based bot screening tools across three incentive conditions. In general, study results suggest that bot screening tools function similarly for participants recruited across incentive conditions. Moreover, the present analyses revealed heterogeneity in the efficacy of bot screening tool subtypes. All key findings are discussed and interpreted below.

Of Task-Based bot screening tools, the largest proportions of respondents in the $1 and $5 incentive condition failed Anagrams and Unscramble Sentences. In contrast, most respondents passed the Visual Search and Real Effort Counting tasks, suggesting that such tools may not be an efficient bot screening tool. Of Question-Based bot screening methods, the Feeling Screener and reCAPTCHA checks were associated with the highest rates of failure in the $1 and $5 incentive conditions. Respondents who were flagged based on screening of their open-ended question responses tended to be flagged on other researcher-driven methods of bot screening (e.g., repetitive numeric responses, age inconsistencies). Age inconsistencies flagged a unique set of respondents not flagged by other researcher-driven methods. Failing the Anagrams task tended to coincide with failing other Task-Based methods, but not vice-versa. Meanwhile, no pattern was observed for Question-Based methods. Failure on Task-Based methods and failure on Question-Based methods of bot screening tended to coincide but did not completely overlap. Of Task-Based screening methods, respondents who failed the Anagrams task tended to fail one or more Question-Based screener. Of Question-Based screening methods, respondents who failed the Feeling Screener tended to fail one or more Task-Based methods.

Cross-Condition Comparisons

The present findings revealed similar patterns of participant behavior among respondents in the $1 and $5 incentive conditions. Notably, the average total number of bot screening tools failed, and the percentage of respondents flagged by the Bot Threshold, did not significantly differ across incentive conditions. Moreover, only minimal group differences were observed for individual bot screening tool failure, wherein a higher proportion of respondents in the $5 condition were flagged on the Numeric and Open-Ended bot screening tools relative to the respondents in the $1 incentive condition. Findings also showed qualitatively similar patterns of passing versus failing each bot screening tool among participants in the $1 and $5 incentive conditions who timed-out of the study. Taken together, the current study suggests that the presence of bots in online research studies using social media as the primary recruitment strategy is relatively similar among studies offering $1 or $5 in compensation. Unfortunately, because most respondents in the $0 incentive condition timed-out of the study before completing any questions or task, group differences could not be examined for this condition. Moreover, additional research is needed to examine bot recruitment in online studies using social media as the primary recruitment strategy with other common compensation approaches (e.g., no compensation, gift card lottery, higher compensation levels).

Efficacy of Individual Bot Screening Tools in Predicting Participant Exclusion

While ReCAPTCHA tools are the among most cited bot screening tools in the literature (e.g., Yarrish et al., 2019), the present findings suggest that researchers should not rely on ReCAPTCHAs alone to identify potential bots. Specifically, while nearly half of respondents failed the ReCAPTCHA tool, this tool was not significantly associated with the bot screening tool exclusion threshold and thus was not effective at detecting respondents who failed at least four bot screening tools. It is possible that ReCAPTCHA tool is no longer an effective bot screening tool due to the increasing sophistication of deployed bots built with Natural Language Processing techniques, as well as the sophistication of bot programmers. It is also possible that human participants are just as likely to fail ReCAPTCHA tools as are bots.

The results of the present study show support for bot screening tools across each category, including Task-Based (Anagrams, Unscramble), Question-Based (Feeling Screener, Number Check), and Data-Based (Open-Ended Flag) tools, as failing any of these bot screening tools was significantly associated with the Bot Threshold. Notably, both effective Task-Based bot screening tools included the reorganization of nonsensical text (e.g., letters, words) into a logical order. In contrast, the three ineffective Task-Based bot screening tools (Visual Search, Naming Task, and Real Effort Counting) included pictures or images that respondents were asked to interact with or derive meaning from. As such, picture-based tasks may be less efficient tools for bot screening purposes. Moreover, each of the ineffective Question-Based bot screening tools included short, succinct questions with a range of potential responses. In contrast, multiple choice Question-Based bot screening tools that include a more detailed passage (e.g., Feeling Screener; Berinsky et al., 2014) or write-in requests (e.g., Number Check) may be more effective methods to screen for bots. Similarly, the present findings support the use of open-ended questions that may be examined to detect unusual or nonsensical responses as potential bots.

Recommendations to Improve Bot Detection

Based on our experiments, the following recommendations should be considered by researchers who administer online survey research.

Include More than One Bot Screening Tool: While some bot screening tools may be more effective and/or offer a higher degree of accuracy at detecting potential bot respondents (e.g., identical response patterns, suspicious or identical open-ended responses) researchers are encouraged to include more than one tool in their study design. Specifically, no single bot screening tool was 100% effective at identifying potential bots and thus it is important that researchers include more than one screening tool to detect bots adequately and ethically.
Make A Priori Decision Rules: Researchers should develop an a priori decision rule regarding the number of bot screening tools a respondent must fail before removal is considered. To balance ethical concerns and data integrity considerations, it is recommended that researchers choose a threshold of no fewer than three failed bot screening tools, although a more conservative number like the Bot Threshold in the current article may ensure that real participants are not unintentionally eliminated. Ultimately, researchers need to balance the risk of false negatives (i.e., incorrectly retaining bots) and false positives (i.e., incorrectly eliminating humans) when deciding on elimination rules for their bot screening tools.
Document Bot Screen Tool Decisions in Manuscripts: All bot protection procedures, methods, and decision rules should be presented in research manuscripts. Researchers may consider presenting this information in the methods or results section alongside missing data procedures or, if space is limited, uploading their bot protocols as supplemental materials.
Minimize Participant Burden and Maximize Bot Screen Tool Efficacy: Researchers should integrate at least one of the more effective bot screening tools, as identified in the present study, including: Anagrams, Unscramble Sentences, Number Check, Feeling Screener, and Open-Ended Flags. To minimize participant burden and maximize bot screening tool effectiveness, researchers should avoid the use of the least effective bot screening tools (i.e., Visual Search, Real Effort Counting, Honeypot).
Cast a Wide Screening Net with Distinct Bot Detection Tools: Bot programmers have varying skill levels with programming bots to engage in online survey research in distinct ways. As such, researchers should select bot screening tools that are (a) independently effective and (b) evaluate unique kinds of respondent behavior (e.g., attention, tasks, data).
Consider Study Design Strategies that Prevent Bots Before They Enroll: Future research might, for instance, require participants to complete a phone or video screener to verify respondents before they enroll in the study. Recruitment methods also impact whether bots or fraudulent respondents gain access to online research. Scholars should think about the serious potential of attracting bots when they share study ads on social media platforms. Whenever possible, researchers are encouraged to distribute study ads through community resources specific to the population of interest (e.g., counseling/community centers, public libraries).
If You Advertise Your Study on Social Media, Bot Screening Tools Are Mandatory: It may be impractical for some researchers to avoid social media altogether. However, researchers who advertise on social media must document bot screening tools in their article to increase confidence in the results. Otherwise, readers cannot know whether the authenticity of the data. The present study, for instance, showed that advertising on Twitter for a 3 days resulted in 444 respondents, of which the majority failed many of our bot screening tools.
Find and/or Develop New Bot Screening Tools: This study offers a preliminary list of bot screening tools. Because bots are programmed with increasing sophistication, the bot screening tools reported in the paper may lose effectiveness over time. As such, researchers should frequently update the bot screening tools they use and/or develop new screeners. In addition, several of the task-based screening tools included in the present study may flag human participants with low literacy or visual impairment and thus future research should consider other task-based screening tools that are unrelated to literacy and visual abilities.

Limitations and Future Directions

While the present study had notable strengths, including a quasi-experimental design with three incentive conditions and the consideration of fourteen distinct bot screening tools, the study also includes several limitations. First, the selection of the 4+ Bot Threshold was rationally-rather than empirically-based due to the lack of empirical guidelines for a bot detection threshold and thus, human participants may have been identified as potential bots, and vice versa, in the present study. Moreover, the study would be strengthened with the addition of a known human comparison group (i.e., in a lab setting), which would allow researchers to compare the responses and behaviors of known human participants with those observed in the present online study to better differentiate bots from human respondents and identify specific bot screening tools that humans are more likely to fail. Additionally, the present bot screening tools may have flagged well-meaning humans with lower English-language literacy or visual impairment (e.g., Anagrams eliminating a person for whom English is not their first language) as well as inattentive human respondents or other kinds of fraudulent respondents (e.g., humans enrolling more than once). This limitation is particularly salient in studies where multiple bot screening tools require English language proficiency, which may result in an inequitable exclusion of respondents who fail more than one of these forms of screening tools. Further, the small sample sizes for each incentive condition group presented limitations to the analyses conducted in the present study. Incentive conditions were collapsed for univariate and multivariate analyses, preventing the examination of associations between failing individual bot screening tools and the bot threshold across conditions. Study design features may have also had unintended consequences, including the potential exclusion of genuine participants, bots, or fraudulent respondents. For instance, the decoy focus on disordered eating may have contributed undue distress among human participants and/or led to attrition of human participants due to the potentially emotionally challenging topics assessed in the surveys in response to the number of tasks that strayed from the advertised study topic (disordered eating). Thus, future research should choose a less emotionally charged decoy topic to minimize harm when investigating bot screening tools. Moreover, the survey platform, Gorilla Experiment Builder, did not distinguish participants timed out of the study from those who elected to leave the study browser, which prevented further consideration of whether participants who timed out were more likely to stalled bots or disinterested humans. While the collection of respondent IP addresses may help prevent respondents from participating in multiple times within and across conditions, the collection of IP addresses is forbidden or illegal in some settings due to the sensitive data (e.g., location) linked to IP addresses. The present study was hosted on Gorilla Experiment Builder in the UK, wherein IP addresses are forbidden from being collected. Thus, respondent IP addresses were not collected and thus it is possible for respondents participated in more than one incentive condition. In sum, replication of our findings is important to ensure the efficacy results of the bot screening tools generalize to other samples. Distinguishing bots from inattentive or fraudulent respondents is a rich area of inquiry because the two groups may affect data integrity in distinct patterns which, if elucidated, could lead to a more comprehensive portfolio of survey design considerations, data cleaning strategies, and bot screening tools. Future research can address the limitations of the present study in a number of ways: (a) empirically test distinct bot detection thresholds to create numerical guidelines; (b) include a known human sample may help advance the field of bot screening and detection; (c) vary participant compensation further (e.g., higher compensation, gift card raffles); (d) replicate the current study with additional bot screening tools and in additional languages; (e) include additional task-based bot screening tools that are not reliant upon English language literacy and/or visual abilities.

Supplementary Material

Supplemental Material 1

NIHMS1916468-supplement-Supplemental_Material_1.pdf^{(445.1KB, pdf)}

Supplemental Material 2

NIHMS1916468-supplement-Supplemental_Material_2.pdf^{(124.3KB, pdf)}

Public Significance Statement.

In recent years, psychologists conducting online research have described escalating cyber threats to online research resulting from survey bots. This study sought to identify effective and cost-efficient ways to detect and prevent bots from enrolling in online research by using recruitment methods that were likely to recruit bots and systematically examining responses. The data collected in this study allowed us to generate a preliminary list of empirically-supported recommendations for psychological researchers.

Acknowledgments:

Melissa Simone is supported by the Transition to Independence Award (K99MD015770) from the National Institute of Minority Health and Health Disparities. Cory Cascalheira is supported as a RISE Fellow by the National Institutes of Health (R25GM061222). The parent organization of Gorilla Experiment Builder, Cauldron Science, provided the researchers with the funds to compensate study participants.

Footnotes

The authors have no financial disclosures. The data analyzed in this manuscript do not appear in any prior dissemination and this sample has not been examined in other published works. Data are not shared publicly but may be requested by contacting the corresponding author.

CRediT: MS contributions: Conceptualization; Data Curation; Funding Acquisition; Methodology; Investigation; Project Administration; Writing of original draft as well as review and editing. CJC contributions: Conceptualization; Writing—Original Draft; Writing – Review and Editing BGP contributions: Conceptualization; Formal Analysis; Investigation; Writing – Review and Editing

References

Abeler J, Falk A, Goette L, & Huffman D (2011). Reference Points and Effort Provision. American Economic Review, 101(2), 470–492. 10.1257/AER.101.2.470 [DOI] [Google Scholar]
Ahmad NA, Hafiz M, Hamid C, Zainal A, Fairuz M, Rauf A, & Adnan Z (2018). Review of Chatbots Design Techniques Article in International Journal of Computer Applications · August. International Journal of Computer Applications, 181(8), 975–8887. https://www.researchgate.net/publication/327097910 [Google Scholar]
Berinsky AJ, Margolis MF, & Sances MW (2014). Separating the Shirkers from the Workers? Making Sure Respondents Pay Attention on Self-Administered Surveys. American Journal of Political Science, 58(3), 739–753. 10.1111/AJPS.12081 [DOI] [Google Scholar]
Burnette CB, Luzier JL, Bennett BL, Weisenmuller CM, Kerr P, Martin S, Keener J, & Calderwood L (2022). Concerns and recommendations for using Amazon MTurk for eating disorder research. International Journal of Eating Disorders, 55(2), 263–272. 10.1002/EAT.23614 [DOI] [PMC free article] [PubMed] [Google Scholar]
Charness G, & Claire Villeval M (2010). Cooperation and Competition in Intergenerational Experiments in the Field and the Laboratory. http://www.gate.cnrs.fr/perso/villeval/ [Google Scholar]
Fairburn CG, & Beglin SJ (1994). Assessment of eating disorders: interview or self-report questionnaire? International Journal of Eating Disorders, 16, 363–370. [PubMed] [Google Scholar]
Franklin S, & Graesser A (2015). Is it an agent, or just a program?: A taxonomy for autonomous agents. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 1193, 21–35. 10.1007/BFB0013570/COVER [DOI] [Google Scholar]
Gabrielli J, Borodovsky J, Corcoran E, & Sink L (2020). Leveraging social media to rapidly recruit a sample of young adults aging out of foster care: Methods and recommendations. Children and Youth Services Review, 113, 104960. 10.1016/j.childyouth.2020.104960 [DOI] [Google Scholar]
Geer D. (2005). Malicious bots threaten network security. Computer, 38(1), 18–20. 10.1109/MC.2005.26 [DOI] [Google Scholar]
Griffin M, Martino RJ, LoSchiavo C, Comer-Carruthers C, Krause KD, Stults CB, & Halkitis PN (2022). Ensuring survey research data integrity in the era of internet bots. Quality and Quantity, 56(4), 2841–2852. 10.1007/S11135-021-01252-1/TABLES/1 [DOI] [PMC free article] [PubMed] [Google Scholar]
Levi R, Ridberg R, Akers M, & Seligman H (2021). Survey Fraud and the Integrity of Web-Based Survey Research. 10.1177/08901171211037531, 36(1), 18–20. 10.1177/08901171211037531 [DOI] [PubMed] [Google Scholar]
Ophir Y, Sisso I, Asterhan CSC, Tikochinski R, & Reichart R (2020). The Turker Blues: Hidden Factors Behind Increased Depression Rates Among Amazon’s Mechanical Turkers. Clinical Psychological Science, 8(1), 65–83. 10.1177/2167702619865973 [DOI] [Google Scholar]
Orabi M, Mouheb D, Al Aghbari Z, & Kamel I (2020). Detection of Bots in Social Media: A Systematic Review. Information Processing & Management, 57(4), 102250. 10.1016/J.IPM.2020.102250 [DOI] [Google Scholar]
Pozzar R, Hammer MJ, Underhill-Blazey M, Wright AA, Tulsky JA, Hong F, Gundersen DA, & Berry DL (2020). Threats of Bots and Other Bad Actors to Data Quality Following Research Participant Recruitment Through Social Media: Cross-Sectional Questionnaire. J Med Internet Res 2020;22(10):E23021 https://www.Jmir.org/2020/10/E23021, 22(10), e23021. 10.2196/23021 [DOI] [PMC free article] [PubMed] [Google Scholar]
Prince KR, Litovsky AR, & Friedman-Wheeler DG (2012). Internet-mediated research: Beware of bots. The Behavior Therapist, 35(5), 85–88. [Google Scholar]
Sager MA, Kashyap AM, Tamminga M, Ravoori S, Callison-Burch C, & Lipoff JB (2021). Identifying and Responding to Health Misinformation on Reddit Dermatology Forums With Artificially Intelligent Bots Using Natural Language Processing: Design and Evaluation Study. JMIR Dermatol 2021;4(2):E20975 https://derma.jmir.org/2021/2/E20975, 4(2), e20975. 10.2196/20975 [DOI] [PMC free article] [PubMed] [Google Scholar]
Shaw TJ, Cascalheira CJ, Helminen E, Brisbin CD, Jackson SD, Simone M, Sullivan TP, Batchelder AW, & Scheer JR (2023). Yes stormtrooper, these are the droids you’re looking for: A method paper evaluating bot detection strategies in online psychological research. PsychArXiv. Preprint available online: 10.31234/osf.io/gtp6z [DOI] [Google Scholar]
Simone M. (2019). Bots started sabotaging my online research. I fought back - STAT. STAT News. https://www.statnews.com/2019/11/21/bots-started-sabotaging-my-online-research-i-fought-back/ [Google Scholar]
Spielberg CD, Gorsuch RL, & Lushene RE (1970). STAI: Manual for the State-Trait Anxiety Inventory. In Manual for the State-Trait Anxiety Inventory. Consulting Psychologists Press. 10.1037/t06496-000 [DOI] [Google Scholar]
Storozuk A, Ashley M, Delage V, & Maloney EA (2020). Got Bots? Practical Recommendations to Protect Online Survey Data from Bot Attacks. The Quantitative Methods for Psychology, 16(5), 472–481. 10.20982/TQMP.16.5.P472 [DOI] [Google Scholar]
Tsvetkova M, García-Gavilanes R, Floridi L, & Yasseri T (2017). Even good bots fight: The case of Wikipedia. PLOS ONE, 12(2), e0171774. 10.1371/JOURNAL.PONE.0171774 [DOI] [PMC free article] [PubMed] [Google Scholar]
Van Rousselt R. (2021). Natural language processing bots. Pro Microsoft Teams Development, 161–185. 10.1007/978-1-4842-6364-8_9 [DOI] [Google Scholar]
Xu Y, Pace S, Kim J, Iachini A, King LB, Harrison T, DeHart D, Levkoff SE, Browne TA, Lewis AA, Kunz GM, Reitmeier M, Utter RK, & Simone M (2022). Threats to Online Surveys: Recognizing, Detecting, and Preventing Survey Bots. Social Work Research. 10.1093/SWR/SVAC023 [DOI] [Google Scholar]
Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov R, & Le QV (2019). XLNet: Generalized Autoregressive Pretraining for Language Understanding. ArXiv, 1906.08237. https://github.com/zihangdai/xlnet [Google Scholar]
Yarrish C, Groshon L, Mitchell JK, Appelbaum AI, Klock S, Winternitz T, & Friedman-Wheeler DG (2019). Finding the Signal in the Noise: Minimizing Responses From Bots and Inattentive Humans in Online Research. The Behavior Therapist, 42(7), 235–242. https://www.researchgate.net/publication/336564144_Finding_the_Signal_in_the_Noise_Minimizing_Responses_From_Bots_and_Inattentive_Humans_in_Online_Research [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material 1

NIHMS1916468-supplement-Supplemental_Material_1.pdf^{(445.1KB, pdf)}

Supplemental Material 2

NIHMS1916468-supplement-Supplemental_Material_2.pdf^{(124.3KB, pdf)}

[R1] Abeler J, Falk A, Goette L, & Huffman D (2011). Reference Points and Effort Provision. American Economic Review, 101(2), 470–492. 10.1257/AER.101.2.470 [DOI] [Google Scholar]

[R2] Ahmad NA, Hafiz M, Hamid C, Zainal A, Fairuz M, Rauf A, & Adnan Z (2018). Review of Chatbots Design Techniques Article in International Journal of Computer Applications · August. International Journal of Computer Applications, 181(8), 975–8887. https://www.researchgate.net/publication/327097910 [Google Scholar]

[R3] Berinsky AJ, Margolis MF, & Sances MW (2014). Separating the Shirkers from the Workers? Making Sure Respondents Pay Attention on Self-Administered Surveys. American Journal of Political Science, 58(3), 739–753. 10.1111/AJPS.12081 [DOI] [Google Scholar]

[R4] Burnette CB, Luzier JL, Bennett BL, Weisenmuller CM, Kerr P, Martin S, Keener J, & Calderwood L (2022). Concerns and recommendations for using Amazon MTurk for eating disorder research. International Journal of Eating Disorders, 55(2), 263–272. 10.1002/EAT.23614 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Charness G, & Claire Villeval M (2010). Cooperation and Competition in Intergenerational Experiments in the Field and the Laboratory. http://www.gate.cnrs.fr/perso/villeval/ [Google Scholar]

[R6] Fairburn CG, & Beglin SJ (1994). Assessment of eating disorders: interview or self-report questionnaire? International Journal of Eating Disorders, 16, 363–370. [PubMed] [Google Scholar]

[R7] Franklin S, & Graesser A (2015). Is it an agent, or just a program?: A taxonomy for autonomous agents. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 1193, 21–35. 10.1007/BFB0013570/COVER [DOI] [Google Scholar]

[R8] Gabrielli J, Borodovsky J, Corcoran E, & Sink L (2020). Leveraging social media to rapidly recruit a sample of young adults aging out of foster care: Methods and recommendations. Children and Youth Services Review, 113, 104960. 10.1016/j.childyouth.2020.104960 [DOI] [Google Scholar]

[R9] Geer D. (2005). Malicious bots threaten network security. Computer, 38(1), 18–20. 10.1109/MC.2005.26 [DOI] [Google Scholar]

[R10] Griffin M, Martino RJ, LoSchiavo C, Comer-Carruthers C, Krause KD, Stults CB, & Halkitis PN (2022). Ensuring survey research data integrity in the era of internet bots. Quality and Quantity, 56(4), 2841–2852. 10.1007/S11135-021-01252-1/TABLES/1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Levi R, Ridberg R, Akers M, & Seligman H (2021). Survey Fraud and the Integrity of Web-Based Survey Research. 10.1177/08901171211037531, 36(1), 18–20. 10.1177/08901171211037531 [DOI] [PubMed] [Google Scholar]

[R12] Ophir Y, Sisso I, Asterhan CSC, Tikochinski R, & Reichart R (2020). The Turker Blues: Hidden Factors Behind Increased Depression Rates Among Amazon’s Mechanical Turkers. Clinical Psychological Science, 8(1), 65–83. 10.1177/2167702619865973 [DOI] [Google Scholar]

[R13] Orabi M, Mouheb D, Al Aghbari Z, & Kamel I (2020). Detection of Bots in Social Media: A Systematic Review. Information Processing & Management, 57(4), 102250. 10.1016/J.IPM.2020.102250 [DOI] [Google Scholar]

[R14] Pozzar R, Hammer MJ, Underhill-Blazey M, Wright AA, Tulsky JA, Hong F, Gundersen DA, & Berry DL (2020). Threats of Bots and Other Bad Actors to Data Quality Following Research Participant Recruitment Through Social Media: Cross-Sectional Questionnaire. J Med Internet Res 2020;22(10):E23021 https://www.Jmir.org/2020/10/E23021, 22(10), e23021. 10.2196/23021 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Prince KR, Litovsky AR, & Friedman-Wheeler DG (2012). Internet-mediated research: Beware of bots. The Behavior Therapist, 35(5), 85–88. [Google Scholar]

[R16] Sager MA, Kashyap AM, Tamminga M, Ravoori S, Callison-Burch C, & Lipoff JB (2021). Identifying and Responding to Health Misinformation on Reddit Dermatology Forums With Artificially Intelligent Bots Using Natural Language Processing: Design and Evaluation Study. JMIR Dermatol 2021;4(2):E20975 https://derma.jmir.org/2021/2/E20975, 4(2), e20975. 10.2196/20975 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Shaw TJ, Cascalheira CJ, Helminen E, Brisbin CD, Jackson SD, Simone M, Sullivan TP, Batchelder AW, & Scheer JR (2023). Yes stormtrooper, these are the droids you’re looking for: A method paper evaluating bot detection strategies in online psychological research. PsychArXiv. Preprint available online: 10.31234/osf.io/gtp6z [DOI] [Google Scholar]

[R18] Simone M. (2019). Bots started sabotaging my online research. I fought back - STAT. STAT News. https://www.statnews.com/2019/11/21/bots-started-sabotaging-my-online-research-i-fought-back/ [Google Scholar]

[R19] Spielberg CD, Gorsuch RL, & Lushene RE (1970). STAI: Manual for the State-Trait Anxiety Inventory. In Manual for the State-Trait Anxiety Inventory. Consulting Psychologists Press. 10.1037/t06496-000 [DOI] [Google Scholar]

[R20] Storozuk A, Ashley M, Delage V, & Maloney EA (2020). Got Bots? Practical Recommendations to Protect Online Survey Data from Bot Attacks. The Quantitative Methods for Psychology, 16(5), 472–481. 10.20982/TQMP.16.5.P472 [DOI] [Google Scholar]

[R21] Tsvetkova M, García-Gavilanes R, Floridi L, & Yasseri T (2017). Even good bots fight: The case of Wikipedia. PLOS ONE, 12(2), e0171774. 10.1371/JOURNAL.PONE.0171774 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Van Rousselt R. (2021). Natural language processing bots. Pro Microsoft Teams Development, 161–185. 10.1007/978-1-4842-6364-8_9 [DOI] [Google Scholar]

[R23] Xu Y, Pace S, Kim J, Iachini A, King LB, Harrison T, DeHart D, Levkoff SE, Browne TA, Lewis AA, Kunz GM, Reitmeier M, Utter RK, & Simone M (2022). Threats to Online Surveys: Recognizing, Detecting, and Preventing Survey Bots. Social Work Research. 10.1093/SWR/SVAC023 [DOI] [Google Scholar]

[R24] Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov R, & Le QV (2019). XLNet: Generalized Autoregressive Pretraining for Language Understanding. ArXiv, 1906.08237. https://github.com/zihangdai/xlnet [Google Scholar]

[R25] Yarrish C, Groshon L, Mitchell JK, Appelbaum AI, Klock S, Winternitz T, & Friedman-Wheeler DG (2019). Finding the Signal in the Noise: Minimizing Responses From Bots and Inattentive Humans in Online Research. The Behavior Therapist, 42(7), 235–242. https://www.researchgate.net/publication/336564144_Finding_the_Signal_in_the_Noise_Minimizing_Responses_From_Bots_and_Inattentive_Humans_in_Online_Research [Google Scholar]

PERMALINK

A Quasi-Experimental Study Examining the Efficacy of Multimodal Bot Screening Tools and Recommendations to Preserve Data Integrity in Online Psychological Research

Melissa Simone

Cory J Cascalheira

Benjamin G Pierce

Abstract

Introduction

The Present Study

Method

Participants and Procedure

Bot Screening Tools

Task-Based Bot Screening Tools

Question-Based Bot Screening Tools

Data-Based Bot Screening Tools

Bot Threshold

Analyses

Transparency and Openness

Results

Individual Screening Methods – Timed-Out Respondents

Table 1.

$0 Incentive Condition

$1 Incentive Condition

$5 Incentive Condition

Bot Screening Tools Among Respondents Timed-Out and Complete Respondents

Effectiveness of Question-, Task-, and Data-Based Bot Screening Tools

Effectiveness of Distinct Combinations of Bot Screening Tools

Joint Bot Screening Tool Combinations

Table 2.

Table 3.

Table 4.

Multi-Method Bot Screening Tool Combination

Table 5.

Table 6.

Table 7.

Cross-Condition Comparisons

Efficacy of Individual Bot Screening Tools in Predicting Participant Exclusion

Table 8.

Discussion

Cross-Condition Comparisons

Efficacy of Individual Bot Screening Tools in Predicting Participant Exclusion

Recommendations to Improve Bot Detection

Limitations and Future Directions

Supplementary Material

Public Significance Statement.

Acknowledgments:

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases