Abstract
Crowdsourced methods of data collection such as Amazon Mechanical Turk (MTurk) have been widely adopted in addiction science. Recent reports suggest an increase in poor quality data on MTurk, posing a challenge to the validity of findings. However, empirical investigations of data quality in addiction-related samples are lacking. In this study of individuals with alcohol use disorder (AUD) we compared poor quality delay discounting data to randomly generated data. A reanalysis of prior published delay discounting data was conducted comparing included, excluded, and randomly generated data samples. Non-systematic criteria were implemented as a measure of data quality. The excluded data was statistically different from the included sample but did not differ from randomly generated data on multiple metrics. Moreover, a response bias was identified in the excluded data. This study provides empirical evidence that poor quality delay discounting data in an AUD sample is not statistically different from randomly generated data, suggesting data quality concerns on MTurk persist in addiction samples. These findings support the use of rigorous methods of a priori defined criteria to remove poor quality data post hoc. Additionally, it highlights that the use of non-systematic delay discounting criteria to remove poor quality data is rigorous and not simply a way of removing data that does not conform to an expected theoretical model.
Keywords: crowdsourcing, Amazon Mechanical Turk, data quality, alcohol, delay discounting
Online crowdsourced data collection methods are versatile tools for behavioral science and have been widely adopted to study addiction-related phenomena. Platforms such as Amazon Mechanical Turk (MTurk; www.mturk.com) have grown increasingly popular and possess several benefits that complement more traditional research strategies. In general, crowdsourcing allows for efficient and cost-effective data collection, the potential to engage more diverse samples across greater geographic areas, and for continuity of research during adverse circumstances (e.g., COVID-19) where conducting research in-lab may be challenging. Regarding addiction science specifically, the large pool of potential participants can aid in the development of novel measures and interventions, comparisons with and/or replication of in-lab research findings, and longitudinal investigations. For more in depth reviews see Strickland & Stoops (2019) and Mellis & Bickel (2020).
However, the relative anonymity of the online environment and lack of face to face interaction with participants poses a challenge to data quality and interpretation. Starting in 2018, anecdotal reports concerning an increase in poor quality data on MTurk, possibly due to a proliferation of automated response methods (i.e., bots), received considerable attention (Dreyfuss, 2018; Stokel-Walker, 2018). Indeed, a subsequent investigation provided empirical evidence for decreases in data quality during this time period (Chmielewski & Kucker, 2020). Moreover, some investigators concluded that this decrease in data quality may have been on the rise since 2015, driven in part by international individuals using virtual private servers to subvert US location criteria (Kennedy et al., 2020). However, investigations of data quality on MTurk remain limited in scope, and to our knowledge, none have specifically investigated data quality in addiction-related samples.
In this study we test the hypothesis, via secondary analysis of previously published data in individuals with alcohol use disorder (AUD), that poor quality delay discounting data does not differ from random responses. Poor quality data was defined as responses that violated systematic delay discounting criteria (Johnson & Bickel, 2008). Our primary goal was to investigate how delay discounting data identified as non-systematic compares to randomly generated data. Additionally, this study also used attention check questions as an exclusion method and we investigated the overlap as well as specificity of these two quality control methods.
Method
In this section we report how our sample size was determined, all data exclusions, all manipulations, and all study measures. The data, study materials, and analysis code are available upon request by emailing the corresponding author. This study was not preregistered. This study consisted of secondary analyses on data previously published, with full study design and details available in Craft et al. (2021), and therefore the sample size was determined by the number of individuals who completed this prior study (n = 211). Briefly, individuals meeting diagnostic criteria for AUD using the Alcohol use disorder identification test (AUDIT) in a standalone screener were recruited via Amazon Mechanical Turk for a within-subject investigation of the effects of scarcity narratives on delay discounting. After screening participants completed a reCAPTCHA to guard against automated responding and were then randomly assigned into one of two narrative intervention groups (job or storm). In each of these groups, participants engaged with a narrative of sudden economic scarcity (job loss or hurricane) and a corresponding control (job neutral or mild storm). Participants then completed two delay discounting tasks during which they were instructed to imagine experiencing the effects of each narrative. An adjusting amount delay discounting task was used (Du et al., 2002) where participants made repeated choices between an immediately available amount of money ($500) and a larger amount available after a delay ($1000). This task was repeated across seven delays (1 day, 1 week, 1 month, 3 months, 1 year, 5 years, 25 years) to yield a discounting curve representing the decline in subjective monetary value as a function of delay to receipt. The presentation of the immediate and delayed options was randomized during each trial (i.e., appearing on the right or left side of the screen). The data analyzed in this study were collected on September 15th-17th, 2020. Two data quality checks were utilized in Craft et al. (2021); 1) application of non-systematic criteria to delay discounting data (Johnson & Bickel, 2008) and 2) attention check questions during delay discounting tasks (e.g., “would you prefer $0.00 now or $1000 now?). As reported in the prior study, 50 individuals failed attention check questions and 80 individuals violated systematic delay discounting criteria, leaving an included sample of n = 81 and an excluded sample of n = 130. For the present investigation, the two active (job loss or hurricane) and two control (job neutral or mild storm) conditions were collapsed into a single “Active” and “Control” condition.
Random Data Generation
Random dummy data were generated using the “Generate Test Responses” feature in Qualtrics (Qualtrics Survey Software, Provo, Utah). A sample of n = 130 responses were generated to match the sample size of the excluded data.
Statistical Analysis
Demographic characteristics were compared between the included and excluded samples using t-test, Chi-square test, and nonparametric methods where appropriate. Area under the curve (AUC) was used as the measure of discounting rate (Myerson et al., 2001). AUC represents the total area under the line plotting the subjective value of (Y) $1000 as a function of (X) the delay to receipt. The values of AUC range from 0 (maximum discounting of the delayed reward) to 1 (no discounting of the delayed reward). Paired t-tests were used to compare the within-subject effects of the active and control scarcity narratives on delay discounting rates for the total, included, excluded, and random groups. An analysis of variance approach (ANOVA) was used to compare discounting rates for control and active conditions among the total, included, excluded, and random groups. To understand the pattern of responding on the delay discounting tasks, participants’ choices were scored to calculate how many times the right vs. left option was chosen. Linear regression was performed to predict the number of right choices from the group (included, excluded, and random). Data were analyzed and graphed in R version 4.0.4 (R Core Team, 2021) GraphPad Prism version 9.2.0 (GraphPad Software, San Diego, CA).
Results
Demographic characteristics for the total, included, and excluded samples are presented in Table 1. The total sample was 33.4 years old on average, 34.6% female, 86.7% White, and 79.1% non-Hispanic. When stratifying the sample by inclusion status in Craft et al. (2021), two statistically significant differences were observed. Individuals in the excluded sample reported a higher proportion of Hispanic ethnicity (p=0.002) and a higher level of education (p<0.001) relative to the included sample.
Table 1.
Sample characteristics (N = 211).
| Characteristics: Mean (SD) / N (%) | Total N = 211 | Included N = 81 | Excluded N = 130 | P valuea |
|---|---|---|---|---|
|
|
|
|
|
|
| Age - yearsb | 33.43 (9.03) | 33.65 (9.59) | 33.29 (8.69) | 0.778 |
| Genderc | 0.234 | |||
| Female | 73 (34.6) | 24 (29.6) | 49 (37.7) | |
| Male | 137 (64.9) | 56 (69.1) | 81 (62.3) | |
| Refuse to Answer | 1 (0.5) | 1 (1.2) | 0 (0.0) | |
| Racec | 0.291 | |||
| Asian | 4 (1.9) | 1 (1.2) | 3 (2.3) | |
| Black or African American | 20 (9.5) | 11 (13.6) | 9 (6.9) | |
| White | 183 (86.7) | 67 (82.7) | 116 (89.2) | |
| Two or more races | 2 (1.0) | 1 (1.2) | 1 (0.8) | |
| Other | 1 (0.5) | 0 (0.0) | 1 (0.8) | |
| Refuse to answer | 1 (0.5) | 1 (1.2) | 0 (0.0) | |
| Ethnicityc | 0.002 | |||
| Hispanic | 42 (19.9) | 6 (8.2) | 36 (27.7) | |
| Non-Hispanic | 167 (79.1) | 74 (90.6) | 93 (71.5) | |
| Refuse to answer | 2 (1.0) | 1 (1.2) | 1 (0.8) | |
| Education levelc | <0.001 | |||
| High school diploma or GED | 12 (5.7) | 12 (14.8) | 0 (0.0) | |
| Associate or Bachelor’s degree | 138 (65.4) | 47 (58.0) | 91 (70.0) | |
| Master’s degree | 55 (26.1) | 20 (24.7) | 35 (26.9) | |
| Professional degree ( MD, JD, DDS) | 5 (2.4) | 1 (1.2) | 4 (3.1) | |
| Refuse to answer | 1 (0.5) | 1 (1.2) | 0 (0.0) | |
| Annual Incomec | 0.504 | |||
| Less than $30,000 | 20 (9.5) | 4 (5.0) | 16 (12.3) | |
| $30,000 to $49,999 | 38 (18.0) | 18 (22.2) | 20 (15.4) | |
| $50,000 to $69,999 | 46 (21.8) | 15 (18.5) | 31 (23.8) | |
| $70,000 to $89,999 | 63 (29.9) | 25 (30.9) | 38 (29.2) | |
| $90,000 or greater | 43 (20.4) | 18 (22.2) | 25 (19.3) | |
| Refuse to answer | 1 (0.5) | 1 (1.2) | 0 (0.0) |
P value represents comparison between included and excluded samples
Mean (standard deviation)
N (percentage)
Discounting curves and AUC values for the total, included, excluded, and random samples are shown in Figure 1. Paired t-tests indicated that discounting rates in the control and active conditions were significantly different for the included group (p=0.001), but not for the total sample (p=0.35), the excluded sample (p=0.61), or the random sample (p=0.66).
Figure 1.

Summary of discounting data for the total, included, excluded, and randomly generated samples. Mean (±SEM) discounting curves (left panel) and area under the curve values (right panel) in the active and control narrative conditions for the (A) total, (B) included, (C) excluded, and (D) randomly generated samples. SEM: standard error of the mean. Note; ns indicates non-significant relationship, ** indicates p-value = 0.001.
ANOVA indicated that in the control condition, discounting rates were significantly different between the included and excluded samples (p<0.001), the included and random samples (p<0.001), but not between the excluded and random samples (p=0.09). Similarly, in the active condition, discounting rates were significantly different between the included and excluded samples (p<0.001), the included and random samples (p<0.001), but not between the excluded and random samples (p=0.25).
In Craft et al. (2021), data was first excluded for failing attention checks in the discounting tasks (n=50) and then for violation of systematic delay discounting criteria (n=80). To determine the proportion of data representing non-systematic responding, we applied the Johnson & Bickel criteria to the full sample of excluded participants (n=130). Of these 130, 126 violated systematic criteria, with 77 violating criteria in two tasks, 49 violating in one task, and 4 passing in both tasks. This same procedure was applied to the random data, with 93 violating in two tasks, 34 violating in one task, and 3 passing in both tasks. Similar proportions passed in both tasks for the excluded and random data, 3.1% and 2.3%, respectively. Fisher’s exact test indicated that the distribution of participants violating systematic criteria in the delay discounting tasks was significantly different between the included and excluded samples (p<0.001) as well as between the included and random samples (p<0.001) but was not different between the excluded and random samples (p=0.103).
The linear regression results indicated that group significantly predicted the number of right choices (Type III Sum of Squares ANOVA: F = 7.12, p < 0.001). Estimated marginal means indicated that the excluded sample chose the right button significantly more than the included sample (t statistic =3.426, p = 0.002) and the random sample (t statistic =2.922, p = 0.010). The included sample and random sample did not differ in the number of right choices (t statistic = −0.866, p = 0.662), and were not statistically different than random utilization of left and right choices (i.e., 42 of each choice; 95% CI included: (40.6, 43.4); 95% CI random: (41.7, 43.9); 95% CI excluded: (44.0, 46.2)).
Discussion
We report empirical evidence that poor quality data does not differ from randomly generated data in an addiction relevant sample. Our analysis indicated that data excluded due to a priori defined criteria was strikingly visually similar to and was not statistically different from randomly generated data on two dimensions: area under the curve and the pattern of systematic criteria violation. These findings buttress previous reports of low quality responding on Amazon Mechanical Turk by showing evidence of poor quality data in individuals with AUD, highlighting that concerns regarding data quality broadly on MTurk are specific to addiction-related samples as well. Additionally, these findings suggest that the data quality concerns first widely discussed in 2018 are likely still present as of the September 2020 collection of the data presented here. Moreover, our primary analyses support the hypothesis that a cause of this poor data quality may be random responding. In addition, secondary analysis of the pattern of discounting task responding identified a response bias in the excluded sample, suggesting the excluded individuals may be behaving in a manner that is orthogonal to the task. Thus their responding is unrelated to delay discounting. We discuss the further relevance of these findings below.
A key takeaway from these findings is that given the decreased control investigators have over participants and the concerns over data quality, in the absence of methods to screen out poor quality participants prior to a study, investigators should be comfortable with excluding a higher proportion of data than might otherwise be expected. Simply put, our empirical evidence suggests that less is more in the case of crowdsourced data. Now, all things being equal, exclusion of data for exclusion’s sake would not be in keeping with sound scientific principles. However, if such data could be identified as of a poor quality (i.e., random or biased responding), exclusion of that data would be scientifically rigorous. Though post-hoc data exclusion is one method to improve the quality of collected data, other strategies such as screening for English language proficiency, reCAPTCHA, attention checks, prescreening surveys and their combinations may be employed. Investigations in non-substance use samples have highlighted the use of prescreening surveys to exclude participants who may be likely to produce poor quality data (Hydock, 2018) and exclusion of participants based on suspicious IP addresses (Kennedy et al., 2020), and these methods could be implemented in addiction research as appropriate. However, findings supportive of shifting data quality control methods, such as attention checks, to a prescreening survey are still forthcoming in addiction research. Further work is necessary to understand if attention checks embedded in a survey measure (e.g., the delay discounting attention checks described here) will have the same utility when used as a standalone prescreening question.
In addition to supporting the exclusion of poor quality data, our results support the use of the Johnson & Bickel criteria (Johnson & Bickel, 2008) to identify non-systematic delay discounting. Use of these criteria have been widely adopted in behavioral science (Smith et al., 2018). While some have criticized use of these criteria as a method to exclude data that does not fit neatly into a particular model (i.e., hyperbolic discounting function) (Bailey et al., 2021), our results provide empirical support that data identified as non-systematic is in fact of poor quality, not merely cherry picked out to advance a theoretical framework. The random responding observed in this study could be a function of participants’ alcohol use, a desire to finish the study as quickly as possible, or due to bots, though this last option may be unlikely, given the inclusion of reCAPTCHA. Interestingly, while the Johnson & Bickel criteria were developed in the framework of a hyperbolic discounting function, the findings reported here suggest a broader utility for these criteria, given the lack of differences between data excluded for violating these criteria and randomly generated data. Specifically, these criteria could still be implemented as a quality check in situations where a different discounting framework is applied (e.g., area under the curve or Bayesian approaches). In this study the application of the non-systematic criteria identified 97% of the total excluded sample (126 of 130) while the delay discounting attention checks identified 38.5% of the total excluded sample (50 of 130). This overlap suggests that using this type of attention check in combination with non-systematic delay discounting criteria may be effectively redundant, though further replications of this finding are needed for a more definitive conclusion. Lastly, while delay discounting was the only outcome assessed in this study, it is possible that the random responding observed here could be transitive to other tasks, and therefore inclusion of a discounting measure could represent a novel or even “covert” quality check that researchers could include in their crowdsourced studies.
We acknowledge the potential limitations of these findings. First, this sample only included individuals meeting criteria for AUD, and it is therefore unknown if these findings are generalizable to populations of individuals using substances other than alcohol, or to non-substance using populations. Second, these data were collected via Amazon Mechanical Turk and their relevance to other crowdsourcing platforms (e.g., Prolific, Qualtrics Panels), as well as research conducted in-lab, is unknown. Third, delay discounting was the only measure evaluated in this study (via an adjusting amount task), so the relevance of this responding to other measures or other types of delay discounting tasks is unknown. However, one could speculate that random responding on discounting tasks could correlate to responding on other measures, which may lack a similar quality control. Further study of this phenomenon in similar crowdsourced addiction studies will be helpful to determine the generalizability of these findings. Lastly, the precise relevance of demographic differences between the included and excluded samples (i.e., higher proportion of Hispanic ethnicity and greater educational attainment in the excluded sample) to these findings is unknown. Many potential reasons for this observation may exist including, but not limited to, lack of english proficiency, random or fraudulent responding, and reading comprehension. Future investigations of these differences will help to clarify their relevance.
Conclusions
This study provides an empirical demonstration that in a sample of individuals with AUD, data identified as poor quality did not differ from randomly generated data. Additionally, a response bias identified suggests behavior that is orthogonal to the task. This reinforces previous concerns about a data quality “crisis” on MTurk and identifies this as a challenge for addiction researchers specifically. Our findings support the use of a priori defined exclusion criteria to rigorously screen out poor quality data.
Supplementary Material
Public Health Significance.
This study provides empirical evidence that poor quality delay discounting data does not differ from random responding in a sample of individuals with alcohol use disorder. This highlights that previous reports of poor quality data on Amazon Mechanical Turk extend to addiction-related samples. Thus, the use of rigorous data quality controls and exclusion of poor quality data are warranted to ensure high quality scientific findings.
Disclosures and Acknowledgments
This work was supported by NIAAA R01AA027381 and the Fralin Biomedical Research Institute at VTC. The funders had no role other than financial support.
All authors contributed in a significant way and all authors have read and approved the final manuscript.
Although the following activities/relationships do not create a conflict of interest pertaining to this manuscript, in the interest of full disclosure, Dr. Bickel would like to report the following: W. K. Bickel is a principal of HealthSim, LLC; BEAM Diagnostics, Inc.; and Red 5 Group, LLC. In addition, he serves on the scientific advisory board for Sober Grid, Inc.; and Ria Health; serves as a consultant for Boehringer Ingelheim International; and works on a project supported by Indivior, Inc. None of the other authors have any disclosures to report.
Footnotes
The findings of this manuscript have not been previously presented or published. However, some of the data analyzed in this manuscript were presented as a poster at the 84th Annual Scientific Meeting of The College on Problems of Drug Dependence and published as a manuscript in Experimental and Clinical Psychopharmacology. The data, study materials, and analysis code are available upon request by emailing the corresponding author. This study was not preregistered.
References
- Bailey AJ, Romeu RJ, & Finn PR (2021). The problems with delay discounting: a critical review of current practices and clinical applications. Psychological Medicine, 1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chmielewski M, & Kucker SC (2020). An MTurk Crisis? Shifts in Data Quality and the Impact on Study Results. Social Psychological and Personality Science, 11(4), 464–473. [Google Scholar]
- Craft WH, Tegge AN, & Bickel WK (2021). Narrative theory IV: Within-subject effects of active and control scarcity narratives on delay discounting in alcohol use disorder. Experimental and Clinical Psychopharmacology. 10.1037/pha0000478 [DOI] [PubMed] [Google Scholar]
- Dreyfuss E (2018, August 17). A Bot Panic Hits Amazon’s Mechanical Turk. Wired https://www.wired.com/story/amazon-mechanical-turk-bot-panic/
- Du W, Green L, & Myerson J (2002). Cross-Cultural Comparisons of Discounting Delayed and Probabilistic Rewards. The Psychological Record, 52(4), 479–492. [Google Scholar]
- Hydock C (2018). Assessing and overcoming participant dishonesty in online data collection. Behavior Research Methods, 50(4), 1563–1567. [DOI] [PubMed] [Google Scholar]
- Johnson MW, & Bickel WK (2008). An algorithm for identifying nonsystematic delay-discounting data. Experimental and Clinical Psychopharmacology, 16(3), 264–274. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kennedy R, Clifford S, Burleigh T, Waggoner PD, Jewell R, & Winter NJG (2020). The shape of and solutions to the MTurk quality crisis. Political Science Research and Methods, 8(4), 614–629. [Google Scholar]
- Mellis AM, & Bickel WK (2020). Mechanical Turk data collection in addiction research: utility, concerns and best practices. Addiction. 10.1111/add.15032 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Myerson J, Green L, & Warusawitharana M (2001). Area under the curve as a measure of discounting. Journal of the Experimental Analysis of Behavior, 76(2), 235–243. [DOI] [PMC free article] [PubMed] [Google Scholar]
- R Core Team. (2021). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/ [Google Scholar]
- Smith KR, Lawyer SR, & Swift JK (2018). A meta-analysis of nonsystematic responding in delay and probability reward discounting. Experimental and Clinical Psychopharmacology, 26(1), 94–107. [DOI] [PubMed] [Google Scholar]
- Stokel-Walker C (2018, August 10). Bots on Amazon’s Mechanical Turk are ruining psychology studies. New Scientist. https://www.newscientist.com/article/2176436-bots-on-amazons-mechanical-turk-are-ruining-psychology-studies/
- Strickland JC, & Stoops WW (2019). The use of crowdsourcing in addiction science research: Amazon Mechanical Turk. Experimental and Clinical Psychopharmacology, 27(1), 1–18. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
