Table 2.
Dataset information
| Feedback data | Election data | Reddit data | Hate speech data | |
|---|---|---|---|---|
| Texts | 2,000 | 3,832a | 10,000b | 24,783c |
| Avg. words |
13.3 (SD = 11.9) |
14.8 (SD = 6.7) |
13.3 (SD = 6.8) |
13.6 (SD = 7.0) |
| Coded for | Motives | Moral norms | Emotions | Hate speech |
| Categories | 7 | 10 | 6 | 3 |
| Classification problem | Multi-labeld | Multi-labeld | Multi-labeld | Multi-classd |
| Data source | (Norbutas et al., 2020) | (Hoover et al., 2020) | (Demszky et al., 2020) | (Davidson et al., 2017) |
| Platform | Online market |
a The original election dataset coded by Hoover et al. (2020) contained 5,358 tweets. However, we were unable to retrieve all the original tweets via Twitter API; some of the tweets or accounts tweeting them appeared to be deleted at the time of our retrieval (third quarter of 2021).
b The original dataset contains 58,011 coded tweets; we selected a random subsample for computational feasibility. We chose a sample of 10,000 texts as an intermediate size between the smaller Feedback and Twitter datasets and the larger Hate speech dataset.
c The original paper mentions 24,802 texts; we obtained 24,783 tweets from the publicly available dataset.
d Multi-label implies that a single text can contain multiple categories at once; multi-class suggests a text can belong to only one of multiple categories.