Skip to main content
. 2024 Apr 4;56(4):2782–2803. doi: 10.3758/s13428-024-02381-9

Table 2.

Dataset information

Feedback data Election data Reddit data Hate speech data
Texts 2,000 3,832a 10,000b 24,783c
Avg. words

13.3

(SD = 11.9)

14.8

(SD = 6.7)

13.3

(SD = 6.8)

13.6

(SD = 7.0)

Coded for Motives Moral norms Emotions Hate speech
Categories 7 10 6 3
Classification problem Multi-labeld Multi-labeld Multi-labeld Multi-classd
Data source (Norbutas et al., 2020) (Hoover et al., 2020) (Demszky et al., 2020) (Davidson et al., 2017)
Platform Online market Twitter Reddit Twitter

a The original election dataset coded by Hoover et al. (2020) contained 5,358 tweets. However, we were unable to retrieve all the original tweets via Twitter API; some of the tweets or accounts tweeting them appeared to be deleted at the time of our retrieval (third quarter of 2021).

b The original dataset contains 58,011 coded tweets; we selected a random subsample for computational feasibility. We chose a sample of 10,000 texts as an intermediate size between the smaller Feedback and Twitter datasets and the larger Hate speech dataset.

c The original paper mentions 24,802 texts; we obtained 24,783 tweets from the publicly available dataset.

d Multi-label implies that a single text can contain multiple categories at once; multi-class suggests a text can belong to only one of multiple categories.