. 2024 Apr 4;56(4):2782–2803. doi: 10.3758/s13428-024-02381-9

Table 2.

Dataset information

	Feedback data	Election data	Reddit data	Hate speech data
Texts	2,000	3,832^a	10,000^b	24,783^c
Avg. words	13.3 (SD = 11.9)	14.8 (SD = 6.7)	13.3 (SD = 6.8)	13.6 (SD = 7.0)
Coded for	Motives	Moral norms	Emotions	Hate speech
Categories	7	10	6	3
Classification problem	Multi-label^d	Multi-label^d	Multi-label^d	Multi-class^d
Data source	(Norbutas et al., 2020)	(Hoover et al., 2020)	(Demszky et al., 2020)	(Davidson et al., 2017)
Platform	Online market	Twitter	Reddit	Twitter

^a The original election dataset coded by Hoover et al. (2020) contained 5,358 tweets. However, we were unable to retrieve all the original tweets via Twitter API; some of the tweets or accounts tweeting them appeared to be deleted at the time of our retrieval (third quarter of 2021).

^b The original dataset contains 58,011 coded tweets; we selected a random subsample for computational feasibility. We chose a sample of 10,000 texts as an intermediate size between the smaller Feedback and Twitter datasets and the larger Hate speech dataset.

^c The original paper mentions 24,802 texts; we obtained 24,783 tweets from the publicly available dataset.

^d Multi-label implies that a single text can contain multiple categories at once; multi-class suggests a text can belong to only one of multiple categories.