Table 2. Comparison of studies that looked at cross-dataset (“cross-domain”) generalisation, by datasets and models used.
Dataset types: H: hate speech, O: other offensive language, *: contains subtypes. Most studies carried out cross-dataset experiments, training and testing each model on all datasets. The exceptions are: Gröndahl et al. (2018) and Nejadgholi & Kiritchenko (2020) used different datasets for training and testing; Fortuna, Soler & Wanner (2020) compared datasets through class vector representations.
| Study | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dataset name | Type | Karan & Šnajder (2018) | Gröndahl et al. (2018) | Waseem (2016) | Wiegand, Ruppenhofer & Kleinbauer (2019) | Swamy, Jamatia & Gambäck (2019) | Pamungkas & Patti (2019), Pamungkas, Basile & Patti (2020) | Arango, Prez & Poblete (2020) | Fortuna, Soler & Wanner (2020) | Caselli et al. (2020) | Nejadgholi & Kiritchenko (2020) | Glavaš, Karan & Vulić (2020) | Fortuna, Soler-Company & Wanner (2021) |
| Waseem | H* | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
| Davidson | H,O | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||
| Founta | H,O | ✓ | ✓ | ✓ | ✓ | ||||||||
| HatEval | H* | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||
| Kaggle | H,O* | ✓ | ✓ | ✓ | ✓ | ||||||||
| Gao | H | ✓ | ✓ | ||||||||||
| AMI | H* | ✓ | ✓ | ✓ | |||||||||
| Warner | H | ✓ | |||||||||||
| Zhang | H | ✓ | |||||||||||
| Stromfront | H | ✓ | |||||||||||
| TRAC | O | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||
| Wulczyn | O | ✓ | ✓ | ✓ | ✓ | ||||||||
| OLID | O | ✓ | ✓ | ✓ | ✓ | ||||||||
| AbuseEval | O* | ✓ | |||||||||||
| Kolhatkar | O | ✓ | |||||||||||
| Razavi | O | ✓ | |||||||||||
| Golbeck | O | ✓ | |||||||||||
| model | SVM | LR, MLP, LSTM, CNN-GRU | MLP | FastText | BERT | SVM, LSTM, GRU, BERT | LSTM, GBDT | N/A | BERT | BERT | BERT, RoBERTa | BERT, ALBERT, fastText, SVM | |