Table 6. Description of the dataset for low-resource languages.
| Article | Language | Source | Number of categories | Categories of data | Size | Way of labeling | Balanced or no | Use a technique to balance the data or not | After balanced | Splitting the data |
|---|---|---|---|---|---|---|---|---|---|---|
| Cruz & Cheng (2020) | Filipino | X Platform | 2 | -hate -non-hate |
10,000 | N/A | Relatively | N/A | – | N/A |
| Rahman et al. (2020) | Bengali | Open source
|
|
Include crime |
|
Annotated | No | No | – | Train: 80% Test: 20% |
| Djandji et al. (2020) | Arabic | X Platform | 2 | Offensive Not offensive |
10,000 | Annotated | No | Yes, over/undersampling | – | Train: 70% Test: 20% Development: 10% |
| Kittask, Milintsevich & Sirts (2020) | Estonian | Postimees Estonian newspaper | 4 | Include crime Topic: Negative Ambiguous Positive Neutral |
4,088 | Annotated with sentiment & with rubric labels. |
No | No | – | Train: 70% Test: 20% Development: 10% |
| Mubarak et al. (2020) | Arabic | X Platform | 4 | - Offensive - Vulgar - Hate speech - Clean |
10,000 | Experienced annotator | No | No | – | N/A |
| Aluru et al. (2021) | 1. Indonesian 2. Polish 3. Arabic |
Public Dataset From X platform |
2 | -Hate speech -Normal |
13,169 & 713 9,788 4,120 & 1,670 |
Annotated | Yes No No & YES |
No | – | Train: 70% Test: 20% Validation: 10% |
| Arcila-Calderón et al. (2022) | 1. Greek 2. Italian |
PHARM datasets | 2 | -Racist/xenophobic hate tweets -Non-racist/xenophobic hate tweets |
1-10,399 2-10,752 |
Manual | Yes | – | – | N/A |
| Modha, Majumder & Mandl (2022) | Hindi | TRAC | 3 | Non-aggressive (NAG), Overtly Aggressive (OAG), & Covertly Aggressive (CAG) | 15,001 | Annotated | Yes | – | – | N/A |
| Sharif & Hoque (2022) | Bengali | BAD Facebook & YouTube |
|
1-AG & NoAG 2-ReAG & PoAG & VeAG & GeAG. |
14,443 | Manual |
|
|
– | Train: 80% Test: 10% Validation: 10% |
| Kapil & Ekbal (2022) | Hindi | Eight datasets from social media |
|
Hate, normal Hostile, non-hostile CAG, NAG, OAG Abusive, hate, natural |
Different sizes | Annotated | Some are balanced and some not | No | – | Train: 80% Test: 20% |
| Shapiro, Khalafallah & Torki (2022) | Arabic | X Platform | 2 | Offensive Not offensive |
12,700 | Annotated | No | No | – | Train: 70% Test: 20% Development: 10% |
| El-Alami, Alaoui & Nahnahi (2022) | Arabic | SemEval’2020 competition -Arabic dataset |
2 | 1 = Offensive 0 = Not offensive |
7,800 | Annotated | No | No | – | Train: 80% Test: 20% |
| Akram, Shahzad & Bashir (2023) | Urdu (Pakistan) | X Platform | 2 | Hatful Neutral |
21,759 | Manual | No | No | – | Train: 90% Test: 10% |
| Hossain et al. (2023) | Bengali | Kaggle & Bangla Newspaper |
|
1-Crime & Others 2-Murder, Drug, Rape & Others |
Approximately 5.3 million entries | Annotated & Manual | N/A | N/A | – | N/A |
| Ullah et al. (2024) | Urdu (Pakistan) | X Platform | 5 | Cyber Terrorism Hate Speech Cyber Harassment Normal Offensive |
7,372 | Three Manual Annotator | No | No | – | Train: 80% Test: 20% |
| Mozafari et al. (2024) | Persian | X Platform | 2 | Offensive Not offensive |
6,000 | Manual | No | No | – | N/A |
| Adam, Zandam & Inuwa-Dutse (2024) | Hausa (Africa) | X & factbook Platforms | 2 | Offensive Not offensive |
N/A | Manual | N/A | N/A | – | N/A |