Skip to main content
. 2025 Jul 24;11:e3017. doi: 10.7717/peerj-cs.3017

Table 6. Description of the dataset for low-resource languages.

Article Language Source Number of categories Categories of data Size Way of labeling Balanced or no Use a technique to balance the data or not After balanced Splitting the data
Cruz & Cheng (2020) Filipino X Platform 2 -hate
-non-hate
10,000 N/A Relatively N/A N/A
Rahman et al. (2020) Bengali Open source
  • 1-BARD,
  • 2-OSBC
  • 3-ProthomAlo
  • 1-5

  • 2-11

  • 3-6

Include crime
  • 1-50,560

  • 2-78,796

  • 3-128,761

Annotated No No Train: 80%
Test: 20%
Djandji et al. (2020) Arabic X Platform 2 Offensive
Not offensive
10,000 Annotated No Yes, over/undersampling Train: 70%
Test: 20%
Development: 10%
Kittask, Milintsevich & Sirts (2020) Estonian Postimees Estonian newspaper 4 Include crime Topic:
Negative Ambiguous Positive Neutral
4,088 Annotated with sentiment &
with rubric labels.
No No Train: 70%
Test: 20%
Development: 10%
Mubarak et al. (2020) Arabic X Platform 4 - Offensive
- Vulgar
- Hate speech
- Clean
10,000 Experienced annotator No No N/A
Aluru et al. (2021) 1. Indonesian
2. Polish
3. Arabic
Public Dataset
From X platform
2 -Hate speech
-Normal
13,169 & 713
9,788
4,120 & 1,670
Annotated Yes
No
No & YES
No Train: 70%
Test: 20%
Validation: 10%
Arcila-Calderón et al. (2022) 1. Greek
2. Italian
PHARM datasets 2 -Racist/xenophobic hate tweets
-Non-racist/xenophobic hate tweets
1-10,399
2-10,752
Manual Yes N/A
Modha, Majumder & Mandl (2022) Hindi TRAC 3 Non-aggressive (NAG), Overtly Aggressive (OAG), & Covertly Aggressive (CAG) 15,001 Annotated Yes N/A
Sharif & Hoque (2022) Bengali BAD
Facebook & YouTube
  • 1-2

  • 2-4

1-AG & NoAG
2-ReAG & PoAG & VeAG & GeAG.
14,443 Manual
  • 1-Yes

  • 2-No

  • 1-  -

  • 2-No

Train: 80%
Test: 10%
Validation: 10%
Kapil & Ekbal (2022) Hindi Eight datasets from social media
  • 1-2

  • 2-3

Hate, normal
Hostile, non-hostile
CAG, NAG, OAG
Abusive, hate, natural
Different sizes Annotated Some are balanced and some not No Train: 80%
Test: 20%
Shapiro, Khalafallah & Torki (2022) Arabic X Platform 2 Offensive
Not offensive
12,700 Annotated No No Train: 70%
Test: 20%
Development: 10%
El-Alami, Alaoui & Nahnahi (2022) Arabic SemEval’2020 competition
-Arabic dataset
2 1 = Offensive
0 = Not offensive
7,800 Annotated No No Train: 80%
Test: 20%
Akram, Shahzad & Bashir (2023) Urdu (Pakistan) X Platform 2 Hatful
Neutral
21,759 Manual No No Train: 90%
Test: 10%
Hossain et al. (2023) Bengali Kaggle & Bangla Newspaper
  • 1-2

  • 2-4

1-Crime & Others
2-Murder, Drug, Rape & Others
Approximately 5.3 million entries Annotated & Manual N/A N/A N/A
Ullah et al. (2024) Urdu (Pakistan) X Platform 5 Cyber Terrorism
Hate Speech
Cyber Harassment
Normal
Offensive
7,372 Three Manual Annotator No No Train: 80%
Test: 20%
Mozafari et al. (2024) Persian X Platform 2 Offensive
Not offensive
6,000 Manual No No N/A
Adam, Zandam & Inuwa-Dutse (2024) Hausa (Africa) X & factbook Platforms 2 Offensive
Not offensive
N/A Manual N/A N/A N/A