. 2025 Jul 24;11:e3017. doi: 10.7717/peerj-cs.3017

Table 6. Description of the dataset for low-resource languages.

Article	Language	Source	Number of categories	Categories of data	Size	Way of labeling	Balanced or no	Use a technique to balance the data or not	After balanced	Splitting the data
Cruz & Cheng (2020)	Filipino	X Platform	2	-hate -non-hate	10,000	N/A	Relatively	N/A	–	N/A
Rahman et al. (2020)	Bengali	Open source 1-BARD, 2-OSBC 3-ProthomAlo	1-5 2-11 3-6	Include crime	1-50,560 2-78,796 3-128,761	Annotated	No	No	–	Train: 80% Test: 20%
Djandji et al. (2020)	Arabic	X Platform	2	Offensive Not offensive	10,000	Annotated	No	Yes, over/undersampling	–	Train: 70% Test: 20% Development: 10%
Kittask, Milintsevich & Sirts (2020)	Estonian	Postimees Estonian newspaper	4	Include crime Topic: Negative Ambiguous Positive Neutral	4,088	Annotated with sentiment & with rubric labels.	No	No	–	Train: 70% Test: 20% Development: 10%
Mubarak et al. (2020)	Arabic	X Platform	4	- Offensive - Vulgar - Hate speech - Clean	10,000	Experienced annotator	No	No	–	N/A
Aluru et al. (2021)	1. Indonesian 2. Polish 3. Arabic	Public Dataset From X platform	2	-Hate speech -Normal	13,169 & 713 9,788 4,120 & 1,670	Annotated	Yes No No & YES	No	–	Train: 70% Test: 20% Validation: 10%
Arcila-Calderón et al. (2022)	1. Greek 2. Italian	PHARM datasets	2	-Racist/xenophobic hate tweets -Non-racist/xenophobic hate tweets	1-10,399 2-10,752	Manual	Yes	–	–	N/A
Modha, Majumder & Mandl (2022)	Hindi	TRAC	3	Non-aggressive (NAG), Overtly Aggressive (OAG), & Covertly Aggressive (CAG)	15,001	Annotated	Yes	–	–	N/A
Sharif & Hoque (2022)	Bengali	BAD Facebook & YouTube	1-2 2-4	1-AG & NoAG 2-ReAG & PoAG & VeAG & GeAG.	14,443	Manual	1-Yes 2-No	1- - 2-No	–	Train: 80% Test: 10% Validation: 10%
Kapil & Ekbal (2022)	Hindi	Eight datasets from social media	1-2 2-3	Hate, normal Hostile, non-hostile CAG, NAG, OAG Abusive, hate, natural	Different sizes	Annotated	Some are balanced and some not	No	–	Train: 80% Test: 20%
Shapiro, Khalafallah & Torki (2022)	Arabic	X Platform	2	Offensive Not offensive	12,700	Annotated	No	No	–	Train: 70% Test: 20% Development: 10%
El-Alami, Alaoui & Nahnahi (2022)	Arabic	SemEval’2020 competition -Arabic dataset	2	1 = Offensive 0 = Not offensive	7,800	Annotated	No	No	–	Train: 80% Test: 20%
Akram, Shahzad & Bashir (2023)	Urdu (Pakistan)	X Platform	2	Hatful Neutral	21,759	Manual	No	No	–	Train: 90% Test: 10%
Hossain et al. (2023)	Bengali	Kaggle & Bangla Newspaper	1-2 2-4	1-Crime & Others 2-Murder, Drug, Rape & Others	Approximately 5.3 million entries	Annotated & Manual	N/A	N/A	–	N/A
Ullah et al. (2024)	Urdu (Pakistan)	X Platform	5	Cyber Terrorism Hate Speech Cyber Harassment Normal Offensive	7,372	Three Manual Annotator	No	No	–	Train: 80% Test: 20%
Mozafari et al. (2024)	Persian	X Platform	2	Offensive Not offensive	6,000	Manual	No	No	–	N/A
Adam, Zandam & Inuwa-Dutse (2024)	Hausa (Africa)	X & factbook Platforms	2	Offensive Not offensive	N/A	Manual	N/A	N/A	–	N/A