. 2025 Jul 24;11:e3017. doi: 10.7717/peerj-cs.3017

Table 7. Description of the dataset for high-resource languages.

Article	Language	Source	Number of categories	Categories of data	Size	Way of labeling	Balanced or no	Use a technique to balance the data or no	After balanced	Splitting the data
Ding & Yang (2020)	China	The Criminal Investigation Corps of the Shanghai Public Security Bureau	3	N/A	6,000	Manual	No	Yes, Oversampling method	9,000	Train: 60% Test: 20% Validation: 20%
Bugueno & Mendoza (2020)	English	ECML/PKDD 2019 conference SIMAH	1-2 2-4	1-harassment or non-harassment 2-Non harassment, sexual harassment, physical harassment & Indirect harassment	10,622	Annotated	No	Yes, SMOTE method	N/A	Train: 60% Test: 20% Validation: 20%
Clavie & Alphonsus (2021)	English	Public dataset 1-Overruling the European Convention of Human Rights (Zheng et al., 2021) 2-ECHR- Casetext, a company focused on legal research software (Chalkidis, Androutsopoulos & Aletras, 2019)	1-2 2-2 & 66	1-N/A 2-Positive if any human rights article or protocol has been violated and negative otherwise.	11,500	Annotated	Yes	–	–	N/A
Soldevilla & Flores (2021)	English	Schrading’s investigation (Schrading, 2015) 1-Reddit 2-X	2	violence & non-violence	1-113,910 2-30,377	Annotated	Yes	–	–	Train: 64% Test: 20% Validation: 16%
Schirmer, Kruschwitz & Donabauer (2022)	English	GTC The ECCC, the ICTR, & the ICTY	2	1 = violence & 0 = no violence	1,475	Manual	Yes	–	–	Train: 80% Test: 10% Validation: 10%
Cascavilla, Catolino & Sangiovanni (2022)	English	Duta10k Agora Dark & surface web	19	Include: drugs, substances for drugs, drug paraphernalia, violence, porno & fraud	113,995	Some manuals & some annotated	No	No	–	N/A
Cho & Choi (2022)	English	Kaggle 20 News Group	20	Include Guns	18,846	Annotated	No	Yes, in GAN-BERT No in BERT	N/A	Train: 60.04% Test: 39.96%
Agarwal et al. (2022)	English	PAN-2012 competition on the sexual predator identification task.	2	1 = predatory & 0 = non-predatory	32,092	Annotated	No	Yes, give more weight to the minority class	–	Train: 30.4% Test: 69.6%
Jamil et al. (2023)	English	MAVEN Wikipedia	3	1) Dangerous events; 2) Top-level dangerous events; 3) Sub-level dangerous events.	21,412	Annotated& Manual	No	Yes, the model is equally trained in each class	–	Train: 70% Test/Validation: 30%
Yu, Li & Feng (2024)	English	Kaggle 20 News Group	20	Include Guns	18,846	Annotated	No	No	–	Train: 60.03% Test: 39.97% Validation: 10% from train set