Skip to main content
. 2025 Jul 24;11:e3017. doi: 10.7717/peerj-cs.3017

Table 7. Description of the dataset for high-resource languages.

Article Language Source Number of categories Categories of data Size Way of labeling Balanced or no Use a technique to balance the data or no After balanced Splitting the data
Ding & Yang (2020) China The Criminal Investigation Corps of the Shanghai Public Security Bureau 3 N/A 6,000 Manual No Yes,
Oversampling method
9,000 Train: 60%
Test: 20%
Validation: 20%
Bugueno & Mendoza (2020) English ECML/PKDD 2019 conference
SIMAH
1-2
2-4
1-harassment or non-harassment
2-Non harassment, sexual harassment, physical harassment & Indirect harassment
10,622 Annotated No Yes,
SMOTE method
N/A Train: 60%
Test: 20%
Validation: 20%
Clavie & Alphonsus (2021) English Public dataset
1-Overruling the European Convention of Human Rights (Zheng et al., 2021)
2-ECHR- Casetext, a company focused on legal research software (Chalkidis, Androutsopoulos & Aletras, 2019)
1-2
2-2 & 66
1-N/A
2-Positive if any human rights article or protocol has been violated and negative otherwise.
11,500 Annotated Yes N/A
Soldevilla & Flores (2021) English Schrading’s investigation (Schrading, 2015)
1-Reddit
2-X
2 violence
&
non-violence
1-113,910
2-30,377
Annotated Yes Train: 64%
Test: 20%
Validation: 16%
Schirmer, Kruschwitz & Donabauer (2022) English GTC
The ECCC, the ICTR, & the ICTY
2 1 = violence
&
0 = no violence
1,475 Manual Yes Train: 80%
Test: 10%
Validation: 10%
Cascavilla, Catolino & Sangiovanni (2022) English Duta10k
Agora
Dark & surface web
19 Include: drugs, substances for drugs, drug paraphernalia, violence, porno & fraud 113,995 Some manuals & some annotated No No N/A
Cho & Choi (2022) English Kaggle
20 News Group
20 Include Guns 18,846 Annotated No Yes, in GAN-BERT
No in BERT
N/A Train: 60.04%
Test: 39.96%
Agarwal et al. (2022) English PAN-2012 competition on the sexual predator identification task. 2 1 = predatory &
0 = non-predatory
32,092 Annotated No Yes, give more weight to the minority class Train: 30.4%
Test: 69.6%
Jamil et al. (2023) English MAVEN
Wikipedia
3 1) Dangerous events; 2) Top-level dangerous events; 3) Sub-level dangerous events. 21,412 Annotated& Manual No Yes, the model is equally trained in each class Train: 70%
Test/Validation: 30%
Yu, Li & Feng (2024) English Kaggle
20 News Group
20 Include Guns 18,846 Annotated No No Train: 60.03%
Test: 39.97%
Validation: 10% from train set