Table 7. Description of the dataset for high-resource languages.
| Article | Language | Source | Number of categories | Categories of data | Size | Way of labeling | Balanced or no | Use a technique to balance the data or no | After balanced | Splitting the data |
|---|---|---|---|---|---|---|---|---|---|---|
| Ding & Yang (2020) | China | The Criminal Investigation Corps of the Shanghai Public Security Bureau | 3 | N/A | 6,000 | Manual | No | Yes, Oversampling method |
9,000 | Train: 60% Test: 20% Validation: 20% |
| Bugueno & Mendoza (2020) | English | ECML/PKDD 2019 conference SIMAH |
1-2 2-4 |
1-harassment or non-harassment 2-Non harassment, sexual harassment, physical harassment & Indirect harassment |
10,622 | Annotated | No | Yes, SMOTE method |
N/A | Train: 60% Test: 20% Validation: 20% |
| Clavie & Alphonsus (2021) | English | Public dataset 1-Overruling the European Convention of Human Rights (Zheng et al., 2021) 2-ECHR- Casetext, a company focused on legal research software (Chalkidis, Androutsopoulos & Aletras, 2019) |
1-2 2-2 & 66 |
1-N/A 2-Positive if any human rights article or protocol has been violated and negative otherwise. |
11,500 | Annotated | Yes | – | – | N/A |
| Soldevilla & Flores (2021) | English | Schrading’s investigation (Schrading, 2015) 1-Reddit 2-X |
2 | violence & non-violence |
1-113,910 2-30,377 |
Annotated | Yes | – | – | Train: 64% Test: 20% Validation: 16% |
| Schirmer, Kruschwitz & Donabauer (2022) | English | GTC The ECCC, the ICTR, & the ICTY |
2 | 1 = violence & 0 = no violence |
1,475 | Manual | Yes | – | – | Train: 80% Test: 10% Validation: 10% |
| Cascavilla, Catolino & Sangiovanni (2022) | English | Duta10k Agora Dark & surface web |
19 | Include: drugs, substances for drugs, drug paraphernalia, violence, porno & fraud | 113,995 | Some manuals & some annotated | No | No | – | N/A |
| Cho & Choi (2022) | English | Kaggle 20 News Group |
20 | Include Guns | 18,846 | Annotated | No | Yes, in GAN-BERT No in BERT |
N/A | Train: 60.04% Test: 39.96% |
| Agarwal et al. (2022) | English | PAN-2012 competition on the sexual predator identification task. | 2 | 1 = predatory & 0 = non-predatory |
32,092 | Annotated | No | Yes, give more weight to the minority class | – | Train: 30.4% Test: 69.6% |
| Jamil et al. (2023) | English | MAVEN Wikipedia |
3 | 1) Dangerous events; 2) Top-level dangerous events; 3) Sub-level dangerous events. | 21,412 | Annotated& Manual | No | Yes, the model is equally trained in each class | – | Train: 70% Test/Validation: 30% |
| Yu, Li & Feng (2024) | English | Kaggle 20 News Group |
20 | Include Guns | 18,846 | Annotated | No | No | – | Train: 60.03% Test: 39.97% Validation: 10% from train set |