. 2021 Nov 11;11:22083. doi: 10.1038/s41598-021-01487-w

Table 3.

Performance of our hate speech classification model on the training set (cross validation results) and the out-of-sample evaluation set, in comparison to the inter-annotator agreement on the same datasets. The overall performance is measured by Krippendorff’s $Alpha$ and accuracy ( $Acc$ ), and performance for individual classes by $F_{1}$ . Note that the performance of our model is comparable to

the annotator agreement, except for the Violent class, indicated by lower $F_{1}$ .

Performance and agreement	Overall		Acceptable	Inappropriate	Offensive	Violent
Performance and agreement	$Alpha$	$Acc$	$F_{1}$	$F_{1}$	$F_{1}$	$F_{1}$
Model
Training	0.59	0.79	0.87	0.54	0.64	0.52
Evaluation	0.55	0.84	0.91	0.59	0.58	0.39
Inter-annotator
Training	0.59	0.77	0.86	0.52	0.63	0.63
Evaluation	0.56	0.82	0.90	0.53	0.57	0.55