. 2025 Oct 16;8:1666349. doi: 10.3389/frai.2025.1666349

Table 2.

Summary of reviewed studies on Arabic hate/offensive/cyberbullying detection.

No	Study	Model(s)	Dataset and Platform	Dialect/Domain	Performance Metrics	Limitations
1	Haidar et al. (2017)	Naïve Bayes, SVM	Posts (Twitter, Facebook, Formspring)	Saudi Arabic	NB: Precision 90.85%; SVM: Precision 0.815 (yes class)	Imbalanced dataset; few bullying instances; precision misleading
2	Haidar et al. (2018)	Feed-forward Neural Network (DL)	Twitter dataset (binary labels)	General Arabic	Validation accuracy 91.17% (7 hidden layers)	Limited to binary labels; dataset size not large
3	Alakrot et al. (2018)	SVM	YouTube comments	General Arabic	Precision 90.05%	Small dataset; not specific to cyberbullying
4	AlHarbi et al. (2019)	Lexicon + Sentiment Analysis (PMI, Chi-square, Entropy)	Tweets	Twitter (Saudi Arabic)	PMI accuracy 81% vs. Chi-square 62.11%	Lexicon-based; potential bias; dataset context-limited
5	Mubarak and Darwish (2019)	ML classifiers	Arabic tweets	General Arabic	High classification accuracy	Focused only on offensive, not cyberbullying
6	Farid and El-Tazi (2020)	Lexicon-based Sentiment Analysis + Emojis	Tweets in Modern Standard + Egyptian Dialect	Egyptian Arabic	Accuracy >73% for bullying hashtags	Lexicon limited; reliance on emojis and history
7	Alsafari et al. (2020b)	LR, LSTM, Sluice, BERT, ELMo, SVM	Labeled tweets (Twitter)	Mixed Arabic dialects	SVM + ngrams: Acc. 85.16%; CNN + mBERT F1-macro 66.86%	Limited samples per class; subjectivity in annotation
8	Bashir and Bouguessa (2021)	LSTM, SVM, Naïve Bayes	Twitter dataset (cyberbullying keywords)	General Arabic	LSTM accuracy 72%	Keyword-based data collection; lower accuracy
9	Fati (2022)	Sentiment Analysis Framework	Twitter	General Arabic	Accuracy 81% (10-fold CV)	Limited validation; binary annotation
10	Al-Hassan and Al-Dossari (2022)	LSTM, CNN + LSTM, GRU, CNN + GRU	Labeled tweets	General Arabic	CNN + LSTM: Precision 72%, Recall 75%, F1 73%	Moderate dataset size; limited categories
11	Alsubait and Alfageh (2021)	Multinomial NB, Complement NB, Logistic Regression	YouTube comments	General Arabic	Avg. F1: TF-IDF 77.9% vs. CountVec 77.5%	Dataset modest; no deep learning comparison
12	Alhashmi and Darem (2022)	RF, NB, SVM, XGB, ANN, Stacked DL; Consensus-Based Ensemble	(Twitter, WhatsApp, Vine, Instagram, Packet; incl. Translated data)	Mixed Arabic + translated	Consensus ensemble improved accuracy by 1.3% over best classifier; RF strongest	Dataset partly translated; mixed domains; modest gain over baselines
13	Bouliche and Rezoug (2022)	Dynamic Graph Neural Network (DGNN)	Arabic comments (tweets)	General Arabic	Accuracy 74%	Model performance modest; needs refinement; small dataset
14	El-Alami et al. (2022)	BERT (multilingual, transfer learning)	Bilingual dataset (English + Arabic tweets)	General Arabic + English	High accuracy and F1; BERT outperformed other models	Ambiguous language still difficult; early-stage
15	AbdelHamid et al. (2022)	AraBERT, ArabicBERT, GigaBERT vs. RF, SVM	Syrian/Levantine tweets	Levantine dialect	GigaBERT: AUC 94.6%, Macro F1 0.81	Focused on Levantine; dataset scope limited
16	AlFarah et al. (2022)	SVM, RF, NB, LR, KNN	Twitter + YouTube, oversampled	General Arabic	NB highest AUC 89%; SVM and LR also strong	Class imbalance; dataset moderate in size
17	Anezi (2022)	Deep Recurrent Neural Network (DRNN)	Custom Arabic comments dataset	General Arabic	Binary Acc 99.73%; 3-class Acc 95.38%; 7-class Acc 84.14%	Dataset unique but limited disclosure; overfitting risk
18	Althobaiti (2022)	BERT + Sentiment + Emoji features vs. SVM, LR	Arabic tweets	General Arabic	BERT model highest F1 across all tasks	Single dataset; limited external validation
19	Ali and Kurdy (2022)	SVM, SGD, KNN, LR, AdaBoost, Bagging	Syrian Facebook comments + questionnaire	Syrian slang	SVM and SGD accuracy 77%; AdaBoost precision 94%	Imbalanced recall (47%); small dataset
20	Alduailaj and Belghith (2023)	SVM + FarasaNLTK vs. NB	Twitter + YouTube comments	General Arabic	SVM best accuracy 95.74% (TF-IDF n-gram)	Keyword-based collection; possible bias
21	Khairy et al. (2023)	Ensemble (Voting) vs. LR, SVC, KNN	New balanced dataset	General Arabic	Voting model highest Acc, F1, Recall, Precision; LR best single Acc 65.1%	Small dataset; limited to ML
22	Rachidi et al. (2023)	ML (SVM, NB, RF, LR) and DL (LSTM)	Instagram Moroccan dialect	Moroccan Arabic	LSTM Acc 83.64%; SVM Acc 75.04%	Scarcity of tools/datasets for dialect; modest results
23	Alrashidi et al. (2023)	Fine-tuned Arabic BERT, Multi-task Learning	Multi-aspect abusive tweets dataset	General Arabic	MTL + BERT > DL baselines; GitHub data shared	Imbalanced datasets; Arabic only
24	Elzayady et al. (2023)	CNN-LSTM, CNN-BiLSTM, CNN-GRU, AraBERT +Personality Features	Twitter hate speech dataset	General Arabic	AraBERT + personality features Acc 82.3%; CNN-LSTM 77%	Personality inference adds complexity; dataset size moderate
25	Khezzar et al. (2023)	LR, SVC, DT, CNN, AraBERT; web app (arHateDetector)	arHateDataset (merged public sets), Twitter	Standard + dialectal Arabic	AraBERT accuracy 93%; precision/recall/F1 reported	Aggregated datasets may introduce label/definition drift; external validation not detailed
26	Alsafari et al. (2020a)	Single and ensemble CNN/BiLSTM; AraBERT vs. non-contextual embeddings	Twitter; fine-grained two/three/six-class corpora	Mixed Arabic dialects	Ensemble F1: 91% (2-class), 84% (3-class), 80% (6-class); AraBERT > non-contextual; CNN > BiLSTM	Class granularity increases difficulty; error analysis shows issues with implicit/defensive language
27	Aljuhani et al. (2022)	BiLSTM with domain-specific embeddings; LR, SVM baselines	Tweets (seeded crawl, cleaned, labeled)	General Arabic (Twitter)	LR on char n-grams P/R/F1 = 92%; SVM ≈ 90%; BiLSTM competitive with domain embeddings	Seed-term collection bias; translation/generalization across topics not assessed
28	Amer Hamzah and Dhannoon (2023)	BiLSTM + Temporal Convolutional Network (TCN)	CASH: tweets on sexual harassment	Sexual-harassment domain	Accuracy 96.65%; F0.5 = 0.969; > XGBoost baseline	Task/domain specific; dialectal robustness not analyzed
29	Boulouard et al. (2022)	BERT EN, AraBERT, mBERT (AR/EN), LinearSVC, LSTM	YouTube comments (Gulf, Egyptian, Iraqi); Tweets	Mixed Arabic dialects; EN translations	BERT EN Acc 98%; AraBERT Acc 96%; mBERT-AR Acc 83%; LSTM Acc 82%	Translation pipeline may inflate EN results; sarcasm remains challenging
30	Aljarah et al. (2021)	SVM, NB, DT, RF; feature sets (TF-IDF, profile, emotion)	Twitter	General Arabic (varied topics)	RF best: Acc/G-mean 0.910; Recall 0.923; Precision 0.902 with all features	Small corpus after filtering; two-annotator protocol; neutrals excluded from training
31	Mouheb et al. (2019)	Naïve Bayes	Twitter + YouTube	General Arabic	Accuracy 0.959	Small dataset; limited feature diversity
32	Alakrot et al. (2021)	LR, SVM/LinearSVC, NB, DT, RF; POS + n-grams; feature selection	YouTube comments	Mixed dialects (YouTube)	LinearSVC highest accuracy (reasonable overall); gains from feature selection	Focus on offensive, not CB; dependence on preprocessing choices
33	Omar et al. (2021)	LinearSVC, NB variants, SVM, LR, DT, SGD, RF; multilabel pipeline	OSN posts across 11 classes; vulgar-speech set	General Arabic (Facebook/Twitter)	With Chi-square FS: Acc 97.92%; F1 97.92%; Precision 97.92%; Recall 97.93%; multilabel LinearSVC + TF-IDF Acc 82.29%, F1 92.48%	High feature counts; results sensitive to FS; generalizability outside OSN mix not shown
34	Shannaq et al. (2022)	Word-embedding fine-tuning + GA-optimized SVM/XGBoost	ArCybC (CB/Non-CB/Off/Non-Off)	Twitter; cyberbullying + offensive	SVM Acc 86.5% → 87.5%; XGB Acc 84.9% → 85.2% after optimization	Incremental gains; relies on a single public corpus
35	Kanan et al. (2021)	Unsupervised K-Means vs. EM (clustering)	(Facebook/Twitter)	General Arabic	Evaluated via training time, SSE (e.g., 7,796.363), and log-likelihood (e.g., 3,606.4669)	No precision/recall/F1; clustering quality hard to align with downstream moderation needs