Skip to main content
. 2025 Oct 16;8:1666349. doi: 10.3389/frai.2025.1666349

Table 2.

Summary of reviewed studies on Arabic hate/offensive/cyberbullying detection.

No Study Model(s) Dataset and Platform Dialect/Domain Performance Metrics Limitations
1 Haidar et al. (2017) Naïve Bayes, SVM Posts (Twitter, Facebook, Formspring) Saudi Arabic NB: Precision 90.85%; SVM: Precision 0.815 (yes class) Imbalanced dataset; few bullying instances; precision misleading
2 Haidar et al. (2018) Feed-forward Neural Network (DL) Twitter dataset (binary labels) General Arabic Validation accuracy 91.17% (7 hidden layers) Limited to binary labels; dataset size not large
3 Alakrot et al. (2018) SVM YouTube comments General Arabic Precision 90.05% Small dataset; not specific to cyberbullying
4 AlHarbi et al. (2019) Lexicon + Sentiment Analysis (PMI, Chi-square, Entropy) Tweets Twitter (Saudi Arabic) PMI accuracy 81% vs. Chi-square 62.11% Lexicon-based; potential bias; dataset context-limited
5 Mubarak and Darwish (2019) ML classifiers Arabic tweets General Arabic High classification accuracy Focused only on offensive, not cyberbullying
6 Farid and El-Tazi (2020) Lexicon-based Sentiment Analysis + Emojis Tweets in Modern Standard + Egyptian Dialect Egyptian Arabic Accuracy >73% for bullying hashtags Lexicon limited; reliance on emojis and history
7 Alsafari et al. (2020b) LR, LSTM, Sluice, BERT, ELMo, SVM Labeled tweets (Twitter) Mixed Arabic dialects SVM + ngrams: Acc. 85.16%; CNN + mBERT F1-macro 66.86% Limited samples per class; subjectivity in annotation
8 Bashir and Bouguessa (2021) LSTM, SVM, Naïve Bayes Twitter dataset (cyberbullying keywords) General Arabic LSTM accuracy 72% Keyword-based data collection; lower accuracy
9 Fati (2022) Sentiment Analysis Framework Twitter General Arabic Accuracy 81% (10-fold CV) Limited validation; binary annotation
10 Al-Hassan and Al-Dossari (2022) LSTM, CNN + LSTM, GRU, CNN + GRU Labeled tweets General Arabic CNN + LSTM: Precision 72%, Recall 75%, F1 73% Moderate dataset size; limited categories
11 Alsubait and Alfageh (2021) Multinomial NB, Complement NB, Logistic Regression YouTube comments General Arabic Avg. F1: TF-IDF 77.9% vs. CountVec 77.5% Dataset modest; no deep learning comparison
12 Alhashmi and Darem (2022) RF, NB, SVM, XGB, ANN, Stacked DL; Consensus-Based Ensemble (Twitter, WhatsApp, Vine, Instagram, Packet; incl. Translated data) Mixed Arabic + translated Consensus ensemble improved accuracy by 1.3% over best classifier; RF strongest Dataset partly translated; mixed domains; modest gain over baselines
13 Bouliche and Rezoug (2022) Dynamic Graph Neural Network (DGNN) Arabic comments (tweets) General Arabic Accuracy 74% Model performance modest; needs refinement; small dataset
14 El-Alami et al. (2022) BERT (multilingual, transfer learning) Bilingual dataset (English + Arabic tweets) General Arabic + English High accuracy and F1; BERT outperformed other models Ambiguous language still difficult; early-stage
15 AbdelHamid et al. (2022) AraBERT, ArabicBERT, GigaBERT vs. RF, SVM Syrian/Levantine tweets Levantine dialect GigaBERT: AUC 94.6%, Macro F1 0.81 Focused on Levantine; dataset scope limited
16 AlFarah et al. (2022) SVM, RF, NB, LR, KNN Twitter + YouTube, oversampled General Arabic NB highest AUC 89%; SVM and LR also strong Class imbalance; dataset moderate in size
17 Anezi (2022) Deep Recurrent Neural Network (DRNN) Custom Arabic comments dataset General Arabic Binary Acc 99.73%; 3-class Acc 95.38%; 7-class Acc 84.14% Dataset unique but limited disclosure; overfitting risk
18 Althobaiti (2022) BERT + Sentiment + Emoji features vs. SVM, LR Arabic tweets General Arabic BERT model highest F1 across all tasks Single dataset; limited external validation
19 Ali and Kurdy (2022) SVM, SGD, KNN, LR, AdaBoost, Bagging Syrian Facebook comments + questionnaire Syrian slang SVM and SGD accuracy 77%; AdaBoost precision 94% Imbalanced recall (47%); small dataset
20 Alduailaj and Belghith (2023) SVM + FarasaNLTK vs. NB Twitter + YouTube comments General Arabic SVM best accuracy 95.74% (TF-IDF n-gram) Keyword-based collection; possible bias
21 Khairy et al. (2023) Ensemble (Voting) vs. LR, SVC, KNN New balanced dataset General Arabic Voting model highest Acc, F1, Recall, Precision; LR best single Acc 65.1% Small dataset; limited to ML
22 Rachidi et al. (2023) ML (SVM, NB, RF, LR) and DL (LSTM) Instagram Moroccan dialect Moroccan Arabic LSTM Acc 83.64%; SVM Acc 75.04% Scarcity of tools/datasets for dialect; modest results
23 Alrashidi et al. (2023) Fine-tuned Arabic BERT, Multi-task Learning Multi-aspect abusive tweets dataset General Arabic MTL + BERT > DL baselines; GitHub data shared Imbalanced datasets; Arabic only
24 Elzayady et al. (2023) CNN-LSTM, CNN-BiLSTM, CNN-GRU, AraBERT +Personality Features Twitter hate speech dataset General Arabic AraBERT + personality features Acc 82.3%; CNN-LSTM 77% Personality inference adds complexity; dataset size moderate
25 Khezzar et al. (2023) LR, SVC, DT, CNN, AraBERT; web app (arHateDetector) arHateDataset (merged public sets), Twitter Standard + dialectal Arabic AraBERT accuracy 93%; precision/recall/F1 reported Aggregated datasets may introduce label/definition drift; external validation not detailed
26 Alsafari et al. (2020a) Single and ensemble CNN/BiLSTM; AraBERT vs. non-contextual embeddings Twitter; fine-grained two/three/six-class corpora Mixed Arabic dialects Ensemble F1: 91% (2-class), 84% (3-class), 80% (6-class); AraBERT > non-contextual; CNN > BiLSTM Class granularity increases difficulty; error analysis shows issues with implicit/defensive language
27 Aljuhani et al. (2022) BiLSTM with domain-specific embeddings; LR, SVM baselines Tweets (seeded crawl, cleaned, labeled) General Arabic (Twitter) LR on char n-grams P/R/F1 = 92%; SVM ≈ 90%; BiLSTM competitive with domain embeddings Seed-term collection bias; translation/generalization across topics not assessed
28 Amer Hamzah and Dhannoon (2023) BiLSTM + Temporal Convolutional Network (TCN) CASH: tweets on sexual harassment Sexual-harassment domain Accuracy 96.65%; F0.5 = 0.969; > XGBoost baseline Task/domain specific; dialectal robustness not analyzed
29 Boulouard et al. (2022) BERT EN, AraBERT, mBERT (AR/EN), LinearSVC, LSTM YouTube comments (Gulf, Egyptian, Iraqi); Tweets Mixed Arabic dialects; EN translations BERT EN Acc 98%; AraBERT Acc 96%; mBERT-AR Acc 83%; LSTM Acc 82% Translation pipeline may inflate EN results; sarcasm remains challenging
30 Aljarah et al. (2021) SVM, NB, DT, RF; feature sets (TF-IDF, profile, emotion) Twitter General Arabic (varied topics) RF best: Acc/G-mean 0.910; Recall 0.923; Precision 0.902 with all features Small corpus after filtering; two-annotator protocol; neutrals excluded from training
31 Mouheb et al. (2019) Naïve Bayes Twitter + YouTube General Arabic Accuracy 0.959 Small dataset; limited feature diversity
32 Alakrot et al. (2021) LR, SVM/LinearSVC, NB, DT, RF; POS + n-grams; feature selection YouTube comments Mixed dialects (YouTube) LinearSVC highest accuracy (reasonable overall); gains from feature selection Focus on offensive, not CB; dependence on preprocessing choices
33 Omar et al. (2021) LinearSVC, NB variants, SVM, LR, DT, SGD, RF; multilabel pipeline OSN posts across 11 classes; vulgar-speech set General Arabic (Facebook/Twitter) With Chi-square FS: Acc 97.92%; F1 97.92%; Precision 97.92%; Recall 97.93%; multilabel LinearSVC + TF-IDF Acc 82.29%, F1 92.48% High feature counts; results sensitive to FS; generalizability outside OSN mix not shown
34 Shannaq et al. (2022) Word-embedding fine-tuning + GA-optimized SVM/XGBoost ArCybC (CB/Non-CB/Off/Non-Off) Twitter; cyberbullying + offensive SVM Acc 86.5% → 87.5%; XGB Acc 84.9% → 85.2% after optimization Incremental gains; relies on a single public corpus
35 Kanan et al. (2021) Unsupervised K-Means vs. EM (clustering) (Facebook/Twitter) General Arabic Evaluated via training time, SSE (e.g., 7,796.363), and log-likelihood (e.g., 3,606.4669) No precision/recall/F1; clustering quality hard to align with downstream moderation needs