Table 2.
Summary of reviewed studies on Arabic hate/offensive/cyberbullying detection.
| No | Study | Model(s) | Dataset and Platform | Dialect/Domain | Performance Metrics | Limitations |
|---|---|---|---|---|---|---|
| 1 | Haidar et al. (2017) | Naïve Bayes, SVM | Posts (Twitter, Facebook, Formspring) | Saudi Arabic | NB: Precision 90.85%; SVM: Precision 0.815 (yes class) | Imbalanced dataset; few bullying instances; precision misleading |
| 2 | Haidar et al. (2018) | Feed-forward Neural Network (DL) | Twitter dataset (binary labels) | General Arabic | Validation accuracy 91.17% (7 hidden layers) | Limited to binary labels; dataset size not large |
| 3 | Alakrot et al. (2018) | SVM | YouTube comments | General Arabic | Precision 90.05% | Small dataset; not specific to cyberbullying |
| 4 | AlHarbi et al. (2019) | Lexicon + Sentiment Analysis (PMI, Chi-square, Entropy) | Tweets | Twitter (Saudi Arabic) | PMI accuracy 81% vs. Chi-square 62.11% | Lexicon-based; potential bias; dataset context-limited |
| 5 | Mubarak and Darwish (2019) | ML classifiers | Arabic tweets | General Arabic | High classification accuracy | Focused only on offensive, not cyberbullying |
| 6 | Farid and El-Tazi (2020) | Lexicon-based Sentiment Analysis + Emojis | Tweets in Modern Standard + Egyptian Dialect | Egyptian Arabic | Accuracy >73% for bullying hashtags | Lexicon limited; reliance on emojis and history |
| 7 | Alsafari et al. (2020b) | LR, LSTM, Sluice, BERT, ELMo, SVM | Labeled tweets (Twitter) | Mixed Arabic dialects | SVM + ngrams: Acc. 85.16%; CNN + mBERT F1-macro 66.86% | Limited samples per class; subjectivity in annotation |
| 8 | Bashir and Bouguessa (2021) | LSTM, SVM, Naïve Bayes | Twitter dataset (cyberbullying keywords) | General Arabic | LSTM accuracy 72% | Keyword-based data collection; lower accuracy |
| 9 | Fati (2022) | Sentiment Analysis Framework | General Arabic | Accuracy 81% (10-fold CV) | Limited validation; binary annotation | |
| 10 | Al-Hassan and Al-Dossari (2022) | LSTM, CNN + LSTM, GRU, CNN + GRU | Labeled tweets | General Arabic | CNN + LSTM: Precision 72%, Recall 75%, F1 73% | Moderate dataset size; limited categories |
| 11 | Alsubait and Alfageh (2021) | Multinomial NB, Complement NB, Logistic Regression | YouTube comments | General Arabic | Avg. F1: TF-IDF 77.9% vs. CountVec 77.5% | Dataset modest; no deep learning comparison |
| 12 | Alhashmi and Darem (2022) | RF, NB, SVM, XGB, ANN, Stacked DL; Consensus-Based Ensemble | (Twitter, WhatsApp, Vine, Instagram, Packet; incl. Translated data) | Mixed Arabic + translated | Consensus ensemble improved accuracy by 1.3% over best classifier; RF strongest | Dataset partly translated; mixed domains; modest gain over baselines |
| 13 | Bouliche and Rezoug (2022) | Dynamic Graph Neural Network (DGNN) | Arabic comments (tweets) | General Arabic | Accuracy 74% | Model performance modest; needs refinement; small dataset |
| 14 | El-Alami et al. (2022) | BERT (multilingual, transfer learning) | Bilingual dataset (English + Arabic tweets) | General Arabic + English | High accuracy and F1; BERT outperformed other models | Ambiguous language still difficult; early-stage |
| 15 | AbdelHamid et al. (2022) | AraBERT, ArabicBERT, GigaBERT vs. RF, SVM | Syrian/Levantine tweets | Levantine dialect | GigaBERT: AUC 94.6%, Macro F1 0.81 | Focused on Levantine; dataset scope limited |
| 16 | AlFarah et al. (2022) | SVM, RF, NB, LR, KNN | Twitter + YouTube, oversampled | General Arabic | NB highest AUC 89%; SVM and LR also strong | Class imbalance; dataset moderate in size |
| 17 | Anezi (2022) | Deep Recurrent Neural Network (DRNN) | Custom Arabic comments dataset | General Arabic | Binary Acc 99.73%; 3-class Acc 95.38%; 7-class Acc 84.14% | Dataset unique but limited disclosure; overfitting risk |
| 18 | Althobaiti (2022) | BERT + Sentiment + Emoji features vs. SVM, LR | Arabic tweets | General Arabic | BERT model highest F1 across all tasks | Single dataset; limited external validation |
| 19 | Ali and Kurdy (2022) | SVM, SGD, KNN, LR, AdaBoost, Bagging | Syrian Facebook comments + questionnaire | Syrian slang | SVM and SGD accuracy 77%; AdaBoost precision 94% | Imbalanced recall (47%); small dataset |
| 20 | Alduailaj and Belghith (2023) | SVM + FarasaNLTK vs. NB | Twitter + YouTube comments | General Arabic | SVM best accuracy 95.74% (TF-IDF n-gram) | Keyword-based collection; possible bias |
| 21 | Khairy et al. (2023) | Ensemble (Voting) vs. LR, SVC, KNN | New balanced dataset | General Arabic | Voting model highest Acc, F1, Recall, Precision; LR best single Acc 65.1% | Small dataset; limited to ML |
| 22 | Rachidi et al. (2023) | ML (SVM, NB, RF, LR) and DL (LSTM) | Instagram Moroccan dialect | Moroccan Arabic | LSTM Acc 83.64%; SVM Acc 75.04% | Scarcity of tools/datasets for dialect; modest results |
| 23 | Alrashidi et al. (2023) | Fine-tuned Arabic BERT, Multi-task Learning | Multi-aspect abusive tweets dataset | General Arabic | MTL + BERT > DL baselines; GitHub data shared | Imbalanced datasets; Arabic only |
| 24 | Elzayady et al. (2023) | CNN-LSTM, CNN-BiLSTM, CNN-GRU, AraBERT +Personality Features | Twitter hate speech dataset | General Arabic | AraBERT + personality features Acc 82.3%; CNN-LSTM 77% | Personality inference adds complexity; dataset size moderate |
| 25 | Khezzar et al. (2023) | LR, SVC, DT, CNN, AraBERT; web app (arHateDetector) | arHateDataset (merged public sets), Twitter | Standard + dialectal Arabic | AraBERT accuracy 93%; precision/recall/F1 reported | Aggregated datasets may introduce label/definition drift; external validation not detailed |
| 26 | Alsafari et al. (2020a) | Single and ensemble CNN/BiLSTM; AraBERT vs. non-contextual embeddings | Twitter; fine-grained two/three/six-class corpora | Mixed Arabic dialects | Ensemble F1: 91% (2-class), 84% (3-class), 80% (6-class); AraBERT > non-contextual; CNN > BiLSTM | Class granularity increases difficulty; error analysis shows issues with implicit/defensive language |
| 27 | Aljuhani et al. (2022) | BiLSTM with domain-specific embeddings; LR, SVM baselines | Tweets (seeded crawl, cleaned, labeled) | General Arabic (Twitter) | LR on char n-grams P/R/F1 = 92%; SVM ≈ 90%; BiLSTM competitive with domain embeddings | Seed-term collection bias; translation/generalization across topics not assessed |
| 28 | Amer Hamzah and Dhannoon (2023) | BiLSTM + Temporal Convolutional Network (TCN) | CASH: tweets on sexual harassment | Sexual-harassment domain | Accuracy 96.65%; F0.5 = 0.969; > XGBoost baseline | Task/domain specific; dialectal robustness not analyzed |
| 29 | Boulouard et al. (2022) | BERT EN, AraBERT, mBERT (AR/EN), LinearSVC, LSTM | YouTube comments (Gulf, Egyptian, Iraqi); Tweets | Mixed Arabic dialects; EN translations | BERT EN Acc 98%; AraBERT Acc 96%; mBERT-AR Acc 83%; LSTM Acc 82% | Translation pipeline may inflate EN results; sarcasm remains challenging |
| 30 | Aljarah et al. (2021) | SVM, NB, DT, RF; feature sets (TF-IDF, profile, emotion) | General Arabic (varied topics) | RF best: Acc/G-mean 0.910; Recall 0.923; Precision 0.902 with all features | Small corpus after filtering; two-annotator protocol; neutrals excluded from training | |
| 31 | Mouheb et al. (2019) | Naïve Bayes | Twitter + YouTube | General Arabic | Accuracy 0.959 | Small dataset; limited feature diversity |
| 32 | Alakrot et al. (2021) | LR, SVM/LinearSVC, NB, DT, RF; POS + n-grams; feature selection | YouTube comments | Mixed dialects (YouTube) | LinearSVC highest accuracy (reasonable overall); gains from feature selection | Focus on offensive, not CB; dependence on preprocessing choices |
| 33 | Omar et al. (2021) | LinearSVC, NB variants, SVM, LR, DT, SGD, RF; multilabel pipeline | OSN posts across 11 classes; vulgar-speech set | General Arabic (Facebook/Twitter) | With Chi-square FS: Acc 97.92%; F1 97.92%; Precision 97.92%; Recall 97.93%; multilabel LinearSVC + TF-IDF Acc 82.29%, F1 92.48% | High feature counts; results sensitive to FS; generalizability outside OSN mix not shown |
| 34 | Shannaq et al. (2022) | Word-embedding fine-tuning + GA-optimized SVM/XGBoost | ArCybC (CB/Non-CB/Off/Non-Off) | Twitter; cyberbullying + offensive | SVM Acc 86.5% → 87.5%; XGB Acc 84.9% → 85.2% after optimization | Incremental gains; relies on a single public corpus |
| 35 | Kanan et al. (2021) | Unsupervised K-Means vs. EM (clustering) | (Facebook/Twitter) | General Arabic | Evaluated via training time, SSE (e.g., 7,796.363), and log-likelihood (e.g., 3,606.4669) | No precision/recall/F1; clustering quality hard to align with downstream moderation needs |