Table 4. Comparative summary of Arabic text preprocessing techniques across selected studies, highlighting common practices and variations.
| Study | Text Cleaning | Normalization (Alef/Ya) | Tokenization | Stemming/Lemmatization | Stopword Removal | Diacritics Removal |
|---|---|---|---|---|---|---|
| Ousidhoum et al. (2021a) | ✓Explicitly mentioned | ✓Explicitly mentioned | Word-based (verified) | ? Unclear | ✓Explicitly mentioned | ✗Not mentioned |
| Hassan et al. (2020) | ✓Explicitly mentioned | ✓Explicitly mentioned | Word-based (likely) | ✗Not used | ✓Explicitly mentioned | ✗Not mentioned |
| Alshalan et al. (2020) | ✓Explicitly mentioned | ✓Explicitly mentioned | Subword-based (BPE) | ✓Explicitly mentioned (Light Stemmer) | ✓Explicitly mentioned | ✓Explicitly mentioned |
| Alshalan & Al-Khalifa (2020) | ✓Explicitly mentioned | ✓Explicitly mentioned | Word-based (likely) | ✗Not used | ✓Explicitly mentioned | ✓Explicitly mentioned |
| Chowdhury et al. (2020) | ✓Explicitly mentioned | ✓Explicitly mentioned | Subword-based (WordPiece) | ✗Not used | ✓Explicitly mentioned | ✓Explicitly mentioned |
| Ben Nessir et al. (2022) | ✓Explicitly mentioned | ✓Explicitly mentioned | Word-based (verified) | ✓Explicitly mentioned (Khoja Stemmer) | ✓Explicitly mentioned | ✗Not mentioned |
| Shapiro, Khalafallah & Torki (2022) | ✓Explicitly mentioned | ✓Explicitly mentioned | Subword-based (Sentence-Piece) | ✗Not used | ✓Explicitly mentioned | ✓Explicitly mentioned |
| Husain & Uzuner (2022) | ✓Explicitly mentioned | ✓Explicitly mentioned | Word-based (likely) | ✓Explicitly mentioned (Light Stemmer) | ✓Explicitly mentioned | ✓Explicitly mentioned |
| Abu Farha & Magdy (2020) | ✓Explicitly mentioned | ✓Explicitly mentioned | Word-based (likely) | ✗Not used | ✓Explicitly mentioned | ✓Explicitly mentioned |