Skip to main content
. 2025 Sep 17;11:e3133. doi: 10.7717/peerj-cs.3133

Table 4. Comparative summary of Arabic text preprocessing techniques across selected studies, highlighting common practices and variations.

Study Text Cleaning Normalization (Alef/Ya) Tokenization Stemming/Lemmatization Stopword Removal Diacritics Removal
Ousidhoum et al. (2021a) ✓Explicitly mentioned ✓Explicitly mentioned Word-based (verified) ? Unclear ✓Explicitly mentioned ✗Not mentioned
Hassan et al. (2020) ✓Explicitly mentioned ✓Explicitly mentioned Word-based (likely) ✗Not used ✓Explicitly mentioned ✗Not mentioned
Alshalan et al. (2020) ✓Explicitly mentioned ✓Explicitly mentioned Subword-based (BPE) ✓Explicitly mentioned (Light Stemmer) ✓Explicitly mentioned ✓Explicitly mentioned
Alshalan & Al-Khalifa (2020) ✓Explicitly mentioned ✓Explicitly mentioned Word-based (likely) ✗Not used ✓Explicitly mentioned ✓Explicitly mentioned
Chowdhury et al. (2020) ✓Explicitly mentioned ✓Explicitly mentioned Subword-based (WordPiece) ✗Not used ✓Explicitly mentioned ✓Explicitly mentioned
Ben Nessir et al. (2022) ✓Explicitly mentioned ✓Explicitly mentioned Word-based (verified) ✓Explicitly mentioned (Khoja Stemmer) ✓Explicitly mentioned ✗Not mentioned
Shapiro, Khalafallah & Torki (2022) ✓Explicitly mentioned ✓Explicitly mentioned Subword-based (Sentence-Piece) ✗Not used ✓Explicitly mentioned ✓Explicitly mentioned
Husain & Uzuner (2022) ✓Explicitly mentioned ✓Explicitly mentioned Word-based (likely) ✓Explicitly mentioned (Light Stemmer) ✓Explicitly mentioned ✓Explicitly mentioned
Abu Farha & Magdy (2020) ✓Explicitly mentioned ✓Explicitly mentioned Word-based (likely) ✗Not used ✓Explicitly mentioned ✓Explicitly mentioned