Skip to main content
. 2025 Sep 17;11:e3133. doi: 10.7717/peerj-cs.3133

Table 3. Step-by-step examples of key preprocessing phases in Arabic NLP with corresponding input-output transformations and tools used for Arabic text Inline graphic.

Step Purpose Example (Input → Output) Tools/Techniques
Normalization Standardize text by resolving diacritics, elongations, and dialectal variations. graphic file with name peerj-cs-11-3133-i006.jpg Python libraries: pyArabic, camel-tools; Rules for MSA/dialect unification.
Text cleaning Remove noise (hashtags, URLs,..) graphic file with name peerj-cs-11-3133-i007.jpg Regex patterns, nltk, custom filters for Arabic social media text.
Tokenization Split text into atomic units (words, subwords). graphic file with name peerj-cs-11-3133-i008.jpg AraBERT tokenizer, Farasa segmenter, or rule-based splitting for clitics.
Stopwords removal Filter out high-frequency, low-meaning words. graphic file with name peerj-cs-11-3133-i009.jpg Custom Arabic stopword lists (MSA + dialects), NLTK Arabic corpus.
Stemming/Lemmatization Reduce words to root form. Map words to dictionary base form (context-aware). graphic file with name peerj-cs-11-3133-i010.jpg Khoja’s stemmer, ISRI stemmer, or Qalsadi for Arabic. Farasa lemmatizer, CAMeL Tools, or MADAMIRA morphological analyzer.
POS tagging Assign grammatical labels to tokens (noun, verb, etc.). graphic file with name peerj-cs-11-3133-i011.jpg StanfordNLP Arabic model, CAMeL Tools, or UDPipe with Arabic trained data.