. 2025 Sep 17;11:e3133. doi: 10.7717/peerj-cs.3133

Table 3. Step-by-step examples of key preprocessing phases in Arabic NLP with corresponding input-output transformations and tools used for Arabic text .

Step	Purpose	Tools/Techniques
Normalization	Standardize text by resolving diacritics, elongations, and dialectal variations.	Python libraries: pyArabic, camel-tools; Rules for MSA/dialect unification.
Text cleaning	Remove noise (hashtags, URLs,..)	Regex patterns, nltk, custom filters for Arabic social media text.
Tokenization	Split text into atomic units (words, subwords).	AraBERT tokenizer, Farasa segmenter, or rule-based splitting for clitics.
Stopwords removal	Filter out high-frequency, low-meaning words.	Custom Arabic stopword lists (MSA + dialects), NLTK Arabic corpus.
Stemming/Lemmatization	Reduce words to root form. Map words to dictionary base form (context-aware).	Khoja’s stemmer, ISRI stemmer, or Qalsadi for Arabic. Farasa lemmatizer, CAMeL Tools, or MADAMIRA morphological analyzer.
POS tagging	Assign grammatical labels to tokens (noun, verb, etc.).	StanfordNLP Arabic model, CAMeL Tools, or UDPipe with Arabic trained data.