| Normalization |
Standardize text by resolving diacritics, elongations, and dialectal variations. |
|
Python libraries: pyArabic, camel-tools; Rules for MSA/dialect unification. |
| Text cleaning |
Remove noise (hashtags, URLs,..) |
|
Regex patterns, nltk, custom filters for Arabic social media text. |
| Tokenization |
Split text into atomic units (words, subwords). |
|
AraBERT tokenizer, Farasa segmenter, or rule-based splitting for clitics. |
| Stopwords removal |
Filter out high-frequency, low-meaning words. |
|
Custom Arabic stopword lists (MSA + dialects), NLTK Arabic corpus. |
| Stemming/Lemmatization |
Reduce words to root form. Map words to dictionary base form (context-aware). |
|
Khoja’s stemmer, ISRI stemmer, or Qalsadi for Arabic. Farasa lemmatizer, CAMeL Tools, or MADAMIRA morphological analyzer. |
| POS tagging |
Assign grammatical labels to tokens (noun, verb, etc.). |
|
StanfordNLP Arabic model, CAMeL Tools, or UDPipe with Arabic trained data. |