Linguistic features |
Sysntactic features |
Part-of-Speech (POS) |
Based on the grammatical use and functions, words are categorized into different types of POS (like None, Verb, Adverb). |
45–47
|
|
|
Dependency parsing |
The grammatical structure of a sentence. |
205,206
|
|
Lexicon-based features |
Bag-of-words (BoW) |
The simplest form of text representation using numbers of vocabularies. |
48–50
|
|
|
Lexical diversity, lexical density |
The unique vocabulary usage and proportion of content words. |
37 |
|
Emotion features |
Sentiment scores |
Sentiment scores are used to quantify the feeling of texts and determine the sentiment polarity (positive, negtive,or netural). The way of calculation includes using VADER sentiment analysis (Valence Aware Dictionary and sEntiment Reasoner)207, SenticNet 5 lexicon208, AFINN lexicon209. |
63,210–212
|
|
|
Emotion scores |
The emotion scores indicates the user’s emotions and opinions of texts to an extent, which is beneficial for mental issues detection. NRC Affect Intensity Lexicon213 are always used. |
56, 63, 109,214
|
|
Semantic features |
Semantic similarity |
Using semantic similarity predict whether the sentence or word is sematically related to the target sentence or word. |
60,215
|
|
Topic features |
Topic features |
The topics extracted from texts using some topic-modeling algorithms, like Latent Dirichlet Allocation (LDA)216, Latent Semantic Analysis (LSA)217, Non-negative matrix factorization (NMF)218. |
55, 65, 87,219
|
|
Linguistic features |
LIWC |
Linguistic Inquiry and Word Count (LIWC)220 is always used to automatically extract linguistic styles from texts by calculating the percentages of words in different categories, like linguistic, social affective, etc.. |
51–53,82
|
|
Others |
Hashtag, emoji |
Hashtag is metadata tag from social media platform,which present a theme or topic. emoticons or emojis are often used to show various types of emotions. |
78, 79,221
|
Statistical features |
Statistical corpus features |
n-gram |
N-gram is a contiguous sequence of n words. |
54–56
|
|
|
TF-IDF |
Term frequency-inverse document frequency (TF-IDF) reflect the importance of the word in document. |
57–59,222
|
|
|
Length statistics |
The length of posts, documents or average sentence. |
60–62,223
|
|
Vector-based features |
Word embedding |
The vector-based representation of words. Examples: word2vec224, GloVe118. |
49, 56, 106,225
|
|
|
Document embedding |
The vector-based representation of document. |
226 |
Domain knowledge features |
Conceptual features |
UMLS |
Unified Medical Language System (UMLS) is a set of key terminology, coding standards, and associated resources related to biomedical information. |
67,227
|
|
|
Linguistic dictionary |
The dictionary contains mental health illness related words |
66, 228,229
|
Other auxiliary features |
Social behavioral features |
Social connectivity |
The degree of social interaction on social media, like number of followers, friends, and communities joined230. |
68 |
|
|
User behaviors |
The user’s behavioral signals on social media, such as the frequency of comments and forwardings. |
65, 69,231
|
|
Time features |
Time features |
Focusing on the time-related features, like sending time, time interval. |
72,73
|
|
User’s profile features |
User’s profile features |
The user’s profile features contain their individual information on social networks. |
70, 71,231
|