Skip to main content
. 2022 Apr 8;5:46. doi: 10.1038/s41746-022-00589-7

Table 2.

An overview of features used in machine learning-based models.

Feature categories Feature types Features Description Typical examples
Linguistic features Sysntactic features Part-of-Speech (POS) Based on the grammatical use and functions, words are categorized into different types of POS (like None, Verb, Adverb). 4547
Dependency parsing The grammatical structure of a sentence. 205,206
Lexicon-based features Bag-of-words (BoW) The simplest form of text representation using numbers of vocabularies. 4850
Lexical diversity, lexical density The unique vocabulary usage and proportion of content words. 37
Emotion features Sentiment scores Sentiment scores are used to quantify the feeling of texts and determine the sentiment polarity (positive, negtive,or netural). The way of calculation includes using VADER sentiment analysis (Valence Aware Dictionary and sEntiment Reasoner)207, SenticNet 5 lexicon208, AFINN lexicon209. 63,210212
Emotion scores The emotion scores indicates the user’s emotions and opinions of texts to an extent, which is beneficial for mental issues detection. NRC Affect Intensity Lexicon213 are always used. 56, 63, 109,214
Semantic features Semantic similarity Using semantic similarity predict whether the sentence or word is sematically related to the target sentence or word. 60,215
Topic features Topic features The topics extracted from texts using some topic-modeling algorithms, like Latent Dirichlet Allocation (LDA)216, Latent Semantic Analysis (LSA)217, Non-negative matrix factorization (NMF)218. 55, 65, 87,219
Linguistic features LIWC Linguistic Inquiry and Word Count (LIWC)220 is always used to automatically extract linguistic styles from texts by calculating the percentages of words in different categories, like linguistic, social affective, etc.. 5153,82
Others Hashtag, emoji Hashtag is metadata tag from social media platform,which present a theme or topic. emoticons or emojis are often used to show various types of emotions. 78, 79,221
Statistical features Statistical corpus features n-gram N-gram is a contiguous sequence of n words. 5456
TF-IDF Term frequency-inverse document frequency (TF-IDF) reflect the importance of the word in document. 5759,222
Length statistics The length of posts, documents or average sentence. 6062,223
Vector-based features Word embedding The vector-based representation of words. Examples: word2vec224, GloVe118. 49, 56, 106,225
Document embedding The vector-based representation of document. 226
Domain knowledge features Conceptual features UMLS Unified Medical Language System (UMLS) is a set of key terminology, coding standards, and associated resources related to biomedical information. 67,227
Linguistic dictionary The dictionary contains mental health illness related words 66, 228,229
Other auxiliary features Social behavioral features Social connectivity The degree of social interaction on social media, like number of followers, friends, and communities joined230. 68
User behaviors The user’s behavioral signals on social media, such as the frequency of comments and forwardings. 65, 69,231
Time features Time features Focusing on the time-related features, like sending time, time interval. 72,73
User’s profile features User’s profile features The user’s profile features contain their individual information on social networks. 70, 71,231