Skip to main content
AMIA Summits on Translational Science Proceedings logoLink to AMIA Summits on Translational Science Proceedings
. 2020 May 30;2020:487–496.

Mental Health Severity Detection from Psychological Forum Data using Domain-Specific Unlabelled Data

Braja Gopal Patra 1, Reshma Kar 2, Kirk Roberts 3, Hulin Wu 1,3
PMCID: PMC7233051  PMID: 32477670

Abstract

Mental health has become a growing concern in the medical field, yet remains difficult to study due to both privacy concerns and the lack of objectively quantifiable measurements (e.g., lab tests, physical exams). Instead, the data that is available for mental health is largely based on subjective accounts of a patient’s experience, and thus typically is expressed exclusively in text. An important source of such data comes from online sources and directly from the patient, including many forms of social media. In this work, we utilize the datasets provided by the CLPsych shared tasks in 2016 and 2017, derived from online forum posts of ReachOut which have been manually classified according to mental health severity. We implemented an automated severity labeling system using different machine and deep learning algorithms. Our approach combines both supervised and semi-supervised embedding methods using corpus from ReachOut (both labeled and unlabelled) and WebMD (unlabelled). Metadata, syntactic, semantic, and embedding features were used to classify the posts into four categories (green, amber, red, and crisis). The developed systems outperformed other state-of-the-art systems developed on the ReachOut dataset and obtained the maximum micro- averaged F-scores of 0.86 and 0.80 for CLPsych 2016 and 2017 test datasets, respectively, using the above features.

Introduction

Around 970 million people currently suffer from mental or neurological disorders, placing mental disorders among the leading causes of ill-health and disability worldwide*. The World Health Organization reported that 800,000 people die by suicide every year due to mental illness. Lack of awareness of mental health causes individuals to associate it with social stigma, which prevents early help-seeking behavior. Consequently, chances of a complete or partial recovery in patients may be reduced2. Effective and evidence-based interventions can be implemented at population, sub-population, and individual levels to prevent suicide and suicide attempts.

With recent technological advances, mental health forums are being used to share the experiences anonymously and relevant information which is helpful for users with similar mental health conditions. These forums are a simple way to facilitate such peer-support online, but when they involve vulnerable people and sensitive matter, they require careful cultivation11. Social media driven peer-support groups like ReachOut help patients to feel better by allowing them to discuss their experiences freely with other patients who have been through similar situations3. ReachOut is a non-profit organization initially started in Australia and later extended to the USA, Ireland, and Canada. It is an online forum, where psychologically affected young people and their parents can get advice from professionals or other users recovered from similar issues or problems. However, it requires the supervision of moderators to provide quick support to users expressing suicidal indentation. The human moderation for these online forums raises concerns about cost and scalability11. The increased number of users in such forums makes the task of moderators unmanageable without the help of automated systems14. The recent progress in text mining and natural language processing (NLP) offers a solution to this problem by identifying urgent posts so that the human moderator can be assigned sooner. It was observed that NLP can be employed to reduce the average response time of the human moderators for urgent posts11.

More recently, several NLP techniques have been proposed to decipher psychological information or psychiatric symptoms from electronic health records (EHRs)17, psychiatric notes16, 18, tweets59, support groups11, 12, 14, and so on. Different supervised and semi-supervised methods were employed to extract psychiatric information from the text using many machine learning algorithms. Further, deep learning techniques were popularly used to improve performance of these systems4, 16.

In this paper, we present a system that we developed to identify the severity of posts based on which moderators can be assigned to serious patients. Latent Semantic Analysis (LSA),

word2vec, Doc2Vec

, and other handcrafted features were implemented to convert posts into feature vectors. Further, embedding methods were built using labeled and unlabelled corpora of the Computational Linguistics and Clinical Psychology (CLPsych) shared tasks as well as a large unlabelled corpus collected from the WebMD psychiatry forum. Finally, different machine learning and deep learning algorithms were used for classifying posts into four categories. The systems were trained on CLPsych 2016 and 2017 training datasets and tested on CLPsych 2016 and 2017 test datasets, respectively. The CLPsych datasets are unbalanced (different number of instances for each class with significantly less number of instances in red and crisis categories) and thus we propose a multi-level classification approach for better system performance. A multi-level system classified the posts into urgent (red and crisis) and non-urgent (green and amber), then classified them into respective categories.

Background

Mental health issues may often have devastating impact on patients and their family. However, it is unclear how mental impairment manifests in common forms of communication, for example, social media communications10. Hence, it is worthwhile to analyze subjects’ text for symptoms of mental illnesses. In addition to diagnosing diseases, detecting its severity is also important in the context of social peer support groups, to detect the patients who require urgent attention. This section broadly reviews the progress in text analytics for mental health. Mainly, three types of tasks have been performed in recent years depending upon the datasets: a) identifying symptoms or type of diseases from the medical notes of psychiatric patients18 or EHR data17; b) identifying the type of psychological diseases from social media posts such as Twitter or blogs410; c) identifying the severity of mental health issues from psychological peer support forums1114.

Psychological Analysis of Medical Notes and EHRs: Electronic health records (EHRs) facilitate the storage of narrative interpersonal communications and observations of patients. Psychiatric symptoms are the key information in such narratives, which often consist of a subjective description with details of the patient’s experience18. Psychiatric symptoms play an important role in prevention, diagnosis, and intervention of mental problems. Psychiatric symptom detection from psychiatric notes may serve as decision support systems for doctors. There has been work on both supervised and unsupervised methods for EHR data17 and labeling psychiatric notes. Tran and Kavuluru16 implement the former task by applying deep learning and obtained average F-scores of 63.14 and 61.90 using convolutional neural networks (CNNs) and recurrent hierarchical networks with attention, respectively. In Zhang et al.18, the authors present an unsupervised technique for labeling psychiatric notes by comparing symptoms in psychiatric notes to certain ‘seed terms’ present in medical knowledge databases, namely MedlinePlus§, Mayo Clinic, and the American Psychiatric Association. This study emphasizes on the embeddings created using different knowledge bases which is useful for many tasks in mental health.

Psychological Analysis of Social Media Data: Most of the work on social media text is based on data collected from Twitter49. The common objectives are either a) automatic labeling of mental health issues or b) predicting the onset of diseases. For automatic labeling of mental health issues, the common approach was to label Twitter data manually and then train a classification algorithm5. Gkotsis et al.4 first identified mental health related posts (accuracy 91.8%) and then classified the posts into 11 themes (accuracy 71.3%). The best result was obtained using a CNN deep learning model. Coppersmith5 presented the results of a shared and an unshared task (restricted within John Hopkins University). Their primary objective was to identify patients with PTSD and depression who shared their mental health issues on social media. The major drawback in this work was that it primarily considered precision as an evaluation metric and did not consider recall. Choudhury et al.6 recruited patients with depression and evaluated their Twitter history. The authors found certain behavioral metrics to be related to depression and the system obtained an accuracy of 70.35% in predicting depression. Since mental illness presents symptoms which are common among otherwise healthy subjects1, proactive detection of diseases from social media is an important research area. However, recent concerns about the privacy issues created concerns in the study of mental health from social media.

Psychological Analysis of Peer Support Forums: In 2016 and 2017, the CLPsych Shared Task presented a challenge of marking the severity of patient’s posts11. The severity was graded into four labels with green signifying the least severe and crisis signifying the most severe. The best performing team12 in the 2016 task obtained an F-score of with ensemble classification using three classifiers. Multiple variations of n-gram features were used for the above experiment. Interestingly, the classifiers were trained on 12 fine labels and then adjusted to 4 coarse labels. In the CLPsych 2016 shared task, SVM or logistic regression was a common choice for teams11. In the CLPsych 2017 shared task, Xia and Liu’s system obtained the maximum macro averaged F-score of 0.467 using voting based classification14. Altszyler et al.14 used different features such as syntactic, semantic, n-gram, and embedding features with ensemble based machine learning algorithms and the system obtained 0.462 macro-averaged F-score. The major drawback of the CLPsych shared tasks was selecting macro-averaged F-score as one of the official metrics for ranking the participant’s systems. Macro-average is good for evaluation when the number of instances in each category is balanced whereas both of the shared tasks do not have an equal number of instances. In another sense, this is a useful metric for emphasizing red and crisis categories with fewer instances. Also, some participants did not consider some instances from green and amber categories to reduce the imbalance in the datasets. In the above CLPsych shared tasks, the participants used mainly machine learning algorithms with stylistic, metadata, and word embeddings. However, domain knowledge and deep learning algorithms have not been explored for this task to date.

Materials and Methods

Dataset

The datasets used in our experiments were provided by the organizers of the CLPsych tasks in 2016** and 2017††. These datasets contain posts which were annotated with four categories (green, amber, red, and crisis) and were collected from ReachOut. The four categories are:

  • Crisis indicates that the author (or someone they know) of the post is in imminent risk of being harmed, or harming others. Such posts should be prioritized above all so that a moderator can be assigned immediately.

  • Red indicates that a moderator should respond to the post as soon as possible.

  • Amber indicates that a moderator should address the post at some point, but they need not do so immediately.

  • Green identifies posts that do not require direct input from a moderator and can safely be left for the wider community of peers to respond to11.

The training datasets consist of 947 and 1,188 posts, and test datasets consist of 241 and 400 posts for 2016 and 2017, respectively. The CLPsych 2016 training and test datasets are a subset of the CLPsych 2017 training and test datasets, respectively. The organizers also provided 140,892 unlabeled posts. The statistics for training and test datasets for each category are provided in Table 1. It can be observed from Table 1 that the datasets are not evenly distributed among all the categories. The number of instances in crisis category is lesser than other categories.

Table 1:

Training and test datasets of CLPsych 2016 and 2017 with class-wise distribution.

Datasets Green Amber Red Crisis Total
Training 2016 549 249 110 39 947
2017 715 296 137 40 1188
Test 2016 166 47 27 1 241
2017 216 94 48 42 400

Each instance contains useful information related to users such as login id, response status, post time, edit time, etc. with a subject and a comment. It also contains the number of kudos for post by other users in the forum. For the current study, we did not consider kudos because any relevant statistic concerning all categories was not found. However, we do not undermine the importance of kudos for other tasks, where a high number of kudos after the post is posted may have significance and it may be considered as one of the key information for that post. Example of posts from four categories are presented in Table 2. Separate files were provided with gold annotations, and the annotation details can be found in Milne et al.11.

Table 2:

Class wise instances from CLPsych datasets.

Class Posts
Crisis @Sans-RO I don’t know what’s gotten over me. Increased hopelessness, increased desperation.I’m spiralling downwards, I feel like I can’t hold on. I just want to give up. I feel like my life is over. I’m curled up in bed, thinking about how I can finish.
Red my dad doesnt know how i have been feeling i have been writing and listening to music most of the day today it just isnt doing much right now. i just everything hurts so much inside and i want to scream all the time i cant even explain it i am so agitated as well and i keep having panic attacks and i just feel really bad.
Amber I woke up with nightmares about people from placement hurting me/ torturing me. I hate sleep. The good thing was I woke up at a sensible time so I turned on the light before getting up. I also felt safe being at my family’s place and it’s quiet here atm (not heaps of people like at Christmas)
Green Ah no it was kind of a joke (have you seen the ‘I am forcibly removed from the premises’ meme?) But for real though internal screaming is just my Christmas setting unfortunately It’ll be over soon.

There are many online resources for mental health patients such as MedlinePlus, Mayo Clinic, and the American Psychiatric Association. These are descriptions of mental health and are different from user posts in forums. Hence, we crawled the WebMD community forum containing comments associated with different diseases. Keywords including posts for anxiety, panic, bipolar, bipolar disorder, depression, mental health, etc. were selected and used for embeddings along with CLPsych datasets (both labeled and unlabelled). The filtered WebMD dataset consists of 13,607 comments related to psychological experiences.

Features

This sub-section lists the features used for experimental purposes. Linguistic features describing syntax and style of posts, and non-linguistic features describing metadata of posts are tested separately and in combination. Features are grouped into four categories for simplified comparison in experiments. Feature selection is performed as follows: testing is started using a single feature group as input, then a new feature group is added for each experiment, to find out the best set of features for the given experiment.

A. Metadata Features

Several non-linguistic features from a post’s metadata were included as features. The selected metadata features are days of the week (1-7), whether the author is a moderator or not (1 or 0), the time difference between post and last edited time in minutes, and the post time of day. The posting time is considered as eight time slots of 3 hours as in Altszyler et al.14, and we also considered the same as it gave better performance compared to 24 features.

B. Stylistic Features

These are the important features and have been used extensively in several text classification tasks. The stylistic features used in the experiments are described below.

Number of question marks: It was observed that most of the users from red and crisis categories were not confident and asked several questions while writing their posts. Thus, the number of question marks were counted.

Number of first person pronouns: When users are severely affected by the psychological disorder, they tend to write more about themselves and used words like ‘I’, ‘me’, etc. Thus, the number of pronouns in posts were counted.

Number of words: It was observed that users from crisis and red categories wrote more than other users. In fact, some users from green category used only one or two words to post. Thus, the number of words were used as a feature.

Number of stop words: We counted the number of stop words present in each of the posts. The nltk English stop word list29 is considered for the present study.

C. Semantic Features

Number of lexicon matches: There are relatively fewer instances of red and crisis categories, hence it is challenging for machine learning algorithms to detect these. Therefore, we manually collected two lexicons from training datasets for red and crisis categories containing 56 and 69 words/phrases, respectively. We counted the number of words matched in each post with the red and crisis lexicons and used the number of matches as features.

Number of positive and negative sentences: The organizers provided sentences annotated with positive and negative sentiment in the posts for the CLPsych 2016, 2017 training datasets, and the test datasets were not annotated with sentiment. For all the datasets, we used Vader Sentiment Intensity Analyzer‡‡ to calculate the number of sentences with positive and negative sentiments automatically (for both training and test datasets) and used these as features. We also used Text Blob Sentiment Analysis tool§§ to identify whether the post is positive or negative or neutral and used it as a feature.

Number of positive, negative words: Sentiment word count features were proved important in the state-of-the-art severity detection systems14. Therefore, several sentiment and emotion lexicons such as SentiWordNet21, WordNet Affect26, SentiSense Synset24, Effect WordNet22, AFINN dictionary19, NRC Word-Emotion Association Lexicon20, Taboada adjective list25 and Whissell dictionary23 were used to identify positive and negative words in posts and are used as features.

Number of positive and negative emoticons: Similarly, a happy and sad emoticon list was created manually containing 58 and 59 emoticons, respectively. The number of positive and negative emoticons present in the posts were used as features.

D. Embedding Features

Embedding methods have gained immense popularity in the last few years and it has been successfully used in several text mining and NLP tasks. These are easy to implement and capture the semantic information of words or documents. For the present study, we considered both word and document embeddings.

Word Embeddings: Pre-trained word embedding model

Google N-Grams

(News corpus) and

GloVe

(dimension = 200) were used for our experiment. Additionally, a

word2vec

model (dimension = 100) was built on corpus collected from WebMD, labeled and unlabelled training CLPsych forum data. The word vectors were created by taking the normalized summation of the vectors of each word in the posts (other than stop words and punctuation), which were present in the vocabulary of the pre-trained model vp=1Ni=1Nw1. Here, vp is a vector for each post, wi is the ith word that post; N is the number of words present in

GloVe or Google n-gram or word2vec

. The gensim¶¶ package was used to build all word and document embeddings.

Document Embeddings: Document embeddings are popular in the NLP community and have been used in several NLP tasks. We used

Doc2Vec

(dimension = 100) implemented in gensim.

Doc2Vec

model was trained on same datasets as

word2vec

.

Latent Semantic Analysis: LSA is a technique for creating a vector representation of a document, similar to document embeddings or

Doc2Vec

. It has been successfully used for developing several applications in NLP. We used the LSA (dimension = 100) implemented in gensim.

System Architecture

We combined the subject and body of each post. The posts were pre-processed before creating word/document embeddings. The pre-processing step includes tokenization, removal of stop words and punctuation, identifying lemma of each word. During the development phase, we divided our training set into 80-20% split for tuning the parameters. Finally, two types of systems were developed based on the datasets, a) the classifier trained on CLPsych 2017 training dataset and tested on CLPsych 2017 test dataset, b) the classifier trained on CLPsych 2016 training dataset and tested on CLPsych 2016 test dataset.

Previously, most tasks on CLPsych had been performed using SVM or Random Forest, and SVM outperformed all other classifiers14. Hence, we implemented SVM for the present task. Deep learning models are quite popular these days and outperformed in many tasks in NLP. Deep learning was not implemented on CLPsych datasets until now. Thus, we implemented a few popular deep learning algorithms as mentioned next. We developed different systems based on the different feature combinations using SVM and identified the best performing feature set. The best performing feature set was used with different classification techniques such as SVM, ensemble based voting, convolutional neural network (CNN), and long short-term memory (LSTM). The training and test datasets were not well distributed among different classes. For imbalanced datasets, it was observed that the multi-level classification performs well27. We also used multi-level ensemble based voting method where at first, we classified urgent (combined red, crisis) and non-urgent (combined amber, green); then classified urgent and non-urgent results into each separate categories.

LSTM and CNN models are trained using the combined features. The main goal of this experiment was to examine how LSTM and CNN are performing on handcrafted features. These models are trained on CLPysch 2016 and 2017 separately and tested on the corresponding test datasets. However, deep learning models are typically trained with raw data rather than providing handcrafted features. Thus, we implemented a deep learning model using just the word and character embeddings. Word and character embbedings were trained separately using LSTM, and then both outputs are concatenated. The concatenated output is then fed to a BiLSTM followed by a densely connected layer with four ReLU activation representing each of the four classes. Finally the output is passed though a softmax activation function. The detailed BiLSTM architecture is provided in Figure 1.

Figure 1:

Figure 1:

Deep learning framework for severity detection

Experimental Details

Each feature was normalized by dividing it with the maximum value of the corresponding feature input range. Polynomial kernel of degree three was used in SVM. For ensemble voting, the ensemble based voting classifier is constructed using polynomial SVM (degree=3), Random Forest (maximum depth=3) and Logistic Regression. For deep learning algorithms, the adam optimizer was used to minimize the categorical cross entropy loss function. The training data was divided into a batch size of 50 samples. Each of the neural networks was trained for 25 epochs. For LSTM and BiLSTM, the models are initialized with 200 hidden layers and dropout of 0.3. The ReLU activation function was used for each layer and softmax was performed in the output layer for classification.

For CNN, two CNN layers followed by maxpooling over 2 × 2 grid and a dropout of 0.5 were inputted to a ReLU activation function. Finally, softmax was performed in the output layer for classification. The deep learning models are implemented using Keras.

Result Analysis and Discussion

We have considered different evaluation matrics for comparing our results. Micro-averaged F-score: the F-score of all categories as well as single categories. Macro-averaged F-score: the average F-score among crisis, red and amber labels. Flagged vs. non-flagged F-score: the F-score of flagged (crisis + red + amber) and non-flagged (green) labels. Urgent vs. non-urgent F-score: the F-score of urgent (crisis + red) and non-urgent (amber + green) labels.

Results

Initially, several SVM-based systems were developed using different feature combinations. Detailed comparisons between various features for CLPsych 2016 and 2017 datasets are provided in Table 3. A maximum micro-average F-score of 0.71 was obtained for CLPsych 2017 test datasets using all feature categories (metadata, stylistic, semantic, and word2vec). The word2vec embeddings trained on domain-specific features performed better compared to other embeddings, and this word2vec embedding was created using the training dataset, unlabeled corpus provided by organizers, and WebMD forum data. For CLPsych 2016 test datasets, the maximum F-score of 0.76 was obtained using similar features. It may be observed that combining embedding features to metadata, stylistic and semantic feature improved performances drastically (by at least 11% in terms of micro-average F-score) for CLPsych 2017 test dataset, while not much improvement was observed for the CLPsych 2016 test dataset. The metadata, stylistic, and semantic features performed well in the case of the CLPsych 2016 datasets and achieved micro-average F-score of 0.70.

Table 3:

System performances of using different feature sets and SVM (M: Metadata feature, St: Stylistic feature, Se: Semantic feature, E: Embedding feature, GN:
Google N-gram
)
F-score
features micro-average macro-average flagged vs. non-flagged urgent vs. non-urgent crisis red amber green
CLPsych 2017 test dataset
M 0.36 0.29 0.52 0.58 0 0.11 0.25 0.52
M + St 0.52 0.33 0.56 0.77 0 0.12 0.27 0.66
M + St + Se 0.56 0.45 0.61 0.79 0.08 0.14 0.31 0.70
M + St + Se + E (LSA) 0.70 0.55 0.82 0.81 0.20 0.48 0.70 0.82
M + St + Se + E (GN) 0.67 0.52 0.80 0.81 0.25 0.42 0.59 0.82
M + St + Se + E (word2vec) 0.71 0.56 0.82 0.84 0.23 0.51 0.65 0.84
M + St + Se + E (GloVe) 0.68 0.61 0.81 0.83 0.25 0.45 0.63 0.83
M + St + Se + E (Doc2Vec) 0.67 0.50 0.71 0.82 0.23 0.46 0.50 0.79
CLPsych 2016 test dataset
M 0.53 0.19 0.59 0.62 0 0.13 0.14 0.69
M + St 0.65 0.28 0.73 0.80 0 0.13 0.18 0.80
M + St + Se 0.70 0.46 0.75 0.81 0 0.46 0.40 0.81
M + St + Se + E (LSA) 0.72 0.49 0.78 0.91 0 0.56 0.47 0.83
M + St + Se + E (GN) 0.71 0.47 0.76 0.89 0 0.55 0.49 0.82
M + St + Se + E (
word2vec
)
0.76 0.67 0.82 0.92 0.67 0.60 0.54 0.86
M + St + Se + E (GloVe) 0.73 0.59 0.79 0.90 0.67 0.48 0.35 0.86
M + St + Se + E (Doc2Vec) 0.73 0.36 0.77 0.90 0 0.47 0.27 0.85

There was only one instance from crisis category in CLPsych 2016 test dataset and most of the features were not able to identify it. The GloVe and word2vec feature based systems were able to identify the instance from the crisis category. However, these systems also mis-identified another instance from red category as crisis. Though there is misclassification, it solved our purpose by identifying red category as crisis category as moderators are assigned to urgent posts as soon as possible. Metadata and stylistic features were not enough for identifying the crisis categories for both datasets. The dominance of the green category may be observed from the individual F-score for each category. For all of the situations, the F-scores are higher for the green category than others.

Further, we implemented different machine and deep learning algorithms with metadata, stylistic, semantic, word2vec embedding features on CLPsych 2016 and 2017 datasets. The detailed system performance is provided in Table 4. The maximum micro average F-scores of 0.80 and 0.86 were obtained for CLPsych 2017 and 2016 test datasets, respectively using multi-level ensemble voting classifier. The BiLSTM with raw input (word and character embeddings) achieved equivalent scores as compared to the ensemble voting multi-level (EVML) classifier. For CLPsych 2017 dataset, the BiLSTM achieved better F-score for the amber category.

Table 4:

Performances of different systems using fixed features and embeddings (word and character) (EV: Ensemble voting, EVML: Ensemble voting multi-level)

F-score
models micro-average macro-average flagged vs. non-flagged urgent vs. non-urgent crisis red amber green
CLPsych 2017 test dataset
SVM 0.71 0.56 0.82 0.84 0.23 0.51 0.65 0.84
LSTM 0.75 0.61 0.82 0.86 0.40 0.61 0.76 0.85
CNN 0.73 0.58 0.84 0.85 0.35 0.55 0.67 0.84
EV 0.78 0.68 0.85 0.88 0.44 0.68 0.77 0.85
EVML 0.80 0.69 0.89 0.91 0.50 0.68 0.78 0.85
BiLSTM 0.78 0.68 0.87 0.88 0.44 0.68 0.80 0.85
CLPsych 2016 test dataset
SVM 0.76 0.67 0.82 0.92 0.67 0.60 0.54 0.86
LSTM 0.85 0.75 0.82 0.92 0.67 0.78 0.75 0.90
CNN 0.79 0.71 0.82 0.92 0.67 0.67 0.72 0.86
EV 0.85 0.78 0.84 0.92 0.67 0.78 0.75 0.90
EVML 0.86 0.79 0.87 0.96 0.67 0.84 0.77 0.90
BiLSTM 0.85 0.79 0.85 0.93 0.67 0.82 0.75 0.90

Discussion

Our study proposes an NLP system based on a multi-level ensemble voting classifier to identify severity in forum data. The system achieved better performance compared to other approaches such as LSTM, CNN, BiLSTM, and ensemble voting. We find that the overall performance of combined features and trained word2vec embedding outperforms other feature combinations. Most state-of-the-art systems were not able to identify urgent posts due to the imbalanced training datasets. The lexicon-based features helped our systems to identify more urgent (red and crisis) posts, which increased the number of false negatives. However, this system is more useful as no urgent posts should be missed.

The BiLSTM with only embeddings input performed fairly compared to multi-level ensemble voting approach. One of the main reasons may be that there is fewer training instances. The deep learning is known to be a data-hungry architecture and requires sufficient data to support its complex architecture fully. Deep learning-based systems may perform better if more data is provided in the future.

Our current system on CLPsych 2016 test dataset outperformed the system of Kim et al.12, which achieved the maximum F-score of 0.85 using TD-IDF features and SVM. In the case of CLPsych 2017 datasets, the maximum macro averaged F-score of 0.462 was achieved by the system of Altszyler et al.14 using TF-IDF of weighted unigrams, word2vec (trained only on CLPsych training data), and ensemble-based classifiers. Our proposed system outperformed the above system in terms of macro averaged F-score. However, the F-score of flagged vs. non-flagged is an important evaluation criterion considered by the task organizers because it measures the system’s capability to identify a post that needs moderators attention. The system developed by Altszyler et al.14 achieved maximum flagged vs. non-flagged F-score of 0.905 and our proposed system achieved 0.89.

The key contributions of this paper are listed below.

  • Unlabelled psychological forum corpus from WebMD was extracted and then used to enhance embeddings with CLPsych labeled and unlabelled datasets.

  • Multiple machine and deep learning algorithms were trained to predict the severity of posts in CLPsych data.

  • Multi-level ensemble based voting classification was performed to remove bias due to the fewer number of posts with high severity. This system outperformed all other systems including state-of-the-art systems in terms of micro averaged F-score.

Conclusion

In this paper, we designed an NLP based system to detect patient mental health severity from forum data extracted from ReachOut. We also applied different embedding features and found that word2vec performed well among all other embeddings. A multi-level ensemble-based voting classifier performed well for classifying the flagged posts. We achieved better results in classifying the severity in both CLPSych 2016 and 2017 datasets compared to the all state-of-the-art systems developed to date. We believe that this system may be used to identify psychological issue severity from posts in other social networks like Facebook, Twitter, etc. In the future, we will explore different techniques for removing bias in the datasets. Different advanced deep learning algorithms and character embeddings may be explored for these datasets in further studies.

Acknowledgements

This project is mainly supported by the Center for Big Data in Health Sciences (CBD-HS) at School of Public Health, University of Texas Health Science Center at Houston (UTHealth), and partially supported by the Cancer Research and Prevention Institute of Texas (CPRIT) project RP170668 (KR, HW) as well as the Patient-Centered Outcomes Research Institute (PCORI) project ME-2018C1-10963 (KR). We also thank the organizers of CLPsych 2016 and 2017 for providing datasets.

Footnotes

Figures & Table

References

  • 1.Kroenke K, Spitzer RL, Williams JB, Linzer M, Hahn SR, deGruy III FV, Brody D. Physical symptoms in primary care: predictors of psychiatric disorders and functional impairment. Archives of family medicine. 1994;3(9):774. doi: 10.1001/archfami.3.9.774. [DOI] [PubMed] [Google Scholar]
  • 2.McGorry PD, Yung AR, Phillips LJ, Yuen HP, Francey S, Cosgrave EM, Germano D, Bravin J, McDonald T, Blair A, Adlard S. Randomized controlled trial of interventions designed to reduce the risk of progression to first-episode psychosis in a clinical sample with subthreshold symptoms. Archives of general psychiatry. 2002;59(10):921–928. doi: 10.1001/archpsyc.59.10.921. [DOI] [PubMed] [Google Scholar]
  • 3.Solomon P. Peer support/peer provided services underlying processes, benefits, and critical ingredients. Psychiatric rehabilitation journal. 2004;27(4):392. doi: 10.2975/27.2004.392.401. [DOI] [PubMed] [Google Scholar]
  • 4.Gkotsis G, Oellrich A, Velupillai S, Liakata M, Hubbard TJ, Dobson RJ, Dutta R. Characterisation of mental health conditions in social media using Informed Deep Learning. Scientific reports. 2017;7:45141. doi: 10.1038/srep45141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Coppersmith G, Dredze M, Harman C, Hollingshead K, Mitchell M. CLPsych 2015 shared task: Depression and PTSD on Twitter; Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality; 2015. pp. 31–39. [Google Scholar]
  • 6.De Choudhury M, Gamon M, Counts S, Horvitz E. Predicting depression via social media; Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media; 2013. pp. 128–137. [Google Scholar]
  • 7.Coppersmith G, Dredze M, Harman C. Quantifying mental health signals in Twitter. Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality. 2014:51–60. [Google Scholar]
  • 8.Chen X, Sykora MD, Jackson TW, Elayan S. What about mood swings: Identifying depression on twitter with temporal measures of emotions; Companion Proceedings of the The Web Conference; 2018. pp. 1653–1660. [Google Scholar]
  • 9.Leis A, Ronzano F, Mayer MA, Furlong LI, Sanz F. Detecting Signs of Depression in Tweets in Spanish: Behavioral and Linguistic Analysis. Journal of Medical Internet Research. 2019;21(6):e14199. doi: 10.2196/14199. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Park A, Conway M. Harnessing Reddit to understand the written-communication challenges experienced by individuals with mental health disorders: Analysis of texts from mental health communities. Journal of medical Internet research. 2018;20(4):e121. doi: 10.2196/jmir.8219. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Milne DN, Pink G, Hachey B, Calvo RA. Clpsych 2016 shared task: Triaging content in online peer-support forums; Proceedings of the Third Workshop on Computational Lingusitics and Clinical Psychology; 2016. pp. 118–127. [Google Scholar]
  • 12.Mac Kim S, Wang Y, Wan S, Paris C. Data61-csiro systems at the clpsych 2016 shared task; Proceedings of the Third Workshop on Computational Lingusitics and Clinical Psychology; 2016. pp. 128–132. [Google Scholar]
  • 13.Pfeiffer PN, Heisler M, Piette JD, Rogers MA, Valenstein M. Efficacy of peer support interventions for depression: a meta-analysis. General hospital psychiatry. 2011;33(1):29–36. doi: 10.1016/j.genhosppsych.2010.10.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Altszyler E, Berenstein AJ, Milne D, Calvo RA, Slezak DF. Using contextual information for automatic triage of posts in a peer-support forum; Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic; 2018. pp. 57–68. [Google Scholar]
  • 15.Malmasi S, Zampieri M, Dras M. Predicting post severity in mental health forums; Proceedings of the third workshop on computational lingusitics and clinical psychology; 2016. pp. 133–137. [Google Scholar]
  • 16.Tran T, Kavuluru R. Predicting mental conditions based on “history of present illness” in psychiatric notes with deep neural networks. Journal of biomedical informatics. 2017;75:S138–48. doi: 10.1016/j.jbi.2017.06.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Gorrell G, Roberts A, Jackson R, Stewart R. Finding negative symptoms of schizophrenia in patient records. Proceedings of the Workshop on NLP for Medicine and Biology associated with RANLP. 2013:9–17. [Google Scholar]
  • 18.Zhang Y, Zhang O, Wu Y, Lee HJ, Xu J, Xu H, Roberts K. Psychiatric symptom recognition without labeled data using distributional representations of phrases and on-line knowledge. Journal of biomedical informatics. 2017;75(S):S129–S137. doi: 10.1016/j.jbi.2017.06.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Naveed N, Gottron T, Kunegis J, Alhadi AC. Bad news travel fast: A content-based analysis of interestingness on twitter; Proceedings of the 3rd international web science conference; 2011. p. 8. [Google Scholar]
  • 20.Mohammad S, Turney P. Crowdsourcing a Word-Emotion Association Lexicon. Computational Intelligence. 2013;29(3):436–465. [Google Scholar]
  • 21.Baccianella S, Esuli A, Sebastiani F. Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining; Proceedings of the 7th conference on International Language Resources and Evaluation; 2010. pp. 2200–2204. [Google Scholar]
  • 22.Choi Y, Wiebe J. +/-EffectWordNet: Sense-level lexicon acquisition for opinion inference; Proceedings of Conference on Empirical Methods in Natural Language Processing; 2014. [Google Scholar]
  • 23.Whissell C, Fournier M, Pelland R, Weir D, Makarec K. A. dictionary of affect in language: IV. Reliability, validity, and applications. Perceptual and Motor Skills. 1986;62(3):875–888. [Google Scholar]
  • 24.Carrillo-de-Albornoz J, Plaza L, Gervas P. SentiSense: An easily scalable concept-based affective lexicon for sentiment analysis; Proceedings of the 8th International Conference on Language Resources and Evaluation; 2012. pp. 3562–3567. [Google Scholar]
  • 25.Taboada M, Brooke J, Tofiloski M, Voll K, Stede M. Lexicon-based methods for sentiment analysis. Computational linguistics. 2011;37(2):267–307. [Google Scholar]
  • 26.Strapparava C, Valitutti A. Wordnet-affect: an affective extension of wordnet; Proceedings of the 4th International Conference on Language Resources and Evaluation; 2004. pp. 1083–1086. [Google Scholar]
  • 27.Patra BG, Mazumdar S, Das D, Rosso P, Bandyopadhyay S. A. Multilevel Approach to Sentiment Analysis of Figurative Language in Twitter; Proccdings of the Conference on Intelligent Text Processing and Computational Linguistics; 2016. pp. 281–291. [Google Scholar]
  • 28.Passerello GL, Hazelwood JE, Lawrie S. Using Twitter to assess attitudes to schizophrenia and psychosis. BJPsych bulletin. 2019:1–9. doi: 10.1192/bjb.2018.115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Bird S, Loper E, Klein E. Natural Language Processing with Python. O’Reilly Media Inc. 2009 [Google Scholar]

Articles from AMIA Summits on Translational Science Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES