Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Apr 1.
Published in final edited form as: J Psychiatr Res. 2021 Feb 2;136:95–102. doi: 10.1016/j.jpsychires.2021.01.052

Using Weak Supervision and Deep Learning to Classify Clinical Notes for Identification of Current Suicidal Ideation

Marika Cusick 1, Prakash Adekkanattu 1, Thomas R Campion Jr 1,2, Evan T Sholle 1, Annie Myers 2, Samprit Banerjee 2, George Alexopoulos 3, Yanshan Wang 4,5, Jyotishman Pathak 2,3
PMCID: PMC8009838  NIHMSID: NIHMS1669210  PMID: 33581461

Abstract

Mental health concerns, such as suicidal thoughts, are frequently documented by providers in clinical notes, as opposed to structured coded data. In this study, we evaluated weakly supervised methods for detecting “current” suicidal ideation from unstructured clinical notes in electronic health record (EHR) systems. Weakly supervised machine learning methods leverage imperfect labels for training, alleviating the burden of creating a large manually annotated dataset. After identifying a cohort of 600 patients at risk for suicidal ideation, we used a rule-based natural language processing approach (NLP) approach to label the training and validation notes (n=17,978). Using this large corpus of clinical notes, we trained several statistical machine learning models—logistic classifier, support vector machines (SVM), Naive Bayes classifier—and one deep learning model, namely a text classification convolutional neural network (CNN), to be evaluated on a manually-reviewed test set (n=837). The CNN model outperformed all other methods, achieving an overall accuracy of 94% and a F1-score of 0.82 on documents with “current” suicidal ideation. This algorithm correctly identified an additional 42 encounters and 9 patients indicative of suicidal ideation but missing a structured diagnosis code. When applied to a random subset of 5,000 clinical notes, the algorithm classified 0.46% (n=23) for “current” suicidal ideation, of which 87% were truly indicative via manual review. Implementation of this approach for large-scale document screening may play an important role in point-of-care clinical information systems for targeted suicide prevention interventions and improve research on the pathways from ideation to attempt.

Keywords: weak supervision, deep learning, natural language processing, machine learning, suicidal ideation

INTRODUCTION

Suicide is one of the leading causes of death in the United States (Murphy et al., 2018). In 2017, 47,173 deaths were attributed to suicide, making up 1.2% of the total deaths in the U.S. (Heron, 2019). Based on evidence that a substantial portion of people who die by suicide come into contact with the healthcare system in the year before their death (Ahmedani et al., 2014), research has focused on developing suicide risk prediction algorithms using clinical data from electronic health record (EHR) systems (Belsher et al., 2019). However, a recent review of these algorithms found that they are limited by high false positive rates, offering little utility in a clinical setting (Kessler et al., 2020). One possible reason is their use of only structured EHR data, such as prior diagnoses, medications, and procedures, which may not contain information on a patient’s historical or current mental illnesses (Goldstein et al., 1991; Simon et al., 2019, 2018). Incorporating data extracted by natural language processing (NLP) from clinician encounter notes, which contain more pertinent information on a patient’s physical and mental health status, may improve prediction performance (Demner-Fushman et al., 2009).

Suicidal ideation, thoughts about committing suicide, is a known precursor to suicide attempts and suicide (Brown et al., 2000; Kuo et al., n.d.). While it has been studied that suicidal ideation does not always precede attempts, a subset of patients do transition from ideation to attempt (Louzon et al., 2016; Pirkis et al., 2000; ten Have et al., 2009), and it is critical to intervene for any patient known to be at risk for suicide. It has been established that this key predictor is frequently documented in unstructured clinical notes, but not as structured International Classification of Diseases (ICD-9/10) diagnosis codes (Anderson HD, Pace WD, Brandt E, Nielsen RD, Allen RR, Libby AM, West DR, Valuck RJ}, 2015; Hammond et al., 2013). With the objective of extracting this predictor from clinical text, we developed an NLP approach using weak supervision, a branch of machine learning that uses inexact or noisy labels to provide supervision signal, alleviating the burden of producing a large manually-reviewed training set, which is costly and time-intensive (Wang et al., 2019). In this study, we evaluated the performance of weakly supervised machine learning and deep learning methods for extracting “current” suicidal ideation from clinical notes. Instead of using a manually annotated training set, our algorithms were trained on notes with imperfect labels derived from a simple rule-based NLP approach.

Our approach is novel along two major dimensions. First, while prior research have proven success in using NLP to detect “any” suicidal ideation regardless of temporality (Fernandes et al., 2018), our focus is to detect “current” suicidal ideation, ideation that occurs in the documented encounter, as the transition from ideation to a fatal or non-fatal suicide attempt can happen quickly (Britton et al., 2012). Studies indicate that affirmative responses to item 9 of the Patient Health Questionnaire (PHQ-9), frequency of suicidal ideation in the past weeks, were associated with increased risk for fatal and non-fatal suicide attempts (Louzon et al., 2016; Simon et al., 2013). Among a subset of patients seen at the Veterans Health Administration (VHA) setting, frequency of suicidal ideation in the past two weeks reported as “several days”, “more than half the days”, and “nearly every day” were associated with 75%, 115%, and 185% increased risk of suicide in comparison to those who reported “none at all” (Louzon et al., 2016). This is further evidenced by results from the National Comorbidity Survey Replication (NCS-R) that a majority of first suicide attempts, 60% of planned and 90% of unplanned, occur within a year of initial suicidal ideation onset (Borges et al., 2006). Emphasis on “current” suicidal ideation is also shown in Columbia Suicide Severity Rating Scale (C-SSRS), the most evidence-based suicide risk assessment tool used by clinicians(Posner et al., 2011), which assess recent ideation (e.g. within the last week) rather than past ideation. In using the C-SSRS to predict suicide attempts, while researchers found prior suicide attempts to be the strongest predictor of future suicide, there was some evidence of predictive value from the suicidal ideation component (Brown et al., 2020).

Second, to extract “current” suicidal ideation from clinical text, we propose the use of weak supervision (Zhou, 2018), machine learning methods that leverage imperfect labels for training and thus are not reliant on manually-labeled, large-scale training corpus, a resource-intensive, cumbersome and time-consuming process. This methodology has proven success in other areas, most notably in visual object detection and text classification, as researchers have developed models to identify interactions between humans and objects in images and speculative language from scientific literature (Medlock and Briscoe, 2007; Prest et al., 2012). However, this approach has yet to be widely applied in clinical research studies, including clinical text classification. The use of weakly supervised methods allowed us to make use of a much larger training corpus in comparison to previous studies, as we did not require any manual annotation of training notes.

Improved identification of patients who experience suicidal ideation has multiple benefits to the research and clinical community. First, since suicide pathways are not well understood, we can use these methods to better identify patterns of transition from suicidal ideation to future suicide attempts. Second, researchers working to predict suicide using structured EHR data can augment their algorithms with information from clinical text, namely repeated evidence of suicidal ideation. Finally, in an informatics approach, clinicians can utilize these algorithms for large-scale scanning of notes to identify patients at risk for suicide and target timely suicide prevention interventions appropriately.

MATERIALS AND METHODS

Data Source

Weill Cornell Medicine (WCM) is an academic medical center in New York City (NYC) with over 1,500 physicians at over 45 locations throughout the NYC metropolitan region, facilitating more than 3 million annual patient encounters (Weill Cornell Medicine, n.d.). WCM is affiliated with NewYork-Presbyterian Hospital (NYPH), which serves as the inpatient and emergency setting for WCM patients. While WCM physicians use the EpicCare Ambulatory EHR system to document outpatient care, inpatient and emergency care at NYPH is documented using AllScripts Sunrise Clinical Manager (SCM) EHR system. In this study, we used clinical notes originating from both the outpatient (Epic) and inpatient/emergency care (SCM) EHR systems, which share all clinical data, including encounter notes, on the basis of a common medical record number (MRN) for each patient (Sholle et al., 2017). This study was approved by the WCM Institutional Review Board (IRB).

Study Population

Our study corpus consisted of 18,815 clinical notes documented during outpatient, emergency, or inpatient psychiatric encounters at WCM/NYPH between 2006 and 2020. To select our corpus, we identified 200 patients who had at least one encounter coded with one of the following International Classification of Diseases (ICD) codes for suicidal ideation: V62.84 (ICD-9) and R45.85 (ICD-10). We then selected 400 patients with potential suicidal ideation, those who had at least ten notes with one or more key suicidal ideation terms: suicidality, suicidal, SI, or suicide. This resulted in a patient cohort with a 1:2 ratio between patients coded for suicidal ideation and those with potential suicidal ideation. For each patient included in the study, we selected every note documented at WCM or NYPH with one of the key suicidal ideation terms. In doing so, for each patient, we captured the majority of notes related to suicidal ideation available in the EHR. We identified a total of 6,588 notes (mean: 33 notes/patient) for the 200 patients coded for suicidal ideation and 12,227 notes (mean: 31 notes/patient) for the 400 patients with potential suicidal ideation. Selected notes were on average 713 words in length (median: 539 words) with the shortest being 6 words and longest being 5,109 words. We then randomly divided the patient and document corpus into training (n=13,426 notes; 456 patients), validation (n=4,552 notes; 114 patients), and test (n=837 notes; 30 patients) datasets. Our splitting approach ensures no single patient can have documents in two datasets, and thus, our weakly supervised methods are tested on documents of unseen patients.

Clinical notes in our study were documented in the following settings: ambulatory office visits (68%), social work sessions (15%), and hospital encounters (11%) and by the following most common outpatient provider specialties: psychiatry (34%), internal medicine (13%), infectious disease (13%), and chemical dependency (12%).

Table 1 below gives the characteristics of our study population, broken down between the two patient groups: coded suicidal ideation (SI) patients and potential suicidal ideation (SI) patients.

Table 1.

Study population characteristics

Coded SI N(%) Potential SI N (%) Total N(%)
Gender Female 125 (63%) 245 (61%) 370 (62%)
Male 75 (38%) 155 (39%) 230 (36%)
Ethnicity Hispanic or Latino or Spanish Origin 41 (21%) 75 (19%) 116 (19%)
Not Hispanic or Latino or Spanish Origin 93 (47%) 188 (47%) 281 (47%)
Unknown 75 (38%) 137 (34%) 212 (35%)
Race Asian 2 (1%) 10 (3%) 12 (2%)
Black or African American 20 (10%) 50 (13%) 70 (12%)
Native Hawaiian or Pacific Islander 0 (0%) 1 (<1%) 1 (<1%)
Other Combinations Not Described 38 (19%) 74 (19%) 112 (19%)
White 94 (47%) 172 (43%) 266 (44%)
Unknown 45 (23%) 93 (23%) 138 (23%)
Age* <20 9 (5%) 28 (7%) 37 (6%)
20-40 49 (25%) 95 (24%) 144 (24%)
40-60 73 (37%) 136 (34%) 209 (35%)
60-80 47 (24%) 112 (28%) 159 (27%)
80 + 22 (11%) 29 (7%) 51 (9%)
Encounters Hospital encounter 192 (96%) 296 (74%) 488 (81%)
Psychiatric hospital encounter 72 (36%) 39 (10%) 111 (19%)
Office visit 192 (96%) 377 (94%) 569 (95%)
Psychiatric Office visit 24 (12%) 129 (33%) 153 (26%)
Comorbidities** Depression 159 (80%) 237 (59%) 396 (66%)
Anxiety 76 (38%) 157 (39%) 233 (39%)
Hypertension 86 (43%) 134 (34%) 220 (37%)
Substance use disorder 85 (43%) 111 (28%) 196 (33%)
Diabetes 40 (20%) 69 (17%) 109 (18%)
HIV 15 (8%) 53 (13%) 68 (11%)
Total 200 400 600
*

Age at the time of first clinical note

**

5 most common comorbidities (ICD-10 diagnosis codes provided in Appendix A)

Labeling

As per the schematic outline of the study shown in Figure 1, at the document level, our clinical notes were annotated for “current” suicidal ideation according to their training, validation, and test split. Training and validation notes were annotated by the rule-based NLP approach further described below, and test notes were manually annotated by two annotators sharing annotation guidelines. We assessed inter-rater agreement using Cohen's kappa statistic. Patient-level classification was a direct product of the document-level classification, as patients were classified as positive if they had at least one note classified as positive for current suicidal ideation. Given this approach, we interpret patient-level classification for “current” suicidal ideation as patients who should be flagged for suicidal ideation at the time of their suicidality.

Figure 1.

Figure 1.

Schematic outline of the study

We developed a rule-based NLP approach to classify notes for “current” suicidal ideation. Our rule-based NLP method was built on the foundation of NegEx, a simple algorithm for identifying negated findings and diseases in discharge summaries (Chapman et al., 2001). The algorithm relies on two lexicons: one defining target concepts (e.g. suicidal) and other defining modifier concepts (e.g. not). We defined the target lexicons to be the key suicidal ideation terms used in our note filtering process: suicidality, suicidal, SI, or suicide. The modifier lexicons were designed to negate suicidal ideation based on four different categories: negation, historical, conditional, and unrelated (Kang, 2009).Table 2 displays a number of examples for each category. We also made adjustments to the algorithm to negate mentions of suicidal ideation not in a “typical” natural language form (e.g. “Suicidal ideation: not present”). Details of the rule-based NLP approach are available on GitHub1.

Table 2.

Rule-based approach modifier categories

Modifier category Examples
Negated Denied, negative for, never, not, no
Historical History, previous, attempted, YEAR, DATE
Conditional Lifeline, emergency number, even if, return if
Unrelated Brother, sister, family history, husband, wife

Prediction Models

In this study, we used four machine learning models—three statistical, and one deep learning—to classify clinical notes for “current” suicidal ideation. The statistical classifiers included logistic classifiers, linear support vector machines, and naive Bayes. Our deep learning classifier used a convolutional neural network (CNN) architecture. All of the models were developed, trained, and tuned on the training set using python version 3.6 and AWS computational resources (Appendix B).

To select the final models to be evaluated on the manually-reviewed test set, we identified the two best performing statistical classifiers and CNN models based on document-level performance—accuracy and area under the receiver-operator curve (AUC)—on the validation set. We evaluated test set performance at both the document and patient level. In addition to reporting overall accuracy and AUC, we reported precision, recall, and F1-scores of manually affirmed “current” suicidal ideation after tuning for optimal thresholds. Thresholds are reported in Appendix E. We then conducted McNemar’s test of homogeneity to determine whether our methods’ predictions were significantly different from the rule-based approach. Finally, we conducted an analysis to determine the degree to which our approach was able to identify “current” suicidal ideation among those who never received an ICD-9/10 encounter diagnosis for suicidal ideation.

To test the use of these approaches for large-scale document screening, we randomly selected 5,000 clinical notes from the WCM ARCH database on a pre-established depression cohort (DEID), defined as any patient diagnosed with depression or prescribed an antidepressant (Sholle et al., 2017). After running our best performing weakly supervised algorithm and the rule-based approach, we manually reviewed the subset of notes classified as affirmative for “current” suicidal ideation and reported the percentage of notes that were truly affirmative based on manual review. To assess sensitivity, we randomly reviewed 50 notes classified as negative for “current” suicidal ideation.

Statistical Classifiers

NLP statistical classifiers require that clinical text is transformed into vectors using word representations. We experimented with the following word representations: bag-of-words (BOW), bi-gram, and term frequency-inverse document frequency (TF-IDF) (Cavnar and Trenkle, 2001; Zhang et al., 2011, 2010). Using BOW and bi-gram representations, each clinical note was represented as a vector of binary indicators for the occurrence of each word, if using the BOW representation or two word phrase, if using the bi-gram representation. TF-IDF added another layer by scaling our vector based on a numerical representation of word importance, a value generated by evaluating both term frequency and document frequency in the training corpus. Further details on the technical details of these word representations are included in Appendix C.

Logistic classifiers find the best fitting coefficients to describe the relationship between the independent variables and an outcome of interest (Genkin et al., 2007). A key benefit of this binary classifier is interpretability, as the coefficients indicate which features, in our case a single word or word pairs, lead to classification. Support vector machines (SVMs) are used for text classification as they remain effective in high-dimension spaces, likely when documents are represented as a vector (Joachims, 1998). Finally, the underlying assumption of Naïve Bayes classifiers, independence among predictor variables, allows for efficient word and word-pair recognition (McCallum et al., 1998).

Deep Learning Classifiers

Similar to statistical classifiers, we transformed the clinical text into a vector of words for input into our convolutional neural network (CNN). After generating word to index mappings for unique words in the training corpus, we represented each clinical note as a list of indices, preserving the order of the words. To ensure that each vector is of the same length, all of the notes are padded with an uniform index.

CNN is a successful method for text classification as they are effective at extracting local and position-invariant features. This method is well-aligned with our classification problem as the model focuses on learning the specific “features,” words or word pairs, that are associated with “current” suicidal ideation. The model architecture is a variant of the CNN architecture of Collobert et. al and Kim et. al (Collobert et al., 2011; Kim, 2014). As shown in Figure 2, our CNN is a relatively simple network with one layer of convolution on top of word embeddings. For the purposes of our experiment, we trained two separate CNN models, one for each type of word embedding: random embeddings and static pre-trained Word2vec embeddings. Random embeddings were initialized with no pre-trained weights and updated throughout training. Word2vec was trained by Mikolov et. all on 100 billion words of Google News, and the word embeddings are publicly available (Mikolov et al., 2013). We henceforth refer to the CNN with random word embeddings and static word2vec word embeddings as random CNN and Word2vec CNN, respectively.

Figure 2.

Figure 2.

Model Architecture of CNN Text Classification Model (adapted from Kim, 2014)

RESULTS

Based on document-level validation results, the following four models were evaluated on the manually-reviewed test set: SVM classifier with TF-IDF word representation, logistic classifier with TF-IDF word representation, random CNN, and word2vec CNN. Among the statistical classifiers, the logistic and SVM classifiers with TF-IDF word representations achieved the highest AUC on the validation set, 0.84. The random CNN and Word2vec CNN both achieved an AUC value of 0.92 the validation set (Appendix D).

As shown in Table 3, the CNN models outperformed the statistical models and the rule-based approach used for weak labeling. While the random CNN had higher overall AUC, the Word2vec CNN had slightly higher accuracy and overall classification of documents with “current” suicidal ideation, evidenced by slightly higher precision (0.81 vs. 0.80), recall (0.83 vs. 0.82), and F1-score (0.82 vs. 0.81). Our McNemar’s test results showed that while our statistical methods (logistic regression and SVM) produced results statistically different from the rule-based approach, our deep learning methods (random CNN and Word2vec CNN) did not (Appendix F). The rule-based approach had the highest recall, supported by having the fewest false positives (18) in comparison to TF-IDF logistic regression, which had the highest (44). The Word2Vec CNN approach had the highest precision, evidenced by having the fewest number of false negatives (28), in comparison to TF-IDF SVM, which had the highest number of false negatives (65). Confusion matrices for our document-level results are provided in Appendix G. ROC curves for each of the four probabilistic models are displayed in Figure 3.

Table 3.

Document-Level Test Results of Predictions Models

Model Accuracy Precision (SI) Recall (SI) F1-Score (SI) AUC
TF-IDF Logistic Regression 0.81 0.69 0.69 0.69 0.929
TF-IDF SVM 0.89 0.64 0.80 0.71 0.930
Random CNN 0.93 0.80 0.82 0.81 0.962
Word2vec CNN 0.94 0.81 0.83 0.82 0.946
Rule-Based Approach 0.92 0.73 0.87 0.79 N/A*
*

AUC is not available for rule-based deterministic approaches

Figure 3.

Figure 3.

ROC Curves

Document level analysis

Using a shared annotation guideline, our annotators had an overall inter-rater agreement of 0.98. Of the 837 notes in our manually-reviewed test dataset, only 24 notes were associated with encounters coded for suicidal ideation. Upon manual review, we identified that only 143 notes were indicative of suicidal ideation during the encounter, exemplifying the under-coding of this condition by healthcare providers. While our word2vec CNN approach correctly detected 118 (83%) of these notes, helping to mitigate the structured capture of suicidal ideation, the rule-based approach had even higher recall, identifying 125 notes (87%). The 118 notes detected by the weakly supervised CNN were associated with 49 distinct encounters, 42 (86%) without an ICD encounter diagnosis for suicidal ideation. Of the 42 encounters without an ICD diagnosis code, 50% of them were hospital encounters, 43% office visits, and the remaining 7% social work visits. In comparison, of the 7 encounters with an ICD diagnosis code, 5 (71%) were hospital encounters and 2 (29%) were office visits.

Patient level analysis

Of the 30 patients in our test set, 10 patients had an encounter diagnosis for suicidal ideation in their EHR. Based on manual review, 19 patients had at least one document indicative of “current” suicidal ideation, thus had a positive patient-level classification. All of the weakly-supervised classifiers outperformed the original rule-based approach on all performance metrics, achieving 100% (n=19) recall. The TF-IDF logistic classifier had the highest level of accuracy (0.93) and F1-scores (0.95). McNemar’s test results at the patient-level indicated that both statistical classifiers (logistic, SVM) produced statistically different prediction results from the rule-based approach (Appendix F).

Large corpus screening

According to the document-level results, the Word2Vec CNN had slightly better performance metrics (accuracy, precision, recall, and F1-Score) in comparison to all other weakly supervised methods. We ran the Word2Vec CNN and the rule-based approach on 5,000 clinical notes from the WCM DEID database. The two algorithms had 99% agreement in classifying these notes. The Word2Vec CNN approach identified 0.46% (23) notes as positive for “current” suicidal ideation, and the rule-based approach identified 0.56% (28) notes. After manual review, 87% (20) of the notes identified by the Word2Vec CNN were correctly classified, and 68% (19) of the notes identified by the rule-based approach were correctly classified. We reviewed 50 notes classified as negative for “current” suicidal ideation, and all were confirmed as negative.

DISCUSSION

The weakly supervised methods developed in this study improved our ability to identify encounters indicative of “current” suicidal ideation, defined as ideation at the time of the encounter. Of the 837 notes describing 611 distinct encounters in our manually-reviewed test set, only 9 encounters (24 notes) had an associated structured ICD diagnosis (V62.84 (ICD9), R45.85 (ICD10)) for suicidal ideation. Using the weakly supervised CNN model, we identified an additional 42 encounters (94 notes) indicative of “current” suicidal ideation within our test set. These document-level results indicate that 45% (9) of patients who never received an ICD code for suicidal ideation in their EHR history did in fact experience suicidal ideation; suggesting under-coding. To exemplify how these methods could be useful in large-scale document screening, we ran our Word2Vec CNN model on 5,000 randomly selected clinical notes, and it classified 0.46% (n=23) notes as positive for “current" suicidal ideation and correctly classified 87% (20) of the notes. These methods offer numerous benefits, as improved identification of encounters at which patients experience suicidal ideation can improve our understanding of suicide pathways, serve as a predictor in suicide prediction algorithms using EHR data, and assist clinicians in effectively targeting suicide prevention interventions.

Performance metrics on document-level classification indicate that the Word2vec CNN model was the most effective. However, this method only offered a modest improvement in classification compared to the rule-based approach, evidenced by a document-level F1-score of 0.82 vs. 0.79 and a patient-level F1-score of 0.93 vs. 0.82. In our large corpus screening experiment, while the Word2Vec CNN model had better precision (87% vs. 68%), the rule-based approach and Word2Vec CNN had nearly perfect agreement (99%) in classifying “current” suicidal ideation. This is further emphasized by the results of McNemar’s test for homogeneity, in which we found that the Word2vec CNN model’s predictions were not statistically different from the rule-based approach at both the document-level and patient-level. When assessing these results, it is important to weigh the positives and negatives of implementing state-of-the-art machine learning techniques, which are technically complex, time-intensive, and in general, lack clinical interpretability. While rule-based approaches make decisions strictly based on what it has been programmed to do, machine learning classifiers have the ability to learn and identify word and phrase relationships based on the training data. For example, after doing error analysis, we determined that our best version of CNN model could identify more complex and nuanced mentions of suicidal ideation, such as “No active SI/HI though passive SI persists.” However, despite this advantage, machine learning approaches, deep learning in particular, are criticized for their challenges in interpretability (Martens et al., 2011; Vellido et al., 2012). When looking towards the eventual goal of real-world, point-of-care implementation of suicide risk models within health systems, a rule-based approach has the benefits of more interpretable results and error analysis compared to marginally better deep learning algorithms.

There are several limitations to this study. First, this is a single-site study. While our deep learning methods only had a slight edge over our rule-based approach, results may differ on notes from an external institution. Because the rule-based approach was developed based on a random subset of notes at WCM, this approach may be overfit to clinical documentation practices at our institution. The deep learning methods may offer a greater performance advantage at external sites, as they are inherently more flexible and generalizable. In addition, with external validation, we could assess model performance on a more diverse set of patients. In this study, of the patients with known race (n=461), our patient population was largely white (58%), which is a reflection of WCM’s patient catchment rather than a reflection of the greater New York City area population at risk for suicide. This is an area of future research. Second, our manually-reviewed test set had a relatively small cohort of patients. Because each patient in our study had an average of 33 notes, we were not able to bolster the test data set (n=837 notes) without also significantly increasing the number of documents to manually review. This note selection process allowed us to capture a comprehensive view of our cohort’s suicidal ideation history, identifying the specific encounters at which patients did experience suicidal ideation. A future area of research is understanding our cohort’s pattern of suicidal ideation over multiple encounters. For this reason, the number of patients (n=30) in our test set remained relatively small, allowing several of the weakly supervised methods to achieve 100% recall on patient-level classification. To the best of our knowledge, our study had the largest set of training set (n = 13,426 notes; 456 patients) for detecting suicidal ideation from text, as prior studies have limited their note corpus sizes to 500 or patient cohort size to less than 210 (Fernandes et al., 2018; Haerian et al., 2012; Poulin et al., 2014). Third, distinction of “current” suicidal ideation proved difficult, particularly in inpatient admissions where suicidal thoughts may have oscillated during a single encounter. Despite these challenges, both annotators adhered to a single set of guidelines for review, and had a higher inter-rater agreement. Fourth, we used a single rule-based approach for our weak-labeling process. While this approach was iteratively developed on a separate subset of WCM notes, this study is limited in understanding the sensitivity of the rule-based approach’s lexicons with the performance of the weakly supervised methods. An area of future research is improving the rule-based approach to contain both explicit and implicit mentions of suicidal ideation, such as “wish to be dead” or “desire to kill oneself.” Finally, our note selection process required explicit mentions of suicidality, such as “suicidal” and “si.” While our large corpus screening experiment helped us to understand our model’s positive predictive value on a general corpus of notes, it is possible there are false negatives, particularly those that may only contain more indirect mentions of suicidal ideation, such as “wish to be dead” or “desire to kill oneself.” This is an area of future research.

Our weakly supervised methods offer a substantial improvement in identifying encounters at which patients are experiencing suicidal ideation, a common precursor to suicide attempts and death. Implementation of these methods may improve suicide prediction, and given current limitations of structured diagnosis coding of ideation (Anderson HD, Pace WD, Brandt E, Nielsen RD, Allen RR, Libby AM, West DR, Valuck RJ}, 2015; Hammond et al., 2013), pave the way for research focused on the pathways from suicidal ideation to attempt. However, before deploying these weakly supervised approaches at an institution, especially at point-of-care, it is important to evaluate the results with an eye towards the technical complexity of state-of-the-art methods, such as CNNs, as rule-based approaches may achieve comparable results and offer clinical interpretability.

CONCLUSION

We successfully developed weakly supervised machine learning methods to detect “current” suicidal ideation from clinical text without the need for a large manually-annotated data set, which can be time-intensive and costly. Our methods achieved 94% accuracy on classifying clinical text, identifying an additional 42 encounters and 9 patients who were indicative of suicidal ideation but missing an ICD-9/10 encounter diagnosis. In a real-world experiment, we ran our weakly supervised algorithm on 5,000 randomly selected clinical notes, which classified 0.46% (n=23) of notes as affirmative for “current” suicidal ideation, of which 87% were truly indicative based on manual review. With identification of encounters indicative of suicidal ideation, researchers may be better equipped to understand the timing of suicidal ideation prior to suicide, incorporate data from unstructured clinical text in suicide prediction models, and subsequently implement such models at the point-of-care for clinical decision making and suicide prevention efforts.

Supplementary Material

1

Table 4.

Patient-Level Test Results of Predictions Models

Model Accuracy Precision (SI) Recall (SI) F1-Score (SI)
TF-IDF Logistic Regression 0.93 0.90 1.00 0.95
TF-IDF SVM 0.83 0.79 1.00 0.88
CNN (Random) 0.90 0.86 1.00 0.93
CNN (word2vec) 0.90 0.86 1.00 0.93
Rule-Based Approach 0.73 0.72 0.95 0.82

Acknowledgement

This research was funded in part by NIH grants R01MH105384, R01MH119177 and R01MH121922.

Footnotes

Conflicts of Interest

The authors declare that they have no competing interests.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

REFERENCES

  1. Ahmedani BK, Simon GE, Stewart C, Beck A, Waitzfelder BE, Rossom R, Lynch F, Owen-Smith A, Hunkeler EM, Whiteside U, Operskalski BH, Coffey MJ, Solberg LI, 2014. Health care contacts in the year before suicide death. J. Gen. Intern. Med 29, 870–877. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Anderson HD, Pace WD, Brandt E, Nielsen RD, Allen RR, Libby AM, West DR, Valuck RJ}, 2015. Monitoring Suicidal Patients in Primary Care Using Electronic Health Records. Journal of the American Board of Family Medicine 28. [DOI] [PubMed] [Google Scholar]
  3. Belsher BE, Smolenski DJ, Pruitt LD, Bush NE, Beech EH, Workman DE, Morgan RL, Evatt DP, Tucker J, Skopp NA, 2019. Prediction Models for Suicide Attempts and Deaths: A Systematic Review and Simulation. JAMA Psychiatry 76, 642–651. [DOI] [PubMed] [Google Scholar]
  4. Borges G, Angst J, Nock MK, Ruscio AM, Walters EE, Kessler RC, 2006. A risk index for 12-month suicide attempts in the National Comorbidity Survey Replication (NCS-R). Psychol. Med 36, 1747–1757. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Britton PC, Ilgen MA, Rudd MD, Conner KR, 2012. Warning signs for suicide within a week of healthcare contact in Veteran decedents. Psychiatry Res. 200, 395–399. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Brown GK, Beck AT, Steer RA, Grisham JR, 2000. Risk factors for suicide in psychiatric outpatients: a 20-year prospective study. J. Consult. Clin. Psychol 68, 371–377. [PubMed] [Google Scholar]
  7. Brown LA, Boudreaux ED, Arias SA, Miller IW, May AM, Camargo CA Jr, Bryan CJ, Armey MF, 2020. C-SSRS performance in emergency department patients at high risk for suicide. Suicide Life Threat. Behav 50, 1097–1104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Cavnar WB, Trenkle JM, 2001. N-Gram-Based Text Categorization. [Google Scholar]
  9. Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG, 2001. A simple algorithm for identifying negated findings and diseases in discharge summaries. J. Biomed. Inform 34, 301–310. [DOI] [PubMed] [Google Scholar]
  10. Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, 2011. Natural language processing (almost) from scratch. J. Mach. Learn. Res [Google Scholar]
  11. Demner-Fushman D, Chapman WW, McDonald CJ, 2009. What can natural language processing do for clinical decision support? J. Biomed. Inform 42, 760–772. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Fernandes AC, Dutta R, Velupillai S, Sanyal J, Stewart R, Chandran D, 2018. Identifying Suicide Ideation and Suicidal Attempts in a Psychiatric Clinical Research Database using Natural Language Processing. Sci. Rep 8, 7426. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Genkin A, Lewis DD, Madigan D, 2007. Large-Scale Bayesian Logistic Regression for Text Categorization. Technometrics 49, 291–304. [Google Scholar]
  14. Goldstein RB, Black DW, Nasrallah A, Winokur G, 1991. The prediction of suicide. Sensitivity, specificity, and predictive value of a multivariate model applied to suicide among 1906 patients with affective disorders. Arch. Gen. Psychiatry 48, 418–422. [DOI] [PubMed] [Google Scholar]
  15. Haerian K, Salmasian H, Friedman C, 2012. Methods for identifying suicide or suicidal ideation in EHRs. AMIA Annu. Symp. Proc 2012, 1244–1253. [PMC free article] [PubMed] [Google Scholar]
  16. Hammond KW, Laundry RJ, Michael OLeary T, Jones WP, 2013. Use of Text Search to Effectively Identify Lifetime Prevalence of Suicide Attempts among Veterans, in: System Sciences (HICSS), 2013 46th Hawaii International Conference on. unknown, pp. 2676–2683. [Google Scholar]
  17. Heron M, 2019. Deaths: Leading Causes for 2017. National Vital Statistics Reports 68. [PubMed] [Google Scholar]
  18. Joachims T, 1998. Text categorization with Support Vector Machines: Learning with many relevant features, in: Machine Learning: ECML-98. Springer Berlin Heidelberg, pp. 137–142. [Google Scholar]
  19. Kang P, 2009. negex. Github. [Google Scholar]
  20. Kessler RC, Bossarte RM, Luedtke A, Zaslavsky AM, Zubizarreta JR, 2020. Suicide prediction models: a critical review of recent research with recommendations for the way forward. Mol. Psychiatry 25, 168–179. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Kim Y, 2014. Convolutional Neural Networks for Sentence Classification. arXiv [cs.CL]. [Google Scholar]
  22. Kuo W-H, Gallo JJ, Tien AY, n.d. Incidence of suicide ideation and attempts in adults : the 13-year follow-up of a community sample in Baltimore, Maryland. 10.1017/S0033291701004482 [DOI] [PubMed] [Google Scholar]
  23. Louzon SA, Bossarte R, McCarthy JF, Katz IR, 2016. Does Suicidal Ideation as Measured by the PHQ-9 Predict Suicide Among Va Patients? Psychiatr. Serv 67, 517–522. [DOI] [PubMed] [Google Scholar]
  24. Martens D, Vanthienen J, Verbeke W, Baesens B, 2011. Performance of classification models from a user perspective. Decis. Support Syst 51, 782–793. [Google Scholar]
  25. McCallum A, Nigam K, Others, 1998. A comparison of event models for naive bayes text classification, in: AAAI-98 Workshop on Learning for Text Categorization. Citeseer, pp. 41–48. [Google Scholar]
  26. Medlock B, Briscoe T, 2007. Weakly supervised learning for hedge classification in scientific literature, in: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. pp. 992–999. [Google Scholar]
  27. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J, 2013. Distributed Representations of Words and Phrases and their Compositionality, in: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (Eds.), Advances in Neural Information Processing Systems 26. Curran Associates, Inc., pp. 3111–3119. [Google Scholar]
  28. Murphy SL, Xu J, Kochanek KD, Arias E, 2018. Mortality in the United States, 2017. NCHS Data Brief 1–8. [PubMed] [Google Scholar]
  29. Pirkis J, Burgess P, Dunt D, 2000. Suicidal ideation and suicide attempts among Australian adults. Crisis: The Journal of Crisis Intervention and Suicide Prevention 21, 16–25. [DOI] [PubMed] [Google Scholar]
  30. Posner K, Brown GK, Stanley B, Brent DA, Yershova KV, Oquendo MA, Currier GW, Melvin GA, Greenhill L, Shen S, Mann JJ, 2011. The Columbia-Suicide Severity Rating Scale: initial validity and internal consistency findings from three multisite studies with adolescents and adults. Am. J. Psychiatry 168, 1266–1277. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Poulin C, Shiner B, Thompson P, Vepstas L, Young-Xu Y, Goertzel B, Watts B, Flashman L, McAllister T, 2014. Predicting the risk of suicide by analyzing the text of clinical notes. PLoS One 9, e85733. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Prest A, Schmid C, Ferrari V, 2012. Weakly supervised learning of interactions between humans and objects. IEEE Trans. Pattern Anal. Mach. Intell 34, 601–614. [DOI] [PubMed] [Google Scholar]
  33. Sholle ET, Kabariti J, Johnson SB, Leonard JP, Pathak J, Varughese VI, Cole CL, Campion TR Jr, 2017. Secondary Use of Patients’ Electronic Records (SUPER): An Approach for Meeting Specific Data Needs of Clinical and Translational Researchers. AMIA Annu. Symp. Proc 2017, 1581–1588. [PMC free article] [PubMed] [Google Scholar]
  34. Simon GE, Johnson E, Lawrence JM, Rossom RC, Ahmedani B, Lynch FL, Beck A, Waitzfelder B, Ziebell R, Penfold RB, Shortreed SM, 2018. Predicting Suicide Attempts and Suicide Deaths Following Outpatient Visits Using Electronic Health Records. Am. J. Psychiatry 175, 951–960. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Simon GE, Rutter CM, Peterson D, Oliver M, Whiteside U, Operskalski B, Ludman EJ, 2013. Does response on the PHQ-9 Depression Questionnaire predict subsequent suicide attempt or suicide death? Psychiatr. Serv 64, 1195–1202. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Simon GE, Shortreed SM, Johnson E, Rossom RC, Lynch FL, Ziebell R, Penfold ARB, 2019. What health records data are required for accurate prediction of suicidal behavior? J. Am. Med. Inform. Assoc 26, 1458–1465. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. ten Have M, de Graaf R, van Dorsselaer S, Verdurmen J, van ’t Land H, Vollebergh W, Beekman A, 2009. Incidence and course of suicidal ideation and suicide attempts in the general population. Can. J. Psychiatry 54, 824–833. [DOI] [PubMed] [Google Scholar]
  38. Vellido A, Martín-Guerrero JD, Lisboa PJG, 2012. Making machine learning models interpretable. ESANN. [Google Scholar]
  39. Wang Y, Sohn S, Liu S, Shen F, Wang L, Atkinson EJ, Amin S, Liu H, 2019. A clinical text classification paradigm using weak supervision and deep representation. BMC Med. Inform. Decis. Mak 19, 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Weill Cornell Medicine, n.d. Weill Cornell Medicine Facts and Figures [WWW Document]. URL https://news.weill.cornell.edu/sites/default/files/publications/fact_sheet_2019_final.pdf
  41. Zhang W, Yoshida T, Tang X, 2011. A comparative study of TF*IDF, LSI and multi-words for text classification. Expert Syst. Appl 38, 2758–2765. [Google Scholar]
  42. Zhang Y, Jin R, Zhou Z-H, 2010. Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics 1, 43–52. [Google Scholar]
  43. Zhou Z-H, 2018. A brief introduction to weakly supervised learning. Natl Sci Rev 5, 44–53. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

RESOURCES