Skip to main content
Applied Clinical Informatics logoLink to Applied Clinical Informatics
. 2024 Dec 18;15(5):1107–1120. doi: 10.1055/a-2411-5796

Enhancing Suicide Attempt Risk Prediction Models with Temporal Clinical Note Features

Kevin J Krause 1,, Sharon E Davis 1, Zhijun Yin 1, Katherine M Schafer 1, Samuel Trent Rosenbloom 1, Colin G Walsh 1
PMCID: PMC11655152  PMID: 39251213

Abstract

Objectives  The objective of this study was to investigate the impact of enhancing a structured-data-based suicide attempt risk prediction model with temporal Concept Unique Identifiers (CUIs) derived from clinical notes. We aimed to examine how different temporal schemes, model types, and prediction ranges influenced the model's predictive performance. This research sought to improve our understanding of how the integration of temporal information and clinical variable transformation could enhance model predictions.

Methods  We identified modeling targets using diagnostic codes for suicide attempts within 30, 90, or 365 days following a temporally grouped visit cluster. Structured data included medications, diagnoses, procedures, and demographics, whereas unstructured data consisted of terms extracted with regular expressions from clinical notes. We compared models trained only on structured data (controls) to hybrid models trained on both structured and unstructured data. We used two temporalization schemes for clinical notes: fixed 90-day windows and flexible epochs. We trained and assessed random forests and hybrid long short-term memory (LSTM) neural networks using area under the precision recall curve (AUPRC) and area under the receiver operating characteristic, with additional evaluation of sensitivity and positive predictive value at 95% specificity.

Results  The training set included 2,364,183 visit clusters with 2,009 30-day suicide attempts, and the testing set contained 471,936 visit clusters with 480 suicide attempts. Models trained with temporal CUIs outperformed those trained with only structured data. The window-temporalized LSTM model achieved the highest AUPRC (0.056 ± 0.013) for the 30-day prediction range. Hybrid models generally showed better performance compared with controls across most metrics.

Conclusion  This study demonstrated that incorporating electronic health record-derived clinical note features enhanced suicide attempt risk prediction models, particularly with window-temporalized LSTM models. Our results underscored the critical value of unstructured data in suicidality prediction, aligning with previous findings. Future research should focus on integrating more sophisticated methods to continue improving prediction accuracy, which will enhance the effectiveness of future intervention.

Keywords: suicide, machine learning, natural language processing, neural networks, random forest

Background and Significance

The Centers for Disease Control and Prevention reported that in 2021 approximately 48,000 people in the United States died of suicide. 1 Proven interventions such as psychiatric medication and removing access to firearms might help individuals at-risk for suicide. 2 Barriers to identifying at-risk individuals and delivering timely prevention efforts include limited access to mental health services, social stigma surrounding mental health, insufficient training for health care providers in recognizing suicide risk, and the fragmented nature of health records. 3 Informatics efforts have been underway to address the growing need for improved screening and treatment of at-risk individuals. 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Much of this work has explored the interplay of structured and unstructured electronic health record (EHR) data for clinical predictive machine learning tasks. 9 13 19 20 21 22 23 24 25

In mental health assessments, information that isn't neatly organized in fields (like free-text notes from clinicians) plays a crucial role, especially when assessing suicide risk. 9 11 18 26 Broadly, researchers have used this so-called “unstructured data” found in Reddit comments to detect suicidal ideation. 27 Likewise, more centrally to health care, the Veterans Health Administration demonstrated that sentiment analysis fueled by unstructured data in EHR clinic notes enhances suicidality prediction accuracy. 9 In previous work, we leveraged Concept Unique Identifier (CUI) counts from clinical notes to generate suicidality risk factor networks. 28 Techniques that analyze the natural language used in clinical notes (called NLP or Natural Language Processing) have been shown to improve the detection of suicidal thoughts in pregnant women, predict suicide risk after hospital discharge, and even perform better than mental health professionals in identifying suicide risk from notes. 11 12 18 This represents a clinically significant milestone in that the perinatal period represents a time of elevated risk for suicide among this vulnerable population. 18 The lack of unstructured data has also been noted as a serious problem that limits the utility in health care. Indeed, Boggs et al reported substantial gaps in follow-up assessments related to suicidal ideation due to the lack of structured EHR data on suicidality. 26 Together, these advancements underscore the potential of unstructured data in enhancing predictive models for suicide risk.

Researchers often avoid including the timing of events (temporality) in their models to keep the analysis simpler and to reduce the risk of the model becoming too tailored to the specific data (a problem known as overfitting). 29 However, recent research advocates for incorporating temporal elements in clinical models to enhance performance. 30 31 32 33 34 Indeed, the ideation-to-action framework suggests that only after interacting with acquired capability to inflict painful harm upon oneself (via lowered fear of death and increased pain tolerance) does suicidal ideation progress toward suicide attempts. 35 36 37 38 39 The order (i.e., relative temporal precedence) of these constructs is integral to the ideation-to-action framework. Integrating this theory into prediction models supports the temporalization of input data to reflect critical indicators along the ideation-to-action continuum.

Objective

The objective of this study was to investigate the impact of extending a validated structured-data-based suicide attempt risk prediction model with CUIs derived from clinical notes and to examine how temporal schemes, model types, and different prediction ranges affected predictive performance. Building upon previous research in suicide risk prediction, this research aimed to enhance our understanding of how clinical variable selection, transformation, and periodization could influence model outcomes. 9 11 12 13 18 19 40 41 Specifically, we sought to leverage detailed clinical note data, enriched with temporal information, to develop a model that improves prediction of rare outcomes like suicide-related behaviors. 42 We compared the model with other baselines on area under the precision recall curve (AUPRC), sensitivity, positive predictive value (PPV), and risk stratification capability beyond prior models lacking these enriched features.

Methods

Study Setting

Vanderbilt University Medical Center (VUMC) operates a large regional health care network with over 1,000 beds across multiple facilities and manages over 1.5 million outpatient visits and 40,000 inpatient admissions annually. VUMC also includes a dedicated psychiatric facility to provide comprehensive mental health care services. The patient population includes urban and rural communities with diverse demographic and health care experiences, which is vital for research initiatives like suicide prevention. The Vanderbilt Suicide Attempt and Ideation Likelihood (VSAIL) model generates patient encounter suicide risk scores, using structured EHR data. 4 5 It has been validated retrospectively, prospectively, and in the context of universal screening in high-acuity clinical settings. 4 43 44 A decision support module driven by VSAIL has been evaluated via randomized controlled trial and shown to increase face-to-face suicide risk assessment. 45

Cohort, Clusters, and Outcomes

We analyzed adult outpatient and emergency department encounters at VUMC between January 1, 2010, and December 31, 2022. We grouped sequences of visits with gaps of 3 or fewer days into discrete clusters. Thus, patients could appear multiple times in the dataset if they visited VUMC care centers on separate occasions more than 3 days apart. We determined modeling targets (outcomes) by the presence or absence of at least one ICD diagnostic code for suicide attempt within 30, 90, or 365 days (prediction ranges) following a visit cluster. Supplementary Material S1 (available in the online version) details the 1,526 ICD codes used to ascertain suicide-attempt—included within these codes are only diagnoses which explicitly indicate suicidal intent or intent to inflict lethal self-harm/injury. In 2015, health systems converted billing diagnostic codes from 9 th revision of International Classification of Diseases (ICD-9) to 10 th revision of International Classification of Diseases (ICD-10). Our previous work has shown ICD-10 diagnostic codes to have higher PPV (0.85) 6 for suicide attempt ascertainment than ICD-9 codes (PPV 0.58). 5 Early experimentation supported the inclusion of ICD-9 era data into the training set, while ensuring that only ICD-10 data were used for evaluation. We excluded visits occurring within 3 days of a suicide attempt to avoid inadvertently including predictions for visits initiated by a suicide event, i.e., visits in which prediction would not be necessary.

Features and Measurements

We collected structured and unstructured data from a 5-year window preceding each episode. Structured data, based on the original VSAIL model, included medications ( n  = 693), diagnoses ( n  = 83), procedures ( n  = 46), and demographics ( n  = 3). We imputed missing values with constant zeros and missing labels. Features were scaled to a maximum absolute value of 1 without shifting or centering. Structured features were not temporally aggregated, as preliminary analysis showed low temporal feature variance, a finding also supported by prior work by Shortreed et al. 46

From unstructured clinical notes, we extracted note-level count aggregations of medical concepts using 13,997 CUIs from the Unified Medical Language System. 47 CUI extraction was performed with the VUMC Wordcloud Indexer, a negation-aware regular-expressions-based NLP tool. 48 We processed CUIs using two temporalization schemes: 21 fixed, 90-day windows (window scheme) and 21 flexible epochs grouping note activity with gaps of 30 or fewer days (epoch scheme). CUI counts were normalized using term frequency-inverse document frequency (TF-IDF) 49 transformation to emphasize terms that occur less frequently and reduced with latent semantic analysis (LSA) 50 into 100 components to reduce the high-dimensional data into fewer components of greater information density. The LSA transformer was first fitted on the 5-year aggregate of CUI counts before transforming individual windows and epochs, allowing for consistent comparison across temporal schemes.

Experimental Overview

We trained random forests and hybrid long short-term memory (LSTM) neural networks to test the impact of input temporality on suicide attempt risk prediction. Models trained on nontemporal structured features were compared with models trained on nontemporal structured features plus temporally aggregated CUIs Supplementary Material S2 (available in the online version). We performed experiments with three modeling groups: random forest structured only (control), random forest structured plus temporal CUIs, and LSTM structured plus temporal CUIs. We used two CUI temporalization schemes: windows and epochs, and three prediction ranges: 30, 90, and 365 days. Thus, our experimental variations included random-forest-control (VSAIL), random-forest-epoch, random-forest-window, LSTM-epoch, and LSTM-window models across each prediction range.

We divided encounters using a mixed temporal split design. The training set consisted of the earliest 80% of visit clusters (January 1, 2010–September 2, 2021), whereas the latest records were reserved for testing, ensuring that only ICD-10 era outcomes were included in the test set for evaluation with more recent data. We randomly assigned the remaining 20% of visit clusters to either development or testing in a 1:4 ratio, resulting in an 80/4/16 training/development/testing split ( Fig. 1 ).

Fig. 1.

Fig. 1

( A ) depicts the training, development, and testing set split methodology. The training set is composed of the earliest 80% of visit clusters. The latest 20% is randomly split into development and testing sets, in a 1:4 ratio, yielding an 80/4/16 train/dev/test split. Diagnostic codes used to ascertain suicide attempts switch from ICD-9 to ICD-10 in October 2015. ( B ) depicts the window (W) and epoch (E) input data temporalization schemes. Windows split input data into fixed, 90-day periods. Epochs split input data into flexible periods defined by gaps of 30 days or greater in EHR activity. Both schemes use 20 total periods. EHR, electronic health record; ICD-9/-10, International Classification of Diseases, 9 th revision/10 th revision.

Model Implementation

We created random forest models using scikit-learn 1.2.0 for preprocessing and classification, with a custom pipeline for efficient hyperparameter selection. 51 These models were developed and evaluated in Python 3.8.0, managed with the built-in virtual environment venv. We created hybrid LSTM models using PyTorch 2.2.2 (CPU) with a custom neural network module to handle a mixture of temporal (CUI counts) and nontemporal (VSAIL) features. 52 The model has two input layers: an LSTM input layer for the temporal CUI count features and a dense linear input layer for the nontemporal VSAIL features, which are combined to feed into a single output prediction layer. Fig. 2 depicts the hybrid LSTM model architecture in detail. Supplementary Material S3 (available in the online version) provides additional modeling details.

Fig. 2.

Fig. 2

This diagram depicts the hybrid LSTM model architecture used to process a mixture of temporal and nontemporal features. Temporal features feed into an LSTM layer, and nontemporal features feed into a dense layer. A fully connected hidden layer connects both the LSTM layer and dense layer to an output layer with a single node. The diagram shows the LSTM, dense, and hidden layers with size two each—but we explore multiple configurations of layer sizes. The diagram also provides a simplified depiction of the inner workings of an LSTM cell. LSTM, Long Short-Term Memory.

Model Training

We used 10-fold stratified, grouped cross-validation for hyperparameter selection. Outcome stratification ensured each fold had nearly equal numbers of suicide attempts, while grouping prevented visit clusters from the same individual from being placed in different folds, reducing overfitting risk. We selected hyperparameters via a two-stage grid search. Initial ranges were geometrically distributed by a ratio of 2 (e.g., 4, 8, 16, 32). A second search used a linear distribution of values centered around the best-performing initial hyperparameters. We trained hybrid LSTM models with early stopping, continuing until cross-validation performance declined. We determined the best hyperparameters using mean cross-validation AUPRC. The best model was calibrated to the prevalence of suicide attempts in the development set using Platt's method. 53

Evaluation

We assessed general performance on the final test set using AUPRC and area under the receiver operating characteristic (AUROC). We measured sensitivity and PPV at 95% specificity to characterize each model's potential cost-effectiveness as a screening tool, as described by Ross et al. 54 We used bootstrapping with 1,000 iterations to generate 95% confidence intervals for all metrics and applied one-sided Wilcoxon rank-sum tests to compare scores. We evaluated risk stratification by counting true positives within probability deciles, providing a visit-centered view of model improvements in identifying future suicide attempts. We measured model calibration Supplementary Material S4 (available in the online version) on the development set before and after adjustment using Spiegelhalter's z-score. 55

To quantify clinical note feature importances, we used random-forest-derived mean decrease in impurity (MDI). Given the transformation of temporal CUI counts with LSA, direct comparison of each temporalized CUI feature's importance was challenging. Therefore, we created an additional random forest model trained entirely on nontemporal CUI counts (13,997 features) that were only TF-IDF transformed. We averaged the MDI from 100 bootstrapped variations of this model to calculate approximate feature importances and 95% confidence intervals for our CUIs, repeated for each prediction window. 56

Results

The training set contained 2,364,183 visit clusters and 2,009 30-day attempts. The development set contained 118,534 visit clusters and 117 30-day attempts. The testing set contained 471,936 visit clusters and 480 30-day attempts. The overall prevalence of 30-day attempts was approximately 1 in 1,000 (0.1%). Table 1 describes the demographics of the entire study cohort. Fig. 3 shows the distribution of health care utilization by demographics. The distribution of visit cluster counts per patient was similar among individuals without recorded suicide attempts across racial, ethnic, and gender groups, with Hispanic, Middle Eastern/North African, and Black individuals trending slightly higher than the rest. The distributions were higher among those with recorded suicide attempts, across all demographic groups. The highest distributions were among Hispanic, Middle Eastern/North African, American Indian/Native Alaskan, and Black individuals with recorded suicide attempts.

Table 1. Study cohort demographics ( N  = 1,281,001) .

Demographics Attempts Total Util.
Race Any 30 d 90 d 365 d
 White 9,283 (1.01%) 1,528 (0.17%) 2,200 (0.24%) 3,241 (0.35%) 920,080 2.71
 Black 2,087 (1.11%) 332 (0.18%) 500 (0.27%) 766 (0.41%) 88,130 3.39
 Other 456 (0.31%) 131 (0.09%) 167 (0.11%) 198 (0.13%) 48,046 2.76
 Asian 130 (0.53%) 26 (0.11%) 39 (0.16%) 62 (0.25%) 24,324 2.70
 Hispanic 130 (0.69%) 27 (0.14%) 44 (0.23%) 64 (0.34%) 18,961 3.70
 Pacific Islander 13 (0.28%) 3 (0.06%) 5 (0.11%) 7 (0.15%) 4,664 2.41
 American Indian/Alaskan 50 (1.29%) 8 (0.21%) 16 (0.41%) 21 (0.54%) 3,868 3.28
 Middle Eastern/North African 15 (0.57%) 4 (0.15%) 7 (0.27%) 11 (0.42%) 2,628 3.98
Ethnicity
 Non-Hispanic 8,699 (1.12%) 1,360 (0.17%) 1,950 (0.25%) 2,908 (0.37%) 77,9243 2.64
 Hispanic/Latino 622 (0.64%) 132 (0.14%) 189 (0.20%) 272 (0.28%) 96,885 2.73
 Unknown 306 (0.34%) 52 (0.06%) 68 (0.08%) 84 (0.09%) 89,488 1.32
Legal gender
 Female 6,862 (1.03%) 1,092 (0.16%) 1,618 (0.24%) 2,448 (0.37%) 665,108 2.73
 Male 5,036 (0.82%) 903 (0.15%) 1,265 (0.21%) 1,788 (0.29%) 612,749 2.64
 Unknown/other 1 (0.04%) 0 (0.00%) 0 (0.00%) 0 (0.00%) 2,750 1

Notes: This table summarizes the overall cohort demographics of the study, including suicide rates within each demographic. Suicide rates are divided into four groups, indicating either (1) any suicide attempt or (2–4) only suicide attempts within a fixed number of days after a visit. The total column indicates the total number of patients within each demographic. The utilization column (Util.) indicates the mean fraction of health care visit clusters per patient within each demographic.

Fig. 3.

Fig. 3

This boxplot compares the distribution of health care visit clusters per patient, divided by demographic group.

The distribution of bootstrapped AUPRC, AUROC, sensitivity at 95% specificity, and PPV at 95% specificity evaluation metrics are shown in Figs. 4 5 6 7 , respectively. Table 2 provides the exact means and 95% confidence intervals for each evaluation metric. As the prediction range increased (i.e., from 30 to 90 to 365 days), AUPRC and PPV increased, whereas AUROC and sensitivity decreased. Window temporalization schemes outperformed epochs across all four metrics, except in the case of LSTM with a 30-day prediction window, where the epoch scheme resulted in higher AUPRC. LSTMs performed better than random forests only in terms of AUPRC. Hybrid models (epochs and windows) outperformed controls in every metric except for PPV. In terms of our primary ranking metric (AUPRC) and primary use case (30-day prediction range), the highest performing model was the window-temporalized LSTM model (0.056 ± 0.016), followed by LSTM-epoch (0.041 ± 0.010), random-forest-window (0.036 ± 0.008), random-forest-epoch (0.028 ± 0.006), and control (0.015 ± 0.003). These rankings were confirmed by the Wilcoxon rank-sum tests ( p  < 0.001).

Fig. 4.

Fig. 4

This boxplot compares AUPRC performance across 1,000 bootstrap iterations, separated by model-type, temporalization scheme, and prediction range. AUPRC, area under the precision recall curve.

Fig. 5.

Fig. 5

This boxplot compares AUROC performance across 1,000 bootstrap iterations, separated by model-type, temporalization scheme, and prediction range. AUROC, area under the receiver operating characteristic.

Fig. 6.

Fig. 6

This boxplot compares sensitivity at 95% specificity performance across 1,000 bootstrap iterations, separated by model-type, temporalization scheme, and prediction range.

Fig. 7.

Fig. 7

This boxplot compares PPV at 95% specificity performance across 1,000 bootstrap iterations, separated by model-type, temporalization scheme, and prediction range. PPV, positive predictive value.

Table 2. Model evaluation summary.

365 d Metric (mean ± 95% CI)
Random Forest AUPRC AUROC Sn. @ 95% Sp. PPV @ 95% Sp.
 Control 0.0152 ± 0.0033 0.9243 ± 0.0079 0.4557 ± 0.0386 0.0175 ± 0.0024
 Epoch 0.0275 ± 0.0062 0.9455 ± 0.0087 0.6728 ± 0.0390 0.0136 ± 0.0015
 Window 0.0363 ± 0.0079 0.9482 ± 0.0099 0.7481 ± 0.0437 0.0158 ± 0.0019
LSTM
 Epoch 0.0407 ± 0.0097 0.9148 ± 0.0131 0.6332 ± 0.0439 0.0131 ± 0.0015
 Window 0.0563 ± 0.0157 0.9256 ± 0.0127 0.6764 ± 0.0402 0.0140 ± 0.0015
90 d
Random Forest
 Control 0.0316 ± 0.0062 0.9114 ± 0.0075 0.4497 ± 0.0283 0.0300 ± 0.0029
 Epoch 0.0510 ± 0.0091 0.9251 ± 0.0079 0.6090 ± 0.0315 0.0231 ± 0.0021
 Window 0.0697 ± 0.0129 0.9320 ± 0.0080 0.6699 ± 0.0294 0.0261 ± 0.0020
LSTM
 Epoch 0.0624 ± 0.0104 0.9024 ± 0.0111 0.6505 ± 0.0327 0.0246 ± 0.0019
 Window 0.0818 ± 0.0133 0.9012 ± 0.0113 0.6567 ± 0.0303 0.0256 ± 0.0020
365 d
Random Forest
 Control 0.0610 ± 0.0072 0.8944 ± 0.0065 0.4809 ± 0.0195 0.0566 ± 0.0036
 Epoch 0.0987 ± 0.0116 0.9021 ± 0.0066 0.6132 ± 0.0215 0.0490 ± 0.0027
 Window 0.1153 ± 0.0128 0.9189 ± 0.0063 0.6681 ± 0.0213 0.0526 ± 0.0026
LSTM
 Epoch 0.1288 ± 0.0143 0.8875 ± 0.0078 0.6177 ± 0.0216 0.0489 ± 0.0026
 Window 0.1391 ± 0.0142 0.8968 ± 0.0078 0.6268 ± 0.0227 0.0505 ± 0.0027

Abbreviations: AUPRC, area under the precision recall curve; AUROC, area under the receiver operating characteristic; LSTM, long short-term memory.

Notes: This table summarizes the metric averages and 95% confidence intervals (CIs) for each model variation and prediction range across 1,000 bootstrapped evaluation iterations. CIs are calculated using the average of the 5 th and 95 th percentile score differences from the mean. Sensitivity (Sn.) and positive predictive value (PPV) are reported at 95% specificity (Sp.).

Fig. 8 depicts the stratification of suicide attempts within prediction deciles, organized by model type, temporalization scheme, and prediction range. Increasing prediction ranges increased the number of suicide attempts but decreased the fraction of suicide attempts captured in the 10 th prediction decile of each model. In the 30-day prediction range 10 th -decile stratification, the rankings were: random-forest-window (90.6%), random-forest-epoch (87.1%), LSTM-window (80.0%), LSTM-epoch (75.8%), and control (72.7%). The gap in stratification performance between LSTM-window and random-forest-window increased in the 90-day prediction range (74.3 vs. 80.0%) and the 365-day prediction range (71.0 vs. 77.8%).

Fig. 8.

Fig. 8

This bar chart depicts the stratification of suicide attempts within prediction deciles, organized by model type, temporalization scheme, and prediction range. The blue bars at the base of the plot correspond to each model's highest prediction decile. The number of attempts captured by the highest prediction decile is given for each model; further, the total number of attempts for each prediction range is given beside the y-axis.

Fig. 9 shows the top 20 feature importances by MDI across 100 bootstrap iterations for the 30-day prediction range. The top five most important features were suicide attempt, feeling hopeless, self-injurious behavior, active suicidal ideation, and impaired judgment. Fig. 10 compares the relative scaled importances by MDI of the top 10 features for all three prediction ranges. The 365- and 90-day prediction ranges showed a tighter cluster of feature importances, whereas the 30-day prediction range showed greater variance in feature importances.

Fig. 9.

Fig. 9

This point plot shows the top 20 most important features by MDI, averaged across 100 bootstrap iterations, with 95% confidence intervals, for the 30-day prediction range. CUI, Concept Unique Identifier; MDI, mean decrease in impurity.

Fig. 10.

Fig. 10

This radio plot compares the scaled importances of the top-10 features from each prediction range (30, 90, and 365 days). Each feature is scaled on between 0 and 1, indicating relative importance of a single feature across prediction ranges.

Discussion

In this study, we demonstrated how EHR-derived clinical note features improve a deployed suicide attempt risk prediction model. 45 We also showed that both the clinical note data temporalization scheme and model type significantly impact model performance across various testing dimensions and prediction ranges. The importance of temporality echoes findings discussed by the ideation-to-action framework, wherein proximal risk factors influence the ability of suicidal ideation to progress toward or away from suicide attempts. That is, the order in which people experience the intensification or easing of risk factors for suicide ideation or attempts influence the presence or absence of suicide attempts. In clinical practice, this may appear as ambivalence regarding the desire for death or the intent to enact a suicide attempt. Although the model employed in the present project was not built to depend on or replicate theories of suicide, the fact that the model benefits from temporality offers support for theorists who in their own right are seeking to understand the causes of suicidal behavior. 38 39

Within the clinically preferred 30-day prediction range, window-temporalized LSTM models achieved the highest test AUPRC, whereas random forests achieved the highest test sensitivity and PPV at 95% specificity. AUPRC is more reliable than setpoint metrics for evaluation until the clinical burden of this model can be studied. 54 Models that included features from free-text clinical notes (such as specific medical terms related to suicidality and mental health) outperformed those trained purely on structured data, and the most impactful clinical terms were related to suicidality, mental health, depression, social stress, and drug use, complementing structured features. 57

This work builds on the efforts of others by comparing different approaches to incorporating temporal features in suicide attempt risk prediction models and highlighting the effectiveness of using unstructured data from clinical notes. In comparing our methods and results to those of Shortreed et al, for example, several key differences and outcomes emerge. 46 While both studies explored the use of added temporal features for suicide attempt risk prediction, our study demonstrated improved performance with features derived from clinical notes, including TF-IDF and LSA-transformed clinical concepts. In contrast, Shortreed et al did not observe significant performance improvements with added temporal predictors engineered from clinical data. Unstructured free text might better capture temporal data relevant to suicide attempt risk prediction than structured data. Replication studies of temporal features for both structured and unstructured data are indicated.

Our findings align with the prior work of Tsui et al, who also found that incorporating unstructured data significantly enhanced their prediction model's accuracy compared with using only structured data. 13 Currently, we use a bag-of-words approach for suicide attempt risk prediction based on medical concept counts from clinical notes. Although not theory-driven, this method offers easy accessibility, fast implementation, and scalability. In contrast, Meerwijk et al advocated for a theory-driven approach using the three-step Theory of Suicide (3ST), which, while potentially more accurate, required extensive setup and manual annotation. 14 36 Overall, our bag-of-words model remains a feasible and effective method until more refined strategies become practical.

In future work, we aim to enhance our risk prediction model using advanced NLP techniques, including vector embeddings like word2vec and cui2vec, which capture term similarities within the corpus. 21 22 58 Coppersmith et al and Ji highlighted the effectiveness of pretrained embeddings and large language models such as BERT, RoBERTa, MentalBERT, and MentalRoBERTa in fine-tuning suicide text classifiers. 10 15 58 We also suspect it may be prudent to retrain with additional NLP-derived features like lexical, syntactic, and sentimental elements, shown to improve outcomes in suicide note classification. 12 Levis et al suggested sentiment analysis of psychotherapy notes can improve prediction models, whereas Ji noted varying effectiveness depending on the data source. 9 15 16 59 60 61 Given the reliance on clinician-entered ICD codes in the present study, which may under-identify suicide attempts, especially those presented at an outside facility, 7 future work may adopt a weakly supervised NLP approach as used by Bejan et al. 6 Importantly, our current study improved performance using a simple, understandable, fast, and more transportable method, underscoring the critical value of unstructured data in suicidality prediction. This highlights that even straightforward approaches can make significant contributions, paving the way for further enhancements with more advanced techniques.

In summary, the field of predictive modeling is steadily incorporating unstructured data and NLP methods to improve screening efforts. Our work, and the works of others discussed in this paper, support the integration of these complex data sources into risk prediction models. There are several interesting paths for future research, such as the use of vector embeddings, better ground-truth ascertainment methods, improved feature extraction techniques, additional sentiment analysis, and the use of theory-based approaches to inform model design. Differences in study outcomes could also stem from variations in datasets, model implementations, and population characteristics, which should be considered in future comparisons. These potential improvements show promise for better predicting suicide risk, which could lead to more effective interventions and fewer deaths from suicide. The challenge moving forward is finding ways to continually improve, fine-tune, and validate these models.

Clinical Relevance Statement

This study highlights the significant improvement in suicide attempt risk prediction models when incorporating temporal CUIs derived from clinical notes. By enhancing structured data with temporally organized unstructured data, particularly through window-temporalized LSTM models, predictive performance notably increased. These findings underscore the importance of utilizing both structured and unstructured EHR data in clinical risk assessments. Improved prediction models can lead to more accurate identification of high-risk individuals, potentially allowing for timely and targeted interventions. Future advancements in integrating sophisticated methods with clinical data hold promise for further enhancing predictive accuracy and ultimately improving patient outcomes.

Multiple-Choice Questions

  1. Which of the following approaches to ascertaining suicide attempts in health records should have the highest PPV?

    1. Analyzing ICD-9 codes

    2. Analyzing ICD-10 codes

    3. Employing weakly supervised NLP

    4. Manual chart review by expert clinicians

    Correct Answer: The correct answer is option d. ICD-10 codes have higher PPV for ascertaining suicide attempt than ICD-9, and novel weakly supervised NLP approaches show promise for further improving PPV. However, the gold standard to which all approaches are currently compared to is clinical chart review.

  2. Which of the following evaluation metrics is prioritized to address the rarity of outcomes in this study?

    1. AUROC

    2. Accuracy

    3. AUPRC

    4. PPV

    Correct Answer: The correct answer is option c. AUROC, accuracy, and PPV can all be artificially inflated with rare outcomes.

  3. Which of the following is a foundational framework for studying suicidality?

    1. Ideation to Action (I2A)

    2. Spontaneous Action (SA)

    3. Hopelessness and Loneliness (HaL)

    Correct Answer: The correct answer is option a. The leading theories on suicidality are collectively referred to as ideation to action frameworks.

Acknowledgments

The authors would like to express their sincere gratitude to Dario A. Giuse, the Director of Vanderbilt Health IT, for generously providing access to the Vanderbilt Wordcloud Indexer. This tool was invaluable in facilitating the research and contributing to the conclusions drawn in this work. The author also acknowledges the support from their team, peers, and the wider research community, whose collaboration and inputs have enriched this study.

Funding Statement

Funding This research has been supported by several funding bodies. The primary source of funding was the National Library of Medicine (NLM) T15 training grant (grant number: 2T15LM007450-20). Additional support came from the Evelyn Selby Stead Fund for Innovation, Vanderbilt University Medical Center, specifically grants R01 MH121455 and R01 MH116269. The Military Suicide Research Consortium also provided funding through grant W81XWH-10-2-0181. Finally, funding for the Research Derivative and BioVU Synthetic Derivative was provided by the National Center for Research Resources (grant number: UL1 RR024975/RR/NCRR). The funders had no role in study design, data collection and analysis, or manuscript preparation.

Conflict of Interest None declared.

Protection of Human and Animal Subjects

No human subjects were involved in this project.

Note

The study was performed in compliance with the World Medical Association Declaration of Helsinki on Ethical Principles for Medical Research Involving Human Subjects and was reviewed by the VUMC Institutional Review Board.

Supplementary Material

10-1055-a-2411-5796-s202308ra0185.pdf (1.7MB, pdf)

Supplementary Material

Supplementary Material

References

  • 1.Centers for Disease Control and Prevention,National Center for Health Statistics. National Vital Statistics System,Mortality 2018–2021 on CDC WONDER Online Database; 2021
  • 2.Zalsman G, Hawton K, Wasserman D et al. Suicide prevention strategies revisited: 10-year systematic review. Lancet Psychiatry. 2016;3(07):646–659. doi: 10.1016/S2215-0366(16)30030-X. [DOI] [PubMed] [Google Scholar]
  • 3.Mann J J, Apter A, Bertolote J et al. Suicide prevention strategies: a systematic review. JAMA. 2005;294(16):2064–2074. doi: 10.1001/jama.294.16.2064. [DOI] [PubMed] [Google Scholar]
  • 4.Walsh C G, Johnson K B, Ripperger M et al. Prospective validation of an electronic health record-based, real-time suicide risk model. JAMA Netw Open. 2021;4(03):e211428. doi: 10.1001/jamanetworkopen.2021.1428. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Walsh C G, Ribeiro J D, Franklin J C. Predicting risk of suicide attempts over time through machine learning. Clin Psychol Sci. 2017;5(03):457–469. [Google Scholar]
  • 6.Bejan C A, Ripperger M, Wilimitis D et al. Improving ascertainment of suicidal ideation and suicide attempt with natural language processing. Sci Rep. 2022;12(01):15146. doi: 10.1038/s41598-022-19358-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Young J, Bishop S, Humphrey C, Pavlacic J M. A review of natural language processing in the identification of suicidal behavior. J Affect Disord Rep. 2023;12:100507. [Google Scholar]
  • 8.Cohen J, Wright-Berryman J, Rohlfs L, Trocinski D, Daniel L, Klatt T W. Integration and validation of a natural language processing machine learning suicide risk prediction model based on open-ended interview language in the emergency department. Front Digit Health. 2022;4:818705. doi: 10.3389/fdgth.2022.818705. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Levis M, Leonard Westgate C, Gui J, Watts B V, Shiner B. Natural language processing of clinical mental health notes may add predictive value to existing suicide risk models. Psychol Med. 2021;51(08):1382–1391. doi: 10.1017/S0033291720000173. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Coppersmith G, Leary R, Crutchley P, Fine A.Natural language processing of social media as screening for suicide risk Biomed Inform Insights 2018101178222618792860 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.McCoy T H, Jr, Castro V M, Roberson A M, Snapper L A, Perlis R H. Improving prediction of suicide and accidental death after discharge from general hospitals with natural language processing. JAMA Psychiatry. 2016;73(10):1064–1071. doi: 10.1001/jamapsychiatry.2016.2172. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Pestian J, Nasrallah H, Matykiewicz P, Bennett A, Leenaars A. Suicide note classification using natural language processing: a content analysis. Biomed Inform Insights. 2010;2010(03):19–28. doi: 10.4137/BII.S4706. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Tsui F R, Shi L, Ruiz V et al. Natural language processing and machine learning of electronic health records for prediction of first-time suicide attempts. JAMIA Open. 2021;4(01):ooab011. doi: 10.1093/jamiaopen/ooab011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Meerwijk E L, Tamang S R, Finlay A K, Ilgen M A, Reeves R M, Harris A HS. Suicide theory-guided natural language processing of clinical progress notes to improve prediction of veteran suicide risk: protocol for a mixed-method study. BMJ Open. 2022;12(08):e065088. doi: 10.1136/bmjopen-2022-065088. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Ji S.Towards intention understanding in suicidal risk assessment with natural language processingIn:Findings of the Association for Computational Linguistics: EMNLP 2022Association for Computational Linguistics;2022:4028–4038. Accessed September 15, 2024 at:https://aclanthology.org/2022.findings-emnlp.297
  • 16.Ji S, Yu C P, Fung S fu, Pan S, Long G. Supervised learning for suicidal ideation detection in online user content. Complexity. 2018;2018:1–10. [Google Scholar]
  • 17.Arowosegbe A, Oyelade T. Application of natural language processing (NLP) in detecting and preventing suicide ideation: a systematic review. Int J Environ Res Public Health. 2023;20(02):1514. doi: 10.3390/ijerph20021514. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Zhong Q Y, Mittal L P, Nathan M D et al. Use of natural language processing in electronic medical records to identify pregnant women with suicidal behavior: towards a solution to the complex classification problem. Eur J Epidemiol. 2019;34(02):153–162. doi: 10.1007/s10654-018-0470-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Zhang D, Yin C, Zeng J, Yuan X, Zhang P. Combining structured and unstructured data for predictive models: a deep learning approach. BMC Med Inform Decis Mak. 2020;20(01):280. doi: 10.1186/s12911-020-01297-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Thompson K. Programming techniques: regular expression search algorithm. Commun ACM. 1968;11(06):419–422. [Google Scholar]
  • 21.Beam A L, Kompa B, Schmaltz A et al. Clinical concept embeddings learned from massive sources of multimodal medical data. Pac Symp Biocomput. 2020;25:295–306. [PMC free article] [PubMed] [Google Scholar]
  • 22.Mikolov T, Chen K, Corrado G, Dean J.Efficient estimation of word representations in vector spaceArXiv13013781. Accessed February 13, 2022 at:http://arxiv.org/abs/1301.3781
  • 23.Blei D M, Ng A Y, Jordan M I.Latent Dirichlet allocation J Mach Learn Res 20033(Jan):993–1022. [Google Scholar]
  • 24.Dey L, Haque S KM.Opinion mining from noisy text data. In: Proceedings of the Second Workshop on Analytics for Noisy Unstructured Text Data - AND '08ACM Press;200883–90.
  • 25.Turney P D.Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviewsarXiv:cs/0212032. Accessed February 13, 2022 at:http://arxiv.org/abs/cs/0212032
  • 26.Boggs J M, Quintana L M, Powers J D, Hochberg S, Beck A. Frequency of clinicians' assessments for access to lethal means in persons at risk for suicide. Arch Suicide Res. 2022;26(01):127–136. doi: 10.1080/13811118.2020.1761917. [DOI] [PubMed] [Google Scholar]
  • 27.Yeskuatov E, Chua S L, Foo L K. Leveraging reddit for suicidal ideation detection: a review of machine learning and natural language processing techniques. Int J Environ Res Public Health. 2022;19(16):10347. doi: 10.3390/ijerph191610347. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Krause K J, Shelley J, Becker A, Walsh C. Exploring risk factors in suicidal ideation and attempt concept cooccurrence networks. AMIA Annu Symp Proc. 2023;2022:644–652. [PMC free article] [PubMed] [Google Scholar]
  • 29.Montesinos López O A, Montesinos López A, Crossa J. Springer International Publishing;; 2022. Overfitting, model tuning, and evaluation of prediction performance; pp. 109–139. [Google Scholar]
  • 30.Zhao J, Henriksson A. Learning temporal weights of clinical events using variable importance. BMC Med Inform Decis Mak. 2016;16 02:71. doi: 10.1186/s12911-016-0311-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Zhao J, Henriksson A, Kvist M, Asker L, Boström H. Handling temporality of clinical events for drug safety surveillance. AMIA Annu Symp Proc. 2015;2015:1371–1380. [PMC free article] [PubMed] [Google Scholar]
  • 32.Singh A, Nadkarni G, Gottesman O, Ellis S B, Bottinger E P, Guttag J V. Incorporating temporal EHR data in predictive models for risk stratification of renal function deterioration. J Biomed Inform. 2015;53:220–228. doi: 10.1016/j.jbi.2014.11.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Choi E, Bahadori M T, Schuetz A, Stewart W F, Sun J, Doctor A I. Doctor AI: predicting clinical events via recurrent neural networks. JMLR Workshop Conf Proc. 2016;56:301–318. [PMC free article] [PubMed] [Google Scholar]
  • 34.Che Z, Purushotham S, Cho K, Sontag D, Liu Y. Recurrent neural networks for multivariate time series with missing values. Sci Rep. 2018;8(01):6085. doi: 10.1038/s41598-018-24271-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Joiner T E. Harvard University Press;; 2005. Why People Die by Suicide. [Google Scholar]
  • 36.Klonsky E D, May A M. The three-step theory (3ST): a new theory of suicide rooted in the “ideation-to-action” framework. Int J Cogn Ther. 2015;8(02):114–129. [Google Scholar]
  • 37.Klonsky E D, May A M, Saffer B Y. Suicide, suicide attempts, and suicidal ideation. Annu Rev Clin Psychol. 2016;12(01):307–330. doi: 10.1146/annurev-clinpsy-021815-093204. [DOI] [PubMed] [Google Scholar]
  • 38.Klonsky E D, Saffer B Y, Bryan C J. Ideation-to-action theories of suicide: a conceptual and empirical update. Curr Opin Psychol. 2018;22:38–43. doi: 10.1016/j.copsyc.2017.07.020. [DOI] [PubMed] [Google Scholar]
  • 39.Van Orden K A, Witte T K, Cukrowicz K C, Braithwaite S R, Selby E A, Joiner T E., Jr The interpersonal theory of suicide. Psychol Rev. 2010;117(02):575–600. doi: 10.1037/a0018697. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Schafer K M, Kennedy G, Gallyer A, Resnik P. A direct comparison of theory-driven and machine learning prediction of suicide: a meta-analysis. PLoS One. 2021;16(04):e0249833. doi: 10.1371/journal.pone.0249833. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Walker R L, Shortreed S M, Ziebell R A et al. Evaluation of electronic health record-based suicide risk prediction models on contemporary data. Appl Clin Inform. 2021;12(04):778–787. doi: 10.1055/s-0041-1733908. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Carter G, Milner A, McGill K, Pirkis J, Kapur N, Spittal M J. Predicting suicidal behaviours using clinical instruments: systematic review and meta-analysis of positive predictive values for risk scales. Br J Psychiatry. 2017;210(06):387–395. doi: 10.1192/bjp.bp.116.182717. [DOI] [PubMed] [Google Scholar]
  • 43.Wilimitis D, Turer R W, Ripperger M et al. Integration of face-to-face screening with real-time machine learning to predict risk of suicide among adults. JAMA Netw Open. 2022;5(05):e2212095. doi: 10.1001/jamanetworkopen.2022.12095. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.McKernan L C, Lenert M C, Crofford L J, Walsh C G. Outpatient engagement and predicted risk of suicide attempts in fibromyalgia. Arthritis Care Res (Hoboken) 2019;71(09):1255–1263. doi: 10.1002/acr.23748. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Walsh C G, Ripperger M A, Novak Let al. Randomized controlled comparative effectiveness trial of risk model-guided clinical decision support for suicide screeningmedRxiv2024 10.1101/2024.03.14.24304318 [DOI]
  • 46.Shortreed S M, Walker R L, Johnson E et al. Complex modeling with detailed temporal predictors does not improve health records-based suicide risk prediction. NPJ Digit Med. 2023;6(01):47. doi: 10.1038/s41746-023-00772-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Bodenreider O.The unified medical language system (UMLS): integrating biomedical terminology Nucleic Acids Res 200432(Database issue):D267–D270. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Mandani S, Giuse D, McLemore M, Weitkamp A.Augmenting NLP Results by Leveraging SNOMED CT Relationships for Identification of Implantable Cardiac Devices from Patient NotesPresented at: SNOMED CT Expo 2019; October 31, 2019; Kuala Lumpur, Malaysia. Accessed September 15, 2024 at:https://confluence.ihtsdotools.org/display/FT/201905+Augmenting+NLP+results+by+leveraging+SNOMED+CT+relationships+for+identification+of+implantable+cardiac+devices+from+patient+notes?preview=/87042613/87043024/201905%20SCT%20Expo%202019%20-%20Madani.pdf
  • 49.Sparck Jones K. A statistical interpretation of term specificity and its application in retrieval. J Doc. 1972;28(01):11–21. [Google Scholar]
  • 50.Landauer T K, Dumais S T. A solution to Plato's problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychol Rev. 1997;104(02):211–240. [Google Scholar]
  • 51.Pedregosa F, Varoquaux G, Gramfort A et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–2830. [Google Scholar]
  • 52.Paszke A, Gross S, Massa Fet al. PyTorch: an imperative style, high-performance deep learning library 2019. Accessed September 15, 2024 at: 10.48550/ARXIV.1912.01703 [DOI] [Google Scholar]
  • 53.Platt J.Probabilistic outputs for support vector machines and comparisons to regularized likelihood methodsAdv Large Margin Classif200010 [Google Scholar]
  • 54.Ross E L, Zuromski K L, Reis B Y, Nock M K, Kessler R C, Smoller J W. Accuracy requirements for cost-effective suicide risk prediction among primary care patients in the US. JAMA Psychiatry. 2021;78(06):642–650. doi: 10.1001/jamapsychiatry.2021.0089. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Spiegelhalter D J. Probabilistic prediction in patient management and clinical trials. Stat Med. 1986;5(05):421–433. doi: 10.1002/sim.4780050506. [DOI] [PubMed] [Google Scholar]
  • 56.Scornet E.Trees, forests, and impurity-based variable importance; 2021. Accessed May 16, 2022 at:http://arxiv.org/abs/2001.04295
  • 57.Boggs J M, Beck A, Hubley S et al. General medical, mental health, and demographic risk factors associated with suicide by firearm compared with other means. Psychiatr Serv. 2018;69(06):677–684. doi: 10.1176/appi.ps.201700237. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Pennington J, Socher R, Manning C.Glove: global vectors for word representationIn:Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)Association for Computational Linguistics;20141532–1543.
  • 59.Sarsam S M, Al-Samarraie H, Alzahrani A I, Alnumay W, Smith A P. A lexicon-based approach to detecting suicide-related messages on Twitter. Biomed Signal Process Control. 2021;65:102355. [Google Scholar]
  • 60.Gaur M, Aribandi V, Alambo A et al. Characterization of time-variant and time-invariant assessment of suicidality on Reddit using C-SSRS. PLoS One. 2021;16(05):e0250448. doi: 10.1371/journal.pone.0250448. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Cambria E, Li Y, Xing F Z, Poria S, Kwok K.SenticNet 6: ensemble application of symbolic and subsymbolic AI for sentiment analysisIn:Proceedings of the 29th ACM International Conference on Information & Knowledge ManagementACM;2020105–114.
  • 62.Chawla N V, Bowyer K W, Hall L O, Kegelmeyer W P. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–357. [Google Scholar]
  • 63.Lemaître G, Nogueira F, Aridas C K. Imbalanced-learn: a Python toolbox to tackle the curse of imbalanced datasets in machine learning. J Mach Learn Res. 2017;18(17):1–5. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

10-1055-a-2411-5796-s202308ra0185.pdf (1.7MB, pdf)

Supplementary Material

Supplementary Material


Articles from Applied Clinical Informatics are provided here courtesy of Thieme Medical Publishers

RESOURCES