Skip to main content
PLOS One logoLink to PLOS One
. 2025 Sep 15;20(9):e0331459. doi: 10.1371/journal.pone.0331459

Automated detection and prediction of suicidal behavior from clinical notes using deep learning

Brian E Bunnell 1,*, Athanasios Tsalatsanis 2, Chaitanya Chaphalkar 2, Sara Robinson 1, Sierra Klein 1, Sarah Cool 1, Elizabeth Szwast 3, Paul M Heider 3,4, Bethany J Wolf 3, Jihad S Obeid 3,4
Editor: Braja Gopal Patra5
PMCID: PMC12435685  PMID: 40953025

Abstract

Background

Deep learning approaches have tremendous potential to improve the predictive power of traditional suicide prediction models to detect and predict intentional self-harm (ISH). Existing research is limited by a general lack of consistent performance and replicability across sites. We aimed to validate a deep learning approach used in previous research to detect and predict ISH using clinical note text and evaluate its generalizability to other academic medical centers.

Methods

We extracted clinical notes from electronic health records (EHRs) of 1,538 patients with International Classification of Diseases codes for ISH and 3,012 matched controls without ISH codes. We evaluated the performance of two traditional bag-of-words models (i.e., Naïve Bayes, Random Forest) and two convolutional neural network (CNN) models including randomly initialized (CNNr) and pre-trained Word2Vec initialized (CNNw) weights to detect ISH within 24 hours of and predict ISH from clinical notes 1–6 months before the first ISH event.

Results

In detecting concurrent ISH, both CNN models outperformed bag-of-words models with AUCs of.99 and F1 scores of 0.94. In predicting future ISH, the CNN models outperformed Naïve Bayes models with AUCs of 0.81–0.82 and F1 scores of 0.61−.64.

Conclusions

We demonstrated that leveraging EHRs with a well-defined set of ISH ICD codes to train deep learning models to detect and predict ISH using clinical note text is feasible and replicable at more than one institution. Future work will examine this approach across multiple sites under less controlled settings using both structured and unstructured EHR data.

Introduction

Suicide has consistently been among the top ten leading causes of death in the U.S. during the past decade, resulting in more than 480,000 deaths, with rising rates each year [1]. These suicidal deaths and related nonfatal suicide-related injuries are associated with significant economic burden in the U.S., with annual medical costs and associated loss in productivity reaching over $500 billion [2]. There are numerous risk factors for suicide, some of which include being male, American Indian or Alaska Native, non-Hispanic, and having mental illness (e.g., depression, anxiety, substance abuse), prior trauma, communication difficulties, decision-making impulsivity, and aggression [1]. Other important risk factors include prior suicide attempts or intentional self-harm (ISH) behaviors [3,4]. Although many risk factors have been identified, meta-analytic data suggest that they only provide a marginal improvement in diagnostic accuracy above chance and prospectively predict suicide attempts only 26% of the time and suicide deaths only 9% of the time [4,5].

Identifying individuals at risk for suicide is critical to suicide prevention [6]. However, current approaches to suicide risk assessment involve healthcare professionals administering clinical interviews and questionnaires, which can be inefficient, costly, and limited in their ability to predict future ISH and suicide deaths [5,711]. Analytical approaches, such as machine learning that make use of data from electronic health records (EHRs), have tremendous potential to improve the efficient identification and predictive accuracy of numerous risk factors without having to repeat clinical interviews and questionnaires across multiple treatment settings. For example, machine learning approaches analyzing structured EHR data [1218] and natural language processing (NLP) of clinical note text (e.g., word pair frequencies, positive or negative valence words) [12,1926] to predict suicide and ISH have shown promising, yet variable results, with area under the receiver operating characteristic curves (AUCs) ranging between 0.61 and 0.89.

Recent advances in computational approaches, such as deep learning, have the potential to significantly improve suicide/ISH prediction and consistency in predictive performance by harnessing contextual content, rather than simple word frequencies within clinical notes [27,28]. Deep learning encompasses computational models composed of multiple layers of artificial neural networks, or deep neural network (DNN) based models, such as convolutional neural networks (CNNs), that “learn” representations of data with multiple levels of abstraction [29]. Several studies have examined innovative deep learning approaches to detect suicidality. For example, studies have successfully employed deep learning using text from social media posts to detect psychiatric stressors for suicide [30], predefined suicide risk categories [31], and users with signs of suicidal ideation [32,33]. Another novel study used deep learning and video-recorded interviews of suicidal patients to detect suicide severity from digitally quantified measurements of facial expressivity, head movement, and speech prevalence [34]. However, despite the innovative nature of these studies, they did not capitalize on the abundance of valuable and relevant information within EHRs, which has the potential to improve risk assessment.

Currently, few studies have examined the utility of deep learning approaches to identify suicide-related clinical records using clinical text from an EHR (e.g., for surveillance purposes) [27,3537], only one of which examined its utility to predict future suicidal behavior [27]. Cusick and colleagues [35] used deep learning to detect current suicidal ideation from EHR clinical note text and found their CNN model to be superior to other machine learning methods (i.e., logistic classifier, support vector machines [SVM], Naive Bayes classifier), demonstrating an AUC of 0.962. Rozova et al. [36] used several machine learning approaches to detect instances of ISH based on text from emergency department triage notes while also implementing a long short-term memory (LSTM) network but found that the LSTM approach performed poorer than a gradient boosting model (i.e., AUCs of 0.801 vs. 0.832, respectively). Using clinical notes from the U.S. Veterans Affairs, Workman et al. [37] found that a trained zero‐shot learning DNN model was able to effectively identify ICD-10-CM diagnostic codes related to suicide with an AUC of .946. Obeid and colleagues [27] compared the performance of two DNN approaches (i.e., CNN and LSTM) with several traditional bag-of-words models (e.g., naïve Bayes, decision tree, random forest) in identifying current ISH and predicting future ISH events from notes 1–6 months earlier. The CNN approach achieved the highest performance in detecting current ISH (i.e., AUC = 0.999, F1 score = 0.985) and predicting future ISH (i.e., AUC = 0.882; F1 score = 0.769). Despite the promising results in these machine learning and deep learning approaches, they are limited by a general lack of consistent performance and replicability of results [27,2931,3336,38]. Furthermore, although the report by Obeid et al. included an examination of overrepresented words in clinical notes of patients with ISH events, it did not include an interpretation of why some clinical notes were indicative of ISH.

Given the limited prior work evaluating deep learning approaches using EHR clinical note text for the detection and prediction of ISH, there is a critical need to conduct more studies that exhibit stronger predictive power and show consistent model performance across multiple data sets. Thus, the purpose of the current study was to validate the approach used previously by Obeid et al. [27], specifically, to replicate the results at another institution [39], and to evaluate its generalizability to other academic medical centers. Of note, this approach leverages medical records coded with a well-defined set of ISH International Classification of Diseases, Clinical Modification (ICD) codes to generate large amounts of labeled records to train machine learning models using supervised learning [40]. This relies on the quality and accuracy of the ICD coded records. Specifically, herein, we aimed to assess the accuracy of ISH ICD codes at another institution and assess their value as “silver standard” labels for training models on the following two tasks: 1) automated detection of suicide attempt and ISH events in clinical note text concurrent with documented codes for ISH (hereafter referred to as concurrent ISH), and 2) the prediction of future ISH-labeled encounters. We also aimed to examine the interpretability of some models by identifying key words in clinical notes that may be indicative of suicidal behavior. We envision that the proposed deep-learning approach will benefit current suicide risk assessment by providing additional valuable information from existing clinical documentation that is unavailable to a provider during the patient’s assessment, as it may not be included in the patient’s primary problem list.

Methods

Participants

Adult patients, ages 18–90 years old, with clinical notes in the Epic (Epic Systems Corporation) EHR system at the University of South Florida (USF) or its affiliate Tampa General Hospital (TGH) between August 2015 and August 2021 were eligible for inclusion in the study. Cases included all patients with ICD-10 codes for a suicide attempt (i.e., T14.91) or ISH as defined in the Centers for Disease Control and Prevention’s National Health Statistics Report (e.g., X71–X83) [40]. Controls included a cohort of randomly selected contemporary patients without ICD codes for a suicide attempt or ISH diagnosis in their records during the study period. We first pulled 5 times more controls than cases from the organization’s EHR. The controls were then matched to cases by age, gender, race, ethnicity, and number of notes using the nearest neighbor approach, such that 2 controls per case were selected for the study. The index event for cases was defined as the first suicide attempt or ISH diagnosis, and the index event for controls included the most recent interaction with the EHR. Clinical notes up to 180 days before and including the index event day were extracted. The study was approved by the USF Institutional Review Board (IRB) under the STUDY002988 protocol. The IRB waived the requirement for informed consent. The data were accessed for research purposes on August 12, 2022. Some authors had access to information that could identify individual participants during or after data collection.

Clinical note selection and preprocessing

Note types.

While all types of clinical notes were extracted from the EHR, to reduce noise and excessive number of notes, we excluded types with small frequencies and emphasized types such as Emergency Department (ED), Progress, Plan of Care, ED Provider, Consults, Interdisciplinary Plan of Care, History and Physical, Case Management, and ED After Visit Summary (AVS) Snapshot (see S1 Table).

Preprocessing of notes for concurrent detection of ISH.

Notes recorded within 24 hours (i.e., + /-24hrs) of the index event were used to (1) assess the reliability of ICD codes in capturing suicide attempts or ISH events by manual review of the notes and (2) detect suicide attempts or ISH using machine learning models. Following the procedures used by Obeid et al. (2020), individual notes longer than 800 words were truncated to 800 words (n = 2,793). Notes belonging to the same patient were sorted by timestamp with newer notes first and then concatenated. Concatenated notes longer than 8,000 words were truncated to 8,000 words (n = 38). The notes were stored in the Detection cohort dataset, in which, each record represented a patient and their concatenated note. Patients without notes within 24 hours of the index event were not included in the Detection cohort dataset.

Preprocessing of notes for predicting ISH.

Notes recorded between 30 and 180 days before the index event were used for the prediction of suicide attempts or ISH events using machine learning models. Individual notes were truncated to 1500 words (n = 7,545), and concatenated notes were truncated to 10,000 words (n = 744). The rationale for using longer word cutoffs was to capture more information in the longitudinal record prior to the ISH visit than we did with the Detection cohort. The notes were stored in the Prediction dataset, in which each record represented a patient and their concatenated note. Patients without notes in the 180-to-30-day window were not included in the Prediction cohort dataset.

ICD validation through manual review

A sample of 400 patient records from the Detection cohort dataset was manually reviewed to assess the reliability of ICD codes in capturing suicide attempt or ISH events reported in clinical notes. The sample included records from 200 cases and 200 controls. The notes were imported into REDCap [41] and reviewed by two medical students blinded to the related ICD code. The reviewers were instructed to read the concatenated notes and label them as case (ISH) or control (no ISH). Suicidal ideation alone was not labeled as a case. Each reviewer was assigned 250 note strings with 100 overlapping to allow for estimation of interrater reliability. Any conflicting ratings were resolved by a licensed Clinical Psychologist (i.e., the first author). Reviewer labels were considered the gold standard and were compared with the suicide attempt or ISH ICD codes.

Text processing and word embeddings

We tested three machine learning models: a deep learning model with word embeddings (WEs) and two traditional bag-of-words models. For the deep learning model, we performed the following functions in the concatenated notes of the Detection and Prediction cohort datasets: lower casing; sentence segmentation; removal of punctuation and numbers; and tokenization. To create word embeddings, token sequences were pre-padded with zeros to match the length of the longest string in the training set. WE had 200 dimensions per word. WE weights were initialized a) randomly and b) with a pretrained Word2Vec (W2V) [42] model, built on 550,000 notes from the EHR using 200 dimensions per word, a skip window size of 5 words in each direction, and negative sampling of 5. For bag-of-words models, text processing included lower casing; removal of punctuation, stop word, and numbers; word stemming, and tokenization.

Bag-of-words models

The models we tested were Naïve Bayes (NB) [43] and a Random Forest (RF) [44] with 1000 trees sampling 700 variables per split.

Deep learning models

The CNN architecture consisted of several layers (Fig 1). We used an architecture that has been previously developed and used for text classification [28,45]. The input layer had a dimension of 10,000 tokens. The next layer was a WE with a drop rate of 0.2. Next, there was a convolutional layer with multiple filter sizes (3, 4, and 5) in parallel, with 200 nodes in each, ReLU activation, a stride of one, and a global max-pooling followed by a merger tensor. Then, a fully connected 200-node hidden layer with ReLU activation and a drop rate of 0.2. Lastly, an output layer with a single binary node with a sigmoid activation function. The CNN training parameters were as follows. The optimizer used: adaptive moment estimation gradient descent algorithm (ADAM); number of epochs: 37; batch size: 32; learning rate:.0004; validation spit: 0.1; and early stopping based on validation loss function patience of 5. We tested the architecture with randomly initialized weights and with W2V-initialized weights.

Fig 1. The CNN architecture.

Fig 1

ReLU: rectified linear units activation function; W2V: word2vec embeddings.

Software packages

The R Statistical Software ver. 4.0.2 was used in all analyses [46]. To match controls and cases, we used the MatchIt package [47]. The Quanteda package [48] was used for text processing and for the NB model. The Keras package [49] was used to initialize WEs. W2V was used to initialize WE pre-trained weights [42]. The CNN was constructed using TensorFlow [50]. The RF was constructed using the ranger package [51]. Metrics were calculated using the Caret [52] and pROC [53] packages.

Evaluation and metrics

For the ICD manual review, we calculated and reported false positive and false negative counts, interrater reliability (Cohen’s Kappa), accuracy, precision, and recall. In preparation for the machine learning models, both Detection and Prediction cohort datasets were split into a training and cross-validation set including cases and controls with an index event occurring between 2015 and 2019, and a testing set including cases and controls with an index event occurring between 2020 and 2021. For all classification models, we calculated the AUC, accuracy, precision, recall, and F1 score. The evaluation metrics of the CNN models were based on the median values of 50 runs. The DeLong test was used to determine possible significant differences between performances of models [54]. We also reported variable (stemmed word) importance calculated by the RF model. Finally, we report the performance of the CNN models against the gold standard as defined by the manual review of 400 records.

Results

Patient population

Data from 11,298 patients (i.e., 1,883 cases and 9,415 controls) were extracted from the EHR, as per S1 Fig. A total of 6,814 patients (i.e., 1,538 cases and 5,276 controls) had notes recorded 24 hours within the index event (i.e., the Detection cohort). Of these, 1,593 cases were matched with 3,012 controls (N = 4,605). A total of 3,474 patients (i.e., 1,158 cases and 2,316 matched controls) with an index event between 2015 and 2019 were selected for the training and cross-validation of the CNN used for detecting suicide attempts or ISH. Lastly, 1,076 patients (i.e., 380 cases and 696 matched controls) with an index event between 2021 and 2021 were selected to test the CNN.

A total of 7,925 patients (i.e., 593 cases and 7,332 controls) had notes recorded between 30 and 180 days prior to the index event. Of these, 593 cases were matched with 1,186 controls (N = 1,779). A total of 487 cases with an index event between 2015 and 2019 and 974 matched controls (N = 1,461) were selected for the training and cross-validation of the prediction model. The 106 cases with an index event between 2020 and 2021 and 212 matched controls (N = 318) were selected for the prediction model testing (see S1 Fig).

S2 Table shows the demographic characteristics of cases and controls from the Detection and Prediction cohorts used to develop and test the CNN models. The Detection cohort was largely white (i.e., 64% of cases and 64% of controls) and non-Hispanic/Latino (i.e., 81% of cases and 80% of controls). There was a statistically significant difference in age (Mann-Whitney U = 1,782,666, n1 = 1,538, n2 = 3,012, p < 0.001) and sex (χ2 (1, N = 4,550) = 126.270, p < 0.001) between cases and controls. The case group was younger on average and had more males (55%) than the control group (48%). There were no significant differences in race (χ2 (2, N = 4,550) = 0.1632, p = .921), or ethnic background (χ2 (2, N = 4550) = 4.4055, p = .110) between cases and controls.

The Prediction cohort was largely female (i.e., 57% of cases and 60% of controls), white (i.e., 67% of cases and 69% of controls), and non-Hispanic/Latino (i.e., 84% of cases and 85% of controls). There were no significant differences in age (Mann-Whitney U = 349,408, n1 = 593, n2 = 1,186, p = 0.826), sex (χ2 (1, N = 1,779) = 1.596, p = .207), race (χ2 (2, N = 1,779) = 0.126, p = .937), or ethnicity (χ2 (2, N = 1,779) = 0.079, p = .961) between cases and controls.

Clinical note characteristics

S1 Table lists the type of notes used in the analyses and frequency per study group. In the Detection cohort, the most frequent note type for cases was Emergency Department (ED) Notes, while the most frequent note type for controls was Progress Notes. On average, there were 6 times more notes per person for cases than controls (11.6 vs. 1.98 notes per person) in the detection cohort. In the Prediction cohort, the most frequent note type was Progress Notes for cases and controls. On average, controls had approximately five times more notes than cases (i.e., 163.5 vs. 36.4 notes per person) in the prediction cohort.

ICD validation through manual review

The interrater reliability for the manual review of clinical notes recorded within 24 hours after the index event was 0.986. In the sample of 400 patients (200 cases and 200 controls), 39 were diagnosed with ICD codes for suicide attempts or ISH when it was not documented in their clinical notes (i.e., false positives), and 2 patients were not assigned ICD codes for suicide attempts or ISH when their clinical notes indicated they should have (i.e., false negatives). Overall, the accuracy of the ICD codes against the manual review was 0.90, Precision was 0.80, and Recall was 0.99.

Detection of concurrent ISH

The performance of the machine learning models in detecting suicide or ISH events from clinical notes recorded within 24 hours of the index event is presented in Table 1. Fig 2 displays the receiver operating characteristic (ROC) curves for the different models. All models performed well with the ISH detection task with AUCs over 0.94. In general, CNN models outperformed the bag-of-word models (i.e., NB and RF), with p-value < .05 using the DeLong test. When rounded to two digits after the decimal, both CNN models (whether WEs were randomly initialized or initialized with W2V) had very similar performance with no significant differences (p-value approaching 1). Both showed an AUC of 0.99 and an F1 score of 0.94. The NB model (Table 1 NB) demonstrated an AUC of 0.95 and an F1 score of 0.86. Lastly, for RF (Table 1 RF) AUC was 0.98 and the F1 score was 0.92. Accuracy, Precision (or Positive Predictive Value), Recall (or Sensitivity), Specificity, 95% AUC confidence intervals, and significance of the difference in AUC for all models are all available in Table 1.

Table 1. Performance of the ML models in Detecting and Predicting Suicide/ISH. The highest metrics are bolded.

Models Detecting Concurrent ISH Events
Model AUC (95% CI) Accuracy Precision Recall F1-score Specificity AUC Significantly>
NB 0.947 (0.933-0.961) 0.894 0.801 0.932 0.861 0.874
RF 0.981 (0.973-0.989) 0.946 0.924 0.924 0.924 0.958 NB
CNNr 0.986 (0.980-0.993) 0.956 0.926 0.953 0.939 0.958 NB, RF
CNNw 0.987 (0.979-0.994) 0.957 0.928 0.953 0.940 0.960 NB, RF
Models Predicting Future ISH Events
Model AUC (95% CI) Accuracy Precision Recall F1-score Specificity AUC Significantly>
NB 0.757 (0.701-0.813) 0.745 0.634 0.557 0.593 0.840
RF 0.790 (0.734-0.846) 0.789 0.800 0.491 0.608 0.939
CNNr 0.817 (0.768-0.866) 0.792 0.756 0.557 0.641 0.910 NB
CNNw 0.812 (0.763-0.861) 0.774 0.718 0.528 0.609 0.896 NB
Detecting Concurrent ISH Events based on the Gold Standard (manual review of 400 notes)
Model AUC (95% CI) Accuracy Precision Recall F1-score Specificity
CNNr 0.950 (0.915-0.986) 0.902 0.93 0.866 0.897 0.936
CNNw 0.956 (0.923-0.989) 0.912 0.942 0.877 0.908 0.947

NB: Naïve Bayes model; RF: Random Forest model; CNNr: Convolutional neural network model with randomly initialized word embeddings; CNNw: Convolutional neural network model with W2V initialized embeddings; significant differences in AUC were determined using DeLong test.

Fig 2. ROC of models for suicide/ISH Detection.

Fig 2

CNN 1: Convolutional neural network model with randomly initialized word embeddings; CNN 2: Convolutional neural network model with W2V initialized embeddings; NB: Naïve Bayes model; RF: Random Forest model.

Fig 3 depicts the first 15 important stemmed words determined using variable importance analysis of the RF model. The first two most important stemmed words for classification in the list were “ed” and “suicid”, followed by three words related to Florida’s Baker Act (legislation requiring a means of providing individuals with emergency services and temporary detention for up to 72 hours for mental health examination): “act”, “ba”, and “baker”.

Fig 3. Variable importance as measured by the RF for suicide/ISH Detection.

Fig 3

“Act, “ba”, and “baker” likely relate to Florida’s Baker Act, legislation surrounding mental health crisis. “Rn” and “sw” are common abbreviations for providers (i.e., registered nurse and social worker).

Prediction of future ISH events

The performance of the machine learning models in the prediction of suicidal behavior or ISH events from clinical notes recorded 30–180 days before the index event is presented in Table 1. Fig 4 displays the ROCs for all models. Here again, the CNN models outperformed both the NB and RF models. The CNN model with randomly initialized WEs (CNNr) demonstrated an AUC of 0.82 and an F1 score of 0.64. The CNN model with WEs initialized from the pre-trained W2V (CNNw) demonstrated an AUC of 0.81 and an F1 score of 0.61. The NB model (NB) demonstrated an AUC of 0.76 and an F1 score of 0.59. Lastly, for the RF model (RF), the AUC was 0.79 and the F1 score was 0.61. Using the DeLong test, both CNN AUCs were significantly higher than those of NB (p < 0.05). However, there were no significant differences between either of the CNNs or the RF, and no significant differences between the RF and NB. Unlike the CNNw model for the detection task, which was numerically higher across all scores, the CNNr model for the prediction task was higher for all metrics except Precision and Specificity, for which the RF model was highest.

Fig 4. ROC of models for the prediction of suicide/ISH.

Fig 4

CNN 1: Convolutional neural network model with randomly initialized word embeddings; CNN 2: Convolutional neural network model with W2V initialized embeddings; NB: Naïve Bayes model; RF: Random Forest model.

S2 Fig shows the first 15 important words measured by the RF. The first three most important stemmed words for classification in the list were “mdm” (i.e., Medical Decision Making), “ed”, and “suicid”.

Detection of concurrent ISH based on gold standard labels

The performance of the CNN models on detecting concurrent ISH based on the gold standard labels is presented in Table 1. The CNN model with randomly initialized WEs (CNNr) demonstrated an AUC of 0.95 and an F1 score of 0.897. The CNN model with WEs initialized from the pre-trained W2V (CNNw) demonstrated an AUC of 0.956 and an F1 score of 0.908.

Discussion

Computational approaches such as deep learning have strong potential to improve ISH detection and prediction as well as consistency in predictive performance, but studies to date have been limited by a general lack of consistent performance and replication of the process [27,2931,3336,38]. The benefit of such deep learning models is that they can add value to the predictive power of traditional suicide predictive models. In this study, we demonstrated that leveraging medical records with a well-defined set of ISH ICD codes with a reasonably high accuracy (0.90 in our study) as “silver standard” labels to train deep learning models to detect and predict ISH using clinical note text is feasible and replicable at more than one institution. The results of this study are compatible with our previous work at another institution, where we compared different deep learning models using a similar temporal validation approach (i.e., training on data from 2012–2017 and testing on data from 2018–2019) [27]. As in this work, the models had a near-ceiling performance on the phenotyping task in that the automated identification of concurrent ISH events in clinical notes vs. a comparable AUC of 0.82 for the predictive task, suggesting that this approach could be replicated at other academic medical centers for both surveillance purposes and the identification of patients at risk of future suicidal behavior. Further upon closer examination of the false positives and false negatives by ICD as compared to manual review, we noted that the CNN classification was correct in 5 of the 39 ICD false positives and 1 of the 2 ICD false negatives, which suggests that the deep learning model could provide more precise retrieval of cases, e.g., for surveillance tasks, beyond simple ICD queries. It is worth noting that although the F1 scores seem low (0.64), the predictive model’s performance (especially as highlighted with the AUC) is fairly competitive with what has been previously reported in the literature using different approaches with EHR data, including structured data; however, consideration should be given to the different type of predictive modeling previously used as well as different settings and feature sets. Nonetheless, one can calibrate the threshold to improve recall/sensitivity vs. the specificity depending on the setting. For example, to improve the recall, we could reduce the threshold to pick up more cases for referrals. The high AUC suggests that we are more likely to find a good balance for a given use case or environment in which such models could be used. In critical applications like suicide prediction, maximizing recall is critical. By lowering the probability threshold, we can increase the likelihood of identifying most patients at risk, even if it means having more false positives. We should note, however, that it is also crucial to continuously evaluate and refine the models, incorporating feedback from healthcare professionals and data from real-world outcomes to improve its accuracy and effectiveness over time.

The variable importance analysis of the RF model in both the detection of concurrent events and the predictive models provided some insight into words in clinical notes that are indicative of ISH. Detected important variables included word stems that could potentially be generalizable to other sites, such as “suicid”, “attempt” and “ed”, as well as words that are relevant only in the local context, such as “ba” as in the Baker Act, which are specific to Florida. These words could be highlighted during the implementation of such models to draw the attention of clinicians in busy settings (e.g., primary care settings) and instill confidence and trust by clinicians in the results of risk prediction, with the increased likelihood of referral to mental health services for appropriate management [5,5557].

Along these lines, a limitation of the bag-of-words models is that they rely on instances of single words (unigrams). Therefore, in a situation where a clinician documents the absence of an event (i.e., not suicidal), the model may misclassify the note. However, these models still evaluate the whole note by assessing the frequencies of other words relevant to ISH that increase the probability that a patient is, in fact, an ISH patient. Moreover, with deep learning models (i.e., CNNs), the sequence of the words (or tokens) is preserved, which has a greater potential of picking up such negations.

Clinical implications

Current approaches to suicide risk assessment can be inefficient, costly, and limited in their ability to predict future ISH and suicide deaths [4,7,8,10,11]. As such, machine and deep learning methods that can improve the efficient and accurate identification of individuals who engage in or are likely to engage in ISH behavior have tremendous potential clinical benefits. For example, EHR systems that integrate these advanced analytical approaches can alert providers across a patient’s care team to possible risk without relying on the repeated administration of clinical interviews and questionnaires, and encourage appropriate referral, intervention, and follow-up over time. This would be especially useful in cases where ISH risk is documented in prior clinical notes but is not added to a patient’s primary problem list, and therefore not easily viewable by providers who were not the original documenter. Similarly, these approaches can be used to analyze data from multiple clinical encounters, accounting for risk factors documented by some providers and not others, resulting in a more complete synthesis of a patient’s risk. This would be useful for informing patient risk assessment–including quantifying risk levels–as well as the observation of those risk levels over time. Therein also lies the potential for data-driven clinical decision support tools that inform the development of individualized treatment plans. Lastly, from a clinical and research perspective, they provide an opportunity to improve our understanding of how ISH and suicidal behavior risk can change over time to inform prevention and intervention initiatives on an individual- and systems-level.

Limitations

This study has limitations that provide opportunities for future research. First, although we were able to show similar results to the original work conducted by Obeid and colleagues [27] at an additional medical center using the same methodology, it is necessary to show replicability across multiple sites. Ideally, we would also want to perform external validation of a trained model (i.e., train at one institution and test the trained model at another site), which was not feasible in this study due to funding limitations and regulatory restrictions. This presents a valuable opportunity to use smaller datasets to train and fine-tune the models. Moreover, although we demonstrated that we can clearly identify ISH, this does not specify an intention to die, so future work will need to examine this approach while incorporating data on fatalities resulting from suicide. Although the variable importance analysis sheds light on important key words in text, this work would benefit from further exploration of interpretable modeling approaches, especially regarding the deep learning models, which are currently thought of as “black boxes”. However, previous work using hierarchical attention networks [57] seems promising for highlighting text that contributes to the outcome of the classifier. Another limitation is the relatively small sample size that we worked with for deep learning modeling. Having a larger sample would allow further examination in predictive time windows, for example, looking at clinical notes 6 months prior to the index visit, vs. 4 months, etc. We chose to use a 5-month-long time window between 6 months and 1 month prior to the index visit. Reducing the predictive time window to examine the temporal impact prior to the index event would require more stringent inclusion criteria and thus reduce the number of patients needed for each prediction. Further, our case-control matching artificially reduces the non-ISH to ISH ratio, which may not reflect real-world scenarios and may artificially improve the reported accuracy and AUC. Lastly, our approach used text from clinical notes found in the EHR but did not include the integration of structured data which are also available in the EHR and have strong potential to strengthen the predictive power of the models.

Future directions

There are several directions for this work that our team plans to pursue in the near future. The first is to show replicability and external validation across multiple sites using novel transfer learning methods. We also plan to examine more advanced deep learning models based on transformer architectures, including newer embedding models and large language models, which require a more advanced compute infrastructure than was used with the models described herein. Furthermore, we intend to compare the text-based predictors with traditional models using structured EHR data (e.g., history of previous mental health codes, medications, and other variables; [18]) as well as datasets with patients at high risk for ISH (e.g., with a history of depression). Lastly, as part of these efforts, we will include data on suicide deaths to examine the predictive power of our approach in the context of this outcome while also integrating explanatory modeling approaches.

Conclusions

In this study, we demonstrated that leveraging medical records with a well-defined set of ISH ICD codes to train deep learning models to detect and predict ISH using clinical note text is feasible and generalizable given replicability at another institution in a different region of the U.S. Our results were similar to our prior work where we compared different deep learning models using a similar temporal validation approach. These approaches can improve the efficient and accurate identification and prediction of individuals who engage in ISH behavior and have enormous potential benefits for clinical practice. However, further research is needed to evaluate this approach across multiple sites and in less controlled, real-world settings, incorporating both structured and unstructured EHR data, as well as suicide fatality outcomes—despite the inherent challenges of modeling highly imbalanced outcome variables.

Supporting information

S1 Fig. Flowchart of study population.

(TIF)

pone.0331459.s001.tif (223KB, tif)
S2 Fig. Variable importance as measured by the RF for the prediction of suicide/ISH.

(TIF)

pone.0331459.s002.tif (53.5KB, tif)
S1 Table. Clinical note types and frequency per group.

(DOCX)

pone.0331459.s003.docx (17.2KB, docx)
S2 Table. Participant demographics.

(DOCX)

pone.0331459.s004.docx (16.4KB, docx)

Data Availability

The University of South Florida’s Human Research Protections restricts data sharing to protect patient information and comply with local and national regulations due to sensitive patient information. Public sharing could compromise patient privacy. Researchers seeking access to the data can contact USF Human Research Protections (IRB) at RSCH-IRB@usf.edu referencing the study number STUDY002988.

Funding Statement

This project was supported in part by the National Institute of Mental Health grant number R56 MH124744 and the National Center for Advancing Translational Sciences under Grant Number UL1 TR001450. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. the funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Centers for Disease Control and Prevention. Facts about suicide 2025. 2025. Available from: https://www.cdc.gov/suicide/facts/
  • 2.Centers for Disease Control and Prevention. WISQARS Cost of Injury 2025. 2025. Available from: https://wisqars.cdc.gov/cost
  • 3.Chan MKY, Bhatti H, Meader N, Stockton S, Evans J, O’Connor RC, et al. Predicting suicide following self-harm: systematic review of risk factors and risk scales. Br J Psychiatry. 2016;209(4):277–83. doi: 10.1192/bjp.bp.115.170050 [DOI] [PubMed] [Google Scholar]
  • 4.Franklin JC, Ribeiro JD, Fox KR, Bentley KH, Kleiman EM, Huang X, et al. Risk factors for suicidal thoughts and behaviors: a meta-analysis of 50 years of research. Psychol Bull. 2017;143(2):187–232. doi: 10.1037/bul0000084 [DOI] [PubMed] [Google Scholar]
  • 5.Ribeiro JD, Franklin JC, Fox KR, Bentley KH, Kleiman EM, Chang BP, et al. Self-injurious thoughts and behaviors as risk factors for future suicide ideation, attempts, and death: a meta-analysis of longitudinal studies. Psychol Med. 2016;46(2):225–36. doi: 10.1017/S0033291715001804 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Stone DM, Crosby AE. Suicide prevention: state of the art review. Am J Lifestyle Med. 2014;8(6):404–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Bolton JM, Gunnell D, Turecki G. Suicide risk assessment and intervention in people with mental illness. BMJ. 2015;351:h4978. doi: 10.1136/bmj.h4978 [DOI] [PubMed] [Google Scholar]
  • 8.Larkin C, Di Blasi Z, Arensman E. Risk factors for repetition of self-harm: a systematic review of prospective hospital-based studies. PLoS One. 2014;9(1):e84282. doi: 10.1371/journal.pone.0084282 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Research AfH, Quality UPSTF. The Guide to Clinical Preventive Services 2014. 2014. [PubMed] [Google Scholar]
  • 10.Runeson B, Odeberg J, Pettersson A, Edbom T, Jildevik Adamsson I, Waern M. Instruments for the assessment of suicide risk: a systematic review evaluating the certainty of the evidence. PLoS One. 2017;12(7):e0180292. doi: 10.1371/journal.pone.0180292 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Sall J, Brenner L, Millikan Bell AM, Colston MJ. Assessment and management of patients at risk for suicide: synopsis of the 2019 U.S. Department of Veterans Affairs and U.S. Department of Defense Clinical Practice Guidelines. Ann Intern Med. 2019;171(5):343–53. doi: 10.7326/M19-0687 [DOI] [PubMed] [Google Scholar]
  • 12.Lu H, Barrett A, Pierce A, Zheng J, Wang Y, Chiang C, et al. Predicting suicidal and self-injurious events in a correctional setting using AI algorithms on unstructured medical notes and structured data. J Psychiatr Res. 2023;160:19–27. doi: 10.1016/j.jpsychires.2023.01.032 [DOI] [PubMed] [Google Scholar]
  • 13.Chen Q, Zhang-James Y, Barnett EJ, Lichtenstein P, Jokinen J, D’Onofrio BM, et al. Predicting suicide attempt or suicide death following a visit to psychiatric specialty care: a machine learning study using Swedish national registry data. PLoS Med. 2020;17(11):e1003416. doi: 10.1371/journal.pmed.1003416 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Edgcomb JB, Shaddox T, Hellemann G, Brooks JO 3rd. Predicting suicidal behavior and self-harm after general hospitalization of adults with serious mental illness. J Psychiatr Res. 2021;136:515–21. doi: 10.1016/j.jpsychires.2020.10.024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Kessler RC, Hwang I, Hoffmire CA, McCarthy JF, Petukhova MV, Rosellini AJ, et al. Developing a practical suicide risk prediction model for targeting high-risk patients in the Veterans health Administration. Int J Methods Psychiatr Res. 2017;26(3):e1575. doi: 10.1002/mpr.1575 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Kessler RC, Stein MB, Petukhova MV, Bliese P, Bossarte RM, Bromet EJ, et al. Predicting suicides after outpatient mental health visits in the Army Study to Assess Risk and Resilience in Servicemembers (Army STARRS). Mol Psychiatry. 2017;22(4):544–51. doi: 10.1038/mp.2016.110 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Kessler RC, Warner CH, Ivany C, Petukhova MV, Rose S, Bromet EJ, et al. Predicting suicides after psychiatric hospitalization in US Army soldiers: the Army Study To Assess Risk and rEsilience in Servicemembers (Army STARRS). JAMA Psychiatry. 2015;72(1):49–57. doi: 10.1001/jamapsychiatry.2014.1754 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Simon GE, Johnson E, Lawrence JM, Rossom RC, Ahmedani B, Lynch FL, et al. Predicting suicide attempts and suicide deaths following outpatient visits using electronic health records. Am J Psychiatry. 2018;175(10):951–60. doi: 10.1176/appi.ajp.2018.17101167 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Young J, Bishop S, Humphrey C, Pavlacic JM. A review of natural language processing in the identification of suicidal behavior. J Affect Disord Rep. 2023;12:100507. doi: 10.1016/j.jadr.2023.100507 [DOI] [Google Scholar]
  • 20.Levis M, Levy J, Dufort V, Gobbel GT, Watts BV, Shiner B. Leveraging unstructured electronic medical record notes to derive population-specific suicide risk models. Psychiatry Res. 2022;315:114703. doi: 10.1016/j.psychres.2022.114703 [DOI] [PubMed] [Google Scholar]
  • 21.Cliffe C, Seyedsalehi A, Vardavoulia K, Bittar A, Velupillai S, Shetty H, et al. Using natural language processing to extract self-harm and suicidality data from a clinical sample of patients with eating disorders: a retrospective cohort study. BMJ Open. 2021;11(12):e053808. doi: 10.1136/bmjopen-2021-053808 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Fernandes AC, Dutta R, Velupillai S, Sanyal J, Stewart R, Chandran D. Identifying suicide ideation and suicidal attempts in a psychiatric clinical research database using natural language processing. Sci Rep. 2018;8(1):7426. doi: 10.1038/s41598-018-25773-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.McCoy TH Jr, Castro VM, Roberson AM, Snapper LA, Perlis RH. Improving prediction of suicide and accidental death after discharge from general hospitals with natural language processing. JAMA Psychiatry. 2016;73(10):1064–71. doi: 10.1001/jamapsychiatry.2016.2172 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Poulin C, Shiner B, Thompson P, Vepstas L, Young-Xu Y, Goertzel B, et al. Predicting the risk of suicide by analyzing the text of clinical notes. PLoS One. 2014;9(1):e85733. doi: 10.1371/journal.pone.0085733 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Tsui FR, Shi L, Ruiz V, Ryan ND, Biernesser C, Iyengar S, et al. Natural language processing and machine learning of electronic health records for prediction of first-time suicide attempts. JAMIA Open. 2021;4(1):ooab011. doi: 10.1093/jamiaopen/ooab011 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Walsh CG, Ribeiro JD, Franklin JC. Predicting risk of suicide attempts over time through machine learning. Clin Psychol Sci. 2017;5(3):457–69. [Google Scholar]
  • 27.Obeid JS, Dahne J, Christensen S, Howard S, Crawford T, Frey LJ, et al. Identifying and predicting intentional self-harm in electronic health record clinical notes: deep learning approach. JMIR Med Inform. 2020;8(7):e17784. doi: 10.2196/17784 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Obeid JS, Weeda ER, Matuskowitz AJ, Gagnon K, Crawford T, Carr CM, et al. Automated detection of altered mental status in emergency department clinical notes: a deep learning approach. BMC Med Inform Decis Mak. 2019;19(1):164. doi: 10.1186/s12911-019-0894-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44. doi: 10.1038/nature14539 [DOI] [PubMed] [Google Scholar]
  • 30.Du J, Zhang Y, Luo J, Jia Y, Wei Q, Tao C, et al. Extracting psychiatric stressors for suicide from social media using deep learning. BMC Med Inform Decis Mak. 2018;18(Suppl 2):43. doi: 10.1186/s12911-018-0632-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Fu G, Song C, Li J, Ma Y, Chen P, Wang R, et al. Distant supervision for mental health management in social media: suicide risk classification system development study. J Med Internet Res. 2021;23(8):e26119. doi: 10.2196/26119 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Abdulsalam A, Alhothali A. Suicidal ideation detection on social media: a review of machine learning methods. Soc Netw Anal Min. 2024;14(1). doi: 10.1007/s13278-024-01348-0 [DOI] [Google Scholar]
  • 33.Ramírez-Cifuentes D, Freire A, Baeza-Yates R, Puntí J, Medina-Bravo P, Velazquez DA, et al. Detection of suicidal ideation on social media: multimodal, relational, and behavioral analysis. J Med Internet Res. 2020;22(7):e17758. doi: 10.2196/17758 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Galatzer-Levy I, Abbas A, Ries A, Homan S, Sels L, Koesmahargyo V, et al. Validation of visual and auditory digital markers of suicidality in acutely suicidal psychiatric inpatients: proof-of-concept study. J Med Internet Res. 2021;23(6):e25199. doi: 10.2196/25199 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Cusick M, Adekkanattu P, Campion TR Jr, Sholle ET, Myers A, Banerjee S, et al. Using weak supervision and deep learning to classify clinical notes for identification of current suicidal ideation. J Psychiatr Res. 2021;136:95–102. doi: 10.1016/j.jpsychires.2021.01.052 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Rozova V, Witt K, Robinson J, Li Y, Verspoor K. Detection of self-harm and suicidal ideation in emergency department triage notes. J Am Med Inform Assoc. 2022;29(3):472–80. doi: 10.1093/jamia/ocab261 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Workman TE, Goulet JL, Brandt CA, Warren AR, Eleazer J, Skanderson M, et al. Identifying suicide documentation in clinical notes through zero-shot learning. Health Sci Rep. 2023;6(9):e1526. doi: 10.1002/hsr2.1526 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Burke TA, Ammerman BA, Jacobucci R. The use of machine learning in the study of suicidal and non-suicidal self-injurious thoughts and behaviors: a systematic review. J Affect Disord. 2019;245:869–84. doi: 10.1016/j.jad.2018.11.073 [DOI] [PubMed] [Google Scholar]
  • 39.Beam AL, Manrai AK, Ghassemi M. Challenges to the reproducibility of machine learning models in health care. JAMA. 2020;323(4):305–6. doi: 10.1001/jama.2019.20866 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Hedegaard H, Schoenbaum M, Claassen C, Crosby A, Holland K, Proescholdbell S. Issues in Developing a Surveillance Case Definition for Nonfatal Suicide Attempt and Intentional Self-harm Using International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) Coded Data. Natl Health Stat Report. 2018;(108):1–19. [PubMed] [Google Scholar]
  • 41.Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG. Research electronic data capture (REDCap)--a metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform. 2009;42(2):377–81. doi: 10.1016/j.jbi.2008.08.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Mikolov T. Efficient estimation of word representations in vector space. arXiv preprint arXiv. 2013. pp. 3781. [Google Scholar]
  • 43.McCallum A, Nigam K. A comparison of event models for naive bayes text classification. AAAI-98 workshop on learning for text categorization. Madison, WI: 1998. . [Google Scholar]
  • 44.Breiman L. Random Forests. Mach Learn. 2001;45(1):5–32. doi: 10.1023/a:1010933404324 [DOI] [Google Scholar]
  • 45.Kim Y. Convolutional Neural Networks for Sentence Classification. Conference on Empirical Methods in Natural Language Processing. Doha, Qatar; 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Team RC. R: A language and environment for statistical computing. R Foundation for Statistical Computing; 2020. [Google Scholar]
  • 47.Ho DE, Imai K, King G, Stuart EA. Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Polit Anal. 2007;15(3):199–236. [Google Scholar]
  • 48.Benoit K, Watanabe K, Wang H, Nulty P, Obeng A, Müller S, et al. Quanteda: An R package for the quantitative analysis of textual data. J Open Sourc Softw. 2018;3(30):774. [Google Scholar]
  • 49.Google LLC. Keras: the Python deep learning API. 2022 Available from: https://keras.io/
  • 50.TensorFlow [Internet]. Zenodo.org. 2022. Available from: 10.5281/ZENODO.4724125 [DOI] [Google Scholar]
  • 51.Wright MN, Ziegler A. ranger: A fast implementation of random forests for high dimensional data in C++ and R. arXiv preprint arXiv. 2015. [Google Scholar]
  • 52.Kuhn M. The caret package. Vienna, Austria: R Found Stat Comput. 2011. https//cranr-project_org/package=caret
  • 53.Fawcett T. An introduction to ROC analysis. Pattern Recogn Lett. 2006;27(8):861–74. doi: 10.1016/j.patrec.2005.10.010 [DOI] [Google Scholar]
  • 54.DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44(3):837–45. doi: 10.2307/2531595 [DOI] [PubMed] [Google Scholar]
  • 55.Panigutti C, Beretta A, Giannotti F, Pedreschi D. Understanding the impact of explanations on advice-taking: a user study for AI-based clinical Decision Support Systems. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems; 2022. [Google Scholar]
  • 56.Ribeiro MT, Singh S, Guestrin C. “ Why should i trust you?” Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2016. [Google Scholar]
  • 57.Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E. Hierarchical attention networks for document classification. Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies. 2016. [Google Scholar]

Decision Letter 0

Braja Gopal Patra

11 Oct 2024

PONE-D-24-31813Automated detection and prediction of suicidal behavior from clinical notes using deep learningPLOS ONE

Dear Dr. Tsalatsanis,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Nov 25 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org . When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols .

We look forward to receiving your revised manuscript.

Kind regards,

Braja Gopal Patra, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

1. When submitting your revision, we need you to address these additional requirements. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match.  When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section. 3. Thank you for stating the following financial disclosure: "This project was supported in part by the National Institute of Mental Health grant number R56 MH124744 and the National Center for Advancing Translational Sciences under Grant Number UL1 TR001450. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health." Please state what role the funders took in the study.  If the funders had no role, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript." If this statement is not correct you must amend it as needed. Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf. 4. We note that you have indicated that there are restrictions to data sharing for this study. For studies involving human research participant data or other sensitive data, we encourage authors to share de-identified or anonymized data. However, when data cannot be publicly shared for ethical reasons, we allow authors to make their data sets available upon request. For information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions.  Before we proceed with your manuscript, please address the following prompts: a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially identifying or sensitive patient information, data are owned by a third-party organization, etc.) and who has imposed them (e.g., a Research Ethics Committee or Institutional Review Board, etc.). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent. b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. Please see http://www.bmj.com/content/340/bmj.c181.long for guidelines on how to de-identify and prepare clinical data for publication. For a list of recommended repositories, please see https://journals.plos.org/plosone/s/recommended-repositories. You also have the option of uploading the data as Supporting Information files, but we would recommend depositing data directly to a data repository if possible. Please update your Data Availability statement in the submission form accordingly. 5. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Partly

Reviewer #3: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: I Don't Know

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: 1. The authors need to more explicitly articulate why and how they demonstrate the generalizability of the methodology they apply.

I appreciate that the authors' objective is to reproduce the methodology of Obeid et al (2020), but it is not clear why it would be a challenge to reproduce the methodology given the existence of ICD-coded data and access to notes. Why is it interesting to investigate the reproduction of this method? What can we learn from this attempt? The authors do not explain this point or truly "validate" the reproduction of the method, as they claim is the core objective of the paper (p. 5). Is it enough to follow the same methodology and present results of a model built from that methodology to "evaluate [the methodology's] generalizability to other academic medical centers? Given that -- as the authors state -- the method hinges on the availability and quality of ICD coded records, that seems to bve the only point. Or is it that a certain expected level of performance of the model must be achieved to claim that the methodology is generalizable? It simply isn't clear what the criteria for "validating" the generalizability of the methodology are. How can we claim success on this goal?

2. The real-world applicability of the method is still not adequately addressed/discussed given artificial selection criteria when constructing the cohorts.

More importantly, the significance of the original methodology and the new application of that methodology in this new setting is unclear. As the authors themselves point out, it is more important to examine how well a trained model itself generalizes -- both to more real-world settings, and to other hospital sites. This is not tackled by the authors of this work. As I will mention below, the datasets used to train and test the authors' model do not appear to be fully realistic. The test data in particular would be much more interesting if it reflected the full complexity of the real-world data environment, including the low prevalence of ISH presentations overall.

I grant that the performance of the models is well beyond random chance, but overall it seems that there is significant uncertainty as to how well this model would work prospectively in the natural data environment, given the steps taken to artificially ensure a minimum number of Cases in the datasets (at least 1/3), while also not controlling for Controls that are the most similar to the Cases. Controls are "matched" only in terms of a superficial note profile, plus to control numbers, not to identify the most likely potential confusers for the model.

When constructing the cohort datasets, did you ensure that the same patients did not appear in both training and test data sets? i.e. not simply splitting at the event level, but rather at the patient level. This is important to control for data leakage.

The data also appears to include "ED After Visit Summary (AVS)" reports, which I suppose are something like a Discharge Summary. Given that this is very likely to have a complete review/summary at the end or after of an ED episode, it seems that this type of report should be excluded as it will not be relevant to detecting a patient with ISH while they are in care.

Why was a 1:5 ratio adopted for Cases:Controls? To what extent is this representative of the overall prevalence of ISH in the data? Then, why was a 1:2 ratio assumed for "matched controls" in the final training/testing cohorts? This results in training/test cohorts that are very far out of balance from the a priori probability of ISH in the overall dataset.

I assume that "Controls" are patients that have no (0) suicide attempts or ISH ICD Codes at all in the data, although this is not fully specified. Then you filter out Controls that meets the notes requirements.

- What is the "index event" that is considered for a Control?

- Significant differences in the types of clinical texts are reported for cases vs controls in both the Detection and Prediction cohorts. This arguably means that distinguishing between cases and controls is easier; it is not only the content of the notes that varies, but the a priori nature of the information they reflect. These differences also imply substantive differences in the nature of problems and patterns of care that the cases and controls may be attended for. There would presumably be much more extra-cases variability, than intra-cases variability, which again makes distinguishing between them potentially very easy.

- What is the reason for Cases having so many more notes than Controls (11.6 vs 1.98) -- is this because there are many (short) ED notes cf. fewer (long) Progress Notes for Controls?

- Given the predominance of ED notes for Cases cf Controls, why was a "matched control" not one that had a similar notes profile to a Case, in terms of type of notes, number of notes, length?

3. Clarify manual review process.

It is not fully clear what data the manual review of clinical notes includes -- does this mean up to 8000 words of text, as is prepared for the model? Or full chart review?

4. Gold standard needed for testing.

It is one thing to *train* supervised learning models with "silver standard" data. It is another to *test* the models on silver standard data. What did you do to ensure the reliability of the final results on the held-out test data?

Reviewer #2: In this manuscript the authors posit that deep learning approaches have great potential to significantly improve traditional suicide prediction models to detect and predict intentional self-harm and report that existing research is limited by a general lack of “consistent performance” and “external reproducibility”. The work in the paper is aimed to “validate a deep learning approach” used in previous research [reference 27] to detect and predict ISH using clinical note text and “evaluate its generalizability to other academic medical centers”.

The manuscript is well written and properly structured without significant obstacles to understandability in terms of language and format. The methodology is sound and well explained though the methodology is not the novel contribution of the manuscript and has been published and implemented in [27]. Therefore, as I understand it, the main contribution of this paper is applying this modeling process(methodology) to another data set also validated by the statement from the manuscript “Thus, the purpose of the current study was to validate the approach used previously by Obeid et al. (2020) and to evaluate its generalizability to other academic medical centers.” (page 5 of manuscript).

In the limitations section the authors indicate “This study has limitations that provide opportunities for future research. First, although we were able to reproduce the original work conducted by Obeid and colleagues (2020) at an additional medical center, it is necessary to show reproducibility across multiple sites. Ideally, we would also want to perform external validation of a trained model (i.e., train at one institution and test the trained model at another site).

I believe the following revisions will improve the manuscript:

1. The authors should go through the manuscript and clarify the distinction between model and process(methodology) validation and specify that the contribution of the paper is to process validation and not model validation.

2. The authors indicate that lack of consistent performance is an important limitation of existing research. While at least as far as process validation is concerned they are able produce consistent results, the authors are silent on what could be the potential cause of such variability. Is it the data used, the methodologies, a combination of the two? OR is it only normal that when such a wide net including structured EHR vs clinical note text with sub-populations such as those with eating-disorders, first time suicide attempts, etc. is considered, such variability is inevitable. I believe the authors possess the experience to address, or at a minimum discuss, the reason for this gap in existing research.

3. The audience would be greatly interested in understanding why a model validation (along with process validation) was not reported in the manuscript. Did the authors face difficulties or simply have not attempted it (to me it does not seem like a difficult extension)? As the authors themselves indicate, this is also very important, and the readers would appreciate their experience and perspective.

4. The cited literature is until 2022 and should be updated as we are approaching 2025.

5. The term “concurrent ISH” is used several times in the manuscript. The reader is not clear on the meaning of the word “concurrent” here.

6. On page 11, the section on “Clinical Note Types and Frequencies” presents a number of results and it is unclear why these are f interest or how these impact the detection and/or the prediction process. Perhaps these should be discussed in the discussion section.

7. On page 14, “Although the use of pre-trained W2V word embeddings did not significantly improve the accuracy in the phenotyping task or exceed the accuracy using randomly initialized word embeddings in the predictive task, we have shown in previous unpublished work that W2V can result in faster training requiring fewer epochs during the training of classifiers. This effect was not evaluated in this study.” perhaps should be removed as it relies on unpublished work.

Reviewer #3: This research paper attempted to validate a previous deep learning approach they had developed to detect and predict intentional self-harm using clinical note text. A secondary aim was to assess a set of ICD codes as a silver standard for training models on the automated detection of intentional self-harm and to predict future encounters of ISH.

Some questions below:

1. I’m getting a little lost in some circular logic in terms of clinical validity and utility that mostly stem from use of the ICD codes.

a. Minor comment: the term “well-defined set” is vague. Is it the full list from Hedegaard, 2018?

b. As stated in that same paper and then found extensively in the suicide NLP literature, ICD codes are very incomplete in detecting suicidality. Perhaps self-harm behavior and suicide attempt is a bit better than suicide ideation, but it is still known to have very low sensitivity for identifying cases, with less than half (and sometimes even much, much less than that) of cases detected by ICD codes. The authors’ manual validation of cases and controls as identified by ICD codes is in stark contrast to any literature known by this reviewer. That begs the question of why. Is it the underlying cohort of patients/corpus of notes? The paper does not specify the details of the underlying corpus of 1,538 and 3,012 cases and controls. Is this from the entire EHR in the specified date range? Perhaps the manual review for the controls was affected by ISH being a rare event so not enough notes were seen? Perhaps the control group should be enriched to get a better sense of the false-negative rate of ICD code detection. Another thing that made this reviewer question the validity of the cases and controls was the supplemental material that indicated that many of the ISH encounter notes were from things like anesthesia pre-op OR nursing, and procedure notes. That seems quite odd that discussion and/or coding of self-harm would come up in this setting. During a surgical procedure? Or while the anesthesiologist is seeing the patient for 5 minutes to brief them? I would be curious about the strings that are being reviewed by the manual reviewers from these notes. Are the strings extracted out of order from the original note? Perhaps there is something that is causing a lot of false-positive mentions here… If one of the objectives of the paper is to validate the ICD list as a silver standard (which again goes against the literature), there needs to be more discussion here to establish credibility.

c. If the determination is that ICD codes have precision of 0.8 and recall of 0.99 (which again seem suspect), those accuracy scores are much greater than the deep learning approaches. What then is the rationale for using the deep learning approach?

2. Question regarding concurrent ISH: This reviewer is unclear if this means the ISH event occurred leading up to this encounter? That is what the text seems to imply. But it’s impossible to distinguish from it just being the first time that the ISH is known and documented, right? Like if it’s their first time seeing a psychiatrist and the psychiatrist asks about history of suicide attempts even though it happened 10 years ago? Or do you exclude historic mentions somehow?

3. Along the same lines, how were negated mentions accounted for? Many clinicians might document the absence of ISH.

4. The justification sentence: “However, current approaches to suicide risk assessment involve healthcare professionals administering clinical interviews and questionnaires, which can be inefficient, costly, and limited in their ability to predict future ISH and suicide deaths.” This is circular logic since the notes themselves are documentation of the clinical interviews and assessments. We wouldn’t want clinicians to stop doing what they do—assessing patients. And also if the theoretically did that, what would the algorithm use to predict, since it is relying on mentions of “suicid” etc.?

5. This implication in the Clinical Implications section makes a lot of sense and could be emphasized more including in the objectives: “This would be especially useful in cases where ISH risk is documented in prior clinical notes but is not added to a patient’s primary problem list, and therefore not easily viewable by providers who were not the original documenter.”

6. Later on in the same section: “also the observation and adjustment of those risk levels over time” – not sure what that means.

7. In the limitations section: not sure what this means: “This highlights the need to examine this approach in the using data on fatalities resulting from suicide.”

8. This reviewer is not sure how similar/different this approach was to the original 2020 article from which it drew its methods. It would be helpful to know what has changed and what is different. And why the authors chose to continue these now outdated techniques and to not use language models. Not necessarily saying language models are always better, but would be an important discussion.

9. Finally, the 0.61-0.64 F1 scores are not so good… What does that mean about how to interpret this work and future steps?

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy .

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/ . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org . Please note that Supporting Information files do not need this step.

PLoS One. 2025 Sep 15;20(9):e0331459. doi: 10.1371/journal.pone.0331459.r002

Author response to Decision Letter 1


14 Feb 2025

We thank the reviewers for their valuable comments. We have revised the manuscript according to the comments and have provided a detailed response to every comment received in the "Response to Reviewers" document. We believe that we have addressed the concerns and that the revised manuscript is much improved.

Attachment

Submitted filename: Response to Reviewers.docx

pone.0331459.s005.docx (43.3KB, docx)

Decision Letter 1

Braja Gopal Patra

5 Jun 2025

PONE-D-24-31813R1Automated detection and prediction of suicidal behavior from clinical notes using deep learningPLOS ONE

Dear Dr. Tsalatsanis,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jul 20 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org . When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols .

We look forward to receiving your revised manuscript.

Kind regards,

Braja Gopal Patra, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: (No Response)

Reviewer #2: All comments have been addressed

Reviewer #3: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: (No Response)

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: (No Response)

Reviewer #3: I Don't Know

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: (No Response)

Reviewer #3: No

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: (No Response)

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Overall re-review: While I appreciate the authors' response to the previous round of reviews and the corresponding revisions to the paper, I still have some concerns about the overall significance of the study in relation to the claims that are being made. A stronger emphasis on process replicability cf model replicability is required, and more discussion of the limitations of the process and particularly the data selection methodology is required. i do agree that it is valuable to explore the application of the process in a new data context, but claims related to the "high AUC" should be moderated given the unnatural data distribution adopted for evaluation.

[Original] REVIEWER COMMENT: Given the predominance of ED notes for Cases cf Controls, why

was a "matched control" not one that had a similar notes profile to a Case, in terms of type of

notes, number of notes, length?

AUTHOR RESPONSE: The reason for not having similar note profiles between cases

and matched controls is because we used the nearest neighbor matching algorithm, which

prioritizes cases based on proximity in selected characteristics rather than aiming to

create an identical profile for all features, such as type, number, and length of notes.

[New] REVIEWER COMMENT: I find that this is not satisfactorily addressed. The differences in data between cases and matched controls would seem to bias the model, i.e. making it easier to detect the cases based on arguably spurious factors. I understand that the characteristics that were selected include "age, gender, race, ethnicity, and number of notes" -- so number of notes is considered, but only as one of 5 factors, of which 3 are categorical demographic features and therefore have less variance (so will dominate the matching). It would have been more interesting to identify controls that are more likely to be confusers, e.g. those with ICDs for other non-ISH psychological presentations, or as previously suggested that align more strongly in terms of the types of note -- noting the substantial differences between cases and controls described in the section "Clinical Note Characteristics". Clearly if the cases and controls have very different types of notes then distinguishing them shouldn't be very difficult for the model.

In general, more error analysis would help to elucidate what the model is actually learning. For example, for the "39 were diagnosed with ICD codes for suicide attempts or ISH when it was not documented in their clinical notes (i.e., False Positives)" and the "2 patients were not assigned ICD codes for suicide attempts or ISH when their clinical notes indicated they should have (i.e., False Negatives)." -- what was the performance of the model on these cases? Did the models also get confused by the FP cases, and did they detect the FN cases?

[New] REVIEWER COMMENT on Model Validation:

The authors have responded to the concern raised by both reviewers about lack of *model* validation (cf process validation) by citing (a) funding limitations, and (b) challenges of data sharing. I do not understand why data sharing is relevant -- you simply need to share the *trained model* from one hospital, and apply it to the other data. This does not require sharing any data itself. You have access to and have already prepared both data sets; it should be straightforward to take a trained model and apply it to the test data from the other hospital. This is far less time-consuming or costly than the manual analysis that you have added.

[New] REVIEWER COMMENT on prior work:

I appreciate the added paragraph on page 4 with prior work; individual cited work should have the relevant citation numbers added there, e.g. "Cusick and colleagues (2021)^35". Note, however, that here and elsewhere (e.g. p3) it is not meaningful to compare AUC for different task framing and different input data; the inherent difficulty of prediction will be very different from that of detection, and the difficulty of detection will vary depending on how much data is available (e.g. a single ED triage note cf. 6 months of clinical notes). The added statement "the predictive model's performance (especially as highlighted with the AUC) is fairly competitive with what has been previously reported in the literature using different approaches with EHR data, including structured data." really isn't meaningful without more explicit clarification of comparibility of the task settings.

[New] REVIEWER COMMENT on index event:

"Notes recorded within 24 hours of the index event" -- does this mean 24 hours before or after the index event? I assume before, given how the data was collected, but this should be clarified.

REVIEWER RESPONSE to Author response "the replication of the prior methodology is significant as this is a core tenant of the scientific method." I humbly submit that in the context of data-driven methods, the authors' argument here about scientific method is not the key concern. What you are measuring in this work is not the inherent validity of the method, which requires some extrinsic truth to hold. You are measuring the stability of the method as applied in two different data contexts. However, if the method is biased, and the data selection methods are biased, two similar results only reflect consistency in the bias, not evidence that the method is "valid". The similar performance of the method in these two distinct data contexts may be a result of the process itself, not the inherent correctness of the method.

Neither the original work nor this replication are applied prospectively under the condition of real-world data distribution. Rather, both are evaluated only with highly controlled (and similarly biased) selection of cases and controls. The authors have stated as justification for doing this "However, at a single institution, we do not have sufficient data to develop and test the models in a real-world environment." -- yet, they have down-selected controls substantially (although arguably even at 5:1 it would not reflect the true distribution) so they could use this to test in a less biased data set.

Note also you do have to focus on replication rather than reproduction, given that you are not reproducing the methodology on the same dataset. Please review your terminology. As the other reviewer has highlighted as well, given that your focus is on replicating a *process* rather than the study itself this should be clearly identified. This is distinct even from the typical discussions of reproducibility vs replicability of models or their findings (see e.g. https://doi.org/10.1001/jama.2019.20866 )

Reviewer #2: (No Response)

Reviewer #3: Authors have greatly improved the manuscript and this is rigorous and sensical work!

A couple comments/questions (in order of importance):

1. Regarding the ICD Validation through Manual Review on page 12, and the implications.

Previous work has shown that ICD codes are very poor at identifying psychiatric conditions, including suicidality, so it was surprising to see only 2 false negatives. How do the authors explain this unusual finding and what does it mean for the generalizability of the results? If ICD codes have such a high recall in particular, why even the need to develop alternative methods using unstructured data?

2. Page 3, lat paragraph. When discussing the current approaches to suicide risk assessment... Healthcare professionals administering clinical interviews, and then documented in the notes and structured data (and clinical decision making such as to admit or send to ED, etc.) is exactly what the EHR data is from. The existence of EHR data means that encounter happened, and happened well. For example, page 4, "abundance of valuable and relevant information within the EHRs" -- that is put there by clinicians after a clinical assessment. From reading the discussion, seems that not having to *repeat* the clinical interview/questionnaires in different treatment settings (or highlight in more in the chart) after it's already been done... that's the utility! That could be explicit in the intro & discussion.

No other comments!

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy .

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/ . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org . Please note that Supporting Information files do not need this step.

Decision Letter 2

Braja Gopal Patra

17 Aug 2025

Automated detection and prediction of suicidal behavior from clinical notes using deep learning

PONE-D-24-31813R2

Dear Dr. Tsalatsanis,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager®  and clicking the ‘Update My Information' link at the top of the page. For questions related to billing, please contact billing support .

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Braja Gopal Patra, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments:

It would be great if you could make the code available on GitHub.

Acceptance letter

Braja Gopal Patra

PONE-D-24-31813R2

PLOS ONE

Dear Dr. Tsalatsanis,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

You will receive further instructions from the production team, including instructions on how to review your proof when it is ready. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few days to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

You will receive an invoice from PLOS for your publication fee after your manuscript has reached the completed accept phase. If you receive an email requesting payment before acceptance or for any other service, this may be a phishing scheme. Learn how to identify phishing emails and protect your accounts at https://explore.plos.org/phishing.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Braja Gopal Patra

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Flowchart of study population.

    (TIF)

    pone.0331459.s001.tif (223KB, tif)
    S2 Fig. Variable importance as measured by the RF for the prediction of suicide/ISH.

    (TIF)

    pone.0331459.s002.tif (53.5KB, tif)
    S1 Table. Clinical note types and frequency per group.

    (DOCX)

    pone.0331459.s003.docx (17.2KB, docx)
    S2 Table. Participant demographics.

    (DOCX)

    pone.0331459.s004.docx (16.4KB, docx)
    Attachment

    Submitted filename: Response to Reviewers.docx

    pone.0331459.s005.docx (43.3KB, docx)
    Attachment

    Submitted filename: R2 Response to Reviewers.docx

    pone.0331459.s006.docx (28.1KB, docx)

    Data Availability Statement

    The University of South Florida’s Human Research Protections restricts data sharing to protect patient information and comply with local and national regulations due to sensitive patient information. Public sharing could compromise patient privacy. Researchers seeking access to the data can contact USF Human Research Protections (IRB) at RSCH-IRB@usf.edu referencing the study number STUDY002988.


    Articles from PLOS One are provided here courtesy of PLOS

    RESOURCES