Skip to main content
Translational Psychiatry logoLink to Translational Psychiatry
. 2024 Jul 31;14:316. doi: 10.1038/s41398-024-03034-3

Accuracy and transportability of machine learning models for adolescent suicide prediction with longitudinal clinical records

Chengxi Zang 1,2,#, Yu Hou 1,2,#, Daoming Lyu 1,2, Jun Jin 3, Shane Sacco 3, Kun Chen 3,, Robert Aseltine 4,, Fei Wang 1,2,
PMCID: PMC11291985  PMID: 39085206

Abstract

Machine Learning models trained from real-world data have demonstrated promise in predicting suicide attempts in adolescents. However, their transportability, namely the performance of a model trained on one dataset and applied to different data, is largely unknown, hindering the clinical adoption of these models. Here we developed different machine learning-based suicide prediction models based on real-world data collected in different contexts (inpatient, outpatient, and all encounters) with varying purposes (administrative claims and electronic health records), and compared their cross-data performance. The three datasets used were the All-Payer Claims Database in Connecticut, the Hospital Inpatient Discharge Database in Connecticut, and the Electronic Health Records data provided by the Kansas Health Information Network. We included 285,320 patients among whom we identified 3389 (1.2%) suicide attempters and 66% of the suicide attempters were female. Different machine learning models were evaluated on source datasets where models were trained and then applied to target datasets. More complex models, particularly deep long short-term memory neural network models, did not outperform simpler regularized logistic regression models in terms of both local and transported performance. Transported models exhibited varying performance, showing drops or even improvements compared to their source performance. While they can achieve satisfactory transported performance, they are usually upper-bounded by the best performance of locally developed models, and they can identify additional new cases in target data. Our study uncovers complex transportability patterns and could facilitate the development of suicide prediction models with better performance and generalizability.

Subject terms: Psychiatric disorders, Scientific community

Introduction

Youth suicide is a major public health threat. Suicide is the second most common cause of death among adolescents and young adults [1, 2], with recent data from the CDC indicating that death by suicide among children and young adults ages 10-24 in the US increased by 57% between 2007 and 2018, from 6.8 to 10.7 per 100,000 [2]. Abundant data is indicating that individuals have had contact with their both primary and mental health care providers before suicide [3]. In a longitudinal study of eight Mental Health Research Network healthcare systems in the US, Ahmedani et al. found that 83% of individuals who died of suicide had received healthcare during the year before death [4]. In a recently published study, 62% of pediatric patients treated for suicide attempts in an urban pediatric hospital had a non-suicide-related visit within 90 days before the attempt [5]. These data point to the enormous opportunity for suicide prevention through improved surveillance, detection, and intervention in the healthcare system.

As a result of NIMH’s prioritization of the development of suicide risk prediction, there are now several published algorithms using data mining and machine learning (ML) approaches with Real-World Clinical Data (RWD) to predict suicidal behavior and suicide mortality among patients in large healthcare systems [513]. Follow-ups of patients completing suicide risk assessments have found that predictive models achieved higher sensitivity and specificity in identifying suicidal behavior than clinical assessments [14]. As part of the Food and Drug Administration’s Mini-Sentinel pilot program, a systematic review examined the efforts of five validated algorithms to predict suicide attempts and completed suicide and found that the sensitivity of the algorithms ranged up to 65% and positive predictive value ranged up to 100% [15, 16]. Moreover, recent studies indicate that such models are both generalizable and durable: they can be used effectively outside of the specific clinical settings in which they were produced [17], and maintain predictive accuracy over short to intermediate timeframes [18].

With the exception of the two aforementioned studies, however, barriers to the use of ML models developed for suicide risk prediction have not been adequately investigated. This would involve a more comprehensive investigation of the transportability of different ML models across different data types and the subsequent examination of cross-data performance and cross-data risk factors. For example, the discussion of transportability in ML-based prediction models often assumes that complex prediction models with many variables outperform simpler models [19]. Since complex models might require more resources to estimate, validate, and implement, it’s crucial to investigate if there exists a substantial improvement in using complex models over simpler ones, considering the resource constraints on the requirements. Besides, before ML-based prediction models can be implemented in clinical practice, their transportability must be evaluated, i.e., the models should be able to produce accurate predictions on new sets of patients from different settings, e.g., clinical settings, geographical locations, or time periods. However, there are many barriers to independent validation of these models, including patient heterogeneity, clinical process variability, and EHR configuration and non-interoperable databases across hospitals [2023]. Furthermore, prediction model performance can vary based on changing patient populations, and shifts in clinical practice.

To fill the research gap, here we compared the performance and transportability of different ML-based suicide prediction models [including regularized logistic regression (LR), gradient boosting machine (GBM), and the deep long-short-term memory neural network (LSTM)] for children and adolescents across three RWD sets with different data types (including claims data and EHR data) and collected at different touchpoints (inpatient, outpatient, and all encounters). The local and transported performance of different models were compared across all enumerated source-target data pairs. We observed complex transportability patterns: a) transported models exhibited not only performance drops but also improvements on target data compared to their performance on source data; b) the LSTM model did not necessarily outperform simpler models, such as LR and GBM, in both local and transported settings; c) while transported models might achieve satisfactory performance, it is generally upper-bounded by the best performance of locally developed models; and d) additionally, transported models might identify new cases missed by locally developed models. Our analyses can help evaluate the readiness of these models for application or transport in different clinical settings, and facilitate the development of suicide prediction models with better performance and generalizability.

Method

Datasets

The following three real-world patient databases were used in our study:

  • The All-Payer Claims Database (APCD) from Connecticut [24], included medical and pharmacy claims for Connecticut residents from January 1, 2012, to December 31, 2017. The APCD contains both inpatient and outpatient encounters from approximately 35% of the commercially insured Connecticut population.

  • The Hospital Inpatient Discharge Database (HIDD) from Connecticut [25], contained inpatient hospitalization encounters from all acute care hospitals in the state from October 1, 2005, to September 30, 2017.

  • The Electronic Health Records (EHR) data provided by the Kansas Health Information Network (KHIN) [26], included the EHR collected from a patient population in Kansas with all types of encounters such as Outpatient, Inpatient, Emergency Room, etc, from 2013 to 2018.

This study was approved by the University of Connecticut Health Center Institutional Review Board and Weill Cornell Medical College Institutional Review Board, the CT Department of Public Health Human Investigations Committee, and the CT APCD Data Release Committee. All EHRs used in this study were appropriately deidentified and thus no informed consent from patients was obtained. The study was performed under the ethical standards of human experimentation established in the Declaration of Helsinki and subsequent amendments [27].

Study cohorts

Study cohorts consisted of children, adolescents, and young adults aged from 10 to 24 from three datasets (APCD, HIDD, and KHIN) who had at least one non-suicidal diagnosis in the recruiting window. Specifically, the recruiting windows were January 1, 2014, to December 31, 2015, for APCD, January 1, 2012, to September 30, 2015, for HIDD, and January 1, 2014, to December 31, 2015, for KHIN, respectively. Each recruited patients were followed up for two years. We excluded patients whose first EHR-documented visit had any suicide attempts. Suicide attempts were identified using ICD-9 codes with detailed rules listed in Supplementary Table S1. Figure 1 summarizes the inclusion-exclusion cascades for the three databases. Detailed descriptions of cohorts can be found in our previous analyses [13].

Fig. 1. Cohort selection.

Fig. 1

Cohort selection from three datasets: a APCD, the All-Payer Claims Database from Connecticut; b HIDD, the Hospital Inpatient Discharge Database from Connecticut; and c KHIN, the Electronic Health Records data from the Kansas Health Information Network.

Follow-up and outcome of interest

The outcome of interest was the first documented suicide attempt (SA) in the follow-up period. For each qualified patient, the follow-up period starts from the first non-suicide-related hospital encounter within the recruiting window and continues until a suicide attempt, two years, or the end of the study period (the end of 2017), whichever happens first. The index time was defined as the time of the last non-suicide visit, namely the visit just before the outcome of interest for the cases or the last visit before the end of the follow-up for the non-cases.

Baseline covariates

Predictor covariates included self-reported demographic information including age, gender, and diagnosis information (ICD-9/10) from each medical encounter. The age was categorized into three age groups (10–14, 15–19, 20–24) and self-reported gender into two groups (female and male). The age at their last recorded non-suicidal visit was reported. For the suicide attempters, we also reported their age at their first recorded suicide attempt visit. The diagnosis information in both APCD and HIDD were encoded with ICD-9 codes, whereas only KHIN was encoded with ICD-9 and ICD-10 codes. We used the R package “touch” [28] to convert the ICD-10 codes to the ICD-9 codes for consistency for the KHIN data only. The diagnostic codes were aggregated by their first three digits and the top 300 most prevalent codes from each site were selected and combined into a shared feature space. For each patient, we used his/her earliest possible historical information to the last non-suicide visit and required at least one non-suicide visit. We considered both aggregated features as well as a sequence of features for different ML models as mentioned below.

Machine learning predictive models

Three machine learning (ML) models were used for predictive modeling. The first one was regularized logistic regression (LR) and we adopted both L1-norm and L2-norm penalty and grid searched the inverse of regularization strength from 102 to 102 with 0.2 as the sampling step size for the exponent. The second one was the gradient boosting machine (GBM) with the random forest as the base learner. We grid searched hyperparameters from maximum depth (3, 4, 5), max number of leaves in one tree (5, 15, 25, 35, 45), and the minimal number of data in one leaf (50, 100, 150, 200, 250). The third was the Long short-term memory (LSTM) neural network with temporal attention mechanisms [2932]. We grid-searched hyperparameter configurations including a two-layered bidirectional LSTM with hidden dimensions (32, 64), learning rate (1e-3, 1e-4), and weight decay (1e-3, 1e-4. 1e-5). The LR and GBM used aggregated features before index visits for LR and GBM while the deep LSTM model used a sequence of features for modeling.

All datasets were first divided randomly into the training set and testing set according to the ratio of 80:20. We used the training set to select and learn the best model parameters according to a five-fold cross-validation pipeline, and evaluate the final performance on the hold-out testing set for the following evaluations.

Model evaluation

Both local and transported performance were evaluated for all investigated ML-predictive models. For the local performance, models were trained and tested on one single data source. For the transport setting, models will be trained on one source data set and then tested on another target data set. We compared the performance of the transported model on the target data with (a) its original performance on the source data, and (b) the performance of the locally trained model on the target data. The area under the receiver operating characteristic curve (AUROC), and the sensitivity as well as the positive predictive value (PPV) at the specificity level of 90% and 95%. The bootstrapped 95% confidence intervals or standardized variance were calculated by repeating the above-mentioned splitting, training, and testing 20 times. Regarding the importance of the predictors, we calculated the average of logistic regression coefficients over 20 times repetition as well. We used the standardized mean difference (SMD) [33] to measure the mean prevalence difference of a particular feature between two data sets. We assumed a significant difference in the feature value across two data sets if its corresponding SMD falls beyond the range of -0.2 to 0.2 [34].

Results

Population characteristics

As shown in Table 1, the basic population characteristics varied across the three study cohorts. For the three cohorts APCD (claims), HIDD (inpatient EHR), and KHIN (all settings EHR), the APCD and HIDD had similar suicide attempt rates (1.4% for APCD and 1.5% for HIDD, while the KHIN cohort had a lower suicide attempt rate of 0.9%. We observed more female suicide attempters in all three datasets. However, for suicide attempters, the proportion of females in KHIN (70.2%) was higher than in APCD (61.4%) and HIDD (62.4%). We also observed different index age distributions across three cohorts: in the younger 10-14 index age group, we observed more suicide attempters in KHIN (18.3%) than APCD (12.1%) and HIDD (14.7%); and in the older 20-24 age group, the APCD (38.7%) had more patients than HIDD (32.7%) and KHIN (30.6%). Regarding the suicide attempt methods, we observed more poisoning in HIDD (70.0%) than in APCD (63.5%) and KHIN (64.2%).

Table 1.

Population characteristics of three datasets.

APCD
(Claims)
HIDD
(EHR Inpatient)
KHIN
(EHR all Settings)
Demographics Case Non-case Case Non-case Case Non-case
N (%) 2163 (1.4%) 153,318 (98.6%) 434 (1.5%) 28,407 (98.5%) 906 (0.9%) 100,047 (99.1%)
Sex - no. (%)
Female 1329 (61.4) 76,483 (49.9) 271 (62.4) 16,907 (59.5) 636 (70.2) 52,917 (52.9)
Male 834 (38.6) 76,835 (50.1) 163 (37.6) 11,500 (40.5) 270 (29.8) 47,130 (47.1)
Age - no. (%)*
10-14 262 (12.1) 19,943 (13.0) 64 (14.7) 4481 (15.8) 166 (18.3) 28,099 (28.1)
15-19 1065 (49.2) 52,166 (34.0) 228 (52.5) 9375 (33.0) 463 (51.1) 34,519 (34.5)
20-24 836 (38.7) 81,209 (53.0) 142 (32.7) 14,551 (51.2) 277 (30.6) 37,429 (37.4)
Suicide attempt methods - no. (%)
Poisoning 1373 (63.5) 304 (70.0) 582 (64.2)
Cutting 578 (26.7) 89 (20.5) 189 (20.9)
Other 150 (6.9) 25 (5.8) 49 (5.4)
Suicide type >= 2 62 (2.9) 16 (3.7) 86 (9.5)

*Age at index, namely the age at the last non-suicidal visit. See the age at the time of the event of interest in Supplementary Table S4. EHR, electronic health records; APCD, the All-Payer Claims Database from Connecticut; HIDD, the Hospital Inpatient Discharge Database from Connecticut; KHIN, the Electronic Health Records (EHR) data from the Kansas Health Information Network.

Local performance

Overall, the performance of the regularized logistic regression (LR) for suicide prediction was good across all three datasets when trained and tested on the same data source, with the test data held out and not used for training. Specifically, as shown in Table 2, the average AUROCs by LR were 0.879 (95% CI [0.876, 0.882]) on APCD, 0.831 (95% CI [0.821, 0.840]) on HIDD, and 0.787 (95% CI [0.781, 0.794]) on KHIN.

Table 2.

Local and transported performance of the regularized logistic regression across three datasets.

90%-Specificity 95%-Specificity
Source Dataset Target Dataset AUROC
(95% CI)
Sensitivity
(95% CI)
PPV
(95% CI)
Sensitivity
(95% CI)
PPV
(95% CI)
APCD

0.879

(0.876, 0.882)

0.662

(0.653, 0.669)

0.086

(0.084, 0.088)

0.538

(0.527, 0.548)

0.132

(0.129, 0.135)

APCD HIDD

0.797

(0.785, 0.809)

0.464

(0.444, 0.482)

0.066

(0.061, 0.071)

0.282

(0.260, 0.304)

0.079

(0.071, 0.087)

APCD KHIN

0.749

(0.739, 0.757)

0.419

(0.403, 0.433)

0.036

(0.035, 0.037)

0.259

(0.246, 0.271)

0.044

(0.042, 0.046)

HIDD

0.831

(0.821, 0.840)

0.456

(0.434, 0.479)

0.065

(0.061, 0.069)

0.279

(0.259, 0.299)

0.078

(0.073, 0.084)

HIDD APCD

0.802

(0.795, 0.808)

0.501

(0.489, 0.514)

0.066

(0.064, 0.067)

0.331

(0.319, 0.343)

0.085

(0.082, 0.088)

HIDD KHIN

0.735

(0.726, 0.744)

0.410

(0.392, 0.427)

0.035

(0.034, 0.037)

0.259

(0.245, 0.274)

0.044

(0.042, 0.047)

KHIN

0.787

(0.781, 0.794)

0.478

(0.460, 0.494)

0.041

(0.040, 0.043)

0.348

(0.335, 0.362)

0.059

(0.056, 0.061)

KHIN APCD

0.845

(0.841, 0.849)

0.581

(0.570, 0.591)

0.075

(0.073, 0.076)

0.423

(0.412, 0.433)

0.105

(0.103, 0.107)

KHIN HIDD

0.829

(0.821, 0.838)

0.452

(0.431, 0.472)

0.063

(0.060, 0.067)

0.276

(0.259, 0.294)

0.076

(0.071, 0.081)

AUROC, area under the receiver operating characteristic curve. PPV, Positive predictive value. CI, confidence intervals. APCD, the All-Payer Claims Database from Connecticut; HIDD, the Hospital Inpatient Discharge Database from Connecticut; KHIN, the Electronic Health Records (EHR) data from the Kansas Health Information Network.

However, more complex ML models, particularly LSTM, did not necessarily outperform the simpler regularized logistic regression. As shown in Fig. 2a, the LR achieved an average AUROC of 0.879 (95% CI [0.876, 0.882]), which is similar to the AUROCs achieved by the GBM at 0.887 ([0.884, 0.890]) and LSTM at 0.901 ([0.896, 0.905]) on the APCD data. The average AUROCs on the HIDD were 0.831 ([0.821, 0.840]), 0.827 ([0.818, 0.837]), and 0.811 ([0.799, 0.824]) by LR, GBM and LSTM, respectively. Regarding the KHIN data, the LSTM showed inferior performance than other models, with an average AUROC of 0.759 ([0.745, 0.771), compared to 0.787 ([0.781, 0.794]) by LR and 0.793 ([0.788, 0.799]) by GBM.

Fig. 2. Local and transported AUROC performance of three ML-based predictive models across different source-target data pairs.

Fig. 2

The three panels show the same set of experiments with different ordering to highlight different comparisons: a local and transported performance of different ML models, b transported performance compared to source data, and c transported performance compared to target data. Three targeted datasets are APCD, the All-Payer Claims Database from Connecticut; HIDD, the Hospital Inpatient Discharge Database from Connecticut; and KHIN, the Electronic Health Records data from the Kansas Health Information Network. Three different ML-based suicide prediction models include regularized logistic regression (LR), gradient boosting machine (GBM), and the deep long-short-term memory neural network (LSTM). The arrows show transporting models developed on the source data to the target data. AUROC, area under the receiver operating characteristic curve.

Transported performance

When transporting models developed on source data to target data, their transported performance does not necessarily drop, showing data-dependent transportability. Specifically, as shown in Fig. 2b, compared to the LR models developed on the APCD data or HIDD data, we observed performance drops when applying them to other datasets. As shown in Table 2, the locally learned LR model on APCD could achieve an average AUROC of 0.879 ([0.876, 0.882]), and transported performance of 0.797 ([0.785, 0.809]) and 0.749 ([0.739, 0.757]) to HIDD and KHIN, respectively. Similar drop patterns were observed on the LR model trained on HIDD and transported to other data sets. However, increased transported performance was observed for the LR model developed on KHIN, with an average AUROC of 0.787 ([0.781, 0.794]) on KHIN, and transported performance of 0.845 ([0.841, 0.849]) and 0.829 ([0.821, 0.838]) to APCD and HIDD, respectively. As shown in Fig. 2b, the data-dependent transportability patterns were also observed in other machine learning models including GBM (Details in Supplementary Table 2) and LSTM (Details in Supplementary Table 3).

All the transported models showed either inferior or comparable performance to the locally developed models, implying that while transported performance was sometimes still good, it was generally upper-bounded by the best performance of locally developed ones. For example, as shown in Table 2 and Fig. 2c, the LR model trained from APCD could achieve an average AUROC of 0.879 ([0.876, 0.882]) on the APCD testing data, and the transported model trained from HIDD and KHIN achieved an average AUROC of 0.802 ([0.795, 0.808]) and 0.845 ([0.841, 0.849]), respectively, when tested on the same APCD data. As shown in Fig. 2c, similar bounded transported performance was observed using different machine learning models, including GBM (Details in Supplementary Table 2) and LSTM (Details in Supplementary Table 3) across all three datasets. Though upper bounded, the good transported performances in several cases suggest the potential usefulness of the transported model, particularly when the locally trained models are not available.

Relatively more complex LSTM model usually exhibited inferior transported performance compared to the simpler LR model, as shown in Fig. 2a. For example, the LSTM showed very poor transported performance when transported from HIDD to KHIN with AUROC of 0.595 ([0.572, 0.616]), and from HIDD to APCD with AUROC of 0.654 ([0.641, 0.668]) (see Supplementary Table 3). Only in the case of APCD to HIDD did the LSTM show comparable transported performance to other models.

Predictor importance

We identified some common features consistently associated with increased suicide attempt risk across all three datasets, as shown in Table 3 and Fig. 3, including episodic mood disorders (with average rankings of 1.1, 1.8, and 2.1 in the APCD, HIDD, and KHIN, respectively), other psychosocial circumstances (8.4, 5.0, and 2.4), drug abuse (7.7, 16.5, and 3.7), and being female (3.6, 12.7, and 5.5). In addition, depressive disorders, anxiety, drug dependence, and poisoning by analgesics antipyretics, antirheumatics, and being in the age group 15 to 19 years, were also associated with increased risk of suicide attempts across all three datasets. Overall, the most predictive features identified by our LR model across three datasets are aligned with existing knowledge of risk factors for suicide, e.g., episodic mood disorder, anxiety, being female, and depressive disorder. See the results of these identified features from the gradient boosting machine in Supplementary Fig. S2.

Table 3.

Top 30 Diagnostic codes (ICD-9 codes) associated with increased risk of suicide attempt in three datasets.

APCD HIDD KHIN
Features Rank* Features Rank* Features Rank*
296, Episodic mood disorders 1.1 296, Episodic mood disorders 1.8 296, Episodic mood disorders 2.1
300, Anxiety, dissociative and somatoform disorders 3.5 311, Depressive disorder, not elsewhere classified 4.4 V62, Other psychosocial circumstances 2.4
Sex Female 3.6 V62, Other psychosocial circumstances 5.0 305, Nondependent abuse of drugs 3.7
311, Depressive disorder, not elsewhere classified 4.5 301, Personality disorders 10.0 Sex Female 5.5
309, Adjustment reaction 6.2 Sex Female 12.7 300, Anxiety, dissociative and somatoform disorders 13.8
305, Nondependent abuse of drugs 7.7 304, Drug dependence 13.1 787, Symptoms involving digestive system 14.7
V62, Other psychosocial circumstances 8.4 300, Anxiety, dissociative and somatoform disorders 14.0 789, Other symptoms involving abdomen and pelvis 15.8
298, Other nonorganic psychoses 11.9 305, Nondependent abuse of drugs 16.5 965, Poisoning by analgesics, antipyretics, and antirheumatics 18.6
304, Drug dependence 13.7 314, Hyperkinetic syndrome of childhood 19.9 625, Pain and other symptoms associated with female genital organs 23.3
299, Pervasive developmental disorders 14.6 309, Adjustment reaction 23.7 277, Other and unspecified disorders of metabolism 27.5
995, Certain adverse effects not elsewhere classified 15.1 965, Poisoning by analgesics, antipyretics, and antirheumatics 30.0 307, Special symptoms or syndromes, not elsewhere classified 31.8
293, Transient mental disorders due to conditions classified elsewhere 15.4 724, Other and unspecified disorders of back 33.5 301, Personality disorders 32.8
319, Unspecified intellectual disabilities 18.7 303, Alcohol dependence syndrome 34.4 311, Depressive disorder, not elsewhere classified 34.8
312, Disturbance of conduct, not elsewhere classified 20.2 V60, Housing, household, and economic circumstances 35.0 729, Other disorders of soft tissues 43.7
295, Schizophrenic disorders 24.7 E939, Psychotropic agents 38.9 295, Schizophrenic disorders 46.4
310, Specific nonpsychotic mental disorders due to brain damage 28.9 307, Special symptoms or syndromes, not elsewhere classified 40.6 536, Disorders of function of stomach 47.6
599, Other disorders of urethra and urinary tract 30.4 298, Other nonorganic psychoses 44.7 969, Poisoning by psychotropic agents 48.3
794, Nonspecific abnormal results of function studies 33.4 728, Disorders of muscle, ligament, and fascia 46.9 E006, Activities involving other sports and athletics played individually 55.1
250, Diabetes mellitus 34.9 368, Visual disturbances 50.5 780, General symptoms 57.4
V71, Observation and evaluation for suspected conditions not found 35.5 308, Acute reaction to stress 51.6 308, Acute reaction to stress 59.5
620, Noninflammatory disorders of ovary, fallopian tube, and broad ligament 36.3 788, Symptoms involving urinary system 58.9 796, Other nonspecific abnormal findings 61.5
E000, External cause status 37.3 Age15-19 60.2 923, Contusion of upper limb 62.3
920, Contusion of face, scalp, and neck except eye(s) 38.6 V61, Other family circumstances 62.5 401, Essential hypertension 67.5
314, Hyperkinetic syndrome of childhood 38.6 295, Schizophrenic disorders 63.6 338, Pain, not elsewhere classified 67.9
493, Asthma 39.4 473, Chronic sinusitis 65.5 599, Other disorders of urethra and urinary tract 71.8
965, Poisoning by analgesics, antipyretics, and antirheumatics 39.6 881, Open wound of elbow, forearm, and wrist 66.1 784, Symptoms involving head and neck 74.3
307, Special symptoms or syndromes, not elsewhere classified 42.9 277, Other and unspecified disorders of metabolism 66.4 V15, Other personal history presenting hazards to health 74.9
E960, Fight, brawl, rape 43.1 253, Disorders of the pituitary gland and its hypothalamic control 66.6 V69, Problems related to lifestyle 76.1
112, Candidiasis 49.7 V12, Personal history of certain other diseases 66.9 V60, Housing, household, and economic circumstances 77.1
682, Other cellulitis and abscess 51.1 E850, Accidental poisoning by analgesics, antipyretics, and antirheumatics 67.6 V64, Persons encountering health services for specific procedures, not carried out 78.1

* The average ranks were determined by the absolute value of the coefficient of the regularized logistic regression models over 20 repetitions. APCD, the All-Payer Claims Database from Connecticut; HIDD, the Hospital Inpatient Discharge Database from Connecticut; KHIN, the Electronic Health Records (EHR) data from the Kansas Health Information Network.

Fig. 3. Estimated predictor importance by LR and source data difference across three datasets.

Fig. 3

The feature importances of the regularized logistic regression models locally estimated from APCD, HIDD, and KHIN were presented by bar plots with 95% Confidence Intervals as error bars. The standardized mean differences (SMD) of feature prevalence between datasets were shown in the dots or triangles. A significant difference in feature prevalences was assumed if the SMD was beyond ±0.2. The top 30 features were showcased, and the dashed lines ±0.2 were guidelines for eyes.

On the other hand, we also observed heterogeneous predictor importance learned from different datasets. As shown in Fig. 3, ‘Health supervision of infant or child’ and ‘General medical examination’ as examples were associated with decreased risk of suicide attempt in the APCD and KHIN models but not in the HIDD model. Indeed, these codes were prevalent in the APCD (general claims data) and KHIN (all-setting EHR), rather than in the inpatient EHR data HIDD (See the standardized mean difference of feature prevalence in Fig. 3). In contrast, “outcome of delivery” and “acute appendicitis” were associated with a more significantly decreased risk of suicide attempt in the HIDD than in the APCD or KHIN.

Overall, the same ML-based suicide prediction model might identify a different set of important predictors when trained from different datasets, together with data differences, suggesting varying transported performance. However, some consistently identified features across different sets also suggest the potential feasibility of transporting models.

Sensitivity analysis

To assess the robustness of our results, we conducted several sensitivity analyses. First, the observed were not due to idiosyncrasies of metrics. The performance patterns remained consistent when evaluated using other metrics, including sensitivity and positive predictive value (PPV) at either 90% specificity or 95% specificity (see Table 2, Supplementary Tables S2 and S3). Second, we investigated how the modeling performance and transportability will change when removing the crosswalk of ICD-10 to ICD-9 in the KHIN dataset. Specifically, we only used the ICD-9 proportion in the KHIN data, leading to 62,636 eligible patients in the new KHIN cohort with 156 identified suicide attempters. We replicated our primary analyses in the newly built cohort and feature spaces. As shown in Supplementary Fig. S3 and Supplementary Tables S5, S6, and S7, similar local and transported results were observed. Third, to make more “apple-to-apple” comparisons, we replicated our primary analyses in the shared feature space built only from inpatient encounters across all three datasets. We observed similar local and transported performance patterns as shown in Supplementary Fig. S4 and Supplementary Tables S8, S9, and S10.

Discussion

In this study, we investigated three ML-based suicide prediction models (regularized logistic regression, gradient boosting machine, LSTM) on three real-world datasets collected in different contexts (inpatient, outpatient, and all encounters) with varying purposes (administrative claims and electronic health records), and compared their local and transported performance across datasets. Regarding local performance where models were trained and tested on the same data source, we observed similar good performance across three models, and what’s more, the relatively more complex models e.g., LSTM or GBM, did not necessarily outperform relatively simpler regularized logistic regression model. We observed that as the ML model becomes more complex, it could increase overfitting in the training data (see Supplementary Tables S11-13), however, we didn’t observe the superior performance for more complex models in the unseen test data, aligning with one recent work [19].

The transported performance of ML-based suicide prediction models exhibited more complex patterns. First, when compared to performance in the source data where models were developed, we observed both performance drop and increase in different transportation scenarios when using different data pairs, suggesting data-dependent transportability. Second, when compared to the locally developed model on the target data, the transported performance was generally upper-bounded by the best performance of the locally developed model. Third, relatively more complex models, particularly the deep learning model LSTM which is good at capturing sequential data, showed inferior transported performance than simpler LR models, suggesting model-dependent transportability. Although the models developed on the source data may demonstrate lower AUROC when transported to the target data and are expected to be inferior to models developed on the target data, in several cases, the transported performances on target data are still good, suggesting potential utility of the transporting suicide prediction models, especially when the target models are unavailable. The utility of the transported model is also evident in situations where the target setting has an insufficient sample size (e.g., rare events like suicide or a small population size), which makes it difficult to develop a robust model within the target setting. In such cases, models developed on a much larger yet comparable sample from another setting may perform better. In addition, the simpler LR model is a viable choice for suicide prediction using EHR/Claims data, whether in local or transported settings.

The differences across three different datasets might account for the complex patterns of transported performance. Specifically, the APCD dataset contained administrative claims data covering both inpatient and outpatient encounters, the HIDD dataset contained data from only inpatient hospitalization EHRs, and the KHIN included EHR from all encounters (e.g., inpatient, outpatient, and emergency room settings, etc), suggesting potentially remarkable differences in clinical information captures. Taking the difference in the prevalence of ‘Acute appendicitis’ between APCD and HIDD as an example, the standardized mean difference (SMD) of its prevalence between the two datasets is -0.7, indicating that compared to HIDD, the APCD captured much less this potential inpatient event. On the other hand, the SMD values of the ‘Health supervision of infant or child’, ‘General medical examination’, ‘Special investigation and examinations’, and ‘Special screening for malignant neoplasms’, between APCD to HIDD diagnoses were 1.7, 1.4, and 1.8, respectively, implying different captures in the potential outpatient events. What’s more, the SMD of the “Poisoning by psychotropic agents” diagnostic event between KHIN and other datasets, as shown in Fig. 3, indicates that KHIN potentially captured more emergency encounters than the other two datasets. Usually, performance drops are anticipated, considering that these different captures of clinical events under different clinical contexts may limit the transportability of the investigated ML-based risk prediction models [19]. However, the increased performance from KHIN to the other two datasets, compared to modeling performance on KHIN, suggests potentially generalizable use of suicide prediction models developed from all encounter EHRs to other settings (administrative claims and inpatient EHRs). These increased transported performances differ from findings in one recent work developed in the schizophrenia domain [19], suggesting disease-specific transportability of ML predictive models.

In addition, the transported models identified different sets of patients including new cases which were neglected by the locally trained models. We illustrated a Venn diagram of correctly identified suicide attempters by different models in Supplementary Fig. S1. Taking the target APCD data as an example, the true positive patients correctly identified by the APCD local model was 235 (among 440), and the true positive patients correctly identified by the HIDD model decreased to 141. Furthermore, the original 115 cases identified by the APCD model were not identified by the HIDD model, and the HIDD model identified an additional 21 new cases that were not identified by the APCD model. We also examined several true positive patient records to explore the differences between models. For instance, the predictors ‘Other disorders of the urethra and urinary traction, a potential risk factor that has been reported by other studies [35, 36], were only selected by the APCD model due to differences in the data (mean ranking in APCD was 23.3, HIDD was 371.7, and KHIN was 47.6). Thus, a suicide attempter from the APCD dataset whose profile had a diagnosis code of disorders of ‘urethra and urinary tract’ in his or her visit record was identified only by the model developed from the APCD. Indeed, the transported models identified fewer cases and thus potentially worse AUROC performance when transporting the HIDD model to the APCD data; however, the transported model was able to identify novel cases.

By demonstrating good local performance of simpler LR models, complex transported performance patterns, and potentially different transportability of three different ML-based suicide prediction models across three different patient populations with varying clinical settings, this study contributes to a small but developing literature on the challenges of deploying suicide risk models in clinical practice. Recent studies suggest that suicide risk models have both generalizability and durability: they can be used effectively in clinical settings outside of but similar to those in which they were produced [17], and maintain predictive accuracy over short to intermediate time frames [18]. Our study uncovers more complex patterns in transporting suicide risk models derived from particular patient populations to populations derived from other clinical settings, by comprehensively comparing transported performance with both source and target modeling performance.

Based on our results we recommend future research be focused on fusing knowledge learned from different populations and settings [13], potentially leading to better performance and generalizability. Through exploration, we found that different models can identify suicide attempters with different characteristics. Suppose we can obtain prior knowledge from different models and the weights of their feature importance, for example, by trying to integrate multiple models. This might improve the predictive performance so that the integrated model can correctly identify more suicide attempters. Alternatively, a solution that may improve prediction performance is to provide a model pool, namely a set of pre-trained models from the different data, and a designed metric can be used to select a best-fit model from the model pool to accomplish the prediction task and can provide statistical evidence, allowing for the optimal prediction result to be achieved.

There also exist limitations in this study. First, the APCD database relies on claims for commercially insured residents, and recruited patients can lose their follow-up records due to losing their commercial insurance or moving out of state. Similar types of issues would be relevant for the KHIN and HIDD as well. Despite this limitation, we selected patients with continuous insurance coverage and with valid enrollment during the recruiting window in our experiment design. Second, any eligibility criteria to select patients based on their information after time zero, e.g., complete follow-up information, might lead to selection bias, and thus, we only require such eligibility criteria in the recruiting windows before time zero rather than after time zero. In addition, we followed each patient up to their most recent 2 years, (rather than 5 years), attempting to minimize the loss to follow-up bias as much as possible in our analysis. Third, to study factors accounting for the transported performance, or fusing models learned from different sources are promising future directions. Fourth, we didn’t investigate model calibration, an important consideration when thinking about transportability and a feature that deserves dedicated study in the future.

Conclusion

This study investigated different ML-based suicide prediction models on three real-world datasets collected in different contexts (inpatient, outpatient, and all encounters) with varying purposes (administrative claims and electronic health records), and compared their local and transported performance. The relatively more complex models (e.g., LSTM and GBM) did not necessarily outperform relatively simpler models (e.g., LR) regarding the local performance as well as the transported performance. The transported performance is data-dependent, model-dependent, and upper-bounded by the performance of the locally developed models. The transported models might achieve good performance and identify additional new cases on target data, suggesting a fusion of knowledge learned from different datasets might improve the performance. Our analyses could facilitate the development of ML-based suicide prediction models with better performance and generalizability.

Supplementary information

Acknowledgements

The work was supported by NIH grant numbers R01MH124740 and R01MH112148. The authors are grateful to the Editor, the Associate Editor, and the referees for their valuable comments and suggestions, which have led to significant improvement of the article.

Author contributions

CZ, FW, KC, and RA designed the study. RA, JJ, and SS contributed to the data acquisition and processing. CZ, YH, and DL undertook the experiments and interpreted the data. CZ, YH, and FW drafted the manuscript. All the authors took critical revisions of the manuscript and approved the final version to be published.

Data availability

The data used in this analysis were obtained from the Connecticut Department of Public Health (DPH) and the Connecticut Office of Health Strategy (OHS). Neither agency endorses nor assumes any responsibility for any analyses, interpretations, or conclusions based on the data. The authors assume full responsibility for all such analyses, interpretations, and conclusions. The data use agreements governing access to and use of these datasets do not permit the authors to re-release the datasets or make the data publicly available. However, these data can be obtained by other researchers using the same application processes used by the authors.

Code availability

For reproducibility, our codes are available at https://github.com/houyurain/2022_Suicide. We implemented the regularized logistic regression (LR) implemented by the Scikit-learn package 1.0.2 [37], the GBM by the Lightgbm package 3.3.2. [38], and the LSTM by PyTorch 1.12 with GPU acceleration.

Competing interests

The authors declare no competing interests.

Ethics declarations

Our study was approved by the University of Connecticut Health Center Institutional Review Board and Weill Cornell Medical College Institutional Review Board, the CT Department of Public Health Human Investigations Committee, and the CT APCD Data Release Committee. All EHRs and claims used in this study were appropriately deidentified and thus no informed consent from patients was obtained.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: Chengxi Zang, Yu Hou.

Contributor Information

Kun Chen, Email: kun.chen@uconn.edu.

Robert Aseltine, Email: aseltine@uchc.edu.

Fei Wang, Email: few2001@med.cornell.edu.

Supplementary information

The online version contains supplementary material available at 10.1038/s41398-024-03034-3.

References

  • 1.Curtin SC. State suicide rates among adolescents and young adults aged 10–24: United States, 2000–2018. Natl Vital Stat Rep. 2020;69:1–10. [PubMed] [Google Scholar]
  • 2.Leading Causes of Death and Injury - PDFs|Injury Center|CDC. https://www.cdc.gov/injury/wisqars/LeadingCauses.html (2022).
  • 3.Luoma JB, Martin CE, Pearson JL. Contact with mental health and primary care providers before suicide: a review of the evidence. Am J Psychiatry. 2002;159:909–16. 10.1176/appi.ajp.159.6.909 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Ahmedani BK, Simon GE, Stewart C, Beck A, Waitzfelder BE, Rossom R, et al. Health care contacts in the year before suicide death. J Gen Intern Med. 2014;29:870–7. 10.1007/s11606-014-2767-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Su C, Aseltine R, Doshi R, Chen K, Rogers SC, Wang F. Machine learning for suicide risk prediction in children and adolescents with electronic health records. Transl Psychiatry. 2020;10:1–10. 10.1038/s41398-020-01100-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Kessler RC, Warner CH, Ivany C, Petukhova MV, Rose S, Bromet EJ, et al. Predicting suicides after psychiatric hospitalization in US army soldiers: the army study to assess risk and resilience in servicemembers (Army STARRS). JAMA Psychiatry. 2015;72:49–57. 10.1001/jamapsychiatry.2014.1754 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Barak-Corren Y, Castro VM, Javitt S, Hoffnagle AG, Dai Y, Perlis RH, et al. Predicting suicidal behavior from longitudinal electronic health records. AJP. 2017;174:154–62. 10.1176/appi.ajp.2016.16010077 [DOI] [PubMed] [Google Scholar]
  • 8.Simon GE, Johnson E, Lawrence JM, Rossom RC, Ahmedani B, Lynch FL, et al. Predicting suicide attempts and suicide deaths following outpatient visits using electronic health records. AJP. 2018;175:951–60. 10.1176/appi.ajp.2018.17101167 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Poulin C, Shiner B, Thompson P, Vepstas L, Young-Xu Y, Goertzel B, et al. Predicting the risk of suicide by analyzing the text of clinical notes. PLoS ONE. 2014;9:e85733. 10.1371/journal.pone.0085733 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.McCarthy JF, Bossarte RM, Katz IR, Thompson C, Kemp J, Hannemann CM, et al. Predictive modeling and concentration of the risk of suicide: implications for preventive interventions in the US department of veterans affairs. Am J Public Health. 2015;105:1935–42. 10.2105/AJPH.2015.302737 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Sanderson M, Bulloch AG, Wang J, Williamson T, Patten SB. Predicting death by suicide using administrative health care system data: Can recurrent neural network, one-dimensional convolutional neural network, and gradient boosted trees models improve prediction performance? J Affect Disord. 2020;264:107–14. 10.1016/j.jad.2019.12.024 [DOI] [PubMed] [Google Scholar]
  • 12.Doshi RP, Chen K, Wang F, Schwartz H, Herzog A, Aseltine RH. Identifying risk factors for mortality among patients previously hospitalized for a suicide attempt. Sci Rep. 2020;10:15223. 10.1038/s41598-020-71320-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Xu W, Su C, Li Y, Rogers S, Wang F, Chen K, et al. Improving suicide risk prediction via targeted data fusion: proof of concept using medical claims data. J Am Med Inf Assoc. 2022;29:500–11. 10.1093/jamia/ocab209 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Tran T, Luo W, Phung D, Harvey R, Berk M, Kennedy RL, et al. Risk stratification using data from electronic medical records better predicts suicide risks than clinician assessments. BMC Psychiatry. 2014;14:76. 10.1186/1471-244X-14-76 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Walkup JT, Townsend L, Crystal S, Olfson M. A systematic review of validated methods for identifying suicide or suicidal ideation using administrative or claims data. Pharmacoepidemiol Drug Saf. 2012;21:174–82. 10.1002/pds.2335 [DOI] [PubMed] [Google Scholar]
  • 16.Platt R, Carnahan RM, Brown JS, Chrischilles E, Curtis LH, Hennessy S, et al. The U.S. Food and Drug Administration’s Mini-Sentinel program: status and direction. Pharmacoepidemiol Drug Safety. 2012;21:1–8. [DOI] [PubMed] [Google Scholar]
  • 17.Barak-Corren Y, Castro VM, Nock MK, Mandl KD, Madsen EM, Seiger A, et al. Validation of an electronic health record–based suicide risk prediction modeling approach across multiple health care systems. JAMA Network Open. 2020;3:e201262. 10.1001/jamanetworkopen.2020.1262 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Walker RL, Shortreed SM, Ziebell RA, Johnson E, Boggs JM, Lynch FL, et al. Evaluation of electronic health record-based suicide risk prediction models on contemporary data. Appl Clin Inform. 2021;12:778–87. 10.1055/s-0041-1733908 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Chekroud AM, Hawrilenko M, Loho H, Bondar J, Gueorguieva R, Hasan A, et al. Illusory generalizability of clinical prediction models. Science. 2024;383:164–7. 10.1126/science.adg8538 [DOI] [PubMed] [Google Scholar]
  • 20.Subbaswamy A, Saria S. From development to deployment: dataset shift, causality, and shift-stable models in health AI. Biostatistics. 2020;21:345–52. [DOI] [PubMed] [Google Scholar]
  • 21.Shilo S, Rossman H, Segal E. Axes of a revolution: challenges and promises of big data in healthcare. Nat Med. 2020;26:29–38. 10.1038/s41591-019-0727-5 [DOI] [PubMed] [Google Scholar]
  • 22.Song X, Yu ASL, Kellum JA, Waitman LR, Matheny ME, Simpson SQ, et al. Cross-site transportability of an explainable artificial intelligence model for acute kidney injury prediction. Nat Commun. 2020;11:5668. 10.1038/s41467-020-19551-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Finlayson SG, Subbaswamy A, Singh K, Bowers J, Kupke A, Zittrain J, et al. The Clinician and Dataset Shift in Artificial Intelligence. n engl j med 4. 2021. [DOI] [PMC free article] [PubMed]
  • 24.All-Payer Claims Database. CT.gov - Connecticut’s Official State Websitehttps://portal.ct.gov/OHS/Services/Data-and-Reports/To-Access-Data/All-Payer-Claims-Database.
  • 25.Hospital Patient Data. CT.gov - Connecticut’s Official State Websitehttps://portal.ct.gov/OHS/Services/Data-and-Reports/To-File-Data/Patient-Data.
  • 26.KHIN - Health Information Network. https://www.khinonline.org/Product-Sevices/HEALTH-INFORMATION-NETWORK.aspx.
  • 27.World Medical Association. World medical Association declaration of Helsinki: ethical principles for medical research involving human subjects. JAMA. 2013;310:2191–4. 10.1001/jama.2013.281053 [DOI] [PubMed] [Google Scholar]
  • 28.Wang, W, Li, Y & Yan, J touch: Tools of Utilization and Cost in Healthcare. (2022).
  • 29.Choi, E, Bahadori, MT, Kulas, JA, Schuetz, A, Stewart, WF & Sun, J. RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism. arXiv:1608.05745 [cs] (2017).
  • 30.Liu R, Wei L, Zhang P. A deep learning framework for drug repurposing via emulating clinical trials on real-world patient data. Nature Machine Intelligence. 2021;3:68–75. 10.1038/s42256-020-00276-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9:1735–80. 10.1162/neco.1997.9.8.1735 [DOI] [PubMed] [Google Scholar]
  • 32.Zang C, Zhang H, Xu J, Zhang H, Fouladvand S, Havaldar S, et al. High-throughput target trial emulation for Alzheimer’s disease drug repurposing with real-world data. Nat Commun. 2023;14:1–16. 10.1038/s41467-023-43929-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Austin PC. Using the standardized difference to compare the prevalence of a binary variable between two groups in observational research. Communications in Statistics - Simulation and Computation. 2009;38:1228–34. 10.1080/03610910902859574 [DOI] [Google Scholar]
  • 34.Zhang Z, Kim HJ, Lonjon G, Zhu Y. Balance diagnostics after propensity score matching. Ann Transl Med. 2019;7:16. 10.21037/atm.2018.12.10 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Braga AANM, Veiga MLT, Ferreira MGCDS, Santana HM, Barroso U. Association between stress and lower urinary tract symptoms in children and adolescents. Int Braz J Urol. 2019;45:1167–79. 10.1590/s1677-5538.ibju.2019.0128 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Carson CM, Phillip N, Miller BJ. Urinary tract infections in children and adolescents with acute psychosis. Schizophr Res. 2017;183:36–40. 10.1016/j.schres.2016.11.004 [DOI] [PubMed] [Google Scholar]
  • 37.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning. in Python. J. Mach. Learn. Res. 2011;12:2825–30. [Google Scholar]
  • 38.Ke, G, Meng, Q, Finley, T, Wang, T, Chen, W, Ma, W, et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. in Advances in Neural Information Processing Systems vol. 30 (Curran Associates, Inc., 2017).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

The data used in this analysis were obtained from the Connecticut Department of Public Health (DPH) and the Connecticut Office of Health Strategy (OHS). Neither agency endorses nor assumes any responsibility for any analyses, interpretations, or conclusions based on the data. The authors assume full responsibility for all such analyses, interpretations, and conclusions. The data use agreements governing access to and use of these datasets do not permit the authors to re-release the datasets or make the data publicly available. However, these data can be obtained by other researchers using the same application processes used by the authors.

For reproducibility, our codes are available at https://github.com/houyurain/2022_Suicide. We implemented the regularized logistic regression (LR) implemented by the Scikit-learn package 1.0.2 [37], the GBM by the Lightgbm package 3.3.2. [38], and the LSTM by PyTorch 1.12 with GPU acceleration.


Articles from Translational Psychiatry are provided here courtesy of Nature Publishing Group

RESOURCES