Abstract
Background:
Strategies to detect the presence of suicidal ideation (SI) or characteristics of ideation that indicate marked suicide risk are critically needed to guide interventions and improve care during care transition periods. Some studies indicate that machine learning can be applied to momentary data to improve classification of SI. This study examined whether the classification accuracy of these models varies as a function of type of training data or characteristics of ideation.
Methods:
A total of 257 psychiatric inpatients completed a 3-week battery of ecological momentary assessment and measures of suicide risk factors. The accuracy of machine learning models in classifying the presence, duration, or intensity of ideation was compared across models trained on baseline and/or momentary suicide risk data. Relative feature importance metrics were examined to identify the risk factors that were most important for outcome classification.
Results:
Models including both baseline and momentary features outperformed models with only one feature type, providing important information in both correctly classifying and differentiating individual characteristics of SI. Models classifying SI presence, duration, and intensity performed similarly.
Limitations:
Results of this study may not generalize beyond a high-risk, psychiatric inpatient sample, and additional work is needed to examine temporal ordering of the relationships identified.
Conclusions:
Our results support using machine learning approaches for accurate identification of SI characteristics and underscore the importance of understanding the factors that differentiate and drive different characteristics of SI. Expansion of this work can support use of these models to guide intervention strategies.
Keywords: ecological momentary assessment, proximal risk
Introduction
Suicide results in 800,000 deaths annually (National Action Alliance for Suicide Prevention, 2014). Patients are at substantially elevated risk of dying by suicide during critical care transitions (Haglund et al., 2019). Intervening early during periods of increases in suicidal ideation (SI) can prevent a cascade to suicidal behavior (SB). However, the onset of SI can occur relatively quickly (Bryan & Rudd, 2016; Kleiman et al., 2017), which makes delivering interventions in a timely manner challenging. Strategies to detect the presence of SI or characteristics SI that indicate marked suicide risk are thus critically needed to guide timely, targeted interventions and improve care during this important transitional period.
Advances in intensive longitudinal sampling, such as Ecological Momentary Assessment (EMA), can facilitate characterization and detection of SI. EMA involves sending brief questionnaires to individuals’ mobile phones for completion at different times throughout the day. This method allows for repeated and frequent assessment of experiences as they occur in real-world settings, enabling investigation of short-term changes in both suicide risk processes and in characteristics of SI that can fluctuate over periods of minutes to hours (Kleiman et al., 2017). This approach offers advantages over retrospective self-report approaches, such as providing data with greater ecological validity, and minimizing recall bias (Kendall et al., 1999). Momentary assessments have minimal social desirability and self-monitoring effects (Hufford et al., 2002), even when assessing suicidality (Coppersmith, Fortgang, et al., 2022). EMA has been increasingly used in the suicide field to describe the onset and trajectory of SI (Kleiman et al., 2017), and to characterize the phenomenological contexts surrounding SI in real-world settings (Armey et al., 2018).
EMA research has provided several key insights about the experience and characteristics of SI that should inform detection strategies. It shows that the phenomenology of SI is highly dynamic, with the presence and intensity of SI often fluctuating over periods of hours to days (Bryan & Rudd, 2016; Kleiman et al., 2017). While SI has usually been studied as a homogenous construct, understanding the characteristics of SI may help to distinguish which patients with SI are most likely to make a future suicide attempt (Bryan et al., 2019). Although work in this area is in its early stages, suicide risk profiles are distinguishable by characteristics of SI, such as how intensely and frequently SI occurs (Bryan et al., 2019). These findings underscore the clinical relevance of characterizing the phenomenology surrounding not only the presence of SI, but also characteristics of SI such as its intensity and duration.
Several factors have been linked with risk of SI in general. Characteristics typically measured during initial, or baseline, patient assessments such as psychiatric diagnoses, suicide risk history, and facets of emotional (i.e., hopelessness), cognitive (i.e., rumination), and behavioral (e.g., impulsivity) functioning have been linked with current and subsequent SI (Allen et al., 2019; Franklin et al., 2017). However, associations between these “baseline” risk factors and SI tend to be small (Franklin et al., 2017), reflecting the challenge of using historical factors captured at a single point in time to predict an outcome that may be highly variable and influenced by ecological or contextual factors. Recently, there has been an emphasis on identifying momentary factors occurring close in time to SI that may signify the presence of suicide risk states (Galynker et al., 2017). Studies show that momentary risk factors measured via EMA at the state level are short-term correlates of the onset of momentary ideation (Armey et al., 2018; Kleiman et al., 2017). However, whether these factors have similar or differential utility in accurately classifying characteristics of SI (i.e., intensity, duration) in vivo is unclear.
Nevertheless, the complex nature of suicide risk has historically made the development of models to accurately classify characteristics of suicide risk challenging. Suicide theory suggests that relationships between risk factors and SI are nonlinear and dynamic (Bryan & Rudd, 2016), and there are likely complex interrelationships between risk factors that are important for more accurately identifying suicide risk. These complexities are challenging to model via traditional regression-based approaches which require pre-specification of these relationships and parameters. Advanced modeling approaches such as machine learning are data-driven and uniquely designed to maximize model accuracy by handling large volumes of variables and intensively modeling complex associations between classifiers and outcome(s). Some meta-analyses show that these data-driven ML approaches are highly promising and can predict SI and SB with up to an 18-fold higher odds ratio than the theoretical, regression-based models that historically have been used in the field (Schafer et al., 2021). Importantly, however, several studies cited in these meta-analyses have since been criticized, or even retracted due to failing to properly validate their ML models, resulting in overfitting and potentially overemphasizing the benefits of ML in suicide risk classification efforts (Jacobucci et al., 2021; Just et al., 2023). As a result, existing research elucidating the potential relative benefits of ML-based approaches over other simpler approaches (i.e., generalized linear models and other forms of linear models) is limited.
Moreover, most studies using ML have used information about baseline patient characteristics (e.g., psychiatric history) to identify which patients are at increased risk of SI months to years later (Schafer et al., 2021). Only a few studies have successfully applied ML to classify suicide risk from momentary data (i.e. risk factors, features of ideation), but those that did showed that next-day SI in youth (Czyz, Koo, et al., 2021) and near-term SB in adults (Wang et al., 2021) can be predicted from these data. While these studies suggest that ML approaches using momentary data can improve prediction or classification of SI/SB, they have important limitations. First, despite research indicating that there are likely both longstanding/stable and time-varying/dynamic aspects of suicide risk (Bryan & Rudd, 2016), it is unknown if and how the classification accuracy of short-term SI risk varies across models trained on different types of data (i.e., baseline, momentary, or both). Second, which risk factors are most important in accurately classifying patients at increased short-term risk of SI is unclear. Third, whether model accuracy and/or variable importance differ when classifying different characteristics of SI (i.e., presence, duration, intensity) is unknown. Fourth, whether the use of ML models to accomplish these aims is superior to simpler models is unknown. This study will examine these important research questions to derive information necessary to develop more accurate risk classification models and targeted and timely intervention strategies.
Methods
Participants
Participants were 460 patients at an inpatient psychiatric hospital in the northeastern United States. We recruited inpatients hospitalized for SI (72%), SA (15%), and no history of SA and no SI in the month prior to hospitalization (13%). Inclusion criteria were aged 18-70, English fluency, and comfort with smartphones. Current psychotic/manic symptoms severe enough to interfere with participation were exclusionary. Analyses included only participants who completed EMA (n = 257). Participants varied in age (M = 40.53, SD = 13.33), with 54% women. Approximately 88% of participants were white, 6% Black/African American, 2% Asian, 1% as American Indian/Native American. Most were non-Hispanic (91%). Most were single/never married (45%) or divorced/separated (25%).
Procedures
Staff screened patient charts for eligibility. Patients provided informed consent and completed an assessment battery to ascertain eligibility and measure SI risk factors. Interviews were administered by bachelor’s level staff supervised by a licensed clinical psychologist. Following discharge, participants received EMA prompts to complete brief (<5 minute) assessments scheduled four times a day, at least one hour apart, at random intervals over three-weeks. Participants also completed identical, self-initiated assessments during times when they engaged in suicidal or non-suicidal self-injurious behavior or experienced an exacerbation in SI. Study procedures were approved by the Butler Hospital IRB.
Measures
Baseline Risk Factors
Depressive Symptoms.
The Quick Inventory of Depressive Symptomatology assesses the severity of depressive symptoms over the past week (Rush et al., 2003).
Borderline Personality Disorder Symptoms.
The McLean Screening Instrument for Borderline Personality Disorder (Zanarini et al., 2003) screens for symptoms of borderline personality disorder (Cronbach’s alpha = .76).
Negative Attitudes.
The Dysfunctional Attitudes Scale (DAS) (Weissman & Beck, 1978) measures pervasive negative attitudes towards the self, the world, and the future (Cronbach’s alpha = .94).
Childhood Trauma.
The Childhood Trauma Questionnaire (Bernstein et al., 1994) assesses the severity of different types of childhood trauma (Cronbach’s alpha ranged from .75 to .95 across subscales).
Impulsivity.
The Barrett Impulsiveness Scale (Patton et al., 1995) assesses different facets of impulsive tendencies (Cronbach’s alpha = .65-,73).
Emotional Dysregulation.
Trait-level perceived ability to regulate emotions was assessed using the 36-item Difficulties in Emotion Regulation Scale (DERS) (Gratz & Roemer, 2004) (Cronbach’s alpha ranged from .77 to .90 across subscales).
Acquired Capability.
The Acquired Capability for Suicide Scale (Van Orden et al., 2008) (ACSS) assesses fearlessness of death and perceived tolerance for physical pain (Cronbach’s alpha = .33).
Perceived Burdensomeness and Thwarted Belongingness.
The Interpersonal Needs Questionnaire (Van Orden et al., 2012) measures perceptions of burdensomeness and low belongingness (Cronbach’s alphas = .87-.91).
Depressive Rumination.
Tendencies towards brooding and pondering depressive rumination were assessed using subscales from the Response Styles Questionnaire (RSQ) (Nolen-Hoeksema & Morrow, 1991) (Cronbach’s alphas = .65-.80).
Hopelessness.
We used the Beck Hopelessness Scale (Beck, 1988) to assess negative expectations for the future (Cronbach’s alpha = 0.91).
Suicide Attempt History.
We assessed lifetime frequency of suicide attempts using the Columbia Suicide Severity Rating Scale (C-SSRS) interview (Posner et al., 2008).
Momentary Risk Factors
Momentary risk factors were assessed via EMA prompts delivered via Ilumivu’s HIPAA certified mEMA phone application, which provides a cross-platform (iOS and Android) application for delivery of multiple simultaneous scheduled EMA protocols. Participants completed an average of 33 (SD=31.18) EMA surveys, resulting in 8,412 completed surveys. SI was endorsed in 1,043 (13.10%) EMAs.
Response Context.
Participants reported their location, whether they were alone, and whether they had used substances since the last assessment.
Negative Life Events.
Participants reported whether they had experienced a negative event since the last assessment.
Positive and Negative Affect.
Items measuring positive (e.g., “happy”) and negative affect (e.g., “sad”) were derived from the PANAS-X (Watson & Clark, 1994).
Ruminative thinking and emotional reactivity.
Items assessed current difficulties in emotional regulation from the DERS (Gratz & Roemer, 2004), and ruminative tendencies from the RSQ (Nolen-Hoeksema & Morrow, 1991).
Distress Tolerance.
Participants answered items pertaining to their ability to manage distress from the Distress Tolerance Scale (Simons & Gaher, 2005).
Non-Suicidal Self-Injury.
Participants reported non-suicidal self-injury since the last assessment.
Momentary Outcomes
Suicidal Ideation Characteristics.
Items based on the Modified Scale for Suicide Ideation (Miller et al., 1986) assessed the presence, duration [Shorter: SI denied - several minutes vs. Longer: an hour or more –continuously], or intensity [Lower: SI denied-weak vs. Higher: strong-very strong] of SI since last assessment.
Data Analytic Strategy
Random Forest Algorithm.
We used random forest (RF) classifiers (Breiman, 2001) to model the data.1 RF models are classification algorithms made up of ensembles of decision trees. Decision trees model the relationships among predictor variables and outcomes as a series of nodes and splits/branches, where each node uses one variable to make a separating decision, or split the data to optimally partition classes, which when compounded over several nodes/branches provides probabilities for each classification of new data. In RFs, each tree is trained using a different bootstrap sample of the training data (i.e., unique datasets generated by randomly resampling the training dataset) containing a randomly selected subset of all available predictor variables. To predict the classification of new data, each tree ‘votes’ for one class and the RF selects the class with a majority of votes. Specific rules for tree growing, tree combination, and self-testing make RF models robust to overfitting, outliers, and noisy data, and well-suited to non-linear relationships and high-dimensional data (Caruana & Niculescu-Mizil, 2006; Menze et al., 2009). RFs are built, in part, by evaluating the importance of variables based on their Gini importance, an importance score for each predictor variable based on the frequency the variable was used to make a decision, weighted by the number of samples it classifies, and averaged across all ensemble trees, which can also be used to rank the importance of each predictor variable (Colic et al., 2022).
Missing Data.
A special reserved value of −1 was assigned to missing observations. As there were no organically observed negative values, this negative value serves as a special flag that allows the model to explicitly discern missing data and reason over non-random patterns of missingness. In this particular application (as opposed to say vital sign features in the electronic health record), we do not expect data to be missing at random, but rather following an underlying trend. That is, the fact that an EMA or other item was not answered might hold information about the participant’s situational mental state. It was for this reason that we introduced a separate reserved feature value to indicate missingness rather than hiding it from the ML algorithm via traditional (e.g., mean) imputation.
Description of Model Training Conditions.
We trained and compared a separate model for each of three different momentary SI characteristics outcomes (SI presence, intensity, and duration rated during EMAs) to assess and compare model performance when different types of data were included in the models. Specifically, we trained three models using the demographic (i.e., age, sex, and number of lifetime SAs) with 1) baseline risk factors (e.g., hopelessness, emotional dysregulation, and suicide attempt history), 2) momentary risk factors (e.g., positive and negative affect, and rumination), or 3) baseline risk factors, and momentary variables. These three models were trained to classify each of our three SI characteristic outcome variables: SI presence/absence (Models A1, A2, and A3), SI duration (Models B1-B3), and SI intensity (Models C1-C3), resulting in nine RF models.
Model Training.
We used a stratified 80/20 split for training/test data, where 80% of the data was available to the model for training, and 20% remained unknown to the model and was used to evaluate model performance and generalization. A 10-fold cross-validation (no repeats, stratified folds) of the training set was used for hyperparameter tuning. A grid search method was used to tune each of the hyperparameters for the RF models: (1) number of trees in the forest; (2) number of splits within each tree; and (3) minimum number of data points per split.
Class Imbalance.
Like many clinical problems, our data shows a stark class imbalance skewed towards the negative class that can sometimes impede the success of machine learning models. To counter these detrimental effects, we applied random oversampling via SMOTE (Chawla et al., 2002) to obtain a 1:1 class balance in the training data. The class distribution of the test set was not changed and remains at the original clinical incidence rate.
Model Evaluation.
After model training, models were asked to classify new observations (the previously unseen 20% of the data) and the model performance was determined by calculating the accuracy, precision, recall/sensitivity, specificity, negative predictive value (NPV) and the Area Under the Curve (AUC) of the Receiver Operator Coefficient (ROC). Results are also visually depicted via confusion matrices (Figure 2).
Figure 2.
Confusion matrices for models classifying A) SI Presence (3A), B) SI Duration (3B), and C) SI Intensity (3C).
Variable Importance.
The Gini importance was used to rank order the importance of variables/factors in each model and compare them across models.
Additional comparisons.
We also conducted additional analyses in which we supplemented the performance part of the train set to examine the acceptability of the degree of overfitting or underfitting of the model. We also ran additional models (i.e., GLM, LDA, and ElasticNet) to compare these random forest models to that of simpler, linear ML models. Details of these experiments are provided in the Supplement, and the performance of these different models is presented in Table 1. All experiments were conducted in Python, using the numpy (linear algebra), scikit-learn (machine learning), imblearn (resampling), and matplotlib (plotting) libraries. The tuning parameter free GLM and LDA were added. For the ElasticNet, we cross-validated alpha (overall regularization strength, 0.0-5, increments of 0.1) and the L1 ratio (mixing parameter controlling the relative importance of L1 vs. L2 priors, 0.0-1.0, increments of 0.1), the number of iterations (100-1000, increments of 100), and whether or not to estimate the intercept (yes/no). The optimal model whose performance is reported here ended up using alpha=1.3, L1-ratio=1.0, iterations=800, estimate intercept = True.
Table 1.
Models' performance by classification target
| a. Presence | b. Duration | c. Intensity | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AUC | Rec | Spec | Prec | NPV | Acc | AUC | Rec | Spec | Prec | NPV | Acc | AUC | Rec | Spec | Prec | NPV | Acc | ||
| 1) BL | GLM | .68 | .73 | .63 | .22 | .93 | .64 | .70 | .71 | .70 | .11 | .98 | .70 | .72 | .76 | .69 | .19 | .96 | .69 |
| LDA | .68 | .71 | .65 | .22 | .93 | .66 | .72 | .71 | .72 | .12 | .97 | .72 | .74 | .77 | .70 | .20 | .96 | .71 | |
| EN | .68 | .71 | .64 | .22 | .93 | .65 | .71 | .71 | .71 | .12 | .98 | .71 | .73 | .75 | .70 | .20 | .96 | .70 | |
| RF | .80 | .79 | .80 | .36 | .95 | .80 | .84 | .80 | .88 | .25 | .99 | .87 | .82 | .80 | .83 | .32 | .97 | .83 | |
| 2) EMA | GLM | .75 | .71 | .80 | .33 | .95 | .78 | .79 | .74 | .84 | .19 | .98 | .83 | .77 | .74 | .81 | .28 | .97 | .80 |
| LDA | .75 | .73 | .77 | .31 | .95 | .77 | .79 | .78 | .81 | .18 | .99 | .81 | .78 | .76 | .80 | .27 | .97 | .80 | |
| EN | .76 | .73 | .79 | .32 | .95 | .78 | .79 | .76 | .82 | .18 | .98 | .82 | .77 | .73 | .81 | .28 | .96 | .80 | |
| RF | .80 | .68 | .92 | .55 | .95 | .89 | .81 | .66 | .96 | .49 | .98 | .95 | .80 | .65 | .94 | .52 | .96 | .91 | |
| 3) BL + EMA | GLM | .76 | .72 | .81 | .35 | .95 | .80 | .82 | .79 | .85 | .21 | .99 | .84 | .80 | .77 | .83 | .31 | .97 | .82 |
| LDA | .77 | .74 | .81 | .35 | .95 | .80 | .82 | .80 | .84 | .21 | .99 | .84 | .80 | .79 | .82 | .30 | .97 | .82 | |
| EN | .77 | .74 | .80 | .34 | .95 | .80 | .81 | .78 | .84 | .21 | .99 | .84 | .80 | .77 | .83 | .32 | .97 | .83 | |
| RF | .83 | .74 | .92 | .57 | .96 | .90 | .86 | .75 | .97 | .55 | .99 | .96 | .84 | .74 | .94 | .55 | .97 | .92 | |
EMA = Ecological Momentary Assessment; BL = Baseline; GLM = Generalized Linear Model; LDA = Linear Discriminant Analysis; EN = Elastic Net; RF = Random Forest; Acc = Accuracy; AUC = Area Under the Curve; Prec = Precision; Rec = Recall/sensitivity; Spec = Specificity; NPV = Negative Predictive Value
Results
Model Performance.
The accuracy, precision, recall/sensitivity, specificity, NPV, and ROC-AUC for models are in Table 1. Overall, model performance improved as more data/variables were made available for training. Models trained with baseline and EMA variables generally outperformed models trained using either baseline or EMA variables, with some nuance. For all three SI characteristics, the recall/sensitivity improved for models trained on both EMA and baseline data, however, models trained on baseline data alone produced higher recall/sensitivity than that of models trained on both baseline and EMA data, indicating limited improvement in false negative rates. Models 3A-C (Table 1) were the best performing across outcomes (while recall/sensitivity was not optimal in best performing models, all other metrics were) with comparable performance to one another on all metrics (accuracy, precision, recall/sensitivity, specificity, NPV, and ROC-AUC). The best tuning parameter values were: (1) number of trees in the forest = 100; (2) maximum number of splits per tree = 8; and (3) minimum number of data points per split = 2. These hyperparameters were used consistently across all predictive endpoints and datasets.
Important Features.
The 20 most important features (by Gini importance) for Models 3A-C are presented in Figure 1 and Table 2, and discussed below. All models shared the same top five important features (momentary hopelessness, sadness, experiencing emotions as overwhelming, having difficulty making sense of feelings, and thinking “why do I always react this way?”). Twelve features were within the top 20 of all three models, 11 of which were EMA-measured , and one was baseline-assessed (depression severity ).
Figure 1.
Random forest feature importance for models classifying A) SI Presence (3A), B) SI Duration (3B), and C) SI Intensity (3C).
Table 2.
Importance of features included in models
| Feature | Presence | Duration | Intensity |
|---|---|---|---|
| EMA-rated Hopelessness | 1 | 1 | 1 |
| EMA-rated Sadness | 2 | 2 | 3 |
| Having Difficulties Making Sense of Feelings | 3 | 4 | 4 |
| Experiencing emotions as overwhelming | 4 | 3 | 2 |
| Thinking “Why do I always react this way?” | 5 | 4 | 5 |
| EMA-rated Happiness | 6 | 5 | 7 |
| EMA-rated Anger at Self | 7 | 8 | 9 |
| Thinking “Why do I have problems others don’t?” | 8 | 10 | 8 |
| EMA-rated Shame | 9 | 9 | 11 |
| Lifetime SA Count | 10 | 15 | >20 |
| EMA-rated Irritability | 11 | >20 | 6 |
| Beck-rated Hopelessness | 12 | >20 | >20 |
| Endorsement of NSSI (yes/no) | 13 | 16 | 15 |
| DERS Non-acceptance of emotions | 14 | >20 | 17 |
| Experiencing Distress as Unacceptable | 15 | 14 | 12 |
| EMA-rated Guilt | 16 | >20 | 18 |
| Negative Event (yes/no) | 17 | >20 | 14 |
| DERS Limited Access to Emotion Regulation Strategies | 18 | >20 | >20 |
| QIDS-rated Depression Level | 19 | 12 | 10 |
| EMA-rated Worry | 20 | 6 | >20 |
| Experiencing Shame Regarding Own Distress | >20 | 13 | 20 |
| Childhood Emotional Abuse | >20 | 17 | 13 |
| Childhood Sexual Abuse | >20 | 18 | 19 |
| EMA-rated Confidence | >20 | 11 | >20 |
| DERS Lack of Emotional Clarity | >20 | 19 | >20 |
| BIS Attentional Impulsivity | >20 | 20 | >20 |
| EMA-rated Excitement | >20 | >20 | 10 |
| Age | >20 | >20 | >20 |
| Sex/Gender | >20 | >20 | >20 |
| Time Since Discharge | >20 | >20 | >20 |
| RSQ Brooding Subscale | >20 | >20 | >20 |
| RSQ Pondering Subscale | >20 | >20 | >20 |
| DERS Lack of Emotional Awareness | >20 | >20 | >20 |
| DERS Difficulties with Goal-Directed Behavior | >20 | >20 | >20 |
| DERS Impulse Control Difficulties | >20 | >20 | >20 |
| BIS Impulsive non-planning | >20 | >20 | >20 |
| BIS Motor Impulsivity | >20 | >20 | >20 |
| Childhood Physical Abuse | >20 | >20 | >20 |
| Childhood Physical Neglect | >20 | >20 | >20 |
| Childhood Emotional Neglect | >20 | >20 | >20 |
| Childhood Positive Family Score | >20 | >20 | >20 |
| BPD Features | >20 | >20 | >20 |
| Dysfunctional Attitudes | >20 | >20 | >20 |
| Acquired Capability for Suicide | >20 | >20 | >20 |
| Thwarted Belonginess | >20 | >20 | >20 |
| Perceived Burdensomeness | >20 | >20 | >20 |
| EMA response type (random vs user initiated) | >20 | >20 | >20 |
| Current Location | >20 | >20 | >20 |
| Isolation (alone vs with others) | >20 | >20 | >20 |
| Substance Use (yes/no) | >20 | >20 | >20 |
| Thinking about recent situation and wishing it had gone better | >20 | >20 | >20 |
| Desire to Avoid Feeling Distressed | >20 | >20 | >20 |
| Experiencing Distress as Unacceptable | >20 | >20 | >20 |
Note. EMA= Ecological Momentary Assessment; SA= Suicide Attempt; NSSI= Non-Suicidal Self-injury; BPD=Borderline Personality Disorder; DERS=Difficulties with Emotion Regulation Scale; QIDS=Quick Inventory of Depressive Symptoms; RSQ=Response Style Questionnaire; BIS=Barrett Impulsiveness Scale
Additional Model Comparisons.
When compared with RF models, other modeling approaches performed comparably or worse on all metrics except recall/sensitivity (see Table 1). The superior recall/sensitivity of LDA models was offset by their worse performance on all other metrics, with the relatively poor precision particularly undercutting LDA utility due to elevated rates of false positives. Moreover, the alternative modeling approaches slightly outperformed RF models in recall/sensitivity, indicating slightly lower false negative rates.
Discussion
In this study, we applied machine learning methods to baseline and momentary risk factor data to classify the presence, duration, and intensity of momentary SI. To our knowledge, this is the first study to evaluate the utility of both baseline and momentary risk factors in machine learning models classifying different characteristics of momentary SI.
Regardless of SI outcome, models trained with only baseline or only momentary data were outperformed by models trained using both sources. This is consistent with prior EMA research which better predicted SI compared to models trained with fewer variables (Czyz, Koo, et al., 2021; Czyz, Yap, et al., 2021; Wang et al., 2021). Interestingly, we found that models trained with only momentary data tended to do better than models trained with only baseline data, which makes sense given that our outcomes were momentary in nature.
Assessing and comparing the relative importance of variables/factors across SI outcomes (Models 3A-C) yielded a nuanced pattern of results. For all models, momentary hopelessness was the most important classifier of SI. All models shared their top five variables—albeit in different orders— which were constructs reflecting momentary negative affect (i.e., hopelessness and sadness ) and momentary emotion reactivity (i.e., experiencing emotions as overwhelming, having difficulty making sense of emotions, and thinking “why do I always react this way?”). This is in line with existing models of suicidality (Bryan & Rudd, 2016; Selby & Joiner, 2013) and research about the risk processes that indicate acute suicidal crises (Galynker et al., 2017). Surprisingly, constructs frequently studied in relation to suicide risk, including borderline personality disorder symptoms and interpersonal theory-related factors were not among the 20 most important factors in any of the models examined. These findings are consistent with research suggesting that these constructs are useful indicators of longstanding risk for suicide but are not markers of shifts into elevated suicidal risk states (Galynker et al., 2017). That most contextual variables (e.g., location) assessed via EMA were not among the top classification features suggests the internal context (i.e., affect and cognitions) is more relevant to identifying characteristics of SI.
Some differences emerged when top features of models were compared. Models all had the same top five classifying variables and shared eight of the top ten variables, which indexed affective, emotional dysregulation, and momentary ruminative thinking features previously implicated in momentary SI (Armey et al., 2018). In contrast, childhood maltreatment (emotional and sexual abuse) seemed relevant to SI intensity and duration but not presence. Additionally, variables pertaining to longstanding lack of emotional clarity, trait attentional impulsivity, and momentary/EMA-rated confidence differentiated the model indexing SI duration. These findings highlight the relevance of trait-like indicators, suggesting the intensity and duration of SI are more related to dispositional, memory, or self-evaluative factors compared to SI presence. These results are consistent with research underscoring variation in the patterns and factors associated with different characteristics of SI (Coppersmith, Ryan, et al., 2022).
This diverging pattern of results across models also underscores the importance of studying both baseline and momentary risk factors in relation to state-level manifestations of a range of characteristics of SI. Across all models, 4-6 of the 20 most important features were baseline-assessed constructs. However, the most important five features for each model were EMA constructs. Taken together with the superior performance from the combined models (3A-C), these findings seem to suggest that while momentary constructs are more robust classifiers of SI, baseline/stable constructs are still relevant especially for distinguishing different characteristics of SI. Therefore, effective characterization of aspects of SI as well as the best classification across all SI outcomes appears to result from their combined use. There were also important differences in performance of the various modeling approaches (i.e., RF models compared to other modeling approaches). In general, we found that RF models outperformed other modeling approaches on most metrics (accuracy, precision, specificity, NPV, & AUC, indicating better false-positive and true-positive rates. Notably, however, we also found that RF models’ recall/sensitivity was comparable or slightly worse than other modeling approaches, suggesting that our RF models were slightly more prone to misclassify individuals as not having suicidal ideation. Given the potential deleterious costs of misclassification for suicidal individuals—both false positive and false negatives—efforts to implement these kinds of models in clinical settings should carefully consider the relative cost/benefit of bias towards false-positive or false-negatives given the context of the application, and future research in this area should pay particular attention to model sensitivity and consider additional strategies to improve model performance in this area.
Limitations and future directions
Our findings have several limitations. First, we examined SI, not SB. Additional research identifying the presence of SB is therefore needed. Second, our sample was comprised of predominantly white, psychiatrically hospitalized patients, which may limit study generalizability. Third, replication of findings using a novel dataset is needed to provide more robust validation of generalizability of our results. Fourth, while there is evidence suggesting no iatrogenic effects of repeated assessment of SI (Coppersmith et al., 2022; Glenn et al., 2020; Law et al., 2015), it is possible that some reactivity to EMA items was experienced that could have impacted our findings. Fifth, research is needed to examine temporal ordering of effects. Sixth, we did not examine potential contributions of biological or neuroimaging factors in this study, and this is an important direction for future research. Seventh, determining precisely if or how RF models should be incorporated into clinical practice to aid in risk classification (including over simpler models) is outside the scope of the current study. However, while RF models may not have been interpretable at the local (i.e., single prediction) level in the past, current methods allow for both global (Gini importance) and local (why this classification was made for this specific instance) explanations to facilitate model interpretation (Hatwell et al., 2020). Due to the novelty of machine learning approaches, additional research would be needed to provide guidance for effectively implementing such models in clinical practice with low provider burden. Eighth, our RF models had slightly higher false negative rates, which could be costly in real world applications; subsequent work should identify and explore ways to reduce the false negative rate in classification models.
Nevertheless, this study represents an important and novel contribution to the literature. It is the first to examine unique contributions of baseline and momentary risk factors in the classification of multiple characteristics of momentary SI. We found that both baseline and momentary features provide important information in both correctly classifying and differentiating individual characteristics of SI. Our results escort the relevance of machine learning approaches for accurate identification of SI characteristics and underscore the importance of understanding the factors that differentiate and drive different characteristics of SI. Expansion of this work can support use of these models to guide intervention strategies.
Supplementary Material
Highlights:
Strategies to detect suicide ideation are needed for timely intervention.
Machine learning models are relevant for accurately identifying suicide ideation.
Different risk factors are useful for classifying ideation characteristics.
Both baseline and momentary risk factor training data are essential.
Acknowledgements
The authors would like to thank the research assistants in the CELL lab for their assistance in collecting the data involved in this manuscript.
Funding sources
This study was supported by grants from the National Institute of Mental Health (R01MH095786 and R01MH097741) to M. Armey.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Declarations of interest: None.
The unit of classification in this study (i.e., a training or test data point) were individual EMA surveys, of which participants completed an average of 33 (SD=31.18), providing 8,412 data points. These data points provide a sufficient sample size for present analyses.
References
- Allen KJ, Bozzay ML, & Edenbaum ER (2019). Neurocognition and suicide risk in adults. Current Behavioral Neuroscience Reports, 1–15.34485022 [Google Scholar]
- Armey MF, Brick L, Schatten HT, Nugent NR, & Miller IW (2018). Ecologically assessed affect and suicidal ideation following psychiatric inpatient hospitalization. General hospital psychiatry. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Beck AT (1988). Beck Hopelessness Scale. In: The Psychological Corporation. [Google Scholar]
- Bernstein DP, Fink L, Handelsman L, & Foote J (1994). Initial reliability and validity of a new retrospective measure of child abuse and neglect. The American Journal of Psychiatry, 151(8), 1132–1136. [DOI] [PubMed] [Google Scholar]
- Breiman L. (2001). Random forests. Machine learning, 45(1), 5–32. [Google Scholar]
- Bryan CJ, Rozek DC, Butner J, & Rudd MD (2019). Patterns of change in suicide ideation signal the recurrence of suicide attempts among high-risk psychiatric outpatients. Behaviour Research and Therapy, 120, 103392. https://doi.org/ 10.1016/j.brat.2019.04.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bryan CJ, & Rudd MD (2016). The importance of temporal dynamics in the transition from suicidal thought to behavior. Clinical Psychology: Science and Practice, 23(1), 21–25. [Google Scholar]
- Caruana R, & Niculescu-Mizil A (2006). An empirical comparison of supervised learning algorithms. Proceedings of the 23rd international conference on Machine learning, [Google Scholar]
- Colic S, He JC, Richardson JD, Cyr KS, Reilly JP, & Hasey GM (2022). A machine learning approach to identification of self-harm and suicidal ideation among military and police Veterans. Journal of Military, Veteran and Family Health, 8(1), 56–67. [Google Scholar]
- Coppersmith DD, Fortgang RG, Kleiman EM, Millner AJ, Yeager AL, Mair P, & Nock MK (2022). Effect of frequent assessment of suicidal thinking on its incidence and severity: high-resolution real-time monitoring study. The British Journal of Psychiatry, 220(1), 41–43. [DOI] [PubMed] [Google Scholar]
- Coppersmith DD, Ryan O, Fortgang R, Millner A, Kleiman E, & Nock M (2022). Mapping the Timescale of Suicidal Thinking. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Czyz E, Koo H, Al-Dajani N, King C, & Nahum-Shani I (2021). Predicting short-term suicidal thoughts in adolescents using machine learning: developing decision tools to identify daily level risk after hospitalization. Psychological Medicine, 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Czyz E, Yap J, King C, & Nahum-Shani I (2021). Using intensive longitudinal data to identify early predictors of suicide-related outcomes in high-risk adolescents: Practical and conceptual considerations. Assessment, 28(8), 1949–1959. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Franklin JC, Ribeiro JD, Fox KR, Bentley KH, Kleiman EM, Huang X, Musacchio KM, Jaroszewski AC, Chang BP, & Nock MK (2017). Risk factors for suicidal thoughts and behaviors: A meta-analysis of 50 years of research. Psychological Bulletin, 143(2), 187. [DOI] [PubMed] [Google Scholar]
- Galynker I, Yaseen ZS, Cohen A, Benhamou O, Hawes M, & Briggs J (2017). Prediction of suicidal behavior in high risk psychiatric patients using an assessment of acute suicidal state: The suicide crisis inventory. Depression and anxiety, 34(2), 147–158. [DOI] [PubMed] [Google Scholar]
- Gratz K, & Roemer L (2004). Multidimensional Assessment of Emotion Regulation and Dysregulation: Development, Factor Structure, and Initial Validation of the Difficulties in Emotion Regulation Scale. Journal of Psychopathology and Behavioral Assessment, 26(1), 41–54. 10.1023/B:JOBA.0000007455.08539.94 [DOI] [Google Scholar]
- Haglund A, Lysell H, Larsson H, Lichtenstein P, & Runeson B (2019). Suicide immediately after discharge from psychiatric inpatient care: a cohort study of nearly 2.9 million discharges. The Journal of clinical psychiatry, 80(2), 0–0. [DOI] [PubMed] [Google Scholar]
- Hufford MR, Shields AL, Shiffman S, Paty J, & Balabanis M (2002). Reactivity to ecological momentary assessment: an example using undergraduate problem drinkers. Psychology of Addictive Behaviors, 16(3), 205. [PubMed] [Google Scholar]
- Kendall PC, Butcher JN, & Holmbeck GN (1999). Handbook of research methods in clinical psychology (2nd ed.). John Wiley and Sons. [Google Scholar]
- Kleiman EM, Turner BJ, Fedor S, Beale EE, Huffman JC, & Nock MK (2017). Examination of real-time fluctuations in suicidal ideation and its risk factors: Results from two ecological momentary assessment studies. Journal of Abnormal Psychology, 126(6), 726. [DOI] [PubMed] [Google Scholar]
- Menze BH, Kelm BM, Masuch R, Himmelreich U, Bachert P, Petrich W, & Hamprecht FA (2009). A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC bioinformatics, 10(1), 1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Miller IW, Norman WH, Bishop SB, & Dow MG (1986). The Modified Scale for Suicidal Ideation: Reliability and Validity. Journal of Consulting and Clinical Psychology, 54(5), 724–725. [DOI] [PubMed] [Google Scholar]
- National Action Alliance for Suicide Prevention, R. T. F. (2014). A prioritized research agenda for suicide prevention: An action plan to save lives. Rockville, MD: National Institute of Mental Health and the Research Prioritization Task Force. [Google Scholar]
- Nolen-Hoeksema S, & Morrow J (1991). A prospective study of depression and posttraumatic stress symptoms after a natural disaster: The 1989 Loma Prieta earthquake. Journal of Personality and Social Psychology, 61(1), 115–121. 10.1037/0022-3514.61.1.115 http://search.ebscohost.com/login.aspx?direct=true&db=psyh&AN=1991-33414-001&site=ehost-live [DOI] [PubMed] [Google Scholar]
- Patton JH, Stanford MS, & Barratt ES (1995). Factor structure of the Barratt impulsiveness scale [Research Support, Non-U.S. Gov't]. J Clin Psychol, 51(6), 768–774. http://www.ncbi.nlm.nih.gov/pubmed/8778124 [DOI] [PubMed] [Google Scholar]
- Posner K, Brent D, Lucas C, Gould M, Stanley B, Brown G, Fisher P, Zelazny J, Burke A, & Oquendo M (2008). Columbia-suicide severity rating scale (C-SSRS). New York, NY: Columbia University Medical Center. [Google Scholar]
- Rush AJ, Trivedi MH, Ibrahim HM, Carmody TJ, Arnow B, Klein DN, Markowitz JC, Ninan PT, Kornstein S, Manber R, Thase ME, Kocsis JH, & Keller MB (2003). The 16-Item Quick Inventory of Depressive Symptomatology (QIDS), clinician rating (QIDS-C), and self-report (QIDS-SR): a psychometric evaluation in patients with chronic major depression. Biol Psychiatry, 54(5), 573–583. [DOI] [PubMed] [Google Scholar]
- Schafer KM, Kennedy G, Gallyer A, & Resnik P (2021). A direct comparison of theory-driven and machine learning prediction of suicide: A meta-analysis. PloS one, 16(4), e0249833. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Selby EA, & Joiner TE (2013). Emotional cascades as prospective predictors of dysregulated behaviors in borderline personality disorder. Personality Disorders: Theory, Research, and Treatment, 4(2), 168. [DOI] [PubMed] [Google Scholar]
- Simons JS, & Gaher RM (2005). The Distress Tolerance Scale: Development and validation of a self-report measure. Motivation and emotion, 29(2), 83–102. [Google Scholar]
- Van Orden KA, Cukrowicz KC, Witte TK, & Joiner TE Jr (2012). Thwarted belongingness and perceived burdensomeness: Construct validity and psychometric properties of the Interpersonal Needs Questionnaire. Psychological Assessment, 24(1), 197. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Van Orden KA, Witte TK, Gordon KH, Bender TW, & Joiner TE Jr (2008). Suicidal desire and the capability for suicide: Tests of the interpersonal-psychological theory of suicidal behavior among adults. Journal of Consulting and Clinical Psychology, 76(1), 72–83. 10.1037/0022-006X.76.1.72 [DOI] [PubMed] [Google Scholar]
- Wang SB, Coppersmith DD, Kleiman EM, Bentley KH, Millner AJ, Fortgang R, Mair P, Dempsey W, Huffman JC, & Nock MK (2021). A pilot study using frequent inpatient assessments of suicidal thinking to predict short-term postdischarge suicidal behavior. JAMA network open, 4(3), e210591–e210591. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Watson D, & Clark L (1994). The PANAS-X: Manual for the positive and negative affect schedule – expanded form. Unpublished manuscript. [Google Scholar]
- Weissman AN, & Beck AT (1978). Development and validation of the Dysfunctional Attitude Scale: A preliminary investigation. [Google Scholar]
- Zanarini MC, Vujanovic AA, Parachini EA, Boulanger JL, Frankenburg FR, & Flennen J (2003). A screening measure for BPD: The Mclean Screening Instrument for Borderline Personality Disorder (MSI-BPD). Journal of Personality Disorders, 17(6), 568–573. 10.1521/pedi.17.6.568.25355 [DOI] [PubMed] [Google Scholar]
- Hatwell J, Gaber MM, & Azad RMA (2020). CHIRPS: Explaining random forest classification. Artificial Intelligence Review, 53, 5747–5788. [Google Scholar]
- Jacobucci R, Littlefield AK, Millner AJ, Kleiman EM, & Steinley D (2021). Evidence of inflated prediction performance: A commentary on machine learning and suicide research. Clinical Psychological Science, 9(1), 129–134. [Google Scholar]
- Just MA, Pan L, Cherkassky VL, McMakin DL, Cha C, Nock MK, & Brent D (2023). Retraction Note: Machine learning of neural representations of suicide and emotion concepts identifies suicidal youth. In: Nature Publishing Group; UK London. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.


