Table 2.
Studies on AI-assisted diagnosis in mental health
| Ref. | Subject description | Mental health condition | Aim | AI-based method | Models | Variables | Predictors | Results and accuracy | Conclusions | |
|---|---|---|---|---|---|---|---|---|---|---|
| Chen et al. (2024) | Patients with MDD and healthy controls (n = 156) | MDD | To detect lifetime diagnosis of MDD and nonremission status | Machine learning models combined with natural language processing |
|
Clinically psychiatric diagnosis and HAMD–17 (cutoff score of 7) |
|
Artificial neural networks (use all variables as predictors to predict patients with MDD and HAMD–17 > 7)
|
The prediction performance of artificial neural networks was generally more favorable compared to other machine learning methods for both lifetime MDD diagnosis and nonremission (HAMD–17 > 7) with the fusion of all digital variables | |
| Das and Naskar (2024) | Patients with depression and controls (DAIC-WOZ dataset, n = 219; MODMA dataset, n = 52) | Depression | To identify symptoms of depression among individuals from their speech and responses | Machine learning models | e.g., SVM, CNN, LSTM, DT |
PHQ–8 binary and professional judgment | Variables from an audio spectrogram | DAIC-WOZ dataset Accuracy: 90.26% MODMA dataset Accuracy: 90.47% |
A novel deep learning-based approach using audio signals for automatic depression recognition has demonstrated superior detection accuracy compared to existing methods | |
| Maekawa et al. (2024) | Individuals with depressive symptoms and healthy controls (n = 35628) | Depressive symptoms | To identify individuals with depressive symptoms | Machine learning algorithms | Stochastic gradient descent (evaluated by two different methods of feature selection: Bayesian network or Markov blanket) | PHQ–9 |
|
Bayesian network: AUC are 0.736, 0.801, 0.809 in three different datasets (use different variables as predictors) | Bayesian network feature selection method outperformed Markov blanket selection method The models have emphasized the importance of the ability to do usual activities, chest pain, and sleep problems as key indicators for detecting depressive symptoms |
|
| Yang et al. (2024) | Suicidal ideators and suicide attempters (n = 438) | Suicide attempts | To identify predictors for suicide attempts and suicides | Machine learning |
|
The number of suicide attempts reported | 136 variables in total
|
Classical logistic regression (136 variables included): AUC: 0.535 Elastic net regression (136 variables included): AUC: 0.812 Classical logistic regression (15 variables included):
|
Young age, suicidal ideation, previous suicidal attempts, anxiety, alcohol abuse, stress, and impulsivity were significant predictors of suicide attempts | |
| C Manikis et al. (2023) | Women with highly treatable breast cancer (n = 706) | Depression, anxiety, overall mental health, and QoL | To identify women at risk of poor mental health, declining mental health and declining global QoL following a diagnosis of breast cancer | Machine learning algorithms | Balanced RF |
|
|
Model A for patients with poor mental health at month 0 Model B for patients with good mental health at month 0 Model C for patients with good QoL at month 0 i: use variables at month 0 and month 3 as predictors ii: use clinical and biological variables at month 0 with other variables at month 6 as predictors 12-month AUC
|
The top predictors of adverse mental health and QoL outcomes include common variables in clusters: negative affect, cancer coping responses/self-efficacy to cancer, sense of control/optimism, social support, lifestyle factors, and treatment-related symptoms | |
| Geng et al. (2023) | Patients with MDD and healthy controls (n = 80) | MDD | To optimize initial screening for MDD in both male and female patients | Machine learning algorithms |
|
PHQ–9 |
|
SVM:
|
Through feature importance analysis, we found that MeanNN, MedianNN, pNN20, and gender were the most important features HRV parameters during sleep stages can be used for the identification of MDD patients |
|
| Kourou et al. (2023) | Women diagnosed with stage I–III breast cancer with a curative treatment intention (n = 600) | Symptoms of anxiety and depression | To predict adverse mental health outcomes among patients who manifest fairly good initial emotional response to the diagnosis and the prospect of cancer treatments | Adaptive machine learning algorithms | Balanced RF | HADS–14 | Sociodemographic, lifestyle, medical variables, and self-reported psychological characteristics were recorded at diagnosis and assessed 3 months after diagnosis | Model 1: use all variables at month 0 and month 3 as predictors Model 2: not include mental health and subjective QoL ratings at months 0 and 3 12-month AUC
|
The top predictors of adverse mental health and QoL outcomes include common variables in clusters: negative affect, cancer coping responses/self-efficacy to cancer, sense of control/optimism, social support, lifestyle factors, and treatment-related symptoms | |
| Lønfeldt et al. (2023) | Adolescents with mild-to-moderate-severe obsessive-compulsive disorders (n = 9) | Obsessive-compulsive disorders | To detect obsessive-compulsive disorders episodes in the daily lives of adolescents | Machine learning models |
|
Obsessive-compulsive disorders events marked by participants | Blood volume pulse, external skin temperature, and electrodermal activity and heart rate (calculated by blood volume pulse) | 10-fold random cross-validation
|
Better performance was obtained when generalizing across time compared to across patients Generalized temporal models trained on multiple patients outperformed personalized single-patient models RF and mixed-effect RF models consistently achieved superior accuracy, reaching 70% accuracy in random and participant cross-validation |
|
| Adler et al. (2022) | Patients with schizophrenia, schizoaffective disorder, or psychosis non-specified in treatment, and university students (n = 109) | Mental health symptoms | To explore if machine learning models can be trained and validated across multiple mobile sensing longitudinal studies (CrossCheck and StudentLife) to predict mental health symptoms | Machine learning algorithms |
|
EMA | Mobile sensing data of sleep quality and stress | Improved model performance for predicting sleep:
|
Machine learning models trained across longitudinal mobile sensing study datasets generalized and provided a more efficient method to build predictive models of adding what they predicted, e.g., sleep and stress | |
| Chilla et al. (2022) | Patients with schizophrenia and healthy controls (n = 234) | Schizophrenia | To classify schizophrenia and healthy control cohorts using a diverse set of neuroanatomical measures | Machine learning |
|
A structured clinical interview for DSM-IV Disorders-Patient Version Clinical history, existing medical records, and interviews with significant others (e.g., family members, spouse, children) |
MRI imaging on subcortical volumes, cortical volumes, cortical areas, cortical thickness & mean cortical curvature | Classification performance was comparable between independent measure sets, with accuracy, sensitivity, and specificity, ranging 70%–73%, 73%–81%, and 57%–61%, respectively Employing a diverse set of measures (measures were merged and used in Ensemble) resulted in improved accuracy, sensitivity, and specificity, with ranges of 77%–87%, 79%–98%, and 65%–74%, respectively |
Subcortical and cortical measures and Ensemble methods achieved better classification performance on people with schizophrenia | |
| Hüfner et al. (2022) | Individuals resided in Austria aged ≥16 or resided in Italy aged ≥18 who were confirmed with SARS-CoV–2 infection and were not under hospitalization (n = 2,050) | Depression, anxiety, overall mental health and QoL | To identify indicators for poor mental health following COVID–19 outpatient management and to identify high-risk individuals Machine learning algorithm |
Machine learning algorithm |
RF | PHQ– 4 Self-perceived Overall Mental Health and QoL rated with 4-point Likert scale |
201 surveyed demographic, socioeconomic, medical history, COVID–19 course, and recovery parameters | RMSE of Austria data: 0.15–0.18 and Italy data: 0.21–0.23 | Machine learning achieved moderate-to-good performance in mental health risk prediction | |
| Jacobson et al. (2022) Note: Also included in the monitoring domain. |
Users who made queries related to mental health screening tools to the Microsoft Bing search engine between December 1, 2018, and January 31, 2020 (n = 126,060) | Suicidal ideation, active suicidal intent | To examine the impact and qualities of widely used, freely available online mental health screening on potential benefits, including suicidal ideation and active suicidal intent | Machine learning algorithm | RF |
|
Exposure to online screening tools and past search behaviors |
AUC of:
|
Websites with referrals to in-person treatments could put persons at greater risk of active suicidal intent. Machine learning’s prediction accuracy of suicidal ideation and intent was moderate | |
| Matsuo et al. (2022) | Pregnant women who delivered at ≥35 weeks of gestation (n = 34,710) | PPD | To develop and validate machine learning models for the prediction of postpartum depression and to compare the predictive accuracy of the machine learning models with conventional logistic regression models | Four machine learning algorithms |
|
EPDS |
|
AUC assessing the predictive accuracy: Model 1 (using variables collected in the first to second trimester):
|
The approach used did not achieve better predictive performance than the conventional logistic regression models | |
| Susai et al. (2022) | Participants from NEURAPRO, aged between 13 and 40 who fulfilled one of the criteria for at-risk state defined by CAARMS (n = 158) | Psychosis: functioning | To investigate the combined predictive ability of blood-based biological markers on functional outcome | Machine learning model | SVM | SOFAS | Clinical predictors: four demographic variables including sex, age, smoking status, BMI, and seven symptom scale scores Biomarker predictors: ten cytokines; 157 proteomic markers; and ten fatty acid markers |
Model based on clinical predictors:
|
Machine learning model based on clinical and biological data poorly predicted functional outcome in clinical high-risk participants | |
| Andersson et al. (2021) | Pregnant women who were 18 years of age or older (n = 4,313) | PPD | To predict women at risk for depressive symptoms at six weeks postpartum, from clinical, demographic, and psychometric questionnaire data available after childbirth Machine learning algorithm |
Machine learning algorithm |
|
EPDS |
|
Accuracy based on BP dataset:
|
All machine learning models had similar performance based solely on BP dataset; there were greater variations in model performance for the combined dataset | |
| Du et al. (2021) | College students (n = 30) | Depression | To design a deep learning-based mental health monitoring scheme to detect depression in college students | Deep learning | Convolutional neural network model | Confirmation with diagnosis of depression based on questionnaires and the bodily feelings | EEG signal | The model showed a classification accuracy score of 97.54% | The proposed deep learning-based mental health monitoring scheme achieved a high accuracy rate in detection of depression using EEG data | |
| Mongan et al. (2021) |
|
Psychosis | To investigate whether proteomic biomarkers may aid prediction of:
|
Machine learning algorithms | SVM | For the transition to psychotic disorders in the clinical high-risk:
|
Proteomic data from plasma samples | For the transition to psychotic disorders in the clinical high-risk: Model based on clinical and proteomic data:
|
Models based on proteomic data demonstrated excellent predictive performance for the transition to psychotic disorder in clinically high-risk individuals. Models based on proteomic data at age 12 had fair predictive performance for psychotic experiences at age 18 |
|
| Tsui et al. (2021) | Inpatients and emergency department patients aged 10–75 (n = 45,238) | First-time suicide attempt | To predict first-time suicide attempts from unstructured (narrative) clinical notes and structured EHR | NLP |
|
ICD–9 and ICD–10 | Unstructured data (clinical notes): history and physical examination, progress, and discharge summary notes. Structured data: demographics, diagnosis, healthcare utilization data, and medications | AUC of prediction window smaller or equal to 30 days:
|
Using both structured and unstructured data resulted in significantly higher accuracy than structured data alone | |
| Maglanoc et al. (2020) | Depression patients from outpatient clinics and healthy controls (n = 241) | Depression, Anxiety | To classify patients and controls, and to predict symptoms for depression and anxiety | Machine learning | Shrinkage discriminant analysis |
|
Brain components, including cortical macrostructure (thickness, area, gray matter density), white matter diffusion properties, radial diffusivity and resting-state functional magnetic resonance imaging (fMRI) default mode network amplitude Sex Age |
Classifying patients and controls:
|
Machine learning revealed low model performance for discriminating patients from controls and predicting symptoms for depression and anxiety, but had high accuracy for age | |
| Tate et al. (2020) | Twins born between 1994 and 1999 (n = 7,638) | Mental health problems: parent-rated emotional symptoms, conduct problems, prosocial behavior, hyper- activity/ inattention, and peer relationship problems | To investigate if various machine learning techniques outperform logistic regression in predicting mental health problems in mid-adolescence. | Machine learning algorithms |
|
Strengths and Difficulties Questionnaire | Birth information, physical illness, mental health symptoms, environmental factors such as neighborhood and parental income | AUC and 95% interval of:
|
All models performed with relatively similar accuracy; machine learning algorithms were no more significant statistically than logistic regression | |
| Byun et al. (2019) | MDD patients and healthy controls who were matched for age and gender (n = 78) | MDD | To investigate the feasibility of automated MDD detection based on heart rate variability features | Machine learning algorithm | SVM-RFE for feature selection SVM for classification |
HAMD–17 | Heart rate variability features extracted from electrocardiogram recordings | The best AUC of heart rate variability features selection for:
|
SVM-RFE marginally outperformed the statistical filter with fewer number of heart rate variability features required in MDD classification | |
| Ebdrup et al. (2019) | Antipsychotic-naive first-episode schizophrenia patients and healthy controls (n = 104) | Schizophrenia, schizoaffective psychosis | To investigate whether machine learning algorithms on multimodal data can serve as a framework for clinically translating into diagnostic utility | Machine learning algorithms |
|
A structured diagnostic interview to ensure fulfillment of ICD–10 diagnostic criteria of schizophrenia or schizoaffective psychosis |
Four modalities
|
Unimodal diagnostic accuracy: Diagnostic accuracy of cognition ranged between 60% and 69% Diagnostic accuracy for electrophysiology, sMRI and DTI ranged between 49% and 56%, and it did not exceed chance accuracy: ‘chance accuracy’ = 56% [(58/(46 patients +58 healthy controls) × 100%)] Multimodal diagnostic accuracy: None of the multimodal analyses with cognition plus any combination of one or more of the remaining modalities (electrophysiology, sMRI, and DTI) showed a significantly higher accuracy than cognition alone: the accuracy ranged between 51% and 68% |
Only cognitive data, but no other modality, significantly discriminated patients from healthy controls No enhanced accuracies were noted by combining cognition with other modalities |
|
| Jaroszewski et al. (2019) | Koko app users who signed up for the service (n = 39,450) | Mental health crisis: suicide (ideation, plan, and attempt), self-harm, eating disorder, physical abuse, unspecified abuse, emotional abuse and otherwise unspecified | To develop and evaluate a brief, automated risk assessment and intervention platform designed to increase the use of crisis resources among individuals routed to a digital mental health app who were identified as likely experiencing a mental health crisis | Machine learning classifiers | Recurrent neural networks with word embeddings | A binary classification of “crisis” or “not crisis, “crisis” defined as possibly at risk of serious, imminent physical harm, either through self-inflicted actions or through abuse from a third party | Semantic content of posts in real time | Performance:
|
The classifiers demonstrated excellent performance in classifying risk of crisis from real time post, regardless of whether these posts were referring to the writer himself or a third party |
|
| Lyu and Zhang (2019) | Suicide attempters randomly recruited through the hospital emergency and patient registration system (n = 659) | Suicide attempt | To establish the prediction model based on the Back Propagation neural network to improve prediction accuracy | Artificial Neural Network | Back Propagation Neural Network | Taken suicide attempt or not | Demographic information (such as age, gender, education level, marital status), family history of suicide, mental problem, aspiration strain, health status variables, hopelessness level, impulsivity, anxiety, depression, suicide attitude, negative life events, social support, coping skills, community environment etc. | The Back Propagation neural network:
|
Back Propagation neural network prediction model was superior in predicting suicide attempt | |
| Simon et al. (2019) | Members of the seven health systems, who had outpatient visits, either to a specialty mental health provider or a general medical provider when a mental health diagnosis was recorded (n = 25,373) | Suicide death Probable suicide attempt |
To evaluate how availability of different types of health records data affect the accuracy of machine learning models predicting suicidal behavior Machine learning models |
Machine learning models | Logistic regression with penalized LASSO variable selection | ICD–9th Revision cause of injury code indicating intentional self-harm (E950–E958) or undetermined intent (E980–E989) ICD–10th Revision diagnosis of self- inflicted injury (X60–X84) or injury or poisoning with undetermined intent (Y10–Y34) |
Historical insurance claims data Sociodemographic characteristics (race, ethnicity, and neighborhood characteristics) Past patient-reported outcome questionnaires from electronic health records Data (diagnoses and questionnaires) recorded during medical visit |
Prediction of suicide attempt following mental health visits:
|
For prediction of suicide attempt following mental health visits, model limited to historical insurance claims data performed approximately as well as model using all available data For prediction of suicide attempt following general medical visits, addition of data recorded during visits yielded improvement in model accuracy |
|
| Carrillo et al. (2018) | Patients with treatment-resistant depression (n = 35) | Depression | To classify patients with depression and healthy control with machine learning algorithm | Natural speech algorithm combined with machine learning | Gaussian Naive Bayes classifier | Quick Inventory of Depressive Symptoms | AMT structured interview in which participants were asked to provide specific autobiographical memories in response to specific cue words | Mean accuracy of identifying patients with depression from controls was 82.85% | The natural speech analysis identified depression from the healthy control | |
| Liang et al. (2018) | First-episode patients with schizophrenia, MDD, and demographically matched healthy controls (n = 577) | Schizophrenia MDD | To investigate the accuracy of neurocognitive graphs in classifying individuals with first-episode schizophrenia and MDD in comparison with healthy controls | Machine learning algorithm | Graphical LASSO logistic regression |
|
Neurocognitive graphs based on cognitive features including general intelligence, immediate and delayed logical memory, processing speed, visual memory, planning, shifting, and psychosocial functioning | Classification accuracy of:
|
Machine learning algorithm achieved moderate accuracy in classifying first-episode schizophrenia and MDD against healthy controls. Classification accuracy between first-episode schizophrenia and MDD was substantially lower | |
| Xu et al. (2018) | Postmenopausal obese or overweight, early-stage breast cancer survivors participating in a weight loss treatment (n = 333) | Depression and QOLm | To elicit bio-behavioral pathways implicated in obesity and health in breast cancer survivorship | Machine learning | Bayesian networks |
|
|
Insomnia predict depression with
|
Higher level of insomnia is associated with higher level of depression Poor depression and sleep were associated with poorer QOLm |
|
| Cook et al. (2016) | Adults discharged after self-harm from emergency services or after a short hospitalization (n = 1,453) | Suicidal Ideation, Heightened Psychiatric Symptoms | Developing and employing a predictive algorithm in a free- text platform (i.e., physician notes in EHRs, texts, and social media) to predict suicidal ideation and heightened psychiatric symptoms | Machine learning algorithm | NLP | Suicidal ideation by the question: “Have you felt that you do not have the will to live?” Heightened psychiatric symptoms measured by GHQ–12 |
Structured items (e.g., relating to sleep and well-being) Responses to one unstructured question, “how do you feel today?” |
Suicidal ideation:
|
NLP-based models were able to generate relatively high predictive values based solely on responses to a simple general mood question | |
| Pestian et al. (2016) | Suicidal (intervention group) or orthopedic (control group) teenager patients aged 13 to 17 admitted at the emergency department (n = 61) | Suicidal ideation | To evaluate whether machine learning methods discriminate between conversations of suicidal and non-suicidal individuals | NLP | SVM |
|
Language | 96.67% accurately matched the gold standard C-SSRS |
Machine learning methods accurately distinguished between suicidal and non-suicidal teenagers | |
| Setoyama et al. (2016) | Patients with any depressive symptoms (HAMD–17 > 0), including both medicated and medication free (n = 115) | Depression and suicidal ideation | To create a more objective system evaluating the severity of depression, especially suicidal ideation | Machine learning | Partial least squares regression model Logistic regression Support vector machine Random Forest |
|
Aqueous metabolites in blood plasma | Each model on evaluating severity of depression showed a fairly good correlation with either value R2 = 0.24 (PHQ–9) and R2 = 0.263 (HAMD–17) The three models discriminated depressive patients with or without SI showed true rate > 0.7 |
Plasma metabolome analysis is a useful tool to evaluate the severity of depression An algorithm to estimate a grade of SI using only a few metabolites was successfully created |
|
| Schnack et al. (2014) | Schizophrenia patients, bipolar disorder patients, and healthy controls selected from database (n = 334) | Schizophrenia and bipolar disorder | To classify patients with schizophrenia, bipolar disorder, and healthy controls on the basis of their structural MRI scans | Machine learning algorithms | Three SVM:
|
DSM-IV criteria for schizophrenia DSM-IV criteria for bipolar disorder |
Gray matter density | M(sz-hc):
|
Models based on gray matter density separated schizophrenia patients from healthy controls and bipolar disorder patients with high accuracy rate, and separated bipolar disorder from healthy control with much lower accuracy rate | |
| Marquand, Mourão-Miranda, Brammer, Cleare, & Fu, (2008) | Patients meeting criteria for major depression and in an acute episode of moderate severity with a minimum score of 18 on the 17-item HRSD Healthy controls with no history of psychiatric disorder, neurological disorder or head injury resulting in a loss of consciousness, and an HRSD score < 7 (n = 40) |
Depression | To examine the sensitivity and specificity of the diagnosis of depression achieved with the neural correlates of verbal working memory | Machine learning algorithms | SVM |
|
fMRI data | Accuracy of 68% with sensitivity of 65% and specificity of 70% with the blood oxygenation level-dependent convolution model at the mid-level of difficulty, which corresponded to a distributed network of cerebral regions involved in verbal working memory | Functional neuroanatomy of verbal working memory provides a statistically significant but clinically moderate contribution as a diagnostic biomarker for depression | |
Abbreviations: AUDIT: Alcohol Use Disorder Identification Test; ALSPAC: Avon Longitudinal Study of Parents and Children; AMT: Autobiographical memory test; AUROC and AUC: Area under the receiver operating characteristic curve; BACS: Brief Assessment of Cognition in Schizophrenia; BAI: Beck Anxiety Inventory; BIS-11: Barratt Impulsiveness Scale-11; BP: Background, medical history, and pregnancy/delivery variables; CAARMS: Comprehensive Assessment of At-Risk Mental State; CANTAB: Cambridge Neuropsychological Test Automated Battery; CNN: Convolutional Neural Network; CPTB: Copenhagen Psychophysiology Test Battery; C-SSRS: Columbia Suicide Severity Rating Scale; DART: Danish Adult Reading Test; DRF: Distributed Random Forests; DSM-IV: Diagnostic and Statistical Manual of Mental Disorder-IV; DT: Decision tree; DTI: Diffusion tensor imaging; EEG: electroencephalogram; EHR: Electronic Health Record; EMA: Ecological momentary assessment; EPDS: Edinburgh Postnatal Depression Scale; ERTC: Extremely randomized trees classifier; ETI-SF: Early Trauma Inventory-Short Form; EU-GEI: European Network of National Schizophrenia Networks Studying Gene–Environment Interactions Multimodal diagnostic accuracy; EXGB: Ensemble of extreme gradient boosting; fMRI: Functional magnetic resonance imaging; GBRT: Gradient Boosting Regression Trees; GHQ: General Health Questionnaire; HADS: Hospital Anxiety and Depression Scale; HAMD and HRSD: Hamilton Rating Scale for Depression; ICD: International Classification of Diseases; Koko: An online peer-to-peer crowdsourcing platform that teaches users cognitive reappraisal strategies that they use to help other users manage negative emotions; LASSO: Least absolute shrinkage and selection operator; LSTM: Long Short-Term Memory; MDD: Major Depressive Disorder; M.I.N.I.: Mini-International Neuropsychiatric Interview; MRI: Magnetic resonance imaging; p: p-value; NEURAPRO: A clinical trial conducted between March 2010 and the end of September 2014, tested the potential preventive role of omega-3 fatty acids in clinical high-risk participants; NLP: Natural Language Processing; NPV: Negative predictive value; PHQ: Patient Health Questionnaire; PPD: Postpartum depression; PPV: Positive Predictive Value; QoL: Quality of life; QOLm: Mental quality of life; RBC: Rank-biserial correlation; RF: Random Forest; RMSE: Root mean square error; RS: Resilience-14; SOC: Sense of Coherence-29; VPSQ: Vulnerable Personality Scale Questionnaire; SE: Regression coefficients; SF-36: 36-Item Short Form Survey; SIQ: Suicidal Ideation Questionnaire; sMRI: Structural magnetic resonance imaging; SOFAS: Social and Occupational Functional Assessment Score; SQ for KNHANES-SF: Stress Questionnaire for Korean National Health and Nutrition Examination Survey-Short Form; SVC and SVM: Support Vector Machine; SVM-RFE: Support Vector Machine Recursive Feature Elimination; UQ: Ubiquitous Questionnaire; WAIS III: Wechsler Adult Intelligence Scale® – Third Edition; W: Wilcoxon signed-rank test (one-sided) statistics; XRT: Extreme randomized forest.