Abstract
Objective:
Psychiatric evaluation suffers from subjectivity and bias, and is hard to scale due to intensive professional training requirements. In this work, we investigated whether behavioral and physiological signals, extracted from tele-video interviews, differ in individuals with psychiatric disorders.
Methods:
Temporal variations in facial expression, vocal expression, linguistic expression, and cardiovascular modulation were extracted from simultaneously recorded audio and video of remote interviews. Averages, standard deviations, and Markovian process-derived statistics of these features were computed from 73 subjects. Four binary classification tasks were defined: detecting 1) any clinically-diagnosed psychiatric disorder, 2) major depressive disorder, 3) self-rated depression, and 4) self-rated anxiety. Each modality was evaluated individually and in combination.
Results:
Statistically significant feature differences were found between psychiatric and control subjects. Correlations were found between features and self-rated depression and anxiety scores. Heart rate dynamics provided the best unimodal performance with areas under the receiver-operator curve (AUROCs) of 0.68–0.75 (depending on the classification task). Combining multiple modalities provided AUROCs of 0.72–0.82.
Conclusion:
Multimodal features extracted from remote interviews revealed informative characteristics of clinically diagnosed and self-rated mental health status.
Significance:
The proposed multimodal approach has the potential to facilitate scalable, remote, and low-cost assessment for low-burden automated mental health services.
Keywords: Telehealth, Digital Biomarker, Multimodal, Depression, Anxiety, Mental Health, Remote Photoplethysmography, Computer Vision, Foundation model, Machine Learning
I. Introduction
The World Health Organization estimated that 13% of the world population, or close to one billion people worldwide, live with a mental disorder, where most of them do not have access to effective care [1]. In addition to being the second most common cause of years of life lived with disability worldwide [2], this crisis of psychiatric disorders translates to an economic burden of $280 billion every year in the United States alone [3]. To reduce the financial cost and to delay the transition into often chronic or life-long psychiatric conditions, it is critical to gain a better understanding and to provide an objective, fast, and accessible evaluation of those disorders to enable early and effective interventions. However, the present diagnosis and phenotyping of psychiatric disorders fail to fully satisfy this dire need due to its subjectivity and biases, and access to psychiatric care is limited even in high-income countries such as the US [4].
The current clinical practice diagnoses psychiatric disorders such as depression and anxiety disorder using the subjective clinical evaluation of signs and symptoms specified by the Diagnostic and Statistical Manual of Mental Disorders (DSM-5) [5] or the International Classification of Diseases, 10th revision [6]. These diagnostic criteria often suffer from low inter-rater reliability. In the DSM-5 field trials [7], inter-rater reliability (Cohen’s kappa, κ) was just 0.28 for a diagnosis of major depressive disorder (MDD) and 0.20 for general anxiety disorder (GAD). Factors such as differences in training, biases (race, gender, culture), and interview style were the most common explanations for discrepancies between raters [8], [9]. Self-rated questionnaires such as General Anxiety Disorder-7 (GAD-7) [10], and Patient Health Questionnaire-9 (PHQ-9) [11] are also widely used in practice for initial screening and symptom monitoring purposes. Naturally, these scales are highly subjective as they are self-reported: the symptoms reported tend to be over-reported and more severe than observer ratings and highly depend on the subjective response processes [12].
The rapid development of objective automated digital assessment tools could potentially aid clinicians in diagnosing and evaluating mental illness [13]. Those tools help address the potential bias and inaccuracy of the current diagnosing practices by providing an objective and more quantified measurement of behavioral and physiological symptoms. Research groups have developed tools using various types of data modality, validated in numerous mental health populations, including depression [14], [15], anxiety [16], schizophrenia [17], and posttraumatic stress disorder (PTSD) [18]. Diverse modalities of signals have been investigated, including behavioral signals, such as facial and body movements [14], [15], [19], [20], speech acoustics [21]–[23], verbal or written content [24], sleep [25] and activity [18], [26] patterns, as well as physiological signals such as cardiovascular (heart rate [18], [27], electrocardiogram [23], [28]) and neural signals (electroencephalogram [29], [30], functional magnetic resonance imaging [31], [32] and functional near-infrared spectroscopy [33]). The multimodal approach, or the combination of multiple types of signals, has been widely adopted to improve the accuracy and robustness of those automated assessments [34], [35]. For example, [36], [37] combined behavioral signals, including cues from video, audio, and text, while others [18], [38] found the combination and interaction between physiological and behavioral signals useful in evaluating disorders.
While the findings in the above studies were promising, there remain unsolved challenges: data in most of them were collected within a lab-controlled environment and/or with specialized hardware, which prohibits potential future access and might not be able to generalize to actual clinical practice. The increasing use of telemedicine in psychiatry in recent years, which was further accelerated by the COVID-19 pandemic [39], provided a promising approach to improve the access and effect of psychiatric care [40]–[42], while at the same time presented an unprecedented opportunity of data collection for objective psychiatric assessments development without the limitation of geographical location and specialized hardware [43]. This begs the question of whether data collected remotely, such as in [44], [45], and in our previous research protocol [13], can provide a comparable level of information as the data collected in a lab-controlled environment.
To address those challenges, we investigated whether each and the combination of behavioral and physiological signals, extracted from audio-visual recordings of remote telehealth interviews, which were collected using heterogeneous generic electronic devices (laptops, tablets, or smartphones), were informative in assessing the multiple facets of psychiatric disorders of control subjects and subjects with mental health conditions (MHC). Specifically, we evaluated the differences in the behavioral and physiological features between different diagnostic groups and studied whether mental health conditions could be accurately assessed using those features. Classification instead of regression tasks were utilized because Mini-International Neuropsychiatric Interview (MINI) [46] was used as the primary diagnosing tool in this study, which resulted in binary categorizations (control vs. MHC).
The main contributions of this work are as follows: (1) We showed that audio-visual recordings of remote interviews collected fully remotely and without device limitation could be used to assess mental health states, with similar performance compared to the performance shown in previous studies where data were collected from lab-controlled environments. (2) We proposed a multimodal machine learning analysis framework, where we extracted both hand-crafted features and self-supervised-learned representations of facial, vocal, linguistic, and remote photoplethysmography (rPPG) patterns using signal processing approaches and state-of-the-art deep learning models, including convolutional neural networks (CNN) and transformer-based [47] foundation models. (3) Using those features and derived temporal dynamics, we compared the performance of features extracted from different modalities, with different models, and the performance of the combined features of multiple modalities, in classifying states of depression, anxiety, and absence of any diagnosed disorder using both self-reported scales (PHQ-9, GAD-7) and clinical diagnoses made by clinicians.
II. Dataset
A. Participants
The overall recruitment protocol can be found in Cotes et al. [13], which was designed to recruit three outpatient groups: 50 schizophrenia patients, 50 unipolar major depressive disorder patients, and 50 individuals with no psychiatric history. Due to the difficulty of recruiting enough in-person schizophrenia subjects during COVID-19, in this work, we focused on analyzing subjects recruited as control and depressed subjects. A total of 84 subjects were recruited as of July 17th, 2023, excluding schizophrenia subjects. The Emory University Institutional Review Board and the Grady Research Oversight Committee granted approval for this study (IRB# 00105142). Interviewees were recruited from Research Match (researchmatch.org), a National Institutes of Health-funded online recruitment strategy designed to connect potential participants to research studies, and through Grady’s Behavioral Health Outpatient Clinic utilizing a database of interested research participants. Participants were aged 18–65 and were native English speakers. For the initial screening, interviewees were recruited for either a control group (no history of mental illness within the past 12 months) or a group currently experiencing depression. All diagnoses and group categorizations were verified and finalized by the overseeing psychiatrist and clinical team after the semi-structured interview.
Two subjects did not meet the inclusion criteria based on the information shared during the interview. Interviews from four subjects were accidentally interrupted or unrecorded due to technical issues with the subjects’ devices, and the recorded audio or video files from five subjects were corrupted or led to signal extraction errors in certain modalities (for example, rPPG extraction error due to large percentage of facial occupation due to large yaw angle). Hence, data from 73 subjects were included in the analyses. Table I shows the demographics of those included participants.
TABLE I.
Demographics of the subjects grouped by diagnoses
Controls | MHC | |
---|---|---|
Number of Subjects | 22 | 51 |
Age (Years) | 42.7 ± 14.0 | 36.6 ± 13.2 |
Gender (M/F/NB/NA) | 9/13/0/0 | 10/38/2/1 |
Race (W/B/A/H/O/NA) | 10/7/2/0/2/1 | 28/10/9/2/2/0 |
Years of Education† | 17.3 ± 4.6 | 16.7 ± 2.5 |
± indicates the standard deviation of the measured variable. Subjects with current mental health conditions or a history of diagnosis within 12 months were grouped as “MHC”, while the rest were considered “Controls”. For gender, “M” = male, “F” = female, “NB” = non-binary, and “NA” = no answer. For race, “W” = white, “B” = Black, “A” = Asian, “H” = Hispanic, “O” = more than one race, and “NA” = no answer. The year of education indicates the number of academic years a person completed in formal programs. High school completion usually corresponds to 12 years of education, whereas college completion usually corresponds to 16 years.
Education levels from two subjects were not recorded, and therefore the last entry is based on 22 Controls and 49 MHC individuals. No significant differences (Mann-Whitney, p > 0.05) were found in ages and years of education between Controls and MHC.
B. Interviews and measurements
The study team created the interview guide and protocol and have components that simulate a psychiatric intake interview [13]. All interviews were conducted remotely via Zoom’s secure, encrypted, HIPAA-compliant telehealth platform. Both Video and Audio were recorded. The remote interview was divided into three parts: 1) A semi-structured interview composed of a series of open-ended questions, a thematic apperception test [48], phonetic fluency test [49], and semantic fluency test [50], 2) a sociodemographic section, and 3) clinical assessments which included the MINI 6.0 [46], McGill Quality of Life Questionnaire [51], General Anxiety Disorder-7 [10], and Patient Health Questionnaire-9 [11].
C. Categorization
Subjects were categorized into four different two-class categorizations based on self-rated scales or clinicians’ diagnoses to evaluate feature performances in classifying categorizations generated from under different assessment procedures. The mental health assessment task was formulated as classification tasks to align with the clinical practices and mental health screening paradigms.
-
The first and primary categorization is control (n=22) vs. subjects with mental health conditions (MHC, n=51) based on diagnoses made using MINI. The characteristics of the two groups can be found in Table I. The latter included subjects diagnosed with any mental health condition currently or a history of diagnosis within 12 months, including disorders like MDD, comorbid or primary GAD, PTSD, panic disorder, social anxiety, agoraphobia, psychotic disorders, manic illnesses, personality disorders, and obsessive-compulsive disorder. The control group included the remaining subjects, who could have mild suicidality, mild agoraphobia, mild substance abuse and dependence, or a remote history (not in the previous 12 months) of MDD and not currently on an antidepressant medication.
The following three categorizations only included a subset of subjects due to inclusion/exclusion criteria and missing self-rating results. The self-reported scales were also dichotomized to align with the primary categorization for easier performance comparison and cross-categorization analysis.
The second categorization is non-MDD-control (n=18) vs. MDD (n=38, past or current). Since both groups in the first categorization are heterogeneous, we used this categorization to assess further whether differences could be found between controls and subjects with past or current MDD, which were diagnosed using MINI and supported by self-reported PHQ-9 scores. In this case, we defined non-MDD-control as subjects with no lifetime history of MDD or other mental health conditions (but could have mild suicidality, mild agoraphobia, mild substance abuse and dependence), while the MDD subjects have primary diagnoses of MDD but could include comorbid GAD, PTSD, panic disorder, social anxiety, agoraphobia, and substance use disorder.
The third categorization is moderately depressed (PHQ-9 scores > 10, n=24) vs. rest (PHQ-9 scores <= 10, n=43). PHQ-9 scores were not reported for six subjects, resulting in 67 subjects in this categorization. To evaluate performance in classifying the severity of self-rated depression symptoms, we used a PHQ-9 score-based categorization and adopted a cutoff of 10, which indicates moderate depression [11].
The fourth categorization is moderate anxiety (GAD-7 scores > 10, n=16) vs. rest (GAD-7 scores <= 10, n=49). GAD-7 scores were not reported for 8 subjects, resulting in 65 subjects in this categorization. Similar to the third categorization, we used a GAD-7 score-based categorization and adopted a cutoff of 10, which indicates moderate anxiety and a reasonable cut for identifying cases of GAD [10], to evaluate performance in classifying the severity of self-rated anxiety symptoms.
III. Methods
A. Multimodal feature extraction
Figure 1 shows the proposed multimodal analysis framework that extracts visual, vocal, language, and rPPG time series signals at the frame or segment level, summarizes those time series with statistical and temporal dynamic features at the subject level (except for text embedding from the large language model, where the model directly generated subject-level embedding), and evaluates the performance of these features in clinical diagnoses or self-rated severity classification tasks described in section. II-C.
Fig. 1. Overview of the processing pipeline.
Color-dashed boxes denote features from different modalities, including physiological, visual, audio, and language features. The audiovisual recordings were first preprocessed with face detection and segmentation for each frame, automatic transcription of the patient-side audio, and audio resampling and segmentation. Then, low-level and subject-level features from various modalities were extracted and used for the classification of mental health conditions. Abbreviations are as follows: “AWS” denotes the Amazon web services, “rPPG” denotes the remote photoplethysmogram extracted from the face [52], “DinoV2” [53], “WavLM” [54], “LLAMA” [55] and “RoBERTa” [56] are the foundation models for each modality; “HMM” denotes the hidden Markov model; “MHC” refers to subjects with mental health conditions; “MDD” refers to subjects with major depressive disorder; “PHQ” and “GAD” refers to the Patient Health Questionnaire-9 and General Anxiety Disorder-7 scores, respectively. Please refer to section II-C for more details.
1). Facial expressions and visual patterns:
We followed the CNN-based facial expression analysis framework we proposed in our previous work [14], [57]. For each frame of the recordings sample at 1 Hz (one frame per second), the face of the participant is detected with RetinaFace [58] using a ResNet-50 [59] backbone network trained on the “WIDER” face dataset [60]. The face detector achieved an accuracy of 95.5% on the “Easy” validation set in WIDER face dataset, where the faces were already much more difficult to detect than the faces in our use case. The segmented face was fed into another CNN with VGG19 [61] structure, which was trained on the “AffectNet” dataset [62], to estimate facial emotion probabilities of seven categories, including being neutral, happy, sad, surprised, fearful, disgusted, and angry. AffectNet dataset and Radboud Faces Database (RaFD) [63] was used to test this facial emotion classifier. An accuracy of 63.3% was achived in the AffectNet evaluation set and an accuracy of 90.1% was achived in RaFD.
To include facial behaviors less affected by cultural differences, we adopted JAA-Net [64] to recognize 49 facial landmarks and 12 facial action units [65] (AUs, or the individual components of facial muscle movement) expressed in the frame. JAA-Net is a deep learning model that combines CNN and adaptive attention module, and it achieved an average AU detection accuracy of 78.6% (including AU1, 2, 4, 6, 7, 10, 12, 14, 15, 17, 23, 24) and face alignment mean error of 3.8% inter-ocular distance on BP4D dataset [66] with three-fold cross-validation.
In addition to manually-defined facial expression signals, including facial emotions, AUs, and facial landmark movements, a self-supervised large vision foundation model named “DINOv2” [53] was also used to extract general visual embedding of the segmented facial area. While video foundation models have better performance in short-video clips, the image foundation model was used because the average length of the video recorded in this study was significantly longer (one hour vs. a few seconds). DINOv2 is a vision transformer (ViT) [67] with one billion parameters trained on 1.2B unique images that achieved decent performance on video classification tasks with linear evaluation, including an accuracy of 90.5% on “UCF-101” dataset [68]. A 1024-dimensional visual embedding was generated from frames sampled at 1 Hz using the “ViT-L/14” [67] model.
2). Language sentiments and representations:
The patient-side audio files were transcribed into texts using Amazon Transcribe on HIPAA-compliant Amazon web services (AWS) at Emory, following the protocol detailed in our previous study [69]. Similar to the audio analysis, only patient-side transcripts during the semi-structured interview section were used to avoid using subjects’ answers to sociodemographic or clinical assessment questions.
We have previously found different word use patterns in subjects with and without MDD using the linguistic inquiry and word count (version LIWC-22) dictionary [70]. Here large language models (LLMs) were used to identify the sentiments and extract general representations to better understand the subjects’ linguistic patterns. More specifically, three LLMs were used: (1) At the utterance level, a distilled RoBERTa model [56], [71] finetuned on 80% of 20k emotional texts (the rest 20% was used as the test set with an average accuracy of 66%) was used to recognize one of seven emotions including neutral, happiness, sadness, surprise, fear, disgust, and anger. (2) Also at the utterance level, another RoBERTa-based model finetuned on 15 diverse review datasets with a leave-one-dataset-out accuracy of 93.2% [72] was used to recognize positive or negative sentiment. Such fine-tuned utterance-level deep learning models have been found to generate effective representations in related contexts such as anxiety [73]. (3) LLAMA-65B [55], one of the state-of-the-art open-sourced decoder-only transformer models with 65 billion parameters which were trained on over one trillion tokens of texts, was used to generate an 8196-dimensional text embedding for the entire transcripts during the semi-structured interview.
3). Vocal features and representations:
Both manually defined acoustic features and general audio representations were extracted from audio files. Only patient-side audio during the semi-structured interview section was used to avoid the potential information leak directly from subjects’ answers to sociodemographic or clinical assessment questions in MINI or in self-rated questionnaires described in section II-B.
For manually defined features, PyAudioAnalysis [74] package was used to extract acoustic features at each 100ms window with 50% overlap, including zero crossing rate, energy, entropy of energy, spectral centroid/spread/entropy/flux/rolloff, Mel frequency cepstral coefficients, and 12 chroma vector and corresponding standard deviations. WavLM [54], which is a self-supervised audio foundation model with 316M parameters (“WavLM Large”) trained on 94k hours of audio, was used to extract general audio representations. It has shown state-of-the-art performance in the universal speech representation benchmark [75]. Recorded audio files were first resampled to 16k Hz and then segmented into non-overlapping 20ms segments following [54]. A 1024-dimensional audio embedding was generated for each 20ms segment using WavLM.
4). Remote PPG cardiovascular features:
Remote PPG signals were extracted from the video recordings using the pyVHR package [52], [76]. The facial skin areas were recognized in each frame using a CNN, 100 regions of interests (ROIs) were sampled, and the pixel values were averaged across the pixels in each ROI for each RGB channel, respectively. Then, an unsupervised method, named orthogonal matrix image transformation [77], was used to transform RGB values in one ROI to an estimated 25 Hz rPPG signal based on QR decomposition. The power spectral density of rPPGs at each ROI was computed in six seconds windows sliding every second, and the medians of the inverse of peak frequency (60/peak frequency) were used to estimate heart rates at every second.
Lastly, the averaged estimated rPPGs at each ROI were used to extract cardiovascular dynamic features using PhysioNet Cardiovascular Signal Toolbox [78] with a 300s window and a 30s sliding window. The cardiovascular dynamic features included time and frequency domain heart rate variability, acceleration and deceleration capacity, entropy measures, and heart rate turbulence measures. Highly tolerant rejecting thresholds were set to avoid rejecting high percentage of data, including setting lowest tolerable mean signal quality index (as defined in [78]) to be 0.1, allowing certain R-R intervals to be longer than ten seconds, allowing two neighboring R-R intervals to have a length difference of more than one second, and allowing a 30 seconds gap at the beginning of the PPG signals.
B. Subject-level features and temporal analyses
Due to the high dimensionality of the low-level features and the limited number of subjects, only two simple statistics of the time series extracted above were used as subject-level features to avoid potential overfitting as explored in our previous work [14]. Both average and standard deviations over time were calculated for lower-dimensional (< 100) time series, including time series of facial expressions (facial emotions, AUs, and facial landmark locations sampled at 1 Hz), acoustic features (sampled at 20 Hz), language sentiments (sampled at each utterance), and estimated heart rates (sampled at 1 Hz). Only averages were calculated for higher-dimensional ( > 100) time series, including time series of WavLM audio embedding and DINOv2 visual embedding. LLAMA-65B embedding of the entire semi-interviews was directly used as subject-level features.
In addition to nonparametric statistics, hidden Markov models (HMM) were used to model the dynamics of the low dimensional time series, and statistics (duration and frequency of inferred states) of the unsupervisely learned HMMs were used as subject-level features. An HMM with a Gaussian observation model and four states was learned for each modality separately using SSM package [79]. The number of states was selected because it represents the smallest number of states needed to model known different states: asymptomatic, symptomatic, uncertain, and padding states. It is worth noting that the states learned from the data do not directly correspond to those four states, nor do we aim to directly interpret those learned states. Instead, we used the downstream analysis of the duration and frequency of the states as an approximate modeling of the dynamics of the time series.
Each time series of one modality from one subject k, Xk, was considered as one noisy observation, where it is padded with zeros to the maximum temporal length Tmax found from X1 to XN(N = 73). i.e., Xk is a Tk × d with a feature dimension of d and a temporal length of Tk was padded (Tmax – Tk) × d zeros at the end, so all Xk has the same shape of Tmax × d. The modality-specific HMM was then fitted on X, and the most likely hidden states Zk with the shape of Tmax × 4 were inferred for each sequence Xk. Lastly, the time steps spent and the frequency (non-neighboring occurrences) of all four states were calculated for each subject and used as subject-level dynamic features.
C. Classification analyses
We evaluated features generated from the above-described processes in four two-class classification tasks described in Section. II-C. Classification performances were measured by the average area under the receiver operating characteristic (AUROC) and the average accuracy in 100 repeated five-fold cross-validations. In each repetition, subjects were randomly split into five approximately equally sized folds. A cross-validation was performed on those folds, where in each one of the five validations, four folds were used for training and hyper-parameter tuning and one fold left was held out for testing.
1). Demographic variables:
Demographic variables, including one-hot-encoded race, one-hot-encoded gender, age, and years of education, were combined into a demographic feature vector for each subject and also evaluated as a benchmark in unimodal classification. However, demographic features were not considered in the multimodal classification.
2). Unimodal evaluation:
For each type of feature (as shown on each row in Table II extracted from different modalities, statistics (averages and standard deviations) and HMM-derived features were evaluated separately using logistic regression (LR) with l2 regularization or a gradient boosting decision tree (GBDT) classifier, depending on the dimensionality of the features, where LR was used for features with fewer than 100 dimensions. For GBDT, a default of 100 base decision tree estimators and a maximum depth of two were set across all types of features.
TABLE II.
Classification performance of clinical diagnoses and self-rated depression/anxiety severity.
Feature type | Metric | 1. Control vs. MHC | 2. Non-MDD-Control vs. MDD | 3. PHQ-9 > 10? | 4. GAD-7 > 10? |
---|---|---|---|---|---|
| |||||
1. Demographic variables | AUROC | 0.54 ± 0.04 | 0.54 ± 0.04 | 0.61 ± 0.03 | 0.57 ± 0.05 |
Accuracy | 0.56 ± 0.04 | 0.57 ± 0.04 | 0.58 ± 0.04 | 0.60 ± 0.04 | |
| |||||
2. Facial emotions + AUs | |||||
2.1 Avgs and stds | AUROC | random | random | 0.56 ± 0.06 | 0.55 ± 0.04 |
Accuracy | random | random | 0.60 ± 0.05 | 0.62 ± 0.04 | |
2.2 HMM features | AUROC | 0.65 ± 0.03 | 0.66 ± 0.04 | 0.61 ± 0.04 | 0.68 ± 0.05 |
Accuracy | 0.64 ± 0.03 | 0.66 ± 0.04 | 0.60 ± 0.04 | 0.67 ± 0.03 | |
| |||||
3. DINOv2 avgs and stds | Both | random | random | random | random |
| |||||
4. Language sentiment | |||||
4.1 Avgs and stds | AUROC | 0.69 ± 0.03 | 0.66 ± 0.04 | 0.64 ± 0.05 | 0.63 ± 0.04 |
Accuracy | 0.67 ± 0.04 | 0.68 ± 0.04 | 0.67 ± 0.05 | 0.64 ± 0.04 | |
4.2 HMM features | AUROC | 0.62 ± 0.03 | 0.64 ± 0.04 | random | 0.65 ± 0.05 |
Accuracy | 0.65 ± 0.03 | 0.60 ± 0.03 | random | 0.73 ± 0.04 | |
| |||||
5. LLAMA-65B | AUROC | 0.64 ± 0.07 | 0.53 ± 0.08 | 0.68 ± 0.04 | 0.64 ± 0.05 |
Accuracy | 0.68 ± 0.05 | 0.59 ± 0.07 | 0.68 ± 0.03 | 0.70 ± 0.04 | |
| |||||
6. WavLM avgs and stds | AUROC | random | 0.58 ± 0.05 | 0.60 ± 0.06 | 0.59 ± 0.02 |
Accuracy | random | 0.61 ± 0.05 | 0.64 ± 0.05 | 0.71 ± 0.02 | |
| |||||
7. Vocal acoustics | |||||
7.1 Avgs and stds | AUROC | random | random | 0.68 ± 0.05 | random |
Accuracy | random | random | 0.67 ± 0.05 | random | |
7.2 HMM features | AUROC | 0.57 ± 0.05 | 0.51 ± 0.06 | 0.51 ± 0.05 | 0.53 ± 0.05 |
Accuracy | 0.59 ± 0.04 | 0.53 ± 0.05 | 0.53 ± 0.05 | 0.60 ± 0.04 | |
| |||||
8. rPPG | |||||
8.1 Cardiovascular features | AUROC | random | 0.55 ± 0.07 | 0.65 ± 0.04 | 0.56 ± 0.05 |
Accuracy | random | 0.61 ± 0.05 | 0.60 ± 0.04 | 0.59 ± 0.04 | |
8.2 HMM features | AUROC | 0.72 ± 0.05 † | 0.73 ± 0.05 † | 0.75 ± 0.03 † | 0.68 ± 0.04 |
Accuracy | 0.76 ± 0.04 † | 0.73 ± 0.05 † | 0.71 ± 0.03 † | 0.67 ± 0.03 | |
| |||||
9. Fusion (row 2–8) | |||||
9.1 Feature concatenation | AUROC | 0.63 ± 0.07 | random | 0.59 ± 0.05 | 0.53 ± 0.05 |
Accuract | 0.68 ± 0.05 | random | 0.61 ± 0.04 | 0.62 ± 0.04 | |
9.2 Majority vote | AUROC | 0.70 ± 0.05 | 0.68 ± 0.07 | 0.75 ± 0.05 | 0.71 ± 0.05 |
Accuracy | 0.73 ± 0.03 | 0.71 ± 0.04 | 0.71 ± 0.04 | 0.76 ± 0.03 | |
9.3 Selected vote | AUROC | 0.82 ± 0.04 ‡ | 0.77 ± 0.02 ‡ | 0.82 ± 0.04 ‡ | 0.72 ± 0.04 ‡ |
Accuracy | 0.75 ± 0.03 ‡ | 0.76 ± 0.01 ‡ | 0.74 ± 0.04 ‡ | 0.75 ± 0.03 ‡ |
Each column shows the performance of two-class classification using one of the four categorizations defined in Section. II-C in the same order. The average and the standard deviation of AUROCs and accuracies (in brackets) from a hundred randomly split five-fold cross-validations are reported. The term avgs denotes averages, and stds denotes standard deviations. “random” indicates that the classifier performed no significantly better (McNemar’s test, p > 0.05) than random guessing (AUROC=0.5). The best classification performance in each task (column) achieved by a single modality was shown in bold text, while the second best was underlined. Multiple metrics were underlined or marked bold when no statistical significance (McNemar’s test, p > 0.05) between classifiers was found. The best classification performance in each task (column) achieved by multimodal fusion was shown in bold text.
indicates significantly better performance (McNemar’s test, p < 0.05) was achieved with the indicated feature type in this classification task (each column) compared to other unimodal features, where
indicates significantly better performance (McNemar’s test, p < 0.05) was achieved with multimodal voting compared to using any unimodal features.
3). Multimodal fusion:
Both early and late fusion of different modalities were considered. For early fusion, features from all modalities were concatenated into a single feature vector as the input to a GBDT classifier. For late fusion, the majority vote of each unimodal classifier was used as the multimodal classification results. To avoid noise from classifiers without classification power, we also compared the majority voting results from classifiers that showed non-random (defined as AUROC > 0.5) performance in the validation set (a 20% subset within the training fold). The non-random classifiers were re-trained with all the data in the training fold before being used for testing.
D. Statistical Analyses
We used statistical tests to assess the differences in the probability distributions of features between different groups of subjects (such as groups described in Section. II-C and demographic groups) and the differences in performance resulting from different features. Mann-Whitney rank tests were applied between features or characteristics of different subject groups to determine whether significant differences exist between the two groups. McNemar’s test was used to test the classification disagreement between pairs of classification settings. Wald Test was used to determine if a significant correlation was found between two variables. Statistical significance was assumed at a level of p < 0.05 for all tests.
IV. Results
A. Unimodal feature patterns across groups
Here we performed a selected array of analyses of the clinically relevant patterns found in different modalities in different groups of subjects, providing additional objective evidence to previous clinical observations.
1). Blunted visual affect and increased sadness in language:
While “blunted affect” was mostly in the context of a negative symptom of schizophrenia, it has been widely reported in other mental disorders like MDD [80]–[82] and other non-psychotic disorders [83]. Measured by the sum of average AU intensities over the interview, we found that non-medicated subjects with current MDD had lower facial expressivity compared to non-MDD controls (Mann-Whitney, p = 0.04), and subjects with mental health conditions also had lower facial expressivity compared to controls (Mann-Whitney, p = 0.03). However, no differences in facial expressivity were found between subjects with past MDD and non-MDD controls, and no statistically significant linear correlations were found between facial expressivity and self-rated PHQ-9 or GAD-7 scores.
Through language sentiment analysis, neither was verbally blunted affect found in the MDD or MHC groups nor language expressivity correlate with self-rated scores. However, the average sadness level expressed in language was found to be higher in MDD groups compared to non-MDD-controls (Mann-Whitney, p = 0.02) and was positively correlated with PHQ-9 (Wald test, ρ = 0.31, p = 0.01) and GAD-7 (Wald test, ρ = 0.37, p = 0.002) scores. In comparison, the sadness level expressed visually did not increase in MDD groups.
2). Increased acoustic spectral flux:
The average spectral flux, defined as the squared difference between the normalized magnitudes of the spectra of the two successive frames averaged across the semi-structured interviews, was found to be positively correlated with PHQ-9 (Wald test, ρ = 0.26,p = 0.03) and GAD-7 (Wald test, ρ = 0.25,p = 0.04) scores, indicating a faster change of acoustic tones in subjects with more severe depression and anxiety symptoms.
3). Increased complexity in heartbeat intervals:
No significant alternation of average heart rate or standard deviation of heart rate during the interview was found between groups. The complexity of the heartbeat time series, measured by the area under the multiscale entropy curve, was significantly higher in non-medicated MDD groups compared to non-MDD-controls (Mann-Whitney, p = 0.01), consistent with previous findings using electrocardiogram [84], [85].
4). Effect of medication:
Compared to non-medicated MDD subjects, medicated MDD subjects showed a higher level of facial expressivity (Mann-Whitney, p = 0.05) and sadness (Mann-Whitney, p = 0.04), while only non-medicated subjects with current MDD showed a higher level of sadness through language compared to medicated subjects with current MDD (Mann-Whitney, p = 0.04). In addition, decreased heartbeat interval complexity (Mann-Whitney, p = 0.02) and increased standard deviation of heart rate (Mann-Whitney, p = 0.02) were observed with medication in subjects with past and current MDD compared to non-medicated MDD subjects, while the average heart rate remained similar between both groups.
B. Dynamics inferred from HMM state duration and frequency
Dynamic features, including inferred HMM state duration and frequency, were found to be the most useful features in classification tasks, as shown in Table II, especially for facial expressions and rPPG modalities. Significant linear correlations were found between these dynamic features and PHQ-9/GAD-7 scores.
Figure 2 shows the correlation plots between the frequency of the states in emotion and heart rate time series. The padding states (described in section III-B) from rPPG and facial expression HMMs were omitted as they would only present once (frequency=1) as the padding in the end. Statistically significant positive correlations were found between all non-padding state frequency and self-rated scores except emotion state 2, indicating a higher switching rate between hidden states may be related to more severe depression and anxiety symptoms.
Fig. 2. PHQ-9 and GAD-7 scores vs. rPPG and facial expression HMM state frequencies.
Each subfigure shows the scatter plot between self-rated scores and the frequency of a learned HMM state, along with a linear regression model fit with the 95% confidence interval. The top row shows how those learned state correlate with PHQ-9 scores and the bottom row shows how they correlate with GAD-7 scores. Texts in each subfigure denote the Pearson correlation coefficient (ρ) and the p-value using the Wald test.
C. Classification performance
Table II shows the classification performance of both clinically diagnosed and self-rated mental health disorders using static and dynamic features from vision, audio, language, and physiology. Each column shows the performance of two-class classification using one of the four categorizations defined in Section. II-C in the same order. The best-performing features achieved an AUROC of 0.68 to 0.75 in unimodal classification tasks, while the selected majority voting described in Section III-C3 achieved an AUROC of 0.82 in detecting current or recent (last 12-month) mental disorders, an AUROC of 0.77 in detecting past and current MDD, an AUROC of 0.82 in detecting PHQ-9 based moderate depression, and an AUROC of 0.72 in detecting GAD-7 based moderate anxiety disorder. Late fusion using selected majority voting ( row “9.3”) outperformed early fusion with the direct concatenation of features (McNemar’s test, p ≪ 0.01) due to the extremely high dimensionality of the concatenated features.
While demographic variables achieved higher than random performance in all four tasks, we found they were not strong predictors of mental health disorders compared to the proposed features, as shown in row “1” in Table II.
A similar level of performance in MDD vs. healthy control classification was achieved compared to results reported in existing studies using in-lab data collection processes. For example, an AUROC of 0.68 was achieved using the facial and speech emotions in our previous in-lab study [14]. Other researchers, such as Schultebraucks et al. [86] achieved an AUROC of 0.86 combining facial action units, acoustic and language features in another in-lab study. In-lab studies [87], [88] using heart rate variability features also achieved AUROCs of 0.74–0.82, which showed a similar range of performance as with our reported AUROCs when detecting self-reported and clinical MDD, achieved by the rPPG-based method proposed in this study. Please note those performance metrics cannot be directly compared as different subjects, data collection hardware and processes, categorization criteria, and evaluation methods were adopted.
1). Moments and dynamics of facial expressions revealed mental states but general visual patterns did not:
While we also extracted facial landmarks as described in Section III-A1, we found adding static statistics of facial landmarks or including facial landmarks in the HMM modeling deteriorated the performance. Row “2” in Table II shows the performance using just the statistics of facial emotions and AUs. Interestingly, the average and standard deviation of facial expressions failed to classify clinical diagnoses but successes in classifying self-rated depression and anxiety. In comparison, the temporal properties derived from HMM resulted in significantly better (McNemar’s test, p ≪ 0.01) classification performance, except for self-rated depression detection. Lastly, using the temporal dynamics of facial expressions achieved the best performance in self-rated anxiety in all modalities.
In comparison, visual embedding generated from DINOv2 failed to generalize to this specialized dataset and did not achieve non-random classification in any of the tasks.
2). Language sentiments beat general language representation in small and specialized dataset:
Compared to other modalities, language features were extracted at a lower sampling rate (at each utterance or the entire semi-structured interview), while LLMs were able to abstract the texts into much shorter sequences of features or even into a single vector when LLAMA-65B was used. The average and standard deviation of the language sentiments achieved the best performances compared to static features of other modalities. Using HMM to model the sentiment dynamics did not improve performance, as shown in all other modalities (comparing rows “4.2” and “4.1”). These results showed that part of the dynamics expressed through the words was already captured by LLM and abstracted into utterance sentiments, and the sentiment dynamics over multiple utterances might not be as important.
Additionally, while using LLAMA-65B embedding showed decent performance compared to other non-language modalities, using language sentiments achieved similar or better results in all tasks. This showed that general language representation might not be as useful as disorder-related sentiment analysis, especially in smaller and highly-specialized datasets, as demonstrated in this study, and suggested in related work on text-based depression and personality detection [89], [90].
3). Vocal features were under-performing compared to other modalities:
While many previous studies [21], [22], [91] have shown that vocal features are useful in detecting depression and anxiety disorder, in this study, other modalities outperformed both spectral/entropy-based acoustic features and general speech representation from WavLM except in self-rated depression detection.
4). HMM modeled dynamics were more informative compared to cardiovascular features for highly noisy rPPG signals:
As shown in row “8.1” in Table II, using cardiovascular features yielded inferior performance compared to other modalities. The key reason is the estimated rPPG signals were highly noisy at each ROI or after averaging across all ROIs, which led to errors (such as peak detection error) in downstream cardiovascular feature calculations. On average, 25.8% of the estimated rPPGs were not used for downstream analyses even with highly tolerant rejecting thresholds as described in Section III-A4. Using HMM-derived features from modeling heart rate time series resulted in the best or second-best performance in all four tasks among all unimodal approaches, reaching AUROCs from 0.68 to 0.75.
5). False positives in the view of self-reported depression were not necessarily false in the clinical view.:
When looking at the false positives (false classification as depression when evaluating with self-reported labels) of the best-performing multimodal classifier, we found that 85% of those cases were actually correctly classified in the view of the clinicians. I.e., that 85% of cases had a current/past MDD or other comorbid mental health condition clinically and were correctly captured by the classifier trained with self-reported PHQ-9-based labels. Although it requires further investigations, this showcased that the model trained with self-reported labels can be helpful for clinical assessments.
V. Discussion and Conclusion
In this work, we performed a detailed multimodal analysis on 73 subjects using remotely-recorded telehealth interviews and showed that the facial, vocal, linguistic, and cardiovascular features extracted from these audiovisual recordings could reveal informative characteristics of both clinically diagnosed and self-rated mental health status. The results provided early evidence of the usefulness of multimodal digital biomarkers extracted from low-cost and non-lab-controlled data with minimal hardware limitations. Comparisons were made between different modalities and between features derived from the latest transformer-based foundation models and more defined features derived from traditional methods, offering insights on which modalities and methods might be most suited for automated remote mental health assessments.
A. Performance of different modalities
When comparing the classification performance using features extracted from different modalities, the overall physiological characteristics outperformed other manually-defined or data-driven behavioral characteristics. Although the heart rates were estimated indirectly from light changes on the face, heart rate dynamics were highly relevant in classifying self-rated and clinician-diagnosed disorders. While it is not surprising to find associations between cardiovascular dynamics and psychiatric disorders, as shown in previous studies of neurobiological mechanisms [92] and statistical analyses [88], [93], the results raised questions on the behavioral features extracted in this study. More investigations are needed to answer whether they underperformed because the current state-of-the-art models cannot capture enough information in remote interviews, or behavioral signals are not as useful as physiological signals in telehealth settings, even for human experts.
Among behavioral modalities, overall facial and language patterns led to better classification performance than patterns derived from audio, although the latter resulted in a comparable performance in detecting self-rated depression. While overall facial and language patterns led to similar levels of performance, it is worth noting that they performed very differently in different tasks, suggesting the same modality might perform differently for different mental health assessment tasks. For example, facial expression dynamics were much more useful in detecting self-rated anxiety than self-rated depression, yet similarly useful in detecting clinical MDD. On the contrary, Language embedding was more powerful in detecting self-rated disorders than clinically diagnosed disorders. These findings caution us on translating and interpreting results found using self-rated or self-reported scales directly for clinical applications, where the categorization criteria and process are different, in addition to the subject distribution shift, which was not shown in this study (as they were evaluated on the same group of subjects).
B. The use of foundation models
Foundation models have gained enormous popularity in the last few years with the rapid development of pre-training and self(semi)-supervised training methods [94], [95], especially since the release of OpenAI’s ChatGPT. LLMs, along with visual [53], audio [54], and multimodal [96] foundation models were widely applied in many disciplines, including the mental health domain, but primarily limited to language analyses and self-rated (self-reported) conditions [97], [98]. By comparing the direct use of unimodal foundation model-generated embedding to manually defined features from the same modalities, we showed how they perform in more clinically-relevant tasks under the telehealth settings.
The statistics of the visual embedding from DinoV2 were not at all useful in detecting mental health disorders. This finding was partially expected because the majority of extracted general visual representations would be more relevant to the texture and appearance of the face, especially after averaging, while the dynamics of the high-dimensional embedding would be hard to find with a limited number of recordings (discussed in more details in section V-C below). Preliminary results using other vision foundation models in this dataset did not show better classification performances either, including using models tuned for facial representation (“FaRL” [99]) and for facial video representations (“MARLIN” [100]).
Although audio embedding from WavLM marginally outperformed acoustic features in our experiments, it demonstrates the potential of using the general audio embedding from a more diversely pre-trained audio foundation model in datasets with more subjects. Interestingly, general text embedding of the entire semi-structured interview from LLAMA-65B performed similarly when compared to sentiment analyses considering the extremely high feature dimensionality and the small number of recordings. With the rapid development of LLMs and the inclusion of more diverse training texts, such as the recent release of LLAMA2 [101], general LLMs could potentially outperform fine-tuned task-specific LLMs in mental health assessment tasks in the future.
C. Limitations and future directions
Several limitations of this study need to be acknowledged, as they provide valuable insights into the boundaries of our findings and potential directions for future studies.
First, the number of subjects (n=73) and their heterogeneity might limit our findings’ generalizability. While the number of subjects will grow as we keep collecting data following our defined protocol [13], the heterogeneity issue might not be easily addressed. Although we recruited subjects with clear inclusion/exclusion criteria and further excluded subjects after the interview if they did not fall into our criteria, the intrinsic nature of high comorbidity levels in different mental disorders makes it difficult to recruit a “clean” cohort of subjects with clear diagnoses of a single type of disorder. Another heterogeneity comes from medication status, which has been known to affect both behaviors and physiology of the patients [102]–[104]. Nevertheless, we believe the heterogeneity could be partially addressed as the number of subjects grows because analyses of smaller and more well-defined groups, for which we do not currently have enough samples, could be performed. As the number of subjects grows, models used for feature extractions could potentially be fine-tuned on the target population instead of being only trained on open-access datasets, which could further close the gap in identifying the most pertinent features from the target population.
Second, potential bias in the feature extraction and subject categorization processes might exist. The facial expression model used in this study was evaluated in our previous research [57], but the features from other modalities were extracted using open-access models that may bias towards certain demographic groups, leading to potential skew in the findings. For example, LLAMA is reported to be biased in religion, age, gender, and other aspects as it was trained with internet-crawled data [55]. A thorough bias analysis must be performed in a future study, and on a larger cohort before applying it clinically. Additionally, the subject categorization in this study may potentially contain inaccuracy or bias due to the limitation of the diagnostic process. While we partially addressed the issue by using both self-reported and clinician-rated measures and analyzing their relationships, future studies are needed for the direct investigation of this challenge. For example, the use of evaluations from multiple clinicians and a complete review of medical records may result in more accurate categorizations [105], [106], and specifically designed learning methods could be used to address the presence of noisy labels [107]. Such approaches would require reinterviewing each subject, which is costly and time-consuming, and would necessarily reduce the size of the cohort we have recruited.
Third, the unimodal and multimodal classification and fusion methods used in this study could be improved given a larger and more densely labeled dataset. Only one label (per task) was available for the entire recording, which made it difficult to apply temporal models such as recurrent neural networks or transformers to directly classify high-dimensional time series with thousands to tens of thousands of steps. Similarly, a multimodal transformer could be potentially used for fusion, provided the label sparsity challenge is addressed. A potential future direction is to label the entire recording more frequently in time. For example, simple measurements like self-rated or clinician-rated levels of distress could be adopted. Another potential direction is to utilize the potential improvement in pre-trained foundation models. For instance, LLMs with larger context windows might enable few-shot classification by including a few examples of transcripts and categorizations in the prompt.
Finally, the interpretability and explainability of the features and models remain unexplored [108]. They are key to fostering the trust of patients and clinicians and pushing the final clinical adoption. Interpretable machine learning methods [109] could be applied to explain which modality, which feature, and which temporal section contribute to the system outputs. Additionally, visualization and reporting through dashboards and text summaries could be beneficial for clinicians and patients to understand better what was assessed and measured.
D. Potential clinical applications
With larger and more diverse samples, we see considerable potential clinical utility for the proposed multimodal objective assessment approach (and future applications informed by this technology) in several areas: 1) deepening understanding of psychopathology and outward manifestations of symptoms, 2) utility for diagnostic purposes, 3) assessing changes in symptoms longitudinally for the same patient, and 4) for patient self-report, engagement, and empowerment. First, this technology has the potential to better objectify and quantify core signs and symptoms of certain mental health conditions, like affect-flattening or tangential speech. Second, this technology has the potential to augment the initial diagnostic process for clinicians in both research and clinical settings. Developing real-time reporting of digital biomarker outputs in the form of a dashboard may help clinicians may hone into a certain line of clinical questions to better help establish a diagnosis. This technology may have a role in reducing bias and discrimination in the diagnostic process, as currently, the preponderance of evidence suggests that Black/African American individuals and Hispanic individuals are disproportionately diagnosed with psychotic disorders [110]. In time, combining digital biomarkers in addition to other blood-based and imaging markers, could play a potential role in subtyping mental health conditions according to treatment response or identifying individuals at risk who might develop the condition [111]. Third, applications of this technology can help clinicians and researchers to assess changes in symptoms over time for the same patient. This is crucial for the health care team to understand if the treatment plan is working and may help to accelerate measurement-based care efforts and overcome some of the barriers to implementation [112]. Additionally, accurate assessment is the cornerstone of clinical research studies, which ultimately determines whether new treatments are approved, and unreliable assessments can have significant consequences to the study and to the field more broadly [113]. Fifth, automated systems can provide quality assurance and feedback to clinicians on how well they performed during the interview, and potentially areas where their technique may be improved. Finally, future applications informed by this technology can play an important role in empowering patients to participate in self-assessment and ongoing monitoring of their symptoms. Such applications may help to improve the accessibility/timeliness of assessments and potentially reduce stigma around mental health [114].
Acknowledgments
This research received support from Emory School of Medicine’s Imagine, Innovate, Impact Funds and a Georgia Clinical & Translational Science Alliance National Institutes of Health award (UL1-TR002378).
Contributor Information
Zifan Jiang, Department of Biomedical Informatics, Emory School of Medicine and the Department of Biomedical Engineering, Emory University and Georgia Institute of Technology..
Salman Seyedi, Department of Biomedical Informatics, Emory School of Medicine..
Emily Griner, Department of Psychiatry and Behavioral Sciences, Emory School of Medicine..
Ahmed Abbasi, Department of IT, Analytics, and Operations, University of Notre Dame..
Ali Bahrami Rad, Department of Biomedical Informatics, Emory School of Medicine..
Hyeokhyen Kwon, Department of Biomedical Informatics, Emory School of Medicine..
Robert O. Cotes, Department of Psychiatry and Behavioral Sciences, Emory School of Medicine.
Gari D. Clifford, Department of Biomedical Informatics, Emory School of Medicine and the Department of Biomedical Engineering, Emory University and Georgia Institute of Technology..
References
- [1].Charlson F et al. , “New WHO prevalence estimates of mental disorders in conflict settings: A systematic review and meta-analysis,” The Lancet, vol. 394, no. 10194, pp. 240–248, Jul. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Vos T et al. , “Global burden of 369 diseases and injuries in 204 countries and territories, 1990–2019: A systematic analysis for the global burden of disease study 2019,” The Lancet, vol. 396, no. 10258, pp. 1204–1222, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Substance Abuse and Mental Health Services Administration, “Projections of national expenditures for treatment of mental and substance use disorders, 2010–2020,” https://store.samhsa.gov/product/Projections-of-National-Expenditures-for-Treatment-of-Mental-and-Substance-Use-Disorders-2010-2020/SMA14-4883, Tech. Rep.
- [4].Mental Health America. (2022) Mental health in America - Access to care data 2018. https://mhanational.org/issues/2022/mental-health-america-access-care-data. Accessed: Jul 23, 2023. [Google Scholar]
- [5].Vahia VN, “Diagnostic and statistical manual of mental disorders 5: A quick glance,” Indian Journal of Psychiatry, vol. 55, no. 3, p. 220, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].International Statistical Classification of Diseases and related health problems: Alphabetical index. World Health Organization, 2004, vol. 3. [Google Scholar]
- [7].Clarke DE et al. , “DSM-5 field trials in the United States and Canada, part I: Study design, sampling strategy, implementation, and analytic approaches,” American Journal of Psychiatry, vol. 170, no. 1, pp. 43–58, 2013. [DOI] [PubMed] [Google Scholar]
- [8].Aboraya A, “Clinicians’ opinions on the reliability of psychiatric diagnoses in clinical settings,” Psychiatry (Edgmont), vol. 4, no. 11, p. 31, 2007. [PMC free article] [PubMed] [Google Scholar]
- [9].Garb HN, “Race bias and gender bias in the diagnosis of psychological disorders,” Clinical Psychology Review, vol. 90, p. 102087, 2021. [DOI] [PubMed] [Google Scholar]
- [10].Spitzer RL et al. , “A brief measure for assessing generalized anxiety disorder: the GAD-7,” Archives of Internal Medicine, vol. 166, no. 10, pp. 1092–1097, 2006. [DOI] [PubMed] [Google Scholar]
- [11].Kroenke K, Spitzer RL, and Williams JB, “The PHQ-9: Validity of a brief depression severity measure,” Journal of General Internal Medicine, vol. 16, no. 9, pp. 606–613, 2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Fried EI, Flake JK, and Robinaugh DJ, “Revisiting the theoretical and methodological foundations of depression measurement,” Nature Reviews Psychology, vol. 1, no. 6, pp. 358–368, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Cotes RO et al. , “Multimodal assessment of schizophrenia and depression utilizing video, acoustic, locomotor, electroencephalographic, and heart rate technology: Protocol for an observational study,” JMIR Res Protoc, vol. 11, no. 7, p. e36417, Jul 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Jiang Z et al. , “Classifying Major Depressive Disorder and response to deep brain stimulation over time by analyzing facial expressions,” IEEE Transactions on Biomedical Engineering, vol. 68, no. 2, pp. 664–672, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Stratou G et al. , “Automatic nonverbal behavior indicators of depression and PTSD: the effect of gender,” Journal on Multimodal User Interfaces, vol. 9, no. 1, pp. 17–29, 2015. [Google Scholar]
- [16].Pintelas EG et al. , “A review of machine learning prediction methods for anxiety disorders,” in Proceedings of the 8th International Conference on Software Development and Technologies for Enhancing Accessibility and Fighting Info-exclusion, 2018, pp. 8–15. [Google Scholar]
- [17].Jiang Z et al. , “Utilizing computer vision for facial behavior analysis in schizophrenia studies: A systematic review,” PLOS ONE, vol. 17, no. 4, p. e0266828, Apr. 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Reinertsen E et al. , “Continuous assessment of schizophrenia using heart rate and accelerometer data,” Physiological Measurement, vol. 38, no. 7, p. 1456, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Harati S et al. , “Classifying depression severity in recovery from major depressive disorder via dynamic facial features,” IEEE Journal of Biomedical and Health Informatics, vol. 24, no. 3, pp. 815–824, 2020. [DOI] [PubMed] [Google Scholar]
- [20].Jiang Z et al. , “Disentangling visual exploration differences in cognitive impairment,” IEEE Transactions on Biomedical Engineering, pp. 1–12, 2023. [DOI] [PubMed] [Google Scholar]
- [21].Harati S et al. , “Depression severity classification from speech emotion,” in Proc. Annual Int. Conf. of the IEEE Engineering in Medicine and Biology Society (EMBC), 2018, pp. 5763–5766. [DOI] [PubMed] [Google Scholar]
- [22].France DJ et al. , “Acoustical properties of speech as indicators of depression and suicidal risk,” IEEE transactions on Biomedical Engineering, vol. 47, no. 7, pp. 829–837, 2000. [DOI] [PubMed] [Google Scholar]
- [23].Qayyum A et al. , “High-density electroencephalography and speech signal based deep framework for clinical depression diagnosis,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2023. [DOI] [PubMed] [Google Scholar]
- [24].Ahmed U, Lin JC-W, and Srivastava G, “Graph attention-based curriculum learning for mental healthcare classification,” IEEE Journal of Biomedical and Health Informatics, 2023. [DOI] [PubMed] [Google Scholar]
- [25].Cakmak AS et al. , “Classification and prediction of post-trauma outcomes related to PTSD using circadian rhythm changes measured via wrist-worn research watch in a large longitudinal cohort,” IEEE Journal of Biomedical and Health Informatics, vol. 25, no. 8, pp. 2866–2876, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].“Detecting bipolar depression from geographic location data,” IEEE Transactions on Biomedical Engineering, vol. 64, no. 8, pp. 1761–1771, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [27].Valenza G et al. , “Wearable monitoring for mood recognition in bipolar disorder based on history-dependent long-term heart rate variability analysis,” IEEE Journal of Biomedical and Health Informatics, vol. 18, no. 5, pp. 1625–1635, 2013. [DOI] [PubMed] [Google Scholar]
- [28].Boscarino JA and Chang J, “Electrocardiogram abnormalities among men with stress-related psychiatric disorders: Implications for coronary heart disease and clinical research,” Annals of Behavioral Medicine, vol. 21, no. 3, pp. 227–234, 1999. [DOI] [PubMed] [Google Scholar]
- [29].Acharya UR et al. , “Computer-aided diagnosis of depression using EEG signals,” European Neurology, vol. 73, no. 5–6, pp. 329–336, 2015. [DOI] [PubMed] [Google Scholar]
- [30].Zhang Y et al. , “Identification of psychiatric disorder subtypes from functional connectivity patterns in resting-state electroencephalography,” Nature Biomedical Engineering, vol. 5, no. 4, pp. 309–323, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [31].Yoon JH et al. , “Automated classification of fMRI during cognitive control identifies more severely disorganized subjects with schizophrenia,” Schizophrenia Research, vol. 135, no. 1–3, pp. 28–33, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [32].Du Y et al. , “NeuroMark: An automated and adaptive ICA based pipeline to identify reproducible fMRI markers of brain disorders,” NeuroImage: Clinical, vol. 28, p. 102375, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [33].Song H et al. , “Automatic schizophrenic discrimination on fNIRS by using complex brain network analysis and SVM,” BMC Medical Informatics and Decision Making, vol. 17, pp. 1–9, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [34].Moura I et al. , “Digital phenotyping of mental health using multimodal sensing of multiple situations of interest: A systematic literature review,” Journal of Biomedical Informatics, p. 104278, 2022. [DOI] [PubMed] [Google Scholar]
- [35].Garcia-Ceja E et al. , “Mental health monitoring with multimodal sensing and machine learning: A survey,” Pervasive and Mobile Computing, vol. 51, pp. 1–26, 2018. [Google Scholar]
- [36].Gupta R et al. , “Multimodal prediction of affective dimensions and depression in human-computer interactions,” in Proc. ACM Int. Workshop on Audio/Visual Emotion Challenge, 2014, pp. 33–40. [Google Scholar]
- [37].Ghosh S, Chatterjee M, and Morency L-P, “A multimodal context-based approach for distress assessment,” in Proc. ACM Int. Conf. Multimodal Interaction, 2014, pp. 240–246. [Google Scholar]
- [38].Zhang X et al. , “Multimodal depression detection: Fusion of electroencephalography and paralinguistic behaviors using a novel strategy for classifier ensemble,” IEEE Journal of Biomedical and Health Informatics, vol. 23, no. 6, pp. 2265–2275, 2019. [DOI] [PubMed] [Google Scholar]
- [39].Mann DM et al. , “COVID-19 transforms health care through telemedicine: Evidence from the field,” Journal of the American Medical Informatics Association, vol. 27, no. 7, pp. 1132–1135, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [40].Moffatt JJ and Eley DS, “The reported benefits of telehealth for rural Australians,” Australian Health Review, vol. 34, no. 3, pp. 276–281, 2010. [DOI] [PubMed] [Google Scholar]
- [41].Cunningham NR et al. , “Addressing pediatric mental health using telehealth during coronavirus disease-2019 and beyond: A narrative review,” Academic Pediatrics, vol. 21, no. 7, pp. 1108–1117, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [42].Sultana S and Pagán JA, “Use of telehealth to address depression and anxiety in low-income us populations: A narrative review,” Journal of Primary Care & Community Health, vol. 14, p. 21501319231168036, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [43].Wright-Berryman J et al. , “Virtually screening adults for depression, anxiety, and suicide risk using machine learning and language from an open-ended interview,” Frontiers in Psychiatry, vol. 14, p. 1143175, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [44].Abbas A et al. , “Computer vision-based assessment of motor functioning in schizophrenia: Use of smartphones for remote measurement of schizophrenia symptomatology,” Digital Biomarkers, vol. 5, no. 1, pp. 29–36, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [45].Matcham F et al. , “Remote assessment of disease and relapse in major depressive disorder (RADAR-MDD): Recruitment, retention, and data availability in a longitudinal remote measurement study,” BMC Psychiatry, vol. 22, no. 1, p. 136, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [46].Sheehan DV et al. , “The mini-international neuropsychiatric interview (MINI): the development and validation of a structured diagnostic psychiatric interview for DSM-IV and ICD-10,” Journal of Clinical Psychiatry, vol. 59, no. 20, pp. 22–33, 1998. [PubMed] [Google Scholar]
- [47].Vaswani A et al. , “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017. [Google Scholar]
- [48].Moretti RJ and Rossini ED, The Thematic Apperception Test (TAT)., ser. Comprehensive handbook of psychological assessment, Vol. 2: Personality assessment. Hoboken, NJ, US: John Wiley & Sons, Inc., 2004, pp. 356–371. [Google Scholar]
- [49].Benton AL, deS K, and Sivan AB, Multilingual aphasia examination. AJA Associates, 1994. [Google Scholar]
- [50].Goodglass H and Kaplan E, The assessment of aphasia and related disorders. Lea & Febiger, 1972. [Google Scholar]
- [51].S. R. Cohen et al. , “Measuring the quality of life of people at the end of life: The McGill quality of life questionnaire–revised,” Palliative Medicine, vol. 31, no. 2, pp. 120–129, 2017. [DOI] [PubMed] [Google Scholar]
- [52].Boccignone G et al. , “pyVHR: A python framework for remote photoplethysmography,” PeerJ Computer Science, vol. 8, p. e929, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [53].Oquab M et al. , “DINOv2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023. [Google Scholar]
- [54].Chen S et al. , “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022. [Google Scholar]
- [55].Touvron H et al. , “LLAMA: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023. [Google Scholar]
- [56].Liu Y et al. , “RoBERTa: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019. [Google Scholar]
- [57].Jiang Z et al. , “Automated analysis of facial emotions in subjects with cognitive impairment,” PLOS ONE, vol. 17, no. 1, p. e0262527, Jan. 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [58].Deng J et al. , “Retinaface: Single-stage dense face localisation in the wild,” arXiv preprint arXiv:1905.00641, 2019. [Google Scholar]
- [59].He K et al. , “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. [Google Scholar]
- [60].Yang S et al. , “WIDER face: A face detection benchmark,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5525–5533. [Google Scholar]
- [61].Simonyan K and Zisserman A, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [Google Scholar]
- [62].Mollahosseini A, Hasani B, and Mahoor MH, “Affectnet: A database for facial expression, valence, and arousal computing in the wild,” IEEE Transactions on Affective Computing, vol. 10, no. 1, pp. 18–31, 2017. [Google Scholar]
- [63].Langner O et al. , “Presentation and validation of the Radboud Faces Database,” Cognition and Emotion, vol. 24, no. 8, pp. 1377–1388, 2010. [Google Scholar]
- [64].Shao Z et al. , “JAA-Net: Joint facial action unit detection and face alignment via adaptive attention,” International Journal of Computer Vision, vol. 129, pp. 321–340, 2021. [Google Scholar]
- [65].Ekman P and Friesen WV, “Facial action coding system,” Environmental Psychology & Nonverbal Behavior, 1978. [Google Scholar]
- [66].Zhang X et al. , “BP4D-spontaneous: A high-resolution spontaneous 3D dynamic facial expression database,” Image and Vision Computing, vol. 32, no. 10, pp. 692–706, 2014. [Google Scholar]
- [67].Dosovitskiy A et al. , “An image is worth 16×16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020. [Google Scholar]
- [68].Soomro K, Zamir AR, and Shah M, “UCF101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012. [Google Scholar]
- [69].Seyedi S et al. , “Using HIPAA (health insurance portability and accountability act)–compliant transcription services for virtual psychiatric interviews: Pilot comparison study,” JMIR Ment Health, vol. 10, p. e48517, Oct 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [70].Corbin L et al. , “A comparison of linguistic patterns between individuals with current major depressive disorder, past major depressive disorder, and those without major depressive disorder in a virtual, psychiatric research interview,” Journal of Affective Disorders Reports, vol. 14, p. 100645, 2023. [Google Scholar]
- [71].Hartmann J, “Emotion english DistilRoBERTa-base,” https://huggingface.co/j-hartmann/emotion-english-distilroberta-base/, 2022. [Google Scholar]
- [72].Hartmann J et al. , “More than a feeling: Accuracy and application of sentiment analysis,” International Journal of Research in Marketing, vol. 40, no. 1, pp. 75–87, 2023. [Google Scholar]
- [73].Ahmad F et al. , “A deep learning architecture for psychometric natural language processing,” ACM Transactions on Information Systems (TOIS), vol. 38, no. 1, pp. 1–29, 2020. [Google Scholar]
- [74].Giannakopoulos T, “pyAudioAnalysis: An open-source python library for audio signal analysis,” PloS One, vol. 10, no. 12, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [75].S.-w. Yang et al. , “SUPERB: Speech processing universal performance benchmark,” arXiv preprint arXiv:2105.01051, 2021. [Google Scholar]
- [76].Boccignone G et al. , “An open framework for remote-PPG methods and their assessment,” IEEE Access, vol. 8, pp. 216083–216103, 2020. [Google Scholar]
- [77].Casado CA and López MB, “Face2PPG: An unsupervised pipeline for blood volume pulse extraction from faces,” IEEE Journal of Biomedical and Health Informatics, 2023. [DOI] [PubMed] [Google Scholar]
- [78].Vest AN et al. , “An open source benchmarked toolbox for cardiovascular waveform and interval analysis,” Physiological Measurement, vol. 39, no. 10, p. 105004, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [79].Linderman S et al. , “SSM: Bayesian learning and inference for state space models,” https://github.com/lindermanlab/ssm, Oct. 2020. [Google Scholar]
- [80].Ma H, Cai M, and Wang H, “Emotional blunting in patients with major depressive disorder: A brief non-systematic review of current research,” Frontiers in Psychiatry, vol. 12, p. 792960, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [81].Trémeau F et al. , “Facial expressiveness in patients with schizophrenia compared to depressed patients and nonpatient comparison subjects,” American Journal of Psychiatry, vol. 162, no. 1, pp. 92–101, 2005. [DOI] [PubMed] [Google Scholar]
- [82].Bylsma LM, Morris BH, and Rottenberg J, “A meta-analysis of emotional reactivity in major depressive disorder,” Clinical Psychology Review, vol. 28, no. 4, pp. 676–691, 2008. [DOI] [PubMed] [Google Scholar]
- [83].Davies H et al. , “Facial expression to emotional stimuli in non-psychotic disorders: A systematic review and meta-analysis,” Neuroscience & Biobehavioral Reviews, vol. 64, pp. 252–271, 2016. [DOI] [PubMed] [Google Scholar]
- [84].Valenza G et al. , “Mood states modulate complexity in heartbeat dynamics: A multiscale entropy analysis,” Europhysics Letters, vol. 107, no. 1, p. 18003, 2014. [Google Scholar]
- [85].Zhao L et al. , “Cardiorespiratory coupling analysis based on entropy and cross-entropy in distinguishing different depression stages,” Frontiers in Physiology, vol. 10, p. 359, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [86].Schultebraucks K et al. , “Deep learning-based classification of post-traumatic stress disorder and depression following trauma utilizing visual and auditory markers of arousal and mood,” Psychological Medicine, vol. 52, no. 5, p. 957–967, 2022. [DOI] [PubMed] [Google Scholar]
- [87].Xing Y et al. , “Task-state heart rate variability parameter-based depression detection model and effect of therapy on the parameters,” IEEE Access, vol. 7, pp. 105701–105709, 2019. [Google Scholar]
- [88].Byun S et al. , “Detection of major depressive disorder from linear and nonlinear heart rate variability features during mental task protocol,” Computers in Biology and Medicine, vol. 112, p. 103381, 2019. [DOI] [PubMed] [Google Scholar]
- [89].Hasib KM et al. , “Depression detection from social networks data based on machine learning and deep learning techniques: An interrogative survey,” IEEE Transactions on Computational Social Systems, vol. 10, no. 4, pp. 1568–1586, 2023. [Google Scholar]
- [90].Yang K, Lau RY, and Abbasi A, “Getting personal: a deep learning artifact for text-based measurement of personality,” Information Systems Research, vol. 34, no. 1, pp. 194–222, 2023. [Google Scholar]
- [91].Weeks JW et al. , ““The sound of fear”: Assessing vocal fundamental frequency as a physiological indicator of social anxiety disorder,” Journal of Anxiety Disorders, vol. 26, no. 8, pp. 811–822, 2012. [DOI] [PubMed] [Google Scholar]
- [92].Grippo AJ and Johnson AK, “Stress, depression and cardiovascular dysregulation: A review of neurobiological mechanisms and the integration of research from preclinical disease models,” Stress, vol. 12, no. 1, pp. 1–21, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [93].Zang X et al. , “End-to-end depression recognition based on a one-dimensional convolution neural network model using two-lead ECG signal,” Journal of Medical and Biological Engineering, vol. 42, no. 2, pp. 225–233, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [94].Devlin J et al. , “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. [Google Scholar]
- [95].Brown T et al. , “Language models are few-shot learners,” Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020. [Google Scholar]
- [96].OpenAI, “GPT-4 technical report,” https://arxiv.org/abs/2303.08774,2023. [Google Scholar]
- [97].Lau C, Zhu X, and Chan W-Y, “Automatic depression severity assessment with deep learning using parameter-efficient tuning,” Frontiers in Psychiatry, vol. 14, p. 1160291, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [98].Farruque N et al. , “Depression symptoms modelling from social media text: A semi-supervised learning approach,” arXiv preprint arXiv:2209.02765, 2022. [Google Scholar]
- [99].Zheng Y et al. , “General facial representation learning in a visual-linguistic manner,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18697–18709. [Google Scholar]
- [100].Cai Z et al. , “MARLIN: Masked autoencoder for facial video representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1493–1504. [Google Scholar]
- [101].Touvron H et al. , “LLAMA 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023. [Google Scholar]
- [102].Goodman WK, Murphy TK, and Storch EA, “Risk of adverse behavioral effects with pediatric use of antidepressants,” Psychopharmacology, vol. 191, pp. 87–96, 2007. [DOI] [PubMed] [Google Scholar]
- [103].Buyukdura JS, McClintock SM, and Croarkin PE, “Psychomotor retardation in depression: Biological underpinnings, measurement, and treatment,” Progress in Neuro-Psychopharmacology and Biological Psychiatry, vol. 35, no. 2, pp. 395–409, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [104].Halper JP and Mann JJ, “Cardiovascular effects of antidepressant medications,” The British Journal of Psychiatry, vol. 153, no. S3, pp. 87–98, 1988. [PubMed] [Google Scholar]
- [105].Nasiri S et al. , “Exploiting labels from multiple experts in automated sleep scoring,” Sleep, vol. 46, no. 5, p. zsad034, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [106].Nasir M et al. , “Redundancy analysis of behavioral coding for couples therapy and improved estimation of behavior from noisy annotations,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 1886–1890. [Google Scholar]
- [107].Song H et al. , “Learning from noisy labels with deep neural networks: A survey,” IEEE Transactions on Neural Networks and Learning Systems, 2022. [DOI] [PubMed] [Google Scholar]
- [108].Johnson DS, Hakobyan O, and Drimalla H, “Towards interpretability in audio and visual affective machine learning: A review,” arXiv preprint arXiv:2306.08933, 2023. [Google Scholar]
- [109].Stiglic G et al. , “Interpretability of machine learning-based prediction models in healthcare,” WIREs Data Mining and Knowledge Discovery, vol. 10, no. 5, p. e1379, 2020. [Google Scholar]
- [110].Schwartz RC and Blankenship DM, “Racial disparities in psychotic disorder diagnosis: A review of empirical literature,” World Journal of Psychiatry, vol. 4, no. 4, p. 133, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [111].Goldsmith DR et al. , “An update on promising biomarkers in schizophrenia,” Focus, vol. 16, no. 2, pp. 153–163, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [112].Lewis CC et al. , “Implementing measurement-based care in behavioral health: A review,” JAMA Pychiatry, vol. 76, no. 3, pp. 324–335, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [113].Berendsen S et al. , “Burying our heads in the sand: the neglected importance of reporting inter-rater reliability in antipsychotic medication trials,” Schizophrenia Bulletin, vol. 46, no. 5, pp. 1027–1029, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [114].Rodríguez-Rivas ME et al. , “Innovative technology–based interventions to reduce stigma toward people with mental illness: Systematic review and meta-analysis,” JMIR Serious Games, vol. 10, no. 2, p. e35099, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]