Table 2.
Methodological details of included studies that classify UDa, BDb, and HCc.
| Study | Data recording | Data used | Data preprocessing | Specific variable or feature selection | Machine learning models or statistic test | Validation | Findings | ||||||||
| Wearable devices | |||||||||||||||
|
|
Anmella et al [35], 2023 | 2 d | Physiological data with wearable devices (Empatica E4) | Rules-based filter for invalid physiological data and time unit set to 1 second | X, Y, or Z-axis acceleration, blood volume pulse, electrodermal activity, heart rate, and skin temperature | BiLSTMd | —e | 7-class classification task: ACCf=0.7; AUROCg=0.69; F1-score=0.6927 | |||||||
|
|
Zakariah and Alotaibi [36], 2023 | 5-20 d | General levels of activity with a wearable Actiwatch | Imputation techniques (mean imputation, median imputation, or regression-based imputation) for missing values; transformed categorical variables into numerical representations | Motor activity measurement from the Actiwatch | UMAPh and NNi | Leave-one-out validation | 4-class classification task: ACC=0.991; F1-score=0.9887 | |||||||
| Audiovisual recordings | |||||||||||||||
|
|
Yang et al [37], 2016 | 1 d | Speech responses from 5 questions after participants watched 6 videos | Silence removal and speech segmentation based on energy and spectral centroid as features for threshold definition | Emotion profiles, 39 dimensions of Mel-frequency cepstral coefficients and acoustic features of 384 dimensions | SVMj, MLPk, LSTMl, and BiLSTM | 13-fold cross-validation | 3-class classification task: optimal ACC=76.92% | |||||||
|
|
Su et al [38], 2017 | 1 d | Facial expressions elicited by 6 emotional video clips | Select a time interval and segment each facial image into 12 mutually independent facial regions | 8 basic orientations of motion vector in microscopic facial expression | HMMm and LSTM | 12-fold cross-validation | 3-class classification task: optimal ACC=67.7% | |||||||
|
|
Hong et al [39], 2018 | 1 d | Facial expressions elicited by 6 emotional video clips | Select time interval and facial points were aligned to a new coordinate | 12 action units | MLP, SVM, GMMn, and LSTM | 12-fold cross-validation | 3-class classification task: optimal ACC=61.1% | |||||||
|
|
Huang et al [40], 2019 | 1 d | Speech responses from interviews with a clinician after participants watched 6 videos | Use of hierarchical spectral clustering algorithm for database adaptation | Emotion profiles and 32-dimensional acoustic features | SVM, CNNo, and LSTM | Leave-one cross-validation | 3-class classification task: optimal ACC=75.56% | |||||||
|
|
Su et al [41], 2020 | 1 d | Facial expressions and speech responses from interviews with a clinician after participants watched 6 videos | Hierarchical spectral clustering and denoising autoencoder method for database adaptation | Emotion profiles, action units, 384 acoustic features and 49 facial expression feature points | SVM, HMM, MLP, GRUp, CNN, RNNq, and LSTM | 13-fold cross-validation | 3-class classification task: optimal ACC=76.9% | |||||||
|
|
Hong et al [42], 2021 | 1 d | Facial expressions elicited by 6 emotional video clips | Selection of four 4-second intervals per elicitation video based on the facial expression intensity of all participants | Action units for macroscopic facial expressions and motion vectors for microscopic facial expressions | MLP, NN, and LSTM | 12-fold cross-validation | 3-class classification task: optimal ACC=72.2% | |||||||
|
|
Luo et al [43], 2024 | 1 d | Voice signals collected from 7 pieces of reading material | Power normalization and speech segmentation into 7 parts corresponding to the 7 reading materials | 120 vocal features for classification, such as the mean value of root-mean-square energy | DTr, NBs, SVM, KNNt, ELu, and CNN | — | 3-class classification task: optimal ACC=95.6% | |||||||
| Multimodal technology | |||||||||||||||
|
|
Wu et al [44], 2024 | 1 d | Text, audio, facial attributes, heart rate, and eye movement with mobile devices while participants have a conversation with a virtual assistant | — | Word embedding; 5 spectral features, facial attribute embedding, 23 heart rate variability indices, and 7 eye movement features (fixation and saccade) | RFv, LSTM, and DT | 5-fold cross-validation | 5-class classification task: optimal ACC=90.26% | |||||||
aUD: unipolar depression.
bBD: bipolar disorder.
cHC: healthy control.
dBiLSTM: bidirectional long short-term memory.
eNot available.
fACC: accuracy.
gAUROC: area under the receiver operating characteristic.
hUMAP: uniform manifold approximation and projection.
iNN: neural network.
jSVM: support vector machine.
kMLP: multilayer perceptron.
lLSTM: long short-term memory.
mHMM: hidden Markov model.
nGMM: Gaussian mixture model.
oCNN: convolutional neural network.
pGRU: gated recurrent unit.
qRNN: recurrent neural network.
rDT: decision tree.
sNB: naive Bayes.
tKNN: k-nearest neighbor.
uEL: ensemble learning.
vRF: random forest.