Skip to main content
BMJ Mental Health logoLink to BMJ Mental Health
. 2025 Sep 29;28(1):e301858. doi: 10.1136/bmjment-2025-301858

Acoustic signatures of depression elicited by emotion-based and theme-based speech tasks

Qunxing Lin 1, Xiaohua Wu 1, Yueshiyuan Lei 1, Wanying Cheng 1, Shan Huang 1, Weijie Wang 1, Chong Li 2, Jiubo Zhao 1,2,
PMCID: PMC12481279  PMID: 41027679

Abstract

Background

Major depressive disorder (MDD) remains underdiagnosed worldwide, partly due to reliance on self-reported symptoms and clinician-administered interviews.

Objective

This study examined whether a speech-based classification model using emotionally and thematically varied image-description tasks could effectively distinguish individuals with MDD from healthy controls.

Methods

A total of 120 participants (59 with MDD, 61 healthy controls) completed four speech tasks: three emotionally valenced images (positive, neutral, negative) and one Thematic Apperception Test (TAT) stimulus. Speech responses were segmented, and 23 acoustic features were extracted per sample. Classification was performed using a long short-term memory (LSTM) neural network, with SHapley Additive exPlanations (SHAP) applied for feature interpretation. Four traditional machine learning models (support vector machine, decision tree, k-nearest neighbour, random forest) served as comparators. Within-subject variation in speech duration was assessed with repeated-measures Analysis of Variance.

Findings

The LSTM model outperformed traditional classifiers, capturing temporal and dynamic speech patterns. The positive-valence image task achieved the highest accuracy (87.5%), followed by the negative-valence (85.0%), TAT (84.2%) and neutral-valence (81.7%) tasks. SHAP analysis highlighted task-specific contributions of pitch-related and spectral features. Significant differences in speech duration across tasks (p<0.01) indicated that affective valence influenced speech production.

Conclusions

Emotionally enriched and thematically ambiguous tasks enhanced automated MDD detection, with positive-valence stimuli providing the greatest discriminative power. SHAP interpretation underscored the importance of tailoring models to different speech inputs.

Clinical implications

Speech-based models incorporating emotionally evocative and projective stimuli offer a scalable, non-invasive approach for early depression screening. Their reliance on natural speech supports cross-cultural application and reduces stigma and literacy barriers. Broader validation is needed to facilitate integration into routine screening and monitoring.

Keywords: Depression, Mental Health


WHAT IS ALREADY KNOWN ON THIS TOPIC

  • Speech-based approaches using acoustic features and deep learning show promise for identifying depression. However, most previous studies have relied on narrow or uniform speech tasks and offered limited insight into which acoustic features drive classification across emotional contexts.

WHAT THIS STUDY ADDS

  • This study shows that the use of emotionally and thematically varied speech tasks, particularly positive-valence prompts, improves the performance of long short-term memory-based depression classification. Task-specific SHapley Additive exPlanations analysis highlights distinct acoustic feature contributions and provides interpretable evidence of how emotional valence influences depressive speech markers.

HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY

  • Incorporating emotionally diverse speech tasks into screening protocols may enhance both the sensitivity and ecological validity of depression detection. Combining interpretable artificial intelligence with natural speech input could support scalable, language-independent tools for digital mental healthcare.

Background

Depression is a common mental disorder with a substantial global impact. Between 1990 and 2017, the number of new cases increased by almost 50%, reaching 258 million worldwide. The 2021 Global Burden of Disease Study reported continued rises in incidence and disability-adjusted life years, particularly in high-sociodemographic index regions and among women. Depression is now one of the leading causes of disability and contributes to more than 700 000 suicides each year, with especially high risk among young people. Despite its prevalence, depression often goes undiagnosed and untreated. Globally, only 7%–28% of affected individuals receive adequate care.1 This treatment gap is driven largely by the limitations of current screening methods, which rely on self-report questionnaires and clinician-administered interviews. Self-reports are prone to under-reporting and social desirability bias, while clinical interviews, although considered the gold standard, are time-consuming, costly and resource-intensive.2 Evidence suggests that many people with depression deny symptoms in self-reports, reducing the reliability of such tools. Diagnostic interviews, while accurate, are impractical for large-scale use because of personnel demands and potential interviewer bias.3 These limitations hinder early detection, particularly in low-resource and primary care settings, leaving many cases unidentified until symptoms become severe.

Given the limitations of traditional screening tools, objective biomarkers for depression are attracting growing attention. Speech is a particularly promising candidate because it reflects both emotional and cognitive processes, is easy to collect and incurs minimal cost. Importantly, language-independent acoustic features allow automated analysis, making speech-based methods feasible even in low-resource language settings. Early studies reported that individuals with depression tend to speak with lower pitch, reduced intensity, slower rate and greater monotony.4 More recent research has confirmed these patterns, noting reductions in fundamental frequency, narrower pitch range, slower articulation, longer pauses and lower overall volume. Voice quality markers such as increased jitter (frequency instability) and shimmer (amplitude instability) are also common.5 Collectively, these characteristics render depressive speech perceptibly ‘flat’ or ‘lifeless’, offering a reliable basis for automated detection.

Building on these findings, numerous studies have applied acoustic features and machine learning to identify depression. For instance, a study of Chinese speakers achieved 84.2% accuracy in distinguishing patients with major depressive disorder (MDD) from healthy controls, with acoustic features significantly correlating with Hamilton Depression Rating Scale, 17 Item scores.6 Cross-linguistic evidence, including studies in English and Chinese, further highlights the potential of acoustic markers as objective and scalable tools that can complement or partially replace traditional self-report measures.

Projective techniques have long been used to explore emotional and cognitive states by eliciting responses to ambiguous stimuli such as pictures or drawings. Classic examples include the Thematic Apperception Test (TAT), the House–Tree–Person test and various drawing tasks, all based on the assumption that individuals project internal conflicts and emotions onto ambiguous material. In depression research, evidence shows that projective responses differ between affected and unaffected individuals. Drawings by depressed individuals often display reduced detail, minimal social elements and symbolic features such as shaded eyes or missing accessories, reflecting withdrawal and avoidance.7 Narrative responses to ambiguous images, such as those used in the TAT, frequently feature negative emotions, diminished agency and fewer interpersonal themes. Recent advances indicate that automated assessments of projective material are feasible. For instance, digital analysis of tree drawings has identified features such as canopy size and trunk proportions that correlate with depressive symptoms.8 These findings suggest that projective tasks can capture affective disturbances through both visual and verbal modalities. Although limitations remain, particularly subjective scoring practices and the absence of standardised norms, projective methods provide a unique opportunity to access emotional disturbances. Unlike structured questionnaires, they elicit spontaneous, open-ended responses and allow personal content to emerge when individuals may resist or struggle to articulate their emotions directly.

This study integrates projective picture-description tasks with acoustic speech analysis to improve the detection of depression. Projective tasks elicit emotionally expressive speech without requiring explicit disclosure, whereas acoustic analysis provides an objective and scalable means of identifying vocal markers of depression. By combining these approaches, both psychological content and paralinguistic features are captured, enhancing the sensitivity, accessibility and cultural adaptability of early screening tools.

Objective

Traditional depression screening relies on self-report questionnaires and clinical interviews, which face limitations such as stigma and cognitive burden. This study evaluates a lightweight and interpretable long short-term memory (LSTM) model for classifying depression based on speech elicited from emotionally valenced and thematically ambiguous tasks. Multiple speech tasks were compared, including descriptions of positive, neutral and negative images, as well as responses to a TAT. Contributions of acoustic features were examined using SHapley Additive exPlanations (SHAP). Traditional machine learning models, including support vector machines, decision trees, k-nearest neighbours and random forests, were implemented as baselines. This comparison illustrates the advantages of the LSTM model in capturing temporal dynamics and interpreting sequential data.

Methods

Experimental procedure

Speech data were collected in multiple batches from September 2019 to December 2021 using a consistent setup to ensure uniformity. Patients were recorded at the Psychotherapy Department of Zhujiang Hospital, while healthy controls were recorded in a sound-attenuated laboratory at the Psychology Department of Southern Medical University. Ambient noise was kept below 25 dB to ensure recording quality. Participants completed a 15 min acclimation period, including deep-breathing exercises, before being seated in ergonomic chairs with head supports, maintaining a fixed 15 cm distance from a microphone placed directly in front. Speech tasks were presented using MATLAB 2023b with randomised image order, and participants responded without time limits. Audio was recorded with a KAXISAIER microphone and Realtek sound card at 96 kHz and 24-bit resolution and saved as Waveform files. Microphone sensitivity was calibrated before each session to maintain consistent quality. The full protocol is detailed in online supplemental sFigure 1.

Participants

A priori power analysis was conducted using G*Power V.3.1.9.7 to determine the minimum sample size required to detect a medium effect size (Cohen’s d=0.5) in planned group comparisons using a two-tailed t-test. The analysis indicated that 36 participants, with 18 in each group, were needed to achieve adequate statistical power at an alpha of 0.05 and power of 0.95. This calculation applied to traditional inferential statistics, not machine learning. To improve reliability and allow for exclusions, 122 participants were initially recruited, including 62 healthy controls and 60 individuals diagnosed with MDD. After applying inclusion and exclusion criteria, the final sample comprised 61 healthy controls, including 24 males, and 59 individuals with MDD, including 19 males. MDD participants were recruited from the outpatient psychiatric department of Zhujiang Hospital in Guangzhou. Diagnoses were confirmed by an experienced psychiatrist according to Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition-5 criteria. Only individuals experiencing a first episode of severe depression without psychotic features were included. Participants with comorbid psychiatric disorders, prior medication use or major physical illnesses were excluded. All MDD patients were assessed before initiating pharmacological treatment. Both groups were screened for physical health and excluded if they had abnormal body mass index or major medical conditions. Healthy controls had no psychiatric history or prior treatment. To control physiological variation, all participants abstained from alcohol, caffeine and other substances affecting cardiovascular function for at least 12 hours before recording.

The study was approved by the Medical Ethics Committee of Southern Medical University, Guangzhou (NFYKDX003) and registered with the Chinese Clinical Trial Registry (ChiCTR2400083328). Written informed consent was obtained from all participants and their guardians. The study adhered to the Declaration of Helsinki and followed the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis reporting guidelines. An overview of the study is provided in online supplemental sFigure 2.

Materials

Pictorial stimuli were selected from the Chinese Affective Picture System, a standardised database of 852 images rated on valence, arousal and dominance.9 For each valence category (positive, neutral and negative), three candidate images with comparable arousal and dominance scores were initially chosen. One image from each category was then randomly selected for the formal experiment, resulting in three stimuli representing distinct emotional states. In addition, one image from the TAT, depicting a ‘Middle-aged Woman and Elderly Woman’,” was included to introduce thematic ambiguity.

Feature extraction

MATLAB (2023b) was used to extract speech features through a standardised pipeline that included preprocessing, frame segmentation, computation and normalisation. Audio files were converted to mono-channel, and sampling rates were recorded for consistency. Frame-based analysis employed 100 ms Hamming windows with a 50 ms step size. A total of 23 frame-level acoustic features were extracted. Pitch track captured prosodic variation through fundamental frequency, while energy reflected signal intensity and vocal effort. Zero-crossing rate measured changes in signal polarity, indicating spectral properties. Mel-frequency cepstral coefficients (MFCCs) described timbral aspects of speech, and Teager energy quantified instantaneous energy sensitive to subtle vocal modulations. Spectral centroid, spectral flatness and spectral slope described the distribution and shape of the spectrum. Formant frequencies F1 to F3 represented vocal tract resonances relevant to vowel identity and articulation. To ensure quality, consecutive duplicate values were removed, and each feature sequence was normalised to 500 frames using linear interpolation for deep learning compatibility. For traditional machine learning models, features were summarised using the mean and SD to create fixed length vectors. In addition, four duration-related metrics were included: total duration, pause duration, speech duration and speech rate. Detailed definitions and methods are provided in online supplemental sMethods and sTable 1.

Traditional machine learning models and deep learning methods

To provide a baseline for comparison with the LSTM model, four traditional machine learning classifiers were implemented: support vector machine, decision tree, k-nearest neighbours and random forest. These models served as reference points for evaluating the added value of incorporating temporal information in deep learning. Performance was assessed using fivefold cross-validation, with accuracy, precision, recall, F1-score, confusion matrices and receiver operating characteristic (ROC) curves reported for each classifier.

An LSTM network was applied to classify participants’ depression status from their speech recordings. Classification accuracy was the primary performance metric, and precision, recall and F1-score were also reported. Final evaluation included a confusion matrix, a permutation test confirming the statistical significance of accuracy (p<0.01), ROC curves with Area Under the Curve values and precision–recall curves with average precision scores. Together, these analyses provided a comprehensive and statistically supported assessment of model performance across folds.

Model optimisation involved tuning hyperparameters such as the number of layers, size of LSTM units, learning rate, batch size and training epochs. Detailed parameters for both the traditional models and the LSTM are presented in online supplemental sTable 2, with the network architecture shown in online supplemental sFigure 3. To improve interpretability, SHAP were applied to evaluate the contribution of acoustic features, including pitch, energy and MFCCs, clarifying the factors most influential in classification outcomes.

Statistics analysis

All analyses were performed using SPSS V.23.0. Independent t-tests were used to compare continuous variables such as age, and χ2 tests were applied to assess categorical variables such as gender. The significance level was set at 0.01. A repeated-measures ANOVA was conducted to examine speech duration across tasks and groups, assessing within-subject variation, between-group differences and interaction effects to determine whether task-related patterns differed by group.

Findings

Demographic analysis between groups

Table 1 presents participant demographics, including age, gender, residence, education, household economic status, only-child status, marital status, major physical illnesses, family psychiatric history, family suicide history, acquaintance suicide history and childhood left-behind experience. Among these 12 variables, only family psychiatric history differed significantly between groups (p<0.01). Specifically, 60 healthy controls reported no family psychiatric history and 1 reported a positive history, whereas 49 patients with depression reported no history and 10 reported a positive history. All other demographic variables showed no significant differences between groups (all p>0.01).

Table 1. Statistics of demographic data.

Variable Total (n=120) Health (n=61) Patient (n=59) T/χ2 P
Age, mean±SD 21.76±2.29 21.30±2.14 22.24±2.36 −2.293 0.024
Gender, n (%) 0.665 0.415
 Male 43 (35.83) 24 (39.34) 19 (32.20)
 Female 77 (65.17) 37 (60.66) 40 (67.80)
Residential location, n (%) 1.727 0.189
 Countryside 52 (43.33) 30 (49.18) 22 (37.29)
 City 68 (56.67) 31 (50.82) 37 (62.71)
Education, n (%) 0.167 0.683
 Bachelor degree or below 100 (83.33) 50 (81.97) 50 (84.75)
 Above bachelor’s degree 20 (16.67) 11 (18.03) 9 (15.25)
Economic, n (%) 0.101 0.751
 Good 27 (22.5) 13 (21.21) 14 (23.73)
 Poor 93 (77.5) 48 (78.79) 45 (76.27)
Only child, n (%) 4.051 0.044
 No 72 (60) 42 (68.85) 30 (50.85)
 Yes 48 (40) 19 (31.15) 29 (49.15)
Marriage, n (%) 0.503 0.478
 No 79 (65.83) 42 (68.85) 37 (62.71)
 Yes 41 (34.17) 19 (31.15) 22 (37.29)
History of a major illness, n (%) 1.985 0.159
 No 115 (95.83) 60 (98.36) 55 (93.22)
 Yes 5 (4.17) 1 (1.64) 4 (6.78)
Family history of mental illness, n (%) 8.443 0.004*
 No 109 (90.83) 60 (98.36) 49 (83.05)
 Yes 11 (9.17) 1 (1.64) 10 (16.95)
Family history of suicide, n (%) 6.143 0.013
 No 111 (92.5) 60 (98.36) 51 (86.44)
 Yes 9 (7.5) 1 (1.64) 8 (13.56)
History of suicide among acquaintances, n (%) 2.833 0.092
 No 104 (86.66) 56 (96.72) 48 (81.36)
 Yes 16 (13.34) 5 (3.28) 11 (18.64)
Left-behind, n (%) 1.627 0.202
 No 76 (63.33) 42 (68.85) 34 (57.63)
 Yes 44 (36.67) 19 (31.15) 25 (42.37)

Repeated measures ANOVA on speech duration

A two-way repeated-measures ANOVA was conducted to examine differences in speech duration across tasks and groups. There was a significant main effect of task, F(2.64)=25.51, p<0.01, partial η²=0.178, indicating a large effect size. No significant main effect of group was found, F(2.64)=1.01, p>0.01, partial η² = 0.008, and no significant task-by-group interaction was observed, F(1)=0.38, p>0.01, partial η²=0.003. Post hoc analyses showed that the TAT and negative-valence tasks elicited longer speech durations, with the TAT producing the longest responses. Because Mauchly’s test indicated a violation of sphericity, the Greenhouse–Geisser correction was applied. Full descriptive and ANOVA results are provided in online supplemental sTable 3–5.

Performance of traditional machine learning models

Among the traditional classifiers, the support vector machine (SVM) performed best on the negative-valence task, with an AUC of 0.716 and an accuracy of 0.660. For the positive-valence task, the random forest (RF) showed the highest performance, with an AUC of 0.703 and an accuracy of 0.693. The neutral-valence task was also best classified by the SVM, yielding an AUC of 0.638 and an accuracy of 0.622. In the TAT task, the Decision Trees achieved the best results, with an AUC of 0.648 and an accuracy of 0.649. ROC curves are presented in figure 1. Average fivefold confusion matrices and complete results, including accuracy, precision, recall and F1 scores, are provided in online supplemental sFigure 4 and sTable 6

Figure 1. ROC curves of traditional machine learning models across four image tasks. k-NN, k-nearest neighbours; ROC, receiver operating characteristic curve; AUC, Area Under the Curve represents the classification performance of each model; SVM, support vector machine; TAT, Thematic Apperception Test.

Figure 1

LSTM model performance and SHAP values

Model performance

Classification accuracies for task-specific LSTM models were 87.50% (p<0.01) for the positive-valence image task, 85.00% (p<0.01) for the negative-valence image task, 81.67% (p<0.01) for the neutral-valence image task and 84.17% (p<0.01) for the TAT image task. Detailed fold-wise results, including precision, recall and F1-score, are provided in online supplemental sTable 7. In addition, a composite figure summarising the fivefold average confusion matrix, permutation accuracy distribution, ROC curves and precision–recall curves across all four tasks and models is presented in figure 2.

Figure 2. Integrated visualisation of classification performance across tasks and models. ROC, receiver operating characteristic curve; PR, precision–recall; HC, healthy controls; AUC: Area Under the Curve; AP, average precision; TAT, thematic apperception test.

Figure 2

SHAP values

SHAP analyses revealed distinct feature importance patterns across the four image tasks (figure 3). Positive SHAP values indicated stronger support for a class, while negative values denoted inhibitory effects.

Figure 3. The SHAP values of four materials in the LSTM model. LSTM: long short-term memory; MFCC: Mel-Frequency Cepstral Coefficients are widely used acoustic features representing the short-term power spectrum of speech; SHAP: SHapley Additive exPlanations; HC: health control; MDD: major depressive disorder.

Figure 3

Negative-valence image

MFCC3, Energy and TeagerEnergy supported MDD, with MFCC3 also showing a negative effect on Health Control. SpectralFlatness, zero crossing rate (ZCR) and PitchTrack supported HC, with SpectralFlatness and PitchTrack additionally inhibiting MDD.

Positive-valence image

ZCR, MFCC11 and MFCC12 supported MDD. SpectralFlatness, Energy, TeagerEnergy and MFCC11 contributed to HC, with MFCC11 showing a dual role.

Neutral-valence image

SpectralFlatness, Energy, MFCC11 and MFCC9 supported HC. Energy, ZCR, MFCC12 and MFCC9 contributed to MDD, with Energy showing dual contributions.

TAT image

SpectralFlatness, TeagerEnergy, MFCC9 and Energy-supported MDD, with MFCC9 also inhibiting HC. MFCC3, ZCR and Energy contributed to HC, with Energy again supporting both groups.

Discussion

The comparative analysis indicated that LSTM models consistently outperformed traditional machine learning approaches, as even the best-performing conventional classifier did not reach LSTM accuracy. This advantage reflects the ability of LSTMs to capture temporal dependencies and dynamic acoustic patterns that static summary features cannot represent. The positive-valence task yielded the highest accuracy (87.50%, p<0.01), followed by the negative-valence (85.00%, p<0.01) and TAT (84.17%, p<0.01) tasks, whereas the neutral-valence task showed the lowest accuracy (81.67%, p<0.01) but remained robust. These results highlight the strength of sequential models in detecting subtle, task-specific acoustic markers of depression and support their potential for reliable, context-sensitive screening tools.

Positive-valence image on depression detection

Positive emotional materials typically evoke approach-related affect and activate brain reward regions such as the ventral striatum, anterior cingulate cortex and insula.10 However, individuals with MDD show consistently blunted neural and behavioural responses to these stimuli. This includes reduced facial expressivity, lower subjective arousal and impaired recall of positive content, a phenomenon referred to as ‘positive attenuation’.11 In speech, this diminished emotional engagement manifests as reduced prosodic variation, monotone intonation, decreased vocal energy and longer pauses during positive tasks. Consequently, individuals with depression tend to speak for longer when discussing positive material, reflecting flattened speech dynamics and reduced temporal variability. These alterations provide valuable cues for automatic emotion detection systems. Moreover, depression is associated with a preference for maintaining dysphoric states, which lowers motivation and cognitive effort when responding to positive stimuli.12 This reduced engagement results in speech patterns that differ from those of healthy individuals, allowing models to distinguish depressive speech during the positive-valence image task.

Negative-valence image on depression detection

Tasks involving negative emotional stimuli produced relatively high classification accuracy (85.00%), emphasising their utility in revealing vocal patterns associated with depression. Neuroimaging studies indicate that individuals with MDD process aversive content atypically, showing reduced activation in the hippocampus and insula, alongside heightened activity in the amygdala and orbitofrontal cortex.10 These regions support threat detection, emotional salience and regulation, and their altered activity reflects impaired control over negative affect. Such neurocognitive disruptions frequently manifest in speech. When exposed to negative stimuli, individuals with depression exhibit delayed responses, increased hesitation, slower articulation and more variable acoustic features.13 This is reflected in longer speech durations and more pronounced vocal anomalies. Indicators such as jitter, shimmer, flat intonation and extended pauses become more evident under negative emotional conditions, suggesting increased cognitive strain and reduced emotional expressiveness.14 These effects likely enhance group differences, making depressive speech patterns more detectable. The LSTM model may have leveraged these acoustic signals to improve classification, consistent with previous findings showing that emotionally charged content strengthens voice-based depression markers.15

Neutral-valence image on depression detection

The lowest classification accuracy (81.67%) occurred in the neutral-valence image task, likely reflecting the ambiguous nature of neutral stimuli and their limited ability to evoke emotionally salient speech features. Unlike positive or negative content, neutral material does not elicit strong affective responses, which may result in more uniform vocal patterns across both healthy and depressed individuals.16 This reduces the distinctiveness of depression-related speech markers. Neuroimaging studies indicate that neutral content engages emotional processing circuits, such as the limbic system and salience network, to a much lesser extent.17 Consequently, vocal output tends to be less variable in pitch, intensity and rhythm, resembling baseline speech. Under these conditions, features such as flattened prosody or slowed articulation are more difficult to differentiate from typical patterns. This increases within-group variability and diminishes distinctions between groups, thereby reducing classification performance. In addition, the neutral-valence image task imposes lower cognitive and emotional demands, leading to weaker engagement of compensatory speech-related neural mechanisms that might otherwise amplify depression-related vocal cues.18

TAT material on depression detection

The TAT task achieved high classification accuracy (84.17%), comparable to the negative-valence condition and higher than the neutral-valence task. This likely reflects the cognitive and emotional demands of constructing narratives from ambiguous social scenes, which elicit deep self-referential and affect-laden processes. Such contexts can amplify vocal markers of depression, as individuals with MDD often display maladaptive cognitive styles, including negative attribution, rumination and reduced positive expectancy. These tendencies are particularly likely to emerge during projective tasks such as the TAT.19 The TAT is designed to tap unconscious emotional and cognitive processes. In individuals with depression, responses often feature themes of hopelessness, guilt, conflict and passivity.20 These are commonly accompanied by acoustic markers such as reduced prosodic variation, increased pauses, slower speech rate and monotony. Such patterns are strongly associated with depressive symptoms and tend to be amplified in introspective, emotionally engaging tasks.

Neuroimaging evidence supports this perspective. Narrative and autobiographical tasks activate brain regions frequently disrupted in depression, including the medial prefrontal cortex, posterior cingulate cortex and hippocampus, which are involved in self-referential thought and emotional memory.21 When interpreting ambiguous TAT stimuli, individuals with depression may recruit these circuits differently, resulting in altered speech and narrative patterns. Research indicates that they often produce shorter, emotionally flat and more pessimistic stories than non-depressed participants.22 The open-ended format of the TAT allows for broad variability in linguistic and emotional content, providing a richer set of vocal features for extraction by models such as LSTM. Compared with structured tasks, such as word reading, the TAT elicits spontaneous speech that integrates affective tone, linguistic complexity and coherence, which are frequently impaired in depression.23

Cross-task interpretation

Taken together, these task-specific findings indicate that differences in classification accuracy are influenced not only by the acoustic features themselves but also by the degree of emotional engagement and cognitive processing required by each task. The superior performance observed in the positive-valence image task may result from the pronounced contrast it produces between groups. Healthy participants typically exhibit enriched prosodic variation and vocal energy in response to positive material, whereas individuals with depression display positive attenuation, characterised by flattened affect and reduced expressivity. This divergence enhances the discriminability of speech profiles. By comparison, the negative-valence image and TAT tasks also support high classification accuracy because they engage emotionally intense and self-referential processes that amplify depressive vocal markers, such as increased hesitation, longer pauses and monotone prosody. In contrast, the neutral-valence image task elicits relatively weak affective responses, producing more homogeneous and less distinctive speech patterns across groups, which likely explains its lower classification accuracy.

These results suggest that the richness of emotional salience and self-relevance embedded in a task is a key determinant of its diagnostic utility. Positive stimuli, by generating the largest contrast between depressed and healthy individuals, appear particularly effective in enhancing the sensitivity of voice-based depression detection models.

Task-specific acoustic features in depression classification

The SHAP analysis clarified the contribution of acoustic features to depression classification across tasks. In negative-valence tasks, MFCC3, Energy and TeagerEnergy supported MDD classification, whereas SpectralFlatness, ZCR and PitchTrack supported healthy status, with some features also inhibiting the opposite group to enhance discrimination. Energy exhibited a particularly dynamic role: it predicted MDD in negative contexts, supported healthy status in positive and neutral tasks, and contributed to both groups in the TAT task. This variability aligns with prior evidence that reduced speech energy reflects psychomotor slowing and emotional blunting in depression and suggests that its diagnostic value is better understood within a dimensional rather than categorical framework.24 MFCCs further demonstrated task-specific contributions. MFCC11 and MFCC12 were key markers of MDD in positive and neutral conditions, consistent with evidence that these coefficients capture subtle spectral changes linked to articulatory effort and glottal tension, which are often disrupted in depressive speech.25 In contrast, MFCC3 and MFCC9 contributed to both MDD and healthy classifications across tasks, likely reflecting normative prosodic variation and vocal tract stability. These results underscore the dual role of MFCCs in capturing both pathological deviations and typical patterns of vocal expression.26 TeagerEnergy, reflecting vocal effort and tension, consistently demonstrated strong positive contributions to MDD, particularly in negative and TAT tasks, consistent with findings that depressive speech often involves altered vocal fold dynamics and compensatory phonatory effort due to affective dysregulation. Spectral features showed complex, context-dependent roles.27 SpectralFlatness was most strongly associated with healthy classification in positive, neutral and TAT tasks, but in negative conditions, it contributed negatively to MDD, suggesting that flatter, noise-like spectral profiles are characteristic of depression while their diagnostic value varies by task demands.28 SpectralSlope supported healthy classification, especially in positive-valence and TAT tasks, indicating preserved harmonic richness in non-depressed speech, although its effect was weaker in negative and neutral conditions. ZCR generally not only supported healthy status in negative, positive and TAT tasks but also contributed to MDD in neutral and TAT tasks, possibly reflecting compensatory articulatory adjustments or task engagement.29

Collectively, these findings indicate that accurate depression detection relies on the combined effects of spectral, prosodic and non-linear features, which vary with task context and emotional valence. They emphasise the need for task-specific modelling and support a dimensional view of vocal biomarkers in depression.

Limitation and conclusion

Despite promising results, several limitations should be acknowledged. The free-response format occasionally constrained speech length, limiting the depth of linguistic analysis. The study focused exclusively on acoustic features, and incorporating facial or physiological data could potentially enhance classification accuracy. The sample consisted primarily of young Mandarin-speaking adults, which limits generalisability to high-risk groups, including women, older adults and individuals with a history of trauma or familial depression. The cross-sectional design and moderate sample size may increase the risk of overfitting and preclude tracking symptom progression. Additionally, assuming uniform emotional responses may overlook individual differences. While this study compared LSTM models with traditional machine learning approaches to highlight the advantages of sequence modelling, future work should extend these comparisons to other deep learning architectures such as Gated Recurrent Unit or Transformers to further validate model robustness.

Future research should recruit participants from diverse age groups, cultural backgrounds and clinical risk populations to evaluate model transferability. Longitudinal and multimodal studies are needed to monitor symptom changes and improve predictive performance. Validation in real-world clinical settings is essential, particularly for identifying high-risk individuals. Cross-cultural investigations will further establish the generalisability of acoustic markers and support broader implementation of voice-based depression detection.

Clinical implications

Interpretable LSTM-based speech models provide a scalable and non-invasive approach for detecting depression and suicide risk by analysing natural speech, thereby reducing reliance on self-reports and the impact of stigma. SHAP offers transparent insights into the contribution of key acoustic features, while tasks incorporating varied emotional and projective content enhance sensitivity and ecological validity. These strengths highlight the potential of LSTM-based speech analysis as a tool for early screening, risk assessment and ongoing monitoring in mental healthcare.

Supplementary material

online supplemental file 1
bmjment-28-1-s001.docx (562.3KB, docx)
DOI: 10.1136/bmjment-2025-301858

Footnotes

Funding: This study was supported by the National Natural Science Foundation of China (Grant Numbers 72174082 and 82373695), the Natural Science Foundation of Guangdong Province (Grant Number 2023A1515011825) and the Guangdong-Hong Kong Joint Laboratory for Psychiatric Disorders (2023B1212120004).

Provenance and peer review: Not commissioned; externally peer-reviewed.

Patient consent for publication: Not applicable.

Ethics approval: This study was approved by the Medical Ethics Committee of Southern Medical University, Guangzhou (Ethics approval number: NFYKDX003). Participants gave informed consent to participate in the study before taking part.

Data availability free text: The datasets generated and analysed in this study involve sensitive personal information and are therefore not publicly accessible. Researchers who wish to access the data may submit a formal request, which will be considered following approval by the relevant ethics committee.

Correction notice: This paper has been amended since it was first published. The first affiliation has been corrected.

Data availability statement

Data are available upon reasonable request.

References

  • 1.Liu Q, He H, Yang J, et al. Changes in the global burden of depression from 1990 to 2017: Findings from the Global Burden of Disease study. J Psychiatr Res. 2020;126:134–40. doi: 10.1016/j.jpsychires.2019.08.002. [DOI] [PubMed] [Google Scholar]
  • 2.Eack SM, Greeno CG, Lee BJ. Limitations of the Patient Health Questionnaire in Identifying Anxiety and Depression: Many Cases Are Undetected. Res Soc Work Pract. 2006;16:625–31. doi: 10.1177/1049731506291582. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Arias-de la Torre J, Vilagut G, Serrano-Blanco A, et al. Accuracy of Self-Reported Items for the Screening of Depression in the General Population. Int J Environ Res Public Health. 2020;17:7955. doi: 10.3390/ijerph17217955. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Mundt JC, Snyder PJ, Cannizzaro MS, et al. Voice acoustic measures of depression severity and treatment response collected via interactive voice response (IVR) technology. J Neurolinguistics. 2007;20:50–64. doi: 10.1016/j.jneuroling.2006.04.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Mao K, Wu Y, Chen J. A systematic review on automated clinical depression diagnosis. Npj Ment Health Res . 2023;2:20. doi: 10.1038/s44184-023-00040-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Liang L, Wang Y, Ma H, et al. Enhanced classification and severity prediction of major depressive disorder using acoustic features and machine learning. Front Psychiatry. 2024;15:1422020. doi: 10.3389/fpsyt.2024.1422020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Deng X, Mu T, Wang Y, et al. The Application of Human Figure Drawing as a Supplementary Tool for Depression Screening. Front Psychol. 2022;13:865206. doi: 10.3389/fpsyg.2022.865206. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Gu S, Liu Y, Liang F, et al. Screening Depressive Disorders With Tree-Drawing Test. Front Psychol. 2020;11:1446. doi: 10.3389/fpsyg.2020.01446. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Lu B, Hui MA, Yu-Xia H. The Development of Native Chinese Affective Picture System--A pretest in 46 College Students. Chin Ment Health J. 2005 [Google Scholar]
  • 10.Lee B-T, Seok J-H, Lee B-C, et al. Neural correlates of affective processing in response to sad and angry facial stimuli in patients with major depressive disorder. Prog Neuropsychopharmacol Biol Psychiatry. 2008;32:778–85. doi: 10.1016/j.pnpbp.2007.12.009. [DOI] [PubMed] [Google Scholar]
  • 11.Bylsma LM, Morris BH, Rottenberg J. A meta-analysis of emotional reactivity in major depressive disorder. Clin Psychol Rev. 2008;28:676–91. doi: 10.1016/j.cpr.2007.10.001. [DOI] [PubMed] [Google Scholar]
  • 12.Vanderlind WM, Millgram Y, Baskin-Sommers AR, et al. Understanding positive emotion deficits in depression: From emotion preferences to emotion regulation. Clin Psychol Rev. 2020;76:101826. doi: 10.1016/j.cpr.2020.101826. [DOI] [PubMed] [Google Scholar]
  • 13.Cannizzaro M, Harel B, Reilly N, et al. Voice acoustical measurement of the severity of major depression. Brain Cogn. 2004;56:30–5. doi: 10.1016/j.bandc.2004.05.003. [DOI] [PubMed] [Google Scholar]
  • 14.Kappen M, Vanhollebeke G, Van Der Donckt J, et al. Acoustic and prosodic speech features reflect physiological stress but not isolated negative affect: a multi-paradigm study on psychosocial stressors. Sci Rep. 2024;14:5515. doi: 10.1038/s41598-024-55550-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Espinola CW, Gomes JC, Pereira JMS, et al. Detection of major depressive disorder using vocal acoustic analysis and machine learning—an exploratory study. Res Biomed Eng. 2021;37:53–64. doi: 10.1007/s42600-020-00100-9. [DOI] [Google Scholar]
  • 16.Chlasta K, Wołk K, Krejtz I. Automated speech-based screening of depression using deep convolutional neural networks. Procedia Comput Sci. 2019;164:618–28. doi: 10.1016/j.procs.2019.12.228. [DOI] [Google Scholar]
  • 17.Hill KE, South SC, Egan RP, et al. Abnormal emotional reactivity in depression: Contrasting theoretical models using neurophysiological data. Biol Psychol. 2019;141:35–43. doi: 10.1016/j.biopsycho.2018.12.011. [DOI] [PubMed] [Google Scholar]
  • 18.Moore E, Clements M, Peifer J, et al. Analysis of prosodic variation in speech for clinical depression. Proceedings of the 25th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (IEEE Cat. No. 03CH37439), Vol. 3, IEEE; 2003. pp. 2925–8. [Google Scholar]
  • 19.Aronow E, Weiss KA, Reznikoff M. A practical guide to the thematic apperception test: the TAT in clinical practice. Routledge; 2013. [Google Scholar]
  • 20.Perry S, Cooper AM, Michels R. The Psychodynamic Formulation: Its Purpose, Structure, and Clinical Application. FOC. 2006;4:297–305. doi: 10.1176/foc.4.2.297. [DOI] [Google Scholar]
  • 21.Lemogne C, Delaveau P, Freton M, et al. Medial prefrontal cortex and the self in major depression. J Affect Disord. 2012;136:e1–11. doi: 10.1016/j.jad.2010.11.034. [DOI] [PubMed] [Google Scholar]
  • 22.Adler JM, Skalina LM, McAdams DP. The narrative reconstruction of psychotherapy and psychological health. Psychother Res. 2008;18:719–34. doi: 10.1080/10503300802326020. [DOI] [PubMed] [Google Scholar]
  • 23.Gong Q, He Y. Depression, neuroimaging and connectomics: a selective overview. Biol Psychiatry. 2015;77:223–35. doi: 10.1016/j.biopsych.2014.08.009. [DOI] [PubMed] [Google Scholar]
  • 24.Alpert M, Pouget ER, Silva RR. Reflections of depression in acoustic measures of the patient’s speech. J Affect Disord. 2001;66:59–69. doi: 10.1016/s0165-0327(00)00335-9. [DOI] [PubMed] [Google Scholar]
  • 25.Verma A, Jain P, Kumar T. An Effective Depression Diagnostic System Using Speech Signal Analysis Through Deep Learning Methods. Int J Artif Intell Tools. 2023;32:2340004. doi: 10.1142/S0218213023400043. [DOI] [Google Scholar]
  • 26.Zhang H, Wang H, Han S, et al. Detecting depression tendency with multimodal features. Comput Methods Programs Biomed. 2023;240:107702. doi: 10.1016/j.cmpb.2023.107702. [DOI] [PubMed] [Google Scholar]
  • 27.Zhou G, Hansen JHL, Kaiser JF. Nonlinear feature based classification of speech under stress. IEEE Trans Speech Audio Process. 2001;9:201–16. doi: 10.1109/89.905995. [DOI] [Google Scholar]
  • 28.Long H, Guo Z, Wu X, et al. Detecting depression in speech: comparison and combination between different speech types. 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE; 2017. pp. 1052–8. [Google Scholar]
  • 29.König A, Mina M, Schäfer S, et al. Predicting Depression Severity from Spontaneous Speech as Prompted by a Virtual Agent. Eur Psychiatr. 2023;66:S157–8. doi: 10.1192/j.eurpsy.2023.387. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

online supplemental file 1
bmjment-28-1-s001.docx (562.3KB, docx)
DOI: 10.1136/bmjment-2025-301858

Data Availability Statement

Data are available upon reasonable request.


Articles from BMJ Mental Health are provided here courtesy of BMJ Publishing Group

RESOURCES