Abstract
Background:
The diagnosis of posttraumatic stress disorder (PTSD) is usually based on clinical interviews or self-report measures. Both approaches are subject to under- and over- reporting of symptoms. An objective test is lacking. We have developed a classifier of PTSD based on objective speech-marker features that discriminate PTSD cases from controls.
Methods:
Speech samples were obtained from warzone-exposed veterans, 52 cases with PTSD and 77 controls, assessed with the Clinician-Administered PTSD Scale. Individuals with major depressive disorder were excluded. Audio recordings of clinical interviews were used to obtain 40,526 speech features which were input to a random forest (RF) algorithm.
Results:
The selected RF used 18 speech features and the receiver operating characteristic curve had an AUC of 0.954. At a probability of PTSD cut point of 0.423, Youden’s index was .787, and overall correct classification rate 89.1 %. The probability of PTSD was higher for markers that indicated slower, more monotonous speech, less change in tonality, and less activation. Depression symptoms, alcohol use disorder, and TBI did not meet statistical tests to be considered confounders.
Conclusions:
This study demonstrates that a speech-based algorithm can objectively differentiate PTSD cases from controls. The RF classifier had a high AUC. Further validation in an independent sample and appraisal of the classifier to identify those with MDD only compared to those with PTSD comorbid with MDD is required.
Keywords: Speech-based Assessment, Feature Extraction, Biomarkers, Post-Traumatic Stress Disorder, Veterans, Diagnostics, Military
Introduction
Posttraumatic stress disorder (PTSD) is frequently associated with functional impairment including relationship conflicts (Taft, Watkins, Stafford, Street, & Monson, 2011), reduced academic attainment (Bachrach & Read, 2012; Kessler, Foster, Saunders, & Stang, 1995), substance abuse (Mills, Teesson, Ross, & Peters, 2006; Pietrzak, Goldstein, Southwick, & Grant, 2011), unemployment (Sripada et al., 2016), and adverse health outcomes (Boscarino, 2008; O’donovan, Slavich, Epel, & Neylan, 2013; Roberts et al., 2015; Zen, Whooley, Zhao, & Cohen, 2012). The ability to accurately screen for and diagnose PTSD, however, remains challenging (Shalev, Liberzon, & Marmar, 2017). There are numerous self-report screening tools (Sijbrandij et al., 2013) and several clinician-administered interview protocols (Blake et al., 1995; Foa & Tolin, 2000; Weathers et al., 2017). The gold-standard for diagnosing PTSD is the Clinician-Administered PTSD Scale (CAPS) (Blake et al., 1995). The CAPS is a structured clinical interview for assessing frequency and severity of PTSD symptoms and related functioning impairments. The CAPS has been shown to have 79% overall agreement with a clinician’s diagnosis, with sensitivity of .74 and specificity .84 (Hovens et al., 1994).
The assessment of PTSD with a structured interview is based in part on the subjective complaints of the patient and interpretations of the clinician. This process is subject to a number of biases that may distort the accuracy of the diagnosis, including cultural and racial biases (Snowden, 2003), distortions in memory (Donaldson, Corrigan, & Kohn, 2000; Ely, Graber, & Croskerry, 2011), or financial and social incentives (Hall & Hall, 2006). Additionally, because of stigma, patients vary in their willingness to candidly discuss traumatic experiences, symptoms and functioning. Moreover, the interview requires a lengthy visit to a clinician’s office, which some patients may be unwilling or unable to do. For these reasons, there is an imperative to develop objective measures for screening and diagnosing psychiatric disorders (Kapur, Phillips, & Insel, 2012; Singh & Rose, 2009), including PTSD (Lehrner & Yehuda, 2014; Shalev, Liberzon, & Marmar, 2017).
Multiple studies have been initiated to identify biological markers for PTSD including alterations in neural structures and circuit functioning, genomics, neurochemistry, immune functioning, and psychophysiology (Lehrner & Yehuda, 2014; Shalev, Liberzon, & Marmar, 2017; Zoladz & Diamond, 2013). Despite these advances, problems in accuracy, cost, and patient burden preclude routine use in clinical practice.
There has been growing interest in speech-based techniques to screen for psychiatric disorders (Bedi et al., 2014, 2015; Grünerbl et al., 2015; Karam et al., 2014; Muaremi, Gravenhorst, Grünerbl, Arnrich, & Tröster, 2014; Osmani et al., 2015; Vanello et al., 2012). Speech is an attractive candidate, as it can be measured at low-cost, remotely, non-invasively, and naturalistically. Clinicians have long observed that individuals suffering from psychiatric disorders display changes in speech (Newman & Mather, 1938) and routinely use impressions of voice quality as an element of mental status examination, including “pressured” speech in bipolar disorder or “monotone”, “lifeless”, and “metallic” speech in depression (Hall, Harrigan, & Rosenthal, 1995; Moses, 1954; Sobin & Sackeim, 1997). More recently, automated techniques to analyze speech have been able to classify mood disorders on a number of speech features. For example, combining prosodic, voice quality, spectral, and glottal features for automated speech classification has shown encouraging sensitivity and specificity (van der Broek, van der Sluis, & Dijkstra, 2010).
Less is known about speech alterations in PTSD. Van der Broek, van der Sluis, and Dijkstra (2010) asked individuals with PTSD to generate two affective narratives and found that 65 parameters of speech accounted for 69%−83% of the variance of stress symptoms. Scherer and colleagues found that in response to positive, negative, and neutral interview prompts, those with PTSD exhibited more tense voice features (Scherer, Stratou, Gratch, & Morency, 2013) and decreased vowel space (Scherer, Lucas, Gratch, Rizzo, & Morency, 2016). Recent work applying multi-view learning algorithms demonstrated that diagnostic classification of PTSD increased by 20–37% using two speech classifiers (Zhuang, Rozgić, Crystal, & Marx, 2014). Although promising, these findings are limited due to reliance on self-report measures rather than validated interviews to classify PTSD (Scherer, Lucas, Gratch, Rizzo, & Morency, 2016; Scherer, Stratou, Gratch, & Morency, 2013), samples with major depressive disorder (MDD) comorbidity (Scherer, Lucas, Gratch, Rizzo, & Morency, 2016; Scherer, Stratou, Gratch, & Morency, 2013), limited use of control groups (van der Broek, van der Sluis, & Dijkstra, 2010), and small samples (Scherer, Lucas, Gratch, Rizzo, & Morency, 2016; Scherer, Stratou, Gratch, & Morency, 2013; van der Broek, van der Sluis, & Dijkstra, 2010).
This is the first study to identify features of speech that differentiate PTSD cases from controls in an age- and gender-matched sample of veterans excluding current MDD.
Methods
Participants
Participants included 129 American warzone-exposed male Iraq and Afghanistan veterans who gave written informed consent. All procedures were approved by the Institutional Review Board of NYU Langone School of Medicine and conform to the US Federal Policy for the Protection of Human Rights. Participants were assessed for PTSD with the Clinician Administered PTSD Scale (CAPS-IV) by a clinical psychologist. Participants in the PTSD group met diagnostic criteria for PTSD based on DSM IV-TR criteria (Blake et al., 1990). Controls were age- and gender-matched warzone-exposed veterans who did not meet criteria for current or lifetime PTSD.
Participants were excluded from the study if they met DSM 5 criteria, assessed by the Structured Clinical Interview for DSM Diagnosis (SCID-5), severe drug use in the past 6 months, lifetime history of any psychiatric disorder with psychotic features, bipolar I & II disorder, current major depressive disorder (MDD), depression due to a general medical condition (GMC), current exposure to recurrent trauma or exposure to a traumatic event within the past month, prominent suicidal ideation, homicidal ideation, suicide attempt in the past three months, history of open-head injury, illness affecting central nervous system (CNS) functioning, cardiovascular disease, major medical illness, and starting psychotropic medications in the past month.
Procedure
Speech Feature Extraction
The audio of each CAPS interview was recorded in two channels, using separate microphones for the interviewer and participant. A rich set of speech features was extracted from the participant’s recording using the following steps:
Audio quality control: This step was manual, targeting the selection of only good audio samples (clear, audible speech in the signal) to avoid noise in the feature extraction process. During this step the participant’s audio channel was also manually marked.
Audio segmentation: This step identified the participant’s speech regions, excluding the interviewer who was often audible in the participant microphone channel. Very short duration (e.g. “yes/no”) participant segments were also removed, resulting in 1–120 minutes of clean speech per participant (mean=35 minutes/speaker). This step could also be done manually, but due to cost and time constraints we applied SRI’s automatic Voice Activity Detector (VAD) (open source alternatives can also be used, e.g https://chromium.googlesource.com/external/webrtc/+/master/common_audio/vad/). VAD was run on the clinician and participant audio channels independently to mark the locations of speech for both speakers. Only the participant channel speech regions with higher VAD score than the corresponding clinician channel segments were retained, avoiding segments that included interviewer’s voice . In the following, we refer to these automatically identified participant speech regions, separated by long pauses or speech from the interviewer, as speech “spurts”.
Extraction of frame level features: A frame is a short sliding window of speech, typically 5–25 milliseconds, depending on feature type. The frame-level features included: spectral (i.e Mel-Frequency Cepstral Coefficients (MFCCs)), linear predictive coding (LPC), noise-robust spectral (i.e. DOCC, RASTA), prosodic (chroma features, pitch, voicing, correlation), time-based (zero crossing, RMS energy, L1-norm), spectro-temporal (LTSV, MFCC, and RASTA derivatives), articulatory, temporal, and machine-learning-based (autoencoders learned from prior speaker databases). The features were extracted using SRI’s speech feature extraction tools, but there are also open source alternatives that can be used for the same feature types (e.g. https://www.audeering.com/opensmile/).
Computation of spurt-level features: These were computed based on frame-level features for every spurt. They included: (a) statistics: mean, variance, kurtosis skewness, variation from mean, percentiles, range and slope, (b) locational information: absolute and relative distances from the beginning of the spurt for the occurrence of important feature values (min, max, 5% of max, 50% of max, 95% of max) and (c) durational information: distances between the occurrences of important feature values e.g. the distance between reaching 5% and 50% of the feature max value within the spurt.
Computation of speaker-level features: The final feature vector was extracted by taking statistics of the spurt level features for each speaker: mean variance, kurtosis, skewness, variation from mean, various percentiles, interquartile range (IQR), and slope.
These features aim to capture the nuances, variability, and behavior, both short-term and long-term, of a rich set of low-level speech features over the entire session focusing only on the patient speech segments of the conversation. A total of 40,526 features were computed at the speaker level and were used for the feature selection and model building.
Statistical Analysis
Comparing the two groups on demographic variables
For categorical variables, a Chi-squared (or Fisher’s exact test when at least one cell count had five or less individuals) and for continuous variables, Wilcoxon rank sum tests were used.
The Random Forest probabilistic classifier
A Random Forest (RF) algorithm was used to build a classifier function using speech markers to predict PTSD. It is an algorithm (Breiman, Friedman, Olshen, & Stone, 1984; Malley, Kruppa, Dasgupta, Malley, & Ziegler, 2012) based on multiple classification and regression trees (CART) (Strobl, Malley, & Tutz, 2009) yielding a probability estimate of membership in a target prediction class based on marker values. CART grows a decision tree whose hierarchical nodes are each based on a cut-point split of a predictor found by an exhaustive search to minimize misclassification error. The process continues recursively until a tree is grown with nodes that contain members from only one group. This tree is pruned to a set of nodes for which little is gained from further splits in improving misclassification error. An estimate of the probability of membership in the target group of an individual in a terminal node is given by the fraction of members in the target group who are in the node. RF makes use of an ensemble of CART decision trees for prediction which acts to decrease the variance of the predictions and the inherent potential of over-fitting of a single decision tree. Bootstrap samples of subjects can be used to grow a random forest of trees. Data on the “out-of-bag” (OOB) subjects in each sample, consisting of approximately one third of the full sample whose data were not used to grow the particular tree, are used to obtain predictions of target class membership. Features of the OOB subjects are scored and the estimate of the probability of being in the target class is the fraction of the target class in the terminal node into which they fall. The average of these estimates over the trees grown is the RF estimate of the probability. These are then used to generate a Receiver Operator Curve (ROC) and its area under the curve (AUC).
The importance of a predictor is assessed by randomly permuting its value in the OOB sample and comparing the differences in predictive performance (AUCs) between the non-permuted and permuted samples. The AUCs are averaged across the entire forest and ranked on the decrease in AUCs (Breiman, Friedman, Olshen, & Stone, 1984). “Shaving” is a method for reducing the number of predictors based on variable importance. The variable of least importance is shaved off first and a new RF is obtained. The procedure is repeated until all variables have been shaved. The shaved RF with a parsimonious mix of a small number of features and a large AUC is chosen.
In this study, based on 20,000 bootstrap samples, a RF based on the 40,759 voice markers was grown and the shaving step began starting with the 500 variables with the highest importance rankings.
Testing for confounding
We tested for the possibility that the findings of the relationship between voice markers and PTSD are confounded by the presence of comorbidities of TBI, AUD, and symptoms of depression. Participants who met criteria for PTSD co-morbid with MDD had been excluded. Residual symptoms of depression were measured by the Beck Depression Inventory-II (BDI-II). A variable was considered to be a confounder if two null hypotheses related to the prediction of PTSD were rejected (Pearl, 2009). The null hypotheses to be rejected are (1) that the potential confounder is not associated with the predictive voice markers and (2) that the probability of being a PTSD case is not different when including the confounder in the model from predicting with the voice markers alone. The confounder hypothesis tests were run separately for TBI, AUD, the individual symptoms of depression, and total BDI-II score. For tests of the first confounder hypothesis, Chi square tests were run on contingency tables of confounder by voice marker. For tests of the second hypothesis, estimates of the probability of PTSD obtained from the final RF with and without the inclusion of the candidate confounder were obtained and compared using a Wilcoxon rank sum test. If the latter test was statistically significant, we also required that the difference in AUCs be greater than 0.05.
Results
Demographics
The PTSD cases and controls did not differ significantly by age, ethnicity, educational attainment, number of warzone deployments or current cannabis, cocaine, hallucinogen, opioid, or stimulant use (Table 1). The PTSD group had significantly higher total BDI scores, TBI exposure levels, and current rates of alcohol use disorder.
Table 1.
Variable | PTSD+ (N=52) | PTSD- (N=77) |
---|---|---|
N (%) or Mean (SD) | N (%) or Mean(SD) | |
Age (years) | 31.92 (5.97) | 32.47 (7.22) |
Number of Deployments | 1.73 (1.01) | 1.79 (1.09) |
Race | ||
Asian | 2 (3.9%) | 5 (6.5%) |
Black / African American | 9 (17.3%) | 6 (7.8%) |
White / Caucasian | 29 (55.8%) | 46 (59.7%) |
Hispanic / Latino | 11 (21.2%) | 14 (18.2%) |
Other | 1 (1.9%) | 6 (7.8%) |
Education | ||
Up to 12th grade | 1 (1.9%) | 0 (0.0%) |
High school / GED | 18 (34.6%) | 17 (22.1%) |
2 years college / Associate’s Degree | 14 (26.9%) | 14 (18.2%) |
4 years college/ Bachelor’s degree | 12 (23.1%) | 32 (41.6%) |
Master’s Degree | 7 (13.5%) | 14 (18.2%) |
TBI Exposure | ||
Yes | 19 (36.5%) | 5 (6.5%) * |
Current Alcohol Use | ||
Yes | 14 (26.9%) | 5 (6.5%) * |
Current Cannabis Use | ||
Yes | 2 (3.9%) | 0 (0%) |
Current Cocaine Use | ||
Yes | 0 (0%) | 0 (0%) |
Current Hallucinogen Use | ||
Yes | 1 (1.9%) | 0 (0%) |
Current Stimulant Use | ||
Yes | 0 (0%) | 0 (0%) |
Current Opioid Use | ||
Yes | 0 (0%) | 0 (0%) |
BDI total score † | 12.54 (8.65) | 3.59 (4.15) * |
Significant p<.05
BDI (Beck Depression Inventory) total score is the sum of all 21 BDI items.
Properties of the Random Forest
The final shaved RF selected was based on 18 voice markers with an AUC = .954. At a PTSD probability cut point of 0.423, Youden’s index, defined as the sum of sensitivity + specificity −1, was .904+.883 −1 = .787 with an overall correct classification rate of 89.1%.
Voice marker features in the Random Forest
Table 2 lists the 18 features used to build the selected model. Among individuals with PTSD, feature 3 reflects speech segments containing articulators that move more slowly than in controls or contain long extended vowels, including hesitations (e.g. “uh….”). In addition, features 1, 2, 4, 5, 11, 15, and 17 contain speech features that were more monotonous in PTSD cases than in controls. Additionally, features 6, 9, 10, 12, and 16 revealed that individuals with PTSD were more likely to generate flat speech. Moreover, features 8 and 13 contained speech features indicating less speech activation among cases. Table 3 displays means, standard deviations, and medians of each feature and results of a Wilcoxon test comparing the distributions of speech markers between groups. All but feature 18 significantly differed between cases and controls.
Table 2:
Feature # | Quality of Speech | Description of Feature Computation |
---|---|---|
1 | More monotonous speech (less varying tonality) | For each spurt we computed the relative time distance between the occurrence of the low (5%) and median (50%) values for a specific spectral feature (1st chroma FFT coefficient), representing variability in certain speech frequencies. Then we extracted the lowest value across the speaker spurts. |
2 | Monotonous speech segments | For each spurt we computed the relative time distance between the maximum and the minimum values for a specific spectral feature (2nd LMFCC coefficient), representing variability in certain speech frequencies. Then we extracted the lowest value across the speaker spurts. |
3 | Occurrences of slow speech production | For each speech spurt we estimated the average time it took for the tongue to move from the minimum to the maximum point. Then we extracted the highest value (slowest changing spurt) across each speaker’s speech. |
4 | More monotonous speech (less varying tonality) | For each spurt we computed the relative time distance between the occurrence of the maximum and the median value for a tonal frequency, representing the tonal variability on a certain frequency. Then we found the average value across all spurts. |
5 | Less bursty (more monotonous) voice | For each spurt we computed the kurtosis value for a specific spectral feature (3rd Chroma filter) detecting existence of anomalies/outliers in the distribution of certain speech frequencies. Then we extracted the skewness of this value across the speaker spurts. This measured whether there were outliers (bursts) in speech tonality or whether it was mostly within expected ranges during the session. |
6 | Flatter speech | For each spurt we computed the normalized variance of a specific spectral feature (11th LMFCC coefficient). Then we computed the kurtosis (consistency) across each speaker’s spurts. |
7 | Less animated speech | For each spurt we computed the skewness for a specific spectral feature (1th LMFCC coefficient) which represented the symmetry of the distribution (found if there was an outlier values). Then we extracted the lowest value across the speaker spurts. It examined the least varying spurt, which may have contained single vowel sounds. |
8 | Speech segments with very low activation | For each spurt we computed the relative time distance between the occurrence of the minimum and the maximum value for a specific spectral feature (11th LMFCC coefficient) representing variability in certain speech frequencies. Then we extract the lowest value across the speaker spurts |
9 | Flatter speech in terms of energy variation | For each spurt we computed the highest tonal energy for a certain frequency range (chroma FFT 9th coefficient). Then we computed the variability of that energy across all spurts. |
10 | Flat tone in speech | For each spurt we computed the highest tonal energy for a certain frequency range (chroma FFT 9th coefficient). As for feature 9 but take the 95th% across all spurts. |
11 | More monotonous speech | For each spurt we computed a tonal frequency (0th chroma FFT coefficient) with the highest value across all spurts. |
12 | Flatter speech in terms of energy variation | Similar to feature 6, but examined the 10th LMFCC coefficient. |
13 | Less activated speech | For each spurt we computed the skewness of a specific spectrogram frequency range and then we find the minimum across all spurts. It measured the flatness of the spectrogram for that frequency range. |
14 | Slow speech production or long hesitations | For each spurt we estimated the kurtosis of the position of an articulator. Then we extracted the highest value (flattest spurt) across the speaker spurts. Similar to feature 3, it was examining the most consistently articulated spurt |
15 | More monotonous speech (less varying tonality) | For each spurt we get the range of values for a specific spectral features (4th Chroma filter) and compute the deviation of the range across all spurts. This indicates how variable is the range of tonality across spurts |
16 | Flatter speech in terms of energy variation | For each spurt we computed the relative time distance between the occurrence of the low (5%) and median (50%) value for specific spectral frequencies (low frequency range), representing how fast the energy changed within that range. Then we extracted the skewness across the speaker spurts, which showed whether that energy changed in a consistent manner across all spurts |
17 | More monotonous speech in energy | For each spurt we compute the 5% value for a certain spectral feature (19th Rasta coeff.) and compute the deviation of this value across spurts. In captures energy variability in a certain frequency range |
18 | Description of speech quality could not be made | For each spurt we estimated the kurtosis of the position of an articulator. Then we extracted the highest value (flattest spurt) across the speaker spurts. Similar to feature 1, it was examining the most consistently articulated spurt |
Table 3:
PTSD − | PTSD + | Wilcoxon Test * = p<.05 | |||||
---|---|---|---|---|---|---|---|
Variable | Mean | Median | Std Dev | Mean | Median | Std Dev | |
Var1 | −0.964 | −0.973 | 0.030 | −0.982 | −0.987 | 0.024 | * |
var2 | −0.937 | −0.942 | 0.035 | −0.965 | −0.970 | 0.022 | * |
var3 | 0.936 | 0.945 | 0.053 | 0.967 | 0.972 | 0.021 | * |
var4 | −0.065 | −0.081 | 0.084 | −0.039 | −0.059 | 0.095 | * |
var5 | 409.400 | 240.187 | 557.587 | 1026.070 | 630.206 | 1154.760 | * |
var6 | 2.682 | 2.139 | 2.498 | 2.862 | 2.744 | 0.757 | * |
var7 | −0.959 | −0.967 | 0.038 | −0.980 | −0.983 | 0.014 | * |
var8 | 0.279 | 0.269 | 0.048 | 0.249 | 0.250 | 0.039 | * |
var9 | −1.430 | −1.364 | 0.336 | −1.763 | −1.657 | 0.632 | * |
var10 | 0.004 | 0.003 | 0.002 | 0.003 | 0.003 | 0.001 | * |
var11 | −1.810 | −1.883 | 0.463 | −2.316 | −2.271 | 1.120 | * |
var12 | 0.929 | 0.945 | 0.074 | 0.940 | 0.970 | 0.196 | * |
var13 | 355.616 | 208.173 | 364.526 | 897.390 | 605.526 | 769.039 | * |
var14 | 12.838 | 12.131 | 4.279 | 17.006 | 15.581 | 5.937 | * |
var15 | 0.00024 | 0.00018 | 0.00017 | 0.00018 | 0.00015 | 0.00019 | * |
var16 | 0.035 | 0.164 | 1.617 | −0.131 | −0.208 | 1.883 | * |
var17 | 0.0040 | 0.0037 | 0.0015 | 0.0032 | 0.0031 | 0.0010 | * |
var18 | 0.170 | 0.169 | 0.012 | 0.171 | 0.169 | 0.006 |
Confounders
BDI symptoms, TBI, and AUD failed to meet statistical criteria required to confirm that they are confounders. For the tests of independence of potential confounders with each marker, TBI and AUD were correlated only with feature 12. BDI symptoms were each individually correlated with at most two markers. For the second statistical test, Table 4 shows the results of comparisons of the probabilities of PTSD and the AUCs of the RFs determined with and without inclusion of the confounders for BDI total score, TBI, and AUD as predictors. While estimates of the probability of PTSD differed between the two models, AUCs were not improved by including either TBI or alcohol use. In addition, individual BDI symptom scores did not significantly increase AUCs and the BDI total score improved AUC by only 1%. Appendix I contains results of the complete confounder analyses including tests for each BDI symptom.
Table 4: RF (Random forest) results:
Variable | Wilcoxon test * =p<.05 | AUC of Model with 18 markers | AUC of Model with 18 markers plus confounder |
---|---|---|---|
TBI | * | 0.954 | 0.954 |
Alcohol_use | * | 0.954 | 0.954 |
BDI_total_score | * | 0.946 | 0.957 |
Discussion
This study demonstrated that by using speech-based techniques, male Iraq and Afghanistan veterans with PTSD could be distinguished from warzone-exposed veterans without PTSD, neither of whom had MDD. The findings suggest that by combining frame-level short- duration and longer-duration prosodic features, high accuracy, sensitivity, and specificity for classifying PTSD can be achieved. The classifier assigns higher probabilities of PTSD to those with features indicating speech that is slower, more monotonous, and less change in tonality and activation.
Although the biological mechanisms underlying the link between these speech features and PTSD were not examined in this study, previous work has documented that changes to the automatic nervous system cause disturbances in similar speech-based features, such as muscle tension (Scherer, 1986) and respiratory rate (Kreibig, 2010). Additionally, the neurotransmitter gamma-amino butyric acid (GABA), which has been linked with a vulnerability to both depression (Croarkin, Levinson, & Daskalakis, 2011) and suicidality (Poulter et al., 2008), has also been associated with changes in muscle tonality (Croarkin, Levinson, & Daskalakis, 2011). Importantly, GABA has been identified as a promising neural marker of vulnerability and resilience to PTSD, as well as a therapeutic target (Faye, McGowan, Denny, & David, 2018; Kelmendi et al., 2016).
Furthermore, changes in muscle tension alter vocal tract dynamics and constrain articulatory movement. A speaker’s level of depression has been shown to affect prosodic and source features relevant to articulation (Moore II, Clements, Peifer, & Weisser, 2008; Mundt, Vogel, Feltner, & Lenderking, 2012; Quatieri & Malyska, 2012; Scherer et al., 2013; Trevino, Quatieri, & Malyska, 2011) and has been correlated with reduced tonal range (Breznitz, 1992; Darby, Simmons, & Berger, 1984; Flint, Black, Campbell-Taylor, Gailey, & Levinton, 1993). Findings from the current study indicate that a reduction in tonal range may also be associated with PTSD, even among individuals who do not meet criteria for MDD. The reduction in tonality observed in PTSD is consistent with previous studies showing decreased formant frequencies in depressed individuals (Mundt, Snyder, Cannizzaro, Chappie, & Geralts, 2007). These findings may reflect that tonality (e.g. F0 frequency) is influenced by factors such as current mood (Ellgring & Scherer, 1996), level of agitation and anxiety (Alpert, Pouget, & Silva, 2001; Tolkmitt, Helfrich, Standke, & Scherer, 1982), and personality traits (Yang, Fairbairn, & Cohn, 2013). Although the exact processes contributing to the observed reduction in tonality in this study were not tested, future experimental studies would allow for a more specific understanding.
Taken together, these data offer strong preliminary evidence that speech features can serve as an objective probability classifier for PTSD. Compared to the more extensive literature linking speech-based features and mood disorders (Cummins et al., 2015), there has been a paucity of research examining speech in PTSD. The few published studies relied on small sample sizes, assessed PTSD with self-report measures, and had high levels of co-morbid MDD, making it difficult to determine whether those features were associated with PTSD or related psychopathologies.
This is the first study to use a structured clinical interview, the CAPS 5, both for classifying cases and controls and for the collection of speech segments for vocal analysis. The ability to use data collected naturalistically suggests that clinicians may be able to employ speech-based analyses to aid in the diagnostic process from information routinely collected by clinicians. On the other hand, the CAPS interview may have been more stressful for those with PTSD, compared to controls. It is unclear whether these differences are only found under conditions of stress or if they would be found in speech segments generated from less affectively charged content.
There were a number of limitations in the study. While we have conducted extensive internal cross validation, classifier endorsement requires a newly recruited external validation sample. We are confident that TBI and AUD did not confound voice marker findings in this study because there are a substantial number of subjects with these disorders in the sample, yielding sufficient power for the confounder analyses. Nevertheless, larger sample sizes in future studies would increase confidence in these findings.
Previous work suggested that similar alterations in speech are associated with affective dysregulation (Breiman, 2001). For example, “monotony” and “dullness” have long been associated with a depressed or sad voice. Kraepaelin described speech quality of depressed patients as “low voice, slowly, hesitatingly, monotonously, sometimes stuttering, whispering” (Kraepelin, 1921). The question of whether the panel predicts depression rather than PTSD must be considered. To minimize this possibility, participants with MDD were excluded from both groups. Further, for symptomatology not meeting criteria for MDD, we tested the BDI symptoms and did not find them to be confounders. Clarification of the value of the classifier in clinical settings requires studies of persons with the diagnosis of MDD without PTSD and those with co-morbid MDD and PTSD.
Given these limitations, we believe that our panel of voice markers represents a rich, multidimensional set of features which with further validation holds promise for developing an objective, low cost, non-invasive, and, given the ubiquity of smart phones, widely accessible tool for assessing PTSD in veteran, military, and civilian contexts.
Supplementary Material
Acknowledgements
This research was supported by the U.S. Army Medical Research & Acquisition Activity (USAMRAA), the Telemedicine & Advanced Technology Research Center (TATRC), W81XWH-11-C-0004, and grants from the Steven A. and Alexandra M. Cohen Foundation, Inc. and Cohen Veterans Bioscience Inc. (CVB) to NYU Langone School of Medicine.
Footnotes
Dr. Marmar receives support from the NIAAA, Department of Defense, Steven and Alexandra Cohen Veterans Center, Cohen Veterans Bioscience, Cohen Veterans Network, the Steven & Alexandra Cohen Foundation, Robin Hood Foundation, McCormick Foundation, Home Depot Foundation, and the City of New York. Dr. Vergyri and Dr. Knoth have a patent, “Systems for Speech-Based Assessment of a Patient’s State-of-Mind, pending. All other authors have no financial relationships with commercial interests to declare.
References
- Alpert M, Pouget ER, & Silva RR (2001). Reflections of depression in acoustic measures of the patient’s speech. Journal of Affective Disorders, 66(1), 59–69. 10.1016/S0165-0327(00)00335-9 [DOI] [PubMed] [Google Scholar]
- Bachrach RL, & Read JP (2012). The role of posttraumatic stress and problem alcohol involvement in university academic performance. Journal of Clinical Psychology, 68(7), 843–859. 10.1002/jclp.21874 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bedi G, Carrillo F, Cecchi GA, Slezak DF, Sigman M, Mota NB, … & Corcoran CM (2015). Automated analysis of free speech predicts psychosis onset in high-risk youths. Npj Schizophrenia, 1, 15030 10.1038/npjschz.2015.30 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bedi G, Cecchi GA, Slezak DF, Carrillo F, Sigman M, & De Wit H (2014). A window into the intoxicated mind? Speech as an index of psychoactive drug effects. Neuropsychopharmacology, 39(10), 2340–2348. 10.1038/npp.2014.80 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blake DD, Weathers FW, Nagy LM, Kaloupek DG, Gusman FD, Charney DS, & Keane TM (1995). The development of a clinician-administered PTSD scale. Journal of Traumatic Stress, 8(1), 75–90. 10.1007/BF02105408 [DOI] [PubMed] [Google Scholar]
- Blake DD, Weathers F, Nagy LM, Kaloupek DG, Klauminzer G, Charney DS, & Keane TM (1990). A clinician rating scale for assessing current and lifetime PTSD: The CAPS-1. The Behavior Therapist, 13, 187–188. [Google Scholar]
- Boscarino JA (2008). A prospective study of PTSD and early-age heart disease mortality among Vietnam veterans: Implications for surveillance and prevention. Psychosomatic Medicine, 70(6), 668–676. 10.1097/PSY.0b013e31817bccaf [DOI] [PMC free article] [PubMed] [Google Scholar]
- Breiman L (2001). Random forests. Machine learning, 45(1), 5–32. 10.1023/A:1010933404324 [DOI] [Google Scholar]
- Breiman L, Friedman JH, Olshen RA, & Stone CJ (1984). Classification and Regression Trees. The Wadsworth Statistics/Probability Series Monterey, California: Wadsworth and Brooks. [Google Scholar]
- Breznitz Z (1992). Verbal indicators of depression. The Journal of General Psychology, 119(4), 351–363. 10.1080/00221309.1992.9921178 [DOI] [PubMed] [Google Scholar]
- Croarkin PE, Levinson AJ, & Daskalakis ZJ (2011). Evidence for GABAergic inhibitory deficits in major depressive disorder. Neuroscience & Biobehavioral Reviews, 35(3), 818–825. 10.1016/j.neubiorev.2010.10.002 [DOI] [PubMed] [Google Scholar]
- Cummins N, Scherer S, Krajewski J, Schnieder S, Epps J, & Quatieri TF (2015). A review of depression and suicide risk assessment using speech analysis. Speech Communication, 71, 10–49. 10.1016/j.specom.2015.03.004 [DOI] [Google Scholar]
- Darby JK, Simmons N, & Berger PA (1984). Speech and voice parameters of depression: A pilot study. Journal of Communication Disorders, 17(2), 75–85. 10.1016/0021-9924(84)90013-3 [DOI] [PubMed] [Google Scholar]
- Donaldson MS, Corrigan JM, & Kohn LT (Eds.). (2000). To err is human: Building a safer health system (Vol. 6). National Academies Press. [PubMed] [Google Scholar]
- Ellgring H, & Scherer KR (1996). Vocal indicators of mood change in depression. Journal of Nonverbal Behavior, 20(2), 83–110. 10.1007/BF02253071 [DOI] [Google Scholar]
- Ely JW, Graber ML, & Croskerry P (2011). Checklists to reduce diagnostic errors. Academic Medicine, 86(3), 307–313. 10.1097/ACM.0b013e31820824cd [DOI] [PubMed] [Google Scholar]
- Faye C, McGowan JC, Denny CA, & David DJ (2018). Neurobiological mechanisms of stress resilience and implications for the aged population. Current Neuropharmacology, 16(3), 234–270. 10.2174/1570159X15666170818095105 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Flint AJ, Black SE, Campbell-Taylor I, Gailey GF, & Levinton C (1993). Abnormal speech articulation, psychomotor retardation, and subcortical dysfunction in major depression. Journal of Psychiatric Research, 27(3), 309–319. 10.1016/0022-3956(93)90041-Y [DOI] [PubMed] [Google Scholar]
- Foa EB, & Tolin DF (2000). Comparison of the PTSD symptom scale–interview version and the clinician‐administered PTSD scale. Journal of Traumatic Stress: Official Publication of the International Society for Traumatic Stress Studies, 13(2), 181–191. 10.1023/A:1007781909213 [DOI] [PubMed] [Google Scholar]
- Grünerbl A, Muaremi A, Osmani V, Bahle G, Oehler S, Tröster G, … & Lukowicz P (2015). Smartphone-based recognition of states and state changes in bipolar disorder patients. IEEE Journal of Biomedical and Health Informatics, 19(1), 140–148. 10.1109/JBHI.2014.2343154 [DOI] [PubMed] [Google Scholar]
- Hall RC, & Hall RC (2006). Malingering of PTSD: Forensic and diagnostic considerations, characteristics of malingerers and clinical presentations. General Hospital Psychiatry, 28(6), 525–535. 10.1016/j.genhosppsych.2006.08.011 [DOI] [PubMed] [Google Scholar]
- Hall JA, Harrigan JA, & Rosenthal R (1995). Nonverbal behavior in clinician—patient interaction. Applied and Preventive Psychology, 4(1), 21–37. 10.1016/S0962-1849(05)80049-6 [DOI] [Google Scholar]
- Hovens JE, Van der Ploeg HM, Klaarenbeek MTA, Bramsen I, Schreuder JN, & Rivero VV (1994). The assessment of posttraumatic stress disorder: With the Clinician Administered PTSD Scale: Dutch results. Journal of Clinical Psychology, 50(3), 325–340. [DOI] [PubMed] [Google Scholar]
- Kapur S, Phillips AG, & Insel TR (2012). Why has it taken so long for biological psychiatry to develop clinical tests and what to do about it? Molecular Psychiatry, 17(12), 1174–1179. 10.1038/mp.2012.105 [DOI] [PubMed] [Google Scholar]
- Karam ZN, Provost EM, Singh S, Montgomery J, Archer C, Harrington G, & Mcinnis MG (2014). Ecologically valid long-term mood monitoring of individuals with bipolar disorder using speech. Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference, 4858–4862. 10.1109/ICASSP.2014.6854525 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kelmendi B, Adams TG, Yarnell S, Southwick S, Abdallah CG, & Krystal JH (2016). PTSD: From neurobiology to pharmacological treatments. European Journal of Psychotraumatology, 7(1), 31858 10.3402/ejpt.v7.31858 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kessler RC, Foster CL, Saunders WB, & Stang PE (1995). Social consequences of psychiatric disorders, I: Educational attainment. American Journal of Psychiatry, 152(7), 1026–1032. 10.1176/ajp.152.7.1026 [DOI] [PubMed] [Google Scholar]
- Kraepelin E (1921). Manic depressive insanity and paranoia. The Journal of Nervous and Mental Disease, 53(4), 350. [Google Scholar]
- Kreibig SD (2010). Autonomic nervous system activity in emotion: A review. Biological Psychology, 84(3), 394–421. 10.1016/j.biopsycho.2010.03.010 [DOI] [PubMed] [Google Scholar]
- Lehrner A, & Yehuda R (2014). Biomarkers of PTSD: Military applications and considerations. European Journal of Psychotraumatology, 5, 10.3402/ejpt.v5.23797. 10.3402/ejpt.v5.23797 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Malley JD, Kruppa J, Dasgupta A, Malley KG, & Ziegler A (2012). Probability machines. Methods of Information in Medicine, 51(1), 74–81. 10.3414/ME00-01-0052 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mills KL, Teesson M, Ross J, & Peters L (2006). Trauma, PTSD, and substance use disorders: Findings from the Australian National Survey of Mental Health and Well-Being. American Journal of Psychiatry, 163(4), 652–658. 10.1176/ajp.2006.163.4.652 [DOI] [PubMed] [Google Scholar]
- Moore II E, Clements MA, Peifer JW, & Weisser L (2008). Critical analysis of the impact of glottal features in the classification of clinical depression in speech. IEEE Transactions on Biomedical Engineering, 55(1), 96–107. 10.1109/TBME.2007.900562 [DOI] [PubMed] [Google Scholar]
- Moses PJ (1954). The Voice of Neurosis New York, Grune & Stratton. [Google Scholar]
- Muaremi A, Gravenhorst F, Grünerbl A, Arnrich B, & Tröster G (2014). Assessing bipolar episodes using speech cues derived from phone calls. International Symposium on Pervasive Computing Paradigms for Mental Health, 103–114. 10.1007/978-3-319-11564-1_11 [DOI] [Google Scholar]
- Mundt JC, Snyder PJ, Cannizzaro MS, Chappie K, & Geralts DS (2007). Voice acoustic measures of depression severity and treatment response collected via interactive voice response (IVR) technology. Journal of Neurolinguistics, 20(1), 50–64. 10.1016/j.jneuroling.2006.04.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mundt JC, Vogel AP, Feltner DE, & Lenderking WR (2012). Vocal acoustic biomarkers of depression severity and treatment response. Biological Psychiatry, 72(7), 580–587. 10.1016/j.biopsych.2012.03.015 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Newman S, & Mather VG (1938). Analysis of spoken language of patients with affective disorders. American Journal of Psychiatry, 94(4), 913–942. 10.1176/ajp.94.4.913 [DOI] [Google Scholar]
- O’donovan A, Slavich GM, Epel ES, & Neylan TC (2013). Exaggerated neurobiological sensitivity to threat as a mechanism linking anxiety with increased risk for diseases of aging. Neuroscience & Biobehavioral Reviews, 37(1), 96–108. 10.1016/j.neubiorev.2012.10.013 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Osmani V, Gruenerbl A, Bahle G, Haring C, Lukowicz P, & Mayora O (2015). Smartphones in mental health: Detecting depressive and manic episodes. IEEE Pervasive Computing, 14(3), 10–13. 10.1109/MPRV.2015.54 [DOI] [Google Scholar]
- Pearl J (2009). Causality: Models, Reasoning, and Inference (2nd ed). Cambridge, UK: Cambridge University Press. [Google Scholar]
- Pietrzak RH, Goldstein RB, Southwick SM, & Grant BF (2011). Prevalence and Axis I comorbidity of full and partial posttraumatic stress disorder in the United States: Results from Wave 2 of the National Epidemiologic Survey on Alcohol and Related Conditions. Journal of Anxiety Disorders, 25(3), 456–465. 10.1016/j.janxdis.2010.11.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Poulter MO, Du L, Weaver IC, Palkovits M, Faludi G, Merali Z, … & Anisman H (2008). GABAA receptor promoter hypermethylation in suicide brain: Implications for the involvement of epigenetic processes. Biological Psychiatry, 64(8), 645–652. 10.1016/j.biopsych.2008.05.028 [DOI] [PubMed] [Google Scholar]
- Quatieri TF, & Malyska N (2012). Vocal-source biomarkers for depression: A link to psychomotor activity. Interspeech-2012, 1059–1062. [Google Scholar]
- Roberts AL, Agnew-Blais JC, Spiegelman D, Kubzansky LD, Mason SM, Galea S, … & Koenen KC (2015). Posttraumatic stress disorder and incidence of type 2 diabetes mellitus in a sample of women: a 22-year longitudinal study. JAMA Psychiatry, 72(3), 203–210. 10.1001/jamapsychiatry.2014.2632 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Scherer KR (1986). Vocal affect expression: A review and a model for future research. Psychological Bulletin, 99(2), 143 10.1037/0033-2909.99.2.143 [DOI] [PubMed] [Google Scholar]
- Scherer S, Lucas GM, Gratch J, Rizzo AS, & Morency LP (2016). Self-reported symptoms of depression and PTSD are associated with reduced vowel space in screening interviews. IEEE Transactions on Affective Computing, 7(1), 59–73. 10.1109/TAFFC.2015.2440264 [DOI] [Google Scholar]
- Scherer S, Stratou G, Gratch J, & Morency LP (2013). Investigating voice quality as a speaker-independent indicator of depression and PTSD. Interspeech, 847–851. [Google Scholar]
- Scherer S, Stratou G, Mahmoud M, Boberg J, Gratch J, Rizzo A, & Morency LP (2013). Automatic behavior descriptors for psychological disorder analysis. 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), 1–8. 10.1109/FG.2013.6553789 [DOI] [Google Scholar]
- Shalev A, Liberzon I, & Marmar C (2017). Post-traumatic stress disorder. New England Journal of Medicine, 376(25), 2459–2469. 10.1056/NEJMra1612499 [DOI] [PubMed] [Google Scholar]
- Sijbrandij M, Reitsma JB, Roberts NP, Engelhard IM, Olff M, Sonneveld LP, & Bisson JI (2013). Self-report screening instruments for post-traumatic stress disorder (PTSD) in survivors of traumatic experiences (protocol). Cochrane Database of Systematic Reviews, 2013(6), 1–15. 10.1002/14651858.CD010575 [DOI] [Google Scholar]
- Singh I, & Rose N (2009). Biomarkers in psychiatry. Nature, 460(7252), 202–207. 10.1038/460202a [DOI] [PubMed] [Google Scholar]
- Snowden LR (2003). Bias in mental health assessment and intervention: Theory and evidence. American Journal of Public Health, 93(2), 239–243. 10.2105/AJPH.93.2.239 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sobin C, & Sackeim HA (1997). Psychomotor symptoms of depression. American Journal of Psychiatry, 154(1), 4–17. 10.1176/ajp.154.1.4 [DOI] [PubMed] [Google Scholar]
- Sripada RK, Henry J, Yosef M, Levine DS, Bohnert KM, Miller E, & Zivin K (2016). Occupational functioning and employment services use among VA primary care patients with posttraumatic stress disorder. Psychological Trauma: Theory, Research, Practice, and Policy, 10(2), 140–143. 10.1037/tra0000241 [DOI] [PubMed] [Google Scholar]
- Strobl C, Malley J, & Tutz G (2009). An introduction to recursive partitioning: Rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychological Methods, 14(4), 323–348. 10.1037/a0016973 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Taft CT, Watkins LE, Stafford J, Street AE, & Monson CM (2011). Posttraumatic stress disorder and intimate relationship problems: A meta-analysis. Journal of Consulting and Clinical Psychology, 79(1), 22–33. 10.1037/a0022196 [DOI] [PubMed] [Google Scholar]
- Tolkmitt F, Helfrich H, Standke R, & Scherer KR (1982). Vocal indicators of psychiatric treatment effects in depressives and schizophrenics. Journal of Communication Disorders, 15(3), 209–222. 10.1016/0021-9924(82)90034-X [DOI] [PubMed] [Google Scholar]
- Trevino AC, Quatieri TF, & Malyska N (2011). Phonologically-based biomarkers for major depressive disorder. EURASIP Journal on Advances in Signal Processing, 2011(1), 42 10.1186/1687-6180-2011-42 [DOI] [Google Scholar]
- van den Broek EL, van der Sluis F, & Dijkstra T (2010). Telling the story and re-living the past: How speech analysis can reveal emotions in post-traumatic stress disorder (PTSD) patients. Sensing Emotions, 153–180. 10.1007/978-90-481-3258-4_10 [DOI] [Google Scholar]
- Vanello N, Guidi A, Gentili C, Werner S, Bertschy G, Valenza G, … & Scilingo EP (2012). Speech analysis for mood state characterization in bipolar patients. Engineering in Medicine and Biology Society (EMBC), 2012 Annual International Conference of the IEEE, 2104–2107. 10.1109/EMBC.2012.6346375 [DOI] [PubMed] [Google Scholar]
- Weathers FW, Bovin MJ, Lee DJ, Sloan DM, Schnurr PP, Kaloupek DG, … & Marx BP (2017). The Clinician-Administered PTSD Scale for DSM–5 (CAPS-5): Development and initial psychometric evaluation in military veterans. Psychological Assessment, 30(3), 383–395. 10.1037/pas0000486 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang Y, Fairbairn C, & Cohn JF (2013). Detecting depression severity from vocal prosody. IEEE Transactions on Affective Computing, 4(2), 142–150. 10.1109/T-AFFC.2012.38 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zen AL, Whooley MA, Zhao S, & Cohen BE (2012). Post-traumatic stress disorder is associated with poor health behaviors: Findings from the heart and soul study. Health Psychology, 31(2), 194–201. 10.1037/a0025989 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhuang X, Rozgić V, Crystal M, & Marx BP (2014). Improving speech-based PTSD detection via multi-view learning. Spoken Language Technology Workshop (SLT), 2014 IEEE, 260–265. 10.1109/SLT.2014.7078584 [DOI] [Google Scholar]
- Zoladz PR, & Diamond DM (2013). Current status on behavioral and biological markers of PTSD: A search for clarity in a conflicting literature. Neuroscience & Biobehavioral Reviews, 37(5), 860–895. 10.1016/j.neubiorev.2013.03.024 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.