Abstract
In this paper, we report experiments on the Interspeech 2013 Autism Challenge, which comprises of two subtasks – detecting children with ASD and classifying them into four subtypes. We apply our recently developed algorithm to extract speech features that overcomes certain weaknesses of other currently available algorithms [1, 2]. From the input speech signal, we estimate the parameters of a harmonic model of the voiced speech for each frame including the fundamental frequency (f0). From the fundamental frequencies and the reconstructed noise-free signal, we compute other derived features such as Harmonic-to-Noise Ratio (HNR), shimmer, and jitter. In previous work, we found that these features detect voiced segments and speech more accurately than other algorithms and that they are useful in rating the severity of a subject’s Parkinson’s disease [3]. Here, we employ these features, along with standard features such as energy, cepstral, and spectral features. With these features, we detect ASD using a regression and identify the sub-type using a classifier. We find that our features improve the performance, measured in terms of unweighted average recall (UAR), of detecting autism spectrum disorder by 2.3% and classifying the disorder into four categories by 2.8% over the baseline results.
Keywords: speech analysis, autism spectrum disorder
1. Introduction
Autism spectrum disorder (ASD) cover a range of developmental disabilities that can cause significant social, communication, and behavioral challenges. Children with ASD often are self-absorbed in their private world and they have difficulty communicating and interacting with others. While not every child with ASD has a language problem, the majority have difficulty using language effectively, especially when conversing with others. Often they exhibit unusual pitch and intonation, for example, monotonous pitch, reduced stress, odd rhythm, large pitch range [4], and even differences in harmonic structure of their speech [5]. There has been continual interest in characterizing these variations in ASD and potentially exploit them in objectively quantify and categorizing the language impairments in ASD.
The range of disorders in ASD can be categorized according to Diagnostic and Statistical Manual of Mental Disorders (DSM), published by American Psychiatric Association. Most clinicians in the US follow the fourth edition (DSM-IV) [6]. The diagnostic category pervasive developmental disorders (PDD) refers to disorders characterized by delays in the development of multiple basic functions including socialization and communication. This category includes Asperger and Rett syndromes. Pervasive developmental disorder not otherwise specified (PDD-NOS) is one of the five ASDs, characterized as”severe and pervasive impairment in the development of reciprocal social interaction or verbal and nonverbal communication skills, or when stereotyped behavior, interests, and activities are present, but the criteria are not met for a specific PDD” or for several other disorders. Unrelated to the above conditions, a child could suffer from limited ability to socialize and communicate, not because of general developmental disorders, but due to specific language impairments such as dysphagia. In all these cases, prosody and intonation are compromised perhaps in different ways, and that is a topic of considerable research interest currently especially for developing useful intervention strategies.
In this paper, we report our experiments on the Autism Sub-Challenge of Interspeech 2013. The challenge consists of two tasks: 1) a binary ‘Typicality’ classification task with classes – TYPicality developing (TYP) and ATYpically developing (ATY), and a four-way ‘Diagnosis’ task for classifying children into 4 categories – TYP, PDD, PDD-NOS, and specific language impairment such as DYSphasia (DYS). The paper is organized as follows. We start the harmonic model of voiced speech and our feature extraction algorithms in Section 2. Our experiments and the results are reported in Section 4. Finally, we conclude with summary of our key results.
2. Speech Analysis Using Harmonic Model
The popular source-channel model of voiced speech considers glottal pulses as a source of period waveforms which is being modified by the shape of the mouth assumed to be a linear channel. Thus, the resulting speech is rich in harmonics of the glottal pulse period.
2.1. Harmonic Model
The harmonic model is a special case of a sinusoidal model where all the sinusoidal components are assumed to be harmonically related, that is, the frequencies of the sinusoids are multiples of the fundamental frequency. This model is tailored to capture the rich harmonic nature of voiced segments in speech.
Stylianou introduced a Harmonic plus Noise Model (HNM) for speech analysis and synthesis, in which speech signals are represented as a time-varying harmonic component plus a modulated noise component [7]. The harmonic part accounts for the periodic component of the speech signal while the noise part accounts for its non-periodic components. Speech decomposition using a HNM is useful for applications in speech synthesis, voice conversion, speech enhancement, and speech coding.
Let y = [y(t1), y(t2), … , y(tN)]T denote the speech samples in a voiced frame, measured at times t1, t2, … , tT. The samples can be represented with a harmonic model with an additive noise n = [n(t1), n(t2), … , n(tN)]T as follow:
(1) |
where H denotes the number of harmonics and 2π f0 stands for the fundamental angular frequency. The harmonic signal can be factored into coefficients of basis functions, α, β, and the harmonic components which are determined solely by the given angular frequency 2π f0 and the choice of the basis function ψ(t).
(2) |
Stacking rows of [1 Ac(t) As(t)] at t = 1, · · · , T into a matrix A, equation (2) can compactly represented in matrix notation as:
(3) |
where y = A m corresponds to a expansion of the harmonic part of voiced frame in terms of windowed sinusoidal components, and is the set of unknown parameters.
2.2. Parameter Estimation
Assuming the noise samples n are independent and identically distributed random variables with zero-mean Gaussian distribution, the likelihood function of the observed vector, y, given the model parameters can be formulated as following equation. The parameters of vector m can then be estimated by maximum likelihood (ML) approach.
(4) |
Under the harmonic model, the reconstructed signal is given by . So far, we assumed that the pitch f0 was given. However, in practice, the pitch needs to be estimated. It can be computed by maximizing the energy of the reconstructed signal over the pre-determined grid of discrete f0 values ranging from f0 min to f0 max.
(5) |
2.3. Segmental Pitch Tracking
The pitch variations are inherently limited by the motion of the articulators in the mouth during speech production and hence they cannot vary arbitrarily between adjacent frames. This smoothness constraint can be enforced using a first order Markov dependency between pitch estimates of successive frames. Adopting the popular hidden Markov model framework, the estimation of pitch over utterances can be formulated as follows. Let Y = {y0, … , yM}, and be M length sequences of observed frames and candidate pitch estimates respectively.
The observation probabilities are assumed to be independent given the hidden states or candidate pitch frequencies here. A zero-mean Gaussian distribution defined over the pitch difference between two successive frame is a reasonable approximation for the first order Markov transition probabilities [8], . Putting all this together and substituting the likelihood from the Equation 5, the pitch over an utterance can be estimated as follows.
(6) |
Thus, the estimation of pitch over an utterance can be cast as an HMM decoding problem and can be efficiently solved using Viterbi algorithm.
2.4. Jitter and shimmer
Jitter and shimmer refer to a short-term (cycle-to-cycle) perturbation in the f0 and the amplitude of voice waveform respectively. Perturbation analysis is based on the fact that small fluctuations in frequency, and amplitude of waveform reflect the inherent noise of voice. These measures can be sensitive to noise. We alleviate this problem by estimating jitter and shimmer from the signal reconstructed using the estimated parameters of the harmonic model [3].
2.4.1. Shimmer
In order to compute shimmer, we first represent the speech waveform using the harmonic model with time-varying amplitudes (HM-VA) as shown in equation 7 [9].
(7) |
Note, this is different from the harmonic model represented previously in Equation 1. Unlike, the previous model whose harmonic coefficients are fixed, in the time-varying model, as the name implies, the coefficients are allowed to vary ah(t) and bh(t) over time. Thus, this model is capable of capturing sample to sample variation in harmonic amplitude within a frame. Given the limitations of the articulators, it is reasonable to assume that the sample to sample variation is smooth. This can be represented as a superposition of small number of basis functions ψi as in equation 8 [9].
(8) |
We represent this smoothness constraints within a frame using four (I = 4) Hanning windows as basis functions. For a frame of length M, the windows are centered at 0, M/3, 2M/3, and M. Each basis function is 2M/3 samples long and has an over-lap of M/3 with immediate adjacent window. The parameters of this model can be expressed, once again, as a linear model, similar to Equation 3, but this time the A and m have dimensions four times the original dimensions. Given the fundamental frequency from 6, we compute ah(t) and bh(t) using a maximum likelihood framework.
Shimmer can be considered as a function f(t) that scales the amplitudes of all the harmonics in the time-varying model.
(9) |
where denotes the amplitude of the harmonic components in harmonic model with constant amplitudes and ch(t) is the counterpart from the time-varying model. Once again, assuming uncorrelated noise, f(t) can be estimated using maximum likelihood criterion.
(10) |
The larger the tremor in voice, the larger the variation in f(t). Hence, we use the standard deviation of f(t) as a summary statistics to quantify the shimmer.
2.4.2. Jitter
Given an estimate pitch period of the frame, we first create a matched filter by excising a one pitch period long segment from the signal estimated with the harmonic model from the center of the frame. This matched filter is then convolved with the estimated signal and the distance between the maxima defines the pitch periods in the frame. The perturbation in period is normalized with respect to the given pitch period and its standard deviation is an estimate of jitter.
2.5. Harmonic-to-noise ratio (HNR)
Researchers have used HNR in the acoustic studies for the evaluation of voice disorders. Given the reconstructed signal as the harmonic source of vocal tract, the noisy part is obtained by subtracting the reconstructed signal from the original speech signal. The noisy part encompasses everything in the signal that is not described by harmonic components including the frication noise, the waveform fluctuations, etc. HNR and the ratio of energy in first and second harmonics (H12) can be computed from the HM-VA as follow.
(11) |
3. Corpus
Empirical evaluation reported in this paper were performed on “Child Pathological Speech Database” (CPSD) [1] collected from 99 children, age 9 to 18, through two hospitals located in Paris, France. This dataset provides 2542 short speech utterances collected for assessing children’s abilities in imitation of different types of prosody contours. Based on the prosodic dependencies of French language, sentences carry out 4 intonations type including descending, falling, floating, and rising.
Subjects, were asked to read 26 phonetically easy sentences and they were recorded in separate files. As a clinical reference, the severity of subjects condition were measured by clinicians using the DSM-IV criteria [6], where 35 of these children showed PDD either of Autism Spectrum Condition (ASC, 12 children), specific language impairment (SLI, 13 children) or PDD non-otherwise specified (PDD-NOS, 10 children). The corpora includes rich annotation such as speaker meta-data, orthographic transcript, phonemic transcript, and segmentation. Also, the corpus treats sentences read by the same speaker as independent samples partitioned randomly in test, development, and training sets.
4. Experiments
4.1. Features
As in most speech processing systems, we extract 25 millisecond long frames using a Hanning window at a rate of 100 frames per second before computing the frame-level features. Voicing related features including pitch, HNR, the ratio of energy in first to second harmonics (H12), jitter, and shimmer are derived from the expressed harmonic analysis over the voiced frames. The features computed at the frame-level needs to be summarized into a global feature vector of fixed dimension for each read sentence. Each feature was summarized across all frames from the voiced segments in terms of standard distribution statistics such as mean, median, variance, minimum and maximum. We also computed the covariance matrix (upper triangular elements) of frame-level feature vectors over voiced segments to capture interaction between features.The resulting per-sentence voice quality feature vector was later augmented by per-sentence energy, spectral, and cepstral related features provided from baseline. For more detail regarding the baseline frame-level features and also functionals that are applied to those feature, we refer the reader to the challenge paper [10].
4.2. Regression and classification models
Typically, in clinical applications, the class distributions are highly unbalanced, as it is in the four subtypes within this corpus. The challenge evaluation metric of unweighted average recall attempts to normalize the influence of the highly skewed classes. We employed a support vector classifier and a support vector regression respectively to detect ASD cases and to identify the subtypes. Both the regression and classifer were learned from the data using open-source WEKA toolkit[11]. For the training the regression and classifier, we retained the hyper parameters from the baseline system, C = 0.001. For the test set, all labeled data from train and developing sets were pooled for training and a new model learned using parameters reported in the baseline. Since the class distribution in the training data was skewed, we upsampled instances in atypicality categories (PDD, NOS, and DYS) by using a factor of five. We refer the reader to the baseline challenge paper for more detail [10]. Table 1 reports UAR evaluated from baseline feature vectors and proposed feature vector on detecting ASD and classifying the sub-types. From the results, it is clear that our voice quality related features (derived by harmonic analysis) significantly improve UAR in both tasks.
Table 1:
Speech Features | ASD Tasks | |
---|---|---|
ASD vs. TD | 4-subtypes of ASD | |
Baseline | 90.7 | 67.1 |
Improved Features | 93.58 | 69.42 |
5. Conclusions
In summary, we considered several speech measures to detect children with ASD and to classify them into four subtypes. For both tasks, our features can be categorized into four groups – voice quality features (estimated from harmonic analysis), energy-related features, spectral features, and cepstral features. We found that our features, specifically the voice quality features, improve the performance of both tasks in terms of unweighted average recall (UAR).
6. Acknowledgements
This research was supported in part by NIH Award AG033723, NSF Awards 1027834, 0964102 and 0905095 and Google Faculty Award. Any opinions, findings, conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the NIH or NSF.
7. References
- [1].Ringeval Fabien, Demouy Julie, Szaszak Gyorgy, Chetouani Mohamed, Robel Laurence, Xavier Jean, Cohen David, and Plaza Monique, “Automatic intonation recognition for the prosodic assessment of language-impaired children,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 19, no. 5, pp. 1328–1342, 2011. [Google Scholar]
- [2].Asgari Meysam, Shafran Izhak, and Bayestehtashk Alireza, “Robust detection of voiced segments in samples of everyday conversations using unsupervised HMMs,” in Spoken Language Technology Workshop (SLT), 2012 IEEE. IEEE, 2012, pp. 438–442. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Asgari Meysam and Shafran Izhak, “Extracting cues from speech for predicting severity of Parkinson’s disease,” in Machine Learning for Signal Processing (MLSP), 2010 IEEE International Workshop on. IEEE, 2010, pp. 462–467. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Hubbard Kathleen and Trauner Doris A, “Intonation and emotion in autistic spectrum disorders,” Journal of psycholinguistic research, vol. 36, no. 2, pp. 159–173, 2007. [DOI] [PubMed] [Google Scholar]
- [5].Yoram S Bonneh Yoram Levanon, Omrit Dean-Pardo Lan Lossos, and Adini Yael, “Abnormal speech spectrum and increased pitch variability in young autistic children,” Frontiers in human neuroscience, vol. 4, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Guze Samuel B, “Diagnostic and statistical manual of mental disorders, (dsm-iv),” American Journal of Psychiatry, vol. 152, no. 8, pp. 1228–1228, 1995. [Google Scholar]
- [7].Stylianou Y, “Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification,” in Ph.D. dissertation, Ecole Nationale des Tlcomminications, 1996. [Google Scholar]
- [8].Tabrikian J, Dubnov S, and Dickalov Y, “Maximum a-posteriori probability pitch tracking in noisy environments using harmonic model,” IEEE Transactions on Speech and Audio Processing, vol. 12, no. 1, pp. 76–87, 2004. [Google Scholar]
- [9].Godsill S and Davy M, “Bayesian harmonic models for musical pitch estimation and analysis,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2002, vol. 2, pp. 1769–72. [Google Scholar]
- [10].Schuller Björn, Steidl Stefan, Batliner Anton, Vinciarelli Alessandro, Scherer Klaus, Ringeval Fabien, Chetouani Mohamed, Weninger Felix, Eyben Florian, Marchi Erik, et al. , “The interspeech 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism,” in Proc. Interspeech, 2013. [Google Scholar]
- [11].Hall Mark, Frank Eibe, Holmes Geoffrey, Pfahringer Bernhard, Reutemann Peter, and Witten Ian H, “The weka data mining software: an update,” ACM SIGKDD Explorations Newsletter, vol. 11, no. 1, pp. 10–18, 2009. [Google Scholar]