Abstract
Purpose:
Acoustic analysis is a commonly used method for quantitatively measuring vocal fold function. The accuracy of acoustic analysis depends upon the operator selecting a stable segment of the voice sample to analyze. This paper proposes a novel method to more accurately and reliably select a stable voice segment.
Study Design:
Four selection methods were implemented to evaluate each raw audio signal and determine the most stable segment of each signal: The proposed modal periodogram method, the moving window method, the mid-vowel method, and the whole vowel method. Acoustic parameters of interest - namely perturbation (jitter), correlation dimension (D2), and spectrum convergence ratio (SCR) - were calculated for forty-eight phonation samples to evaluate each method.
Methods:
The proposed modal periodogram method utilizes a minimum mean-square error (MMSE) based approach to calculate a stable modal periodogram and obtain the most stable segment. The Wilcoxon Signed-Rank test was used to compare jitter, D2, and SCR values acquired using the modal periodogram method against the current standard segment selection methods.
Results:
The modal periodogram method yielded significantly lower D2 values, and a significantly higher SCR for both normal and disordered voice samples (p < 0.01). This indicates that the modal periodogram method is more apt for selecting a stable audio segment than the other selection methods.
Keywords: Modal Presence Probability, Minimum Mean Square Error, Modal Periodogram, Voice Segment Selection
Introduction
Acoustic analysis is an objective method used to quantitatively assess the characteristics of audio signals, which has been a focus in voice research in recent years [3]. It has been used to characterize normal and pathological voices, further explore phonation mechanisms, and provide objective parameters to evaluate the efficacy of voice treatments [1, 2]. When implementing acoustic analysis methods, selecting a stable phonation segment is an essential step to generating accurate and reliable results. The segment selected is ultimately accountable for reflecting if a patient is demonstrating healthy or pathological phonation, as well as providing information on the degree of disorder in the voice, making the determination of a stable signal segment a crucial aspect of acoustic analysis [3].
There are currently four methods that are widely used for selecting stable audio segments for use in acoustic analysis: subjectively selecting vowels, the whole vowel method, the middle vowel method, and the moving window method. While each of these methods has shown utility in acoustic analysis of voice samples, each method also possesses limitations or inconsistencies which hinders its efficacy. The subjective selection method requires the evaluator to visually determine the most stable portion of the audio signal by looking at the waveform. An evaluator utilizing this method generally selects the portion of the phonation signal with minimum amplitude variation, and rarely considers frequency variations, which often leads to significant inconsistencies between evaluators [4-6]. Advancements have been made to reduce this subjective variability, with previous work showing that 20 cycles for onset/offset instability affected perturbation values, and that a minimum segment of 100 cycles is needed to adequately measure perturbation values [25]. The whole vowel method of acoustic analysis utilizes the entire voice signal, excluding voice onset and offset, to determine the most stable segment. However, each subject has different onset and offset periods ranging from 200 milliseconds to 1 second, and determining the onset and offset durations is not an objective determination, which limits the applicability of this method [7, 8]. The middle vowel method selects an equivalent portion of the signal on either side of the exact mid-point of the audio signal, which ignores changes in the sample stability over time [9-11]. The moving window method utilizes a fixed length window that continually sweeps along the entirety of the sample incrementally to determine the most stable region. With this method it is still difficult to determine the most stable segment of the signal because a single voice segment is not likely to display desirable characteristics that allow analysis for every parameter of interest [12-14].
In this paper, a new method based on modal presence probability, which uses minimum mean-square error (MMSE) to construct a modal periodogram, is utilized to obtain the most stable voice segment [15, 16]. The results of this modal periodogram method were compared with the segments selected by the whole vowel, mid-vowel, and moving window methods to determine which method is the most suitable for consistently providing stable voice segments for use in acoustic analysis.
Methods
Proposed Probability-Based Sample Selection Method
In an acoustic signal, every location in the signal has two probabilities: one being the probability that the location is stable, and the other that it is unstable. When it is uncertain if a segment is stable or unstable, an MMSE estimator for the stable periodogram in noisy signal Y is given by:
| (1) |
In this expression, H0 indicates an unstable location; H1 indicates a stable location; A is a parameter chosen to express the stability of Y; and E( ) is the statistical expectation operator. Therefore E(A∣Y, H0) represents the strength of A presence in the condition of Y and H0.
The noisy signal Y is constructed using the following relations:
| (2) |
| (3) |
S and N represent a purely acoustic signal and noise, respectively. is the long-form of parameter A present in (1).
P(H0∣Y), defined as the probability that Y is unstable, is given by [15, 16]:
| (4) |
The likelihood functions P(Y∣H0) and P(Y∣H1) indicate how well the observation fits the modeling parameters for unstable presence and stable presence, respectively. The observation is assumed to be a complex Gaussian distribution. When implementing this distribution, the likelihood that stable conditions exist is determined by:
| (5) |
Similarly, the likelihood that unstable conditions exist is given by:
| (6) |
ξH1 is a parameter for stable presence, and is proportional to the signal to noise ratio (SNR), where under stable conditions, SNR = 10 log10 ξH0 [17]. ξH1 is a parameter that has been utilized in work concerning radar or communication, where ξH1 is selected in order to guarantee a specified performance in terms of false alarms or missed detections [18]. It has also been used in voice research applications previously [19, 20]. For our purposes, in order to guarantee that the detection of vocal stability is accurate, we select ξH0 to be equal to 101.5.
Applying this to the likelihood of stable conditions, P(H0∣Y), we derive:
| (7) |
In (7), the initial probabilities are set at P(H0) = P(H1) = 0.5, meaning that in the case of no signal information being analyzed, we assume that the probabilities for an unstable or stable condition are the same. The modal periodogram method is an iterative process, so these probabilities change as more signal information is analyzed.
The statistical expectation of the purely acoustic signal energy in noisy signal Y is:
| (8) |
In (8), it is necessarily true that P(H1∣Y) = 1 − P(H0∣Y). When applying (8) to acoustic analysis, the stable periodogram of the t-th frame is:
| (9) |
where P(H1∣Y(t)) = 1 − P(H0∣Y(t)), and
| (10) |
In (9) and (10), Y(t) is the t-th frame of Y, and is the signal power estimate from the previous frame.
In practice, the signal energy in frame t (denoted ) is estimated via recursive smoothing:
| (11) |
When implementing (11), the most stable segment of the voice sample is where there is the highest purely acoustic signal power (). The voice samples in this study were all taken from the same database, with all voice samples recorded in a sound-proof booth. Therefore, the highest purely acoustic signal power, and thus the most stable segment of the signal, can be reliably selected as the value of SL in the following relation:
| (12) |
The initial signal energy, , is the mean of the first five frames of Y. The adjacent 30ms of the signal is then added to the previous signal in an iterative fashion until SL does not increase to determine the best sample. The iteration flow chart is shown in Figure 1.
Figure 1:

Iteration flow chart for segment selection utilizing the modal periodogram method.
Each voice sample was analyzed, and the most stable segment was selected using four different methods: whole vowel, mid-vowel, moving window, and the proposed modal periodogram method. The effectiveness of each method was determined by comparing the jitter, D2, and SCR values obtained from the segment using each analysis method.
Other Sample Selection Methods Utilized
Whole vowel: The complete vocal sample of the subject uttering a prolonged vowel, minus 200 milliseconds (ms) from the beginning and end to remove vocal onset and offset, was used for acoustic analysis. The onset and offset segments are removed because they are often the most unstable parts of an audio signal [4-6].
Mid-vowel: A 250 ms segment on both sides of the midpoint was selected from a longer vocal sample of a subject uttering a prolonged vowel. The resulting 500 ms segment was used for the acoustic analysis.
Moving window: An 800 ms window length and 25 ms time shift were determined to be optimal values to calculate acoustic parameters [14]. After measuring the parameters of interest (jitter, D2, SCR) in each of the moving window segments, optimal parameter values (minimum jitter, minimum D2, and maximum SCR) were determined and used to identify the most stable portion of the voice. These optimal parameter values sometimes appear in different segments of the voice sample. In practice, the moving window method only selects a single sample window to use as the representative portion of the voice. However, we allowed for multiple windows to impart the values for the selected parameters as we are concerned with comparing the optimal acoustic values that each method can generate.
Subjects
Recordings from forty-eight subjects uttering a prolonged vowel sound were analyzed using each of the four segment selection methods. All voice samples were taken from the KayPentax Disordered Voice Database (Kay Elemetrics Corp, Lincoln Park, NJ). Twenty-seven normal voices (7 male and 20 female) were analyzed with an age range of 19–81 years and an average age of 36.7 years. Twenty-one disordered voices (8 male and 13 female) were analyzed with an age range of 40–93 years and an average age of 62.5 years. Table 1 summarizes the subject pool for the voice samples analyzed.
Table 1.
Summary of subject pools for normal and disordered voice
| Voice Type | Number of Samples |
Mean Age in Years |
Gender |
|---|---|---|---|
| Normal | 27 | 36.7 | 7 Males 20 Females |
| Disordered | 21 | 62.5 | 8 Males 13 Females |
Results
The mean and standard deviation of the values obtained for each parameter for normal and disordered voices are displayed in Table 2 and Table 3, respectively. One-way ANOVA was used to determine if the different analysis methods can differentiate between normal and disordered voice samples, with the resulting p-values displayed in Table 4. The results show that significant differences (p < 0.01) exist between the means of the normal voices and disordered voices in D2 and SCR for each method. The results in Table 4 indicate that jitter is not as suitable for differentiating between normal and disordered subjects as D2 and SCR, so it was excluded in the analysis of disordered voices. A box plot for the parameters of interest in normal and disordered voices is displayed in Figure 2. The Wilcoxon Signed-Rank test was conducted to check for significant differences in the parameter values obtained using the modal periodogram method with the values obtained using the other three methods. The p-values of the Wilcoxon tests for jitter, D2, and SCR in normal voices, and for D2 and SCR in disordered voices, are displayed in Table 5 and Table 6, respectively.
Table 2.
Summary of results (mean ± standard deviation) for each selection method with normal voices (n = 27)
| Parameter | Whole Vowel | Mid-vowel | Moving Window | Modal Periodogram |
|---|---|---|---|---|
| Jitter | 0.4833 ± 0.2468 | 0.5044 ± 0.2843 | 0.5716 ± 0.2710 | 0.4265 ± 0.2502 |
| D2 | 2.9921 ± 0.2302 | 2.5121 ± 0.3291 | 3.3468 ± 0.3690 | 1.8791 ± 0.3543 |
| SCR | 0.3057 ± 0.1212 | 0.5870 ± 0.2011 | 0.6016 ± 0.1854 | 1.3561 ± 0.4106 |
Table 3.
Summary of results (mean ± standard deviation) for each selection method with disordered voices (n= 21)
| Parameter | Whole Vowel | Mid-vowel | Moving Window | Modal Periodogram |
|---|---|---|---|---|
| D2 | 3.7276 ± 0.8098 | 3.8532 ± 0.7838 | 4.3690 ± 0.7668 | 2.8652 ± 0.6697 |
| SCR | 0.0415 ± 0.0427 | 0.0696 ± 0.0590 | 0.0547 ± 0.0499 | 0.4430 ± 0.1603 |
Table 4.
Comparison Between Normal Voices and Disordered Voices for Each Parameter
| Method | Jitter | D2 | SCR |
|---|---|---|---|
| Whole Vowel | p = 0.004** | p < 0.001** | p < 0.001** |
| Mid-vowel | p = 0.054 | p < 0.001** | p < 0.001** |
| Moving Window | p = 0.047* | p < 0.001** | p < 0.001** |
| Modal Periodogram | p = 0.002** | p < 0.001** | p < 0.001** |
Denotes p-value significant at the 0.05 level
Denotes p-value significant at the 0.01 level
Figure 2:
Box plots of (a) D2 for normal voices (b) D2 for disordered voices (c) SCR for normal voices (d) SCR for disordered voices. Median values are indicated by the solid red lines. The red pluses represent outliers.
Table 5.
Statistical Results for Comparison Between Modal Periodogram and Competing Methods For Normal Voices
| Method | Jitter | D2 | SCR |
|---|---|---|---|
| Whole vowel | p = 0.046* | p < 0.001** | p < 0.001** |
| Mid-vowel | p = 0.040* | p < 0.001** | p < 0.001** |
| Moving Window | p = 0.002** | p < 0.001** | p < 0.001** |
Denotes p-value significant at the 0.05 level
Denotes p-value significant at the 0.01 level
Table 6.
Statistical Results for Comparison Between Modal Periodogram and Competing Methods For Disordered Voices
| Method | D2 | SCR |
|---|---|---|
| Whole vowel | p = 0.001** | p < 0.001** |
| Mid-vowel | p = 0.001** | p < 0.001** |
| Moving Window | p = 0.001** | p < 0.001** |
Denotes p-value significant at the 0.05 level
Denotes p-value significant at the 0.01 level
Discussion
In this study, four different segment selection methods (whole vowel, mid-vowel, moving window, and the modal periodogram) were compared to determine which method could obtain the most stable phonation segment for use in acoustic analysis in adults with and without a voice disorder. Stability was characterized by low perturbation and D2 measurements coupled with high SCR values. To calculate these stability values for a subject, the approach we have taken with this paper describes an acoustic signal as having two probabilities: The probability that the subject produces a stable signal, and the probability that they produce an unstable signal. These probabilities are initially unknown, but we have developed a method to calculate these probabilities, providing us new information on the likelihood of a specific subject to create a stable signal. The higher the probability that the subject can create a stable voice segment, the lower the resulting perturbation values are. In this fashion, the voice segment that possesses the highest probability for the subject to produce a stable acoustic signal represents the patient’s best phonatory capability, and is a reflection of their phonatory skill.
During phonation, fluctuations randomly occur in the voice signals due to the nonstationary nature of the voice. Because of this, the ability to isolate and select a stable segment of the voice signal is difficult, but is essential to performing accurate acoustic analysis. The statistical results show that jitter is not suitable for discriminating between normal and disordered voices, which confirms previous findings [21]. Therefore, in normal voices, stability was characterized using jitter, D2, and SCR, while for disordered samples, stability was determined using D2 and SCR.
The ability for each segment selection method to discriminate between normal and disordered voices was demonstrated, which validates previous studies [13]. Compared to previous studies, the values obtained in this study for jitter and D2 in normal voices were larger, which can be attributed to differences in the voice samples selected for analysis. Furthermore, the values obtained with each method when compared to each other display the same trajectory with regards to an increase or decrease in the parameters of interest [13, 14]. The same can be said about the greater D2 values obtained from analysis of the disordered voices, with the added consideration that pathological voices are highly variable between different pathologies and patients [14]. The voice samples in this study come from adult subjects only. It has been shown that voice samples from children have different acoustic properties, thus future studies looking at pediatric voice samples is essential to determine the correct window length to use given the fact that children would not have a similar number of cycles as adult subjects if the same window length were to be used, and would likely yield different acoustic measurements [23]. The same considerations may hold true and additional analysis would be necessary to evaluate differences between male and female subjects [24].
The superior performance of the modal periodogram in selecting stable vocal samples enables numerous potential improvements in the clinical diagnosis and treatment of vocal pathology. Clinicians and vocal pathologists utilizing this selection method or its measurements can be more confident that they are obtaining a stable sample of the patient they are treating. This also improves the ability of voice researchers to consistently yield reproducible results with respect to acoustic measurements for both normal and pathological subjects. Possible future directions utilizing the modal periodogram method include evaluating its efficacy in differentiating between different classes of voice disorders to predict if a patient has a structural or neurogenic disorder, evaluating the modal periodogram’s potential improvement in classifying a voice sample into its appropriate voice type, and determining a range of values for acoustic parameters that align with normal or disordered voices [22].
Conclusion
The modal periodogram method applies modal presence probability and minimum mean-square error (MMSE) to construct a modal periodogram of the acoustic signal, which in turn is used to obtain the most stable segment of the signal. The results show that for both normal and disordered voices, the segments selected by the modal periodogram method were more stable than those selected by the other three traditional segment selection methods. Specifically, the D2 values obtained from the modal periodogram were significantly lower, and the SCR was significantly higher than the other methods. These results indicate that segments selected using the modal periodogram method are more stable and therefore more fit for acoustic analysis. Given these findings, the proposed modal periodogram method may soon be the standard method for selecting stable acoustic signal segments.
Acknowledgments
Financial disclosure: This study was supported by the National Institutes of Health, NIH Grant No. R01-DC006019 from the National Institute on Deafness and Other Communication Disorders.
Footnotes
Conflict of interest: None
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- [1].Laver J, Hiller S, MacKenzie J, Rooney E. “An acoustic screening system for detection of laryngeal pathology”. J Phon. 1986; 14:517–524. [Google Scholar]
- [2].Kasuya H, Ogawa S, Kikuchi Y. “An acoustic analysis of pathological voice and its application to the evaluation of laryngeal pathology”. Speech Commun. 1986; 5:171–181. [Google Scholar]
- [3].Titze IR. Workshop on Acoustic Voice Analysis: Summary Statement. Denver, CO: National Center for Voice and Speech; 1995:1–36. [Google Scholar]
- [4].MacCallum JK, Olszewski AE, Zhang Y, Jiang JJ. “Effects of low-pass filtering on acoustic analysis of voice”. J Voice. 2011;25:15–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Jiang JJ, Zhang Y, MacCallum J, Sprecher A, Zhou L. “Objective acoustic analysis of pathological voices from patients with vocal nodules and polyps”. Folia Phoniatr Logop. 2009;61:342–349. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Scherer RC, Gould WJ, Titze IR, Meyers AD, Sataloff RT. “Preliminary evaluation of selected acoustic and glottographic measures for clinical phonatory function analyses”. J Voice. 1988;2:230–244. [Google Scholar]
- [7].Revis J, Giovanni A, Wuyts F, Triglia JM. “Comparison of different voice samples for perceptual analysis”. Folia Phoniatr Logop. 1999;51:108–116. [DOI] [PubMed] [Google Scholar]
- [8].Choi SH, Lee J, Sprecher AJ, Jiang JJ. The effect of segment selection on acoustic analysis. J Voice. 2012;26:1–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Glaze LE, Bless DM, Susser RD. “Acoustic analysis of vowel and loudness differences in children’s voice”. J Voice. 1990;4:37–44. [Google Scholar]
- [10].Gelfer MP. “Fundamental frequency, intensity, and vowel selection: effects on measures of phonatory stability”. J Speech Lang Hear Res. 1995;38: 1189–1198. [DOI] [PubMed] [Google Scholar]
- [11].Bielamowicz SS, Kreiman JJ, Gerratt BRB, Dauer MSM, Berke GSG. “Comparison of voice analysis systems for perturbation measurement”. J Speech Lang Hear Res. 1996;39:126–134. [DOI] [PubMed] [Google Scholar]
- [12].Olszewski Aleksandra E., Shen Lisa, Jiang Jack J., “Objective Methods of Sample Selection in Acoustic Analysis of Voice”. Annals of Otology, Rhinology & Laryngology 120(3):155–161. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Seong Hee Choi, JiYeoun Lee, Sprecher Alicia J., and Jiang Jack J., “The Effect of Segment Selection on Acoustic Analysis”, Journal of Voice, Vol. 26, No. 1, pp. 1–7 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Shu Min, Jiang Jack J., and Willey Malachi, “The Effect of Moving Window on Acoustic Analysis”. Journal of Voice, Vol. 30, No. 1, pp. 5–10, 2014. [DOI] [PubMed] [Google Scholar]
- [15].Gerkmann Timo and Hendriks Richard, “Unbiased MMSE-Based Noise Power Estimation with Low Complexity and Low Tracking Delay”, IEEE Transactions on Audio, Speech, and Language Processing, Volume: 20, Issue: 4, pp. 1383–1393, 2012. [Google Scholar]
- [16].Gerkmann Timo and Hendriks Richard, “Noise Power Estimation Based on the Probability of Speech Presence”, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 145–148, 2011 [Google Scholar]
- [17].Gerkmann T, Breithaupt C, and Martin R, “Improved a posteriori speech presence probability estimation based on a likelihood ratio with fixed priors”, IEEE Trans. Audio, Speech, Language Process, Vol. 16, no. 5, pp. 910–919, July. 2008. [Google Scholar]
- [18].McAulay RJ and Malpass ML, “Speech enhancement using a soft-decision noise suppression filter”, IEEE Trans. Acoust., Speech, Signal Process, vol. 28, no. 2, pp. 137–145, 1980. [Google Scholar]
- [19].Ephraim Y and Malah D, “Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator”, IEEE Transactions on Acoustics, Speech, and Signal Processing. Vol. 32, no. 6, pp 1109–1121, 1984. [Google Scholar]
- [20].Sohn J and Sung W, “A voice activity detector employing soft decision based noise spectrum adaptation”, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP; ‘98. 1998. [Google Scholar]
- [21].Titze IR and Liang H, “Comparison of Fo extraction methods for high-precision voice perturbation measurements”, Journal of Speech, Language, and Hearing Research. 1993. December;36(6):1120–33. [DOI] [PubMed] [Google Scholar]
- [22].Sprecher A, Olszewski A, Zhang Y, and Jiang JJ, "Updating Signal Typing in Voice: Addition of Type 4 Signals", J. Acoust. Soc. Am vol. 127, no. 6, pp. 3710–3716, June. 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Patel RR, Dubrovskiy D, and Döllinger M, “Measurement of glottal cycle characteristics between children and adults: physiological variations”, J Voice. Vol, 28, no. 4, pp. 476–486, March. 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Higgins MB, and Saxman JH, “A comparison of selected phonatory behaviors of healthy aged and young adults”, J Speech Hear Res. Vol 34, no. 5, pp 1000–1010, October. 1991. [DOI] [PubMed] [Google Scholar]
- [25].Scherer RC, Gould WJ, Titze IR, Meyers AD, Sataloff RT “Preliminary evaluation of selected acoustic and glottographic measures for clinical phonatory function analysis”, J Voice. Vol 2, no. 3, pp 230–244, 1988. [Google Scholar]

