Abstract
The purpose of this study was to investigate the degree to which the coupling between the oscillating sound source and the vocal tract filter occurs in connected speech samples, and to provide insight into how humans may choose to deploy this coupling for intelligibility, intensity, or both. A technique was developed to extract, from minutes-long speech samples, the time-dependent fundamental frequency and the first two formant frequencies ( and ) to permit an analysis that determines whether a talker aligns a voice source harmonic with a vocal tract resonance, and also measures a normalized vowel space area. The accuracy of the processing method was validated by applying it to a set of audio samples generated via speech simulation that provided “ground-truth” data. It was then applied to a 41-talker database of clear and conversational speech. Results indicated that talkers make adjustments for different speaking styles that include not only increased vowel space area but also alignment of harmonics and formant frequencies, although future work is needed to determine whether these adjustments are directed toward maximizing transfer of information or transfer of acoustic power.
Keywords: synchronization, fundamental frequency, resonance, formant
I. INTRODUCTION
Much of our knowledge of speech acoustics is built on the traditional source-filter theory (cf., Chiba and Kajiyama, 1958; Fant, 1960; Flanagan, 1972; Stevens, 2000) in which the vibration of the vocal folds produces a sound source in the form of pulsatile glottal airflow that is acoustically filtered by the vocal tract before it is released into free space as the speech waveform. In the linear form of the source-filter theory, the production of the voiced sound source is considered to be independent of the wave propagation in the vocal tract such that the filter does not influence the source. Rather, the filter can only enhance or suppress the amplitudes of the spectral components produced by the source. However, interaction of the source and filter, through nonlinear coupling, might occur when the waves propagating within the airways affect the sound source.
Source-filter interaction and its effects have been known for a long time (cf. Flanagan and Landgraf, 1968; Ishizaka and Flanagan, 1972; Rothenberg, 1987; Fant and Lin, 1987; Titze, 1988) but more recently has received increased attention. For example, Titze et al. (2008) used a set of vocal exercises consisting of fundamental frequency () and formant glides to show that human phonation exhibits various forms of instability when the coincided with first formant frequency (). In a related article, Titze (2008) provided the theoretical background for “Level 1” and “Level 2” source-filter interactions, where “Level 1” interaction is a modification of the glottal airflow by reflected energy that occurs without a significant change in vocal fold vibration, and “Level 2” interaction includes changes in vocal fold vibration.
In a similar vein, Maxfield et al. (2017) reported an experiment designed specifically to induce source-filter interaction by artificial means. Participants were asked to produce a fundamental frequency () glide with a tube fixed between their lips, effectively lengthening the vocal tract and shifting the resonances to frequencies that were unfamiliar. Participants were instructed to maintain a steady rate of change and comfortable loudness throughout the glide, setting up the conditions for or a harmonic () to cross a vocal tract resonance. An eventual resonance-harmonic crossing took one of two forms. In one form, the glide progressed smoothly until a harmonic equaled a resonance frequency. The glide then leveled off until the subject (or the system naturally) was able to accommodate the change again, resulting in an abrupt jump in away from the resonance. In the second form, the glide was disrupted by a gap in phonation until a jump to an equal to the resonance frequency had occurred. These entrainments were observed as the first four harmonics crossed one of the first three resonances, although the probability of a phonatory instability occurring at one of those crossings was estimated to be about 65%. It was also shown that when was just below the first resonance, its harmonic intensity increased and then dropped as the second harmonic moved through the resonance. Simultaneously, however, the harmonic intensity of dropped by several dB when was just below the first resonance, and then increased with frequency. These effects were reversed when was just below the first resonance. Together these results suggest that both Level 1 and Level 2 interactions occurred in the experimental conditions.
Source-tract interaction effects were similarly investigated by Wade et al. (2017). In their study, singers were asked to engage in several experiments designed to produce variations of both and the first vocal tract resonance frequency, , to determine if vocal instability occurs when the two are aligned in frequency. For example, in one experiment each singer produced a vowel glide while holding constant, and then produced an glide while holding the vowel configuration constant, thus providing ample opportunities for observing the vibratory characteristics as and pass through each other. Their findings indicated that when a singer’s vocal tract was unconstrained, they tended to alter their lip geometry just enough so that was higher in frequency than , thus avoiding vocal instability. When their vocal tracts were constrained with a mouth ring that limited lip movement, however, the singers did exhibit vocal instabilities at values above .
Natural source-filter interactions were studied in classical singers by Echternach et al. (2021). During repetitions of the vowel transition sequence produced on the pitch D4 (294 Hz), transnasal high speed videoendoscopy (20000 fps) image, electroglottography (EGG) signals, and audio recordings were collected from 12 classical singers (6 male, 6 female). Analysis of the signals indicated that, for some of the singers, there was evidence of Level 2 interaction (i.e., changes in vocal fold vibration) between the vocal tract and voice source. The lack of systematic evidence across all of the singers suggested that the presence of the interaction is idiosyncratic and may involve a range of factors such as learned fine-tuned adjustments of the voice production musculature and/or anatomical differences.
It is likely that the mechanisms of vocal sound production operate along a continuum. At one end of the continuum the sound generated by vocal fold vibration is weakly coupled to the resonances of the vocal tract such that the output is effectively a linear combination of their respective acoustic characteristics, whereas at the other end there is strong coupling of the vibratory source to the vocal tract resonances. It is not clear, however, where along such a continuum human speech would typically be produced. A low, relatively stable, fundamental frequency would generate densely spaced harmonics and little interaction between the sound source and the acoustic filter of the vocal tract, a condition in which the envelope of output speech spectrum would clearly express the frequency response of the vocal tract. In contrast, a variable fundamental frequency that allows for frequent and prolonged alignment of harmonics with resonances of the vocal tract may serve as a means of optimizing radiated sound intensity and tone quality, as well as enhancement of vowel quality. Interaction between the source and filter may also facilitate the generation of harmonic energy in the source signal with minimal vocal fold collision and glottal closure (Titze, 2008; Story and Bunton, 2013).
The purpose of this study was to investigate the degree to which coupling between the oscillating sound source and the vocal tract filter occurs in connected speech samples of talkers who were recorded producing both conversational and “clear” speech. Ferguson (2004) reported a study in which 41 talkers were recorded producing a wide range of speech material as if they were in conversation with a partner and again as if they were speaking to be understood by a hearing impaired person. The latter condition is referred to as “clear” speech and has been shown, in some studies, to enhance intelligibility for both normal-hearing and hearing impaired listeners, although Ferguson (2004) reported that the benefits of clear speech vary widely among talkers. Regardless, the database of audio recordings of the 41 talkers provides a useful test case for investigating the possible occurrence of vocal tract and voice source interaction in two speaking styles that were intended to be distinctly different. The specific aims of the study were to 1) develop and test a method for visualizing and quantifying the degree of tuning of the voice source harmonics to the first resonance of the vocal tract, a condition for which source-tract interaction is likely to occur, and 2) apply the method to the Ferguson (2004) database of audio recordings for 41 talkers.
II. METHOD
A process was developed to extract, from minutes-long speech samples, the time-dependent fundamental frequency () and the first two formant frequencies ( and ) to allow for analysis that determines whether a talker aligns a voice source harmonic with a vocal tract resonance. Alignment would be interpreted as indicating synchronization of the voice source and vocal tract filter. In order to be assured that the audio processing method generates accurate measurements, it was first applied to an extensive set of audio samples generated via speech simulation. The simulation process allows for the vocal tract resonances to be calculated independently of the formant measurement algorithm, thus providing “ground-truth” data that could be compared to the measurements.
The following subsections lay out the main parts of the study: Explanation of the simulated speech samples, description of the audio processing approach used to measure , and , validation of the processing approach based on application to simulated speech samples and comparison to ground-truth data, and finally, application of the processing approach to a large set of audio samples of natural speech.
A. Speech simulations as “ground-truth” test samples
Simulated signals were generated by a speech production model comprised of area function representations of the trachea and vocal tract, and a kinematic representation of the vocal folds (specifically the model detailed in Story, 2013 and Story and Bunton, 2019). The vocal tract portion of model can be scaled along an age and gender continuum (Story et al., 2018); the trachea and vocal fold component can be similarly scaled. Temporal variation of the vocal tract area function is controlled in this model by the approach detailed in Story and Bunton (2019) where speech segments can be encoded by specifying relative acoustic events along a time axis that consist of directional changes of the first three vocal tract resonance frequencies (); these changes are called resonance deflection patterns (RDPs). The events are transformed, via acoustic sensitivity functions calculated for a neutral or baseline vocal tract area function, into time-varying modulations of the vocal tract shape. The speech production model generates an audio signal by coupling the interactive vocal fold model to the tracheal exit and vocal tract entry. At each time sample, acoustic wave propagation in the airways is calculated over the time course of the utterance with a wave-reflection algorithm (Liljencrants, 1985; Story, 1995; Titze, 2002) that includes energy losses due to yielding walls, viscosity, and heat conduction, as well as radiation at the lips, nares, and skin surfaces (specifically as described in Story (1995)). The glottal flow signal results from an interaction of the voice source model with the wave propagation in the vocal tract and trachea (cf., Titze, 2008). The acoustic output signal is the sum of the radiated pressure at the lips, nares, and skin surfaces, and is analogous to an audio signal recorded with a microphone. Because the voice source model is a kinematic representation of the vibrating vocal folds, there is only the possibility of Level 1 interaction (i.e., modification of the glottal airflow by reflected energy in the vocal tract and trachea) in the simulated signals. Level 2 interaction, for which the reflected vocal tract energy can affect both glottal airflow and vocal fold vibration, would require the use of a self-oscillating kinetic model as the voice source.
To emulate the wide variation of vowels present in typical connected speech, RDPs were generated randomly over an utterance duration of about 13 seconds to produce a continuously changing vocal tract shape representing a random sequence of vowel transitions. The randomized approach ensured a sampling of vowels across the vowel space represented by the first and second vocal tract resonances over many seconds of simulated speech. This process was performed independently with vocal tracts representative of an adult male talker and an adult female talker; i.e., the vowel sequence was different for the two simulated talkers. This was repeated five times, resulting in time-varying area functions for each talker that totaled about 65 seconds in duration (one 65 second simulation could have been generated, but executing the process in five blocks was more computationally manageable).
Time-varying vocal tract resonance frequencies were determined from the frequency response function calculated for the area function at each time point in the temporal vowel sequences for the male and female vocal tracts. These calculations were accomplished with a transmission-line model of the vocal tract (e.g., Sondhi and Schroeter, 1987; Story, Laukkanen, and Titze, 2000) that included energy losses due to yielding walls, viscosity, heat conduction, and acoustic radiation at the lips. The resulting resonance frequencies were determined with a peak-picking algorithm with parabolic interpolation (Titze et al., 1987). The time-varying resonance frequencies were calculated for comparison to the formant frequencies measured by the algorithm described in the next section1. The resonance frequencies were also used to prescribe fundamental frequency contours used in the simulation of the acoustic signals in which a selected harmonic (either or ) was aligned with the first resonance at all points in time to provide a stringent test of the formant tracking algorithm.
Simulated acoustic speech signals were generated to be representative of one adult male talker and one adult female talker based on the time-varying area functions described previously and with several different fundamental frequency contours. In the first speech-like simulation condition for each talker, the fundamental frequency was varied randomly within a range that would be reasonable for typical male and female speech; although harmonic frequencies will occasionally pass through the vocal tract resonances, any such alignment was coincidental in these two cases. In the second simulation condition, a constant fundamental frequency was maintained throughout the duration of the utterance; for the male, , and for the female, . This was intended to serve as a case in which the acoustic characteristics of the voice source and vocal tract filter were not synchronized, and thus there was no deliberate enhancement of the output signal. In the third condition, the fundamental frequency at each point in time across the utterance was set to equal the first vocal tract resonance frequency divided by three (i.e., ) for the male and divided by two (i.e., ) for the female. The simulations with the third set of conditions were intended to serve as extreme cases of synchronization of the voice source and vocal tract filter characteristics such that a selected harmonic was tuned to the first vocal tract resonance throughout the entire duration of the utterance. Simulations for each condition were run independently for each of the five sets of time-varying area functions generated for the male and female talkers.
Shown in Fig. 1 is the first block (approximately 13 seconds in duration) of each of the three simulation conditions for the male talker (top row) and the female talker (bottom row). The other blocks in each condition are not shown but contain essentially the same information except that they were generated with a different randomized sequence of vocal tract shapes. Plotted in each panel of the figure are the first two vocal tract resonances (thick lines) along with the fundamental frequency and harmonics (thin lines). The plots in Fig. 1a and 1d are the speech-like cases, whereas the others are simulations where the fundamental frequency was either constant over time (Fig. 1b & 1e) or tuned so that a selected harmonic was tuned to the first vocal tract resonance. These cases were intended to provide a range of test material for the and formant analysis algorithms.
FIG. 1.

Calculated vocal tract resonance frequencies, and , (thick lines) and prescribed fundamental frequencies and harmonics, , (thin lines) for the three male and female simulation cases described in the text. (Color online)
B. Measurement of fundamental frequency and formants
The goal of the audio processing approach was simply to provide accurate measurements of the fundamental frequency, , and the first two formant frequencies, and , over the time course of a long utterance. These measurements were based on fairly standard methods, but the challenge was in limiting them to segments of the signal that were voiced and of duration long enough to allow for accurate measurement of both and formant frequencies. The process is explained below in the numbered list of steps where the “signal” can be assumed to be an audio recording that is several minutes long (typically 7–10 minutes in duration), and “segment” is used to refer to a short portion of the signal subjected to some aspect of analysis. All analyses were coded in Matlab (Mathworks, 2023).
The first step was to apply a voicing detector to the signal to identify segments that contained periodicity based on vocal fold vibration. Using a sliding 0.025 second window with 0.0025 sec. overlap, periodicity was determined as the largest value of the normalized autocorrelation function within each window (cf., Xie & Niyogi, 2006), resulting in a periodicity function over the time course of the signal that varied between 0 and 1. A threshold of 0.7 was used to identify voiced segments.
For each voiced segment, the was determined by a two-step process. First, an estimate of the average over the duration of the segment was determined as the first spectral peak below 600 Hz from the mean of consecutive spectra calculated for 0.05 sec. windows. The estimate was then used to define a sliding time window in which cycles of the segment waveform are detected by finding time points at which amplitude minima occur at the center of each window. The resulting collection of time points defines the start and end of consecutive cycles so that the can be computed for each cycle. Median and smoothing filters were applied to the resultant contour for each segment to eliminate spurious peaks.
The and formant frequencies were measured for each voiced segment with LPC (Linear Predictive Coding) analysis. The segment waveform was first pre-emphasized, and then analyzed with a 0.05 sec. Hamming window with 0.0125 sec. overlap. For this study, all audio signals were sampled at 22050 Hz, and so the number of LPC coefficients was set to for signals where the maximum was less than 200 Hz, and if the maximum was greater than 200 Hz. The formants in each time window were found by finding the frequencies of the first two peaks in the frequency response function of the LPC filter using a peak-picking algorithm with parabolic interpolation (Titze et al., 1987). Similar to the analysis, median and smoothing filters were applied to the resultant and contours for each segment to eliminate spurious peaks.
Because different analyses were used to measure the time-varying fundamental frequency and formants, the and tracks were resampled to be time-synchronous with the corresponding contour. This resampling of the formant tracks allows and formant tracks to be directly compared at every time sample.
Linear prediction techniques (c.f., Makhoul, 1975; Markel & Gray, 1976) are commonly used as a component in speech analysis algorithms because they can provide reasonably accurate measurements of formant frequencies of adult speech (Monsen & Engebretsen, 1983; Vallabha & Tuller, 2002). In part, the success of LPC algorithms is due to the ample harmonics produced by the adult speaking voice source, which adequately samples the vocal tract transfer function and expresses it clearly as the envelope of the speech spectrum. High fundamental frequencies, however, generate widely spaced harmonic components, producing an apparent undersampling of the vocal tract transfer function (c.f., Lindblom, 1962; Kent, 1976). Measuring the formant frequencies from the spectral envelope becomes difficult in such cases because the spectral envelope peaks are strongly influenced by individual harmonic amplitudes rather than by the overall effect of many closely-spaced harmonics. The effect of high on LPC-based formant measurement was explained in Klatt (1986), and the limitations of LPC were more recently addressed by Chen et al. (2019) and Whalen et al. (2022). Such limitations are acknowledged for the present study, particularly for female speech, and are the reason for the extensive ground truth testing of the analysis algorithms.
An example analysis is shown in Fig. 2 for the simulated male and female speech-like cases. Fig. 2a and 2d are analogous to Fig. 1b & 1d, but now showing the measured fundamental frequency as the thick gray contour (along with harmonics shown as light gray lines), and the measured formants as thick black contours for a 13 second segment. Visual comparison suggests that the measured quantities in Fig. 2 are similar to the ground-truth simulation-based quantities in Fig. 1. To quantify the accuracy of the measurements across the entire 65 seconds of simulated speech, they are plotted against the ground truth values in the middle and right columns of Fig. 2, where a perfect match would be indicated by all points falling along the diagonal.
FIG. 2.

Measurement of formants and fundamental frequency, along with correlation analysis of measured quantities and ground-truth values based on simulated speech-like cases (compare to Figs. 1a and 1d). Mean absolute error (MAE) and mean absolute percentage error (MAPE) are an additional measure of accuracy. Although the time axis in (a) and (b) are limited to 13 seconds, the correlation plots and measurements comprise 65 seconds of simulated speech. (Color online)
In Fig. 2b and 2e are the assessments of the fundamental frequency measurements where the correlation of measured and ground truth values, , and slope, , of the line through the points (dashed line) are shown on each subplot. For both the male and female versions of the simulations, the correlation and slope are essentially equal to 1 indicating that the time-varying patterns of the measurements are nearly perfectly aligned with the ground truth contours. In addition, the mean absolute error (MAE) and mean absolute percentage error (MAPE) of the measured quantities relative to the known (prescribed) values are shown in the upper corner of each plot, and indicate errors less than 1 Hz and less than 1 percent, respectively.
Similar comparisons are shown in Fig. 2c and 2f for the measurements of the formants where, in each panel, both and are plotted against the ground truth vocal tract resonances, and , across 65 seconds of simulated speech; correlation values and slopes of the data clusters are indicated for each formant. In the case of the male simulation (Fig. 2c), the correlation values are just less than 1.0 and the slopes are just greater than 1.0 indicating that the measured formants are slightly overestimated. The measured formants for the female simulation (Fig. 2f) diverge a bit more from the ground truth values than for the male, but still the correlation is just less than 1.0 and the slope values are just greater than 1.0, indicating, again, a slight overestimation of the formants. The error measures are shown in each of these plots for both the first and second formants. MAE values are less than 15 Hz and 30 Hz, respectively, for the male and female simulations whereas the MAPE values for the male are less than 2 percent and well below 5 percent for the female.
Figure 3 shows comparisons of measured formants and ground truth vocal tract resonances across the 65 seconds of the male and female simulations with a constant (Fig. 3a and 3c) and for the cases where a harmonic was always tuned to the first resonance frequency (Fig. 3b and 3d). The correlation coefficients range from 0.976 to 0.996, and the slopes range from 0.997 to 1.168. As would be expected, there is somewhat larger divergence from the ground truth value when either the second or third harmonic is tuned to the first resonance frequency because the interaction of the voice source model with the vocal tract can modify the relative amplitudes of the harmonics in the vicinity of a resonance, potentially making the formant estimate of the resonance frequency less accurate. The error measures shown in each plot also indicate increased error with increased fundamental frequency due the wider spacing of the harmonics. The fundamental frequency measurements for all four cases were nearly as accurate as shown in Fig. 2, hence they are not shown graphically here (i.e., MAE < 1.5 Hz, and MAPE < 0.6 % for all cases).
FIG. 3.

Correlation analysis of measured formant frequencies and calculated ground-truth vocal tract resonance frequencies based on four simulated cases. (a) Male, constant , (b) Male, , (c) Female, constant , and (d) Female, . (Color online)
C. Assessment of source-tract coupling
For any given speech sample, the measured contours , and can be used to provide visualization and quantification of the vowel space and to indicate whether a talker tended toward aligning a harmonic with a vocal tract resonance while speaking. Of interest for this study were visualizations of the spaces formed by grouping and , where the first is the traditional vowel space and the other indicates the relation of the fundamental frequency to the first formant frequency. Following the approach detailed in Story and Bunton (2017), the density of data points within each of the two spaces was calculated to provide a third dimension that allows the data to be displayed as a color map showing a talker’s use of the space. These density plots are called vowel space density for the space and source-tract coupling density for the space.
Vowel space density plots were generated by first normalizing at every time sample relative to the median value determined over the entire duration of the signal such that,
| (1) |
where is the duration of the formant measurements, and is the formant number. This normalization scheme was used so that the origin of the vowel space was equivalent to the median value of the formants, and the range of both formant axes was roughly constrained to be within the interval [−1, 1]. To determine the density of the normalized vowel space, a circular field of view, or kernal, with radius equal to 0.05, which is unitless in the normalized space, was moved throughout a grid spanning the range [−1, 1.5] in both the and dimensions with an increment of 0.01. With the center of the kernal positioned at a given point in the grid, the number of pairs that fall within the boundaries of the kernel were counted; this value is logged as the density at that grid point. Once the density at every point in the grid was determined, each density value was divided by the maximum across the entire grid so that the density range was [0, 1] regardless of the total number of formant pairs in the vowel space, and then transformed to a 20log10 scale for viewing. The final visual representation will be referred to as the VSDn (i.e., vowel space density normalized).
Figure 4a and 4c show the vowel space density plots for the roughly 65 seconds of the male and female speech-like simulations. The darkest red portions correspond to the largest densities and indicate that the talker traversed this region of the vowel space often during the time course of the analyzed signal. The density scale on the right side of each plot is shown in dB; this is simply to denote that a 20log10 scale was applied to the data for viewing, and is unrelated to sound pressure level or voice intensity. The normalization scheme (Eqn. 1) allows for direct comparison across talkers regardless of gender or other idiosyncratic differences. The white line around the vowel space within each plot is the “conforming boundary” at a density level of −30 dB and allows for measurement of the vowel space area in normalized units (i.e., unitless). The area measurements shown in Fig. 4a and 4c are quite similar across the male and female simulated talkers (0.81 and 0.87, respectively), as expected; they are not, however, exactly the same because the scaling of the computational model from male to female is not linear and the randomization processes for generating the vowel transitions and contours were unique.
FIG. 4.

Normalized vowel space density (VSDn) and source-tract coupling density (STCD) plots for the male and female speech-like simulations. The density scale on the right side of each plot is shown in dB; this is simply to denote that a 20log10 scale was applied to the data for viewing, and is unrelated to sound pressure level or voice intensity. (Color online)
Density plots were similarly generated for the space, referred to as the source-tract coupling density (STCD). Because there is no normalization across frequency, the units are in Hz and the kernel radius was set to 10 Hz. As was done for the VSDn plots, each density value was divided by the maximum across the entire grid so that the density range is [0, 1] regardless of the total number of [] pairs in the STCD space, and then transformed to a 20log10 scale for viewing. STCD plots for the male and female speech-like simulations are shown in Figs. 4b and 4d, respectively. The white inclined lines indicate where the fundamental frequency or harmonics would align with the first formant frequency. For the male simulation (Fig. 4b), the range can be seen to be limited to 70–130 Hz resulting in only occasional crossing of the fourth harmonic with ; this is expected since there was no attempt in the case to align with . For the female speech-like simulation, the range was 150–260 Hz which does provide some coincidental alignment of the second, third, and fourth harmonics with .
Shown in Figs. 5a and 5c are STCD plots for the constant simulations, which resulted in vertical bars at for the male version and for the female version, both spanning the range of in their respective simulations. STCD plots for the male and female simulations with either the third or second harmonic aligned with the first formant, respectively, are shown in Figs. 5b and 5d. In both cases, the entirety of the density mostly lies along the line for the male and the for the female. For male, however, there is an upward divergence of away from the line when the fundamental frequency is above 200 Hz, demonstrating some degree of overestimation of the formants by the LPC algorithm as the distance between harmonics increases. Considering the extreme nature of the harmonic/resonance aligned simulations, all four cases in Fig. 5 provide, more or less, the expected result based on the simulation parameters, providing further validation of the signal analysis approach.
FIG. 5.

Source-tract coupling density (STCD) plots for male and female simulations. (a) Male, constant of 100 Hz, (b) Male, was aligned with , (c) Female, constant of 200 Hz, (d) Female, was aligned with . See Fig. 4 caption regarding interpretation of density scale. (Color online)
D. Analysis of speech recording database
The analysis approach detailed in the previous sections was applied to a database of audio recordings collected by Ferguson (2004). This database is comprised of speech samples collected from 41 adult talkers (21 female, 20 male) whose native language was American English. Each talker was recorded speaking two lists of 188 sentences, one list spoken in a conversational speaking style and the other spoken as clear speech. Within each list were three types of sentences: 1) /bVd/ words where “V” was selected from the vowels , each word set in seven neutral sentence frames for a total of 70 sentences, 2) consonant-vowel-consonant sentences, with words chosen or adapted from the Northwestern University Auditory Test, No.6 (NU-6; Tillman and Carhart, 1966), each set in two neutral sentence frames for a total of 104 sentences, and 3) 14 additional sentences from the Central Institute for the Deaf (CID) Every Day Sentences test (Davis and Silverman, 1978). Ultimately, each talker produced the same material in both the conversational and clear conditions. The 188 sentences from each talker in each speaking style were trimmed and concatenated, resulting in 82 audio files (conversational and clear for each of the 41 talkers), each containing between seven and ten minutes of connected speech.
For each talker’s conversational and clear speech samples, the analysis procedure generated VSDn and STCD plots, along with normalized vowel space area. Although the present study did not specifically seek to investigate all possible differences between the two speaking styles, they were chosen because it was hypothesized that clear speech may involve a more volitional attempt by the talkers to synchronize the sound production at the larynx with the vocal tract resonances. Hence, the goal of the analysis was to provide visual, side-by-side, comparison of the two styles so that at least a qualitative degree of source-tract coupling may be assessed.
III. RESULTS
Density plots for the simulated samples, as shown in the previous section, provided a compelling visualization of the extreme conditions, where source-filter coupling was either non-existent, or nearly constant. For human talkers, there is naturally more variability. In this section, VSDn and STCD plots are presented for two male (M04, M10) and two female (F01, F14) talkers from the Ferguson (2004) database. The density plots of these four talkers are featured specifically because they demonstrated alignment of one or more harmonics with the first formant, and hence possible source-filter synchronization. Figures 6 and 7 are based on analysis of the two male talkers and Figures 8 and 9 are based on the two female talkers; in each figure the left column is the analysis of conversational speech and the right column is clear speech.
FIG. 6.

VSDn and STCD plots for talker M04. See Fig. 4 caption regarding interpretation of density scale.(Color online)
FIG. 7.

VSDn and STCD plots for talker M10. See Fig. 4 caption regarding interpretation of density scale. (Color online)
FIG. 8.

VSDn and STCD plots for talker F01. See Fig. 4 caption regarding interpretation of density scale.(Color online)
FIG. 9.

VSDn and STCD plots for talker F14. See Fig. 4 caption regarding interpretation of density scale.(Color online)
For talker M04, shown in Fig. 6, the VSDn increases only slightly in the clear speaking style relative to the conversational. Comparison of the STCD spaces for each style, however, indicates that the talker shifted to aligning the second harmonic with the first formant much of the time during clear speech, as indicated by the red portion of the color map falling along the line in the plot. Interestingly, such an alignment required the talker to increase his fundamental frequency range from about 100–170 Hz in the conversational style to 100–240 Hz in clear speech.
Fig. 7 shows that talker M10 did not increase the vowel space area at all from conversational to clear speech, but fairly dramatically shifted the fundamental frequency range and harmonic/formant alignment as seen in the STCD plots. The talker used a limited range of , mostly under 100 Hz, while speaking conversationally, but expanded his range to 90–220 Hz in clear speech while spending much of the talking time with the second, third, or fourth harmonic aligned with .
Female talker F01 increased her vowel space area somewhat during clear speech relative to conversational as can be seen in the VSDn plots shown in Figs. 8a and 8b. Much like talker M10, however, the more distinct speaking style changes are observable in the STCD plots in Figs. 8c and 8d. Although some alignment of the second and third harmonics with can be seen for conversational speech in Fig. 8c, the effect is more pronounced in Fig. 8d for clear speech and is accompanied by an expanded fundamental frequency range.
Talker F14, whose VSDn and STCD plots are shown in Fig. 9, is a case where there is almost no change from conversational to clear speech with regard to either vowel space area or alignment of harmonics with . This is because in both speech types the talker produces a fairly large vowel space (Figs. 9a and 9b) along with observable alignment of the first four harmonics with . For this talker, conversational speech has essentially the same quality as her clear speech, hence there is little change to observe in shifting from one style to the other.
As further quantification of the degree to which a harmonic is aligned with , a histogram can be calculated based on the ratio and then filtered to provide a smooth curve. The result can be shown as the probability of taking on any particular value between 0 and 10 (i.e., this covers the possibility of the first 10 harmonics of the voice source being aligned with ); the probability can be interpreted as the percentage of the signal duration for which assumes any particular value.
Figure 10 shows the probability curves based on the STCD plots for talkers M04, M10, F01, and F14 where the dotted line represents conversational speech and the solid line is for clear speech. For M04 (Fig. 10a), the probability curves quantify what could be seen qualitatively in Figs. 6c and 6d - that the talker shifted from some alignment of the third harmonic with in the conversational style to a strong alignment of with in clear speech. The plot shows that, during clear speech, M04 spent 30 percent of the speaking duration with the second harmonic aligned with suggesting a strong synchronization of the voice source and vocal tract. The probability curves for M10 in Fig. 10b show that the probability for conversational speech is distributed across values ranging from about 2.5 to 8 with no particular alignment or tuning pattern. The clear speech plot, however, indicates a shift to alignment of , and with , thus confirming the qualitative impression observed in Fig. 7.
FIG. 10.

Probability curves that indicate percentage of time that talkers M04, M10, F01, and F14 produced speech while aligning a harmonic with . (Color online)
The probability curves in Fig. 10c show that, for conversational speech, talker F01 spent about 4, 18, and 12 percent of the signal duration with the first, second, and third harmonics aligned with , respectively, but for clear speech there is a shift to slightly more alignment of the first and second harmonics with . Talker F14 was a case where the conversational and clear speech density plots in Fig. 9 were similar but where both styles demonstrated strong alignment of the first four harmonics with . The probability curves in Fig. 10d are indeed quite similar but do show that the talker was more likely to align and with in clear speech than in conversational speech, and less likely to align .
As a means of capturing harmonic alignment with across all 41 talkers in the database, Figure 11 shows four collections of probability curves where the gray lines are those of individual talkers, the thick black line is the mean, and the blue shaded area indicates ± one standard deviation. Shown in Fig. 11a are all the probability curves for the male talkers producing sentences in conversational speech, whereas Fig. 11b shows the curves calculated for their clear speech. Probability curves for the female talkers’ conversational and clear speech are similarly shown in Figs. 11c and 11d. It can be noted that, for the male talkers, the mean probability curve representing conversational speech is somewhat more broadly distributed across the x-axis, and contains peaks that are less prominent than in clear speech, whereas for the female talkers there appears to be little difference in the mean probability curves between the two speaking styles.
FIG. 11.

Probability curves that indicate percentage of time that the 41 talkers in the Ferguson (2004) database produced speech while aligning a harmonic with . In each subplot, the thin gray lines are the probability curves of each talker, the thick black line is the mean probability curve, and shaded area indicates ± one standard deviation. (Color online)
The clear and conversational speech probability curves may be more easily compared by plotting differences between them. Shown in Fig. 12a are functions that result from subtracting the mean probability curve of conversational speech from that of clear speech, for both male and female talkers. The solid curve, representing male talkers, indicates a shift toward alignment of the second harmonic, , with the first formant, , when speech is produced with the clear speech instruction; the negative portion of the curve shows that the harmonics above are aligned with less in clear speech than in conversational. The difference curve for female talkers shows similar features but the actual differences are much smaller than for males. Plotted in Fig. 12b are differences between the maximum value (highest peak) of the clear and conversational probability curves for each individual talker (i.e., the light curves in Fig./ 11). Sorted in ascending order, these plots show that well over half of both male (60%) and female (57.1%) talkers had a peak of higher amplitude in the clear speech condition than in the conversational style.
FIG. 12.

Comparison of conversational and clear speech based on probability differences between the speaking styles for the talkers in the Ferguson (2004) database. (a) Difference between the mean probability curves of conversational speech from that of clear speech, for both male (solid line) and female (dotted line) talkers. (b) Differences between the maximum value (highest peak) of the clear and conversational probability curves for each individual talker, sorted in ascending order. (Color online)
As a final assessment of differences between conversational and clear speech across the 41 talkers in the database, Fig. 13 shows the normalized vowel space areas () measured for each case. In Fig. 13a, each curve in the figure indicates the values in one of the four conditions (e.g., female-clear, female-conversational, male-clear, male-conversational) where the values in the conversational conditions (male and female) are plotted in ascending order (shown with solid data points). Each talker’s corresponding value in the clear speech condition is represented by the open circles. For 16 of the 20 male talkers (80%), the vowel space is larger in the clear speech condition than for conversational speech; the same is true for 18 of the 21 female talkers (86%). In Fig. 13b, all four sets of data are shown in ascending order which provides a clearer picture of the overall trends; note that the x-axis is labeled as number of talkers, to indicate that the four values at any point along the x-axis may be from different talkers. In this plot it can observed that the is around 30 percent larger for female talkers than for males in both the conversational and clear speech styles. It can also be seen that tends to be larger in the clear speech style for both male and female talkers.
FIG. 13.

Normalized vowel space areas, VSAn, for the 41 talkers in the Ferguson (2004) in conversational and clear speech conditions. (a) The values in the conversational conditions (male and female) are plotted in ascending order (shown with solid data points); each talker’s corresponding VSAn value in the clear speech condition is represented by the open circles. (b) all four sets of VSAn data are shown in ascending order. (Color online).
IV. DISCUSSION
This study presented a method for determining the presence of possible synchronization of the voice source with the vocal tract resonances in minutes-long recordings of speech. Specifically, the fundamental frequency and the first two formant frequencies measured over the duration of the audio sample were presented in two types of density plots. The first type is the normalized vowel space called the VSDn, from which the size of the vowel space was measured with a conforming boundary algorithm (similar to a convex hull). The second type, called the source-tract coupling density or STCD, provides visualization of the degree to which a harmonic of the is aligned with . The STCD plots were quantified with probability curves that indicate the percentage of signal duration that the ratio took on any particular value, where harmonic alignment is indicated by peaks in the probability curves.
The analysis technique was first applied to a set of artificial speech signals generated by a computational model of speech production (cf., Story and Bunton, 2019; Story, 2013). These signals provided test material representative of adult male and female speech, as well as speech generated with either an unnaturally high or unnaturally low degree of synchronization of the and first vocal tract resonance. The latter cases were intended to provide a particularly harsh test of the method. Results of the simulation study showed the analysis technique to be successful at obtaining measurements that were reasonably well matched to the known “ground-truth” values of the model-generated signals.
Although the formant tracking technique was deemed to be reasonably accurate for purposes of this study, the limitations of using LPC as the basis for measuring formants were apparent in the ground-truth simulated cases. For example, when for the female simulation and for the male simulation were prescribed to follow the same time course as , the first formant frequency, , was somewhat overestimated (see Fig. 3). This likely results from the interaction of the glottal flow with the vocal tract reactance, altering the relative amplitudes of several harmonics when one or more of the harmonics is near a resonance (cf., Maxfield, 2017). Even small changes in the distribution of harmonic amplitudes, particularly when the is high and harmonic spacing is large, could slightly change the shape of the spectral envelope generated by the LPC algorithm, thus potentially locating a formant at a frequency that is slightly above (or below) the actual frequency of the resonance.
Following the model-based, ground-truth validation, the analysis technique was applied to the speech of 41 talkers who comprised a database collected by Ferguson (2004), where each talker produced the same speech protocol in both conversational and clear speaking styles. Although the interest in the present study was not to specifically investigate differences between conversational and clear speech, having two different speaking styles produced by the same talkers provided an opportunity to observe the ways in which talkers might modify the acoustic characteristics of their speech in response to instructions. That is, did the talkers change the degree of alignment of harmonics with the first formant, and did they modify their vowel space in one speaking style versus the other? The VSDn and STCD density plots, demonstrated for four of the talkers in the database, indicated a general trend toward an increase in and alignment as well as increased vowel space area in clear versus conversational speech. A visually salient feature of the STCD density plots is that they become more “glove-like” as the incidences of and alignment increase, where the “fingers of the glove” are the regions in the plot were and coincide.
Calculating probability curves from the STCD plots allowed for quantification of the and alignment frequency. In their raw form, these showed a wide range of variability across talkers but, in general, female talkers appear to be likely to align and in both speaking styles, whereas male talkers increase the alignment in the clear speech condition. The difference between the clear and conversational mean probability curves showed a distinct shift of the male talkers toward a condition of aligning with . Overall results showed that of the talkers in the database had higher probability of aligning and in the clear speech condition than when speaking conversationally. In addition, 82.9% of the talkers increased their vowel space area in clear versus conversational speech. Interestingly, Ferguson (2004, see Table I) showed that 82.9% of the talkers were judged to have higher vowel intelligibility in clear versus conversational speech, exactly the same as the vowel space area results here.
Although the analysis of the speech samples indicated clear alignment of and , there appeared to be no evidence of vocal instabilities in either the conversational or clear speech data as might be expected based on previous studies. Both Titze et al. (2008) and Maxfield et al. (2017) used glides combined with various vocal tract configurations and reported jumps and phonation gaps when a harmonic crossed over a formant. Not all participants in their studies exhibited these instabilities, however, suggesting that source-tract interaction varies considerably from person to person, and perhaps even from one vocalization to the next, even in unnatural experimental conditions. Based on similar experiments, Wade et al. (2017) observed few instabilities in the condition where the participants maintained volitional control of their vocal tract configuration. The apparent absence of instabilities in the conversational and clear speech samples analyzed in the present study suggests that, when a talker has the freedom to configure the vocal tract and control the , there is a natural tendency to position the and harmonics just below resonance frequencies to avoid undesirable instability in phonation while still benefitting from their close proximity.
These results provide some insight into how talkers modify their production of speech to meet varying communicative demands. When asked to produce clear speech, specifically with the instruction “...so that a hearing-impaired person would be able to understand you” (Ferguson, 2004; p. 2366), some talkers responded by increasing their vowel space, others by increasing and alignment, and some did both. These modifications generally had the effect of increasing vowel intelligibility and perceived clarity for both listeners with normal hearing and listeners with hearing loss (Ferguson 2004 and 2012; Ferguson and Morgan 2018). This could result from increased range and precision of articulatory movement, increase in overall intensity (although all speech in the Ferguson database had been scaled to the same peak intensity for the perceptual study), and synchronization of a voice source harmonic with time-varying changes of formant frequencies in connected speech. An alignment between the second harmonic and the first resonance frequency of the vocal tract was a common synchronization observed in the analysis. This is a well-known interaction in outdoor calling (Titze et al., 2020) and belting in musical theatre singing (cf., Titze et al., 2011). If the fundamental frequency is raised to be on the order of 300–400 Hz and the first resonance frequency is around 600–800 Hz for an // or vowel, this synchronization is useful for more intense sound production (cf. Herbst & Story, 2022). For lower fundamental frequencies, synchronization was also found to be a frequent choice, suggesting that intonation in speech can follow the formant frequency contours, perhaps a means of enhancing intelligibility and clarity. As might be suggested from the STCD plots in Figs. 6–9, enhancement of the alignment of with the first resonance could perhaps be achieved simply by increasing the range and variability used during speaking. Use of a wide range of would indeed, by itself, increase the “opportunities” for alignment to occur, but the “glove-like” STCD plots, indicating strong alignment, would not likely be produced unless the talker volitionally modulated and the vocal tract configuration to do so.
Although methods used in this investigation are not able to discern whether source-tract interaction occurs in a speech sample, the analyses suggest that talkers do adjust 1) the degree to which they synchronize the characteristics of the voice source with the vocal tract resonances, and 2) the size of their vowel space. Based on these results it could be hypothesized the primary objectives in producing sound for speech communication are to maximize power transfer and information transfer. To achieve both objectives, however, would require precise volitional coordination of the voice source and the vocal tract by the talker to permit strong interaction between source and filter for power transfer while simultaneously producing distinct, clear vowels and consonants for information transfer. In this preliminary study, the two objectives were not prospectively revealed to the talkers (i.e., this was a database of recordings collected independently of this study). They simply produced clear speech with an intent to be well understood by a hearing impaired person. To the talkers, that could have meant louder speech, slower speech, or more carefully articulated speech. Future studies will attempt to differentiate the two objectives with additional analyses that incorporate sound pressure level and articulation rate differences across speaking styles. In addition, signals such as an electroglottograph (EGG) could be used to enhance the accuracy of measurements and provide additional measures like contact index and contact quotient to characterize vibratory characteristics across different speaking conditions.
V. CONCLUSIONS
This study provides evidence that humans can optimize their vocalization to align formant frequencies with harmonics of the sounds source. Although the talkers who were analyzed were instructed to produce conversational or clear speech, it seems unlikely that they were consciously deciding to make adjustments based on alignment of harmonics and formants. Instead, it appears plausible that the talkers’ actions toward a specific speech quality goal allowed the voice system to self-organize, optimizing beneficial interactions between the source and filter. Future research will investigate if and how subjects may optimize their vocalizations in response to external context or directions to target their voice production (e.g., speak loudly, or speak clearly).
ACKNOWLEDGEMENTS
Research supported by NIH R01DC017998.
Footnotes
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Ethics Approval
No new human subjects data were collected for this study. The recordings that were analyzed were from the Ferguson Clear Speech Database (Ferguson, 2004).
The term “resonance frequency” and associated notation, is used here to denote a physical property of a particular vocal tract configuration, whereas “formant frequency” and its notation, , refer to the enhancement of energy in the speech signal as result of the interaction of the resonances with a sound source (Titze et al., 2015).
DATA AVAILABILITY
Audio files of the speech-like simulations and ground-truth data of the fundamental frequency and vocal tract resonances are available from the corresponding author upon reasonable request. The recordings in the Ferguson Clear Speech Database (Ferguson, 2004) are available upon reasonable request from author Sarah Hargus Ferguson.
REFERENCES
- Chen WR, Whalen DH, and Shadle CH (2019). F0-induced formant measurement errors result in biased variabilities. J. Acoust. Soc. Am, 145(5), EL360–EL366. 10.1121/1.5103195 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chiba T, and Kajiyama M The vowel: its nature and structure, Phonetic Society of Japan, (1941). [Google Scholar]
- Davis H, and Silverman SR (1978). Hearing and Deafness, 4th ed., Holt, Rinehart, and Winston, New York. [Google Scholar]
- Echternach M, Herbst CT, Köberlein M, Story B, Döllinger M, and Gellrich D (2021). Are source-filter interactions detectable in classical singing during vowel glides? J. Acoust. Soc. Am, 149(6), 4565–4578. 10.1121/10.0005432 [DOI] [PubMed] [Google Scholar]
- Fant G, The Acoustic Theory of Speech Production, Mouton, The Hague, (1960). [Google Scholar]
- Fant G, and Lin Q (1987). Glottal voice source-vocal tract acoustic interaction, Q. Prog. Status Rep. STL-QPSR 4, 13–27. [Google Scholar]
- Ferguson SH (2004). Talker differences in clear and conversational speech: Vowel intelligibility for normal-hearing listeners. J. Acoust. Soc. Am, 116(4), 2365–2373. 10.1121/1.1788730 [DOI] [PubMed] [Google Scholar]
- Ferguson SH (2012). Talker differences in clear and conversational speech: Vowel intelligibility for older adults with hearing loss. J. Speech. Lang. Hear. Res. 55(3), 779–790. 10.1044/1092-4388(2011/10-034 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ferguson SH, and Morgan SD (2018). Talker differences in clear and conversational speech: Perceived sentence clarity for young adults with normal hearing and older adults with hearing loss, J. Speech. Lang. Hear. Res. 61, 159–173. 10.1044/2017_JSLHR-H-17-0082 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Flanagan JL, (1972). Speech analysis synthesis and perception. 2nd Edition, Springer-Verlag; Berlin Heidelberg. DOI 10.1007/978-3-662-01562-9 [DOI] [Google Scholar]
- Flanagan JL, and Landgraf L, (1968). Self-oscillating source for vocal tract synthesizers,” IEEE Trans. Audio Electroacoust, AU-16, 57–64. DOI: 10.1109/TAU.1968.1161949 [DOI] [Google Scholar]
- Herbst CT, and Story BH (2022). Computer simulation of vocal tract resonance tuning strategies with respect to fundamental frequency and voice source spectral slope in singing. J. Acoust. Soc. Am. 152(6), 3548–3561. doi: 10.1121/10.0014421 [DOI] [PubMed] [Google Scholar]
- Ishizaka K and Flanagan JL (1972), Synthesis of Voiced Sounds From a Two-Mass Model of the Vocal Cords. Bell System Technical Journal, 51: 1233–1268. 10.1002/j.1538-7305.1972.tb02651.x [DOI] [Google Scholar]
- Kent RD, (1976). Anatomical and neuromuscular maturation of the speech mechanism: Evidence from acoustic studies, J. Speech Hear. Res, 19, 421–447. [DOI] [PubMed] [Google Scholar]
- Klatt DH, (1986). Representation of the first formant in speech recognition and in models of the auditory periphery, in Units and their representation in speech recognition: Proceedings, Montreal, 5–7. [Google Scholar]
- Liljencrants J, 1985. Speech Synthesis with a Reflection-Type Line Analog. DS Dissertation, Dept. of Speech Comm. and Music Acous., Royal Inst. of Tech., Stockholm, Sweden. [Google Scholar]
- Lindblom B, (1962). Accuracy and limitations of sonagraph measurements. Proc. Fourth Intl. Cong. Phon. Sci, Helsinki. The Hague: Mouton, 188–202. [Google Scholar]
- Makhoul J, (1975). Linear prediction: A tutorial review, Proc. IEEE, 63(4), 561–580. dx. 10.1109/proc.1975.9792 [DOI] [Google Scholar]
- Markel JD, and Gray AH, (1976). Linear Prediction of Speech, Springer, Berlin. dx. 10.1007/978-3-642-66286-7 [DOI] [Google Scholar]
- The MathWorks Inc. (2023). MATLAB version:(R2023b), Natick, Massachusetts: The MathWorks Inc. https://www.mathworks.com [Google Scholar]
- Maxfield L, Palaparthi A, and Titze I (2017). New evidence that nonlinear source-filter coupling affects harmonic intensity and fo stability during instances of harmonics crossing formants. J. Voice, 31(2), 149–156. 10.1016/j.jvoice.2016.04.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Monsen RB, and Engebretson AM, (1983). The accuracy of formant frequency measurements: A comparison of spectrographic analysis and linear prediction, J. Speech Hear. Res, 26, 89–97. dx. 10.1044/jshr.2601.89 [DOI] [PubMed] [Google Scholar]
- Rothenberg M (1987). Cosi fan tutte and what it means or nonlinear source-tract acoustic interaction in the soprano voice and some implications for the definition of vocal efficiency, in Laryngeal Function in Phonation and Respiration, edited by Baer T, Sasaki C, Harris KS, College-Hill Press, Little, Brown and Company, Boston, pp. 254–269. [Google Scholar]
- Sondhi MM, Schroeter J, 1987. A hybrid time-frequency domain articulatory speech synthesizer, IEEE Trans. ASSP, ASSP-35(7), 955–967. doi: 10.1109/tassp.1987.1165240. [DOI] [Google Scholar]
- Stevens KN (2000). Acoustic phonetics. MIT press. DOI: 10.1121/1.1327577 [DOI] [Google Scholar]
- Story BH, (1995). Physiologically-based speech simulation using an enhanced wave-reflection model of the vocal tract, Ph.D. dissertation, University of Iowa, Iowa City, IA. [Google Scholar]
- Story BH, Laukkanen A-M, Titze IR, 2000. Acoustic impedance of an artificially lengthened and constricted vocal tract, J. Voice, 14(4), 455–469. doi: 10.1016/s0892-1997(00)80003-x. [DOI] [PubMed] [Google Scholar]
- Story BH, and Bunton K (2013). Production of child-like vowels with nonlinear interaction of glottal flow and vocal tract resonances. In Proc. M. Acoust. 19(1). 10.1121/1.4798416 [DOI] [Google Scholar]
- Story BH (2013). Phrase-level speech simulation with an airway modulation model of speech production. Comp. Spch. Lang, 27(4), 989–1010. // 10.1016/j.csl.2012.10.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Story BH, and Bunton K (2017). Vowel space density as an indicator of speech performance. J. Acoust. Soc. Am, 141(5), EL458–EL464. 10.1121/1.4983342 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Story BH, Vorperian H, Bunton K, and Durtschi R, (2018). An age-dependent vocal tract model for males and females based on anatomic measurements, J. Acoust. Soc. Am, 143(5), 3079–3102. // 10.1121/1.5038264 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Story BH, and Bunton K (2019). A model of speech production based on the acoustic relativity of the vocal tract. J. Acoust. Soc. Am, 146(4), 2522–2528. // 10.1121/1.5127756 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tillman T, and Carhart RC (1966). An expanded test for speech discrimination utilizing CNC monosyllabic words: N.U. Auditory Test No. 6, USAF School of Aerospace Medicine Report No. SAM-TR-66–55. [DOI] [PubMed]
- Titze IR, Horii Y, and Scherer RC (1987). Some technical considerations in voice perturbation measurements, J. Speech Hear. Res. 30, 252–260. [DOI] [PubMed] [Google Scholar]
- Titze IR (1988). The physics of small-amplitude oscillation of the vocal folds, J. Acoust. Soc. Am. 83, 1536–1552. 10.1121/1.395910 [DOI] [PubMed] [Google Scholar]
- Titze IR, 2002. Regulating glottal airflow in phonation: Application of the maximum power transfer theorem to a low dimensional phonation model. J. Acoust. Soc. Am, 111, 367–376. 10.1121/1.1417526 [DOI] [PubMed] [Google Scholar]
- Titze IR, Riede T, Popolo P (2008). Nonlinear source-filter coupling in phonation: Vocal exercises. J. Acoust. Soc. Am, 123(4), 1902–1915. 10.1121/1.2832339 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Titze I, Riede T, and Popolo P (2008). Nonlinear source?filter coupling in phonation: Vocal exercises. The Journal of the Acoustical Society of America, 123(4), 1902–1915. 10.1121/1.2832337 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Titze IR (2008). Nonlinear source-filter coupling in phonation: Theory. J. Acoust. Soc. Am, 123(5), 2733–2749. 10.1121/1.2832337 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Titze IR, Worley AS, and Story BH (2011). Source-vocal tract interaction in female operatic singing and theater belting. J. Singing, 67(5), 561–572. [Google Scholar]
- Titze IR, Blake D, and Wodzak J (2020). Intelligibility of long-distance emergency calling. J. Voice, 34(1), 44–52. 10.1016/j.jvoice.2018.08.008 [DOI] [PubMed] [Google Scholar]
- Vallabha GK, and Tuller B, (2002). Systematic errors in the formant analysis of steady-state vowels, Speech Comm, 38, 141–160. dx. 10.1016/s0167-6393(01)00049-8 [DOI] [Google Scholar]
- Wade L, Hanna N, Smith J, and Wolfe J (2017). The role of vocal tract and subglottal resonances in producing vocal instabilities. The J. Acoust. Soc. Am, 141(3), 1546–1559. 10.1121/1.4976954 [DOI] [PubMed] [Google Scholar]
- Whalen DH, Chen WR, Shadle CH, and Fulop SA (2022). Formants are easy to measure; resonances, not so much: Lessons from Klatt (1986). J. Acoust. Soc. Am, 152(2), 933–941. 10.1121/10.0013410 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xie Z, and Niyogi P (2006). Robust acoustic-based syllable detection. In Interspeech Proc., 1571–1574, Sept. 17–21, Pittsburgh, PA. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Audio files of the speech-like simulations and ground-truth data of the fundamental frequency and vocal tract resonances are available from the corresponding author upon reasonable request. The recordings in the Ferguson Clear Speech Database (Ferguson, 2004) are available upon reasonable request from author Sarah Hargus Ferguson.
