Vocal tract modes based on multiple area function sets from one speaker

Brad H Story

doi:10.1121/1.3082263

. Author manuscript; available in PMC: 2009 May 11.

Published in final edited form as: J Acoust Soc Am. 2009 Mar 4;125(4):141–147. doi: 10.1121/1.3082263

Vocal tract modes based on multiple area function sets from one speaker

Brad H Story ¹

PMCID: PMC2677261 NIHMSID: NIHMS110398 PMID: 19354352

Abstract

The purpose of this study was to derive vocal tract modes from a wider range of vowel area functions for a specific speaker than has been previously reported. Area functions from Story et al. [(1996). J. Acoust. Soc. Am. 100, 537–554] and Story [(2008). J. Acoust. Soc. Am. 123, 327–335] were combined in a composite set from which modes were derived with principal component analysis. Along with scaling coefficients, these modes were used to generate a [F1, F2] formant space. In comparison to formant spaces similarly generated based on the two area function sets alone, the combined version provides a wider range of both F1 and F2 values. This new set of modes may be useful for inverse mapping of formant frequencies to area functions or for modeling of vocal tract shape changes.

Introduction

For production of vowels, the vocal tract area function has been shown to be fairly well represented by two canonical deformation patterns, or “modes,” (e.g., Story and Titze, 1998; Story, 2005b; Mokhtari et al., 2007) derived from principal component analysis (PCA). Such modes have been derived from speaker-specific area function sets, but are similarly shaped (in terms of their variation along the length of the vocal tract) across speakers, and are related to specific patterns of formant frequencies when appropriately scaled and superimposed on a mean area function.

The modal representation of the vocal tract shape allows for an essentially one-to-one mapping between the scaling coefficients of the modes and the first two formant frequencies and has led to a method for mapping time-varying formant frequencies extracted from recorded speech to a time-varying sequence of area functions (Story and Titze, 1998; Story and Titze, 2002; Mokhtari et al., 2007). Development of a kinematic model of the vocal tract area function that can be used to simulate speech is also based on this same modal representation (Story, 2005a). Although both the inverse mapping technique and kinematic model have been shown to be reasonably successful at bridging the area function-to-acoustic relation, the ability of the modal representation to produce a wide range of vocal tract shapes is limited by the original set of area functions on which it is based. For example, the modal representation derived in Story and Titze (1998) and subsequently utilized in Story and Titze (2002) and Story (2004, 2005b) was based on the ten vowel area functions reported for an adult male (Story et al., 1996). Hence, the boundaries of the potential [F1, F2] vowel space afforded by the modal representation are essentially defined by the formant frequencies of those original area functions. Although additional sets of speaker-specific area functions reported by Story (2005b) and Mokhtari et al. (2007) have been used to derive modal representations of individual speakers, they are also limited by the range of formant frequencies produced by the original area functions obtained from each speaker. This is not necessarily a severe limitation assuming that each speaker produced hyperarticulated vowels (thus, producing extreme [F1, F2] formant values). If the vowels tend toward being centralized, however, the working space of a modal-based vocal tract model will be constrained.

Recently, Story (2008) reported an additional set of 11 vowel area functions for the same speaker who produced the 10 vowels originally reported in Story et al. (1996). F1 and F2 formant values were determined from the frequency response function calculated for each area function in the two sets. It was noted that, in general, the F2 values calculated for the new area function set were shifted downward in frequency relative to those calculated for the area functions of,1 Story et al. (1996) as shown in Fig. 1. It was suggested that this difference originated from a tendency, in the new area functions, to slightly constrict the pharyngeal portion of the vocal tract while expanding the oral cavity. Since the new area functions are configured somewhat differently than the 1996 versions and produce different [F1, F2] values, they provide additional samples of the same speaker’s vocal tract that might be combined with the original set in a PCA. The resulting vocal tract modes and mean area function would then be representative of a wider range of vowel configurations than either area function set alone.

Vowel space plot of F1 and F2 frequencies calculated from the A96 and A08 versions of each vowel (see text). The data points are connected by solid or dashed lines to clarify the set to which they belong and to provide a rough outline of the possible vowel space. Because of their positions in the F2 vs F1 plane, the two [ʌ] vowels are not connected to the other vowels within their respective sets. The large open circles denote those vowels chosen from each set that were combined in a composite set.

The purpose of this letter is to demonstrate that a modal representation derived from a particular combination of area functions selected from the data of Story et al. (1996) and Story (2008) produces an [F1, F2] vowel space that is larger than that produced by modes derived from either area function set alone.

Principal component analysis

Area function sets from both Story et al. (1996) and Story (2008) were used in the analysis. The area functions of Story et al. (1996) included the vowels [i ɪ ε æ ʌ ɑ ɔ o ʊ u] and will henceforth be referred to as “A96.” In addition to these same ten vowels, the Story (2008) data include the vowel [e]; this set of 11 area functions will be referred to as “A08.”

Any given area function can be represented by two components, an area vector and a length increment. The area vectorA(i) contains 44 cross-sectional areas,2 assumed to represent a concatenation of tubelets ordered consecutively from glottis to lips. The index i denotes this ordering and extends from 1 to 44. The length increment Δ is the distance between consecutive cross-sectional areas and can be considered to be the tubelet length. The 44-section area vectors and length increments, as used in the present study, were reported in Story and Titze (1998) for A963 and in Story (2008) for A08. The A96 set, however, was smoothed for this study in the same manner as the A08 set (see Story, 2008, p. 328).

As mentioned in the Introduction, the calculated [F1, F2] formant frequencies for each of the two area function sets are shown in Fig. 1. The open circles indicate the vowels in the plot that are representative of the boundaries of the displayed vowel space, regardless of set membership, and their corresponding area functions served as the combined data set on which a new PCA was performed. This combined set is referred to as “A9608x” and specifically includes [i ɪ ε æ ɑ] from A96 and [æ ɑ ɔ o ʊ u] from A08, where the “x” denotes that the combined set excludes some of the vowels in A96 and A08. Note that the target vowels [æ] and [ɑ] from both A96 and A08 are included in the combined set because they represent part of the overall boundary of the vowel space.

The collection of area vectors for a given set (i.e., A96, A08, or A9608x) can be represented in matrix form as A(i,j), where i is the area index and j denotes the particular vowel in the clockwise order indicated by the points along the dashed or solid lines for A96 and A08, respectively, or by the large circles in Fig. 1 for A9608x (i.e., j=1 for [i], j=2 for [ɪ],…, j=11 for [u]). Following Story (2005b), the PCA was performed on the equivalent diameters of the cross-sectional areas rather than on the areas themselves. In addition, the length of each area function was included in the same manner as Yehia et al. (1996) and Mokhtari et al. (2007) where the variance of the length increment Δ_j is first normalized by the largest variance of the equivalent diameters, and then becomes the 45th element. Thus, a matrix D(i,j), containing diameter and length information, is constructed as

D (i, j) = {\begin{cases} \sqrt{\frac{4}{π} A (i, j)} & for i = 1, \dots, 44 \\ \frac{(Δ_{j} - \bar{Δ}) σ_{D}}{σ_{Δ}} + \bar{Δ} & for i = 45, \end{cases}

(1)

where Δ_j is the length increment of area vector j, $\bar{Δ}$ is the mean length increment, σ_Δ is the standard deviation of the length increments, and σ_D is the largest standard deviation within any section i of the equivalent diameters.

Matrix D(i,j) can be represented by a mean and variable part,

D (i, j) = Ω (i) + α (i, j),

(2)

where Ω(i) is the mean vector across D(i,j), and α(i,j) is the variation superimposed on Ω(i) to produce a specific vector. The PCA was then carried out by calculating the eigenvectors of a covariance matrix formed with α(i,j). The specific implementation was essentially the same method as reported in Story and Titze (1998), and results in the following parametric representation of the original D(i,j) matrix:

\hat{D} (i, j) = [Ω (i) + \sum_{i = 1}^{N} q_{i} (j) ϕ_{i} (i)], i = [1, N] (N = 45), j = [1, 11],

(3)

where the ϕ_i(i)’s are 45-element eigenvectors (modes) that, when multiplied by the appropriate scaling coefficients q_i(j), will reconstruct each area vector in A(i,j) for i=[1,44] and each length increment for i=45 by the following:

\hat{A} (i, j) = \frac{π}{4} \hat{D} {(i, j)}^{2}, i = [1, 44], j = [1, 11],

(4a)

\hat{Δ} (j) = \frac{(\hat{D} (45, j) - \bar{Δ}) σ_{Δ}}{σ_{D}} + \bar{Δ}, j = [1, 11] .

(4b)

Area functions generated from the mean diameter function Ω(i) [i.e., q_i(j)=0] of each set A96, A08, and A9608x are plotted against the normalized distance4 from the glottis in Fig. 2a. It is noted that the area function based on the combined set (thick line) is more constricted in the pharyngeal portion (from approximately 0.2 to 0.6 along the length) and more expanded in the oral portion (from 0.6 to 1) than either of the other two.

Mean area functions and modes derived for each of the three area function sets A96, A08, and A9608x: (a) mean area functions and (b) ϕ₁ and ϕ₂ modes where the thick line corresponds to set A9608x, the thin line to A96, and the dashed line to A08.

The two modes that accounted for most of the variance in the analysis of the three area function sets [referred to as ϕ₁(i) and ϕ₂(i)] are shown in Fig. 2b, again plotted against the normalized distance from the glottis. Note that the modes have been smoothed by fitting them with eighth-order polynomials. This simplifies the visual comparison of the modes but maintains their gross characteristics. As in Fig. 2a, the thick, thin, and dashed lines represent the A9608x, A96, and A08 sets, respectively. The variances accounted for by each mode in each of the three sets are given in Table 1. The first mode ϕ₁ accounts for greater than 64% of the variance, whereas ϕ₂ accounts for about 20%. The total amount of the variance accounted for by these two modes was greatest for the A9608x set with 89.6%. Qualitatively there are some minor differences in the amplitude of the modes across the three sets, but the overall shape is essentially the same. That is, with a positive scaling coefficient, ϕ₁ will expand the front half of the vocal tract and constrict the back half, and vice-versa with a negative scaling coefficient. When positively scaled, the second mode imposes an expansion near the lips, followed by a constriction, an expansion, and another constriction above the glottis; the opposite effect occurs with a negative scaling coefficient.

Table 1.

Percentage of the total variance in the matrix D (i,j) accounted for by the two most significant modes (in their smoothed form) calculated for each of the sets A96, A08, and A9608x.

Mode	A96	A08	A9608x
Φ₁	64.8	69.8	65.9
Φ₂	21.3	18.5	23.7
Total	86.1	88.3	89.6

Open in a new tab

Vowel space calculations

A mapping was generated for each area function set that relates the two scaling coefficients q₁ and q₂ to the first two formant frequencies (F1 and F2). Based on the PCA performed on each set, an equal increment continuum was generated for each mode coefficient that ranged from their respective minimum to maximum values. The increments were specified as

Δ q_{1} = \frac{q_{1}^{\max} - q_{1}^{\min}}{M - 1},

(5a)

Δ q_{2} = \frac{q_{2}^{\max} - q_{2}^{\min}}{N - 1},

(5b)

where M and N are the numbers of increments along each coefficient dimension. For this study M=N=80. The coefficient continua were then generated by

q_{1 m} = q_{1}^{\min} + m Δ q_{1}, m = 0, \dots, M - 1,

(6a)

q_{2 n} = q_{2}^{\min} + n Δ q_{2}, n = 0, \dots, N - 1 .

(6b)

with m and n serving as indices along each continuum. By modifying Eq. 3 to contain only two modes and eliminating the dependence on a specific vowel (i.e., jth vowel), an area vector and length increment can be directly generated with

A_{m n} (i) = \frac{π}{4} {[Ω (i) + q_{1 m} ϕ_{1} (i) + q_{2 n} ϕ_{2} (i)]}^{2}, i = [1, 44],

(7a)

Δ_{m n} = \frac{(Ω (i) + q_{1 m} ϕ_{1} (i) + q_{2 n} ϕ_{2} (i) - \bar{Δ}) σ_{Δ}}{σ_{D}} + \bar{Δ}, i = 45 .

(7b)

Shown in Fig. 3a is the 80×80 coefficient grid for the A9608x set along with the coefficient pairs that reconstruct the original 11 vowels used in the PCA for this set. The squares and circles represent those vowels extracted from the A96 and A08 sets, respectively, to make up the combined set. Also shown in this figure are the outlines of coefficient grids based on the separate PCAs of the A96 (solid line) and A08 (dashed line) sets. The coefficient ranges are similar for the A9608x and A08 sets, but the range for the A96 set is contracted relative to the other two. It is noted that all three sets have nearly the same upper limit of the q₂ coefficient.

Coefficient and [F1, F2] grids. (a) The grid in the background, bounded by the thin line, represents the coefficient pairs generated for the A9608x set. The squares denote the coefficient pairs corresponding to the vowels chosen from the A96 set whereas the circles are those chosen from the A08 set. The thick solid and dashed lines indicate the outlines of the coefficient grids generated separately for the A96 (solid) and A08 (dashed) sets, respectively. (b) The deformed grid in the background, bounded by the thin line, represents the [F1, F2] space generated from the coefficient grid in (a). The squares and circles denote the calculated [F1, F2] values for the A96 and A08 vowels, respectively. The thick solid and dashed lines represent the outlines of the [F1, F2] spaces generated from coefficient grids for the A96 and A08 sets, respectively.

For each of the 6400 coefficient pairs in the grid (i.e., every intersection point), an area vector and length increment were generated. A frequency response function of each resulting area function was then calculated with a frequency-domain technique based on cascaded “ABCD” matrices (Sondhi and Schroeter, 1987; Story et al., 2000). This calculation included energy losses due to yielding walls, viscosity, heat conduction, and acoustic radiation at the lips; side branches such as the piriform sinuses were not considered. Formant frequencies (F1 and F2) were determined by finding the peaks in the frequency response functions with an automated peak-picking algorithm (Titze et al., 1987). These operations were carried out separately for the coefficient grids corresponding to all three area function sets (A96, A08, and A9608x).

The resulting formant space for A9608x is shown in Fig. 3b. With the exception of a small amount of overlap along the upper edge, this represents a one-to-one mapping between [q₁,q₂] coefficient pairs and [F1, F2] formant pairs, similar to those demonstrated in previous publications (e.g., Story and Titze, 1998; Story, 2005b). The formant pairs corresponding to the 11 vowel area functions in the A9608x set are shown as squares and circles. The formant grid outlines for the A96 and A08 sets are also shown in this figure as solid and dashed lines, respectively. Note that the A08 outline is shifted downward relative to that for A96, as would be expected based on the data shown previously in Fig. 1. More importantly, the formant grid based on A9608x encompasses nearly all of the space outlined by both the A96 and A08 sets.

Discussion

Finding the formant grid for the A9608x to be larger than that calculated for either the A96 or A08 sets suggests that the modes based on this combined set may be more useful for purposes of inverse mapping and vocal tract modeling. In the case of inverse mapping, a wider variation in [F1, F2] formant frequencies can be accommodated within the formant space and thus related to a wider range of vocal tract area function shapes. For modeling vocal tract shape change as a component of synthesizing speech (cf. Story, 2005a), the modes from the combined set allow for production of a wider range of formant frequencies than those from the other two sets, and could potentially facilitate more natural sounding synthesis.

The similarity of the mode shapes across each of the three area function sets (A96, A08, and A9608x) does, however, raise the question of whether it would be possible to produce an expanded formant space for either the A96 or A08 sets simply by allowing the mode coefficient values to exceed their ranges produced by the PCA? For example, could the A96 modes and mean diameter function produce a formant space similar to the grid shown in Fig. 3 if the A9608x coefficient space was used instead of the more limited set of A96 coefficients? It is almost certain that expanding the coefficient ranges will have the effect of enlarging the formant space. In fact, Story (2004) used this approach to model a compensation for a lip-tube constraint imposed on the vocal tract shape. But expanding the coefficient range will always be limited by the degree of constriction produced within the vocal tract. That is, increasing the magnitude of either the q₁ or q₂ coefficients will ultimately result in occlusion of the vocal tract, and possibly generating unrealistically short or long tract lengths.

Acknowledgments

This research was supported by NIH Grant No. R01-DC04789.

Footnotes

F1 and F2 formant frequencies measured from audio recordings also indicated the same downward shift of F2 for the Story (2008) study relative to Story et al. (1996). The magnitude of the shift, however, was less than that observed in the calculated formants.

The use of 44 elements to represent an area function has been adopted by the author in previous publications, but is not a requirement. The number derives from the approximate spatial resolution obtained in MRI-based reconstructions of vocal tract shape (e.g., Story et al., 1996). It is also convenient to use 44 elements for simulating speech with acoustic waveguide models because it allows for a sampling frequency of 44.1 kHz when the tract length is approximately 17.5 cm (typical adult male).

The area functions in the A96 set were indeed originally reported in Story et al. (1996) but were provided in a form with a variable number of sections (tubelets) where the length increment was constant across all vowels. Story and Titze (1998) resampled the area functions so that each area vector consisted of 44 sections, but the length increment was vowel dependent.

⁴

Normalized distance is used for displaying the reconstructed area functions and modes because the actual vocal tract length is variable and depends on a given combination of mode scaling coefficients [see Eq. 7].

References and links

Mokhtari, P., Kitamura, T., Takemoto, H., and Honda, K. (2007). “Principal components of vocal tract area functions and inversion of vowels by linear regression of cepstrum coefficients,” J. Phonetics 10.1016/j.wocn.2006.01.001 35, 20–39. [DOI] [Google Scholar]
Sondhi, M. M., and Schroeter, J. (1987). “A hybrid time-frequency domain articulatory speech synthesizer,” IEEE Trans. Acoust., Speech, Signal Process. 10.1109/TASSP.1987.1165240 ASSP-35, 955–967. [DOI] [Google Scholar]
Story, B. H. (2004). “On the ability of a physiologically-constrained area function model of the vocal tract to produce normal formant patterns under perturbed conditions,” J. Acoust. Soc. Am. 10.1121/1.1689347 115, 1760–1770. [DOI] [PubMed] [Google Scholar]
Story, B. H. (2005a). “A parametric model of the vocal tract area function for vowel and consonant simulation,” J. Acoust. Soc. Am. 10.1121/1.1869752 117, 3231–3254. [DOI] [PubMed] [Google Scholar]
Story, B. H. (2005b). “Synergistic modes of vocal tract articulation for American English vowels,” J. Acoust. Soc. Am. 10.1121/1.2118367 118, 3834–3859. [DOI] [PubMed] [Google Scholar]
Story, B. H. (2008). “Comparison of magnetic resonance imaging-based vocal tract area functions obtained from the same speaker in 1994 and 2002,” J. Acoust. Soc. Am. 10.1121/1.2805683 123, 327–335. [DOI] [PMC free article] [PubMed] [Google Scholar]
Story, B. H., Laukkanen, A.-M., and Titze, I. R. (2000). “Acoustic impedance of an artificially lengthened and constricted vocal tract,” J. Voice 10.1016/S0892-1997(00)80003-X 14, 455–469. [DOI] [PubMed] [Google Scholar]
Story, B. H., and Titze, I. R. (1998). “Parameterization of vocal tract area functions by empirical orthogonal modes,” J. Phonetics 10.1006/jpho.1998.0076 26, 223–260. [DOI] [Google Scholar]
Story, B. H., and Titze, I. R. (2002). “A preliminary study of voice quality transformation based on modifications to the neutral vocal tract area function,” J. Phonetics 10.1006/jpho.2002.0168 30, 485–509. [DOI] [Google Scholar]
Story, B. H., Titze, I. R., and Hoffman, E. A. (1996). “Vocal tract area functions from magnetic resonance imaging,” J. Acoust. Soc. Am. 10.1121/1.415960 100, 537–554. [DOI] [PubMed] [Google Scholar]
Titze, I. R., Horii, Y., and Scherer, R. C. (1987). “Some technical considerations in voice perturbation measurements,” J. Speech Hear. Res. 30, 252–260. [DOI] [PubMed] [Google Scholar]
Yehia, H. C., Takeda, K., and Itakura, F. (1996). “An acoustically oriented vocal-tract model,” IEICE Trans. Inf. Syst. E79-D, 1198–1208. [Google Scholar]

[c1] Mokhtari, P., Kitamura, T., Takemoto, H., and Honda, K. (2007). “Principal components of vocal tract area functions and inversion of vowels by linear regression of cepstrum coefficients,” J. Phonetics 10.1016/j.wocn.2006.01.001 35, 20–39. [DOI] [Google Scholar]

[c2] Sondhi, M. M., and Schroeter, J. (1987). “A hybrid time-frequency domain articulatory speech synthesizer,” IEEE Trans. Acoust., Speech, Signal Process. 10.1109/TASSP.1987.1165240 ASSP-35, 955–967. [DOI] [Google Scholar]

[c3] Story, B. H. (2004). “On the ability of a physiologically-constrained area function model of the vocal tract to produce normal formant patterns under perturbed conditions,” J. Acoust. Soc. Am. 10.1121/1.1689347 115, 1760–1770. [DOI] [PubMed] [Google Scholar]

[c4] Story, B. H. (2005a). “A parametric model of the vocal tract area function for vowel and consonant simulation,” J. Acoust. Soc. Am. 10.1121/1.1869752 117, 3231–3254. [DOI] [PubMed] [Google Scholar]

[c5] Story, B. H. (2005b). “Synergistic modes of vocal tract articulation for American English vowels,” J. Acoust. Soc. Am. 10.1121/1.2118367 118, 3834–3859. [DOI] [PubMed] [Google Scholar]

[c6] Story, B. H. (2008). “Comparison of magnetic resonance imaging-based vocal tract area functions obtained from the same speaker in 1994 and 2002,” J. Acoust. Soc. Am. 10.1121/1.2805683 123, 327–335. [DOI] [PMC free article] [PubMed] [Google Scholar]

[c7] Story, B. H., Laukkanen, A.-M., and Titze, I. R. (2000). “Acoustic impedance of an artificially lengthened and constricted vocal tract,” J. Voice 10.1016/S0892-1997(00)80003-X 14, 455–469. [DOI] [PubMed] [Google Scholar]

[c8] Story, B. H., and Titze, I. R. (1998). “Parameterization of vocal tract area functions by empirical orthogonal modes,” J. Phonetics 10.1006/jpho.1998.0076 26, 223–260. [DOI] [Google Scholar]

[c9] Story, B. H., and Titze, I. R. (2002). “A preliminary study of voice quality transformation based on modifications to the neutral vocal tract area function,” J. Phonetics 10.1006/jpho.2002.0168 30, 485–509. [DOI] [Google Scholar]

[c10] Story, B. H., Titze, I. R., and Hoffman, E. A. (1996). “Vocal tract area functions from magnetic resonance imaging,” J. Acoust. Soc. Am. 10.1121/1.415960 100, 537–554. [DOI] [PubMed] [Google Scholar]

[c11] Titze, I. R., Horii, Y., and Scherer, R. C. (1987). “Some technical considerations in voice perturbation measurements,” J. Speech Hear. Res. 30, 252–260. [DOI] [PubMed] [Google Scholar]

[c12] Yehia, H. C., Takeda, K., and Itakura, F. (1996). “An acoustically oriented vocal-tract model,” IEICE Trans. Inf. Syst. E79-D, 1198–1208. [Google Scholar]

PERMALINK

Vocal tract modes based on multiple area function sets from one speaker

Brad H Story

Abstract

Introduction

Figure 1.

Principal component analysis

Figure 2.

Table 1.

Vowel space calculations

Figure 3.

Discussion

Acknowledgments

Footnotes

References and links

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Vocal tract modes based on multiple area function sets from one speaker

Brad H Story

Abstract

Introduction

Figure 1.

Principal component analysis

Figure 2.

Table 1.

Vowel space calculations

Figure 3.

Discussion

Acknowledgments

Footnotes

References and links

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases