
Keywords: motor learning, sensorimotor adaptation, speech intelligibility, speech motor control, vowel contrast
Abstract
When auditory feedback is perturbed in a consistent way, speakers learn to adjust their speech to compensate, a process known as sensorimotor adaptation. Although this paradigm has been highly informative for our understanding of the role of sensory feedback in speech motor control, its ability to induce behaviorally relevant changes in speech that affect communication effectiveness remains unclear. Because reduced vowel contrast contributes to intelligibility deficits in many neurogenic speech disorders, we examine human speakers’ ability to adapt to a nonuniform perturbation field that was designed to affect vowel distinctiveness, applying a shift that depended on the vowel being produced. Twenty-five participants were exposed to this “vowel centralization” feedback perturbation in which the first two formant frequencies were shifted toward the center of each participant’s vowel space, making vowels less distinct from one another. Speakers adapted to this nonuniform shift, learning to produce corner vowels with increased vowel space area and vowel contrast to partially overcome the perceived centralization. The increase in vowel contrast occurred without a concomitant increase in duration and persisted after the feedback shift was removed, including after a 10-min silent period. These findings establish the validity of a sensorimotor adaptation paradigm to increase vowel contrast, showing that complex, nonuniform alterations to sensory feedback can successfully drive changes relevant to intelligible communication.
NEW & NOTEWORTHY To date, the speech motor learning evoked in sensorimotor adaptation studies has had little ecological consequences for communication. By inducing complex, nonuniform acoustic errors, we show that adaptation can be leveraged to cause an increase in speech sound contrast, a change that has the capacity to improve intelligibility. This study is relevant for models of sensorimotor integration across motor domains, showing that complex alterations to sensory feedback can successfully drive changes relevant to ecological behavior.
INTRODUCTION
Real-time alterations of auditory feedback have been used extensively over the past 20 years to probe the sensorimotor control system for speech. In these studies, speech is recorded, processed, altered, and played back to participants in near real time, enabling experimental manipulation of acoustic parameters. For example, altering the first and second resonant frequencies of the vocal tract, or formants (F1/F2), can cause the perception of a different vowel (1–8). Over the course of many repetitions, participants learn to alter their speech to oppose the perturbation, a process known as sensorimotor adaptation. The ability of sensorimotor adaptation to cause rapid changes in speech without conscious control or awareness (4) has made it a promising potential avenue for rehabilitation in motor disorders, where current treatments rely on intensive training over an extended time period. However, this promise remains unfulfilled, as most current adaptation paradigms lack behaviorally relevant outcomes (9). In speech, the vast majority of studies to date have examined a single vowel (often a single word), with a single perturbation in F1/F2 space. Although this paradigm can cause local changes in the production of a particular vowel, these changes have little ecological consequences for communication.
Critically, the intelligibility impairments prevalent in many speech disorders have been linked to acoustic parameters that adaptation can target: formant contrast between different vowels (10–16). An increase in this contrast is often taken as an outcome metric in research on speech rehabilitation (17–21). Because increasing contrast in acoustic space relies on changing the produced formant frequencies for different vowels, this is a particularly promising avenue for the application of sensorimotor adaptation, though it would necessarily entail more complex perturbation paradigms. A few studies have begun to move beyond the common paradigm of single perturbations applied to a single vowel. Recent work has shown that speakers will adapt to a constant perturbation applied to read sentences (3) and can simultaneously adapt to two opposite perturbations that are applied to different monosyllabic words, even when these words share the same vowel (7). Together, these results indicate that more complex paradigms, such as those that would be needed to enhance vowel contrasts globally, are learnable.
In the current study, we implemented an auditory perturbation explicitly designed to enhance the acoustic contrast between vowels. We accomplished this through a nonuniform formant perturbation paradigm that pushes all vowels toward the center of the vowel space (Fig. 1A), making them less distinct from one another. We measured how participants adapt their speech to oppose this perturbation and tested whether this leads to increased vowel contrast in the four “corner” vowels of English (/i/ as in bead, /æ/ as in bad, /ɑ/ as in bod, and /u/ as in booed). We additionally tested how these changes were retained immediately after the perturbation was removed as well as after a 10-min delay. To control for any changes in speech over the course of the experiment, all participants completed an additional control session with the same structure as the main adapt session, only with no auditory perturbation applied. We include two global measures of speech articulation as outcome variables. Vowel space area (VSA), the area of the irregular quadrilateral formed by the four corner vowels (13), is the most common metric of global vowel contrast and allows us to compare our results to previous work. We also include a measure of average vowel spacing (AVS), the average of the pairwise distances between the four vowels, which may be more sensitive to changes in vowel contrast and a better predictor of intelligibility (13). To disambiguate whether changes in speech production are caused by sensorimotor adaptation or by a general hyperarticulation mechanism that occurs in contexts that encourage clearer speech, we include measures of duration and several other speech parameters associated with clear speech (pitch range, maximum pitch, and amplitude).
Figure 1.

Experiment design. A: illustration of the perturbation field applied to speech. All perturbations point toward the center of the speaker’s vowel space, effectively centralizing their vowels. B: example spectrograms of the four target words showing the produced formants (blue), perturbed formants during the hold phase (red), and the vowel space center (yellow). C: magnitude of the perturbation applied throughout the experiment. In the adapt session (red), the perturbation during the hold phase is 50% of the 2D distance (in F1/F2 space) between the current formant values and the vowel center. In a separate control session (blue), no perturbation is applied. F1/F2, first and second resonant frequencies of the vocal tract, or formants.
MATERIALS AND METHODS
Participants
Twenty-five participants were tested in the current study (21 female/4 male, mean age ± SD: 20.4 ± 2.9 yr). This is slightly larger than previous studies on sensorimotor adaptation in speech, most of which have used 10–20 participants (e.g., 3, 4, 8). All participants were native speakers of American English, without any reported history of neurological, speech, or hearing disorders. Participants gave informed consent before participation in the study and were compensated either monetarily or with course credit. All procedures were approved by the Institutional Review Board of the University of Wisconsin–Madison.
Auditory Perturbation
Auditory feedback was recorded, altered, and played back to participants using Audapter (22, 23). Speech was recorded at 16 kHz via a head-mounted microphone (AKG C520), digitized with a Focusrite Scarlett sound card, and sent to a desktop workstation. The speech signal was then perturbed using Audapter, which identifies the vowel formants using linear predictive coding (LPC) and filters the speech signal to introduce a shift to those formants (details of the applied shift are given in Experimental Procedures below). If no shift is applied, Audapter outputs the unmodified input signal with the same processing delay. The output of Audapter was played back to participants via closed-back circumaural headphones (Beyerdynamic DT 770) through all phases of the experiment. The measured latency of audio playback on our system was ∼18 ms. Speech was played back at a volume of ∼80 dB SPL and mixed with speech-shaped noise at ∼60 dB SPL. The noise served to mask potential perception of the participants’ own unaltered speech, which may have otherwise been perceptible through air or bone conduction.
We used a modified version of Audapter that is able to specify formant perturbations as a function of the current values of F1 and F2. A participant-specific perturbation field was calculated such that all vowels were pushed toward the center of that participant’s vowel space (Fig. 1A). This central point was defined as the centroid of the quadrilateral formed by the four corner vowels of English (/i/, /æ/, /ɑ/, /u/). The magnitude of the perturbation was defined as a percentage of the distance between the currently produced vowel formants and the center of the vowel space. The magnitude varied across the experiment, ramping up to 50% of the distance between the current formant values and those at the center of the vowel space (Fig. 1B). Acoustically, this had the effect of centralizing all produced vowels.
Stimuli and Trial Structure
Stimuli consisted of four English words containing the four corner vowels in a /bVd/ context: bead, bad, bod, and booed (containing the vowels /i/, /æ/, /ɑ/, and /u/, respectively). Stimuli were presented on an LED computer screen, with one word presented per trial. Participants were instructed to read each word out loud as it appeared. Each stimulus was presented for 1.5 s. The interstimulus interval was randomly jittered between 0.75 and 1.5 s. All participants produced the stimuli with the intended vowels. Stimuli were randomly ordered within groups of four trials during the experiment, such that each word was repeated once per group.
Experimental Procedures
The experiment consisted of six phases (below). Participants received auditory feedback through headphones in all phases of the experiment.
A 40-trial calibration phase, which was used to define a participant-specific LPC order for formant tracking in Audapter. No perturbation was given during the calibration phase.
A 60-trial baseline phase, which was used to measure the participants’ baseline formant values for the four corner vowels of English. No perturbation was given during the baseline phase. These values were used to calculate the participant-specific perturbation field.
A 40-trial ramp phase. During the ramp phase, the magnitude of the perturbation was increased by 5% of the distance to the vowel space center at the start of each group of four trials, up to 50%.
A 320-trial hold phase. During the hold phase, the magnitude of the perturbation was held at 50% of the distance to the vowel space center.
A 40-trial washout phase. No perturbation was given during the washout phase.
A 40-trial retention phase. A 10-min break was given in between the washout and retention phases. Participants were allowed to read during this time but not to talk. No perturbation was given during the retention phase.
A self-timed short break was given every 30 trials.
To control for possible changes in vowel space area that may occur over the course of producing 500 words, each participant also completed a control session, which had the same structure as the adapt session but without any auditory perturbations. All participants completed both adapt and control sessions, with at least 1 wk between sessions (mean time between sessions ± SD: 8.36 ± 3.3 days). The order of the sessions was counterbalanced across participants.
After the end of the second session, participants completed a brief questionnaire that assessed their awareness of the perturbation as well as any potential strategies they used during the study. Participants were initially told there were two groups: a group that received a perturbation to the auditory feedback in both sessions and a group that did not receive a perturbation in either session (all participants received a perturbation in one session). They were asked which group they thought they were in and, if they selected the perturbed group, what they thought the perturbation was. Subsequently, participants were asked if they adopted any strategies during either session of the experiment.
Quantification and Statistical Analysis
Formant data were tracked using wave_viewer (24), which provides a MATLAB GUI interface to formant tracking using Praat (25). LPC order and pre-emphasis values were set individually for each participant. Vowels were initially automatically identified by locating the samples that were above a participant-specific amplitude level. Subsequently, all trials were hand-checked for errors. Errors in formant tracking were corrected by adjusting the pre-emphasis value or LPC order. Errors in the location of vowel onset and offset were corrected by hand-marking these times using the audio waveform and spectrogram. A small number of trials were excluded due to errors in production (i.e., the participant said the wrong word), disfluencies, or unresolvable errors in formant tracking. Across participants, 1.6% of trials were excluded (0%–8%). Single values for F1 and F2 were measured for each trial as the average of these formants during the middle 50% of the vowel (steady-state portion). These formant values were converted to mels for calculating AVS and VSA. Because vowel production is inherently variable, both AVS and VSA were calculated in bins of 40 trials, using the average formants from 10 repetitions of each stimulus word. AVS and VSA were normalized by dividing these raw values by the values measured during the 60-trial baseline phase, giving a measure of the percentage change from baseline across the experiment.
Normalized measures were calculated for each participant during the last 40 trials of the hold phase (adaptation), during the washout phase, and during the retention phase for both adapt and control sessions. Because the control session accounts for any overall change in production during the course of 500 trials, a repeated-measures ANOVA with main effects of phase and session, as well as their interaction, was used to test for differences between adaptation, washout, and retention, as well as between adapt and control sessions. Post hoc tests were conducted using the Tukey–Kramer method with α = 0.05 to correct for multiple comparisons. To additionally test for changes in absolute value from the baseline within each session, two-tailed t tests were used to determine whether the values in each phase differed from the baseline phase (where normalized AVS and VSA were, by definition, equal to 1). Holm–Bonferroni corrections were used to maintain an overall session-wise α of 0.05.
In addition to these global measures of vowel contrast, we evaluated the changes in each of the four corner vowels independently for both the adapt and control sessions. To do this, we calculated each trial’s Euclidean distance in F1/F2 space from the center of the vowel quadrilateral in the baseline phase (see Fig. 1A). Adaptation magnitude was then defined as the change in this distance-from-the-center between the baseline phase and each of the three test phases (adaptation, washout, and retention). This procedure was performed for both the adapt and control sessions. We evaluated adaptation magnitude through repeated-measures ANOVAs with fixed effects of vowel, phase, and session, as well as their interactions. Post hoc tests were conducted using the Tukey–Kramer method with α = 0.05 to correct for multiple comparisons. Two-tailed t tests, with Holm–Bonferroni corrections, were used to determine if these values differed from the baseline value of 0.
To facilitate comparison with previous work (3), we additionally measured the magnitude of the changes in vowel production that directly opposed the perturbation (compensation). We first calculated the difference in F1/F2 space between the average values in each test phase and the average values in the baseline phase. This difference vector, representing change from baseline, was then projected onto the inverse of the vector defining the average perturbation for that vowel, calculated from all trials in the baseline phase, to yield the measure of compensation (Supplemental Fig. S4A; all Supplemental material is available at https://doi.org/10.6084/m9.figshare.13244378). In other words, compensation is the component of the deviation that directly opposed the formant shift. We measured compensation both in mels and as a percentage of the perturbation. This procedure was performed for both the adapt and control sessions.
We additionally evaluated the variability of vowel production. For each participant, we measured the standard deviation of each vowel in both F1 and F2 separately for each phase in both sessions. We then used these standard deviation measurements in separate repeated-measures ANOVAs to analyze variability in F1 and F2. Each model had fixed factors of vowel, session, and phase, as well as all two-way interactions between these terms. Post hoc tests were conducted using the Tukey–Kramer method with α = 0.05 to correct for multiple comparisons.
To test for any differences in baseline vowel contrast between the two sessions, we examined raw (nonnormalized) AVS and VSA values. Repeated-measures ANOVAS with fixed factors of session and session order (adapt first vs. control first) were used to assess any potential differences.
Lastly, we evaluated any potential changes in a broad range of speech parameters that are associated with hyperarticulation and clear speech. To adapt to the vowel-centralization perturbation, participants must produce more extreme versions of vowels that lie farther away from the center of their vowel space. A similar increase in vowel space occurs when speakers pronounce words more clearly than they are normally pronounced, often referred to as hyperarticulation (26, 27). Importantly, all circumstances known to induce hyperarticulation also result in increases in vowel duration and often affect other parameters of speech as well. We measured vowel duration (in ms), maximum intensity (as measured from the root mean square signal from Audapter in arbitrary units), maximum vocal pitch (in Hz), and pitch range (in Hz). All of these measures were normalized by subtracting the average value in the baseline from the remaining trials. Repeated-measures ANOVAs were used to evaluate statistical significance.
Data and Code Availability
Analysis code is available on GitHub at https://github.com/blab-lab/vsaCentralize. Some functions rely on additional code available at https://github.com/carrien/free-speech.
RESULTS
Speakers Adapt to Vowel Centralization by Increasing Global Contrast
Speakers responded to the centralization perturbation by expanding VSA in the adapt session relative to the control session (Figs. 2, 3A, and 6A), indicated by a main effect of session [F(1,48) = 5.32, P = 0.03]. In the adapt session, VSA remained high until the end of the hold phase (adaptation phase), dropping closer to baseline values in the washout and retention phases. In the control session, VSA fell slightly below baseline in the adaptation, washout, and retention phases. These changes are reflected in significant effects of phase [F(2,48) = 3.81, P = 0.03] and the interaction between session and phase [F(2,48) = 4.44, P = 0.02]. During the adaptation phase, VSA was significantly greater in the adapt session (9.7% ± 4.2%) than in the control session (−4.3% ± 3%, P < 0.05), though neither session was significantly different from baseline after correction for multiple comparisons. Neither the washout nor retention phases differed between sessions, nor were any of these values different from the baseline (all P > 0.07).
Figure 2.
Illustration of vowel space increases. Data from three example participants showing the movement of the four corner vowels after exposure to perturbed feedback in the adapt session (red) or unperturbed feedback in the control session (blue) compared with their values in the baseline phase of each session (dashed black). F1/F2, first and second resonant frequencies of the vocal tract, or formants. S1/S2/S3 represent three individual speakers.
AVS showed similar results (Figs. 3B and 6B). In all phases after the baseline, AVS was larger in the adapt session than the control session [P < 0.05, main effect of session: F(1,48) = 11.02, P = 0.003]. The difference between sessions was greatest in the adaptation phase, where participants increased the spacing between vowels, relative to baseline, by an average of 6.3% ± 1.7% in the adapt session and decreased it by −1.4% ± 1.4% in the control session. A significant difference was maintained throughout both the washout (3.2% ± 1.5% vs. −1.6% ± 1.2%) and retention (1.5% ± 1.5% vs. −1.5% ± 1.2%) phases. The change in AVS across phases was shown by a main effect of phase [F(2,48) = 7.25, P = 0.002] as well as an interaction between phase and session [F(2,48) = 5.33, P = 0.009]. Only the adaptation phase in the adapt session was significantly different from baseline [t(24) = 3.73, P = 0.001] after correction for multiple comparisons. The only significant within-session difference was between the adaptation and retention phases in the adapt session (P < 0.05).
Figure 3.

Vowel contrast adaptation. A: baseline-normalized vowel space area (VSA) increases in the adapt session (red) but not the control session (blue). C: individual and group means for the adaptation, washout, and retention phases. Each pair of points connected by a gray line represents data from a single participant. B and D: same as A and C, showing average vowel spacing (AVS). Error bars show standard error. See also Fig. 6, A and B.
Figure 6.

Individual variability in vowel space adaptation. A and B: individual participant differences between adapt and control sessions in normalized VSA (A) and normalized AVS (B), as measured in the adaptation phase. Each bar represents a participant, ordered in both panels by descending (positive) difference in VSA between the two sessions. C: baseline VSA and AVS values do not predict adaptation magnitude. D: as in A, individual participant differences between the adapt and control sessions in the normalized distance to center for each vowel, as measured in the adaptation phase. AVS, average vowel spacing; VSA, vowel space area.
VSA and AVS values were highly correlated (r = 0.95, P < 0.0001, Supplemental Fig. S1). Baseline values did not differ between the adapt and control sessions [VSA: F(1,46) = 0.01, P = 0.92, AVS: F(1,46) = 0.01, P = 0.91]. There was no change from session 1 to session 2 in either metric [VSA: F(1,46) = 0.15, P = 0.70, AVS: F(1,46) = 0.02, P = 0.89], and no interaction between session type and order [VSA: F(1,46) = 0.6, P = 0.44, AVS: F(1,46) = 0.4, P = 0.55].
Speech Contrast Increases as Duration Decreases
In contrast to VSA and AVS, vowel duration decreased slightly over the course of the experiment (Fig. 4). In the adapt session, vowels in the adaptation phase were 17 ± 33 ms shorter than the baseline (P = 0.02). Duration continued to decrease in the washout (23 ± 37 ms, P = 0.006) and retention phases (25 ± 38 ms, P = 0.004). Decreases in duration from baseline were smaller in the control session, with no phase significantly shorter than baseline (adaptation: 5 ± 40 ms, P = 0.53; washout: 10 ± 34 ms, P = 0.17; retention: 11 ± 31 ms, P = 0.09). The difference between the sessions did not reach significance [F(1, 48) = 3.1, P = 0.08]. There was no significant effect of phase [F(2, 48) = 2.2, P = 0.12], nor any interaction between phase and session [F(2, 48) = 0.9, P = 0.92]. We similarly observed only minimal changes in other speech parameters that did not differ across sessions (Supplemental Table S1; Supplemental Fig. S2): maximum pitch and peak intensity slightly increased, and pitch range did not change.
Figure 4.
Changes in vowel duration. A: baseline-normalized changes in vowel duration in the adapt session (red) and the control session (blue). B: individual and group means for normalized vowel duration in the adaptation, washout, and retention phases. Each pair of points connected by a gray line represents data from a single participant. See also Supplemental Fig. S2.
Speakers Simultaneously Learn Multiple Vowel-Specific Compensatory Changes
Speakers in the adapt session achieved these increases in speech contrast by increasing the distance between each vowel and the center of the vowel space (Figs. 5 and 6), reflected by a main effect of session [F(1,414) = 9.49, P < 0.0001]. The increase in distance to the center was greatest in the adaptation phase (20.9 ± 3.9 mels) and smaller in the washout (10.7 ± 3.4 mels) and retention (4.6 ± 3.4 mels) phases. Adaptation, washout, and retention phases in the adapt session all differed from the control session (P < 0.05), where the distances decreased from baseline (adaptation: −2.9 ± 2.9 mels, washout: −3.1 ± 3.3 mels, retention: −4.5 ± 2.8 mels). These changes were reflected in a main effect of phase [F(2,414) = 8.41, P = 0.007] and an interaction between phase and session [F(2,414) = 5.47, P = 0.005]. There were no significant differences between vowels [F(3,414) = 2.0, P = 0.12], though there was a significant interaction between vowel and session [F(3,414) = 3.81, P = 0.01]. Post hoc tests comparing individual vowels between the sessions showed that /i/ was farther from the center in both the adapt and washout phases, and that /ɑ/ was farther from the center in all three test phases (all P < 0.05). No other comparison was significant after correction for multiple comparisons. Results were highly similar when examining the portion of compensatory changes that directly opposed the perturbation (Supplemental Fig. S3).
Figure 5.

Vowel-specific compensatory changes. A–C: mean change (± SE) in formants for each vowel in the adaptation, washout, and retention phases relative to baseline values [normalized to (0,0)]. Bright colors (open circles) show data from the adapt session; dull colors (filled circles) show data from the control session. D–F: group means (± SE) and individual participant adaptation for each vowel. Colors as in A–C. See also Fig. 6 and Supplemental Fig. S2. F1/F2, first and second resonant frequencies of the vocal tract, or formants.
Individual Variability in Adaptation
Although the vowel centralization paradigm resulted in increased speech contrast at the group level, there was substantial variability in response magnitude across participants, both in global measures of vowel spacing (Fig. 6, A and B) and in the compensatory movement of individual vowels away from the vowel center (Fig. 6D). Similar interindividual variability is consistently seen in studies of sensorimotor adaptation in speech (4, 5, 8, 28). We examined whether individual adaptation magnitude was predicted by vowel spacing in the baseline phase; we found no such correlation for either global measure of vowel spacing (VSA: r = 0.08, P = 0.70; AVS: r = 0.26, P = 0.21, Fig. 6C), suggesting that the variability seen across participants in their response to the centralization perturbation was not driven by differences in baseline production. Notably, similar increases in VSA/AVS across participants were driven by different patterns of adaptation at the individual vowel level (Fig. 2). Adaptation magnitude for a given vowel was not well-predicted by that vowel’s baseline formant variability (/i/: r = −0.21, P = 0.32; /æ/: r = −0.20, P = 0.34; /ɑ/: r = 0.37, P = 0.07; /u/: r = −0.09, P = 0.66). Adaptation was also not well-predicted by the initial distance to the center of the vowel space across vowels (/i/: r = 0.32, P = 0.12; /æ/: r = 0.21, P = 0.31; /ɑ/: r = 0.42, P = 0.04; /u/: r = 0.065, P = 0.76; no vowel significant after correcting for multiple comparisons).
Awareness of Perturbation and Strategy Use
When participants were queried about their awareness of the perturbation, 16 of 25 responded that they thought they received a perturbation, but no participant correctly identified it as a change to their vowels (Table 1). Only three of 25 participants reported using a strategy, and no strategy addressed the applied perturbation. The reported strategies were “Saying the words slower,” “Kept mouth open between words,” and “Looking away from the screen between words.”
Table 1.
Participant awareness of perturbation
| Number of Participants | Perceived Perturbation |
|---|---|
| 9 | Did not perceive a perturbation |
| 6 | Thought audio feedback had added noise (likely reflecting the 60 dB speech-shaped noise added to the signal) |
| 5 | Perceived a perturbation but unable to identify what it was |
| 2 | Pitch of voice altered |
| 1 | Speech delayed |
| 1 | Speech volume altered |
| 1 | Speech was “more nasal” |
DISCUSSION
In the current study, we used alterations of auditory feedback to drive participants to expand their working vowel space and increase the contrast between vowels. Speakers who were exposed to vowel centralization feedback learned to produce corner vowels farther from the center of their vowel space, partially overcoming the perceived centralization. This was reflected in global measures of vowel space (VSA) and vowel contrast (AVS), as well as vowel-specific measures. These changes were partially retained after the perturbation was removed, as well as in an assay of retention 10 min after the main experiment.
Overall, our findings show that speakers are capable of adapting to nonuniform transformations of vowel space feedback. Because the direction of the feedback shift was dependent on the produced formants, participants in the study had to learn vowel-dependent compensatory adjustments, each of which required unique changes to articulatory movements. The current results build on previous work (7) to show that multiple opposing transformations can be learned simultaneously across the extent of producible vowel space, establishing the ability of sensorimotor adaptation paradigms to enhance global contrast between vowels.
Increased vowel contrast is a commonly used metric for quantifying the effects of rehabilitative interventions for motor speech disorders (17–21), as it is associated with greater intelligibility (10–16). In the current study, ∼12 min of speaking resulted in a VSA increase of 15.8% (calculated using /u/, /i/, and /ɑ/), half the increase seen in individuals with Parkinson’s disease after 16 hour-long sessions of LSVT-LOUD (31.6% calculated with the same three vowels) (18), the current standard of care for the hypokinetic dysarthria secondary to that disorder. Furthermore, participants adapted their speech without conscious effort or awareness of the specific targets of the perturbation, which may be useful for patients who do not, or cannot, respond well to existing treatments due to issues with explicit strategy use (29).
The increase in vowel formant contrast in the current study is similar to the hyperarticulation observed when people are explicitly instructed to speak clearly (17, 30–35). Vowel hyperarticulation also occurs in many contexts that implicitly encourage clearer speech: when repeating a word after being misunderstood (36–39), when speaking to infants (40, 41), when speaking to people who speak a foreign language (42), when speaking to individuals who are have hearing difficulties (33, 43), when speaking in “challenging” acoustic situations such as when a conversation partner is wearing headphones (44, 45), when using words for the first time in a discourse compared with subsequent repetitions of that word (26), when words are less predictable from sentential context versus more predictable—e.g., “The next word is nine” versus “A stitch in time saves nine” (46, 47), for words that are relatively uncommon versus words that occur more frequently (26, 47, 48), for words that have a lower number of similar words versus words with a higher number of similar words—i.e., hyperarticulation is associated with higher lexical density (43), and in words with prosodic stress or emphasis (49–51). In every case, changes in formants are universally accompanied by relative increases in vowel duration, which allow more time for full articulatory movement (27). Hyperarticulation is so closely tied to increased duration, in fact, that duration is often used as a proxy metric for hyperarticulation (26, 52, 53). In contrast, we observed no differences between the adapt and control sessions in vowel duration, intensity, or pitch, and in fact saw an overall decrease in vowel duration in the adapt session relative to baseline. These results strongly suggest the increase in vowel contrast observed in the current study does not arise from general mechanisms of clarity-driven hyperarticulation but is, in fact, an adaptive response to counteract the auditory perturbation. Future work could examine adaptation to an outward-pushing perturbation that enhances vowel contrast (and which would require hypoarticulation as a compensatory mechanism), or to perturbations with unrelated vowel-specific effects, to address this issue directly.
It remains unclear from the current data whether participants learned a global vowel expansion pattern or local, vowel-specific changes. The variable learning by vowel, with low learning for /u/ in particular, suggests participants may have learned separate transformations. These results must be interpreted cautiously, however, as the capacity for adaptation may vary between vowels (54). Additionally, F2 variability for /u/ in our data (36.2 mels) was substantially higher than other vowels (20.4 mels, all P < 0.05, see Supplemental Table S2), consistent with previous reports (55). Thus, perturbations for /u/ may have caused fewer productions to fall outside acceptable category boundaries (56, 57). Future work examining generalization to other, untrained, vowels may help resolve whether the learning observed here is achieved through a combination of local transformations or a generalized, global pattern (58, 59).
Increases in vowel contrast persisted even after a washout period and 10-min silent interval, suggesting sensorimotor adaptation may cause longer term changes, consistent with previous results in whispered speech (2). Although some evidence of retention after a short break can be found in a figure in a previous study (60), no statistical evidence for retention of sensorimotor learning in voiced speech has been reported previously. More research on retention of learning in sensorimotor adaptation will be vital to clinical translation of sensorimotor adaptation, including how retention is affected by different communicative settings.
In the control session, there was a trend for both VSA and AVS to decrease over the course of the experiment. This gradual decrease explains the pattern of results: consistent contrast between sessions, even when the individual phases did not always differ from the baseline. This can also be seen in the individual vowel data (Fig. 5 and Supplemental Fig. S3). This pattern suggests that vowels have a tendency to become more centralized over the course of an hour-long study, highlighting the importance of conducting a control session to account for potential changes in speech unrelated to the auditory perturbation.
Similar to previous studies of sensorimotor adaptation in speech, we observed substantial variability across individuals in the magnitude of adaptation. This variability was seen in both global measures of vowel spacing as well as in changes in productions of individual vowels. Variability in adaptation was not correlated with speech behavior in the baseline phase. Previous work has suggested that variability in the magnitude of sensorimotor adaptation for perturbations of a single vowel may be caused by differences in the balance between auditory and somatosensory feedback use across individuals (61–63). It is possible the variability we observed has a similar source.
Finally, we found a strong relationship between VSA and AVS. Previous studies in motor speech disorders have suggested that vowel contrast may be a more accurate predictor of intelligibility impairments in these populations (13). The present results suggest that both measures capture similar effects, though AVS may be a slightly more sensitive measure. However, more research with a wider set of stimuli is needed to determine how this relationship is affected when noncorner vowels are included in the AVS measures, and it remains to be seen how intelligibility may be correlated with either measure.
In sum, we have shown that individuals can modify their speech to oppose auditory perturbations that target vowel distinctiveness. This demonstration of the ability to drive increased vowel contrast without the need for explicit strategies or conscious control moves research on sensorimotor adaptation closer to realizing the long-standing promise of this technique for clinical use.
GRANTS
This work was supported by NIH grants R01 DC017091 and R00 DC014520.
DISCLOSURES
No conflicts of interest, financial or otherwise, are declared by the authors.
AUTHOR CONTRIBUTIONS
B.P. and C.A.N. conceived and designed research; B.P. and C.A.N. performed experiments; B.P. and C.A.N. analyzed data; B.P. and C.A.N. interpreted results of experiments; B.P. and C.A.N. prepared figures; B.P. and C.A.N. drafted manuscript; B.P. and C.A.N. edited and revised manuscript; B.P. and C.A.N. approved final version of manuscript.
ENDNOTE
At the request of the authors, readers are herein alerted to the fact that additional materials related to this manuscript may be found at https://github.com/blab-lab/vsaCentralize and https://github.com/carrien/free-speech. These materials are not a part of this manuscript, and have not undergone peer review by the American Physiological Society (APS). APS and the journal editors take no responsibility for these materials, for the website address, or for any links to or from it.
REFERENCES
- 1.Houde JF, Jordan MI. Sensorimotor adaptation in speech production. Science 279: 1213–1216, 1998. doi: 10.1126/science.279.5354.1213. [DOI] [PubMed] [Google Scholar]
- 2.Houde JF, Jordan MI. Sensorimotor adaptation of speech I: compensation and adaptation. J Speech Lang Hear Res 45: 295–310, 2002. doi: 10.1044/1092-4388(2002/023). [DOI] [PubMed] [Google Scholar]
- 3.Lametti DR, Smith HJ, Watkins KE, Shiller DM. Robust sensorimotor learning during variable sentence-level speech. Curr Biol 28: 3106–3113.e2, 2018. doi: 10.1016/j.cub.2018.07.030. [DOI] [PubMed] [Google Scholar]
- 4.Munhall KG, MacDonald EN, Byrne SK, Johnsrude I. Talkers alter vowel production in response to real-time formant perturbation even when instructed not to compensate. J Acoust Soc Am 125: 384–390, 2009. doi: 10.1121/1.3035829. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Parrell B, Agnew Z, Nagarajan S, Houde JF, Ivry RB. Impaired feedforward control and enhanced feedback control of speech in patients with cerebellar degeneration. J Neurosci 37: 9249–9258, 2017. doi: 10.1523/JNEUROSCI.3363-16.2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Purcell DW, Munhall KG. Adaptive control of vowel formant frequency: evidence from real-time formant manipulation. J Acoust Soc Am 120: 966–977, 2006. doi: 10.1121/1.2217714. [DOI] [PubMed] [Google Scholar]
- 7.Rochet-Capellan A, Ostry DJ. Simultaneous acquisition of multiple auditory-motor transformations in speech. J Neurosci 31: 2657–2662, 2011. doi: 10.1523/JNEUROSCI.6020-10.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Villacorta VM, Perkell JS, Guenther FH. Sensorimotor adaptation to feedback perturbations of vowel acoustics and its relation to perception. J Acoust Soc Am 122: 2306–2319, 2007. doi: 10.1121/1.2773966. [DOI] [PubMed] [Google Scholar]
- 9.Roemmich RT, Bastian AJ. Closing the loop: from motor neuroscience to neurorehabilitation. Annu Rev Neurosci 41: 415–429, 2018. doi: 10.1146/annurev-neuro-080317-062245. [DOI] [PubMed] [Google Scholar]
- 10.Ansel BM, Kent RD. Acoustic-phonetic contrasts and intelligibility in the dysarthria associated with mixed cerebral palsy. J Speech Hear Res 35: 296–308, 1992. doi: 10.1044/jshr.3502.296. [DOI] [PubMed] [Google Scholar]
- 11.Bang Y-I, Min K, Sohn YH, Cho S-R. Acoustic characteristics of vowel sounds in patients with Parkinson disease. NeuroRehabilitation 32: 649–654, 2013. doi: 10.3233/NRE-130887. [DOI] [PubMed] [Google Scholar]
- 12.Kim H, Hasegawa-Johnson M, Perlman A. Vowel contrast and speech intelligibility in dysarthria. Folia Phoniatr Logop 63: 187–194, 2011. doi: 10.1159/000318881. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Neel AT. Vowel space characteristics and vowel identification accuracy. J Speech Lang Hear Res 51: 574–585, 2008. doi: 10.1044/1092-4388(2008/041). [DOI] [PubMed] [Google Scholar]
- 14.Skodda S, Visser W, Schlegel U. Vowel articulation in Parkinson’s disease. J Voice 25: 467–472, 2011. [Erratum in J Voice 26: 267–268, 2012]. doi: 10.1016/j.jvoice.2010.01.009. [DOI] [PubMed] [Google Scholar]
- 15.Tjaden K, Rivera D, Wilding G, Turner GS. Characteristics of the lax vowel space in dysarthria. J Speech Lang Hear Res 48: 554–566, 2005. doi: 10.1044/1092-4388(2005/038). [DOI] [PubMed] [Google Scholar]
- 16.Weismer G, Jeng JY, Laures JS, Kent RD, Kent JF. Acoustic and intelligibility characteristics of sentence production in neurogenic speech disorders. Folia Phoniatr Logop 53: 1–18, 2001. doi: 10.1159/000052649. [DOI] [PubMed] [Google Scholar]
- 17.Lam J, Tjaden K. Clear speech variants: an acoustic study in Parkinson’s disease. J Speech Lang Hear Res 59: 631–646, 2016. doi: 10.1044/2015_JSLHR-S-15-0216. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Sapir S, Spielman JL, Ramig LO, Story BH, Fox C. Effects of intensive voice treatment (the Lee Silverman Voice Treatment [LSVT]) on vowel articulation in dysarthric individuals with idiopathic Parkinson disease: acoustic and perceptual findings. J Speech Lang Hear Res 50: 899–912, 2007. [Erratum in J Speech Lang Hear Res 50: 1652, 2007]. doi: 10.1044/1092-4388(2007/064). [DOI] [PubMed] [Google Scholar]
- 19.Tjaden K, Lam J, Wilding G. Vowel acoustics in Parkinson’s disease and multiple sclerosis: comparison of clear, loud, and slow speaking conditions. J Speech Lang Hear Res 56: 1485–1502, 2013. doi: 10.1044/1092-4388(2013/12-0259). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Tjaden K, Wilding GE. Rate and loudness manipulations in dysarthria: acoustic and perceptual findings. J Speech Lang Hear Res 47: 766–783, 2004. doi: 10.1044/1092-4388(2004/058). [DOI] [PubMed] [Google Scholar]
- 21.Whitfield JA, Goberman AM. Articulatory–acoustic vowel space: application to clear speech in individuals with Parkinson’s disease. J Commun Disord 51: 19–28, 2014. doi: 10.1016/j.jcomdis.2014.06.005. [DOI] [PubMed] [Google Scholar]
- 22.Cai S, Boucek M, Ghosh S, Guenther FH, Perkell J. A system for online dynamic perturbation of formant trajectories and results from perturbations of the Mandarin triphthong /iau/. In: Proceedings of the 8th International Seminar on Speech Production. Strasbourg, France, December 8–12, 2008, p. 65–68.
- 23.Tourville JA, Cai S, Guenther F. Exploring auditory-motor interactions in normal and disordered speech. Proc Mtgs Acoust 19: 060180, 2013. doi: 10.1121/1.4800684. [DOI] [Google Scholar]
- 24.Niziolek CA, Houde J. Wave_Viewer: First Release. 2015. doi: 10.5281/zenodo.13839. [DOI]
- 25.Boersma P, Weenink D. Praat: doing phonetics by computer (Online). http://www.praat.org/.
- 26.Baker RE, Bradlow AR. Variability in word duration as a function of probability, speech style, and prosody. Lang Speech 52: 391–413, 2009. doi: 10.1177/0023830909336575. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Lindblom B. Explaining phonetic variation: a sketch of the H&H theory. In: Speech Production and Speech Modelling, edited by Hardcastle JW, Marchal A.. Dordrecht: Kluwer Academic Publisher, 1990, p. 403–439. 10.https://www.researchgate.net/deref/http://dx.doi.org/10.1007/978-94-009-2037-8_16. [DOI] [Google Scholar]
- 28.Martin CD, Niziolek CA, Duñabeitia JA, Perez A, Hernandez D, Carreiras M, Houde JF. Online adaptation to altered auditory feedback is predicted by auditory acuity and not by domain-general executive control resources. Front Hum Neurosci 12: 91, 2018. doi: 10.3389/fnhum.2018.00091. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Sadagopan N, Huber JE. Effects of loudness cues on respiration in individuals with Parkinson’s disease. Mov Disord 22: 651–659, 2007. doi: 10.1002/mds.21375. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Ferguson SH, Kewley-Port D. Vowel intelligibility in clear and conversational speech for normal-hearing and hearing-impaired listeners. J Acoust Soc Am 112: 259–271, 2002. doi: 10.1121/1.1482078. [DOI] [PubMed] [Google Scholar]
- 31.Krause JC, Braida LD. Acoustic properties of naturally produced clear speech at normal speaking rates. J Acoust Soc Am 115: 362–378, 2004. doi: 10.1121/1.1635842. [DOI] [PubMed] [Google Scholar]
- 32.Moon S-J-J, Lindblom B. Interaction between duration, context, and speaking style in English stressed vowels. J Acoust Soc Am 96: 40–55, 1994. doi: 10.1121/1.410492. [DOI] [Google Scholar]
- 33.Picheny MA, Durlach NI, Braida LD. Speaking clearly for the hard of hearing. II: acoustic characteristics of clear and conversational speech. J Speech Lang Hear Res 29: 434–446, 1986. doi: 10.1044/jshr.2904.434. [DOI] [PubMed] [Google Scholar]
- 34.Smiljanić R, Bradlow AR. Stability of temporal contrasts across speaking styles in English and Croatian. J Phon 36: 91–113, 2008. doi: 10.1016/j.wocn.2007.02.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Smiljanić R, Bradlow AR. Speaking and hearing clearly: talker and listener factors in speaking style changes. Lang Linguist Compass 3: 236–264, 2009. doi: 10.1111/j.1749-818X.2008.00112.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Burnham D, Joeffry S, Rice L. "D-o-e-s-Not-C-o-m-p-u-t-e”: vowel hyperarticulation in speech to an auditory-visual avatar. Proceedings of the 9th International Conference on Auditory-Visual Speech Processing (AVSP-2010). Hakone, Japan, September 30–October 3, 2010. [Google Scholar]
- 37.Burnham D, Joeffry S, Rice L. Computer- and human-directed speech before and after correction. In: Speech Science and Technology. Melbourne, Australia, December 14–16, 2010, p. 13–17. https://assta.org/proceedings/sst/SST-10/SST2010/PDF/AUTHOR/ST100077.PDF. [Google Scholar]
- 38.Oviatt S, Levow G-A, Moreton E, MacEachern M. Modeling global and focal hyperarticulation during human-computer error resolution. J Acoust Soc Am 104: 3080–3098, 1998. doi: 10.1121/1.423888. [DOI] [PubMed] [Google Scholar]
- 39.Oviatt S, MacEachern M, Levow G-A. Predicting hyperarticulate speech during human-computer error resolution. Speech Commun 24: 87–110, 1998. doi: 10.1016/S0167-6393(98)00005-3. [DOI] [Google Scholar]
- 40.Kuhl P, Andruski EJ, Chistovich AI, Chistovich AL, Kozhevnikov VE, Ryskina VV, Stolyarova IE, Sundberg U, Lacerda F. Cross-language analysis of phonetics units in language addressed to infants. Science 277: 684–686, 1997. doi: 10.1126/science.277.5326.684. [DOI] [PubMed] [Google Scholar]
- 41.Lam C, Kitamura C. Mommy, speak clearly: induced hearing loss shapes vowel hyperarticulation. Dev Sci 15: 212–221, 2012. doi: 10.1111/j.1467-7687.2011.01118.x. [DOI] [PubMed] [Google Scholar]
- 42.Scarborough R, Brenier J, Zhao Y, Hall-Lew L, Dmitrieva O. An acoustic study of real and imagined foreigner-directed speech. In: Proceedings of the 15th International Congress of Phonetic Sciences. Barcelona, Spain, August 3–9, 2007, p. 2165–2168.
- 43.Scarborough R, Zellou G. Clarity in communication: “clear” speech authenticity and lexical neighborhood density effects in speech production and perception. J Acoust Soc Am 134: 3793–3807, 2013. doi: 10.1121/1.4824120. [DOI] [PubMed] [Google Scholar]
- 44.Hazan V, Baker R. Acoustic-phonetic characteristics of speech produced with communicative intent to counter adverse listening conditions. J Acoust Soc Am 130: 2139–2152, 2011. doi: 10.1121/1.3623753. [DOI] [PubMed] [Google Scholar]
- 45.Koster S. Acoustic-phonetic characteristics of hyperarticulated speech for different speaking styles. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. Salt Lake City, UT, May 7–11, 2001, p. 873–876. doi: 10.1109/ICASSP.2001.941054. [DOI] [Google Scholar]
- 46.Aylett M, Turk A. Language redundancy predicts syllabic duration and the spectral characteristics of vocalic syllable nuclei. J Acoust Soc Am 119: 3048–3058, 2006. doi: 10.1121/1.2188331. [DOI] [PubMed] [Google Scholar]
- 47.Scarborough R. Lexical and contextual predictability: Confluent effects on the production of vowels. In: Laboratory Phonology 10, edited by Fougeron C, Kuehnert B, D’Imperio M, Vallee N.. Berlin/New York: De Gruyter Mouton, 2010, p. 557–586. doi: 10.1515/9783110224917.5.557. [DOI] [Google Scholar]
- 48.Scarborough R. Neighborhood-conditioned patterns in phonetic detail: relating coarticulation and hyperarticulation. J Phon 41: 491–508, 2013. doi: 10.1016/j.wocn.2013.09.004. [DOI] [Google Scholar]
- 49.Cho T, Lee Y, Kim S. Communicatively driven versus prosodically driven hyper-articulation in Korean. J Phon 39: 344–361, 2011. doi: 10.1016/j.wocn.2011.02.005. [DOI] [Google Scholar]
- 50.de Jong KJ. The supraglottal articulation of prominence in English: linguistic stress as localized hyperarticulation. J Acoust Soc Am 97: 491–504, 1995. doi: 10.1121/1.412275. [DOI] [PubMed] [Google Scholar]
- 51.de Jong K, Beckman ME, Edwards J. The interplay between prosodic structure and coarticulation. Lang Speech 36: 197–212, 1993. doi: 10.1177/002383099303600305. [DOI] [PubMed] [Google Scholar]
- 52.Aylett M, Turk A. The smooth signal redundancy hypothesis: a functional explanation for relationships between redundancy, prosodic prominence, and duration in spontaneous speech. Lang Speech 47: 31–56, 2004. doi: 10.1177/00238309040470010201. [DOI] [PubMed] [Google Scholar]
- 53.Freeman V. Hyperarticulation as a signal of stance. J Phon 45: 1–11, 2014. doi: 10.1016/j.wocn.2014.03.002. [DOI] [Google Scholar]
- 54.Mitsuya T, MacDonald EN, Munhall KG, Purcell DW. Formant compensation for auditory feedback with English vowels. J Acoust Soc Am 138: 413–424, 2015. doi: 10.1121/1.4923154. [DOI] [PubMed] [Google Scholar]
- 55.Clopper CG, Burdin RS, Turnbull R. Variation in /u/ fronting in the American Midwest. J Acoust Soc Am 146: 233–244, 2019. doi: 10.1121/1.5116131. [DOI] [PubMed] [Google Scholar]
- 56.Mitsuya T, Samson F, Ménard L, Munhall KG. Language dependent vowel representation in speech production. J Acoust Soc Am 133: 2993–3003, 2013. doi: 10.1121/1.4795786. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Niziolek CA, Guenther FH. Vowel category boundaries enhance cortical and behavioral responses to speech feedback alterations. J Neurosci 33: 12090–12098, 2013. doi: 10.1523/JNEUROSCI.1008-13.2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Malfait N, Gribble PL, Ostry DJ. Generalization of motor learning based on multiple field exposures and local adaptation. J Neurophysiol 93: 3327–3338, 2005. doi: 10.1152/jn.00883.2004. [DOI] [PubMed] [Google Scholar]
- 59.Rochet-Capellan A, Richer L, Ostry DJ. Nonhomogeneous transfer reveals specificity in speech motor learning. J Neurophysiol 107: 1711–1717, 2012. doi: 10.1152/jn.00773.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Lametti DR, Rochet-Capellan A, Neufeld E, Shiller DM, Ostry DJ. Plasticity in the human speech motor system drives changes in speech perception. J Neurosci 34: 10339–10346, 2014. doi: 10.1523/JNEUROSCI.0108-14.2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Katseff S, Houde JF, Johnson K. Partial compensation for altered auditory feedback: a tradeoff with somatosensory feedback? Lang Speech 55: 295–308, 2012. doi: 10.1177/0023830911417802. [DOI] [PubMed] [Google Scholar]
- 62.Lametti DR, Nasir SM, Ostry DJ. Sensory preference in speech production revealed by simultaneous alteration of auditory and somatosensory feedback. J Neurosci 32: 9351–9358, 2012. doi: 10.1523/JNEUROSCI.0404-12.2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Parrell B, Ramanarayanan V, Nagarajan S, Houde J. The FACTS model of speech motor control: fusing state estimation and task-based control. PLoS Comput Biol 15: e1007321, 2019. doi: 10.1371/journal.pcbi.1007321. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Analysis code is available on GitHub at https://github.com/blab-lab/vsaCentralize. Some functions rely on additional code available at https://github.com/carrien/free-speech.


