Skip to main content
. 2021 Mar 20;11(3):394. doi: 10.3390/brainsci11030394

Figure 2.

Figure 2

Examples of degraded speech manipulations used experimentally and their acoustic effects on the speech signal. Broadband time-frequency spectrograms of the same speech token (“tomatoes”), subjected to different forms of speech degradation (all samples apart from 2B were recorded by a native British speaker with a Standard Southern English accent; wavefiles of A–G are in Supplementary Material online). (A) Natural speech token. (B) Same speech token spoken with an American-Californian accent (an accent is a meta-linguistic feature that reveals information about the speaker’s geographical or socio-cultural background [53]; normal listeners make predictions about speakers’ accents that tend to facilitate faster accent processing [54]). (C) Speech in multi-talker babble (speech-in-noise can be adaptively adjusted to find the point at which speech switches from intelligible to unintelligible [55]; background “noise” used experimentally typically comprises either “energetic” masking (e.g., steady-state white noise) or “informational” masking (e.g., multi-talker babble, as illustrated here)) [56], (D) Perceptual (or phonemic) restoration (Warren [57] originally observed that when a key phoneme is artificially excised from a given sentence, control participants are unable to identify the location of the missing phoneme when “filled-in” with a burst of white noise (bottom panel), but are able to identify the location accurately if the gap remains silent (top panel), i.e., they perceptually “restore” the excised phoneme). (E) Noise-vocoded speech (vocoding removes fine spectral detail from speech, whilst preserving temporal cues [58,59]; three bands of modulated noise (i.e., three “channels”; top panel) are the minimum needed for consistent recognition by normal listeners [59], spectrograms for six (middle panel) and twelve (bottom panel) channels also shown here). (F) Time-compressed speech (created by artificially increasing the rate at which a recorded speech stimulus is presented; intelligibility decreases as speech compression rate increases [60,61,62]). (G) Sinewave speech (this transformation reduces speech to a series of “whistles” or sinewave tones that track formant contours [63]). Note that these speech manipulations vary widely in the cognitive process they target, the degree to which they degrade the speech signal and their ecological resonance (see also Table 1); accented speech and speech-in-noise or babble are commonly encountered in daily life through exposure to diverse speakers and noisy environments, perceptual restoration simulates the frequent everyday phenomenon of speech interruption by intermittent extraneous sounds (e.g., a slamming door), whereas sinewave-speech is a drastic impoverishment of the speech signal that sounds highly unnatural but becomes intelligible with exposure due to perceptual learning [64].