Want to quickly adapt to distorted speech and become a better listener? Read lips, not text

Faezeh Pourhashemi; Martijn Baart; Thijs van Laarhoven; Jean Vroomen

doi:10.1371/journal.pone.0278986

. 2022 Dec 29;17(12):e0278986. doi: 10.1371/journal.pone.0278986

Want to quickly adapt to distorted speech and become a better listener? Read lips, not text

Faezeh Pourhashemi ¹, Martijn Baart ^1,^2,^*, Thijs van Laarhoven ¹, Jean Vroomen ¹

Editor: Kristof Strijkers³

PMCID: PMC9799298 PMID: 36580461

Abstract

When listening to distorted speech, does one become a better listener by looking at the face of the speaker or by reading subtitles that are presented along with the speech signal? We examined this question in two experiments in which we presented participants with spectrally distorted speech (4-channel noise-vocoded speech). During short training sessions, listeners received auditorily distorted words or pseudowords that were partially disambiguated by concurrently presented lipread information or text. After each training session, listeners were tested with new degraded auditory words. Learning effects (based on proportions of correctly identified words) were stronger if listeners had trained with words rather than with pseudowords (a lexical boost), and adding lipread information during training was more effective than adding text (a lipread boost). Moreover, the advantage of lipread speech over text training was also found when participants were tested more than a month later. The current results thus suggest that lipread speech may have surprisingly long-lasting effects on adaptation to distorted speech.

Introduction

Human speech is sometimes difficult to understand due to background noise, an unfamiliar accent of the speaker, or poor quality of the speech signal itself. However, listeners quickly adapt to this situation because there is often other information available that informs the listeners what the intended spoken message should be. This ‘other’ information might be lipread speech if the speaker can be seen, or it might consist of subtitles that can be read while speech is heard when watching a movie [e.g., 1–3].

Knowledge about the possible words in a language is also important for disambiguating the intended message. For example, the Ganong effect shows that an ambiguous auditory phoneme is perceived as /g/ when followed by ‘ift’, but perceived as /k/ when followed by ‘iss’ because only ‘gift’ and ‘kiss’ are legal words in English [4]. Here, we hypothesized that these extra information sources–lipread speech, written text, and lexical knowledge–all provide a basis for determining the discrepancy between the intended and actual speech signal that guides adaptive changes in speech perception. However, until now, their effectiveness has never been directly compared in a systematic way. Here, we therefore examined whether lipread speech, written text, and lexical information differ in their efficacy to drive learning of distorted speech. Remarkably, the least constraining and, arguably, most difficult information source to decode–lipread speech–proved to be the most effective teacher.

Learning effects in distorted speech have been demonstrated many times, and can even occur without other information sources that disambiguate the distorted signal [e.g., 5, 6]. Also, mere exposure to clear but accented speech results in improvements in performance in the absence of other information about the correct interpretation [e.g. 7–9]. Simply listening to time-compressed speech [8–10] can also lead to intelligibility improvements. However, there is no doubt that speech perception also makes use of other information sources to adapt or ‘recalibrate’ auditory perception [11–16]. One example comes from the literature on ‘phonetic recalibration’ where ambiguous phonetic segments are adjusted so that they fit in the context. For lipread speech, this was first demonstrated by Bertelson et al. [12] who exposed listeners to the view of a speaker who pronounced either /aba/ or /ada/ while an ambiguous speech sound halfway between /aba/ and /ada/ was heard. In auditory-only posttests, identification of the ambiguous sound was shifted towards the previously seen lipread information, so the same test sound was perceived more likely as /aba/ when the previous exposure contained lipread /aba/, and more likely as /ada/ when the previous exposure contained lipread /ada/. The rationale behind this assimilative effect is that the perceptual system during audiovisual exposure minimizes the discrepancy between the heard and seen information by shifting the auditory phonetic boundary towards the lipread stimulus. This then leads to assimilative auditory aftereffects in posttests.

Similar, but generally smaller assimilative aftereffects have been reported when instead of lipread speech, written text [17] or lexical information [16, 18–20] is used as the teacher-signal that disambiguates the ambiguous phoneme. The reason why lipread speech is more potent than text or lexical information remains, for the time being, rather elusive. On the one hand, it seems that lipread speech is biologically closely tied to auditory speech production. Reading, on the other hand, has emerged only late in evolution and our ability to read is typically built on our ability to process auditory speech. However, this does not imply that lipread speech should, by default, be more helpful than text because the mapping between lipread visemes and auditory phonemes is not as clear as mapping between phonemes and text. For example, the visible articulatory movements that correspond to /b/ (i.e., a bilabial closure) can easily be mistaken for /p/ or /m/, and even professional lipreaders struggle with decrypting silent lipread videos [21]. In contrast, for typical readers of an alphabetical script, the mapping between orthography and phonology is constrained and leaves little room for visual-auditory ambiguities (at least not in a language with a relatively transparent orthography like Dutch). Nevertheless, only dynamic lipread information is tightly correlated with the unfolding auditory speech stream [e.g., 22, 23], which may explain why silent lip-read information can activate auditory cortex [24] and can drive so-called ‘cortical entrainment’ in which oscillatory cortical activity in auditory areas synchronizes with the (silent) moving lips [25, 26].

Lexical context (rather than orthographic context), may, arguably, also be more constraining than lipread speech: in the previously mentioned Ganong-effect, only the words ‘gift’ or ‘kiss’ provided a fully constraining context when /?ift/ or /?iss/ was presented to the listeners. Despite these differences in biology, acquisition, and information content, though, it has been repeatedly demonstrated that all three information sources induce phonetic recalibration, although lipread information seems more potent in inducing this recalibration effect on phoneme identification.

Learning effects of lipread speech, written text, and lexical context have also been found in studies on spoken word recognition where instead of ambiguous phonemes, longer segments like words or sentences are used. One striking example from the literature on spoken word recognition comes from noise-vocoded speech in which spectral (and some temporal) details from the speech signal are removed [27]. When heard for the first time, noise-vocoded speech sounds rather unintelligible, but intelligibility drastically improves if the content of the sentence is revealed by a clear (undistorted) speech example. The second presentation of the vocoded sentence then seems much more intelligible than on initial hearing. Davis et al. [5] examined whether learning to identify noise-vocoded speech differed if the identity of the distorted items was revealed via a clear speech example or concurrently presented written text. Their results showed that both information sources were equally effective. Presumably, clear speech prior to, or written presentation concurrent with distorted speech presentation drove learning by providing the correct target representation. Furthermore, they reported that learning effects were absent when the distorted target sentence contained pseudowords instead of real words, thus suggesting that lexical context matters. However, the training sentences themselves may contain sentence-level information that can boost identification of test items. This potential issue was acknowledged in a follow-up study by the same group [28], and was circumvented by using single items rather than full sentences. However, in that study, text was no longer included as a training context (learning was assessed via clear speech feedback only), and the only other study we are aware off that investigated text-based learning in NVS again used full sentences rather than single items [29]. Pilling et al. [29] did compare the boosting effect of written text versus lipread speech and showed that both visual contexts were equally effective in driving learning. Note that this result is somewhat different from the one obtained in phonetic recalibration with single items, where lipread speech is usually more potent than text, but again, in the work by Pilling et al. [29], the effect of sentence-level content during training (and its potential interaction with lipread information or text) could not be ruled out.

Here, we systematically examined, for the first time, whether lipread speech, printed text, and lexical information are able to boost auditory-only noise-vocoded speech identification when the training items consist of mono- or bi-syllabic noise-vocoded words or pseudowords. By using single items rather than sentences, we could assess the role of lexical information without sentence-level perceptual support (which can, for example, induce semantic carry-over effects between items), and rule out potential differences in parsing difficulty of connected speech that contains words versus pseudowords (or in the degree to which short-term memory is taxed).

In the current study, we used a training-test paradigm in which we measured recognition of auditory-only noise-vocoded words during a pretest followed by three short training blocksand test blocks. The training items and test items were all novel and were thus never repeated during the experiment. The training stimuli consisted of noise-vocoded words or pseudowords that were combined with either lipread speech, printed text, or a still frame of the speaker that served as a baseline condition. Training and test blocks always contained 15 items which allowed us to determine not only the overall effect of training context, but also how learning effects would build-up over time. In general, adaptation to distorted speech occurs rather rapidly [30], and our experimental design thus allowed us to compare adaptation when visual information is not informative (the still frame condition) with conditions where the visual context may boost learning via audiovisual integration (the text and lipread conditions).

We expected that training with words relative to pseudowords would boost auditory learning because lexical information provides top-down information that can drive learning (a lexical boost). As noted, both text and lipread information can drive perceptual learning (phonetic recalibration, see e.g., [12, 17]) so we also expected that lipread speech and text during training would boost learning relative to the static control condition. The relative contributions of lipread speech and text that is combined with single words or pseudowords during training had not been examined explicitly and thus was uncharted terrain. In Experiment 1, we assessed lexical, lipread and text based learning effects immediately after training, while in Experiment 2 we examined, in a subset of the participants (those who had trained with words), whether learning effects were stable and would last more than a month after the initial training-test procedure was completed.

Experiment 1

Materials and methods

Participants

One hundred and twenty-eight students from Tilburg University participated in this study in return for course credits. All participants were native speakers of Dutch who reported normal hearing and (corrected to) normal vision. Two participants in the word-training group were excluded from the analyses because more than 30% of the responses could not be analyzed (i.e., no response was provided, or responses were given in a different language than Dutch despite clear instructions). Also, one participant from the pseudoword- training group in the text condition was excluded because he/she had also participated in the word- training group. Including this participant in the data analyses did not change the patterns of (non) significance. Mean age in the final sample (92 females) was 19 years (SD = 1.70). The experiment was conducted in accordance with the Declaration of Helsinki. All participants provided written informed consent and the experiment was approved by the Tilburg University Ethics Review Board (project ID: EC-2016.48). The raw (concatenated) data is available for download from the DataverseNL platform (https://doi.org/10.34894/2D83AI).

Stimuli

For the auditory stimuli, a male native speaker of Dutch (MB, one of the authors) was recorded with a NIKON D7200 camera while pronouncing a set of 120 mono- and bi- syllabic Dutch words. This set (but different recordings) was originally used by van der Zande et.al. [31], and the same set of items was later used by van Laarhoven et al. [32]. The items had a mean word frequency of 153.75 per million, as determined with the SUBTLEX-NL database [33], see Supporting information for the full list (S1 Table). For training with pseudowords, a set of 120 pseudowords was recorded by the same speaker. The pseudowords were created by switching consonant positions within an item (e.g., “fout” [error] was changed into “touf”), or replacing a particular consonant with another one from the same place of articulation category (e.g., “kamer” [room] was changed into “pamer”). There were nine items for which these procedures could not yield pseudowords, and for these, consonants were swapped across items (e.g., “kip” [chicken] and “vijf” [five] were changed into “kif” and “vijp”) or a consonant was replaced with one from a different place of articulation category (“bom” [bomb] was changed into “nom”). Video recordings (25 f/s) were framed as headshots and included the entire face (from shoulders upwards, including the speaker’s hair) against a black background. The audio was recorded with an external microphone attached to the camera (RØDE VideoMicro). Individual clips of each item were extracted with Adobe Premiere Pro. The extracted audio was manipulated in the Praat software [34], using the Shannon-AM-noise script by Darwin [35]. Specifically, we created 4-channel noise-vocoded speech as follows: each auditory signal was decomposed into four non-overlapping frequency bands (50–800 Hz, 800–1500 Hz, 1500–2500 Hz, and 2500–4000 Hz). Next, for each band, the amplitude envelope was extracted and combined with a Hann band-pass filtered white noise signal (the smoothing value was defined as the highest frequency/10, so including filter skirts, the filtered bands now overlapped at the frequency band boundaries) and the bands were recombined into a 4-channel noise-vocoded speech stimulus [2], see Fig 1.

Fig 1 — Original and noise-vocoded spectrogram for the item ‘kamer’.

The rationale for using this type of NVS was that potential context-driven auditory learning effects are, generally, maximal when the auditory signal is poor, but still contains sufficient information to allow for context-driven perceptual restoration. Prior research has indicated that NVS word identification accuracy at least doubles when moving up from 2 to 4 channels, but then increases with about half that effect size when moving from 4 to 8 channels (where accuracy tapered-off and was comparable to 16 and 32 channel NVS, [36]). Moreover, two-year-olds start to show the first signs of word recognition with 4-channel NVS (when compared to 2-channel NVS, [37]), and Senan et al. [38] showed that dual task interference from 4-channel NVS was in between, but statistically comparable to 2-channel and 6-channel NVS (the primary task was a digit-recall task), whereas interference from 4 and 6-channel NVS was also statistically comparable to natural speech. We thus assumed that 4-channel NVS has a relatively poor intelligibility, but at the same time, might contain sufficient spectral detail to accommodate context-driven learning.

For the audiovisual training stimuli, the noise vocoded audio tracks were dubbed onto three different types of visual stimuli to create the training set: 1) the original lipread video of the speaker (size: 1920 px wide × 1080 px high, 2) printed text of the original word (Courier New font, bold typeface, size: 18 pts) and 3) a static image (referred to as still face) of the speaker (size: 1920 px wide x 1080 px high) that served as baseline. In the text condition, the orthographic item was presented 1800 ms before onset of the noise-vocoded audio and remained on the screen for the entire duration of the auditory stimulus. This was done to ensure that listeners had ample time to read the word/pseudoword, and activate the corresponding (phonological) and lexical code, before and during its auditory presentation. Because noise-compensation in audiovisual word processing may occur at a lexical/semantic level of processing [39], providing (more than) sufficient time to process the text before/during the AV presentation should provide participants with an optimal condition to learn from the text.

For both the word and psuedowowrd condition, all 120 items were randomly distributed across 15-item blocks. One of these was selected as the designated auditory-only familiarization block that was administered for all participants. The remaining 7 blocks of items were, for each participant, presented in random order, and randomly assigned to the AV training or auditory-only test procedures. Moreover, within each block, item order was also randomized across participants.

Design and procedure

Participants were randomly assigned to either a group that trained with words or a group that trained with pseudowords. In each group, participants were randomly assigned to one of the three different audiovisual training conditions: lipread speech, text, or a still face. In total, participants were thus assigned to six different groups in a 2 (Lexical status: words, pseudowords) * 3 (Audiovisual training: lipread, text, still face) between-subjects design. There were 21 participants in each group, except for the text group that received pseudoword training that contained 20 participants, as mentioned before.

Participants were seated in front of a 20-inch widescreen flat panel monitor (1680 × 1050 px resolution, 60 Hz refresh rate) in a sound attenuated booth and were instructed to attentively listen to the speech sounds while looking at the screen. Sounds were delivered through headphones (Sennheiser HD 203) at ~65 dBA (measured at ear level). The experiment was run using the E-prime 2.0 software (available at https://pstnet.com), and total testing lasted ~25 minutes. On all trials (training and test), participants typed in what they perceived, and this response was stored on the hard drive. Trials were automatically scored as ‘correct’ when the typed input exactly matched the original item. All incorrect responses were manually checked by a native Dutch speaker (MB) to find other orthographically ‘correct’ responses that were missed in the automatic procedure (e.g ‘noot’ [nut] spelled as the phonologically identical word ‘nood’ [need]: in Dutch, the voiced /d/ is pronounced as unvoiced /t/ when in final position).

All participants first received a familiarization block (15 auditory-only trials) followed by a pretest (henceforth T1,15 auditory-only trials). Next, there were three AV training blocks (Training 1, Training 2, and Training 3; 15 items per block) interspersed with auditory-only test blocks named T2, T3, and T4 (15 trials per block, see Fig 2).

Fig 2 — The familiarization phase and T1 consisted of 15 auditory words. Training blocks consisted of 15 audiovisual words or pseudowords, combined with either a dynamic face (i.e., lipread information), text, or a still face. Each training block was followed by an auditory-only test block (T2, T3 and T4) of 15 words. In total, there were three AV- training /Test blocks. After each training or test item, participants typed in what they had heard.

In total, 120 items were presented across 8 blocks of 15 unique items. Block order was counterbalanced across participants, except for the familiarization block, which was the same for all participants. In the pseudoword training groups, participants never received the two items from a particular word–pseudoword pair: if, for example, the pseudoword “pamer” (derived from the Dutch word “kamer” [room]) was presented during AV training, the word “kamer” was never used during the auditory test blocks. Auditory test words were randomly assigned to participants, and never repeated.

Analysis

All data were analyzed in R (version 4.10) using the lme4 package, version 1.1–27 [40]. Data were analyzed using generalized linear mixed effects models of the binomial family, which were fitted to the data by maximum likelihood estimation (Laplace Approximation) using the logit link function and the optimizer ‘bobyqa’. Significant main and interaction effects were further examined with post-hoc pairwise comparisons (two-tailed, Holm-Bonferroni corrected p values) on the model-predicted estimated means using the R package lsmeans (version 2.30–0).

Results

The grand average proportion of correct responses during test and training blocks for each type of training conditions (lipread speech, text, still face) and lexical status (words, pseudowords) are shown in Fig 3. Note that Accuracy at test is rather low, but as expected given that Pilling et al. [29] reported accuracy values at around 50% when participants identified keywords embedded in 8-channel auditory noise-vocoded sentences (as opposed to the 4-channel auditory noise-vocoded single items that we presented here) after AV training that included text or lipread speech.

Fig 3 — Grand average proportion of correct responses during test and training blocks for each type of training conditions (lipread speech, text, still face) and lexical status (words, pseudowords). The upper panels show accuracy of auditory-only word recognition during test blocks for participants trained with words (panel A) or pseudowords (panel B). The lower panels show accuracy during the audiovisual training blocks for words (panel C) or pseudowords (panel D). Error bars represent one standard error of the mean.

Performance during auditory-only test blocks

Visual inspection of the data obtained from the auditory-only test blocks showed that overall performance was more accurate if participants were trained with words rather than pseudowords, a lexical boost of ~6 percentage points (i.e., ~3–4 items out of the total set of 60 A-only test items, see Fig 3, upper panels). Accuracy also improved across consecutive test blocks, and this effect was largest when participants were trained with lipread speech, intermediate when trained with text, and close to zero when trained with a still face.

To test these observations more formally, a generalized linear mixed effects logistic model was fitted to the data. The model included fixed effects for Training (lipread speech, text, still face), Lexical status (words, pseudowords) and Test block (1,2,3,4). We used the maximal random effect structure supported by the data, with by-subject and by-item random intercepts. Training type was dummy-coded such that training with a still face was set as the reference category for this factor. The factors Lexical status and Test block were recoded such that their values were centered around 0 (i.e. pseudowords and words were recoded into -1 and 1; T1, T2, T3 and T4 were recoded into -1.5, -0.5, 0.5 and 1.5). This ensured that all levels of Lexical status and Test block were considered in the fitted coefficients for the main effects and interactions including these factors. As a result, the fitted coefficient for Lexical status could be interpreted as the difference in correct responses (in log- odds) between training with words vs. pseudowords. Similarly, the fitted coefficient for Test block reflected the main effect of this factor. The fitted model was: Correct ~ 1 + Training type × Test block × Lexical status + (1 | subject) + (1 | item). Fixed effect coefficient estimates are shown in Table 1.

Table 1. Fixed effect coefficient estimates for the generalized linear mixed effects model fitted on the auditory-only test data.

Correct ~ 1 + Training type × Test block × Lexical status + (1 | subject) + (1 | item).

Fixed factor	Estimate	SE	z-value	p
(Intercept)	-2.50	0.23	-10.85	< .001^***
Training type _lipread speech	0.46	0.14	3.56	< .001^***
Training type _text	0.12	0.14	0.85	.40
Test block	0.11	0.06	1.98	.048^*
Lexical status	0.32	0.10	3.26	.001^**
Training type _lipread speech × Test block	0.34	0.08	4.30	< .001^***
Training type _text × Test block	0.27	0.08	3.33	< .001^***
Training type _lipread speech × Lexical status	0.04	0.14	0.30	.76
Training type _text × Lexical status	-0.00	0.14	-0.07	.95
Test block × Lexical status	0.00	0.06	0.06	.95
Training type _lipread speech × Test block × Lexical status	0.11	0.08	1.48	.14
Training type _text × Test block × Lexical status	0.10	0.08	1.24	.22

Open in a new tab

* p < .05;

** p < .01;

*** p < .001,

SE: standard error.

The model revealed a significant main effect for the intercept (b = −2.50, SE = 0.23, p < .001), indicating an overall bias towards an incorrect response when participants were trained with a still face—which fits the overall response distribution (see Fig 3, upper panels). There was a main effect of Lexical status (b = 0.32, SE = 0.10, p = .001), indicating that overall test performance was higher when listeners were trained with words rather than pseudowords. Overall test performance was significantly higher after training with lipread speech when compared to training with a still face (b = 0.46, SE = 0.14, p < .001), while there was no difference in overall performance between training with text and a still face (b = 0.12, SE = 0.14, p = .40). In addition, there was a main effect of Test block (b = 0.11, SE = 0.06, p = .048), indicating that, on average, listeners were more likely to correctly identify the speech sounds with each successive test block. However, the model also showed that, compared to training with a still face, this improvement over time was larger when participants were trained with lipread speech (b = 0.34, SE = 0.08, p < .001) and text (b = 0.27, SE = .08, p < .001). There were no other significant main or interaction effects (ps > .13).

Post-hoc pairwise comparisons on the model-predicted means revealed that the effect of Test block was only significant for training with lipread speech and text (ps < .001), but not in the still face condition (p = .29). Importantly, these effects were not due to performance differences between training types during the first test block (ps > .38). After the third and final training block (i.e. at T3 and T4), performance after training with lipread speech was significantly higher than after training with a still face (T3: b = 0.63, SE = 0.14, p < .001, T4: b = 0.97, SE = 0.18, p < .001) or text (T3: b = 0.38, SE = 0.14, p = .014, T4: b = .45, SE = 0.17, p = .01). After the final training block (at T4), recognition performance was also significantly higher after training with text compared to training with a still face (b = 0.52, SE = 0.18, p = .007).

Performance during audiovisual training blocks

Visual inspection of the data from the training blocks showed that overall performance was more accurate if participants were trained with words compared to pseudowords (see Fig 3, lower panels). Intelligibility of the speech sounds was highest when they were accompanied by text, intermediate when accompanied by lipread speech, and lowest when a still face was presented simultaneously.

These observations were confirmed by a generalized linear mixed effects logistic model similar to the one fitted on the auditory-only test data. The model included fixed effects for Training type (lipread speech, text, still face), Lexical status (words, pseudowords) and Training block (1,2,3). The maximal random effect structure supported by the data was used, with by-subject and by-item random intercepts. Training type was dummy-coded such that training with a still face was set as the reference category. The factors Lexical status and Test block were recoded such that their values were centered around 0. The fitted model was: Correct ~ 1 + Training type × Training block × Lexical status + (1 | subject) + (1 | item). Fixed effect coefficient estimates are shown in Table 2.

Table 2. Fixed effect coefficient estimates for the generalized linear mixed effects model fitted on the training data.

Correct ~ 1 + Training type × Training block × Lexical status + (1 | subject) + (1 | item).

Fixed factor	Estimate	SE	z-value	p
(Intercept)	-3.66	0.29	-12.42	< .001^***
Training type _lipread speech	2.84	0.35	8.04	< .001^***
Training type _text	6.66	0.39	17.15	< .001^***
Training block	-0.13	0.17	-0.74	.46
Lexical status	1.74	0.29	5.98	< .001^***
Training type _lipread speech × Training block	0.36	0.19	1.92	0.055
Training type _text × Training block	0.63	0.20	3.14	0.002^**
Training type _lipread speech × Lexical contrast	-0.00	0.35	-0.003	.99
Training type _text × Lexical status	-1.39	0.37	-3.75	< .001^***
Training block × Lexical status	0.21	0.17	1.27	.21
Training type _lipread speech × Training block × Lexical status	-0.23	0.19	-1.23	.21
Training type _text × Training block × Lexical status	-0.42	0.20	-2.11	.035^*

Open in a new tab

* p < .05;

** p < .01;

*** p < .001,

SE: standard error

The model revealed a significant main effect for the intercept (b = −3.66, SE = 0.29, p < .001), indicating an overall bias towards an incorrect response when participants were trained with a still face. There was a significant main effect of Lexical status, b = 1.74, SE = 0.29, p < .001, indicating that recognition of words was noticeably better than recognition of pseudowords. Overall performance was higher during training with text (b = 6.66, SE = 0.39, p < .001) and lipread speech (b = 2.84, SE = 0.35, p < .001) when compared to the training with a still face. There were no other main or interaction effects (ps > .21).

Post-hoc pairwise comparisons on the model-predicted means showed that performance significantly differed between all three types of training (ps < .001), such that overall recognition performance was highest during training with text, intermediate during training with lipread speech, and lowest during training with a still face. There was a significant two-way interaction between training with text and Training block (b = 0.63, SE = 0.20, p = .002, suggesting an increase in accuracy over time during training with text. However, there was also a three-way interaction between training with text, Training block and Lexical status (b = -1.38, SE = 0.37, p < .001). Post-hoc simple effects analysis showed that the effect of Training block was significant for training with text and pseudowords (p < .001), but not for training with text and words (p = .33).

Discussion

Experiment 1 showed that auditory learning of noise-vocoded speech was larger if listeners were trained with words rather than pseudowords (a lexical boost). Secondly, auditory learning of noise-vocoded speech was largest if listeners were trained with lipread speech (a lipread boost), intermediate if trained with text, and almost absent if trained with a still face. Words with lipread speech, rather than words-with-text, were thus the best guides for auditory speech learning. This is intriguing if one considers that during training, words-with-lipread speech were more difficult to recognize than words-with-text. So, despite the fact that lipread speech was more difficult to decode than text, it was nevertheless the best guide. Perhaps, the perceived difficulty of the AV lipread training trials (when compared to the AV text condition) has provided listeners with a more enganging or motivating learning context, but since previous work had revealed no differences between text and lipread training contexts [29], this tentative explanation of the current results requires an in-depth follow-up investigation. Presumably though, there are differences in binding–or integration–of text versus lipread speech with the audio, and we will return to this issue in the General Discussion.

The results from Experiment 1 further indicate that the learning effect in the lipread condition has built-up somewhat quicker than in the text condition: averaged across words/pseudowords, performance in the lipread condition was higher than in the text and still face condition in test blocks 3 and 4 (after receiving 30 AV training traisl), whereas performance in the text condition was higher than in the still face condition in test block 4 only (after receiving 45 AV training trials). So lipread speech not only provided stronger learning effects than text, but these effects also started earlier.

Early and rapid improvements in performance–as observed here–are mostly driven by perceptual learning (as opposed to procedural learning, see [41]), and such effects may dissipate quite quickly (i.e., auditory training regimes that produce longer-lasting (rehabilitation) effects typically span a longer time-frame [42]). However, because it is actually not clear whether the short-term auditory learning effects we observed vanish in minutes, or possibly reflect much longer-term effects lasting for days, weeks, or even months, we re-invited the participants from Experiment 1 that were originally trained with words (because they had the largest training effects and because this is the most natural situation encountered in real life) to test their auditory-only word recognition after a ~45-days rest period.

Experiment 2

Participants from Experiment 1 who were originally trained with words were asked to return to the lab ~45 days later. During this session, they received no training and were thus only tested with the auditory-only noise-vocoded words they had heard in Experiment 1.