Working memory training to improve speech perception in noise across languages

Erin M Ingvalson; Sumitrajit Dhar; Patrick C M Wong; Hanjun Liu

doi:10.1121/1.4921601

. 2015 Jun;137(6):3477–3486. doi: 10.1121/1.4921601

Working memory training to improve speech perception in noise across languages

Erin M Ingvalson ^1,^a), Sumitrajit Dhar ², Patrick C M Wong ³, Hanjun Liu ⁴

PMCID: PMC4474942 PMID: 26093435

Abstract

Working memory capacity has been linked to performance on many higher cognitive tasks, including the ability to perceive speech in noise. Current efforts to train working memory have demonstrated that working memory performance can be improved, suggesting that working memory training may lead to improved speech perception in noise. A further advantage of working memory training to improve speech perception in noise is that working memory training materials are often simple, such as letters or digits, making them easily translatable across languages. The current effort tested the hypothesis that working memory training would be associated with improved speech perception in noise and that materials would easily translate across languages. Native Mandarin Chinese and native English speakers completed ten days of reversed digit span training. Reading span and speech perception in noise both significantly improved following training, whereas untrained controls showed no gains. These data suggest that working memory training may be used to improve listeners' speech perception in noise and that the materials may be quickly adapted to a wide variety of listeners.

I. INTRODUCTION

Recent years have seen a surge of interest in cognitive training. Among other claims, improving one's cognitive performance has been suggested to lead to improved academic performance (Klingberg, 2010), improved language learning (Ingvalson and Wong, 2013), and reduced incidence of dementia (Willis et al., 2006).

Much of the interest in cognitive training has focused on improving working memory. Training working memory has received particular attention because larger working memory capacities have been linked to better academic performance, better language learning, and reduced incidence of pathological aging (Klingberg, 2010; Morrison and Chein, 2010). Additionally, it has become apparent that working memory plays a role in listeners' ability to perceive speech in noisy situations. For both individuals with normal hearing and individuals who use hearing aids, listeners with larger working memory capacities appear to have more success perceiving speech in noise. Working memory capacity successfully predicts speech recognition in noise by normal hearing older adults (Gordon-Salant and Fitzgibbons, 1997; Parbery-Clark et al., 2009). In older adults who use hearing aids, greater working memory capacity also predicts more success recognizing speech in noise (Foo et al., 2007; Lunner and Sundewall-Thorén, 2007). Within an individual listener, working memory capacity is correlated with degree of listening effort and success perceiving speech in noise (Koelewijn et al., 2014; Zekveld et al., 2011).

Increasingly frequent complaints of speech perception difficulties in background noise have drawn attention to the relationship between working memory and speech perception in noise. Listeners with hearing aids, users of cochlear implants, as well as older adults with normal hearing, all complain of difficulty perceiving speech in noisy situations. As the population ages (Kinsella and He, 2008), more individuals are likely to suffer from difficulties in speech perception in noise, making effective interventions increasingly necessary. Knowing that working memory capacity predicts speech perception in noise success suggests that one avenue for intervention may be cognitive training: increasing listeners' working memory capacity may lead to improved speech perception in noise performance.

Current knowledge supports the hypothesis that working memory capacity can be improved. Older adults with no known cognitive impairment show increases in working memory capacity following working memory training (Bherer et al., 2005; Li et al., 2008). More generalized cognitive training, which emphasizes working memory capacity, speed of processing, inhibitory control, and semantic memory has also been shown to improve working memory capacity in healthy older adults (Ball et al., 2002). Older adults with a diagnosis of mild cognitive impairment, a precursor to dementia, also show a benefit of cognitive training. Following training, older adults with mild cognitive impairment showed gains not only on measures of working memory but also on measures of cognitive health such as the Mini Mental State Exam (Folstein et al., 1975) and the Dementia Rating Scale (O'Bryant et al., 2008).

However, though the available data support the claim that working memory capacity can be increased in older adults, it is less clear whether this increase would lead to improvements in speech perception in noise. The extent to which cognitive training gains transfer to tasks that are only distally related to the training, called far transfer, remains an open question. Some studies have found success with far transfer to skills such as inhibitory control (Persson and Reuter-Lorenz, 2008) or reading comprehension (Chein and Morrison, 2010). On the other hand, other studies have found little evidence of far transfer following working memory training (Dahlin et al., 2008; Schmiedek et al., 2010). A direct comparison of far transfer in younger and older adults suggested that only younger adults may benefit from far transfer (Schmiedek et al., 2010). Supporting the potential weakness of far transfer in older adulthood, both a review and a meta-analysis of cognitive training in older adulthood found little evidence of benefit in distally related tasks post-training (Morrison and Chein, 2010). A recent meta-analysis of working memory training, although not specific to older adults, also found little evidence of far transfer across the trained populations (Melby-Lervåg and Hulme, 2013). Examining the lists of tasks included, however, reveals that assessments of far transfer in older adults often include tasks that, like working memory, decline with age [e.g., fluency tasks (Dahlin et al., 2008)] or that have been shown to share neural substrates with working memory [e.g., episodic memory tasks (Cabeza et al., 2002)] but not where any behavioral connection has been made between working memory performance and performance on the tested task. There is therefore little theoretical reason to expect increases in working memory to lead to increases in these tasks, other than an expectation that training working memory should lead to across-the-board increases in cognitive ability (Shipstead et al., 2012). Evidence of far transfer is apparent where links have been made between greater working memory capacity and greater performance [e.g., reading comprehension; (Morrison and Chein, 2010)]. The mismatch between what can theoretically be expected to improve following training and the distal outcome measures was also noted in a recent review of auditory training (Henshaw and Ferguson, 2013). Henshaw and Ferguson concluded that more research into auditory training is needed and that assessments of auditory training should include measures that will capture a clinically significant improvement in listening ability. Looking over the data in these reviews, it is apparent that both working memory and auditory training studies would benefit from more theoretically driven outcome measures, which will provide a reliable indicator of whether far transfer is possible. In the present study, we note that behavioral connections have been made between working memory capacity and speech perception and noise (Foo et al., 2007; Lunner, 2003; Lunner and Sundewall-Thorén, 2007) leading to a theoretical connection between the two (Rönnberg et al., 2008). We therefore hypothesize that improvements in working memory will produce far transfer and lead to improvements in speech perception in noisy situations.

A secondary goal of working memory training to improve speech perception in noise is its potential transferability to additional language environments. Commercially available speech perception in noise training programs such as LACE or SPATS (Miller et al., 2008; Sweetow and Sabes, 2006) are successful in improving speech perception in noise, but they rely extensively on sentence-level speech items. Currently these materials are only available in American English, limiting their utility to listeners who are not proficient in English or who do not speak American English as their native dialect (Brouwer et al., 2012; Van Engen and Bradlow, 2007). Conversely many working memory training programs rely on very simple items, such as digits, letters, or single-syllable words (Klingberg, 2010; Morrison and Chein, 2010). Although sentence-level materials would take a large amount of effort to translate across languages, necessitating checks to ensure items are grammatically correct, culturally relevant, and produced by native speakers (Brouwer et al., 2012; Wong et al., 2008), translating typical working memory items across languages requires only production by native speakers. Furthermore, digits, letters, and single words are very ecologically valid in that they are readily familiar to listeners with hearing impairments and do not require a high semantic load and thus are relatively robust against hearing loss (Smits et al., 2013; Wilson et al., 2010), ensuring the listener is engaging in a working memory task and not a speech perception task. We therefore hypothesize that working memory training will demonstrate speech-in-noise benefits in multiple languages.

In addition to the ecological validity of digits, letters, and single words for speech in noise testing and training (Smits et al., 2013; Wilson et al., 2010), there is the concern that sentence-level auditory materials may not be optimal for improving working memory. Traditionally, training working memory has been thought of as attempting to increase the capacity of the central executive (Baddeley, 2003; Klingberg, 2010). In commercially available products, listeners are asked to guess what the missing word in a sentence might be or to indicate which word appeared before a target word in a sentence. These tasks rely heavily on the ability to extract meaning from the sentence. Though extracting meaning is important for understanding speech in noise, it is not a working memory task in itself. Instead the task of extracting meaning—and thereby filling in missing words or finding target words—requires interactions of working memory, lexical access, and access of long-term memory (Hickok and Poeppel, 2007). Improvement may therefore be a result of listeners' increased practice filling in information missed in noise than true working memory gains. Here we aim to improve listeners' working memory performance using the simple stimuli customary among working-memory training paradigms. Simultaneously, our training paradigm provides practice listening to speech in a variety of noise types, including steady-state and environmental noise at two signal-to-noise ratios (SNRs). The speech-perception-in-noise practice was introduced to avoid asymptotic performance on the working memory task, but it provides the additional benefit of allowing listeners to practice perceiving speech in noise while increasing auditory working memory capacity. We differentiate the practice perceiving speech in noise in earlier studies from the present study by noting that those studies used whole sentences, whereas we use only digits. Although the recognition of whole sentences in noise may utilize multiple memory systems, we suggest that rehearsing and storing a string of digits will utilize only working memory (specifically, the phonological loop; Baddeley, 1986). Consequently, we hypothesize that the training paradigm will be associated with gains on both working memory measures and measures of speech perception in noise.

When developing an intervention for speech perception in noise, one consideration is the outcome measure. This consideration is compounded when the intervention is intended to be translated across multiple languages, as is the case here. In the present study, we opted to assess speech-perception-in-noise gains in Mandarin via the M-HINT and in English via the QuickSIN. The QuickSIN has repeatedly been shown to be more sensitive than the HINT as a measure of speech perception in noise (Duncan and Aarts, 2006; Wilson et al., 2007) particularly for listeners with normal hearing (Parbery-Clark et al., 2009). Ideally, therefore, we would use the QuickSIN for assessment in both language environments, but the M-HINT is the only speech-perception-in-noise assessment currently available in Mandarin Chinese (Wong et al., 2008). We therefore opted to use the more sensitive test when available to better capture how working memory training might lead to speech perception in noise gains. We note that performance on the HINT and performance on the QuickSIN are closely correlated (Duncan and Aarts, 2006; Weber et al., 2010), suggesting that gains seen on one test are likely to also be seen on the other and to be of similar magnitude (Weber et al., 2010).

The present experiment trained working memory using digits. Based on the demonstrated relationship between increased working memory capacity and speech perception in noise, we hypothesize that gains in training will correspond to working memory and speech-perception-in noise gains. We further hypothesize that digits will be easily translated across languages and that we will see gains in more than one linguistic context. We completed two experiments, one in Mandarin Chinese and one in English. We conducted two experiments, in two language environments, to test our hypothesis that digits would readily transfer to multiple languages and correspond to speech-in-noise gains across languages. In both experiments, we found that backward digit-span training was associated with speech-perception-in-noise gains.

II. EXPERIMENT 1

A. Methods

1. Participants

Twenty-five native Mandarin Chinese speakers (15 female, 22.08 ± 1.93 yr, mean ± SD) were recruited and run at Sun-Yat Sen University of China. Participants self-reported normal hearing and had no known cognitive deficits. Fifteen participants (eight female) were assigned to the training group; ten (seven female) were assigned to the control group. Group assignments were random; trained listeners were simultaneously enrolled in an additional study of speech perception in noise that used identical protocols but that required a larger number of subjects, resulting in the disparity in group size. There were no differences between the trained and control participants in gender, level of education, or age, all p > 0.05.

2. Materials

A native male Mandarin speaker recorded the digits one through nine. Recordings were sampled at 44 100 Hz and 16 bit accuracy then RMS matched to a 70 dB sound pressure level (SPL) pure tone at 1 kHz. Two types of noise were added to the recordings to create two distinct speech-in-noise environments. We created steady-state noise shaped to match the long-term average spectrum of the recordings using matlab. The steady-state noise was mixed with the target recordings in adobe audition to create stimuli with SNR of −5 and −10 dB. Pilot testing indicated these SNR levels were challenging but sufficiently perceptible to not interfere with digit rehearsal for the working memory task.

We obtained a set of 27 non-speech sounds (e.g., mechanical, human non-speech, animal, and musical) from online databases. Non-speech distractors were chosen because they serve as a fluctuating masker and have cross-linguistic informational masking value (Garcia Lecumberri and Cooke, 2006). Although the informational masking of non-speech sounds is not likely to be as great as with speech, using speech as a masker would require the development of new maskers for each language environment; non-speech maskers thereby allow for more rapid transfer across languages. Twelve sounds were animal sounds (e.g., bird song, dog bark), two sounds were mechanical (e.g., clock ticking, gun shot), five sounds were non-speech human (e.g., cough, laugh), and eight sounds were musical (e.g., violin, flute). These sounds were cropped to 1 s, RMS matched in amplitude to the calibration tone, then mixed with the target recordings at −5 and −10 dB SNR. All non-speech distractors were mixed with all targets at all SNR levels, for a total of 468 stimulus combinations.

In both the steady-state and non-speech noise conditions, total stimulus duration was 1 s. Targets were placed 15 ms after the onset of the distractor; targets offset 15 ms prior to stimulus offset. Presentation level for the final stimulus was set to 70 dB SPL.

Speech perception in noise performance was assessed using the Mandarin Chinese HINT (Wong et al., 2008). Performance on the HINT is measured in reception threshold for speech (RTS). Working memory performance was assessed using a Chinese version of the reading span test (Daneman and Carpenter, 1980). Both the reading span task used for assessment and the backward digit span task used for training are considered complex working memory tasks (Daneman and Merikle, 1996), suggesting they tap the same working memory mechanisms, and training of one task is expected to thereby transfer to the other (Baddeley, 2003; Chein and Morrison, 2010).

3. Procedure

Total experimental duration was 12 days. Participants assigned to the control group completed the pre- and post-tests on the 1st and 12th day, and made no contact with the experimenters on the intervening days. Participants assigned to the training group trained on the 2nd through 11th days.

a. Testing.

Before testing, audibility of the speech in noise was verified for all listeners. List 1 of the M-HINT was presented in quiet. Following correct identification of the list, participants indicated their ability to perceive the M-HINT noise in isolation. Loudness levels for both targets and speech were adjusted until listeners correctly identified all target words in quiet and indicated perception of the noise. All participants were able to perceive both the target and the noise at the recommended presentation level of 65 dB A.

At the pre- and post-test participants completed the M-HINT and the reading span test. Three M-HINT lists were chosen at random for each test presentation with the constraint that different lists be used at pre- and post-test within subject. All lists were presented via Etymotic ER1 insert earphones in a sound-attenuated booth. An experimenter outside the booth scored keywords repeated correctly according to standard procedure (Wong et al., 2008). Prior to completing the three test lists, participants heard and repeated one practice list to ensure audibility and familiarize participants with the task.

The reading span test was given and scored according to standard procedures (Daneman and Carpenter, 1980). Participants read a sentence, indicated whether the sentence made semantic sense or not, then remembered the final word in the sentence. After reading all the sentences in a list, participants recalled all the final words from that list. There were three lists of the same length per block. Reading span was defined as the last block on which the participant correctly recalled all the items on two of three lists.

b. Training.

Training was implemented in E-Prime (Psychological Testing Services, Pittsburgh, PA) using a backward digit span paradigm (i.e., if the participant heard, “1-2-3-4,” the correct response was, “4-3-2-1”). On each trial, participants heard a list of digits presented over Sennheiser HD-250 headphones in a sound-attenuated booth. At the end of the list, the participant was visually cued by the computer to respond. No feedback was given on individual trials.

We developed an adaptive training metric that adjusted the length of digit span according to listener ability. There were seven digit lists per block. If block accuracy was greater than 70%, span length on the subsequent block was increased by one digit. Conversely, if block accuracy was less than 50%, span length on the subsequent block was decreased by one digit. Ten blocks were completed each day. Each day's first span length was the same as the last span length of the previous training day; span length on the first block of the first day was set to 2.

Pilot testing demonstrated that participants quickly reached asymptotic performance when performing the digit span task in quiet. We therefore implemented training in noise to maintain task engagement. On training days one and two, participants trained in quiet. Training days three and four were done in steady-state noise at −5 dB SNR. Training days five and six were done in non-speech distractors at −5 dB SNR; distractors were chosen at random for each digit presentation by the training program. Days seven through ten replicated training days three through six, but at −10 dB SNR.

B. Results

Figure 1 demonstrates the training gains. Participants made consistent increases in their digit spans throughout training. An analysis of variance (ANOVA) for linear trends was significant, F(1,8) = 379.15, p < 0.001, confirming the upward trajectory.

FIG. 1. — Mean backward digit span performance for the native Mandarin speakers as a function of training day. Training days 3, 4, 7, and 8 were presented in speech-shaped noise, and days 5, 6, 9, and 10 were presented in multiple distractors (e.g., nature sounds, music). Training days 3–6 were presented at −5 dB SNR and days 7–10 were at −10 dB SNR. Error bars are standard error of the mean.

We entered the data from the M-HINT into a 2 (group: trained vs control) × 2 (session: pre- vs post-test) mixed ANOVA where session was the within-subjects factor. We found the expected interaction of group and session, F(1,23) = 11.24, p = 0.003, shown in Fig. 2. Planned t-tests showed a significant improvement by the trained subjects, t(14) = 5.87, p < 0.001 (pre-test RTS M = −3.01 dB SNR, post-test RTS M = −3.85 dB SNR), but not the control subjects, t(9) = 0.34, p = 0.74 (pre-test RTS M = −2.81 dB SNR, post-test RTS M = −2.87 dB SNR). There was also a main effect of group, F (1,23) = 7.65, p = 0.01 and one of session, F(1,23) = 21.69, p < 0.001. Overall the trained group performed better than the control group (trained RTS M = −3.43 dB SNR, control RTS M = −2.84 dB SNR) and overall performance was better at the post-test than at the pre-test (pre-test RTS M = −2.93 dB SNR, post-test RTS M = −3.46 dB SNR). However, these main effects should be interpreted with an eye to the significant interaction.

FIG. 2. — Mean dB RTS performance on the HINT at pre- and post-test by the trained and control native Mandarin speaking subjects. Trained subjects showed a significant improvement on their ability to perceive speech in noise whereas control subjects did not. Error bars are standard error of the mean.

The reading span data were entered into a separate 2 (group: trained vs control) × 2 (session: pre- vs post-test) mixed ANOVA. We again found the anticipated interaction of group and session, F(1,23) = 7.87, p = 0.01, shown in Fig. 3. Planned t-tests again revealed a significant improvement by the trained group, t(14) = 3.56, p = 0.003 (pre-test M = 4.87 words, post-test M = 6.33 words), but not the control group, t(9) = 0, p = 1 (pre-test M = 4.80 words, post-test M = 4.80 words). The main effect of session was also significant, F(1,23) = 11.80, p = 0.002. Post-test performance was better than pretest performance overall (pre-test M = 4.84 words, post-test M = 5.72 words), but, again, this should be interpreted with an eye to the significant interaction.

FIG. 3. — Mean reading span performance by the native Mandarin speaking subjects. Trained subjects significantly increased their reading spans following training, whereas there was no change in control spans at posttest. Error bars are standard error of the mean.

C. Discussion

The results from the first experiment indicate a benefit of working memory training for improving speech perception in noise. Trained listeners showed gains in span length throughout training, gains of approximately 1.5 words on the reading span test and gains of 0.85 dB in their reception thresholds for sentences (RTS). The fact that these gains were seen even in young, cognitively healthy, normal-hearing listeners is especially impressive as this population typically does not show difficulties perceiving speech in noisy situations nor deficits in their working memory performance.

Despite the significant gains seen on all components, this experiment does not address the fact that currently available commercial speech perception in noise training paradigms are difficult to translate across languages. One of the aims in using digits as training stimuli is the hypothesis that digits are easily translatable across languages, making the training paradigm easily adaptable to improve both working memory and speech perception in noise for listeners in a variety of languages. Experiment 2 was designed to test this hypothesis.