Combining partial information from speech and text

Daniel Fogerty; Irraj Iftikhar; Rachel Madorskiy

doi:10.1121/10.0000748

. 2020 Feb 19;147(2):EL189–EL195. doi: 10.1121/10.0000748

Combining partial information from speech and text

Daniel Fogerty ^1,^a),^✉, Irraj Iftikhar ^1,^b), Rachel Madorskiy ²

PMCID: PMC7030977 PMID: 32113272

Abstract

The current study investigated how partial speech and text information, distributed at various interruption rates, is combined to support sentence recognition in quiet. Speech and text stimuli were interrupted by silence and presented unimodally or combined in multimodal conditions. Across all conditions, performance was best at the highest interruption rates. Listeners were able to gain benefit from most multimodal presentations, even when the rate of interruption was mismatched between modalities. Supplementing partial speech with incomplete visual cues can improve sentence intelligibility and compensate for degraded speech in adverse listening conditions. However, individual variability in benefit depends on unimodal performance.

1. Introduction

Listeners are often tasked with comprehending incomplete speech signals in everyday environments. Speech is often masked, or interrupted, creating a signal with partial information that the listener must resolve. Surprisingly, listeners perform quite well in these situations, maintaining high intelligibility even when 50% of the speech is periodically deleted (e.g., Bashford et al., 1996; Miller and Licklider, 1950). However, under these conditions of periodically interrupted speech, intelligibility is determined not only by the amount of the speech signal preserved, but also by the interruption rate (e.g., Wang and Humes, 2010), which determines how speech information is distributed over time. Higher accuracy occurs with slower or faster interruption rates, with poorest performance in the middle around 2 Hz (Miller and Licklider, 1950; Shafiro et al., 2018). The involvement of these two factors defining the preservation and distribution of speech information also holds for speech glimpses defined by a single-talker modulated noise (Gibbs and Fogerty, 2018).

The ability to resolve partial information can also be examined in the visual modality, by periodically interrupting written text with bar patterns obscuring parts of the sentence. Previous research has observed an association between interrupted text and interrupted speech perception abilities (George et al., 2007; Humes et al., 2013). Due to this association across modalities, it seems plausible that a similar underlying mechanism is utilized to process both interrupted speech and interrupted text. However, modality-specific processing is also likely (Shafiro et al., 2018), due to the unique perceptual differences between speech, a continuous multidimensional signal, and text, a discrete serially ordered sequence of abstract symbols.

A few studies have begun to examine how partial information of speech and text might be combined to support a multimodal percept of the intended message. Degraded or partially correct text cues can be used to successfully augment speech recognition for younger and older listeners (Krull and Humes, 2016), as well as middle-aged listeners (Zekveld et al., 2009). However, little is known about how listeners combine these signals to enhance recognition of the multimodal message. In particular, as interruption rate affects the recognition of unimodal speech and text recognition, it is not known how rate, determining the distribution of perceptual information, might affect the integration of these two modalities.

1.1. Unimodal processing

As already reviewed, the recognition of interrupted speech is determined by both the proportion of speech preserved as well as the rate of interruption (e.g., Miller and Licklider, 1950; Wang and Humes, 2010). When the portion of the sentence preserved remains at a constant 50%, slower rates tend to preserve entire words across the sentence, allowing the listener to clearly comprehend the uninterrupted words in the signal. Conversely, at faster rates of interruption, all words in the sentence are frequently sampled. Middle rates of interruption at 2 Hz, however, degrade word units but also provide less frequent sampling across the sentence, resulting in lower intelligibility. Indeed, performance for interrupted speech, at least for adults with normal hearing listening in quiet, can be successfully modeled using these two parameters: the preservation of whole words and the frequency of word sampling (Shafiro et al., 2018).

In the visual modality, interrupted text has been examined through the text reception threshold (TRT), which utilizes one interruption rate with various percentages of unmasked text (Zekveld et al., 2007). This task was designed as an analog to the speech interruption paradigm. Performance on the TRT has been correlated with speech-in-noise perception in steady-state and modulated noise (Zekveld et al., 2007). This suggests a shared involvement of a modality-general cognitive-linguistic ability in forming meaningful wholes from fragments of sentences (Zekveld et al., 2009). Examining text recognition across interruption rates demonstrates a very similar performance function to interrupted speech with a dip in performance around 2 Hz, supporting the argument for similar perceptual processes for both modalities (Shafiro et al., 2018). However, differences in interrupted text and speech processing are also observed, with better text performance at extreme rates, indicating that modality-specific processes are likely involved as well (Smith and Fogerty, 2015; Shafiro et al., 2018). This may be due to the fundamental differences between speech and text. For example, speech provides additional co-articulatory and prosodic cues, while text consists of discrete context-free symbols. These modality-general and modality-specific perceptual processes may have significant consequences for how listeners are able to process multimodal signals, particularly across interruption rates where the perceptual and linguistic processing constraints appear to change (e.g., word-based linguistic processes at slower rates, with perceptual integration across distributed units required at faster rates).

1.2. Multimodal processing

Previous research has also examined simultaneous presentation of interrupted speech and text materials (i.e., multimodal presentation), though not across interruption rate. When speech in noise was presented concurrently with partially correct subtitles, an improvement of between 15% and 25% was observed in comparison to unimodal conditions (Zekveld et al., 2009). Furthermore, there is a greater benefit obtained for adding interrupted text to interrupted speech when the text is better preserved or the speech is highly degraded (Smith and Fogerty, 2015). Thus, the ability to extract partial information from both modalities determines how multimodal information is combined to support speech recognition. A similar conclusion was found when combining speech in noise with text output from a speech recognizer at varying signal-to-noise ratios (SNRs; Krull and Humes, 2016). Benefit from the addition of text was greatest at the poorest SNRs. Furthermore, adding speech at poor SNRs to highly preserved text cues resulted in poorer performance relative to the text-only condition. This evidence suggests that the benefit of multimodal presentation depends on the level of degradation of both speech and text stimuli.

A significant benefit can be obtained by adding degraded text cues to an interrupted auditory speech signal. The current study extended these previous results by examining multimodal speech-text processing across interruption rates. This methodology varies the perceptual requirements of processing in each modality, i.e., whole perceptual units versus integration across units. Furthermore, perceptual requirements were altered between the two modalities by independently varying the interruption rate of each modality.

When the two modalities are presented simultaneously, it is possible that the intelligibility of the sentences will be the greatest when the presentation of perceptual and cognitive-linguistic cues distributed across the two modalities are maximized. This may occur when top-down cues are provided in one modality at slower rates along with distributed perceptual cues at high interruption rates in another modality, similar to interpretations using dual-rate interruption (Shafiro et al., 2011). In contrast, multimodal integration might be best when the perceptual units between the speech and text stimuli are matched, i.e., interrupted at the same rate. A lack of modality congruence could potentially interfere with multimodal processing, resulting in lower levels of benefit. The current experiment tested these possibilities of how the two modalities are combined.

2. Methods

2.1. Participants

Fourteen normal hearing participants were recruited from the University of South Carolina. The participants were all female, with ages ranging from 20 to 27 years (mean 23 years). The participants were native speakers of English and reported normal or corrected-to-normal vision. The participants also passed a pure-tone hearing screening with audiometric thresholds less than 20 dB hearing level at octave frequencies between 250 and 8000 Hz.

2.2. Stimuli and design

IEEE sentences (IEEE, 1969) were used for both the speech and text stimuli. Audio recordings were spoken by a single male talker, who was a native speaker of American English (Loizou, 2013). The creation of speech and text stimuli was identical to the methods used by Shafiro et al. (2018).

Speech stimuli were low-pass filtered using an 80th order linear-phase finite impulse response filter with a cutoff frequency of 2000 Hz. This reduced the overall redundancy of the speech stimulus and resulted in greater variation across interruption rates. A control condition was conducted to determine maximum sentence intelligibility without interruption.

Text stimuli were displayed in red Arial font centered in a visual window of 200 × 1600 pixels. For the time-to-space conversion of interruption rate, the text window was equated to a duration of 4 s. A sentence of median duration across the sentence corpus was selected to calibrate the font size such that the width of the text displayed equaled the duration of the calibration sentence (i.e., a sentence duration of 2 s would result in a visual width equal to 1/2 the visual display or 800 pixels). The font size was 22.5 points, which calibrated the pixel display. These font settings were used on the remaining sentences. Further discussion of the auditory-to-visual conversion can be found in Shafiro et al. (2018).

Both modalities were presented to participants at the following interruption rates: 0.5, 2.0, 8.0, and 32.0 Hz. Speech stimuli were interrupted using silent intervals, while text stimuli were interrupted using white space. For the text conditions, interruption rates were defined according to pixels. All interruption conditions used a 50% duty cycle. In both modalities the starting phase (or pixel location) of the interruption cycle was randomized for each sentence. For auditory speech presentations, interruption windows used a 2-ms raised cosine on/off ramp to minimize the introduction of transients.

2.3. Procedure

Stimuli were presented in three blocks for unimodal speech, unimodal text, and multimodal conditions. Unimodal conditions consisted of the four interruption rates, while multimodal conditions consisted of all pairings of interruption rates between the speech and text (4 speech rates × 4 text rates = 16 multimodal conditions). Each condition consisted of ten sentences for a total of 240 sentences (40 speech sentences + 40 text sentences + 160 multimodal sentences). Blocks were counterbalanced across participants and stimuli were randomized within each block. A demo of five additional sentences was presented before each block to familiarize the listener with the stimuli. No feedback was provided and no sentences were ever repeated.

Participants listened to stimuli presented monaurally over Sennheiser HD Pro 280 headphones (Wedemark, Germany) at a presentation level of 70 dB sound pressure level in a sound-attenuating booth. Stimulus levels were calibrated using a noise matching the long-term average spectrum of the stimuli prior to interruption. The visual stimulus was presented on a 17 in. display for the duration of the corresponding audio recording to equate processing time between the two modalities. The entire sentence was presented at once during text presentations. The on/off timing of the text was synchronized with the speech during multimodal conditions. A custom response interface designed in matlab was used to simultaneously present speech and text signals. Following presentation, the word RESPOND appeared to cue participants to repeat the sentence. Presentation was self-paced and participants pressed a NEXT button to play the next stimulus. For unimodal speech presentations, the word LISTEN was displayed during the stimulus interval instead of interrupted text. Participants were only able to listen or read the stimulus once (i.e., no repeating the stimulus).

For each trial the participants were instructed to either repeat or read the sentence aloud as accurately as possible and were encouraged to guess. Accuracy was scored based on the number of keywords that were correctly repeated. Responses were audio recorded for offline scoring and analysis by trained raters. A strict scoring procedure required participants to repeat each keyword exactly (i.e., no missing or extra suffixes). Keyword correct scores were transformed to rationalized arcsine units to stabilize the error variance prior to analysis (Studebaker, 1985).

3. Results

3.1. Unimodal presentation

Analysis of individual performance on the non-interrupted control condition demonstrated that listeners were able to recognize 2-kHz low-pass filtered speech with 97% accuracy (Range = 94%–100%, standard deviation = 2.3%). Thus, without interruption, speech was sufficiently preserved to result in high recognition scores.

Results for the unimodal interrupted conditions are displayed in Fig. 1(A). Consistent with Shafiro et al. (2018), a similar performance function for speech and text was observed across rates, with poorest performance at the mid rates. Furthermore, while there was a modality benefit for speech at the mid rates, keyword recognition scores were highest for text at the slowest 0.5 Hz rate. These observations were confirmed using a 2 (modality) × 4 (rate) repeated-measured analysis of variance. Significant main effects of modality [F(1, 13) = 8.12, p = 0.014, η_p² = 0.39] and rate [F(3, 39) = 305.73, p < 0.001, η_p² = 0.96] were observed, as well as a significant interaction [F(3, 39) = 27.06, p < 0.001, η_p² = 0.68]. Overall, for both modalities, performance dipped at 2-Hz, and rose to reach maximum performance at 32 Hz.

3.2. Multimodal presentation

Results for multimodal performance are plotted twice in Fig. 1, once as a function of text interruption rate [Fig. 1(B)] and once as a function of speech interruption rate [Fig. 1(C)]. The leftmost column in each panel displays the unimodal performance for ease of comparison. Multimodal function lines above these unimodal symbols indicate the benefit obtained from adding either text to the speech modality [Fig. 1(B)] or speech to the text modality [Fig. 1(C)].

A 4 (speech rate) × 4 (text rate) repeated-measures analysis of variance was used to determine how the multimodal presentation of interrupted text and speech stimuli affected sentence intelligibility. Results indicated significant main effects of speech [F(3, 39) = 379.83, p < 0.001, η_p² = 0.97] and text interruption rates [F(3, 39) = 729.10, p < 0.001, η_p² = 0.98], as well as a significant interaction [F(9, 117) = 37.37, p < 0.001, η_p² = 0.74]. The main effect of speech rate [compare lines in Fig. 1(B)] demonstrates that performance in multimodal presentations followed the unimodal order of performance, with the best performance at 32 Hz (circle symbols) and the worst performance at 2 Hz (square symbols). Likewise, the main effect of text rate also generally followed unimodal performance [compare lines in Fig. 1(C)] again with performance extremes at 32 Hz (circles) and 2 Hz (squares). However, while a 2-Hz text interruption is worst at slow speech rates, 8-Hz text interruption [triangles in Fig. 1(C)] result in the poorest performance at speech rates of 8 and 32 Hz. The much steeper slope for 2-Hz text rates (i.e., rapid increase in performance from a 2 Hz speech rate to a 32 Hz speech rate for the square symbols) compared to the flatter function for 8-Hz text rates (triangles) across increases in speech interruption rates is particularly notable.

Comparisons between unimodal and multimodal presentations overall demonstrate that, as a group, listeners effectively integrate partial information from the two modalities to improve keyword recognition in sentences. This occurred when speech and text information were interrupted at similar or different rates. There was only one exception to this net multimodal benefit at the group level. While the 32-Hz speech-only condition was near ceiling performance, there is a notable dip in the performance function when it is combined with 8-Hz interrupted text. This dip can be seen in Fig. 1(B) which is 28 percentage-points below 32-Hz speech-only performance, or as the previously noted flatter function for 8-Hz in Fig. 1(C) which dramatically undershoots performance at the 32-Hz speech rate compared to the other text conditions.¹ The performance decline for this multimodal condition (32 Hz speech + 8 Hz text) is relative to unimodal performance (32 Hz speech), which may reflect a perceptual interaction between the two modalities. Here we use the term “interference” to refer to this perceptual interaction leading to multimodal performance declines relative to unimodal performance.

Given this notable decline in performance, individual multimodal scores were compared to unimodal performance. For particular individuals, several other conditions also demonstrated a multimodal decrement relative to unimodal performance, possibly indicating perceptual interference. The conditions that resulted in interference at an individual level are plotted in Fig. 2.² This figure displays performance for individual listeners that were rank ordered according to performance on three unimodal conditions, (A) 32-Hz speech, (B) 8-Hz speech, and (C) 32-Hz text. In this figure, the bold line displays unimodal performance and the thin lines display multimodal performance (i.e., combining a second modality with the displayed unimodal condition). Multimodal performance above the bold unimodal line indicates enhancement; whereas, multimodal performance falling below the bold line indicates interference from the addition of the second modality. From Fig. 2 several notable patterns can be observed. First, the large interference for adding 8-Hz text to 32-Hz speech is clearly observed in Fig. 2(A). Second, participants that demonstrated some interference in one condition were also likely to demonstrate interference in other conditions. Third, individual interference patterns were largely predicted by performance in the unimodal condition. That is, as participants performed better in the unimodal condition, they were more likely to demonstrate an effect of interference from the addition of a second modality (at one or multiple rates).

Fig. 2. — Keyword recognition scores for individual participants rank ordered by performance in the unimodal task (solid bold line) for (A) 32-Hz speech, (B) 8-Hz speech, and (C) 32-Hz text where some multimodal interference was observed. Multimodal performance displayed in the thin lines is the conditions that were added to the respective unimodal condition. Multimodal performance above the bold unimodal line indicates enhancement; whereas, multimodal performance falling below the bold line indicates interference from the addition of the second modality.

Overall, for most conditions and most listeners, supplementing partial speech information with interrupted text was beneficial. On a group level, interference was only observed when supplementing 32-Hz speech with 8-Hz text, although this interference might be extended to other conditions for certain high performing listeners. Why did 8-Hz text result in the most interference? Perhaps it is because this rate corresponded closely to the spatial distribution of individual letters (i.e., one interruption per letter). However, this possibility will need to be explored in future studies, especially as 8-Hz text, although still poor, did not result in the poorest unimodal performance.

There are at least four possibilities that might underlie the difference observed here between unimodal and multimodal perception. (1) It may be that metrics that explain unimodal conditions do not reflect the underlying multimodal stimulus available for perception. That is, the number of interruptions per word in both modalities may need to be captured, rather than estimating performance based on any one modality. (2) Explanations of multimodal performance may have to consider the combined multimodal stimulus rather than each modality independently. For example, the alignment of interruptions across both modalities may be an important parameter in explaining performance. That is, redundant information across modalities, such as the same word preserved, may be treated differently than words presented only in one modality, or divided among the two. (3) As indicated earlier, the integration of the two modalities may also involve a smaller unit of perception. For example, multimodal perception might involve greater reliance on the preservation of phonemes/letters, versus the longer word-level (or syllable-level) units that appear to be important in unimodal conditions (Shafiro et al., 2018). (4) The results might not reflect modality interactions at all, but interactions of interruption rate. The independent interruption rates of the two modalities might involve processes similar to dual-rate interrupted speech (Shafiro et al., 2011). Here, the slower and faster interruptions of the two modalities may involve different processing requirements that determine performance. Further work will need to explore these, or alternative, possibilities.

4. Discussion

Unimodal performance for interrupted speech and text closely replicated the previous findings by Shafiro et al. (2018), demonstrating a dip in performance around 2-Hz for both modalities. As Shafiro et al. (2018) showed, this performance dip can be explained by a simple model accounting for the number of glimpses per word and the number of words within a glimpse. The current study adds to these findings by investigating how adults might combine the partial information in both of these modalities for sentence recognition.

Multimodal speech and text recognition largely followed trends observed for unimodal performance. That is, performance functions across speech or text rates generally demonstrated best performance when the second modality was added at 32-Hz rate and worst performance when the second modality was added at a 2-Hz rate. Thus, the dip in performance at mid-rates is preserved for multimodal presentations. For the most part, multimodal performance was largely accounted for by the independent contributions of each modality. That is, modalities did not need to be interrupted at highly different or at the same rates to sample either different or congruent speech information. Rather, multimodal benefit was determined by unimodal performance, not how perceptual information may be combined among the modalities. However, a significant interaction was observed such that performance declined when 8-Hz text was added to 32-Hz speech. This observation was best explained when looking at scores at the individual level in comparison to unimodal performance.

Comparison between unimodal and multimodal presentations demonstrates that participants benefited from the addition of partial text information or partial speech information in a second modality for 83% and 96% of individual cases, respectively. The exceptions to this trend were predicted by individual performance in unimodal conditions. In general, participants that had higher unimodal performance were the individuals that demonstrated interference by the addition of a second modality. Furthermore, interference in the second modality occurred for the mid rates where performance is the poorest unimodally. Therefore, two conclusions can be drawn from these findings. First, these results demonstrate that participants with the poorest unimodal scores are most likely to benefit from the addition of a second modality. Second, if performance is already high for unimodal perception, the addition of a highly degraded second modality may interfere with performance. Obviously, current results are limited to partial information created through a periodic interruption paradigm, which, while useful as a model of partially glimpsed speech, is unnatural. Further work is required to determine the generalizability of this observation for speech and text degraded under more realistic conditions. However, the results are generally consistent with other multimodal studies of speech and text using masked speech (Krull and Humes, 2016; Zekveld et al., 2009).

The results from this study are significant because they show that supplementing speech recognition with text is a viable option for improving sentence intelligibility and compensating for degraded speech in adverse listening conditions, even with variability in the distribution of partial information between the two modalities.

Acknowledgments

Portions of this work were completed as part of the requirements for an undergraduate honors thesis (I.I.). This work was supported, in part, by the South Carolina Honors College Exploration Scholars Program and by National Institutes of Health/National Institute on Deafness and Other Communication Disorders Grant No. R01-DC015465.

Footnotes

^¹

Several follow-up analyses were conducted to ensure reliability of the performance dip for this condition (32-Hz speech + 8 Hz text). First, on an individual level 13/14 participants demonstrated interference. Second, the interference pattern for this condition is consistent with individual patterns on other conditions (Fig. 2). Third, a second group of four participants where tested using a different set of sentences for a subset of conditions. The observation reported here was replicated, with all participants demonstrating interference for 32-Hz speech + 8 Hz text, but enhancement for 32-Hz speech + 2 Hz text and 2-Hz speech + 8 Hz text.

^²

All conditions that had multimodal interference at the individual level are displayed in Fig. 2, except for two cases by subject 10 who also had interference for 2-Hz speech when combined with 2- and 8-Hz text. Consistent with the other conditions, subject 10 had the highest unimodal performance for 2-Hz speech.

Contributor Information

Daniel Fogerty, Email: .

Irraj Iftikhar, Email: .

Rachel Madorskiy, Email: .

References and links

1. Bashford, J. A. , Warren, R. M. , and Brown, C. A. (1996). “ Use of speech-modulated noise adds strong ‘bottom-up’ cues for phonemic restoration,” Percept. Psychophys. 58, 342–350. 10.3758/BF03206810 [DOI] [PubMed] [Google Scholar]
2. George, E. L. , Zekveld, A. A. , Kramer, S. E. , Goverts, S. T. , Festen, J. M. , and Houtgast, T. (2007). “ Auditory and nonauditory factors affecting speech reception in noise by older listeners,” J. Acoust. Soc. Am. 121, 2362–2375. 10.1121/1.2642072 [DOI] [PubMed] [Google Scholar]
3. Gibbs, B. E. , and Fogerty, D. (2018). “ Explaining intelligibility in speech-modulated maskers using acoustic glimpse analysis,” J. Acoust. Soc. Am. 143, EL449–EL455. 10.1121/1.5041466 [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Humes, L. E. , Kidd, G. R. , and Lentz, J. J. (2013). “ Auditory and cognitive factors underlying individual differences in aided speech-understanding among older adults,” Front. Syst. Neurosci. 7(55), 1–16. 10.3389/fnsys.2013.00055 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.IEEE. (1969). “ IEEE recommended practice for speech quality measurements,” IEEE Trans. Audio Electroacoust. 17, 225–246. 10.1109/TAU.1969.1162058 [DOI] [Google Scholar]
6. Krull, V. , and Humes, L. E. (2016). “ Text as a supplement to speech in young and older adults,” Ear Hear. 37, 164–176. 10.1097/AUD.0000000000000234 [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Loizou, P. C. (2013). Speech Enhancement: Theory and Practice, 2nd ed. ( CRC, Boca Raton, FL: ), pp. 665–668. [Google Scholar]
8. Miller, G. A. , and Licklider, J. C. (1950). “ The intelligibility of interrupted speech,” J. Acoust. Soc. Am. 22, 167–173. 10.1121/1.1906584 [DOI] [Google Scholar]
9. Shafiro, V. , Fogerty, D. , Smith, K. , and Sheft, S. (2018). “ Perceptual organization of interrupted speech and text,” J. Speech Lang. Hear. Res. 61, 2578–2588. 10.1044/2018_JSLHR-H-17-0477 [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Shafiro, V. , Sheft, S. , and Risley, R. (2011). “ Perception of interrupted speech: Effects of dual-rate gating on the intelligibility of words and sentences,” J. Acoust. Soc. Am. 130, 2076–2087. 10.1121/1.3631629 [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Smith, K. G. , and Fogerty, D. (2015). “ Integration of partial information within and across modalities: Contributions to spoken and written sentence recognition,” J. Speech Lang. Hear. Res. 58, 1805–1817. 10.1044/2015_JSLHR-H-14-0272 [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Studebaker, G. A. (1985). “ A ‘rationalized’ arcsine transform,” J. Speech Lang. Hear. Res. 28, 455–462. 10.1044/jshr.2803.455 [DOI] [PubMed] [Google Scholar]
13. Wang, X. , and Humes, L. E. (2010). “ Factors influencing recognition of interrupted speech,” J. Acoust. Soc. Am. 128, 2100–2111. 10.1121/1.3483733 [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Zekveld, A. A. , George, E. L. , Kramer, S. E. , Goverts, S. T. , and Houtgast, T. (2007). “ The development of the text reception threshold test: A visual analogue of the speech reception threshold test,” J. Speech Lang. Hear. Res. 50, 576–584. 10.1044/1092-4388(2007/040) [DOI] [PubMed] [Google Scholar]
15. Zekveld, A. A. , Kramer, S. E. , Kessens, J. M. , Vlaming, M. S. , and Houtgast, T. (2009). “ The influence of age, hearing, and working memory on the speech comprehension benefit derived from an automatic speech recognition system,” Ear Hear. 30, 262–272. 10.1097/AUD.0b013e3181987063 [DOI] [PubMed] [Google Scholar]

[c1] 1. Bashford, J. A. , Warren, R. M. , and Brown, C. A. (1996). “ Use of speech-modulated noise adds strong ‘bottom-up’ cues for phonemic restoration,” Percept. Psychophys. 58, 342–350. 10.3758/BF03206810 [DOI] [PubMed] [Google Scholar]

[c2] 2. George, E. L. , Zekveld, A. A. , Kramer, S. E. , Goverts, S. T. , Festen, J. M. , and Houtgast, T. (2007). “ Auditory and nonauditory factors affecting speech reception in noise by older listeners,” J. Acoust. Soc. Am. 121, 2362–2375. 10.1121/1.2642072 [DOI] [PubMed] [Google Scholar]

[c3] 3. Gibbs, B. E. , and Fogerty, D. (2018). “ Explaining intelligibility in speech-modulated maskers using acoustic glimpse analysis,” J. Acoust. Soc. Am. 143, EL449–EL455. 10.1121/1.5041466 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c4] 4. Humes, L. E. , Kidd, G. R. , and Lentz, J. J. (2013). “ Auditory and cognitive factors underlying individual differences in aided speech-understanding among older adults,” Front. Syst. Neurosci. 7(55), 1–16. 10.3389/fnsys.2013.00055 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c5] 5.IEEE. (1969). “ IEEE recommended practice for speech quality measurements,” IEEE Trans. Audio Electroacoust. 17, 225–246. 10.1109/TAU.1969.1162058 [DOI] [Google Scholar]

[c6] 6. Krull, V. , and Humes, L. E. (2016). “ Text as a supplement to speech in young and older adults,” Ear Hear. 37, 164–176. 10.1097/AUD.0000000000000234 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c100] 7. Loizou, P. C. (2013). Speech Enhancement: Theory and Practice, 2nd ed. ( CRC, Boca Raton, FL: ), pp. 665–668. [Google Scholar]

[c7] 8. Miller, G. A. , and Licklider, J. C. (1950). “ The intelligibility of interrupted speech,” J. Acoust. Soc. Am. 22, 167–173. 10.1121/1.1906584 [DOI] [Google Scholar]

[c8] 9. Shafiro, V. , Fogerty, D. , Smith, K. , and Sheft, S. (2018). “ Perceptual organization of interrupted speech and text,” J. Speech Lang. Hear. Res. 61, 2578–2588. 10.1044/2018_JSLHR-H-17-0477 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c9] 10. Shafiro, V. , Sheft, S. , and Risley, R. (2011). “ Perception of interrupted speech: Effects of dual-rate gating on the intelligibility of words and sentences,” J. Acoust. Soc. Am. 130, 2076–2087. 10.1121/1.3631629 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c10] 11. Smith, K. G. , and Fogerty, D. (2015). “ Integration of partial information within and across modalities: Contributions to spoken and written sentence recognition,” J. Speech Lang. Hear. Res. 58, 1805–1817. 10.1044/2015_JSLHR-H-14-0272 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c11] 12. Studebaker, G. A. (1985). “ A ‘rationalized’ arcsine transform,” J. Speech Lang. Hear. Res. 28, 455–462. 10.1044/jshr.2803.455 [DOI] [PubMed] [Google Scholar]

[c12] 13. Wang, X. , and Humes, L. E. (2010). “ Factors influencing recognition of interrupted speech,” J. Acoust. Soc. Am. 128, 2100–2111. 10.1121/1.3483733 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c13] 14. Zekveld, A. A. , George, E. L. , Kramer, S. E. , Goverts, S. T. , and Houtgast, T. (2007). “ The development of the text reception threshold test: A visual analogue of the speech reception threshold test,” J. Speech Lang. Hear. Res. 50, 576–584. 10.1044/1092-4388(2007/040) [DOI] [PubMed] [Google Scholar]

[c14] 15. Zekveld, A. A. , Kramer, S. E. , Kessens, J. M. , Vlaming, M. S. , and Houtgast, T. (2009). “ The influence of age, hearing, and working memory on the speech comprehension benefit derived from an automatic speech recognition system,” Ear Hear. 30, 262–272. 10.1097/AUD.0b013e3181987063 [DOI] [PubMed] [Google Scholar]

PERMALINK

Combining partial information from speech and text

Daniel Fogerty

Irraj Iftikhar

Rachel Madorskiy

Abstract

1. Introduction

1.1. Unimodal processing

1.2. Multimodal processing

2. Methods

2.1. Participants

2.2. Stimuli and design

2.3. Procedure

3. Results

3.1. Unimodal presentation

Fig. 1.

3.2. Multimodal presentation

Fig. 2.

4. Discussion

Acknowledgments

Footnotes

Contributor Information

References and links

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Combining partial information from speech and text

Daniel Fogerty

Irraj Iftikhar

Rachel Madorskiy

Abstract

1. Introduction

1.1. Unimodal processing

1.2. Multimodal processing

2. Methods

2.1. Participants

2.2. Stimuli and design

2.3. Procedure

3. Results

3.1. Unimodal presentation

Fig. 1.

3.2. Multimodal presentation

Fig. 2.

4. Discussion

Acknowledgments

Footnotes

Contributor Information

References and links

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases