Abstract
Speech recognition is robust to background noise. One underlying neural mechanism is that the auditory system segregates speech from the listening background and encodes it reliably. Such robust internal representation has been demonstrated in auditory cortex by neural activity entrained to the temporal envelope of speech. A paradox, however, then arises, as the spectro-temporal fine structure rather than the temporal envelope is known to be the major cue to segregate target speech from background noise. Does the reliable cortical entrainment in fact reflect a robust internal “synthesis” of the attended speech stream rather than direct tracking of the acoustic envelope? Here, we test this hypothesis by degrading the spectro-temporal fine structure while preserving the temporal envelope using vocoders. Magnetoencephalography (MEG) recordings reveal that cortical entrainment to vocoded speech is severely degraded by background noise, in contrast to the robust entrainment to natural speech. Furthermore, cortical entrainment in the delta-band (1–4 Hz) predicts the speech recognition score at the level of individual listeners. These results demonstrate that reliable cortical entrainment to speech relies on the spectro-temporal fine structure, and suggest that cortical entrainment to the speech envelope is not merely a representation of the speech envelope but a coherent representation of multiscale spectro-temporal features that are synchronized to the syllabic and phrasal rhythms of speech.
Keywords: envelope entrainment, auditory cortex, auditory scene analysis, MEG
Introduction
Normal hearing listeners exhibit a surprising ability to understand speech in noisy acoustic environments, even in the absence of visual cues. A number of studies have suggested that the target speech and the listening background are separated in auditory cortex (Kerlin et al., 2010; Ding and Simon, 2012a; Mesgarani and Chang, 2012; Power et al., 2012; Zion Golumbic et al., 2013; Horton et al., 2013). In particular, when a listener attends to a speech stream, auditory cortical activity is reliably entrained to the temporal envelope of that stream, regardless of the listening background. This reliable neural representation of the speech envelope, i.e. slow temporal modulations below 16 Hz, is a key candidate mechanism underlying the reliable recognition of speech, since the temporal envelopes carry important cues for speech recognition (Shannon et al., 1995). It remains mysterious, however, how such reliable cortical entrainment to the speech envelope is achieved, since envelope is not an effective cue for segregation of speech from noise (Friesen et al., 2001).
Moreover, even the nature of cortical entrainment to the speech envelope is heavily debated, especially about whether it encodes the temporal envelope per se or instead other speech features that are correlated with the speech envelope (Obleser et al., 2012; Peelle et al., 2013). Many speech features, including pitch and spatial cues, are temporally coherent and correlated with the temporal envelope (Shamma et al., 2011). Therefore it has been proposed that the envelope entrainment in fact reflects a collective neural representation of multiple speech features that are synchronized to the syllabic and phrasal rhythm of speech (Ding and Simon, 2012a). Because of the collective nature of this representation, it has been suggested as a representation of speech as a whole auditory object.
If envelope entrainment indeed reflects an object-level, collective representation of speech features, reliable envelope entrainment in complex auditory scenes is likely to involve an analysis-by-synthesis process (Poeppel et al., 2008; Shinn-Cunningham, 2008; Shamma et al., 2011): In such a process, multiple features of a complex auditory scene are extracted subcortically in the analysis phase and then, based on speech segregation cues such as pitch, features belonging to the same speech stream are grouped into an auditory object in the synthesis phase. In contrast, if envelope entrainment involves only direct neural processing of the envelope, its robustness to noise may arise from more basic processes such as contrast gain control (Rabinowitz et al., 2011; Ding and Simon, 2013).
In this study, we investigate whether noise-robust cortical entrainment to the speech envelope involves merely envelope processing or instead reflects an analysis-by-synthesis process that includes the processing of spectro-temporal fine structure and reflects envelope properties of the re-synthesized auditory object. Here, the spectro-temporal fine structure refers to the acoustic information not included in the broadband envelope of speech (<16 Hz), including, for example, the acoustic cues responsible for the pitch and formant structure of speech. We degrade the spectro-temporal fine structure of speech or speech-noise mixtures using noise vocoders and investigate whether vocoded stimuli are cortically represented differently from natural speech using MEG. If cortical entrainment only depends on the temporal envelope, it will not be affected by degradation of the spectro-temporal fine structure, even in a noisy listening environment. In contrast, if reliable cortical entrainment to speech requires an analysis-by-synthesis process that relies on the spectro-temporal fine structure, it should be severely degraded for vocoded speech.
Materials & Methods
Subjects
Twelve normal hearing, right-handed (Oldfield, 1971) young adults (6 females), all between 19 and 32 years old (23 years old on average) participated in the experiment. Subjects were paid, and the experimental procedures were approved by the University of Maryland institutional review board. Written informed consent form was obtained before the experiment.
Stimuli
The stimuli were selected from of a narration of the story Alice’s Adventures in Wonderland (Chapter One, http://librivox.org/alices-adventures-in-wonderland-by-lewis-carroll-4/). The sound recording was low-pass filtered below 4 kHz and divided into twelve 50-second duration segments, after long speaker pauses (> 300 ms) were shortened to 300 ms. All sound stimuli were presented binaurally (diotically). Six types of stimuli were created (2 noise levels × 3 vocoding conditions).
Background Noise
Half of the speech segments (N = 6) were presented in a quiet listening environment (no noise added in), while the other half were mixed with spectrally matched stationary noise generated using a 12th-order linear predictive model estimated from the speech recording. The intensity ratio between speech and noise was fixed at 3 dB, measured by RMS.
Noise Vocoding
Each stimulus is either noise vocoded (through a 4-channel or 8-channel vocoder) or unprocessed. The noise vocoder filters the stimulus, either speech in quiet or speech in noise, into 4 or 8 frequency channels between 123 and 3951 Hz using a 4th order Butterworth filter. All frequency channels are evenly distributed in the Cam scale (Glasberg and Moore, 1990; Qin and Oxenham, 2003). In each frequency band, the envelope of the stimulus, either speech or a speech-noise mixture, is extracted by taking the absolute value of the Hilbert Transform, low-pass filtering below 160 Hz using a 4th order Butterworth filter, and then half-wave rectifying the filtered signal. The extracted envelope is used to modulate a white noise filtered into the same frequency band from which the envelope was derived. The envelope-modulated-noises are then summed over frequency bands to create the noise-vocoded stimulus. The RMS intensity of the noise-vocoded stimulus is adjusted to match that of the unprocessed stimulus.
Stimulus Characterization
The auditory spectrogram of the stimulus was calculated using a sub-cortical auditory model (Yang et al., 1992) and expressed in a logarithmic amplitude scale. The frequency by time auditory spectrogram has 128 logarithmically spaced frequency channels and a 10-ms resolution in time. The broadband temporal envelope of the stimulus was extracted by summing the auditory spectrogram over frequency.
Procedure
The stimuli were presented in two orders, each to half of the subjects. In either order, the story continued naturally between stimuli and was repeated twice after the first presentation (3 trials in total). In the progressive order, the first two speech segments were natural speech presented in quiet, followed by 8-band vocoded speech in quiet and then 4-band vocoded speech in quiet. Then, natural speech in noise, 8-band vocoded speech in noise, and 4-band vocoded speech in noise were presented sequentially. To control for the effect of presentation order, we also created a random order condition, in which each acoustic manipulation (e.g. vocoding or background noise) was assigned randomly to a segment for each subject. The two presentation orders did not result in any difference in speech intelligibility or neural synchronization spectrum and were therefore not distinguished in the following analysis.
The subjects were asked to listen to the story and keep their eyes closed. Questions about the story were asked after each 50-seconds duration stimulus to ensure subjects’ attention. The subjects were also asked to rate the percent of words they understood after the first presentation of each stimulus (on a scale of 0% (not intelligible) to 100% (fully intelligible)). The grand averaged subjectively rated intelligibility is highly correlated with the grand averaged percent of questions correctly answered (R = 0.96). Before the experiment, the subjects listened to 100 repetitions of a 500-Hz tone and the responses were used to extract the M100 response, a salient MEG response localized to auditory cortex (Lütkenhöner and Steinsträter, 1998).
The magnetic field generated by cortical activity was recorded using a 157-channel whole-head MEG system (KIT, Kanazawa, Japan). The signal was sampled at 1 kHz and was filtered by a 200-Hz lowpass filter and a notch filter at 60 Hz online. Environmental noise was further removed using TS-PCA (de Cheveigné and Simon, 2007). The whole-head MEG recording was used for analysis unless otherwise specified. When the two hemispheres were analyzed separately, hemisphere-specific responses were extracted using 55 sensors located above each hemisphere. More details of the recording procedure are as described in Ding & Simon (2012a).
Inter-trial Correlation Analysis
The phase locking of a neural response was evaluated by the inter-trial correlation of the neural response in narrow frequency bands (2-Hz wide) (Ding and Simon, 2013; Zion Golumbic et al., 2013). The inter-trial correlation is the Pearson correlation coefficient between two trials of the neural responses to the same stimulus (averaged over all possible combinations of two trials). It measures the reliability of the neural response when the same stimulus repeats, and reflects the strength of phase-locked neural activity. The major phase-locked component of the MEG response was extracted using a blind source separation method, Denoising Source Separation (DSS) (de Cheveigné and Simon, 2008). The first DSS component was used for this analysis.
Temporal Response Function
The response from each MEG sensor is modeled as the speech envelope convolved with a temporal response function (TRF), which characterizes the cortical response evoked by a unit power increase of the stimulus (Ding and Simon, 2012b). The TRF is derived by summing a spectro-temporal response function (STRF) over frequency.
The STRF is estimated using boosting with 10-fold cross validation (David et al., 2007). For computational efficiency, the 157 MEG sensors were reduced to 10 DSS components (de Cheveigné and Simon, 2008). The STRFs separately estimated for the 10 DSS components were converted back to the sensor space for further analysis (Ding and Simon, 2012a).
The two major peaks of the TRF have latencies near 50 ms and 100 ms and they are referred to as the M50TRF and the M100TRF. The M50TRF is extracted as the response peak, identified by the root mean square (RMS) of the MEG response over sensors, between 20 and 80 ms, while the M100TRF is extracted as the peak between 90 and 160 ms.
When deriving the STRF, the auditory spectrogram (Yang et al., 1992) of clean speech is always used, even when modeling the neural response to noisy speech. Previous studies have shown that the spectrogram of clean speech models the MEG response slightly better than the spectrogram of the actual noisy stimulus (Ding and Simon, 2013). More importantly, since the background noise is stationary, the shape of the spectrogram of the noisy stimulus closely resembles that of clean speech but the dynamic range, or contrast, of the spectrogram is much smaller for the noisy stimulus. Therefore, it is the gain, rather than the shape of the TRF, that depends strongly on which spectrogram is used. Here, by using the original speech spectrogram, the effect of noise on the TRF amplitude depends on how the neural response amplitude changes with noise, rather than on how the stimulus dynamic range changes with noise.
Results
MEG responses were recorded from subjects listening to a narrated story presented either in quiet or in spectrally matched stationary noise (3 dB SNR). The speech stimuli were presented either without additionally processing, referred to as natural speech, or after being processed by a noise vocoder (4-band or 8-band), referred to as vocoded speech. Noise vocoding reduces the spectral resolution of speech, as is demonstrated by the auditory spectrograms of the stimuli (Fig. 1). The temporal envelope (summation of the auditory spectrogram over frequencies), however, is essentially identical before and after noise vocoding (R > 0.99 for the stimuli used in this study).
Figure 1.

Examples of the auditory spectrograms of the experimental stimuli (2 noise levels × 3 vocoding conditions). Subjectively rated speech recognition score (Mean ± SEM) is labeled in the upper right corner of each spectrogram. Vocoding (8-band or 4-band) degrades the spectro-temporal fine structure of speech, for example, the harmonic structure, but preserves the temporal envelope.
Neural Synchronization Spectrum
We first characterize how each stimulus synchronizes the neural responses in different frequency bands. The degree of neural synchronization is measured by the inter-trial correlation of the neural recording (Fig. 2). Consistent with previous studies (Luo and Poeppel, 2007; Ding and Simon, 2012b), neural synchronization to speech is observed in the delta (1–4 Hz) and theta (4–8 Hz) bands. A comparison of the neural responses to speech in quiet and speech in noise indicates that the degree of neural synchronization was robust to noise for natural but not for noise-vocoded speech (Fig. 2A). Note that for speech presented in noise, the same noise signal is used across trials. Therefore, the inter-trial correlation reflects the neural phase locking to any available stimulus features, including the background noise. Therefore, the reduced neural phase locking for vocoded speech in noise indicates an overall reduction in the response to the speech-noise mixture.
Figure 2.
The inter-trial correlation of the neural response to natural and vocoded speech. (A) The inter-trial correlation grouped by spectro-resolution. The background noise decreases the inter-trial correlation of the neural response for vocoded speech but not for clean speech. Frequency regions where the response is significantly affected by noise (P < 0.001, 1-way ANOVA) are shaded in yellow. (B) The inter-trial correlation grouped by the noise level. Noise vocoding affects the phase-locking spectrum differentially for speech in quiet and speech in noise. Frequency regions where the response is significantly affected by vocoding (P < 0.001, 1-way ANOVA) are shaded in yellow. (C) The inter-trial correlation averaged over the delta (1–4 Hz) and theta (4–8 Hz) bands. The error bar represents one SEM. In quiet, delta-band inter-trial correlation increases with reduced spectral resolution, while theta-band inter-trial correlation decreases. For speech in noise, the inter-trial correlation decreases with decreasing spectral resolution for both bands. * P < 0.01, ** P < 0.001, (paired t-test)
In a quiet listening environment, as the spectral resolution of the stimulus decreases, neural synchronization below 4 Hz is enhanced (P < 0.01, 1-way repeated measures ANOVA) while neural synchronization above 4 Hz is reduced (P < 0.003, 1-way repeated measures ANOVA) (Fig. 2BC). In a noisy listening environment, however, the degree of neural synchronization is reduced in both the delta (P < 0.003, 1-way repeated measures ANOVA) and theta bands (P < 10−5, 1-way repeated measures ANOVA) as the stimulus spectral resolution reduces (Fig. 2B). In this analysis, the two hemispheres are combined. When each hemisphere is analyzed separately, no significant hemispherical lateralization is seen in any of the 6 stimulus conditions for any 2-Hz band between 2 and 8 Hz (P > 0.17 and on average 0.37, uncorrected paired t-test).
Predicting Individual Speech Recognition Score
The subjectively rated speech recognition score varies strongly across subjects. The individual recognition score significantly correlates with delta-band neural synchronization for 4-band vocoded speech in quiet, and for 8-band vocoded speech both in quiet and in noise (P < 0.002, bootstrap). Since there are 6 stimulus conditions, the P-value remains below 0.012 after a Bonferroni correction. The correlation coefficients are 0.66 ± 0.14, 0.55 ± 0.14, and 0.71 ± 0.11 (Mean ± SEM) for these 3 conditions (Fig. 3, from left to right). For 4-band vocoded speech in noise, a weaker correlation is also found (P < 0.02, bootstrap; R = 0.43 ± 0.20). For natural speech in quiet and in noise, speech intelligibility reaches ceiling, obscuring any observable correlation between neural synchronization and speech intelligibility. In this correlation analysis, the two hemispheres are combined. If each hemisphere is considered separately, the only significant correlation is for the 8-band vocoded speech in noise, in the left hemisphere (P < 0.002, bootstrap). This result indicates that the two hemispheres encode speech similarly and therefore integrating measurements across hemispheres increases the statistical power of the correlation analysis. The correlation between neural synchronization and speech intelligibility is only observed in the delta band but not in the theta band in any condition.
Figure 3.

Delta-band neural synchronization correlates with the speech recognition score of individual listeners. Each cross shows the data from a listener and the solid line is the regression line. The correlation coefficient between delta-band inter-trial correlation and the speech score is shown at the upper left corner of each plot (Mean ± SEM).
Temporal Response Function
The neural synchronization analysis characterizes the response reliability over trials, while in the following we further investigate how the neural response follows the speech envelope using a temporal response function (TRF). The TRF can be interpreted by the neural response evoked by a broadband power increase of the stimulus (Ding and Simon, 2012b). The RMS of the TRFs from all MEG sensors is shown in Fig. 4. The amplitude of the TRF is dimensionless, and is normalized by the maximal amplitude of the TRF for natural speech in quiet. The TRF in quiet shows two early peaks near 50 ms and 100 ms and these two peaks are referred to as the M50TRF and M100TRF respectively (Ding and Simon, 2013).
Figure 4.

Properties of the temporal response function (TRF). A) The RMS of the TRF over MEG sensors, which can be interpreted as the total energy of the neural response evoked by a unit power increase of the sound stimulus. Time intervals where the TRF amplitude is significantly modulated by background noise (P < 0.001) are shaded in yellow. The early peak of the TRF near 50 ms, i.e. M50TRF, is most strongly affected by background noise and stimulus spectral resolution. Its amplitude decreases in the presence of background noise and increases when the stimulus spectral resolution decreases in a quiet listening environment.
The early response component M50TRF is sensitive to noise while the late response component M100TRF is not. Specifically, the M50TRF amplitude is significantly reduced by noise (P < 0.001, F(1, 71) = 41.26, SNR × spectral resolution 2-way repeated measures ANOVA) and there is an interaction between the influence of noise and the influence of spectral resolution (P = 0.033 with Geisser-Greenhouse corrections, F(2, 71) = 5.34). To investigate the interaction between noise and spectral resolution, two separate ANOVA tests are applied to the M50TRF amplitude in quiet and in noise, with spectral resolution as the analysis factor. In quiet, the M50 amplitude increases with reduced spectral resolution (P = 0.006, F(2, 33) = 6.03, 1-way repeated measures ANOVA). In noise, the M50TRF amplitude is weak and is not significantly changed when the stimulus spectral resolution reduces. The M100TRF amplitude is not significantly affected by noise or resolution.
Discussion
This study demonstrates that although the cortical entrainment to natural speech is robust to noise the cortical entrainment to vocoded speech is not. This phenomenon cannot be explained by passive envelope tracking mechanisms since noise vocoding does not directly affect the stimulus envelope to which that cortical activity is entrained. Instead, the results illustrate that the spectro-temporal fine structure, which is degraded for noise-vocoded speech, is critical to segregating speech from noise and to constructing an object-level neural representation of speech that is robust to the listening background.
Object-based vs. Stimulus-based Representation
Only after simultaneous auditory objects are neurally segregated, can each of them be represented and processed independent of each other (Ding and Simon, 2012a). On the other hand, if the auditory scene is represented as a whole in auditory cortex, the cortical representation will be affected by every component in the same auditory scene. In this study, we found that the neural representation of natural speech is largely invariant to a moderate amount of background noise, i.e. at 3 dB SNR, indicating that natural speech is neurally segregated from noise and is represented as an individual auditory object. This result is consistent with a previous study where the neural representation of speech is found to be largely independent of background noise until the significantly worse case of −3 dB SNR (Ding and Simon, 2013). In contrast, the neural representation of vocoded speech is significantly degraded by noise. This encoding interference between noise and speech suggests that the vocoded speech noise mixture is not neurally segregated in auditory cortex.
The Role of Contrast Gain Control
Stationary noise significantly reduces the intensity contrast of speech but does not strongly affect the shape of the temporal envelope of speech (Fig. 1). As a result, the robust neural representation of natural speech may be accounted for by contrast gain control (Ding and Simon, 2013), a relatively passive mechanism that can be observed in anesthetized animals (Dean et al., 2005; Rabinowitz et al., 2011). Nevertheless, although contrast and intensity gain control surely play an important role in maintaining the noise robust representation of speech, they cannot explain the phenomenon observed in this study without assuming the prior neural segregation of speech and noise. First, background noise reduces the intensity contrast of natural speech and vocoded speech in the same way, but a noise-robust cortical representation is only observed for natural speech. Second, even in auditory cortex, contrast gain control is incomplete, in the sense that even though the neural gain changes, the cortical response is still affected by the stimulus intensity contrast (Rabinowitz et al., 2011).
Robust Cortical Entrainment: An Analysis-by-Synthesis Approach
An analysis-by-synthesis approach is ubiquitously adopted in sensory systems (Yuille and Kersten, 2006; Poeppel et al., 2008). In a primary analysis stage, the sensory system breaks up the sensory input into fundamental features, e.g. edges are encoded in the visual system and spectro-temporal features in the auditory system. After this stage, however, a synthesis stage is necessary to reconstruct the sensory experience, usually with top-down modulations that contribute pertinent a priori information about the observer’s world. Such mechanisms are likely to be useful in attenuating the effects of unwanted noise.
The sensitivity to the spectro-temporal fine structure indicates that robust cortical entrainment to speech is the consequence of the analysis-by-synthesis process rather than just bottom-up envelope tracking. In the sub-cortical auditory system, acoustic cues that are important for sound source segregation, such as pitch and binaural cues, are extracted (Nelken, 2008). This decomposition process can be viewed as an analysis stage. In order to achieve speech recognition or auditory perception in general, however, features belonging to the same speech stream need to be bound or re-synthesized into an auditory object (Shinn-Cunningham, 2008). In speech, multiple acoustic features are temporally coupled and the spectro-temporal fine structure is modulated by the temporal envelope (Sheft, 2007; Shamma et al., 2011). Therefore, in this synthesis stage, sound segregation cues play a guiding role: The auditory system is proposed to group features based on their temporal coherence with the sound segregation cues (Shamma et al., 2011). As a consequence of this temporal coherence based grouping, features belonging to the attended speech stream are recovered from a complex auditory scene and appear as neural activity entrained to the temporal envelope of the attended speech. When sound segregation cues such as the spectro-temporal fine structure are degraded, features extracted from a complex auditory scene can no longer be selectively grouped into a representation specific to the attended speech stream. Therefore, cortical entrainment is degraded.
Influence of Spectral Resolution on Speech Encoding
As the spectral resolution of speech reduces, speech intelligibility decreases mildly in a quiet listening environment but severely in noisy environments (Friesen et al., 2001). The same trend is seen in theta- but not delta-band cortical synchronization. In a quiet environment, as the stimulus spectral resolution decreases, theta-band synchronization is moderately reduced, consistent with previous studies (Luo and Poeppel, 2007; Peelle et al., 2013), while delta-band synchronization is enhanced. It is possible that the reduction in theta-band activity reflect an impairment of neural processing of syllabic-level speech features (Giraud and Poeppel, 2012; Peelle et al., 2013), while the enhancement in delta-band activity, hypothesized as an instrument of top-down attention (Schroeder and Lakatos, 2009), may reflect increased listening effort. Both the delta- and theta-band synchronization in auditory cortex is likely to be modulated by higher level cortical areas involved in language processing (Scott et al., 2006; Obleser and Weisz, 2012). Furthermore, when the spectro-temporal resolution reduces, in the time domain, the M50TRF is enhanced. This is likely to be related to the observations that the M50TRF is stronger at the onset of a noise burst than at the onset of a tone (Chait et al., 2004) and that the onset response to vocoded speech is stronger than the onset response to natural speech (Obleser and Kotz, 2010).
Delta Band Synchronization and Speech Intelligibility
Cortical entrainment to speech is generally observed in the delta and theta bands. In this study and other studies using long (>10 s) continuous speech stimuli, delta-band entrainment dominates the measured neural activity (e.g. Ding & Simon, 2012a; Ding & Simon, 2013, Zion Golumbic et al. 2013). For studies using isolated sentences (< 5 seconds in duration), however, delta-band entrainment is much weaker than theta-band entrainment (e.g. Howard & Poeppel, 2010; Luo & Poeppel, 2007; Peelle et al., 2013). Therefore, it is likely that delta-band entrainment requires the longer time scale contextual information of running speech.
Here, we observed that delta-band synchronization correlates with listeners’ speech recognition scores for vocoded speech, and a previous study found a similar correlation for speech embedded in strong noise (Ding and Simon, 2013). It is possible that delta-band synchronization to speech is a signature of the auditory cortical representation subserving subsequent language processing (Schroeder et al., 2008; Schroeder and Lakatos, 2009; Giraud and Poeppel, 2012). Alternatively, it is possible that speech intelligibility is required for delta-band synchronization to occur. This possibility, however, is not well supported since strong neural synchronization has been seen to reversed speech (Howard and Poeppel, 2010) and amplitude/frequency modulated tones (Henry and Obleser, 2012; Wang et al., 2012).
The correlation between neural synchronization and individual speech recognition score is not found in the theta band, consistent with previous studies (Peelle et al., 2013). One possible reason is that theta-band synchronization is relatively weak compared with delta-band synchronization for discourse level speech stimuli, and the lower signal to noise ratio of theta activity makes it less reliably measured at an individual subject level. Alternatively, it is also possible that theta band activity faithfully reflects properties of the stimulus but not individual differences in neural processing (see also Schroeder et al, 2008).
In summary, the spectro-temporal fine structure is required to maintain noise-robust cortical entrainment to the speech envelope. These results demonstrate that envelope entrainment in auditory cortex is not just a neural representation of the speech envelope per se but instead is likely to be a collective, object-level neural representation that is achieved by an analysis-by-synthesis approach. Furthermore, since degraded ability to separate simultaneous auditory objects is common for hearing impaired listeners (Shinn-Cunningham and Best, 2008), the results here are indicative of the cortical processing in impaired auditory systems.
Highlights.
Cortical entrainment to vocoded speech is sensitive to background noise
Robust cortical entrainment to speech relies on the spectro-temporal fine structure
Delta-band entrainment predicts individual speech recognition score
Acknowledgments
We thank NIH grants R01 DC 008342 to J.Z.S and R01 DC 004786 to M.C. for support.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Contributor Information
Nai Ding, Email: nd45@nyu.edu.
Monita Chatterjee, Email: monita.chatterjee@boystown.org.
References
- Chait M, Simon JZ, Poeppel D. Auditory M50 and M100 responses to broadband noise: functional implications. Neuroreport. 2004;15:2455–2458. doi: 10.1097/00001756-200411150-00004. [DOI] [PubMed] [Google Scholar]
- David SV, Mesgarani N, Shamma SA. Estimating sparse spectro-temporal receptive fields with natural stimuli. Network: Computation in Neural Systems. 2007;18:191–212. doi: 10.1080/09548980701609235. [DOI] [PubMed] [Google Scholar]
- de Cheveigné A, Simon JZ. Denoising based on time-shift PCA. Journal of Neuroscience Methods. 2007;165:297–305. doi: 10.1016/j.jneumeth.2007.06.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- de Cheveigné A, Simon JZ. Denoising based on spatial filtering. Journal of Neuroscience Methods. 2008;171:331–339. doi: 10.1016/j.jneumeth.2008.03.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dean I, Harper NS, McAlpine D. neural population coding of sound level adapts to stimulus statistics. nature neuroscience. 2005;8:1684–1689. doi: 10.1038/nn1541. [DOI] [PubMed] [Google Scholar]
- Ding N, Simon JZ. Emergence of neural encoding of auditory objects while listening to competing speakers. Proceedings of the National Academy of Sciences of the United States of America. 2012a;109:11854–11859. doi: 10.1073/pnas.1205381109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ding N, Simon JZ. Neural Coding of Continuous Speech in Auditory Cortex during Monaural and Dichotic Listening. Journal of Neurophysiology. 2012b;107:78–89. doi: 10.1152/jn.00297.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ding N, Simon JZ. Adaptive Temporal Encoding Leads to a Background-Insensitive Cortical Representation of Speech. Journal of Neuroscience. 2013;33:5728–5735. doi: 10.1523/JNEUROSCI.5297-12.2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Friesen LM, Shannon RV, Baskent D, Wang X. Speech recognition in noise as a function of the number of spectral channels: Comparison of acoustic hearing and cochlear implants. Journal of the Acoustical Society of America. 2001;110:1150–1163. doi: 10.1121/1.1381538. [DOI] [PubMed] [Google Scholar]
- Giraud A-L, Poeppel D. Cortical oscillations and speech processing: emerging computational principles and operations. Nature Neuroscience. 2012;15:511–517. doi: 10.1038/nn.3063. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Glasberg BR, Moore BC. Derivation of auditory filter shapes from notched-noise data. Hearing Research. 1990;47:103–138. doi: 10.1016/0378-5955(90)90170-t. [DOI] [PubMed] [Google Scholar]
- Zion Golumbic EM, Ding N, Bickel S, Lakatos P, Schevon CA, McKhann GM, Goodman RR, Emerson R, Mehta AD, Simon JZ, Poeppel D, Schroeder CE. Mechanisms Underlying Selective Neuronal Tracking of Attended Speech at a “Cocktail Party”. Neuron. 2013;77:980–991. doi: 10.1016/j.neuron.2012.12.037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Henry MJ, Obleser J. Frequency modulation entrains slow neural oscillations and optimizes human listening behavior. Proceedings of the National Academy of Sciences. 2012;109:20095–20100. doi: 10.1073/pnas.1213390109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Horton C, D’Zmura M, Srinivasan R. Suppression of competing speech through entrainment of cortical oscillations. Journal of neurophysiology. 2013;109:3082–3093. doi: 10.1152/jn.01026.2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Howard MF, Poeppel D. Discrimination of Speech Stimuli Based on Neuronal Response Phase Patterns Depends on Acoustics But Not Comprehension. Journal of Neurophysiology. 2010;104:2500–2511. doi: 10.1152/jn.00251.2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kerlin JR, Shahin AJ, Miller LM. Attentional Gain Control of Ongoing Cortical Speech Representations in a “Cocktail Party”. Journal of Neuroscience. 2010;30:620–628. doi: 10.1523/JNEUROSCI.3631-09.2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luo H, Poeppel D. Phase Patterns of Neuronal Responses Reliably Discriminate Speech in Human Auditory Cortex. Neuron. 2007;54:1001–1010. doi: 10.1016/j.neuron.2007.06.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lütkenhöner B, Steinsträter O. High-Precision Neuromagnetic Study of the Functional Organization of the Human Auditory Cortex. Audiology and Neuro-Otology. 1998;3:191–213. doi: 10.1159/000013790. [DOI] [PubMed] [Google Scholar]
- Mesgarani N, Chang EF. Selective cortical representation of attended speaker in multi-talker speech perception. Nature. 2012;485:233–236. doi: 10.1038/nature11020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nelken I. Processing of complex sounds in the auditory system. Current Opinion in Neurobiology. 2008;18:413–417. doi: 10.1016/j.conb.2008.08.014. [DOI] [PubMed] [Google Scholar]
- Obleser J, Kotz SA. Expectancy constraints in degraded speech modulate the language comprehension network. Cerebral Cortex. 2010;20:633–640. doi: 10.1093/cercor/bhp128. [DOI] [PubMed] [Google Scholar]
- Obleser J, Weisz N. Suppressed alpha oscillations predict intelligibility of speech and its acoustic details. Cerebral Cortex. 2012;22:2466–2477. doi: 10.1093/cercor/bhr325. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Obleser J, Herrmann B, Henry MJ. Neural oscillations in speech: don’t be enslaved by the envelope. Frontiers in Human Neuroscience. 2012;6 doi: 10.3389/fnhum.2012.00250. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oldfield RC. The assessment and analysis of handedness: The Edinburgh inventory. Neuropsychologia. 1971;9:97–113. doi: 10.1016/0028-3932(71)90067-4. [DOI] [PubMed] [Google Scholar]
- Peelle JE, Gross J, Davis MH. Phase-Locked Responses to Speech in Human Auditory Cortex are Enhanced During Comprehension Cerebral Cortex. 2013 doi: 10.1093/cercor/bhs1118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Poeppel D, Idsardi WJ, Wassenhove Vv. Speech perception at the interface of neurobiology and linguistics. Philosophical Transactions of the Royal Society of London, Series B: Biological Sciences. 2008;363:1071–1086. doi: 10.1098/rstb.2007.2160. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Power AJ, Foxe JJ, Forde EJ, Reilly RB, Lalor EC. At what time is the cocktail party? A late locus of selective attention to natural speech. European Journal of Neuroscience. 2012;35:1497–1503. doi: 10.1111/j.1460-9568.2012.08060.x. [DOI] [PubMed] [Google Scholar]
- Qin MK, Oxenham AJ. Effects of simulated cochlear-implant processing on speech reception in fluctuating maskers. Journal of the Acoustical Society of America. 2003;114:446–454. doi: 10.1121/1.1579009. [DOI] [PubMed] [Google Scholar]
- Rabinowitz NC, Willmore BDB, Schnupp JWH, King AJ. Contrast Gain Control in Auditory Cortex. Neuron. 2011;70:1178–1191. doi: 10.1016/j.neuron.2011.04.030. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schroeder CE, Lakatos P. Low-frequency neuronal oscillations as instruments of sensory selection. Trends in Neurosciences. 2009;32:9–18. doi: 10.1016/j.tins.2008.09.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schroeder CE, Lakatos P, Kajikawa Y, Partan S, Puce A. Neuronal oscillations and visual amplification of speech. Trends in Cognitive Sciences. 2008;12:106–113. doi: 10.1016/j.tics.2008.01.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Scott SK, Rosen S, Lang H, Wise RJ. Neural correlates of intelligibility in speech investigated with noise vocoded speech - a positron emission tomography study. journal of the Acoustical Society of America. 2006;120:1075–1083. doi: 10.1121/1.2216725. [DOI] [PubMed] [Google Scholar]
- Shamma SA, Elhilali M, Micheyl C. Temporal coherence and attention in auditory scene analysis. Trends in Neurosciences. 2011;34:114–123. doi: 10.1016/j.tins.2010.11.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shannon RV, Zeng F-G, Kamath V, Wygonski J, Ekelid M. Speech recognition with primarily temporal cues. Science. 1995;270:303–304. doi: 10.1126/science.270.5234.303. [DOI] [PubMed] [Google Scholar]
- Sheft S. Envelope processing and sound-source perception. In: Yost WA, Popper AN, Fay RR, editors. Auditory Perception of Sound Sources. New York: Springer; 2007. [Google Scholar]
- Shinn-Cunningham BG. Object-based auditory and visual attention. Trends in Cognitive Sciences. 2008;12:182–186. doi: 10.1016/j.tics.2008.02.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shinn-Cunningham BG, Best V. selective attention in normal and impaired hearing. Trends in Amplification. 2008;12:283–299. doi: 10.1177/1084713808325306. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang Y, Ding N, Ahmar N, Xiang J, Poeppel D, Simon JZ. Sensitivity to Temporal Modulation Rate and Spectral Bandwidth in the Human Auditory System: MEG Evidence. Journal of Neurophysiology. 2012;107:2033–2041. doi: 10.1152/jn.00310.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang X, Wang K, Shamma SA. Auditory representations of acoustic signals. IEEE Transactions on Information Theory. 1992;38:824–839. [Google Scholar]
- Yuille A, Kersten D. Vision as Bayesian inference: analysis by synthesis? Trends in Cognitive Sciences. 2006;10:301–308. doi: 10.1016/j.tics.2006.05.002. [DOI] [PubMed] [Google Scholar]

