Skip to main content
PLOS Biology logoLink to PLOS Biology
. 2020 Oct 22;18(10):e3000883. doi: 10.1371/journal.pbio.3000883

Neural speech restoration at the cocktail party: Auditory cortex recovers masked speech of both attended and ignored speakers

Christian Brodbeck 1,*, Alex Jiao 2, L Elliot Hong 3, Jonathan Z Simon 1,2,4
Editor: Manuel S Malmierca5
PMCID: PMC7644085  PMID: 33091003

Abstract

Humans are remarkably skilled at listening to one speaker out of an acoustic mixture of several speech sources. Two speakers are easily segregated, even without binaural cues, but the neural mechanisms underlying this ability are not well understood. One possibility is that early cortical processing performs a spectrotemporal decomposition of the acoustic mixture, allowing the attended speech to be reconstructed via optimally weighted recombinations that discount spectrotemporal regions where sources heavily overlap. Using human magnetoencephalography (MEG) responses to a 2-talker mixture, we show evidence for an alternative possibility, in which early, active segregation occurs even for strongly spectrotemporally overlapping regions. Early (approximately 70-millisecond) responses to nonoverlapping spectrotemporal features are seen for both talkers. When competing talkers’ spectrotemporal features mask each other, the individual representations persist, but they occur with an approximately 20-millisecond delay. This suggests that the auditory cortex recovers acoustic features that are masked in the mixture, even if they occurred in the ignored speech. The existence of such noise-robust cortical representations, of features present in attended as well as ignored speech, suggests an active cortical stream segregation process, which could explain a range of behavioral effects of ignored background speech.


How do humans focus on one speaker when several are talking? MEG responses to a continuous two-talker mixture suggest that, even though listeners attend only to one of the talkers, their auditory cortex tracks acoustic features from both speakers. This occurs even when those features are locally masked by the other speaker.

Introduction

When listening to an acoustic scene, the signal that arrives at the ears is an additive mixture of the different sound sources. Listeners trying to selectively attend to one of the sources face the task of determining which spectrotemporal features belong to that source [1]. When multiple speech sources are involved, as in the classic cocktail party problem [2], this is a nontrivial problem because the spectrograms of the different sources often have strong overlap. Nevertheless, human listeners are remarkably skilled at focusing on one out of multiple talkers [3,4]. Binaural cues can facilitate segregation of different sound sources based on their location [5] but are not necessary for this ability, because listeners are able to selectively attend even when 2 speech signals are mixed into a monophonic signal and presented with headphones [6]. Here we are specifically interested in the fundamental ability to segregate and attend to one out of multiple speakers even without such external cues.

The neural mechanisms involved in this ability are not well understood, but previous research suggests at least 2 separable cortical processing stages. In magnetoencephalography (MEG) responses to multiple talkers [7], the early (approximately 50-millisecond) cortical component is better described as a response to the acoustic mixture than as the sum of the responses to the individual (segregated) source signals, consistent with an early unsegregated representation of the mixture. In contrast, the later (> 85 millisecond) response component is dominated by the attended (segregated) source signal. Recent direct cortical recordings largely confirm this picture, suggesting that early responses in Heschl’s gyrus (HG) reflect a spectrotemporal decomposition of the acoustic mixture that is largely unaffected by selective attention, whereas later responses in the superior temporal gyrus (STG) dynamically change to represent the attended speaker [8]. In general, cortical regions further away from core auditory cortex tend to mainly reflect information about the attended speaker [9]. Together, these results suggest a cortical mechanism that, based on a detailed representation of the acoustic input, detects and groups features belonging to the attended source.

A long-standing question is whether early cortical processing of the acoustic mixture is restricted to passive spectrotemporal filtering, or whether it involves active grouping of acoustic features leading to the formation of auditory object representations. The filter theory of attention suggests that early representations reflect physical stimulus characteristics independent of attention, with attention selecting a subset of these for further processing and semantic identification [10,11]. Consistent with this, electroencephalography (EEG) and MEG results suggest that time-locked processing of higher order linguistic features, such as words and meaning, is restricted to the attended speech source [12,13]. However, it is not known whether, in the course of recovering the attended source, the auditory cortex also extracts acoustic features of the ignored source from the mixture. Individual intracranially recorded HG responses to a 2-speaker mixture can be selective for either one of the speakers, but this selectivity can be explained merely by spectral response characteristics favoring the spectrum of a given speaker over the other [8]. A conservative hypothesis is thus that early auditory cortical responses represent acoustic features of the mixture based on stable (possibly predefined) spectrotemporal receptive fields, allowing the attended speech to be segregated through an optimally weighted combination of these responses. Alternatively, the auditory cortex could employ more active mechanisms to dynamically recover potential speech features, regardless of what stream they belong to. Selective attention could then rely on these local auditory (proto-) objects to recover the attended speech [14]. This hypothesis critically predicts the existence of representations of acoustic features from an ignored speech source, even when those features are not apparent in the acoustic mixture, i.e., when those features are masked by acoustic energy from another source. Here we report evidence for such representations in human MEG responses.

Cortical responses to speech reflect a strong representation of the envelope (or spectrogram) of the speech signal [15,16]. Prior work has also shown that acoustic onsets are prominently represented in auditory cortex, both in naturalistic speech [17,18] and in nonspeech stimuli [19,20]. Studies using paradigms similar to the one used here often predicted brain responses from only envelopes or only onsets [16,2124], but more recent studies show that both representations explain nonredundant portions of the responses [12,18]. Behaviorally, acoustic onsets are also specifically important for speech intelligibility [25,26]. Here we consider both envelope and onset features but focus on onset features in particular because of their relevance for stream segregation, as follows [1]. If acoustic elements in different frequency regions are co-modulated over time, they likely stem from the same physical source [27]. A simultaneous onset in distinct frequency bands thus provides sensory evidence that these cross-frequency features originate from the same acoustic source and should be processed as an auditory object. Accordingly, shared acoustic onsets promote perceptual grouping of acoustic elements into a single auditory object, such as a complex tone and, vice versa, separate onsets lead to perceptual segregation [28,29]. For example, the onset of a vowel is characterized by a shared onset at the fundamental frequency of the voice and its harmonics. Correspondingly, if the onset of a formant is artificially offset by as little as 80 milliseconds, it can be perceived as a separate tone rather than as a component of the vowel [30]. This link to object perception thus makes acoustic onsets particularly relevant cues, which might be represented distinctly from envelope cues and used to detect the beginning of local auditory objects, and thus aid segregation of the acoustic input into different, potentially overlapping auditory objects.

We analyzed human MEG responses to a continuous 2-talker mixture to determine to what extent the auditory cortex reliably tracks acoustic onset or envelope features of the ignored speech, above and beyond the attended speech and the mixture. Participants listened to 1-minute-long continuous audiobook segments, spoken by a male or a female speaker. Segments were presented in 2 conditions: a single talker in quiet (“clean speech”), and a 2-talker mixture, in which a female and a male speaker were mixed at equal perceptual loudness. MEG responses were analyzed as additive, linear response to multiple concurrent stimulus features (see Fig 1). First, cross-validated model comparisons were used to determine which representations significantly improve prediction of the MEG responses. Then, the resultant spectrotemporal response functions (STRFs) were analyzed to gain insight into the nature of the representations.

Fig 1. Additive linear response model based on STRFs.

Fig 1

(A) MEG responses recorded during stimulus presentation were source localized with distributed minimum norm current estimates. A single virtual source dipole is shown for illustration, with its physiologically measured response and the response prediction of a model. Model quality was assessed by the correlation between the measured and the predicted response. (B) The model’s predicted response is the sum of tonotopically separate response contributions generated by convolving the stimulus envelope at each frequency (C) with the estimated TRF of the corresponding frequency (D). TRFs quantify the influence of a predictor variable on the response at different time lags. The stimulus envelopes at different frequencies can be considered a collection of parallel predictor variables, as shown here by the gammatone spectrogram (8 spectral bins); the corresponding TRFs as a group constitute the STRF. Physiologically, the component responses (B) can be thought of as corresponding to responses in neural subpopulations with different frequency tuning, with MEG recording the sum of those currents. MEG, magnetoencephalographic; STRF, spectrotemporal response function; TRF, temporal response function.

Results and discussion

Auditory cortex represents acoustic onsets

MEG responses to clean speech were predicted from the gammatone spectrogram of the stimulus and, simultaneously, from the spectrogram of acoustic onsets (Fig 2A). Acoustic onsets were derived from a neural model of auditory edge detection [19]. The 2 predictors were each binned into 8 frequency bands, such that the MEG responses were predicted from a model of the acoustic stimulus encompassing 16 time series in total. Each of the 2 predictors was assessed based on how well (left-out) MEG responses were predicted by the full model, compared with a null model in which the relevant predictor was omitted. Both predictors significantly improve predictions (onsets: tmax = 12.00, p ≤ 0.001; envelopes: tmax = 9.39, p ≤ 0.001), with an anatomical distribution consistent with sources in HG and STG bilaterally (Fig 2B). Because this localization agrees with findings from intracranial recordings [8,17], results were henceforth analyzed in an auditory region of interest (ROI) restricted to these 2 anatomical landmarks (Fig 2C). When averaging the model fits in this ROI, almost all subjects showed evidence of responses associated with both predictors (Fig 2D).

Fig 2. MEG responses to clean speech.

Fig 2

(A) Schematic illustration of the neurally inspired acoustic edge detector model, which was used to generate onset representations. The signal at each frequency band was passed through multiple parallel pathways with increasing delays, so that an “edge detector” receptive field could detect changes over time. HWR removed the negative sections to yield onsets only. An excerpt from a gammatone spectrogram (“envelope”) and the corresponding onset representation are shown for illustration. (B) Regions of significant explanatory power of onset and envelope representations, determined by comparing the cross-validated model fit from the combined model (envelopes + onsets) to that when omitting the relevant predictor. Results are consistent with sources in bilateral auditory cortex (p ≤ 0.05, corrected for whole brain analysis). (C) ROI used for the analysis of response functions, including superior temporal gyrus and Heschl’s gyrus. An arrow indicates the average dominant current direction in the ROI (upward current), determined through the first principal component of response power. (D) Individual subject data corresponding to (B), averaged over the ROI in the LH and RH, respectively. (E) STRFs corresponding to onset and envelope representations in the ROI; the onset STRF exhibits a clear pair of positive and negative peaks, while peaks in the envelope STRF are less well-defined. Different color curves reflect the frequency bins, as indicated next to the onset and envelope spectrograms in panel A. Shaded areas indicate the within-subject standard error (SE) [31]. Regions in which STRFs differ significantly from 0 are marked with more saturated (less faded) colors (p ≤ 0.05, corrected for time/frequency). Data are available in S1 Data. HWR, half-wave rectification; LH, left hemisphere; MEG, magnetoencephalographic; RH, right hemisphere; ROI, region of interest; SE, standard error; STRF, spectrotemporal response function; TRF, temporal response function.

Auditory cortical STRFs were summarized for each subject and hemisphere using a subject-specific spatial filter based on principal component analyses of overall STRF power in the ROI. The average direction of that spatial filter replicates the direction of the well-known auditory MEG response (Fig 2C, arrows). This current vector is consistent with activity in core auditory cortex and the superior temporal plane. However, MEG sensors are less sensitive to radial currents, as would be expected from lateral STG areas implicated by intracranial recordings [8]. Because of this, we focus here on the temporal information in STRFs rather than drawing conclusions from the spatial distribution of sources. STRFs can thus be interpreted as reflecting different processing stages associated with different latencies, possibly involving multiple regions in the superior temporal lobe. STRFs were initially separately analyzed by hemisphere, but because none of the reported results interact significantly with hemisphere, the results shown are collapsed across hemisphere to simplify presentation.

STRFs to acoustic onsets exhibit a well-defined 2-peaked shape, consistent across frequency bands (Fig 2E). An early, positive peak (average latency 65 milliseconds) is followed by a later, negative peak (126 milliseconds). This structure closely resembles previously described auditory response functions to envelope representations when estimated without consideration of onsets [16]. In comparison, envelope STRFs in the present results are diminished and exhibit a less well-defined structure. This is consistent with acoustic onsets explaining a large portion of the signal usually attributed to the envelope; indeed, when the model was refitted with only the envelope predictor, excluding the onset predictor, the envelope STRFs exhibited that canonical pattern and with larger amplitudes (see S1 Fig).

STRFs have disproportionately higher amplitudes at lower frequencies (Fig 2E), which is consistent with previous tonotopic mapping of speech areas and may follow from the spectral distribution of information in the speech signal [32,33]. This explanation is also supported by simulations, where responses to speech were generated using equal temporal response functions (TRFs) for each band, and yet estimated STRFs exhibited higher amplitudes in lower frequency bands (see S1 Simulations, Fig S1).

Auditory cortex represents ignored speech

MEG responses to a 2-speaker mixture were then analyzed for neural representations of ignored speech. Participants listened to a perceptually equal loudness mixture of a male and a female talker and were instructed to attend to one talker and ignore the other. The speaker to be attended was counterbalanced across trials and subjects. Responses were predicted using both onset and envelope representations for: the acoustic mixture, the attended speech source, and the ignored source (Fig 3A). The underlying rationale is that, because the brain does not have direct access to the individual speech sources, if there is neural activity corresponding to either source separately (above and beyond the mixture), this indicates that cortical responses have segregated or reconstructed features of that source from the mixture. Both predictors representing the ignored speech significantly improve predictions of the responses in the ROI (both p < 0.001, onsets: tmax = 6.70, envelopes: tmax = 6.28; group level statistics were evaluated with spatial permutation tests; subject-specific model fits, averaged in the ROI are shown in Fig 3B). This result indicates that acoustic features of the ignored speech are represented neurally even after controlling for features of the mixture and the attended source. The remaining 4 predictors also significantly increased model fits (all p < 0.001; mixture onsets: tmax = 8.61, envelopes: tmax = 5.70; attended onsets: tmax = 6.32, envelopes: tmax = 7.37).

Fig 3. Responses to the 2-speaker mixture, using the stream-based model.

Fig 3

(A) The envelope and onset representations of the acoustic mixture and the 2 speech sources were used to predict MEG responses. (B) Individual subject model fit improvement due to each predictor, averaged in the auditory cortex ROI. Each predictor explains neural data not accounted for by the others. (C) Auditory cortex STRFs to onsets are characterized by the same positive/negative peak structure as STRFs to a single speaker. The early, positive peak is dominated by the mixture but also contains speaker-specific information. The second, negative peak is dominated by representations of the attended speaker and, to a lesser extent, the mixture. As with responses to a single talker, the envelope STRFs have lower amplitudes, but they do show a strong and well-defined effect of attention. Explicit differences between the attended and ignored representations are shown in the bottom row. Details as in Fig 2. (D) The major onset STRF peaks representing individual speech sources are delayed compared with corresponding peaks representing the mixture. To determine latencies, mixture-based and individual-speaker-based STRFs were averaged across frequency (lines with shading for mean ±1 SE). Dots represent the largest positive and negative peak for each subject between 20 and 200 milliseconds. Note that the y-axis is scaled by an extra factor of 4 beyond the indicated break points at y = 14 and −6. Data are available in S2 Data. LH, left hemisphere; MEG, magnetoencephalography; RH, right hemisphere; ROI, region of interest; SE, standard error; STRF, spectrotemporal response function.

Onset STRFs exhibit the same characteristic positive–negative pattern as for responses to a single talker but with reliable distinctions between the mixture and the individual speech streams (Fig 3C and 3D). The early, positive peak occurs earlier and has a larger amplitude for onsets in the mixture than for onsets in either of the sources (latency mixture: 72 milliseconds; attended: 81 milliseconds, t25 = 4.47, p < 0.001; ignored: 89 milliseconds, t25 = 6.92, p < 0.001; amplitude mixture > attended: t25 = 8.41, p < 0.001; mixture > ignored: t25 = 7.66, p < 0.001). This positive peak is followed by a negative peak only in responses to the mixture (136 milliseconds) and the attended source (150 milliseconds; latency difference t25 = 3.20, p = 0.004). The amplitude of these negative peaks is statistically indistinguishable (t25 = 1.56, p = 0.132).

The mixture predictor is not completely orthogonal to the source predictors. This might raise a concern that a true response to the mixture might cause spurious responses to the sources. Simulations using the same predictors as used in the experiment suggest, however, that such contamination is unlikely to have occurred (see S1 Simulations).

Envelope processing is strongly modulated by selective attention

Although the envelope STRFs seem to be generally less structured than those of the onsets, a comparison of the STRFs to the attended and the ignored source revealed a strong and well-defined effect of attention (Fig 3C, right column). The attended-ignored difference wave exhibits a negative peak at approximately 100 milliseconds, consistent with previous work [16], and an additional positive peak at approximately 200 milliseconds. In contrast with previous work, however, a robust effect of attention on the envelope representation starts almost as early as the very earliest responses. Thus, accounting for responses to onset features separately reveals that envelope processing is thoroughly influenced by attention. The reason for this might be that onsets often precede informative regions in the spectrogram, such as the spectral detail of voiced segments. The onsets might thus serve as cues to direct attention to specific regions in the spectrogram [28], which would allow early attentional processing of the envelope features.

Auditory cortex “un-masks” masked onsets

The analysis using the stream-based predictors suggests that the auditory cortex represents acoustic onsets in both speech sources separately, in addition to onsets in the acoustic mixture. This is particularly interesting because, while envelopes combine mostly in an additive manner, acoustic onsets may experience negative interference. This can be seen in the spectrograms in Fig 3A: The envelope mixture representation largely looks like a sum of the 2 stream envelope representations. In contrast, the onset mixture representation has several features that have a lower amplitude than the corresponding feature in the relevant source. The finding of a separate neural representation of the source onsets thus would suggest that the auditory cortex reconstructs source features that are masked in the mixture. Such reconstruction might be related to instances of cortical filling-in, in which cortical representations show evidence of filling in missing information to repair degraded, or even entirely absent, input signals [3436]. The latency difference between mixture and source onsets might then reflect a small additional processing cost for the recovery of underlying features that are not directly accessible in the sensory input.

However, the specifics of what we call mixture and source features depends to some degree on the model of acoustic representations, i.e., the gammatone and edge detection models used here. Specifically, source features that are here masked in the mixture might be considered overt in a different acoustic model. It is unlikely that all our source features are, in reality, overt, because then our mixture representation should not be able to predict any brain responses beyond the acoustic sources. However, the apparent neural representations of stream-specific onsets could be of a secondary set of features that the mixture is transparent to. An example could be a secondary stage of onset extraction based on pitch; the delay in responses to source specific onsets might then simply reflect the latency difference of spectral and pitch-based onset detection.

Although these 2 possibilities could both explain the results described so far, they make different predictions regarding responses to masked onsets. A passive mechanism, based on features to which the mixture is transparent, should be unaffected by whether the features are masked in the gammatone representation, because the masking does not actually apply to those features. Such responses should thus be exhaustively explained by the stream-based model described in Fig 3. On the other hand, an active mechanism that specifically processes masked onsets might generate an additional response modulation for masked onsets. To test for such a modulation, we subdivided the stream-based onset predictors to allow for different responses to overt and masked onsets. The new predictors were implemented as element-wise operations on the onset spectrograms (Fig 4A). Specifically, for each speech source, the new “masked onsets” predictor models the degree to which an onset in the source is attenuated (masked) in the mixture, i.e., the amount by which a segregated speech source exceeds the (physically presented) resultant acoustic mixture, or zero when it does not: (max(source–mixture, 0)). The new “overt onsets” predictor models all other time-frequency points, where an onset in the source is also seen as a comparable onset in the mixture (element-wise min(mixture, source)). Note that with this definition the sum of overt and masked onsets exactly equals the original speech source onset representation, i.e., the new predictors model the same onsets but allow responses to differ depending on whether an onset is masked or not. Replacing the 2 speech-source-based onset predictors with the 4 overt/masked onset predictors significantly improves the model fit (tmax = 6.81, p < 0.001), suggesting that cortical responses indeed distinguish between overt and masked onsets. Each of the 4 new predictors individually contributes to the MEG responses, although masked onsets do so more robustly (attended: tmax = 8.42, p < 0.001; ignored: tmax = 5.23, p < .001) than overt onsets (attended: tmax = 3.34, p = 0.027; ignored: tmax = 3.82, p = 0.016; Fig 4B); this difference could be due to overt source onsets being more similar to the mixture onsets predictor. Critically, the significant effect for masked onsets in the ignored source confirms that the auditory cortex recovers masked onsets even when they occur in the ignored source.

Fig 4. Responses to overt and masked onsets.

Fig 4

(A) Spectrograms (note that in this Fig, the onset representations are placed below the envelope representations, to aid visual comparison of the different onset representations) were transformed using element-wise operations to distinguish between overt onsets, i.e., onsets in a source that are apparent in the mixture, and masked onsets, i.e., onsets in a source that are masked in the presence of the other source. Two examples are marked by rectangles: The light blue rectangle marks a region with an overt (attended) onset, i.e., an onset in the attended source that also corresponds to an onset in the mixture. The dark blue rectangle marks a masked (attended) onset, i.e., an onset in the attended source which is not apparent in the mixture. (B) All predictors significantly improve the cross-validated model fit (note that improvements were statistically tested with a test sensitive to spatial variation, whereas these plots show single-subject ROI average fits). (C) The corresponding overt/masked STRFs exhibit the previously described positive–negative 2-peaked structure. The first, positive peak is dominated by a representation of the mixture but also contains segregated features of the 2 talkers. For overt onsets, only the second, negative peak is modulated by attention. For masked onsets, even the first peak exhibits a small degree of attentional modulation. (D) Responses to masked onsets are consistently delayed compared with responses to overt onsets. Details are analogous to Fig 3D, except that the time window for finding peaks was extended to 20–250 milliseconds to account for the longer latency of masked onset response functions. (E) Direct comparison of the frequency-averaged onset TRFs highlights the amplitude differences between the peaks. For overt onsets, the negative deflection due to selective attention starts decreasing the response magnitude even near the maximum of the first, positive peak. For masked onsets, the early peak reflecting attended onsets is increased despite the subsequent enhanced negative peak. Results for envelope predictors are omitted from this figure because they are practically indistinguishable from those in Fig 3. Data are available in S3 Data. LH, left hemisphere; RH, right hemisphere; ROI, region of interest; STRF, spectrotemporal response function; TRF, temporal response function.

Masked onsets are processed with a delay and an early effect of attention

Model comparison thus indicates that the neural representation of masked onsets is significantly different from that of overt onsets. The analysis of STRFs suggests that this is for at least 2 reasons (Fig 4C–4E): response latency differences and a difference in the effect of selective attention.

First, responses to masked onsets are systematically delayed compared with overt onsets (as can be seen in Fig 4D). Specifically, this is true for the early, positive peak (mixture: 72 milliseconds), both for the attended speaker (overt: 72 milliseconds, masked: 91 millisecond, t25 = 2.85, p = 0.009) and the ignored speaker (overt: 83 milliseconds, masked: 97 millisecond, t25 = 6.11, p < 0.001). It is also the case for the later, negative peak (mixture: 138 milliseconds), which reflects only the attended speaker (overt: 133 milliseconds, masked: 182 milliseconds, t25 = 4.45, p < 0.001). Thus, at each major peak, representations of masked onsets lag behind representations of overt onsets by at least 15 milliseconds.

Second, for overt onsets, the early representations at the positive peak appear to be independent of the target of selective attention (Fig 4C, top right). In contrast, for masked onsets, even these early representations are enhanced by attention (Fig 4C, bottom right). This difference is confirmed in a stream (attended, ignored) by masking (overt, masked) ANOVA on peak amplitudes with a significant interaction (F(1,25) = 24.45, p < 0.001). For overt onsets, Fig 4E might suggest that the early peak is actually enhanced for the ignored speaker; however, this difference can be explained by the early onset of the second, negative response to attended onsets, which overlaps with the earlier peak. This observation makes the early effect of attention for masked onsets all the more impressive, because the early peak is larger despite the onset of the subsequent, negative peak (note the steeper slope between positive and negative peak for attended masked onsets). Also note that we here interpreted the timing of the effect of attention relative to the peak structure of the TRFs; in terms of absolute latency, the onset of the effect of attention is actually more similar between masked and overt onsets (see Fig 4E).

Delayed response to masked onsets

Previous research has found that the latency of responses to speech increases with increasing levels of stationary noise [37,38] or dynamic background speech (Fig 3 in [21]). Our results indicate that, for continuous speech, this is not simply a uniform delay but that the delay varies dynamically for each acoustic element based on whether this element is overt or locally masked by the acoustic background. This implies that the stream is not processed as a homogeneous entity of constant signal-to-noise ratio (SNR) but that the acoustic elements related to the onsets constitute separate auditory objects, with processing time increasing dynamically as features are more obscured acoustically. Notably, the same applies to the ignored speech stream, suggesting that acoustic elements from both speakers are initially processed as auditory objects.

The effect of SNR on response amplitude and latency is well established and clearly related to the results here. We consider it unlikely that SNR can itself play the role of a causal mechanistic explanation, because the measure of SNR presupposes a signal and noise that have already been segregated. Consequently, SNR is not a property of features in the acoustic input signal. Acoustic features only come to have an SNR after they are designated as acoustic objects and segregated against an acoustic background, such that their intensity can be compared with that of the residual background signal. This is illustrated in our paradigm in that the same acoustic onset can be a signal for one process (when detecting onsets in the mixture) and part of the noise for another (when attending to the other speaker). Rather than invoking SNR itself as an explanatory feature, we thus interpret the delay as evidence for a feature detection mechanism that requires additional processing time when the feature in question is degraded in the input—although leaving open the specific mechanism by which this happens. The addition of stationary background noise to simple acoustic features is associated with increased response latencies to those features as early as the brainstem response wave-V [39]. This observed shift in latency is in the submillisecond range and may have a mechanistic explanation in terms of different populations of auditory nerve fibers: Background noise saturates high spontaneous rate fibers, and the response is now dominated by somewhat slower, low spontaneous rate fibers [40]. In cortical responses to simple stimuli, like tones, much larger delays are observed in the presence of static noise, in the order of tens of milliseconds [41].

Latency shifts due to absolute signal intensity [42] might be additive with shifts due to noise [43]. Such a nonlinear increase in response latency with intensity might be a confounding factor in our analysis, which is based on linear methods: Compared with overt onsets, masked onsets in the ignored talker should generally correspond to weaker onsets in the mixture. Splitting overt and masked onsets might thus improve the model fit because it allows modeling different response latencies for different intensity levels of onsets in the mixture, rather than reflecting a true response to the background speech. In order to control for this possibility, we compared the model fit of the background-aware model with a model allowing for 3 different intensity levels of onsets in the mixture (and without an explicit representation of onsets in the ignored speaker). The background speaker-aware model outperformed the level-aware model (tmax = 9.21, p < 0.001), suggesting that the present results are not explained by this level-based nonlinear response to the mixture. Furthermore, a background-unaware nonlinearity would not explain the difference in the effect of attention between overt and masked onsets. Together, this suggests that the observed delay is related to recovering acoustic source information, rather than a level-based nonlinearity in the response to the mixture.

It is also worth noting that the pattern observed in our results diverges from the most commonly described pattern of SNR effects. Typically, background noise causes an amplitude decrease along with the latency increase [38]: Although the latency shift observed here conforms to the general pattern, the amplitude of responses to masked onsets is not generally reduced. Even more importantly, selective attention affects the delayed responses more than the undelayed, suggesting that the delay is not a simple effect of variable SNR but is instead linked to attentive processing. A study of single units in primary auditory cortex found that neurons with delayed, noise-robust responses exhibited response properties suggestive of network effects [44]. This is consistent with the interaction of selective attention and delay found here on the early peak, because the delayed responses to masked onsets also exhibit more evidence of goal-driven processing than the corresponding responses to overt onsets.

In sum, masking causes latency increases at different stages of the auditory system. These latency shifts increase at successive stages of the ascending auditory pathway, as does the preponderance of noise-robust response properties [45]. It is likely that different levels of the auditory system employ different strategies to recover auditory signals of interest and do so at different time scales. Together with these considerations, our results suggest that the auditory cortex actively recovers masked speech features, and not only of the attended, but also of the ignored speech source.

Early effect of selective attention

Besides the shift in latency, response functions to overt and masked onsets differed in a more fundamental way: While the early, positive response peak to overt onsets did not differentiate between attended and ignored onsets, the early peak to masked onsets contained significantly larger representations of attended onsets (see Fig 4C). Thus, not only do early auditory cortical responses represent masked onsets, but these representations are substantively affected by whether the onset belongs to the attended or the ignored source. This distinction could have several causes. In the extreme, it could indicate that the 2 streams are completely segregated and represented as 2 distinct auditory objects. However, it might also be due to a weighting of features based on their likelihood of belonging to the attended source. This could be achieved, for example, through modulation of excitability based on spectrotemporal prediction of the attended speech signal [46]. Thus, onsets that are more likely to belong to the attended source might be represented more strongly, without yet being ascribed to one of the sources exclusively.

One discrepancy in previous studies using extra- and intracranial recordings is that the former were unable to detect any early effects of selective attention [7,16], whereas the latter showed a small but consistent enhancement of feature representations associated with the attended acoustic source signal [8]. Furthermore, an early effect of attention would also be expected based on animal models that show task-dependent modulations of A1 responses [47,48]. Our results show that this discrepancy may depend, in part, on which acoustic features are analyzed: While overt acoustic onsets were not associated with an effect of selective attention, masked onsets were.

Overall, the early difference between the attended and ignored source suggests that acoustic information from the ignored source is represented to a lesser degree than information from the attended source. This is consistent with evidence from psychophysics suggesting that auditory representations of background speech are not as fully elaborated as those of the attended foreground [49]. More generally, it is consistent with results that suggest an influence of attention early on in auditory stream formation [50].

Stages of speech segregation through selective attention

Regardless of whether listening to a single talker or 2 concurrent talkers, response functions to acoustic onsets are characterized by a prominent positive–negative 2 peak structure. Brain responses to 2 concurrent talkers allow separating these response functions into components related to different representations and thus reveal different processing stages. Fig 5 presents a plausible model incorporating these new findings. Early on, the response is dominated by a representation of the acoustic mixture, with a preliminary segregation, possibly implemented through spectral filters [8]. This is followed by restoration of speech features that are masked in the mixture, regardless of speaker, but with a small effect of selective attention, suggesting a more active mechanism. Finally, later responses are dominated by selective attention, suggesting a clearly separated representation of the attended speaker as would be expected in successful streaming.

Fig 5. Model of onset-based stream segregation.

Fig 5

A model of cortical processing stages compatible with the results reported here. Left: The auditory scene, with additive mixture of the waveforms from the attended and the ignored speakers (red and blue, respectively). Right: Illustration of cortical representations at different processing stages. Passive filtering: At an early stage, onsets are extracted from the acoustic mixture and representations are partially segregated, possibly based on frequency. This stage corresponds to the early positive peak in onset TRFs. Active Restoration: A subsequent stage also includes representations of onsets in the underlying speech sources that are masked in the mixture, corresponding to the first peak in TRFs to masked onsets. At this stage, a small effect of attention suggests a preliminary selection of onsets with a larger likelihood of belonging to the attended speaker. Streaming: Finally, at a third stage, the response to onsets from the ignored speaker is suppressed, suggesting that now the 2 sources are clearly segregated (see also [8]). This stage corresponds to the second, negative peak, which is present in TRFs to mixture and attended onsets but not to ignored onsets. TRF, temporal response function.

An open question concerns how the overt and masked feature representations are temporally integrated. Representations of masked onsets were consistently delayed compared with those of overt onsets by approximately 20 milliseconds (see Fig 4D). This latency difference entails that upstream speech processing mechanisms may receive different packages of information about the attended speech source with some temporal desynchronization. Although this might imply a need for a higher order corrective mechanism, it is also possible that upstream mechanisms are tolerant to this small temporal distortion. A misalignment of 20 milliseconds is small compared with the normal temporal variability encountered in speech (although phonetic contrasts do exist where a distortion of a few tens of milliseconds would be relevant). Indeed, in audio-visual speech perception, temporal misalignment up to 100 milliseconds between auditory and visual input can be tolerated [51].

Broadly, the new results are consistent with previous findings that early cortical responses are dominated by the acoustic mixture, rather than receiving presegregated representations of the individual streams [7,8]. However, the new results do show evidence of an earlier, partial segregation, in the form of representations of acoustic onsets, which are segregated from the mixture, though not grouped into separate streams. Because these early representations do not strictly distinguish between the attended and the ignored speaker, they likely play the role of an intermediate step in extracting the information needed to selectively attend to one of the 2 speakers. Overall, these results are highly consistent with object-based models of auditory attention, in which perception depends on an interplay between bottom-up analysis and formation of local structure, and top-down selection and global grouping, or streaming [14,52].

Implications for processing of “ignored” acoustic sources

The interference in speech perception from a second talker can be very different from the interference caused by nonspeech sounds. For instance, music is cortically segregated from speech even when both signals are unattended, consistent with a more automatic segregation, possibly due to distinctive differences in acoustic signal properties [22]. In contrast, at moderate SNRs, a second talker causes much more interference with speech perception than a comparable nonspeech masker. Interestingly, this interference manifests not just in the inability to hear attended words but in intrusions of words from the ignored talker [53]. The latter fact, in particular, has been interpreted as evidence that ignored speech might be segregated and processed to a relatively high level. On the other hand, listeners seem to be unable to process words in more than 1 speech source at a time, even when the sources are spatially separated [54]. Furthermore, demonstrations of lexical processing of ignored speech are rare and usually associated with specific perceptual conditions such as dichotic presentation [55]. Consistent with this, recent EEG/MEG evidence suggests that unattended speech is not processed in a time-locked fashion at the lexical [12] or semantic [13] level. The results described here, showing systematic recovery of acoustic features from the ignored speech source, suggest a potential explanation for the increased interference from speech compared with other maskers. Representing onsets in 2 speech sources could be expected to increase cognitive load compared with detecting onsets of a single source in stationary noise. These representations of ignored speech might also act as bottom-up cues and cause the tendency for intrusions from the ignored talker. They might even explain why a salient and overlearned word, such as one’s own name [56], might sometimes capture attention, which could happen based on acoustic rather than lexical analysis [57]. Finally, at very low SNRs, the behavioral pattern can invert, and a background talker can be associated with better performance than stationary noise maskers [53]. In such conditions, there might be a benefit of being able to segregate the ignored speech source and use this information strategically [21].

An open question is how the auditory system deals with the presence of multiple background speakers. When there are multiple background speakers, does the auditory system attempt to unmask the different speakers all separately, or are they represented as a unified background [7]? An attempt to isolate speech features even from multiple background talkers might contribute to the overly detrimental effect of babble noise with a small number of talkers [58].

Limitations

Many of the conclusions drawn here rest on the suitability of the auditory model used to predict neural responses. The gammatone and onset models are designed to reflect generalized cochlear and neural processing strategies and were chosen as more physiologically realistic models than engineering-inspired alternatives such as envelope and half-wave rectified derivative models. Yet they might also be missing critical aspects of truly physiological representations. An important consideration for future research is thus to extend the class of models of lower level auditory processing and how they relate to the large-scale neural population activity as measured by EEG/MEG.

In addition, our model incorporates the masking of acoustic features as a binary distinction, by splitting features into overt and masked features. In reality, features can be masked to degrees. In our model, intermediate degrees of maskedness would result in intermediate values in both predictors and thus, in a linear superposition of 2 responses. We would expect that a model that could take into account the degree of maskedness as continuous variable would likely provide a better fit to the neural data.

Conclusions

How do listeners succeed in selectively listening to one of 2 concurrent talkers? Our results suggest that active recovery of acoustic onsets plays a critical role. Early responses in the auditory cortex represent not only overt acoustic onsets but also reconstruct acoustic onsets in the speech sources that are masked in the mixture, even if they originate from the ignored speech source. This suggests that early responses, in addition to representing a spectrotemporal decomposition of the mixture, actively reconstruct acoustic features that could originate from either speech source. Consequently, these early responses make comparatively complex acoustic features from both speech sources available for downstream processes, thus enabling both selective attention and bottom-up effects of salience and interference.

Materials and methods

Participants

The data analyzed here have been previously used in an unrelated analysis [12] and can be retrieved from the Digital Repository at the University of Maryland (see Data Availability). MEG responses were recorded from 28 native speakers of English, recruited by media advertisements from the Baltimore area. Participants with medical, psychiatric, or neurological illnesses, head injury, and substance dependence or abuse were excluded. Data from 2 participants were excluded, one due to corrupted localizer measurements and one due to excessive magnetic artifacts associated with dental work, resulting in a final sample of 18 male and 8 female participants with mean age 45.2 (range 22–61).

Ethics statement

All participants provided written informed consent in accordance with the University of Maryland Baltimore Institutional Review Board and were paid for their participation.

Stimuli

Two chapters were selected from an audiobook recording of A Child’s History of England by Charles Dickens, one chapter read by a male and one by a female speaker (https://librivox.org/a-childs-history-of-england-by-charles-dickens/, chapters 3 and 8, respectively). Four 1-minute long segments were extracted from each chapter (referred to as male-1 through 4 and female 1 through 4). Pauses longer than 300 milliseconds were shortened to an interval randomly chosen between 250 and 300 milliseconds, and loudness was matched perceptually (such that either speaker was deemed equally easy to attend to). Two-talker stimuli were generated by additively combining 2 segments, one from each speaker, with an initial 1-second period containing only the to-be attended speaker (mix-1 through 4 were constructed by mixing male-1 and female-1, through 4).

Procedure

During MEG data acquisition, participants lay supine and were instructed to keep their eyes closed to minimize ocular artifacts and head movement. Stimuli were delivered through foam pad earphones inserted into the ear canal at a comfortably loud listening level, approximately 70 dB SPL.

Participants listened 4 times to mix-1 while attending to one speaker and ignoring the other (which speaker they attended to was counterbalanced across participants), then 4 times to mix-2 while attending to the other speaker. After each segment, participants answered a question relating to the content of the attended stimulus. Then, the 4 segments just heard were all presented once each, as single talkers. The same procedure was repeated for stimulus segments 3 and 4.

Data acquisition and preprocessing

Brain responses were recorded with a 157 axial gradiometer whole head MEG system (KIT, Kanazawa, Japan) inside a magnetically shielded room (Vacuumschmelze GmbH & Co. KG, Hanau, Germany) at the University of Maryland, College Park. Sensors (15.5-mm diameter) are uniformly distributed inside a liquid-He dewar, spaced approximately 25 mm apart, and configured as first-order axial gradiometers with 50 mm separation and sensitivity better than 5 fT·Hz-1/2 in the white-noise region (> 1 KHz). Data were recorded with an online 200-Hz low-pass filter and a 60-Hz notch filter at a sampling rate of 1 kHz.

Recordings were preprocessed using mne-python (https://github.com/mne-tools/mne-python) [59]. Flat channels were automatically detected and excluded. Extraneous artifacts were removed with temporal signal space separation [60]. Data were filtered between 1 and 40 Hz with a zero-phase FIR filter (mne-python 0.15 default settings). Extended infomax independent component analysis [61] was then used to remove ocular and cardiac artifacts. Responses time-locked to the onset of the speech stimuli were extracted and resampled to 100 Hz. For responses to the 2-talker mixture, the first second of data, in which only the to-be attended talker was heard, was discarded.

Five marker coils attached to participants’ head served to localize the head position with respect to the MEG sensors. Two measurements, one at the beginning and one at the end of the recording, were averaged. The FreeSurfer (https://surfer.nmr.mgh.harvard.edu) [62] ‘‘fsaverage” template brain was coregistered to each participant’s digitized head shape (Polhemus 3SPACE FASTRAK) using rotation, translation, and uniform scaling. A source space was generated using 4-fold icosahedral subdivision of the white matter surface, with source dipoles oriented perpendicularly to the cortical surface. Minimum ℓ2 norm current estimates [63,64] were computed for all data. Initial analysis was performed on the whole brain as identified by the FreeSurfer “cortex” label. Subsequent analyses were restricted to sources in the STG and Heschl’s gyrus as identified in the ‘‘aparc” parcellation [65].

Predictor variables

Predictor variables were based on gammatone spectrograms sampled at 256 frequencies, ranging from 20 to 5,000 Hz in ERB space [66], resampled to 1 kHz and scaled with exponent 0.6 [67].

Acoustic onset representations were computed by applying an auditory edge detection model [19] independently to each frequency band of the spectrogram. The model was implemented with a delay layer with 10 delays ranging from τ2 = 3 to 5 milliseconds, a saturation scaling factor of C = 30, and a receptive field based on the derivative of a Gaussian window with SD = 2 input delay units. Negative values in the resulting onset spectrogram were set to 0. We initially explored using higher levels of saturation (smaller values for C) but found that the resulting stimulus representations emphasized nonspeech features during pauses more than features relevant to speech processing, because responses quickly saturated during ongoing speech. We chose the given, narrow range of τ2 to allow for a wider possibility of models because a wider range of τ2 would only lead to smoother representations, although smoothing can also achieved by the TRF model fitted to the neural data.

Onset representations of the 2 speakers were split into masked and overt onsets using element-wise operations on the onset spectrograms. Masked onsets were defined by the extent to which onsets were larger in the source than in the mixture:

omasked=max(osourceomixture,0)

Overt onsets were onsets that were not masked, i.e., speech source onsets that were also visible in the mixture:

oovert=osourceomaskedmin(osource,omixture)

Using this procedure, approximately 67% of total onset magnitudes (ℓ1 norm) was assigned to overt onsets and 33% to masked onsets.

For model estimation, envelope and onset spectrograms were then binned into 8 frequency bands equally spaced in ERB space (omitting frequencies below 100 Hz because the female speaker had little power below that frequency) and resampled to match the MEG data. As part of the reverse correlation procedure, each predictor time series (i.e., each frequency bin) was scaled by its ℓ1 norm over time.

For testing an intensity-based nonlinear response (see “Delayed response to masked onsets”), the onset predictor was split into 3 separate predictors, one for each of 3 intensity levels. For each of the 8 frequency bins, individual onsets were identified as contiguous nonzero elements; Each onset was assigned an intensity based on the sum of its elements, and the onsets were then assigned to one of 3 predictors based on intensity tertiles (calculated separately for each band). This resulted in three 8-band onset spectrograms modeling low-, medium-, and high-intensity onsets.

Reverse correlation

STRFs were computed independently for each virtual current source [see 68]. The neural response at time t, yt, was predicted from the sum of N predictor variables xn convolved with a corresponding response function hn of length T:

y^t=nNτThn,τxn,tτ

STRFs were generated from a basis of 50-millisecond-wide Hamming windows and were estimated using an iterative coordinate descent algorithm [69] to minimize the ℓ1 error.

For model evaluation, left-out data were predicted using 4-fold cross-validation. Folds were created by assigning successive trials to the different folds in order (1, 2, 3, 4, 1, 2, …). In an outer loop, the responses in each fold were predicted with STRFs estimated from the remaining 3 folds. These predictions, combined, served to calculate the correlation between measured and predicted responses used for model tests. In an inner loop, each of the 3 estimation folds was, in turn, used as validation set for STRFs trained on the 2 remaining folds. STRFs were iteratively improved based on the maximum error reduction in the training set (the steepest coordinate descent) and validated in the validation set. Whenever a predictor time series (i.e., one spectrogram bin) would have caused an increasing in the error in the validation set, the kernel for this predictor was frozen, continuing until all predictors were frozen (see [70] for further details). The 3 STRFs from the inner loop were averaged to predict responses in the left-out testing data.

Model tests

Each spectrogram comprising 8 time series (frequency bins) was treated as an individual predictor. Speech in quiet was modeled using the (envelope) spectrogram and acoustic onsets:

MEGo+e

where o = onsets and e = envelope. Models were estimated with STRFs with T = [0,…,500) millisecond. Model quality was quantified through the Pearson correlation r between actual and predicted responses. The peak of the averaged r-map was 0.143 in the single-speaker condition and 0.158 in the 2-talker condition (0.162 when the model incorporated masking). Because single-trial MEG responses contain a relatively high proportion of spontaneous and nonneural (and hence unexplainable) signals, the analysis focused on differences between models that are reliable across participants, rather than absolute r-values. Each model was thus associated with a map of Fisher z-scored r-values, smoothed with a Gaussian kernel (SD = 5 mm). In order to test the predictive power of each predictor, a corresponding null model was generated by removing that predictor. For each predictor, the model quality of the full model was compared with the model quality of the corresponding null model using a mass-univariate related measures t-test with threshold-free cluster enhancement [71] and a null distribution based on 10,000 permutations. This procedure results in a map of p-values across the tested area, corrected for multiple comparisons based on the nonparametric null-distribution ([70] for further details). For each model comparison, we report the smallest p-value across the tested area, as an indicator of whether the given model significantly explains any neural data. In addition, for effect size comparison, we report tmax for each comparison, the largest t-value in the significant (p ≤ 0.05) area. For single-talker speech (Fig 2), this test included the whole cortex (as labeled by FreeSurfer). For subsequent tests of the 2-talker condition, the same tests were used, but the test area was restricted to the auditory ROI comprising the STG and transverse temporal gyrus in each hemisphere.

Initially, responses to speech in noise (Fig 3) were predicted from:

MEGomix+oatt+oign+emix+eatt+eign

where mix = mixture, att = attended, and ign = ignored. Masked onsets (Fig 4) were analyzed with the following:

MEGomix+oatt,overt+oatt,masked+oign,overt+oign,masked+emix+eatt+eign

In order to test for a level-dependent nonlinear response to onsets in the mixture, this model was compared with the following:

MEGomixlow+omixmid+omixhigh+oatt,overt+oatt,masked+emix+eatt+eign

where mix-low, -mid, and -high = mixture low, mid, and high intensity. This model has the same number of predictors but assumes no awareness of onsets in the ignored speaker.

STRF analysis

To evaluate STRFs, the corresponding model was refit with T = [−100,…,500) milliseconds to include an estimate of baseline activity (because of occasional edge artifacts, STRFs are displayed between −50 to 450 milliseconds). Using the same 4-fold split of the data as for model fits, 4 STRF estimates were averaged, each using 1 fold of the data for validation and the remaining 3 for training. Because predictors and responses were ℓ1 normalized for the reverse correlation, and STRFs were analyzed in this normalized space, STRFs provide an SNR-like measure of response strength at different latencies for each subject.

Auditory STRFs were computed for each subject and hemisphere as a weighted sum of STRFs in the auditory ROI encompassing the STG and transverse temporal (Heschl’s) gyrus. Weights were computed separately for each subject and hemisphere. First, each source point was assigned a vector with direction orthogonal to the cortical surface and length equal to the total TRF power for responses to clean speech (sum of squares over time, frequency, and predictor). The ROI direction was then determined as the first principal component of these vectors, with the sign adjusted to be positive on the inferior–superior axis. A weight was then assigned to each source as the dot product of this direction with the source’s direction, and these weights were normalized within the ROI.

In order to make STRFs more comparable across subjects, they were smoothed on the frequency axis with a Hamming window of width 7 bins. STRFs were statistically analyzed in the time range [0,…,450) milliseconds using mass-univariate t-tests and ANOVAs, with p-values calculated from null distributions based on the maximum statistic (t, F) in 10,000 permutations [72].

For visualization and peak analysis, STRFs were upsampled to 500 Hz. Peak latencies were computed by first averaging auditory STRFs along the frequency axis and then finding the largest or smallest value in each subject’s TRF in a window of [20, 200) milliseconds for single-speaker and stream-based analysis (Figs 2 and 3) or [20, 250) milliseconds for the masked onset analysis (Fig 4). Reported peak latencies are always average latencies across subject.

Supporting information

S1 Simulations. Simulations to assess TRF cross-contamination.

TRF, temporal response function.

(PDF)

S1 Fig. MEG responses to clean speech, envelope only.

Spectrotemporal response function to the envelope spectrogram, when estimated without considering onsets. All other details are analogous to Fig 2E. Data in S4 Data.

(PDF)

S1 Data. Data from Fig 2.

Model prediction accuracy maps and spectrotemporal response functions for plots shown in Fig 2. Data are stored as pickled Python/Eelbrain objects with corresponding meta-data.

(ZIP)

S2 Data. Data from Fig 3.

Details as S1 Data.

(ZIP)

S3 Data. Data from Fig 4.

Details as S1 Data.

(ZIP)

S4 Data. Data from S1 Fig.

Details as S1 Data.

(ZIP)

S5 Data. Data from S1 Simulation.

Details as S1 Data.

(ZIP)

Acknowledgments

We would like to thank Shihab Shamma for several fruitful discussions and Natalia Lapinskaya for her help in collecting data and for excellent technical support.

Abbreviations

EEG

electroencephalography

HG

Heschl’s gyrus

MEG

magnetoencephalography

ROI

region of interest

SNR

signal-to-noise ratio

STG

superior temporal gyrus

STRF

spectrotemporal response function

TRF

temporal response function

Data Availability

Preprocessed MEG recordings and stimuli are available from the Digital Repository at the University of Maryland. The MEG dataset is available at http://hdl.handle.net/1903/21109, additional files specific to this paper are available at http://hdl.handle.net/1903/26370. Subject-specific results are provided for each figure in supplementary data files.

Funding Statement

This work was supported by a National Institutes of Health grant R01-DC014085 (to JZS; https://www.nih.gov) and by a University of Maryland Seed Grant (to LEH and JZS; https://umd.edu). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Bregman AS. Auditory scene analysis: the perceptual organization of sound. Cambridge, Mass: MIT Press; 1990. [Google Scholar]
  • 2.Cherry EC. Some Experiments on the Recognition of Speech, with One and with Two Ears. J Acoust Soc Am. 1953;25:975–979. 10.1121/1.1907229 [DOI] [Google Scholar]
  • 3.McDermott JH. The cocktail party problem. Curr Biol. 2009;19:R1024–R1027. 10.1016/j.cub.2009.09.005 [DOI] [PubMed] [Google Scholar]
  • 4.Middlebrooks JC, Simon JZ, Popper AN, Fay RR, editors. The Auditory system at the cocktail party. Cham: Springer International Publishing; 2017. [Google Scholar]
  • 5.Brungart DS, Simpson BD. The effects of spatial separation in distance on the informational and energetic masking of a nearby speech signal. J Acoust Soc Am. 2002;112:664–676. 10.1121/1.1490592 [DOI] [PubMed] [Google Scholar]
  • 6.Kidd G, Mason CR, Swaminathan J, Roverud E, Clayton KK, Best V. Determining the energetic and informational components of speech-on-speech masking. J Acoust Soc Am. 2016;140:132–144. 10.1121/1.4954748 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Puvvada KC, Simon JZ. Cortical Representations of Speech in a Multitalker Auditory Scene. J Neurosci. 2017;37: 9189–9196. 10.1523/JNEUROSCI.0938-17.2017 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.O’Sullivan J, Herrero J, Smith E, Schevon C, McKhann GM, Sheth SA, et al. Hierarchical Encoding of Attended Auditory Objects in Multi-talker Speech Perception. Neuron. 2019;104:1195–1209. 10.1016/j.neuron.2019.09.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Zion Golumbic EM, Ding N, Bickel S, Lakatos P, Schevon CA, McKhann GM, et al. Mechanisms Underlying Selective Neuronal Tracking of Attended Speech at a “Cocktail Party.” Neuron. 2013;77:980–991. 10.1016/j.neuron.2012.12.037 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Broadbent DE. Perception and communication. London: Pergamon Press; 1958. [Google Scholar]
  • 11.Lachter J, Forster KI, Ruthruff E. Forty-five years after Broadbent (1958): Still no identification without attention. Psychol Rev. 2004;111:880–913. 10.1037/0033-295X.111.4.880 [DOI] [PubMed] [Google Scholar]
  • 12.Brodbeck C, Hong LE, Simon JZ. Rapid Transformation from Auditory to Linguistic Representations of Continuous Speech. Curr Biol. 2018;28:3976–3983.e5. 10.1016/j.cub.2018.10.042 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Broderick MP, Anderson AJ, Liberto GMD, Crosse MJ, Lalor EC. Electrophysiological Correlates of Semantic Dissimilarity Reflect the Comprehension of Natural, Narrative Speech. Curr Biol. 2018;28:803–809.e3. 10.1016/j.cub.2018.01.080 [DOI] [PubMed] [Google Scholar]
  • 14.Shinn-Cunningham BG. Object-based auditory and visual attention. Trends Cogn Sci. 2008;12:182–186. 10.1016/j.tics.2008.02.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Lalor EC, Foxe JJ. Neural responses to uninterrupted natural speech can be extracted with precise temporal resolution. Eur J Neurosci. 2010;31: 189–193. 10.1111/j.1460-9568.2009.07055.x [DOI] [PubMed] [Google Scholar]
  • 16.Ding N, Simon JZ. Emergence of neural encoding of auditory objects while listening to competing speakers. Proc Natl Acad Sci U S A. 2012;109: 11854–9. 10.1073/pnas.1205381109 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Hamilton LS, Edwards E, Chang EF. A Spatial Map of Onset and Sustained Responses to Speech in the Human Superior Temporal Gyrus. Curr Biol. 2018;28:1860–1871.e4. 10.1016/j.cub.2018.04.033 [DOI] [PubMed] [Google Scholar]
  • 18.Daube C, Ince RAA, Gross J. Simple Acoustic Features Can Explain Phoneme-Based Predictions of Cortical Responses to Speech. Curr Biol. 2019;29:1924–1937.e9. 10.1016/j.cub.2019.04.067 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Fishbach A, Nelken I, Yeshurun Y. Auditory Edge Detection: A Neural Model for Physiological and Psychoacoustical Responses to Amplitude Transients. J Neurophysiol. 2001;85:2303–2323. 10.1152/jn.2001.85.6.2303 [DOI] [PubMed] [Google Scholar]
  • 20.Zhou Y, Wang X. Cortical Processing of Dynamic Sound Envelope Transitions. J Neurosci. 2010;30:16741–16754. 10.1523/JNEUROSCI.2016-10.2010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Fiedler L, Wöstmann M, Herbst SK, Obleser J. Late cortical tracking of ignored speech facilitates neural selectivity in acoustically challenging conditions. NeuroImage. 2019;186:33–42. 10.1016/j.neuroimage.2018.10.057 [DOI] [PubMed] [Google Scholar]
  • 22.Hausfeld L, Riecke L, Valente G, Formisano E. Cortical tracking of multiple streams outside the focus of attention in naturalistic auditory scenes. NeuroImage. 2018;181:617–626. 10.1016/j.neuroimage.2018.07.052 [DOI] [PubMed] [Google Scholar]
  • 23.Petersen EB, Wöstmann M, Obleser J, Lunner T. Neural tracking of attended versus ignored speech is differentially affected by hearing loss. J Neurophysiol. 2017;117:18–27. 10.1152/jn.00527.2016 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Fiedler L, Wöstmann M, Graversen C, Brandmeyer A, Lunner T, Obleser J. Single-channel in-ear-EEG detects the focus of auditory attention to concurrent tone streams and mixed speech. J Neural Eng. 2017;14:036020 10.1088/1741-2552/aa66dd [DOI] [PubMed] [Google Scholar]
  • 25.Stilp CE, Kluender KR. Cochlea-scaled entropy, not consonants, vowels, or time, best predicts speech intelligibility. Proc Natl Acad Sci U S A. 2010;107:12387–12392. 10.1073/pnas.0913625107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Koning R, Wouters J. The potential of onset enhancement for increased speech intelligibility in auditory prostheses. J Acoust Soc Am. 2012;132:2569–2581. 10.1121/1.4748965 [DOI] [PubMed] [Google Scholar]
  • 27.Elhilali M, Ma L, Micheyl C, Oxenham AJ, Shamma SA. Temporal Coherence in the Perceptual Organization and Cortical Representation of Auditory Scenes. Neuron. 2009;61:317–329. 10.1016/j.neuron.2008.12.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Bregman AS, Ahad P, Kim J, Melnerich L. Resetting the pitch-analysis system: 1. Effects of rise times of tones in noise backgrounds or of harmonics in a complex tone. Percept Psychophys. 1994;56:155–162. 10.3758/bf03213894 [DOI] [PubMed] [Google Scholar]
  • 29.Bregman AS, Ahad PA, Kim J. Resetting the pitch‐analysis system. 2. Role of sudden onsets and offsets in the perception of individual components in a cluster of overlapping tones. J Acoust Soc Am. 1994;96:2694–2703. 10.1121/1.411277 [DOI] [PubMed] [Google Scholar]
  • 30.Hukin RW, Darwin CJ. Comparison of the effect of onset asynchrony on auditory grouping in pitch matching and vowel identification. Percept Psychophys. 1995;57:191–196. 10.3758/bf03206505 [DOI] [PubMed] [Google Scholar]
  • 31.Loftus GR, Masson MEJ. Using confidence intervals in within-subject designs. Psychon Bull Rev. 1994;1:476–490. 10.3758/BF03210951 [DOI] [PubMed] [Google Scholar]
  • 32.Moerel M, De Martino F, Formisano E. Processing of Natural Sounds in Human Auditory Cortex: Tonotopy, Spectral Tuning, and Relation to Voice Sensitivity. J Neurosci. 2012;32:14205–14216. 10.1523/JNEUROSCI.1388-12.2012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Hullett PW, Hamilton LS, Mesgarani N, Schreiner CE, Chang EF. Human Superior Temporal Gyrus Organization of Spectrotemporal Modulation Tuning Derived from Speech Stimuli. J Neurosci. 2016;36:2014–2026. 10.1523/JNEUROSCI.1779-15.2016 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Cervantes Constantino F, Simon JZ. Dynamic cortical representations of perceptual filling-in for missing acoustic rhythm. Sci Rep. 2017;7:17536 10.1038/s41598-017-17063-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Leonard MK, Baud MO, Sjerps MJ, Chang EF. Perceptual restoration of masked speech in human cortex. Nat Commun. 2016;7:13619 10.1038/ncomms13619 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Cervantes Constantino F, Simon JZ. Restoration and Efficiency of the Neural Processing of Continuous Speech Are Promoted by Prior Knowledge. Front Syst Neurosci. 2018;12:56 10.3389/fnsys.2018.00056 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Ding N, Simon JZ. Adaptive Temporal Encoding Leads to a Background-Insensitive Cortical Representation of Speech. J Neurosci. 2013;33:5728–5735. 10.1523/JNEUROSCI.5297-12.2013 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Billings CJ, McMillan GP, Penman TM, Gille SM. Predicting Perception in Noise Using Cortical Auditory Evoked Potentials. J Assoc Res Otolaryngol. 2013;14:891–903. 10.1007/s10162-013-0415-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Burkard RF, Sims D. A Comparison of the Effects of Broadband Masking Noise on the Auditory Brainstem Response in Young and Older Adults. Am J Audiol. 2002;11:13–22. 10.1044/1059-0889(2002/004) [DOI] [PubMed] [Google Scholar]
  • 40.Mehraei G, Hickox AE, Bharadwaj HM, Goldberg H, Verhulst S, Liberman MC, et al. Auditory Brainstem Response Latency in Noise as a Marker of Cochlear Synaptopathy. J Neurosci. 2016;36:3755–3764. 10.1523/JNEUROSCI.4460-15.2016 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Billings CJ, Tremblay KL, Stecker GC, Tolin WM. Human evoked cortical activity to signal-to-noise ratio and absolute signal level. Hear Res. 2009;254:15–24. 10.1016/j.heares.2009.04.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Drennan DP, Lalor EC. Cortical Tracking of Complex Sound Envelopes: Modeling the Changes in Response with Intensity. eneuro. 2019;6:ENEURO.0082-19.2019. 10.1523/ENEURO.0082-19.2019 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Teschner MJ, Seybold BA, Malone BJ, Hüning J, Schreiner CE. Effects of Signal-to-Noise Ratio on Auditory Cortical Frequency Processing. J Neurosci. 2016;36:2743–2756. 10.1523/JNEUROSCI.2079-15.2016 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Malone BJ, Heiser MA, Beitel RE, Schreiner CE. Background noise exerts diverse effects on the cortical encoding of foreground sounds. J Neurophysiol. 2017;118:1034–1054. 10.1152/jn.00152.2017 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Rabinowitz NC, Willmore BDB, King AJ, Schnupp JWH. Constructing Noise-Invariant Representations of Sound in the Auditory Pathway. PLoS Biol. 2013;11: e1001710 10.1371/journal.pbio.1001710 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Lakatos P, Musacchia G, O’Connel MN, Falchier AY, Javitt DC, Schroeder CE. The Spectrotemporal Filter Mechanism of Auditory Selective Attention. Neuron. 2013;77:750–761. 10.1016/j.neuron.2012.11.034 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Fritz J, Shamma S, Elhilali M, Klein D. Rapid task-related plasticity of spectrotemporal receptive fields in primary auditory cortex. Nat Neurosci. 2003;6:1216–1223. 10.1038/nn1141 [DOI] [PubMed] [Google Scholar]
  • 48.Atiani S, Elhilali M, David SV, Fritz JB, Shamma SA. Task Difficulty and Performance Induce Diverse Adaptive Patterns in Gain and Shape of Primary Auditory Cortical Receptive Fields. Neuron. 2009;61:467–480. 10.1016/j.neuron.2008.12.027 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Shinn-Cunningham BG, Lee AKC, Oxenham AJ. A sound element gets lost in perceptual competition. Proc Natl Acad Sci. 2007;104: 12223–12227. 10.1073/pnas.0704641104 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Carlyon RP. How the brain separates sounds. Trends Cogn Sci. 2004;8:465–471. 10.1016/j.tics.2004.08.008 [DOI] [PubMed] [Google Scholar]
  • 51.van Wassenhove V, Grant KW, Poeppel D. Temporal window of integration in auditory-visual speech perception. Neuropsychologia. 2007;45:598–607. 10.1016/j.neuropsychologia.2006.01.001 [DOI] [PubMed] [Google Scholar]
  • 52.Elhilali M, Xiang J, Shamma SA, Simon JZ. Interaction between Attention and Bottom-Up Saliency Mediates the Representation of Foreground and Background in an Auditory Scene. Griffiths TD, editor. PLoS Biol. 2009;7:e1000129 10.1371/journal.pbio.1000129 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Brungart DS. Informational and energetic masking effects in the perception of two simultaneous talkers. J Acoust Soc Am. 2001;109:1101–1109. 10.1121/1.1345696 [DOI] [PubMed] [Google Scholar]
  • 54.Kidd G, Arbogast TL, Mason CR, Gallun FJ. The advantage of knowing where to listen. J Acoust Soc Am. 2005;118:12 10.1121/1.2109187 [DOI] [PubMed] [Google Scholar]
  • 55.Rivenez M, Darwin CJ, Guillaume A. Processing unattended speech. J Acoust Soc Am. 2006;119:4027–4040. 10.1121/1.2190162 [DOI] [PubMed] [Google Scholar]
  • 56.Wood N, Cowan N. The cocktail party phenomenon revisited: How frequent are attention shifts to one’s name in an irrelevant auditory channel? J Exp Psychol Learn Mem Cogn. 1995;21:255–260. 10.1037//0278-7393.21.1.255 [DOI] [PubMed] [Google Scholar]
  • 57.Woods KJP, McDermott JH. Schema learning for the cocktail party problem. Proc Natl Acad Sci. 2018;115:E3313–E3322. 10.1073/pnas.1801614115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Simpson SA, Cooke M. Consonant identification in N-talker babble is a nonmonotonic function of N. J Acoust Soc Am. 2005;118:2775–2778. 10.1121/1.2062650 [DOI] [PubMed] [Google Scholar]
  • 59.Gramfort A, Luessi M, Larson E, Engemann DA, Strohmeier D, Brodbeck C, et al. MNE software for processing MEG and EEG data. NeuroImage. 2014;86:446–460. 10.1016/j.neuroimage.2013.10.027 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Taulu S, Simola J. Spatiotemporal signal space separation method for rejecting nearby interference in MEG measurements. Phys Med Biol. 2006;51:1759 10.1088/0031-9155/51/7/008 [DOI] [PubMed] [Google Scholar]
  • 61.Bell AJ, Sejnowski TJ. An Information-Maximization Approach to Blind Separation and Blind Deconvolution. Neural Comput. 1995;7:1129–1159. 10.1162/neco.1995.7.6.1129 [DOI] [PubMed] [Google Scholar]
  • 62.Fischl B. FreeSurfer. NeuroImage. 2012;62:774–781. 10.1016/j.neuroimage.2012.01.021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Hämäläinen MS, Ilmoniemi RJ. Interpreting magnetic fields of the brain: minimum norm estimates. Med Biol Eng Comput. 1994;32:35–42. 10.1007/BF02512476 [DOI] [PubMed] [Google Scholar]
  • 64.Dale AM, Sereno MI. Improved Localizadon of Cortical Activity by Combining EEG and MEG with MRI Cortical Surface Reconstruction: A Linear Approach. J Cogn Neurosci. 1993;5:162–176. 10.1162/jocn.1993.5.2.162 [DOI] [PubMed] [Google Scholar]
  • 65.Desikan RS, Ségonne F, Fischl B, Quinn BT, Dickerson BC, Blacker D, et al. An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. NeuroImage. 2006;31:968–980. 10.1016/j.neuroimage.2006.01.021 [DOI] [PubMed] [Google Scholar]
  • 66.Heeris J. Gammatone Filterbank Toolkit [Internet]. 2018. [cited 2020 Oct 14]. Available from: https://github.com/detly/gammatone [Google Scholar]
  • 67.Biesmans W, Das N, Francart T, Bertrand A. Auditory-Inspired Speech Envelope Extraction Methods for Improved EEG-Based Auditory Attention Detection in a Cocktail Party Scenario. IEEE Trans Neural Syst Rehabil Eng. 2017;25:402–412. 10.1109/TNSRE.2016.2571900 [DOI] [PubMed] [Google Scholar]
  • 68.Brodbeck C, Presacco A, Simon JZ. Neural source dynamics of brain responses to continuous stimuli: Speech processing from acoustics to comprehension. NeuroImage. 2018;172:162–174. 10.1016/j.neuroimage.2018.01.042 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.David SV, Mesgarani N, Shamma SA. Estimating sparse spectro-temporal receptive fields with natural stimuli. Netw Comput Neural Syst. 2007;18:191–212. 10.1080/09548980701609235 [DOI] [PubMed] [Google Scholar]
  • 70.Brodbeck C, Das P, Brooks TL, Reddigari S. Eelbrain 0.31 [Internet]. Zenodo; 2019. [cited 2020 Oct 14]. 10.5281/ZENODO.3564850 [DOI] [Google Scholar]
  • 71.Smith SM, Nichols TE. Threshold-free cluster enhancement: Addressing problems of smoothing, threshold dependence and localisation in cluster inference. NeuroImage. 2009;44:83–98. 10.1016/j.neuroimage.2008.03.061 [DOI] [PubMed] [Google Scholar]
  • 72.Maris E, Oostenveld R. Nonparametric statistical testing of EEG- and MEG-data. J Neurosci Methods. 2007;164:177–190. 10.1016/j.jneumeth.2007.03.024 [DOI] [PubMed] [Google Scholar]

Decision Letter 0

Roland G Roberts

24 Jan 2020

Dear Dr Brodbeck,

Thank you for submitting your manuscript entitled "Dynamic processing of background speech at the cocktail party: Evidence for early active cortical stream segregation" for consideration as a Research Article by PLOS Biology.

Your manuscript has now been evaluated by the PLOS Biology editorial staff, as well as by an academic editor with relevant expertise, and I'm writing to let you know that we would like to send your submission out for external peer review.

However, before we can send your manuscript to reviewers, we need you to complete your submission by providing the metadata that is required for full assessment. To this end, please login to Editorial Manager where you will find the paper in the 'Submissions Needing Revisions' folder on your homepage. Please click 'Revise Submission' from the Action Links and complete all additional questions in the submission questionnaire.

Please re-submit your manuscript within two working days, i.e. by Jan 28 2020 11:59PM.

Login to Editorial Manager here: https://www.editorialmanager.com/pbiology

During resubmission, you will be invited to opt-in to posting your pre-review manuscript as a bioRxiv preprint. Visit http://journals.plos.org/plosbiology/s/preprints for full details. If you consent to posting your current manuscript as a preprint, please upload a single Preprint PDF when you re-submit.

Once your full submission is complete, your paper will undergo a series of checks in preparation for peer review. Once your manuscript has passed all checks it will be sent out for review.

Feel free to email us at plosbiology@plos.org if you have any queries relating to your submission.

Kind regards,

Roli Roberts

Roland G Roberts, PhD,

Senior Editor

PLOS Biology

Decision Letter 1

Roland G Roberts

4 Mar 2020

Dear Dr Brodbeck,

Thank you very much for submitting your manuscript "Dynamic processing of background speech at the cocktail party: Evidence for early active cortical stream segregation" for consideration as a Research Article at PLOS Biology. Your manuscript has been evaluated by the PLOS Biology editors, an Academic Editor with relevant expertise, and by four independent reviewers.

You'll see that all four reviewers are broadly positive about the research question that you're addressing, but they also raise some significant concerns. For example, several of them question the focus of the paper and its relationship to the literature. Reviewers #2 and #3 think that you may have failed to exclude some alternative explanations for your observations (to the extent that rev #3 even contemplated rejection), while reviewers #1 and #4 want more robust assessment of model performance.

In light of the reviews (below), we will not be able to accept the current version of the manuscript, but we would welcome re-submission of a much-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent for further evaluation by the reviewers.

We expect to receive your revised manuscript within 2 months.

Please email us (plosbiology@plos.org) if you have any questions or concerns, or would like to request an extension. At this stage, your manuscript remains formally under active consideration at our journal; please notify us by email if you do not intend to submit a revision so that we may end consideration of the manuscript at PLOS Biology.

**IMPORTANT - SUBMITTING YOUR REVISION**

Your revisions should address the specific points made by each reviewer. Please submit the following files along with your revised manuscript:

1. A 'Response to Reviewers' file - this should detail your responses to the editorial requests, present a point-by-point response to all of the reviewers' comments, and indicate the changes made to the manuscript.

*NOTE: In your point by point response to the reviewers, please provide the full context of each review. Do not selectively quote paragraphs or sentences to reply to. The entire set of reviewer comments should be present in full and each specific point should be responded to individually, point by point.

You should also cite any additional relevant literature that has been published since the original submission and mention any additional citations in your response.

2. In addition to a clean copy of the manuscript, please also upload a 'track-changes' version of your manuscript that specifies the edits made. This should be uploaded as a "Related" file type.

*Re-submission Checklist*

When you are ready to resubmit your revised manuscript, please refer to this re-submission checklist: https://plos.io/Biology_Checklist

To submit a revised version of your manuscript, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' where you will find your submission record.

Please make sure to read the following important policies and guidelines while preparing your revision:

*Published Peer Review*

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*PLOS Data Policy*

Please note that as a condition of publication PLOS' data policy (http://journals.plos.org/plosbiology/s/data-availability) requires that you make available all data used to draw the conclusions arrived at in your manuscript. If you have not already done so, you must include any data used in your manuscript either in appropriate repositories, within the body of the manuscript, or as supporting information (N.B. this includes any numerical values that were used to generate graphs, histograms etc.). For an example see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5

*Blot and Gel Data Policy*

We require the original, uncropped and minimally adjusted images supporting all blot and gel results reported in an article's figures or Supporting Information files. We will require these files before a manuscript can be accepted so please prepare them now, if you have not already uploaded them. Please carefully read our guidelines for how to prepare and upload this data: https://journals.plos.org/plosbiology/s/figures#loc-blot-and-gel-reporting-requirements

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosbiology/s/submission-guidelines#loc-materials-and-methods

Thank you again for your submission to our journal. We hope that our editorial process has been constructive thus far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Roli Roberts

Roland G Roberts, PhD,

Senior Editor

PLOS Biology

*****************************************************

REVIEWERS' COMMENTS:

Reviewer #1:

Brodbeck and colleagues present an experiment and its results on the neural processing of two-talker auditory scenes as measured with MEG. Participants were presented with two audio-books and asked to focus on one of them. In a first analysis, using the same analysis framework throughout the manuscript, the authors present support of using onset predictors in addition to envelope predictors. Next, modelling MEG signals with onset and envelope predictors at different stimulus frequencies for the mixed, attended and unattended speech, the authors show TRF models showing similarities and differences between coefficients of the onset and envelope pointing out differences in the shape and latency differences of high coefficients (peaks) between attended and ignored. In a final step, the stimuli are differently described; this time in terms of overt and masked onsets, separately for attended and ignored speech. Here, the authors find a longer latency for overt vs. masked onsets concerning a first peak in coefficients and a second peak of coefficients in attended speech that is absent for both overt and masked onsets for ignored speech. The authors suggest that this latency difference indicates active unmasking processes (similar to "filling in") in auditory cortex.

While the study is well designed and the analyses seem to be carefully performed, I have some concerns regarding the presentation of the study as well as the methodology. Main and minor points are listed below.

Major points:

1. The analysis focuses mostly on the TRF models and does a good job in presenting the spread across participants in figures 2 to 4. However, such a strong (visual) focus on model coefficients underlines these outcomes while model performance seems "second stage". However, the focus of the effects should be on the model performance which provides a more convincing argument in this type of analyses as model coefficients are difficult to interpret (e.g., being inherently multivariate). Thus, it would be beneficial to provide figures on model performance as well (boxplot/violin plots + single participant indicators similar to 3D, 4C). Similarly, an analysis restricted to specific delay values (e.g., first, second, later peaks; similar to e.g. Puvvada and Simon, 2017) would make a better argument concerning their presence/absence. For latency changes, TRF models might be a good way pursue if one doesn't want to use shifting- window-like or single delay analyses.

2. The statistical testing and results reporting was unclear. This concerned both model performance comparisons as well as statistics on TRF coefficients. For example, I could not follow why, for comparing model performances from filtered MEG data optimized to sources in AC, TFCE is required. Why is a cluster analysis needed at this stage when the model performance is a single r-value that is compared between different models or with respect to a baseline? Also, what does t_max mean in this context? I see the use for the analysis of TRF coefficients but I don't see how analyses of model performance could benefit from it. In addition, the multiple comparison correction for the model coefficients remains unclear. I believe that the results are corrected as there is an indication in the final paragraph of the methods but that was not enough to fully follow the approach.

3. While I like and agree with the conclusions drawn especially from the overt vs masked onset comparisons, I don't follow why this manuscript's title (and to a lesser degree the abstract and discussion) focuses to such a large extent on background speech processing rather than the masking results. In my opinion, the effects and conclusions on overt vs. masked onsets (amplitude and latency differences) constitute the main content being novel to the field while the absence of the second negative peak is interesting (it lends support to the hierarchical processing accounts discussed by the authors) seems rather like a strong but secondary result/conclusion.

4. One of the main points of this paper is the comparison of onset and envelope descriptions of sound stimuli to explain MEG activity. While I agree with the conclusions and like how carefully these distinctions were analyzed and interpreted, I want to remind the authors that studies have already applied onset descriptions for the analysis of continuous sounds in similar analysis frameworks which is not mentioned in this study (e.g., Petersen et al., 2017, JNeurophysiol; Fiedler et al., 2017, JNeurEng, Fiedler et al., 2018, NeuroImage; Hausfeld et al., 2018, NeuroImage; and maybe more). Still, the explicit comparison and quantification of including onset features provided in this study provides a novel element.

Minor points:

5. Related to point 2: figures 2D, 3B and 4B include grey-scale indicators of significance for specific delays, which are markers for significantly non-zero coefficients in any frequency band (l. 147/8). Were these already corrected for multiple comparisons (e.g., by removing small clusters not surviving a cluster-based thresholding) and was this performed across 2 dimensions (frequency and delay) or across the delays? While there is a lot of information conveyed in these graphs, simply indicating any significance across frequencies is a big simplification. How about adjusting the height of the significance bar according to the number of significant frequency bins? Another small point is that at the moment it is unclear which p-values are indicated: max, min, an average of some sort.

6. Ll. 185-188 and Fig 3C: This was difficult to follow.

7. It is suggested masked onsets of both attended and unattended speech undergo active processing to be "un-mask". Especially for unattended sounds this might be interesting for further investigation as it remains unclear with these data whether this unmasking works on single speech sources or more generally on the unattended sounds as a whole (representing attended vs all other sounds). Could you please comment on that in your response?

8. The figures are difficult to read given their size, please make them bigger or maybe split them up such that single panels have more space.

9. The color differences between minus and plus peaks are impossible to distinguish (maybe it was the print out but the colors seems to be very similar). Are the color differences between positive and negative peaks necessary? For the line graphs these peaks are easily discernible which also holds for the single participants markers in Fig. 3C

Reviewer #2:

This is an interesting paper that uses previously published MEG data from humans listening to connected speech, either as a single talker or two simultaneous talkers (one male, one female) with participants attending to one and ignoring the other. The study reports a difference in how "masked" onsets in speech are represented in the human auditory cortex, relative to onsets that are not masked by the competing talker. This difference is observed both for attended and unattended talkers. The authors suggest in the paper (including in the title) that this difference reflects an "active" process, and that the longer latency observed for the masked responses may reflect this active processing (or recovery) of the masked speech.

Although the premise is interesting, I do not believe that the authors have demonstrated what they claim. In particular the conclusions are based on linear systems analysis, which is applied to a highly nonlinear system. This approach of assuming linearity is almost universally adopted and has been a very useful first approximation, in part because speech is quite sparse, and so even mixtures of two talkers will having relatively limited spectrotemporal overlap. Nevertheless, the limitations of a linear-systems approximation to a nonlinear (and time-variant) system should not be overlooked. An important limitation is that a linear system assumes perfect superposition, whereas neural recordings do not demonstrate this. Indeed, the simplest auditory responses, from the Wave I, reflecting auditory-nerve responses, through brainstem responses (e.g., Wave V), to early cortical responses, all show changes in the morphology of the responses (amplitude and latency) based on the signal-to-noise ratio, ie, the degree to which a stimulus is masked. By this reasoning, it would be expected that partially masked sounds would produce smaller-amplitude and longer-latency responses, just as reported by the authors, not because of any active stream segregation, but because neural responses to partially masked sounds typically have smaller responses and longer latencies. Perhaps I am missing something important, which answers this criticism, and for this reason I have recommended revision rather than initial rejection. In any case, this seems an obvious alternative explanation, so it is not clear why the authors haven't addressed it, if only to rule it out.

Specific comments

The onset detection used by the authors seems to be based on a neural model. This needs a little more explanation, as well as more detail. For instance, what proportion of onsets overall were deemed "masked"? Also the measure itself (Max(0,(Max(attended,ignored) - mixture) could do with a little more explanation. Are the units in terms of the model's neural response? Does this measure have any influence on the size of the MEG responses reported in the figures?

Line 319. Early selective attention. The authors suggest that the masked responses show earlier manifestations of attention than the overt responses. But according to Fig. 4B, both the unmasked (overt) and the masked responses start to show significant effects of attention prior to 100 ms, and the difference seems to be more in the sign than in the magnitude of the difference. This needs to be clarified or reframed.

Line 439. Add ", respectively" after "Chapters 3 and 8" so that it's clear that the male read 3 and the female read 8.

Line 442. How was loudness matched, and what was the approximate sound pressure level of the two talkers? (Something more exact than "Comfortably loud")

Line 507. Typo: "am" should be "an".

Reviewer #3:

This manuscript presents research aimed at investigating the representation of attented and, especially, unattended speech in the auditory cortex. The authors ask subjects to attend to one of two concurrently stories at the same time while they record their MEG. Then they examine how well they can predict MEG, source localized to auditory cortex, from two different representations of the speech stimuli. One representation is a gammatone spectrogram of the stimulus (essentially eight envelopes corresponding to the energy in eight frequency bands). The second is a spectrogram of the acoustic onsets of the speech, again in eight bands, and derived using an auditory onset model. The goal is to test if early auditory cortex either conducts a fairly straightforward and stable spectrotemporal analysis of all input or carries out a more active process to that involves representing speech features from attended and unattended streams in a dissociable way. They reason that including the acoustic onset representations should allow them to answer this question because of the importance of onsets to auditory scene analysis and also because, analytically, the onsets in a mixture of two speech streams is not very predictable from the mixture of the onsets of the two streams. They report that… and conclude that…

Overall I thought this was a very interesting manuscript with a well conducted experiment and a sensible and sophisticated data analysis framework. I did have a few queries and comments for the authors though.

Main comments:

1) My first comment just speaks to the idea of making the rationale for the approach a bit more reader friendly in the introduction. I have to admit it took me two reads through the introduction for me to start to become comfortable with the ideas underlying the strategy. So, can I first clarify: the authors use the onset representation for two reasons: a) because of their importance for ASA, and b) because of this property that "the proportion of the variability in the mixture representations that cannot be predicted from the two sources is small for the envelopes, but substantially larger for the onsets". Is that accurate? If so, I wondered about giving readers a slightly more intuitive description of this. The way I have phrased it above is that "the onsets in a mixture of two speech streams is not very predictable from the mixture of the onsets of the two streams", at least less than for the envelope. Am I understanding what you are saying? Also, I was confused about how this links with the idea that "With acoustic onset features, interference is more pronounced because ongoing acoustic energy in one source can hide an increase in another source (see below for empirical verification of this claim)." Is this making the same point? It didn't seem so on first reading. And it was not at all obvious to me - I would have felt that envelopes were more likely to interfere with each other because they are more continuous. You say you empirically verify this below - is that in Figure 3C). Anyway, the reason I am commenting on this, is that when I got to the results I just had this nagging feeling that the delay I was seeing in the STRFs for the individual sources might just be a signal processing glitch. Does the supplementary figure answer that specific concern? Or is that only to provide evidence that the ignored speech source contribution cannot be explained from the mixture. It feels like only the latter to me.

2) Related to the previous point, is it also fair to say that the current analysis basically seeks to overturn the claims made in the Puvvada & Simon (2017) paper - based on an analysis that is more sensitive to the neurophysiological responses to the individual speech streams?

3) Given recent (cited) work on the different effects of attention in HG vs STG (O'Sullivan et al., 2019), I also wondered about the auditory localization used in the present study. I think I understand that the present study does not contradict the O'Sullivan paper. It merely suggests that both attended and unattended speech are separably represented in early auditory cortex, not that both are modulated strongly by attention in early auditory cortex. Nonetheless, I wondered about how confident we can be that we are really looking early auditory cortex not contamination by strong attention effects from, for example, STG. I guess it all comes down to the latency of the effects you are seeing? The source onset STRFs in Figure 3B show strong attentional modulations of a later component around 150 ms - might this be STG? And so we are safe to assume everything before 100 is from a "lower" cortical area than STG?

4) I also wondered about how much the difference in performance between onset STRFs and envelope STRFs might be due to the choice to only have 8 frequency bands in the representation. I wonder might the authors care to wax lyrical on whether or not this result is likely to still be true if one used, say, 100 bands. Might that be enough spectral resolution for us to see these effects also in the envelope STRFs (assuming it was possible to fit the models reliably)?

Minor comments:

1) I wondered why the STRF to the mixture envelope was not plotted in Figure 3 - just for the sake of completeness.

2) For people who have not read Bob Carlyon's 2004 paper, I thought it might be nice to slightly unpack the following sentence "Alternatively, the auditory cortex could employ an active process to dynamically recover and represent potential speech features regardless of what stream they belong to, and in doing so provide selective attention with more elaborate representations [11]."

3) The value of the un-masking section on page 16 - and indeed the whole paper really - was brought home to me when I read the line about auditory cortex potentially filling in the underlying speech sources. I thought that was a nice way to think about what kind of thing cortex might be doing in terms of active segregation. I wonder might the authors want to move that point earlier and/or unpack it a small bit for folks who are not familiar (as I happened to be) with references 26 and 27. Just a suggestion.

Reviewer #4:

This study examines how the foreground and background speech streams are processed when presented simultaneously. Using MEG recordings and linear STRF modeling, it examines how the responses to the onset and envelope of the sources are represented. The results suggest that when the spectro-temporal features mask each other, an active process extracts the onsets of the sources, resulting in a MEG activation that is delayed by 20 ms compared to the activation in response to overt (unmasked) sources. These findings are novel and significant, identifying the brain mechanisms that underlie speech perception in complex situations. It is technically sound and the experiment is well designed. The statistical analysis appropriate and supplementary information is useful. Also, the data needed to replicate the study are available.

There is only one major issue I think needs to be addressed. The authors use linear modeling based on information extracted from onsets and envelops of the signals. A primary conclusion is that onsets are critical for many aspects of the processing. However, the model does not allow too many alternatives to that, as the predictors are only based on onsets or on envelopes. I think the conclusions (including those in the supplement) would be much stronger if offset detectors were built into the model, equivalent to the onset detectors, and if it was shown that no model improvement is obtained from predictors based on the offsets, or, that the offsets themselves cannot predict the results. So, I think it is critical to do additional modeling with the offsets considered before the conclusions of the modeling can be presented as they stand.

Minor issues:

L36: in "… an acoustic scene, the acoustic signal …" the word "acoustic" is repetitive. Saying "… an acoustic scene, the signal …" is sufficient.

L507 am -> an

Decision Letter 2

Roland G Roberts

13 Aug 2020

Dear Dr Brodbeck,

Thank you for submitting your revised Research Article entitled "Neural speech restoration at the cocktail party: Auditory cortex recovers masked speech of both attended and ignored speakers" for publication in PLOS Biology. I have now obtained advice from the original reviewers and have discussed their comments with the Academic Editor.

IMPORTANT: You'll see that reviewer #2 continues to express deep scepticism about your findings, and there is also a cautious note from reviewer #3. However, after discussing the reviewers' comments with the Academic Editor, we have decided to consider your manuscript further if you make these limitations clear to the readers. Specifically, the Academic Editor would like you to further consider and discuss the issue of the SNR as mentioned by reviewer #2. You should also address the remaining requests from revs #1 and #3.

Based on the reviews, we will probably accept this manuscript for publication, assuming that you will modify the manuscript to address the remaining points raised by the reviewers. Please also make sure to address the Data and other policy-related requests noted at the end of this email.

We expect to receive your revised manuscript within two weeks. Your revisions should address the specific points made by each reviewer. In addition to the remaining revisions and before we will be able to formally accept your manuscript and consider it "in press", we also need to ensure that your article conforms to our guidelines. A member of our team will be in touch shortly with a set of requests. As we can't proceed until these requirements are met, your swift response will help prevent delays to publication.

*Copyediting*

Upon acceptance of your article, your final files will be copyedited and typeset into the final PDF. While you will have an opportunity to review these files as proofs, PLOS will only permit corrections to spelling or significant scientific errors. Therefore, please take this final revision time to assess and make any remaining major changes to your manuscript.

NOTE: If Supporting Information files are included with your article, note that these are not copyedited and will be published as they are submitted. Please ensure that these files are legible and of high quality (at least 300 dpi) in an easily accessible file format. For this reason, please be aware that any references listed in an SI file will not be indexed. For more information, see our Supporting Information guidelines:

https://journals.plos.org/plosbiology/s/supporting-information

*Published Peer Review History*

Please note that you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*Early Version*

Please note that an uncorrected proof of your manuscript will be published online ahead of the final version, unless you opted out when submitting your manuscript. If, for any reason, you do not want an earlier version of your manuscript published online, uncheck the box. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us as soon as possible if you or your institution is planning to press release the article.

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosbiology/s/submission-guidelines#loc-materials-and-methods

*Submitting Your Revision*

To submit your revision, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' to find your submission record. Your revised submission must include a cover letter, a Response to Reviewers file that provides a detailed response to the reviewers' comments (if applicable), and a track-changes file indicating any changes that you have made to the manuscript.

Please do not hesitate to contact me should you have any questions.

Sincerely,

Roli Roberts

Roland G Roberts, PhD,

Senior Editor,

rroberts@plos.org,

PLOS Biology

------------------------------------------------------------------------

ETHICS STATEMENT:

-- Please include the full name of the IACUC/ethics committee that reviewed and approved the animal care and use protocol/permit/project license. Please also include an approval number.

-- Please include the specific national or international regulations/guidelines to which your animal care and use protocol adhered. Please note that institutional or accreditation organization guidelines (such as AAALAC) do not meet this requirement.

-- Please include information about the form of consent (written/oral) given for research involving human participants. All research involving human participants must have been approved by the authors' Institutional Review Board (IRB) or an equivalent committee, and all clinical investigation must have been conducted according to the principles expressed in the Declaration of Helsinki.

------------------------------------------------------------------------

DATA POLICY:

You may be aware of the PLOS Data Policy, which requires that all data be made available without restriction: http://journals.plos.org/plosbiology/s/data-availability. For more information, please also see this editorial: http://dx.doi.org/10.1371/journal.pbio.1001797

We note that your raw data are made available in an institutional repository. While this is accessible and clearly laid out, ideally (for long-term robustness) we would prefer it to be hosted on a non-institutional repository (e.g. Dryad, Figshare, Github), and ask you to explore that.

In addition, we ask that all individual quantitative observations that underlie the data summarized in the figures and results of your paper be made available in one of the following forms:

1) Supplementary files (e.g., excel). Please ensure that all data files are uploaded as 'Supporting Information' and are invariably referred to (in the manuscript, figure legends, and the Description field when uploading your files) using the following format verbatim: S1 Data, S2 Data, etc. Multiple panels of a single or even several figures can be included as multiple sheets in one excel file that is saved using exactly the following convention: S1_Data.xlsx (using an underscore).

2) Deposition in a publicly available repository. Please also provide the accession code or a reviewer link so that we may view your data before publication.

Regardless of the method selected, please ensure that you provide the individual numerical values that underlie the summary data displayed in the following figure panels as they are essential for readers to assess your analysis and to reproduce it: Figs 2BCDE, 3ABCD, 4ABCD and S1, S2. NOTE: the numerical data provided should include all replicates AND the way in which the plotted mean and errors were derived (it should not present only the mean/average values).

Please also ensure that figure legends in your manuscript include information on where the underlying data can be found, and ensure your supplemental data file/s has a legend.

Please ensure that your Data Statement in the submission system accurately describes where your data can be found.

------------------------------------------------------------------------

REVIEWERS' COMMENTS:

Reviewer #1:

This work by Brodbeck, Jiao, Hong and Simon is a resubmission of a previously reviewed manuscript. The current version of the manuscript addressed most of the concerns of the initial version such that it reads stronger and provides clearer links between results and claims. I very much appreciate the reanalysis and introduction of model fit statistics here with its non-trivial creation of permutated data/predictors as well as the extensive work on the readability of text and figures. However, some comments and unclear points (mainly on the statistics) remained and are listed below.

1. Model fit statistics I. Although consistent and significant, the z-values and corresponding r-values for improvement seem to be small and around delta_z = .01 in Fig 2D and otherwise lower than ˜.005 in Figs.3B and 4B. What do you think is the reason for this? Could it be due to correlations between predictors and/or small effect sizes in general? One suggestion might be to indicate the z- or r-value obtained with the tested model for better judgement.

2. Model fit statistics II. I just want to be sure to exactly follow the approach in combination with the source localization. I assume that the dots in Figs 2D, 3B, 4B depict the individual improvements when including a certain predictor vs. not including it and show the average improvement across the x dipole models in each of the hemisphere. When considering the single speaker condition, are these values showing the difference of, for example, the models with onset and envelope vs. models with only the onset to infer the significance of the envelope in this case (this is my guess)? Or do you test the model with envelope vs a null model that is established with a noise simulation (ref 72)?

3. l. 326-331. It is investigated whether masked onsets are delayed. However, only descriptive statistics are provided. For better support of the claim, please test for this formally. This comes back in l.382 where it is stated as a summary of masking results that "[these] latency shifts increase at successive stages", which requires a comparison of the latency shifts between earlier and later window.

4. Unlike for the t_max it remains unclear to me what the p-values that are mentioned (e.g., ll. 142, 208, …) denote, which is mainly due to the source modeling of signals in each vertex of the modeled brain surface (also in comments on the first version). Are these the corresponding p-values of the statistic of the spatial cluster showing highest significance (as there might be several significant clusters within one hemisphere) or something else? The explanation that "group-level statistics were evaluated with spatial permutation tests" (l. 204) didn't help with better understanding and might benefit from a (very short) description of the procedure outlined in ref 72.

5. Just to be sure this comment is not misunderstood: the following is a final, constructive and, by all means, friendly remark on the topic of masked speech. While re-reading the manuscript and focusing on the masking aspect, I was reminded of Fiedler et al. (2019, NeuroImage) (ref 21). Their results presented in their figure 3 (contrasting relative SNR changes in attended vs ignored speakers) look very similar to the results presented in this manuscripts Figs 4D and E and indicate a similar latency shift (which was not tested for or discussed in Fiedler et al., 2019). It's great and encouraging to see convergence here for different types of data, stimuli and languages (disclaimer: neither am I one of the authors nor am I in any way affiliated with this lab or have any other interests).

Reviewer #2:

My main concern when reading the previous version of the manuscript was that the authors had not supported their primary claim that the cortex "recovers masked speech". Instead, I suggested that the results could be explained by the fact that responses to partially masked onsets (i.e., onsets presented at a low signal-to-noise ratio) would be expected to be lower in amplitude and have a longer latency, based on what we already know about auditory responses to stimuli in noise, from the earliest stages of auditory processing. Indeed, the only justification for calling it "recovery" is that the authors' neural model of onset detection does not detect partially masked onsets in the same way as the measured responses from human cortex seem to do. But this could easily be a problem with their very simple model, rather than any insight into how the auditory cortex processes sound.

The authors now acknowledge this problem, and counter it by providing additional analyses to show that their model (incorporating the overt vs. masked distinction) outperforms a model that incorporates information about the instantaneous target level (quantized into 3 level regions) but ignores the signal-to-noise ratio (SNR). However, this approach does not address the problem in a satisfactory way: The effect of nonlinearity goes beyond just producing level-dependent effects: Nonlinearity produces interactions when two or more stimuli are combined. We already know that at low SNRs (as is the case here, where the average SNR is 0 dB), SNR will modulate the responses to a target sound more strongly than simple intensity. It has nothing to do with "cortical recovery" of signals. Thus, the additional analysis does not rule out, or even reduce, the possibility that what the authors have measured is simply the expected outcome of stimuli that overlap acoustically, producing partial masking.

In summary, the major potential problem raised in the first round of the reviews seems confirmed: the results are predictable based simply on the fact that neural responses to onsets will be reduced (and slightly delayed) when the onsets are partially masked. This type of response is well established at many stages of the auditory pathways and does not provide evidence for "recovery" or any other type of active cortical processing. Indeed the time course in itself (with latencies of less than 100 ms) should make the authors (and readers) suspicious of the claim that any active recovery process is undertaken.

With this interpretation, most of the speculation provided in the Discussion (Lines 416-494) and the unhelpful black-box model in Fig. 5 become irrelevant. As the data themselves are not new and were already published elsewhere, there seems little else to justify publishing the current study.

Reviewer #3:

Many thanks to the authors for their extensive efforts in addressing my previous comments. I think the new introduction is much clearer. Most of my other comments have been fully addressed also.

I guess I still have a little bit of residual unease about the analysis of masked vs overt onsets. You have determined which onsets are masked and which are not based on your own particular choices of how to represent the acoustic onsets (i.e., using your auditory edge detection model with particular parameters). However, it is not obvious that these onsets will be genuinely masked from the perspective of the cochlear output. If they are not, then should we really be considering them as masked and overt? And if they are, then I still struggle to understand how one can confidently model their separate contributions to the neural responses. I do appreciate that the simulations in Fig. S1 and S2 are aimed to ease our worries on this front – and they achieve that goal mostly. But, I guess, I still feel like we are in a bit of a Catch-22. If they are not perfectly masked, then what does it even mean to call them masked? And, if they are perfectly masked, then how is it possible to separate them at all? If you could help clarify this for me, I would greatly appreciate it.

Also, and I hesitate to suggest this given the amount of work you have already done, but I wondered about the simulations in S1 and S2. Might it not have been more compelling to allow non-zero ignored TRFs and show that there was no warping/delaying of those TRF peaks by the coincidence of the attended and ignored (masked) onsets. Maybe not – the simulation you already have seems fairly compelling. So maybe a general answer to my first comment would suffice.

Reviewer #4:

The revised version of the manuscript sufficiently addressed all of my concerns. I do not have any more suggestions.

Decision Letter 3

Roland G Roberts

14 Sep 2020

Dear Dr Brodbeck,

On behalf of my colleagues and the Academic Editor, Manuel S. Malmierca, I am pleased to inform you that we will be delighted to publish your Research Article in PLOS Biology.

The files will now enter our production system. You will receive a copyedited version of the manuscript, along with your figures for a final review. You will be given two business days to review and approve the copyedit. Then, within a week, you will receive a PDF proof of your typeset article. You will have two days to review the PDF and make any final corrections. If there is a chance that you'll be unavailable during the copy editing/proof review period, please provide us with contact details of one of the other authors whom you nominate to handle these stages on your behalf. This will ensure that any requested corrections reach the production department in time for publication.

Early Version

The version of your manuscript submitted at the copyedit stage will be posted online ahead of the final proof version, unless you have already opted out of the process. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

PRESS

We frequently collaborate with press offices. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximise its impact. If the press office is planning to promote your findings, we would be grateful if they could coordinate with biologypress@plos.org. If you have not yet opted out of the early version process, we ask that you notify us immediately of any press plans so that we may do so on your behalf.

We also ask that you take this opportunity to read our Embargo Policy regarding the discussion, promotion and media coverage of work that is yet to be published by PLOS. As your manuscript is not yet published, it is bound by the conditions of our Embargo Policy. Please be aware that this policy is in place both to ensure that any press coverage of your article is fully substantiated and to provide a direct link between such coverage and the published work. For full details of our Embargo Policy, please visit http://www.plos.org/about/media-inquiries/embargo-policy/.

Thank you again for submitting your manuscript to PLOS Biology and for your support of Open Access publishing. Please do not hesitate to contact me if I can provide any assistance during the production process.

Kind regards,

Vita Usova

Publication Assistant,

PLOS Biology

on behalf of

Roland Roberts,

Senior Editor

PLOS Biology

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Simulations. Simulations to assess TRF cross-contamination.

    TRF, temporal response function.

    (PDF)

    S1 Fig. MEG responses to clean speech, envelope only.

    Spectrotemporal response function to the envelope spectrogram, when estimated without considering onsets. All other details are analogous to Fig 2E. Data in S4 Data.

    (PDF)

    S1 Data. Data from Fig 2.

    Model prediction accuracy maps and spectrotemporal response functions for plots shown in Fig 2. Data are stored as pickled Python/Eelbrain objects with corresponding meta-data.

    (ZIP)

    S2 Data. Data from Fig 3.

    Details as S1 Data.

    (ZIP)

    S3 Data. Data from Fig 4.

    Details as S1 Data.

    (ZIP)

    S4 Data. Data from S1 Fig.

    Details as S1 Data.

    (ZIP)

    S5 Data. Data from S1 Simulation.

    Details as S1 Data.

    (ZIP)

    Attachment

    Submitted filename: R2-Response v8 clean.docx

    Attachment

    Submitted filename: Responses.docx

    Data Availability Statement

    Preprocessed MEG recordings and stimuli are available from the Digital Repository at the University of Maryland. The MEG dataset is available at http://hdl.handle.net/1903/21109, additional files specific to this paper are available at http://hdl.handle.net/1903/26370. Subject-specific results are provided for each figure in supplementary data files.


    Articles from PLoS Biology are provided here courtesy of PLOS

    RESOURCES