Abstract
The ability to detect and track relevant acoustic signals embedded in a background of other sounds is crucial for hearing in complex acoustic environments. This ability is exemplified by a perceptual phenomenon known as “rhythmic masking release” (RMR). To demonstrate RMR, a sequence of tones forming a target rhythm is intermingled with physically identical “Distracter” sounds that perceptually mask the rhythm. The rhythm can be “released from masking” by adding “Flanker” tones in adjacent frequency channels that are synchronous with the Distracters. RMR represents a special case of auditory stream segregation, whereby the target rhythm is perceptually segregated from the background of Distracters when they are accompanied by the synchronous Flankers. The neural basis of RMR is unknown. Previous studies suggest the involvement of primary auditory cortex (A1) in the perceptual organization of sound patterns. Here, we recorded neural responses to RMR sequences in A1 of awake monkeys in order to identify neural correlates and potential mechanisms of RMR. We also tested whether two current models of stream segregation, when applied to these responses, could account for the perceptual organization of RMR sequences. Results suggest a key role for suppression of Distracter-evoked responses by the simultaneous Flankers in the perceptual restoration of the target rhythm in RMR. Furthermore, predictions of stream segregation models paralleled the psychoacoustics of RMR in humans. These findings reinforce the view that preattentive or “primitive” aspects of auditory scene analysis may be explained by relatively basic neural mechanisms at the cortical level.
Keywords: multiunit activity, simultaneous suppression, spectral integration, auditory stream segregation
hearing in natural environments requires the ability to detect and track ecologically relevant sounds, such as conspecific vocalizations, embedded among other, less relevant sounds (see, e.g., Bee and Micheyl 2008; Fay 2008). This ability, in turn, requires that animals perceptually segregate or integrate sounds emanating from multiple sources into auditory “streams,” a process known as “auditory scene analysis” (Bregman 1990).
The neural basis of auditory scene analysis is an active research topic (Carlyon 2004; Fishman and Steinschneider 2010; Gutschalk et al. 2005; Nelken and Bar-Yosef 2008; Shamma and Micheyl 2010; Snyder and Alain 2007). Importantly, many aspects of auditory scene analysis, including auditory stream segregation, are phylogenetically widespread (Fay 2008) and are thought to rely on “primitive” mechanisms that do not require learning or attention (Bregman 1990). These considerations motivate the investigation of neural correlates of auditory scene analysis in experimental animal models.
A classic demonstration of the perceptual organization of sound sequences is provided by the “auditory stream segregation” paradigm. In this paradigm, low- and high-frequency tones (denoted A and B, respectively) are presented in an alternating pattern, “…ABAB…”. Depending on the frequency separation between the tones and on the alternation rate, the sequence is heard either as a single coherent stream of tones alternating in pitch or as two separate streams—one comprised of “A” tones and the other of “B” tones (Bregman 1990). Previous neurophysiological studies suggest that auditory stream segregation can be explained by relatively basic neural response properties, such as frequency selectivity, forward suppression, and adaptation, which are present at, or below, the level of primary auditory cortex (A1) (for reviews, see Fishman and Steinschneider 2010; Micheyl et al. 2007). Earlier investigations led to the “population-separation” model of stream segregation. According to this model, sounds are perceptually organized into separate auditory streams if they excite distinct neural populations in auditory cortex and into a single stream if they excite largely overlapping populations; frequency selectivity, forward suppression, and adaptation promote functional separation of neural populations and, therefore, stream segregation (Fishman et al. 2001; Fishman and Steinschneider 2010; Micheyl et al. 2007). More recent studies have emphasized that in order to be perceived as separate streams, sounds must not only excite separate neural populations but do so in a temporally incoherent (or anticorrelated) fashion. These findings have led to the “temporal-coherence” model of auditory streaming (Elhilali et al. 2009; Shamma and Micheyl 2010).
An important test of both the population-separation and temporal-coherence models of auditory streaming is to examine whether they can account for the perceptual organization of sound sequences in which the same sound element (e.g., a frequency component) can be assigned perceptually to two different streams, depending upon its acoustic context. Such acoustic sequences are exemplified by the rhythmic masking release (RMR) paradigm, which is schematically illustrated in Fig. 1. In the RMR paradigm, a sequence of tones, “Targets,” is presented in a rhythmic pattern (Fig. 1A). The target rhythm is perceptually masked by the introduction of “Distracter” tones that are physically identical to, and temporally intermingled with, the Targets (Fig. 1B). However, the target rhythm can be perceptually restored (“released from masking”) when “Flanker” tones, occupying adjacent frequency channels, are added synchronously to the Distracters (Fig. 1C). This “rhythmic masking release” causes the Targets to be “heard out” and perceptually segregated from the Distracters, thereby facilitating identification of the target rhythm (Turgeon et al. 2002, 2005). Thus physically identical Target and Distracter tones are perceived as belonging to the same or to separate streams depending on the absence or presence of the Flankers. Importantly, the RMR effect is most pronounced when Flankers and Distracters are synchronous and declines with increasing asynchrony between them, effectively disappearing when their onset asynchrony exceeds ∼20 ms (Fig. 1D; Turgeon et al. 2002, 2005). Onset synchrony and asynchrony are powerful cues that the auditory system uses to group and segregate, respectively, acoustic components in auditory scene analysis (Darwin and Carlyon 1995). Thus RMR has been explained in terms of perceptual “capture” of the Distracters by the Flankers, i.e., perceptual grouping of the synchronous Distracter and Flanker tones, which are then segregated from the Target tones (Turgeon et al. 2002, 2005). RMR can be viewed as a particular case of auditory stream segregation, with the two perceptual streams corresponding to the sequence of pure-tone Targets and to the sequence of Distracter-Flanker complexes (see Fig. 1). As RMR is a perceptual phenomenon that combines two key principles of auditory perceptual organization, sequential segregation and simultaneous grouping (Bregman 1990), the RMR stimulus paradigm provides an interesting context in which to evaluate neural models of stream segregation.
As no study so far has examined neural responses to stimulus sequences eliciting RMR, the neural basis of the phenomenon remains unknown. Moreover, it is unclear whether existing physiological models of auditory stream segregation can account for RMR. One hypothesis is that the perceptual segregation of the Target tones from the Distracter-Flanker complexes is paralleled, at the neural level, by the excitation of two functionally separate populations: one population responding selectively to the Target tones and the other responding selectively to the Distracter-Flanker complexes. The question addressed in the present study is whether such neural populations could be found in A1. The existence of “Target-selective” neurons in A1 is not trivial, because frequency-selective neurons in A1 that are strongly activated by the Target tones are likely to respond also to the (same frequency) Distracter tones. Therefore, it was unclear a priori whether neural response patterns in A1 would parallel the perceptual segregation of the Target and Distracter tones.
We considered three possible scenarios. In the first scenario, A1 neurons that are tuned to the Targets would be insensitive to the Flankers, so that they would respond equally strongly to the Distracters regardless of whether the Flankers are present or absent; consequently, the temporal response pattern of these neurons would not parallel the psychophysical findings. In the second scenario, A1 neurons that respond strongly to the Targets would also respond equally strongly to the Distracters in the absence of the Flankers; however, owing to the suppression of the Distracter responses by the synchronous Flankers (Sadagopan and Wang 2011; Shamma and Symmes 1985; Sutter and Loftus 2003), these neurons would respond less strongly to the Distracters when they are accompanied by the synchronous Flankers. This suppression of Distracter responses by the synchronous Flankers would effectively create “Target-selective” neural responses. These “Target-selective” responses could support the selective detection of the Target rhythm when Flankers are added to the Distracters, i.e., the main RMR effect (Turgeon et al. 2002, 2005). In the third scenario, neurons that respond strongly to the Target tones would respond equally strongly to the Distracter tones when the Flankers are absent; however, because of spectral integration (Kanwal and Rauschecker 2007; Sadagopan and Wang 2009; Schwarz and Tomlinson 1990), when Flankers are added to the Distracters these neurons would show larger responses to the Distracters than to the Targets. The enhanced responses to the Distracter-Flanker complexes could potentially support listeners' ability to attend selectively to the rhythm carried by the Distracter-Flanker complexes. It is important to note that the three above-described outcomes are not mutually exclusive, and, in fact, considering the variety of neural response interactions observed in auditory cortex in previous studies, a priori it is conceivable that different sites in A1 would exhibit different response patterns.
To determine which of these scenarios is realized in A1, we recorded neural responses to sound sequences similar to those illustrated in Fig. 1 in A1 of awake monkeys. In accordance with the rationale described above, we tested the hypothesis that A1 contains two separate neural populations that respond selectively to the Targets and to the Distracter-Flanker complexes, respectively. Furthermore, we predicted that “Target-selective” responses would be generated by simultaneous suppression of responses to Distracters by the simultaneous Flankers. Consistent with psychoacoustic findings (Turgeon et al. 2002, 2005), we predicted that suppression of Distracter responses by the Flankers in A1 would be independent of the frequency separation between the Flankers and Distracters (at least across the ∼1-octave range tested in psychoacoustic RMR experiments), would be maximal when Flankers are presented synchronously with the Distracters, and would diminish or disappear when Flankers are delayed relative to the Distracters by 20 ms or more. Finally, we examined whether the population-separation and temporal-coherence models of stream segregation, when applied to A1 responses, could account for the perceptual organization of RMR sequences in human psychoacoustic studies.
MATERIALS AND METHODS
Four adult male macaque monkeys (Macaca fascicularis) were studied with previously described methods (Fishman et al. 2001; Steinschneider et al. 2003). All experimental procedures were reviewed and approved by the Association for Assessment and Accreditation of Laboratory Animal Care (AAALAC)-accredited Animal Institute of Albert Einstein College of Medicine and were conducted in accordance with institutional and federal guidelines governing the experimental use of primates. Animals were housed in our AAALAC-accredited Animal Institute under daily supervision of laboratory and veterinary staff. To minimize the number of monkeys used, other auditory experiments were conducted in the same animals during each recording session. Animals were acclimated to the recording environment and trained while sitting in custom-fitted primate chairs prior to surgery.
Stimuli
Stimuli were generated and delivered at a sample rate of 48.8 kHz by a PC-based system using an RX8 module (Tucker Davis Technologies). Frequency response functions (FRFs) based on pure-tone responses characterized the spectral tuning of the cortical sites. Pure tones used to generate the FRFs ranged from 0.15 to 18.0 kHz, were 200 ms in duration (including 10-ms linear rise/fall ramps), and were presented with a stimulus onset-to-onset interval of 658 ms. Resolution of FRFs was 0.25 octaves or finer across the 0.15–18.0 kHz frequency range tested.
RMR stimulus parameters were similar to those used in psychoacoustic studies (Turgeon et al. 2002) and in the audio demonstration of RMR provided at the website of Albert Bregman (http://webpages.mcgill.ca/staff/Group2/abregm1/web/). Stimuli consisted of three continuously looped sequences of tones: Targets Alone, Targets & Distracters, and Targets & Distracters + Flankers presented in separate blocks (Fig. 1). Tones comprising RMR sequences were 50 ms in duration (including 5-ms linear on/off ramps). The frequency of Targets was set equal to the best frequency (BF) of the recording sites (defined below). In the Targets Alone sequences, Target tones were presented with an onset-to-onset interval of 246 ms. In the Targets & Distracters sequences, Distracter tones were introduced and offset from the Target tones by 123 ms. Distracters had a frequency equal to that of the Targets. Thus the Targets & Distracters sequences were comprised of BF tones presented with an onset-to-onset interval of 123 ms (a tone presentation rate that is twice that of the Targets Alone sequences). In the Targets & Distracters + Flankers sequences, two Flanker tones, one higher and one lower in frequency than the Distracters, were added in sine phase to each Distracter tone. To test the effects of frequency separation (ΔF) on responses to the Distracter tones, Flankers were separated in frequency from Distracters by 0.25, 0.5, and 1.0 octaves. As RMR can also be demonstrated by using Distracter-Flanker complexes comprised of more than a single pair of Flankers (Turgeon et al. 2005), it is important to test whether the physiological findings could be generalized to cases in which more than two Flankers are presented. Thus the effect of including additional pairs of Flankers was examined in a limited number of electrode penetrations in the fourth animal. For these electrode penetrations, one pair of Flankers was placed at 0.25 octaves, two pairs of Flankers at 0.25 and 0.50 octaves, and three pairs of Flankers at 0.25, 0.50, and 0.75 octaves above and below the frequency of the Distracter tones. The condition in which the onsets of the Flankers were synchronous with those of Distracters (onset asynchrony of 0 ms) was tested in all electrode penetrations. To test effects of onset asynchrony on responses to the Distracters, onsets of Flankers were delayed relative to those of Distracters by 20 and 40 ms. For these Flanker delays, psychoacoustic studies show a reduced or absent RMR effect in human listeners (Turgeon et al. 2002, 2005; see Fig. 1D). Effects of onset asynchrony were tested with a constant ΔF between Flankers and Distracters of 0.25 octaves. As RMR stimuli were comprised of continuous, uninterrupted sequences of tones, there was no “first” tone of the sequences.
All stimuli were presented via a free-field speaker (Microsatellite; Gallo) located 60° off the midline in the field contralateral to the recording sites and 1 m away from the animal's head (Crist Instruments). Sound intensity was measured with a sound level meter (type 2236; Bruel and Kjaer) positioned at the location of the animal's ear. The frequency response of the speaker was essentially flat (within ±5 dB SPL) over the frequency range tested. Pure-tone stimuli and individual tones comprising RMR stimuli (Targets, Distracters, and Flankers) were presented at 60 dB SPL. Thus the complex sounds formed by the addition of the two Flankers to the Distracter tones had an overall level of ∼65 dB SPL.
Surgical Procedure
Under pentobarbital anesthesia and with the use of aseptic techniques, holes were drilled bilaterally into the dorsal skull to accommodate matrices composed of 18-gauge stainless steel tubes glued together in parallel. Tubes served to guide electrodes toward A1 for repeated intracortical recordings. Matrices were stereotaxically positioned to target A1. They were oriented at a 20–30° anterior-posterior angle and with a slight medial-lateral tilt in order to direct electrode penetrations perpendicular to the superior surface of the superior temporal gyrus, thereby satisfying one of the major technical requirements of one-dimensional current source density (CSD) analysis (Vaughan and Arezzo 1988). Matrices and Plexiglas bars, used for painless head fixation during the recordings, were embedded in a pedestal of dental acrylic secured to the skull with inverted bone screws. Peri- and postoperative antibiotic and anti-inflammatory medications were always administered. Recordings began after a 2-wk postoperative recovery period.
Neurophysiological Recordings
Recordings were conducted in an electrically shielded, sound-attenuated chamber. Monkeys were monitored via video camera throughout each recording session. To confirm and maintain alertness, prior to each stimulus block animals were observed by entry of the investigator into the recording chamber. Between stimulus blocks, animals also received liquid reinforcements and preferred treats. However, the possibility of intermittent drowsiness cannot be excluded. Data acquisition began ∼10 s after the beginning of RMR sequence presentation to allow neural responses to reach a “steady state” and to exclude potential “buildup” effects associated with stream segregation (Micheyl et al. 2005).
Intracortical recordings were performed with linear-array multicontact electrodes containing 16 contacts, evenly spaced at 150-μm intervals (U-Probe; Plexon). Individual contacts were maintained at an impedance of ∼200 kΩ. An epidural stainless steel screw placed over the occipital cortex served as a reference electrode. Neural signals were band-pass filtered from 3 Hz to 3 kHz (roll-off 48 dB/octave) and digitized at 12.2 kHz with an RA16 PA Medusa 16-channel preamplifier connected via fiber-optic cables to an RX5 data acquisition system (Tucker-Davis Technologies). Signals were averaged online by computer to yield auditory evoked potentials (AEPs). CSD analyses characterized the laminar pattern of net current sources and sinks within A1 generating the AEPs and were used to identify the laminar location of concurrently recorded multiunit activity (MUA). CSD was calculated with a 3-point algorithm that approximates the second spatial derivative of voltage recorded at each recording contact (Freeman and Nicholson 1975; Nicholson and Freeman 1975).
To derive MUA, signals were simultaneously high-pass filtered at 500 Hz (roll-off 48 dB/octave), full-wave rectified, and then low-pass filtered at 520 Hz (roll-off 48 dB/octave) prior to digitization and averaging (see Super and Roelfsema 2005 for a methodological review). MUA is a measure of the envelope of summed action potential activity of neuronal aggregates within a sphere of ∼100-μm diameter surrounding each recording contact (Brosch et al. 1997; Super and Roelfsema 2005; Vaughan and Arezzo 1988). With the use of an electrode impedance similar to that of the electrodes utilized in the present study, MUA and single-unit recordings in primary visual cortex were shown to yield similar results with regard to orientation tuning of closely spaced neurons in the vicinity of the electrode (Super and Roelfsema 2005). While MUA displays greater response stability than single-unit activity (Nelken et al. 1994; Stark and Abeles 2007), details of activity occurring in single neurons are lost.
Peristimulus time histograms (PSTHs) provided a complementary and more selective measure of action potential activity in neuronal ensembles. PSTHs were derived from unrectified high-pass-filtered data with a custom spike detection program implemented in MATLAB (MathWorks). Trigger thresholds were set at 4 standard deviations above and below the mean baseline activity. As RMR sequences were presented in a continuous loop (with no designated prestimulus interval), the baseline used for PSTH analyses was defined as the mean activity occurring within the first 100 ms of response epochs evoked by Targets Alone sequences.
Positioning of electrodes was guided by online examination of click-evoked potentials. Pure-tone stimuli were delivered when the electrode channels bracketed the inversion of early AEP components and when the largest MUA was situated in the middle channels. Evoked responses to 50 presentations of each pure-tone stimulus were averaged with an analysis time of 500 ms (including a 100-ms prestimulus baseline interval). After determination of the BF, RMR stimuli were presented. One hundred to one hundred fifty responses to individual tones comprising RMR sequences were acquired and averaged online, with a 225-ms analysis window whose onset was locked to the onset of the Distracter tones (even when there was no Distracter tone present, as in the Targets Alone condition). Single-trial data from which averaged responses were derived were stored for off-line statistical and PSTH analyses.
At the end of the recording period, monkeys were deeply anesthetized with pentobarbital sodium and transcardially perfused with 10% buffered formalin. Tissue was sectioned in the coronal plane (80-μm thickness) and stained for Nissl substance to reconstruct the electrode tracks and to identify A1 according to previously published physiological and histological criteria (Kaas and Hackett 1998; Merzenich and Brugge 1973; Morel et al. 1993). On the basis of these criteria, all electrode penetrations considered in this report were localized to A1. However, the possibility that some sites situated near the lower-frequency anterolateral border of A1 were located in the rostral field R cannot be excluded.
General Data Analysis
MUA and PSTH data analyzed in the present study were based on responses recorded within lower lamina 3, as identified by the presence of large-amplitude initial current sinks that are balanced by concurrent superficial sources in upper lamina 3 (Steinschneider et al. 1992). The BF was defined as the pure tone frequency eliciting the maximal MUA or PSTH amplitude within a time window of 10–75 ms after stimulus onset. This time window brackets the duration of the responses to the individual 50-ms tones comprising the RMR sequences.
It was not possible to conduct a statistically reliable analysis of responses in supragranular and infragranular layers, as frequencies of the stimulus components comprising RMR stimuli were tailored to match the BF of neurons located in lower lamina 3, which was not always identical to that of neurons located at more supragranular and infragranular depths. This potential discrepancy may be due in part to the insertion angle of the multicontact electrode relative to a “cortical column” after traversing many millimeters of brain tissue from the dorsal surface of the brain. However, previous findings based on MUA recorded in lower lamina 3 are generally concordant with results obtained from single neurons located in various cortical layers of A1 (e.g., Fishman and Steinschneider 2009; Scholl et al. 2010), suggesting that response patterns in lower lamina 3 are largely representative of those recorded in other cortical laminae. Importantly, lamina 3 neurons in A1 send their output projections to nonprimary auditory areas (Galaburda and Pandya 1983; Jones et al. 1995). Thus responses in lower lamina 3 may reflect information transmitted to higher-level cortical areas for further sound processing.
Analysis of Neural Responses to RMR Sequences
Neural responses to the RMR sequences were analyzed as follows. For each electrode penetration and under each stimulus condition, the difference between the peak amplitude of spike activity evoked by the Targets and by the Distracters in single trials was computed. Peak amplitude of the onset response was chosen for analysis instead of other measures (e.g., total spike count) because onset responses are the most prominent features of responses to short-duration sounds and signal the occurrence of a new acoustic events, thus making them relevant for auditory scene analysis (see, e.g., Petkov et al. 2007). Peaks were defined as the maximal amplitude of spike activity occurring within a 50-ms time window beginning at the onset of the Distracter and Target tones. A nondirectional paired t-test based on these peak response amplitude differences was used to test the aforementioned hypothesis (α-level = 0.05). A significant positive difference between peak responses to Targets and to Distracters (when accompanied by simultaneous Flankers) would indicate a “Target-selective” neural population, whereas a significant negative difference would indicate a “Complex-tone-selective” neural population that responds more robustly to the Distracter-Flanker complexes than to the pure-tone Targets. Standard error of the mean difference between Distracter and Target responses was estimated via bootstrap resampling methods using a custom program implemented in MATLAB (20,000 resamples; Efron and Tibshirani 1993). Similar paired (directional) t-tests were performed on mean data (averaged over single trials) across sites.
We also examined whether sites displaying “Target-selective” and “Complex-tone-selective” neural response patterns differed with respect to other response properties, including BF and tuning bandwidth. To this end, we fitted FRFs with an Roexpr model (Fishman and Steinschneider 2006; Glasberg and Moore 1990) to derive estimates of the BF and 3-dB excitatory tuning bandwidth. Results of nondirectional paired t-tests, used to characterize sites as “Target-selective” or as “Complex-tone-selective” (as described above), were then analyzed as a function of the BFs and bandwidths derived from the model fits.
Computational Models Used to Relate Neural and Psychophysical Responses
To examine whether current models relating neural responses to auditory streaming percepts could account for the perceptual organization of RMR sequences, we implemented a within-channel “threshold” model (Micheyl et al. 2005) and a “temporal-coherence” model (Elhilali et al. 2009) and applied them to the neural responses evoked by the RMR sequences. Model predictions were then compared with psychophysical findings from RMR experiments in which the subjects' task was to detect a Target rhythm embedded in a background of Distracters, presented in the absence or presence of synchronous or asynchronous Flankers (Turgeon et al. 2002, 2005).
Threshold model.
This model is similar to that used by Micheyl et al. (2005), which was found to successfully account for several features of the perceptual organization of alternating-tone sequences into auditory streams (see also Bee et al. 2010; Pressnitzer et al. 2008). The underlying principle is straightforward and follows directly from basic signal detection theory, as applied to neural responses: Spike counts are compared to a threshold, or, in the parlance of signal detection theory, a decision “criterion” (Green and Swets 1966). If the number of spikes in the PSTH exceeds the threshold, a “detect” decision is made. This process is assumed to take place within frequency-selective neural populations—hereafter referred to as “neural channels.” The time course of detect and nondetect decisions in the Target channel determines the percept predicted by the model. If a detect decision was made within the time frame (124–225 ms) corresponding to the presentation of a Target, this was counted as a “hit”; if a detect decision was made within a time frame (1–123 ms) corresponding to the presentation of either silence (for the Targets Alone condition) or a Distracter (in the Targets & Distracters and Targets & Distracters + Flankers conditions), this was counted as a “false alarm.”
The detection threshold was varied systematically over a range bounded below by the average spike count corresponding to spontaneous activity and above by the largest observed spike count to compute a “receiver operating characteristic” (ROC). The rationale for letting the criterion vary in the model is based on the assumption that, in psychophysical experiments, participants strive to achieve optimal performance—which requires that their detection “threshold” (or, in terms of signal detection theory, their “internal decision criterion”) is adjusted so as to detect as many Targets as possible, while at the same time rejecting as many Distracters as possible. The probability of correct responses that is achieved by using this maximum-likelihood decision strategy in a two-interval forced-choice task corresponds to the area under the ROC (Green and Swets 1966). To assess statistical significance of differences between median areas under ROCs across stimulus conditions (unpaired, 1-tailed t-test), standard error of the median was estimated by bootstrap (20,000 resamples).
Implementation details are as follows. First, for each stimulus condition and each cortical site, the number of spikes in the PSTH falling within 1-ms bins between 0 and 225 ms after the onset of the Distracter tone was tallied across all stimulus presentations (or trials)—as was done to produce the PSTHs shown in Fig. 2. The resulting number of spikes was divided by the number of trials to estimate a time-dependent spike rate. We denote the spike rate as r[k], where k refers to the bin number. Second, the “false alarm” and “hit” probabilities were estimated as:
(1) |
and
(2) |
F(c, λ) denotes the cumulative Poisson mass function with parameter λ evaluated at c[j], the current value of the spike-count threshold (j = 1, …, m), and n denotes the number of sites in the simulated pool. The Poisson distribution is the simplest theoretically motivated description of spike-count distributions, and it has been found to provide a good approximation in many previous studies. The products over k in these two equations compute the probability that the spike count across the n sites never exceeded the threshold, c[j], during the time frame indexed by k. Since we used 1-ms time bins, the index, k, is expressed in milliseconds. One minus this product is simply the probability that the spike count exceeded the threshold at least once during the considered time frame. The number of A1 sites that are involved in human listeners' decisions concerning the presence of a tone is currently unknown. Therefore, we explored the influence of the corresponding model parameter, n, on model predictions and found that, while performance increased with increasing n, the pattern of performance across the different stimuli remained qualitatively unchanged across a wide range of n values (from n = 3 to n = 10,000). The results shown here were obtained with n set to 100, a number that was found to yield performance levels within the range observed in psychophysical studies. The ROC area was evaluated as:
(3) |
where m denotes the index corresponding to the largest value of c, which was set to 4n times the largest value of r[k] observed for the current cortical site and stimulus condition. Although indices for stimulus condition and A1 site have been omitted in the above equations for clarity of notation, the above-described computations were carried out separately for each stimulus condition and for each A1 site. If the threshold model is sufficient to account for RMR, then the resultant areas under the model ROCs should parallel subjects' ability to detect the Target rhythm as reported in psychoacoustic RMR experiments.
Temporal-coherence model.
To examine whether a direct application of the temporal-coherence model to neural responses in A1 could explain the enhanced detection of the Target rhythm in RMR, we followed an approach similar to that described by Elhilali et al. (2009). In that study, the model tested whether temporal coherence of responses across tonotopically organized neural channels in A1 could explain the perceptual organization of alternating and simultaneous tones. Testing the temporal-coherence model requires that neuronal responses in the Target channel and the upper and lower Flanker channels be compared and their correlation evaluated over time. As we were unable to record simultaneously from three sites with BFs corresponding to the Target, upper Flanker, and lower Flanker, we simulated responses in the Flanker channels elicited under the Targets & Distracters + Flankers condition. A similar approach was used by Elhilali et al. (2009) to simulate the tonotopic distribution of activity in A1 evoked by a stimulus comprised of two tones of different frequency based on the responses recorded at a single site. To this end, we presented two additional sets of stimulus conditions in which the Flanker tone was set at the BF of the site and the frequency of the Targets and Distracters was set at 0.25 octaves above or below the BF. Thus responses recorded under these conditions simulate what each Flanker site “sees” when the Distracters and Targets are presented simultaneously with the Flankers under the ΔF = 0.25 condition. The temporal-coherence model was then tested by evaluating the correlations over time between neural responses evoked in the Target/Distracter channel and in the (simulated) Flanker channels.
To evaluate the temporal coherence of responses across the Target/Distracter and Flanker channels, we computed Pearson product-moment correlation coefficients between the temporal response patterns evoked by Targets & Distracters + Flankers sequences with the Target (and Distracter) frequency set equal to the BF and the temporal response patterns evoked by Targets & Distracters + Flankers sequences with the upper and lower Flanker frequency set equal to the BF. We then computed the eigenvectors of the resulting three-by-three correlation matrix. The two eigenvector(s) associated with the largest significant eigenvalue(s) were taken to represent the channel groupings (or “streams”) predicted by the model, as in Elhilali et al. (2009). The eigenvectors were normalized so that their largest element was equal to 1. Normalized in this way, the eigenvectors can be interpreted as “weight vectors,” which indicate how much (as a proportion) each neural “channel” (i.e., neural populations with a BF equal to the frequency of the upper Flanker, the lower Flanker, or the Target/Distracter) contributes to the corresponding stream. Thus these normalized eigenvectors reveal the spectral content of the auditory stream percepts predicted by the temporal-coherence model. If the temporal-coherence model is sufficient to explain RMR, then the auditory stream percepts predicted by the model should correspond to the two perceptual streams heard during RMR, one comprised of the Targets and the other of the Distracter-Flanker complexes.
RESULTS
Results are based on multiunit responses evoked by RMR sequences in a total of 48 electrode penetrations into A1 of three macaque monkeys (monkeys A, D, and E). Data from two additional electrode penetrations were obtained in a fourth animal (monkey L) to examine effects of adding additional pairs of Flankers to the Distracters. A1 sites displaying multipeaked tuning were uncommon (6 sites in the 3 animals) and are not considered in this report. All other sites displayed a clear BF and sharp frequency tuning characteristic of small neural populations in A1 (Fishman and Steinschneider 2009). Mean onset latency and mean 6-dB bandwidths of FRFs of MUA recorded in lower lamina 3 evoked by tones presented at 60 dB SPL were ∼14 ms and ∼0.75 octaves, respectively. These values are comparable to those reported for single neurons in A1 of awake monkeys (Recanzone et al. 2000). BF of recording sites ranged from 200 to 14,000 Hz. Data obtained under ΔF = 0.5 and 1.0 octave conditions at sites with a BF <250 Hz or >12 kHz were not included, as the level of Flanker tones under these conditions was significantly attenuated because of the frequency response of the speakers.
Responses to RMR Sequences: Individual Sites
“Target-selective” sites.
PSTHs of spike activity evoked by RMR sequences at a representative “Target-selective” site are shown in Fig. 2. The FRF of the site is shown in Fig. 2A. Frequency of Target and Distracter tones (set equal to the BF) and frequencies of Flanker tones (set at 0.25 octaves above and below the BF) are indicated by the three dashed vertical lines superimposed upon the FRF. Responses evoked by Targets Alone sequences and by Targets & Distracters sequences are shown in Fig. 2B. Targets & Distracters sequences evoke two onset responses, occurring at twice the rate as those evoked under the Targets Alone condition. Responses to Distracters and Targets have comparable peak amplitudes that are not significantly different from each other (paired, 1-tailed t-test: t = −0.99; P > 0.10; df = 149). However, when Flankers are presented synchronously with the Distracters, responses to Distracters are markedly diminished relative to responses to the Targets (Fig. 2C). Significant attenuation of responses to Distracters in the presence of Flankers is observed under all three ΔF conditions tested (paired, 1-tailed t-tests: all P < 0.00001; df = 149). These latter results are consistent with psychoacoustic data indicating that RMR is largely independent of the ΔF between Distracters and Flankers over a ∼1-octave range (Turgeon et al. 2002). Findings illustrated in this example support the hypothesis that “Target-selective” sites may be generated by suppression of responses to Distracters by the synchronous Flankers. Quantitatively similar results are obtained for MUA data (not shown).
One prediction of our hypothesis is that, at “Target-selective” sites, suppression of responses to Distracters by Flankers will be maximal when Flankers are presented synchronously with the Distracters and reduced when Flankers and Distracters are asynchronous. Results are consistent with this prediction, as exemplified by PSTHs of spike activity recorded at three representative sites shown in Fig. 3, A–C, respectively. Onset responses evoked by Targets and Distracters under the Targets & Distracters condition have comparable peak amplitudes that are not significantly different from each other (paired, 1-tailed t-tests; all sites P > 0.10; df = 149). When Flankers (ΔF = 0.25 octaves) are presented synchronously with the Distracters, responses to Distracters are significantly attenuated relative to responses to Targets (paired, 1-tailed t-tests; all sites P < 0.00001; df = 149). However, attenuation of responses to Distracters disappears when Flankers are delayed by 40 ms relative to the Distracters (paired, 1-tailed t-tests; all sites P > 0.10; df = 149). Thus the relative peak amplitudes of responses to Distracters and Targets under the delayed Flankers condition are comparable to those observed under the Targets & Distracters condition in the absence of Flankers.
“Complex-tone-selective” sites.
PSTHs of spike activity evoked by RMR sequences at three example “Complex-tone-selective” sites are shown in Fig. 4, A–C, respectively. FRFs of the sites are included on the left of the figure (same conventions as in Figs. 2 and 3). In contrast to the suppression of Distracter responses by the simultaneous Flankers at the “Target-selective” sites (Figs. 2 and 3), “Complex-tone-selective” sites display enhanced responses to Distracters when they are accompanied by the simultaneous Flankers. Responses to the Distracter-Flanker complexes are significantly larger than responses to Targets (paired, 1-tailed t-tests; all sites P < 0.00001; df = 149).
Distribution of “Target-selective” and “Complex-tone-selective” sites.
The response patterns illustrated in Figs. 2–4 support the hypothesis that A1 contains two functionally distinct populations of neurons, one that responds robustly to the pure-tone Targets and whose activity is suppressed by the Distracter-Flanker complexes and another that responds more robustly to the Distracter-Flanker complexes than to the pure-tone Targets. Together, these two neural populations may be sufficient to account for the perceptual phenomenon of RMR. The former would be suited for detecting the Target rhythm—the primary task evaluated in psychoacoustic studies of RMR—while the latter could account for subjects' ability to attend selectively to the rhythm carried by the Distracter-Flanker complexes. To examine the prevalence of each of these two populations in A1, we characterized individual sites as “Target-selective” or “Complex-tone-selective” according to the results of nondirectional paired t-tests, with significant positive t-values (Target responses > Distracter-Flanker responses) and negative t-values (Target responses < Distracter-Flanker responses) indicating “Target-selective” and “Complex-tone-selective” responses, respectively. We then examined whether the different response patterns evoked by RMR sequences were associated with more basic response properties of small neural ensembles, including BF and excitatory tuning bandwidth.
Results of this analysis are represented in Fig. 5, which shows scatterplots of t-values as a function of log relative excitatory tuning bandwidth (3-dB bandwidth divided by BF) and BF under the ΔF = 0.25 octave condition. Points above and below the horizontal dashed lines at critical t-values = ±1.96 represent sites displaying statistically significant differences (α-level = 0.05) between responses to Targets and Distracter-Flanker complexes (paired, nondirectional t-test; df > 125 for all sites). Hence, points above and below the dashed lines represent “Target-selective” and “Complex-tone-selective” sites, respectively. The majority of our sample of A1 sites displayed a “Target-selective” response pattern (24 of 48 sites), with only seven sites displaying significantly larger responses to the Distracter-Flanker complexes (1 site with exceptionally broad tuning is excluded from the plot of t-values vs. log relative excitatory tuning bandwidths, as its excitatory bandwidth could not be determined). Interestingly, t-values were significantly negatively correlated with log relative excitatory tuning bandwidths (Spearman r = −0.62; P < 0.0001) and positively correlated with BF (Spearman r = 0.42; P < 0.005). A significant, although smaller, negative correlation between t-values and log relative excitatory tuning bandwidths was also observed under the ΔF = 0.5 octave condition (Spearman r = −0.35; P < 0.02; Spearman correlations were not significant for the ΔF = 1.0 octave condition, perhaps, in part, because of the smaller sample size). The present findings suggest that while some sites in A1 may show significantly enhanced responses to the Distracters in the presence of the simultaneous Flankers, suppression of responses to Distracters by Flankers (“Target selectivity”) represents the predominant response pattern in A1. Accordingly, to the extent that A1 is involved in the perceptual phenomenon of RMR, it would be engaged primarily in the task of detecting a narrowband Target rhythm embedded in the background of Distracters, i.e., the task typically evaluated in psychoacoustic studies of RMR.
Responses to RMR Sequences: Mean Population Data
Given the prevalence of “Target-selective” sites demonstrated above, we sought to determine whether overall population activity averaged across sites in A1 also displayed a “Target-selective” response pattern. Mean PSTHs evoked by RMR sequences under synchronous and asynchronous Distracters + Flankers conditions are shown in Fig. 6, A and B, respectively. For display purposes, PSTHs have been temporally aligned relative to the peak of Target-evoked responses prior to averaging across sites to compensate for latency differences across sites with different BFs (see, e.g., Lakatos et al. 2005). Peak amplitudes of mean responses to Targets and Distracters under the Targets & Distracters condition are comparable, whereas responses to Distracters are suppressed when Flankers are added synchronously to them. Suppression of responses to Distracters is observed under all ΔF conditions and disappears when Flankers are delayed by both 20 and 40 ms relative to the onset of the Distracters. Accordingly, results of paired one-tailed t-tests shown in Fig. 6C indicate nonsignificant differences between peak amplitudes of Distracter- and Target-evoked responses under the Targets & Distracters condition and significantly larger responses to Targets when Flankers are presented synchronously with the Distracters under all ΔF conditions (P < 0.0001). However, when Distracters are delayed relative to Flankers, differences between Distracter- and Target-evoked responses are not statistically significant (P > 0.05). Mean population data in A1 thus show a “Target-selective” response pattern similar to that observed in the majority of individual sites and are consistent with the hypothesis that perceptual segregation of Targets from Distracter tones in RMR is facilitated by the suppression of Distracter-evoked responses by the synchronous Flankers. Similar results are obtained for complementary MUA measures of neuronal spiking activity (Fig. 7).
Modeling Results
There currently exist two main types of models describing the relationship between neural responses and auditory stream percepts: models based on “spatial” (e.g., tonotopic) separation between neural populations—the neural “population-separation” model (here evaluated with a neural response threshold)—and models based on temporal correlations between neural responses—the “temporal-coherence” model. In the following, we quantitatively test whether these models, when applied to the neural responses obtained in the present study, can account for the perceptual organization of RMR sequences.
Threshold model.
Predictions of the threshold model are illustrated in Fig. 8. Figure 8A shows the median ROCs derived from PSTH data obtained under each stimulus condition tested in the study. Figure 8B shows the median area under the ROCs in Fig. 8A, which provides a quantitative measure of the neural discrimination of the Target rhythm under each stimulus condition. According to the hypothesis that performance in the two-interval forced-choice psychophysical rhythm-discrimination experiments (Turgeon et al. 2002, 2005) depends on the detection of the Target tones by neurons tuned to the Target frequency, the area under the ROC should vary across stimulus conditions in a manner that parallels human listeners' performance. Accordingly, the ROC area should be 1) close to 1.0 (perfect performance) in the Targets Alone condition, 2) close to 0.5 (chance performance) in the Targets & Distracters condition, and 3) close to 1.0 in the Targets & Distracters + Flankers conditions. Moreover, for the model to account successfully for the effects of Flanker-Distracter asynchrony on RMR, the ROC area should be 4) smaller when Flankers are delayed than when they are presented synchronously with the Distracters. Finally, ROC area should be 5) larger in the 20-ms delay condition than in the 40-ms delay condition and close to chance (0.5) in the 40-ms delay condition (see Turgeon et al. 2002, 2005). The results shown in Fig. 8B are consistent with these predictions. Median ROC areas in the Targets Alone and Targets & Distracters + Flankers conditions are significantly larger than the median ROC area in the Targets & Distracters condition (unpaired, 1-tailed t-test; P values are indicated in figure). On the other hand, median ROC area in the 40-ms delay condition is not significantly different from that in the Targets & Distracters condition. An intermediate value is obtained in the 20-ms delay condition. These results suggest that a simple threshold model is sufficient to account for the enhanced perceptual salience of the target rhythm in RMR.
Temporal-coherence model.
Whereas a simple threshold model appears to be sufficient to account for RMR, it may not be necessary. Hence, we evaluated whether the temporal-coherence model could also successfully account for RMR. Specifically, the auditory stream percepts predicted by the temporal-coherence model should correspond to the two perceptual streams that are heard during RMR, one comprised of the Targets and the other of the Distracter-Flanker complexes. Figure 9A, left, shows the mean MUA evoked under the Targets & Distracters + Flankers condition (ΔF = 0.25 octaves) for the three stimulus configurations relative to the BF that were used to evaluate the temporal-coherence model: BF equal to the Target frequency (Target & Distracter channel), the lower-Flanker frequency (lower Flanker channel), and the upper-Flanker frequency (upper Flanker channel). The resulting two-dimensional profile simulates the spatiotemporal response pattern in A1 that would be expected if neurons with BFs corresponding to the Target tones and neurons with BFs corresponding to the Flanker tones were recorded from simultaneously. The resulting spatiotemporal profile is as predicted: It shows a strong response to the Flankers and a weak response to the Targets in the two channels tuned to the Flankers and larger responses to the Targets in the channel tuned to the Distracters and Targets.
The across-channel correlation matrix is shown next to the MUA waveforms in Fig. 9A. As expected from the similar temporal activation pattern in the two Flanker channels, the correlation coefficient for these two channels is close to 1.0. Lower correlations are obtained between responses in the upper and lower Flanker channels and those in the Target/Distracter channel. These lower correlations reflect the fact that the response to the Target is large in the Target/Distracter channel while it is small in the Flanker channels.
Figure 9B shows the first two normalized eigenvectors of the across-channel correlation matrix. Each element in one of these eigenvectors indicates the proportion to which a given BF channel “contributes” to the corresponding stream (as explained in materials and methods); thus these two normalized eigenvectors represent the spectral content of the two streams predicted by the temporal-coherence model. Figure 9B reveals that, according to the temporal-coherence model, stimuli under the Targets & Distracters + Flankers condition should evoke a percept of two streams, one corresponding mainly to the Flanker tones and the other corresponding mainly to the Target and Distracter tones. Importantly, while the temporal-coherence model is capable of “grouping together” highly correlated activity occurring in the Flanker channels into a single stream and segregating this activity from the less-correlated activity occurring in the Target/Distracter channel, it fails to segregate responses evoked by the Targets from those evoked by the Distracters. Hence, when applied to the responses of frequency-selective neural populations in A1, the temporal-coherence model cannot replicate the perceptual organization heard during RMR.
On the other hand, the temporal-coherence model can successfully account for RMR if one considers responses of “combination-selective” neuronal populations that are tuned to higher-level acoustic features, e.g., the spectral composition of the Distracter-Flanker complexes. Combination-selective neurons, which show enhanced responses to particular simultaneous tone combinations, have been found in both primary and nonprimary auditory cortex of monkeys (Kanwal and Rauschecker 2007; Sadagopan and Wang 2009; Schwarz and Tomlinson 1990). Such neurons could be tuned so that they respond optimally when Flankers are presented simultaneously with the Distracters and suboptimally when the pure-tone Targets are presented. Indeed, this response pattern is exemplified by the “Complex-tone-selective” sites identified in the present study, which show larger responses to the Distracter-Flanker complexes than to the pure-tone Targets. As illustrated in Fig. 9C, two neural populations that are selectively tuned to the pure-tone Targets and to the Distracter-Flanker complexes, respectively, will respond in a temporally incoherent or “anticorrelated” fashion when they are stimulated by Targets & Distracters + Flankers sequences. Consequently, activity in the channel representing the Targets will form a separate stream from that in the channel representing the Distracter-Flanker complexes.
Regardless of which model (threshold or temporal coherence) is used to examine RMR, some suppressive mechanism, similar to that exemplified in the present study, must be involved in shaping neuronal selectivity to the Targets, such that neurons respond robustly to the Targets and respond weakly to the identical Distracters when they are accompanied by the simultaneous Flankers. Thus, in principle, both the threshold model and the temporal-coherence model, if applied to neural responses selective to higher-level acoustic features, can account for the perceptual organization of RMR sequences, provided that suppression and spectral integration are included to shape neuronal selectivity to the Targets and to the Distracter-Flanker complexes.
Effect of number of flankers.
As RMR can be demonstrated by using more than a single pair of Flankers, it is important to determine whether the simultaneous suppression observed in “Target-selective” sites when a single pair of Flankers is presented also occurs when multiple pairs of Flankers are presented. Indeed, based on informal listening tests, the RMR effect tends to be perceptually more robust when the Flankers contain multiple frequency components than when they each consist of a single pure tone. Hence, we examined the effects of including additional Flankers at 0.5 and 0.75 octaves above and below the Distracter frequency at two additional sites in a fourth animal (Fig. 10, A and B, respectively). Both sites display a “Target-selective” response pattern, with suppression of Distracter-evoked responses when Flankers are presented simultaneously with the Distracters and diminished or absent suppression when Flankers are delayed by 20 and 40 ms. Suppression tends to be more pronounced when more than a single pair of Flankers is added to the Distracters. Moreover, suppression of Distracter-evoked responses by the simultaneous Flankers does not depend upon the presence of the Targets, as it is also observed when Target tones are omitted from the RMR sequences (“No T” condition; Fig. 10, A and B). Thus the suppression observed when a single pair of Flankers is presented appears to represent a general neural phenomenon that is not specific to the particular RMR stimuli used in the present study.
DISCUSSION
The present study builds on previous investigations of the neural bases of auditory scene analysis by examining neural correlates of RMR, a perceptual phenomenon that combines two key principles of auditory perceptual organization: sequential segregation and simultaneous grouping (Bregman 1990). Previous studies lend support to the neural “population-separation” model of stream segregation, whereby different auditory streams are represented by separate populations of neurons that are selectively tuned to sounds comprising each stream (Bee and Klump 2004; Fishman et al. 2001; Itatani and Klump 2009; Micheyl et al. 2005). Given the frequency selectivity of neurons throughout the auditory pathway, spectral contrast is a key determinant of neural population separation. Thus, population-separation models are consistent with the psychophysical observation that frequency separation is an important factor contributing to stream segregation (Bregman 1990; Bregman et al. 2000).
However, under certain conditions, sounds that share frequency components, and that therefore excite the same frequency-selective channels, may nonetheless be perceived as belonging to separate streams (Grimault et al. 2002; Roberts et al. 2002; Vliegen and Oxenham 1999). In RMR, frequency components comprising the Targets represent a subset of the components comprising the Distracter-Flanker complexes. Thus a population-separation model based solely on neuronal frequency selectivity cannot explain the perceptual segregation of Targets from the spectrally identical Distracters. Another neural mechanism is needed. The present findings indicate that this mechanism is provided by simultaneous suppression of Distracter responses by the concurrently presented Flankers. Suppression effectively generates “Target-selective” neurons and accounts for two key perceptual features of RMR: the enhanced ability to “hear out” the Targets embedded in the sequence of identical Distracters and the reduced ability to “hear out” the Distracters when they are accompanied by the simultaneous Flankers. Importantly, simultaneous suppression is needed to generate neuronal selectivity to the Targets, even if the Targets and Distracters are complex tones, because neurons responding strongly to the Target complex must be prevented from responding when the Distracter complex (which is spectrally identical to the Target complex) is accompanied by a Flanker complex.
The present findings are thus qualitatively consistent with the neural population-separation model of stream segregation. While in previous studies the differential responses to the two tones, A and B, comprising “ABAB” sequences could be explained in terms of frequency selectivity and forward (sequential) suppression, here the differential responses to the Targets and the Distracter-Flanker complexes are attributed to simultaneous suppression across frequency channels. Several previous studies have reported suppression of responses in A1 by simultaneously presented tones, even when tones are separated in frequency by an octave or more and presented over a broad range of levels (Sadagopan and Wang 2011; Shamma and Symmes 1985; Sutter and Loftus 2003). Our findings are also consistent with results of psychoacoustic studies suggesting a role for wide-band nonlinear spectral interactions in the perceptual organization of sound sequences (Holmes and Roberts 2006; O'Connor and Sutter 2000).
Suppression of Distracter-evoked responses provides a simple cue for differentiating Targets and Distracters and for perceptually segregating the Target rhythm. Our modeling results demonstrate that this cue can be exploited by the simple and physiologically plausible mechanism of thresholding. With the threshold set high enough to selectively detect the larger Target responses, this model can replicate the qualitative pattern of perceptual performance across most stimulus conditions observed in psychophysical RMR experiments. Thus, owing to simultaneous suppression, a simple read-out mechanism is sufficient to account for the basic RMR phenomenon. A similar thresholding scheme was previously found to be sufficient to explain important aspects of auditory streaming based on neural responses to alternating pure tones differing in frequency, at the level of both the cochlear nucleus (Pressnitzer et al. 2008) and A1 (Micheyl et al. 2005).
This simple threshold-based mechanism can also account for the finding that RMR is reduced by introducing an asynchrony between the Flankers and the Distracters (Turgeon et al. 2002, 2005). A1 responses to short-duration stimuli are predominantly phasic, with a sharp onset peak occurring within 20 ms after stimulus onset. Hence, if the Flankers are delayed, onset responses to the Flankers will no longer temporally overlap and suppress onset responses to the Distracters, thus yielding Distracter and Target responses that are comparable in amplitude. One aspect of the psychoacoustic data, however, is not fully accounted for by population responses in A1, namely, the graded effect on RMR that is observed as stimulus onset asynchrony (SOA) between Flankers and Distracters is increased from 0 to 40 ms. Whereas listeners can reliably discriminate the Target rhythm at SOAs ranging from 0 to 25 ms, neural responses to Distracters tend to be fully restored from suppression at an SOA as short as 20 ms. This suggests that time constants of onset responses and simultaneous suppression in A1 are too short to account for the graded effect of SOA. While speculative, this effect may be explained by activity occurring in other auditory cortical areas that display longer response time constants. For instance, neural responses in the rostral auditory field (area R) are more sluggish than A1 responses, exhibiting longer onset and peak latencies, diminished temporal precision, and slower decay rates (Bendor and Wang 2008; Recanzone et al. 2000; Scott et al. 2011). This temporal smearing (on the order of 10–20 ms relative to that observed in A1; Scott et al. 2011) would result in greater temporal overlap of onset responses to Distracter and Flanker tones when SOA is equal to 20 ms, and hence a longer time window during which suppression could occur. Thus response time constants in nonprimary areas may be more consistent with those needed to account for the graded effect of SOA. Indeed, RMR stimuli are expected to activate multiple auditory cortical fields in addition to A1, all of which could contribute to shaping the perceptual phenomenon of RMR. Physiological differences across species may also account for the discrepancy between the present physiological data and psychoacoustic data in humans.
Although the above-described population-separation model is sufficient to account for important aspects of the RMR phenomenon, results do not demonstrate that this model is necessary. Moreover, this type of model does not explicitly explain why tones that are widely separated in frequency—as in the 1-octave-separation condition—and (presumably) activate well-separated neural populations in A1, are nevertheless perceptually grouped (Elhilali et al. 2009). Indeed, in the psychoacoustic context, RMR has been explained in terms of perceptual “capture” of the Distracters by the synchronous Flankers, leading to their “perceptual grouping” (Turgeon et al. 2002, 2005). This problem is addressed by the temporal-coherence model, which extends the neural population-separation models by considering also the temporal correlations between responses of neurons tuned to different frequencies—or to other sound attributes (Shamma and Micheyl 2010). According to the temporal-coherence model, neural populations whose responses are temporally correlated are grouped together and represent a single stream, whereas neural populations whose responses are temporally uncorrelated are segregated and represent different streams.
However, we found that when a simple temporal-coherence model was applied to the responses of frequency-selective neurons evoked by Targets & Distracters + Flankers sequences, the predicted stream percepts did not correspond to those heard during RMR. Specifically, while the temporal-coherence model segregated neural activity in the Target/Distracter channel from that in the Flanker channels, it failed to segregate the Targets from the Distracters. On the other hand, the temporal-coherence model may successfully account for RMR if it operates on the output of neurons that are selectively tuned to more complex acoustic features, such as the Distracter-Flanker complexes (Fig. 9C).
Selectivity to these higher-level acoustic features may be achieved via spectral integration and suppression. Spectral integration is exemplified by “combination-selective” neurons in A1, which show enhanced responses when several pure tone components are presented simultaneously (see, e.g., Sadagopan and Wang 2009), and by neurons in lateral belt regions of auditory cortex, which respond more vigorously to frequency-modulated sweeps and band-pass noise than to pure tones (Rauschecker and Tian 2004; Tian and Rauschecker 2004). Although limited in number, the “Complex-tone-selective” neural populations identified in the present study provide additional support for the existence of combination-selective neurons in A1. As shown in Fig. 5, “Complex-tone-selective” response patterns were more common at sites with larger relative excitatory tuning bandwidths and with lower BFs, which tend to exhibit broader tuning bandwidths than sites with higher BFs (e.g., Fishman and Steinschneider 2009). These observations suggest a role for spectral integration in generating the enhanced responses to the Distracter-Flanker complexes at “Complex-tone-selective” sites. While relatively rare in our sample of sites, neurons with multipeaked frequency tuning functions, which are also indicative of spectral integration, may further contribute to combination selectivity in A1 (Kadia and Wang 2003). Thus, from a more general perspective that considers neural populations tuned to higher-level acoustic features, both the population-separation and temporal-coherence models of streaming are compatible with the perceptual phenomenon of RMR. This conclusion is concordant with the view that the brain will exploit whatever cues are available (e.g., spectral or temporal differences) to perceptually organize sounds comprising auditory scenes (Darwin and Carlyon 1995). Accordingly, a hybrid model that incorporates both neuronal population separation and temporal coherence may be best suited for explaining all of the relevant perceptual phenomena related to stream segregation.
An interesting question is whether neurons that respond preferentially to pure tones (Targets) and neurons that respond preferentially to complex tones (Distracter-Flanker complexes) occupy different anatomical locations in A1. We were unable to identify any clear differences between these two populations with respect to their location within A1. While our methods may lack sufficient resolution to discern a systematic organization, these two populations may instead form a “patchy” distribution in A1, an organization that is still consistent with the population-separation and temporal-coherence models of stream segregation.
While the present findings demonstrate parallels between neural response patterns in A1 and the perceptual organization of RMR sequences in humans, the relevance of A1 responses to RMR is tempered by the absence of behavioral data related to the animals' perception. A similar caveat applies to previous studies of the neural bases of auditory streaming in experimental animals. While there is behavioral evidence that many nonhuman animals perceptually organize auditory sequences similarly to humans (Fay 2008), the connection between neural response patterns and perception has been indirect. Thus the present results should be conservatively interpreted as representing a “proof of principle” concerning how neural response patterns at the cortical level may provide the physiological “building blocks” underlying the perceptual organization of RMR sequences. Present findings reinforce the view that preattentive aspects of auditory scene analysis rely on relatively basic neural response properties at the cortical level, including frequency selectivity, forward and simultaneous suppression, spectral integration, adaptation, and temporal coherence (Elhilali et al. 2009; Fishman et al. 2001; Micheyl et al. 2005). Additional mechanisms, including synchronized gamma-band activity in auditory cortex, may also contribute to perceptual grouping (or “binding”) of simultaneous acoustic components (Bidet-Caulet et al. 2008), such as those comprising the Distracter-Flanker complexes. As auditory cortical areas outside of A1 are also involved in stream segregation (Gutschalk et al. 2005), future studies should examine neural responses to RMR sequences in nonprimary auditory cortical fields and investigate how these responses are modulated by learning and attention (Fritz et al. 2007; Yin et al. 2007).
GRANTS
This work was supported by National Institute on Deafness and Other Communications Disorders Grants DC-00657 and DC-07657.
DISCLOSURES
No conflicts of interest, financial or otherwise, are declared by the author(s).
AUTHOR CONTRIBUTIONS
Author contributions: Y.I.F. and M.S. conception and design of research; Y.I.F. and M.S. performed experiments; Y.I.F. and C.M. analyzed data; Y.I.F., C.M., and M.S. interpreted results of experiments; Y.I.F. and C.M. prepared figures; Y.I.F., C.M., and M.S. drafted manuscript; Y.I.F., C.M., and M.S. edited and revised manuscript; Y.I.F., C.M., and M.S. approved final version of manuscript.
ACKNOWLEDGMENTS
We are grateful to Jeannie Hutagalung and Kyoko Kamishima for providing invaluable assistance with animal training, surgery, and data collection and to Dr. Steven Walkley for providing histological facilities. We also thank Dr. Shihab Shamma for helpful comments.
REFERENCES
- Bee MA, Klump GM. Primitive auditory stream segregation: a neurophysiological study in the songbird forebrain. J Neurophysiol 92: 1088–1104, 2004 [DOI] [PubMed] [Google Scholar]
- Bee MA, Micheyl C. The cocktail party problem: what is it? How can it be solved? And why should animal behaviorists study it? J Comp Psychol 122: 235–251, 2008 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bee MA, Micheyl C, Oxenham AJ, Klump GM. Neural adaptation to tone sequences in the songbird forebrain: patterns, determinants, and relation to the build-up of auditory streaming. J Comp Physiol A 196: 543–557, 2010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bendor D, Wang X. Neural response properties of primary, rostral, and rostrotemporal core fields in the auditory cortex of marmoset monkeys. J Neurophysiol 100: 888–906, 2008 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bidet-Caulet A, Fischer C, Bauchet F, Aguera PE, Bertrand O. Neural substrate of concurrent sound perception: direct electrophysiological recordings from human auditory cortex. Front Hum Neurosci 1: 5, 2008 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bregman AS. Auditory Scene Analysis: The Perceptual Organization of Sound. Cambridge, MA: MIT Press, 1990 [Google Scholar]
- Bregman AS, Ahad PA, Crum PA, O'Reilly J. Effects of time intervals and tone durations on auditory stream segregation. Percept Psychophys 62: 626–636, 2000 [DOI] [PubMed] [Google Scholar]
- Brosch M, Bauer R, Eckhorn R. Stimulus-dependent modulations of correlated high-frequency oscillations in cat visual cortex. Cereb Cortex 7: 70–76, 1997 [DOI] [PubMed] [Google Scholar]
- Carlyon RP. How the brain separates sounds. Trends Cogn Sci 8: 465–471, 2004 [DOI] [PubMed] [Google Scholar]
- Darwin CJ, Carlyon RP. Auditory grouping. In: Hearing (2nd ed.), edited by Moore BCJ. London: Academic, 1995 [Google Scholar]
- Efron B, Tibshirani R. An Introduction to the Bootstrap. New York: Chapman and Hall/CRC, 1993 [Google Scholar]
- Elhilali M, Ma L, Micheyl C, Oxenham AJ, Shamma SA. Temporal coherence in the perceptual organization and cortical representation of auditory scenes. Neuron 61: 317–329, 2009 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fay RR. Sound source perception and stream segregation in nonhuman vertebrate animals. In: Auditory Perception of Sound Sources, edited by Yost WA, Popper AN, Fay RR. New York: Springer, 2008 [Google Scholar]
- Fishman YI, Reser DH, Arezzo JC, Steinschneider M. Neural correlates of auditory stream segregation in primary auditory cortex of the awake monkey. Hear Res 151: 167–187, 2001 [DOI] [PubMed] [Google Scholar]
- Fishman YI, Steinschneider M. Spectral resolution of monkey primary auditory cortex (A1) revealed with two-noise masking. J Neurophysiol 96: 1105–1115, 2006 [DOI] [PubMed] [Google Scholar]
- Fishman YI, Steinschneider M. Temporally dynamic frequency tuning of population responses in monkey primary auditory cortex. Hear Res 254: 64–76, 2009 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fishman YI, Steinschneider M. Formation of auditory streams. In: The Oxford Handbook of Auditory Science: The Auditory Brain, edited by Rees A, Palmer AR. New York: Oxford Univ. Press, 2010 [Google Scholar]
- Freeman JA, Nicholson C. Experimental optimization of current source-density technique for anuran cerebellum. J Neurophysiol 38: 369–382, 1975 [DOI] [PubMed] [Google Scholar]
- Fritz JB, Elhilali M, David SV, Shamma SA. Auditory attention—focusing the searchlight on sound. Curr Opin Neurobiol 17: 437–455, 2007 [DOI] [PubMed] [Google Scholar]
- Galaburda AM, Pandya DN. The intrinsic architectonic and connectional organization of the superior temporal region of the rhesus monkey. J Comp Neurol 221: 169–184, 1983 [DOI] [PubMed] [Google Scholar]
- Glasberg BR, Moore BC. Derivation of auditory filter shapes from notched-noise data. Hear Res 47: 103–138, 1990 [DOI] [PubMed] [Google Scholar]
- Green DM, Swets JA. Signal Detection Theory and Psychophysics. New York: Krieger, 1966 [Google Scholar]
- Grimault N, Bacon SP, Micheyl C. Auditory stream segregation on the basis of amplitude-modulation rate. J Acoust Soc Am 111: 1340–1348, 2002 [DOI] [PubMed] [Google Scholar]
- Gutschalk A, Micheyl C, Melcher JR, Rupp A, Scherg M, Oxenham AJ. Neuromagnetic correlates of streaming in human auditory cortex. J Neurosci 25: 5382–5388, 2005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Holmes SD, Roberts B. Inhibitory influences on asynchrony as a cue for auditory segregation. J Exp Psychol Hum Percept Perform 32: 1231–1242, 2006 [DOI] [PubMed] [Google Scholar]
- Itatani N, Klump GM. Auditory streaming of amplitude-modulated sounds in the songbird forebrain. J Neurophysiol 101: 3212–3225, 2009 [DOI] [PubMed] [Google Scholar]
- Jones EG, Dell'Anna ME, Molinari M, Rausell E, Hashikawa T. Subdivisions of macaque monkey auditory cortex revealed by calcium-binding protein immunoreactivity. J Comp Neurol 362: 195–208, 1995 [DOI] [PubMed] [Google Scholar]
- Kaas JH, Hackett TA. Subdivisions of auditory cortex and levels of processing in primates. Audiol Neurootol 3: 73–85, 1998 [DOI] [PubMed] [Google Scholar]
- Kadia SC, Wang X. Spectral integration in A1 of awake primates: neurons with single- and multipeaked tuning characteristics. J Neurophysiol 89: 1603–1622, 2003 [DOI] [PubMed] [Google Scholar]
- Kanwal JS, Rauschecker JP. Auditory cortex of bats and primates: managing species-specific calls for social communication. Front Biosci 12: 4621–4640, 2007 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lakatos P, Pincze Z, Fu KM, Javitt DC, Karmos G, Schroeder CE. Timing of pure tone and noise-evoked responses in macaque auditory cortex. Neuroreport 16: 933–937, 2005 [DOI] [PubMed] [Google Scholar]
- Merzenich MM, Brugge JF. Representation of the cochlear partition of the superior temporal plane of the macaque monkey. Brain Res 50: 275–296, 1973 [DOI] [PubMed] [Google Scholar]
- Micheyl C, Tian B, Carlyon RP, Rauschecker JP. Perceptual organization of tone sequences in the auditory cortex of awake macaques. Neuron 48: 139–148, 2005 [DOI] [PubMed] [Google Scholar]
- Micheyl C, Carlyon RP, Gutschalk A, Melcher JR, Oxenham AJ, Rauschecker JP, Tian B, Courtenay Wilson E. The role of auditory cortex in the formation of auditory streams. Hear Res 229: 116–131 2007 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Morel A, Garraghty PE, Kaas JH. Tonotopic organization, architectonic fields, and connections of auditory cortex in macaque monkeys. J Comp Neurol 335: 437–459, 1993 [DOI] [PubMed] [Google Scholar]
- Nelken I, Bar-Yosef O. Neurons and objects: the case of auditory cortex. Front Neurosci 2: 107–113, 2008 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nelken I, Prut Y, Vaadia E, Abeles M. Population responses to multifrequency sounds in the cat auditory cortex: one- and two-parameter families of sounds. Hear Res 72: 206–222, 1994 [DOI] [PubMed] [Google Scholar]
- Nicholson C, Freeman JA. Theory of current source-density analysis and determination of conductivity tensor for anuran cerebellum. J Neurophysiol 38: 356–368, 1975 [DOI] [PubMed] [Google Scholar]
- O'Connor KN, Sutter ML. Global spectral and location effects in auditory perceptual grouping. J Cogn Neurosci 12: 342–354, 2000 [DOI] [PubMed] [Google Scholar]
- Petkov CI, O'Connor KN, Sutter ML. Encoding of illusory continuity in primary auditory cortex. Neuron 54: 153–165, 2007 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pressnitzer D, Sayles M, Micheyl C, Winter IM. Perceptual organization of sound begins in the auditory periphery. Curr Biol 18: 1124–1128, 2008 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rauschecker JP, Tian B. Processing of band-passed noise in the lateral auditory belt cortex of the rhesus monkey. J Neurophysiol 91: 2578–2589, 2004 [DOI] [PubMed] [Google Scholar]
- Recanzone GH, Guard DC, Phan ML. Frequency and intensity response properties of single neurons in the auditory cortex of the behaving macaque monkey. J Neurophysiol 83: 2315–2331, 2000 [DOI] [PubMed] [Google Scholar]
- Roberts B, Glasberg BR, Moore BC. Primitive stream segregation of tone sequences without differences in fundamental frequency or passband. J Acoust Soc Am 112: 2074–2085, 2002 [DOI] [PubMed] [Google Scholar]
- Sadagopan S, Wang X. Nonlinear spectrotemporal interactions underlying selectivity for complex sounds in auditory cortex. J Neurosci 29: 11192–11202, 2009 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sadagopan S, Wang X. Contribution of inhibition to stimulus selectivity in primary auditory cortex of awake primates. J Neurosci 30: 7314–7325, 2011 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Scholl B, Gao X, Wehr M. Nonoverlapping sets of synapses drive on responses and off responses in auditory cortex. Neuron 65: 412–421, 2010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schwarz DW, Tomlinson RW. Spectral response patterns of auditory cortex neurons to harmonic complex tones in alert monkey (Macaca mulatta). J Neurophysiol 64: 282–298, 1990 [DOI] [PubMed] [Google Scholar]
- Scott BH, Malone BJ, Semple MN. Transformation of temporal processing across auditory cortex of awake macaques. J Neurophysiol 105: 712–730, 2011 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shamma SA, Micheyl C. Behind the scenes of auditory perception. Curr Opin Neurobiol 20: 361–366, 2010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shamma SA, Symmes D. Patterns of inhibition in auditory cortical cells in awake squirrel monkeys. Hear Res 19: 1–13, 1985 [DOI] [PubMed] [Google Scholar]
- Snyder JS, Alain C. Toward a neurophysiological theory of auditory stream segregation. Psychol Bull 133: 780–799, 2007 [DOI] [PubMed] [Google Scholar]
- Stark E, Abeles M. Predicting movement from multiunit activity. J Neurosci 27: 8387–8394, 2007 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Steinschneider M, Fishman YI, Arezzo JC. Representation of the voice onset time (VOT) speech parameter in population responses within primary auditory cortex of the awake monkey. J Acoust Soc Am 114: 307–321, 2003 [DOI] [PubMed] [Google Scholar]
- Steinschneider M, Tenke CE, Schroeder CE, Javitt DC, Simpson GV, Arezzo JC, Vaughan HG., Jr Cellular generators of the cortical auditory evoked potential initial component. Electroencephalogr Clin Neurophysiol 84: 196–200, 1992 [DOI] [PubMed] [Google Scholar]
- Super H, Roelfsema PR. Chronic multiunit recordings in behaving animals: advantages and limitations. Prog Brain Res 147: 263–282, 2005 [DOI] [PubMed] [Google Scholar]
- Sutter ML, Loftus WC. Excitatory and inhibitory intensity tuning in auditory cortex: evidence for multiple inhibitory mechanisms. J Neurophysiol 90: 2629–2647, 2003 [DOI] [PubMed] [Google Scholar]
- Tian B, Rauschecker JP. Processing of frequency-modulated sounds in the lateral auditory belt cortex of the rhesus monkey. J Neurophysiol 92: 2993–3013, 2004 [DOI] [PubMed] [Google Scholar]
- Turgeon M, Bregman AS, Ahad PA. Rhythmic masking release: contribution of cues for perceptual organization to the cross-spectral fusion of concurrent narrow-band noises. J Acoust Soc Am 111: 1819–1831, 2002 [DOI] [PubMed] [Google Scholar]
- Turgeon M, Bregman AS, Roberts B. Rhythmic masking release: effects of asynchrony, temporal overlap, harmonic relations, and source separation on cross-spectral grouping. J Exp Psychol Hum Percept Perform 31: 939–953, 2005 [DOI] [PubMed] [Google Scholar]
- Vaughan HGJ, Arezzo JC. The neural basis of event-related potentials. In: Human Event-Related Potentials: EEG Handbook, edited by Picton TW. New York: Elsevier, 1988 [Google Scholar]
- Vliegen J, Oxenham AJ. Sequential stream segregation in the absence of spectral cues. J Acoust Soc Am 105: 339–346, 1999 [DOI] [PubMed] [Google Scholar]
- Yin P, Ma L, ME, Fritz J, Shamma SA. Primary auditory cortical responses while attending to different streams. In: Hearing: from Basic Research to Applications, edited by Kollmeier B, Klump G, Hohmann V, Langemann U, Mauermann M, Uppenkamp S, Verhey J. Berlin: Springer, 2007 [Google Scholar]