Abstract
A molecular (trial-by-trial) analysis of data from a cocktail-party, target-talker search task was used to test two general classes of explanations accounting for individual differences in listener performance: cue weighting models for which errors are tied to the speech features talkers have in common with the target and internal noise models for which errors are largely independent of these features. The speech of eight different talkers was played simultaneously over eight different loudspeakers surrounding the listener. The locations of the eight talkers varied at random from trial to trial. The listener's task was to identify the location of a target talker with which they had previously been familiarized. An analysis of the response counts to individual talkers showed predominant confusion with one talker sharing the same fundamental frequency and timbre as the target and, secondarily, other talkers sharing the same timbre. The confusions occurred for a roughly constant 31% of all of the trials for all of the listeners. The remaining errors were uniformly distributed across the remaining talkers and responsible for the large individual differences in performances observed. The results are consistent with a model in which largely stimulus-independent factors (internal noise) are responsible for the wide variation in performance across listeners.
I. INTRODUCTION
At one time or another, most of us have managed a successful conversation with someone at a social gathering despite noise from the crowd. In such situations, we are able to focus on the speech of one talker to the exclusion of others speaking at the same time. In hearing science, this is known as the cocktail-party effect (Cherry, 1953) and has been a topic of vigorous study for well over a half century (see Kidd and Colburn, 2017; Bronkhorst 2000, 2015 for reviews).
One of the perplexing outcomes of research on the cocktail-party effect is the wide variation observed in the degree to which individual listeners manifest the effect. The performance of individuals in different studies involving different tasks regularly ranges from near chance to near perfect within the same condition (see Lutfi et al., 2021, for review). The reasons for this wide variation are not well understood. The listeners of these studies are young, healthy individuals who are evaluated to have normal hearing and are well-practiced in their given task. Current speculation is that any one or a combination of factors might be responsible, and that those factors may be different for different individuals. The possible factors suggested in the literature include differences in the capacity of working memory (Conway et al., 2001; Tamati et al., 2013; McLaughlin et al., 2018), differences in selective attention (Ruggles and Shinn-Cunningham, 2011; Dai and Shinn-Cunningham, 2016; Oberfeld and Klöckner-Nowotny, 2016; Shinn-Cunningham, 2017), differences in energetic and informational masking (Lutfi et al., 2003; Kidd and Colburn, 2017; Buss et al., 2021), momentary lapses in attention (Brungart and Simpson, 2007; Bidelman and Yoo, 2020), normal variation in hearing sensitivity (Lee and Long, 2012; Plack et al., 2014; Dewey and Dhar, 2017), cochlear pathology missed by conventional audiometry (Kujawa and Liberman, 2009; Bharadwaj et al., 2015), and differences in sound source localization acuity (see Yost and Pastore, 2021, for a review).
Where attempts have been made to evaluate such factors, the conclusions have been qualified. Part of the reason is the near singular focus of studies on predictions for metrics of overall performance accuracy (d′, percent correct or threshold measures). Different factors can make similar predictions for performance accuracy. For example, a correspondence between individual performance on auditory and visual selective attention tasks may reasonably be taken as evidence for a role of selective attention underlying individual differences (Oberfeld and Klöckner-Nowotny, 2016); but it just as reasonably may be taken for a role of working memory, attentional lapses, and/or other decision processes common to both tasks. Interactions among factors can further complicate conclusions, as, for example, when auditory sensitivity affects selective attention to specific cues (e.g., Doherty and Lutfi, 1996; Lee et al., 2016). In such cases, the individual effects of the factors as they interact will be conflated in the single metric of performance accuracy (one measure, two or more unknowns).
Lutfi et al. (2018) and Lutfi et al. (2020) took an alternative approach to understanding the individual differences by testing general classes of explanation based on a molecular (trial-by-trial) analysis of the types of errors listeners make (Watson, 1963; Ahumada, 2002). Two classes of explanation were considered, cue weighting models, for which errors are tied to specific voice features identifying the target, and internal noise models, for which errors are largely independent of these features. Selective attention to voice features identifying the target is an example of cue weighting and probably the most common explanation given for individual differences in the cocktail-party effect (Ruggles and Shinn-Cunningham, 2011; Dai and Shinn-Cunningham, 2016; Oberfeld and Klöckner-Nowotny, 2016; Shinn-Cunningham, 2017). Listeners must rely on the distinguishing voice features of the target to segregate target from nontarget talkers, but when target and nontarget talkers share similar voice features, confusions result tied to those features.1 Momentary lapses in attention fall in the category of internal noise (Brungart and Simpson, 2007; Bidelman and Yoo, 2020). During a lapse, the listener is not listening (effectively has the headphones off) and, therefore, at the random times in which a lapse occurs, the listener makes a guess largely unrelated to the voice features of the talkers. Lutfi et al. (2018) and Lutfi et al. (2020) found evidence for both types of errors in their experiments, but the type that distinguished the poorer from the better performing listeners was overwhelmingly the one that was unrelated to the voice features of the target. The authors interpret their results as inconsistent with popular accounts, attributing individual differences to cue weighting, and consistent instead with a broad class of internal noise models for which largely stimulus-independent, stochastic processes at different stages of auditory processing act to decorrelate the listener's response from the stimulus from trial to trial.
The results from Lutfi et al. (2018) and Lutfi et al. (2020) were unexpected. Given the obvious importance of voice similarity in the studies (Kidd and Colburn, 2017; Bronkhorst, 2000, 2015), one might have expected confusion based on voice similarity to have played a much larger role in accounting for the individual differences. Lutfi et al. (2020) undertook simulations to ensure that individual differences in performance for the conditions of their experiments could realistically result from either or both types of errors. They also replicated the results for different levels of task difficulty and durations of the talkers' speech affecting the relative proportion of voice-dependent and voice-independent errors. Notwithstanding, the authors cautioned that there were peculiarities of their experiments that could limit the generality of their results. Only two talkers were presented on each trial (hardly a party). The stimuli were synthesized vowels alternating in a fixed, repetitive ABA pattern. The only cues distinguishing talkers were location and voice fundamental frequency, and the locations were simulated over headphones using the same KEMAR (Knowles Electric Manikin for Acoustic Research) head-related transfer functions for all of the listeners. The tasks given to listeners were also among the simplest one can imagine for a study on the cocktail-party effect: talker segregation (listener reports whether one or two talkers are speaking) and talker identification (listener reports which one of two targets was presented with the distracter).
The goal of the present study was to test the generality of the results from Lutfi et al. (2020) for conditions much more closely approximating real-world listening and a much more demanding listener task. The stimuli were recordings of naturally spoken words by eight talkers speaking simultaneously and asynchronously, similar to that which occurs for the natural speech of talkers at a cocktail-party. The locations of talkers were determined by the positions of loudspeakers surrounding the listener who, as in normal cocktail-party listening, were allowed free head movements during the experiment. In addition to talker location and voice fundamental frequency, the cues distinguishing talkers included speech rate and timbre, both viable cues in conditions of natural cocktail-party listening. The listener's task was talker search, the location of the target talker was not known and had to be determined by identifying the target's voice among others in the crowd. Talker search places greater demands on the listener because it involves the dual tasks of determining the location and identity of the talker (Eramudugolla et al., 2008; Ericson et al., 2004; Simpson et al., 2006, 2007).
II. METHODS
A. Test environment
The experiment was conducted in The Spatial Hearing Laboratory at Arizona State University, Tempe, AZ. The room measured 10 ft × 15 ft × 10 ft, with all six surfaces covered by 4 in. thick acoustic foam, yielding a broadband reverberation time (RT60) of 102 ms. Twenty-four loudspeakers (Boston Acoustics 100×, Peabody, MA) were spaced equidistant from each other on a 5-ft radius circle (i.e., azimuth array with 15° loudspeaker spacing) at approximately the same height as the listeners' pinnae (see Yost et al., 2015 for further details). The stimuli were presented from every third loudspeaker, resulting in a spacing of 45° between possible sound source positions. The loudspeakers presenting stimuli were clearly labeled 1–8 (see Fig. 1). The 45° separation and ability of listeners to freely move their heads ensured that the speech of the target in isolation could be accurately localized from each loudspeaker (Wightman and Kistler, 1989). An intercom and camera enabled the experimenter to monitor the listener's head position and communicate with the listener from a remote-control room. All of the sounds were presented via a 32-channel digital-to-analog converter (Presonus Quantum 4848, Presonus Co., Baton Rouge, LA) at a rate of 44 100 samples/s/channel. Using a small diaphragm condenser microphone and sound level meter, the sound level at the entrance to the ear for all speakers was measured to range from 67 to 73 dB sound pressure level (SPL) for the different sentences.
FIG. 1.
A schematic view of the experimental setup with an example stimulus configuration for one trial of the simulated cocktail-party listening task. Appearing next to talkers are the designations indicating the speech features they share with the target talker, TFR, timbre, T, fundamental frequency, F, and rate, R (see the text for further details).
B. Search task
On each trial, recordings of the speech of eight different talkers were played simultaneously over the eight loudspeakers with a different talker for each loudspeaker. The locations of the eight talkers varied randomly and uniformly across the trials. In addition to location, the speech of the eight talkers differed in timbre (male or female), pitch (normal or high fundamental frequency), and/or rate (normal or fast). The two values for each of the three speech features yielded a total of eight different combinations, one for each talker. (More details regarding the speech and synthesis of speech features are given in Sec. II C.) One of the eight talkers served as the target throughout the experiment and we shall refer to him as TFR. TFR had a typical male timbre (T), normal fundamental frequency for a male (F), and normal average speaking rate (R). The speech of the nontarget talkers is denoted based on those features they share in common with TFR. For example, the talker whose speech had the same male timbre and fundamental frequency, but differed in rate is denoted TF; the talker whose speech occurred at the same rate, but differed in timbre and fundamental frequency is denoted R; and the talker whose speech differed in all respects from TFR is denoted “0.”
Prior to performing the task, listeners were familiarized with TFR's speech by hearing it presented twice in quiet from each loudspeaker in clockwise order, beginning at 0 deg azimuth. In the main experiment, all eight talkers were presented simultaneously, and the listener's task was to identify, by numerical keypad, the location of the loudspeaker presenting the speech of TFR. Correct feedback was given after each trial. To maintain the naturalness of the search task, listeners were allowed to move their heads freely during the trials. Listeners completed one approximately 20-min session, consisting of 2 blocks of 50 trials, with a break between blocks. One week later, the same listeners returned and were presented with the same familiarizing routine, and then completed another 2 blocks of 50 trials each.
C. Speech stimuli
The speech stimuli were sequences of six words selected independently and at random on each trial for each of the eight talkers. The words were selected from 360 exemplars of the Maryland consonant-nucleus-consonant (CNC) word lists spoken by a native English-speaking male and female adult (Peterson and Lehiste, 1962). The linguistic criteria for selecting the words are given in Lehiste and Peterson (1959). The recordings were edited to eliminate the indicator phrase “ready,” leaving only approximately 50 ms of silence preceding and following each recorded word. To increase the naturalness of the speech, the silent intervals separating words were chosen at random for each talker on each trial with uniform probability over the interval 0–100 ms.
The speech of the target talker, TFR, was identified with the original recordings from the male talker. The speech of nontarget talkers differing in timbre (FR, F, R, and 0) were identified with the original recordings from the female speaker. The high-pitch voice for male and female timbres was created by shifting the fundamental frequency (F0) of the recording up by 20%. This was performed using the overlap and add method, implemented by the solaf function available on the matlab exchange (Hejna and Musicus, 1991). For the male recordings, the shift in F0 resulted in voicing with a characteristic male timbre (male vocal tract length) but a more characteristic female pitch (female-sized larynx). For the female recordings, the shift resulted in voicing with characteristic female timbre and atypically high pitch. For the normal rate of speech, the recordings were played at their naturally recorded rate. For the fast rate, the word duration was reduced by 25%, keeping the same variation in F0. This was done by using a combination of the solaf and interpolation (interpl) routines in matlab. Note that because of the differences in rates and lengths of words, the speech of some talkers terminated before that of the others. This occurs naturally in cocktail-party listening and, thus, is considered a viable cue for the search task. The results of the signal processing gave rise to eight talkers whose speech was clearly distinguishable from one another.
D. Listeners
Listeners were 14 students and staff from Arizona State University (11 females, 3 males, ages 19–47 years old with an average age of 24.9 years old). Listeners were screened based on their audiometric thresholds with an exclusion criteria of hearing thresholds greater than 20 dB hearing level (HL) at any of the audiometric frequencies from 200 to 4000 Hz. Listeners were compensated $15 per session for their participation in the study.
III. RESULTS
Two of the listeners, S3 and S13, were unable to return for data collection on the second week of the experiment and a third listener performed near chance for the first week, therefore, only their second week data were used. Figure 2 plots the first against the second week's proportional response counts for each talker for the remaining 11 listeners. Open symbols are the proportional counts for nontarget talkers and filled symbols are the proportional counts for the target. The target counts show evidence of a learning effect in the second week for some listeners, but the relative counts for the first and second weeks are consistent across listeners. A Chi-square test failed to reject the hypothesis that the relative counts were significantly different for the first and second weeks [p > 0.99, degrees of freedom (df) = 89], such that the data were pooled across weeks to produce the histograms of Fig. 3.
FIG. 2.
The proportional response counts across listeners for each talker for the first and second week of the experiment. Nontarget talkers are designated by unfilled symbols, and target talkers are designated by filled symbols (see the text for further details).
FIG. 3.
Histograms for individual listeners (panels) giving the total response counts for each talker as identified on the abscissas by the speech features they share with TFR: timbre T, fundamental frequency F, and rate R. The first bar gives the response count to the target TFR. Percent correct (PC) performance is shown in the upper-right corner of each plot. The horizontal dashed curve gives the prediction for a uniform distribution of errors across nontarget talkers (total error count divided by seven nontarget talkers, conditioned on the listener) for each listener.
Each panel of Fig. 3 gives the total response counts to the different talkers for an individual listener. Only the first week's data are shown for S3 and S13, and only the second week's data are shown for S12. The eight talkers are identified on the abscissas by their respective speech designation; the first bar gives the response count to the target, TFR. The percent correct (PC) performance is shown in the upper-right corner of each histogram. As expected, the overall performance varies considerably across listeners, ranging from near chance, PC = 16% for S13 (where chance performance equals 12.5%), to PC = 55% for S1. Thirteen of the 14 listeners (the exception being S13) most often confused TF for TFR, indicating that the judgments for these listeners were based primarily on the combination of voice fundamental frequency and timbre (the two voice features TF shares with TFR). For S13, the performance was just slightly above chance and the count for R was actually greater than that for TFR.
Figure 4 gives the corresponding response counts for the localization responses. The values on the abscissa give the target-response separation as the difference in degrees azimuth between TFR's location and the location that the listener pointed to. Each listener's data are represented by a different colored curve. The data are superimposed to show the similarity in the pattern of localization responses across listeners. Figure 4 shows the errors for all of the listeners to be near uniformly distributed across the different target-response separations. This is, to some extent, expected based on the frequent confusions of the target with TF (voice-dependent errors). However, with this many talkers speaking simultaneously, it may also, in part, reflect the listener's inability to resolve the location of talkers on some trials (Yost, 2017; Yost et al., 2015; Zhong and Yost, 2017). Most notably, for the conditions of this study, one might expect front-back confusions to be a common localization error (cf. Wightman and Kistler, 1989). It is difficult to evaluate this possibility reliably for individual listeners because the number of times that the target was directly in front of or behind the listener was quite small and differed for each listener. On the other hand, a preponderance of front-back confusions, if they exist, should be evident in the response counts averaged across listeners. The dashed curve in Fig. 4 shows the number of correct responses averaged across listeners for each location of the target (ordinate now to be read as average count and abscissa to be read as target location). In this case, front-back confusions would be evidenced by dips in the curve at 0 and 180 deg (middle, leftmost, and rightmost points of the curves). There is little indication of such dips, suggesting front-back confusions did not play a significant role in determining performance. This, in retrospect, may not be surprising as listeners had the opportunity to resolve front-back confusions by free movement of their heads.
FIG. 4.
The response counts for the localization responses of individual listeners (different colored curves). The abscissa gives the target-response separation as the difference in degrees azimuth between the location of TFR and that the listener pointed to. The dashed curve gives the average number of correct responses across subjects (ordinate) for each actual location of the target (abscissa).
The fact that errors are largely independent of target location (absolute and relative to the listener's response) does not rule out the possibility that the results were influenced by an inability to resolve location. The uniform distribution of errors, however, indicates that if there was such an influence, listeners were quite confused about the location of the target that they identified, essentially choosing one of the eight locations at random. Although possible, this seems unlikely. Given the comparably much stronger dependence of errors on the voice properties of talkers (Fig. 3), a simpler interpretation is that errors resulted predominantly from the similarity in voice properties. In either case, it is important to underscore that our approach does not assume or require that voice errors be dominant. The search task is just as much a localization as an identification task, therefore, any limiting influence of localization acuity would simply contribute as an additional source of internal noise in this analysis.
Returning to the original goal of the study, we wish to evaluate two general classes of models accounting for individual differences in percent correct performance: cue weighting models, for which errors result from similarity in the voice/speech features of talkers, and so are tied to those features, and internal noise models, for which errors result from factors unrelated to the voice/speech features of talkers, and so are independent of those features. Looking at the histograms of Fig. 3, it is evident that listeners make both types of errors. If errors were entirely unrelated to the voice/speech features of talkers, we should expect the response error counts for each subject to be uniformly distributed across the seven nontarget talkers (horizontal dashed curves in Fig. 3). They are not. Conversely, if errors were strictly tied to similarity in the voice/speech features of talkers, we should expect the response counts to talker 0, who shares no speech features with TFR, to be close to zero or at least quite less than all of the other nontarget talkers. They are not.
We can generate precise expressions estimating the relative frequency of the two types of errors for each listener. Returning to the histograms, we see that, except for the one talker most often confused with TFR (TF for all but one listener), the response counts to nontarget talkers are close to being uniformly distributed. The mean data further bear this out. This observation suggests a simple account in which internal noise causes some uniform frequency of response to nontarget talkers and cue weighting adds to this count for the one talker whose speech is perceived most similar to TFR (again, TF). Let Cmax denote the count for TF and let C be the vector of counts for all of the other nontarget talkers. Now, to estimate the number of errors due to cue weighting, we must subtract from Cmax the errors resulting from internal noise. Since we take the errors due to internal noise to be uniformly distributed across talkers, that number is estimated to be avg(C). Our estimate for the number of errors attributable to cue weighting is then
| (1) |
Conversely, to compute the number of errors due to internal noise, we must add avg(C) to C to fill in for the missing talker TF that is most often confused with TFR. The number of errors due to internal noise is then estimated to be
| (2) |
Combining Eqs. (1) and 2, we have the fundamental parsing of total errors into two types: those tied to the stimulus, C(err)wgt, and those that are not, C(err)nos,
| (3) |
Note that the errors tied to the stimulus, in this case, are primarily reflected in the height of the bar for TR (the value of Cmax). Errors that are not tied to the stimulus are reflected in the height of the bars for all of the other nontarget talkers [related to sum(C)].
The empirical question now is which one or combination of these two types of errors is responsible for the individual differences in the overall performance. Equation (3) places no constraints on what the answer might be. For example, three listeners, all of whom perform near chance, may do so for entirely different reasons: one because of consistently confusing TF with TFR, another because of the influence of internal noise, and the third because of some combination of TF confusions and internal noise. The last case, parenthetically, would be indicative of the interaction mentioned earlier in which auditory sensitivity affects selective attention to voice cues (see Fig. 2 of Lutfi et al., 2020, for simulations). Being able to decide among these alternatives represents the great advantage of molecular analyses over single measures of performance accuracy.
Figure 5 shows the results of this analysis. The error counts have been converted to the obtained proportion of correct responses to be comparable to the figures in Lutfi et al. (2018) and Lutfi et al. (2020). The ordinate is the total obtained proportion of correct responses,
| (4) |
the abscissa is the obtained proportion of correct responses for the two models, and the internal noise,
| (5) |
given by the filled symbols, and the cue weighting,
| (6) |
given by the unfilled symbols. The individual symbols represent the data for each listener. They are shown separately for the first and second weeks to give some sense of the variability across estimates.2 Reading from the abscissa, cue weighting predicts a little more or less than a 13% error rate across trials for all of the listeners; the average of P(cor)wgt = 0.87. Internal noise accounts for the remaining errors and is also responsible for the wide variation in overall performance across listeners; P(cor)nos, ranging from 0.21 to 0.75. Comparing this outcome to the one expected if cue weighting were responsible for the individual differences, the filled and unfilled symbols would simply change places (again, see Lutfi et al., 2020, Fig. 2 for the results of simulations).
FIG. 5.
The results of error analysis. The obtained proportion of correct responses, P(cor), for individual listeners (symbols) is plotted against the predicted proportion of correct responses for two models, internal noise [P(cor)nos, filled symbols] and cue weighting [P(cor)wgt, unfilled symbols]. See the text for further details. The dashed and dotted lines give expectations assuming only one of the two types of errors is responsible for individual differences in performance.
Looking more closely at Fig. 5, there does appear to be a secondary effect wherein P(cor)wgt increases slightly as P(cor) approaches chance performance. This effect was not seen in Lutfi et al. (2018) and Lutfi et al. (2020). In those studies, the weighting efficiency, comparable to P(cor)wgt, was essentially constant. The discrepancy is due to the difference in the way that cue weighting is measured in the two studies. In Lutfi et al. (2018) and Lutfi et al. (2020) cue weighting was determined from regression coefficients relating the listeners trial-by-trial response to random perturbations in the cues. In the present study, cue weighting was estimated from the proportion of responses to TF, assuming uniform responses to all other nontarget talkers. The uniform assumption is an approximation. It seems likely, and the data indeed show evidence, that at least some of the responses to the other nontarget talkers were due to confusions from cue weighting. In particular, note that the mean data of Fig. 3 show that the two talkers receiving the next highest response counts after TF are the ones sharing the same timbre as TF (T and TR). The additional errors due to timbre confusions can be estimated by repeating the analysis with Cmax and avg(C) adjusted to include the counts for T and TR. Figure 6 shows that when the adjustment is made for the broader timbre confusions, the results are in better agreement with Lutfi et al. (2018) and Lutfi et al. (2020), in which confusions due to cue weighting were roughly constant across listeners.3 The constant error rate for cue weighting is now estimated to be 31%. The error rate due to internal noise is estimated to range from 43% to 55% across listeners.
FIG. 6.
The same as that in Fig. 5 except that P(cor)wgt is corrected for the estimated additional confusions based on speech timbre.
IV. DISCUSSION
The present study replicates the results of Lutfi et al. (2018) and Lutfi et al. (2020) for a fundamentally different and more challenging cocktail-party listening task, under conditions much more closely approximating real-world cocktail-party listening. The results indicate that the findings of Lutfi et al. (2018) and Lutfi et al. (2020) were not likely artifacts of the tasks, stimuli, or methods used. They provide further support for the conclusion of that study: that the variation in overall performance across listeners is consistent with a class of internal noise models for which errors are largely independent of the stimulus. The construct of internal noise has a long history in psychophysics dating back to Fechner [1860 (Fechner, 1966)]. It is identified with stochastic events occurring at different stages of auditory processing that act to decorrelate the listeners response from the stimulus from trial to trial. Some examples proposed to account for the individual differences in the present study, which we consider toward the end of this section, include neural noise caused by cochlear pathology (Lopez-Poveda, 2014), uncertainty resulting from informational masking (Lutfi et al., 2003; Lutfi et al., 2013), and guessing resulting from a momentary lapse in attention.
The results are new for studies of the cocktail-party effect, but similar results have been reported in related studies of sound source identification. In Lutfi and Liu (2007), for example, the stimuli were impact sounds of bars, plates, and membranes synthesized from standard textbook equations for the free motion of these sources. In a two-interval, forced-choice procedure, listeners identified predefined target sources based on their material, size, the hardness of the striking mallet, and/or the presence or absence of light damping. The results, shown in Figs. 7 and 11 of that study, are directly comparable to the results of the present study, which are shown in Fig. 6. Large numbers of errors in that study, as in the present study, were tied to the stimulus, but it was the errors not tied to the stimulus that were responsible for the individual differences in performances.
Altogether, these studies raise the question as to whether there are any conditions of cocktail-party listening for which a molecular analysis might reveal individual differences in performance due to cue weighting. The authors know of one demonstrated case. Gilbertson and Lutfi (2015) had a group of elderly and young adults perform a masked vowel discrimination task under conditions similar to those of Lutfi et al. (2018) and Lutfi et al. (2020), and they also used a similar trial-by-trial analysis. Within groups, individual differences were attributed to internal noise, but between groups, the elderly listeners performed more poorly due to cue weighting. The results were attributed to the effects of aging on selective attention. Another case in which cue weighting may be responsible for individual differences in performance would be if some listeners predominantly weighted one or more cues unrelated to the target that were not analyzed for (particular words or time intervals between words as examples in the present study). The possibility is a degenerate case, but it is a possibility. One way to test it is to use the classic procedure of estimating internal noise by evaluating the consistency of the listener's response to the same stimulus repeated across trials (see, for example, Gilkey and Robinson, 1986).4
A. The listener effect
Historically, research in psychoacoustics has focused on group behavior, where individual differences in performances are treated as the error variable in the analysis of the main effects of experimental factors. There are, however, compelling reasons to pay more attention to individual differences in studies of speech-on-speech masking. Evidence has been accumulating to suggest that the individual differences in these studies result not from random measurement error (e.g., listeners occasionally having a “bad day”) but rather from real differences among listeners in the perceptual processing of signals. Kidd et al. (2016), for example, obtained speech reception thresholds from listeners for three conditions of cocktail-party listening: differences in the gender of talkers, differences in the spatial separation of talkers, and time-reversal of the interfering speech. They report wide variation in thresholds across listeners within conditions but strong correlations of thresholds across conditions within listeners. Although it is not the specific focus of that study, the results suggest a fixed effect of listeners across conditions nearly as large as the main effect of the experimental factors that they examined. Lutfi et al. (2013) report similar results for word recognition where the conditions involved differences in the similarity and uncertainty in the F0 of talkers.
A still different type of evidence for a fixed effect of listeners in speech-on-speech masking was reported by Arbogast et al. (2002). They measured word identification performance for target sentences spatially separated from a masker: either comb-filtered speech or noise. Complete psychometric functions (PFs) were obtained relating performance to the level of the target sentence. The study included only four listeners, but the PFs for the individual listeners showed clear differences in slopes. Conversely, the deviations of the points about the PFs showed little variation either within or across listeners. The results provide convergent evidence for a fixed effect of listeners (overall effect on slope) far exceeding the random effect of listeners (effect on the individual estimates of slope given by each point).
More recently, Lutfi et al. (2021) have presented data regarding the magnitude of the fixed effect of listeners relative to the effect of major experimental factors known to influence the cocktail-party effect. They used naturally spoken sentences and synthesized vowels to measure the PFs for four major experimental variables: differences in task (talker segregation vs identification), differences in the voice features of talkers (pitch vs location), differences in the voice similarity and uncertainty of talkers (informational masking), and the presence or absence of linguistic cues. The main effects of these variables on performance were the same as those that had been observed in previous studies; however, when performance was expressed relative to that of an ideal observer (best performance possible), the effects of the experimental variables were largely eliminated; the only effect to remain was that of the listener. Similar to the data of Arbogast et al. (2002), the listener effect was on the slopes of the PFs with the variation in slopes across listeners (fixed effect) far exceeding the variation in the estimates of slopes within listeners (random effect). Lutfi et al. (2021) suggest from these data that the listener should be considered a factor in cocktail-party listening studies, equal in status to any major experimental variable.
B. The role of internal noise
The evidence for a fixed effect of listeners in these studies brings us back to the question as to what specifically is responsible for differences in the performances among listeners. The present results along with those of Lutfi et al. (2018) and Lutfi et al. (2020) suggest that the answer might be found in one or more explanations related to internal noise. Several authors have noted in this regard that studies of the cocktail-party effect share important features with studies of informational masking (see Kidd and Colburn, 2017 and Kidd et al., 2008 for reviews). Both entail some uncertainty on the part of the listener about the background noise from trial to trial, and both yield large individual differences in performances. Lutfi et al. (2003) offer an internal noise model of listener uncertainty designed to account for individual differences in informational masking; the model would make similar predictions for individual differences in the cocktail-party effect observed here (also see Oh and Lutfi, 1998; and Lutfi et al., 2013). Lopez-Poveda (2014), on the other hand, suggest that the individual differences in the cocktail-party effect could result from cochlear deafferentation (Kujawa and Lieberman, 2009) giving rise to the stochastic under-sampling of signals, another internal noise model. Both models would predict that the PFs of the more poorly performing listeners would have a shallower slope, as was observed in the studies of Arbogast et al. (2002) and Lutfi et al. (2021). Alternatively, listeners who are more often distracted than others (the all-or-none model of attention given as an example in the Introduction) would be expected to show an upper asymptote converging to a lower level of performance, which was not observed in these studies. Future studies employing molecular analyses may be key to distinguishing among different sources of internal noise that contribute to individual differences in these studies.
C. Final caveat
When Cherry (1953) first coined the term “the cocktail-party effect” to describe the results of his pioneering experiments, he alluded primarily to the acoustic properties of speech that serve to segregate target from nontarget talkers (largely, the same ones investigated here). Notwithstanding, the reason we have cocktail parties is to communicate with one another—through words. For this reason, there could be interest in using similar methods applied in this study to test the role of linguistic properties of words, their semantic, syntactic, and pragmatic properties. Linguistic cues are clearly not required for the large individual differences in the cocktail-party effect we and others have observed, but there are reasons to consider their possible role in mitigating these differences, specifically by providing compensatory cues for listeners having greater difficulty segregating talkers based on acoustic cues. Linguistic cues are abundantly available in everyday listening; they have been demonstrated to have a significant impact on listener performance (Kidd and Colburn, 2017), and they have not been widely investigated as a factor affecting individual differences in performance, except for differences related to one's native language (Brouwer et al., 2012; Calandruccio et al., 2014; Calandruccio et al., 2017).
Another factor to consider is the relative salience of the different acoustic cues used to segregate talkers. In the present study, the differences in the timbre, pitch (F0), and speaking rate of talkers were all clearly audible, but that does not mean that they were equally salient. The predominant errors associated with the timbre and pitch of talker voices in the present study could very well reflect such differences in salience, and there is no guarantee that the same results would be obtained if the situation were different. This, of course, is also true of real cocktail parties and, therefore, it is something else to consider regarding the generality of the present results.
ACKNOWLEDGMENTS
The authors would like to thank Kathryn Pulling for her assistance in collecting the data and Dr. Emily Buss and two anonymous reviewers for comments on an earlier version of the manuscript. This research was supported by National Institute on Deafness and Other Communication Disorders (NIDCD) Grant Nos. R01 DC001262 (R.A.L.), R01 DC015214 (W.A.Y.), and F32 DC016808 (W.A.Y. and T.P.), and partially from a grant from Facebook Reality Laboratories (W.A.Y. and T.P.).
Footnotes
The references to selective attention in this and the aforementioned studies focus exclusively on the identifying features of targets. It is worth noting, however, that in some degenerate cases, attention may also be selective for features not at all associated with the target.
Note that the data of Fig. 3 should yield 25 datum (filled and unfilled symbol pairs) in Fig. 5. Only 24 datum appear in Fig. 5 because one listener (S5) gave the same response counts, C(err)wgt and C(err)nos, for the first and second weeks.
One might be tempted here to take the goodness of fit of the data about the diagonal dashed lines in Figs. 5 and 6 as a measure of the dependence of individual differences on errors not tied to the stimulus. However, this can be misleading. The present analysis is simply designed to parse listener responses into two categories; there are no models in the sense of having free parameters whose values are regressed on the data to obtain a goodness of fit. Note, in this regard, that the goodness of fit is clearly better for Fig. 5, but Fig. 6 shows the stronger dependence of individual differences on responses not tied to the stimulus.
We thank the Associate Editor, Dr. Emily Buss for offering this suggestion.
References
- 2. Ahumada, A. J., Jr. (2002). “ Classification image weights and internal noise level estimation,” J. Vision 2(1), 121–131. 10.1167/2.1.8 [DOI] [PubMed] [Google Scholar]
- 3. Arbogast, T. L. , Mason, C. R. , and Kidd, G., Jr. (2002). “ The effect of spatial separation on informational and energetic masking of speech,” J. Acoust. Soc. Am. 112, 2086–2098. 10.1121/1.1510141 [DOI] [PubMed] [Google Scholar]
- 5. Bharadwaj, H. M. , Masud, S. , Mehraei, G. , Verhulst, S. , and Shinn-Cunningham, B. G. (2015). “ Individual differences reveal correlates of hidden hearing deficits,” J. Neurosci. 35(5), 2161–2172. 10.1523/JNEUROSCI.3915-14.2015 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Bidelman, G. M. , and Yoo, J. (2020). “ Musicians show improved speech segregation in competitive, multi-talker cocktail-party scenarios,” Front. Psychol. 11, 1927. 10.3389/fpsyg.2020.01927 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Bronkhorst, A. W. (2000). “ The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker conditions,” Acta Acust. united Ac. 86(1), 117–128. [Google Scholar]
- 8. Bronkhorst, A. W. (2015). “ The cocktail-party problem revisited: Early processing and selection of multi-talker speech,” Atten. Percept. Psychophys. 77, 1465–1487. 10.3758/s13414-015-0882-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Brouwer, S. , Van Engen, K. , Calandruccio, L. , and Bradlow, A. R. (2012). “ Linguistic contributions to speech-on-speech masking for native and non-native listeners: Language familiarity and semantic content,” J. Acoust. Soc. Am. 131(2), 1449–1464. 10.1121/1.3675943 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Brungart, D. S. , and Simpson, B. D. (2007). “ Cocktail party listening in a dynamic multitalker environment,” Percept, Psychophys. 69(1), 79–91. 10.3758/BF03194455 [DOI] [PubMed] [Google Scholar]
- 11. Buss, E. , Calandruccio, L. , Oleson, J. , and Leibold, L. J. (2021). “ Contribution of stimulus variability to word recognition in noise versus two-talker speech for school-age children and adults,” Ear Hear. 42(2), 313–322. 10.1097/AUD.0000000000000951 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Calandruccio, L. , Buss, E. , and Bowdrie, K. (2017). “ Effectiveness of two-talker maskers that differ in talker congruity and perceptual similarity to the target speech,” Trends. Hear. 21, 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Calandruccio, L. , Buss, E. , and Hall, J. W. (2014). “ Effects of linguistic experience on the ability to benefit from temporal and spectral masker modulation,” J. Acoust. Soc. Am. 135, 1335–1143. 10.1121/1.4864785 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Cherry, E. C. (1953). “ Some experiments on the recognition of speech, with one and two ears,” J. Acoust. Soc. Am. 25, 975–979. 10.1121/1.1907229 [DOI] [Google Scholar]
- 13. Conway, A. R. , Cowan, N. , and Bunting, M. F. (2001). “ The cocktail party phenomenon revisited: The importance of working memory capacity,” Psychonom. Bull. Rev. 8, 331–335. 10.3758/BF03196169 [DOI] [PubMed] [Google Scholar]
- 14. Dai, L. , and Shinn-Cunningham, B. G. (2016). “ Contributions of sensory coding and attentional control to individual differences in performance in spatial auditory selective attention tasks,” Front. Hum. Neurosci. 10(530), 1–19. 10.3389/fnhum.2016.00530 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Dewey, J. B. , and Dhar, S. (2017). “ A common microstructure in behavioral hearing thresholds and stimulus-frequency otoacoustic emissions,” J. Acoust. Soc. Am. 142, 3069–3083. 10.1121/1.5009562 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Doherty, K. A. , and Lutfi, R. A. (1996). “ Spectral weights for overall level discrimination in listeners with sensorineural hearing loss,” J. Acoust. Soc. Am. 99(2), 1053–1058. 10.1121/1.414634 [DOI] [PubMed] [Google Scholar]
- 16. Eramudugolla, R. , McAnally, K. I. , Martin, R. L. , Irvine, D. R. F. , and Mattingley, J. B. (2008). “ The role of spatial location in auditory search,” Hear. Res. 238, 139–146. 10.1016/j.heares.2007.10.004 [DOI] [PubMed] [Google Scholar]
- 17. Ericson, M. A. , Brungart, D. S. , and Simpson, B. D. (2004). “ Factors that influence intelligibility in multitalker speech displays,” Int. J. Aviat. Psychol. 14, 313–334. 10.1207/s15327108ijap1403_6 [DOI] [Google Scholar]
- 19. Fechner, G. T. (1966). Elemente der Psychophysik (Elements of Psychophysics) ( Holt, Rinehart and Winston, Inc., New York: ). [Google Scholar]
- 56. Gilbertson, L. , and Lutfi, R. A. (2015). “ Estimates of decision weights and internal noise for the masked discrimination of vowels by young and elderly adults,” J. Acoust. Soc. Am. 137, EL403–EL407. 10.1121/1.4919701 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Gilkey, R. H. , and Robinson, D. E. (1986). “ Models of auditory masking: A molecular psychophysical approach,” J. Acoust. Soc. Am. 79, 1499–1510. 10.1121/1.393676 [DOI] [PubMed] [Google Scholar]
- 21. Hejna, D. , and Musicus, B. R. (1991). “ The SOLAFS time-scale modification algorithm,” Bolt, Beranek, and Newman Technical Report (University of Cambridge, Cambridge, England).
- 22. Kidd, G. , Jr., and Colburn, S. (2017). “ Informational masking in speech recognition,” in Springer Handbook of Auditory Research: The Auditory System at the Cocktail Party, edited by Middlebrooks J. C., Simon J. Z., Popper A. N., and Fay R. R. ( Springer, New York: ), pp. 75–110. [Google Scholar]
- 23. Kidd, G. , Jr., Mason, C. R. , Richards, V. M. , Gallun, F. J. , and Durlach, N. I. (2008). “ Informational masking,” in Springer Handbook of Auditory Research: Auditory Perception of Sound Sources, edited by Yost W. A. and Popper A. N ( Springer, New York: ), pp. 143–190. [Google Scholar]
- 24. Kidd, G., Jr. , Mason, C. R. , Swaminathan, J. , Roverud, E. , Clayton, K. K. , and Best, V. (2016). “ Determining the energetic and informational components of speech-on-speech masking,” J. Acoust. Soc. Am. 140, 132–144. 10.1121/1.4954748 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Kujawa, S. G. , and Liberman, M. C. (2009). “ Adding insult to injury: Cochlear nerve degeneration after ‘temporary’ noise-induced hearing loss,” J. Neurosci. 29(45), 14077–14085. 10.1523/JNEUROSCI.2845-09.2009 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57. Lee, J. , Heo, I. , Chang, A.-C. , Bond, K. , Stoelinga, C. , Lutfi, R. , and Long, G. (2016). “ Individual differences in behavioral decision weights related to irregularities in cochlear mechanics,” Adv. Exp. Med. Bio. 894, 457–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Lee, J. , and Long, G. (2012). “ Stimulus characteristics which lessen the impact of threshold fine structure on estimates of hearing status,” Hear. Res. 283, 24–32. 10.1016/j.heares.2011.11.011 [DOI] [PubMed] [Google Scholar]
- 27. Lehiste, I. , and Peterson, G. E. (1959). “ Linguistic considerations in the study of speech intelligibility,” J. Acoust. Soc. Am. 31, 280–286. 10.1121/1.1907713 [DOI] [Google Scholar]
- 28. Lopez-Poveda, E. A. (2014). “ Why do I hear but not understand? Stochastic undersampling as a model of degraded neural encoding of speech,” Front. Neurosci. 8(348), 348. 10.3389/fnins.2014.00348 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Lutfi, R. A. , Gilbertson, L. , Chang, A.-C. , and Stamas, J. (2013). “ The information-divergence hypothesis of informational masking,” J. Acoust. Soc. Am. 134(3), 2160–2170. 10.1121/1.4817875 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Lutfi, R. A. , Kistler, D. J. , Oh, E. L. , Wightman, F. L. , and Callahan, M. R. (2003). “ One factor underlies individual differences in auditory informational masking within and across age groups,” Percept. Psychophys. 65(3), 396–406. 10.3758/BF03194571 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Lutfi, R. A. , and Liu, C. J. (2007). “ Individual differences in source identification from synthesized impact sounds,” J. Acoust. Soc. Am. 122, 1017–1028. 10.1121/1.2751269 [DOI] [PubMed] [Google Scholar]
- 33. Lutfi, R. A. , Rodriguez, B. , and Lee, J. (2021). “ The listener effect in multitalker speech, segregation and identification,” Trends Hear. 25, 1–11. 10.1177/23312165211051886 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Lutfi, R. A. , Rodriguez, B. , Lee, J. , and Pastore, T. (2020). “ A test of model classes accounting for individual differences in the cocktail-party effect,” J. Acoust. Soc. Am. 148(6), 4014–4024. 10.1121/10.000296 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Lutfi, R. A. , Tan, A. Y. , and Lee, J. (2018). “ Modeling individual differences in cocktail-party listening,” Acta Acust. united Ac. 104, 926–929. 10.3813/AAA.919246 [DOI] [Google Scholar]
- 36. McLaughlin, D. J. , Baese-Berk, M. M. , Bent, T. , Borrie, S. A. , and Van Engen, K. J. (2018). “ Coping with adversity: Individual differences in the perception of noisy and accented speech,” Atten. Percept. Psychophys. 80, 1559–1570. 10.3758/s13414-018-1537-4 [DOI] [PubMed] [Google Scholar]
- 37. Oberfeld, D. , and Klöckner-Nowotny, F. (2016). “ Individual differences in selective attention predict speech identification as a cocktail party,” eLife 5, e16747. 10.7554/eLife.16747 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Oh, E. , and Lutfi, R. A. (1998). “ Nonmonotonicity of informational masking,” J. Acoust. Soc. Am. 104, 3489–3499. 10.1121/1.423932 [DOI] [PubMed] [Google Scholar]
- 39. Peterson, G. , and Lehiste, I. (1962). “ Revised CNC list for auditory tests,” J. Speech Hear. Disord. 27, 62–70. 10.1044/jshd.2701.62 [DOI] [PubMed] [Google Scholar]
- 40. Plack, C. J. , Barker, D. , and Prendergast, G. (2014). “ Perceptual consequences of ‘hidden’ hearing loss,” Trends Hear. 18, 1–11. 10.1177/2331216514550621 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Ruggles, D. , and Shinn-Cunningham, B. G. (2011). “ Spatial selective auditory attention in the presence of reverberant energy: Individual differences in normal-hearing listeners,” JARO 12(3), 395–405. 10.1007/s10162-010-0254-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Shinn-Cunningham, B. (2017). “ Cortical and sensory causes of individual differences in selective attention ability among listeners with normal hearing thresholds,” J. Speech. Lang. Hear. Res. 60(10), 2976–2988. 10.1044/2017_JSLHR-H-17-0080 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Simpson, B. D. , Brungart, D. S. , Iyer, N. , Gilkey, R. H. , and Hamil, J. T. (2006). “ Detection and localization of speech in the presence of competing speech signals,” in Proceedings of the 12th International Conference on Auditory Display (ICAD2006), pp. 129–133. [Google Scholar]
- 45. Simpson, B. D. , Brungart, D. S. , Iyer, N. , Gilkey, R. H. , and Hamil, J. T. (2007). “ Localization in multiple source environments: Localizing the missing source,” in Proceedings of the 13th International Conference on Auditory Display, Montreal, Canada, June 26–29, pp. 280–284. [Google Scholar]
- 46. Tamati, T. N. , Gilbert, J. L. , and Pisoni, D. B. (2013). “ Some factors underlying individual differences in speech recognition on PRESTO: A first report,” J. Am. Acad. Audiol. 24(7), 616–634. 10.3766/jaaa.24.7.10 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Watson, C. S. (1963). “ Signal detection and certain physical characteristics of the stimulus during the observation interval,” Doctoral dissertation, Indiana University, Bloomington, IN. [Google Scholar]
- 48. Wightman, F. L. , and Kistler, D. J. (1989). “ Headphone simulation of free-field listening. II: Psychophysical validation,” J. Acoust Soc. Am. 85, 868–878. 10.1121/1.397558 [DOI] [PubMed] [Google Scholar]
- 49. Yost, W. A. (2017). “ Spatial release from masking based on binaural processing for up to six maskers,” J. Acoust. Soc. Am. 141, 2093–2106. 10.1121/1.4978614 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Yost, W. A. , and Pastore, M. T. (2021). “ Individual listener differences in azimuthal front-back reversals,” J. Acoust. Soc Am. 146, 2709–2715. 10.1121/1.5129555 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Yost, W. A. , Zhong, X. , and Najam, A. (2015). “ Judging sound rotation when listeners and sound rotate: Sound source localization is a multisensory process,” J. Acoust. Soc. Am. 138, 3293–3308. 10.1121/1.4935091 [DOI] [PubMed] [Google Scholar]
- 52. Zhong, X. , and Yost, W. A. (2017). “ How many images are in an auditory scene?,” J. Acoust. Soc. Am. 141, 2882–2892. 10.1121/1.4981118 [DOI] [PMC free article] [PubMed] [Google Scholar]






