Pupillometry shows the effort of auditory attention switching

Daniel R McCloy; Bonnie K Lau; Eric Larson; Katherine A I Pratt; Adrian K C Lee

doi:10.1121/1.4979340

. 2017 Apr 6;141(4):2440–2451. doi: 10.1121/1.4979340

Pupillometry shows the effort of auditory attention switching^a)

Daniel R McCloy ¹, Bonnie K Lau ¹, Eric Larson ¹, Katherine A I Pratt ¹, Adrian K C Lee ^1,^b)

PMCID: PMC5848839 PMID: 28464660

Abstract

Successful speech communication often requires selective attention to a target stream amidst competing sounds, as well as the ability to switch attention among multiple interlocutors. However, auditory attention switching negatively affects both target detection accuracy and reaction time, suggesting that attention switches carry a cognitive cost. Pupillometry is one method of assessing mental effort or cognitive load. Two experiments were conducted to determine whether the effort associated with attention switches is detectable in the pupillary response. In both experiments, pupil dilation, target detection sensitivity, and reaction time were measured; the task required listeners to either maintain or switch attention between two concurrent speech streams. Secondary manipulations explored whether switch-related effort would increase when auditory streaming was harder. In experiment 1, spatially distinct stimuli were degraded by simulating reverberation (compromising across-time streaming cues), and target-masker talker gender match was also varied. In experiment 2, diotic streams separable by talker voice quality and pitch were degraded by noise vocoding, and the time alloted for mid-trial attention switching was varied. All trial manipulations had some effect on target detection sensitivity and/or reaction time; however, only the attention-switching manipulation affected the pupillary response: greater dilation was observed in trials requiring switching attention between talkers.

I. INTRODUCTION

The ability to selectively attend to a target speech stream in the presence of competing sounds is required to communicate in everyday listening environments. Evidence suggests that listener attention influences auditory stream formation;¹ for listeners with peripheral hearing deficits, changes in the encoding of stimuli often result in impaired stream selection and consequent difficulty communicating in noisy environments.² In many situations (e.g., a debate around the dinner table), it is also necessary to rapidly switch attention among multiple interlocutors—in other words, listeners must be able to continuously update what counts as foreground in their auditory scene, in order to keep up with a lively conversation.

Prior results show that when cueing listeners in a target detection task to either maintain attention to one stream or switch attention to another stream mid-trial, switching attention both reduced accuracy and led to longer response latency even on targets prior to the attentional switch.³ This suggests that the act of preparing or remembering to switch imposes some degree of mental effort or cognitive load that can compromise the success of the listening task. Given that listeners are aware of linguistic cues to conversational turn-taking,⁴ the pre-planning of attention switches (and associated hypothesized load) may be part of ordinary listening behavior in everyday conditions, not just an artifact of laboratory experimentation.

Pupillometry, the tracking of pupil diameter, has been used for over five decades to measure cognitive load in a variety of task types.^5,6 Pupil dilation is an involuntary, time-locked, physiological response that is present from infancy in humans and other animal species. In general, as the cognitive demands of a task increase, pupil dilation of up to about 5–6 mm can be observed up to 1 s after onset of relevant stimuli.^5–7 While this task-evoked pupillary response is slow (∼1 Hz), recent results show that it is possible to track attention and cognitive processes with higher temporal resolution (∼10 Hz) with deconvolution of the pupillary response.^8,9

Prior work has shown that the pupillary response co-varies with differences in memory demands,¹⁰ sentence complexity,¹¹ lexical frequency of isolated written words,¹² or difficulty of mathematical operations.¹³ In the auditory domain, larger pupil dilations have been reported in response to decreased speech intelligibility due to background noise,¹⁴ speech maskers versus fluctuating noise maskers,¹⁵ and severity of spectral degradation of spoken sentences.¹⁶ The pupillary response has also emerged as a measure of listening effort, which has been defined as “the mental exertion required to attend to, and understand, an auditory message,”¹⁷ or, more broadly, as “the deliberate allocation of mental resources to overcome obstacles in goal pursuit when carrying out a task” involving listening.¹⁸ In this guise, pupillometry has been used in several studies to investigate the effects of age and hearing loss on listening effort.^16,19,20

Recent evidence suggests that the pupillary response is also sensitive to auditory attention. Dividing attention between two auditory streams is known to negatively affect performance in psychoacoustic tasks;^21,22 greater pupil dilation and later peak pupil-size latency have also been reported for tasks in which listeners must divide their attention between both speech streams present in the stimulus instead of attending only one of the two,²² or when the expected location or talker of a speech stream were unknown as opposed to predictable.²³

However, it is unknown whether the greater pupil dilation in divided attention tasks is due to the demands of processing more information or the effort of switching attention back and forth between streams (or both). The present study was designed to test whether auditory attention switches in a strictly selective attention task would elicit mental effort that was detectable using pupillometry. Both experiments involve selective attention to one of two auditory streams (spoken alphabet letters) and a pre-trial cue indicating (1) which stream to attend to and (2) whether to maintain attention on that stream throughout the trial, or switch attention to the other stream at a designated mid-trial gap. In this way, there is no need or advantage for listeners to try to attend both streams throughout the trial, so any increase in pupil dilation seen in the switch attention trials should index the effort due to attention switching, rather than effort due to processing two streams' worth of information. On the assumption that the divided attention results of Koelewijn and colleagues²² were at least partially due to listeners switching back and forth between streams, we predicted greater pupil dilation on trials that required attention switching.

Additionally, the two experiments include manipulations of the stimuli designed to compromise auditory streaming, and thereby make the task of maintaining or switching attention more difficult. We thus expected that the pupillary response would be larger in trials with more degraded stimuli, trials where target and masker streams were harder to distinguish, or trials where the time allocated for switching between streams was shorter. Secondarily, these manipulations provide a test of whether the kind of pupillary response seen in previous studies that required semantic processing of meaningful sentences might also be seen in a simpler, closed-set target detection task. Based on findings showing that harder pitch discrimination trials elicit larger dilations than easier trials,²⁴ and based on findings from Winn and colleagues that differences in dilation to sentences with different degrees of spectral degradation occurred during sentential stimuli as well as in the post-stimulus delay and response period,¹⁶ we expected that the stimulus degradations in and of themselves might also yield larger dilations (in addition to any effect the degradations might have on auditory stream selection).

II. EXPERIMENT 1

Experiment 1 involved target detection in one of two spatially separated speech streams. In addition to the maintain- versus switch-attention manipulation, there was a stimulus manipulation previously shown²⁵ to cause variation in task performance: degradation of binaural cues to talker location (implemented as presence/absence of simulated reverberation). Reduced task performance and greater pupil dilation were predicted for the reverberant condition. This manipulation was incorporated into the pre-trial cue (i.e., on reverberant trials, the cue was also reverberant). Additionally, the voice of the competing talker was varied (either the same male voice as the target talker or a female voice); this manipulation was not signaled in the pre-trial cue. The same-voice condition was expected to degrade the separability of the talkers²⁶ and therefore decrease task performance and increase pupil dilation.

A. Methods

1. Participants

Sixteen adults (ten female, aged 21–35 yr, mean 25.1 yr) participated in experiment 1. All participants had normal audiometric thresholds [20 dB hearing level (HL) or better at octave frequencies from 250 Hz to 8 kHz], were compensated at an hourly rate, and gave informed consent to participate as overseen by the University of Washington Institutional Review Board.

2. Stimuli

Stimuli comprised spoken English alphabet letters from the ISOLET v1.3 corpus²⁷ from one female and one male talker. Mean fundamental frequencies of the unprocessed recordings were 103 Hz (male talker) and 193 Hz (female talker). Letter durations ranged from 351 to 478 ms, and were silence-padded to a uniform duration of 500 ms, root-mean-square (RMS) normalized, and windowed at the edges with a 5 ms cosine-squared envelope. Two streams of four letters each were generated for each trial, with a gap of 600 ms between the second and third letters of each stream. The letters “A” and “B” were used only in the pre-trial cues (described in Sec. II A 3); the target letter was “O” and letters “IJKMQRUXY” were non-target items. To allow unambiguous attribution of button presses, the letter “O” was always separated from another “O” (in either stream) by at least 1 s; thus there were between zero and two “O” tokens per trial. The position of “O” tokens in the letter sequence was balanced across trials and conditions, with ∼40% of all “O” tokens occurring in the third letter slot (just after the switch gap, since that slot is most likely to be affected by attention switches), and ∼20% in each of the other three timing slots.

Reverberation was implemented using binaural room impulse responses (BRIRs) recorded by Shinn-Cunningham and colleagues.²⁸ Briefly, an “anechoic” condition was created by processing the stimuli with BRIRs truncated to include only the direct impulse response and exclude reverberant energy, while stimuli for the “reverberant” condition were processed with the full BRIRs. In both conditions, the BRIRs recorded at ±45° for each stream were used, simulating a separation of 90° azimuth between target and masker streams.

3. Procedure

All procedures were performed in a sound-treated booth; illumination was provided only by the LCD monitor that presented instructions and fixation points. Auditory stimuli were delivered via a TDT RP2 real-time processor (Tucker Davis Technologies, Alachula, FL) to Etymotic ER-2 insert earphones at 65 dB sound pressure level (SPL). A white-noise masker with π-interaural-phase was played continuously during experimental blocks at a level of 45 dB SPL, yielding a stimulus-to-noise ratio of 20 dB. The additional noise was included to provide masking of environmental sounds (e.g., friction between subject clothing and earphone tubes) and to provide consistency with follow-up neuroimaging experiments (required due to the acoustic conditions in the neuroimaging suite).

Pupil size was measured continuously during each block of trials at a 1000 Hz sampling frequency using an EyeLink1000 infra-red eye tracker (SR Research, Kanata, ON, Canada). Participants' heads were stabilized by a chin rest and forehead bar, fixing their eyes at a distance of 50 cm from the EyeLink camera. Target detection accuracy and response time were also recorded for comparison with pupillometry data and the results of past studies.

Participants were instructed to fixate on a white dot centered on a black screen and maintain this gaze throughout test blocks. Each trial began with a 1 s auditory cue (spoken letters “AA” or “AB”); the cue was always in a male voice, and its spatial location prompted the listener to attend first to the male talker at that location. The letters spoken in the cue indicated whether to maintain attention to the cue talker's location throughout the trial (“AA” cue) or to switch attention to the talker at the other spatial location at the mid-trial gap (“AB” cue). The cue was followed by 0.5 s of silence, followed by the main portion of the trial: two concurrent four-letter streams with simulated spatial separation and varying talker gender (either the same male voice in both streams or one male and one female voice), with a 600 ms gap between the second and third letters. The task was to respond by button press to the letter “O” spoken by the target talker while ignoring “O” tokens spoken by the competing talker (Fig. 1).

FIG. 1. — (Color online) Illustration of maintain and switch trial types in experiment 1. In the depicted switch trial (heavy dashed line), listeners would hear cue “AB” in a male voice, attend to the male voice (“QU”) for the first half of the trial, switch to the female voice (“OM”) for the second half of the trial, and respond once (to the “O” occurring at 3.1–3.6 s). In the depicted maintain trial (heavy solid line), listeners would hear cue “AA” in a male voice, maintain attention to the male voice (“QUJR”) throughout the trial, and not respond at all. In the depicted trials, a button press anytime during timing slot 2 would be counted as response to the “O” at 2–2.5 s, which is a “foil” in both trial types illustrated; a button press during slot 3 would be counted as response to the “O” at 3.1–3.6 s (which is considered a target in the switch-attention trial and a foil in the maintain-attention trial), and button presses at any other time would be counted as non-foil false alarms. Note that “O” tokens never occurred in immediately adjacent timing slots (unless separated by the switch gap) so response attribution to targets or foils was unambiguous.

Before starting the experimental task, participants heard two blocks of ten trials for familiarization with anechoic and reverberant speech (one with a single talker, one with two simultaneous talkers). Next, listeners did three training blocks of ten trials each (one block of “maintain” trials, one block of “switch” trials, and one block of randomly mixed maintain and switch trials). Training blocks were repeated until participants achieved ≥50% of trials correct on the homogenous blocks and ≥40% of trials correct on the mixed block. During testing, the three experimental conditions (maintain/switch, anechoic/reverberant speech, and male-male versus male-female talker combinations) were counterbalanced, intermixed within each block, and presented in 10 blocks of 32 trials each for a total of 320 trials.

4. Behavioral analysis

Listener responses were labeled as “hits” if the button press occurred between 100 and 1000 ms after the onset of “O” stimuli in the target stream. Responses at any other time during the trial were considered “false alarms.” False alarm responses occurring between 100 and 1000 ms following the onset of “O” stimuli in the masker stream were additionally labeled as “responses to foils” to aid in assessing failures to selectively attend to the target stream. As illustrated in Fig. 1, the response windows for adjacent letters partially overlap in time; responses that occurred during these overlap periods were attributed to an “O” stimulus if possible (e.g., given the trial depicted in Fig. 1, a button press at 3.8 s was assumed to be in response to the “O” at 3.1–3.6 s, and not to the “M”). If no “O” tokens had occurred in that period of time, the response was coded as a false alarm for the purpose of calculating sensitivity, but no reaction time was computed (in other words, only responses to targets and foils were considered in the reaction time analyses).

Listener sensitivity and reaction time were analyzed with (generalized) linear mixed-effects regression models. A model for listener sensitivity was constructed to predict probability of button press at each timing slot (four timing slots per trial; see Fig. 1) from the interaction among the fixed-effect predictors specifying trial parameters (maintain/switch, anechoic/reverberant, and talker gender match/mismatch) and an indicator variable encoding whether a target, foil, or neither was present in the timing slot. A random intercept was also estimated for each listener. An inverse probit link function was used to transform button press probabilities (bounded between 0 and 1) into unbounded continuous values suitable for linear modeling. This model has the convenient advantage that coefficient estimates are interpretable as differences in bias and sensitivity on a $d'$ scale resulting from the various experimental manipulations.^29–31 Full model specifications are given in Eqs. (1) and (3) of the supplementary material;³⁹ the general form of this model is given here in Eq. (1), where Φ⁻¹ is the inverse probit link function, Pr(Y = 1) is the probability of button press, X is the design matrix of trial parameters and indicator variables, and β is the vector of parameter coefficients to be estimated:

Φ^{- 1} (Pr (Y = 1 | X)) = X' β .

(1)

Reaction time was analyzed using linear mixed-effects regression (i.e., with identity link function), but was otherwise analyzed similarly to listener sensitivity. Significance of predictors in the reaction time model was computed via F-tests using the Kenward-Roger approximation for degrees of freedom; significance in the sensitivity model was determined by likelihood ratio tests between models with and without the predictor of interest (as the Kenward-Roger approximation has not been demonstrated to work with non-normally-distributed response variables, i.e., when modeling probabilities). See Secs. III A and III B and Tables I–III of the supplementary material³⁹ for full details.

5. Analysis of pupil diameter

Recordings of pupil diameter for each trial were epoched from −0.5 to 6 s, with 0 s defined as the onset of the pre-trial cue. Periods where eye blinks were detected by the EyeLink software were linearly interpolated from 25 ms before blink onset to 100 ms after blink offset. Epochs were normalized by subtracting the mean pupil size between −0.5 and 0 s on each trial, and dividing by the standard deviation of pupil size across all trials (to allow pooling across subjects). Normalized pupil size data were then deconvolved with a pupil impulse response kernel.^8,9 Briefly, the pupil response kernel represents the stereotypical time course of a pupillary response to an isolated stimulus, modeled as an Erlang gamma function with empirically determined parameters t_max (latency of response maximum) and n (Erlang shape parameter).⁷ The parameters used here were t_max = 0.512 s and n = 10.1, following previous literature.^7,9

Fourier analysis of the subject-level mean pupil size data and the deconvolution kernel indicated virtually no energy at frequencies above 3 Hz, so for computational efficiency the deconvolution was realized as a best-fit linear sum of kernels spaced at 100 ms intervals (similar to downsampling both signal and kernel to 10 Hz prior to deconvolution), as implemented in the pyeparse software.³² After deconvolution, the resulting time series can be thought of as an indicator of mental effort that is time-aligned to the stimulus (i.e., the response latency of the pupil has been effectively removed). Statistical comparison of deconvolved pupil dilation time series (i.e., “effort” in Figs. 4 and 8) was performed using a non-parametric cluster-level one-sample t-test on the within-subject differences in deconvolved pupil size between experimental conditions (clustering across time only),³³ as implemented in mne-python.³⁴

FIG. 4. — (Color online) Deconvolved pupil size (mean ± 1 standard error across subjects) for (a) reverberant versus anechoic trials, (b) talker gender-match versus -mismatch trials, and (c) maintain- versus switch-attention trials, with trial schematics showing the time course of stimulus events (compare to Fig. 1). Hatched region shows temporal span of statistically significant differences between time series. The onset of statistically significant divergence (vertical dotted line) of the maintain/switch conditions is in close agreement with the end of the cue. a.u. = arbitrary units (see Sec. II A 5 for explanation of effort).

FIG. 8. — (Color online) Deconvolved pupil size (mean ±1 standard error across subjects) for (a) 10- versus 20-band vocoded stimuli, (b) 200 versus 600 ms mid-trial switch gap durations, and (c) maintain- versus switch-attention trials, with trial schematics showing the time course of stimulus events (compare to Fig. 5). Hatched region shows temporal span of statistically significant differences between time series. The late-trial divergence in (b) is attributable to the delay of stimulus presentation in the long-gap condition; the onset of divergence in (c) aligns with the end of the cue, as in experiment 1 [see Fig. 4(c)]. a.u. = arbitrary units (see Sec. II A 5 for explanation of effort).

B. Results

1. Sensitivity

Over all trials, sensitivity ( $d'$ ) ranged across subjects from 1.7 to 4.2 (first quartile 1.9, median 2.4, third quartile 3.0). Box-and-swarm plots displaying quartile and individual differences in $d'$ values between experimental conditions are shown in Fig. 2. Note that $d'$ is an aggregate measure of sensitivity that does not distinguish between responses to foil items versus other types of false alarms; however, the statistical model does separately estimate significant differences between experimental conditions for both target response rate and foil response rate, and also estimates a bias term for each condition that captures non-foil false alarm response rates.

The model indicated significant main effects for all three trial type manipulations, as seen in Fig. 2(a), with effect sizes around 0.2–0.3 on a $d'$ scale. Model results indicate that the attentional manipulation led to more responses to both targets (Wald z = 5.23, p < 0.001) and foils (Wald z = 2.82, p = 0.005) in maintain- versus switch-attention trials, though the net effect was an increase in $d'$ in the maintain attention condition for nearly all listeners. The model also showed a significant difference in response bias in the attentional contrast (Wald z = −2.57, p = 0.01), with responses more likely in the switch- than the maintain-attention condition. In fact, there were slightly fewer total button presses in the switch-attention trials, but there were more non-foil false alarm responses in those trials. This suggests that the bias term is in fact capturing a difference in non-foil false alarm responses (i.e., presses that are not captured by terms in the model equation encoding responses to targets and foils).

Regarding reverberation, listeners were better at detecting targets in the anechoic trials (Wald z = 3.08, p = 0.002), but there was no significant difference in response to foils between anechoic and reverberant trials. Regarding talker gender (mis)match, the model indicated both better target detection (Wald z = 2.43, p = 0.015) and fewer responses to foils (Wald z = −2.31, p = 0.021) when the target and masker talkers were different genders. The model also indicated a two-way interaction for target detection between reverberation and talker gender (Wald z = −2.09, p = 0.036); this can be seen in Fig. 2(b): the difference between anechoic and reverberant trials was smaller when the target and masker talkers were of different genders. The three-way interaction among attention, reverberation, and talker gender was not significant.

To address the concern that listeners might have attempted to monitor both streams, and especially that they might do so differently in maintain- versus switch-attention trials, the rate of listener response to foil items was examined separately for each timing slot. Foil response rates ranged from 1% to 4% for slots 1 and 2 (before the switch gap), and from 9% to 15% for slots 3 and 4 (after the switch gap), but showed no statistically reliable difference between maintain- and switch-attention trials for any of the four slots (see Sec. III D 1 of the supplementary material³⁹ for details).

2. Reaction time

Over all correct responses, median reaction time for each subject ranged from 434 ms to 692 ms after the onset of the target letter. Box-and-swarm plots showing quartile and individual differences in reaction time values between experimental conditions are shown in Fig. 3. The statistical model indicated significant main effects of attentional condition, reverberation, and talker gender mismatch. Faster response times were seen for targets in maintain-attention trials [9 ms faster on average, F(1, 5868.1) = 4.45, p = 0.035], anechoic trials [13 ms faster, F(1, 5868.1) = 9.35, p = 0.002], and trials with mismatched talker gender [25 ms faster, F(1, 5868.2) = 35.74, p < 0.001]. The model showed no significant interactions in reaction time among these trial parameters.

FIG. 3. — (Color online) Box-and-swarm plots of between-condition differences in reaction time for experiment 1. Boxes show first and third quartiles and median values; individual data points correspond to each listener; asterisks indicate comparisons with corresponding coefficients in the statistical model that were significantly different from zero. (a) Main effects of attention (faster reaction time in maintain than switch trials), reverberation (faster reaction time in anechoic than reverberant trials), and talker gender (mis)match (faster reaction time in trials with trials with different-gendered target and masker talkers). (b) Two-way interactions (no statistically significant differences). (c) Three-way interaction (no statistically significant difference). * = p < 0.05; ** = p < 0.01; *** = p < 0.001; MM = matching talker genders; MF = mismatched talker genders.

Post hoc analysis of reaction time by response slot showed no significant differences for the reverberation contrast. For the talker gender (mis)match contrast and the maintain- versus switch-attention contrasts, there were significant differences only in slot 3 (see Sec. II D 2 of the supplementary material³⁹ for details). This is consistent with a view that the act of attention switching creates a lag or slow-down in auditory perception.³

3. Pupillometry

Mean deconvolved pupil diameter as a function of time for the three stimulus manipulations (reverberant/anechoic trials, talker gender match/mismatch trials, and maintain/switch attention trials) are shown in Fig. 4. Only the attentional manipulation shows a significant difference between conditions, with “switch attention” trials showing greater pupillary response than “maintain attention” trials in the time range from 1.0 to 5.5 (t_crit = 2.13, p < 0.001; see Sec. III C and Table IV of the supplementary material³⁹ for full statistical details). The time courses diverge as soon as listeners have heard the cue, and the response remains significantly higher in the switch-attention condition throughout the remainder of the trial.

C. Discussion

The models of listener sensitivity and reaction time showed main effects in the expected directions for all three manipulations: put simply, listener sensitivity was better and responses were faster when the talkers had different voices, when there was no reverberation, and when mid-trial switching of attention was not required. The difference between anechoic and reverberant trials was smaller in trials where the talkers had different voices, suggesting that the advantage of anechoic conditions and the advantage due to talker voice differences are not strictly additive. A possible explanation for this finding is that either talker voice difference or anechoic conditions are sufficient to support auditory source separation and streaming,^25,26 but the presence of both conditions cannot overcome difficulty arising from other aspects of the task. Conversely, one might say that both segregating two talkers with the same voice and segregating two talkers in highly reverberant conditions are hard tasks, which when combined make for a task even more difficult than would be expected if the manipulations were additive (i.e., reverberation hurt performance more when both talkers were male).

Unlike listener sensitivity and reaction time, the pupillary response differed only in response to the attentional manipulation. Interestingly, the difference in pupillary response was seen across the entire trial, whereas the reaction time difference for the maintain-versus-switch contrast was restricted to slot 3 (the immediately post-switch time slot). The fact that patterns of pupillary response do not recapitulate patterns of listener behavior would make sense if, for normal hearing listeners, reverberation and talker gender mismatch are not severe enough degradations to cause sufficient extra mental effort or cognitive load to be observable in the pupil (in other words, the pupillary response may reflect the same processes as the behavioral signal, but may not be as sensitive). However, the magnitude of the effect size in $d'$ is roughly equal for all three trial parameters [see Fig. 2(a)]; if behavioral effect size reflects degree of effort or load, then the explanation that pupillometry is just “not sensitive enough” seems unlikely. Another possibility is that the elevated pupil response is simply due to a higher number of button presses in the switch trials: motor planning and execution are known to cause pupillary dilations.³⁵ However, as mentioned in Sec. II B 1, the total number of button presses is in fact higher in the maintain-attention condition. A third possibility is that the pupil dilation only reflects certain kinds of effort or load, and that stimulus degradations that mainly affect listener ability to form and select auditory streams are not reflected in the pupillary response, whereas differences in listener attentional state, such as preparing for a mid-trial attention switch, are reflected by the pupil. Experiment 2 tests this latter explanation by repeating the maintain/switch manipulation while increasing stimulus degradation to further impair formation and selection of auditory streams.

III. EXPERIMENT 2

Since no effect of talker gender on pupil dilation was seen in experiment 1, in experiment 2 the target and masker talkers were always of opposite gender, and their status as initial target or masker was counterbalanced across trials. Since no effect of reverberation on pupillary response was seen in experiment 1, experiment 2 also removed the simulated spatial separation of talkers and involved a more severe cued stimulus degradation known to cause variation in task demand: spectral degradation implemented as variation in number of noise-vocoder channels, 10 or 20. Based on results from Winn and colleagues showing increased dilation for low versus high numbers of vocoder channels with full-sentence stimuli,¹⁶ greater pupil dilation was expected here in the (more difficult, lower-intelligibility) ten-channel condition. As in experiment 1, a pre-trial cue indicated whether to maintain or switch attention between talkers at the mid-trial gap; here, the cue also indicated whether spectral degradation was mild or severe (i.e., the cue underwent the same noise vocoding procedure as the main portion of the trial).

Additionally, in experiment 2 the duration of the mid-trial temporal gap provided for attention switching was varied (either 200 ms or 600 ms). Behavioral and neuroimaging research suggest that the time course of attention switching in the auditory domain is around 300–400 ms;^3,36 accordingly, we expected the short gap trials to be challenging and thus predicted greater pupil dilation in short-gap trials (though only in the post-gap portion of the trial). The duration of the gap was not predictable from the pre-trial cue.