Skip to main content
PLOS One logoLink to PLOS One
. 2021 Mar 3;16(3):e0233251. doi: 10.1371/journal.pone.0233251

Disentangling listening effort and memory load beyond behavioural evidence: Pupillary response to listening effort during a concurrent memory task

Yue Zhang 1,2,3,4,*,#, Alexandre Lehmann 1,2,3,4,#, Mickael Deroche 1,2,3,4,5,#
Editor: Claude Alain6
PMCID: PMC7928507  PMID: 33657100

Abstract

Recent research has demonstrated that pupillometry is a robust measure for quantifying listening effort. However, pupillary responses in listening situations where multiple cognitive functions are engaged and sustained over a period of time remain hard to interpret. This limits our conceptualisation and understanding of listening effort in realistic situations, because rarely in everyday life are people challenged by one task at a time. Therefore, the purpose of this experiment was to reveal the dynamics of listening effort in a sustained listening condition using a word repeat and recall task. Words were presented in quiet and speech-shaped noise at different signal-to-noise ratios (SNR): 0dB, 7dB, 14dB and quiet. Participants were presented with lists of 10 words, and required to repeat each word after its presentation. At the end of the list, participants either recalled as many words as possible or moved on to the next list. Simultaneously, their pupil dilation was recorded throughout the whole experiment. When only word repeating was required, peak pupil dilation (PPD) was bigger in 0dB versus other conditions; whereas when recall was required, PPD showed no difference among SNR levels and PPD in 0dB was smaller than repeat-only condition. Baseline pupil diameter and PPD followed different variation patterns across the 10 serial positions within a block for conditions requiring recall: baseline pupil diameter built up progressively and plateaued in the later positions (but shot up when listeners were recalling the previously heard words from memory); PPD decreased at a pace quicker than in repeat-only condition. The current findings demonstrate that additional cognitive load during a speech intelligibility task could disturb the well-established relation between pupillary response and listening effort. Both the magnitude and temporal pattern of task-evoked pupillary response differ greatly in complex listening conditions, urging for more listening effort studies in complex and realistic listening situations.

1 Introduction

Effortless as it seems, everyday communication is cognitively demanding. Degraded speech input induced by adverse listening conditions (e.g., background noise, reverberation etc.) and peripheral hearing loss introduces mismatch between perceived acoustic signals and their canonical forms [13]. Resolving this mismatch demands more resources from the finite pool of cognitive resources, leading to fewer resources for other cognitive tasks and eventually overload [4, 5]. Populations facing long-term auditory challenges are specifically at risk. For instance, people with hearing impairment and particularly those using cochlear implants (CI) often experience high and sustained effort, even when speech communication reaches satisfactory level [610]. CI listeners have to engage and deploy more cognitive resources to achieve a satisfactory level of speech communication due to electric hearing. Such elevated and sustained listening effort is associated with detrimental psychosocial consequences including greater need for recovery after work, increased incidence of sick leave and social interaction withdrawal [6, 1113]. Therefore, there is a growing interest in the field of hearing science to conceptualise and quantify listening effort during speech perception for different populations.

Pupillometry (the continuous recording of changes in pupil diameter) has been one of most widely used methods for assessing listening effort. Its popularity can be attributed to its sensitivity to a wide range of cognitive tasks and processing that relate to the concept of listening effort [3, 14, 15]. Past studies have shown that pupil size varies with different speech intelligibility, hearing impairment, lexical manipulation, masker type, spectral resolution, memory load and divided/focused attention [4, 1625]. Typically, when task demands increase, for instance, with lower SNR, degraded spectral resolution or more digits to remember, pupil size increases. However, when the task becomes so demanding that it exceeds the capacity limit, pupil size stops increasing and/or starts decreasing, forming a relation similar to inverse-U shape between task demands and listening effort [14, 2632].

Because pupil size variation is the result of a complex interplay between the parasympathetic and sympathetic system, pupillometry can also reveal aspects of listening effort relating to fatigue, motivation and arousal [3336]. For instance, Wang et al. [36] showed a negative correlation between the need for recovery and peak pupil dilation relative the baseline (PPD), supporting the assumption that high fatigue could be related to a reduced state of arousal (hence smaller pupillary response) [37]. Furthermore, pupillometry has a reasonable temporal locking to cognitive events, with some delay due to the slow locus coeruleus (LC)-norepinephrine (NE) response. Typically, the peak of event-evoked pupillary dilation arrives within the time window from 0.7 to 1.5 sec following the target stimuli [22, 38, 39]. This allows pupillary response to show trial-by-trial and within-trial variation in listening effort, which can reveal the underlying cognitive processing and allocation policy that are hardly measurable via behavioural outcomes. For instance, pupil size typically decreases with increasing trial/block numbers within one condition, suggesting fatigue or habituation with similar stimuli and task [17, 4042]; it also varies with the level of engagement that changes from one trial to the next [43].

Due to these multiple influences on pupillary responses, there is only limited understanding of how pupil size varies in complex situations, where multiple cognitive functions are engaged and effort sustained over a period of time. Rarely in everyday life are people challenged by one task at a time. Even in a simple conversation, one needs to decode the incoming speech input embedded in various types of background noise, retain some information for mental processing, ponder over the best choices of words and articulate a verbal response, all of which require sustained cognitive processing over time. Understanding pupillary response to speech understanding in those situations is essential to conceptualise and quantify listening effort in ecological conditions, especially in the case of hearing aid or cochlear implant users.

Specifically, the relation between single task demand and pupil dilation has been shown and well-replicated in studies manipulating speech intelligibility and memory load [18, 2631]. However, there are only a handful of pupillometry studies involving multiple and sustained tasks within hearing science. For instance, Karatekin et al. [16] found that pupil diameter increased progressively with more digits to remember during a digit span task and a dual task (digit span with visual response time task), but the rate of increase was shallower in the dual task than the single task. McGarrigle et al. [44] asked NH participants to listen to one short passage per trial, presented with multi-talker babble noise, and at the end of each passage judge whether images presented on the screen were mentioned in the previous passage. A steeper decrease in (baseline corrected) pupil size was found for difficult than easy SNR, but only in the second paragraph. This was interpreted as an index of the onset of fatigue in listening conditions requiring sustained effort. However, paragraphs were between 13-18s and the target word was periodically varied inside the paragraph, making it difficult to measure directly the pupillary response evoked by recognising and encoding the target item. In Zekveld et al. [45], participants had to recall the four-word cues (either related or unrelated to the following sentence) presented visually before the onset of the sentence embedded in a speech masker. The 7dB SNR difference between two sentence-in-noise conditions (-17dB and -10dB) elicited a difference in intelligibility, but not in peak and mean pupil dilation. This contrasted with the well-established effect of auditory task demands on the pupillary response, suggesting that an external cognitive load (i.e., memory) during speech processing could nullify the intelligibility effects on pupil dilation response. Overall, these studies point to a complicated, but under-investigated, relation between the speech task and the pupil dilation response, when other cognitive task load is present.

Therefore, the current experiment starts addressing the lack of systematic investigation on the dynamics of pupillary response in complex and sustained listening situations. To do so, we designed a behavioural paradigm including two TASKs with different demands in cognitive resources: a repeat-only condition where participants listen and repeat one word consecutively for ten words, and a repeat-with-recall condition where after listening and repeating each of the ten words, they need to recall as many words as possible at the end of the tenth word. Using words instead of digits or paragraphs, the paradigm utilises natural speech, yet still provides precise time-locking to the canonical task-evoked pupil response. The recall task poses a substantial and sustained requirement of cognitive resources (attention and working memory) that are also essential for speech understanding: participants had to complete both word recognition and memorising tasks within the same time window, and keep retaining more words in the memory until the end of the list. The task difficulty was further manipulated by embedding words in different levels of speech-shaped noise to compare pupillary responses under high and low listening effort (LISTENING condition). Simultaneously, pupil size variations were recorded. Participants’ subjective ratings on effortfulness were also collected, and results were correlated with individuals’ behavioural and pupillary responses. This analysis helps to disentangle further pupil responses corresponding to word recognition and memory, by identifying pupillary metrics that are significantly related to word recognition, recall and self-rating performance.

The main hypotheses were:

  • Fewer words correctly repeated in difficult versus easy SNR conditions due to more degraded acoustic input, and fewer stated words recalled with more adverse SNRs due to limited cognitive capacity to prioritise the word recognition task.

  • Bigger pupillary response in difficult versus easy SNR conditions, due to more degraded acoustic input. Bigger pupillary response in repeat-with-recall versus repeat-only condition: bigger baseline pupil diameter due to accumulating memory load and bigger PPD due to greater cognitive demands. This difference might also depend on the serial position.

  • Quick and large increase in pupil diameter when listeners were prompted to recall the words previously heard (similar to Cabestrero et al. [28]), and possibly bigger increase in difficult versus easy SNR conditions.

  • Higher self-report effort in difficult SNR and repeat-with-recall conditions, reflecting the increased subjective experience of effort for conditions with more degraded acoustic input and sustained effort.

2 Methods and results

2.1 Methods: Participants

Data were collected from 25 adults (age range:18-49 years; average: 29 years; 17 females). A pure tone audiometry was administered to ensure that all participants had binaural thresholds at or better than 25 dB HL at 0.25, 0.5, 1, 2, 4, 8 kHz. All participants were native speakers of either French or North American English (10 French speakers). The study was always run in their native language. This work received ethical approvals from McGill University Faculty of Medicine Research Ethics Board (IRB) under the number A05-B11-18B. Prior to the experiment, participants were given enough time to read the information sheet and consent form approved by the ethics board. All gave written informed consent for their participation.

2.2 Methods: Stimuli

Stimuli were standard CNC words recorded from a male American English Speaker (mean duration = 0.61s, SD = 0.08s) and monosyllabic Fournier words [46] recorded from a male French Speaker (mean duration = 0.63s, SD = 0.11s). Words were fully randomised, grouped into lists of 10 and occurred only once in each list. Altogether, 240 words were used. They were then masked by speech-shaped noise (filtered on the long-term excitation pattern of the entire material, respectively in English or French) at three SNR levels of 0 dB, 7 dB and 14 dB. A quiet condition was also included, making a total of 4 LISTENING conditions.

Each LISTENING condition (three SNRs and quiet) was paired with TASK condition (repeat-only and repeat-with-recall) and was repeated three times (using three different word lists), making a total of 4x2x3 = 24 test blocks. Condition sequences and word lists were fully randomised for each participant.

2.3 Methods: Procedure

During the test, participants sat on a chair in a soundproof room, 2m in front of a 35-inch screen monitor and wearing an infrared binocular eyetracker (Tobii Glasses Pro2, 100 Hz sampling rate). The room and screen luminance levels were adjusted to reach 75lx (measured using a luxometer with the sensor positioned at the same height as the participants’ left eye and facing the screen). The luminance levels were fixed throughout the experiment, to avoid changes in light level inducing task-unrelated pupillary response. Participants were instructed to minimise sudden movements that could dislocate the head-mounted eyetracker. The eyetracker was calibrated for each participant, and again during the experiment if participants made big postural changes. Pupil diameter (in mm), timestamps, horizontal and vertical gaze positions of the eyes were collected from the eyetracker outputs. All audio stimuli were presented through a Beyer Dynamics DT 990 Pro headphone via an external soundcard (Edirol UA), calibrated at 65 dB SPL using a 1kHz pure tone. Experiments were run in Matlab 2016b, using Psychtoolbox and custom software.

After demonstrating the task and explaining the procedure, participants practised with one repeat-with-recall condition at 14dB SNR to familiarise themselves with the test sequence and requirements for pupil recording.

The SNR of each block was achieved by fixing the masker level and adjusting the target level. In the quiet condition, the speech level was set at 65 dB SPL. In this way, listeners could not estimate the upcoming block difficulty based on the noise level that was presented first (except for the quiet condition) [32]. The SNR of each trial within a 10-word block was held constant. Before each test block, participants were notified by words on the screen to either recall (printed in red) or not recall (printed in black) at the end of the ten words. Condition sequences were fully randomised for each participant. 3s after the notification, a black fixation cross appeared and stayed for another 1s, to indicate the start of the first trial and eliminate any carry-over effect from reading the coloured words in the pre-block notification. In each trial within the block, the presentation of speech-shaped noise masker (or quiet in the quiet condition) started 1.5s before the onset of the word. Participants were instructed to fixate on the black fixation cross displayed at the centre of the screen. After 1.5s, the word was played, and the presentation of the masker noise (or quiet in the quiet condition) was turned off 1s after the word offset. Upon the masker offset, the fixation cross turned into a circle, and this prompted participants to repeat back the word. They were instructed to fixate on the black circle during the verbal response. The experimenter then typed down the repeated word and pressed ENTER to proceed to the next trial. Words were scored automatically based on whether the characters typed matched the transcripts. The experimenter was always presented with the intended word on the Matlab interface, so potential homophones were scored as correct. No fixed time was enforced on the participants and experimenter to repeat back and type down the correct word. Both the participants and the experimenter were instructed to take time. This was to avoid extra mental stress and ensure the correct scoring of word recognition and recall performance. On average, it took 2.11s (SD = 1.08s) from the onset of the prompt cue to the onset of the next trial.

In blocks requiring recall, 2s after the end of the 10th trial, the word RECALL appeared on the screen followed by a black circle to prompt the participants to recall as many words as possible from the previous 10 words in any order. Participants were instructed to fixate on the black circle during recall. Their responses were typed down by the experimenter and scored automatically based on character matching with the response typed during word repeat. Therefore, correctly recalled words would include words that were correctly recalled misperceptions (similar to [47]), dissociating the impact of intelligibility from recall performance.

At the end of each block (containing a list of 10 words and recall session if it was a repeat-with-recall condition), participants were asked verbally to rate How effortful the last block was from 1 to 10, 10 being most effortful. Their subjective ratings were typed down by the experimenter. An illustration of the test sequence is shown in Fig 1.

Fig 1. Test sequence in a block.

Fig 1

Before each block, participants were presented with either words ‘please listen, repeat and recall’ in red or words ‘please listen, repeat and no recall’ in black against a white screen, indicating whether the incoming block was repeat-with-recall or repeat-only condition. 3s after the words notification, a black fixation cross appeared and stayed for another 1s, to signal the start of the first trial. The trial started with acoustic presentation of 0.5s speech-shaped noise (or quiet in the quiet condition) and visual presentation of a black fixation cross (‘intertrial’). Another 1s of baseline measurement followed, with the same acoustic and visual presentation (‘baseline’). The word was then played at 1.5s into the trial, followed by noise presentation (or quiet in the quiet condition) for 1s (‘waitpeak’), with the same visual presentation. Upon the offset of ‘waitpeak’, the black fixation cross turned into a black circle to prompt listeners to repeat back the word ‘repeat’. If the block was a repeat-with-recall condition, at the end of the 10th word, participants were prompted by the word RECALL followed by a black circle on the screen to start recalling previously repeated words. At the end of the block, participants were verbally reminded to rate How effortful was the last block from 1 to 10, 10 being most effortful.

The experiment lasted for 1 hour.

2.4 Behavioural data analysis and results

There were no differences between the French-speaking and English-speaking listeners in word recognition (t = 0.32, df = 20.50, p = 0.75), word recall (t = 0.09, df = 20.66, p = 0.93) and subjective rating (t = 0.68, df = 22.57, p = 0.50), using between-subjects Welch two-tailed t-tests. Therefore, data were firstly aggregated over language (as this played no role and was not a factor of interest in our study).

2.4.1 Methods: Word recognition performance

To examine the effect of LISTENING and TASK conditions on word recognition, a logistic mixed-effect model was fitted on listeners’ word recognition, using LISTENING and TASK conditions as fixed effect factors, with LISTENER and WORD LIST as random effect factors. Mixed effect models allow for controlling the variance associated with random factors without data aggregation. Therefore, by using LISTENER and WORD LIST used for stimuli as random effect factors in the model, we controlled for the variance in overall performance (random intercept) and dependency on other fixed factors (random slope) that were associated with LISTENER and WORD LIST. Models were constructed using the lme4 package [48] in R [49], and figures were produced using the ggplot2 package [50]. Fixed and random effect factors entered the model, and remained in the model only if they significantly improved the model fitting, using Chi-squared tests based on changes in deviance (p < 0.05). Differences between levels of each factor and interactions were examined with post-hoc Wald test. p values were estimated using the z distribution in the test as an approximation for the t distribution [51].

2.4.2 Results: Word recognition performance

There was a significant main effect of LISTENING condition (χ2 = 684.11, df = 3, p < 0.001) and interaction between LISTENING and TASK conditions(χ2 = 10.64, df = 3, p = 0.01), but no main effect of TASK (χ2 = 1.49, df = 1, p = 0.22).

Post-hoc Wald test showed that word recognition performance at 0dB SNR condition (mean = 73.78%, SD = 10.37%) was the lowest of the four LISTENING conditions. 7dB SNR condition (mean = 94.4%, SD = 5.03%) had higher word recognition performance than 0dB SNR (β = 1.8, se = 0.13, p < 0.001), and lower performance than 14 dB SNR (mean = 97.57%, SD = 3.55%) (β = −0.82, se = 0.2, p < 0.001) and quiet condition (mean = 98.93%, SD = 2.07%) (β = −1.92, se = 0.34, p < 0.001). 14 dB SNR condition had lower performance than quiet condition (β = −1.1, se = 0.36, p < 0.001). At 0dB SNR, word recognition was higher in repeat-with-recall (mean = 76.27%, SD = 10.01%) than in repeat-only condition (mean = 71.73%, SD = 10.41%) (β = 0.27, se = 0.12, p = 0.03). Surprisingly, in quiet, word recognition was lower in repeat-with-recall (mean = 98.26%, SD = 2.57%) than in repeat-only condition (mean = 99.6%, SD = 1.11%) (β = −1.5, se = 0.64, p = 0.02) (Fig 2). Recognition performance did not vary across ten word positions within each block (χ2 = 15.14, df = 9, p = 0.09).

Fig 2. Behavioural performance.

Fig 2

All data are averaged across 25 listeners. The error bars denote 1 standard error of the mean. (a) shows word recognition performance as a function of LISTENING and TASK conditions, and (b) shows free recall performance (when listeners were recalling as many words as possible the previously heard words from memory) as a function of the LISTENING condition.

2.4.3 Methods: Word recall performance

To examine the effect of background noise on stated word recall performance, a logistic mixed-effect model was fitted on the number of words correctly recalled, with LISTENING condition as fixed effect factor, with LISTENER and WORD LIST as random effect factors, and following the same procedure reported above. Note that the recall performance was counted as stated word correct, and as such a word could be misunderstood and yet correctly recalled.

2.4.4 Results: Word recall performance

There was a significant main effect of LISTENING condition (χ2 = 18.46, df = 3, p < 0.001). Post-hoc Wald test showed that fewer stated words were recalled at 0dB SNR (mean = 5.67, SD = 1.86) than 7dB SNR (mean = 6.47, SD = 1.77)(β = 0.38, se = 0.11, p < 0.001), 14dB SNR (mean = 6.41, SD = 1.82)(β = 0.34, se = 0.11, p = 0.003) and quiet condition (mean = 6.6, SD = 1.76)(β = 0.45, se = 0.11, p < 0.001), with no other significant differences (Fig 2b).

2.5 Methods: Pupil data preprocessing

Baseline pupil diameter in each trial was calculated as averaged pupil trace 1s before each word onset. The pupil diameter measured from the word onset to the end of the trial was subtracted from that baseline level to obtain relative changes in pupil diameter elicited by the task. Sample points were coded as blinks when pupil diameter values were below 3 standard deviation (SD) of the mean of the unprocessed trace or when gazing positions were 3 SD away from the centre of the fixation. Traces between 10 data points (0.1s) before the start and after the end of blink were interpolated cubically in Matlab, to further decrease the impact of the obscured pupil from blinks. Trials that had over 20% of the data points coded as blinks from the start of baseline to the start of the next trial were excluded. Trials containing blinks longer than 0.4s were also excluded, because they were more likely to be artefacts than normal blinks [52]. Three participants had more than 20% of the overall trials discarded and were excluded from the pupillometry analysis (but kept for behavioural and subjective rating analysis).

All valid traces were low-pass filtered at 10 Hz with a first-order Butterworth filter to preserve only cognitively related pupil size modulation [53]. Processed traces were then aligned by the onset of the response prompt (the display of circle to signal participants to repeat back the word) and aggregated per listener, by each WORD POSITION in the 10-word list, TASK and LISTENING conditions.

2.6 Methods: Pupil data analysis

Two indices of task-evoked pupillary response (peak pupil dilation PPD and peak latency) were obtained from the aggregated traces, consistent with the method in [17]. PPD was the maximum diameter of pupil measurements from word onset to response prompt (time window 1), relative to the baseline pupil diameter. Note that we used the averaged pupil trace 1s before each word as the baseline during baseline correction, therefore, PPD corresponded to the phasic pupillary response evoked by word recognition. This method was in line with the aim of our experiment to investigate pupillary response to listening effort when another cognitive load was present. (For comparison, S1 Appendix. showed an alternative method to calculate PPD, i.e. baseline corrected by the averaged pupil trace 1s before the first word in the list, and its impact on understanding the results. To summarise, this alternative method could not disentangle the compound impact of listening effort and memory load on pupillary response.) Peak latency response was the time between word onset to the peak dilation. During this time window, listeners were predominantly listening and decoding the acoustic signals. There were also no significant differences in baseline pupil diameter (t = 0.75, df = 19.7, p = 0.46), PPD (t = −0.49, df = 18.53, p = 0.63) and peak latency (t = 1.02, df = 17.04, p = 0.32) between native English and French speakers using between-subjects Welch two-tailed t-tests, so data were aggregated over language.

2.6.1 The effect of noise and memory load on pupillary response

To investigate how the experimental manipulations on listening effort and memory load affected the dynamics of pupillary response, three mixed effect models were then fitted on baseline diameter, PPD and peak latency respectively. LISTENING and TASK conditions were entered as fixed effect factors to investigate the impact of experimental conditions on the pupillary response averaged over the ten-word list. WORD POSITION was coded as from 1 to 10, corresponding to the serial position of each word in the list. Entering this variable as another fixed factor enabled us to examine the temporal variations of different pupil metrics. Also, the interaction between WORD POSITION and other fixed effect factors showed how the pupil dynamics differed in the conditions with and without memory load, and under high and low listening effort. LISTENER was entered in the model as a random effect factor. Model buildings followed the same procedure above.

2.6.2 Pupillary response of incorrectly versus correctly repeated words, recalled and forgotten words

To further explore the sequence of different cognitive processing stages, pupil traces of words correctly versus incorrectly recognised, and pupil traces of words forgotten versus recalled were compared. For words correctly and incorrectly recognised, two logistic mixed effect models were fitted on the word recognition correct, using PPD and peak latency (calculated in time window 1 from word onset to response prompt) as fixed effect factors, with LISTENER as random effect factor. For words recalled and forgotten, a new time window was added into analysis. New PPD and peak latency were calculated at the time window from the response prompt to 1.5s after the response prompt (time window 2). The inclusion of extra 1.5s after the response prompt in the analysis was to include the time for rehearsing and encoding the perceived word to working memory storage [54]. Logistic mixed effect models were fitted on the word recall, using PPD and peak latency in two time windows as fixed effect factors. Note that in this particular analysis pupillary parameters were used as independent variables to assess behavioural outcomes, to understand how the strategy of cognitive resources allocation affected word recognition and recall. In other words, it was examined as a predictive tool: predict whether a given word would be correctly understood or not, and recalled or forgotten, from the particular shape of a pupil trace.

2.6.3 The effect of noise on pupillary response during word recall at the end of a block

Finally, to explore the impact of LISTENING condition on the pupillary response (i.e. when listeners were recalling as many words as possible the previously heard words from memory), pupil traces from recall onset cue to 15s after the cue was firstly baseline-corrected by subtracting the average diameter of all previous word trials in the block. They were then de-blinked and low-pass filtered using the same parameters as above. Processed traces were then aggregated per listener by LISTENING condition. The mean of the trace during word recall was calculated. A mixed effect model was fitted on the mean pupil diameter during recall, with LISTENING condition as fixed effect factor and LISTENER as random effect factor.

2.7 Results: Pupil data

Figs 3a and 4a show the pupil diameter variation from the onset of baseline to 1.5s after the response cue.

Fig 3. Pupillometry results as a function of LISTENING and TASK conditions.

Fig 3

All data are aggregated across 22 listeners, and WORD POSITION, LISTENING, TASK conditions. The error bars and shaded width denote 1 standard error of the mean. (a) shows changes in pupil size as a function of time during each trial, for each LISTENING and TASK conditions. (b) and (c) plot baseline pupil diameter and PPD as a function of LISTENING and TASK conditions respectively.

Fig 4. Pupillometry results as a function of TASK and WORD POSITION.

Fig 4

All data are aggregated across 22 listeners, and WORD POSITION, LISTENING, TASK conditions. The error bars and shaded width denote 1 standard error of the mean. (a) shows changes in pupil size as a function of time at each WORD POSITION for each TASK condition. (b) and (c) plot baseline pupil diameter and PPD as a function of WORD POSITION and TASK condition respectively.

2.7.1 The effect of noise and memory load on pupillary response

For baseline pupil diameter, there was a significant main effect of LISTENING condition (χ2 = 11.21, df = 3, p = 0.01), TASK (χ2 = 283.49, df = 1, p < 0.001) and WORD POSITION (χ2 = 24.85, df = 9, p = 0.003), and significant interaction between TASK:WORD POSITION (χ2 = 82.99, df = 9, p < 0.001). Post-hoc tests showed that baseline pupil diameter at 0dB SNR (mean = 3.89, SD = 0.76) was not different from 7dB SNR condition (mean = 3.91, SD = 0.81) (β = 0.004, se = 0.01, p = 0.68). Both were bigger than 14dB SNR condition (mean = 3.84, SD = 0.74) (β = 0.04, se = 0.01, p = 0.002;β = 0.03, se = 0.01, p = 0.007) and quiet condition (mean = 3.86, SD = 0.78) (β = 0.04, se = 0.01, p = 0.04; β = 0.03, se = 0.01, p = 0.04); 14dB was not different from quiet (β = 0.01, se = 0.01, p = 0.32). Overall, baseline pupil diameter at repeat-with-recall condition (mean = 3.95, SD = 0.78) was significantly bigger (about 0.15 mm) than that at repeat-only condition (mean = 3.81, SD = 0.76) (β = 0.18, se = 0.01, p < 0.001) (Fig 3b). A trend analysis on WORD POSITION showed that from the 1st to 10th word, repeat-only condition had a linearly decreasing trend (β = −0.18, se = 0.01, p < 0.001), whereas repeat-with-recall condition had a linearly increasing trend (β = 0.18, se = 0.01, p < 0.001) (Fig 4b). Baseline diameter in repeat-with-recall condition also showed a significant quadratic trend (β = −0.09, se = 0.03, p < 0.001), suggesting that the greatest increase in baseline diameter occurred in the mid-section of the word list. No significant cubic trend was detected.

For PPD, there was a significant main effect of WORD POSITION (χ2 = 104.39, df = 9, p < 0.001), and no significant main effect of LISTENING (χ2 = 2.55, df = 3, p = 0.47) and TASK conditions (χ2 = 1.85, df = 1, p = 0.17). Interactions between LISTENING:TASK (χ2 = 13.15, df = 3, p = 0.004) and TASK:WORD POSITION (χ2 = 22.98, df = 9, p = 0.006) were significant, with no significant three-way interaction (χ2 = 31.05, df = 27, p = 0.27). Post-hoc tests showed that at 0dB SNR, repeat-only condition (mean = 0.23, SD = 0.19) evoked bigger PPD than repeat-with-recall condition (mean = 0.19, SD = 0.14) (β = 0.03, se = 0.01, p = 0.04), and no difference between two tasks at other SNR levels (Fig 3c). Examining the same interaction differently: SNR only affected the repeat-only condition, showing a bigger PPD at 0 dB than at other SNR conditions. A trend analysis on WORD POSITION showed that from the 1st to the 10th word, there was a decrease in PPD (χ2 = 55.73, df = 1, p < 0.001, β = −0.08, se = 0.01, p < 0.001), and this decrease was steeper in the repeat-with-recall condition than repeat-only condition (β = −0.07, se = 0.007, p < 0.001) (Fig 4c). No further significant quadratic or cubic trend.

For peak latency, there was a significant main effect of LISTENING condition (χ2 = 8.67, df = 3, p = 0.03) and WORD POSITION (χ2 = 66.98, df = 9, p < 0.001), and significant interaction between TASK:WORD POSITION(χ2 = 21.93, df = 9, p = 0.009). Post-hoc test showed that at 0dB SNR condition (mean = 1.12, SD = 0.59) pupil size peaked significantly later than at 7dB SNR (mean = 1.06, SD = 0.61) (β = 0.07, se = 0.03, p = 0.008), 14dB SNR (mean = 1.05, SD = 0.61) (β = 0.06, se = 0.02, p = 0.01), and quiet (mean = 1.06, SD = 0.59) (β = 0.05, se = 0.03, p = 0.05). From the 1st to the 10th word, there was an increase in repeat-only condition (β = −0.11, se = 0.04, p = 0.007), and also an increase (β = −0.3, se = 0.04, p < 0.001) in repeat-with-recall condition, but steeper than repeat-only condition (β = 0.2, se = 0.05, p = 0.001). No further significant quadratic or cubic trend.

2.7.2 Pupillary response of incorrectly versus correctly repeated words

For the pupillary responses of words that were correctly and incorrectly recognised, no difference in baseline diameter was found (χ2 = 0.001, df = 1, p = 0.94), suggesting that there was no differential arousal that could explain the word intelligibility. There was a main effect of PPD (χ2 = 12.59, df = 1, p < 0.001) and a significant interaction of TASK:PPD (χ2 = 13.9, df = 1, p < 0.001). No significant effect of peak latency (χ2 = 1.96, df = 1, p = 0.16) was found. Post-hoc tests showed that at repeat-only condition, bigger PPD was associated with incorrectly repeated words (β = −1.8, se = 0.35, p < 0.001), and no such relation at repeat-with-recall task (Fig 5a).

Fig 5. Comparing pupil traces for words correctly and incorrectly repeated, recalled and forgotten.

Fig 5

All data are averaged across 22 listeners. The shaded width denotes 1 standard error of the mean. (a) compares the pupil traces for words correctly and incorrectly repeated in each TASK condition. (b) compares the pupil traces for words that are successfully recalled or forgotten. Traces in two time windows are analysed: first analysis window is from the onset of word to the onset of the response prompt, and the second analysis window is from the onset of the response prompt to 1.5s after the prompt.

2.7.3 Pupillary response of recalled versus forgotten words

Comparing the pupillary responses of words that were later recalled or forgotten, no difference in baseline size was found (χ2 = 0.001, df = 1, p = 0.9). At the first time window, there was no significant main effect of PPD (χ2 = 1.76, df = 1, p = 0.18) and latency (χ2 = 1.49, df = 1, p = 0.22). At the second time window, there was a significant main effect of peak pupil diameter (χ2 = 4.87, df = 1, p = 0.03). Post-hoc Wald test showed that bigger PPD at the second time window was associated with the successful recall of the word (β = 3.18, se = 1.47, p = 0.03) (Fig 5b).

2.7.4 The effect of noise on pupillary response during word recall at the end of a block

For the mean pupil diameter during the listeners’ word recall, there was no difference among SNRs (χ2 = 0.67, df = 3, p = 0.88) (Fig 6); and the mean pupil diameter jumped from about 4.0mm to 4.3-4.4 mm (just short of 10%).

Fig 6. Pupil traces from 10s before the recall onset to 15s after the recall onset.

Fig 6

Each panel shows the averaged traces in each LISTENING condition. All data are aggregated across 22 listeners. The shaded width denotes 1 standard error of the mean. The line is further smoothed using the default gam method in ggplot2 package to highlight the general trend.

2.8 Methods: Subjective listening effort rating and individual differences

To examine the effect of LISTENING and TASK conditions on subjective rating, a logistic mixed-effect model was fitted on ratings, with LISTENING and TASK conditions as fixed effect factors, with LISTENER and WORD LIST as random effect factors, and following the same procedure reported above.

In a final attempt to delineate different components of the pupillary dynamics, each participant’s pupillary responses (baseline diameter and PPD) were correlated with their age, word recognition, word recall and subjective rating performance using Pearson correlation.

All best fitting models, model parameter estimates and model comparison statistics were reported in the S2 Appendix.

2.9 Results: Subjective listening effort rating

There was a significant main effect of LISTENING (χ2 = 2278.51, df = 3, p < 0.001) and TASK conditions (χ2 = 7137.01, df = 1, p < 0.001), and a significant interaction of LISTENING:TASK (χ2 = 239.78, df = 3, p < 0.001) on subjective rating. Subjective rating at 0dB (mean = 5.87, SD = 2.13) was higher than at 7dB (mean = 4.05, SD = 2.29) (β = 0.85, se = 0.04, p < 0.001), 14dB (mean = 3.98, SD = 2.43) (β = 0.89, se = 0.04, p < 0.001) and quiet (mean = 3.24, SD = 2.36) (β = 1.29, se = 0.05, p < 0.001); 7dB was higher than quiet (β = 0.44, se = 0.05, p < 0.001) but not 14dB (β = 0.04, se = 0.05, p = 0.38); and 14dB was higher than quiet (β = 0.4, se = 0.05, p < 0.001). Overall, subjective rating at repeat-with-recall condition was higher than that at repeat-only condition (β = 1.56, se = 0.03, p < 0.001), and the difference was smaller at 0dB than other SNR levels (β = −1.13, se = 0.06, p < 0.001) (Fig 7a).

Fig 7. Subjective effort and individual differences.

Fig 7

Each data point corresponds to one participant. The error bars denote 1 standard error of the mean. (a) plots subjective rating as a function of LISTENING and TASK conditions. (b) to (d) show the significant correlations (p < 0.05) between behavioural and pupillary measures.

2.10 Results: Individual differences

On an individual level, baseline diameter (within word lists) positively correlated with word recall performance (r = 0.45, p = 0.04, Fig 7b), and negatively correlated with subjective rating (r = −0.45, p = 0.04, Fig 7c). PPD negatively correlated with word recognition performance (r = −0.48, p = 0.02, Fig 7d), but this was only true when no memory requirement was involved: in repeat-with-recall condition, there was no significant correlation between PPD and word recognition performance (r = 0.08, p = 0.21). Note that these relations were modulated by participants’ age: word recall performance worsened with age (r = −0.5, p = 0.01); baseline diameter shrunk with age (r = −0.52, p = 0.01); and subjective rating shifted up with age (r = 0.5, p = 0.01). After correcting for the effect of age, the correlations were not significant between baseline diameter and word recall performance (r = 0.18, p = 0.08), and between baseline diameter and subjective rating (r = −0.21, p = 0.22). PPD and word recognition performance (r = −0.02, p = 0.01) remained significant after the correction.

3 Discussion

The current experiment used a word recall paradigm to elicit sustained and concurrent memory load on word recognition in noise. Pupil diameters were recorded simultaneously to investigate the dynamics of pupillary response in complex listening situations. A number of our findings can be contrasted with the literature, advancing current debates on 1) interferences between concurrent tasks, 2) the nature of pupil dynamics in dual versus single tasks, 3) the predictive power of pupillometry for intelligibility and memory, and 4) individual differences.

3.1 Word recall task interfering with the word recognition task

Consistent with our first hypothesis, results showed that noise impaired both word recognition and recall. Fewer stated words were recalled at 0dB than 7dB, 14dB and quiet conditions. Note that to dissociate the impact of word recognition from recall performance, word recall scoring was based on whether the recalled words matched the words repeated by participants, rather than the transcripts (similar to [47]). Past studies using the recall paradigm reported similar results. For instance, Surprenant showed that even when nonsense syllable recognition performance was similar across SNRs, NH participants’ recall performance was impaired in difficult SNR [55]. In [56], NH participants repeated the final word of each of 8 sentences embedded in babble-speech noise, and at the end of the 8th sentence recalled as many of the previously reported words as possible. Results showed that challenging signal-to-noise (SNR) condition impaired both word recognition and recall of the stated words performance. Ng et al. [57] tested moderate to severe hearing loss participants using a similar memory recall paradigm referred to as the sentence-final word identification and recall (SWIR). Results showed that even under similar intelligibility, babble-speech noise impaired word recall performance more than speech-shaped noise. Similar effect was also illustrated for young versus old listeners [58], native versus foreign speech masker [59] and replicated in different languages [47]. In line with the interpretation in previous studies, we believe that this SNR effect on recall reflects that higher listening effort during word recognition evoked at lower SNR leaves fewer cognitive resources for encoding and retrieving words, leading to the decreased performance in the word recall task [4, 8, 6062].

Surprisingly, we found a possible interference from the recall task on the word recognition task. At 0dB, word recognition performance was better when participants expected word recall at the end of the list; and in quiet, word recognition was worse when participants expected word recall task at the end. Although word recognition was essentially the same task in repeat-only condition and repeat-with-recall condition, participants might evaluate and anticipate the amount of cognitive resources differently. At 0dB, listeners might be more attentive and ready to engage overall more cognitive resources when they were notified at the beginning of the block that they should recall at the end of 10th word because they anticipated the incoming block to be demanding. When no recall was required, they might have judged beforehand that the incoming block was not worthwhile to mobilise too many resources, hence worse recognition performance. If this were the case, then we should observe a corresponding interaction in baseline pupil diameter, because baseline has been shown to be associated with task readiness and engagement [33, 43]. Trend analysis from the 1st to the 10th word showed that the baseline diameter in the repeat-only condition had a decreasing trend that was consistent with other studies where listener showed fatigue or habituation with similar stimuli and task within a block [17, 4042] In comparison, the baseline diameter from the 1st to the 10th word in the repeat-with-recall condition had an increasing trend. However, without manipulation of both memory load and task engagement in our experimental design, it is impossible to disentangle the convoluted effect of memory accumulation and engagement on the baseline. Furthermore, in quiet with repeat-with-recall condition, listeners should have sufficient capacity to reach a better primary task performance (as shown by a higher word recognition in repeat-only condition), but instead, they performed worse in the word recognition task compared to in the repeat-only condition. This might suggest that they did not prioritise the word recognition task (although they were instructed explicitly to do so by the experimenter), and may have shifted some resources to the recall task probably because it was more interesting and rewarding [6366].

This interference warrants further investigation, because it concerns the validity of using a dual-task paradigm in measuring listening effort. In order to interpret safely the difference in secondary task performance as a result of listening effort, implicit assumptions of the dual-task paradigm need to be reviewed [67]. Firstly, the paradigm assumes that participants have a limited pool of cognitive resources, but the Framework for Understanding Effortful Listening (FUEL) model also notes that resources that are available to be allocated are fluctuating with other factors besides overall task demands [3, 4]. In other words, the relationship between task difficulty and effort is not linear, but modulated by factors like fatigue, motivation and (dis)pleasure [35, 6873]. Secondly, the paradigm assumes that listeners, under explicit instructions, will prioritise the primary task by investing as many resources as possible, and only leaves whatever left of the resources for the secondary task. However, individual differences and task characteristics might affect listeners’ actual strategy [3]. For instance, older adults may differ from younger adults in the extent to which they prioritise one task over another [6365]. And when the primary task is too complex or secondary task more novel, participants may consciously or unconsciously shift more resources to the secondary task relative to the primary task [7476]. Although various recall paradigms from previous studies are sensitive to the relative allocation of cognitive resources [47, 56, 57, 59], there is no direct method to gauge the total amount of resources deployed and how they are allocated [67]. As illustrated in the current experiment, listeners might not mobilise and/or allocate the same amount of cognitive resources for the speech recognition task when a secondary recall task was anticipated, even under explicit instruction. This makes it unclear whether the difference in the recall performance is due to differences in the listening effort, or prior mobilisation of overall cognitive resources, or internal shift of resources between primary and secondary task. Previous studies using the SWIR paradigm have typically fixed the SNR levels at or close to ceiling performance, to ensure no substantial differences in sentence intelligibility [47, 57]. But this still does not exclude the possibilities mentioned above, because even at ceiling performance level (similar to the quiet condition in the current experiment), interferences could occur.

Our results highlight the importance of considering these factors when designing behavioural paradigms for measuring listening effort and interpreting their outputs. The possible interference from the recall task on the word recognition task showed that the behavioural outcome might not be indicative of the listening effort alone. This might be particularly important when applying the test to listener groups who are susceptible to fatigue and task interference, for instance hearing impaired populations and children, because they might either give up or not fully be motivated in the first place even when the available capacity can meet the processing demand [3, 72, 7476].

3.2 Pupillary response to intelligibility during a concurrent and sustained memory load

Consistent with our second hypothesis regarding the pupillary response during the word listening and encoding section, pupil diameter was larger in repeat-with-recall than repeat-only condition. In this respect, the present design has the advantage of dissecting how this difference arises, thanks to the trial-by-trial sensitivity of pupillometry. The difference arises from a progressive decrease in pupil diameter within the repeat-only condition, and a progressive increase in baseline diameter within the repeat-with-recall condition from the 1st to the 10th word. Although past studies have reported similar trends, they were using different materials and test designs, making it hard to demonstrate clearly the impact of additional memory task on listening effort in both magnitude and dynamics. For instance, within one speech perception task, pupil diameter gradually decreased with increasing trial numbers, due to task/stimuli habituation [17, 4042]. However, when listeners needed to remember the digits [16, 26, 28] or pseudo-words [77] presented auditorily, pupil diameter increased progressively, until the memory span was exceeded. Note that in the current experiment, listeners needed to continuously decode words embedded in noise, which might be more effortful than listening to digits or pseudo-words in quiet due to higher cognitive and perceptual processing demands. The more demanding primary speech recognition task led to more accumulated and sustained effort over time. This might explain earlier plateau in baseline diameter in our experiment than observed in those studies. We observed a quadratic trend of baseline pupil diameter from the 1st to the 10th word within a list. [28] reported the plateau at the 9th digit for young adults, and [78] reported the plateau at 6th digits for children and 8th digit for adults. Our results are in good agreement with such estimates, and confirm that additional memory task places a heavier and sustained load on cognitive effort. More specifically, baseline diameter could reveal the impact on cognitive effort from the additional task, and the rate of increase in baseline diameter could be suggestive of the magnitude of sustained effort in a test paradigm with multiple sources of cognitive effort.

However, the steeper decrease of PPD in repeat-with-recall condition compared to repeat-only condition was unexpected. PPD has been shown to be sensitive to memory load, therefore, with more words to be remembered, we expected PPD to increase accordingly over time [16, 27, 28]. Decrease in PPD was reported when listeners tended to give up in the tasks that were impossibly difficult [29, 31]. In those cases, performance level was typically low (around 0%). But we did not observe a decrease in recognition and recall performance for words in the later part of the list in our results, or a worse word recognition performance in repeat-with-recall condition at difficult 0dB condition (in fact, word recognition was higher in repeat-with-recall than repeat-only condition). This suggests that listeners did not give up at the later part of the word list, or at 0dB. Similarly, a smaller PPD at 0dB in repeat-with-recall than repeat-only condition was surprising. Additional recall task with difficult SNR is certainly more demanding than a single task, therefore, we expected PPD to be larger in the repeat-with-recall condition and at difficult SNR level. But we observed the opposite: PPD actually decreased in the repeat-with-recall condition. We do not believe that these are spurious results. This huge contrast with the well-established effect of task demands on the pupillary response was also observed in Zekveld et al. [45]. In Zekveld et al. [45], participants had to recall the four-word cues (either related or unrelated to the following sentence) presented visually before the onset of the sentence embedded in speech masker. The 7dB SNR difference between two sentence-in-noise conditions (-17dB and -10dB) elicited a difference in intelligibility, but not in peak and mean pupil dilation. Zekveld et al. [45] interpreted the absence of pupillary difference between two SNRs as participants prioritising the central factors (memory task) than peripheral factors (sentence recognition task). There are a few characteristics that distinguish our design from Zekveld et al. [45]. Firstly, the memory and sentence recognition tasks in Zekveld et al. [45] were more independent: participants read the cue words for 5s before the auditory stimulus onset; after the auditory stimulus offset, participants either repeated the sentence or the cue words. This separation between two tasks could facilitate intentional prioritisation of the memory over the speech recognition task. Secondly, participants in Zekveld et al. [45] only needed to memorise a four-word cue at the start of each trial, with no accumulation of memory load over time. In comparison, the memory task in our paradigm was more imposing on the limited cognitive resources: participants had to complete both word recognition and memorising tasks within the same time window, and they needed to keep retaining more words in the memory from the 1st to the 10th word. Therefore, it is not surprising that we observed not only a lack of correlation between task demands and pupillary response at easier SNR levels, but also a reversal of that relation at the most cognitively demanding condition (0dB and repeat-with-recall).

One explanation for the steep decrease of PPD in sustained listening condition could be due to fatigue. In a similar sustained listening condition, McGarrigle et al. [44] asked NH participants to listen to two short passages of text with multi-talker babble noise at either -8 dB and 15 dB, and at the end of each passage judge whether images presented on the screen were mentioned in the previous passage. A steeper decrease in (normalised and baseline corrected) pupil size during listening was found for difficult SNR than easy SNR, but only in the second half of the trial block. This was interpreted as fatigue kicking in at the second section of the test. It is likely that in our study, the steeper decrease of PPD in repeat-with-recall condition could also be the sign of overload and fatigue with continuing effort to recognise, encode and rehearse isolated words. However, the decreasing trend reported in McGarrigle et al. [44] was not found in McGarrigle et al. [79] when using a similar test for school-aged children, so it is still unclear how reliably and accurately this metric is related to fatigue.

A more likely explanation to the steeper decrease of PPD in repeat-with-recall condition is that the dynamic range of the pupil could be constrained by baseline diameter. Critically, for the first word in the list, PPD was bigger in repeat-with-recall than repeat-only condition but the baseline diameter was similar. As the baseline diameter grew bigger and plateaued in repeat-with-recall condition, PPD did not have much space to grow, so it decreased faster than repeat-only condition. Similarly, at repeat-with-recall condition, baseline diameter was already bigger than the repeat-only condition for all SNR levels to start with, leaving little room for PPD to increase further during the task. It looks as if under sustained listening condition, there is a limit on the magnitude of pupil dilation, beyond which no further increase is possible. This limit must not be imposed by physiological constraint of the iris muscles, because at the onset of the recall, pupil diameter increased dramatically, on average by 0.3mm or equivalent to an effect six times bigger than the average PPD at the 10th word (also seen in Cabestrero et al. [28] and discussed in Zekveld et al. [45]). Instead, this limit might be of a cognitive origin rather than the physiological constraint. Similarly, Puma et al. [80] reported a similar ceiling in EEG alpha and theta band power when participants were overloaded with multiple concurrent tasks. This limit might be associated with the saturation in cognitive resources allocation. In order to ensure successful retrieval of words from long- and short-term memory storage at the recall stage, some cognitive resources should be preserved and held until the later part of the test. Therefore, as memory load accumulated (increase in baseline diameter) and approached the limit allocated for the recognition and encoding stage, fewer new resources would be assigned (decrease in PPD), so that enough resources were reserved for the recall stage. The reserved cognitive resources were finally put to use at the onset of recall, leading to a big ‘jump’ in pupil diameter.

This could be a phenomenal illustration of how cognitive resources are managed in a highly flexible and goal-directed manner. More importantly, as demonstrated in our experiment, this cognitive planning is reflected in the pupillary response. When listeners need to reserve some cognitive resources for later tasks, the pupillary response might show a cap until the next task. In Cabestrero et al. [28], the biggest ‘jump’ at the onset of recall was when 5 digits were to be recalled (low load), and the smallest ‘jump’ was when 11 digits were to be recalled (overload), suggesting that this sharp increase in pupil diameter is proportionate to the cognitive resources left for the recall task. Arguably, how cognitive resources are allocated to different tasks could also depend on individual cognitive capacity and cognitive abilities. Listeners with bigger cognitive capacity and better abilities to process speech in noise, might allocate fewer resources (lower limit) to word recognition and encoding, because they will be more efficient in completing the task [59, 81]. In this case, listeners might show a bigger increase in pupil diameter at the onset of recall because they have more cognitive resources left for the recall section. To fully test this hypothesis, future studies need to include more individual cognitive ability measurements and different types of manipulations on cognitive load (for instance, manipulating the memory load by varying the number of words to recall).

3.3 Pupillary response to word recognition and memory

Baseline pupil diameter reflected the accumulation of memory load from one serial position to the next. On an individual level, baseline diameter was also related to recall performance, as shown by their significant correlation.

Bigger PPD and more delayed dilation for incorrectly than correctly repeated words in repeat-only condition is also observed in other studies using sentence stimuli [17, 21, 29]. But in the condition requiring heavy and sustained effort (repeat-with-recall), PPD saturated too quickly, especially later in the word list, to support the correlation with word recognition. It seemed that the dynamic range of pupillary response was constrained by the baseline diameter. This further highlights the issue aforementioned, namely that the saturation in pupillary response under sustained load might make PPD problematic for quantifying the actual effort.

Nevertheless, PPD remains a reliable index of listening effort during a single listening task and a potential biomarker for memory processing. Typically, when comparing the recall performance, we found words that were successfully recalled had bigger pupillary response than those forgotten at the encoding stage during time window 2. Papesh et al. [82] suggested a similar relation between PPD and memory encoding success: words that were remembered with higher degree of confidence showed bigger PPD, relative to words that were remembered with less confidence or forgotten. Kucewicz et al. [54] showed that subsequently recalled words had higher peak pupil dilation 1s after the onset of words being presented visually on the screen for memorising. This is at a similar time point as the difference we observed in the time window 2.

Taken as a whole, these results picture a complex story of the allocation and dynamics of cognitive resources during speech perception and memory task. Failure to recognise the word is associated with more effortful processing, possibly because more lexical competitors are activated for explicit decision when listeners fail to decode the acoustic signals without ambiguity. This might also initiate retroactive corrective processing that would keep the effort elevated post-stimulus [22]. When words need to be remembered for the recall task, the memory encoding probably becomes a priority after completing the word recognition. If more cognitive resources are expended at this stage to encode the word in the working memory storage, there is a higher chance that it will be retrieved successfully later.

3.4 Individual differences

Behavioural performance was correlated with pupillary response, but in different manners: better word recognition performance was related with smaller PPD; better stated word recall performance was related with bigger baseline diameter; bigger baseline diameter was related with easier subjective rating; better word recall performance was related with easier subjective rating. Consistent with the results discussed above, these suggest that different metrics of pupillary responses might relate to different cognitive processing. PPD was an indicator of transient effort expended for decoding the words presented in noise, hence correlated with the word recognition performance. Listeners’ subjective feeling is affected both by external task demands (SNR levels and TASK), and one’s evaluation of recall success. Note that all three measures (pupillary response, word recall performance and subjective rating) also significantly correlated with age, making it possible that the correlations observed were due to a latent variable, for instance individual cognitive capacity [24, 29, 47, 57, 58, 8385].

To summarise, while behavioural performance (i.e., recall) and subjective rating indicate the final outcome of a series of cognitive processes, pupillometry can reveal the difference in listening effort between conditions, the temporal dynamics of different stages of cognitive processing, as well as the allocation policy of cognitive resources. However, the present findings highlight that there are still many open questions about what the pupil dilation reflects. Only a handful of studies have looked into the dynamics of pupillary response in realistic conditions, where listening is not the only task demanding cognitive resources. The lack of research makes it difficult to develop theories and methods to disentangle the rich information pupillary responses contain. The current experiment is a good example to enrich our knowledge on the topic, by showing the importance of looking at pupillary metrics (time-series variations, baseline diameter) other than PPD when investigating listening effort under sustained memory or other cognitive loads. In a nutshell, here we found that the baseline carries critical information about the overall level of engagement of cognitive resources and the moment-by-moment allocation of these resources in a complex task. As such it might potentially be used as a predictor for the likelihood of success in a memory task on an individual level. In contrast, the PPD might be a good indicator of the success of word decoding and potential biomarker for memory processing, specifically at the level of a single word. But PPD as an indication of listening effort is not as robust in a complex task when PPD would be constrained by the elevated baseline diameter induced by concurrent tasks and when listeners employ different resource allocation strategies. Therefore, future pupillary metrics and analysis pipeline should, similar to our method, devote more attention on the trial-by-trial variation patterns of baseline diameter.

3.5 Limitation

Pupil recordings during word repeat and recall were inevitably contaminated by movements during speech production and involuntary eye movement. No algorithm has been developed yet to reliably adjust pupil diameter for these factors. Special care was taken during the experiment and data preprocessing: participants were instructed to keep fixating at the fixation circle during verbal responses; we extrapolated points in the pupil traces where the centre of gazing was beyond 3SD from the centre and excluded trials where over 20% of the traces were either blinks or erratic gazing. Although this lead to loss of data, we ensured that the data left for analysis was valid.

Nevertheless, speech production following the response cue could potentially interfere with the pupillary response corresponding to memory encoding. Individual differences in the timing of responding could also interfere with the correspondence between memory encoding and pupillary response. However, this artefact was present for every word because participants needed to repeat words in all conditions. Therefore, the difference in pupil trace observed within this time window could not be entirely due to production confounds.

4 Conclusion

As one of the first few studies to investigate pupillary responses under sustained and complex listening condition, the present study serves as a bridge between established listening effort research and future direction of understanding and quantifying listening effort in real-life communication in various populations. The concurrent recall task did not allow listeners to process just one item, shake off the load once finished and start afresh for the next item. Instead, they needed to be constantly attentive and allocating cognitive resources to process new items while holding other information in (working) memory. This is similar to a real-life communication scenario where multiple tasks compete for a limited pool of cognitive resources over a period of time. Our results suggest that PPD, a traditional pupillometry metric for listening effort in a single listening task, is not robust anymore in a complex task. Instead, the baseline and specifically its trial-by-trial variations are more indicative of the overall cognitive load. Results suggest that both the magnitude and temporal pattern of pupillary response differ greatly in sustained listening condition from those in a single task.

Although real-life speech communication is even more complex and dynamic, the present study serves as a good starting point by choosing a paradigm that could provide enough approximation to cognitive processing in speech communication, yet sufficient time locking to a given type of cognitive processing to ensure the interpretability of the results. A better understanding of listening effort in ecological environments is also important for developing clinical measurement, especially for CI users and HI listeners. It is possible that prior motivational, emotional, cognitive factors and social pressure could disturb the relation between pupillary response and listening effort that is well-established in research settings.

Supporting information

S1 Appendix. Alternative method to calculate PPD.

Results and discussions on the alternative method to perform baseline correction using the averaged pupil trace 1s before the first word in the list.

(PDF)

S2 Appendix. Model summary outputs.

Model parameter estimates and model comparison statistics for the best fitting models. The reference level for the categorical factor LISTENING is 0dB, for the factor TASK is repeat-only.

(PDF)

S3 Appendix. Position effect in the word recall task.

Analysis on the position of the words recalled in the repeat-with-recall task.

(PDF)

S1 Raw data

(ZIP)

Acknowledgments

We thank Florian Malaval and Arthur Delage for assistance with running the experiment with native French-speaking participants.

Data Availability

All relevant data are within the manuscript and its Supporting information files.

Funding Statement

This work received funding from the MITACS Accelerate program, in collaboration with Oticon Medical Canada, under the grand number IT10517. (https://www.mitacs.ca/en/programs/accelerate; https://www.oticonmedical.com/ca). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Rönnberg J, Rudner M, Foo C, Lunner T. Cognition counts: A working memory system for ease of language understanding (ELU). International Journal of Audiology. 2008;47(sup2):S99–S105. 10.1080/14992020802301167 [DOI] [PubMed] [Google Scholar]
  • 2. Mattys SL, Davis MH, Bradlow AR, Scott SK. Speech recognition in adverse conditions: A review. Language and Cognitive Processes. 2012;27(7-8):953–978. 10.1080/01690965.2012.705006 [DOI] [Google Scholar]
  • 3. Pichora-Fuller MK, Kramer SE, Eckert MA, Edwards B, Hornsby BW, Humes LE, et al. Hearing impairment and cognitive energy: The framework for understanding effortful listening (FUEL). Ear and Hearing. 2016;37:5S–27S. 10.1097/AUD.0000000000000312 [DOI] [PubMed] [Google Scholar]
  • 4. Kahneman D. Attention and effort. vol. 1063. Citeseer; 1973. [Google Scholar]
  • 5. Rudner M. Cognitive spare capacity as an index of listening effort. Ear and hearing. 2016;37:69S–76S. 10.1097/AUD.0000000000000302 [DOI] [PubMed] [Google Scholar]
  • 6. Kramer SE, Kapteyn TS, Festen JM, Kuik DJ. Assessing aspects of auditory handicap by means of pupil dilatation. Audiology. 1997;36(3):155–164. 10.3109/00206099709071969 [DOI] [PubMed] [Google Scholar]
  • 7. McCoy SL, Tun PA, Cox LC, Colangelo M, Stewart RA, Wingfield A. Hearing loss and perceptual effort: Downstream effects on older adults’ memory for speech. The Quarterly Journal of Experimental Psychology Section A. 2005;58(1):22–33. 10.1080/02724980443000151 [DOI] [PubMed] [Google Scholar]
  • 8. Gosselin PA, Gagné JP. Use of a Dual-Task Paradigm to Measure Listening Effort Utilisation d’un paradigme de double tâche pour mesurer l’attention auditive. Inscription au Répertoire. 2010;34(1):43. [Google Scholar]
  • 9. Rönnberg J, Lunner T, Zekveld A, Sörqvist P, Danielsson H, Lyxell B, et al. The Ease of Language Understanding (ELU) Model: theoretical, empirical, and clinical advances. Frontiers in systems neuroscience. 2013;7:31. 10.3389/fnsys.2013.00031 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. McGarrigle R, Munro KJ, Dawes P, Stewart AJ, Moore DR, Barry JG, et al. Listening effort and fatigue: What exactly are we measuring? A British Society of Audiology Cognition in Hearing Special Interest Group ‘white paper’. International journal of audiology. 2014;53(7):433–445. 10.3109/14992027.2014.890296 [DOI] [PubMed] [Google Scholar]
  • 11. Nachtegaal J, Kuik DJ, Anema JR, Goverts ST, Festen JM, Kramer SE. Hearing status, need for recovery after work, and psychosocial work characteristics: Results from an internet-based national survey on hearing. International journal of audiology. 2009;48(10):684–691. 10.1080/14992020902962421 [DOI] [PubMed] [Google Scholar]
  • 12. Grimby A, Ringdahl A. Does having a job improve the quality of life among post-lingually deafened Swedish adults with severe-profound hearing impairment? British Journal of Audiology. 2000;34(3):187–195. 10.3109/03005364000000128 [DOI] [PubMed] [Google Scholar]
  • 13. Kramer SE, Kapteyn TS, Houtgast T. Occupational performance: Comparing normally-hearing and hearing-impaired employees using the Amsterdam Checklist for Hearing and Work: Desempeño laboral: Comparación de empleados con audición normal o alterada usando el Listado Amsterdam para Audición y Trabajo. International journal of audiology. 2006;45(9):503–512. [DOI] [PubMed] [Google Scholar]
  • 14. Ohlenforst B, Zekveld AA, Jansma EP, Wang Y, Naylor G, Lorens A, et al. Effects of hearing impairment and hearing aid amplification on listening effort: A systematic review. Ear and hearing. 2017a;38(3):267. 10.1097/AUD.0000000000000396 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Zekveld AA, Koelewijn T, Kramer SE. The pupil dilation response to auditory stimuli: Current state of knowledge. Trends in hearing. 2018;22:2331216518777174. 10.1177/2331216518777174 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Karatekin C, Couperus JW, Marcus DJ. Attention allocation in the dual-task paradigm as measured through behavioral and psychophysiological responses. Psychophysiology. 2004;41(2):175–185. 10.1111/j.1469-8986.2004.00147.x [DOI] [PubMed] [Google Scholar]
  • 17. Zekveld AA, Kramer SE, Festen JM. Pupil response as an indication of effortful listening: The influence of sentence intelligibility. Ear and hearing. 2010;31(4):480–490. 10.1097/AUD.0b013e3181d4f251 [DOI] [PubMed] [Google Scholar]
  • 18. Goldinger SD, Papesh MH. Pupil dilation reflects the creation and retrieval of memories. Current directions in psychological science. 2012;21(2):90–95. 10.1177/0963721412436811 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Koelewijn T, Zekveld AA, Festen JM, Rönnberg J, Kramer SE. Processing load induced by informational masking is related to linguistic abilities. International journal of otolaryngology. 2012;2012. 10.1155/2012/865731 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Koelewijn T, de Kluiver H, Shinn-Cunningham BG, Zekveld AA, Kramer SE. The pupil response reveals increased listening effort when it is difficult to focus attention. Hearing research. 2015;323:81–90. 10.1016/j.heares.2015.02.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Winn MB, Edwards JR, Litovsky RY. The impact of auditory spectral resolution on listening effort revealed by pupil dilation. Ear and hearing. 2015;36(4):e153. 10.1097/AUD.0000000000000145 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Winn MB. Rapid release from listening effort resulting from semantic context, and effects of spectral degradation and cochlear implants. Trends in Hearing. 2016;20:2331216516669723. 10.1177/2331216516669723 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. McMurray B, Farris-Trimble A, Rigler H. Waiting for lexical access: Cochlear implants or severely degraded input lead listeners to process speech less incrementally. Cognition. 2017;169:147–164. 10.1016/j.cognition.2017.08.013 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Wendt D, Hietkamp RK, Lunner T. Impact of noise and noise reduction on processing effort: A pupillometry study. Ear and hearing. 2017;38(6):690–700. 10.1097/AUD.0000000000000454 [DOI] [PubMed] [Google Scholar]
  • 25. Zhao S, Bury G, Milne A, Chait M. Pupillometry as an objective measure of sustained attention in young and older listeners. Trends in hearing. 2019;23:2331216519887815. 10.1177/2331216519887815 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Peavler WS. Pupil size, information overload, and performance differences. Psychophysiology. 1974;11(5):559–566. 10.1111/j.1469-8986.1974.tb01114.x [DOI] [PubMed] [Google Scholar]
  • 27. Granholm E, Asarnow RF, Sarkin AJ, Dykes KL. Pupillary responses index cognitive resource limitations. Psychophysiology. 1996;33(4):457–461. 10.1111/j.1469-8986.1996.tb01071.x [DOI] [PubMed] [Google Scholar]
  • 28. Cabestrero R, Crespo A, Quirós P. Pupillary dilation as an index of task demands. Perceptual and motor skills. 2009;109(3):664–678. 10.2466/pms.109.3.664-678 [DOI] [PubMed] [Google Scholar]
  • 29. Zekveld AA, Kramer SE. Cognitive processing load across a wide range of listening conditions: Insights from pupillometry. Psychophysiology. 2014;51(3):277–284. 10.1111/psyp.12151 [DOI] [PubMed] [Google Scholar]
  • 30. Kramer SE, Teunissen CE, Zekveld AA. Cortisol, chromogranin A, and pupillary responses evoked by speech recognition tasks in normally hearing and hard-of-hearing listeners: a pilot study. Ear and hearing. 2016;37:126S–135S. 10.1097/AUD.0000000000000311 [DOI] [PubMed] [Google Scholar]
  • 31. Ohlenforst B, Zekveld AA, Lunner T, Wendt D, Naylor G, Wang Y, et al. Impact of stimulus-related factors and hearing impairment on listening effort as indicated by pupil dilation. Hearing Research. 2017b;351:68–79. 10.1016/j.heares.2017.05.012 [DOI] [PubMed] [Google Scholar]
  • 32. Ohlenforst B, Wendt D, Kramer SE, Naylor G, Zekveld AA, Lunner T. Impact of SNR, masker type and noise reduction processing on sentence recognition performance and listening effort as indicated by the pupil dilation response. Hearing research. 2018;365:90–99. 10.1016/j.heares.2018.05.003 [DOI] [PubMed] [Google Scholar]
  • 33. Aston-Jones G, Cohen JD. An integrative theory of locus coeruleus-norepinephrine function: adaptive gain and optimal performance. Annu Rev Neurosci. 2005;28:403–450. 10.1146/annurev.neuro.28.061604.135709 [DOI] [PubMed] [Google Scholar]
  • 34. Murphy PR, O’connell RG, O’sullivan M, Robertson IH, Balsters JH. Pupil diameter covaries with BOLD activity in human locus coeruleus. Human brain mapping. 2014;35(8):4140–4154. 10.1002/hbm.22466 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Koelewijn T, Zekveld AA, Lunner T, Kramer SE. The effect of reward on listening effort as reflected by the pupil dilation response. Hearing research. 2018;367:106–112. 10.1016/j.heares.2018.07.011 [DOI] [PubMed] [Google Scholar]
  • 36. Wang Y, Naylor G, Kramer SE, Zekveld AA, Wendt D, Ohlenforst B, et al. Relations between self-reported daily-life fatigue, hearing status, and pupil dilation during a speech perception in noise task. Ear and Hearing. 2018;39(3):573–582. 10.1097/AUD.0000000000000512 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Hockey R. The psychology of fatigue: Work, effort and control. Cambridge University Press; 2013. [Google Scholar]
  • 38. Verney SP, Granholm E, Marshall SP. Pupillary responses on the visual backward masking task reflect general cognitive ability. International Journal of Psychophysiology. 2004;52(1):23–36. 10.1016/j.ijpsycho.2003.12.003 [DOI] [PubMed] [Google Scholar]
  • 39. Winn MB, Wendt D, Koelewijn T, Kuchinsky SE. Best practices and advice for using pupillometry to measure listening effort: An introduction for those who want to get started. Trends in hearing. 2018;22:2331216518800869. 10.1177/2331216518800869 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Beatty J. Task-evoked pupillary responses, processing load, and the structure of processing resources. Psychological bulletin. 1982;91(2):276. 10.1037/0033-2909.91.2.276 [DOI] [PubMed] [Google Scholar]
  • 41. Damsma A, van Rijn H. Pupillary response indexes the metrical hierarchy of unattended rhythmic violations. Brain and cognition. 2017;111:95–103. 10.1016/j.bandc.2016.10.004 [DOI] [PubMed] [Google Scholar]
  • 42. Marois A, Labonté K, Parent M, Vachon F. Eyes have ears: Indexing the orienting response to sound using pupillometry. International Journal of Psychophysiology. 2018;123:152–162. 10.1016/j.ijpsycho.2017.09.016 [DOI] [PubMed] [Google Scholar]
  • 43. Gilzenrat MS, Nieuwenhuis S, Jepma M, Cohen JD. Pupil diameter tracks changes in control state predicted by the adaptive gain theory of locus coeruleus function. Cognitive, Affective, & Behavioral Neuroscience. 2010;10(2):252–269. 10.3758/CABN.10.2.252 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. McGarrigle R, Dawes P, Stewart AJ, Kuchinsky SE, Munro KJ. Pupillometry reveals changes in physiological arousal during a sustained listening task. Psychophysiology. 2017a;54(2):193–203. 10.1111/psyp.12772 [DOI] [PubMed] [Google Scholar]
  • 45. Zekveld AA, Kramer SE, Rönnberg J, Rudner M. In a concurrent memory and auditory perception task, the pupil dilation response is more sensitive to memory load than to auditory stimulus characteristics. Ear and hearing. 2019;40(2):272. 10.1097/AUD.0000000000000612 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Fournier JE. Audiométrie vocale: les épreuves d’intelligibilité et leurs applications au diagnostic, à l’expertise et à la correction prothétique des surdités. Maloine; 1951.
  • 47. Lunner T, Rudner M, Rosenbom T, Ågren J, Ng EHN. Using speech recall in hearing aid fitting and outcome evaluation under ecological test conditions. Ear and hearing. 2016;37:145S–154S. 10.1097/AUD.0000000000000294 [DOI] [PubMed] [Google Scholar]
  • 48. Bates D, Mächler M, Bolker B, Walker S. Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software. 2015;67(1):1–48. 10.18637/jss.v067.i01 [DOI] [Google Scholar]
  • 49.R Core Team. R: A Language and Environment for Statistical Computing; 2019. Available from: https://www.R-project.org/.
  • 50.Wickham H. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York; 2016. Available from: https://ggplot2.tidyverse.org.
  • 51. Mirman D. Growth curve analysis and visualization using R. CRC Press; Boca Raton, FL; 2014. [Google Scholar]
  • 52. Bristow D, Frith C, Rees G. Two distinct neural effects of blinking on human visual processing. Neuroimage. 2005;27(1):136–145. 10.1016/j.neuroimage.2005.03.037 [DOI] [PubMed] [Google Scholar]
  • 53.Klingner J, Kumar R, Hanrahan P. Measuring the task-evoked pupillary response with a remote eye tracker. In: Proceedings of the 2008 symposium on Eye tracking research & applications. ACM; 2008. p. 69–72.
  • 54. Kucewicz MT, Dolezal J, Kremen V, Berry BM, Miller LR, Magee AL, et al. Pupil size reflects successful encoding and recall of memory in humans. Scientific reports. 2018;8(1):1–7. 10.1038/s41598-018-23197-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Surprenant AM. The effect of noise on memory for spoken syllables. International Journal of Psychology. 1999;34(5-6):328–333. 10.1080/002075999399648 [DOI] [Google Scholar]
  • 56. Sarampalis A, Kalluri S, Edwards B, Hafter E. Objective measures of listening effort: Effects of background noise and noise reduction. Journal of Speech, Language, and Hearing Research. 2009;. 10.1044/1092-4388(2009/08-0111) [DOI] [PubMed] [Google Scholar]
  • 57. Ng EHN, Rudner M, Lunner T, Pedersen MS, Rönnberg J. Effects of noise and working memory capacity on memory processing of speech for hearing-aid users. International Journal of Audiology. 2013;52(7):433–441. 10.3109/14992027.2013.776181 [DOI] [PubMed] [Google Scholar]
  • 58. Pichora-Fuller MK, Schneider BA, Daneman M. How young and old adults listen to and remember speech in noise. The Journal of the Acoustical Society of America. 1995;97(1):593–608. 10.1121/1.412282 [DOI] [PubMed] [Google Scholar]
  • 59. Ng EHN, Rudner M, Lunner T, Rönnberg J. Noise reduction improves memory for target language speech in competing native but not foreign language speech. Ear and Hearing. 2015;36(1):82–91. 10.1097/AUD.0000000000000080 [DOI] [PubMed] [Google Scholar]
  • 60. Downs DW. Effects of hearing aid use on speech discrimination and listening effort. Journal of Speech and Hearing Disorders. 1982;47(2):189–193. 10.1044/jshd.4702.189 [DOI] [PubMed] [Google Scholar]
  • 61. Wingfield A. Evolution of models of working memory and cognitive resources. Ear and hearing. 2016;37:35S–43S. 10.1097/AUD.0000000000000310 [DOI] [PubMed] [Google Scholar]
  • 62. Edwards B. A model of auditory-cognitive processing and relevance to clinical applicability. Ear and hearing. 2016;37:85S–91S. 10.1097/AUD.0000000000000308 [DOI] [PubMed] [Google Scholar]
  • 63. Li KZ, Lindenberger U, Freund AM, Baltes PB. Walking while memorizing: Age-related differences in compensatory behavior. Psychological science. 2001;12(3):230–237. 10.1111/1467-9280.00341 [DOI] [PubMed] [Google Scholar]
  • 64. Madden DJ, Langley LK. Age-related changes in selective attention and perceptual load during visual search. Psychology and aging. 2003;18(1):54. 10.1037/0882-7974.18.1.54 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65. Hein G, Schubert T. Aging and input processing in dual-task situations. Psychology and Aging. 2004;19(3):416. 10.1037/0882-7974.19.3.416 [DOI] [PubMed] [Google Scholar]
  • 66. Plummer P, Eskes G. Measuring treatment effects on dual-task performance: a framework for research and clinical practice. Frontiers in human neuroscience. 2015;9:225. 10.3389/fnhum.2015.00225 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67. Gagne JP, Besser J, Lemke U. Behavioral assessment of listening effort using a dual-task paradigm: A review. Trends in hearing. 2017;21:2331216516687287. 10.1177/2331216516687287 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68. Brehm JW, Self EA. The intensity of motivation. Annual review of psychology. 1989;40(1):109–131. 10.1146/annurev.ps.40.020189.000545 [DOI] [PubMed] [Google Scholar]
  • 69. Eckert MA, Teubner-Rhodes S, Vaden KI Jr. Is listening in noise worth it? The neurobiology of speech recognition in challenging listening conditions. Ear and hearing. 2016;37(Suppl 1):101S. 10.1097/AUD.0000000000000300 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70. Hornsby BW, Naylor G, Bess FH. A taxonomy of fatigue concepts and their relation to hearing loss. Ear and hearing. 2016;37(Suppl 1):136S. 10.1097/AUD.0000000000000289 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71. Matthen M. Effort and displeasure in people who are hard of hearing. Ear and hearing. 2016;37:28S–34S. 10.1097/AUD.0000000000000292 [DOI] [PubMed] [Google Scholar]
  • 72. Richter M. The moderating effect of success importance on the relationship between listening demand and listening effort. Ear and Hearing. 2016;37:111S–117S. 10.1097/AUD.0000000000000295 [DOI] [PubMed] [Google Scholar]
  • 73. Peelle JE. Listening effort: How the cognitive consequences of acoustic challenge are reflected in brain and behavior. Ear and Hearing. 2018;39(2):204. 10.1097/AUD.0000000000000494 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74. Paas F, Renkl A, Sweller J. Cognitive load theory and instructional design: Recent developments. Educational psychologist. 2003;38(1):1–4. 10.1207/S15326985EP3801_1 [DOI] [Google Scholar]
  • 75. Choi S, Lotto A, Lewis D, Hoover B, Stelmachowicz P. Attentional modulation of word recognition by children in a dual-task paradigm. Journal of Speech, Language, and Hearing Research. 2008;. 10.1044/1092-4388(2008/076) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.McFadden B, Pittman A. Effect of minimal hearing loss on children’s ability to multitask in quiet and in noise. Language, speech, and hearing services in schools. 2008;. [DOI] [PMC free article] [PubMed]
  • 77. López-Ornat S, Karousou A, Gallego C, Martín L, Camero R. Pupillary measures of the cognitive effort in auditory novel word processing and short-term retention. Frontiers in psychology. 2018;9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78. Johnson EL, Miller Singley AT, Peckham AD, Johnson SL, Bunge SA. Task-evoked pupillometry provides a window into the development of short-term memory capacity. Frontiers in psychology. 2014;5:218. 10.3389/fpsyg.2014.00218 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79. McGarrigle R, Dawes P, Stewart AJ, Kuchinsky SE, Munro KJ. Measuring listening-related effort and fatigue in school-aged children using pupillometry. Journal of experimental child psychology. 2017b;161:95–112. 10.1016/j.jecp.2017.04.006 [DOI] [PubMed] [Google Scholar]
  • 80. Puma S, Matton N, Paubel PV, Raufaste É, El-Yagoubi R. Using theta and alpha band power to assess cognitive workload in multitasking environments. International Journal of Psychophysiology. 2018;123:111–120. 10.1016/j.ijpsycho.2017.10.004 [DOI] [PubMed] [Google Scholar]
  • 81. Unsworth N, Engle RW. The nature of individual differences in working memory capacity: active maintenance in primary memory and controlled search from secondary memory. Psychological review. 2007;114(1):104. 10.1037/0033-295X.114.1.104 [DOI] [PubMed] [Google Scholar]
  • 82. Papesh MH, Goldinger SD, Hout MC. Memory strength and specificity revealed by pupillometry. International Journal of Psychophysiology. 2012;83(1):56–64. 10.1016/j.ijpsycho.2011.10.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83. Surprenant AM. Effects of noise on identification and serial recall of nonsense syllables in older and younger adults. Aging, Neuropsychology, and Cognition. 2007;14(2):126–143. 10.1080/13825580701217710 [DOI] [PubMed] [Google Scholar]
  • 84. Kuchinsky SE, Vaden KI Jr, Ahlstrom JB, Cute SL, Humes LE, Dubno JR, et al. Task-related vigilance during word recognition in noise for older adults with hearing loss. Experimental aging research. 2016;42(1):50–66. 10.1080/0361073X.2016.1108712 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85. Tsukahara JS, Harrison TL, Engle RW. The relationship between baseline pupil size and intelligence. Cognitive psychology. 2016;91:109–123. 10.1016/j.cogpsych.2016.10.001 [DOI] [PubMed] [Google Scholar]

Decision Letter 0

Claude Alain

15 Jun 2020

PONE-D-20-12674

Disentangling listening effort and memory load beyond behavioural evidence

PLOS ONE

Dear Dr. Zhang,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jul 30 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Claude Alain

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please amend either the title on the online submission form (via Edit Submission) or the title in the manuscript so that they are identical.

3. Please provide additional details regarding participant consent. In the Methods section, please ensure that you have specified (1) whether consent was informed and (2) what type you obtained (for instance, written or verbal). If your study included minors, state whether you obtained consent from parents or guardians. If the need for consent was waived by the ethics committee, please include this information.

4. Please ensure that you have outlined how you recorded data on pupil diameter in your Methods section.

Additional Editor Comments (if provided):

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Review on Zhang et al. “Disentangling listening effort and memory load beyond behavioural evidence”

In this work, the authors aim to characterize the effect of listening effort and memory load on pupil diameter using a dual-task paradigm. They reported the behavioural and pupillometry result from one experiment, where subjects were instructed to repeat a word immediately after its representation. Listening effort was manipulated by varying the SNR of the word. Memory load was manipulated by having an additional task where subjects also needed to memorise 10 words in a row and recall these 10 words after the 10th word. After the response, subjects self-reported their subjective rating of effort.

Most of their findings were consistent with previous studies or very much expected: (1) the memory recall performance decreased in the more difficult listening condition (i.e. the word is hard to hear), (2) larger pupil dilation associated with harder-to-hear words due to the greater listening effort and associated with words in the dual-task due to the greater cognitive load, (3) larger pupil dilation should be associated with response (verbally produce the word), especially in the more difficult listening condition. (4) higher subjective effort rating in harder conditions. The authors were surprised by the fact that although the absolute pupil diameter was larger in the dual-task over the 10-word trial, one of the pupil metrics, peak pupil diameter (=max pupil diameter – pre-trial baseline), decreased since the second word. From this, they claim that the involvement of extra cognitive load interferes with the effect of listening effort on pupil diameter and this is possibly due to the fatigue.

Overall, I found the study well-designed and the data carefully analysed. I particularly like the dual-task paradigm used here, which is quite elegant; the two conditions both involve same acoustic stimuli (i.e. word) and basic task (i.e. recognize and repeat the word) but differ only in the need of maintaining the memory of words. This makes this study distinct from the past studies (as authors mentioned around line 514). This makes the comparison of pupil data neat and clear. The paradigm is also of great potential as it is very close to the real-life listening situation: listeners to recognize the word in noise, reproduce the word accurately, and occasionally remember the word for future recall.

***Major concerns***

1. The possibly most important finding of this study is the fact the PPD (peak pupil diameter) tends to be smaller in the last few words in the repeat-to-recall condition. The authors were very surprised by this result and tried to interpret it by comparing with the previous listening effort studies like Zekveld et al, 2019. However, the explanation the authors offered in the discussion was extensive but not satisfying. It has been well-known that PPD is not only related to the effort or load but also strongly related to its baseline; the larger the baseline, the smaller the PPD. Figure 4b clearly showed that the large baseline is the case. Thus, a simple explanation for this result is that pupil simply saturated in the repeat with recall condition and the pupil simply cannot expand further in the presence of additional words and responses. If this is the case, the result is not surprising at all.

To exclude this possibility, the authors should consider running further analysis (e.g. regress out the effect of baseline from PPD) or conducting additional experiments to show that pupil still CAN dilate further in the repeat with recall condition. If these cannot be done, the authors should at least discuss it in the discussion. The saturation could be not only due to the mechanical limitation of the muscles controlling the pupil diameter, but also because pupil diameter is strongly correlated with the norepinephrine activity in the LC system. Since the authors are aware of the link between pupil diameter and LC-NE system as this was briefly mentioned in Introduction (line 36), they should also take this into account in the discussion.

2. As mentioned before, this paradigm is very neat and of great potential. The authors have already manipulated the level of listening effort using different levels of SNRs. However, the paradigm lacks the manipulation of cognitive load/memory load while it can be simply done. One way to manipulate the memory load is to vary the number of the words required to recall, for example, 5/10/15 words to recall. This will add an additional but necessary dimension to the existing study, otherwise it’s not able to disentangle the effect of listening effort on pupil diameter and that of memory load on pupil diameter, which is actually stated in the title of this paper. This additional experiment with varying memory load should also provide some answers to the questions stated in (1) whether the pupil is saturated in the repeat with recall condition.

3. [Line 549] A minor concern related to this part is that Zekvel et al., 2019 might not be the best study to compare with. The paradigm used in the current study requires listeners to sustain attention over a much longer period; 10 words in total including the sound presentation, the word reproduction, and the word type-in by the experimenter might take almost 30 seconds. I would recommend the authors to take a look at recent pupillometry publications using a sound signal with a similar length.

4. [line 122] The authors need to justify how this sample size was determined.

5. [line 240] How was PPD computed here? Was it extracted from each trial and then averaged within each subject? Or was PPD directly extracted from each subject’s average pupil diameter response?

6. [line 248] Similar question (5) applies to peak latency.

7. [Line 387] The stat test shows a significant difference in the second time window. However, by looking at Figure 5b the error bar of Forgotten and Correct largely overlaps and makes this test result unconvincing. Could the authors run a time-series stat analysis on the pupil data (like the analysis used in Figure 3a, Zhao et al. 2019 Trends in Hearing) to double-check whether the significance is true and if so when the significant interval starts?

8. [line 410] Please state the method of the correlation. e.g. Spearman or Pearson?

***Other comments***

1. [Fig.5] Please use different colours for different conditions’ shaded area. The current colour and pattern makes it hard to tell which area belongs to which condition.

2. [Fig.5] How are these time windows determined?

3. [Figure 6] The solid curve looks over-smoothed compared to its shaded area. Comparing the relationship between the shaded area and the solid curve in Figure 5, the shaded area in Figure 6 are extremely spiky. Maybe additional smooth was accidentally applied to the group average? If so, the authors should justify the difference in the analysis pipeline.

4. [line 394] Possibly I misunderstood the content, but could the authors please provide more details or make it clearer: how is the mean pupil diameter computed here? Over which time window? Also, to support the statement this line, could you also plot the mean pupil diameter against the number stated words and test the correlation like Figure 7?

5. [line 417] Nice to see that the authors noticed the relationship between these metrics with the age as ageing is a known factor in pupil diameter. As the authors stated “Note that these correlations should be considered with caution due to no corrections” I was expecting to see these results with the age being regressed out.

6. [line 472] “the recall task probably because it was more interesting and rewarding”. It’s unclear how the recall task can be rewarding here. Did the authors apply a special bonus to the repeat with recall condition?

Reviewer #2: I think this is a pretty good paper, in principle. The work seems to have been well done, and it addresses some very interesting and timely questions. That said, I found the original manuscript relatively difficult to read, not because of any problems with language but because I think it needs one more thorough revision now that the authors have successfully thought through all of their ideas - there are good ideas in here, but they're jumbled up still, and hard to find. Thus, most of my suggestions relate to writing and organization.

Abstract

When stating “different signal to noise ratios” please state what these are. Otherwise it makes no sense to state that “ (PPD) was bigger in the 0dB versus other conditions” because we have no idea whether those were all negative, or all positive, or by how much compared to 0 dB.

It’s not clear to me from the abstract how baseline PPD can follow a growth pattern. I assume this means across trials or perhaps across words within a trial, but that is not clear.

If PPD increased, how did it them decrease? This whole paragraph could be rewritten more clearly.

No need to refer to “concur with the recent literature” in the abstract. IT's a statement that is too vague to be useful.

11 speech recognition is similar to what?

48-52 weird change of tense in the middle of this sentence. (“needs to decode… pondering…”)

93+ I think the section on LISTENING condition should be a separate paragraph. Also, I’m not sure it makes sense to go into this much detail about the study before the methods section. Right now, I’m left wanting to know more – for example, was the SNR variation blocked or was it randomized within trials? So either more info is needed here, or possibly less.

95-97 this sentence seems out of place (“The effect of SNR …”)

98-99 the issue of pupil traces for recalled vs. not recalled tokens seems quite a bit different from the topics that have been discussed so far. This is a big issue in the memory literature, and you probably need to take a look as some of that to restructure your introduction to better reflect this emphasis and the scientific context in which it fits.

105 I’d leave out “according to past studies” – you’re doing your own work here.

109-113 This prediction needs to be broken up, and perhaps re-thought. If baseline PD is expected to increase with memory load then it’s not obvious to me that PPD will also increase – if you’re raising the floor over the course of the 10 items in the trial, does it make sense to assume that the peak from that increasingly higher floor will *also* be increasing? Also, How do you think the increased memory load-related increasing baseline will interact with the previously mentioned decrease in PD as a function of time-on-task?

114-116 It is not clear whether this “rise” in PD is referring to the appearance of the pupil diameter trace in the course of a single stimulation, or over the course of a 10-word trial, or over the course of the entire experiment. The issue of time needs to be MUCH clearer throughout the paper.

121-126 It’s common to provide a gender breakdown for participants. But, more importantly, it’s incredibly important to identify the number of participants run in French and the number run in English. This could be a very significant factor and should probably be included in the final analysis [it is analyzed, so definitely mention it here].

128-138 How many words in all? How many lists per condition? This is starting to seem like a rather incomplete Stimuli section.

127-134 Please break down durations by language. Ideally, please list all words in the supplementary materials. Was there any attempt to balance the intelligibility of the different word lists? If not, word list might need to be a factor in the eventual model as well. Also, a reference to “Fournier” words is needed, I don't know what those.

147 Was the calibration done with a pure tone, or with the speech stimuli and (if speech) then with or without noise (and if with noise, then at what SNR level)?

150 I presume “at 14 dB” means “at 14 dB SNR” but that should be made clear

159 what is the significance of the (0.5s) in this line and the (1s) in the next line?

From the procedures section, it’s still not clear whether SNR was constant for a given set of words, or block of trials, or randomized (either across words within a trial, which would be admittedly strange) or across trials. Also, the ordering of TASK and LISTENING condition combinations is not specified.

162 Adjustment of the Target level in a constant noise level means that in the highest SNR condition the Target was 14 dB louder than in the 0 dB condition, right? This seems like an extreme difference, even without the presence of noise. What was the noise level alone? More importantly, what was the level of the signal in the QUIET condition – the same as that in the 0dB condition, or the 14 dB condition, or something else? Given that autonomic responses can be influenced by absolute level, what procedures were implemented to ensure that the differences observed were not simply due to differences in overall signal level?

169 Were there procedures for dealing with homophones? Were the transcribers well-versed in the set of words being used? Could the experimenter/transcriber see the intended word?

175 what was the time delay before the word RECALL appeared?

184 Was a “block” one set of 10 words, or 3?

189-192 Please provide degrees of freedom for the t-tests.

203 missing “were” before “retained” (or change to “remained”)

1.4.1+ It would be easier to understand the statistical analyses if you would provide the actual model, either in lme4 syntax (easy to do in this case) or in standard mathematical notation. This is quite common nowadays and could be put into supplementary materials if space is an issue.

232 What does “aggregated per word” mean? Averaged?

Also, in this context, it is confusing to say “per word” if you actually mean (as I think you do) “per word position” (i.e. 1-10). Aggregating “per word” seems impossible if listeners never heard the same word twice as implied in the methods section.

236 Please clarify – these were the “aggregated” traces, right? 1 trace per subject per word-position?

Also, given that the actual words were presumably of different durations, I think using absolute time (i.e. in seconds) is a bad idea, because it could blur effects that are related to the duration of the word. You might consider normalizing all times before any averaging is done. Or fit a curve to each individual trace and then compute the peak and the latency from that, then do averages over those values.

243-26 I like the comparison between the “block baseline” and the “word position baseline”.

252-291 I find it a little confusing (and quite demanding on my working memory) to present all statistical analyses prior to any results. I think this section would be helped by using sub-headings and, again, by providing the actual models in either lme4 syntax or as an equation. Alternatively, these paragraphs could be put as the initial paragraph of the respective results (sub)sections.

265-282 I think the discussion of the second time window suggests that what you really should be doing is looking at the entire pupil diameter curve from the onset of the word-position to 1.5s after its offset. See Winn & Moore (2018) for a really clever way of breaking such long(ish) traces down for analysis.

Figure 2 a minor point, but it seems needlessly complicated to present the results with different Y axes representing essentially the same thing - % correct words repeated vs. Average number of words remembered (presumably out of 10?)

323-418 The results section is incredibly hard to read. Please revise to put things into complete sentences. It’s not just about presenting a bunch of equations here, you need to organize them in such a way that the reader can understand what you’re talking about. At this point I can’t really. Please give values (i.e. don’t just tell me baseline PD was bigger in one level than another, tell me what the value was for each level). It’s confusing to read that baseline pupil diameter was “bigger than 14 dB” (line 332) when to my knowledge we don’t typically measure pupil diameter in decibels. Yes, I can figure out what you mean, but this is currently written as it might be written in a lab notebook, for personal consumption, not as it should be written for scientific communication. And in some cases, it’s opaque: in line 335, is the (0.2 mm) referring to the absolute diameter, or the amount by which it is bigger? Even at the end “due to no corrections” is practically txting the results…

Also, I think the trend analyses could be discussed separately. Basically, right now it seems as if you're more or less just listing the results of all the tests you did, maybe in chronological order or perhaps loosely organized (?) according to dependent measure. Please consider some way of organizing the results in a way that facilitates the reader’s understanding of why you conclude what you will eventually conclude or, at a minimum, that reflects the issues that you determined were relevant to investigate as described in the introduction. Ideally, the results section should be presented in the same order as the discussion section, which should walk the reader through the data toward the eventual theoretical claims that you want to make (and which should reflect the relative importance of topics as discussed in the introduction). Right now I honestly can’t figure out what data point(s) are particularly relevant or irrelevant, it just sort of devolved into a giant mass of statistical tests presented without obvious organization.

[Discussion section is also a bit confusing - mostly due to digressions, though]

Figures 3 & 4 Looking at the traces in Figures 3a and 4a it seems apparent to me that peak pupil dilation may not be a useful metric here. Except in the first word position there really isn’t much of a *peak* of any sort visible in 4a. And you can see that when those word positions get averaged together (for the images in 3a) any potential peakedness disappears. So why not use average PD or something like that? I think that would tell the story at least as well, and would be less subject to potentially weird micro-effects such as the weird flip of the black and red dots in positions 5, 8, and 9 of figure 4c.

Also, the Y axis of 3c and 4c should somehow indicate that this is change from baseline.

In general, I’d recommend considering a very different way of doing this analysis, perhaps along the lines of Winn & Moore (2018).

340 should this be 3b or 4b?

432-443 references needed here to Pichora-Fuller et al. 1995, Surprenant (1999, 2007).

461-506 I really struggled with this discussion. I think the long and detailed references to the noise reduction work are distracting and superfluous. So lines 436-455 could be reduced to just lines 451-455.

Also, this discussion brings up the question of what, exactly, pupil dilation tells us. Arguably, it could provide information about the overall level of engagement of cognitive resources (I think that’s what the baseline measurement is supposed to get at, here) in these two conditions, as well as the moment-by-moment allocation of those resources during part of a task (encoding, repetition, recall). Given that you *have* pupil dilation data, I think this needs to be addressed somehow, before going into details of what dual task paradigms may or may not tell us.

And, finally, what do you conclude? I appreciate that there are multiple possible interpretations, but you've thought about this far more than most. Could you lead the reader from this apparent bafflement into something that we can be more satisfied with?

488 Could you examine age differences in your data? What would you predict to see either in terms of behavior or pupil dilation if people are prioritizing things differently?

490 what does it mean that the recall paradigm is from previous studies? Which recall paradigm?

498 Define SWIR acronym.

509 What second hypothesis? There are so many hypotheses swirling around by now I’ve lost track of which one is which. Please restate.

519 these references did not all use the same speech perception task. Clarify.

523 You don’t really have data showing any greater effort of your task over other tasks.

540 The lack of position effect is extremely unusual for a serial recall task and needs to be discussed in much more detail. It should also be presented in the results. This is one of those memory effects that is so basic it’s taught in intro psych textbooks... I would very much like to see a graph of word recognition and recall by word position. I have great difficulty imaging that there wasn’t some kind of recency effect at least, if not also a primacy effect, with a 10-item list to be recalled.

585-621 It seems to me that the best explanation for smaller growth of the PPD is that the baseline is increasing. So the limit (probably physiological, based on light levels) is imposed not in terms of how much the pupil can dilate, but in terms of how much of a dilation it will reach. In other words, illumination, which you held constant, may have imposed an upper limit on pupil dilation, such that as the baseline creeps up with increasing memory load in the recall condition, or creeps down with increasing habituation in the repeat-only condition, you get the difference between the two gradually shrinking (in the recall condition) or increasing (in the repeat only condition).

623-626 Word choice seems problematic. What does it mean to “hold predictive power” or to be “responsive for recall performance”? Say what you want to say in a simple way.

References that should be incorporated into a revision

Goldinger, S. D., & Papesh, M. H. (2012). Pupil dilation reflects the creation and retrieval of memories. Current directions in psychological science, 21(2), 90-95.

Kucewicz, M. T., Dolezal, J., Kremen, V., Berry, B. M., Miller, L. R., Magee, A. L., ... & Worrell, G. A. (2018). Pupil size reflects successful encoding and recall of memory in humans. Scientific reports, 8(1), 1-7.

Miller, A. L., Gross, M. P., & Unsworth, N. (2019). Individual differences in working memory capacity and long-term memory: The influence of intensity of attention to items at encoding as measured by pupil dilation. Journal of Memory and Language, 104, 25-42.

Pichora‐Fuller, M. K., Schneider, B. A., & Daneman, M. (1995). How young and old adults listen to and remember speech in noise. The Journal of the Acoustical Society of America, 97(1), 593-608.

Surprenant, A. M. (1999). The effect of noise on memory for spoken syllables. International Journal of Psychology, 34(5-6), 328-333.

Surprenant, A. M. (2007). Effects of noise on identification and serial recall of nonsense syllables in older and younger adults. Aging, Neuropsychology, and Cognition, 14(2), 126-143.

Winn, M. B., & Moore, A. N. (2018). Pupillometry reveals that context benefit in speech perception can be disrupted by later-occurring sounds, especially in listeners with cochlear implants. Trends in hearing, 22, 2331216518808962.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Sijia Zhao

Reviewer #2: Yes: Alexander L. Francis

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Mar 3;16(3):e0233251. doi: 10.1371/journal.pone.0233251.r002

Author response to Decision Letter 0


21 Aug 2020

All line numbers in our response are referring to the marked-up manuscript for clear comparison with the original draft.

We thank the editor and reviewers for their comments. We have incorporated their feedback in our revised article to improve the methods reporting, organise better the sections, update with more relevant literature and clarify our main findings.

In summary, we believe that we have strengthened our research article to meet your publication criteria.

Reviewer #1: Review on Zhang et al. “Disentangling listening effort and memory load beyond behavioural evidence” Overall, I found the study well-designed and the data carefully analysed. I particularly like the dual-task paradigm used here, which is quite elegant; the two conditions both involve same acoustic stimuli (i.e. word) and basic task (i.e. recognize and repeat the word) but differ only in the need of maintaining the memory of words. This makes this study distinct from the past studies (as authors mentioned around line 514). This makes the comparison of pupil data neat and clear. The paradigm is also of great potential as it is very close to the real-life listening situation: listeners to recognize the word in noise, reproduce the word accurately, and occasionally remember the word for future recall.

We thank reviewer 1 for the very useful feedback on the manuscript. We are addressing each concern in the following:

***Major concerns***

1. The possibly most important finding of this study is the fact the PPD (peak pupil diameter) tends to be smaller in the last few words in the repeat-to-recall condition. The authors were very surprised by this result and tried to interpret it by comparing with the previous listening effort studies like Zekveld et al, 2019. However, the explanation the authors offered in the discussion was extensive but not satisfying. It has been well-known that PPD is not only related to the effort or load but also strongly related to its baseline; the larger the baseline, the smaller the PPD. Figure 4b clearly showed that the large baseline is the case. Thus, a simple explanation for this result is that pupil simply saturated in the repeat with recall condition and the pupil simply cannot expand further in the presence of additional words and responses. If this is the case, the result is not surprising at all.

To exclude this possibility, the authors should consider running further analysis (e.g. regress out the effect of baseline from PPD) or conducting additional experiments to show that pupil still CAN dilate further in the repeat with recall condition. If these cannot be done, the authors should at least discuss it in the discussion. The saturation could be not only due to the mechanical limitation of the muscles controlling the pupil diameter, but also because pupil diameter is strongly correlated with the norepinephrine activity in the LC system. Since the authors are aware of the link between pupil diameter and LC-NE system as this was briefly mentioned in Introduction (line 36), they should also take this into account in the discussion.

We thank the reviewer’s contribution to this interpretation of the results.

We agree that baseline and PPD are correlated mathematically because (X-Y) and Y will always be correlated by R2=0.5, assuming X and Y completely random. And past literatures have also demonstrated this correlation. However, the exact relation between baseline and PPD during a hearing or cognitive task depends on the underlying cause. For instance, the effect of old age induces smaller baseline and smaller PPD due to physiological constraints and changes of activity in peripheral and/or central nervous system (Piquado et al., 2010; Kuchinsky et al., 2016; Wang et al., 2018). Lower luminance induces bigger baseline but smaller PPD due to the ‘gripping’ of parasympathetic system (Wang et al., 2018). Therefore, it is unclear what direction this relation between PPD and baseline should be in a task with concurrent listening and cognitive demands. What is surprising in our results is that we initially hypothesised that PPD might increase with more difficult SNR and more items to retain in the memory, but we see instead that the pupil dilation ‘capped’ during the listening section (before the word recall section). However, looking at the baseline dynamics let us understand partially the cause of the ‘capping’. This highlights the importance to look both at the PPD and baseline in future experiments that involves more ecologically realistic tests.

We share with Reviewer 1 the desire to further disentangle baseline from PPD by either regressing out the effect

of baseline or showing that pupil can still dilate further in repeat with recall condition. The first approach, however, is problematic as long as we do not understand the exact conditions where base and PPD are negatively correlated from conditions where they may be positively correlated (whether this is seen within or across subjects). So, we opted for the second approach: while a sort of pupil saturation was present during the listening and encoding section from 1st to 10th word, the pupil increased at the onset of recall on average by 0.3mm! Reviewer 1 did not realize this finding, so we made it more explicit in the article: the ceiling of the pupil during the recall blocks cannot be due to mechanical limitation of the muscles controlling the pupil diameter, because right at the end of the block, the pupil diameter rose considerably, an effect equivalent to six times the average PPD at the 10th word. Therefore, it is clear that the pupil ceiling during listening and encoding was not at all due to mechanical constraints but originated from cognitive resource allocation strategy. The best interpretation we can offer – and that we discussed – is that listeners would reserve their resources during the 1st to 10th word in order to retrieve the words during the recall section.

2. As mentioned before, this paradigm is very neat and of great potential. The authors have already manipulated the level of listening effort using different levels of SNRs. However, the paradigm lacks the manipulation of cognitive load/memory load while it can be simply done. One way to manipulate the memory load is to vary the number of the words required to recall, for example, 5/10/15 words to recall. This will add an additional but necessary dimension to the existing study, otherwise it’s not able to disentangle the effect of listening effort on pupil diameter and that of memory load on pupil diameter, which is actually stated in the title of this paper. This additional experiment with varying memory load should also provide some answers to the questions stated in (1) whether the pupil is saturated in the repeat with recall condition.

We agree with the reviewer that a manipulation on the memory load would in principle add further support to the major point raised in the current experiment. But this is not a trivial thing to do: changing the size of the list will likely not do the trick; it has been done several times in the literature - i.e. different flavors of the SWIR paradigms with varying list sizes – and the problem is that listeners tend to “normalize” their performance. They would perform surprisingly poorly with short lists and surprisingly well with long lists, such that the manipulation supposed to vary the difficulty level within the same task is ineffective. So, this is a good idea in theory but there are hurdles to overcome beforehand that originate directly from the non-linearity, and possibly non-monotonicity, of the response. This is why we took a first step in this study, looking at how drastically different the pupil dynamics are when memory is involved. And we showed that a SNR manipulation – which is not easily compensated by flexible resource allocation – had relatively little impact in the recall conditions. It is likely that the same will hold whether the lists are 5, 10, or 15 words long. This being said, it would be a topic worthy of future investigation, so we added this point to highlight the important next step following up this experiment (line 698).

3. [Line 549] A minor concern related to this part is that Zekvel et al., 2019 might not be the best study to compare with. The paradigm used in the current study requires listeners to sustain attention over a much longer period; 10 words in total including the sound presentation, the word reproduction, and the word type-in by the experimenter might take almost 30 seconds. I would recommend the authors to take a look at recent pupillometry publications using a sound signal with a similar length.

We thank the reviewers’ advice on choosing other publications with similar length stimuli. We have added a few papers in the introduction and discussion section using similar length stimuli (Zhao et al., 2019; Goldinger and Papesh, 2012; Kucewicz et al., 2018 ).

4. [line 122] The authors need to justify how this sample size was determined.

A priori analysis using G*Power3 showed N=26 (three predictors: SNR, recall condition and word position; alpha error probability = 0.05) for an effect size of 0.8 using F test for linear multiple regression.

(Faul, F., Erdfelder, E., Lang, A. G., & Buchner, A. (2007). G* Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior research methods, 39(2), 175-191.)

5. [line 240] How was PPD computed here? Was it extracted from each trial and then averaged within each subject? Or was PPD directly extracted from each subject’s average pupil diameter response?

We thank Reviewer 1 for this comment, which we have repeatedly heard while presenting these results at conferences. It is a matter of constant debate. For sentences, the pupil dilation is more stable than it is for individual words, and thus arguably, one might want to extract PPD directly from individual sentences. We found that this did not apply well to individual words. So, we opted for the PPD taken from the averaged traces. We

firstly performed the baseline correction to subtract the baseline of each trial from the pupil trace. Then traces were aligned by the onset of the response prompt and aggregated per listener per condition. PPD was then calculated at this aggregated level, instead of the trial level. This method was chosen in aligned with past studies and ensured PPD was more robust (Zekveld et al., 2010; Zekveld et al., 2013; Zekveld et al., 2014).

We have re-organised the method and result sections to clarify the detailed procedure (line280).

6. [line 248] Similar question (5) applies to peak latency.

The latency was also calculated on the averaged traces, not on each trial. We have also re-organised the method and result section to clarify the procedure.

7. [Line 387] The stat test shows a significant difference in the second time window. However, by looking at Figure 5b the error bar of Forgotten and Correct largely overlaps and makes this test result unconvincing. Could the authors run a time-series stat analysis on the pupil data (like the analysis used in Figure 3a, Zhao et al. 2019 Trends in Hearing) to double-check whether the significance is true and if so when the significant interval starts?

We appreciate Reviewer 1’s concern on the overlapping error bars in Figure 5b. This was mostly due to plotting pupil traces using raw pupil diameter in mm. Traces were not baseline-corrected by each trial here, so plotting the average of all individuals did not illustrate within-individual differences effectively. We replaced Fig.5b with a baseline-corrected comparison. Raw traces have their merit though, so we still keep the original Figure 5a in mm

to fully illustrate the baseline and trace separation between repeat-only and repeat-with-recall condition.

We thank the reviewer for pointing us to the time-series analysis method in Zhao et al., 2019. Such analyses can delineate accurately the point of separation between two signals, provided there is a tight control of timing for the successive events. As pointed out in the limitation section, we didn't ask participants to respond as quickly as possible, in purpose, because applying a time pressure for the memory task would further affect pupillary response. So the pupil trace after the onset of response contains individual differences in speech production timing, speech production confounds and memory processing. Therefore, the onset of the trace separation seems to us too convoluted by cognitive events. Instead, we used the time window of 1.5s, based on the successful memory encoding effect observed in Kucewicz et al., 2018 with similar stimuli length and timing, which seems more relevant here.

Nevertheless, we would intend to employ more sensitive time-series analysis method as in Zhao et al., 2019 in future studies with better timing control. Specifically, in a follow-up experiment (yet unpublished), we asked participants to NOT repeat but recall, so that we had a 'clean' trace without the effect of speech production and its variable timing. We will then be able to tackle more seriously the onset of separation. Once again, our current view was to demonstrate, first at a qualitative level, that a bigger PPD is not always a bad sign. Many readers would find this result highly controversial because indeed in listening-only situations, a bigger PPD is a reliable sign that

speech was misunderstood (Fig5a).

8. [line 410] Please state the method of the correlation. e.g. Spearman or Pearson?

We used Pearson correlation. Thanks for pointing this missing information out. We have further specified this detail (line431).

***Other comments***

1. [Fig.5] Please use different colours for different conditions’ shaded area. The current colour and pattern makes it hard to tell which area belongs to which condition.

Clarified. We have used both lineshape and colour to differentiate word recall and word repeat performance.

2. [Fig.5] How are these time windows determined?

Time window 1 was determined according to the practice in past studies that looked at the listening effort associated with listening to sound stimuli (line238). Time window 2 was determined by using a similar waiting period (1s) after the event of interest occurred (line273). We hypothesised that time window 2 corresponded to participants rehearsing and encoding the perceived word into working memory storage. Also, Kucewicz et al., 2018 showed that the effect of memory encoding occurred 1s after the visual word presentation, corresponding to the time window 2.

But we noted in the Limitation sections that there are many factors that could interfere with the pupil trace at this time window (speech production, individual differences in the timing of responding/memorisation etc). Therefore, only the comparison between the repeat-with-recall and repeat-without-recall conditions is meaningful because the difference in the event is in rehearsal/memorisation.

3. [Figure 6] The solid curve looks over-smoothed compared to its shaded area. Comparing the relationship between the shaded area and the solid curve in Figure 5, the shaded area in Figure 6 are extremely spiky. Maybe additional smooth was accidentally applied to the group average? If so, the authors should justify the difference in the analysis pipeline.

We thank the reviewer for pointing this missing information out. Indeed, due to fewer traces to average over and great individual differences in recalling behaviour, we had more variable pupil traces (shaded areas), which we wished to reflect for transparency. But, to plot the mean trace (black solid line) we used the default GAM smoothing in ggplot2 package to highlight the general trend of the pupil data.

We have added this detail in the legends of Figure 6.

4. [line 394] Possibly I misunderstood the content, but could the authors please provide more details or make it clearer: how is the mean pupil diameter computed here? Over which time window? Also, to support the statement this line, could you also plot the mean pupil diameter against the number stated words and test the correlation like Figure 7?

We described the procedure to calculate the mean pupil diameter on line333. But we appreciate that the clustered organisation of the method section was unhelpful. We have re-organised the methods and results section so that they are closer. Hopefully this has now eased the comparison.

Upon closer check, we realised that we did not discuss this result later, and due to the limitation mentioned in the Limitation section, we could not interpret too much from this result due to disruptions from the verbal responses. We have now deleted this report to simplify the results.

5. [line 417] Nice to see that the authors noticed the relationship between these metrics with the age as ageing is a known factor in pupil diameter. As the authors stated “Note that these correlations should be considered with caution due to no corrections” I was expecting to see these results with the age being regressed out.

We thank the reviewer for pointing this out. We now add the results of these correlations after regressing out the effect of age (line461).

6. [line 472] “the recall task probably because it was more interesting and rewarding”. It’s unclear how the recall task can be rewarding here. Did the authors apply a special bonus to the repeat with recall condition?

We did not give special bonuses. But for NH listeners, the word recognition task, even at 0 dB SNR, was not very difficult: intelligibility might seem relatively low (at 75% on average) due to the lack of semantic context, but their intelligibility of sentences was 90% at 0 dB (not shown in this article). The real challenge and rewarding section of the experiment was to recall the words because they knew there were always 10 words. Indeed we saw individual differences in how they reacted to their recall performance: some were curious with how many words they recalled correctly and some did not care.

Reviewer #2: I think this is a pretty good paper, in principle. The work seems to have been well done, and it addresses some very interesting and timely questions. That said, I found the original manuscript relatively difficult to read, not because of any problems with language but because I think it needs one more thorough revision now that the authors have successfully thought through all of their ideas - there are good ideas in here, but they're jumbled up still, and hard to find. Thus, most of my suggestions relate to writing and organization.

We thank the reviewer for the detailed and useful feedback on the manuscript. We have addressed the concerns as following:

Abstract

When stating “different signal to noise ratios” please state what these are. Otherwise it makes no sense to state that “ (PPD) was bigger in the 0dB versus other conditions” because we have no idea whether those were all negative, or all positive, or by how much compared to 0 dB.

We have added the specific SNRs.

It’s not clear to me from the abstract how baseline PPD can follow a growth pattern. I assume this means across trials or perhaps across words within a trial, but that is not clear.

We have clarified the growth pattern is within a trial.

If PPD increased, how did it them decrease? This whole paragraph could be rewritten more clearly.

We have used the word ‘variation’ instead of ‘growth’ to avoid ambiguity.

No need to refer to “concur with the recent literature” in the abstract. IT's a statement that is too vague to be useful.

We have deleted that section.

11 speech recognition is similar to what?

We have clarified the sentence (line10).

48-52 weird change of tense in the middle of this sentence. (“needs to decode... pondering...”)

We have corrected the tense here (line50).

93+ I think the section on LISTENING condition should be a separate paragraph. Also, I’m not sure it makes sense

to go into this much detail about the study before the methods section. Right now, I’m left wanting to know more – for example, was the SNR variation blocked or was it randomized within trials? So either more info is needed here, or possibly less.

We have simplified this section(line96).

95-97 this sentence seems out of place (“The effect of SNR ...”)

We have deleted this sentence to simplify this section (line95).

98-99 the issue of pupil traces for recalled vs. not recalled tokens seems quite a bit different from the topics that have been discussed so far. This is a big issue in the memory literature, and you probably need to take a look as some of that to restructure your introduction to better reflect this emphasis and the scientific context in which it fits.

We agree with Reviewer 2 that the literature on the memory is important here, and truly appreciate the suggestions on the relevant memory studies. We have included them in the discussion section (line723), because they will help interpret the pupillary response in the repeat-with-recall condition. We think that the discussion section is more suitable to elaborate on the effect of memory, and reserve the introduction section for highlighting the lack of literature on concurrent cognitive load during listening tasks.

105 I’d leave out “according to past studies” – you’re doing your own work here.

Done.

109-113 This prediction needs to be broken up, and perhaps re-thought. If baseline PD is expected to increase with memory load then it’s not obvious to me that PPD will also increase – if you’re raising the floor over the course of the 10 items in the trial, does it make sense to assume that the peak from that increasingly higher floor will *also* be increasing? Also, How do you think the increased memory load-related increasing baseline will interact with the previously mentioned decrease in PD as a function of time-on-task?

Absolutely, all of these are perfectly valid and important questions, but it has not been easy to articulate these hypotheses before we started this project because of the inherent entanglement between baseline and PPD. This is a point also raised by Reviewer 1 for good reasons: people have a priori assumptions with regard to how PPD should behave if baseline goes in one direction or another. Typically, the idea that PPD will be necessarily restricted if baseline is too high does not hold in many cases. In response to R1, we mention the role of age or luminance as factors that completely disrupt the presumed inverse relationship between the two. Therefore, instead of revising our hypotheses now that we know the results to better differentiate baseline-related hypotheses from PPD-related hypotheses, we think it is more transparent and honest to present the hypotheses as we would have done in a pre-registered format. And what we knew two years ago was that the PPD would increase with adverse SNR, and that the baseline would increase incrementally as listeners rehearse words in their mind within a block, eventually leading to a task difference when averaging over successive trials of a block.

114-116 It is not clear whether this “rise” in PD is referring to the appearance of the pupil diameter trace in the course of a single stimulation, or over the course of a 10-word trial, or over the course of the entire experiment. The issue of time needs to be MUCH clearer throughout the paper.

Sorry for not making this more explicit: we referred to the end of the recall blocks, where listeners were prompted to report to the experimenter all the words they could remember. We clarified this in the hypothesis (line114), and reminded the reader of a similar phrasing on method section 2.6.3.

121-126 It’s common to provide a gender breakdown for participants. But, more importantly, it’s incredibly important to identify the number of participants run in French and the number run in English. This could be a very significant factor and should probably be included in the final analysis [it is analyzed, so definitely mention it here]. We have clarified the gender and language breakdown of the participants (line122).

128-138 How many words in all? How many lists per condition? This is starting to seem like a rather incomplete Stimuli section.

We have clarified these: three different lists per condition and overall 480 words used (line 135 line140).

127-134 Please break down durations by language. Ideally, please list all words in the supplementary materials. Was there any attempt to balance the intelligibility of the different word lists? If not, word list might need to be a factor in the eventual model as well. Also, a reference to “Fournier” words is needed, I don't know what those.

For sure, although the word lists are standardised, there might still be differences. We have now added word list as a random effect in the models, and only kept it in the model if it significantly improved the model fitting using chi- squared test. Details have been added to Supplementary material table 2.

We have added the reference to Fournier corpus. (line132) All words are listed in the references, therefore, we have not listed all words again in the supplementary materials.

147 Was the calibration done with a pure tone, or with the speech stimuli and (if speech) then with or without noise (and if with noise, then at what SNR level)?

The calibration with the headphone was performed using a pure tone 1kHz (line156).

150 I presume “at 14 dB” means “at 14 dB SNR” but that should be made clear

We have clarified this here and at later occurrences.

159 what is the significance of the (0.5s) in this line and the (1s) in the next line?

1s baseline is set following the recommendations in Winn et al., 2018 (between 100ms to 2s), and more specifically for words stimuli in Kuchinsky et al., 2014 Kuchinsky et al., 2013 (1s).

0.5s intertrial is to allow for gradual regress back to the baseline. Although longer intertrial duration is preferable for pupillometry measures, and might be relevant to longer speech material, it is not ideal for the memory task.

From the procedures section, it’s still not clear whether SNR was constant for a given set of words, or block of trials, or randomized (either across words within a trial, which would be admittedly strange) or across trials. Also, the ordering of TASK and LISTENING condition combinations is not specified.

To clarify, what we call a trial is the presentation of a single word. There is no trial with several words. SNR was kept constant for a block of 10 trials. This was necessary to evaluate the effect of SNR when memory was involved. But the rest was fully randomized. We have reorganised and clarified the details in the procedure section (line167).

162 Adjustment of the Target level in a constant noise level means that in the highest SNR condition the Target was 14 dB louder than in the 0 dB condition, right? This seems like an extreme difference, even without the presence of noise. What was the noise level alone? More importantly, what was the level of the signal in the QUIET condition – the same as that in the 0dB condition, or the 14 dB condition, or something else? Given that autonomic responses can be influenced by absolute level, what procedures were implemented to ensure that the differences observed were not simply due to differences in overall signal level?

Yes, target level was at 65 dB at 0 dB, 79 dB at +14 dB SNR, and back to 65 dB in quiet. We have further specified the speech leve at 161.

Fixing the masker noise was chosen following the reference cited in text (Ohlenforst et al., 2018) to avoid participants guessing the upcoming block difficulty. This choice was thus preferable to fixing the target level at 65 dB, especially with a large range of SNRs in the same experiment, similar to the case of Ohlenforst et al., 2018.

In principle absolute level should illicit an automatic impact, but this is not necessarily the case during a listening effort task that can dominate over the automatic response. Also the impact of signal level was not seen in Ohlenforst et al., 2018

169 Were there procedures for dealing with homophones? Were the transcribers well-versed in the set of words being used? Could the experimenter/transcriber see the intended word?

No, remember that the listener did not type the words (we couldn’t do that because the eye gaze had to stay in the middle of the screen). The experimenter could see the correct word in the Matlab interface, so they could type down the correct word instead of its homophones.

We have further specified this in the procedure (line180).

175 what was the time delay before the word RECALL appeared?

We thank the reviewer for pointing this out. We have clarified this delay to be 2s (line187).

184 Was a “block” one set of 10 words, or 3?

Each block contains 10 words. We have clarified this (line195).

189-192 Please provide degrees of freedom for the t-tests.

We have added more details on the t-tests performed (line202) here an other occurances.

203 missing “were” before “retained” (or change to “remained”)

Corrected

1.4.1+ It would be easier to understand the statistical analyses if you would provide the actual model, either in lme4 syntax (easy to do in this case) or in standard mathematical notation. This is quite common nowadays and could be put into supplementary materials if space is an issue.

We have included the best fitting models information in the Supplementary 2.

232 What does “aggregated per word” mean? Averaged?

Yes, we averaged the pupil diameter at the same point in time. This was following the method in past studies and ensured PPD was more robust (Zekveld et al., 2010; Zekveld et al., 2013; Zekveld et al., 2014).

Also, in this context, it is confusing to say “per word” if you actually mean (as I think you do) “per word position” (i.e. 1-10). Aggregating “per word” seems impossible if listeners never heard the same word twice as implied in the methods section.

Thank you for this clarification! We have corrected the wording from ‘word’ to more precise ‘word position’ here and in other occurrences.

236 Please clarify – these were the “aggregated” traces, right? 1 trace per subject per word-position?

Yes, the traces were aggregated/averaged per participant WORD POSITION.

Also, given that the actual words were presumably of different durations, I think using absolute time (i.e. in seconds) is a bad idea, because it could blur effects that are related to the duration of the word. You might consider normalizing all times before any averaging is done. Or fit a curve to each individual trace and then compute the peak and the latency from that, then do averages over those values.

Very nice comment. Indeed, this is a concern for longer speech materials, which typically have bigger duration SD and longer peak latency. In fact, we are doing exactly the same recommended method on sentence stimuli in a manuscript under preparation (sentences at 0, 7, 14 dB and quiet, of different durations).

But this essentially does not apply here to individual words. In the current experiment, the duration SD of monosyllabic word is small (0.09s), and not long enough to disturb the time scale of cognitive driven pupil responses (~ 10Hz).

Perhaps more to the point, the normalisation procedure Reviewer 2 suggested is problematic because it distorts time with different ratios depending on the time window considered. There are fixed windows before (baseline) and after the word stimuli (waitpeak) where time should not be compressed/stretched. Also, the normalised time units are less interpretable than the absolute units in seconds, which becomes an issue in some applications of pupillometry.

In the current method, we simply minimized the effect of variable word duration by aligning the traces at the onset of the response before aggregating/averaging over repetitions, so that the time window of interest always captured the peak pupil response.

243-26 I like the comparison between the “block baseline” and the “word position baseline”.

Thank you!

252-291 I find it a little confusing (and quite demanding on my working memory) to present all statistical analyses prior to any results. I think this section would be helped by using sub-headings and, again, by providing the actual models in either lme4 syntax or as an equation. Alternatively, these paragraphs could be put as the initial paragraph of the respective results (sub)sections.

We thank the reviewer's advice on enhancing our delivery of the results. We have now re-organised the methods and results section so that readers could compare directly the results after reading the methods. This organisation is definitely more efficient and easier to digest.

265-282 I think the discussion of the second time window suggests that what you really should be doing is looking at the entire pupil diameter curve from the onset of the word-position to 1.5s after its offset. See Winn & Moore (2018) for a really clever way of breaking such long(ish) traces down for analysis.

Yes, their time series approach is interesting, but once again it really applies better to sentences, and when no memory task is involved and no verbal responses. Reviewer 1 asked a related question to determine the exact point at which the pupil traces separated. But with the current design – participants taking their time to report words with no time constraints, and the time window 2 contained verbal responses (see Limitation for more details) – it won’t be particularly useful.

Indeed, a curve analysis (growth curve analysis) supplied more information than the feature extraction analysis (extracting PPD, peak latency), with its rich information on the pupil size variation over time. But the feature extraction method still captured the most prominent task-evoked pupillary response, and with our trend analysis we could also gain information on its variation over time. While the slow cognitive related pupil variation (around 10Hz) had enough time to unravel in sentence stimuli, it could be tight in the duration of words.

The key point we wish to emphasize is that a larger PPD is not always a bad sign, provided that the task involved memory. The exact timing at which this differentiation happens is not particularly interesting and would likely only be determined with a design that places heavy constraints on the manner with which listeners are allowed to report the words recalled (which would itself hinder the memory task).

Figure 2 a minor point, but it seems needlessly complicated to present the results with different Y axes representing essentially the same thing - % correct words repeated vs. Average number of words remembered (presumably out of 10?)

Not at all: the % words repeated is about intelligibility, assessed when participants repeat back the word right after it was played. The number of stated words recalled is about memory, assessed at the end of a block. They are completely different DVs.

We have clarified this in the subtitle of the plots in the new figure 2 and its legend.

323-418 The results section is incredibly hard to read. Please revise to put things into complete sentences. It’s not just about presenting a bunch of equations here, you need to organize them in such a way that the reader can understand what you’re talking about. At this point I can’t really. Please give values (i.e. don’t just tell me baseline PD was bigger in one level than another, tell me what the value was for each level). It’s confusing to read that baseline pupil diameter was “bigger than 14 dB” (line 332) when to my knowledge we don’t typically measure pupil diameter in decibels. Yes, I can figure out what you mean, but this is currently written as it might be written in a lab notebook, for personal consumption, not as it should be written for scientific communication. And in some cases, it’s opaque: in line 335, is the (0.2 mm) referring to the absolute diameter, or the amount by which it is bigger? Even at the end “due to no corrections” is practically txting the results...

Also, I think the trend analyses could be discussed separately. Basically, right now it seems as if you're more or less just listing the results of all the tests you did, maybe in chronological order or perhaps loosely organized (?) according to dependent measure. Please consider some way of organizing the results in a way that facilitates the reader’s understanding of why you conclude what you will eventually conclude or, at a minimum, that reflects the issues that you determined were relevant to investigate as described in the introduction. Ideally, the results section should be presented in the same order as the discussion section, which should walk the reader through the

data toward the eventual theoretical claims that you want to make (and which should reflect the relative importance of topics as discussed in the introduction). Right now I honestly can’t figure out what data point(s) are particularly relevant or irrelevant, it just sort of devolved into a giant mass of statistical tests presented without obvious organization.

We thank the reviewer's feedback on the organisation of the section. We have addressed the issue by:

1) grouping the analysis method report with the results section directly for behavioural data, pupil data and subjective rating results.

2) adding meaningful subtitles for each chunk of analyses or results to indicate the purpose of the analysis.

3) better wording for more accurate number reporting.

[Discussion section is also a bit confusing - mostly due to digressions, though]

We have removed a few digressions to stay focussed on the organization outlined at the start of the discussion, namely 1) interference between concurrent tasks, 2) pupil dynamics, 3) predictive power of pupillometry, and 4) individual differences.

Figures 3 & 4 Looking at the traces in Figures 3a and 4a it seems apparent to me that peak pupil dilation may not

be a useful metric here. Except in the first word position there really isn’t much of a *peak* of any sort visible in 4a. And you can see that when those word positions get averaged together (for the images in 3a) any potential peakedness disappears. So why not use average PD or something like that? I think that would tell the story at least as well, and would be less subject to potentially weird micro-effects such as the weird flip of the black and red dots in positions 5, 8, and 9 of figure 4c.

That is fair point: average pupil dilation within a restricted window would likely do just as well, and the question becomes how narrow one should choose this window, and this would be certainly open to debate/criticism. By opting for the PPD, we took a more traditional approach that had the advantage of being directly comparable to methods used for sentences.

Also, the Y axis of 3c and 4c should somehow indicate that this is change from baseline.

Yes, we corrected them with “baseline-corrected PPD”.

In general, I’d recommend considering a very different way of doing this analysis, perhaps along the lines of Winn & Moore (2018).

As mentioned earlier, this might not apply directly to a memory task with single words.

340 should this be 3b or 4b?

Figure 4b shows the trend analysis from the 1st to 10th word.

432-443 references needed here to Pichora-Fuller et al. 1995, Surprenant (1999, 2007).

461-506 I really struggled with this discussion. I think the long and detailed references to the noise reduction work are distracting and superfluous. So lines 436-455 could be reduced to just lines 451-455.

We thank the reviewer for pointing out relevant literatures here! We have removed the effect of noise reduction here to only include studies relating to the recall performance. We also added in references to Pichora-Fuller et al., 1995 and Surprenant 1999.

Also, this discussion brings up the question of what, exactly, pupil dilation tells us. Arguably, it could provide information about the overall level of engagement of cognitive resources (I think that’s what the baseline measurement is supposed to get at, here) in these two conditions, as well as the moment-by-moment allocation of those resources during part of a task (encoding, repetition, recall). Given that you *have* pupil dilation data, I think this needs to be addressed somehow, before going into details of what dual task paradigms may or may not tell us.

And, finally, what do you conclude? I appreciate that there are multiple possible interpretations, but you've thought about this far more than most. Could you lead the reader from this apparent bafflement into something that we can be more satisfied with?

We thank the reviewer for pointing this out. We understand that although we have explored different possible interpretations, we didn’t do enough to provide a summarising or take-home points to ease the digestion of all the messages. Also, we didn’t do enough to connect the behavioural and pupillometry results to provide a linked picture.

To address this, we have emphasised the findings and the relevance of our experiment at the end of both the behavioural (line567) and pupillary results (line682, line752). WE have also strengthened the link between our behavioural and pupillary findings (line520) .

488 Could you examine age differences in your data? What would you predict to see either in terms of behavior or pupil dilation if people are prioritizing things differently?

We further tested and discussed the effect of age on the individual performances in the later section titled Individual Differences, also some hypothesis to test at line694. There are however no easy answers to the question on prioritizing tasks. One could image that someone who prioritizes the repeat task would show a smaller PPD (and better intelligibility), along with a lower baseline (and worse recall). But given the observed saturation of the PPD, we would speculate that a listener prioritizing the recall task is also likely to exhibit a small PPD (and poorer intelligibility), along with a higher baseline (and better recall). Whether age would incite the former or the latter pattern is itself largely unclear.

490 what does it mean that the recall paradigm is from previous studies? Which recall paradigm?

We have added the references to the two recall paradigms mentioned earlier (line553).

498 Define SWIR acronym.

Clarified here by referring to the references (line562).

509 What second hypothesis? There are so many hypotheses swirling around by now I’ve lost track of which one is which. Please restate.

Clarified.

519 these references did not all use the same speech perception task. Clarify.

We have re-organised the references to be clear (line589).

523 You don’t really have data showing any greater effort of your task over other tasks.

Indeed a direct comparison is not available here. We have changed the wording (line592).

540 The lack of position effect is extremely unusual for a serial recall task and needs to be discussed in much more detail. It should also be presented in the results. This is one of those memory effects that is so basic it’s taught in intro psych textbooks... I would very much like to see a graph of word recognition and recall by word position. I have great difficulty imaging that there wasn’t some kind of recency effect at least, if not also a primacy effect, with a 10-item list to be recalled.

Certainly, we replicate the recency and primacy effects just as expected. We had not reported these results in the main body because we did not observe an interaction with SNR conditions, therefore, it was not directly tied to the main purpose of the experiment. But this is surely a finding that would reassure the reader, so we have decided to include them in a third Supplementary material.

585-621 It seems to me that the best explanation for smaller growth of the PPD is that the baseline is increasing.

So the limit (probably physiological, based on light levels) is imposed not in terms of how much the pupil can dilate, but in terms of how much of a dilation it will reach. In other words, illumination, which you held constant, may have imposed an upper limit on pupil dilation, such that as the baseline creeps up with increasing memory load in the recall condition, or creeps down with increasing habituation in the repeat-only condition, you get the difference between the two gradually shrinking (in the recall condition) or increasing (in the repeat only condition). No, illumination did not impose an upper limit on pupil dilation. We know this because the pupil did jump considerably as soon as participants were prompted to report as many words as possible. This is a huge effect compared to the PPD measured by the end of the block (when it seems like it’s saturating), meaning that there was

definitely space for the pupil to dilate more. This phenomenon seems to have been overlooked by Reviewer 1 as well, so we highlighted this effect more explicitly (line682). In other words, the apparent saturation was of cognitive origin, not mechanical. Also, refer to our response to the first comment by Reviewer 1 on the same idea.

623-626 Word choice seems problematic. What does it mean to “hold predictive power” or to be “responsive for recall performance”? Say what you want to say in a simple way.

We have simplified wording.

References that should be incorporated into a revision

Goldinger, S. D., & Papesh, M. H. (2012). Pupil dilation reflects the creation and retrieval of memories. Current directions in psychological science, 21(2), 90-95.

Kucewicz, M. T., Dolezal, J., Kremen, V., Berry, B. M., Miller, L. R., Magee, A. L., ... & Worrell, G. A. (2018). Pupil size reflects successful encoding and recall of memory in humans. Scientific reports, 8(1), 1-7.

Miller, A. L., Gross, M. P., & Unsworth, N. (2019). Individual differences in working memory capacity and long-term memory: The influence of intensity of attention to items at encoding as measured by pupil dilation. Journal of Memory and Language, 104, 25-42.

Pichora-Fuller, M. K., Schneider, B. A., & Daneman, M. (1995). How young and old adults listen to and remember speech in noise. The Journal of the Acoustical Society of America, 97(1), 593-608.

Surprenant, A. M. (1999). The effect of noise on memory for spoken syllables. International Journal of Psychology, 34(5-6), 328-333.

Surprenant, A. M. (2007). Effects of noise on identification and serial recall of nonsense syllables in older and younger adults. Aging, Neuropsychology, and Cognition, 14(2), 126-143.

Winn, M. B., & Moore, A. N. (2018). Pupillometry reveals that context benefit in speech perception can be disrupted by later-occurring sounds, especially in listeners with cochlear implants. Trends in hearing, 22, 2331216518808962

Thank you for these references, which we included.

Attachment

Submitted filename: Response_to_Reviewers.pdf

Decision Letter 1

Claude Alain

19 Oct 2020

PONE-D-20-12674R1

Disentangling listening effort and memory load beyond behavioural evidence:

Pupillary response to listening effort during a concurrent memory task

PLOS ONE

Dear Dr. Zhang,

Thank you for submitting your manuscript to PLOS ONE. Your revised manuscript has been reviewed by the original reviewer. One is satisfied with your revision while the other request further clarification. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Dec 03 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Claude Alain

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: (No Response)

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The revised version of the manuscript gained on clarity and furthermore improved in its readability. I have read the authors’ responses and I am happy with most of them. However, two of my major concerns —which are also actually the most important major concerns of my first review— still require authors’ response.

I also checked the raw pupil data the authors uploaded. It’s great, however, it’s unclear if each row represents a time sample. To make the analysis reproducible, could you please also add the time coordinate for each entry?

——————

# MAJOR CONCERN #1:

## REVIEWER 1 IN ROUND 1:

1. The possibly most important finding of this study is the fact the PPD (peak pupil diameter) tends to be smaller in the last few words in the repeat-to-recall condition. The authors were very surprised by this result and tried to interpret it by comparing with the previous listening effort studies like Zekveld et al, 2019. However, the explanation the authors offered in the discussion was extensive but not satisfying. It has been well-known that PPD is not only related to the effort or load but also strongly related to its baseline; the larger the baseline, the smaller the PPD. Figure 4b clearly showed that the large baseline is the case. Thus, a simple explanation for this result is that pupil simply saturated in the repeat with recall condition and the pupil simply cannot expand further in the presence of additional words and responses. If this is the case, the result is not surprising at all.

To exclude this possibility, the authors should consider running further analysis (e.g. regress out the effect of baseline from PPD) or conducting additional experiments to show that pupil still CAN dilate further in the repeat with recall condition. If these cannot be done, the authors should at least discuss it in the discussion. The saturation could be not only due to the mechanical limitation of the muscles controlling the pupil diameter but also because pupil diameter is strongly correlated with the norepinephrine activity in the LC system. Since the authors are aware of the link between pupil diameter and LC-NE system as this was briefly mentioned in Introduction (line 36), they should also take this into account in the discussion.

## Authors:

We thank the reviewer’s contribution to this interpretation of the results.

We agree that baseline and PPD are correlated mathematically because (X-Y) and Y will always be correlated by R2=0.5, assuming X and Y completely random. And past literatures have also demonstrated this correlation. However, the exact relation between baseline and PPD during a hearing or cognitive task depends on the underlying cause. For instance, the effect of old age induces smaller baseline and smaller PPD due to physiological constraints and changes of activity in peripheral and/or central nervous system (Piquado et al., 2010; Kuchinsky et al., 2016; Wang et al., 2018). Lower luminance induces bigger baseline but smaller PPD due to the ‘gripping’ of parasympathetic system (Wang et al., 2018). Therefore, it is unclear what direction this relation between PPD and baseline should be in a task with concurrent listening and cognitive demands. What is surprising in our results is that we initially hypothesised that PPD might increase with more difficult SNR and more items to retain in the memory, but we see instead that the pupil dilation ‘capped’ during the listening section (before the word recall section). However, looking at the baseline dynamics let us understand partially the cause of the ‘capping’. This highlights the importance to look both at the PPD and baseline in future experiments that involves more ecologically realistic tests.

We share with Reviewer 1 the desire to further disentangle baseline from PPD by either regressing out the effect

of baseline or showing that pupil can still dilate further in repeat with recall condition. The first approach, however, is problematic as long as we do not understand the exact conditions where base and PPD are negatively correlated from conditions where they may be positively correlated (whether this is seen within or across subjects). So, we opted for the second approach: while a sort of pupil saturation was present during the listening and encoding section from 1st to 10th word, the pupil increased at the onset of recall on average by 0.3mm! Reviewer 1 did not realize this finding, so we made it more explicit in the article: the ceiling of the pupil during the recall blocks cannot be due to mechanical limitation of the muscles controlling the pupil diameter, because right at the end of the block, the pupil diameter rose considerably, an effect equivalent to six times the average PPD at the 10th word. Therefore, it is clear that the pupil ceiling during listening and encoding was not at all due to mechanical constraints but originated from cognitive resource allocation strategy. The best interpretation we can offer – and that we discussed – is that listeners would reserve their resources during the 1st to 10th word in order to retrieve the words during the recall section.

## REVIEWER 1 IN ROUND 2:

Sorry for being very fussy about this. This has been surprisingly under-addressed in the literature. As you are aware of it now, please properly discuss it in the manuscript and point out that, although it’s less exciting, it is a reasonable explanation of your result.

(1) I am not convinced by the authors' rejection to regress out the baseline from PPD on a trial basis. As shown in the first paragraph of their response, the authors clearly understood the concern about the correlation between baseline and PPD. As such correlation potentially exist and explains the key result, it should be carefully examined and reported, because it potentially “fully” not just “partially” explains the key finding here. No matter it’s a positive or negative correlation, no matter it’s within- or across- subjects.

(2) The authors’ response to my second approach is not satisfying either. Remember, the key finding here is the PPD (peak pupil diameter) tends to be smaller in the last few words in the repeat-to-recall condition. The authors need to examine if this is due to the pupillary saturation in the last few words. In other words, the authors need to show that in the last few words, the pupil can still dilate more flexibly just like in other conditions. The dilation from 1st to 10th word (as shown in figure 4b) does not solve the concern at all. Actually, based on figure 4b, the baseline reached a plateau between 4.0-4.1mm after the 6th word, strongly suggesting that the saturation is the case.

(3) Moreover, I am not convinced by the authors’ statement that “The best interpretation we can offer – and that we discussed – is that listeners would reserve their resources during the 1st to 10th word to retrieve the words during the recall section.” This statement is strong, but the link between “small PPD in the last few words” and “reserving the resource” is weak to me. Please elaborate on it.

——————

# MAJOR CONCERN #2

## REVIEWER 1 IN ROUND 2:

5. [line 240] How was PPD computed here? Was it extracted from each trial and then averaged within each subject? Or was PPD directly extracted from each subject’s average pupil diameter response?

## AUTHORS: We thank Reviewer 1 for this comment, which we have repeatedly heard while presenting these results at conferences. It is a matter of constant debate. For sentences, the pupil dilation is more stable than it is for individual words, and thus arguably, one might want to extract PPD directly from individual sentences. We found that this did not apply well to individual words. So, we opted for the PPD taken from the averaged traces. We

firstly performed the baseline correction to subtract the baseline of each trial from the pupil trace. Then traces were aligned by the onset of the response prompt and aggregated per listener per condition. PPD was then calculated at this aggregated level, instead of the trial level. This method was chosen in aligned with past studies and ensured PPD was more robust (Zekveld et al., 2010; Zekveld et al., 2013; Zekveld et al., 2014).

We have re-organised the method and result sections to clarify the detailed procedure (line280).

## REVIEWER 1 IN ROUND 2:

Thanks.

(1) could you clarify what you meant by “this did not apply well to individual words”? How did you determine that method doesn’t apply well?

(2) although it’s ok to choose this method as it has been consistently used by Zekveld lab since 2010, it does not mean that in 2020 we, the pupillometry field, should still ONLY reply on this simple method. So, to demonstrate that your result is robust and replicable, please at least report the result with the trial level PPD in supplementary materials.

Reviewer #2: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Sijia Zhao

Reviewer #2: Yes: Alexander L. Francis

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Mar 3;16(3):e0233251. doi: 10.1371/journal.pone.0233251.r004

Author response to Decision Letter 1


3 Dec 2020

Reviewer #1: The revised version of the manuscript gained on clarity and furthermore improved in its readability. I have read the authors’ responses and I am happy with most of them. However, two of my major concerns —which are also actually the most important major concerns of my first review— still require authors’ response.

I also checked the raw pupil data the authors uploaded. It’s great, however, it’s unclear if each row represents a time sample. To make the analysis reproducible, could you please also add the time coordinate for each entry?

We have now uploaded pupil data files with extra columns representing aligned sequence in time. They are aligned by the onset of baseline (basAtime), onset of word (wordAtime), onset of word repetition (wordAtime), onset of recall (speakAtime).


## REVIEWER 1 IN ROUND 2:
Sorry for being very fussy about this. This has been surprisingly under-addressed in the literature. As you are aware of it now, please properly discuss it in the manuscript and point out that, although it’s less exciting, it is a reasonable explanation of your result.

We must not have been clear enough in our first response. It is simply not true that the pupil saturated in the last few words of the lists in the repeat-with-recall condition. And we know that because the pupil jumped by 0.3 mm as soon as the recall started. Please refer to Figure 6 showing the sudden jump which proves that the baseline at the final few trials may be high (around 4 mm), it does NOT prevent the pupil to dilate further. In this regard, we do not need any further experiments to demonstrate that this is case because we have done it: it is shown in Figure 6. Please also find the attached figure to show the pupil trace 15s before the onset of recall. Although every 10-words block varied in time due to different response time, on average 15s covered around the last 5 words in the list. Please notice how, regardless of how high the baseline and PPD already were in the final few seconds during the word listening, pupil size still increased when the recall stage started. It is clear that there is no physical constraints on pupil dilation and the size could increase at the end of the repeat-with-recall block.

(1) I am not convinced by the authors' rejection to regress out the baseline from PPD on a trial basis. As shown in the first paragraph of their response, the authors clearly understood the concern about the correlation between baseline and PPD. As such correlation potentially exist and explains the key result, it should be carefully examined and reported, because it potentially “fully” not just “partially” explains the key finding here. No matter it’s a positive or negative correlation, no matter it’s within- or across- subjects.

The inherent mathematical relationship between PPD (y-x) and baseline (x) is a theoretical relationship. Even with two variables, x and y, that are completely random and independent of one another, x shares 50% of the variance with x-y. Yet, in practice, there are many factors that can exacerbate this link (i.e. make its r2 stronger than 0.5) or on the contrary impair this link (i.e. make its r2 weaker than 0.5). But we know that this inherent relationship does NOT fully explain the capping of the pupil towards the end of the list, because as soon as recall starts the pupil (which is at a high baseline, 4 mm) can dilate to a pretty impressive amount, by 0.3 mm. This is twice the PDD observed by the end of the list. Therefore, evidently, the pupil CAN dilate even when the baseline is high. We need no further evidence that the pupil was not mechanically constrained by this baseline being at 4mm.


(2) The authors’ response to my second approach is not satisfying either. Remember, the key finding here is the PPD (peak pupil diameter) tends to be smaller in the last few words in the repeat-to-recall condition. The authors need to examine if this is due to the pupillary saturation in the last few words. In other words, the authors need to show that in the last few words, the pupil can still dilate more flexibly just like in other conditions. The dilation from 1st to 10th word (as shown in figure 4b) does not solve the concern at all. Actually, based on figure 4b, the baseline reached a plateau between 4.0-4.1mm after the 6th word, strongly suggesting that the saturation is the case.

We would argue that one of the key finding is also the large jump (which is twice the PPD size) that occurs after the 10th word, while the baseline is at 4 mm. So, figure 4b is not the evidence you are looking for; figure 6 is which demonstrates that there is NO saturation. In other words, the plateau that is visible during the 10 words is not of physical or mathematical limit, but of cognitive origin. This is what we discussed extensively (line 638). And we also cited past study showing similar pupillary response for digit recall tasks (line654).


(3) Moreover, I am not convinced by the authors’ statement that “The best interpretation we can offer – and that we discussed – is that listeners would reserve their resources during the 1st to 10th word to retrieve the words during the recall section.” This statement is strong, but the link between “small PPD in the last few words” and “reserving the resource” is weak to me. Please elaborate on it.

We are turning around the same point: as the limitation comes from a cognitive source, there must be cognitive reasons why listeners do not dilate more during the listening-and-repeat task, and instead reallocate their resources to the task coming after the 10th word. We elaborated extensively on this topic from line 650 onwards.
——————
# MAJOR CONCERN #2

## REVIEWER 1 IN ROUND 2:
Thanks.
(1) could you clarify what you meant by “this did not apply well to individual words”? How did you determine that method doesn’t apply well?

Very simply, words are short in comparison with sentences, on the order of 700 ms versus 3-4 seconds. As you know, there is intrinsic variability in the pupil dynamics over time. With a longer auditory event (such as a sentence), it is more likely that the pupil will dilate at some point over the next 5 seconds. There may be occasional traces that show a negative peak but they are rare. This is not rare with individual words, as the auditory event is very short and places little demand on the listener (especially here with normally-hearing and relatively young adults). Kuchinsky et al. (2014) used individual words and applied a similar pre-processing method as here, i.e. aggregating across all trials for a given listener and condition. It’s not impossible to analyze the data on a trial-by-trial basis but it implies a good deal of variability in the shape of the pupil response, and any naïve reader may feel skeptical to grant meaning to a PPD when the shape of the pupil trace is largely flat or even going down. So, we simply followed recommended practice (Winn et al., 2018).

(2) although it’s ok to choose this method as it has been consistently used by Zekveld lab since 2010, it does not mean that in 2020 we, the pupillometry field, should still ONLY reply on this simple method. So, to demonstrate that your result is robust and replicable, please at least report the result with the trial level PPD in supplementary materials.

Of course, not because something has been done in a certain way in the past implies that we should follow it. But here, we very much agree with past recommendations and there are strong reasons why to do so, namely with regard to the stereotypical shape of the pupil trace during an auditory event. In the literature, there are many studies that first average across all trials of a block, and then calculate baseline and PPD. This is NOT what we have done because we believe there is much information to be gained by looking at the evolution of the pupil dynamics within a block (and have shown many pieces of evidence throughout our manuscript to prove it). We calculated baseline and PPD for each position in the block, but since we had three repetitions of each experimental condition, this allowed us to aggregate to a very small degree the three traces of each position, before deriving our pupillary metrics. In contrast to what Review 1 is asking, we have been asked several times through conferences to show the opposite, i.e. first aggregate the traces through all repetitions of a condition (in our case the 30 trials) and derive baseline and PPD from the average traces. We have done so and shown additional analysis methods in the supplementary materials already. We think that it would be confusing to add yet another analysis method that would differ very little from the present one.

Attachment

Submitted filename: Response_to_reviewers_round2.pdf

Decision Letter 2

Claude Alain

16 Feb 2021

Disentangling listening effort and memory load beyond behavioural evidence:

Pupillary response to listening effort during a concurrent memory task

PONE-D-20-12674R2

Dear Dr. Zhang,

Thank you for the revision of your manuscript and for your patience in awaiting our response. For some reason, I could not reach one of the reviewers anymore, although s/he agreed to review the manuscript. I now decided to not wait any longer. We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Claude Alain

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Thank you for the revision of your manuscript and for your patience in awaiting our response. For some reason, I could not reach one of the reviewers anymore. I now decided to not wait any longer and am pleased to tell you that your work has now been accepted for publication in PLoS ONE.

Reviewers' comments:

Acceptance letter

Claude Alain

22 Feb 2021

PONE-D-20-12674R2

Disentangling listening effort and memory load beyond behavioural evidence: Pupillary response to listening effort during a concurrent memory task

Dear Dr. Zhang:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Claude Alain

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Appendix. Alternative method to calculate PPD.

    Results and discussions on the alternative method to perform baseline correction using the averaged pupil trace 1s before the first word in the list.

    (PDF)

    S2 Appendix. Model summary outputs.

    Model parameter estimates and model comparison statistics for the best fitting models. The reference level for the categorical factor LISTENING is 0dB, for the factor TASK is repeat-only.

    (PDF)

    S3 Appendix. Position effect in the word recall task.

    Analysis on the position of the words recalled in the repeat-with-recall task.

    (PDF)

    S1 Raw data

    (ZIP)

    Attachment

    Submitted filename: Response_to_Reviewers.pdf

    Attachment

    Submitted filename: Response_to_reviewers_round2.pdf

    Data Availability Statement

    All relevant data are within the manuscript and its Supporting information files.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES