Abstract
Auditory memory is an important everyday skill evaluated more and more frequently in clinical settings as there is recently a greater recognition of the cost of hearing loss to cognitive systems. Testing often involves reading a list of unrelated items aloud; but prosodic variations in pitch and timing across the list can affect the number of items remembered. Here, we ran a series of online studies on normally-hearing participants to provide normative data (with a larger and more diverse population than the typical student sample) on a novel protocol characterizing the effects of suprasegmental properties in speech, namely investigating pitch patterns, fast and slow pacing, and interactions between pitch and time grouping. In addition to free recall, and in line with our desire to work eventually with individuals exhibiting more limited cognitive capacity, we included a cued recall task to help participants recover specifically the words forgotten during the free recall part. We replicated key findings from previous research, demonstrating the benefits of slower pacing and of grouping on free recall. However, only slower pacing led to better performance on cued recall, indicating that grouping effects may decay surprisingly fast (over a matter of one minute) compared to the effect of slowed pacing. These results provide a benchmark for future comparisons of short-term recall performance in hearing-impaired listeners and users of cochlear implants.
Keywords: short-term memory, speech, prosody, melody, chunking
Graphical Abstract.
This is a visual representation of the abstract.
Auditory short-term memory is a key skill in everyday life, helping us to take notes, follow directions, or remember what someone said in a conversation. It is also assessed in psychology experiments, where lists of numbers or words are read to a participant, who must repeat them back to the experimenter. This task is meant to quickly evaluate an individual's memory capacity, and is simple to administer (Wechsler, 2007). However, the way the experimenter reads these digits—their intonation, speed, and rhythm—may affect memory outcomes. Though these suprasegmental qualities and their contribution to memory may be a nuisance when it comes to standardized testing, they are interesting for understanding real-world cognition, and may be useful to teachers, musicians, audiologists, and others who wish to impart information effectively. Understanding how pitch and timing variations from word to word in a list may interact with short-term memory can thus inform pedagogical techniques and even explain why people with less access to pitch information, like those with cochlear implants (Moore & Carlyon, 2005), may expend more cognitive effort during a conversation (Pichora-Fuller et al., 2016; Rönnberg et al., 2019).
As free recall measurements are recruited more routinely in the field of Audiology for ecological purposes, such as evaluating hearing aid fitting, and translated for use around the world 1 (see e.g., Lunner et al., 2016; Micula et al., 2022; Zhang et al., 2021), it becomes increasingly urgent to reconcile any inconsistency in the literature and refine a paradigm with clinical relevance in mind. To fully understand the effect of suprasegmental properties on memory, one might desire a paradigm differentially sensitive to aspects of immediate recall and cued retrieval, since they would provide clinicians and auditory prostheses manufacturers different metrics to focus on.
Clive Frankish conducted seminal experiments on this topic in the late 1980s and early 1990s, asking college students to remember a series of spoken digits. He found that manipulating the mean fundamental frequency (F0) of different words, in and of itself, did not enhance recall; rather, it only helped when the F0 pattern suggested a grouping of the words (Frankish, 1989; 1995). This grouping phenomenon has since been demonstrated using a number of characteristics of the auditory signal: pitch, timing, voice, stress, sound location— or even explicit instruction (Bower, 1970; Farrell, 2008; Farrell et al., 2011; Gilbert et al., 2014; Ng & Maybery, 2002; Parmentier & Maybery, 2008; Savino et al., 2014; Savino et al., 2020; Towse et al., 1999). As long as any of these characteristics indicate a grouping, recall is enhanced, and this grouping becomes a cue for memory. According to the work of Frankish, different grouping cues were not additive: having two of them was not better than having one. In ecological situations, of course, speech usually has redundant cues (e.g., both descending F0 and lengthening of a phrase-final word), so without a controlled experiment, it is difficult to disentangle the roles of different grouping cues.
When comparing the relative strengths of pitch and timing in support of memory, there is some disagreement in the literature. For example, Savino and colleagues recently found that the effect of intonation grouping was stronger than the effect of temporal grouping, provided that a natural within-word intonation was used (Savino et al., 2020). On the other hand, a number of earlier papers had found that the effect of temporal grouping is stronger (Gilbert, Boucher & Jemel, 2015; Parmentier & Maybery, 2008; Silverman, 2007, 2010). Some research even suggests that unfamiliar melodic patterns, if they do not indicate grouping very obviously, can instead interfere with recall, acting as a distraction (Purnell-Webb & Speelman, 2008; Racette & Peretz, 2007; Rainey & Larsen, 2002; Wallace, 1994), while random timing variations have no effect (Gorin, 2020). Finally, total presentation time of stimuli is an important factor that is not always accounted for when comparing different cues, although it has been shown that longer presentation times lead to better recall (Kilgour et al., 2000; Unsworth, 2014; Unsworth & Miller, 2021). Thus, the insertion of pauses to create temporal groups, as well as the length of those pauses, can create confounds for recall, to the degree that they elongate the overall duration of a list. It seems that the influence of suprasegmental properties is highly dependent on the exact way they are implemented, and we cannot make sweeping statements such as “timing is more important that pitch” or vice versa. An experiment like Savino and colleagues’, which used two-syllable words with the rich, natural intonation contours of Bari Italian, and very short (300 ms) pauses between temporal groups, may indeed show a “pitch advantage,” whereas one like Parmentier and Maybery (2008), with simple, short consonants spoken in American English and pauses extending over 1000 ms, will likely show a “rhythm advantage.”
Although our ultimate goal is to examine recall performance of cochlear implant users when presented with different acoustic manipulations (background noise, fundamental frequency contour, reverberation, etc), the current study is our attempt to revisit the role of suprasegmental cues (pitch and timing) in free recall for a non-tonal language, using more modern analytical approaches. In particular, we revisit these phenomena using Bayesian statistics. This allows us to take advantage of previous research and realistic constraints to the data in the form of priors, compare entire distributions instead of point-based measures like the mean, and use credible intervals, which have a more straightforward interpretation than confidence intervals (i.e., 95% chance that the mean is within this range).
Our study raised the following research questions, addressed in three successive experiments. First, is pitch information only useful insofar as it suggests grouping, or does it add another association space (e.g., melodic) to help store words, independent of the grouping phenomenon? Second, what role does presentation speed play, and can it suffice in explaining grouping by timing cues? Third, how do pitch grouping and time grouping interact—are they additive or, as Frankish asserted, somewhat redundant?
In experiment 1, we examined pitch patterns that did not indicate grouping to see if the simple act of varying pitch could affect immediate recall for a list of words. We had expected that pitch variation might add another dimension to the stimulus that could aid in memory retrieval compared to a monotone condition, but we found no evidence for this. In experiment 2, we tested different timing patterns, including fast-paced, slow-paced, jittered intervals, and one time-grouped condition, to look at the effect of pause insertion. We expected that both slower presentation rates and grouping would enhance recall, but found that presentation rate influenced recall much more strongly, making the grouping effect all but impossible to see. In experiment 3, we finally compared pitch grouping and time grouping while controlling for overall speed of presentation, replicating Frankish's results (an effect of grouping, but no additive effect for multiple grouping cues).
To augment the classical free-recall paradigm, we added a cued recall (multiple choice) portion, similar to what is used in the Montreal Cognitive Assessment (MoCA, Nasreddine et al., 2005). This cued recall gives participants a chance to retrieve additional words (which could have been well encoded but not retrieved), and give future researchers the possibility to leverage measures like reaction time or gaze tracking on possible distracters (phonetic, semantic, or random). Overall, we showed that presentation speed was the only factor that significantly affects discrimination in the cued recall portion in this paradigm.
This series of experiments was conducted online and recruited a large sample of diverse participants using the Prolific platform, in line with our desire for solid statistical power, and used modern tools (Bayesian logistic or linear mixed-effect models) to analyze different aspects of the data (free recall, d’ and subjective measures), providing baseline data for future studies in clinical populations. We will present the methods for all experiments together, followed by a section of cross-experiment results, then the results and a short discussion of each individual study, and finally a general discussion. Visual summaries of each experiment can be found in Figures 3–5.
Figure 3.
Summary of experiment 1. Top: schematics of the four pitch conditions used in this experiment (fixed pitch, chromatic scale, roved (random) pitches, and melody). Middle: proportion of words recalled for each position in the list of ten by condition. Bottom: summary of recall, d’, and subjective performance by condition. Error bars signify confidence intervals.
Figure 5.
Summary of experiment 3. Top: schematics of the four conditions used in this experiment (fixed pitch/isochronous, arpeggiated pitch/isochronous, fixed pitch/time grouped, arpeggiated pitch, time grouped). Middle: proportion of words recalled for each position in the list of ten. Bottom: summary of recall, d’, and subjective performance by condition. Error bars signify confidence intervals.
Methods
Data Availability
Anonymized datasets and code are publicly available at https://osf.io/94m65 for secondary analysis, teaching, and other academic or clinical pursuits.
Participants
Participants were recruited anonymously through the online recruitment service Prolific (www.prolific.co). Through this platform, we invited participants who reported both that they were fluent in English and that it was their first language. They also reported no hearing difficulties or ongoing mental health problems. A given participant never participated in more than one experiment. Informed consent was obtained from all participants included in the study.
For practical reasons (e.g., cost, time) we aimed to recruit approximately 80 participants per study. According to a frequentist power estimate and an estimated distribution of free recall performance in (Zhang et al., 2021), 80 participants is enough to detect recall differences of close to 0.5 out of 10 words at 70% power—however, we will be using Bayesian statistics for the main analyses. Submitted data files were evaluated for quality using a flag system, excluding participants who had three or more flags. Flags included extremely high or low recall, absence of primacy and recency effects (Ebbinghaus, 1913; Osth & Farrell 2019), unreasonably short reaction times, chance performance on the multiple-choice portions, and so on (see supplementary materials for more detail). There were three participants that were included manually despite having 3–3.5 flags; two voluntarily repeated the practice block, and the other had been flagged twice for scores exactly equal to the exclusion criterion (instead of falling below).
For experiment 1, 82 of 113 data files were retained (39 male; aged 33 ± 11 s.d. years, four did not report age). For experiment 2, 80 of 96 data files were retained (36 male; aged 33 ± 12 s.d. years). For experiment 3, 77 out of 104 data files were retained (46 male; aged 36 ± 13 s.d. years, one did not report age). Participant characteristics were not significantly different over the three experiments (Analysis of variance on Age: Experiment F(2,228) = 0.05, p = .951; Sex F(1,228) = 1.6, p = .205; Experiment x Sex F(2,228) = 0.7, p = .494; Experiment χ2s with sex, student status, & multilingualism all p > .063). Statistically, there were small differences in geographic origin as one might expect from a random selection of English-speakers across the globe (χ2(6) = 21.1, p = .002), but each experiment had a majority of participants from the UK, followed by a large proportion of participants from the USA, and small amounts from other countries. This research was approved by Concordia University's ethics board.
Stimuli
Stimuli were divided between three studies: experiment 1 explored suprasegmental pitch variations (realized through F0 manipulation), experiment 2 explored suprasegmental timing variations, and experiment 3 looked at crossed pitch and time grouping effects. The conditions for all three studies are as reported in Table 1.
Table 1.
Summary of Experimental Conditions for all Three Studies. In Experiment 1, Timing was Kept Constant and Pitch Patterns were Manipulated. In Experiment 2, Pitch was Kept Constant and Timing Patterns were Manipulated. In Experiment 3, Pitch Grouping and Time Grouping were Crossed in a 2 × 2 Design While Controlling for Processing Time
| Condition |
F0 manipulation
|
Time manipulation |
|---|---|---|
| Experiment 1 | ||
| Fixed | Monotonized at 140 Hz | 750 ms spacing |
| Chromatic | Chromatic ascending scale (beginning at 90 Hz and stepping up 10 semitones to 160 Hz) | 750 ms spacing |
| Roved | Random pitches (drawn without replacement from a 12-semitone chromatic scale, starting at 90 Hz & ending at 180 Hz) | 750 ms spacing |
| Melodic | 10-note melodic sequence (First 10 notes from the melody “Somewhere over the Rainbow” starting at 90 Hz, with the highest note at 180 Hz) | 750 ms spacing |
| Experiment 2 | ||
| Fast | Monotonized at 140 Hz | 750 ms spacing |
| Slow | Monotonized at 140 Hz | 1500 ms spacing |
| Jittered | Monotonized at 140 Hz | Randomly selected temporal intervals (ranging from 750–1500 ms) |
| Grouped | Monotonized at 140 Hz | Groups of 3-3-4 (750 ms interval within group; 1500 ms between groups) |
| Experiment 3 | ||
| Fixed | Monotonized at 140 Hz | 917 ms spacing |
| Pitch-Grouped | Major arpeggios starting at 90 Hz of the form: do-mi-sol-do-mi-sol-do⇑-sol-mi-do |
917 ms spacing |
| Time-Grouped | Monotonized at 140 Hz | Groups of 3-3-4 (750 ms interval within group; 1500 ms between groups) |
| Pitch- & Time-Grouped | Major arpeggios starting at 90 Hz of the form: do-mi-sol-do-mi-sol-do⇑-sol-mi-do |
Groups of 3-3-4 (750 ms interval within group; 1500 ms between groups) |
Words were drawn from a standard CNC (consonant-nucleus-consonant) word list (Peterson & Lehiste, 1962), and were recorded by a male American English speaker, whose original productions were centered around 119 Hz (specifically, taking the median F0 of each word, and then the median of those values, gave 119 Hz). Median production length was 594 ms. We used 24 lists of 10 words each; list order was randomly shuffled for each participant (but not word order within the list, for practical purposes to reduce online processing time). Because we were interested in suprasegmental patterns only, words were monotonized for all experiments; i.e., the F0 contour within each word was flattened. Note that this intentionally departs from F0 contours found in natural language because 1) it generates a stronger sense of melody (akin to piano notes) across the ten words, and 2) it offers a potentially closer comparison to cochlear implant users’ perception (our clinical population of interest for future implementation of this work). F0 (pitch) patterns were then created by shifting each monotonized word to the desired fundamental frequency. All manipulations were achieved using Praat's (Boersma & Weenink, 2015) PSOLA algorithm with a custom Matlab (The MathWorks Inc., 2020) wrapper. These words were then concatenated with different spacings to achieve the desired temporal patterns. For each study, there were four conditions, and six word lists (replications) per condition, with condition order randomized for each participant. Any word list could be used in any condition.
In a separate study (yet unpublished) that took place in person with almost identical stimuli, the average intelligibility of these stimuli was measured at over 95%. Of course, in this online experiment, some variation in audio quality was introduced by participants’ sound systems at home, which likely enlarged individual differences, but we presumed that it had relatively negligible impact on the within-subject difference between conditions.
Task
In all three studies, participants were first trained on the task and asked to adjust their volume to a comfortable level during this time. Due to the nature of online testing, absolute sound levels could neither be controlled nor measured. Instead, after practice and calibration, participants were briefly asked about how their sound was being delivered, with the options of device audio, external speakers, headphones, and earbuds, and how they would rate the quality of their audio on a Likert scale from 1–7.
In the test phase, participants listened to 24 spoken lists of 10 words, and after hearing each list 3 times, they typed as many words as they could remember. In contrast to some other experiments of this type (especially with digits), we did not enforce serial recall: participants could type the words in any order they happened to remember them.
In previous protocols in this literature, words may have been encoded but not retrieved during free recall, also termed “available but not accessible” (Tulving & Pearlstone, 1966). We reasoned that it would be beneficial to hearing/speech clinicians to expand on their protocols by giving participants with more limited storage (e.g., because more cognitive resources would be spent deciphering the words in the first place) a chance to retrieve those words with a recognition memory task. It is known that noisy or otherwise challenging stimuli can affect later retrieval (Piquado et al., 2010; Rabbitt, 1966; Surprenant, 1999), and we hoped that using a recognition task would boost overall performance, giving “more room” for certain factors (e.g., grouping cues, or the quality of the speech signals delivered by a given speech processor versus another) to have an impact. Later cued retrieval can also be analyzed separately (which we have done here).
To this aim, after the recall phase, we used a multiple-choice paradigm to probe participants’ memory for the words they did not type, having them select the missed word from among several distracters (described below, Fig. 1), or not select any if they did not belong to the list just presented. Multiple choice trials containing a forgotten word (“target” trials) were mixed with trials containing no previously presented words (“catch” trials) in a ratio of 2:1, rounded up. For example, 3 forgotten words would lead to five trials, three being target trials and two being catch trials. This permitted us to calculate d prime (d’, Macmillan & Creelman, 2005), reaction time, and other interesting metrics for these difficult-to-retrieve words.
Figure 1.
Experimental schematic. Participants listened to a word list three times, then typed as many words as they could remember in the free recall portion. Next, for any words not recalled, participants got a chance to retrieve them through multiple choice questions. These questions include “target trials” with a word they heard from the list, and “catch trials” with none of the words from the list.
Finally, participants were administered the six subscales of the NASA-TLX (Hart & Staveland, 1988b) after each block to probe subjective mental demand, physical demand, time pressure, performance success, effort, and frustration. The experimental paradigm is summarized in Figure 1.
To determine how many words a participant got correct in the free recall portion, we needed to check for all possible ways of writing the words that had been presented by audio. For example, the written words “leak” and “leek” could be valid typed responses to the same spoken word. These were coded for by homophone lists, so that either response was considered correct. Misspelled words were considered incorrect.
Multiple Choice
For each word in the CNC list, a set of 3 other words was created to form the multiple-choice trial. We included both phonetic and semantic distracters, as well as random words. Phonetic distracters were evaluated with the ALINE algorithm implemented in R (Downey et al., 2008; Downey et al., 2017; Kondrak, 2000). The phonetic similarity of each distracter to the target word was computed (based on phonetic transcriptions), and phonetic advantage was calculated by subtracting the similarity score of the phonetic distracter from others in its set. Similarity scores for phonetic distracters were positive and significant (t(254) = 16.5, p < .0001), demonstrating that they were, on average, phonetically closer to the target words.
A similar process was followed for semantic distracters; except that semantic similarity was measured with a pre-trained word2vec model 2 , in a package implemented in R (Mikolov et al., 2013; Wijffels, 2021). Word2vec evaluates word similarity based on co-occurrence in a large collection of text using machine learning. Again, semantic distracters were successfully more semantically related to the target word than others in their set (t(256) = 10.5, p < .0001). Care was taken so that it was not possible to logically deduce the word based on the patterns of distracters. In the hope that this phonetic/semantic word database may foster future endeavors in recall paradigms, we have made these word sets available along with our experimental data (see beginning of methods section).
Analyses
Overview
Recall, d prime (d’), and subjective ratings were all analyzed with Bayesian generalized linear models. Bayesian models were created in Stan computational framework (http://mc-stan.org/) accessed through the brms package in R (Bürkner, 2017). All models were run with two chains, a burn-in of 750 iterations, and a total of 4000 iterations.
Before adding experimental condition as a predictor to each model, we tried to create a “maximal” null model for each type of data (Barr et al., 2013), building up from simpler models. We progressively added factors such as quadratic primacy-recency curves, random effects of participant, block effects to capture learning or fatigue, etc. Model comparisons were carried out using a leave-one-out algorithm (LOO) with Pareto-smoothed importance sampling (PSIS), implemented in the loo package in R (Vehtari et al., 2017). If any pareto k values were above the recommended value of 0.7, the LOO algorithm was re-run with moment-matching set to TRUE. In all cases, the maximal null model was best, or not significantly different from the best null model. These maximal null models could then be compared to alternative models including experimental condition (that is, the pitch or timing pattern in which the words were presented) as a factor. For more details on the model comparisons, see the supplementary materials.
Free Recall Model
As stated above, a null Bayesian logistic model was first built to predict whether a word would be recalled or not, accounting for effects that are well-attested in the literature and generally not debated. In R syntax, it was written as follows:
recalled ∼ 1 + poly(position,2) * block + (1 | log word frequency) + (1 | phonological density) +
(1 + poly(position,2) * block | participant) #[null model]
Predictors include a fixed effect of the position of each word in the presented list (1 to 10), as well as the quadratic function of position, to account for primacy and recency effects. Position and its quadratic were orthogonalized using the poly() function. The block number was introduced to account for practice and/or fatigue effects during the experiment. Random effects included participant, the log of the word's frequency, and the word's phonological neighborhood density, the latter two being obtained from the Clearpond database (Marian et al., 2012). We avoided a simple random effect of word (i.e., item): since each item was always presented in the same position in a list, the effect of word would be confounded with the effect of position. Finally, random slopes by participant were added for position, and position squared, and block. Other than participant, which was categorical, all other measures were treated as numeric and scaled before being entered into the model (position was scaled after being orthogonalized).
Based on a previous in-person experiment from our lab using the same words in lists of 10 (Zhang et al., 2021) which had a mean recall in quiet of 66%, we set a corresponding prior for the grand intercept on the logit scale: Normal(0.66, 0.5) – note that the logit transformation of 66% is also 0.66, exceptionally. This prior corresponds to a gaussian distribution on the logit scale, centered at the mean from the previous study, with approximately comparable variability. All other priors for beta coefficients were set to Normal(0,1) to be somewhat informative but direction-agnostic. The null model was verified with data from each experiment to ensure that it converged and behaved as expected.
Next, we built alternative models which included a main effect of condition, the slope of condition by participant, and potential interactions between condition and other fixed variables.
recalled ∼ 1 + poly(position,2) * block * audio condition + (1 | log_frequency)
+ (1 | phonological density) + (1 + poly(position,2) * block * audio condition | participant) #[alternative model]
To understand whether the experimental conditions (pitch, timing) accounted for differences in word recall, we compared the null model to alternative models. We also inspected the alternative models for the estimated effects of condition (β), to see whether any of the 95% credible intervals included zero. Finally, to understand the effect of condition at each position in the list, we performed post-hoc (non-Bayesian) linear models at each word position with condition and a random intercept by participant, using a Bonferroni alpha of .005 to account for the 10 comparisons.
D Prime Model
Catch trials contain interesting information and can help to account for the bias of each participant (i.e., tendency to choose a word versus selecting “none of these”). In particular, they allow us to calculate a d prime (d’) measure, which can give an idea of how well participants discriminate previously heard words from unheard words (i.e., a form of primed memory) regardless of their bias. For this reason, we calculated d’ values by condition for each participant and corrected for extreme d’ values using the 1/2n rule. We then submitted the results to a Bayesian linear (Gaussian) model as follows:
dPrime ∼ 1 + (1 | participant) #[null model]
dPrime ∼ 1 + audio condition + (1 | participant) #[alternative model]
Given that the theoretical maximum and minimum performance in this experiment, after corrections to prevent infinite d’ values, result in d’ of +/- 5.078, we set a prior for the intercept of Normal(0,2), which has 99% of its mass between these two values. All other betas were given a prior of Normal(0,1).
Subjective Measures
Participants completed the NASA-TLX (Hart & Staveland, 1988a) after each block, evaluating their own performance and effort in various ways on a scale from one to seven (and any point in between, not just integers). We analyzed these judgments using a Bayesian multivariate ordinal model, rounding responses to each half-point on the scale (giving 13 steps). We rounded this way in order to capture the larger number of responses close to each tick on the scale, while still accounting for some participants who answered in between. We also included the block number to account for fatigue or learning effects:
NASAtlx ∼ 1 + block (1 + block | participant) #[null model]
NASAtlx ∼ 1 + block + audio condition + (1 + block | participant) #[alternative model; audio condition intercept]
NASAtlx ∼ 1 + block * audio condition + (1 + block * audio condition | participant) #[alternative model; audio condition intercept & slope]
For these models, the intercept and all other betas were given a prior of Normal(0,1). In this case, LOO was not possible with the most complex model due to a limitation in computing power. Thus, LOO is only reported up to the model with a random intercept by participant, but no random slope or interaction.
Cross-Experiment Analyses
Overall correlations between recall performance and NASA-TLX scores were assessed for each participant (24 data points each) and plotted to see which of the subscores was most closely associated with recall performance. The same was done for d’ scores, but as there was only one d’ score per condition, there were only four data points per participant.
Availability of Data and Materials
This work was not preregistered. Data, sample code, and word sets are available on OSF: https://osf.io/94m65.
Results
Cross-Experiment Analyses
Over the three experiments, participants recalled an average of 5.9 ± 1.5 words per block, and had a d’ of 1.2 ± 0.8 in the multiple-choice section. The multiple-choice brought the mean number of retrieved words up to 7.8 words per block (more than randomly responding to the multiple choice, which would have resulted in approximately 6.7 words retrieved).
Overall recall was related to subjective audio quality rating in experiment 1 (r2(80) = 0.06, p = 0.026), but not in experiment 2, experiment 3, or overall (overall r2(237) < 0.001, p = 0.429). Earbud users performed best on average; significantly better than those with device audio and external speakers (overall ANOVA F(3,235) = 4.48, p = 0.004, η2G = 0.05; earbuds > device audio t(71.2) = 2.7, p = 0.024, d = 0.85; earbuds > external speakers t(69.1) = 3.8, p = 0.005, d = 0.52). Headphone users fell in between these groups. Results for d’ were similarly affected (overall r2 (237) < 0.001, p = 0.499); overall ANOVA F(3,235) = 2.9, p = 0.038, η2G = 0.04; earbuds > external speakers t(76.7) = 3.0, p = 0.019, d = 0.68). Despite the differences by audio type, we were not concerned that audio quality would bias the results in favor of any one condition over another, since all participants experienced all conditions and the primary comparisons are within-participant.
Of all errors made in the multiple choice portion of the task, the vast majority were misses—when a word from the list was present but the participant said they had heard “none of these.” After that, false alarms were most common—when participants selected a word during a catch trial—followed by all other kinds of errors. Despite the number of phonetic- and semantic-labeled errors being equally possible, participants made, on average, twice as many phonetic-labeled errors (6% ± 0.6%) compared to semantic- or random-labeled errors (3% ± 0.4%; 3.1% ± 0.4%). However, with so few semantic and phonetic errors per participant, no further analysis was conducted on these errors. Participants’ subjective evaluations were correlated with their recall (Figure 2), with subjective performance showing a strong positive association while all other scores (effort, frustration, mental demand, temporal demand, and physical demand) showed negative associations.
Figure 2.
Cross-experiment analyses. a) absolute frequencies of different error types across all participants: “miss” errors indicate an answer of “none of these” when the target word was present, false alarms indicate a word was selected when the correct response was “none of these,” and all other categories indicate various distracters (see supplementary materials for details). b) Rates of choosing phonetic errors, semantic errors, and random word errors, with one data point per participant. c) Participants’ NASA-TLX scores were correlated with their free recall scores. Pearson correlation R values are plotted, with one data point per participant. d) A similar correlation with d’ scores; however, this correlation is variable and weak since there were only four points per participant (d’ for each condition), compared to 24 points for recall (recall per block). Error bars signify confidence intervals.
Experiment 1
Description
Experiment 1 was concerned with whether pitch variation aided memory independent of grouping. In other words, can suprasegmental pitch patterns that do not suggest a chunking pattern lead to enhanced recall? We compared four conditions: 1) no pitch variation, 2) a very predictable rising chromatic scale, 3) an unpredictable random pitch condition that was different each time the participant encountered it, and 4) a relatively volatile melody (the sequence of notes that form the beginning phrase of “Somewhere Over the Rainbow” from the Wizard of Oz). Note that when we use the term “volatility,” we mean that the sequence had a variety of interval sizes from one note to another. Thus, even though the melodic condition may become somewhat predictable after being repeated within the experiment, it remains volatile in its structure. On the other hand, the roved condition was both unpredictable and volatile. Though they ranged in predictability and volatility, these conditions did not strongly suggest any particular groupings (see Figure 3).
Results
Experiment 1 showed no effect of pitch patterns on recall (as seen in Figure 3; with statistics in online appendix tables 2–4 and 11). Models that included pitch pattern as a predictor were not any better than the null model (supplementary tables S2, S3, and S5), and the 95% credible intervals for each pitch pattern (when compared to the monotone condition) included zero. There was one interaction in the analysis of free recall whose credible interval lay outside of zero: a positive effect of melody on linear position; indicating a shift towards recency in the melodic condition. However, since overall the model accounting for pitch conditions did not perform better than the null model, and since the analyses by position yielded no significant effects of melody, this interaction was not considered further.
Discussion
Though we provided a mix of predictable, volatile, repeated, and novel conditions, none of these manipulations of suprasegmental pitch seemed to make a difference in short-term memory. It is striking that highly rigid and predictable stimuli (especially evident in the monotone or chromatic cases) did not differ at all from the more volatile conditions. There was also no effect of pattern familiarity: three of the patterns (fixed, chromatic, and melodic) were the same every time, and were thus encountered six times throughout the experiment; on the other hand, the roved pitch condition was different every time, and thus completely novel to the participant at each encounter. Still, there was no difference between the roved pitch condition and the other pitch conditions for our current data, suggesting that familiarity and predictability of the pattern have little role on this time scale. In fact, the roved condition was the best of the four (though not significantly so), perhaps because the randomized pitch pattern could occasionally create accidental grouping patterns, which (as we will see later) do enhance recall, or because the novelty of the pattern briefly piqued the participants’ interest and attention. Previous studies have suggested that extraneous pitch information may interfere with memorization (Purnell-Webb & Speelman, 2008; Racette & Peretz, 2007; Rainey & Larsen, 2002), while familiar melodies may provide helpful structure. However, other studies find that familiarity with pitch patterns does not help memory (Silverman, 2010). If there is an effect of pitch pattern familiarity on memory, perhaps six presentations might not be enough, or perhaps familiar melodies might only help when they also suggest a grouping (see experiment 3). Long-term familiarity for the “melodic” condition (an isochronous version of “Somewhere Over the Rainbow”) was probably minimal in this experiment, as the rhythm of the original song was not preserved and none of the over 80 participants were able to explicitly name the melody. Thus, although this pitch sequence was melodically plausible, we surmise that it was perceived in a similar way as the roved condition. Still, given that the melodic condition differed in terms of interval size, possible familiarity, and tonal structure, we cannot rule out the possibility that these various features might contribute to memory, but counteracted each other in our experiment.
Experiment 2
Description
Experiment 2 was concerned with temporal aspects like speed, consistency, and grouping that might affect recall. We included an evenly spaced presentation with one word every 750 ms (which was the same as the fixed pitch condition in experiment 1), a slowed-down presentation with one word every 1500 ms, a jittered timing condition with randomized intervals between 750–1500 ms, and finally a grouped timing condition with a 1500 ms interval between words 3–4 and 6–7 and 750 ms intervals everywhere else (see Figure 4).
Figure 4.
Summary of experiment 2. Top: schematics of the four timing conditions used in this experiment (short intervals, long intervals, jittered timing, time-grouped). Middle: proportion of words recalled for each position in the list of ten. Bottom: summary of recall, d’, and subjective performance by condition. Error bars signify confidence intervals.
Results
In contrast to experiment 1, there was a robust effect of temporal condition, with the largest difference being between the fast- and slow-paced isochronous conditions, the slow pace having better performance (as seen in Figure 4; with statistics in online appendix tables 5–7 and 12). The jittered timing and the grouped timing showed similar recall performance, being better than the fast isochronous condition but worse than the slow isochronous condition. Models accounting for these temporal conditions performed better than the null model (supplementary tables S7, S8, and S10). In these models, all temporal conditions (when compared to the fixed 750 ms condition) showed 95% credible intervals well away from zero (log-odds credible intervals: slow isochronous [0.38, 0.61], jittered [0.20, 0.40], time-grouped [0.17, 0.36]). Improvement on the slower-paced conditions was most seen in the middle of the lists, as reflected by interactions between condition and the square of position (curves became shallower in the slower conditions), as well as the post-hoc tests by position (significant differences between conditions in the middle positions and not at either end). The slow isochronous condition also interacted with the linear effect of position, shifting towards earlier words (primacy; log-odds credible interval: [-0.25, −0.01]). Compared to the fixed condition, recall was particularly good in the grouped condition at position four and seven, which is hardly a coincidence since these positions marked the beginning of each temporal group (i.e., there was a “mini-primacy” effect). D’ for the multiple choice and subjective estimations of performance all showed the same basic pattern as for recall, though the grouped condition was not significantly better than the fixed condition for this measure.
To further explore this effect of presentation time, we conducted a simple post-hoc linear model within the jittered condition, which, given that the intervals were randomly determined, had a slightly different presentation time for each list. The correlation between length of audio presentation and recall performance within the jittered condition, including a random intercept of participant, was significant (effect of block duration (416.6) = 0.28, p = .027), again with a slower overall pace leading to better recall, with an estimated gain of 0.28 words per second of presentation time. However, if the short and long fixed-interval conditions were included in the model, the estimate shrunk to 0.12 words gained per second of presentation time (effect of block duration (1359) = 0.12, p < .00001). This is likely because word gain by presentation is not linear, but rather asymptotic (Unsworth, 2014).
Discussion
These results indicate that presentation rate is an absolutely critical factor for recall performance, consistent with previous work (Dowling, 1973; Kilgour et al., 2000; Unsworth, 2014). Longer presentation times allow for more elaborative processes, such as rehearsal and engagement of the phonological loop (Baddeley & Hitch, 2019). In fact, the slight increase in presentation time in the jittered condition compared to the time-grouped condition obscured any effect of grouping entirely. The jittered condition, with an average presentation time of 11.3 s per list, had similar recall and perhaps a slightly better d’ than the grouped condition, with 9 s presentation time. If we were to try and “correct” a posteriori for these extra two seconds in presentation time using the estimate from our linear models above, we would expect the recall in the jittered condition to fall by 0.24–0.56 words, to between 5.4–5.72 words out of 10, and then the grouping condition would indeed be superior. However, as noted before, though an improvement of recall with presentation time is evident, its magnitude is context-dependent and not linear. It is thus better to control for presentation time at the experimental design stage than to try and apply a correction later.
Additionally, we observed mini-primacy effects induced by the grouped timing condition, which have previously been observed (e.g., Ng & Maybery, 2002). These mini-primacy effects will become even more evident in experiment 3, where we look at different types of grouping.
Experiment 3
Description
Experiment 3 was concerned with looking at the relative contributions of grouping in time versus grouping by pitch, specifically testing whether these cues were redundant or additive. We used arpeggios to create the pitch groupings. Pitch and time groupings were fully crossed: 1) no grouping cues, 2) pitch grouping only, 3) time grouping only 4) both grouping cues. In these models, instead of specifying the experimental condition as a single predictor with four levels, we included pitch (flat or arpeggiated/pitch-grouped) and time (isochronous or time-grouped) as two separate predictors and assessed their main effects along with their interaction (see Figure 5).
In light of the results of experiment 2, a robust difference between the fixed and grouped timing conditions may not only be due to grouping, but also to the longer presentation time coming from the inserted pauses, leaving time for elaborative processes to kick in—this is supported by results from the literature, like a more robust pupil response when listening to slowly-presented word lists (Unsworth & Miller, 2021). Thus, we created stimuli that were controlled for presentation time, making each interval in the ungrouped condition slightly longer to account for the pauses in the grouped condition.
Results
In experiment 3, there was a slight improvement in free recall when stimuli were grouped, either by pitch cues, timing cues, or both (as seen in Figure 5; with statistics in online appendix tables 8–10 and 13). This expressed itself as a main effect of both pitch and time grouping (log-odds credible intervals: pitch = [0.03, 0.24]; time = [0.09, 0.29]) with a negative interaction (log-odds credible interval: [-0.29, −0.01]) emphasizing the non-additivity of the two cues. In addition, there was an interaction of pitch grouping with linear position, indicating a shift towards remembering items that were later in the list (recency effect; log-odds credible interval: [0.06, 0.31]).
The plot of recall by position also suggested a “mini-primacy” effect interrupting the smoothness of the primacy-recency curve in the grouped conditions, as has been seen in the literature (Parmentier & Maybery, 2008; Savino et al., 2020). As before, this was investigated in a post-hoc test for the effect of condition at each position of the data, with a Bonferroni correction for 10 tests. Post-hoc tests confirmed the mini-primacy effect at positions four and seven for the time grouping (position four t(228) = 2.7, p = .008 [trend], position seven t(228) = 4.3, p = .00003). However, the pitch grouping effect was not significant at position four (t(228) = 1.2, p = .225), nor at position seven once the Bonferroni correction was applied (t(228) = 2.4, p = .018). Instead, there was an effect of pitch grouping at position eight (t(228) = 3.5, p = .0006), consistent with a shift towards recency.
Visually, there also appeared to be a small amount of transfer of this grouping effect into the fixed condition, perhaps because in this experiment, since three out of the four conditions presented a 3–3-4 grouping, participants began to think of the list of 10 as grouped, even when there was no audio cue to this effect. This could account for the lack of significance of pitch grouping at positions four and seven. Subjective ratings were consistent with the free recall results: participants thought they did better on the grouped conditions, and rated them as less mentally demanding.
Unlike the presentation time effects of experiment 2, the grouping effect did not seem to extend to the multiple-choice performance. For the d’ models, the inclusion of pitch- or time- grouping conditions did not reliably improve the predictive power of the model (supplementary table S13), and neither were any of the grouping conditions significant in the full model (online appendix table 9). Note that this is consistent with the grouped condition in experiment 2, which did not reach significance for d’.
Discussion
These results confirm those of Frankish (Frankish, 1989; 1995), who also concluded that grouping by more than one factor did not further enhance recall. However, compared to the processing time effect demonstrated in experiment 2, the grouping effect of experiment 3 was weaker and did not extend into the multiple-choice section. This could indicate that the elaborative processes available from longer presentation times facilitate a more robust encoding, or even a transfer to long-term memory, whereas grouping simply allows for better storage in short-term memory.
General Discussion
We present here a thorough exploration of suprasegmental pitch and timing properties and their effects on immediate recall for word-list stimuli presented to normally-hearing participants. This systematic series of experiments is among the largest of its kind to date in terms of participants per condition (around 80 per study, with a within-subjects design, but see also Savino et al., 2020). Online recruitment allowed us to move beyond the university undergraduate population (only 30% reported they were students). In addition, we introduce a cued recall paradigm that can be used to retrieve additional information about memory for future experiments conducted with clinical populations, typically hearing-impaired listeners or users of cochlear implants. We have made these word sets and the data available for use or adaptation as a useful baseline going forward, for the field of audiology and listening effort (Pichora-Fuller et al., 2016).
Our results indicate that pitch patterns, at least in English, impact immediate recall only insofar as they suggest a grouping (compare experiment 1 to experiment 3). Highly predictable stimuli (fixed pitch, chromatic scale) were not statistically different from highly volatile stimuli (randomized pitches, melody). In addition, we saw no effect of familiarity with the melodies on this timescale: if there was a familiarity effect, it should have resulted in worse performance on the randomized pitch pattern in experiment 1, which was different at every exposure, while all other conditions were repeated six times. Thus, we are skeptical of whether the familiarity of series of pitches alone can help people memorize new information in the short term. In other studies showing a sung-word advantage for memory, it may instead be due to longer processing times as compared to spoken words (see experiment 2), or pitch grouping phenomena (see experiment 3). We surmise that it is these qualities – grouping cues and slowing—that primarily account for music's effects on memory. However, the time frame of our study could only address short-term memory—multi-day studies would be needed to look at long-term learning. For the randomized pitch condition, there might also be a competing effect of novelty in the context of an otherwise repetitive experiment, something that could be dissociated in future studies. Further investigation could also try to rule out the possibility of inverted-U patterns related to volatility, predictability or familiarity: since we tested the extremes of these conditions, we might have missed a “goldilocks” effect.
Consistent with previous research, both pitch- and time-grouped stimuli improved recall, but in experiment 2 this was obscured by an effect of presentation time—the total amount of time over which the stimuli are presented. Other experiments confirm the effect of presentation time, or the lack of an effect of interest once presentation time is controlled for (Gorin, 2020; Kilgour et al., 2000; Unsworth, 2014). Future studies should take presentation time into account when inserting pauses to create temporal grouping patterns for memory experiments, as we did in experiment 3. The length of inserted pauses will directly affect presentation time, which will enhance recall to different degrees, and may be confounded with the manipulation of interest.
In experiment 3, we controlled for presentation time and found that temporal grouping was stronger than pitch grouping, but this is likely dependent on the specific experimental context. Savino and colleagues (Savino et al., 2020), for example, found that their pitch-grouped stimuli had a larger impact on memory than their time-grouped stimuli. There are a number of methodological choices that could account for this difference: first, their temporal grouping condition used pauses of 300 ms between groups (0 ms within groups), while our study used 750 ms pauses, over twice as long. Additionally, their study was conducted in Bari Italian, a language with rich intonation cues, and used two-syllable words having natural F0 contours, whereas ours was conducted in English, with monotonized one-syllable words. It seems intuitive that the more syllables in words (and generally the longer the speech materials, e.g., in the largely used SWIR, or Sentence-final Word Identification and Recall test) the more intonation could matter to encoding and memory consolidation. In the end, there was a difference of degree regarding the influence of different grouping cues, but both studies agree that grouping cues enhance free recall for participants with normal hearing. This is an important finding to have confirmed as it is unclear whether this will hold for some clinical populations. In addition, we observed grouping effects that seemed to “bleed over” into the monotone condition in experiment 3. This echoes the results of Farrell (2008), who found that explicit instruction to think of words in groups could improve memory in the same way that time-grouping could. In this experiment, the “instruction” was more implicit: participants were encouraged to think of the words in groups through their exposure to the other grouped conditions. These carryover effects are subtle and not what we set out to test, but they are a good reminder that participants are constantly taking context into account while performing a task.
In speaking of context-based performance, it is also worth noting that there were differences in the average number of words recalled in the “fixed” condition of each experiment, despite the auditory stimuli being identical for this condition over the three experiments. It seems that when all the conditions were similarly difficult, as in experiment 1, performance on the fixed condition was higher, at 6 words out of 10, but when an easier condition was present, like the slow condition in experiment 2, performance on the fixed condition dropped to around 5.5 words out of 10 (t(154.6) = 2.4, p = 0.017, d = 0.38). We speculate that this could be part of a subtle context-based effort allocation strategy, which would be very interesting to future memory researchers. However, our experiments were not designed to capture this effect, which was between subjects and could have been affected by other differences in our samples.
Our recall task was followed by a multiple-choice task that aimed to help participants recover some of the words they had been unable to freely recall. In studies 1 and 2, the results of the multiple-choice task (d’) were mostly consistent with free recall, such that conditions with better recall also showed better multiple-choice performance for words that had not been freely recalled. The one case where multiple-choice measures deviated from free recall results was in the grouped stimuli. Here, although there was an effect of both pitch- and time-grouping for free recall, this pattern was surprisingly not apparent in multiple choice performance. This does not mean that the grouping effect is suspect; indeed, it has been confirmed in a number of independent auditory and visual studies over the years (Bower, 1970; Farrell et al., 2011; Frankish, 1989, 1995; Ng & Maybery, 2002; Savino et al., 2020). However, the current results do highlight a distinction between processing time effects and grouping effects: the former seems to boost memory in a more sustained manner that extends into cued recall, perhaps because there is time for more elaborative processes (Unsworth & Miller, 2021).
Such a stimulus set for cued recall can be useful in future studies, either for participants who may not do very well on free recall, or for experiments using pupillometry/eyetracking or reaction time as measures of primary interest. It could even be used entirely without the free recall component, in two, three, or four-alternative forced choice tasks, depending on how many words from each set are presented at once. Our inclusion of phonetic and semantic distracters also allows for a richer analysis of errors. These errors were rare in this implementation and for this normally-hearing population, but removing the option “none of these” would force them to occur and might inform about the cognitive strategies recruited by different clinical populations (perhaps leaning more on phonetic than semantic cues for storage, or vice-versa).
Participants’ subjective ratings were consistent with free recall results; five of the six subscales of the NASA-TLX scores were affected by task conditions, all except for physical effort. Lower subjective mental demand, frustration, and effort were associated with higher subjective performance. Participants also indicated they thought they did better and spent less effort on the grouped conditions on experiment 3, agreeing with the free recall results rather than multiple choice. Thus, it seems that people found free recall easier to gauge, highlighting that these self-reports remain effective at indicating effort (despite their simplicity).
Conclusion
As the field of Audiology progressively bridges to cognitive sciences (Pichora-Fuller et al., 2016; Rönnberg et al., 2019), there is a need to revisit classical experiments on short-term recall of word lists and find ways to adapt them to populations of interest. Here, we expanded on the classical paradigm and provided a stimulus set along with normative data, i.e., with normally-hearing adults. We found that slow pacing, as well as both pitch- and time-grouping, positively impacted free recall performance, replicating findings with a larger and more diverse sample than previous studies. On the other hand, pitch patterns that did not include grouping did not benefit short-term memory. Results from our cued recall task (using multiple choice to retrieve words not freely recalled) suggest that grouping cues may be less effective than increased processing time for prolonged retention of difficult-to-remember words. If grouping promotes memory compression (Norris & Kalm, 2021), this effect may decay rather more quickly than the effect of elaborative processes that can happen with longer processing time (Unsworth & Miller, 2021). These findings speak to why music may be such an effective learning tool, as it is usually slower paced than speech and contains well-defined grouping structures (even if the sequence of pitches does not itself affect memory).
Supplemental Material
Supplemental material, sj-docx-1-tia-10.1177_23312165231181757 for Grouping by Time and Pitch Facilitates Free but Not Cued Recall for Word Lists in Normally-Hearing Listeners by Anastasia G. Sares, Annie C. Gilbert, Yue Zhang, Maria Iordanov, Alexandre Lehmann and Mickael L. D. Deroche in Trends in Hearing
Footnotes
Authors’ contributions.: AGS, MD, YZ, and AL participated in experiment conceptualization and design. MD, YZ, AL, and ACG provided background research and guided the literature review. AGS and MD created the stimuli. AGS coded the experiment and ran it on the prolific interface. AGS and MI inspected all data for quality control. AGS analyzed data, created figures, and drafted the manuscript. All authors edited the manuscript and approved the final version. MD and AL provided funding in partnership with Oticon Medical. Special thanks to François Patou and Manuel Segovia Martinez at Oticon for their support, suggestions and feedback at various stages of this project.
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding and competing interests.: This work received funding from the MITACS Accelerate program, in collaboration with Oticon Medical Canada, under the grand number IT16878. (https://www.mitacs.ca/en/programs/accelerate; https://www.oticonmedical.com/ca). The funders did not dictate study design, data collection and analysis, decision to publish, or preparation of the manuscript. The authors have declared that no other competing interests exist.
Availability of data and materials.: This work was not preregistered. Data, code, and word sets are available at https://osf.io/94m65
ORCID iDs: Anastasia G. Sares https://orcid.org/0000-0002-1440-2639
Mickael L. D. Deroche https://orcid.org/0000-0002-8698-2249
Supplemental Material: Supplemental material for this article is available online.
References
- Baddeley A. D., Hitch G. J. (2019). The phonological loop as a buffer store: An update. Cortex, 112, 91–106. 10.1016/j.cortex.2018.05.015 [DOI] [PubMed] [Google Scholar]
- Barr D. J., Levy R., Scheepers C., Tily H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3), 255–278. 10.1016/j.jml.2012.11.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Boersma P., Weenink D. (2015). Praat: doing phonetics by computer [Computer program, version 6.0.50]. 10.1097/AUD.0b013e31821473f7 [DOI]
- Bower G. H. (1970). Organizational factors in memory. Cognitive Psychology, 1(1), 18–46. 10.1016/0010-0285(70)90003-4 [DOI] [Google Scholar]
- Bürkner P. C. (2017). brms: An R package for Bayesian multilevel models using Stan. Journal of Statistical Software, 80(1), 1–28. 10.18637/jss.v080.i01 [DOI] [Google Scholar]
- Dowling W. J. (1973). Rhythmic groups and subjective chunks in memory for melodies. Perception & Psychophysics, 14(1), 37–40. 10.3758/BF03198614 [DOI] [Google Scholar]
- Downey S. S., Hallmark B., Cox M. P., Norquest P., Lansing J. S. (2008). Computational feature-sensitive reconstruction of language relationships: Developing the ALINE distance for comparative historical linguistic reconstruction. Journal of Quantitative Linguistics, 15(4), 340–369. 10.1080/09296170802326681 [DOI] [Google Scholar]
- Downey S. S., Sun G., Norquest P. (2017). Aliner: An R package for optimizing feature-weighted alignments and linguistic distances. R Journal, 9(1), 138–152. 10.32614/rj-2017-005 [DOI] [Google Scholar]
- Ebbinghaus H. (1913). On memory: A contribution to experimental psychology. Teachers College. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Farrell S. (2008). Multiple roles for time in short-term memory: Evidence from serial recall of order and timing. Journal of Experimental Psychology: Learning Memory and Cognition, 34(1), 128–145. 10.1037/0278-7393.34.1.128 [DOI] [PubMed] [Google Scholar]
- Farrell S., Wise V., Lelièvre A. (2011). Relations between timing, position, and grouping in short-term memory. Memory and Cognition, 39(4), 573–587. 10.3758/s13421-010-0053-0 [DOI] [PubMed] [Google Scholar]
- Frankish C. (1989). Perceptual organization and precategorical acoustic storage. Journal of Experimental Psychology: Learning, Memory, and Cognition, 15(3), 469–479. 10.1037/0278-7393.15.3.469 [DOI] [PubMed] [Google Scholar]
- Frankish C. (1995). Intonation and auditory grouping in immediate serial recall. Applied Cognitive Psychology, 9(7), S5–S22. 10.1002/acp.2350090703 [DOI] [Google Scholar]
- Gilbert A. C., Boucher V. J., Jemel B. (2014). Perceptual chunking and its effect on memory in speech processing: ERP and behavioral evidence. Frontiers in Psychology, 5(March), 220. 10.3389/fpsyg.2014.00220 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gilbert A. C., Boucher V. J., Jemel B. (2015). The perceptual chunking of speech: A demonstration using ERPs. Brain Research, 1603, 101–113. 10.1016/j.brainres.2015.01.032 [DOI] [PubMed] [Google Scholar]
- Gorin S. (2020). The influence of rhythm on short-term memory for serial order. Quarterly Journal of Experimental Psychology, 73(12), 2071–2092. 10.1177/1747021820941358 [DOI] [PubMed] [Google Scholar]
- Hart S. G., Staveland L. E. (1988b). NASA-TLX Scale. 10.1016/S0166-4115(08)62386-9 [DOI]
- Hart S. G., Staveland L. E. (1988a). Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. Advances in Psychology, 52(C), 139–183. 10.1016/S0166-4115(08)62386-9 [DOI] [Google Scholar]
- Kilgour A. R., Jakobson L. S., Cuddy L. L. (2000). Music training and rate of presentation as mediators of text and song recall. Memory and Cognition, 28(5), 700–710. 10.3758/BF03198404 [DOI] [PubMed] [Google Scholar]
- Kondrak G. (2000). A new algorithm for the alignment of phonetic sequences. Proceedings of the First Meeting of the North American Chapter of the Association for Computational Linguistics, 288–295. https://dl.acm.org/doi/10.5555/974305.974343 [Google Scholar]
- Lunner T., Rudner M., Rosenbom T., Ågren J., Ng E. H. N. (2016). Using speech recall in hearing aid fitting and outcome evaluation under ecological test conditions. Ear and Hearing, 37(October 2019), 145S–154S. 10.1097/AUD.0000000000000294 [DOI] [PubMed] [Google Scholar]
- Macmillan N. A., Creelman C. D. (2005). Detection theory: a user’s guide (2nd ed.). Erlbaum. [Google Scholar]
- Marian V., Bartolotti J., Chabal S., Shook A. (2012). Clearpond: Cross-linguistic easy-access resource for phonological and orthographic neighborhood densities. PLoS ONE, 7(8), e43230. 10.1371/journal.pone.0043230 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Micula A., Rönnberg J., Książek P., et al. (2022). A glimpse of memory through the eyes: Pupillary responses measured during encoding reflect the likelihood of subsequent memory recall in an auditory free recall test. Trends in Hearing, 26, 23312165221130581. 10.1177/23312165221130581 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mikolov T., Chen K., Corrado G., Dean J. (2013). Efficient estimation of word representations in vector space. 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings, 1–12. 10.48550/arXiv.1301.3781 [DOI]
- Moore B. C. J., Carlyon R. P. (2005). Perception of pitch by people with cochlear hearing loss and by cochlear implant users. In Pitch: Neural coding and perception (pp. 234–277). Springer Science + Business Media, Inc. 10.1007/0-387-28958-5_7 [DOI] [Google Scholar]
- Nasreddine Z. S., Phillips N. A., Bédirian V., Charbonneau S., Whitehead V., Collin I., Cummings J. L., Chertkow H. (2005). The Montreal Cognitive Assessment, MoCA: A brief screening tool for mild cognitive impairment. Journal of the American Geriatrics Society, 53(4), 695–699. 10.1111/j.1532-5415.2005.53221.xErratum in: J Am Geriatr Soc. 2019 Sep;67(9) [DOI] [PubMed] [Google Scholar]
- Ng H. L. H., Maybery M. T. (2002). Grouping in short-term verbal memory: Is position coded temporally? Quarterly Journal of Experimental Psychology Section A: Human Experimental Psychology, 55(2), 391–424. 10.1080/02724980143000343 [DOI] [PubMed] [Google Scholar]
- Norris D., Kalm K. (2021). Chunking and data compression in verbal short-term memory. Cognition, 208(December 2020), 104534. 10.1016/j.cognition.2020.104534 [DOI] [PubMed] [Google Scholar]
- Osth A. F., Farrell S. (2019). Using response time distributions and race models to characterize primacy and recency effects in free recall initiation. Psychological Review, 126(4), 578–609. 10.1037/rev0000149 [DOI] [PubMed] [Google Scholar]
- Parmentier F. B. R., Maybery M. T. (2008). Equivalent effects of grouping by time, voice, and location on response timing in verbal serial memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 34(6), 1349–1355. 10.1037/a0013258 [DOI] [PubMed] [Google Scholar]
- Peterson G. E., Lehiste I. (1962). Revised CNC lists for auditory tests. Journal of Speech and Hearing Disorders, 27(1), 62–70. 10.1044/jshd.2701.62 [DOI] [PubMed] [Google Scholar]
- Pichora-Fuller M. K., Kramer S. E., Eckert M. A., Edwards B., Hornsby B. W. Y., Humes L. E., … & Wingfield A. (2016). Hearing impairment and cognitive energy. Ear and Hearing, 37, 5S–27S. 10.1097/AUD.0000000000000312 [DOI] [PubMed] [Google Scholar]
- Piquado T., Cousins K. A. Q., Wingfield A., Miller P. (2010). Effects of degraded sensory input on memory for speech: Behavioral data and a test of biologically constrained computational models. Brain Research, 1365, 48–65. 10.1016/j.brainres.2010.09.070 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Purnell-Webb P., Speelman C. P. (2008). Effects of music on memory for text. Perceptual and Motor Skills, 106(3), 927–957. 10.2466/PMS.106.3.927-957 [DOI] [PubMed] [Google Scholar]
- Rabbitt P. (1966). Recognition: Memory for words correctly heard in noise. Psychonomic Science, 6(8), 383–384. 10.3758/BF03330948 [DOI] [Google Scholar]
- Racette A., Peretz I. (2007). Learning lyrics: To sing or not to sing? Memory and Cognition, 35(2), 242–253. 10.3758/BF03193445 [DOI] [PubMed] [Google Scholar]
- Rainey D. W., Larsen J. D. (2002). The effect of familiar melodies on initial learning and long-term memory for unconnected text. Music Perception, 20(2), 173–186. 10.1525/mp.2002.20.2.173 [DOI] [Google Scholar]
- Rönnberg J., Holmer E., Rudner M. (2019). Cognitive hearing science and ease of language understanding. International Journal of Audiology, 58(5), 247–261. 10.1080/14992027.2018.1551631 [DOI] [PubMed] [Google Scholar]
- Savino M., Bosco A., Grice M. (2013). Intonation and positional effects in spoken serial recall. Speech Prosody, (July), 1–6. https://escholarship.org/uc/item/69v8h0jf [Google Scholar]
- Savino M., Winter B., Bosco A., Grice M. (2020). Intonation does aid serial recall after all. Psychonomic Bulletin and Review, 27(2), 366–372. 10.3758/s13423-019-01708-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Silverman M. J. (2007). The effect of paired pitch, rhythm, and speech on working memory as measured by sequential digit recall. Journal of Music Therapy, 44(4), 415–427. 10.1093/jmt/44.4.415 [DOI] [PubMed] [Google Scholar]
- Silverman M. J. (2010). The effect of pitch, rhythm, and familiarity on working memory and anxiety as measured by digit recall performance. Journal of Music Therapy, 47(1), 70–83. 10.1093/jmt/47.1.70 [DOI] [PubMed] [Google Scholar]
- Surprenant A. M. (1999). The effect of noise on memory for spoken syllables. International Journal of Psychology, 34(5–6), 328–333. 10.1080/002075999399648 [DOI] [Google Scholar]
- The MathWorks Inc. (2020). MATLAB version: 9.8 (R2020a), Natick, Massachusetts: The MathWorks Inc. https://www.mathworks.com [Google Scholar]
- Towse J. N., Hitch G. J., Skeates S. (1999). Developmental sensitivity to temporal grouping effects in short-term memory. International Journal of Behavioral Development, 23(2), 391–411. 10.1080/016502599383883 [DOI] [Google Scholar]
- Tulving E., Pearlstone Z. (1966). Availability versus accessibility of information in memory for words. Journal of Verbal Learning and Verbal Behavior, 5(4), 381–391. 10.1016/S0022-5371(66)80048-8 [DOI] [Google Scholar]
- Unsworth N. (2014). The influence of encoding manipulations on the dynamics of free recall. Memory and Cognition, 43(1), 60–69. 10.3758/s13421-014-0447-5 [DOI] [PubMed] [Google Scholar]
- Unsworth N., Miller A. L. (2021). Encoding dynamics in free recall: Examining attention allocation with pupillometry. Memory and Cognition, 49(1), 90–111. 10.3758/s13421-020-01077-7 [DOI] [PubMed] [Google Scholar]
- Vehtari A., Gelman A., Gabry J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5), 1413–1432. 10.1007/s11222-016-9696-4 [DOI] [Google Scholar]
- Wallace W. T. (1994). Memory for music: Effect of melody on recall of text. Journal of Experimental Psychology: Learning, Memory, and Cognition, 20(6), 1471–1485. 10.1037/0278-7393.20.6.1471 [DOI] [Google Scholar]
- Wechsler D. (2007). The measurement of adult intelligence (3rd ed.). Williams & Wilkins Co. 10.1037/11329-000 [DOI] [Google Scholar]
- Wijffels J. (2021). word2vec: Distributed Representations of Words [R package, version 0.3.4].
- Zhang Y., Lehmann A., Deroche M. (2021). Disentangling listening effort and memory load beyond behavioural evidence: Pupillary response to listening effort during a concurrent memory task. PLoS ONE, 16(3), e0233251. 10.1371/journal.pone.0233251 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental material, sj-docx-1-tia-10.1177_23312165231181757 for Grouping by Time and Pitch Facilitates Free but Not Cued Recall for Word Lists in Normally-Hearing Listeners by Anastasia G. Sares, Annie C. Gilbert, Yue Zhang, Maria Iordanov, Alexandre Lehmann and Mickael L. D. Deroche in Trends in Hearing
Data Availability Statement
Anonymized datasets and code are publicly available at https://osf.io/94m65 for secondary analysis, teaching, and other academic or clinical pursuits.
This work was not preregistered. Data, sample code, and word sets are available on OSF: https://osf.io/94m65.






