Salient features of task-irrelevant continuous speech distort subjective time

Ashley E Symons; Fred Dick; Lori Holt; Adam T Tierney

doi:10.1037/xge0001946

. Author manuscript; available in PMC: 2026 May 20.

Published before final editing as: J Exp Psychol Gen. 2026 May 18:10.1037/xge0001946. doi: 10.1037/xge0001946

Salient features of task-irrelevant continuous speech distort subjective time

Ashley E Symons ¹, Fred Dick ², Lori Holt ³, Adam T Tierney ⁴

PMCID: PMC13186111 NIHMSID: NIHMS2165017 PMID: 42149489

Abstract

Computational models of auditory salience predict that acoustic change and divergence from prediction increase the salience of sound streams. Confirming these predictions, prior research has shown that acoustic change and unpredictable sound features are linked to increases in physiological arousal and disruption of concurrent task performance. However, it remains unclear whether linguistic features, such as phonemic and lexical/semantic surprisal, help drive attentional orienting, or whether instead attentional capture takes place prior to linguistic analysis. To address this question, we introduce a new technique for assessing attentional capture by naturalistic task-irrelevant speech. In this paradigm, participants tap to a metronome while ignoring a spoken passage from an audiobook. Salient features of the task-irrelevant speech capture attention, increase arousal, and expand subjective time, leading to shifts in tap timing. We show that distortions of subjective time are driven not only by acoustic change but also by phonemic surprisal. Thus, attentional orienting to sound takes place after the initial stages of linguistic analysis.

Keywords: Attention, speech, time

Introduction

Imagine that you are in a coffee shop, trying to work on a grant proposal. The ambient noise of silverware clinking and coffee being assembled recedes into the background as you focus your mind. Then, behind you, a conversation turns heated: a couple begins to argue, their voices suddenly louder, higher in pitch, and spiked with emotional words. Despite your best intentions, your attention drifts away from your proposal and you begin to eavesdrop. As this emotional conversation captures your attention, your physiological arousal increases: your pupils dilate, your skin sweats, your pulse quickens, and your perception of time expands. This is a common experience because speech is particularly good at capturing our attention. However, it remains an open question which factors cause speech to capture attention, and which cause speech to fade into the background.

Researchers have developed several computational models of the factors which drive attentional capture by sound. However, these models have been developed to apply to sound in general and so cannot capture speech-specific factors such as phonology or lexical semantics. Instead, these models have focused on modelling changes in salience over time in complex sound streams due to acoustic factors. For example, some bottom-up models have created salience maps with center-surround inhibition in time-frequency “space” (Kayser et al., 2005; Kalinli & Narayanan, 2007; Duangudom & Anderson, 2007), inspired by vision research using eye tracking data as ground truth (Niebur et al., 2002). These models predict that sudden acoustic changes across multiple dimensions (amplitude, pitch, spectral shape) will be linked to transient increases in attentional capture. Other contextual models track dynamic changes in feature-specific deviance from prediction relative to local and longer-term statistics (Tsuchida & Cottrell, 2012; Kaya & Elhilali, 2014). These models predict that moments of high unpredictability within a speech stream will be followed by short-term capture of attention.

Predictions of auditory salience models have been tested via behavior and physiology. One simple method of assessing salience used to validate computational models is to ask participants how salient a given sound or auditory scene is (Kalinli & Narayanan, 2007; Duangudom & Anderson, 2007) or to ask participants which of two competing scenes is more salient (Huang & Elhilali, 2017). This research has found that salience ratings are greater after change along several acoustic dimensions, including loudness, pitch, and spectral shape, as well as when a sound stream’s acoustic characteristics diverge from the distributional statistics of the surrounding context. However, this approach relies upon participants having consistent and valid interpretations of the word/concept “salience”. An alternate approach is to examine the effects of presentation of a sound stream on performance of a concurrent task, such as serial visual short-term memory (Jones & Macken, 1993). Visual memory is disrupted more by sound streams featuring acoustic change, and larger changes are linked to greater disruption (Jones et al., 2000; Schlittmeier et al., 2012). Moreover, unpredictable changes cause more disruption of performance than predictable changes (Bell et al., 2012; Bell et al., 2019). However, disruption of performance is not a pure measure of attentional capture, because it can potentially reflect interference with pre-conscious automatic processes (such as processing of serial order) as well as divergence of attention (Hughes 2014).

An alternate approach to studying capture of attention by sound streams is to measure the physiological components of attentional orienting. Capture of attention by a particularly salient sound is accompanied by an increase in arousal which prepares the listener for action (Sokolov 1963). These arousal effects can be a confounding factor when investigating attentional capture by assessing disruption of behavior, as distraction and arousal can have opposite effects on performance (Bonmassar et al., 2023; Masson & Bidet-Caulet, 2019). However, arousal can be assessed more directly by measuring physiological responses such as pupil dilation, skin conductance, and MEG/EEG. For example, sound intensity is linked to the degree of pupil dilation (Antikainen & Niemi, 1983; Liao et al., 2016) and the amplitude of the galvanic skin response (Barry 1975). The degree of acoustic modulation is also linked to the extent of microsaccadic inhibition (Zhao et al., 2019a), pupil dilation (Bala & Takahashi 2000; Marois et al., 2018), involuntary peripheral muscle responses (Schultz et al., 2021), decreased neural phase-locking to target stimuli (Huang & Elhilali, 2020), and the size of the P3a, an ERP component thought to reflect attentional orienting (Berti et al., 2004; Rinne et al., 2006). These physiological responses are not only driven by acoustic change but also factor in the surrounding context: unpredictable stimuli lead to greater changes in pupil dilation (Friedman et al., 1973; Liao et al., 2018; Milne et al., 2021a; Qiyuan et al., 1985; Southwell et al., 2017; Zhao et al. 2019b) and larger neural responses (Kaya et al., 2020).

In summary, computational modelling and behavioral/physiological research have demonstrated that acoustic change and unpredictability are linked to disruption of behavior and increased arousal. Acoustic factors alone, however, may not be sufficient to explain why certain sounds capture attention. Task-irrelevant comprehensible speech, for example, interferes with task performance more than acoustically matched non-speech sounds (Dorsi et al., 2018; Le Compte et al., 1997; Little et al., 2010; Viswanathan et al., 2014), suggesting that certain linguistic factors additionally play a role in driving attentional orienting. One possible explanation for why speech can so effectively capture attention is that it contains probabilistic regularities on many different levels, including phonemic and semantic patterns, leading to predictions which capture attention when not fulfilled. However, modelling has not addressed the question of whether unpredictability of linguistic features can lead to attentional orienting. This question has also largely not been addressed experimentally, either using physiological or behavioral measures. An important exception is Röer et al. (2019), who found that semantically unexpected words can interfere with visual short-term recall. However, as mentioned above, interference of an auditory stimulus with visual recall can reflect either attentional capture or interference-by-process. For example, as suggested by Röer et al. (2019), the sequence processing necessary for chunking during visual recall could overlap cognitively with the process of integrating a word with its preceding semantic context. Moreover, Röer and Cowan (2021) found that unexpected words in a distractor stream do not interfere with comprehension of a target speech stream. It remains, therefore, an open question whether linguistic surprisal in a task-irrelevant stream of speech can lead to attentional orienting.

Here we demonstrate a method of tracking attentional capture by task-irrelevant speech which can be used to assess the salience of phonemic and semantic surprisal while ruling out the influence of interference-by-process. This approach takes advantage of a well-documented link between increased arousal and expansion of subjective time. Expanded subjective time has been demonstrated due to a wide variety of experimental manipulations of arousal, including administration of methamphetamine to rats (Maricq et al., 1981), emotional content of stimuli (Droit-Volet et al., 2004; Droit-Volet & Meck, 2007; Gil & Droit-Volet 2012; Lake et al., 2016), breath-holding (Schwarz et al., 2013), artificially raised body temperature (Wearden & Penton-Voak 2007), and presentation of simple fluctuating stimuli such as clicks and flashes (Buffardi, 1971; Droit-Volet & Wearden, 2002; Ortega & López, 2008; Penton-Voak et al., 1996; Wearden et al., 1999). Moreover, the rate of subjective passage of time and pupil size have been shown to correlate in monkeys (Suzuki et al., 2016). These findings are compatible with models of sub-second time perception featuring an internal central clock (or clocks) which can vary in speed due to changes in internal state (Gibbon, Church, & Meck, 1984; Allman & Meck, 2012; Merchant, Harrington, & Meck, 2013; Allman, Teki, Griffiths, & Meck, 2014). Assessing subjective time, therefore, enables measurement of arousal-induced task bias separately from task performance, which reflects a complex combination of attention, arousal, and process-based interference.

Prior research on arousal and bias in internal timing has presented single short sound events and assessed retrospective time perception. However, we have developed a technique that enables the assessment of ongoing subjective time throughout presentation of a complex sound stream. Participants are asked to tap to the beat of a 2-Hz click track while ignoring the presentation of distracting sounds. An auditory rather than visual pacing signal is used—i.e. clicks rather than flashes—because participants have been reported to tap more consistently to auditory stimuli (Chen, Repp, & Patel, 2002), and so this choice minimizes noise in our data due to intrinsic synchronization variability. Synchronized tapping requires participants to keep track of time so that an upcoming movement can be planned to align with the next click. In a previous paper (Symons et al., 2024), we showed that presentation of distracting sounds and sound changes led to an expansion of subjective time, causing participants to wait for less time before making their next movement. This finding is conceptually similar to the filled duration illusion, in which silent intervals are perceived as being shorter in duration than intervals filled with sensory events (Buffardi, 1971; Ortega & López, 2008; Wearden et al., 2007). A likely explanation is that unexpected sounds and sound changes lead to increased arousal, speeding up internal pacemakers and expanding subjective time (Gibbon et al., 1984). Importantly, larger acoustic changes led to greater temporal distortions: a one-semitone pitch change, for example, did not affect tap timing, but a six-semitone pitch change did, suggesting that the shift in timing was driven by sound salience rather than simple perception of acoustic change. These findings were consistent across online and in-lab participant samples, suggesting that this online paradigm can be used to accurately measure subtle tapping shifts.

This previous study used relatively simple sounds (e.g., complex tones and white noise bursts) that allowed for precise control over variations in the acoustic features of interest. The use of synthesized sounds enabled us to vary individual acoustic dimensions while keeping the stimuli otherwise constant. However, it remains an open question to what extent those results generalize to more naturalistic listening scenarios where sounds vary across multiple acoustic and linguistic features simultaneously. To address this question, here we used this synchronized tapping paradigm to investigate capture of attention by task-irrelevant naturalistic speech. Participants were asked to tap to the beat of a 2-Hz click track while ignoring a continuous stream of narrative speech (an audiobook recording of “The Old Man and the Sea”; Di Liberto et al., 2015; Broderick et al., 2018; Teoh et al., 2019). The degree to which listeners’ tapping deviated from the beat of the click track (tapping asynchrony) provided a measure of temporal distortion. Based on prior work (Symons et al., 2024), we predicted that salient acoustic changes, including changes in intensity, pitch, and spectral shape, would increase autonomic arousal, leading to an overestimation of the passage of time and more negative asynchronies (earlier tapping). To test whether temporal distortions could be elicited by linguistic unpredictability, we computed measures of word frequency, phoneme surprisal, and semantic surprisal, features that have been shown to elicit changes in neural tracking of continuous speech (Broderick et al., 2018, 2022; Gillis et al., 2021; Weissbart et al., 2020).

Experiment 1

Methods

Participants

A sample of 101 participants between the ages of 20 and 41 (M = 28.34 years, SD = 5.83; 65 female, 36 male, 0 non-binary) was recruited from the Prolific (prolific.co) online recruitment platform. Due to the National Institutes of Health (NIH) funding requirements, data on race and ethnicity were collected. We placed no geographic restrictions on Prolific, and therefore, racial and ethnic categories may not have applied to participants outside of the United States. However, we report them here for completeness: From the original sample, 99 participants reported their ethnicity as not Hispanic or Latino and 2 preferred not to report. Forty-five participants were Black or African American, 43 were White, 5 were Asian, 7 were more than one race, and 1 participant preferred not to report.

Automated screening procedures were set to ensure that participants spoke English as their native language and had no known hearing impairments. The experiment was conducted using the online experiment platform Gorilla Experiment Builder (Anwyl-Irvine et al., 2020). Participants were required to complete the experiment on a desktop or laptop with Google Chrome as the web browser and instructed to wear headphones for the duration of the experiment. All experimental procedures were approved by the Ethics Committee in the Department of Psychological Sciences at Birkbeck, University of London. Each participant provided informed consent and received monetary compensation for their participation at a standard rate.

Data from participants who did not report English as one of their native languages on a subsequent questionnaire were excluded from analysis (N = 15). To ensure that participants were complying with task instructions to tap along to the click track, we imposed an additional set of criteria for exclusion: did not tap at all during one or more run (N = 2), failed to synchronize with the clicks (N = 9) meaning they showed no significant clustering of phases across taps according to circ_rtest in the Matlab Circular Statistics Toolbox (Berens, 2009), or whose tapping variability (standard deviation of intervals between tap and click) was greater than 100 ms (N = 17). We also removed any participant whose responses were coarsely quantized (> 15 ms quantization) due to the use of Bluetooth keyboards (which participants were explicitly requested not to use). Compared to wired keyboards, Bluetooth keyboards bin responses in much longer intervals, and do not permit the temporal precision needed to measure small tapping shifts (~4–5 ms; Symons et al., 2024). To identify participants showing coarsely quantized responses, we binned the inter-tap intervals with an 0.1 Hz resolution, computed the autocorrelation function (0–100 ms lags), and then identified peaks in the autocorrelation function (minimum prominence = 0.3). This resulted in the removal of 1 additional participant. Lastly, we removed 1 participant who had < 70% of valid taps for one or more excerpt. Valid taps were those occurring within 250 ms of a click (so if participants stopped tapping temporarily, these missed taps would be considered invalid) and falling within 3 standard deviations of their mean. The final sample consisted of 55 participants ages 20 to 40 (M = 29.09, SD = 6.17, 35 female, 20 male; racial/ethnic status using NIH reporting groupings: 55 not Hispanic or Latino, 20 Black or African American, 29 White, 6 more than one racial grouping). Of this sample, 26 participants reported receiving musical training (ranging from 1–20 years). However, only 10 participants could be classed as musicians based on the 6-year criterion suggested by prior research (Zhang et al., 2020). Therefore, musical training was not factored into the analysis.¹ Additionally, 24 participants reported experience with other languages (with age of second language acquisition ranging from 1 to 35 years).

Stimuli

Tapping sequences.

The continuous speech consisted of an audiobook recording of “The Old Man and the Sea” spoken by a professional male narrator with an American English accent (see Broderick et al., 2018; Teoh et al., 2019). The audiobook was divided into four excerpts, each 2–3 minutes in duration. Each excerpt was presented simultaneously with a 2-Hz isochronous click sequence (Figure 1). Clicks were broadband impulses spanning 10 time points (0.23 ms in duration with a 44.1 kHz sample rate). To ensure that the clicks were audible against the continuous speech, the peak amplitude of the speech was set to be 70% of the click amplitude. Prior to the onset of the continuous speech, 10 clicks presented against silence were added to allow participants time to synchronize their tapping with the click track.

*Note.* On the left, examples of speech feature vectors, which were extracted across the full duration of each excerpt. Acoustic and linguistic features were extracted from the audiobook and represented as vectors that were time-aligned with the tapping responses. On the right, a schematic of the tapping analysis. Listeners tapped to the beat of a click track (dark grey lines) while ignoring continuous speech in the background (light grey). Taps are represented in dotted red vertical bars. Using the mTRF toolbox, a decoder was trained to predict variations in speech features (amplitude envelope shown here) based on tapping asynchrony, which was represented as a single vector with values at each click time representing the difference between tap and click time.

Acoustic features.

The amplitude envelope and relative pitch of the speech recordings were obtained from previous research using this audiobook (Teoh et al., 2019). The amplitude envelope was extracted by filtering the speech waveform between 80 – 2,800 Hz and computing the absolute value of the Hilbert transform. The envelope was then low-pass filtered (cut-off = 30 Hz) and down-sampled to 128 Hz. This provided a measure of amplitude level. In addition, we computed the change in amplitude across successive time points by calculating the derivative over time. Relative pitch was computed by extracting the fundamental frequency (F0) of the speech signal at 128 Hz and applying a z-transform to normalize the pitch based on the speaker’s vocal range. In addition, we computed the change in relative pitch across successive time points by calculating the derivative over time. Spectral centroid and spectral flux were extracted from each speech recording using the spectralCentroid and spectralFlux functions in the Audio Toolbox implemented in Matlab (version 2021a). Spectral centroid describes the center of gravity in the spectrum while spectral flux measures the change in the spectrum over time. Both spectral measures were computed with a 23.4-ms window and 15.6-ms overlap between successive windows, resampled to a 128 Hz sampling rate, and z-scored.

Linguistic features.

All linguistic features were based off written transcripts of the audiobook. Phoneme surprisal was computed using a similar procedure to that reported in Brodbeck et al. (2018). First, phoneme onsets were automatically marked using the Gentle forced aligner (https://lowerquality.com/gentle/). Missing or incorrectly marked phonemes were adjusted by hand. Next, a phonetic dictionary with linked lexical frequencies was assembled by combining information from the SUBTLEX subtitle database (Brysbaert & New, 2009) and the CMU Pronouncing Dictionary (http://www.speech.cs.cmu.edu/cgi-bin/cmudict). Any words in the text that were not present in SUBTLEX were given a frequency equal to the lowest possible value. For each phoneme, we calculated two “cohort frequency values”: first, we calculated the summed frequency of all words in the dictionary matching the set of phonemes spanning the beginning of the word to the current phoneme. Second, we calculated the sum of the frequency of all words matching the set of phonemes from the beginning of the word to the previous phoneme. Phoneme surprisal was calculated as the negative log2 of the ratio between the first and the second value. For the first phoneme of a word, phoneme surprisal was the negative log2 of the ratio between the summed frequency of all words in the dictionary beginning with that phoneme and the summed frequency of all words in the dictionary. Finally, phoneme surprisal was stored as a vector with values equal to the surprisal of each phoneme across the duration of the phoneme and zeros elsewhere. Non-zero portions of this vector were z-scored such that the mean of the non-zero values was 0 and the standard deviation was 1, to make sure that phoneme surprisal analysis did not simply reflect the difference between the presence versus absence of phonemes.

Word frequency was extracted from the SUBTLEX-US database (Brysbaert & New, 2009). Word onsets were marked using Prosodylab-Aligner (Gorman et al., 2011). These markings were obtained from previous studies using this audiobook (Broderick et al., 2018). For each speech recording, we downsampled the recording to 128 Hz and extracted the time points corresponding to each word onset and offset. A custom Matlab script then searched for each word in the database. Word frequency was stored as a vector with values equal to the frequency of each word, with the value only changing with the onset of the next word. Non-zero portions of the vector were z-scored prior to analysis. Words that were not found in the database were set to the minimum word frequency value in the database. Contractions such as “aren’t” are included in the database without the apostrophe (e.g., “aren’t” is listed as “arent”). However, it was not clear how contractions that form words when taking out the apostrophe (e.g., “I’ll” or “we’re”) are represented in the database. Since these instances were rare in the speech recordings we used, these words were ignored in the analysis. A full list of words not found in the database or excluded from analysis here can be found in the Supplementary Materials (Table S2.1).

The semantic surprisal measure was obtained from previous research using this audiobook (Anderson et al., 2024). The text corresponding to each speech recording was passed to Open AI’s GPT-2, which computed a single surprisal value for each word based on the preceding context (up to 1024 words). Each value represented the negative log probability estimate of each word. Semantic surprisal values were time-aligned with word onset and stored as a vector with surprisal values lasting the duration of the word, with the value changing at the onset of the next word. Non-zero portions of the vector were z-scored prior to analysis.

Procedure

Upon signing up to the study, participants were provided with a link to the experiment. After providing informed consent, participants completed a demographics questionnaire in which they reported their age, gender, language background, and musical experience. On-screen instructions were then provided. Participants were told that they would hear a series of clicks against some background sounds and instructed to tap to the beat of the clicks by pressing the ‘j’ key on the keyboard while ignoring the background speech. An example sequence of clicks presented against silence was provided to allow participants to practice tapping to the clicks. During the main task, participants heard each excerpt with the order of the four runs (and thus the order of stimulus presentation) randomized across participants. At the end of the experiment, participants were asked whether they experienced any technical issues that could have affected their performance on the task. No technical issues were reported. This experiment lasted approximately 20 minutes.

Data processing and analysis

Tapping asynchrony.

Sound timing and participant response times were recorded in Gorilla (gorilla.sc). Custom Matlab scripts extracted the sound offset (relative to the start of the run) and subtracted this from the known sound duration to measure the sound delay for each run (with each run consisting of one 2–3-minute speech recording with clicks) and each individual. Participants’ taps were extracted for each run and the difference between participants’ tap time and the nearest click onset (tap-click asynchrony) was computed. The true asynchrony between participants’ tap time and the click onsets could not be reliably recorded due to variations in the computer setup. To account for this variability, we subtracted the tap-click asynchrony at each time point from the mean tap-click asynchrony across the entire run. Instances in which there was no tap within +/−250 ms of a click onset were classified as missing taps and excluded from analysis. Likewise, taps greater than 3 standard deviations from the participant’s mean tapping asynchrony for a given run were removed from analysis. Of the participants included, the percentage of taps removed on this basis ranged from 0.28 – 5.41% (M = 1.99%, SD = 1.42%). Only participants with > 70% of valid taps were included in the analysis. Taps within the first 5 seconds (10 clicks prior to the onset of the speech) were not included in the analysis.

Stimulus reconstruction.

To determine the relationship between the features of continuous speech and tapping asynchrony, a linear model was trained to reconstruct an estimate of each feature separately based on tapping asynchrony using the multivariate temporal response function (mTRF) toolbox (Crosse et al., 2016) implemented in Matlab (Version 2023b). To do this, we treated the tapping data as a vector consisting of a series of impulses at the click time points and zeros at all other time points, with the amplitude of each impulse equal to the corresponding tapping asynchrony. Non-zero portions of the tapping vector were z-scored prior to analysis. This tapping vector was then used to predict the speech features in the two seconds preceding the click using the ‘backwards’ option in the mTRF toolbox. A doubly-nested cross-validation procedure was used to identify the optimal regularization parameter and then test the model. First, we divided the data from the four runs into a training set consisting of 3 runs and a test set consisting of 1 run. Then a leave-one-out cross-validation procedure was conducted on the training data over time lags from zero to a maximum of two seconds, with the time lags selected based on previous research (Symons, Dick, & Tierney, 2024). This procedure was used to obtain the optimal regularization parameter within the range of 2¹ to 2¹⁰. The model was then trained using the optimal regularization parameter. To evaluate the performance of the model, we determined the accuracy with which the model predicted speech features based on tapping asynchrony in the test data. The similarity between the predicted and observed data was computed using Spearman’s correlation coefficient. This process was repeated four times, with each run taking its turn serving as the test data and the remaining three runs serving as the training data (e.g., model 1 was trained on runs 2–4 and tested on run 1). The prediction accuracy (rho value) of the model and model coefficients were averaged across all four models.

Statistical analysis.

To test whether variations in tapping asynchrony predicted the stimulus features of interest, we conducted a Monte Carlo analysis. During each of 1000 iterations, we randomly shuffled the tapping data within each excerpt for each participant. We then ran the temporal response function analysis reported above on the shuffled tapping data, computing the resulting prediction accuracy for each participant. Next, we took the median prediction accuracy across participants, resulting in a null distribution of median cross-participant prediction accuracy over the 1000 iterations. Finally, we compared the median of the prediction accuracy across participants for the original data to this null distribution. P-values, therefore, represent the probability of the observed rho-value given the distribution of rho-values obtained when shuffling the tapping data.

Prior work has shown that temporal distortions occur between 250–750 ms following salient acoustic changes (Symons, Dick & Tierney, 2024). To examine the time course of temporal distortions elicited by variations in continuous speech, we examined the behavioral temporal response function (behavioral TRF), which represents the degree to which tapping asynchrony predicted variations in each feature at each time point (within the 2-second time window) preceding the tap. One-sample rank sum tests compared model coefficients to zero across participants at each time point within the 2-second time window preceding the tap to establish the time course of the tapping shift. P-values were corrected for multiple comparisons across time (257 time points; Benjamini & Hochberg, 1995).

There was a significant correlation between amplitude envelope and many of the other stimulus features (see Supplementary Materials, Figure S1.1). Therefore, any effects observed for the other stimulus features could be partially driven by amplitude. To ensure that the effects observed for other features were not solely driven by amplitude, we ran a follow-up analysis covarying out amplitude. To do this, we constructed a linear regression using amplitude to predict each of the other features and extracted the residuals from the model. We then conducted the same statistical analyses as above, but with the residuals from the regression model as the dependent variable.

Transparency and Openness

This study was not preregistered. The sample size was determined based on previous research (Symons, Dick, & Tierney, 2024). All data inclusion criteria, manipulations, and measures are reported here. Stimulus vectors, anonymous data, and analysis code are available on OSF (https://osf.io/pc5tw/; Symons et al. 2026).

Results

Tapping asynchrony significantly predicted amplitude, amplitude change, pitch change, spectral flux, and phoneme surprisal (Figure 2A). Across features, tapping asynchrony was most consistently linked to speech features occurring 550–600 ms before the click (Figure 2B). Moments in the speech featuring high amplitude, for example, were linked to earlier tapping roughly half a second later. Earlier tapping was also linked to moments of high acoustic change in the speech half a second earlier, including changes in amplitude, pitch, and spectral shape (frequency content). These results suggest that acoustic change increases physiological arousal, with a lag of around 500 ms, causing participants to experience an expansion of subjective time and, therefore, tap earlier.

*Note.* (A) Prediction accuracy. The dashed line shows the correlation coefficient (Spearman’s rho) representing the relationship between the time series of each speech feature and the predicted time series based on tapping asynchrony as estimated by the mTRF model. Histograms show the permutation-generated null distribution of rho values, representing the relationship between the time series of each speech feature and the time series predicted by the shuffled tapping data. P-values represent the probability of the observed rho value given the distribution of rho values obtained when shuffling the tapping data. B) Behavioral temporal response function (TRF). Coefficients (in arbitrary units) represent the degree to which tapping asynchrony predicts the speech feature at each time lag. Along the x-axis, the zero time lag indicates the onset of the click to which participants were attempting to synchronize. Along the y-axis, a positive coefficient indicates that a higher value (e.g., larger amplitude) is associated with later tapping while a negative coefficient indicates that a higher value is associated with earlier tapping. Thick blue lines represent lags at which the coefficients significantly differ from zero (with FDR-correction for multiple comparisons).

Importantly, effects of speech characteristics on tapping asynchrony were not limited to acoustic measures but extended to linguistic characteristics: low phonemic predictability also led participants to tap sooner. Therefore, failure of linguistic prediction led to faster tapping roughly a half second later, suggesting an expansion of subjective time due to increased arousal.

To ensure that the links between speech features and tapping speed were not simply driven by variations in amplitude, we ran a follow-up analysis covarying out amplitude. Figure 3 shows the median prediction accuracy versus a histogram of the null distribution of prediction accuracies, as well as model coefficients, when covarying for amplitude. Effects of amplitude and pitch change (both highly correlated with amplitude, see Supplementary Materials) were no longer significant. However, even when covarying out amplitude, tapping asynchrony was significantly linked to both preceding spectral flux and phoneme surprisal. Both acoustic change and linguistic surprisal, therefore, continued to predict tapping speed, even once the effects of amplitude were controlled for.

*Note.* (A) Prediction accuracy when covarying out amplitude. Amplitude was covaried out by regressing each speech feature against amplitude and extracting the residuals. The dashed line shows the correlation coefficient (Spearman’s rho) representing the relationship between the time series of each speech feature (with amplitude removed) and the predicted time series based on tapping asynchrony as estimated by the mTRF model. Histograms show the permutation-generated null distribution of rho values, representing the relationship between the time series of each speech feature (with amplitude removed) and the time series predicted by the shuffled tapping data. P-values represent the probability of the observed rho value given the distribution of rho values obtained when shuffling the tapping data. B) Behavioral temporal response function (TRF). Coefficients (in arbitrary units) represent the degree to which tapping asynchrony predicts the speech feature after removing the contribution of amplitude at each time lag. Along the x-axis, the zero time lag indicates the onset of the click to which participants were attempting to synchronize. Along the y-axis, a positive coefficient indicates that a higher residual value (e.g., higher pitch after accounting for amplitude) is associated with later tapping while a negative coefficient indicates that a higher residual value is associated with earlier tapping. Thick blue lines represent lags at which the coefficients significantly differ from zero (with FDR-correction for multiple comparisons).

Discussion

We find that acoustic changes in task-irrelevant speech, including changes in amplitude, pitch change, and spectral shape, are linked to distortions in subjective time, as measured with a synchronized tapping paradigm. Our prediction, based on the results of our previous experiment (Symons et al. 2024), was that acoustic changes would be linked to earlier tapping 250–750 ms later. The functions relating tapping asynchrony to acoustic change are roughly consistent with these predictions: for amplitude change, pitch change, and spectral flux, greater change was linked to earlier tapping between 500 and 1250 ms later. This pattern suggests that acoustic change led to an increase in arousal arising within around 500 ms, expanding subjective time for around 750 milliseconds before returning to baseline. We also find that sounds with greater amplitude are linked to earlier tapping in the same time range, suggesting that louder sounds are more salient, leading to greater attentional orienting, increased arousal, and expanded subjective time.

Importantly, we find that the link between tapping asynchrony and characteristics of task-irrelevant speech is not limited to acoustic features. There was a robust relationship between phonemic surprisal and asynchrony, such that greater surprisal was linked to earlier tapping. The time course of this effect closely aligned with the time course of the effect of amplitude on tapping; however, phoneme surprisal and amplitude only weakly correlated (r_s = 0.07), and the phoneme surprisal effect remained significant even after covarying for amplitude. This finding suggests that phonemic surprisal captures attention, increasing arousal and expanding subjective time. We did not find any significant relationship between semantic surprisal and time perception; however, this null result could also reflect a lack of statistical power and so should be interpreted with caution. Our finding that semantic surprisal does not affect tapping performance conflicts somewhat with the finding of Röer et al. (2019) that semantic unpredictability in speech can interfere with the performance of a concurrent visual serial memory task, but as the authors of that paper suggest, this could reflect interference by process between semantic integration and tracking of serial order. Our results also conflict somewhat with Kothinti & Elhilali (2023), who found that semantic surprisal in non-linguistic auditory scenes was a predictor of perceptual salience, as measured via salience ratings.

Our primary framework for explaining our results is that they reflect fluctuations in arousal, which have been shown to be linked to expansions and contractions in subjective time (Maricq et al., 1981; Droit-Volet & Meck, 2007; Wearden & Penton-Voak 2007; Schwartz et al., 2013). However, an alternate possible explanation is that acoustic events in the period just before a click are perceptually fused with the click onset, resulting in a hybrid percept with an earlier time of onset. This could cause participants to perceive that their tapping is later than the hybrid perceived click, leading them to make their next movement earlier in time. Similar effects of integration of auditory events on the phase of synchronized tapping are demonstrated in Repp (2004), with a fixed window for temporal integration of around 120 ms. This perceptual fusion account could explain why the functions relating amplitude change, spectral flux, and amplitude level all peak just before the previous click (at 500 ms). However, this explanation would have difficulty explaining why phoneme surprisal is linked to tapping asynchrony, given that correlations between phoneme surprisal and acoustic features were rather weak (r_s = 0.10 or lower; see Figure S1.1 in the Supplementary Materials). Nevertheless, to rule out this explanation we ran a follow-up experiment in which the clicks and the task-irrelevant speech were presented in separate ears, preventing perceptual fusion. This additional experiment also enabled us to determine the replicability of the shape of the functions relating speech features to tapping over time.

Experiment 2

Overview

Our results from Experiment 1 showed consistent shifts in synchronized tapping across multiple acoustic and linguistic features. The time course of this effect aligned with our previous research using simpler sounds (250–750 ms; Symons et al., 2024). However, because the clicks and speech were presented in the same ear, we could not rule out the possibility that the observed tapping shifts were driven by perceptual fusion of the speech with clicks as opposed to increases in arousal. Therefore, we conducted a second experiment aimed at (i) ruling out the possibility that tapping shifts were driven by perceptual fusion and (ii) determining the replicability and generalizability of the relationship between stimulus dynamics and tapping shifts. To this end, Experiment 2 aimed to replicate the results of Experiment 1 in a new sample of participants and audiobook recordings. Listeners tapped to the beat of a click track while ignoring excerpts from “The Old Man and the Sea” (different from the excerpts used in Experiment 1). To determine whether the effects observed in Experiment 1 were driven by perceptual fusion, Experiment 2 presented clicks and speech in opposite ears. Half of the participants heard clicks in the left ear and speech in the right, while the other half of the participants heard clicks in the right ear and speech in the left. If the previous results were driven by perceptual fusion, there should be no relationship between tapping asynchrony and the features present in continuous speech. However, if variations in continuous speech distort internal timekeeping by changing physiological arousal, tapping asynchrony should predict acoustic and linguistic features even when the two streams are present in opposite ears.