Abstract
Sound structures such as phonemes and words have highly variable durations. Therefore, there is a fundamental difference between integrating across absolute time (for example, 100 ms) versus sound structure (for example, phonemes). Auditory and cognitive models have traditionally cast neural integration in terms of time and structure, respectively, but the extent to which cortical computations reflect time or structure remains unknown. Here, to answer this question, we rescaled the duration of all speech structures using time stretching and compression and measured integration windows in the human auditory cortex using a new experimental and computational method applied to spatiotemporally precise intracranial recordings. We observed slightly longer integration windows for stretched speech, but this lengthening was very small (~5%) relative to the change in structure durations, even in non-primary regions strongly implicated in speech-specific processing. These findings demonstrate that time-yoked computations dominate throughout the human auditory cortex, placing important constraints on neurocomputational models of structure processing.
Subject terms: Cortex, Neural encoding
Temporal integration throughout the human auditory cortex is predominantly locked to absolute time and does not vary with the duration of speech structures such as phonemes or words.
Main
Natural sounds are composed of hierarchically organized structures that span many temporal scales, such as phonemes, syllables and words in speech1, or notes, contours and melodies in music2. Understanding the neural computations of hierarchical temporal integration is therefore critical to understanding how people perceive and understand complex sounds such as speech and music3–6. Neural computations in the auditory cortex are constrained by time-limited ‘integration windows’, within which stimuli can alter the neural response and outside of which stimuli have little effect7. These integration windows grow substantially as one ascends the auditory hierarchy7,8 and are thought to have a central role in organizing hierarchical computation in the auditory system5,9.
A major, unanswered question is whether neural integration in the auditory cortex is tied to absolute time (for example, 100 ms; time-yoked integration) or sound structures (for example, a phoneme; structure-yoked integration). Time-yoked and structure-yoked integration reflect fundamentally different computations because sound structures such as phonemes, syllables and words are highly variable10 (Fig. 1a and Extended Data Fig. 1). The duration of a phoneme, for example, can vary by a factor of four or more across talkers and utterances (Fig. 1b). Therefore, if the auditory cortex integrated across speech structures such as phonemes or words—or sequences of phonemes or words—then the effective integration time would necessarily vary with the duration of those structures.
Fig. 1. Distinguishing time-yoked and structure-yoked neural integration.
a, Histogram of durations for five example phonemes (APRAbet notation) across a large speech corpus (LibriSpeech), illustrating the substantial durational variability of speech structures. Each line corresponds to a single phoneme (Extended Data Fig. 1 plots all phonemes). b, Histogram of the durational variability across all phonemes (median value, 4.33). Durational variability was measured by computing the central 95% interval from the duration histogram for each phoneme (a) and then computing a ratio between this interval’s upper and lower bounds. This figure plots a histogram of this ratio across all 39 phonemes. c, Schematic illustration of the TCI paradigm used to measure integration windows. In this paradigm, each stimulus segment (here, speech) is presented in two different ‘contexts’, where context is simply defined as the stimuli that surround a segment. This panel illustrates two contexts: one in which a segment is part of a longer segment and surrounded by its natural context (left), and one in which a segment is surrounded by randomly selected alternative segments of the same duration (right) (concatenated using crossfading). The top versus bottom panels illustrate predictions when the window is shorter versus longer than the shared segment. If the window is shorter than the segment (top), there will be a moment when the window is fully contained within the shared segment. At that moment, the response will be the same between the two contexts because the context falls outside the window. By contrast, if the integration window is longer than the segment duration (bottom), the context can always alter the response. d, Schematic illustration of the effect of time compression and stretching on a time-yoked (left, green) versus structure-yoked (right, purple) integration window. Compression and stretching rescale the duration of all speech structures and should therefore compress or stretch the integration window if it reflects structure but not time. e, Schematic illustration of the primary hypotheses tested in this study. Each circle plots the hypothesized integration window (logarithmic scale) for stretched (x axis) and compressed (y axis) speech for a single electrode, with color indicating the electrode’s anatomical location. Time-yoked integration windows will be invariant to stretching or compression (green line), while structure-yoked windows will scale with the magnitude of stretching and compression, leading to a shift on a logarithmic scale (purple line). The left panel (hypothesis 1) illustrates the predicted organization if neural integration windows increased from primary to non-primary regions but remained time-yoked throughout the auditory cortex. The right panel (hypothesis 2) illustrates the predicted organization if there were a transition from time-to-structure-yoked integration between primary and non-primary regions. Struct. dur., structure duration.
Extended Data Fig. 1. Phoneme duration statistics.
a, Histograms showing the distribution of phoneme durations for all 39 phonemes in the LibriSpeech corpus. Phonemes have been collected into five groups that capture some of the diversity across classes. Figure 1a shows a representative example from each group. b, Variability index for all phonemes broken down by group. Durational variability was measured by computing the central 95% interval from the duration histogram for each phoneme (panel a) and then computing a ratio between this interval’s upper and lower bound. Box plots show the median, 25th, and 75th percentile of the distribution for each group. c, Median duration and variability index for all phonemes, colored by the same groups shown in all other panels. Phonemes listed in this figure (ARPAbet notation): Voiced obstruent – B, D, G, V, DH, Z, ZH, JH, Voiceless obstruent – P, T, K, F, TH, S, SH, CH, HH, Sonorant Consonant – M, N, NG, L, R, W, Y, Diphthong – AY, AW, OY, EY, OW, Monophthong – IY, IH, UH, UW, EH, AH, AE, AA, AO, ER.
Currently, little is known about the extent to which neural integration in the auditory cortex reflects time or structure. Auditory models have typically assumed that neural integration is tied to absolute cortical timescales8,11–17; for example, by modeling cortical responses as integrating spectral energy within a fixed spectrotemporal receptive field8,13–15,17,18 (STRF). By contrast, cognitive and psycholinguistic models have often assumed that information integration is tied to abstract structures such as phonemes or words4,19–26. Distinguishing between time-yoked and structure-yoked integration is therefore important for relating auditory and cognitive models, building more accurate neurocomputational models of auditory processing and interpreting findings from the prior literature.
Distinguishing time-yoked and structure-yoked integration has been difficult in part because of methodological challenges. Temporal receptive field models have been used to characterize selectivity for many different acoustic features (for example, STRFs) and speech structures20,25–28. Temporal receptive field models, however, implicitly assume time-yoked integration, even when applied to sound structures (for example, phoneme labels), require making strong assumptions about the particular features or structures that underlie a neural response and cannot account for nonlinear computations, which are prominent in the auditory cortex29–34 and are likely critical to structure-yoked computations35. In addition, standard human neuroimaging methods have poor spatial (for example, electroencephalography (EEG)) or temporal (for example, functional MRI) resolution, which is critical for measuring and mapping integration windows in the human auditory cortex.
To address these limitations, we measured integration windows using intracranial recordings from human neurosurgical patients, coupled with a recently developed method for measuring integration windows from nonlinear systems, which does not depend on any assumptions about the features that underlie the neural response or the nature of the stimulus–response mapping (the temporal context invariance (TCI) paradigm)7 (Fig. 1c). We then tested whether neural integration windows varied with speech structures by rescaling the duration of all structures using stretching and compression (preserving pitch). Prior studies have measured the degree to which the overall neural response timecourse compresses or stretches with the stimulus36,37, but this type of ‘timecourse rescaling’ could be caused by changes in the stimulus rather than changes in the neural integration window, as we demonstrate. By contrast, we show that our approach can clearly distinguish time-yoked and structure-yoked integration, including from deep artificial neural network (DANN) models that have complex and nonlinear stimulus–response mappings.
We used uniform stretching and compression in our study, even though it is not entirely natural, because it rescales the duration of all speech structures by the same amount. As a consequence, a structure-yoked integration window should rescale with the magnitude of stretching or compression, irrespective of the particular structures that underlie the window, while a time-yoked window should be invariant to stretching and compression (Fig. 1d). In a second experiment, we replicated our key finding using naturally faster or slower speech. We tested two primary hypotheses (Fig. 1e): (1) integration windows increase hierarchically in non-primary regions but remain yoked to absolute time throughout the auditory cortex, and (2) integration windows become structure-yoked in non-primary auditory cortex.
Results
TCI effectively distinguishes time-yoked versus structure-yoked integration
Integration windows are often defined as the time window within which stimuli alter a neural response and outside of which stimuli have little effect7,38. This definition is simple and general and applies regardless of the particular features that underlie the response or the nature of the stimulus–response mapping (for example, linear versus nonlinear). To estimate the integration window, we measure responses to sound segments surrounded by different ‘context’ segments7 (Fig. 1c). Although context has many meanings39, here, we operationally define context as the stimuli that surround a segment. If the integration window is less than the segment duration, there will be a moment when it is fully contained within each segment and thus unaffected by the surrounding context. By contrast, if the integration window is larger than the segment duration, the surrounding segments can alter the response. We can therefore estimate the integration window as the smallest segment duration yielding a context-invariant response. Our stimuli consist of segment sequences presented in a pseudorandom order, with shorter segments being excerpted from longer segments. As a consequence, we can compare contexts in which a segment is part of a longer segment, and thus surrounded by its natural context (Fig. 1c, left), with contexts in which a segment is surrounded by random other segments (Fig. 1c, right).
We assessed context invariance through the ‘cross-context correlation’7 (Fig. 2a). We aligned and compiled the response timecourses surrounding all segments as a segment-by-time matrix, separately for each context. We then correlated the corresponding columns across the segment-aligned matrices for each context. Before segment onset, the cross-context correlation should be approximately zero since the integration window must overlap the preceding context segments, which are independent across contexts. As the integration window begins to overlap the shared central segments, the cross-context correlation will rise, and if the window is less than the segment duration, there will be a lag when the response is the same across contexts, yielding a correlation of 1.
Fig. 2. Validating approach using computational models.
a, Schematic of the cross-context correlation analysis used to measure context invariance for one segment duration. Response timecourses for all segments are organized as a segment-by-time matrix, separately for each of the two contexts. Each row contains the response timecourse to a different segment, aligned to segment onset. The central segments are the same across contexts, but the surrounding segments differ. The gray region highlights the time window when the shared segments are present. To determine whether the response is context-invariant, we correlate corresponding columns across matrices from different contexts (‘cross-context correlation’), illustrated by the linked columnar boxes. For each box, we plot a schematic of the integration window at that moment in time. At the start of the shared segments (first box pair), the integration window will fall on the preceding context segments, which are random, yielding a cross-context correlation near zero. If the integration window is less than the segment duration, there will be a lag when the integration window is fully contained within the segment, yielding a context-invariant response and a correlation of 1 (second box pair). b, The cross-context correlation for stretched and compressed speech from two computational models: a model that integrates spectrotemporal energy within a time-yoked window (STRF, top) and a model that integrates phonemes within a structure-yoked window (phoneme integration, bottom). c, Structure-yoking index measuring the change in integration windows between stretched and compressed speech on a logarithmic scale (logis – logic), relative to the change in structure durations between stretched and compressed speech (log(Δd)). The index provides a graded metric measuring the extent of time-yoked (index = 0) versus structure-yoked (index = 1) integration. This panel plots the distribution of structure-yoking indices for STRF and phoneme integration models. Models were fit to electrode responses, and violin plots show the distribution across electrodes. d, Schematic of a DANN speech recognition model (DeepSpeech2) used to assess whether the TCI paradigm can distinguish time-yoked and structure-yoked integration in a complex, multilayer, nonlinear model. The DANN consists of eight layers (two convolutional, five recurrent layers, one linear readout), trained to transcribe text from a speech mel spectrogram without any stretching or compression. Each layer contains model units with a response timecourse to sound. Integration windows were estimated by applying the TCI paradigm to these unit timecourses. e, Distribution of structure-yoking indices (violin plots) across all units from each layer for both a trained and untrained DANN model. f, Distribution of overall integration windows, averaged across stretched and compressed speech, for all model units from each layer for both a trained and untrained DANN.
We tested whether we could use our paradigm to distinguish time-yoked and structure-yoked integration by applying the cross-context correlation analysis to responses from computational models. We began by testing two simple models for which there was a clear ground truth. The first model was a STRF model that linearly integrated spectrotemporal energy within a fixed spectrotemporal window. The second model linearly integrated phonemic labels within a window that varied with the speech rate, increasing for stretched speech and decreasing for compressed speech. To make the windows more similar to those for neural data, they were fit using a dataset of neural responses to natural speech alone. The phoneme window was then stretched and compressed by interpolation when predicting responses to stretched and compressed speech, respectively (the STRF model was unchanged). The fitting step is not critical to the interpretation of the analysis and was done only to make the windows more similar to those that would typically be found in the human auditory cortex. We used a stretching and compression factor of in both our model simulations and neural experiments because this factor maintains high intelligibility36,40 while still producing a large, threefold difference in structure durations between compressed and stretched speech.
For both models, the cross-context correlation rose and fell at segment onset and offset and increased with the segment duration, as predicted. Notably, the STRF cross-context correlation was nearly identical for stretched and compressed speech, consistent with a time-yoked integration window (Fig. 2b, top panel), while the phoneme cross-context correlation was lower and more delayed for stretched speech (Fig. 2b, bottom panel), consistent with a longer integration window. We quantified these effects by computing a structure-yoking index that varied between 0 (time-yoked) and 1 (structure-yoked), calculated by dividing the change in integration windows on a logarithmic scale by the threefold change in structure duration between stretched and compressed speech (Fig. 2c and Extended Data Fig. 2). Numerically, the structure-yoking index can exceed the range of 0–1 if the integration window for stretched speech is smaller than that for compressed speech (structure yoking of <0) or if the integration window for stretched speech is more than three times larger than that for compressed speech (structure yoking of >1); however, such values were rare. Integration windows were computed as the smallest segment duration needed to achieve a threshold invariance level, as measured by the peak cross-context correlation across lags (threshold set to 0.75; results were robust to the threshold value). We consistently observed structure-yoking values near 0 for our STRF model responses and structure-yoking values near 1 for our phoneme integration models, verifying that we can distinguish time-yoked and structure-yoked integration from ground-truth models.
Extended Data Fig. 2. Structure yoking for phoneme and STRF models estimated using a parametric Gamma window.

We used a simple, model-free approach to estimate integration windows from STRF and phoneme integration models since the model responses are noise-free and not constrained by limited data. Here, we show that structure-yoking values are similar when estimated using the parametric, Gamma window applied to neural data. The format is the same as in Fig. 2c.
We next examined whether we could detect a transition from time-to-structure-yoked integration from a DANN model (DeepSpeech2) that had been trained to recognize speech structure from sound (Fig. 2d) (trained to transcribe speech from a mel spectrogram). DANNs trained on challenging tasks have been shown to learn nonlinear representations that are predictive of cortical responses and replicate important aspects of hierarchical cortical organization33,34,41–43, and thus provide a useful testbed for evaluating new methods and generating hypotheses for neural experiments. Moreover, unlike our STRF and phoneme models, the stimulus–response mapping in these models is complex and nonlinear and therefore provides a more challenging setting in which to measure neural integration windows and hence a stronger test of our method. We measured integration windows using the same procedure just described, but applied to the response of each unit from each layer of the trained DANN model. Importantly, the DANN model was only ever trained on natural speech, so any structure yoking present in the model must have been learned solely from the structural variability of natural speech.
This analysis revealed a clear transition from time-to-structure-yoked integration across network layers (Fig. 2e). This transition was completely absent from an untrained model, demonstrating that it was learned from the structural variability of natural speech. The overall integration time, averaged across stretched and compressed speech, also increased substantially across network layers for the trained model (Fig. 2f), an effect that was also primarily a result of training (the increase from layer one to layer two is present in the untrained network because of striding in the architecture). These results demonstrate that we can distinguish time-yoked and structure-yoked integration, as well as transitions between the two, using our approach, including from complex, nonlinear models that have only ever been trained on natural speech. These results also help motivate our second hypothesis (Fig. 1e, right panel), as prior studies have found that later network layers better predict later-stage regions of the human auditory cortex43, suggesting that there may be a transition from short, time-yoked integration to long, structure-yoked integration across the auditory hierarchy.
Neural integration throughout human auditory cortex is predominantly time-yoked
We next sought to test whether integration windows in the human auditory cortex reflect time-yoked or structure-yoked integration. We measured cortical responses to speech segments (37, 111, 333, 1,000, 3,000 ms) from an engaging spoken story (from the Moth Radio Hour), following time compression or stretching (preserving pitch). Our paradigm was designed to characterize sub-second integration windows within the auditory cortex, and it was not possible to characterize multi-second integration windows beyond the auditory cortex owing to the small number of segments at these longer timescales. All analyses were performed on the broadband gamma power response of each electrode (70–140 Hz). We focus on broadband gamma because it provides a robust measure of local electrocortical activity and can be extracted using filters with narrow integration windows (~19 ms), which we have shown have a negligible effect on the estimated integration window7. By contrast, low-frequency, phase-locked activity requires long-integration filters that can substantially bias the measured integration window7.
The cross-context correlation for stretched and compressed speech is shown for two example electrodes: one from a primary region overlapping right Heschl’s gyrus and one from a non-primary region overlapping the right superior temporal gyrus (STG) (Fig. 3a). Given that neural data are noisy (unlike model responses), we plot a noise ceiling for each electrode. The noise ceiling is calculated by performing the same analysis as the cross-context correlation, but when the context is the same using repeated presentations of the same segment sequence. The STG electrode required longer segment durations and lags for the cross-context correlation to reach the noise ceiling, indicating a longer integration window. However, notably, the cross-context correlation was similar for stretched and compressed speech for both the Heschl’s gyrus and STG electrodes, suggesting a window predominantly yoked to absolute time.
Fig. 3. Neural integration in human auditory cortex is predominantly time-yoked.
a, Cross-context correlation (red line) and noise ceiling (black line) for stretched (left) and compressed (right) speech from two example electrodes overlapping right Heschl’s gyrus (HG) (top) and right STG (bottom). b, Integration windows for stretched (x axis) and compressed (y axis) speech for all sound-responsive electrodes. Green and purple lines indicate what would be predicted from a time-yoked and structure-yoked integration window. Electrodes are colored based on annular ROIs that reflect the distance of each electrode to the primary auditory cortex (inset; gray colors reflect electrodes outside the annular ROIs). c, Map plotting the average integration window across stretched and compressed speech (108 electrodes, 15 participants). Right panel plots the median integration window within annular ROIs. d, Map of structure yoking (108 electrodes, 15 participants). Right panel plots the median structure-yoking index within annular ROIs. e, Cross-context correlation (normalized by noise ceiling) for stretched and compressed speech averaged across all electrodes within each annular ROI (top to bottom). Error bars in c and d plot the central 68% interval of the sampling distribution (equivalent to one standard error for a Gaussian distribution) computed by hierarchical bootstrapping across both participants and electrodes.
We used a parametric model developed in our prior work that makes it possible to estimate integration windows from noisy neural data by pooling across all lags and segment durations, allowing us to quantify the integration window for each electrode for both stretched and compressed speech (Fig. 3b and Extended Data Figs. 3 and 4). Electrodes were localized on the cortical surface, which we used to create a map of the overall integration window (Fig. 3c), averaged across stretched and compressed speech, as well as a map of structure yoking (Fig. 3d). As in prior work7,31, we quantified differences in neural integration related to the cortical hierarchy by binning electrodes into annular regions of interest (ROIs) based on their distance to primary auditory cortex (center of TE1.1; three 10 mm-spaced bins) (Fig. 3b–e). As a simple, model-free summary metric, we also computed the average cross-context correlation across all electrodes within each ROI separately for stretched and compressed speech and normalized the cross-context correlation by the noise ceiling at each lag and segment duration (Fig. 3e; Extended Data Fig. 5 plots the cross-context correlation and noise ceiling separately). For our annular ROI analyses, we pooled across hemispheres because integration windows were similar between the right and left hemispheres both in this study (Fig. 3d,e) and in our prior work7, and because we had a small sample size in the right hemisphere (33 electrodes). Similar results were observed using standard anatomical parcellations of the auditory cortex (Heschl’s gyrus, planum polare, planum temporale and superior temporal gyrus) (Extended Data Fig. 6).
Extended Data Fig. 3. Consistency across participants.

Integration windows for stretched (x-axis) and compressed (y-axis) speech for all sound-responsive electrodes. Electrodes from the same participant are given the same symbol/color. The format is otherwise the same as in Fig. 3b.
Extended Data Fig. 4. Unconstrained window shape.

To reduce the number of free parameters and increase robustness, we constrained the shape of the Gamma-distributed window to be the same for stretched and compressed speech. This figure shows the results when the shape is unconstrained. See Methods (Estimating integration windows from noisy neural responses) for a more detailed discussion. The format is the same as in Fig. 3b.
Extended Data Fig. 5. Average cross-context and noise ceiling within annular ROIs.
Electrodes were grouped based on their distance to primary auditory cortex in 10 mm bins (see rightmost panel), and we averaged the cross-context correlation (red line) and noise ceiling (black line) for all electrodes within each group. This figure plots these quantities for stretched (left) and compressed speech (right) for varying segment durations (37, 111, 333 ms). Figure 3e shows the normalized cross-context correlation, computed by pointwise division of the cross-context correlation with the noise ceiling.
Extended Data Fig. 6. Standard anatomical ROIs.
This figure plots results using a standard anatomical parcellation of the human auditory cortex as an alternative to our annular ROI analysis. Auditory cortex was subdivided into 5 regions: medial Heschl’s gyrus (dark blue, TE1.1), lateral Heschl’s gyrus (light blue TE1.0/1.2), planum polare (green), planum temporale (orange), superior temporal gyrus (red). For each ROI, we plot the median overall integration window across both stretched and compressed stimuli as well as the median structure yoking index. Error bars plot the central 68% interval of the sampling distribution (equivalent to 1 standard error for a Gaussian) computed via hierarchical bootstrapping across both subjects and electrodes.
We found that the overall integration window, averaged across stretched and compressed speech, increased substantially across the cortical hierarchy with an approximately threefold increase from primary to non-primary regions (Fig. 3b,c) (median integration for the annular ROIs: 80, 130 and 272 ms for annular ROI 1; β = 0.107 octaves per mm, s.d. = 0.031, 90% credible interval = [0.060, 0.159], 108 electrodes, 15 subjects, via Bayesian linear mixed-effects (LME) model; see Methods). The cross-context correlation showed a lower peak value and a more gradual build-up and fall-off in ROIs further from the primary auditory cortex, again consistent with a longer integration time (Fig. 3e). These findings replicate prior work7 showing that the human auditory cortex integrates hierarchically across time with substantially longer integration windows in higher-order regions.
We next investigated structure yoking. Across all electrodes, the median difference between stretched and compressed speech was only 0.06 octaves (Fig. 3b) (β = 0.091 octaves, s.d. = 0.046, 90% credible interval = [0.015, 0.164], 110 electrodes, 15 subjects). The magnitude of this increase was much smaller than the 1.58-octave difference in structure durations, yielding a median structure-yoking index of 0.04. Notably, structure yoking was similarly weak throughout both the primary and non-primary auditory cortex (Fig. 3d): structure yoking did not increase with distance from primary auditory cortex (β = −0.006 Δoctaves per mm, s.d. = 0.013, 90% credible interval = [−0.029, 0.015], 108 electrodes, 15 subjects) and did not become stronger in electrodes with longer overall integration times (β = −0.053 Δoctaves per octave, s.d. = 0.086, 90% credible interval = [−0.197, 0.083], 110 electrodes, 15 subjects) as would be predicted by hypothesis 2 (Fig. 1e). The average cross-context correlation was very similar for stretched and compressed speech, even in annular ROIs far from primary auditory cortex, again consistent with time-yoked integration (Fig. 3e). Although structure-yoking indices were clearly centered around zero, indicating time-yoked integration, there was some variation around zero from electrode to electrode. This variation, however, was not reliable across independent data splits (non-overlapping segments) (Spearman’s rank correlation, −0.07), unlike the electrode-to-electrode variation in overall integration windows, which was highly reliable (Spearman’s rank correlation, 0.82). These findings suggest that the auditory cortex integrates hierarchically across time, but that the fundamental unit of integration is absolute time and not structure.
Our experiments used uniform time stretching and compression because this increases and decreases the duration of all speech structures by a known magnitude. Although clearly intelligible, uniform stretching and compression are not natural because some speech structures vary more than others (Extended Data Fig. 1). Therefore, we conducted a second experiment in three patients (30 electrodes) in which we measured integration windows for naturally fast or slow speech, accomplished by recording sentences spoken by the same speaker at either a fast or slow pace. We again observed very similar integration windows for naturally fast and naturally slow speech, despite a large difference in the average durations for those stimuli (Extended Data Fig. 7) (slow sentences were on average 2.53× longer in duration than the fast sentences). This result indicates that time-yoked integration is the dominant form of integration even for natural manipulations of the speech rate.
Extended Data Fig. 7. Integration windows for naturally fast and slow speech.

This figure plots integration windows estimated using natural recordings – without any stretching or compression – for a set of sentences spoken by the same talker at either a fast or slow pace. The slow speech was 2.53x longer on average than the fast speech, which is indicated by the purple line. The green line is the line of unity.
Timecourse rescaling does not distinguish time-yoked and structure-yoked integration
Prior studies have observed that stretching or compressing speech causes the neural response timecourse to stretch and compress36,37. We replicated these prior studies by correlating the response timecourse of each electrode to stretched and compressed speech after time-stretching the neural response to compressed speech (Fig. 4a) (by the threefold difference in speech rates) and then dividing by the maximum possible correlation given the data reliability. Consistent with prior studies, we observed substantial timecourse rescaling values throughout primary and non-primary auditory cortex (Fig. 4b) (median rescaling values across the three annular ROIs were 0.47, 0.43 and 0.39) (β = 0.064, s.d. = 0.012, 90% credible interval = [0.045, 0.083], 132 electrodes, 15 subjects).
Fig. 4. Timecourse rescaling does not distinguish time-yoked and structure-yoked integration.
a, The timecourse for compressed speech was rescaled (stretched) using resampling and then correlated with the neural response timecourse for stretched speech. b, The correlation between stretched and compressed speech with and without rescaling (noise-corrected by response reliability). Electrodes were grouped into annular ROIs, and violin plots show the distribution of the timecourse rescaling metric across all electrodes within each ROI. c, Timecourse rescaling for different layers of a trained and untrained DANN model, as well as baseline correlations without rescaling (for the trained model).
Neural timecourse rescaling might naively be interpreted as reflecting a rescaled (structure-yoked) integration window, but might alternatively reflect a rescaled stimulus with a fixed (time-yoked) integration window. To address this question, we examined timecourse rescaling in our DANN model, which showed a clear, training-dependent change in structure-yoked integration across its layers. We found that all DANN layers showed substantial timecourse rescaling, even the earliest layers, and that rescaling was similar for trained and untrained models (Fig. 4c). These findings contrast sharply with the results from our TCI analysis (Fig. 2e) and suggest that timecourse rescaling provides a poor metric of structure-yoked integration, being driven to a large extent by stimulus rescaling rather than integration window rescaling. This finding underscores the significance of our approach, which can cleanly isolate time-yoked and structure-yoked integration from complex, nonlinear responses.
Discussion
We tested whether the human auditory cortex integrates information across absolute time or sound structure in speech. Leveraging our recently developed TCI method, we developed a novel approach that we demonstrated was effective at distinguishing time-yoked and structure-yoked integration in computational models, including revealing a transition from time-yoked to structure-yoked integration in nonlinear DANN models trained to recognize structure from natural speech. We then applied our approach to spatiotemporally precise human intracranial recordings, enabling us to measure integration windows throughout primary and non-primary auditory cortex. Across all electrodes, we observed a slight increase in integration times for stretched compared with compressed speech. This change, however, was small relative to the difference in structure durations between stretched and compressed speech, even in non-primary regions of the STG, which are strongly implicated in structure processing6,44,45. These findings demonstrate that the primary unit of integration throughout the human auditory cortex is absolute time and not structure duration.
Implications for computational models
Auditory and cognitive neuroscientists have often studied sound processing using distinct sets of models and assumptions. Auditory models typically assume that neurons integrate simple types of acoustic features over fixed temporal windows8,11–17 (for example, STRFs). These models have been successful in explaining neural responses early in the auditory system, but have had greater difficulty in predicting neural responses to natural sounds in higher-order regions of the auditory cortex29,43. For example, prior work has revealed prominent selectivity for speech and music that cannot be explained by STRFs6,31,45, as well as tuning for speech-specific structures such as phonemes and phonotactics20,25,27,28,46 (STG).
Cognitive models, by contrast, have often assumed that temporal integration is yoked to abstract structures such as phonemes or words4,19–26, often leaving unspecified the acoustic computations used to derive these structures from sound. For example, models of spoken recognition are often cast as a sequential operation applied to phonemes19,24, and features inspired by these models have shown promise in predicting cortical responses in non-primary auditory cortex20,46. Language models, which integrate semantic and syntactic information across words (or ‘word pieces’), have also shown strong predictive power, including in non-primary regions of the auditory cortex such as the STG21–23. Collectively, these observations suggest that there might be a transition from time-yoked integration of acoustic features in primary regions to structure-yoked integration of abstract structures such as phonemes or words in non-primary regions.
Our findings are inconsistent with this hypothesis, as we find that neural integration is predominantly yoked to absolute time throughout both primary and non-primary auditory cortex, with little change in structure yoking between primary and non-primary regions. These findings do not contradict prior findings that non-primary auditory cortex represents speech-specific and music-specific structure, but do suggest that the underlying computations are not explicitly aligned to sound structures as in many cognitive and psycholinguistic models. For example, phonotactic models that compute measures of phoneme probability using a sequence of phonemes19,24 implicitly assume that the neural integration window is yoked to phonemes and thus varies with phoneme duration. Similarly, language models that integrate across words and word pieces implicitly assume that the integration window varies with word duration21–23. Our findings indicate that these types of structure-yoked computations are weak in the auditory cortex, which has important implications for models of neural coding. For example, a corollary of our finding is that the amount of information analyzed by the auditory cortex will be inversely proportional to the rate at which that information is presented, as compressing speech increases the information present within the integration window, and stretching speech decreases the information within the integration window. Similarly, the number of phonemes and words that the auditory cortex effectively analyzes will be less when those phonemes and words have longer durations.
How can people recognize speech structures using time-yoked integration windows, given their large durational variability? One possibility is that integration windows in the auditory cortex are sufficiently long to achieve recognition of the relevant sound structures, even if they are yoked to absolute time. Non-primary regions integrate across hundreds of milliseconds and do not become fully invariant to context even at segment durations of 333 ms (Fig. 3e and Extended Data Fig. 5), and thus will integrate across many phonemes even if those phonemes have long durations. This may be analogous to higher-order regions of visual cortex that have large spatial receptive fields, sufficient to recognize objects across many spatial scales47. Structure-yoked computations may also be instantiated in downstream regions, such as the superior temporal sulcus or frontal cortex, that integrate across longer, multi-second timescales3,4,48, either by enhancing weak structure-yoked computations already present in the auditory cortex or by explicitly aligning their computations to speech structures and structural boundaries19,49.
Our findings are broadly consistent with anatomical models that posit a hierarchy of intrinsic timescales, driven by hierarchically organized anatomical and recurrent connections50. Given that intrinsic timescales are stimulus-independent, they are, by definition, time-yoked and will not vary with structure duration. Stimulus-independent dynamics can, nonetheless, influence the integration window by controlling the rate at which sensory information is accumulated and forgotten within neural circuits. Several recent neurocognitive models have posited that neural integration in the cortex is yoked to ‘event boundaries’ in speech, such as the boundary between words, sentences or longer narrative structures51,52. Our results rule out simple models of event-based integration within the auditory cortex, such as models in which the integration window grows based on the distance to a structural boundary53, as the distance scales with the magnitude of stretching and compression.
Relationship to prior methods
Many methods have been developed to study the temporal characteristics of neural responses, but in most cases, these methods do not provide a direct estimate of the integration window. For example, many prior studies have measured the autocorrelation of the neural response in the absence of or after removing stimulus-driven responses as a way to assess intrinsic timescales50. Intrinsic timescales are thought, in part, to reflect network dynamics such as the integration time of a neural population with respect to its synaptic inputs54. The neural integration window, as defined here, specifies the integration time of a neural response with respect to the stimulus and includes the cumulative effects of all network dynamics on stimulus processing, both feedforward and recurrent.
Many prior studies have measured temporal properties of the neural response timecourse, such as frequency characteristics of the response55 or the degree to which the neural response tracks or phase-locks to the stimulus56. These methods, although useful, do not provide a direct measure of the neural integration window, in part because they are influenced by both the stimulus and neural response characteristics. For example, stretching the stimulus will tend to stretch the neural response (changing the frequency spectrum as well), even in the absence of a change in neural integration, which explains why we observed strong timecourse rescaling for neural responses with time-yoked integration windows. This observation underscores the utility of having a method like the TCI paradigm that can directly estimate the neural integration window, separate from stimulus characteristics, to distinguish time-yoked and structure-yoked integration.
Temporal receptive models can be conceptualized as estimating the integration window of a best-fitting linear system; for example, between a spectrogram-like representation and the neural response in the case of a STRF8,13,18. However, as our DANN analyses show, nonlinear processing may be critical to instantiating structure-yoked computations, and it is therefore important to be able to measure the integration window of a nonlinear system to distinguish between time-yoked and structure-yoked integration. Temporal receptive field models have also been applied to investigate selectivity for speech structures, such as phonemes or phonotactic features20,27,46, which implicitly assume time-yoked integration of these structures. By demonstrating strong, time-yoked integration throughout the auditory cortex, our findings provide some justification for this assumption.
Several studies have reported neural responses that change at structural event boundaries in speech or music51,57, including in the auditory cortex, which has motivated event-based integration models52. However, many features of sound change at structural boundaries, which could produce a neural response at the boundary, even if the integration window does not change at the boundary. These prior findings are therefore not inconsistent with our study, and more research is needed to determine whether there are any event-based changes in integration within the auditory cortex.
Many studies have measured selectivity for longer-term temporal structure by comparing intact and temporally scrambled sounds3,4,6. These scrambling metrics, however, do not provide a direct measure of the integration window, given that many regions of the auditory cortex (for example, primary auditory cortex) show no effect of scrambling even at very short timescales6 (for example, 30 ms) despite having a meaningful integration window at these durations7. Here, we were able to identify a wide range of integration times throughout primary and non-primary auditory cortex by combining our TCI method with the spatiotemporal precision of intracranial recordings. We were able to detect a small change in integration windows between stretched and compressed speech across the auditory cortex, while simultaneously showing that this change was small relative to the magnitude of stretching and compression and similar between primary and non-primary regions. Our methods were thus critical to distinguishing between time-yoked and structure-yoked integration and provide clear evidence that integration windows in the auditory cortex are predominantly time-yoked.
Methodological choices, limitations and future directions
We chose to use uniform stretching and compression so that all speech structures would be stretched or compressed by the same amount. This choice was made because speech contains many perceptually relevant structures, and we do not know a priori which structures will underlie a particular neural response. Although uniform stretching and compression are not entirely natural, people can still understand speech with moderate amounts of uniform stretching and compression36,40, like that used in this study, and therefore, whatever mechanisms are used by the brain to recognize speech must still be operational in this regime. Consistent with this conclusion, we found similar results using naturally faster and slower speech, produced by the same talker speaking at a fast or slow pace (Extended Data Fig. 7). Moreover, we showed that we could use uniform stretching and compression to detect a transition to structure-yoked integration from a nonlinear DANN model that had only ever been trained on natural speech (Fig. 2e).
We chose to use segments with a fixed duration because this simplified our analyses. Fixed-duration segments contain mixtures of full and partial structures (for example, half a word), but this property is not problematic for identifying structure-yoked integration because the same mixtures are present for both stretched and compressed speech. For example, if a structure-yoked population only responded strongly to complete words, then the segment duration that is needed to achieve a strong response would still be three times as long for stretched versus compressed speech, given that the duration needed for a given number of complete words to be present would be three times as long for the stretched speech. Our computational models verified that we could detect structure-yoked integration using fixed-duration segments.
Our analysis focused on broadband gamma power because it provides a standard measure of local electrocortical activity and because it can be extracted with short-duration filters that have little effect on the measured integration7. By contrast, low-frequency phase-locked activity is measured using filters that are long and fixed, which inevitably biases responses toward long, time-yoked integration. Future work could potentially examine lower-frequency, phase-locked activity using spatial filters designed to extract such activity rather than temporal filters58.
The large majority of the segments tested in this experiment were less than 1 s to allow us to measure integration windows from the auditory cortex, which has sub-second integration times7. Our data demonstrate that the auditory cortex responds similarly to segments of at least 333 ms compared to those of longer segment durations (for example, 1 s or above), indicating that the use of sub-second segments is not problematic for characterizing integration in the auditory cortex. Specifically, we found that the cross-context correlation was close to the noise ceiling for segment durations of at least 333 ms, even in non-primary regions (Fig. 3a,e and Extended Data Fig. 5). The cross-context correlation was computed by comparing segments surrounded by their natural context in a longer segment with segments that were surrounded by random other segments of the same duration. Therefore, a high cross-context correlation indicates a similar response to shorter and longer segments. Regions other than the auditory cortex in the superior temporal sulcus and frontal cortex show longer, multi-second integration windows3,4,48, and therefore probably require longer segment durations to characterize their integration times. Future work could examine whether these higher-order regions show time-yoked or structure-yoked integration windows by testing a larger number of long-duration segments, although this is practically challenging because longer-duration segments inevitably require longer experiment times.
Methods
All subjects provided informed written consent to participate in the study, which was approved by the Institutional Review Boards of Columbia University (Columbia University Human Research Protection Office), New York University (NYU) Langone Medical Center (Human Research Protections) and the University of Rochester (Research Subjects Review Board).
Measuring durational variability of speech phonemes
To illustrate the variability of speech structures, we measured the durational variability of phonemes in the LibriSpeech corpus59 (Fig. 1a,b and Extended Data Fig. 1). Phoneme alignments were computed using the Montreal Forced Aligner60,61. The duration estimates for word-initial and word-final phonemes were occasionally contaminated by periods of silence that were included in the phoneme’s segmentation; therefore, we discarded word-initial and word-final phonemes from our estimates to ensure they did not inflate our variability estimates. For each phoneme, we then measured the distribution of durations across all speakers and utterances in the corpus. We then calculated the central 95% interval of this distribution and measured the ratio between the upper and lower boundaries of this interval as a measure of durational variability.
Cross-context correlation analysis
We used the ‘cross-context correlation’ to estimate context invariance from both computational models and neural data. In this analysis, we first compile the response timecourses to all segments of a given duration in a segment-by-time matrix (Fig. 2a). Each row contains the response timecourse surrounding a single segment, aligned to segment onset. Different rows thus correspond to different segments, and different columns correspond to different lags relative to segment onset. We compute a separate matrix for each of the two contexts being compared. The central segment is the same between contexts, but the surrounding segments differ.
Our goal is to determine whether there is a lag when the response is the same across contexts. We instantiate this idea by correlating corresponding columns across segment-aligned response matrices from different contexts (schematized by the linked columnar boxes in Fig. 2a). At segment onset (Fig. 2a, first box pair), the cross-context correlation should be near zero because the integration window must overlap the preceding segments, which are random across contexts. As time progresses, the integration window will start to overlap the shared segment, and the cross-context correlation should increase. If the integration window is less than the segment duration, there will be a lag in which the integration window is fully contained within the shared segment, and the response should thus be the same across contexts, yielding a correlation of 1 (Fig. 2a, second box pair).
Our stimuli enable us to investigate and compare two types of contexts: cases in which a segment is a subset of a longer segment and thus surrounded by its natural context (Fig. 1c, left) and cases in which a segment is surrounded by randomly selected segments of the same duration (Fig. 1c, right). We computed the cross-context correlation by comparing natural and randomly selected contexts as well as two different randomly selected contexts (see our previous publication7 for details), but the results were very similar when only comparing natural and random contexts. We note that any response that is selective for naturalistic structure (for example, a word) will, by definition, show a difference between natural and random contexts, and thus our paradigm will be sensitive to this change. In practice, we found that the cross-context correlation was close to the noise ceiling for segment durations of at least 333 ms (Fig. 3a,e and Extended Data Fig. 5), and this was true even when only comparing natural and random contexts. This fact demonstrates that segment durations of 333 ms produce similar responses to those of longer segments (1 s or longer) in the auditory cortex.
Estimating integration windows from computational model responses
We tested whether our approach could distinguish time-yoked versus structure-yoked integration by applying our analyses to the outputs of computational models. Our methods were similar to our neural analyses, except that in the case of computational models, the response is noise-free and we are not constrained by experiment time, and therefore can measure responses to many segments. We measured the cross-context correlation using 30 segment durations with 100, 27 s sequences per duration (segment durations of 20, 40, 60, 80, 120, 160, 200, 240, 280, 320, 400, 480, 560, 640, 720, 800, 880, 960, 1,040, 1,120, 1,200, 1,280, 1,440, 1,600, 1,760, 1,920, 2,080, 2,240, 2,400, 2,560 ms). Segments were excerpted from LibriSpeech following time compression and stretching by a factor of as in the neural experiments (time stretching and compression were implemented using waveform synchronous overlap and add, as implemented by SOX in Python62).
To estimate the integration window, we applied our cross-context correlation analysis, which is described both in the main text and in our previous publication7 (Fig. 2a) (comparing natural and random contexts). We then calculated the peak cross-context correlation value for each segment duration as a measure of context invariance. Finally, we interpolated the correlation versus segment duration curve to determine the smallest segment duration needed to achieve a threshold cross-context correlation value (threshold set to 0.75).
STRF model
Following standard practice, our STRF model was defined by applying a linear transformation to a spectrogram representation of sound (mel spectrogram with torchaudio; sample_rate = 16,000, n_fft = 1,280, hop_length = 320, f_min = 40, f_max = 8,000, n_mels = 128, power = 2.0, norm = ‘slaney’). To make the STRF model more realistic, the STRFs were fit to approximate human intracranial responses to natural speech. The STRFs were fit using regularized regression (ridge regression) against lagged spectrogram features (tenfold cross-validation was used to select the regularization parameter63). The weights from the regression analysis define the STRF window. We used a window size of 1 s sampled at 100 Hz. A 50 ms half-Hanning window was applied to the beginning and end of each STRF to suppress edge artifacts64. The data and stimuli used to fit the STRF models have been described previously46 and consisted of 566 electrode responses from 15 patients. Models were fit to all electrodes from that prior study46. Each patient listened to 30 min of speech excerpted from a children’s storybook (Hank the Cowdog) and an instructional audio guide (four voice actors; two male, two female).
Phoneme integration model
As a simple model of structure-yoked integration, we instantiated a model that integrated phonemic features within a window whose temporal extent varied inversely with the speech rate. The features of the phoneme model were defined by 22 binary phonetic features, each indicating the presence (1) or absence (0) of a single feature (for example, manner of articulation)46. As is common, we used onset features in which the presence or absence of a feature is represented by a ‘1’ at the onset of each phoneme. As with the STRF, the window was fit to neural data to make it more similar to that from a neural experiment. The window was fit in the same way by regressing time-lagged features against the neural response, and the weights from the regression analysis defined the window. We then stretched or compressed this window by interpolation when predicting responses to stretched and compressed speech, respectively, to simulate a structure-yoked response.
DANN model
We computed integration windows from a popular DANN speech recognition model (DeepSpeech2)35,65. The DANN consists of two convolutional layers, five recurrent layers (long-short-term memory cells) and one linear readout layer, and it was trained to transcribe text from a mel spectrogram using a connection-temporal-classification loss (applied to graphemes). The model was trained using 960 h of speech from the LibriSpeech corpus using standard data augmentation techniques35 (background noise, reverberation and frequency masking) to make recognition more challenging (25 epochs; optimization was implemented in PyTorch using the Adam optimizer). Each layer is defined by a set of unit response timecourses, and we measured the integration of each unit timecourse by applying the TCI analysis to its response (in the same manner as all other models). Like most DANN models, DeepSpeech2 is acausal, but this is not problematic for measuring its integration window35 because acausality shifts the cross-context correlation to earlier lags, and our measure of context invariance (the peak correlation across lags) is invariant to shifts.
Intracranial recordings from human patients
Participants and data collection
Data were obtained from 15 patients undergoing treatment for intractable epilepsy at the NYU Langone Hospital (six patients) and the Columbia University Medical Center (CUMC) (nine patients) (seven male, eight female; mean age, 36 years, standard deviation, 13 years). Three additional patients were tested in a follow-up experiment (all female; ages 29, 30 and 50 years) in which we measured responses to speech that was naturally faster or slower. Two of these patients were tested at the University of Rochester Medical Center (URMC), and one patient was tested at NYU. Electrodes were implanted to localize epileptogenic zones and delineate these zones from eloquent cortical areas before brain resection. NYU patients were implanted with subdural grids, strips and depth electrodes depending on the clinical needs of the patient. CUMC patients were implanted with depth electrodes. NYU patients were compensated $20 per hour. URMC patients were compensated $35 per hour. CUMC patients were not compensated owing to Institutional Review Board prohibition.
Stimuli
Segments of speech were excerpted from a recording of a spoken story from the Moth Radio Hour (Tina Zimmerman, Go In Peace). The recording was converted to mono and resampled to 20 kHz; pauses longer than 500 ms were excised. The stimuli were then compressed and stretched while preserving pitch using a high-quality speech vocoder STRAIGHT66. There were five logarithmically spaced segment durations (37, 111, 333, 1,000, 3,000 ms). Each stimulus was 27 s and was composed of a sequence of segments of a single duration. There were two stimuli per segment duration, each with a different ordering of segments. We also tested a 27 s stimulus composed of a single, undivided excerpt. Shorter segments were created by subdividing longer segments. Each stimulus was repeated twice (in one subject, the stimuli were repeated four times). In two participants, we tested an additional segment duration (9 s) and used a longer stimulus duration (45 s) with only one ordering of segments rather than two.
We conducted a subsequent experiment to test whether similar results would be observed for speech that is naturally faster or slower. Specifically, we recorded the same talker (author S.V.N.-H.) producing a common set of sentences at either a fast or slow pace. There were 28 fast sentences and 11 slow sentences, such that the total duration of the material was approximately the same (40.41 s, 41.24 s). There were 11 shared sentences that were present for both the fast and slow conditions, plus 17 additional sentences that were only tested for the fast condition (because more sentences were needed). For the 11 shared sentences, the slow sentences were 2.53 times longer than the fast sentences (3.75 s versus 1.48 s). Each sentence was root mean square-normalized after removing very low frequencies below 50 Hz (fourth-order Butterworth filter). We then generated segment sequences in the same manner as that described above (segment durations of 62.5, 125, 250, 500 ms).
Preprocessing
Our preprocessing pipeline was similar to prior studies7,67. Electrode responses were common-average referenced to the grand mean across electrodes from each subject. We excluded noisy electrodes from the common-average reference by detecting anomalies in the 60 Hz power band (measured using an IIR resonance filter with a 3 dB down bandwidth of 0.6 Hz; implemented using MATLAB’s iirpeak.m). Specifically, we excluded electrodes whose 60 Hz power exceeded five standard deviations of the median across electrodes. Given that the standard deviation is itself sensitive to outliers, we estimated the standard deviation using the central 20% of samples, which are unlikely to be influenced by outliers (we divided the range of the central 20% of samples by that which would be expected from a Gaussian distribution of unit variance). After common-average referencing, we used a notch filter to remove harmonics and fractional multiples of the 60 Hz noise (60, 90, 120, 180; using an IIR notch filter with a 3 dB down bandwidth of 1 Hz; the filter was applied forward and backward; implemented using MATLAB’s iirnotch.m).
We measured integration windows from the broadband gamma power response timecourse of each electrode. We computed broadband gamma power by measuring the envelope of the preprocessed signal filtered between 70 and 140 Hz (implemented using a sixth-order Butterworth filter with 3 dB down cutoffs of 70 Hz and 140 Hz; the filter was applied forward and backward; envelopes were measured using the absolute value of the analytic signal, computed using the Hilbert transform; implemented using fdesign.bandpass in MATLAB). We have previously shown that the filter does not strongly bias the measured integration time because the integration window of the filter (~19 ms) is small relative to the integration window of the measured cortical responses7. Envelopes were downsampled to 100 Hz. We detected occasional artifactual time points as time points that exceeded five times the 90th percentile value for each electrode (across all time points for that electrode), and we interpolated these outlier time points from nearby non-outlier time points (using ‘piecewise cubic Hermite interpolation’ as implemented by MATLAB’s interp1.m function).
As is standard, we time-locked the intracranial EEG recordings to the stimuli by either cross-correlating the audio with a recording of the audio collected synchronously with the intracranial EEG data or by detecting a series of pulses at the start of each stimulus that were recorded synchronously with the intracranial EEG data. We used the stereo jack on the experimental laptop to either send two copies of the audio or to send audio and pulses on separate channels. The audio on one channel was used to play sounds to subjects, and the audio/pulses on the other were sent to the EEG recording system. Sounds were played through a Bose Soundlink Mini II speaker (at CUMC), an Anker Soundcore speaker (at NYU) or a Genelec 8010a speaker (at CUMC). Responses were converted to units of percent signal change relative to silence by subtracting and then dividing the response of each electrode by the average response during the 500 ms before each stimulus.
We selected electrodes with a significant test–retest correlation (Pearson correlation) across the two presentations of each stimulus. Significance was measured with a permutation test, in which we randomized the mapping between stimuli across repeated presentations and recomputed the correlation (using 1,000 permutations). We used a Gaussian fit to the distribution of permuted correlation coefficients to compute small P values. Only electrodes with a highly significant correlation relative to the null were retained (P < 10−5). We also excluded electrodes for which the test–retest correlation fell below 0.05, which resulted in 132 total electrodes.
Following standard practice, we localized electrodes as bright spots on a post-operative computer tomography image or dark spots on an MRI, depending on which was available. The post-operative computer tomography or MRI was aligned to a high-resolution, pre-operative MRI that was undistorted by electrodes. Each electrode was projected onto the cortical surface computed by Freesurfer from the pre-operative MRI, excluding electrodes greater than 10 mm from the surface. We used the same correction procedure as in our prior studies7,67 to correct gross-scale errors by encouraging points that are nearby in 3D space but far apart on the 2D cortical surface (for example, two abutting gyri) to preferentially be localized to regions where sound responses are common.
Estimating integration windows from noisy neural responses
Neural responses are noisy and therefore never produce the same response, even if the response is context-invariant. Moreover, we are limited in the number of segment durations and segments that we can test owing to limited experimental time with patients. To address these challenges, we measure a noise ceiling for our cross-context correlation when the context is identical, using repeated presentations of the same segment sequence. The noise ceiling is a stimulus-dependent measure because it reflects the relative strength of the stimulus-driven and noise variance, which necessarily varies with the stimuli. Therefore, we compute a separate noise ceiling for each time lag, segment duration and speech rate. We then estimate the integration window that best predicts the cross-context correlation by finding a parametric window that best predicts the cross-context correlation pooling across all lags and segment durations. The predictions are computed by multiplying a noise-free prediction from the parametric model window by the measured noise ceiling. We have previously described and justified our parametric model in detail and have extensively tested the method, showing that it can correctly estimate integration windows from a variety of ground-truth models without substantial bias using noisy, broadband gamma responses with similar signal-to-noise ratios as those in actual neural data7. We report results from 110 electrodes (out of 132) for which the model predictions were highly significant (P < 0.001; measured using a significance test with phase-randomized predictions7).
In our original formulation, the window was parametrized using a gamma probability density function, with three parameters that control the width, delay and shape of the window (note that the window is not treated as a probability distribution; the gamma window just provides a convenient parametric form). We found previously that the best-fitting delay is highly correlated with the width and is close to the minimum possible value for a given width and shape, and therefore constrained the delay to take this minimum value to reduce the number of free parameters. For a structure-yoked response, we predict that the window will scale with the temporal scaling of the stimulus structures, which will change the width but not the shape. Therefore, we constrained the shape of the window to be the same for compressed and stretched speech for our main analysis. When we did not constrain the shape to be the same, there was more variance between the estimates for stretched and compressed speech, and overall we observed slightly higher structure yoking (median structure-yoking index of 0.15 for untied shapes versus 0.04 for tied shapes; Extended Data Fig. 4). To determine whether these differences were primarily a result of lower reliability or genuine differences in the integration window shape, we estimated integration windows using two different splits of data (non-overlapping segments). We found that tying the shapes increased the reliability of the estimates, measured as the Spearman correlation between splits (untied correlation, 0.614; tied correlation, 0.716). We also found that tying improved predictions for untied estimates. Specifically, we found that correlating the untied estimates from one split with the tied estimates from another split increased the correlation (0.668) compared with just correlating two untied estimates from different splits (0.614). These results suggest that the primary effect of tying is to enhance the reliability of the estimates. We also measured the average cross-context correlation within each annular ROI as a simple, model-free way of assessing the average integration window for stretched and compressed speech (Fig. 3e). The results of this model-free analysis support our model-dependent results by showing that integration windows increase in non-primary regions but change little with structure duration. Although these results do not rule out the possibility that there might be small changes in the integration window between stretched and compressed speech, they point to the same conclusion: that time-yoked integration predominates throughout the auditory cortex, even in non-primary regions.
We also applied our parametric window analysis to our STRF and phoneme integration models, which revealed the expected result with time-yoked integration for our STRF model and structure-yoked integration for our phoneme model (Extended Data Fig. 2). It was not possible to apply our parametric window analysis to our DANN model because, unlike the brain, the model’s responses are acausal, and a gamma-distributed window is therefore inappropriate.
Timecourse rescaling
We measured the degree to which the neural response timecourse rescales by correlating the response timecourse to stretched and compressed speech after rescaling the timecourse for the compressed speech (accomplished by resampling; that is, upsampling by a factor of three, and then discarding the additional time points). After rescaling, we measured the average correlation between all pairs of stimulus repetitions, and we measured a ceiling for this across-condition correlation (stretched versus rescaled compressed) by measuring the average test–retest correlation across repetitions for the same condition (stretched versus stretched and rescaled compressed versus rescaled compressed). We only used responses to the intact 27 s stimuli for this analysis. We applied the same analysis to the trained and untrained DANN models.
Statistics and reproducibility
No formal tests were used to determine the sample size, but the number of participants (15 in our primary experiment) was larger than in most intracranial studies, which often test fewer than ten participants. The only data inclusion or exclusion criterion was the presence of a reliable response to sound (described above). Unlike most intracranial studies, we obtained responses from a sufficient number of participants to perform across-subject statistics and used a LME model to account for subjects as a random effect (using lmefit.m in MATLAB). To evaluate whether the integration windows differed between stretched and compressed speech, we used the following model:
which models the logarithmic difference in integration windows between stretched (is) and compressed (ic) speech using a fixed effects intercept plus a subject-specific random intercept.
To examine the effect of distance (d) on overall integration windows (), we modeled the overall integration window across stretched and compressed speech as a function of distance to the primary auditory cortex:
To evaluate the effect of distance on structure yoking, we modeled the difference in integration windows (which is proportional to the structure-yoking index) as a function of distance:
To evaluate whether the structure yoking increased at longer timescales, we modeled the difference in integration windows as a function of the overall integration window:
We evaluated whether there was a significant effect of timecourse rescaling by fitting a model on the difference in correlations between rescaled and non-rescaled responses:
We used a Bayesian implementation of LME models because we found that maximum likelihood estimates often resulted in zero-variance random effects terms for subjects. Bayesian LME models were implemented in STAN (CmdStan v.2.35). For a model with both intercepts and slopes, the Bayesian model took the form:
where β0 and β1 are the fixed intercepts and slopes, respectively, and S0,s and S1,s are the random intercepts and slopes for subjects. We used weakly informative priors for all parameters:
We report the standard deviation and 90% credible intervals of the posterior distribution for the fixed effect parameters of interest. Data (y) were standardized to unit variance, and predictors (x) were standardized and demeaned. The standardization factors were accounted for when reporting effect sizes (for example, multiplying by the standard deviation of the data when reporting intercept terms). We verified convergence by checking , which was always close to 1 (±0.001), and by examining trace plots, which showed clear evidence of mixing (ten chains, 10,000 samples per chain, 1,000 sample burn-in period). Some chains occasionally had divergent transitions, which we addressed using a non-centered parametrization and a high adapt_delta parameter (0.99). Parameters were initialized to the following values: β0, β1: 0, τ0,s, τ1,s: 0.1, σ: 1. Results were robust to all of these choices (that is, similar for centered parametrization, robust to initialization). Posterior distributions showed clear evidence of unimodality.
Bootstrapping was used to compute error bars (Fig. 3c,d), resampling both subjects and electrodes with replacement (resampling subjects, and for each sampled subject, resampling electrodes). Error bars plot the central 68% interval (equivalent to one standard deviation) of the bootstrapped distribution.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41593-025-02060-8.
Supplementary information
Acknowledgements
We thank L. Long for help with data collection. This study was supported by the National Institutes of Health (NIDCD-K99/R00-DC018051, NIDCD-R01-DC020960) and by a grant from Marie-Josée and Henry R. Kravis. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Extended data
Author contributions
S.V.N.-H. and M.K. collected data for the experiments. O.D., W.D., G.M.M. and C.A.S. collectively planned, coordinated and executed the neurosurgical electrode implantation needed for intracranial monitoring. S.V.N.-H. performed the analyses of intracranial data, and M.K. performed the analyses of the phoneme/STRF and DANN models. S.V.N.-H. designed and implemented the experiment with mentorship and guidance from N.M. and A.F. S.V.N.-H. wrote the paper with feedback from N.M. and A.F., as well as all other co-authors.
Peer review
Peer review information
Nature Neuroscience thanks the anonymous reviewer(s) for their contribution to the peer review of this work.
Data availability
Data needed to replicate the results of this study are available on GitHub (https://github.com/snormanhaignere/time-vs-structure).
Code availability
Code needed to replicate the results of this study is available on GitHub (https://github.com/snormanhaignere/time-vs-structure).
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Sam V. Norman-Haignere, Menoua Keshishian.
Contributor Information
Sam V. Norman-Haignere, Email: samuel_norman-haignere@urmc.rochester.edu
Nima Mesgarani, Email: nima@ee.columbia.edu.
Extended data
is available for this paper at 10.1038/s41593-025-02060-8.
Supplementary information
The online version contains supplementary material available at 10.1038/s41593-025-02060-8.
References
- 1.Hickok, G. & Poeppel, D. The cortical organization of speech processing. Nat. Rev. Neurosci.8, 393–402 (2007). [DOI] [PubMed] [Google Scholar]
- 2.Patel, A. D. Music, Language, and the Brain (Oxford University Press, 2007).
- 3.Lerner, Y., Honey, C. J., Silbert, L. J. & Hasson, U. Topographic mapping of a hierarchy of temporal receptive windows using a narrated story. J. Neurosci.31, 2906–2915 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Farbood, M. M., Heeger, D. J., Marcus, G., Hasson, U. & Lerner, Y. The neural processing of hierarchical structure in music and speech at different timescales. Front. Neurosci.9, 157 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Rauschecker, J. P. & Scott, S. K. Maps and streams in the auditory cortex: nonhuman primates illuminate human speech processing. Nat. Neurosci.12, 718–724 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Overath, T., McDermott, J. H., Zarate, J. M. & Poeppel, D. The cortical analysis of speech-specific temporal structure revealed by responses to sound quilts. Nat. Neurosci.18, 903–911 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Norman-Haignere, S. V. et al. Multiscale integration organizes hierarchical computation in human auditory cortex. Nat. Hum. Behav.6, 455–469 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Khatami, F. & Escabí, M. A. Spiking network optimized for word recognition in noise predicts auditory system hierarchy. PLoS Comput. Biol.16, e1007558 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Sharpee, T. O., Atencio, C. A. & Schreiner, C. E. Hierarchical representations in the auditory cortex. Curr. Opin. Neurobiol.21, 761–767 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.House, A. S. On vowel duration in English. J. Acoust. Soc. Am.33, 1174–1178 (1961). [Google Scholar]
- 11.Chi, T., Ru, P. & Shamma, S. A. Multiresolution spectrotemporal analysis of complex sounds. J. Acoust. Soc. Am.118, 887–906 (2005). [DOI] [PubMed] [Google Scholar]
- 12.Dau, T., Kollmeier, B. & Kohlrausch, A. Modeling auditory processing of amplitude modulation. II. Spectral and temporal integration. J. Acoust. Soc. Am.102, 2906–2919 (1997). [DOI] [PubMed] [Google Scholar]
- 13.Meyer, A. F., Williamson, R. S., Linden, J. F. & Sahani, M. Models of neuronal stimulus-response functions: elaboration, estimation, and evaluation. Front. Syst. Neurosci.10, 109 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Sadagopan, S., Kar, M. & Parida, S. Quantitative models of auditory cortical processing. Hear. Res.429, 108697 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Woolley, S. M. N., Fremouw, T. E., Hsu, A. & Theunissen, F. E. Tuning for spectro-temporal modulations as a mechanism for auditory discrimination of natural sounds. Nat. Neurosci.8, 1371–1379 (2005). [DOI] [PubMed] [Google Scholar]
- 16.Walker, K. M. M., Bizley, J. K., King, A. J. & Schnupp, J. W. H. Multiplexed and robust representations of sound features in auditory cortex. J. Neurosci.31, 14565–14576 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Thorson, I. L., Liénard, J. & David, S. V. The essential complexity of auditory receptive fields. PLoS Comput. Biol.11, e1004628 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Hullett, P. W., Hamilton, L. S., Mesgarani, N., Schreiner, C. E. & Chang, E. F. Human superior temporal gyrus organization of spectrotemporal modulation tuning derived from speech stimuli. J. Neurosci.36, 2014–2026 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Norris, D. & McQueen, J. M. Shortlist B: a Bayesian model of continuous speech recognition. Psychol. Rev.115, 357 (2008). [DOI] [PubMed] [Google Scholar]
- 20.Brodbeck, C., Hong, L. E. & Simon, J. Z. Rapid transformation from auditory to linguistic representations of continuous speech. Curr. Biol.28, 3976–3983 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Jain, S., Vo, V. A., Wehbe, L. & Huth, A. G. Computational language modeling and the promise of in silico experimentation. Neurobiol. Lang.5, 80–106 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Goldstein, A. et al. Shared computational principles for language processing in humans and deep language models. Nat. Neurosci.25, 369–380 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Caucheteux, C., Gramfort, A. & King, J.-R. Evidence of a predictive coding hierarchy in the human brain listening to speech. Nat. Hum. Behav.7, 430–441 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Marslen-Wilson, W. D. & Welsh, A. Processing interactions and lexical access during word recognition in continuous speech. Cogn. Psychol.10, 29–63 (1978). [Google Scholar]
- 25.Gwilliams, L., King, J.-R., Marantz, A. & Poeppel, D. Neural dynamics of phoneme sequences reveal position-invariant code for content and order. Nat. Commun.13, 6606 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Donhauser, P. W. & Baillet, S. Two distinct neural timescales for predictive speech processing. Neuron105, 385–393 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Hamilton, L. S., Oganian, Y., Hall, J. & Chang, E. F. Parallel and distributed encoding of speech across human auditory cortex. Cell184, 4626–4639 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Di Liberto, G. M., O’Sullivan, J. A. & Lalor, E. C. Low-frequency cortical entrainment to speech reflects phoneme-level processing. Curr. Biol.25, 2457–2465 (2015). [DOI] [PubMed] [Google Scholar]
- 29.Keshishian, M. et al. Estimating and interpreting nonlinear receptive field of sensory neural responses with deep neural network models. eLife9, e53445 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Harper, N. S. et al. Network receptive field modeling reveals extensive integration and multi-feature selectivity in auditory cortical neurons. PLoS Comput. Biol.12, e1005113 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Norman-Haignere, S. V. & McDermott, J. H. Neural responses to natural and model-matched stimuli reveal distinct computations in primary and nonprimary auditory cortex. PLoS Biol.16, e2005127 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Pennington, J. R. & David, S. V. A convolutional neural network provides a generalizable model of natural sound coding by neural populations in auditory cortex. PLoS Comput. Biol.19, e1011110 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Giordano, B. L., Esposito, M., Valente, G. & Formisano, E. Intermediate acoustic-to-semantic representations link behavioral and neural responses to natural sounds. Nat. Neurosci.26, 664–672 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Li, Y. et al. Dissecting neural computations in the human auditory pathway using deep neural networks for speech. Nat. Neurosci.26, 2213–2225 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Keshishian, M., Norman-Haignere, S. V. & Misgarani, N. Understanding adaptive, multiscale temporal integration in deep speech recognition systems. Adv. Neural. Inf. Process. Syst.34, 24455–24467 (2021). [PMC free article] [PubMed] [Google Scholar]
- 36.Nourski, K. V. et al. Temporal envelope of time-compressed speech represented in the human auditory cortex. J. Neurosci.29, 15564–15574 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Lerner, Y., Honey, C. J., Katkov, M. & Hasson, U. Temporal scaling of neural responses to compressed and dilated natural speech. J. Neurophysiol.111, 2433–2444 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Theunissen, F. & Miller, J. P. Temporal encoding in nervous systems: a rigorous definition. J. Comput. Neurosci.2, 149–162 (1995). [DOI] [PubMed] [Google Scholar]
- 39.Angeloni, C. & Geffen, M. N. Contextual modulation of sound processing in the auditory cortex. Curr. Opin. Neurobiol.49, 8–15 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Dupoux, E. & Green, K. Perceptual adjustment to highly compressed speech: effects of talker and rate changes. J. Exp. Psychol. Hum. Percept. Perform.23, 914–927 (1997). [DOI] [PubMed] [Google Scholar]
- 41.Yamins, D. L. K. et al. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proc. Natl Acad. Sci. USA111, 8619–8624 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Kriegeskorte, N. Deep neural networks: a new framework for modeling biological vision and brain information processing. Annu. Rev. Vis. Sci.1, 417–446 (2015). [DOI] [PubMed] [Google Scholar]
- 43.Kell, A. J. E., Yamins, D. L. K., Shook, E. N., Norman-Haignere, S. V. & McDermott, J. H. A task-optimized neural network replicates human auditory behavior, predicts brain responses, and reveals a cortical processing hierarchy. Neuron98, 630–644 (2018). [DOI] [PubMed] [Google Scholar]
- 44.Mesgarani, N., Cheung, C., Johnson, K. & Chang, E. F. Phonetic feature encoding in human superior temporal gyrus. Science343, 1006–1010 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Norman-Haignere, S. V., Kanwisher, N. G. & McDermott, J. H. Distinct cortical pathways for music and speech revealed by hypothesis-free voxel decomposition. Neuron88, 1281–1296 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Keshishian, M. et al. Joint, distributed and hierarchically organized encoding of linguistic features in the human auditory cortex. Nat. Hum. Behav.7, 740–753 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Levy, I., Hasson, U., Avidan, G., Hendler, T. & Malach, R. Center–periphery organization of human object areas. Nat. Neurosci.4, 533–539 (2001). [DOI] [PubMed] [Google Scholar]
- 48.Blank, I. A. & Fedorenko, E. No evidence for differences among language regions in their temporal receptive windows. NeuroImage219, 116925 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Graves, A., Fernández, S., Gomez, F. & Schmidhuber, J. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proc. 23rd International Conference on Machine Learning (eds Cohen, W. & Moore, A.) 369–376 (Omni Press, 2006).
- 50.Murray, J. D. et al. A hierarchy of intrinsic timescales across primate cortex. Nat. Neurosci.17, 1661 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Baldassano, C. et al. Discovering event structure in continuous narrative perception and memory. Neuron95, 709–721 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Chien, H.-Y. S. & Honey, C. J. Constructing and forgetting temporal context in the human cerebral cortex. Neuron106, 675–686 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Skrill, D. & Norman-Haignere, S. V. Large language models transition from integrating across position-yoked, exponential windows to structure-yoked, power-law windows. In Proc. 37th Conference on Neural Information Processing Systems (eds Oh, A. et al.) 638–654 (Curran Associates, 2023). [PMC free article] [PubMed]
- 54.Chaudhuri, R., Knoblauch, K., Gariel, M.-A., Kennedy, H. & Wang, X.-J. A large-scale circuit mechanism for hierarchical dynamical processing in the primate cortex. Neuron88, 419–431 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Ding, N., Melloni, L., Zhang, H., Tian, X. & Poeppel, D. Cortical tracking of hierarchical linguistic structures in connected speech. Nat. Neurosci.19, 158–164 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Wang, X., Lu, T., Bendor, D. & Bartlett, E. Neural coding of temporal information in auditory thalamus and cortex. Neuroscience154, 294–303 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Williams, J. A. et al. High-order areas and auditory cortex both represent the high-level event structure of music. J. Cogn. Neurosci.34, 699–714 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.de Cheveigné, A. & Parra, L. C. Joint decorrelation, a versatile tool for multichannel data analysis. NeuroImage98, 487–505 (2014). [DOI] [PubMed] [Google Scholar]
- 59.Panayotov, V., Chen, G., Povey, D. & Khudanpur, S. Librispeech: an ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 5206–5210 (IEEE, 2015).
- 60.McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M. & Sonderegger, M. Montreal forced aligner: trainable text-speech alignment using Kaldi. In Proc. 18th Annual Conference of the International Speech Communication Association: Interspeech 2017 498–502 (ISCA, 2017).
- 61.Lugosch, L., Ravanelli, M., Ignoto, P., Tomar, V. S. & Bengio, Y. Speech model pre-training for end-to-end spoken language understanding. Preprint at 10.48550/arxiv.1904.03670 (2019).
- 62.Bittner, R., Humphrey, E. & Bello, J. PySOX: leveraging the audio signal processing power of SOX in Python. In Proc. 17th International Society for Music Information Retrieval Conference: Late breaking/Demo (eds Devaney, J. et al.) (International Society for Music Information Retrieval, 2016).
- 63.Bialas, O., Dou, J. & Lalor, E. C. mTRFpy: a Python package for temporal response function analysis. J. Open Source Softw.8, 5657 (2023). [Google Scholar]
- 64.Crosse, M. J., Di Liberto, G. M., Bednar, A. & Lalor, E. C. The multivariate temporal response function (mTRF) toolbox: a MATLAB toolbox for relating neural signals to continuous stimuli. Front. Hum. Neurosci.10, 604 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Amodei, D. et al. Deep Speech 2: end-to-end speech recognition in English and Mandarin. In Proc. 33rd International Conference on Machine Learning (eds Balcan, M. F. & Weinberger, K. Q.) 173–182 (JMLR, 2016).
- 66.Kawahara, H. & Morise, M. Technical foundations of TANDEM-STRAIGHT, a speech analysis, modification and synthesis framework. Sadhana36, 713–727 (2011). [Google Scholar]
- 67.Norman-Haignere, S. V. et al. A neural population selective for song in human auditory cortex. Curr. Biol.32, 1470–1484.e12 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Data needed to replicate the results of this study are available on GitHub (https://github.com/snormanhaignere/time-vs-structure).
Code needed to replicate the results of this study is available on GitHub (https://github.com/snormanhaignere/time-vs-structure).







