Hierarchical amplitude modulation structures and rhythm patterns: Comparing Western musical genres, song, and nature sounds to Babytalk

Tatsuya Daikoku; Usha Goswami

doi:10.1371/journal.pone.0275631

. 2022 Oct 14;17(10):e0275631. doi: 10.1371/journal.pone.0275631

Hierarchical amplitude modulation structures and rhythm patterns: Comparing Western musical genres, song, and nature sounds to Babytalk

Tatsuya Daikoku ^1,^2,^3,^*, Usha Goswami ¹

Editor: Yann Benetreau⁴

PMCID: PMC9565671 PMID: 36240225

Abstract

Statistical learning of physical stimulus characteristics is important for the development of cognitive systems like language and music. Rhythm patterns are a core component of both systems, and rhythm is key to language acquisition by infants. Accordingly, the physical stimulus characteristics that yield speech rhythm in “Babytalk” may also describe the hierarchical rhythmic relationships that characterize human music and song. Computational modelling of the amplitude envelope of “Babytalk” (infant-directed speech, IDS) using a demodulation approach (Spectral-Amplitude Modulation Phase Hierarchy model, S-AMPH) can describe these characteristics. S-AMPH modelling of Babytalk has shown previously that bands of amplitude modulations (AMs) at different temporal rates and their phase relations help to create its structured inherent rhythms. Additionally, S-AMPH modelling of children’s nursery rhymes shows that different rhythm patterns (trochaic, iambic, dactylic) depend on the phase relations between AM bands centred on ~2 Hz and ~5 Hz. The importance of these AM phase relations was confirmed via a second demodulation approach (PAD, Probabilistic Amplitude Demodulation). Here we apply both S-AMPH and PAD to demodulate the amplitude envelopes of Western musical genres and songs. Quasi-rhythmic and non-human sounds found in nature (birdsong, rain, wind) were utilized for control analyses. We expected that the physical stimulus characteristics in human music and song from an AM perspective would match those of IDS. Given prior speech-based analyses, we also expected that AM cycles derived from the modelling may identify musical units like crotchets, quavers and demi-quavers. Both models revealed an hierarchically-nested AM modulation structure for music and song, but not nature sounds. This AM modulation structure for music and song matched IDS. Both models also generated systematic AM cycles yielding musical units like crotchets and quavers. Both music and language are created by humans and shaped by culture. Acoustic rhythm in IDS and music appears to depend on many of the same physical characteristics, facilitating learning.

1. Introduction

The potential parallels between language and music have long fascinated researchers in cognitive science. In this paper, we examine whether a statistical learning approach previously applied to understand the development of phonology as a cognitive system in language-learning infants and children may enable theoretical advances in understanding the acoustic basis of rhythm in music. Infant language learning has been argued to begin with speech rhythm [1], and infant-directed speech (IDS), also called Babytalk or Parentese, has been described as sing-song speech. The particular prosodic or quasi-musical characteristics of IDS have been suggested to explain both natural selection for human language from an anthropological perspective [2], and to facilitate infant learning of the phonological structure of human languages [3]. Although language acquisition by human infants was once thought to require specialized neural architecture, studies of infant statistical learning have revealed that basic acoustic processing mechanisms are sufficient for infants to learn phonology (speech sound structure at different linguistic levels such as words, syllables, rhymes and phonemes; e.g. [4]). Further, the cognitive capacity of statistical learning is not restricted to verbal language, but extends to non-linguistic sounds such as tones (e.g., [5, 6]), timbres (e.g., [7, 8]) as well as rhythm and timing (e.g., [9–11]). Children who exhibit difficulties with phonological learning also exhibit rhythm processing difficulties, with both speech and musical stimuli [12]. This implies that there may be inherent common statistical properties shared by language and music, and that such statistical properties contribute to the acquisition of both language and music [13].

Modelling of the speech signal aimed at understanding the potential sensory/neural statistical properties that underpin phonological and rhythmic learning in childhood has revealed a novel set of acoustic statistics that underpin speech rhythm in infant- and child-directed speech (IDS and CDS). These novel statistics are consistent across two different modelling approaches, a spectral-amplitude modulation phase hierarchy (S-AMPH) approach based on the neural speech encoding literature [14, 15], and probabilistic amplitude demodulation (PAD, [16, 17]). The S-AMPH model was first applied to English nursery rhymes and subsequently to Babytalk [18]. The key parameter that emerged with respect to rhythm in both models and for both genres was the phase relations between a band of amplitude modulations (AMs) centred on ~2 Hz, and a band of AMs centred on ~5 Hz. When both bands peaked together, a strong syllable was heard. When a trough in the slower AM band (~2 Hz) coincided with a peak in the faster AM band (~5 Hz), a weak syllable was heard. Adult listeners’ perception of vocoded English nursery rhymes could be shifted from a trochaic to iambic rhythm simply by phase-shifting the slower AM band by 180 degrees. Related experimental work using PAD showed that the phase relations between peaks and troughs in AM bands centred on ~2 Hz and ~5 Hz was critical for perceiving rhythmic metrical patterning in nursery rhymes (trochaic versus iambic, [14, 15, 18, 19]). These phase relations between peaks and troughs in AM bands centred on ~2 Hz and ~5 Hz have also been revealed by statistical modelling of other languages like Portuguese and Spanish [20, 21]. For example, Pérez-Navarro et al. [21] reported that CDS in Spanish was characterized by higher temporal regularity of the placement of stressed syllables (phase synchronization of ~2 Hz and ~5 Hz AM bands) compared to ADS in Spanish. Further, phase relations are statistical characteristics that describe music as well as language, and phase relations appear relatively uniform regarding music from different cultures [22, 23], as well as songs of different species [24]. Even prior to the acquisition of culture-specific biases of musical rhythm, infants are affected by ratio complexity [25]. Thus, phase hierarchies may be a universal aspect across music and language.

Accordingly, here we investigate the characteristics of music and child songs from the same S-AMPH modelling perspective previously applied to English, Portuguese and Spanish. In particular, it is of interest to establish whether the phase dependency between bands of AMs centred on ~2 Hz and ~ 5 Hz will relate to musical rhythm across different genres. Theoretically, it is plausible that the physical stimulus characteristics that describe rhythm patterns in nursery rhymes and IDS may also describe the hierarchical rhythmic relationships that characterize music and child songs. According to anthropological analyses [2], it was IDS that emerged first, subsequently enabling the development of adult-directed speech (ADS, which is notably not sing-song in nature). As primitive human cultures also developed music, the same evolutionary adaptations that enabled Babytalk may underpin music as well. That is, it is possible that the AM hierarchy in music has similar structure to the AM hierarchy in IDS. The core research question addressed here is whether music will exhibit similar salient bands of AMs and similar phase dependencies between AM bands to IDS and English nursery rhymes (child-directed speech, CDS).

The theoretical framework underpinning the AM modelling approach used for CDS and Babytalk was Temporal Sampling theory (TS theory, [26]). TS theory was initially developed to explain why children with language disorders show difficulties in AM processing, with the aim of supporting musical interventions. TS theory now provides a systematic sensory/neural/cognitive framework for explaining childhood language disorders [27]. TS theory proposes that accurate sensory/neural processing of the amplitude envelope of speech is one foundation of language acquisition, and that impairments in discriminating key aspects of the envelope such as amplitude rise times at different temporal rates (which relate to speech rhythm) is one cause of developmental language disorders [28]. The amplitude envelope of any sound is the slower changes in AM (intensity or signal energy) that unfold over time. The amplitude rise time of the vowel in any syllable is a core acoustic feature related to speech rhythm [29]. Amplitude rise times are important for the perception of rhythm because they determine the acoustic experience of “P-centers.” P-centers are the perceptual moment of occurrence (“perceptual center”) of each musical beat or syllable for the listener [30, 31]. Amplitude rise times are typically called attack times in the musical literature [32, 33]. By TS theory, it is the rhythmic components of musical therapies for children that explain the language gains that are found, for example via the matching of the P-centers of syllable beats and musical beats [27, 34]. If musical remediation of developmental language disorders is to be optimised, then by TS theory the rhythm structures that underpin language and music should be matched at the level of physical stimulus characteristics. It is known that developmental disorders of both syntax and phonology can be helped by musical interventions [35], but related modelling of the amplitude envelope of the music used in such interventions (typically classical music) has yet to be carried out. We present such modelling of classical music and also other Western genres here.

As a control for our prediction that the AM structure of music and IDS/CDS should be highly similar, we also modelled other natural sounds that have quasi-rhythmic structure such as wind, fire, river, storms, rain, as well as non-human vocal sounds, namely birdsong. A priori, we expect nature sounds to have a different AM structure to IDS and CDS. Nature sounds such as rain and storms were originally used to derive PAD [16], and are characterized by AM patterns correlated over long time scales and across multiple frequency bands. However, as these sounds are not produced by humans nor shaped by human physiology and culture, there is no reason a priori to expect them to be similar in AM structure to IDS and CDS. Birdsong may be different, as it is more musically sophisticated and closer to human song than the other nature sounds such as wind, fire, river, storms, and rain. Indeed, a previous study revealed that the structure of nightingale rhythms, rather than other bird song rhythms such as zebra finches, are similar to the structure of human musical rhythms [24]. Therefore, we also modelled the corpus of nightingale’s song studied by Roeske et al. [24]. We expected the AM patterns here to be more similar to IDS and CDS than the AM patterns for wind, rain etc. Other approaches to modelling hierarchical temporal relations in sound signals, such as the Allan Factor approach (which detects clusters of peaks in the amplitude envelope), have suggested that thunderstorms and classical music have a similar hierarchical temporal modulation structure, we would not predict this [36]. An Allan Factor approach only reveals the overall degree of clustering found in a sound signal according to window lengths input by the modeller, which for Kello et al. [36] varied from 15 ms to 15 seconds. In contrast, the S-AMPH modelling approach utilizes the known characteristics of the human cochlea to determine its windows. Further, as noted by Kello et al. [36] themselves, an Allan Factor approach does not throw light on the relationship between individual clusters that may be identified and linguistic units. This stands in marked contrast to the S-AMPH approach to modelling infant and child language [15].

The S-AMPH model analyses the AM structure of the amplitude envelope of any sound by separating the AM characteristics from the frequency modulation (FM) characteristics. This is achieved by acoustic engineering methods for decomposing the amplitude envelope (demodulation; [17]). Demodulation approaches to characterizing the physical stimulus structure of IDS and CDS decompose the amplitude envelope into the same narrow bands imposed by the human cochlea, and then seek systematic patterns of AM [15, 18]. The AM patterns are associated with fluctuations in loudness or sound intensity, a primary acoustic correlate of perceived rhythm which is based on onset timing, beat, accent, and grouping [37]. In contrast, the FM patterns can be interpreted as fluctuations in pitch and noise [16]. Prior analyses of the average modulation spectra of Western musical genres have revealed a peak at ~2 Hz, consistent across genres like jazz, rock and classical music [38]. This is theoretically interesting, as the 2 Hz peak observed by Ding et al. [38] for music matches the modulation peak in IDS identified by S-AMPH modelling [18]. It is notable that Allan Factor modelling, which identifies nested clusters of peaks in the amplitude envelope, also finds differences between the temporal modulation structure of jazz, rock and classical music respectively. Ding et al. [38] did not find such differences in their modulation spectra approach. As the S-AMPH also takes a modulation spectra approach (but governed by knowledge about cochlear function), here we expected that the musical genres explored (which were adopted from [38]) would show a similar modulation structure to each other, as well as to IDS and CDS.

The music listener also needs to identify discrete units to gain meaning, for example musical notes and phrasing. This is analogous to infants needing to identify discrete units like syllables, words and syntactic phrases from the prosodic rhythm structure of IDS. The systematic patterns of AM nested in the amplitude envelope of both IDS and CDS, in five core spectral bands, have been demonstrated to support the identification of these discrete units. For example, application of the S-AMPH to English nursery rhyme corpora showed that the model identified 72% of stressed syllables correctly, 82% of syllables correctly, and 78% of onset-rime units correctly if a particular AM cycle was assumed to match a particular speech unit [15]. If the nursery rhymes were chanted to a regular 2 Hz beat, then the model identified over 90% of each type of linguistic unit. Accordingly, decomposition of the amplitude envelope of different musical genres may identify similar hierarchical AM structures in predictable spectral bandings that provide a perceptual basis for perceiving rhythm patterns, musical notes and musical phrasing. Whether music will exhibit similar salient bands of AMs, similar spectral banding and similar phase dependencies between AM bands to IDS and CDS is currently unknown.

It should also be noted that the modulation statistics of adult-directed speech (ADS) as revealed by the S-AMPH modelling approach are markedly different to IDS [18, 20]. ADS has significantly weaker phase synchronization between the slower bands of AMs centred on ~2 Hz and ~5 Hz compared to IDS, probably reflecting the fact that ADS is not sing-song or rhythmic. However, ADS has significantly stronger phase synchronization of bands of AMs centred on ~5 Hz and ~ 20 Hz compared to IDS. These different modulation statistics for ADS can be interpreted as increasing the salience of acoustic information related to the phonemes in syllables [20]. The differences in statistical AM structure of the amplitude envelope of ADS vs IDS have been hypothesized to reflect the acquisition of literacy [20]. This was because the phase synchronisation between bands of AMs centred on ~2 Hz and ~5 Hz, and also ~5 Hz and ~ 20 Hz, in natural conversational adult speech increased parametrically with literacy levels (illiterate, low literate, high literate, see [20]). The acquisition of literacy remaps phonology in the human brain [39]. Music and language are ubiquitous in human societies [22], but literacy is a relatively recent cultural acquisition, so arguably the AM structure of music is more likely a priori to match IDS than to match ADS. Wind, rain, storms and birdsong have also all been present since early hominid times, but their statistical structure has not been constrained by the human brain. It was thus expected a priori that the range of nature sounds would show different statistical AM structures to music, with the possible exception of nightingale song.

Two contrasting mathematical approaches to demodulation of the amplitude envelope of music, song and nature sounds were employed, the S-AMPH [15], and PAD [16, 17]. Both models parse the amplitude envelope of the signals into an hierarchy of AM bands, but the principles underpinning their operation are different. The S-AMPH simulates the frequency decomposition known to be carried out by the cochlea [40–42], thereby aiming to decompose the amplitude envelope of music in the same way as the human ear. PAD infers the modulators and carriers in the envelope based purely on Bayesian inference, thereby carrying out amplitude demodulation on a neutral statistical basis that makes no adjustments for the human hearing system. PAD is thus a “brain-neutral” approach, but the use of Bayesian statistics means that it may reveal priors relevant to human neural learning [43]. Our expectation that the perception of musical meter may depend on the temporal alignment of AM bands centred on ~2 Hz and ~5 Hz also relates to linguistic theory [44–46]. Classically, hierarchical linguistic structures like the phonological hierarchy of prosodic, syllabic, rhyme and phoneme levels nested within speech rhythm are represented as a tree that captures the relative prominence of units [46, 47]. Such tree representations may also provide a good model regarding the core principles of metrical structure in music [48]. In the tree representation, a “parent” node (element) at one tier of the hierarchy encompasses one or more “daughter” nodes at a lower level of the hierarchy. The adjacent connection between the parent and daughter nodes are indicated as “branches” in the tree. To give an example from CDS, a parent node such as the trisyllabic word “pussycat” in the nursery rhyme “Pussycat pussycat where have you been,” which is also the prosodic foot, would have 3 daughter nodes at the next hierarchical level, comprising the three syllables. From the prior S-AMPH modelling, the level of the prosodic foot would be derived from the cycles of AM at the ~2 Hz rate. Two AM cycles would encompass all three daughter nodes in “pussycat”, while the individual syllables would be derived from the cycles of AM at the ~5 Hz rate. The phase alignment of the ~2 Hz and ~5 Hz AM cycles would then determine metrical structure. When modelled with the S-AMPH, English nursery rhymes with different metrical structures like “Jack and Jill went up the hill” (trochaic rhythm), “As I was going to St Ives” (iambic rhythm) and “Pussycat pussycat where have you been” (dactyl rhythm) all showed the same acoustic hierarchical AM structure, with three core AM bands centered on ~2 Hz, ~5 Hz, and ~20 Hz. Which metrical structure was perceived by the listener depended on the temporal alignment of AM peaks in the two slower AM bands identified by the S-AMPH, centred on ~2 Hz and ~5 Hz [14].

We note that in the previous S-AMPH research the terms “delta-rate” and “theta-rate” AM bands were adopted to describe the results of the speech demodulation analyses (see also [18]). The band of AMs centred on ~2 Hz was designated the delta-rate AM band, and the band of AMs centred on ~5 Hz was designated the theta-rate AM band. This was because TS theory was based in part on the neural oscillatory bands that track human speech in adult cortex [49–55]. The AM bands in the speech signal revealed by the S-AMPH modelling equate temporally to electrophysiological rhythms found across the brain at the oscillatory rates of delta, theta and beta-low gamma. It is known that human speech perception relies in part on neural tracking of the temporal modulation patterns in speech at different timescales simultaneously. These temporal modulation patterns are then bound into a single speech percept, “multi-time resolution processing” [49–51, 56]. This neural tracking (also described as phase alignment, temporal alignment or entrainment) relies on acoustic components of the speech signal such as the amplitude rise times of nested AM components phase-resetting oscillatory cortical activity. In adult work, neural (“speech-brain”) alignment has been shown to contribute to parsing of the speech signal into phonological units such as syllables and words [56]. For language, delta, theta, and beta/gamma oscillators in auditory cortex appear to contribute to the perception of prosodic, syllabic, and phonetic information respectively [49, 55, 57–60]. For music, oscillatory rhythms may align with rhythmic features of the acoustic input such as crotchets or musical beats [61–65]. However, possible correspondences between different oscillators and different musical units like crotchets and quavers have yet to be investigated.

Finally, there are mechanistic phase dependencies in the neural system which mirror the acoustic phase dependencies between AM bands revealed by the S-AMPH modelling of IDS, CDS and ADS. The biological evidence shows that the adjacent-band neural oscillators are not independent of, but interdependent on, each other [59, 66]. For example, the phase of delta oscillators modulates the phase of theta oscillators, and theta phase modulates beta/gamma power [59]. To date, despite a number of studies of music encompassing brain-based analyses [61, 63, 67–69], no studies have examined the temporal correlates of musical rhythm from an amplitude demodulation perspective. Our prior speech modelling suggests that it is biologically plausible to propose that rhythm perception in music and language may depend on neural entrainment to the AM hierarchies nested in the amplitude envelope of music versus IDS/CDS respectively, and that parsing of units in language and music may be an automatic consequence of neural entrainment to this hierarchy. Regarding musical signals, there are already relevant data. For example, it has been shown that neural phase locking to periodic rhythms present in musical tempi is selectively enhanced compared to frequencies unrelated to the beat and meter [65, 68]. Further, Di Liberto and colleagues revealed that musical expertise increases the accuracy of cortical tracking [62].

However, to date the amplitude envelope of different musical inputs has not been decomposed in order to discover whether beat and meter are systematically related to adjacent bands of AMs that are physically connected by mutual phase dependencies. These phase dependencies between AM bands should be consistent across different beat rates falling within each AM band, as like electrophysiological bandings the AM bands span a range of temporal rates (e.g., S-AMPH ‘delta’ AM band, 0.9–2.5 Hz, ‘theta’ AM band, 2.5–7 Hz, see Supplementary Figure, Table c in S4 Appendix). This enables the phase dependencies to be maintained across environmental variations such as speaker rate or musical tempo. Given the biological evidence that each neural oscillator modulates the adjacent-band oscillator during speech perception [59, 66], and our prior acoustic modelling data with the S-AMPH, we also hypothesized that the adjacent tiers in the temporal hierarchies of music would be highly dependent on each other compared with non-adjacent tiers, particularly for delta-theta AM coupling. By hypothesis, phase locking to different bands of AM present in the amplitude envelope of each musical genre may enable parsing of the signal to yield the perceptual experience of musical components such as minim, crotchet, and quaver (half, quarter, and eighth notes). The acoustic structure of the amplitude envelope should also contribute systematically to the perceptual experience of beat, tempo, and musical phrasing.

Note finally that our modelling approach is theoretically distinct from models that seek to identify the tactus or beat markers in singing [70], models of pulse perception based on neural resonance [71], oscillatory models of auditory attention based on dynamic attending [72], and models of temporal hierarchical structure based on the Allan Factor approach [36, 73]. Ours is the only modelling approach to analyze the modulation structure of the amplitude envelope and further to make specific a priori predictions concerning expected key temporal AM rates and key hierarchical AM phase relations related to the perception of musical rhythm structure and the parsing of musical units. We predict that the phase dependency between bands of AMs centred on ~2 Hz and ~ 5 Hz will relate to musical rhythm across different genres, and that music will show similar hierarchical AM structures in predictable spectral bandings to IDS, structures that can provide a perceptual basis for perceiving musical notes and musical phrasing. The amplitude envelope is recognized as core to speech processing by speech engineers [29]. Our modelling decomposes the amplitude envelope of music instead of speech and then relates the resulting AM bands and their phase relationships to individual musical units. In principle, this approach provides a novel acoustic perspective on musical rhythm, motivated by our prior novel acoustic analyses of Babytalk.

2. Materials and methods

The music samples for modelling consisted of the music corpora used in the study by Ding et al. [38], with the addition of 23 children’s songs in order to characterize more general properties of modulation spectra across musical genres. The final samples consisted of over 39 h of recordings (sampling rate = 44.1 kHz) of Western music (Western-classical music, Jazz, adult song, and children’s songs) and musical instruments (single-voice, Violin, Viola, Cello, and Bass; multi-voice, Piano and Guitar). In addition, a range of natural sounds like birdsong (nightingale), wind and rain were extracted from sound files available on the internet (https://mixkit.co; https://www.zapsplat.com; https://www.xeno-canto.org/). The sample size and number of items in each category is provided in S1 Appendix.

The acoustic signals were normalized based on z-score (mean = 0, SD = 1). The spectro-temporal modulation of the signals was analyzed using two different algorithms for deriving the dominant AM patterns: Probability Amplitude Demodulation based on Bayesian inference (PAD; [16]) and Spectral Amplitude Modulation Phase Hierarchy (S-AMPH; [15]). The PAD model infers the modulators and a carrier based on Bayesian inference. PAD is biologically neutral and can be run recursively using different demodulation parameters each time to identify potential “priors” in the input stimulus. The S-AMPH model is a low-dimensional representation of the auditory signal, using an equivalent rectangular bandwidth (ERB_N) filterbank, which simulates the frequency decomposition by the cochlea [40, 42, 74]. The number and the edge of bands are determined by principal component analysis (PCA) dimensionality reduction of original high-dimensional spectral and temporal envelope representations of the input stimuli (for detail, please see Fig a in S2 Appendix). This modulation filterbank can generate a cascade of amplitude modulators at different oscillatory rates, producing the AM hierarchy. The model generates an hierarchical representation of the core spectral (acoustic frequency spanning 100–7,250 Hz) and temporal (oscillatory rate spanning 0.9–40 Hz) modulation hierarchies in the amplitude envelopes of speech and music.

2.1 Probability Amplitude Demodulation (PAD) model based on Bayesian inference

Amplitude demodulation is the process by which a signal (y_t) is decomposed into a slowly-varying modulator (m_t) and quickly-varying carrier (c_t):

y_{t} = m_{t} * c_{t}

(1)

Probabilistic amplitude demodulation (PAD) [17] implements the amplitude demodulation as a problem of learning and inference. Learning corresponds to the estimation of the parameters that describe these distributional constraints such as the expected time-scale of variation of the modulator. Inference corresponds to the estimation of the modulator and carrier from the signals based on the learned or manually defined parametric distributional constraints. This information is encoded probabilistically in the likelihood: P(y_1:T|c_1:T, m_1:T, θ), prior distribution over the carrier: p(c_1:T|θ), and prior distribution over the modulators: p(m_1:T|θ). Here, the notation x_1:T represents all the samples of the signal x, running from 1 to a maximum value T. Each of these distributions depends on a set of parameters θ, which controls factors such as the typical time-scale of variation of the modulator or the frequency content of the carrier. For more detail, the parametrized joint probability of the signal, carrier and modulator is:

P (y_{1 : T}, c_{1 : T}, m_{1 : T} | θ) = P (y_{1 : T} | c_{1 : T}; m_{1 : T}, θ) * p (c_{1 : T} | θ) * p (m_{1 : T} | θ)

(2)

Bayes’ theorem is applied for inference, forming the posterior distribution over the modulators and carriers, given the signal:

P (c_{1 : T}, m_{1 : T} | y_{1 : T}, θ) = P (y_{1 : T}, c_{1 : T}, m_{1 : T} | θ) / P (y_{1 : T} | θ)

(3)

The full solution to PAD is a distribution over possible pairs of modulator and carrier. The most probable pair of modulator and carrier given the signal is returned:

m *_{1 : T}, c *_{1 : T} = a r g m a x P (c_{1 : T}, m_{1 : T} | y_{1 : T}, θ)

(4)

The solution takes the form of a probability distribution which describes how probable a particular setting of the modulator and carrier is, given the observed signal. Thus, PAD summarizes the posterior distribution by returning the specific envelope and carrier that have the highest posterior probability and therefore represent the best match to the data. As noted, PAD can be run recursively using different demodulation parameters each time, thereby generating a cascade of amplitude modulators at different oscillatory rates [16]. The positive slow envelope is modelled by applying an exponential nonlinear function to a stationary Gaussian process. This produces a positive-valued envelope whose mean is constant over time. The degree of correlation between points in the envelope can be constrained by the timescale parameters of variation of the modulator (envelope), which may either be entered manually or learned from the data. In the present study, we entered the PAD parameters manually to produce modulators below 40 Hz because it is known that the core AM frequencies that contribute to speech rhythm lie below 40 Hz [15]. The carrier is interpreted as components including noise and pitches whose frequencies are much higher than the core modulation bands in phrasal, prosodic, syllabic and other phonological components. After extracting the modulators below 40 Hz, continuous Wavelet Transform (CWT) was run on each AM envelope. The procedure is depicted in Fig 1 via heat maps, which show an example of the demodulation outputs from CWT for an example of each stimulus type. Next, the demodulation outputs were normalized between 0 and 1, and averaged across all samples in each genre (instrumental music, song, and nature sounds).

Fig 1 — We depict music (classical), IDS (naturalistic conversation), ADS (naturalistic conversation, [75]), bird song (nightingale), nature sounds (averaged) and a man-made rhythmic sound (a machine) using Continuous Wavelet Transform (CWT), which was run on each AM envelope from randomly chosen 30-s excerpts of music, IDS, ADS, bird song, nature sounds, and machine sounds. Note that similar scalograms cannot be generated for S-AMPH because of the use of cochlear filterbanks, which means that boundary frequencies would disappear. The x-axis denotes time (30 s) and the y-axis denotes modulation rate (0.1-40Hz). The maximal amplitude is normalized to 0 dB. The demodulation outputs are shown as a heat map. It should be noted the low frequency structure (<5 Hz) visible in music and IDS is absent for the nature and machine sounds and weak for ADS and bird song. That is, systematic patches of red can be seen recurring at low frequencies for speech and music (~2 Hz and ~5 Hz), but not for nature sounds or mechanical sounds. Comparison of the temporal structures of these sounds for the low-frequency modulation rates (0–5 Hz) shows that only music and speech show strong delta- and theta-AM band patterning. The nested structure of AM patterning across the higher modulation bands (12-40Hz) is also clearly visible for the quasi-rhythmic sounds found in nature. This patterning is clearly absent for the man-made rhythmic sound of a machine.

2.2. Spectral Amplitude Modulation Phase Hierarchy (S-AMPH) model

2.2.1. Signal processing: Spectral and temporal modulations

This study used the same methodologies and parameters as a previous study based on CDS by Leong and Goswami [15] (for wiki, please see https://www.cne.psychol.cam.ac.uk). To establish the patterns of spectral modulation, the raw acoustic signal was passed through a 28 log-spaced ERB_N filterbank spanning 100–7250 Hz, which simulates the frequency decomposition by the cochlea in a normal human [40, 42]. For further technical details of the filterbank design, see Stone and Moore [76]. The parameters of the ERB_N filterbanks and the frequency response characteristics are provided in S2 Appendix. Then, the Hilbert envelope was obtained for each of the 28 filtered-signals. Using the 28 Hilbert envelopes, the core spectral patterning was defined by PCA. This can identify the appropriate number and spacing of non-redundant spectral bands, by detecting co-modulation in the high-dimensional ERB_N representation. To establish the patterns of temporal modulation, the raw acoustic signal was filtered into the number of spectral bands that were identified in the spectral PCA analysis. Then, the Hilbert envelope was extracted from each of the spectral bands. Further, the Hilbert envelopes of each of the spectral bands were passed through a 24 log-spaced ERB_N filterbank spanning 0.9–40 Hz. Using the 24 Hilbert envelopes in each of the spectral bands, the core AM hierarchy was defined by PCA. This approach clarifies co-activation patterns across modulation rate channels.

To determine the number and the edge of the core spectral (acoustic frequency spanning 100–7,250 Hz) and temporal (oscillatory rate spanning 0.9–40 Hz) modulation bands, PCA was applied separately for spectral and temporal dimensionality reductions. PCA has previously been used for dimensionality reduction in speech studies (e.g., [77, 78]. The present study focused on the absolute value of component loadings rather than the component scores. The loadings indicate the underlying patterns of correlation between high-dimensional channels. That is, PCA loading was adopted to identify patterns of covariation between the high-dimensional channels of spectral (28 channels) and temporal (24 channels) modulations, and to determine groups (or clusters) of channels that belonged to the same core modulation bands.

2.2.2. PCA to find the core modulation hierarchy in the high-dimensional ERB_N representation

In spectral PCA, the 28 spectral channels were taken as separate variables, yielding a total of 28 principal components. Only the top 5 principal components (PC) were considered for the further analysis, because these already cumulatively accounted for over 58% (on average) of the total variance in the original sound signal. In temporal PCA, the 24 channels in each of the spectral bands were entered as separate variables. Only the top 3 were considered for further analysis, because these cumulatively accounted for over 55% of the total variance in the original sound signal. Each PC loading value was averaged across all samples in each genre (Western-classical music, Jazz, adult and children’s song, nature sounds, birdsong) and musical instruments (single-voice: Violin, Viola, Cello, and Bass; multi-voice: Piano and Guitar). The absolute value of the PC loadings was used to avoid mutual cancellation by averaging an opposite valence across samples [14]. Then, peaks in the grand average PC loading patterns were taken to identify the core modulation hierarchy. Troughs were also identified because they reflect boundaries of edges between co-modulated clusters of channels. To ensure that there would be an adequate spacing between the resulting inferred modulation bands, a minimum peak-to-peak distance of 2 and 5 channels was set for the spectral and temporal PCAs, respectively. After detecting all the peaks and troughs, the core spectral and temporal modulation bands were determined based on the criteria that at least 2 of the 5 PCs and 1 of the 3 PCs showed a peak for spectral and temporal bands, respectively. On the other hand, the boundary edges between modulation bands were determined based on the most consistent locations of “flanking” troughs for each group of PC peaks that indicated the presence of a band. More detailed methodologies and examples can be found in Leong and Goswami [15] and Fig a of the S2 Appendix.

2.3. Mutual information between different modulation bands

We also examined whether one tier of the temporal hierarchy of music may be mutually dependent on the timing of another tier by conducting mutual information (MI) analyses. MI is a measure of the mutual dependence between the two variables. The MI can also be expressed as

\begin{array}{l} I (X; Y) = \sum_{x, y} p (x, y) \log (\frac{p (x, y)}{p (x) p (y)}) \\ = \sum_{x, y} p (x, y) \log (\frac{p (x, y)}{p (x)}) - \sum_{x, y} p (x, y) \log p (y) \\ = \sum_{x, y} p (x) p (y | x) \log p (y | x) - \sum_{x, y} \log p (y) p (x, y) \\ = \sum_{x} p (x) (\sum_{y} p (y | x) \log p (y | x)) - \sum_{y} \log p (y) (\sum_{x} p (x, y)) \\ = - \sum_{x} p (x) H (Y | X = x) - \sum_{y} p (y) \log p (y) \\ = - H (Y | X) + H (Y) \\ = H (Y) - H (Y | X) (b i t) \end{array}

(5)

where p(x,y) is the joint probability function of X and Y, p(x) and p(y) are the marginal probability distribution functions of the X and Y respectively, H(X) and H(Y) are the marginal entropies, Η(X|Y) and Η(Y|X) are the conditional entropies, and Η(X,Y) is the joint entropy of X and Y [79].

This analysis should reveal whether a certain oscillatory rhythm X (i.e., delta, theta, alpha and beta) is dependent on another oscillatory rhythm Y. Given prior evidence regarding the interdependence of neuronal oscillatory bands [59, 66], we hypothesized that the adjacent tiers that connect via so-called “branches” in the AM hierarchy would be mutually dependent on each other, but non-adjacent tiers would not. If so, the results may support a hierarchical “tree-based” structure of musical rhythm, highlighting the applicability of an AM hierarchy to music as well as speech.

To explore this, we manually entered the PAD parameters to produce the modulators at each of five tiers of oscillatory band (i.e., delta: -4 Hz, theta: 4–8 Hz, alpha: 8–12 Hz, beta: 12–30 Hz, and gamma: 30–50 Hz) (see S3 Appendix). Note that manual entry of these parameters does not predetermine the results, rather it enables exploration of whether there is a prominent peak frequency observed in each oscillatory rate band regardless of any tempo variations (such as speeding up or slowing down) that may depend on the performer or the particular music. Accordingly, this method determines the frequencies that comprise the core temporal modulation structure of each musical genre. In each of the music samples, the modulators (envelopes) of the five oscillatory bands were converted into the frequency domain by the Fast Fourier Transform (FFT). That is, PAD is run recursively using different demodulation parameters each time, and this generates a cascade of amplitude modulators at different oscillatory rates (i.e., delta, theta, alpha, beta, and gamma), forming an AM hierarchy. We adopted the phase angle “θ” of the core temporal modulation envelopes corresponding to delta, theta, alpha and beta/gamma waves that were detected by PAD. In the S-AMPH modelling, the 5 spectral envelopes (see S2 Appendix) were passed through a second series of band-pass filters to isolate the 4 different AM bands based on the results of temporal PCA (channel edge frequencies: 0.9, 2.5, 7, 17 and 30 Hz). The phase angles were then calculated using each of the 4x5 temporal modulation envelopes. Then, using the phase angle values derived from S-AMPH and PAD respectively, the MI between different temporal modulation bands was measured.

2.4. Phase synchronization analyses

Based on the findings of MI, we further investigated possible multi-timescale phase synchronization between bands by computing the integer ratios between “adjacent” AM hierarchies (i.e., the number of parent vs. daughter elements in an AM hierarchy). This analysis addressed how many daughter elements a parent element encompasses in general in a particular musical genre. We adopted the core temporal modulation envelopes corresponding to delta, theta, alpha and beta/gamma waves detected by each of the S-AMPH and PAD modelling approaches. In the S-AMPH model, the five spectral envelopes were passed through a second series of band-pass filters to isolate the four different AM bands based on the results of the temporal PCA (channel edge frequencies: 0.9, 2.5, 7, 17 and 30 Hz). In the end, the total numbers were 4x5 temporal modulation envelopes in the S-AMPH model. In contrast, in the PAD model, we made use of the four core modulators (envelopes) corresponding to delta, theta, alpha, and beta/gamma bands, respectively.

The Phase Synchronization Index (PSI) was computed between the adjacent AM bands in the S-AMPH representation for each of the five spectral bands and in the corresponding AM bands in the PAD representation (i.e., delta vs. theta, theta vs. alpha, alpha vs. beta, beta vs. gamma phase synchronizations). The n:m PSI was originally conceptualized to quantify phase synchronization between two oscillators of different frequencies (e.g., muscle activity; Tass et al., 1998) [80], and was subsequently adapted for neural analyses of oscillatory phase-locking [81]. For example, if the integer ratio is 1:2, then the parent element encompasses 2 daughter elements for the rhythm. The PSI was computed as:

P S I = | e^{1 (n θ 1 ‐ m θ 2)} |

(6)

n and m are integers describing the frequency relationship between lower and higher AM bands, respectively. An n: m ratio for each PSI was defined as n & m < 10, and 1 < n/m < 3. The values θ1 and θ2 refer to the instantaneous phase of the two AMs at each point in time. Therefore, (nθ1–mθ2) is the generalized phase difference between the two AMs, which was computed by taking the circular distance (modulus 2π) between the two instantaneous phase angles. The angled brackets denote averaging of this phase difference over all time-points. The PSI is the absolute value of this average, and can take values between 0 and 1 (i.e., no to perfect synchronizations) [18]. A sound with a PSI of 1 is perceived as being perfectly rhythmically regular (a repeating pattern of strong and weak beats), whereas a sound with a PSI of 0 is perceived as being random in rhythm.

To investigate whether the resulting outputs truly represented systematic characteristics of natural musical rhythm, we conducted simulation analyses. We generated synthesized sounds that consisted of four temporal modulation envelopes (i.e., modulator) and one spectral frequency (carrier). That is, 2 Hz, 4 Hz, 8 Hz and 16 Hz sine waves were summarized to synthesize one compound tone waveform. The compound tone waveform was, then, multiplied by a 200 Hz sine wave. The synthesized waveform was assumed as a sound that includes temporal information of delta, theta, alpha and gamma rhythms, and spectral information of a pitch around to natural human voices. It is important to note that all of the temporal envelopes comprised simple sine waves with frequencies of a power of 2. Hence, we can hypothesize that 1:2 integer ratios should clearly and consistently appear compared with other integer ratios. If the PSIs of music show different findings from these artificial sounds, then the results may indicate that natural musical rhythm has systematic integer ratios in an AM hierarchy.

3. Results

3.1. Amplitude modulation properties of Western musical genres, song, and nature sounds from PAD

The modelling outputs from PAD are considered first, as this modelling is “brain-neutral”, implementing amplitude demodulation by estimating the most appropriate modulator (envelope) and carrier based on Bayesian inference and ignoring the logarithmic frequency sensitivity of human hearing (for more detail, see Methods). Accordingly, PAD provides a good test of the hypothesis that there is a systematic hierarchy of temporal modulations underpinning both Western music and (English) IDS, but not nature sounds, with the possible exception of birdsong. Further, PAD is exempt from the possibility that the filterbank used in the S-AMPH modelling may have partially introduced artificial modulations into the stimuli through “ringing”.

The PAD results are presented in Fig 2. The modelling showed that the AM bands in music matched those previously found in IDS, but the AM bands in the nature sounds did not. In particular, in panel 2d strong peaks in the delta and theta bands are clearly visible for instrumental music (red line, mean peak: delta 1.1Hz and 2.2Hz, theta 4.7Hz) and IDS (black line, mean peak: delta 1.8Hz, theta 3.3Hz), but not for nature sounds (blue line). Although the delta and theta peaks occur at slightly different temporal points, they are within close range of each other. Further, there are two matching peaks at delta and theta rates between IDS (black line in Fig 2D) and child song (light green in Fig 2D), but not in adult song, birdsong, and nature sounds. As predicted, therefore, the demodulation results for Western music match prior studies of English CDS and IDS [14, 15, 18], suggestive of shared statistical temporal characteristics of the acoustic input, to which the brain can entrain.

3.2. Amplitude modulation properties of Western musical genres, song, and nature sounds from S-AMPH

To investigate whether a demodulation approach based on an equivalent rectangular bandwidth (ERB_N) filterbank (which simulates the frequency decomposition by the cochlea) would yield similar AM bands, we applied the S-AMPH model to the same materials. We expected to find a similar modulation structure to that revealed by PAD. Based on the a priori criteria (see Methods), the spectral PCA provided evidence for the presence of 5 core spectral bands in the spectral modulation data (300, 500, 1000, 2500 and 5500 Hz), with at least 2 out of 5 PCs showing peaks in each of these 5 spectral regions. This is shown in Fig a in S4 Appendix, which shows the grand average as well as the loading patterns and cumulative contribution ratios for each musical genre and instrument. Furthermore, we consistently observed 4 boundaries between these 5 spectral bands (350, 700, 1750 and 3900 Hz). Table a in the S4 Appendix provides a summary of these 5 spectral bands and their boundaries. It is noteworthy that these 5 spectral bands, which were consistent across musical instruments and the human voice, are proportionately-scaled with respect to the logarithmic frequency sensitivity of human hearing. As predicted, these results are similar to the spectral bands previously revealed by modelling IDS and CDS using the S-AMPH approach [14, 15, 18]. It can also be noted that the loading patterns for the 5 PCA components showed roughly similar characteristics across the genres, although there was some individual variation at each spectral modulation band (see Fig a in S4 Appendix).

Based on the a priori criteria (Methods), the temporal PCA provided evidence for the presence of 4 core bands with 3 boundaries across the different musical genres and instruments. This is shown in Fig 3. These AM bands in music matched those previously found in IDS, but the AM bands in the nature sounds did not (see PC3 in Fig 3B). Further statistical detail is given in the Table b in the S4 Appendix. Fig b in S4 Appendix shows the grand average loading patterns (absolute value) for each genre for the first three principal components arising from the temporal PCA of each of the 5 spectral bands determined in the spectral PCA. Fig b in S4 Appendix also shows the temporal loading patterns and cumulative contribution ratios for each music genre and each instrument. Table c in the S4 Appendix provides a summary of these 5 spectral bands and their boundaries. Fig e in S4 Appendix shows the grand average for the modulation spectra of FFT as well as the loading patterns and cumulative contribution ratios for each music genre and instrument along with individual variation. Perceptually, cycles in these AM bands may yield the experience of crotchets, quavers, demiquavers and onsets, as shown in Table d in S4 Appendix.

In summary, the strong peaks in the delta and theta bands visible in Fig 3, along with the strong flanking trough between these bands, are clearly visible for instrumental music, human song (adult and child songs), bird song, and infant-directed speech compared to nature sounds. As predicted, therefore, the results of the temporal PCA for music essentially matches prior studies of CDS and IDS [14, 15, 18]. The one difference observed compared to the PAD modelling is for birdsong, which to the human ear (i.e., to a decomposition based on the cochlear filterbank) sounds more similar to human song, at least for the corpus of the 47 nightingale songs (see, S1 Appendix) analyzed here.

3.3. Mutual information in both models

To examine whether mutual dependencies between AM bands in the temporal modulation structure of different musical genres was more similar to the dependencies identified in IDS by prior S-AMPH modelling [18], a MI analysis method was employed. As noted earlier, prior modelling of IDS and CDS has revealed a significantly higher phase dependency between delta- and theta-rate AM bands compared to ADS. ADS by contrast shows a significantly higher phase dependency between theta- and beta/low gamma rate AM bands compared to IDS. Accordingly, for music we expected to find significantly higher phase dependency between delta- and theta-rate AM bands than between any other pair of AM bandings. We did not expect to find this for nature sounds. Please note that birdsong was not included in the nature sound MI analyses as the previously-presented modelling shows that the AM properties of birdsong differ with the modelling technique applied (see 3.1, 3.2 –the AM structure of birdsong is more similar to Babytalk when a model that mimics the human cochlea is utilized). For music and nature sounds (e.g., fire, river), we investigated whether higher phase dependency between delta- and theta-rate AM bands would only be detected in music, thereby matching prior studies of IDS.

When MI analyses were applied for PAD in music, four peak frequencies were detected at ~2.4 Hz, ~4.8 Hz, ~9 Hz and 16 Hz. Both delta-theta and theta-alpha mutual dependency were consistently greater than other dependencies. The MI for nature sounds looked different, with little apparent variation in MI associated with different pairings of bands. In particular, the mutual dependence between delta- and theta-rate AM bands of natural sounds was similar to non-adjacent tiers. Further detail is given in Fig d in S5 Appendix.

The MI results for the S-AMPH model showed that adjacent tiers of the AM hierarchy were mutually dependent on each other compared with nonadjacent tiers for the musical genres and for child songs. S5 Appendix shows the MI for each music genre and instrument, revealing high consistency between Western music genres, instrument and child song. Accordingly the S-AMPH model yielded similar findings to PAD, detecting peak frequencies in AM bands corresponding temporally to neural delta, theta, alpha and beta/gamma neural oscillatory bands. Further, mutual dependence between delta- and theta-rate AM bands was the strongest of all mutual dependencies detected in music, for both models. This stronger AM phase dependence matched the results of the prior speech-based modelling with IDS and CDS rather than ADS [14, 15, 18]. Accordingly, both PAD and S-AMPH MI modelling suggests that metrical structure, a feature shared by both music and speech, depends on the same core delta-theta AM phase relations in both domains.

3.4. Multi-timescale phase synchronization in both models

The demonstration of mutual dependency does not by itself capture metrical structure, as each AM cycle at a particular timescale may encompass one or more AM cycles at a faster timescale. To identify how many daughter elements a parent element could encompass in general, we next investigated the integer ratios between adjacent AM bands. For example, if the integer ratio is 1:2, then the parent element encompasses 2 daughter elements for the rhythm. An example from speech would be a tongue twister like “Peter Piper picked a peck of pickled peppers,” which follows a 1:2 ratio (two syllables in each prosodic foot). To assess the integer ratios for each pair of mutually dependent AM bands in our selected musical genres, we used PSI indices (please see S6 Appendix for the PSI for each musical genre and instrument). The PSI analyses revealed high consistency between musical genres for the phase synchronization indices generated by both S-AMPH and PAD models. Further analysis focused on the grand average (shown in Fig 4).

Fig 4 — Both S-AMPH (a) and PAD models (b) showed that the simpler integer ratios (i.e., m/n) synchronize their phase with each other. The inverted dissonance curve **(c)** was obtained by including the first five upper partials of tones with a 440 Hz (i.e., pitch standard, A4) fundamental frequency in calculating the total dissonance of intervals [82]. It is of note that the peaks of PSI demonstrated by PAD correspond to those of the dissonance curve.

The PSI of the S-AMPH model suggested that the PSI of 1:2 integer ratios is the highest in all of the adjacent oscillatory bands. The PSIs of 1:3 and 2:3 integer ratios for the S-AMPH modelling were also higher than the other integer ratios, suggesting that the simpler integer ratios (i.e., m/n) were likely to synchronize between adjacent bands. For spoken languages, the m/n ratio between two adjacent AM bands tends to vary with linguistic factors such as how many phonemes typically comprise a syllable (e.g. 2 phonemes per syllable for a language with a consonant-vowel syllable structure like Spanish, hence a theta-beta/low gamma PSI of 1:2, but 3 phonemes per syllable for a language with largely consonant-vowel-consonant syllable structures like English, hence a theta-beta/low gamma PSI of 1:3). For music, the dominance of PSIs 1:3 and 2:3 across genres and instruments suggests more tightly controlled rhythmic dependencies than for speech.

The PSIs generated by the PAD model were similar to the S-AMPH, but PAD was more sensitive to the simple integer ratios. In PAD, the PSIs of not only the 1:2 integer ratios, but also those of the 2:3, 3:4 and 4:5 integer ratios were notably higher than the other integer ratios, particularly for the delta-theta AM band pairing (see Fig 4). The differences between models may have arisen because the filterbank used in the S-AMPH model may partially introduce some artificial modulations into the stimuli through “ringing.” However, the ERB_N filterbank in the S-AMPH model is the filtering process that reflects the frequency decomposition by cochlear function in the normal human ear. Hence, the different findings between S-AMPH and PAD models regarding multi-timescale phase synchronization may imply that there are differences between the physical stimulus characteristics of musical rhythm as perceived by the human brain and the purely physical and statistical structure of music.

Nevertheless, as shown in Fig 4, the PSI between delta- and theta-rate AM bands was consistently the largest PSI in both the S-AMPH and PAD models. Again, this finding is consistent with our prior findings for IDS and rhythmic CDS [15, 18]. As a further check, we also examined the PSI of sounds found in nature. The human hearing system has been receiving these quasi-rhythmic sounds at least as long as it has been receiving language and music, but unlike language and music, these sounds have not been produced by humans and shaped by human physiology and culture. Accordingly, it would not be expected that the temporal modulation structure of these natural sounds would be shared with IDS and CDS. The results showed that compared with music, the PSI between delta- and theta-rate AM bands was not consistently the largest PSI (S6 Appendix). This shows that the strong phase dependence between slower bands of AMs revealed for music and for IDS/CDS is not an artifact of the modelling approaches employed, but a core physical feature of their rhythmic structure.

Accordingly, the strong rhythmic character and acoustic temporal regularity of both infant- and child-directed speech, child song and Western music appears to be influenced by AMs in the delta band (a 2 Hz modulation peak, in music reflecting a 120 bpm rate) and by delta-theta AM phase alignment. Our modelling data for temporal frequency (i.e., “rhythm”) also map nicely to the Plomp and Levelt [82] modeling of the dissonance curve for spectral frequency (i.e., “pitch”) (shown in Fig 4, Bottom). This may imply that these physical properties of fast spectral frequencies are also involved in very slow temporal modulation envelopes below 40 Hz. In sum, both modelling approaches showed that the PSI of 1:2 integer ratios is the highest in all the AM band pairings, and the other simpler integer ratios (1:3, 2:3., etc) are also higher than non-integer ratios. Fig 5 provides a schematic example of the 1:2 integer ratio regarding the likely AM hierarchy in music. The figure shows in principle how musical rhythm could be hierarchically organized based on note values (i.e., crotchets, quavers, demiquavers and onsets, Fig 5, left) and the AM hierarchy (Fig 5, right).

Left and right are the representation by musical score and the corresponding sound waveform of a part of the 33 Variations on a waltz by Anton Diabelli, Op. 120 (commonly known as the Diabelli Variations) by Ludwig van Beethoven. In principle, musical rhythm could be hierarchically organized based on note values (left) matched to nested amplitude modulations (AM, right) in bandings spanning different temporal rates (for example, green ~2 Hz, blue ~4 Hz, red ~8 Hz, matching Table d in S4 Appendix). In the framework of Temporal Sampling theory, the AM bands (right) equate temporally to neural oscillatory rhythms. Auditory rhythm perception relies in part on neural tracking of the AM patterns at different timescales simultaneously (e.g., neural tracking of the green, blue, and red AMs in Fig 5 by neurophysiological delta, theta and alpha bands). This neural tracking is triggered by acoustic components of the sound signal such as the amplitude rise times (musical attack times) of the nested AM components which phase-reset oscillatory cortical activity. There are of course a large range of tempi used in music, for example slow ballads and fast dance songs. However, as shown by the black lines in the musical note hierarchy (left) and the dotted vertical lines in the AM hierarchy (right), the adjacent tiers of the hierarchy (i.e., green & blue and blue & red AM pairs) are dependent on each other compared with non-adjacent hierarchical relations (i.e., green-red AM pairing) and thus the hierarchy itself will expand or contract to fit the tempo.

3.5. Simulation analyses

Finally, to investigate whether the detected (dissonance curve-like) characteristics revealed by the MI and PSI analyses really represent systematic features of natural musical rhythm, we conducted simulation analyses with synthesized rhythmic but non-musical sounds. The final synthesized waveform comprised a sound that included clear rhythmic information at delta (2Hz), theta (4Hz), alpha (8Hz) and gamma (16Hz) timescales, and spectral information at a pitch around that of natural human voices (200 Hz) (for the figure, see S6 Appendix). The resulting percept was similar to a harsh rhythmic whisper. The sound is available from here https://osf.io/6s8kp/. As all of the temporal envelopes were comprised of simple sine waves with frequencies of a power of 2, PSI analyses of these artificial sounds should clearly and consistently reveal only 1:2 integer ratios compared with other integer ratios. This was the case. Thus, the simulation analyses revealed that the PSIs for natural Western musical genres were different from those for artificial rhythmic sounds. This suggests that natural musical rhythm has covert and systematic integer ratios (i.e., 2:3, 3:4 and 4:5 as well as 1:2) within the AM hierarchy, at least when considering Western musical genres.

4. Discussion

Here we explored the possibility that the hierarchical rhythmic (statistical AM) relationships that characterize both English Babytalk and children’s nursery rhymes would also characterize Western music [15, 18, 38]. We tested the prediction that the physical stimulus characteristics (acoustic statistics) that describe the amplitude envelope structure of IDS and CDS from a demodulation perspective would also describe Western music and child song. If child language and human music depend on the same acoustic statistics, this should facilitate initial neural learning of these culturally-determined systems. Decomposition of the amplitude envelope of IDS and CDS has previously revealed that (a) the modulation peak in IDS is ~2 Hz [18], (b) that perceived rhythmic patterning depends on three core AM bands in the amplitude envelope centred on ~2 Hz, ~5 Hz and ~20 Hz that are found systematically across the spectral range of speech [15], and (c) that varying metrical patterns such as trochaic and iambic meters can be identified by the phase relations between two of these bands of AMs (delta- and theta-rate AMs, ~2 Hz and ~5 Hz) [14]. The phase alignment (rhythmic synchronicity) of these relatively slow AM rates represents a unique statistical clue to rhythmic patterning in speech, relevant to language acquisition [83]. We predicted a priori that this statistical parameter (delta rate-theta rate AM phase alignment) would be present in music and human song, but not in nature sounds such as wind and rain. The physical stimulus characteristics of the amplitude envelope of different musical genres and of music produced by different instruments was expected to yield similar acoustic statistics, with classical, rock and jazz music all producing similar modulation structures. The acoustic statistics describing nature sounds were expected a priori to be different, as these sounds are not created by humans nor dependent on human physiology and culture. A possible exception could be the non-human-created rhythms of birdsong. Nightingale song, which has been shown to be more similar in structure to human music than other birds’ song by Roeske et al. [24], was thus also modelled from a demodulation perspective.

Our demodulation analyses indeed revealed an hierarchy of temporal modulations that systematically described the acoustic properties of musical rhythm for a range of Western musical genres and instruments, as well as child song (Fig 1). Our modelling indicated highly similar acoustic statistical properties to IDS and CDS: a 2Hz modulation peak (Fig 2, panel d and Fig 3, panel b), particularly strong phase alignment between delta- and theta-rate AM bands across musical genres and human song (Mutual Information analyses), and a distinct set of preferred PSIs that indicated multi-timescale synchronization across different AM bands (Fig 4). As the brain begins learning language using IDS, and consolidates this learning via the rhythmic routines of the nursery (CDS), the present findings are consistent with the theoretical view that perceiving rhythm in both music and language may (at least early in development, prior to acquiring expertise) rely on statistical learning of the same physical stimulus characteristics. Although not tested directly here, it is likely that similar neural oscillatory entrainment mechanisms are used for encoding this hierarchical AM structure in both domains [61, 63, 67, 68]. The natural sounds analysed here have also been present since early hominid times, but their statistical structure has not been constrained by the human brain. Accordingly, learning their AM structure is less critical for human communities, and their acoustic temporal modulation structure is somewhat different to that of Babytalk and music.

Indeed, the multi-timescale synchronization found here was systematic across Western musical genres and instruments (see S6 Appendix), suggesting that this AM hierarchy contributes to building perceived rhythmic structures. The nested AM hierarchies in music may yield nested musical units (crotchets, quavers, demiquavers and onsets), just as nested AM hierarchies in CDS yield linguistic units like syllables and rhymes [15]. This possibility is depicted in Fig 5. Our modelling shows that acoustically-emergent musical units can in principle be parsed reliably from the temporal modulation spectra of the different musical genres examined, and that these units are reflected in each of delta-, theta-, alpha- and beta/gamma-rate bands of AM (Table d in S4 Appendix). To the best of our knowledge, our modelling is the first to reveal a set of temporal statistics related to the perception of different musical units. Just as cycles of AM in IDS and CDS relate to prosodic patterns (e.g. trochaic versus iambic) and to identifying stressed syllables, syllables and rhymes, cycles of AM in music may relate to metrical structures and to units such as crotchets, quavers and demi-quavers. It is of note that these hierarchical statistical temporal dependencies should be consistent across different tempi. The dependencies refer to temporal bandings of AMs, hence the hierarchical dependencies should simply adjust to fit the tempo used in the music, for example slow ballads and fast dance songs. In similar fashion, it has been demonstrated that the hierarchical AM dependencies in speech adapt to speech rate (see [18]). Indeed, the current modelling revealed statistically strong mutual dependence (using MI estimates) between adjacent bands in the AM hierarchy across musical genres (Western classical, jazz, rock, children’s songs) and musical instruments (piano, guitar, violin, viola, cello, bass, single-voice, multi-voice). This strong mutual dependence was not observed in nature sounds (shown in S5 Appendix), although birdsong was not included in these latter analyses.

In particular, regarding music the mutual dependence between delta- and theta-rate bands of AM was the strongest dependence identified by both models. Stronger mutual dependence between delta- and theta-rate AM bands could not be detected in the other nature sounds (river, fire, wind, storms, rain), even though these natural sounds are also quasi-rhythmic. The current modelling thus suggests that for Western music, delta-theta phase alignment of AM bands may underpin metrical rhythmic patterns, matching the acoustic structure of IDS and CDS. Convergent results from the phase synchronization analyses further showed that multi-timescale synchronization between delta- and theta-rate AM bands was always higher than the other PSIs regardless of the integer ratios. This was not replicated for nature sounds. The phase alignment of delta- and theta-rate bands of AM has been suggested to be a key acoustic statistic for the language-learning brain [83, 84], reflecting the placement of stressed syllables, which governs metrical patterning in speech (e.g., trochaic, iambic and dactyl meters). The present findings concerning mutual dependence and phase synchronization indicate that music may share these properties: phase alignment between delta- and theta-rate AM bands may contribute to establishing musical metrical structure as well.

Accordingly, our findings differ from a prior study using the same Western music materials, which claimed that the rhythmic properties of music and language are distinct [18]. In their speech corpora, the modulation spectrum for music peaked at 2 Hz and the modulation spectrum for speech peaked at 5 Hz. The analyses presented here suggest that the apparent dissimilarity between music and speech arises from the exclusive reliance of the speech modelling on ADS, coupled with the absence of further investigation of the AM structure of each musical genre. By contrast, our demodulation modelling approaches show better matching with temporal data from studies of IDS and CDS, where the modulation spectrum also peaks at 2 Hz (Figs 2 and 3), as well as a similar set of phase relations between AM bands (as noted, the latter were not explored by [18]). We would predict that these statistical regularities in temporal modulation may be the same for other forms of music, and for IDS and CDS in other languages, this remains to be explored. The demonstration that temporal modulation bands play a key role in rhythm hierarchies in music as well as in speech may also suggest that the same evolutionary adaptations underpin both music and language.

Another interesting result from the phase synchronization analyses regarding music was the appearance of systematic integer ratios within the AM hierarchy. These ratios were relatively uniform for nature sounds, whereas for music and child song, the 1:2 integer ratio was strongest for both models. The PSIs for 1:3 and 2:3 were also higher than the other integer ratios explored for music, for both models. For the PAD modelling approach, which does not make any adjustments for the cochlea, the 2:3, 3:4 and 4:5 integer ratios were also prominent. This statistical patterning appears to reflect the tightly-controlled rhythmic dependencies in music, and may offer an acoustic model for capturing the different metrical structures and integer ratios that characterize music from different cultures [22, 23], as well as the songs of different species [24]. For example, even prior to the acquisition of culture-specific biases of musical rhythm, young infants (5-month-olds) are influenced by ratio complexity [25]. Our modelling further suggests that the AM bands in music are related by integer ratios in a similar way to the integer ratios relating notes of different fundamental frequencies that create harmonicity (see the similarity between the PSIs for the two models shown in Fig 4 and the dissonance curve measured by Plomp & Levelt, [82]). Converging prior modelling of speech has shown that the probability distribution of amplitude–frequency combinations in human speech sounds relates statistically to the harmonicity patterns that comprise musical universals [85]. Our modelling appears to suggest that the simple integer ratios (i.e., 1:2, 1:3, and 2:3) in the AM hierarchy comprise a fundamental set of statistics for musical rhythm perception. This fits well with prior data from Jacoby and McDermott [86], who demonstrated that certain integer ratios are prominent across music from both Western and non-Western cultures. Our acoustic modelling suggests that AM phase hierarchies may play as strong a role as harmonicity regarding universal aspects of human hearing that are important for both music and language.

The modelling presented here also converges conceptually with past studies designed to detect pulse based on neural resonance theory [71]. Pulse is the perceptual phenomenon in which an individual perceives a steady beat. Large et al. [71] suggested that the perception of pulse emerges through nonlinear coupling between two oscillatory networks, one representing the physical properties of the stimulus and a second network that integrates inputs from the sensory system. The nonlinear interactions between the two give rise to oscillatory activity not only at the frequencies present in the physical stimulus, but also at more complex combinations, including the pulse frequency. Consistent with this view, Tal et al. [87] reported phase locking for the adult brain at the times of a missing pulse, even though the pulse was absent from the physical stimulus. This suggests that neural activity at the pulse frequency is (for adults) internally generated rather than being purely stimulus-driven. From this perspective, our modelling (i.e., S-AMPH and PAD) is capturing the physical stimulus characteristics (the modulation structure of the amplitude envelope and its internal phase relations) rather than capturing internally-generated oscillatory activity. To our knowledge, missing pulse phenomena have not yet been studied in infants. It may be that early learning of hierarchical phase relations from the amplitude envelopes of musical inputs may be required for the internal generation of missing pulse phenomena. On the other hand, ERP studies show that even newborns can detect beat violations in oddball paradigms, where occasionally a deviant rhythm with a missing downbeat is heard in place of a standard metrical rhythm [88]. Further studies with infants may also be able to investigate the phase relationships between missing pulses or beats and higher hierarchical units such as musical phrasing or prosody.

The modelling presented here is also relevant to the remediation of childhood language disorders. The possible utility of musical interventions for children with disorders of language learning such as developmental language disorder (DLD) and developmental dyslexia has long been recognized [35, 89–91]. Such interventions are likely to be most beneficial when the temporal hierarchy of the music corresponds to the temporal hierarchy underpinning speech rhythm [27, 83]. Careful consideration of the statistical rhythm structures characterizing speech in different languages may thus lead to better remedial outcomes. For example, our findings suggest that for children with disorders of English language learning, interventions using Western music should be beneficial via the shared temporal hierarchy with English IDS and CDS. Further, it is possible that such interventions could be beneficial for second language learners. A caveat is that here we modelled musical genres that could be designated WEIRD corpora (originating from Westernized, educated, industrialized, rich and democratic societies). Accordingly, further studies are necessary to understand how music interventions can contribute to improving speech processing in other languages.

In conclusion, the present study revealed that the acoustic statistics that describe rhythm in music from an amplitude envelope decomposition perspective match those that describe IDS and CDS. The physical stimulus characteristics that describe nature sounds are different. The modelling demonstrates a core acoustic hierarchy of AMs that yield musical rhythm across the amplitude envelopes of different Western musical genres and instruments, with mutual dependencies between AM bands playing a key role in organizing rhythmic units in the musical hierarchy for each genre. Accordingly, biological mechanisms that exploit AM hierarchies may underpin the perception and development of both language and music. In terms of evolution, the novel acoustic statistics revealed here could also explain cross-cultural regularities in musical systems [23]; this remains to be tested.

Supporting information

S1 Appendix. Corpora of music, speech, and nature sounds.

(DOCX)

Click here for additional data file.^{(46.2KB, docx)}

S2 Appendix. Signal processing steps in S-AMPH model.

(DOCX)

Click here for additional data file.^{(856.6KB, docx)}

S3 Appendix. Signal processing steps in PAD model.

(DOCX)

Click here for additional data file.^{(367.8KB, docx)}

S4 Appendix. Individual variation of PCA loadings in S-AMPH model and those of FFT in the PAD model.

(DOCX)

Click here for additional data file.^{(1.1MB, docx)}

S5 Appendix. Individual variation of mutual information.

(DOCX)

Click here for additional data file.^{(88.6MB, docx)}

S6 Appendix. Individual variation of PSI in each integer ratio.

(DOCX)

Click here for additional data file.^{(30.7MB, docx)}

Data Availability

All data files can be found at the following link: https://osf.io/6s8kp/. All original sound files are publicly available from the Figshare database: http://figshare.com/articles/SAMPH_CDS/1318572 DOI: 10.6084/m9.figshare.1318572. Please see the original article that used the speech data for more detailed information (Leong et al., 2015), for the related wiki please see https://www.cne.psychol.cam.ac.uk. All original bird song, nature sound files are available from https://www.xeno-canto.org/, https://mixkit.co/free-sound-effects/nature/, and https://www.zapsplat.com. Music and human song data has copyright, but described the detailed information in S1 Appendix.

Funding Statement

This study was supported by Nakatani Foundation, JSPS KAKENHI Grant Number 21H05063 (Transformative Research Areas (B)), 22K17986 (Grant-in-Aid for Early-Career Scientists), 20K22676 (Grant-in-Aid for Research Activity Start-up), 22H05210 (Grant-in-Aid for Transformative Research Areas (A)), World Premier International Research Centre Initiative (WPI), MEXT, Japan. The sponsor played no role in the study design nor in the collection, analysis, interpretation and writing up of the data.

References

1.Mehler J., Jusczyk P., Lambertz G., Halsted N., Bertoncini J., & Amiel-Tison C. (1988). A precursor of language acquisition in young infants. Cognition, 29(2), 143–178. doi: 10.1016/0010-0277(88)90035-2 [DOI] [PubMed] [Google Scholar]
2.Falk D. (2004). Prelinguistic evolution in early hominins: Whence motherese? Behavioral and Brain Sciences, 27(4), 491–503; discussion 503. doi: 10.1017/s0140525x04000111 [DOI] [PubMed] [Google Scholar]
3.Nazzi T., Bertoncini J., & Mehler J. (1998). Language discrimination by newborns: Toward an understanding of the role of rhythm. Journal of Experimental Psychology: Human Perception and Performance, 2 4, 756–766. doi: 10.1037//0096-1523.24.3.756 [DOI] [PubMed] [Google Scholar]
4.Saffran J. R. (2001). Words in a sea of sounds: The output of infant statistical learning. Cognition, 81, 149–169. doi: 10.1016/s0010-0277(01)00132-9 [DOI] [PubMed] [Google Scholar]
5.Saffran J. R., Johnson E. K., Aslin R. N., & Newport E. L. (1999). Statistical learning of tone sequences by human infants and adults. Cognition, 70(1), 27–52. doi: 10.1016/s0010-0277(98)00075-4 [DOI] [PubMed] [Google Scholar]
6.Francois C., & Schön D. (2011). Musical expertise boosts implicit learning of both musical and linguistic structures. Cerebral Cortex, 21(10), 2357–2365. doi: 10.1093/cercor/bhr022 [DOI] [PubMed] [Google Scholar]
7.Loui P. (2022). New music system reveals spectral contribution to statistical learning. Cognition, 224, 105071. doi: 10.1016/j.cognition.2022.105071 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Tsogli V., Jentschke S., Daikoku T., & Koelsch S. (2019). When the statistical MMN meets the physical MMN. Scientific reports, 9(1), 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Prince J. B., Stevens C. J., Jones M. R., & Tillmann B. (2018). Learning of pitch and time structures in an artificial grammar setting. Journal of Experimental Psychology: Learning, Memory, and Cognition, 44(8), 1201. doi: 10.1037/xlm0000502 [DOI] [PubMed] [Google Scholar]
10.Brandon M., Terry J., Stevens C. K. J., & Tillmann B. (2012). Incidental learning of temporal structures conforming to a metrical framework. Frontiers in Psychology, 3, 294. doi: 10.3389/fpsyg.2012.00294 [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Daikoku T., & Yumoto M. (2020). Musical expertise facilitates statistical learning of rhythm and the perceptive uncertainty: A cross-cultural study. Neuropsychologia, 146, 107553. doi: 10.1016/j.neuropsychologia.2020.107553 [DOI] [PubMed] [Google Scholar]
12.Goswami U. (2015). Sensory theories of developmental dyslexia: three challenges for research. Nature Reviews Neuroscience, 16(1), 43–54. doi: 10.1038/nrn3836 [DOI] [PubMed] [Google Scholar]
13.Politimou N., Dalla Bella S., Farrugia N., & Franco F. (2019). Born to speak and sing: Musical predictors of language development in pre-schoolers. Frontiers in Psychology, 10, 948. doi: 10.3389/fpsyg.2019.00948 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Leong V., Stone M. A., Turner R. E., & Goswami U. (2014). A role for amplitude modulation phase relationships in speech rhythm perception. Journal of the Acoustical Society of America, 136(1), 366–381. doi: 10.1121/1.4883366 [DOI] [PubMed] [Google Scholar]
15.Leong V., & Goswami U. (2015). Acoustic-emergent phonology in the amplitude envelope of child-directed speech. PLOS ONE, 10(12), e0144411. doi: 10.1371/journal.pone.0144411 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Turner, R. (2010). Statistical models for natural sounds [PhD dissertation]. Coll London.
17.Turner R. E., & Sahani M. (2011). Demodulation as probabilistic inference. IEEE Transactions on Audio, Speech, and Language Processing, 19(8), 2398–2411. 10.1109/TASL.2011.2135852 [DOI] [Google Scholar]
18.Leong V., Kalashnikova M., Burnham D., & Goswami U. (2017). The temporal modulation structure of infant-directed speech. Open Mind, 1(2), 78–90. 10.1162/OPMI_a_00008 [DOI] [Google Scholar]
19.Leong, V. (2012). rosodic rhythm in the speech amplitude envelope: Amplitude modulation phase hierarchies (AMPHs) and AMPH models [PhD Thesis].
20.Araújo J., Flanagan S., Castro-Caldas A., & Goswami U. (2018). The temporal modulation structure of illiterate versus literate adult speech. PLOS ONE, 13(10), e0205224. doi: 10.1371/journal.pone.0205224 [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Pérez-Navarro J., Lallier M., Clark C., Flanagan S., & Goswami U. (2022). Local temporal regularities in child-directed speech in Spanish. Journal of Speech, Language and Hearing Research, in press. doi: 10.1044/2022_JSLHR-22-00111 [DOI] [PubMed] [Google Scholar]
22.Mehr S. A., Krasnow M. M., Bryant G. A., & Hagen E. H. (2020). Origins of music in credible signaling. Behavioral and Brain Sciences, 1–41. doi: 10.1017/S0140525X20000345 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.McPherson M. J., Dolan S. E., Durango A., Ossandon T., Valdés J., Undurraga E. A., et al. (2020). Perceptual fusion of musical notes by native Amazonians suggests universal representations of musical intervals. Nature Communications, 11(1), 2786. doi: 10.1038/s41467-020-16448-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Roeske T. C., Tchernichovski O., Poeppel D., & Jacoby N. (2020). Categorical rhythms are shared between songbirds and humans. Current Biology, 30(18), 3544–3555.e6. doi: 10.1016/j.cub.2020.06.072 [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Hannon E. E., Soley G., & Levine R. S. (2011). Constraints on infants’ musical rhythm perception: Effects of interval ratio complexity and enculturation. Developmental Science, 14(4), 865–872. doi: 10.1111/j.1467-7687.2011.01036.x [DOI] [PubMed] [Google Scholar]
26.Goswami U. (2011). A temporal sampling framework for developmental dyslexia. Trends in Cognitive Sciences, 15(1), 3–10. doi: 10.1016/j.tics.2010.10.001 [DOI] [PubMed] [Google Scholar]
27.Goswami U. (2019. a). A neural oscillations perspective on phonological development and phonological processing in developmental dyslexia. Language and Linguistics Compass, 13(5), e12328. 10.1111/lnc3.12328 [DOI] [Google Scholar]
28.Goswami U. (2022). Language acquisition and speech rhythm patterns: An auditory neuroscience perspective. Royal Society Open Science, 9, 211855. doi: 10.1098/rsos.211855 [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Greenberg S. (2006). A multi-tier framework for understanding spoken language. In Greenberg S.& Ainsworth W.(Eds.), Listening to speech: An auditory perspective. Lawrence Erlbaum Associates. [Google Scholar]
30.Morton J., Marcus S., & Frankish C. (1976). Perceptual centers (P-centers). Psychological Review, 83(5), 405–408. 10.1037/0033-295X.83.5.405 [DOI] [Google Scholar]
31.Hoequist C. E Jr. (1983). The perceptual center and rhythm categories. Language and Speech, 26(4), 367–376. doi: 10.1177/002383098302600404 [DOI] [PubMed] [Google Scholar]
32.Scott, S. (1993). P-centres in speech-an acoustic analysis [PhD thesis]. Universidad Coll.
33.Gordon J. W. (1987). The perceptual attack time of musical tones. Journal of the Acoustical Society of America, 82(1), 88–105. doi: 10.1121/1.395441 [DOI] [PubMed] [Google Scholar]
34.Huss M., Verney J. P., Fosker T., Mead N., & Goswami U. (2011). Music, rhythm, rise time perception and developmental dyslexia: Perception of musical meter predicts reading and phonology. Cortex, 47(6), 674–689. doi: 10.1016/j.cortex.2010.07.010 [DOI] [PubMed] [Google Scholar]
35.Ladányi E., Persici V., Fiveash A., Tillmann B., & Gordon R. L. (2020). Is atypical rhythm a risk factor for developmental speech and language disorders? Wiley Interdisciplinary Reviews. Cognitive Science, 11(5), e1528. doi: 10.1002/wcs.1528 [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Kello C.T., Dalla Bella S., Butovens M., & Balasubramaniam R. (2017). Hierarchical temporal structure in music, speech and animal vocalizations: Jazz is like a conversation, humpbacks sing like hermit thrushes. J. R. Soc Interface, 14, 20170231. doi: 10.1098/rsif.2017.0231 [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Patel A. D. (2008). Music, language, and the brain. Oxford University Press. [Google Scholar]
38.Ding N., Patel A. D., Chen L., Butler H., Luo C., & Poeppel D. (2017). Temporal modulations in speech and music. Neuroscience and Biobehavioral Reviews, 81(B), 181–187. 10.1016/j.neubiorev.2017.02.011 [DOI] [PubMed] [Google Scholar]
39.Frith U., Wimmer H., & Landerl K. (1998). Differences in phonological recoding in German-and English-speaking children. Scientific Studies of reading, 2(1), 31–54. [Google Scholar]
40.Moore B. C. J. (2012). An introduction to the psychology of hearing. Brill. [Google Scholar]
41.Zeng F. G., Nie K., Stickney G. S., Kong Y. Y., Vongphoe M., Bhargave A., Wei C., & Cao K. (2005). Speech recognition with amplitude and frequency modulations. Proceedings of the National Academy of Sciences of the United States of America, 102(7), 2293–2298. doi: 10.1073/pnas.0406460102 [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Dau T., Kollmeier B., & Kohlrausch A. (1997. b). Modeling auditory processing of amplitude modulation I. Detection and masking with narrow-band carriers. Journal of the Acoustical Society of America, 102(5 Pt 1), 2892–2905. 10.1121/1.420344 [DOI] [PubMed] [Google Scholar]
43.Fiser J., Berkes P., Orbán G., Lengyel M. (2010). Statistically optimal perception and learning: from behavior to neural representations. Trends in Cognitive Sciences, 14(3), 119–130. doi: 10.1016/j.tics.2010.01.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Liberman M., & Prince A. (1977). On stress and linguistic rhythm. Linguistic Inquiry, 8, 249–336. [Google Scholar]
45.Selkirk E. (1984). Phonology and syntax. The relation between sound and structure. [Google Scholar]
46.Selkirk E. O. (1980). The role of prosodic categories in English word stress. Linguistic Inquiry, 11, 563–605. [Google Scholar]
47.Hayes B. (1995). Metrical stress theory: Principles and case studies. University of Chicago Press. [Google Scholar]
48.Lerdahl F., Jackendoff R., & Jackendoff R. S. (1983). A generative theory of tonal music. MIT Press. https://books.google.de/books?id=38YcngEACAAJ [Google Scholar]
49.Luo H., & Poeppel D. (2007). Phase patterns of neuronal responses reliably discriminate speech in human auditory cortex. Neuron, 54(6), 1001–1010. doi: 10.1016/j.neuron.2007.06.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Ahissar E., Nagarajan S., Ahissar M., Protopapas A., Mahncke H., & Merzenich M. M. (2001). Speech comprehension is correlated with temporal response patterns recorded from auditory cortex. Proceedings of the National Academy of Sciences of the United States of America, 98(23), 13367–13372. doi: 10.1073/pnas.201400998 [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Giraud A. L., & Poeppel D. (2012). Cortical oscillations and speech processing: Emerging computational principles and operations. Nature Neuroscience, 15(4), 511–517. doi: 10.1038/nn.3063 [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Henry M. J., & Obleser J. (2012). Frequency modulation entrains slow neural oscillations and optimizes human listening behavior. Proceedings of the National Academy of Sciences of the United States of America, 109(49), 20095–20100. doi: 10.1073/pnas.1213390109 [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Overath T., McDermott J. H., Zarate J. M., & Poeppel D. (2015). The cortical analysis of speech-specific temporal structure revealed by responses to sound quilts. Nature Neuroscience, 18(6), 903–911. doi: 10.1038/nn.4021 [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Ding N., Melloni L., Zhang H., Tian X., & Poeppel D. (2016). Cortical tracking of hierarchical linguistic structures in connected speech. Nature Neuroscience, 19(1), 158–164. doi: 10.1038/nn.4186 [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Park H., Ince R. A. A., Schyns P. G., Thut G., & Gross J. (2015). Frontal top-down signals increase coupling of auditory low-frequency oscillations to continuous speech in human listeners. Current Biology, 25(12), 1649–1653. doi: 10.1016/j.cub.2015.04.049 [DOI] [PMC free article] [PubMed] [Google Scholar]
56.Poeppel D. (2003). The analysis of speech in different temporal integration windows: Cerebral lateralization as “asymmetric sampling in time.” Speech Communication, 41(1), 245–255. 10.1016/S0167-6393(02)00107-3 [DOI] [Google Scholar]
57.Fontolan L., Morillon B., Liegeois-Chauvel C., & Giraud A. L. (2014). The contribution of frequency-specific activity to hierarchical information processing in the human auditory cortex. Nature Communications, 5, 4694. doi: 10.1038/ncomms5694 [DOI] [PMC free article] [PubMed] [Google Scholar]
58.Obleser J., & Kayser C. (2019). Neural entrainment and attentional selection in the listening brain. Trends in Cognitive Sciences, 23(11), 913–926. doi: 10.1016/j.tics.2019.08.004 [DOI] [PubMed] [Google Scholar]
59.Gross J., Hoogenboom N., Thut G., Schyns P., Panzeri S., Belin P., et al. (2013). Speech rhythms and multiplexed oscillatory sensory coding in the human brain. PLOS Biology, 11(12), e1001752. doi: 10.1371/journal.pbio.1001752 [DOI] [PMC free article] [PubMed] [Google Scholar]
60.Di Liberto G. M., O’Sullivan J. A., & Lalor E. C. (2015). Low-frequency cortical entrainment to speech reflects phoneme-level processing. Current Biology, 25(19), 2457–2465. doi: 10.1016/j.cub.2015.08.030 [DOI] [PubMed] [Google Scholar]
61.Doelling K. B., & Poeppel D. (2015). Cortical entrainment to music and its modulation by expertise. Proceedings of the National Academy of Sciences of the United States of America, 112(45), E6233–E6242. doi: 10.1073/pnas.1508431112 [DOI] [PMC free article] [PubMed] [Google Scholar]
62.Di Liberto G. M., Pelofi C., Shamma S., & de Cheveigné A. (2020). Musical expertise enhances the cortical tracking of the acoustic envelope during naturalistic music listening. Acoustical Science and Technology, 41(1), 361–364. [Google Scholar]
63.Baltzell L. S., Srinivasan R., & Richards V. (2019). Hierarchical organization of melodic sequences is encoded by cortical entrainment. Neuroimage, 200, 490–500. doi: 10.1016/j.neuroimage.2019.06.054 [DOI] [PMC free article] [PubMed] [Google Scholar]
64.Fujioka T., Ross B., & Trainor L. J. (2015). Beta-band oscillations represent auditory beat and its metrical hierarchy in perception and imagery. Journal of Neuroscience, 35(45), 15187–15198. doi: 10.1523/JNEUROSCI.2397-15.2015 [DOI] [PMC free article] [PubMed] [Google Scholar]
65.Large E. W., Herrera J. A., & Velasco M. J. (2015). Neural networks for beat perception in musical rhythm. Frontiers in systems neuroscience, 9, 159. doi: 10.3389/fnsys.2015.00159 [DOI] [PMC free article] [PubMed] [Google Scholar]
66.Lakatos P., Shah A. S., Knuth K. H., Ulbert I., Karmos G., & Schroeder C. E. (2005). An oscillatory hierarchy controlling neuronal excitability and stimulus processing in the auditory cortex. Journal of Neurophysiology, 94(3), 1904–1911. doi: 10.1152/jn.00263.2005 [DOI] [PubMed] [Google Scholar]
67.Norman-Haignere S., Kanwisher N. G., & McDermott J. H. (2015). Distinct cortical pathways for music and speech revealed by hypothesis-free voxel decomposition. Neuron, 88(6), 1281–1296. doi: 10.1016/j.neuron.2015.11.035 [DOI] [PMC free article] [PubMed] [Google Scholar]
68.Nozaradan S., Peretz I., Missal M., & Mouraux A. (2011). Tagging the neuronal entrainment to beat and meter. Journal of Neuroscience, 31(28), 10234–10240. doi: 10.1523/JNEUROSCI.0411-11.2011 [DOI] [PMC free article] [PubMed] [Google Scholar]
69.Harding E. E., Sammler D., Henry M. J., Large E. W., & Kotz S. A. (2019). Cortical tracking of rhythm in music and speech. Neuroimage, 185, 96–101. doi: 10.1016/j.neuroimage.2018.10.037 [DOI] [PubMed] [Google Scholar]
70.Coath M., Denham S. L., Smith L. M., Honing H., Hazan A., Holonowicz P., et al. (2009). Model cortical responses for the detection of perceptual onsets and beat tracking in singing. Connection Science, 21(2–3), 193–205. 10.1080/09540090902733905 [DOI] [Google Scholar]
71.Large E. W., Wasserman C. S., Skoe E., & Read H. L. (2019). Neural entrainment to missing pulse rhythms. Journal of the Acoustical Society of America, 144(3), 1760–1760. 10.1121/1.5067790 [DOI] [Google Scholar]
72.Large E. W., & Jones M. R. (1999). The dynamics of attending: How people track time-varying events. Psychological Review, 106(1), 119–159. 10.1037/0033-295X.106.1.119 [DOI] [Google Scholar]
73.Falk S. & Kello C.T. (2017). Hierarchical organization in the temporal structure of infant-direct speech and song. Cognition, 163, 80–86. doi: 10.1016/j.cognition.2017.02.017 [DOI] [PubMed] [Google Scholar]
74.Dau T., Kollmeier B., & Kohlrausch A. (1997. a). Modeling auditory processing of amplitude modulation. II. Spectral and temporal integration. Journal of the Acoustical Society of America, 102(5 Pt 1), 2906–2919. doi: 10.1121/1.420345 [DOI] [PubMed] [Google Scholar]
75.Albert S., de Ruiter L. E., & de Ruiter J.P. (2015) CABNC: the Jeffersonian transcription of the Spoken British National Corpus. https://saulalbert.github.io/CABNC/. [Google Scholar]
76.Stone M. A., & Moore B. C. J. (2003). Tolerable hearing aid delays. III. Effects on speech production and perception of across-frequency variation in delay. Ear and Hearing, 24(2), 175–183. doi: 10.1097/01.AUD.0000058106.68049.9C [DOI] [PubMed] [Google Scholar]
77.Klein W., Plomp R., & Pols L. C. W. (1970). Vowel spectra, vowel spaces, and vowel identification. Journal of the Acoustical Society of America, 48(4), 999–1009. doi: 10.1121/1.1912239 [DOI] [PubMed] [Google Scholar]
78.Pols L. C. W., Tromp H. R. C., & Plomp R. (1973). Frequency analysis of Dutch vowels from 50 male speakers. Journal of the Acoustical Society of America, 53(4), 1093–1101. doi: 10.1121/1.1913429 [DOI] [PubMed] [Google Scholar]
79.Daikoku T. (2018). Entropy, uncertainty, and the depth of implicit knowledge on musical creativity: Computational study of improvisation in melody and rhythm. Frontiers in Computational Neuroscience, 12, 97. doi: 10.3389/fncom.2018.00097 [DOI] [PMC free article] [PubMed] [Google Scholar]
80.Tass P., Rosenblum M. G., Weule J., Kurths J., Pikovsky A., Volkmann J., Schnitzler A., & Freund H. -J. (1998). Detection of n:m phase locking from noisy data: Application to magnetoencephalography. Physical Review Letters, 81(15), 3291–3294. 10.1103/PhysRevLett.81.3291 [DOI] [Google Scholar]
81.Schack B., & Weiss S. (2005). Quantification of phase synchronization phenomena and their importance for verbal memory processes. Biological Cybernetics, 92(4), 275–287. doi: 10.1007/s00422-005-0555-1 [DOI] [PubMed] [Google Scholar]
82.Plomp R., & Levelt W. J. M. (1965). Tonal consonance and critical bandwidth. Journal of the Acoustical Society of America, 38(4), 548–560. doi: 10.1121/1.1909741 [DOI] [PubMed] [Google Scholar]
83.Goswami U. (2019. b). Speech rhythm and language acquisition: An amplitude modulation phase hierarchy perspective. Annals of the New York Academy of Sciences, 1453(1), 67–78. doi: 10.1111/nyas.14137 [DOI] [PubMed] [Google Scholar]
84.Flanagan S., & Goswami U. (2018). The role of phase synchronisation between low frequency amplitude modulations in child phonology and morphology speech tasks. The Journal of the Acoustical Society of America, 143(3), 1366–1375. doi: 10.1121/1.5026239 [DOI] [PMC free article] [PubMed] [Google Scholar]
85.Schwartz D. A., Howe C. Q., & Purves D. (2003). The statistical structure of human speech sounds predicts musical universals. Journal of Neuroscience, 23(18), 7160–7168. doi: 10.1523/JNEUROSCI.23-18-07160.2003 [DOI] [PMC free article] [PubMed] [Google Scholar]
86.Jacoby N., & McDermott J. H. (2017). Integer ratio priors on musical rhythm revealed cross-culturally by iterated reproduction. Current Biology, 27(3), 359–370. doi: 10.1016/j.cub.2016.12.031 [DOI] [PubMed] [Google Scholar]
87.Tal I., Large E. W., Rabinovitch E., Wei Y., Schroeder C. E., Poeppel D., et al. (2017). Neural entrainment to the beat: The “missing-pulse” phenomenon. Journal of Neuroscience, 37(26), 6331–6341. doi: 10.1523/JNEUROSCI.2500-16.2017 [DOI] [PMC free article] [PubMed] [Google Scholar]
88.Winkler I., Háden G. P., Ladinig O., Sziller I., & Honing H. (2009). Newborn infants detect the beat in music. Proceedings of the National Academy of Sciences, 106(7), 2468–2471. [DOI] [PMC free article] [PubMed] [Google Scholar]
89.Cumming R., Wilson A., Leong V., Colling L. J., & Goswami U. (2015). Awareness of rhythm patterns in speech and music in children with specific language impairments. Frontiers in Human Neuroscience, 9, 672. doi: 10.3389/fnhum.2015.00672 [DOI] [PMC free article] [PubMed] [Google Scholar]
90.Kodály Z. (1974). The selected writings of Zolta´n Koda´ly (L. Halápy & F. Macnicol, Trans.). Boosey and Hawkes. [Google Scholar]
91.Jacques-Dalcroze E. (1980). Rhythm, music and education (H. Rubinstein, Trans.). Dalcroze Society, Inc. [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0275631.r001

Decision Letter 0

Caicai Zhang

2 Mar 2022

PONE-D-21-38111Hierarchical Amplitude Modulation Structures and Rhythm Patterns: Comparing Western Musical Genres, Song, and Nature Sounds to BabytalkPLOS ONE

Dear Dr. Daikoku,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Two expert reviewers have reviewed your submission. While both reviewers acknowledged that this manuscript addresses a timely and interesting question, they raised a bunch of issues related to the hypothesis, methodology and selection of testing materials. They also asked about the language specificity issue, and questioned to what extent the findings can generalize to languages other than English. I'd encourage you to incorporate the comments from the two reviewers into your revision as much as possible.

Please submit your revised manuscript by Apr 16 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Caicai Zhang

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability.

Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized.

Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access.

We will update your Data Availability statement to reflect the information you provide in your cover letter.

3. PLOS requires an ORCID iD for the corresponding author in Editorial Manager on papers submitted after December 6th, 2016. Please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager. Please see the following video for instructions on linking an ORCID iD to your Editorial Manager account: https://www.youtube.com/watch?v=_xcclfuvtxQ.

4. Thank you for stating the following in the Acknowledgments Section of your manuscript:

“This study was supported by Nakatani Foundation, JSPS KAKENHI Grant Numbers 20K22676 (Research Activity Start-up), 21B101(Transformative Research Areas), and World Premier International Research Centre Initiative (WPI), MEXT, Japan. The sponsor played no role in the study design nor in the collection, analysis, interpretation and writing up of the data.”

Please note that funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

5. Thank you for stating the following in your Competing Interests section:

“NO authors have competing interests.”

Please complete your Competing Interests on the online submission form to state any Competing Interests. If you have no competing interests, please state "The authors have declared that no competing interests exist.", as detailed online in our guide for authors at http://journals.plos.org/plosone/s/submit-now

This information should be included in your cover letter; we will change the online submission form on your behalf.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I Don't Know

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Although the basic question raised by the paper is valid and interesting, it can be formulated in a more logically sound manner, and it seems that the research question has been based on more presumptions than it should. Here are some questions for the authors to address further: (1) although music and language share many overlapping features, but one essential difference is the pitch variation. In non-tonal language such as English, there is no pitch variation that can match that in the music. Therefore, to set out by directly "transplanting" the statistical learning approach to understand the rhythm in music is questionable. A potentially more objective way to start might be to examine some simple rhythmic patterns (without pitch, e.g., monotone duple/triple metre) first. (2) why the statistical learning approaches that work in explaining linguistic rhythm would lead to the authors expecting a comparable results in explaining the musical rhythm in the first place? It seems not that convincing and more literature and logical flow should be added there. (3) using the many quasi-rhythmic sounds as the control was a nice try, however, birdsong can be more musically sophisticated than other sounds such as the rain or wind, why it should be counted as quasi-rhythmic? why choosing nightingale birdsong as the representative for the birdsong? More explanations are needed.

Reviewer #2: Daikoku and Goswami present two computational approaches to analyze rhythmic patterns and hierarchical amplitude modulation structures in different acoustic materials. They built on previous work and extend here the analyses approach to more divers acoustic materials (music, song, nature sounds, babytalk) and two analyses approaches (S-AMPH, PAD).

This manuscript contributes to a timely and rapidly growing research domain and will stimulate new experimentations and perspectives for the investigation of typical and atypical functioning as well as the rehabilitation thereof. The potential impact of this contribution could be enhanced by considering the following extensions and revisions, as well as by making available their two analyzing approaches and programs via Open Science (or on reasonable request), which would allow interested research groups to further explore and test their behavioral and neural relevance. The discussion section could also further gain in impact by proposing more concrete testing hypothesis, e.g., what kind of specific music/language material matching could be used for training and/or cueing, for example.

- The authors propose a matching between AM cycles and musical units like crotchets or quavers (e.g., figure 5). Considering the large range of tempo used in the musical repertoire, such as slow ballads and fast dance songs, which contrast from the smaller tempo range for nursery rhymes, the proposed matching needs some further explanation and/or restricted to an illustrative example and/or be removed. Indeed, it does not seem straightforward how the quarter level will be matched between a musical piece at 60 bpm versus 130 bpm.

- The figures and descriptions of the findings (also in comparison to previous work) reveals that it would be interesting to extend the present set of analyses also to adult-directed speech in addition. This would allow for directly integrating the present work in previous approaches and allow for additional perspectives for future research.

- Relatedly, the authors discuss potential differences between different types of music (e.g., from different cultures with different underlying metric structures) as well as different languages (e.g., page 36 and elsewhere). It would be great to see how the models react to musical pieces in even versus uneven meters as well as to speech excerpts of different languages.

- Figure 2 suggests two slightly shifted peaks for music vs speech (e.g., with maxima of ˜1.5 vs. ˜2Hz?). Please add the exact maxima in the text.

The introduction includes a presentation of the Temporal Sampling theory. This section could be extended and clarified, notably how it links to other research domains focusing on the potential role of oscillations in this processing (whether typical or atypical). The explanation should be extended from the amplitude rise time of the vowel to that of a consonant, which contains the onset part and should be particularly relevant for the extraction of timing information. Regarding potential musical interventions and the discussion of amplitude envelopes of music, the authors should clarify the potential mechanisms that are boosted by the interventions.

Regarding the window lengths selected by the modeler for the Allan Factor approach, it would be interesting to specify the time windows used (time scales) and whether this approach would work also for longer utterances? (page 7)

The authors did a great job in explaining TS, AM and brain oscillations (Page 12), but the explanations could be clarified by adding a figure illustrating the different elements (e.g., page 11).

Tal et al. (2017) addressing the ‘missing pulse’ phenomenon might be an interesting reference in the present context too. How do the present modeling approaches handle situations of the missing beat? Or more generally of syncopation and groove? The work by E. W. Large (University of Connecticut) proposes some interesting modeling perspectives for rhythm, meter and temporal information (as well as tonality), including the implication of neural oscillations as well as entrainment. It would be relevant for the present approach and for the integration in the research domain to also address these research approaches.

Methods

Page 15:

- Which are the “a priori assumptions” here?

- Did the S-AMPH model use the same parameters as in Leong et al?

Page 19:

- Does Figure 1 present just one example for music, IDS and machine? Could the authors propose a summary across items and show frequency ranges (in an additional figure?), aiming to provide information about the generalization of the observed pattern.

- Figure 1 proposes a comparative presentation of the four categories only for PAD. It would be informative for the reader to see the same type of presentation for S-AMPH.

Page 27: “suggestive of shared physical stimulus characteristics to which the brain can entrain” – So this would concern only the evoked responses (entrainment based on the input), can the authors also address entrainment based on cognitive construct (such as metrical hierarchies) that are not necessarily implemented in the acoustic signal?

Figure 3: How many items were used for each category here?

Discussion

Page 46: Please clarify the extension to music intervention. How could the findings be linked? Which type of music or characteristics should be used to train speech (across development as well as across different languages)?

Additional comments:

Abstract:

Please clarify the wording, notably to which part of the sentence “which matched IDS” is referring to.

Introduction:

Page 4: this cognitive capacity has been shown to go beyond verbal material, that is, it is not restricted to language, but extends to non-linguistic materials, such as tones (e.g., Saffran et al., 1999; Tillman & Poulin-Charronnat, 2010), timbres (e.g., Loui et al., 2022; Tillman & McAdams, 2003; Tillman & Hoch, 2010) as well as rhythm and timing (e.g., Prince et al., 2018; Brandon et al., 2012).

Page 5: Does this depend on the language? Here the authors refer to their own work in English (Leong et al). It would be interesting to comment on extensions to other languages (e.g., German, French, Spanish, Italian) and whether it would be affected by differences between American English, British English, Australian ..?

Page 13. A cortical tracking approach has been applied to music (even though not specifically to rhythm) by Pelofi, Shamma, and collaborators (see https://clame.nyu.edu/scientists/claire-pelofi), and which might be of interest for the authors here too.

Page 30 “the nightingale song” – this suggests as if only 1 item was used here. Please clarify (and if yes, justify) or, preferably, extend to more items per category.

Page 32: “When MI analyses were applied …..” Does this refer to the music material? Please clarify.

Page 40 “similar to a harsh rhythmic whisper” For the interested reader, it would be great to add a sound file to the supplementary material section.

Page 42: The text refers here to “across Western musical genres and instruments” while most figures do not separate across genres. It would be helpful to see how findings potentially change as well as regarding whether the musical excerpts include voice or not.

Appendix 1 lists the used materials - would the sound files be available upon (reasonable) request?

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Oct 14;17(10):e0275631. doi: 10.1371/journal.pone.0275631.r002

Author response to Decision Letter 0

11 May 2022

RESPONSE TO REVIEWERS 1:

I am extremely grateful for the Reviewers’ insightful comments on our manuscript. We have studied the Reviewers’ comments carefully and have made the necessary corrections in our paper. We believe that the comments have helped us significantly improve and refine the manuscript. Our responses to the Reviewers’ comments are as follows.

Comment 1:

Although music and language share many overlapping features, but one essential difference is the pitch variation. In non-tonal language such as English, there is no pitch variation that can match that in the music. Therefore, to set out by directly "transplanting" the statistical learning approach to understand the rhythm in music is questionable. A potentially more objective way to start might be to examine some simple rhythmic patterns (without pitch, e.g., monotone duple/triple metre) first.

Response: I thank the reviewers for the pertinent comments. However, pitch variation is independent of rhythm variation, as indeed demonstrated by our modelling. The important research question of the current study is to understand “temporal” (but not spectral, like pitch) structure. Infant language learning has been argued to begin with speech rhythm (Mehler et al., 1988), and infant-directed speech (IDS), also called Babytalk or Parentese, has been described as sing-song speech. Spectral features such as pitch variation have been widely examined in IDS, as well as music (Schwarz, Howe & Purves, 2003). Our relatively new topic (i.e., temporal feature of sound waveform) is becoming more and more important in the fields of both speech processing and music. As noted, we did examine the spectral features that carry temporal rhythm within our framework of amplitude modulation (see Fig b in S2 Appendix). The results were shown in fig b of S4 Appendix. The modelling showed that the spectral component does not affect temporal rhythm.

Comment 2:

Why the statistical learning approaches that work in explaining linguistic rhythm would lead to the authors expecting a comparable results in explaining the musical rhythm in the first place? It seems not that convincing and more literature and logical flow should be added there.

Response: Thank you for the comment. To address this, we have restructured parts of the Introduction accordingly. Although language acquisition by human infants was once thought to require specialized neural architecture, studies of infant statistical learning have revealed that basic acoustic processing mechanisms are sufficient for infants to learn phonology. Further, the cognitive capacity of statistical learning is not restricted to verbal language, but extends to non-linguistic sounds, such as tones (e.g., Saffran et al., 1999; Francois et al., 2011), timbres (e.g., Loui et al., 2022; Tsogli et al., 2019) as well as rhythm and timing (e.g., Prince et al., 2018; Brandon et al., 2012; Daikoku et al., 2020). Children who exhibit difficulties with phonological learning also exhibit rhythm processing difficulties, with both speech and musical stimuli (Goswami, 2015, for review). This implies the inherent common statistical properties between language and music. We explained these points in the Introduction section as follows.

“Although language acquisition by human infants was once thought to require specialized neural architecture, studies of infant statistical learning have revealed that basic acoustic processing mechanisms are sufficient for infants to learn phonology (speech sound structure at different linguistic levels such as words, syllables, rhymes and phonemes; e.g. Saffran, 2001). Further, the cognitive capacity of statistical learning is not restricted to verbal language, but extends to non-linguistic sounds such as tones (e.g., Saffran et al., 1999; Francois et al., 2011), timbres (e.g., Loui et al., 2022; Tsogli et al., 2019) as well as rhythm and timing (e.g., Prince et al., 2018; Brandon et al., 2012; Daikoku et al., 2020). Children who exhibit difficulties with phonological learning also exhibit rhythm processing difficulties, with both speech and musical stimuli (Goswami, 2015, for review). This implies that there may be inherent common statistical properties shared by language and music, and that such statistical properties contribute to the acquisition of both language and music (Politimou et al., 2019).”

“Theoretically, it is plausible that the physical stimulus characteristics that describe rhythm patterns in nursery rhymes and IDS may also describe the hierarchical rhythmic relationships that characterize music and child songs. According to anthropological analyses (Falk, 2004), it was IDS that emerged first, subsequently enabling the development of adult-directed speech (ADS, which is notably not sing-song in nature). As primitive human cultures also developed music, the same evolutionary adaptations that enabled Babytalk may underpin music as well. That is, it is possible that the AM hierarchy in music has similar structure to the AM hierarchy in IDS. The core research question addressed here is whether music will exhibit similar salient bands of AMs and similar phase dependencies between AM bands to IDS and English nursery rhymes (child-directed speech, CDS).”

Comment 3:

Using the many quasi-rhythmic sounds as the control was a nice try, however, birdsong can be more musically sophisticated than other sounds such as the rain or wind, why it should be counted as quasi-rhythmic? why choosing nightingale birdsong as the representative for the birdsong? More explanations are needed.

Response: I thank the reviewers for the pertinent comments. To address this, we have explained in more detail why we chose nightingale songs. A priori it could either be argued that birdsong will differ from human music (our original argument), or that it will be more similar to human music than other nature sounds for the reasons pointed out by the reviewer. We now acknowledge both possibilities. In some of the analyses, we indeed detected similarity of AM structure between human voice and birdsong, as shown in Fig 3. In addition, we choose nightingales as the representative case for birdsong because a previous study revealed that nightingale rhythms, rather than another bird song rhythm such as zebra finches are most similar to human musical rhythms (Roeske et al., 2020). We revised this part in whole manuscript as follows.

“Abstract: Quasi-rhythmic and non-human sounds found in nature (birdsong, rain, wind) were utilized for control analyses.”

“Introduction: As a control for our prediction that the AM structure of music and IDS/CDS should be highly similar, we also modelled other natural sounds that have quasi-rhythmic structure such as wind, fire, river, storms, rain, as well as non-human vocal sounds, namely birdsong. A priori, we expect nature sounds to have a different AM structure to IDS and CDS. Nature sounds such as rain and storms were originally used to derive PAD (Turner, 2010), and are characterized by AM patterns correlated over long time scales and across multiple frequency bands. However, as these sounds are not produced by humans nor shaped by human physiology and culture, there is no reason a priori to expect them to be similar in AM structure to IDS and CDS. Birdsong may be different, as it is more musically sophisticated and closer to human song than the other nature sounds such as wind, fire, river, storms, and rain. Indeed, a previous study revealed that the structure of nightingale rhythms, rather than other bird song rhythms such as zebra finches, are similar to the structure of human musical rhythms (Roeske et al., 2020). Therefore, we also modelled the corpus of nightingale’s song studied by Roeske et al. (2020). We expected the AM patterns here to be more similar to IDS and CDS than the AM patterns for wind, rain etc.”

RESPONSE TO REVIEWERS 2:

I wish to express our strong appreciation to the Reviewer for the insightful comments on our manuscript. We have studied the Reviewer’s comments very carefully and have made necessary corrections. We feel the comments have helped us significantly improve the manuscript.

Comment 1:

Daikoku and Goswami present two computational approaches to analyze rhythmic patterns and hierarchical amplitude modulation structures in different acoustic materials. They built on previous work and extend here the analyses approach to more divers acoustic materials (music, song, nature sounds, babytalk) and two analyses approaches (S-AMPH, PAD). This manuscript contributes to a timely and rapidly growing research domain and will stimulate new experimentations and perspectives for the investigation of typical and atypical functioning as well as the rehabilitation thereof. The potential impact of this contribution could be enhanced by considering the following extensions and revisions, as well as by making available their two analyzing approaches and programs via Open Science (or on reasonable request), which would allow interested research groups to further explore and test their behavioral and neural relevance. The discussion section could also further gain in impact by proposing more concrete testing hypothesis, e.g., what kind of specific music/language material matching could be used for training and/or cueing, for example.

Response: I thank the reviewers for all of the pertinent comments. Based on all of the comments, we thoroughly revised the manuscript. We also provided more concrete testing hypothesis based on the comment. We are sure that the manuscript improved very much. Further, as indicated, we provided all of the data analyzed via Open Science at the following link (https://osf.io/6s8kp/). In addition, all original sound files are publicly available from the Figshare database: http://figshare.com/articles/SAMPH_CDS/1318572 DOI: 10.6084/m9.figshare.1318572. Please see the original article that used the speech data for more detailed information (Leong et al., 2015). All original bird song, nature sound files are available from https://www.xeno-canto.org/, https://mixkit.co/free-sound-effects/nature/, and https://www.zapsplat.com. Music and human song data has copyright, but described the detailed information in S1 Appendix. We described them in the Data Availability section.

Comment 2:

The authors propose a matching between AM cycles and musical units like crotchets or quavers (e.g., figure 5). Considering the large range of tempo used in the musical repertoire, such as slow ballads and fast dance songs, which contrast from the smaller tempo range for nursery rhymes, the proposed matching needs some further explanation and/or restricted to an illustrative example and/or be removed. Indeed, it does not seem straightforward how the quarter level will be matched between a musical piece at 60 bpm versus 130 bpm.

Response: I thank the reviewers for the pertinent comments. Our intention was not to claim that the quarter notes were always at 4 Hz, rather that the phase relations in the hierarchy could be matched to notes at different temporal levels in a piece of music within the AM bands described by our modelling. We have revised fig 5 accordingly and altered the corresponding part in the manuscript as follows.

“Discussion section: It is of note that these hierarchical statistical temporal dependencies should be consistent across different tempi. The dependencies refer to temporal bandings of AMs, hence the hierarchical dependencies should simply adjust to fit the tempo used in the music, for example slow ballads and fast dance songs. In similar fashion, it has been demonstrated that the hierarchical AM dependencies in speech adapt to speech rate (see Leong et al., 2017). Indeed, the current modelling revealed statistically strong mutual dependence (using MI estimates) between adjacent bands in the AM hierarchy across musical genres (Western classical, jazz, rock, children’s songs) and musical instruments (piano, guitar, violin, viola, cello, bass, single-voice, multi-voice).”

“Result section (3.4. Multi-Timescale Phase Synchronization in Both Models): Fig 5 provides a schematic example of the 1:2 integer ratio regarding the likely AM hierarchy in music. The figure shows in principle how musical rhythm could be hierarchically organized based on note values (i.e., crotchets, quavers, demiquavers and onsets, Fig 5, left) and the AM hierarchy (Fig 5, right).”

“Fig 5. Schematic Depiction of the Hierarchical AM Structure yielding Rhythm in Music. Left and right are the representation by musical score and the corresponding sound waveform of a part of the 33 Variations on a waltz by Anton Diabelli, Op. 120 (commonly known as the Diabelli Variations) by Ludwig van Beethoven. In principle, musical rhythm could be hierarchically organized based on note values (left) matched to nested amplitude modulations (AM, right) in bandings spanning different temporal rates (for example, green ~2 Hz, blue ~4 Hz, red ~8 Hz, matching S4 Appendix Table d). In the framework of Temporal Sampling theory, the AM bands (right) equate temporally to neural oscillatory rhythms. Auditory rhythm perception relies in part on neural tracking of the AM patterns at different timescales simultaneously (e.g., neural tracking of the green, blue, and red AMs in Fig 5 by neurophysiological delta, theta and alpha bands). This neural tracking is triggered by acoustic components of the sound signal such as the amplitude rise times (musical attack times) of the nested AM components which phase-reset oscillatory cortical activity. There are of course a large range of tempi used in music, for example slow ballads and fast dance songs. However, as shown by the black lines in the musical note hierarchy (left) and the dotted vertical lines in the AM hierarchy (right), the adjacent tiers of the hierarchy (i.e., green & blue and blue & red AM pairs) are dependent on each other compared with non-adjacent hierarchical relations (i.e., green-red AM pairing) and thus the hierarchy itself will expand or contract to fit the tempo.”

Comment 3:

The figures and descriptions of the findings (also in comparison to previous work) reveals that it would be interesting to extend the present set of analyses also to adult-directed speech in addition. This would allow for directly integrating the present work in previous approaches and allow for additional perspectives for future research. Relatedly, the authors discuss potential differences between different types of music (e.g., from different cultures with different underlying metric structures) as well as different languages (e.g., page 36 and elsewhere). It would be great to see how the models react to musical pieces in even versus uneven meters as well as to speech excerpts of different languages.

Response: Thank you for the comment. As suggested, we added ADS figure and also bird song figure to figure 1. Further, as indicated in the paper, we already published analyses of adult-directed speech (see Leong et al., 2017; Araujo et al., 2018). In this previous work examining the AM hierarchy of ADS, we demonstrated that ADS has significantly weaker phase synchronization between the slower bands of AMs centred on ~2 Hz and ~5 Hz compared to IDS. We also demonstrated that literacy affects phase synchronization. These prior analyses are discussed in the Introduction (see pages 9-10) when we motivate why we predict greater similarity between music and IDS than between music and ADS. We agree that in future work it would be interesting to apply the modelling to even versus uneven meters and to more languages. However, that is beyond the scope of the current paper.

Comment 4:

Figure 2 suggests two slightly shifted peaks for music vs speech (e.g., with maxima of ˜1.5 vs. ˜2Hz?). Please add the exact maxima in the text.

Response: I thank the reviewers for the pertinent comments. As indicated, we described mean peak of music and IDS in the manuscript as follows. Further, we have also showed mean FFT results in S4 Appendix in both S-AMPH and PAD.

“The modelling showed that the AM bands in music matched those previously found in IDS, but the AM bands in the nature sounds did not. In particular, in panel 2d strong peaks in the delta and theta bands are clearly visible for instrumental music (red line, mean peak: delta 1.1Hz and 2.2Hz, theta 4.7Hz) and IDS (black line, mean peak: delta 1.8Hz, theta 3.3Hz), but not for nature sounds (blue line). Although the delta and theta peaks occur at slightly different temporal points, they are within close range of each other. Further, there are two matching peaks at delta and theta rates between IDS (black line in Fig 2d) and child song (light green in Fig 2d), but not in adult song, birdsong, and nature sounds.”

Comment 5:

Response: I thank the reviewers for all of the pertinent comments. However, prior research on rhythmic timing and amplitude rise times has already demonstrated that the rise time of the vowel is the key to speech rhythm, the consonants do not play a core role. Although a consonant before the vowel can move the temporal position of the perceived beat (e.g. sonorous consonant onsets produce later vowel peaks), the P centres literature already showed that the consonant in a syllable is not key to rhythmic timing (e.g., Scott 1991 showed that if two syllables with differing consonant onsets, like STREET and EAT are spoken to a rhythm, the rise time of the vowel governs syllable production). We now briefly mention P centres in the ms, noting potential mechanisms, please see page 7.

Comment 6:

Response: I thank the reviewers for the comments. Allan factor analysis quantifies the clustering of events in terms of their variances in timing at different timescales. The time windows of a given size are tiled across a time series of events, and events are counted within each window. For example, in their study (Kello et al., 2017), recordings were chosen to be at least 4 min long, and window sizes were varied from approximately “15 ms to 15 s”. Their preliminary results showed that there was no need for windows shorter than 15 ms because events stopped being clustered, and 15 s is the largest window possible given a 4 min long recording. We now note their windows in our text (page 8).

Comment 7:

The authors did a great job in explaining TS, AM and brain oscillations (Page 12), but the explanations could be clarified by adding a figure illustrating the different elements (e.g., page 11).

Response: Thank you for the pertinent comments. We have adapted Fig 5 to meet this point. The AM bands (Fig 5 right) equate temporally to neural oscillatory rhythms. Human auditory rhythm perception relies in part on neural tracking of the AM patterns at different timescales simultaneously (e.g., green, blue, and red line in Fig 5). These temporal modulation patterns are then bound into a single sound percept. This neural tracking relies on acoustic components of the sound signal such as the amplitude rise times of nested AM components phase-resetting oscillatory cortical activity. We added the explanation in the legend of Fig 5 as follows.

“In the framework of Temporal Sampling theory, the AM bands (right) equate temporally to neural oscillatory rhythms. Auditory rhythm perception relies in part on neural tracking of the AM patterns at different timescales simultaneously (e.g., neural tracking of the green, blue, and red AMs in Fig 5 by neurophysiological delta, theta and alpha bands). This neural tracking is triggered by acoustic components of the sound signal such as the amplitude rise times (musical attack times) of the nested AM components which phase-reset oscillatory cortical activity.”

Comment 8:

Response: I thank for letting us know these interesting papers. We have cited them in the Discussion and related them to our modelling work (see page 49). In our study, S-AMPH and PAD could detect missing beats as a silent gap (if no tone in music, no voice in speech). However, our modelling approach is basically only relevant to the part of Ed Large’s resonance theory that is based on the physical characteristics of the stimulus, as we now state on page 49.

“Introduction section: For music, oscillatory rhythms may align with rhythmic features of the acoustic input such as crotchets or musical beats (Doelling & Poeppel, 2015; Large et al., 2015; Di Liberto et al., 2020; Baltzell et al., 2019; Fujioka et al., 2015). However, possible correspondences between different oscillators and different musical units like crotchets and quavers have yet to be investigated.”

“Introduction section: Note finally that our modelling approach is conceptually distinct from models that identify the tactus or beat markers in singing (Coath et al., 2010), models of pulse perception based on neural resonance (Large et al., 2019), oscillatory models of auditory attention based on dynamic attending (Large & Jones, 1999), and models of temporal hierarchical structure based on the Allan Factor approach (Falk & Kello, 2017; Kello et al., 2017). Conceptually, ours is the only modelling approach to analyze the modulation structure of the amplitude envelope, recognized as core to speech processing by speech engineers (Greenberg, 2006). Our modelling decomposes the amplitude envelope and then relates the resulting AM bands and their phase relationships to individual musical units. In principle, this approach provides a novel acoustic perspective on musical rhythm, motivated by our prior novel acoustic analyses of Babytalk.”

“Discussion section: The modelling presented here also converges with past studies designed to detect pulse based on neural resonance theory (Large et al., 2019). Pulse is the perceptual phenomenon in which an individual perceives a steady beat. Large et al. (2019) suggested that the perception of pulse emerges through nonlinear coupling between two oscillatory networks, one representing the physical properties of the stimulus and a second network that integrates inputs from the sensory system. The nonlinear interactions between the two give rise to oscillatory activity not only at the frequencies present in the physical stimulus, but also at more complex combinations, including the pulse frequency. Consistent with this view, Tal et al. (2017) reported phase locking for the adult brain at the times of a missing pulse, even though the pulse was absent from the physical stimulus. This suggests that neural activity at the pulse frequency is (for adults) internally generated rather than being purely stimulus-driven. From this perspective, our modelling (i.e., S-AMPH and PAD) is capturing the physical stimulus characteristics (the modulation structure of the amplitude envelope and its internal phase relations) rather than capturing internally-generated oscillatory activity. To our knowledge, missing pulse phenomena have not yet been studied in infants. It may be that early learning of hierarchical phase relations from the amplitude envelopes of musical inputs may be required for the internal generation of missing pulse phenomena. On the other hand, ERP studies show that even newborns can detect beat violations in oddball paradigms, where occasionally a deviant rhythm with a missing downbeat is heard in place of a standard metrical rhythm (Winkler et al., 2009). Further studies with infants may also be able to investigate the phase relationships between missing pulses or beats and higher hierarchical units such as musical phrasing or prosody.”

Comment 9:

Methods

Page 15:

- Which are the “a priori assumptions” here?

Response: Thank you for the useful comment. The wording actually leads to misleading to readers. Therefore, we revised it as follows.

“The PAD model infers the modulators and a carrier based on Bayesian inference. PAD is biologically neutral and can be run recursively using different demodulation parameters each time to identify potential “priors” in the input stimulus.”

Comment 10:

Did the S-AMPH model use the same parameters as in Leong et al?

Response: Yes, the methodologies were based on a previous study by Leong and Goswami (2015). To establish the patterns of spectral modulation, the raw acoustic signal was passed through a 28 log-spaced ERBN filterbank spanning 100–7250 Hz. Further, the Hilbert envelopes of each of the spectral bands were passed through a 24 log-spaced ERBN filterbank spanning 0.9–40 Hz. This was specified in the begging of the section 2.2.1 (Signal Processing: Spectral and Temporal Modulations) as follows.

“The methodologies were based on a previous study by Leong and Goswami (2015).”

Comment 11:

Page 19:

Response: I thank the reviewers for the pertinent comments. The methods used to generate Figure 1 require a representative piece of acoustic stimulus, not a summary. A summary across items is shown subsequently in Fig 2c and Fig2d. Also, we showed the summary for both S-AMPH and PAD in Table d in S4 Appendix. We cannot do something like averaging to generate Fig 1 because even same categorical sounds have in principle different stimuli although the temporal hierarchy is similar. That is, if they are averaged, the inherent characteristics of temporal hierarchy cancel each other out.

Comment 12:

Figure 1 proposes a comparative presentation of the four categories only for PAD. It would be informative for the reader to see the same type of presentation for S-AMPH.

Response: Thank you for the important comment. In this figure, we are showing how each sound statistically or acoustically includes a temporal hierarchy without the sensory/neural perspective imposed by the human cochlear and represented by S-AMPH. Because the S-AMPH models the cochlear filterbank, the frequency component between boundaries of the adjacent filterbanks (e.g., <0.9Hz, 2.5Hz, 7Hz, 17Hz, and 30Hz) is partially disappeared. This leads to inappropriate scalograms. We now state this in the Figure Legend. On the other hand, we added both ADS and bird song figure to figure 1 so that readers can compare IDS vs. ADS, other sounds and bird song.

Comment 13:

Response: I thank the reviewers for the pertinent comments. This issue is now addressed in the new section concerning neural resonance theory (page 50), please see response to Comment 8.

Comment 14:

Figure 3: How many items were used for each category here?

Response: Thanks for the comments. The sample size and amount of items in each category was described in the S1 Appendix in details (Corpora of Music, Speech, and Nature sound). We also stated it in the Materials and Methods section as follows.

“The sample size and number of items in each category is provided in S1 Appendix.”

Comment 15:

Discussion

Response. Our study may suggest that interventions utilising “Western” music that has temporal hierarchy known to correspond to English IDS (particularly delta rhythm and synchronization between delta and theta rhythms) may be beneficial for children with disorders of English language learning. As indicated, we revised the sentence as follows.

“The modelling presented here is also relevant to the remediation of childhood language disorders. The possible utility of musical interventions for children with disorders of language learning such as developmental language disorder (DLD) and developmental dyslexia has long been recognized (Ladányi et al., 2020; Cumming et al., 2015; Kodály, 1974; Jacques-Dalcroze, 1980). Such interventions are likely to be most beneficial when the temporal hierarchy of the music corresponds to the temporal hierarchy underpinning speech rhythm (Goswami, 2019a; Goswami, 2019b). Careful consideration of the statistical rhythm structures characterizing speech in different languages may thus lead to better remedial outcomes. For example, our findings suggest that for children with disorders of English language learning, interventions using Western music should be beneficial via the shared temporal hierarchy with English IDS and CDS. Further, it is possible that such interventions could be beneficial for second language learners.”

Comment 16:

Abstract:

Please clarify the wording, notably to which part of the sentence “which matched IDS” is referring to.

Response: Thank for the useful comment. As suggested, we clarified it in the Abstract section as follows.

“Both models revealed an hierarchically-nested AM modulation structure for music and song, but not nature sounds. This AM modulation structure for music and song matched IDS.”

Comment 17:

Introduction:

Response: I thank the reviewers for the pertinent comments. We have now cited these important studies in the Introduction section as follows.

Comment 18:

Response: As suggested, we now include some discussion about other languages, other species, and music relationships in the Introduction section as follows.

“These phase relations between peaks and troughs in AM bands centred on ~2 Hz and ~5 Hz have also been revealed by statistical modelling of other languages like Portuguese and Spanish (Araujo et al., 2018; Pérez-Navarro et al., 2022). For example, Pérez-Navarro et al. (2022) reported that CDS in Spanish was characterized by higher temporal regularity of the placement of stressed syllables (phase synchronization of ~2 Hz and ~5 Hz AM bands) compared to ADS in Spanish. Further, phase relations are statistical characteristics that describe music as well as language, and phase relations appear relatively uniform regarding music from different cultures (Mehr et al., 2020; McPherson et al., 2020), as well as songs of different species (Roeske et al., 2020). Even prior to the acquisition of culture-specific biases of musical rhythm, infants are affected by ratio complexity (Hannon et al., 2011). Thus, phase hierarchies may be a universal aspect across music and language.”

Comment 19:

Response: I thank the reviewers for letting me know the important researches. We have described and cited their papers in the manuscript as follows.

“For music, oscillatory rhythms may align with rhythmic features of the acoustic input such as crotchets or musical beats (Doelling & Poeppel, 2015; Large et al., 2015; Di Liberto et al., 2020; Baltzell et al., 2019; Fujioka et al., 2015). However, possible correspondences between different oscillators and different musical units like crotchets and quavers have yet to be investigated.”

“Further, Di Liberto and colleagues revealed that musical expertise increases the accuracy of cortical tracking (Di Liberto, Pelofi, Shamma, and de Cheveigné, 2020).”

Comment 20:

Page 30 “the nightingale song” – this suggests as if only 1 item was used here. Please clarify (and if yes, justify) or, preferably, extend to more items per category.

Response: I thank the reviewers for the pertinent comments. We actually analyzed 47 items for bird songs. However, as indicated, the wording sounds as if only 1 item was used here. We refined the wording as follows.

“at least for the corpus of the 47 nightingale songs (see, S1 Appendix) analyzed here.”

Comment 21:

Page 32: “When MI analyses were applied …..” Does this refer to the music material? Please clarify.

Response: Thanks for the helpful comment. We clarified it as follows.

“When MI analyses were applied for PAD in music, four peak frequencies were detected at ~2.4 Hz, ~4.8 Hz, ~9 Hz and 16 Hz.”

Comment 22:

Page 40 “similar to a harsh rhythmic whisper” For the interested reader, it would be great to add a sound file to the supplementary material section.

Response: I thank the reviewers for the comment. As suggested, we added it in the following link: https://osf.io/6s8kp/, and stated it in the manuscript as follows.

“The resulting percept was similar to a harsh rhythmic whisper. The sound is available from here https://osf.io/6s8kp/.”

Comment 23:

Response: I thank the reviewers for the helpful comment. We showed each genres and instruments in S6 Appendix. But to make it simple and clear to the reader, we showed the figures of summary. As stated, we modified the sentence as follows.

“Indeed, the multi-timescale synchronization found here was systematic across Western musical genres and instruments (see S6 Appendix)”

Comment 24:

Appendix 1 lists the used materials - would the sound files be available upon (reasonable) request?

Response: All original infant-directed speech files are publicly available from the Figshare database: http://figshare.com/articles/SAMPH_CDS/1318572 DOI: 10.6084/m9.figshare.1318572. Please see the original article that used the speech data for more detailed information (Leong et al., 2015). All original bird song, nature sound files are available from https://www.xeno-canto.org/, https://mixkit.co/free-sound-effects/nature/, and https://www.zapsplat.com. Music and human song data has copyright, but described the detailed information in S1 Appendix. They are available by purchasing. We described about them in the section of “Data Availability Statement”.

Attachment

Submitted filename: Response_to_Reviewer_TD_UG_TD_UG_TD FINAL.docx

Click here for additional data file.^{(1MB, docx)}

PLoS One. doi: 10.1371/journal.pone.0275631.r003

Decision Letter 1

Lorena Verduci

13 Jul 2022

PONE-D-21-38111R1Hierarchical Amplitude Modulation Structures and Rhythm Patterns: Comparing Western Musical Genres, Song, and Nature Sounds to BabytalkPLOS ONE

Dear Dr. Daikoku,

The manuscript has been evaluated by one reviewer, and his comments are available below.

The reviewer has raised a number of concerns. He requests improvements to the reporting of methodological aspects of the study, for example, regarding the average duration across the different musical pieces used. The reviewer also requests revision to the introduction and discussion.

Could you please carefully revise the manuscript to address all comments raised?

Please submit your revised manuscript by Aug 26 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Lorena Verduci

Staff Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #2: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #2: I thank the authors for their revision, which considerably improved the manuscript. I also welcome their “open-science” attitude. I just have the following clarification questions:

- In the introduction, the authors indicate that their modeling approach is “conceptually distinct from … models of pulse perception based on neural resonance (Large et al., 2019), oscillatory models of auditory attention based on dynamic attending”. In the discussion section, however, the authors discuss the convergence of their modelling approach with neural resonance theory. Please clarify. Also it is not clear why it is presented as being distinct from dynamic attending models as these also include multiple oscillators that entrain to the stimulus and influence processing, and also temporal sampling framework has been presented in link with dynamic attending (e.g., Goswami, 2011). Please clarify.

- Page 23: “The methodologies were based on a previous study by Leong and Goswami (2015).” Please clarify whether these were adapted (and “based on”/inspired by?) or whether the same implementations (i.e., same parameters, steps, etc.) were used here as in Leong and Goswami (2015). Otherwise, please indicate what was changed (and why).

- Thanks also for adding the various appendices for further information. Regarding appendix S1, could you clarify also the average duration across the different musical pieces used? I guess “Duration (minutes)” currently just indicates the total duration of all pieces together? Or are all pieces listed underneath played in their entirety ? (no excerpts chosen). Thanks.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

**********

PLoS One. 2022 Oct 14;17(10):e0275631. doi: 10.1371/journal.pone.0275631.r004

Author response to Decision Letter 1

3 Aug 2022

RESPONSE TO REVIEWERS 2:

Comment 1:

In the introduction, the authors indicate that their modeling approach is “conceptually distinct from … models of pulse perception based on neural resonance (Large et al., 2019), oscillatory models of auditory attention based on dynamic attending”. In the discussion section, however, the authors discuss the convergence of their modelling approach with neural resonance theory. Please clarify. Also it is not clear why it is presented as being distinct from dynamic attending models as these also include multiple oscillators that entrain to the stimulus and influence processing, and also temporal sampling framework has been presented in link with dynamic attending (e.g., Goswami, 2011). Please clarify.

Response: We thank the reviewers for the pertinent comments. This made us reflect that our modelling is theoretically rather than conceptually distinct from other theories, while sharing some conceptual similarities with other approaches. Hence we reworded some of the sentences in the Introduction section and the Discussion section. The difference from dynamic attending theory is that DAT hypothesised multiple oscillators based on behavioural findings with attention tasks, rather than identifying a hierarchy of oscillators related to physical variations in stimuli. Here we specify the key oscillators in musical rhythm and the expected AM hierarchy a priori, based on our prior computational modelling of acoustic rhythm in language.

Introduction (pp. 15-16)

Note finally that our modelling approach is theoretically distinct from models that seek to identify the tactus or beat markers in singing (Coath et al., 2010), models of pulse perception based on neural resonance (Large et al., 2019), oscillatory models of auditory attention based on dynamic attending (Large & Jones, 1999), and models of temporal hierarchical structure based on the Allan Factor approach (Falk & Kello, 2017; Kello et al., 2017). Ours is the only modelling approach to analyze the modulation structure of the amplitude envelope and further to make specific a priori predictions concerning expected key temporal AM rates and key hierarchical AM phase relations related to the perception of musical rhythm structure and the parsing of musical units. We predict that the phase dependency between bands of AMs centred on ~2 Hz and ~ 5 Hz will relate to musical rhythm across different genres, and that music will show similar hierarchical AM structures in predictable spectral bandings to IDS, structures that can provide a perceptual basis for perceiving musical notes and musical phrasing. The amplitude envelope is recognized as core to speech processing by speech engineers (Greenberg, 2006). Our modelling decomposes the amplitude envelope of music instead of speech and then relates the resulting AM bands and their phase relationships to individual musical units. In principle, this approach provides a novel acoustic perspective on musical rhythm, motivated by our prior novel acoustic analyses of Babytalk.

Discussion (p. 51)

“The modelling presented here also converges conceptually with past studies designed to detect pulse based on neural resonance theory (Large et al., 2019).

Comment 2:

Response: Thank you for the comment. This study used the same methodologies as a previous study by Leong and Goswami (2015). We now stated it in the Methods section as follows.

“This study used the same methodologies and parameters as a previous study based on CDS by Leong and Goswami (2015) (for wiki, please see https://www.cne.psychol.cam.ac.uk).” (p. 23).

Comment 3:

Response: I thank the reviewers for the pertinent comments. All pieces listed underneath are played in their entirety. As suggested, we described the average duration as well in the S1 Appendix, as follows.

1) Single instrument materials

Instrument Representative

Composers Duration (minutes) Average daution (munutes) # pieces

Piano Beethoven, Mozart 230 3.97 58

Cello Bach,

Ysaÿe 173.8 3.78 46

Bass Bach 60.9 3.38 18

Viola Bach, Hindemith 320.8 4.65 69

Violin Bach,

Ysaÿe 206.5 4.39 47

Guitar Bach,

Sanz 218.4 3.36 65

2) Ensemble recordings

Genre Composers or

performers Duration

(minutes) Average daution (munutes) # pieces

Symphony Bach, Mozart, Beethoven 546.3 7.19 76

Jazz Miles Davis, Dave Brubeck 212.2 5.89 36

Rock The Beatles, U2 226.6 3.28 69

Children’s song (English) various 104.0 2.67 39

Attachment

Submitted filename: Response_to_Reviewers.docx

Click here for additional data file.^{(36.4KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0275631.r005

Decision Letter 2

Yann Benetreau

22 Sep 2022

Hierarchical Amplitude Modulation Structures and Rhythm Patterns: Comparing Western Musical Genres, Song, and Nature Sounds to Babytalk

PONE-D-21-38111R2

Dear Dr. Daikoku,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Yann Benetreau, PhD

Division Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #2: I thank the authors for the changes and latest additions. The manuscript has been further improved and the changes will clarify the work for the readers.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

**********

PLoS One. doi: 10.1371/journal.pone.0275631.r006

Acceptance letter

Yann Benetreau

26 Sep 2022

PONE-D-21-38111R2

Hierarchical Amplitude Modulation Structures and Rhythm Patterns: Comparing Western Musical Genres, Song, and Nature Sounds to Babytalk

Dear Dr. Daikoku:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Yann Benetreau

Staff Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Appendix. Corpora of music, speech, and nature sounds.

(DOCX)

Click here for additional data file.^{(46.2KB, docx)}

S2 Appendix. Signal processing steps in S-AMPH model.

(DOCX)

Click here for additional data file.^{(856.6KB, docx)}

S3 Appendix. Signal processing steps in PAD model.

(DOCX)

Click here for additional data file.^{(367.8KB, docx)}

S4 Appendix. Individual variation of PCA loadings in S-AMPH model and those of FFT in the PAD model.

(DOCX)

Click here for additional data file.^{(1.1MB, docx)}

S5 Appendix. Individual variation of mutual information.

(DOCX)

Click here for additional data file.^{(88.6MB, docx)}

S6 Appendix. Individual variation of PSI in each integer ratio.

(DOCX)

Click here for additional data file.^{(30.7MB, docx)}

Attachment

Submitted filename: Response_to_Reviewer_TD_UG_TD_UG_TD FINAL.docx

Click here for additional data file.^{(1MB, docx)}

Attachment

Submitted filename: Response_to_Reviewers.docx

Click here for additional data file.^{(36.4KB, docx)}

Data Availability Statement

[pone.0275631.ref001] 1.Mehler J., Jusczyk P., Lambertz G., Halsted N., Bertoncini J., & Amiel-Tison C. (1988). A precursor of language acquisition in young infants. Cognition, 29(2), 143–178. doi: 10.1016/0010-0277(88)90035-2 [DOI] [PubMed] [Google Scholar]

[pone.0275631.ref002] 2.Falk D. (2004). Prelinguistic evolution in early hominins: Whence motherese? Behavioral and Brain Sciences, 27(4), 491–503; discussion 503. doi: 10.1017/s0140525x04000111 [DOI] [PubMed] [Google Scholar]

[pone.0275631.ref003] 3.Nazzi T., Bertoncini J., & Mehler J. (1998). Language discrimination by newborns: Toward an understanding of the role of rhythm. Journal of Experimental Psychology: Human Perception and Performance, 2 4, 756–766. doi: 10.1037//0096-1523.24.3.756 [DOI] [PubMed] [Google Scholar]

[pone.0275631.ref004] 4.Saffran J. R. (2001). Words in a sea of sounds: The output of infant statistical learning. Cognition, 81, 149–169. doi: 10.1016/s0010-0277(01)00132-9 [DOI] [PubMed] [Google Scholar]

[pone.0275631.ref005] 5.Saffran J. R., Johnson E. K., Aslin R. N., & Newport E. L. (1999). Statistical learning of tone sequences by human infants and adults. Cognition, 70(1), 27–52. doi: 10.1016/s0010-0277(98)00075-4 [DOI] [PubMed] [Google Scholar]

[pone.0275631.ref006] 6.Francois C., & Schön D. (2011). Musical expertise boosts implicit learning of both musical and linguistic structures. Cerebral Cortex, 21(10), 2357–2365. doi: 10.1093/cercor/bhr022 [DOI] [PubMed] [Google Scholar]

[pone.0275631.ref007] 7.Loui P. (2022). New music system reveals spectral contribution to statistical learning. Cognition, 224, 105071. doi: 10.1016/j.cognition.2022.105071 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref008] 8.Tsogli V., Jentschke S., Daikoku T., & Koelsch S. (2019). When the statistical MMN meets the physical MMN. Scientific reports, 9(1), 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref009] 9.Prince J. B., Stevens C. J., Jones M. R., & Tillmann B. (2018). Learning of pitch and time structures in an artificial grammar setting. Journal of Experimental Psychology: Learning, Memory, and Cognition, 44(8), 1201. doi: 10.1037/xlm0000502 [DOI] [PubMed] [Google Scholar]

[pone.0275631.ref010] 10.Brandon M., Terry J., Stevens C. K. J., & Tillmann B. (2012). Incidental learning of temporal structures conforming to a metrical framework. Frontiers in Psychology, 3, 294. doi: 10.3389/fpsyg.2012.00294 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref011] 11.Daikoku T., & Yumoto M. (2020). Musical expertise facilitates statistical learning of rhythm and the perceptive uncertainty: A cross-cultural study. Neuropsychologia, 146, 107553. doi: 10.1016/j.neuropsychologia.2020.107553 [DOI] [PubMed] [Google Scholar]

[pone.0275631.ref012] 12.Goswami U. (2015). Sensory theories of developmental dyslexia: three challenges for research. Nature Reviews Neuroscience, 16(1), 43–54. doi: 10.1038/nrn3836 [DOI] [PubMed] [Google Scholar]

[pone.0275631.ref013] 13.Politimou N., Dalla Bella S., Farrugia N., & Franco F. (2019). Born to speak and sing: Musical predictors of language development in pre-schoolers. Frontiers in Psychology, 10, 948. doi: 10.3389/fpsyg.2019.00948 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref014] 14.Leong V., Stone M. A., Turner R. E., & Goswami U. (2014). A role for amplitude modulation phase relationships in speech rhythm perception. Journal of the Acoustical Society of America, 136(1), 366–381. doi: 10.1121/1.4883366 [DOI] [PubMed] [Google Scholar]

[pone.0275631.ref015] 15.Leong V., & Goswami U. (2015). Acoustic-emergent phonology in the amplitude envelope of child-directed speech. PLOS ONE, 10(12), e0144411. doi: 10.1371/journal.pone.0144411 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref016] 16.Turner, R. (2010). Statistical models for natural sounds [PhD dissertation]. Coll London.

[pone.0275631.ref017] 17.Turner R. E., & Sahani M. (2011). Demodulation as probabilistic inference. IEEE Transactions on Audio, Speech, and Language Processing, 19(8), 2398–2411. 10.1109/TASL.2011.2135852 [DOI] [Google Scholar]

[pone.0275631.ref018] 18.Leong V., Kalashnikova M., Burnham D., & Goswami U. (2017). The temporal modulation structure of infant-directed speech. Open Mind, 1(2), 78–90. 10.1162/OPMI_a_00008 [DOI] [Google Scholar]

[pone.0275631.ref019] 19.Leong, V. (2012). rosodic rhythm in the speech amplitude envelope: Amplitude modulation phase hierarchies (AMPHs) and AMPH models [PhD Thesis].

[pone.0275631.ref020] 20.Araújo J., Flanagan S., Castro-Caldas A., & Goswami U. (2018). The temporal modulation structure of illiterate versus literate adult speech. PLOS ONE, 13(10), e0205224. doi: 10.1371/journal.pone.0205224 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref021] 21.Pérez-Navarro J., Lallier M., Clark C., Flanagan S., & Goswami U. (2022). Local temporal regularities in child-directed speech in Spanish. Journal of Speech, Language and Hearing Research, in press. doi: 10.1044/2022_JSLHR-22-00111 [DOI] [PubMed] [Google Scholar]

[pone.0275631.ref022] 22.Mehr S. A., Krasnow M. M., Bryant G. A., & Hagen E. H. (2020). Origins of music in credible signaling. Behavioral and Brain Sciences, 1–41. doi: 10.1017/S0140525X20000345 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref023] 23.McPherson M. J., Dolan S. E., Durango A., Ossandon T., Valdés J., Undurraga E. A., et al. (2020). Perceptual fusion of musical notes by native Amazonians suggests universal representations of musical intervals. Nature Communications, 11(1), 2786. doi: 10.1038/s41467-020-16448-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref024] 24.Roeske T. C., Tchernichovski O., Poeppel D., & Jacoby N. (2020). Categorical rhythms are shared between songbirds and humans. Current Biology, 30(18), 3544–3555.e6. doi: 10.1016/j.cub.2020.06.072 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref025] 25.Hannon E. E., Soley G., & Levine R. S. (2011). Constraints on infants’ musical rhythm perception: Effects of interval ratio complexity and enculturation. Developmental Science, 14(4), 865–872. doi: 10.1111/j.1467-7687.2011.01036.x [DOI] [PubMed] [Google Scholar]

[pone.0275631.ref026] 26.Goswami U. (2011). A temporal sampling framework for developmental dyslexia. Trends in Cognitive Sciences, 15(1), 3–10. doi: 10.1016/j.tics.2010.10.001 [DOI] [PubMed] [Google Scholar]

[pone.0275631.ref027] 27.Goswami U. (2019. a). A neural oscillations perspective on phonological development and phonological processing in developmental dyslexia. Language and Linguistics Compass, 13(5), e12328. 10.1111/lnc3.12328 [DOI] [Google Scholar]

[pone.0275631.ref028] 28.Goswami U. (2022). Language acquisition and speech rhythm patterns: An auditory neuroscience perspective. Royal Society Open Science, 9, 211855. doi: 10.1098/rsos.211855 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref029] 29.Greenberg S. (2006). A multi-tier framework for understanding spoken language. In Greenberg S.& Ainsworth W.(Eds.), Listening to speech: An auditory perspective. Lawrence Erlbaum Associates. [Google Scholar]

[pone.0275631.ref030] 30.Morton J., Marcus S., & Frankish C. (1976). Perceptual centers (P-centers). Psychological Review, 83(5), 405–408. 10.1037/0033-295X.83.5.405 [DOI] [Google Scholar]

[pone.0275631.ref031] 31.Hoequist C. E Jr. (1983). The perceptual center and rhythm categories. Language and Speech, 26(4), 367–376. doi: 10.1177/002383098302600404 [DOI] [PubMed] [Google Scholar]

[pone.0275631.ref032] 32.Scott, S. (1993). P-centres in speech-an acoustic analysis [PhD thesis]. Universidad Coll.

[pone.0275631.ref033] 33.Gordon J. W. (1987). The perceptual attack time of musical tones. Journal of the Acoustical Society of America, 82(1), 88–105. doi: 10.1121/1.395441 [DOI] [PubMed] [Google Scholar]

[pone.0275631.ref034] 34.Huss M., Verney J. P., Fosker T., Mead N., & Goswami U. (2011). Music, rhythm, rise time perception and developmental dyslexia: Perception of musical meter predicts reading and phonology. Cortex, 47(6), 674–689. doi: 10.1016/j.cortex.2010.07.010 [DOI] [PubMed] [Google Scholar]

[pone.0275631.ref035] 35.Ladányi E., Persici V., Fiveash A., Tillmann B., & Gordon R. L. (2020). Is atypical rhythm a risk factor for developmental speech and language disorders? Wiley Interdisciplinary Reviews. Cognitive Science, 11(5), e1528. doi: 10.1002/wcs.1528 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref036] 36.Kello C.T., Dalla Bella S., Butovens M., & Balasubramaniam R. (2017). Hierarchical temporal structure in music, speech and animal vocalizations: Jazz is like a conversation, humpbacks sing like hermit thrushes. J. R. Soc Interface, 14, 20170231. doi: 10.1098/rsif.2017.0231 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref037] 37.Patel A. D. (2008). Music, language, and the brain. Oxford University Press. [Google Scholar]

[pone.0275631.ref038] 38.Ding N., Patel A. D., Chen L., Butler H., Luo C., & Poeppel D. (2017). Temporal modulations in speech and music. Neuroscience and Biobehavioral Reviews, 81(B), 181–187. 10.1016/j.neubiorev.2017.02.011 [DOI] [PubMed] [Google Scholar]

[pone.0275631.ref039] 39.Frith U., Wimmer H., & Landerl K. (1998). Differences in phonological recoding in German-and English-speaking children. Scientific Studies of reading, 2(1), 31–54. [Google Scholar]

[pone.0275631.ref040] 40.Moore B. C. J. (2012). An introduction to the psychology of hearing. Brill. [Google Scholar]

[pone.0275631.ref041] 41.Zeng F. G., Nie K., Stickney G. S., Kong Y. Y., Vongphoe M., Bhargave A., Wei C., & Cao K. (2005). Speech recognition with amplitude and frequency modulations. Proceedings of the National Academy of Sciences of the United States of America, 102(7), 2293–2298. doi: 10.1073/pnas.0406460102 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref042] 42.Dau T., Kollmeier B., & Kohlrausch A. (1997. b). Modeling auditory processing of amplitude modulation I. Detection and masking with narrow-band carriers. Journal of the Acoustical Society of America, 102(5 Pt 1), 2892–2905. 10.1121/1.420344 [DOI] [PubMed] [Google Scholar]

[pone.0275631.ref043] 43.Fiser J., Berkes P., Orbán G., Lengyel M. (2010). Statistically optimal perception and learning: from behavior to neural representations. Trends in Cognitive Sciences, 14(3), 119–130. doi: 10.1016/j.tics.2010.01.003 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref044] 44.Liberman M., & Prince A. (1977). On stress and linguistic rhythm. Linguistic Inquiry, 8, 249–336. [Google Scholar]

[pone.0275631.ref045] 45.Selkirk E. (1984). Phonology and syntax. The relation between sound and structure. [Google Scholar]

[pone.0275631.ref046] 46.Selkirk E. O. (1980). The role of prosodic categories in English word stress. Linguistic Inquiry, 11, 563–605. [Google Scholar]

[pone.0275631.ref047] 47.Hayes B. (1995). Metrical stress theory: Principles and case studies. University of Chicago Press. [Google Scholar]

[pone.0275631.ref048] 48.Lerdahl F., Jackendoff R., & Jackendoff R. S. (1983). A generative theory of tonal music. MIT Press. https://books.google.de/books?id=38YcngEACAAJ [Google Scholar]

[pone.0275631.ref049] 49.Luo H., & Poeppel D. (2007). Phase patterns of neuronal responses reliably discriminate speech in human auditory cortex. Neuron, 54(6), 1001–1010. doi: 10.1016/j.neuron.2007.06.004 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref050] 50.Ahissar E., Nagarajan S., Ahissar M., Protopapas A., Mahncke H., & Merzenich M. M. (2001). Speech comprehension is correlated with temporal response patterns recorded from auditory cortex. Proceedings of the National Academy of Sciences of the United States of America, 98(23), 13367–13372. doi: 10.1073/pnas.201400998 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref051] 51.Giraud A. L., & Poeppel D. (2012). Cortical oscillations and speech processing: Emerging computational principles and operations. Nature Neuroscience, 15(4), 511–517. doi: 10.1038/nn.3063 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref052] 52.Henry M. J., & Obleser J. (2012). Frequency modulation entrains slow neural oscillations and optimizes human listening behavior. Proceedings of the National Academy of Sciences of the United States of America, 109(49), 20095–20100. doi: 10.1073/pnas.1213390109 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref053] 53.Overath T., McDermott J. H., Zarate J. M., & Poeppel D. (2015). The cortical analysis of speech-specific temporal structure revealed by responses to sound quilts. Nature Neuroscience, 18(6), 903–911. doi: 10.1038/nn.4021 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref054] 54.Ding N., Melloni L., Zhang H., Tian X., & Poeppel D. (2016). Cortical tracking of hierarchical linguistic structures in connected speech. Nature Neuroscience, 19(1), 158–164. doi: 10.1038/nn.4186 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref055] 55.Park H., Ince R. A. A., Schyns P. G., Thut G., & Gross J. (2015). Frontal top-down signals increase coupling of auditory low-frequency oscillations to continuous speech in human listeners. Current Biology, 25(12), 1649–1653. doi: 10.1016/j.cub.2015.04.049 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref056] 56.Poeppel D. (2003). The analysis of speech in different temporal integration windows: Cerebral lateralization as “asymmetric sampling in time.” Speech Communication, 41(1), 245–255. 10.1016/S0167-6393(02)00107-3 [DOI] [Google Scholar]

[pone.0275631.ref057] 57.Fontolan L., Morillon B., Liegeois-Chauvel C., & Giraud A. L. (2014). The contribution of frequency-specific activity to hierarchical information processing in the human auditory cortex. Nature Communications, 5, 4694. doi: 10.1038/ncomms5694 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref058] 58.Obleser J., & Kayser C. (2019). Neural entrainment and attentional selection in the listening brain. Trends in Cognitive Sciences, 23(11), 913–926. doi: 10.1016/j.tics.2019.08.004 [DOI] [PubMed] [Google Scholar]

[pone.0275631.ref059] 59.Gross J., Hoogenboom N., Thut G., Schyns P., Panzeri S., Belin P., et al. (2013). Speech rhythms and multiplexed oscillatory sensory coding in the human brain. PLOS Biology, 11(12), e1001752. doi: 10.1371/journal.pbio.1001752 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref060] 60.Di Liberto G. M., O’Sullivan J. A., & Lalor E. C. (2015). Low-frequency cortical entrainment to speech reflects phoneme-level processing. Current Biology, 25(19), 2457–2465. doi: 10.1016/j.cub.2015.08.030 [DOI] [PubMed] [Google Scholar]

[pone.0275631.ref061] 61.Doelling K. B., & Poeppel D. (2015). Cortical entrainment to music and its modulation by expertise. Proceedings of the National Academy of Sciences of the United States of America, 112(45), E6233–E6242. doi: 10.1073/pnas.1508431112 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref062] 62.Di Liberto G. M., Pelofi C., Shamma S., & de Cheveigné A. (2020). Musical expertise enhances the cortical tracking of the acoustic envelope during naturalistic music listening. Acoustical Science and Technology, 41(1), 361–364. [Google Scholar]

[pone.0275631.ref063] 63.Baltzell L. S., Srinivasan R., & Richards V. (2019). Hierarchical organization of melodic sequences is encoded by cortical entrainment. Neuroimage, 200, 490–500. doi: 10.1016/j.neuroimage.2019.06.054 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref064] 64.Fujioka T., Ross B., & Trainor L. J. (2015). Beta-band oscillations represent auditory beat and its metrical hierarchy in perception and imagery. Journal of Neuroscience, 35(45), 15187–15198. doi: 10.1523/JNEUROSCI.2397-15.2015 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref065] 65.Large E. W., Herrera J. A., & Velasco M. J. (2015). Neural networks for beat perception in musical rhythm. Frontiers in systems neuroscience, 9, 159. doi: 10.3389/fnsys.2015.00159 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref066] 66.Lakatos P., Shah A. S., Knuth K. H., Ulbert I., Karmos G., & Schroeder C. E. (2005). An oscillatory hierarchy controlling neuronal excitability and stimulus processing in the auditory cortex. Journal of Neurophysiology, 94(3), 1904–1911. doi: 10.1152/jn.00263.2005 [DOI] [PubMed] [Google Scholar]

[pone.0275631.ref067] 67.Norman-Haignere S., Kanwisher N. G., & McDermott J. H. (2015). Distinct cortical pathways for music and speech revealed by hypothesis-free voxel decomposition. Neuron, 88(6), 1281–1296. doi: 10.1016/j.neuron.2015.11.035 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref068] 68.Nozaradan S., Peretz I., Missal M., & Mouraux A. (2011). Tagging the neuronal entrainment to beat and meter. Journal of Neuroscience, 31(28), 10234–10240. doi: 10.1523/JNEUROSCI.0411-11.2011 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref069] 69.Harding E. E., Sammler D., Henry M. J., Large E. W., & Kotz S. A. (2019). Cortical tracking of rhythm in music and speech. Neuroimage, 185, 96–101. doi: 10.1016/j.neuroimage.2018.10.037 [DOI] [PubMed] [Google Scholar]

[pone.0275631.ref070] 70.Coath M., Denham S. L., Smith L. M., Honing H., Hazan A., Holonowicz P., et al. (2009). Model cortical responses for the detection of perceptual onsets and beat tracking in singing. Connection Science, 21(2–3), 193–205. 10.1080/09540090902733905 [DOI] [Google Scholar]

[pone.0275631.ref071] 71.Large E. W., Wasserman C. S., Skoe E., & Read H. L. (2019). Neural entrainment to missing pulse rhythms. Journal of the Acoustical Society of America, 144(3), 1760–1760. 10.1121/1.5067790 [DOI] [Google Scholar]

[pone.0275631.ref072] 72.Large E. W., & Jones M. R. (1999). The dynamics of attending: How people track time-varying events. Psychological Review, 106(1), 119–159. 10.1037/0033-295X.106.1.119 [DOI] [Google Scholar]

[pone.0275631.ref073] 73.Falk S. & Kello C.T. (2017). Hierarchical organization in the temporal structure of infant-direct speech and song. Cognition, 163, 80–86. doi: 10.1016/j.cognition.2017.02.017 [DOI] [PubMed] [Google Scholar]

[pone.0275631.ref074] 74.Dau T., Kollmeier B., & Kohlrausch A. (1997. a). Modeling auditory processing of amplitude modulation. II. Spectral and temporal integration. Journal of the Acoustical Society of America, 102(5 Pt 1), 2906–2919. doi: 10.1121/1.420345 [DOI] [PubMed] [Google Scholar]

[pone.0275631.ref075] 75.Albert S., de Ruiter L. E., & de Ruiter J.P. (2015) CABNC: the Jeffersonian transcription of the Spoken British National Corpus. https://saulalbert.github.io/CABNC/. [Google Scholar]

[pone.0275631.ref076] 76.Stone M. A., & Moore B. C. J. (2003). Tolerable hearing aid delays. III. Effects on speech production and perception of across-frequency variation in delay. Ear and Hearing, 24(2), 175–183. doi: 10.1097/01.AUD.0000058106.68049.9C [DOI] [PubMed] [Google Scholar]

[pone.0275631.ref077] 77.Klein W., Plomp R., & Pols L. C. W. (1970). Vowel spectra, vowel spaces, and vowel identification. Journal of the Acoustical Society of America, 48(4), 999–1009. doi: 10.1121/1.1912239 [DOI] [PubMed] [Google Scholar]

[pone.0275631.ref078] 78.Pols L. C. W., Tromp H. R. C., & Plomp R. (1973). Frequency analysis of Dutch vowels from 50 male speakers. Journal of the Acoustical Society of America, 53(4), 1093–1101. doi: 10.1121/1.1913429 [DOI] [PubMed] [Google Scholar]

[pone.0275631.ref079] 79.Daikoku T. (2018). Entropy, uncertainty, and the depth of implicit knowledge on musical creativity: Computational study of improvisation in melody and rhythm. Frontiers in Computational Neuroscience, 12, 97. doi: 10.3389/fncom.2018.00097 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref080] 80.Tass P., Rosenblum M. G., Weule J., Kurths J., Pikovsky A., Volkmann J., Schnitzler A., & Freund H. -J. (1998). Detection of n:m phase locking from noisy data: Application to magnetoencephalography. Physical Review Letters, 81(15), 3291–3294. 10.1103/PhysRevLett.81.3291 [DOI] [Google Scholar]

[pone.0275631.ref081] 81.Schack B., & Weiss S. (2005). Quantification of phase synchronization phenomena and their importance for verbal memory processes. Biological Cybernetics, 92(4), 275–287. doi: 10.1007/s00422-005-0555-1 [DOI] [PubMed] [Google Scholar]

[pone.0275631.ref082] 82.Plomp R., & Levelt W. J. M. (1965). Tonal consonance and critical bandwidth. Journal of the Acoustical Society of America, 38(4), 548–560. doi: 10.1121/1.1909741 [DOI] [PubMed] [Google Scholar]

[pone.0275631.ref083] 83.Goswami U. (2019. b). Speech rhythm and language acquisition: An amplitude modulation phase hierarchy perspective. Annals of the New York Academy of Sciences, 1453(1), 67–78. doi: 10.1111/nyas.14137 [DOI] [PubMed] [Google Scholar]

[pone.0275631.ref084] 84.Flanagan S., & Goswami U. (2018). The role of phase synchronisation between low frequency amplitude modulations in child phonology and morphology speech tasks. The Journal of the Acoustical Society of America, 143(3), 1366–1375. doi: 10.1121/1.5026239 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref085] 85.Schwartz D. A., Howe C. Q., & Purves D. (2003). The statistical structure of human speech sounds predicts musical universals. Journal of Neuroscience, 23(18), 7160–7168. doi: 10.1523/JNEUROSCI.23-18-07160.2003 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref086] 86.Jacoby N., & McDermott J. H. (2017). Integer ratio priors on musical rhythm revealed cross-culturally by iterated reproduction. Current Biology, 27(3), 359–370. doi: 10.1016/j.cub.2016.12.031 [DOI] [PubMed] [Google Scholar]

[pone.0275631.ref087] 87.Tal I., Large E. W., Rabinovitch E., Wei Y., Schroeder C. E., Poeppel D., et al. (2017). Neural entrainment to the beat: The “missing-pulse” phenomenon. Journal of Neuroscience, 37(26), 6331–6341. doi: 10.1523/JNEUROSCI.2500-16.2017 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref088] 88.Winkler I., Háden G. P., Ladinig O., Sziller I., & Honing H. (2009). Newborn infants detect the beat in music. Proceedings of the National Academy of Sciences, 106(7), 2468–2471. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref089] 89.Cumming R., Wilson A., Leong V., Colling L. J., & Goswami U. (2015). Awareness of rhythm patterns in speech and music in children with specific language impairments. Frontiers in Human Neuroscience, 9, 672. doi: 10.3389/fnhum.2015.00672 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0275631.ref090] 90.Kodály Z. (1974). The selected writings of Zolta´n Koda´ly (L. Halápy & F. Macnicol, Trans.). Boosey and Hawkes. [Google Scholar]

[pone.0275631.ref091] 91.Jacques-Dalcroze E. (1980). Rhythm, music and education (H. Rubinstein, Trans.). Dalcroze Society, Inc. [Google Scholar]

PERMALINK

Hierarchical amplitude modulation structures and rhythm patterns: Comparing Western musical genres, song, and nature sounds to Babytalk

Tatsuya Daikoku

Usha Goswami

Roles

Abstract

1. Introduction

2. Materials and methods

2.1 Probability Amplitude Demodulation (PAD) model based on Bayesian inference

Fig 1. Scalograms depicting the amplitude modulation (AM) envelopes derived by recursive application of PAD.

2.2. Spectral Amplitude Modulation Phase Hierarchy (S-AMPH) model

2.2.1. Signal processing: Spectral and temporal modulations

2.2.2. PCA to find the core modulation hierarchy in the high-dimensional ERBN representation

2.3. Mutual information between different modulation bands

2.4. Phase synchronization analyses

3. Results

3.1. Amplitude modulation properties of Western musical genres, song, and nature sounds from PAD

Fig 2. Core temporal modulation rates in PAD.

3.2. Amplitude modulation properties of Western musical genres, song, and nature sounds from S-AMPH

Fig 3. Core temporal modulation rates in S-AMPH.

3.3. Mutual information in both models

3.4. Multi-timescale phase synchronization in both models

Fig 4. Phase synchronization index between different tiers in the amplitude modulation hierarchy for music.

Fig 5. Schematic depiction of the hierarchical AM structure yielding Rhythm in music.

3.5. Simulation analyses

4. Discussion

Supporting information

Data Availability

Funding Statement

References

Decision Letter 0

Caicai Zhang

Roles

Author response to Decision Letter 0

Decision Letter 1

Lorena Verduci

Roles

Author response to Decision Letter 1

Decision Letter 2

Yann Benetreau

Roles

Acceptance letter

Yann Benetreau

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2.2.2. PCA to find the core modulation hierarchy in the high-dimensional ERB_N representation