Why Movement Is Captured by Music, but Less by Speech: Role of Temporal Regularity

Simone Dalla Bella; Anita Białuńska; Jakub Sowiński

doi:10.1371/journal.pone.0071945

. 2013 Aug 2;8(8):e71945. doi: 10.1371/journal.pone.0071945

Why Movement Is Captured by Music, but Less by Speech: Role of Temporal Regularity

Simone Dalla Bella ^1,^2,^3,^*, Anita Białuńska ³, Jakub Sowiński ³

Editor: Corrado Sinigaglia⁴

PMCID: PMC3732235 PMID: 23936534

Abstract

Music has a pervasive tendency to rhythmically engage our body. In contrast, synchronization with speech is rare. Music’s superiority over speech in driving movement probably results from isochrony of musical beats, as opposed to irregular speech stresses. Moreover, the presence of regular patterns of embedded periodicities (i.e., meter) may be critical in making music particularly conducive to movement. We investigated these possibilities by asking participants to synchronize with isochronous auditory stimuli (target), while music and speech distractors were presented at one of various phase relationships with respect to the target. In Exp. 1, familiar musical excerpts and fragments of children poetry were used as distractors. The stimuli were manipulated in terms of beat/stress isochrony and average pitch to achieve maximum comparability. In Exp. 2, the distractors were well-known songs performed with lyrics, on a reiterated syllable, and spoken lyrics, all having the same meter. Music perturbed synchronization with the target stimuli more than speech fragments. However, music superiority over speech disappeared when distractors shared isochrony and the same meter. Music’s peculiar and regular temporal structure is likely to be the main factor fostering tight coupling between sound and movement.

Introduction

Music compels us to move. When spontaneously tapping our feet or swaying our body along with our preferred song, while dancing, or performing synchronized sports (e.g., swimming), we entrain to the regular pulse and rhythm of music. This propensity to coordinate movement with music, mostly during group activities, transcends places and cultures [1,2]. Not surprisingly, indeed, music is perfectly suited to act as a coordinating device at a group level. Because of its communal character, synchronization with music is thought to foster social bonding [3–6], thus favoring socially-rooted behaviors that are very distinctive of our species. Indeed, some species such as crickets and fireflies provide spectacular examples of synchronization with external rhythmical stimulation in nature [7–9]. Moreover, bird species which are vocal learners can to a certain extent couple their movements to musical beat [10,11]. Nevertheless, humans exhibit unique flexibility in their ability to achieve synchrony with an external timekeeper [12–14].

From early infancy, humans show sensitivity to rhythmic properties of auditory stimuli. They react to violations in repetitive timing patterns (i.e., meter [15–17]), and can code meter in auditory patterns via body movement [18]. Based on this precocious ability to extract regular temporal patterns (e.g., the underlying pulse), 2.5-year-old children start adjusting their movements to the beat of an auditory stimulus, in particular when interacting with a social partner [19,20]. This tie between movement and musical rhythm is probably originating in the first infant–mother interaction [21]. Coupling movement to an external auditory rhythm is supported by a dedicated neuronal network involving both subcortical areas (e.g., the basal ganglia and the cerebellum) and cortical regions (e.g., temporal cortex, premotor regions, and the Supplementary Motor Area) [22–24]. In sum, the pervasive tendency to couple movement to musical beats is a human trait with a defined neuronal substrate which may have played an important role in the origin of music [1,25].

The ubiquity of synchronization with music contrasts with the lack of spontaneous motor synchronization with other complex auditory stimuli, such as spoken utterances. Speech, albeit featuring rich rhythmic organization [26–29] and serving as an inter-personal communication device [30], unlike music, is typically not well suited for synchronized movement. Thus, it is not surprising that there is a paucity of studies on synchronization with speech material. In the only study to date devoted to this question, high inter-tap variability shown by a coefficient of variation around 30% of the average inter-tap interval was reported when participants synchronized with French and English spoken sentences [31]. This performance contrasts with markedly lower coefficients of variation when synchronizing with music (around 4% of the inter-tap interval) [32]. One of the reasons for these differences may lie in the regularity of musical beats (i.e., isochrony) as opposed to speech stresses. Regular isochronous beats are a universal property of music defining its rhythm [33,34], and affording a synchronized motor response [35–37]. Beat perception is supported and reinforced by the properties of the musical structure, characterized by temporal patterns with multiple embedded periodicities [37,38]. These patterns result in the perception of predictable sequences of strong (accented) and weak beats. For example, different meters distinguish marches (characterized by a strong-weak binary pattern) from waltzes (with a strong-weak-weak ternary pattern).

Stresses in speech similarly evoke a subjective impression of isochrony [39]. Yet, the notion of periodicity for speech rhythm (e.g., in stress-timed and syllable-timed languages, like English and French, respectively [40,41]) did not find empirical support, at least in the case of conversational speech [28,42–44]. Inter-stress-intervals are typically highly variable in speech, with coefficients of variations greater than 30% of the average inter-stress-interval [31,43]. These values are typically larger than the variability of inter-beat-intervals observed in performed expressive music, that shows coefficients of variation between 10% and 30% [45]. Moreover, metrical phonology, in analogy with metrical approaches to rhythm in Western Music [46], has similarly proposed a hierarchical metrical structure in speech, based on rhythmic prominence of linguistic units (i.e., syllables, words, and phrases) [26,47]. Nevertheless, speech meter in conversational speech is clearly less strict and regular (i.e., weaker) than musical meter (see 28 for a discussion). Higher regularity is found in poetry [48–50], and speech production in group such as prayers and chanting (i.e., choral speaking [51]). These manifestations of speech can be generally referred to as “metrical speech”. In sum, apart from metrical speech, it appears that speech mostly misses fundamental rhythmic properties, commonly present in music, such as a predictable regular beat and metrical structure, needed to drive synchronized movement.

Why does music typically have a stronger pull than speech on motor synchronization? Music, because of its greater temporal regularity, may be better suited than speech to recruit domain-general mechanisms responsible for extracting beat/stress patterns from a complex acoustic signal. Potential candidates for such mechanisms are cognitive processes supporting entrainment of attention to the temporal properties of auditory sequences (e.g., musical beats or stress patterns in speech) [35], or more low-level mechanisms treating acoustic features relevant for rhythm perception (e.g., amplitude envelope rise time) [52]. This possibility entails that speech utterances displaying music-like temporal patterns (i.e., with regular beat and metrical structure; for example, metrical speech) should attract movement as well as music does. Yet, temporal regularity may not be sufficient alone to account for this effect. The alternative hypothesis is that beat extraction and synchronization to music may require dedicated processes which are music-specific. Indeed, additional cues inherent in the musical structure, engaging domain-specific processes, such as pitch relationships may also favor synchronization. Melodic accents are another source responsible for our perception of meter. This possibility is in keeping with the joint accent structure hypothesis [53–57], implying that musical rhythm results from a multilayered structure of relationships among features, such as durations and pitch. A direct consequence is that music may still foster motor synchronization more than metrical speech, in spite of the fact that both share a regular temporal structure. These possibilities have not been examined so far. Moreover, in general, evidence is scant on the comparison of speech and music with regard to synchronization, in spite of its potential interest for clarifying whether beat/meter processing is supported by domain-specific or rather by general-purpose mechanisms.

To examine the role played by temporal regularity (i.e., beat isochrony and meter) on sensorimotor synchronization in music and speech in the present study, we conducted two experiments using the synchronized tapping task. Sensorimotor synchronization has been mainly examined by asking participants to tap their index finger in correspondence with isochronous stimuli, which lack the temporal complexity of music and natural speech (for reviews, see 24,58,59). The tapping paradigm has been quite extensively applied to synchronization with music (e.g., [32,57,60], and more recently to speech stimuli [31,61]. In our experiments we adopted a distractor paradigm [62,63]. Participants are asked to tap their finger along with an isochronous sequence (i.e., a metronome) while periodic distractors (e.g., another isochronous sequence, in the same or in a different modality) are presented at one of various temporal offsets [64–66]. Movement attraction by the distractor is reflected by systematic modulation of the asynchronies (i.e., the relative phase) between the taps and the target sounds, and of the variability of these asynchronies. The magnitude of the systematic change in relative phase and its variability are indicative of the distractors’ degree of interference. The distractor paradigm was successfully adopted to show that rhythmic movement is attracted more strongly to auditory than to visual rhythms [63], but see 64. Moreover, it was shown that asynchrony is typically more negative in the presence of leading distractors and less negative (or more positive) in the presence of lagging distractors [63,65]. Since in this paradigm the distractors are to be ignored, their disrupting effect on synchronization indicates an irresistible tendency of the distractors to capture participants’ movement.

In the two experiments, the effects of music and metrical speech distractors on synchronization with a metronome were compared. The temporal structure of spoken utterances (i.e., examples of metrical speech) was manipulated. Those manipulations pertained to duration and were meant to enhance speech temporal regularity, so that the utterances progressively matched music material in terms of beat/stress isochrony (Exp. 1) and the associated metrical structure (Exp. 2). In Exp. 1 we examined whether beat isochrony embedded in a speech stimulus is less effective in attracting movement as compared to a musical context. Music distractors were computer-generated fragments of familiar music. Speech distractors were spoken fragments of familiar children poetry, chosen for their regular stress pattern and regular metrical structure, and thereby natural conduciveness to synchronized movement. Additional stimulus manipulations were carried out to attain maximum comparability between speech and music. Since in the original stimuli pitch separation between the target sounds and the distractors was smaller for music than for speech, the distractors were equalized in terms of average fundamental frequency (pitch height). Average pitch height was controlled, in so far as phase correction mechanisms underlying distractor effects are not completely insensitive to pitch differences [66]. In another condition, even though speech distractors displayed very regular beats, with minor deviations from isochrony, inter-stress-intervals were additionally manipulated to achieve perfect isochrony, like in music distractors. If music has a greater pull than speech on motor synchronization exclusively because of beat isochrony, we predict that by equalizing beat/stress isochrony the differences between the two domains should disappear. In contrast, if domain-specific musical features play a role, music should still interfere more that metrical speech with synchronization to a target stimulus.

In Exp. 2, new participants were asked to synchronize with isochronous target stimuli while one of three types of distractors was presented in a distractor paradigm. Both music and speech distractors were derived from well-known songs, and performed by a professional singer without accompaniment. Renditions of the songs performed with lyrics, using only the repeated syllable /la/, and the metrically spoken lyrics were used as distractors. Stimuli were manipulated so that inter-beat-intervals and inter-stress-intervals were equally isochronous. Moreover, the duration of corresponding events (i.e., syllables and notes) in between musical beats and linguistic stresses was equalized. Hence, speech and music distractors shared beat/accent isochrony as well as the same metrical structure. If factors beyond temporal regularity (e.g., pitch relationships) contribute to explain music’s greater tendency to favor synchronized movement, music should still attract movement more than metrical speech.

EXPERIMENT 1

Materials and Methods

Participants

Three groups of native Polish-speaking students without formal musical training from the University of Finance and Management in Warsaw took part in the study in exchange for course credits: Group 1 (n = 38, 29 females, mean age = 24.8 years, range = 19-52 years, 36 right-handed and 2 left-handed), Group 2 (n = 30, 27 females, mean age = 22.6 years, range = 20-31, 29 right-handed and one left-handed), and Group 3 (n = 30, 25 females, mean age = 22.4 years, range = 19-39, 26 right-handed, 4 left-handed). None of the participants reported hearing disorders or motor dysfunction.

Material

We used a Target sequence formed by thirty-five 30-ms computer-generated tones with constant pitch (880 Hz sinusoids with a linear 17-ms down-ramp) and intensity presented with an inter-onset interval (IOI) of 600 ms. The Music distractors were three computer-generated well-formed musical fragments from familiar music written in binary meter (i.e., circus music, “Sleighride”, and Bee Gees’ “Stayin’ Alive”) including 29 to 33 musical beats (inter-beat-interval = 600 ms). Speech distractors were well-formed fragments with 28 to 32 stresses from three well-known excerpts of Polish children poetry („Pstryk” and „Lokomotywa” [67]; „Na straganie” [68]). „Pstryk” and „Na straganie” were written in a binary meter (i.e., every second syllable was stressed), „Lokomotywa” in a ternary meter (i.e., every third syllable was stressed). Note that Polish is usually described as a stress-timed language (with lexical stress occurring on the penultimate syllable). Yet, the classification of Polish in terms of rhythm is still quite controversial, suggesting that it should be placed in between stress-timed and syllable-timed languages [93,94]. Speech fragments were read by an actor who was instructed to utter the sentences using adult-directed speech while synchronizing speech stresses to the sounds of a metronome (IOI = 600 ms), and recorded. The mean inter-stress-interval of the recorded speech fragments was 598 ms (SD = 66 ms), indicating that the actor was able to maintain the speech rate, as instructed. Distractor stimuli were all normalized to the same maximum intensity level. This condition is referred to as Original. Distractors’ familiarity was assessed by asking 32 additional students (28 females, mean age = 20.6 years, range 20-28 years) to rate the distractors on a 10-point scale (1 = not familiar; 10 = very familiar). Music and speech distractors did not differ in terms of familiarity (for music, mean rating = 6.7; for speech, mean rating = 7.0; t < 1).

In the Pitch condition, the music and speech stimuli were equalized in terms of average pitch, to ensure that this variable did not affect or bias the tendency of the two distractors to pull synchronization. This manipulation was motivated by the observation in previous studies that the pitch of the distractor (e.g., in isochronous sequences) can affect synchronization with a target sound. For example, low distractors tend to exert stronger attraction than high distractors ( [66], but see the same study, Exp. 2, for the opposite effect). Music distractors were manipulated so that their average fundamental frequency (130.6 Hz, SD = 17.6 Hz) was comparable to the average fundamental frequency of speech distractors (130.0 Hz, SD = 30.0 Hz). Fundamental frequency was computed with Praat software [69] using autocorrelation [70]. This manipulation was achieved by transposing the entire musical excerpt so that its average fundamental frequency matched that of speech distractors. Pitch range and intervals were not manipulated. In the Pitch+Timing condition, speech distractors were additionally manipulated with Audition 1.5 software (Adobe, Inc.), to obtain exact 600-ms inter-stress-intervals. The time of occurrence of speech stresses was estimated by listening to the stimuli and by concurrent visual inspection of sounds’ waveform and spectrogram using Praat software. This manipulation (8.7% of the inter-stress-interval, on average), obtained by linear stretching or compressing the waveform between two subsequent speech stresses, did not engender unnatural or abrupt tempo changes (as judged by the experimenters), thus attesting that inter-stress-intervals were already quite isochronous in the original stimuli. In a few cases, stretching segments of the waveform led to acoustic artifacts (e.g., clicks), which were manually removed. In addition, 6 nonmusicians were asked to rate all the stimuli in terms of naturalness on a scale from 1 (= not artificial) to 6 (very artificial). After manipulation, music and speech distractors did not sound more artificial (mean ratings = 2.66 and 2.86, respectively) than in the original condition (mean ratings = 2.21 and 2.51), as attested by non-parametric Wilcoxon tests.

In all conditions, after the first five sounds of the target sequence, the distractor was presented at a particular temporal separation from the sixth target sound. The first musical beat/speech stress of distractor stimuli was determined by the experimenters by visual inspection of the waveform and by listening to the stimuli. Twenty relative temporal separations (i.e., “relative phases”, called hereafter simply “phases”) were used. At phase 0 the sixth sound of the target sequence and the first musical beat or speech stress of the distractor occurred at the same time (for an example, see Figure 1; sound examples can be found at http://www.mpblab.vizja.pl/dallabella_et_al_plos1_stimuli.html). The sixth sound of the target sequence was aligned with the time of occurrence of the first musical beat or of the first speech stress. Target-distractor alignment at phase 0 led to comparable synchronization performance with musical and speech stimuli. This was ascertained in a pilot experiment with 17 nonmusicians. The participants who did not take part in the main experiments were asked to synchronize the movement of their index finger to the target sequence with music or speech distractors presented at phase 0. The remaining 19 phases ranged from -50% of the IOIs (-300 ms) to +45% of the IOIs (+270 ms) with a step of 5% of the IOIs (30 ms). Negative and positive phases indicate that the musical beats and speech stresses occurred before and after target sounds, respectively (see Figure 1). Musical stimuli were generated with a Yamaha MidiRack synthesizer. Speech sequences were recorded with a Shure SM58 microphone onto a hard-disk through Fostex D2424 LV 24 Track Digital Recorder (sampling rate = 44.1 KHz). Stimulus manipulations were carried out using a PC-compatible computer.

Procedure

Each group was assigned to one condition (i.e., Group 1 to the Original condition, Group 2 to the Pitch condition, and Group 3 to the Pitch+Timing condition). Participants, sitting in a quiet room in front of the computer monitor, were asked to tap the index finger of their dominant hand along with the sounds of the target sequence alone (Target only). In further tasks, they synchronized with the same target sequence while music distractors, or speech distractors were presented. Participants were explicitly instructed to try to ignore the distractor. Targets and distractors were presented binaurally over Sennheiser eH2270 headphones at the same comfortable intensity level. Motor responses were recorded with a tapping pad with 1-ms accuracy built for the purpose of this experiment. The tapping pad provided auditory feedback at the time of the tap, due to the contact of the pad with the tabletop. The experiment was run on Presentation software (Neurobehavioral Systems, Inc.) using a PC-compatible computer. For each distractor type (i.e., music or speech) there were three blocks of trials, one for each of the three stimuli. In one block the distractor+target stimuli sequences at all phases were presented in random order. The Target only condition was performed twice, before performing the conditions with distractors. The order of the distractors (i.e., music or speech) and the order of the blocks were counterbalanced across subjects. The experiment lasted approximately 1 hour and a half.

Ethics statement

The study was approved by the Ethics Committee of the University of Finance and Management in Warsaw. Written informed consent was obtained from all participants.

Results and Discussion

Data were first analyzed to ensure that the perturbation of tapping due to music and speech distractors followed a pattern across phases which is comparable to the one observed in previous studies with simpler distractors (i.e., isochronous sequences [63]). To this aim, for each tapping trial, the signed time differences between the target sounds and the taps were computed (as in [63]). These differences are referred to “signed asynchronies”. Mean signed asynchrony for each tapping trial was computed for data in the Target only and in the Original conditions, and submitted to the following analyses. By convention, signed asynchrony is negative when the tap precedes the target sound, and positive when the tap is delayed. The SD of the time differences between the targets sounds and the distractors was also calculated (SD of asynchrony), as a measure of synchronization variability.

Five out of 98 participants were discarded based on the results obtained in the Target only condition: they produced less than 23 consecutive synchronized taps (80% of the maximum number of taps) and exhibited high variability (the SD of the asynchrony between target sounds and the taps was larger than 10% of the IOI). Taps corresponding to the first 7 target sounds were not analyzed, as in [63]. Signed asynchrony in the Target only condition did not significantly differ across groups, indicating comparable synchronization accuracy (Group 1, signed asynchrony = -46.3 ms, SD asynchrony = 34.8 ms; Group 2, async. = -50.0 ms, SD async. = 32.9 ms; Group 3, async. = -51.4 ms, SD async. = 30.8 ms).

Mean signed asynchronies were computed at each of the 20 phases. At phase 0 (baseline), where no interference was expected, signed asynchrony was negative, and comparable across music and speech distractors (= -46.9 ms with music distractors and -47.7 ms with speech distractors; t < 1). This confirms the anticipation tendency (i.e., mean negative asynchrony) typically observed in sensorimotor synchronization [71]. Data were aligned with respect to the baseline by subtracting asynchrony at phase 0 (averaged separately for each distractor type) from the mean signed asynchronies obtained at all relative phases for the same distractor. Mean signed asynchrony with music and speech distractors in the Original condition is illustrated in Figure 2, as a function of the phase of the distractor. Zero signed asynchrony corresponds to the same asynchrony obtained at phase 0. Negative signed asynchrony indicates that the distractor typically increased negative asynchrony from the target stimulus with respect to the baseline; positive signed asynchrony indicates that the distractor reduced negative asynchrony as compared to phase 0, and sometimes (i.e., at larger deviations) led to positive asynchrony. Both music and speech distractors affected synchronization beyond normal tapping variability, as attested by several points in the asynchrony curves falling out of the confidence interval (i.e., 0 ± standard error of asynchrony obtained in the Target only condition) represented in Figure 2 by horizontal dotted lines. This pattern of responses, showing the highest perturbation of synchronization around 20-30% of the IOI is consistent with previous studies using isochronous sequences as distractors [63]. Hence, the distractor paradigm can be extended to more complex sequences, such as speech and music. Moreover, since both music and speech showed a similar perturbation profile across phases, the direction of asynchrony was not further considered in the following analyses, and data from different phases were merged before comparing the degree of perturbation caused by the two distractors (see below).

Error bars indicate SE of the Mean. The horizontal dotted lines around 0 asynchrony (dashed line) indicate ± SE of signed asynchrony obtained in the *Target only* condition.

To measure the degree of perturbation induced by music and by speech, irrespective of the direction of the asynchrony (i.e., whether it was positive or negative), absolute asynchrony was computed. This measure, more parsimonious than relative asynchrony and more appropriate to compute synchronization error, was obtained by taking the absolute values of signed asynchrony at all phases (except at phase 0, where deviation was obviously 0) and by computing their average. Mean absolute asynchrony for Original, Pitch, Pitch+Timing conditions and for music and speech distractors is reported in Figure 3. Absolute asynchronies were entered in a 3(condition) x 2(distractor) mixed-design ANOVA, considering subjects as the random variable. Condition (original vs. pitch vs. pitch+timing) was the between-subjects factor, and Distractor (music vs. speech) was the within-subjects factor. Data from phase 0 were not entered in the ANOVA, because absolute asynchrony in this case was always 0. Music distractors interfered with synchronization more than speech distractors, but this effect was not observed in all conditions, as attested by a significant Condition x Distractor interaction (F(2,90) = 4.67, p < .05). Music interfered more than speech in the Original condition (F(1,90) = 29.10, p < .001); in the Pitch condition, the difference between distractors was only marginally significant (F(1,90) = 2,99, p = .09). The effect of the distractors did not differ in the Pitch+Timing condition. In addition, to obtain a measure of interference when the distractor preceded the target stimulus (i.e., leading) vs. when the distractor was presented after the target stimulus (i.e., lagging), absolute asynchrony was averaged for all negative phases and for all positive phases, separately. Leading distractors (with asynchrony = 32.9 ms, SD = 19.4 ms) were more disruptive than lagging distractors (async. = 28.3 ms, SD = 17.7 ms) only in the Pitch condition (t(29) = 2.26, p < .05).

Error bars are SE of the Mean. Stars indicate significant differences (a = marginally significant).

SD of asynchrony was considered to assess whether music and speech distractors differentially affected synchronization variability. This measure at phase 0 (baseline) was larger with speech distractors (SD = 35.8 ms) than with music distractors (SD = 31.3 ms) (t(92) = 3.36, p < .01). Mean SD of asynchrony with music and speech distractors in the Original condition is reported in Figure 4 as a function of the relative phase of the distractor. The distractors induced more variability than observed when participants synchronized with targets alone, as several points in the SD of asynchrony curves fell out of the confidence interval (i.e., Mean ± SE for SD of asynchrony in Target only condition, indicated by the horizontal dotted lines). Mean SD of asynchrony for Original, Pitch, Pitch+Timing conditions and for music and speech distractors are reported in Figure 5. Data were entered in a 3(condition) x 2(distractor) mixed-design ANOVA. Greater variability of asynchrony was found with music distractors than with speech distractors across all conditions as indicated by a main effect of Distractor (F(1,90) = 8.74, p < .01). Moreover, SD of asynchrony was progressively smaller in the Pitch and Pitch+Timing conditions as compared to the Original condition (main effect of Condition, F(2,90) = 3.13, p < .05). The Condition x Distractor interaction did not reach significance. The findings with absolute asynchrony and SD of asynchrony were replicated when the ANOVAs were run taking only distractors having a binary meter, namely the three music distractors, and two speech distractors. Hence, meter differences across domains cannot account for the smaller perturbation effect of speech distractors. Finally, leading distractors (SD async. = 48.9, SE = 2.0) were more disruptive than lagging distractors (SD async. = 44.4 ms, SE = 1.6) in all three conditions (t(92) = 5.94, p < .001). These differences between music and speech distractors were confirmed in two additional control experiments, in which 1) the original target sequence (tones) was replaced by a non-musical target sequence (n = 30), and 2) when the intensity for each beat/speech stress was normalized to the same maximum intensity level for all distractors (n = 34). In these experiments music distractors still led to higher variability of asynchrony than speech distractors did.

Error bars indicate SE of the Mean. The horizontal dotted lines indicate ± SE of SD of asynchrony obtained in the *Target only* condition.

In sum, both music and speech distractors perturbed sensorimotor synchronization with the target sequence. Although the taps were attracted to both leading and lagging distractors, the effect was often larger when the distractor preceded the target, consistent with previous evidence [63,65]. Music disturbed synchronization with the target more than speech in the Original condition and when average pitch was controlled (i.e., Pitch condition). Due to the music distractor, participants were less accurate (i.e., they tapped farther from the target sounds) and were more variable. Interestingly, when the beats/stresses in the two distractors were equally isochronous (i.e., Pitch+Timing condition), music still led to increased variability. Yet, the discrepancy between speech and music in terms of absolute asynchrony was no more visible. These findings generally indicate that beat isochrony differently affects synchronization depending on the context in which it is embedded. That music kept disturbing synchronization more than speech even when all distractors shared beat/stress isochrony, suggests that other factors may intervene for explaining this difference between the two domains. In Exp. 1, the recurrent patterns of durations supporting an isochronous beat/stress (i.e., the metrical structure) were not totally comparable in music and speech distractors. In the music distractors, the events occurring in between isochronous beats were precisely timed, thus conferring to these stimuli a regular metrical structure. This was not true for speech distractors, which exhibited a less regular metrical structure, in spite of isochrony, even in the Pitch+Timing condition. Moreover, another potential confound is that music was computer-generated whereas speech stimuli were read by an actor. Thus, natural stimulus variability may have partly reduced the effectiveness of the speech distractor. In Exp. 2 speech and music distractors were manipulated so that they were comparable not only in terms of beat/accent regularity but also of their metrical structure. As before, their distracting effect on sensorimotor synchronization with a target sequence was examined.