Surmising synchrony of sound and sight: Factors explaining variance of audiovisual integration in hurdling, tap dancing and drumming

Nina Heins; Jennifer Pomp; Daniel S Kluger; Stefan Vinbrüx; Ima Trempler; Axel Kohler; Katja Kornysheva; Karen Zentgraf; Markus Raab; Ricarda I Schubotz

doi:10.1371/journal.pone.0253130

. 2021 Jul 22;16(7):e0253130. doi: 10.1371/journal.pone.0253130

Surmising synchrony of sound and sight: Factors explaining variance of audiovisual integration in hurdling, tap dancing and drumming

Nina Heins ^1,^2,^#, Jennifer Pomp ^1,^2,^#, Daniel S Kluger ^2,³, Stefan Vinbrüx ⁴, Ima Trempler ^1,², Axel Kohler ², Katja Kornysheva ⁵, Karen Zentgraf ⁶, Markus Raab ^7,⁸, Ricarda I Schubotz ^1,^2,^*

Editor: Alice Mado Proverbio⁹

PMCID: PMC8298114 PMID: 34293800

Abstract

Auditory and visual percepts are integrated even when they are not perfectly temporally aligned with each other, especially when the visual signal precedes the auditory signal. This window of temporal integration for asynchronous audiovisual stimuli is relatively well examined in the case of speech, while other natural action-induced sounds have been widely neglected. Here, we studied the detection of audiovisual asynchrony in three different whole-body actions with natural action-induced sounds–hurdling, tap dancing and drumming. In Study 1, we examined whether audiovisual asynchrony detection, assessed by a simultaneity judgment task, differs as a function of sound production intentionality. Based on previous findings, we expected that auditory and visual signals should be integrated over a wider temporal window for actions creating sounds intentionally (tap dancing), compared to actions creating sounds incidentally (hurdling). While percentages of perceived synchrony differed in the expected way, we identified two further factors, namely high event density and low rhythmicity, to induce higher synchrony ratings as well. Therefore, we systematically varied event density and rhythmicity in Study 2, this time using drumming stimuli to exert full control over these variables, and the same simultaneity judgment tasks. Results suggest that high event density leads to a bias to integrate rather than segregate auditory and visual signals, even at relatively large asynchronies. Rhythmicity had a similar, albeit weaker effect, when event density was low. Our findings demonstrate that shorter asynchronies and visual-first asynchronies lead to higher synchrony ratings of whole-body action, pointing to clear parallels with audiovisual integration in speech perception. Overconfidence in the naturally expected, that is, synchrony of sound and sight, was stronger for intentional (vs. incidental) sound production and for movements with high (vs. low) rhythmicity, presumably because both encourage predictive processes. In contrast, high event density appears to increase synchronicity judgments simply because it makes the detection of audiovisual asynchrony more difficult. More studies using real-life audiovisual stimuli with varying event densities and rhythmicities are needed to fully uncover the general mechanisms of audiovisual integration.

Introduction

From simple percepts like the ticking of a clock to complex stimuli like a song played on a guitar–in our physical world we usually perceive visual and auditory components alongside each other. The multisensory nature of our world has many advantages–it increases the reliability of sensory signals [1] and helps us navigate noisy environments, e.g. when one of the senses is compromised [2]. On the other hand, multimodality poses a challenge to our brains. Percepts from different senses have to be monitored to decide whether they belong to the same event and have to be integrated or segregated.

The impression of unity, i.e. the feeling that percepts belong to the same event, depends on many factors [3], one of them being the temporal coincidence of stimuli. For instance, we usually perceive visual and auditory speech as occurring at the same time, although these signals differ both in their neural processing time (10 ms for auditory signals vs. 50 ms for visual signals) and their physical “travel time” (330 m/sec for auditory signals, 300.000.000 m/sec for visual signals). Indeed, there seems to be a temporal window for the integration of audiovisual signals (temporal binding window: e.g. [2, 4]. Although the much cited notion that visual speech naturally leads auditory speech [5] has been recently revised [6], the temporal binding window seems to favor the visual channel leading the auditory channel. This is reflected in audio-first asynchronies (where the auditory signal leads the visual signal) being detected at smaller delays than visual-first asynchronies (where the visual signal leads the auditory signal: e.g. [7]. Also, the so-called McGurk effect—an illusion where the perception of an visual speech component and a different auditory speech component leads to the perception of a third auditory component [8]—is prevalent for larger visual-first than audio-first asynchronies [2]. This effect is suggested to show that visual speech acts as a predictor for auditory speech [9]. Visual speech aids auditory speech recognition even when visual and auditory signals are asynchronous, up to the boundaries of the temporal binding window [10]. Consequently, a coherent perception can be maintained for relatively large temporal asynchronies [7].

Although generally asymmetric, the width of temporal binding windows depends on different stimulus properties. For instance, this width seems to be up to five times wider for speech signals compared to simple flash and beep stimuli [11], more symmetrical for speech [12] and generally wider for more complex stimuli [4]. Experience seems to shape the width of the temporal binding window as well: Musicians have narrower temporal binding windows [13] and the window can be widened when participants are continuously presented with asynchronous stimuli [11, 14].

Notably, research on the audiovisual perception has so far focused on speech [2, 9, 15, 16], whereas other types of stimuli have been largely neglected [17]. There are only a few studies looking at the audiovisual perception of musical stimuli [18–20] and object-directed actions, e.g. a hammer hitting a peg [21], a soda can being crashed [22], hand claps and spoon taps [23], and a chess playing sequence [24]. These studies mostly find that non-speech sounds have a narrower temporal binding window than speech, i.e. asynchronies are detected at smaller temporal delays. This is explained by the more predictable moments of impact [24] which is also in line with a better asynchrony detection for the more visually salient bilabial speech syllables [25].

Although audiovisual integration of our own and other people’s actions is omnipresent in our everyday life the same way speech is, it is not nearly as well explored. It is an open issue whether effects that have been observed for the audiovisual integration in language and music generalize to the breadth of self-generated sounds we are familiar with [17]. Also, aberrant audiovisual integration in psychiatric diseases [26] and neurological impairments [27] may well apply beyond speech and music, and thus affect the perception and control of own action. To fully understand audiovisual integration, we need to consider this phenomenon in its entire range, from human-specific speech and music to sounds that we, as all animals, generate just by moving and contacting the environment.

The two studies we present here were motivated by the observation that speech and music are both actions that generate sounds intentionally. Moreover, both speech and musical sounds score particularly high on two further properties: event density and rhythmicity. Therefore, in order to examine the potential generalizability of audiovisual integration from these domains to other natural sound-inducing actions, we were interested to find out whether incidentally action-induced sounds would show comparable patterns of audiovisual integration as intentionally action-induced sounds (Study 1); and whether audiovisual integration is modulated by variable event density and rhythmicity (Study 2). In two recent fMRI studies, we observed that brain networks for processing intentionally produced sounds differ from those for incidentally produced action sounds. Interestingly, rather than triggering higher auditory attention, intentional sound production more so than incidental sound production encouraged predictive processes leading to the typical attenuation pattern in primary auditory cortex [28, 29].

Study 1

In the first study, we used two types of non-speech auditory stimuli created by whole-body actions, namely hurdling and tap dancing. We decided to use two different types of sporting action that allowed us to study the processing of natural movement sounds in an ecologically valid context. This also had the particular advantage that the subjects’ attention was not directed in any direction, since we created a completely natural perceptual situation. We applied a total of eight different asynchronies (ranging from a 400 ms lead of the auditory track to a 400 ms lead of the visual track) and a synchronous condition with a simultaneity judgment task. In addition to using new, more ecologically valid stimuli, we examined the influence of intentionality of sound generation. We have the intention to generate sounds by a tap dancing action (just like speaking, or playing a musical instrument, etc.), while sounds generated by a hurdling action (or by placing a chess piece on the board, for instance) are rather an incidental by-product of the action. Based on a previous study [28], demonstrating that the cerebral and behavioral processing of action-induced sounds significantly differs for intentionally and incidentally generated sounds, perceived audiovisual synchrony of tap dancing stimuli may yield similar effects as speech stimuli, and hurdling similar effects as object-directed actions.

Accordingly, we set out to test the following specific hypotheses: We expected shorter asynchronies to be generally perceived as synchronous more often than longer asynchronies (Hypothesis 1). Additionally, we expected visual-first asynchronies to be perceived as synchronous more often than corresponding audio-first asynchronies in both types of action (Hypothesis 2). Moreover, suggesting that tap dancing is comparable to speech production in having a larger temporal binding window than incidentally produced action sounds, we expected that this synchrony bias vanishes for the longer delays in hurdling but still persists for tap dancing (Hypothesis 3). This should manifest in significant differences between visual-first and audio-first delays in the larger delay types (i.e. 320–400 ms) in tap dancing, but not in hurdling.

Materials and methods–Study 1

Participants

The sample consisted of 22 participants (12 males, 10 females) with an age range from 20 to 32 years (M = 23.9, SD = 2.9), including only right-handers. We recruited only participants who never had a training in tap dancing or hurdling. Participants signed an informed consent explaining the procedure of the experiment and the anonymity of the collected data. Participants studying psychology received course credit for their participation. The study was approved by the Local Ethics Committee at the University of Münster, Germany, in accordance with the Declaration of Helsinki.

Stimuli

The stimuli used in this study stem from a previous fMRI study [28] and consisted of point-light displays (PLDs) of hurdling and tap dancing with their matching sounds (Fig 1A; see also Supplementary Material for exemplary videos). Note that tap dancing and hurdling share a basic property, that is, all sounds generated by these actions are caused by foot-ground contact. Fourteen passive (retroreflective) markers placed symmetrical on the left and the right shoulders, elbows, wrists, hip bones, knees, ankles, and toes (over the second metatarsal head). Nine optical motion capture cameras (Qualisys opus 400 series) of the Qualisys Motion Capture System (https://www.qualisys.com; Qualisys, Gothenburg, Sweden) were used for kinematic measurements. The sound generated by hurdling was recorded using in-ear microphones (Sound-man OKM Classic II) and by a sound recording app on a mobile phone for tap dancing. The mobile phone was hand-held by a student assistant sitting about one meter behind the tap dancing participant.

Fig 1 — Screenshots of the stimuli used in (A.) Study 1 and (B.) Study 2. The lower panel (C.) shows a schema of the trial and required responses. Participants were presented with videos showing PLDs of hurdling, tap dancing, or drumming. Subsequently, they were asked to judge, in a dual forced choice setting, whether the audiovisual presentation was synchronous or not. In case of a negative decision, participants had to furthermore judge whether sound was leading video or vice versa.

After recording, PLDs were processed using the Qualisys Track Manager software (QTM 2.14), ensuring visibility of all 14 recorded point-light markers during the entire recording time. Sound data were processed using Reaper v5.28 (Cockos Inc., New York, United States). In a first step, stimulus intensities of hurdling and tap dancing recordings were normalized separately. In order to equalize the spectral distributions of both types of recordings, the frequency profiles of hurdling and tap dancing sounds were then captured using the Reaper plugin Ozone 5 (iZotope Inc, Cambridge, United States). Finally, the difference curve (hurdling–tap dancing) was used by the plugin’s match function to adjust the tap dancing spectrum to the hurdling reference. PLDs and sound were synchronized, and the subsequent videos were cut using Adobe Premiere Pro CC (Adobe Systems Software, Dublin, Ireland). All videos had a final duration of 5.12 seconds. Note that we employed the 0 ms lag condition as an experimental anchor point, being aware that if the observer watched actions from the distance of the camera there would have been a very slight positive lag of audio of about 14 ms. This time lag was the same for both the hurdling and tap dancing stimuli, so that no experimental confound was induced. The final videos had a size of 640x400 pixels, a sampling rate of 25 frames per second and an audio sampling rate of 44 100 Hz. Due to the initial distance between the hurdling participant and the camera system, the hurdling sounds were audible before corresponding PLDs were fully visible. To offset this marked difference between hurdling and tap dancing stimuli in the visual domain, we employed a visual fade-in and fade-out of 1000 ms (25 frames) using Adobe Premiere, while the auditory track was presented without fading.

The stimulus set used here consisted of four hurdling and four tap dancing videos, each of which was presented at nine different “asynchronies” of the sound respective to the PLD (± 400 / 320 / 200 / 120 ms, and 0 ms), with negative values indicating that the audio track was leading the visual track (audio-first) and positive values indicating that the visual track was leading the audio track (visual-first)), resulting in a total of 72 different stimuli (exemplary videos are provided in the Supplementary Material). Asynchrony sizes were chosen based on similar values used in previous studies (e.g. [22, 24]. Finally, prepared videos had an average length of 6 s.

A separate set of 40 hurdling and 40 tap dancing videos with a lag of 0 ms (synchronous) was used to familiarize participants with the synchronous PLDs. All stimuli had a duration of 4000 ms. Videos showed three hurdling transitions for the hurdling stimuli and a short tap dancing sequence for the tap dancing stimuli.

Acoustic feature extraction: Event density and rhythmicity

Core acoustic features of the 16 newly recorded drumming videos as well as the 8 original videos from Study 1 were extracted using the MIRtoolbox (version 1.7.2) for Matlab [30]. The toolbox first computes a detection curve (amplitude over time) from the audio track of each video. Form this detection curve, a peak detection algorithm then determines the occurrence of distinct acoustical events (such as the sound of a single step). The number of distinct events per second quantifies the event density of a particular recording.

Acoustic events vary in amplitude, with accentuated events being louder than less accentuated ones. Therefore, we computed within-recording variance of the detection curve (normalized by the total number of events) to quantify to what extent each recording contained both accentuated and less accentuated events (see Fig 2): A recording with equally spaced, clearly accentuated events was defined as more rhythmic than a recording whose events are more or less equal in loudness (i.e., with low variation between events). An illustrative example of this approach is shown in S1 Fig. To allow comparison of rhythmicity across videos (independently of mean loudness), amplitude variability was computed as the coefficient of amplitude variation, i.e. the standard deviation of amplitude divided by its mean.

Fig 2 — Left panel shows the event density measured in the videos showing hurdling (H), tap dancing (H) (Study 1) and in the four sub-conditions of the drumming videos implementing combinations of high and low event density (D-, D+) and high and low rhythmicity (R+, R-) (Study 2). Each dot represents one recording. Right panel shows a measure of rhythmicity for the same set of recordings, operationalized as the variability of each recording’s amplitude envelope. Amplitude variation is shown as the coefficient of variation, i.e. the standard deviation of amplitude normalized by mean amplitude.

Assessment of motion energy (ME)

The overall motion energy for hurdling and tap dancing videos was quantified using Matlab (Version R2019b). For each video, the total amount of motion was quantified using frame-to-frame difference images for all consecutive frames of each video. Difference images were binarized, classifying pixels with more than 10 units luminance change as moving and those pixels below 10 units luminance as not moving. Above-threshold (“moving”) pixels were finally summed up for each video, providing its motion energy [31]. This approach yielded comparable levels for our experimental conditions, with a mean motion energy of 1189 for hurdling and 1220 for tap dancing (S2 Fig).

Procedure

The experiment was conducted in a noise-shielded and light-dimmed laboratory. Participants received a short instruction about the procedure of the experiment and signed the informed written consent before the experiment started. Participants were seated with a distance of approximately 75 cm to the computer screen. All stimuli were presented using the Presentation software (Neurobehavioral Systems Inc., CA). Headphones were used for the presentation of the auditory stimuli.

The experiment consisted of four blocks. The first block contained synchronous videos (0 ms lag) to familiarize participants with the PLD. To ensure their attention, participants were engaged in a cover task during this first block: They were asked to rate, by a dual forced-choice button press (male/female), the assumed gender of the person performing the hurdling or tap dancing action. There were no hypotheses concerning the gender judgment task and this part of the study was not analyzed any further.

Three blocks with the experimental task were presented thereafter. Within each of these blocks, all the 72 stimuli (four hurdling and four tap dancing videos, each with nine different audiovisual asynchronies) were presented twice, resulting in 144 trials per block and 432 trials in total. A pseudo-randomization guaranteed that no more than three videos of the same delay type (audio-first vs. visual-first) were presented in a row to prevent adaptation to one or the other. Additionally, it was controlled that no more than two videos of the same asynchrony were presented directly after each other.

A trial schema of the experimental task is given in Fig 1C. After presentation of each video (4000 ms) participants had to indicate whether they perceived the visual and auditory input as “synchronous” or “not synchronous”, pressing either the left key (for synchronous) or the right key (for not synchronous) on the response panel with their left and right index finger. If they decided that picture and sound were “not synchronous”, there was a follow-up question concerning the assumed order of the asynchrony (“sound first” or “picture first”, corresponding to the delay types audio-first and visual-first, respectively). We opted for a simultaneity judgment tasks rather than a temporal order judgment, because simultaneity judgment tasks are easier to perform for participants and have a higher ecological validity [32]. Responses were self-paced, but participants were instructed to decide intuitively and as fast as possible. A 1000 ms fixation cross was presented at the middle of the screen before the next video started.

Experimental design

The study employed a three-factorial within-subjects design. The dependent variable was the percentage of trials perceived as synchronous. Trials with a reaction time above 3000 ms were discarded from the analyses. The first factor was action with the factor levels hurdling and tap dancing. The different delays were generated by combinations of the factor asynchrony size (120 ms, 200 ms, 320 ms, 400 ms) and asynchrony type (audio-first, visual first). Note that all delays where the auditory track was leading the visual track were labeled audio-first, while all delays where the visual track was leading the auditory track were labeled visual-first. For this analysis, we did not include the 0 ms lag (synchronous) condition, as it could not be assigned to either the audio-first or the visual-first condition. A 2 x 4 x 2 ANOVA was calculated.

Results—Study 1

Trials with response times that exceeded 3000 ms were excluded from the analyses (470 out of 9504). Mauchly’s test indicated that the assumption of sphericity was violated for asynchrony size (χ²(5) = 14.89, p = .011). Therefore, degrees of freedom were corrected using Greenhouse-Geisser estimates of sphericity (ε = .72). Behavioral results are depicted in Figs 3 and 4.

Fig 3 — Displayed are the mean percentages of trials perceived as synchronous, aggregated for the factors asynchrony size, asynchrony type, and action type. Error bars show standard deviations. Statistically significant differences (p < .001) are marked with asterisks.

Fig 4 — Asynchronies (in ms) are displayed on the x-axis, with negative values indicating that the auditory channel preceded the visual channel (audio-first) and positive values indicating that the visual channel preceded the auditory channel (visual-first). Error bars show standard error of the mean. The upper panel shows all scores, fanned out for the level combinations of the factors asynchrony size, asynchrony type, and action type. The lower panel illustrates the significant action x asynchrony size x asynchrony type interaction.

The ANOVA revealed a main effect of asynchrony size (F(2.2,45.3) = 197.96, p < .001). As expected (Hypothesis 1), trials with the 120 ms asynchrony were rated as synchronous significantly more often (M = 68.8%, SD = 11.9%) than trials with the 200 ms asynchrony (M = 53.4%, SD = 14.0%, t(21) = 8.8, p < .001), which were in turn rated as synchronous more often than trials with the 320 ms asynchrony (M = 34.0%, SD = 11.8%, t(21) = 13.2, p < .001), and those were rated as synchronous more often than trials with the 400 ms asynchrony (M = 29.5%, SD = 8.7%, t(21) = 3.7, p = .001).

The main effect of asynchrony type was significant as well (F(1,21) = 198.87, p < .001), with visual-first asynchronies (M = 59.2%, SD = 11.7%) being rated as synchronous significantly more often than audio-first asynchronies (M = 33.7%, SD = 10.9%), as expected (Hypothesis 2).

Unexpectedly, the main effect of action type was also significant with F(1, 21) = 64.55, p < .001, driven by overall more synchronous ratings in the tap dancing condition (M = 58.9%, SD = 14.9%) compared to the hurdling condition (M = 34.0%, SD = 10.1%). Note that this finding motivated Study 2, as outlined below.

In line with Hypothesis 3, the interaction of asynchrony size, asynchrony type, and action type was significant (F(3,63) = 10.51, p < .001). Bonferroni-corrected pairwise post-hoc t-tests comparing the respective audio-first and visual-first conditions revealed that visual-first conditions in tap dancing were perceived as synchronous more often for the 120 ms asynchrony (M = 88.0%, SD = 11.30%, M = 49.9%, SD = 19.0%, t(21) = 11.8, p < .001), the 200 ms asynchrony (M = 75.5%, SD = 22.5%, M = 49.4%, SD = 19.8%, t(21) = 6.0, p < .001), the 320 ms asynchrony (M = 59.5%, SD = 19.8%, M = 44.5%, SD = 18.9%, t(21) = 5.2, p < .001) and the 400 ms asynchrony (M = 60.1%, SD = 16.2%, M = 44.0%, SD = 16.0%, t(21) = 4.4, p = .001). In hurdling, visual-first conditions were perceived as synchronous more often than their respective audio-first conditions for the 120 ms asynchrony (M = 91.4%, SD = 10.5%, M = 45.9%, SD = 23.5%, t(21) = 10.4, p < .001), the 200 ms asynchrony (M = 68.2%, SD = 18.0%, M = 20.6%, SD = 17.1%, t(21) = 12.6, p < .001), the 320 ms asynchrony (M = 23.1%, SD = 16.7%, M = 9.1%, SD = 8.7%, t(21) = 4.0, p < .001), but not for the 400 ms asynchrony (M = 7.7%, SD = 9.7%, M = 6.4%, SD = 8.9%, t(21) = 0.6, p = .588). This was in accordance with our assumption that the visual-first bias is observed even at very long asynchronies for tap dancing but vanishes for hurdling.

Furthermore, the interaction of action type and asynchrony size (F(3,63) = 88.71, p < .001) and the interaction of asynchrony size and asynchrony type (F(3,63) = 51.31, p < .001) were both significant, whereas the interaction of action type and asynchrony type was not (F(1,21) = 1.75, p = .20).

Interim discussion—Study 1

A consistent finding over all studies examining audiovisual asynchrony processing is that perceived synchrony rates of visual-first conditions are higher compared to audio-first conditions [11]. Study 1 corroborated this finding for both hurdling and tap dancing stimuli, suggesting that asynchrony perception does not fundamentally differ for whole-body movements. However, perceived synchrony of visual-first compared to audio-first asynchronies was found for larger asynchrony sizes in the tap dancing condition compared to the hurdling condition, as we expected. That is, in tap dancing, audio-first and visual-first perceived synchrony ratings were not only significantly different from each other in the smaller delay types (120 ms, 200 ms), but also in the larger ones (320 ms, 400 ms), whereas in the hurdling conditions, the same difference was found for the 120 ms, 200 ms and 320 ms conditions, but not for the 400 ms condition. This aligns with our assumption that intentionality of sound production leads to differences in the perception of tap dancing and hurdling [28]. We suggest this finding to reflect a wider temporal integration window for our tap dancing condition compared to our hurdling condition. The same mechanism might be at work whenever the temporal integration window for language or music are compared to those for object-related action-induced sounds. For instance, Eg and Behne [24] found a wider temporal integration window for language and music than for chess playing.

Our findings suggest that whole-body movement synchrony perception does not principally differ from other previously examined types of synchrony perception. At the same time, they also point to differences in synchrony perception depending on the intentionality of the produced sounds, with intentional sounds generally being perceived as more synchronous with their visual actions, or having a higher acceptance range, compared to action-induced sound occurring only incidentally.

These results also suggest diverging effects of audiovisual asynchrony on action perception and action execution. In the case of action execution, visual-first asynchronies, i.e. temporal delays of sound, have a disruptive effect on the execution of speaking [33], singing [34] and playing a musical instrument [35], but not on the execution of hurdling [36]. In the case of action perception, on the other hand, those same phase shifts are accepted as synchronous more often in language and music compared to simple object actions [22, 24]. Thus, while asynchronies seem to disrupt action execution for actions intentionally creating sounds, asynchronies for these actions are usually integrated even for relatively large temporal offsets in action perception. Considering that self-initiated sounds during action execution are usually attenuated when compared to externally generated sounds (e.g. [37–40], most likely due to the fact that they are expected [41], the disruption of action execution through experimentally induced audiovisual asynchronies might reflect a heightened sensitivity for unexpectedly delayed sounds in self-performed vs. only observed action.

In sum, Study 1 suggests that characteristics of audiovisual integration in the perception of speech and music may generalize to other types of intentionally sound-generating actions but not to those which create sounds rather incidentally.

Unexpectedly, asynchrony was generally more accurately judged for hurdling than for tap dancing, as reflected by a significant main effect. While this finding does not relativize the reported evidence for a widened temporal window of integration in tap dancing, as suggested by the 400 ms lag condition, it motivates the assumption that it was also more difficult to detect audiovisual asynchrony in tap dancing than in hurdling. Building on these findings in Study 2, we turned to event density and rhythmicity as factors potentially modulating audiovisual integration; specifically, we sought to test whether they confound our experimental conditions in Study 1. As outlined in the Methods section, we performed a Matlab-based acoustic feature extraction to objectively quantify event density and rhythmicity based on which we conducted the following two post hoc analyses.

Firstly, tap dancing trials had a higher event density (ranging from 3.19 to 4.18; M = 3.74 Hz, SD = .41) compared to hurdling trials (ranging from 2.19 to 2.99; M = 2.69 Hz, SD = .35; Mann-Whitney-U-test: U = 0.00, exact p = .029). Event density, i.e. the number of distinguishable auditory sounds occurring per second, or the frequency of distinct events has an influence on the detection performance of audiovisual asynchrony. Visual speech, for example, is integrated roughly at the syllable rate of 4–5 Hz [2, 42, 43]. Temporal frequencies above 4 Hz seem to be difficult to rate in terms of their (a)synchrony [44]. In light of these findings, we post-hoc investigated the effect of event density on synchronicity judgments. To this end, we included event density as ordinal variable to replace action type in our original ANOVA. This analysis showed a main effect of event density (F(2.9,61.8) = 71.64, p < .001, Greenhouse-Geisser corrected; χ²(14) = 40.60, p < .001, ε = .59), which could mirror our reported main effect of action type. Bonferroni-corrected post-hoc pairwise comparisons of the event density levels showed that differences in performance levels, however, did not mirror the separation point between actions. Instead, no difference in performance was found between the four hurdling videos and one tap dancing video (all p ≥ .52), which were all lower in performance than the three tap dancing videos with the highest event densities (all p < .001), while the video with the highest event density again significantly differed from all others (all p < .001). To see whether event density fully explains the original effect of action type, we calculated another ANOVA including action type and event density (as ordinal variable within an action) as well as asynchrony type and asynchrony size. Here, we found significant main effects of both event density (F(2,42) = 71.09, p < .001) and action type (F(1,21) = 74.26, p < .001) as well as their interaction (F(2,42) = 68.69, p < .001). These findings suggested that higher synchronicity ratings of tap dancing (vs. hurdling) could not be explained by higher event density in this action type. Moreover, event density did not have the same effect on judging audiovisual synchrony of intentionally and incidentally generated action sounds. As these were post-hoc analyses, we do not further elucidate the other main and interaction effects here. All in all, a direct experimental manipulation and investigation of event density as variable was motivated by these data patterns.

Secondly, tap dancing trials were less rhythmically structured compared to hurdling sounds. Although the overall amplitude of the soundtrack was balanced (i.e. adjusted) between the tap dancing and the hurdling condition, loudness of steps was less variable within tap dancing as compared to hurdling. As the latter was accentuated by three heavy landing steps after hurdle clearance embedded in a sequence of lighter running steps, hurdling might have also led to a more structured percept than tap dancing sounds. A post-hoc Mann-Whitney-U-test showed that the measure of rhythmicity that we explored, the mean amplitude variation coefficient, was lower in tap dancing (M = .47, SD = .10) than in hurdling (M = .91, SD = .14; U = 0.00, exact p = .029). This is, the four lower mean amplitude variation coefficients allocated to the tap dancing stimuli and the four higher mean amplitude variation coefficients allocated to the hurdling stimuli. In line with the post-hoc analyses for event density reported above, we investigated the effect of rhythmicity operationalized as the mean amplitude variation coefficient on synchronicity judgments in Study 1. The ANOVA including rhythmicity (8), asynchrony type (2) and asynchrony size (4) revealed a significant main effect of rhythmicity (F(3.5,73.1) = 43.62, p < .001, Greenhouse-Geisser corrected; χ²(27) = 74.27, p < .001, ε = .50). Bonferroni-corrected post-hoc pairwise comparisons showed that the stimuli with the five highest mean amplitude variation coefficients (all hurdling stimuli and one tap dancing stimulus) did not differ in their synchronicity judgments (all p ≥ .968) but were significantly lower than the three stimuli with the lowest mean amplitude variation coefficients. Within those three stimuli the second lowest differed significantly from the first and the third (all p ≤ .042). Here, again, the main effect does not mirror the separation point between actions. To see whether rhythmicity fully explains the original effect of action type, we calculated an ANOVA including action type (2), rhythmicity (4), asynchrony type (2) and asynchrony size (4). Just as we found for event density, this analysis showed a main effect of action type (F(1,21) = 66.18, p < .001), a main effect of rhythmicity (F(3,63) = 28.47, p < .001) and the interaction of both (F(3,63) = 33.79, p < .001). These findings suggested that rhythmicity neither had the same effect on intentionally and incidentally generated action sounds. As these were post-hoc analyses, we do not further elucidate the other main and interaction effects here. Results of Study 1 gave rise to the direct experimental manipulation and investigation of rhythmicity, further motivated by the fact that to our knowledge, there is so far no study examining the impact of rhythmicity on perception of audiovisual synchrony.

To summarize these considerations, we found in Study 1 that tap dancing stimuli received generally higher audiovisual synchrony ratings than hurdling stimuli. Since tap dancing videos differed from hurdling videos also with regard to higher event density and lower rhythmicity, both factors were potential sources of confound. To address the potential impact of these factors on audiovisual integration, we conducted Study 2, in which we employed PLDs of drumming sequences with variable event density and rhythmicity. Employing drumming PLDs enabled a direct control of event density and rhythmicity in an otherwise natural human motion stimulus. Note that using drumming actions, we kept intentionality of sound production constant while varying event density and rhythmicity as independent experimental factors. Since PLD markers were restricted to the upper body of the drummer, and since sounds were produced by handheld drumsticks in Study 2 as in contrast to sounds produced by feet in Study 2, we refrained from directly comparing conditions from Study 1 with Study 2.

Study 2

We recorded PLDs of drumming actions which matched and re-combined parameters of the event density and rhythmicity of the stimuli used in Study 1. Four conditions were generated by instructing the drummer to generate one sequence matching the original hurdling condition in Study 1 (low event density, high rhythmicity, labelled D-R+ hereafter), another matching the original tap dancing stimuli (high event density, low rhythmicity, D+R-), and two sequences with new level combinations of these factors (low event density, low rhythmicity, D-R-, and high event density, high rhythmicity, D+R+).

To investigate whether high event density and low rhythmicity are relevant factors for the temporal binding of multisensory percepts, we applied the same synchrony rating judgment task to our four different classes of drumming stimuli. Based on results from Study 1, we expected that for all stimuli, synchrony ratings are higher for short asynchronies compared to longer asynchronies (120 ms > 200 ms > 320 ms > 400 ms, Hypothesis 1) and visual-first asynchronies to be perceived as synchronous more often than their respective audio-first asynchronies (Hypothesis 2). Regarding the newly introduced factors of event density and rhythmicity, we tested whether higher synchrony ratings are observed for higher event density (Hypothesis 3) and lower rhythmicity (Hypothesis 4).

Materials and methods–Study 2

Many details regarding participants, the stimulus material and the procedure were the same as in Study 1. Therefore, we here only report aspects that were different between the two studies.