Skip to main content
PLOS One logoLink to PLOS One
. 2021 Jul 22;16(7):e0253130. doi: 10.1371/journal.pone.0253130

Surmising synchrony of sound and sight: Factors explaining variance of audiovisual integration in hurdling, tap dancing and drumming

Nina Heins 1,2,#, Jennifer Pomp 1,2,#, Daniel S Kluger 2,3, Stefan Vinbrüx 4, Ima Trempler 1,2, Axel Kohler 2, Katja Kornysheva 5, Karen Zentgraf 6, Markus Raab 7,8, Ricarda I Schubotz 1,2,*
Editor: Alice Mado Proverbio9
PMCID: PMC8298114  PMID: 34293800

Abstract

Auditory and visual percepts are integrated even when they are not perfectly temporally aligned with each other, especially when the visual signal precedes the auditory signal. This window of temporal integration for asynchronous audiovisual stimuli is relatively well examined in the case of speech, while other natural action-induced sounds have been widely neglected. Here, we studied the detection of audiovisual asynchrony in three different whole-body actions with natural action-induced sounds–hurdling, tap dancing and drumming. In Study 1, we examined whether audiovisual asynchrony detection, assessed by a simultaneity judgment task, differs as a function of sound production intentionality. Based on previous findings, we expected that auditory and visual signals should be integrated over a wider temporal window for actions creating sounds intentionally (tap dancing), compared to actions creating sounds incidentally (hurdling). While percentages of perceived synchrony differed in the expected way, we identified two further factors, namely high event density and low rhythmicity, to induce higher synchrony ratings as well. Therefore, we systematically varied event density and rhythmicity in Study 2, this time using drumming stimuli to exert full control over these variables, and the same simultaneity judgment tasks. Results suggest that high event density leads to a bias to integrate rather than segregate auditory and visual signals, even at relatively large asynchronies. Rhythmicity had a similar, albeit weaker effect, when event density was low. Our findings demonstrate that shorter asynchronies and visual-first asynchronies lead to higher synchrony ratings of whole-body action, pointing to clear parallels with audiovisual integration in speech perception. Overconfidence in the naturally expected, that is, synchrony of sound and sight, was stronger for intentional (vs. incidental) sound production and for movements with high (vs. low) rhythmicity, presumably because both encourage predictive processes. In contrast, high event density appears to increase synchronicity judgments simply because it makes the detection of audiovisual asynchrony more difficult. More studies using real-life audiovisual stimuli with varying event densities and rhythmicities are needed to fully uncover the general mechanisms of audiovisual integration.

Introduction

From simple percepts like the ticking of a clock to complex stimuli like a song played on a guitar–in our physical world we usually perceive visual and auditory components alongside each other. The multisensory nature of our world has many advantages–it increases the reliability of sensory signals [1] and helps us navigate noisy environments, e.g. when one of the senses is compromised [2]. On the other hand, multimodality poses a challenge to our brains. Percepts from different senses have to be monitored to decide whether they belong to the same event and have to be integrated or segregated.

The impression of unity, i.e. the feeling that percepts belong to the same event, depends on many factors [3], one of them being the temporal coincidence of stimuli. For instance, we usually perceive visual and auditory speech as occurring at the same time, although these signals differ both in their neural processing time (10 ms for auditory signals vs. 50 ms for visual signals) and their physical “travel time” (330 m/sec for auditory signals, 300.000.000 m/sec for visual signals). Indeed, there seems to be a temporal window for the integration of audiovisual signals (temporal binding window: e.g. [2, 4]. Although the much cited notion that visual speech naturally leads auditory speech [5] has been recently revised [6], the temporal binding window seems to favor the visual channel leading the auditory channel. This is reflected in audio-first asynchronies (where the auditory signal leads the visual signal) being detected at smaller delays than visual-first asynchronies (where the visual signal leads the auditory signal: e.g. [7]. Also, the so-called McGurk effect—an illusion where the perception of an visual speech component and a different auditory speech component leads to the perception of a third auditory component [8]—is prevalent for larger visual-first than audio-first asynchronies [2]. This effect is suggested to show that visual speech acts as a predictor for auditory speech [9]. Visual speech aids auditory speech recognition even when visual and auditory signals are asynchronous, up to the boundaries of the temporal binding window [10]. Consequently, a coherent perception can be maintained for relatively large temporal asynchronies [7].

Although generally asymmetric, the width of temporal binding windows depends on different stimulus properties. For instance, this width seems to be up to five times wider for speech signals compared to simple flash and beep stimuli [11], more symmetrical for speech [12] and generally wider for more complex stimuli [4]. Experience seems to shape the width of the temporal binding window as well: Musicians have narrower temporal binding windows [13] and the window can be widened when participants are continuously presented with asynchronous stimuli [11, 14].

Notably, research on the audiovisual perception has so far focused on speech [2, 9, 15, 16], whereas other types of stimuli have been largely neglected [17]. There are only a few studies looking at the audiovisual perception of musical stimuli [1820] and object-directed actions, e.g. a hammer hitting a peg [21], a soda can being crashed [22], hand claps and spoon taps [23], and a chess playing sequence [24]. These studies mostly find that non-speech sounds have a narrower temporal binding window than speech, i.e. asynchronies are detected at smaller temporal delays. This is explained by the more predictable moments of impact [24] which is also in line with a better asynchrony detection for the more visually salient bilabial speech syllables [25].

Although audiovisual integration of our own and other people’s actions is omnipresent in our everyday life the same way speech is, it is not nearly as well explored. It is an open issue whether effects that have been observed for the audiovisual integration in language and music generalize to the breadth of self-generated sounds we are familiar with [17]. Also, aberrant audiovisual integration in psychiatric diseases [26] and neurological impairments [27] may well apply beyond speech and music, and thus affect the perception and control of own action. To fully understand audiovisual integration, we need to consider this phenomenon in its entire range, from human-specific speech and music to sounds that we, as all animals, generate just by moving and contacting the environment.

The two studies we present here were motivated by the observation that speech and music are both actions that generate sounds intentionally. Moreover, both speech and musical sounds score particularly high on two further properties: event density and rhythmicity. Therefore, in order to examine the potential generalizability of audiovisual integration from these domains to other natural sound-inducing actions, we were interested to find out whether incidentally action-induced sounds would show comparable patterns of audiovisual integration as intentionally action-induced sounds (Study 1); and whether audiovisual integration is modulated by variable event density and rhythmicity (Study 2). In two recent fMRI studies, we observed that brain networks for processing intentionally produced sounds differ from those for incidentally produced action sounds. Interestingly, rather than triggering higher auditory attention, intentional sound production more so than incidental sound production encouraged predictive processes leading to the typical attenuation pattern in primary auditory cortex [28, 29].

Study 1

In the first study, we used two types of non-speech auditory stimuli created by whole-body actions, namely hurdling and tap dancing. We decided to use two different types of sporting action that allowed us to study the processing of natural movement sounds in an ecologically valid context. This also had the particular advantage that the subjects’ attention was not directed in any direction, since we created a completely natural perceptual situation. We applied a total of eight different asynchronies (ranging from a 400 ms lead of the auditory track to a 400 ms lead of the visual track) and a synchronous condition with a simultaneity judgment task. In addition to using new, more ecologically valid stimuli, we examined the influence of intentionality of sound generation. We have the intention to generate sounds by a tap dancing action (just like speaking, or playing a musical instrument, etc.), while sounds generated by a hurdling action (or by placing a chess piece on the board, for instance) are rather an incidental by-product of the action. Based on a previous study [28], demonstrating that the cerebral and behavioral processing of action-induced sounds significantly differs for intentionally and incidentally generated sounds, perceived audiovisual synchrony of tap dancing stimuli may yield similar effects as speech stimuli, and hurdling similar effects as object-directed actions.

Accordingly, we set out to test the following specific hypotheses: We expected shorter asynchronies to be generally perceived as synchronous more often than longer asynchronies (Hypothesis 1). Additionally, we expected visual-first asynchronies to be perceived as synchronous more often than corresponding audio-first asynchronies in both types of action (Hypothesis 2). Moreover, suggesting that tap dancing is comparable to speech production in having a larger temporal binding window than incidentally produced action sounds, we expected that this synchrony bias vanishes for the longer delays in hurdling but still persists for tap dancing (Hypothesis 3). This should manifest in significant differences between visual-first and audio-first delays in the larger delay types (i.e. 320–400 ms) in tap dancing, but not in hurdling.

Materials and methods–Study 1

Participants

The sample consisted of 22 participants (12 males, 10 females) with an age range from 20 to 32 years (M = 23.9, SD = 2.9), including only right-handers. We recruited only participants who never had a training in tap dancing or hurdling. Participants signed an informed consent explaining the procedure of the experiment and the anonymity of the collected data. Participants studying psychology received course credit for their participation. The study was approved by the Local Ethics Committee at the University of Münster, Germany, in accordance with the Declaration of Helsinki.

Stimuli

The stimuli used in this study stem from a previous fMRI study [28] and consisted of point-light displays (PLDs) of hurdling and tap dancing with their matching sounds (Fig 1A; see also Supplementary Material for exemplary videos). Note that tap dancing and hurdling share a basic property, that is, all sounds generated by these actions are caused by foot-ground contact. Fourteen passive (retroreflective) markers placed symmetrical on the left and the right shoulders, elbows, wrists, hip bones, knees, ankles, and toes (over the second metatarsal head). Nine optical motion capture cameras (Qualisys opus 400 series) of the Qualisys Motion Capture System (https://www.qualisys.com; Qualisys, Gothenburg, Sweden) were used for kinematic measurements. The sound generated by hurdling was recorded using in-ear microphones (Sound-man OKM Classic II) and by a sound recording app on a mobile phone for tap dancing. The mobile phone was hand-held by a student assistant sitting about one meter behind the tap dancing participant.

Fig 1. Stimuli and task.

Fig 1

Screenshots of the stimuli used in (A.) Study 1 and (B.) Study 2. The lower panel (C.) shows a schema of the trial and required responses. Participants were presented with videos showing PLDs of hurdling, tap dancing, or drumming. Subsequently, they were asked to judge, in a dual forced choice setting, whether the audiovisual presentation was synchronous or not. In case of a negative decision, participants had to furthermore judge whether sound was leading video or vice versa.

After recording, PLDs were processed using the Qualisys Track Manager software (QTM 2.14), ensuring visibility of all 14 recorded point-light markers during the entire recording time. Sound data were processed using Reaper v5.28 (Cockos Inc., New York, United States). In a first step, stimulus intensities of hurdling and tap dancing recordings were normalized separately. In order to equalize the spectral distributions of both types of recordings, the frequency profiles of hurdling and tap dancing sounds were then captured using the Reaper plugin Ozone 5 (iZotope Inc, Cambridge, United States). Finally, the difference curve (hurdling–tap dancing) was used by the plugin’s match function to adjust the tap dancing spectrum to the hurdling reference. PLDs and sound were synchronized, and the subsequent videos were cut using Adobe Premiere Pro CC (Adobe Systems Software, Dublin, Ireland). All videos had a final duration of 5.12 seconds. Note that we employed the 0 ms lag condition as an experimental anchor point, being aware that if the observer watched actions from the distance of the camera there would have been a very slight positive lag of audio of about 14 ms. This time lag was the same for both the hurdling and tap dancing stimuli, so that no experimental confound was induced. The final videos had a size of 640x400 pixels, a sampling rate of 25 frames per second and an audio sampling rate of 44 100 Hz. Due to the initial distance between the hurdling participant and the camera system, the hurdling sounds were audible before corresponding PLDs were fully visible. To offset this marked difference between hurdling and tap dancing stimuli in the visual domain, we employed a visual fade-in and fade-out of 1000 ms (25 frames) using Adobe Premiere, while the auditory track was presented without fading.

The stimulus set used here consisted of four hurdling and four tap dancing videos, each of which was presented at nine different “asynchronies” of the sound respective to the PLD (± 400 / 320 / 200 / 120 ms, and 0 ms), with negative values indicating that the audio track was leading the visual track (audio-first) and positive values indicating that the visual track was leading the audio track (visual-first)), resulting in a total of 72 different stimuli (exemplary videos are provided in the Supplementary Material). Asynchrony sizes were chosen based on similar values used in previous studies (e.g. [22, 24]. Finally, prepared videos had an average length of 6 s.

A separate set of 40 hurdling and 40 tap dancing videos with a lag of 0 ms (synchronous) was used to familiarize participants with the synchronous PLDs. All stimuli had a duration of 4000 ms. Videos showed three hurdling transitions for the hurdling stimuli and a short tap dancing sequence for the tap dancing stimuli.

Acoustic feature extraction: Event density and rhythmicity

Core acoustic features of the 16 newly recorded drumming videos as well as the 8 original videos from Study 1 were extracted using the MIRtoolbox (version 1.7.2) for Matlab [30]. The toolbox first computes a detection curve (amplitude over time) from the audio track of each video. Form this detection curve, a peak detection algorithm then determines the occurrence of distinct acoustical events (such as the sound of a single step). The number of distinct events per second quantifies the event density of a particular recording.

Acoustic events vary in amplitude, with accentuated events being louder than less accentuated ones. Therefore, we computed within-recording variance of the detection curve (normalized by the total number of events) to quantify to what extent each recording contained both accentuated and less accentuated events (see Fig 2): A recording with equally spaced, clearly accentuated events was defined as more rhythmic than a recording whose events are more or less equal in loudness (i.e., with low variation between events). An illustrative example of this approach is shown in S1 Fig. To allow comparison of rhythmicity across videos (independently of mean loudness), amplitude variability was computed as the coefficient of amplitude variation, i.e. the standard deviation of amplitude divided by its mean.

Fig 2. Auditory stimulus features, Study 1 and 2.

Fig 2

Left panel shows the event density measured in the videos showing hurdling (H), tap dancing (H) (Study 1) and in the four sub-conditions of the drumming videos implementing combinations of high and low event density (D-, D+) and high and low rhythmicity (R+, R-) (Study 2). Each dot represents one recording. Right panel shows a measure of rhythmicity for the same set of recordings, operationalized as the variability of each recording’s amplitude envelope. Amplitude variation is shown as the coefficient of variation, i.e. the standard deviation of amplitude normalized by mean amplitude.

Assessment of motion energy (ME)

The overall motion energy for hurdling and tap dancing videos was quantified using Matlab (Version R2019b). For each video, the total amount of motion was quantified using frame-to-frame difference images for all consecutive frames of each video. Difference images were binarized, classifying pixels with more than 10 units luminance change as moving and those pixels below 10 units luminance as not moving. Above-threshold (“moving”) pixels were finally summed up for each video, providing its motion energy [31]. This approach yielded comparable levels for our experimental conditions, with a mean motion energy of 1189 for hurdling and 1220 for tap dancing (S2 Fig).

Procedure

The experiment was conducted in a noise-shielded and light-dimmed laboratory. Participants received a short instruction about the procedure of the experiment and signed the informed written consent before the experiment started. Participants were seated with a distance of approximately 75 cm to the computer screen. All stimuli were presented using the Presentation software (Neurobehavioral Systems Inc., CA). Headphones were used for the presentation of the auditory stimuli.

The experiment consisted of four blocks. The first block contained synchronous videos (0 ms lag) to familiarize participants with the PLD. To ensure their attention, participants were engaged in a cover task during this first block: They were asked to rate, by a dual forced-choice button press (male/female), the assumed gender of the person performing the hurdling or tap dancing action. There were no hypotheses concerning the gender judgment task and this part of the study was not analyzed any further.

Three blocks with the experimental task were presented thereafter. Within each of these blocks, all the 72 stimuli (four hurdling and four tap dancing videos, each with nine different audiovisual asynchronies) were presented twice, resulting in 144 trials per block and 432 trials in total. A pseudo-randomization guaranteed that no more than three videos of the same delay type (audio-first vs. visual-first) were presented in a row to prevent adaptation to one or the other. Additionally, it was controlled that no more than two videos of the same asynchrony were presented directly after each other.

A trial schema of the experimental task is given in Fig 1C. After presentation of each video (4000 ms) participants had to indicate whether they perceived the visual and auditory input as “synchronous” or “not synchronous”, pressing either the left key (for synchronous) or the right key (for not synchronous) on the response panel with their left and right index finger. If they decided that picture and sound were “not synchronous”, there was a follow-up question concerning the assumed order of the asynchrony (“sound first” or “picture first”, corresponding to the delay types audio-first and visual-first, respectively). We opted for a simultaneity judgment tasks rather than a temporal order judgment, because simultaneity judgment tasks are easier to perform for participants and have a higher ecological validity [32]. Responses were self-paced, but participants were instructed to decide intuitively and as fast as possible. A 1000 ms fixation cross was presented at the middle of the screen before the next video started.

Experimental design

The study employed a three-factorial within-subjects design. The dependent variable was the percentage of trials perceived as synchronous. Trials with a reaction time above 3000 ms were discarded from the analyses. The first factor was action with the factor levels hurdling and tap dancing. The different delays were generated by combinations of the factor asynchrony size (120 ms, 200 ms, 320 ms, 400 ms) and asynchrony type (audio-first, visual first). Note that all delays where the auditory track was leading the visual track were labeled audio-first, while all delays where the visual track was leading the auditory track were labeled visual-first. For this analysis, we did not include the 0 ms lag (synchronous) condition, as it could not be assigned to either the audio-first or the visual-first condition. A 2 x 4 x 2 ANOVA was calculated.

Results—Study 1

Trials with response times that exceeded 3000 ms were excluded from the analyses (470 out of 9504). Mauchly’s test indicated that the assumption of sphericity was violated for asynchrony size (χ2(5) = 14.89, p = .011). Therefore, degrees of freedom were corrected using Greenhouse-Geisser estimates of sphericity (ε = .72). Behavioral results are depicted in Figs 3 and 4.

Fig 3. Main effects of audiovisual (a)synchrony ratings, Study 1.

Fig 3

Displayed are the mean percentages of trials perceived as synchronous, aggregated for the factors asynchrony size, asynchrony type, and action type. Error bars show standard deviations. Statistically significant differences (p < .001) are marked with asterisks.

Fig 4. Mean percentages of trials perceived as synchronous, Study 1.

Fig 4

Asynchronies (in ms) are displayed on the x-axis, with negative values indicating that the auditory channel preceded the visual channel (audio-first) and positive values indicating that the visual channel preceded the auditory channel (visual-first). Error bars show standard error of the mean. The upper panel shows all scores, fanned out for the level combinations of the factors asynchrony size, asynchrony type, and action type. The lower panel illustrates the significant action x asynchrony size x asynchrony type interaction.

The ANOVA revealed a main effect of asynchrony size (F(2.2,45.3) = 197.96, p < .001). As expected (Hypothesis 1), trials with the 120 ms asynchrony were rated as synchronous significantly more often (M = 68.8%, SD = 11.9%) than trials with the 200 ms asynchrony (M = 53.4%, SD = 14.0%, t(21) = 8.8, p < .001), which were in turn rated as synchronous more often than trials with the 320 ms asynchrony (M = 34.0%, SD = 11.8%, t(21) = 13.2, p < .001), and those were rated as synchronous more often than trials with the 400 ms asynchrony (M = 29.5%, SD = 8.7%, t(21) = 3.7, p = .001).

The main effect of asynchrony type was significant as well (F(1,21) = 198.87, p < .001), with visual-first asynchronies (M = 59.2%, SD = 11.7%) being rated as synchronous significantly more often than audio-first asynchronies (M = 33.7%, SD = 10.9%), as expected (Hypothesis 2).

Unexpectedly, the main effect of action type was also significant with F(1, 21) = 64.55, p < .001, driven by overall more synchronous ratings in the tap dancing condition (M = 58.9%, SD = 14.9%) compared to the hurdling condition (M = 34.0%, SD = 10.1%). Note that this finding motivated Study 2, as outlined below.

In line with Hypothesis 3, the interaction of asynchrony size, asynchrony type, and action type was significant (F(3,63) = 10.51, p < .001). Bonferroni-corrected pairwise post-hoc t-tests comparing the respective audio-first and visual-first conditions revealed that visual-first conditions in tap dancing were perceived as synchronous more often for the 120 ms asynchrony (M = 88.0%, SD = 11.30%, M = 49.9%, SD = 19.0%, t(21) = 11.8, p < .001), the 200 ms asynchrony (M = 75.5%, SD = 22.5%, M = 49.4%, SD = 19.8%, t(21) = 6.0, p < .001), the 320 ms asynchrony (M = 59.5%, SD = 19.8%, M = 44.5%, SD = 18.9%, t(21) = 5.2, p < .001) and the 400 ms asynchrony (M = 60.1%, SD = 16.2%, M = 44.0%, SD = 16.0%, t(21) = 4.4, p = .001). In hurdling, visual-first conditions were perceived as synchronous more often than their respective audio-first conditions for the 120 ms asynchrony (M = 91.4%, SD = 10.5%, M = 45.9%, SD = 23.5%, t(21) = 10.4, p < .001), the 200 ms asynchrony (M = 68.2%, SD = 18.0%, M = 20.6%, SD = 17.1%, t(21) = 12.6, p < .001), the 320 ms asynchrony (M = 23.1%, SD = 16.7%, M = 9.1%, SD = 8.7%, t(21) = 4.0, p < .001), but not for the 400 ms asynchrony (M = 7.7%, SD = 9.7%, M = 6.4%, SD = 8.9%, t(21) = 0.6, p = .588). This was in accordance with our assumption that the visual-first bias is observed even at very long asynchronies for tap dancing but vanishes for hurdling.

Furthermore, the interaction of action type and asynchrony size (F(3,63) = 88.71, p < .001) and the interaction of asynchrony size and asynchrony type (F(3,63) = 51.31, p < .001) were both significant, whereas the interaction of action type and asynchrony type was not (F(1,21) = 1.75, p = .20).

Interim discussion—Study 1

A consistent finding over all studies examining audiovisual asynchrony processing is that perceived synchrony rates of visual-first conditions are higher compared to audio-first conditions [11]. Study 1 corroborated this finding for both hurdling and tap dancing stimuli, suggesting that asynchrony perception does not fundamentally differ for whole-body movements. However, perceived synchrony of visual-first compared to audio-first asynchronies was found for larger asynchrony sizes in the tap dancing condition compared to the hurdling condition, as we expected. That is, in tap dancing, audio-first and visual-first perceived synchrony ratings were not only significantly different from each other in the smaller delay types (120 ms, 200 ms), but also in the larger ones (320 ms, 400 ms), whereas in the hurdling conditions, the same difference was found for the 120 ms, 200 ms and 320 ms conditions, but not for the 400 ms condition. This aligns with our assumption that intentionality of sound production leads to differences in the perception of tap dancing and hurdling [28]. We suggest this finding to reflect a wider temporal integration window for our tap dancing condition compared to our hurdling condition. The same mechanism might be at work whenever the temporal integration window for language or music are compared to those for object-related action-induced sounds. For instance, Eg and Behne [24] found a wider temporal integration window for language and music than for chess playing.

Our findings suggest that whole-body movement synchrony perception does not principally differ from other previously examined types of synchrony perception. At the same time, they also point to differences in synchrony perception depending on the intentionality of the produced sounds, with intentional sounds generally being perceived as more synchronous with their visual actions, or having a higher acceptance range, compared to action-induced sound occurring only incidentally.

These results also suggest diverging effects of audiovisual asynchrony on action perception and action execution. In the case of action execution, visual-first asynchronies, i.e. temporal delays of sound, have a disruptive effect on the execution of speaking [33], singing [34] and playing a musical instrument [35], but not on the execution of hurdling [36]. In the case of action perception, on the other hand, those same phase shifts are accepted as synchronous more often in language and music compared to simple object actions [22, 24]. Thus, while asynchronies seem to disrupt action execution for actions intentionally creating sounds, asynchronies for these actions are usually integrated even for relatively large temporal offsets in action perception. Considering that self-initiated sounds during action execution are usually attenuated when compared to externally generated sounds (e.g. [3740], most likely due to the fact that they are expected [41], the disruption of action execution through experimentally induced audiovisual asynchronies might reflect a heightened sensitivity for unexpectedly delayed sounds in self-performed vs. only observed action.

In sum, Study 1 suggests that characteristics of audiovisual integration in the perception of speech and music may generalize to other types of intentionally sound-generating actions but not to those which create sounds rather incidentally.

Unexpectedly, asynchrony was generally more accurately judged for hurdling than for tap dancing, as reflected by a significant main effect. While this finding does not relativize the reported evidence for a widened temporal window of integration in tap dancing, as suggested by the 400 ms lag condition, it motivates the assumption that it was also more difficult to detect audiovisual asynchrony in tap dancing than in hurdling. Building on these findings in Study 2, we turned to event density and rhythmicity as factors potentially modulating audiovisual integration; specifically, we sought to test whether they confound our experimental conditions in Study 1. As outlined in the Methods section, we performed a Matlab-based acoustic feature extraction to objectively quantify event density and rhythmicity based on which we conducted the following two post hoc analyses.

Firstly, tap dancing trials had a higher event density (ranging from 3.19 to 4.18; M = 3.74 Hz, SD = .41) compared to hurdling trials (ranging from 2.19 to 2.99; M = 2.69 Hz, SD = .35; Mann-Whitney-U-test: U = 0.00, exact p = .029). Event density, i.e. the number of distinguishable auditory sounds occurring per second, or the frequency of distinct events has an influence on the detection performance of audiovisual asynchrony. Visual speech, for example, is integrated roughly at the syllable rate of 4–5 Hz [2, 42, 43]. Temporal frequencies above 4 Hz seem to be difficult to rate in terms of their (a)synchrony [44]. In light of these findings, we post-hoc investigated the effect of event density on synchronicity judgments. To this end, we included event density as ordinal variable to replace action type in our original ANOVA. This analysis showed a main effect of event density (F(2.9,61.8) = 71.64, p < .001, Greenhouse-Geisser corrected; χ2(14) = 40.60, p < .001, ε = .59), which could mirror our reported main effect of action type. Bonferroni-corrected post-hoc pairwise comparisons of the event density levels showed that differences in performance levels, however, did not mirror the separation point between actions. Instead, no difference in performance was found between the four hurdling videos and one tap dancing video (all p ≥ .52), which were all lower in performance than the three tap dancing videos with the highest event densities (all p < .001), while the video with the highest event density again significantly differed from all others (all p < .001). To see whether event density fully explains the original effect of action type, we calculated another ANOVA including action type and event density (as ordinal variable within an action) as well as asynchrony type and asynchrony size. Here, we found significant main effects of both event density (F(2,42) = 71.09, p < .001) and action type (F(1,21) = 74.26, p < .001) as well as their interaction (F(2,42) = 68.69, p < .001). These findings suggested that higher synchronicity ratings of tap dancing (vs. hurdling) could not be explained by higher event density in this action type. Moreover, event density did not have the same effect on judging audiovisual synchrony of intentionally and incidentally generated action sounds. As these were post-hoc analyses, we do not further elucidate the other main and interaction effects here. All in all, a direct experimental manipulation and investigation of event density as variable was motivated by these data patterns.

Secondly, tap dancing trials were less rhythmically structured compared to hurdling sounds. Although the overall amplitude of the soundtrack was balanced (i.e. adjusted) between the tap dancing and the hurdling condition, loudness of steps was less variable within tap dancing as compared to hurdling. As the latter was accentuated by three heavy landing steps after hurdle clearance embedded in a sequence of lighter running steps, hurdling might have also led to a more structured percept than tap dancing sounds. A post-hoc Mann-Whitney-U-test showed that the measure of rhythmicity that we explored, the mean amplitude variation coefficient, was lower in tap dancing (M = .47, SD = .10) than in hurdling (M = .91, SD = .14; U = 0.00, exact p = .029). This is, the four lower mean amplitude variation coefficients allocated to the tap dancing stimuli and the four higher mean amplitude variation coefficients allocated to the hurdling stimuli. In line with the post-hoc analyses for event density reported above, we investigated the effect of rhythmicity operationalized as the mean amplitude variation coefficient on synchronicity judgments in Study 1. The ANOVA including rhythmicity (8), asynchrony type (2) and asynchrony size (4) revealed a significant main effect of rhythmicity (F(3.5,73.1) = 43.62, p < .001, Greenhouse-Geisser corrected; χ2(27) = 74.27, p < .001, ε = .50). Bonferroni-corrected post-hoc pairwise comparisons showed that the stimuli with the five highest mean amplitude variation coefficients (all hurdling stimuli and one tap dancing stimulus) did not differ in their synchronicity judgments (all p ≥ .968) but were significantly lower than the three stimuli with the lowest mean amplitude variation coefficients. Within those three stimuli the second lowest differed significantly from the first and the third (all p ≤ .042). Here, again, the main effect does not mirror the separation point between actions. To see whether rhythmicity fully explains the original effect of action type, we calculated an ANOVA including action type (2), rhythmicity (4), asynchrony type (2) and asynchrony size (4). Just as we found for event density, this analysis showed a main effect of action type (F(1,21) = 66.18, p < .001), a main effect of rhythmicity (F(3,63) = 28.47, p < .001) and the interaction of both (F(3,63) = 33.79, p < .001). These findings suggested that rhythmicity neither had the same effect on intentionally and incidentally generated action sounds. As these were post-hoc analyses, we do not further elucidate the other main and interaction effects here. Results of Study 1 gave rise to the direct experimental manipulation and investigation of rhythmicity, further motivated by the fact that to our knowledge, there is so far no study examining the impact of rhythmicity on perception of audiovisual synchrony.

To summarize these considerations, we found in Study 1 that tap dancing stimuli received generally higher audiovisual synchrony ratings than hurdling stimuli. Since tap dancing videos differed from hurdling videos also with regard to higher event density and lower rhythmicity, both factors were potential sources of confound. To address the potential impact of these factors on audiovisual integration, we conducted Study 2, in which we employed PLDs of drumming sequences with variable event density and rhythmicity. Employing drumming PLDs enabled a direct control of event density and rhythmicity in an otherwise natural human motion stimulus. Note that using drumming actions, we kept intentionality of sound production constant while varying event density and rhythmicity as independent experimental factors. Since PLD markers were restricted to the upper body of the drummer, and since sounds were produced by handheld drumsticks in Study 2 as in contrast to sounds produced by feet in Study 2, we refrained from directly comparing conditions from Study 1 with Study 2.

Study 2

We recorded PLDs of drumming actions which matched and re-combined parameters of the event density and rhythmicity of the stimuli used in Study 1. Four conditions were generated by instructing the drummer to generate one sequence matching the original hurdling condition in Study 1 (low event density, high rhythmicity, labelled D-R+ hereafter), another matching the original tap dancing stimuli (high event density, low rhythmicity, D+R-), and two sequences with new level combinations of these factors (low event density, low rhythmicity, D-R-, and high event density, high rhythmicity, D+R+).

To investigate whether high event density and low rhythmicity are relevant factors for the temporal binding of multisensory percepts, we applied the same synchrony rating judgment task to our four different classes of drumming stimuli. Based on results from Study 1, we expected that for all stimuli, synchrony ratings are higher for short asynchronies compared to longer asynchronies (120 ms > 200 ms > 320 ms > 400 ms, Hypothesis 1) and visual-first asynchronies to be perceived as synchronous more often than their respective audio-first asynchronies (Hypothesis 2). Regarding the newly introduced factors of event density and rhythmicity, we tested whether higher synchrony ratings are observed for higher event density (Hypothesis 3) and lower rhythmicity (Hypothesis 4).

Materials and methods–Study 2

Many details regarding participants, the stimulus material and the procedure were the same as in Study 1. Therefore, we here only report aspects that were different between the two studies.

Participants

The sample consisted of 31 right-handed participants (2 males, 29 females) with an age range from 19 to 29 years (M = 24.0, SD = 2.7), and all of them were right-handers, as obtained by personal questioning. We recruited only participants who never had a training in drumming. Participants signed an informed consent explaining the procedure of the experiment and the anonymity of the collected data. Participants studying psychology received course credit for their participation. The study was approved by the Local Ethics Committee at the University of Münster, Germany, in accordance with the Declaration of Helsinki.

Stimuli

The stimuli used in this study were PLD of drumming actions with matching sound, performed by a professional drum teacher. As in Study 1, PLD were recorded using the Qualisys Motion Capture System and in-ear microphones. Fifteen markers were placed symmetrical on the left and the right shoulders, elbows, and wrists, and on three points of the drumstick and three points of the drum (Fig 1B; exemplary videos can be found in the Supplementary Material). Further processing steps of the video material matched those for Study 1. Finally prepared videos had an average length of 6 s for each of the four factor level combinations (i.e., D-R+, D+R-, D-R-, D+R+), with the length of the videos varying from 4.9 s to 6.8 s (M = 5.9 s). The final stimulus set used here consisted of four different types of drumming videos with different event density and rhythmicity parameters as outlined above (D-R+, D+R-, D-R-, D+R+). For the conditions replicating our previous hurdling and tap dancing stimuli in event density and rhythmicity (D-R+, D+R-), the drummer was familiarized with these stimuli and asked to replicate them on the drums. For the two new conditions (D-R-and D+R+), he was asked to play the previously played sequences either less (D-R-) or more (D+R+) accentuated. For each of these four sub-conditions, four separate videos were selected, each of which was presented at nine different levels of asynchrony of the sound respective to the visual channel (± 400 / 320 / 200 / 120 ms, and 0 ms). Again, negative values indicated that the audio track was leading the visual track (audio-first) and positive values indicated that the visual track was leading the audio track (visual-first), resulting in 144 different stimuli. All videos included a 1000 ms visual fade-in and fade-out.

To ensure that the 16 newly recorded drumming videos implemented the four different factor level combinations (D-R+, D+R-, D-R-, D+R+), we used the same MIRtoolbox as in Study 1 to extract core acoustic features. Fig 2 shows that drumming videos successfully implemented the two experimental factors of mean event density (Hz) and rhythmicity (mean amplitude variation coefficient), resulting in the following combinations: D-R+ (D 2.192, R 0.694), D+R- (D 3.264, R 0.215), D-R- (D 2.538, R 0.162) and D+R+ (D 3.191, R 0.772). Thus, videos with a high event density (D+) had an event frequency of 3.23 Hz, those with low density (D-) 2.37 Hz on average. Videos with a high rhythmicity (R+) had a coefficient of amplitude variation of 0.733, whereas videos with a low rhythmicity (R-) had a coefficient of amplitude variation of 0.189.

As in Study 2, we assessed the mean motion energy (ME) score for all drumming videos (see Methods section of Study 1). This approach yielded a mean ME of 1052 for drumming videos, which was slightly lower than the ME for hurdling (1189) and tap dancing (1220) in Study 1 (S2 Fig). A Kruskal-Wallis test by ranks showed no significant difference between motion energy in hurdling, tap dancing and drumming (χ2(2) = 4.2, p = .12).

Procedure

The experiment consisted of four experimental blocks. Within each of these blocks, each of the 144 stimuli (four D-R+, four D+R-, four D-R-, and four D+R+ videos, each with nine different levels of audiovisual asynchrony) were presented once, resulting in 576 trials in total. A pseudo-randomization guaranteed that no more than three videos of the same type of asynchrony (audio-first vs. visual-first) were presented in a row to prevent adaptation to one or the other. Additionally, it was controlled that no more than two videos of the exact same level of asynchrony were presented directly after each other. We employed the same task as in Study 1.

Experimental design

The study was implemented with a four-factorial within-subject design with the two-level factor event density (low, high) and rhythmicity (low, high), the four-level factor asynchrony size (120 ms, 200 ms, 320 ms, 400 ms) and the two-level factor asynchrony type (audio first, visual first). The dependent variable was the percentage of the trials perceived as synchronous. Correspondingly, a 2 x 2 x 4 x 2 ANOVA was calculated.

Results–Study 1

Behavioral results are depicted in Figs 5 and 6. Mauchly’s test indicated that the assumption of sphericity was violated for asynchrony size (χ2(5) = 17.93, p = .003, ε = .71), event density x asynchrony size (χ2(5) = 22.53, p < .001, ε = .65), rhythmicity x asynchrony size (χ2(5) = 14.63, p = .012, ε = .78), event density x rhythmicity x asynchrony size (χ2(5) = 12.52, p = .028, ε = .78), and asynchrony size x asynchrony type (χ2(5) = 22.01, p = .001, ε = .68). Therefore, degrees of freedom were corrected using Greenhouse-Geisser estimates of sphericity. As expected, we replicated the main effects of asynchrony size (F(2.1,63.9) = 186.86, p < .001) and asynchrony type (F(1,30) = 149.59, p < .001). Synchrony ratings were highest for the 120 ms asynchronies (M = 69.7%, SD = 14.6%), and decreased with increasing asynchronies (200 ms delays, M = 61.8%, SD = 16.9%; 320 ms asynchronies, M = 48.0%, SD = 19.0%; 400 ms asynchronies, M = 39.8%, SD = 18.5%) with significant differences between all adjacent asynchrony sizes (all p < .001, Hypothesis 1). Synchrony ratings were also higher for visual-first asynchronies (M = 64.1%, SD = 14.8%) compared to audio-first asynchronies (M = 45.6%, SD = 19.3%, Hypothesis 2).

Fig 5. Main effects of the audiovisual (a)synchrony ratings, Study 2.

Fig 5

Mean percentages of trials perceived as synchronous, aggregated for the factors asynchrony size, asynchrony type, Event density (D- standing for low, D+ for high density) and Rhythmicity (R- and R+ for low and high rhythmicity, respectively). Error bars represent the standard deviation. Significant differences are marked with asterisks.

Fig 6. Mean percentages of trials perceived as synchronous, Study 2.

Fig 6

On the left hand side, all scores are fanned out for the level combinations of the factors asynchrony size, asynchrony type, Event density and Rhythmicity. The right hand side chart illustrates the significant Event Density x Rhythmicity interaction.

We found a main effect for event density (F(1,30) = 122.30, p < .001), with higher event density resulting in higher synchrony ratings (M = 69.8%, SD = 20.4%) compared to lower event density (M = 39.9%, SD = 15.8%, Hypothesis 3). We found a main effect for rhythmicity as well (F(1,30) = 5.48, p = .026), but contrary to our hypothesis (Hypothesis 4), synchrony ratings for lower rhythmicity were lower (M = 52.3%, SD = 15.7%) than those for higher rhythmicity (M = 57.4%, SD = 19.6%).

Interaction effects were significant for event density x rhythmicity (F(1,30) = 22.59, p < .001), event density x asynchrony size (F(1.9,58.0) = 42.86, p < .001), rhythmicity x asynchrony size (F(2.3,70.1) = 6.26, p = .002), event density x rhythmicity x asynchrony size (F(2.3,70.1) = 34.63, p < .001), event density x asynchrony type (F(1,30) = 87.11, p < .001), rhythmicity x asynchrony type (F(1,30) = 4.58, p = .041), event density x rhythmicity x asynchrony type (F(1,30) = 4.85, p = .036), asynchrony size x asynchrony type (F(2.0,61.3) = 65.22, p < .001), event density x asynchrony size x asynchrony type (F(3,90) = 99.52, p < .001), rhythmicity x asynchrony size x asynchrony type (F(3,90) = 4.82, p = .004), and event density x rhythmicity x asynchrony size x asynchrony type (F(3,90) = 9.76, p < .001).

Bonferroni-corrected post-hoc pairwise comparisons inspecting the interaction of event density and rhythmicity showed significant increases between low and high event densities at both low (p < .001) and high (p < .001) rhythmicity. Rhythmicity levels increased significantly only for low event density (p < .001) but not for high event density (p = .32).

General discussion

Visual and auditory signals often occur concurrently and aid a more reliable perception of events that cause these signals. Audiovisual integration depends on several factors which have been thoroughly investigated using the example of spoken language or music, but remain largely unexplored regarding their generalizability beyond these domains. In two behavioral studies, we examined the impact of audiovisual stimulus properties that are characteristic for both speech and music, and hence are particularly suited to address the issue of generalizability. In Study 1, we compared audiovisual signals from PLDs which were created intentionally, via tap dancing, and those which were created incidentally, via hurdling. In Study 2, we examined event density and rhythmicity as two properties describing drumming actions and their corresponding sounds.

In line with previous research [11, 22, 24], we found in both Study 1 and 2 that smaller asynchronies tended to be perceived as synchronous more often than larger asynchronies, and that visual-first asynchronies received higher synchrony ratings than their respective audio-first asynchronies. These effects were consistently observed for both, actions that create sounds intentionally as well as incidentally. Interestingly, the average synchrony ratings received for the drumming stimuli (55%) were comparable to those recorded in Study 1 for tap dancing (59%) rather than hurdling (34%), corroborating the interpretation that intentionally action-induced sounds are perceived differently from merely incidentally action-induced sounds. In two previous fMRI study addressing this issue [28, 29], behavioral and brain activity pointed towards stronger auditory expectations (but, importantly, not towards enhanced auditory attention) when observing intentional as compared to incidental sound production. Thus, particularly strong auditory expectations may tend to overrun actual perceptual evidence (e.g. of asynchrony), leading to a pronounced audiovisual integration bias in intentional sound production such as in spoken language. It is well-known that strong prior expectations, while being often helpful for perception, can lead to misperception of degraded sensory signals, causing for instance so-called “slips of the ear” in speech perception and visual illusions [45]. In terms of the predictive coding account, such misperceptions reflect a failure to adjust prior expectations to the current stimulus, either because these prior beliefs are not strong enough or the prediction error is not strong enough [46]. Importantly, part of the generative model is the precision of incoming sensory input and hence potential prediction errors: this expected precision modulates how much the prediction error is weighted in updating predictions. Normally, when we see and hear an actor performing sound-generating movements, sound and sight are synchronous (accepting a slight visual lead because the visual signal always travels slightly faster than the sound). Under normal conditions, when the environment is not particularly noisy, and movement patterns are familiar, the internal models are weighted high and the prediction error relatively low. Temporal regularities, prominent in speech and music, as well as skilled human movement sequences, e.g. tap dancing and hurdling, are particularly powerful cues to enable cross-sensory prediction [47]. Hence, in line with aforementioned fMRI findings on tap dancing and hurdling [28, 29], we suggest that while predictive processes favor the (mis)perception of synchroneity for both hurdling and tap dancing, a more elaborated generative model of to-be-produced sounds may amplify this bias even further. We come back to this assumption below when discussing the effect of rhythmicity.

Study 2 was conducted to figure out whether the effect of intentionality could be partly explained by event density and/or rhythmicity, keeping the intentionality of sound production (in drumming) constant. Here we found that synchrony ratings significantly increased for stimuli with a high event density. That is, the more events occurred per second, the stronger participants were biased towards audiovisual integration—even at very large asynchronies (400 ms). This observation fits well with asynchrony detection collapsing for high event density stimuli, as reported for both speech [2] and audiovisual flash-beep pairings [44]. Petrini and co-workers [18, 48] found that this bias is reduced by practice, as expert drummers outperform novices in audiovisual asynchrony perception for both slow and fast tempi. Importantly, data of Study 1 suggested that both intentionality and event density had significant effects on increasing the portion of synchronous judgments of actually asynchronous audiovisual stimuli, meaning the effect of intentionality could not be reduced to differences in event density. Since these findings were only derived from a post-hoc comparison treating event density as an ordinal variable, a more reliable test of event density was provided by Study 2, corroborating a main effect of event density on perceived audiovisual synchroneity.

Why does increasing event density disrupt audiovisual asynchrony detection so effectively? A straightforward explanation may be that increasing the event density narrows the width of empty intervals between filled intervals, may they be clicks, tones, or sounds. If we take an average event density of 2.5 Hz, the average onset-to-onset interval between events amounts to 400 ms on average, meaning that a 400 ms asynchrony manipulation shifts the delayed auditory event to coincide with the visual event (or vice versa). Consequently, a 400 ms asynchrony in an audiovisual signal with an event density of 2.5 Hz can only be detected in either of two cases: Either (a), other stimulus features such as amplitude, pitch or spectral frequency are variable enough to indicate a mismatch between the coinciding, phase-shifted visual event (or vice versa). Or (b), the temporal variance between the onset-to-onset intervals is high enough to include longer intervals, bringing a phase-shifted auditory event onto a visual event gap (or vice versa). In the current Study 2, (b) were met by all conditions, and both (a) and (b) in the case of increasing rhythmicity, entailing more variable beat accentuation. For the high event density condition (3.23 Hz), the outcome of higher rhythmicity was negligible. Thus, high event density effectively disturbed asynchrony judgements, independent of the level of rhythmicity. Focusing on this part of the results, one may expect that rhythmicity could only have an effect if event density was not too high, enabling the detection of asynchrony by the fact that the probability of an phase-shifted auditory event to fall into a visual event gap was higher when event density was low enough. And indeed, increasing rhythmicity had a comparably clear effect when event density was low (2.37 Hz).

Contrary to the expected, this higher rhythmicity did not enhance asynchrony detection; thus, synchrony ratings were actually lower for more rhythmic trials in Study 2. Note that, while both main effects—density and rhythmicity—were statistically significant, the overall impact on perceived synchronicity was far stronger for density (low: 40% vs. high: 70%) than for rhythmicity (low: 52% vs. high: 57%). Still, the significant interaction of these factors revealed that at the level of low event density, rhythmicity noticeably increased perceived synchronicity from 34% (low rhythmicity) to 47% (high rhythmicity). Two conclusions can be drawn from this finding: first, rhythmicity had a significant effect, but only under the condition that event density did not exceed a certain level. Second, rhythmicity could be ruled out as a confounding factor in Study 1, as tap dancing was less rhythmic than hurdling but lead to higher synchroneity ratings. In other words, the participants’ bias to misjudge asynchronous audiovisual PLDs as being synchronous was not explained by the lower rhythmic structure of tap dancing, as compared to hurdling.

To our knowledge, there is no study to date which explicitly examined the influences of temporal structure/rhythmicity on audiovisual asynchrony detection. Thus, it is hard to pinpoint why our more rhythmic stimuli, featuring more distinct events with more discernable moments of impact, led to higher synchrony ratings than the less rhythmic stimuli. At first sight, the opposite would have been plausible, given that, for instance, rhythmicity enhances the detection of auditory stimuli in noisy environments [49]. Since less accurate asynchrony detection has been reported for more complex stimuli in previous studies [4], it is possible that both rhythmicity and density increased overall stimulus’ complexity, explaining increased synchrony ratings for both highly dense and highly rhythmic stimuli. However, a worthwhile alternative hypothesis regarding the impact of rhythmicity on audiovisual integration may be that increasing rhythmicity promotes general predictability of the stimulus based on chunking and patterns of accented and unaccented events [50]. Regularity in the stimulus stream and especially a rhythmical event structure is among the most effective sources for temporal predictions [51, 52]. Metrical congruency between the visual and the auditory stream are known to make a slight temporal deviant less noticeable, for instance when we observe dancing to music [53]. Factors promoting predictability therefore may increase our proneness to neglect smaller audiovisual asynchronies [54]. We hence propose that both, intentionally produced sounds and more rhythmically structured sounds, increased (undue) confidence in synchrony. Both effects may be related to increased reliance on top-down predictive models, entailing that asynchronous trials go more often undetected.

Limitations

Tap dancing, hurdling, and drumming are quite different types of action that may introduce further sources of variance than those we were focusing on in the current studies.

Firstly, the amount of motion may differ between these types of action. The experiments reported here focused on how natural motion sounds are processed. As reported, we assessed this factor in terms of motion energy, and our statistical analysis suggested no significant differences of motion energy in the three tested types of action. Still, the impact of further and more fine-grained parameters describing motion on audiovisual integration, including for instance movement velocity, acceleration, smoothness or entropy (hence predictability) [55] as well as dynamic features related to rhythmic structure in movements [56], remains to be further examined using sound-generating whole-body movements.

Secondly, tap dancing and hurdling may differ with regard to the perceived arousal or the emotional responses they may trigger. In a previous study [28] in which the stimulus set included the videos used in Study 1 we asked participants to indicate whether they found either hurdling or tap dancing more difficult, and to rate the quality of the performance in each single trial. These previous ratings did not reveal any significant differences between tap dancing and hurdling videos. While hurdling and tap dancing are comparable sports in many respects, they differ in terms of expressive or aesthetic appeal. Our studies reported here cannot rule out the influence of this factor, which should be the subject of future investigation.

Finally, one may expect tap dancing videos to increase more auditory attention than hurdling videos. To be sure, in the current study, participants were instructed to deliver an explicit judgement on synchrony of the audiovisual stimuli, obviously entailing attention to both modalities. However, one may speculate that auditory attention would still be higher when we observe tap dancing simply because sounds are produced intentionally in this condition. Two previous fMRI studies including the videos used in Study 1 did not support such an attentional bias for tap dancing [28, 29]. Attention has been found to reverse the typical BOLD attenuation effects observed for predicted stimuli, leading to rather enhanced responses in primary sensory cortices [57]. Contrary to an attentional bias hypothesis, primary auditory cortex was actually significantly and replicable attenuated in tap dancing compared to hurdling in both fMRI studies [28, 29]. These findings are difficult to reconcile with an attentional interpretation of the stronger synchrony bias in tap dancing. Rather, and in line with the fMRI effects, we assume that predictability of the auditory signal plays a crucial role, making intentionally produced sounds more prone to be integrated with their respective visual motion patterns than incidentally produced sounds.

Although it would have been possible to investigate audiovisual integration in the perception of intentionally and incidentally produced sounds using artificially generated stimuli, for the current series of experiments, the focus was on investigating natural movement sounds in an ecologically valid context. This also had the particular advantage of not biasing subjects’ attention in any direction, since we were generating a quite natural perceptual situation. Another approach would have been to combine identical natural actions with intentional and incidental sounds. Here, however, we expected a confound in the sense that subjects would have expected the intentionally produced sound and not the incidental one. Thus, surprise or even irritation effects would have occurred in the incidental condition and would probably have strongly biased the comparison.

Conclusions

While almost all our physical actions produce sounds, the existing research on audiovisual perception is largely restricted to language and music, and only a handful of studies consider sounds created by object manipulations. However, since speech and music apparently stand out as intentionally produced sounds, it is unclear whether they can be considered as being representatives of action-induced sounds and their audiovisual integration in general. Our present studies contribute to the still very limited number of studies that examine audiovisual integration of natural non-speech stimuli (e.g. [19, 2124, 48]. Study 1 showed that typical effects reported for audiovisual speech integration extend to the perception of audiovisual asynchrony in whole-body actions, with shorter asynchronies leading to higher synchrony ratings, and an asymmetric temporal integration window favoring integration at visual-first asynchronies. As expected, these effects were even stronger for intentionally as compared to incidentally generated action sounds. Study 2 suggested that high event density effectively disturbs the discrimination of audiovisual asynchronies. As auditory event density of speech excels those achieved by most other types of action-induced sounds, it remains to be investigated whether the considerable bias for integrating asynchronous audiovisual speech stimuli is (at least partly) due to its exceptionally high event density. At low event densities, also stronger rhythmicity increased the overall audiovisual integration bias. We suggest that rhythmicity and intentionality of sound production promote (undue) trust in synchroneity because both foster reliance on a predictive mode of processing. It remains to be tested whether event density and/or rhythmicity have the same effect in incidentally generated action sounds. To clarify this question, and to further our understanding of common principles of audiovisual integration beyond speech and music [17], more research is needed addressing audiovisual integration in incidentally generated action sounds and more real-life audiovisual stimuli, considering the full range of sound features contributing to the variance of audiovisual integration biases.

Supporting information

S1 Fig. Example of drumming sequence with high and low rhythmicity (Study 2).

Rhythmicity was operationalized as variation of the amplitude envelope, shown here for two exemplary drumming sequences. While the event density of both recordings is virtually identical (3.42 and 3.39, respectively), the auditory events in the left recording are highly similar in loudness, resulting in low rhythmicity overall (v = 0.25). In contrast, the auditory events within the right recording vary more strongly in loudness, with almost equidistant duplets of loud (i.e. accentuated) events intersected with less accentuated events. This resulted in high rhythmicity overall (v = 0.78).

(TIFF)

S2 Fig. Motion energy, Study 1 and 2.

The amount of motion quantified by the amount of moving pixels per video for all PLD videos employed to generate different audio-visual asynchronous stimuli in Study 1 and Study 2. Each black marker depict the motion energy for one video (see Methods of Study 1 for details).

(TIFF)

S1 Video. Sample video Study 1.

Hurdling, auditory first, 120 ms asynchrony.

(MP4)

S2 Video. Sample video Study 1.

Hurdling, auditory first, 400 ms asynchrony.

(MP4)

S3 Video. Sample video Study 1.

Hurdling, visual first, 120 ms asynchrony.

(MP4)

S4 Video. Sample video Study 1.

Hurdling, visual first, 400 ms asynchrony.

(MP4)

S5 Video. Sample video Study 1.

Tap dancing, auditory first, 120 ms asynchrony.

(MP4)

S6 Video. Sample video Study 1.

Tap dancing, auditory first, 400 ms asynchrony.

(MP4)

S7 Video. Sample video Study 1.

Tap dancing, visual first, 120 ms asynchrony.

(MP4)

S8 Video. Sample video Study 1.

Tap dancing, visual first, 400 ms asynchrony.

(MP4)

S9 Video. Sample video Study 2.

Drumming, high event density, high rhythmicity.

(MP4)

S10 Video. Sample video Study 2.

Drumming, high event density, low rhythmicity.

(MP4)

S11 Video. Sample video Study 2.

Drumming, low event density, high rhythmicity.

(MP4)

S12 Video. Sample video Study 2.

Drumming, low event density, low rhythmicity.

(MP4)

Acknowledgments

We would like to thank Monika Mertens and Marie Kleinbielen for their help during data collection, Theresa Eckes, Alina Eisele, Marie Kleinbielen, and Katharina Thiel for their assistance during filming and with creating the stimulus material, and Niklas Petersen for calculating the motion energy scores. Finally, we would like to thank Nadiya El-Sourani, Amelie Huebner, Klara Hagelweide, Laura Quante, Marlen Roehe and Lena Schliephake for rewarding discussions.

Data Availability

All files are available from the OSF database: Schubotz, Ricarda. 2020. “AVIA - Audiovisual Integration in Hurdling, Tap Dancing and Drumming.” OSF. October 2. osf.io/ksma6. " The DOI is 10.17605/OSF.IO/KSMA6.

Funding Statement

Author: RIS; Grant number: SCHU1439/4-2; Name of funder: German Research Foundation (DFG, Deutsche Forschungsgemeinschaft); URL: https://www.dfg.de/en/index.jsp. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Ernst MO, Bülthoff HH. Merging the senses into a robust percept. Trends in Cognitive Sciences. 2004;8(4):162–9. doi: 10.1016/j.tics.2004.02.002 [DOI] [PubMed] [Google Scholar]
  • 2.van Wassenhove V, Grant KW, Poeppel D. Temporal window of integration in auditory-visual speech perception. Neuropsychologia. 2007;45(3):598–607. doi: 10.1016/j.neuropsychologia.2006.01.001 [DOI] [PubMed] [Google Scholar]
  • 3.Chen YC, Spence C. Assessing the role of the “unity assumption” on multisensory integration: A review. Frontiers in Psychology. 2017;8:445. doi: 10.3389/fpsyg.2017.00445 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Zhou H yu, Cheung EFC, Chan RCK. Audiovisual temporal integration: Cognitive processing, neural mechanisms, developmental trajectory and potential interventions. Neuropsychologia. 2020;140:107396. doi: 10.1016/j.neuropsychologia.2020.107396 [DOI] [PubMed] [Google Scholar]
  • 5.Chandrasekaran C, Trubanova A, Stillittano S, Caplier A, Ghazanfar AA. The natural statistics of audiovisual speech. PLoS Computational Biology. 2009;5(7). doi: 10.1371/journal.pcbi.1000436 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Schwartz JL, Savariaux C. No, There Is No 150 ms Lead of Visual Speech on Auditory Speech, but a Range of Audiovisual Asynchronies Varying from Small Audio Lead to Large Audio Lag. PLoS Computational Biology. 2014;10(7). doi: 10.1371/journal.pcbi.1003743 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Grant KW, Van Wassenhove V, Poeppel D. Detection of auditory (cross-spectral) and auditory-visual (cross-modal) synchrony. Speech Communication. 2004;44(1–4 SPEC. ISS.):43–53. [Google Scholar]
  • 8.McGurk H, MacDonald J. Hearing lips and seeing voices. Nature. 1976;264(5588):746–8. doi: 10.1038/264746a0 [DOI] [PubMed] [Google Scholar]
  • 9.Venezia JH, Thurman SM, Matchin W, George SE, Hickok G. Timing in audiovisual speech perception: A mini review and new psychophysical data. Attention, Perception, and Psychophysics. 2016;78(2):583–601. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Grant KW, Greenberg S. Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information. Proceedings of the Conference on Auditory-Visual Speech Processing (AVSP). 2001;(1):132–7. [Google Scholar]
  • 11.Vroomen J, Keetels M. Perception of intersensory synchrony: A tutorial review. Attention Perception & Psychophysics. 2010;72(4):871–84. doi: 10.3758/APP.72.4.871 [DOI] [PubMed] [Google Scholar]
  • 12.Stevenson RA, Wallace MT. Multisensory temporal integration: Task and stimulus dependencies. Experimental Brain Research. 2013;227(2):249–61. doi: 10.1007/s00221-013-3507-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Jicol C, Proulx MJ, Pollick FE, Petrini K. Long-term music training modulates the recalibration of audiovisual simultaneity. Experimental Brain Research. 2018;236(7):1869–80. doi: 10.1007/s00221-018-5269-4 [DOI] [PubMed] [Google Scholar]
  • 14.Fujisaki W, Shimojo S, Kashino M, Nishida S. Recalibration of audiovisual simultaneity. Nature Neuroscience. 2004;7(7):773–8. doi: 10.1038/nn1268 [DOI] [PubMed] [Google Scholar]
  • 15.Massaro DW, Cohen MM, Smeele PMT. Perception of asynchronous and conflicting visual and auditory speech. The Journal of the Acoustical Society of America. 1996. Sep;100(3):1777–86. doi: 10.1121/1.417342 [DOI] [PubMed] [Google Scholar]
  • 16.Navarra J, Vatakis A, Zampini M, Soto-Faraco S, Humphreys W, Spence C. Exposure to asynchronous audiovisual speech extends the temporal window for audiovisual integration. Cognitive Brain Research. 2005;25(2):499–507. doi: 10.1016/j.cogbrainres.2005.07.009 [DOI] [PubMed] [Google Scholar]
  • 17.Schutz M, Gillard J. On the generalization of tones: A detailed exploration of non-speech auditory perception stimuli. Scientific Reports. 2020;10(1):1–14. doi: 10.1038/s41598-019-56847-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Petrini K, Pollick FE, Dahl S, McAleer P, McKay L, Rocchesso D, et al. Action expertise reduces brain activity for audiovisual matching actions: An fMRI study with expert drummers. NeuroImage. 2011;56(3):1480–92. doi: 10.1016/j.neuroimage.2011.03.009 [DOI] [PubMed] [Google Scholar]
  • 19.Petrini K, Dahl S, Rocchesso D, Waadeland CH, Avanzini F, Puce A, et al. Multisensory integration of drumming actions: Musical expertise affects perceived audiovisual asynchrony. Experimental Brain Research. 2009;198(2–3):339–52. doi: 10.1007/s00221-009-1817-2 [DOI] [PubMed] [Google Scholar]
  • 20.Love SA, Petrini K, Cheng A, Pollick FE. A Psychophysical Investigation of Differences between Synchrony and Temporal Order Judgments. PLoS ONE. 2013;8(1):e54798. doi: 10.1371/journal.pone.0054798 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Dixon NF, Spitz L. The detection of auditory visual desynchrony. Perception. 1980;9(6):719–21. doi: 10.1068/p090719 [DOI] [PubMed] [Google Scholar]
  • 22.Vatakis A, Spence C. Audiovisual synchrony perception for speech and music assessed using a temporal order judgment task. Neuroscience Letters. 2006;393(1):40–4. doi: 10.1016/j.neulet.2005.09.032 [DOI] [PubMed] [Google Scholar]
  • 23.Stekelenburg JJ, Vroomen J. Neural correlates of multisensory integration of ecologically valid audiovisual events. Journal of cognitive neuroscience. 2007. Dec;19(12):1964–73. doi: 10.1162/jocn.2007.19.12.1964 [DOI] [PubMed] [Google Scholar]
  • 24.Eg R, Behne DM. Perceived synchrony for realistic and dynamic audiovisual events. Frontiers in Psychology. 2015;6:736. doi: 10.3389/fpsyg.2015.00736 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Vatakis A, Maragos P, Rodomagoulakis I, Spence C. Assessing the effect of physical differences in the articulation of consonants and vowels on audiovisual temporal perception. Frontiers in Integrative Neuroscience. 2012;6(OCTOBER 2012):1–18. doi: 10.3389/fnint.2012.00001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Noel J-P, Stevenson RA, Wallace MT. Atypical Audiovisual Temporal Function in Autism and Schizophrenia: Similar Phenotype, Different Cause. European Journal of Neuroscience. 2018;47(10):1230–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Van der Stoep N, Van der Stigchel S, Van Engelen RC, Biesbroek J, Nijboer TC. Impairments in Multisensory Integration after Stroke. Journal of Cognitive Neuroscience. 2019;31(6):885–99. doi: 10.1162/jocn_a_01389 [DOI] [PubMed] [Google Scholar]
  • 28.Heins N, Pomp J, Kluger DS, Trempler I, Zentgraf K, Raab M, et al. Incidental or Intentional? Different Brain Responses to One’ s Own Action Sounds in Hurdling vs. Tap Dancing. Frontiers in neuroscience. 2020;14:483. doi: 10.3389/fnins.2020.00483 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Heins N, Trempler I, Zentgraf K, Raab M, Schubotz RI. Too Late! Influence of Temporal Delay on the Neural Processing of One’s Own Incidental and Intentional Action-Induced Sounds. Froniers in Neuroscience. 2020;14:573970. doi: 10.3389/fnins.2020.573970 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Lartillot O, Toiviainen P. A Matlab toolbox for musical feature extraction from audio. Proceedings of the 8th International Conference on Music Information Retrieval, ISMIR 2007. 2007;September. [Google Scholar]
  • 31.Nishimoto S, VU AT, Naselaris T, Benjamin Y, Yu B, Gallant JL. Reconstructing visual experiences from brain activity evoked by natural movies. Curr Biol. 2011;21(19):1641–6. doi: 10.1016/j.cub.2011.08.031 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Eg R, Griwodz C, Halvorsen P, Behne D. Audiovisual robustness: exploring perceptual tolerance to asynchrony and quality distortion. Multimedia Tools and Applications. 2014;74(2):345–65. [Google Scholar]
  • 33.Tourville J a., Reilly KJ, Guenther FH. Neural mechanisms underlying auditory feedback control of speech. NeuroImage. 2008;39(3):1429–43. doi: 10.1016/j.neuroimage.2007.09.054 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Keough D, Jones J a. The sensitivity of auditory-motor representations to subtle changes in auditory feedback while singing. The Journal of the Acoustical Society of America. 2009;126(2):837–46. doi: 10.1121/1.3158600 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Pfordresher PQ, Beasley RTE. Making and monitoring errors based on altered auditory feedback. Frontiers in Psychology. 2014;5(August):914. doi: 10.3389/fpsyg.2014.00914 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Kennel C, Streese L, Pizzera A, Justen C, Hohmann T, Raab M. Auditory reafferences: the influence of real-time feedback on movement control. Frontiers in Psychology. 2015;6(January):1–6. doi: 10.3389/fpsyg.2015.00001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Horváth J, Maess B, Baess P. Action–Sound Coincidences Suppress Evoked Responses of the Human Auditory Cortex in EEG and MEG. 2012;1919–31. [DOI] [PubMed] [Google Scholar]
  • 38.Baess P, Horváth J, Jacobsen T, Schröger E. Selective suppression of self-initiated sounds in an auditory stream: An ERP study. Psychophysiology. 2011;48(9):1276–83. doi: 10.1111/j.1469-8986.2011.01196.x [DOI] [PubMed] [Google Scholar]
  • 39.Aliu SO, Houde JF, Nagarajan SS. Motor-induced suppression of the auditory cortex. Journal of Cognitive Neuroscience. 2009;21(4):791–802. doi: 10.1162/jocn.2009.21055 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Straube B, Van Kemenade BM, Arikan BE, Fiehler K, Leube DT, Harris LR, et al. Predicting the multisensory consequences of one’s own action: Bold suppression in auditory and visual cortices. PLoS ONE. 2017;12(1):1–25. doi: 10.1371/journal.pone.0169131 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Kaiser J, Schütz-Bosbach S. Sensory attenuation of self-produced signals does not rely on self-specific motor predictions. European Journal of Neuroscience. 2018;47(11):1303–10. [DOI] [PubMed] [Google Scholar]
  • 42.Arai T, Greenberg S. The temporal properties of spoken Japanese are similar to those of English. Proc Eurospeech’97. 1997;2(February):1011–4. [Google Scholar]
  • 43.Greenberg S. A Multi-Tier Framework for Understanding Spoken Language. In: Listening to speech: An auditory perspective. Mahwah, NJ, US: Lawrence Erlbaum Associates Publishers; 2006. p. 411–33. [Google Scholar]
  • 44.Fujisaki W, Nishida S. Temporal frequency characteristics of synchrony-asynchrony discrimination of audio-visual signals. Experimental Brain Research. 2005;166(3–4):455–64. doi: 10.1007/s00221-005-2385-8 [DOI] [PubMed] [Google Scholar]
  • 45.Nour MM, Nour JM. Perception, illusions and Bayesian inference. Psychopathology. 2015;48(4):217–21. doi: 10.1159/000437271 [DOI] [PubMed] [Google Scholar]
  • 46.Blank H, Spangenberg M, Davis MH. Neural prediction errors distinguish perception and misperception of speech. Journal of Neuroscience. 2018;38(27):6076–89. doi: 10.1523/JNEUROSCI.3258-17.2018 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Noppeney U, Lee HL. Causal inference and temporal predictions in audiovisual perception of speech and music. Annals of the New York Academy of Sciences. 2018;1423(1):102–16. [DOI] [PubMed] [Google Scholar]
  • 48.Petrini K, Dahl S, Rocchesso D, Waadeland CH, Avanzini F, Puce A, et al. Multisensory integration of drumming actions: musical expertise affects perceived audiovisual asynchrony. Experimental Brain Research. 2009;198(2–3):339–52. doi: 10.1007/s00221-009-1817-2 [DOI] [PubMed] [Google Scholar]
  • 49.ten Oever S, Schroeder CE, Poeppel D, van Atteveldt N, Zion-Golumbic E. Rhythmicity and cross-modal temporal cues facilitate detection. Neuropsychologia. 2014;63:43–50. doi: 10.1016/j.neuropsychologia.2014.08.008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Hardesty J. Building Blocks of Rhythmic Expectation. MUME 2016—The Fourth International Workshop on Musical Metacreation. 2016;(September). [Google Scholar]
  • 51.Henry MJ, Obleser J. Frequency modulation entrains slow neural oscillations and optimizes human listening behavior. Proceedings of the National Academy of Sciences of the United States of America. 2012;109(49):20095–100. doi: 10.1073/pnas.1213390109 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Cravo AM, Rohenkohl G, Wyart V, Nobre AC. Temporal expectation enhances contrast sensitivity by phase entrainment of low-frequency oscillations in visual cortex. Journal of Neuroscience. 2013; doi: 10.1523/JNEUROSCI.4675-12.2013 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Su YH. Metrical congruency and kinematic familiarity facilitate temporal binding between musical and dance rhythms. Psychonomic Bulletin and Review. 2018;25(4):1416–22. doi: 10.3758/s13423-018-1480-3 [DOI] [PubMed] [Google Scholar]
  • 54.De Lange FP, Heilbron M, Kok P. How Do Expectations Shape Perception? Perceptual Consequences of Expectation. Trends in Cognitive Sciences. 2018;1–16. [DOI] [PubMed] [Google Scholar]
  • 55.Orlandi A, Cross ES, Orgs G. Timing is everything: Dance aesthetics depend on the complexity of movement kinematics. Cognition. 2020;205(September):104446. doi: 10.1016/j.cognition.2020.104446 [DOI] [PubMed] [Google Scholar]
  • 56.Su YH. Peak velocity as a cue in audiovisual synchrony perception of rhythmic stimuli. Cognition. 2014;131(3):330–44. doi: 10.1016/j.cognition.2014.02.004 [DOI] [PubMed] [Google Scholar]
  • 57.Kok P, Rahnev D, Jehee JFM, Lau HC, de Lange FP. Attention reverses the effect of prediction in silencing sensory signals. Cerebral Cortex. 2012. Sep;22(9):2197–206. doi: 10.1093/cercor/bhr310 [DOI] [PubMed] [Google Scholar]

Decision Letter 0

Alice Mado Proverbio

8 Dec 2020

PONE-D-20-30978

Surmising synchrony of sound and sight: Factors explaining variance of audiovisual integration in hurdling, tap dancing and drumming

PLOS ONE

Dear Dr. Schubotz,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

In particular, I urge the authors to reconsidered the presentation of the rationale and the discussion section, with interpretation of the results. Please submit your revised manuscript by Jan 22 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Alice Mado Proverbio

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors presented two attractive behavioural studies that aimed to investigate audiovisual integration in whole-body actions. They presented the participants with short videos depicting point-light displays from hurdling vs. tap-dancing actions (study 1) and drumming (study 2) actions. The movements could be in synch or out of synch with the produced sounds, and in the latter case, the authors systematically varied the delay between visual and auditory stimulations and their order (video-first vs. audio-first). The participants were instructed to judge the synchronicity of the stimuli explicitly. The results indicated higher synchrony ratings for shorter (vs. longer) asynchronies intervals, for visual stimulation preceding (vs. following) auditory consequences, and for higher (vs. lower) event density. The manuscript is well-organized, the language is appropriate, and the investigation of audiovisual integration in the context of whole-body actions is certainly interesting. That said, I encourage the Authors to take into consideration a series of points that could strengthen the work if integrated with their current proposal. My concerns are specifically focused on the methodological aspects of the studies.

Major points:

- I appreciated that the authors reported possible confounding factors from study 1 and used them to motivate and create Study 2. However, I invite the authors to evaluate further factors that could have impacted or modulated their results.

(1) Did the authors check for a possible confound effect of familiarity of the stimuli? Did the participants have similar familiarity and/or expertise with tap dancing, drumming, and hurdling? The effects of visuomotor expertise (e.g., music, dance, sports) on multisensory processing of whole-body actions are well-documented in the literature.

(2) Did the authors validate their stimuli using a different group of participants before using the videos in their experiment? Were the stimuli comparable in perceived arousal, activation level, or emotional response (e.g., perceived action effort impacts movement feasibility and appreciation judgments)? Did the author consider possible attention effects (e.g., tap dance movements could be more engaging/rewording than hurdling movements)?

- My second point concern the quantification and definition of motion differences between stimulus categories. The authors seem to focus on the acoustic consequences of the actions when creating their categories of stimuli, leaving out the actual movement features. For instance, they based the definitions of "density" on the produced sounds' frequency (e.g., 2.4 Hz vs. 3.4 Hz). There is no information concerning the sequences of movement that produced those sounds. Did the average amount of motion differ between stimulus category? Did the authors take into consideration the different body parts involved in the movements? Did the stimuli differ for kinematic parameters as movement speed and acceleration? There is a documented preference for complex movements characterized by a faster and more complex (vs. slower and uniform) temporal profile (Orlandi et al., 2020).

The authors indicated (lines 386-387) that "For the two new conditions (D-R-, D+R+), he was asked to play the previously played sequences either less (D-R-) or more (D+R+) accentuated". What was the difference in rhythmicity from the kinematic perspective (e.g., variation in the muscular effort or movement acceleration)?

Carrying on with this argument, drumming and hurdling actions involves different body parts (e.g., whole-body vs. upper body) with sounds produced by feet vs. hands, which could introduce a further possible confounding factor. (This comment is consistent with my previous point on observers' expertise and their arousal/emotional response to actions).

I suggest the author provide a more comprehensive rationale and objective quantification of movement density and rhythmicity. It would be good to have a statistical analysis of the kinematics and acoustic features of the different categories of stimuli (objective quantification). Additionally, the author could validate their stimuli by asking participants to rate perceived density and rhythmicity (subjective evaluation) explicitly.

As an example, a recent paper on Cognition introduced motion smoothness/fluency and entropy measures as indices of action timing complexity/rhythmicity and predictability (Orlandi, A., Cross, E. S., & Orgs, G. (2020). Timing is everything: Dance aesthetics depend on the complexity of movement kinematics. Cognition, 205, 104446). Please consider including a reference to the aforementioned work that appears quite crucial in this context.

- My third point concerns the number of trials used in both studies and corresponding analysis. First, the authors report 144 trials for each study, indicating 8 trials per category in study 1 and 4 trials per category in study 2. Secondly, the sample size considered is quite small, especially in study 1. Hence, on the one hand, I suggest the authors provide a rationale for choosing the two sample sizes (e.g., power analysis). On the other hand, ANOVA may not result in the most appropriate statistical method for data analysis. Did the authors check for ANOVA assumption? Maybe, non-parametrical methods will be more sensitive considering the small number of stimuli per condition and sample size. Alternatively, mixed-effects models (e.g., logistic regression) based on single trials (instead of means) may result more effectively.

- I suggest the authors include a limitations paragraph at the end of the discussion section, whether it is not possible to offer a complete explanation or adjustment for the points raised. Furthermore, in light of the above comments, I suggest the authors take into consideration the role of attentional processes, prior expectation, and anticipation (e.g., predictive coding framework) when discussing their results.

Minor points:

- In Figures 3 and 5, please consider reporting statistical significance as p-values or an "*" (with related significance level, e.g., 0.05 reported in the captions).

- Please consider including a figure reporting the kinematic and auditory features of the stimuli categories.

Reviewer #2: This is an interesting study investigating auditory-visual integration in the context of action sounds. The guiding hypothesis is that actions that intentionally produced sound (i.e., where sound is the target of action) should produce greater auditory-visual integration such that sound-action pairs should be perceived as synchronous over a wider range of intervals. This is an interesting idea, but there are many features that differ between both the movements and the sounds that participants were being asked to judge. First of all, as the authors themselves report, sound density differed between the hurdling and tap-dancing stimuli in Study 1. The results of Study 2 which uses different stimuli, appear to confirm the importance of sound density, rather than intentionality.

In addition to sound density, the qualities of the intentionally and unintentionally produced sounds differ acoustically and the movements involved differ in terms of the number of effectors engaged, and the perceived trajectory of actions. All of these features are known to influence auditory-visual integration. (See Chuenn and Schultz, Atten Percept Psychophys (2016) 78:1512–1528; Su, Y.-H. Peak velocity as a cue in audiovisual synchrony perception of rhythmic stimuli. Cognition 131, 330–344 (2014). And Su, YH. Visual tuning and metrical perception of realistic point-light dance movements. Sci Rep 6, 22774 (2016). https://doi.org/10.1038/srep22774 and Vroomens et al. cited in the manuscript).

Finally, the intentional and unintentional sounds differ in another important sense: for intentional sounds the sound is the target of action, whereas for the unintentional sounds the movement sequence (or clearing the hurdle) is the target. Thus the focus of learning and attention in the one case is to form an auditory-motor temporal prediction, whereas the other does not. The same is true of the comparison case of speech, which is discussed at length in the rationale for the experiment.

Together, there seem to be many features that might contribute to perceived auditory-visual synchrony for these stimuli that are not directly aligned with intentionality. While it may not be possible for the authors to address all of these issues with the current data, they need to be considered more carefully I the presentation of the rationale and the interpretation of the results.

Study 1 Questions:

• The authors interpret their results as consistent with the hypothesis that perceived synchrony will be greater for visual-first stimuli for the tap-dancing compared to the hurdling, and that this is because tap-dancing intentionally produces sound. While it is true that the visual-first perceived synchrony is higher, it is also the case that the auditory-first synchrony is higher as well. So, this seems more like an overall effect of task, and doesn’t really seem to fit with the initial hypothesis.

• In the same vein, performance at 0-asynchrony was better for hurdling than tap-dancing showing that the hurdling stimuli are more accurately judged than the tap-dancing stimuli. As with the above, the authors consider this as evidence for the “widened temporal window of integration,” I’m not sure how this can be distinguished from a task difficult effect.

• The authors raise the issue of event density, and examine it in Study 2, but the hurdling stimuli include other possible cues to synchrony such as enhanced visual movement trajectories. Movement trajectory is known to influence perceived timing and auditory-visual integration (See the work of Su, cited above and Vroomens, cited in the manuscript). This issue is addressed in the drumming study, where trajectories across the conditions are more equivalent, and there are no differences at the 0-delay condition.

• The authors hypothesized that event density might affect performance. Was an analysis done to look at the results of Study 1 controlling for event density? If event density does not affect the pattern of results this would be better evidence supporting their hypothesis.

Study 2 Questions:

• The results for the high-density drumming stimuli, which are at a similar rate to the tap-dancing show a very similar pattern of performance. Putting the results of Study 1 and 2 together suggests that the main factor differentiating the audiovisual simultaneity judgements is event density, rather than intentionality. The authors themselves say that the results are consistent with previous findings that simultaneity judgements “collapse” at high event densities.

• The interaction between density and rhythmicity is not described in the Results, only in the Discussion. The authors focus on the main effect of rhythmicity, but this is really driven by the interaction with density.

• The authors try to interpret this finding as indicating that rhythmicity does not affect synchrony judgements for hurdling, but this may not really be true. Running steps are not strictly rhythmic in the way music rhythms are, so it may be hard to compare. I agree that it is not immediately obvious why the more metrically simple stimuli are perceived as more simultaneous, but there may be some sort of “attractor” effect of the beat point. It would be worthwhile to review the literature on beat.

The Discussion is relatively underdeveloped in consists largely in a rehash of the findings. Better integration of the findings with the literature is needed. Also the authors have two previous brain imaging papers using the tap-dancing and hurdling stimuli. It seems like integration of the goals and findings of the current studies with that previous work would make this paper much more substantial.

Minor points:

This sentence is unclear (lines 89-93): “Along these lines, Eg & Behne (24) employed long running and eventful stimuli in their study and concluded that these more natural stimuli can and should be used more in audiovisual asynchrony studies. On the other hand, aberrant audiovisual integration in psychiatric diseases (26) and neurological impairments (27) may well apply beyond speech and music, and thus affect the perception and control of own action.” In the first sentence a more concrete description of the stimuli would be helpful. In the second, it is hard to understand what is meant.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Jul 22;16(7):e0253130. doi: 10.1371/journal.pone.0253130.r002

Author response to Decision Letter 0


10 Feb 2021

Dear Professor Proverbio

We are very grateful to you and the reviewers for the constructive feedback regarding our manuscript. We have addressed the reviewers’ valuable comments and suggestions accordingly and wish to submit a revised version of the manuscript for further consideration. Changes to the manuscript have been highlighted and a point-by-point response to the reviewers’ comments can be found below.

Thank you for your time and consideration. We look forward to hearing from you.

Yours sincerely,

On behalf of the co-authors

Ricarda Schubotz

Reviewer #1:

The authors presented two attractive behavioural studies that aimed to investigate audiovisual integration in whole-body actions. They presented the participants with short videos depicting point-light displays from hurdling vs. tap-dancing actions (study 1) and drumming (study 2) actions. The movements could be in synch or out of synch with the produced sounds, and in the latter case, the authors systematically varied the delay between visual and auditory stimulations and their order (video-first vs. audio-first). The participants were instructed to judge the synchronicity of the stimuli explicitly. The results indicated higher synchrony ratings for shorter (vs. longer) asynchronies intervals, for visual stimulation preceding (vs. following) auditory consequences, and for higher (vs. lower) event density. The manuscript is well-organized, the language is appropriate, and the investigation of audiovisual integration in the context of whole-body actions is certainly interesting. That said, I encourage the Authors to take into consideration a series of points that could strengthen the work if integrated with their current proposal. My concerns are specifically focused on the methodological aspects of the studies.

Major points:

- I appreciated that the authors reported possible confounding factors from study 1 and used them to motivate and create Study 2. However, I invite the authors to evaluate further factors that could have impacted or modulated their results.

We appreciate the valuable comments and suggestions made by the Reviewer.

(1) Did the authors check for a possible confound effect of familiarity of the stimuli? Did the participants have similar familiarity and/or expertise with tap dancing, drumming, and hurdling? The effects of visuomotor expertise (e.g., music, dance, sports) on multisensory processing of whole-body actions are well-documented in the literature.

We agree with the Reviewer that familiarity and motor expertise would have been a confounding factor. We therefore excluded participants who received a training in tap dancing, drumming or hurdling. Since tap dancing and hurdling are fairly uncommon types of sports in Germany, and drumming classes are also not very common, we had no problems recruiting naïve participants. We incorporate this exclusion criterion in the Methods sections.

(2) Did the authors validate their stimuli using a different group of participants before using the videos in their experiment? Were the stimuli comparable in perceived arousal, activation level, or emotional response (e.g., perceived action effort impacts movement feasibility and appreciation judgments)? Did the author consider possible attention effects (e.g., tap dance movements could be more engaging/rewording than hurdling movements)?

We did not validate the stimulus material in the strict sense according to perceived arousal or activation level. We had some experiences with the stimulus material from the recording sessions (in a different set of participants) and from findings in two previous fMRI studies including a test-retest quality of performance rating (Heins et al., 2020a; 2020b). Based on these experiences and findings we were (and are) confident that there were no differences between hurdling and tap dancing regarding either perceived action effort, attention, or feelings of reward, for the following reasons. In these previous studies, we presented participants with videos showing their own training performance in hurdling and tap dancing and asked them to rate the quality of their own performance on a Likert scale. We suggest that if there were effects of emotional responses or arousal that yielded differences between tap dancing or hurdling, this group of participants would have been particularly prone to show them, possibly even more than our naïve participants in the present studies. However, participants rated their own performance in tap dancing and in hurdling equally positive, suggesting no differences regarding estimates of efforts, rewarding feelings or appreciation of performance. Moreover, we asked this former group whether they found hurdling or tap dancing more difficult during training, and also these ratings yielded no significant differences between hurdling and tap dancing. Based on these previous findings, we did not expect confounding effects of perceived ease of movements in the current study.

Moreover, regarding attentional effects, the fMRI effects in the mentioned precursor studies using the same type of stimulus material did not suggest increased attention when participants perceived and judged tap dancing, as compared to seeing and judging hurdling trials. Attention has been found to reverse the typical BOLD attenuation effects observed in primary sensory cortices for predicted vs. non-predicted stimuli, leading to enhanced rather than attenuated responses (Reznik et al., 2015; Schröger et al., 2015; Wollman and Morillon, 2018). Interestingly, primary auditory cortex was attenuated in tap dancing compared to hurdling, favoring a prediction-caused attenuation over the attention-caused enhancement explanation of our findings, as hypothesized in these previous fMRI studies.

We include these observations and considerations from our previous studies in the new limitations section.

- My second point concern the quantification and definition of motion differences between stimulus categories. The authors seem to focus on the acoustic consequences of the actions when creating their categories of stimuli, leaving out the actual movement features. For instance, they based the definitions of "density" on the produced sounds' frequency (e.g., 2.4 Hz vs. 3.4 Hz). There is no information concerning the sequences of movement that produced those sounds. Did the average amount of motion differ between stimulus category? Did the authors take into consideration the different body parts involved in the movements? Did the stimuli differ for kinematic parameters as movement speed and acceleration? There is a documented preference for complex movements characterized by a faster and more complex (vs. slower and uniform) temporal profile (Orlandi et al., 2020).

We agree with the Reviewer’s thoughtful point. While we cannot differentiate between more fine-grained movement features such as movement speed and acceleration, and have now added an analysis of the amount of motion in hurdling, tap dancing, and drumming using the motion energy (ME) calculation based on Matlab (please see changes made in the Methods section of Study 1 and the new Supplementary Figure S2). The mean ME were 1052 for drumming, 1220 for tap dancing, 1189 for hurdling. A Kruskal-Wallis test by ranks showed no significant difference between motion energy in hurdling, tap dancing and drumming (�2(2) = 4.2, p = .12). We report this finding in the revised manuscript as well.

Regarding the body parts involved in the movements, this factor was balanced among the two stimulus categories in Study 1, where point light walkers wore the same number of markers, and hence, the entire body was shown in motion in these videos. For Study 2, of course, there were only markers on the upper part of the drummer’s body. In this respect, stimuli used in Study 1 and Study 2 cannot be directly compared. Please note that we did not intend to compare Study 1 and Study 2 directly, but rather to provide a report on two consecutive experiments, each with its own questions and hypotheses. We rephrased some parts of the original manuscript to make this point clearer than we did before.

We also thank the Reviewer for drawing our attention to the recent study of Orlandi and co-workers which we now cite in the revised manuscript. It would be very interesting, indeed, to examine the effects of further and more fine-grained motion properties on perceived synchronicity in sound-generating whole-body movements.

The authors indicated (lines 386-387) that "For the two new conditions (D-R-, D+R+), he was asked to play the previously played sequences either less (D-R-) or more (D+R+) accentuated". What was the difference in rhythmicity from the kinematic perspective (e.g., variation in the muscular effort or movement acceleration)?

As mentioned above, we are not able to assess acceleration in our stimulus material, but we included the drumming videos in the analysis of motion energy, as described above.

Carrying on with this argument, drumming and hurdling actions involves different body parts (e.g., whole-body vs. upper body) with sounds produced by feet vs. hands, which could introduce a further possible confounding factor. (This comment is consistent with my previous point on observers' expertise and their arousal/emotional response to actions).

We agree with the Reviewer’s point that directly comparing whole-body and upper-body movements would entail a confound. Note that we did not mean to directly compare drumming and hurdling. We compared hurdling with tap dancing (Study 1), and we compared the parameters of event density and rhythmicity within drumming (Study 2). Thus, conclusions were not drawn from a direct comparison of Study 1 and 2. Study 2 was motivated by the fact that hurdling and tap dancing come at different event densities and rhythmicities, and Study 2 was just meant to examine these factors in more detail. We tried to make this point more clearly at the end of the Interim Discussion motivating Study 2.

I suggest the author provide a more comprehensive rationale and objective quantification of movement density and rhythmicity. It would be good to have a statistical analysis of the kinematics and acoustic features of the different categories of stimuli (objective quantification). Additionally, the author could validate their stimuli by asking participants to rate perceived density and rhythmicity (subjective evaluation) explicitly. As an example, a recent paper on Cognition introduced motion smoothness/fluency and entropy measures as indices of action timing complexity/rhythmicity and predictability (Orlandi, A., Cross, E. S., & Orgs, G. (2020). Timing is everything: Dance aesthetics depend on the complexity of movement kinematics. Cognition, 205, 104446). Please consider including a reference to the aforementioned work that appears quite crucial in this context.

While we cannot provide an as sophisticated and deep analysis of rhythmicity, complexity and predictability as presented in the study of Orlandi and co-workers, we now included an objective and more detailed analysis of rhythmicity (on the auditory side) and of the amount of motion (on the visual side). Since the corresponding MATLAB tools also provide an algorithm to calculate event density, we re-analyzed also event density using the same MATLAB tool for reasons of consistency. Please note in this context, that using different tools changed the event density scores a bit but did not change the overall data pattern. We added new paragraphs in the Methods sections (“Acoustic feature extraction: Event Density and Rhythmicity” and “Assessment of motion energy” and new Figures (Fig. 2 and Supplementary Figures S1 and S2) to document these stimulus features.

Since the number of videos based on which we generated the differently synchronized stimuli was very small (8 for Study 1 and 16 for Study 2), event density and rhythmicity entered as ordinal variables in a post-hoc statistical approach which we now present at the end of the Interim Discussion leading to Study 2. This is, we now objectify the motivation of testing these two factors in Study 2. At the same time, these (albeit only post-hoc) tests show that neither event density nor rhythmicity can fully account for the differences between hurdling and tap dancing since we found a main effect for action type in both statistical approaches. In the end, the Reviewer’s thoughtful comments have made our paper more clearly in this regard, hopefully also in the eyes of this Reviewer. We discuss the consequences of these additional insights, and included potential limitations that our studies have in comparison to Orlandi’s work.

- My third point concerns the number of trials used in both studies and corresponding analysis. First, the authors report 144 trials for each study, indicating 8 trials per category in study 1 and 4 trials per category in study 2. Secondly, the sample size considered is quite small, especially in study 1. Hence, on the one hand, I suggest the authors provide a rationale for choosing the two sample sizes (e.g., power analysis). On the other hand, ANOVA may not result in the most appropriate statistical method for data analysis. Did the authors check for ANOVA assumption? Maybe, non-parametrical methods will be more sensitive considering the small number of stimuli per condition and sample size. Alternatively, mixed-effects models (e.g., logistic regression) based on single trials (instead of means) may result more effectively.

Regarding the number of trials we presented, there is a misunderstanding. We apologize for the obviously unclear description and modified the respective sentences in the Methods sections.

We stated in the original manuscript for Study 1 (lines 199+): “Three blocks with the experimental task were presented thereafter. Within each of these blocks, all the 72 stimuli (four hurdling and four tap dancing videos, each with nine different audiovisual asynchronies) were presented twice, resulting in a total of 144 trials.” Thus, a total of 432 trials were presented in Study 1.

For Study 2, we said “The experiment consisted of four experimental blocks. Within each of these blocks, each of the 144 stimuli (four D-R+, four D+R-, four D-R-, and four D+R+ videos, each with nine different levels of audiovisual asynchrony) were presented once.” (lines 396++). Thus, a total of 576 trials were presented in Study 2.

We have rephrased corresponding paragraphs in the revised manuscript.

Regarding the ANOVA assumptions, we re-checked them for both Study 1 and Study 2, and for factors where Mauchly's test indicated that the assumption of sphericity was violated, degrees of freedom were corrected using Greenhouse-Geisser estimates of sphericity. We wish to thank the Reviewer for this comment. Note that the Greenhouse-Geisser correction did not change any of the reported significant effects.

- I suggest the authors include a limitations paragraph at the end of the discussion section, whether it is not possible to offer a complete explanation or adjustment for the points raised. Furthermore, in light of the above comments, I suggest the authors take into consideration the role of attentional processes, prior expectation, and anticipation (e.g., predictive coding framework) when discussing their results.

We included a new limitations paragraph at the end of the General Discussion to clarify aspects that we cannot exclude based on our studies or findings. Moreover, we extended our discussions considering the role of attention and also refer to predictive coding.

Minor points:

- In Figures 3 and 5, please consider reporting statistical significance as p-values or an "*" (with related significance level, e.g., 0.05 reported in the captions).

We modified Fig. 3 and Fig. 5 accordingly; please note that due to an additional Figure (Fig. 2), Fig. 3 and 5 are now labeled Fig. 4 and 6, respectively.

- Please consider including a figure reporting the kinematic and auditory features of the stimuli categories.

We have included two new figures reporting auditory features of the stimuli (Fig. 2 and Supplementary Figure S1) and a new figure reporting motion energy scores for both Studies (Supplementary Figure S2).

Reviewer #2:

This is an interesting study investigating auditory-visual integration in the context of action sounds. The guiding hypothesis is that actions that intentionally produced sound (i.e., where sound is the target of action) should produce greater auditory-visual integration such that sound-action pairs should be perceived as synchronous over a wider range of intervals. This is an interesting idea, but there are many features that differ between both the movements and the sounds that participants were being asked to judge.

We strongly appreciated the principally positive evaluation of this Reviewer and the constructive and helpful comments.

First of all, as the authors themselves report, sound density differed between the hurdling and tap-dancing stimuli in Study 1. The results of Study 2 which uses different stimuli, appear to confirm the importance of sound density, rather than intentionality.

Actually, after having calculated further post-hoc statistics as motivated by the Reviewers, we now even more objectively and clearly find that the effect of intentionality of sound production stands on its own, in addition to the effects of event density and rhythmicity. Please see our more detailed replies to this point below (question number 4 concerning Study 1).

In addition to sound density, the qualities of the intentionally and unintentionally produced sounds differ acoustically and the movements involved differ in terms of the number of effectors engaged, and the perceived trajectory of actions. All of these features are known to influence auditory-visual integration. (See Chuenn and Schultz, Atten Percept Psychophys (2016) 78:1512–1528; Su, Y.-H. Peak velocity as a cue in audiovisual synchrony perception of rhythmic stimuli. Cognition 131, 330–344 (2014). And Su, YH. Visual tuning and metrical perception of realistic point-light dance movements. Sci Rep 6, 22774 (2016). https://doi.org/10.1038/srep22774 and Vroomens et al. cited in the manuscript).

We agree with the Reviewer that sound quality as well as movement patterns are relevant factors that have to be considered when comparing intentionally and incidentally sound-generating actions.

First regarding sound quality differences, we explained in the original manuscript (please see “Stimuli” in Methods, Study 1) that we took measures to render the sound qualities of tap dancing and hurdling as comparable as possible, using the following approach: “In a first step, stimulus intensities of hurdling and tap dancing recordings were normalized separately. In order to equalize the spectral distributions of both types of recordings, the frequency profiles of hurdling and tap dancing sounds were then captured using the Reaper plugin Ozone 5 (iZotope Inc, Cambridge, United States). Finally, the difference curve (hurdling – tap dancing) was used by the plugin’s match function to adjust the tap dancing spectrum to the hurdling reference.” As a result, the sound quality of hurdling and tap dancing were highly similar. Please also check the sound of the Supplementary Video Material that we provided for the stimuli.

Regarding the second point, the number of effectors engaged in tap dancing exactly matched those engaged in hurdling. Of course, we did not directly compare the drumming condition (Study 2) with tap dancing or hurdling (Study 1), as here, the number of effectors differed, and markers were restricted to the upper body. We include a statement in the revised manuscript to make this point more prominent.

Finally, we agree that the perceived trajectories of the actions were subjectively different. We now included an additional analysis of the videos quantifying the amount of motion by the index of motion energy (please see Methods Section of Study 1 and new Supplementary Figure S2). The mean motion energy score was 1220 for tap dancing, 1189 for hurdling, and 1052 for drumming. In order to check whether the amount of motion had to be considered a confounding factor in our studies, we calculated a Kruskal-Wallis test by ranks. This test did not show significant differences between motion energy in hurdling, tap dancing and drumming (�2(2) = 4.2, p = .12).

Finally, the intentional and unintentional sounds differ in another important sense: for intentional sounds the sound is the target of action, whereas for the unintentional sounds the movement sequence (or clearing the hurdle) is the target. Thus, the focus of learning and attention in the one case is to form an auditory-motor temporal prediction, whereas the other does not. The same is true of the comparison case of speech, which is discussed at length in the rationale for the experiment.

Our interest in potential differences between processing incidental and intentional types of sound-generating actions stems from exactly these considerations. According to both psychological theories on action (e.g. common coding) as well as according to predictive coding accounts, sound is in either case the effect of these action. But is it in the same sense part of the action goal when we compare incidentally vs. intentionally sound-generating actions? In three previous fMRI studies, two of which we have published so far (Heins et al., 2020a and 2020b), we examined the brain activity for incidental and intentional sound production using “normal” stimuli as well as sound-delayed and sound-deprived video recordings. We found that in tap dancing, primary auditory cortex activity is reduced as compared to hurdling, favoring a stronger predictive “cancelling” of the expected sound. Importantly, attention is known to reverse this effect, leading to enhanced activity in primary sensory cortices. Since we found the opposite, we could (in each of these previous studies) clearly rule out that differences between perceptual processing of tap dancing and hurdling are driven by auditory attention. We now more explicitly consider these previous insights in the revised manuscript and in the limitations section.

While in these previous fMRI studies, we asked participants to judge the quality of the performance in hurdling and tap dancing, participants in the present studies were required to explicitly judge whether the soundtrack and the visual video were synchronous. Thus, task instruction should have ensured that attention was necessarily on both the auditory and the visual channel in either type of action.

Together, there seem to be many features that might contribute to perceived auditory-visual synchrony for these stimuli that are not directly aligned with intentionality. While it may not be possible for the authors to address all of these issues with the current data, they need to be considered more carefully in the presentation of the rationale and the interpretation of the results.

We thank the Reviewer for this suggestion and hope that he/she will find the way we address these points in a convincing manner, as detailed in the following.

Study 1 Questions:

• The authors interpret their results as consistent with the hypothesis that perceived synchrony will be greater for visual-first stimuli for the tap-dancing compared to the hurdling, and that this is because tap-dancing intentionally produces sound. While it is true that the visual-first perceived synchrony is higher, it is also the case that the auditory-first synchrony is higher as well. So, this seems more like an overall effect of task, and doesn’t really seem to fit with the initial hypothesis.

Reconsidering the way we introduced Hypothesis 3 in the Introduction of Study 1, we think that there was a misunderstanding, caused by unclarity of our own phrasing.

Based on the literature, we expected a visual-first bias (i.e. a general bias of judging visual-first stimuli more often as synchronous as compared to audio-first stimuli) for both action types (Hypotheses 2). Moreover, we expected that this bias vanishes for the longer delays in hurdling but still persists for tap dancing, if it is true that tap dancing is comparable to speech production in having a larger temporal binding window than incidentally produced action sounds (Hypothesis 3). Our data confirmed this assumption, as we found (i) an interaction of ASYNCHRONY SIZE, ASYNCHRONY TYPE, and ACTION TYPE, and more specifically, (ii) a still significant bias towards synchronous judgments for visual-first as compared to audio-first stimuli at the longest asynchrony delay of 400 ms for tap dancing but not for hurdling. Note that Hypothesis 3 did not necessarily imply that the visual-first effect (Hypotheses 2) was generally larger for tap dancing as compared to hurdling.

For sure, it is true that we found a main effect for action type, and this is also what we reported in the Results section of Study 1. Importantly, this finding motivated Study 2, as also reported in the original manuscript. Please note that this main effect does not level the hypothesized and observed “extended” visual-first bias for the largest asynchronies in tap dancing. Note also that the post-hoc analysis of action type and event density which we provided according to the Reviewers’ suggestion (see below), corroborated a significant main effect of action type.

We now have rephrased parts of the Introduction and checked the entire manuscript to make this point more clearly. We also noted that we unnecessarily repeated the explanation of the three hypotheses of Study 1 in a part of the Methods section (“Design and statistical hypotheses”) that was also partly redundant with the Results section. We therefore modified this paragraph to avoid confusion. The same was true and applied for the corresponding paragraph in Study 2. We hope that in doing so, we could improve the overall clarity of the hypotheses.

• In the same vein, performance at 0-asynchrony was better for hurdling than tap-dancing showing that the hurdling stimuli are more accurately judged than the tap-dancing stimuli. As with the above, the authors consider this as evidence for the “widened temporal window of integration,” I’m not sure how this can be distinguished from a task difficult effect.

This is an absolutely valid suggestion and indeed, finding a main effect for task reflecting that participants were less accurate in judging synchronicity in tap dancing versus hurdling motivated our Study 2 where we tested potential confounds by different levels of event density and rhythmicity. We make this point now more clearly in the Interim Discussion of Study 1. Please note however, that based on a now more objective quantification of rhythmicity and additional post hoc tests on both factors, data even more clearly show that, while event density and rhythmicity explain a part of the observed synchronicity bias in tap dancing, they still do not fully explain the effect of intentionality. For instance, event density had different effects on the synchronicity judgment in tap dancing and in hurdling, as shown by an interaction of these two factors in the post hoc analysis.

Based on the original and the additional statistical effects, we suggest - as we now also summarize in the abstract and consider in the revised Discussion more clearly – that overconfidence in the naturally expected, that is, synchrony of sound and sight, was stronger for intentional (vs. incidental) sound production and for movements with high (vs. low) rhythmicity, presumably because both encourage predictive processes. In contrast, high event density appears to increase synchronicity judgments simply because it makes the detection of audiovisual asynchrony more difficult. We finally also re-arranged some paragraphs of the discussion to make it more readable after changes and additions.

Just to be sure, please note that our interpretation of a widened temporal window of integration for tap dancing was not based on the observation that performance at 0-asynchrony was better for hurdling than tap-dancing but on the observation that the visual-first bias was still significant at 400 ms time lag for tap dancing but not for hurdling. Moreover, for the 120 ms visual-first condition, the performance for hurdling was as bad as for tap dancing (cf. Fig. 4 - formerly labeled Fig. 3), so we did not (in the original manuscript) and will not (in the revised manuscript) base any argument on the 0-asynchrony level or the smallest levels of audio-visual-lag.

• The authors raise the issue of event density, and examine it in Study 2, but the hurdling stimuli include other possible cues to synchrony such as enhanced visual movement trajectories. Movement trajectory is known to influence perceived timing and auditory-visual integration (See the work of Su, cited above and Vroomens, cited in the manuscript). This issue is addressed in the drumming study, where trajectories across the conditions are more equivalent, and there are no differences at the 0-delay condition.

As described above, we now included an additional analysis of the videos quantifying the amount of motion by a “motion energy” index. The mean ME were 1052 for drumming, 1220 for tap dancing, 1189 for hurdling. A Kruskal-Wallis test by ranks showed no significant difference between motion energy in hurdling, tap dancing and drumming (�2(2) = 4.2, p = .12). However, since motion energy certainly does not capture all dynamic features describing movement, we mention this restriction in the new limitations section.

• The authors hypothesized that event density might affect performance. Was an analysis done to look at the results of Study 1 controlling for event density? If event density does not affect the pattern of results this would be better evidence supporting their hypothesis.

We agree with the Reviewer that a statistical approach to event density was more informative. To post hoc investigate the effect of event density on synchronicity judgments in Study 1, we included event density as ordinal variable to replace action type in our original ANOVA (please see revised Interim Discussion of Study 1, leading to Study 2). This analysis showed a main effect of event density (F(2.9,61.8) = 71.64, p < .001, Greenhouse-Geisser corrected; �2(14) = 40.60, p < .001, � = .59), which could mirror our reported main effect of action type. Bonferroni-corrected post-hoc pairwise comparisons of the event density levels showed that differences in performance levels, however, did not mirror the separation point between actions. Instead, no difference in performance was found between the four hurdling videos and one tap dancing video (all p ≥ .52), which were all lower in performance than the three tap dancing videos with the highest event densities (all p < .001), while the video with the highest event density again significantly differed from all others (all p < .001). To see whether event density fully explains the original effect of action type, we calculated another ANOVA including action type and event density (as ordinal variable within an action) as well as asynchrony type and asynchrony size. Here, we found significant main effects of both event density (F(2,42) = 71.09, p < .001) and action type (F(1,21) = 74.26, p < .001) as well as their interaction (F(2,42) = 68.69, p <.001). These findings suggested that event density did not have the same effect on intentionally and incidentally generated action sounds. All in all, a direct experimental manipulation and investigation of event density as variable was motivated by these data patterns, then implemented in Study 2.

Study 2 Questions:

• The results for the high-density drumming stimuli, which are at a similar rate to the tap-dancing show a very similar pattern of performance. Putting the results of Study 1 and 2 together suggests that the main factor differentiating the audiovisual simultaneity judgements is event density, rather than intentionality. The authors themselves say that the results are consistent with previous findings that simultaneity judgements “collapse” at high event densities.

It is true that event density had a profound effect on audiovisual simultaneity judgments, as we also discuss in our original manuscript. Inspired by comments of both Reviewers, we now additionally conducted a post-hoc statistical analysis on event density as a categorical variable varying in both action types. As a result, we found significant main effects of both event density as well as action type, and a significant interaction of these factors. Hence, event density did not have the same effect on intentionally and incidentally generated action sounds, and the type of action, differing in intentionality, had an effect on its own. It is obvious that this post-hoc analysis cannot provide the same level of evidence regarding the effect of event density as Study 2, since the variance of event density in tap dancing and in hurdling were not really comparable. However, finding main effects for intentionality and event density, as well as an interaction of both, renders it even more valuable to consider both Study 1 and Study 2 in a common paper, and discussing both as factors modulating the perception of audiovisual (a)synchrony.

• The interaction between density and rhythmicity is not described in the Results, only in the Discussion. The authors focus on the main effect of rhythmicity, but this is really driven by the interaction with density.

We now have added the statistical effect of the interaction in the Results section of Study 2. Actually, due to an editing error before submission, we had deleted this paragraph of the Results section, including also all other significant interaction effects. This missing paragraph has been inserted in the Results section now. We cordially thank the Reviewer to point out this shortcoming.

Albeit small, the main effect of rhythmicity was significant. Even more interesting, the effect of rhythmicity was clear when only looking at low event densities. We think that high event densities obscured the specific effects that rhythmicity had on audiovisual (a)synchrony perception. We have modified the General Discussion to make this point more clearly.

• The authors try to interpret this finding as indicating that rhythmicity does not affect synchrony judgements for hurdling, but this may not really be true. Running steps are not strictly rhythmic in the way music rhythms are, so it may be hard to compare. I agree that it is not immediately obvious why the more metrically simple stimuli are perceived as more simultaneous, but there may be some sort of “attractor” effect of the beat point. It would be worthwhile to review the literature on beat.

We are really sorry but we have to admit that we could not figure out to which statement in our manuscript the Reviewer refers to when saying that we interpret this finding as indicating that rhythmicity does not affect synchrony judgements for hurdling. Regarding the fairly strong effect of rhythmicity for (drumming) stimuli with low event density, rising from 34% (low rhythm) to 47% (high rhythm) synchronous judgements, we now extend the discussion to consider the effect of enhanced predictability by increased rhythmicity.

As a side note, it is important to note that hurdling is highly rhythmical (I would describe its classical three-step structure as indeed beat-based, consisting of a dotted crotchet followed by a quaver and a quaver triplet, resulting in a two-two meter), and learning and training of this rhythm is crucial to achieve a good performance. So we would say that rhythm is not less important for hurdling than for tap dancing performance, but the rhythmic structure and periodicity was subjectively and, as shown by the new analysis on rhythmicity provided in the revised manuscript, also objectively much more evident in hurdling videos.

The Discussion is relatively underdeveloped in consists largely in a rehash of the findings. Better integration of the findings with the literature is needed. Also the authors have two previous brain imaging papers using the tap-dancing and hurdling stimuli. It seems like integration of the goals and findings of the current studies with that previous work would make this paper much more substantial.

Following the Reviewer’s suggestion, we tried to improve the integration of our findings with the literature, including reference to our own previous studies on tap dancing and hurdling in fMRI. Note that we also slightly re-arranged some paragraphs of the General Discussion to improve readability and clarity of the interpretations.

Minor points:

This sentence is unclear (lines 89-93): “Along these lines, Eg & Behne (24) employed long running and eventful stimuli in their study and concluded that these more natural stimuli can and should be used more in audiovisual asynchrony studies. On the other hand, aberrant audiovisual integration in psychiatric diseases (26) and neurological impairments (27) may well apply beyond speech and music, and thus affect the perception and control of own action.” In the first sentence a more concrete description of the stimuli would be helpful. In the second, it is hard to understand what is meant.

We have modified this entire paragraph which was indeed somewhat ill-structured.

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 1

Alice Mado Proverbio

21 Apr 2021

PONE-D-20-30978R1

Surmising synchrony of sound and sight: Factors explaining variance of audiovisual integration in hurdling, tap dancing and drumming

PLOS ONE

Dear Dr. Schubotz,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

While reviewer 1 is satisfied with the way you responded to previous queries, you will see that Reviewer 2 has noticed serious methodological problems inherent to the experimental paradigm, that I will ask you to, please. seriously addressed in the revised version of the paper

Please submit your revised manuscript by Jun 05 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Alice Mado Proverbio

Academic Editor

PLOS ONE

Additional Editor Comments (if provided):

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #3: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #3: No

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #3: No

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: I much appreciated the authors' effort in considering all points raised during the revision process. The new sections and clarifications have increased the reliability and value of the entire manuscript that is now suitable for publication. I wish the authors the best of luck with their future studies.

Reviewer #3: The study addresses an interesting research question. Does perceived audio-visual synchrony depend on the intentionality of producing sounds? Two experiments are reported. In the first experiment tap dancing and hurdling point light videos are presented to participants at different audio-visual asynchronies. The authors observe a smaller window of perceived synchrony for the hurdling than the tap dancing actions. In order to assess the influence of event density and rhythm in their study, the authors then conduct a second experiment manipulating event density and rhythmicity using drumming stimuli and show that both factors significantly influence simultaneity judgements. Overall, the two experiments appear rather unrelated to each other and I am not convinced that the pattern of results is mainly driven by differences in task difficulty (simultaneity judgements are easier to make for hurdling than for drumming or tap dancing). I have two major concerns related to the choice of experimental design and stimuli on the one hand and the lack of quantification of visual features of the actions on the other hand.

1) It does not become clear to me why the authors chose to compare tap dancing and hurdling in the first place, especially if the idea was to compare intentionally vs. accidentally produced action sounds. In order to show that intentionality matters, it would have been necessary to look at identical actions, and combining these with intentional or accidental action sounds. In the present experimental design intentionality is always confounded with the type of action being performed. Any observed effects can therefore be due to the fact that hurdling is different from tap dancing in many ways other than the intentionality of the action sounds. The second experiment addresses two auditory confounding factors (sound rhythm and event density), by introducing a third new action which is drumming. So the study leaves open many other ways in which these three actions differ both conceptually (artistic vs. competitive) and visually (with respect to their movement kinematics for example)

2) The rigorous assessment of auditory features is not matched by an equally rigorous assessment of the visual features of the stimuli, although such data appears to be available given that actions were recorded using motion capture. The authors only report overall motion energy for the videos computed from the video. What about event density and rhythm in the visual domain? Could it be that synchrony during tap dancing and drumming are harder to detect because the movement amplitude of drumming and tap dancing is much smaller than that of hurdling? What about the saliency of cyclical motion or the influence of visual perspective? The difficulty of audiovisual integration does not only depend on saliency of auditory features but also of the visual stimuli, yet an assessment of how tap dancing, drumming and hurdling are different from each other with respect to their visual aspects and their movement kinematics is missing entirely.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #3: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Jul 22;16(7):e0253130. doi: 10.1371/journal.pone.0253130.r004

Author response to Decision Letter 1


5 May 2021

Dear Professor Proverbio

We are very grateful to you and the reviewer for the constructive feedback regarding our manuscript. Also, we are glad to hear that Reviewer 1 was satisfied with the way we responded to previous queries, and since we did not hear about any further points raised by Reviewer 2, we assume that s/he was also satisfied with the revision. We addressed the new Reviewer 3’s points in a new revision.

Thank you for your time and consideration. We look forward to hearing from you.

Yours sincerely,

On behalf of the co-authors

Ricarda Schubotz

Reviewer #3

The study addresses an interesting research question. Does perceived audio-visual synchrony depend on the intentionality of producing sounds? Two experiments are reported. In the first experiment tap dancing and hurdling point light videos are presented to participants at different audio-visual asynchronies. The authors observe a smaller window of perceived synchrony for the hurdling than the tap dancing actions. In order to assess the influence of event density and rhythm in their study, the authors then conduct a second experiment manipulating event density and rhythmicity using drumming stimuli and show that both factors significantly influence simultaneity judgements. Overall, the two experiments appear rather unrelated to each other and I am not convinced that the pattern of results is mainly driven by differences in task difficulty (simultaneity judgements are easier to make for hurdling than for drumming or tap dancing).

I have two major concerns related to the choice of experimental design and stimuli on the one hand and the lack of quantification of visual features of the actions on the other hand.

1) It does not become clear to me why the authors chose to compare tap dancing and hurdling in the first place, especially if the idea was to compare intentionally vs. accidentally produced action sounds. In order to show that intentionality matters, it would have been necessary to look at identical actions, and combining these with intentional or accidental action sounds. In the present experimental design intentionality is always confounded with the type of action being performed.

Our reply:

We thank the reviewer for this important reference to the apparently still inadequate account of the motivation behind the choice of the types of movement studied. The current experiments built directly on a series of fMRI studies in which we investigated hurdling and tap dancing with respect to the perceptual processing of the sounds they produce. We added a new passage in the manuscript revision in the introduction and in the discussion, in which we hope to motivate the experimental approach more convincingly:

Introduction:

We decided to use two different types of sporting action that allowed us to study the processing of natural movement sounds in an ecologically valid context. This also had the particular advantage that the subjects' attention was not directed in any direction, since we created a completely natural perceptual situation.

Limitations:

Although it would have been possible to investigate audiovisual integration in the perception of intentionally and incidentally produced sounds using artificially generated stimuli, for the current series of experiments, the focus was on investigating natural movement sounds in an ecologically valid context. This also had the particular advantage of not biasing subjects' attention in any direction, since we were generating a quite natural perceptual situation. Another approach would have been to combine identical natural actions with intentional and incidental sounds. Here, however, we expected a confound in the sense that subjects would have expected the intentionally produced sound and not the incidental one. Thus, surprise or even irritation effects would have occurred in the incidental condition and would probably have strongly biased the comparison.

Reviewer #3

Any observed effects can therefore be due to the fact that hurdling is different from tap dancing in many ways other than the intentionality of the action sounds. The second experiment addresses two auditory confounding factors (sound rhythm and event density), by introducing a third new action which is drumming. So the study leaves open many other ways in which these three actions differ both conceptually (artistic vs. competitive) and visually (with respect to their movement kinematics for example)

Our reply:

This reference is also helpful for us because it gives us the opportunity to better motivate the choice of movements studied and to address possible limitations of our studies.

With regard to visual differences between the compared sports, we would like to point out here that our focus was on the topic of movement sounds. In our first revision, we added an analysis of visual kinetic energy according to the reviewer's instructions and found no significant differences here. In addition, we would like to point out that we used only point-light videos in both experiments and thus controlled for much of the visually irrelevant information. Please also consider furthermore our reply to point 2) raised by the Reviewer.

With regard to the conceptual differences raised by the reviewer, we agree that drumming as well as tap dancing are expressive and music-related actions whereas hurdling is not. Please note that we did not mean to directly compare drumming (Exp. 2) with hurdling or tap dancing (Exp. 1) as pointed out in the previous version of the manuscript: “Employing drumming PLDs enabled a direct control of event density and rhythmicity in an otherwise natural human motion stimulus. Note that using drumming actions, we kept intentionality of sound production constant while varying event density and rhythmicity as independent experimental factors. Since PLD markers were restricted to the upper body of the drummer, and since sounds were produced by handheld drumsticks in Study 2 as in contrast to sounds produced by feet in Study 2, we refrained from directly comparing conditions from Study 1 with Study 2.”

However, to more strongly acknowledge this point we now add another passage in the Discussion as follows:

While hurdling and tap dancing are comparable sports in many respects, they differ in terms of expressive or aesthetic appeal. Our studies reported here cannot rule out the influence of this factor, which should be the subject of future investigation.

Reviewer #3

2) The rigorous assessment of auditory features is not matched by an equally rigorous assessment of the visual features of the stimuli, although such data appears to be available given that actions were recorded using motion capture.

The authors only report overall motion energy for the videos computed from the video. What about event density and rhythm in the visual domain? Could it be that synchrony during tap dancing and drumming are harder to detect because the movement amplitude of drumming and tap dancing is much smaller than that of hurdling? What about the saliency of cyclical motion or the influence of visual perspective? The difficulty of audiovisual integration does not only depend on saliency of auditory features but also of the visual stimuli, yet an assessment of how tap dancing, drumming and hurdling are different from each other with respect to their visual aspects and their movement kinematics is missing entirely.

Our Reply:

It is correct that the rigorous assessment of auditory features is not matched by an equally rigorous assessment of the visual features of the stimuli. While research on motion perception, not only in sports, has for many years focused heavily on and been limited to the visual domain, the focus of our research is clearly on how natural motion sounds are processed. We do not mean to deny the relevance of the visual domain, but to point out that the experiments we present relate specifically to motion sounds. In the first revision of our manuscript, in response to reviewer comments, we also included and added a descriptive and statistical analysis of visual stimulation. We point out this specific focus again in the revised discussion.

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 2

Alice Mado Proverbio

31 May 2021

Surmising synchrony of sound and sight: Factors explaining variance of audiovisual integration in hurdling, tap dancing and drumming

PONE-D-20-30978R2

Dear Dr. Schubotz,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Alice Mado Proverbio

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

The reviewers and I found that all previous comments were successfully addressed by your last revision.

We particularly appreciated your further clarifications relative to the stimulus choice.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #3: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #3: I appreciate that the authors further clarified their choice of stimuli and further address these issues in the newly revised manuscript.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #3: No

Acceptance letter

Alice Mado Proverbio

14 Jul 2021

PONE-D-20-30978R2

Surmising synchrony of sound and sight: Factors explaining variance of audiovisual integration in hurdling, tap dancing and drumming

Dear Dr. Schubotz:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Alice Mado Proverbio

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Example of drumming sequence with high and low rhythmicity (Study 2).

    Rhythmicity was operationalized as variation of the amplitude envelope, shown here for two exemplary drumming sequences. While the event density of both recordings is virtually identical (3.42 and 3.39, respectively), the auditory events in the left recording are highly similar in loudness, resulting in low rhythmicity overall (v = 0.25). In contrast, the auditory events within the right recording vary more strongly in loudness, with almost equidistant duplets of loud (i.e. accentuated) events intersected with less accentuated events. This resulted in high rhythmicity overall (v = 0.78).

    (TIFF)

    S2 Fig. Motion energy, Study 1 and 2.

    The amount of motion quantified by the amount of moving pixels per video for all PLD videos employed to generate different audio-visual asynchronous stimuli in Study 1 and Study 2. Each black marker depict the motion energy for one video (see Methods of Study 1 for details).

    (TIFF)

    S1 Video. Sample video Study 1.

    Hurdling, auditory first, 120 ms asynchrony.

    (MP4)

    S2 Video. Sample video Study 1.

    Hurdling, auditory first, 400 ms asynchrony.

    (MP4)

    S3 Video. Sample video Study 1.

    Hurdling, visual first, 120 ms asynchrony.

    (MP4)

    S4 Video. Sample video Study 1.

    Hurdling, visual first, 400 ms asynchrony.

    (MP4)

    S5 Video. Sample video Study 1.

    Tap dancing, auditory first, 120 ms asynchrony.

    (MP4)

    S6 Video. Sample video Study 1.

    Tap dancing, auditory first, 400 ms asynchrony.

    (MP4)

    S7 Video. Sample video Study 1.

    Tap dancing, visual first, 120 ms asynchrony.

    (MP4)

    S8 Video. Sample video Study 1.

    Tap dancing, visual first, 400 ms asynchrony.

    (MP4)

    S9 Video. Sample video Study 2.

    Drumming, high event density, high rhythmicity.

    (MP4)

    S10 Video. Sample video Study 2.

    Drumming, high event density, low rhythmicity.

    (MP4)

    S11 Video. Sample video Study 2.

    Drumming, low event density, high rhythmicity.

    (MP4)

    S12 Video. Sample video Study 2.

    Drumming, low event density, low rhythmicity.

    (MP4)

    Attachment

    Submitted filename: Response to Reviewers.docx

    Attachment

    Submitted filename: Response to Reviewers.docx

    Data Availability Statement

    All files are available from the OSF database: Schubotz, Ricarda. 2020. “AVIA - Audiovisual Integration in Hurdling, Tap Dancing and Drumming.” OSF. October 2. osf.io/ksma6. " The DOI is 10.17605/OSF.IO/KSMA6.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES