Skip to main content
PLOS One logoLink to PLOS One
. 2023 Apr 6;18(4):e0282691. doi: 10.1371/journal.pone.0282691

The relationship between musical training and the processing of audiovisual correspondences: Evidence from a reaction time task

Riku Ihalainen 1,2,*, Georgios Kotsaridis 3, Ana B Vivas 3, Evangelos Paraskevopoulos 2,4
Editor: Deborah Apthorp5
PMCID: PMC10079049  PMID: 37023061

Abstract

Numerous studies have reported both cortical and functional changes for visual, tactile, and auditory brain areas in musicians, which have been attributed to long-term training induced neuroplasticity. Previous investigations have reported advantages for musicians in multisensory processing at the behavioural level, however, multisensory integration with tasks requiring higher level cognitive processing has not yet been extensively studied. Here, we investigated the association between musical expertise and the processing of audiovisual crossmodal correspondences in a decision reaction-time task. The visual display varied in three dimensions (elevation, symbolic and non-symbolic magnitude), while the auditory stimulus varied in pitch. Congruency was based on a set of newly learned abstract rules: “The higher the spatial elevation, the higher the tone”, “the more dots presented, the higher the tone”, and “the higher the number presented, the higher the tone”, and accuracy and reaction times were recorded. Musicians were significantly more accurate in their responses than non-musicians, suggesting an association between long-term musical training and audiovisual integration. Contrary to what was hypothesized, no differences in reaction times were found. The musicians’ advantage on accuracy was also observed for rule-based congruency in seemingly unrelated stimuli (pitch-magnitude). These results suggest an interaction between implicit and explicit processing–as reflected on reaction times and accuracy, respectively. This advantage was generalised on congruency in otherwise unrelated stimuli (pitch-magnitude pairs), suggesting an advantage on processes requiring higher order cognitive functions. The results support the notion that accuracy and latency measures may reflect different processes.

Introduction

Human beings are equipped with multiple sensory channels allowing us to produce a unified and coherent representation of the outside world. Recent evidence suggests that multimodal sensory processing occurs at an early processing stage, and is cortically widespread (for a review, see [1]). Studies also suggest that specific life experiences may lead to enhanced multimodal sensory processing. For instance, professional musicians appear to have extensive experience in multisensory processing, since reading, interpreting, and acting on musical notation combines at least visual, auditory, and motor information. Similarly, playing an instrument requires simultaneous processing of stimuli from at least visual, auditory, and tactile sensory modalities [2, 3]. At the cortical level, research has shown that musicians and non-musicians have both cortical and functional differences, often hypothesised to long-term training induced neuroplasticity [e.g. 410].

Crossmodal correspondence refers to the mapping that the observer expects between two or more seemingly arbitrary stimuli from different modalities inducing congruency effects in performance [1113]. For example, people tend to naturally associate higher pitch with smaller objects, and with objects that are higher in spatial elevation [14, 15]. Similarly, lower pitch is typically associated with larger objects and with objects lower in spatial elevation [16]. Recent evidence suggests that our brain automatically integrates stimuli based on parameters such as temporal or spatial proximity [2, 15, 17], previous experience [18], innate cross modal correspondences and statistics of natural scenes [19, 20].

Furthermore, a number of studies have successfully used newly learned abstract rules in order to induce audiovisual congruency effects in musicians and non-musicians. Paraskevopoulos et al. [21] investigated the cortical responses (EEG/MEG) of musicians and non-musicians in relation to congruent and incongruent audiovisual magnitude comparisons of symbolic nature. The judgements were made based on a newly learned abstract rule: ‘the higher the pitch, the larger the number presented’. Their results indicated two distinct neural networks for congruent and incongruent comparisons: frontotemporal and occipital areas in congruent condition, and temporal and parietal regions in the incongruent condition. Musicians further performed better—had higher accuracy—at discriminating whether the audiovisual stimuli were congruent with the rule, suggesting that musical expertise may be associated to enhanced processing of audiovisual stimuli. Similar results were found with audiovisual stimuli varying in spatial elevation in conjunction with pitch [22], see also [23], and later, with stimuli with the visual dimension representing different magnitudes in conjunction with varying pitch [24].

The evidence thus suggests that long-term training-induced neuroplasticity in musicians’ is associated with advantages that can be explicitly measured with both, neuroimaging methods and behavioural measures. However, these studies have employed only precision (e.g. accuracy of discrimination judgments) as the behavioural measure. As two commonly applied behavioural measures, accuracy and reaction time have been argued to tap onto different underlying processes, with accuracy reflecting more explicit and reaction time more implicit processing [2528], see also [29]. Hence, the two measures provide complementary information, which we aim to capture by incorporating both of the measures.

There is a relatively large number of reaction time studies done with non-musicians in the context of multimodal sensory integration. These investigations have consistently reported better performance with multimodal stimuli: faster detection of the target in the multimodal condition relative to unimodal condition (see for example [14]; for reviews see [12, 13]). Moreover, some previous studies have suggested a processing speed advantage with multimodal stimuli for musicians over non-musicians. For instance, Laundry and Champoux [30] examined auditory, tactile, and audio-tactile processing with a detection reaction time task, in which the participants were instructed to click a mouse button immediately upon perception of auditory, tactile, or simultaneous audio-tactile stimuli, and reported that long-term musical training resulted in faster response times for all the three stimulus types. Bidelman [31] investigated the effects of music training on the temporal binding of audiovisual multisensory stimuli using the double-flash illusion [32, 33]. In the illusion, a single flash of a visual stimulus is perceived as two separate visual stimuli; an effect caused by two consecutive auditory signals (beeps) presented concurrently with the visual stimulus. Their results indicated that musicians, in comparison to non-musicians, were more accurate and faster in indicating whether a single or double flash was presented. Further, the temporal window in which the illusory effect was successfully induced was 2–3 times shorter for musicians than it was for non-musicians. Taken together, experience-induced plasticity effects on musicians seemed to extend beyond simple listening skills, leading to a more refined, improved integration of multiple sensory systems in a domain-general manner.

Earlier studies have also investigated the associations between magnitudes, pitch, and spatial elevation in the context of spatial mapping. Spatial Numerical Association of Response Codes (SNARC; [34]) and Spatial Musical Association of Response Codes (SMARC–also known as SPARC, Spatial Pitch Association of Response Codes; [35]) refer to faster response times when using the left hand responses to discriminate space-related interactions between smaller numbers and pitch. Right hand responses, on the other hand, show a similar advantage for larger numbers. This phenomenon seems to affect both speed and accuracy of responses, having a preferential pairing with the response location (upper and/or right for higher pitch), respectively. Such associations between perception and action/response are believed to be the result of an automatism [36].

Such audiovisual correspondences may also occur at an early processing level; they are thought to represent pre-attentive processing that is largely correlated with sub-cortical structures such as the superior colliculi ([37]; for a review see [38]). In contrast, magnitude judgements are thought to require higher cognitive processes, and seem to be associated with cortical sources such as the posterior parietal cortex, or the intra-parietal sulcus [21, 39]. By exploring the effects of multimodal musical training on multimodal integration in both magnitude judgements and pitch-elevation correspondence, we aim to investigate the potential link between implicit and explicit cognitive processes. In doing so, we analyse both, the response latencies (thought to reflect more implicit processing) and accuracy (thought to reflect more explicit processing; [26]). In an attempt to correct for the possible speed-accuracy trade-off effects, we furthermore analyse the combined linear integrated speed-accuracy scores (LISAS; [4042]).

Hence, we employ a decision reaction time task, based on abstract rules related to audiovisual stimuli in in a group of musicians and non-musicians. Grounded on previous research [1214, 21, 22, 30, 31], we hypothesize that the overall performance will be better (faster and more accurate responses) in the congruent condition relative to the incongruent condition for both groups reflecting integration of the audio- and visual stimulus. Crucially, we expect that musicians will have an overall better performance (faster and more accurate) than non-musicians due to training-induced enhancements in multisensory integration.

Materials and methods

Participants

Following the power analysis (reported below in the Analysis section), data were collected from 27 musicians and 23 non-musicians. One musician who was left-handed and one control participant, who had weekly piano lessons between the ages of 7 to 12 years old, were excluded from the study. Thus, the dataset consisted of 48 right-handed participants (26 musicians and 22 non-musicians). Data were further trimmed based on reaction time outliers (see Analysis section below) after which we were left with a final sample of 44 participants (25 musicians; 24 females).

Musicians (mean age = 31.40 years, SD = 12.26, range: 20–64 years, 7 females) were recruited through social media sites and via email. A musician was defined with the minimum criteria of having at least 3 years of musical education (including possible careers/music teaching) in addition to compulsory music lessons in elementary and junior high school, being currently active, and having an ability to read musical notes. The mean formal musical education (in years) amongst the musicians was 15.4 (SD = 9.4) with a range of 3–41 years. Only 2 musicians reported having less than 6 years of training. Instruments played by the musicians were considerably heterogeneous; most reported instruments were guitar (N = 6) and piano (N = 4).

Non-musicians (mean age = 28.47 years, SD = 7.19, range: 19–47 years, 11 females; mean ages did not significantly differ between musicians and non-musicians, p > .05) were recruited mainly via social media sites and email, and were defined as having no formal musical training in addition to the music lessons compulsory in elementary and junior high school. All of the non-musician participants also self-identified as having no musical expertise. Any continuous lessons or practicing with an instrument resulted in an exclusion from the study.

None of the participants had a history of brain trauma or mental health issues, they all had normal or corrected-to-normal vision and normal auditory thresholds/hearing, and identified as right-handed. The study was approved by the University of Sheffield Ethics committee, and was conducted in accordance with the Declaration of Helsinki (1973).

Stimuli

Each pair of audiovisual stimuli consisted of a visual and an auditory part. The visual part of the stimulus may be encountered by one of three different types of attributes: spatial elevation, symbolic magnitude, and non-symbolic magnitude. Spatial elevation consisted of five white horizontal lines against a black background, similar to the staff lines in musical notation. A blue dot (about 2 cm in diameter, RGB colour codes: red, 86; green, 126; and blue, 214) was then placed into one of the 4 spaces between the lines, thus forming 4 different images in which the blue dot varied in spatial elevation. The colour blue was chosen for all stimuli as it has been suggested not to have any natural association with common sensory attributes [11].

The symbolic magnitude condition consisted of a number written in Calibri font in blue colour against a black background, with numbers varying from 1 to 4. The non-symbolic magnitude condition consisted of blue dots (varying in number from one to four, and of the same size and colour as in spatial elevation category) projected against a black background (see Fig 1). Each visual stimulus type was then combined with a sinusoidal tone (44,100 kHz, 16 bit) lasting for 400 ms including a 10-ms rise and decay time, thus forming stimuli displays that consisted of both an auditory and a visual modality. The tones were adopted from Paraskevopoulos et al. [21]; F5, 698.46 Hz; A5, 880.46 Hz; C6, 1046.50 Hz; or E6, 1318.51 Hz. The sounds were manufactured into a Sin waveform with 0.8 intensity using Audacity (version 2.1.2–1, Carnegie Mellon University) [43].

Fig 1. Examples of each of the three stimulus categories.

Fig 1

Such pairs were used to form the experimental stimuli; both images in all three of the stimulus categories were paired (within-category) with different tones and shown consecutively within one audiovisual video (see section 2.3. for detailed description). One video therefore consisted of the first image shown for the duration of the first sound (400 ms), a 60 ms break, and of the second image shown for the duration of the second sound (400 ms), forming a one audiovisual video of the length of 860 ms.

Next, short video clips were prepared by pseudo-randomly pairing up two different audiovisual stimuli with keeping the stimulus category consistent. The length of one video (i.e., one trial) consisted of the first tone (presented together with the first image for 400 ms), a 60 ms break in between, and of the second tone (presented with the second image for 400 ms). Thus, an 860 ms audiovisual video was formed in which the visual element varied in one of the three dimensions, and both of the images used were paired with a different tone. Examples of such pairs are shown in Fig 1. Consequently, these videos were either congruent or incongruent according to the following, category-specific rules: “The higher the spatial elevation, the higher the tone”, “the more dots presented, the higher the tone”, and “the higher the number presented, the higher the tone”. A total of 180 videos were prepared; 30 congruent and 30 incongruent for all three stimuli displays type conditions (spatial elevation, symbolic magnitude, and non-symbolic magnitude).

The stimuli were presented using Presentation® software (Version 18.0, Neurobehavioral Systems), on a laptop with a screen size of 13.3 inches, running windows 10 as OS [44].

Procedure and design

In the present study, a 2x2x3 mixed factorial design was implemented with group (musical training, no musical training) as the between subject factor, and congruency (congruent, incongruent), and stimulus dimension (spatial elevation, symbolic magnitude, and non-symbolic magnitude) as the within subject factors. The dependent variables were response latencies (in milliseconds) and accuracy (number of correct responses).

The participants were seated comfortably in a quiet well-lit room approximately 60 cm away from the screen, with their right hand on keyboard. All participants received written instructions, and provided informed consent in a written form prior to the experiment. The auditory stimuli were provided via Shike QHP-660 headphones. First, example trials of both congruent and incongruent stimuli from each of the categories were provided to the participants (6 example trials in total). It was then verbally confirmed that the instructions were understood. Next, the experiment began. The experimental procedure is shown in Fig 2.

Fig 2. Examples of the experimental procedure in both congruent and incongruent conditions for each of the tree audiovisual stimulus category.

Fig 2

Panels A–B illustrate congruent and incongruent trials in the pitch-elevation category, respectively. Each panel/trial consisted of a single video with the total length of 860 ms. In this video clip, the participant saw two visual stimuli each associated with a sound on a particular pitch. The pitch varied in frequency and the visual stimulus varied in elevation. Congruency was estimated according to an explicitly learned rule, “The higher the spatial elevation, the higher the tone”. Panels C–D illustrate congruent and incongruent trials in the symbolic magnitude-pitch category, respectively. Here, the stimulus varied in value of the number shown and congruency was estimated according to the explicitly learned rule, “the higher the number presented, the higher the tone”. Panels E–F illustrate congruent and incongruent trials in the non-symbolic magnitude-pitch category, respectively. Here, the stimulus varied in the number of circles shown and congruency was estimated according to the explicitly learned rule, “the more dots presented, the higher the tone”. Note that in each category congruency/incongruency could be induced either with the visual or with the auditory part of the stimuli.

Each trial consisted of a single video clip (a single panel on Fig 2). The sequence of events in each trial was as follows: the first audiovisual stimulus was presented in the middle of the screen for 400 ms, followed by a short break of 60 ms before the second audiovisual stimulus (400 ms). Each of the audiovisual stimuli consisted of a visual picture (see Fig 1) associated with an auditory stimulus that varied in pitch. The onset time for the auditory stimuli was synchronized with the onset of the visual images. After presenting both audiovisual stimuli (i.e., after presenting a single trial consisting of a pair of audiovisual stimuli), congruency was estimated; the pair of audiovisual stimuli allowed the participants to estimate congruency between the stimuli based on the explicitly learned abstract rules. These rules were “The higher the spatial elevation, the higher the tone”, “the more dots presented, the higher the tone”, and “the higher the number presented, the higher the tone”, depending on the category. The participant were instructed to respond as quickly and accurate as possible with their right hand only by pressing ‘K’ on the keyboard when the corresponding rules were followed, and pressing the button ‘L’ when the rules were not followed. After the response, a 1000 ms blank screen appeared before the next trial. The researcher further instructed the participants verbally, and made sure the instructions were understood properly.

There were two experimental blocks, and each block consisted of 180 trials. In each block, there were three audiovisual stimulus categories with 60 trials in each (30 congruent and 30 incongruent). The order of the trials was pseudo-randomized across the participants so that there were not consecutive trials having the same stimulus type. This randomization process also enabled the elimination of any potential bias caused by varying intervals between the auditory tones and the visual stimuli.

Analysis

The required number of participants was estimated by conducting a statistical power analysis based on the raw data of Paraskevopoulos et al. [24], which investigated differences in cortical responses between musicians and non-musicians, utilizing a similar method to the one in the present study. The effect size (d) in that study was 1.76, considered as large using Cohen’s [45] criteria. With an alpha of .05, both the power table and the sample size table were calculated using fpower (version 1.2–1, [46]) suggesting power >.99 with N = 10 for each experimental group.

We applied a standard in which RTs below 100 ms and above 1000 ms, as well as two standard deviations from each condition’s individual mean were considered as outliers and thus eliminated from the data [47]. In addition, trials in which the participant did not provide a response, or made a mistake were excluded from the reaction time analysis. As our reaction times were measured from the onset of the first picture within the video clip, response latencies below 560 ms (first picture + break + 100 ms) and above 1460 ms (first picture + break + 1000 ms) were eliminated. Participants who did not reach a level of 50% accepted trials per condition were disregarded from the analyses. This way we ensured at least 30 trials per condition for each participant in the subsequent analysis, and were left with a sample of 44 participants (25 musicians; 24 females). Maximum number of mistakes made per category was 28.

For the accuracy of following the rules, the discriminability index d prime was calculated. For the calculations of the d-prime, hits were defined as congruent stimuli correctly identified as congruent, misses as congruent identified as incongruent, false alarms as incongruent stimuli identified as congruent, and correct rejections as incongruent stimuli identified as incongruent. As our main interest was the potential difference in accuracy between musicians and non-musicians and as several participants made zero mistakes in some of the stimulus categories, instead of calculating the d-prime for each audiovisual category individually, following the suggestions of Stanislaw & Todorov [48], we combined the data from the three stimulus categories before calculating the hit and false-alarm rates. For the further investigation of audiovisual category-wise accuracy, we used the raw number of mistakes. As the data were not normally distributed, a Mann-Whitney U test was performed. A significant effect was found and the effect size (r) was calculated. As the number of mistakes reached 7.5%, the mistakes across the visual stimulus categories were analysed using the Friedman test, and between the two congruency categories with Wilcoxon Signed-Rank test.

For reaction times, a mixed 2x2x3 ANOVA was performed with musical training as the between-participants factor, and congruency and stimulus categories as within-participant factors. Bonferroni corrected estimated marginal means were calculated for the main effects when appropriate. Furthermore, as the experimental design subjected the results to a possible speed-accuracy trade-off (SAT), we calculated Linear Integrated Speed-Accuracy Scores (LISAS) over all the categories [4042]. LISAS were calculated by transforming the measurement scores to equal scales:

LISASij={RTijifPEij=0RTij+PEij×SRTjSPEjotherwise

where SRT and SPE are the standard deviations of the participants’ reaction times and proportion of errors, respectively. A 2x2x3 ANOVA was then performed on these transformed scores, with corresponding Bonferroni post-hoc comparisons.

Results

Reaction times

The results of the ANOVA (see Table 1) showed statistically significant main effects for congruency (F(1,42) = 80.78, p < .001, ηp2 = 0.658), and for audiovisual stimulus category (F(2,84) = 34.46, p < .001, ηp2 = 0.451), but not for musical training group (p = .902). That is, overall response times were faster for congruent trials (M = 1452.76 ms) than for incongruent trials (M = 1588.84 ms) but did not significantly differ between musicians and non-musicians. Bonferroni corrected estimated marginal means showed that response times were significantly faster for the spatial elevation category (M = 1463.50 ms, SE = 32.61) than for the symbolic (1546.72 ms; p < .001, SE = 37.21) and non-symbolic magnitude (1552.19 ms, p < .001, SE = 37.10) categories, which did not significantly differ from each other (p = 1.00). None of the interactions reached statistical significance (all ps > 0.077).

Table 1. Audiovisual category-wise mean response latencies (SD) in milliseconds for musicians and non-musicians in spatial elevation, non-symbolic magnitude, and symbolic magnitude categories, in both, congruent and incongruent conditions.

At bottom, overall means (SE) for the audiovisual categories, musicians and non-musicians, and for congruent and incongruent stimuli.

Musical Training Congruency Spatial Elev. Symbolic Mag. Non-symbolic Mag.
Musician Cong. 1389.90 (179.79) 1459.86 (221.87) 1480.34 (235.78)
Incong. 1530.09 (258.68) 1634.61 (293.08) 1656.11 (285.68)
Non-musician Cong. 1422.07 (216.33) 1506.05 (245.72) 1458.35 (213.98)
Incong. 1511.94 (230.49) 1586.34 (236.34) 1613.94 (261.20)
Overall mean (SE) 1463.50 (32.61) 1546.72 (37.21) 1552.19 (37.10)
Mean (SE) musician 1525.15 (46.03)
Mean (SE) non-musician 1516.45 (52.81)
Mean (SE) congruent 1452.76 (32.26)
Mean (SE) incongruent 1588.84 (39.09)

Accuracy

The assumption of normality was found to be violated for the d primes: in a Shapiro-Wilk, W (44) = 0.94, p = 0.018. Therefore, the d prime data was analysed with non-parametric tests.

Both musicians and non-musicians scored significantly higher than the chance level indicating that both groups made more correct responses than could have been expected by chance alone. In Wilcoxon Signed-Ranks tests, musicians: Mdn = 4.44, Z = -4.37, p < .001; non-musicians: Mdn = 3.30, Z = -3.82, p < .001.

A Mann-Whitney U test was conducted for comparison of the two groups (musicians and non-musicians). The analysis revealed a statistically significant difference in the discriminability index. That is, musicians made significantly fewer mistakes than non-musicians (U = 85, p < .001, d = 1.299, large effect). Fig 3 shows the difference in the mean d-prime between musicians and non-musicians discriminating between congruent and incongruent stimuli.

Fig 3. Mean d primes for musicians and non-musicians discriminating between congruent and incongruent trials.

Fig 3

Higher scores for musicians reflect higher accuracy. Error bars indicate the 95% confidence intervals. The between group difference is significant at p < .001.

Moreover, while musicians and non-musicians did not significantly differ in the number of mistakes in congruent categories (in a Mann-Whitney U test, p = 0.113), in incongruent trials musicians (M = 12.64) made significantly fewer mistakes than non-musicians (M = 47; U = 38, p < .001, d = 2.032, large effect).

In analysis of number of mistakes in each audiovisual category individually, musicians made statistically significantly fewer mistakes than non-musicians with congruent non-symbolic stimuli (in a Mann-Whitney U test, U = 152.5, p = 0.038, d = 0.637, medium effect), incongruent elevation (U = 42, p < .001, d = 1.951, large effect), incongruent symbolic magnitude (U = 47, p < .001, d = 1.857, large effect), and incongruent non-symbolic magnitude (U = 37.5, p < .001, d = 2.042, large effect).

A Wilcoxon Signed-Rank test revealed a statistically significant, large difference between the overall number of mistakes made in congruent (M = 15.5, SD = 24.44) and incongruent (M = 27.48, SD = 24.74; Z = 3.30, p = 0.001, r = 0.50) trials, indicating that the participants were more prone to make mistakes in incongruent than congruent categories.

The analysis of the overall number of mistakes made between elevation (M = 13.14, SD = 13.89), symbolic magnitude (M = 14.66, SD = 14.92), and non-symbolic magnitude (M = 15.18, SD = 15.10) indicated a statistically significant difference in the number of made mistakes: in a Friedman test x2(2) = 13.87, p = 0.001. Post hoc comparisons (Bonferroni corrected α = 0.017) revealed that elevation category differed significantly from non-symbolic magnitude (Z = -3.37, p = 0.001, r = .51). Comparisons between elevation and symbolic magnitude and between symbolic magnitude and non-symbolic magnitude did not reach statistical significance (ps > 0.043 and 0.656, respectively).

Linear Integrated Speed-Accuracy Scores (LISAS)

The mean scores (SD) for LISAS are shown in Table 2. The results of a mixed ANOVA indicated statistically significant main effects for congruency (F(1,42) = 80.09, p < .001, ηp2 = .656), and for stimulus category (F(2,84) = 35.99, p < .001, ηp2 = .461), but not for musical training group (p = .79). In particular, LISAS scores were smaller for congruent (M = 1457.49) than for incongruent trials (M = 1599.77). The Bonferroni corrected estimated marginal means indicated that elevation category (M = 1468.24) differed from both symbolic (M = 1556.33, p < .001) and non-symbolic magnitudes (M = 1561.32, p < .001), while there was no statistically significant difference observed between the two magnitude categories (p = 1.00).

Table 2. Mean linear integrated speed-accuracy scores (SD) and means (SD) for spatial elevation, non-symbolic magnitude, and symbolic magnitude categories, in both, congruent and incongruent conditions.

Musical Training Congruency Spatial Elev. Symbolic Mag. Non-symbolic Mag.
Musician Cong. 1394.54 (180.45) 1470.86 (224.30) 1488.05 (238.47)
Incong. 1539.39 (258.80) 1656.70 (297.02) 1678.90 (288.83)
Non-musician Cong. 1423.44 (216.81) 1507.82 (255.79) 1460.24 (214.82)
Incong. 1515.60 (231.27) 1589.95 (237.55) 1618.11 (261.68)

In addition, we observed a statistically significant interaction between congruency and category of the stimuli (F(2,84) = 3.34, p = 0.04, ηp2 = .074). Bonferroni corrected estimated marginal means showed that with both congruent and incongruent stimuli, elevation category was significantly different from symbolic and non-symbolic magnitudes (ps < .001); in congruent condition the difference was larger between elevation and symbolic magnitudes, while in the incongruent condition the difference was larger between elevation and non-symbolic magnitudes. All other interactions were statistically non-significant (all ps > .0.54).

Discussion

Our main objective was to explore the association between long-term musical training and bimodal sensory integration. In a decision reaction-time task, we presented crossmodally correspondent audiovisual stimuli to musicians and non-musicians and measured both accuracy and response latencies. Our main result indicated a large, significant advantage for musicians, relative to non-musicians, in accuracy, but not in reaction times. Importantly, this observed advantage was present not only with pitch-elevation category, but extended also over two magnitude categories. Moreover, this advantage in accuracy was especially prominent with incongruent audiovisual stimulus categories; musicians had an advantage in all three incongruent stimulus categories, in addition to congruent non-symbolic magnitudes.

As expected, we observed significant main effects of congruency with all measures (response times, d primes, number of mistakes, and linear integrated speed-accuracy scores; LISAS), suggesting audiovisual integration for all three stimulus categories, both for musicians as well as for non-musicians. The reaction times (RTs) were overall faster for the spatial elevation category relative to the other two categories, with the number of mistakes made with elevation being significantly smaller than with non-symbolic magnitude. This speed advantage with the spatial elevation category could reflect more extensive experience and/or innateness of naturally occurring statistics between pitch and elevation [12, 20, 49]. These results were further confirmed with LISAS, combining response latencies and the number of errors [42]. However, we did not observe the hypothesized difference in reaction times between musicians and non-musicians. The present pattern of findings is best explained in terms of an interaction between explicit and implicit processing and suggest a way by which the benefits of musical training on performance may generalize beyond the musical context.

Hence, contrary to what was hypothesized, no advantage in response latencies for musicians was established in the present study, whereas a significant advantage in accuracy was found. These results do not agree with previous studies in which a processing speed advantage for musicians has been observed, specifically with multisensory stimuli when compared to non-musicians, in stimulus detection [30] and in perceptual illusory tasks [31]. One possible explanation for the discrepancy in the results between previous studies and the present study may be in task-specific cognitive processing demands. First, while in previous similar studies participants were instructed to attend and make a single judgment (e.g., detect an auditory stimulus, tactile stimulus, or an audio-tactile stimulus), in the present study, participants were asked to switch between three different judgments–based on explicitly learned abstract rules–within the same block (pitch + spatial elevation or magnitudes). Thus, we cannot rule out the possibility that switching cost effects could have influenced performance in the task.

However, von Bastian & Druey [50], using a latent factor model, concluded that, from all types of switching (judgment, dimension, stimulus, mapping and response), response mapping shifting is the most relevant (the only one contributing directly) to switching cost. In our study, the response mapping remained the same across the judgment conditions, and thus, we believe that it is unlikely that switching cost effects could explain the lack of differences between musicians and non-musicians on response latencies. Given that response latency is the most sensitive measure of switching cost, and that the group effects were not significant with response latencies, we also believe that a potential musician advantage on judgment switching ability does not account for the group differences (better performance) found in accuracy data.

Second, in both, Laundry & Champoux [30] and in Bidelman [31], the task and the manipulations operated exclusively at the perceptual level. It remains possible that such stimulus detection or perceptual illusory effects do not require higher cognitive processing, whereas a task such as in the present study–where participants are required to be simultaneously cognizant of both modalities (pitch + spatial elevation or magnitudes) for a relatively long time, after which a comparison is made according to a newly learned abstract rule–does. Accordingly, short-term improvements, induced by short-term training on temporal binding windows have been found to be drastically similar despite alterations in task structure [51]. Crucially, this has been interpreted as to indicate that the effects induced by expertise reflect changes in perceptual rather than in higher cognitive systems.

We observed a significant advantage for musicians in accuracy, but not in reaction times, that extended over the magnitude categories. Importantly, previous research has suggested that comparative magnitude judgements require “higher” cognitive processing that encompasses more widely distributed cortical areas, including the parietal network [21, 39, 52]. On the other hand, response speed advantages for musicians have been observed with potentially cognitively less demanding tasks [30, 31]. Following this line of thought one may speculate that the present pattern of findings can be explained in terms of an interaction between top-down, higher order and pre-attentive processing–to the extent that distinguishing higher order and pre-attentive processing behaviourally is possible. That said, this potential interaction could be further broken down in future studies by, for example, including multiple tasks varying in cognitive demand and by analysing not only behavioural, but neuroimaging measures as well.

Furthermore, the observed lack of group effects with latency data could be due to differences in strategies and sound processing differences between musicians and non-musicians [5355], which affected reaction times for the former. For instance, Chartrand & Belin [56] investigated timbre processing in a discrimination task, and found better performance for musicians, relative to non-musicians, in the discriminability index, but slower overall reaction times for musicians. They concluded that musicians might process sounds on a deeper level needing more time to encode the stimuli, which may result in a different strategic approach to the task (see also [57]). Therefore, it is possible that the observed reaction times for musicians in comparison to non-musicians reflect a trade-off between accuracy and speed. Motivated by this, we calculated LISAS scores to account simultaneously for both accuracy and response speed (see Methods). Crucially, the group factor did not yield statistically significant effects, and thus, further research is needed on the dissociation between latency and accuracy measures when investigating the relationship between musical training and multimodal sensory integration. These results furthermore support the association between implicit processing–as reflected on RTs–and explicit processing–as reflected on accuracy of judgements.

Interestingly, musicians–in comparison to non-musicians–had greater discriminability (d prime), and they made significantly fewer mistakes in all the incongruent audiovisual stimulus categories, whereas they had an advantage only in one congruent category, non-symbolic magnitudes, and only with one measure (raw number of mistakes). As far as we are aware, there are no previous studies indicating whether the musicians’ advantage in multisensory processing is more predominantly grounded on the processing of congruent or incongruent stimuli. Studies in the field typically use d prime as an index of accuracy, which identifies the signal detection but cannot attribute the difference on either of the two conditions. Nonetheless, congruent processing relies mostly on predictive coding principles [58]. As such, the audiovisual stimuli in the congruent condition does not violate the underlying predictions, and hence, engages predominantly top-down mechanisms, which are not explicitly trained in musicians. As the cross-modal correspondences seem to be innate [12, 20, 49, 5961], we can expect to see similar behaviour reflecting congruent processing in both, musicians and non-musicians. On the other hand, stimuli in the incongruent condition violates the predictions of congruency engaging bottom-up processing, which benefits from the perceptual learning effects that musicians gain throughout their training, advancing the perceptual processing of the multisensory stimuli. In other words, it is possible that due to more refined multisensory integration and sharpened perception in relation to such stimuli, musicians can better identify the violated prediction of congruency in comparison to non-musicians. Such an advantage has previously been observed in relation to the temporal window of integration, where musicians show enhanced ability to detect audiovisual asynchrony [7]. These results conjointly with the results with RTs support the possible interaction between top-down and bottom-up processes.

Although we had–based on previous findings–reasonable grounds for hypothesising an advantage in overall performance induced by musical training, it could be that if indeed musical training enhances multisensory integration, it may have a measurable effect not in overall performance but in the size of the congruency effect. It would be worthwhile for a future study to explore whether such a larger congruency effect exists in musicians when compared to non-musicians. This is particularly true in the light of the observed lack of advantage in overall performance when measured with RT. We leave further investigation in this direction to future studies.

In addition, and as expected, we found significant main effects of congruency with all measures (response times, d primes, number of mistakes, LISAS), suggesting audiovisual integration for all three stimulus categories. With regard to stimulus category, reaction times were faster for multisensory stimuli in which the visual dimension varied in spatial elevation, relative to the two magnitude categories. This observation was further complemented by the accuracy data; fewer mistakes were made with pitch-elevation in relation to non-symbolic magnitude category. Similarly, with LISAS, a main effect of visual stimulus category was found with advantage for spatial elevation category, suggesting that even when accuracy scores are taken into account, the trials in the spatial elevation category were processed more efficiently.

Taken together, this suggests a significant difference in the processing of the three types of multisensory stimuli. These faster responses in pitch-elevation category may indicate an over-learned and partly automated association between pitch and elevation in contrast to correspondences based more solely on newly learned abstract rules (the magnitude categories). Indeed, several previous studies have established pitch-elevation correspondence [14, 15, 20, 62, 63], relating it to naturally occurring statistics [49], and suggesting its innateness [5961]. Interestingly, in relation to magnitudes–assuming no naturally occurring correspondence between pitch and symbolic magnitude exists (however see [36] and [64])–both musicians and non-musicians were able to internalize and automatize the abstract rule within a relatively short period, as both groups performed significantly better than what could be expected by chance.

We also observed that more mistakes were made overall in incongruent trials than in congruent trials. These results are consistent with previous findings on congruency effects with crossmodal correspondences (see, [11, 13] for reviews) and support the notion that the benefits of musical training on performance may generalize beyond the musical context. This advantage was reflected also on congruency based on newly learned abstract rules in otherwise unrelated stimuli (pitch-magnitude pairs).

Finally, we also observed small, statistically significant interaction between congruency and stimulus category with LISAS. However, the estimated effect size was relatively small, and it has been suggested that with high powered studies, as the present one, p-values of 0.04 are more likely under the null-hypothesis (for a discussion, see [65]).

It is worth explicitly noting that similarly to most of the studies in this field, the methodology adopted here limits the conclusions we can draw such that a causal relationship cannot be established between musical training and advantages in multisensory integration; we can, at best, only show a relationship between the two factors. This is particularly true as the general cognitive abilities of the participants were not directly controlled, and hence other factors may play a role in explaining the observed differences between the two groups. Future studies should collect more extensive background knowledge regarding participants’ cognitive abilities (e.g. level of education in general) and introduce control tasks to have a stronger basis for suggesting causality.

Another potential limitation of this study is that the musical instruments in our sample were considerably heterogeneous with most musicians reporting multiple instruments as their main skill. It is plausible that different instruments elicit different effects on multisensory integration, or more so, on some modalities over others. Therefore, future studies should investigate the specific effects of training on particular instruments.

Conclusions

This study investigated the association between musical expertise and processing of audiovisual crossmodal correspondences varying in three visual stimulus dimensions (elevation, symbolic magnitude, non-symbolic magnitude) and in pitch. Congruency was assessed based on explicitly learned abstract rules in pitch-elevation and in otherwise unrelated pitch-magnitude stimuli. We provided novel evidence supporting an interaction between implicit processing–as reflected on RTs–and explicit processing–as reflected on accuracy of judgements. In particular, we observed a large effect on accuracy supporting the hypothesis that long-term musical training induced neuroplasticity effects on audiovisual integration can be generalized outside of musical stimuli. This advantage was reflected on congruency based on newly learned abstract rules in otherwise unrelated stimuli (pitch-magnitude pairs), suggesting an advantage on processes requiring higher order cognitive functions. Moreover, musicians performed better than non-musicians predominantly in the incongruent condition. These results are discussed in relation to the predictive coding framework. However, this advantage was not reflected on reaction time measures, or on scores accounting for speed-accuracy trade-offs. Thus, our study support the notion that accuracy and latency measures reflect different cognitive processes. Further research is needed to understand why musical training effects may affect differentially these performance measures, and what this means for the corresponding cognitive processes.

Data Availability

The data are publically accessible and can be retrieved from https://gin.g-node.org/rihalai/Crossmodal_Correspondences.

Funding Statement

This project has received funding from the Hellenic Foundation for Research and Innovation (HFRI) and the General Secretariat for Research and Technology (GSRT), under grant agreement No [2089]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Ghazanfar A. A., & Schroeder C. E. (2006). Is neocortex essentially multisensory? Trends in cognitive sciences, 10(6), 278–285. doi: 10.1016/j.tics.2006.04.008 [DOI] [PubMed] [Google Scholar]
  • 2.Lee H., & Noppeney U. (2011). Long-term music training tunes how the brain temporally binds signals from multiple senses. Proceedings of the National Academy of Sciences, 108(51), E1441–E1450. doi: 10.1073/pnas.1115267108 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Zatorre R. J., Chen J. L., & Penhune V. B. (2007). When the brain plays music: auditory-motor interactions in music perception and production. Nature reviews. Neuroscience, 8(7), 547. [DOI] [PubMed] [Google Scholar]
  • 4.Gaser C., & Schlaug G. (2003). Brain structures differ between musicians and non-musicians. Journal of Neuroscience, 23(27), 9240–9245. doi: 10.1523/JNEUROSCI.23-27-09240.2003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Herholz S. C., & Zatorre R. J. (2012). Musical training as a framework for brain plasticity: behavior, function, and structure. Neuron, 76(3), 486–502. doi: 10.1016/j.neuron.2012.10.011 [DOI] [PubMed] [Google Scholar]
  • 6.Kuchenbuch A., Paraskevopoulos E., Herholz S. C., & Pantev C. (2014). Audio-tactile integration and the influence of musical training. PloS one, 9(1), e85743. doi: 10.1371/journal.pone.0085743 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Petrini K., Dahl S., Rocchesso D., Waadeland C. H., Avanzini F., Puce A., et al. (2009). Multisensory integration of drumming actions: musical expertise affects perceived audiovisual asynchrony. Experimental brain research, 198(2–3), 339. doi: 10.1007/s00221-009-1817-2 [DOI] [PubMed] [Google Scholar]
  • 8.Schneider P., Scherg M., Dosch H. G., Specht H. J., Gutschalk A., & Rupp A. (2002). Morphology of Heschl’s gyrus reflects enhanced activation in the auditory cortex of musicians. Nature neuroscience, 5(7), 688. doi: 10.1038/nn871 [DOI] [PubMed] [Google Scholar]
  • 9.Schneider P., Sluming V., Roberts N., Scherg M., Goebel R., Specht H. J., et al. (2005). Structural and functional asymmetry of lateral Heschl’s gyrus reflects pitch perception preference. Nature neuroscience, 8(9), 1241–1247. doi: 10.1038/nn1530 [DOI] [PubMed] [Google Scholar]
  • 10.Stewart L. (2008). Do musicians have different brains? Clinical medicine, 8(3), 304–308. doi: 10.7861/clinmedicine.8-3-304 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Parise C. V. (2016). Crossmodal correspondences: standing issues and experimental guidelines. Multisensory Research, 29(1–3), 7–28. doi: 10.1163/22134808-00002502 [DOI] [PubMed] [Google Scholar]
  • 12.Parise C., & Spence C. (2013). Audiovisual Cross-Modal Correspondences in the General Population. In J. Simner & E. M. Hubbard (Eds.), Oxford library of psychology. The Oxford handbook of synesthesia (pp. 790–815). Oxford, UK: Oxford University Press. [Google Scholar]
  • 13.Spence C. (2011). Crossmodal correspondences: A tutorial review. Attention, Perception, & Psychophysics, 73(4), 971–995. [DOI] [PubMed] [Google Scholar]
  • 14.Miller J. (1991). Channel interaction and the redundant-targets effect in bimodal divided attention. Journal of Experimental Psychology: Human Perception and Performance, 17, 160–169. doi: 10.1037//0096-1523.17.1.160 [DOI] [PubMed] [Google Scholar]
  • 15.Parise C. V., & Spence C. (2009). ‘When birds of a feather flock together’: synesthetic correspondences modulate audiovisual integration in non-synesthetes. PLoS One, 4(5), e5664. doi: 10.1371/journal.pone.0005664 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Gallace A., & Spence C. (2006). Multisensory synesthetic interactions in the speeded classification of visual size. Perception & Psychophysics, 68(7), 1191–1203. doi: 10.3758/bf03193720 [DOI] [PubMed] [Google Scholar]
  • 17.Innes-Brown H., & Crewther D. (2009). The impact of spatial incongruence on an auditory- visual illusion. PLoS One, 4(7), e6450. doi: 10.1371/journal.pone.0006450 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Naumer M. J., Doehrmann O., Müller N. G., Muckli L., Kaiser J., & Hein G. (2008). Cortical plasticity of audio–visual object representations. Cerebral Cortex, 19(7), 1641–1653. doi: 10.1093/cercor/bhn200 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Besle J., Hussain Z., Giard M. H., & Bertrand O. (2013). The representation of audiovisual regularities in the human brain. Journal of cognitive neuroscience, 25(3), 365–373. doi: 10.1162/jocn_a_00334 [DOI] [PubMed] [Google Scholar]
  • 20.Evans K. K., & Treisman A. (2010). Natural cross-modal mappings between visual and auditory features. Journal of vision, 10(1), 6–6. doi: 10.1167/10.1.6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Paraskevopoulos E., Kuchenbuch A., Herholz S. C., Foroglou N., Bamidis P., & Pantev C. (2014). Tones and numbers: A combined EEG–MEG study on the effects of musical expertise in magnitude comparisons of audiovisual stimuli. Human brain mapping, 35(11), 5389–5400. doi: 10.1002/hbm.22558 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Paraskevopoulos E., Kuchenbuch A., Herholz S. C., & Pantev C. (2012). Musical expertise induces audiovisual integration of abstract congruency rules. Journal of Neuroscience, 32(50), 18196–18203. doi: 10.1523/JNEUROSCI.1947-12.2012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Paraskevopoulos E., Kraneburg A., Herholz S. C., Bamidis P. D., & Pantev C. (2015). Musical expertise is related to altered functional connectivity during audiovisual integration. Proceedings of the National Academy of Sciences, 112(40), 12522–12527. doi: 10.1073/pnas.1510662112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Paraskevopoulos E., Chalas N., Foroglou N., & Bamidis P. D. (2016). Musical training enhances multisensory magnitude judgements. Front. Hum. Neurosci. Conference Abstract: SAN2016 Meeting. doi: 10.3389/conf.fnhum.2016.220 [DOI] [Google Scholar]
  • 25.van Ede F., de Lange F. P., & Maris E. (2012). Attentional cues affect accuracy and reaction time via different cognitive and neural processes. Journal of Neuroscience, 32(30), 10408–10412. doi: 10.1523/JNEUROSCI.1337-12.2012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Kirsner K. (2013). Implicit and explicit mental processes. Psychology Press. [Google Scholar]
  • 27.Norman D. A., & Bobrow D. G. (1975). On data-limited and resource-limited processes. Cognitive psychology, 7(1), 44–64. [Google Scholar]
  • 28.Santee J. L., & Egeth H. E. (1982). Do reaction time and accuracy measure the same aspects of letter recognition? Journal of Experimental Psychology: Human Perception and Performance, 8(4), 489. [DOI] [PubMed] [Google Scholar]
  • 29.Wong A. L., Goldsmith J., Forrence A. D., Haith A. M., & Krakauer J. W. (2017). Reaction times can reflect habits rather than computations. Elife, 6, e28075. doi: 10.7554/eLife.28075 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Landry S. P., & Champoux F. (2017). Musicians react faster and are better multisensory integrators. Brain and Cognition, 111, 156–162. doi: 10.1016/j.bandc.2016.12.001 [DOI] [PubMed] [Google Scholar]
  • 31.Bidelman G. M. (2016). Musicians have enhanced audiovisual multisensory binding: experience-dependent effects in the double-flash illusion. Experimental brain research, 234(10), 3037–3047. doi: 10.1007/s00221-016-4705-6 [DOI] [PubMed] [Google Scholar]
  • 32.Shams L., Kamitani Y., & Shimojo S. (2000). Illusions: What you see is what you hear. Nature, 408(6814), 788–788. [DOI] [PubMed] [Google Scholar]
  • 33.Shams L., Kamitani Y., & Shimojo S. (2002). Visual illusion induced by sound. Cognitive Brain Research, 14(1), 147–152. doi: 10.1016/s0926-6410(02)00069-1 [DOI] [PubMed] [Google Scholar]
  • 34.Dehaene S., Bossini S., & Giraux P. (1993). The mental representation of parity and number magnitude. Journal of Experimental Psychology: General, 122(3), 371. [Google Scholar]
  • 35.Rusconi E., Kwan B., Giordano B. L., Umilta C., & Butterworth B. (2006). Spatial representation of pitch height: the SMARC effect. Cognition, 99(2), 113–129. doi: 10.1016/j.cognition.2005.01.004 [DOI] [PubMed] [Google Scholar]
  • 36.Weis T., Estner B., van Leeuwen C., & Lachmann T. (2016). SNARSC meets SPARC: Automaticity and Interdependency in Compatibility Effects. The Quarterly Journal of Experimental Psychology, 69(7), 1366–1383. [DOI] [PubMed] [Google Scholar]
  • 37.Stein B. E., & Meredith M. A. (1993). The merging of the senses. The MIT Press. [Google Scholar]
  • 38.Alais D., Newell F. N., & Mamassian P. (2010). Multisensory processing in review: from physiology to behaviour. Seeing and perceiving, 23(1), 3–38. doi: 10.1163/187847510X488603 [DOI] [PubMed] [Google Scholar]
  • 39.Dehaene S., Piazza M., Pinel P., & Cohen L. (2003). Three parietal circuits for number processing. Cognitive neuropsychology, 20(3–6), 487–506. doi: 10.1080/02643290244000239 [DOI] [PubMed] [Google Scholar]
  • 40.Vandierendonck A. (2017). A comparison of methods to combine speed and accuracy measures of performance: A rejoinder on the binning procedure. Behavior research methods, 49(2), 653–673. doi: 10.3758/s13428-016-0721-5 [DOI] [PubMed] [Google Scholar]
  • 41.Vandierendonck A. (2018). Further tests of the utility of integrated speed-accuracy measures in task switching. Journal of Cognition, 1(1), 1–16. doi: 10.5334/joc.6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Vandierendonck A. (2021). On the Utility of Integrated Speed-Accuracy Measures when Speed-Accuracy Trade-off is Present. Journal of cognition, 4(1), 1–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Audacity (version 2.1.2–1) [computer software]. Carnegie Mellon University. Available from http://www.audacityteam.org/home/.
  • 44.Presentation® software (Version 18.0). Neurobehavioral Systems, Inc., Berkeley, CA. Available from https://www.neurobs.com/.
  • 45.Cohen J. (1988). Statistical power analysis for the behavioral sciences (2nd Ed.). New Jersey: Lawrence Earlbaum Associates. [Google Scholar]
  • 46.Friendly, M. (2013). fPower [computer software]. York University. Available from http://www.datavis.ca/sasmac/fpower.html.
  • 47.Whelan R. (2008). Effective analysis of reaction time data. The Psychological Record, 58(3), 475. [Google Scholar]
  • 48.Stanislaw H., & Todorov N. (1999). Calculation of signal detection theory measures. Behavior research methods, instruments, & computers, 31(1), 137–149. doi: 10.3758/bf03207704 [DOI] [PubMed] [Google Scholar]
  • 49.Parise C. V., Knorre K., & Ernst M. O. (2014). Natural auditory scene statistics shapes human spatial hearing. Proceedings of the National Academy of Sciences, 111(16), 6104–6108. doi: 10.1073/pnas.1322705111 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.von Bastian C. C., & Druey M. D. (2017). Shifting between mental sets: An individual differences approach to commonalities and differences of task switching components. Journal of Experimental Psychology: General, 146(9), 1266. doi: 10.1037/xge0000333 [DOI] [PubMed] [Google Scholar]
  • 51.Powers A. R., Hillock A. R., & Wallace M. T. (2009). Perceptual training narrows temporal window of multisensory binding. Journal of Neuroscience, 29(39), 12265–12274. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Pinel P., Piazza M., Le Bihan D., & Dehaene S. (2004). Distributed and overlapping cerebral representations of number, size, and luminance during comparative judgments. Neuron, 41(6), 983–993. doi: 10.1016/s0896-6273(04)00107-2 [DOI] [PubMed] [Google Scholar]
  • 53.Kraus N., Skoe E., Parbery‐Clark A., & Ashley R. (2009). Experience‐induced malleability in neural encoding of pitch, timbre, and timing. Annals of the New York Academy of Sciences, 1169(1), 543–557. doi: 10.1111/j.1749-6632.2009.04549.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Musacchia G., Sams M., Skoe E., & Kraus N. (2007). Musicians have enhanced subcortical auditory and audiovisual processing of speech and music. Proceedings of the National Academy of Sciences, 104(40), 15894–15898. doi: 10.1073/pnas.0701498104 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Strait D. L., Kraus N., Parbery-Clark A., & Ashley R. (2010). Musical experience shapes top- down auditory mechanisms: evidence from masking and auditory attention performance. Hearing research, 261(1), 22–29. doi: 10.1016/j.heares.2009.12.021 [DOI] [PubMed] [Google Scholar]
  • 56.Chartrand J. P., & Belin P. (2006). Superior voice timbre processing in musicians. Neuroscience letters, 405(3), 164–167. doi: 10.1016/j.neulet.2006.06.053 [DOI] [PubMed] [Google Scholar]
  • 57.Munzer S., Berti S., & Pechmann T. (2002). Encoding of timbre, speech, and tones: Musicians vs. non-musicians. Psychological Test and Assessment Modeling, 44(2), 187. [Google Scholar]
  • 58.Dercksen T. T., Stuckenberg M. V., Schröger E., Wetzel N., & Widmann A. (2021). Cross-modal predictive processing depends on context rather than local contingencies. Psychophysiology, 58(6), e13811. doi: 10.1111/psyp.13811 [DOI] [PubMed] [Google Scholar]
  • 59.Braaten R. (1993). Synesthetic correspondence between visual location and auditory pitch in infants. In 34th Annual Meeting of the Psychonomic Society. [Google Scholar]
  • 60.Spence C., & Deroy O. (2013). How automatic are crossmodal correspondences? Consciousness and cognition, 22(1), 245–260. doi: 10.1016/j.concog.2012.12.006 [DOI] [PubMed] [Google Scholar]
  • 61.Walker P., Bremmer J. G., Mason U., Spring J., Mattock K., Slater A., et al. (2010). Preverbal infants’ sensitivity to synaesthetic cross-modality correspondences. Psychological Science, 21(1), 21–25. doi: 10.1177/0956797609354734 [DOI] [PubMed] [Google Scholar]
  • 62.Ben-Artzi E., & Marks L. E. (1995). Visual-auditory interaction in speeded classification: Role of stimulus difference. Perception & Psychophysics, 57(8), 1151–1162. [DOI] [PubMed] [Google Scholar]
  • 63.Melara R. D., & O’brien T. P. (1987). Interaction between synesthetically corresponding dimensions. Journal of Experimental Psychology: General, 116(4), 323. [Google Scholar]
  • 64.Walsh V. (2003). A theory of magnitude: common cortical metrics of time, space and quantity. Trends in cognitive sciences, 7(11), 483–488. doi: 10.1016/j.tics.2003.09.002 [DOI] [PubMed] [Google Scholar]
  • 65.Spanos A. (2013). Who should be afraid of the Jeffreys-Lindley paradox? Philosophy of Science, 80(1), 73–93. [Google Scholar]

Decision Letter 0

Deborah Apthorp

5 Apr 2022

PONE-D-22-01297 The effect of musical training on the processing of audiovisual correspondences: Evidence from a reaction time taskPLOS ONE

Dear Dr. Ihalainen,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. The reviewers have provided careful and constructive comments which I feel can be addressed in a thorough revision. Please pay particular attention to clarifying the procedure and reporting the statistics correctly. 

Please submit your revised manuscript by May 20 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Deborah Apthorp, Ph.D

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf  and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: No

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This paper investigates the effect of musicianship on three audiovisual crossmodal correspondences: pitch-elevation/symbolic & non-symbolic magnitude (PE, PSM, PNSM, respectively). Musicianship had no effect on reaction times; but musicians were more accurate than non-musicians overall, largely because they were more accurate on incongruent trials. Accuracy was highest on the PE task but there appear to be no differences between musicians and non-musicians on any task. Similarly, there is no group effect or group/task interaction for speed-accuracy scores. The authors attribute these results to top-down effects on pre-attentive processing.

While this is an interesting question, it’s quite hard to work out what participants were actually asked to do and, thus, how the question is being addressed. Each trial consists of two audiovisual pairs, let’s say 1 dot and 4 dots each accompanied by a different tone. What exactly makes a trial congruent or incongruent? If the tones are different in each pair, one has to be higher than the other. Is a congruent trial one in which the single dot is accompanied by a low tone and the 4 dots by a high tone, with an incongruent trial being 1-dot/high tone plus 4-dots/low tone? If so, what’s the point of the pairs? You could achieve (in)congruency with a single audiovisual stimulus, i.e. congruent = 1-dot/low tone, incongruent = 1-dot/high tone. Or do participants have to say whether the second pair breaks the rule?

It would be helpful to amend Fig 1 to indicate how the tones are paired with the visuals (e.g., insert text ‘high’/‘low’), and to show both congruent and incongruent examples.

PE, PSM and PNSM trials are randomly ordered within a block of 180 trials, so participants have to remember three different rules and switch between them at random. To what extent do the results simply reflect task demands and switching costs?

Incidentally, the paper constantly refers to ‘visual stimuli’ and ‘visual stimulus categories’ when they are, in fact, audiovisual. In any case, what the authors seem to mean by this is the three crossmodal correspondences/tasks, it would be clearer to refer to them as such.

There also seems to be some confusion over what constitutes congruency: better performance in a multisensory condition than a unisensory condition reflects integration, not congruency (p5); see also comments on the Discussion.

Results

3.1 Please report the t-tests for the post-hoc comparisons for the main effect of task (and the Bonferroni-corrected alpha). Was the group/task interaction not significant? It would be helpful to add an overall mean to the bottom of Table 1 so that the reader can connect back to the text.

3.2 What does Figure 2 show? Does it refer to performance against chance (p14) – in which it should show what value would represent chance and also show median values as in the text instead of changing to mean values – or does it reflect the Mann-Whitney test (p15)?

The legend suggests that it refers to “musicians and non-musicians discriminating between congruent and incongruent trials”. This would be more useful than either of the options above but, in that case, the data would change to number of mistakes and for each group should be broken down by trial type. This is the “large advantage for musicians” (p18/367) so let’s see that clearly.

The final paragraph of 3.2 describes the main effect of trial type and would be better placed after the paragraph reporting the main effect of group.

Were there any group differences on each task?

3.3 Please report the means, SDs, tests, p-values, and corrected alpha for the explanation of the task/trial type interaction.

Was the group/task interaction not significant?

Discussion

Paragraph 1 needs to make clear that the advantage for musicians is quite general, across all tasks, with no group/task interactions (I assume, see comments above).

The opening sentence of paragraph 2 is a bit misleading – the results show *overall* main effects of congruency (faster RTs, fewer mistakes, smaller LISAS for congruent compared to incongruent) but no congruency/task interactions are reported for RTs or accuracy: presumably these were not significant? There is such an interaction in the LISAS analysis but, instead of a difference between congruent and incongruent trials *within* a task (i.e., a congruency effect), this seems to reflect differences in the congruent/incongruent conditions *between* tasks (and only the magnitude tasks) which is not helpful. Just because overall RTs are faster for the PE task compared to the other two doesn’t mean there’s a congruency effect.

Overall, I feel the authors need to be clearer about how they’re defining key terms and make sure these reflect accepted definitions in the literature and then go from there.

Minor points

p6/138: after ‘sets’ insert ‘out’

p8: the first paragraph should be moved to the start of section 2.4 where it makes more sense; the title for section 2.1 then becomes just ‘Participants’.

p12/260: if there are 2 experimental blocks each with 180 trials then all stimuli are presented twice?

p15/324, 333, 334: please report the statistical test supporting the p-value.

p16/344-345: the p-values are redundant, they just reflect the main effect reported a few lines above.

p16/346-348: please report the tests underlying the p-values and the corrected alpha level.

p12/262: ‘trials’ not ‘trails’

p18/376, p19/400: ‘innateness’ would be better.

p19/391: I’m not sure I would refer to the RT data as ‘raw’ because they were considerably cleaned up – perhaps ‘absolute’ or just not qualify it at all.

p20/421: after ‘previous’ insert ‘studies’.

Figure 1: the labels for symbolic and non-symbolic magnitude need to be swapped so that they are under the correct task.

Reviewer #2: The current study examined the influences of musical training on crossmodal correspondences between vision and audition. Three rules of correspondences were tested: pitch-elevation, pitch-numerosity, and pitch-digit pairings. The results demonstrated the only effect involves musical training was that musicians responded less errors than non-musicians, especially in the incongruent trials. In addition, responses were faster and less errors in the congruent than in the incongruent condition, and faster for the pitch-elevation pairing than for the pitch-numerosity and pitch-digit pairings; similar effect was demonstrated when considering both response time and errors using the index of LISAS.

In general, the rationale and the design of the study is confusing, so it is hard for me to reach any clear conclusion. Here are my main concerns:

1. The first concern is the rationale of using accuracy and response time measures. To my knowledge, accuracy is more suitable than response time when probing early processing of stimuli with time-limited presentation (Norman & Bobrow, 1975, Cognitive Psychology; Santee & Egeth, 1982, JEP:HPP). In contrast to the authors’ arguments, response time measure often involves the accumulation process of decision making.

2. It is unclear how to separate different types of crossmodal correspondences at pre-attentive stage associated with sub-cortical structure versus higher-order cognitive process. Presumably, sub-cortical structures mention by the authors (superior colliculus) does not represent stimulus identity, and therefore it is not possible to reveal any crossmodal correspondences at this level of processing.

3. It is unclear why the three correspondences rules for participants were defined as “newly learned”—did the participants truly learn the rule, or they were merely instructed to responded in such ways?

More specifically the rule “The higher the spatial elevation, the higher the tone” is a natural correspondence and has been repeatedly reported in literature (reviewed in Discussion). However, the other two rules “the more dots presented, the higher the tone” and “the higher the number presented, the higher the tone” seem to be counter-intuitive to the vertical numerical line (Hung, Hung, Tzeng, & Wu, 2008, Cognition). It is therefore not surprising that the response time for the first rule was faster than the latter two rules.

4. The authors’ prediction is ambiguous: If musical training induces enhanced multisensory integration, should the prediction be larger congruency effect rather than overall better performance for musicians than non-musicians?

5. The experimental design is confusing and hard to follow:

(1) There were four types of stimuli in each visual and auditory stimulus domain. Would it be possible that some trials would be easier (such as using the tones F5 and E6) than other trials (such as using the tones A5 and C6)?

(2) There were two audiovisual stimulus pairs presented sequentially in each trial. Isn’t one pair of audiovisual stimuli sufficient for response?

(3) A figure of experimental procedure would be helpful.

(4) How were the hit and false alarm rates defined when calculating d prime?

6. In Figure 2, there should be 2x3x2 bars, corresponding to the experimental design.

7. Can the better performance (less errors) in musicians than non-musicians simply reflect a better motor control after musical training of instruments?

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2023 Apr 6;18(4):e0282691. doi: 10.1371/journal.pone.0282691.r002

Author response to Decision Letter 0


4 Oct 2022

Response to Reviewers

Reviewer #1 comments to the authors

This paper investigates the effect of musicianship on three audiovisual crossmodal correspondences: pitch-elevation/symbolic & non-symbolic magnitude (PE, PSM, PNSM, respectively). Musicianship had no effect on reaction times; but musicians were more accurate than non-musicians overall, largely because they were more accurate on incongruent trials. Accuracy was highest on the PE task but there appear to be no differences between musicians and non-musicians on any task. Similarly, there is no group effect or group/task interaction for speed-accuracy scores. The authors attribute these results to top-down effects on pre-attentive processing.

While this is an interesting question, it’s quite hard to work out what participants were actually asked to do and, thus, how the question is being addressed.

Response to reviewer: The authors would like to sincerely thank the reviewer for the feedback, and especially for highlighting the weaknesses and raising the questions on issues that were not clearly communicated or considered in the original manuscript. We find the feedback very constructive, and have taken the utmost care to address your concerns in the revised manuscript.

As for making it more clear what the participants were asked to do, we have added an additional figure in the manuscript (figure 2) that shows the experimental procedure in detail in congruent and incongruent conditions for each of the three audiovisual stimulus categories. With the figure we have added the following legend: “Fig 2. Examples of the experimental procedure in both congruent and incongruent conditions for each of the three audiovisual stimulus category. Panels A – B illustrate congruent and incongruent trials in the pitch-elevation category, respectively. Each panel/trial consisted of a single video with the total length of 860 ms. In this video clip the participant saw two visual stimuli each associated with a sound on a particular pitch. The pitch varied in frequency and the visual stimulus varied in elevation. Congruency was estimated according to an explicitly learned rule, “The higher the spatial elevation, the higher the tone”. Panels C – D illustrate congruent and incongruent trials in the symbolic magnitude-pitch category, respectively. Here, the visual stimulus varied in value of the number shown and congruency was estimated according to the explicitly learned rule, “the higher the number presented, the higher the tone”. Panels E – F illustrate congruent and incongruent trials in the nonsymbolic magnitude-pitch category, respectively. Here, the visual stimulus varied in the number of circles shown and congruency was estimated according to the explicitly learned rule, “the more dots presented, the higher the tone”. Note that in each category congruency/incongruency could be induced either with the visual stimuli or with the auditory stimuli.”

1. Each trial consists of two audiovisual pairs, let’s say 1 dot and 4 dots each accompanied by a different tone. What exactly makes a trial congruent or incongruent? If the tones are different in each pair, one has to be higher than the other. Is a congruent trial one in which the single dot is accompanied by a low tone and the 4 dots by a high tone, with an incongruent trial being 1-dot/high tone plus 4-dots/low tone? If so, what’s the point of the pairs? You could achieve (in)congruency with a single audiovisual stimulus, i.e. congruent = 1-dot/low tone, incongruent = 1-dot/high tone. Or do participants have to say whether the second pair breaks the rule?

Response to reviewer: We do acknowledge that the task the participants were asked to do could have been communicated more clearly. Here, congruency/incongruency was defined by reflecting whether the abstract rule was followed or not. The reviewer is correct in that a congruent trial is one in which e.g. a single dot is accompanied by a low tone followed by 4 dots accompanied by a high tone. An incongruent trial, on the other hand, could be one in which a single dot accompanied by a high tone is followed by 4 dots accompanied by a low tone. One of such sequence formed a single trial and consisted of a pair of audiovisual stimuli (one dot with a tone and 4 dots with a tone). After such a pair was shown to the participant, the participants were instructed to indicate whether the rules were followed or not.

It is not clear how such a reflection of the congruency (e.g. whether the rules were followed or not) could be achieved with a single audiovisual stimulus (i.e. one dot with a low tone). We do feel that the confusion here is due to our poor communication of the task (and what is meant by ‘a pair of audiovisual stimuli’). Hence, we have made the following changes to the corresponding parts of the manuscript:

We have added additional figure (figure 2) showing the detailed experimental procedure in congruent and incongruent conditions for each of the tree audiovisual stimulus categories. With the figure, we have written a detailed legend explaining the figure (and hence the procedure) in what we believe to be clear terms (see also reply above). Moreover, we have edited the methods section to explain the procedure more clearly (lines 250-290).

Lines 250-290 now say the following: “Next, the experiment began. The experimental procedure is shown in Fig 2.

Enter figure 2 here

Fig 2. Examples of the experimental procedure in both congruent and incongruent conditions for each of the tree audiovisual stimulus category. Panels A – B illustrate congruent and incongruent trials in the pitch-elevation category, respectively. Each panel/trial consisted of a single video with the total length of 860 ms. In this video clip, the participant saw two visual stimuli each associated with a sound on a particular pitch. The pitch varied in frequency and the visual stimulus varied in elevation. Congruency was estimated according to an explicitly learned rule, “The higher the spatial elevation, the higher the tone”. Panels C – D illustrate congruent and incongruent trials in the symbolic magnitude-pitch category, respectively. Here, the stimulus varied in value of the number shown and congruency was estimated according to the explicitly learned rule, “the higher the number presented, the higher the tone”. Panels E – F illustrate congruent and incongruent trials in the non-symbolic magnitude-pitch category, respectively. Here, the stimulus varied in the number of circles shown and congruency was estimated according to the explicitly learned rule, “the more dots presented, the higher the tone”. Note that in each category congruency/incongruency could be induced either with the visual or with the auditory part of the stimuli.

Each trial consisted of a single video clip (a single panel on Fig 2). The sequence of events in each trial was as follows: the first audiovisual stimulus was presented in the middle of the screen for 400 ms, followed by a short break of 60 ms before the second audiovisual stimulus (400 ms). Each of the audiovisual stimuli consisted of a visual picture (see Fig 1) associated with an auditory stimulus that varied in pitch. The onset time for the auditory stimuli was synchronized with the onset of the visual images. After presenting both audiovisual stimuli (i.e. after presenting a single trial consisting of a pair of audiovisual stimuli), congruency was estimated; the pair of audiovisual stimuli allowed the participants to estimate congruency between the stimuli based on the explicitly learned abstract rules. These rules were “The higher the spatial elevation, the higher the tone”, “the more dots presented, the higher the tone”, and “the higher the number presented, the higher the tone”, depending on the category. The participant were instructed to respond as quickly and accurate as possible with their right hand only by pressing ‘K’ on the keyboard when the corresponding rules were followed, and pressing the button ‘L’ when the rules were not followed. After the response, a 1000 ms blank screen appeared before the next trial. The researcher further instructed the participants verbally, and made sure the instructions were understood properly.

There were two experimental blocks, and each block consisted of 180 trials. In each block, there were three audiovisual stimulus categories with 60 trials in each (30 congruent and 30 incongruent). The order of the trials was pseudo-randomized across the participants so that there were not consecutive trials having the same stimulus type. This randomization process also enabled the elimination of any potential bias caused by varying intervals between the auditory tones and the visual stimuli.”

2. It would be helpful to amend Fig 1 to indicate how the tones are paired with the visuals (e.g., insert text ‘high’/‘low’), and to show both congruent and incongruent examples.

Response to reviewer: Please, see the above response.

3. PE, PSM and PNSM trials are randomly ordered within a block of 180 trials, so participants have to remember three different rules and switch between them at random. To what extent do the results simply reflect task demands and switching costs?

Response to reviewer: We would like to thank the reviewer for this interesting suggestion. Indeed the task employed involved a “judgment” switching component, but please notice that the task did not require shifting of the response mapping, which remained the same across the judgment conditions. According to Von Bastian, & Druey (2017), who systematically investigated all types of switching (judgment, dimension, stimulus, mapping and response), the response mapping shifting is the one which is central and most relevant to switching ability and switching cost. In addition, and given that RTs is the most sensitive measure of switching cost, and that the main effect of group was not significant (and this factor did not interact with the other factors) in the response time analyses it is unlikely that a better switching ability in musicians could explain our results.

We discuss the cognitive demand of the task and its possible effects on the results in lines 440-496. This section also includes a paragraph specifically discussion the possible effect of task switching and our reponse to it.

4. Incidentally, the paper constantly refers to ‘visual stimuli’ and ‘visual stimulus categories’ when they are, in fact, audiovisual. In any case, what the authors seem to mean by this is the three crossmodal correspondences/tasks, it would be clearer to refer to them as such.

Response to reviewer: Thank you for this comment.. We edited the MS to increase clarity. For example, in the beginning of section 2.2. Stimuli, instead of talking about “three types of visual stimuli” we know refer to the visual and auditory parts of the audiovisual stimuli. Similarly, in the legend of figure 1, we refer to just stimulus, instead of “visual stimulus”. Similar changes have been made in the manuscript where necessary.

5. There also seems to be some confusion over what constitutes congruency: better performance in a multisensory condition than a unisensory condition reflects integration, not congruency (p5); see also comments on the Discussion.

Response to reviewer: We do acknowledge the confusion in the use of the terms here. This part has been edited to say (lines 105-109): “There is a relatively large number of reaction time studies done with non-musicians in the context of multimodal sensory integration. These investigations have consistently reported better performance with multimodal stimuli: faster detection of the target in the multimodal condition relative to unimodal condition (see for example Miller, 1991; for reviews see Parise & Spence, 2013; Spence, 2011).

RESULTS

6. 3.1 Please report the t-tests for the post-hoc comparisons for the main effect of task (and the Bonferroni-corrected alpha). Was the group/task interaction not significant? It would be helpful to add an overall mean to the bottom of Table 1 so that the reader can connect back to the text.

Response to reviewer: We have added more information on paragraph 3.1. We now explicitly state that the interactions were statistically non-significant and that the post hoc comparison was made using the estimated marginal means, and not with t-tests. The reported p-values for the estimated marginal means are bonferroni corrected.

We have also added the overall means in table 1, as per the suggestion.

7. 3.2 What does Figure 2 show? Does it refer to performance against chance (p14) – in which it should show what value would represent chance and also show median values as in the text instead of changing to mean values – or does it reflect the Mann-Whitney test (p15)?

The legend suggests that it refers to “musicians and non-musicians discriminating between congruent and incongruent trials”. This would be more useful than either of the options above but, in that case, the data would change to number of mistakes and for each group should be broken down by trial type. This is the “large advantage for musicians” (p18/367) so let’s see that clearly.

Response to reviewer: We acknowledge that indeed it was unclear to what figure 3 (figure 2 in previous version) is referring to. It shows the difference in the d-primes between musicians and non-musicians discriminating between congruent and incongruent stimuli. To make this clearer, we have added the following sentence on paragraph 3.2. Accuracy: Fig 3 shows the difference in the mean d-prime between musicians and non-musicians discriminating between congruent and incongruent stimuli.

As for why the d-prime is calculated combining the audiovisual stimulus categories, as we responded to reviewer #2 (corrections number 6.), when recording the mistakes, we had several participants who made zero mistakes in some of the audiovisual categories, especially with congruent stimuli, hence causing issues in the d-prime calculations. As our main interest lied in the potential difference between musicians and non-musicians, rather than between the types of visual stimuli, to overcome the issue caused by participants with zero mistakes, we followed the suggestions of Stanislaw & Todorov (1999) and combined the data from several categories before calculating the hit and false-alarm rates (https://link.springer.com/content/pdf/10.3758/BF03207704.pdf). Consequently, when we investigated the accuracy between the visual categories further, we used the raw number of mistakes in the analyses.

We have added this information also in paragraph 2.4. Analysis which now states “As our main interest lied in the potential difference in accuracy between musicians and non-musicians and as several participants made zero mistakes in some of the stimulus categories, instead of calculating the d-prime for each audiovisual category individually, following the suggestions of Stanislaw & Todorov (1999), we combined the data from the three stimulus categories before calculating the hit and false-alarm rates. For the further investigation of audiovisual category-wise accuracy, we used the raw number of mistakes.”

8. The final paragraph of 3.2 describes the main effect of trial type and would be better placed after the paragraph reporting the main effect of group.

Were there any group differences on each task?

Response to reviewer: We have moved the paragraph below the results describing the difference in raw mistakes between musicians and non-musicians, as suggested. However, we kept the results with d-primes and raw mistakes separate.

We have also added a paragraph describing the difference between musicians and non-musicians in each of the audiovisual categories individually (lines 341-346). Correspondingly, we added these results also in the discussion section (lines 390-392 & 407-409).

9. 3.3 Please report the means, SDs, tests, p-values, and corrected alpha for the explanation of the task/trial type interaction.

Was the group/task interaction not significant?

Response to reviewer: For the interactions in reaction times, we have the results from the ANOVA with estimated marginal means, however, we feel reporting the estimated marginal means (with SDs, p-values, and corrected p-values) is not crucial, since the results cannot be separated from chance under the null hypothesis with a high enough precision. Rather, we have explicitly stated that the interactions did not reach statistical significance, and provided the level above which all the p-values lay.

DISCUSSION

10. Paragraph 1 needs to make clear that the advantage for musicians is quite general, across all tasks, with no group/task interactions (I assume, see comments above).

Response to reviewer: See our response above (correction #8).

11. The opening sentence of paragraph 2 is a bit misleading – the results show *overall* main effects of congruency (faster RTs, fewer mistakes, smaller LISAS for congruent compared to incongruent) but no congruency/task interactions are reported for RTs or accuracy: presumably these were not significant? There is such an interaction in the LISAS analysis but, instead of a difference between congruent and incongruent trials *within* a task (i.e., a congruency effect), this seems to reflect differences in the congruent/incongruent conditions *between* tasks (and only the magnitude tasks) which is not helpful. Just because overall RTs are faster for the PE task compared to the other two doesn’t mean there’s a congruency effect.

Response to reviewer: This is true: reading the sentence did give a misleading impression. What we meant was that when measured with RTs, d-primes, number of mistakes and LISAS, we found a congruency effect. We have corrected this sentence to state “Specifically, our results showed significant main effects of congruency with all measures (response times, d primes, number of mistakes, and linear integrated speed-accuracy scores), suggesting audiovisual integration for all three stimulus categories.”

Moreover, we have re-structured and largely re-written the entire discussion section to better reflect the comments and suggestions by both of the reviewers. Hence, the entire discussion section should be re-visited.

MINOR POINTS

12. p6/138: after ‘sets’ insert ‘out’ - corrected

p8: the first paragraph should be moved to the start of section 2.4 where it makes more sense; the title for section 2.1 then becomes just ‘Participants’. - corrected

p12/260: if there are 2 experimental blocks each with 180 trials then all stimuli are presented twice? – This was indeed a bit unclear. It now states “There were two experimental blocks, and each block consisted of 180 trials. In each block, there were three audiovisual stimulus categories with 60 trials in each (30 congruent and 30 incongruent).” (lines 255-257).

p15/324, 333, 334: please report the statistical test supporting the p-value. - corrected

p16/344-345: the p-values are redundant, they just reflect the main effect reported a few lines above. – removed

p16/346-348: please report the tests underlying the p-values and the corrected alpha level. - corrected

p12/262: ‘trials’ not ‘trails’ - corrected

p18/376, p19/400: ‘innateness’ would be better. - corrected

p19/391: I’m not sure I would refer to the RT data as ‘raw’ because they were considerably cleaned up – perhaps ‘absolute’ or just not qualify it at all. - corrected

p20/421: after ‘previous’ insert ‘studies’. - corrected

Figure 1: the labels for symbolic and non-symbolic magnitude need to be swapped so that they are under the correct task. - corrected

Reviewer #2 comments to the authors

The current study examined the influences of musical training on crossmodal correspondences between vision and audition. Three rules of correspondences were tested: pitch-elevation, pitch-numerosity, and pitch-digit pairings. The results demonstrated the only effect involves musical training was that musicians responded less errors than non-musicians, especially in the incongruent trials. In addition, responses were faster and less errors in the congruent than in the incongruent condition, and faster for the pitch-elevation pairing than for the pitch-numerosity and pitch-digit pairings; similar effect was demonstrated when considering both response time and errors using the index of LISAS.

In general, the rationale and the design of the study is confusing, so it is hard for me to reach any clear conclusion. Here are my main concerns:

Response to reviewer: The authors would like to sincerely thank the reviewer for the feedback. We find the feedback very constructive, especially the feedback concerning clarity and the rationale for parts of the manuscript. We have taken the utmost care to address your concerns in the revised manuscript.

1. The first concern is the rationale of using accuracy and response time measures. To my knowledge, accuracy is more suitable than response time when probing early processing of stimuli with time-limited presentation (Norman & Bobrow, 1975, Cognitive Psychology; Santee & Egeth, 1982, JEP:HPP). In contrast to the authors’ arguments, response time measure often involves the accumulation process of decision making.

Response to the reviewer: We wish to thank the reviewer for pointing this out, and we fully agree with the notion that reaction times and accuracy measures tap onto different underlying processes. As such, each type of response yields different information, with reaction time reflecting more on implicit processing and accuracy more on explicit processing. This is indeed what we argue in the manuscript, and provide as a justification for incorporating both of these two measures; we aim to capture these differences by using both measures.

We have edited the wording throughout the manuscript to better reflect this comment and removed the parts of text explicitly referring to ‘requirements of conscious processing’ (lines 99-110). Instead, we now highlight the point that the two measures reflect different cognitive processes (implicit/explicit), and hence, provide complementary information of the underlying processes.

Moreover, we have re-structured and largely re-written the entire discussion section to better reflect the comments and suggestions by both of the reviewers. The discussion section now includes parts discussing the results in relation to these two measures and the underlying cognitive processing. Hence, we recommend that the entire discussion section should be re-visited.

2. It is unclear how to separate different types of crossmodal correspondences at pre-attentive stage associated with sub-cortical structure versus higher-order cognitive process. Presumably, sub-cortical structures mention by the authors (superior colliculus) does not represent stimulus identity, and therefore it is not possible to reveal any crossmodal correspondences at this level of processing.

Response to the reviewer: We agree with the reviewer that behaviourally the distinction between higher order cortical processing and lower level sub-cortical processing is difficult, in not directly impossible. However, this was not the aim of the manuscript. Rather, we in the corrected manuscript speculate an interplay between top-down, higher order and pre-attentive processing, and provide some suggestion on how future studies could potentially break down this interaction further.

That said, a great line of research has been done on multisensory correspondences (Bidelman, 2016; Laundry & Champoux, 2017; Miller, 1991; Paraskevopoulos et al., 2014; Paraskevopoulos et al., 2012; Parise & Spence, 2013; Spence, 2011) and previous research has already identified the role of sub-cortical structures, including but not limited to the superior colliculi in the processing of such stimuli (e.g. Stein & Meredith, 1993; for a review see Alais, Newell & Mamassian, 2010). The mentioning of these structures in the manuscript is hence grounded on a-priori knowledge provided by such relevant literature.

Moreover, we have re-structured and largely re-written the entire discussion section to better reflect the comments and suggestions by both of the reviewers. Hence, the entire discussion section should be re-visited.

3. It is unclear why the three correspondences rules for participants were defined as “newly learned”—did the participants truly learn the rule, or they were merely instructed to responded in such ways?

More specifically the rule “The higher the spatial elevation, the higher the tone” is a natural correspondence and has been repeatedly reported in literature (reviewed in Discussion). However, the other two rules “the more dots presented, the higher the tone” and “the higher the number presented, the higher the tone” seem to be counter-intuitive to the vertical numerical line (Hung, Hung, Tzeng, & Wu, 2008, Cognition). It is therefore not surprising that the response time for the first rule was faster than the latter two rules.

Response to the reviewer: We fully agree that the advantageous effect of pitch-elevation over pitch-symbolic magnitude and pitch-non-symbolic magnitude in and of itself is not surprising. Indeed, in the manuscript we do not claim that it is, and instead, we relatively shortly discuss it in relation to previous papers suggesting its innateness and more extensive experience in comparison to the two other types of audiovisual stimuli used in the experiment (the two magnitude categories). Indeed, precisely because of pitch-elevation has been suggested to be a naturally occurring statistic with which people in general – and not only musicians – have more extensive experience, we added the two other audiovisual categories. With pich-magnitude and pitch-non-symbolic magnitude – as we discuss in the paper – such a natural linkage is not that clear (although see the manuscript discussion section lines 544 – 558).

Hence, our task also requires the participants to estimate congruence based on explicitly learned rule, binding other unrelated unisensory stimuli. Here, ‘newly/explicitly’ is particularly true with the two magnitude categories, as no natural linkage between magnitudes and pitch has been demonstrated, and by ‘learning’ we simply refer to the participants’ internalizing the rule, keeping it in mind, and discriminating the audiovisual stimuli accordingly.

In addition to not having a clear, naturally occurring linkage with pitch, magnitude processing in and of itself has been argued to reflect “higher” cognitive processes (Dehaene, Piazza, Pinel & Cohen, 2003; Paraskevopoulos et al., 2014). Hence, adding the two magnitude categories in addition to pitch-elevation enabled us to investigate the hypothesised advantage for musicians in task varying in cognitive demand, which we hypothesised could reflect on the two measures used (RT and accuracy).

Lastly, it is not clear to us how the vertical mental numerical line (particularly with Chinese number words mentally aligned top-to-bottom) is particularly problematic here: the only change in spatial elevation was with the pitch-elevation category and all our visual stimuli were centred horizontally. Moreover, in the manuscript we do shortly discuss the SNARC and SMARC effects. Perhaps there is confusion relating to the word ‘higher’ in our abstract rules, which in this case refers to only a higher value represented, rather than to spatial elevation.

4. The authors’ prediction is ambiguous: If musical training induces enhanced multisensory integration, should the prediction be larger congruency effect rather than overall better performance for musicians than non-musicians?

Response to the reviewer: We wish to thank the reviewer for raising this interesting question. We have added parts into the discussion section discussing this point (lines 503 – 534).

First, we point out that regarding the group difference between congruent and incongruent condition, as far as we know, there are no previous results to indicate whether the advantage grounds on the processing of the congruent or the incongruent stimulus, as the typical studies in the field use d prime as an idex – which cannot attribute the difference on either of the two conditions, but only infer signal detection. We then relate this observation and further discuss it in relation to predictive coding framework (Cross‐modal predictive processing depends on context rather than local contingencies - Dercksen - 2021 - Psychophysiology - Wiley Online Library). Particularly, we suggest that in the congruent condition, the underlying predictions are not violated, and as the correspondence is suggested to be innate, the predction should (behaviourally) work equally in the two groups (engagning predominantly top-down processing, which is not explicitly trained in musicians). On the other hand, the incongruent condition violates the prediction and hence, engages bottom-up processing routes. This route, in turn, takes advantage of the perceptual learning effects that musicians develop throughout their training, sharpening the perceptual processing of multisensory stimuli. In other words, due to the enhanced multisensory integration, and sharpened perception in relation to such stimuli, they can better identify the violation from the predicted congruency. We also see this effect in relation to the temporal window of integration, where musicians can more easily discriminate out-of-synch stimuli, because of their sharpened temporal window of integration curve (Multisensory integration of drumming actions: musical expertise affects perceived audiovisual asynchrony | SpringerLink).

We also point out that although our hypothesis was strongly grounded in previous studies, a worthwhile direction for future studies could be to not only look for the overall effects on congruency, but also to look at measures such as the size of the congruency effect.

5. The experimental design is confusing and hard to follow:

(1) There were four types of stimuli in each visual and auditory stimulus domain. Would it be possible that some trials would be easier (such as using the tones F5 and E6) than other trials (such as using the tones A5 and C6)?

Response to the reviewer: It is indeed true that some of the auditory stimulus pairs had a larger interval between them than others. The same is true to some extent with the visual stimuli (for example with the elevation category in which the physical difference in spatial distance between the two stimuli varied between the trials). The pairs of stimuli were chosen pseudorandomly and both the auditory tones and the visual stimuli were adopted from previous studies by Paraskevopoulos (2012; 2014; 2015) to establish comparability between those studies and the present study – to the extent it is possible despite differences in methodology.

Crucially however, the participants heard, saw, and responded to the same stimuli, but in pseudorandom order. Hence, any potential bias caused by varied difficulty across the trials was eliminated.

We have added this information explicitly to the manuscript (line 272-274): “This randomization process also enabled the elimination of any potential bias caused by varying intervals between the auditory tones and the visual stimuli.”

(2) There were two audiovisual stimulus pairs presented sequentially in each trial. Isn’t one pair of audiovisual stimuli sufficient for response?

Response to the reviewer: We do acknowledge that the task the participants were asked to do could have been communicated more clearly. Here, congruency/incongruency was defined by reflecting whether the abstract rule was followed or not. The reviewer is correct in that a pair of audiovisual stimuli is sufficient for estimating congruency, and this is indeed what we tried to communicate in the manuscript.

To make the procedure more clear, we have added additional figure (figure 2) showing the detailed experimental procedure in congruent and incongruent conditions for each of the tree audiovisual stimulus categories. With the figure, we have written a detailed legend explaining the figure (and hence the procedure) in what we believe to be clear terms (see also reply above). Moreover, we have edited the methods section to explain the procedure more clearly (lines 250-290).

Lines 250-290 now say the following: “Next, the experiment began. The experimental procedure is shown in Fig 2.

Enter figure 2 here

Fig 2. Examples of the experimental procedure in both congruent and incongruent conditions for each of the tree audiovisual stimulus category. Panels A – B illustrate congruent and incongruent trials in the pitch-elevation category, respectively. Each panel/trial consisted of a single video with the total length of 860 ms. In this video clip, the participant saw two visual stimuli each associated with a sound on a particular pitch. The pitch varied in frequency and the visual stimulus varied in elevation. Congruency was estimated according to an explicitly learned rule, “The higher the spatial elevation, the higher the tone”. Panels C – D illustrate congruent and incongruent trials in the symbolic magnitude-pitch category, respectively. Here, the stimulus varied in value of the number shown and congruency was estimated according to the explicitly learned rule, “the higher the number presented, the higher the tone”. Panels E – F illustrate congruent and incongruent trials in the non-symbolic magnitude-pitch category, respectively. Here, the stimulus varied in the number of circles shown and congruency was estimated according to the explicitly learned rule, “the more dots presented, the higher the tone”. Note that in each category congruency/incongruency could be induced either with the visual or with the auditory part of the stimuli.

Each trial consisted of a single video clip (a single panel on Fig 2). The sequence of events in each trial was as follows: the first audiovisual stimulus was presented in the middle of the screen for 400 ms, followed by a short break of 60 ms before the second audiovisual stimulus (400 ms). Each of the audiovisual stimuli consisted of a visual picture (see Fig 1) associated with an auditory stimulus that varied in pitch. The onset time for the auditory stimuli was synchronized with the onset of the visual images. After presenting both audiovisual stimuli (i.e. after presenting a single trial consisting of a pair of audiovisual stimuli), congruency was estimated; the pair of audiovisual stimuli allowed the participants to estimate congruency between the stimuli based on the explicitly learned abstract rules. These rules were “The higher the spatial elevation, the higher the tone”, “the more dots presented, the higher the tone”, and “the higher the number presented, the higher the tone”, depending on the category. The participant were instructed to respond as quickly and accurate as possible with their right hand only by pressing ‘K’ on the keyboard when the corresponding rules were followed, and pressing the button ‘L’ when the rules were not followed. After the response, a 1000 ms blank screen appeared before the next trial. The researcher further instructed the participants verbally, and made sure the instructions were understood properly.

There were two experimental blocks, and each block consisted of 180 trials. In each block, there were three audiovisual stimulus categories with 60 trials in each (30 congruent and 30 incongruent). The order of the trials was pseudo-randomized across the participants so that there were not consecutive trials having the same stimulus type. This randomization process also enabled the elimination of any potential bias caused by varying intervals between the auditory tones and the visual stimuli.”

(3) A figure of experimental procedure would be helpful.

Response to the reviewer: This again is an excellent suggestion. We have added a figure showing the experimental procedure (see fig.2 and the response above).

(4) How were the hit and false alarm rates defined when calculating d prime?

Response to the reviewer: We have added the following sentence on paragraph 2.4 Analysis (lines 300-304): For the calculations of the d-prime, hits were defined as congruent stimuli correctly identified as congruent, misses as congruent identified as incongruent, false alarms as incongruent stimuli identified as congruent, and correct rejections as incongruent stimuli identified as incongruent.

6. In Figure 2, there should be 2x3x2 bars, corresponding to the experimental design.

Response to the reviewer: When recording the mistakes, we had several participants who made zero mistakes in some of the audiovisual categories, especially with congruent stimuli, hence causing issues in the d-prime calculations. As our main interest lied in the potential difference between musicians and non-musicians, rather than between the types of visual stimuli, to overcome the issue caused by participants with zero mistakes, we followed the suggestions of Stanislaw & Todorov (1999) and combined the data from several categories before calculating the hit and false-alarm rates (https://link.springer.com/content/pdf/10.3758/BF03207704.pdf). Consequently, when we investigated the accuracy between the visual categories further, we used the raw number of mistakes in the analyses.

We have added this information also in the Analysis section which now states (lines 304-309)“As our main interest lied in the potential difference in accuracy between musicians and non-musicians and as several participants made zero mistakes in some of the stimulus categories, instead of calculating the d-prime for each audiovisual category individually, following the suggestions of Stanislaw & Todorov (1999), we combined the data from the three stimulus categories before calculating the hit and false-alarm rates. For the further investigation of audiovisual category-wise accuracy, we used the raw number of mistakes.”

7. Can the better performance (less errors) in musicians than non-musicians simply reflect a better motor control after musical training of instruments?

Response to the reviewer: It is our view that such a suggested advantage in motor control after persistent musical training would first and foremost affect reaction times and not so much accuracy. In other words, if the better performance was due to better motor control, we think we should have seen an advantage in the reaction times first, which we did not observe here. It is especially counterintuitive to think that an advantage would be seen in accuracy measures but not in reaction times, if it indeed were due to better motor control only. Hence, it is unlikely that the better performance was due to better motor control in musicians.

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 1

Deborah Apthorp

15 Dec 2022

PONE-D-22-01297R1The effect of musical training on the processing of audiovisual correspondences: Evidence from a reaction time taskPLOS ONE

Dear Dr. Ihalainen,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jan 29 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Deborah Apthorp, Ph.D

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments (if provided):

I apologise for the delay - Reviewer 2 declined to review again, and we felt it was important to source another opinion. Reviewer 3 was aware that the paper had been previously reviewed, and suggests minor revisions, in harmony with Reviewer 1.

In particular, please ensure that all data and code for this study is publicly available, in line with the policies of the journal.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: (No Response)

Reviewer #3: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #3: No

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: I thank the authors for their very detailed responses and revision. In particular, the van Bastian & Druey paper was interesting, thanks for bringing that to my attention. Please note, though, that it does not appear in the reference list…

All the minor points have been addressed except one (although it was marked as corrected). In the Participants section, the second paragraph, beginning “We applied a standard…”, should be moved to be either the first or second paragraph of the Analysis section, depending on which the authors feel flows better. This is because it is more about prepping the data for analysis than it is about who participated.

I noticed a few errors of English expression in the revision: for example, “our main interest lied” rather than “lay”; and also some typos, for example, “thank” instead of “than”. Please make sure these, and any others, are caught at the proof stage.

Apart from these small items, it looks great!

Reviewer #3: I had the pleasure of reading this interesting manuscript about crossmodal integration in musicians and nonmusicians. As I read this for the first time, I will give a general overview of the manuscript, and then add specific comments. Note that I also read the previous reviews and it seems to me that the authors addressed well all the points raised. My comments will be, however, slighlty different from those already mentioned, but will not require any substantial modification of the present manuscript.

First of all I believe that the study itself is well designed, the manuscript reads well and the theoretical background and discussion are sufficiently rich. I have a general advice though: throughout the manuscript, we can often read "the effects of music training": note that very few studies could really prove that the music training causes some improvements in various perceptual/cognitive skills. Most of studies are based on the comparison of adult musicians and nonmusicians. This is not sufficient to talk about cause/effect relationships. I encourage the authors to talk about "association with the music training" instead of "effects". Then, in the discussion, clearly there is space of mentioning why the authors believe it is reasonable to consider the music training as the cause. But it is still an intepretation, and the authors should acknowledge, perhaps in the limitation section, that the present study cannot infere any cause-effect relationship.

This is particularly true (and here I suggest to add another limitation) as there was no general control of cognitive abilities of the two groups. For example, the speed of processing subtests of the WAIS-IV could have been informative in explaining why there were no differences between groups in the RTs. Having no control tasks, cannot exclude that the two groups were different in terms of general cognitive abilities. Also, I did not read any information about years of education. Was this variable collected? Are the group different in years of education?

Again on the groups differences. I see that the inclusion criteria for nonmusicians was not having formal training outside from the mandatory classes at school. Did the authors checked whether the nonmusicians could have, anyway, learnt to play an instrument as self-taught? For instance, as they "self-identified as having no musical expertise", was musical expertise defined as having received formal lessons? This is important, because one could still play a bit as an amateur but consider him/herself not an expert. What were the exact questions asked to gather information about musical expertise? Similarly, I find the range of musical expertise in the musician group quite large. Studies including musicians usually have stricter criteria (e.g., 7-8 years of training at least). 3 years seems not a enough. Did the authors check at least whether the musicians were active at the moment of the testing? Because if an individual had 3 years of experience but stopped to play 10 years before, well, this wouldn't really qualify as being a musician in my opinion.

I encourage the authors to add more details about the two groups, and if these details are not available, to include the criteria used to create the two groups as a possible limitation (that might also explain different results from previous studies, perhaps). I know that the athors provide already many analyses, but perhpas it could be interesting to look at the correlations between years of musical expertise and RTs/accuracy, as the range is very wide?

Finally, I do not see in the manuscript any statement or link for data availability. My apologies if this will appear later on, in any case, I think it would be great to provide a link where readers can access the dataset.

Some specific details I noticed:

-Figures: Figure 1 and 2 seem very low resolution.

-page 4, line 85: "audiovisual congruency effects" I suggest adding a definition of what these effects are in practice (e.g.,higher accuracy in identifying congruent pairs, shorter RTs, etc.?), as there might be different things to which the authors are referring.

-page 4, line 94: "at discriminating the stimuli" This reads a bit vague, I suggest clarifying what the task required.

-page 4, line 97,98: " pitch-elevation stimuli" and "non-symbolic magnitude" are not clear yet in the introduction, I suggest to define what are these types of stimuli, otherwise the reader will understand it only in the method section

-page 5, line 114: here the "detection reaction time task" is also a bit vague, had the participants to respond to congruency again?

-page 5, line 117: Apologise if it is my mistake, but by reading the description of the study by Bidelman et al., it seems that the musicians had less frequently the audiovisual illusion. Does this mean that they integrated better the stimuli? Because intuitively, I would say that if they integrated them better, they would suffer the illusion more, not less. Having less frequently the illusion (or with shorter - less detectable - durations) might indicate, to me, that they could segregate better the two types of stimuli, not integrate them. But I might misunderstand what it's written.

-Page 7, line 150: I'm not a native speaker but starting a paragraph with "therefore" reads strange.

- Page 7, line 162: I suggest writing that the power analysis is explained later on, otherwise at a first glance one could wonder why no details about it are reported.

-Page 8, line 170: "made a mistake". How is the mistake defined here? Because later on, mistakes (wrong answers) are taken into account in the analyses, so I believe that this is a different type of mistake.

-Page 17. I find it a bit strange to read that there is a difference between musicians and nonmusicians in numbers of mistakes in the incongruent trials, but then, in the last paragraph with this analysis there is no difference between groups in mistakes in congruent and incogruent conditions. (see line 382, "irrespective of musicianship"). Are these two results a bit in contraddiction? If the musicians have less mistakes in the incogruent conditions, I would expect an interaction, not an overall effect of congruency.

-page 19, line 410, 411: Can the authors report the statistics (at least the p-values, as before) for the post-hoc significant comparisons?

-page 20, line 416: I think that here mentioning "effects of long-term training" is quite tricky for the reasons mentioned before: (1) there is no way to understand any effect of musical training with the present study, (2) speaking about long training with a inclusion criteria of >3 years seems a bit optimistic).

I think that if these minor details are clarified, the manuscript will be then ready to be publicated.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #3: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2023 Apr 6;18(4):e0282691. doi: 10.1371/journal.pone.0282691.r004

Author response to Decision Letter 1


12 Feb 2023

Response to Reviewers

Reviewer #1 comments to the authors

I thank the authors for their very detailed responses and revision. In particular, the van Bastian & Druey paper was interesting, thanks for bringing that to my attention. Please note, though, that it does not appear in the reference list…

All the minor points have been addressed except one (although it was marked as corrected). In the Participants section, the second paragraph, beginning “We applied a standard…”, should be moved to be either the first or second paragraph of the Analysis section, depending on which the authors feel flows better. This is because it is more about prepping the data for analysis than it is about who participated.

I noticed a few errors of English expression in the revision: for example, “our main interest lied” rather than “lay”; and also some typos, for example, “thank” instead of “than”. Please make sure these, and any others, are caught at the proof stage.

Apart from these small items, it looks great!

Response to reviewer: We wish to thank the reviewer for the overall encouraging feedback on both revision rounds, and we do want to apologize for these typos and mishaps that had slipped into the manuscript. We have proof read the manuscript again and corrected the mistakes accordingly. Due to issues in switching to a different reference manager, we had failed to include one of the in-text references in the bibliography. This has now been corrected (added under ‘B’). We have also followed the suggestion of moving the paragraph from the Participants section to a more proper location (under Analysis.

Reviewer #3 comments to the authors

I had the pleasure of reading this interesting manuscript about crossmodal integration in musicians and nonmusicians. As I read this for the first time, I will give a general overview of the manuscript, and then add specific comments. Note that I also read the previous reviews and it seems to me that the authors addressed well all the points raised. My comments will be, however, slightly different from those already mentioned, but will not require any substantial modification of the present manuscript.

First of all I believe that the study itself is well designed, the manuscript reads well and the theoretical background and discussion are sufficiently rich. I have a general advice though: throughout the manuscript, we can often read "the effects of music training": note that very few studies could really prove that the music training causes some improvements in various perceptual/cognitive skills. Most of studies are based on the comparison of adult musicians and nonmusicians. This is not sufficient to talk about cause/effect relationships. I encourage the authors to talk about "association with the music training" instead of "effects".

Then, in the discussion, clearly there is space of mentioning why the authors believe it is reasonable to consider the music training as the cause. But it is still an intepretation, and the authors should acknowledge, perhaps in the limitation section, that the present study cannot infere any cause-effect relationship.

This is particularly true (and here I suggest to add another limitation) as there was no general control of cognitive abilities of the two groups. For example, the speed of processing subtests of the WAIS-IV could have been informative in explaining why there were no differences between groups in the RTs. Having no control tasks, cannot exclude that the two groups were different in terms of general cognitive abilities. Also, I did not read any information about years of education. Was this variable collected? Are the group different in years of education?

Response to reviewer: The authors would like to sincerely thank the reviewer for the positive and very constructive and insightful feedback. We absolutely agree with the point regarding causality: it is indeed the case that we cannot demonstrate causality in a decision reaction-time task such as in the present manuscript. We do not wish to claim that such causality is demonstrated, although we do recognise that we have used unclear language and semantics that easily reads as we indeed wish to suggest such causal relationship. We have changed the language throughout the manuscript to refer to relationship/association/link between musical training and advantages in multisensory integration, as suggested. We also added a paragraph in the discussion section stating this explicitly. In that paragraph we also suggest that future studies should use a control task/tasks and collect more extensive background knowledge from the participants to have stronger grounding for suggesting causality between musicianship and observed advantages (lines 568-576).

Again on the groups differences. I see that the inclusion criteria for nonmusicians was not having formal training outside from the mandatory classes at school. Did the authors checked whether the nonmusicians could have, anyway, learnt to play an instrument as self-taught? For instance, as they "self-identified as having no musical expertise", was musical expertise defined as having received formal lessons? This is important, because one could still play a bit as an amateur but consider him/herself not an expert. What were the exact questions asked to gather information about musical expertise?

Response to reviewer: This is again an important point. We did ask for any musical experience from the participants, and as we mention in Participants section under Materials and Methods, we disregarded one participant due to having been playing the piano between the ages of 7 to 12 years old. Self-identified musical experience here was based on asking if the participants played any instruments, composed music on a computer, or had taken any lessons for any instruments at any point in their life. Any continuous lessons or practicing excluding the compulsory music lessons in elementary/high-school resulted in exclusion from the experiment. To be more precise, we have added this last sentence into the manuscript (lines 181-182).

Similarly, I find the range of musical expertise in the musician group quite large. Studies including musicians usually have stricter criteria (e.g., 7-8 years of training at least). 3 years seems not a enough. Did the authors check at least whether the musicians were active at the moment of the testing? Because if an individual had 3 years of experience but stopped to play 10 years before, well, this wouldn't really qualify as being a musician in my opinion. I encourage the authors to add more details about the two groups, and if these details are not available, to include the criteria used to create the two groups as a possible limitation (that might also explain different results from previous studies, perhaps).

Response to reviewer: We indeed did ask whether the musicians were currently active and that was one of the inclusion criteria and we have added this detail into the manuscript as well. We also asked the participants their mean training time per week, which varied between 1 to 6 hours, but as we didn’t have this information for all of the musicians, we did not include it in the manuscript. Moreover, only 2 musicians reported having less than 6 years of training, hence we do not believe the more flexible definition of musicianship played a significant role in the results (in comparison to other studies in the field). We have added more details to the manuscript regarding the definition of musicianship.

We also noticed a mistake in the mean number of musical education and updated the value to the correct mean after participant exclusion.

I know that the athors provide already many analyses, but perhpas it could be interesting to look at the correlations between years of musical expertise and RTs/accuracy, as the range is very wide?

Response to reviewer: As the manuscript already includes a number of analyses, we are hesitant to add anymore into the mix. We also feel that our analyses already cover these questions to an extent: it was precisely our hypothesis that reaction times and number of mistakes made would decrease as a function of musical expertise which could be captured with number of years of training.

However, based on this feedback we did run the correlation analyses suggested. All correlations with reaction times were non-significant, while all correlations with the number of mistakes made in each category were statistically significant with all correlations being negative (i.e. less mistakes made when musical education increased). This is not surprising given our previous results indicating that musicians made fewer mistakes than non-musicians (with training of 0 years), and that no differences were found in reaction times.

These results remained the same regardless of whether we correlated only musicians or included non-musicians (with training years of zero) into the sample. Hence, respectfully, we do not think including them would add a lot of value into the manuscript.

Finally, I do not see in the manuscript any statement or link for data availability. My apologies if this will appear later on, in any case, I think it would be great to provide a link where readers can access the dataset.

Response to reviewer: During the submission process we mentioned that the data will be made available upon publication. It is now online and can be found at https://gin.g-node.org/rihalai/Crossmodal_Correspondences

Some specific details I noticed:

-Figures: Figure 1 and 2 seem very low resolution.

Response to reviewer: We agree figure 1 had low resolution and we have re-made it with higher resolution. The resolution in figure 2 is better and is dictated by the resolution of the original stimuli images.

-page 4, line 85: "audiovisual congruency effects" I suggest adding a definition of what these effects are in practice (e.g.,higher accuracy in identifying congruent pairs, shorter RTs, etc.?), as there might be different things to which the authors are referring.

Response to reviewer: These studies are discussed in detail in the very next paragraph (starting from line 87). This was not very clear, though, and we have now edited the manuscript such that the new paragraph starts from the line first mentioning the “audiovisual congruency effects” and is followed by the more detailed discussion of those experiments.

-page 4, line 94: "at discriminating the stimuli" This reads a bit vague, I suggest clarifying what the task required.

Response to reviewer: Agreed. We changed this to refer to the task of judging whether the stimuli were congruent with the rule.

-page 4, line 97,98: " pitch-elevation stimuli" and "non-symbolic magnitude" are not clear yet in the introduction, I suggest to define what are these types of stimuli, otherwise the reader will understand it only in the method section

Response to reviewer: We have edited the text to refer to spatial elevation, which is mentioned before, and generally to visual representations of magnitudes (in conjunction with pitch).

-page 5, line 114: here the "detection reaction time task" is also a bit vague, had the participants to respond to congruency again?

Response to reviewer: We have added more details of the task into the description. It now says “…with a detection reaction time task, in which the participants were instructed to click a mouse button immediately upon perception of auditory, tactile, or simultaneous audio-tactile stimuli, and reported that…” (line 114-118).

-page 5, line 117: Apologise if it is my mistake, but by reading the description of the study by Bidelman et al., it seems that the musicians had less frequently the audiovisual illusion. Does this mean that they integrated better the stimuli? Because intuitively, I would say that if they integrated them better, they would suffer the illusion more, not less. Having less frequently the illusion (or with shorter - less detectable - durations) might indicate, to me, that they could segregate better the two types of stimuli, not integrate them. But I might misunderstand what it's written.

Response to reviewer: This is a fair question. We discuss Bidelman et al. (2016) paper as an example of observed processing speed advantage in musicians over non-musicians, and this was indeed one of the results of Bidelman et al. (i.e., musicians were faster at making their response than nonmusicians and that musicians were not only more accurate at processing concurrent audiovisual cues but considerably faster at judging the composition of audiovisual stimuli).

In addition, they did observe musicians to have lower susceptibility for perceiving the illusory effect. In the paper, Bidelman et al. talk about “considerably more refined binding of auditory and visual cues” and “..musicians have enhanced multisensory integration and are better able to accurately parse audiovisual cues”. They conclude that musical experience “improves multimodal processing and integration of multiple sensory systems”. On the other hand, in their paper, they cite other results showing age-related increased multisensory integration “.. as evidenced by broader temporal binding window”. We take it that it is their position, that taken together, these results indicate a more refined, improved multisensory integration/processing (whereas looking at the narrower temporal binding window in musicians alone would probably be characterized differently if discussed in isolation).

In the manuscript, we were careful not to talk about “increased multisensory integration” in relation to these results, but rather, “improved multisensory integration”. We have further edited the part to refer to a more refined, improved integration of multiple sensory systems in a domain-general manner.

-Page 7, line 150: I'm not a native speaker but starting a paragraph with "therefore" reads strange.

Response to reviewer: We have changed this to “Hence”.

- Page 7, line 162: I suggest writing that the power analysis is explained later on, otherwise at a first glance one could wonder why no details about it are reported.

Response to reviewer: Corrected.

-Page 8, line 170: "made a mistake". How is the mistake defined here? Because later on, mistakes (wrong answers) are taken into account in the analyses, so I believe that this is a different type of mistake.

Response to reviewer: These mistakes were indeed wrong answers (analysed later on). Here, we refer only to the reaction time measure: for the calculations of the reaction times, we only used the recorded data from the correct responses. Note that this paragraph is now moved under the Analysis-section, as per the suggestion of Reviewer #1.

-Page 17. I find it a bit strange to read that there is a difference between musicians and nonmusicians in numbers of mistakes in the incongruent trials, but then, in the last paragraph with this analysis there is no difference between groups in mistakes in congruent and incogruent conditions. (see line 382, "irrespective of musicianship"). Are these two results a bit in contraddiction? If the musicians have less mistakes in the incogruent conditions, I would expect an interaction, not an overall effect of congruency.

Response to reviewer: Here, the last result did not look at group-differences at all (between musicians and non-musicians). Rather, we analysed the overall number of mistakes made (musicians and non-musicians combined) in all congruent trials and in all incongruent trials, and found a large difference indicating more mistakes in incongruent trials (as expected). This difference was there irrespective of musicianship (i.e. for both groups combined). We can understand that by mentioning musicianship here may be confusing as we are looking at all participants together, and hence, we have removed that part from the paragraph (“irrespective of musicianship”).

-page 19, line 410, 411: Can the authors report the statistics (at least the p-values, as before) for the post-hoc significant comparisons?

Response to reviewer: We have added the p-values for this estimated marginal means interaction. We have also corrected the language in this paragraph, and related the interaction more directly to elevation category.

-page 20, line 416: I think that here mentioning "effects of long-term training" is quite tricky for the reasons mentioned before: (1) there is no way to understand any effect of musical training with the present study, (2) speaking about long training with a inclusion criteria of >3 years seems a bit optimistic).

Response to reviewer: This is now corrected to “.. explore the association between long-term musical training and bimodal sensory integration”.

I think that if these minor details are clarified, the manuscript will be then ready to be publicated.

Response to reviewer: We appreciate all the feedback and feel that the manuscript is much stronger after editing it accordingly.

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 2

Deborah Apthorp

22 Feb 2023

The relationship between musical training and the processing of audiovisual correspondences: Evidence from a reaction time task

PONE-D-22-01297R2

Dear Dr. Ihalainen,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Deborah Apthorp, Ph.D

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Acceptance letter

Deborah Apthorp

27 Feb 2023

PONE-D-22-01297R2

The relationship between musical training and the processing of audiovisual correspondences: Evidence from a reaction time task

Dear Dr. Ihalainen:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Deborah Apthorp

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: Response to Reviewers.docx

    Attachment

    Submitted filename: Response to Reviewers.docx

    Data Availability Statement

    The data are publically accessible and can be retrieved from https://gin.g-node.org/rihalai/Crossmodal_Correspondences.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES