Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Feb 11.
Published in final edited form as: Multisens Res. 2017 Jan 1;30(3-5):287–306. doi: 10.1163/22134808-00002553

Interactions Between Auditory Elevation, Auditory Pitch and Visual Elevation during Multisensory Perception

Yaseen Jamal 1, Simon Lacey 1, Lynne Nygaard 3, K Sathian 1,2,3,4
PMCID: PMC7877490  NIHMSID: NIHMS1666806  PMID: 31287081

Summary

Cross-modal correspondences refer to associations between apparently unrelated stimulus features in different senses. For example, high and low auditory pitches are associated with high and low visual elevations, respectively. Here we examined how this crossmodal correspondence between visual elevation and auditory pitch relates to auditory elevation. We used audiovisual combinations of high- or low-frequency bursts of white noise and a visual stimulus comprising a white circle. Auditory and visual stimuli could each occur at high or low elevations. These multisensory stimuli could be congruent or incongruent for three correspondence types: cross-modal featural (auditory pitch/visual elevation), within-modal featural (auditory pitch/auditory elevation) and cross-modal spatial (auditory and visual elevation). Participants performed a 2AFC speeded classification (high or low) task while attending to auditory pitch, auditory elevation, or visual elevation. We tested for modulatory interactions between the three correspondence types. Modulatory interactions were absent when discriminating visual elevation. However, the within-modal featural correspondence affected the cross-modal featural correspondence during discrimination of auditory elevation and pitch, while the reverse modulation was observed only during discrimination of auditory pitch. The cross-modal spatial correspondence modulated the other two correspondences only when auditory elevation was being attended, was modulated by the cross-modal featural correspondence only during attention to auditory pitch, and was modulated by the within-modal featural correspondence while performing discrimination of either auditory elevation or pitch. We conclude that the cross-modal correspondence between auditory pitch and visual elevation interacts strongly with auditory elevation.

Keywords: cross-modal correspondence, audiovisual, spatial, congruency effect

Introduction

Cross-modal correspondences refer to almost universally experienced associations between apparently arbitrary stimulus features in different senses (Spence, 2011). For example, people consistently associate large and small visual size with low- and high-pitched sounds, respectively (Gallace and Spence, 2006; Evans and Treisman, 2010); and auditorily presented pseudowords, e.g., ‘takete’ and ‘maluma’, with pointed and rounded visual shapes, respectively (Köhler, 1929, 1947). Cross-modal correspondences tend to occur between stimulus properties that are correlated in nature, and thus might serve to increase the efficiency of information processing, ultimately helping integrate sensory data into meaningful representations (Spence, 2011), although under some circumstances, cross-modal correspondences may act to compromise veridical perception. For instance, in many species, body size is inversely correlated with the formant frequencies of vocalizations, but many animals are able to emit atypically low sounds in competitive contexts, apparently a bluffing strategy to exaggerate their perceived size (Fitch, 2000).

In this study we focus on the well-known cross-modal correspondence in which high and low auditory pitch are associated with high and low visuospatial elevation, respectively (e.g., Bernstein and Edelstein, 1971; Ben-Artzi and Marks, 1995; Evans and Treisman, 2010; Lacey et al., 2016). During simultaneous presentation of auditory and visual stimuli, classification of auditory pitch as high or low is enhanced by congruent visuospatial cues (Ben-Artzi and Marks, 1995) and visuospatial congruency effects increase as the magnitude of the pitch difference between auditory tones increases (Ben-Artzi & Marks, 1995; Chiou and Rich, 2012). This commonly studied and robust correspondence, between very basic properties of sensory stimuli (Spence, 2011), may underlie the intuitive nature of Western musical notation, where higher pitch is represented by higher position on the musical score. The origins of cross-modal correspondences remain uncertain. In the case of auditory pitch and visual elevation, one suggestion is that the correspondence arises from learnt statistical regularities of the natural environment, i.e., high-pitched sounds tend to emanate from high locations and low-pitched sounds from low locations (Parise et al., 2014). Alternatively, it may be due to semantically mediated associations arising because the same word – high or low – can refer to either pitch or elevation (Spence, 2011; Walker et al., 2012).

The association between pitch and elevation is not only cross-modal, i.e. auditory pitch being linked to visual elevation (see above), but also applies to unisensory auditory stimuli, for which pitch is associated with the elevation of the sound (Pratt, 1930; Roffler and Butler, 1968; Parise et al., 2014). Here we address the question of whether and how the correspondence between auditory pitch and visual elevation relates to auditory elevation. In other words, is the cross-modal correspondence enhanced if high- and low-pitched sounds emanate from high and low locations, respectively? Given that auditory pitch interacts with spatial localization both within-modally and cross-modally, we asked whether the cross-modal featural correspondence between auditory pitch and visual elevation (AP-VE) is influenced by the spatial elevation of the sound. If it is, auditory elevation, and its within-modal featural correspondence with auditory pitch (AP-AE), would be a contributor to the cross-modal featural (AP-VE) correspondence. We also asked the complementary question: does the elevation of the visual cue affect the within-modal featural (AP-AE) correspondence? An affirmative answer to this question would imply that the cross-modal featural (AP-VE) correspondence has an effect independent of the within-modal featural (AP-AE) correspondence.

Another potential consideration is the spatial alignment between visual and auditory stimuli, i.e., cross-modal spatial (AE-VE) correspondence. Cross-modal spatial congruency is well known to be an important determinant of neuronal responses to multisensory stimuli in the cat superior colliculus as well as in certain spatially organized cortical regions in various species, and can facilitate behavioral orientation in cats (Stein, 1998). In humans, cross-modal spatial congruency similarly facilitates performance on tasks involving spatial orienting or where spatial attention is relevant, but not necessarily on non-spatial discrimination tasks (Spence, 2013). Further, perceptually bound sounds and lights tend to be co-localized (Wallace et al., 2004). Hence, cross-modal spatial (AE-VE) congruency may also affect the cross-modal featural (AP-VE) correspondence. Thus, we additionally asked whether the cross-modal spatial (AE-VE) correspondence between visual and auditory elevation modulates, or is modulated by, the cross-modal (AP-VE) and within-modal (AP-AE) featural correspondences.

To address these questions, we tested whether various combinations of correspondences between the three stimulus attributes of interest (auditory pitch, auditory elevation, and visual elevation) interact with single feature discrimination when selectively attending to that feature during presentation of an audiovisual stimulus.

Materials and methods

Participants

Thirty-one people (18 females) volunteered for the present study. Two volunteers were excluded due to poor performance on the initial test of unisensory auditory elevation discrimination (see below); 2 others were excluded due to procedural errors during the main experiment; and 3 additional participants were excluded due to poor task comprehension. Therefore, twenty-four participants (12 male, 12 female; mean age 26 years) successfully completed the experiment. All reported normal or corrected-to-normal vision and none reported impaired hearing. The experiment was conducted in a single session, lasting approximately 2 hours, for each participant and participants were compensated $20 for their time. All participants provided written informed consent and all procedures were approved by the Emory University Institutional Review Board.

Stimuli

The visual stimulus was a white circle (RGB = 240, 240, 240; diameter = 4cm) subtending 4° of visual angle at a viewing distance of approximately 60cm and displayed on a black background. Discrimination of auditory location in the vertical plane is well known to be significantly more difficult than in the horizontal plane (Roffler and Butler, 1968). Inter-aural differences in timing and intensity, which are important cues for sound localization along the horizontal axis, are absent when pure tones are presented at varying vertical locations; vertical localization depends on spectral filtering by the shape of the head and pinnae and hence requires complex sounds (Roffler and Butler, 1968; Oertel and Doupe, 2013). Thus, we used complex sounds (i.e. white noise) rather than simple tones. The two auditory stimuli were created in Audacity v2.0.1 (Audacity Team, 2012) and consisted of 500ms bursts of high-pass (> 8 kHz) and low-pass (< 1 kHz) filtered white noise. These frequency ranges were selected based on prior work that demonstrated vertical sound localization to be more accurate for sounds above 8 kHz or below 1.4 kHz than for intermediate frequencies, for which the perceived locations of sounds differed substantially from the actual locations (Parise et al., 2014). Auditory localization bias occurs because the pinnae of the human ear alter waveforms in a frequency-dependent manner, which is why certain frequencies are more accurately localized than others (Batteau, 1967). Because the pinna also modifies auditory waveforms in an elevation-dependent manner (Parise et al., 2014; Batteau, 1967), sounds emanating from high and low locations might be perceived as higher or lower in pitch, respectively, and higher or lower in intensity, respectively. For our experiment, however, this type of bias is not of major concern because of the very large frequency difference between the two sounds we used.

Stimulus presentation was controlled by Presentation software (Neurobehavioral Systems Inc., Berkeley CA). The visual stimuli were displayed at the top and bottom of a 61cm LCD computer monitor and the auditory stimuli were played through speakers attached to the top and bottom of the monitor. The monitor itself was oriented with the long axis vertical in order to achieve maximal auditory and visual spatial separation. The centers of the visual stimuli were 48cm apart and the centers of the speakers were 54cm apart; the visual stimuli and the speakers were adjacent without overlapping and their respective centers were 4cm apart. Thus, the distance between the high and low positions was nearly equal in the two modalities. Participants were instructed to keep their head placed on a chin-rest located 60cm from the center of the monitor, and to focus their eyes on a white fixation cross at the center of the monitor during all tasks. Responses were collected through a two-button wireless computer mouse and response times (RTs) were recorded via Presentation software.

Unisensory elevation classification

Before the main experiment started, each participant listened to the high-pitched white noise burst at a range of amplitudes and selected the loudest that was still comfortable. This high-pitched sound was then compared to a range of low-pitched sounds similarly varying in amplitude; each participant selected the low-pitched sound that was perceived as matching the high-pitched sound in loudness.

Participants then heard the two auditory stimuli played from the top and bottom speakers three times each, in order to familiarize them with the auditory elevations. In order to verify that auditory and visual elevations could be discriminated with equal accuracy, participants were asked to report the locations of the high- and low-frequency auditory stimuli and of the visual stimulus in unisensory conditions. The three stimulus types were presented in separate unisensory blocks of a single run with a fixed order (low pitch, high pitch, visual). Each block consisted of 12 trials, randomly interleaving 6 trials at the high and 6 at the low elevation. ‘High’ and ‘low’ responses were mapped onto the left and right mouse buttons respectively; this is similar to the keyboard mapping used by Evans and Treisman (2010: ‘high’ = s and ‘low’ = k, on the left and right of the keyboard respectively). Since this is opposite to the natural low/left, high/right pitch-space mapping (Rusconi et al., 2006), this avoided spatial priming of high/low responses (Evans and Treisman, 2010). Although RTs were recorded, they were not analyzed for the unisensory data since the main purpose of unisensory testing was to ensure that auditory and visual elevations could be classified with equal accuracy.

Multisensory experiment

The main experiment consisted of three tasks in which participants made a 2AFC speeded classification during audiovisual stimulus presentation. During each task, participants attended one stimulus attribute for classification while ignoring the other two: auditory pitch (high or low), auditory elevation (high or low, i.e. the top or bottom speaker) or visual elevation (high or low, i.e. the top or bottom of the monitor).

On each trial, an auditory and a visual stimulus were presented simultaneously for 500ms, combined according to the type of correspondence and whether or not there was congruency or incongruency for each correspondence (Table 1). Featural correspondence could be within-modal (AP-AE) or cross-modal (AP-VE); either congruent: high pitch coupled with high elevation or low pitch coupled with low elevation; or incongruent: high pitch paired with low elevation or low pitch paired with high elevation. In addition, the combination of within-modal and cross-modal featural correspondences could be characterized according to cross-modal spatial correspondence (AE-VE); either congruent: high elevation for both auditory and visual stimuli or low elevation for both auditory and visual stimuli; or incongruent: high auditory with low visual elevation or low auditory with high visual elevation. The various correspondences and congruencies were combined to give 8 audiovisual stimulus types as shown in Table 1. These 8 stimulus types were grouped into four congruency combinations, labeled A-D in Table 1. Combination A was fully congruent for all three correspondences while the other three combinations each contained one congruent correspondence and two incongruent ones: combination B was congruent only for the within-modal featural correspondence (AP-AE); combination C was congruent only for the cross-modal featural correspondence (AP-VE); and combination D was congruent only for the cross-modal spatial correspondence (AE-VE).

Table 1:

Auditory and visual stimuli were combined to give four different congruency combinations (A-D) of correspondence types: within-modal featural (auditory pitch and elevation: AP-AE), cross-modal featural (auditory pitch and visual elevation: AP-VE), and cross-modal spatial (auditory and visual elevation: AE-VE); c: congruent; i: incongruent.

Congruencies Auditory
pitch
Auditory
elevation
Visual
elevation
A) AP-AEc, AP-VEc, AE-VEc (i) High High High
(ii) Low Low Low
B) AP-AEc, AP-VEi, AE-VEi (i) High High Low
(ii) Low Low High
C) AP-AEi, AP-VEc, AE-VEi (i) High Low High
(ii) Low High Low
D) AP-AEi, AP-VEi, AE-VEc (i) Low High High
(ii) High Low Low

Each classification task involved six runs, each run having 8 trials of each stimulus type, for a total of 64 interleaved trials in each run and, across all six runs, a total of 48 trials of each stimulus type. Each trial lasted 500ms, with an inter-trial interval of 3s. Each run began with a 3s blank interval, so that the total run length was 227s. The order of the classification tasks was fully counterbalanced across subjects and genders; the order of runs for each task was individually randomized for each participant. As in the unisensory conditions (see above), ‘high’ and ‘low’ responses were mapped onto the left and right mouse buttons respectively. Participants were instructed to respond as quickly and as accurately as possible.

Data analyses

All analyses of RTs were based on correct responses only (93% of all responses). Within each run of 64 trials, RTs greater than three standard deviations away from the mean were removed (< 1% of all data). We also set a threshold of 70% accuracy for inclusion of data for each classification task: on this basis, we excluded the auditory elevation data for one participant (but retained the visual elevation and auditory pitch data of this person). Statistical testing used repeated-measures ANOVAs (RM-ANOVAs) and/or paired, two-tailed t tests, with the significance level adjusted for multiple comparisons using Bonferroni corrections.

Results

Unisensory elevation classification

For the initial unisensory elevation classification test, accuracy (mean ± sem) was 96.9 ± 1.1% for the low-pitched auditory stimuli; 96.9 ± 1.3% for the high-pitched auditory stimuli and 98.6 ± .8% for the visual stimulus. There were no significant accuracy differences between the three classification tests (all t23 < −1.5, all p > .1). We therefore concluded that auditory elevation could be classified equally well for the high- and low-pitched stimuli, and with no more difficulty than classification of visual elevation.

Accuracy of classification in the multisensory conditions

In the multisensory tasks, overall classification accuracy was significantly different (Bonferroni-corrected α = .017 for 3 paired comparisons) between discrimination of visual (97.5 ± .7%) and auditory (89.5 ± 1.5%) elevation (t22 = −5.5, p < .001), visual elevation and auditory pitch (93.2 ± 1.4%: t23 = −3.4, p = .003), and auditory elevation and pitch (t22 = −2.6, p = .016). Although the differences between tasks were significant, accuracy was acceptably high in all conditions. A detailed analysis of accuracy rates (see Supplementary Material) showed a close coupling between higher accuracy rates and faster RTs, making a speed/accuracy trade-off unlikely.

RTs: congruency effects

We focus here on the results most critical to our a priori question of interactions between the three correspondence types. For the sake of completeness, we also performed global RM-ANOVAs, as detailed in Supplementary Material. Briefly, these RM-ANOVAs showed that when the attended features were auditory elevation or pitch, congruent and incongruent trial RTs were significantly different for some correspondence types and not others, i.e., there was a significant interaction between correspondence type and trial type for these two attended features. When attending to visual elevation, RTs were faster than when attending to auditory stimuli and there was a main effect of trial type in which congruent trial RTs were slightly but significantly faster than incongruent trial RTs (503ms vs 506ms; F1,23 = 4.6, p = .04) when collapsed across all correspondence types. However, during classification of visual elevation, there was neither a main effect of correspondence type nor an interaction between correspondence and trial type (see Supplementary Material). Thus, we do not consider the results of the visual discrimination task further.

Figures 1 and 2 show RTs for each of the stimulus combinations A-D of Table 1 for the auditory pitch and auditory elevation classification tasks. To tease out the significant interactions in the auditory conditions, we first conducted pairwise comparisons (Bonferroni corrected α = .008 for 6 paired comparisons) of the RTs for congruent and incongruent trials for each of the six possible combinations of correspondences, separately for each attended auditory feature. As detailed in Table 2, most of these pairwise tests were significant, indicating that the corresponding congruency effects were significantly non-zero. Testing whether one kind of correspondence influences another, however, requires comparing the magnitudes of congruency effects across different conditions. We therefore calculated the magnitudes of the congruency effects:

Congruency magnitude=((incongruent RTcongruent RT)(incongruentRT + congruent RT))x100. Equation 1:

Figure 1:

Figure 1:

Mean response times for each stimulus combination (A-D of Table 1) during classification of auditory pitch (error bars = sem).

Figure 2:

Figure 2:

Mean response times for each stimulus combination (A-D of Table 1) during classification of auditory elevation (error bars = sem).

Table 2:

Pairwise comparisons of the stimulus combinations shown in Table 1 and Figures 1 & 2 .

Pairwise comparison Auditory
pitch
Auditory
elevation
t23 p t22 p
A vs. B −4.4 <.001* −5.7 <.001*
A vs. C −3.7 .001* −6.6 <.001*
A vs. D −5.1 <.001* −3.0 .007*
B vs. C −1.4 .2 −2.9 .009
B vs. D −4.3 <.001* 4.5 <.001*
C vs. D −2.6 .017 5.8 <.001*
*

significant at Bonferroni-corrected α = .008

marginal

We conducted an RM-ANOVA on the congruency magnitudes, as detailed in Supplementary Material. As pointed out above, it is the direct comparison of congruency magnitudes between conditions that is most helpful to address the mutual influences of correspondence types on one another. In order to discern the presence of modulatory effects of one correspondence upon another, target, correspondence, we compared congruency magnitudes for the target correspondence when the potentially modulatory correspondence had congruencies that were either aligned or opposed to the target correspondence (Table 3). As an illustrative example, in order to determine the modulatory effect of the within-modal featural (AP-AE) correspondence on the cross-modal featural (AP-VE) correspondence (the first two data rows of Table 3), we compared the AP-VE congruency magnitude when the AP-VE and AP-AE congruencies were aligned (i.e., [D − A]/[D + A], where A and D represent RTs in conditions A and D, respectively) to when they were opposed (i.e., [B − C]/[B + C]), i.e. 1a vs. 1b in Table 3. Note that, in this example, the trials in A and D were always congruent with respect to the cross-modal spatial (AE-VE) correspondence, while, in B and C, the trials were always incongruent with respect to the cross-modal spatial (AE-VE) correspondence. Thus, the cross-modal spatial correspondence (AE-VE) was held constant across the congruent/incongruent trials being considered to compute each congruency magnitude for the target cross-modal featural (AP-VE) correspondence, whereas the cross-modal (AP-VE) and within-modal (AP-AE) featural congruencies opposed each other. (Note that, when any two congruencies are aligned, their congruency magnitudes are identical, being computed from the same set of congruent and incongruent trials, e.g., 1a and 2a in Table 3. On the other hand, when any two congruencies are opposed, the absolute values of the respective congruency magnitudes are the same but their signs are reversed, e.g., 1b and 2b in Table 3.)

Table 3:

Congruency magnitudes (Equation 1) while attending to auditory pitch or auditory elevation for each correspondence type when one of the other (potentially modulatory) correspondence types had either aligned or opposed congruencies. Congruency magnitudes derived from significant congruency effects in Table 2 are shown in bold; the single effect that was marginal is shown in italics; c: congruent; i: incongruent. The second column shows the source conditions of Table 1 from which the corresponding congruency magnitudes are derived according to Equation 1.

Target (bold) and modulatory
(underlined) correspondences
Source
conditions
Auditory
pitch
Auditory
elevation
1 Cross-modal featural (AP-VE)
a) AE-VEc & AP-AE aligned with AP-VE D – A 5.36 ± .87 1.81 ± .58
b) AE-VEi & AP-AE opposite to AP-VE B – C −.53 ± .53 −1.86 ± .66
c) AP-AEc & AE-VE aligned with AP-VE B – A 2.97 ± .57 8.68 ± 1.23
d) AP-AEi & AE-VE opposite to AP-VE D – C 1.87 ± .65 −8.73 ± 1.19
2 Within-modal featural (AP-AE)
a) AE-VEc & AP-VE aligned with AP-AE D – A 5.36 ± .87 1.81 ± .58
b) AE-VEi & AP-VE opposite to AP-AE C – B .53 ± .53 1.86 ± .66
c) AP-VEc & AE-VE aligned with AP-AE C – A 3.50 ± .79 10.52 ± 1.30
d) AP-VEi & AE-VE opposite to AP-AE D – B 2.40 ± .48 −6.89 ± 1.23
3 Cross-modal spatial (AE-VE)
a) AP-AEc & AP-VE aligned with AE-VE B – A 2.97 ± .57 8.68 ± 1.23
b) AP-AEi & AP-VE opposite to AE-VE C – D −1.87 ± .65 8.73 ± 1.19
c) AP-VEc & AP-AE aligned with AE-VE C – A 3.50 ± .79 10.52 ± 1.30
d) AP-VEi & AP-AE opposite to AE-VE B – D −2.40 ± .48 6.89 ± 1.23

Table 3 shows the congruency magnitudes for each target correspondence under the aligned and opposed conditions for each of the other two (potentially modulatory) correspondences. We now turn to comparing the congruency magnitudes between the various experimental conditions in the two auditory tasks (Bonferroni corrected α = .008 for 6 paired comparisons for each task). The effect sizes for the comparisons yielding significant differences are reported below in terms of Cohen’s d.

Attention to auditory pitch

The cross-modal featural (AP-VE) correspondence was significantly modulated by the within-modal featural (AP-AE) correspondence (1a vs. 1b in Table 3), such that the AP-VE congruency magnitude was significantly larger when the AP-AE congruency was aligned with it, compared to when it was in the opposite direction (t23 = 5.5, p < .001, d = 1.7). In fact, when the two congruencies were in opposite directions, the AP-VE congruency magnitude was not significantly different from zero (i.e., the corresponding pairwise comparison in Table 2 did not yield a significant difference), whereas when the two congruencies were aligned, the resulting AP-VE congruency magnitude was the largest observed during attention to auditory pitch. Another way of stating this is that the cross-modal influence of visual elevation on auditory pitch classification was modulated by the within-modal influence of auditory elevation. In contrast, the cross-modal featural (AP-VE) correspondence was not significantly modulated by the cross-modal spatial (AE-VE) correspondence (1c vs. 1d in Table 3: t23 = 1.4, p = .16). In this case, it should be noted that the AP-VE congruency magnitude, though not actually significant when the AE-VE congruency was opposed, did not differ significantly from the relatively small AP-VE congruency magnitude when the AE-VE congruency was aligned.

The within-modal featural (AP-AE) correspondence was significantly modulated by the cross-modal featural (AP-VE) correspondence (2a vs. 2b in Table 3), being significantly larger when the two congruencies were aligned, compared to when they were opposed (t23 = 5.0, p < .001, d = 1.4). Again, the AP-AE congruency magnitude could not be distinguished from zero when the two congruencies were opposed. Put another way, the within-modal effect of auditory elevation on pitch discrimination was modulated by the cross-modal effect of visual elevation. In contrast, the within-modal featural (AP-AE) correspondence was not significantly modulated by the cross-modal spatial (AE-VE) correspondence (2c vs. 2d in Table 3). There were moderate, significant AP-AE congruency effects whether the AE-VE congruency was aligned or opposed, but these were not significantly different from each other (t23 = 1.4, p = .16).

The cross-modal spatial (AE-VE) correspondence was significantly modulated by both the cross-modal (AP-VE) and within-modal (AP-AE) featural correspondences (in Table 3, respectively, 3a vs. 3b: t23 = 4.9, p < .001, d = 1.65; and 3c vs. 3d: t23 = 5.5, p < .001, d = 1.9), implying that the cross-modal correspondence between visual and auditory elevation was affected by the relationships of these spatial variables to auditory pitch (which was the attended stimulus attribute). The effects of the AP-VE and AP-AE correspondences turned significant, moderate, positive AE-VE congruency effects when aligned with each of the other two congruencies into negative AE-VE congruency effects when opposed to each of the other two congruencies. Of these negative AE-VE congruency effects, that when the AP-AE congruency was opposed was significant whereas that when the AP-VE was opposed was not significant.

Attention to auditory elevation

The cross-modal featural (AP-VE) correspondence was significantly modulated by both the within-modal featural (AP-AE, 1a vs. 1b in Table 3: t23 = 3.6, p = .002, d = 1.2) and the cross-modal spatial (AE-VE, 1c vs. 1d in Table 3: t22 = 7.5, p < .001, d = 3) correspondences, such that the AP-VE congruency magnitude was significantly larger when either of the other two congruencies was aligned as compared to opposed. The significant, positive AP-VE congruency magnitude when each of the other two congruencies was aligned reversed sign but retained similar magnitude when either of these two congruencies was in opposition. The AP-VE congruency magnitudes were small when considering the modulatory influence of the AP-AE congruency (1a vs. 1b in Table 3), but large when the alignment of the AE-VE congruency was being evaluated (1c vs. 1d in Table 3). Thus, the cross-modal featural relationship between the unattended stimulus attributes, auditory pitch and visual elevation, was strongly influenced by the cross-modal spatial relationship of visual elevation, and to a lesser extent by the within-modal featural relationship of auditory pitch, to the attended stimulus attribute of auditory elevation.

The within-modal featural (AP-AE) correspondence was not significantly modulated by the cross-modal featural (AP-VE) correspondence (2a vs. 2b in Table 3: t23 = −.06, p = .9); the AP-AE congruency magnitude was small, being significant when aligned with the AP-VE congruency and marginally significant when opposed to it. Thus, the cross-modal featural relationship between the unattended stimulus attributes of auditory pitch and visual elevation did not modify the small effect of the within-modal featural correspondence between auditory elevation and pitch, during discrimination of auditory elevation. However, the within-modal featural (AP-AE) correspondence was significantly influenced by the cross-modal spatial (AE-VE) correspondence (2c vs. 2d in Table 3: t23 = 7.5, p < .001, d = 2.9). Here, a large, significant, positive AP-AE congruency magnitude when the AP-AE and AE-VE congruencies were aligned (the largest congruency effect observed in the present study) became a substantial, significant, negative congruency magnitude when the two congruencies were in opposite directions. This means that the cross-modal spatial (AE-VE) relationship strongly modulated the large effect of the within-modal featural (AP-AE) correspondence when auditory elevation was being discriminated.

The cross-modal spatial (AE-VE) correspondence was unaffected by the cross-modal featural (AP-VE) correspondence (3a vs. 3b in Table 3: t23 = −.06, p = .9). The large, significantly positive AE-VE congruency effect was essentially unaltered by whether the AP-VE congruency was in alignment or opposition with it. Thus, when discrimination of auditory elevation was the task of interest, the cross-modal featural relationship between the unattended stimulus attributes of auditory pitch and visual elevation did not modify the large effect of the cross-modal spatial correspondence between visual and auditory elevation. However, the cross-modal spatial (AE-VE) correspondence was significantly affected by the within-modal featural (AP-AE) correspondence (3c vs. 3d in Table 3: t23 = 3.6, p = .002, d = .6) such that the AE-VE congruency magnitude was larger when the AE-VE and AP-AE congruencies were aligned (again the largest magnitude observed in the present study), relative to when they were opposed. However, the AE-VE congruency effect was significant, positive and of relatively large magnitude in both instances. Thus, the within-modal featural effect of auditory pitch only weakly modulated the spatial congruency effect of the auditory and visual stimuli, during attention to auditory elevation, as demonstrated by the relatively modest effect size (d = .6) compared to all other significant modulatory influences reported above (d ranging from 1.2 to 3).

Discussion

In this study, we examined modulatory interactions between cross-modal featural (auditory pitch and visual elevation: AP-VE), within-modal featural (auditory pitch and auditory elevation: AP-AE), and cross-modal spatial (auditory elevation and visual elevation: AE-VE) correspondences, while participants experienced simultaneous audiovisual stimuli during three separate tasks involving 2AFC speeded classification (high vs. low) of visual elevation, auditory elevation and auditory pitch. During classification of visual elevation, congruency effects were minimal and there were no modulatory interactions between correspondence types. However, the within-modal featural (AP-AE) correspondence influenced the cross-modal featural (AP-VE) correspondence while attending to either auditory elevation or pitch, while the reverse modulation occurred only during classification of auditory pitch. The cross-modal spatial (AE-VE) correspondence affected the other two correspondences only during classification of auditory elevation, was influenced by the cross-modal featural (AP-VE) correspondence only while attending to auditory pitch, and was modulated by the within-modal featural correspondence (AP-AE) while classifying either auditory elevation or pitch. We found similar main effects and interactions for both accuracy and RTs (see Supplementary Material), which suggests that our results are likely not due to a trade-off between speed and accuracy.

Our experiment comprised a full factorial design to explore mutual interactions between the three correspondence types. We chose to focus on performance differences between trials that varied in the three-way congruency between the audiovisual stimuli, and found a set of systematic interactions between congruency types. Since our experimental design followed an interleaved format in which the two unattended stimulus attributes randomly varied, it may be considered equivalent to the “filtering” blocks of a Garner interference paradigm (Ben-Artzi and Marks, 1995; Algom and Fitousi, 2016). An alternative design might have been a Garner paradigm including blocks where the irrelevant variables are held constant and blocks where variation on the irrelevant dimension correlates with that on the relevant dimension. Although such a design would be substantially more complicated than a typical Garner study, given three stimulus attributes of which two are irrelevant at any given time, there is precedent for such a three-dimension study (Ben-Artzi and Marks, 1999) and thus it may be worthwhile in the future to conduct such a study (essentially a block design replication of our study) to confirm the present findings.

When attending to visual elevation, accuracy was higher and responses were faster than when attending to either auditory elevation or pitch. Further, there were only tiny congruency effects attributable to each of the three correspondences during visual elevation classification, without significant modulatory interactions between them. This indicates that, while classifying visual elevation in the presence of a concomitant auditory stimulus, the nature of the auditory stimulus and its relationship to the visual stimulus is of little importance, at least under the experimental conditions of the present study. This stands in contrast to effects of the visual stimulus and its relationship to aspects of the auditory stimulus, as discussed below. Importantly, this asymmetry is despite ensuring that the accuracy of discriminating visual and auditory elevation did not differ when tested under unisensory conditions. Similar asymmetry, reflecting the dominance of visual over auditory inputs in location discrimination, has been reported previously (e.g., Ward, 1994; Bertelson and Aschersleben, 1998) and forms the basis of the well-known ventriloquism effect. Such visual dominance has also been specifically reported in the context of the cross-modal featural (AP-VE) correspondence: responses to visual stimuli were generally faster (as in the present study), and the visual stimulus exerted stronger Garner interference and congruency effects on the auditory stimulus than vice versa (Ben-Artzi and Marks, 1995). Thus, the rest of the Discussion concentrates on consideration of modulatory interactions between the three correspondences during attention to auditory elevation or pitch.

Does the within-modal featural (AP-AE) correspondence modulate the cross-modal featural (AP-VE) correspondence?

The primary motivation for our study was to test whether the cross-modal correspondence between auditory pitch and visual elevation could be influenced by auditory elevation. This question is important because in the natural environment, there is a frequency-elevation mapping such that high-pitched sounds tend to emanate from high locations and low-pitched sounds from low locations (Pratt, 1930; Roffler and Butler, 1968; Parise et al., 2014). If the cross-modal featural (AP-VE) correspondence is affected by auditory elevation, this would fit with the notion that the origin of this cross-modal correspondence lies in learnt statistical regularities of the natural environment. Although high- and low-pitched sounds need not in theory emanate from high and low elevations, respectively, in order to be associated with visually high and low elevations, our results show that indeed, the within-modal AP-AE relationship does exert a significant effect on the cross-modal AP-VE correspondence, but only when attention is directed to the auditory stimulus and not during visual discrimination of elevation. This interaction was found when attending to either auditory pitch or auditory elevation, being somewhat stronger for pitch than elevation.

The AP-VE congruency effect was only manifest in our study when the AP-VE and AP-AE congruencies were aligned, being larger during discrimination of auditory pitch than when auditory elevation was being discriminated. Thus, both the AP-VE congruency effect and the extent of its modulation by the AP-AE correspondence were larger during auditory attention to pitch compared to elevation. When the AP-AE congruency opposed the AP-VE congruency, the classic AP-VE congruency effect was no longer present. This, of course, does not negate the independent existence of the classic AP-VE correspondence when the position of the auditory stimulus is neutral (e.g., Ben-Artzi and Marks, 1995; Evans and Treisman, 2010; Chiou and Rich, 2012; Lacey et al., 2016). Our findings imply that auditory elevation could influence the cross-modal featural correspondence between auditory pitch and visual elevation, e.g., when examining how the perceptual upright might relate to the high versus low dichotomy (Carnevale and Harris, 2016).

Does the cross-modal featural (AP-VE) correspondence modulate the within-modal featural (AP-AE) correspondence?

The answer to this question is not as straightforward as to the first question. We found a modulatory effect of the cross-modal featural (AP-VE) correspondence on the within-modal featural (AP-AE) correspondence, but only when attention was directed to auditory pitch. This sizable modulation was of a similar order as the reverse modulation of the AP-VE correspondence by the AP-AE correspondence considered in the preceding section, and was absent when attending to either auditory or visual elevation. During the auditory pitch classification task, the AP-AE correspondence was only observed when the cross-modal AP-VE congruency was in alignment and not when it was in opposition. Again, this does not rule out an independent AP-AE correspondence, since our observations were made under the condition of concurrent visual and auditory stimuli, whereas the within-modal AP-AE correspondence was originally reported under unisensory auditory conditions (Pratt, 1930; Roffler and Butler, 1968; Parise et al., 2014). In fact, a small AP-AE congruency effect was found for the auditory elevation discrimination task, regardless of the direction of the AP-VE congruency, underscoring the independent existence of the AP-AE correspondence. During attention to auditory elevation, the size of the AP-AE congruency effect as well as the extent of its modulation by the AP-VE correspondence were each smaller than when auditory pitch was being attended – this is the converse of the findings for the AP-VE congruency effect noted above.

How does the cross-modal spatial (AE-VE) correspondence interact with the cross-modal (AP-VE) and within-modal (AP-AE) featural correspondences?

This question is more complex than the other two as it refers to a number of potential interactions. Again, the only modulatory interactions found were when attending to the auditory stimulus attributes. Not surprisingly, the sizes of the modulatory effects involving the AE-VE correspondence were larger during attention to auditory elevation than during attention to auditory pitch.

The within-modal featural (AP-AE) correspondence affected the cross-modal spatial (AE-VE) correspondence during attention to auditory pitch as well as to auditory elevation. For auditory pitch discrimination, a moderate AE-VE congruency effect was eliminated when AP-AE congruency alignment switched to opposition. However, during discrimination of auditory elevation, the largest AE-VE congruency magnitude observed in the present study was only attenuated by the switch from AP-AE congruency alignment to opposition.

The other modulatory interactions were limited to one or the other auditory classification task. The cross-modal spatial (AE-VE) correspondence modulated both the cross-modal (AP-VE) and within-modal (AP-AE) featural correspondences, only when auditory elevation was being discriminated: For both the target correspondences, modulation by the AE-VE correspondence involved a change from a large congruency magnitude when the AE-VE congruency was aligned to a large but reversed effect when the AE-VE congruency acted in opposition. In contrast, during attention to auditory pitch, small AP-VE and AP-AE featural congruency magnitudes were essentially unaffected by manipulating AE-VE congruency. The cross-modal featural (AP-VE) correspondence influenced the cross-modal spatial (AE-VE) correspondence during attention to auditory pitch; this reflected abolition of a moderate AE-VE congruency effect when AP-VE congruency alignment switched to opposition (the same modulatory effect that the AP-AE correspondence had – see preceding paragraph). During auditory elevation discrimination, a large AE-VE congruency effect was unaffected by manipulation of the AP-VE congruency.

Once again, the disappearance of the cross-modal spatial (AE-VE) congruency effects noted in the preceding paragraphs, when the AP-AE or AP-VE congruencies opposed the AE-VE congruency, does not falsify the existence of the well-known cross-modal spatial correspondence that occurs in the absence of variations of auditory pitch (Stein, 1998; Wallace et al., 2004; Spence, 2013). Yet, it is interesting that this robust cross-modal spatial correspondence is susceptible to modulation by the alignment status of AP-AE congruency during attention to either the pitch or elevation of an auditory stimulus, and by the alignment status of AP-VE congruency during attention to auditory pitch. Such modulatory effects reinforce the idea that spatial congruency effects are task-dependent (Spence, 2013).

Mechanisms underlying correspondences

Our findings are consistent with the idea that the statistics of the audiovisual environment underlie the development of the three correspondences studied here between auditory pitch and stimulus elevation in both the visual and auditory modalities. Because, at least in Western culture, the words ‘high’ and ‘low’ describe both elevation and pitch, we cannot completely discount the possibility that the cross-modal featural (AP-VE) correspondence is semantically mediated (Spence, 2011). For instance, in auditory experiments, Stroop interference was reported between the words “low” and “high” and tones of low and high pitch (Shor, 1975), and these words, in relation to pitch and elevation, generated Garner interference and congruence effects for all combinations of dimensions (Ben-Artzi and Marks, 1999). While such results have been used to argue for the involvement of semantic processes (Ben-Artzi and Marks, 1999), an alternative interpretation is offered by theories of grounded cognition, i.e. perceptual simulations evoked by the words (Barsalou, 2008). Arguing against the semantic mediation hypothesis, several studies show that pre-linguistic infants exhibit the cross-modal featural (AP-VE) correspondence (e.g., Walker et al., 2010; Dolscheid et al., 2014). Furthermore, at least some non-Western cultures show the AP-VE correspondence but do not use spatial language to describe auditory pitch (e.g., Parkinson et al., 2012). In each of these cases, it would seem that the AP-VE correspondence arises from natural auditory statistics rather than language. Regardless, our observation that the strength of the AP-VE correspondence varies with the elevation of the sound source indicates that the frequency-elevation correlation between sounds in the natural environment is at least an important contributor to this relationship.

Conclusion

We conclude that a variety of modulatory interactions characterize the relationships among the cross-modal featural (AP-VE) correspondence, the within-modal featural (AP-AE) correspondence, and the cross-modal spatial (AE-VE) correspondence. These interactions during audiovisual perception depend on the attended stimulus attribute, and are absent when discriminating visual elevation, even though this is no more accurate than auditory elevation discrimination when both are compared on unisensory tests. It is interesting to note that the cross-modal featural (AP-VE) correspondence was present even when attending to a stimulus attribute – auditory elevation – that was not itself part of this correspondence. Similarly, the cross-modal spatial (AE-VE) correspondence could be observed even though attention was directed to the non-spatial feature of auditory pitch.

Supplementary Material

Supplementary Material

Acknowledgments

This work was supported by grants to KS and LN from the National Eye Institute at the NIH and the Emory University Research Council. Support to KS from the Veterans Administration is also acknowledged.

References

  1. Algom D and Fitousi D (2016). Half a century of research on Garner interference and the separability-integrality distinction. Psychol. Bull 142, 1352–1383. [DOI] [PubMed] [Google Scholar]
  2. Audacity Team (2012) Audacity v2.0.1 [Computer program]. Retrieved from http://audacity.sourceforge.net/ Audacity ® software is copyright © 1999-2014 Audacity Team.
  3. Barasalou LW (2008). Grounded cognition. Annu. Rev. Psychol 59, 617–645. [DOI] [PubMed] [Google Scholar]
  4. Batteau DW (1967). The role of the pinna in human localization. Proc. R. Soc. Lond. B Biol. Sci 168, 158–180. [DOI] [PubMed] [Google Scholar]
  5. Ben-Artzi E and Marks LE (1995). Visual-auditory interaction in speeded classification: Role of stimulus difference. Percept. Psychophys 57, 1151–1162. [DOI] [PubMed] [Google Scholar]
  6. Ben-Artzi E and Marks LE (1999). Processing linguistic and perceptual dimensions of speech: interactions in speeded classification. J. Exp. Psychol. Human 25, 579–595. [DOI] [PubMed] [Google Scholar]
  7. Bernstein IH and Edelstein BA (1971). Effects of some variations in auditory input upon visual choice reaction time. J. Exp. Psychol 87, 241–247. [DOI] [PubMed] [Google Scholar]
  8. Bertelson P and Aschersleben G (1998). Automatic bias of perceived auditory location. Psychon. Bull. Rev 5, 482–489. [Google Scholar]
  9. Carnevale MJ and Harris LR (2016). Which direction is up for a high pitch? Multisens. Res 29, 113–132. [DOI] [PubMed] [Google Scholar]
  10. Chiou R and Rich AN (2012). Cross-modality correspondence between pitch and spatial location modulates attentional orienting. Perception 41, 339–353. [DOI] [PubMed] [Google Scholar]
  11. Dolscheid S, Hunnius S, Casasanto D and Majid A (2014). Prelinguistic infants are sensitive to space-pitch associations found across cultures. Psychol. Sci 25, 1256–1261. [DOI] [PubMed] [Google Scholar]
  12. Evans KK and Treisman A (2010). Natural cross-modal mappings between visual and auditory features. J. Vis 7, 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Fitch WT (2000). The evolution of speech: a comparative review. Trends Cogn. Sci 4, 258–267. [DOI] [PubMed] [Google Scholar]
  14. Gallace A and Spence C (2006). Multisensory synesthetic interactions in the speeded classification of visual size. Percept. Psychophys 68, 1191–1203. [DOI] [PubMed] [Google Scholar]
  15. Garner WR. (1974). The processing of information and structure. Lawrence Erlbaum Associates: Potomac, MD. [Google Scholar]
  16. Köhler W (1929). Gestalt Psychology. Liveright: New York, NY. [Google Scholar]
  17. Köhler W (1947). Gestalt Psychology: An Introduction to New Concepts in Modern Psychology. Liveright: New York, NY. [Google Scholar]
  18. Lacey S, Martinez MO, McCormick K and Sathian K (2016). Synesthesia strengthens sound-symbolic cross-modal correspondences. Eur. J. Neurosci 44, 2716–2721. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Oertel D, and Doupe AJ (2013). The auditory central nervous system In: Principles of Neural Science, 5th edition. (Eds. Kandel ER, Schwartz JH, Jessell TM, Siegelbaum SA, and Hudspeth AJ), pp 682–711. McGraw Hill Medical: New York, NY. [Google Scholar]
  20. Parise CV, Knorre K and Ernst MO (2014). Natural auditory scene statistics shapes human spatial hearing. Proc. Natl. Acad. Sci. USA 111, 6104–6108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Parkinson C, Kohler PJ, Sievers B and Wheatley T (2012). Associations between auditory pitch and visual elevation do not depend on language: evidence from a remote population. Perception 41, 854–861. [DOI] [PubMed] [Google Scholar]
  22. Pratt CC (1930). The spatial character of high and low tones. J. Exp. Psychol 13, 278–285. [Google Scholar]
  23. Roffler SK and Butler RA (1968). Factors that influence the localization of sound in the vertical plane. J. Acoust. Soc. Am 43, 1255–1259. [DOI] [PubMed] [Google Scholar]
  24. Rusconi E, Kwan B, Giordano BL, Umilta C and Butterworth B (2006). Spatial representation of pitch height: The SMARC effect. Cognition 99, 113–129. [DOI] [PubMed] [Google Scholar]
  25. Shor RE (1975). An auditory analog of the Stroop test. J. Gen. Psychol 93, 281–288. [PubMed] [Google Scholar]
  26. Spence C (2011). Crossmodal correspondences: A tutorial review. Atten. Percept. Psycho 73, 971–95. [DOI] [PubMed] [Google Scholar]
  27. Spence C (2013). Just how important is spatial coincidence to multisensory integration? Evaluating the spatial rule. Ann. N.Y. Acad. Sci, 1296, 31–49. [DOI] [PubMed] [Google Scholar]
  28. Stein BE (1998). Neural mechanisms for synthesizing sensory information and producing adaptive behaviors. Exp. Brain Res 123, 124–135. [DOI] [PubMed] [Google Scholar]
  29. Wallace MT, Roberson GE, Hairston WD, Stein BE, Vaughan JW, and Schirillo JA (2004). Unifying multisensory signals across time and space. Exp. Brain Res 158, 252–258. [DOI] [PubMed] [Google Scholar]
  30. Walker P, Bremner JG, Mason U, Spring J, Mattock K et al. (2010). Preverbal infants’ sensitivity to synaesthetic cross-modality correspondences. Psychol. Sci 21, 21–25. [DOI] [PubMed] [Google Scholar]
  31. Walker L, Walker P and Francis B (2012). A common scheme for cross-sensory correspondences across stimulus domains. Perception 41, 1186–1192. [DOI] [PubMed] [Google Scholar]
  32. Ward LM (1994). Supramodal and modality-specific mechanisms for stimulus-driven shifts of auditory and visual attention. Can. J. Exp. Psychol 48, 242–259. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

RESOURCES