Effects of audiovisual temporal synchronization on visual experience of the non-dominant eye

Hikari Takebayashi; Yuji Wada

doi:10.1007/s10339-025-01296-3

. 2025 Aug 22;27(1):121–132. doi: 10.1007/s10339-025-01296-3

Effects of audiovisual temporal synchronization on visual experience of the non-dominant eye

Hikari Takebayashi ^1,^✉, Yuji Wada ²

PMCID: PMC12860879 PMID: 40844798

Abstract

Audiovisual integration occurs automatically and affects visual processing. This study aims to investigate whether temporally synchronized auditory signals enhance monocular signals during binocular observation. In Experiment 1, 16 participants performed a visual target localization task. A mirror stereoscope was used to present a rapid serial visual presentation (RSVP) stream of distractors to both eyes, with a visual target inserted in either both eyes, the dominant eye, or the non-dominant eye. Continuous low tones synchronized with distractors were paired with the target as either the same low tone (non-salience) or a high tone (salience). Detection facilitation rates by tone type were analyzed through multiple comparisons. Results showed a significant detection enhancement only when the target appeared in the non-dominant eye. In Experiment 2, involving 16 participants, a similar RSVP was presented, but with an orientation discrimination task for parafoveally presented texture stimuli comprising 17 vertical Gabor patches. The angle and proportion of tilted patches were manipulated simultaneously, and logistic regression was used to estimate orientation discrimination thresholds. Contrary to predictions, salient tones did not reduce the thresholds. These findings suggest that temporally synchronized auditory signals can selectively enhance the monocular processing of weaker visual signals (i.e., non-dominant eye signals) before binocular fusion, particularly for spatial localization. However, these effects did not extend to the identification of visual content (i.e., orientation) or stable visual signals (i.e., dominant or binocular). The results highlight the role of audiovisual integration in supporting unstable monocular signals and suggest potential applications in low vision training.

Keywords: Audiovisual interaction, Auditory cue, Dichoptic stimulation, Eye dominance

Introduction

Integrating information across multiple sensory modalities can enrich perceptual experiences. Temporal and spatial synchrony of audiovisual information helps navigate information sources, thereby increasing the reliability of source estimation and facilitating a coherent understanding of the visual environment. Previous research revealed that the advantages of audiovisual interaction are maximized when both temporal synchrony and spatial consistency are present (Meredith and Stein 1986; Meredith et al. 1987). For instance, the timing and location of auditory stimuli can enhance visual stimulus detectability (Spence and Driver 1997), shorten response times to visual stimuli at specific locations (Mclntire et al. 2010; Perrott et al. 1991; Simon and Craft 1970), and determine the visual direction of motion (Alink et al. 2012; reviewed in Chaplin et al. 2018; Hidaka et al. 2009; Hidaka et al. 2011; Maeda et al. 2004; McCourt and Leone 2016). Moreover, in patients with hemianopia due to stroke or cortical injury, audiovisual training lasting more than 10 weeks (two hours per session) can restore flash detection in the blind visual field (Rowland et al. 2023), possibly through plasticity from subcortical (e.g., superior colliculus) to cortical regions (e.g., auditory and visual cortices) (Meredith and Stein 1983; Wallace et al. 2004).

Nevertheless, visual experiences can be enhanced through temporal synchrony even without requiring spatial correspondence between audiovisual stimuli. For instance, in a visual search task where the colors of a target and some distractors switch at regular intervals, the detection time for a visual target can be shortened by synchronizing the color changes with an auditory stimulus (Van der Burg et al. 2008). This phenomenon, known as the “pip and pop effect,” occurs when the timing of color changes in some stimuli coincides with an auditory stimulus (a pip sound) presented binaurally, leading to a pop-out effect. Preceding auditory stimuli can function as warning signals. While warning sounds may reduce reaction time, they can sometimes increase detection errors (Han and Proctor 2022; Simon et al. 1975). However, according to the pip and pop effect, the smaller the temporal gap between audiovisual stimuli, the shorter the detection time for visual targets. Therefore, it is suggested that the combination of audiovisual information enhances the subjective salience of visual targets. Because the auditory stimuli in this case do not function as warning signals, audiovisual integration is probably automatic and feedforward (Salselas et al. 2024). Relatedly, in a rapid serial visual presentation (RSVP), if a salient auditory stimulus is temporally synchronized with a visual target inserted in the stream, the visual target can appear to “freeze,” thereby enhancing the accuracy of its spatial localization (i.e., the freezing phenomenon; Vroomen and De Gelder 2000). Previous studies suggest that, rather than merely directing attention to a specific visual stimulus, the auditory signal automatically enhances visual salience.

However, the pip-and-pop effect, freezing phenomenon, and audiovisual training for patients with hemianopia all assume that auditory stimuli enhance the salience of binocular visual representations. It remains unclear whether auditory signals can enhance monocular visual signals before binocular fusion. To address this, we focused on the difference between the dominant and non-dominant eye in individuals with normal vision (Porac and Coren 1976; Rice et al. 2008), where the non-dominant eye’s signals are often underweighted or ignored in perception. Thus, we aimed to investigate whether normally suppressed visual experiences from the non-dominant eye could be activated solely through temporal synchrony with auditory signals. This may have implications for stabilizing binocular vision in cases of monocular impairment prior to cortical processing.

We conducted two RSVP tasks, presenting visual targets to both eyes, the dominant eye only, or the non-dominant eye only, while presenting distractors to the other eye, during the audiovisual stream. Stimuli were paired with low-pitched tones, while targets were synchronized either with the same low tone or a high-pitched tone. We hypothesized that salient high-pitched tones would independently enhance monocular visual signals, predicting detection rates similar to binocular presentation. Alternatively, if auditory signals contribute only to binocular processing, detection rates would be lower for monocular presentations, particularly for the non-dominant eye, due to its inherently unstable representation. This study examined whether auditory signals influenced “where” information in a localization task (Experiment 1) and “what” information in an orientation discrimination task (Experiment 2).

Experiment 1

A four-alternative forced choice (4AFC) procedure was used for a localization task involving a visual target inserted during an RSVP. We examined whether localization was enhanced by audiovisual temporal synchronization. We compared the performance across the three presentation conditions: both eyes, the dominant eye, and the non-dominant eye.

Materials and methods

Participants

Twenty observers from Ritsumeikan University participated in this study (seven men and 13 women; age range: 19–35 years). All the participants had normal or corrected-to-normal vision and no relevant medical history. They first provided written informed consent. After the experiment, the participants received a gift certificate worth JPY 1,000 for their 1 h of participation. The study was approved by the Institutional Review Board of the Ethics for Research Involving Human Subjects at Ritsumeikan University.

Apparatus and stimuli

The experiments were conducted individually in a dark room. The participants sat at 47 cm from a 31.1-inch liquid crystal monitor (ColorEdge CG318, EIZO Corporation, JAPAN) with a 60-Hz refresh rate, 1920 × 1080 resolution, and 40 cd/m² luminance. An ophthalmic chin and forehead rest was used for head positioning. The participants viewed the left and right screens with their left and right eyes, respectively, through a mirror stereoscope, as the screen was divided into two sections (NAMOTO, Co. Ltd., JAPAN). The preparation and presentation of the visual stimuli were controlled using GNU Octave 7.3.0 (GNU General Public License) with the Psychtoolbox extension (Brainard 1997).

All the visual stimuli were presented within a white square subtending 4.5° in the visual angle at the center of a black screen. The white square was always presented as a frame to stabilize binocular fusion. Within this square, 4 × 4 matrix placeholders were virtually created. One distractor comprised four black dots randomly placed from the 16 placeholders (Fig. 1). One target comprised four dots forming a diamond shape and was presented at one of the four corners inside the square: top-left, top-right, bottom-left, or bottom-right. In addition, mask stimuli comprising dots drawn from all 16 placeholders were created. Each dot measured 4 × 4 pixels. Auditory stimuli consisted of pure tones at frequencies of 1000 Hz and 1259 Hz (four semitones higher) with a 44,100 Hz sampling rate. The intensity of the sound was approximately 60 dB SPL to clearly distinguish each tone. The sound was always presented to both ears via headphones (AKG Q701, Harman International Industries, Inc., Stamford, USA).

Fig. 1 — Experimental design and flow diagram of a single trial in Experiment 1

Four test displays, consisting of three distractors and one target followed by masks, were presented in a visual stream. Each distractor comprised four dots positioned randomly within 16 placeholders, whereas the target comprised four dots forming a diamond shape. This stream was looped up to 15 times in a single trial. Visual stimulation was performed dichoptically using a mirror stereoscope. The visual target in the third display was presented to both eyes, the dominant eye, or the non-dominant eye. The distractors were identical throughout a single trial, but they were randomized between trials. Auditory stimulation was synchronized with each test display for 50 ms, except for masks. The tone sequence was conducted under two conditions: salient and non-salient. DE: dominant eye, NDE: non-dominant eye.

Procedure of dominant eye test

Before entering the dark room, each participant performed three sighting-dominant eye tests in a well-lit room. The first test was the Hole-in-a-Card Test, where participants used a 21.6 × 30 cm board with a hole that was 3 cm in diameter in the center (Fig. 2). Holding the board at arm’s length, the participants peered at a green patch displayed on a monitor 110 cm away through the hole using both eyes. The patch, subtending 2.2° in the visual angle, fit exactly into the hole. The participants were instructed to slowly bring the board towards their face while maintaining fixation on the patch. If the position of the hole shifted horizontally towards one eye when the board was brought in front of their faces, the experimenter identified that eye as the dominant eye. Most participants were not conscious that the final board position had shifted to one eye because they believed that they had observed the patch using both eyes. Next, the participants formed a small triangle by overlapping their hands and repeated the same action as in the first test (Fig. 2). The eye aligned with the direction of the shifted hand was identified as the dominant eye. The last test was the Miles Test, which used the same board as in the first test. The participants held the board at arm’s length, peered at the patch through the hole, and alternately closed one eye. The dominant eye was identified based on the open eye that successfully captured the patch. Using these three tests, we determined the final dominant eye based on the majority score.

Fig. 2 — Sighting-dominant eye test procedure

The Hole-in-a-Card Test consists of two steps. First, participants stretch their arms and create a small triangle using both hands. While watching a green patch through the triangle, the participants slowly bring both hands to their faces. The eye that keeps fixating and induces a subtle horizontal shift of the triangle (i.e., both hands) is defined as the sighting-dominant eye. Again, participants use a board truncated at the center and perform the same procedure. In the Miles Test, participants use the same board and observe the patch through the hole with one eye closed alternately. The eye capturing the patch in the hole is the dominant one.

Procedure of location detection task

Following the sighting-dominant eye tests, participants received instructions for the main task through an oral explanation and slide presentation. Additionally, we presented four trials of an audiovisual stream at a much slower tempo than the actual task to ensure their comprehension. After confirmation, the participants entered a dark room to perform 10 practice trials at the actual tempo. Before starting the practice, the participants reconfirmed a brief text of instructions on the screen, which also helped with binocular fusion. If the participants experienced discomfort with the binocular fusion of the text, the experimenter instructed them to rotate the mirrors on either side until the images merged. A trial started with a blank screen featuring only a white square on a black background for 1000 ms. Subsequently, four displays comprising four dots were presented with immediately accompanying masks. A visual target (diamond formation) was inserted as the third display (see Fig. 1). Each of the four-dot displays and masks was presented for three frames (50 ms), resulting in a total duration of 400 ms for these eight displays. Because a blank screen was presented for six frames (100 ms) after the series of eight displays, the total duration was 500 ms. It looped for a maximum of 15 times, with a duration of 7500 ms. However, the participants were instructed to press the corresponding key immediately upon perceiving the target’s position during the loop; thus, the actual duration was often shorter than 15 loops. If no response was elicited within 15 loops, the program automatically transitioned to the next trial. Participants’ responses were recorded using two numeric keypads (BSTKH08, Buffalo Inc., JAPAN, and ST-U2NK, SATECHI, CA, USA) placed on either side of the mirror stereoscope. The “7” and “4” keys on the left keypad corresponded to the top-left and bottom-left responses, whereas the “-” and “+” keys on the right keypad corresponded to the top-right and bottom-right responses, respectively. To prevent erroneous input, all other keys were disabled during the task. The four-dot displays were synchronized with auditory stimuli consisting of four low (L) tones at 1000 Hz (LL“L”L) or four tones including a high (H) tone at 1259 Hz in the third display (LL“H”L).

Furthermore, another factor involved presenting visual targets to both eyes, the dominant eye alone, or the non-dominant eye only. For the latter two conditions, a blank screen was presented to the other eye on the third display, whereas the flow of the other displays remained the same for both eyes. This manipulation was based on the pilot results from two volunteers in our laboratory and the author (HT). The pilot experiment involved a parallel presentation of the target to one eye and a mask (16 dots) to the opposite eye on the third display. However, this composition was assumed to be difficult because the results showed generally low detection rates approaching chance levels. Conversely, the blank screen inserted in the third display in the opposite eye did not cause any discomfort to the observers during visual stimulation. The procedure in this task was similar to that used in a previous study (Vroomen and De Gelder 2000), but with a faster presentation time per display and no warm-up period. This was based on the three pilots’ data. The number of responses over 15 loops was 6.7% of all trials in the valid data, suggesting that participants became accustomed to the sequence within 4–8 loops without any special warm-up period.

Three within-subject factors were examined: two levels of the tone saliency (salient and non-salient), three levels of the target-presented eye (binocular, dominant eye, and non-dominant eye), and four levels of the target position (top-left, top-right, bottom-left, and bottom-right). This combination generated 24 subconditions with 16 repetitions. All conditions were randomized, with a short break inserted every 48 trials, for a total of 384 trials.

Results and discussion

Because data from four participants with overall low detection rates were excluded from the analysis, 16 datasets were analyzed. The criterion was defined as whether the average detection rate fell below the chance level (25%) across the three target-presented eye conditions in the non-salient tone condition, as it served as a reference for comparison between the tone saliency conditions.

Figure 3 presents plots of the average detection rates, loop numbers, and detection promotion rates. All descriptive statistics are listed in Table 1. First, a repeated-measures analysis of variance (ANOVA) was conducted for detection rates, considering factors of tone saliency and the target-presented eye, on R (version 4.3.1). The alpha level was set at 0.05. Because the main effect of the target-presented eye factor was significant, F(2, 30) = 34.324, p <.001, Inline graphic = 0.696, Bayes Factor BF₁₀ = 1.935e+8, Bonferroni-corrected post-hoc comparisons revealed that the binocular condition had higher detection rates than the dominant eye, t(15) = 7.819, p <.001, Cohen’s d = 1.955, BF₁₀ = 557644.011, and the non-dominant eye conditions, t(15) = 6.284, p <.001, Cohen’s d = 1.571, BF₁₀ = 104686.232. There was no significant difference between the monocular conditions, t(15) = 1.535, p =.406, Cohen’s d = 0.384, BF₁₀ = 0.612. Furthermore, neither the main effect of the tone saliency factor, F(1, 15) = 1.924e−4, p =.989, Inline graphic = 1.283e−5, BF₁₀ = 0.215, nor the interaction, F(2, 30) = 2.030, p =.149, = 0.119, BF₁₀ = 3.972e+7, was significant (Fig. 3A).

Fig. 3 — Results of Experiment 1. (A) The average detection rate for the target across 16 participants. The horizontal dotted line represents the chance level. Each point represents an individual participant’s data. (B) The average number of loops until the key press is used as an index of detection time. (C) Average detection facilitation rates were calculated using the formula: . Error bars represent the standard error of the mean.

Inline graphic — Results of Experiment 1. (A) The average detection rate for the target across 16 participants. The horizontal dotted line represents the chance level. Each point represents an individual participant’s data. (B) The average number of loops until the key press is used as an index of detection time. (C) Average detection facilitation rates were calculated using the formula: . Error bars represent the standard error of the mean.

Table 1.

Quantitative data of experiment 1

Target-presented eye	N	Tone	Detection rate	(SD)	Loop numbers	(SD)	Detection facilitation rates	(SD)
Binocular	16	Salient	0.45	(0.11)	8.81	(2.01)	−0.02	(0.21)
Binocular	16	Non-salient	0.47	(0.11)	8.72	(2.11)	−0.02	(0.21)
Dominant eye	16	Salient	0.34	(0.09)	9.15	(2.07)	0.00	(0.23)
Dominant eye	16	Non-salient	0.35	(0.11)	9.04	(2.14)	0.00	(0.23)
Non-dominant eye	16	Salient	0.38	(0.09)	9.18	(2.12)	0.12	(0.21)
Non-dominant eye	16	Non-salient	0.35	(0.09)	8.83	(2.15)	0.12	(0.21)

Open in a new tab

Detection rate, loop numbers, and detection facilitation rates represent mean value across 16 participants. N: the number of participants, SD: standard deviation.

However, anisotropy across the presentation conditions was observed in the detection rates as influenced by the tone salience. Specifically, the detection rate increased only under the non-dominant eye condition when the tone was salient. Therefore, we focused on the detection facilitation rate: Inline graphic in relation to tone salience. Multiple comparisons with an alpha of 0.017 revealed that the detection facilitation rate for the non-dominant eye condition was significantly higher than that for the binocular condition, t(15) = 3.040, p =.008, Cohen’s d = 0.76, BF₁₀ = 6.394. There were no significant differences between the non-dominant and dominant eye conditions, t(15) = 1.547, p =.143, Cohen’s d = 0.387, BF₁₀ = 0.688, nor between the binocular and dominant eye conditions, t(15) = 0.356, p =.727, Cohen’s d = 0.089, BF₁₀ = 0.271 (Fig. 3C).

Next, a repeated-measures analysis of variance (ANOVA) with an alpha of 0.05 revealed a significant main effect of the target-presented eye on the number of loops, F(2, 30) = 5.179, p =.012, Inline graphic = 0.257, BF₁₀ = 3.126. The number of loops indicated the reaction time for key presses (Fig. 3B). Bonferroni-corrected post-hoc comparisons revealed that the binocular condition had fewer loops than the dominant eye condition, t(15) = −3.122, p =.012, Cohen’s d = 0.780, BF₁₀ = 25.765, but did not significantly differ from the non-dominant eye condition, t(15) = −2.240, p =.098, Cohen’s d = 0.560, BF₁₀ = 3.014. Furthermore, there was no significant difference between the monocular conditions, t(15) = 0.882, p = 1.000, Cohen’s d = 0.22, BF₁₀ = 0.265. Additionally, neither the main effect of the tone saliency factor, F(1, 15) = 2.033, p =.174, Inline graphic = 0.119, BF₁₀ = 1.063, nor the interaction, F(2, 30) = 1.246, p =.302, = 0.077, BF₁₀ = 3.481, was significant. In summary, participants exhibited the best performance in terms of both detection rates and loop numbers under the binocular condition, indicating that there was no speed-accuracy trade-off.

The lack of contribution from the salient tone to the detection facilitation rate in the binocular condition contradicts the findings from previous research on the freezing phenomenon. This inconsistency might be due to ceiling effects arising from the absence of trade-off or extraneous factors introduced using the mirror stereoscope. Notably, the detection facilitation rate increased only in the non-dominant eye condition from the non-salient to salient tone conditions, suggesting that the temporal synchronization of audiovisual stimuli enhances the saliency of unstable monocular signals in the localizing task.

Experiment 2

To further investigate the conditions under which a salient auditory signal enhances visual processing in the non-dominant eye, we conducted Experiment 2, focusing on an orientation discrimination task involving a visual target in the parafovea.