Suboptimal human multisensory cue combination

Derek H Arnold; Kirstie Petrie; Cailem Murray; Alan Johnston

doi:10.1038/s41598-018-37888-7

. 2019 Mar 26;9:5155. doi: 10.1038/s41598-018-37888-7

Suboptimal human multisensory cue combination

Derek H Arnold ^1,^✉, Kirstie Petrie ¹, Cailem Murray ¹, Alan Johnston ²

PMCID: PMC6435731 PMID: 30914673

Abstract

Information from different sensory modalities can interact, shaping what we think we have seen, heard, or otherwise perceived. Such interactions can enhance the precision of perceptual decisions, relative to those based on information from a single sensory modality. Several computational processes could account for such improvements. Slight improvements could arise if decisions are based on multiple independent sensory estimates, as opposed to just one. Still greater improvements could arise if initially independent estimates are summed to form a single integrated code. This hypothetical process has often been described as optimal when it results in bimodal performance consistent with a summation of unimodal estimates weighted in proportion to the precision of each initially independent sensory code. Here we examine cross-modal cue combination for audio-visual temporal rate and spatial location cues. While suggestive of a cross-modal encoding advantage, the degree of facilitation falls short of that predicted by a precision weighted summation process. These data accord with other published observations, and suggest that precision weighted combination is not a general property of human cross-modal perception.

Introduction

Cues extracted from different sensory modalities can interact, bringing about striking changes in perception. In the spatial ventriloquist effect, for example, an auditory signal can seem to originate from the location of a visual signal¹. This has been used by puppeteers to entertain for centuries, and it is pertinent whenever you watch television, and fail to notice that the visible actors aren’t actually making any sound, which is instead originating from an offset source. There is a related temporal effect. In temporal ventriloquism the timing of a visual event tends to be drawn toward that of an auditory signal². As visual coding typically provides more precise information about the spatial environment, and auditory coding typically provides more precise timing information, these effects have historically been taken as evidence for the modality appropriateness hypothesis³ – the idea that a given sensory modality will dominate perception when that modality typically provides better information about the issue at hand.

In contemporary research, the modality appropriateness hypothesis has been overturned, due primarily to a seminal finding regarding the malleability of cross-modal perception. Instead of vision always dominating spatial perception, Alais & Burr⁴ showed that audition could dominate vision when the spatial cues provided by vision were sufficiently degraded. Instead of a certain type of sensory decision being dominated by a given sensory modality, researchers now argue that the brain estimates the instantaneous precision of each source of sensory evidence when multiple cues are available from different senses^4–8. Hypothetically, the brain uses these estimates when it sums the initially independent sensory codes together to form an integrated code. This process is often referred to as an optimally weighted summation⁵.

A process of optimally weighted summation does not just allow for perceptual decisions to be dominated by diverse sensory modalities, it allows for enhanced sensitivity, relative to when information is available from just one sensory modality. This is true even if the two sensory modalities provide equally precise sensory estimates. Levels of sensitivity resulting from an optimally weighted summation can be predicted by employing a Minkowsky metric^9,10 to evaluate the degree of summation. This can be calculated as follows:

AV s = {(A s^{k} + V s^{k})}^{1 / k}

where AVs denotes sensitivity to combined audio and visual signals, As sensitivity to audio signals, and Vs sensitivity to visual signals. The exponent, k, indicates the degree of summation, with an optimally weighted summation corresponding with a quadratic summation, or k = 2.

Note that in the cue combination literature the term “optimal” refers to an optimal combination rule under a given model. Since modeling efforts and experiments are not exhaustive, we can never say that a particular outcome is optimal in a general sense. Moreover, all such models are based on assumptions, so better or worse performance than model predictions can result if model assumptions are wrong¹¹. Refuting model predictions does not, however, discredit the broader conceptual framework in which that model resides. Here that framework is Bayesian, with perception presumably informed by past contexts.

To avoid confusion, we will largely refrain from referring to optimal and suboptimal processes. Rather, we will examine the predictions of a specific model, wherein initially independent sensory estimates are combined via a process of weighted summation, with weights determined by an instantaneous and accurate appraisal of the precision of unimodal estimates. We will refer to precision weighted summation, as this is the rule many researchers have assumed^4,5,12.

The provision of sensory cues in multiple, as opposed to just one, sensory modalities has repeatedly been found to result in more precise perceptual judgments^4,5,12–15, and the degree of improvement has often been said to be consistent with a precision weighted summation^4,5,12,14. But precision weighted summation is not the only combination rule that could result in bimodal improvements, and other schemes are seldom considered in detail, or pitted against precision weighted summation predictions in order to see which scheme best describes performance.

Sensitivity differences predicted by different decision schemes

The idea that unimodal sensory estimates are combined via a precision weighted summation is conceptually important, and prominent in contemporary literature. However, the sensitivity differences this scheme predicts, relative to simpler decision schemes, are small. To illustrate, in Fig. 1 we show simulated unimodal (audio and visual) sensitivity scores (dPrime values) that vary inversely. We have also plotted bimodal sensitivities predicted assuming these signals are combined via a precision weighted summation process. Finally, we have plotted sensitivities predicted by assuming that each modality makes an independent contribution, on a trial-by-trial basis, to a decision process – probability summation^16,17.

Plots of inversely varying simulated audio (blue) and visual (red) sensitivity scores, in addition to sensitivity scores predicted by weighted summations of unimodal signals (red) and via probability summations (grey). Note that the largest difference, between unimodal sensitivities and predicted bimodal sensitivities, occurs when unimodal sensitivities are *precisely* matched.

Probability Summation

Pirenne¹⁶ pointed out that the ability to detect a signal could be enhanced if there were two encoded signals, with each having an independent probability (on a trial-by-trial basis) of exceeding an encoded level of intensity necessary for detection – a concept known as probability summation. The ability to detect audiovisual signals, at minimal component intensities, reportedly conforms to probability summation predictions^14,18. The same logic applies to multisensory tasks – like deciding which of two presentations is associated with a greater value (further to the right, or higher in temporal modulation rate). If we assume the decision process weights information equally, a bimodal encoding advantage could still ensue, relative to even the most reliable single sensory modality, because either modality could provide a more intense correct signal on a trial-by-trial basis¹⁷. In terms of the Minkowsky metric we described earlier^9,10, the difference between the predictions of a precision weighted summation, and a probability summation process, are encapsulated within the assumed exponent, k = 2 for precision weighted summations, 3 for probability summation.

Comparing predicted bimodal sensitivities

The first thing to note about Fig. 1 is that the largest difference, between unimodal sensitivities and predicted bimodal sensitivities, happens when the two unimodal sensitivities are precisely matched. This is why multisensory cue combination studies often attempt to equate unimodal performances^4,5,14. Note also that even these differences are slight. Unimodal dPrime values in this case are ~1.32, whereas bimodal sensitivity predicted by a precision weighted summation process is ~1.86. In a forced choice task, this could coincide with proportion correct scores of ~0.75 and 0.83 respectively. Any deviance from precisely matched unimodal sensitivities lessens the predicted advantage. For instance, if we assume a 5% difference in unimodal task performance, the better of the two modalities could coincide with ~0.77 correct task performance, whereas the predicted bimodal performance from precision weighted summation would still be ~0.83 – a 6% difference. Our point is that with any degree of measurement error, it might be difficult to distinguish between precision weighted summation predictions and decisions based on the best available unimodal evidence, and this constitutes a minimal condition before a multisensory decision process should be assumed.

Differences between bimodal sensitivities predicted by a precision weighted summation process, and those predicted by a probability summation process, are even smaller. If unimodal sensitivity dPrime scores are precisely matched (at 1.32), predicted dPrime values are 1.86 and 1.66 respectively, corresponding with proportion correct scores of ~0.83 and 0.80. Again, assuming any degree of measurement error, these slight predicted differences could be difficult to discern. Before any multimodal encoding advantage can confidently be attributed to a precision weighted summation process, probability summation should be dismissed as a viable interpretation, as this decision process predicts a bimodal advantage relative to averaged unimodal sensitivities, but it does not assume an analysis of the brains’ intrinsic signal to noise – an additional computation that is necessary for precision weighted summation calculations^4,5,12,14.

A minimal precision weighted summation prediction

Precision weighted summation predicts that sensitivity to a consistent combination of bimodal signals should be greater than sensitivity to the most precisely encoded unimodal signal, especially when unimodal sensitivities are matched (see Fig. 1). In our experiments we will use this as a metric to test if precision weighted summation predictions have been met, as this constitutes a minimal requirement before a multisensory process should be considered. We will also pit precision weighted summation predictions against probability summation predictions, to see which better describes performances on trials involving congruent bimodal signals.

To investigate these matters, we chose to examine audio, visual and audio-visual spatial origin (Experiments 1) and temporal rate (Experiment 2) judgments. To preface our results, we find evidence for enhanced audio-visual sensitivity in each context, relative to average unimodal performances. We do not, however, find that cross modal sensitivities are enhanced relative to the most precise unimodal sensitivity displayed by each participant. Also, in both experiments we find that probability summation better describes performance on congruent bimodal trials, relative to precision weighted summation predictions, although neither account accurately describes group-level performances. We also find that when audio and visual cues are placed in conflict, participant judgments are not biased in favor of the information that they had encoded with greater precision. All these results argue against bimodal decisions being informed by a weighted summation process.

Experiment 1 – Audio, Visual and Audio-Visual Spatial Origin Judgments

Methods

Ethical approval for both experiments was obtained from the University of Queensland Ethics Committee, and were in accordance with the Declaration of Helsinki. Consent to participate in the study was fully informed. Before each experimental session began, participants read an instruction screen that informed them that they could withdraw from the experiment at any time without penalty. They indicated consent to participate by clicking on a statement, that they had read and understood these instructions. The participant depicted in Fig. 2 consented to these images being used in an online open-access publication.

Graphics depicting the visual display for Experiment 1. (A) Position of speakers, mounted to the rear of the display, used to present audio stimuli, (B) Height of display area (86 cm), (C) Width of display area (112 cm) (D) Viewing distance (137 cm).

There were 20 participants in Experiment 1, including 2 of the authors and an additional 18 people who were naïve regarding the purpose of the experiment. Stimuli were generated using custom Matlab software in conjunction with the Psychophysics Toolbox^19,20. Visual stimuli were presented via a NEC VT660 data projector. The data projector was positioned above and behind the participants’ head, 210 cm from the display screen (112 cm wide, 86 cm tall; see Fig. 2). Participants were seated in a chair, centered relative to the display screen, at a viewing distance of ~137 cm. Auditory stimuli were delivered via two speakers positioned 27 cm to the left and the right of the center of the display screen. These were mounted behind, centered on the vertical mid-point, and facing into the rear of the screen. Responses were reported by pressing one of two buttons on a mouse held in the participant’s lap.

Visual stimuli consisted of vertical Gabors, subtending ~15 degrees of visual angle (dva) at the retina, with a spatial constant of ~2.5dva, a spatial frequency of ~1 cycle per dva, and a Michelson contrast of 75%, shown against a grey background. These were flashed for 200 ms, followed by a full screen of visual white noise.

Audio calibration

Auditory stimuli consisted of white noise bursts, lasting for 25 ms with 5 ms linear onset/offset ramps. Eccentricity was signaled via inter-aural intensity differences. Before the experiment, participants completed a calibration task to determine what right speaker peak signal intensity was subjectively matched to a Standard left speaker peak intensity of ~60 dB SPL. In the subsequent experiment, apparent stimulus eccentricity was manipulated by presenting stimuli at multiples of these pre-determined peak stimulus intensities, such that a signal apparently originating 27 cm to the left of the display center would have a peak left channel intensity of ~60 dB SPL while the right channel was silent, and a signal apparently originating from the display center was signaled by both channels being set to half their pre-determined peak intensities.

Performance calibration

A second calibration phase was used to identify signaled eccentricities for unimodal stimulus presentations resulting in ~67% correct task performance. Subjectively, Standard signals seemed to originate from the display center, whereas Comparison signals were manipulated on a trial-by-trial basis to identify signaled eccentricities that could be accurately identified, as originating from the left or right relative to Standard presentations on ~67% of trials. During this procedure signaled Comparison eccentricities were adjusted according to 1-Up, 2-Down staircase procedures, wherein an incorrect judgment resulted in an increased Comparison eccentricity, and two consecutive correct judgments in a decrease in Comparison eccentricity being signaled. Audio and Visual Comparison stimuli for the subsequent experiment were set to these values, to ensure performance would be closely matched across unimodal stimulus conditions. This level of task performance was chosen to avoid floor and ceiling effects in the subsequent experiment.

Test Phase

Each trial consisted of two sequential presentations, of a Standard then a Comparison, or of a Comparison then a Standard stimulus (order determined at random on a trial-by-trial basis). Stimulus presentations were separated by a 333 ms inter-stimulus interval (ISI). On each trial the participant was required to indicate whether the second stimulus presentation was offset to the left or right relative to the first. The direction of the Comparison offset (left/right) was determined at random on a trial-by-trial basis.

Four types of trial were interspersed during trial blocks. Stimulus presentations were either auditory, visual, congruent audiovisual, or incongruent audiovisual. In congruent audiovisual trials, the Comparison offset direction (left or right) signaled by audition and vision was matched, whereas in incongruent audiovisual trials, audition and vision signaled opposite offset directions. Each trial block consisted of 80 individual trials for each type of trial, 320 individual trials in total, all interspersed in random order. Each participant completed three blocks of trials, and data were collated across the three blocks prior to analysis, for a total of 960 individual trials for each participant.

Results

Uunimodal performance levels were well matched, both across individuals (visual 65%, audio = 64%) and within individuals (mean individual difference score of 0.7% SD = 4.8).

Of our participants, 15 of 20 showed some evidence of a bimodal encoding advantage, with performance on congruent audiovisual trials superior to performances averaged across each participants’ two unimodal conditions (see Fig. 3a). Averaged across participants, congruent audiovisual performance (69% SD = 6) was better than averaged unimodal performances (65% SD 4; paired t₁₉ = 2.96, p = 0.008, 95% CIs 0.012 to 0.068). Congruent audiovisual performance was not, however, improved relative to the best unimodal performance achieved by each participant (67% SD 4; paired t₁₉ = 1.48, p = 0.155, 95% CIs −0.008 to 0.049; see Fig. 3b).

(a) X/Y scatterplot of individual proportion correct task performance scores on Congruent audiovisual trials (X axis) and averaged across unimodal visual and auditory trials (Y axis). (b) As for (a), but for the best of each individuals’ unimodal condition performance (Y axis). (c) X/Y scatterplot of individual d’ scores for Congruent audiovisual trials (X axis) and d’ scores predicted by precision weighted summation, from performances on unimodal trials. (d) As for (c), but for d’ scores predicted by probability summation. In all plots author data points are coloured red. Data points in grey regions of (a,b) indicate *better* performance on Congruent AV trials, relative to the other dataset. Data points in white regions in (c,d) indicate *worse* performance on congruent AV trials than predicted.

Proportion correct scores for audio, visual, and congruent AV trials were converted into hit rates and false alarm rates, by respectively treating left and right offset cues as ‘signal’ and ‘noise’. This allowed us to calculate d’ scores for congruent AV trials, and predicted d’ scores based on unimodal trials (following the formula outlined in the introduction). Predictions for precision weighted summation and probability summation are plotted against actual congruent AV d’ scores in Fig. 3c,d respectively. Of the two schemes, probability summation predictions were closer to actual congruent AV d’ scores. Each set of predictions were subjected to Bayes factor analysis for a paired samples t-test, using JASP software (2015) with a Cauchy prior width of 0.707. These compared model predictions to actual performances on congruent bimodal trials. For precision weighted summation predictions, this only revealed anecdotal evidence in favour of the null hypothesis (that model predictions would be equivalent to actual bimodal performance; BF₁₀ = 0.418, error 0.015%). For probability summation predictions there was moderate evidence in favour of the null hypothesis (BF₁₀ = 0.233, error 0.022%).

Incongruent audio-visual trial bias

As unimodal performance levels were well matched on average, if people had accurate insight into the precision of information provided by audition and vision, there should have been no systematic group-level bias on Incongruent audiovisual trials (when unimodal spatial cues were placed in conflict). This was tested by calculating individual bias scores, by subtracting the proportion of incongruent audiovisual trials on which a participants’ response was consistent with the visual stimulus, from the proportion of such trials on which their response had been consistent with the audio stimulus. A positive bias score is indicative of a visual bias, a negative score an audio bias. Averaged across participants, we found evidence for a visual bias (15% SD = 18, single sample t₁₉ = 3.81, p = 0.001, 95% CIs 0.069 to 0.236, see Fig. 4a). This provides further evidence that bimodal decisions lack insight into how well information in either uni-sensory modality has been encoded. If they had, bias scores should have reflected the modality that had prompted better unimodal performance, but there was no robust correlation between individual performance levels on unimodal trials and the degree to which people displayed a visual bias on incongrunet bimodal trials (Pearson’s R = 0.2, p = 0.388, 95% CIs −0.26 to 0.59).

(a) X/Y scatterplot of individual visual bias scores (X axis) and proportional differences in correct performances on unimodal visual and audio trials (Y axis) from Experiment 1. Data points falling within the grey region are indicative of a greater visual bias than is justified by differences in unimodal trial performances. (b) Details are as for (a) but data are from Experiment 2.

Discussion

We found evidence for a bimodal encoding advantage, in that individual performance tended to be enhanced on Congruent audiovisual trials relative to each individuals’ performance averaged across unimodal (audio and visual) trials. Congruent audiovisual trial performance was, however, inferior on average to each individuals’ performance on the best of their two unimodal conditions. Moreover, performance on Incongruent audiovisual trials suggested that participants had placed an undue emphasis on visual cues in those trials, as participants had displayed a visual bias despite the two unimodal cues being well-matched in terms of task performance. This speaks against bimodal decisions being guided by an accurate sensory-level appraisal of the precision of unimodal cue encodings, as predicted by a precision weighted summation process^4,5,12.

Of the two bimodal decision schemes we have outlined (probability summation and precision weighted summation), probability summation predictions were more accurate (see Fig. 2c,d). Neither scheme, however, predicted the data well. This is not surprising in light of the superiority of each participants’ best unimodal performance, relative to performances on congruent bimodal trials. This implies that participants based decisions predominantly on one type of unimodal information on congruent bimodal trials, and that this was not reliably the best source of information available to them (so they did not benefit from an integrative process). In Experiment 2 we decided to see if a similar pattern of results would emerge when people make temporal judgments.