Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2013 Nov 11;110(48):E4668–E4677. doi: 10.1073/pnas.1312518110

Dynamic faces speed up the onset of auditory cortical spiking responses during vocal detection

Chandramouli Chandrasekaran a,b,1, Luis Lemus a,b,2, Asif A Ghazanfar a,b,c,3
PMCID: PMC3845123  PMID: 24218574

Significance

We combine facial motion with voices to help us hear better, but the role that low-level sensory areas such as the auditory cortex may play in this process is unclear. We combined a vocalization detection task with auditory cortical physiology in monkeys to bridge this epistemic gap. Surprisingly, and contrary to previous assumptions and hypotheses, changes in firing rate had no clear relationship to the detection advantage that dynamic faces provided when listening for vocalizations. Instead, dynamic faces uniformly sped up the onset of spiking activity in the auditory cortex, and this faster onset partially explains the behavioral benefits of combining faces and voices.

Keywords: multisensory integration, crossmodal, face processing, monkey vocalization

Abstract

How low-level sensory areas help mediate the detection and discrimination advantages of integrating faces and voices is the subject of intense debate. To gain insights, we investigated the role of the auditory cortex in face/voice integration in macaque monkeys performing a vocal-detection task. Behaviorally, subjects were slower to detect vocalizations as the signal-to-noise ratio decreased, but seeing mouth movements associated with vocalizations sped up detection. Paralleling this behavioral relationship, as the signal to noise ratio decreased, the onset of spiking responses were delayed and magnitudes were decreased. However, when mouth motion accompanied the vocalization, these responses were uniformly faster. Conversely, and at odds with previous assumptions regarding the neural basis of face/voice integration, changes in the magnitude of neural responses were not related consistently to audiovisual behavior. Taken together, our data reveal that facilitation of spike latency is a means by which the auditory cortex partially mediates the reaction time benefits of combining faces and voices.


In noisy environments, the audiovisual nature of speech is a tremendous benefit to sensory processing. While holding a conversation in a large social setting, your brain must deftly detect when a person is saying something, who is saying it, and discriminate what she is saying. To make the task easier, our brains do not rely entirely on the person’s voice but also take advantage of the speaker’s mouth movements. This visual motion provides spatial and temporal cues (1, 2) that readily integrate with the voice, enhancing both detection (310) and discrimination (1115). How the brain mediates the behavioral benefits achieved by integrating signals from different modalities is the subject of intense debate and investigation (16). For face/voice integration, traditional models emphasize the role of association areas embedded in the temporal, frontal, and parietal lobes (17). Although these regions certainly play important roles, numerous recent studies demonstrate that they are not the sole regions for multisensory convergence (18, 19). The auditory cortex, in particular, has many sources of visual input, and an increasing number of studies in both humans and nonhuman primates demonstrate that dynamic faces influence auditory cortical activity (20).

However, the relationship between multisensory behavioral performance and neural activity in the auditory cortex remains unknown for two reasons. First, methodologies typically used to study the auditory cortex in humans are unable to resolve neural activity at the level of action potentials. Second, regardless of the areas explored, none of the face/voice neurophysiological studies in monkeys to date, including auditory cortical studies (2124) and studies of association areas (2527), have required monkeys to perform a multisensory task. All these physiological studies demonstrated that neural activity in response to faces combined with voices is integrative, exhibiting both enhanced and suppressed changes in the magnitude of response when multisensory conditions are compared with unisensory ones. It is presumed that such changes in firing rate mediate behavioral benefits (e.g., faster reaction times, better accuracy) of multisensory signals, but it is possible that integrative neural responses—particularly in the auditory cortex—are epiphenomenal.

In this study, we combined an audiovisual vocal-detection task with auditory cortical physiology in macaque monkeys. When detecting voices alone, our data show that the signal-to-noise ratio (SNR) systematically influences behavioral performance; the same systematic effects are observed in the magnitude and latency of spiking activity. The addition of a dynamic face leads to audiovisual neural responses that are faster than auditory-only responses—dynamic faces speed up the latency of auditory cortical spiking activity. Surprisingly, the addition of dynamic faces does not systematically change the magnitude or variability of the firing rate. These data suggest that visual influences have a role in facilitating response latency in the auditory cortex during audiovisual vocal detection. Facial motion speeds up the spiking responses of the auditory cortex but has no systematic influence on firing rate magnitudes.

Results

We trained two monkeys to detect visual-only (V), auditory-only (A), and audiovisual (AV) presentations of vocalizations in a combined free-response/redundant-signals task (3). A free-response task is one in which there are no explicit trial markers (28). In a redundant-signals task, all the multiple sensory components of each stimulus arise from the same event (3, 2932). This latter task has been used repeatedly to study multisensory integration in humans, and the expected behaviors in this task, as well as models to explain it, have been reported extensively (3, 2932). We combined these tasks in a paradigm that approximates natural face-to-face communication, in which the timing of vocalizations is not entirely predictable. Moreover, the acoustic components of the vocalizations were degraded by noise, but the face and its motion were perceived clearly—again, conditions that are typical of macaque monkey and human social interactions. In the task, monkeys responded to two natural “coo” calls, each produced by a different individual and of the same duration (Fig. 1A). Coo calls were presented at three levels of SNR (high, medium, and low relative to a constant background noise). The dynamic faces were computer-generated avatars of monkeys (Fig. 1 B and C) (3, 33, 34).

Fig. 1.

Fig. 1.

Stimuli and task for monkeys. (A) Waveform and spectrogram of coo vocalizations detected by the monkeys. The x-axes for both plots depict time in seconds. The y-axes are spectral frequency in kHz. (B) Images of the two monkey avatars at the point of maximal mouth opening for the largest SNR. (C) Images of one of the monkey avatars showing maximal mouth opening for three different SNRs: high, medium, and low. (D) Task structure for monkeys. An avatar face was always on the screen. AV, V, and A stimuli were presented randomly with an interstimulus interval of 1–3 s drawn from a uniform distribution.

The task was structured so that an avatar face appeared continuously on the screen (Fig. 1D). We used two monkey avatars and two coos to ensure that behavioral and neural effects of integration generalized across exemplars. In the V condition the avatar silently produced a coo facial expression. In the A condition the vocalization normally paired with the other avatar (which is not on the screen) was presented with the static face of the current avatar. Finally, in the AV condition the avatar moved its mouth 85 ms before the corresponding vocalization [consistent with natural intervals between seeing lip motion and hearing a voice (22, 27)] and with an aperture in accordance with the intensity of the vocalization (Fig. 1C). Each condition (V, A, or AV) was presented after a variable interval of 1–3 s. Subjects indicated detection of an event by pressing a lever within 2 s following the event onset. Lever presses outside this window were classified as false alarms and led to time-outs of 3–5 s. False alarms comprised ∼15% of all trials. To switch between the identity of the avatar and the coo call used for the A condition, we used a block design. At the end of every block (typically 60 trials, 7 trials each SNR of AV and A, 6 trials for each SNR of V), a brief pause (∼12 s) was imposed, followed by the start of a new block in which the avatar face and the identity of the coo used for the A condition were switched.

While the monkeys performed this task, we recorded the spiking activity from 249 lateral-belt auditory cortical sites over 67 experimental sessions. Spiking activity typically was measured from single neurons and very occasionally from clusters of two or three neurons; in an abundance of caution, we refer to all spiking activity as coming from “cortical sites.” Of these 249 auditory-responsive cortical sites, 149 (60%) responded to vocalizations with robust spiking activity (109 and 40 sites from monkeys 1 and 2, respectively). Behavioral and neural results were qualitatively similar in the two monkeys, and the different coo calls did not elicit any meaningful differences. As such, the data shown below are pooled across the two monkeys and across the two coo call exemplars.

Monkeys Detect AV Vocalizations Better and Faster than A or V Vocalizations.

Like humans, monkeys were faster and more accurate when detecting vocalizations accompanied by visible mouth motion. These multisensory benefits varied as a function of the SNR (Fig. 2). Behavioral performance using the same task with monkey and human subjects was characterized extensively and modeled in a recent study (3); we will be brief here. Fig. 2 shows the average accuracy in performance and reaction times (RT) of the monkeys during vocal detection. We used an ANOVA to examine effects of stimulus modality and SNR. For both accuracy and RT data, the Mauchly test of sphericity, which tests whether the variance of differences (between conditions) is homogeneous, was significant (P < 0.05 for both accuracy and RT); all degrees of freedom were therefore Greenhouse–Geiser corrected. Post hoc pairwise t tests were Bonferroni corrected for multiple comparisons.

Fig. 2.

Fig. 2.

Monkeys detect AV vocalizations better and faster than A or V vocalizations. Average accuracy (A) and RT (B) across all sessions (n = 67) pooled over both monkeys for the three different SNRs for the unisensory and multisensory conditions. Error bars denote SE of mean across sessions. The x-axes in both A and B denote the SNR. The y-axis in A denotes percent accuracy. The y-axis in B denotes RT in milliseconds.

For accuracy of detection, we observed main effects of modality [F(1.64, 108) = 141.58, P < 0.001] and SNR [F(1.07, 70) = 43.04, P < 0.001] and a significant interaction [F(2.52, 166.15) = 106.32, P < 0.001] (Fig. 2A). For SNR, both linear [(F(1, 66) = 51.37, P < 0.001)] and quadratic [(F(1, 66) = 20.23, P < 0.001)] trends were significant. Post hoc comparisons revealed that AV detection was more accurate than A or V detection (AV vs. V, P < 0.05; AV vs. A, P < 0.05). The significant interaction between modality and SNR suggested that effects of modality were different for different SNR levels. For example, at the highest SNR, detection in the AV condition was more accurate than in the V condition (P < 0.05) but was not more accurate than detection in the A condition (P > 0.05). For the medium SNR, detection was more accurate in the AV condition than in either the V (P < 0.05) or the A (P < 0.05) conditions. Finally, for the lowest SNR, the pattern of results was opposite that observed for the highest SNR: detection was more accurate in the AV condition than in the A condition (P < 0.05) but was not more accurate than in the V condition (P > 0.05).

We observed a similar pattern for RT (Fig. 2B): significant main effects of modality [F(1.56, 103.25) = 198.39, P < 0.001] and SNR [F(1.69, 111.83) = 468.59, P < 0.001] and a significant interaction [F(2.77, 182.80) = 234.95, P < 0.001]. For RT as a function of SNR, both linear [F(1, 66) = 666.8, P < 0.001] and quadratic [F(1, 66) = 82.28, P < 0.001] trends were significant. Post hoc comparisons revealed that RTs were faster in the AV condition than in either the A condition (P < 0.05) or the V condition (P < 0.05). This result was true for all SNRs (all AV vs. V and A comparisons were P < 0.05). These results suggest that the benefit of integration is to speed up RTs to AV stimuli compared with RTs to either V or A stimuli. Importantly, the benefit of AV integration is apparent even at the highest SNR—RTs are nearly 29 ms faster for AV vocalizations than for A vocalizations.

In summary, monkeys show two behavioral patterns during the detection of vocalizations in noise. First, as the SNR decreases, monkeys are poorer and slower at detecting vocalizations. Second, under these conditions, monkeys take advantage of the dynamic face to speed up and enhance their detection of vocalizations. Our previous study, focusing purely on the behavioral performance during this task, revealed that the monkeys truly integrate the two modalities (3). Specifically, we showed that the results are not explained by adopting a race model, at least for the high and medium SNRs, and we replicated this result in the current study (Fig. S1). Instead, both modalities are used together to facilitate behavior. Importantly, as both accuracy and RT data show, our data are not easily explained by adopting the principle of inverse effectiveness. Instead, our benefit profiles are nonmonotonic, suggesting that AV behavior at different SNRs is a relative combination of the strengths of the constituent auditory and visual signals (3). At high SNRs, the auditory signal is dominant but still is influenced by vision. In contrast, at the lowest SNRs, the RTs are driven predominantly by vision, with audition providing a weak influence. Thus, the monkeys’ performance suggests that any integrative processes in the auditory cortex will likely be more readily apparent at the highest SNR. To investigate the neural correlates of these behavioral patterns in the auditory cortex, we began by first examining neural responses in the A condition.

Vocalization-Induced Firing Rate Responses and Latencies Covary with RTs.

Even for the lowest SNR, our detection accuracies were high (mean = 67.44%, SD = 14.14%) (Fig. 2A). Therefore, we did not have enough error trials to perform traditional analyses relating spiking activity to accuracy. Our analyses here focus on relating firing rates and neural response latencies to RT as a function of changing SNRs. Traditionally, changes in sensory cortical firing rates, representing sensory evidence, are thought be integrated by downstream structures, where they are summed to a threshold and trigger behavior (35, 36). Changes in neural response latency also have been suggested as being important for mediating the encoding and detection of sensory stimuli (3740). Both are putative mechanisms for RTs that change as a function of the SNR.

Fig. 3A shows spiking activity from three typical cortical sites in response to the A condition. In each plot, the three different SNRs are shown. For these sites, decreases in SNR led to slower and smaller spiking responses. Across the population, a majority of sites showed this pattern. Fig. 3B shows the median latency across cortical sites as a function of SNR. A decrease in SNR increased the response latency [F(1.61, 238.31) = 92.98, P < 0.001], mimicking the effect of SNR on RTs. Fig. 3C shows a log–log scatter plot of neural response latency vs. RT across all sites. In general, longer latencies are accompanied by slower RTs (Spearman’s r = 0.39, P < 0.001). A regression between log (RT) and log (latency) explained more variance (15%) than a regression between RT and latency (12% of variance). Overall, in monkeys actively detecting vocalizations, decreases in the SNR induce both slower RTs and slower responses in the auditory cortex.

Fig. 3.

Fig. 3.

Spiking responses in the auditory cortex decrease in magnitude and increase in latency as SNR decreases. (A) (Upper) PSTHs from three different cortical sites in the belt auditory cortex in response to the three different SNRs of coo 1. (Lower) Spike rasters for the corresponding PSTHs. The x-axes in all panels denote time in milliseconds. Zero is vocalization onset. In all panels the y-axes depict the firing rate in spikes/s. The corresponding, PSTHs were obtained by convolving spike trains with an exponential kernel with a 1-ms growth time; decay time was 20 ms. Shaded regions denote SEM. (B) Median latency across all cortical sites for the different SNRs. X-axis depict SNR categories. Y-axes the latency in milliseconds. (C) Log–log scatter plot of RT (in milliseconds) vs. response latency (in milliseconds) across all cortical sites. (D) Response magnitude across all cortical sites for the different SNRs. X-axis depict SNR categories. Y-axes the response magnitude in SD units. (E) Log–log scatter plot of RT (in milliseconds) vs. peak response magnitude (in SD units) across all cortical sites. To normalize variance across sites in this scatter plot, we normalized the response across SNRs by expressing the peak response magnitude for each SNR at that site as the deviation from the mean of the peak response across all SNRs for that site. (F) Proportion of cortical sites showing different patterns of level tuning. X-axis depict different categories of tuning available with 3 SNRs. Y-axes the peak response in SD units. (G) Response latency for monotonic and nonmonotonic response groups as a function of SNR. (H) Average correlation over all cortical sites at the highest SNR between binned firing rates and RT as a function of time after stimulus onset. The x-axis depicts the time in milliseconds; the y-axis depicts Spearman correlation.

Fig. 3D shows the normalized peak firing rate in the period 0–300 ms after stimulus onset as a function of SNR. On average, as the SNR decreased, the peak firing rate decreased [F(1.64, 242.94) = 89.25, P < 0.001]. Fig. 3E shows the log–log scatter plot of the peak firing rate versus RT for the population of cortical sites. Increases in the RT are accompanied by decreases in the peak magnitude of responses in the auditory cortex (Spearman’s r = −0.63, P < 0.001). Nevertheless, although the average peak firing rate decreased with decreasing SNR (a monotonic relationship), there was considerable heterogeneity in the firing rate response profiles across both poststimulus time and cortical sites (Fig. S2 AC). Fig. 3F shows that, across the population, the majority of sites (58%) had monotonically decreasing responses; that is, decreases in the SNR led to decreases in the magnitude of the peak response. A very small fraction (5%) showed the opposite pattern, monotonically increasing their peak responses with decreasing SNR. Finally, an intermediate proportion of sites (37%) showed nonmonotonic responses. These sites had no rank ordering of their peak response relative to SNR.

Although the relationship (i.e., monotonic versus nonmonotonic) between SNR and peak firing rate varied across cortical sites, we observed that trends in response latency were largely preserved. Fig. 3G shows the neural response latency of cortical sites pooled into monotonic and nonmonotonic categories. In both cases, a decrease in the SNR led to longer neural response latencies. Notably, these data reveal that, although changes in the SNR could give rise to differential peak firing rates and temporal modulations of firing rates, changes in neural response latency were consistent and thus may be an important mediator of RT during vocal detection. If so, there should be trial-by-trial relationships between the neural responses and RT. To address this possibility, we binned firing rates and correlated them to RT. Fig. 3H shows 50-ms bins stepped by 10 ms; other bin sizes gave similar results, as shown in Fig. S2 D and E. To avoid spurious correlations between firing rates and RT caused by the relationship between the SNR and the firing rate, we performed this analysis at each level of SNR and then averaged them across all SNRs. Fig. 3H shows the average correlation across cortical sites between spiking activity and RT for the highest SNR. The baseline correlation between spiking activity and RT was subtracted for each site. There are two negative peaks of correlation. A first small peak occurred immediately after auditory onset (∼70 ms, r = −0.054); the second, larger peak appeared at 210 ms (r = −0.078). Both correlations were modest but statistically significant (70 ms: t(148) = −4.0, P < 0.05; 210 ms: t(148) = −5.25, P < 0.05). When averaged across all SNRs, the correlation was slightly lower (70 ms: r = −0.03; 210 ms: r = −0.042) but still was significant (70 ms: t(148) = −3.13, P < 0.05; 210 ms: t(148) = −4.20, P < 0.05).

In light of this pattern with firing rates, we also tested whether trial-by-trial variations in neural response latency covaried with trial-by-trial variations in the RT. To do so, we first split the trials for each SNR and for each cortical site according to the median RT. We then observed whether the median latency across sites differed between the fast and slow RTs. For both the high and medium (but not low) SNRs, faster RTs led to significantly faster neuronal response latencies than slow RTs (sign test for fast vs. slow RTs: high: P = 0.005; medium: P = 0.027; low: P = 0.677). Together with the analysis of firing rate, our data suggest that trial-by-trial fluctuations in auditory cortical neural responses covary with the trial-by-trial fluctuations in behavior as indexed by the RT.

Dynamic Faces Uniformly Speed Up the Latency of the Neural Responses.

Behaviorally, when a vocalization was accompanied by a corresponding dynamic face (i.e., in the AV condition), RTs were faster than in the A condition (Fig. 2B). We investigated if there were any auditory cortical correlates of this behavioral effect. Based on the trends observed from relating A responses to RT, the effects of AV integration in the auditory cortex could manifest themselves as (i) changes in neural response latency, (ii) increases or decreases in peak firing rate, or (iii) some combination of both.

Fig. 4A shows three examples of AV responses relative to unisensory ones in the high SNR context (in the same cortical sites as in Fig. 3A). Spiking responses are faster (leftward shift in the red trace) in the AV condition. In many cases, responses were >10 ms faster. Also these plots show that differences in the firing rate between AV and A responses show a complex pattern of suppression and enhancement (arrows in Fig. 4A). In some cortical sites responses to AV stimuli were greater than responses to A stimuli; in other sites responses to AV stimuli were smaller than responses to A stimuli. Fig. 4B shows the response profile of a single site across the different SNRs. As in a previous study (21), the V condition did not elicit spiking responses in auditory cortical neurons: The average response in the 200-ms period after the onset of mouth motion was not significantly different from baseline (paired t test, P > 0.05). Thus, for clarity, V conditions are not shown. For this cortical site (and many others), visual speedup of response latencies was apparent at all SNRs.

Fig. 4.

Fig. 4.

The auditory cortex responds faster with the addition of mouth motion. (A) PSTHs (Upper) and rasters (Lower) of three cortical sites in the auditory cortex to AV, A, and V components of coo 1 at the highest SNR. The x-axes depict time in milliseconds; the y-axes indicate the firing rate in spikes/s. The solid line indicates the onset of the auditory stimulus. The dashed line indicates the onset of the visual stimulus. Arrows highlight a region showing enhancement or suppression of the response to the AV stimulus. Blue shading denotes the time period when only visual input was present. (B) Responses of cortical site shown in the left panel of A for the AV and A conditions for coo 1 at every SNR. The x-axis depicts the time in milliseconds; the y-axis in the upper panel shows response magnitude in spikes/s. The lower panel shows the spike rasters at the three different SNRs. (C) Probability density of spiking response latencies for the spiking responses to the AV and A stimuli at the highest SNR. The x-axis depicts the latency of the response in milliseconds; the y-axis depicts the probability of observing that latency. (D) Median latency of response to AV and A stimuli as a function of SNR for all cortical sites. Error bars denote bootstrap SE of the median. (E) Log–log scatter plot of RT (in milliseconds) vs. response latency (in milliseconds) across all cortical sites. (F) Average correlation over all cortical sites at the highest SNR between binned firing rates and RT as a function of time after stimulus onset for A and AV conditions. The x-axis depicts the time in milliseconds; the y-axis depicts the Spearman correlation.

This pattern in which responses to AV stimuli were faster than responses to A stimuli was consistent in the population of cortical sites and for all three SNRs. Fig. 4C shows the probability density for A and AV spiking response latencies for the highest SNR. The median latencies for AV and A vocalizations as a function of SNR are shown in Fig. 4D. All differences in median latency for AV vs. A responses were negative (sign-test, all SNRs, P < 0.001); speedups for AV vs. A were on the order of 10 ms. Interestingly, the AV benefit did not increase systematically according to the SNR (Fig. 4D). That is, even though neural response latency generally varied as a function of the SNR (becoming slower when the SNR decreased) in the A condition (Fig. 3 A and B), the latency benefit added by the dynamic face was approximately uniform. This result implies that any downstream structure receiving the spiking activity from the auditory cortex would, in principle, be able to execute its computations faster. Thus, if latency is involved in mediating behavioral RTs during detection, then a speedup of response latency will lead to faster responses for AV stimuli than for A stimuli. We also performed a second analysis using cross-correlations of the peristimulus time histograms (PSTHs) to identify any significant differences in response onset between the the responses to AV and A stimuli. Again, for all three SNRs, responses to AV stimuli were faster than responses to A stimuli (median speedup was on the order of 7 ms for the highest SNRs; sign test vs. 0, P < 0.05).

Although we did not expect activity in the auditory cortex to account for the entirety of the RT benefits mediated by the dynamic face, we wanted to see if auditory cortical activity could account for at least a portion of the benefit. For the AV condition, there remained a correlation between response latency and RT across SNRs (Spearman’s r = 0.31, P < 0.001) (Fig. 4E). However, this correlation is not as robust as the correlation seen in the A condition (Fig. 3C). Much of this discrepancy may arise from the much faster AV RTs for the lowest SNRs, conditions in which the contribution of visual cues to task performance is much larger than that of auditory cues (Fig. 2). In the AV condition (as in the A condition reported above), we again found that there a significant trial-by-trial correlation with firing rates [∼70 ms: r = −0.056, t(148) = −3.51, P < 0.05; ∼210 ms: r = −0.031, t(148) = −3.00, P < 0.05)] (Fig. 4F).

With regard to neural response latencies, again, for the highest SNR of the AV condition, there was a significant speedup for fast vs. slow RTs (sign test, P = 3.65 × 10−4). For the medium SNR, there was a trend toward a difference between fast and slow RTs (sign test, P = 0.07). Finally, for the lowest SNR, just as in the A condition, there was no difference between the fast and slow RTs (sign test, P = 0.28). Thus, at least for the highest SNR, the speedups in neural latency mediated by dynamic faces correspond to the faster RTs in the AV condition.

It is possible that these speedups in neural response latency that we observed are influenced by motor or somatosensory processing effects (41, 42). However, such effects would manifest themselves much later in poststimulus time, around the lever-pressing event. Thus, late timing is unlikely to influence our observed effects on neural response latency that occur quite soon (∼50 ms for the highest SNR) after the onset of the vocalization and well before the lever press. To make certain, we repeated the latency analysis including only trials with RTs >400 ms after stimulus onset. Again, we observed similar latency speedups in responses to AV stimuli as compared with responses to A stimuli (sign test, P < 0.05).

Dynamic Faces Do Not Systematically Enhance or Suppress Firing Rates.

It has been assumed, for auditory as well as for association cortices, that differences in firing rates recorded in passively listening monkeys reflected processes that could ultimately be linked to multisensory behavioral performance (2127, 43, 44). The assumption behind these studies is that if, for example, visual input onto an auditory cortical neuron increases firing rates, a downstream structure summing spikes from the auditory cortex would reach threshold faster. We tested the validity of such an assumption regarding multisensory detection.

We began by examining whether there were differences in the baseline-normalized peak firing rate for the responses to AV vs. A stimuli. We chose the period 0–300 ms after stimulus onset across the population of cortical sites. For each site, we estimated the peak firing rate in this period for the AV and A conditions. (Again, the V condition does not elicit changes in spiking activity and so is not included). Fig. 5A shows the probability density of the baseline normalized peak firing rate in the 0- to 300-ms period for responses to AV and A stimuli at the highest SNR. Fig. 5B shows the difference in peak firing rate between AV and A stimuli at the highest SNR. Fig. 5C shows the average peak firing rates for AV and A conditions as a function of SNR. There is no significant change in the magnitude of the peak firing rate across the population [F(1,148) = 1.429, P > 0.05]. For no SNR condition were there significant differences between peak responses to AV and A stimuli. Thus, there were no systematic increases or decreases in the peak firing rate during the 0- to 300-ms period for AV compared with A stimuli during vocal detection. This result does not mean that that the AV and A spiking responses were identical to one another on an individual site-by-site basis. Responses to AV stimuli are enhanced relative to responses to A stimuli in some sites and are suppressed in others. However, in contrast to latency, there is no consistent population-level enhancement or suppression of firing rates in response to AV stimuli as compared with A stimuli.

Fig. 5.

Fig. 5.

No systematic changes in the magnitude or variability of the firing rate were observed with the addition of mouth motion. (A) Probability density of peak magnitudes for the spiking responses in the AV and A conditions. The x-axis depicts the change in response magnitude in SD units; the y-axis depicts the probability of observing that response magnitude. (B) Distribution of directional changes in response magnitudes with the addition of mouth motion. The x-axis depicts the difference in magnitude between the AV and A conditions in SD units; the y-axis depicts the number of cortical sites. (C) Magnitudes for AV and A responses as a function of SNR for all cortical sites. The x-axis depicts the SNR; the y-axis depicts the response magnitudes in SD units. Error bars denote bootstrap SE of the median. Red bars denote AV, green bars denote A. (D) Fano factor analysis of response firing rates in the AV and A conditions. The x-axis depicts different conditions as a function of SNR; the y-axis depicts the magnitude of the Fano factor. For C and D, red bars denote AV, green bars denote A and error bars denote boot-strapped standard errors.

We next tested for three other potential firing-rate effects. First, we examined average (as opposed to peak) firing rates in the period 0–300 ms after auditory onset. Again, we observed no significant differences between responses to AV and A stimuli across the population of cortical sites [F(1,148) = 0.69, P > 0.05]. Thus, no differences between AV and A vocalizations exist for either the average or the peak response. Second, we examined whether there is an initial period during which responses to AV stimuli are consistently enhanced relative to the responses to A stimuli, as reported for the superior colliculus (45). To do so, we tested whether there was a 100-ms period after auditory onset during which the responses to AV stimuli were consistently larger than responses to A stimuli. One problem in this evaluation is that, because of the differences in latency, the averaging over a time epoch results in different numbers of spikes being spuriously attributed to different conditions. The simulations this comparing Poisson processes with the same fixed rate but different onset latencies illustrate this problem (Fig. S3 A and B). To account for this latency confound, we first used cross-correlation to align the spiking activity from the AV and A conditions. We then averaged the activity in different epochs after the onset of the auditory stimulus and observed no difference in spiking responses between the AV vs. A conditions across different 100-ms epochs after onset of the stimulus [0–100 ms: F(1,148) = 1.68, P > 0.05; 100–200 ms: F(1,148) = 0.38, P > 0.05; 200–300 ms: F(1,148) = 1.16, P > 0.05].

The process of simulating firing rates revealed an interesting result (Fig. S3 A and B). Just by chance (i.e., poisson noise), some sites had higher firing rates for the AV than for the A stimulus, while the opposite was true for other sites. As a result, the distribution of differences between AV and A peak firing rates was centered around zero (Fig. S3C). This distribution is strikingly similar to our real firing-rate data (compare Fig. 5 A and B with Fig. S3C), suggesting that differences in firing rate between conditions potentially could arise by chance under our task conditions.

Finally, we tested whether there was less variability in the firing rate in the AV condition, as reported for the auditory cortex in a passive multisensory paradigm (23). Fig. 5D shows the Fano factor, a measure of neuronal variability, as a function of the SNR for the AV and A conditions. As the SNR decreases, the Fano factor increases. However, there is no evidence that the Fano factor for the AV condition is significantly different from that for the A condition at any SNR (P > 0.05, bootstrap test).

Discussion

In both human and nonhuman primates, combining the motion of the face with the voice improves detection and speeds up RTs (3, 5, 10). We investigated spiking activity in the auditory cortex during this process. In the current study, monkeys were slower to detect vocalizations as the SNR decreased, but the addition of mouth movement resulted in faster detection. Paralleling this behavior, responses in the auditory cortex were slower and smaller in magnitude as the SNR decreased. Both response latency and the peak magnitude of the responses were correlated with RT. When a dynamic face accompanied the vocalization, auditory cortical responses were faster and were sped up in a manner that was uniform across the different SNR conditions; the magnitude and variability of the firing rate did not change systematically.

We found that dynamic faces sped up the onset of auditory spiking responses. Faster spiking responses imply that downstream structures that receive inputs from the auditory cortex, such as the superior temporal sulcus (21, 26, 27, 46) and the prefrontal cortex (25, 43, 44), potentially could trigger behavioral responses faster. Thus, the faster responses to AV stimuli than to A stimuli could mediate faster detection of events. The same or similar processes may occur for faster discrimination. Event-related potential (ERP) and magnetoencephalography studies of presumptive auditory cortex during audiovisual speech discrimination consistently report that visual speech speeds up neural responses to auditory speech (4, 47, 48). The similarities between neural responses for AV detection and discrimination suggest that common mechanisms may mediate face/voice integration in humans and monkeys (3).

Faster latencies for responses to AV stimuli than for A stimuli could benefit behavior via two principal mechanisms. Naturally, any downstream structure involved in a motor output will respond faster principally because sensory inputs are arriving faster. Besides speeding up the responses of a downstream structure, latency itself also may be an important mode of coding sensory stimuli (3739). Studies across different sensory systems have suggested that response latency may indicate the intensity of sensory stimuli. In particular, first-spike timing and other temporal patterns, when referenced to a neuronal population reference frame, theoretically provide additional information about stimuli (49). Furthermore, in the primary visual cortex, neural response latency correlates well with RT in a detection task (40). Thus, if response latency is an additional channel by which behavior is modulated, our demonstration of faster responses in the auditory cortex to AV stimuli than to A stimuli suggests that they are likely important for the behavioral benefits observed during face/voice integration.

We observed that speedups in neural response latency for responses to AV vs. A stimuli were correlated with the behavior for the high and medium SNR conditions, which also were the SNR conditions in which we observed violations of the race model predictions (Fig. S1). This similarity suggests that behavioral benefits observed for AV integration result in part from the speedups we observed in the onsets of neural response in the auditory cortex. This observation is consistent with ERP studies showing that race model violations are more likely to occur at fast RTs than at slow RTs and, correspondingly, that ERP differences in latency are significant only for the faster RTs (50, 51). However, it is unlikely that the auditory cortex is the sole mediator of behavior during this task. The profile of multisensory behavioral benefits in our task was not monotonic; instead it has a maximum at intermediate levels of SNR (3). Moreover, the neural latency benefit in the auditory cortex was uniform across all SNRs. Thus, AV detection of voices is, no doubt, the result of interactions between several different brain areas mediating the behavioral output, with the auditory cortex as an important node. Although robust interactions between areas such as the auditory cortex and the upper bank of the superior temporal sulcus were reported in passive viewing/listening paradigms (21, 46), how these areas interact to mediate an audiovisual behavior remains unexplored.

Other than latency, we observed no systematic differences in the firing rates. We measured variability and peak and average firing rates. In no case did we observe a systematic difference in these measures that were to AV vs. A stimuli. The lack of changes in firing rate magnitude was surprising but has at least two possible explanations. First, the effects of response magnitude in the auditory cortex may be sensitive to the stimulus onset asynchrony (SOA) of the visual and auditory components of the sensory event (52). In our experiment, we needed to limit the number of trials per condition and used only one SOA, with facial motion preceding vocal output by 85 ms. However, although facial motion always leads vocal output by several tens of milliseconds in both speech (2) and monkey vocalizations (22, 27), there is a large of amount of utterance-to-utterance variability in this temporal parameter. Thus, the lack of consistency in this parameter in primate vocal signals precludes it from being a viable cue for face/voice integration. A second possibility is that differences in stimulus presentation may account for the absence of firing rate-related effects. Prior studies adopted a design involving a transition from fixation to the onset of a video (21, 22, 2427, 43, 44). This design results in visual transients that could have a strong (but still subthreshold) influence on the auditory cortex. In the current study, a face always was on the screen; thus any visual transients and their effects on the auditory cortex were eliminated.

Spiking activity was not modulated by facial motion alone, implying a subthreshold mechanism for latency facilitation in the auditory cortex. It has been reported that the frequency selectivity of auditory response latencies arises from a combination of sharply tuned intracortical excitation at preferred frequencies and broad inhibition at nonpreferred frequencies (the E–I network) (53). In our task, the vocalization is heard 85 ms after the visual input (facial motion) is seen; thus any synaptic input resulting from the auditory component would have a lag of 20–40 ms and would arrive when changes in the auditory cortical network induced by mouth motion already have occurred (54). In this scheme, the resulting auditory cortical response would combine with the E–I network reorganized by the visual input to speed up spiking activity. One form of reorganization may be a phase-reset of the ongoing cortical activity (52). Thus, during AV conditions, the onset of mouth motion before the vocalization could lead to a phase reset of ongoing oscillations in the auditory cortex (55).

In summary, during vocal detection, the effects of visual input on the auditory cortex are manifested by the speeding up of neural responses and not by changes in the magnitude or variability of firing rates. Although auditory cortical firing rates certainly play important roles in other behaviors (56), the effects on neural response latency have two related consequences. First, any downstream structure performing a transformation on the auditory cortical responses will be able to do so faster for responses to AV stimuli than for responses to A stimuli. Second, if latency is a method for encoding intensity and thereby resulting in slower or faster RTs in our task, then faster latencies could speed up behavioral responses.

Methods

Subjects and Surgery.

Two adult male long-tailed macaques (Macaca fascicularis) were used in the experiments. For each monkey, we used preoperative whole-head MRI (3-T magnet, 500-μm slices) to identify the stereotaxic coordinates of the auditory cortex. The monkeys underwent sterile surgery for the implantation of posts for head fixation and a recording chamber (19-mm diameter; Crist Instruments). The chamber was vertically oriented to allow an approach to the superior surface of the superior temporal gyrus (57, 58). All experiments were performed in compliance with the guidelines of the Princeton University Institutional Animal Care and Use Committee.

Behavioral Apparatus, Stimuli, and Task.

Experiments were conducted in a sound-attenuating RF enclosure. The monkey sat in a primate chair fixed 74 cm opposite a 19-inch CRT color monitor with a 1,280 × 1,024 (25° × 20°) screen resolution and a 75-Hz refresh rate. All stimuli were centrally located on the screen and occupied a total area (including blank regions) of 640 × 653 pixels. Both monkeys used the left hand to press a lever (ENV-610M; Med Associates) placed at the center of the chair. Stimulus presentation and data collection were performed using the Presentation program (Neurobehavioral Systems).

Stimuli.

We used coo calls from two macaques unknown to the monkey subjects as the auditory components of vocalizations. The auditory vocalizations were 400 ms long and were normalized in amplitude. The visual components of the vocalizations were 400-ms-long videos of synthetic monkey agents articulating a coo vocalization. The entire monkey avatar was 12.25° wide and 8.55° high. The face itself was 5.25° wide between the eyes and 5.72° tall from the top of the head to the bottom of the chin. The AV stimuli were generated by presenting both V and A components with an 85-ms lag between the onset of mouth opening and the vocalization; this time lag is within the natural range for macaque monkey vocalizations (27). We used two coo calls to avoid idiosyncratic effects that might arise from a single-stimulus exemplar and because monkeys are sensitive to individual identity embedded in coo calls. The two coos were paired with two different monkey avatars, as it would be odd to see an individual face not move when hearing its associated coo call.

We chose to use computer-generated avatars for following reasons. First, the use of avatars allowed us to restrict facial motion to the mouth region. Second, the avatars allowed us to keep a static face constantly on the screen, ensuring we did not induce sudden visual-onset responses relative to a blank background. Third, they ensured constant lighting and background. Finally, as we modulated the SNR of the auditory component of the vocalization, the avatars allowed us to parameterize the size of the mouth opening while keeping eye and head positions constant (Fig. 1C). Greater sound intensity is coupled to larger mouth openings by the dynamic face, consistent with natural vocal signals.

Task Structure.

Monkeys were trained to detect two coo vocalizations in a free-response/redundant-signals task (3). Detection was indicated by a lever press. In our task, the redundant targets were the motion of the mouth and the sound of the coo vocalization. Coo vocalizations were presented at three different loudness levels (53, 68, and 85 dB) and at random intervals in ∼63 dB spectrally pink background noise. Each vocalization was paired with a synthetic monkey face in which the mouth opened in a manner concordant with the loudness of the vocalizations. During every block, a face was always visible but moved only for the matched vocalization. The identity of the avatar was counterbalanced across blocks (typically 60 trials, 7 trials each SNR of AV and A, 6 trials for each SNR of V) with an interblock interval of 10–12 s. In 11 of the 67 sessions, we chose longer trial blocks (132 trials). The stimulus events were presented at an interstimulus interval drawn from a uniform distribution between 1 and 3 s.

In a free-response paradigm, one can define hits, misses, and false alarms. A response within a window starting 135 ms and ending 2,000 ms after the onset of the stimulus was defined as a hit and led to juice reward. An omitted response in this 2-s window was classified as a miss (28). Responses outside this window led to time-outs of 3–5.5 s and were defined as false alarms. The monkeys had to wait the entire duration of this time-out period before a new stimulus was presented. The hit rate was defined as the ratio of hits to hits plus misses.

Task design justification.

In the A condition, a vocalization from one monkey is heard while the static avatar face of another is visible. The incongruence between the auditory stimulus and the motionless face presumably could slow down RT and neural responses in the A condition. However, this concern is mitigated by the fact that this condition is what monkeys typically deal with in group settings. That is, our task mimics audiovisual communication: Faces in noisy social settings do not appear and disappear, and monkeys, like humans, recognize identity cues in vocalizations and match them to faces (5961). Thus, it is not odd to hear one individual’s voice while seeing another individual’s static face. Importantly, our design incorporating a static face in the A condition also helps us avoid two other problems. First, if there was a blank screen instead of a static face in the A condition, then AV benefits could be attributed to differences in overall attention or arousal upon seeing the face only in the AV condition. Second, the abrupt appearance of a face in the AV condition also would contaminate interpretations of RT benefits for AV compared with A stimuli. We could not be sure if the RT benefits resulted from the integration of facial motion with the sound or from the integration of the sound with the sudden onset of the face.

Eye movements.

For three reasons, eye fixation was not required for monkeys to perform this task. The first is that, under natural communicative settings, both monkeys and human subjects use eye movements to pick up important cues (62, 63). Second, controlling eye movements was precluded by the design of our task, which involves continuous presentation of the face and randomization of the time at which facial movements or sound occur. Third, studies have demonstrated that enforcing fixation tends to suppress responses to both auditory (64) and multisensory (65) stimuli. Thus, although several studies have demonstrated modest effects of eye position on multiple levels of the auditory pathway, ranging from the inferior colliculus (66) to the primary auditory cortex (67, 68), we did not (indeed, could not) take such effects into account in our study.

Reaction Times and Accuracy.

RTs were measured as the time from the onset of the stimulus to the first depression of the lever. For each SNR and condition, the accuracy was defined as the ratio of hits to hits plus misses expressed as a percentage. The false-alarm rate was defined as number of false alarms divided by the sum of hits plus false alarms. False-alarm percentages were low (∼15%).

Race Model.

It is quite common to observe that RTs to presented simultaneously multisensory targets are faster than RTs to unisensory targets. This effect usually is termed the “redundant signals effect.” One important class of explanations for the redundant signals effect is the “race model” (29). According to the race model (or a parallel first-terminating model), redundancy benefits may result not from an actual integration of visual and auditory cues but rather may be determined by the stimulus modality that is processed faster. A way to test whether this principle can explain RT data is to use the well-known race model inequality (29), stating that the cumulative RT distribution for the redundant stimuli never exceeds the sum of the RT distributions for the unisensory stimuli. That is, if FAV (t), FV (t), and FA (t) are the estimated cumulative distributions (CDF) of the RTs for the three different modalities,

graphic file with name pnas.1312518110uneq1.jpg

then one cannot rule out race models as an explanation for the facilitation of RT. On the other hand, if this inequality is violated in a given dataset, then parallel processing cannot completely account for the benefits observed for multisensory stimuli; an explanation based on integration would be required. We computed the CDFs of our conditions and then computed the difference between the actual CDF of the AV condition and the CDF predicted by the race model (Fig. S1). The maximum positive point of this difference was taken as the index of violation, R. A positive value of R means that the race model is rejected. If this value is 0, then the race model is upheld. To test the violations of the race model on a single-subject basis, we compared the true value of R with one computed by a bootstrap method that performs artificial versions of our multisensory experiments (for details, see ref. 3).

Collection of Physiology Data.

Recordings were made from the left auditory cortex using standard electrophysiological techniques. We used an eight-channel microdrive (NAN Instruments) that allowed us to move multiple electrodes independently. Electrodes (FHC) were glass-coated 125-μm-thick tungsten wire with impedances between 1 and 3 MΩ (measured at 1 kHz). Grounding was achieved by connecting the microdrive to the ground provided by the head stage and by using an additional grounding wire connected to the circuitry in the preamplifier. This grounding eliminated or minimized 60-Hz line noise. All signals were acquired using the data acquisition system provided by Plexon (MAP; Plexon Instruments). Electrodes were lowered until multiunit cortical responses could be driven reliably by auditory stimuli. Search stimuli included pure tones, FM sweeps, noise bursts, clicks, and vocalizations.

Tonotopy and location in the auditory cortex.

At the end of every recording block, for every cortical site, we estimated the frequency tuning using 30 frequencies between 100 Hz and 20 kHz presented at three different intensity levels (53, 68, and 85 dB). The spectral selectivity of a cortical site was determined by measuring the maximal multiunit spike discharge rate as a function of spectral frequency. In both monkey subjects, we discerned a rough high-to-low transition of frequency selectivity in the caudal-to-rostral direction (69). We also observed a transition from simple to complex frequency selectivity when moving in the medial-to-lateral direction. Based on our stereotaxic coordinates, the spectral selectivity of our cortical sites (Fig. S4), and prior reports of spectral tuning in the macaque auditory cortex (58), our recordings were localized to the lateral belt of the auditory cortex. Specifically, based on the frequency selectivity profile (Fig. S4), we estimate that our recording sites straddled the mediolateral and anterolateral belt regions.

Spiking activity.

To extract spiking activity, we adapted previously published methods (70). For the extraction of spike times, the raw neural signal was filtered in the high-frequency range of 500–5,000 Hz using a four-pole Butterworth filter. We then used a spike detection threshold of 3.5 SDs followed by cluster selection. A spike was recognized as such only if the previous spike occurred more than 1 ms earlier. This method is excellent for detecting spiking events reflecting the activity of typically one but very occasionally two or three neurons around the tip of the electrode.

Data Analysis and Statistics.

All analyses were carried out in MATLAB. Many details have been published elsewhere (21, 71). For all analyses, only RTs >250 ms were considered. We analyzed only the time period from 0–300 ms after stimulus onset to ensure that our results were not contaminated by the sound of the lever press, reward, somatosensory motor effects, or other factors, all of which have been proposed to affect activity in the auditory cortex but typically are processed many tens of milliseconds later in a trial (41, 42).

PSTHs.

We computed the PSTH by convolving the spike train with an exponential kernel typically used in studies relating RT to neural activity (72). We used a decay time of 20 ms. Error bars were estimated by bootstrapping over trials. We ensured that we recorded a sufficient number of trials to estimate the mean and SE of firing rates reliably.

Normalization of firing rates.

Before performing comparisons across cortical sites, the firing rate on each trial for each site was normalized into SD units relative to baseline by subtracting the mean firing rate across time in the baseline for that trial and dividing by the standard deviation of the baseline firing rate for that trial. This normalized firing rate estimate for each trial was then averaged to get an estimate of the normalized PSTH for that site and that condition. These normalized firing rates were used for all peak and average response comparisons.

Difference between peak AV and A firing rates.

For every site we first estimated the peak firing rate in the interval 0–300 ms after auditory onset. Then we performed a repeated-measures ANOVA across the population of cortical sites with modality and SNR as factors to identify any significant enhancement or suppression of the peak response to AV stimuli relative to the response to A stimuli. The advantage of comparing peak firing rates is that differences in latency across cortical sites will not affect the estimate of the firing rate (see Simulated firing rate responses below).

Difference between average firing rate for AV and A responses.

We also tested if there was a difference in the average number of spikes for the AV and A conditions in the peristimulus epoch. We then used a repeated-measures ANOVA with modality and SNR as factors to determine if there were significant differences between conditions in epochs.

Simulated firing-rate responses.

We generated simulated firing rates using a Poisson process. In particular, we generated interspike intervals using the built-in exponential random number generator in MATLAB. In our example, we assumed that the spike rate decayed as a function of time after stimulus onset as a first-order approximation of our auditory cortical responses. To generate responses to AV stimuli, these simulated rates from the A condition were shifted by 10 ms.

Difference in firing rates in stimulus epochs.

To test for differences in certain stimulus epochs, we first aligned the PSTHs from the AV and A vocalizations. However, differences in latency can confound this analysis (Fig. S2), so we first used cross-correlation to align the firing rates for responses to AV and A stimuli. We then averaged the firing rates in epochs from 0–100 ms, 100–200 ms, and 200–300 ms after stimulus onset. We then used a repeated-measures ANOVA with modality (AV, A) and SNR as factors and tested for main effects of modality.

Relationship between spiking activity and RT.

To compute the relationship between spiking activity and RT on a trial-by-trial basis, we first took all trials at each SNR and then binned the firing rate into 50-ms bins stepped by 5 ms. We then computed the Spearman rank correlation between the binned firing rates and the RT for each time bin. The baseline level of correlation in the period from −500 to 0 ms then was subtracted from the correlation time series for each site. We then averaged the correlation time series across all the cortical sites. We then identified the epoch in which this average was significantly different from zero and tested whether the peak of this response was significantly below zero.

Latency estimation.

The latency of response onset was calculated using a method described previously (72). The onset of the response was defined as the first time point at which spiking activity exceeded baseline (defined as the 1,100-ms period before the onset of the stimulus) plus 3 SDs. To be classified as a valid response and accepted into the latency estimate, the activity after this time point had to remain above this level for a minimum of 15 ms. We also used cross-correlation to estimate how much responses to AV stimuli were sped up relative to responses to A stimuli. We convolved the single-trial spiking response with a 5-ms exponential kernel (smoothing will underestimate the latency speedup) and then performed cross-correlation on a site-by-site basis to estimate the delay between the responses to AV and A stimuli. We then used a sign test to determine whether this delay was significantly different from zero for each of the SNRs.

We also performed a second analysis to estimate whether the responses we observed in the auditory cortex were behaviorally relevant. For each SNR and each site, we binned trials into slow vs. fast RTs. We then estimated the latencies for each set of trials separately and examined whether the median latency across sites for the fast trials was faster than the median latency for the slow trials.

Fano factor.

We used the variance toolbox for computing Fano factors (73) (http://churchlandlab.neuroscience.columbia.edu/links.html). Differences were examined using 95% confidence intervals.

Supplementary Material

Supporting Information

Acknowledgments

We thank Shawn Steckenfinger for carefully preparing avatars of vocalizing monkeys; Lauren Kelly for the care of our monkeys; and Matthias Gondan, Sepideh Sadaghiani, and Daniel Takahashi for their critical comments on this manuscript. This work was supported by National Institutes of Health/National Institute of Neurological Disorders and Stroke Grant R01NS054898 (to A.A.G.), a James S. McDonnell Foundation Scholar Award (to A.A.G.), and a Charlotte Elizabeth Procter Fellowship from Princeton University (to C.C.).

Footnotes

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312518110/-/DCSupplemental.

References

  • 1.Rosen S. Temporal information in speech: Acoustic, auditory and linguistic aspects. Philos Trans R Soc Lond B Biol Sci. 1992;336(1278):367–373. doi: 10.1098/rstb.1992.0070. [DOI] [PubMed] [Google Scholar]
  • 2.Chandrasekaran C, Trubanova A, Stillittano S, Caplier A, Ghazanfar AA. The natural statistics of audiovisual speech. PLOS Comput Biol. 2009;5(7):e1000436. doi: 10.1371/journal.pcbi.1000436. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Chandrasekaran C, Lemus L, Trubanova A, Gondan M, Ghazanfar AA. Monkeys and humans share a common computation for face/voice integration. PLOS Comput Biol. 2011;7(9):e1002165. doi: 10.1371/journal.pcbi.1002165. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.van Wassenhove V, Grant KW, Poeppel D. Visual speech speeds up the neural processing of auditory speech. Proc Natl Acad Sci USA. 2005;102(4):1181–1186. doi: 10.1073/pnas.0408949102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Bernstein LE, Auer JET, Takayanagi S. Auditory speech detection in noise enhanced by lipreading. Speech Commun. 2004;44(1–4):5–18. [Google Scholar]
  • 6.Grant KW, Seitz P-F. The use of visible speech cues for improving auditory detection of spoken sentences. J Acoust Soc Am. 2000;108(3 Pt 1):1197–1208. doi: 10.1121/1.1288668. [DOI] [PubMed] [Google Scholar]
  • 7.Kim J, Davis C. Investigating the audio-visual speech detection advantage. Speech Commun. 2004;44(1–4):19–30. [Google Scholar]
  • 8.Besle J, Fort A, Delpuech C, Giard M-H. Bimodal speech: Early suppressive visual effects in human auditory cortex. Eur J Neurosci. 2004;20(8):2225–2234. doi: 10.1111/j.1460-9568.2004.03670.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Klucharev V, Möttönen R, Sams M. Electrophysiological indicators of phonetic and non-phonetic multisensory interactions during audiovisual speech perception. Brain Res Cogn Brain Res. 2003;18(1):65–75. doi: 10.1016/j.cogbrainres.2003.09.004. [DOI] [PubMed] [Google Scholar]
  • 10.Murase M, et al. Cross-modal integration during vowel identification in audiovisual speech: A functional magnetic resonance imaging study. Neurosci Lett. 2008;434(1):71–76. doi: 10.1016/j.neulet.2008.01.044. [DOI] [PubMed] [Google Scholar]
  • 11.Sumby WH, Pollack I. Visual contribution to speech intelligibility in noise. J Acoust Soc Am. 1954;26(2):212–215. [Google Scholar]
  • 12.Rosenblum LD, Johnson JA, Saldaña HM. Point-light facial displays enhance comprehension of speech in noise. J Speech Hear Res. 1996;39(6):1159–1170. doi: 10.1044/jshr.3906.1159. [DOI] [PubMed] [Google Scholar]
  • 13.Summerfield Q. Use of visual information for phonetic perception. Phonetica. 1979;36(4-5):314–331. doi: 10.1159/000259969. [DOI] [PubMed] [Google Scholar]
  • 14.Summerfield Q. Lipreading and audio-visual speech perception. Philos Trans R Soc Lond B Biol Sci. 1992;335(1273):71–78. doi: 10.1098/rstb.1992.0009. [DOI] [PubMed] [Google Scholar]
  • 15.Schwartz J-L, Berthommier F, Savariaux C. Seeing to hear better: Evidence for early audio-visual interactions in speech identification. Cognition. 2004;93(2):B69–B78. doi: 10.1016/j.cognition.2004.01.006. [DOI] [PubMed] [Google Scholar]
  • 16.Stein BE. The New Handbook of Multisensory Processing. Cambridge, MA: MIT Press; 2012. p. 776. [Google Scholar]
  • 17.Calvert GA. Crossmodal processing in the human brain: Insights from functional neuroimaging studies. Cereb Cortex. 2001;11(12):1110–1123. doi: 10.1093/cercor/11.12.1110. [DOI] [PubMed] [Google Scholar]
  • 18.Ghazanfar AA, Schroeder CE. Is neocortex essentially multisensory? Trends Cogn Sci. 2006;10(6):278–285. doi: 10.1016/j.tics.2006.04.008. [DOI] [PubMed] [Google Scholar]
  • 19.Driver J, Noesselt T. Multisensory interplay reveals crossmodal influences on ‘sensory-specific’ brain regions, neural responses, and judgments. Neuron. 2008;57(1):11–23. doi: 10.1016/j.neuron.2007.12.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Ghazanfar AA. The multisensory roles for auditory cortex in primate vocal communication. Hear Res. 2009;258(1-2):113–120. doi: 10.1016/j.heares.2009.04.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Ghazanfar AA, Chandrasekaran C, Logothetis NK. Interactions between the superior temporal sulcus and auditory cortex mediate dynamic face/voice integration in rhesus monkeys. J Neurosci. 2008;28(17):4457–4469. doi: 10.1523/JNEUROSCI.0541-08.2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Ghazanfar AA, Maier JX, Hoffman KL, Logothetis NK. Multisensory integration of dynamic faces and voices in rhesus monkey auditory cortex. J Neurosci. 2005;25(20):5004–5012. doi: 10.1523/JNEUROSCI.0799-05.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Kayser C, Logothetis NK, Panzeri S. Visual enhancement of the information representation in auditory cortex. Curr Biol. 2010;20(1):19–24. doi: 10.1016/j.cub.2009.10.068. [DOI] [PubMed] [Google Scholar]
  • 24.Kayser C, Petkov CI, Logothetis NK. Visual modulation of neurons in auditory cortex. Cereb Cortex. 2008;18(7):1560–1574. doi: 10.1093/cercor/bhm187. [DOI] [PubMed] [Google Scholar]
  • 25.Sugihara T, Diltz MD, Averbeck BB, Romanski LM. Integration of auditory and visual communication information in the primate ventrolateral prefrontal cortex. J Neurosci. 2006;26(43):11138–11147. doi: 10.1523/JNEUROSCI.3550-06.2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Barraclough NE, Xiao D, Baker CI, Oram MW, Perrett DI. Integration of visual and auditory information by superior temporal sulcus neurons responsive to the sight of actions. J Cogn Neurosci. 2005;17(3):377–391. doi: 10.1162/0898929053279586. [DOI] [PubMed] [Google Scholar]
  • 27.Chandrasekaran C, Ghazanfar AA. Different neural frequency bands integrate faces and voices differently in the superior temporal sulcus. J Neurophysiol. 2009;101(2):773–788. doi: 10.1152/jn.90843.2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Egan JP, Greenberg GZ, Schulman AI. Operating characteristics, signal detectability, and the method of free response. J Acoust Soc Am. 1961;33(8):993–1007. [Google Scholar]
  • 29.Miller J. Divided attention: Evidence for coactivation with redundant signals. Cognit Psychol. 1982;14(2):247–279. doi: 10.1016/0010-0285(82)90010-x. [DOI] [PubMed] [Google Scholar]
  • 30.Schwarz W. A new model to explain the redundant-signals effect. Percept Psychophys. 1989;46(5):498–500. doi: 10.3758/bf03210867. [DOI] [PubMed] [Google Scholar]
  • 31.Schwarz W. Diffusion, superposition and the redundant-targets effect. J Math Psychol. 1994;38(4):504–520. [Google Scholar]
  • 32.Diederich A. Intersensory facilitation of reaction time: Evaluation of counter and diffusion coactivation models. J Math Psychol. 1995;39(2):197–215. [Google Scholar]
  • 33.Steckenfinger SA, Ghazanfar AA. Monkey visual behavior falls into the uncanny valley. Proc Natl Acad Sci USA. 2009;106(43):18362–18366. doi: 10.1073/pnas.0910063106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Ghazanfar AA, Morrill RJ, Kayser C. Monkeys are perceptually tuned to facial expressions that exhibit a theta-like speech rhythm. Proc Natl Acad Sci USA. 2013;110(5):1959–1963. doi: 10.1073/pnas.1214956110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Roitman JD, Shadlen MN. Response of neurons in the lateral intraparietal area during a combined visual discrimination reaction time task. J Neurosci. 2002;22(21):9475–9489. doi: 10.1523/JNEUROSCI.22-21-09475.2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Cook EP, Maunsell JH. Dynamics of neuronal responses in macaque MT and VIP during motion detection. Nat Neurosci. 2002;5(10):985–994. doi: 10.1038/nn924. [DOI] [PubMed] [Google Scholar]
  • 37.Reich DS, Mechler F, Victor JD. Temporal coding of contrast in primary visual cortex: When, what, and why. J Neurophysiol. 2001;85(3):1039–1050. doi: 10.1152/jn.2001.85.3.1039. [DOI] [PubMed] [Google Scholar]
  • 38.Heil P. First-spike latency of auditory neurons revisited. Curr Opin Neurobiol. 2004;14(4):461–467. doi: 10.1016/j.conb.2004.07.002. [DOI] [PubMed] [Google Scholar]
  • 39.Panzeri S, Petersen RS, Schultz SR, Lebedev M, Diamond ME. The role of spike timing in the coding of stimulus location in rat somatosensory cortex. Neuron. 2001;29(3):769–777. doi: 10.1016/s0896-6273(01)00251-3. [DOI] [PubMed] [Google Scholar]
  • 40.Lee J, Kim HR, Lee C. Trial-to-trial variability of spike response of V1 and saccadic response time. J Neurophysiol. 2010;104(5):2556–2572. doi: 10.1152/jn.01040.2009. [DOI] [PubMed] [Google Scholar]
  • 41.Brosch M, Selezneva E, Scheich H. Nonauditory events of a behavioral procedure activate auditory cortex of highly trained monkeys. J Neurosci. 2005;25(29):6797–6806. doi: 10.1523/JNEUROSCI.1571-05.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Niwa M, Johnson JS, O’Connor KN, Sutter ML. Activity related to perceptual judgment and action in primary auditory cortex. J Neurosci. 2012;32(9):3193–3210. doi: 10.1523/JNEUROSCI.0767-11.2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Romanski LM, Diehl MM. Neurons responsive to face-view in the primate ventrolateral prefrontal cortex. Neuroscience. 2011;189:223–235. doi: 10.1016/j.neuroscience.2011.05.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Romanski LM, Hwang J. Timing of audiovisual inputs to the prefrontal cortex and multisensory integration. Neuroscience. 2012;214:36–48. doi: 10.1016/j.neuroscience.2012.03.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Rowland BA, Quessy S, Stanford TR, Stein BE. Multisensory integration shortens physiological response latencies. J Neurosci. 2007;27(22):5879–5884. doi: 10.1523/JNEUROSCI.4986-06.2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Kayser C, Logothetis NK. 2009. Directed interactions between auditory and superior temporal cortices and their role in sensory integration. Front Integr Neurosci, 10.3389/neuro.07.007.2009.
  • 47.Sams M, et al. Seeing speech: Visual information from lip movements modifies activity in the human auditory cortex. Neurosci Lett. 1991;127(1):141–145. doi: 10.1016/0304-3940(91)90914-f. [DOI] [PubMed] [Google Scholar]
  • 48.Arnal LH, Morillon B, Kell CA, Giraud AL. Dual neural routing of visual facilitation in speech processing. J Neurosci. 2009;29(43):13445–13453. doi: 10.1523/JNEUROSCI.3194-09.2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Chase SM, Young ED. First-spike latency information in single neurons increases when referenced to population onset. Proc Natl Acad Sci USA. 2007;104(12):5175–5180. doi: 10.1073/pnas.0610368104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Sperdin HF, Cappe C, Foxe JJ, Murray MM. 2009. Early, low-level auditory-somatosensory multisensory interactions impact reaction time speed. Front Integr Neurosci 3:2. [DOI] [PMC free article] [PubMed]
  • 51.Senkowski D, Saint-Amour D, Höfle M, Foxe JJ. Multisensory interactions in early evoked brain activity follow the principle of inverse effectiveness. Neuroimage. 2011;56(4):2200–2208. doi: 10.1016/j.neuroimage.2011.03.075. [DOI] [PubMed] [Google Scholar]
  • 52.Lakatos P, Chen C-M, O’Connell MN, Mills A, Schroeder CE. Neuronal oscillations and multisensory interaction in primary auditory cortex. Neuron. 2007;53(2):279–292. doi: 10.1016/j.neuron.2006.12.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Zhou Y, et al. Generation of spike latency tuning by thalamocortical circuits in auditory cortex. J Neurosci. 2012;32(29):9969–9980. doi: 10.1523/JNEUROSCI.1384-12.2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Hoshino O. Neuronal responses below firing threshold for subthreshold cross-modal enhancement. Neural Comput. 2011;23(4):958–983. doi: 10.1162/NECO_a_00096. [DOI] [PubMed] [Google Scholar]
  • 55.Schroeder CE, Lakatos P, Kajikawa Y, Partan S, Puce A. Neuronal oscillations and visual amplification of speech. Trends Cogn Sci. 2008;12(3):106–113. doi: 10.1016/j.tics.2008.01.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Niwa M, Johnson JS, O’Connor KN, Sutter ML. Active engagement improves primary auditory cortical neurons’ ability to discriminate temporal modulation. J Neurosci. 2012;32(27):9323–9334. doi: 10.1523/JNEUROSCI.5832-11.2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Pfingst BE, O’Connor TA. A vertical stereotaxic approach to auditory cortex in the unanesthetized monkey. J Neurosci Methods. 1980;2(1):33–45. doi: 10.1016/0165-0270(80)90043-6. [DOI] [PubMed] [Google Scholar]
  • 58.Recanzone GH, Guard DC, Phan ML. Frequency and intensity response properties of single neurons in the auditory cortex of the behaving macaque monkey. J Neurophysiol. 2000;83(4):2315–2331. doi: 10.1152/jn.2000.83.4.2315. [DOI] [PubMed] [Google Scholar]
  • 59.Sliwa J, Duhamel JR, Pascalis O, Wirth S. Spontaneous voice-face identity matching by rhesus monkeys for familiar conspecifics and humans. Proc Natl Acad Sci USA. 2011;108(4):1735–1740. doi: 10.1073/pnas.1008169108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Ghazanfar AA, Logothetis NK. Facial expressions linked to monkey calls. Nature. 2003;423(6943):937–938. doi: 10.1038/423937a. [DOI] [PubMed] [Google Scholar]
  • 61.Ghazanfar AA, et al. Vocal-tract resonances as indexical cues in rhesus monkeys. Curr Biol. 2007;17(5):425–430. doi: 10.1016/j.cub.2007.01.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Ghazanfar AA, Nielsen K, Logothetis NK. Eye movements of monkey observers viewing vocalizing conspecifics. Cognition. 2006;101(3):515–529. doi: 10.1016/j.cognition.2005.12.007. [DOI] [PubMed] [Google Scholar]
  • 63.Lansing CR, McConkie GW. Word identification and eye fixation locations in visual and visual-plus-auditory presentations of spoken sentences. Percept Psychophys. 2003;65(4):536–552. doi: 10.3758/bf03194581. [DOI] [PubMed] [Google Scholar]
  • 64.Linden JF, Grunewald A, Andersen RA. Responses to auditory stimuli in macaque lateral intraparietal area. II. Behavioral modulation. J Neurophysiol. 1999;82(1):343–358. doi: 10.1152/jn.1999.82.1.343. [DOI] [PubMed] [Google Scholar]
  • 65.Bell AH, Corneil BD, Munoz DP, Meredith MA. Engagement of visual fixation suppresses sensory responsiveness and multisensory integration in the primate superior colliculus. Eur J Neurosci. 2003;18(10):2867–2873. doi: 10.1111/j.1460-9568.2003.02976.x. [DOI] [PubMed] [Google Scholar]
  • 66.Groh JM, Trause AS, Underhill AM, Clark KR, Inati S. Eye position influences auditory responses in primate inferior colliculus. Neuron. 2001;29(2):509–518. doi: 10.1016/s0896-6273(01)00222-7. [DOI] [PubMed] [Google Scholar]
  • 67.Werner-Reiss U, Kelly KA, Trause AS, Underhill AM, Groh JM. Eye position affects activity in primary auditory cortex of primates. Curr Biol. 2003;13(7):554–562. doi: 10.1016/s0960-9822(03)00168-4. [DOI] [PubMed] [Google Scholar]
  • 68.Fu KM, et al. Timing and laminar profile of eye-position effects on auditory responses in primate auditory cortex. J Neurophysiol. 2004;92(6):3522–3531. doi: 10.1152/jn.01228.2003. [DOI] [PubMed] [Google Scholar]
  • 69.Hackett TA, Stepniewska I, Kaas JH. Subdivisions of auditory cortex and ipsilateral cortical connections of the parabelt auditory cortex in macaque monkeys. J Comp Neurol. 1998;394(4):475–495. doi: 10.1002/(sici)1096-9861(19980518)394:4<475::aid-cne6>3.0.co;2-z. [DOI] [PubMed] [Google Scholar]
  • 70.Quiroga RQ, Nadasdy Z, Ben-Shaul Y. Unsupervised spike detection and sorting with wavelets and superparamagnetic clustering. Neural Comput. 2004;16(8):1661–1687. doi: 10.1162/089976604774201631. [DOI] [PubMed] [Google Scholar]
  • 71.Chandrasekaran C, Turesson HK, Brown CH, Ghazanfar AA. The influence of natural scene dynamics on auditory cortical activity. J Neurosci. 2010;30(42):13919–13931. doi: 10.1523/JNEUROSCI.3174-10.2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Thompson KG, Hanes DP, Bichot NP, Schall JD. Perceptual and motor processing stages identified in the activity of macaque frontal eye field neurons during visual search. J Neurophysiol. 1996;76(6):4040–4055. doi: 10.1152/jn.1996.76.6.4040. [DOI] [PubMed] [Google Scholar]
  • 73.Churchland MM, et al. Stimulus onset quenches neural variability: A widespread cortical phenomenon. Nat Neurosci. 2010;13(3):369–378. doi: 10.1038/nn.2501. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES