Skip to main content
Journal of Speech, Language, and Hearing Research : JSLHR logoLink to Journal of Speech, Language, and Hearing Research : JSLHR
. 2020 Sep 16;63(10):3539–3559. doi: 10.1044/2020_JSLHR-20-00063

Electrophysiological Evidence of Early Cortical Sensitivity to Human Conspecific Mimic Voice as a Distinct Category of Natural Sound

William J Talkington a, Jeremy Donai b, Alexandra S Kadner a, Molly L Layne a, Andrew Forino a, Sijin Wen c, Si Gao c, Margeaux M Gray d, Alexandria J Ashraf d, Gabriela N Valencia a, Brandon D Smith d, Stephanie K Khoo d, Stephen J Gray a, Norman Lass b, Julie A Brefczynski-Lewis a, Susannah Engdahl a, David Graham e, Chris A Frum a, James W Lewis a,
PMCID: PMC8060013  PMID: 32936717

Abstract

Purpose

From an anthropological perspective of hominin communication, the human auditory system likely evolved to enable special sensitivity to sounds produced by the vocal tracts of human conspecifics whether attended or passively heard. While numerous electrophysiological studies have used stereotypical human-produced verbal (speech voice and singing voice) and nonverbal vocalizations to identify human voice–sensitive responses, controversy remains as to when (and where) processing of acoustic signal attributes characteristic of “human voiceness” per se initiate in the brain.

Method

To explore this, we used animal vocalizations and human-mimicked versions of those calls (“mimic voice”) to examine late auditory evoked potential responses in humans.

Results

Here, we revealed an N1b component (96–120 ms poststimulus) during a nonattending listening condition showing significantly greater magnitude in response to mimics, beginning as early as primary auditory cortices, preceding the time window reported in previous studies that revealed species-specific vocalization processing initiating in the range of 147–219 ms. During a sound discrimination task, a P600 (500–700 ms poststimulus) component showed specificity for accurate discrimination of human mimic voice. Distinct acoustic signal attributes and features of the stimuli were used in a classifier model, which could distinguish most human from animal voice comparably to behavioral data—though none of these single features could adequately distinguish human voiceness.

Conclusions

These results provide novel ideas for algorithms used in neuromimetic hearing aids, as well as direct electrophysiological support for a neurocognitive model of natural sound processing that informs both neurodevelopmental and anthropological models regarding the establishment of auditory communication systems in humans.

Supplemental Material

https://doi.org/10.23641/asha.12903839


The ability to categorize and recognize different sources of natural sounds, including conspecific vocalizations as one category, is crucial to survival, and inadequate vocalization and paralinguistic processing in early childhood can lead to a wide variety of communication disorders (Abrams et al., 2009; Imai & Kita, 2014; Rosch, 1973). A human listener can often distinguish human (conspecific) vocal mimicry from actual nonhuman animal (nonconspecific) vocalizations (Lass et al., 1983), such as animal calls that are portrayed by human actors in animated cartoons. This study focused on examining electrophysiological processing to determine the earliest stages of when a human versus nonhuman vocalization processing distinction in the human brain would manifest: However, in contrast to earlier electrophysiological studies examining human voice and motivated by an earlier functional magnetic resonance imaging (fMRI) study using human “mimic voice,” the present auditory evoked potential (AEP) response study sought to examine the temporal dynamics of differential processing of sounds produced by humans mimicking animal vocalizations relative to the corresponding animal vocalizations themselves, which would serve as a critical control.

Voice-selective regions have been identified along the superior temporal sulci (STS) using fMRI in humans (Belin et al., 2000; Belizaire et al., 2007; Pernet et al., 2015; Uppenkamp et al., 2006) and using fMRI or neurophysiology recording methods in macaque monkeys (Perrodin et al., 2011; Petkov et al., 2008; Recanzone, 2008; Russ et al., 2008; Tian et al., 2001), suggesting a long evolutionary history of preferential processing for conspecific voice in primates. From the perspective of hearing perception, our group recently formalized a neurobiological taxonomic model of real-world natural sound categories that the brain appears to utilize to encode meaningfulness behind the behaviorally relevant everyday sounds we hear (Brefczynski-Lewis & Lewis, 2017), which in principle should apply to all social mammals with hearing and vocal communication ability. This semantic processing model of auditory object processing (see Figure 1A) included three basic acoustic–semantic categories of sound source: (a) action sounds (nonvocalizations) produced by “living things”; (b) action sounds produced by “nonliving things”; and (c) vocalizations (“living things”), with human (conspecific) versus nonhuman animal vocalizations as two subcategories therein. Moreover, responses to attending and categorizing human mimic voice, in contrast to the corresponding animal vocalizations being mimicked, were reported (using fMRI) to involve superior temporal plane regions (see Figure 1B, yellow colored cortex) in the vicinity of primary auditory cortices (Talkington et al., 2012), which provided much of the rationale for the present electrophysiological study using human mimic voice as stimuli.

Figure 1.

Figure 1.

The theoretical rationale for exploring the temporal dynamic processing of human mimic voice. (A) A neurobiological taxonomic model of sound categories representing distinct acoustic–semantic (meaningful) classes of natural sounds based largely on hemodynamic neuroimaging. Bold text depicts basic sound classes proposed to represent ethologically relevant categories. Plain text entries depict human speech and singing, tool use sounds, and human-made machinery sounds, which are represented as extensions of the three fundamental categories. This study is testing the putative functional boundary (double-headed yellow arrow) for processing nonverbal vocalizations produced by human (conspecific) mimics (“mimic voice”) versus the corresponding nonhuman animal sounds as a critical control. (B) A vocalization processing hierarchy in human auditory cortex revealed with hemodynamic imaging (adapted from Talkington et al., 2012, reprinted with permission by the publishers). Group-averaged (n = 22) functional activation maps displayed on surface model reconstructions derived from all subjects. The averaged spatial locations of tonotopic gradients (black-to-white gradients) were located along Heschl's gyrus (HG). Cortex is colored if significantly sensitive to human mimic voice versus corresponding animal calls (M > A; yellow), to foreign speech versus mimic vocalizations (F > M; red), to native English speech versus mimic vocalizations (E > M; dark blue), or to mimic vocalizations versus English speech (M > E; cyan). Corresponding colors indicating functional overlaps are shown in the figure key. All data are threshold-free cluster enhancement and permutation-corrected for multiple comparisons to p < .05. IFG = inferior frontal gyrus; STG = superior temporal gyrus; STS = superior temporal sulci. Reprinted from Neuropsychologia, Vol. 105, Julie A. Brefczynski-Lewis & James W. Lewis, “Auditory Object Perception: A Neurobiological Model and Prospective Review,” pp. 223–242, Copyright © 2017, with permission from Elsevier.

Previous electroencephalography (EEG) investigations have revealed aspects of the temporal dynamics of voice-sensitive responses by examining late AEPs, which suggested that processing preferential to human voice initiates at stages hierarchically prior to or different from the STS. In particular, studies that compared AEPs between singing voice and instrument-produced sounds described a positive-going “voice-specific response” occurring approximately 320 ms after stimulus onset at anterior sites of the scalp (Levy et al., 2001, 2003). Other studies that investigated responses to vocal adaptation effects and the processing of paralinguistic acoustic features of vocalizations described responses occurring earlier than the voice-specific response (Beauchemin et al., 2006; Lattner et al., 2003; Schweinberger, 2001; Zaske et al., 2009), including the report of a “frontotemporal positivity to voices” (FTPV) occurring approximately 164 ms after stimulus onset, which was based on responses to human vocalizations including speech voice relative to those of bird vocalizations and environmental sounds (Charest et al., 2009). A similar study using magnetoencephalography reported an FTPVm (the magnetic counterpart of the FTPV) peaking around 220 ms (initiating as early as 147 ms poststimulus onset) that was elicited by human vocal (speech voice) and nonspeech sounds (e.g., yawn) relative to a wide variety of nonhuman sound sources but localized along bilateral midanterior portions of the STS plus the planum temporale in the right hemisphere (Capilla et al., 2013).

Additionally, a study utilizing mean global field potential (GFP) measures further suggested a time frame for species-specific vocalization processing (human vs. other animals) beginning approximately 169–219 ms after stimulus presentation within regions of the right STS and superior temporal gyrus region (De Lucia et al., 2010). They further proposed a four-tiered temporal cortical processing hierarchy for human audition: (a) “general” sound processing (low-level spectrotemporal processing) occurring around 70 ms, (b) the differentiation between man-made and living sound sources occurring in a window near 70–119 ms, (c) human versus animal vocalization discrimination occurring between approximately 169 and 219 ms, and (d) music versus nonmusic discrimination occurring around 291–357 ms. However, the stereotypical nature of the human vocalizations used in the aforementioned studies and the disparate nature of control sound categories used may have overshadowed processing of some of the subtler acoustic signals pertinent to behaviorally relevant “acoustic–semantic” features that characterize conspecific vocalizations as such. Thus, the acoustic signal attributes and signal processing in cortex that allow us to distinguish and hear human voices as a conspecific per se remain unclear.

This study incorporated short (180 ms) animal vocalizations (as the critical control category) and human-mimicked versions of those same stimuli to investigate the temporal processing dynamics of sensitivity to human mimic voice. We first hypothesized that naïve listeners who passively heard unattended human mimics, in contrast to corresponding animal vocalizations, would show greater amplitude and/or shorter latencies in processing as revealed by various late AEP responses, notably involving an N1 component that is maximal near the FCz electrode, which is thought to be generated along primary auditory cortices (Luck, 2005; Näätänen & Picton, 1987; Shahin et al., 2003). This would establish whether acoustic signal attributes characteristic of human voice were being differentially processed along any of these early stages in the cortical processing hierarchy. A second hypothesis was that attentively listening to discriminate the two sound categories would additionally reveal differential responses involving a later “P3 family” AEP component, consistent with the posterior STS among other generators classically thought to be involved in voice-specific processing and voice perception (Chapman & Bragdon, 1964; De Lucia et al., 2009; Luck, 2005; Pernet et al., 2015; Polich, 2007). Support for either of the above hypotheses using electrophysiological measures (i.e., EEG) of human mimic voice processing would help to establish or refine a time course for the processing of conspecific vocalizations. They would also serve to complement neurophysiological fMRI evidence suggestive of a category-level organization in human cortex (e.g., see Figure 1A) for processing human voice as a distinct acoustic–semantic category. Such processing mechanisms presumably are, or develop to become, optimized to process the acoustic signal attributes inherent to conspecific vocalizations, the understanding of which could have significant clinical implications for implementation of intelligent hearing aid algorithm designs, models of typical auditory communication neurodevelopment in toddlers, and anthropological models regarding the evolution of oral communication systems.

Method

Participants

We recruited and recorded EEG signals from 34 adult native English-speaking participants (M age = 24.4 years, SD = 4.5; 15 women, 19 men; 32 right-handed, one left-handed, one ambidextrous). All participants were free of neurological, audiological, or medical illness; self-reported a normal range of hearing; and had no self-reported auditory or vocal production impairments. Auditory thresholds were assessed in a subset of participants, especially if they indicated that their health care provider had not previously administered hearing testing: We used a staircase procedure (Digital Audiometer Pro v.7.0, Digital Recordings) using the same sound booth and sound transmission chain as for the EEG recordings (described below). A normal range (< 25 dB at 0.25-, 0.5-, 1-, 2-, 4-, and 8-kHz pure tones) was confirmed for all but three subjects, the latter of whom showed < 35 dB at 250 Hz and/or the 4- or 8-kHz range. To assess whether mild hearing loss might affect the results an analysis of four participants with verified low thresholds versus four with higher thresholds in one or both ears was conducted, using a two-factor analysis of variance with replication with the passive condition data (addressed below). This analysis revealed no significant difference in the N1b response at 114 ms from the FCz electrode (F = 0.52, p > .48). Nor were there any significant interaction effects (F = 0.01, p > .91). All participants indicated that they had no problems hearing the test stimuli, and all were able to perform the task condition paradigm (addressed below). Thus, no participants were excluded based on a suspicion of mild hearing loss. EEG recording participants were paid a stipend for the session. Informed consent was obtained following guidelines approved by the West Virginia University Institutional Review Board, which is in accordance with the 1964 Declaration of Helsinki and its later amendments.

Sound Stimuli

Animal vocalization stimuli were sourced from various professionally recorded CD collections (sampled at 44.1 kHz; Sound Ideas, Inc., and The Hollywood Edge) and only included sounds from single, isolated animals without noticeable background environmental sounds (e.g., insects). Five listeners screened the original animal sounds, including samples from stimuli used in an earlier fMRI study (Talkington et al., 2012) to ensure they included acoustic events that sounded like full epochs, have reasonable onsets (attacks) and offsets, and were on the order of 200- to 400-ms duration. Sounds were excluded if there was any indication of echoes (such as from cement walls or zoo cages) or breathy respiratory elements of sound. Only land mammals were included, excluding common pets such as dogs or cats. An initial set of animal vocalizations were retained that humans could mimic by voicing (using their vocal folds) and in a pitch range of at least some of our human actors. An initial selection of approximately 150 sounds were retained to assess their “mimicability.”

We prepared audio recordings from two female and six male actors (nonprofessional) who attempted to mimic the collected stimuli (see online featured image), with audio recordings being made while they were seated in a sound isolation booth (Model 800A-RF shielded, Industrial Acoustics Co.) using a Sony PCM-D1 recorder (sampled at 44.1 kHz, 16-bit), with the microphone placed approximately 30 cm directly in front of their mouth located downward from their head at the level of a table surface to retain high-frequency energy (HFE; Monson et al., 2012; Shoji et al., 1991a). Actors were recruited if their voices were in good health (no cold or allergy at time of recording) and were instructed not to overstrain their voice during mimicry attempts. They were not to use their hands, props, echoes, lips-only sounds, cheeks-only sounds, or whistling. They were free to have a beverage to soothe their voice but were instructed to have nothing in their mouths during audio recordings. The actors were instructed to watch the intensity meter to determine that they were in a good loudness range while mimicking the animal vocalizations. Actors listened to most of the initial 150 animal vocalization events and attempted to mimic them with multiple takes, ending when they felt they had achieved good mimic quality in terms of timing and loudness level (without clipping, as indicated by a red LED during the recording session). They were instructed to “pass” on vocalizations they felt they would not be able to mimic well. Six other listeners screened the human mimicry vocalization attempts to identify the “best” mimics, both in terms of call duration and acoustic quality.

A selection of 81 animal vocalizations and 81 corresponding human mimic sound stimuli (see Table 1) were aligned to match the original animal vocalization onsets as closely as possible (44.1 kHz, 16-bit; Adobe Audition 3.0, Adobe Systems, Inc.). Sound stimuli were then restricted to one channel (right ear); trimmed to 180.0 ms; high-pass filtered above 70 Hz (20 dB/octave); low-pass filtered by 10 kHz, retaining HFE to maintain clarity (Fullgrabe et al., 2010; Monson et al., 2014; Shoji et al., 1991b) and naturalness (Moore & Tan, 2003; Shoji et al., 1991a); and then equated for root-mean-square power (Avg = 72.71 ± 0.03 dB). The attacks of the recorded and sourced stimuli were left in their original states to preserve acoustic attributes that may be important for categorical processing, but 1-ms cos2 onset/offset ramps were applied to each stimulus to avoid onset/offset clicks during playback. While all human mimic and most animal recordings were recorded by a stereo microphone, all sounds were converted to mono (the right channel) and played to both ears to avoid any binaural spatial cue processing.

Table 1.

List of the 81 animal sound stimuli and 81 corresponding human mimics (in alphabetical order).

Name Animal vocalizations
Human mimics
Intensity (dB) Mean F0 (Hz) No. of formants HNR WE SSV % n33 % n50 CA Intensity (dB) F0 (Hz) No. of formants HNR WE SSV Gender % n33 % n50 CA
Baboon Grunts 1 72.69 205 5 5.72 −11.36 0.94 63 66 1 72.72 637 5 9.97 −11.02 0.45 M 97 98 1
Badger Grunts 2 72.73 837 5 1.01 −8.15 1.47 86 82 1 72.72 892 5 −0.19 −6.40 0.58 M 80 70 1
Bear 72.71 4 0.86 −9.41 1.42 71 84 1 72.64 488 5 3.01 −8.77 1.17 M 94 64 1
Bear Cub Moans 1 72.71 567 5 9.07 −12.54 0.86 71 92 1 72.73 606 5 20.91 −10.77 0.28 M 86 68 1
Bear Cub Moans 2 72.69 602 5 5.18 −12.65 0.18 66 70 1 72.72 683 5 16.79 −10.06 0.12 M 74 56 1
Bear Roar 72.72 354 5 4.99 −11.39 0.48 97 92 1 72.72 123 5 5.12 −8.24 0.41 M 89 80 1
Beaver Call 72.70 630 5 17.47 −10.84 0.34 31 62 0 72.66 561 5 21.01 −9.82 0.65 M 91 62 1
Bobcat Growl 1 72.72 209 3 2.00 −10.58 3.02 83 92 1 72.69 254 5 1.71 −9.20 0.77 M 29 24 0
Bobcat Growl 2 72.73 167 5 −1.00 −10.74 2.56 91 86 1 72.72 199 5 8.98 −8.18 0.49 M 20 10 1
Chimp Baby 1 72.73 671 5 15.17 −11.31 0.13 74 82 1 72.72 670 5 22.98 −12.01 0.59 M 66 30 1
Chimp Baby 2 72.73 644 5 18.59 −11.21 1.43 86 62 1 72.72 721 5 25.94 −10.33 1.04 M 74 58 1
Chimp Screams 72.68 1238 5 23.16 −10.33 0.55 77 76 0 72.71 1224 5 23.47 −8.59 0.03 F 46 38 1
Chimpanzee 1 72.73 868 4 11.51 −10.49 1.11 100 96 1 72.73 671 5 24.80 −10.23 3.95 M 51 46 1
Chimpanzee 2 72.73 432 4 14.24 −11.43 1.92 83 74 1 72.73 680 4 22.42 −10.81 0.26 M 57 42 1
Chimpanzee 3 72.72 210 4 13.43 −11.89 0.09 91 90 1 72.71 773 5 26.10 −10.78 0.32 F 49 46 1
Chimpanzee 4 72.73 627 4 18.16 −10.35 2.65 91 88 1 72.73 690 5 26.24 −10.68 0.23 F 57 54 1
Chipmunk Chatter 72.70 1453 5 2.64 −7.34 2.86 100 90 1 72.73 707 5 14.53 −8.80 0.86 M 57 54 1
Coyote Howls and Barks 1 72.73 475 5 3.76 −12.16 2.69 97 98 1 72.73 575 4 13.66 −10.67 0.39 M 91 56 1
Coyote Howls and Barks 2 72.73 590 5 2.32 −11.64 4.24 89 98 1 72.72 494 5 6.92 −10.87 1.03 M 46 52 0
Donkey 72.72 818 4 5.99 −7.50 0.50 94 92 0 72.70 548 5 13.51 −9.59 0.30 M 86 72 1
Gibbon Call 72.73 645 5 10.92 −10.15 2.24 89 90 1 72.73 679 5 28.96 −12.12 0.30 F 23 52 1
Gibbon Call 1 72.68 755 4 27.76 −13.06 0.53 63 50 0 72.72 295 5 29.32 −12.47 1.24 F 66 80 1
Gibbon Call 2 72.69 767 5 13.34 −11.29 3.18 97 78 1 72.68 861 5 26.20 −11.31 1.20 F 43 48 1
Gibbon Call 3 72.73 671 5 27.17 −12.59 1.94 74 62 0 72.70 670 4 32.52 −12.43 1.44 M 46 42 1
Gibbon Call 4 72.69 664 5 34.80 −13.13 0.51 69 54 0 72.71 631 5 34.47 −12.37 1.28 M 86 82 1
Goat Bleat 1 72.68 852 5 6.96 −10.86 0.61 97 90 1 72.72 513 5 4.97 −7.92 0.48 M 71 48 0
Goat Bleat 2 72.72 888 5 −2.06 −8.80 0.42 94 86 1 72.73 744 5 4.93 −10.20 0.68 M 63 26 0
Goat Bleat 3 72.73 953 5 −0.16 −9.71 0.93 54 82 1 72.70 1141 5 5.45 −8.11 0.05 M 77 48 1
Goat Bleat 4 72.54 363 4 10.85 −10.73 1.14 57 68 1 72.67 136 5 19.34 −9.25 0.12 M 94 92 1
Goat Bleat 5 72.70 445 5 5.38 −9.50 0.86 91 88 1 72.71 370 5 4.89 −7.17 0.89 M 69 46 1
Gorilla 1 72.73 424 5 9.35 −11.04 2.24 89 62 1 72.72 546 4 7.23 −11.26 0.75 M 86 96 0
Gorilla 2 72.73 482 4 4.79 −10.65 0.74 89 76 1 72.72 541 5 17.41 −11.85 0.79 M 54 78 0
Gorilla Yells 72.73 424 5 1.63 −11.44 1.89 86 98 1 72.73 513 5 3.89 −9.34 0.35 M 66 48 1
Hyena Laugh 72.73 446 4 6.91 −8.76 3.81 40 66 1 72.70 371 5 10.87 −7.70 3.15 M 97 100 1
Hyena Whine 72.73 883 5 15.17 −10.47 2.28 83 88 1 72.72 685 5 18.10 −9.35 3.55 F 51 76 1
Leopard 72.70 473 5 −0.61 −9.72 0.79 91 94 1 72.62 630 5 −1.76 −5.63 0.50 M 29 50 1
Leopard Snarl 1 72.69 199 4 −0.10 −8.69 1.17 89 96 1 72.71 515 5 4.72 −9.62 1.12 M 80 62 1
Leopard Snarl 2 72.73 584 4 0.22 −8.88 0.75 80 94 1 72.73 658 5 0.74 −7.92 2.15 M 94 100 1
Leopard Snarl 3 72.63 431 4 0.27 −8.76 0.77 91 100 1 72.71 463 4 5.60 −6.58 0.18 M 37 16 1
Lion Roar 72.72 271 5 6.74 −13.44 1.50 91 86 1 72.69 377 4 4.50 −8.52 1.29 M 43 22 1
Monkey Calls 1 72.73 621 5 8.57 −11.00 1.02 83 96 1 72.73 527 4 25.52 −9.02 1.91 F 83 88 1
Monkey Calls 2 72.72 1951 5 31.90 −13.32 0.87 77 70 1 72.72 1150 5 30.07 −10.71 0.52 F 31 28 1
Monkey Calls 3 72.73 788 5 16.17 −10.68 1.12 97 94 1 72.72 838 5 23.68 −10.33 0.21 F 40 36 1
Monkey Chirps 1 72.73 563 5 3.35 −9.06 1.48 83 66 1 72.70 572 5 10.62 −8.55 1.36 F 91 88 1
Monkey Chirps 2 72.73 709 5 4.84 −7.92 0.65 89 94 0 72.72 668 5 18.40 −9.01 0.07 F 54 70 1
Monkey Chirps 3 72.73 1000 5 4.16 −7.39 0.66 89 92 0 72.72 686 5 14.99 −7.93 0.36 F 69 48 1
Monkey Chitter 72.73 1739 5 3.48 −8.13 4.28 100 100 1 72.72 486 4 3.55 −7.21 0.37 M 94 80 0
Monkey Growls 1 72.71 843 4 3.26 −11.94 0.27 80 80 1 72.69 665 5 2.90 −9.48 0.60 M 86 86 0
Monkey Growls 2 72.73 996 5 1.31 −11.01 2.02 89 90 1 72.72 606 4 7.07 −9.03 0.46 M 94 98 1
Monkey Panting with Vocal 1 72.73 588 4 6.18 −9.00 1.35 77 90 1 72.73 723 5 29.42 −11.67 0.11 F 40 54 1
Monkey Panting with Vocal 2 72.73 756 4 5.28 −9.25 1.51 94 90 1 72.73 523 5 23.60 −10.81 0.18 F 57 76 1
Monkey Panting with Vocal 3 72.73 773 5 2.39 −9.62 1.15 74 82 1 72.72 447 4 9.67 −7.68 0.45 M 89 92 1
Monkey Panting with Vocal 4 72.73 931 5 0.11 −9.03 1.36 89 68 1 72.73 205 3 7.96 −6.98 2.06 M 83 92 1
Otter Grunts 1 72.72 348 5 3.95 −11.67 0.53 60 50 1 72.73 313 5 15.78 −8.54 0.80 F 94 100 1
Otter Grunts 2 72.73 529 5 3.10 −10.52 1.25 94 96 1 72.71 560 4 11.38 −10.02 0.76 M 77 66 1
Panda Bark 72.73 612 5 13.45 −9.69 0.74 74 72 0 72.71 973 4 24.41 −10.13 0.12 F 37 28 1
Panda Bleat 1 72.72 774 4 16.06 −9.49 0.31 49 68 0 72.64 747 5 21.66 −10.41 1.61 M 43 38 1
Panda Bleat 2 72.72 655 4 21.04 −10.33 0.13 51 62 0 72.69 722 5 21.87 −10.24 0.56 F 49 36 1
Panda Bleat 3 72.73 653 5 14.23 −9.59 1.27 54 56 0 72.72 586 5 27.80 −10.77 0.06 M 94 74 1
Pig Squeals 1 72.70 903 4 0.69 −8.06 1.06 94 84 1 72.71 783 5 10.30 −8.44 0.22 F 46 20 1
Pig Squeals 2 72.71 1387 4 2.10 −8.53 0.39 89 88 1 72.66 931 5 1.19 −9.05 0.44 M 23 16 0
Pig Squeals 3 72.73 618 4 0.69 −8.17 0.91 91 94 1 72.73 508 4 2.87 −7.80 2.79 M 89 94 1
Pig Squeals 4 72.73 1081 5 1.17 −9.25 0.74 97 98 1 72.71 642 5 1.72 −7.84 0.26 M 14 8 1
Pig Squeals 5 72.73 890 4 1.90 −9.34 0.63 89 94 1 72.71 867 5 5.29 −9.95 0.43 M 69 50 0
Pig Squeals 6 72.72 884 5 0.12 −8.99 0.65 91 98 1 72.73 679 5 1.00 −8.91 0.61 M 89 60 1
Pig Squeals 7 72.73 862 5 1.91 −8.57 1.59 83 80 1 72.72 629 5 5.64 −7.86 0.41 M 63 54 1
Puma Scream 72.70 170 5 19.65 −11.13 0.31 83 76 1 72.71 1082 4 20.75 −9.28 0.29 M 34 32 1
Raccoon Chitter 1 72.72 1210 5 2.44 −9.45 1.71 97 96 1 72.72 1106 5 14.35 −9.04 0.58 M 31 42 1
Raccoon Chitter 2 72.71 1432 5 7.58 −11.02 2.48 97 92 1 72.69 494 5 8.07 −9.80 0.85 M 83 76 0
Rhino Calls 72.72 328 5 9.97 −12.52 0.26 83 86 1 72.72 728 4 7.25 −8.82 0.21 F 80 62 1
Sea Lion Growls 1 72.69 763 4 6.74 −11.67 0.40 74 90 1 72.66 887 4 13.21 −9.28 2.64 M 46 34 1
Sea Lion Growls 2 72.71 728 4 1.18 −10.66 1.27 91 78 1 72.73 510 4 9.50 −10.52 0.30 M 94 94 1
Snow Leopard 72.72 442 5 1.61 −7.66 0.06 69 68 0 72.69 590 5 2.36 −8.43 0.37 M 71 38 1
Squirrel Chirping 1 72.72 582 5 1.38 −11.26 0.32 80 58 1 72.71 257 5 5.91 −7.13 0.17 M 91 86 1
Squirrel Chirping 2 72.72 624 5 2.61 −9.83 0.46 83 76 1 72.72 165 5 19.18 −10.62 0.10 M 77 98 1
Squirrel Chirping 3 72.73 557 5 −0.43 −10.28 0.66 71 82 1 72.72 910 4 20.54 −9.02 3.33 F 51 66 1
Tiger 1 72.71 262 5 3.07 −9.65 0.68 100 96 1 72.72 352 4 9.40 −6.94 0.19 M 51 22 1
Tiger 2 72.72 83 5 3.23 −10.98 1.21 97 96 1 72.71 578 5 1.83 −7.39 0.57 M 43 14 1
Tiger 3 72.70 528 5 2.45 −8.49 0.86 86 84 0 72.73 437 4 2.36 −8.39 0.93 M 97 68 0
Wildcat Hiss 1 72.73 859 4 3.31 −10.63 0.50 89 96 1 72.72 703 5 4.48 −7.35 0.26 M 83 76 1
Wildcat Hiss 2 72.71 424 4 −1.32 −9.64 0.94 71 58 1 72.72 727 5 2.06 −7.10 0.28 M 83 78 1
Average 72.71 683.30 4.65 7.27 10.24 1.21 82.3 82.3 72.71 618.47 4.73 13.21 9.31 0.79 65.8 58.7
SD 0.03 343.07 0.50 8.02 1.48 0.94 14.5 13.5 0.02 229.12 0.47 9.62 1.55 0.84 23.0 25.2

Note. All stimuli were 180-ms duration and matched for average root-mean-square intensity. Various acoustic signal attributes for the original animal vocalizations and corresponding human mimic are indicated, including fundamental frequency (F0), estimated number of formants, harmonics-to-noise ratio (HNR; dBHNR), Wiener entropy (WE), and spectral structure variation (SSV). The gender of the human mimicker is also indicated. Categorization accuracy is also shown for the n = 33 electroencephalography (EEG) participants (n33), the n = 50 non-EEG participants (n50), and the classifier algorithm (CA). Bold entries indicate sounds with spectrograms illustrated in Figure 2A. M = male; F = female.

The animal vocals and corresponding human mimic sounds (see Figure 2A, spectrograms) were assessed for estimated fundamental frequency (F0) and estimated number of formants (see Table 1; Praat software, http://www.fon.hum.uva.nl/praat/), neither of which showed a significant difference overall on average: F0, paired two-tailed t(159) = 1.41, p > .16, and number of formants, paired two-tailed t(160) = 0.96, p > .34. Averaged spectrograms by category (see Figure 2B) showed a significantly greater peak of power for human mimics in roughly the range from 650 to 800 Hz (peak at 700 Hz), two-tailed paired t(81) = 1.99, p < .00002. Additionally, a spectral minima trough was evident around 5.5 kHz (peak at 5388 Hz), two-tailed paired t(81) = 1.99, p < .032, in the human spectrogram trace (see Figure 2B, arrow), consistent with characteristic acoustic contributions by the human piriform fossa reported previously (Dang & Honda, 1997; Shoji et al., 1991a).

Figure 2.

Figure 2.

Spectral analyses of the animal vocalizations and their human mimics. (A) Example spectrograms of several pairs of sound stimuli (see bolded entries from Table 1). (B) Average spectrogram for each category of sound (n = 81 per category). * indicates the range of significant difference, peak two-sample paired t(80), p < .00002, between animal and human mimic sounds around the 700-Hz peak; black arrow shows location of a 5.5-kHz frequency trough, two-sample paired t(80), p < .03, associated with the piriform fossa of the human vocal tract.

The human mimic sounds and corresponding animal vocalizations were also assessed for several higher order acoustic signal attributes or features (see Table 1). This included a measure of harmonics-to-noise ratio (HNR; Praat software), in which the animal vocalizations (Avg ± SD: 7.27 ± 8.02 dBHNR) and human mimics (13.21 ± 9.62 dBHNR) were significantly different from one another, two-tailed paired t(160) = 4.27, p < 3 × 10−5. A measure of spectral flatness was also examined using Wiener entropy (WE; Tchernichovski et al., 2001), with animal vocalizations (Avg ± SD: −10.24 ± 1.48) and human mimics (−9.31 ± 1.55) showing significant differences, two-tailed paired t(160) = 3.89, p < .0001. Using spectral structure variation (SSV) as a measure of change in entropy over time, the animal vocalizations (SSV: Avg ± SD: −1.21 ± 0.94) and human mimics (0.79 ± 0.84) also showed significant differences, two-tailed paired t(160) = 2.96, p < .003. Each of the above signal attributes were assessed and reported here because they have been implicated in auditory stream segregation and auditory object perception in earlier studies in humans (Engel et al., 2009; Fitch & Fritz, 2006; Lewis et al., 2012; Medvedev et al., 2002; Medvedev & Kanwal, 2004; Reddy et al., 2009) and/or in other vocal communicating species (Beckers et al., 2003; Dooling et al., 1992; Kikuchi et al., 2014; Phan & Vicario, 2010), which were investigated in this study (see Supplemental Material S1) and may be useful for systematic examination in future studies.

Psychophysical Testing of Sound Stimuli

To facilitate data interpretation, a group of 50 non-EEG participants (Mage = of 23 years, balanced gender) were recruited to determine how well the animal and corresponding human mimic sound stimuli could be categorized, using PsychoPy software (Version 3.0; Peirce, 2009). They were instructed that they would hear an equal number of animal- and human-produced sounds, indicating by two-alternative forced choice (keyboard keys) as to which category they thought the sound source belonged, with an emphasis on accuracy rather than speed (which differed from the EEG task study wherein responses needed to occur before the next sound event). A brief practice session was used to accustom participants to the process. Pressing the space bar advanced to the next trial to ensure all sounds, presented in different random orders, were rated by every participant. Psychophysical testing participants were paid a stipend for the session.

Electrophysiology Procedures

All EEG recording (and audiometry) occurred in a sound-attenuated and electrically shielded sound isolation booth to minimize acoustic and electrical interference (Model 800A-RF shielded, Industrial Acoustics Co.). EEG recordings (64-channels) were collected with Neuroscan SynAmps hardware (Neuroscan), Scan 4.3 Acquire software, and Quik-Caps (Ag/Ag-Cl sintered electrodes; 10–20 system). Impedances were generally kept below 10 kΩ at all electrodes for all but six participants, for whom impedance was under 50 kΩ. Scalp electrodes were referenced during data collection to a separate electrode placed on the right mastoid directly to ground. A 1-kHz sampling rate (alternating current to SynAmps) was applied to all channels, and signals were filtered online from 0.05 to 200 Hz.

Stimulus Presentation Procedures

Stimuli were presented to subjects through electrostatic ear buds (Stax SRS-005 Earspeaker system, Stax Ltd.; with frequency responses of 20–1000 Hz ± 2 dB and 1–20 kHz ± 4 dB) via Presentation software (Version 11.0, Neurobehavioral Systems, Inc.) running on a Windows PC. Prior to the scanning session, stimulus intensity was set between 65 and 75 dB SPL Leq to a 1-kHz pure tone (fast A-weighted; Brüel & Kjær 2239A sound meter) presented via the same stimulus computer and sound delivery chain. The intensity level was initially set at 75 dB SPL, but for some participants, lowered intensities were necessary to avoid EEG artifacts associated with sound events (e.g., subconscious startle reflex signals in EEG traces). Each EEG recording session consisted of six total recordings (three recordings in a passive condition [naïve] listening were followed by three recordings in an active task condition; see below). Recordings in each condition lasted approximately 6 min; the overall duration of an EEG recording session was approximately 45 min. Each EEG recording included a randomized presentation of the 81 animal vocalizations and 81 corresponding human-mimicked versions, with interstimulus intervals randomly and uniformly distributed between 2,300 and 2,700 ms, which served to minimize habituation. This timing selection also allowed enough time for subjects to respond by button press during the sound discrimination task paradigm (herein termed “task condition”) that was conducted in the latter half of the EEG recording session.

The nonattentive passive listening paradigm (herein termed “passive condition”) was conducted first, wherein participants were intentionally unaware that they were hearing two different categories of vocalization sounds, and they simply watched a muted subtitled movie of their choice. The passive condition was always conducted prior to the task condition to avoid biasing the participant as to the nature of the study, which was exploring two subcategories of vocalizations (human vs. nonhuman mammals): We adopted the passive homogeneous listening paradigm design because it had been shown previously to elicit stable N1 components (i.e., a stable P1–N1–P2 complex) in response to signal-in-noise encoding after removing eye movement artifact trials (Billings et al., 2011). After these three passive condition EEG recordings, the subjects were newly informed that they would hear the sounds again but would be performing a discrimination task while maintaining fixating on a crosshair on the booth wall. Because this second portion of the experimental session required that participants hear the sound stimuli again in another set of recordings with randomized presentation, the results across the two experimental paradigms could be directly compared for effects of passive versus attentive task listening. Using a Neuroscan response pad, subjects responded after the presentation of each stimulus, as accurately as possible, to indicate whether it was produced by an animal or a human. Either the far left or right buttons (by left and right thumb press) corresponded to the respective vocalization categories, with button designations (human or animal source) being counterbalanced across the entire subject group and across gender. Subjects were encouraged to respond before the next stimulus presentation, with emphasis on accuracy. Reaction times were recorded for subsequent analyses.

Data Analysis

All analyses of EEG and AEP data were performed using the MATLAB-based open-source software packages EEGLAB (Version 13.6.5b; Delorme & Makeig, 2004) and ERPLAB (Version 5.1.1.0, http://www.erpinfo.org ). The three continuous EEG recordings from each segment of the experiment (passive or task conditions) were concatenated for each subject. The Brain Electrical Source Analysis system (Gunji et al., 2000) was used to enable group averaging for generating topographic maps of current source density of the AEP components. The continuous concatenated EEG recordings for each participant were digitally filtered at 0.1–30 Hz using an infinite impulse response Butterworth filter with 12 dB/octave roll-off (zero phase shift) and then epoched according to stimulus type. For the passive condition, epochs were trimmed to 700 ms, including a 200-ms prestimulus period used for baseline correction. For the task condition, epochs were 2,200 ms in duration, including a 200-ms prestimulus baseline: Only those data with button responses between 200 and 2,000 ms after stimulus onset, whether correct or not, were retained for analyses (amounting to 98.3% of all trials/epochs). Epochs were carefully screened for artifacts related to eye blinks, high-frequency muscle artifact, and excessive alpha wave (drowsiness). Automated detection of any epochs that exceeded ± 100 μV at any time point between −200 and 2,000 ms postonset facilitated this process (ERPLAB software), and epochs rejected as artifact trials were excluded from all subsequent averaging analyses. Noisy channels were replaced by interpolation of nearby channel data. Data were re-referenced to the average to facilitate comparisons with earlier studies (De Lucia et al., 2009; Levy et al., 2003; Murray et al., 2006). The mean acceptance rate for the passive condition was 80.8% trials overall (n = 33 of 34 participants; one participant removed due to excessive motion artifact), and for the task condition it was 62.8% (n = 29 of 34 participants).

For the task condition, mean GFP measures were initially calculated (Delorme et al., 2015) to reveal activation of anticipated P3 family components, as described earlier using similar experimental parameters (De Lucia et al., 2009). The GFP measures were derived from the average standard deviation across 62 (of 65) scalp electrodes (including all except vertical and horizontal eye gaze channels and mastoid channel), representing a reference-independent field strength measurement across the entire electrode montage (Koenig & Melie-Garcia, 2010; Lehmann & Skrandies, 1980; Murray et al., 2008). Some participants had significant numbers of “error trials,” leading to four types of response conditions (with the following nomenclature used in Supplemental Material S1, Part A, Figure S1): human vocalizations correctly categorized as being produced by humans (H–H); animal vocalizations correctly categorized as being produced by animals (A–A); plus error trials, wherein animal vocalizations were incorrectly categorized as being produced by humans (A–H) and human vocalizations were incorrectly categorized as being produced by animals (H–A). However, because there were relatively few numbers of error trials overall, those analyses, while informative, are presented as Supplemental Material S1 (Part A). For each participant (n = 29), the epochs including correctly categorized responses (H–H and A–A responses) were independently baseline-corrected (−200 to 0 ms) prior to group averaging and statistical analyses.

Statistical Methods and Considerations

The objective of this study was to use animal vocalizations and human-mimicked versions of those calls to examine late AEP responses in humans. We assessed the data from electrophysiological processing as continuous outcome variables when stimuli were controlled for the complex acoustic signals present in vocalizations. Analysis of variance (more than two groups) or two-sample t tests were initially used to assess the outcome variables between different groups, while paired two-sample t tests were used to assess the difference of the outcome variable at the same location or the same time point between two different conditions.

To address the issues of multiple comparisons and repeated measurements over time for both passive and task conditions, a linear mixed model, with participant as a random effect, was used to test for significant differences between the two vocalization conditions (as fixed effect), which can address the multiple measurements from each participant over time and assess the difference between two conditions based on a single p value. The mixed-effects model was fitted using statistical software R (R Core Team, 2013) with packages “lme4” and “lmerTest.” This provided a ratio of fixed effect divided by standard error (asymptotic z value or t value) and its p value. A sensitivity analysis on the choice of time intervals was performed, which ensured that the results were stable and robust. For consistency with some earlier AEP studies, we also performed nonparametric cluster permutation tests (Maris & Oostenveld, 2007) in which Wilcoxon's statistic was calculated on each random permutation.

Machine Learning Classification

To provide further characterization as to what acoustic signal attributes might distinguish human mimic voice from animal calls and might be driving any differential AEP component activations (and possibly the fMRI data depicted in Figure 1B), a statistical classification was conducted with the passive condition data. Specifically, linear discriminant analysis techniques using spectral and spectrotemporal features extracted from the animal sound and human mimic stimuli were utilized. The features included F0, HNR, WE, SSV, and four spectral peaks extracted from 0 to 8 kHz using Praat's formant detection feature, all of which have been used previously to characterize vocalization stimuli (see above).

A stepwise discriminant function analysis was also conducted to determine the extent to which the abovementioned features, together with spectral peak information, could correctly classify a given sound stimulus as produced by an animal or human mimic and to characterize the relative importance of the features. This technique has been shown to be effective at classifying normal versus dysphonic speech (which vary in perceived sound quality) using spectral and cepstral features (Gaskill et al., 2017; Lowell et al., 2013). To assess the ability of prediction and classification, a leave-one-out cross-validation was used, wherein one signal is removed from the data set and the remainder of the signals are used as training data, and the removed signal is then entered and classified. This process continued iteratively until all signals have been removed and subsequently classified. Five stimulus features were found to be significant predictors of classification, including (in order of importance) Peak 4 (a spectral prominence centered around 6.7 kHz for humans and 6.2 kHz for animals), HNR, WE, Peak 1 (a spectral prominence centered around 1.2 kHz for humans and 1.3 kHz for animals), and SSV. The results from this approach yielded informative though complex results and, for clarity, are provided only as Supplemental Material S1 (Part B).

Results

Previous studies of AEP responses to voice and the N1 component have focused on the FCz channel (De Lucia et al., 2009; Levy et al., 2003; Murray et al., 2006), and thus, this channel served as an a priori channel to initially explore in this study. However, reference-independent field strength measurement across the entire electrode montage, termed as mean “GFP measures” (De Lucia et al., 2010), have also been used as a more objective method for exploring electrophysiological differences. Thus, for clarity, both GFP and then individual channel data are presented, for both the passive condition and the subsequently conducted task condition paradigms.

Passive Condition Results

For the passive condition, EEG participants heard a random presentation of animal vocalizations versus human mimics of those vocalizations while they watched a subtitled movie and were naïve to the nature of the study. EEG traces were derived from data free of artifacts, with one participant excluded due to excessive motion artifacts. Approximate time windows showing differences in waveforms evoked by human mimic sounds and animal vocalization were initially derived for each individual from recordings of the FCz channel, revealing three time periods of interest showing differences in amplitude (similar to the averaged FCz waveforms in Figure 3B). All participants meeting inclusion criteria (n = 33 of 34) revealed two prominent dipolar responses in the FCz channel (data not shown) characteristic of N1 and P2 components of the “P1–N1–P2 complex” (Billings et al., 2011; Clynes, 1969; De Lucia et al., 2009; Näätänen & Picton, 1987; Tremblay et al., 2001). Qualitatively, 24 of the 33 participants revealed a readily evident N1 component for human mimic sounds that was greater in amplitude than for the animal vocalizations. In eight of the nine participants who showed the opposite or overlapping pattern of N1 profile amplitude, the P2 component amplitude was significantly greater for human mimic vocalizations, yielding greater N1–P2 complexes overall for the human mimic versus animal sound stimuli. There were no observed factors that could otherwise distinguish these participants, and thus, the n = 33 participants were retained in the final analyses.

Figure 3.

Figure 3.

Group-averaged electroencephalography responses to human mimic versus animal vocalizations during the passive condition. (A) Group-averaged global field potential (GFP) waveforms (62 channels, n = 33 participants). (B) Averaged auditory evoked potential waveforms (64 channels, n = 33 participants) for electrode FCz in response to hearing human-mimicked animal vocalizations (blue trace) versus the corresponding animal vocalizations (black dotted trace) during the passive condition. (C) Scalp topography (current source density) on the average brain of Brain Electrical Source Analysis for the periods 96–120 ms (N1b), 198–209 ms (P2), and 286–313 ms (N2) for processing the human mimic category of sound. (D) Area under the curve measures at FCz of the N1b, P2, and N2 components (M ± SE)—all rectified so positive is up in charts. **p uncorr < .0001, *p uncorr < .0002.

Using mean GFP measures, the passive condition AEP data revealed three waveform peaks characteristic of the N1b, P2, and N2 (see Figure 3A), the naming convention for which is detailed below with reference to the FCz data and previous literature. Linear mixed modeling was used to compare the waveforms, including all individual trials from all participants and taking into account the multiple measurements from each participant. The first major component, identified later as the N1b component (see below), showed significant differences between human (blue trace) and animal sounds (black dotted trace) with a peak at 114 ms and a significant difference ranging from 96 to 120 ms estimated from a linear mixed model to account for multiple comparisons (fixed effect coefficient = .229, z = 5.725, p < .0001). The asymptotic z distribution was used to evaluate the significance of the fixed effects. As an additional statistical measure, added for comparison to earlier AEP studies, a nonparametric Wilcoxon test was also conducted, which similarly indicated a significant N1b difference in a similar range between 96 and 123 ms (at p < .001, and p = .018 from nonparametric cluster permutation test adjusted for multiple comparisons). The GFP analysis revealed a P2 peak around 203 ms but did not reach statistically significant differences between the two conditions (fixed effect coefficient = .104, z = 1.677, p < .105). The GFP analysis also revealed an N2 peak, showing a significant difference for greater activation to human mimics (fixed effect coefficient = .106, z = 2.255, p < .03).

The passive condition AEP data were further examined using the FCz channel, which further characterized the three main GFP waveforms (cf. Figures 3A and 3B). On average, the first negative-going component (the N1b) had a latency that peaked roughly at 114 ms for both human and animal vocalizations (no significant latency difference), with negativities centered about the FCz electrode location (see Figure 3C, left panels, dark blue in scalp topography maps). This scalp topography and latency were consistent with the previously reported characteristics of the N1b component (Näätänen & Picton, 1987; Shahin et al., 2003; Wioland et al., 2001), which is generated from different sources in intermediate and lateral parts of Heschl's gyrus and planum temporale (Godey et al., 2001; Liegeois-Chauvel et al., 1994), and thus, this component is referred to as the N1b herein. This significantly larger N1b amplitude persisted for human mimic voice than the animal vocalizations in the 96- to 120-ms poststimulus period (fixed effect coefficient = −.373, z = 9.325, p < .0001). At peak amplitude, the N1b response to human-mimicked vocalizations (peak ± SD: −2.69 ± 1.30 μV) relative to the animal vocalizations (−2.30 ± 1.31 μV) were plotted and both found to be normally distributed. Additionally, for comparison with earlier reported statistical designs, a nonparametric post hoc analysis that entailed a cluster permutation test (using Wilcoxon's statistic) to adjust for multiple comparisons revealed significant differences in the N1b component between 93 and 122 ms (at p < .002 and adjusting for multiple comparisons at p = .027). Measures of the area under the curve (see Figure 3D) in the range from 96 to 120 ms similarly revealed significant differences between human (M ± SD: −2.42 ± 1.19 μV ms) and animal (−2.05 ± 1.19 μV ms), which significantly differed from one another, two-tailed paired t(32) = 4.42, p < .0001.

A positive-going P2 component for both categories of vocalizations showed a latency that peaked around 205 ms (see Figure 3B) and centered around the FCz and nearby electrodes (see Figure 3C, middle panels, red hues in scalp topography map). The group-averaged response magnitude of the P2 component in the 198- to 209-ms poststimulus range (a range of interest initially estimated by t tests; see Statistical Methods and Considerations section) for the human mimics did not show a significant difference from the response to animal vocalizations using a linear mixed model (fixed effect coefficient = .156, z = 1.380, p = .176), nor by area under the curve measures in this range (see Figure 3D).

A second negative-going peak (the N2 component) in the 286- to 313-ms poststimulus range from the FCz electrode further showed a significant difference (fixed effect coefficient = –.426, z = 4.260, p < .0002), with a maximal peak difference at 301 ms (see Figure 3B), which was centered around the FCz electrode region (see Figure 3C, right panel, dark blue scalp topography). Measures of the area under the curve (see Figure 3D) in the range from 286 to 313 ms similarly revealed significant differences between human (M ± SD: −1.10 ± 1.23 μV ms) and animal (−0.067 ± 1.25 μV ms), which significantly differed from one another, two-tailed paired t(32) = 4.24, p < .0002. This component resembled a previously reported N2 component (Crowley & Colrain, 2004; Luck & Hillyard, 1994; Näätänen & Picton, 1986) and is thus referred to as such herein.

Task Condition Results

For the second portion of the AEP recording session, the task condition, participants were newly informed of the nature of the sound stimuli, that is, they included a mix of animal vocalizations and human vocalizations. They were now instructed to indicate whether they heard a vocalization produced by a human or an animal, responding by two-alternative forced choice after the presentation of each stimulus event (refer to Method section). EEG traces were derived from data free of artifacts, with 29 of the abovementioned 33 participants retained after further excluding four who exhibited excessive head motion artifacts during the task condition. The percentage of trials with button press category responses (human or animal, counterbalanced), whether correct or not, on average was 62.7%: The distribution of retained trials by condition on average was human mimics perceived as human (H–H) = 101 (Avg ± SD: 20.8% ± 8.1%) and animal vocalizations perceived as animal (A–A) = 127 (26.2% ± 9.6%), and incorrect (error trial) categorization was animal vocalizations perceived as human voice (A–H) = 24 (5.0% ± 3.0%) and human mimics perceived as animal vocalizations (H–A) = 51 (10.6% ± 6.7%). The mean reaction times to those retained trials for the H–H condition (0.96 ± 0.10 s) and the A–A condition (0.91 ± 0.02 s) showed no statistical difference, paired t(27) = 0.49, p < .63. As expected, the error trial responses in the A–H condition (1.02 ± 0.18 s) and the H–A condition (0.99 ± 0.17 s) were longer in duration than the two accurate perception trial conditions H–H and A–A, two-tailed paired t(53) = 8.56, p < 1 × 10−11. While there were relatively few error trials overall, a thorough investigation of the error trial data is presented in Supplemental Material S1 (Part A).

Figure 4 illustrates the group-averaged mean GFP responses to the two categories of correctly discriminated sounds during the two-alternative forced choice task condition. A main effect of vocalization category was revealed at multiple time period ranges, including the time periods of the N1b, P2, and N2 components, similar to the passive condition (cf. Figures 3A and 4A), plus a component that peaked around 600 ms poststimulus onset. Specifically, an N1b component in the range of 94–121 ms for the human mimic sound (blue trace) condition (GFP Avg ± SD: 1.02 ± 0.42 μV) relative to the animal vocalization (black dotted trace) condition (0.88 ± 0.29 μV) was significantly greater in magnitude (fixed effect coefficient = .145, z = 3.222, p < .0034), evident when comparing both the waveform and scalp topographies of the N1b. Measures of the area under the curve (see Figure 4C) in the range from 94 to 121 ms similarly revealed significant differences between human (M ± SD: −1.02 ± 0.65 μV ms) and animal (0.87 ± 0.54 μV ms), which significantly differed from one another, two-tailed paired t(32) = 3.20, p < .003.

Figure 4.

Figure 4.

Group-averaged electroencephalography responses to correctly categorized human mimics versus animal vocalizations during the task condition. (A) Averaged global field potential (GFP) waveforms (62 channels, n = 29 participants) corresponding to responses to the two main conditions. (B) Scalp topographies for the N1b, P2, N2, and P600 components for the correctly categorized human mimic sounds and for the correctly categorized animal vocalizations. (C) Area under the GFP curve of the N1b, P2, and N2 components (M ± SE)—all rectified so positive is up in charts. **Two-tailed p uncorr < 3 × 10−7. *Two-tailed p uncorr < .006. (D) Averaged AEP waveforms of the P600 component at the Pz electrode.

In the task condition, the P2 component showed greater GFP response magnitude to the human mimic sounds relative to the animal vocalizations in the range of 198–209 ms (fixed effect coefficient = .244, z = 3.210, p < .0033). The P2 component response profile and associated scalp topography was similar across the two listening paradigms but reached significant difference only during the task condition. Measures of the area under the curve in the range from 198 to 209 ms revealed significant differences between human mimics (M ± SD: 1.65 ± 1.06 μV ms) and animal vocalizations (1.43 ± 0.96 μV ms), which significantly differed from one another, two-tailed paired t(32) = 2.98, p < .006. An N2 peak was also evident for both vocalization categories during the task condition, but they were not significantly different from one another in the range of 286–313 ms (fixed effect coefficient = .087, z = 1.526, p < .14).

Unique to the task condition results was the presence of a P3 family curve, evident as a sustained peak at 602 ms (see Figure 4A), with a range between 450 and 750 ms of poststimulus period that was significantly different between the two sound categories (fixed effect coefficient = .638, z = 6.380, p < .0001). This P600 component had positivity increases with a scalp distribution centered over the posterior midline region (see Figure 4B, rightmost panels, dark red hues), which was evident for both conditions. Measures of the area under the curve (see Figure 4C) in the range from 500 to 700 ms revealed significant differences between human (M ± SD: 3.79 ± 1.59 μV ms) and animal (3.08 ± 1.35 μV ms), which significantly differed from one another, two-tailed paired t(32) = 6.75, p < 3 × 10−7. These characteristics were consistent with a P600 component described earlier (De Lucia et al., 2009, 2010; Kalaiah & Shastri, 2016) and is referred to as such herein.

A P600 component was present in an earlier study examining responses to falling tone complexes, with a peak near the Pz electrode (Kalaiah & Shastri, 2016). Thus, we further charted the P600 component waveform from the Pz electrode (see Figure 4D). This waveform revealed a significantly greater positive-going P600 component amplitude for the correctly categorized human mimic sounds relative to animal vocalizations with a peak around 580 ms, including significant differences in the range of 550–650 ms (fixed effect coefficient = 1.558, z = 3.847, p < .0006) and also in the more liberal range of 450–750 ms (fixed effect coefficient = 1.367, z = 3.607, p < .0012).

Machine Learning Classification of Animal Vocals and Their Human Mimics

In an attempt to determine how human listeners might be classifying vocalizations, a classifier learning algorithm was implemented as a model, using the four data-derived spectral peaks (refer to Method section), plus the F0, HNR, entropy (WE), and SSV measures for each sound as quantitative signal attributes. While the algorithm was able to classify human versus animal vocalizations with a high degree of accuracy (see Supplemental Material S1, Part B), the errors it made did not match those that humans made during the task condition. Thus, though informative, those results did not provide direct insights into how humans classify vocalizations and are thus presented as Supplemental Material S1 (Part B) to help inspire future studies.

Parametric Charting of Acoustic Signals Versus AEP Components

We next sought to identify acoustic signal attribute differences between animal and human mimic stimuli that might have been driving the early AEPs, using data from the FCz electrode during the passive condition. (There were insufficient numbers of error trials from the task condition results to address this question.) A peak and average response magnitude was derived in response to each of the 81 animal and 81 human mimic stimuli for the N1b, P2, and N2 components. These, in turn, were plotted parametrically with HNR, F0, WE, and SSV (see Supplemental Material S1, Figure S2, Part C). While these results did not clearly reveal specific bottom-up acoustic signal attributes that could distinguish conspecific human from nonconspecific voice, they did nonetheless reveal some novel avenues for future research regarding what signal attributes may ultimately drive the N1b, P2, and N2 AEP signals associated with perception of human (conspecific) mimic voice.

Discussion

This study revealed several AEP components showing response magnitudes that, on average, were preferential when hearing human mimics of animal vocalizations (“mimic voice”) in contrast to the corresponding animal vocalizations themselves, during both a passive and task listening condition. There were two main findings. The first was evidence of an early temporal stage involving the frontocentral N1b component (peak 114 ms, with a range of 96–120 ms) for initiating the differential processing of human conspecific nonlinguistic vocalizations, which was significantly earlier than previously reported using other forms of human voice. This finding of an early N1b AEP sensitivity to human mimic voice was suggestive of it representing exogenous mechanisms that involve the primary auditory cortices (Näätänen & Picton, 1987; Wioland et al., 2001), which was consistent with our earlier fMRI study (see Figure 1B) using mimic voice (Talkington et al., 2012).

The second main finding was that attention to differentiating mimic voice from animal vocalizations alters the processing that is evident in the elicitation of a late P600 positivity (with a generally posterior distribution), as well as altering the earlier positivity (P2) and second negativity (N2) components. Together, these AEP findings lent novel support of a neurobiological taxonomic model (see Figure 1A) of natural sound categories as an overarching processing organization by the brain (Brefczynski-Lewis & Lewis, 2017): Moreover, they newly contribute to the refinement of a temporal hierarchical model for the processing of natural sounds, as addressed in the immediately following section. After this is a discussion of the N1b component finding and its potential clinical significance in advancing designs of neuromimetic algorithms for intelligent hearing aids and for advancing models of acoustic communication systems during human neurodevelopment. This is followed by a discussion of the effects of a listening task condition on AEP components and the significance these findings may have for anthropological models of acoustic communication systems.

The N1b AEP Responses to Human Mimic Voice

In contrast to earlier studies, the present results indicated that sensory discrimination between human conspecific versus corresponding nonconspecific vocalizations occur as early as the auditory N1b component (114 ms peak), which is reported to be an obligatory AEP response to acoustic events generated near primary auditory cortices (Clynes, 1969; Luck, 2005; Näätänen & Picton, 1987; Shahin et al., 2003). Earlier studies examining stereotypical human vocalizations (e.g., speech voice, singing voice, and incidental nonverbal voice), in contrast to a wide variety of other control sounds, revealed differential processing in the range of 169–219 ms (De Lucia et al., 2010; Murray et al., 2006), or at a peak around 164 ms (Charest et al., 2009), or at an N1m peak around 220 ms (initiating as early as 147 ms) using magnetoencephalography (Capilla et al., 2013). Because the N1b deflection of this study elicited a differential response to mimic voice in both the passive and task conditions (cf. GFP waveforms in Figures 3A and 4A), this lent strong support to the existing notion that this represented an obligatory sensory or exogenous potential.

The significantly earlier discrimination of human voice observed in this study was almost certainly due to stimulus selection, wherein animal vocalizations served as the critical control relative to corresponding human mimics (mimic voice) of those same vocalizations observed in nature. The human actors we recruited to generate the mimic voice stimuli had intentionally tried to mimic the animal vocalizations as best they could. This vocalization mimicry process led to the matching of many complex spectral and temporal attributes of the already acoustically complex natural vocal signals. However, there was likely a host of acoustic attributes and vocal cues produced by sound shaping of the human vocal tract and oral cavity that presumably retained subtle nuances that the human auditory system uses to distinguish conspecific from nonconspecific vocalizations—though also see Limitations section below. While the auditory N1 (and N1b) response is triggered by stimulus onsets (Clynes, 1969; Näätänen & Picton, 1987; Seither-Preisler et al., 2007), it is also sensitive to a number of general low-level attributes such as stimulus pitch (Heinks-Maldonado et al., 2005; Winkler et al., 1997). The average F0 of the human mimics was not significantly different from the animal vocalizations (see Table 1), and the classifier algorithm did not reveal F0 as a significant attribute for discriminating the two vocalization categories (see Supplemental Material S1, Part C).

The human mimics were, however, on average characterized by a prominent frequency peak with greater power in the 650- to 800-Hz range (see Figure 2B), which may have been a low-level spectral attribute contributing to the N1b response magnitude difference. Additionally, the prominence of a spectral peak around 6.6 kHz together with a frequency trough for human vocals around 5.5 kHz, likely due to the piriform fossa anatomy of humans (Dang & Honda, 1997; Shoji et al., 1991a), was suggestive of HFE acoustic signal information that also may have been driving some of the conspecific versus nonconspecific discrimination. A variety of higher order acoustic signal attributes may have further weighed in on this differential processing, as suggested by the intersecting versus parallel slopes from the parametric regression analyses (see Figure S2). While our parametric regression results did not identify any single acoustic signal attribute (or combinations of attributes) that could readily discriminate the two categories of sound stimuli nor identify the main drivers of the N1b (or P2 or N2) component amplitude differences, they did provide clues for future studies that may specifically address which quantifiable features (vs. idiosyncrasies) of natural sound are probabilistically important for human voice discrimination.

The four-tiered temporal cortical processing hierarchy for human audition proposed by De Lucia et al. (2010) included a third tier wherein human versus animal vocalization discrimination initially occurred between approximately 169 and 219 ms, which received general support by the findings of Charest et al. (2009) and Levy et al. (2001, 2003), respectively. However, the findings of this study using human mimic voice suggest that the brain's ability to discriminate between human (conspecific) versus nonconspecific vocal tract sounds occurs considerably earlier. This differentiation initiating as the N1b component (96- to 120-ms poststimulus onset period) was closer to their proposed second tier when man-made versus living sound source differentiation was initiating. The earlier human vocalization studies mentioned above did not elicit a significant differential N1b response to human vocalizations presumably because their selected stimuli and/or control stimuli were more varied in the nature of their exemplars, which may have masked the earlier N1b processing component. Regardless, this study's findings suggest a need of refinements to hierarchical processing models of natural sounds and furthermore have potential clinical relevance in terms of possible hearing aid algorithm designs and neurodevelopmental models, as addressed next.

Designs for Intelligent Hearing Aids

Since the differential N1b response to human mimic voice likely reflects an exogenous signal, identifying the relevant acoustic signal attributes may prove instrumental for guiding intelligent hearing aid algorithm designs by using neuromimetic principles (Rumberg et al., 2008): For instance, isolating pertinent signal attributes of close-range voice from the rest of an acoustic scene (and potentially other voice sources at more distant ranges) could be used to enhance those signals and to adapt the resulting output signals to the dynamic hearing range of an individual with hearing impairment. In this regard, it seems likely that different sets of signal attribute differences that physically distinguish conspecific from nonconspecific vocalizations are being distinguished along different temporal stages (e.g., N1b vs. P2 vs. N2), reflecting a processing hierarchy that could be probed using acoustically well-controlled nonnatural (or quasinatural) vocal-like stimuli. Future hearing aid algorithms could conceivably preprocess acoustic signals to facilitate their perception as an individual's vocalizations from background noise (e.g., ambient background vocalizations).

Neurodevelopment and Human Voice Sensitivity

The finding of an exogenous N1b response to human mimic voice may also have implications for objectively and noninvasively measuring sensitivity to human voice for assessing neurodevelopmental markers in normal development. For instance, speech sounds have been used with human infants to assess neurodevelopmental milestones in normal development (Friederici, 2005). Moreover, brain imaging studies indicate that, by 7 months of age (but not by 4 months), infants show sensitivity to human voice along the superior temporal gyrus region (Blasi et al., 2011; Grossmann et al., 2010). EEG evidence further indicates that 7-month-old infants show sensitivity to human action sounds relative to other acoustic–semantic categories of sound (Geangu et al., 2015), which had been incorporated in the neurobiological taxonomic model (see Figure 1A). These and related findings support the idea that the human brain is optimized—intrinsically, through development, or a combination of both—for processing human conspecific vocalizations (including self-produced vocalizations) as a distinct acoustic–semantic category of sound. Thus, the taxonomic model being tested in this study may serve as a basis for further identifying neurophysiological markers for neurotypical development of auditory communication, especially in infants who are at risk of neurodevelopmental deficits, such as familial history of autism or specific language impairments.

Effects of a Sound Discrimination Task on Late AEPs

Because the same stimuli in this study were used in both the passive and task conditions and with the same participants, a direct comparison between results was feasible to assess the effects of auditory attention. In both listening conditions, an N1b, P2, and N2 profile was present in the GFP measures (cf. Figures 3 and 4 waveforms). The attention-demanding discrimination task condition, however, further led to an expected endogenous P3 family profile, which manifested here as a robust P600 component over posterior scalp electrodes (notably Pz), which was significantly greater in amplitude in response to the correctly categorized human mimic sounds relative to the correctly categorized animal vocalizations—and greater in amplitude relative to the responses to the miscategorized human-mimicked and animal vocalization sounds (see Supplemental Material S1, Figure S1). These later P2, N2, and P600 AEP components are addressed in turn below.

The auditory P2 component of the AEP is a positive peak occurring with a latency of 200 ms, and it is maximally recorded at the vertex (Picton et al., 1999). Like the N1b component, the P2 component is considered an onset response (Lightfoot, 2016). The P2 component has multiple neural generators in Heschl's gyrus in the region of the secondary auditory cortex (Shahin et al., 2007), with latencies of 160–270 ms, depending on stimulus intensity (Lightfoot, 2016). The P2 component is modulated by the spectral complexity and specifically the number of formant frequencies contained within the stimulus. Stimuli with greater spectral complexity increase the root-mean-square of the amplitude of P2. The amplitude of the P2 component can be increased in nonmusicians by training the discrimination of complex tones. These known properties of the auditory P2 component are in good agreement with the findings in this study since animal and mimic vocalizations were spectrally complex stimuli that could be expected to engage the mechanisms underlying the P2 component both in the task and nontask conditions. The differential enhancement of the P2 component in response to mimic stimuli might be driven by neural plasticity occurring during the discrimination task, thereby representing a “mesogenous” component (Cacioppo et al., 2007). Interestingly, the P2 response showed greater sensitivity to human mimics whether correctly or incorrectly categorized relative to animal vocalizations whether correctly or incorrectly categorized (see Figure S1), which differed from the P600 response profile. However, this preliminary finding, together with interpretation of the P2 signal, will need to await validation with paradigms that can more rigorously address this effect with greater numbers of trials.

The N2 component of the AEP is elicited by novel, task-relevant stimuli (Folstein & Van Petten, 2008). The N2 component is a negative peaking wave that peaks with a latency of 200–350 ms after stimulus onset and has an anterior scalp distribution (Folstein & Van Petten, 2008). The N2 component is commonly observed together in combination with the task or attention-related P3 component and is often referred to as part of the “N2–P3 complex.” The auditory N2 component is thought to be generated by sources in the supratemporal plane, specifically the posterior superior temporal gyrus (O'Donnell et al., 1993; Simson et al., 1976, 1977). In contrast to the P3, the auditory N2 component has a modality-specific scalp distribution that is distinct from the visually evoked N2 component. Nevertheless, the auditory N2 component has a strong temporal coupling to the P3 component, but not with the auditory N1 component (Michalewski et al., 1986). This temporal coupling suggests that the N2 component reflects cognitive stimulus evaluation processes in common with the P3 component. The N2 component is specifically associated with the classification of discordant or novel stimuli (Folstein & Van Petten, 2008; O'Donnell et al., 1993). As with the P2 component, our findings are in good agreement with the known properties of the N2 component, since we observed the N2 component when the subjects were engaged in a classification task. Unlike the most typical findings described to date, the N2 component in our study did not form part of an N2–P3 complex but was instead followed by a P600 component.

Because the N2 component was also influenced by the effects of attention, this too, in addition to the P2 component, may be considered a mesogenous component though with the opposite effect, in that the task condition abolished the differential magnitude differences for mimic voice that were observed in the passive condition. One possibility is that the N2 waveform may have simply been distorted by the much larger P600 waveform and thus beyond the temporal resolution to dissociate given our paradigm. Alternatively, the N2 (or N200) family of negative components has also been reported to reflect detection of some type of mismatch between a stimulus relative to some previously formed “standard template” as a comparison process (Cacioppo et al., 2007; Gehring et al., 1992; Squires et al., 1975). In the passive condition, some human mimic voice stimuli may have effectively popped out as different, reflecting the slightly rarer occurrence of “obviously human voice” stimuli (depending on the listener and their experiences with mimic voice sound). Just prior to the task condition AEP recording sessions, instruction was given that two categories of sound would be heard and discriminated, which may have subsequently influenced an individual's use of some otherwise default automatic or preattentive standard template that would detect deviant features among the trains of stimuli being heard. Instead, attention was focused on comparing another set, or sets, of templates to discriminate the sounds and eliminating any perceptions of “deviant” sounds and thereby eliminating a corresponding N2 effect. Such distinctions here too will need to await more detailed study for further interpretation.

With regard to the P600 component, a P600-like component was also reported in response to vocalizations using an oddball detection task paradigm and with analysis techniques similar to the present study (De Lucia et al., 2010). In their study, human nonverbal sounds (e.g., coughing, sneezing, laughing) were contrasted with stereotypical animal vocalizations (e.g., pig, chicken, dog), which revealed a larger P600 component amplitude for the human-produced sounds. They further reported the P600 to likely be localized to several generators, including the posterior STS, temporal–parietal junction, plus prefrontal and inferior frontal regions. This processing was presumed to reflect stages where voice selectivity emerges at a perceptual level, and the present study's results were consistent with this interpretation. Another study examining frequency changing dynamic tones in an oddball paradigm also revealed a 610-ms peak latency component at posterior scalp topographic locations (Kalaiah & Shastri, 2016). Their P600 showed larger amplitudes to dynamic tone complexes that correlated with larger changes in frequency and with a bias toward rising versus falling tone complex sounds. This was interpreted as representing cognitive and decision-making processes with a perceptual bias for approaching rather than receding sound sources.

The P600 has been associated with the P3 family of components, and unexpected stimuli are reported to evoke larger P3 amplitudes (Verleger & Smigasiewicz, 2016). However, hearing the human mimic voice stimuli during the task condition was not “unexpected” in that we informed the participants of what they would be hearing, and there were equal numbers of human and animal vocalization stimuli. Another consideration is that sounds that are more familiar to a listener have been reported to show an enhancement of the N1 and P2 components in addition to effects 300–500 ms after stimulus onset (Kirmse et al., 2009). However, it was unclear whether our selection of animal vocalizations or the corresponding human mimics might have been deemed as less familiar to most listeners. Nonetheless, human mimic voice is arguably less commonly heard by adult listeners than other forms of human voice and, in this regard, may have reflected a less expected form of vocal utterance. Regardless, the combined findings of preferential processing for human mimic voice may have implications for anthropological fields such as oral communication evolution, as addressed next.

Processing of Mimic Voice and Oral Communication Evolution

In several contexts, human listeners can accurately identify human-imitated animal sounds (Lass et al., 1983). This study's findings showing a processing preference for human mimic voice (nonverbal human vocalizations) relative to nonconspecific animal vocalizations may reflect processing that is more rudimental for acoustic communication, oral mimicry, and oral mimesis, which harkens to theories of the origins of vocal language or glottogenesis. Some anthropological theories regarding oral communication propose a major change in hominin evolution to include a transition from episodic to mimetic culture (Darwin et al., 1981; Donald, 1991; Hewes, 1973). Such theories espouse a survival advantage to primate species with ever increasing mimetic skill, including oral mimesis, which is the production of conscious, self-initiated representational acts that are intentional (but not linguistic). Such utterances could be used to orally mimic the sounds of nature for hunting, to convey declarative communications among troop members, and for social-bonding events (Falk, 2004; Larsson, 2012, 2015; Lewis et al., 2018). Thus, even passive nonattentive listening to conspecific vocalization and its increased magnitude of late AEP components may reflect vestiges of the sensory processing needs to pick out the sounds and subtle nuances of other conspecifics' vocalizations. The rationale for examining late AEP responses to human conspecific versus nonconspecific vocalizations in this study was to test a general neurobiological taxonomic model of natural sound processing (see Figure 1A), which may well pertain to all primate species (or even all social mammalian species) with hearing and vocal communication ability. Thus, this general model may further serve as a template for future exploration of anthropological models of protonetworks that developed to subserve spoken communication in hominins.

Limitations and Future Directions

One of the limitations to consider regarding interpretation of the early AEP component processing contributions (i.e., N1b and P2) to category membership (human or nonhuman animal) stimuli is the technical issues of sound source recording. In this study, all human mimic recordings capitalized on the same room acoustics (a sound booth) with the microphone located directly in front of the mouth to retain high frequency energy, and thus clarity and naturalness. In contrast, the wide range of commercially available animal vocalizations selected may have varied considerably in their recording environment(s) and in the fidelity of the field recording equipment used. We were not able to discern any systematic bias in the human-mimicked recordings relative to the original animal calls that were obviously due to recording conditions. Nonetheless, echoic features, spectral sampling biases, and distance cues among other subtle acoustic attributes may have contributed systematically to the early AEP component amplitudes. Additionally, not all participants were tested for audiometry, such that there may have been some effects of unreported mild hearing loss as another variable to consider. Future studies of natural sound processing should benefit from using well-controlled natural or seminatural vocal sounds with emphasis on particular spectral peaks and troughs or with emphasis on examining a given acoustic signal attribute (e.g., proximity cues) that can be parametrically manipulated.

Regarding linguistic-related processing, a P600 component had previously been implicated in processing syntactic anomalies (Osterhout & Holcomb, 1992, 1995) when using serial visual presentation of words (rather than spoken words and spoken phrases to avoid current limitations in EEG methodologies). One limitation regards controversy over whether the grouping of event-related potential responses by domain is misleading and the question of whether the P600 component described herein truly belongs to the P3 (P3b or P300) family of components and whether different P600 variants account for syntactic, semantic, or other effects (Leckey & Federmeier, 2020). Though speculative, one intriguing possibility is that auditory attention to conspecific vocalization discrimination may similarly prove to relate to the timing of a linguistic P600 effect. This may reflect, for instance, the temporal dynamics of sensory discriminations (auditory or visual) as vestiges of communicative signal processing that antedated spoken language reception systems. Thus, the present findings potentially tap into the neuroimaging field of language research that may be amenable to study in human toddlers during the co-emergence of acoustic nonlanguage sensory and spoken linguistic abilities.

Finally, humans are not the only species to use mimic voice as an overt behavior with survival advantages. For instance, lyrebirds, catbirds, and various parrot species have excellent mimicry abilities, as do several mammalian species such as dolphins, elephants, and wild cats, which may also serve as model systems for studying elements of conspecific sound discrimination and perception. A simple 2 × 2 design comparing, for instance, macaques versus humans may elucidate the neurobiology underlying conspecific sound processing advantages, wherein presentation of monkey vocalizations and human mimics of those vocalizations to both species could be used as a cross-species design to further explore the neurobiology of conspecific voice processing.

In conclusion, this study places earlier limits on when human voice begins to be discriminated from nonconspecific voice, including the N1b component. This was a finding that perhaps was only evident when using corresponding animal vocalizations as control stimuli, which equated a wide variety of psychoacoustic and low-level acoustic features. Additionally, we were unable to determine any particular attribute or obvious set of attributes that could determine a participant's performance on the discrimination task, suggesting that perception of conspecific vocalizations is dependent on either a constellation of different features or perhaps more holistic processing of all the features jointly. Further elucidating various aspects of human vocal sound processing mechanisms, including mimic voice, should impact strategies germane to the advancement of intelligent hearing aid designs that use neuromimetic principles. Additionally, the identification of specific processing stages sensitive to human conspecific vocalizations should further have implications for clinically relevant models regarding the timeline of neurodevelopmental markers for voice-processing sensitivity in human toddlers who are at risk of developing auditory communication deficits. Finally, this study's results provide support for a subset of anthropological models of how and why spoken protolanguage systems may have evolved.

Author Contributions

William J. Talkington and James W. Lewis designed and performed experiments, analyzed data, and wrote the article; all authors contributed to data analyses and assisted with data collection. William J. Talkington: Conceptualization (Supporting), Data curation (Lead), Formal analysis (Supporting), Methodology (Lead), Writing – original draft (Supporting). Jeremy Donai: Formal analysis (Supporting), Investigation (Supporting), Methodology (Supporting), Writing – review & editing (Supporting). Molly L. Layne: Formal analysis (Supporting), Software (Supporting). Andrew Forino: Data curation (Equal), Formal analysis (Supporting), Validation (Supporting). Sijin Wen: Formal analysis (Supporting). Si Gao: Formal analysis (Supporting). Margeaux M. Gray: Formal analysis (Supporting), Investigation (Supporting), Software (Supporting), Visualization (Supporting). Alexandra J. Ashraf: Data curation (Supporting), Formal analysis (Supporting). Gabriela N. Valencia: Data curation (Supporting), Formal analysis (Supporting). Brandon D. Smith: Data curation (Supporting). Stephanie K. Khoo: Data curation (Supporting). Stephen J. Gray: Data curation (Supporting), Formal analysis (Supporting). Norman Lass: Data curation (Supporting), Investigation (Supporting), Methodology (Supporting). Julie A. Brefczynski-Lewis: Conceptualization (Supporting), Investigation (Supporting), Validation (Supporting). Susannah Engdahl: Data curation (Supporting), Methodology (Supporting), Software (Supporting). David Graham: Conceptualization (Supporting), Investigation (Supporting). Chris A. Frum: Data curation (Supporting), Investigation (Supporting).

Supplementary Material

Supplemental Material S1. Part A: Error trial AEP data from Task condition. Part B: Machine learning classification of animal vocals and their human mimics.

Acknowledgments

This work was supported by National Institute of General Medical Sciences, Centers of Biomedical Research Excellence Grant GM103503 and National Center for Research Resources, Centers of Biomedical Research Excellence Grants E15524 and RR10007935 to the Sensory Neuroscience Research Center of West Virginia University (WVU) and to affiliated WVU Summer Undergraduate Research Internships; to the West Virginia IDeA Network of Biomedical Research Excellence program supported by National Institute of General Medical Sciences Award P20GM103434; plus an individual predoctoral award to W. J. T. funded by the Air Force Office of Scientific Research (American Society for Engineering Education, National Defense Science and Engineering Graduate Fellowship). The authors thank Hannah Ludwick of the Department of Biostatistics at West Virginia University (WVU) for helping with statistical analyses and Addie Boyd-Pratt and Haley Hixon of the Department of Communication Sciences and Disorders for assistance with signal analysis.

Funding Statement

This work was supported by National Institute of General Medical Sciences, Centers of Biomedical Research Excellence Grant GM103503 and National Center for Research Resources, Centers of Biomedical Research Excellence Grants E15524 and RR10007935 to the Sensory Neuroscience Research Center of West Virginia University (WVU) and to affiliated WVU Summer Undergraduate Research Internships; to the West Virginia IDeA Network of Biomedical Research Excellence program supported by National Institute of General Medical Sciences Award P20GM103434; plus an individual predoctoral award to W. J. T. funded by the Air Force Office of Scientific Research (American Society for Engineering Education, National Defense Science and Engineering Graduate Fellowship).

References

  1. Abrams, D. A. , Nicol, T. , Zecker, S. , & Kraus, N. (2009). Abnormal cortical processing of the syllable rate of speech in poor readers. The Journal of Neuroscience, 29(24), 7686–7693. https://doi.org/10.1523/JNEUROSCI.5242-08.2009 [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Beauchemin, M. , De Beaumont, L. , Vannasing, P. , Turcotte, A. , Arcand, C. , Belin, P. , & Lassonde, M. (2006). Electrophysiological markers of voice familiarity. European Journal of Neuroscience, 23(11), 3081–3086. https://doi.org/10.1111/j.1460-9568.2006.04856.x [DOI] [PubMed] [Google Scholar]
  3. Beckers, G. J. L. , Goossens, B. M. A. , & Cate, C. T. (2003). Perceptual salience of acoustic differences between conspecific and allospecific vocalizations in African collared-doves. Animal Behaviour, 65(3), 605–614. https://doi.org/10.1006/anbe.2003.2080 [Google Scholar]
  4. Belin, P. , Zatorre, R. J. , Lafaille, P. , Ahad, P. , & Pike, B. (2000). Voice-selective areas in human auditory cortex. Nature, 403(6767), 309–312. https://doi.org/10.1038/35002078 [DOI] [PubMed] [Google Scholar]
  5. Belizaire, G. , Fillion-Bilodeau, S. , Chartrand, J. P. , Bertrand-Gauvin, C. , & Belin, P. (2007). Cerebral response to ‘voiceness’: A functional magnetic resonance imaging study. NeuroReport, 18(1), 29–33. https://doi.org/10.1097/WNR.0b013e3280122718 [DOI] [PubMed] [Google Scholar]
  6. Billings, C. J. , Bennett, K. O. , Molis, M. R. , & Leek, M. R. (2011). Cortical encoding of signals in noise: Effects of stimulus type and recording paradigm. Ear and Hearing, 32(1), 53–60. https://doi.org/10.1097/AUD.0b013e3181ec5c46 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Blasi, A. , Mercure, E. , Lloyd-Fox, S. , Thomson, A. , Brammer, M. , Sauter, D. , Deeley, Q. , Barker, G. J. , Renvall, V. , Deoni, S. , Gasston, D. , Williams, S. C. , Johnson, M. H. , Simmons, A. , & Murphy, D. G. (2011). Early specialization for voice and emotion processing in the infant brain. Current Biology, 21(14), 1220–1224. https://doi.org/10.1016/j.cub.2011.06.009 [DOI] [PubMed] [Google Scholar]
  8. Brefczynski-Lewis, J. A. , & Lewis, J. W. (2017). Auditory object perception: A neurobiological model and prospective review. Neuropsychologia, 105, 223–242. https://doi.org/10.1016/j.neuropsychologia.2017.04.034 [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Cacioppo, J. T. , Tassinary, L. G. , & Berntson, G. (2007). Handbook of psychophysiology. Cambridge University Press. https://doi.org/10.1017/9781107415782 [Google Scholar]
  10. Capilla, A. , Belin, P. , & Gross, J. (2013). The early spatio-temporal correlates and task independence of cerebral voice processing studied with MEG. Cerebral Cortex, 23(6), 1388–1395. https://doi.org/10.1093/cercor/bhs119 [DOI] [PubMed] [Google Scholar]
  11. Chapman, R. M. , & Bragdon, H. R. (1964). Evoked responses to numerical and non-numerical visual stimuli while problem solving. Nature, 203, 1155–1157. https://doi.org/10.1038/2031155a0 [DOI] [PubMed] [Google Scholar]
  12. Charest, I. , Pernet, C. R. , Rfousselet, G. A. , Quinones, I. , Latinus, M. , Fillion-Bilodeau, S. , Chartrand, J. P. , & Belin, P. (2009). Electrophysiological evidence for an early processing of human voices. BMC Neuroscience, 10, 127. https://doi.org/10.1186/1471-2202-10-127 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Clynes, M. (1969). Dynamics of vertex evoked potentials: The R-M brain function. In Donchin E. & Lindsley D. B. (Eds.), Averaged evoked potentials: Methods, results, and evaluations (pp. 363–374). U.S. National Aeronautics and Space Administration. https://doi.org/10.1037/13016-013 [Google Scholar]
  14. Crowley, K. E. , & Colrain, I. M. (2004). A review of the evidence for P2 being an independent component process: Age, sleep and modality. Clinical Neurophysiology, 115(4), 732–744. https://doi.org/10.1016/j.clinph.2003.11.021 [DOI] [PubMed] [Google Scholar]
  15. Dang, J. , & Honda, K. (1997). Acoustic characteristics of the piriform fossa in models and humans. The Journal of the Acoustical Society of America, 101(1), 456–465. https://doi.org/10.1121/1.417990 [DOI] [PubMed] [Google Scholar]
  16. Darwin, C. , Bonner, J. , & May, R. (1981). The descent of man, and selection in relation to sex. Princeton University Press. https://doi.org/10.5962/bhl.title.2092 [Google Scholar]
  17. De Lucia, M. , Camen, C. , Clarke, S. , & Murray, M. M. (2009). The role of actions in auditory object discrimination. NeuroImage, 48(2), 475–485. https://doi.org/10.1016/j.neuroimage.2009.06.041 [DOI] [PubMed] [Google Scholar]
  18. De Lucia, M. , Clarke, S. , & Murray, M. M. (2010). A temporal hierarchy for conspecific vocalization discrimination in humans. The Journal of Neuroscience, 30(33), 11210–11221. https://doi.org/10.1523/JNEUROSCI.2239-10.2010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Delorme, A. , & Makeig, S. (2004). EEGLAB: An open source toolbox for analysis of single-trial EEG dynamics including independent component analysis. Journal of Neuroscience Methods, 134(1), 9–21. https://doi.org/10.1016/j.jneumeth.2003.10.009 [DOI] [PubMed] [Google Scholar]
  20. Delorme, A. , Miyakoshi, M. , Jung, T. P. , & Makeig, S. (2015). Grand average ERP-image plotting and statistics: A method for comparing variability in event-related single-trial EEG activities across subjects and conditions. Journal of Neuroscience Methods, 250, 3–6. https://doi.org/10.1016/j.jneumeth.2014.10.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Donald, M. (1991). Origins of the modern mind: Three stages in the evolution of culture and cognition. Harvard University Press. [Google Scholar]
  22. Dooling, R. J. , Brown, S. D. , Klump, G. M. , & Okanoya, K. (1992). Auditory perception of conspecific and heterospecific vocalizations in birds: Evidence for special processes. Journal of Comparative Psychology, 106(1), 20–28. https://doi.org/10.1037/0735-7036.106.1.20 [DOI] [PubMed] [Google Scholar]
  23. Engel, L. R. , Frum, C. , Puce, A. , Walker, N. A. , & Lewis, J. W. (2009). Different categories of living and non-living sound sources activate distinct cortical networks. NeuroImage, 47(4), 1778–1791. https://doi.org/10.1016/j.neuroimage.2009.05.041 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Falk, D. (2004). The roles of infant crying and motherese during prelinguistic evolution in early hominins. American Journal of Physical Anthropology, 93–93. [Google Scholar]
  25. Fitch, W. T. , & Fritz, J. B. (2006). Rhesus macaques spontaneously perceive formants in conspecific vocalizations. The Journal of the Acoustical Society of America, 120(4), 2132–2141. https://doi.org/10.1121/1.2258499 [DOI] [PubMed] [Google Scholar]
  26. Folstein, J. R. , & Van Petten, C. (2008). Influence of cognitive control and mismatch on the N2 component of the ERP: A review. Psychophysiology, 45(1), 152–170. https://doi.org/10.1111/j.1469-8986.2007.00602.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Friederici, A. D. (2005). Neurophysiological markers of early language acquisition: From syllables to sentences. Trends in Cognitive Science, 9(10), 481–488. https://doi.org/10.1016/j.tics.2005.08.008 [DOI] [PubMed] [Google Scholar]
  28. Fullgrabe, C. , Baer, T. , Stone, M. A. , & Moore, B. C. (2010). Preliminary evaluation of a method for fitting hearing aids with extended bandwidth. International Journal of Audiology, 49(10), 741–753. https://doi.org/10.3109/14992027.2010.495084 [DOI] [PubMed] [Google Scholar]
  29. Gaskill, C. S. , Awan, J. A. , Watts, C. R. , & Awan, S. N. (2017). Acoustic and perceptual classification of within-sample normal, intermittently dysphonic, and consistently dysphonic voice types. Journal of Voice, 31(2), 218–228. https://doi.org/10.1016/j.jvoice.2016.04.016 [DOI] [PubMed] [Google Scholar]
  30. Geangu, E. , Quadrelli, E. , Lewis, J. W. , Macchi Cassia, V. , & Turati, C. (2015). By the sound of it. An ERP investigation of human action sound processing in 7-month-old infants. Developmental Cognitive Neuroscience, 12, 134–144. https://doi.org/10.1016/j.dcn.2015.01.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Gehring, W. J. , Gratton, G. , Coles, M. G. H. , & Donchin, E. (1992). Probability effects on stimulus evaluation and response processes. Journal of Experimental Psychology: Human Perception and Performance, 18(1), 198–216. https://doi.org/10.1037/0096-1523.18.1.198 [DOI] [PubMed] [Google Scholar]
  32. Godey, B. , Schwartz, D. , de Graaf, J. B. , Chauvel, P. , & Liegeois-Chauvel, C. (2001). Neuromagnetic source localization of auditory evoked fields and intracerebral evoked potentials: A comparison of data in the same patients. Clinical Neurophysiology, 112(10), 1850–1859. https://doi.org/10.1016/s1388-2457(01)00636-8 [DOI] [PubMed] [Google Scholar]
  33. Grossmann, T. , Oberecker, R. , Koch, S. P. , & Friederici, A. D. (2010). The developmental origins of voice processing in the human brain. Neuron, 65(6), 852–858. https://doi.org/10.1016/j.neuron.2010.03.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Gunji, A. , Kakigi, R. , & Hoshiyama, M. (2000). Spatiotemporal source analysis of vocalization-associated magnetic fields. Cognitive Brain Research, 9(2), 157–163. https://doi.org/10.1016/s0926-6410(99)00054-3 [DOI] [PubMed] [Google Scholar]
  35. Heinks-Maldonado, T. H. , Mathalon, D. H. , Gray, M. , & Ford, J. M. (2005). Fine-tuning of auditory cortex during speech production. Psychophysiology, 42(2), 180–190. https://doi.org/10.1111/j.1469-8986.2005.00272.x [DOI] [PubMed] [Google Scholar]
  36. Hewes, G. W. (1973). Primate communication and the gestural origin of language. Current Anthropology, 14, 5–24. https://doi.org/10.1086/204019 [Google Scholar]
  37. Imai, M. , & Kita, S. (2014). The sound symbolism bootstrapping hypothesis for language acquisition and language evolution. Philosophical Transactions of the Royal Society B: Biological Sciences, 369(1651), 20130298. https://doi.org/10.1098/rstb.2013.0298 [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Kalaiah, M. K. , & Shastri, U. (2016). Cortical auditory event related potentials (P300) for frequency changing dynamic tones. Journal of Audiology & Otology, 20(1), 22–30. https://doi.org/10.7874/jao.2016.20.1.22 [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Kikuchi, Y. , Horwitz, B. , Mishkin, M. , & Rauschecker, J. P. (2014). Processing of harmonics in the lateral belt of macaque auditory cortex. Frontiers in Neuroscience, 8, 204. https://doi.org/10.3389/fnins.2014.00204 [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Kirmse, U. , Jacobsen, T. , & Schroger, E. (2009). Familiarity affects environmental sound processing outside the focus of attention: An event-related potential study. Clinical Neurophysiology, 120(5), 887–896. https://doi.org/10.1016/j.clinph.2009.02.159 [DOI] [PubMed] [Google Scholar]
  41. Koenig, T. , & Melie-Garcia, L. (2010). A method to determine the presence of averaged event-related fields using randomization tests. Brain Topography, 23(3), 233–242. https://doi.org/10.1007/s10548-010-0142-1 [DOI] [PubMed] [Google Scholar]
  42. Larsson, M. (2012). Incidental sounds of locomotion in animal cognition. Animal Cognition, 15(1), 1–13. https://doi.org/10.1007/s10071-011-0433-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Larsson, M. (2015). Tool-use-associated sound in the evolution of language. Animal Cognition, 18(5), 993–1005. https://doi.org/10.1007/s10071-015-0885-x [DOI] [PubMed] [Google Scholar]
  44. Lass, N. J. , Eastham, S. K. , Wright, T. L. , Hinzman, A. R. , Mills, K. J. , & Hefferin, A. L. (1983). Listeners' identification of human-imitated animal sounds. Perceptual and Motor Skills, 57(3, Pt. 1), 995–998. https://doi.org/10.2466/pms.1983.57.3.995 [DOI] [PubMed] [Google Scholar]
  45. Lattner, S. , Maess, B. , Wang, Y. , Schauer, M. , Alter, K. , & Friederici, A. D. (2003). Dissociation of human and computer voices in the brain: Evidence for a preattentive gestalt-like perception. Human Brain Mapping, 20(1), 13–21. https://doi.org/10.1002/hbm.10118 [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Leckey, M. , & Federmeier, K. D. (2020). The P3b and P600(s): Positive contributions to language comprehension. Psychophysiology, 57(7), Article e13351. https://doi.org/10.1111/psyp.13351 [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Lehmann, D. , & Skrandies, W. (1980). Reference-free identification of components of checkerboard-evoked multichannel potential fields. Electroencephalography and Clinical Neurophysiology, 48(6), 609–621. https://doi.org/10.1016/0013-4694(80)90419-8 [DOI] [PubMed] [Google Scholar]
  48. Levy, D. A. , Granot, R. , & Bentin, S. (2001). Processing specificity for human voice stimuli: Electrophysiological evidence. NeuroReport, 12(12), 2653–2657. https://doi.org/10.1097/00001756-200108280-00013 [DOI] [PubMed] [Google Scholar]
  49. Levy, D. A. , Granot, R. , & Bentin, S. (2003). Neural sensitivity to human voices: ERP evidence of task and attentional influences. Psychophysiology, 40(2), 291–305. https://doi.org/10.1111/1469-8986.00031 [DOI] [PubMed] [Google Scholar]
  50. Lewis, J. W. , Silberman, M. J. , Donai, J. J. , Frum, C. A. , & Brefczynski-Lewis, J. A. (2018). Hearing and orally mimicking different acoustic–semantic categories of natural sound engage distinct left hemisphere cortical regions. Brain and Language, 183, 64–78. https://doi.org/10.1016/j.bandl.2018.05.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Lewis, J. W. , Talkington, W. J. , Tallaksen, K. C. , & Frum, C. A. (2012). Auditory object salience: Human cortical processing of non-biological action sounds and their acoustic signal attributes. Frontiers in Systems Neuroscience, 6(27), 1–16. https://doi.org/10.3389/fnsys.2012.00027 [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Liegeois-Chauvel, C. , Musolino, A. , Badier, J. M. , Marquis, P. , & Chauvel, P. (1994). Evoked-potentials recorded from the auditory-cortex in man—Evaluation and topography of the middle latency components. Electroencephalography and Clinical Neurophysiology, 92(3), 204–214. https://doi.org/10.1016/0168-5597(94)90064-7 [DOI] [PubMed] [Google Scholar]
  53. Lightfoot, G. (2016). Summary of the N1–P2 cortical auditory evoked potential to estimate the auditory threshold in adults. Seminars in Hearing, 37(1), 1–8. https://doi.org/10.1055/s-0035-1570334 [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Lowell, S. Y. , Colton, R. H. , Kelley, R. T. , & Mizia, S. A. (2013). Predictive value and discriminant capacity of cepstral- and spectral-based measures during continuous speech. Journal of Voice, 27(4), 393–400. https://doi.org/10.1016/j.jvoice.2013.02.005 [DOI] [PubMed] [Google Scholar]
  55. Luck, S. J. (2005). An introduction to the event-related potential technique. MIT Press. [Google Scholar]
  56. Luck, S. J. , & Hillyard, S. A. (1994). Electrophysiological correlates of feature analysis during visual search. Psychophysiology, 31(3), 291–308. https://doi.org/10.1111/j.1469-8986.1994.tb02218.x [DOI] [PubMed] [Google Scholar]
  57. Maris, E. , & Oostenveld, R. (2007). Nonparametric statistical testing of EEG- and MEG-data. Journal of Neuroscience Methods, 164(1), 177–190. https://doi.org/10.1016/j.jneumeth.2007.03.024 [DOI] [PubMed] [Google Scholar]
  58. Medvedev, A. V. , Chiao, F. , & Kanwal, J. S. (2002). Modeling complex tone perception: Grouping harmonics with combination-sensitive neurons. Biological Cybernetics, 86(6), 497–505. https://doi.org/10.1007/s00422-002-0316-3 [DOI] [PubMed] [Google Scholar]
  59. Medvedev, A. V. , & Kanwal, J. S. (2004). Local field potentials and spiking activity in the primary auditory cortex in response to social calls. Journal of Neurophysiology, 92(1), 52–65. https://doi.org/10.1152/jn.01253.2003 [DOI] [PubMed] [Google Scholar]
  60. Michalewski, H. J. , Prasher, D. K. , & Starr, A. (1986). Latency variability and temporal interrelationships of the auditory event-related potentials (N1, P2, N2, and P3) in normal subjects. Electroencephalography and Clinical Neurophysiology, 65(1), 59–71. https://doi.org/10.1016/0168-5597(86)90037-7 [DOI] [PubMed] [Google Scholar]
  61. Monson, B. B. , Hunter, E. J. , Lotto, A. J. , & Story, B. H. (2014). The perceptual significance of high-frequency energy in the human voice. Frontiers in Psychology, 5, 587. https://doi.org/10.3389/fpsyg.2014.00587 [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Monson, B. B. , Hunter, E. J. , & Story, B. H. (2012). Horizontal directivity of low- and high-frequency energy in speech and singing. The Journal of the Acoustical Society of America, 132(1), 433–441. https://doi.org/10.1121/1.4725963 [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Moore, B. C. , & Tan, C. T. (2003). Perceived naturalness of spectrally distorted speech and music. The Journal of the Acoustical Society of America, 114(1), 408–419. https://doi.org/10.1121/1.1577552 [DOI] [PubMed] [Google Scholar]
  64. Murray, M. M. , Brunet, D. , & Michel, C. M. (2008). Topographic ERP analyses: A step-by-step tutorial review. Brain Topography, 20(4), 249–264. https://doi.org/10.1007/s10548-008-0054-5 [DOI] [PubMed] [Google Scholar]
  65. Murray, M. M. , Camen, C. , Gonzalez Andino, S. L. , Bovet, P. , & Clarke, S. (2006). Rapid brain discrimination of sounds of objects. The Journal of Neuroscience, 26(4), 1293–1302. https://doi.org/10.1523/jneurosci.4511-05.2006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Näätänen, R. , & Picton, T. W. (1986). N2 and automatic versus controlled processes. Electroencephalography and Clinical Neurophysiology Supplement, 38, 169–186. [PubMed] [Google Scholar]
  67. Näätänen, R. , & Picton, T. W. (1987). The N1 wave of the human electric and magnetic response to sound: A review and an analysis of the component structure. Psychophysiology, 24(4), 375–425. https://doi.org/10.1111/j.1469-8986.1987.tb00311.x [DOI] [PubMed] [Google Scholar]
  68. O'Donnell, B. F. , Shenton, M. E. , McCarley, R. W. , Faux, S. F. , Smith, R. S. , Salisbury, D. F. , Nestor, P. G. , Pollak, S. D. , Kikinis, R. , & Jolesz, F. A. (1993). The auditory N2 component in schizophrenia: Relationship to MRI temporal lobe gray matter and to other ERP abnormalities. Biological Psychiatry, 34(1–2), 26–40. https://doi.org/10.1016/0006-3223(93)90253-a [DOI] [PubMed] [Google Scholar]
  69. Osterhout, L. , & Holcomb, P. J. (1992). Event-related brain potentials elicited by syntactic anomaly. Journal of Memory and Language, 31(6), 785–806. https://doi.org/10.1016/0749-596X(92)90039-Z [Google Scholar]
  70. Osterhout, L. , & Holcomb, P. J. (1995). Event-related potentials and language comprehension. In Rugg M. D. & Coles M. G. H. (Eds.), Electrophysiology of mind: Event-related brain potentials and cognition (Chap. 6). Oxford University Press. [Google Scholar]
  71. Peirce, J. (2009). Generating stimuli for neuroscience using PsychoPy. Frontiers in Neuroinformatics, 2, 10. https://doi.org/10.3389/neuro.11.010.2008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  72. Pernet, C. R. , McAleer, P. , Latinus, M. , Gorgolewski, K. J. , Charest, I. , Bestelmeyer, P. E. , Watson, R. H. , Fleming, D. , Crabbe, F. , Valdes-Sosa, M. , & Belin, P. (2015). The human voice areas: Spatial organization and inter-individual variability in temporal and extra-temporal cortices. NeuroImage, 119, 164–174. https://doi.org/10.1016/j.neuroimage.2015.06.050 [DOI] [PMC free article] [PubMed] [Google Scholar]
  73. Perrodin, C. , Kayser, C. , Logothetis, N. K. , & Petkov, C. I. (2011). Voice cells in the primate temporal lobe. Current Biology, 21(16), 1408–1415. https://doi.org/10.1016/j.cub.2011.07.028 [DOI] [PMC free article] [PubMed] [Google Scholar]
  74. Petkov, C. I. , Kayser, C. , Steudel, T. , Whittingstall, K. , Augath, M. , & Logothetis, N. K. (2008). A voice region in the monkey brain. Nature Neuroscience, 11(3), 367–374. https://doi.org/10.1038/nn2043 [DOI] [PubMed] [Google Scholar]
  75. Phan, M. L. , & Vicario, D. S. (2010). Hemispheric differences in processing of vocalizations depend on early experience. Proceedings of the National Academy of Sciences of the United States of America, 107(5), 2301–2306. https://doi.org/10.1073/pnas.0900091107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  76. Picton, T. W. , Alain, C. , Woods, D. L. , John, M. S. , Scherg, M. , Valdes-Sosa, P. , Bosch-Bayard, J. , & Trujillo, N. J. (1999). Intracerebral sources of human auditory-evoked potentials. Audiology and Neurotology, 4(2), 64–79. https://doi.org/10.1159/000013823 [DOI] [PubMed] [Google Scholar]
  77. Polich, J. (2007). Updating P300: An integrative theory of P3a and P3b. Clinical Neurophysiology, 118(10), 2128–2148. https://doi.org/10.1016/j.clinph.2007.04.019 [DOI] [PMC free article] [PubMed] [Google Scholar]
  78. R Core Team. (2013). R: A language and environment for statistical computing. http://www.R-project.org/
  79. Recanzone, G. H. (2008). Representation of con-specific vocalizations in the core and belt areas of the auditory cortex in the alert macaque monkey. The Journal of Neuroscience, 28(49), 13184–13193. https://doi.org/10.1523/JNEUROSCI.3619-08.2008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  80. Reddy, R. K. , Ramachandra, V. , Kumar, N. , & Singh, N. C. (2009). Categorization of environmental sounds. Biological Cybernetics, 100(4), 299–306. https://doi.org/10.1007/s00422-009-0299-4 [DOI] [PubMed] [Google Scholar]
  81. Rosch, E. H. (1973). Natural categories. Cognitive Psychology, 4(3), 328–350. https://doi.org/10.1016/0010-0285(73)90017-0 [Google Scholar]
  82. Rumberg, B. , McMillan, K. , Rea, C. , & Graham, D. W. (2008). Lateral coupling in silicon cochlear models. 51st Midwest Symposium on Circuits and Systems (Vols. 1 and 2, pp. 25–28). https://doi.org/10.1109/Mwscas.2008.4616727 [Google Scholar]
  83. Russ, B. E. , Ackelson, A. L. , Baker, A. E. , & Cohen, Y. E. (2008). Coding of auditory-stimulus identity in the auditory non-spatial processing stream. Journal of Neurophysiology, 99(1), 87–95. https://doi.org/10.1152/jn.01069.2007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  84. Schweinberger, S. R. (2001). Human brain potential correlates of voice priming and voice recognition. Neuropsychologia, 39(9), 921–936. https://doi.org/10.1016/S0028-3932(01)00023-9 [DOI] [PubMed] [Google Scholar]
  85. Seither-Preisler, A. , Johnson, L. , Krumbholz, K. , Nobbe, A. , Patterson, R. , Seither, S. , & Lutkenhoner, B. (2007). Tone sequences with conflicting fundamental pitch and timbre changes are heard differently by musicians and nonmusicians. Journal of Experimental Psychology: Human Perception and Performance, 33(3), 743–751. https://doi.org/10.1037/0096-1523.33.3.743 [DOI] [PMC free article] [PubMed] [Google Scholar]
  86. Shahin, A. J. , Bosnyak, D. J. , Trainor, L. J. , & Roberts, L. E. (2003). Enhancement of neuroplastic P2 and N1c auditory evoked potentials in musicians. The Journal of Neuroscience, 23(13), 5545–5552. https://doi.org/10.1523/jneurosci.23-13-05545.2003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  87. Shahin, A. J. , Roberts, L. E. , Miller, L. M. , McDonald, K. L. , & Alain, C. (2007). Sensitivity of EEG and MEG to the N1 and P2 auditory evoked responses modulated by spectral complexity of sounds. Brain Topography, 20(2), 55–61. https://doi.org/10.1007/s10548-007-0031-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  88. Shoji, K. , Regenbogen, E. , Yu, J. , & Blaugrund, S. (1991a). High-frequency components of normal voice. Journal of Voice, 5(1), 29–35. https://doi.org/10.1016/S0892-1997(05)80160-2 [Google Scholar]
  89. Shoji, K. , Regenbogen, E. , Yu, J. , & Blaugrund, S. (1991b). High-frequency power ratio of breathy voice. The Laryngoscope, 102(3). https://doi.org/10.1288/00005537-199203000-00007 [DOI] [PubMed] [Google Scholar]
  90. Simson, R. , Vaughan, H. G., Jr. , & Ritter, W. (1976). The scalp topography of potentials associated with missing visual or auditory stimuli. Electroencephalography and Clinical Neurophysiology, 40(1), 33–42. https://doi.org/10.1016/0013-4694(76)90177-2 [DOI] [PubMed] [Google Scholar]
  91. Simson, R. , Vaughan, H. G., Jr. , & Ritter, W. (1977). The scalp topography of potentials in auditory and visual discrimination tasks. Electroencephalography and Clinical Neurophysiology, 42(4), 528–535. https://doi.org/10.1016/0013-4694(77)90216-4 [DOI] [PubMed] [Google Scholar]
  92. Squires, K. C. , Squires, N. K. , & Hillyard, S. A. (1975). Decision-related cortical potentials during an auditory signal detection task with cued observation intervals. Journal of Experimental Psychology: Human Perception and Performance, 1(3), 268–279. https://doi.org/10.1037//0096-1523.1.3.268 [DOI] [PubMed] [Google Scholar]
  93. Talkington, W. J. , Rapuano, K. M. , Hitt, L. A. , Frum, C. A. , & Lewis, J. W. (2012). Humans mimicking animals: A cortical hierarchy for human vocal communication sounds. The Journal of Neuroscience, 32(23), 8084–8093. https://doi.org/10.1523/JNEUROSCI.1118-12.2012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  94. Tchernichovski, O. , Mitra, P. P. , Lints, T. , & Nottebohm, F. (2001). Dynamics of the vocal imitation process: How a zebra finch learns its song. Science, 291(5513), 2564–2569. https://doi.org/10.1126/science.1058522 [DOI] [PubMed] [Google Scholar]
  95. Tian, B. , Reser, D. , Durham, A. , Kustov, A. , & Rauschecker, J. P. (2001). Functional specialization in rhesus monkey auditory cortex. Science, 292(5515), 290–293. https://doi.org/10.1126/science.1058911 [DOI] [PubMed] [Google Scholar]
  96. Tremblay, K. , Kraus, N. , McGee, T. , Ponton, C. , & Otis, B. (2001). Central auditory plasticity: Changes in the N1–P2 complex after speech-sound training. Ear and Hearing, 22(2), 79–90. https://doi.org/10.1097/00003446-200104000-00001 [DOI] [PubMed] [Google Scholar]
  97. Uppenkamp, S. , Johnsrude, I. S. , Norris, D. , Marslen-Wilson, W. , & Patterson, R. D. (2006). Locating the initial stages of speech-sound processing in human temporal cortex. NeuroImage, 31(3), 1284–1296. https://doi.org/10.1016/j.neuroimage.2006.01.004 [DOI] [PubMed] [Google Scholar]
  98. Verleger, R. , & Smigasiewicz, K. (2016). Do rare stimuli evoke large P3s by being unexpected? A comparison of oddball effects between standard-oddball and prediction-oddball tasks. Advances in Cognitive Psychology, 12(2), 88–104. https://doi.org/10.5709/acp-0189-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  99. Winkler, I. , Tervaniemi, M. , & Naatanen, R. (1997). Two separate codes for missing-fundamental pitch in the human auditory cortex. The Journal of the Acoustical Society of America, 102(2, Pt .1), 1072–1082. https://doi.org/10.1121/1.419860 [DOI] [PubMed] [Google Scholar]
  100. Wioland, N. , Rudolf, G. , & Metz-Lutz, M. N. (2001). Electrophysiological evidence of persisting unilateral auditory cortex dysfunction in the late outcome of Landau and Kleffner syndrome. Clinical Neurophysiology, 112(2), 319–323. https://doi.org/10.1016/s1388-2457(00)00528-9 [DOI] [PubMed] [Google Scholar]
  101. Zaske, R. , Schweinberger, S. R. , Kaufmann, J. M. , & Kawahara, H. (2009). In the ear of the beholder: Neural correlates of adaptation to voice gender. European Journal of Neuroscience, 30(3), 527–534. https://doi.org/10.1111/j.1460-9568.2009.06839.x [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material S1. Part A: Error trial AEP data from Task condition. Part B: Machine learning classification of animal vocals and their human mimics.

Articles from Journal of Speech, Language, and Hearing Research : JSLHR are provided here courtesy of American Speech-Language-Hearing Association

RESOURCES