Significance
The human voice provides a wealth of social information, including who is speaking. A salient voice in a child’s life is mother's voice, which guides social function during development. Here we identify brain circuits that are selectively engaged in children by their mother’s voice and show that this brain activity predicts social communication abilities. Nonsense words produced by mother activate multiple brain systems, including reward, emotion, and face-processing centers, reflecting how widely mother’s voice is broadcast throughout a child’s brain. Importantly, this activity provides a neural fingerprint of children’s social communication abilities. This approach provides a template for investigating social function in clinical disorders, e.g., autism, in which perception of biologically salient voices may be impaired.
Keywords: auditory, voice, reward, brain, children
Abstract
The human voice is a critical social cue, and listeners are extremely sensitive to the voices in their environment. One of the most salient voices in a child’s life is mother's voice: Infants discriminate their mother’s voice from the first days of life, and this stimulus is associated with guiding emotional and social function during development. Little is known regarding the functional circuits that are selectively engaged in children by biologically salient voices such as mother’s voice or whether this brain activity is related to children’s social communication abilities. We used functional MRI to measure brain activity in 24 healthy children (mean age, 10.2 y) while they attended to brief (<1 s) nonsense words produced by their biological mother and two female control voices and explored relationships between speech-evoked neural activity and social function. Compared to female control voices, mother’s voice elicited greater activity in primary auditory regions in the midbrain and cortex; voice-selective superior temporal sulcus (STS); the amygdala, which is crucial for processing of affect; nucleus accumbens and orbitofrontal cortex of the reward circuit; anterior insula and cingulate of the salience network; and a subregion of fusiform gyrus associated with face perception. The strength of brain connectivity between voice-selective STS and reward, affective, salience, memory, and face-processing regions during mother’s voice perception predicted social communication skills. Our findings provide a novel neurobiological template for investigation of typical social development as well as clinical disorders, such as autism, in which perception of biologically and socially salient voices may be impaired.
The human voice is a critical social cue for children. Beyond the semantic information contained in speech, this acoustical signal provides a wealth of socially important information. For example, the human voice provides information regarding who is speaking, a highly salient perceptual feature that has been described as an “auditory face” (1). From the earliest stages of development, human listeners are extremely sensitive to the different voices in their environment (2), reflecting the importance of this social cue to human interaction and communication.
Listeners are particularly sensitive to the familiar voices encountered in their everyday environment, and arguably the most salient vocal source in a child’s life is mother’s voice. Mother’s voice is a constant and familiar presence in a child’s environment, beginning at a time when these vocal sounds and vibrations are conducted through the intrauterine environment to the fetus’ developing auditory pathways (3). Early exposure to mother’s voice facilitates recognition of this sound source and establishes it as a preferred stimulus: From the first days of life, children can identify their mother’s voice and will actively work to hear this sound source in preference to unfamiliar female voices (2). Throughout development, communicative cues in mother’s voice convey critical information to guide behavior (4–6) and learning (7). For example, hearing a recording of one’s own mother’s voice is a source of emotional comfort for preschoolers during stressful situations, even when the content of the speech is meaningless (5). Furthermore, when school-age females experience a stressful situation, hearing their mother’s voice reduces children’s cortisol levels, a biomarker of stress, and increases oxytocin levels, a hormone associated with social bonding (4). These studies have highlighted the profound influence that mother’s voice has on children’s cognitive, emotional, and social function.
Despite the behavioral importance of mother’s voice for critical aspects of emotional and social development, little is known about the mechanisms by which socially salient vocal sources shape the developing brain. Near-infrared spectroscopy (8) and EEG (9) studies examining responses to mother’s voice have focused on young children (≤6 mo old) and have found increased neural activity for mother’s voice compared to female control voices; however, the methods used in these studies are unable to provide detailed information about the brain areas and functional circuits underlying the perception of mother’s voice. Therefore, a critical question remains: What are the neural representations of a biologically salient vocal source in a child’s brain?
To investigate this question, we used functional MRI (fMRI) and measured brain activity in 24 typically developing children (7–12 y old; see Tables S1 and S2) in response to their mother’s voice, an example of a highly socially salient vocal source in a child’s life. An important component of our experimental protocol included vocal recording sessions of each participant’s mother and two female control voices, both of whom are also mothers and were not known to the study participants, for subsequent presentation during functional brain imaging (Fig. 1A; see Methods and Audio Files S1–S6 for audio examples). During the recording sessions, mothers produced three four-syllable nonsense words, which were used to avoid activating semantic systems in the brain (10), thereby enabling a focus on the neural responses to each speaker’s vocal characteristics.
Table S1.
Demographic and IQ measures
| Measure | TD, n = 24 | Population mean (SD) |
| Gender ratio, M:F | 17:7 | |
| Age, y | 10.22 ± 1.44 | |
| IQ, WASI scale | ||
| Full-scale IQ | 119.08 ± 12.00 | 100 (15) |
| Verbal IQ | 120.42 ± 13.87 | 100 (15) |
| Performance IQ | 113.54 ± 13.80 | 100 (15) |
| Achievement, WIAT | ||
| Word reading | 114.04 ± 9.18 | 100 (15) |
| Reading comprehension | 114.79 ± 9.02 | 100 (15) |
Demographic and mean IQ scores are shown for the full sample. F, female; M, male; TD, typically developing children; WIAT, Wechsler Individual Achievement Test.
Table S2.
Neuropsychological measures of social and language abilities
| Measure | TD, n = 24 | Population mean (SD) |
| SRS-2 | ||
| Total Standard Score | 43.13 ± 5.24 | 30.9 (25.3) |
| Social Communication Standard Score* | 42.29 ± 4.63 | |
| Social Motivation Standard Score* | 44.75 ± 6.22 | |
| CELF-4 | ||
| Core Language Standard Score | 109.92 ± 15.24† | 100 (15) |
| Receptive Language Standard Score | 104.29 ± 17.06† | 100 (15) |
| Expressive Language Standard Score | 113.67 ± 15.91† | 100 (15) |
| Concepts and following directions | 9.71 ± 3.37 | 10 (3) |
| Recalling sentences | 11.54 ± 3.48 | 10 (3) |
| Formulated sentences | 12.21 ± 3.13 | 10 (3) |
| Word classes – total | 13.04 ± 2.94 | 10 (3) |
| Receptive | 13.00 ± 2.77 | 10 (3) |
| Expressive | 13.08 ± 3.16 | 10 (3) |
| Spoken paragraphs | 10.17 ± 2.37 | 10 (3) |
| Familiar sequences | 10.75 ± 3.40 | 10 (3) |
| CTOPP | ||
| Phonological Awareness Standard Score | 107.50 ± 10.73 | 100 (15) |
| Alternate Rapid Naming Standard Score | 99.25 ± 17.27 | 100 (15) |
| Elision | 11.58 ± 2.02 | 10 (3) |
| Blending words | 10.92 ± 2.43 | 10 (3) |
| Nonword repetition | 9.46 ± 2.02 | 10 (3) |
| Rapid color naming | 9.92 ± 3.67 | 10 (3) |
| Rapid object naming | 9.83 ± 3.12 | 10 (3) |
Neuropsychological measures of Language and Phonetic abilities are shown for the full sample. CELF-4, Clinical Evaluation of Language Fundamentals, 4th edition (23); CTOPP, Comprehensive Test of Phonological Processing; SRS-2, Social Responsiveness Scale, Second Edition (22); TD, typically developing children.
Population means and SDs are not reported for Social Communication and Social Motivation Standard Scores (22).
Word Classes 1 was used instead of Word Classes 2 to compute standardized scores for two subjects that were below the age of 7 y.
Fig. 1.
fMRI experimental design, acoustical analysis. and behavioral results. (A) Randomized, rapid event-related design: During fMRI data collection, three auditory nonsense words, produced by three different speakers, were presented to the child participants at a comfortable listening level. The three speakers consisted of the child’s mother and two female control voices. Nonspeech environmental sounds were also presented to enable baseline comparisons for the speech contrasts of interest. All auditory stimuli were 956 ms in duration and were equated for rms amplitude. (B) Acoustical analyses show that vocal samples produced by the participants’ mothers were similar to the female control voice samples for individual acoustical measures. (C) Results from behavioral ratings, collected in an independent cohort of children who did not participate in the fMRI study, show that female control voice samples were rated equally as pleasant as, and more exciting than, the mother’s voice samples. *P < 0.05; NS, not significant. (D) Children who participated in the fMRI study were able to identify their mother’s voice with high levels of accuracy, supporting the sensitivity of these young listeners to their mother’s voice. The horizontal line represents chance level for the mother’s voice identification task.
We had two primary goals for the data analysis. First, we wanted to probe neural representations and circuits elicited by mother’s voice across all participants. We hypothesized that the critical role of mother’s voice in social and emotional learning and its function as a rewarding stimulus would facilitate a distinct representation of this sound source in the minds of children, reflected by neural activity and connectivity patterns in auditory, voice-selective (11), reward (12), and social cognition (13) systems in the brain. The second goal of the analysis was to explore individual differences in brain responses to mother’s voice among children. We reasoned that children’s social communication and language function could potentially account for individual differences in brain responses to mother’s voice. Although it is established that children show a range of cognitive and language abilities, it also has been shown that they demonstrate a range of social abilities (14). Given the important contribution of mother’s voice to social communication (4–6), we hypothesized that the strength of functional connectivity between voice-selective cortex and reward and affective processing regions would predict social function in neurotypical children.
Results
Acoustical and Behavioral Analysis of Mother’s Voice and Control Voices.
We conducted acoustical analyses and behavioral experiments to characterize the physical and perceptual attributes of mother’s voice and female control voice samples. The goal of these analyses was to determine if there were differences between mother’s voice and female control voice samples that could account for differences in fMRI activity beyond the biological salience of mother’s voice. Human voices are differentiated according to a number of acoustical characteristics, including features that reflect the anatomy of the speaker’s vocal tract, such as the pitch and harmonics of speech, and learned aspects of speech production, which include speech rhythm, rate, and emphasis (15, 16). Acoustical analysis of the vocal samples used in the fMRI scan showed that control voice samples were qualitatively similar to mother’s voice samples across multiple spectrotemporal acoustical features (Fig. 1B).
We next examined perceptual attributes of the stimuli. Of particular interest are the attributes associated with the pleasantness and excitement (a child-friendly proxy for “engagingness”) of the vocal samples: If the vocal characteristics of the mother’s voice samples are more rewarding and exciting than those of the female control voices, this difference could potentially account for brain effects associated with hearing mother’s voice. We administered a separate behavioral experiment in an independent cohort (i.e., children who did not participate in the fMRI study) of 27 elementary school children (mean age: 11.1 y). In this experiment, participants rated the 24 mother’s voice stimuli used in the fMRI experiment and the two female control stimuli based on how pleasant and exciting these voices sounded (SI Methods). We found no statistical difference between pleasantness ratings for the control voices and the mean pleasantness ratings for the mother’s voice samples (Fig. 1C, Left); however, female control voices showed greater excitement ratings than the mother’s voice samples (P = 0.023) (Fig. 1C, Right). Importantly, these behavioral results show that the vocal qualities of the two female control voices used in the fMRI experiment were equally as pleasant as, and were not less exciting than, the mother’s voice stimuli.
Identification of Mother’s Voice.
To examine whether children who participated in the fMRI study could identify their mother’s voice accurately in the brief vocal samples used in the fMRI experiment, participants performed a mother’s voice identification task (SI Methods). We found that children identified their mother’s voice with a high degree of accuracy (mean accuracy >97%) (Fig. 1D), indicating that brief (<1 s) pseudoword speech samples are sufficient for the consistent and accurate identification of mother’s voice.
Brain Responses to Mother’s Voice Compared to Female Control Voices.
In the fMRI analysis, we first identified brain regions that showed greater activation in response to mother’s voice compared to female control voices. By subtracting out brain activation associated with hearing female control voices producing the same nonsense words (i.e., controlling for low-level acoustical features, phoneme and word-level analysis, auditory attention, and other factors), we estimated brain responses unique to hearing the maternal voice. We found that mother’s voice elicited greater activity in a number of brain systems, encompassing regions important for auditory, voice-selective, reward, social, and visual functions. First, mother’s voice elicited greater activation in primary auditory regions, including bilateral inferior colliculus (IC), the primary midbrain nucleus of the ascending auditory system, and bilateral posteromedial Heschl’s gyrus (HG), which contains the primary auditory cortex (Fig. 2). The auditory association cortex of the superior temporal plane, including bilateral planum temporale and planum polare, also showed significantly greater activation in response to mother’s voice, with slightly greater activation in the right hemisphere. Next, mother’s voice elicited enhanced bilateral activation in voice-selective superior temporal gyrus (STG) and superior temporal sulcus (STS), extending from posterior (y = −48) to anterior (y = 14) aspects of the lateral temporal cortex. Mother’s voice also elicited greater activity in the medial temporal lobe, including the left-hemisphere amygdala, a key node of the affective processing system. Structures of the mesolimbic reward pathway also showed greater activation in response to mother’s voice than to female control voices, including the bilateral nucleus accumbens (NAc) and the ventral putamen of the ventral striatum, orbitofrontal cortex (OFC), and ventromedial prefrontal cortex (vmPFC). Mother’s voice also elicited greater activation in posterior medial cortex bilaterally encompassing the precuneus and posterior cingulate cortex, a key node of the default mode network (17), which is a system involved in processing self-referential information (18). Additionally, mother’s voice elicited increased activity in multiple regions of the occipital cortex, including right-hemisphere intercalcarine, lingual, and fusiform cortex, including overlap with the FG2 subregion of the fusiform, which is associated with visual face processing (19). Greater activation also was evident in the anterior insula (AI) and the dorsal anterior cingulate cortex (dACC), two key structures of the salience network (20). Finally, preference for mother’s voice was evident in frontoparietal regions, including right-hemisphere pars opercularis [Brodmann area (BA) 44] and triangularis (BA 45), and in bilateral angular, supramarginal, and precentral gyri. The signal level in the majority of these brain regions showed increased activity relative to baseline in response to mother’s voice (see SI Methods and Figs. S1–S4 for results from signal-level analysis). No brain regions showed significantly greater activation for female control voices compared to mother’s voice.
Fig. 2.
Brain activity in response to mother’s voice. Compared to female control voices, mother’s voice elicits greater activity in auditory brain structures in the midbrain and superior temporal cortex (Upper Left), including the bilateral IC and primary auditory cortex (mHG) and a wide extent of voice-selective STG (Upper Center) and STS. Mother’s voice also elicited greater activity in occipital cortex, including fusiform gyrus (FG) (Lower Left), and in heteromodal brain regions serving affective functions, anchored in the amygdala (Upper Right), core structures of the mesolimbic reward system, including NAc, OFC, and vmPFC (Lower Center), and structures of the salience network, including the AI and dACC (Lower Right). No voxels showed greater activity in response to female control voices compared to mother’s voice.
Fig. S1.
Signal levels in primary auditory regions (Upper) and voice-selective cortex (Lower) in response to mother’s voice and female control voices. Primary auditory regions were identified a priori from previous auditory studies (IC ROIs) (56) and from cytoarchitectonic maps (Te ROIs) (55), and voice-selective cortical regions were selected for signal-level analysis based on previous investigations of voice-selective cortex (bilateral pSTS; refs. 11, 37) or their identification in the [mother’s voice > female control voices] contrast (bilateral mSTS and aSTS; see Fig. 2 in the main text). Values plotted for mother’s voice and female control voices are referenced to duration and energy-matched environmental sounds, e.g., [mother’s voice > environmental sounds]. The signal-level analysis was performed because stimulus-based differences in fMRI activity can result from a number of different factors. Significant differences were inherent to this ROI analysis, because they are based on results from the whole-brain GLM analysis (52); however, results provide important information regarding the magnitude and sign of fMRI activity. **P < 0.01; *P < 0.05.
Fig. S4.
Signal levels in frontoparietal regions in response to mother’s voice and female control voices. Regions were selected for signal-level analysis based on their identification in the [mother’s voice > female control voices] contrast (Fig. 2 in the main text). Values plotted for mother’s voice and female control voices are referenced to duration- and energy-matched environmental sounds, e.g., [mother’s voice > environmental sounds]. All ROIs are 5-mm spheres centered at the peak for these regions in the [mother’s voice > female control voices] contrast (Fig. 2 in the main text). **P < 0.01.
Fig. S2.
Signal levels in mesolimbic reward regions (Upper) and the amygdala and salience network (Lower) in response to mother’s voice and female control voices. Regions were selected for signal-level analysis based on their identification in the [mother’s voice > female control voices] contrast (Fig. 2 in the main text). Values plotted for mother’s voice and female control voices are referenced to duration- and energy-matched environmental sounds, e.g., [mother’s voice > environmental sounds]. NAc and amygdala ROIs are 2-mm spheres centered at the peak for these regions in the [mother’s voice > female control voices] contrast; all other ROIs in these bar graphs are 5-mm spheres centered at the peak for these regions in the [mother’s voice > female control voices] contrast (Fig. 2 in the main text). **P < 0.01; *P < 0.05.
Fig. S3.
Signal levels in default mode (Upper) and occipital (Lower) regions in response to mother’s voice and female control voices. Regions were selected for signal-level analysis based on their identification in the [mother’s voice > female control voices] contrast (Fig. 2 in the main text). Values plotted for mother’s voice and female control voices are referenced to duration- and energy-matched environmental sounds, e.g., [mother’s voice > environmental sounds]. All ROIs in these bar graphs are 5-mm spheres centered at the peak for these regions in the [mother’s voice > female control voices] contrast (Fig. 2 in the main text). **P < 0.01.
We explored sources of variance in participants’ voxelwise responses by performing whole-brain covariate analyses using social and language scores as covariates. Results from whole-brain analysis showed that standardized measures of social or language abilities did not show significant correlations with brain activity levels in reward, affective, or salience-processing regions.
Brain Responses to Female Control Voices Compared to Nonvocal Environmental Sounds.
We next examined whether the extensive brain activation in response to mother’s voice (Fig. 2) is specific to this stimulus or, alternatively, if a similar extent of activation is elicited by female control voices when compared to nonvocal environmental sounds. This particular comparison was used in a seminal study examining the cortical basis of vocal processing in adult listeners (11), and results from the current child sample are consistent with this previous work, showing strong activation in bilateral voice-selective STG and STS (Fig. S5) for this contrast. Moreover, female control voices elicit activity in bilateral amygdala and supramarginal gyri and in left-hemisphere medial HG (mHG). Importantly, this analysis comparing female control voices and environmental sounds failed to identify reward, salience, and face-processing regions or the IC. Together, these results not only demonstrate that responses to mother’s voice are highly distributed throughout a number of brain systems but also show that activity in many of these regions, encompassing reward, salience, and face-processing systems, is specific to mother’s voice.
Fig. S5.
Brain activity in response to female control voices compared to environmental sounds. Compared to environmental sounds, female control voices elicit greater activity throughout a wide extent of voice-selective STG and STS (Upper Center Left), bilateral amygdala (Upper Center Right) and supramarginal gyrus (Upper Right), and a small extent of left hemisphere mHG (Upper Left). In contrast to brain activity in response to mother’s voice, univariate results comparing female control voices to environmental sounds failed to identify brain regions in reward, salience, and face-processing regions or in the IC.
Analysis of Control Voices.
We next examined whether the presence of pleasant vocal features in the control voices could elicit increased activity in brain systems activated by mother’s voice (Fig. 2). This analysis was based on independent behavioral ratings of the vocal stimuli, which revealed that vocal pleasantness ratings were significantly greater for one of the female control voices compared to the other control voice (P < 0.001). Both whole-brain and region of interest (ROI) analyses showed no differences in brain response between the two control voices in auditory, voice-selective, face-processing, reward, salience, or default mode brain regions (see SI Methods, Control voice analysis). These results indicate that more intrinsically pleasant vocal characteristics alone are not sufficient to drive brain activity in the wide range of brain systems engaged by mother’s voice.
Functional Connectivity During Mother’s Voice Processing.
The brain regions identified by the voxelwise analysis of mother’s voice identified multiple functional systems encompassing primary auditory and voice-selective temporal cortex, cortical structures of the visual ventral stream, and heteromodal regions associated with affective and reward function and salience detection. A prominent hypothesis states that the STS is a key node of the speech perception network that connects low-level auditory regions with heteromodal regions important for reward and affective processing of these sounds (21). Therefore, our next analysis examined the functional connectivity of the STS, using the generalized psychophysiological interaction (gPPI) model, with the goal of identifying the brain network that shows greater connectivity during mother’s voice compared to female control voice perception.
Given the broad anterior–posterior expanse of STS/STG that showed greater activity for mother’s voice compared to female control voices (Fig. 2), we placed gPPI seeds bilaterally in posterior, mid, and anterior STG/STS (see Table S3 for seed coordinates). Surprisingly, group results did not reveal significant brain connectivity during mother’s voice perception between any of the STS/STG seeds and affective and reward processing regions or structures of the salience network and visual ventral stream.
Table S3.
Superior temporal cortex seed coordinates for gPPI analysis
| Brain region | Coordinates |
| Left-hemisphere pSTS | −63, −42, 9 |
| Left-hemisphere mSTS | −60, −16, −8 |
| Left-hemisphere aSTS | −54, −6, −12 |
| Right-hemisphere pSTS | 57, −31, 5 |
| Right-hemisphere mSTS | 56, −20, 4 |
| Right-hemisphere aSTS | 52, −2, −14 |
Individual Differences in Functional Connectivity During Mother’s Voice Processing.
We then investigated individual differences in children’s brain connectivity by performing a regression analysis between the strength of STS connectivity and social and language measures. Results from whole-brain regression analyses showed a striking relationship: Children’s social communication scores, assessed using the Social Responsiveness Scale (SRS-2) (22), covaried with the strength of functional connectivity among multiple STS gPPI seeds and the brain systems identified in the univariate analysis (Fig. 3). Specifically, standardized scores of social communication were correlated with the strength of brain connectivity for the [mother’s voice > female control voices] gPPI contrast between left-hemisphere anterior STS (aSTS) and left-hemisphere NAc of the mesolimbic reward pathway, right-hemisphere amygdala, hippocampus, and fusiform gyrus (FG), which overlapped with the FG2 subregion (19). Moreover, social communication scores were correlated with the strength of brain connectivity between right-hemisphere posterior STS (pSTS) and OFC of the reward system and the AI and dACC of the salience network (Fig. 4). Scatterplots show that both brain connectivity and social communication abilities vary across a range of values and that greater social function, reflected by lower social communication scores, is associated with greater brain connectivity between the STS and these reward, affective, salience, and face-processing regions. In contrast, language abilities, assessed using the Core Language Score from Clinical Evaluation of Language Fundamentals, 4th edition (CELF-4) (23), correlated only with connectivity between left-hemisphere medial STS (mSTS) and right-hemisphere HG and inferior frontal gyrus (Fig. S6).
Fig. 3.
Connectivity of left-hemisphere voice-selective cortex and social communication abilities. The whole-brain connectivity map shows that children’s social communication scores covaried with the strength of functional coupling between the left-hemisphere aSTS (Top) and left-hemisphere NAc (Center Left), right-hemisphere amygdala (Center Right), right-hemisphere hippocampus (Bottom Left), and FG, which overlapped with the FG2 subregion (Bottom Right). Scatterplots show the distributions and covariation of aSTS connectivity strength in response to mother’s voice and standardized scores of social communication abilities. Greater social communication abilities, reflected by smaller social communication scores, are associated with greater brain connectivity between the STS and these brain regions. a.u., arbitrary units.
Fig. 4.
Connectivity of right-hemisphere voice-selective cortex and social communication abilities. The whole-brain connectivity map shows that children’s social communication scores covaried with the strength of functional coupling between the right-hemisphere pSTS (Upper Left) and OFC of the reward pathway (Upper Right) and between the AI and dACC of the salience network (Lower). Scatterplots show the distributions and covariation of STS connectivity strength in response to mother’s voice and standardized scores of social function. Greater social communication abilities, reflected by smaller social communication scores, are associated with greater brain connectivity between the STS and these brain regions.
Fig. S6.
Connectivity of left-hemisphere voice-selective cortex and language abilities. The whole-brain connectivity map shows that children’s CELF core language covaried with the strength of functional coupling between the left-hemisphere mSTS (Upper) and right-hemisphere HG (Lower Left) and BA 47 of inferior frontal gyrus (IFG) (Lower Right). Scatterplots show the distributions and covariation of mSTS connectivity strength in response to mother’s voice and standardized scores of language abilities in these children. Greater language scores are associated with greater brain connectivity between the STS and these brain regions.
To examine the robustness and reliability of these particular brain connections for predicting social communication scores, we performed a support vector regression (SVR) analysis (24–26). Results showed that the strength of each of these brain connections was a reliable predictor of social communication function (left aSTS gPPI seed to left NAc: r = 0.62, P < 0.001; to right amygdala: r = 0.49, P = 0.004; to right hippocampus: r = 0.59, P < 0.001; to right fusiform: r = 0.54, P = 0.002; right pSTS gPPI seed to right OFC: r = 0.58, P < 0.001; to right AI: r = 0.66, P < 0.001; to right dACC: r = 0.66, P < 0.001).
SI Methods
Participants.
The Stanford University Institutional Review Board approved the study protocol. Parental consent and children's assent were obtained for all evaluation procedures, and children were paid for their participation in the study.
A total of 32 children were recruited from around the San Francisco Bay Area for this study. Six participants were excluded because of excessive movement, one was excluded because of infrequent contact with their biological mother, who also was unavailable for a vocal recording, and another participant was excluded because of scores in the “severe” range on standardized measures of social function. Parent reports from the final sample of 24 participants showed that these children were raised in families with a wide of range of socioeconomic backgrounds, with 25% of participants coming from households earning ≤$100K/y. Socioeconomic status was not correlated with children’s social communication skills as assessed using the SRS-2 (P > 0.50), the key behavioral measure described in the analysis. All children were required to have a full-scale IQ >80, as measured by the WASI (41). All children were right-handed and had no history of: neurological, psychiatric, or learning disorders; no personal or family (first degree) history of developmental cognitive disorders or heritable neuropsychiatric disorders; no evidence of significant difficulty during pregnancy, labor, delivery, or the immediate neonatal period; and no abnormal developmental milestones as determined by neurologic history and examination. Participants were the biological offspring of the mothers whose voices were used in this study (i.e., none of our participants were adopted, and therefore none of the mothers’ voices were from an adoptive mother), and all participants were raised in homes that included their mother. Participants’ neuropsychological and language characteristics are provided in Tables S1 and S2, respectively.
Data Acquisition Parameters.
All fMRI data were acquired in a single session at the Richard M. Lucas Center for Imaging at Stanford University. Functional images were acquired on a 3-T Signa scanner (General Electric) using a custom-built head coil. Participants were instructed to stay as still as possible during scanning, and head movement was minimized further by placing memory-foam pillows around the participant’s head. A total of 29 axial slices (4.0-mm thickness, 0.5-mm skip) parallel to the anterior/posterior commissure line and covering the whole brain were imaged by using a T2*-weighted gradient-echo spiral in-out pulse sequence (43) with the following parameters: repetition time (TR) = 3,576 ms; echo time = 30 ms; flip angle = 80°; one interleaf. The 3,576-ms TR is the sum of (i) the stimulus duration of 956 ms; (ii) a 300-ms silent interval buffering the beginning and end of each stimulus presentation (600 ms total of silent buffers) to avoid backward and forward masking effects; (iii) the 2,000-ms volume acquisition time; and (iv) an additional 20-ms silent interval which helped the stimulus computer maintain precise and accurate timing during stimulus presentation. The field of view was 20 cm, and the matrix size was 64 × 64, providing an in-plane spatial resolution of 3.125 mm. Reduction of blurring and reduction of signal loss arising from field inhomogeneities was accomplished by the use of an automated high-order shimming method before data acquisition.
fMRI Task.
Auditory stimuli were presented in 10 separate runs, each lasting 4 min. One run consisted of 56 trials of mother’s voice, female control voices, environmental sounds, and catch trials, which were pseudorandomly ordered within each run. Stimulus presentation order was the same for each subject. Each stimulus lasted 956 ms. Before each run, child participants were instructed to play the “kitty cat game” during the fMRI scan. While lying down in the scanner, children were first shown a brief video of a cat and were told that the goal of the cat game was to listen to a variety of sounds, including “voices that may be familiar,” and to push a button on a button box only when they heard kitty cat meows (catch trials). The function of the catch trials was to keep the children alert and engaged during stimulus presentation. During each run, four or five exemplars of each stimulus type (i.e., nonsense words produced by that child's mother, female control voices, and environmental sounds) and three catch trials were presented. At the end of each run, the children were shown another engaging video of a cat. Across the 10 runs, a total of 48 exemplars of each stimulus condition were presented to each subject. Speech stimuli were presented to participants in the scanner using E-Prime v1.0 (Psychological Software Tools, 2002). Participants wore custom-built headphones designed to reduce the background scanner noise to ∼70 adjusted dB (dBA) (44, 45). Headphone sound levels were calibrated before each data-collection session, and all stimuli were presented at a sound level of 75 dBA. Participants were scanned using an event-related design. Auditory stimuli were presented during silent intervals between volume acquisitions to eliminate the effects of scanner noise on auditory discrimination. One stimulus was presented every 3,576 ms, and the duration of the silent period was not jittered. The total silent period between stimulus presentations was 2,620 ms and consisted of a 300-ms silent period, 2,000 ms for a volume acquisition, another 300 ms of silence, and a 20-ms silent interval that helped the stimulus computer maintain precise and accurate timing during stimulus presentation.
fMRI Preprocessing.
fMRI data collected in each of the 10 functional runs were subject to the following preprocessing procedures. The first five volumes were not analyzed to allow for signal equilibration. A linear shim correction was applied separately for each slice during reconstruction by using a magnetic field map acquired automatically by the pulse sequence at the beginning of the scan. Functional images were first realigned to their first volume using SPM8 analysis software (www.fil.ion.ucl.ac.uk/spm) and then corrected for deviant volumes resulting from spikes in movement. Translational movement in millimeters (x, y, z) was calculated based on the SPM8 parameters from the realignment procedure for each subject. We used a despiking procedure (46) similar to those implemented in the Analysis of Functional NeuroImages (AFNI) toolkit maintained by the National Institute of Mental Health (Bethesda, MD) (47). Volumes with movement exceeding 0.5 voxels (1.562 mm) or spikes in global signal exceeding 5% were interpolated using adjacent scans. The majority of volumes repaired occurred in isolation. After the interpolation procedure, images were further corrected for slice-timing errors. To normalize the functional images to standard Montreal Neurological Institute (MNI) space, each individual's functional images were first coregistered to their structural T1 images, and their T1 images were then transformed to MNI space. The transformation parameters were subsequently applied to the functional images, which were then resampled to 2-mm isotropic voxels. Finally, functional images were smoothed with a 6-mm full-width half-maximum Gaussian kernel to decrease spatial noise prior to statistical analysis.
Movement Criteria for Inclusion in fMRI Analysis.
For inclusion in the fMRI analysis, we required that each functional run have a maximum scan-to-scan movement of <6 mm and that no more than 15% of volumes were corrected in the despiking procedure. Moreover, we required that all individual subject data included in the analysis consist of at least seven functional runs that met our criteria for scan-to-scan movement and percentage of volumes corrected; subjects who had fewer than seven functional runs that met our movement criteria were not included in the data analysis. All 24 participants included in the analysis had at least seven functional runs that met our movement criteria. Fifteen of the participants had 10 runs of data that met these movement criteria; two subjects had nine runs of data that met movement criteria; five subjects had eight runs of data; and two subjects had seven runs that met criteria.
Voxelwise Analysis of fMRI Activation.
The goal of the voxelwise analysis of fMRI activation was to identify brain regions that showed differential activity levels in response to mother’s voice, female control voices, and environmental sounds. Brain activation related to each speech task condition was first modeled at the individual subject level using boxcar functions with a canonical hemodynamic response function and a temporal derivative to account for voxelwise latency differences in hemodynamic response. Environmental sounds were not modeled to avoid collinearity, and this stimulus served as the baseline condition. Low-frequency drifts at each voxel were removed using a high-pass filter (0.5 cycles/min), and serial correlations were accounted for by modeling the fMRI time series as a first-degree autoregressive process (48). Voxelwise t statistics maps for each condition were generated for each participant using the general linear model (GLM) along with the respective contrast images. Group-level activation was determined using individual-subject contrast images and second-level one-sampled t tests. The main contrasts of interest were [mother’s voice vs. female control voices], [female control voices vs. mother’s voice], [female control voices vs. environmental sounds], [female control voice 1 vs. female control voice 2], and [female control voice 2 vs. female control voice 1]. Significant clusters of activation were determined using a voxelwise statistical height threshold of P < 0.01, with familywise error corrections for multiple spatial comparisons (P < 0.01; 128 voxels) determined using Monte Carlo simulations (49, 50) using a custom Matlab script. To examine GLM results in the IC, a small subcortical brain structure, we used a small-volume correction at P < 0.01. To define specific cortical regions, we used the Harvard–Oxford probabilistic structural atlas (51) with a probability threshold of 25%.
Signal-Level Analysis.
Group mean activation differences for key brain regions identified in the whole-brain univariate analysis were calculated to examine the basis for the [mother’s voice > female control voices] group differences (Fig. 2). This analysis was performed because stimulus differences can result from a number of different factors. For example, both mother’s voice and female control voices could elicit reduced activity relative to baseline, and significant stimulus differences could be driven by greater negative activation in response to female control voices. Significant stimulus differences were inherent to this ROI analysis, because they are based on results from the whole-brain GLM analysis (52); however, the results provide important information regarding the magnitude and sign of results in response to both stimulus conditions. The baseline for this analysis was calculated as the brain response to environmental sounds. A number of ROIs were constructed using coordinates reported in previous studies: IC ROIs were 5-mm spheres centered at ±6, −33, −11 (53, 54); primary auditory cortical ROIs (Te1.0, Te1.1, and Te1.2) were identified a priori from cytoarchitectonic maps (55), and bilateral pSTS coordinates were identified from previous investigations of voice-selective cortex (11, 37). All other ROI coordinates used in the signal-level analysis were based on peaks identified in the [mother’s voice > female control voices] group map. This analysis included 13 ROIs in bilateral STC, nine ROIs in bilateral frontal cortex, seven ROIs in bilateral parietal cortex, three ROIs in bilateral occipital cortex, one ROI in the anterior cingulate, and five subcortical structures. Cortical ROIs were defined as 5-mm spheres, and subcortical ROIs were 2-mm spheres, centered at the peaks in the [mother’s voice > female control voices] group map. Signal level was calculated by extracting the β-value from individual subjects’ contrast maps for the [mother’s voice > environmental sounds] and [female control voices > environmental sounds] comparisons. The mean β-value within each ROI was computed for both contrasts in all subjects. The group mean β and its SE for each ROI are plotted in Figs. S1–S4.
Effective Connectivity Analysis.
Effective connectivity analysis was performed using gPPI (42), a method more sensitive than PPI to context-dependent differences in connectivity. At the individual subject level, the time series from the seed region is first deconvolved to uncover neuronal activity and then multiplied with the task design waveforms to form an interaction term. This interaction term is then convolved with the hemodynamic response function (HRF) to form the gPPI regressor, and the resulting time series is regressed against all other voxels in the brain. The goal of this analysis was to examine connectivity patterns of the voice-selective network, with a focus on voice-selective temporal cortex, which is hypothesized to be a hub of the network linking auditory regions with heteromodal regions important for reward and affective processing of these sounds (1, 21). Therefore, we constructed four STS/STG ROIs that were identified from the univariate analysis [mother’s voice > female control voices] group t map (Fig. 2; mSTS and aSTS/STG) and two ROIs that were identified from previous investigations of voice-selective cortex (pSTS) (11, 37). For the STS regions identified from the univariate analysis, we identified peaks in this t map for mSTS and aSTS/STG regions bilaterally and constructed nonoverlapping 5-mm spherical ROIs centered at these peaks. These six ROIs then were used as seeds in six separate whole-brain gPPI models. Significant clusters of activation were determined using a voxelwise statistical height threshold of P < 0.01, with familywise error corrections for multiple spatial comparisons (P < 0.01; 128 voxels) determined using Monte Carlo simulations (49, 50).
Brain-Behavior Analysis.
Regression analysis was used to examine the relationship between brain signatures of mother’s voice perception and social and language skills. Social function was assessed using the Social Communication subscale of the SRS-2 (22). For our measure of language function, we used the CELF-4 (23), a standard instrument for measuring language function in neurotypical children. Regression analyses were conducted using the Core Language score of the CELF, a measure of general language ability. Brain-behavior relationships were examined using analysis of both activation levels and effective connectivity. We first performed a voxelwise regression analysis in which the relation between fMRI activity and social and language measures was examined using images contrasting mother’s voice vs. female control voices. We then performed a voxelwise regression analysis between STC connectivity and standardized social and language measures using gPPI images generated for each participant by contrasting responses to mother's voice vs. responses to female control voices. Significant clusters were determined using a voxelwise statistical height threshold of P < 0.01, with familywise error corrections for multiple spatial comparisons (P < 0.01; 128 voxels) determined using Monte Carlo simulations (49, 50).
Functional Brain Connectivity and Prediction of Social Function.
To examine the robustness and reliability of brain connectivity between STS and reward, affective, salience detection, and face-processing brain regions for predicting social communication scores, we performed a confirmatory cross-validation (CV) analysis that employs a machine-learning approach with balanced fourfold CV combined with linear regression (25). In this analysis, we extracted individual subject connectivity beta values, taken from the [mother’s voice > female control voices] gPPI contrast, in left-hemisphere NAc, right-hemisphere amygdala, fusiform cortex, and hippocampus (i.e., the left-hemisphere aSTS gPPI seed) and in right-hemisphere OFC, AI, and dACC (the right-hemisphere pSTS gPPI seed). Mean gPPI beta values for each brain connection (e.g., left-hemisphere aSTS seed to left-hemisphere NAc) were separately entered as the independent variable in a linear regression analysis with SRS-2 social communication standard scores as the dependent variable. First, r(predicted, observed), a measure of how well the independent variable predicts the dependent variable, was estimated using a balanced fourfold CV procedure. Data were divided into four folds so that the distributions of dependent and independent variables were balanced across folds. Data were randomly assigned to four folds, and the independent and dependent variables were tested in one-way ANOVAs, repeating as necessary until both ANOVAs were insignificant to guarantee balance across the folds. A linear regression model was built using three folds, leaving out the fourth, and this model was used to predict the data in the omitted fold. This procedure was repeated four times to compute a final r(predicted, observed) representing the correlation between the data predicted by the regression model and the observed data. Finally, the statistical significance of the model was assessed using a nonparametric testing approach. The empirical null distribution of r(predicted, observed) was estimated by generating 1,000 surrogate datasets under the null hypothesis that there was no association between social communication subscores and brain connectivity.
Stimulus Design Considerations.
Previous studies investigating the perception (2, 5) and neural bases (8, 9) of mother’s voice processing have used a design in which one mother’s voice serves as a control voice for another participant. However, for a number of reasons, in this study we used a design in which all participants heard the same two control voices. First, we wanted to be able to perform analyses comparing brain responses between the two control voices (see Results, Analysis of Control Voices in the main text), which would not have been possible had the participants heard different control voices. There also was an important practical limitation with using mother’s voices as control voices for other children: Although we make every effort to recruit children from a variety of communities in the San Francisco Bay Area, some level of recruitment occurs through contact with specific schools, and in other instances our participants refer their friends to our laboratory for inclusion in our studies. In these cases, it is a reasonable possibility that our participants may know other mothers involved in the study and therefore may be familiar with these mothers’ voices; that familiarity would limit the control we were seeking in our control voices. Importantly, the Health Insurance Portability and Accountability Act (HIPAA) guidelines are explicit that participant information is confidential, and therefore there would be no way to probe whether a child knew any of the other families involved in the study. Given these analytic and practical considerations, we concluded that it would be best to use the same two control voices, which we knew were unfamiliar to the participants, for all participants’ data collection.
Stimulus Recording.
Recordings of each mother were made individually while her child was undergoing neuropsychological testing. Mother’s voice stimuli and control voices were recorded in a quiet conference room using a Shure PG27-USB condenser microphone connected to a MacBook Air laptop computer. The audio signal was digitized at a sampling rate of 44.1 kHz and was A/D converted with 16-bit resolution. Mothers were positioned in the conference room to avoid early sound wave reflections from contaminating the recordings. To provide a natural speech context for the recording of each nonsense word, mothers were instructed to repeat three sentences, each of which contained one of the nonsense words, during the recording. The first word of each of these sentences was their child’s name, which was followed by the words “that is a,” followed by one of the three nonsense words. A hypothetical example of a sentence spoken by a mother for the recording was “Johnny, that is a keebudishawlt.” Before beginning the recording, mothers were instructed on how to produce these nonsense words by repeating them to the experimenter until the mothers had reached proficiency. Importantly, mothers were instructed to say these sentences using the tone of voice they would use when speaking with their child during an engaging and enjoyable shared learning experience (e.g., if their child asked them to identify an item at a museum). The vocal recording session resulted in digitized recordings of the mothers repeating each of the three sentences ∼30 times to ensure multiple high-quality samples of each nonsense word for each mother. A second class of stimuli included in the study was nonspeech environmental sounds. These sounds, which included brief recordings of laundry machines, dishwashers, and other household sounds, were taken from a professional sound effects library.
Stimulus Postprocessing.
The goal of stimulus postprocessing was to isolate the three nonsense words from the sentences that each mother spoke during the recording session and to normalize them for duration and rms amplitude for inclusion in the fMRI stimulus presentation protocol, the pleasantness and excitement ratings experiment, and the mother’s voice identification task. First, a digital sound editor (Audacity: https://sourceforge.net/projects/audacity/) was used to isolate each utterance of the three nonsense words from the sentences spoken by each mother. The three best versions of each nonsense word were selected based on the audio and vocal quality of the utterances (i.e., eliminating versions that were mispronounced, included vocal creak, or were otherwise not ideal exemplars of the nonsense words). These nine nonsense words then were normalized to 956 ms in duration, the mean duration of the nonsense words produced by the female control voices, using Praat software similar to previous studies (56). A 10-ms fade (ramp and damp) was performed on each stimulus to prevent click-like sounds at the beginning and end of the stimulus, and then stimuli were equated for rms amplitude. These final stimuli were evaluated for audibility and clarity to ensure that postprocessing manipulations had not introduced any artifacts into the samples. The same process was performed on the control voices and environmental sounds to ensure that all stimuli presented in the fMRI experiment were the same duration and rms amplitude.
Pleasantness and Excitement Ratings for Vocal Stimuli.
To examine the relative pleasantness and engagingness of the vocal stimuli used in the fMRI experiment, we performed two behavioral experiments in an independent cohort of 27 children (mean age ± SD: 11.1 ± 1.2 y; sex: 10 female, 17 male). Participants were seated in a quiet room in front of a laptop computer, and headphones were placed over their ears. In one experiment, participants were presented with trials of either a mother’s voice sample or a control voice sample of the nonsense word “teebudishawlt.” After each stimulus presentation, the participant rated the vocal sample for pleasantness on a four-point scale as “very unpleasant,” “unpleasant,” “pleasant,” or “very pleasant.” In the second experiment the same procedures were used, but participants rated each vocal sample on a four-point scale for engagingness. Because there was concern that 8- to 10-y-old children might not understand the meaning of the word “engaging” in the context of this experiment, consistent with a previous study (57), we used the following four-point scale for this experiment: “totally boring,” “a little boring,” “a little exciting,” or “totally exciting.” Each vocal stimulus (i.e., 24 mother’s voice samples plus the two control voice samples) was presented once to each child in both the pleasantness and engagingness ratings tasks. The order of stimulus presentation was randomized for each participant and experiment; half of the participants performed the pleasantness ratings task first, and the other half of the participants performed the engagingness ratings task first. The vocal samples used in these behavioral experiments are the same as those used in the fMRI experiment. To examine statistical differences between ratings for mother’s voice and female control voices, we performed independent samples t tests comparing the mean ratings for mother’s voice samples and participant ratings for both control voices.
Postscan Speaker Identity Recognition Task.
All participants who participated in the fMRI experiment completed an auditory behavioral test following the fMRI scan. The goal of the Speaker Identity Recognition Task was to determine if the participants could reliably discriminate their mother’s voice from female control voices. Participants were seated in a quiet room in front of a laptop computer, and headphones were placed over their ears. In each trial, participants were presented with a recording of a multisyllabic nonsense word spoken by either the participant’s mother or by a control mother, and the task was to indicate whether or not the participant’s mother spoke the word. The multisyllabic nonsense words used in this behavioral task were the same samples used in the fMRI task. Each participant was presented with 54 randomly ordered nonsense words: 18 produced by the participant's mother and the remaining 36 produced by female control voices.
Discussion
Mother’s voice is a foundational stimulus and is one of the most salient vocal sources in a child’s life. Here we have identified the brain structures and network that are sensitive to brief (<1 s) samples of pseudoword speech sounds produced by each child’s mother compared to female control voices. We observed distinct representations of mother’s voice in a wide range of brain structures, encompassing not only auditory and voice-selective structures in the temporal cortex but also structures of the reward circuit including the NAc, OFC, and vmPFC, structures implicated in affective processes, including the amygdala, and regions associated with visual face processing, including fusiform cortex. Importantly, connectivity analyses revealed that coordinated neural activity between voice-selective regions and structures serving reward, affective, face processing, salience detection, and mnemonic functions predicts social communication abilities. Our results suggest that hearing mother’s voice, a critical source of emotional comfort and social learning in a child’s life, is represented in a wide range of brain systems that encompass auditory, speech, reward, and affective processing and that children’s social abilities are tightly linked to the function of this network. Surprisingly, brain signatures of mother’s voice can be detected even ∼10 y into childhood and provide a neural fingerprint of children’s social communication abilities.
A major finding here is the breadth of brain systems that are preferentially activated by brief samples of mother’s voice, a result that demonstrates the highly distributed nature of neural representations for this highly salient sound source. Importantly, these brain systems are thought to support discrete aspects of stimulus processing. The superior temporal cortex (STC) contains both primary auditory cortex, which is selective for processing rudimentary sound features (27), and STS regions known to be selective for human vocal sounds (11), and our results show strong effects for mother’s voice throughout these cortical areas. Why might auditory sensory and voice-selective cortex show enhanced responses for mother’s voice? Sensory representations are sharpened and strengthened for behaviorally salient stimuli (28, 29), ostensibly to facilitate their rapid identification, and it is plausible that the behavioral importance of mother contributes to the strengthening of sensory representation for her voice in auditory regions in her child’s brain. A potential mechanism for the enhancement of auditory cortical responses is the coincident activity of auditory and reward circuitry: Previous work has shown that stimulation of dopaminergic neurons in reward circuitry during auditory stimulus presentation selectively enhances auditory cortical representations for the presented sounds (30). We hypothesize that the identification of mother’s voice as a rewarding stimulus drives synchronous activity in auditory and reward circuitry and facilitates the strengthening of mother’s voice representations throughout auditory cortex.
Our results also show, for the first time to our knowledge, that mother’s voice drives neural activity in a number of key nodes of the reward circuit (12), including the NAc, OFC, and vmPFC. Activity in this circuit reflects both the anticipation and the experience of preferred stimuli, including music (31, 32), whose rewarding nature has received considerable attention (33). Vocal sounds, on the other hand, are not typically considered a “rewarding” category of sounds, possibly because of their ubiquity in everyday life, and structures of the reward circuit are not considered part of the canonical speech-perception network (27). During development, however, mother’s voice is thought to constitute a rewarding stimulus to young children (34), and this initial attraction to the sounds of speech is thought to guide early language acquisition (35, 36). Our findings suggest that the rewarding nature of mother’s voice can be detected even in late childhood and demonstrate that brief samples of salient speech stimuli have preferred access to the distributed reward circuit. More generally, we propose that the reward circuit plays an active role in multiple aspects of speech perception, including identifying preferred speech sources and positively valenced emotional cues provided by personally relevant voices.
Our findings further identify a strong link between children’s social communication abilities—their ability to interact and relate with others—and speech-based brain connectivity. Specifically, our results show that functional connectivity between voice-selective STS and the NAc of the reward circuit, the amygdala, salience network, FG, and hippocampus, a key structure for memory function, predicts social communication abilities. This result is consistent with previous findings that intrinsic connectivity between voice-selective STS and reward structures and the amygdala predicts social communication abilities in children with autism spectrum disorders (37). Results from the current study advance our understanding of individual differences in neurotypical children through the use of distinct and biologically salient speech stimuli. Surprisingly, despite prominent individual differences related to social communication, voice-selective STS did not show significantly greater connectivity for mother’s voice than for female control voices at the group-averaged level. These results suggest that tightly coordinated neural activity between voice-selective STS and brain regions serving reward and affective processes is specific to children with greater social communication abilities.
An important question is whether brain responses to mother’s voice simply reflect the intrinsic pleasantness of this vocal source compared to control voices. We addressed this question using several additional analyses. First, we behaviorally characterized all vocal stimuli and found that female control voice samples were rated equally as pleasant as the mother’s voice samples. Second, we found that, despite female control samples being equally pleasant, mother’s voice elicited greater activity and connectivity compared to control voices in auditory, voice-selective, face-processing, reward, salience, and default mode network regions; in contrast, no brain areas showed greater engagement to control voices compared to mother’s voice. Third, analysis of the two control voices, which had shown significantly different pleasantness ratings, revealed comparable brain responses across these key brain systems. Together, these results indicate that vocal pleasantness is not sufficient to drive brain activity in the wide range of brain systems engaged by mother’s voice.
Another question is whether brain responses to mother’s voice simply reflect a familiarity response to a recognizable vocal source (38, 39). A number of distinguishing features of the current results suggest that mother’s voice elicits a more specialized form of response than the response identified in these previous findings. For example, familiarity effects in previous studies have failed to identify primary auditory cortex, structures of the reward network, including the NAc, OFC, and vmPFC, or key nodes of the salience network, including the dACC and AI (20). Moreover, if familiarity were the only variable driving responses to mother’s voice, one would not expect to see a strong relation between children’s social skills and brain connectivity during mother’s voice processing. Based on these findings, we hypothesize that brain responses to mother’s voice reflect specialized representations of a salient source for social learning in a child’s life.
In conclusion, we have identified key functional systems and circuits underlying the perception of a foundational sound source for social communication in a child: mother’s voice. Critically, the degree of engagement of these functional systems represents a biological signature of individual differences in social communication abilities. Our findings provide a novel neurobiological template for the investigation of normal social development as well as clinical disorders such as autism (37), in which perception of biologically salient voices may be impaired (40).
Methods
Participants.
The Stanford University Institutional Review Board approved the study protocol. Parental consent and children's assent were obtained for all evaluation procedures, and children were paid for their participation in the study. All children were required to have a full-scale intelligence quotient (IQ) >80, as measured by the Wechsler Abbreviated Scale of Intelligence (WASI) (41). Participants were the biological offspring of the mothers whose voices were used in this study (i.e., none of our participants were adopted, and therefore none of the mothers’ voices were from an adoptive mother), and all participants were raised in homes that included their mothers. Participants’ neuropsychological and language characteristics are provided in Tables S1 and S2, respectively. Details are provided in SI Methods.
Stimuli.
Stimuli consisted of the three nonsense words, “teebudishawlt,” “keebudishawlt,” and “peebudishawlt,” produced by the participant’s mother and by two female control voices produced by women who are also mothers (Fig. 1; see Audio Files S1–S6 for audio examples). A second class of stimuli included in the study was nonspeech environmental sounds. Details are provided in SI Methods.
Data Acquisition Parameters.
All fMRI data were acquired in a single session at the Richard M. Lucas Center for Imaging at Stanford University. Functional images were acquired on a 3-T Signa scanner (General Electric) using a custom-built head coil. Details are provided in SI Methods.
fMRI Task.
Auditory stimuli were presented in 10 separate runs, each lasting 4 min. The order of stimulus presentation was the same for each subject. Details are provided in SI Methods.
fMRI Preprocessing.
Details of fMRI preprocessing are provided in SI Methods.
Voxelwise Analysis of fMRI Activation.
The goal of the voxelwise analysis of fMRI activation was to identify brain regions that showed differential activity levels in response to mother’s voice, female control voices, and environmental sounds. Details are provided in SI Methods.
Effective Connectivity Analysis.
Effective connectivity analysis was performed using gPPI (42), a method more sensitive than psychophysiological interaction (PPI) to context-dependent differences in connectivity. Details are provided in SI Methods.
Brain-Behavior Analysis.
Regression analysis was used to examine the relationship between brain signatures of mother’s voice perception and social and language skills. Social function was assessed using the Social Communication subscale of the SRS-2 (22). For our measure of language function, we used the CELF-4 (23), a standard instrument for measuring language function in neurotypical children. Regression analyses were conducted using the Core Language Score of the CELF, a measure of general language ability. Brain-behavior relationships were examined using analysis of both activation levels and effective connectivity. Details are provided in SI Methods.
Functional Brain Connectivity and Prediction of Social Function.
To examine the robustness and reliability of brain connectivity between STS and reward, affective, salience detection, and face-processing brain regions for predicting social communication scores, we performed a confirmatory cross-validation (CV) analysis that employs a machine-learning approach with balanced fourfold CV combined with linear regression (25). Details are provided in SI Methods.
Please see SI Methods for (i) Movement Criteria for Inclusion in fMRI Analysis, (ii) Signal-Level Analysis, (iii) Stimulus Design Considerations, (iv) Stimulus Recording, (v) Stimulus Postprocessing, (vi) Pleasantness and Excitement Ratings for Vocal Stimuli, and (vii) Postscan Speaker Identity Recognition Task, and SI Results for (i) fMRI Sex Difference Analysis and (ii) Control Voice Analysis.
SI Results
fMRI Sex Difference Analysis.
We examined whether male and female participants showed different univariate activity for the [mother’s voice > female control voices] contrast. We performed an ROI analysis within all 38 brain regions identified for the [mother’s voice > female control voices] contrast in the group results (Fig. 2). This analysis included 13 ROIs in bilateral STC, nine ROIs in bilateral frontal cortex, seven ROIs in bilateral parietal cortex, three ROIs in bilateral occipital cortex, one ROI in the anterior cingulate, and five subcortical structures. Signal level was calculated by extracting the β-value from individual subjects’ contrast maps for the [mother’s voice > female control voice] comparison. The mean β-value within each ROI was computed for this contrast in all subjects. Because of the small number of participants in each group (7 females; 17 males), we performed Wilcoxon rank sum tests on mean β-values within each ROI between female and male groups. Results from this analysis showed that none of the 38 regions identified in the combined group analysis showed a significant sex difference (P < 0.01) for the [mother’s voice > female control voices] contrast.
Control Voice Analysis.
Behavioral ratings acquired from an independent cohort of children (Pleasantness and Excitement Ratings for Vocal Stimuli, above) showed that one of the female control voices had significantly greater pleasantness ratings compared to the other female control voice (P < 0.001). The mean (± SD) rating for control voice 1 was 3.29 (± 0.78) and for control voice 2 was 2.26 (± 0.90). We therefore performed whole-brain and ROI analyses to examine whether more pleasant vocal features alone were sufficient to elicit increased activity in the brain systems highlighted in the [mother’s voice > female control voices] contrast (Fig. 2). A whole-brain analysis was performed on the [female control 1 > female control 2] contrast and the [female control 2 > female control 1] contrast, and results were thresholded at P < 0.01 and 128 voxels. An ROI analysis also was performed for these same two contrasts using the same 38 ROIs identified for the signal-level analysis. Signal level was calculated by extracting the β-value from individual subject’s contrast maps for the [female control 1 > female control 2] and [female control 2 > female control 1] comparisons. The mean β-value within each ROI was computed for this contrast in all subjects, and one-sample t tests were performed on mean β-values within each ROI for both contrasts.
Supplementary Material
Acknowledgments
We thank all the children and their parents who participated in our study, E. Adair for assistance with data collection, the staff at the Lucas Center for Imaging for assistance with data collection, and H. Abrams and C. Anderson for help with stimulus production. This work was supported by NIH Grants K01 MH102428 (to D.A.A), K25 HD074652 (to S.R.), and DC011095 and MH084164 (to V.M.) and by the Singer Foundation and the Simons Foundation (V.M.).
Footnotes
The authors declare no conflict of interest.
This article is a PNAS Direct Submission.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1602948113/-/DCSupplemental.
References
- 1.Belin P, Fecteau S, Bédard C. Thinking the voice: Neural correlates of voice perception. Trends Cogn Sci. 2004;8(3):129–135. doi: 10.1016/j.tics.2004.01.008. [DOI] [PubMed] [Google Scholar]
- 2.DeCasper AJ, Fifer WP. Of human bonding: Newborns prefer their mothers’ voices. Science. 1980;208(4448):1174–1176. doi: 10.1126/science.7375928. [DOI] [PubMed] [Google Scholar]
- 3.Kisilevsky BS, Hains SM. Onset and maturation of fetal heart rate response to the mother’s voice over late gestation. Dev Sci. 2011;14(2):214–223. doi: 10.1111/j.1467-7687.2010.00970.x. [DOI] [PubMed] [Google Scholar]
- 4.Seltzer LJ, Prososki AR, Ziegler TE, Pollak SD. Instant messages vs. speech: Hormones and why we still need to hear each other. Evol Hum Behav. 2012;33(1):42–45. doi: 10.1016/j.evolhumbehav.2011.05.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Adams RE, Passman RH. Effects of visual and auditory aspects of mothers and strangers on the play and exploration of children. Dev Psychol. 1979;15(3):269–274. [Google Scholar]
- 6.Mumme DL, Fernald A, Herrera C. Infants’ responses to facial and vocal emotional signals in a social referencing paradigm. Child Dev. 1996;67(6):3219–3237. [PubMed] [Google Scholar]
- 7.Liu HM, Kuhl PK, Tsao FM. An association between mothers’ speech clarity and infants’ speech discrimination skills. Dev Sci. 2003;6(3):F1–F10. [Google Scholar]
- 8.Imafuku M, Hakuno Y, Uchida-Ota M, Yamamoto J, Minagawa Y. “Mom called me!” Behavioral and prefrontal responses of infants to self-names spoken by their mothers. Neuroimage. 2014;103:476–484. doi: 10.1016/j.neuroimage.2014.08.034. [DOI] [PubMed] [Google Scholar]
- 9.Purhonen M, Kilpeläinen-Lees R, Valkonen-Korhonen M, Karhu J, Lehtonen J. Cerebral processing of mother’s voice compared to unfamiliar voice in 4-month-old infants. Int J Psychophysiol. 2004;52(3):257–266. doi: 10.1016/j.ijpsycho.2003.11.003. [DOI] [PubMed] [Google Scholar]
- 10.Binder JR, Desai RH, Graves WW, Conant LL. Where is the semantic system? A critical review and meta-analysis of 120 functional neuroimaging studies. Cereb Cortex. 2009;19(12):2767–2796. doi: 10.1093/cercor/bhp055. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Belin P, Zatorre RJ, Lafaille P, Ahad P, Pike B. Voice-selective areas in human auditory cortex. Nature. 2000;403(6767):309–312. doi: 10.1038/35002078. [DOI] [PubMed] [Google Scholar]
- 12.Haber SN, Knutson B. The reward circuit: Linking primate anatomy and human imaging. Neuropsychopharmacology. 2010;35(1):4–26. doi: 10.1038/npp.2009.129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Adolphs R, Tranel D, Damasio AR. The human amygdala in social judgment. Nature. 1998;393(6684):470–474. doi: 10.1038/30982. [DOI] [PubMed] [Google Scholar]
- 14.Constantino JN, Todd RD. Autistic traits in the general population: A twin study. Arch Gen Psychiatry. 2003;60(5):524–530. doi: 10.1001/archpsyc.60.5.524. [DOI] [PubMed] [Google Scholar]
- 15.Bricker PD, Pruzansky S. Speaker recognition. In: Lass NJ, editor. Contemporary Issues in Experimental Phonetics. Academic; New York: 1976. pp. 295–326. [Google Scholar]
- 16.Hecker MH. Speaker recognition. An interpretive survey of the literature. ASHA Monogr. 1971;16:1–103. [PubMed] [Google Scholar]
- 17.Greicius MD, Krasnow B, Reiss AL, Menon V. Functional connectivity in the resting brain: A network analysis of the default mode hypothesis. Proc Natl Acad Sci USA. 2003;100(1):253–258. doi: 10.1073/pnas.0135058100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Gusnard DA, Akbudak E, Shulman GL, Raichle ME. Medial prefrontal cortex and self-referential mental activity: Relation to a default mode of brain function. Proc Natl Acad Sci USA. 2001;98(7):4259–4264. doi: 10.1073/pnas.071043098. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Caspers J, et al. Functional characterization and differential coactivation patterns of two cytoarchitectonic visual areas on the human posterior fusiform gyrus. Hum Brain Mapp. 2014;35(6):2754–2767. doi: 10.1002/hbm.22364. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Menon V, Uddin LQ. Saliency, switching, attention and control: A network model of insula function. Brain Struct Funct. 2010;214(5-6):655–667. doi: 10.1007/s00429-010-0262-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Belin P, Bestelmeyer PE, Latinus M, Watson R. Understanding voice perception. Br J Psychol. 2011;102(4):711–725. doi: 10.1111/j.2044-8295.2011.02041.x. [DOI] [PubMed] [Google Scholar]
- 22.Constantino JN, Gruber CP. Social Responsiveness Scale, Second Edition (SRS-2) Western Psychological Services; Torrance, CA: 2012. [Google Scholar]
- 23.Semel E, Wiig EH, Secord WH. Clinical Evaluation of Language Fundamentals. 4th Ed Psychological Corporation; San Antonio, TX: 2003. [Google Scholar]
- 24.Evans TM, et al. Brain structural integrity and intrinsic functional connectivity forecast 6 year longitudinal growth in children’s numerical abilities. J Neurosci. 2015;35(33):11743–11750. doi: 10.1523/JNEUROSCI.0216-15.2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Cohen JR, et al. Decoding developmental differences and individual variability in response inhibition through predictive analyses across individuals. Front Hum Neurosci. 2010;4:47. doi: 10.3389/fnhum.2010.00047. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Supekar K, et al. Neural predictors of individual differences in response to math tutoring in primary-grade school children. Proc Natl Acad Sci USA. 2013;110(20):8230–8235. doi: 10.1073/pnas.1222154110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Hickok G, Poeppel D. The cortical organization of speech processing. Nat Rev Neurosci. 2007;8(5):393–402. doi: 10.1038/nrn2113. [DOI] [PubMed] [Google Scholar]
- 28.Wang X, Merzenich MM, Sameshima K, Jenkins WM. Remodelling of hand representation in adult cortex determined by timing of tactile stimulation. Nature. 1995;378(6552):71–75. doi: 10.1038/378071a0. [DOI] [PubMed] [Google Scholar]
- 29.Recanzone GH, Schreiner CE, Merzenich MM. Plasticity in the frequency representation of primary auditory cortex following discrimination training in adult owl monkeys. J Neurosci. 1993;13(1):87–103. doi: 10.1523/JNEUROSCI.13-01-00087.1993. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Bao S, Chan VT, Merzenich MM. Cortical remodelling induced by activity of ventral tegmental dopamine neurons. Nature. 2001;412(6842):79–83. doi: 10.1038/35083586. [DOI] [PubMed] [Google Scholar]
- 31.Salimpoor VN, et al. Interactions between the nucleus accumbens and auditory cortices predict music reward value. Science. 2013;340(6129):216–219. doi: 10.1126/science.1231059. [DOI] [PubMed] [Google Scholar]
- 32.Menon V, Levitin DJ. The rewards of music listening: Response and physiological connectivity of the mesolimbic system. Neuroimage. 2005;28(1):175–184. doi: 10.1016/j.neuroimage.2005.05.053. [DOI] [PubMed] [Google Scholar]
- 33.Huron D. Sweet Anticipation: Music and the Psychology of Expectation. MIT Press; Cambridge, MA: 2006. [Google Scholar]
- 34.Lamb ME. Developing trust and perceived effectance in infancy. In: Lipsitt LP, editor. Advances in infancy research. Vol 1. Ablex; Norwood, NJ: 1981. pp. 101–127. [Google Scholar]
- 35.Curtin S, Vouloumanos A. Speech preference is associated with autistic-like behavior in 18-months-olds at risk for autism spectrum disorder. J Autism Dev Disord. 2013;43(9):2114–2120. doi: 10.1007/s10803-013-1759-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Vouloumanos A, Curtin S. Foundational tuning: How infants’ attention to speech predicts language development. Cogn Sci. 2014;38(8):1675–1686. doi: 10.1111/cogs.12128. [DOI] [PubMed] [Google Scholar]
- 37.Abrams DA, et al. Underconnectivity between voice-selective cortex and reward circuitry in children with autism. Proc Natl Acad Sci USA. 2013;110(29):12060–12065. doi: 10.1073/pnas.1302982110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Shah NJ, et al. The neural correlates of person familiarity. A functional magnetic resonance imaging study with clinical implications. Brain. 2001;124(Pt 4):804–815. doi: 10.1093/brain/124.4.804. [DOI] [PubMed] [Google Scholar]
- 39.von Kriegstein K, Kleinschmidt A, Sterzer P, Giraud AL. Interaction of face and voice areas during speaker recognition. J Cogn Neurosci. 2005;17(3):367–376. doi: 10.1162/0898929053279577. [DOI] [PubMed] [Google Scholar]
- 40.Uddin LQ, et al. Salience network-based classification and prediction of symptom severity in children with autism. JAMA Psychiatry. 2013;70(8):869–879. doi: 10.1001/jamapsychiatry.2013.104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Wechsler D. Wechsler Abbreviated Scale of Intelligence. Harcourt; San Antonio, TX: 1999. [Google Scholar]
- 42.McLaren DG, Ries ML, Xu G, Johnson SC. A generalized form of context-dependent psychophysiological interactions (gPPI): A comparison to standard approaches. Neuroimage. 2012;61(4):1277–1286. doi: 10.1016/j.neuroimage.2012.03.068. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Glover GH, Law CS. Spiral-in/out BOLD fMRI for increased SNR and reduced susceptibility artifacts. Magn Reson Med. 2001;46(3):515–522. doi: 10.1002/mrm.1222. [DOI] [PubMed] [Google Scholar]
- 44.Abrams DA, et al. Decoding temporal structure in music and speech relies on shared brain resources but elicits different fine-scale spatial patterns. Cereb Cortex. 2011;21(7):1507–1518. doi: 10.1093/cercor/bhq198. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Abrams DA, et al. Multivariate activation and connectivity patterns discriminate speech intelligibility in Wernicke’s, Broca’s, and Geschwind’s areas. Cereb Cortex. 2013;23(7):1703–1714. doi: 10.1093/cercor/bhs165. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Iuculano T, et al. Brain organization underlying superior mathematical abilities in children with autism. Biol Psychiatry. 2014;75(3):223–230. doi: 10.1016/j.biopsych.2013.06.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Cox RW. AFNI: Software for analysis and visualization of functional magnetic resonance neuroimages. Comput Biomed Res. 1996;29(3):162–173. doi: 10.1006/cbmr.1996.0014. [DOI] [PubMed] [Google Scholar]
- 48.Bullmore E, et al. Statistical methods of estimation and inference for functional MR image analysis. Magn Reson Med. 1996;35(2):261–277. doi: 10.1002/mrm.1910350219. [DOI] [PubMed] [Google Scholar]
- 49.Forman SD, et al. Improved assessment of significant activation in functional magnetic resonance imaging (fMRI): Use of a cluster-size threshold. Magn Reson Med. 1995;33(5):636–647. doi: 10.1002/mrm.1910330508. [DOI] [PubMed] [Google Scholar]
- 50.Ward BD. Simultaneous Inference for fMRI Data. AFNI 3dDeconvolve Documentation. Medical College of Wisconsin; Milwaukee, WI: 2000. [Google Scholar]
- 51.Smith SM, et al. Advances in functional and structural MR image analysis and implementation as FSL. Neuroimage. 2004;23(Suppl 1):S208–S219. doi: 10.1016/j.neuroimage.2004.07.051. [DOI] [PubMed] [Google Scholar]
- 52.Vul E, Harris C, Winkielman P, Pashler H. Puzzlingly High Correlations in fMRI Studies of Emotion, Personality, and Social Cognition. Perspect Psychol Sci. 2009;4(3):274–290. doi: 10.1111/j.1745-6924.2009.01125.x. [DOI] [PubMed] [Google Scholar]
- 53.Mühlau M, et al. Structural brain changes in tinnitus. Cereb Cortex. 2006;16(9):1283–1288. doi: 10.1093/cercor/bhj070. [DOI] [PubMed] [Google Scholar]
- 54.Abrams DA, et al. Inter-subject synchronization of brain responses during natural music listening. Eur J Neurosci. 2013;37(9):1458–1469. doi: 10.1111/ejn.12173. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Morosan P, et al. Human primary auditory cortex: Cytoarchitectonic subdivisions and mapping into a spatial reference system. Neuroimage. 2001;13(4):684–701. doi: 10.1006/nimg.2000.0715. [DOI] [PubMed] [Google Scholar]
- 56.Abrams DA, Nicol T, Zecker S, Kraus N. Right-hemisphere auditory cortex is dominant for coding syllable patterns in speech. J Neurosci. 2008;28(15):3958–3965. doi: 10.1523/JNEUROSCI.0187-08.2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Gustafson K, House D. Fun or boring? A Web-Based Evaluation of Expressive Synthesis for Children. Eurospeech; Aalborg, Denmark: 2001. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.










