Significance
In the adult brain, multimodal speech perception that interfaces with a bidirectional interaction of perception and production speech systems is increasingly accepted. Speech perception in infancy is already highly multisensory, suggesting an early emerging representation for speech across sensory modalities. We provide electrophysiological evidence for sensorimotor influences on auditory speech discrimination responses in 3-mo-old infants who are several months away from producing canonical babbling. Auditorily, infants discriminated both contrasts tested. However, a tongue-tip articulatory inhibition diminished the /ɗa/-/ɖa/ discrimination (both phones involve the tongue-tip movement as a place of articulation), whereas accentuating the /ba/-/ɗa/ discrimination (different articulators involved in the articulation of the two phones). These findings suggest that prebabbling infants’ speech perception is more robustly multisensory than previously considered.
Keywords: infancy, sensorimotor, speech perception, EEG
Abstract
While there is increasing acceptance that even young infants detect correspondences between heard and seen speech, the common view is that oral-motor movements related to speech production cannot influence speech perception until infants begin to babble or speak. We investigated the extent of multimodal speech influences on auditory speech perception in prebabbling infants who have limited speech-like oral-motor repertoires. We used event-related potentials (ERPs) to examine how sensorimotor influences to the infant’s own articulatory movements impact auditory speech perception in 3-mo-old infants. In experiment 1, there were ERP discriminative responses to phonetic category changes across two phonetic contrasts (bilabial–dental /ba/-/ɗa/; dental–retroflex /ɗa/-/ɖa/) in a mismatch paradigm, indicating that infants auditorily discriminated both contrasts. In experiment 2, inhibiting infants’ own tongue-tip movements had a disruptive influence on the early ERP discriminative response to the /ɗa/-/ɖa/ contrast only. The same articulatory inhibition had contrasting effects on the perception of the /ba/-/ɗa/ contrast, which requires different articulators (the lips vs. the tongue) during production, and the /ɗa/-/ɖa/ contrast, whereby both phones require tongue-tip movement as a place of articulation. This articulatory distinction between the two contrasts plausibly accounts for the distinct influence of tongue-tip suppression on the neural responses to phonetic category change perception in definitively prebabbling, 3-mo-old, infants. The results showing a specificity in the relation between oral-motor inhibition and phonetic speech discrimination suggest a surprisingly early mapping between auditory and motor speech representation already in prebabbling infants.
Infants rapidly acquire robust representations of the native phonetic repertoire from the natural multisensory speech input of their environment. Multimodal speech signals are generated by a common underlying source—the vocal tract and the articulatory movements used during production (1, 2). Adult speech perception is influenced by synchronously occurring multimodal speech cues, including auditory, visual, motor, and sensorimotor signals (3). Recent advances reveal that speech production relies on both auditory and sensorimotor signals (4, 5), but also, sensorimotor input can affect the perception of auditory (6) and visual (7) speech. Indeed, neural evidence indicates bidirectional interaction between the speech perception and production systems in the adult brain (8). It has been widely assumed that the interactions between the articulator-specific sensorimotor information and acoustic phonetic perception would appear later in development after infants begin to babble and to produce speech themselves. This assumption is not surprising given that motor coordination is immature early in life and appears to have a protracted development. However, to fully understand how infants acquire their native speech sound repertoire, it is critical to examine whether sensorimotor/motoric dimensions of speech are relevant for auditory speech perception even in infants who are prebabbling. If so, then sensorimotor influences on speech perception may be part of the foundation that sets the stage for language acquisition in general and babbling in particular, rather than production experience driving the eventual auditory-sensorimotor/motor speech interaction.
While the speech signal that infants experience and learn from is multimodal, speech perception research during the acquisition period has focused mainly on auditory speech perception, and to a modest extent, on audiovisual speech perception. Infants reliably match heard and seen speech at 2 mo of age by looking longer to the face that is articulating the syllable being played (9, 10). Remarkably, infants are also able to match audio and visual speech even for nonnative consonants and vowels, which they have not encountered in their linguistic environment (11). While some have suggested that audiovisual speech perception abilities in infants reflect a domain-general preference for synchronously occurring stimuli (12), there is neural evidence of multimodal phonetic representation already at 2 mo of age (13). In ref. 13, a phonetic mismatch response (MMR) was observed to the category change of an auditory vowel, both when the preceding stimuli were repetitions of visemes (a face articulating the same or a different vowel) or speech sounds. The consistency of the MMRs to the phonetic category change regardless of modality suggests that infants have access to an integrated intermodal representation (13).
There is less experimental work investigating sensorimotor interactions with speech perception; however, several recent behavioral studies have addressed this question by experimentally manipulating infants’ own oral-motor movements. In the first such study, 4-mo-old infants’ labial configuration was manipulated (by gently holding an appropriately shaped object in their mouth) to either resemble the shape made for producing /i/ or /u/ vowels while they were tested in an audiovisual matching task. Results showed that infants’ matching of these same vowels was changed by the manipulation (14). The influence of sensorimotor cues on auditory-only speech perception was more recently tested, this time with infants aged 6 mo, who do not typically produce well-formed consonant–vowel (CV) syllables. Replicating previous work (15), English-learning infants this age discriminated a dental /ɗa/–retroflex /ɖa/ phonetic contrast that is nonnative to English speakers, but native to Hindi speakers’ contrast. These two consonants differ, in adult Hindi production, only on the placement of the tongue tip during articulation: The dental involves placement of the tongue tip behind the back front teeth, whereas retroflex production involves curling the tongue tip back and placing it against the roof of the mouth. However, when an infant’s tongue-tip movement was inhibited by having a caregiver gently hold a teether on the tongue, discrimination of this nonnative /ɗa/-/ɖa/ contrast was disrupted (16, 17). A control experiment showed that discrimination of this contrast was maintained when a different teether that does not interfere with tongue-tip movement was used, indicating that it was not the mere presence of a teething toy but rather the inhibition of the relevant articulator that accounted for the disruption of discrimination (16).
These specific sensorimotor influences on auditory and audiovisual speech perception provide evidence that the relation between sensorimotor information and auditory speech perception is present in infants who have not had extensive speech production or babbling experience. Although preverbal infants this young have yet to gain the full articulatory control required to generate speech-like sounds, behavioral studies reviewed above suggest that a sensorimotor mapping of the articulators may be available to infants before babbling begins, possibly through spontaneously generated movement patterns during prenatal development (18). These patterns may be progressively refined through orofacial movements (e.g., sucking movements and nonspeech vocalizations) that help to shape the motor articulatory space that must be aligned with the phonetic perceptual space to ensure correct productions.
Anatomically, the core neural pathways for speech including the cortical connections between the frontal (productive) and temporal (receptive) speech areas are in place before term birth (19). While the ventral pathway is more mature at birth, the dorsal pathway (i.e., the arcuate fasciculus) that functionally transforms auditory and motor speech codes rivals in maturity by 10 wk (20, 21). In ref. 21, the authors concluded that the functional connectivity, or cross-talk, between the suprasylvian part of the arcuate fasciculus, the posterior part of the superior temporal sulcus, and area 44 in the left inferior frontal region is established within the first few postnatal months based on a unique correlational pattern in the maturational indices across these regions, which also collectively form key nodes of the adult phonological loop. The early maturation and functional engagement of the arcuate fasciculus, which is a bidirectional tract between the productive and receptive areas, suggest that the necessary connectivity that subserves the sensorimotor influence on auditory perception is in place within several months after birth.
Current Study
The aim of the current study was to examine whether auditory speech discrimination is affected by sensorimotor influences at an age when the productive and receptive regions of the brain are functionally connected but when infants are still several months away from beginning to babble CV syllables. CV syllable production begins around 7 to 9 mo of age (22); thus, testing infants at 3 mo of age ensures that babbling would not have begun. We used electroencephalogram (EEG) to investigate how sensorimotor input could influence 3-mo-old infants’ ability to auditorily discriminate phonetic contrasts that minimally differ in the place of articulation, and examined the neural dynamics underlying auditory-sensorimotor integration in preverbal infants. We measured infants’ event-related potential (ERP) responses during the speech perception task—without (experiment 1) and with (experiment 2) sensorimotor influences—using a mismatch paradigm designed to assess phonetic category discrimination. This ERP paradigm has been validated by previous studies that show that young infants and even prematurely born newborns detect phonetic category change as evidenced by a phonetic MMR (23, 24)
In experiment 1 (N = 22), English-learning infants passively listened to speech syllables presented in sequences of four syllables with an isochronous onset. In experiment 2 (N = 22), infants’ tongue-tip movements were inhibited using a teething toy (Tomy Learning Curve Fruity Teethers) that was gently held in the infant’s mouth by the caregiver while infants passively listened to the syllables (Fig. 1). In each experiment, we measured the ERP discriminative responses to two phonetic contrasts: an English bilabial /ba/ vs. dental /ɗa/ contrast and a non-English (Hindi) dental /ɗa/ vs. retroflex /ɖa/ contrast. Previous behavioral and EEG studies demonstrate that prelingual infants auditorily discriminate both the /ba/-/ɗa/ and the /ɗa/-/ɖa/ phonetic contrasts (25, 26); therefore, in experiment 1 (auditory discrimination), we hypothesized that 3-mo-old infants would discriminate both contrasts. Behaviorally, tongue-tip movement suppression disrupted 6-mo-old infants’ discrimination of the /ɗa/-/ɖa/ contrast (16), but no prior studies examined its effects on the /ba/-/ɗa/ discrimination. Furthermore, sensorimotor influences on auditory discrimination had not previously been examined in infants as young as 3 mo of age. In experiment 2, we hypothesized that if sensorimotor-auditory speech relations are present and functional even in 3-mo-old infants, then a similar disruption in the /ɗa/-/ɖa/ contrast discrimination may be expected. While both the dental /ɗa/ and the retroflex /ɖa/ require tongue-tip movement during articulation, /ba/ production requires bilabial movement during articulation. If this articulator distinction is a salient feature of discrimination, then we may expect tongue-tip inhibition to differentially influence the ERP responses to the /ba/-/ɗa/ and the /ɗa/-/ɖa/ contrast. Alternatively, because /ɗa/ is present in both contrasts, the tongue-tip inhibition may result in similar disruption across both contrasts. The goal of the current study was to examine the specificity of the auditory-sensorimotor relation that reflects the underlying articulatory code in prebabbling infants.
Fig. 1.
Midsagittal views of infant vocal tract. (A) The tongue is in its natural state. (B) The teething toy held by the caregiver depresses the tongue tip. The tongue is muscular hydrostat, modeled as a solid muscle cylinder, which maintains a constant volume under pressure (50); thus, changes to the height result in an antagonistic force to the tongue root to maintain constant volume.
Results
Cluster Analyses.
Experiment 1.
We examined the main effect of Condition (i.e., the difference between standard and deviant trials; see Materials and Methods) to assess the ERP discriminative responses to the phonetic category changes. We observed a significant difference in a cluster of left-frontal electrodes between 450 and 710 ms following the onset of the fourth syllable (PMonte Carlo cluster corrected = 0.023; Cohen’s d = 0.74; Fig. 2). We then tested for a Condition (Standards vs. Deviants) by Phonetic Contrast (/ba/-/ɗa/ contrast vs. /ɗa/-/ɖa/ contrast) interaction, which was not significant (the smallest cluster P value was 0.63). The sensors and time windows identified from the first main analysis were extracted and averaged per subject and experimental condition.
Fig. 2.
Experiment 1: ERP responses. (A, Left) The grand averaged ERP time course of the deviant (red line) and standard (blue line) trials to the /ba/-/ɗa/ contrast. The mean voltage and the SEs for each condition are plotted for the left-anterior cluster of sensors. The vertical dotted lines indicate syllable onset (1 to 4). The gray bar indicates the time window of the significant spatiotemporal cluster (2.25–2.51 s; i.e., 450–710 ms post fourth syllable onset) over the electrodes 9, 11,12, and 13. (A, Right) Voltage topographies for the Deviant and Standard trials, and the difference (deviant − standard) averaged across the time window of the cluster. (B, Left) The grand averaged ERP time courses to the /ɗa/-/ɖa/ contrast. (B, Right) Voltage topographies for the standard and deviant trials, and the difference (deviant − standard).
Experiment 2.
When infants’ tongue-tip movement was inhibited, we did not observe a main effect of Condition for a comparison of the ERPs between the standard and deviants across both phonetic contrasts (the smallest cluster P value was 0.16). However, there was a significant interaction effect of Condition by Phonetic Contrast (PMonte Carlo cluster corrected = 0.0052); thus, we conducted additional cluster-based permutation paired t tests comparing standard vs. deviant trials for the /ba/-/ɗa/ contrast and the /ɗa/-/ɖa/ contrast separately. We observed a significant cluster to the /ba/-/ɗa/ contrast, over a cluster of central-posterior electrodes 290–490 ms following the onset of the fourth syllable (PMonte Carlo cluster corrected = 0.020; Cohen’s d = 0.93; Fig. 3). However, no significant cluster was observed to the /ɗa/-/ɖa/ contrast (the smallest cluster P value was 0.36). The sensors and time windows from the test of Condition from the /ba/-/ɗa/ contrast were extracted and averaged per subject and experimental condition.
Fig. 3.
Experiment 2: ERP responses. (A, Left) The grand averaged ERP time course of the deviant (red line) and the standard (blue line) trials to the /ba/-/ɗa/ contrast. The mean voltage and the SEs for each condition are plotted for the left-anterior cluster of sensors. The vertical dotted lines indicate each syllable onset (1 to 4). The gray bar indicates the time window of the significant spatiotemporal cluster (2.09–2.29 s, i.e., 290–490 ms post fourth syllable onset) over the electrodes 21, 33, 34, 36, and 38. (A, Right) Voltage topographies for the deviant and standard trials, and the difference (deviant − standard) averaged across the time window of the cluster. (B, Left) The grand averaged ERP time courses to the /ɗa/-/ɖa/ contrast from the same spatiotemporal cluster as in the above, /ba/-/ɗa/ contrast. An ERP discriminative response was not observed to the /ɗa/-/ɖa/ contrast in experiment 2. The gray bar indicates the time window of the significant cluster observed to the /ba/-/ɗa/ contrast in experiment 2. (B, Right) Voltage topographies for the Standard and Deviant trials, and the difference (deviant − standard).
Control Analyses.
Experiment 1.
To assure that the difference revealed above is in response to the phonetic category change, rather than an experimental artifact, we compared the ERP responses in the same spatiotemporal cluster (450–710 ms) after each syllable in a three-way ANOVA with Condition (Standard and Deviant), Syllable position (Repetitions [1 to 3] and Fourth), and Phonetic Category (/ba/-/ɗa/ and /ɗa/-/ɖa/) as factors. There were no main effects of Phonetic Category [F(1,21) < 1, p2 = 0.039] or Syllable Position [F(1,21) <1, p2 = 0.012], indicating that, overall, the responses did not vary depending on the Phonetic Category nor the Syllable Position. We observed a significant main effect of Condition [F(1,21) = 13.678, P = 0.001, p2 = 0.39], and a significant Condition by Syllable Position interaction [F(1,21) = 9.461, P = 0.006, p2 = 0.311]. No other interaction effects were significant (values of P > 0.29). Simple main effects showed that the ERP response following the Fourth syllable significantly differed between the standard and the deviant trials for both phonetic contrasts [/ba/-/ɗa/: F(1,21) = 6.937, P = 0.016; /ɗa/-/ɖa/: F(1,21) = 20.520, P < 0.001]; however, this difference was not detected for the preceding repeated syllables in neither phonetic contrasts [/ba/-/ɗa/: F(1,21) < 1, and /ɗa/-/ɖa/: F(1,21) < 1].
Bayesian multilevel regression modeling further supported these results: There was an effect of Condition in the amplitude response to the /ba/-/ɗa/ (γ = 0.55, CI95% = [0.01, 1.08]) and to the /ɗa/-/ɖa/ (γ = 0.94, CI95% = [0.39, 1.48]) contrasts, indicated by the difference between the standard and deviant trials that did not significantly overlap with zero. The Bayesian confidence intervals indicated that the effect of Condition was stronger to the /ɗa/-/ɖa/ contrast than to the /ba/-/ɗa/ contrast (Fig. 4A).
Fig. 4.
(A) Experiment 1. (Left) A comparison across Syllable Position (Repetitions and Fourth), Phonetic Contrast (/ba/-/ɗa/ and /ɗa/-/ɖa/), and Condition (Standard and Deviant). The mean voltage of the Fourth syllable (450–710 ms following syllable onset) and Repetitions (average of 450–600 ms following the onset of first, second, and third syllables) are plotted separately to the /ba/-/ɗa/ and the /ɗa/-/ɖa/ contrasts. Mean and SEs are plotted. (Right) Experiment 1: Bayesian multilevel regression model. Posterior distributions of the value of the Condition parameter, which indicates the degree of change from the Deviant to the Standard condition. The central dot indicates the highest density posterior mean, and the line indicates the 95% HPDI. (B) Experiment 2. (Left) The mean voltage of the Fourth syllable (290–490 ms following syllable onset) and Repetitions (average of 290–490 ms following the onset of first, second, and third syllables) are plotted separately to the /ba/-/ɗa/ and the /ɗa/-/ɖa/ contrasts. (Right) Experiment 2: Bayesian multilevel regression model. Posterior distributions of the value of Condition parameter indicating the change in slope from the Deviants to the Standards across subjects. The central dot indicates the highest density posterior mean, and the line indicates the 95% HPDI.
Experiment 2.
A three-way ANOVA (Condition by Syllable position by Phonetic category) was conducted on the identified spatiotemporal cluster (290–490 ms) from experiment 2. The main effect was not significant for Phonetic Contrast [F(1,21) < 1, p2 = 0.008] nor for Syllable Position [F(1,21) < 1, p2 = 0.018], but there was a significant main effect of Condition [F(1,21) = 14.844, P = 0.001, p2 = 0.414]. There were multiple significant interaction effects including a Condition by Syllable Position interaction [F(1,21) = 4.389, P = 0.048, p2 = 0.173], a Condition by Phonetic Contrast interaction [F(1,21) = 10.681, P = 0.004, p2 = 0.337], and a significant three-way interaction [F(1,21) = 6.284, P = 0.020, p2 = 0.230]; only the Phonetic Contrast by Syllable Position interaction was not significant [F(1,21) < 1, p2 = 0.006]. Follow-up analyses indicated a significant difference in Condition (Deviant and Standard) only at the level of the Fourth syllable to the /ba/-/ɗa/ contrast [F(1,21) = 29.606, P < 0.001]. There were no significant effects of Condition at any other levels [F(1,21) < 1 for the repeated syllables for both contrasts and for the response to the Fourth syllable for the /ɗa/-/ɖa/ contrast].
The Bayesian multilevel regression model further supported these results. There was an effect of Condition in the amplitude response to the /ba/-/ɗa/ contrast (γ = 1.16, CI95% = [0.62, 1.70]). However, there was no effect of Condition to the /ɗa/-/ɖa/ contrast (γ = 0.003, CI95% = [−0.55, 0.55]); the Bayesian CIs overlapped with zero, indicating strong evidence in support of no difference between the standard and the deviant trials (Fig. 4B).
Discussion
The current study investigated infants’ neural responses to phonetic category changes with and without oral-motor influence across two experiments. Each experiment targeted two distinct phonetic contrasts: a /ba/-/ɗa/ contrast and a /ɗa/-/ɖa/ contrast. In experiment 1, we observed a significant effect of Condition (i.e., differences in the neural response between standard and deviant trials) to both contrasts in a cluster of left-anterior electrodes. In experiment 2, when infants’ tongue-tip movements were suppressed, the data-driven approach revealed that there was no overall effect of Condition, but a Condition by Phonetic Contrast interaction was significant. To the /ba/-/ɗa/ contrast, the standard and deviant trials showed distinct responses in a cluster of posterior electrodes, but no differences were observed to the /ɗa/-/ɖa/ contrast. Follow-up analyses on the spatiotemporal cluster as observed in the /ba/-/ɗa/ contrast, experiment 2, showed that the difference between the standards and the deviant trials overlapped with zero for the /ɗa/-/ɖa/ contrast (CI95% [−0.55, 0.55]). In other words, when infants’ tongue-tip movement was restricted, while the neural responses to the /ba/-/ɗa/ discrimination was clearly observed, we did not find evidence for /ɗa/-/ɖa/ discrimination in this same spatiotemporal cluster. These findings suggest that sensorimotor influences on the speech articulator (i.e., the tongue) modulated phonetic processing, but with a degree of specificity rather than broadly disrupting speech processing.
Our finding that the neural responses to phonetic category discrimination are modulated by articulatory sensorimotor influences in prebabbling infants is illuminating, because at 3 mo of age, speech experience is limited and infants have not yet attuned to the native consonants (27, 15). Moreover, infants at this age are not producing well-formed CV syllables characteristic of canonical babbling (22). Although the motor repertoire at this age is relatively immature, the tongue-tip inhibition in experiment 2 had a different impact on phonetic perception whether or not the phones in the contrast require a tongue-tip movement, during production by adults, indicating that the sensorimotor input is simultaneously integrated with the auditory speech signal during phonetic processing. Although the articulator inhibition was present (by the parent holding a teething toy) throughout the testing of both phonetic distinctions and the trial conditions were randomly presented, an early ERP discrimination response was present to the /ba/-/ɗa/ contrast whereas there was no ERP discrimination response to the /ɗa/-/ɖa/ contrast. Thus, sensorimotor articulatory dimensions relevant to the heard speech sounds interact with auditory speech perception in prebabbling infants.
The pattern of the MMR was earlier and larger (290–490 ms) to the /ba/-/ɗa/ contrast in experiment 2 than to both phonetic contrasts in experiment 1 (450–710 ms). It is possible that the latency and magnitude of the ERP response was biased due to a shift in the probabilistic distribution of the phonetic categories perceived. MMRs are elicited following a violation of auditory regularity and represent prediction errors based on a probabilistic model of the environment (28, 29). Thus, the magnitude of the difference between the standard and deviant stimuli, but also the statistical distributions of the trials themselves can either accelerate (30) or amplify the MMN response in adults (28), and the MMR in infants (31). Therefore, one potential explanation for across-experiment differences in the MMR is that the articulatory inhibition biased the statistical regularities in the input in experiment 2 compared to experiment 1. In our design, infants heard an equal distribution of the syllables (/ba/, /ɗa/, and /ɖa/), equal instances of standard and deviant trials, and heard both directions of category change during the deviant changes (Materials and Methods). In experiment 1, because the probabilistic distribution across these three dimensions was constant, no given syllable has a greater predictive value over the others. In experiment 2, however, if the sensorimotor input informs auditory speech perception in infants and the /ɗa/-/ɖa/ discrimination is impaired by the tongue-tip inhibition, then /ba/ becomes singularized as the only bilabial sound. Thus, /ba/ is less frequent within this perceptual space, which means that it is also less predicted. This change in the probability distribution could account for the faster and larger MMR to the /ba/-/ɗa/ contrast observed in experiment 2.
Although we cannot rule out the possibility that infants may be attempting to imitate the auditory stimuli as they listen during the experiment, the randomized presentation of the trials and the short interstimulus intervals likely prevent any imitative efforts. In experiment 2, the tongue-tip restriction might prevent overt imitation, in particular, for /ɗa/ and /ɖa/, but much less so for /ba/. Thus, if imitative attempts were to have taken place in experiment 2, it would only accentuate the difference between /ba/ and the other two syllables and further reinforce any auditory-motor loop. Another potential alternative is that infants were more attentive or alert when the teething toy was held in their mouth (experiment 2), compared to when they were passively listening to the sounds (experiment 1); however, such an account does not explain the dissociative effects observed between the two phonetic contrasts heard within experiment 2. Lastly, it is possible that movement-related artifacts generated by the infant’s interaction with the teething toy may not have been sufficiently accounted for. Yet, this account also fails to explain the significant difference between the two contrasts within experiment 2. A comparison of the standard trials across the two experiments showed that the standard trials in experiments 1 and 2 were not significantly different from each other (SI Appendix, Fig. S3), suggesting that the data were comparable despite the articulator inhibition in experiment 2.
Young infants process auditory speech in a highly sophisticated manner. Infants only a few months old, and even newborn preterm infants, show a MMR specific to phonetic category change normalizing across voice quality changes within and across genders (32, 24). Furthermore, infants as young as 2 mo of age detect phonetic invariance across coarticulation (33, 34), revealing stable phonetic representations despite acoustic variability. It has been proposed that the infant brain achieves invariant representations of speech sounds through a vectorization of the acoustic input along orthogonal dimensions corresponding to phonetic features, that are subsequently integrated into a phonetic representation (35). The finding from ref. 35 that phonetic features defined relative to articulatory dimensions are pertinent to describe infants’ as adults’ speech perception space, converges with the current study demonstrating that direct manipulations of the sensorimotor information to the articulators can modulate perception of the auditory speech signal.
The current evidence suggests that, at the earliest stages of language acquisition, the sensorimotor system may be relevant for the perception of auditory speech signals. Critically, we do not find evidence that motor programs are necessarily referenced for speech perception as is predicted by the standard motor theory of speech perception (36). Rather, we find only that perception can be modulated by relevant sensorimotor input in a way that reflects an underlying multimodal speech representation that is shared across the sensory signals in the infant brain. In adults, articulatory suppression has only a modest effect on speech perception (37), and phonetic perception impairments are observed after strokes involving the left superior temporal region and the left parietal sulcus (38) coherent with brain imaging studies in healthy adults (39). However, recent work with direct cortical recordings using electrocorticography shows that during auditory speech perception, the superior and inferior regions of the ventral motor cortex are activated and follows a structure along acoustic features similar to the auditory cortex (40). Thus, even if the motor system is not required for speech perception, speech representations might be coded along similar dimensions between the auditory and the motor cortices. In adults, the predicted sensory consequences to a speech motor program are conveyed to the auditory cortex from the vSMC (lateral sensorimotor cortex); as well, the auditory and somatosensory error signals are conveyed to the vSMC such that corrective motor movements could take place (4). The pathways for communicating predicted auditory and sensorimotor patterns, as well as altering the motor program based on the feedback that exist in the adult brain, could already be present in the preverbal infant brain. These same circuits could be involved in potentially modulating the audio-motor circuits within the motor cortex, which in turn could mediate auditory perception.
Across sensory systems, substantial initial organization is established before postnatal experience through the mechanisms of spontaneously generated patterns of neural activity in early and prenatal development (41). While the bulk of available empirical evidence is based on animal models, these principles plausibly extend to the development of sensory systems in humans (42). Here, we propose that activity-dependent processes may critically shape the initial motor and sensorimotor foundations for speech production, and the sensorimotor system calibrated in this way interacts with the early emerging speech network and experience. Rhythmic stereotypies such as tongue protrusion and retraction observed until about 3 mo of age, have been suggested to be a form of self-generated rhythmic activations that induce activity-dependent development of the aerodigestive system (43). Thus, movement-induced sensory feedback in the earliest days of development could lead to the initial formation of the sensorimotor maps of the speech articulators (e.g., lips and tongue). To the extent that the motor primitives to speaking are shared with other functions of the articulators, such as aerodigestion, the initial refinement of these motor primitives may be shared early on in development. A critical link between the early sensorimotor mapping relevant for the articulatory space for speech and the human speech and language network, could be established starting prenatally during the third trimester when both sensorimotor organization and the emergence of the cortical language network are underway.
What evidence is there to suggest that these sensorimotor mappings are linguistically meaningful? The cortical language network canonical in the adult brain is present already from the third trimester of gestation (20). Neuroimaging evidence implies substantial prenatal and early postnatal development and organization of the human language network that forms the basis for functioning speech perception and production systems (19, 44). Interaction between the calibrated motor and sensorimotor systems achieved through these activity-dependent and spontaneous processes described above and the phonetic system supported by the early emerging language network could abet the acquisition of the correspondence between articulatory and acoustic dimensions of speech. This presents a more efficient and general learning mechanism than the alternative, which is to define a precise combination of articulatory actions for each phoneme that the infant must learn.
Implications and Conclusions.
In summary, the current results that sensorimotor speech information is relevantly integrated with auditory speech processing in infants who are prebabbling provide insights into the sensorimotor–auditory speech interactions prior to production or extensive perceptual experience. In turn, this reveals that the human language system is robustly multisensory not only following full acquisition but already early in development and during the acquisition period. These results have implications for congenital oral-motor dysmorphologies and disorders. Contrary to the view that interventions will be impactful following babbling (7–9 mo) but not before, if a bidirectional perception–oral-motor link is present already at a younger age, and speech representation in infants is already multisensory, then a disrupted motor system could impact speech acquisition from early on. It remains to be examined whether there are long-term influences from conditions that more fully limit oral-motor movements in young infants.
Materials and Methods
Participants.
Thirty-two English-learning infants (19 males, 13 females; mean age, 112 d; SD, 7.83 d) recruited from the greater Vancouver area, Canada, were included in the study. An additional 25 infants were tested but excluded due to excessive movement artifacts, technical issues, or insufficient data (SI Appendix). Of the 32 infants in the sample, 12 infants completed both the passive listening (experiment 1) and the oral-motor inhibition during passive listening (experiment 2) experiments; the testing order was counterbalanced across infants such that six infants first completed experiment 1. Ten additional infants completed experiment 1 and an additional 10 infants completed experiment 2, such that 22 infants were included in each experiment (SI Appendix, Sample size estimation). The infants’ primary caregivers provided informed consent prior to the experiment. The research was approved by the University of British Columbia Behavioral Research Ethics Board (Certificate H95-80023).
Stimuli.
Three sound tokens were selected from a synthesized 16-step continuum (26): a voiced bilabial stop (/3ba/), a voiced dental stop (/9ɗa/), and a voiced retroflex stop (/15ɖa/). The stimuli were equal in duration (275 ms) and were precisely matched for low-level acoustic features (SI Appendix, Fig. S1).
Experimental Paradigm.
We used a similar auditory mismatch design to previous infant speech perception studies (13, 34). Each trial consisted of four consecutive syllables; the first three syllables were repetitions of the same syllable, and the fourth syllable was either a repetition of the preceding three syllables (standard trial) or a different syllable that crossed the phonetic category boundary (deviant trial). The syllable-to-syllable stimulus-onset asynchrony was 600 ms and intertrial interval was 4 s. Infants were exposed to a maximum of 120 trials (60 standard trials and 60 deviant trials) per experiment. Standard trials were repetitions of syllables from a single phonetic category (/ba/, /ɗa/, or /ɖa/), and deviant trials consisted of a phonetic category change on the fourth syllable in both directions for each phonetic contrast (ba/ to /ɗa/, /ɗa/ to /ba/, /ɗa/ to /ɖa/, and /ɖa/ to /ɗa/). The number of the standard /ɗa/ trials was doubled to achieve a balanced cumulative number of each of the three syllables heard across the experiment, while maintaining an equal number of standard and deviant trials, since even 3- to 4-mo-old infants are sensitive to the probabilistic regularities of global and local changes (31). The trial presentation was randomized across all possibilities, ensuring that each infant was exposed to all seven trial types in a randomized and balanced manner.
Procedure.
The infant was seated on the caregiver’s lap while wearing an EEG cap and facing a computer monitor in an acoustically shielded room. The screen was placed ∼60 cm from the seated infant and displayed a dynamic visual animation for the infant to watch. Speech sounds were presented at 70 dB from an audio speaker (Fostex 6301NX) placed behind the screen. The experimenter monitored the infant from outside the acoustically shielded room through a camera mounted inside the room, and presented the stimuli using a custom-written program on Psychophysics Toolbox (45) in Matlab (2016b). If the infant began to show discomfort or if the caregiver signaled to stop, the experimenter terminated the study.
EEG Acquisition.
EEG data were collected at a sampling rate of 1,000 Hz with a 64-electrode geodesic sensor net (EGI; N400 amplifier) referenced to the vertex (Cz). The net was placed on the infant’s head relative to the anatomical markers while the infant sat on the caregiver’s lap. The maximal impedance was kept under 40 kΩ.
EEG Preprocessing.
EEG preprocessing analyses were conducted using functions from EEGLAB (46). First, the continuous EEG data were bandpass filtered from 0.5 to 20 Hz. The filtered data were segmented into 4-s epochs starting from −0.2 to 3.8 s from the onset of the first syllable. The length of the epoch included a 200-ms prestimulus period prior to the onset of the first syllable and ended 2 s after the onset of the fourth syllable. Following artifact rejection (SI Appendix), the data were re-referenced to the mean voltage. The trials were further collapsed based on Phonetic Contrast (/ba/-/ɗa/ and /ɗa/-/ɖa/) and Condition (Standard and Deviant) (SI Appendix). To minimize the potential effects of slow drifts on the fourth syllable analyses, we applied a baseline correction considering as baseline the mean voltage of the entire time window preceding the fourth syllable onset (−0.2 to 1.8 s from the trial onset); during this time window, the stimuli are identical between the standard and deviant trials.
EEG Data Analysis.
Data-driven analyses.
The ERP differences between the standard and deviant trials were examined using a cluster-based nonparametric statistic for each experiment. The nonparametric cluster-based permutation combines clustering and randomization procedures to identify the spatiotemporal clusters (electrodes and time points) that showed statistically distinct responses (47). We first examined whether there is a main effect of Condition (Standard vs. Deviant) by collapsing across the two phonetic contrasts; a cluster-based permutation paired t test was conducted between standard and deviant trials with a cluster-alpha threshold of 0.1, a minimal cluster size of two electrodes, and 5,000 permutations over the 800-ms period following the fourth syllable (1.8 to 2.6 s from the trial onset) (48). We also tested for an interaction of Condition by Phonetic contrast; to conduct this analysis in Fieldtrip, we calculated the difference between the standard and deviant trials, and compared this difference in a paired t test between the /ba/-/ɗa/ and the /ɗa/-/ɖa/ contrasts. If the interaction cluster-based permutation test was significant but the test of main effect was not, follow-up tests were conducted on the two phonetic contrasts separately. We averaged the voltage values from the sensors and the time window selected following this procedure, per subject and for each experimental condition for further analyses.
Control analyses.
To ensure that the response was specific to the last syllable and not due to any systematic noise, we also averaged the voltage in the same cluster of sensors, and the same time window following each of the first three syllables (i.e., Repetitions) and compared them against the responses following the fourth syllable. If the distinct ERP responses between the standard and deviant trials reflected a response to phonetic category change, then a difference is expected following the fourth syllable but not on the Repetitions.
ANOVA.
For each experiment, we conducted a three-way repeated-measures ANOVA with Condition (Standard and Deviant), Syllable Position (Repetitions [1 to 3] and Fourth), and Phonetic Contrast (/ba/-/ɗa/ and /ɗa/-/ɖa/) as factors. Because none of the factors within the repeated-measures ANOVA exceeded more than two levels, a Mauchly’s test of sphericity was not required.
Bayesian regression analysis.
We used Bayesian multilevel regression models to quantify the strength of evidence using Bayesian confidence intervals on the main effect of interest (Condition by Phonetic Contrast). Model fit was implemented using the package brms (49) v2.12 within the R computing environment. Standardized (z-scored) data were fit to a varying intercept model with Condition (Standard and Deviant), Phonetic Contrast (/ba/-/ɗa/ and /ɗa/-/ɖa/), and Condition by Phonetic Contrast interaction specified as fixed effects. Individual participants were modeled as random intercepts. Weakly informative priors were selected for each parameter. The mean and the highest posterior density interval (HPDI) of the of the fixed effects and the interaction effect were estimated. Four independent chains, each with 1,000 warmup samples and 2,000 iterations were run, resulting in a total of 4,000 draws from the posterior. Model fit was assessed for good convergence as indicated by Gelman–Rubin < 1.01. To examine whether the βCondition is modulated by different levels of the Phonetic Contrast, a linear model of slope γ = βCondition + βCondition*Phonetic Contrast is specified and reported (SI Appendix).
Supplementary Material
Acknowledgments
This research was supported by grants from the Natural Sciences and Engineering Research Council of Canada (RGPIN-2015-03967) and the Canada Foundation for Innovation John R. Evans Leaders Fund (33096) awarded to J.F.W., from Global Research Alliance in Language–Pontificia Universidad Católica de Chile awarded to M.P., and European Research Council under the European Union’s Horizon 2020 Research and Innovation Program (Grant Agreement 695710) awarded to G.D.-L.
Footnotes
The authors declare no competing interest.
This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.2025043118/-/DCSupplemental.
Data Availability
Anonymized EEG data have been deposited in the Open Science Framework (https://osf.io/my496/).
References
- 1.Scholes C., Skipper J. I., Johnston A., The interrelationship between the face and vocal tract configuration during audiovisual speech. Proc. Natl. Acad. Sci. U.S.A. 117, 32791–32798 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Yehia H. C., Kuratate T., Vatikiotis-Bateson E., Linking facial animation, head motion and speech acoustics. J. Phonetics 30, 555–568 (2002). [Google Scholar]
- 3.Keough M., Derrick D., Gick B., Cross-modal effects in speech perception. Annu. Rev. Linguist. 5, 49–66 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Guenther F. H., Neural Control of Speech (MIT Press, 2016). [Google Scholar]
- 5.Hickok G., Houde J., Rong F., Sensorimotor integration in speech processing: Computational basis and neural organization. Neuron 69, 407–422 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Ito T., Tiede M., Ostry D. J., Somatosensory function in speech perception. Proc. Natl. Acad. Sci. U.S.A. 106, 1245–1248 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Masapollo M., Guenther F. H., Engaging the articulators enhances perception of concordant visible speech movements. J. Speech Lang. Hear. Res. 62, 3679–3688 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Poeppel D., Assaneo M. F., Speech rhythms and their neural foundations. Nat. Rev. Neurosci. 21, 322–334 (2020). [DOI] [PubMed] [Google Scholar]
- 9.Kuhl P. K., Meltzoff A. N., The bimodal perception of speech in infancy. Science 218, 1138–1141 (1982). [DOI] [PubMed] [Google Scholar]
- 10.Patterson M. L., Werker J. F., Infants’ ability to match dynamic phonetic and gender information in the face and voice. J. Exp. Child Psychol. 81, 93–115 (2002). [DOI] [PubMed] [Google Scholar]
- 11.Pons F., Lewkowicz D. J., Soto-Faraco S., Sebastián-Gallés N., Narrowing of intersensory speech perception in infancy. Proc. Natl. Acad. Sci. U.S.A. 106, 10598–10602 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Hollich G., Newman R. S., Jusczyk P. W., Infants’ use of synchronized visual information to separate streams of speech. Child Dev. 76, 598–613 (2005). [DOI] [PubMed] [Google Scholar]
- 13.Bristow D., et al., Hearing faces: How the infant brain matches the face it sees with the speech it hears. J. Cogn. Neurosci. 21, 905–921 (2009). [DOI] [PubMed] [Google Scholar]
- 14.Yeung H. H., Werker J. F., Lip movements affect infants’ audiovisual speech perception. Psychol. Sci. 24, 603–612 (2013). [DOI] [PubMed] [Google Scholar]
- 15.Werker J. F., Tees R. C., Phonemic and phonetic factors in adult cross-language speech perception. J. Acoust. Soc. Am. 75, 1866–1878 (1984). [DOI] [PubMed] [Google Scholar]
- 16.Bruderer A. G., Danielson D. K., Kandhadai P., Werker J. F., Sensorimotor influences on speech perception in infancy. Proc. Natl. Acad. Sci. U.S.A. 112, 13531–13536 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Choi D., Bruderer A. G., Werker J. F., Sensorimotor influences on speech perception in pre-babbling infants: Replication and extension of Bruderer et al. (2015). Psychon. Bull. Rev. 26, 1388–1399 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Blumberg M. S., Marques H. G., Iida F., Twitching in sensorimotor development from sleeping rats to robots. Curr. Biol. 23, R532–R537 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Dehaene-Lambertz G., The human infant brain: A neural architecture able to learn language. Psychon. Bull. Rev. 24, 48–55 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Dubois J., et al., Exploring the early organization and maturation of linguistic pathways in the human infant brain. Cereb. Cortex 26, 2283–2298 (2016). [DOI] [PubMed] [Google Scholar]
- 21.Leroy F., et al., Early maturation of the linguistic dorsal pathway in human infants. J. Neurosci. 31, 1500–1506 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Oller D. K., The Emergence of the Speech Capacity (Lawrence Erlbaum Associates Publishers, 2000). [Google Scholar]
- 23.Dehaene-Lambertz G., Dehaene S., Speed and cerebral correlates of syllable discrimination in infants. Nature 370, 292–295 (1994). [DOI] [PubMed] [Google Scholar]
- 24.Mahmoudzadeh M., Wallois F., Kongolo G., Goudjil S., Dehaene-Lambertz G., Functional maps at the onset of auditory inputs in very early preterm human neonates. Cereb. Cortex 27, 2500–2512 (2017). [DOI] [PubMed] [Google Scholar]
- 25.Peña M., Werker J. F., Dehaene-Lambertz G., Earlier speech exposure does not accelerate speech acquisition. J. Neurosci. 32, 11159–11163 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Werker J. F., Lalonde C. E., Cross-language speech perception: Initial capabilities and developmental change. Dev. Psychol. 24, 672–683 (1988). [Google Scholar]
- 27.Kuhl P. K., Williams K. A., Lacerda F., Stevens K. N., Lindblom B., Linguistic experience alters phonetic perception in infants by 6 months of age. Science 255, 606–608 (1992). [DOI] [PubMed] [Google Scholar]
- 28.Garrido M. I., Sahani M., Dolan R. J., Outlier responses reflect sensitivity to statistical structure in the human brain. PLoS Comput. Biol. 9, e1002999 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Wacongne C., et al., Evidence for a hierarchy of predictions and prediction errors in human cortex. Proc. Natl. Acad. Sci. U.S.A. 108, 20754–20759 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Tiitinen H., May P., Reinikainen K., Attentive novelty detection in humans is governed by pre-aHentive sensory memory. Nature 372, 90–92 (1994). [DOI] [PubMed] [Google Scholar]
- 31.Basirat A., Dehaene S., Dehaene-Lambertz G., A hierarchy of cortical responses to sequence violations in three-month-old infants. Cognition 132, 137–150 (2014). [DOI] [PubMed] [Google Scholar]
- 32.Dehaene-Lambertz G., Peña M., Electrophysiological evidence for automatic phonetic processing in neonates. Neuroreport 12, 3155–3158 (2001). [DOI] [PubMed] [Google Scholar]
- 33.Bertoncini J., Bijeljac-Babic R., Jusczyk P. W., Kennedy L. J., Mehler J., An investigation of young infants’ perceptual representations of speech sounds. J. Exp. Psychol. Gen. 117, 21–33 (1988). [DOI] [PubMed] [Google Scholar]
- 34.Mersad K., Dehaene-Lambertz G., Electrophysiological evidence of phonetic normalization across coarticulation in infants. Dev. Sci. 19, 710–722 (2016). [DOI] [PubMed] [Google Scholar]
- 35.Gennari G., Marti S., Palu M., Fló A., Dehaene-Lambertz G., Orthogonal neural codes for phonetic features in the infant brain. bioRxiv [Preprint] (2021). 10.1101/2021.03.28.437156. Accessed 28 March 2021. [DOI] [PMC free article] [PubMed]
- 36.Liberman A. M., Cooper F. S., Shankweiler D. P., Studdert-Kennedy M., Perception of the speech code. Psychol. Rev. 74, 431–461 (1967). [DOI] [PubMed] [Google Scholar]
- 37.Stokes R. C., Venezia J. H., Hickok G., The motor system’s [modest] contribution to speech perception. Psychon. Bull. Rev. 26, 1354–1366 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Kim K., et al., Neural processing critical for distinguishing between speech sounds. Brain Lang. 197, 104677 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Dehaene-Lambertz G., et al., Neural correlates of switching from auditory to speech perception. Neuroimage 24, 21–33 (2005). [DOI] [PubMed] [Google Scholar]
- 40.Cheung C., Hamiton L. S., Johnson K., Chang E. F., The auditory representation of speech sounds in human motor cortex. eLife 5, e12577 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Khazipov R., Luhmann H. J., Early patterns of electrical activity in the developing cerebral cortex of humans and rodents. Trends Neurosci. 29, 414–418 (2006). [DOI] [PubMed] [Google Scholar]
- 42.Molnár Z., Luhmann H. J., Kanold P. O., Transient cortical circuits match spontaneous and sensory-driven activity during development. Science 370, eabb2153 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Keven N., Akins K. A., Neonatal imitation in context: Sensorimotor development in the perinatal period. Behav. Brain Sci. 40, e381 (2017). [DOI] [PubMed] [Google Scholar]
- 44.Skeide M. A., Friederici A. D., The ontogeny of the cortical language network. Nat. Rev. Neurosci. 17, 323–332 (2016). [DOI] [PubMed] [Google Scholar]
- 45.Brainard D. H., The Psychophysics Toolbox. Spat. Vis. 10, 433–436 (1997). [PubMed] [Google Scholar]
- 46.Delorme A., Makeig S., EEGLAB: An open source toolbox for analysis of single-trial EEG dynamics including independent component analysis. J. Neurosci. Methods 134, 9–21 (2004). [DOI] [PubMed] [Google Scholar]
- 47.Maris E., Oostenveld R., Nonparametric statistical testing of EEG- and MEG-data. J. Neurosci. Methods 164, 177–190 (2007). [DOI] [PubMed] [Google Scholar]
- 48.Oostenveld R., Fries P., Maris E., Schoffelen J. M., FieldTrip: Open source software for advanced analysis of MEG, EEG, and invasive electrophysiological data. Comput. Intell. Neurosci. 2011, 156869 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Bürkner, Paul-Christian, brms: An R Package for Bayesian Multilevel Models Using Stan. J. Stat. Softw. 80, 10.18637/jss.v080.i01 (2017).. [DOI] [Google Scholar]
- 50.Kier W. M., Smith K. K., Tongues, tentacles and trunks: The biomechanics of movement in muscular-hydrostats. Zool. J. Linn. Soc. 83, 307–324 (1985). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Anonymized EEG data have been deposited in the Open Science Framework (https://osf.io/my496/).