Abstract
The ability to use visual speech cues does not fully develop until late adolescence. The cognitive and neural processes underlying this slow maturation are not yet understood. We examined electrophysiological responses of younger (8–9 years of age) and older (11–12 years of age) children as well as adults elicited by visually-perceived articulations in an audiovisual word matching task and related them to the amount of benefit gained during a speech-in-noise (SIN) perception task when seeing the talker’s face. On each trial, participants first heard a word and, after a short pause, saw a speaker silently articulate a word. In half of trials, the articulated word matched the auditory word (congruent trials), while in another half, it did not (incongruent trials). In all three groups, incongruent articulations elicited the N400 component while congruent articulations elicited the late positive complex (LPC). Groups did not differ in the mean amplitude of N400. The mean amplitude of LPC was larger in younger children compared to older children and adults. Importantly, the relationship between ERP measures and SIN performance varied by group. In 8–9-year-olds, neither component was predictive of SIN gain. The LPC amplitude predicted the SIN gain in older children but not in adults. Conversely, the N400 amplitude predicted the SIN gain in adults. We argue that while all groups were able to detect correspondences between auditory and visual word onsets at the phonemic/syllabic level, only adults could use this information for lexical access.
Keywords: multisensory development, audiovisual word perception, N400, Late Positive Complex, speech-in-noise perception, lexical access, audiovisual matching
1. Introduction
Seeing the oro-facial movements of a talker significantly facilitates auditory speech perception. This facilitation may occur at different levels of auditory processing (Peelle & Sommers, 2015; Stevenson, Wallace, & Altieri, 2014). At the basic level, visible speech enhances the salience of the auditory signal by providing information about its onset and the timing of the most informative parts corresponding to mouth openings (e.g., van Wassenhove, Grant, & Poeppel, 2005). It also allows listeners to focus their attention in time to coincide with speech onset (e.g., ten Oever, Schroeder, Poeppel, van Atteveldt, & Zion-Golumbic, 2014).
Visible speech may also influence auditory speech perception at the phonemic level. One of the most studied audiovisual phenomena – the McGurk Illusion (Alsius, Paré, & Munhall, 2018; McGurk & MacDonald, 1976) – is typically elicited when a bilabial sound (such as in “pa”) is dubbed onto the velar sounds (such as in “ga”) resulting in the perception of an alveolar or interdental sound (such as in “ta” or “tha”). Furthermore, research in cognitive neuroscience suggests that listeners form phoneme/articulation correspondences, which may help listeners form predictions about the incoming speech sounds (van Wassenhove et al., 2005). For example, areas of the posterior superior temporal sulcus (pSTS), the neural region that is strongly engaged by speech and multisensory stimuli, appears to be particularly sensitive to lip movements compared to other types of biological motion (Zhu & Beauchamp, 2017). Importantly, these mouth-preferring areas are also responsive to human voices, suggesting that they may contain neurons that encode both vocal sounds and the mouth movements that produce them. In addition, auditory-only speech produced by talkers whom listeners know by face is significantly easier to understand than the speech produced by talkers whom listeners know by voice only. Auditory speech spoken by familiar talkers also activates face-movement sensitive areas within the pSTS (Riedel, Ragert, Schelinski, Kiebel, & Von Kriegstein, 2015; Schelinski, Riedel, & Von Kriegstein, 2014). Furthermore, listening to speech has been shown to activate the listener’s own motor representations for speech production (e.g., Skipper, Nusbaum, & Small, 2005; Skipper, van Wassenhove, Nusbaum, & Small, 2007; Wilson, Saygin, Sereno, & Iacoboni, 2004), even when such speech is clear and easy to understand (Panouilleres, Boyles, Chesters, Watkins, & Mottonen, 2018).
Finally, seeing the speaker’s facial movements can also influence lexical processing. For example, Brancazio (2004) reported that the McGurk-like perception of audiovisually mismatched stimuli is significantly more likely if the illusory perception forms a real word compared to when it results in a non-word. Barutchu and colleagues (Barutchu, Crewther, Kiely, Murphy, & Crewther, 2008) also used McGurk stimuli and showed that the likelihood of the audiovisual merger was higher when the audiovisual mismatch occurred at the word onset rather than at the word offset. This difference was absent for nonsense words. The authors suggested that word-offset audiovisual mismatches are less effective at eliciting the illusion because the audiovisually congruent word onsets activate specific lexical representations and thus reduce the chance of the illusory perception. Several studies focused specifically on whether visible speech can activate lexical representations. Buchwald and colleagues (Buchwald, Winters, & Pisoni, 2009) used a cross-modal repetition priming paradigm and showed that seeing a silent articulation of a word facilitates the subsequent auditory perception of the same word when its auditory quality is degraded. This result remained robust even when the auditory and visual components of words were taken from different speakers. Finally, Fort and colleagues (Fort, Spinelli, Savariaux, & Kandel, 2010) used a phoneme monitoring task and reported that in the audiovisual condition consonant phonemes were recognized more accurately if they occurred in words rather than in non-words. In a follow-up study (Fort et al., 2013), these researchers also demonstrated that seeing the articulatory movements corresponding to just the first two phonemes of a word as a prime significantly improved participants’ performance on a lexical decision task. Taken together, these findings suggest that visible speech, either alone or in combination with auditory speech, can influence lexical access and word recognition.
The absolute majority of research on audiovisual speech perception has been conducted in either healthy young adults or in infants. As a result, how audiovisual processing of phonemic and lexical information develops in older children is much less understood. Behavioral and neuroimaging studies suggest that multisensory associations between visually-perceived articulatory movements and specific phoneme(s) continue to mature into adolescence. For example, compared to adults, school-age children show reduced susceptibility to the McGurk illusion (e.g., McGurk & MacDonald, 1976; Tremblay et al., 2007) and benefit less from seeing a talker when listening to speech-in-noise (SIN) (e.g., Ross et al., 2011). Several recent studies by Jerger and colleagues have demonstrated that children’s phonological audiovisual skills change from the pre-school to mid-teen years; however, the trajectory of this change may be task- and stimuli-dependent, with some tasks exhibiting an inverted-U trajectory and others – a relatively linear improvement. Additionally, and not surprisingly, easy-to-observe speech sounds (such as “b”) begin to exert influence over auditory perception in childhood prior to difficult-to-observe sounds (such as “g”) (Susan Jerger, Damian, McAlpine, & Abdi, 2018; S Jerger, Damian, Spence, Tye-Murray, & Abdi, 2009; Susan Jerger, Damian, Tye-Murray, & Abdi, 2014, in press).
Even fewer studies focused on audiovisual lexical processing in school-age children. Fort and colleagues (Fort, Spinelli, Savariaux, & Kandel, 2012) used a vowel monitoring task with words and non-words that were presented either auditorily or audiovisually. They tested children ranging in age from approximately 5 to 10 years as well as adults. Although children as young as 6 years of age benefited from seeing the speaker’s face, only adults showed a lexicality effect, with better performance on words than on non-words in the audiovisual condition. The authors concluded that visible speech does not activate lexical representation in children of the tested age but influences only phonemic processing. The study by Havy and colleagues (Havy, Foroud, Fais, & Werker, 2017) tested whether 18-month-old toddlers and adults can first learn a new word/object pairing during either an auditory only or visual only (i.e., silent video of articulation) exposure and whether they can then recognized such words when they are presented in the opposite modality (e.g., exposed to auditory-only word but tested on visual-only word or the reverse). Toddlers were able to learn word-object pairings in the auditory modality and then recognized the auditorily-learned word in the visual modality. However, only adults were able to learn new words from the visual modality alone. Although the study did not test older children, the results suggest that either the ability to access lexical representations from the visual modality or the nature of lexical representations themselves change significantly between toddlerhood and adulthood.
The functional immaturity of audiovisual speech perception in pre-teen and teenaged children is likely rooted in a prolonged development of higher-order cortical regions involved in multisensory processing. In a longitudinal study of developmental cortical changes in 4–21 year-old children and young adults, Gogtay and colleagues observed two major changes in the grey matter volume in older children – namely, a pre-adolescent increase followed by a post-adolescent thinning. The timing of these changes varied by brain region, with the posterior portions of the superior and middle temporal gyri (the areas closest to the multisensory pSTS among all evaluated in the study) were among the last areas to exhibit post-puberty grey matter thinning (Gogtay et al., 2004). In fact, the grey matter volume in the temporal lobe continues to increase until approximately 16 years of age (Giedd et al., 1999). It is, therefore, not surprising that those aspects of multisensory language processing that require fine-grained coordination across modalities and are based on learned associations continue to develop during the school years.
The goal of this study was to examine changes in electrophysiological responses elicited by visible articulations of words in 8–9-year-olds, 11–12-year-olds, and young adults. The temporal resolution of the ERP technique allows one to focus on neural signatures of phonological and lexical processing and to evaluate the relationship between these stages of processing and the ability to benefit from seeing the talker’s face while listening to speech embedded in noise. The first few years of formal schooling is the time of tremendous vocabulary growth — from approximately 7,000–10,000 words in first-graders to 39,000–46,000 in fifth graders (Anglin, 1989) — which is thought to play a significant role in auditory phonological awareness, which may also lead to the formation of correspondences between auditory phonemes and articulatory movements and to the development of connections between visible oro-facial movements and lexical representations. We selected groups of children that represent two different time points in this developmental process.
In earlier studies in our laboratory, we examined neural processes engaged by matching auditory and visual components of words. More specifically, participants first heard an auditory word and then were shown a silent video of a speaker articulating either the same or a different word, with the difference, when present, always occurring at the word onset. In both adults and children, visually-perceived articulations consistently elicited a larger phonological N400 event-related potential (ERP) component when they did not match the earlier heard words and a larger late positive complex (LPC) when they did. Importantly, only the N400 amplitude elicited in this paradigm predicted the degree to which participants benefited from seeing the talker’s face while listening to SIN in a separate task. As we describe below, the N400 and LPC ERP components provide a useful tool for relating the development of children’s sensitivity to visual speech at phonological and lexical level, respectively, and for examining the relationship between neural maturation and behavioral performance.
Although the N400 ERP component is most commonly used in research on semantic processing, a number of studies have shown that it is also sensitive to violations of phonological expectancy. Most work on this “phonological N400” comes from paradigms in which participants are presented with pairs of stimuli (words or non-words) that either do or do not rhyme, with non-rhyming items eliciting a larger N400 (e.g., Praamstra, Meyer, & Levelt, 1994; Praamstra & Stegeman, 1993; Rugg, 1984a). The presence of N400 to non-rhyming non-words suggests that its elicitation does not depend on semantic processing of stimuli, with a caveat that such non-words should contain only phonotactically legal sound/letter sequences (Bentin, Mouchetant-Rostaing, Giard, Echallier, & Pernier, 1999). Indeed, a study by Coch and colleagues (Coch, George, & Berger, 2008) showed that even individual letters whose names do not rhyme elicit N400. By comparing ERP responses to either rhyming or alliterating (i.e., sharing onsets) words, Praamstra and colleagues (Praamstra et al., 1994) showed that the latency of the phonological N400 is modulated by the position of the phonological mismatch within a word. The phonological N400 has been elicited by both auditory and visual stimuli and is not affected by the stimuli’s physical properties, such as a change in the speaker’s voice (Praamstra & Stegeman, 1993) or the letters’ font size (Coch et al., 2008), strongly suggesting that it reflects phonological rather than low-level sensory processing. The amplitude of this component tends to be larger at midline and right hemisphere sites over parietal and temporal areas of the scalp (e.g., Grossi, Coch, Coffey-Corina, Holcomb, & Neville, 2001; MacSweeney, Goswami, & Neville, 2013). This scalp distribution overlaps with the semantic N400, which is typically a centro-parietal component, with a slightly larger amplitude over the right hemisphere, at least to visually-presented words (Kutas & Federmeier, 2011).
It’s important to note that earlier ERP studies employing a variety of stimuli and experimental designs have described several negative ERP components elicited by phonological processing. These include the Phonological Mapping (or Matching) Negativity (PMN), the phonological N400, and the rhyming effect (RE). The PMN component typically peaks earlier (approximately 250 ms post-stimulus onset) compared to the phonological N400 and RE and has a slightly more anterior scalp distribution (e.g., D’Arcy, Connolly, Service, Hawco, & Houlihan, 2004). However, the phonological N400 and RE have very similar characteristics (e.g., Rugg, 1984b) and may form a family of negative ERP components sensitive to comparisons of phonological forms. Because the task in our study does not involve a rhyming judgment, we use the term “phonological N400” throughout the manuscript.
The LPC ERP component is typically elicited by repeated stimuli and has been used extensively in studies of the neural correlates of recognition memory (for reviews, see Rugg & Allan, 2000; Rugg & Curran, 2007). However, an overt judgment about whether or not an item occurred before is not necessary for its elicitation. The larger LPC to a repeated stimulus as compared to a novel stimulus is often called the old/new effect. Its latency may vary significantly from study to study but typically exceeds 400 ms post-stimulus onset. A change in modality between the original presentation and the repetition reduces but does not eliminate LPC (Rugg & Nieto-Vegas, 1999). Importantly, a study by Rugg and Nagy (1987) compared the mean amplitude of LPC elicited by non-words that were composed of either legal or illegal sequences of letters. They reported that when the task demands are carefully controlled, only legal non-words elicit LPC upon their repetition, suggesting that, at least when linguistic material is used as stimuli, this component reflects processes associated with access to lexical representations, even when, as in the case of legal non-words, this access is only partial.
In this study, we have capitalized on the properties of the N400 and LPC components described above in order to understand how neural processes associated with visual speech perception develop in school-aged children at the phonological and lexical levels and how they relate to behavioral measures of audiovisual speech perception, such as listening to SIN. On each trial, participants first heard an auditory word and then saw a silent video of a speaker pronouncing either the same or a different word. Based on our earlier work, we expected that visual articulations that did not match preceding auditory words would elicit the phonological N400 component while visual articulations that matched preceding auditory words would elicit the LPC component. In a previous study (reference), we showed that in children and adults, only the amplitude of the phonological N400 component, and not of the LPC component, was predictive of the degree to which individuals benefited from visual speech cues. However, in that earlier study we tested children with a broad age range (from about 7.5 to 13.5 years of age), which might have obscured age-specific patterns of relationship between neural and behavioral measures. Additionally, the N400 amplitude in children was predictive of the SIN gain in the audiovisual condition only in combination with measures of sensory visual encoding. The question of at what age the N400 amplitude can be predictive of the SIN gain by itself has, therefore, remained unanswered. We hypothesized that if neural mechanisms underlying visual speech perception change significantly from early to mid-childhood, we should see a change in the relationship between ERP measures of visual speech perception and the audiovisual SIN gain. In the auditory domain, children need to hear a larger portion of the auditory word compared to adults to correctly identify it during gating tasks (Metsala, 1997). We thought that a similar relationship may also exist in the visual modality, with children needing to see a larger portion of the visible word articulation to accurately match it onto a word. If so, the neural processes encoded by the later LPC component should play a bigger role in children’s behavioral improvement on the audiovisual SIN task compared to the processes encoded by the earlier N400. At the same time, given significant phonological development during early school years, we also expected that by 11–12 years of age, phonemic processing of visual speech, as indexed by N400, will at least begin to exert influence over lexical processing during a SIN task.
2. Method
2.1. Participants
Fourteen 8–9-year old children (8 females, mean age 9;0 (years; months), range 8;0–9;11), fourteen 11–12-year old children (5 females, mean age 11;9, range 11;1–12;9), and fourteen adults (8 female, mean age 23, range 18–37) participated in the study. All participants gave their written assent/consent to participate in the experiment. Additionally, at least one parent of each child provided written consent to enroll their child in the study. The experimental protocol was approved by the Institutional Review Board of Purdue University, and all study procedures conformed to The Code of Ethics of the World Medical Association (“WMA Declaration of Helsinki - Ethical Principles for Medical Research Involving Human Subjects,” 1964).
Participants passed a hearing screening at a level of 20 dB HL at 500, 1000, 2000, 3000, and 4000 Hz and reported to have normal or corrected-to-normal vision. None had neurological or language disorders. In all participants, the presence of intellectual impairment was ruled out with the Test of Non-Verbal Intelligence (TONI-4; Brown, Sherbenou, & Johnsen, 2010). Additionally, in children only, The Clinical Evaluation of Language Fundamentals test (CELF-4; Semel, Wiig, & Secord, 2004) was used to rule out any undiagnosed language impairment. Four sub-tests of CELF-4, which together yield the Core Language Index, were administered. These included Concepts & Following Directions, Recalling Sentences, and Formulated Sentences for both age groups together with Word Classes-2 for 11–12-year-olds and Word Structure for 8–9-year-olds. The Childhood Autism Rating Scale (Schopler, Van Bourgondien, Wellman, & Love, 2010) was used to evaluate the presence of Autism Spectrum Disorders (ASD) and the ADHD Index of the Conner’s Rating Scales (Conners, 1997) to evaluate the presence of the attention-deficit/hyperactivity disorder. Because socio-economic status has been shown to influence language development (e.g., Fernald, Marchman, & Weisleder, 2013), information on mothers’ and fathers’ levels of education has been collected for all children. Based on the augmented version of the Edinburgh Handedness Questionnaire (Cohen, 2008; Oldfield, 1971), one participant was left-handed (11–12-year old group) and three were ambidextrous (one 11–12-year old and two adults). All others were right-handed. In the U.S. public schools, children are expected to become fluent readers by the end of the 3rd grade. However, the exact number of years of formal reading instruction and its type varies greatly depending on whether or not a child attended the kindergarten, the economic status of the district in which the school is located, parents’ socio-economic status, race, and other factors (Department of Education, 1996)
2.2. Experimental Design
The study consisted of two experiments. In one, which was combined with ERP recordings, we examined the ability of younger and older children to match auditory words with silent visual articulations. Henceforth, we call it the audiovisual matching task. In another, which contained only behavioral measures, we examined the degree to which the same groups of children and adults benefitted from seeing the talking face when they listened to auditory words embedded in a two-talker babble masker. Henceforth, we call it the speech-in-noise (SIN) perception task. Finally, we examined the relationship between ERP indices of visual speech encoding and SIN perception ability in each group through multiple regressions. Each of these study components is described in greater detail below.
2.2.1. Audiovisual Matching Task
2.2.1.1. Stimuli
Stimuli for experiment 1 consisted of auditory words and silent videos of their articulations. We used 96 words from the MacArthur Bates Communicative Developmental Inventories (Words and Sentences) (Fenson et al., 2007) as stimuli. All words contained 1–2 morphemes and were 1 to 2 syllables in length with two exceptions – “elephant” and “teddy bear.” Words contained between 1 and 8 phonemes, with diphthongs counted as 1 phoneme. Words were produced by a female speaker and recorded with a Marantz digital recorder and an external microphone at a sampling rate of 44,100 Hz. Sound files were edited in the Praat software (Boersma & Weenink, 2011) so that the onset and offset of sound were preceded by 50 ms of silence. Final sound files were root-mean-square normalized to 70 dB SPL.
Videos showed a female talker dressed in a piglet costume articulating one word at a time. Only her head and shoulders were visible. The costume made it easier to turn the instructions into a story and to maintain children’s attention. Instructions were video-recorded and kept identical across all participant groups. An actor dressed as a wolf told children that his friend piglet helped researchers with the study. The piglet was asked to repeat words exactly as she heard them. Sometimes she did as she was told, but other times she goofed off and said something completely different. The wolf character also told children that the researchers’ recordings had lost sound. The children’s goal was to help researchers figure out when the piglet repeated the words she heard and when she did not. The videos’ frame per second rate was 29.97. The audio track of the video recording was removed in Adobe Premier Pro CS5 (Adobe Systems Incorporated, USA). Articulation portions of videos ranged from 1133 ms (for “car”) to 1700 ms (for “sandbox”).
2.2.1.2. Procedure
The procedure was identical to that described in two earlier studies from our laboratory. A sequence of events in each trial is shown in Figure 1. First, participants heard an auditory word. To maintain participants’ attention on the task and to provide a fixation point, a picture matching each word was shown on the computer’s screen while participants listened1. The length of auditory words varied, but the fixation picture always stayed on the screen for 1000 ms following the word offset to allow for auditory evoked potentials to resolve. A blank screen followed for another 1000 ms, after which a video of a female talker silently articulating a word was presented. In half of all trials, the talker’s articulation matched the previously heard word (congruent trials; for example, participants saw the talker articulate “toys” after hearing the word “toys”), while in another half, the talker’s articulation clearly mismatched the previously heard word (incongruent trials; for example, participants saw the talker say “bus” after hearing the word “toys”). The appearance of the screen with “Same?” written across it signaled the start of the response window. Participants had to determine whether the silently articulated word was the same as the auditory word they heard at the beginning of the trial. Trials were separated by a temporal period randomly varying between 1000 and 1500 ms. Responses were collected via a response pad (RB-530, Cedrus Corporation), with the response hand counterbalanced across participants. Stimulus presentation and response recording was controlled by the Presentation program (www.neurobs.com).
Figure 1.
Schematic representation of a trial in the matching task
Note that separate timelines are shown for the video and audio tracks. The video of articulation was congruent in half of all trials (e.g., participants saw the piglet silently articulate “toys” after hearing “toys” at the start of the trial) and incongruent in the other half of trials (e.g., participants saw the piglet silently articulate “bus” after hearing “toys” at the start of the trial). The onset of articulation was used as time 0 for the N400 and LPC ERP averages.
Each participant completed 96 trials (48 congruent and 48 incongruent). For incongruent trials, 48 pairs of auditory and silently articulated words were created such that their visual articulation differed significantly during the word onset. In most cases (35 out of 48 pairs), this was achieved by pairing words in which the first consonants differed visibly in the place of articulation (e.g., belt vs. truck). In 6 pairs, the first vowels of the words differed in the shape and the degree of mouth opening (e.g., donkey vs. candy). In the remaining 7 pairs, the first sounds were a labial consonant in one word (i.e., required a mouth closure (e.g., pumpkin)) and a vowel (i.e., required a mouth opening (e.g., airplane)) in another word. Heard and articulated words in incongruent pairs had no obvious semantic relationship. Two lists containing 48 congruent and 48 incongruent pairings were created such that the articulations that were congruent in list A were incongruent in list B. As a result, across participants, we collected responses to the same articulations, which were perceived as either congruent or incongruent. Such counterbalancing also allowed for the control of word frequency, length, and complexity in congruent and incongruent trials. Lastly, 10 different versions of list A and 10 different versions of list B were created by randomizing the order of 96 trials. Each participant completed only one version of one list (e.g., participant 1 did list A version 1; participant 2 did list B version 1; participant 3 did list A version 2, participant 4 did list B version 2, etc.) Version 1 of lists A and B is shown in the Appendix.
2.2.2. Speech-In-Noise Perception Task
2.2.2.1. Stimuli
In the second experiment, participants listened to the same 96 words used in the audiovisual matching task. However, this time words were embedded in a two-talker babble masker. The masker consisted of two female voices reading popular children’s stories. One sample was 3 minutes and 8 seconds long (by talker 1), and the other was 3 minutes and 28 seconds long (by talker 2). Both samples were manually edited in Praat to remove silent pauses greater than 300 ms and then repeated without discontinuity. The streams from the two talkers were root-mean-square normalized to 75 dB SPL, mixed, and digitized using a resolution of 32 bits and a sampling rate of 24.414 kHz. Because 96 target words were root-mean-square normalized to 70 dB SPL, the final stimuli had a −5 dB SPL signal-to-noise ratio.
2.2.2.2. Experimental Design
A schematic representation of the SIN trial is shown in Figure 2. This task had 2 conditions – auditory only (A) and audiovisual (AV) – which were administered on two separate days. The order of A and AV conditions was counterbalanced across participants, but each participant completed both. The babble masker started 3 seconds prior to the first trial and was presented continuously until the end of the experiment. In the AV condition, participants saw videos of a talker producing each of 96 words. Each video was preceded and followed by a static image of a talker with a closed mouth, which lasted for 1,000 ms. In the A condition, the same static images of the talker were present; however, the video portion was replaced with an image of the talker with her mouth open (see Figure 2). The appearance of the open-mouth picture in the A condition cued participants to the onset of the target auditory word, without providing any visual cues to its identity. Previous research shows that visual cues that reliably predict the onset of the auditory signal significantly improve the latter’s detection threshold (ten Oever et al., 2014). The inclusion of the cue to the target word onset in the A condition aimed to make the attentional demands of the A and AV conditions more similar. Word presentations in both conditions were separated by 3 seconds, during which participants provided their verbal response about what they had heard. When unsure, participants were encouraged to give their best guess or to say “I don’t know.”
Figure 2.
Schematic representation of a trial in the speech-in-noise (SIN) task
The SIN task had two conditions – the audiovisual (AV, top panel) and the auditory only (A, bottom panel). Note that separate timelines are shown for the video and audio tracks in each condition. The only difference between the conditions was that while in the AV condition participants saw a video of the piglet articulating target words, in the A condition the video portion was replaced with a static image of the piglet’s face with her mouth open. The appearance of the open mouth picture in the A condition cued participants to the fact that the onset of the auditory word is imminent, but provided no visual speech cues to its identity.
2.3. Sequence of Testing Sessions
All testing occurred over 3 sessions administered on 3 different days. One of the SIN conditions (either A or AV) was administered during the first session, the audiovisual matching task – during the second session, and the second SIN condition – during the third session. Because the same words were used in the audiovisual matching task and in the SIN task, most participants’ sessions were separated by at least 7 days to minimize the possible effect of stimulus repetition.
2.4. EEG Recordings
During the audiovisual matching task, the electroencephalographic (EEG) data were recorded from the scalp at a sampling rate of 512 Hz using 32 active Ag-AgCl electrodes secured in an elastic cap (Electro-Cap International Inc., USA). Electrodes were positioned over homologous locations across the two hemispheres according to the criteria of the International 10–10 system (American Electroencephalographic Society, 1994). The specific locations were as follows: midline sites Fz, Cz, Pz, and Oz; mid-lateral sites FP1/FP2, AF3/AF4, F3/F4, FC1/FC2, C3/C4, CP1/CP2, P3/P4, PO3/PO4, and O1/O2; and lateral sites F7/F8, FC5/FC6, T7/T8, CP5/CP6, and P7/P8; and left and right mastoids. EEG recordings were made with the Active-Two System (BioSemi Instrumentation, Netherlands), in which the Common Mode Sense (CMS) active electrode and the Driven Right Leg (DRL) passive electrode replace the traditional “ground” electrode (Metting van Rijn, Peper, & Grimbergen, 1990). Data were referenced offline to the average of the left and right mastoids. The Active-Two System allows EEG recording with high impedances by amplifying the signal directly at the electrode (BioSemi, 2013; Metting van Rijn, Kuiper, Dankers, & Grimbergen, 1996). In order to monitor for eye movement, additional electrodes were placed over the right and left outer canthi (horizontal eye movement) and below the left eye (vertical eye movement). Prior to data analysis, EEG recordings were filtered between 0.1 and 30 Hz. Individual EEG records were visually inspected to exclude trials containing excessive muscular and other non-ocular artifacts. Ocular artifacts were corrected by applying a spatial filter (EMSE Data Editor, Source Signal Imaging Inc., USA) (Pflieger, 2001). ERPs were epoched starting at 200 ms pre-stimulus and ending at 1800 ms post-stimulus onset. The 200 ms prior to the stimulus onset served as a baseline.
The onset of the visually-observed articulation elicited clear N400 and LPC. These components’ mean amplitudes were measured over the following windows post-stimulus onset: 430–750 ms for N400 and 930–1540 ms for LPC. Fifty percent area latencies of these components were also calculated (i.e., the point that divides the area under the specified portion of an ERP waveform in half) (Luck, 2014). The N400 measurement window was motivated by previous studies with younger children (e.g., Malins et al., 2013; Mohan & Weber, 2015; Weber-Fox, Spruill, Spencer, & Smith, 2008). The LPC measurement window was identical to the one used in an earlier study with children in our lab (Kaganovich, Schumaker, & Rowland, 2016). Although these windows may appear relatively late compared to other studies, it is important to note that time zero for ERP averaging was the onset of visually detectable articulations, which may preceded the onset of an auditory phoneme by 100–200ms (Chandrasekaran, Trubanova, Stillittano, Caplier, & Ghazanfar, 2009).
2.5. Statistical Analyses
2.5.1. Behavioral and ERP measures
One-way ANOVA tests were used to compare group means on all screening tests. The homogeneity of variances across groups was evaluated with the Levene statistic. When variances differed, the Brown-Forsythe correction was applied. In all such cases, the corrected degrees of freedom and p-value are reported.
Repeated-measures ANOVAs were used to determine whether groups differed in the number of correct responses, incorrect responses, misses, and in response time during the audiovisual matching task. For each of these tests, congruency (congruent vs. incongruent) was used as a within-subject variable and age (8–9-year-olds vs. 11–12-year-olds vs. adults) as a between-group variable. To evaluate whether the SIN accuracy was higher in the AV compared to the A condition, with used condition (auditory vs. audiovisual) as a within-subject variable and age (3 levels) as a between-group variable.
Repeated-measures ANOVAs were also used to evaluate ERP components, with separate analyses conducted on the N400 and LPC mean amplitude and 50% area latency. Two separate analyses were performed on each mean amplitude measure. The first was conducted on original ERP waveforms and included the within-group variable of condition (congruent vs. incongruent) and site (see below) as well as the between-group variable of age (8–9-year-olds, 11–12-year-olds, and adults). The main goal of this analysis was to ascertain that the effect of condition was present in each group and for each component. In other words, this analysis showed that N400 and LPC were present in the waveforms of each group. The second analysis was conducted on difference waveforms obtained by subtracting ERPs elicited in the incongruent condition from the ERPs elicited in the congruent condition. Therefore, this analysis only had the factors of site and age, as described below.
We used previous studies in the field for choosing the electrode sites to be included into analyses (Luck, 2014, Chapter 10). The N400 component has been studied extensively and typically has a centro-parietal distribution (for a review, see Kutas & Federmeier, 2011). Accordingly, we have included centro-parietal (CP1, CP2), parietal (P3, PZ, P4), and parieto-occipital (PO3, PO4) sites into our analyses. The N400 amplitude is typically broadly distributed over the centro-parietal sites. However, some studies reported a slightly greater N400 amplitude over the right scalp. We, therefore, included hemisphere (left vs. right) as a factor in initial analyses to make sure that our groups did not differ in the hemispheric distribution of this component. Since the effect of hemisphere was not significant and did not interact with group (as we report in the Results section), we reduced the number of variables in the N400 analyses to just two: site (7 levels) and group (3 levels).
The LPC component also tends to have a centro-parietal distribution (e.g., Rugg, Brovedani, & Doyle, 1992). In our data, the effect spread over the occipital sites as well, likely due to the visual nature of the stimuli. Therefore, centro-parietal (CP1, CP2), parietal (P3, PZ, P4), parieto-occipital (PO3, PO4), and occipital (O1, OZ, O2) sites were included in analyses. To make the N400 and LPC analyses as parallel as possible, we included hemisphere as a factor in the initial analyses of LPC. The effect of hemisphere was significant, but did not interact with either age or congruency. We, therefore, removed hemisphere as a variable from further analyses and kept only site (10 levels) and age (3 levels) as factors. The results of initial analyses with hemisphere as a factor are reported for both components for completeness. However, the main findings did not change regardless of whether or not hemisphere was included as a factor in ERP analyses.
When ANOVA analyses produced a significant interaction, it was further analyzed with step-down ANOVAs, with included factors specified for each follow-up analysis in the Results section. When the assumption of sphericity was violated, we used the Greenhouse-Geisser adjusted p-values to determine significance. Effect sizes, indexed by the partial eta squared statistic (ηp2), are reported for all significant repeated-measures ANOVA results.
2.5.2. Regressions
Multiple regression analyses aimed to determine how the relationship between ERP measures of visual articulatory processing (i.e., N400 and LPC) and gains during audiovisual as compared to auditory SIN perception changes with age. A hierarchical forced-entry regression model was constructed for each group of participants, in which ERP measures were entered as predictors and the SIN gain in the audiovisual SIN task as an outcome. Based on our earlier studies, we knew that at least in adults, N400 was related to the SIN improvement (reference). Therefore, the N400 mean amplitude was entered in the first step of the model construction, with the LPC entered in the second step. The N400 and LPC measures entered into regressions were their mean differences between congruent and incongruent trials averaged across all sites used for statistical analyses. To screen for outliers, we used the standardized DFBeta function in the SPSS Statistics program. Cases with the standardized DFBeta values over 1 have a significant influence over the regression model and are considered outliers (Field, 2013). No such cases were detected. Variance inflation factor (VIF) screened for multicollinearity. None of the VIF numbers exceeded 10, with the average VIF being 1.6, suggesting that multicollinearity was not a significant factor (Field, 2013).
3. Results
3.1. Group Characteristics
Table 1 contains group means, standard deviations, and outcomes of group comparisons for younger and older children on measures of non-verbal intelligence, SES, linguistic ability (CELF-4 Core Language Score and Expressive Language Index), presence of ASD (CARS-2), and presence of ADHD symptoms (Connors’ Rating Scales). Note that tests of non-verbal intelligence and of linguistic ability use standardized measures. Therefore, no group difference due to age was expected if none of the children had significant impairments. All group comparisons were non-significant.
Table 1.
Group means for non-verbal intelligence (TONI-4), parents’ education levels (SES), linguistic ability (CELF-4), presence of autism (CARS-2), and presence of ADHD symptoms (Connors’ Rating Scales)
| 8–9-year-olds | 11–12-year-olds | Group Effect F (df1,df2) | p | |
|---|---|---|---|---|
| Non-verbal intelligence | 106.6 (2.6) | 110.1 (3) | 1.8 (2,41) | .179 |
| Mother’s Education, years | 14.86 (2.3) | 15.43 (1.8) | 0.53 (1, 27) | .475 |
| Father’s Education, years | 15.83 (3.5) | 17.23 (3.6) | 0.97 (1,24) | .334 |
| CELF-4 | ||||
| CF&D | 11.07 (1.6) | 11.50 (1.5) | 0.54 (1, 27) | .471 |
| RS | 11.07 (2.1) | 10.43 (2) | 0.69 (1, 27) | .412 |
| FS | 12.71 (1.8) | 12.14 (2) | 0.63 (1,27) | .435 |
| CLS | 111.07 (8.9) | 110.21 (9.1) | 0.06 (1,27) | .802 |
| ELI | 111.93 (9.7) | 108.21 (11.2) | 0.88 (1,27) | .357 |
| CARS-2 | 15.21 (.58) | 15.00 (.00) | 1.92 (1, 27) | .178 |
| Connors’ ADHD Index | 50.79 (5.77) | 47.71 (6.29) | 1.81 (1, 27) | .19 |
Note. Numbers for TONI-4, Connors’ ADHD Index, CLS, ELI, and CARS-2 represent standard scores. Numbers in parenthesis are standard errors of the mean. F and p values reflect a group comparison. TONI = Test of Nonverbal Intelligence; CF&D = Concepts and Following Directions; RS = Recalling Sentences; FS = Formulated Sentences; CLS = Core Language Score; ELI = Expressive Language Index. Data for paternal level of education was not available for one 11–12-year old child and two 8–9-year old children.
3.2. Experiment 1 – Audiovisual Matching Task
3.2.1. Behavioral Results
Behavioral performance on the audiovisual matching task is summarized in Table 2. Groups were compared on the number of correct responses, misses, and reaction time (RT). While overall all groups scored above 90% on this task, the group effect was nonetheless significant, F(2,39)=9.34, p<.001, ηp2 =.32. Pair-wise comparisons with Bonferroni correction showed that the 8–9-year old group performed significantly worse than adults, p<.001, but did not differ from 11–12-year-olds, p=.098. Nor did the 11–12-year old group differ from adults, p=.125. There was no congruency effect, F(1,39)=.454, p=.504, ηp2 =.01, and no group by congruency interaction, F(2,39)=.58, p=.564, ηp2 =.029.
Table 2.
Performance on the audiovisual matching task
| 8–9-year-olds | 11–12-year-olds | Adults | |
|---|---|---|---|
| Percent Correct | |||
| Congruent | 92.41 (5) | 94.05 (3.4) | 97.03 (3) |
| Incongruent | 90.17 (7) | 94.35 (5.7) | 96.89 (2.6) |
| Percent Missed | |||
| Congruent | 2.98 (2.8) | 1.04 (1.6) | 0.88 (1.5) |
| Incongruent | 2.83 (3.4) | 1.79 (2.6) | 1.34 (3) |
| Reaction Time (ms) | |||
| Congruent | 766.11 (142.9) | 762.85 (97.9) | 606.64 (217) |
| Incongruent | 773.41 (147.5) | 785.77 (78.5) | 651.86 (253) |
Note. Numbers in parenthesis are standard deviation of the mean.
Groups did not differ in the number of misses, F(2,39)=2.27, p=.117, ηp2=.1, nor was there an effect of congruency, F(1,39)=1.25, p=.269, ηp2=0.031, or a group by congruency interaction, F(2,39)=0.706, p=0.5, ηp2=0.035.
Finally, participants were faster at responding to congruent than to incongruent articulations, F(1,39)=4.83, p=.034, ηp2=.11, indicative of a priming effect. The main effect of group on RT was significant, F(2,39)=3.55, p=.038, ηp2=.15; however, none of the follow-up group comparisons reached significance: adults vs. 8–9 year-olds, p=.086; adults vs. 11–12 year-olds, p=.073; 8–9 year-olds vs. 11–12 year-olds, p=1. There was no group by congruency interaction, F(2,39)<1, ηp2=.045.
3.2.2. ERP Results
3.2.2.1. Analysis of Standard Waveforms
Figure 3 shows ERP waveforms elicited by congruent and incongruent articulations in each group. In agreement with our earlier studies, incongruent articulations elicited a larger N400 compared to congruent ones, F(1,39)=53.3, p<.001, ηp2=.578, while congruent articulations elicited a larger LPC compared to incongruent ones, F(1,39)=45.9, p<.001,ηp2=.541. For both components, congruency by group interaction was significant: N400, F(2,39)=3.182, p=.052, ηp2=.14; LPC, F(2,39)=6.01, p=.005, ηp2=.235. Importantly, follow-up tests showed that ERPs elicited by congruent and incongruent articulations differed significantly during the N400 and LPC time windows in each group: N400, 8–9-year-old, F(1,13)=20.68, p=.001, ηp2=.614; 11–12-year-olds, F(1,13)=29.45, p<.001, ηp2=.694; adults, F(1,13)=7.18, p=0.019, ηp2=.356; LPC, 8–9-year-old, F(1,13)=29.3, p<.001, ηp2=.693; 11–12-year-olds, F(1,13)=13.37, p=.003, ηp2=.507; adults, F(1,13)=5.49, p=.036, ηp2=.297.
Figure 3.
ERPs elicited by congruent and incongruent articulations
Grand average ERPs to the onset of silent articulation on congruent and incongruent trials are overlaid for each group. Negative is plotted up.
Analyses of both components yielded a significant effect of hemisphere: N400, F(1,39)=6.82, p=0.013, ηp2=.149; LPC, F(1,39)=5.3, p=0.027, ηp2=.12. The N400 was larger over the left hemisphere sites, while the LPC was larger over the right hemisphere sites. However, this hemisphere effect did not interact with condition: N400, F(1,39)<1, ηp2=.001; LPC, F(1,39)=1.76, p=0.192, ηp2=.043.
3.2.2.2. Analysis of Difference Waveforms
Figure 4 overlays incongruent-minus-congruent difference waveforms for the three groups of participants (Panel A) as well as the distribution of the two components over the scalp in each group (Panel B). Additionally, mean values and standard errors for the N400 and LPC mean amplitudes in each group are detailed in Figure 5.
Figure 4.
ERP difference waves: Incongruent minus congruent
Panel A: Difference waveforms between ERPs elicited by incongruent and congruent articulations in each group are overlaid on top of each other. N400 is marked on the PO3 site, and LPC is marked on the O2 site. Negative is plotted up.
Panel B: A distribution of the N400 (430–750 ms) and LPC (930–1,540 ms) components over scalp as measured from difference waveforms is shown for each group.
Figure 5.
N400 and LPC mean amplitude across groups
Mean amplitudes of N400 and LPC measured from difference waves are shown for each group. Error bars are standard errors of the mean. P values reflect pair-wise group comparisons following the Bonferroni correction.
The N400 statistical analyses were performed on the incongruent-minus-congruent difference waveforms. In initial assessments, the effect of hemisphere was not significant, F(1,39)=.041, p=.841, ηp2=.001, nor was there a group by hemisphere interaction, F(2,39)=.49, p=.616, ηp2=.025. Therefore, to reduce the number of variables, data from electrodes over both hemispheres as well as from the midline Pz site were entered as 7 levels of the site variable for further analyses. The main effect of group was not significant either in the analysis of the N400 mean amplitude, F(2,39)=3.22, p=.051, ηp2=.14, or in the analysis of the N400’s 50% area latency, F(2,39)=1.67, p=.202, ηp2=.08.
Group effects in the LPC mean amplitude and latency were evaluated based on the congruent-minus-incongruent difference waveforms. Similar to the N400 analyses, initial evaluations of the LPC amplitude yielded no main effect of hemisphere, F(1,39)= 1.41, p=.243, ηp2=.035, and no hemisphere by group interaction, F(2,39)=.725, p=.491, ηp2=.036. Hence, data from electrodes over both hemispheres as well as from the midline Pz and Oz sites were entered as 10 levels of the site variable for further analyses. The main effect of group was significant, F(2,39)=6.6, p=.003, ηp2=.253. Bonferroni-corrected pairwise comparisons revealed that the 8–9-year old group had a significantly larger mean amplitude than both the 11–12-year old group, p=.025, and the adult group, p=.004. The 11–12-year old group and the adult group did not differ, p=1. Analysis of the LPC’s 50% area latency showed no effect of group, F(2,39)=.219, p=.805, ηp2=.01.
3.3. Speech-in-Noise Perception Task
Table 3 shows means and standard deviations for word identification during the SIN task in each group. Across-the-board, accuracy was significantly higher in the AV compared to the A condition, F(1,39)=495.9, p<.001, ηp2=.927. This improvement was present in every participant and ranged from 15 to 52%. The main effect of group was also significant, F(2,39)=64.635, p<.001, ηp2=.77. Bonferroni-corrected pairwise comparisons revealed that all groups differed significantly from each other, with adults being most accurate overall, followed by 11–12-year-olds, followed, in turn, by 8–9-year-olds, all ps<.001. This group effect did not interact with condition, F(2,39)=1.46, p=.245, ηp2=.07, suggesting that the amount of gain from audiovisual speech did not differ between groups.
Table 3.
Group performance on the speech-in-noise task
| Condition\Group | 8–9-year-olds | 11–12-year-olds | Adults |
|---|---|---|---|
| A | 38.38 (9.5) | 47.7 (9.9) | 61.78 (5.8) |
| AV | 72.01 (3.7) | 82.2 (5.1) | 90.61 (4) |
| AV-A | 33.64 (10.9) | 34.44 (9.3) | 28.83 (7.7) |
Note. A = auditory only, AV = audiovisual, AV-A shows gains in accuracy from seeing the talker’s face. Numbers for A and AV are the percent of words that were correctly identified. Standard deviations are shown next to each mean.
3.4. Regressions
Figure 6 shows relationships between the N400 and LPC ERP components and SIN improvement in the presence of the talker’s face. To better visualize age-related changes in these relationships, Figure 7 also compares regression results for the 3 groups, separately for N400 and LPC. Table 4 summarizes statistical outcomes.
Figure 6.
Regression results
Relationships between the N400 and LPC amplitudes and the SIN gain in the audiovisual condition are plotted separately for each group. Note that the ERP amplitudes plotted on the y-axis differ across groups.
Figure 7.
Age-related changes in relationship between ERP components and SIN
Relationship between the N400 and LPC amplitudes and the SIN gain in the audiovisual condition in each group are plotted separately for each ERP component to better visualize developmental changes.
Table 4.
Regression results
| B (95% CI) | SE of B | Std Beta | P | R2 | |
|---|---|---|---|---|---|
| 8–9-year-olds | |||||
| Step 1 | |||||
| Constant | 36.5 (25.9, 47.2) | 4.89 | <.001 | .04 | |
| N400 | .58 (−1.12, 2.28) | 0.78 | .211 | .47 | |
| Step 2 | |||||
| Constant | 41.4 (30, 52.8) | 5.2 | <.001 | .27 | |
| N400 | −.28 (−2.15, 1.6) | 0.85 | −.1 | .751 | |
| LPC | −1.02 (−2.23, 0.19) | 0.55 | −.57 | .091 | |
| 11–12-year-olds | |||||
| Step 1 | |||||
| Constant | 32.9 (23.1, 42.7) | 4.5 | <.001 | .01 | |
| N400 | −.37 (−2.32, 1.58) | 0.9 | −.12 | .684 | |
| Step 2 | |||||
| Constant | 3.38 (25.1, 41.67) | 3.76 | <.001 | .37 | |
| N400 | −1.98(−4.14, 0.19) | 0.98 | −.63 | .07 | |
| LPC | −1.83 (−3.44, −0.22) | 0.73 | −.79 | .03 | |
| Adults | |||||
| Step 1 | |||||
| Constant | 25.2 (20.6, 29.8) | 2.1 | <.001 | .4 | |
| N400 | −1.84 (−3.3, −.4) | .65 | −.63 | .016 | |
| Step 2 | |||||
| Constant | 25.64 (20.92, 30.37) | 2.14 | <.001 | .45 | |
| N400 | −2.43 (−4.34, −.51) | 0.87 | −.83 | .018 | |
| LPC | −0.58 (−1.82, 0.67) | 0.56 | −.3 | .331 | |
Note. For all analyses, ERP measures were entered as predictor variables while the SIN improvement in the presence of the talker’s face as an outcome variable. Both N400 and LPC measures are based on the difference waveforms: incongruent-congruent for N400 and congruent-incongruent for LPC. All p-values below .05 are bolded for the ease of reference.
In the younger group of children, neither ERP component was a strong predictor of audiovisual SIN improvement. In the older group of children, the N400 component entered in step 1 had a negligible influence on the model. However, the addition of the LPC component in step 2 had a significant influence (p=0.03), accounting for 37% of variance in the SIN performance. Importantly, the LPC’s correlation coefficient was negative, suggesting that larger LPC was associated with smaller improvement on the audiovisual version of SIN. Finally, in adults, the amplitude of N400 was a strong predictor of audiovisual SIN gain (p=0.016). It accounted for 40% of variance in the SIN improvement. Although the addition of LPC in step 2 increased the R2 value from .4 to .45, the LPC’s influence on the model was not significant (p=0.331).
To determine the degree to which the mean amplitude of N400 and LPC were related in our participants, we conducted a Pearson correlation between the two measures. The correlation was significant, r=−.635, p<0.001, with larger LPC being associated with larger N400.
4. Discussion
We measured behavioral and neural responses to observed articulations of words during an audiovisual matching task in 8–9-year-olds, 11–12-year-olds, and adults and examined in each group the relationship between the elicited neural responses and the degree of improvement on the SIN task afforded by the presence of the talker’s face. Both children and adults were very accurate at identifying correspondences between the auditory and visual presentations of words and exhibited cross-modal priming, with faster responses to observed articulations on congruent as compared to incongruent trials. In agreement with our earlier studies, in all groups of participants visible articulations that did not match the preceding auditory word elicited the N400 component while those that did match the preceding auditory word elicited the LPC component. Groups did not differ in the mean amplitude of the phonological N400. However, the LPC mean amplitude of the 8–9-year-old group was significantly larger than that of both 11–12-year-olds and adults. Importantly, the relationship between these brain responses and a behaviorally measurable benefit of audiovisual as compared to auditory only speech (i.e., the SIN gain in the presence of the talker’s face) changed with age. While in 8–9-year-olds neither ERP component was a good predictor of the SIN gain, in 11–12-year-olds, the LPC amplitude accounted for about 37% of the SIN gain variance, with negligible influence of the N400 amplitude. In contrast, in adults a similar amount of variance — 40% — could be accounted for by the amplitude of N400, with LPC having no real effect on the regression model. The direction of the relationship was component-specific: in older children, smaller LPC was associated with higher SIN gains; in adults, larger N400 was associated with higher SIN gains. These results emphasize the fact that although visually perceived articulations elicited measurable N400 and LPC components in all age groups, the maturity of the neural processes modulating these components and their relationship to behavioral measures of audiovisual speech perception change significantly during the pre-teen years.
The obtained pattern of results suggests that the matching of auditory and visual modalities of words operates on different scales in children and adults. Adults appear to not only have established correspondences between oro-facial movements and individual phonemes but also use this fine-grained audiovisual information as the primary way of taking advantage of visible speech. Indeed, the LPC component that indexes the repetition of the entire word form, although clearly present in adults, had a miniscule effect on the regression model with the SIN perception as an outcome. This interpretation is in agreement with earlier work by Fort and colleagues (Fort et al., 2013; Fort et al., 2010) showing that seeing the talker articulate just the first syllable of a word affords enough information to access the word’s lexical representation. Within the context of our paradigm, it means that adults could likely make a decision on whether the articulation matched or did not match the heard word after seeing just the word onset.
The 11–12-year-old children did not differ significantly from adults in the mean amplitudes of N400 and LPC. However, in this group, it was the amplitude of the LPC component that was a strong predictor of the SIN perception gain when seeing the talker’s face, suggesting that older children need to see a larger portion of a word’s articulation (if not all of it) compared to adults in order to benefit from visible speech. Somewhat surprisingly, neither N400 nor LPC had a strong relationship with SIN gains in 8–9-year-old children despite being present in their ERP waveforms and despite the fact that even these youngest children did better on the SIN task when it was audiovisual rather than auditory only. The neural mechanisms underlying this gain require further study. One possibility is that these younger children used more basic properties of videos for audiovisual matching, such as the length of articulation, or openness or roundedness of the lips without necessarily mapping these properties onto specific phonemes or words. Coincidentally, the study by Jerger and colleagues (2009) suggested that children between the ages of 5 and 9 may exhibit reduced sensitivity to visible speech compared to both older and younger children due to a re-organization of phonological representations. In our study, the 8–9-year-old group performed worse on the SIN task than either older children or adults, but they nonetheless were quite accurate in their judgments during the audiovisual matching task. It is likely, therefore, that the electrophysiological measures used in our study were not sensitive to the neural processes underlying audiovisual lexical processing in this group.
The above interpretation of our findings assumes a fair degree of independence between the N400 and LPC components. However, because the amplitudes of N400 and LPC are calculated as a difference between the congruent and the incongruent conditions, it is possible that a developmental change in only N400 or only LPC could influence the relative amplitude of both. For example, a larger LPC component in children could also result in a larger N400 measurement, if the former spreads into the earlier time window. Given that our measurement windows for N400 and LPC were non-overlapping and very distinct (430–750 ms for N400, 930–1540 ms for LPC), this would suggest that different portions of the same underlying component are associated with better SIN perception in older children and adults. For adults, the amplitude of the component’s onset would be more predictive of better SIN perception, while for 11–12-year-olds, the amplitude of the center of the component would be more predictive. Although different from the original interpretation regarding the specific components bringing about developmental change, this analysis still suggests that children need to see a larger portion of the articulation than adults do before detecting an audiovisual match/mismatch. Our articulation stimuli varied in length from 1133 ms to 1700 ms meaning that adults could detect a match during the first half of articulation and, therefore, likely used phonemic or syllabic information for the task while children needed to see most, if not all, of the word before they could detect a match.
It may seem counterintuitive that, one the one hand, both groups of children showed a clear N400 to visible audiovisual mismatches, while, on the other hand, unlike in adults, the modulation of this component in children was not at all correlated with the SIN perception accuracy. If children are able to detect an audiovisual mismatch at the phonological level, why cannot they use phoneme-level audiovisual matching during word recognition in a SIN perception task? However, this finding fits well with earlier reports in developmental and neuropsychology literatures. For example, even though infants can discriminate between word forms that differ in just one phoneme as early as 8 months of age (e.g., bih vs. dih), they fail to learn to associate these minimally different word forms with two separate objects at 14 months of age (Stager & Werker, 1997; Werker, Cohen, Lloyd, Casasola, & Stager, 1998). Bringing the seemingly contradictory results from these and other studies of speech perception in infancy, Werker and Curtin has proposed a developmental framework for processing rich information from multidimensional interactive representations (PRIMIR) (2005). According to this framework, the same type of phonetic information may be available to younger and older infants, but the way they use this information changes with development and task. Younger infants simply detect a phonetic change and do not use it for word learning while older infants are at the stage of development when they are trying to associate sequences of sounds with specific objects. However, for 14-month-olds, word forms that differ in only one phonetic feature (like the place of articulation between “bih” and “dih”) of the onset consonant but overlap in all other phonetic properties are not sufficiently different to be associated with two different objects. Put another way, at 14 months of age, infants have not yet figured out that a change in the place of articulation differentiating “bih” and “dih” is phonemic. By 17 months of age, however, infants appear to have overcome this difficulty (Werker, Fennell, Corcoran, & Stager, 2002).
In adults, too, the ability to detect phonemic changes dissociates from the ability to access lexical and semantic representations, based on neuropsychological and neuroimaging studies. For example, Miceli and colleagues (1980) reported data from 69 aphasics on a phoneme discrimination task and a single word comprehension task. Within this group, a double dissociation of skills was observed, with some patients performing normally on the phoneme discrimination but not on the word comprehension task and with some patients showing the reverse pattern. In a slightly later study, Milberg and colleagues (1988) showed that while phonemic priming was impaired in some aphasic patients, they nonetheless were accurate when making lexical decisions about the same word forms as the ones used as primes, in a separate task. Tasks involving sub-lexical units (e.g., phoneme and syllable monitoring and discrimination) also seem to engage a different neural network compared to tasks involving lexical-level processing (e.g., listening to speech for comprehension). In fact this dissociation laid the foundation for Hickok and Poeppel’s influential dual-stream theory of speech perception (Hickok & Poeppel, 2004).
Within this broader context, the outcome of our study fits well with the idea that the same type of information (i.e., visible articulatory movements) may be used differently at different stages of development and in different tasks. Children 8–12 years of age can clearly detect a difference between the expected and the observed oro-facial movements as is evidenced by the presence of N400 in their ERPs and high accuracy on the audiovisual matching task. However, even at the later point of this age range, seeing the speaker articulate word onsets does not appear to activate the associated lexical representations resulting in the lack of correlation between N400 and SIN accuracy. In older children, the activation of lexical representations may be triggered by observing most or all of the word articulation, leading to the correlation between the LPC component (that is often thought of as reflecting a repeated event) and SIN perception. Generally speaking, the developmental pattern observed in our study may reflect the slow establishment of connections between visually perceived phonological and lexical representations.
Interestingly, although the SIN gain was the same across groups, overall accuracy, on both auditory and audiovisual SIN tasks increased with age. At 90% accuracy, the adult group had probably approached the ceiling on the audiovisual SIN task, especially given that the stimuli were single words, which, unlike words appearing in sentences, could not be predicted by context. However, 8–9-year-olds and 11–12-year-olds, at 72 and 82 percent accuracy respectively, clearly had more room to improve accuracy in the audiovisual condition, yet failed to do so. This finding fits well with the ERP data and underlines the immaturity of audiovisual speech perception during the early school years.
It is also noteworthy that although the audiovisual matching and SIN tasks used identical visual stimuli, unlike in the SIN task, group differences in behavioral responses were very small during audiovisual matching. On both congruent and incongruent trials, children in both groups were able to correctly detect a correspondence, or the lack thereof, between auditory words and observed articulations. Seemingly counterintuitive, this finding fits well with a growing neuroimaging and developmental literature showing that different aspects of audiovisual processing, such as audiovisual matching, learning, or integration, may rely on disparate brain regions (e.g., Calvert, 2001; Erickson et al., 2014) and have different developmental timeframes (e.g., Hillock-Dunn & Wallace, 2012; S Jerger et al., 2009; Lewkowicz, 2012). Within the context of our study, this finding suggests that the ability to match auditory and visual components of articulated words in a sequential manner is not by itself sufficient for effective audiovisual integration for speech under naturalistic conditions.
The study has its limitations and raises a number of questions that will require future experimental work. One of them concerns the development of the N400 component. Because the N400 amplitude did not differ statistically between children and adults, its functional maturation requires further studies with a higher number of participants representing a broader age range. In this study, the alpha of .05 has been adopted as a hard threshold for statistical analyses. This has precluded potential over-interpretation of the results; however, it has also raised the possibility of the Type II error, especially in the analysis of the N400 mean amplitude. In an earlier study from our laboratory that used the same paradigm as the current study (reference), we examined a relationship between the N400 mean amplitude and SIN perception in children with typical development and in children with developmental language disorder (DLD, also known as specific language impairment or SLI). In a combined group of children (n=38, mean age = 10;0; range 7;7–13;8), the N400 mean amplitude was predictive of children’s SIN in the audiovisual condition, but only in combination with the visual P1 peak amplitude. Differences in the results of the current study and our earlier report are likely due to differences in the age range of children, their number, and the nature of the regression analyses. Overall, however, to fully rule out the effect of age on the N400 component and its relationship to SIN during mid-childhood, a replication study is needed.
Finally, our stimuli were designed so that all audiovisual mismatches on incongruent trials occurred at word onsets. In part, this design feature was motivated by the desire to make mismatches as salient as possible, with word onsets typically attracting selective attention during speech perception tasks (Astheimer & Sanders, 2009, 2011). This choice was also based on the fact that word onsets are gates to lexical access and, therefore, audiovisual processing of word onsets may lend significant benefits during speech perception. However, this leaves open the question of how the processing of visually perceived word rimes develops during the same age range. In the auditory domain, the development of sensitivity to onsets and codas depends at least to some extent on the language-specific complexity of each component within a syllable. In languages like English, that have more rime neighbors than body neighbors, children learn to segment a CVC sequence into an onset and a body (C-VC) before they learn to segment it into a body and a coda (CV-C) (Ziegler & Goswami, 2005). One might predict, therefore, that the ability to map observed articulation during word rimes to specific phonemes or groups of phonemes takes even more time to mature than the ability to do such matching for word onsets. A confirmation of this hypothesis awaits future studies.
In sum, our results suggest that the ability to access lexical representations from the visual modality is not yet mature during the pre-teen years. While children as young as 8–9 years of age are able to detect audiovisual correspondences at the phonemic/syllabic level, even by 11–12 years of age children cannot reliably use such correspondences for lexical access.
Highlights.
Matching silent articulations with auditory words elicits phonological N400 and LPC
N400 amplitude does not differ between school-age children and adults
LPC amplitude is larger in 8–9-year-olds compared to 11–12-year-olds and adults
LPC to seen articulations predicts SIN gain in 11–12-year old children only
N400 to seen articulations predicts SIN gain in adults only
Only adults can use visible speech at the phonemic level for lexical access
Acknowledgments
This research was supported in part by the R03DC013151 grant from the National Institute on Deafness and Other Communicative Disorders, National Institutes of Health. The content is solely the responsibility of the authors and does not necessarily represent the official view of the National Institute on Deafness and Other Communicative Disorders or the National Institutes of Health. We are grateful to Kevin Barlow for creating stimulus presentation programs, to Steven Hnath and Samantha Hoover for help with video materials, and to Jennifer Schumaker for help with various stages of data collection and analysis.
Appendix
Appendix.
The pairing of auditory words and silent articulations
Note that articulations that are congruent (i.e., match the preceding auditory word) in List A are incongruent (i.e., do not match the preceding auditory word) in List B.
| List A | List B | ||||
|---|---|---|---|---|---|
| Auditory Word | Silent Articulation | Auditory Word | Silent Articulation | ||
| 1 | shower | shower | candy | donkey | |
| 2 | tree | lamb | cat | cat | |
| 3 | jello | jello | jello | monkey | |
| 4 | cat | girl | egg | egg | |
| 5 | egg | pool | donut | bottle | |
| 6 | donut | donut | zipper | present | |
| 7 | zipper | zipper | donkey | candy | |
| 8 | donkey | donkey | shirt | shirt | |
| 9 | grapes | grapes | grapes | farm | |
| 10 | police | apple | police | police | |
| 11 | truck | belt | apple | apple | |
| 12 | apple | police | truck | truck | |
| 13 | monkey | monkey | monkey | jello | |
| 14 | sandwich | mailman | sandwich | sandwich | |
| 15 | car | car | car | fish | |
| 16 | turtle | turtle | turtle | popcorn | |
| 17 | squirrel | squirrel | squirrel | pretzel | |
| 18 | window | window | window | sandbox | |
| 19 | sled | bird | sled | sled | |
| 20 | necklace | necklace | bread | duck | |
| 21 | water | water | water | carrot | |
| 22 | sink | sink | sink | mop | |
| 23 | paint | paint | paint | woods | |
| 24 | pretzel | pretzel | pretzel | squirrel | |
| 25 | nail | peas | nail | nail | |
| 26 | bird | sled | bird | bird | |
| 27 | corn | corn | corn | frog | |
| 28 | couch | couch | couch | moose | |
| 29 | farm | farm | farm | grapes | |
| 30 | airplane | pumpkin | airplane | airplane | |
| 31 | popcorn | popcorn | popcorn | turtle | |
| 32 | penguin | doctor | penguin | penguin | |
| 33 | knife | mouth | mouth | mouth | |
| 34 | arm | horse | arm | arm | |
| 35 | bed | ear | bed | bed | |
| 36 | present | present | present | zipper | |
| 37 | sandbox | sandbox | sandbox | window | |
| 38 | mop | mop | mop | sink | |
| 39 | mailman | sandwich | mailman | mailman | |
| 40 | lamb | tree | shower | necklace | |
| 41 | candy | candy | duck | bread | |
| 42 | scissors | balloon | scissors | scissors | |
| 43 | pool | egg | pool | pool | |
| 44 | bee | bee | bee | eye | |
| 45 | chair | boat | chair | chair | |
| 46 | cake | ball | cake | cake | |
| 47 | boy | boy | boy | dog | |
| 48 | sprinkler | muffin | sprinkler | sprinkler | |
| 49 | elephant | elephant | elephant | teddy bear | |
| 50 | comb | beach | comb | comb | |
| 51 | jar | purse | jar | jar | |
| 52 | horse | arm | horse | horse | |
| 53 | sweater | sweater | sweater | picture | |
| 54 | moose | moose | moose | couch | |
| 55 | muffin | sprinkler | muffin | muffin | |
| 56 | ear | bed | ear | ear | |
| 57 | toys | toys | toys | bus | |
| 58 | bus | bus | carrot | water | |
| 59 | carrot | carrot | teacher | buttons | |
| 60 | teacher | teacher | hammer | hammer | |
| 61 | hammer | pizza | bus | toys | |
| 62 | frog | frog | frog | corn | |
| 63 | shirt | foot | necklace | shower | |
| 64 | buttons | buttons | buttons | teacher | |
| 65 | ball | cake | ball | ball | |
| 66 | beach | comb | beach | beach | |
| 67 | girl | cat | girl | girl | |
| 68 | mouth | knife | knife | knife | |
| 69 | peas | nail | peas | peas | |
| 70 | woods | woods | woods | paint | |
| 71 | picture | picture | picture | sweater | |
| 72 | purse | jar | purse | purse | |
| 73 | belt | truck | belt | belt | |
| 74 | wolf | wolf | wolf | house | |
| 75 | scarf | scarf | scarf | broom | |
| 76 | teddy bear | teddy bear | teddy bear | elephant | |
| 77 | house | house | house | wolf | |
| 78 | eye | eye | eye | bee | |
| 79 | dog | dog | dog | boy | |
| 80 | flower | orange | flower | flower | |
| 81 | doctor | penguin | doctor | doctor | |
| 82 | foot | shirt | foot | foot | |
| 83 | broom | broom | broom | scarf | |
| 84 | tractor | pencil | tractor | tractor | |
| 85 | circus | money | circus | circus | |
| 86 | balloon | scissors | balloon | balloon | |
| 87 | orange | flower | orange | orange | |
| 88 | pencil | tractor | pencil | pencil | |
| 89 | pumpkin | airplane | pumpkin | pumpkin | |
| 90 | bread | bread | lamb | lamb | |
| 91 | money | circus | money | money | |
| 92 | bottle | bottle | bottle | donut | |
| 93 | boat | chair | boat | boat | |
| 94 | pizza | hammer | pizza | pizza | |
| 95 | fish | fish | fish | car | |
| 96 | duck | duck | tree | tree | |
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Pictures were taken from the Peabody Picture Vocabulary Test and used with the publisher’s permission (Dunn & Dunn, 2007).
References
- Alsius A, Paré M, & Munhall KG (2018). Forty years after Hearing Lips and Seeing Voices: the McGurk effect revisited. Multisesory Research, 31, 111–144. [DOI] [PubMed] [Google Scholar]
- American Electroencephalographic Society. (1994). Guideline thirteen: Guidelines for standard electrode placement nomenclature. Journal of Clinical Neurophysiology, 11, 111–113. [PubMed] [Google Scholar]
- Anglin JM (1989). Vocabulary growth and the knowing-learning distinction. Reading Canada, 7, 142–146. [Google Scholar]
- Astheimer LB, & Sanders LD (2009). Listeners modulate temporally selective attention during natural speech processing. Biological Psychology, 80, 23–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Astheimer LB, & Sanders LD (2011). Temporally selective attention supports speech processing in 3- to 5-year-old children. Developmental Cognitive Neuroscience, 2(1), 120–128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barutchu A, Crewther SG, Kiely P, Murphy MJ, & Crewther DP (2008). When /b/ill with /g/ill becomes /d/ill: Evidence for a lexical efect in audiovisual speech perception. European Journal of Cognitive Psychology, 20(1), 1–11. [Google Scholar]
- Bentin S, Mouchetant-Rostaing Y, Giard M-H, Echallier JF, & Pernier J (1999). ERP manifestations of processing printed words at different psycholinguistic levels: time course and scalp distribution. Journal of Cognitive Neuroscience, 11(3), 235–260. [DOI] [PubMed] [Google Scholar]
- BioSemi. (2013). Active Electrodes. Retrieved from http://www.biosemi.com/active_electrode.htm
- Boersma P, & Weenink D (2011). Praat: doing phonetics by computer (version 5.3) [Computer program]. Retrieved from http://www.praat.org (Version 5.1).
- Brancazio L (2004). Lexical influences in audiovisual speech perception. Journal of Experimental Psychology: Human Perception and Performance, 30(3), 445–463. [DOI] [PubMed] [Google Scholar]
- Brown L, Sherbenou RJ, & Johnsen SK (2010). Test of Nonverbal Intelligence (4th ed.). Austin, Texas: Pro-Ed: An International Pubilsher. [Google Scholar]
- Buchwald AB, Winters SJ, & Pisoni DB (2009). Visual speech primes open-set recognition of spoken words. Language and Cognitive Processes, 24(4), 580–610. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Calvert GA (2001). Crossmodal processing in the human brain: Insights from functional neuroimaging studies. Cerebral Cortex, 11, 1110–1123. [DOI] [PubMed] [Google Scholar]
- Chandrasekaran C, Trubanova A, Stillittano S, Caplier A, & Ghazanfar AA (2009). The natural statistics of audiovisual speech. PLoS Computational Biology, 5(7). Retrieved from [DOI] [PMC free article] [PubMed] [Google Scholar]
- Coch D, George E, & Berger N (2008). The case of letter rhyming: An ERP study. Psychophysiology, 45, 949–956. [DOI] [PubMed] [Google Scholar]
- Cohen MS (2008). Handedness Questionnaire. Retrieved from http://www.brainmapping.org/shared/Edinburgh.php#
- Conners KC (1997). Conners’ Rating Scales - Revised. North Tonawanda, NY: MHS. [Google Scholar]
- D’Arcy RCN, Connolly JF, Service E, Hawco CS, & Houlihan ME (2004). Separating phonological and semantic processing in auditory sentence processing: A high-resolution event-related brain potential study. Human Brain Mapping, 22, 40–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Department of Education U. S. (1996). Reading Literacy in the United States: Findings from the IEA Reading Literacy Study. Washington, D.C.: U.S. Government Printing Office. [Google Scholar]
- Dunn LM, & Dunn DM (2007). Peabody Picture Vocabulary Test (4th Ed.): Pearson. [Google Scholar]
- Erickson LC, Zielinski BA, Zielinski JEV, Liu G, Turkeltaub PE, Leaver AM, & Rauschecker JP (2014). Distinct cortical locations for integration of audiovisual speech and the McGurk effect. Frontiers in Psychology, 5 Retrieved from [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fenson L, Marchman V, Thal DJ, Dale PS, Reznick JS, & Bates E (2007). MacArthur-Bates Communicative Development Inventories (CDI) Words and Sentences: Brookes Publishing Co. [Google Scholar]
- Fernald A, Marchman V, & Weisleder A (2013). SES differences in language processing skill and vocabulary are evident at 18 months. Developmental Science, 16(2), 234–248. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Field A (2013). Discovering statistics using SPSS (4 ed.). Washington, DC: Sage. [Google Scholar]
- Fort M, Kandel S, Chipot J, Savariaux C, Granjon L, & Spinelli E (2013). Seeing the initial articulatory gestures of a word triggers lexical access. Language and Cognitive Processes, 28(8), 1207–1223. [Google Scholar]
- Fort M, Spinelli E, Savariaux C, & Kandel S (2010). The word superiority effect in audiovisual speech perception. Speech Communication, 52, 525–532. [Google Scholar]
- Fort M, Spinelli E, Savariaux C, & Kandel S (2012). Audiovisual vowel monitoring and the word superiority effect in children. International Journal of Behavioral Development, 36(6), 457–467. [Google Scholar]
- Giedd JN, Blumenthal J, Jeffries NO, Castellanos FX, Liu H, Zijdenbos A, … Rapoport JL (1999). Brain development during childhood and adolescence: a longitudinal MRI study. Nature Neuroscience, 2(10), 861–863. [DOI] [PubMed] [Google Scholar]
- Gogtay M, Giedd JN, Lusk L, Hayashi KM, Greenstein D, Vaituzis AC, … Thompson PM (2004). Dynamic mapping of human cortical development during childhood through early adulthood. Proceedings of the National Academy of Sciences, 101(21), 8174–8179. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Grossi G, Coch D, Coffey-Corina S, Holcomb PJ, & Neville HJ (2001). Phonological processing in visual rhyming: A developmental ERP study. Journal of Cognitive Neuroscience, 13(5), 610–625. [DOI] [PubMed] [Google Scholar]
- Havy M, Foroud A, Fais L, & Werker JF (2017). The role of auditory and visual speech in word learning at 18 months and in adulthood. Child Development, 88(6), 2043–2059. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hickok G, & Poeppel D (2004). Dorsal and ventral streams: A framework for understanding aspects of the functional anatomy of language. Cognition, 92, 67–99. [DOI] [PubMed] [Google Scholar]
- Hillock-Dunn A, & Wallace MT (2012). Developmental changes in the multisensory temporal binding window persist into adolescence. Developmental Science, 15(5), 688–696. doi: 10.1111/j.1467-7687.2012.01171.x [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jerger S, Damian MF, McAlpine RP, & Abdi H (2018). Visual speech fills in both discrimination and identification of non-intact auditory speech in children. Journal of Child Language, 45(2), 392–414. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jerger S, Damian MF, Spence MJ, Tye-Murray N, & Abdi H (2009). Developmental shifts in children’s sensitivity to visual speech: A new multimodal picture-word task. Journal of Experimental Child Psychology, 102, 40–59. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jerger S, Damian MF, Tye-Murray N, & Abdi H (2014). Children use visual speech to compensate for non-intact auditory speech. Journal of Experimental Child Psychology, 126, 295–312. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jerger S, Damian MF, Tye-Murray N, & Abdi H (in press). Children perceive speech onsets by ear and eye. Journal of Child Language. doi: 10.1017/S030500091500077X [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kaganovich N, Schumaker J, & Rowland C (2016). Atypical audiovisual word processing in school-age children with a history of specific language impairment: an event-related potential study. Journal of Neurodevelopmental Disorders, 8(33). doi: 10.1186/s11689-016-9168-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kutas M, & Federmeier KD (2011). Thirty years and counting: Finding meaning in the N400 component of the event-related brain potential (ERP). Annual Review In Psychology, 62, 621–647. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lewkowicz DJ (2012). Development of multisensory temporal perception In Murray MM & Wallace MT (Eds.), The Neural Bases of Multisensory Processes (pp. 325–344). New York: CRC Press. [PubMed] [Google Scholar]
- Luck SJ (2014). An Introduction to the Event-Related Potential Technique (second ed.). Cambridge, MA: The MIT Press. [Google Scholar]
- MacSweeney M, Goswami U, & Neville HJ (2013). The neurobiology of rhyme judgment by deaf and hearing adults: An ERP study. Journal of Cognitive Neuroscience, 25(7), 1037–1048. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Malins JG, Desroches AS, Robertson EK, Newman RL, Archibald LMD, & Joanisse MF (2013). ERPs reveal the temporal dynamics of auditory word recognition in specific language impairment. Developmental Cognitive Neuroscience, 5, 134–148. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McGurk H, & MacDonald J (1976). Hearing lips and seeing voices. Nature, 264, 746–748. [DOI] [PubMed] [Google Scholar]
- Metsala JL (1997). An examination of word frequency and neighborhood density in the development of spoken-word recognition. Memory and Cognition, 25(1), 47–56. [DOI] [PubMed] [Google Scholar]
- Metting van Rijn AC, Kuiper AP, Dankers TE, & Grimbergen CA (1996). Low-cost active electrode improves the resolution in biopotential recordings. Paper presented at the 18th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Amsterdam, The Netherlands. [Google Scholar]
- Metting van Rijn AC, Peper A, & Grimbergen CA (1990). High-quality recording of bioelectric events. Part 1: Interference reduction, theory and practice. Medical and Biological Engineering and Computing, 28, 389–397. [DOI] [PubMed] [Google Scholar]
- Miceli G, Gainotti G, Caltagirone C, & Masullo C (1980). Some aspects of phonological impairment in aphasia. Brain and Language, 11, 159–169. [DOI] [PubMed] [Google Scholar]
- Milberg W, Blumstein S, & Dworetzky B (1988). Phonological processing and lexical access in aphasia. Brain and Language, 34, 279–293. [DOI] [PubMed] [Google Scholar]
- Mohan R, & Weber C (2015). Neural systems mediating processing of sound units of language distinguish recovery versus persistence in stuttering. Journal of Neurodevelopmental Disorders, 7(1). 10.1186/s11689-015-9124-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oldfield RC (1971). The assessment and analysis of handedness: The Edinburgh inventory. Neuropsychologia, 9, 97–113. [DOI] [PubMed] [Google Scholar]
- Panouilleres MTN, Boyles R, Chesters J, Watkins KE, & Mottonen R (2018). Facilitation of motor excitability during listening to spoken sentences is not modulated by noise or semantic coherence. Cortex, 103, 44–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Peelle JE, & Sommers MS (2015). Prediction and constraint in audiovisual speech perception. Cortex, 68, 169–181. doi: 10.1016/j.cortex.2015.03.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pflieger ME (2001). Theory of a spatial filter for removing ocular artifacts with preservation of EEG. Paper presented at the EMSE Workshop, Princeton University; http://www.sourcesignal.com/SpFilt_Ocular_Artifact.pdf [Google Scholar]
- Praamstra P, Meyer AS, & Levelt WJM (1994). Neurophysiological manifestations of phonological processing: Latency variation of a negative EP component timelocked to phonological mismatch. Journal of Cognitive Neuroscience, 6(3), 204–219. [DOI] [PubMed] [Google Scholar]
- Praamstra P, & Stegeman DF (1993). Phonological effects on the auditory N400 event-related brain potential. Cognitive Brain Research, 1, 73–86. [DOI] [PubMed] [Google Scholar]
- Riedel P, Ragert P, Schelinski S, Kiebel SJ, & Von Kriegstein K (2015). Visual face-movement sensitive cortex is relevant for auditory-only speech recognition. Cortex, 68, 86–99. [DOI] [PubMed] [Google Scholar]
- Ross LA, Molholm S, Blanco D, Gomez-Ramirez M, Saint-Amour D, & Foxe JJ (2011). The development of multisensory speech perception continues into the late childhood years. European Journal of Neuroscience, 33(12), 2329–2337. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rugg MD (1984a). Event-related potentials and the phonological processing of words and non-words. Neuropsychologia, 22(4), 435–443. [DOI] [PubMed] [Google Scholar]
- Rugg MD (1984b). Event-related potentials in phonological matching tasks. Brain and Language, 23, 225–240. [DOI] [PubMed] [Google Scholar]
- Rugg MD, & Allan K (2000). Event-related potential studies of memory In Tulving E & Craik FIM (Eds.), The Oxford Handbook of Memory (pp. 521–537). New York, NY: Oxford University Press. [Google Scholar]
- Rugg MD, Brovedani P, & Doyle MC (1992). Modulation of event-related potentials (ERPs) by word repetition in a task with inconsistent mapping between repetition and response. Electroencephalography and Clinical Neurophysiology, 84, 521–531. [DOI] [PubMed] [Google Scholar]
- Rugg MD, & Curran T (2007). Event-related potentials and recognition memory. Trends in Cognitive Sciences, 11(6), 251–257. [DOI] [PubMed] [Google Scholar]
- Rugg MD, & Nagy ME (1987). Lexcial contribution to nonword-repetition effects: Evidence from event-related potentials. Memory and Cognition, 15(6), 473–481. [DOI] [PubMed] [Google Scholar]
- Rugg MD, & Nieto-Vegas M (1999). Modality-specific effects of immediate word repetition: Electrophysiological evidence. NeuroReport, 10, 2661–2664. [DOI] [PubMed] [Google Scholar]
- Schelinski S, Riedel P, & Von Kriegstein K (2014). Visual abilities are important for auditory-only speech recognition: Evidence from autism spectrum disorder. Neuropsychologia, 65, 1–11. [DOI] [PubMed] [Google Scholar]
- Schopler E, Van Bourgondien ME, Wellman GJ, & Love SR (2010). Childhood Autism Rating Scale (2nd ed.): Western Psychological Services. [Google Scholar]
- Semel E, Wiig EH, & Secord WA (2004). Clinical Evaluation of Language Fundamentals - Preschool-2 (2nd ed.). San Antonio, TX: Pearson Clinical Assessment. [Google Scholar]
- Skipper JI, Nusbaum HC, & Small SL (2005). Listening to talking faces: Motor cortical activation during speech perception. NeuroImage, 25, 76–89. [DOI] [PubMed] [Google Scholar]
- Skipper JI, van Wassenhove V, Nusbaum HC, & Small SL (2007). Hearing lips and seeing voices: How cortical araes supporting speech production mediate audiovisual speech perception. Cerebral Cortex, 17, 2387–2399. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stager CL, & Werker JF (1997). Infants listen for more phonetic detail in speech perception than in word-learning tasks. Nature, 388, 381–382. [DOI] [PubMed] [Google Scholar]
- Stevenson RA, Wallace MT, & Altieri N (2014). The interaction between stimulus factors and cognitive factors during multisensory integration of audiovisual speech. Frontiers in Psychology, 5 10.3389/fpsyg.2014.00352 [DOI] [PMC free article] [PubMed] [Google Scholar]
- ten Oever S, Schroeder CE, Poeppel D, van Atteveldt N, & Zion-Golumbic E (2014). Rhythmicity and cross-modal temproal cues facilitate detection. Neuropsychologia, 63, 43–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tremblay C, Champoux F, Voss P, Bacon BA, Lepore F, & Théoret H (2007). Speech and non-speech audio-visual illusions: A developmental study. PLOS ONE, 2(8), e742. doi: 10.1371/journal.pone.0000742 [DOI] [PMC free article] [PubMed] [Google Scholar]
- van Wassenhove V, Grant KW, & Poeppel D (2005). Visual speech speeds up the neural processing of auditory speech. Proceedings of the National Academy of Sciences, 102(4), 1181–1186. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weber-Fox C, Spruill JE, Spencer R, & Smith A (2008). Aypical neural functions underlying phonological processing and silent rehearsal in children who stutter. Developmental Science, 11(2), 321–337. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Werker JF, Cohen LB, Lloyd VL, Casasola M, & Stager CL (1998). Acquisition of word-object associations by 14-month-old infants. Developmental Psychology, 34(6), 1289–1309. [DOI] [PubMed] [Google Scholar]
- Werker JF, & Curtin S (2005). PRIMIR: A developmental framework of infant speech processing. Language Learning and Development, 1(2), 197–234. [Google Scholar]
- Werker JF, Fennell CT, Corcoran KM, & Stager CL (2002). Infants’ ability to learn phonetically similar words: Effects of age and vocabulary size. Infancy, 3(1), 1–30. [Google Scholar]
- Wilson SM, Saygin AP, Sereno MI, & Iacoboni M (2004). Listening to speech activates motor areas involved in speech production. Nature Neuroscience, 7, 701–702. [DOI] [PubMed] [Google Scholar]
- WMA Declaration of Helsinki - Ethical Principles for Medical Research Involving Human Subjects, (1964). [PubMed]
- Zhu LL, & Beauchamp MS (2017). Mouth and voice: A relationship between visual and auditory preference in the human superior temporal sulcus. The Journal of Neuroscience, 37(10), 2697–2708. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ziegler JC, & Goswami UC (2005). Reading acquisition, developmental dyslexia and skilled reading across languages. Psychological Bulletin, 131, 3–29. [DOI] [PubMed] [Google Scholar]







