Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Dec 30.
Published in final edited form as: Multisens Res. 2022 Dec 30;36(1):57–74. doi: 10.1163/22134808-bja10087

The Impact of Singing on Visual and Multisensory Speech Perception in Children on the Autism Spectrum

Jacob I Feldman 1,2,*, Alexander Tu 3,4, Julie G Conrad 3,5, Wayne Kuang 3,6, Pooja Santapuram 3,7, Tiffany G Woynaroski 1,2,8,9
PMCID: PMC9924934  NIHMSID: NIHMS1863275  PMID: 36731528

Abstract

Autistic children show reduced multisensory integration of audiovisual speech stimuli in response to the McGurk illusion. Previously, it has been shown that adults can integrate sung McGurk tokens. These sung speech tokens offer more salient visual and auditory cues, in comparison to the spoken tokens, which may increase the identification and integration of visual speech cues in autistic children. Forty participants (20 autism, 20 non-autistic peers) aged 7–14 completed the study. Participants were presented with speech tokens in four modalities: auditory-only, visual-only, congruent audiovisual, and incongruent audiovisual (i.e., McGurk; auditory ‘ba’ and visual ‘ga’). Tokens were also presented in two formats: spoken and sung. Participants indicated what they perceived via a four-button response box (i.e., ‘ba’, ‘ga’, ‘da’, or ‘tha’). Accuracies and perception of the McGurk illusion were calculated for each modality and format. Analysis of visual-only identification indicated a significant main effect of format, whereby participants were more accurate in sung versus spoken trials, but no significant main effect of group or interaction effect. Analysis of the McGurk trials indicated no significant main effect of format or group and no significant interaction effect. Sung speech tokens improved identification of visual speech cues, but did not boost the integration of visual cues with heard speech across groups. Additional work is needed to determine what properties of spoken speech contributed to the observed improvement in visual accuracy and to evaluate whether more prolonged exposure to sung speech may yield effects on multisensory integration.

Keywords: Multisensory integration, autism spectrum disorder, speech perception, McGurk, lip-reading, speech-reading, music

1. Introduction

Many autistic children present with differences in sensory perception; the DSM-5 (American Psychiatric Association, 2013) now includes these sensory differences as part of the diagnostic criteria for autism. Multisensory integration (i.e., the ability to process, combine, and respond to convergent input from multiple sensory modalities, such as vision and audition; Murray et al., 2016) is particularly important for speech perception, which has an auditory component (i.e., the voice) and a visual component (i.e., the face; Stevenson et al., 2014a).

Although differences in multisensory integration are the focus of much research discussed in this population (Feldman et al., 2018), particularly deficits with multisensory speech stimuli (Baum et al., 2015), there is also substantial evidence to suggest that individuals on the autism spectrum have difficulty with visual-only speech perception (i.e., lip-reading or speech-reading). Compared to non-autistic comparisons, autistic children have been shown to demonstrate worse visual-only speech comprehension at the sentence-level (Smith & Bennetto, 2007), the whole-word level (Foxe et al., 2015; Stevenson et al., 2018), the syllable level (de Gelder et al., 1991; Irwin et al., 2011; Woynaroski et al., 2013), and the phoneme-level (Stevenson et al., 2017). Smith and Bennetto (2007) found that individuals’ visual-only speech perception strongly predicted their multisensory speech perception, even when controlling for auditory-only speech perception and diagnostic group (i.e., autism or non-autism). Similarly, Iarocci et al. (2010) found that, after controlling for speech-reading accuracy, there were no between-group differences in audiovisual speech perception. These results suggest that the impairment in visual-only speech recognition contributes to or perhaps even accounts for difficulties with multisensory speech in individuals on the autism spectrum.

Researchers have commonly assessed visual-only speech recognition and the visual influence on audiovisual speech processing in tasks that employ the McGurk illusion. The McGurk illusion occurs when the pairing of incongruent visual and auditory speech tokens (e.g., auditory ‘ba’ and visual ‘ga’) induces a ‘fused’ percept (i.e., ‘da’ or ‘tha’; McGurk & MacDonald, 1976). During such tasks, researchers will often present unisensory tokens (i.e., auditory- and visual-only speech), congruent audiovisual tokens, and incongruent audiovisual (i.e., illusory) tokens. There is a great deal of variability in the literature regarding whether autistic individuals perceive fewer McGurk illusions than their non-autistic peers. Some researchers have found that autistic children perceive significantly fewer illusory precepts compared to non-autistic peers (Bebko et al., 2014; Feldman et al., 2022; Irwin et al., 2011; Stevenson et al., 2014b; Taylor et al., 2010), though that is not universally observed (Saalasti et al., 2011, 2012; Stevenson et al., 2018; Woynaroski et al., 2013). However, one consistent finding from this body of literature is that autistic individuals correctly identify significantly fewer visual-only speech tokens compared to their non-autistic peers, even when group differences in perception of the McGurk illusion are not observed (Irwin et al., 2011; Saalasti et al., 2012; Stevenson et al., 2018; Taylor et al., 2010; Woynaroski et al., 2013).

As the McGurk effect indexes the impact of vision on speech perception, properties of speech that are known to influence visual aspects of speech, such as the speaking rate of the speaker (Mefferd & Green, 2010; Smith & Kleinow, 2000) or the degree to which the speaker opens his/her mouth (Munhall et al., 1996), are of great interest, as they may influence perception of the McGurk illusion. For example, it has been found that ‘fast’ speech results in decreased fusion while ‘normal’ or ‘clear’ speech results in greater fusion, possibly due to the changes in articulatory movement in those formats (Fixmer & Hawkins, 1998; Munhall et al., 1996).

Previous research has indicated that articulatory movement differs between sung and spoken speech, such that sung speech often results in larger articulator movements (see Massaro & Jesse, 2009 for a review). Quinto et al. (2010) were able to induce the McGurk effect in college students using incongruent sung audiovisual tokens (i.e., sung auditory ‘ba’ and sung visual ‘ga’). Although the integration of visual and auditory aspects of musical performances is believed to be automatic (Thompson et al., 2008), which may facilitate audiovisual integration, their sample did not differ according to format, perhaps because participants reported very high rates of fusion on the McGurk task in response to both the song tokens and the spoken tokens. It may be most appropriate to evaluate the hypothesis that sung McGurk stimuli facilitate increased audiovisual integration compared to spoken McGurk stimuli in children. Because children tend to perceive the McGurk illusion in spoken stimuli at a lower rate than adults (e.g., Hillock-Dunn & Wallace, 2012; McGurk & MacDonald, 1976; Tremblay et al., 2007), there would not be a ceiling effect for McGurk fusion. The faciliatory effect of singing may be even more apparent in individuals on the autism spectrum, who have long been noted as having a particular fascination with or inclination toward music (Kanner, 1943) and can present with advanced musical skills (e.g., musical savants; Heaton et al., 1998). However, to our knowledge no studies have compared sung versus spoken McGurk in autistic or non-autistic children. Additionally, no study has assessed differences in identifying natural visual-only sung versus spoken tokens (but see Hidalgo-Barnes & Massaro, 2007).

The purpose of the present study was to explore differences in visual influence on audiovisual speech perception for sung versus spoken stimuli in autistic children and well matched non-autistic comparisons. We addressed the following research questions:

  1. Do autistic children differ from non-autistic peers in their identification of visual-only speech syllables for (a) spoken versus (b) sung stimuli?

  2. Do autistic children differ from non-autistic peers in their perception of the McGurk effect (i.e., fusion due to visual influence on audiovisual speech) for (a) spoken versus (b) sung stimuli?

2. Methods

2.1. Participants

Forty participants (20 autism, 20 non-autism) were included in the final sample. Participants were recruited from a larger study of multisensory integration in autistic children (i.e., sample partially overlaps with that of Feldman et al., 2020) and were well matched on age and sex (p values = 0.71 and 1, respectively; see Table 1). Though groups did not significantly differ on the basis of race or ethnicity (p values = 0.15 and 0.34, respectively), they did differ on nonverbal IQ (p < 0.001), as assessed by the Leiter International Performance Scale, third edition (Leiter-3; Roid et al., 2013).

Table 1.

Selected demographic variables by group.

Autism (n = 20)
M (SD)
Non-autism (n = 20)
M (SD)
Age (years) 9.71 (2.37) 9.98 (2.25)
Nonverbal IQ 103.5 (12.8) 119.2 (9.3)
n n
Sex 13 male
7 female
13 male
7 female
Race 4 Black or African American
16 White
1 Black or African American
17 White
2 Multiple
Ethnicity 1 Hispanic or Latino
19 Not Hispanic or Latino
4 Hispanic or Latino
16 Not Hispanic or Latino

Note. Groups did not differ in age (p = 0.71) sex (p = 1), race (p = 0.15), or ethnicity (p = 0.34). Groups did differ in nonverbal IQ (as measured by the Leiter 3; Roid et al., 2013), p < 0.001.

General eligibility criteria for inclusion in the study were as follows: (a) chronological age between 7 and 14 years; (b) normal or corrected-to-normal vision and normal hearing, as confirmed by screening at entry to the study; and (c) susceptibility to the McGurk effect, as indexed by at least one reported illusory precept on stimuli in this study (see subsection 2.2. Stimuli) or in other psychophysical tasks collected as part of the larger study. Two additional participants (one autistic, one non-autistic) were recruited and subsequently excluded due to a lack of susceptibility to the McGurk effect. Additional eligibility criteria for autistic children included: (a) diagnosis of autism spectrum disorder as confirmed by research-reliable administrations of the Autism Diagnostic Observation Schedule, second edition (ADOS-2; Lord et al., 2012) and clinical judgment of a licensed clinician; (b) no history of seizure disorders; and (c) no diagnosed genetic disorders, such as Fragile X or tuberous sclerosis. Additional eligibility criteria for non-autistic participants included: (a) parent report of autistic features below the suggested screening threshold on the Lifetime Version of the Social Communication Questionnaire (SCQ; Rutter et al., 2003); (b) no immediate family members with diagnoses of autism; (c) average nonverbal intelligence, as assessed by the Leiter-3 (Roid et al., 2013); (d) no history of neurological formats or seizure disorders; and (e) no prior history or present indicators of psychiatric conditions and/or learning disorders. Two additional participants (one autistic, one non-autistic) were recruited for this study but excluded because they did report any fusion in response to McGurk stimuli.

Recruitment and study procedures were conducted with the approval of the Vanderbilt University Institutional Review Board. Parents provided written or verbal informed consent, and participants provided written or verbal assent prior to participation in the study. All participants were compensated for their participation.

2.2. Stimuli

The sung and spoken audiovisual stimuli were obtained from Quinto et al. (2010). Similar to the original stimuli reported by McGurk and MacDonald (1976), these stimuli featured three repetitions of the target syllable spoken or sung by a female speaker with neutral affect. Some previous studies in the literature (e.g., Stevenson et al., 2014b; Stevenson et al., 2018) have assessed susceptibility to the McGurk effect using an abbreviated version of the spoken audiovisual stimuli that featured only one repetition of each syllable.

The selected syllables were presented in four modalities: auditory-only (i.e., listening to ‘ba’ or ‘ga’ with a blank screen), visual-only (i.e., lip-reading the syllables ‘ba’ or ‘ga’), congruent audiovisual (i.e., watching matched visual and auditory ‘ba’ or ‘ga’), and incongruent audiovisual (i.e., visual ‘ga’ dubbed with auditory ‘ba’). For auditory- and visual-only stimuli, the original audiovisual stimuli were edited to remove the video feed and the audio feed, respectively, using Sony Vegas 13 (Song Creative Software Inc., Middleton, WI, USA).

The selected syllables were also presented in two musicality formats (i.e., sung or spoken syllables). The spoken syllables were specifically recorded with the speaker directed to talk in a normal voice (Quinto et al., 2010). The sung syllables were sung in an ascending major chord progression to add musical qualities such as pitch and tonality.

A comparison of the visual properties of both stimuli in Adobe Photoshop CC 2017 (Adobe Inc, San Jose, CA, USA) revealed that the speaker opened her mouth significantly more in the sung presentation format, F1,4 = 157.9, p < 0.001, demonstrating that the sung stimuli had more visual salience (see Supplementary Material for more information on stimulus property extraction). A comparison of the auditory properties of the stimuli in Praat (Boersma, 2002) revealed that the speaker’s loudness did not significantly differ according to presentation formats (F1,4 = 0.01, p = 0.91). The speaker did articulate significantly faster in the spoken format compared to the sung format (F1,4 = 60.14, p = 0.001, ΔM = 74 ms).

2.3. Procedure

Experimental tasks were completed in a sound- and light attenuated room (WhisperRoom Inc., Morristown, TN, USA). Visual stimuli were presented on a Samsung Syncmaster 2233RZ 22-inch PC monitor (Samsung Electronics America, Inc., Ridgefield Park, NJ, USA). Auditory stimuli were presented binaurally via Sennheiser HD550 series supra-aural headphones (Sennheiser Electronic GmbH & Co. KG, Wedemark, Germany). Stimulus presentation was managed by E-prime software, and responses were recorded via a response box, providing a forced choice between four CV syllables (i.e., ‘ba’, ‘ga’, ‘da’, ‘tha’).

Prior to the experiment, participants were instructed to look at the women’s face and listen to her voice. Participants then completed a brief practice session wherein they were prompted to correctly push each button in random order in response to written prompts on the computer monitor that were read orally by a trained research assistant; this practice session was terminated once participants correctly pressed all four buttons. During the experiment, the trained research assistant monitored the participant to ensure compliance with and attention to the task. Planned breaks in the task (i.e., nine breaks presented at equal intervals) and verbal praise from the research assistant were used to maintain participants’ compliance. If necessary, verbal and/or gestural prompts were used to redirect participants back to the task.

Trials were presented in random order. Each trial began with a fixation cross with duration randomized between 600 ms and 1200 ms. After each trial, the participants were prompted with a response screen asking them to report their perception from four possible options (i.e., ‘ba’, ‘ga’, ‘da’, ‘tha’). For two trial types, incongruent audiovisual (McGurk) and visual-only ‘ga’, twenty trials were presented in both presentation formats (i.e., sung and spoken); all other trial types (i.e., congruent audiovisual ‘ba’, congruent audiovisual ‘ga’, auditory-only ‘ba’, auditory-only ‘ga’, visual-only ‘ba’) were presented ten times in each presentation format in the interest of reducing participant fatigue. Thus, the entire experiment required 180 trials to complete, which took participants approximately 20 minutes to complete.

2.4. Analytic Plan

The task output was processed in MATLAB 2017a (The MathWorks Inc., Natick, MA, USA) to obtain metrics of interest. For the unisensory (i.e., auditory- and visual-only) and matched audiovisual modalities, percent identification accuracy was calculated for each participant in each format. Rate of reported McGurk illusion was calculated in each presentation format as the proportion of ‘da’ and ‘tha’ responses in response to incongruent audiovisual stimuli.

All analyses were conducted in SPSS 28 (IBM Corp, Armonk, NY, USA). To provide an initial exploration of the accuracy data, a 3 (modality: auditory-only accuracy, visual-only accuracy, matched audiovisual accuracy) × 2 (presentation format: sung, spoken) × 2 (group: autism, non-autism) mixed-model analysis of variance (ANOVA) was conducted. Similarly, to evaluate the presence of the McGurk fusion across and within groups, we conducted one-sample t-tests to determine whether the rate of McGurk fusion significantly differed from 0. To answer the two research questions, separate 2 (format) × 2 (group) mixed-model ANOVAs were conducted to analyze the responses in the visual-only modality and in the rate of reported McGurk illusion. Least significant difference tests were planned to probe significant effects and interactions. To probe significant main effects and interactions, we conducted pairwise comparisons using estimated marginal means with Bonferroni-corrected p values to adjust for multiple comparisons.

3. Results

Means and standard deviations for all trial types by group are reported in Table 2.

Table 2.

Response rates by group.

Autism (n = 20) Non-autism (n = 20)
Metric Msung (SD) Mspoken (SD) Msung (SD) Mspoken (SD)
1. A Accuracy 93.8% (7.9%) 91.5% (15.3%) 93.8% (11.6%) 94.3% (10.7%)
2. V Accuracy 50.2% (14.6%) 45.8% (21.1%) 56.7% (17.1%) 47.7% (13.2%)
3. AV Accuracy 96.8% (4.1%) 90.8% (14.1%) 93.8% (10.6%) 95.3% (9.9%)
4. McGurk Fusion 27.0% (34.0%) 29.3% (39.1%) 44.8% (38.8%) 41.5% (35.0%)

Note. A = auditory-only stimuli, V = visual-only stimuli, AV = matching audiovisual stimuli, McGurk Fusion = rate of reported McGurk illusion (mismatched audiovisual stimuli; i.e., proportion of “da” and “tha” responses).

3.1. Accuracy Results

Prior to running the 3 (modality) × 2 (format) × 2 (group) mixed-model ANOVA, Mauchly’s Test of Sphericity indicated that the within-subjects factor modality violated the assumption of sphericity, W = 0.301, p < 0.001; thus, the degrees of freedom for main effect of modality and any interaction effects involving modality were adjusted using a Greenhouse–Geisser correction.

There was a significant main effect of modality, F1,44 = 251.0, p < 0.001, ηp2 = 0.87 (see Fig. 1). Across groups and formats, participants had worse accuracies in the visual-only modality compared to the audiovisual and auditory-only modalities (Mdifference = 42.9% and 44.0%, respectively, p values < 0.001). There was no difference in auditory versus audiovisual accuracies (Mdifference = 0.1%, p = 0.76). There was a significant main effect of presentation format, F1,38 = 5.83, p = 0.021, ηp2 = 0.13. Across groups and modalities, participants performed significantly better in the sung presentation format (M = 80.8%) compared to the spoken presentation format (M = 77.5%). There was no main effect of group, F1,38 = 0.15, p = 0.70, ηp2 < 0.01.

Figure 1.

Figure 1.

Boxplot of percent accuracy by modality by group in the spoken and sung presentation formats. A, auditory-only stimuli; V, visual-only stimuli; AV, matched audiovisual stimuli..

There was a marginal presentation by modality interaction, F1,57 = 2.78, p = 0.084, ηp2 = 0.07. The difference between the sung and spoken modalities was only significant in for visual-only accuracy, Mdifference = 7.4%, p = 0.001. Neither of the interactions with group were significant (F1,38 = 0.49, p = 0.83, ηp2 < 0.01 for presentation format and group; F1,44 = 0.41, p = 0.59, ηp2 = 0.01 for modality and group). There was a marginally significant format by modality by group interaction, F1,57 = 2.77, p = 0.085, ηp2 = 0.07. The previously reported difference between the sung and spoken presentation formats in the visual-only modality was only significant in the non-autistic group, Mdifference = 10.4%, p = 0.001. In the autistic group, there were no significant differences in modality accuracy in the sung versus spoken presentation formats.

3.1.1. Visual-only accuracy

The 2 (format) × 2 (group) mixed model ANOVA for the visual-only accuracy data revealed a significant main effect of presentation format, F1,38 = 12.7, p = 0.001, ηp2 = 0.25. Across groups, participants were more accurate during sung visual-only trials (M = 53.5%, SD = 16.0%) compared to the spoken visual-only trials (M = 46.1%, SD = 18.1%). The main effect of group, F1,38 = 0.52, p = 0.47, ηp2 = 0.01, and the interaction, F1,38 = 2.40, p = 0.24, ηp2 = 0.05, were not significant.

3.2. Rate of Reported McGurk Illusion

To determine whether McGurk fusion was present, on average, across and within groups, we conducted one-sample t-tests to determine whether McGurk fusion significantly differed from 0. Across groups, McGurk fusion significantly differed from 0 in the spoken (t = 6.03, p < 0.001) and sung (t = 6.19, p < 0.001) presentation formats. Similar results were observed within groups (ts = 3.35 and 3.55 for the autistic group and 5.31 and 5.29 for the non-autistic group in the spoken and sung presentation formats).

The 2 (format) × 2 (group) mixed model ANOVA for the rate of reported McGurk illusion data indicated no significant main effect of presentation format, F1,38 = 0.11, p = 0.75, ηp2 < 0.01 (see Figure 2). There was also no main effect of group, F1,38 = 1.80, p = 0.19, ηp2 = 0.05. Although non-autistic participants did not perceive the McGurk illusion at a significantly higher rate compared to the autistic participants across formats, it is notable that the difference does constitute a small magnitude effect (Cohen, 1992), and we are underpowered to detect an effect of this magnitude. The interaction term in the ANOVA was also not significant, F1,38 = 1.71, p = 0.20, ηp2 = 0.04.

Figure 2.

Figure 2.

Boxplot of rate of McGurk Fusion by group in the spoken and sung presentation formats. McGurk Fusion, rate of reported McGurk illusion (mismatched audiovisual stimuli; i.e., proportion of “da” and “tha” responses).

3.3. Post-Hoc Tests Covarying for Nonverbal IQ

Because groups differed on nonverbal IQ, a series of post-hoc analyses of covariance (ANCOVAs) were run with nonverbal IQ included as a covariate. The significant format by modality by group interaction in the 3 (modality) × 2 (format) × 2 (group) mixed-model ANOVA was not robust to controlling for nonverbal IQ. None of the aforementioned null effects of group became significant when covarying for nonverbal IQ.

4. Discussion

The purpose of this study was to explore the impact of sung versus spoken speech on visual-only and multisensory speech perception in autistic children and non-autistic peers. Results indicated that across groups, sung speech significantly aided speech perception in all formats across groups, particularly visual-only speech perception, compared to spoken speech. However, significant interactions revealed that the differences between sung and spoken speech tokens aided only visual-only speech accuracy in the non-autistic group and audiovisual speech accuracy in the autism group. Consistent with the extant literature, the enhanced visual cues provided in the sung sample likely aided speech comprehension in this sample. However, further research is needed to determine whether this effect is exclusive to singing and/or music or whether increased articulatory movement alone can improve speech comprehension.

However, unlike previous reports in the literature, there was no difference between autistic children and non-autistic peers in their visual-only speech perception. It is unclear whether this difference may be due to methods (i.e., closed-response set versus open responses; nonverbal versus verbal responding), age of the sample, or some other factor. Meta-analytic research on lip-/speech-reading abilities in autistic children is needed to determine the overall difference observed in the literature for visual speech perception and elucidate factors that may explain variance between studies.

This study is the first to demonstrate that autistic and non-autistic children can perceive the McGurk illusion in response to sung tokens, extending the work done by Quinto and colleagues (2010) on the McGurk effect. This result also adds to a growing body of literature on audiovisual integration in autistic children. Although there is considerable variability observed in this literature, the effect size for the between-group difference we observed in the perception of the McGurk illusion is within the prediction interval of recent meta-analyses on the subject (Feldman et al., 2018; Zhang et al., 2019), indicating that these results are consistent with the extant literature base despite the lack of a significant statistical test.

Although sung speech tokens did improve performance when averaging across all trial types across groups, it is notable that, contrary to our hypothesis, sung McGurk tokens did not improve perception of the McGurk effect in children on the autism spectrum. However, children in both groups perceived the McGurk effect at very low rates. Though the rate of McGurk fusion across and within groups was significantly greater than 0, we may have observed floor effects in our sample, thus limiting our ability to investigate whether sung stimuli can facilitate increased audiovisual integration compared to spoken stimuli. Only one other study in the extant literature comparing children autistic children to non-autistic comparisons reported that autistic children perceived the McGurk effect less than 30% of the time (i.e., Iarocci et al., 2010) and only one other such study has reported a mean rate of fusion across groups of less than 35% (i.e., Woynaroski et al., 2013).

It is unclear why differences in the McGurk effect are not consistently observed in individuals on the autism spectrum, though there is significant variability within the general population (Basu Mallick et al., 2015). Meta-analytic research has found that much of the heterogeneity between studies can be explained by the procedures used in each study, such as how studies operationalize susceptibility to the McGurk illusion (e.g., as percent of illusory responses versus the percent of non-auditory responses) and the directions given to participants (Zhang et al., 2019). Other factors that could partially explain variance in the literature include participant factors (e.g., age; Feldman et al., 2018; McGurk & MacDonald, 1976; Sekiyama et al., 2014; Tremblay et al., 2007), stimulus properties (e.g., the speaker; Basu Mallick et al., 2015; Jiang & Bernstein, 2011; Magnotti et al., 2018a), and instructions provided to the participant (e.g., open versus closed-response sets; de Gelder et al., 1991; Iarocci et al., 2010). One potential explanation for our sample’s low rate of fusion is that our stimuli comprised three repetitions of each syllable per trial. It is possible that children looked away from the screen after the first presentation of the syllable (e.g., to select a response) and only heard the auditory token on subsequent repetitions. Thus, future studies should incorporate eye-tracking technology, which would allow researchers to only use the trials wherein the child looked at the screen, to assess this possibility. It is also possible that the long experimental protocol may have influenced perception of the McGurk illusion due to fatigue or due to repeated exposure to the same speaker (see Magnotti et al., 2018b).

It is also possible that our McGurk results were limited by the syllables used in the stimuli. Although many studies have used the auditory ‘ba’ and visual ‘ga’ combination in the past, results of a recent stability study indicate that the McGurk effect using ‘ba’/’ga’ syllables in autistic children is less stable and requires several observations to achieve a stable estimate (Dunham et al., 2020). Although the stability of the McGurk effect in response to this specific speaker is not known, unstable estimates increase the likelihood of Type II error (Thompson & Vacha-Haase, 2000; Yoder et al., 2018). Results from Dunham et al. (2020) indicate that McGurk fusion may be more reliably measured in children on the autism spectrum when the task uses auditory ‘pa’ and visual ‘ka’ syllables. Future work assessing sung versus spoken McGurk syllables should consider whether different syllables may differently influence the validity and/or stability of results.

To our knowledge, this is the first study to demonstrate that sung speech tokens enhanced visual-only speech perception compared to spoken speech tokens. Again, it is unclear whether this enhancement was due to stimulus properties, such as the greater mouth opening in or the increased duration of the sung stimuli, or some factor related to the musical qualities of the sung speech, so future research should assess differences in visual-only speech perception of sung speech versus overly-articulated speech (e.g., ‘clear’ speech; Fixmer & Hawkins, 1998). It is also unclear why this finding did not hold when controlling for non-verbal IQ; future studies should utilize larger samples in order to evaluate how non-verbal IQ impacts speech perception in the context of sung versus spoken modalities. Additionally, future research should consider whether visual-only speech perception is also augmented by sung speech at the whole-word and sentence levels, where children on the autism spectrum have also been found to differ from non-autistic peers (Foxe et al., 2015; Smith & Bennetto, 2007; Stevenson et al., 2018). This finding should also be explored in adults and children who are deaf and hard of hearing, who often use visual speech information to access spoken language and supplement degraded auditory signals from cochlear implants and hearing aids (Arnold, 1997).

In addition to our finding that sung speech tokens increase identification accuracy, a large body of literature has shown that musical expertise enhances audiovisual multisensory integration in typically developed adults (e.g., Jicol et al., 2018; Lee & Noppeney, 2011; Petrini et al., 2009, 2011). Though multisensory interventions for autistic children have been frequently proposed, as of yet there is very limited evidence that they can improve multisensory integration abilities (Feldman et al., 2020b; Irwin et al., 2015; Williams et al., 2004). Musical interventions have both been frequently proposed and explored with autistic children (Simpson & Keen, 2011) and share a common goal of improving long-term outcomes, such as language and communication abilities, in children on the autism spectrum (Cascio et al., 2016; Kaplan & Steele, 2005; Sandbank et al., 2020). Given that the musical stimuli alone improved identification accuracy across modalities, it is possible that gaining musical expertise may further facilitate multisensory and unisensory speech perception in this population. Despite these shared goals, to our knowledge no study has assessed whether musical training or interventions may improve language or communication outcomes via improved sensory or multisensory function in this clinical population. Thus, this represents an interesting avenue for future research.

4.1. Limitations

There are several limitations to this work that must be considered when interpreting our results. First, we had a relatively small sample size. Thus, future studies should attempt to replicate these findings with larger samples. Second, we did not collect any information about musical experience, such as exposure to or participation in music therapy, proficiency with musical instruments, or participation in music classes or extracurricular activities. Given that musical proficiency does influence audiovisual integration (e.g., Jicol et al., 2018; Lee & Noppeney, 2011; Petrini et al., 2009; 2011), data on music knowledge or proficiency may explain some of the variance in perception of the McGurk effect.

Supplementary Material

SupplementalMaterial

Acknowledgements

This work was supported by NIH/NIDCD R21 DC016144, NIH/CTSA award No. KL2TR000446, NIH/NCATS TL1TR002244, the Vanderbilt Kennedy Center (NIH/NICHD U54HD083211/P50HD103537), the Vanderbilt Undergraduate Summer Research Program, and Vanderbilt Institute for Clinical and Translational Research award V19409. Results from this manuscript were previously presented at the 2017 International Multisensory Research Forum and the 2017 Gatlinburg Conference on Intellectual and Developmental Disabilities. The authors would like to thank David Simon for his assistance coding the experiment.

Footnotes

1.

There is currently some debate as to whether researchers should use person-first language (e.g., children with autism) or identity-first language (e.g., autistic children; see Robison, 2019). In this paper, we will use the terms ‘children on the autism spectrum’ and ‘autistic children’ interchangeably to align with the preferences of the community and current recommendations for researchers (Bottema-Beutel et al., 2021; Bury et al., 2020).

References

  1. American Psychiatric Association. (2013). Diagnostic and statistical manual of mental disorders-5.
  2. Arnold P (1997). The structure and optimization of speechreading, J. Deaf Stud. Deaf Educ 2, 199–211. 10.1093/oxfordjournals.deafed.a014326 [DOI] [PubMed] [Google Scholar]
  3. Basu Mallick D, Magnotti JF and Beauchamp MS (2015). Variability and stability in the McGurk effect: contributions of participants, stimuli, time, and response type, Psychon. Bull. Rev 22, 1299–1307. 10.3758/s13423-015-0817-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Baum SH, Stevenson RA and Wallace MT (2015). Behavioral, perceptual, and neural alterations in sensory and multisensory function in autism spectrum disorder, Prog. Neurobiol 134, 140–160. 10.1016/j.pneurobio.2015.09.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bebko JM, Schroeder JH and Weiss JA (2014). The McGurk effect in children with autism and Asperger syndrome, Autism Res. 7, 50–59. 10.1002/aur.1343 [DOI] [PubMed] [Google Scholar]
  6. Boersma P (2002). Praat, a system for doing phonetics by computer, Glot Int. 5, 341–345. [Google Scholar]
  7. Bottema-Beutel K, Kapp SK, Lester JN, Sasson NJ and Hand BN (2021). Avoiding ableist language: suggestions for autism researchers, Autism Adulthood 3, 18–29. 10.1089/aut.2020.0014 [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Bury SM, Jellett R, Spoor JR and Hedley D (2020). “It defines who I am” or “It’s something I have”: What language do [autistic] Australian adults [on the autism spectrum] prefer? J. Autism Dev. Disord 10.1007/s10803-020-04425-3 [DOI] [PubMed] [Google Scholar]
  9. Cascio CJ, Woynaroski T, Baranek GT and Wallace MT (2016). Toward an interdisciplinary approach to understanding sensory function in autism spectrum disorder, Autism Res. 9, 920–925. 10.1002/aur.1612 [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Cohen J (1992). A power primer, Psychol. Bull 112, 155–159. 10.1037/0033-2909.112.1.155 [DOI] [PubMed] [Google Scholar]
  11. de Gelder B, Vroomen J and van der Heide L (1991). Face recognition and lip-reading in autism, Eur. J. Cogn. Psychol 3, 69–86. 10.1080/09541449108406220 [DOI] [Google Scholar]
  12. Dunham K, Feldman JI, Liu Y, Cassidy M, Conrad JG, Santapuram P, Suzman E, Tu A, Butera I, Simon DM, Broderick N, Wallace MT, Lewkowicz D and Woynaroski TG (2020). Stability of variables derived from measures of multisensory function in children with autism spectrum disorder, Am. J. Intellect. Dev. Disabil 125, 287–303. 10.1352/1944-7558-125.4.287 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Feldman JI, Dunham K, Cassidy M, Wallace MT, Liu Y and Woynaroski TG (2018). Audiovisual multisensory integration in individuals with autism spectrum disorder: A systematic review and meta-analysis, Neurosci. Biobehav. Rev 95, 220–234. 10.1016/j.neubiorev.2018.09.020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Feldman JI, Cassidy M, Liu Y, Kirby AV, Wallace MT and Woynaroski TG (2020a). Relations between sensory responsiveness and features of autism in children, Brain Sci. 10, 775. 10.3390/brainsci10110775 [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Feldman JI, Dunham K, Conrad JG, Simon DM, Cassidy M, Liu Y, Tu A, Broderick N, Wallace MT and Woynaroski TG (2020b). Plasticity of temporal binding in children with autism spectrum disorder: A single case experimental design perceptual training study, Res. Autism Spectrum Disord 74, 101555. 10.1016/j.rasd.2020.101555 [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Feldman JI, Conrad JG, Kuang W, Tu A, Liu Y, Simon DM, Wallace MT and Woynaroski TG (2022). Relations between the McGurk effect, social and communication skill, and autistic features in children with and without autism, J. Autism Dev. Disabil 52, 1920–1928. 10.1007/s10803-021-05074-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Fixmer E and Hawkins S (1998). The influence of quality of information on the McGurk effect, in Proc. AVSP’98, Terrigal, Sydney, Australia, pp. 27–32. [Google Scholar]
  18. Foxe JJ, Molholm S, Del Bene VA, Frey H-P, Russo NN, Blanco D, Saint-Amour D and Ross LA (2015). Severe multisensory speech integration deficits in high-functioning school-aged children with autism spectrum disorder (ASD) and their resolution during early adolescence, Cereb.l Cortex 25, 298–312. 10.1093/cercor/bht213 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Heaton P, Hermelin B and Pring L (1998). Autism and pitch processing: a precursor for savant musical ability? Music Percept. 15, 291–305. 10.2307/40285769 [DOI] [Google Scholar]
  20. Hidalgo-Barnes M and Massaro DW (2007). Read my lips: An animated face helps communicate musical lyrics, Psychomusicology 19, 3–12. 10.1037/h0094037 [DOI] [Google Scholar]
  21. Hillock-Dunn A and Wallace MT (2012). Developmental changes in the multisensory temporal binding window persist into adolescence, Dev. Sci 15, 688–696. 10.1111/j.1467-7687.2012.01171.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Iarocci G, Rombough A, Yager J, Weeks DJ and Chua R (2010). Visual influences on speech perception in children with autism. Autism 14, 305–320. 10.1177/1362361309353615 [DOI] [PubMed] [Google Scholar]
  23. Irwin JR, Tornatore LA, Brancazio L and Whalen D (2011). Can children with autism spectrum disorders “hear” a speaking face? Child Dev. 82, 1397–1403. 10.1111/j.1467-8624.2011.01619.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Irwin J, Preston J, Brancazio L, D’angelo M and Turcios J (2015). Development of an audiovisual speech perception app for children with autism spectrum disorders, Clin. Linguist. Phon 29, 76–83. 10.3109/02699206.2014.966395 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Jiang J and Bernstein LE (2011). Psychophysics of the McGurk and other audiovisual speech integration effects, J. Exp. Psychol. Hum. Percept. Perform 37, 1193–1209. 10.1037/a0023100 [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Jicol C, Proulx MJ, Pollick FE and Petrini K (2018). Long-term music training modulates the recalibration of audiovisual simultaneity, Exp. Brain Res 236, 1869–1880. 10.1007/s00221-018-5269-4 [DOI] [PubMed] [Google Scholar]
  27. Kanner L (1943). Autistic disturbances of affective contact, Nerv. Child, 2, 217–250. [PubMed] [Google Scholar]
  28. Kaplan RS and Steele AL (2005). An analysis of music therapy program goals and outcomes for clients with diagnoses on the autism spectrum, J. Music Ther 42, 2–19. 10.1093/jmt/42.1.2 [DOI] [PubMed] [Google Scholar]
  29. Lee H and Noppeney U (2011). Long-term music training tunes how the brain temporally binds signals from multiple senses, Proc. Natl. Acad. Sci. USA 108, E1441–E1450. 10.1073/pnas.1115267108 [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Lord C, Rutter M, DiLavore PC, Risi S, Gotham K and Bishop SL (2012). Autism Diagnostic Observation Schedule, second edition (ADOS-2) manual (Part I): Modules 1–4. Western Psychological Services, Los Angeles, CA, USA. [Google Scholar]
  31. Magnotti JF, Basu Mallick D and Beauchamp MS (2018a). Reducing playback rate of audiovisual speech leads to a surprising decrease in the McGurk effect. Multisens. Res 31, 19–38. 10.1163/22134808-00002586 [DOI] [PubMed] [Google Scholar]
  32. Magnotti JF, Smith KB, Salinas M, Mays J, Zhu LL and Beauchamp MS (2018b). A causal inference explanation for enhancement of multisensory integration by co-articulation, Sci. Rep 8, 18032. 10.1038/s41598-018-36772-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Massaro DW and Jesse A (2009). Read my lips: speech distortions in musical lyrics can be overcome (slightly) by facial information, Speech Commun. 51, 604–621. 10.1016/j.specom.2008.05.013 [DOI] [Google Scholar]
  34. McGurk H and MacDonald J (1976). Hearing lips and seeing voices, Nature 264, 746–748. 10.1038/264746a0 [DOI] [PubMed] [Google Scholar]
  35. Mefferd AS and Green JR (2010). Articulatory-to-acoustic relations in response to speaking rate and loudness manipulations, J. Speech Lang. Hear. Res 53, 1206–1219. 10.1044/1092-4388(2010/09-0083) [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Munhall KG, Gribble P, Sacco L and Ward M (1996). Temporal constraints on the McGurk effect, Percept. Psychophys 58, 351–362. 10.3758/BF03206811 [DOI] [PubMed] [Google Scholar]
  37. Murray MM, Lewkowicz DJ, Amedi A and Wallace MT (2016). Multisensory processes: a balancing act across the lifespan, Trends Neurosci. 39, 567–579. 10.1016/j.tins.2016.05.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Petrini K, Dahl S, Rocchesso D, Waadeland CH, Avanzini F, Puce A and Pollick FE (2009). Multisensory integration of drumming actions: musical expertise affects perceived audiovisual asynchrony, m Exp. Brain Res 198, 339. 10.1007/s00221-009-1817-2 [DOI] [PubMed] [Google Scholar]
  39. Petrini K, Pollick FE, Dahl S, McAleer P, McKay L, Rocchesso D, Waadeland CH, Love S, Avanzini F and Puce A (2011). Action expertise reduces brain activity for audiovisual matching actions: An fMRI study with expert drummers, NeuroImage 56, 1480–1492. 10.1016/j.neuroimage.2011.03.009 [DOI] [PubMed] [Google Scholar]
  40. Quinto L, Thompson WF, Russo FA and Trehub SE (2010). A comparison of the McGurk effect for spoken and sung syllables, Atten. Percept. Psychophys 72, 1450–1454. 10.3758/APP.72.6.1450 [DOI] [PubMed] [Google Scholar]
  41. Robison JE (2019). Talking about autism — thoughts for researchers, Autism Res. 12, 1004–1006. 10.1002/aur.2119 [DOI] [PubMed] [Google Scholar]
  42. Roid GH, Miller LJ, Pomplun M and Koch C (2013). Leiter International Performance Scale, third edition. Western Psychological Services, Los Angeles, CA, USA. [Google Scholar]
  43. Rutter M, Bailey A and Lord C (2003). The Social Communication Questionnaire. Western Psychological Services, Los Angeles, CA, USA. [Google Scholar]
  44. Saalasti S, Tiippana K, Kätsyri J and Sams M (2011). The effect of visual spatial attention on audiovisual speech perception in adults with Asperger syndrome, Exp. Brain Res 213, 283–290. 10.1007/s00221-011-2751-7 [DOI] [PubMed] [Google Scholar]
  45. Saalasti S, Kätsyri J, Tiippana K, Laine-Hernandez M, von Wendt L and Sams M (2012). Audiovisual speech perception and eye gaze behavior of adults with Asperger syndrome, J. Autism Dev. Disord 42, 1606–1615. 10.1007/s10803-011-1400-0 [DOI] [PubMed] [Google Scholar]
  46. Sandbank M, Bottema-Beutel K, Crowley S, Cassidy M, Dunham K, Feldman JI, Crank J, Albarran SA, Raj S, Mahbub P and Woynaroski TG (2020). Project AIM: Autism intervention meta-analysis for studies of young children, Psychol. Bull 146, 1–29. 10.1037/bul0000215 [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Sekiyama K, Soshi T and Sakamoto S (2014). Enhanced audiovisual integration with aging in speech perception: a heightened McGurk effect in older adults, Front. Psychol 5, 323. 10.3389/fpsyg.2014.00323 [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Simpson K and Keen D (2011). Music interventions for children with autism: narrative review of the literature. J. Autism Dev. Disord 41, 1507–1514. 10.1007/s10803-010-1172-y [DOI] [PubMed] [Google Scholar]
  49. Smith EG and Bennetto L (2007). Audiovisual speech integration and lipreading in autism, J. Child Psychol. Psychiatry 48, 813–821. 10.1111/j.1469-7610.2007.01766.x [DOI] [PubMed] [Google Scholar]
  50. Smith A and Kleinow J (2000). Kinematic correlates of speaking rate changes in stuttering and normally fluent adults. J. Speech Lang. Hear. Res 43, 521–536. 10.1044/jslhr.4302.521 [DOI] [PubMed] [Google Scholar]
  51. Stevenson RA, Segers M, Ferber S, Barense MD and Wallace MT (2014a). The impact of multisensory integration deficits on speech perception in children with autism spectrum disorders, Front. Psychol 5, 379. 10.3389/fpsyg.2014.00379 [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Stevenson RA, Siemann JK, Schneider BC, Eberly HE, Woynaroski TG, Camarata SM and Wallace MT (2014b). Multisensory temporal integration in autism spectrum disorders, J. Neurosci 34, 691–697. 10.1523/JNEUROSCI.3615-13.2014 [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Stevenson RA, Baum SH, Segers M, Ferber S, Barense MD and Wallace MT (2017). Multisensory speech perception in autism spectrum disorder: From phoneme to whole-word perception, Autism Res. 10, 1280–1290. 10.1002/aur.1776 [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Stevenson RA, Segers M, Ncube BL, Black KR, Bebko JM, Ferber S and Barense MD (2018). The cascading influence of multisensory processing on speech perception in autism, Autism 22, 609–624. 10.1177/1362361317704413 [DOI] [PubMed] [Google Scholar]
  55. Taylor N, Isaac C and Milne E (2010). A comparison of the development of audiovisual integration in children with autism spectrum disorders and typically developing children, J. Autism Dev. Disord 40, 1403–1411. 10.1007/s10803-010-1000-4 [DOI] [PubMed] [Google Scholar]
  56. Thompson B and Vacha-Haase T (2000). Psychometrics is datametrics: the test is not reliable, Educ. Psychol. Meas 60, 174–195. 10.1177/0013164400602002 [DOI] [Google Scholar]
  57. Thompson WF, Russo FA and Quinto L (2008). Audio-visual integration of emotional cues in song, Cogn. Emot 22, 1457–1470. 10.1080/02699930701813974 [DOI] [Google Scholar]
  58. Tremblay C, Champoux F, Voss P, Bacon BA, Lepore F and Théoret H (2007). Speech and non-speech audio-visual illusions: a developmental study, PLoS ONE 2, e742. 10.1371/journal.pone.0000742 [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Williams JHG, Massaro DW, Peel NJ, Bosseler A and Suddendorf T (2004). Visual–auditory integration during speech imitation in autism, Res. Dev. Disabil 25, 559–575. 10.1016/j.ridd.2004.01.008 [DOI] [PubMed] [Google Scholar]
  60. Woynaroski TG, Kwakye LD, Foss-Feig JH, Stevenson RA, Stone WL and Wallace MT (2013). Multisensory speech perception in children with autism spectrum disorders, J. Autism Dev. Disord 43, 2891–2902. 10.1007/s10803-013-1836-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Yoder PJ, Lloyd BP and Symons FJ (2018). Observational Measurement of Behavior, second edition. Brookes Publishing, Baltimore, MD, USA. [Google Scholar]
  62. Zhang J, Meng Y, He J, Xiang Y, Wu C, Wang S and Yuan Z (2019). McGurk effect by individuals with autism spectrum disorder and typically developing controls: a systematic review and meta-analysis, J. Autism Dev. Disord 49, 34–43. 10.1007/s10803-018-3680-0 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

SupplementalMaterial

RESOURCES