Abstract
Objectives
Children with cochlear implants (CIs) vary widely in their ability to identify emotions in speech. The causes of this variability are unknown, but this knowledge will be crucial if we are to design improvements in technological or rehabilitative interventions that are effective for individual patients. The objective of this study was to investigate how well factors such as age at implantation, duration of device experience (hearing age), nonverbal cognition, vocabulary, and socioeconomic status (SES) predict prosody-based emotion identification in children with cochlear implants, and how the key predictors in this population compare to children with normal hearing who are listening to either normal emotional speech or to degraded speech.
Design
We measured vocal emotion identification in 47 school age CI recipients aged 7 – 19 years in a single-interval, five alternative forced-choice task. None of the participants had usable residual hearing based on parent/caregiver report. Stimuli consisted of a set of semantically emotion-neutral sentences that were recorded by four talkers in child-directed and adult-directed prosody corresponding to five emotions: neutral, angry, happy, sad, and scared. Twenty one children with normal hearing were also tested in the same tasks; they listened to both original speech and to versions that had been noise-vocoded to simulate CI information processing.
Results
Group comparison confirmed the expected deficit in CI participants’ emotion identification relative to participants with normal hearing. Within the CI group, increasing hearing age (correlated with developmental age) and nonverbal cognition outcomes predicted emotion recognition scores. Stimulus-related factors such as talker and emotional category also influenced performance and were involved in interactions with hearing age and cognition. Age at implantation was not predictive of emotion identification. Unlike the CI participants, neither cognitive status nor vocabulary predicted outcomes in participants with normal hearing, whether listening to original speech or CI-simulated speech. Age-related improvements in outcomes were similar in the two groups. Participants with normal hearing listening to original speech showed greatest differences in their scores for different talkers and emotions. Participants with normal hearing listening to CI-simulated speech showed significant deficits compared to their performance with original speech materials, and their scores also showed least effect of talker- and emotion-based variability. CI participants showed more variation in their scores with different talkers and emotions than participants with normal hearing listening to CI-simulated speech, but less so than participants with normal hearing listening to original speech.
Conclusions
Taken together, these results confirm previous findings that pediatric CI recipients have deficits in emotion identification based on prosodic cues, but they improve with age and experience at a rate that is similar to peers with normal hearing. Unlike participants with normal hearing, nonverbal cognition played a significant role in CI listeners’ emotion identification. Specifically, nonverbal cognition predicted the extent to which individual CI users could benefit from some talkers being more expressive of emotions than others, and this effect was greater in CI users who had less experience with their device (or were younger) than CI users who had more experience with their device (or were older). Thus, in young prelingually deaf children with CIs performing an emotional prosody identification task, cognitive resources may be harnessed to a greater degree than in older prelingually deaf children with CIs or than children with normal hearing.
INTRODUCTION
Emotional communication is a fundamental aspect of social cognition. Humans show sensitivity to vocal emotions as early as in infancy (Mastropierri & Turkewitz, 1999; Grossman, 2010; Oller et al., 2013; Palama et al., 2018), well before speech communication develops. While we convey emotions both with our words and our tone/manner of speech (emotional prosody), listeners with normal hearing rely dominantly on prosody, rather than on the meanings of the spoken words, to decipher the emotion the talker intends to convey (Ben David et al., 2016; Richter and Chatterjee, 2021). Emotional prosody comprises a number of covarying acoustic cues, of which voice pitch and its associated changes are primary (i.e., they most strongly differentiate the emotions from one another) and cues such as duration (the inverse of speaking rate) and intensity are secondary cues (Banse & Scherer, 1996; Chatterjee et al., 2015). For those with severe to profound hearing loss who use cochlear implants (CIs) to hear speech, understanding others’ emotional prosody presents significant challenges, at least in part because CIs do not transmit voice pitch cues with sufficient fidelity (Luo et al., 2007; Chatterjee & Peng, 2008; Deroche et al., 2014, Deroche et al., 2016; Van De Velde et al., 2019; Everhardt et al., 2020; Nagels et al., 2020). Although the secondary cues of intensity and duration are still represented in CIs, they are not sufficient to convey the full richness of emotional communication achieved by voice pitch modulations (Chatterjee et al., 2015). Vocal tract length estimates have also been shown to change with different emotions (e.g., Kim et al., 2020); although the degree to which CI users can utilize these cues for emotion identification is not known, their sensitivity to vocal tract length is significantly impaired compared to listeners with normal hearing (Gaudrain & Baskent, 2018).
Depending on the communicative environment, facial emotional cues may compensate for a deficit in spoken emotion information. Studies of facial emotion in CI patients are sparse, but the findings indicate that adult CI users are either comparable to their counterparts with normal hearing, or at a deficit, in perceiving facial emotions (Stevenson et al., 2017; Fengler et al., 2017). Thus, CI users may have a net deficit in emotional communication. Children who are born deaf and receive a cochlear implant early in life may show some deficits in facial emotion recognition in the preschool years, but appear to catch up to their peers with normal hearing by the time they are school-aged (Wiefferink et al., 2013; Wang et al., 2016; Hopyan-Misakyan et al., 2009). This suggests that concepts of different emotions (termed emotion understanding) are developed by the time children with CIs are school age. However, school-age children with CIs still show significant deficits relative to children with normal hearing in vocal emotion recognition tasks (Lin et al., 2022; Chatterjee et al., 2015; Barrett et al., 2020; Gilbers et al., 2015; Luo et al., 2007). Strong intersubject variability is also evident in performance by children with CIs in vocal emotion identification accuracy, with some of them achieving excellent performance while others’ scores are at chance level (Chatterjee et al., 2015; Barrett et al., 2020). Depending on the stimulus materials (e.g., adult-directed or child-directed speech), beneficial factors predicting this variability may include their cognitive status, duration of device experience, or their access to acoustic cues to emotional prosody (Barrett et al., 2020). However, these factors may also interact with one another, in ways that have not as yet been addressed in the literature. For instance, the availability of cognitive resources may be differentially predictive of performance in children with cochlear implants who are younger or older, or depending on the degree to which prosodic cues are present in the stimuli. In the present study, such interactions were investigated to obtain a fuller picture of how the predictors link to outcomes.
Parallel data in listeners with normal hearing who are attending to degraded stimuli provide an informative reference framework within which to study data obtained in CI users (Shannon et al., 1995). With respect to vocal emotion recognition, normative data in children are generally sparse, even with “clean”, undistorted stimuli. School-aged children with normal hearing also show strong intersubject variability in their vocal emotion identification accuracy when the speech has been processed to simulate the degradation experienced by CI listeners (Chatterjee et al., 2015; Tinnemore et al., 2018). Some factors explaining this variability in children with normal hearing have been studied for speech materials with exaggerated prosody (child-directed speech); the results suggest beneficial effects of nonverbal cognition and developmental age (Chatterjee et al., 2015; Tinnemore et al., 2018), but interactions between these predictors have not been studied. School-age children beyond the age of 10, however, are not usually addressed with the exaggerated prosody of child-directed speech. No information is currently available in the literature about how school-age children with normal hearing discern emotions in adult- vs. child-directed speech/speech with different degrees of prosody, or how they compare to children with CIs. This normative information is needed to understand whether or not children with CIs are showing emotion identification as expected in typically developing children of comparable age.
Goals of the present study were to discover how children with normal hearing and children with CIs compare in their ability to identify spoken emotions in speech with both normal and exaggerated prosody (i.e., adult-directed and child-directed speech prosody), and to understand the sources of inter-subject variation in their performance. To this end, we obtained vocal emotion recognition scores in school-age children with normal hearing and children with CIs attending to the same adult- and child-directed emotional speech stimuli. The children with normal hearing also listened to CI-processed (using noise vocoding, as in Shannon et al., 1995) versions of the speech materials. The noise-vocoded stimuli differed from those used in our previous studies in the degree to which the temporal envelope periodicity conveyed pitch information. While previous studies (e.g., Chatterjee et al., 2015; Tinnemore et al., 2018) used noise vocoding preserving voice pitch information up to 400 Hz, the present design limited the temporal envelope filter cutoff frequency to 160 Hz. This provides for a closer approximation to the degree of temporal envelope pitch information transmitted through modern-day CI processors. Of the children with CIs, 27 were tested as part of the Barrett et al (2020) study; 20 children with CIs tested at BTNRH provided additional data for the current study. Predictive factors of interest in both participants with CI and participants with normal hearing included vocabulary and nonverbal cognition, age, socioeconomic status. The test of vocabulary was included to obtain a measure of receptive language, which may be important in developing the ability to categorize and label emotions from the continuum of variable affective inputs and experiences (Theory of Constructed Emotion: Barrett, 2017). Nonverbal cognition has been shown to be a significant predictor of language outcomes in children with CIs (e.g., Geers et al, 2003). We hypothesized that top-down repair of the degraded auditory input in CIs was necessary to complete our emotion identification task. Therefore, Block Design and Matrix Reasoning subtests were selected from the WASI-II battery of tests of nonverbal intelligence to obtain a measure of problem solving, attention, and concentration. Factors related to age were included to probe effects of development and plasticity: for children with CIs, earlier age at implantation may be beneficial, as the child started receiving electric input with a more adaptive brain than might be possible with a later age at implantation. As children with normal hearing and children with CIs advance in age, developmental effects/experience-driven plasticity are likely to benefit both populations. For children with CIs, we considered “hearing age” (duration of device experience) as well as their age at testing.
While previous studies have shown developmental effects as well as sensitivity to talkers and emotions in children with CIs’ emotional prosody perception (Lin et al., 2022; Barrett et al., 2020; Volkova et al., 2013; Wang et al., 2019), the interactive effects between age, cognitive status, emotion categories, and talker have not been studied. The current study builds on previous findings by investigating these interactions. For instance, if children with CIs do not have as much access to the primary acoustic cues that distinguish spoken emotions from one another, then they might need more cognitive resources to complete the emotion identification task than children with normal hearing. If so, then a measure of cognitive function might predict performance in emotion identification. How this involvement of cognitive resources might change with increasing age or experience with the device, however, is a question that can only be answered by examining the interaction between cognitive function and age in predicting emotion identification outcomes. The acoustic features of emotional prosody varies across talkers and emotions (e.g., Luo et al., 2007; Chatterjee et al., 2015; Barrett et al., 2020; Lin et al., 2022). These variations imply that some talkers are better at conveying emotions vocally than others. As children with CIs receive the co-varying acoustic cues to emotions differently from their peers with normal hearing, their sensitivity to variations in talker and emotions is likely to be different as well. Along with age and device experience, cognitive resources may help children with CIs to benefit more from the acoustic information provided by one talker over another. A previous study on adult normally hearing listeners’ sensitivity to talker variability in emotional speech showed reduced talker-variability effects in CI-simulated emotional speech compared to original emotional speech (Luo, 2016). We expected, therefore, that children with CIs attending to unprocessed speech and children with normal hearing attending to CI-simulated speech would both show reduced talker-variability effects overall. The extent to which children with CIs are sensitive to talker-variations in emotional prosody compared to children with normal hearing, and how this sensitivity may change with age and/or cognitive function, is unknown. The present study examined these interactions in both children with CIs and children with normal hearing, with the goal of obtaining a more nuanced understanding of the underlying processes in these two populations.
We expected that the children with CIs would show deficits in emotion identification compared to children with normal hearing, but that they would also show large intersubject variability. Our hypotheses were that this intersubject variability would be accounted for by the CI recipients’ age at implantation, their duration of device experience (we expected this to be correlated with their age at testing) and their nonverbal cognition. Based on previous work (Barrett et al., 2020), we expected the SES to be correlated with the CI recipients’ age at implantation, and CI recipients’ nonverbal cognition to be correlated with their vocabulary scores, but we were unsure the extent to which SES and vocabulary would account for emotion identification scores. We expected the children with normal hearing to show poorer performance with CI-simulated speech than the CI recipients’ performance with full-spectrum speech (as in Chatterjee et al., 2015), and to show age-related improvements in their ability to identify emotions in CI-simulated speech (Chatterjee et al., 2015; Tinnemore & Chatterjee, 2018). Based on our previous work with children with normal hearing (Tinnemore & Chatterjee, 2018) we expected performance by children with normal hearing attending to CI-simulated speech to show a link with their nonverbal cognition.
Our stimuli consisted of emotion-neutral sentences that were recorded with different emotional affect by different talkers. These stimuli have been described in previous publications (e.g., Chatterjee et al., 2015; Barrett et al., 2020). Four talkers recorded the stimuli with distinct degrees of prosody in their productions. Two of them were asked to read the sentences with exaggerated, child-directed prosody (stimuli described in Chatterjee et al., 2015), while the other two talkers were asked to read the sentences with normal, adult-directed prosody (stimuli described in Barrett et al, 2020; Cannon et al., 2022; Christensen et al., 2019). Including stimuli with different strengths of prosodic cues allowed us to measure participants’ emotion recognition in different listening conditions. As CI users have reduced access to the acoustic cues that differentiate talkers’ tone and manner of speaking, we hypothesized that children with CIs would show less sensitivity to talker-variability and to the degree of prosody in the stimuli than the children with normal hearing. Parallel to our hypothesis re: children with CIs, we hypothesized that children with normal hearing would show reduced sensitivity to differences between talkers and emotions when listening to CI-simulated speech than when listening to full-spectrum speech.
METHODS
Human Subjects Review
All protocols were approved by Institutional Review Boards at the two respective institutions.
Participants
The present study was conducted with a group of 21 children with normal hearing and 47 children with CIs. All participants spoke English as their native language. The children with normal hearing comprised 15 females, 6 males, min age= 7 years, max age = 18 years, mean age=12.68 years, s.d.=3.52 years). The children with CIs (20 females, 27 males, minimum age = 6.5 years, maximum age = 19 years, mean age=11.68 years, s.d.=3.79 years) were fully oral communicators (i.e., American Sign Language was not their primary language) and were tested at two sites: 20 children with CIs participated at Boys Town National Research Hospital and the remaining 27 were tested at the University of California at San Francisco. Of the children with CIs, 29 used Advanced Bionics Corporation devices, 14 used Cochlear Corporation devices, and 4 used Med-El Corporation devices. None of the children with CIs had usable residual hearing at birth, based on parent/caregiver report. Their ages of implantation varied from within the first year of birth up to 12 years of age. The age at implantation was provided by their caregivers. In one case, the caregiver was unable to provide the exact age, but indicated that the child had been implanted within the first year of birth. In this case, we assigned the age at implantation as 0.5 years. Table 1 provides relevant information about individual children with CIs.
Table 1.
CI participant information. Participants tested at UCSF are coded as UCSF-1, UCSF-2, etc., and participants tested at BTNRH are coded similarly. Participants BTNRH-N1 – BTNRH-N6 were tested in a new laboratory space with a different loudspeaker (see text).
| Subject | Site | Age (years) | Gender | Age at Implantation (years) | Device Manufacturer |
|---|---|---|---|---|---|
| UCSF-1 | UCSF | 11.00 | Female | 1.00 | Advanced Bionics |
| UCSF-2 | UCSF | 13.00 | Female | 4.00 | Advanced Bionics |
| UCSF-3 | UCSF | 17.00 | Female | 5.00 | Cochlear Corporation |
| UCSF-4 | UCSF | 13.00 | Female | 1.00 | Cochlear Corporation |
| UCSF-5 | UCSF | 7.50 | Female | 0.50 | Cochlear Corporation |
| UCSF-6 | UCSF | 9.50 | Male | 1.50 | Cochlear Corporation |
| UCSF-7 | UCSF | 11.25 | Male | 1.25 | Advanced Bionics |
| UCSF-8 | UCSF | 13.00 | Female | 2.00 | Cochlear Corporation |
| UCSF-10 | UCSF | 15.92 | Male | 0.92 | Advanced Bionics |
| UCSF-11 | UCSF | 14.00 | Female | 1.00 | Advanced Bionics |
| UCSF-12 | UCSF | 10.00 | Female | 2.00 | MED-EL |
| UCSF-13 | UCSF | 8.00 | Male | 0.50 | Cochlear Corporation |
| UCSF-14 | UCSF | 8.00 | Male | 2.00 | Cochlear Corporation |
| UCSF-15 | UCSF | 16.00 | Male | 7.00 | Advanced Bionics |
| UCSF-16 | UCSF | 7.92 | Male | 0.92 | Advanced Bionics |
| UCSF-17 | UCSF | 13.00 | Male | 1.00 | Advanced Bionics |
| UCSF-18 | UCSF | 9.90 | Male | 1.90 | Advanced Bionics |
| UCSF-19 | UCSF | 13.50 | Male | 1.50 | Cochlear Corporation |
| UCSF-20 | UCSF | 8.00 | Male | 2.00 | MED-EL |
| UCSF-21 | UCSF | 9.50 | Male | 2.50 | Cochlear Corporation |
| UCSF-22 | UCSF | 18.00 | Female | 4.00 | Advanced Bionics |
| UCSF-23 | UCSF | 7.00 | Male | 5.00 | Advanced Bionics |
| UCSF-24 | UCSF | 19.00 | Male | 12.00 | Cochlear Corporation |
| UCSF-25 | UCSF | 14.50 | Male | 1.50 | Advanced Bionics |
| UCSF-28 | UCSF | 17.00 | Male | 2.00 | Cochlear Corporation |
| UCSF-29 | UCSF | 10.00 | Male | 7.00 | Advanced Bionics |
| UCSF-30 | UCSF | 7.00 | Female | 4.00 | Advanced Bionics |
| BTNRH-33 | BTNRH | 17.92 | Male | 1.20 | Advanced Bionics |
| BTNRH-35 | BTNRH | 16.77 | Male | 1.47 | Advanced Bionics |
| BTNRH-38 | BTNRH | 11.88 | Female | 1.21 | Cochlear Corporation |
| BTNRH-41 | BTNRH | 16.60 | Female | 2.80 | Advanced Bionics |
| BTNRH-42 | BTNRH | 7.90 | Male | 1.01 | Advanced Bionics |
| BTNRH-44 | BTNRH | 13.20 | Female | 2.41 | Cochlear Corporation |
| BTNRH-45 | BTNRH | 17.30 | Male | 1.29 | Advanced Bionics |
| BTNRH-46 | BTNRH | 14.40 | Male | 1.28 | Cochlear Corporation |
| BTNRH-47 | BTNRH | 10.78 | Female | 1.13 | Advanced Bionics |
| BTNRH-48 | BTNRH | 6.50 | Female | 0.82 | MED-EL |
| BTNRH-49 | BTNRH | 7.83 | Female | 0.88 | Advanced Bionics |
| BTNRH-50 | BTNRH | 9.06 | Male | 1.65 | Advanced Bionics |
| BTNRH-51 | BTNRH | 8.15 | Female | 1.26 | MED-El |
| BTNRH-52 | BTNRH | 10.24 | Male | 1.12 | Advanced Bionics |
| BTNRH-N1 | BTNRH | 7.19 | Male | 1.92 | Advanced Bionics |
| BTNRH-N2 | BTNRH | 9.96 | Male | 1.46 | Advanced Bionics |
| BTNRH-N3 | BTNRH | 7.96 | Female | 0.96 | Advanced Bionics |
| BTNRH-N4 | BTNRH | 7.96 | Female | 1.45 | Advanced Bionics |
| BTNRH-N5 | BTNRH | 18.21 | Male | 2.94 | Advanced Bionics |
| BTNRH-N6 | BTNRH | 7.94 | Female | 1.02 | Advanced Bionics |
Each participant with normal hearing was tested for normal hearing prior to the experiment. Thresholds were obtained using an Interacoustics Diagnostic Audiometer AD226 (Interacoustics, Middlefart, Denmark). Normal hearing was defined as thresholds better than or equal to 20 dB HL at audiometric frequencies between 500 and 4000 Hz.
Stimuli
The stimuli that were presented were 12 sentences (6 syllables each; Chatterjee et al., 2015) selected to be emotionally neutral in their semantic information (e.g., The cup is on the table; The mailman shut the gate). The sentences were selected from the Hearing in Noise Test (HINT) database (Nilsson et al., 1994). They were recorded for our previous studies (Chatterjee et al., 2015; Cannon & Chatterjee, 2018) by a native American-English, male and female talker in a child directed manner (CDS_M1 and CDS_F1 respectively) and two different male and female talkers in an adult-directed manner (ADS_M2 and ADS_F2 respectively). Each sentence was recorded in five different emotions: Happy, Sad, Angry, Scared, and Neutral. In addition, each type of speech (child-directed or adult-directed, CDS or ADS) was presented in two different conditions: full-spectrum and 8-channel noise-band vocoding. The vocoding was done using the AngelSim tool (TigerSpeech Technologies), with standard settings (200–7000 Hz bandwidth overall, 24 dB/octave bandpass filters for the eight channels, and 160 Hz cutoff frequency for the temporal envelope lowpass filter). The participants with normal hearing completed the emotion recognition task with child directed speech and adult directed speech in full spectrum and 8-channel noise band vocoding stimuli, while the children with CIs participants completed the task with child directed and adult directed speech in only full spectrum stimuli.
By ensuring that the stimuli comprised the same sentences spoken in the five emotions, we ensured that prosodic cues, and not lexical-semantic cues, were used by listeners to perform the task. The prosody in the child-directed and adult-directed stimuli were different in a number of ways. Acoustic features of primary prosodic cues (voice pitch, intensity, and duration cues) of the stimuli are shown in Fig. 1, which demonstrates the exaggerated prosody of the child-directed stimuli relative to the adult-directed stimuli. For instance, the mean F0 for child-directed talkers CDS_F1 and CDS_M1 rises to 2.01 and 2.43 times the mean F0 for Happy emotion re: the Neutral emotion respectively, while the adult directed talkers ADS_F2 and ADS_M2 stimuli show only a corresponding increase to 1.43 and 1.30 respectively. Visual inspection of Fig. 1 shows that CDS_F1 and CDS_M1 distinguish the five emotions more strongly than ADS_F2 and ADS_M2 by virtue of mean F0 (top panel shows larger differences between the boxplots across the emotions in their stimuli re: the speech adult-directed speech). In the middle row, it is evident that CDS_F1, CDS_M1, and ADS_F2 all distinguish the emotions better via intensity cues than ADS_M2. In the lowest row, it is evident that the duration cue is similarly strong in distinguishing the emotions, except for ADS_F2 who shows smaller differences in duration across the emotions than the other three talkers. Thus, overall, all three cues are better used to distinguish the emotions by the child-directed talkers than the adult-directed talkers. In our previous studies with the same stimuli, we have found that CDS_F1’s emotions are best identified (perhaps related to her greater ability to distinguish emotions via the F0 cue than the other talkers), and that child-directed stimuli are overall better identified than adult-directed stimuli (Barrett et al., 2020; Cannon & Chatterjee, 2019). Confusion matrices were analyzed to obtain hit rates and false alarm rates. The high d’ values based on these hit rates and false alarm rates with the unprocessed (full-spectrum) stimuli in both previous and present data with listeners with normal hearing confirm same stimulus validity. .
Figure 1.

Examples of primary and secondary acoustic cues to emotional prosody in stimuli recorded by the four talkers (left to right panels: CDS_F1, CDS_M1, ADS_F2, ADS_M2). Within each panel, the abscissa shows the five emotions, and the ordinate shows the acoustic feature of interest. Top to bottom, the rows show the mean pitch (F0), the mean intensity, and the mean duration computed across the 12 sentences recorded by each talker.
Stimuli were presented via a single loudspeaker facing the listener. Prior to the start of each block, the research assistant calibrated the speaker output to ensure a level of 65 dB SPL, based on the output for a calibration tone of 1 kHz presented at an RMS level equal to the average RMS calculated across all stimuli within the block. Each participant was asked to sit one meter away from the loudspeaker while performing the task.
Protocol at BTNRH
All but one of the participants completed cognitive and vocabulary tests prior to completing the emotion recognition task. To assess the subjects’ nonverbal cognition, they were tested using matrix reasoning and block design subtests from the Weschler Abbreviated Scale of Intelligence Second Edition (WASI-II) (Wechsler, 2011). In the block design subtest, children recreated two-dimensional patterns from the test booklet using two-colored blocks within a specified amount of time. In the matrix reasoning subtest, participants were asked to select a picture/pattern that correctly completed a series of pictures/patterns. The summed T-scores from the block design and matrix reasoning subtests were used to calculate the Perceptual Reasoning Index (PRI) composite score. Receptive vocabulary levels were tested using the Peabody Picture Vocabulary Test Fourth Edition (PPVT-4) Form B (Dunn and Dunn, 2007). Children listened to a word spoken by an experienced test administrator and pointed to a picture best representing the meaning of the word. Care was taken to ensure that the words were clearly understood, and the testing was done in a quiet room with no interference. The raw score was converted into an age-normed standard score, which was used in analyses.
Socioeconomic status was indirectly obtained from primary and secondary caregivers’ highest education level. These were assigned numerical values as follows: high school =1, community college/vocational school = 2, 4-year college/university degree = 3, professional degree/graduate school = 4. The mean of the two caregivers’ values, or the primary caregiver’s value if the participant had only one primary caregiver, was used in analysis. This followed the protocol used in the original Hollingshead scale of socioeconomic status (Hollingshead, 1957).
For the emotion recognition test, participants with CIs used their self-reported better ear or earlier-implanted ear while the contralateral ear was plugged (or for bilateral participants, the later-implanted processor was removed). This was done to prevent the use of any residual hearing on the other side, and to compare all CI participants on their earlier-implanted or better-performing side. Participants with normal hearing listened normally. The experiment was controlled by a custom Matlab-based software program. Two rounds of passive training, each with 10 practice sentences (these sentences were not used in the test set), were provided. In the passive training, after each sentence was presented during the training sessions, the correct emotion (picture and text) would light up on the computer screen in green to indicate the emotion associated with that sentence. The purpose of the passive training rounds was to familiarize the participant with the task and how the talkers produce the five different emotions. In the test, emotion recognition was assessed using a single-interval, five alternative, forced-choice task. Stimuli were blocked by condition and talker (e.g., full-channel, male ADS talker).
Participants were tested in the sound field, seated 1m. away from a single loudspeaker (Grasson Stadler, Inc. for participants BTNRH-33 through BTNRH-52; Genelec Smart Active Studio Monitor for participants BTNRH-N1 through BTNRH-N6) and listened to a total of 60 sentences in each block. Stimuli were presented at a mean level of 65 dB SPL. A calibration tone at 1 kHz was presented at an rms level corresponding to the mean rms level of the stimuli in the test block. Testing was done in a single-interval, five-alternative forced choice paradigm. After each sentence was presented, the participant indicated (by clicking on the appropriate button on the screen) which of the five emotions they perceived the sentence as being associated with. No feedback was provided, and the participant was not allowed to replay the individual stimuli. Each block was repeated twice. Within a block, stimuli were presented in random order, and blocks were presented in random order as well. Confusion matrices were obtained from each test set and d’ values were derived from these confusion matrices (d’ = z-score corresponding to the hit rate – z-score corresponding to the false alarm rate; Macmillan & Creelman, 2005). Eight participants were unable to finish the full test procedure due to time constraints.
Protocol at UCSF
Only CI participants were tested at the second test site, UCSF, following the same protocol and with an experimental setup similar to BTNRH. Following informed consent protocols, participants completed the same cognitive and vocabulary tests as at BTNRH and then proceeded to complete the emotion recognition task. Similar to BTNRH, participants used their self-reported better ear or earlier-implanted ear while the contralateral ear was plugged while completing the study. The Emognition program (described above) was used to present the stimuli via a Microsoft surface tablet (10.6-inch ClearType HD Display) located inside a sound booth. The sentences were presented to the participants from a single loudspeaker (Sony SS-MB150H) located approximately 1 meter away at a mean level of 65 dB SPL. Identical to the BTNRH protocol, two rounds of passive training were provided to all participants. Also identical to the BTNRH protocol, during the test conditions, each stimulus was randomly presented to the participant in the five emotions and blocked by condition and talker, and the same procedure was used, except that the participant indicated the emotion that was expressed in the sentence by touching the emotion image on the screen. As at BTNRH, replays were not allowed, and no feedback was provided.
Statistical Analyses
Linear mixed effects (LME) models were constructed to investigate the effects of predictors of interest, primarily for three reasons: i) the data were not normally distributed ii) there were missing data; iii) in the context of this study, understanding within- and across-participant variability was valuable. LMEs afford advantages in all three scenarios over,other analytical approaches such as repeated measures analyses of variance (e.g., Magezi, 2015). As the number of predictor variables was large, preliminary exploratory analyses were completed to 1) investigate correlations between predictors, and effects of specific predictors and 2) to reduce the complexity of final models by eliminating unnecessary variables. Histograms of model residuals were visually inspected to check for normality of their distribution. In addition, the Akaike Information Criterion and estimated R2 values were examined to compare models with one another before deciding on the final model.
Statistical analyses were conducted in R v. 4.0.4 (R Core Team, 2021), using the lme4 (Bates et al., 2015) and lmerTest (Kuznetsova et al., 2017) packages for the linear mixed effects models and model summaries respectively. Plots were rendered in ggplot2 (Wickham, 2016). The p values were considered significant at criterion level of 0.05. The MUMin package (Barton, 2022) was used to obtain indices of model fit.
In the remainder of the text, the factor duration of device experience (difference between age at testing and age at implantation) is referred to as “hearing age”, the logarithm of the nonverbal cognition score is referred to as “cognition”, the logarithm of the PPVT score is referred to as “vocabulary”, and the socioeconomic status score is referred to as “SES”; “hearing status” refers to whether children had normal hearing or used CIs. The logarithm of the cognition scores and vocabulary scores were taken in analyses because the transformation resulted in a more normally distributed set of residuals in statistical models.
RESULTS
Participant characteristics
Table 2 shows the means and standard deviations of the participant-based predictors.
Table 2.
Means and standard deviations (s.d.s) of relevant predictor variables in participants with normal hearing (NH) and CIs
| Group | Mean (s.d.) Nonverbal Cognition Score | Mean (s.d.) Vocabulary Score | Mean (s.d.) SES score | Mean age (s.d.) | Mean age at implantation (s.d.) | Mean hearing age (s.d.) |
|---|---|---|---|---|---|---|
| NH | 108.38 (14.79) | 117.29 (16.35) | 2.86 (0.84) | 12.68 (3.53) years | NA | NA |
| CI | 98.87 (15.42) | 98.66 (16.16) | 2.74 (1.00) | 11.68 (3.79) years | 2.22 (2.10) years | 9.47 (3.64) years |
Age
The distribution of participant ages failed the Wilks-Shapiro test for normality, so a nonparametric test was conducted to determine if there was a difference between the ages of the participants with normal hearing and those with CIs. A Kruskal-Wallis test showed no significant differences in age between the two groups (H(1)=1.44, p=0.230).
Socio-economic status
The distribution of participants’ parental education status also failed the Wilks-Shapiro test for normality. The Kruskal-Wallis test showed no significant difference in SES between the two groups (H(1)=0.092, p=0.762).
Vocabulary and Nonverbal Cognition scores
The F-test showed no significant difference in variance between the group with normal hearing and the group with CI in Vocabulary scores (PPVT score) and Nonverbal Cognition scores (Block Design and Matrix Reasoning). Therefore, separate one-way analyses of variance (ANOVAs) were conducted to investigate differences in these scores between participants with normal hearing and those with CIs. Results showed significant group differences for both Vocabulary and Cognition scores, and in both cases, the residuals were normally distributed (Shapiro-Wilks test). Participants with normal hearing had higher Vocabulary than those with CIs (F(1)=19.16, p<0.001). Participants with normal hearing also had higher Nonverbal Cognition than those with CIs (F(1)=5.657, p=0.020).
Exploratory analyses
1. Effects of Test Site, SES, Manufacturer
For the CI participants’ data, a simple exploratory LME analysis with d’ score as the dependent variable, and age, site of testing (categorical variable with 2 levels), device manufacturer (3 levels), talker (4 levels), emotion (5 levels), and SES as fixed effects, was completed. No interactions were included. The analysis indicated significant effects of age, emotion (performance for all emotions better than for Neutral, and talker (performance poorer for all talkers than for CDS_F1), but no significant effects of device manufacturer, test site, or SES (Table 3). Device manufacturer, SES, and test site were not considered as predictors in the remaining analyses of the data obtained with the CI group.
Table 3.
Results of preliminary exploratory analysis on predictor effects in CI participants. Significant effects shown in bold text.
| Effect | β (s.e.) | t (df) | p |
|---|---|---|---|
| Age | 0.089 (0.037) | 2.411 (46.225) | 0.020 |
| Test Site Site 2 (Ref: Site 1) | −0.177 (0.294) | −0.603 (47.070) | 0.550 |
| Talker (Ref: CDS_F1) | |||
| CDS_M1 | −1.066 (0.103) | −10.363 (853.092) | <0.001 |
| ADS_F2 | −1.760 (0.105) | −16.828 (854.991) | <0.001 |
| ADS_M2 | −2.245 (0.105) | −21.300 (855.124) | <0.001 |
| Emotion (Ref: Neutral) | |||
| Angry | 0.984(0.117) | 8.439(852.521) | <0.001 |
| Happy | 0.886(0.117) | 7.591(852.521) | <0.001 |
| Sad | 0.418(0.117) | 3.576(852.521) | <0.001 |
| Scared | 0.246(0.117) | 2.105(852.521) | 0.036 |
| Device Manufacturer (ref: Advanced Bionics) | |||
| Cochlear Corporation | 0.146 (0.306) | 0.479 (46.135) | 0.634 |
| Med-EL | −0.221 (0.502) | −0.439 (47.953) | 0.662 |
| SES | 0.207 (0.143) | 1.448 (47.203) | 0.154 |
2. Inter-relationships between specific factors
In exploratory analyses, simple Pearson’s correlations were conducted to investigate the relationships between predictors of interest.
In the participants with normal hearing, SES was significantly and positively related to cognition (r=0.531, p=0.013) and to vocabulary (r=0.563, p=0.008), both surviving the Bonferroni correction for multiple comparisons. The correlation between cognition and vocabulary did not reach significance (r=0.425, p=0.055). Age was not significantly related to these other predictors.
In the CI participants, the correlation between the age at testing (hereafter referred to as age) and the age at implantation was significant (r=0.346, p=0.017). Hearing age was not significantly correlated with age at implantation. Hearing age was significantly related to age at testing (r=0.841, p<0.001) – i.e., older children also have more device experience; both correlations with age at testing were significant after the Bonferroni correction. SES was not significantly related to cognition, vocabulary, or age at implantation. Cognition and vocabulary scores were significantly related to one another (r=0.509, p<0.001), a finding that survived the Bonferroni correction.
Because of the differences between the groups in vocabulary and cognition, and because of the different effects of SES on the two groups, vocabulary, cognition, and SES were not considered in the next section, in which we compared emotion recognition performance between them.
Group differences between participants with normal hearing and participants with CI in the Full-Spectrum Condition
We analyzed d’ values obtained in the Full-Spectrum condition by the participants with normal hearing and participants with CI in the full-spectrum condition. Figure 2 shows the d’ values (ordinate) obtained for the two groups listening in the Full-Spectrum condition only, for each emotion category (rows) and for each talker (columns). The data appear to follow relatively consistent patterns, where listeners with normal hearing showed better performance than CI users, and with child-directed stimuli (CDS_F1, CDS_M1) producing better performance than adult-directed stimuli (ADS_F2, ADS_M2) overall.
Figure 2.

Group differences between participants with normal hearing (red) and CI (blue) in emotion identification (d’), shown in boxplots for each emotion (rows) and plotted against each of the four talkers (F1, M1, F2, M2) and the two speaking styles (ADS, CDS) (rows).
A linear mixed effects model with d’ as the dependent variable, age, talker, emotion, hearing status, and subject-based random intercepts, was constructed. The model included all interactions. It accounted for 76.95% of the variance (conditional R^2), i.e., considering both fixed and random effects). Results (Table 4) showed significant effects of age, hearing status (normal hearing > CI), emotion (all emotions were better identified than Neutral), and two-way interactions between talker and emotion in which the difference in sensitivity to Neutral vs. Happy emotions was greater (i.e., Happy was better identified than Neutral) for CDS_F1 than for all other talkers. No other significant interactions or effects were observed.
Table 4.
Results of LME analysis on normally hearing (NH) and CI participants’ data (Full-Spectrum condition only). Only significant effects are reported for brevity.
| Effect | β (s.e.) | t (df) | p |
|---|---|---|---|
| Age | 0.164 (0.054) | 3.035 (273.1) | 0.003 |
| Hearing Status (ref: CI) | |||
| NH | 4.056 (1.36) | 2.983 (292.700) | <0.003 |
| Emotion (ref: Neutral) | |||
| Angry | 2.461 (0.703) | 3.499(1146) | <0.001 |
| Happy | 3.005 (0.703) | 4.273 (1146) | <0.001 |
| Sad | 1.441 (0.703) | 2.049 (1146) | 0.041 |
| Scared | 1.632 (0.703) | 2.320 (1146) | 0.020 |
| Talker (ref: CDS_F1) x Emotion (ref: Neutral) | |||
| CDS_M1:Happy | −3.101 (1.004) | −3.089(1146) | 0.002 |
| ADS_M1:Happy | −2.221 (1.004) | −2.157(1146) | 0.031 |
| ADS_M2:Happy | −2.022 (1.004) | −4.953(1146) | 0.052 |
These results show that the group with normal hearing was significantly better than the CI group in identification of emotions in the full spectrum condition, that age was a significant predictor in general (performance was better in older children than in younger children in both groups), and that there were effects of emotion and talker. Effects of cognition and vocabulary, which were not included in this part of the analyses, are considered separately for the two groups in following sections.
Factors determining CI participants’ performance
As indicated previously, exploratory models showed that site of testing and SES did not predict outcomes and were not included in the LME analysis. An LME model was constructed with hearing age, talker, emotion, age at implantation, vocabulary, and cognition as fixed effects, and subject-based random intercepts. Inclusion of vocabulary and age at implantation did not contribute to improved model fit. The final, best-fitting model (smallest value of Akaike’s Information Criterion) included hearing age, talker, emotion, and cognition as fixed effects, and subject-based random intercepts. The model accounted for 73.23% of the variance (conditional R2, based on both fixed and random effects). The results (Table 5) showed significant positive effects of hearing age and cognition, effects of emotion and talker, and interactions between all of the predictors.
Table 5.
Results of LME analysis on CI participants’ data. Only significant effects are reported for brevity.
| Effect | β (s.e.) | t (df) | p |
|---|---|---|---|
| Hearing Age | 3.368(1.671) | 2.016 (179.808) | 0.045 |
| Cognition | 24.401 (8.588) | 2.841 (179.809) | 0.005 |
| Talker (re: CDS_F1) | |||
| ADS_M2 | 43.007(17.939) | 2.397 (603.005) | 0.017 |
| Emotion (re: Neutral) | |||
| Happy | −44.925(17.745) | −2.532(852.660) | 0.012 |
| Scared | −49.747 (17.745) | −2.803(852.660) | 0.005 |
| HearingAge*Cognition | −1.625 (0.833) | −1.950 (179.808) | 0.053 |
| Talker (ref: CDS_F1) x Cognition | |||
| ADS_M2:Cognition | −21.916 (8.953) | −2.448(852.941) | 0.015 |
| Hearing Age x Emotion (ref: Neutral) | |||
| Hearing Age:Happy | 3.783 (1.725) | 2.193(852.660) | 0.029 |
| Hearing Age:Scared | 4.523 (1.725) | 2.623(852.660) | 0.009 |
| Talker (ref: CDS_F1) x Emotion (ref: Neutral) | |||
| ADS_M2:Happy | 51.076(25.348) | 2.015(852.657) | 0.044 |
| ADS_M2:Scared | 50.861(25.348) | 2.007(852.657) | 0.045 |
| ADS_F2:Scared | 61.822(25.321) | 2.441(852.659) | 0.015 |
| Cognition x Emotion (ref: Neutral) | |||
| Cognition: Happy | 23.839(8.866) | 2.689(852.660) | 0.007 |
| Cognition: Scared | 25.719 (8.866) | 2.901(852.660) | 0.004 |
| Hearing Age x Talker (ref: CDS_F1) x Emotion (ref: Neutral) | |||
| Hearing Age:ADS_F2:Scared | −5.361 (2.455) | −2.184(852.659) | 0.029 |
| Hearing Age: ADS_M2:Scared | −5.691 (2.457) | −2.317(852.657) | 0.021 |
| Hearing Age x Cognition x Emotion (ref: Neutral) | |||
| Hearing Age: Cognition: Happy | −1.890 (0.860) | −2.197(852.660) | 0.028 |
| Hearing Age: Cognition: Scared | −2.282 (0.860) | −2.653(852.660) | 0.008 |
| Talker (ref: CDS_F1) x Cognition x Emotion (ref: Neutral) | |||
| ADS_M2:Cognition:Happy | −26.541 (12.652) | −2.098(852.657) | 0.036 |
| ADS_M2:Cognition:Scared | −25.901 (12.642) | −2.047(852.657) | 0.041 |
| ADS_F2:Cognition:Happy | −31.835 (12.641) | −2.518 (852.659) | 0.012 |
| Hearing age x Talker (ref: CDS_F1) x Cognition x Emotion (ref: Neutral) | |||
| Hearing age: ADS_F2:Cognition:Scared | 2.673(1.223) | 2.184(852.659) | 0.029 |
| Hearing age: ADS_M2:Cognition:Scared | 2.857(1.225) | 2.857(852.657) | 0.020 |
A four-way interaction was observed between hearing age, cognition, emotion, and talker. Overall, higher cognition was linked to a greater ability to take advantage of the exaggerated emotional prosody offered by CDS_F1 for some emotional contrasts, an effect observed significantly more in children with less device experience (who were also younger) than in children with longer device experience. Specifically, the Neutral-Scared difference changed differentially with cognition for children with younger vs. older hearing age, and this effect was greatest in the CDS_F1 talker, and with the ADS_F2 and ADS_M2 talkers’ materials showing significantly smaller effects. These effects are seen in Table 5 and visually apparent in Figure 3, which shows the link between nonverbal cognition and d’ scores for each talker (columns) and emotion (colors), separated out by participants whose hearing ages were less than (upper row) or greater than (lower row) the median age. The interaction between nonverbal cognition and hearing age is evident in the greater effect of nonverbal cognition on d’ scores in the upper row vs. the lower row. Further, in the participants of younger hearing age, those with higher nonverbal cognition show a greater benefit from the prosodic cues provided by CDS_F1 than other talkers, and this increased benefit varies with emotion.
Figure 3.

Sensitivity to emotions (d’) in participants with CIs (ordinate), plotted against their cognition scores (abscissa), for each emotion (colors) and separated out by talker (columns) and by their hearing age (upper and lower rows correspond to younger and older hearing ages respectively). Lines show linear regressions through the data in each case.
Factors determining the performance of participants with normal hearing
The effects and interactions between predictor variables of the data obtained in participants with normal hearing can be seen in Fig. 4A, which plots the d’ values obtained in the participants with normal hearing against their age, for the two levels of spectral degradation (colors), for each talker (rows) and emotion (columns). The 8-channel condition elicited poor identification scores, but one-sample t-tests showed that for each talker, the d’ values (averaged across emotions) were significantly greater than 0.0 (i.e., better than chance: for CDS_F1, t(19)=6.358, p<0.001, for CDS_M1, t(19)=6.764, p<0.001, for ADS_F2, t(19)= 8.635, p<0.001, and for ADS_M2, t(19)=5.309, p<0.001). This suggests that floor effects were not reached in the overall data. However, we note for some emotions, some talkers, and younger child listeners, floor effects were reached in the 8-channel condition.

Figure 4A. Sensitivity to emotions (d’) in participants with normal hearing, plotted against their age (abscissa), separated by emotion (columns) and talker/speaking style (rows), and degree of spectral degradation (color). Lines show linear regression through the data.
Figure 4B. The same data as in Fig. 4A, but specifically comparing outcomes for Neutral and Scared emotions (yellow and blue respectively), and separated out by degree of spectral degradation (squares show 8 channel noise vocoded speech, circles show full spectrum speech). Lines show linear regression through the data.
An LME model was constructed with the data obtained in participants with normal hearing, with d’ as the dependent variable and age, talker, emotion, degree of spectral degradation, as fixed effects, and subject-based random intercepts. Cognition, Vocabulary, and SES did not contribute significantly to the model fit and were therefore excluded. Emotion categories were coded with the Neutral emotion as the reference. Talker categories were coded with CDS_F1 as the reference. The model with full interactions accounted for 83.87% of the variance (marginal R2) for fixed effects, and for 86.58% of the variance (conditional R2) for the entire model, a better fit than the model without interactions, for which the Akaike Information Criterion was larger (i.e., poorer model fit), and the marginal proportions of the variance accounted for were also substantially less.
Results (Table 6) showed a significant main effect of degree of spectral degradation (better performance in the Full-Spectrum condition relative to the 8-channel condition). This is clearly observed in Fig. 4A, which shows the 8-channel data falling well below the Full-Spectrum data across emotions and talkers. A four-way interaction between age, emotion (difference between Neutral and Scared), talker (difference between CDS_F1 and CDS_M1), and degree of spectral degradation was observed. This interaction can be seen in Fig 4B. It is evident that the primary differences between the emotions occur in the Full-Spectrum condition, while the d’s are much more overlapping in the 8-channel case. The age-related changes in identification of the emotions are also more evident in the Full-Spectrum condition, and the slopes vary more across talkers as well in this condition vs. in the 8-channel condition (evident in both Figs. 4A and 4B). Comparing the regression lines for the Neutral and Scared emotions for the CDS_F1 and CDS_M1 talkers for the Full-Spectrum and 8-channel conditions, it is apparent that i) there is not much difference between talkers and emotions in the 8-channel condition, but there are differences in the Full-Spectrum condition ii) the regression line for the Neutral emotion in the Full-Spectrum condition is steeper for CDS_M1 than for CDS_F1, and that the reverse is true for the Scared emotion. Part of the reason for the lack of variation with talker and emotion in the 8-channel condition could be the low mean scores in this condition.
Table 6.
Results of LME analyses on normally hearing participants’ data. Only significant effects are reported for brevity.
| Effect | β (s.e.) | t (df) | p |
|---|---|---|---|
| Degree of Spectral Degradation (ref: 8-channel) | 5.436 (1.078) | 5.043(769.616) | <0.001 |
| Emotion (ref: Angry) x Degree of Spectral Degradation (ref: 8-channel) | |||
| Scared | −4.010 (1.523) | −2.634 (768.863) | 0.009 |
| Sad | −3.016 (1.523) | −1.981 (768.863) | 0.047 |
| Age:Emotion (ref: Angry) :Degree of Spectral Degradation (ref: 8-channel) | |||
| Age:Scared:Full-Spectrum | 0.248 (0.116) | 2.133 (768.863) | 0.033 |
| Talker (ref: CDS_F1) x Emotion (ref: Angry) x Degree of Spectral Degradation (ref:8-channel) | |||
| CDS_M1:Scared:Full-Spectrum | 4.478 (2.154) | 2.079 (768.863) | 0.038 |
| Age:Talker (ref: CDS_F1) x Emotion (ref: Angry) x Degree of Spectral Degradation (ref : 8-channel) | |||
| Age: CDS M1:Scared:Full-Spectrum | −0.4721 (0.165) | −2.868 (768.863) | 0.004 |
Degradation reduces sensitivity (d’) to talker- and emotion-based variability
The analyses thus far have used a categorical scheme in which CDS_F1 was considered as a reference and performance with other talkers were compared to this reference talkers’ materials. Similarly, performance with all emotions were referenced to performance with the Neutral emotion. The resultssuggested a pattern in which the normally hearing listeners’ d’ scores showed differences based on variability in how talkers communicate emotions in the Full-Spectrum condition than in the 8-channel condition and compared to CI listeners. This was investigated in greater depth in a second analytical approach in which we considered only variability based on talker and emotion for each of the two populations. .
The d’ scores obtained in children with normal hearing failed the Shapiro-Wilk test for normality, so multiple nonparametric comparisons were completed using Dunn’s test (Holm’s adjustment) on the normally hearing participants’ data for differences between talkers in the Full-Spectrum condition. Results showed significant differences in d’s between all six comparisons between talkers (p<0.001 in all cases), with two exceptions: d’ values for CDS_F1 and CDS_M1 were not significantly different from one another, and d’ values for ADS_F2 and ADS_M2 were not significantly different from one another. Similar tests on the normal hearing data for talker-differences in the 8-channel condition showed significant differences in d’s obtained between all talkers (p<0.05 or better in all cases) except for CDS_M1 vs. ADS_M2, and between CDS_M1 vs. ADS_F2.
Parallel tests on the CI listeners’ d’ scores (Full-spectrum condition) showed significant differences between all talker pairs (p<0.001 or better in all cases). These analyses show that listeners in the 8-channel condition were least sensitive to differences between talkers, while CI listeners were surprisingly sensitive to talker-based differences.
Within each talker and degree of spectral degradation, Dunn tests were conducted for differences in d’ scores obtained between emotions. For normally hearing listeners in the Full-Spectrum condition, the CDS_F1 talker’s materials showed different performance for Angry vs. Neutral and Happy vs. Neutral emotions (p=0.010 for both). No differences between emotions were found in d’s for the CDS_M1 talker’s materials. The ADS_F2 talker’s materials showed a significant difference between Scared and all other emotions (p=0.007 or better in all cases). The ADS_M2 talker’s materials showed significant differences in d’ between Angry and Neutral, Angry and Sad, Happy and Neutral, Happy and Sad, Scared and Neutral, and Scared and Sad emotions (p=0.046 or better in all cases). Thus, a total of 12 instances of significant differences in d’ between pairs of emotions were observed across all talkers.
For normally hearing listeners in the 8-channel condition, the CDS_F1 materials produced no significant differences between emotions. The CDS_M1 materials produced a significant difference between Angry and Sad (p=0.032) and between Angry and Scared (p=0.015) emotions. The ADS_F2 materials produced significant differences between Angry and Neutral (p=0.011), Angry and Scared (p<0.001), Happy and Scared (p=0.020) and Sad and Scared (p<0.001) emotions. No significant differences in d’s were observed between emotions for the ADS_M2 talker. Thus, a total of 6 instances of significant differences in d’ were observed between pairs of emotions across all talkers.
The CI listeners showed significant differences between emotions for all talkers except for CDS_M1. For the CDS_F1 talker, significant differences were observed between Angry and Neutral, Happy and Neutral, Angry and Sad, Happy and Sad, Angry and Scared, Happy and Scared, and Neutral and Scared emotions (p=0.028 or better in all cases). For the CDS_M1 talker, no significant differences were observed across emotions. For the ADS_F2 talker, significant differences were observed between Scared and all other emotions (p<0.009 or better in all cases) and between Angry and Neutral emotions (p=0.03). For the ADS_M2 talker, significant differences were observed between the Angry and Neutral, Happy and Neutral, and Neutral and Scared emotions (p=0.03 or better). Thus, a total of 15 instances of significant differences were observed in pairwise comparisons of emotions across all talkers.
This analysis shows that normally hearing listeners and CI listeners showed that their performance varied strongly with talker- and emotion-based variability in the full-spectrum condition. In contrast, the performance of normally hearing listeners listening in the 8-channel condition showed minimal changes with talker- and emotion-based variations. We note that the 8-channel condition produced low scores in general, and this may have contributed to the small effects of talker and emotion in this condition.
DISCUSSION
The primary goals of this study were to compare vocal emotion sensitivity of children with CIs with those of children with normal hearing, and to investigate predictive factors explaining intersubject variation in these two populations. We found that both groups showed sensitivity to talkers, poorer sensitivity to emotions spoken in an adult-directed manner vs. in a child-directed manner. Older children generally showed better performance than younger children in both groups. Children with normal hearing attending to CI-simulated speech showed significantly poorer performance than their performance with full-spectrum speech. Children with CIs attending to full-spectrum speech showed significant deficits compared to peers with normal hearing, and their d’ scores were broadly in between that of children with normal hearing listening to 8-channel CI simulations and children with normal hearing listening to full spectrum speech. The performance of children with normal hearing was not predicted by their SES, cognitive status or vocabulary. On the other hand, CI participants’ data showed a different pattern in which the children with less experience with the device benefited more from specific talkers and emotions when their nonverbal cognition was higher, while their peers with lower cognition scores did not show as much benefit. CI participants’ data were not predicted by their SES or their vocabulary. Test site (BTNRH or UCSF) was not a predictor of performance.
These findings indicate that children with CIs are at a considerable disadvantage relative to their counterparts with normal hearing where vocal emotion identification is concerned. The CI participants also showed considerable sensitivity to talker and emotion variations compared to counterparts with normal hearing when attending to noise-vocoded speech. This suggests that the experience with the device might be increasing CI participants’ sensitivity to differences between the prosodic cues associated with different emotions and talkers. On the other hand, participants with normal hearing had no prior experience with CI-degradation and did not show the same sensitivity to talkers and emotions with the noise-vocoded speech. Additionally, the eight-channel noise vocoding may be resulting in greater degradation than actual CI recipients’ experience, which would account for the reduced sensitivity in the participants with normal hearing. Improvements with increasing device experience (and age) were strongly observed in CI participants’ performance, underscoring the benefits of experience with the device. Nonverbal cognition played a role, interacting with age/hearing age, such that children with CIs who had less experience with the device/ younger hearing age benefited more from higher nonverbal cognition, while children with more experience showed little effect of cognition.
Participants with normal hearing also showed strong age-related improvements in performance in our task, with older participants showing better performance than younger participants. We were interested in comparing the present data in the participants with normal hearing with our previous findings in similar groups, also obtained with 8-channel noise-vocoded speech (Chatterjee et al., 2015; Tinnemore et al., 2018). The previous data were obtained with the child-directed stimuli only, and also differed in another way: the temporal envelope was low pass filtered with a corner frequency of 400 Hz, rather than 160 Hz as in the present study. This change, although bringing the simulation closer to real-world CI processors, is expected to impair performance by limiting voice pitch cues even further. Comparing the present data with those presented in previous studies, we note two differences. First, as expected, the overall performance of the children with normal hearing was poorer in the noise-vocoded condition than in our previous studies; Fig. 5A shows a comparison of the 8-channel noise vocoded data obtained in the Tinnemore et al., 2018 study and the present study. Second, both our previous studies found significant positive effects of age in the 8-channel noise-vocoded condition, but effects of age were not evident in the present study (Fig. 5B).

Figure 5A Boxplots showing the differences in emotion recognition accuracy (percent correct scores) obtained in participants with normal hearing in the present study and those in Tinnemore et al. (2018). Scores are averaged across talkers (CDS_F1 and CDS_M1 only) and emotions.
Figure 5B. Age effects in the two studies compared.
Comparing the present data with the findings of Tinnemore et al. (2018), we observe a third difference: they found a significant beneficial effect of nonverbal cognition in their study, while we do not. Tinnemore et al.’s analysis was performed on accuracy scores, averaged across emotions, and for the noise-vocoded conditions only. When we repeated analyses on accuracy scores obtained in the current 8-channel data, no effects of age and no effects of cognition were found. Similar findings were obtained with d’ as the dependent variable. This suggests that when the pitch cues are severely restricted, the beneficial effects of cognition and age are lost.
The data obtained in the normally hearing children in the present study and in the study by Tinnemore et al. (2018) provide an informative comparison against the present data obtained in children with CIs. The noise-vocoded condition in the present study cannot be directly compared with the children with CIs, as the children with normal hearing had no prior experience with noise-vocoding or similar degradation. The children with CIs showed significant benefit of increasing age and cognition, while the children with normal hearing showed no effect of either in the vocoded condition. On the other hand, the children with normal hearing participating in the Tinnemore et al. study were more similar to the CI children in the present study, showing a beneficial effect of both age and cognition in the 8-channel vocoded condition. This suggests that, for children with CIs, at least one factor contributing to age-related improvements in vocal emotion recognition may be the degree of degradation in the signal received from their device. If neural degeneration or excessive channel interaction result in poor coding of the temporal envelope periodicity cue (as in the difference between the Tinnemore et al. study and the present study), performance may not only be poorer at the outset, but the child may benefit less from increasing experience and from how their general intelligence benefits their experience-driven learning. The comparison between the two studies also underscores the importance of the temporal-envelope periodicity in vocal emotion perception under conditions of CI processing. The importance of this F0-related cue was also demonstrated in a study of sex identification by Fu et al (2005), who showed poorer performance as the temporal envelope cutoff frequency was reduced in spectrally degraded conditions, and in a study of complex pitch discrimination by Deroche et al (2014), who showed a significant correlation between temporal envelope sensitivity and harmonic-pitch discrimination in children with CIs, but not in children with normal hearing.
Patterns of performance by the children with normal hearing and the children with CIs were fairly consistent in the full-spectrum condition, despite some differences. Thus, the CDS_F1 talker tended to be best identified by both groups. The children with normal hearing showed greater sensitivity to differences between talkers and emotions than their peers with CIs in the full-spectrum condition. The two groups showed similar rates of improvement in performance with age and with hearing age, suggesting a similar trajectory of developmental benefits. However, general cognition played a significant role in the CI group, but not in the children with normal hearing. Although the group with normal hearing showed better cognitive outcomes than the CI group, this suggests that differences in cognition between the two groups do not mediate differences in their emotion recognition outcomes.
A number of participants in the CI group were also included in the Barrett et al. study (2020). Findings are generally consistent, as expected: although analysis methods differed and the pool of participants was much larger in the present study, both studies showed better performance with the child-directed than the adult-directed stimuli, improvement with duration of device experience, and performance differences based on differences in talkers and emotions. Neither study found a significant effect of age at implantation. We note that our measure of age at implantation was based on parental/caregiver report, and although we attempted to only include participants with no usable residual hearing at birth, this was also dependent on parent/caregiver reports. These limitations should be addressed in future studies. A further limitation is the lack of a measure of sensitivity to the acoustic cues involved in emotional prosody. Psychoacoustic measures that may be relevant (e.g., static and dynamic fundamental frequency discrimination, duration sensitivity, and intensity resolution) are time-consuming and challenging to obtain in pediatric populations. Currently available and validated measures of spectral modulation sensitivity which are relevant to speech perception are not as relevant to fundamental frequency processing with cochlear implants. Temporal envelope sensitivity is likely to be more relevant, but a rapid, validated test that is feasible for use with pediatric populations is yet to be developed. Unfortunately, simple measures such as audibility-related indices are not relevant to prosody perception in the CI population, as hearing thresholds are not a primary limitation in CI listeners. Rather, it is the suprathreshold sensitivity to the acoustic cues of interest that is likely to be more predictive of their performance in various speech-related tasks.
To summarize, the present results confirm previous findings that school-age children with CIs are able to benefit from increasing experience with their device (and concurrent increase in their age) that their nonverbal cognition is linked to their performance in identifying vocal emotions. We extend previous findings by showing interactions between hearing age/device experience and nonverbal cognition, such that children with younger hearing age showed stronger protective effects of cognition than children with older hearing age, specifically in their ability to benefit from exaggerated prosody provided by some talkers and for some emotional contrasts. We found that children with CIs were sensitive to variations among talkers and emotions in a way that was generally consistent with their peers with normal hearing. Children with normal hearing showed excellent performance with our stimuli in the full-spectrum condition, but their performance in the 8-channel noise vocoded condition was quite poor. Further, they showed no benefit from increasing age or cognitive status in their ability to identify emotions from the noise-vocoded materials. Comparison with previous work by Tinnemore et al. (2018) (which showed significant effects of both age and cognition in children with normal hearing attending to degraded speech which provided more temporal-envelope pitch information than the present study) suggests that the temporal envelope periodicity cue is crucial for children’s vocal emotion identification in degraded stimuli.
Funding Sources:
This work was supported in part by the following NIH grants: R01 DC014233 (PI: MC), P20 GM109023 (PI: Dr. Lori Leibold), and R01 DC019943 (PI: MC, cofunded by the National Institute on Deafness and Other Communication Disorders and The Office of Behavioral and Social Sciences Research).
Footnotes
Conflicts of Interest: None
REFERENCES
- Banse R & Scherer KR (1996). Acoustic profiles in vocal emotion expression. J. Pers. Soc. Psychol 70(3), 614–636. [DOI] [PubMed] [Google Scholar]
- Barrett KC, Chatterjee M, Caldwell MT, Deroche ML, Jiradejvong P, Kulkarni AM, & Limb CJ (2020). Perception of child-directed versus adult-directed emotional speech in pediatric cochlear implant users. Ear Hear, 41(5), 1372–1382. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barton K (2022) MuMIn: Multi-Model Inference. R package version 1.46.0. https://CRAN.R-project.org/package=MuMIn [Google Scholar]
- Bates D, Mächler M, Bolker B, Walker S (2015). Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1), 1–48. doi: 10.18637/jss.v067.i01. [DOI] [Google Scholar]
- Ben-David BM, Multani N, Shakuf V, Rudzicz F, & van Lieshout PH (2016). Prosody and semantics are separate but not separable channels in the perception of emotional speech: Test for rating of emotions in speech. J. Sp. Lang. Hear Res 59(1), 72–89. [DOI] [PubMed] [Google Scholar]
- Cannon SA & Chatterjee M (2019). Voice emotion recognition by children with mild-to-moderate hearing loss. Ear Hear. 40(3), 477–492. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chatterjee M, & Peng SC (2008). Processing F0 with cochlear implants: Modulation frequency discrimination and speech intonation recognition. Hearing Research, 235(1–2), 143–156. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chatterjee M, Zion DJ, Deroche ML, Burianek BA, Limb CJ, Goren AP, Kulkarni AM, & Christensen JA (2015). Voice emotion recognition by cochlear-implanted children and their normally-hearing peers. Hear. Res 322, 151–162. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Deroche ML, Kulkarni AM, Christensen JA, Limb CJ, & Chatterjee M (2016). Deficits in the sensitivity to pitch sweeps by school-aged children wearing cochlear implants. Front. Neurosc 10, 73. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Deroche ML, Lu HP, Limb CJ, Lin YS, & Chatterjee M (2014). Deficits in the pitch sensitivity of cochlear-implanted children speaking English or Mandarin. Front. Neurosc 8, 282. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dunn LM, Dunn DM, & Pearson Assessments. (2007). PPVT-4: Peabody picture vocabulary test. Minneapolis, MN: Pearson Assessments. [Google Scholar]
- Everhardt MK, Sarampalis A, Coler M, Başkent D, Lowie W, 2020. Meta-analysis on the identification of linguistic and emotional prosody in cochlear implant users and vocoder simulations. Ear. Hear 41(5),1092–1102. [DOI] [PubMed] [Google Scholar]
- Fengler I, Nava E, Villwock AK, Büchner A, Lenarz T, & Röder B (2017). Multisensory emotion perception in congenitally, early, and late deaf CI users. PloS one, 12(10), e0185821. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fu QJ, Chinchilla S, Nogaki G, & Galvin III JJ (2005). Voice gender identification by cochlear implant users: The role of spectral and temporal resolution. J. Acoust. Soc. Am, 118(3), 1711–1718. [DOI] [PubMed] [Google Scholar]
- Gaudrain E, Başkent D. (2018) Discrimination of Voice Pitch and Vocal-Tract Length in Cochlear Implant Users. Ear Hear. 39(2):226–237. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Geers Ann E., Nicholas Johanna G, Sedey Allison L. (2003) Language skills of children with early cochlear implantation. Ear Hear 24 (1), 46S–58S [DOI] [PubMed] [Google Scholar]
- Gilbers S, Fuller C, Gilbers D, Broersma M, Goudbeek M, Free R, & Başkent D (2015). Normal-hearing listeners’ and cochlear implant users’ perception of pitch cues in emotional speech. i-Perception, 6(5), 0301006615599139. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Grossmann T The development of emotion perception in face and voice during infancy. Restor. Neurol. & Neurosc, 28(2), 219–236. [DOI] [PubMed] [Google Scholar]
- Hollingshead AB (1957). Two Factor Index of Social Position. Mimeo. New Haven, Connecticut: Yale University. [Google Scholar]
- Hopyan-Misakyan TM, Gordon KA, Dennis M, & Papsin BC (2009). Recognition of affective speech prosody and facial affect in deaf children with unilateral right cochlear implants. Child Neuropsychology, 15(2), 136–146. [DOI] [PubMed] [Google Scholar]
- Kim J, Toutios A, Lee S, & Narayanan SS (2020). Vocal tract shaping of emotional speech. Computer Speech & Language, 64, 101100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kuznetsova A, Brockhoff PB, Christensen RHB (2017). lmerTest Package: Tests in Linear Mixed Effects Models. Journal of Statistical Software, 82(13), 1–26. [Google Scholar]
- Lin YS, Wu CM, Limb CJ, Lu HP, Feng IJ, Peng SC, Deroche MLD, Chatterjee M. (2022) Voice emotion recognition by Mandarin-speaking pediatric cochlear implant users in Taiwan. Laryngoscope Investig Otolaryngol. 7(1):250–258. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luo X (2016) Talker variability effects on vocal emotion recognition in acoustic and simulated electric hearing. J Acoust Soc Am. 140(6):EL497. doi: 10.1121/1.4971758. [DOI] [PubMed] [Google Scholar]
- Luo X, Fu QJ, & Galvin JJ 3rd. (2007). Cochlear implants special issue article: Vocal emotion recognition by normal-hearing listeners and cochlear implant users. Trends Amplif. 11(4), 301–315. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Macmillan NA & Creelman CD (2004) Detection theory: A user’s guide. 2nd Edition. Hillsdale, NJ: Erlbaum. [Google Scholar]
- Magezi DA (2015). Linear mixed-effects models for within-participant psychology experiments: an introductory tutorial and free, graphical user interface (LMMgui). Frontiers in Psychology, 6, 2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mastropieri D, & Turkewitz G (1999). Prenatal experience and neonatal responsiveness to vocal expressions of emotion. Developmental Psychobiology: J. Intl. Soc.Dev/ Psychobiol, 35(3), 204–214. [DOI] [PubMed] [Google Scholar]
- Nagels L, Gaudrain E, Vickers D, Lopes MM, Hendriks P, and Başkent D (2020). Development of vocal emotion recognition in school-age children: The EmoHI test for hearing-impaired populations. PeerJ 8, e8773. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nilsson M, Soli SD, & Sullivan JA (1994). Development of the Hearing in Noise Test for the measurement of speech reception thresholds in quiet and in noise. J. Acoust. Soc. Am 95(2), 1085–1099. [DOI] [PubMed] [Google Scholar]
- Oller DK, Buder EH, Ramsdell HL, Warlaumont AS, Chorna L, & Bakeman R (2013). Functional flexibility of infant vocalization and the emergence of language. Proc. Nat. Acad. Sci, 110(16), 6318–6323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Palama A, Malsert J, & Gentaz E (2018). Are 6-month-old human infants able to transfer emotional information (happy or angry) from voices to faces? An eye-tracking study. PloS one, 13(4), e0194579. [DOI] [PMC free article] [PubMed] [Google Scholar]
- R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. [Google Scholar]
- Richter ME, & Chatterjee M (2021). Weighting of prosodic and lexical-semantic cues for emotion identification in spectrally-degraded speech and with cochlear implants. Ear Hear, 42(6), 1727. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shannon RV, Zeng FG, Kamath V, Wygonski J, & Ekelid M (1995). Speech recognition with primarily temporal cues. Science, 270(5234), 303–304. [DOI] [PubMed] [Google Scholar]
- Stevenson R, Sheffield SW, Butera IM, Gifford RH, & Wallace M (2017). Multisensory integration in cochlear implant recipients. Ear Hear., 38(5), 521. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tinnemore AR, Zion DJ, Kulkarni AM, & Chatterjee M (2018). Children’s recognition of emotional prosody in spectrally degraded speech is predicted by their age and cognitive status. Ear Hear. 39(5), 874–880. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Van De Velde DJ, Schiller NO, Levelt CC, Van Heuven VJ, Beers M, Briaire JJ, & Frijns JH (2019). Prosody perception and production by children with cochlear implants. J.Child Lang, 46(1), 111–141. [DOI] [PubMed] [Google Scholar]
- Volkova A, Trehub SE, Schellenberg EG, Papsin BC, & Gordon KA (2013). Children with bilateral cochlear implants identify emotion in speech and music. Cochlear Implants International, 14(2), 80–91. [DOI] [PubMed] [Google Scholar]
- Wang Huizhi, Wang Yifang, and Hu Yousong. (2019) Emotional understanding in children with a cochlear implant. The Journal of Deaf Studies and Deaf Education 24.2, 65–73. [DOI] [PubMed] [Google Scholar]
- Wang Y, Su Y, & Yan S (2016). Facial expression recognition in children with cochlear implants and hearing aids. Frontiers in psychology, 7, 1989. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wechsler David. 2011. Wechsler Abbreviated Scale of Intelligence – Second Edition (WASI-II). San Antonio, TX: NCS Pearson. [Google Scholar]
- Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag; New York. ISBN 978–3-319–24277-4, https://ggplot2.tidyverse.org. [Google Scholar]
- Wiefferink CH, Rieffe C, Ketelaar L, De Raeve L, & Frijns JH (2013). Emotion understanding in deaf children with a cochlear implant. J. Deaf Studies & Deaf Ed, 18(2), 175–186. [DOI] [PubMed] [Google Scholar]
