Skip to main content
PLOS One logoLink to PLOS One
. 2023 Apr 5;18(4):e0283635. doi: 10.1371/journal.pone.0283635

Emotional tones of voice affect the acoustics and perception of Mandarin tones

Hui-Shan Chang 1,2,3, Chao-Yang Lee 2,4, Xianhui Wang 4, Shuenn-Tsong Young 5, Cheng-Hsuan Li 3, Woei-Chyn Chu 1,*
Editor: Yiu-Kei Tsang6
PMCID: PMC10075469  PMID: 37018230

Abstract

Lexical tones and emotions are conveyed by a similar set of acoustic parameters; therefore, listeners of tonal languages face the challenge of processing lexical tones and emotions in the acoustic signal concurrently. This study examined how emotions affect the acoustics and perception of Mandarin tones. In Experiment 1, Mandarin tones were produced by professional actors with angry, fear, happy, sad, and neutral tones of voice. Acoustic analyses on mean F0, F0 range, mean amplitude, and duration were conducted on syllables excised from a carrier phrase. The results showed that emotions affect Mandarin tone acoustics to different degrees depending on specific Mandarin tones and specific emotions. In Experiment 2, selected syllables from Experiment 1 were presented in isolation or in context. Listeners were asked to identify the Mandarin tones and emotions of the syllables. The results showed that emotions affect Mandarin tone identification to a greater extent than Mandarin tones affect emotion recognition. Both Mandarin tones and emotions were identified more accurately in syllables presented with the carrier phrase, but the carrier phrase affected Mandarin tone identification and emotion recognition to different degrees. These findings suggest that lexical tones and emotions interact in complex but systematic ways.

Introduction

Speech conveys more than the linguistic message intended by a speaker. It provides information about the speaker such as physical characteristics, regional accent, and emotional state. Since multiple sources of information often converge on the same acoustic parameters, the two fundamental questions are how the different sources of information contribute to speech acoustics, and how listeners disentangle these sources of information during speech perception. In this study, we investigated the relationship between emotional tones of voice (emotions hereafter) and lexical tones by examining how four common emotions shape the acoustic characteristics of Mandarin tones, and how the emotions affect the perception of Mandarin tones.

Emotional tone is defined as the vocal expression of emotion, which conveys a speaker’s affective states, motivational states, or intended emotions [15]. The primary acoustic correlates of vocal emotions include fundamental frequency (F0), mean amplitude, and duration [1, 2, 6]. Previous research showed that F0 is the primary acoustic correlate of emotions [710], whereas amplitude and duration serve as secondary cues [11, 12]. Importantly, F0 and amplitude are highly correlated with each other [8].

Lexical tones are used to distinguish words in tonal languages. In Mandarin, segmentally identical words can be distinguished on the basis of F0 height or contour. For example, the syllable /ba/ means “eight”, “uproot”, “grip”, or “father” with Tone 1 (a high-flat tone), Tone 2 (mid-rising), Tone 3 (mid-falling-rising), or Tone 4 (high-falling), respectively. The primary acoustic correlate of lexical tones is F0 [13]. Amplitude and duration also vary systematically among Mandarin tones [1315], and both contribute to Mandarin tone perception as secondary cues [1619]. However, F0 remains the most powerful cue for the perception of Mandarin tones [1921].

Since the acoustic characteristics most relevant for lexical tones coincide with those for emotions, the convergence raises the question of how emotions affect the acoustics and perception of lexical tones. A tonal language like Mandarin offers a unique opportunity to examine this question.

Theories of emotion

The two approaches to the analysis of emotion are the dimensional theory of emotion and the theory of basic emotions [22]. The difference between these two approaches is that emotions are either described as independent dimensions [23] or discrete entities [24]. In the dimensional approach, Russell (1980) [23] proposed a circlex model of emotion, which showed that each emotion could be arranged in a circle controlled by two orthogonal dimensions in space: valence and arousal [2528]. The position of each emotion on the quadrant reflects different amounts of valence and arousal traits [27, 29]. The valence dimension is associated with a person’s subjective feeling, ranging from displeasure to pleasure. The arousal dimension is associated with the energy of a person’s subjective feeling, ranging from sleep to excitement [28].

The theory of basic emotions suggests that human emotions are composed of a limited number of basic emotions [30]. Each basic emotion has its proprietary neural circuits which are structurally different [24, 25, 31, 32]. Although the idea of basic emotions is commonly accepted, there is no consensus on the exact number of basic emotions. Plutchik (1962) [33] proposed eight primary emotions (anger, fear, sadness, disgust, surprise, anticipation, trust, and joy). Ekman (1992) [24, 34] proposed seven basic emotions (fear, anger, joy, sad, contempt, disgust, and surprise), but later changed to six (happiness, anger, sadness, fear, disgust, and surprise). Izard [35] proposed seven basic emotions (fear, anger, happiness, sadness, disgust, interest, and contempt). Recent studies examining facial expressions, neural mechanisms, and brain imaging suggest that the number of basic emotions could be further reduced to four (fear, anger, joy, and sadness) [3640]. As an exploratory study of the tone-emotion relationship, we adopt the framework of four basic emotions (anger, fear, happiness, and sadness) in the current study.

How emotions affect speech acoustics

There is ample evidence that different emotions result in distinct acoustic characteristics [2, 57, 4154]. Physiologically, the sympathetic nervous system is aroused by emotions such as anger, fear, or happiness, resulting in a higher heart rate and blood pressure, a dry mouth, and occasionally muscle tremors [55, 56]. Consequently, speech is loud, fast, and has intense high-frequency energy. On the other hand, sadness arouses the parasympathetic nervous system. Heart rate and blood pressure decrease, salivation increases, and speech is produced slowly and with little high-frequency energy. These physiological changes are reflected in amplitude, energy distribution across the frequency spectrum, frequency of pauses, and duration. For example, higher arousal associated with excitement, fear, and anger have been shown to generate higher mean F0 [7, 9, 57], higher mean amplitude [5861], and shorter duration [56].

The influence of specific emotions on speech acoustics varies across studies. When compared with a neutral tone of voice, an angry voice has a higher mean F0, a wider or similar F0 range, a higher mean amplitude, and a shorter duration [2, 57, 4151, 54]. A fearful voice shows a higher mean F0, a narrower, wider, or similar F0 range, a higher or lower mean amplitude, and a shorter duration [2, 57, 4151, 54]. A happy voice has a higher mean F0, a wider or similar F0 range, a higher or equal mean amplitude, and a shorter or longer duration [2, 57, 4154]. A sad voice has a lower or similar mean F0, a narrower or wider or similar F0 range, a lower mean amplitude, and a longer duration [2, 57, 4154]. In sum, some emotions have fairly consistent acoustic features, whereas other emotions are more variable. The variability, however, is consistent with the idea that emotion is sociocultural in nature, i.e., there are cross-linguistic and cross-cultural differences in the acoustic manifestation of emotions [62]. The variability is also consistent with the observation that features of emotions vary across speakers, sexes, and contexts [63].

Several studies compared the acoustic characteristics of emotions between tonal and non-tonal languages. Ross, Edmondson, and Seibert (1986) [64] examined acoustic characteristics of neutral, happy, sad, angry, and surprising emotions using Thai, Taiwanese, Mandarin (all tonal languages), and English (a non-tonal language). They found greater F0 variations in English compared to the tonal languages, suggesting that non-tonal languages have a greater degree of freedom in using F0 to convey emotions. In contrast, no significant difference was found in duration or amplitude between the tonal and non-tonal languages.

Anolli, Wang, Mantovani, and De Toni (2008) [65] investigated acoustic differences among happy, sad, angry, fear, scornful, prideful, guilty, and shameful emotions in Mandarin and Italian (a non-tonal language). They found that emotions were characterized by significant variations in F0 and amplitude for Italian but not for Mandarin. In contrast, duration varied significantly among the emotions for Mandarin but not for Italian. Since Italian is a syllable-timed language, it is also likely that the less variation of syllable duration in Italian reflects the language-specific prosodic structure.

Wang, Lee, and Ma (2016, 2018) [46, 66] examined acoustic correlates of angry, fear, happy, sad, and neutral emotions in Mandarin and English. Semantically-neutral declarative sentences were embedded in different contexts to elicit angry, fear, happy, sad, and neutral emotions. Comparable English sentences were constructed with a direct translation of the Mandarin sentences. Acoustic analysis showed that F0 variations among the emotions were significantly greater in English than in Mandarin. In contrast, duration variations were significantly greater in Mandarin than in English.

Studies using other tonal languages also show more restricted F0 variations for emotions in a tonal language (Chong, Kim, and Davis, 2015 [67] for Cantonese; Luksaneeyanawin, 1998 [68] for Thai). To our knowledge, the only exception to this pattern is Li, Jia, Fang, and Dang’s (2013) [69], who showed greater F0 variations associated with emotions in Mandarin compared to Japanese, which is a non-tonal language that uses lexical pitch accent extensively.

Regarding the effect of emotions on specific lexical tones, Chao (1933) [70] noted that Mandarin uses successive additional tones and edge tones to implemet the intonation for emotions (see Liang and Chen, 2019 [71], for an illustration). Li, Fang, and Dang (2011) [44] examined how emotions affect the F0 and duration of Mandarin utterances ranging from one to fourteen syllables. The results showed that anger and disgust were associated with an additional falling tone, and happiness and surprise were associated with an additional rising tone. Non-neutral emotions resulted in a different F0 range, register, contour, or duration. For example, happiness and surprise were associated with a higher F0 range and higher register, whereas sadness and disgust were associated with a reduced F0 range and lower register.

In sum, most studies comparing tonal and non-tonal languages show that F0 variations associated with emotions are greater in non-tonal languages. This suggests that lexical tones constrain the availability of F0 for emotions in tonal languages. In contrast, amplitude or duration variations associated with emotions appear to be greater in tonal languages [14, 64, 65], suggesting that amplitude or duration may be used to compensate for the restricted use of F0 in conveying emotions. Studies on Mandarin further showed that emotions shape F0 and duration characteristics of Mandarin tones.

How emotions affect speech perception

The speech perception literature shows that emotions affect speech perception at various levels of processing. Mullennix, Bihon, Bricklemyer, Gaston, and Keener (2002) [1] examined how variations in emotions and talker voice affect spoken word recognition in English. They presented pairs of names (e.g., Todd-Tom) produced by either the same or different talkers, and with the same or different emotions. The participants’ task was to judge whether the names in a pair were the same or different. The results showed that variations in emotion slowed down judgments of both the names and talker voices, indicating emotion affected perception of consonants and talker characteristics. Kitayama and Ishii (2002) [72] and Ishii et al. (2003) [73] presented words spoken in a pleasant or unpleasant tone of voice. While ignoring the emotional tone, listeners were asked to judge whether the word meaning was pleasant (e.g., grateful, satisfaction) or unpleasant (e.g., complaint, dislike). The results showed that emotion variations slowed down judgments of word meaning. Nygaard and Lunders (2002) [52] examined how emotions (happy, neutral, and sad) affect the perception of homophonic words (e.g., die/dye). They found that selection of word meaning was compromised by the emotion of the words. Nygaard and Queen (2008) [53] presented happy (e.g., cheer), sad (e.g., upset), or neutral (e.g., chair) words spoken with consistent, inconsistent, or neutral emotions. Listeners were asked to repeat the words they heard. The results showed that listeners responded more quickly when the meaning of the words matched the emotions.

Similarly, research on tonal languages show the impact of emotions on speech perception, and the effect is further modulated by tonal language experience. Singh, Lee, and Goh (2016) [74] examined how changes in emotion and Mandarin tone affect consonant recognition, and how consonant changes affect emotion recognition and Mandarin tone identification. For Mandarin-speaking listeners, variations in Mandarin tone and emotion made consonant recognition less accurate. Consonant variations also made Mandarin tone identification and emotion recognition less accurate. Consonants and prosody (Mandarin tone and emotion) affect each other to the same extent. In contrast, for English-speaking listeners, consonant recognition was affected by prosodic variation to a different degree than prosody recognition was affected by consonant variations. That is, the effects of emotions and lexical tones on segmental perception depend on tonal language experience.

Liang and Chen (2019) [71] examined how emotions and tonal language experience affect Mandarin tone perception. Four Chinese pseudo words (i.e., mong, ging, ra, ) were created, and each had four lexical tones variations. Each syllable-tone combination was embedded in the middle of carrier phrases. The syllable immediately before the pseudo words was manipulated to create four tonal contexts (i.e., chu1, du2, xie3, lian4). For instance, mong1 was embedded in carrier phrases (1) zhi3 chu1 [mong1] zhe4ge0 zi4 “Please point out the word [mong1]”; (2) wo3 hui4 du2 [mong1] zhe4 ge4 zi4” I can read the word [mong1]”); (3) wo3 hui4 xie3 [mong1] zhe4 ge4 zi4” I can write the word [mong1]”; (4) wo3 xiang3 lian4 [mong1] zhe4 ge4 zi4” I can practice the word [mong1].” All stimuli were produced with an angry, happy, sad, or neutral emotion. Mandarin listeners and Dutch-speaking learners of Mandarin were asked to identify the Mandarin tone of the pseudo words. The results showed that stimuli produced with the neutral emotion resulted in higher accuracy than those produced with non-neutral emotions. However, only the angry voice resulted in significantly lower accuracy relative to the neutral voice for both groups of listeners. In addition, Tone 4 was identified more accurately than Tone 1 in the angry voice.

In sum, research on speech perception shows that emotions affect the perception of segmental phonemes, talker voices, and lexical tones. The effect of emotion on lexical tone perception depends on both stimulus characteristics and tonal language experience. Particularly relevant to the current study, Liang and Chen’s (2019) [71] findings further demonstrated that emotions affect lexical tone perception to different degrees depending on specific Mandarin tones.

How emotions are perceived in speech

In addition to understanding emotion’s effect on speech perception, we also explore how emotions themselves are perceived in speech. A common approach to studying emotion recognition is to recruit professional actors to produce speech materials with different emotions. Listeners are then asked to identify the emotions of the stimuli [2, 4, 7, 50, 7577]. Studies using non-tonal languages showed that sad and angry voices are easier to identify than fearful and happy voices [2, 4, 7, 50, 7577]. Studies using Mandarin have reported similar findings: negative emotions such as sadness [41, 46, 66] and fear [43] are easier to identify than positive emotions such as happiness [41, 43]. It has been suggested that negative emotions are prioritized in vocal communication because they convey warnings in situations of attack, loss, and danger. Consequently, negative emotions need to be communicated more effectively to ensure human survival [43, 78, 79]. In contrast, positive emotions such as happiness are usually expressed through additional communication channels (e.g., facial expression), which may explain why a happy voice is identified with lower accuracy when only the vocal channel is used [7, 43, 50].

There is limited evidence on how lexical tones affect the perception of emotions. Wang, Ding, & Gu (2012) [80] investigated emotion recognition from Mandarin sentences by native and non-native speakers. Mandarin words with various tones (qi4che1 “car”, zhao4pian4 “picture”, xin1fang2 “new house”, dian4nao3 “computer”, and xue2xiao3 “school”) were embedded in a semantically neutral carrier phrase (zhe4 shi4 ta1 de0 [target word] “This is his [target word]”). The sentences were recorded with six emotions (happiness, fear, anger, sadness, boredom, & neutral) and presented to listeners for emotion recognition. The results showed that native listeners had an overall higher accuracy than non-native listeners, but both groups recognized sadness with the highest accuracy and boredom with the lowest accuracy. Since tones were not systematically manipulated, it is not clear whether the results could inform the effect of lexical tone on emotion recognition.

Wang and Lee (2015) [41] and Wang and Qian (2018) [47] constructed sentences composed exclusively of a particular Mandarin tone (e.g., wang1 bin1 xing1qi1tian1 xiu1 fei1ji1 “Wang Bin fixed the airplane on Sunday” or with a mixture of different tones (e.g., wo3 bu4gan3 xiang1xin4 zhe4 shi4 zhen1de0 “I cannot believe this is true”). The sentences were recorded with various emotions. It was hypothesized that emotions would be recognized less accurately in the Tone 1-only sentences because of the restricted F0 variation imposed by the (level) tone. An alternative hypothesis was that the restricted F0 variation in the Tone 1 sentences would have allowed emotions to surface more easily, thus facilitating emotion recognition. However, the results showed that emotions were recognized equally well regardless of tonal composition; that is, the restricted F0 variation imposed by the level tone did not compromise or facilitate emotion recognition. Emotional recognition seems quite robust irrespective of specific lexical tones.

Benefit of context

The presence of context can alter a listener’s interpretation of a speech sound [81]. Such perceptual adaptation forms the basis of speaker normalization [82, 83]. In speech audiometry, a carrier phrase is typically included in a word recognition task to provide a cue for the listeners to focus their attention on the target words [8486]. Previous studies showed that word recognition accuracy is typically higher when embedded in a carrier phrase [84, 85, 87, 88]. The presence of a carrier phrase is particularly helpful under challenging listening conditions. For example, Lynn and Brotman (1981) [85] found that word identification in a carrier phrase was 10% more accurate than in isolation in the presence of speech-shaped noise. Since lexical tones produced with emotions are likely to deviate from the citation form, the presence of a context is likely to help listeners retrieve the intended tones more effectively.

In the current study, we examine Mandarin tone identification and emotion recognition in two contexts: when the target syllables are embedded in a carrier phrase (in context), and when the target syllables are extracted from the carrier phrase (in isolation). Note that the syllables in the “isolation” condition were not produced in isolation; rather, they were excised from the carrier phrase. We predict that Mandarin tone identification would be less accurate when the target syllables were extracted from the carrier phrase. This is because the citation form of a tone is likely to be altered due to the influence of neighboring tones [89]. Without the carrier phrase, it would be challenging for listeners to recover the tone. Furthermore, the carrier phrase provides information about talker characteristics such as speaking F0 range, which has been shown to facilitate tone identification from the multi-talker input [90]. We also predict that emotion recognition would be less accurate when the target syllables are presented in isolation. Since the talkers who recorded the stimuli were instructed to produce emotions for the entire utterance, the presence of emotions when preceded by a carrier phrase should facilitate emotion recognition from the target syllables.

The present study

The above review shows that both acoustics and perception of speech are shaped by emotions. Emotions affect Mandarin tone identification to different degrees depending on specific tones [71], but specific Mandarin tones do not seem to affect emotion recognition differently [41, 47]. To further clarify the interaction between lexical tones and emotions in speech, this study examines the acoustics and perception of Mandarin tones produced with various emotions, and the perception of emotions embedded in Mandarin tones. Following Liang and Chen (2019) [71], we use syllables produced with four Mandarin tones in a sentence-medial position. Extending Liang and Chen (2019) [71], we use multiple speakers of both sexes to record the stimuli. We also examine both the acoustics and perception of Mandarin tones and emotions. Finally, we examine the perception of Mandarin tones and emotions in two contexts: when the target syllables are presented with the carrier phrase, and when the target syllables are extracted from the carrier phrase.

Based on prior research, we predict that the acoustics of emotions would be affected by specific Mandarin tones. Liang and Chen’s (2019) [71] findings lead us to predict that the accuracy of Mandarin tone identification would be affected by emotions to different degrees depending on specific emotions and Mandarin tones in the stimuli. Following findings from Wang and Lee (2015) [41] and Wang and Qian (2018) [47], we predict that emotion recognition would remain robust irrespective of the specific Mandarin tones in the stimuli. Finally, we predict that Mandarin tone identification and emotion recognition would be less accurate when the target words are extracted from the carrier phrase.

Experiment 1

In this experiment, we investigated the acoustic characteristics of Mandarin tones produced with emotions by multiple talkers of both sexes. Anger (ANGRY hereafter), fear (FEAR hereafter), happiness (HAPPY hereafter), and sadness (SAD hereafter) were selected because they are considered four basic emotions [3640] (see introduction for opposing views). The acoustic effects of these four emotions were evaluated relative to the neutral tone of voice (NEUTRAL hereafter). Four acoustic measures including mean F0, F0 range, amplitude, and duration were chosen because they are most relevant to both lexical tone and emotional tone distinctions.

Based on the literature reviewed, we expect that ANGRY would result in a higher mean F0, a wider or similar F0 range, a higher mean amplitude, and a shorter duration when compared to the NEUTRAL baseline. FEAR would result in a higher mean F0, a narrower, wider, or similar F0 range, a higher or lower mean amplitude, and a shorter duration. HAPPY would result in a higher mean F0, a wider or similar F0 range, a higher or equal mean amplitude, and a shorter or longer duration. SAD would result in a lower or similar mean F0, a narrower or wider or similar F0 range, a lower mean amplitude, and a longer duration. These predictions are summarized in Table 1. We also expect that the acoustic differences among the emotions would be modulated by specific Mandarin tones.

Table 1. Summary of predicted acoustic characteristics of the four emotions relative to the neutral emotion in Experiment 1.

The symbols >, <, and = indicate an emotion is associated with a higher, lower, or comparable value compared to the neutral emotion.

Mean F0 F0 range Mean amplitude Duration
Angry > > or = > <
Fear > > or = or < > or < <
Happy > > or = > or = > or <
Sad < or = > or = or < < >

Method

Talkers

The use of human subjects in this study was reviewed and approved by the Institutional Review Board of National Yang Ming Chiao Tung University (IRB No. 1000063). Written informed consent was obtained from all talkers. No minors participated in this study. No medical records or archived samples were used in this study.

Eight professional actors (4 women and 4 men; mean age of 32.4 ± 7.1 years) were recruited to record the speech materials. All were native speakers of Taiwan Mandarin with no reported history of speech, hearing, or language disorders. Each talker was compensated $1,600 New Taiwan Dollars ($54 USD) per hour for their participation.

Speech materials

Three syllables /fa/, /ꓙi/, and /pΗu/ with the four Mandarin tones were selected as target syllables, resulting in 12 syllable-tone combinations [發/fa1/], [筏/fa2/], [髮/fa3]/, [法/fa4/], [西/ꓙi1/], [錫/ꓙi2/], [洗/ꓙi3]/, [夕/ꓙi4/], [鋪/pΗu1/], [葡/pΗu2/], [譜/pΗu3/], and [瀑/pΗu4/]. These syllables were chosen because: (1) they included the three most common vowels in Taiwan Mandarin [91, 92], (2) all began with a voiceless or aspirated consonant to facilitate identification of syllable onset, (3) all syllable-tone combinations were real words in Mandarin, and (4) the meanings of all syllable-tone combinations were emotionally neutral.

The 12 syllable-tone combinations were paired with five emotions (ANGRY, FEAR, HAPPY, SAD, and NEUTRAL) and embedded in a semantically neutral carrier phrase /ni3 ʂuo1 [target word] ts5/ “You say the word [target word]”. In the carrier phrase, the syllable following the target syllable began with a voiceless consonant /ts5/ to facilitate identification of syllable boundaries. The 60 word-emotion combinations were produced twice by eight talkers for a total of 960 stimuli.

Procedure

Speech recordings took place in a sound-treated booth in the Department of Biomedical Engineering of National Yang Ming Chiao Tung University with a GRAS Type 40AC microphone at 0-degree azimuth. The microphone was placed 30 centimeters from the participant’s mouth. The sampling rate was 44,100 Hz with 16-bit quantization. Before the recording, the first author discussed with the actors the emotional tones that they should aim for. The actors then completed the recording in the booth while being monitored by the first author.

The 120 stimuli were recorded in five blocks separated by emotions. The order of the blocks and the order of the stimuli within a block were randomized for each participant. Before the recording started, the participants were given 10 minutes to familiarize themselves with the stimuli. Breaks were given between blocks for the participants to adjust their emotions. Our goal was to elicit a broad-focus analysis, i.e., distributing the prosodic change over the whole sentence instead of focusing narrowly on the target syllable. To that end, the participants were instructed to avoid pausing before and after the target syllables, and to avoid placing excessive emphasis on the target syllables.

Acoustic and statistical analysis

The recordings (except for the NEUTRAL stimuli) were rated by 30 native speakers of Taiwan Mandarin to evaluate how well the intended emotions were present in the speech materials. The raters included 19 women and 11 men with ages ranging from 21 to 49 years (mean age 33.5 ± 7.8 years). For each stimulus, the raters were asked in a four-alternative forced-choice task to choose an emotion (ANGRY, FEAR, HAPPY, or SAD) that best represented the speech sample they heard. The raters were also asked to provide a score on a Likert’s 5-point scale indicating the degree of the match, with 1 being the worst match and 5 being the best match. Stimuli were chosen for acoustic analyses only if they were correctly identified by all 30 participants and if they received an average rating of 3.0 and above. All stimuli met both criteria and were included in the acoustic analyses.

The acoustic analyses were performed with the Praat program [93]. Two landmarks were identified from the waveform: (1) the last glottal pulse of the syllable immediately before the target syllable, and (2) the last glottal pulse of the target syllable. The target syllable was then extracted based on these two landmarks. The acoustics measures were taken from the target syllables including mean F0, F0 range, mean amplitude, and duration.

Results

Fig 1 shows the F0 contours of the four Mandarin tones produced with the five emotions. The F0 contours were time-normalized and averaged over speakers of the same sex. Specifically, for each token, F0 was measured every 10% from the beginning (0%) to the end (100%) to obtain 11 data points. Each of these 11 points was then averaged over all speakers of the same sex.

Fig 1. F0 contours of target syllables as a function of emotion, Mandarin tone, and talker sex.

Fig 1

The F0 contours of the Mandarin tones appear to be consistent with traditional descriptions of the Mandarin tones in citation form: Tone 1 is flat, Tone 2 is falling then rising, Tone 3 (which is in a non-final position in the carrier phrase) is falling, and Tone 4 is falling but in a higher register. The F0 plot is meant to show that the Mandarin tones were produced as intended. For quantitative analysis of the F0 contours, a more rigorous approach such as Functional Data Analysis should be taken [94, 95].

To evaluate the acoustic difference between the emotions, for each of the four acoustic measures (mean F0, F0 range, mean amplitude, and duration), a linear mixed-effects model was built for each of the four measures separately using R 3.6.3 (R Core Team, 2021) [96]. Mandarin tone (T1, T2, T3, and T4), emotion (ANGRY, FEAR, HAPPY, SAD, and NEUTRAL), and the tone-emotion interaction were entered as fixed effects. Talker, talker sex, syllable type, and repetition were entered as random effects.

Mean F0

Fig 2 shows the mean F0 of the target syllables as a function of emotion and Mandarin tone. Fig 1 suggests that the overall F0 contours of the Mandarin tones produced with the four emotions are similar to those of NEUTRAL; therefore, we calculated mean F0 as a summary measure for quantitative comparisons. The linear mixed-effects model revealed significant main effects of Mandarin tone, χ2(3, N = 8) = 973.95, p < .001, emotion, χ2(4, N = 8) = 1749.66, p < .001, and tone-emotion interaction, χ2(12, N = 8) = 70.18, p < .001. Post hoc pairwise comparisons (Tukey adjusted) were conducted to disentangle the interaction (Fig 2). It was found that ANGRY had the highest mean F0 and NEUTRAL had the lowest. The ranking of FEAR, HAPPY, and SAD varies on specific Mandarin tones. Full output of the model is available in S1 Table.

Fig 2. Boxplot showing the mean F0 of target syllables as a function of Mandarin tone and emotion.

Fig 2

(*p < .05).

F0 range

Fig 3 shows the F0 range of the target syllables as a function of emotion and Mandarin tone. The linear mixed-effects model revealed significant main effects of Mandarin tone, χ2(3, N = 8) = 779.23, p < .001, emotion, χ2(4, N = 8) = 123.02, p < .001, and their two-way interaction, χ2(12, N = 8) = 114.64, p < .001. The results of post hoc pairwise comparisons are also shown on the figure. There does not appear to be a consistent pattern in the ranking of the emotions. Full output of the model is available in S1 Table.

Fig 3. Boxplot showing the F0 range of target syllables as a function of Mandarin tone and emotion.

Fig 3

(*p < .05).

Mean amplitude

Fig 4 shows the mean amplitude of the target syllables as a function of emotion and Mandarin tone. The main effects of Mandarin tone, χ2(3, N = 8) = 121.1, p < .001, and emotion, χ2(4, N = 8) = 1801.44, p < .001 were observed, but not their interaction χ2(12, N = 8) = 3.92, p = .98. Post hoc pairwise comparisons showed that ANGRY has the highest mean amplitude and NEUTRAL has the lowest. No difference was observed across SAD, HAPPY, and FEAR. Full output of the model is available in S1 Table.

Fig 4. Boxplot showing the mean amplitude of target syllables as a function of Mandarin tone and emotion.

Fig 4

Duration

Fig 5 shows the duration of the target syllables as a function of emotion and Mandarin tone. Similar to the mean amplitude, the main effects of Mandarin tone, χ2(3, N = 8) = 32.5, p < .001, and emotion, χ2(4, N = 8) = 639.74, p < .001 were significant but their interaction χ2(12, N = 8) = 14.2, p = .29 was not. Post hoc pairwise comparisons indicate SAD has the longest duration than all other emotions. Full output of the model is available in S1 Table.

Fig 5. Boxplot showing the duration of the target syllables as a function of Mandarin tone and emotion.

Fig 5

Summary and discussion

Emotions leave a mark on the acoustic characteristics of Mandarin tones. Consistent with previous studies [2, 57, 4153], findings from our acoustic analyses support the observation that emotions shape the acoustic characteristics of speech for both tonal and non-tonal languages [6365, 97]. Our inclusion of all four Mandarin tones produced by talkers of both sexes further revealed that the impact of emotions varies depending on specific Mandarin tones. ANGRY has the highest mean F0 and mean amplitude. SAD has the longest duration. In contrast, we did not observe a systematic difference in F0 range.

Table 2 summarizes our findings compared to previous research. There are similarities but also discrepancies. Methodological differences such as talkers (amateurs in previous studies vs. professional actors in the current study) and materials (sentences in previous studies vs. syllables extracted from a carrier phrase in the current study) are likely to have contributed to the discrepancies. We used professional actors in the current study because they are typically more proficient in producing the desired emotions [7, 63]. The presence of the intended emotions was verified in the current study with an independent emotion judgment task. As for materials, since the syllable is the tone-bearing unit in Mandarin, our choice to analyze syllables instead of sentences allowed us to examine the effect of emotions on specific Mandarin tones systematically.

Table 2. Summary of findings from Experiment 1 regarding the effect of emotion relative to the neutral emotion.

Mean F0 F0 range Mean amplitude Duration
Predicted Actual Predicted Actual Predicted Actual Predicted Actual
ANGRY > > > or = > or = > > < =
FEAR > > > or = or < > or = or < > or < > < =
HAPPY > > or = > or = > or = > or = > > or < > or =
SAD < or = > > or = or < > or = < > > >

Among the acoustic measures, mean F0 and F0 range appear to yield the most consistent results between previous research and the current study. ANGRY, FEAR, and HAPPY consistently resulted in a higher F0. However, the utility of this measure is difficult to evaluate because of the lack of specificity in the predictions. For example, previous research showed that FEAR and SAD could result in a wider, comparable, or narrower F0 range in Mandarin tones compared to the neutral emotion. Although the current study showed the same results, no consistent patterns could be extracted without taking into consideration specific Mandarin tones.

Among the emotions, ANGRY consistently results in a higher mean F0, a greater F0 range, and a higher mean amplitude. This finding appears to support the proposal that negative emotions are conveyed more effectively in vocal communication to ensure human survival [43, 78, 79]. If an emotion results in relatively stable acoustic changes in lexical tones, recognition of that emotion is likely to be more robust. However, the other two negative emotions FEAR and SAD did not result in a similar pattern of consistent acoustic changes. It has yet to be determined if ANGRY has a special status among the negative emotions.

Experiment 2

Findings from Experiment 1 show that emotions shape the acoustic characteristics of Mandarin tones. The extent of the effect also depends on specific Mandarin tones. In Experiment 2, we ask how the acoustic changes induced by emotions would affect the perception of Mandarin tones. Liang and Chen (2019) [71] found that ANGRY resulted in lower accuracy of Mandarin tone identification relative to the neutral emotion. This difference was driven by more accurate identification of Tone 4 compared to Tone 1. We expect to find a similar interaction between Mandarin tones and emotions.

More generally, we evaluate two possibilities regarding how listeners interpret the acoustic signal in terms of Mandarin tones and emotions. On the one hand, Mandarin tone identification may be compromised by emotions because emotions make tones more variable acoustically, and thus more challenging to identify. In this scenario, greater acoustic changes (e.g., those associated with ANGRY) should lead to a less accurate tone identification. On the other hand, Mandarin tone identification may not be compromised by emotions if listeners are able to attribute the acoustic changes into the Mandarin tones and emotions, respectively. Assuming more predictable acoustic changes facilitate identification, more consistent acoustic changes (e.g., those associated with ANGRY) should lead to more accurate tone identification. It should be noted that neither scenarios assume that lexical tones are defined by absolute pitch, amplitude, or duration. Rather, they are two possible ways listeners parse the acoustic signal into distinct sources of acoustic variability.

In addition to Mandarin tone identification, we examine how well the emotions themselves could be recognized from the stimuli. Previous research has shown that negative emotions tend to be identified with higher accuracy than positive emotions. Furthermore, specific Mandarin tones do not seem to affect emotional tone recognition disproportionately [41, 47]. The inclusion of both positive and negative emotions and all four Mandarin tones in the current study would allow us to evaluate these observations.

Method

Participants

This study was reviewed and approved by the Institutional Review Board of National Yang Ming Chiao Tung University (IRB No. 1000063). All participants signed a written informed consent. No minors participated in this study. No medical records or archived samples were used in this study.

Thirty-six adults (23 women and 13 men) with ages ranging from 19 to 23 years (mean age 20.08 ± 0.91 years) participated in Experiment 2. All participants were native speakers of Taiwan Mandarin and reported no known history of speech and hearing disorders. Each participant was paid $1,000 NTD ($34 USD) for their participation. All participants passed a screening of Mandarin tone identification in the neutral emotion presented in isolation and in context. Identification accuracy of Mandarin Tone 1, Tone 2, Tone 3, and Tone 4 was 97%, 93%, 93%, and 98% in isolation, and 100%, 100%, 99%, and 100% in context.

Stimuli

The stimuli used in this experiment were selected from one female and one male talker among the eight talkers who recorded the stimuli for Experiment 1. The talkers who received the highest rating in the emotion judgment task (reported in Experiment 1) in their respective sex group were chosen. For tone identification, all five emotions were included, resulting in 240 stimuli (3 syllables, 4 tones, 5 emotions, 2 repetitions, and 2 talkers). The 240 stimuli were presented in isolation or embedded in the carrier phrase, resulting in a total of 480 trials. For emotion recognition, the NEUTRAL stimuli and response option were excluded to prevent participants from using NEUTRAL as a default response for stimuli that they were uncertain about. As a result, there were 192 stimuli (3 syllables, 4 tones, 4 emotions, 2 repetitions, and 2 talkers). The 192 stimuli were presented in isolation or embedded in the carrier phrase, resulting in a total of 384 trials.

Procedure

This experiment took place in a sound-treated booth in the Department of Biomedical Engineering at National Yang Ming Chiao Tung University. The LabVIEW program (National Instruments) on a Windows 10 laptop computer was used for stimulus delivery and response acquisition. Stimuli were presented at each participant’s preferred hearing level over a pair of Beyerdynamic DT 990 PRO headphones.

The participant’s task was to listen to each stimulus and identify the Mandarin tone and the emotion of the target syllable. The stimuli were presented in four blocks in the following order: (1) isolated syllables for Mandarin tone identification; (2) isolated syllables for emotion recognition; (3) target syllables embedded in the carrier phrase for Mandarin tone identification; and (4) target syllables embedded in the carrier phrase for emotion recognition. The order of stimuli within each block was randomized for each participant. The Random Number Generator in LabVIEW was then used to generate a unique presentation order for each participant. Brief breaks were provided between the blocks.

For Mandarin tone identification, four response buttons marked with “一聲 (Tone 1)”, “二聲 (Tone 2)”, “三聲 (Tone 3)”, and “四聲 (Tone 4)” were displayed at the four corners of the computer screen and equidistant from the center of the screen. For emotion recognition, four response buttons marked with “生氣 (ANGRY)”, “害怕 (FEAR)”, “快樂 (HAPPY)”, and “傷心 (SAD)” were displayed at the four corners of the computer screen and equidistant from the center of the screen. The NEUTRAL response option was not included to avoid listener using NEUTRAL as a default for stimuli that they were not sure about. At the beginning of each trial, a cursor appeared briefly at the center of the screen, followed by the auditory stimulus. Listeners responded by clicking one of the four buttons on the screen using a computer mouse. The next trial was then presented 500 ms after the response. If participants were not sure about the tone identity, they were told to make their best guess. Each experimental session took approximately 60–90 minutes to complete.

Results

Mandarin tone identification

Fig 6 shows the accuracy of Mandarin tone identification as a function of emotion, Mandarin tone, and context. Overall, Mandarin tones were identified more accurately in context, and the effect of emotion appeared to be much more variable when the target syllable was presented in isolation. For example, all four Mandarin tones were identified quite well in NEUTRAL regardless of context, but the presence of other emotions compromised the identification of isolated Mandarin tones disproportionately compared to those presented in context.

Fig 6. Boxplot showing the Mandarin tone identification accuracy as a function of Mandarin tone, emotion, and context.

Fig 6

To evaluate these observations statistically, a mixed-effect logistic regression model was fitted to the Mandarin tone identification data. Mandarin tone (T1, T2, T3, and T4), emotion (ANGRY, FEAR, HAPPY, SAD, and NEUTRAL), context (in isolation and in context), and the tone-emotion interaction were entered into the model as fixed effects. Talker sex, syllable type, and repetition were entered as random effect. The dependent variable was the binary Mandarin tone identification (i.e., correct and incorrect). Full output of the model is available in S2 Table.

All main effects were significant: Mandarin tone, χ2(3, N = 36) = 83.12, p < .001; emotion, χ2(4, N = 36) = 486.38, p < .001; and context, χ2(1, N = 36) = 1534.71, p < .001. The tone-emotion interaction was also significant: Mandarin tone-emotion, χ2(12, N = 36) = 492.89, p < .001. Post hoc pairwise comparisons indicate that tone identification accuracy varied across different emotions (Table 3).

Table 3. Summary of pairwise comparisons for Mandarin tone identification accuracy as a function of emotion.

The symbol > indicates a significant difference (p < .05) and = indicates no significant difference.

Emotion Mandarin tone
ANGRY T4 > T2 > T1 = T3
FEAR T1 > T4 > T2 > T3
HAPPY T2 > T3 > T4 = T1
NEUTRAL T4 = T1, T4 > T3, T4 > T2, T1 > T2
SAD T1 > T2 = T3 = T4

To examine the specific types of Mandarin tone identification errors, Tables 4 and 5 show confusion matrices of Mandarin tone identification responses for syllables presented in isolation (Table 4) and in context (Table 5). To facilitate interpretation of the confusion patterns, the most common error for each Mandarin tone that exceeds 20% is highlighted in gray. For syllables presented in isolation (Table 4), Mandarin tones produced with NEUTRAL were rarely misidentified as other tones. There appears to be a response bias where Mandarin tones, particularly Tones 1 and 3, were most misidentified as Tone 4 when presented in an ANGRY tone of voice. Similarly, there appears to be a response bias for tones to be identified as Tone 1 when presented in a FEAR tone of voice. In contrast, the errors for Mandarin tones produced with HAPPY or SAD did not show distinct bias patterns.

Table 4. Confusion matrices of Mandarin tone identification responses for syllables presented in isolation across emotions.

The most common error that exceeds 20% for each emotion is highlighted in gray.

Mandarin tone response
Emotion Mandarin tone stimulus Tone 1 Tone 2 Tone 3 Tone 4
Neutral Tone 1 97% 2% 0% 1%
Tone 2 0% 93% 7% 0%
Tone 3 2% 2% 93% 3%
Tone 4 1% 1% 0% 98%
Angry Tone 1 68% 12% 2% 18%
Tone 2 11% 71% 16% 2%
Tone 3 18% 6% 40% 36%
Tone 4 9% 1% 3% 87%
Fear Tone 1 86% 10% 2% 2%
Tone 2 26% 58% 15% 1%
Tone 3 25% 13% 46% 16%
Tone 4 20% 2% 8% 70%
Happy Tone 1 52% 41% 3% 4%
Tone 2 5% 83% 11% 1%
Tone 3 4% 5% 66% 25%
Tone 4 32% 7% 5% 56%
Sad Tone 1 83% 13% 3% 1%
Tone 2 13% 61% 25% 1%
Tone 3 13% 13% 60% 14%
Tone 4 24% 4% 17% 55%
Table 5. Confusion matrices of Mandarin tone identification responses for syllables presented in context across emotions.
Mandarin tone response
Emotion Mandarin tone stimulus Tone 1 Tone 2 Tone 3 Tone 4
Neutral Tone 1 100% 0% 0% 0%
Tone 2 0% 100% 0% 0%
Tone 3 0% 1% 99% 0%
Tone 4 0% 0% 0% 100%
Angry Tone 1 78% 1% 2% 19%
Tone 2 0% 98% 2% 0%
Tone 3 0% 1% 98% 1%
Tone 4 3% 0% 0% 97%
Fear Tone 1 96% 2% 1% 1%
Tone 2 4% 93% 2% 1%
Tone 3 3% 8% 88% 1%
Tone 4 7% 1% 1% 91%
Happy Tone 1 92% 1% 1% 6%
Tone 2 2% 96% 1% 1%
Tone 3 1% 2% 97% 1%
Tone 4 11% 2% 0% 87%
Sad Tone 1 95% 3% 0% 2%
Tone 2 1% 97% 2% 0%
Tone 3 1% 4% 95% 0%
Tone 4 1% 2% 1% 96%

For syllables presented in context (Table 5), Mandarin tone identification was remarkably accurate. There were only three instances where the accuracy fell below 90%, and only one of those was below 80%. In terms of confusion patterns, there was only one error that approached 20%, where an ANGRY Tone 1 was misidentified as Tone 4. There were no other dominant errors. In sum, the presence of the carrier phrase effectively neutralized the negative impact of emotions on Mandarin tones identification from isolated syllables.

Emotion recognition

Fig 7 shows the accuracy of emotion recognition as a function of emotion, Mandarin tone, and context. Overall, accuracy appears higher for target syllables presented in context than in isolation. Importantly, the four emotions appear to be affected by context to different degrees. For example, FEAR appears to be identified disproportionately worse in isolation.

Fig 7. Boxplot showing the emotion recognition accuracy as a function of Mandarin tone and context.

Fig 7

A mixed-effect logistic regression model with Mandarin tone (T1, T2, T3, and T4), emotion (ANGRY, FEAR, HAPPY, and SAD), context (in isolation and in context), and the tone-emotion interaction as fixed effects, talker sex, syllable type, and repetition as random effects were constructed. The dependent variable was the binary emotion recognition (correct and incorrect). Full output of the model is available in S3 Table.

All main effects were significant: Mandarin tone, χ2(3, N = 36) = 92.92, p < .001; emotion, χ2(1, N = 36) = 775.08, p < .001; and context, χ2(1, N = 36) = 1843.42, p < .001. The tone-emotion interaction was also significant: χ2(9, N = 36) = 327.95, p < .001. Post hoc pairwise comparisons indicate that emotion recognition accuracy varied across different Mandarin tones (Table 6). For example, ANGRY was recognized most accurately in all but Mandarin Tone 3, whereas FEAR was recognized least accurately in all Mandarin tones.

Table 6. Summary of pairwise comparisons for emotion recognition accuracy as a function of Mandarin tone.

The symbol > indicates a significant difference (p < .05) and = indicates no significant difference.

Mandarin tone Emotion
Tone 1 ANGRY = HAPPY > SAD > FEAR
Tone 2 ANGRY = HAPPY > SAD > FEAR
Tone 3 SAD > ANGRY = HAPPY > FEAR
Tone 4 ANGRY > HAPPY > SAD > FEAR

To examine the types of emotion recognition errors made by listeners, Tables 7 and 8 show confusion matrices of emotion recognition responses for syllables presented in isolation (Table 7) and in context (Table 8). To facilitate the interpretation of the confusion patterns, the most common error for each emotion that exceeds 20% is highlighted in gray.

Table 7. Confusion matrices of emotion recognition responses for syllables presented in isolation.

The most common error that exceeds 20% for each emotion is highlighted in gray.

Emotion response
Mandarin tone Emotional tone stimulus ANGRY FEAR HAPPY SAD
Tone 1 ANGRY 77% 9% 13% 1%
FEAR 11% 21% 56% 12%
HAPPY 8% 6% 82% 4%
SAD 4% 29% 25% 42%
Tone 2 ANGRY 76% 5% 17% 2%
FEAR 14% 36% 29% 21%
HAPPY 5% 8% 76% 11%
SAD 2% 21% 5% 72%
Tone 3 ANGRY 53% 15% 9% 23%
FEAR 7% 34% 13% 46%
HAPPY 12% 12% 49% 27%
SAD 2% 22% 3% 73%
Tone 4 ANGRY 93% 1% 3% 3%
FEAR 11% 37% 24% 28%
HAPPY 8% 11% 71% 10%
SAD 6% 29% 4% 61%

Table 8. Confusion matrices of emotion recognition responses for syllables presented in context.

Emotional tone response
Mandarin tone Emotional tone stimulus ANGRY FEAR HAPPY SAD
Tone 1 ANGRY 97% 1% 1% 1%
FEAR 1% 87% 6% 6%
HAPPY 0% 2% 96% 2%
SAD 0% 15% 0% 85%
Tone 2 ANGRY 97% 2% 0% 1%
FEAR 1% 88% 4% 7%
HAPPY 0% 1% 97% 2%
SAD 0% 8% 1% 91%
Tone 3 ANGRY 96% 2% 0% 1%
FEAR 1% 85% 7% 7%
HAPPY 1% 2% 96% 1%
SAD 0% 6% 0% 94%
Tone 4 ANGRY 99% 1% 0% 0%
FEAR 1% 87% 3% 9%
HAPPY 1% 2% 95% 2%
SAD 1% 7% 0% 92%

For isolated syllables (Table 7), emotion recognition accuracy ranged from 21% to 93%. There appears to be a response bias where SAD stimuli were most commonly identified as FEAR in Tone 1 and 4. Similarly, there appears to be a response bias where FEAR stimuli were commonly identified as HAPPY in Tones 1 and 2, but as SAD in Tones 3 and 4. In contrast, ANGRY and HAPPY stimuli did not result in any dominant bias patterns.

For syllables presented in context (Table 8), emotions were recognized more accurately, ranging from 85% to 99%. In terms of confusion patterns, the only error that exceeded 10% was the SAD response to the FEAR stimulus in Tone 1 (15%). Because of the high accuracy, none of the emotions resulted in any dominant bias patterns.

Summary and discussion

As predicted, Mandarin tone identification accuracy varied across emotions, i.e., not all tones were identified equally well for different emotions. We replicated Liang and Chen’s (2019) [71] finding that Tone 4 is identified more accurately than Tone 1 in ANGER. Our results further revealed additional contingencies on emotion (Table 3). Tone identification accuracy also varied across contexts. All tones were identified more accurately when the target syllables were embedded in the carrier phrase, but not all tones were identified equally well across the two contexts (Table 3). For syllables presented in isolation, confusion analyses showed that Tone 4 was the most common error for the ANGRY stimuli, and Tone 1 was the most common error for the FEAR stimuli (Table 4). For syllables presented in the carrier phrase, confusion analyses did not reveal any notable patterns except that Tone 4 was a common error for ANGRY Tone 1 stimuli (Table 5).

The hypothesis that ANGRY stimuli would result in less or more accurate tone identification depending on how listeners interpret the acoustic signal was not supported by our data. Although ANGRY resulted in the greatest acoustic changes (as shown in Experiment 1), Mandarin tones produced with ANGRY were not identified disproportionately worse. Similarly, although ANGRY was associated with the most consistent acoustic changes (as shown in Experiment 1), Mandarin tones produced with ANGRY were not identified disproportionately better. With the extensive interactions between tone, emotion, and context, the current data is inconclusive as to how listeners parsed the acoustic signal into distinct sources.

Regarding emotion recognition, accuracy of emotion recognition varied across Mandarin tones, i.e., not all emotions were recognized equally well in different Mandarin tones (Table 6). However, ANGRY was recognized most accurately in three of the four Mandarin tones. Like Mandarin tones, emotions were recognized more accurately when the target syllables were embedded in the carrier phrase. Unlike Mandarin tones, the order of emotion identification accuracy was more consistent between in isolation and in context (Table 6), i.e., ANGRY was recognized better than HAPPY, followed by SAD, and FEAR was recognized least accurately. For syllables presented in isolation, confusion analyses showed that the most common error for SAD stimulus was FEAR response (Table 7). For syllables presented in context, confusion analyses did not reveal any notable error patterns (Table 8).

Taken together, data from Mandarin tone identification and emotion recognition revealed several similarities. Mandarin tones identification accuracy varied across emotions, just as emotions recognition accuracy varied across Mandarin tones. In addition, Mandarin tone identification and emotion recognition were both more accurate with syllables presented in the carrier phrase. The presence of the carrier phrase facilitated both Mandarin tone identification and emotion recognition, especially for those tones and emotions that were identified poorly in isolated syllables. Finally, for both Mandarin tone identification and emotion recognition, there were distinct error patterns in isolated syllables, but barely any notable error patterns in context.

There are also differences between Mandarin tone identification and emotion recognition. First, the accuracy of identifying specific Mandarin tones varied greatly across different emotions (Table 3), but the accuracy of recognizing specific emotions was more consistent across different Mandarin tones. For example, ANGRY was consistently identified better than most other emotions, and FEAR was consistently identified worse than all other emotions (Table 6). In other words, how well a Mandarin tone is identified depends heavily on specific emotions, but how well an emotion is identified does not depend as much on specific Mandarin tones. At first sight, this finding appears to suggest that emotion recognition is more robust than Mandarin tone identification. However, the accuracy of emotion recognition (ranging from 21% to 93%) was not higher than Mandarin tone identification (40% to 98%) in isolated syllables. The accuracy in the carrier phrase also seemed comparable: emotion recognition (85% to 99%); Mandarin tone identification (78% to 100%). Emotions do not seem to be inherently easier to identify compared to Mandarin tones. Rather, our data suggest that their mutual influence is asymmetrical.

Furthermore, Mandarin tone identification and emotion recognition differed in their interaction with context. The accuracy of identifying specific Mandarin tones varied substantially between the two contexts (Table 3), but the accuracy of recognizing specific emotions was quite consistent between the two contexts (Table 6).

Conclusions, limitations, and future directions

Three conclusions can be drawn from our findings. First, acoustic characteristics of Mandarin tones are shaped by emotions in complex but systematic ways, depending on specific Mandarin tones and specific emotions. Second, emotions affect Mandarin tone identification to a greater extent than Mandarin tones affect emotion recognition. Finally, the presence of carrier phrase facilitates both Mandarin tone identification and emotion recognition.

There are a number of limitations of the study. First, this study was conceptualized with four-emotion models; therefore the experimental materials were recorded with only four basic emotions. As noted in the introduction, there are different opinions as to the number of basic emotions, and basic-emotion models do not fully capture valence and arousal in the dimensional approach to emotion. For example, of the four emotions used in this study, only happiness is a positive emotion, and only sadness is considered low in arousal. Inclusion of a wider range of emotions would have allowed a more nuanced examination of the interaction between lexical and emotional tones. Second, only one carrier phrase was used; i.e., we did not systematically manipulate the tone that precedes the target syllable like Liang and Chen (2019) [71] did. It is not known whether our results would generalize to other tonal contexts. Third, the stimuli were produced by speakers of Taiwan Mandarin, and the participants in the perception experiment were also Taiwan Mandarin speakers. It is not known whether our results would generalize to other variants of Mandarin. Fourth, we did not include NEUTRAL emotion in the stimuli for the emotion recognition task. Since NEUTRAL emotion was included in the acoustic analysis, it would have been informative to include it in the perception experiment. Fifth, the syllables presented in isolation in the current study were excised from a carrier phrase; therefore, it is not known whether the findings would generalize to syllables produced in isolation. Six, since syllables in isolation were presented before syllables in context, familiarization from the isolation could have boosted response accuracy in the context blocks. Finally, we used the neutral condition as a baseline for emotion-related comparisons. Comparisons among the other emotions could generate further insights.

In addition to addressing these limitations, future studies could consider the following extensions. First, although the acoustic measures used in this study were commonly used in research on lexical tones and emotions, additional acoustic measures on voice quality and quantitative measures of F0 contour (e.g., Functional Data Analysis [94, 95]) would be useful. Second, although our use of naturally produced stimuli in the perception experiment preserved all the acoustic cues available to listeners, it was not possible to isolate the contributions of specific acoustic cues. Future studies could systematically manipulate those cues to identify their individual contributions to lexical tone identification and emotion recognition. Third, although our data revealed how the acoustics and perception of lexical tones depend on factors including emotions and context, it is not clear why these factors affect lexical tone perception in such complex ways. Finally, the asymmetry in the mutual influence between lexical tones and emotions highlights the need for further research on how tonal language experience shapes lexical tone identification [71] and emotion recognition.

Supporting information

S1 Table. Full output of a linear mixed-effect logistic regression model for acoustic analysis of four Mandarin tones with five emotions.

(DOCX)

S2 Table. Full output of a linear mixed-effect logistic regression model for Mandarin tone identification.

(DOCX)

S3 Table. Full output of a linear mixed-effect logistic regression model for emotion recognition.

(DOCX)

S4 Table. All data of two experiments.

(RTF)

Acknowledgments

We thank Dr. Liquan Liu, Dr. Chong Chee Seng, Dr. Martijn Goudbeek, and anonymous reviewers for their constructive feedback. We also thank Faith Fedele for editorial assistance.

Data Availability

All relevant data are within the manuscript and its Supporting Information files.

Funding Statement

This study was supported by grants MOST 109-2218-E-010-003 and MOST 110-2622-B-A49-001 from the Ministry of Science and Technology, Taiwan ROC. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Mullennix JW, Bihon T, Bricklemyer J, Gaston J, Keener JM. Effects of variation in emotional tone of voice on speech perception. Language and speech. 2002;45(3):255–83. doi: 10.1177/00238309020450030301 [DOI] [PubMed] [Google Scholar]
  • 2.Leinonen L, Hiltunen T, Linnankoski I, Laakso MJ. Expression or emotional-motivational connotations with a one-word utterance. The Journal of the Acoustical Society of America. 1997;102(3):1853–63. doi: 10.1121/1.420109 [DOI] [PubMed] [Google Scholar]
  • 3.Coutinho E, Dibben N. Psychoacoustic cues to emotion in speech prosody and music. Cognition & emotion. 2013;27(4):658–84. doi: 10.1080/02699931.2012.732559 [DOI] [PubMed] [Google Scholar]
  • 4.Banse R, Scherer KR. Acoustic profiles in vocal emotion expression. Journal of Personality and Social Psychology. 1996;70(3):614–36. doi: 10.1037//0022-3514.70.3.614 . [DOI] [PubMed] [Google Scholar]
  • 5.Nwe TL, Foo SW, De Silva LC. Speech emotion recognition using hidden Markov models. Speech Communication. 2003;41(4):603–23. doi: 10.1016/s0167-6393(03)00099-2 [DOI] [Google Scholar]
  • 6.Murray IR, Arnott JL. Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. The Journal of the Acoustical Society of America. 1993;93(2):1097–108. doi: 10.1121/1.405558 [DOI] [PubMed] [Google Scholar]
  • 7.Scherer KR. Vocal Communication of Emotion: A Review of Research Paradigms. Speech Communication. 2003;40:227–56. [Google Scholar]
  • 8.Hammerschmidt K, Jürgens U. Acoustical correlatesof affective prosody. Journal of voice. 2007;21(5):531–40. doi: 10.1016/j.jvoice.2006.03.002 [DOI] [PubMed] [Google Scholar]
  • 9.Mozziconacci S. Speech Variability and Emotion: Production and Perception. Eindhoven: Technische Universiteit Eindhoven; 1998. [Google Scholar]
  • 10.Rodero E. Intonation and emotion: influence of pitch levels and contour type on creating emotions. Journal of Voice. 2011;25(1):e25–e34. doi: 10.1016/j.jvoice.2010.02.002 [DOI] [PubMed] [Google Scholar]
  • 11.Ladd DR, Silverman KEA, Tolkmitt F, Bergmann G, Scherer KR. Evidence for the independent function of intonation contour type, voice quality, and F0 range in signaling speaker affect. The Journal of the Acoustical Society of America. 1985;78(2):435–44. doi: 10.1121/1.392466 [DOI] [Google Scholar]
  • 12.Tolkmitt F, Bergmann G, Goldbeck T, Scherer KR. Experimental Studies on Vocal Communication. In: S KR, editor. Facets of Emotion: Recent Research. NJ: Erlbaum, Hillsdale; 1988. p. 119–38. [Google Scholar]
  • 13.Xu L, Tsai Y, Pfingst BE. Features of stimulation affecting tonal-speech perception: implications for cochlear prostheses. The Journal of the Acoustical Society of America. 2002;112(1):247–58. doi: 10.1121/1.1487843 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Jongman A, Wang Y, Moore C, Sereno J. Perception and Production of Mandarin Tone. In: Li P, Tan LH, Bates E, Tzeng OJL, editors. Handbook of East Asian Psycholinguistics (Vol 1: Chinese). UK: Cambridge University Press; 2006. [Google Scholar]
  • 15.Chuang CK, Hiki S. Acoustical Features and Perceptual Cues of the Four Tones of Standard Colloquial Chinese. The Journal of the Acoustical Society of America. 1972;52:146. doi: 10.1121/1.1981919 [DOI] [Google Scholar]
  • 16.Blicher DL, Diehl RL, Cohen LB. Effects of syllable duration on the perception of the Mandarin Tone 2/Tone 3 distinction: evidence of auditory enhancement. Journal of Phonetics. 1990;18(1):37–49. doi: 10.1016/s0095-4470(19)30357-2 [DOI] [Google Scholar]
  • 17.Whalen DH, Xu Y. Information for Mandarin tones in the amplitude contour and in brief segments. Phonetica. 1992;49(1):25–47. doi: 10.1159/000261901 [DOI] [PubMed] [Google Scholar]
  • 18.Liu S, Samuel AG. Perception of Mandarin lexical tones when F0 information is neutralized. Language and speech. 2004;47:109–38. doi: 10.1177/00238309040470020101 [DOI] [PubMed] [Google Scholar]
  • 19.Lee CY, Tao L, Bond ZS. Identification of Acoustically Modified Mandarin Tones by Non-native Listeners. Language and Speech. 2010;53(2):217–43. doi: 10.1177/0023830909357160 [DOI] [PubMed] [Google Scholar]
  • 20.Gandour J. Tone perception in Far Eastern languages. Journal of Phonetics. 1983;11(2):149–75. doi: 10.1016/s0095-4470(19)30813-7 [DOI] [Google Scholar]
  • 21.Massaro DW, Cohen MM, Tseng CY. The Evaluation and Integration of Pitch Height and Pitch Contour in Lexical Tone Perception in Mandarin Chinese. Journal of Chinese Linguistics. 1985;13:267–89. [Google Scholar]
  • 22.Bestelmeyer PEG, Kotz SA, Belin P. Effects of emotional valence and arousal on the voice perception network. Social Cognitive and Affective Neuroscience. 2017;12(8):1351–8. doi: 10.1093/scan/nsx059 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Russell JA. A circumplex model of affect. Journal of personality and social psychology. 1980;39(6):1161–78. doi: 10.1037/h0077714 [DOI] [Google Scholar]
  • 24.Ekman P. An argument for basic emotions. Cognition & emotion. 1992;6(3/4):169–200. doi: 10.1080/02699939208411068 [DOI] [Google Scholar]
  • 25.Barrett LF, Russell JA. They Psychological Construction of Emotion. New York: Guildofrd Press; 2015. [Google Scholar]
  • 26.Bradley MM, Lang PJ. Measuring emotion: the self-assessment manikin and the semantic differential. Journal of behavior therapy and experimental Psychiatry. 1994;25(1):49–59. doi: 10.1016/0005-7916(94)90063-9 [DOI] [PubMed] [Google Scholar]
  • 27.Posner J, Russell JA, Peterson BS. The circumplex model of affect: an integrative approach to affective neuroscience, cognitive development, and psychopathology. Development and Psychopathology 2005;17(3):715–34. doi: 10.1017/S0954579405050340 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Russell JA, Barrett LF. Core affect, prototypical emotional episodes, and other things called emotions: dissecting the elephant. Journal of personality and social psychology. 1999;76(5):805–19. doi: 10.1037/0022-3514.76.5.805 [DOI] [PubMed] [Google Scholar]
  • 29.Colibazzi T, Posner J, Wang Z, Gorman D, Gerber A, Yu S, et al. Neural systems subserving valence and arousal during the experience of induced emotions. Emotion. 2010;10(3):377–89. doi: 10.1037/a0018484 [DOI] [PubMed] [Google Scholar]
  • 30.Wilson-Mendenhall CD, Barrett LF, Barsalou LW. Neural evidence that human emotions share core affective properties. Psychological Science. 2013;24(6):947–56. doi: 10.1177/0956797612464242 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Russell JA. Emotions are not modules. Canadian Journal of Philosophy. 2006;32:53–71. doi: 10.1353/cjp.2007.0037 [DOI] [Google Scholar]
  • 32.Scarantino A. Basic emotions, psychological construction, and the problem of variability. In: Barrett LF, Russell JA, editors. The psychological construction of emotion. New York: Guilford Press; 2015. p. 334–76. [Google Scholar]
  • 33.Plutchik R. The Emotions: Facts, Theories, and a New Model. New York: Random House; 1962. [Google Scholar]
  • 34.Ekman P. Are there basic emotions? Psychological Review. 1992;99(3):550–3. doi: 10.1037/0033-295x.99.3.550 [DOI] [PubMed] [Google Scholar]
  • 35.Izard CE. Basic emotions, natural kinds, emotion schemas, and a new paradigm. Perspectives on psychological science. 2007;2(3):260–80. doi: 10.1111/j.1745-6916.2007.00044.x [DOI] [PubMed] [Google Scholar]
  • 36.Gu S, Wang F, Cao C, Wu E, Tang YY, Huang JH. An integrative way for studying neural basis of basic emotions with fMRI. Frontiers in Neuroscience. 2019:2019.00628. doi: 10.3389/fnins.2019.00628 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Gu S, Gao M, Yan Y, Wang F, Tang YY, Huang JH. The neural mechanism underlying cognitive and emotional processes in creativity. Frontiers in Psychology. 2018:9:1924. doi: 10.3389/fpsyg.2018.01924 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Gu S, Wang W, Wang F, Huang JH. Neuromodulator and emotion biomarker for stress induced mental disorders. Neural plasticity. 2016:2016:2609128. doi: 10.1155/2016/2609128 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Jack RE, Garrod OGB, Schyns PG. Dynamic facial expressions of emotion transmit an evolving hierarchy of signals over time. Current biology. 2014;24(2):187–92. doi: 10.1016/j.cub.2013.11.064 [DOI] [PubMed] [Google Scholar]
  • 40.Zheng Z, Gu S, Lei Y, Lu S, Wang W, Li Y, et al. Safety needs mediate stressful events induced mental disorders. Neural plasticity. 2016. doi: 10.1155/2016/8058093 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Wang T, Lee YC. Does Restriction of Pitch Variation Affect the Perception of Vocal Emotions in Mandarin Chinese? The Journal of the Acoustical Society of America. 2015;137:EL117–EL23. doi: 10.1121/1.4904916 [DOI] [PubMed] [Google Scholar]
  • 42.Yuan J, Shen L, Chen F, editors. The acoustic realization of anger, fear, joy and sadness in Chinese. 7th International Conference on Spoken Language Processing (ICSLP2002); 2002; Denver, Colorado, USA.
  • 43.Liu P, Pell MD. Recognizing Vocal Emotions in Mandarin Chinese: A Validates Satabase of Chinese Vocal Emotional Stimuli. Behavior Research Methods. 2012;44:1042–1-51. [DOI] [PubMed] [Google Scholar]
  • 44.Li A, Fang Q, Dang J, editors. Emotional Intonation in a Tone Language: Experimental Evidence From Chinese. ICPhS XVII; 2011; Hong Kong. [Google Scholar]
  • 45.Lin HY, Fon J, editors. Prosodic and acoustic features of emotional speech in Taiwan Mandarin. 6th International Conference on Speech Prosody; 2012.
  • 46.Wang T, Lee YC, Ma Q, editors. An Experimental Study of Emotional Speech in Mandarin and English. Speech Prosody; 2016; Boston, USA. [Google Scholar]
  • 47.Wang T, Qian Y, editors. Are Pitch Variation Cues Indispensable to Distinguish Vocal Emotions? The 9th International Conference on Speech Prosody 2018; 2018; Poznan, Poland.
  • 48.Chnag HS, Young ST, Yuen K, editors. Effects of the Acoustic Characteristics on the Emotional Tones of Voice of Mandarin Tones. 20th International Congress on Acoustics, ICA 2010; 2010; Sydney, Australia.
  • 49.Scherer KR, Johnstone T, Klasmeyer G. Vocal expression of emotion. In: Davidson RJ, Scherer KR, Goldmith HH, editors. Handbook of the affective sciences. New York: Oxford University Press; 2003. p. 433–56. [Google Scholar]
  • 50.Pell MD, Paulmann S, Dara C, Alasseri A, Kotz SA. Factors in the recognition of vocally expressed emotions: A comparison of four languages. Journal of Phonetics. 2009;37(4):417–35. doi: 10.1016/j.wocn.2009.07.005 [DOI] [Google Scholar]
  • 51.Zhang S, Ching PC, Kong F, editors. Acoustic analysis of emotional speech in Mandarin Chinese. International Symposium on Chinese Spoken Language Processing; 2006. [Google Scholar]
  • 52.Nygaard LC, Lunders ER. Resolution of lexical ambiguity by emotional tone of voice. Memory & cognition. 2002;30(4):583–93. doi: 10.3758/bf03194959 [DOI] [PubMed] [Google Scholar]
  • 53.Nygaard LC, Queen JS. Communicating emotion: linking affective prosody and word meaning. Journal of Experimental Psychology: Human Perception and Performance. 2008;34(4):1017–30. doi: 10.1037/0096-1523.34.4.1017 [DOI] [PubMed] [Google Scholar]
  • 54.Chang HS, Young ST, Li PC, Chu WC, Ho CY. Effects of emotional tones of voice on the acoustic and perceptual characteristics of Mandarin tones. The Journal of the Acoustical Society of America. 2018;144(3). doi: doi.org/10.1121/1.5067961 [Google Scholar]
  • 55.Williams CE, Stevens KN. Emotion and speech: Some acoustical correlates. Journal of the Acoustical Society of America. 1972;52:1238–50. doi: 10.1121/1.1913238 [DOI] [PubMed] [Google Scholar]
  • 56.Scherer KR. Vocal affect expression: a review and a model for future research. Psychological bulletin. 1986;99(2):143–65. doi: 10.1037/0033-2909.99.2.143 [DOI] [PubMed] [Google Scholar]
  • 57.Schröder M. Expressing degree of activation in synthetic speech. IEEE Transactions on audio, speech, and language processing. 2006;14(4):1128–36. doi: 10.1109/TASL.2006.876118 [DOI] [Google Scholar]
  • 58.Goudbeek M, Scherer KR. Beyond arousal: Valence and potency/control cues in the vocal expression of emotion. The Journal of the Acoustical Society of America. 2010;128(3):1322–36. doi: 10.1121/1.3466853 [DOI] [PubMed] [Google Scholar]
  • 59.Ilie G, Thompson WF. A comparison of acoustic cues in music and speech for three dimensions of affect. Music Perception. 2006;23 (4):319–30. doi: 10.1525/mp.2006.23.4.319 [DOI] [Google Scholar]
  • 60.Laukka P, Juslin P, Bresin R. A dimensional approach to vocal expression of emotion. Cognition & Emotion. 2005;19(5):633–53. doi: 10.1080/02699930441000445 [DOI] [Google Scholar]
  • 61.Schmidt J, Janse E, Scharenborg O. Perception of emotion in conversational speech by younger and older listeners. Frontiers in Psychology. 2016. doi: 10.3389/fpsyg.2016.00781 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Elfenbein HA, Ambady N. On the universality and cultural specificity of emotion recognition: a meta-analysis. Psychological bulletin. 2002;128(2):203–35. doi: 10.1037/0033-2909.128.2.203 [DOI] [PubMed] [Google Scholar]
  • 63.Douglas-Cowie E, Campbell N, Cowie R, Roach P. Emotional speech: Towards a new generation of databases. Speech Communication. 2003;40(1–2):33–60. doi: 10.1016/s0167-6393(02)00070-5 [DOI] [Google Scholar]
  • 64.Ross ED, Edmondson JA, Seibert GB. The effect of affect on various acoustic measures of prosody in tone and non-tone languages: A comparison based on computer analysis of voice. Journal of Phonetics. 1986;14(2):283–302. doi: 10.1016/s0095-4470(19)30669-2 [DOI] [Google Scholar]
  • 65.Anolli L, Lei W, Mantovani F, De Toni A. The Voice of Emotion in Chinese and Italian Young Adults. Journal of Cross-Cultural Psychology. 2008;39(5):565–98. doi: 10.1177/0022022108321178 [DOI] [Google Scholar]
  • 66.Wang T, Lee YC, Ma Q. Within and Across-Language Comparison of Vocal Emotions in Mandarin and English. Applied Science. 2018;8(12):2629. doi: 10.3390/app8122629 [DOI] [Google Scholar]
  • 67.Chong CS, Kim J, Davis C, editors. Exploring acoustic differences between Cantonese (tonal) and English (non-tonal) spoken expressions of emotions. INTERSPEECH 2015; 2015; Dresden, Germany. [Google Scholar]
  • 68.Luksaneeyanawin S. Intonation in Thai. In: Hirst D, DiCristo A, editors. Intonation Systems: A Survey of twenty languages. Cambridge: Cambridge University Press; 1998. [Google Scholar]
  • 69.Li AJ, Jia Y, Fang Q, Dang JW, editors. Emotional intonation modeling: A cross-language study on Chinese and Japanese. 2013 APSIPA Annual Summit and Conference; 2013; Kaohsiung, Taiwan.
  • 70.Chao YR. Tone and intonation in Chinese. Bulletin of the Institute of History and Philology. 1933;4(2):121–34. [Google Scholar]
  • 71.Liang Y, Chen A, editors. The perception of lexical tones in emotional speech by Dutch learners of Mandarin. The 19th International Congress of Phonetic Sciences; 2019; Melbourne, Australia.
  • 72.Kitayama S, Ishii K. Word and voice: Spontaneous attention to emotional utterances in two languages. Cognition & Emotion. 2002;16(1):29–59. doi: 10.1080/0269993943000121 [DOI] [Google Scholar]
  • 73.Ishii K, Reyes JA, Kitayama S. Spontaneous attention to word content versus emotional tone: differences among three cultures. Psychological science. 2003;14(1):39–46. doi: 10.1111/1467-9280.01416 . [DOI] [PubMed] [Google Scholar]
  • 74.Singh L, Lee Q, Goh WD. Processing dependencies of segmental and suprasegmental information: effects of emotion, lexical tone, and consonant variation. Language, Cognition and Neuroscience. 2016;31(8):989–99. doi: 10.1080/23273798.2016.1190850 [DOI] [Google Scholar]
  • 75.Fenster CA, Blake LK, Goldstein AM. Accuracy of vocal emotional communications among children and adults and the power of negative emotions. Journal of Communication Disorders. 1977;10(4):301–14. doi: 10.1016/0021-9924(77)90028-4 [DOI] [PubMed] [Google Scholar]
  • 76.Johnson WF, Emde RN, Scherer KR, Klinnert MD. Recognition of emotion from vocal cues. Archives of General Psychiatry. 1986;43(3):280–3. doi: 10.1001/archpsyc.1986.01800030098011 . [DOI] [PubMed] [Google Scholar]
  • 77.Murray IR, Arnott JL. Implementation and testing of a system for producing emotion-by-rule in synthetic speech. Speech Communication. 1995;16(4):369–90. doi: 10.1016/0167-6393(95)00005-9 [DOI] [Google Scholar]
  • 78.Öhman A, Flykt A, Esteves F. Emotion drives attention: Detecting the snake in the grass. Journal of Experimental Psychology: General. 2001;130(3):466–78. doi: 10.1037//0096-3445.130.3.466 [DOI] [PubMed] [Google Scholar]
  • 79.Tooby J, Cosmides L. The past explains the present. Ethology and Sociobiology. 1990;11(4–5):375–424. doi: 10.1016/0162-3095(90)90017-z [DOI] [Google Scholar]
  • 80.Wang T, Ding H, Gu W, editors. Perceptual Study for Emotional Speech of Mandarin Chinese. Speech Prosody 2012; 2012; Shanghai, China. [Google Scholar]
  • 81.Ladefoged P, Broadbent DE. Information conveyed by vowels. The Journal of the Acoustical Society of America. 1957;29(1):98–104. doi: 10.1121/1.1908694 [DOI] [PubMed] [Google Scholar]
  • 82.Mitterer H. Is vowel normalization independent of lexical processing? Phonetica. 2006;63(4):209–29. doi: 10.1159/000097306 [DOI] [PubMed] [Google Scholar]
  • 83.Sjerps M, Fox NP, Johnson K, Chang EF. Speaker-normalized sound representations in the human auditory cortex. Nature Communications 2019;10(2465). doi: 10.1038/s41467-019-10365-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.Galdstone VS, Siegenthaler BM. Carrier phrase and speech intelligibility test score. Journal of Auditory Research. 1971;11(1):101–3. [Google Scholar]
  • 85.Lynn JM, Brotman SR. Perceptual significance of the CID W-22 carrier phrase. Ear and Hearing. 1981;2(3):95–9. doi: 10.1097/00003446-198105000-00001 [DOI] [PubMed] [Google Scholar]
  • 86.Egan JP. Articulation testing methods. Laryngoscope. 1948;58:955–91. doi: 10.1288/00005537-194809000-00002 [DOI] [PubMed] [Google Scholar]
  • 87.Wilson RH, Sanchez VA. Effects of the Carrier Phrase on Word Recognition Performances by Younger and Older Listeners Using Two Stimulus Paradigms. Journal of the American Academy of Audiology. 2020;31(6):412–41. doi: 10.3766/jaaa.19061 [DOI] [PubMed] [Google Scholar]
  • 88.Martin FN, Hawkins R, Bailey H. The nonessentiality of the carrier phrase in phonetically balanced (PB) word testing. The Journal of Auditory Research. 1962;2:319–22. [Google Scholar]
  • 89.Xu Y. Contextual tonal variations in Mandarin. Journal of Phonetics. 1997;25(1):61–83. doi: 10.1006/jpho.1996.0034 [DOI] [Google Scholar]
  • 90.Moore CB, Jongman A. Speaker normalization in the perception of Mandarin Chinese tones. The Journal of the Acoustical Society of America. 1997;102(3):1864–77. Epub 1997/09/25. doi: 10.1121/1.420092 . [DOI] [PubMed] [Google Scholar]
  • 91.CL W. Developing the Instruction Corpus and Computer Assistive System of Language Intervention for Children with Hearing and Language Disorders. NSC; 2004. [Google Scholar]
  • 92.Tsai KS, Tseng, L H., Wu C. J., & Young S. T. Development of a Mandarin monosyllable recognition test. Ear Hear. 2009;30(1):10. doi: 10.1097/AUD.0b013e31818f28a6 [DOI] [PubMed] [Google Scholar]
  • 93.Boersma P, Weenink D. Praat: doing phonetics by computer [Computer program]. Version 6.1.48 [cited 2019 February 17]. Available from: http://www.praat.org/.
  • 94.Chen A, Boves L. What’s in a word: Sounding sarcastic in British English. Journal of the International Phonetic Association. 2018;48(1):57–76. doi: 10.1017/S0025100318000038 [DOI] [Google Scholar]
  • 95.Gubian M, Torreira F, Boves L. Using Functional Data Analysis for investigating multidimensional dynamic phonetic contrasts. Journal of Phonetics. 2015;40(9):16–40. doi: 10.1016/j.wocn.2014.10.001 [DOI] [Google Scholar]
  • 96.Team RC. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing (Version R 3.6.3) 2021. Available from: https://www.R-project.org/. [Google Scholar]
  • 97.Scherer KR, Oshinsky JS. Cue utilization in emotion attribution from auditory stimuli. Motivation and Emotion. 1977;1(4):331–46. doi: 10.1007/bf00992539 [DOI] [Google Scholar]

Decision Letter 0

Vera Kempe

24 Nov 2021

PONE-D-21-23777Emotional tones of voice affect the acoustics and perception of Mandarin tonesPLOS ONE

Dear Dr. Chu,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Specifically, the reviewers acknowledge the merits of this study but see major problems with the integration with the current literature as well as with the justification of the design and the appropriateness of the analyses. Because PLOS ONE emphasises methodological soundness over novelty and impact I am minded to give you the opportunity to address the reviewers' comments very carefully. Reviewer 2 makes particularly pertinent and detailed comments that I encourage you to address. I particularly see the need to address the comments regarding ANOVA as an unsuitable statistical approach to these data. In addition, please make sure the data are being made available to the reviewers - you state that they are but the reviewers had trouble finding them. Finally, I would like to point out that request of revision does not constitute guarantee of acceptance of the revised version.

Please submit your revised manuscript by Jan 08 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Vera Kempe

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please change "female” or "male" to "woman” or "man" as appropriate, when used as a noun (see for instance https://apastyle.apa.org/style-grammar-guidelines/bias-free-language/gender).

3. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability.

"Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized.

Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access.

We will update your Data Availability statement to reflect the information you provide in your cover letter.

4. We note that you have referenced (ChangLiao I [31]) which has currently not yet been accepted for publication. Please remove this from your References and amend this to state in the body of your manuscript: (ChangLiao I . [Unpublished]”) as detailed online in our guide for authors http://journals.plos.org/plosone/s/submission-guidelines#loc-reference-style

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: No

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The study is interesting and relevant. I enjoy reading it very much.

The separation between acoustics and perception is wonderful.

Please find the points of improvement/discussion below.

Overall review questions:

1. Is the manuscript technically sound, and do the data support the conclusions?

Narratives can appear more concisely throughout the manuscript.

If possible, a bit more discussion on the other way (lexical affecting emotional tone) would be appreciated.

There is a potential design issue in the “context” condition, which can be resolved by careful discussion since the experiments have been done.

Based on the previous comment, please double-check the research questions. Make sure the questions asked are the variables being examined.

2. Has the statistical analysis been performed appropriately and rigorously?

The effect size is omitted in the report.

3. Have the authors made all data underlying the findings in their manuscript fully available?

Maybe I missed this: it is unclear where (anonymous) data are stored.

4. Is the manuscript presented in an intelligible fashion and written in standard English?

The manuscript is understandable but is encouraged to be double-checked on its English writing style. There is space for further improvement.

Detailed comments:

Abstract

1. Polish writing, including style, words and grammar

A. The starting sentence can be polished to enhance the significance of the study

B. “the effect of emotional tones on Mandarin tone acoustics varied depending on specific Mandarin tones, specific emotional tones, and talker gender.” Please rephrase.

C. “select materials”: selected? (same in Ln283)

D. “Emotional tones also affected the four acoustic measures to different degrees.” Please integrate with the narratives prior.

2. Reflecting on the narratives of Exp.2, clarify whether Exp.1 is presented in isolation or in context.

Introduction

Ln48

I understand that the authors adopted a four-basic emotion model. I wonder why the more commonly accepted six-basic emotion model, adding surprise and disgust in the study, is not considered.

Ln51

Please double check "amplitude" or "intensity"

Ln70

1. I have the feeling that these scenarios are restricted to certain context/isolation, or when only key cues are involved. This is later discussed in Ln575.

2. I wonder if these are more like logical possibilities unless the authors argue that different scenarios may occur with different emotional-lexical tone combos.

Ln85

I wonder whether the other way around is worth highlighting, namely, how lexical tone affects emotion (e.g., emotional tone) perception? This is analysed and discussed after all (e.g., Ln805, later asymmetry)

Ln192

Does falling tone affect anger perception?

Ln198-212

Some information here seems to have been mentioned earlier.

Ln232, 234

Wang et al. - repetitive presentation

Ln261

It is unclear why lexical tones lacking F0 variation would be predicted to make the emotional tone less perceptible in the first place. Can it be the other way around? That is, since lexical tone has no variation, would variations of emotion prevail?

Experiment/Analyses

Ln310

Some tables are a little bit difficult to process although the meaning is clear. I do not have suggestions here because the authors have already given careful thoughts on the presentation of these tables.

Ln318

Why is 4 a good number of participants as the baseline for the acoustic characteristics of emotional tone?

Ln537-540

Please rephrase to reduce repetitiveness.

Ln566

Please rephrase "In addition, we also"

Discussion

Regarding "Context":

There can be several possibilities to understand this word in the “context” of the current paper. It may be worthwhile specifying which context the authors are referring to (carrier sentences).

Also, what the context adds to/change the picture needs to be spelt out more overtly (e.g., ease of semantic parsing and/or tonal coarticulation? If one, why not the other?)

One methodological issue is to what extent the discrepancy between context emotion (no emotion if I understand correctly) and target word emotion would affect the current findings in the context condition. This needs to be discussed.

Reviewer #2: 1. Is the manuscript technically sound, and do the data support the conclusions?

a. The current study vs. directly relevant past studies: This study is a potentially very valuable addition to the line of research on the influence of emotion on the production and perception of lexical tones in Mandarin. Different from past studies, the present study looks at the influence of emotion on lexical tones in both production and perception. However, this does not mean that past work with a focus on either the production side of the coin or the perception should be disconnected from the current study. In other words, the authors did not seem to make a sufficient use of existent findings to motivate their study and develop the hypotheses and predictions. Specially, the authors focused on how the realisation of each of the five basic emotions impacts the f0 mean, f0 range, intensity and duration of the target syllable. The literature review 'how emotional tones affect speech acoustics' only discussed the differences in the degree to which an acoustic parameter is exploited to express emotions between tone and non-tone languages. The authors referred to 'the literature reviewed' when presenting their predictions for changes in f0, duration and intensity (lines 300-302). But past studies on how emotion expression influences the f0 and duration of each lexical tone in Mandarin was not reviewed. Furthermore, although research on the effect of emotion expression on the perception of Mandarin tones by native speakers of Mandarin and second language learners of Mandarin is small in numbers, there has been some published work on this topic, e.g. Liang & Chen (2019), which was published in the ICPhS 2019 proceedings. This work was mistakenly cited as Liang and Chen's MA thesis in the references section, summarised in the section 'How emotional tones affect speech perception' and subsequently dismissed. This is a pity because the authors could have built their work on the findings from Liang and Chen, especially considering the fact that they had the same research question on the production side as Liang and Chen, and used a similar method in both Experiment 1 and Experiment 2 to Liang and Chen’s methodology. The main differences concerning production elicitation seems to be that Liang and Chen used phototactically legal non-words as target words and systematically varied the tonal contexts preceding the monosyllabic target words whereas the current study used real words and had only one tonal context, i.e. the monosyllabic target words were preceded by tone 1 in the carried sentence. Liang and Chen's findings can thus provide a very useful starting point for the current study in terms of both hypothesis formulation and methodology.

b. Taiwan Mandarin is prosodically not the same as Beijing Mandarin (e.g. in prosodic realisation of focus), even though they both have the same four lexical tones. This difference is not sufficiently acknowledged throughout the study. Notably, the similarities in findings between this study and those from Liang and Chen (2019) can have interesting implications for the prosodic similarities and differences regarding the interaction between emotion and lexical tones between Taiwan Mandarin and Beijing Mandarin.

c. Experiment 1: The participants were specifically instructed not to place 'excessive stress on the target syllables'. This seems to suggest that the participants have no uniformed way to determine the information structure of each target sentence. They could have treated each target sentence as a response to a 'what-happens'-like question, rendering the whole sentence focal (broad focus) or treated the target word as the focus (narrow focus) in spite of the instructions. What the participants did regarding the information structure of the target sentences may influence how they used prosody to realise emotion and tones. For example, they might have distributed the prosodic changes over the whole sentence if they had a broad-focus analysis, or the prosodic changes might have concentrated in the target syllables if they had a narrow-focus analysis.

d. Experiment 1: The recordings from the participants were rated by 30 native speakers of Taiwan Mandarin. No information was given on who these participants were and how they were recruited. Were they the same participants who did Experiment 2?

e. Experiment 2: Experiment 2 hinged on two opposing assumptions. Neither is well motivated. One of the assumptions was that ‘emotional tones alter the canonical acoustic characteristics of Mandarin tones, making it difficult for listeners to retrieve the Mandarin tones’ (Lines 554-556). The authors did not specify anywhere in the manuscript what the ‘canonical acoustic characteristics of Mandarin tones’ should be. I do not think that past studies have provided evidence for such a claim either. Instead, it has been shown that the changes in the acoustic realisation of lexical tones related to the expression of emotions are restricted to the gradient changes in terms of higher of lower mean pitch, smaller or wider pitch range, or longer or shorter duration. Such changes are phonetic by nature because lexical tones are not defined by absolute pitch, duration and intensity. The other assumption is that ‘listeners are able to disaggregate the acoustic changes into two sources – Mandarin tones and emotional tones’. This assumption is itself based on the assumptions that lexical tones have absolute f0, duration and intensity values, which is not the case, and the perception of emotion prosody is categorical, for which evidence has emerged (e.g. Laukka 2005).

Laukka, P. (2005). Categorical perception of vocal emotion expressions. Emotion, 5(3), 277–295.

f. The factors ‘context’ vs. ‘isolation’ in Experiment 2: The target words were presented to the participants in both their original carrier sentence and out of their original carrier sentence (e.g. cut out of the carrier sentences) to study ‘how the presence of context would affect the identification of Mandarin tones and emotion tones’. The latter was referred to as the isolation condition. The authors did not offer any hypotheses or predictions for the effect of context based on current understanding of the effect of tonal coarticulation on tonal perception and that emotion expression can affect the prosody of the whole sentence.

It should also be pointed out that the isolation condition cannot be representative of how a lexical tone is produced in isolation, i.e. with no neighbouring sounds, or how an emption is expressed in a monosyllabic word with no neighbouring sounds. When a monosyllabic word is produced in isolation, it is produced as a one-word utterance, prosodically speaking. It has no tonal coarticulation and carries all the prosodic adjustments that need to be made to express a certain emotion.

Besides, in the ‘context’ condition, the target words were preceded by a Tone-1 word and followed by neutral tone-word in all sentences. Can the current results be generalised to other tonal contexts? For example, what happens when the target word is preceded by a Tone-2, Tone-3 or Tone-4 word?

g. Experiment 2, lines 609-615: The authors stated that ‘the order of the stimuli within each block was randomized for each participant’. How was this done exactly? Take the block of isolated syllables for tonal identification for example. What kind of principles were used in the randomization?

h. Experiment 2: I don’t understand why stimuli produced in the neutral emotion were not included in the task on perception of emotion. Excluding this option might potentially increase the chance of choosing the right emotion from 20% to 25%.

i. The discussion on the differences in the context and isolation conditions was somewhat confusing. The authors seemed to acknowledge that there were prosodic cues to emotion outside the target word in a sentence in some places (e.g. lines 829-830). But in the conclusion section, the authors attributed the better performance in the context condition to listeners’ knowledge of tonal co-articulation (lines 843-844). This is true for the perception of tones but not true for the perception of emotion.

2. Has the statistical analysis been performed appropriately and rigorously?

a. The data from Experiment 2 were categorical by nature: 4 categories for identification of lexical tones; 4 categories for identification of emotions. These data should thus be analysed using methods that can deal with categorical data. Repeated measures ANOVA, which was used in the paper, is not the appropriate analysis. The authors can consider using the mixed-model multinomial regression analysis in R.

b. The effect of emotion on the shape of the f0 contours of the lexical tones was discussed based on visual observations (lines 380-390). A more rigorous and more appropriate approach would be to use the Functional Data Analysis to mathematically quantify the shapes of each contour in terms of functional principle components and then use the fPCs as dependent variables.

Gubian, M., Torreira, F., & Boves, L. (2015). Using Functional Data Analysis for investigating multidimensional dynamic phonetic contrasts. Journal of Phonetics, 49(0), 16–40. https://doi.org/http://dx.doi.org/10.1016/j.wocn.2014.10.001

Chen, A., & Boves, L. (2018). What’s in a word: Sounding sarcastic in British English. Journal of the International Phonetic Association, 48(1). https://doi.org/10.1017/S0025100318000038

c. Crucial details are missing on the f0 contours presented in Figure 1: How were the f0 contours of the lexical tones in Figure 1 determined? Were the mean f0 contours of all speakers? If so, how were they extracted? Were they time-normalised?

d. Statistics on all the pairwise comparisons should be provided for each analysis.

3. Have the authors made all data underlying the findings in their manuscript fully available?

The authors stated that ‘All relevant data are within the manuscript and its Supporting Information files.’ I did not find out how to access Supporting Information in PLOS One. It seems that I could only download the manuscript. I ticked the ‘Yes’ box for this question but I’d like acknowledge that I do not have the information to judge on the availability of the data in Supporting Information that has not been presented in the manuscript.

4. Is the manuscript presented in an intelligible fashion and written in standard English?

I fear that the clarity of the manuscript has been affected by the presentation and exposition, even though there are very few language errors. The major issue I have with the writing is that the authors regularly made claims without substantiating them. A few examples:

a. The authors sketched four possible scenarios for how emotion may affect the production and perception of tones on pp. 4-5 (lines 70-86). These scenarios appear poorly informed given what has already been known.

b. Lines 98-99: ‘Despite the universal appearance of emotion tones, their acoustic correlates vary across studies.’ I do not understand claims such as this one that seem to contain contradictory information. In this particular case, the review of the literature in the preceding lines clearly pointed to cross-linguistic differences in how f0, duration, intensity are varied to express the same emotion.

c. Lines 167-169: These lines were supposed to summarise the main findings reviewed in the same paragraph. However, the preceding review did not talk about discrimination of consonants and recognition of spoken words. The findings reviewed were mostly concerned with recognizing word meaning.

d. Lines 175-184: These lines were supposed to give more details on Singh et al.’s work after the summarising sentence on what their work was about in lines 173-175. However, what was presented in lines 175-184 was hard to follow and did not really match the summarising sentence.

e. Lines 195-197 on the limitation of Liang and Chen’s study: “Since no acoustic analysis was reported, it is not clear to what extent these findings could be explained by the actual acoustic difference among the emotional tones.” I am not sure I can follow the logic here. Liang and Chen had the stimuli rated by a native speaker of Mandarin in terms of emotion before presenting them to their participants. The emotions were correctly recognised in all cases. This would allow the authors to conclude that emotions were accurately produced in the stimuli and differences found in the perception of tones in different emotions could be attributed to the differences in prosodic realisation of emotions.

f. Lines 231-232: “Although emotional tones appear to be perceived in a similar way irrespective of specific languages, it is not clear whether the use of lexical tones in tonal languages affects the perception of emotional tones.” Immediately following these lines, the authors reviewed Wang et al.’s study on exactly this topic. Their findings seem quite straightforward to me. What is still not clear and should therefore be investigated in the current study then?

g. Lines 271-273: I have the same concern as with lines 231-232.

h. Lines 256-259: “Evidence from L1 and L2 studies of segmental and tonal perception further suggests that tonal language experience affects how lexical tones and segmental structure are processed when they are produced with emotional tones.” Such studies were not reviewed in the preceding sections.

i. Lines 673-676: The description of the results (Tables 8, 9) was not accurate.

j. Lines 746-747: The description of the result was not accurate.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Liquan Liu

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2023 Apr 5;18(4):e0283635. doi: 10.1371/journal.pone.0283635.r002

Author response to Decision Letter 0


28 Feb 2022

Editor comments:

Specifically, the reviewers acknowledge the merits of this study but see major problems with the integration with the current literature as well as with the justification of the design and the appropriateness of the analyses. Because PLOS ONE emphasises methodological soundness over novelty and impact I am minded to give you the opportunity to address the reviewers' comments very carefully. Reviewer 2 makes particularly pertinent and detailed comments that I encourage you to address. I particularly see the need to address the comments regarding ANOVA as an unsuitable statistical approach to these data. In addition, please make sure the data are being made available to the reviewers - you state that they are but the reviewers had trouble finding them. Finally, I would like to point out that request of revision does not constitute guarantee of acceptance of the revised version.

Response: We thank Dr. Kempe for the reminder regarding methodological soundness. We have followed the reviewers’ advice to revamp the conceptual foundation of the study. We have replaced the ANOVA with a mixed-model multinomial regression analysis for Experiment 2. We made sure the data is available to the reviewers.

Reviewer #1 comments:

The study is interesting and relevant. I enjoy reading it very much. The separation between acoustics and perception is wonderful. Please find the points of improvement/discussion below.

Response: We thank Dr. Liu for recognizing the merit of this study.

1. Is the manuscript technically sound, and do the data support the conclusions?

Narratives can appear more concisely throughout the manuscript.

Response: We have revised the manuscript to make it more concise.

If possible, a bit more discussion on the other way (lexical affecting emotional tone) would be appreciated.

Response: We have added more information about how lexical tones effect the acoustics and perception of emotions:

• In the section “How Emotions Affect Speech Acoustics”, we added two studies showing restricted F0 variations of emotions in other tonal languages (Chong, Kim, & Davis, 2015, for Cantonese; and Luksaneeyanawin, 1998, for Thai). We also added a Mandarin-Japanese study that showed the opposite pattern (Li, Jia, Fang, & Dang, 2013).

• In the section “How Emotions Are Perceived in Speech”, we added discussion of a study on emotion recognition in Mandarin (Wang, Ding, & Gu, 2012).

There is a potential design issue in the “context” condition, which can be resolved by careful discussion since the experiments have been done.

Response: We have clarified our use of the term “context” throughout the manuscript.

Based on the previous comment, please double-check the research questions. Make sure the questions asked are the variables being examined.

Response: We have revised the introduction to make sure the research questions are aligned with the variables being examined.

2. Has the statistical analysis been performed appropriately and rigorously?

The effect size is omitted in the report.

Response: We have included effect size as instructed.

3. Have the authors made all data underlying the findings in their manuscript fully available?

Maybe I missed this: it is unclear where (anonymous) data are stored.

Response: We apologize for missing the data in the original submission. As instructed, we have included the minimal data set.

4. Is the manuscript presented in an intelligible fashion and written in standard English?

The manuscript is understandable but is encouraged to be double-checked on its English writing style. There is space for further improvement.

Response: We have asked a native speaker of English to check our word choice, grammar, and writing style.

Detailed comments:

Abstract

1. Polish writing, including style, words and grammar

A. The starting sentence can be polished to enhance the significance of the study

Response: We have revised the starting sentence to enhance the significance of the study.

B. “the effect of emotional tones on Mandarin tone acoustics varied depending on specific Mandarin tones, specific emotional tones, and talker gender.” Please rephrase.

Response: We have rephrased this sentence to improve its clarity.

C. “select materials”: selected? (same in Ln283)

Response: We have changed the word here and elsewhere.

D. “Emotional tones also affected the four acoustic measures to different degrees.” Please integrate with the narratives prior.

Response: We have integrated this sentence into a previous sentence.

2. Reflecting on the narratives of Exp.2, clarify whether Exp.1 is presented in isolation or in context.

Response: We have clarified that acoustic analyses in Experiment 1 were conducted on target syllables extracted from a carrier phrase.

Introduction

Ln48: I understand that the authors adopted a four-basic emotion model. I wonder why the more commonly accepted six-basic emotion model, adding surprise and disgust in the study, is not considered.

Response: We thank Dr. Liu for pointing this out. The four basic emotion model was proposed by Ortony & Turner (1990) and the six basic emotion model was proposed by Ekman (1999). Both were cited extensively: Ortony & Turner (1990) was cited 3107 times and Ekman (1999) 4399 times. We decided to adopt the four emotion model because the four emotions are included in the six-emotion model, and most of the studies on Mandarin have used the four-emotion model (Yuan, Shen, & Chen, 2002; Zhang, Ching, & Kong, 2006; Lin & Fon, 2012; Wang & Qian, 2018; Chang, Young, Li, Chu, & Ho, 2018; Chang, Young, & Yuan, 2010; Wang, Lee, & Ma, 2016). Adopting the four-emotion model would make comparisons with previous studies more straightforward.

Ekman, P. Basic Emotions. In Dalgleish T, Power T, editors, The Handbook of Cognition and Emotion. Sussex: John Wiley & Sons, Ltd.; 1999. p. 45‐60

Ln51: Please double check "amplitude" or "intensity"

Response: We have replaced “intensity” with “amplitude” so that “amplitude” is used consistently throughout the manuscript.

Ln70: 1. I have the feeling that these scenarios are restricted to certain context/isolation, or when only key cues are involved. This is later discussed in Ln575. 2. I wonder if these are more like logical possibilities unless the authors argue that different scenarios may occur with different emotional-lexical tone combos.

Response: We agree that the scenarios are more like logical possibilities, which did not contribute much to motivating the research questions. This point is also made by Reviewer 2. We have removed this paragraph and subsequent discussions that were based on this taxonomy.

Ln85: I wonder whether the other way around is worth highlighting, namely, how lexical tone affects emotion (e.g., emotional tone) perception? This is analysed and discussed after all (e.g., Ln805, later asymmetry)

Response: As noted in our earlier response, we have added more information regarding the effect of lexical tone on emotion perception to the section “how emotions are perceived in speech”.

Ln192: Does falling tone affect anger perception?

Response: Liang and Chen (2019) examined Mandarin tone perception but not emotional perception. Results from our Experiment 2 showed that anger perception in isolated syllables was more accurate in the falling tone (T4: 93%) compared to other tones (T1: 77%; T2: 76%; T3: 53%). Anger perception in syllables embedded in the carrier phrase was comparable among the four tones (T1: 97%; T2: 97%; T3: 96%, T4: 99%).

Ln198-212: Some information here seems to have been mentioned earlier.

Response: This paragraph was intended to offer a summary of the studies reviewed in this section. In light of the reviewer’s comment, we have deleted this summary.

Ln232, 234: Wang et al. - repetitive presentation

Response: We have rewritten this section to remove the repetition.

Ln261: It is unclear why lexical tones lacking F0 variation would be predicted to make the emotional tone less perceptible in the first place. Can it be the other way around? That is, since lexical tone has no variation, would variations of emotion prevail?

Response: We thank Dr. Liu for this insight. This statement was intended to be a summary of Wang et al. (2015, 2018) reviewed earlier. We have rewritten this section to include both the authors’ hypothesis and the reviewer’s interpretation.

Experiment/Analyses

Ln310: Some tables are a little bit difficult to process although the meaning is clear. I do not have suggestions here because the authors have already given careful thoughts on the presentation of these tables.

Response: We thank Dr. Liu for understanding our effort to summarize the wide range of findings in the literature. We have added more information to the caption of this table and others to clarify the meaning of the symbols and notations.

Ln318: Why is 4 a good number of participants as the baseline for the acoustic characteristics of emotional tone?

Response: Ideally, the more participants, the better. However, our funding only allowed us to recruit a maximum of eight (4 female and 4 male) professional actors. To our knowledge, this number is comparable to those used in similar studies.

Ln537-540: Please rephrase to reduce repetitiveness.

Response: We have rewritten this section to remove the repetition.

Ln566: Please rephrase "In addition, we also"

Response: We removed “also” from this sentence as instructed.

Discussion

Regarding "Context": There can be several possibilities to understand this word in the “context” of the current paper. It may be worthwhile specifying which context the authors are referring to (carrier sentences).

Also, what the context adds to/change the picture needs to be spelt out more overtly (e.g., ease of semantic parsing and/or tonal coarticulation? If one, why not the other?)

One methodological issue is to what extent the discrepancy between context emotion (no emotion if I understand correctly) and target word emotion would affect the current findings in the context condition. This needs to be discussed.

Response: We thank Dr. Liu for this important observation. Reviewer 2 also made a similar comment. We have clarified what we mean by “context” throughout the manuscript.

Reviewer #2 comments:

1. Is the manuscript technically sound, and do the data support the conclusions?

a. The current study vs. directly relevant past studies: This study is a potentially very valuable addition to the line of research on the influence of emotion on the production and perception of lexical tones in Mandarin. Different from past studies, the present study looks at the influence of emotion on lexical tones in both production and perception. However, this does not mean that past work with a focus on either the production side of the coin or the perception should be disconnected from the current study. In other words, the authors did not seem to make a sufficient use of existent findings to motivate their study and develop the hypotheses and predictions. Specially, the authors focused on how the realisation of each of the five basic emotions impacts the f0 mean, f0 range, intensity and duration of the target syllable. The literature review 'how emotional tones affect speech acoustics' only discussed the differences in the degree to which an acoustic parameter is exploited to express emotions between tone and non-tone languages. The authors referred to 'the literature reviewed' when presenting their predictions for changes in f0, duration and intensity (lines 300-302). But past studies on how emotion expression influences the f0 and duration of each lexical tone in Mandarin was not reviewed. Furthermore, although research on the effect of emotion expression on the perception of Mandarin tones by native speakers of Mandarin and second language learners of Mandarin is small in numbers, there has been some published work on this topic, e.g. Liang & Chen (2019), which was published in the ICPhS 2019 proceedings. This work was mistakenly cited as Liang and Chen's MA thesis in the references section, summarised in the section 'How emotional tones affect speech perception' and subsequently dismissed. This is a pity because the authors could have built their work on the findings from Liang and Chen, especially considering the fact that they had the same research question on the production side as Liang and Chen, and used a similar method in both Experiment 1 and Experiment 2 to Liang and Chen’s methodology. The main differences concerning production elicitation seems to be that Liang and Chen used phototactically legal non-words as target words and systematically varied the tonal contexts preceding the monosyllabic target words whereas the current study used real words and had only one tonal context, i.e. the monosyllabic target words were preceded by tone 1 in the carried sentence. Liang and Chen's findings can thus provide a very useful starting point for the current study in terms of both hypothesis formulation and methodology.

Response: We thank the reviewer for this important advice on grounding the current study in the literature. We agree that the study by Liang and Chen (2019) offers an excellent foundation for this study. We have revised the introduction, results, and discussion to highlight how Liang and Chen’s (2019) study has been used to develop our research questions and hypotheses, and how our results compared to Liang and Chen’s (2019) findings.

We also agree that a more meaningful connection should be made with previous studies focusing on either production or perception. We have rewritten the introduction and discussion to enhance the connection.

Re: “But past studies on how emotion expression influences the f0 and duration of each lexical tone in Mandarin was not reviewed.” Thank you for the comment. We have added a summary of Chao’s (1933) description of how emotions are implemented in Mandarin tones. We have also included a discussion of Li, Fang, and Dang’s (2011) acoustic study.

Finally, we apologize for the confusion between Liang and Chen’s ICPhS proceedings and Liang’s MA thesis. We had referenced the thesis in the original version because the thesis seemed to provide more details of the study. In light of the reviewer’s feedback, we have used the ICPhS citation throughout the manuscript.

b. Taiwan Mandarin is prosodically not the same as Beijing Mandarin (e.g. in prosodic realisation of focus), even though they both have the same four lexical tones. This difference is not sufficiently acknowledged throughout the study. Notably, the similarities in findings between this study and those from Liang and Chen (2019) can have interesting implications for the prosodic similarities and differences regarding the interaction between emotion and lexical tones between Taiwan Mandarin and Beijing Mandarin.

Response: We thank the reviewer for bringing to our attention this comparison. We have acknowledged this difference as a limitation of the current study.

c. Experiment 1: The participants were specifically instructed not to place 'excessive stress on the target syllables'. This seems to suggest that the participants have no uniformed way to determine the information structure of each target sentence. They could have treated each target sentence as a response to a 'what-happens'-like question, rendering the whole sentence focal (broad focus) or treated the target word as the focus (narrow focus) in spite of the instructions. What the participants did regarding the information structure of the target sentences may influence how they used prosody to realise emotion and tones. For example, they might have distributed the prosodic changes over the whole sentence if they had a broad-focus analysis, or the prosodic changes might have concentrated in the target syllables if they had a narrow-focus analysis.

Response: We thank the reviewer for this insight. Our intention was to elicit a broad focus, i.e., distributing the prosodic changes over the whole sentence, instead of narrowly focusing on the target word. We instructed the participants to avoid placing excessive stress on the target word because the carrier phrase was the same for different target words, and we worried there would be a tendency for the speakers to emphasize the target word. We have revised the description to clarify that the instruction was intended to elicit a broad-focus analysis.

d. Experiment 1: The recordings from the participants were rated by 30 native speakers of Taiwan Mandarin. No information was given on who these participants were and how they were recruited. Were they the same participants who did Experiment 2?

Response: Yes, the same participants also participated in Experiment 2. We apologize for not specifying this information in the original manuscript. We have added this information in the revision. Also, there were actually 36 participants for the rating task. We apologize for the error. We have made corrections throughout the manuscript.

e. Experiment 2: Experiment 2 hinged on two opposing assumptions. Neither is well motivated. One of the assumptions was that ‘emotional tones alter the canonical acoustic characteristics of Mandarin tones, making it difficult for listeners to retrieve the Mandarin tones’ (Lines 554-556). The authors did not specify anywhere in the manuscript what the ‘canonical acoustic characteristics of Mandarin tones’ should be. I do not think that past studies have provided evidence for such a claim either. Instead, it has been shown that the changes in the acoustic realisation of lexical tones related to the expression of emotions are restricted to the gradient changes in terms of higher of lower mean pitch, smaller or wider pitch range, or longer or shorter duration. Such changes are phonetic by nature because lexical tones are not defined by absolute pitch, duration and intensity. The other assumption is that ‘listeners are able to disaggregate the acoustic changes into two sources – Mandarin tones and emotional tones’. This assumption is itself based on the assumptions that lexical tones have absolute f0, duration and intensity values, which is not the case, and the perception of emotion prosody is categorical, for which evidence has emerged (e.g. Laukka 2005).

Laukka, P. (2005). Categorical perception of vocal emotion expressions. Emotion, 5(3), 277–295.

Response: We thank the reviewer for these observations. We certainly agree that lexical tones are not defined by absolute pitch, intensity, or duration. The acoustic characteristics of tones vary across talkers, phonetic contexts, and emotions. Our use of the term “canonical” was meant to highlight that emotions, like talkers (e.g., Moore and Jongman, 1997), tonal contexts (e.g., Xu, 1997), and other sources of acoustic variability, may change the citation form of lexical tones. The gradient acoustic changes noted by the reviewer illustrate exactly the idea. It was not our intention to claim that a lexical tone is associated with an ideal set of acoustic characteristics. In light of the reviewer’s comment, we have removed the term “canonical” and rephrased this statement to avoid that impression.

Our use of the term “disaggregate” was intended to highlight that the listener’s task is to interpret the acoustic signal by taking into consideration possible sources of acoustic variability. That is, listeners need to analyze the acoustic signal in terms of lexical tones, emotional tones, and other sources that contribute to the acoustic variability. This is analogous to interpreting lexical tones by considering talker F0 range (e.g., Moore and Jongman, 1997). It was not our intention to assume a lexical tone is associated with an ideal set of acoustic characteristics. We have rephrased this statement to avoid that impression.

We thank the reviewer for bringing Laukka (2005) to our attention. Since our study was not designed to test whether emotions are perceived categorically or not, we did not include the study in the references.

f. The factors ‘context’ vs. ‘isolation’ in Experiment 2: The target words were presented to the participants in both their original carrier sentence and out of their original carrier sentence (e.g. cut out of the carrier sentences) to study ‘how the presence of context would affect the identification of Mandarin tones and emotion tones’. The latter was referred to as the isolation condition. The authors did not offer any hypotheses or predictions for the effect of context based on current understanding of the effect of tonal coarticulation on tonal perception and that emotion expression can affect the prosody of the whole sentence.

It should also be pointed out that the isolation condition cannot be representative of how a lexical tone is produced in isolation, i.e. with no neighbouring sounds, or how an emption is expressed in a monosyllabic word with no neighbouring sounds. When a monosyllabic word is produced in isolation, it is produced as a one-word utterance, prosodically speaking. It has no tonal coarticulation and carries all the prosodic adjustments that need to be made to express a certain emotion.

Besides, in the ‘context’ condition, the target words were preceded by a Tone-1 word and followed by neutral tone-word in all sentences. Can the current results be generalised to other tonal contexts? For example, what happens when the target word is preceded by a Tone-2, Tone-3 or Tone-4 word?

Response: We thank the reviewer for this important observation. Reviewer 1 also made a similar comment. We apologize for not specifying the hypotheses regarding context in the original version. We have elaborated on our characterization of context throughout the manuscript. We have also added specific hypotheses to the section “the present study”.

We agree that the “isolation” condition does not represent how a lexical tone is produced in isolation. As the reviewer correctly pointed out, the “isolated” syllables were not produced in isolation; rather, they were excised from a sentence with the carrier phrase. We have revised the manuscript to clarify this and the nature of the isolation vs. context comparison.

We agree with the reviewer’s observation about generalization. Since the carrier phrase was the same for all target syllables, our results cannot generalize to other tonal contexts. On the other hand, Liang and Chen (2019) systematically manipulated the preceding tone. We have revised the manuscript to acknowledge this limitation.

g. Experiment 2, lines 609-615: The authors stated that ‘the order of the stimuli within each block was randomized for each participant’. How was this done exactly? Take the block of isolated syllables for tonal identification for example. What kind of principles were used in the randomization?

Response: We used the Random Number Generator in LabVIEW (National Instruments) for randomization. Each of the 120 syllables was assigned a number ranging from 1 to 120. The Random Number Generator was then used to create a unique presentation order for each participant. We have added this description to specify the randomization procedure.

h. Experiment 2: I don’t understand why stimuli produced in the neutral emotion were not included in the task on perception of emotion. Excluding this option might potentially increase the chance of choosing the right emotion from 20% to 25%.

Response: Thank you for this observation. We had intended to have 4 response options for both Mandarin tone identification (4 tones) and emotion recognition (4 emotions), but we agree we should have included the neutral emotion in the emotion recognition task. We have acknowledged this limitation in the manuscript.

i. The discussion on the differences in the context and isolation conditions was somewhat confusing. The authors seemed to acknowledge that there were prosodic cues to emotion outside the target word in a sentence in some places (e.g. lines 829-830). But in the conclusion section, the authors attributed the better performance in the context condition to listeners’ knowledge of tonal co-articulation (lines 843-844). This is true for the perception of tones but not true for the perception of emotion.

Response: We thank the reviewer for this observation. As acknowledged earlier, context was indeed not explained sufficiently in the original version. We have revised the manuscript to clarify the interpretation of context.

2. Has the statistical analysis been performed appropriately and rigorously?

a. The data from Experiment 2 were categorical by nature: 4 categories for identification of lexical tones; 4 categories for identification of emotions. These data should thus be analysed using methods that can deal with categorical data. Repeated measures ANOVA, which was used in the paper, is not the appropriate analysis. The authors can consider using the mixed-model multinomial regression analysis in R.

Response: We thank the reviewer for pointing this out. As instructed, we have redone the analysis by using a mixed-model multinomial regression analysis in R. We have rewritten the results and discussion based on the updated analysis.

b. The effect of emotion on the shape of the f0 contours of the lexical tones was discussed based on visual observations (lines 380-390). A more rigorous and more appropriate approach would be to use the Functional Data Analysis to mathematically quantify the shapes of each contour in terms of functional principle components and then use the fPCs as dependent variables.

Gubian, M., Torreira, F., & Boves, L. (2015). Using Functional Data Analysis for investigating multidimensional dynamic phonetic contrasts. Journal of Phonetics, 49(0), 16–40. https://doi.org/http://dx.doi.org/10.1016/j.wocn.2014.10.001

Chen, A., & Boves, L. (2018). What’s in a word: Sounding sarcastic in British English. Journal of the International Phonetic Association, 48(1). https://doi.org/10.1017/S0025100318000038

Response: We thank the reviewer for pointing this out and offering the references. We agree that the Functional Data Analysis would be the most rigorous approach for a quantitative analysis of dynamic F0 contours. However, the focus of our acoustic analysis was the four measures (mean F0, F0 range, mean amplitude, and duration), not the F0 contours. Our purpose of including the figure and qualitative description was to show that the four tones were produced as intended. We certainly agree that analyzing the dynamic F0 contours would be quite informative. However, we feel it is best to leave the analysis to a separate study because including such an analysis would make our paper even longer than it already is.

Finally, as the reviewer pointed out earlier, the acoustic realization of lexical tones in the expression of emotions is typically restricted to gradient changes in terms of mean pitch, pitch range, or duration. We agree with the reviewer that these acoustic measures would offer more insights for the purpose of this study. We have revised this section to acknowledge that our observations are qualitative in nature. We have also included the reference offered by the reviewer.

c. Crucial details are missing on the f0 contours presented in Figure 1: How were the f0 contours of the lexical tones in Figure 1 determined? Were the mean f0 contours of all speakers? If so, how were they extracted? Were they time-normalised?

Response: We apologize for missing these details. We have added a description of how the F0 contours were determined. Each F0 contour was indeed generated by averaging over all speakers of the same sex. The contours were also time-normalized as the reviewer pointed out.

d. Statistics on all the pairwise comparisons should be provided for each analysis.

Response: We have made sure that all pairwise comparisons are provided where applicable.

3. Have the authors made all data underlying the findings in their manuscript fully available?

The authors stated that ‘All relevant data are within the manuscript and its Supporting Information files.’ I did not find out how to access Supporting Information in PLOS One. It seems that I could only download the manuscript. I ticked the ‘Yes’ box for this question but I’d like acknowledge that I do not have the information to judge on the availability of the data in Supporting Information that has not been presented in the manuscript.

Response: We apologize for missing the data in the original submission. We have supplied a minimal data set as instructed.

4. Is the manuscript presented in an intelligible fashion and written in standard English?

I fear that the clarity of the manuscript has been affected by the presentation and exposition, even though there are very few language errors. The major issue I have with the writing is that the authors regularly made claims without substantiating them. A few examples:

a. The authors sketched four possible scenarios for how emotion may affect the production and perception of tones on pp. 4-5 (lines 70-86). These scenarios appear poorly informed given what has already been known.

Response: We agree that the four scenarios were not grounded in the literature and did not contribute to motivating the research questions. Reviewer 1 also made a similar comment. We have removed this paragraph and subsequent discussions that were based on this taxonomy. Instead, we have followed the reviewer’s suggestion to use Liang and Chen (2019) as the foundation of this study.

b. Lines 98-99: ‘Despite the universal appearance of emotion tones, their acoustic correlates vary across studies.’ I do not understand claims such as this one that seem to contain contradictory information. In this particular case, the review of the literature in the preceding lines clearly pointed to cross-linguistic differences in how f0, duration, intensity are varied to express the same emotion.

Response: We apologize for the confusion. We have rephrased this statement to resolve the contradiction.

c. Lines 167-169: These lines were supposed to summarise the main findings reviewed in the same paragraph. However, the preceding review did not talk about discrimination of consonants and recognition of spoken words. The findings reviewed were mostly concerned with recognizing word meaning.

Response: By “discrimination of consonants” we were referring to Mullenix et al. (2002), who manipulated the final consonant of the names (e.g., Todd-Tom). By “recognition of spoken words”, we were referring to Mullenix et al. (2002), Kitayama and Ishii (2002), Ishii et al. (2003), Nygaard and Lunders (2002), Nygaard and Queen (2008): they all used some form of a word recognition task. This summary is no longer in the revised manuscript due to extensive rewriting.

d. Lines 175-184: These lines were supposed to give more details on Singh et al.’s work after the summarising sentence on what their work was about in lines 173-175. However, what was presented in lines 175-184 was hard to follow and did not really match the summarising sentence.

Response: We apologize for the lack of clarity. We have rewritten the description to clarify what was done in this study.

e. Lines 195-197 on the limitation of Liang and Chen’s study: “Since no acoustic analysis was reported, it is not clear to what extent these findings could be explained by the actual acoustic difference among the emotional tones.” I am not sure I can follow the logic here. Liang and Chen had the stimuli rated by a native speaker of Mandarin in terms of emotion before presenting them to their participants. The emotions were correctly recognised in all cases. This would allow the authors to conclude that emotions were accurately produced in the stimuli and differences found in the perception of tones in different emotions could be attributed to the differences in prosodic realisation of emotions.

Response: We thank the reviewer for this observation. We agree that ratings by native speakers constitute valid evidence that emotions are produced as intended. We were simply pointing out that no acoustic analysis was reported in the study. We have removed this statement to avoid the impression that the stimuli were not valid.

f. Lines 231-232: “Although emotional tones appear to be perceived in a similar way irrespective of specific languages, it is not clear whether the use of lexical tones in tonal languages affects the perception of emotional tones.” Immediately following these lines, the authors reviewed Wang et al.’s study on exactly this topic. Their findings seem quite straightforward to me. What is still not clear and should therefore be investigated in the current study then?

Response: We apologize for the confusion. We have revised this statement to resolve the contradiction.

g. Lines 271-273: I have the same concern as with lines 231-232.

Response: We apologize for the confusion. We have removed this sentence from the revision.

h. Lines 256-259: “Evidence from L1 and L2 studies of segmental and tonal perception further suggests that tonal language experience affects how lexical tones and segmental structure are processed when they are produced with emotional tones.” Such studies were not reviewed in the preceding sections.

Response: This statement was a summary of Singh, Lee, and Goh (2016), and Liang and Chen (2019). Both studies showed that tonal language experience affected perception of speech produced with emotions. This summary is no longer in the revised manuscript because of extensive rewriting.

i. Lines 673-676: The description of the results (Tables 8, 9) was not accurate.

Response: We have revised these statements to make sure the description is accurate.

j. Lines 746-747: The description of the result was not accurate.

Response: We have revised these statements to make sure the description is accurate.

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 1

Vera Kempe

6 Jun 2022

PONE-D-21-23777R1Emotional tones of voice affect the acoustics and perception of Mandarin tonesPLOS ONE

Dear Dr. Chu,

Thank you again for submitting your revised manuscript to PLOS ONE. Let me start by reiterating my apologies for the delay which were due to some difficulties with recruiting reviewers for the revised version.  As you will see, both reviewers acknowledge the improvements that you have carried out in response to the first round of reviews and have taken the previous reviews into account in appraising your current revision. It is also clear that there is considerable overlap in the reviewers' remarks, especially pertaining to better justification of the four-emotion-model and potential repercussions of this choice and to improvements in the statistical analyses. In addition, the reviewers have made a number of further insightful suggestions that I urge you to consider. Even though one reviewer categorises the required work as 'Minor Revision' I feel that there are still some more substantial improvements required, which is why I decided to return it as a 'Major Revision'. I hope that the reviewers' comments will aid you in a subsequent revision of this submission.

Please submit your revised manuscript by Jul 21 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Vera Kempe

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #3: (No Response)

Reviewer #4: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #3: Partly

Reviewer #4: Partly

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #3: No

Reviewer #4: No

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #3: No

Reviewer #4: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #3: Yes

Reviewer #4: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #3: In line with the previous two reviewer’s general comments, I agree that the current work is relevant and can be a valuable addition to our current knowledge base.

While the authors have either acknowledged as limitations or addressed the majority of the points raised, I feel that there are still a number of issues that have not been satisfactorily resolved.

In the course of this review, I have noted a number of additional issues beyond what was raised in the previous review. Consequently, I felt that the current manuscript still lacks a strong rationale and justification for some of the decisions made regarding the study design and this has implications on the interpretability of the results.

1. Is the manuscript technically sound, and do the data support the conclusions?

A) Reviewers 1 and 2 both made similar remarks about the disconnect between the broader motivations of the study under review and how it is positioned within the context of similar research. Reviewer 2 recommended that the authors use the work of Liang and Chen (2019) as an anchor point for the current manuscript.

The authors have done well to incorporate the reviewers comments and the Liang and Chen (2019) study is given prominence in the introduction.

However, given the importance of the study as the cornerstone of the current work, this manuscript will benefit from a revision to provide greater clarity and details of the Liang and Chen study. In its current form, I feel that a reader will need to first review Liang and Chen’s thesis in order to follow this manuscript.

B) Reviewer 1 asked why the authors adopted a four basic emotion model (angry, happy, fear and sad) instead of a six emotion model (which on top of the four aforementioned emotion types, also include surprise and disgust).

The authors replied that the decision was made on the basis that most studies on Mandarin have adopted the four emotion model, hence using the four emotion model would make comparisons with previous studies more straightforward.

I am not sure if I quite follow the logic of this reasoning. As the four emotion model is a subset of the six emotion model, straightforward comparisons can still be made if the current study adopted the six emotion model.

Moreover, using the six emotion model has a number of other benefits. One, it would allow comparisons against a much larger literature base (studies using either four or six emotion models) and potentially against the wider literature of other tonal and non-tonal languages.

The second benefit pertains to Reviewer 2’s comment about the inclusion of Neutral as a possible response option for Experiment 2 to reduce chance level accuracy (from 25% to 20%). A six emotion model would achieve a similar effect by having two additional emotion types as response options.

A third advantage, which I feel is more critical, relates to the aim of study. Fundamental Frequency (F0) is considered to be one of the most salient carriers of emotion information as it is argued that F0 may be an acoustic correlate of arousal (Scherer, 2003). For example, emotions with high arousal such as anger may be expressed with a higher mean F0 and conversely, emotions with low arousal such as sadness may be expressed with a lower mean F0. In this regard, of the four emotions examined in this study, only one (Sad) is considered to be of low arousal and the study may therefore benefit from the inclusion of a larger range of emotion types of varying arousal levels to allow a more nuanced study of the interaction between emotions and lexical tone production.

C) Reviewer 2 raised a question regarding the randomisation procedure that was undertaken within each experimental block in Experiment 2. Similarly, I have a concern regarding randomisation but of the experimental blocks. From my reading, I understand that the order of block presentation was fixed for all participants such that they always viewed stimuli in the isolation condition before the context condition.

I am curious about the authors reasoning for this. I feel that this is may be a weakness as confounds such as training or exposure may have an unwanted impact on the observed results (e.g., participants have familiarised themselves with the stimuli by the time they get to the context condition, hence resulting in higher identification accuracy). This is given greater scrutiny especially since the authors are making claims based on the comparison of the isolation to the carrier condition.

D) Reviewer 2 asked why Neutral stimuli were not included in the emotion perception task as the exclusion of this option had raised chance level accuracy rates from 20% to 25%.

The authors responded on page 33, third sentence from the top of the page (pardon the clunky referencing but there doesn’t seem to be line numbering on the revised version), that ‘Neutral was regrettably not included’. I think it would be more meaningful to justify why Neutral was not included, rather than an admission that it was not there.

Nevertheless, in my opinion, the exclusion of Neutral has the benefit of making this a forced choice task that prevented the potential abuse of Neutral as a bin for stimuli that participants were uncertain about.

However, my concern is about why Neutral as a response option was excluded when Neutral stimuli were presented to the participants in the emotion identification task. From my reading, it appears that Neutral stimuli were presented in every experimental block (each experimental block consists of 120 stimuli, 3 syllables, 4 tones, 5 emotions, and 2 repetitions – bottom of pg. 31). Does this also mean that the participants would be giving an incorrect response to the Neutral stimuli by default simply because the Neutral option was not available?

E) Both reviewers had raised issues regarding the unclear rationale behind the isolation vs. context comparisons, and the lack of clarity in the discussion of the findings.

Unfortunately, despite the revision, I am unsure if I follow the logic and significance of the comparison between the isolation and context conditions. There were no predictions or references given that would contradict the rather intuitive hypothesis and critical discussions regarding these findings still seems a little thin.

This section could potentially benefit from another rewrite to clearly state the significance of the isolation vs. context comparison, the underlying or competing mechanisms at work, and how this comparison adds to our current knowledge base. In particular, I find the second paragraph of page 29 hard to follow.

F) Another point that is unclear to me is the motivation behind the authors’ aim to examine sex differences. From my reading, there is only one justification given, which is to extend Liang and Chen’s (2019) study where all stimuli were produced by a single female speaker (this information was not given in the current manuscript).

Despite the focus on sex differences, it is unclear whether the analyses here were meant to be exploratory or confirmatory. The manuscript does not provide adequate references or predictions on precisely how sex may have an impact on lexical and emotion tone production and why it is important that we study this.

I note that this issue was not raised by the previous reviewers. My reason for bringing this up is that, the manuscript will be easier to read if clearer objectives and deeper insights into the authors’ thought process is given. Moreover, this has implications regarding the appropriateness of the statistical tests undertaken by the authors.

From my understanding, at the broadest level, this study aims to address how emotion (not emotion and sex) affects the production and perception of Mandarin lexical tones. While it is interesting to tease apart and comment on sex differences, I feel that the logical next step, which is missing, is how these effects may be controlled for in order to draw inferences regarding the generalisability of the findings and its robustness against sex and other potential confounds arising from idiosyncratic individual differences.

In this regard, I highly recommend the use of mixed effects models where speaker, sex and perhaps even syllable type may be entered as random intercepts and slopes. This is aligned with Reviewer 2’s suggestion that mixed effects models be applied on data from Experiment 2. I am of the opinion that this should also be applied to Experiment 1 and with more rigour than what is currently done (more on this in the next section).

2. Has the statistical analysis been performed appropriately and rigorously?

A. Reviewer 2 recommended the use of Functional Data Analysis for the analysis of F0 contours.

The authors agreed that the use of Functional Data Analysis was appropriate but unwarranted as the current analysis of contours was intended to be a manipulation check of sorts to show that the four tones were produced as intended.

I agree that the use of Functional Data Analysis may not be necessary if the analysis was intended to be a simple visual sanity check. However, in its current form, explorations of the contours appear to be beyond the level of a simple sanity check as contrasts between sex and emotions were made. It is therefore inaccurate for the authors to claim that the ‘qualitative observations were meant to corroborate that the tones were ‘produced as intended’ (second last sentence of first paragraph on page 19) as it is not clear what the ‘intended’ shape of the contour is for the different sexes and emotion types.

My recommendation is for the authors to acknowledge that this is an exploratory examination to visualise and better understand the stimuli produced in the current study, rather than framing this as a comparison against a ‘standard’.

B. Reviewer 2 suggest that that mixed-model multinomial regression analysis in R may be a more appropriate analysis than the repeated measures ANOVA performed by the authors.

The authors have now applied mixed model logistic regression in their analysis of the data from Experiment 2.

More critically, I am concerned about how the variables were entered into the model in the revised manuscript. I am not sure if I understand why participants were entered as a random effect instead of variables such as sex, talker or syllable. It also unclear why the model was needlessly overcomplicated by the inclusion of the interaction effect between emotion and context when this is not the aim of the study (and also subsequently ignored by the authors – pg. 39). It would also be instructive for the authors to cite the R package that was used.

The significant interactions effects were further illustrated by the authors through the use of confusion matrices. While the findings presented are sound, I feel that an edit would provide better clarity. For example, the authors wrote ‘The most common error for Mandarin tones produced with ANGRY was Tone 4’. My suggestions would be to rephrase this to something along the lines of ‘There appears to be a response bias where Mandarin lexical tones, Tones 1 and 3 in particular, were most commonly misidentified as Tone 4 when presented in an Angry tone of voice’.

C. I would also like the authors to review how the confidence interval error bars of Figures 3 and 6 were generated. Some of the CI bars are in the negative range, which although can happen through calculation but not in actual measure (e.g., F0 range). This is likely indicative of a rather unusual or problematic distribution in the data which calls into question whether sufficient data preparation and cleaning (outliers?) has been conducted prior to analysis.

D. When conducting regression analysis, it is good practice to include within the body of the results section the statistics of the analysis such as the Beta estimates, standard errors, and Z-ratio. These are currently only available as downloadable supplementary information.

3. Have the authors made all data underlying the findings in their manuscript fully available?

A. Following the provided link, I am able to view only a simplified output of the mixed effects model. No other data was found.

4. Is the manuscript presented in intelligible fashion and written in standard English?

A. In general, both reviewers commented to the effect that edits were required to improve the quality of the manuscript. Reviewer 2 further noted that the authors frequently made unsubstantiated claims.

The authors have made extensive rewrites with assistance from a native speaker of English who provided feedback on word choice, grammar and writing style.

It is clear that the quality of the writing has improved in the revised manuscript on account of the substantial effort put in by the authors’ and their openness to suggestions. However, it can definitely benefit from another round of polishing and editing as there are areas within the manuscript (e.g., the section on confusion matrices) that although is grammatically sound, lacks clarity.

Reviewer #4: Review of "Emotional tones of voice affect the acoustics and perception of Mandarin tones"

Martijn Goudbeek, Tilburg University

This manuscript addresses and interesting question, namely how linguist and paralinguistic cues mutually influence each other in communication. Specifically, it does this by investigating how the expression of emotion in Mandarin affects the prouction and perception of tones 1 to 4 and, conversely, how the production of tones 1 to 4 affects the production and perception of emotion. A tonal language such as Mandarin provides an excellent testbed for the linguistics/paralinguistics interface, because emotional expression often modulates pitch, and pitch is one of the defining features (if not the defining feature) of Mandarin tones.

Since I did not review the orignal manuscript, I tried to pay close attention to the original comments and the replies from the authors. However, even though I did take the original reviews as a basis, there are some new remarks and recommendations in my review.

Theoretical points

------------------

One of the points of discussion in the original reviews was how to defend the use of four emotion categories. The authors have explained that these four are the four basic emotions in Ortony and Turners 1990 paper. This is somewhat unfortunate, since this paper is a highly cited /critique/ of the whole idea of basic emotion theory (a critique deemed relevant enough for Paul Ekman to directly engage the paper in his 1992 paper). That said, and although I would appreciate a better motivation for the -small number of- emotions included in the corpus, I do not particularly mind that there are only four. The more fundamental problem is that the low number of emotions (and the ones chosen) are never discussed in terms of the consequences for recognition in the vocal domain. For example, the fact that there is only one positive emotion (happiness) and only one low aroused emotion (sadness) strongly influences the decision problem that participants face. Likewise, it also strongly influences the results of the analysis, since (for example) the acoustic profile of sadness is so much different from the others in intensity, that significant results are bound to emerge. Contrast this with a situation where other low aroused emotions such as disgust or tenderness are in the dataset and you get very different results for, say, F0 or intensity. For a balanced and transparent discussion of the meaning of the results, factors like these needs to be taken into account, both in introducing the study and in the discussion section.

Along these lines I think the exclusive focus on basic emotions is needlessly limiting. Especially considering the fact that results and conclusions are often cast in terms of positive versus negative emotions, I think the dimensional approach to emotions (e.g., Russel, 1980, Russell and Feldman Barrett, 1999 and further) should be given proper attention (for exmaple, when describing the emotoins in the study, but also when interpreting the results, which is, as mentioned often already done in terms of valence -which is notably not a property of basic emotion theory which considers all emotions orthogonal categories). A reference to Laukka 2005 (in the rebuttal letter) is somewhat misleading, since although it provides evidence for categorical (which is not the samex`) perception of emotions, a paper from the same author in the same year using the same dataset uses the dimensional approach (Laukka, Juslin, & Bresin, 2005).

Methodolical / Statistical points

---------------------------------

In addition to these theoretical considerations, there are some methodological issues that need to be addressed.

For the analysis of variance in experiment 1, the (statistical) design is somewhat unclear. The design is introduced with "For each of the four acoustic measures, a three-way mixed-design analysis of variance (ANOVA) was conducted [...] with emotion [] and Mandarin tones [...] as within-subject factors and talker sex [...] as a between-subject factor." What is not clear is why tone and emotion are within factors, but why other simulus characteristics that were deemed relevant in the construction of the corpus (syllable, repetition) were not used as within factors. Statistically, this variance is unexplained variance, but in a more sophiticated analysis, these could be random effects over which the study could generalize (see Judd, Westfall, and Kenny, 2012, but also Winter and Grice, 2021). In any case, the precise number of items in the analysis should be clarified, if only because otherwise the degrees of freedom in the analysis become difficult to interpret.

As a follow up to the same analysis (the mixed ANOVA of experiment 1), the authors use LSD as a post hoc test. This test does not correct in any way for multiple comparisons, thus increasing the possibility of a type I error. While there might be arguments to not correcting every multiple comparison with a bonferroni test, the choice to not correct at al (certainly with such a large number of comparisons. This is particularly relevant in light of Figure 1, where almost all CI's overlap substantially (indicating the lack of a significant difference) which is in stark contrast to the findings of the post hoc analysis.

For the second analysis, multinomial mixed effects models are indeed the correct way to analyze this kind of data. However, mixed models also enable the inclusion of items (in addition to participants) as a random effect (again, see Judd, Westfall, and Kenny, 2012, but many others). This appears to not have been done, severely limiting the generalizability of the findings. For the revision, items should be entered as a random effect in the analysis.

It should be make more clear what participants did in the two experiments. If the same group of participants was used throughout, does that then mean that some judgments were made more than one or were some judgements reused? E.g., both experiment 1 and experiemnt 2 contain a rating task. Were the selected data in experiment 1 rated again or not?

Minor issues / typo's

P17 (Abstract): Lexical tones and emotions are conveyed by a similar set of acoustic parameters; -> partly similar set (because both tones and emotions are also conveyed by parameters beyond F0)

P18 conveys not only -> not only conveys

P18 Emotional tone is defined as the vocal expression of emotion, which conveys a speaker’s affective states; -> this is a very strong statement given the discussion in the field about the status of emotion as a category and what it exactly is that vocal expression expresses. So, iether more than one reference is needed, or a more nuanced statement (preferably both)

P18 When discussing Scherer's (2003) review, the high correlation between F0 and, particularly, amplitude needs to be mentioned. This is important, because using both F0 and Ampllitude simultaneously as variables needs to take this interdependence into account.

P19 A lexical tone language -> a tonal language (?)

P19 When compared to a neutral tone of voice ... a longer duration [3-12, 24-31]: this maybe a succint summary of the literature, but in order for it to be useful, especially given the many mutually exclusive effects (e.g. a sad voice has a narrower or wider or similar F0 range), some summary conclusion or integration is necessary (over and above "there is some consistancy, but also not"). In addition: all these are comparisons of the emotion to neutral, which is severely limiting the findings, right? That should be acknowledged.

P19 Similarly, when the authors say that "The variability, however, is consistent with the idea that emotion is sociocultural in nature", they are not wrong, but this statement does not connect that well with using basic emotions as a starting point. It would -to some extent- be in line with the dialect theory of emotion (e.g., Elfenbein and Ambady, 2002), which is based on basic emotion theory. So, some more clarity about the theoretical bachground of the authors and this paper is needed. What is the conceptual background here?

P21 In contrast, duration varied significantly among the emotions for Mandarin but not for Italian. > Not a big issue, but this is most likley also partly due to the fact that Italian is a syllable timed language, where there is much less room for variation in syllable duration.

P22 -> In which language was the study by Mullennix et al.?

P23 -> I have a hard time seeing why the study of Nygaard and Queen. is important here: the effects seemed to be semantic, but how does this connect to the tone/emotion tradeoff that is expected by the authors?

P23 / P24 -> the use of the word "compromised" is somewhat ambiguous; explain how emotion compromised. Similarly for the word asymmetrical: asymmetrical in what way?

P24 Dutch speaking learners -> of Mandarin

P24/25 -> The finding about intermediate learners being better seems not so relevant, unless there is a link with emotion, no?

P25 "Depending on specific Mandarin tones"; this is crucial for this paper (I think) and that authors could explain more why they think different tones are affected differently by (different?) emotions (and how this plays out). If this is not possible given the available information, that should be explained, too, then.

P25 "The way emotions are perceived appears to be language-universal": this contradicts earlier statements about the sociocultural nature of emotions.

P27 -> the juxtaposition of acoustics and perception is a bit odd, I'd use production and perception

P28 -> extracted from the carrier phrase -> in isolation

P29 "Considered the four most common basic emotions" -> perhaps rephrase in light of the comments made above

P29 "Based on the literature review" -> It is important to realize / flag that most (if not all) of the review concerns empirical "just so" findings, without much theoretical underpinning as to why the expected effects are predicted. There is not necessarily something wrong with that, but it is important to consider.

P29/30 -"We also expect emotions to be modulated by specific Mandarin tones and talker sex" -> How? And if it is impossible to say how, explain why that is.

P30 Were participants compensated in any way for their participation?

P32 Were the recordings managed by a (stage) director (see, for example, Banse and Scherer 1996) or were the actors working alone?

P32 Acoustic analysis: indicate how many recordings there were (960, I think)

P32 State the aim of the rating procedure: why was it done?

P32 Explain why NEUTRAL was not rated

P33 name the four acoustic measures analyzed

P34 the Functional Data Analysis -> Functional Data Analysis (drop the determiner)

P35 error bars indicat 95% interval -> the 95% interval

P38 and talker -> there apppears to be an extra space (twice)

P45 The different results for stimuli with and without carrier phrase is reminiscent of work doen on (speaker) normalization. Work by Holger Mitterer and Mathias Sjerps, for example, as well early work by Donald Broadbent (the filter theory)

P48 The NEUTRAL emotion -> NEUTRAL category

P49 in the carrier phrase -> when preceded by a carrier phrase

P49 As indicated, (random) item effects should be incorporated in the analysis

P50 "we focus on two interactions" -> but the second one (tone-context) is not really statistically analyzed or introduced (while the other one is).

References

Elfenbein, H. A., & Ambady, N. (2002). On the universality and cultural specificity of emotion recognition: a meta-analysis. Psychological bulletin, 128(2), 203.

Judd, C. M., Westfall, J., & Kenny, D. A. (2012). Treating stimuli as a random factor in social psychology: a new and comprehensive solution to a pervasive but largely ignored problem. Journal of personality and social psychology, 103(1), 54.

Winter, B., & Grice, M. (2021). Independence and generalizability in linguistics. Linguistics, 59(5), 1251-1277.

Laukka, P., Juslin, P., & Bresin, R. (2005). A dimensional approach to vocal expression of emotion. Cognition & Emotion, 19(5), 633-653.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #3: Yes: Chong Chee Seng

Reviewer #4: Yes: Martijn Goudbeek

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2023 Apr 5;18(4):e0283635. doi: 10.1371/journal.pone.0283635.r004

Author response to Decision Letter 1


5 Dec 2022

Reviewer #3: In line with the previous two reviewer’s general comments, I agree that the current work is relevant and can be a valuable addition to our current knowledge base.

A: We thank Dr. Chong for recognizing the contribution of this study.

1. Is the manuscript technically sound, and do the data support the conclusions?

A) Reviewers 1 and 2 both made similar remarks about the disconnect between the broader motivations of the study under review and how it is positioned within the context of similar research. Reviewer 2 recommended that the authors use the work of Liang and Chen (2019) as an anchor point for the current manuscript.

The authors have done well to incorporate the reviewers comments and the Liang and Chen (2019) study is given prominence in the introduction.

However, given the importance of the study as the cornerstone of the current work, this manuscript will benefit from a revision to provide greater clarity and details of the Liang and Chen study. In its current form, I feel that a reader will need to first review Liang and Chen’s thesis in order to follow this manuscript.

A: As instructed, we have provided more details of Liang and Chen’s study in the introduction (lines 201-216).

B) Reviewer 1 asked why the authors adopted a four basic emotion model (angry, happy, fear and sad) instead of a six emotion model (which on top of the four aforementioned emotion types, also include surprise and disgust).

The authors replied that the decision was made on the basis that most studies on Mandarin have adopted the four emotion model, hence using the four emotion model would make comparisons with previous studies more straightforward.

I am not sure if I quite follow the logic of this reasoning. As the four emotion model is a subset of the six emotion model, straightforward comparisons can still be made if the current study adopted the six emotion model.

Moreover, using the six emotion model has a number of other benefits. One, it would allow comparisons against a much larger literature base (studies using either four or six emotion models) and potentially against the wider literature of other tonal and non-tonal languages.

The second benefit pertains to Reviewer 2’s comment about the inclusion of Neutral as a possible response option for Experiment 2 to reduce chance level accuracy (from 25% to 20%). A six emotion model would achieve a similar effect by having two additional emotion types as response options.

A third advantage, which I feel is more critical, relates to the aim of study. Fundamental Frequency (F0) is considered to be one of the most salient carriers of emotion information as it is argued that F0 may be an acoustic correlate of arousal (Scherer, 2003). For example, emotions with high arousal such as anger may be expressed with a higher mean F0 and conversely, emotions with low arousal such as sadness may be expressed with a lower mean F0. In this regard, of the four emotions examined in this study, only one (Sad) is considered to be of low arousal and the study may therefore benefit from the inclusion of a larger range of emotion types of varying arousal levels to allow a more nuanced study of the interaction between emotions and lexical tone production.

A: We thank Dr. Chong for pointing out the benefits of adopting the six-emotion model. Since the study has been implemented with four-emotion speech materials, we believe the best we can do at this point, without re-running the entire the study, is to offer a broader theoretical foundation and acknowledge the limitation of the four-emotion model. To these ends, we have added a description of theories of emotion in the introduction (lines 70-93). We have also added a description in the conclusion to acknowledge the limitation of the four-emotion model (lines 783-791).

C) Reviewer 2 raised a question regarding the randomisation procedure that was undertaken within each experimental block in Experiment 2. Similarly, I have a concern regarding randomisation but of the experimental blocks. From my reading, I understand that the order of block presentation was fixed for all participants such that they always viewed stimuli in the isolation condition before the context condition.

I am curious about the authors reasoning for this. I feel that this is may be a weakness as confounds such as training or exposure may have an unwanted impact on the observed results (e.g., participants have familiarised themselves with the stimuli by the time they get to the context condition, hence resulting in higher identification accuracy). This is given greater scrutiny especially since the authors are making claims based on the comparison of the isolation to the carrier condition.

A: Dr. Chong is correct that the isolation condition was always presented before the context condition. We agree that familiarization from the isolation blocks could have boosted accuracy in the context blocks. In hindsight, we could have randomized the order of the blocks to mitigate this concern. We have acknowledged this limitation in the conclusion (lines 803-805). We have also added a paragraph in the introduction regarding the benefit of context (lines 266-292).

D) Reviewer 2 asked why Neutral stimuli were not included in the emotion perception task as the exclusion of this option had raised chance level accuracy rates from 20% to 25%.

The authors responded on page 33, third sentence from the top of the page (pardon the clunky referencing but there doesn’t seem to be line numbering on the revised version), that ‘Neutral was regrettably not included’. I think it would be more meaningful to justify why Neutral was not included, rather than an admission that it was not there.

Nevertheless, in my opinion, the exclusion of Neutral has the benefit of making this a forced choice task that prevented the potential abuse of Neutral as a bin for stimuli that participants were uncertain about.

However, my concern is about why Neutral as a response option was excluded when Neutral stimuli were presented to the participants in the emotion identification task. From my reading, it appears that Neutral stimuli were presented in every experimental block (each experimental block consists of 120 stimuli, 3 syllables, 4 tones, 5 emotions, and 2 repetitions – bottom of pg. 31). Does this also mean that the participants would be giving an incorrect response to the Neutral stimuli by default simply because the Neutral option was not available?

A: We apologize for not including line numbers in the previous revision. We have added line numbers to this revision.

We thank Dr. Chong for pointing out the benefit of excluding the NEUTRAL response option. In fact, NEUTRAL stimuli were also excluded from the emotion recognition task—we apologize for not making this explicit. That is, in the emotion recognition task, the NEUTRAL stimuli were excluded, resulting in 192 stimuli (3 syllables, 4 tones, 4 emotions, 2 repetitions, and 2 talkers) per block (isolation block and 1 context block). In the tone identification task, all five emotions were included, resulting in 240 stimuli (3 syllables, 4 tones, 5 emotions, 2 repetitions, and 2 talkers) per block. We have revised the description to clarify the design (line 569-574).

E) Both reviewers had raised issues regarding the unclear rationale behind the isolation vs. context comparisons, and the lack of clarity in the discussion of the findings.

Unfortunately, despite the revision, I am unsure if I follow the logic and significance of the comparison between the isolation and context conditions. There were no predictions or references given that would contradict the rather intuitive hypothesis and critical discussions regarding these findings still seems a little thin.

This section could potentially benefit from another rewrite to clearly state the significance of the isolation vs. context comparison, the underlying or competing mechanisms at work, and how this comparison adds to our current knowledge base. In particular, I find the second paragraph of page 29 hard to follow.

A: We agree that the benefit of context is rather intuitive, therefore we did not elaborate on the rationale behind the comparison in the introduction. We did present predictions regarding the comparison in Experiment 2 immediately before the method (lines 520-546). To highlight the rationale for this comparison, we have added a section entitled “Benefit of context” to the introduction. We have also moved the predictions currently in Experiment 2 to this new section.

F) Another point that is unclear to me is the motivation behind the authors’ aim to examine sex differences. From my reading, there is only one justification given, which is to extend Liang and Chen’s (2019) study where all stimuli were produced by a single female speaker (this information was not given in the current manuscript).

Despite the focus on sex differences, it is unclear whether the analyses here were meant to be exploratory or confirmatory. The manuscript does not provide adequate references or predictions on precisely how sex may have an impact on lexical and emotion tone production and why it is important that we study this.

I note that this issue was not raised by the previous reviewers. My reason for bringing this up is that, the manuscript will be easier to read if clearer objectives and deeper insights into the authors’ thought process is given. Moreover, this has implications regarding the appropriateness of the statistical tests undertaken by the authors.

From my understanding, at the broadest level, this study aims to address how emotion (not emotion and sex) affects the production and perception of Mandarin lexical tones. While it is interesting to tease apart and comment on sex differences, I feel that the logical next step, which is missing, is how these effects may be controlled for in order to draw inferences regarding the generalisability of the findings and its robustness against sex and other potential confounds arising from idiosyncratic individual differences.

In this regard, I highly recommend the use of mixed effects models where speaker, sex and perhaps even syllable type may be entered as random intercepts and slopes. This is aligned with Reviewer 2’s suggestion that mixed effects models be applied on data from Experiment 2. I am of the opinion that this should also be applied to Experiment 1 and with more rigour than what is currently done (more on this in the next section).

A: We agree that this study focused on the effect of emotion (not emotion and talker sex) on tone production and perception. As suggested, we have rerun the statistical analysis for Experiment 1 (acoustic analysis) using linear mixed-effects models with talker, talker sex, syllable type, and repetition as random effects.

2. Has the statistical analysis been performed appropriately and rigorously?

A. Reviewer 2 recommended the use of Functional Data Analysis for the analysis of F0 contours.

The authors agreed that the use of Functional Data Analysis was appropriate but unwarranted as the current analysis of contours was intended to be a manipulation check of sorts to show that the four tones were produced as intended.

I agree that the use of Functional Data Analysis may not be necessary if the analysis was intended to be a simple visual sanity check. However, in its current form, explorations of the contours appear to be beyond the level of a simple sanity check as contrasts between sex and emotions were made. It is therefore inaccurate for the authors to claim that the ‘qualitative observations were meant to corroborate that the tones were ‘produced as intended’ (second last sentence of first paragraph on page 19) as it is not clear what the ‘intended’ shape of the contour is for the different sexes and emotion types.

My recommendation is for the authors to acknowledge that this is an exploratory examination to visualise and better understand the stimuli produced in the current study, rather than framing this as a comparison against a ‘standard’.

A: We agree that the F0 plot was intended to show how the Mandarin tones were produced by talkers in this study. As suggested, we have added a note about this point and deleted statements about potential sex and emotion contrasts.

B. Reviewer 2 suggest that that mixed-model multinomial regression analysis in R may be a more appropriate analysis than the repeated measures ANOVA performed by the authors.

The authors have now applied mixed model logistic regression in their analysis of the data from Experiment 2.

More critically, I am concerned about how the variables were entered into the model in the revised manuscript. I am not sure if I understand why participants were entered as a random effect instead of variables such as sex, talker or syllable. It also unclear why the model was needlessly overcomplicated by the inclusion of the interaction effect between emotion and context when this is not the aim of the study (and also subsequently ignored by the authors – pg. 39). It would also be instructive for the authors to cite the R package that was used.

The significant interactions effects were further illustrated by the authors through the use of confusion matrices. While the findings presented are sound, I feel that an edit would provide better clarity. For example, the authors wrote ‘The most common error for Mandarin tones produced with ANGRY was Tone 4’. My suggestions would be to rephrase this to something along the lines of ‘There appears to be a response bias where Mandarin lexical tones, Tones 1 and 3 in particular, were most commonly misidentified as Tone 4 when presented in an Angry tone of voice’.

A: We thank the reviewer for the suggestion to clarify and simplify the regression model. As suggested, we have rerun the analysis by focusing on factors most relevant to our research questions. In particular, fixed effects now include tone, emotion, context, and tone-emotion interaction. Random effects now include talker, syllable type, and repetition.

The R packages we used included lme4 and car. As instructed, we have included the following citations:

Bates D, Mächler M, Bolker B, Walker S (2015). “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software, 67(1), 1–48. doi:10.18637/jss.v067.i01.

Fox J, Weisberg S (2019). An R Companion to Applied Regression, Third edition. Sage, Thousand Oaks CA. https://socialsciences.mcmaster.ca/jfox/Books/Companion/.

As for the confusion matrices, we have also revised the descriptions as the reviewer suggested to improve the clarity of our interpretation.

C. I would also like the authors to review how the confidence interval error bars of Figures 3 and 6 were generated. Some of the CI bars are in the negative range, which although can happen through calculation but not in actual measure (e.g., F0 range). This is likely indicative of a rather unusual or problematic distribution in the data which calls into question whether sufficient data preparation and cleaning (outliers?) has been conducted prior to analysis.

A: Thank you for the observation. We have replaced the original figures with boxplots, which show the data distribution more clearly.

D. When conducting regression analysis, it is good practice to include within the body of the results section the statistics of the analysis such as the Beta estimates, standard errors, and Z-ratio. These are currently only available as downloadable supplementary information.

A: We thank Dr. Chong for this suggestion. Since the manuscript is already quite long, we feel that including detailed statistics in main text will make reading difficult. Therefore we have chosen to keep the details in the supplementary information.

3. Have the authors made all data underlying the findings in their manuscript fully available?

A. Following the provided link, I am able to view only a simplified output of the mixed effects model. No other data was found.

A: Thank you for this observation. We included the full output of a linear mixed-effect logistic regression model in the supporting information.

4. Is the manuscript presented in intelligible fashion and written in standard English?

A. In general, both reviewers commented to the effect that edits were required to improve the quality of the manuscript. Reviewer 2 further noted that the authors frequently made unsubstantiated claims.

The authors have made extensive rewrites with assistance from a native speaker of English who provided feedback on word choice, grammar and writing style.

It is clear that the quality of the writing has improved in the revised manuscript on account of the substantial effort put in by the authors’ and their openness to suggestions. However, it can definitely benefit from another round of polishing and editing as there are areas within the manuscript (e.g., the section on confusion matrices) that although is grammatically sound, lacks clarity.

A: As suggested, we have revised section on confusion matrices to improve clarity.

Reviewer #4:

This manuscript addresses and interesting question, namely how linguist and paralinguistic cues mutually influence each other in communication. Specifically, it does this by investigating how the expression of emotion in Mandarin affects the production and perception of tones 1 to 4 and, conversely, how the production of tones 1 to 4 affects the production and perception of emotion. A tonal language such as Mandarin provides an excellent testbed for the linguistics/paralinguistics interface, because emotional expression often modulates pitch, and pitch is one of the defining features (if not the defining feature) of Mandarin tones.

Since I did not review the original manuscript, I tried to pay close attention to the original comments and the replies from the authors. However, even though I did take the original reviews as a basis, there are some new remarks and recommendations in my review.

Theoretical points

------------------

One of the points of discussion in the original reviews was how to defend the use of four emotion categories. The authors have explained that these four are the four basic emotions in Ortony and Turners 1990 paper. This is somewhat unfortunate, since this paper is a highly cited /critique/ of the whole idea of basic emotion theory (a critique deemed relevant enough for Paul Ekman to directly engage the paper in his 1992 paper). That said, and although I would appreciate a better motivation for the -small number of- emotions included in the corpus, I do not particularly mind that there are only four. The more fundamental problem is that the low number of emotions (and the ones chosen) are never discussed in terms of the consequences for recognition in the vocal domain. For example, the fact that there is only one positive emotion (happiness) and only one low aroused emotion (sadness) strongly influences the decision problem that participants face. Likewise, it also strongly influences the results of the analysis, since (for example) the acoustic profile of sadness is so much different from the others in intensity, that significant results are bound to emerge. Contrast this with a situation where other low aroused emotions such as disgust or tenderness are in the dataset and you get very different results for, say, F0 or intensity. For a balanced and transparent discussion of the meaning of the results, factors like these needs to be taken into account, both in introducing the study and in the discussion section.

Along these lines I think the exclusive focus on basic emotions is needlessly limiting. Especially considering the fact that results and conclusions are often cast in terms of positive versus negative emotions, I think the dimensional approach to emotions (e.g., Russel, 1980, Russell and Feldman Barrett, 1999 and further) should be given proper attention (for example, when describing the emotions in the study, but also when interpreting the results, which is, as mentioned often already done in terms of valence -which is notably not a property of basic emotion theory which considers all emotions orthogonal categories). A reference to Laukka 2005 (in the rebuttal letter) is somewhat misleading, since although it provides evidence for categorical (which is not the same) perception of emotions, a paper from the same author in the same year using the same dataset uses the dimensional approach (Laukka, Juslin, & Bresin, 2005).

A: We thank Dr. Goudbeek for this comment. Dr. Chong also raised a similar comment. As noted in our earlier response, we have added a description of theories of emotion in the introduction (lines 70-93). We have also added a description in the conclusion to acknowledge the limitation of the four-emotion model (lines 783-791).

Methodolical / Statistical points

---------------------------------

In addition to these theoretical considerations, there are some methodological issues that need to be addressed.

For the analysis of variance in experiment 1, the (statistical) design is somewhat unclear. The design is introduced with "For each of the four acoustic measures, a three-way mixed-design analysis of variance (ANOVA) was conducted [...] with emotion [] and Mandarin tones [...] as within-subject factors and talker sex [...] as a between-subject factor." What is not clear is why tone and emotion are within factors, but why other stimulus characteristics that were deemed relevant in the construction of the corpus (syllable, repetition) were not used as within factors. Statistically, this variance is unexplained variance, but in a more sophisticated analysis, these could be random effects over which the study could generalize (see Judd, Westfall, and Kenny, 2012, but also Winter and Grice, 2021). In any case, the precise number of items in the analysis should be clarified, if only because otherwise the degrees of freedom in the analysis become difficult to interpret.

A: We thank Dr. Goudbeek for this observation. As noted in our response to Dr. Chong, we have rerun the statistical analysis using linear mixed-effects models with talker, talker sex, syllable type, and repetition as random effects.

As a follow up to the same analysis (the mixed ANOVA of experiment 1), the authors use LSD as a post hoc test. This test does not correct in any way for multiple comparisons, thus increasing the possibility of a type I error. While there might be arguments to not correcting every multiple comparison with a bonferroni test, the choice to not correct at all (certainly with such a large number of comparisons. This is particularly relevant in light of Figure 1, where almost all CI's overlap substantially (indicating the lack of a significant difference) which is in stark contrast to the findings of the post hoc analysis.

A: We agree with Dr. Goudbeek about the need to control for type I error. As suggested, we have used the Tukey test in the revised statistical analysis to correct for multiple comparisons.

For the second analysis, multinomial mixed effects models are indeed the correct way to analyze this kind of data. However, mixed models also enable the inclusion of items (in addition to participants) as a random effect (again, see Judd, Westfall, and Kenny, 2012, but many others). This appears to not have been done, severely limiting the generalizability of the findings. For the revision, items should be entered as a random effect in the analysis.

A: Thank you for this observation. Dr. Chong also made a similar point. As suggested, we have included item characteristics (talker, syllable type, and repetition) as random effects.

It should be make more clear what participants did in the two experiments. If the same group of participants was used throughout, does that then mean that some judgments were made more than one or were some judgements reused? E.g., both experiment 1 and experiment 2 contain a rating task. Were the selected data in experiment 1 rated again or not?

A: The raters for Experiment 1 were different from the listeners for Experiment 2. Therefore no judgments were reused. We apologize for the confusion.

Minor issues / typo's

P17 (Abstract): Lexical tones and emotions are conveyed by a similar set of acoustic parameters; -> partly similar set (because both tones and emotions are also conveyed by parameters beyond F0)

A: Corrected.

P18 conveys not only -> not only conveys

A: Corrected.

P18 Emotional tone is defined as the vocal expression of emotion, which conveys a speaker’s affective states; -> this is a very strong statement given the discussion in the field about the status of emotion as a category and what it exactly is that vocal expression expresses. So, either more than one reference is needed, or a more nuanced statement (preferably both)

A: We have revised this statement and added more references (lines 49-50).

P18 When discussing Scherer's (2003) review, the high correlation between F0 and, particularly, amplitude needs to be mentioned. This is important, because using both F0 and Amplitude simultaneously as variables needs to take this interdependence into account.

A: We have added information about the correlation as suggested (line 54-55).

P19 A lexical tone language -> a tonal language (?)

A: Revised.

LN71 When compared to a neutral tone of voice … a longer duration [3-12, 24-31]: this maybe a succinct summary of the literature, but in order for it to be useful, especially given the many mutually exclusive effects (e.g. a sad voice has a narrower or wider or similar F0 range), some summary conclusion or integration is necessary (over and above “there is some consistency, but also not”). In addition: all these are comparisons of the emotion to neutral, which is severely limiting the findings, right? That should be acknowledged.

A: We suspect the conflicting findings arise from methodological differences between studies. We have revised the summary to elaborate on this point (lines 96-106 & lines 115-120).

We understand that comparing emotions to each other would be quite informative. Our intention was to use neutral as a baseline to keep the number of comparisons manageable. We have acknowledged this limitation in the revised manuscript (lines 803-805).

P19 Similarly, when the authors say that "The variability, however, is consistent with the idea that emotion is sociocultural in nature", they are not wrong, but this statement does not connect that well with using basic emotions as a starting point. It would -to some extent- be in line with the dialect theory of emotion (e.g., Elfenbein and Ambady, 2002), which is based on basic emotion theory. So, some more clarity about the theoretical background of the authors and this paper is needed. What is the conceptual background here?

A: We thank Dr. Goudbeek for this observation and the reference. We have included the reference to support the statement. Since the focus of this study is on how emotions affect the acoustics and perception of Mandarin tones, rather than the effect of culture on emotion expression, we believe this fascinating topic can be addressed in a different study.

P21 In contrast, duration varied significantly among the emotions for Mandarin but not for Italian. > Not a big issue, but this is most likely also partly due to the fact that Italian is a syllable timed language, where there is much less room for variation in syllable duration.

A: Thank you for this observation. We have added this point to the text.

P22 -> In which language was the study by Mullennix et al.?

A: It was English. We have clarified this.

P23 -> I have a hard time seeing why the study of Nygaard and Queen. is important here: the effects seemed to be semantic, but how does this connect to the tone/emotion tradeoff that is expected by the authors?

A: We agree that this study is not super relevant to the issue of tone-emotion relationship. However, the point of this section is to show that emotion can affect spoken word recognition, which necessarily involves processing meaning. If we removed this reference, then all studies of word recognition reviewed in this section should be removed. With these considerations, we would like to keep this reference.

P23 / P24 -> the use of the word "compromised" is somewhat ambiguous; explain how emotion compromised. Similarly for the word asymmetrical: asymmetrical in what way?

A: We have replaced these words with a more elaborated description.

P24 Dutch speaking learners -> of Mandarin

A: Corrected.

P24/25 -> The finding about intermediate learners being better seems not so relevant, unless there is a link with emotion, no?

A: This finding is based on proficiency-tone interaction and the effect does not involve emotion. We agree that this does not seem relevant, so we have removed this statement.

P25 "Depending on specific Mandarin tones"; this is crucial for this paper (I think) and that authors could explain more why they think different tones are affected differently by (different?) emotions (and how this plays out). If this is not possible given the available information, that should be explained, too, then.

A: Thank you for the observation. The tone-emotion interaction was indeed a major finding of the study. We did not have sufficient information from the literature to make predictions about which tones would be affected more by emotions. Our data also did not reveal consistent tone-specific patterns to allow meaningful speculations beyond the presence of the interaction itself.

P25 "The way emotions are perceived appears to be language-universal": this contradicts earlier statements about the sociocultural nature of emotions.

A: Thank you for pointing this out. This statement has been deleted.

P27 -> the juxtaposition of acoustics and perception is a bit odd, I'd use production and perception

A: We still think acoustics is a more precise term that describes what we did in this study, i.e., acoustic analysis of speech produced by the talkers. Speech production could be examined with physiological measures, but that is not what we did in this study.

P28 -> extracted from the carrier phrase -> in isolation

A: We had used “in isolation” in the original version, but one of the original reviewers asked us to clarify that the isolated syllables were extracted from the carrier phrase, not produced in isolation. We agree that “in isolation” is a more succinct term and have used it in this revision.

P29 "Considered the four most common basic emotions" -> perhaps rephrase in light of the comments made above

A: We have rephrased this statement to highlight different perspectives on characterizing emotions.

P29 "Based on the literature review" -> It is important to realize / flag that most (if not all) of the review concerns empirical "just so" findings, without much theoretical underpinning as to why the expected effects are predicted. There is not necessarily something wrong with that, but it is important to consider.

A: Thank you for the observation. We acknowledge that we did not have sufficient information to make predictions about specific tones or to interpret the tone-emotion interactions fully.

P29/30 -"We also expect emotions to be modulated by specific Mandarin tones and talker sex" -> How? And if it is impossible to say how, explain why that is.

A: At Dr. Chong’s suggestion, we have removed talker sex as a fixed effect.

P30 Were participants compensated in any way for their participation?

A: Yes, all participants were compensated. We have included payment information in the revised manuscript.

P32 Were the recordings managed by a (stage) director (see, for example, Banse and Scherer 1996) or were the actors working alone?

A: Before the recording, the first author discussed with the actors the emotional tones that they should aim for. The actors completed the recording in a sound-treated booth while the first author monitored the recording. We have added this information to the revised manuscript.

P32 Acoustic analysis: indicate how many recordings there were (960, I think)

A: Yes, there were 960 (3 syllables * 4 tones * 5 emotions * 2 repetitions * 8 speakers) speech samples in Experiment 1. We have added this information to the revised manuscript.

P32 State the aim of the rating procedure: why was it done?

A: The rating was done to make sure the intended emotions were actually present in the speech materials. This was explained in the section on acoustic analysis.

P32 Explain why NEUTRAL was not rated

A: We did not think neutral stimuli needed to be rated because they were emotionally neutral. In hindsight, we agree that it would have been a good idea to include neutral stimuli in the rating task too.

P33 name the four acoustic measures analyzed

A: The names of the four acoustic measures are specified in lines 402-404.

P34 the Functional Data Analysis -> Functional Data Analysis (drop the determiner)

A: Corrected.

P35 error bars indicate 95% interval -> the 95% interval

A: Corrected.

P38 and talker -> there appears to be an extra space (twice)

A: Corrected.

P45 The different results for stimuli with and without carrier phrase is reminiscent of work done on (speaker) normalization. Work by Holger Mitterer and Mathias Sjerps, for example, as well early work by Donald Broadbent (the filter theory)

A: Thank you for making the connection with talker normalization. We have included this idea in the section on the benefit of context in the introduction.

P48 The NEUTRAL emotion -> NEUTRAL category

A: Thank you for the suggestion. By “NEUTRAL emotion” we meant NEUTRAL as an emotional tone of voice (defined in lines 46). Since Mandarin has a neutral (lexical) tone, we feel that labeling NEUTRAL as an emotion avoids the potential confusion with the lexical tone.

P49 in the carrier phrase -> when preceded by a carrier phrase

A: We have changed this phrase when it first appears in line 291-292.

P49 As indicated, (random) item effects should be incorporated in the analysis

A: Yes, as noted earlier, we have included item characteristics as random effects.

P50 "we focus on two interactions" -> but the second one (tone-context) is not really statistically analyzed or introduced (while the other one is).

A: In the revised modeling, we examined only the tone-emotion interaction in light of Dr. Chong’s suggestion. Therefore we are no longer discussing the tone-context interaction.

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 2

Yiu-Kei Tsang

16 Feb 2023

PONE-D-21-23777R2Emotional tones of voice affect the acoustics and perception of Mandarin tonesPLOS ONE

Dear Dr. Chu,

Thank you for submitting your manuscript to PLOS ONE. First, let me apologize for the lengthy review process. I notice that the manuscript has been submitted for more than 500 days and has undergone 2 rounds of major revision. When I took up the responsibility in handling this manuscript, my goal was to ensure that you do not need to undergo another round of major review. Therefore, I tried my best to engage previous reviewers. Unfortunately, they were unavailable. Given that in the second round of review, only one reviewer requested for Major Revision, I decided to invite one new reviewer only to speed up the process. I have also reviewed your manuscript to give you additional comments.

Both the new reviewer and I agree that the manuscript is potentially publishable. However, there are some minor flaws that need to be corrected before I can accept it for publication. Therefore, I am making the decision of Minor Review. Please check the manuscript carefully and consider seeking help from copyediting services.

You can find my comments below (under Additional Editor Comments). You can also find the comments of the additional reviewer in this decision letter. Please address them in your revision.

Please submit your revised manuscript by Apr 02 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Yiu-Kei Tsang

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments:

Personally, I think how emotional tones and lexical tones interacted in the perception of Mandarin is a very interesting topic. While I agree with the previous reviewers that the experimental design is not perfect (e.g., not having all six emotions and not having a neutral response option), I believe the authors have discussed their results in an unbiased manner by stating the limitations explicitly. Given that no experiment is perfect, I think the most important point is to provide enough information so that readers can make informed judgement about the study and be inspired to conduct better experiments to clarify uncertainties. With this in mind, I am willing to support the manuscript for publication.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #5: (No Response)

********** 

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #5: Yes

********** 

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #5: Yes

********** 

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #5: No

********** 

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #5: Yes

********** 

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #5: I generally agree with previous review comments regarding the rationale behind and the design of the study (e.g., the four-emotion model, lack of counterbalancing on the conditions). I think the authors have adequately addressed these concerns in the last round of revisions, by providing additional justification or acknowledging the limitations. But I still noticed quite a few inaccuracies in both the results and the writing, and I hope the authors can do another round of proofreading and editing. I think this request has been raised repeatedly in previous rounds of review, and the authors should take the suggestion more seriously. Below I list some of these inaccuracies together with other minor comments. (I say “some” because I might not have spotted all of them.)

Line 352, 557: How much is the compensation in USD?

Line 368: “a total of 960 stimuli for each participant”, delete “for each participant”?

Line 663: “accuracy appears higher for target syllables presented than in isolation”, presented in context?

Line 673: “and the tone-emotion interaction fixed effects”, as fixed effects?

Line 679: “comparisons were indicates that”?

Line 702: Not all common errors are highlighted, e.g., Tone 1 SAD identified as HAPPY for 25%.

Line 705: “ranging from 86% to 99%”, ranging from 85% to 99%?

Line 781: “but it affects the identification of specific Mandarin tones to a greater extent than it affects the recognition of specific emotions”, this is not fully correct considering that the context seemed to boost emotion recognition accuracy to a larger extent (from 21%-93% to 85%-99%; tone identification from 40%-98% to 78%-100%)?

Figure 5: “Duration (ms)”, Duration (s)?

Figures 6, 7: Better specify tone identification /emotion recognition accuracy rather than “Accuracy”, in the figures.

********** 

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #5: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2023 Apr 5;18(4):e0283635. doi: 10.1371/journal.pone.0283635.r006

Author response to Decision Letter 2


13 Mar 2023

Additional Editor Comments

Personally, I think how emotional tones and lexical tones interacted in the perception of Mandarin is a very interesting topic. While I agree with the previous reviewers that the experimental design is not perfect (e.g., not having all six emotions and not having a neutral response option), I believe the authors have discussed their results in an unbiased manner by stating the limitations explicitly. Given that no experiment is perfect, I think the most important point is to provide enough information so that readers can make informed judgement about the study and be inspired to conduct better experiments to clarify uncertainties. With this in mind, I am willing to support the manuscript for publication.

Thank you for recognizing the potential contribution of this study. We agree that there is always room for improvement. Thank you for supporting this manuscript for publication.

Comments to the Author

Reviewer #5: I generally agree with previous review comments regarding the rationale behind and the design of the study (e.g., the four-emotion model, lack of counterbalancing on the conditions). I think the authors have adequately addressed these concerns in the last round of revisions, by providing additional justification or acknowledging the limitations. But I still noticed quite a few inaccuracies in both the results and the writing, and I hope the authors can do another round of proofreading and editing. I think this request has been raised repeatedly in previous rounds of review, and the authors should take the suggestion more seriously. Below I list some of these inaccuracies together with other minor comments. (I say “some” because I might not have spotted all of them.)

We appreciate the reviewer’s detailed and thoughtful comments. We have made corrections based on the reviewer’s suggestions. We have also invited a colleague (a native speaker of English) to proofread and edit the manuscript.

1. Line 352, 557: How much is the compensation in USD?

We have converted the currency to USD: $54 (LN352) and $34 (LN557) USD.

2. Line 368: “a total of 960 stimuli for each participant”, delete “for each participant”?

Thank you for noting the redundancy. We have deleted it.

3. Line 663: “accuracy appears higher for target syllables presented than in isolation”, presented in context?

Yes, we have revised it to “accuracy appears higher for target syllables presented in context than in isolation.” Thank you for catching the missing phrase.

4. Line 673: “and the tone-emotion interaction fixed effects”, as fixed effects?

Yes, we have added “as before “fixed effects”. Thank you.

5. Line 679: “comparisons were indicates that”?

We have revised it to “comparisons indicate that.”

6. Line 702: Not all common errors are highlighted, e.g., Tone 1 SAD identified as HAPPY for 25%.

Our intention was to highlight the most common error that exceed 20%, not all errors exceeding 20%. Please see caption (LN 700-701). In this case, it is FEAR at 29%.

7. Line 705: “ranging from 86% to 99%”, ranging from 85% to 99%?

Thank you. We have updated the percentage.

8. Line 781: “but it affects the identification of specific Mandarin tones to a greater extent than it affects the recognition of specific emotions”, this is not fully correct considering that the context seemed to boost emotion recognition accuracy to a larger extent (from 21%-93% to 85%-99%; tone identification from 40%-98% to 78%-100%)?

Thank you for the observation. We have deleted this clause.

9. Figure 5: “Duration (ms)”, Duration (s)?

Thank you. We have corrected the label.

10. Figures 6, 7: Better specify tone identification /emotion recognition accuracy rather than “Accuracy”, in the figures.

Thank you. We made the changes as suggested.

Attachment

Submitted filename: Response to Reviewers_20230312.docx

Decision Letter 3

Yiu-Kei Tsang

14 Mar 2023

Emotional tones of voice affect the acoustics and perception of Mandarin tones

PONE-D-21-23777R3

Dear Dr. Chu,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Yiu-Kei Tsang

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Acceptance letter

Yiu-Kei Tsang

27 Mar 2023

PONE-D-21-23777R3

Emotional tones of voice affect the acoustics and perception of Mandarin tones

Dear Dr. Chu:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Yiu-Kei Tsang

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Full output of a linear mixed-effect logistic regression model for acoustic analysis of four Mandarin tones with five emotions.

    (DOCX)

    S2 Table. Full output of a linear mixed-effect logistic regression model for Mandarin tone identification.

    (DOCX)

    S3 Table. Full output of a linear mixed-effect logistic regression model for emotion recognition.

    (DOCX)

    S4 Table. All data of two experiments.

    (RTF)

    Attachment

    Submitted filename: Response to Reviewers.docx

    Attachment

    Submitted filename: Response to Reviewers.docx

    Attachment

    Submitted filename: Response to Reviewers_20230312.docx

    Data Availability Statement

    All relevant data are within the manuscript and its Supporting Information files.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES