Skip to main content
Scientific Data logoLink to Scientific Data
. 2026 Mar 5;13:602. doi: 10.1038/s41597-026-06976-z

A validated Mandarin Chinese Auditory Emotion Database of Subject-Personal-Pronoun Sentences (MCAE-SPPS)

Mengyuan Li 1,2, Anqi Zhou 2, Huiru Yan 2, Qiuhong Li 2, Chifen Ma 2, Chao Wu 2,
PMCID: PMC13079751  PMID: 41786802

Abstract

Emotional expression in speech varies with grammatical subjects, including personal pronouns. This study reports the development and validation of a novel Mandarin Chinese auditory emotional speech dataset comprising sentences with first-, second-, and third-person pronouns. Six professionally trained actors recorded 200 semantically meaningful sentences in a neutral tone and six basic emotions: happiness, sadness, anger, fear, disgust, and surprise. Emotional labels and intensity ratings were provided by 720 native Chinese-speaking college students. The final dataset includes 6,675 validated recordings, including Neutral (1,169), Sadness (1,187), Anger (671), Surprise (969), Disgust (738), Happiness (785), and Fear (671) utterances. Of these, 2,729 recordings contain first-person pronouns, 2,608 contain second-person pronouns, and 1,338 contain third-person pronouns. The dataset demonstrated acceptable inter-rater reliability and robust associations between acoustic features and emotion recognition performance. Each recording includes the raw waveform file, emotion recognition rates, perceived intensity ratings, and a comprehensive set of extracted acoustic features. This validated emotional speech corpus offers a unique and valuable resource for research in linguistics, psychological science, neuroscience, and clinical rehabilitation.

Subject terms: Psychology, Research data

Background & Summary

Subject-pronoun sentences are frequently used, either consciously or unconsciously, in everyday interpersonal communication. Subject pronouns such as “I,” “we,” and “you” often carry substantial emotional salience, as they directly reference the speaker or the listener, thereby occupying a central role in both emotion expression and perception in spoken language1,2. Investigating how emotions are conveyed through pronoun-based sentences provides insight into subtle variations in prosodic features, including pitch, intonation, and temporal dynamics that characterize different emotional states, particularly in populations with mental health disorders35. Such knowledge is essential for the development of effective strategies for diagnosis, monitoring, and intervention of emotional and communicative impairments6,7. Moreover, the performance of automatic emotion recognition systems relies on a comprehensive understanding of emotional phonetics, including the modulatory role of subject pronouns in emotional expression8. Integrating pronoun-specific emotional cues into these systems may substantially enhance their sensitivity to human emotions and altered mental states, thereby improving their applicability in clinical assessment and human–machine interaction contexts9,10.

Emotional speech databases (ESDs) constitute essential resources for advancing research in emotion recognition1113, linguistic analysis14,15, emotion-language processing16,17, emotionally intelligent systems18,19, and mental health monitoring and cognitive training20,21. To date, numerous countries have developed emotional speech databases in their native languages, as comprehensively summarized in previous reviews22,23. Recently established large-scale ESDs include the Italian Database of Elicited Mood in Speech (DEMoS)24, an Urdu emotional speech corpus comprising 2,500 utterances based on emotionally neutral coherent sentences25, the SUST Bangla Emotional Speech Corpus (SUBESCO)26, and a Quechua Collao corpus containing 12,420 stimuli27. Language-specific resources for Mandarin and Cantonese have also been developed, including Mandarin Chinese Auditory Emotions Nonsense Sentences (MCAE-NS) database14, Mandarin Chinese Auditory Emotions monosyllables (MCAE-MS)28, and the Cantonese Audio-Visual Emotional Speech (CAVES) dataset29.

Although several existing databases include some sentences containing subject personal pronouns, none are specifically designed to examine pronoun-related emotional expression in a systematic manner. For instance, the widely used Berlin Emotional Speech Database (EmoDB; https://www.tu.berlin/en/kw/research/projects/emotional-speech) comprises a set of short German sentences, some of which begin with subject pronouns, produced by 10 professional speakers (five male and five female, aged 25–35) across seven emotional categories: happiness, boredom, sadness, fear, disgust, surprise, and neutrality. However, the inclusion of pronoun-based sentences in EmoDB is incidental rather than theoretically motivated. Similarly, the Interactive Emotional Dyadic Motion Capture database (IEMOCAP)30 contains naturalistic conversational speech that includes subject pronouns (e.g., “I”, “We”, “You”, “He”) embedded within dialogue exchanges performed by five actors (two female, three male) and annotated across ten emotional categories, including: happy, sad, angry, neutral, frustrated, excited, surprised, fearful, disgusted, and others. Although IEMOCAP offers rich contextual and multimodal information, pronoun usage is not explicitly controlled or annotated as a variable of interest. The Crowd-Sourced Emotional Multimodal Actors Dataset (CREMA-D; https://github.com/CheyneyComputerScience/CREMA-D) contains five English sentences beginning with the first-person singular “I”, spoken by 91 actors (48 male and 43 female) in six emotional states: happiness, sadness, anger, fear, disgust, and neutrality. Despite the consistent use of a subject pronoun, the dataset was not designed to systematically investigate the emotional modulation associated with different pronoun types or grammatical perspectives. In Mandarin Chinese, existing emotional speech resources such as the CASIA Emotional Speech Dataset consist of recordings from four professional actors (two male and two female) who produced 50 sentences across six emotional categories: neutrality, anger, fear, sadness, surprise, and happiness31. However, the dataset is broad in scope and does not specifically target sentences containing subject personal pronouns. Likewise, the Chinese Expressive Audio-Visual Database (CHEAVD)32 includes audio-visual recordings of multiple speakers expressing emotions such as neutrality, happiness, sadness, anger, fear, and surprise, and contains some utterances with subject pronouns. Nonetheless, these pronoun-containing sentences are neither systematically selected nor explicitly annotated for analyses focusing on pronoun-related emotional expression. Overall, while subject pronouns appear sporadically across several widely used emotional speech databases, there remains a clear lack of dedicated resources that explicitly control, annotate, and analyze the emotional prosody associated with different subject pronoun categories. This gap highlights the need for a specialized emotional speech corpus designed to investigate pronoun-specific emotional encoding and perception.

Pronouns are subtle yet powerful linguistic elements that substantially shape the perception and interpretation of emotion in spoken language. Their influence stems from their capacity to direct attentional focus, evoke social dynamics, and modulate prosodic features. Consequently, pronouns constitute a critical dimension of emotional speech analysis, with important implications for computational emotion recognition systems as well as clinical and therapeutic applications. Different classes of pronouns are associated with distinct emotional and interpersonal functions. First-person plural pronouns (e.g., “we”) tend to emphasize shared experience and social affiliation, fostering a sense of intimacy and connectedness between speakers and listeners. In contrast, first-person singular pronouns (e.g., “I,” “me”) have been linked to increased self-focus, interpersonal distress, and more intrusive communication styles, particularly in clinical populations3335. Second-person pronouns (e.g., “you”) directly address the listener and often heighten emotional salience and perceived intensity by increasing personal relevance and engagement3638. Third-person pronouns (e.g., “he,” “she,” “they”), by contrast, introduce psychological distance between the speaker and the referent, potentially attenuating emotional immediacy while promoting a more detached or reflective tone3941. Neuroimaging evidence further underscores the relevance of pronouns in emotional speech processing. A meta-analysis identified functional convergence in the left posterior middle and superior temporal gyri during pronoun processing42. Additionally, Schirmer (2018) highlighted the involvement of bilateral primary and secondary temporal cortices in the processing of auditory emotional speech, suggesting that these regions may play integrative roles in decoding and expressing emotional prosody, particularly when subject pronouns are present.

Existing emotional speech databases have provided valuable insights into emotion recognition and speech processing. However, they rarely account for the role of subject personal pronouns—linguistic elements that are central to both emotional and social communication. Pronouns such as “I,” “you,” and “we” convey distinct self-referential and interpersonal meanings, which can substantially influence emotional salience and cognitive processing during speech perception. To address this gap, we developed a specialized corpus featuring six types of subject personal pronoun sentences, each annotated with detailed emotional labels. The corpus includes parallel utterances produced by six professional Mandarin actors, facilitating both speaker voice conversion (i.e., the same sentence expressed by different speakers) and emotional voice conversion (i.e., the same speaker expressing different emotions)23. This design enhances the corpus’s utility for human–machine interaction research, including emotionally responsive AI and personalized therapeutic applications4345. Moreover, the resource enables interdisciplinary investigations across linguistics, psychology, neuroscience, and computational science by allowing exploration of pronoun-specific emotional cues, testing theoretical models in social cognition and appraisal theory, and informing emotion-based interventions aimed at improving empathy, self-awareness, and social communication.

This manuscript represents the first formal, peer-reviewed report detailing the design, validation, and application of the MCAE-SPPS database. To increase dataset visibility and encourage academic sharing, we previously created a descriptive, non-peer-reviewed metadata page on IEEE DataPort (https://ieee-dataport.org/documents/mandarin-chinese-auditory-emotions-stimulus-database-validated-set-sentences-subjective). This page does not constitute a publication, and no version of this manuscript has been published by IEEE. The IEEE DataPort entry simply redirects users to the official dataset repository hosted on OSF: 10.17605/OSF.IO/9JYZC. (View-only link: https://osf.io/9jyzc/overview?view_only=088ce4e15a914b939c8bb6bd119c7226)

Methods

Script creation

The recording script comprises 200 emotionally neutral Chinese sentences, all declarative and following a consistent subject–predicate–object structure (e.g., “I have a plan”). Each sentence contained 4–7 high-frequency Chinese characters to minimize unintended affective connotations.

The script construction included two phases. In the first phase, 40 base sentences were created using the first-person singular pronoun “I” (see the sentence corpus file available in the OSF repository: “sentence corpus information.xlsx”46). Twenty independent raters evaluated the semantic valence of each sentence on a 9-point Likert scale (1 = very sad, 5 = neutral, 9 = very happy). The sentence order was randomized for each participant, and informed consent was obtained electronically. The average semantic valence across sentences was 5.32 ± 0.95, and no sentence exceeded the neutral midpoint of 5 plus two standard deviations; therefore, all 40 sentences were retained. In the second phase, these base sentences were adapted by replacing the first-person singular pronoun with five additional subject personal pronouns: “We,” “You” (singular), “You” (plural), “He (他)”, and “They.” Only the subject pronoun was modified; verb and object components remained identical to preserve semantic equivalence. This process resulted in a total of 200 semantically neutral sentences: 160 sentences with pronouns “I,” “We,” “You” (singular), and “You” (plural) (40 sentences per pronoun) and 40 sentences with third-person pronouns (“He”(他) and “They”, 20 sentences each). Note that in Mandarin, the spoken forms of “He” (他) and “She” (她) are phonetically identical; thus, only the orthographic form “他” was used for the third-person singular condition.

Recording sessions

Recording actors

Six professional actors (Actor 1 to Actor 6; three males and three females; mean age = 32.83 ± 5.98 years), all native Mandarin speakers and graduates of the Central Academy of Drama, participated in this study. Basic information of the actors is shown in Table 1. Before the recording sessions, all actors provided written informed consent and received financial compensation for their participation.

Table 1.

Voice Actors’ Information.

Actor Age Gender Voice Acting/Performance experience
Actor 1 39 Male 21 years of voice acting and performance
Actor 2 36 Male 18 years of voice acting
Actor 3 24 Male 4 years of voice acting and performance
Actor 4 34 Female 26 years of voice acting and performance
Actor 5 37 Female 16 years of voice acting and performance
Actor 6 27 Female 4 years of voice acting

Recording procedures

Prior to the recording sessions, each actor was provided with the full set of 200 sentences to familiarize themselves with the text. The researcher briefly explained the general procedures and allowed the actors to practice expressing the target emotions using these sentences. During the recording sessions, actors were instructed to convey the six target emotions (happiness, sadness, anger, fear, disgust, and surprise) and neutral state as authentically as possible. Each actor recorded all 200 sentences for each of the seven emotional categories, resulting in a total of 1,400 recordings per actor. The coding system comprised actor codes (1–6), emotion category codes (1–7; 1 = neutral, 2 = happiness, 3 = anger, 4 = sadness, 5 = fear, 6 = disgust, and 7 = surprise), and sentence codes (001–200). After excluding 21 recordings due to mispronunciation or technical errors, 8379 recordings remained for validation (actor 1: 1,398; actor 2: 1,398; actor 3: 1,394; actor 4:1,395; actor 5: 1,398; actor 6: 1,396). The distribution of recordings across emotional categories was as follows: neutral, 1,197; sadness:1,197; fear:1,196; anger: 1,199; disgust: 1,198; surprise: 1,195; and happiness: 1,197.

Recording environment and equipment

All recordings were conducted in a professional soundproof studio. Speech signals were captured using a DJI Mic-AST01 wireless microphone and digitized at a sampling rate of 48 kHz with 64-bit resolution across two channels.

Segmentation and preprocessing

The recordings were manually segmented and coded using Adobe Audition (version 13.0.5.36). All 8,379 speech samples were peak-normalized and saved individually in WAV format using MATLAB R2023a.

Annotation procedure

Participants

A total of 818 Chinese college students were recruited through online advertisements. Ninety-seven individuals were excluded based on screening scores ≥ 10 on either the Generalized Anxiety Disorder Scale (GAD-7) or the 9-item Patient Health Questionnaire Depression Scale (PHQ-9), consistent with prior evidence indicating that anxiety and depressive symptoms may compromise emotion recognition accuracy47,48. One additional participant was excluded due to a self-reported history of a neurological disorder. The final sample included 720 participants (412 females and 308 males; mean age = 21.60 ± 2.96 years), who were included in the validation analysis.

Procedure

The validation procedure was conducted via a custom-designed website developed in Java (version JDK 1.8) using the IntelliJ IDEA integrated development environment (version 2019.3.5). Participants received remote instructions from the experimenter through WeChat or a digital instruction manual. The experimenter remained available online throughout the initial phase of the experiment to guide participants, ensure task comprehension, and resolve any questions or technical issues.

Participants completed the experiment on personal computers in a quiet environment, using a self-selected comfortable listening level. Access to the experimental platform was granted via a unique username and password. Upon clicking the “Begin” button, each vocal utterance was presented automatically. After listening to the audio stimuli, participants were instructed to identify the perceived emotion category (neutral, happiness, anger, sadness, fear, surprise, or disgust) and to rate its emotional intensity on a 9-point scale. Participants were allowed to replay each utterance as needed by clicking the play button. Once an evaluation was submitted, the subsequent utterance was presented automatically, and participants were not allowed to revise previous responses.

Given the substantial time required to evaluate all 8,379 utterances—estimated at approximately 10 hours, based on an average of 1.3 seconds for listening and 3 seconds for rating per utterance— we implemented a strategy to minimize participant fatigue and maintain attentional engagement14. The 720 participants were divided into 18 groups, each consisting of 40 participants. Each group was assigned to evaluate one-third of the recordings from a single speaker, encompassing all seven emotional categories. Consequently, each participant evaluated a representative subset of emotional utterances from one actor rather than being restricted to a single emotion category. On average, each participant rated approximately 465 utterances, which were presented in a randomized order within each group. On average, participants spent 35–40 minutes evaluating approximately one-third of the utterances from one of the six speakers. In addition to the sentence-based evaluations, participants also completed assessments of approximately 1.5 hours of monosyllabic emotional speech, as reported in our previous work28. Consequently, the total evaluation time per participant ranged from 2 to 2.5 hours. To reduce cognitive load and maintain data quality, participants were encouraged to take two to three breaks as needed throughout the experiment.

Ethics statement

All study procedures were reviewed and approved by the Peking University Biomedical Ethics Committee (IRB00001052_23144). Participants were recruited via Chinese social media platforms, including WeChat, Rednote, and Weibo. Electronic informed consent was obtained prior to participation through the SoJump platform. The consent form detailed the study objectives, data collection procedures, and assessment tasks. Raters were informed that the collected data would be shared in anonymized form for non-commercial use. All personally identifiable information was removed to ensure anonymity. Participants were also informed of the potential risks and benefits of participation and received appropriate compensation. Voice actors were specifically informed that their recordings would be anonymous and made publicly available for non-commercial use.

Data Records

The dataset is publicly available on OSF46. It contains 6,675 audio segments. An overview of the corpus is provided in Table 2. Audio files are organized into separate folders by emotion category to facilitate preview, and compressed versions of these folders are available in the directory entitled audio_zips_by_different_emotion. Each audio file is named according to a standardized convention that includes the actor code (1–6), emotion category code (1–7), and a five-digit sentence code (001–200). Detailed naming rules are described in Table 2. Comprehensive metadata for each audio file are provided in the file sentence_corpus_information.xlsx, which includes information on emotion category, perceived emotional intensity, actor code, validation results, sentence content, and acoustic features. Additional documentation is available in the Files_description.txt.

Table 2.

Summary of the Database Information.

Utterance number in each step
Number of sentences 200 sentences in total: 160 sentences using first- and second-person pronouns (“I”, “We”, “You [singular]”, and “You [plural]”; 40 sentences per pronoun) and 40 sentences using third-person pronouns (20 with “He” and 20 with “They”).
Number of utterances recorded 8,400 utterances: 200 sentences × 7 emotional categories (neutrality, happiness, anger, fear, sadness, disgust, surprise) × 6 actors (3 males, 3 females).
Number of utterances assessed 8,379 utterances (actor 1: 1,398; actor 2: 1,398; actor 3: 1,394; actor 4:1,395; actor 5: 1,398; actor 6: 1,396). Twenty-one utterances were excluded due to mispronunciation or technical errors.
Number of utterances included in the final corpus

6,675 utterances (1704 were excluded because their recognition rate did not meet the inclusion criteria). Emotion distribution: neutral (1,169), sadness (1,187), anger (671), surprise (969), disgust (738), happiness (785), and fear (671).

Pronoun distribution: first-person: 2,729; second-person: 2,608, and third-person: 1,338.

Audio annotation
Number of annotators per utterance 40
Number of utterances assessed per participant Approximately 465 utterances, randomly distributed across seven emotion categories and six subject personal pronoun types.
Participant grouping 18 groups (6 actors × 3 randomly assigned subsets)
Total number of participants 720 (18 groups × 40 participants).
Average audio duration 1.24 second.
Annotation tasks Emotion category identification (single-choice among seven emotions); emotion intensity rating (9-point scale)
Audio files coding rule (five digits)
1st digit Actor coding (1–6), 1–3 are males, 4–6 are females
2st digit Emotion type coding (Original recording emotion type, 1–7), 1: neutral, 2: happiness, 3: anger, 4: fear, 5: sadness, 6: disgust, 7: surprise
3–5th digit sentence coding: 001 - 200. Sentence pronouns with “I”: 001–040; “We”: 041–080; “You”: 081–120; “You” (plural): 121–160; “He”: 161–180; “They”: 181–200
Examples

13171.wav —actor 1’s (male) anger voice, speech content is “他有个计划”(He has a plan).

45021.wav —actor 4’s (female) sadness voice, speech content is “我拿着杯子”(I hold the cup).

Technical Validation

Accuracy without speech inclusion criteria

Figure 1 presents a confusion matrix illustrating the correspondence between the emotions intended by the actors and the emotions perceived by the raters. Each cell in the matrix indicates the counts (i.e., the number of times) a specific emotion was selected. Diagonal cells represent correct classifications, whereas off-diagonal cells reflect misclassifications. The x-axis denotes the intended emotion, and the y-axis denotes the perceived emotion. Recognition accuracy varied across emotion categories: neutral utterances showed the highest recognition accuracy (88.6%), followed by sadness (82.1%), surprise (66.0%), anger (65.8%), happiness (55.1%), fear (51.2%), and disgust (50.7%). Confusion matrices stratified by actor are presented in Table 3.

Fig. 1.

Fig. 1

Confusion matrix for emotion recognition. Numbers represent the count of trials in which each emotional stimulus was assigned to a given emotion category.

Table 3.

Confusion matrix (counts) of target emotion categories and listener-based emotion recognition across six speakers.

Speakers (gender) Target emotion Responses
Categories N Neutral Happiness Anger Fear Sadness Disgust Surprise Total
1 (male) Neutral 200 6914 267 123 109 241 177 169 8000
Happiness 200 2967 3146 234 254 285 124 990 8000
Anger 200 511 124 6825 77 160 198 105 8000
Fear 200 1564 173 101 3059 2251 120 732 8000
Sadness 199 419 110 81 564 6628 95 63 7960
Disgust 199 1901 105 269 93 231 5214 147 7960
Surprise 200 829 360 186 96 142 115 6272 8000
Total 1398 15105 4285 7819 4252 9938 6043 8478 55920
2 (male) Neutral 200 6914 267 123 109 241 177 169 8000
Happiness 200 2967 3146 234 254 285 124 990 8000
Anger 200 511 124 6825 77 160 198 105 8000
Fear 200 1564 173 101 3059 2251 120 732 8000
Sadness 199 419 110 81 564 6628 95 63 7960
Disgust 199 1901 105 269 93 231 5214 147 7960
Surprise 200 829 360 186 96 142 115 6272 8000
Total 1398 15105 4285 7819 4252 9938 6043 8478 55920
3 (male) Neutral 198 6945 115 146 132 173 209 200 7920
Happiness 200 2608 2605 585 145 152 188 1717 8000
Anger 200 756 268 5807 119 94 374 582 8000
Fear 197 1446 117 95 3340 2440 111 331 7880
Sadness 200 650 33 73 430 6658 81 75 8000
Disgust 199 3623 112 778 124 325 2582 416 7960
Surprise 200 2022 1008 332 129 117 176 4216 8000
Total 1394 18050 4258 7816 4419 9959 3721 7537 55760
4 (female) Neutral 199 6477 102 145 182 576 270 208 7960
Happiness 200 2029 4613 140 105 133 90 890 8000
Anger 200 1548 553 4838 118 113 407 423 8000
Fear 200 1632 93 100 3716 1925 122 412 8000
Sadness 198 553 47 80 428 6684 62 66 7920
Disgust 200 2968 64 1267 133 250 3116 202 8000
Surprise 198 1275 842 336 321 118 144 4884 7920
Total 1395 16482 6314 6906 5003 9799 4211 7085 55800
5 (female) Neutral 200 7324 131 49 90 160 149 97 8000
Happiness 199 1842 5558 51 84 96 88 241 7960
Anger 200 453 93 6362 95 104 548 345 8000
Fear 199 512 25 53 5222 1888 101 159 7960
Sadness 200 379 41 26 1089 6335 89 41 8000
Disgust 200 1890 65 884 92 133 4814 122 8000
Surprise 200 754 341 1296 84 111 475 4939 8000
Total 1398 13154 6254 8721 6756 8827 6264 5944 55920
6 (female) Neutral 200 7245 90 90 79 145 240 111 8000
Happiness 200 1661 5739 50 67 105 52 326 8000
Anger 199 793 144 6398 66 95 286 178 7960
Fear 200 885 48 90 5797 690 72 418 8000
Sadness 200 491 44 64 513 6783 58 47 8000
Disgust 200 2518 46 374 71 105 4683 203 8000
Surprise 197 1021 895 358 148 109 154 5195 7880
Total 1396 14614 7006 7424 6741 8032 5545 6478 55840

Values represent the number of listener responses. Correct classifications are shown in bold. Underlined values indicate instances in which an emotion category was selected by more listeners than the intended (target) emotion for the corresponding vocalization.

Accuracy with speech inclusion criteria

Based on evaluations from 40 participants, we calculated the mean percentage of correct identifications for each target emotion. Recordings were considered valid if they met the following criteria: (1) a recognition accuracy of at least 43% for the target emotion, corresponding to three times the chance level in a seven-alternative forced-choice task, and (2) fewer than 43% of responses attributed to any non-target emotion category14,49.

According to these inclusion criteria, Table 4 summarizes the number of perceptually valid items for each emotion category. Following item selection and reclassification, 80% of the original utterances (6,675 out of 8,379) were retained across the seven target emotion categories: neutrality (1,169), sadness (1,187), anger (671), surprise (969), disgust (738), happiness (785), and fear (671). The proportion of valid recordings across the six actors was high, with validation rates of 78%, 80%, 62%, 73%, 92%, and 92%, respectively. Recognition accuracy and intensity ratings for each emotion are presented in Table 5.

Table 4.

Number of perceptually valid utterances for each emotion category.

Speaker Neutrality Happiness Anger Fear Sadness Disgust Surprise Total Valid rate
1 (male) Original 200 200 200 200 199 199 200 1398
Removed 1 135 0 154 0 16 2 308
Valid (N) 200 65 200 46 199 183 198 1091 0.78
2 (male) Original 200 198 200 200 200 200 200 1398
Removed 5 45 6 128 1 89 10 284
Valid (N) 200 153 195 72 199 111 190 1120 0.80
3 (male) Original 198 200 200 197 200 199 200 1394
Removed 3 159 14 120 2 157 73 528
Valid (N) 197 41 186 77 198 42 127 868 0.62
4(female) Original 199 200 200 200 198 200 198 1395
Removed 1 51 45 98 1 134 51 381
Valid (N) 199 149 155 102 197 66 147 1015 0.73
5 (female) Original 200 199 200 199 200 200 200 1398
Removed 16 8 7 17 6 23 48 125
Valid (N) 200 191 194 182 196 177 152 1292 0.92
6 (female) Original 200 200 199 200 200 200 197 1396
Removed 6 5 41 8 3 42 14 119
Valid (N) 200 186 199 192 198 159 155 1289 0.92
Valid (total) 1196 785 1129 671 1187 738 969 6675 0.80

Table 5.

Mean Recognition Accuracy and Emotionally Intensity Ratings for Perceptually Valid Utterances.

Emotion Accuracy, Mean (SD) Intensity, Mean (SD)
Neutral 0.89 (0.08) 4.52 (0.33)
Happiness 0.69 (0.12) 5.38 (0.50)
Anger 0.79 (0.11) 6.27 (0.66)
Fear 0.65 (0.10) 5.56 (0.52)
Sadness 0.82 (0.09) 6.39 (0.45)
Disgust 0.63 (0.10) 4.85 (0.37)
Surprise 0.74 (0.12) 5.48 (0.46)
Average 0.76 (0.14) 5.53 (0.84)

Effect of subject pronoun type on emotion recognition

To examine the effect of subject pronoun type on recognition accuracy, we performed a 6 (Subject Pronoun Type: I, We, singular You, plural You, third-person singular, They) × 7 (Emotion Category: neutrality, happiness, sadness, fear, anger, disgust, and surprise) two-way non-parametric mixed-effects ANOVA on recognition rates, which exhibited a non-normal distribution. The Analysis was conducted using the Aligned Rank Transform (ART) method implemented in the R package ARTool (https://cran.r-project.org/web/packages/ARTool/readme/README.html). Actor and rater groups were included as random effects. The results revealed significant main effects of emotion category (F6,6620 = 814.99, p < 0.001, η2p = 0.425) and subject pronoun type (F5,3522 = 3.33, p = 0.005, η2p = 0.005), and a significant interaction effect between emotion category and subject pronoun type (F30,6622 = 2.04, p < 0.001, η2p = 0.02) were all significant (upper panel of Fig. 2). Although a significant interaction effect were observed, post hoc simple-effects analyses did not yield statistically significant contrasts after correction for multiple testing. This pattern suggests that the interaction reflects subtle, distributed differences across pronoun categories rather than large effects driven by specific contrasts.

Fig. 2.

Fig. 2

Interaction effect of sentence subject pronoun and emotion category on recognition accuracy (upper panel) and perceived emotional intensity (lower panel).

To further probe the significant interaction between emotion and subject pronouns, we conducted theory-driven planned contrasts50,51 using estimated marginal means derived from the ART-based mixed-effects model. Three contrasts were specified: first-person versus second-person, first-person versus third-person, and second-person versus third-person. Joint Wald-type F tests were performed separately within each emotion category. The planned contrasts revealed a selective and robust effect of person reference. Across all seven emotion categories, the contrast between second-person and third-person utterances was consistently significant (all F values > 118, all p values < 0.0001), suggesting that person reference exerts a stable, emotion-independent influence on auditory speech emotion processing. In contrast, no significant differences were observed between first-person and second-person utterances, nor between first-person and third-person utterances, across any emotion category (all p values > 0.39). These findings suggest that the observed emotion-by-pronoun interaction was primarily driven by a robust contrast between second-person and third-person references rather than by differences involving first-person expressions. This pattern indicates that the influence of person reference on emotion recognition reflects a qualitative distinction between listener-directed and non-listener-directed speech. Second-person utterances directly address the listener, establishing immediate interpersonal engagement that may enhance social relevance and emotional salience, thereby facilitating distinct perceptual or cognitive processing. In contrast, third-person utterances describe external agents and are more narrative, potentially eliciting weaker interpersonal involvement.

To assess the impact of subject pronoun type on emotional intensity ratings, we conducted a 6 (Subject Type: I, We, singular You, plural You, third-person singular, They) × 7 (Emotion Category: neutrality, sadness, fear, anger, disgust, surprise, happiness) mixed-effects ANOVA on the perceived intensity ratings, which exhibited a normal distribution. The rater group was included as a random factor. The main effect of the emotion category (F6,6622 = 2301.54, p < 0.001, η2p = 0.676) and the interaction effect (F30,6617 = 3.27, p < 0.001, η2p = 0.01) were significant; the main effect of subject pronoun type was not significant (F5,6619 = 1.40, p = 0.10, η2p = 0.001) (lower panel of Fig. 2). Planned comparisons between first-person, second-person, and third-person utterances (1st vs. 2nd, 1st vs. 3rd, and 2nd vs. 3rd) were non-significant across neutral, sadness, and surprise emotion categories (all p values > 0.19). In contrast, comparisons between first-person and second-person utterances (1st vs. 2nd) showed significant differences for the other four emotions (anger: estimate = −0.163, SE = 0.031, t = −5.18, p < 0.001; disgust: estimate = 0.11, SE = 0.04, t = 2.90, p = 0.004; fear: estimate = −0.09, SE = 0.04, t = 2.20, p = 0.028; happiness: estimate = 0.16, SE = 0.04, t = 4.123, p = 0.0001), and the comparisons between first-person and third-person utterances (1st vs. 3rd) showed significant differences for fear (estimate = −0.16, SE = 0.04, t = −3.18 p = 0.001) and happiness (estimate = 0.15, SE = 0.04, t = 3.48, p = 0.0005). While differences in the remaining emotion categories were non-significant (all p values > 0.08). These findings suggest that pronoun effects were primarily driven by contrasts involving first-person utterances. Second-person expressions were perceived as more intense for anger and fear, whereas first-person expressions were rated as more intense for disgust and happiness. This pattern may reflect differences between listener-directed and self-referential framing, indicating that pronoun type modulates perceived emotional intensity in an emotion-specific manner.

Effect of gender on emotion recognition

We conducted a 2 (speaker gender: female vs. male) × 2 (rater gender: female vs. male) two-way non-parametric mixed-effects ANOVA to investigate whether speaker and rater gender affected recognition accuracy. Figure 3 presents the interaction effects for the overall corpus, neutral utterances, and each of the six basic emotions. The main effect of rater gender (F1 = 356.48, p < 0.001; η2p = 0.027) and the interaction effect (F1 = 5.66, p = 0.017; η2p = 0.0004) were significant; the main effect of actor gender was not significant (F1 = 1.37, p = 0.307). Overall, female listeners outperformed male listeners in speech emotion identification. These findings are consistent with previous research, showing that women recognize52,53 and express54 emotion more accurately than men.

Fig. 3.

Fig. 3

Interaction effect of the gender of speaker and rater on the recognition rate.

Although the interaction between actor gender and rater gender reached statistical significance, the associated effect size was extremely small, and follow-up simple effects analyses did not reveal significant differences within individual levels of either factor. This pattern suggests that the interaction reflects a subtle shift in the relative pattern of effects — specifically, compared with speech produced by male speakers, female listeners tended to show slightly higher recognition accuracy than male listeners for speech produced by female speakers—rather than robust differences in any single comparison. To further clarify the source of this interaction, separate non-parametric mixed-effects ANOVAs were conducted for each emotion category. These analyses revealed that the overall interaction was primarily driven by happiness (F1,1552 = 15.02, p = 0.0001, η2p = 0.001), anger (F1,2238 = 6.04, p = 0.014, η2p = 0.003), and disgust (F1,1472 = 11.25, p = 0.0008, η2p = 0.008). Notably, although statistically significant, the corresponding effect sizes were small, suggesting that the interaction reflects subtle modulation rather than substantial differences in recognition performance.

Annotation consensus

Inter-rater reliability was assessed using Fleiss’ kappa (κ), which assesses agreement on categorical emotion labels beyond chance. As shown in Table 6, the Fleiss’ κ values computed across independent listener groups showed moderate to substantial agreement in categorical emotion judgments (κ = 0.44–0.66), indicating overall satisfactory reliability of the labelling process55.

Table 6.

Summary of Inter-Rater Reliability Indices for Emotion Recognition Performance.

Actor Fleiss Kappa for Rating Groups
Group 1 Group 2 Group 3
Actor 1, Male 0.66 0.58 0.48
Actor 2, Male 0.58 0.60 0.54
Actor 3, Male 0.56 0.51 0.50
Actor 4, Female 0.50 0.47 0.47
Actor 5, Female 0.50 0.44 0.62
Actor 6, Female 0.57 0.63 0.58

Acoustic validation

Referring to the previous emotional corpus study14, we employed the Parselmouth (Praat in Python) package56 to extract twelve acoustic features: duration (seconds); F0 mean (Hz), F0 standard deviation (SD), F0 minimum, and F0 maximum; harmonics-to-noise ratio (HNR); local jitter; local shimmer; sound intensity (dB); root-mean-square (RMS) amplitude; spectral center of gravity (COG, Hz); and spectral spread (Hz). These features were used to assess the predictive power of acoustic parameters for emotion category classification. The mean acoustic values for each emotion category, averaged across speakers, are presented in Table 7.

Table 7.

Acoustic Features of Valid Emotional Speech Across Emotion Categories.

Emotion Duration (second) Pitch F0 (Hz) HNR (dB) Jitter (local) Shimmer (local) Intensity (dB) RMS Energy (Amplitude) SpectralCOG (Hz) SpectralSpread (Hz)
Mean SD Max Min
Neutrality 1.17 166.79 31.04 261.71 110.82 10.02 0.03 0.10 72.94 0.15 3601.55 4258.58
Happiness 1.13 272.06 61.09 420.45 161.23 11.45 0.02 0.10 73.84 0.16 3299.57 3773.32
Anger 1.07 269.11 63.38 441.58 157.89 9.91 0.02 0.12 72.16 0.14 3251.42 3615.67
Fear 1.23 215.97 28.71 297.34 160.23 9.73 0.03 0.11 74.15 0.17 4277.70 4635.36
Sadness 1.65 233.99 45.42 404.06 142.43 12.12 0.02 0.11 73.08 0.16 3433.03 4035.92
Disgust 1.20 157.87 28.79 252.81 105.19 10.37 0.02 0.11 73.15 0.14 3067.38 3930.48
Surprise 1.14 244.92 63.88 421.68 148.53 11.18 0.02 0.11 74.48 0.17 3397.82 4002.25

COG: center of gravity; HNR: harmonics-to-noise ratio; RMS: root mean square.

We conducted simultaneous multiple regression analyses for each emotion category (Gong et al., 2023; Lima et al., 2013) to examine the direct associations between acoustic features and listeners’ recognition accuracy of emotional expressions. The dependent variable was the average recognition rate for each vocalization, and the independent variables were the extracted acoustic features. Table 8 summarizes the main findings, including standardized regression coefficients (β) and adjusted variance explained. All regression models were significant, indicating that recognition of each emotion category was influenced by multiple acoustic attributes. Specifically, recognition of neutral utterances was predicted by longer duration, lower spectral COG, and greater spectral spread. Sadness recognition was associated with shorter duration, higher F0 variability, higher HNR, greater local shimmer, and lower F0 maximum and spectral spread. Fear recognition was predicted by longer duration, higher F0 mean, greater spectral spread, and lower HNR and spectral COG. Anger recognition was linked to longer duration, higher F0 mean, and increased local jitter. Disgust recognition was predicted by lower local shimmer. Surprise recognition was associated with longer duration, lower F0 mean, higher HNR and RMS amplitude, and wider spectral spread. Happiness recognition was predicted by higher F0 maximum and lower local jitter.

Table 8.

Multiple Regression Results Predicting Speech Emotion Recognition Rates from Acoustic Features.

Emotion Acoustic feature
Duration F0 Mean F0 SD F0 Min F0 Max HNR (dB) Jitter (local) Shimmer (local) RMS Energy SpectralCOG (Hz) SpectralSpread (Hz)
Neutral 0.17*** −0.01 −0.04 −0.06 0.03 −0.10 0.04 0.03 0.07 −0.19*** −0.58***
Happiness 0.06 0.06 0.05 0.05 0.17*** −0.12 −0.15*** 0.07 −0.02 −0.02 −0.13
Anger 0.14*** 0.16** −0.02 −0.03 0.01 −0.04 0.15*** −0.05 −0.04 0.06 0.04
Fear 0.23*** 0.65*** −0.12 0.04 0.07 −0.27*** −0.03 −0.01 −0.03 −0.24*** 0.23***
Sadness −0.14*** 0.01 0.13** −0.17** 0.07 0.33*** −0.02 0.19*** 0.02 0.27*** −0.32***
Disgust 0.12 −0.06 −0.12 −0.05 −0.02 0.13 0.02 −0.13* −0.01 −0.08 −0.23
Surprise 0.12* −0.12 0.03 0.01 0.01 0.30*** 0.01 0.08 0.13* −0.01 0.15*

Values represent beta weights (Standardized Coefficients); COG: center of gravity, HNR: harmonics-to-noise ratio, RMS: root mean square.

*p < 0.05; **p < 0.01; ***p < 0.001 (FDR corrected); Bold typeface indicates a significant association after the false discovery rate (FDR) correction.

Usage Notes

The MCAE-SPPS dataset provides a comprehensive resource for investigating the interplay between language, emotion, and cognition in auditory speech. It enables researchers to examine how subject personal pronouns (e.g., “I,” “You,” “He/They”) modulate the perception and recognition of emotional expressions. Potential applications include:

  1. Theoretical research: Studying pronoun-driven effects on emotional salience, self- vs. other-relevance, and listener perspective in emotion perception.

  2. Social cognition and neuroscience: Exploring how pronouns influence empathy, perspective-taking, social bonding, or neural activation patterns during emotional speech processing.

  3. Computational modeling: Improving emotion recognition algorithms by integrating pronoun-specific acoustic cues.

  4. Clinical and practical applications: Designing interventions to enhance emotional awareness or empathy, and informing AI systems such as virtual assistants or chatbots.

Users should note that the dataset reflects recordings from a limited number of speakers and emotions; pronoun and emotion distributions may influence recognition patterns. Care should be taken when generalizing findings to broader populations or different languages.

Uncertainties and limitations

The corpus focuses on six basic emotions (sadness, anger, fear, disgust, surprise, and happiness) together with neutral expressions, and therefore does not cover more complex or socially nuanced emotions such as pride, shyness, or shame. As with many acted auditory emotional databases, the recordings may contain exaggerated or less naturalistic expressions, which could limit ecological validity. In addition, the forced-choice recognition paradigm requires listeners to select a single emotion category even when they are uncertain, potentially inflating recognition accuracy and distorting confusion patterns. Future work may address this issue by incorporating alternative response options (e.g., “Cannot recognize the emotion”) to better capture perceptual uncertainty. Moreover, each listener evaluated only a subset of utterances from a single speaker, which may introduce inter-speaker variability. To address these sources of heterogeneity, we employed nonparametric mixed-effects ANOVA models that treated both listener and actor as random effects.

Acknowledgements

We would like to thank the actors and raters for their participation in the study. This work was supported by The National Natural Science Foundation of China General Project (grant number: 32271138), the General Program of the National Social Science Foundation (grant number: 21BGL229), and the Beijing Natural Science Foundation (grant number: 7202086).

Author contributions

M. Li conducted data collection, preprocessing, data analysis, visualization, results interpretation, and first-draft writing; A. Zhou conducted data collection and preprocessing; H. Yan, Q. Li, and C. Ma performed data preprocessing. C. Wu: formulated research ideas, project design and supervision, data analysis and visualization, results interpretation, and first draft writing and editing. All authors approved the final version of the manuscript.

Data availability

The corpus audio files, metadata, and corresponding rating data are publicly available on the Open Science Framework (OSF) at 10.17605/OSF.IO/9JYZC (View-only link: https://osf.io/9jyzc/overview?view_only=088ce4e15a914b939c8bb6bd119c7226). The repository includes the full set of validated audio recordings, rater demographic information, and comprehensive metadata files describing emotion labels, recognition accuracy, perceived intensity ratings, sentence content, and extracted acoustic features.

Code availability

The codes are publicly available on OSF at 10.17605/OSF.IO/9JYZC46. Please refer to the “Codes_and_RawData” folder for the data organization and analysis scripts.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Landman, L. L. & van Steenbergen, H. Emotion and conflict adaptation: the role of phasic arousal and self-relevance. Cogn Emot34, 1083–1096, 10.1080/02699931.2020.1722615 (2020). [DOI] [PubMed] [Google Scholar]
  • 2.Hert, R., Järvikivi, J. & Arnhold, A. The Importance of Linguistic Factors: He Likes Subject Referents. Cogn Sci48, e13436, 10.1111/cogs.13436 (2024). [DOI] [PubMed] [Google Scholar]
  • 3.Liebenthal, E., Silbersweig, D. A. & Stern, E. The Language, Tone and Prosody of Emotions: Neural Substrates and Dynamics of Spoken-Word Emotion Perception. Front Neurosci10, 506, 10.3389/fnins.2016.00506 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Yang, C. et al. Emotion-dependent language featuring depression. J Behav Ther Exp Psychiatry81, 101883, 10.1016/j.jbtep.2023.101883 (2023). [DOI] [PubMed] [Google Scholar]
  • 5.Homan, S. et al. Linguistic features of suicidal thoughts and behaviors: A systematic review. Clin Psychol Rev95, 102161, 10.1016/j.cpr.2022.102161 (2022). [DOI] [PubMed] [Google Scholar]
  • 6.Iyer, R., Nedeljkovic, M. & Meyer, D. Using Voice Biomarkers to Classify Suicide Risk in Adult Telehealth Callers: Retrospective Observational Study. JMIR Ment Health9, e39807, 10.2196/39807 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Kappen, M., Vanderhasselt, M. A. & Slavich, G. M. Speech as a promising biosignal in precision psychiatry. Neurosci Biobehav Rev148, 105121, 10.1016/j.neubiorev.2023.105121 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Abedin, E. et al. Exploring intellectual humility through the lens of artificial intelligence: Top terms, features and a predictive model. Acta Psychol (Amst)238, 103979, 10.1016/j.actpsy.2023.103979 (2023). [DOI] [PubMed] [Google Scholar]
  • 9.Bae, Y. J., Shim, M. & Lee, W. H. Schizophrenia Detection Using Machine Learning Approach from Social Media Content. Sensors (Basel)21. 10.3390/s21175924 (2021). [DOI] [PMC free article] [PubMed]
  • 10.Ryu, J. et al. A natural language processing approach reveals first-person pronoun usage and non-fluency as markers of therapeutic alliance in psychotherapy. iScience26, 106860, 10.1016/j.isci.2023.106860 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Wang, Q., Wang, M., Yang, Y. & Zhang, X. Multi-modal emotion recognition using EEG and speech signals. Comput Biol Med149, 105907, 10.1016/j.compbiomed.2022.105907 (2022). [DOI] [PubMed] [Google Scholar]
  • 12.Keshtiari, N., Kuhlmann, M., Eslami, M. & Klann-Delius, G. Recognizing emotional speech in Persian: a validated database of Persian emotional speech (Persian ESD). Behav Res Methods47, 275–294, 10.3758/s13428-014-0467-x (2015). [DOI] [PubMed] [Google Scholar]
  • 13.Bustamin, A., Rizky, A. M., Warni, E., Areni, I. S. & Indrabayu IndoWaveSentiment: Indonesian audio dataset for emotion classification. Data in Brief57, 111138, 10.1016/j.dib.2024.111138 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Gong, B. et al. The Mandarin Chinese auditory emotions stimulus database: A validated set of Chinese pseudo-sentences. Behav Res Methods55, 1441–1459, 10.3758/s13428-022-01868-7 (2023). [DOI] [PubMed] [Google Scholar]
  • 15.Costantini, G., Parada-Cabaleiro, E., Casali, D. & Cesarini, V. The Emotion Probe: On the Universality of Cross-Linguistic and Cross-Gender Speech Emotion Recognition via Machine Learning. Sensors (Basel)22. 10.3390/s22072461 (2022). [DOI] [PMC free article] [PubMed]
  • 16.Movaghar, A., Page, D., Saha, K., Rynn, M. & Greenberg, J. Machine learning approach to measurement of criticism: The core dimension of expressed emotion. J Fam Psychol35, 1007–1015, 10.1037/fam0000906 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Habets, B., Ye, Z., Jansma, B. M., Heldmann, M. & Münte, T. F. Brain imaging and electrophysiological markers of anaphoric reference during speech production. Neuroscience Research213, 110–120, 10.1016/j.neures.2025.01.001 (2025). [DOI] [PubMed] [Google Scholar]
  • 18.Lei, J., Zhu, X. & Wang, Y. BAT: Block and token self-attention for speech emotion recognition. Neural Netw156, 67–80, 10.1016/j.neunet.2022.09.022 (2022). [DOI] [PubMed] [Google Scholar]
  • 19.Kingeski, R., Henning, E. & Paterno, A. S. Fusion of PCA and ICA in Statistical Subset Analysis for Speech Emotion Recognition. Sensors (Basel)24. 10.3390/s24175704 (2024). [DOI] [PMC free article] [PubMed]
  • 20.Riad, R. et al. Automated Speech Analysis for Risk Detection of Depression, Anxiety, Insomnia, and Fatigue: Algorithm Development and Validation Study. J Med Internet Res26, e58572, 10.2196/58572 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Mobram, S. & Vali, M. Depression detection based on linear and nonlinear speech features in I-vector/SVDA framework. Comput Biol Med149, 105926, 10.1016/j.compbiomed.2022.105926 (2022). [DOI] [PubMed] [Google Scholar]
  • 22.Darcy, I. & Fontaine, N. M. G. The Hoosier Vocal Emotions Corpus: A validated set of North American English pseudo-words for evaluating emotion processing. Behavior Research Methods52, 901–917, 10.3758/s13428-019-01288-0 (2020). [DOI] [PubMed] [Google Scholar]
  • 23.Zhou, K., Sisman, B., Liu, R. & Li, H. Emotional voice conversion: Theory, databases and ESD. Speech Communication137, 1–18, 10.1016/j.specom.2021.11.006 (2022). [Google Scholar]
  • 24.Parada-Cabaleiro, E., Costantini, G., Batliner, A., Schmitt, M. & Schuller, B. W. DEMoS: an Italian emotional speech corpus. Language Resources and Evaluation54, 341–383, 10.1007/s10579-019-09450-y (2020). [Google Scholar]
  • 25.Asghar, A., Sohaib, S., Iftikhar, S., Shafi, M. & Fatima, K. An Urdu speech corpus for emotion recognition. PeerJ Comput Sci8, e954, 10.7717/peerj-cs.954 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Sultana, S., Rahman, M. S., Selim, M. R. & Iqbal, M. Z. SUST Bangla Emotional Speech Corpus (SUBESCO): An audio-only emotional speech corpus for Bangla. PLoS One16, e0250173, 10.1371/journal.pone.0250173 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Paccotacya-Yanque, R. Y. G., Huanca-Anquise, C. A., Escalante-Calcina, J., Ramos-Lovón, W. R. & Cuno-Parari, Á. E. A speech corpus of Quechua Collao for automatic dimensional emotion recognition. Scientific Data9, 778, 10.1038/s41597-022-01855-9 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Li, M. et al. The Mandarin Chinese auditory emotions stimulus database: A validated corpus of monosyllabic Chinese characters. Behav Res Methods57, 89, 10.3758/s13428-025-02607-4 (2025). [DOI] [PubMed] [Google Scholar]
  • 29.Chong, C. S., Davis, C. & Kim, J. A Cantonese Audio-Visual Emotional Speech (CAVES) dataset. Behavior Research Methods56, 5264–5278, 10.3758/s13428-023-02270-7 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Busso, C. et al. IEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation42, 335–359, 10.1007/s10579-008-9076-6 (2008). [Google Scholar]
  • 31.Zhang J., T F., Liu M., Jia H. Design of speech corpus for mandarin text to speech. The Blizzard Challenge 2008 Workshop (2008).
  • 32.Li, Y., Tao, J., Chao, L., Bao, W. & Liu, Y. CHEAVD: a Chinese natural emotional audio–visual database. Journal of Ambient Intelligence and Humanized Computing8, 913–924, 10.1007/s12652-016-0406-z (2017). [Google Scholar]
  • 33.Zimmermann, J., Wolf, M., Bock, A., Peham, D. & Benecke, C. The way we refer to ourselves reflects how we relate to others: Associations between first-person pronoun use and interpersonal problems. Journal of Research in Personality47, 218–225, 10.1016/j.jrp.2013.01.008 (2013). [Google Scholar]
  • 34.Wang, F. & Karimi, S. This product works well (for me): The impact of first-person singular pronouns on online review helpfulness. Journal of Business Research104, 283–294, 10.1016/j.jbusres.2019.07.028 (2019). [Google Scholar]
  • 35.Stade, E. C., Ungar, L., Eichstaedt, J. C., Sherman, G. & Ruscio, A. M. Depression and anxiety have distinct and overlapping language patterns: Results from a clinical interview. J Psychopathol Clin Sci132, 972–983, 10.1037/abn0000850 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Sun, Z., Cao, C. C., Liu, S., Li, Y. & Ma, C. Behavioral consequences of second-person pronouns in written communications between authors and reviewers of scientific papers. Nature Communications15, 152, 10.1038/s41467-023-44515-1 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Cruz, R. E., Leonhardt, J. M. & Pezzuti, T. Second Person Pronouns Enhance Consumer Involvement and Brand Attitude. Journal of Interactive Marketing39, 104–116, 10.1016/j.intmar.2017.05.001 (2017). [Google Scholar]
  • 38.Qu, J., Zhou, R., Zou, L., Sun, Y. & Zhao, M. in Human-Computer Interaction. Multimodal and Natural Interaction. (ed Kurosu, M.) 234–243 (Springer International Publishing).
  • 39.Moser, J. S. et al. Third-person self-talk facilitates emotion regulation without engaging cognitive control: Converging evidence from ERP and fMRI. Scientific Reports7, 4519, 10.1038/s41598-017-04047-3 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Wallace-Hadrill, S. M. & Kamboj, S. K. The Impact of Perspective Change As a Cognitive Reappraisal Strategy on Affect: A Systematic Review. Front Psychol7, 1715, 10.3389/fpsyg.2016.01715 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Orvell, A. et al. Does Distanced Self-Talk Facilitate Emotion Regulation Across a Range of Emotionally Intense Experiences? Clinical Psychological Science9, 68–78, 10.1177/2167702620951539 (2020). [Google Scholar]
  • 42.El Ouardi, L., Yeou, M. & Faroqi-Shah, Y. Neural correlates of pronoun processing: An activation likelihood estimation meta-analysis. Brain Lang246, 105347, 10.1016/j.bandl.2023.105347 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Massaeli, F., Bagheri, M. & Power, S. D. EEG-based detection of modality-specific visual and auditory sensory processing. J Neural Eng20. 10.1088/1741-2552/acb9be (2023). [DOI] [PubMed]
  • 44.Devillers, L., Vidrascu, L. & Lamel, L. Challenges in real-life emotion annotation and machine learning based detection. Neural Netw18, 407–422, 10.1016/j.neunet.2005.03.007 (2005). [DOI] [PubMed] [Google Scholar]
  • 45.Lee, M.-H. et al. EAV: EEG-Audio-Video Dataset for Emotion Recognition in Conversational Contexts. Scientific Data11, 1026, 10.1038/s41597-024-03838-4 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Li, M. et al. The Mandarin Chinese Auditory Emotions Stimulus Database: A Validated Set of Sentences with A Personal Pronoun as the Subject. OSF.10.17605/OSF.IO/9JYZC (2026).
  • 47.De Prisco, M. et al. Differences in facial emotion recognition between bipolar disorder and other clinical populations: A systematic review and meta-analysis. Prog Neuropsychopharmacol Biol Psychiatry127, 110847, 10.1016/j.pnpbp.2023.110847 (2023). [DOI] [PubMed] [Google Scholar]
  • 48.Tseng, H. H. et al. Facial and prosodic emotion recognition in social anxiety disorder. Cogn Neuropsychiatry22, 331–345, 10.1080/13546805.2017.1330190 (2017). [DOI] [PubMed] [Google Scholar]
  • 49.Liu, P. & Pell, M. D. Recognizing vocal emotions in Mandarin Chinese: a validated database of Chinese vocal emotional stimuli. Behav Res Methods44, 1042–1051, 10.3758/s13428-012-0203-3 (2012). [DOI] [PubMed] [Google Scholar]
  • 50.Rosenthal, R. & Rosnow, R. L. Contrast analysis: Focused comparisons in the analysis of variance. (1985).
  • 51.Kuehne, C. C. The Advantages of Using Planned Comparisons over Post Hoc Tests., (1993).
  • 52.Lambrecht, L., Kreifelts, B. & Wildgruber, D. Gender differences in emotion recognition: Impact of sensory modality and emotional category. Cogn Emot28, 452–469, 10.1080/02699931.2013.837378 (2014). [DOI] [PubMed] [Google Scholar]
  • 53.Collignon, O. et al. Women process multisensory emotion expressions more efficiently than men. Neuropsychologia48, 220–225, 10.1016/j.neuropsychologia.2009.09.007 (2010). [DOI] [PubMed] [Google Scholar]
  • 54.Filipe, M. G., Branco, P., Frota, S., Castro, S. L. & Vicente, S. G. Affective prosody in European Portuguese: Perceptual and acoustic characterization of one-word utterances. Speech Communication67, 58–64, 10.1016/j.specom.2014.09.007 (2015). [Google Scholar]
  • 55.Landis, J. R. & Koch, G. G. The measurement of observer agreement for categorical data. Biometrics33, 159–174 (1977). [PubMed] [Google Scholar]
  • 56.Jadoul, Y., Thompson, B. & de Boer, B. Introducing Parselmouth: A Python interface to Praat. Journal of Phonetics71, 1–15, 10.1016/j.wocn.2018.07.001 (2018). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

  1. Li, M. et al. The Mandarin Chinese Auditory Emotions Stimulus Database: A Validated Set of Sentences with A Personal Pronoun as the Subject. OSF.10.17605/OSF.IO/9JYZC (2026).

Data Availability Statement

The corpus audio files, metadata, and corresponding rating data are publicly available on the Open Science Framework (OSF) at 10.17605/OSF.IO/9JYZC (View-only link: https://osf.io/9jyzc/overview?view_only=088ce4e15a914b939c8bb6bd119c7226). The repository includes the full set of validated audio recordings, rater demographic information, and comprehensive metadata files describing emotion labels, recognition accuracy, perceived intensity ratings, sentence content, and extracted acoustic features.

The codes are publicly available on OSF at 10.17605/OSF.IO/9JYZC46. Please refer to the “Codes_and_RawData” folder for the data organization and analysis scripts.


Articles from Scientific Data are provided here courtesy of Nature Publishing Group

RESOURCES