Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2016 Jan 11;113(4):948–953. doi: 10.1073/pnas.1506552113

Covert digital manipulation of vocal emotion alter speakers’ emotional states in a congruent direction

Jean-Julien Aucouturier a,1, Petter Johansson b,c, Lars Hall b, Rodrigo Segnini d,e, Lolita Mercadié f, Katsumi Watanabe g,h
PMCID: PMC4743803  PMID: 26755584

Significance

We created a digital audio platform to covertly modify the emotional tone of participants’ voices while they talked toward happiness, sadness, or fear. Independent listeners perceived the transformations as natural examples of emotional speech, but the participants remained unaware of the manipulation, indicating that we are not continuously monitoring our own emotional signals. Instead, as a consequence of listening to their altered voices, the emotional state of the participants changed in congruence with the emotion portrayed. This result is the first evidence, to our knowledge, of peripheral feedback on emotional experience in the auditory domain. This finding is of great significance, because the mechanisms behind the production of vocal emotion are virtually unknown.

Keywords: emotion monitoring, vocal feedback, self-perception, digital audio effects, voice emotion

Abstract

Research has shown that people often exert control over their emotions. By modulating expressions, reappraising feelings, and redirecting attention, they can regulate their emotional experience. These findings have contributed to a blurring of the traditional boundaries between cognitive and emotional processes, and it has been suggested that emotional signals are produced in a goal-directed way and monitored for errors like other intentional actions. However, this interesting possibility has never been experimentally tested. To this end, we created a digital audio platform to covertly modify the emotional tone of participants’ voices while they talked in the direction of happiness, sadness, or fear. The result showed that the audio transformations were being perceived as natural examples of the intended emotions, but the great majority of the participants, nevertheless, remained unaware that their own voices were being manipulated. This finding indicates that people are not continuously monitoring their own voice to make sure that it meets a predetermined emotional target. Instead, as a consequence of listening to their altered voices, the emotional state of the participants changed in congruence with the emotion portrayed, which was measured by both self-report and skin conductance level. This change is the first evidence, to our knowledge, of peripheral feedback effects on emotional experience in the auditory domain. As such, our result reinforces the wider framework of self-perception theory: that we often use the same inferential strategies to understand ourselves as those that we use to understand others.


Over the last few years, tens of thousands of research articles have been published on the topic of emotion regulation, detailing how people try to manage and control emotion and how they labor to suppress expressions, reappraise feelings, and redirect attention in the face of tempting stimuli (1, 2). This kind of blurring of the traditional (antagonistic) boundaries between emotional and cognitive processes has gained more and more influence in the behavioral and neural sciences (3, 4). For example, a recent overview of neuroimaging and electrophysiological studies shows a substantial overlap of error monitoring and emotional processes in the dorsal mediofrontal cortex, lateral prefrontal areas, and anterior insula (5, 6). A consequence of this emerging integrative view is that emotional states and signals should be monitored in the same way as other intentional actions. That is, we ought to be able to commit emotional errors, detect them, and correct them. This assumption is particularly clear in the emotion as interoceptive inference view by Seth (7), which posits a central role for the anterior insular cortex as a comparator that matches top-down predictions against bottom-up prediction errors. However, there is a great need for novel empirical evidence to evaluate the idea of emotional error control, and we are not aware of any experimental tests in this domain.

The best candidate domain for experimentally inducing emotional errors is vocal expression. Vocal signals differ from other types of emotional display in that, after leaving the vocal apparatus and before reentering the auditory system, they exist for a brief moment outside of the body’s sensory circuits. In principle, it should be possible to “catch” a vocal signal in the air, alter its emotional tone, and feed it back to the speaker as if it had been originally spoken this way. Such a manipulation would resemble the paradigm of speech perturbation, in which acoustic properties, like fundamental frequency (F0), are altered in real time and relayed back to the speakers, who are often found to monitor and compensate for the manipulation in their subsequent speech production (8, 9). Thus, would participants detect and correct feedback of a different emotional tone than they actually produced? If so, this behavior would provide novel experimental evidence in support of a dissolution of the cognition emotion divide. If not, it would provide a unique opportunity to study the effects of peripheral emotional feedback. As hypothesized by James–Lange-type theories of emotion (1012), participants might then come to believe that the emotional tone was their own and align their feelings with the manipulation.

To this end, we aimed to construct three different audio manipulations that, in real time, make natural-sounding changes to a speaker’s voice in the direction of happiness, sadness, or fear. The manipulations use digital audio processing algorithms to simulate acoustic characteristics that are known components of emotional vocalizations (13, 14).

The happy manipulation modifies the pitch of a speaker’s voice using upshifting and inflection to make it sound more positive (Audio Files S1–S8); it modifies its dynamic range using compression to make it sound more confident and its spectral content using high-pass filtering to make it sound more aroused (Fig. 1 and compare Audio File S1 with Audio File S2 and compare Audio File S5 with Audio File S6). Similarly, the sad manipulation operates on pitch using downshifting and spectral energy using a low-pass filter and a formant shifter (compare Audio File S1 with Audio File S3 and compare Audio File S5 with Audio File S7). The afraid manipulation operates on pitch using both vibrato and inflection (compare Audio File S1 with Audio File S4 and compare Audio File S5 with Audio File S8). The manipulations were implemented using a programmable hardware platform, allowing a latency of only 15 ms. (A low-latency, open-source software version of the voice manipulation is made available with this work at cream.ircam.fr.)

Fig. 1.

Fig. 1.

Participants listened to themselves while reading, and the emotional tones of their voices were surreptitiously altered in the direction of happiness, sadness, or fear. In the happy condition (shown here), the speaker’s voice is made to sound energized and positive using subtle variations of pitch (pitch-shifting and inflection), dynamic range (compression), and spectral energy (high-pass filter). The changes are introduced gradually from t = 2 to t = 7 min, and the feedback latency is kept constant across conditions at 15 ms. Example audio clips recorded in the experiment are available in Audio Files S5–S8.

First, using three independent groups of Japanese speakers, we determined in a forced choice test that the manipulations were indistinguishable from natural samples of emotional speech (n = 18). Second, we verified that the manipulated samples were correctly associated with the intended emotions, whether these were described with valence arousal scales (n = 20) or free verbal descriptions (n = 39) (SI Text). Third, to assure that the manipulations were similarly perceived by these experiments’ target population, we used the free verbal descriptions to construct a set of French adjective scales and let 10 French speakers rate the emotional quality of processed vs. nonprocessed sample voices. The six adjectives used were found to factor into two principal components, best labeled along the dimensions of positivity (happy/optimistic/sad) and tension (unsettled/anxious/relaxed). The three manipulations were perceived as intended: happy increased positivity and decreased tension, sad decreased positivity but did not affect tension, and afraid decreased positivity and increased tension (Fig. 2).

Fig. 2.

Fig. 2.

Perceived difference in positivity and tension between processed and nonprocessed speech as judged by independent listeners and the post- and prereading changes in positivity and tension in the feedback experiment. Participants reading under manipulated feedback reported emotional changes consistent with the emotional characteristics of the voices that they heard. Error bars represent 95% confidence levels on the mean. The continuous scale is transformed to increments of 1 from −10 to +10. *Significant differences from zero at P < 0.05.

To determine whether participants would detect the induced emotional errors and measure possible emotional feedback effects of voice, we let participants (n = 112; female: 92) read an excerpt from a short story by Haruki Murakami while hearing their own amplified voice through a noise-cancelling headset. In the neutral control condition, the participants simply read the story from beginning to end. In three experimental conditions, the audio manipulations were gradually applied to the speaker’s voice after 2 min of reading; after 7 min, the participants were hearing their own voice with maximum manipulation strength (Fig. 1). In total, the excerpt took about 12 min to read. The participants were asked to evaluate their emotional state both before and after reading using the same two-factor adjective scales previously used to classify the effects. In addition, we monitored the participants’ autonomic nervous system responses while reading with their tonic skin conductance level (SCL). The participants were then asked a series of increasingly specific questions about their impression of the experiment to determine whether they had consciously detected the manipulation of their voice.

SI Text

Audio Manipulations (Definitions).

A cent is the frequency interval corresponding to 100th of a semitone (i.e., the interval between two neighboring keys on a piano is 100 cents). It corresponds to a frequency ratio of 21/1,200.

Pitch-shifting is the process of multiplying the pitch of the original voice by a constant factor. It was used in emotion-monitoring experiment 1 for the happy manipulation (with a positive shift of +25 cents) and the sad manipulation (with a negative shift of −30 cents) and emotion-monitoring experiment 2 for the tensed manipulation (with a positive shift of +100 cents). It was implemented with the pitch synchronous overlap and add technique (PSOLA).

Inflection, a nonstandard term, is defined here as the process of rapidly modifying the pitch at the start of each utterance. It was used in emotion-monitoring experiment 1 for the happy manipulation (with an initial pitch shift of −50 cents and a linear decrease of 400 ms) and the afraid manipulation (with an initial pitch shift of +120 cents and a duration of 150 ms) and emotion-monitoring experiment 2 for the tensed manipulation (with an initial pitch shift of +150 cents and a duration of 150 ms). It was implemented with the PSOLA.

Compression is the process of narrowing a signal’s dynamic range by reducing the level of loud sounds over a certain threshold to make it sound more intense without changing its mean energy. It was used in emotion-monitoring experiment 1 for the happy manipulation and implemented with a standard feedforward architecture.

High shelf-filtering is the process of increasing the high-frequency portion of the signal’s distribution of spectral energy, whereas low shelf-filtering is the process of increasing its low-frequency portion. The former was used in emotion-monitoring experiment 1 for the happy manipulation and emotion-monitoring experiment 2 for the tensed manipulations, the latter was used in emotion-monitoring experiment 1 for the sad manipulation, and both were implemented as second-order Butterworth filters.

Formant shifting is the process of narrowing or broadening the distance between the peaks in the spectral envelope of vowels. It was used in emotion-monitoring experiment 1 for the sad manipulation and implemented by combining the PSOLA and resampling.

Vibrato is the process of applying a periodic variation (here, with frequency of 8.5 Hz) to the pitch of a voice. It was used in emotion-monitoring experiment 1 for the afraid manipulation and implemented with the PSOLA.

Discrimination of Naturalness of Audio Manipulations (Pilot Experiment).

We recorded utterances of the same sentence spoken by 10 different speakers (male: 5) who were all relatively young (M = 25.2) native Japanese speakers. The sentence was an extract in the Japanese language from the short story “The Second Bakery Attack” by writer Haruki Murakami (the same source as the text used for emotion-monitoring experiments 1 and 2): “She was probably trying to do both at the same time. I thought I had some idea how she felt.” Per speaker, we collected four utterances spoken in an emotionally neutral tone and three utterances spoken with emotion (primed with a selection of pictures representing anger, happiness, fear, etc.). All recordings were sampled at 44.1 kHz and normalized at zero mean and a maximum amplitude of −3 dB. We then selected at random three of four neutral utterances for each speaker and processed each by one of three audio effects of emotion-monitoring experiment 1 (happy, sad, or afraid). This procedure resulted in seven stimuli per speaker: one neutral, three natural emotional variations, and three artificial emotional variations obtained by processing three different neutral utterances. We then asked n = 18 independent listeners (male: 15) from the same population as the speakers to judge whether these stimuli were natural or artificial (similar to a Turing test). In each trial, participants were presented with a reference recording (naturally neutral) and six target recordings (three naturally emotional and three artificially emotional variants of the reference recording from the same speaker). Hit rates were calculated over all trials for each participant; 52% (SD = 14%) of the natural nonmodified recordings were rated as artificial. The hit rate was only 42.1% (SD = 22) for happy-, 42.1% (SD = 16) for sad-, and 46.3% (SD = 21) for afraid-modified voices. These distributions did not differ from chance level (50%) for neutral [t(17) = −0.56, P = 0.57], afraid [t(17) = −0.72, P = 0.53], and happy [t(17) = −1.52, P = 0.14] and even approached significance (less detection than chance) for sad [t(17) = −2.09, P = 0.06].

Valence, Arousal, and Dominance Ratings of Audio Manipulations (Pilot Experiment).

We recorded utterances of the same sentence as above spoken in a neutral tone by 16 (male: 8) relatively young (M = 20.1) native Japanese speakers. All recordings were sampled at 44.1 kHz, were normalized at zero mean and a maximum amplitude of −3 dB, and had a similar duration (M = 6.89). Each recording was then processed by each of three audio effects of emotion-monitoring experiment 1 (happy, sad, or afraid), resulting in 16 trials of 4 stimuli per speaker (1 neutral and 3 processed): 64 audio stimuli in total. We then asked n = 20 independent listeners (male: 20) from the same population to judge the emotional content of the voices using the SAM test methodology (47). In each trial, participants were tasked with placing all four stimuli on each of three continuous scales (from one to five) coding for valence, arousal, and dominance (hence, assessing not only their absolute position but also, their relative positions to one another). Participants were specifically instructed to report their perception of how they thought the speaker feels (“John really sounds depressed today”) and not how they felt (“I’m worried that John should sound so depressed”). They were also instructed to disregard the lexical content of the sentence with which they were presented (“If a sad sentence is spoken with a happy voice, please report about the happy emotion, not the sad one”).

Valence, arousal, and dominance scores for processed sentences in each trial were normalized by subtracting the scores of the corresponding neutral version. Normalized scores were then averaged over all trials within each participant (with 19 missing trials because of technical problems). Normalized scores of manipulated speech differed from zero (i.e., from nonmanipulated speech) for happy voices [multivariate T2 = 20.15, F(3,17) = 6.0, P < 0.005], which scored with less arousal [M = −0.27, t(19) = −2.89, P = 0.009] and less dominance [M = −0.4, t(19) = −4.35, P = 0.0003] but no less valence [M = −0.02, t(19) = −0.28, P = 0.77] than neutral; sad voices [multivariate T2 = 41.8, F(3,17) = 12.48, P < 0.00015], which scored with less valence [M = −0.78, t(19) = −6.44, P = 0.000004], less arousal [M = −0.71, t(19) = −4.97, P = 0.00008], and less dominance [M = −0.85, t(19) = −5.16, P = 0.00005] than neutral; and afraid voices [multivariate T2 = 42.59, F(3,17) = 12.70, P < 0.00013], with less valence [M = −1.6, t(19) = −6.2, P = 0.000005] and less dominance [M = −1.14, t(19) = −4.68, P = 0.0001] but no less arousal [M = −0.45, t(19) = −1.39, P = 0.17] than neutral. In addition, the manipulations were perceived to be distinct from one another [multivariate F(6,14) = 10.0, Wilk’s Λ = 0.189, P = 0.00022) on both normalized valence (sad < afraid < happy; Fisher LSD, P < 0.05 for all pairs) and normalized dominance (afraid < sad < happy; Fisher LSD, P < 0.05 for afraid–happy and sad–happy). On normalized arousal, effects also ranked in the predicted direction (sad < afraid < happy) but not significantly so (Fisher LSD, P > 0.05 for all pairs).

Verbal Descriptions of Audio Manipulations (Pilot Experiment).

We used a subset of the stimuli used above: the same Japanese sentence spoken by 13 different speakers (seven female) who were all relatively young (M = 19.6) undergraduate students. Each recording was processed by each of three audio effects of emotion-monitoring experiment 1 (happy, sad, or afraid), resulting in four voices per speaker. Voices were then organized in pairs of stimuli consisting of one neutral and one processed voice from each speaker, leaving 3 pairs per speaker and a total of 39 different pairs. We then asked n = 39 (male: 32) independent listeners from the same population to produce verbal descriptions of how the modified voices sounded compared with the neutral corresponding voice. Each participant listened to 13 of 39 constructed pairs, and therefore, they rated each speaker exactly once (in one of three effects randomly). For each pair, participants were tasked to use between zero and three adjectives to describe the emotional difference between the modified and neutral voices. After discarding trials with no answer, in total, 380 words were collected in 143 happy trials (M = 2.6 words per trial), 395 words were collected in 144 sad trials (M = 2.7 words per trial), and 375 words were collected in 148 afraid trials (M = 2.5 words per trial). Voices modified with the happy effect were most commonly described as calmer (25% of trials), more cheerful (16%), childish (16%), happier (12%), darker (10%), and more friendly (8%) compared with neutral. Voices modified with the sad effect were most commonly described as darker (28%), calmer (19%), sadder (17%), more dispirited (12%), tired (9%), and adult-like (8%) compared with neutral. Voices modified with the afraid effect were most commonly described as sadder (42%), about to cry (31%), more tensed (16%), darker (14%), shivering (11%), and frightened (8%) compared with neutral.

Acoustical Analysis of Speech Produced Under Emotional Feedback (Emotion-Monitoring Experiments 1 and 2).

Emotion-monitoring experiment 1.

To test for acoustical compensation, we analyzed the nonmanipulated (as said) and manipulated (as heard) speech of the nondetecting female participants of experiment 1 (happy: 17, sad: 20, afraid: 15, and control: 18). Participants were restricted to females because of the strong sexual dimorphism in human voice pitch and because females formed the larger subgroup of participants (female: 92 and male: 20). The Praat software (PRAAT, v.5.4.07; www.praat.org/) was used to extract estimates of mean F0, jitter (local, absolute), shimmer (local), and breathiness [harmonic-to-noise ratio (HNR)] on all successive 1-s windows. All estimates were then averaged over 1-min windows and normalized with respect to t = 2 min (before the effect ramp).

To test whether these measures were sensitive to the manipulation, we computed the instantaneous difference between the characteristics of manipulated and nonmanipulated speech. Successive pitch measures were not in significant interaction with condition [F(18,360) = 0.40, P = 0.98]. In other words, pitch alterations were so small (a 3-Hz increase in happy and a 3.5-Hz decrease in sad) that they were not detectable given the participants’ natural variabilities in pitch over the course of reading (M = 207, SD = 56 Hz in the control condition). Contrary to pitch, both jitter, shimmer, and HNR were in significant interaction with the condition, indicating that these three measures were sensitive to all or parts of acoustic alterations made by the manipulations [jitter: F(18,366) = 10.0, P = 0.00000, +10-Hz increase in happy, +30-Hz increase in sad, and +40-Hz increase in afraid; shimmer: F(18,360) = 4.68, P = 0.00000, −0.3% decrease in happy and +0.5% increase in sad; HNR: F(18,390) = 6.77, P = 0.00000, −0.5-dB decrease in happy, −0.4-dB decrease in sad, and −0.9-dB decrease in afraid].

To test whether participants compensated for these alterations, we conducted the same analysis on the participants’ nonmanipulated speech. None of the measures showed significant interaction with the three manipulations: mean pitch [F(12,282) = 0.76, P = 0.69], jitter [F(12,288) = 0.69, P = 0.76], shimmer [F(12,276) = 0.62, P = 0.82], and HNR [F(12,288) = 0.83, P = 0.61] of speech produced under manipulated feedback did not show any alteration, suggesting adaptation to the manipulations.

Emotion-monitoring experiment 2.

The same methodology was used to test for possible compensation for the tensed manipulation used in experiment 2.

To test whether pitch, jitter, shimmer, and HNR were sensitive to the manipulation, we computed the instantaneous difference between the characteristics of manipulated and nonmanipulated speech of the all-female group of nondetecting participants (tensed: 27 and control: 39). This time, successive measures were in significant interaction with condition for mean pitch [F(6,354) = 5.88, P = 0.0000; a 43-cent observed increase in tensed], mean jitter [F(6,354) = 8.9, P = 0.0000; a 15-Hz decrease in tensed], mean shimmer [F(6,354) = 4.4, P = 0.0003; a 0.6% increase in tensed], and mean HNR [F(6,354) = 8.3, P = 0.0000; a −0.3-dB decrease in tensed]. To test whether participants compensated for these alterations, we conducted the same analysis on the participants’ nonmanipulated speech. As before, the successive measures showed no interaction with condition: mean pitch [F(6,354) = 0.459, P = 0.84], mean jitter [F(6,354) = 0.42, P = 0.86], mean shimmer [F(6,354) = 0.57, P = 0.75], and mean HNR [F(6,354) = 0.62, P = 0.71] of speech produced under manipulated feedback showed no alterations compared with those produced in the control condition (Fig. S1).

Fig. S1.

Fig. S1.

Evolution of vocal pitch in manipulated and nonmanipulated speech in the second emotion-monitoring experiment. Participants’ voices were recorded during feedback in two separate versions: nonmanipulated produced voices (what they said) and manipulated feedback voices (what they heard). Vocal pitch was extracted using Praat and averaged over successive 1-min windows in both versions. (A) The manipulation was associated with a progressively increasing difference in pitch between what is said and what is heard (red), whereas there was no such difference in the control condition (blue). (B) There was no difference in the evolution of produced pitch between the experimental (red) and control (blue) groups.

Note in the above that the discrepancy between the manipulated increase (+100 cents) and the increase captured by the analysis (+43 cents) is caused by imprecisions of the pitch estimation algorithms at the analysis stage. The same imprecisions also account for the nonzero differences between manipulated and nonmanipulated pitches in the control condition, which are seen in Fig. S1.

Emotional Changes in Detecting Participants (Emotion-Monitoring Experiments 1 and 2).

Emotion-monitoring experiment 1.

The n = 16 detecting participants of experiment 1 did not start the experiment in a significantly different emotional state than the n = 93 nondetecting participants [multivariate T2 = 1.00, F(2,106) = 0.49, P < 0.60].

The n = 16 detecting participants in conditions happy, sad, and afraid reported no significant change of emotion [repeated-measure multivariate analysis of variance (rMANOVA): F(4,24) = 0.80, Wilk’s Λ = 0.77, P = 0.53, αBonferroni,1|4 = 0.0125]. Change of positivity was M = +0.99 in happy, M = −0.6 in sad, and M = −0.7 in afraid. Change of tension was M = −1.3 in happy, M = +4.7 in sad, and M = +0.04 in afraid.

The evolution of the SCL of detecting participants from t = 3 to t = 9 was not in significant interaction with the condition [repeated-measure analysis of variance (rANOVA): F(12,78) = 0.45, P = 0.93,αBonferroni,2|4 = 0.016]. SCL in the detecting group increased more for happy (+3%) than that in the nondetecting group (+0.6%), decreased comparably for sad (detected: −1.8% and nondetected: −1.6%), and decreased less for afraid (detected: −0.02% and nondetected: −2.4%). The interaction of SCL evolution with detection (yes or no) and condition was not significant: F(10,360) = 0.41, P = 0.93.

Emotion-monitoring experiment 2.

The n = 8 detecting participants of experiment 2 did not start the experiment in a significantly different emotional state than the nondetecting participants [multivariate T2 = 3.47, F(2,50) = 1.70, P < 0.19].

The n = 8 detecting participants in condition tensed reported a significant change of emotion [multivariate T2 = 12.8, F(2,6) = 5.4, P = 0.044, αBonferroni,1|1 = 0.05], with more tension [t(7) = 3.54, P = 0.009] but similar positivity [t(7) = −1.21, P = 0.26] than control.

Because there was no detection in the control group, we could not test for interaction of SCL with condition. The detecting participants’ SCLs in the tensed group increased by 0.4% from t = 3 to t = 9 min, whereas they stayed constant for nondetecting participants (−0.01%). The interaction of SCL evolution with detection (yes or no) was significant [F(6,306) = 4.96, P = 0.00007].

Additional Measures Not Discussed in the Text (Emotion-Monitoring Experiment 1).

POMS.

In addition to pre- and postreading emotional self-rating with adjective scales as described in the text, participants completed a single postreading POMS questionnaire [French-translated version of the abridged 65-item POMS (48)]. For analysis, responses were aggregated into six mood constructs originally hypothesized by the model (tension–anxiety, depression–dejection, anger–hostility, vigor–activity, fatigue–inertia, and confusion–bewilderment). The POMS measure behaved consistently with the main adjective scale measure. All six POMS constructs regressed significantly on the postreading scores of positivity and tension obtained from the scales [tension on tension: F(1,90) = 35.8, P = 0.000000; depression on positivity: F(1,90) = 18.5, P = 0.00004; anger on positivity: F(1,90) = 6.4, P = 0.013; vigor on positivity: F(1,90) = 54.8, P = 0.000000; fatigue on positivity: F(1,90) = 4.6, P = 0.03; and confusion on tension: F(1,90) = 15.4, P = 0.0001), thus reinforcing their intermethod validity. However, being a posttest measure only, the POMS is less sensitive than the scales and thus, failed to capture the emotional feedback effect [multivariate analysis of variance (MANOVA): F(18,238.07) = 1.02, Wilk’s Λ = 0.81, P = 0.43, αBonferroni,3|4 = 0.025].

Emotional Stroop.

Before and after the reading, participants completed a version of the emotional Stroop paradigm (49). Eight French emotional words congruent with each of four emotional conditions were selected using affective norms (50) while controlling for word size and subjective frequency, and they were presented in two blocks (pre- and postreading) of a four-color, four-repeat, 128-trial mixed trial design. Participants were instructed to indicate the color used in each trial using one of four keyboard keys using their dominant hand. For analysis, reaction times were computed and grouped for congruent and incongruent trials pre- and postreading. There was no significant interaction of congruency with the effect of manipulation [rANOVA: F(1,63) = 0.000, P = 0.98, αBonferroni,4|4 = 0.05], indicating no emotional Stroop effect. In addition, after the two Stroop blocks, we tested participants for their capacity to freely recall 32 words presented in the Stroop test to investigate whether participants would recall more emotional words congruent with the manipulation. However, with 65% (n = 61) of the participants recalling fewer than one word per emotional category (M = 1.2), recall performance was too low to allow subsequent analysis. Overall, because this implementation of the emotional Stroop with four different emotions in a single block with novel French word material and keyboard rather than vocal responses was not previously validated, this result is difficult to interpret. It could mean that either the emotional manipulation did not activate the selective attention mechanisms classically associated with the Stroop effect or the measure was simply not reliable.

Emotional rating of the text.

After reading, participants completed a brief assessment of the emotional content of the text using the valence and arousal parts of the SAM (47) noted on five-point Likert scales. The assessment was done after the scales, the POMS, and the second Stroop block and before debriefing. The text was judged emotionally neutral in all conditions (afraid: M = 3.09, sad: M = 3.4, happy: M = 3.55, and control: M = 3.40), and there was no main effect of condition [MANOVA: F(6,176) = 0.69, P = 0.65, Wilk’s Λ = 0.95].

Note.

Both the POMS and Stroop are members of the same family of measures as the scales and SCL results reported in the text for emotion-monitoring experiment 1. Although these two measures are not discussed in the text, they were integrated in the statistical analyses for the measures of experiment 1 using Holm’s sequential Bonferroni correction for multiple measures. None of these measures were used in experiment 2.

Results

Emotion Monitoring.

Participant responses to posttest detection interviews were recorded by audio and written notes and then analyzed by the experimenters to categorize each participant into different detection levels. Only 1 participant (female; condition: afraid) reported complete detection (“you manipulated my voice to make it sound emotional”), and only 15 (female: 14; happy: 7, sad: 3, and afraid: 5) reported partial detection (“you did something to my voice; it sounded strange and it wasn’t just the microphone”). The remaining 93 participants (female: 74; happy: 20, sad: 25, afraid: 21, and control: 27) reported no detection. To not bias any potential feedback results, the detecting participants were removed from all additional analyses. Three participants were also excluded because of technical problems with the feedback. The subsequent results, therefore, concern a total of 93 participants.

Feedback Effect.

For the emotional self-rating task, the participants’ scores on the previously described dimensions of positivity and tension were compared pre- and postreading. The scores (two levels: pre and post) were in significant interaction with the type of manipulation (three levels): repeated-measure multivariate analysis of variance (rMANOVA) F(4,124) = 3.30, Wilk’s Λ = 0.81, P = 0.013, αBonferroni,2|4 = 0.016. In the happy and sad conditions, the general pattern of the emotional changes matched how the manipulations were perceived in the pretest: happy feedback led to more positivity [M = 7.4 > 6.9; Fisher least-square difference (LSD), P = 0.037; Cohen’s d = 0.75] but not significantly less tension (M = 3.0 < 3.6; Fisher LSD, P = 0.14); sad feedback led to less positivity (M = 7.0 < 7.5; Fisher LSD, P = 0.017; Cohen’s d = 0.70) and as predicted, no significant change in tension (M = 3.2 < 3.5; Fisher LSD, P = 0.29). Despite being the most salient of the manipulations, we did not see significant emotional changes in the afraid condition for either positivity (M = 6.5 < 6.8; Fisher LSD, P = 0.11) or tension (M = 3.8 < 4.0; Fisher LSD, P = 0.53) (Fig. 2).

The evolution of participants’ tonic SCL from minutes 3–8 was in significant interaction with the experimental condition [repeated-measure analysis of variance (rANOVA): F(15,425) = 2.29, P = 0.0037, αBonferroni,1|4 = 0.0125] (Fig. 3). SCL decreased the most in the control condition (M = −5.9% at t = 8) and less so in the sad (M = −1.6%) and afraid conditions (M = −2.4%), and it increased moderately in the happy condition (M = +0.6%). SCLs reached at t = 8 were different from control in all three conditions (Fisher LSD, P < 0.05; Cohen’s d: happy = 0.66, sad = 0.56, and afraid = 0.47). The steady decrease of tonic SCL seen in the control condition is the expected autonomic response associated with predictable and low-arousal control tasks, such as reading aloud (15). Although reports of systematic SCL dissociation between fear, sadness, and happiness are inconsistent (16), tonic SCL increase is typically associated with activated emotional states (17) as well as the appraisal of emotional speech or images (18, 19).

Fig. 3.

Fig. 3.

Percentage increase of SCL over time measured relative to the level at the outset of manipulation (minute 3). Manipulation strength was gradually increased from 3 to 7 min and then, held at the highest level until the end of the task. Error bars represent 95% confidence intervals on the mean. *Time steps at which the SCL distribution is significantly different from the control condition (Fisher LSD, P < 0.05).

Audio Compensation.

It is known that speakers reading under any kind of manipulated feedback may remain unaware of the audio alterations but compensate for them by, for example, adapting the pitch of their vocal production (8, 9). If such compensation occurred here, participants could be said to be monitoring their own expressive output, despite their lack of conscious awareness of the manipulation. Testing for such eventuality, we found no evidence of acoustical compensation in the participants’ produced speech: the temporal evolution of the voices’ fundamental frequencies, amplitudes, or voice qualities was not in significant statistical interaction with the experimental condition (SI Text). However, because participants were reading continuously a varied text of words as opposed to controlled phonemes as is often the case in pitch-altered feedback research, the variability in pitch over the course of speaking would make it difficult to detect compensation for the small pitch shifts used here (a 3-Hz increase in happy and a 3.5-Hz decrease in sad).

To further examine whether participants compensated for the emotional errors, even if they did not consciously detect them, we, therefore, replicated the first emotion-monitoring experiment with an additional emotional manipulation designed to feature drastically more pitch upshifting than before (+100 cents, a fourfold increase from happy) along with inflection and high-pass filtering. Applied to neutral speech, the resulting manipulation gave rise to a stressed, hurried impression (compare Audio File S9 with Audio File S10 and compare Audio File S11 with Audio File S12). Using the same adjective scales as above, we let 14 French speakers rate the emotional quality of processed vs. nonprocessed sample voices and found that this tensed manipulation differed significantly from neutral [multivariate T2 = 23.7, F(2,11) = 10.9, P = 0.0024], with increased tension but no change of positivity.

Using this new manipulation, we then let n = 90 (all female) participants take part in a second emotion-monitoring experiment (neutral: 39, tensed: 38, and technical problems: 13). Results replicated both the low level of conscious detection and the emotional feedback found in experiment 1. First, only 2 of 38 tensed participants reported complete detection (5.6%), and 9 (23.6%) reported partial detection, proportions that did not differ from those in experiment 1. Second, scores of the nondetecting participants (tensed: 27 and control: 39) on the previously described dimensions of positivity and tension were compared pre- and postreading. The scores (two levels: pre and post) were in significant interaction with the condition [two levels; rMANOVA F(2,42) = 4.10, Wilk’s Λ = 0.83, P = 0.023, αBonferroni,1|2 = 0.025] in a direction congruent with the new manipulation: more tension [t(43) = 2.43, P = 0.019; Cohen’s d = 0.70] and no change of positivity [t(43) = −1.94, P = 0.06]. There was no interaction of the evolution of SCL with condition [rANOVA: F(6,258) = 1.17, P = 0.32, αBonferroni,2|2 = 0.05].

We extracted phonetical characteristics (mean F0, jitter, shimmer, and breathiness) from the manipulated (what’s heard) and nonmanipulated (what’s said) speech of nondetecting participants over successive 1-min windows from t = 3 to t = 9. First, we compared the manipulated and nonmanipulated speech of the tensed group and found that all four characteristics differed in line with the manipulation made, with increased pitch [F(6,354) = 5.88, P = 0.0000; +43 cents] and shimmer [F(6,354) = 4.4, P = 0.0003; +0.6%] and decreased jitter [F(6,354) = 8.9, P = 0.0000; −15 Hz] and breathiness [F(6,354) = 8.3, P = 0.0000; −0.3 dB]. This result shows that our method of analysis is sensitive enough to detect possible compensatory changes in voice production at least at a magnitude similar to that of the perturbation applied here. Second, we compared the nonmanipulated speech in the tensed group with the speech in the control group and found that the evolution of all four characteristics did not differ with condition. Thus, we again found no evidence that the participants compensated or otherwise adapted to the alterations (SI Text and Fig. S1).

Discussion

In this study, we created a digital audio platform for real-time manipulation of the emotional tone of participants’ voices in the direction of happiness, sadness, or fear. Classification results from both Japanese and French speakers revealed that the alterations were perceived as natural examples of emotional speech, corresponding to the intended emotions. This result was robust across several different forms of rating formats. In experiment 1, the great majority of the participants remained unaware that their own voices were being transformed. As a consequence of listening to their altered voices, they came to react in congruence with the emotion portrayed as reflected in both self-report and skin conductance responses across the experiment. In experiment 2, we replicated key findings from experiment 1 and again, found no evidence that our participants vocally compensated for the altered audio feedback.

The low level of conscious detection of the manipulation as well as the absence of evidence of any compensation in the participants’ production provide no support for the hypothesis that we continuously monitor our own voice to make sure that it meets a predetermined emotional target. This finding is significant because the neural processes underlying the production of emotional speech remain poorly understood (20, 21), and recent commentaries have suggested a central role of forward error-monitoring models in prosodic control (2224). Our findings instead give support to dual-pathway models of vocal expression, where an older primate communication system responsible for affective vocalizations, like laughter and crying, penetrates the neocortex-based motor system of spoken language production, offering less opportunity for volitional control and monitoring than its cortical verbal counterpart (ref. 21, p. 542).

These results do not rule out the possibility that mismatch was registered below the threshold for conscious detection (25) and that the manipulated feedback overpowered any potential error signals (ref. 26 has a related discussion in the semantic domain). However, this suggestion would not explain why the nonconscious alarm was not acted on and especially, not compensated for in the participants’ vocal productions. Similarly, it is interesting to speculate about the small minority of participants who actually detected the manipulation. If we assume a matrix of conflicting evidence in the task (from interoceptive signals and exteroceptive feedback), it is possible that their performance can be explain by individual differences in emotional sensitivity and awareness (27, 28).

When participants did not detect the manipulation, they instead attributed the vocal emotion as their own. This feedback result is as striking as the concomitant evidence for nondetection. The relationship between the expression and experience of emotions is a long-standing topic of heated disagreement in the field of psychology (10, 29, 30). Central to this debate, studies on facial feedback have shown that forced induction of a smile or a frown or temporary paralysis of facial muscles by botulinum injection leads to congruent changes in the participants’ emotional reactions (11, 3133). Although these experiments support the general notion that emotional expression influences experience, they all suffer from problems of experimental peculiarity and demand. Participants can never be unaware of the fact that they are asked to bite on a pencil to produce a smile or injected with a paralyzing neurotoxin in the eyebrows. In addition, these studies leave the causality of the feedback process largely unresolved: to what extent is it the (involuntary) production of an emotional expression or the afference from the expression itself that is responsible for feedback effects (33)? In contrast to all previous studies of feedback effects, we have created a situation where the participants produce a different signal than the feedback that they are receiving (in this case, neutral vs. happy, sad, afraid, or tensed). These conditions allow us to conclude that the feedback is the cause of the directional emotional change observed in our study. As such, our result reinforces the wider framework of self-perception theory: that we use our own actions to help infer our beliefs, preferences, and emotions (34, 35). Although we do not necessarily react the same way to emotion observed in ourselves and that observed in others, in both cases, we often use the same inferential strategies to arrive at our attributions (12, 36, 37).

In experiment 1, the happy and sad manipulations registered a feedback effect on the self-report measure but not the afraid voice, whereas all three manipulations differed from neutral on the SCL measure. It is unlikely that this outcome stemmed from different qualities of the manipulations, because all of them previously had been classified as the intended emotion (indeed, as can be seen in Fig. 2, the transformations to afraid separated most clearly from neutral in the discrimination test). Instead, we suggest to explain this unpredicted outcome by looking at the appraisal context of the experiment (38). Unlike previous studies, where the intensity of emotions was modulated by feedback, in our experiment, emotions were induced from scratch in relation to the same neutral backdrop in all conditions. However, most likely, the atmosphere of the short story that we used was more conducive to an emotional appraisal in terms of general mood changes, such as happy and sad (and later, tensed), compared with a more directional emotion, such as fear. In future studies, our aim will be to manipulate both context and feedback to determine the relative importance of each influence.

Alternatively, it should be noted that, although concordance between different measures, such as self-report and psychophysiology, is often posited by emotion theories, the empirical support for this position is not particularly strong (39, 40). Thus, a dual-systems view of emotion could, instead, interpret an effect on the SCL profile but not on self-report as unconscious emotional processing (25). This interpretation might be particularly fitting for an emotion like fear, where evidence indicates the existence of a unconscious subcortical route through which emotional stimuli quickly reach the amygdala (41).

In summary, this result gives novel support for modular accounts of emotion production and self-perception theory and argues against emotional output monitoring. In future experiments, we will tie our paradigm closer to particular models of speech production (42, 43) and explore the interesting discrepancies between our results and the compensation typically found in pitch perturbation studies. In addition, real-time emotional voice manipulation allows for a number of further paths of inquiry. For example, in the field of decision-making, emotion is often seen as integral to both rapid and deliberate choices (44), and it seems likely that stating preferences and choosing between options using emotionally altered speech might function as somatic markers (45) and influence future choices. More speculatively, emotion transformation might have remedial uses. It has been estimated that 40–75% of all psychiatric disorders are characterized by problems with emotion regulation (46). Thus, it is possible that positive attitude change can be induced from retelling of affective memories or by redescribing emotionally laden stimuli and events in a modified tone of voice. Finally, outside academia, we envisage that our paradigm could be used to enhance the emotionality of live singing performances as well as increase immersion and atmosphere in online gaming, where vocal interactions between players often lack an appropriate emotional edge.

Materials and Methods

Experiment 1: Audio Manipulations.

The happy effect processed the voice with pitch-shifting, inflection, compression, and a high shelf filter (definitions are in SI Text). Pitch-shifting was set to a positive shift of +25 cents. Inflection had an initial pitch shift of −50 cents and a duration of 400 ms. Compression had a −26-dB threshold, 4:1 soft-knee ratio, and 10 dB/s attack and release. High shelf-filtering had a shelf frequency of 8,000 Hz and a high-band gain of 10 dB per octave. The sad effect processed the voice with pitch-shifting, a low shelf filter, and a formant shifter. Pitch-shifting had a negative shift of −30 cents. Low shelf-filtering had a cutoff frequency 8,000 Hz and a high-band roll off of 10 dB per octave. Formant shifting used a tract ratio of 0.9. Finally, the afraid effect processed the voice with vibrato and inflection. Vibrato was sinusoidal with a depth of 15 cents and frequency of 8.5 Hz. Inflection had an initial pitch shift of +120 cents and a duration of 150 ms. The effects were implemented with a programmable hardware platform (VoicePro, TC-Helicon; TC Group Americas) with an in/out latency of exactly 15 ms.

Pilot experiment.

A sentence from the French translation of the short story collection The Elephant Vanishes by Haruki Murakami was recorded in a neutral tone by eight (male: four) relatively young (M = 20.1) native French speakers. Recordings were processed by each of the audio effects (happy, sad, and afraid), resulting in 24 different pairs of one neutral reference and one processed variant thereof (eight trials per effect). We then asked n = 10 independent listeners (male: five) from the same population to judge the emotional content of the processed voices compared with their neutral reference using six continuous scales anchored with emotional adjectives (happy, optimistic, relaxed, sad, anxious, and unsettled). For analysis, response data were factored into two principal components (with varimax rotation; 91% total variance explained), with factors suggesting labels of positivity (happy, optimistic, and sad: 80% variance explained) and tension (unsettled, anxious, and relaxed: 11% variance explained). The manipulations were perceived to be distinct from one another on both dimensions [multivariate F(4,6) = 8.33, P = 0.013]. Emotional ratings of manipulated speech differed from nonmanipulated speech for happy [multivariate T2 = 28.6, F(2,8) = 12.7, P = 0.003] with increased positivity [t(9) = 2.51, P = 0.03; Cohen’s d = 1.67] and decreased tension [t(9) = −4.98, P = 0.0008; Cohen’s d = 3.32], sad [multivariate T2 = 11.3, F(2,8) = 5.0, P = 0.038] with decreased positivity [t(9) = −3.34, P = 0.008; Cohen’s d = 2.22] and unchanged tension [t(9) = 0.30, P = 0.77], and afraid [multivariate T2 = 54.3, F(2,8) = 24.1, P = 0.0004] with decreased positivity [t(9) = −5.7, P = 0.0003; Cohen’s d = 3.8] and increased tension [t(9) = 7.34, P = 0.00004; Cohen’s d = 4.8] (Fig. 2).

Feedback procedure.

Participants were recruited to perform two successive Stroop tasks separated by the main reading task, which was presented as a filler task. At the beginning of the experiment, participants were fitted with two finger electrodes (Biosemi BioPaC MP150) on their nondominant hands, from which their continuous SCLs were measured throughout the session. After the first Stroop task, participants were asked to evaluate their emotional state using six continuous adjective scales. For the reading task, participants were fitted with noise-cancelling headsets (Sennheiser HME-100) with attached microphones, in which they could hear their own amplified voices while they read out loud. They were tasked to read an excerpt from a short story collection by Haruki Murakami (“The Second Bakery Attack” from The Elephant Vanishes), and text was comfortably presented on a board facing them. In the neutral control condition, the participants simply read the story from beginning to end. In three experimental conditions, the emotional effects were gradually applied to the speaker’s voice after 2 min of reading. The strength of the effects increased by successive increments of their parameter values triggered every 2 min by messages sent to the audio processor from an audio sequencer (Steinberg Cubase 7.4). Levels were calibrated using a Bruël & Kjær 2238 Mediator Sound-Level Meter (Bruël & Kjær Sound & Vibration), and overall effect gain was automatically adjusted so that gradual increases in effect strength did not result in gradual increases of sound level. After 7 min, the participants were hearing their own voices with the maximum of the effect added until the end of the reading task. After the reading task and before the second Stroop task, participants were again asked to evaluate their emotional state using adjective scales. In addition, they were also asked to fill in the Profile of Mood States (POMS) questionnaire and evaluate the emotional content of the text using a brief Self-Assessment Manikin (SAM) test (results for the POMS, the SAM, and Stroop are not discussed in the text) (SI Text). After the second Stroop task, participants were then asked a series of increasingly specific questions about their impressions of the experiment to determine whether they had consciously detected the manipulations of their voices. Finally, participants were debriefed and informed of the true purpose of the experiment.

Participants.

In total, n = 112 (female: 92) participants took part in the study, and all were relatively young (M = 20.1, SD = 1.9) French psychology undergraduates at the University of Burgundy in Dijon, France. The students were rewarded for their participation by course credits. Three participants were excluded who could not complete the feedback part of the experiment because of technical problems, leaving n = 109 (female: 89) for subsequent analysis.

Detection questionnaire.

At the end of the experiment, participants were asked a series of increasingly specific questions to determine whether they had consciously detected the manipulations of their voices. Participants were asked (i ) what they had thought about the experiment, (ii) whether they had noticed anything strange or unusual about the reading task, (iii) whether they had noticed anything strange or unusual about the sound of their voices during the reading task, and (iv) because a lot of people do not like to hear their own voices in a microphone, whether that was what they meant by unusual in this case. Answers to all questions were recorded by audio and written notes and then, analyzed by the experimenters to categorize each participant into four detection levels: (i) “you manipulated my voice to make it sound emotional” (complete detection), (ii) “you did something to my voice; it sounded strange and it was not just the microphone or headphones” (partial detection), (iii) “my voice sounded unusual, and I am confident that it was because I was hearing myself through headphones” (no detection), and (iv) “there was nothing unusual about my voice” (no detection).

Skin conductance.

The participants’ SCLs were continuously recorded during the complete duration of the reading. Data were acquired with gain of 5 micro-ohm/volt, sampled at 200 Hz, and low pass-filtered with a 1-Hz cutoff frequency. SCLs were averaged over nonoverlapping 1-min windows from t = 3 min to t = 8 min and normalized relative to the level at t = 3.

Mood scales.

Feedback participants reported their emotional states both before and after the reading task using the same six adjective scales used in pilot data. Responses were combined into positivity (happy, optimistic, and sad) and tension (unsettled, anxious, and relaxed) averaged scores, and their differences were computed pre- and postreading.

Correction for multiple measures.

Results for scales and SCLs were corrected for multiple measures (four measures of emotional feedback: scales, SCL, the POMS, and Stroop) using Holm’s sequential Bonferroni procedure. The two measures of error detection (detection rate and audio compensation) were not corrected, because detection rate is a descriptive measure.

Experiment 2: Audio Manipulation.

The tensed manipulation consisted of pitch-shifting (+100 cents; a fourfold increase from happy), inflection (initial pitch shift of +150 cents and a duration of 150 ms), and high shelf-filtering (shelf frequency of 8,000 Hz; +10 dB per octave). The effect was implemented with a software platform based on the Max/MSP language designed to reproduce the capacities of the hardware used in experiment 1, and it is available at cream.ircam.fr.

Pilot experiment.

Eight recordings of the same sentence spoken in a neutral tone by eight young female native French speakers were processed with the tensed effect and presented paired with their nonmanipulated neutral reference to n = 14 French speakers (male: 5) who rated their emotional quality using the same adjective scales used in the main experiment. Participants found that tensed manipulated speech differed from nonmanipulated speech [multivariate T2 = 23.7, F(2,11) = 10.9, P = 0.002] and with increased tension [M = +4.4, t(13) = 3.39, P = 0.005; Cohen’s d = 1.88] but found no change of positivity [M = +0.1, t(13) = 0.08, P = 0.93].

Feedback procedure.

The same procedure as in experiment 1 was used, with the same text read under one manipulated (tensed) condition and one control condition. Measures were the same, with the exception of the POMS, the SAM, and Stroop tasks, which were not used in experiment 2.

Participants.

Ninety (all female) participants took part in the study; all were relatively young (M = 21.0, SD = 2.3) undergraduate students at Sorbonne University (Paris, France). Participants were rewarded for their participation by cash; 13 participants were excluded who could not complete the feedback task because of technical problems, leaving 77 participants (neutral: 39 and tensed: 38).

Correction for multiple measures.

Results for scales and SCLs were corrected for multiple measures (two measures of emotional feedback: scales and SCL) using Holm’s sequential Bonferroni procedure. The two results of error detection (detection rate and audio compensation) were not corrected for on multiple measures, because one of them, the detection rate, is not used in any statistical tests.

The procedures used in this work were approved by the Institutional Review Boards of the University of Tokyo, of the INSERM, and of the Institut Européen d’Administration des Affaires (INSEAD). In accordance with the American Psychological Association Ethical Guidelines, all participants gave their informed consent and were debriefed and informed about the true purpose of the research immediately after the experiment.

Supplementary Material

Supplementary File
Download audio file (1.5MB, wav)
Supplementary File
Download audio file (1.5MB, wav)
Supplementary File
Download audio file (1.6MB, wav)
Supplementary File
Download audio file (1.6MB, wav)
Supplementary File
Download audio file (1.5MB, wav)
Supplementary File
Download audio file (1.6MB, wav)
Supplementary File
Download audio file (1.5MB, wav)
Supplementary File
Download audio file (1.6MB, wav)
Supplementary File
Download audio file (1.7MB, wav)
Supplementary File
Download audio file (1.7MB, wav)
Supplementary File
Download audio file (1.3MB, wav)
Supplementary File
Download audio file (1.3MB, wav)
Supplementary File
Download audio file (1.5MB, wav)
Supplementary File
Download audio file (1.5MB, wav)
Supplementary File
Download audio file (1.5MB, wav)
Supplementary File
Download audio file (1.5MB, wav)
Supplementary File
Download audio file (1.6MB, wav)
Supplementary File
Download audio file (1.6MB, wav)
Supplementary File
Download audio file (1.6MB, wav)
Supplementary File
Download audio file (1.6MB, wav)

Acknowledgments

J.-J.A. acknowledges the assistance of M. Liuni and L. Rachman [Institut de Recherche et Coordination en Acoustique et Musique (IRCAM)], who developed the software used in experiment 2, and H. Trad (IRCAM), who helped with data collection. Data in experiment 2 was collected at the Centre Multidisciplinaire des Sciences Comportementales Sorbonne Universités–Institut Européen d’Administration des Affaires (INSEAD). All data reported in the paper are available on request. The work was funded, in Japan, by two Postdoctoral Fellowships for Foreign Researchers to the Japanese Society for the Promotion of Science (JSPS; to J.-J.A. and P.J.), the Japanese Science and Technology (JST) ERATO Implicit Brain Function Project (R.S. and K.W.), and a JST CREST Project (K.W.). Work in France was partly funded by European Research Council Grant StG-335536 CREAM (to J.-J.A.) and the Foundation of the Association de Prévoyance Interprofessionnelle des Cadres et Ingénieurs de la région Lyonnaise (APICIL; L.M.). In Sweden, P.J. was supported by the Bank of Sweden Tercentenary Foundation and Swedish Research Council Grant 2014-1371, and L.H. was supported by Bank of Sweden Tercentenary Foundation Grant P13-1059:1 and Swedish Research Council Grant 2011-1795.

Footnotes

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1506552113/-/DCSupplemental.

References

  • 1.Gross JJ. Emotion regulation?: Current status and future prospects. Psychol Inq. 2015;26(1):1–26. [Google Scholar]
  • 2.Moyal N, Henik A, Anholt GE. Cognitive strategies to regulate emotions-current evidence and future directions. Front Psychol. 2014;4(2014):1019. doi: 10.3389/fpsyg.2013.01019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Keltner D, Horberg EJ. 2015. Emotion cognition interactions. APA Handbook of Personality and Social Psychology, APA Handbooks in Psychology, eds Mikulincer M, Shaver PR (American Psychological Association, Washington, DC), Vol 1, pp 623–664.
  • 4.Inzlicht M, Bartholow BD, Hirsh JB. Emotional foundations of cognitive control. Trends Cogn Sci. 2015;19(3):126–132. doi: 10.1016/j.tics.2015.01.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Koban L, Pourtois G. Brain systems underlying the affective and social monitoring of actions: An integrative review. Neurosci Biobehav Rev. 2014;46(Pt 1):71–84. doi: 10.1016/j.neubiorev.2014.02.014. [DOI] [PubMed] [Google Scholar]
  • 6.Cromheeke S, Mueller SC. Probing emotional influences on cognitive control: An ALE meta-analysis of cognition emotion interactions. Brain Struct Funct. 2014;219(3):995–1008. doi: 10.1007/s00429-013-0549-z. [DOI] [PubMed] [Google Scholar]
  • 7.Seth AK. Interoceptive inference, emotion, and the embodied self. Trends Cogn Sci. 2013;17(11):565–573. doi: 10.1016/j.tics.2013.09.007. [DOI] [PubMed] [Google Scholar]
  • 8.Burnett TA, Freedland MB, Larson CR, Hain TC. Voice F0 responses to manipulations in pitch feedback. J Acoust Soc Am. 1998;103(6):3153–3161. doi: 10.1121/1.423073. [DOI] [PubMed] [Google Scholar]
  • 9.Jones JA, Munhall KG. Perceptual calibration of F0 production: Evidence from feedback perturbation. J Acoust Soc Am. 2000;108(3 Pt 1):1246–1251. doi: 10.1121/1.1288414. [DOI] [PubMed] [Google Scholar]
  • 10.James W. Principles of Psychology. Vol 2 Holt; New York: 1890. [Google Scholar]
  • 11.Flack W. Peripheral feedback effects of facial expressions, bodily postures, and vocal expressions on emotional feelings. Cogn Emotion. 2006;20(2):177–195. [Google Scholar]
  • 12.Laird JD, Lacasse K. Bodily influences on emotional feelings: Accumulating evidence and extensions of William James’s theory of emotion. Emot Rev. 2013;6(1):27–34. [Google Scholar]
  • 13.Briefer EF. Vocal expression of emotions in mammals: Mechanisms of production and evidence. J Zool. 2012;288(1):1–20. [Google Scholar]
  • 14.Juslin P, Scherer K. Vocal expression of affect. In: Harrigan J, Rosenthal R, Scherer K, editors. The New Handbook of Methods in Nonverbal Behavior Research. Oxford Univ Press; Oxford: 2005. pp. 65–135. [Google Scholar]
  • 15.Nagai Y, Critchley HD, Featherstone E, Trimble MR, Dolan RJ. Activity in ventromedial prefrontal cortex covaries with sympathetic skin conductance level: A physiological account of a “default mode” of brain function. Neuroimage. 2004;22(1):243–251. doi: 10.1016/j.neuroimage.2004.01.019. [DOI] [PubMed] [Google Scholar]
  • 16.Kreibig SD. Autonomic nervous system activity in emotion: A review. Biol Psychol. 2010;84(3):394–421. doi: 10.1016/j.biopsycho.2010.03.010. [DOI] [PubMed] [Google Scholar]
  • 17.Silvestrini N, Gendolla GH. Mood effects on autonomic activity in mood regulation. Psychophysiology. 2007;44(4):650–659. doi: 10.1111/j.1469-8986.2007.00532.x. [DOI] [PubMed] [Google Scholar]
  • 18.Aue T, Cuny C, Sander D, Grandjean D. Peripheral responses to attended and unattended angry prosody: A dichotic listening paradigm. Psychophysiology. 2011;48(3):385–392. doi: 10.1111/j.1469-8986.2010.01064.x. [DOI] [PubMed] [Google Scholar]
  • 19.Lang PJ, Greenwald MK, Bradley MM, Hamm AO. Looking at pictures: Affective, facial, visceral, and behavioral reactions. Psychophysiology. 1993;30(3):261–273. doi: 10.1111/j.1469-8986.1993.tb03352.x. [DOI] [PubMed] [Google Scholar]
  • 20.Pichon S, Kell CA. Affective and sensorimotor components of emotional prosody generation. J Neurosci. 2013;33(4):1640–1650. doi: 10.1523/JNEUROSCI.3530-12.2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Ackermann H, Hage SR, Ziegler W. Brain mechanisms of acoustic communication in humans and nonhuman primates: An evolutionary perspective. Behav Brain Sci. 2014;37(6):529–546. doi: 10.1017/S0140525X13003099. [DOI] [PubMed] [Google Scholar]
  • 22.Frühholz S, Sander D, Grandjean D. Functional neuroimaging of human vocalizations and affective speech. Behav Brain Sci. 2014;37(6):554–555. doi: 10.1017/S0140525X13004020. [DOI] [PubMed] [Google Scholar]
  • 23.Hasson U, Llano DA, Miceli G, Dick AS. Does it talk the talk? On the role of basal ganglia in emotive speech processing. Behav Brain Sci. 2014;37(6):556–557. doi: 10.1017/S0140525X13004044. [DOI] [PubMed] [Google Scholar]
  • 24.Pezzulo G, Barca L, D’Ausilio A. The sensorimotor and social sides of the architecture of speech. Behav Brain Sci. 2014;37(6):569–570. doi: 10.1017/S0140525X13004172. [DOI] [PubMed] [Google Scholar]
  • 25.Gainotti G. Unconscious processing of emotions and the right hemisphere. Neuropsychologia. 2012;50(2):205–218. doi: 10.1016/j.neuropsychologia.2011.12.005. [DOI] [PubMed] [Google Scholar]
  • 26.Lind A, Hall L, Breidegard B, Balkenius C, Johansson P. Speakers’ acceptance of real-time speech exchange indicates that we use auditory feedback to specify the meaning of what we say. Psychol Sci. 2014;25(6):1198–1205. doi: 10.1177/0956797614529797. [DOI] [PubMed] [Google Scholar]
  • 27.Garfinkel SN, Seth AK, Barrett AB, Suzuki K, Critchley HD. Knowing your own heart: Distinguishing interoceptive accuracy from interoceptive awareness. Biol Psychol. 2015;104:65–74. doi: 10.1016/j.biopsycho.2014.11.004. [DOI] [PubMed] [Google Scholar]
  • 28.Kuehn E, Mueller K, Lohmann G, Schuetz-Bosbach S. Interoceptive awareness changes the posterior insula functional connectivity profile. Brain Struct Funct. January 23, 2015 doi: 10.1007/s00429-015-0989-8. [DOI] [PubMed] [Google Scholar]
  • 29.Darwin C. The Expression of Emotions in Man and Animals. Philosophical Library; New York: 1872. [Google Scholar]
  • 30.Schachter S, Singer JE. Cognitive, social, and physiological determinants of emotional state. Psychol Rev. 1962;69:379–399. doi: 10.1037/h0046234. [DOI] [PubMed] [Google Scholar]
  • 31.Strack F, Martin LL, Stepper S. Inhibiting and facilitating conditions of the human smile: A nonobtrusive test of the facial feedback hypothesis. J Pers Soc Psychol. 1988;54(5):768–777. doi: 10.1037//0022-3514.54.5.768. [DOI] [PubMed] [Google Scholar]
  • 32.Havas DA, Glenberg AM, Gutowski KA, Lucarelli MJ, Davidson RJ. Cosmetic use of botulinum toxin-a affects processing of emotional language. Psychol Sci. 2010;21(7):895–900. doi: 10.1177/0956797610374742. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Hennenlotter A, et al. The link between facial feedback and neural activity within central circuitries of emotion--new insights from botulinum toxin-induced denervation of frown muscles. Cereb Cortex. 2009;19(3):537–542. doi: 10.1093/cercor/bhn104. [DOI] [PubMed] [Google Scholar]
  • 34.Dennett DC. The Intentional Stance. MIT Press; Cambridge, MA: 1987. [Google Scholar]
  • 35.Johansson P, Hall L, Tarning B, Sikstrom S, Chater N. Choice blindness and preference change: You will like this paper better if you (believe you) chose to read it! J Behav Decis Making. 2014;27(3):281–289. [Google Scholar]
  • 36.Bem DJ. Self-perception theory. In: Berkowitz L, editor. Advances in Experimental Social Psychology. Vol 6. Academic; New York: 1972. pp. 1–62. [Google Scholar]
  • 37.Laird JD. Self-attribution of emotion: The effects of expressive behavior on the quality of emotional experience. J Pers Soc Psychol. 1974;29(4):475–486. doi: 10.1037/h0036125. [DOI] [PubMed] [Google Scholar]
  • 38.Gray MA, Harrison NA, Wiens S, Critchley HD. Modulation of emotional appraisal by false physiological feedback during fMRI. PLoS One. 2007;2(6):e546. doi: 10.1371/journal.pone.0000546. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Hollenstein T, Lanteigne D. Models and methods of emotional concordance. Biol Psychol. 2014;98:1–5. doi: 10.1016/j.biopsycho.2013.12.012. [DOI] [PubMed] [Google Scholar]
  • 40.Evers C, et al. Emotion response coherence: A dual-process perspective. Biol Psychol. 2014;98:43–49. doi: 10.1016/j.biopsycho.2013.11.003. [DOI] [PubMed] [Google Scholar]
  • 41.Adolphs R. The biology of fear. Curr Biol. 2013;23(2):R79–R93. doi: 10.1016/j.cub.2012.11.055. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Hickok G. Computational neuroanatomy of speech production. Nat Rev Neurosci. 2012;13(2):135–145. doi: 10.1038/nrn3158. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Pickering MJ, Garrod S. An integrated theory of language production and comprehension. Behav Brain Sci. 2013;36(4):329–347. doi: 10.1017/S0140525X12001495. [DOI] [PubMed] [Google Scholar]
  • 44.Lerner JS, Li Y, Valdesolo P, Kassam KS. Emotion and decision making. Annu Rev Psychol. 2015;66:799–823. doi: 10.1146/annurev-psych-010213-115043. [DOI] [PubMed] [Google Scholar]
  • 45.Batson CD, Engel CL, Fridell SR. Value judgments: Testing the somatic-marker hypothesis using false physiological feedback. Pers Soc Psychol Bull. 1999;25(8):1021–1032. [Google Scholar]
  • 46.Gross JJ, Jazaieri H. Emotion, emotion regulation, and psychopathology: An affective science perspective. Clin Psychol Sci. 2014;2(4):387–401. [Google Scholar]
  • 47.Lang P. Behavioural Treatment and Bio-Behavioural Assessment: Computer Applications. Ablex; Norwood, NJ: 1980. pp. 119–137. [Google Scholar]
  • 48.McNair D, Lorr M, Droppleman L. Profile of Mood States (POMS) Manual. Educational and Industrial Testing Service; San Diego: 1971. [Google Scholar]
  • 49.Williams JMG, Mathews A, MacLeod C. The emotional Stroop task and psychopathology. Psychol Bull. 1996;120(1):3–24. doi: 10.1037/0033-2909.120.1.3. [DOI] [PubMed] [Google Scholar]
  • 50.Bonin P, et al. Normes de concrétude de valeur d’imagerie, de fréquence subjective et de valence émotionnelle pour 866 mots. Annee Psychol. 2003;103(4):655–694. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File
Download audio file (1.5MB, wav)
Supplementary File
Download audio file (1.5MB, wav)
Supplementary File
Download audio file (1.6MB, wav)
Supplementary File
Download audio file (1.6MB, wav)
Supplementary File
Download audio file (1.5MB, wav)
Supplementary File
Download audio file (1.6MB, wav)
Supplementary File
Download audio file (1.5MB, wav)
Supplementary File
Download audio file (1.6MB, wav)
Supplementary File
Download audio file (1.7MB, wav)
Supplementary File
Download audio file (1.7MB, wav)
Supplementary File
Download audio file (1.3MB, wav)
Supplementary File
Download audio file (1.3MB, wav)
Supplementary File
Download audio file (1.5MB, wav)
Supplementary File
Download audio file (1.5MB, wav)
Supplementary File
Download audio file (1.5MB, wav)
Supplementary File
Download audio file (1.5MB, wav)
Supplementary File
Download audio file (1.6MB, wav)
Supplementary File
Download audio file (1.6MB, wav)
Supplementary File
Download audio file (1.6MB, wav)
Supplementary File
Download audio file (1.6MB, wav)

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES