Auditory traits of "own voice"

Marino Kimura; Yuko Yotsumoto

doi:10.1371/journal.pone.0199443

. 2018 Jun 26;13(6):e0199443. doi: 10.1371/journal.pone.0199443

Auditory traits of "own voice"

Marino Kimura ¹, Yuko Yotsumoto ^1,^*

Editor: Manabu Sakakibara²

PMCID: PMC6019673 PMID: 29944698

Abstract

People perceive their recorded voice differently from their actively spoken voice. The uncanny valley theory proposes that as an object approaches humanlike characteristics, there is an increase in the sense of familiarity; however, eventually a point is reached where the object becomes strangely similar and makes us feel uneasy. The feeling of discomfort experienced when people hear their recorded voice may correspond to the floor of the proposed uncanny valley. To overcome the feeling of eeriness of own-voice recordings, previous studies have suggested equalization of the recorded voice with various types of filters, such as step, bandpass, and low-pass, yet the effectiveness of these filters has not been evaluated. To address this, the aim of experiment 1 was to identify what type of voice recording was the most representative of one’s own voice. The voice recordings were presented in five different conditions: unadjusted recorded voice, step filtered voice, bandpass filtered voice, low-pass filtered voice, and a voice for which the participants freely adjusted the parameters. We found large individual differences in the most representative own-voice filter. In order to consider roles of sense of agency, experiment 2 investigated if lip-synching would influence the rating of own voice. The result suggested lip-synching did not affect own voice ratings. In experiment 3, based on the assumption that the voices used in previous experiments corresponded to continuous representations of non-own voice to own voice, the existence of an uncanny valley was examined. Familiarity, eeriness, and the sense of own voice were rated. The result did not support the existence of an uncanny valley. Taken together, the experiments led us to the following conclusions: there is no general filter that can represent own voice for everyone, sense of agency has no effect on own voice rating, and the uncanny valley does not exist for own voice, specifically.

Introduction

“Who am I?” This question, which is at the heart of the sense of self, has been asked and challenged for a long time by artists, philosophers, and scientists [1–4]. To measure the conceptual “self” scientifically, the sense of “self” has been represented using several modalities as stimuli. The self-face is the most frequently used experimental stimuli due to its representativeness and convenience. Although most self-focused psychological experiments have used self-face, one’s voice is also an important component of “self.” Indeed, one does not witness one’s own face except on horizontally flipped images on mirrors. However, humans are frequently exposed to their own voice suggesting it may be a better, more representative example of real world self-representation.

Speech sounds are produced in the vocal fold and delivered to the vocal cavity. They then travel to the ear and auditory nerve via an air-conducted pathway from the mouth and a bone-conducted pathway via the cranial bones [5]. The bone conduction pathway also includes soft tissues. These different forms of sound conduction result in the different sounds and manifestations of hearing. Even though one can recognize if the presented voice is theirs, the recorded voice is found to be very unlike the voice that one hears when they are speaking. This is because the voice that one hears (own-voice) includes both bone conduction and air conduction while the recorded voice only includes air conduction [6,7]. In addition, air conduction may also be distorted in the recorded voice, because the recorded voice is recorded close to the mouth, while own voice is “played” in the mouth. Further, depending on the audio set up, recorded voice may originate closer or farther from the ear than spoken voice. This difference may also contribute to the difference between own voice and recorded voice.

Over decades, researchers of the transfer function in own voice have employed various experimental methods. For example, the resonance frequencies of the human skull of patients with skin penetrating titanium implants were measured [8]. Bone transfer functions have been estimated using distortion product otoacoustic emissions [9]. Finally, the frequency characteristics of four different bone conduction actuators have been investigated [10]. Based on bone conduction characteristics described in previous research, the equalization filter is considered a suitable method to reproduce own-voice from recorded voice. Although filtered voice was rated as own-voice rather than recorded voice, the filter types varied across studies [11–13]. Moreover, differences in the experimental settings, e.g. the words used as stimuli, impede the direct comparison of experimental results.

As previous studies were only concerned with frequency cut-off filters, the possible contributions of other sound characteristics, such as vibrato and pitch, as a component of own voice have not yet been examined. As some people tremble when they speak, instability of the voice may affect own-voice perception. Voice instability corresponds to vibrato, as they share characteristics [14]. Pitch may be another specific trait of own voice. Poor-pitch (i.e. tone deaf) singers have difficulty in mapping pitch onto action, but perceptual, motor, or memory problems have not been found in these individuals [15]. When the speaker tries to reproduce required pitch sounds, the speaker may have recognized bone conducted own voice as the correct pitch resulting in “poor-pitch”.

Other than sound characteristics, sense of agency is said to be an important component of self-ness. The online sense of action performance (“I am the one who is causing action”) is referred to as sense of agency, in which the performance done by someone else is being distinguished [16]. Sense of agency does not only concern body movement but also speech monitoring of auditory perception. It is known that mouth movement during sound presentation induces a higher sense of agency than images or hearing alone [17]. The effect of sense of agency presence on own voice, whether it encourages or changes the own voice representation within one’s self, has not been investigated.

There are strong links between speech acoustics and emotions [18,19]. Listeners are able to perceive the intended emotions from spoken voices, indicating that listeners associate particular patterns of acoustic cues with various discrete emotional states, and that the ability to infer emotion from speech is a fundamental component of human vocal communication [20]. Besides the profound relationship between emotion and voice [21], perception of voices is also critical in various situations. For example, newborn infants clearly prefer their mother’s voice [22,23], and voice-only communication elicits greater empathy [24]. Furthermore, recent technology developments have increased the demand to use human-like voice in vocal assistance robots. A number of studies have examined how synthesized robotic voices are perceived by humans [25,26], and explored the best form of user-friendly acoustic interfaces. Despite the importance of voice perception in the human interactions, as well as human-machine interfaces, we are yet to fully understand how we perceive our own voices. Hence, it is critical to precisely evaluate the perception and representation of own-voice.

In addition to own voice reproduction, we also focused on differences in discomfort between own-voice and recorded voice. Even though most people may judge the presented voice as own-voice, non-modified recorded voice is found to be unpleasant. This phenomenon may be due to the recorded voice creating a so-called the uncanny valley (Fig 1). The uncanny valley is a widely used concept first proposed in the field of robotics [27]. The idea claims the familiarity and empathy to humanlike robots increases as the appearance of the robot becomes similar to human beings. However, in robots very closely approximating but failing to attain human appearance, the response by humans turns into revulsion. As an explanation, the original theory stated, “eeriness can be represented by negative familiarity.” Previous studies investigating the existence of the uncanny valleys have used eeriness, familiarity, and humanlike-ness as measurements [28,29].

Fig 1 — **Adapted from “The Uncanny Valley,” by M. Mori, 1970.** Conceptual diagram of the theoretical graph presented in the original uncanny valley theory. X-axis corresponds to similarity between robots and humans and y-axis corresponds to familiarity of the robots. Recorded voice may represent the valley part and own voice the highest point after the valley. Sense of one’s self instead of similarity was used in the present study.

Our first experiment investigated the consistency of own-voice rating and queried which equalization filter among those employed in previous studies best represents one’s own voice in a controlled experimental setting. The filters compared were: one that attenuated and amplified a certain range of frequency, one that cut off frequency at a strict threshold, and one that omitted a certain range of frequency. In addition to the filter comparison, the possibility of contributions from other sound characteristics, such as pitch and vibrato, to one’s own voice representation was examined. In experiment 2, we examined the effect of sense of agency on own-voice representation by activating the motor system. Finally, in experiment 3, we measured familiarity, eeriness, and sense of one’s self to investigate the existence of the uncanny valley in the acoustic field, focusing on each individual's voice features.

Experiment 1

Introduction

In experiment 1, the sound profile that best represents own voice was examined. We used filters described in previous studies, as follows: +3 dB for a signal higher than 1 kHz and -3 dB for a signal lower than 1 kHz as a step filter [11]; a trapezoid like filter as a lowpass filter [12]; filter passing from 300 to 1200 Hz as a bandpass filter [13]. In addition to these three types of filters, an adjusted voice protocol, in which the participants adjusted all or part of pitch, vibrato, and frequency cut off filters of recorded voice to reproduce own-voice, was added for comparison. The participants chose the stimulus that best represented own-voice by comparing recorded voice, step filtered voice, lowpass filtered voice, bandpass filtered voice, and adjusted voice. To examine the consistency of the own-voice rating, the participants rated own-voiceness twice on two different days.

Methods

Participants

Ten Japanese students (four females and six males, 18–22 years old) who reported no hearing disorders were paid to participate in the experiment. All participants gave written informed consent in accordance with the Declaration of Helsinki for their participation in the experimental protocol, which was approved by the institutional review board at The University of Tokyo.

Apparatus

Each participant’s voice was recorded in a soundproof room using Sennheiser Microphone ME62 (Sennheiser electronic GmbH & Co.KG, Germany) and Focusrite audio interface (Scarlett 2i4, First Generation model; Focusrite, UK). Audacity, downloaded from www.audacityteam.org, was used to save a digital recording of the voice. All recorded voice was digitized at a 16 bit/44.1 kHz sampling rate. The auditory stimuli were presented through a USB digital-to-analog converter Focusrite audio interface Scarlett 2i4 1^st Generation and MDR-XB500 headphones at 60 dB (SONY, Japan). The visual stimuli were presented on a LCD monitor (BenQ, China) using MATLAB R2015b (The MathWorks, Inc., USA) and the Psychtoolbox (www.psychtoolbox.org). The open-source patch DAVID (Da Amazing Voice Inflection Device)[21] for the close-source audio processing platform Max (Cycling ‘74, USA) was used to allow participant control of auditory features of voice in real-time.

Stimuli and procedure

The experiment consisted of three sessions with the protocol for each filter setting conducted on 3 individual days. In session 1, the voice was recorded and the parameters of the voice were modified. Twenty-six three-syllable Japanese words categorized as neutral were selected [30] and recorded as the stimuli. The participants pronounced the stimuli in their usual manner. The participants were instructed not to correct their dialects. After the recording of all 26 words, the participants freely modified filters for pitch, vibrato, and frequency features of the original voice (recorded voice) such that the recording sounded like the voice that they hear when speaking (own voice). The participants were given the instruction of how to use graphical user interface for modification. The experimenter sat aside of each participant, and instructed the usage of GUI step by step until the participant fully understood the procedure. After this training period, the participants underwent the actual experimental trials. They were allowed to take time as long as they needed until they were convinced that the adjusted voice was their own. Vocalization was neither restricted nor encouraged while the participant modified the parameters of the voice. To control familiarity to the stimulus, six words of the recorded voices were used in this voice adjustment phase, and the remaining 20 recorded voices were used later in the rating phase; i.e., words used in the voice adjustment phase were not used in the rating phase in order to control for familiarity of the rated words.

In sessions 2 and 3, the participants were asked to participate in the voice rating task, and the exact same procedures were repeated. The participants performed two alternative forced choice tasks that involved listening to two different voice conditions and answering which voice sounded more like their own voice (Fig 2). The voice conditions that were judged included: recorded voice, step filtered voice, bandpass filtered voice, lowpass filtered voice, and adjusted-by-will voice (adjusted voice). In order to control the individual difference of own voice perception and to prevent individual variability in the rating procedure, stimuli were presented as a pair to force participants to decide which of the presented stimuli sounded more like own voice. Each of the five voice conditions was paired with another condition in each trial. Combinations of five filters with counterbalanced presentation orders resulted in 20 pairs of the filters. Each pair of the filters was tested with the 20 words prepared for the rating phase. As a result, 400 trials were conducted in the rating phase. All 400 trials were randomized within the session. Inter-stimulus interval was fixed as 400 ms and each stimulus was 800 ms. Within a trial, each stimulus was presented only once without repetition. There were 10 blocks in one session, each block containing 40 trials. Participants were able to take a break between the blocks.

Fig 2 — **Schematic of the task.** After the presentation of stimuli, participants chose which of the stimuli sounded more like own-voice by button press.

Analysis

The own voice ratings were analyzed by a pairwise comparison method [31], which enables plotting of the scores of each condition on the same scale, so that each participant’s relative preference could be evaluated. Thurstone’s pairwise comparison method ranks the responses based on the z values calculated from the percentage of the choice of each item. For all pairs of recorded, step-filtered, lowpass-filtered, bandpass-filtered, and adjusted voices, the proportion of the stimuli chosen as a more own-voice like sound was calculated. The inverse function of the standard normal distribution was calculated and averaged for each stimulus. Then, each participant’s own-voice rating was schematized into a scale bar. To evaluate the consistency of own-voice rating for a participant, Spearman’s rank correlation coefficient was also calculated across two sessions carried out on two independent days.

Results and discussions

We verified that voice transformation with DAVID worked as the participants intended, by analyzing the pitch of modified and non-modified speech samples using the SWIPE algorithm [32], and confirmed that actual pitch differences matched the parameter settings saved by the participants (see S1 Fig, S2 and S3 Tables).

Individual results of pairwise comparisons are shown in Fig 3 and S1 Table. The voice rated as most similar to own voice differed across participants. Two participants chose the recorded voice most representative of own voice, and eight participants rated modified voice as most like own voice. Individual differences were found in the own voice rating, indicating there was no general filter that represented own voice. Even though each participant adjusted part or all of the pitch, vibrato, and frequency cut off filter to sound like own-voice (see S2 Table for details), only Sub 01 and 09 rated the adjusted voice as the own-voice. The various availabilities of modifiable parameter choices may have confused participants, resulting in prolonged adjustment times that made participants tired. There is also a possibility of participants unknowingly vocalizing the own voice closer to the recorded voice as part of their review of own voice.

Fig 4 represents the consistency of similarity to own-voice rating across days. Six participants rated the voices the least and the most similar to own voice consistently, two participants rated the least own voice representative condition consistently, and one participant rated the most own voice representative condition consistently, while one participant showed no congruence. Spearman’s rank correlation coefficient calculation across the participants revealed high rank correlation of most (ρ = .899) and least (ρ = .900) own-voice ratings between the two different sessions done on two different days. The result suggests that the perception of own-voice was steady to a certain extent across experimental days.