Convergence in voice fundamental frequency during synchronous speech

Abigail R Bradshaw; Carolyn McGettigan

doi:10.1371/journal.pone.0258747

. 2021 Oct 21;16(10):e0258747. doi: 10.1371/journal.pone.0258747

Convergence in voice fundamental frequency during synchronous speech

Abigail R Bradshaw ^1,^*, Carolyn McGettigan ¹

Editor: Vera Kempe²

PMCID: PMC8530294 PMID: 34673811

Abstract

Joint speech behaviours where speakers produce speech in unison are found in a variety of everyday settings, and have clinical relevance as a temporary fluency-enhancing technique for people who stutter. It is currently unknown whether such synchronisation of speech timing among two speakers is also accompanied by alignment in their vocal characteristics, for example in acoustic measures such as pitch. The current study investigated this by testing whether convergence in voice fundamental frequency (F0) between speakers could be demonstrated during synchronous speech. Sixty participants across two online experiments were audio recorded whilst reading a series of sentences, first on their own, and then in synchrony with another speaker (the accompanist) in a number of between-subject conditions. Experiment 1 demonstrated significant convergence in participants’ F0 to a pre-recorded accompanist voice, in the form of both upward (high F0 accompanist condition) and downward (low and extra-low F0 accompanist conditions) changes in F0. Experiment 2 demonstrated that such convergence was not seen during a visual synchronous speech condition, in which participants spoke in synchrony with silent video recordings of the accompanist. An audiovisual condition in which participants were able to both see and hear the accompanist in pre-recorded videos did not result in greater convergence in F0 compared to synchronisation with the pre-recorded voice alone. These findings suggest the need for models of speech motor control to incorporate interactions between self- and other-speech feedback during speech production, and suggest a novel hypothesis for the mechanisms underlying the fluency-enhancing effects of synchronous speech in people who stutter.

Introduction

Synchronised vocal behaviours are ubiquitous across a variety of everyday settings and can be found in every culture [1]. The production of speech in unison by a group of speakers can be observed in as diverse settings as places of worship, schools, sports stadiums, protest marches, military parades and political rallies. Often such behaviours serve to promote social cohesion and bonding amongst their participants [2]. However, as well as studying joint speech behaviours at the level of the goals and intentions of the collective, we can also consider their impact on lower-level processes within the individual; specifically, for the focus of this paper, how might synchronised speech affect the engagement of different systems for speech motor control?

A clue that synchronised speech behaviours may radically change the nature of the mechanisms underlying motor control of speech is found in the observation of their powerful efficacy as a temporary fluency enhancer for people who stutter. The speech of people who stutter is characterised by frequent dysfluencies that interrupt the smooth flow of speech, such as syllable repetitions, prolongations, and blocks (tense pauses in between speech sounds). Strikingly, reading a prepared text in synchrony with another speaker (here termed synchronous speech; also referred to as “choral speech” or “choral reading” in this literature) can temporarily reduce the occurrence of such dysfluent events by 90–100% in most individuals who stutter [3]. Disruption to the processing of sensory feedback from the self-voice has long been theorised to play a role in stuttering [for a recent review, see 4]. However, in synchronous speech, the brain simultaneously receives auditory feedback from the self-voice and the voice of the synchronisation partner (known as the accompanist). It is of interest to consider how this concurrent speech feedback might affect sensorimotor integration processes during speech motor control, in both typically fluent individuals and people who stutter.

It is already known that when engaged in a dialogue, interlocutors show a tendency to align with each other across multiple linguistic levels. For example, speakers may converge in their choice of syntactic structures and lexical wordforms during conversational interactions [5]. As well as this alignment in the content of speech, convergence can also be seen in the voices of the interlocutors; that is, they can start to sound more similar in terms of the way in which they produce speech sounds [for a review, see 6]. This vocal convergence has been demonstrated either by direct measurement of acoustic and phonetic voice parameters [7], such as voice fundamental frequency (F0, the acoustic correlate of pitch), formant frequencies such as F1 and F2 (spectral information in the speech signal which determines perceived vowel identity) and speech rate; or through the use of perceptual judgements by independent listeners in AXB designs [8]. In the latter, an external set of listeners are asked to judge whether a participant’s utterance during an interaction (stimulus A) sounds more similar to the model utterance from their interlocutor (stimulus X), compared with that participant’s baseline utterance when speaking alone (stimulus B)–if listeners tend to choose A more often than B as the stimulus most similar to X, this suggests that the participant’s speech has moved perceptibly toward the interlocutor when speaking with them. This vocal convergence has been studied across a range of interactive contexts, including conversations [9], shared-reading tasks [10], and interactive verbal games [11]. Such behaviour has traditionally been interpreted within the conceptual framework of Communication Accommodation Theory [12], which views such responses as a socially motivated strategy for establishing social affiliation with an interlocutor. Accordingly, there is evidence that social factors can influence the extent to which interlocutors converge or even diverge during interactions. For example, greater vocal convergence has been reported when speakers perceive their partner as more attractive [13], or of a higher social status than themselves [14]; conversely, greater divergence was reported when speakers interacted with an insulting interviewer [15].

An alternative theoretical framework for understanding vocal convergence effects is the Interactive Alignment model [5, 16]. This views such convergence as a subconscious process, driven by an automatic priming mechanism. The latest version of the theory [16] proposes that while listening to the speech of an interlocutor, a listener engages in covert imitation of that speech; the listener then uses a ‘simulation route’ to derive forward models that predict their partner’s upcoming speech utterances based on experience of their own speech actions. This covert imitation tends to also co-activate representations within the production system, which can then spill over into the speaker’s own productions resulting in overt imitation. This automatic priming is proposed to occur across all levels of linguistic representation, including semantics, syntax and phonology. The resulting convergence in the speech of the interlocutors is thought to then facilitate mutual understanding between them.

In support of this view of convergence as a more low-level and unconscious process, vocal convergence effects have been reported across a range of ‘non-social’ contexts, such as vowel repetition or speech shadowing tasks, in which participants listen to words produced by a model speaker and are asked to produce each word they hear as quickly as possible [8, 17–20]. Sato et al., [18] asked participants to produce vowel sounds, first cued by orthographic targets, then acoustic targets, and finally by orthographic targets again. Participants were found to show convergence in F0 and F1 towards the acoustic targets, which was sustained during the second presentation of the orthographic targets. These ‘after-effects’ were interpreted as evidence of offline recalibration of the sensory-motor targets that drive speech production. The authors proposed that these convergence effects should be viewed within the same framework as sensorimotor accounts of speech motor control, in which forward models of the intended or predicted sensory outcomes of speech movements are compared to actual sensory feedback [21–23]. Specifically, they argued that these models should additionally incorporate the influence of external speech inputs on these processes; that is, the external speech environment leads to adaptive changes in the sensory goals for speech, which in turn can result in imitative changes in speech productions.

Further evidence in support of these ideas is found in studies that demonstrated effects of external speech inputs on responses to perturbations of self-produced auditory feedback. Such perturbations can be made in real-time to the F0 or formants of the auditory feedback a speaker receives from their own voice during speech; these typically trigger opposing changes in the speech productions of the speaker in order to compensate for the perceived sensory error [24, 25]. The extent of such speech motor learning has been shown to be affected by both explicit perceptual training with another voice that aimed to shift perception of a phoneme category boundary [26], and by implicit perceptual learning processes triggered by mere exposure to another voice that cued a participant’s own speech production [27]. Therefore, other voices can affect both the vocal characteristics of a speaker, and the responsiveness of their speech motor control system to experimentally induced ‘sensory errors’ in their speech feedback.

This evidence highlights the need for speech motor control research to move beyond studies in which participants speak on their own, to paradigms that try to capture the true dynamics of the interactive contexts in which speech is typically used in everyday life. This challenge to the prevailing approach in cognitive psychology of studying individual minds in isolation was made more broadly by the ‘Joint Action’ framework, which argues that a comprehensive understanding of mind and behaviour requires the study of the coordinated actions that are so prevalent in everyday interactions among individuals [28, 29]. In order to extend models of speech motor control to account for sensorimotor processes in such interactive contexts, we need more evidence on how control of an individual’s speech is affected by the speech and voices of talkers with whom they are interacting.

It is thus of interest to consider the effects on speech motor control of a synchronous external speech input. Specifically, does speaking in synchrony with an accompanist have any influence on the voice of a speaker, beyond the obvious changes in speech timing? If so, do these changes in the voice of the speaker relate systematically to those of the accompanist voice? In samples of people who stutter, it has been shown that the fluency enhancing effects of synchronous speech rely on the presence of spectral information within the accompanist speech signal. A study by Rami, Kalinowski, Rastatter, Holbert and Allen found that synchronous speech with a speech signal that was low-pass filtered (at 100Hz) so as to remove formants but retain glottal source information (including F0) was not effective at reducing stuttering frequency; conversely, a speech signal low-pass filtered to include the source and only partial spectral cues (either F1 only or F1 and F2) was sufficient to induce fluency during synchronous speech [30]. This suggests that, at least in people who stutter, spectral information in the accompanist voice has an effect on speech motor control and the sound of the speaker’s voice. Other research with samples of typically fluent speakers has reported that synchronous speech reduces variability in the pitch, intonation, amplitude and vowel duration of produced speech [31–34]. However, crucially, it is unknown to what extent these changes in speaking style represent convergence with the acoustic characteristics of the accompanist voice. That is, it is possible that any changes in the acoustic features of the speaker’s voice during synchronous speech may simply be a by-product of the process of synchronising the timing of one’s speech with any external stimulus, be that another voice or a non-speech stimulus (e.g. a metronome).

Overall, this study aims to bring together these different strands of findings in the literature, in order to provide insight into the influence of external speech inputs on speech motor control. Firstly, as outlined above, we know that synchronous speech can result in changes to the speech of a talker, be that a reduction in dysfluency in people who stutter, or reduced variability in speaking style in typically fluent speakers. However, we currently have no insight into the acoustic specificity of these changes in relation to the acoustic/phonetic properties of the other voice. Secondly, it is known that pairs of talkers tend to converge in speaking style during speech tasks, showing changes in their speech that are specific to the acoustic/phonetic properties of their interlocutor’s voice. So far however, this has not been investigated for synchronous speech, and so it is yet unknown whether convergence effects would generalise to this task. Thirdly, it is known that perturbations of simultaneous self-voice feedback induce compensatory and adaptive changes in the acoustics of a talker’s speech productions that are specific to the precise perturbation applied. These responses have been interpreted within theoretical frameworks as evidence for the prioritisation of auditory feedback for speech sensorimotor control. However, such studies–and the models they inform—have never incorporated the effects of simultaneous other-voice feedback on speech sensorimotor behaviours. It is thus currently unknown whether these responses and underlying processes are specific to self-voice feedback, or whether they might also be employed for responding to other-voice feedback during speech motor control.

In order for us to advance our understanding of these effects, the current study aimed to test whether 1) speakers given simultaneous other-voice feedback during speech–by speaking in synchrony with another talker–show changes in their vocal acoustics, and 2) whether these changes are specific to the external acoustic feedback, rather than being driven by some other aspect of speech synchronisation. In order to do this, experiments are needed which test for acoustic convergence during synchronous speech that is specifically dependent on the auditory properties of the accompanist talker. In Experiment 1, we test this by measuring changes in F0 during synchronous speech, specifically testing for upward changes in F0 during synchronisation with a higher-pitched accompanist, but downward changes in F0 in participants who synchronised with a lower-pitched accompanist. Furthermore, in order to test theoretical frameworks that prioritise auditory feedback for speech motor control, it is important to confirm that these convergent changes are indeed primarily driven by acoustic properties of the other talker’s voice, and are not equally generated by other types of input like visual speech information. We test this in Experiment 2, in the inclusion of a visual-only synchronous speech condition (where participants could see but not hear the accompanist), as well as an audiovisual condition (where participants could both see and hear the accompanist). Across both experiments, we predicted that participants would show changes in F0 that were dependent on the F0 of the accompanist voice they experienced (i.e. they should increase to a high F0 accompanist, decrease to a low F0 accompanist, and remain unchanged during visual synchronous speech). Conversely, if changes in F0 are simply a by-product of synchronisation of speech timing with an external stimulus, we would expect uniform changes in F0 across these different conditions.

General methods

The Gorilla Experiment Builder (www.gorilla.sc) was used to create and host all experiments reported in this paper [35]. The online recruitment platform Prolific (www.prolific.ac) was used for participant recruitment. All participants were compensated for their time by payment of £3.75, administered via Prolific. This study received ethical approval from the local ethics officer at the Department of Speech, Hearing and Phonetic Sciences at University College London (approval no. SHaPS-2019-CM-030). All participants gave informed consent prior to taking part in the study.

Data, analysis scripts and stimuli for each of the two experiments reported here are openly available on the Open Science Framework (https://osf.io/rs7gk/ DOI: 10.17605/OSF.IO/RS7GK).

General procedure

All experiments began with a headphone screening task, to ascertain that participants were wearing headphones and listening in a quiet environment [36]. This task makes use of a perceptual phenomenon called ‘Huggins pitch’, an illusory pitch percept which relies on dichotic presentation of stimuli. Specifically, each ear receives the same white noise stimulus, but with a phase shift of 180° in a narrow frequency band in one channel. When wearing headphones this results in perception of a faint pitch amongst noise; conversely this percept is not generated when played over loudspeakers. On each trial, participants are asked to detect which of three white noise stimuli contains the hidden tone. Participants who failed to reach criterion performance on this task were not permitted to proceed to the main study (see Data Exclusion). For experiment 1 this criterion was set to a score of 6/6; for Experiment 2 this criterion was slightly relaxed to a score of at least 5/6, in an effort to reduce the high level of rejection of participants who reported they were in fact wearing headphones.

The first experimental task for all experiments was a solo reading task, which required participants to read aloud a series of sentences presented on the screen while their speech was audio recorded. This was treated as a ‘baseline’, in order to measure the participant’s F0 when not exposed to another voice/engaging in synchronous speech. This was followed by a synchronous speech task, in which participants were again audio recorded while speaking the same set of sentences, this time in synchrony with another person (the accompanist). The experiment ended with a short debrief, in which participants were asked to rate how well they synchronised with the other voice (from 1-not at all to 5-perfectly), and to report whether they noticed anything about that voice. They were further asked what type of headphones they were using, and to rate how loudly they could hear both the other voice and their own voice on a scale of 1 (very quiet) to 5 (very loud). The data from this debrief questionnaire for each of the experiments can be found in the S1 and S2 Files.

Audio recording

Due to the online nature of this study, we had a limited amount of control over the recording set-ups of our participants, in terms of the type of device and microphone used, and the background environment (e.g. level of noise). This is a clear limitation of the current work, and reflects the fact that this research was conducted during the global COVID-19 pandemic, in which in-person testing was not possible.

Variability in the technological device and software application used for speech recording can affect the measured acoustic signal, via factors such as the type of compression used, the use of filters by different software programmes and differing sampling rates. A small number of studies have begun to provide evidence on the effects of such variation in remote recording set-ups on measurement of acoustic and phonetic parameters in speech recordings [37, 38]. In general, these indicate that identification of contrasts within speakers such as vowel arrangements tends to be fairly robust across different remote recording set-ups compared to gold-standard laboratory audio recordings; however, measurement of absolute raw acoustic/phonetic parameters such as frequency and duration measures could be systematically affected by the device or software used, sometimes in vowel-specific ways. This was particularly found to be the case for higher frequencies such as measurement of F2. The current study focused on measurement of F0 in speech recordings, which was reported by Sanker and colleagues to not significantly differ across different remote recording set-ups when averaging across vowels (although there were vowel-specific differences) [37]. Their study also found that variability associated with differences in the software application used for recording (e.g. Zoom, Skype, Facebook messenger) was greater than that associated with variability in the recording device (e.g. whether a Mac or PC was used). Based on the findings and recommendations of their study, we implemented a number of methodological decisions in order to mitigate the potential effects of participants’ idiosyncratic recording set-ups.

Firstly, the software application that was used to collect the audio recordings was held constant across participants. Recordings were collected via Gorilla audio recording software, which is powered by the WebRTC (Real Time Communication) API within the browser. This software uses the default settings within the browser it is run on, and does not implement any additional algorithms, compression or functions on the recorded data. Audio recordings were saved as MP3 files. Participants were constrained to the use of a laptop or desktop computer to complete the study; completion using mobile phones or tablets was not allowed. The operating system and browser used by each participant across the three experiments can be found in S1 and S2 Files. Furthermore, most analyses were based on within-participant comparisons; that is, F0 measurements were compared across solo reading and synchronous speech tasks within participants. The effects of the recording set-up would therefore have been uniform across these conditions used in these contrasts e.g. if F0 was overestimated in the recordings, this would be uniform across both solo reading and synchronous speech, and so relative comparisons between these conditions would be unaffected. Additionally, when between-group comparisons were made (e.g. between high F0 and low F0 accompanist conditions), a random effect of participant was included in analyses. All audio recordings were individually checked for each participant; persistent/excessive background noise across recordings resulted in exclusion of that participant’s data (see Data Exclusion).

Acoustic analysis

Audio recordings from both experiments were analysed using a custom-made script within the software package Praat [39]. This script first isolated the voiced segments of the acoustic signal (by removing pauses and unvoiced consonants) and then took the median F0 value (in Hz) of each sentence. This was run on each participant’s audio recordings from the solo reading and synchronous speech tasks.

Experiment 1

The aim of Experiment 1 was to investigate whether participants showed convergent changes in the F0 of their produced speech towards the F0 of an accompanist’s voice during synchronous speech. In a between-subjects design, participants encountered an accompanist with either an unusually high F0 or an unusually low F0, in order to examine potential convergent changes in F0 in both upward and downward directions. Initially, two conditions were tested corresponding to these high and low F0 conditions. After this data was collected and analysed, a further ‘extra-low F0’ condition was tested, in which participants synchronised with an accompanist voice with an even lower F0. This was added after observing the distribution of solo reading baseline F0 in the initial sample tested with our a priori designed conditions (see Results). For simplicity, we report here on the results of a single data analysis incorporating all three conditions.

Participants

Twenty female participants (mean age = 28.65 years) took part in the initial high F0 and low F0 conditions, with an equal number of participants (10) in each condition. After data collection and analysis, a further 10 female participants (mean age = 28.2 years) were recruited to take part in an additional extra-low F0 condition. These numbers reflect the final participant samples included in analyses (see Data Exclusion). All participants in all three conditions were native speakers of English, with most of the sample being of British nationality (one Canadian and one United States).

Evidence on the effects of speaker gender on variability in the extent of vocal convergence observed is inconclusive [6, 9, 17, 40], but has led many studies to restrict their samples to female speakers only; we therefore similarly opted to only recruit female participants for our experiment so as to have same-sex pairs with our female accompanist voice (see Stimuli).

Design

The solo reading task (our baseline task) consisted of 50 trials, with 50 trial-unique sentences. Participants were instructed to start reading each sentence after a visual 3-2-1 countdown presented on the screen, with an interval of 1 second between each number. Participants had 9 seconds to read the sentence (including time taken by the countdown), followed by an inter-trial-interval of 2 seconds. The synchronous speech task asked participants to read the same sequence of sentences, this time in synchrony with audio recordings of another voice (the accompanist). A trial began with three metronome clicks (with an interstimulus interval of 1 second between clicks) followed by the accompanist voice speaking the sentence; participants were told to use the three-click countdown to help them to start speaking at the same time as the accompanist. Again, participants were given 9 seconds to read the sentence, followed by an inter-trial-interval of 2 seconds. Participants were first given 5 practice trials (with novel sentences) to practice speaking in synchrony with the other voice. After each practice trial, they were instructed to adjust their volume to a level at which they could hear both the other voice and their own voice at a loud and clear volume. Following this practice and volume calibration phase, participants were instructed not to make any further changes to their sound volume. A further 50 trials were then presented, consisting of the same set of sentences as in the solo reading task, presented in an identical order. One of these trials was designated as a vigilance trial; for this trial, when the written sentence appeared onscreen, instead of hearing the other voice read the sentence, they unexpectedly heard the voice ask them to read the last word in the sentence. This vigilance trial was included to check that participants were attending to the audio through the headphones, and not simply reading the sentences without listening to the audio. The audio recordings from this trial were checked for accuracy, but not included in further analyses of F0. This thus resulted in a total of 49 audio recordings of interest for each of the two tasks.

Stimuli

Audio stimuli in the synchronous speech task consisted of audio recordings of a female speaker of Standard Southern British English reading 49 sentences taken from the Harvard IEEE corpus of sentences [41]. The full set of sentences are given in the S1 Table. These sentences had an average word length of 8, and are designed to be phonetically balanced. The tokens were recorded using the internal microphone of a MacBook Air and the software programme Audacity [42]. These audio recordings were matched for sound intensity via RMS norming in Praat [39]. A custom-made script in Praat [39] was used to shift the F0 of the voice in these recordings either up or down to create separate stimulus sets for the high, low and extra-low F0 conditions. Stimuli for the high F0 accompanist condition were created by shifting F0 up by 2 semitones; stimuli for the low F0 accompanist condition were created by shifting F0 down by 5 semitones. Both these stimulus sets then underwent F0 norming to the average of the median values of sentences within that set; median F0 values of the sentences were normed to 265Hz for the high condition, and 170Hz for the low condition. To create the extra-low condition stimuli, the F0 of the recordings was first shifted down by 9 semitones. In order to preserve the perceived naturalness of the voice in the recordings, a small adjustment to formant spacing was also made to increase perceived vocal tract length using an open-source Praat script [43]; within this script, the ‘vocal tract lengthening factor’ parameter was set to 1.1, corresponding to a change of just under 2 semitones. The median F0 of these sentences was then normed to 140Hz. Audio stimuli for the five practice trials of the synchronous speech task (consisting of five additional unique sentences, see S1 Table) and the vigilance trial were recorded by the same speaker, and underwent the same processes of F0 normalisation so as to be in keeping with the F0 values of their respective conditions.

Choice of these F0 targets for the initial high and low F0 conditions was guided by data on the distribution of female voice F0s from the UCLA Speaker Variability Database [44, 45]. For the purposes of this study, we looked at data taken from this speech corpus on audio recordings from 50 female speakers of American-English reading a set of 5 of the Harvard IEEE sentences. These were repeated twice within each of three sessions, giving a total number of 30 tokens per participant. Average F0 values across these tokens were calculated for each participant, and upper and lower cut-offs obtained by taking the values two standard-deviations above and below the group average. This resulted in an upper cut-off of 240Hz and a lower cut-off of 175Hz; our values of 265Hz and 170Hz for the high and low conditions were chosen to be outside of these cut-offs, and therefore were expected to be substantially higher/lower than the baseline F0s of our female participants during solo reading. However, on observing the distribution of average F0 values found in our sample during the solo reading baseline task (see Results), we decided to test an additional sample of participants with the extra-low F0 condition outlined above, with an even lower accompanist median F0 of 140Hz.

Data exclusion

During data collection, various checks on data quality and task performance were conducted to exclude problematic participants, who were then immediately replaced. Firstly, participants who failed the headphone check were not permitted to proceed to the main study. Across all three conditions, a total of 15 participants failed this headphone check and were immediately replaced. Of those who passed the headphone check, a further 13 participants were excluded prior to data analysis, either because they failed the vigilance trial (2 participants), had poor quality audio recordings (e.g. excessive background noise) that affected pitch tracking (6 participants), or because the accompanist voice was audible in their audio recordings (e.g. due to headphones with poor sound insulation) and so interfered with pitch tracking of the participant’s voice (5 participants). Again, all of these 13 participants were replaced, so as to achieve the target sample size of 10 participants per condition.

Data from individual trials within a participant were excluded if the participant failed to read the sentence correctly (e.g. made a large speech error or missed the sentence completely) or if there was excessive background noise on that trial that affected pitch tracking. Further, within each task, trials in which F0 values were more than 3 standard deviations from the mean were excluded. These criteria resulted in exclusion of 3.40% of trials across the whole sample.

Measures and hypotheses

A measure of F0 change from baseline was calculated for each participant by calculating the difference in semitones between their F0 values during solo reading and those during synchronous speech on a sentence-by-sentence basis, and then averaging across these normed values. This measure thus preserves the sign of these differences, and so indicates whether the change was negative (F0 decreased during synchronous speech) or positive (F0 increased during synchronous speech).

Our central hypothesis across all experiments was that F0 changes during synchronous speech would be specific to the accompanist voice experienced. For this experiment, we therefore predicted that participants in the high F0 accompanist condition should show significant increases in F0 from solo reading to synchronous speech, while participants in the low and extra-low F0 accompanist conditions should show significant decreases in F0. At the group level, we predicted that this expected difference in the direction of F0 changes across the conditions would result in a significant group difference, in which the predicted negative F0 change in the low and extra-low conditions would be significantly lower (i.e. more negative) than the predicted positive F0 change in the high condition.

Results

Within-participant analysis of convergence

Convergence patterns shown by individual participants in each of the three conditions are shown in Fig 1; this plots the difference (in semitones) between the accompanist F0 and (i) the participant’s average F0 at solo reading and (ii) the participant’s average F0 at synchronous speech. To determine if each participant showed significant convergence towards the accompanist voice in F0, a two-sided one-sample t-test was run for each participant to compare their sentence-wise F0 change values (synchronous speech minus solo reading) with zero. These tests were used to categorise participants into convergers (significant increase in F0 at synchronous speech in the high condition; or significant decrease in F0 in the low and extra-low conditions), divergers (significant decrease in F0 at synchronous speech in the high condition; or significant increase in F0 in the low and extra-low conditions) or non-convergers (no significant change in F0). The frequencies in each category across high, low and extra-low conditions are given in Table 1; colour coding in Fig 1 is also used to indicate convergence status. Across the whole sample, 16 participants converged and 14 did not converge (no change or diverged); however, these seem to be somewhat unevenly distributed across the conditions, with slightly fewer convergers in the extra-low F0 condition.

Table 1. Frequencies of convergers, divergers and non-convergers (no change in F0) in high, low and extra-low F0 groups from Experiment 1.

Group	High F0 condition	Low F0 condition	Extra-low F0 condition
Converged	6	6	4
Diverged	2	2	3
No change	2	2	3

Open in a new tab

Group analysis of convergence

To check the comparability of our three groups, we first checked that the average F0 of participants during the solo reading baseline task did not significantly differ between the groups (i.e. before they were exposed to our experimental manipulation in the synchronous speech task). A one-way ANOVA found that average F0 at solo reading baseline did not differ between groups (F(2, 27) = 0.087, p = .917). Average F0 at solo reading baseline ranged from 163.74Hz to 243.79Hz in the group that went on to experience the high F0 condition (M = 193.86, SD = 25.28) and from 170.01Hz to 262.94Hz in the group that went on to experience the low F0 condition (M = 198.21, SD = 26.07). From these descriptives, it can be seen that the distribution of F0 values measured during solo reading baseline in the low F0 group overlaps with the average F0 of the accompanist voice to which they were subsequently exposed in the synchronous speech task (170Hz). Since it was our aim to use accompanist voices whose average F0 was far away from that of our sample at solo reading, this motivated the design of our additional extra-low F0 accompanist condition. In this condition, the accompanist voice’s average median F0 (140Hz) was substantially lower than the distribution observed in our initial sample. In the third group tested with this extra-low condition, average F0 at baseline solo reading ranged from 176.11Hz to 222.51Hz (M = 195.52, SD = 18.54), and thus did not overlap with the average F0 of the accompanist voice they subsequently experienced.

Trial by trial values for the change in F0 from baseline (synchronous speech minus solo reading) are shown for high, low and extra-low conditions in Fig 2; group averages across the whole experiment are shown in Fig 3. It can be seen in Fig 3 that one participant in the high condition demonstrated a noticeably greater change in F0 compared to the group mean (increase of 6.13 semitones); this was within three standard-deviations of the group mean and since no a priori criteria were set for outlier detection this participant’s data was included in analyses. To test whether the direction of F0 change from solo reading to synchronous speech was significantly different between high and low/extra-low conditions, a linear mixed modelling (LMM) analysis was performed using the lmerTest package in R [46]. A random intercept model was created on F0 change values, with a fixed effect of condition (high, low and extra-low) and random intercepts of sentence and participant. A random slope of condition by sentence was not included due to model convergence issues. This found that F0 change was significantly lower in the low (β = -1.62, t(30.01) = -2.67, p = .012) and extra-low conditions (β = -1.25, t(30.01) = -2.06, p = 0.048) compared to the high condition (the reference condition in the model; Intercept β = 0.99, t(30.14) = 2.32, p = .027). These significant effects reflect the difference in the direction of F0 change between conditions; that is, as predicted, F0 change from solo reading baseline was positive in the high condition, but negative in the low and extra-low conditions. This suggests that the acoustic properties of the accompanist’s voice (specifically, F0) did affect the nature of the change in the participants’ voices during synchronous speech.

Fig 2 — Graph shows average change in F0 from solo reading baseline to synchronous speech (in semitones) for each trial of the task, averaged across participants in high, low and extra-low conditions. Shaded areas around lines indicate standard error.

Fig 3 — Graph shows average change in F0 from solo reading baseline to synchronous speech (in semitones) averaged across trials for participants in the three conditions of Experiment 1. Dots indicate individual participant averages, thick line indicates group means, boxes indicate standard errors.

Discussion: Experiment 1

Overall, Experiment 1 found evidence of significant convergence in F0 to an accompanist voice during synchronous speech, both when convergence required a raising of F0 (high F0 accompanist condition) and when it required a lowering of F0 (low and extra-low F0 accompanist conditions). That is, changes in F0 observed during synchronous speech were specific to the acoustic properties of the accompanist voice experienced by participants, arguing against the idea that these F0 changes might simply result from the act of synchronisation itself (which would have predicted a uniform increase or decrease in F0 across all conditions).

As previously explained, it was necessary to design and test post-hoc an ‘extra-low’ F0 accompanist condition in this experiment in addition to the originally planned high and low F0 conditions. This was because the average F0 of our initial low F0 accompanist voice overlapped with the distribution of ‘baseline’ F0 values within our sample as measured during solo reading. Previous research has suggested that the initial acoustic distance between two speakers can affect the degree of vocal convergence observed, with most studies reporting that convergence is facilitated by a greater distance at baseline [47–49], although the reverse pattern has also been reported [10, 50]. More recently, it has been argued that the former apparent relationship may in fact be an artefact of the way in which convergence is typically calculated, specifically in relation to the use of a ‘difference-in-distance’ measure (comparing the change in absolute distance between a participant and a model talker from before to after exposure) [51, 52]. In general, Priva and Sanker [51] argued that measurement of convergence can be unreliable when participants’ baseline values are close to the model talker, as the influence of random variability within an individual is likely to overshadow genuine convergent changes, leading to underestimation of convergence or even apparent divergence. This point thus supports our rationale for the inclusion of our extra-low F0 condition, to ensure there was sufficient room for downward convergent changes from solo reading baseline to synchronous speech to be observed. On the other hand, others have suggested that some degree of overlap between the distributions of productions from two interlocutors is necessary for facilitating convergence, so that randomly produced matches between talkers are likely to occur and be reinforced; this was supported by evidence from both model simulations and experimental data [53].

The rationale for including conditions designed to induce both upward and downward convergent shifts in F0 was to determine whether measured changes in F0 during synchronous speech might simply be driven by the act of speech synchronisation itself. That is, it is possible that the act of adjusting the timings of one’s speech so as to be in synchrony with an external stimulus could itself induce systematic global changes in F0. There is evidence that speaking conditions associated with increased effort can lead to global changes in F0. For example, when speaking in adverse listening conditions (e.g. background noise or babble), speakers adopt a ‘clear speech’ speaking style that is accompanied by global changes in F0; Hazan and colleagues [54] reported global increases in mean F0 that did not serve to enhance specific phonological contrasts, suggesting they could reflect a by-product of increased speaking effort. In the field of vocal convergence, a study by Kappes, Baumgaertner, Peschke and Ziegler [55] found reduced convergence in F0 in a speech shadowing task (where participants had to repeat non-words with as minimal a delay as possible, even during the ongoing stimulus presentation) compared to a delayed repetition task; they attributed this to an overall tendency to increase F0 due to increased speaking effort in the shadowing task. In a similar way, synchronous speech could have a global effect on F0 due to the increased effort associated with timing the pace of one’s speech with an external stimulus.

Our demonstration in Experiment 1 that F0 changes during synchronous speech appeared specific to the acoustic characteristics of the accompanist voice experienced goes some way to addressing this concern. However, a more direct test of this idea can be achieved through the use of a visual synchronous speech condition. In such a task, speakers are required to produce sentences in synchrony with silent videos of another talker; this thus preserves the synchronisation aspect of the standard synchronous speech condition, while removing the acoustic input from the accompanist voice. We therefore ran a second experiment that compared F0 changes between such a visual synchronous speech condition and an audio-only synchronous speech condition (identical to the high F0 condition of Experiment 1), in order to provide a stricter test of the hypothesis that changes in F0 during synchronous speech depend on the acoustics of the accompanist voice. By testing for significant changes in F0 during visual synchronous speech, this experiment will thus provide greater clarity on the interpretation of the F0 changes observed in Experiment 1. For example, if visual synchronous speech also results in significant increases in F0, these global F0 changes may have contributed to the significant upward shifts observed in the high F0 condition of Experiment 1.

In addition to comparing visual-only and auditory-only synchronous speech conditions, Experiment 2 further tested a group of participants with a combined audiovisual synchronous speech condition. This allowed us to ask whether being able to see as well as hear the person you are synchronising with would affect convergence to the accompanist voice. Previous research has demonstrated that in some circumstances, visual speech information can enhance vocal convergence to auditory speech over listening to that speech alone. Dias and Rosenblum [56] reported that being able to see as well as hear an interlocutor enhanced convergence during a live interactive search task. Conversely, a follow-up study by the same authors found that visual enhancement of convergence during a non-interactive speech shadowing task was only observed when auditory targets were presented in low-level noise [57]. Thus, enhancement of convergence by audiovisual cues may require either live interaction between speakers, or failing that, challenging auditory conditions. We aimed to investigate this in the context of synchronous speech, by comparing F0 changes across audiovisual and audio-only conditions. Since our speech synchronisation task involves neither live interaction between interlocutors, nor challenging auditory conditions, we might predict based on evidence from Dias and Rosenblum that we would not see enhancement of convergence in our audiovisual condition. Alternatively, synchronising one’s speech with a pre-recorded accompanist that can be seen as well as heard may nevertheless increase participants’ perception of a social interaction taking place, resulting in enhanced convergence.