Skip to main content
PLOS One logoLink to PLOS One
. 2021 Oct 21;16(10):e0258747. doi: 10.1371/journal.pone.0258747

Convergence in voice fundamental frequency during synchronous speech

Abigail R Bradshaw 1,*, Carolyn McGettigan 1
Editor: Vera Kempe2
PMCID: PMC8530294  PMID: 34673811

Abstract

Joint speech behaviours where speakers produce speech in unison are found in a variety of everyday settings, and have clinical relevance as a temporary fluency-enhancing technique for people who stutter. It is currently unknown whether such synchronisation of speech timing among two speakers is also accompanied by alignment in their vocal characteristics, for example in acoustic measures such as pitch. The current study investigated this by testing whether convergence in voice fundamental frequency (F0) between speakers could be demonstrated during synchronous speech. Sixty participants across two online experiments were audio recorded whilst reading a series of sentences, first on their own, and then in synchrony with another speaker (the accompanist) in a number of between-subject conditions. Experiment 1 demonstrated significant convergence in participants’ F0 to a pre-recorded accompanist voice, in the form of both upward (high F0 accompanist condition) and downward (low and extra-low F0 accompanist conditions) changes in F0. Experiment 2 demonstrated that such convergence was not seen during a visual synchronous speech condition, in which participants spoke in synchrony with silent video recordings of the accompanist. An audiovisual condition in which participants were able to both see and hear the accompanist in pre-recorded videos did not result in greater convergence in F0 compared to synchronisation with the pre-recorded voice alone. These findings suggest the need for models of speech motor control to incorporate interactions between self- and other-speech feedback during speech production, and suggest a novel hypothesis for the mechanisms underlying the fluency-enhancing effects of synchronous speech in people who stutter.

Introduction

Synchronised vocal behaviours are ubiquitous across a variety of everyday settings and can be found in every culture [1]. The production of speech in unison by a group of speakers can be observed in as diverse settings as places of worship, schools, sports stadiums, protest marches, military parades and political rallies. Often such behaviours serve to promote social cohesion and bonding amongst their participants [2]. However, as well as studying joint speech behaviours at the level of the goals and intentions of the collective, we can also consider their impact on lower-level processes within the individual; specifically, for the focus of this paper, how might synchronised speech affect the engagement of different systems for speech motor control?

A clue that synchronised speech behaviours may radically change the nature of the mechanisms underlying motor control of speech is found in the observation of their powerful efficacy as a temporary fluency enhancer for people who stutter. The speech of people who stutter is characterised by frequent dysfluencies that interrupt the smooth flow of speech, such as syllable repetitions, prolongations, and blocks (tense pauses in between speech sounds). Strikingly, reading a prepared text in synchrony with another speaker (here termed synchronous speech; also referred to as “choral speech” or “choral reading” in this literature) can temporarily reduce the occurrence of such dysfluent events by 90–100% in most individuals who stutter [3]. Disruption to the processing of sensory feedback from the self-voice has long been theorised to play a role in stuttering [for a recent review, see 4]. However, in synchronous speech, the brain simultaneously receives auditory feedback from the self-voice and the voice of the synchronisation partner (known as the accompanist). It is of interest to consider how this concurrent speech feedback might affect sensorimotor integration processes during speech motor control, in both typically fluent individuals and people who stutter.

It is already known that when engaged in a dialogue, interlocutors show a tendency to align with each other across multiple linguistic levels. For example, speakers may converge in their choice of syntactic structures and lexical wordforms during conversational interactions [5]. As well as this alignment in the content of speech, convergence can also be seen in the voices of the interlocutors; that is, they can start to sound more similar in terms of the way in which they produce speech sounds [for a review, see 6]. This vocal convergence has been demonstrated either by direct measurement of acoustic and phonetic voice parameters [7], such as voice fundamental frequency (F0, the acoustic correlate of pitch), formant frequencies such as F1 and F2 (spectral information in the speech signal which determines perceived vowel identity) and speech rate; or through the use of perceptual judgements by independent listeners in AXB designs [8]. In the latter, an external set of listeners are asked to judge whether a participant’s utterance during an interaction (stimulus A) sounds more similar to the model utterance from their interlocutor (stimulus X), compared with that participant’s baseline utterance when speaking alone (stimulus B)–if listeners tend to choose A more often than B as the stimulus most similar to X, this suggests that the participant’s speech has moved perceptibly toward the interlocutor when speaking with them. This vocal convergence has been studied across a range of interactive contexts, including conversations [9], shared-reading tasks [10], and interactive verbal games [11]. Such behaviour has traditionally been interpreted within the conceptual framework of Communication Accommodation Theory [12], which views such responses as a socially motivated strategy for establishing social affiliation with an interlocutor. Accordingly, there is evidence that social factors can influence the extent to which interlocutors converge or even diverge during interactions. For example, greater vocal convergence has been reported when speakers perceive their partner as more attractive [13], or of a higher social status than themselves [14]; conversely, greater divergence was reported when speakers interacted with an insulting interviewer [15].

An alternative theoretical framework for understanding vocal convergence effects is the Interactive Alignment model [5, 16]. This views such convergence as a subconscious process, driven by an automatic priming mechanism. The latest version of the theory [16] proposes that while listening to the speech of an interlocutor, a listener engages in covert imitation of that speech; the listener then uses a ‘simulation route’ to derive forward models that predict their partner’s upcoming speech utterances based on experience of their own speech actions. This covert imitation tends to also co-activate representations within the production system, which can then spill over into the speaker’s own productions resulting in overt imitation. This automatic priming is proposed to occur across all levels of linguistic representation, including semantics, syntax and phonology. The resulting convergence in the speech of the interlocutors is thought to then facilitate mutual understanding between them.

In support of this view of convergence as a more low-level and unconscious process, vocal convergence effects have been reported across a range of ‘non-social’ contexts, such as vowel repetition or speech shadowing tasks, in which participants listen to words produced by a model speaker and are asked to produce each word they hear as quickly as possible [8, 1720]. Sato et al., [18] asked participants to produce vowel sounds, first cued by orthographic targets, then acoustic targets, and finally by orthographic targets again. Participants were found to show convergence in F0 and F1 towards the acoustic targets, which was sustained during the second presentation of the orthographic targets. These ‘after-effects’ were interpreted as evidence of offline recalibration of the sensory-motor targets that drive speech production. The authors proposed that these convergence effects should be viewed within the same framework as sensorimotor accounts of speech motor control, in which forward models of the intended or predicted sensory outcomes of speech movements are compared to actual sensory feedback [2123]. Specifically, they argued that these models should additionally incorporate the influence of external speech inputs on these processes; that is, the external speech environment leads to adaptive changes in the sensory goals for speech, which in turn can result in imitative changes in speech productions.

Further evidence in support of these ideas is found in studies that demonstrated effects of external speech inputs on responses to perturbations of self-produced auditory feedback. Such perturbations can be made in real-time to the F0 or formants of the auditory feedback a speaker receives from their own voice during speech; these typically trigger opposing changes in the speech productions of the speaker in order to compensate for the perceived sensory error [24, 25]. The extent of such speech motor learning has been shown to be affected by both explicit perceptual training with another voice that aimed to shift perception of a phoneme category boundary [26], and by implicit perceptual learning processes triggered by mere exposure to another voice that cued a participant’s own speech production [27]. Therefore, other voices can affect both the vocal characteristics of a speaker, and the responsiveness of their speech motor control system to experimentally induced ‘sensory errors’ in their speech feedback.

This evidence highlights the need for speech motor control research to move beyond studies in which participants speak on their own, to paradigms that try to capture the true dynamics of the interactive contexts in which speech is typically used in everyday life. This challenge to the prevailing approach in cognitive psychology of studying individual minds in isolation was made more broadly by the ‘Joint Action’ framework, which argues that a comprehensive understanding of mind and behaviour requires the study of the coordinated actions that are so prevalent in everyday interactions among individuals [28, 29]. In order to extend models of speech motor control to account for sensorimotor processes in such interactive contexts, we need more evidence on how control of an individual’s speech is affected by the speech and voices of talkers with whom they are interacting.

It is thus of interest to consider the effects on speech motor control of a synchronous external speech input. Specifically, does speaking in synchrony with an accompanist have any influence on the voice of a speaker, beyond the obvious changes in speech timing? If so, do these changes in the voice of the speaker relate systematically to those of the accompanist voice? In samples of people who stutter, it has been shown that the fluency enhancing effects of synchronous speech rely on the presence of spectral information within the accompanist speech signal. A study by Rami, Kalinowski, Rastatter, Holbert and Allen found that synchronous speech with a speech signal that was low-pass filtered (at 100Hz) so as to remove formants but retain glottal source information (including F0) was not effective at reducing stuttering frequency; conversely, a speech signal low-pass filtered to include the source and only partial spectral cues (either F1 only or F1 and F2) was sufficient to induce fluency during synchronous speech [30]. This suggests that, at least in people who stutter, spectral information in the accompanist voice has an effect on speech motor control and the sound of the speaker’s voice. Other research with samples of typically fluent speakers has reported that synchronous speech reduces variability in the pitch, intonation, amplitude and vowel duration of produced speech [3134]. However, crucially, it is unknown to what extent these changes in speaking style represent convergence with the acoustic characteristics of the accompanist voice. That is, it is possible that any changes in the acoustic features of the speaker’s voice during synchronous speech may simply be a by-product of the process of synchronising the timing of one’s speech with any external stimulus, be that another voice or a non-speech stimulus (e.g. a metronome).

Overall, this study aims to bring together these different strands of findings in the literature, in order to provide insight into the influence of external speech inputs on speech motor control. Firstly, as outlined above, we know that synchronous speech can result in changes to the speech of a talker, be that a reduction in dysfluency in people who stutter, or reduced variability in speaking style in typically fluent speakers. However, we currently have no insight into the acoustic specificity of these changes in relation to the acoustic/phonetic properties of the other voice. Secondly, it is known that pairs of talkers tend to converge in speaking style during speech tasks, showing changes in their speech that are specific to the acoustic/phonetic properties of their interlocutor’s voice. So far however, this has not been investigated for synchronous speech, and so it is yet unknown whether convergence effects would generalise to this task. Thirdly, it is known that perturbations of simultaneous self-voice feedback induce compensatory and adaptive changes in the acoustics of a talker’s speech productions that are specific to the precise perturbation applied. These responses have been interpreted within theoretical frameworks as evidence for the prioritisation of auditory feedback for speech sensorimotor control. However, such studies–and the models they inform—have never incorporated the effects of simultaneous other-voice feedback on speech sensorimotor behaviours. It is thus currently unknown whether these responses and underlying processes are specific to self-voice feedback, or whether they might also be employed for responding to other-voice feedback during speech motor control.

In order for us to advance our understanding of these effects, the current study aimed to test whether 1) speakers given simultaneous other-voice feedback during speech–by speaking in synchrony with another talker–show changes in their vocal acoustics, and 2) whether these changes are specific to the external acoustic feedback, rather than being driven by some other aspect of speech synchronisation. In order to do this, experiments are needed which test for acoustic convergence during synchronous speech that is specifically dependent on the auditory properties of the accompanist talker. In Experiment 1, we test this by measuring changes in F0 during synchronous speech, specifically testing for upward changes in F0 during synchronisation with a higher-pitched accompanist, but downward changes in F0 in participants who synchronised with a lower-pitched accompanist. Furthermore, in order to test theoretical frameworks that prioritise auditory feedback for speech motor control, it is important to confirm that these convergent changes are indeed primarily driven by acoustic properties of the other talker’s voice, and are not equally generated by other types of input like visual speech information. We test this in Experiment 2, in the inclusion of a visual-only synchronous speech condition (where participants could see but not hear the accompanist), as well as an audiovisual condition (where participants could both see and hear the accompanist). Across both experiments, we predicted that participants would show changes in F0 that were dependent on the F0 of the accompanist voice they experienced (i.e. they should increase to a high F0 accompanist, decrease to a low F0 accompanist, and remain unchanged during visual synchronous speech). Conversely, if changes in F0 are simply a by-product of synchronisation of speech timing with an external stimulus, we would expect uniform changes in F0 across these different conditions.

General methods

The Gorilla Experiment Builder (www.gorilla.sc) was used to create and host all experiments reported in this paper [35]. The online recruitment platform Prolific (www.prolific.ac) was used for participant recruitment. All participants were compensated for their time by payment of £3.75, administered via Prolific. This study received ethical approval from the local ethics officer at the Department of Speech, Hearing and Phonetic Sciences at University College London (approval no. SHaPS-2019-CM-030). All participants gave informed consent prior to taking part in the study.

Data, analysis scripts and stimuli for each of the two experiments reported here are openly available on the Open Science Framework (https://osf.io/rs7gk/ DOI: 10.17605/OSF.IO/RS7GK).

General procedure

All experiments began with a headphone screening task, to ascertain that participants were wearing headphones and listening in a quiet environment [36]. This task makes use of a perceptual phenomenon called ‘Huggins pitch’, an illusory pitch percept which relies on dichotic presentation of stimuli. Specifically, each ear receives the same white noise stimulus, but with a phase shift of 180° in a narrow frequency band in one channel. When wearing headphones this results in perception of a faint pitch amongst noise; conversely this percept is not generated when played over loudspeakers. On each trial, participants are asked to detect which of three white noise stimuli contains the hidden tone. Participants who failed to reach criterion performance on this task were not permitted to proceed to the main study (see Data Exclusion). For experiment 1 this criterion was set to a score of 6/6; for Experiment 2 this criterion was slightly relaxed to a score of at least 5/6, in an effort to reduce the high level of rejection of participants who reported they were in fact wearing headphones.

The first experimental task for all experiments was a solo reading task, which required participants to read aloud a series of sentences presented on the screen while their speech was audio recorded. This was treated as a ‘baseline’, in order to measure the participant’s F0 when not exposed to another voice/engaging in synchronous speech. This was followed by a synchronous speech task, in which participants were again audio recorded while speaking the same set of sentences, this time in synchrony with another person (the accompanist). The experiment ended with a short debrief, in which participants were asked to rate how well they synchronised with the other voice (from 1-not at all to 5-perfectly), and to report whether they noticed anything about that voice. They were further asked what type of headphones they were using, and to rate how loudly they could hear both the other voice and their own voice on a scale of 1 (very quiet) to 5 (very loud). The data from this debrief questionnaire for each of the experiments can be found in the S1 and S2 Files.

Audio recording

Due to the online nature of this study, we had a limited amount of control over the recording set-ups of our participants, in terms of the type of device and microphone used, and the background environment (e.g. level of noise). This is a clear limitation of the current work, and reflects the fact that this research was conducted during the global COVID-19 pandemic, in which in-person testing was not possible.

Variability in the technological device and software application used for speech recording can affect the measured acoustic signal, via factors such as the type of compression used, the use of filters by different software programmes and differing sampling rates. A small number of studies have begun to provide evidence on the effects of such variation in remote recording set-ups on measurement of acoustic and phonetic parameters in speech recordings [37, 38]. In general, these indicate that identification of contrasts within speakers such as vowel arrangements tends to be fairly robust across different remote recording set-ups compared to gold-standard laboratory audio recordings; however, measurement of absolute raw acoustic/phonetic parameters such as frequency and duration measures could be systematically affected by the device or software used, sometimes in vowel-specific ways. This was particularly found to be the case for higher frequencies such as measurement of F2. The current study focused on measurement of F0 in speech recordings, which was reported by Sanker and colleagues to not significantly differ across different remote recording set-ups when averaging across vowels (although there were vowel-specific differences) [37]. Their study also found that variability associated with differences in the software application used for recording (e.g. Zoom, Skype, Facebook messenger) was greater than that associated with variability in the recording device (e.g. whether a Mac or PC was used). Based on the findings and recommendations of their study, we implemented a number of methodological decisions in order to mitigate the potential effects of participants’ idiosyncratic recording set-ups.

Firstly, the software application that was used to collect the audio recordings was held constant across participants. Recordings were collected via Gorilla audio recording software, which is powered by the WebRTC (Real Time Communication) API within the browser. This software uses the default settings within the browser it is run on, and does not implement any additional algorithms, compression or functions on the recorded data. Audio recordings were saved as MP3 files. Participants were constrained to the use of a laptop or desktop computer to complete the study; completion using mobile phones or tablets was not allowed. The operating system and browser used by each participant across the three experiments can be found in S1 and S2 Files. Furthermore, most analyses were based on within-participant comparisons; that is, F0 measurements were compared across solo reading and synchronous speech tasks within participants. The effects of the recording set-up would therefore have been uniform across these conditions used in these contrasts e.g. if F0 was overestimated in the recordings, this would be uniform across both solo reading and synchronous speech, and so relative comparisons between these conditions would be unaffected. Additionally, when between-group comparisons were made (e.g. between high F0 and low F0 accompanist conditions), a random effect of participant was included in analyses. All audio recordings were individually checked for each participant; persistent/excessive background noise across recordings resulted in exclusion of that participant’s data (see Data Exclusion).

Acoustic analysis

Audio recordings from both experiments were analysed using a custom-made script within the software package Praat [39]. This script first isolated the voiced segments of the acoustic signal (by removing pauses and unvoiced consonants) and then took the median F0 value (in Hz) of each sentence. This was run on each participant’s audio recordings from the solo reading and synchronous speech tasks.

Experiment 1

The aim of Experiment 1 was to investigate whether participants showed convergent changes in the F0 of their produced speech towards the F0 of an accompanist’s voice during synchronous speech. In a between-subjects design, participants encountered an accompanist with either an unusually high F0 or an unusually low F0, in order to examine potential convergent changes in F0 in both upward and downward directions. Initially, two conditions were tested corresponding to these high and low F0 conditions. After this data was collected and analysed, a further ‘extra-low F0’ condition was tested, in which participants synchronised with an accompanist voice with an even lower F0. This was added after observing the distribution of solo reading baseline F0 in the initial sample tested with our a priori designed conditions (see Results). For simplicity, we report here on the results of a single data analysis incorporating all three conditions.

Participants

Twenty female participants (mean age = 28.65 years) took part in the initial high F0 and low F0 conditions, with an equal number of participants (10) in each condition. After data collection and analysis, a further 10 female participants (mean age = 28.2 years) were recruited to take part in an additional extra-low F0 condition. These numbers reflect the final participant samples included in analyses (see Data Exclusion). All participants in all three conditions were native speakers of English, with most of the sample being of British nationality (one Canadian and one United States).

Evidence on the effects of speaker gender on variability in the extent of vocal convergence observed is inconclusive [6, 9, 17, 40], but has led many studies to restrict their samples to female speakers only; we therefore similarly opted to only recruit female participants for our experiment so as to have same-sex pairs with our female accompanist voice (see Stimuli).

Design

The solo reading task (our baseline task) consisted of 50 trials, with 50 trial-unique sentences. Participants were instructed to start reading each sentence after a visual 3-2-1 countdown presented on the screen, with an interval of 1 second between each number. Participants had 9 seconds to read the sentence (including time taken by the countdown), followed by an inter-trial-interval of 2 seconds. The synchronous speech task asked participants to read the same sequence of sentences, this time in synchrony with audio recordings of another voice (the accompanist). A trial began with three metronome clicks (with an interstimulus interval of 1 second between clicks) followed by the accompanist voice speaking the sentence; participants were told to use the three-click countdown to help them to start speaking at the same time as the accompanist. Again, participants were given 9 seconds to read the sentence, followed by an inter-trial-interval of 2 seconds. Participants were first given 5 practice trials (with novel sentences) to practice speaking in synchrony with the other voice. After each practice trial, they were instructed to adjust their volume to a level at which they could hear both the other voice and their own voice at a loud and clear volume. Following this practice and volume calibration phase, participants were instructed not to make any further changes to their sound volume. A further 50 trials were then presented, consisting of the same set of sentences as in the solo reading task, presented in an identical order. One of these trials was designated as a vigilance trial; for this trial, when the written sentence appeared onscreen, instead of hearing the other voice read the sentence, they unexpectedly heard the voice ask them to read the last word in the sentence. This vigilance trial was included to check that participants were attending to the audio through the headphones, and not simply reading the sentences without listening to the audio. The audio recordings from this trial were checked for accuracy, but not included in further analyses of F0. This thus resulted in a total of 49 audio recordings of interest for each of the two tasks.

Stimuli

Audio stimuli in the synchronous speech task consisted of audio recordings of a female speaker of Standard Southern British English reading 49 sentences taken from the Harvard IEEE corpus of sentences [41]. The full set of sentences are given in the S1 Table. These sentences had an average word length of 8, and are designed to be phonetically balanced. The tokens were recorded using the internal microphone of a MacBook Air and the software programme Audacity [42]. These audio recordings were matched for sound intensity via RMS norming in Praat [39]. A custom-made script in Praat [39] was used to shift the F0 of the voice in these recordings either up or down to create separate stimulus sets for the high, low and extra-low F0 conditions. Stimuli for the high F0 accompanist condition were created by shifting F0 up by 2 semitones; stimuli for the low F0 accompanist condition were created by shifting F0 down by 5 semitones. Both these stimulus sets then underwent F0 norming to the average of the median values of sentences within that set; median F0 values of the sentences were normed to 265Hz for the high condition, and 170Hz for the low condition. To create the extra-low condition stimuli, the F0 of the recordings was first shifted down by 9 semitones. In order to preserve the perceived naturalness of the voice in the recordings, a small adjustment to formant spacing was also made to increase perceived vocal tract length using an open-source Praat script [43]; within this script, the ‘vocal tract lengthening factor’ parameter was set to 1.1, corresponding to a change of just under 2 semitones. The median F0 of these sentences was then normed to 140Hz. Audio stimuli for the five practice trials of the synchronous speech task (consisting of five additional unique sentences, see S1 Table) and the vigilance trial were recorded by the same speaker, and underwent the same processes of F0 normalisation so as to be in keeping with the F0 values of their respective conditions.

Choice of these F0 targets for the initial high and low F0 conditions was guided by data on the distribution of female voice F0s from the UCLA Speaker Variability Database [44, 45]. For the purposes of this study, we looked at data taken from this speech corpus on audio recordings from 50 female speakers of American-English reading a set of 5 of the Harvard IEEE sentences. These were repeated twice within each of three sessions, giving a total number of 30 tokens per participant. Average F0 values across these tokens were calculated for each participant, and upper and lower cut-offs obtained by taking the values two standard-deviations above and below the group average. This resulted in an upper cut-off of 240Hz and a lower cut-off of 175Hz; our values of 265Hz and 170Hz for the high and low conditions were chosen to be outside of these cut-offs, and therefore were expected to be substantially higher/lower than the baseline F0s of our female participants during solo reading. However, on observing the distribution of average F0 values found in our sample during the solo reading baseline task (see Results), we decided to test an additional sample of participants with the extra-low F0 condition outlined above, with an even lower accompanist median F0 of 140Hz.

Data exclusion

During data collection, various checks on data quality and task performance were conducted to exclude problematic participants, who were then immediately replaced. Firstly, participants who failed the headphone check were not permitted to proceed to the main study. Across all three conditions, a total of 15 participants failed this headphone check and were immediately replaced. Of those who passed the headphone check, a further 13 participants were excluded prior to data analysis, either because they failed the vigilance trial (2 participants), had poor quality audio recordings (e.g. excessive background noise) that affected pitch tracking (6 participants), or because the accompanist voice was audible in their audio recordings (e.g. due to headphones with poor sound insulation) and so interfered with pitch tracking of the participant’s voice (5 participants). Again, all of these 13 participants were replaced, so as to achieve the target sample size of 10 participants per condition.

Data from individual trials within a participant were excluded if the participant failed to read the sentence correctly (e.g. made a large speech error or missed the sentence completely) or if there was excessive background noise on that trial that affected pitch tracking. Further, within each task, trials in which F0 values were more than 3 standard deviations from the mean were excluded. These criteria resulted in exclusion of 3.40% of trials across the whole sample.

Measures and hypotheses

A measure of F0 change from baseline was calculated for each participant by calculating the difference in semitones between their F0 values during solo reading and those during synchronous speech on a sentence-by-sentence basis, and then averaging across these normed values. This measure thus preserves the sign of these differences, and so indicates whether the change was negative (F0 decreased during synchronous speech) or positive (F0 increased during synchronous speech).

Our central hypothesis across all experiments was that F0 changes during synchronous speech would be specific to the accompanist voice experienced. For this experiment, we therefore predicted that participants in the high F0 accompanist condition should show significant increases in F0 from solo reading to synchronous speech, while participants in the low and extra-low F0 accompanist conditions should show significant decreases in F0. At the group level, we predicted that this expected difference in the direction of F0 changes across the conditions would result in a significant group difference, in which the predicted negative F0 change in the low and extra-low conditions would be significantly lower (i.e. more negative) than the predicted positive F0 change in the high condition.

Results

Within-participant analysis of convergence

Convergence patterns shown by individual participants in each of the three conditions are shown in Fig 1; this plots the difference (in semitones) between the accompanist F0 and (i) the participant’s average F0 at solo reading and (ii) the participant’s average F0 at synchronous speech. To determine if each participant showed significant convergence towards the accompanist voice in F0, a two-sided one-sample t-test was run for each participant to compare their sentence-wise F0 change values (synchronous speech minus solo reading) with zero. These tests were used to categorise participants into convergers (significant increase in F0 at synchronous speech in the high condition; or significant decrease in F0 in the low and extra-low conditions), divergers (significant decrease in F0 at synchronous speech in the high condition; or significant increase in F0 in the low and extra-low conditions) or non-convergers (no significant change in F0). The frequencies in each category across high, low and extra-low conditions are given in Table 1; colour coding in Fig 1 is also used to indicate convergence status. Across the whole sample, 16 participants converged and 14 did not converge (no change or diverged); however, these seem to be somewhat unevenly distributed across the conditions, with slightly fewer convergers in the extra-low F0 condition.

Fig 1. Individual participant convergence patterns.

Fig 1

Each graph plots for each participant (one per row) the difference in semitones between the accompanist voice mean F0 (represented by the filled circles at zero) and (i) the participant’s mean F0 at solo reading (shown in the filled triangles) and (ii) the participant’s mean F0 at synchronous speech (shown in the empty triangles). Symbol colour represents whether each participant demonstrated significant convergence, divergence, or no change in their F0 between solo reading and synchronous speech, as determined using one-sample t-tests on their F0 change values (see Table 1). The three conditions (high F0, low F0 and extra-low F0) are plotted on separate graphs (A, B and C).

Table 1. Frequencies of convergers, divergers and non-convergers (no change in F0) in high, low and extra-low F0 groups from Experiment 1.

Group High F0 condition Low F0 condition Extra-low F0 condition
Converged 6 6 4
Diverged 2 2 3
No change 2 2 3

Group analysis of convergence

To check the comparability of our three groups, we first checked that the average F0 of participants during the solo reading baseline task did not significantly differ between the groups (i.e. before they were exposed to our experimental manipulation in the synchronous speech task). A one-way ANOVA found that average F0 at solo reading baseline did not differ between groups (F(2, 27) = 0.087, p = .917). Average F0 at solo reading baseline ranged from 163.74Hz to 243.79Hz in the group that went on to experience the high F0 condition (M = 193.86, SD = 25.28) and from 170.01Hz to 262.94Hz in the group that went on to experience the low F0 condition (M = 198.21, SD = 26.07). From these descriptives, it can be seen that the distribution of F0 values measured during solo reading baseline in the low F0 group overlaps with the average F0 of the accompanist voice to which they were subsequently exposed in the synchronous speech task (170Hz). Since it was our aim to use accompanist voices whose average F0 was far away from that of our sample at solo reading, this motivated the design of our additional extra-low F0 accompanist condition. In this condition, the accompanist voice’s average median F0 (140Hz) was substantially lower than the distribution observed in our initial sample. In the third group tested with this extra-low condition, average F0 at baseline solo reading ranged from 176.11Hz to 222.51Hz (M = 195.52, SD = 18.54), and thus did not overlap with the average F0 of the accompanist voice they subsequently experienced.

Trial by trial values for the change in F0 from baseline (synchronous speech minus solo reading) are shown for high, low and extra-low conditions in Fig 2; group averages across the whole experiment are shown in Fig 3. It can be seen in Fig 3 that one participant in the high condition demonstrated a noticeably greater change in F0 compared to the group mean (increase of 6.13 semitones); this was within three standard-deviations of the group mean and since no a priori criteria were set for outlier detection this participant’s data was included in analyses. To test whether the direction of F0 change from solo reading to synchronous speech was significantly different between high and low/extra-low conditions, a linear mixed modelling (LMM) analysis was performed using the lmerTest package in R [46]. A random intercept model was created on F0 change values, with a fixed effect of condition (high, low and extra-low) and random intercepts of sentence and participant. A random slope of condition by sentence was not included due to model convergence issues. This found that F0 change was significantly lower in the low (β = -1.62, t(30.01) = -2.67, p = .012) and extra-low conditions (β = -1.25, t(30.01) = -2.06, p = 0.048) compared to the high condition (the reference condition in the model; Intercept β = 0.99, t(30.14) = 2.32, p = .027). These significant effects reflect the difference in the direction of F0 change between conditions; that is, as predicted, F0 change from solo reading baseline was positive in the high condition, but negative in the low and extra-low conditions. This suggests that the acoustic properties of the accompanist’s voice (specifically, F0) did affect the nature of the change in the participants’ voices during synchronous speech.

Fig 2. F0 change across trials in high, low and extra-low conditions.

Fig 2

Graph shows average change in F0 from solo reading baseline to synchronous speech (in semitones) for each trial of the task, averaged across participants in high, low and extra-low conditions. Shaded areas around lines indicate standard error.

Fig 3. Average F0 change in high, low and extra-low conditions.

Fig 3

Graph shows average change in F0 from solo reading baseline to synchronous speech (in semitones) averaged across trials for participants in the three conditions of Experiment 1. Dots indicate individual participant averages, thick line indicates group means, boxes indicate standard errors.

Discussion: Experiment 1

Overall, Experiment 1 found evidence of significant convergence in F0 to an accompanist voice during synchronous speech, both when convergence required a raising of F0 (high F0 accompanist condition) and when it required a lowering of F0 (low and extra-low F0 accompanist conditions). That is, changes in F0 observed during synchronous speech were specific to the acoustic properties of the accompanist voice experienced by participants, arguing against the idea that these F0 changes might simply result from the act of synchronisation itself (which would have predicted a uniform increase or decrease in F0 across all conditions).

As previously explained, it was necessary to design and test post-hoc an ‘extra-low’ F0 accompanist condition in this experiment in addition to the originally planned high and low F0 conditions. This was because the average F0 of our initial low F0 accompanist voice overlapped with the distribution of ‘baseline’ F0 values within our sample as measured during solo reading. Previous research has suggested that the initial acoustic distance between two speakers can affect the degree of vocal convergence observed, with most studies reporting that convergence is facilitated by a greater distance at baseline [4749], although the reverse pattern has also been reported [10, 50]. More recently, it has been argued that the former apparent relationship may in fact be an artefact of the way in which convergence is typically calculated, specifically in relation to the use of a ‘difference-in-distance’ measure (comparing the change in absolute distance between a participant and a model talker from before to after exposure) [51, 52]. In general, Priva and Sanker [51] argued that measurement of convergence can be unreliable when participants’ baseline values are close to the model talker, as the influence of random variability within an individual is likely to overshadow genuine convergent changes, leading to underestimation of convergence or even apparent divergence. This point thus supports our rationale for the inclusion of our extra-low F0 condition, to ensure there was sufficient room for downward convergent changes from solo reading baseline to synchronous speech to be observed. On the other hand, others have suggested that some degree of overlap between the distributions of productions from two interlocutors is necessary for facilitating convergence, so that randomly produced matches between talkers are likely to occur and be reinforced; this was supported by evidence from both model simulations and experimental data [53].

The rationale for including conditions designed to induce both upward and downward convergent shifts in F0 was to determine whether measured changes in F0 during synchronous speech might simply be driven by the act of speech synchronisation itself. That is, it is possible that the act of adjusting the timings of one’s speech so as to be in synchrony with an external stimulus could itself induce systematic global changes in F0. There is evidence that speaking conditions associated with increased effort can lead to global changes in F0. For example, when speaking in adverse listening conditions (e.g. background noise or babble), speakers adopt a ‘clear speech’ speaking style that is accompanied by global changes in F0; Hazan and colleagues [54] reported global increases in mean F0 that did not serve to enhance specific phonological contrasts, suggesting they could reflect a by-product of increased speaking effort. In the field of vocal convergence, a study by Kappes, Baumgaertner, Peschke and Ziegler [55] found reduced convergence in F0 in a speech shadowing task (where participants had to repeat non-words with as minimal a delay as possible, even during the ongoing stimulus presentation) compared to a delayed repetition task; they attributed this to an overall tendency to increase F0 due to increased speaking effort in the shadowing task. In a similar way, synchronous speech could have a global effect on F0 due to the increased effort associated with timing the pace of one’s speech with an external stimulus.

Our demonstration in Experiment 1 that F0 changes during synchronous speech appeared specific to the acoustic characteristics of the accompanist voice experienced goes some way to addressing this concern. However, a more direct test of this idea can be achieved through the use of a visual synchronous speech condition. In such a task, speakers are required to produce sentences in synchrony with silent videos of another talker; this thus preserves the synchronisation aspect of the standard synchronous speech condition, while removing the acoustic input from the accompanist voice. We therefore ran a second experiment that compared F0 changes between such a visual synchronous speech condition and an audio-only synchronous speech condition (identical to the high F0 condition of Experiment 1), in order to provide a stricter test of the hypothesis that changes in F0 during synchronous speech depend on the acoustics of the accompanist voice. By testing for significant changes in F0 during visual synchronous speech, this experiment will thus provide greater clarity on the interpretation of the F0 changes observed in Experiment 1. For example, if visual synchronous speech also results in significant increases in F0, these global F0 changes may have contributed to the significant upward shifts observed in the high F0 condition of Experiment 1.

In addition to comparing visual-only and auditory-only synchronous speech conditions, Experiment 2 further tested a group of participants with a combined audiovisual synchronous speech condition. This allowed us to ask whether being able to see as well as hear the person you are synchronising with would affect convergence to the accompanist voice. Previous research has demonstrated that in some circumstances, visual speech information can enhance vocal convergence to auditory speech over listening to that speech alone. Dias and Rosenblum [56] reported that being able to see as well as hear an interlocutor enhanced convergence during a live interactive search task. Conversely, a follow-up study by the same authors found that visual enhancement of convergence during a non-interactive speech shadowing task was only observed when auditory targets were presented in low-level noise [57]. Thus, enhancement of convergence by audiovisual cues may require either live interaction between speakers, or failing that, challenging auditory conditions. We aimed to investigate this in the context of synchronous speech, by comparing F0 changes across audiovisual and audio-only conditions. Since our speech synchronisation task involves neither live interaction between interlocutors, nor challenging auditory conditions, we might predict based on evidence from Dias and Rosenblum that we would not see enhancement of convergence in our audiovisual condition. Alternatively, synchronising one’s speech with a pre-recorded accompanist that can be seen as well as heard may nevertheless increase participants’ perception of a social interaction taking place, resulting in enhanced convergence.

Methods: Experiment 2

Participants

Thirty female participants (mean age = 27.77 years) took part in this experiment. All participants were native speakers of British English. For this experiment, we only recruited participants who had been born and currently resided in the South East of England, in order to recruit a sample whose accent would match that of the accompanist (see Stimuli). An equal number of participants took part in the three synchronous speech conditions (10 in the audio-only condition, 10 in the visual-only condition and 10 in the audiovisual condition). These numbers reflect the final participant samples used in analyses (see Data Exclusion).

Design

Three synchronous speech conditions were tested in this experiment: an audio-only condition in which participants synchronised their speech with a pre-recorded voice (as in Experiment 1); a visual-only condition in which participants synchronised their speech with silent videos of another person speaking; and an audiovisual condition in which participants synchronised their speech with the same videos including access to the audio channel (i.e. the accompanist voice).

Some modifications were necessary for the solo reading and synchronous speech tasks in order to create matched visual-only, audio-only and audiovisual conditions. Primarily, in order to speak sentences in synchrony with videos of a person speaking, the participant cannot be reading a written sentence at the same time as synchronous speech; in the visual-only condition with silent videos this would completely disrupt synchronisation, while in the audiovisual condition this would likely lead to participants synchronising with the audio only and ignoring the video. Instead, visual synchronous speech requires participants to produce sentences that have been previously memorised. In order to achieve this, the number of sentence tokens was reduced down to three sentences from the set used in Experiment 1 (see S1 Table). In the solo reading task, participants produced 15 repetitions of each of these three sentences in a pseudo-random order (in which the same sentence could not appear more than three times in a row), giving a total of 45 trials. For this task, the written sentence was on-screen throughout the trial. All timings were identical to those of the solo reading task in Experiment 1. Participants were instructed to read the sentences after hearing a three-click countdown. In the synchronous speech task, the number and order of sentences to be spoken was identical to the solo reading task; however, this time the sentence was presented onscreen for three seconds before disappearing. The participant then heard the three-click countdown and had 9 seconds to speak the sentence in synchrony with the accompanist. Additionally, after each synchronous speech trial the participant was asked to report whether the sentence they saw/heard the accompanist speak was the same as the sentence they had been cued to say. This was to test participants’ performance on two randomly occurring vigilance trials, in which the accompanist spoke a different sentence to the one the participant had been cued to speak (see Stimuli). Participants had up to 5 seconds to report whether there was a mismatch (yes/no) before the next trial began with the presentation of the next sentence onscreen. Participants were given 5 practice trials with these three sentence tokens before completing 45 trials of synchronous speech.

Stimuli

Stimuli for the three synchronous speech conditions were adapted from three videos of a female speaker of Standard Southern British English. In each video, the speaker read one of three sentences taken from the larger set used in Experiment 1. Stimuli for the visual-only condition were created by removing the audio from these videos. Stimuli for the audio-only condition were created by extracting the audio from these videos, and applying the same pitch shifts as in the high F0 condition of Experiment 1, resulting in an average median F0 of 265Hz for the accompanist. Participants were presented with a fixation cross on the screen for the duration of the audio stimuli. Stimuli for the audiovisual condition were created by recombining the modified audio stimuli from the audio-only condition with the video stimuli. Additionally, vigilance trial stimuli were created for each condition. In these vigilance trial stimuli, the sentence spoken by the accompanist was different to the sentence the participant had been cued to speak. For the visual-only condition, two silent videos of the accompanist speaking two additional sentences selected from the larger set used in Experiment 1 were used as mismatching trials. For the audio-only condition, the corresponding audio from these additional videos were used (again with pitch modifications to norm median pitch to 265Hz). For the audiovisual condition, two types of vigilance stimuli were created: one token in which the video matched the cued sentence but the audio was mismatching; and one token in which the audio matched the cued sentence but the video was mismatching. This allowed us to identify any participants who were relying on one modality only for synchronisation (e.g. synchronising with the voice only while ignoring the concurrent videos). All stimuli in all three conditions began with the same three-click countdown before commencement of speech.

Data exclusion

Twenty-eight participants failed the headphone check and so were not permitted to proceed to the main study. Of those who passed the headphone check, a further 13 participants were excluded, either due to poor quality audio recordings (5 participants), the audible presence of the accompanist voice in the audio recordings (2 participants), incorrect responses on one or more of the vigilance trials (5 participants), or a failure to follow task instructions (1 participant, who spoke in time with the three metronome clicks in the solo reading task instead of waiting to speak after the countdown). All these participants were replaced, to ensure a final sample size of 30 participants, with 10 participants per synchronous speech condition.

Exclusion of individual trials within a participant’s data was again made if the participant made a large speech error or if the F0 value for that trial was more than three standard deviations away from the mean for that participant on that task. These criteria resulted in exclusion of 4.44% of trials across the whole sample.

Measures and hypotheses

The same measure of F0 change as used in Experiment 1 was calculated for this experiment. Again, our predictions for this experiment stem from our central hypothesis that F0 changes across conditions will be specific to the accompanist voice experienced. For this experiment, we therefore predicted that participants in the audio-only and audiovisual conditions would show significant increases in F0 from solo reading to synchronous speech (to converge to the high F0 accompanist). Conversely, we predicted that participants in the visual-only condition would show no significant change in F0 between these two tasks. This should therefore result in a significant difference in F0 change between the visual-only condition and the two conditions containing audio. Furthermore, to explore whether convergence to an accompanist voice during speech synchronisation is affected by the addition of visual cues, we also compared F0 change in the audio-only and audiovisual conditions. If being able to see as well as hear the accompanist during synchronous speech has a facilitatory effect on convergence, we would expect F0 change in the audiovisual condition to be significantly greater than that in the audio-only condition.

Results: Experiment 2

Within-participant analysis of convergence

Convergence patterns shown by individual participants in each of the three conditions are shown in Fig 4. The frequencies of convergers, divergers and non-convergers across the three synchronous speech conditions are given in Table 2; colour coding in Fig 4 is also used to indicate convergence status. As can be seen, while the majority of participants in the audio-only and audiovisual conditions showed significant convergence, most participants in the visual-only condition showed significant divergence. Average baseline F0 (at solo reading) did not differ between the three groups that went on to experience our three synchronous speech conditions (F(2,27) = 0.455, p = .639).

Fig 4. Individual participant convergence patterns.

Fig 4

Each graph plots for each participant (one per row) the difference in semitones between the accompanist voice mean F0 (represented by the filled circles at zero) and (i) the participant’s mean F0 at solo reading (shown in the filled triangles) and (ii) the participant’s mean F0 at synchronous speech (shown in the empty triangles). Symbol colour represents whether each participant demonstrated significant convergence, divergence, or no change in their F0 between solo reading and synchronous speech, as determined using one-sample t-tests on their F0 change values (see Table 2). The three conditions (audiovisual, audio-only and visual-only) are plotted on separate graphs (A, B and C).

Table 2. Frequencies of convergers, divergers and non-convergers (no change in F0) in the three synchronous speech conditions from Experiment 2.

Group Visual only condition Audio only condition Audiovisual condition
Converged 1 8 8
Diverged 7 2 0
No change 2 0 2

Group analysis of convergence

Trial by trial values for change in F0 from baseline (synchronous speech minus solo reading) are shown for the three synchronous speech conditions in Fig 5; group averages across the whole experiment are shown in Fig 6. To test whether F0 change differed among the conditions, two linear-mixed models were compared: a null model with random intercepts of participant and sentence and a random slope of condition by sentence, and a full model with the same random effects and a fixed effect of condition. A likelihood ratio test found that the full model provided a better fit to the data: χ2(2) = 15.39, p < .001. Pairwise comparisons between conditions using estimated marginal means (with Tukey’s HSD adjustment for multiple comparisons) were performed using the emmeans package in R [58]; these found significant differences between the visual-only and audio-only conditions (t(33.2) = - 2.65, p = .032) and the visual-only and audiovisual conditions (t(34.3) = -4.19, p < .001). F0 change was thus significantly greater in the two conditions containing audio compared to the visual-only condition, supporting our first hypothesis. Conversely, these did not find a significant difference between the audio-only and audiovisual conditions (t(34.8) = -1.56, p = 0.277). This is consistent with the hypothesis based on prior literature that the enhancing effects of audiovisual cues on convergence may only be observed during tasks involving live interaction between speakers, or else during challenging auditory conditions.

Fig 5. F0 change across trials in audio, visual and audiovisual conditions.

Fig 5

Graph shows average change in F0 from solo reading baseline to synchronous speech (in semitones) for each trial of the task, averaged across participants in each of the three conditions of Experiment 2. Shaded areas around lines indicate standard error.

Fig 6. Average F0 change in visual, audio and audiovisual conditions.

Fig 6

Graph shows average change in F0 from solo reading baseline to synchronous speech (in semitones) averaged across trials for participants in the three conditions of Experiment 2. Dots indicate individual participant averages, thick line indicates group means, boxes indicate standard errors.

In addition to comparing F0 change across conditions, it is also of interest to consider whether F0 change in the visual-only condition was significantly different from zero; that is, does synchronising speech with an external stimulus in the absence of any acoustic input lead to any significant changes in F0? To test this, the same model as above was fitted (except with the random slope of condition by sentences removed due to model convergence issues) but this time with the intercept suppressed (set to zero). We then obtained 95% confidence intervals on the estimates for the three levels of condition using the confint function from the stats package in R [59]. Confidence intervals for the audio-only and audiovisual conditions did not contain zero (all CIs [>.15; <1.9]), indicating significant upward changes in F0 from solo reading baseline to synchronous speech. Conversely, confidence intervals for the visual-only condition did contain zero (-0.87; 0.18), indicating that F0 change from solo reading to synchronous speech was not significantly different from zero. This pattern was also seen in the significance of the fixed effects in the zero-intercept model; the effect for the visual-only condition was not significant (β = -0.348, t(30.36) = -1.35, p = .188), whereas effects for the audio (β = 0.675, t(30.38) = 2.61, p = .014) and audiovisual conditions (β = 1.285, t(30.37) = 4.96, p < .001) were both significant.

Discussion: Experiment 2

In summary, the findings of Experiment 2 first of all replicate our finding from Experiment 1 that synchronisation with a high-pitched accompanist voice induces increases in the F0 of participants’ speech productions (audio-only condition). Further, they demonstrate that these convergent F0 changes are not seen when synchronising with a silent accompanist (visual-only condition). Lastly, being able to see as well as hear the accompanist (audiovisual condition) did not have a significant effect on the magnitude of F0 convergence observed, compared to our audio-only condition.

It should be noted that there was some evidence of changes in F0 in the visual-only synchronous speech condition; in contrast to the convergent upward changes in F0 seen in the audio-only and audiovisual conditions to the high F0 voice, a number of participants in the visual-only condition showed a tendency to slightly decrease their voice F0 (demonstrated in the high number of participants showing apparent ‘divergence’ in Table 2). This could be driven by a slowing of speech rate and production of flatter intonation patterns during synchronisation (which would affect our median measure of F0 across each sentence). These changes in F0 were however not significant at the group level in our zero-intercept model. In Experiment 1, the best convergence appeared to be seen in the high F0 accompanist condition; this demonstration that visual-only synchronous speech does not induce upward changes in F0 therefore suggests that these significant upward shifts are being driven by exposure to the high F0 of the accompanist voice, and thus reflect true convergence.

Interestingly, other work has reported evidence of convergence during speech shadowing of single words in silent visual speech; using an AXB procedure, Miller, Sanchez and Rosenblum [60] found that speakers’ productions of single words during lipreading and speech shadowing of a silent model talker were judged as sounding more similar to that model than their non-shadowing utterances. They suggested that visual speech cues can thus provide perceivers with information on the phonetic content and articulatory style of a speaker, that can drive alignment in their speech productions. In contrast, our results suggested a tendency for some of our participants to effectively diverge in F0 from the accompanist during visual synchronous speech, due to a slight decrease in F0 (although these changes in F0 were not significant at the group level). This discrepancy from the previous study may lie in differences in the paradigms used i.e. single word shadowing versus synchronised production of sentences. Alternatively, this difference may lie in our focus on a single acoustic measure (F0); we focused on F0 convergence because this was explicitly manipulated in the accompanist voice, and was considered more likely to be robust when measured from audio recordings collected remotely online compared to formant frequencies [37]. It is entirely possible, however, that participants in our study did converge with the accompanist on other phonetic aspects such as formant frequencies that would be missed by our analysis but picked up in AXB judgements as in Miller et al., [60]. Our findings therefore simply suggest that convergent changes in F0 specifically during synchronous speech rely on access to acoustic input from the accompanist’s voice, and are not likely explained by the mere act of synchronising the timing of one’s speech with another (silent) talker.

Furthermore, in our study, the addition of visual speech cues to auditory speech information did not result in enhanced convergence relative to an audio-only condition. While some previous work has reported enhanced convergence for audiovisual conditions [56], this effect appears to be specific to tasks involving live interaction between interlocutors. Our findings with a non-interactive task involving pre-recorded stimuli instead accord with findings from Rosenblum and Dias, who similarly found no enhancement of convergence for an audiovisual condition over an audio-only condition in a speech shadowing task with pre-recorded stimuli [57]. In that study, an enhancing effect of audiovisual cues on convergence for this non-interactive speech task was only observed when model talker stimuli were presented in low-level noise. Together with our findings, this suggests that in a non-interactive context, visual speech information is only employed to support convergence when auditory speech information is rendered less informative. Interestingly, this appears to be driven by access to speech-relevant articulatory information, and not simply the social relevance of a face stimulus, with Dias and Rosenblum reporting no enhancement of convergence in low-level noise when the mouth area of the model talker was blurred in their video stimuli. Relating this to the Interactive Alignment Model [5, 16], access to the articulating mouth may have enabled more accurate or stronger covert imitation of the model talker’s speech via the simulation route, resulting in enhanced priming of representations in the production system and thus increased imitation.

Conversely, full visual access to a live interlocutor within an interactive speech task has been reported to enhance convergence compared to a condition where the interlocutor was fully occluded from view [56]. It is possible therefore that a similar enhancement of convergence would be seen when engaging in synchronous speech with a live accompanist that could be seen as well as heard. In this context, having visual access to a live accompanist may increase the potential for a two-way exchange of social cues, leading to increased motivation for vocal convergence in order to facilitate social bonding with them as an interaction partner. In this way, whether and how visual cues are used to support vocal convergence appears to differ across different tasks and contexts, particularly with regards to the presence or not of a live interlocutor.

General discussion

The experiments reported in this paper provide evidence of vocal convergence (in F0) during synchronous speech in female speakers. Experiment 1 demonstrates that speakers shift the F0 of their produced speech towards that of an accompanist voice during synchronous speech, both when this requires a lowering and a raising of F0. Experiment 2 demonstrates that these convergent F0 changes are not seen when synchronising speech with another speaker in the absence of acoustic input, suggesting that they are indeed driven by perception of the accompanist voice. For this task, being able to see as well as hear the accompanist did not have a significant effect on the magnitude of F0 convergence. Overall therefore, as hypothesised, changes in F0 during synchronous speech were specific to the F0 of the accompanist voice. This suggests that these changes reflect true convergence, and are not simply a by-product of a synchronisation of speech timing with an external stimulus.

Synchronous speech is an interesting case for models of speech motor control to consider, which typically place processing of self-generated auditory feedback as central to guiding speech productions [2123]. Across many of these models, the brain is proposed to compare a prediction of the expected or intended auditory feedback from a speech utterance with the actual auditory feedback it receives; this is important for maintaining stability in speech and for enabling any sensory errors to be corrected online (as demonstrated in altered auditory feedback experiments e.g. [24]). However, in synchronous speech, the brain simultaneously receives auditory feedback from the self-voice and the voice of the accompanist. The evidence from the current study suggests that this scenario results in sensorimotor processes that drive alignment of speech movements to an external target from the other voice, rather than to an internally defined target or prediction. Some evidence on what this might look like at the neural level was reported by Jasmin et al., [61], who found that synchronous speech with a live (but not a pre-recorded) accompanist resulted in a release from speech-induced suppression in the right anterior temporal lobe (the usual reduction in the auditory cortex response to self-produced speech compared to passive listening to that speech). Such speech-induced suppression is typically interpreted as reflecting a cancellation of the response to the auditory speech signal by subtraction of an auditory prediction of expected/intended feedback for that speech utterance (also known as efference copy or the forward model). Interestingly, this release from suppression was not seen in a condition where the other talker spoke a different (i.e. non-matching) sentence simultaneously with the participant, suggesting it is the synchronous and not the simultaneous nature of the other speech input that drives this enhanced response for synchronous speech. The authors suggested that this release from suppression reflects a blurring of the distinction between self- and other-sensory feedback during speech that results in self-generated speech feedback being processed as if it were externally-generated.

Overall, the current study, along with multiple other lines of evidence, highlights the necessity for models of speech motor control to incorporate influences of external speech inputs on sensorimotor processes during speech (as also argued previously by [18]). In particular, these models will need to consider how the brain can balance and prioritise the competing pressures to maintain stable articulatory targets on the one hand, versus the ability to flexibly change these articulatory targets depending on external speech inputs on the other. The use of predictions during speech is widely assumed, but it is unknown how predictions of self- and other-speech feedback interact during speech production, or further whether synchronous (and not simply overlapping or simultaneous) speech among talkers represents a ‘special’ case for these predictive processes during speech motor control.

One interesting comparison to make between the sensorimotor adaptation responses seen in altered auditory feedback studies and vocal convergence responses concerns their time-scale. Adaptation responses to altered self-voice feedback typically ramp up over a series of trials [6264]. Conversely, in our first experiment, F0 change did not appear to linearly increase across the task (see Fig 2). This is consistent with some previous studies of convergence; for example, a study of F0 convergence during a shared reading task by Aubanel and Nguyen found that the maximal level of convergence was achieved at the beginning of the task, then remaining relatively stable across the interaction [10]. However, in our second experiment, convergence did appear to show an upward trend across trials, particularly in the audiovisual condition (see Fig 5). One possible reason for this difference in the pattern of convergence between experiments could lie in the use of a restricted stimulus set in Experiment 2 (only three sentences were repeated across trials, in contrast to the trial-unique sentences used in experiment 1). This is perhaps more similar to the design of altered auditory feedback experiments in which the same small set of stimuli is typically repeated across multiple blocks. Some authors have argued that convergence in different cues can take on multiple different temporal patterns, either continuously increasing across an interaction or dynamically fluctuating according to a speaker’s motivational state and intentions [65]. It will be useful for future studies of vocal convergence to consider more closely the factors that might affect the dynamics of the convergence response across time.

The current results also have implications for interpretation of the fluency enhancing effects of synchronous speech in people who stutter. Traditionally, these have been attributed to the provision of an external rhythm which enables a switch from reliance on a faulty basal ganglia-cortical route for internally timed speech to reliance on an intact cerebellar-cortical route for externally timed speech [66, 67]. The present findings suggest that synchronous speech recruits additional sensorimotor processes that drive imitation-like changes in speech productions in typically fluent speakers. This thus provides a link between synchronous speech and other effective fluency-enhancers used with people who stutter that involve changes in the acoustic/phonetic properties of their speech feedback, either through active processes (e.g. speaking in a foreign accent) or through experimental perturbations of auditory speech feedback. Dysfunction in the processing and forward modelling of self-produced auditory speech feedback has been hypothesised in stuttering [6870, for a review, see 4]. Indeed, there is recent evidence for an absence of speech-induced suppression effects in people who stutter [71, 72], suggesting faulty auditory prediction and internal modelling. Synchronous speech might circumvent these dysfunctional processes by biasing speech motor control away from a reliance on faulty or unstable internally specified targets for speech and towards imitation of the external targets provided by the accompanist voice.

Alternatively, the beneficial effects of synchronous speech in people who stutter may relate to engagement of an entrainment process [73] that drives changes in the acoustics of speech productions. This interpretation of vocal convergence based on dynamical systems theory [74] was put forward by Pardo [75], who suggested that a pair of interacting talkers can be viewed as an informationally coupled dynamical system. Within such a system, a ‘magnet effect’ occurs in which externally derived information (i.e. the other speaker’s voice) acts as a forcing function on internal dynamics; over time, the more dominant talker pulls the less dominant talker into coordination so that their dynamics (i.e. vocal acoustics) more closely resemble each other, resulting in relative (but not absolute) coordination. This kind of entrainment process may provide stability or an alternative route for speech motor control in people who stutter. Further research on the effects of synchronous speech in people who stutter, both in terms of potential convergence of their speech productions with the accompanist and the nature of neural mechanisms of prediction during synchronous speech, would provide insights into these ideas.

Some limitations of the current study must be acknowledged when interpreting the results. Firstly, relatively small sample sizes were used for our conditions (n = 10 per condition). Although these are in keeping with sample sizes used by others in the field of vocal convergence [e.g. 57, 75, 76], our findings should therefore be viewed as preliminary and in need of further replication. It appears promising, however, that we were able to replicate significant convergence across multiple different synchronous speech conditions in two separate experiments. This study is also limited by the use of an online study design and the resulting variability in the recording set-ups used by different speakers, which could have affected our acoustic measurements of F0 [37]. Although several steps were taken to try to mitigate potential effects of this variability, this study would benefit from replication in a laboratory environment where specialist equipment could be used to record participants’ voices in person. This would further allow for a wider range of acoustic and phonetic measures to be examined, such as formant frequencies.

The online study design also limited us to the use of a pre-recorded accompanist voice for the synchronous speech task. As already discussed, this lack of live interaction with the accompanist may have resulted in our failure to observe an enhancing effect of audiovisual cues on convergence [56, 57]. Other studies have reported important differences between synchronous speech conditions involving pre-recorded versus live accompanists. Synchronous speech with a pre-recorded accompanist is more difficult, and leads to reduced success in synchronisation [32, 77]. The decreased variability in vowel productions reported for synchronous speech has also been shown to be less pronounced for synchronisation with a pre-recorded versus a live accompanist [31, 32]. Furthermore, as discussed previously, the release from speech-induced suppression in the right anterior temporal lobe reported for synchronous speech was specific to a live accompanist condition [61]. From these findings, we might therefore predict that even greater vocal convergence to an accompanist might be seen during live interaction. It should be noted, however, that the fluency enhancing effects of synchronous speech in people who stutter do not appear to rely on a live accompanist [78]. Importantly, the unique aspect of live synchronisation is that both talkers can simultaneously align their speech productions with one another, rather than convergence being a one-way-process. This absence of two-way interaction in our design thus limits the informativeness of this study for models of joint action, instead constraining our interpretation to individualistic representational models. It will be important for future research to follow up these preliminary findings with an in-person study of synchronous speech using live interacting pairs, in order to contribute to our understanding of joint speech production.

Conclusions

In conclusion, our study demonstrates vocal convergence in F0 during synchronous speech in typically fluent speakers; these changes in F0 were specific to the acoustic properties of the accompanist voice, and so are not simply driven by a coordination of the timing of one’s speech with an external stimulus. The findings suggest the need for models of speech motor control to be extended to account for influences of external speech inputs on speech production. Further, they provide novel insights into the potential mechanisms behind the fluency enhancing effects of synchronous speech in people who stutter; specifically, our findings suggest that synchronous speech may induce a shift in the balance with which the speech motor control system weights internally-stored versus externally-generated speech targets for guiding speech productions.

Supporting information

S1 File. Debrief data Experiment 1.

Self-report data from debrief questionnaire and data on browser/operating system of participants from Experiment 1.

(CSV)

S2 File. Debrief data Experiment 2.

Self-report data from debrief questionnaire and data on browser/operating system of participants from Experiment 2.

(CSV)

S1 Table. Sentence stimuli.

Sentences taken from the Harvard IEEE corpus of sentences used in Experiments 1 and 2.

(PDF)

Data Availability

Data files for all experiments reported on in this manuscript are openly available on the Open Science Framework (https://osf.io/rs7gk/ DOI: 10.17605/OSF.IO/RS7GK).

Funding Statement

This work was funded by a Research Leadership Award from The Leverhulme Trust (https://www.leverhulme.ac.uk), awarded to C.M. (RL-2016-013). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Cummins F. Joint speech as an object of empirical inquiry. Mater Relig 2018;14:417–9. 10.1080/17432200.2018.1485344. [DOI] [Google Scholar]
  • 2.McNeill WH. Keeping together in time. Cambridge, MA: Harvard University Press; 1995. [Google Scholar]
  • 3.Andrews G, Hoddinott S, Craig A, Howie P, Feyer A-M, Neilson M. Stuttering. J Speech Hear Disord 1983;48:226–46. 10.1044/jshd.4803.226 [DOI] [PubMed] [Google Scholar]
  • 4.Bradshaw AR, Lametti DR, McGettigan C. The Role of Sensory Feedback in Developmental Stuttering: A Review. Neurobiol Lang 2021;2:1–27. 10.1162/nol_a_00036. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Pickering MJ, Garrod S. Toward a mechanistic psychology of dialogue. Behav Brain Sci 2004;27:169+. doi: 10.1017/s0140525x04000056 [DOI] [PubMed] [Google Scholar]
  • 6.Pardo JS, Urmanche A, Wilman S, Wiener J. Phonetic convergence across multiple measures and model talkers. Attention, Perception, Psychophys 2017;79:637–59. 10.3758/s13414-016-1226-0 [DOI] [PubMed] [Google Scholar]
  • 7.Gentilucci M, Bernardis P. Imitation during phoneme production. Neuropsychologia 2007;45:608–15. 10.1016/j.neuropsychologia.2006.04.004 [DOI] [PubMed] [Google Scholar]
  • 8.Goldinger SD. Echoes of echoes? An episodic theory of lexical access. Psychol Rev 1998;105:251–79. 10.1037/0033-295X.105.2.251 [DOI] [PubMed] [Google Scholar]
  • 9.Pardo JS, Jay IC, Hoshino R, Hasbun SM, Sowemimo-Coker C, Krauss RM. Influence of Role-Switching on Phonetic Convergence in Conversation. Discourse Process 2013;50:276–300. 10.1080/0163853X.2013.778168. [DOI] [Google Scholar]
  • 10.Aubanel V, Nguyen N. Speaking to a common tune: Between-speaker convergence in voice fundamental frequency in a joint speech production task. PLoS One 2020;15. 10.1371/journal.pone.0232209. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Mukherjee S, D’Ausilio A, Nguyen N, Fadiga L, Badino L. The Relationship between F0 Synchrony and Speech Convergence in Dyadic Interaction. 18TH Annu. Conf. Int. SPEECH Commun. Assoc. (INTERSPEECH 2017), VOLS 1–6 SITUATED Interact., 2017, p. 2341–5. 10.21437/Interspeech.2017-795. [DOI]
  • 12.Giles H, Coupland N, Coupland J. 1. Accommodation theory: communication, context, and consequence. Context. Accommod. Dev. Appl. Socioling., New York: Cambridge University Press; 1991, p. 1–68. [Google Scholar]
  • 13.Michalsky J, Schoormann H. Pitch convergence as an effect of perceived attractiveness and likability. 18TH Annu. Conf. Int. SPEECH Commun. Assoc. (INTERSPEECH 2017), VOLS 1–6 SITUATED Interact., 2017, p. 2253–6. 10.21437/Interspeech.2017-1520. [DOI]
  • 14.Gregory SW, Webster S. A nonverbal signal in voices of interview partners effectively predicts communication accommodation and social status perceptions. J Pers Soc Psychol 1996;70:1231–40. 10.1037/0022-3514.70.6.1231 [DOI] [PubMed] [Google Scholar]
  • 15.Bourhis RY, Giles H. The language of intergroup distinctiveness. Lang. Ethn. Intergr. relations, London: Academic Press; 1977, p. 119–35. [Google Scholar]
  • 16.Pickering MJ, Garrod S. An integrated theory of language production and comprehension. Behav Brain Sci 2013;36:329–47. 10.1017/S0140525X12001495 [DOI] [PubMed] [Google Scholar]
  • 17.Pardo JS, Urmanche A, Wilman S, Wiener J, Mason N, Francis K, et al. A comparison of phonetic convergence in conversational interaction and speech shadowing. J Phon 2018;69:1–11. 10.1016/j.wocn.2018.04.001. [DOI] [Google Scholar]
  • 18.Sato M, Grabski K, Garnier M, Granjon L, Schwartz J-L, Nguyen N. Converging toward a common speech code: imitative and perceptuo-motor recalibration processes in speech production. Front Psychol 2013;4:422. 10.3389/fpsyg.2013.00422 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Garnier M, Lamalle L, Sato M. Neural correlates of phonetic convergence and speech imitation. Front Psychol 2013;4:600. 10.3389/fpsyg.2013.00600 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Shockley K, Sabadini L, Fowler CA. Imitation in shadowing words. Percept \& Psychophys 2004;66:422–9. 10.3758/BF03194890 [DOI] [PubMed] [Google Scholar]
  • 21.Guenther FH, Ghosh SS, Tourville JA. Neural modeling and imaging of the cortical interactions underlying syllable production. Brain Lang 2006;96:280–301. 10.1016/j.bandl.2005.06.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Parrell B, Houde J. Modeling the Role of Sensory Feedback in Speech Motor Control and Learning. J SPEECH Lang Hear Res 2019;62:2963–85. 10.1044/2019_JSLHR-S-CSMC7-18-0127 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Parrell B, Lammert AC, Ciccarelli G, Quatieri TF. Current models of speech motor control: A control-theoretic overview of architectures and properties. J Acoust Soc Am 2019;145:1456–81. 10.1121/1.5092807 [DOI] [PubMed] [Google Scholar]
  • 24.Houde JF, Jordan MI. Sensorimotor adaptation in speech production. Science (80-) 1998;279:1213–6. 10.1126/science.279.5354.1213 [DOI] [PubMed] [Google Scholar]
  • 25.Burnett TA, Freedland MB, Larson CR, Hain TC. Voice F0 responses to manipulations in pitch feedback. J Acoust Soc Am 1998;103:3153–61. 10.1121/1.423073 [DOI] [PubMed] [Google Scholar]
  • 26.Lametti DR, Krol SA, Shiller DM, Ostry DJ. Brief Periods of Auditory Perceptual Training Can Determine the Sensory Targets of Speech Motor Learning. Psychol Sci 2014. 10.1177/0956797614529978 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Bourguignon NJ, Baum SR, Shiller DM. Please Say What This Word Is-Vowel-Extrinsic Normalization in the Sensorimotor Control of Speech. J Exp Psychol Percept Perform 2016;42:1039–47. 10.1037/xhp0000209. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Sebanz N, Bekkering H, Knoblich G. Joint action: bodies and minds moving together. TRENDS Cogn Sci 2006;10:70–6. 10.1016/j.tics.2005.12.009 [DOI] [PubMed] [Google Scholar]
  • 29.Knoblich G, Butterfill S, Sebanz N. Psychological research on joint action: theory and data. In: Ross B, editor. Psychol. Learn. Motiv. Adv. Res. Theory, Vol 54, vol. 54, 2011, p. 59–101. 10.1016/B978-0-12-385527-5.00003-6. [DOI] [Google Scholar]
  • 30.Rami MK, Kalinowski J, Rastatter MP, Holbert D, Allen M. Choral Reading with Filtered Speech: Effect on Stuttering. Percept Mot Skills 2005;100:421–31. 10.2466/pms.100.2.421-431 [DOI] [PubMed] [Google Scholar]
  • 31.Poore MA, Ferguson SH. Methodological variables in choral reading. Clin Linguist Phon 2008;22:13–24. 10.1080/02699200701601971 [DOI] [PubMed] [Google Scholar]
  • 32.Cummins F. Prosodic characteristics of synchronous speech. In: Puppel S, Demenko G, editors. Prosody 2000 Speech Recognit. Synth., Krakow: Adam Mickiewicz University: 2000, p. 45–49. [Google Scholar]
  • 33.Cummins F. Synchronization Among Speakers Reduces Macroscopic Temporal Variability. Proc Twenty-Sixth Annu Conf Cogn Sci Soc 2004:256–61. [Google Scholar]
  • 34.Wang B, Cummins F. Intonation contour in synchronous speech. J. Acoust. Soc. Am., 2003, p. 2397. 10.1121/1.4778142. [DOI] [Google Scholar]
  • 35.Anwyl-Irvine AL, Massonnié J, Flitton A, Kirkham N, Evershed JK. Gorilla in our midst: An online behavioral experiment builder. Behav Res Methods 2020;52:388–407. 10.3758/s13428-019-01237-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Milne AE, Bianco R, Poole KC, Zhao S, Oxenham AJ, Billig AJ, et al. An online headphone screening test based on dichotic pitch. Behav Res Methods 2020. 10.3758/s13428-020-01514-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Sanker C, Babinski S, Burns R, Evans M, Kim J, Smith S, et al. (Don’t) try this at home! The effects of recording devices and software on phonetic analysis. LingBuzz; 2021. [Google Scholar]
  • 38.Freeman V, De Decker P, Landers M. Suitability of self-recordings and video calls: Vowel formants and nasal spectra. J Acoust Soc Am 2020;148:2714–5. 10.1121/1.5147526. [DOI] [Google Scholar]
  • 39.Boersma P, Weenink D. Praat: doing phonetics by computer 2021. [Google Scholar]
  • 40.Namy LL, Nygaard LC, Sauerteig D. Gender differences in vocal accommodation: The role of perception. J Lang Soc Psychol 2002;21:422–32. 10.1177/026192702237958. [DOI] [Google Scholar]
  • 41.IEEE Subcommittee on Subjective Measurements. IEEE Recommended Practice for Speech Quality Measurements. IEEE Trans Audio Electroacoust 1969;17:227–46. [Google Scholar]
  • 42.Audacity Team. Audacity(R): Free Audio Editor and Recorder 2021.
  • 43.Darwin C. Praat script: VTchange-dynamic 2005.
  • 44.Lee Y, Keating P, Kreiman J. Acoustic voice variation within and between speakers. J Acoust Soc Am 2019;146:1568–79. 10.1121/1.5125134 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Keating P, Kreiman J, Alwan A. “A new speech database for within-and between-speaker variability”, Paper in Sasha Calhoun, Paola Escudero, Marija Tabain & Paul Warren (eds.). Proc. 19th Int. Congr. Phonetic Sci. Melbourne, Aust., 2019.
  • 46.Kuznetsova A, Brockhoff PB, Christensen RHB. lmerTest Package: Tests in Linear Mixed Effects Models. J Stat Softw 2017;82:1–26. 10.18637/jss.v082.i13. [DOI] [Google Scholar]
  • 47.Walker A, Campbell-Kibler K. Repeat what after whom? Exploring variable selectivity in a cross-dialectal shadowing task. Front Psychol 2015;6:546. 10.3389/fpsyg.2015.00546 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Babel M. Evidence for phonetic and social selectivity in spontaneous phonetic imitation. J Phon 2012;40:177–89. 10.1016/j.wocn.2011.09.001. [DOI] [Google Scholar]
  • 49.Babel M. Dialect divergence and convergence in New Zealand English. Lang Soc 2010;39:437–56. 10.1017/S0047404510000400. [DOI] [Google Scholar]
  • 50.Kim M, Horton WS, Bradlow AR. Phonetic convergence in spontaneous conversations as a function of interlocutor language distance. Lab Phonol 2011;2:125–56. 10.1515/labphon.2011.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Priva UC, Sanker C. Limitations of difference-in-difference for measuring convergence. Lab Phonol 2019;10. 10.5334/labphon.200. [DOI] [Google Scholar]
  • 52.MacLeod B. Problems in the Difference-in-Distance measure of phonetic imitation. J Phon 2021;87. 10.1016/j.wocn.2021.101058 34012182 [DOI] [Google Scholar]
  • 53.Lee Y, Goldstein L, Parrell B, Byrd D. Who converges? Variation reveals individual speaker adaptability. SPEECH Commun 2021;131:23–34. 10.1016/j.specom.2021.05.001. [DOI] [Google Scholar]
  • 54.Hazan V, Grynpas J, Baker R. Is clear speech tailored to counter the effect of specific adverse listening conditions? J Acoust Soc Am 2012;132:EL371–7. 10.1121/1.4757698 [DOI] [PubMed] [Google Scholar]
  • 55.Kappes J, Baumgaertner A, Peschke C, Ziegler W. Unintended imitation in nonword repetition. Brain Lang 2009;111:140–51. 10.1016/j.bandl.2009.08.008 [DOI] [PubMed] [Google Scholar]
  • 56.Dias JW, Rosenblum LD. Visual influences on interactive speech alignment. Perception 2011;40:1457–66. 10.1068/p7071 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Dias JW, Rosenblum LD. Visibility of speech articulation enhances auditory phonetic convergence. Atten Percept Psychophys 2016;78:317–33. 10.3758/s13414-015-0982-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Lenth R, Buerkner P, Herve M, Love J, Riebl H, Singmann H. emmeans: Estimated Marginal Means, aka Least-Squares Means 2021. doi: 10.12669/pjms.37.2.2877 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.R Core Team. R: A language and environment for statistical computing 2019. [Google Scholar]
  • 60.Miller RM, Sanchez K, Rosenblum LD. Alignment to visual speech information. Atten Percept \& Psychophys 2010;72:1614–25. 10.3758/APP.72.6.1614 [DOI] [PubMed] [Google Scholar]
  • 61.Jasmin KM, McGettigan C, Agnew ZK, Lavan N, Josephs O, Cummins F, et al. Cohesion and Joint Speech: Right Hemisphere Contributions to Synchronized Vocal Production. J Neurosci 2016;36:4669–80. 10.1523/JNEUROSCI.4075-15.2016 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Villacorta VM, Perkell JS, Guenther FH. Sensorimotor adaptation to feedback perturbations of vowel acoustics and its relation to perception. J Acoust Soc Am 2007. 10.1121/1.2773966. [DOI] [PubMed] [Google Scholar]
  • 63.Lametti DR, Smith HJ, Watkins KE, Shiller DM. Robust Sensorimotor Learning during Variable Sentence-Level Speech. Curr Biol 2018;28:3106–3113.e2. 10.1016/j.cub.2018.07.030 [DOI] [PubMed] [Google Scholar]
  • 64.Guenther FH. Neural Control of Speech. Cambridge, MA: The MIT Press; 2016. [Google Scholar]
  • 65.De Looze C, Scherer S, Vaughan B, Campbell N. Investigating automatic measurements of prosodic accommodation and its dynamics in social interaction. Speech Commun 2014;58:11–34. 10.1016/j.specom.2013.10.002. [DOI] [Google Scholar]
  • 66.Alm PA. Stuttering and the basal ganglia circuits: a critical review of possible relations. J Commun Disord 2004;37:325–69. 10.1016/j.jcomdis.2004.03.001 [DOI] [PubMed] [Google Scholar]
  • 67.Giraud A-L, Neumann K, Bachoud-Levi A-C, von Gudenberg AW, Euler HA, Lanfermann H, et al. Severity of dysfluency correlates with basal ganglia activity in persistent developmental stuttering. BRAIN Lang 2008;104:190–9. 10.1016/j.bandl.2007.04.005 [DOI] [PubMed] [Google Scholar]
  • 68.Civier O, Tasko SM, Guenther FH. Overreliance on auditory feedback may lead to sound/syllable repetitions: Simulations of stuttering and fluency-inducing conditions with a neural model of speech production. J Fluency Disord 2010;35:246–79. 10.1016/j.jfludis.2010.05.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Max L, Daliri A. Limited Pre-Speech Auditory Modulation in Individuals Who Stutter: Data and Hypotheses. J Speech Lang Hear Res 2019;62:3071–84. 10.1044/2019_JSLHR-S-CSMC7-18-0358 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Max L, Guenther FH, Gracco VL, Ghosh SS, Wallace ME. Unstable or Insufficiently Activated Internal Models and Feedback-Biased Motor Control as Sources of Dysfluency: A Theoretical Model of Stuttering. Contemp Issues Commun Sci Disord 2004;31:105–22. 10.1044/cicsd_31_S_105. [DOI] [Google Scholar]
  • 71.Daliri A, Max L. Modulation of auditory processing during speech movement planning is limited in adults who stutter. Brain Lang 2015;143:59–68. 10.1016/j.bandl.2015.03.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Meekings S, Jasmin K, Lima C, Scott S. Does over-reliance on auditory feedback cause disfluency? An fMRI study of induced fluency in people who stutter. BioRxiv 2020. 10.1101/2020.11.18.378265. [DOI] [Google Scholar]
  • 73.von Holst E. On the nature of order in the central nervous system. In: Martin R, editor. Behav. Physiol. Anim. man Collect. Pap. Erich von Holst, London: Methuen; 1973, p. 133–55. [Google Scholar]
  • 74.Beek PJ, Turvey MT, Schmidt RC. Autonomous and Nonautonomous Dynamics of Coordinated Rhythmic Movements. Ecol Psychol 1992;4:65–95. 10.1207/s15326969eco0402_1. [DOI] [Google Scholar]
  • 75.Pardo JS. On phonetic convergence during conversational interaction. J Acoust Soc Am 2006;119:2382–93. 10.1121/1.2178720 [DOI] [PubMed] [Google Scholar]
  • 76.Pardo JS, Gibbons R, Suppes A, Krauss RM. Phonetic convergence in college roommates. J Phon 2012;40:190–7. 10.1016/j.wocn.2011.10.001. [DOI] [Google Scholar]
  • 77.Cummins F. Practice and performance in speech produced synchronously. J Phon 2003;31:139–48. 10.1016/S0095-4470(02)00082-7. [DOI] [Google Scholar]
  • 78.Kiefte M, Armson J. Dissecting choral speech: Properties of the accompanist critical to stuttering reduction. J Commun Disord 2008;41:33–48. 10.1016/j.jcomdis.2007.03.002 [DOI] [PubMed] [Google Scholar]

Decision Letter 0

Vera Kempe

6 Jul 2021

PONE-D-21-16330

Synchronised speech and speech motor control: convergence in voice fundamental frequency during choral speech

PLOS ONE

Dear Dr. Bradshaw,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

I was fortunate to obtain reviews from three experts. Although two reviewers are the standard at Plos One the fact that this manuscript received three reviews is not an indication that I thought it deserved extra scrutiny but came about because anticipating many reviewer rejections based on past experience, I initially invited a larger number and then simply got lucky that three reviewers agreed to provide a review. I hope this will give you a wide range of helpful suggestions for how to improve the paper.

You will see that the reviewers raise important issues that I encourage you to take on board. In addition, I have read the submission myself and have added some comments of my own, which largely overlap with the reviewers. Below I summarise the crucial points but I urge you to go through the reviews carefully to address and clarify all issues raised.

1.Reviewer 1 suggests streamlining the analyses and I concur with this suggestion. In fact, it seems to me that there is some redundancy that leads to conflicting results, especially in Experiment 3: While the difference between audio-only and audio-visual was not significant for difference scores in the first analysis it was so when both phases were considered separately in the second analysis. This may be due to spurious effects to do with the F0 values in the SR phase. In my view, there should only be one set of analyses for each experiment performed on adjusted (more on this below) differences between SR and CS with full random effect structure (i.e. not just random intercepts of participants and sentences but also random slope of Condition by sentences).

2.The theoretical limitations associated with the use of pre-recorded voices with respect to how informative the study can be for joint action models should be clearly acknowledged.

3.Aim for terminological coherence with the literature in the use of ‘choral speech’ and acknowledge other terms and their definitions.

4.If you streamline the F0 analysis as suggested you may want to consider analysing F0-range. Indeed, far-reaching theoretical conclusions about convergence in general based on just one parameter may not be warranted.

5.I agree with Reviewer 1 that the Hypotheses do not flow directly from the Introduction and need to be better motivated early on. This may then remove the need to restate them for each experiment.

6.The alternative – that F0 convergence is a by-product of trying to converge in speech rate, presumably due to greater effort, should be theoretically better motivated.

7.Clarify baseline comparison between groups prior to experimental manipulation so that the condition label terminology does not become confusing. Also, please aim for consistency of labelling the solo reading phase either as that or as baseline.

8.Clarify what distances you are reporting on. In addition to Reviewer 1’s comments on defining the differences I would caution that absolute differences in Hz obscure the non-linear nature of pitch perception so in my view differences should be reported using a log-transformed measure such as semitones.

9.Introduce clear labels for all factors early on. It took the Reviewers and myself a while to figure out what ‘phase’ refers to – this should have been clearly stated when the design is explained.

10.Consider a direct comparison between XLO (Exp 2), LO (Exp 1) and HI (Exp 1) conditions, essentially merging Experiments 1 and 2 into a joint analysis, while maintaining transparency in reporting the order in which they were carried out.

11.In the figure captions, please state clearly what the abbreviations mean and what the units of the differences are (semitones I would hope in a revised version).

Finally, a small issue I noticed – in line 103, shouldn’t it say ‘if listeners tend to chose B more often…’ ?

I hope that these and the Reviewers’ comments will be helpful for you when preparing a revision of this submission.

Please submit your revised manuscript by Aug 20 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Vera Kempe

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Yes

Reviewer #3: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: Yes

Reviewer #3: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: See attachment. These extra words are necessary because the badly designed form insists on a minimum character count. Here are some more characters: zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz

Reviewer #2: This is an excellent paper that I am happy to recommend for publication. It provides a thorough and relevant positioning of its original research within the context of previous work on convergence, and in discussion makes connections to possible clinical applications associated with stuttering disorders. Moreover, as a paradigm for effective online data collection, with careful consideration of potential problems, validation methods and norming it has interest beyond its immediate research questions.

Specific Comments (by reviewer question)

1. Is the manuscript technically sound, and do the data support the conclusions?

Yes, with alternative explanations appropriately considered. With 10 participants per condition the contrasts are arguably underpowered in the initial experiment, however, this is mitigated by converging results from the additional experiments.

2. Has the statistical analysis been performed appropriately and rigorously?

Yes. In the LMMs it would be helpful to reorder the levels of the condition where appropriate from default alphabetic ordering such that one level ("high") is consistently baseline for comparison (see below).

3. Have the authors made all data underlying the findings in their manuscript fully available?

Yes, using the OSF platform.

4. Is the manuscript presented in an intelligible fashion and written in standard English?

Very well written and throughly intelligible.

Specific Comments (by line)

211 mention subject compensation, presumably accomlished through the Gorilla platform

455 "Therefore, as predicted, F0 change from baseline was positive in the high condition, but negative in the low condition." While this is clearly correct as visualized in Fig 2, to support this assertion from the LMM results you should also provide the (positive) intercept for the baseline "high" condition.

474 "phase" – I had to go back to see what was meant by "phase" here and suggest including something like "phase (solo vs. choral)" to help other readers with that reminder

592 In the earlier LMM the "high" level is baseline whereas here "extra-low" is used as baseline; similarly Fig 2 shows a high:low ordering while Fig 5 shows low:high. Suggest reordering levels away from the default alphabetic ordering to make this consistent.

804 "with Tukey['s HSD] adjustment"

899 "In contrast, our results suggested a tendency for speakers to diverge in F0 from the accompanist during visual choral speech." How can they diverge if they have no access to that cue? A more plausible reason would seem to be something associated with a repetition effect, which you might consider assessing across trials.

913 This paper is relevant here: Dias, J. W., & Rosenblum, L. D. (2016). Visibility of speech articulation enhances auditory phonetic convergence. Attention, Perception, & Psychophysics, 78(1), 317-333.

917 "perhaps via motor-based speech gesture representations" – the implication here is that F0 can be recovered from facial cues. While there is work suggesting that oral tract constriction gestures can be recovered in this way [e.g. Yehia, H., Rubin, P., & Vatikiotis-Bateson, E. (1998). Quantitative association of vocal-tract and facial behavior. Speech Communication, 26(1-2), 23-43.], I don't see a plausible mechanism for inferring F0.

929 "The provision of a dynamic face stimulus in addition to the voice would have increased the salience of the accompanist as a social agent, potentially increasing participants’ motivation for vocal convergence in order to facilitate social bonding with them as an interaction partner." – except that no exchange of social cues takes place with canned stimuli

976 "recruits additional processes" – this may be so, but should also be discussed in the context of entrainment, which might be providing stability

988 "Further research..." It would be interesting and useful for you to speculate on the possible differences between synchronized speech with a single partner vs. true choral speech as part of a synchronized group.

Reviewer #3: The authors report three experiments in which participants spoke in synchrony with recorded speech. The primary issue was whether participants would adapt spoken f0 when synchronizing with a recording of unusually high or unusually low speech. Across experiments, there was a tendency to converge spoken f0 with the recording; however this tendency was stronger for high than for low pitched recordings. In Experiment 3 participants imitated video recordings as well as video+audio for the high-pitched recordings. This experiment demonstrated that the bias for high-pitch convergence was not simply a byproduct of the joint speech task (vocal f0 went down when synchronizing with video only), but also that there was more convergence for video + audio, suggesting that the results may reflect tendencies based on social interactions.

This was an interesting paper and clearly reported. Here are my main concerns.

(1) would be valuable to correlate should know the difference between baseline f0 (from the initial solo reading task) and f0 for synchronization targets with degree of convergence for individuals.

(2) Sample sizes within each condition were surprisingly small (n=10), and this small sample size was not justified. I know data collection during COVID is difficult, but I think a larger sample of online participants should be possible.

(3) Arguments about applications to stuttering should be limited to discussion because there are no persons who stutter in the experiments. The argument in the discussion could be more clearly described. Why is it beneficial to bring speech motor control away from “internally specified targets and towards external targets”?

(4) The shift of f0 was more extreme for low than for high targets. Do we know that both targets sounded similarly natural? Could lower tendency to converge to low targets reflect the greater loss of naturalness from the more extreme shift?

Minor comments

Lines 413-417

This is a somewhat questionable use of t-test as it assumes independent sampling across trials for a participant.

Lines 430-432

Technically correct, but the lowest participant in the low-voice gropu is for all practical purposes identical in pitch to the low voiced target.

Line 455

Why are there 20 degrees of freedom for t. With n=10 per sample, a t-test on independent samples should be df=18 I would think.

Figure 1

Curious that there does not seem to be a strong trend across trials. Why?

Figure 3 (and similar figures)

State what acronyms mean in the caption.

Table 2

Not sure necessary. Also not clear why 20.4 df for each mean (shouldn’t this be 9 per condition?)

Lines 522-523

What cultural differences predict lower f0 for British-English female speech in contrast to American English? Why is this not an artifact of the small sample?

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Fred Cummins

Reviewer #2: No

Reviewer #3: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Attachment

Submitted filename: PLoSSynchSpeech.pdf

Decision Letter 1

Vera Kempe

21 Sep 2021

PONE-D-21-16330R1Convergence in voice fundamental frequency during synchronous speechPLOS ONE

Dear Dr. Bradshaw,

Your revision has now been scrutinized by the original three reviewers, and while two reviewers are happy with your revision one reviewer recommends rejection based mainly on a perceived discrepancy between small sample size and stated generalizability of theoretical conclusions. However, I would like to see this paper published, also because the sample size did not figure prominently in the initial round of reviews. However, to make sure that the paper accounts for these potential criticisms, and also to ensure that recent suggestions are incorporated, I am sending it back to you for one final round of very minor reviews. If you decide to submit these reviews (which I very much hope you will) I will not send it out for further review but simply check completion and accept.

For the final version, I recommend the following minor edits:

1. In the General Discussion, provide an acknowledgement of the relatively small sample size and an explanation of how this compares with sample sizes of similar published work. Then please include a statement addressing generalizability given your sample size. The aim of this is to give readers who are not working in this area a sense of where the study sits in this respect.

2. Reviewer 1 argues that in their view ‘synchronized’ speech should be reserved for laboratory speech. While I am not in a position to judge whether such use of terminology is warranted and aligns with a common view, please include a sentence clarifying whether you intend synchronous speech to be reserved for laboratory studies or not, and how this aligns with the literature. Again, my aim is to achieve maximal terminological clarity to allow readers from different areas and theoretical persuasions to fully benefit from reading your work.

3. Reviewer 2 provided a clarification of their earlier statement about entrainment – please check how your interpretation aligns with their conceptualization of entrainment and perhaps add a mention of the stated interpretation in your Discussion.

4. Reviewer 3 points out interesting links to the literature on musical synchronization – I leave it up to you whether you see value in pointing out or even elaborating briefly on these links in the general Discussion.

Please submit your revised manuscript by Nov 05 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor. You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Vera Kempe

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: (No Response)

Reviewer #2: All comments have been addressed

Reviewer #3: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Yes

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors have undertaken a substantial revision and the result is in many ways improved. The notion of baseline, and the introduction of motivated hypotheses are welcome and improve the reading of the text.

However, my judgement is that this is a small experiment performed under highly constrained conditions, that provides some suggestive data that might inform future studies. The data do not, in my opinion, warrant the conceptual elaboration given here, because they are slight, and do not speak clearly. Nor could they, under the constrained circumstances.

The conduct of the experiment was difficult and the Huggins pitch test is a valuable guide for work that might be performed under similar circumstances. It did mean that a lot of participants were excluded. In Experiment 2, for example 41 were excluded, while only 30 made the cut, which is diluted further due to the between subjects design, leading to n=10 in each condition. There is nothing intrinsically wrong with this. The exclusions are warranted, the trials are robustly conducted, and the variables measured are simple: median f0 is a simple measure. But these limitations must, to my mind, mean that this is a pilot, or exploratory study. To represent it as otherwise is to inflate it, and to suggest that it has powers it does not have to inform theory. Drawing conclusions related to the interactive alignment model, or to any neuroscientific interpretation, is simply unwarranted.

The analysis produces an aura of formality over slight data. Everything the data have to be offered can be seen in the first figure for each experiment. The linear modelling is, to my cautious mind, not justified. Such modelling is justified if there is reason to believe that the data might reveal underlying regularities, structures or processes. Median F0 in sentences produced under these conditions with n=10 in each condition is not that kind of data. That is my view and the editor may freely diverge from it.

Finally, and I know I am being a bore about this, but I tried to suggest that work in the area of speech produced simultaneously be described using coherent terms. To that end, I have suggested joint speech as an all-encompassing term, with specializations for speech produced under specific experimental conditions. Choral speech has been the term used by Kanlinowski and colleagues, and that can be taken as a useful marker that work belongs in that camp. I have tried to suggest synchronous speech for laboratory speech such as that of the present study. But the abstract begins by spoiling this attempt. I give up in the present instance.

The relevance of the present work to Kalinowski's agenda is tenuous.

I am unwilling to pronounce on the matter of suitability for publication. The review form insists that I make a choice, so I have to select "reject" but am happy to be overruled. The field is riddled with work that appears important, but masks simple observations conducted in conditions that constrain what can be said beyond the experimental context. Let the editors decide.

Reviewer #2: My comments have been addressed and I endorse publication of this paper.

Concerning the authors' request for clarification about the sense of "entrainment" I intended in a comment on the original form of this sentence: "The present findings suggest that [synchronous] speech recruits additional [sensorimotor] processes that drive imitation-like changes in speech productions in typically fluent speakers." I commented that "this may be so, but should also be discussed in the context of entrainment, which might be providing stability". The point I'd like the authors to consider hinges on the 'mechanism' by which convergence effects are induced. They provide the plausible suggestion that the additional feedback from externally timed speech can drive imitation (convergence) effects, though the lack of auditory cortex suppression facilitating this reported by the Jasmin et al. (2016) paper they cite was crucially found only in live partner interaction which undercuts this argment. But the gradual onset of phonetic convergence, seen most clearly in this submission in the visual+audio condition (Fig 5) and reported elsewhere by Pardo, Babel, and many others, can also be interpreted as a strengthening entrainment process. This arises (following Beek et al. 1992) by way of external information derived from sensory cues. These serve as a forcing function on internal dynamics, leading to adjustments in existing patterns of behavior within their intrinsic range and resulting in relative coordination between two systems (as opposed to absolute coordination between physically coupled systems). Discussion of something along these lines as an alternative explanation would be useful in my opinion.

Beek PJ, Turvey MT & Schmidt RC (1992) Autonomous and nonautonomous dynamics of coordinated rhythmic movements. Ecological Psychology, 4, 65-95.

Reviewer #3: I am satisfied with author responses to my review and the other reviews. I look forward to seeing this paper published. Though the authors need not make changes, there are interesting parallels with the musical synchronization literature that the authors may find interesting (papers by Caroline Palmer and Andrew Chang).

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Fred Cummins

Reviewer #2: No

Reviewer #3: Yes: Peter Pfordresher

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Oct 21;16(10):e0258747. doi: 10.1371/journal.pone.0258747.r004

Author response to Decision Letter 1


28 Sep 2021

Please see the attached 'Response to Reviewers' document for our point-by-point responses to the Editor's final comments.

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 2

Vera Kempe

5 Oct 2021

Convergence in voice fundamental frequency during synchronous speech

PONE-D-21-16330R2

Dear Dr. Bradshaw,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Vera Kempe

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Acceptance letter

Vera Kempe

11 Oct 2021

PONE-D-21-16330R2

Convergence in voice fundamental frequency during synchronous speech

Dear Dr. Bradshaw:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Prof Vera Kempe

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 File. Debrief data Experiment 1.

    Self-report data from debrief questionnaire and data on browser/operating system of participants from Experiment 1.

    (CSV)

    S2 File. Debrief data Experiment 2.

    Self-report data from debrief questionnaire and data on browser/operating system of participants from Experiment 2.

    (CSV)

    S1 Table. Sentence stimuli.

    Sentences taken from the Harvard IEEE corpus of sentences used in Experiments 1 and 2.

    (PDF)

    Attachment

    Submitted filename: PLoSSynchSpeech.pdf

    Attachment

    Submitted filename: Response to Reviewers.docx

    Attachment

    Submitted filename: Response to Reviewers.docx

    Data Availability Statement

    Data files for all experiments reported on in this manuscript are openly available on the Open Science Framework (https://osf.io/rs7gk/ DOI: 10.17605/OSF.IO/RS7GK).


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES