Vocal communication is tied to interpersonal arousal coupling in caregiver-infant dyads

Sam Wass; Emily Phillips; Celia Smith; Elizabeth OOB Fatimehin; Louise Goupil

doi:10.7554/eLife.77399

. 2022 Dec 20;11:e77399. doi: 10.7554/eLife.77399

Vocal communication is tied to interpersonal arousal coupling in caregiver-infant dyads

Sam Wass ^1,^✉, Emily Phillips ¹, Celia Smith ², Elizabeth OOB Fatimehin ¹, Louise Goupil ³

Editors: Ruth de Diego-Balaguer⁴, Christian Rutz⁵

PMCID: PMC9833822 PMID: 36537657

Abstract

It has been argued that a necessary condition for the emergence of speech in humans is the ability to vocalise irrespective of underlying affective states, but when and how this happens during development remains unclear. To examine this, we used wearable microphones and autonomic sensors to collect multimodal naturalistic datasets from 12-month-olds and their caregivers. We observed that, across the day, clusters of vocalisations occur during elevated infant and caregiver arousal. This relationship is stronger in infants than caregivers: caregivers vocalisations show greater decoupling with their own states of arousal, and their vocal production is more influenced by the infant’s arousal than their own. Different types of vocalisation elicit different patterns of change across the dyad. Cries occur following reduced infant arousal stability and lead to increased child-caregiver arousal coupling, and decreased infant arousal. Speech-like vocalisations also occur at elevated arousal, but lead to longer-lasting increases in arousal, and elicit more parental verbal responses. Our results suggest that: 12-month-old infants’ vocalisations are strongly contingent on their arousal state (for both cries and speech-like vocalisations), whereas adults’ vocalisations are more flexibly tied to their own arousal; that cries and speech-like vocalisations alter the intra-dyadic dynamics of arousal in different ways, which may be an important factor driving speech development; and that this selection mechanism which drives vocal development is anchored in our stress physiology.

Research organism: Human

Introduction

Infants explore their vocal possibilities from birth, producing vocalisations that are on a continuum from cries (rough sounds with a high amplitude and fundamental frequency) to speech-like vocalisations, or protophones (sounds whose morphological and spectro-temporal features resemble speech sounds) (Kent et al., 1987; Nathani et al., 2006; Oller et al., 2019). This pre-linguistic phase of vocal exploration is thought to be crucial for the emergence of speech: it could serve as a base for a selection mechanism whereby caregivers’ differential responses to their infants’ vocal outputs (e.g. different responses to cries versus protophones) progressively lead to the prioritisation of speech signals for communication, both at developmental and evolutionary scales (Ghazanfar and Zhang, 2016; Locke, 2006; Oller and Griebel, 2020). Temporal contingencies (Yoo et al., 2018) could be especially important for infants, allowing them to realise through repeated interactions that some sounds are privileged communicative signals that are particularly efficient to engage their social partners in conversation.

But what determines when infant vocalisations occur initially, and what their acoustic characteristics are? Are infants’ early vocal explorations constrained, and if so, how? One possibility is that vocal explorations follow a stochastic regime early on, and that infants’ explorations of their vocal tract possibilities produce a wide and unconstrained repertoire of outputs that is then narrowed down through the parental selection mechanism described above. Consistent with this idea, Oller and colleagues have proposed that a fundamental ability that supports the emergence of speech is functional flexibility (Oller and Griebel, 2020; Oller et al., 2013). An individual has functional flexibility when at least some of their vocalisations can occur alongside variable underlying affective states, and are not tied to specific communicative functions (e.g. expressing distress). This ability is necessary for the establishment of a language system: it is because we can produce specific sounds to convey different meanings that arbitrary, symbolic systems can emerge (Oller et al., 2013). In short, functional flexibility is a necessary condition for arbitrariness, a key feature of words that supports the emergence of conventional symbolic systems. By contrast, non-human primate vocalisations remain largely inflexible with respect to arousal even in adulthood (Borjon et al., 2016) (although see Taylor et al., 2022).

Infants would be said to have functional flexibility if specific vocalisations that they produce (e.g. protophones) were not tied to specific communicative functions, instead occuring alonside variable affective states. Consistent with this idea, by 3 months, infants can produce speech-like vocalisations in conjunction with both positive and negative facial displays, which suggests that their vocal explorations are functionally flexible in terms of valence (Oller et al., 2013). It remains possible, however, that their vocalisations remain tied to other affective dimensions, in particular autonomic arousal, the fast-acting neural substrate of the body’s stress response mediated by the Autonomic Nervous System (ANS) (Cacioppo et al., 2001; Wass, 2020; Pfaff, 2018; Porges, 2007). Arousal and valence vary in an orthogonal fashion (Kreibig, 2010), so it remains possible that infants vocalisations are tied to arousal, while remaining relatively flexible with respect to valence (e.g., vocalisations produced with both positive and negative affect could be monotonically linked to arousal).

Consistent with this hypothesis, one factor that does appear to influence vocalisation likelihood early on in development is the presence of an interactive social partner (Baumwell et al., 1997; Gros-Louis et al., 2006; Goldstein et al., 2003): although infants also vocalise when they are alone (Oller et al., 2019; Long et al., 2020), from the first few days of life, most infants’ vocalisations cluster with parental speech in time when infants are awake and actively engaged with a partner (Caskey et al., 2011; Dominguez et al., 2016). This might suggest that infants mostly vocalise when they are aroused, in the context of social interactions in particular, and thus, that their vocalisations might remain relatively inflexible – at least with respects to states of arousal, early on in development. This assumption has not been formally tested in humans, but research with marmoset monkeys has shown that vocalisation likelihood and the acoustic properties of vocalisations are both driven by rhythmic fluctuations in the autonomic nervous system (ANS) across multiple temporal scales, in infants and in adults (Ghazanfar and Zhang, 2016; Zhang and Ghazanfar, 2020; McFarland et al., 2020). These influences span across temporal scales, from the temporally fine-grained (spectral features of vocalisations, in the kHz range), through to context-dependent vocalisations likelihood on the scale of minutes to hours (Zhang and Ghazanfar, 2020).

Relatively little research has investigated whether vocal behaviours in human infants are influenced in a similar way by fluctuations in autonomic arousal. Although a number of authors have discussed the relationship between physiological arousal and vocal behaviour, particularly in the context of infant cries (Wolff, 1967; Zeskind et al., 1985; Wilder, 1974), no research to our knowledge (other than work from McFarland and colleagues, discussed below) have directly measured it. Studying this is important for two reasons.

First, as mentioned above, so far, research on functional flexibility has focused on valence and correspondences between visible facial affects and vocalisations only, which limits our understanding of how and when functional flexibility emerges across development (Oller and Griebel, 2020). Thus, here, we investigate whether each infant’s and adult’s vocalisation likelihood is overall more contingent on arousal (either their own arousal level or their partner’s). We also subdivide infant vocalisations into several types: cries (which our supplementary analyses show tend to be negative affective valence) and speech-like vocalisations (which our analyses show tend to be neutral affective valence). The functional flexibility hypothesis predicts that speech-like vocalisations should be relatively independent from arousal, while cries (alarm signals) should largely co-vary with arousal (Altenmüller et al., 2013). Another important aspect is that functionally inflexiblility of vocalisations with regard to arousal in early infancy could be inidicative of a specific learning mechanism that underlies the development of speech over time. That is, it might be that instead of a stochastic regime, what determines the acoustic features of infants vocalizations early on is precisely fluctuations in arousal, that impact on the tension of the vocal folds, and thus, on the roughness and loudness of vocalizations (Fitch et al., 2002). The parental selection mechanism would then operate on a repertoire of vocalisations that is not stochastic, but grounded in physiology (Ghazanfar and Zhang, 2016).

Second, studying how vocalisations relate to arousal changes and autonomic arousal coupling across the dyad would deepen our understanding of how caregivers identify and respond to various types of vocalisations, leading to selective reinforcement (Locke, 2006; Oller and Griebel, 2020; Zhang and Ghazanfar, 2016; Goldstein and Schwade, 2008; Albert et al., 2018). Many authors have described how mimicry and vocal turn-taking behaviours play roles in socio-communicative development (Condon and Sander, 1974; Schneirla, 1946; Lester et al., 1985; Wilson and Wilson, 2005; Fogel, 2017), and how empathy and physiological synchrony play roles in the development of self-regulation and caregiver-child affiliative bonding (Feldman, 2007; Feldman, 2006; Fogel, 1993; Ham and Tronick, 2009); but the relationship between these two areas remains relatively under-explored. McFarland and colleagues have shown that contingent vocalisations (mother to infant and infant to mother) are more common during periods of respiratory-marked synchrony (McFarland et al., 2020; McFarland, 2001). And our own previous research has shown transient increases in caregiver-child physiological synchrony following negative affect vocalisations ^4223,38. Traditionally, the co-regulation of arousal (i.e. management of arousal across the caregiver-child dyad) has been considered important for the early development of self-regulation (Fogel, 2017; Feldman, 2007; Kopp, 1982; Beebe et al., 2016), but it has rarely been linked to the development of communicative skills. The limited previous research in this area suggests that negative affect vocalisations are more common at high arousal states, and are more likely to elicit contingent caregiver responding (Wass et al., 2019; Tronick, 2007). But, if this is the case, and cries are more likely to elicit parental responses, then how might speech-like vocalisations (i.e. non-cries) become progressively prioritised? To answer this question, it seems crucial to examine whether non-cry vocalisations also elicit changes in arousal, arousal stability and arousal coupling across the caregiver-child dyad, and whether this is related to changes in caregivers’ responses to these vocalisations; but to our knowledge no previous research has examined this.

To investigate these questions, we designed new miniaturised wearable autonomic monitors (electrocardiograms and actigraphs) and miniaturised microphones and video cameras that could be worn by infant and caregivers to obtain day-long recordings in home settings. For technical reasons our microphones recorded a 5-s sample every minute (i.e. 8% of each minute) and therefore our analyses examined only large-scale arousal changes during the 20 min before and after each vocalisation (see Methods for further discussion).

We had two main research questions. First, are caregiver and infant vocalisations as inflexible with regard to arousal as those documented in non-human primates? That is, do different types of vocalisation, such as cries and speech-like sounds, show different patterns of association with arousal across the infant-caregiver dyad? Our hypothesis was that even speech-like vocalisations remain relatively tied to fluctuations in arousal during infancy, in contrast with adulthood. Second, do spontaneously occurring vocalisations during the day co-occur with specific patterns of arousal synchrony and co-regulation? Here, our hypothesis, in line with the parental selection mechanism, was that caregivers would track infants’ arousal fluctuations, and that as a consequence their vocalisations and arousal would be largely tied to their infants rather than their own.

Results

Our results section is in two parts. In part 1, we analyse individual and cross-dyadic arousal changes relative to all vocalisations obtained in our data. In part 2, we subdivide vocalisations into cries and speech-like vocalisations; and, in Appendix 1, by additionally subdividing vocalisations based on vocal intensity and affective valence, manually rated by trained coders (Appendix 1 sections 2.5–2.6).

Part 1 – All vocalisations

Our first research question was: are caregiver and infant vocalisations as inflexible with regard to arousal as those documented in non-human primates? To examine this, we conducted three analyses. First, as a preliminary analysis, we examine how vocalisations are clustered together in time. Second, we examine caregivers’ and infants’ arousal levels around vocalisations, using three approaches: (1) average arousal levels around vocalisations; (2) vocalisation likelihood around arousal peaks; and (3) Receiver Operator Characteristic (ROC) curves. Third, we examine arousal around vocalisations subdivided by the partner’s arousal at the time of the vocalisation.

Temporal clustering

To examine whether infants and caregivers produce clusters of vocalisations simultaneously, we performed the following analysis. For each vocalisation, we estimated the likelihood of another vocalisation occurring both before and after that vocalisation. See Figure 1a, which shows as an example the likelihood of a subsequent infant vocalisation occurring 1–3 min after an initial infant vocalisation. To compare the observed probabilities with chance we performed a control analysis in which we inserted random ‘non-vocalisation’ events into the data and repeated the analysis relative to these ‘non-vocalisations’, and compared the ‘real’ and ‘control’ datasets using a Mann-Whitney U test. We then repeated this analysis across multiple time windows from 20 min before the vocalisation to 20 min after. We also repeated it across multiple contrasts, looking both within an individual (e.g. infant vocalisations relative to infant vocalisations) and across a dyad (e.g. infant vocalisations relative to adult vocalisations). Multiple comparisons were corrected for using permutation-based clustering analysis (described Appendix 1 section 1.9).

Figure 1. — (a) Sample violin plot showing the analysis for one time interval that was then repeated iteratively across multiple time intervals in b. The plot shows the likelihood of a subsequent infant vocalisation in the time window 1–3 min following an infant vocalisation, comparing real with control data. (b) Same analysis repeated across multiple time windows, and across different categories. Coloured rectangles indicate time bins in which real >control after correction for multiple comparisons using a permutation-based temporal clustering procedure. Y-axis shows the Hodges-Lehman effect size of the Mann Whitney test comparing observed and control data.

After correcting for multiple comparisons, infant vocalisations were more likely to occur relative to a caregiver vocalisation across all time windows examined from 12 min prior to a vocalisation to 8 min after (all ps <0.05). This indicates that, when a caregiver vocalisation had occurred, there was an elevated likelihood of a infant vocalisation occurring for all time windows from 12 min prior to that caregiver vocalisation to 8 min after it. Similarly, caregiver vocalisations were significantly more likely to occur relative to an infant vocalisation from 20 min before to 20 min after; infant vocalisations were significantly more likely to occur relative to another infant vocalisation from 16 min prior to a vocalisation to 20 min after (all ps <.05); and caregiver vocalisations were significantly more likely to occur relative to another caregiver vocalisation from from 16 min prior to a vocalisation to 16 min after. Overall, these findings indicate that vocalisations occurred in clusters both within an individual and across the dyad, which confirms previous reports that infant vocalisations tend to cluster with those of their social partners (; Slone et al., 2023).

Arousal around vocalisations. We conducted three analyses to examine the relation between arousal levels and vocalisations. First, we examined how average arousal levels change before and after vocalisations. Second, we examined how the likelihood of vocalisation changes around peak moments in arousal. Third, we calculated ROC curves to examine whether arousal levels alone can predict vocalisation likelihood.

Analysis 1 - average arousal levels around vocalisations. Figure 2a shows histograms of arousal levels at the time of a vocalisation, including both within-individual (e.g. infant arousal to infant vocalisations) and across the dyad (e.g. infant arousal to adult vocalisations). Figure 2b shows the same information, but also examines change in arousal during the period from 20 min before and 20 min after each vocalisation. Figure 2a is identical to the Time 0 values in Figure 2b. Significance testing was performed by comparing the observed arousal levels around vocalisations with a chance value of 0 (which, for z-scored data, is that individual’s average arousal level across the entire day). Multiple comparisons were corrected for using a permutation-based cluster test (described Appendix 1 section 1.9). For infant arousal to infant vocalisations, significant increases in arousal were observed from 16 min before to 16 min after the vocalisation (all ps <0.05, after correction). Note that these findings are not affected by autocorrelation in the arousal data as this was removed (see Appendix 1 section 1.6). For infant arousal to caregiver vocalisations, significant increases were observed from 16 min before to 18 min after each vocalisation; for caregiver arousal to infant vocalisations, significant increases were observed from 4 min before to 6 min after; for caregiver arousal to caregiver vocalisations, no significant difference from 0 was observed at Time 0, but a significant difference was observed for the period between 4 minutes to 2 minutes before each vocalisation. Overall, these results suggest that infant arousal levels are elevated around both infant and caregiver vocalisations, and that caregivers also show smaller but significant increases in arousal around infant vocalisations. By contrast caregivers’ vocalisations show little association with their own arousal, congruent with the hypothesis of heightened vocal flexibility in human adults as compared to infants.

Analysis 2 - vocalisation likelihood around arousal peaks. Conversely, we also examined the likelihood of vocalisations occurring around peaks in arousal. We identified the moments when the infants’ and the caregivers’ z-scored arousal levels exceeded the top 10% most elevated values observed for that participant that day (Figure 2c and e). Appendix 1 section 2.4 shows the same analysis repeated with different threshold levels (5% and 20%). We then examined the likelihood of vocalisations occurring during the time windows around arousal peaks, and compared this with control data generated in the same way as described in section 1.1. Significance was calculated by performing Mann Whitney U tests and correcting for multiple comparisons using a permutation-based clustering analysis (see Appendix 1 section 1.9). Significant increases in vocalisation likelihood were observed only when we examined the likelihood of infant vocalisations around infant arousal peaks, and when we examined the likelihood of adult vocalisations around infant arousal peaks. Note that this latter finding was not significant when other threshold values were used instead (see Appendix 1 section 2.4). No significant increases in vocalisation likelihood were observed relative to adult arousal peaks. Overall, these results confirm that infant arousal peaks are associated with an increased vocalisation likelihood in both infants and (to a lesser extent) adults, but that peaks in adult arousal are not associated with increased vocalisation likelihood (a marker of greater vocal functional flexibility).

Analysis 3

As a further test of whether arousal levels predict vocalisation likelihood differently in infants and adults, we employed a signal detection framework based on the ROC (see Figure 2d). Each dataset was systematically thresholded at all possible values from its minimum to maximum value. At each threshold, each epoch was individually classified either as a True Positive (above-threshold arousal, vocalisation present) or a False Positive (above-threshold arousal, vocalisation absent). If the systematic thresholding produced as many false alarms as hits, then the feature dimension could not be said to aid in predicting vocalisation likelihood. Following calculation of the ROC curves, the Area Under the Curve (AUC) was calculated: a higher AUC indicates that the feature dimension is more predictive. AUC values were calculated per participant and compared with a chance value of 0.5 using the non-parametric Mann-Whitney U test. Results indicated that the infant arousal was significantly predictive of infant vocalisation likelihood (p<0.001), but that other relationships were not. This is again consistent with the idea that infants’ vocalisations are inflexibly related to their arousal.

Arousal around vocalisations subdivided by partner arousal at the time of vocalisation

Our final method for examining how contingent infant and caregivers’ vocalisations are on arousal levels across the dyad was to subdivide all vocalisations by the partner’s arousal at the time of the vocalisation. Figure 3a shows caregiver arousal relative to infant vocalisations (i.e. the same as the purple line from Figure 2b), but subdivided using a quartile split by infant arousal at the time of the vocalisation. Figure 3b shows infant arousal relative to caregiver vocalisation.

Figure 3. — (a) Caregiver arousal subdivided by infant arousal at the time of the vocalisation. (b) Infant arousal subdivided by caregiver arousal at the time of the vocalisation. For all plots, shaded areas indicate standard error based on an N of 82, and red highlights indicate areas of significant difference after correction for multiple comparisons using a permutation-based temporal clustering procedure.

To estimate whether caregivers showed larger arousal changes to high arousal infant vocalisations, we performed a one-way ANOVA repeatedly for each time bin and used a permutation-based temporal clustering analysis to correct for multiple comparisons (see section 1.9). Significant effects were found (P<0.01) such that increased caregiver arousal was observed during the time periods 2–6 min and 10–14 min after high arousal infant vocalisations (Figure 3a). For infant arousal, the opposite finding was observed: high arousal caregiver vocalisations were accompanied by increased infant arousal during the period 10–6 min before the caregiver vocalisation (Figure 3b). Overall, these results suggest that high arousal infant vocalisations are followed by subsequent increases in caregiver arousal, and that high arousal caregiver vocalisations are preceded by increases in infant arousal.

Control analyses

Overall, results thus far suggest that infants’ vocalisations are contingent on their arousal state, whereas adults’ vocalisations are independent of arousal. However, we also considered two possible alternative explanations for this finding. The first is that it may be because vocalisations are more likely to occur while the participants are in physical positions associated with increased arousal. To examine this possibility, we conducted an additional analysis in which we performed video coding to examine infants’ physical position while vocalising (Appendix 1 section 2.2). In brief, this analysis suggested that 49% of infant vocalisations occurred while the infant was freely moving; 33% occurred while they were free but stationary; 7% while strapped sitting; 11% while carried. For adult vocalisations, 44% occurred while the infant was freely moving; 33% while free stationary; 10% while strapped sitting; 12% while carried. Overall when we examined how arousal levels differed by physical position we found no evidence that arousal increases around vocalisations are attributable to changes in physical position.

The second possibility is that arousal increases around vocalisations may be attributable to the physical act of vocalising itself. This may seem unlikely given that we also observed increases in infant arousal relative to caregiver vocalisations (Figure 2b). Yet, because we also observed that caregiver and infant vocalisations occur in clusters (Figure 1b), it remained possible that vocalising itself increased infant arousal in these periods. To address this, we conducted a more fine-grained analysis on a different dataset in which we continuously recorded vocalisations and arousal in 11-month-old infants and their caregivers during two 5-min tabletop interactions (see Appendix 1 section 2.3). The timings and durations of vocalisations were coded to an accuracy of 20 Hz (i.e. 50 ms), and our findings examine heart rate changes on a much finer time-scale (1 sample per second compared with 1 sample per minute for the main analyses). Overall our results suggested that, in a seated tabletop interaction, caregivers showed no change in arousal relative either to vocalisations either from themselves or their partner (the infant). Infants showed non-significant increases in arousal relative to their own vocalisations, which started to increase 5 s before a vocalisation and returned to baseline 20 s after. No changes in infant arousal were observed relative to caregiver vocalisations. The fact that arousal levels start to increase before a vocalisation suggests, consistent with animal research (Borjon et al., 2016), that it is unlikely that arousal changes around vocalisations are purely attributable to the physical act of vocalising itself. The fact that no changes were observed in caregiver arousal around caregiver vocalisations is also consistent with this conclusion.

Arousal stability and arousal coupling around vocalisations

Our second research question was: do spontaneously occurring vocalisations during the day co-occur with specific patterns of arousal, arousal synchrony and arousal co-regulation? To address this we performed the calculation described in the Methods and illustrated in Figure 6.

Arousal stability

Arousal stability was indexed by calculating the auto-correlation in infant and caregiver arousal. No significant changes in infant and caregiver arousal stability were observed relative to adult vocalisations (Figure 4a and e). By contrast, infant vocalisations were associated with decreased arousal stability in infants (Figure 4b), and increased arousal stability in adults (Figure 4f), in the time windows prior to the event. These findings differ markedly, however, when we subdivide infant vocalisations into cries and speech-like vocalisations, as shown in part 2.

Figure 4. — (a) Infant arousal stability relative to caregiver vocalisations; (b) infant arousal stability relative to infant vocalisations; (c) infant arousal stability relative to infant cries; (d) infant arousal stability relative to infant speech-like vocalisations; (e) caregiver arousal stability relative to caregiver vocalisations; (f) caregiver arousal stability relative to infant vocalisations; (g) caregiver arousal stability relative to infant cries; (h) caregiver arousal stability relative to infant speech-like vocalisations; (i) infant-caregiver arousal coupling relative to caregiver vocalisations; (j) infant-caregiver arousal coupling relative to infant vocalisations; (k) infant-caregiver arousal coupling relative to infant cries; (l) infant-caregiver arousal coupling relative to infant speech-like vocalisations. Black shows the real data; grey shows the control data. Error bars show the standard errors based on an N of 82 for a-h and 74 for i-l. Sections highlighted in red indicate areas of significant difference between real and control data after correction for multiple comparisons using a permutation-based temporal clustering procedure.

Arousal coupling

To measure arousal coupling we calculated the cross-correlation in infant-caregiver arousal, as described in the Methods and illustrated in Figure 8. Results suggested that significantly increased infant-caregiver arousal coupling was observed in the time windows following an adult vocalisation. For infant vocalisations, the same directional effect was observed but results were not significant. These findings again differ markedly when we subdivide infant vocalisations into cries and speech-like vocalisations, as shown in part 2.

Part 2 – Infant vocalisations subdivided by vocalisation type

The findings described in part 1 indicate that 12-month-old infants’ vocalisations are contingent on their arousal state, whereas adults’ vocalisations are independent of arousal. However, there may be important differences between cries and speech-like vocalisations or protophones, which have been argued to already be used flexibly by infants during infancy (Oller et al., 2013). To further test our first research question, therefore, we examined whether different types of vocalisation, such as cries and speech-like sounds, show different patterns of association with arousal. To examine this, we recorded arousal changes relative to vocalisations subdivided by infant vocalisation type, differentiating between cries and speech-like vocalisations (see Methods).