Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Dec 30.
Published in final edited form as: Proc ACM Int Conf Intell Virtual Agents. 2024 Dec 26;2024:11. doi: 10.1145/3652988.3673917

GeSTICS: A Multimodal Corpus for Studying Gesture Synthesis in Two-party Interactions with Contextualized Speech

Gaoussou Youssouf Kebe 1, Mehmet Deniz Birlikci 2, Auriane Boudin 3, Ryo Ishii 4, Jeffrey M Girard 5, Louis-Philippe Morency 6
PMCID: PMC12747799  NIHMSID: NIHMS2126798  PMID: 41473117

Abstract

Generating natural co-speech gestures and facial expressions for effective human-agent interactions requires modeling the intricate interplay between verbal, non-verbal, and contextual cues observed in dyadic human communication. Two types of contextual cues are of particular interest: (1) individual factors of the interlocutors, such as their demographic attributes, and (2) situational factors, like the outcome of a preceding event. To facilitate their study, we introduce the GeSTICS Dataset, a novel multimodal corpus comprising 9,853 questions and 10,460 answers from audiovisual recordings of post-game sports interviews by 147 interviewees. The dataset contains speech data, including textual transcriptions, lexical descriptors, and acoustic features, as well as visual data encompassing the interviewee’s body pose and facial expressions, with an emphasis on capturing these modalities during both the question-listening and answering phases of the interview. Furthermore, GeSTICS incorporates metadata about individual factors, such as the age and cultural background of the interviewees, and situational factors, like the results of the games, which are often overlooked in existing multimodal datasets. Our preliminary analysis of GeSTICS reveals that the effects of speech features, such as loudness and lexical choice, on the production of co-speech gestures in both speaking and listening phases are moderated by situational factors and the interviewee’s individual factors. GeSTICS is designed to enhance the generation of realistic nonverbal behaviors in virtual agents, animated characters, and human-robot interaction systems, thus contributing to more engaging and effective human-agent communication. The analysis code and the dataset are available at https://gkebe.github.io/gestics/.

Keywords: Multimodal communication, co-speech gestures, nonverbal behavior, dyadic interaction, audiovisual dataset, sports interviews, social context, virtual agents

1. INTRODUCTION

Recent advancements in deep learning have significantly improved the generation of natural and semantically relevant co-speech gestures. Most approaches focus on direct speech-gesture mapping [33, 53]. While these approaches have shown promising results, they do not fully capture the complexity of human gesticulation.

In this paper, we use the term contextual factors to refer to the various elements beyond speech content that influence an individual’s gestures, including their inherent traits, background, and the specific circumstances they encounter. We further categorize these contextual factors into individual factors and situational factors. Individual factors, such as personality traits [28], disorders [32], and gesturing style [2, 35, 52, 54], are more stable over time and have been extensively studied, serving as the focus of numerous computational approaches and datasets.

In contrast, situational factors, including social context, mood, and previous events, are more transient and subject to change between different interactions. Despite their well-documented impact on speech and gestures in human communication [4, 14, 25], situational factors have received less attention in the literature, particularly in terms of their specific role in shaping co-speech gestures. This gap in research highlights the need for further investigation, as situational factors play a crucial role in natural human interaction. A major cause of this gap is the inherent challenge of collecting a well-controlled dataset that effectively accounts for these variables during spontaneous interactions. By addressing this gap and developing a more comprehensive understanding of the complex interplay between speech, gestures, and the contextual factors that shape them, we can enhance the naturalness and contextual appropriateness of generated gestures, leading to more engaging and effective human-agent interactions.

To contribute to this research trajectory, we present the GeSTICS Dataset, a novel multimodal corpus that captures the complex interactions between speech, gestures, and contextual factors, encompassing both individual factors and situational factors, in dyadic human communication. GeSTICS consists of 9,852 questions and 10,460 answers extracted from audiovisual recordings of post-game basketball interviews featuring 147 individuals. The dataset includes a wide range of speech data (textual transcriptions, lexical descriptors, acoustic features) and visual data (interviewee’s body pose, facial expressions), captured during both the question listening and answering phases.

We hypothesize that post-game sports interviews provide a suitable setting for studying the influence of situational factors on co-speech gestures. The interviews occur immediately after games and involve a range of contextual factors that may shape the interviewees’ verbal and non-verbal expressions, such as the game outcome, game importance, public opinion of the team, the interviewee’s experience level, and their role on the team (player or coach). Additionally, the interview format also allows for an analysis of gesture production in both speaking and listening phases.

The GeSTICS Dataset aims to advance the development of more naturalistic and context-aware co-speech gesture generation models for virtual agents, animated characters, and human-robot interaction systems. By providing a rich, multimodal corpus that captures the complex interplay between verbal, non-verbal, and contextual factors, with an emphasis on both individual factors and situational factors, GeSTICS could support the training of personalized models capable of generating affectively appropriate gestures based on situational factors, enhancing the realism and contextual sensitivity of virtual agents.

Finally, we employ mixed-effects regression models to examine two key relationships: the influence of the interviewer’s speech features on the interviewee’s gestures during the listening phase, and the connection between the interviewee’s own speech features and gestures in the answering phase. Additionally, we investigate the impact of situational factors, such as game score difference, and individual factors, including age and team role, on these relation-ships.

2. RELATED WORK

2.1. Datasets and Approaches for Co-speech Gesture Generation and Listener Feedback

Previous research on co-speech gesture generation has focused mainly on the direct relationship between speech and gesticulation. This work aims to enhance the naturalness of virtual agents and animated characters [24, 33, 45]. Datasets like the TED Gesture Dataset [13, 53] have been instrumental in supporting these efforts. Similarly, researchers have explored modeling listener feedback during conversations [5, 8, 29, 3841, 50], with datasets like ALICO [11] and ViCo [55] focusing on this aspect of dyadic interactions. Moreover, some studies have pursued the joint modeling of gestures and behaviors considering both the speaker’s and listener’s perspectives within such interactions [22, 26], particularly in contexts involving human interactions with virtual agents.

2.2. Individual Factors in Co-speech Gestures and Listener Feedback

As research in nonverbal behavior generation progressed, the importance of individual factors such as styles, cultural differences, and personality traits became increasingly apparent [13, 1921, 34, 52, 54]. Studies have shown that incorporating personality information can lead to more diverse and realistic co-speech gestures [28] and that cultural differences influence facial expressions during co-speech gestures [37]. Datasets like BEAT [35] have enabled the generation of better stylized gestures by including emotional and semantic annotations. Similarly, individual characteristics have been found to influence listener feedback, with personality traits and conflict handling styles affecting the use of non-verbal cues in phone conversations [49] and person-specific factors influencing feedback expressiveness during backchanneling opportunity points [6].

2.3. Other Contextual Factors

While individual factors have received considerable attention in recent works, the impact of other contextual factors on co-speech gestures and listener feedback remains relatively underexplored. Some researchers have investigated the role of contextual factors related to affect, with the JESTKOD database [9] including emotion annotations, agreement/disagreement labels, and other affective factors and Yang and Narayanan [51] showing that emotion modulates the relationship between speech and gesture. Interlocutor context has also been explored [11, 42, 48], with Buschmeier et al. [11] investigating the impact of listening distraction on speaker behavior.

In contrast to these transient contextual factors, our work highlights the importance of considering broader situational influences that may shape an individual’s communication patterns over extended periods. Situational factors like social settings, interpersonal relationships, and previous events can induce more prolonged emotional/mental states and behaviors. These overarching circumstances likely have longer-lasting effects on verbal and nonverbal expression compared to momentary contextual variables within a single interaction. Additionally, the situational factors in GeSTICS arise organically from spontaneous real-life interactions.

3. GESTICS DATASET

The dataset is curated from publicly available post-game interviews from the 2022–2023 NBA season, meticulously selected to encompass a diverse array of interviewees and situational factors. The videos were specifically chosen due to their consistent recording conditions, comprising stable camera angles, high-fidelity audio and minimal ambient noise. To preserve the anonymity of the participants, all identifiable information, including the names of players, coaches, and teams, was removed from the dataset. Consequently, the original video and audio recordings are not included. Instead, the dataset contains anonymized interview transcripts and interpretable audio-visual features, such as pose vectors, facilitating comprehensive analyses of verbal and nonverbal communication dynamics. The cumulative duration of the answer segments is approximately 104.53 hours (M=29.41 minutes per answer), while the total duration of question segments is approximately 32.29 hours (M=9.59 seconds per question).

3.1. Audio-Visual Data

The GeSTICS dataset employs several models to extract rich and interpretable multimodal features from the raw interview data: Speech Features: Automatic Speech Recognition (ASR) is performed using Whisper [44] to generate transcripts of the interview audio. Speaker diarization is then applied using pyannote.audio [10] to label speaker turns in the interviews. The diarization segments and whisper transcriptions are used to identify question and answer segments in a semi-automatic way based on the fact that the longest speaking track should belong to the interviewee. Sentiment analysis is performed on the transcripts using the VADER (Valence Aware Dictionary and sEntiment Reasoner) model [27] to quantify the positive, negative, and neutral sentiment expressed. Additionally, acoustic features are extracted using the Opensmile toolkit [16], and lexical features are computed using LIWC (Linguistic Inquiry and Word Count) [43]. Visual Features: To track the interviewee’s face throughout the videos, face detection is first performed using RetinaFace [15]. The detected faces are then recognized and clustered using FaceNet [46]. Once the interviewee’s face is identified, the corresponding body region is cropped from the selected frames. Mediapipe [36] is then applied to the cropped frames to extract body poses, hand keypoints, facial landmarks, and facial action units (blendshapes). All pose frames are centered and normalized based on the distance from the left shoulder to the right hip to ensure smooth frame transitions and body pose size consistency. Table 1 summarizes the full set of features included in the GeSTICS dataset.

Table 1:

Features included in the GeSTICS dataset (number of features in parentheses)

Speech
graphic file with name nihms-2126798-t0004.jpg Whisper Transcriptions
graphic file with name nihms-2126798-t0005.jpg eGeMAPS Acoustic Features (88)
graphic file with name nihms-2126798-t0006.jpg LIWC (93) & VADER Lexical Features

Visual
graphic file with name nihms-2126798-t0007.jpg Body Pose Landmarks (33)
graphic file with name nihms-2126798-t0008.jpg Hand Keypoints (21 per hand)
graphic file with name nihms-2126798-t0009.jpg Face Landmarks (468) & Action Units (52)

3.2. Contextual Metadata

The GeSTICS dataset incorporates a rich array of contextual metadata, enabling researchers to study how various factors may influence the interaction dynamics in post-game interviews. These metadata can be broadly categorized into two types: situational factors and individual factors. Situational Factors: These factors provide a measure of the competitive context surrounding each post-game interview question and answer segment. The data was collected from the reputable sports analytics site FiveThirtyEight [17]. The key metrics, summarized in Table 2, include game score differentials, game quality indices that capture team strengths via the harmonic mean of Elo ratings [31], game importance ratings that quantify a result’s impact on postseason projections on a 0–100 scale, and team strength disparities calculated using FiveThirtyEight’s RAPTOR metric [18].

Table 2:

Summary statistics of situational factors

Attribute Win Loss

Game Score Difference 12.15 ± 9.36 −12.22 ± 8.6
Game Quality Index 28.74 ± 68.51 26.36 ± 70.86
Game Importance 39.5 ± 61.46 40.52 ± 58.21
Team Strength Difference 39.33 ± 126.21 −24.1 ± 119.29
Games 308 338

Total Games 505

Individual Factors: These factors encompass essential demographic attributes, including race, nationality, team role, and age. We employ a rigorous anonymization process to protect the identity of the interviewees while preserving the richness of the demographic and professional data. Each interviewee is assigned a unique identifier unlinked to their name or other directly identifying information. In Table 3, the mean and standard deviation of actual ages are reported in the table, but in the dataset, only the age percentile is provided, representing the interviewee’s relative age within the entire pool. Race, origin, and team role are recorded as categorical variables, providing insights into cultural, demographic, and professional factors without compromising anonymity.

Table 3:

Summary statistics of individual factors

Attribute Type of Data Statistics Summary
Race Categorical Black: 75.0%
White: 22.3%
Asian: 1.4%
Latino: 0.7%
Origin Categorical Local: 79.7%
International: 20.3%
Team Role Categorical Player: 82.9%
Coach: 17.1%
Age Continuous Mean: 31.89
Std Dev: 9.78
Total Interviewees 147

4. EXPERIMENTAL APPROACH

We explore three main research questions:

  1. What relationships, if any, exist between interviewer’s speech features and interviewee’s gestures during questions and interviewee’s own speech and their gestures while answering?

  2. Do situational factors, specifically the score difference between teams, influence the relationships between speech features and gestures?

  3. Do individual factors, such as the interviewee’s role on the team (coach vs. player), affect the relationships between speech features and nonverbal behaviors?

To investigate these research questions, we first extract relevant multimodal features as described in Section 4.1, and then conduct the experimental analyses explained in Sections 4.2, 4.3, 4.4.

4.1. Speech and Gesture Features

Gesture Features.

Gestural rate, a measure of the frequency and amplitude of movements, was calculated using the pose keypoints provided by MediaPipe, considering only keypoints with a confidence score above 0.5 [13]. The gestural rate was computed as the average Euclidean distance between the same keypoint across adjacent frames. Mean frowning, smiling, and downward gaze scores were obtained using the blendshapes extracted from MediaPipe. These scores represent the intensity of the respective facial expressions, ranging from 0 to 1.

Speech Features.

Linguistic features, specifically the count of positive emotion words and negative emotion words, were extracted using the Linguistic Inquiry and Word Count (LIWC) tool [43]. LIWC compares the words in the transcribed speech to predefined dictionaries and provides the frequency of words associated with positive emotions and negative emotions. Prosodic features, namely voice pitch (fundamental frequency, F0) and loudness (intensity), were extracted using the openSMILE toolkit [16]. The mean values of these features were calculated over the duration of each speech segment to capture the overall prosodic characteristics.

4.2. Associations of Speech Features and Contextual Factors with Co-occuring Gestures

We investigate how various aspects of interviewer and interviewee speech features, contextual factors (game score), and individual factors (team role) relate to specific nonverbal cues in question-listening and answer-speaking phases. To achieve this, we used mixed-effects regression models for each feature and outcome pair. Speech feature variables and contextual factor variables are standardized and mean centered within each interviewee, allowing us to separate effects within and between interviews. Individual characteristic variables are included without standardization or mean centering. The models account for individual variation among interviewees by including as fixed effects, the mean values of each interviewee’s speech features and contextual factors, as well as random intercepts. We control for the interviewee’s role in all models, except when role itself is the feature being investigated. The mathematical formulation is as follows:

yij=β0i+β1xij-xi+eijβ0i=γ00+γ01xi+γ02ROLEi+u0ieijNormal0,σe2u0iNormal0,σu02 (1)

where:

  • yij is the outcome variable for instance j of interviewee i,

  • β0i is the interviewee-specific intercept,

  • β1 is the fixed slope for the within-interviewee speech feature component,

  • xij is the speech feature variable for instance j of interviewee i, which is decomposed into within- and between-interviewee components,

  • xi is the interviewee-specific mean of the speech feature variable, capturing the between-interviewee component,

  • xij-xi is the within-interviewee component (i.e., deviations from their own mean in a given observation),

  • eij is the residual error for instance j of interviewee i,

  • γ00 is the fixed intercept,

  • γ01 is the fixed slope for the between-interviewee speech feature component,

  • γ02 is the fixed slope for the interviewee role,

  • ROLEi is a categorical variable (i.e., dummy code) representing the team role of interviewee i,

  • u0i represents the random deviation of interviewee i’s intercept from the fixed/average intercept γ00.

4.3. Influence of Game Score Difference

We investigate how the game outcome, quantified by the score difference between the interviewee’s team and their opponent, may moderate the relationships between speech features and interviewee gestures during the post-game interview. To explore this, we extend the mixed-effects models that examine the associations between speech features and gestures by incorporating the score difference variable and its multiplicative interactions with the within-interviewee speech features. We also control for the interviewee’s role to account for its potential impact on the relationships. This approach enables us to determine whether the game outcome modulates the connections between speech and gestures while considering the possible effect of the interviewee’s role within the team. The presence of a significant β3 coefficient in Equation (2) would indicate that the game outcome moderates the relationship between speech features and gestures.

yij=β0i+β1xcj+β2SCOREij+β3xcjSCOREij+eijxcj=xij-xiβ0i=γ00+γ01xi+γ02ROLEi+u0ieijNormal0,σe2u0iNormal0,σu02 (2)

where:

  • SCOREij represents the game score difference for instance j of interviewee i,

  • β2 and β3 are the fixed slopes for score difference and its interaction with the within-interviewee speech feature.

4.4. Influence of Interviewee Team Role

We explore how the interviewee’s role within the team may moderate the relationships between speech features and interviewee gestures during the post-game interview. To investigate this, we employ a mixed-effects model that includes the interviewee’s role as a predictor variable and examines its interaction with the speech features. The model allows the intercept and the slope of the speech feature to vary by individual, and it also incorporates the interviewee’s role as a fixed effect on both the intercept and the slope. By including the interaction term between the interviewee’s role and the speech features, we can assess whether the associations between speech features and gestures differ depending on the interviewee’s role (coach vs. player) while accounting for individual variability in the relationships. In Equation (3), γ11 represents the fixed effect coefficient for the interaction between the interviewee’s role and the within-interviewee speech feature variable. The presence of a significant γ11 coefficient would indicate that the interviewee’s role moderates the relationship between speech features and gestures.

yij=β0i+β1ixij-xi+eijβ0i=γ00+γ01xi+γ02ROLEi+u0iβ1i=γ10+γ11ROLEi+u1ieijNormal0,σe2(u0iu1i)MVNormal00,σu02σu102σu012σu12 (3)

where:

  • γ02 and γ11 are the fixed effect coefficients for the interviewee’s age and its interaction with the within-interviewee speech feature variable,

  • u01i represents the random slope for the within-interviewee speech feature variable, allowing the effect of the speech feature to vary across interviewees.

5. RESULTS

In this section, we present the findings from our mixed-effects regression models examining the associations between speech features, co-speech gestures, and contextual factors in post-game sport interviews. The coefficients are presented and significant associations (p < 0.05) are shaded and bolded in the tables.

5.1. During Questions: Interviewer Speech Features Influence Interviewee Gestures

The mixed-effects regression models revealed several significant associations between interviewer speech features and interviewee gestures during the question-listening phase (see Table 4a). Specifically, higher interviewer loudness was associated with increased interviewee gestural rate (β1=0.026,p=.041). In contrast, higher interviewer voice pitch was related to decreased interviewee gestural rate (β1=-0.031,p=.003). Furthermore, a higher frequency of negative words in interviewer questions was associated with reduced interviewee smiling (β1=-0.031,p=.001). Interviewee smiling was also positively related to the score difference between the teams, with more smiling when their team won by a larger margin (β1=0.058,p<.001). Interviewee frowning showed a significant positive association with their team role (β1=0.198,p<.001), indicating that coaches were more likely to frown compared to players. Additionally, interviewer questions with higher voice pitch were linked to increased downward gaze by the interviewee (β1=0.051,p<.001). These findings suggest that interviewer speech features and contextual factors can influence the nonverbal behaviors of interviewees during question-listening phases in post-game sport interviews.

Table 4:

Within-interviewee associations (β1) between question speech features (green), answer speech features (purple), situational factors, and individual characteristics with co-speech gestures.

Question Pos. Word Freq. Neg. Word Freq. Loudness Voice Pitch Score Difference Team Role 1
Gestural Rate −0.010 0.002 0.026 * −0.031 ** 0.002 −0.100
Smile −0.004 −0.031 ** 0.006 0.011 0.058 *** −0.019
Frown 0.015 0.001 −0.011 0.011 −0.011 0.198 ***
Downward Gaze 0.004 −0.012 0.003 0.051 *** 0.007 0.038
(a) Question features
Answer Pos. Word Freq. Neg. Word Freq. Loudness Voice Pitch Score Difference Team Role 1
Gestural Rate 0.019 * −0.004 0.059 *** 0.016 0.004 −0.066
Smile 0.022 ** −0.021 ** 0.016 −0.007 0.035 *** −0.020
Frown 0.013 0.009 0.014 −0.030 *** −0.012 0.340 ***
Downward Gaze 0.016 −0.017 * −0.041 *** −0.019 −0.013 0.039
(b) Answer features

Cells display direct-effect estimates from the mixed-effect models in eq. (1), with significant associations indicated by asterisks (*) and bolded. Significance levels:

***

p < 0.001,

**

p < 0.01,

*

p < 0.05.

5.2. While Answering: Interviewee Gestures Correlate with Own Speech Features

The mixed-effects regression models also revealed significant associations between interviewee speech features and their own gestures during the answer-speaking phase (see Table 4b). Interviewee gestural rate was positively related to their own positive word frequency (β1=0.019,p=.045) and loudness (β1=0.059,p<.001). Interviewee smiling showed a positive association with their own positive word frequency (β1=0.022,p=.005) and a negative association with their negative word frequency (β1=-0.021,p=.009). Additionally, interviewee smiling was positively related to the score difference between the teams, with more smiling when their team won by a larger margin (β1=0.035,p<.001). Interviewee frowning was negatively associated with their own voice pitch (β1=-0.030,p<.001) and positively associated with their team role (β1=0.340,p<.001), indicating that coaches were more likely to frown compared to players. Interviewee downward gaze was negatively related to their own negative word frequency (β1=-0.017,p=.035) and loudness (β1=-0.041,p<.001). These findings suggest that interviewee speech features and contextual factors are associated with their own nonverbal behaviors during answer-speaking phases in post-game sport interviews. Our experiments also yielded intriguing results regarding the relationship between an interviewee’s general speaking style and their gestures during the answering phase. By examining the interviewee’s mean speech features (xi), we could directly assess how their typical speaking patterns relate to their own gestures and expressions, as both the speech and nonverbal behaviors originate from the same individual. We discovered that interviewees who generally spoke with higher vocal intensity tended to exhibit a higher rate of gesturing (γ01=0.228,p=.002), while those who habitually used more negative language displayed more downward gaze (γ01=0.432,p=.027). Finally, interviewees whose teams won by larger margins on average tended to smile more during their responses (γ01=0.197,p=.032).

5.3. Game Score Difference Moderates Speech and Gesture Correlations

Our analysis also investigated how the game outcome, quantified by the score difference between the interviewee’s team and their opponent, moderated the relationships between speech features and interviewee gestures during post-game interviews (see Table 5a and Table 5c). The impact of the outcome of the game on these relationships is also depicted in Figure 2. For instance, Figure 2a shows that the negative correlation between the interviewer’s use of positive emotional language and the interviewee’s downward gaze diminishes as the victory margin increases (β3=-0.019,p=.027). Conversely, Figure 2b shows that the positive correlation between positive emotional language use by the interviewee and their smiling intensifies with larger victory margins (β3=0.021,p=.007). Additionally, we found that a larger victory margin also weakens the association between the pitch of the interviewer’s voice and the smile of the interviewee (β3=-0.025,p=.021) when listening to questions. These findings suggest that when the interviewee’s team won by a larger margin, the interviewer’s speech features had less influence on the interviewee’s nonverbal behaviors. However, when answering, a larger margin notably strengthens the positive correlations between interviewee prosodic features—namely loudness (β3=0.026,p=.010) and pitch (β3=0.032,p=.005)—and their gestural rate. However, consistent with the listening phase, the relationship between interviewee voice pitch and smiling was also attenuated when the victory margin was larger (β3=-0.025,p=.014). Furthermore, our analysis revealed that a larger win amplified the positive relationships between interviewee negative language use (β3=0.020,p=.007), loudness (β3=0.018,p=.015), and frowning behavior. These findings indicate that a situational factor such as the magnitude of the victory differentially moderated the associations between the interviewee’s speech features and their gestures, some relationships being amplified and others attenuated.

Table 5:

Associations between question speech features (a,b) and answer speech features (c,d) with co-occurring interviewee gestures, moderated by score difference and team role (coach or player).

Score Difference Pos. Word Freq. Neg. Word Freq. Loudness Voice Pitch Team Role Pos. Word Freq. Neg. Word Freq. Loudness Voice Pitch
Gestural Rate 0.002 0.012 0.006 0.001 Gestural Rate 0.001 0.005 −0.028 -
Smile −0.014 −0.006 0.018 −0.025 * Smile −0.016 - −0.040 * −0.005
Frown −0.004 −0.002 0.003 0.014 Frown 0.023 - 0.032 −0.013
Downward Gaze −0.019 * 0.005 0.018 −0.014 Downward Gaze −0.013 0.005 −0.006 0.026
(a) Questions moderated by game score difference (b) Questions moderated by interviewee role (coach or player)
Score Difference Pos. Word Freq. Neg. Word Freq. Loudness Voice Pitch Team Role Pos. Word Freq. Neg. Word Freq. Loudness Voice Pitch
Gestural Rate −0.016 0.006 0.026 ** 0.032 ** Gestural Rate −0.011 −0.001 −0.100 * −0.005
Smile 0.021 ** 0.010 0.003 −0.025 * Smile −0.014 0.002 −0.007 −0.018
Frown −0.002 0.020 ** 0.018 * 0.000 Frown 0.029 ** −0.021 0.050 −0.025
Downward Gaze −0.002 0.011 0.006 −0.001 Downward Gaze −0.012 −0.002 −0.020 0.019
(c) Answers moderated by game score difference (d) Answers moderated by interviewee role (coach or player)

Cells display fixed effect estimates, with significant associations indicated by asterisks (*) and bolded. Significance levels:

***

p < 0.001,

**

p < 0.01,

*

p < 0.05.

These results indicate that the relationship between speech and gesture is contingent upon situational and individual factors.

Figure 2:

Figure 2:

Spotlight analyses showing how score difference moderates associations between (a) interviewer’s positive words and interviewee’s downward gaze, and (b) interviewee’s positive words and smiling. Plots show estimated effects at −1 standard deviation (SD), mean, and +1 SD of score difference. Crosses (x) represent observed data points; shaded regions indicate ±1 standard error.

5.4. Interviewee’s Role in their Team Moderates Speech and Gesture Correlations

We also examined how the interviewee’s role within their team (coach vs. player) moderated the correlations between speech features and gestures during post-game interviews (see Table 5b and Table 5d). Figure 3a shows a negative moderation of the correlation between the loudness of the question and the smiling of the interviewee for coaches compared to players (γ11=-0.040,p=.017), indicating that coaches had a more pronounced decrease in smiling when their interlocutor increased vocal intensity compared to players. Similarly, Figure 3b shows that the relationship between interviewee loudness and gestural rate was more negative for coaches compared to players (γ11=-0.100,p=.040), implying that coaches are less likely to pair loud speech with expressive gestures compared to players. Furthermore, the results showed that coaches exhibited a stronger propensity to frown when using positive language than players (γ11=0.029,p=.005). The moderation effects suggest that associations between speech and gestures can differ depending on whether the interviewee is a coach or a player, and some relationships are stronger for one group than the other, highlighting that individual factors, such as the role of the interviewee within his team, significantly influence the dynamics between speech features and gestures during post-game interviews.

Figure 3:

Figure 3:

Spotlight analyses showing how interviewee team role moderates the associations between (a) interviewer question pitch and interviewee frowning, and (b) interviewee positive words and interviewee gestural rate.

6. DISCUSSION

In this section, we analyze our experiments and results from the GeSTICS dataset to explore how various factors influence the interaction between speech and gestures.

6.1. Key Findings

Our results indicate that situational factors, such as the outcome of the game, can modify the way speech features and gestures are associated. Specifically, we observed that a larger victory margin diminishes the impact of interviewer vocal and linguistic characteristics on the nonverbal behaviors of interviewees during the question-listening phase. This reduction in influence might be due to the interviewee’s positive emotional state or an increased sense of confidence and control after a positive experience [23, 47]. On the contrary, during the response phase, a larger victory margin appeared to strengthen the connections between the vocal characteristics of the interviewees and their gestural rate. This observation is consistent with previous findings that emotional states can influence the interaction between facial gestures and speech [12].

Furthermore, our analysis suggests that individual characteristics, such as the interviewee’s role on the team, influence the interaction between speech features and gestures. Specifically, coaches demonstrated a more pronounced negative speech-gesture coupling compared to players, especially in response to louder questions. This may reflect a greater need for coaches to control their nonverbal expressions to maintain a leadership posture. This distinction could also be attributed to variations in age and experience.

Notably, our analysis of the relationship between an interviewee’s general speaking style and their nonverbal expressions during the answer-speaking phase yielded intriguing findings. We found that interviewees who spoke with a higher average volume tended to also exhibit a higher rate of gestures (γ01=0.228,p=.002), possibly reflecting individual differences in personality traits such as extraversion or confidence [7]. Furthermore, interviewees who habitually used more negative language showed more frequent downward gaze (γ01=0.432,p=.027), which could be linked to individual differences in emotional regulation strategies or self-consciousness [30]. These findings highlight the value of the GeSTICS dataset in examining how individual differences in speaking style relate to non-verbal expressive patterns, offering a more comprehensive understanding of the factors that shape multimodal communication.

6.2. Limitations and Future Directions

While our study provides valuable insights into the context-dependent nature of speech-gesture relationships, it is important to recognize its limitations and identify avenues for future research. As an exploratory analysis, we focused on a subset of speech features and nonverbal behaviors, but there may be additional dimensions of vocal expression (e.g., intonation contours, spectral features) and bodily communication (e.g., posture shifts, self-adaptors) that play important roles in shaping the dynamics of interpersonal interaction. Future research could leverage the GeSTICS dataset to explore a broader range of features, nonlinear relationships, and temporal dynamics in multimodal communication. Time series analyses could reveal how speech-gesture coordination evolves over the course of an interaction, accounting for complex interactions among different communication channels. While post-game sports interviews provide a rich setting for studying situational factors, it’s important to note that the observed multimodal patterns may be partially shaped by the specific domain. Future work should collect and analyze data from other interaction domains to assess the generalizability of these findings. Finally, the GeSTICS dataset could be used to develop and test computational models for predicting nonverbal behaviors from speech features, or for jointly analyzing and generating realistic multimodal behaviour.

7. CONCLUSION

In this paper, we introduced GeSTICS, a new large-scale dataset of multimodal communication behaviors in post-game sports interviews. Our exploratory analysis of GeSTICS revealed significant associations between speech features, gestures, and contextual factors, highlighting the importance of considering the interaction of contextual factors in co-speech gestures. Specifically, our findings suggest that the context in which interviews take place can affect the interactions between speech and gesture in both phases of a dyadic interaction. By making GeSTICS publicly available and sharing our initial findings, we aim to stimulate interdisciplinary research on the dynamics of natural human communication.

Figure 1:

Figure 1:

Example of a Q&A pair in the GeSTICS dataset. Left: The interviewee, depicted as a stick figure, assumes a listening pose while reacting to the interviewer’s question shown in a speech bubble. Right: The interviewee is portrayed in a speaking pose as they provide their response, with their speech represented in a corresponding speech bubble. The dataset also includes metadata on the interviewee’s age, cultural background, and the game outcome.

CCS CONCEPTS.

• Human-centered computing → Human computer interaction (HCI); Gestural input; Natural language interfaces; Collaborative interaction.

ACKNOWLEDGMENTS

This material is based upon work partially supported by National Institutes of Health awards U01MH116923, R01HD081362, R01MH125740, R01MH096951, R21MH130767, and R01MH132225. The work of Auriane Boudin was supported by the Institute of Convergence ILCB (ANR-16-CONV-0002). Any opinions, fndings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily refect the views of the sponsors, and no official endorsement should be inferred.

Contributor Information

Gaoussou Youssouf Kebe, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.

Mehmet Deniz Birlikci, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.

Auriane Boudin, Aix Marseille Univ, CNRS, LPL, LIS, Aix-en-Provence, France.

Ryo Ishii, NTT Corporation, Kanagawa, Japan.

Jeffrey M. Girard, University of Kansas, Lawrence, Kansas, USA

Louis-Philippe Morency, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.

REFERENCES

  • [1].Ahuja Chaitanya, Joshi Pratik, Ishii Ryo, and Morency Louis-Philippe. 2023. Continual Learning for Personalized Co-Speech Gesture Generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, Vancouver, BC, Canada, 20893–20903. [Google Scholar]
  • [2].Ahuja Chaitanya, Dong Won Lee Yukiko I. Nakano, and Morency Louis-Philippe. 2020. Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach. In European Conference on Computer Vision (ECCV) Springer, Glasgow, UK, 248–265. https://arxiv.org/abs/2007.12553 [Google Scholar]
  • [3].Ahuja Chaitanya, Ma Shugao, Morency Louis-Philippe, and Sheikh Yaser. 2019. To React or not to React: End-to-End Visual Pose Forecasting for Personalized Avatar during Dyadic Conversations. In 2019 International Conference on Multimodal Interaction. ACM, New York, NY, USA, 74–84. [Google Scholar]
  • [4].Beattie Geoffrey and Aboudan Rima. 1994. Gestures, pauses and speech: An experimental investigation of the effects of changing social context on their precise temporal relationships. Semiotica 99, 3–4 (1994), 239–272. [Google Scholar]
  • [5].Blache Philippe, Abderrahmane Massina, St’ephane Rauzy, and Roxane Bertrand. 2020. An integrated model for predicting backchannel feedbacks. In Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents (Virtual Event, Scotland, UK) (IVA ‘20). Association for Computing Machinery, New York, NY, USA, Article 6, 3 pages. [Google Scholar]
  • [6].Blomsma Peter, Julija Vaitonyté Gabriel Skantze, and Swerts Marc. 2024. Backchannel behavior is idiosyncratic. Language and Cognition 1 (2024), 1–24. 10.1017/langcog.2024.1 [DOI] [Google Scholar]
  • [7].Borkenau Peter and Liebler Anette. 1992. Trait inferences: Sources of validity at zero acquaintance. Journal of personality and social psychology 62, 4 (1992), 645. [Google Scholar]
  • [8].Boudin Auriane, Bertrand Roxane, Rauzy Stéphane, Ochs Magalie, and Blache Philippe. 2024. A Multimodal Model for Predicting Feedback Position and Type During Conversation. Speech Communication 159 (2024), 103066. 10.1016/j.specom.2024.103066 [DOI] [Google Scholar]
  • [9].Bozkurt Elif, Khaki Hossein, Sinan Keçeci, B Berker Türker, Yücel Yemez, and Engin Erzin. 2017. The JESTKOD database: an affective multimodal database of dyadic interactions. Language Resources and Evaluation 51 (2017), 857–872. [Google Scholar]
  • [10].Bredin Hervé, Yin Ruiqing, Juan Manuel Coria Gregory Gelly, Laurens Pascal, Guinaudeau Cyril, Alam Faizan, et al. 2020. pyannote.audio: neural building blocks for speaker diarization. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 2020-May. IEEE, Barcelona, Spain, 7124–7128. [Google Scholar]
  • [11].Buschmeier Hendrik, Malisz Zofia, Skubisz Joanna, Wlodarczak Marcin, Wachsmuth Ipke, Kopp Stefan, and Wagner Petra. 2014. ALICO: A multimodal corpus for the study of active listening. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). European Language Resources Association (ELRA), Reykjavik, Iceland, 3638–3643. http://www.lrec-conf.org/proceedings/lrec2014/pdf/1017_Paper.pdf [Google Scholar]
  • [12].Busso Carlos and Shrikanth S Narayanan. 2007. Interrelation between speech and facial gestures in emotional utterances: a single subject study. IEEE Transactions on Audio, Speech, and Language Processing 15, 8 (2007), 2331–2247. [Google Scholar]
  • [13].Cao Zhe, Simon Tomas, Wei Shih-En, and Sheikh Yaser. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, Honolulu, HI, USA, 7291–7299. [Google Scholar]
  • [14].Cutica Ilaria and Bucciarelli Monica. 2011. “The More You Gesture, the Less I Gesture”: Co-speech gestures as a measure of mental model quality. Journal of Nonverbal Behavior 35 (2011), 173–187. [Google Scholar]
  • [15].Deng Jiankang, Guo Jia, Yuxiang Zhou, Yu Jinke, Kotsia Irene, and Zafeiriou Stefanos. 2019. RetinaFace: Single-stage Dense Face Localisation in the Wild. [Google Scholar]
  • [16].Eyben Florian, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM International Conference on Multimedia (Firenze, Italy) (MM ‘10). Association for Computing Machinery, New York, NY, USA, 1459–1462. 10.1145/1873951.1874246 [DOI] [Google Scholar]
  • [17].FiveThirtyEight. 2019. Comprehensive Game Analysis and Statistics. FiveThirtyEight. https://fivethirtyeight.com/features/comprehensive-game-analysis-and-statistics/ Accessed: 2023-09-28. [Google Scholar]
  • [18].FiveThirtyEight. 2019. NBA Raptor Data. https://github.com/fivethirtyeight/data/tree/master/nba-raptor. Accessed: 2024-04-16. [Google Scholar]
  • [19].Ghorbani Saeed, Ferstl Ylva, and Carbonneau Marc-André. 2022. Exemplar-based Stylized Gesture Generation from Speech: An Entry to the GENEA Challenge 2022. In ICMI ‘22 (Bengaluru, India). Association for Computing Machinery, New York, NY, USA, 778–783. 10.1145/3536221.3558068 [DOI] [Google Scholar]
  • [20].Ghorbani Saeed, Ferstl Ylva, Holden Daniel, Troje Nikolaus F., and Carbonneau Marc-André. 2023. ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech. Computer Graphics Forum 42, 1 (2023), 206–216. 10.1111/cgf.14734 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/cgf.14734 [DOI] [Google Scholar]
  • [21].Ginosar Shiry, Bar Amir, Kohavi Gefen, Chan Caroline, Owens Andrew, and Malik Jitendra. 2019. Learning individual styles of conversational gesture. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Long Beach, CA, USA, 3497–3506. [Google Scholar]
  • [22].Gratch Jonathan, Okhmatovskaia Anna, Lamothe Francois, Marsella Stacy, Morales Mathieu, Rick J van der Werf, and Louis-Philippe Morency. 2006. Virtual rapport. In Intelligent Virtual Agents: 6th International Conference, IVA 2006, Marina Del Rey, CA, USA, August 21–23, 2006. Proceedings 6. Springer, Springer, Berlin, Heidelberg, 14–27. [Google Scholar]
  • [23].Judith A Hall, Erik J Coats, and Lavonia Smith LeBeau. 2005. Nonverbal behavior and the vertical dimension of social relations: A meta-analysis. In Psychological bulletin. Vol. 131. American Psychological Association, Washington, DC, USA, 898. [DOI] [PubMed] [Google Scholar]
  • [24].Hasegawa Dai, Kaneko Naoshi, Shirakawa Shinichi, Sakuta Hiroshi, and Sumi Kazuhiko. 2018. Evaluation of speech-to-gesture generation using bi-directional LSTM network. In Proceedings of the 18th International Conference on Intelligent Virtual Agents. ACM, New York, NY, USA, 79–86. [Google Scholar]
  • [25].Hostetter Autumn B. and Potthoff Andrea L.. 2012. Effects of personality and social situation on representational gesture production. Gesture 12 (2012), 62–83. 10.1075/GEST.12.1.04HOS [DOI] [Google Scholar]
  • [26].Huang Lixing, Morency Louis-Philippe, and Gratch Jonathan. 2011. Virtual Rapport 2.0. In Intelligent Virtual Agents: 10th International Conference, IVA 2011, Reykjavik, Iceland, September 15–17, 2011. Proceedings 11. Springer, Berlin, Heidelberg, 68–79. [Google Scholar]
  • [27].Clayton J Hutto and Eric Gilbert. 2014. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Eighth international AAAI conference on weblogs and social media. AAAI Press, Palo Alto, CA, USA, 216–225. [Google Scholar]
  • [28].Ishii Ryo, Ahuja Chaitanya, Nakano Yukiko I., and Morency Louis-Philippe. 2020. Impact of Personality on Nonverbal Behavior Generation. In Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents (IVA ‘20). Association for Computing Machinery, New York, NY, USA, 1–8. 10.1145/3383652.3423908 [DOI] [Google Scholar]
  • [29].Jonell Patrik, Kucherenko Taras, Gustav Eje Henter, and Jonas Beskow. 2020. Let’s face it: Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic settings. In Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents. ACM, New York, NY, USA, 1–8. [Google Scholar]
  • [30].Keltner Dacher. 1995. Signs of appeasement: Evidence for the distinct displays of embarrassment, amusement, and shame. Journal of personality and social psychology 68, 3 (1995), 441. [Google Scholar]
  • [31].Young Hwan Kim. 1986. ELOGIC: A relaxation-based switch-level simulation technique. Electronics Research Laboratory, College of Engineering, University of California, Berkeley, Berkeley, CA, USA. [Google Scholar]
  • [32].Kong A, Law S, Wat W, and Lai C. 2015. Co-verbal gestures among speakers with aphasia: Influence of aphasia severity, linguistic and semantic skills, and hemiplegia on gesture employment in oral discourse. Journal of communication disorders 56 (2015), 88–102. 10.1016/j.jcomdis.2015.06.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [33].Kucherenko Taras, Jonell Patrik, Sanne Van Waveren Gustav Eje Henter, Alexandersson Simon, Leite Iolanda, and Hedvig Kjellstr”om. 2020. Gesticulator: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the 2020 international conference on multimodal interaction. ACM, New York, NY, USA, 242–250. [Google Scholar]
  • [34].Lee Gilwoo, Deng Zhiwei, Ma Shugao, Shiratori Takaaki, Srinivasa Siddhartha S., and Sheikh Yaser. 2019. Talking With Hands 16.2M: A Large-Scale Dataset of Synchronized Body-Finger Motion and Audio for Conversational Motion Analysis and Synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, Seoul, Korea, 763–772. [Google Scholar]
  • [35].Liu Haiyang, Zhu Zihao, Iwamoto Naoya, Peng Yichen, Li Zhengqing, Zhou You, Bozkurt Elif, and Zheng Bo. 2022. Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. In European conference on computer vision. Springer, Springer, Cham, Switzerland, 612–630. [Google Scholar]
  • [36].Lugaresi Camillo, Tang Jiuhai, Nash Hadon, Chris McClanahan Esha Uboweja, Hays Michael, Zhang Fan, Chang Chuo-Ling, Ming Guang Yong Juhyun Lee, et al. 2019. Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1903.01101; 2 (2019), 1–4. [Google Scholar]
  • [37].Daniel McDuff, Jeffrey M Girard, and Rana el Kaliouby. 2017. Large-scale observational evidence of cross-cultural differences in facial behavior. Journal of Nonverbal Behavior 41 (2017), 1–19. [Google Scholar]
  • [38].Morency Louis-Philippe, Iwan De Kok, and Jonathan Gratch. 2008. Predicting listener backchannels: A probabilistic multimodal approach. In International workshop on intelligent virtual agents. Springer, Berlin, Heidelberg, 176–190. [Google Scholar]
  • [39].Morency Louis-Philippe, Kok Iwan, and Gratch Jonathan. 2010. A probabilistic multimodal approach for predicting listener backchannels. Autonomous Agents and Multi-Agent Systems 20, 1 (jan 2010), 70–84. 10.1007/s10458-009-9092-y [DOI] [Google Scholar]
  • [40].Ng Evonne, Joo Hanbyul, Hu Liwen, Li Hao, Darrell Trevor, Kanazawa Angjoo, and Ginosar Shiry. 2022. Learning To Listen: Modeling Non-Deterministic Dyadic Facial Motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, New Orleans, LA, USA, 20395–20405. [Google Scholar]
  • [41].Ng Evonne, Subramanian Sanjay, Klein Dan, Kanazawa Angjoo, Darrell Trevor, and Ginosar Shiry. 2023. Can Language Models Learn to Listen?. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, Vancouver, BC, Canada, 10083–10093. [Google Scholar]
  • [42].Tan Viet Tuyen Nguyen and Oya Celiktutan. 2022. Context-Aware Body Gesture Generation for Social Robots. In ICRA 2022 Workshop on Prediction and Anticipation Reasoning for Human-Robot Interaction. IEEE, Philadelphia, PA, USA, 1–6. [Google Scholar]
  • [43].Pennebaker James W., Boyd Ryan, Jordan Kayla, and Blackburn Kate. 2015. The development and psychometric properties of LIWC2015. University of Texas at Austin, Austin, USA. 10.15781/T29G6Z [DOI] [Google Scholar]
  • [44].Radford Alec, Jong Wook Kim Tao Xu, Brockman Greg, and Ilya McLeavey Christine andSutskever. 2022. Robust Speech Recognition via Large-Scale Weak Supervision. arXiv preprint arXiv:2212.04396; 180 (2022), 1043–1053. [Google Scholar]
  • [45].Sadoughi Najmeh and Busso Carlos. 2018. Novel realizations of speech-driven head movements with generative adversarial networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, IEEE, Calgary, AB, Canada, 6169–6173. [Google Scholar]
  • [46].Schroff Florian, Kalenichenko Dmitry, and Philbin James. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, Boston, MA, USA, 815–823. [Google Scholar]
  • [47].Schwarz Norbert. 2002. Situated cognition and the wisdom of feelings: Cognitive tuning. The wisdom in feelings 1 (2002), 144–166. [Google Scholar]
  • [48].Song Siyang, Spitale Micol, Luo Cheng, Barquero Germán, Palmero Cristina, Escalera Sergio, Valstar Michel, Baur Tobias, Ringeval Fabien, Elisabeth André, and Hatice Gunes. 2023. REACT2023: The First Multiple Appropriate Facial Reaction Generation Challenge. In Proceedings of the 31st ACM International Conference on Multimedia (, Ottawa ON, Canada,) (MM ‘23). Association for Computing Machinery, New York, NY, USA, 9620–9624. 10.1145/3581783.3612832 [DOI] [Google Scholar]
  • [49].Vinciarelli Alessandro, Chatziioannou Paraskevi, and Esposito Anna. 2015. When the words are not everything: the use of laughter, fillers, back-channel, silence, and overlapping speech in phone calls. Frontiers in ICT 2 (2015), 4. [Google Scholar]
  • [50].Ward Nigel and Tsukahara Wataru. 2000. Prosodic features which cue backchannel responses in English and Japanese. Journal of Pragmatics 32, 8 (2000), 1177–1207. 10.1016/S0378-2166(99)00109-5 [DOI] [Google Scholar]
  • [51].Yang Zhaojun and Shrikanth S Narayanan. 2014. Analysis of emotional effect on speech-body gesture interplay. In Fifteenth Annual Conference of the International Speech Communication Association. ISCA, Singapore, 2176–2180. [Google Scholar]
  • [52].Yoon Youngwoo, Cha Bok, Lee Joo-Haeng, Jang Minsu, Lee Jaeyeon, Kim Jaehong, and Lee Geehyuk. 2020. Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Trans. Graph. 39, 6, Article 222 (nov 2020), 16 pages. 10.1145/3414685.3417838 [DOI] [Google Scholar]
  • [53].Yoon Youngwoo, Ko Woo-Ri, Jang Minsu, Lee Jaeyeon, Kim Jaehong, and Lee Geehyuk. 2019. Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, Montreal, Canada, 4303–4309. 10.1109/ICRA.2019.8793720 [DOI] [Google Scholar]
  • [54].Yoon Youngwoo, Wolfert Pieter, Kucherenko Taras, Viegas Carla, Nikolov Teodor, Tsakov Mihail, and Gustav Eje Henter. 2022. The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation. In Proceedings of the 2022 International Conference on Multimodal Interaction (Bengaluru, India) (ICMI ‘22). Association for Computing Machinery, New York, NY, USA, 736–747. 10.1145/3536221.3558058 [DOI] [Google Scholar]
  • [55].Zhou Mohan, Bai Yalong, Zhang Wei, Yao Ting, Zhao Tiejun, and Mei Tao. 2022. Responsive listening head generation: a benchmark dataset and baseline. In European Conference on Computer Vision. Springer, Springer, Cham, Switzerland, 124–142. [Google Scholar]

RESOURCES