Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Oct 1.
Published in final edited form as: IEEE Trans Affect Comput. 2023 Jan 12;14(4):3388–3395. doi: 10.1109/TAFFC.2023.3236265

Automated Classification of Dyadic Conversation Scenarios using Autonomic Nervous System Responses

Iman Chatterjee 1, Maja Goršič 2, Mohammad S Hossain 3, Joshua D Clapp 4, Vesna D Novak 5
PMCID: PMC10721131  NIHMSID: NIHMS1948015  PMID: 38107015

Abstract

Two people’s physiological responses become more similar as those people talk or cooperate, a phenomenon called physiological synchrony. The degree of synchrony correlates with conversation engagement and cooperation quality, and could thus be used to characterize interpersonal interaction. In this study, we used a combination of physiological synchrony metrics and pattern recognition algorithms to automatically classify four different dyadic conversation scenarios: two-sided positive conversation, two-sided negative conversation, and two one-sided scenarios. Heart rate, skin conductance, respiration and peripheral skin temperature were measured from 16 dyads in all four scenarios, and individual as well as synchrony features were extracted from them. A two-stage classifier based on stepwise feature selection and linear discriminant analysis achieved a four-class classification accuracy of 75.0% in leave-dyad-out crossvalidation. Removing synchrony features reduced accuracy to 65.6%, indicating that synchrony is informative. In the future, such classification algorithms may be used to, e.g., provide real-time feedback about conversation mood to participants, with applications in areas such as mental health counseling and education. The approach may also generalize to group scenarios and adjacent areas such as cooperation and competition.

Keywords: Classifier design and evaluation, Affect sensing and analysis, Peripheral measures, Recognition of group emotion

1. Introduction

WHEN two people interact, they mirror and coordinate their behavior in diverse ways: for example, by converging in vocabulary choice and mirroring each other’s postures [1], [2]. Such interpersonal synchrony may involve multiple underlying mechanisms (e.g., mirror neurons) and can manifest at neural, perceptual, affective, and/or behavioral levels [1], [2]. On the physiological level, studies show that physiological synchrony (i.e., similarity between two interacting individuals’ physiological responses) provides valuable information about the interaction. For example, studies of cooperation indicate that the degree of synchrony between participants’ heart rates (HR) [3], skin conductances [4], [5] and electroencephalography (EEG) signals [6], [7] correlates with cooperation quality and predicts task outcome. Similarly, the degree of synchrony correlates with perceived therapist empathy in therapist-client interactions [8], [9] and overall engagement in teachers and students [10], [11].

Since physiological responses of interacting pairs and groups provide insight into the interaction dynamics, they could be presented to participants or observers, helping them obtain information about subconscious processes they may otherwise miss. For example, physiological information could be presented as visual biofeedback to help multiplayer game players [12] and therapist-client dyads [13] self-regulate their emotions and interpersonal engagement. As another example, presentation of physiological information could help groups of learners obtain better group awareness and consequently enhance collaborative learning [14]. Physiological synchrony has significant potential in this regard, as it may provide information not visible from self-report measures [6].

1.1. Pattern Recognition Applied to Physiology

As individual physiological signals and synchrony metrics can be noisy and hard to interpret, there is an alternative to presenting them directly. Physiological measurements could be combined with pattern recognition algorithms that use multiple signal modalities (e.g., EEG and HR) and/or multiple channels of a signal modality (e.g., multiple EEG channels) to automatically infer a metric such as interpersonal engagement or cooperation quality. Such inferred metrics might be more robust and easily interpretable than raw physiological measurements.

The usage of pattern recognition with physiological data has been extensively studied in single-user affective computing, where psychological states of single participants (e.g., workload, enjoyment) are inferred from physiological data using (usually) supervised machine learning techniques [15]–[17]. These techniques can be divided into classification algorithms, which discriminate between discrete classes (e.g., happy vs. sad), and regression algorithms, which estimate a continuous quantity (e.g., enjoyment from 0 to 10). For reasons such as simplicity and interpretability, classification is much more common than regression in single-user affective computing [15]–[17].

Compared to extensive research in single-user scenarios, there has been much less work on applying supervised machine learning to dyadic and group situations. A few dyadic studies have used classification algorithms with a single physiological modality (e.g., only EEG) to discriminate between two [18]–[22], three [23] or four states [24]. To our knowledge, only one study has used classification with multiple physiological modalities in dyadic scenarios: our work on competitive gaming [25]. One study used regression to estimate arousal and valence on 1–9 scales using dyadic EEG during video watching [26], and a recent study by our team used regression to estimate conversation engagement on a 1–100 scale using multiple autonomic nervous system responses [27].

1.2. Gap in State of the Art and Study Contribution

In dyadic situations, pattern recognition applied to physiological signals could be used as a basis for intelligent feedback and automated adaptation, similarly to how it is used in single-user affective computing [15]–[17]. However, the lack of studies on such pattern recognition in dyadic scenarios represents a gap in the state of the art. Further development and evaluation of pattern recognition algorithms for dyads’ physiological signals is needed before such algorithms can be used for feedback and adaptation.

Our team is interested in physiological synchrony in conversation scenarios, as this would have applications to mental health counseling [8], [9] and education [10], [11]. Conversation is relatively unexplored with regard to physiological synchrony: a 2020 review of synchrony in brain signals found education and conversation to be uncommon among reviewed studies [28]. While conversation scenarios have been used in some dyadic classification studies, nearly all involved 2-class classification using a single physiological modality [19], [22]. The one conversation study with multiple modalities was a recent study by our team that used regression algorithms to estimate engagement during a freeflowing (unscripted) conversation [27]. However, this study had several issues: self-reported engagement tended to stay within a limited positive range, and regression performance was difficult to evaluate due to lack of established outcome metrics.

To avoid the weaknesses of our previous study, the current study design approximates classic study designs in affective computing. In perhaps the most famous single-user affective computing study, participants acted out different basic emotions (e.g., anger, disgust), and a classification algorithm was used to differentiate among these emotions using multiple physiological responses [29]. In the current study, dyads acted out different prescribed conversation scenarios, and a classification algorithm was used to differentiate among them using multiple physiological responses. By prescribing acted conversation scenarios, the study ensures that the scenarios are intuitively divisible into discrete classes. These classes are designed to exist at different points in valence-arousal-dominance space [30] and thus represent different conversation situations of interest [31]. As classification accuracy is a standard outcome metric [15]–[17], it can easily be used to compare different classification approaches.

The study contribution can be summarized as follows: it represents, to our knowledge, the first use of multiclass classification in dyadic conversation with multiple physiological modalities, including synchrony metrics. By demonstrating the accuracy of such classification and comparing it to, e.g., classification without synchrony metrics, it allows the identification of effective classification approaches and an estimation of their practical usefulness. This may pave the way for applied uses of dyadic physiological measurements in affective computing.

2. Materials AND Methods

2.1. Participants

Eighteen dyads from the University of Wyoming student and staff community participated in the study. Two dyads’ signals were corrupted by noise and discarded, leaving 16 valid dyads. Of these dyads, 8 self-described as friends, 2 as roommates, 1 as coworkers, 4 as relationship partners, and 1 as strangers. Seven dyads were male-male, 4 female-female, 4 female-male, and 1 dyad had one agender and one nonbinary participant. Their age was 23.3 ± 4.6 (mean ± SD) years with a range of 18–35 years. Participants were invited to volunteer for the study either alone or with a self-selected partner; the high percentage of acquainted pairs was due to the COVID-19 pandemic during data collection and likely biased results since physiological responses to conversation are modulated by the nature of the relationship between participants [32].

Participants self-reported 4 personality traits that affect physiological synchrony [33]–[35]: cognitive and affective empathy with the Questionnaire of Cognitive and Affective Empathy [36], social anxiety with the Brief Fear of Negative Evaluation Scale [37], and depression with the Center for Epidemiologic Studies Depression Scale [38]. Their scores were 36.3 ± 7.9 for social anxiety (theoretical possible range 25–55), 15.9 ± 10.5 for depression (possible range 1–40), 55.9 ± 8.4 for cognitive empathy (range 30–75), and 32.6 ± 5.0 for affective empathy (range 20–45). High values denoted higher empathy, anxiety or depression.

2.2. Study Protocol

Data collection was done from January to May 2021 with approval of the Institutional Review Board of the University of Wyoming. Sessions took 60–75 min and began with an explanation of the study, after which participants gave written informed consent and reported demographic data (~10 min). They sat at opposite ends of a table about 1.5 m apart, separated by a transparent plexiglass barrier, and filled out personality questionnaires (~10 min). They removed face masks and self-applied physiological sensors with researcher verbal guidance; signal quality was checked and any needed corrections (e.g., electrode reattachment) were made (~10 min). Though sensor self-application likely increased signal variability, it allowed social distancing during the COVID-19 pandemic.

Participants next discussed possible conversation topics and identified topics that they agreed on as well as at least one topic that they disagreed on (~5 min). They were asked to find topics that were noncontroversial and as agreeable/disagreeable as possible. Once topics were selected, baseline physiological measurements were taken for 4 minutes. During this interval, participants were asked to relax silently with eyes closed. Afterwards, dyads went through four 4-minute conversation scenarios:

  • Two-sided positive conversation: Participants were told to discuss a topic they agreed on and emphasize their agreement: always agree with things the other person said and generally appear cheerful/enthusiastic. They were also told that both participants should talk about the same amount.

  • Two-sided negative conversation: Participants were told to discuss a topic they disagreed on and to emphasize their disagreement: always disagree with things the other person said and generally appear unhappy/unfriendly. They were also told that both participants should talk about the same amount.

  • One-sided conversation, person on left talks: The participant on the left side of the lab was told to talk about an agreeable topic while the other participant was told to listen and act interested. The listener could make sounds of acknowledgment and ask follow-up questions, but participants were told that the speaker should talk at least 90% of the time.

  • One-sided conversation, person on right talks: As above, but with participants’ roles reversed.

The four scenarios were done in random order. After each scenario, participants completed the Self-Assessment Manikin (SAM) [30] to report individual valence, arousal and dominance and the Interpersonal Interaction Questionnaire (IIQ) [31] to report the amount of conversation, its balance, and its valence (~1 min per scenario). They then removed the sensors and were paid $15 (~5 min).

2.3. Physiological Measurements

Five signals were measured from each participant using two g.USBamp amplifiers (g.tec Medical Engineering GmbH, Austria) and add-on sensors. The electrocardiogram (ECG) was measured using g.tec’s recommended approach: two signal electrodes on right side of abdomen and left side of chest, reference electrode on right side of chest, and ground on upper back. Respiration was measured both at the nose and the chest since we were unsure which approach would be more robust to noise. For nose measurements, the thermistor-based g.Sensor Respiration Airflow was placed below the nose and above the mouth. For chest measurements, the piezoelectric crystal-based g.Sensor Respiration Effort was placed around the chest. Skin conductance was measured using the g.GSRsensor2 at distal phalanges of the index and middle fingers on the nondominant hand. Peripheral skin temperature was measured using the g.Temp sensor at the distal phalanx of the little finger on the nondominant hand.

Signals were sampled at 600 Hz and passed through a 60-Hz analog notch filter. The ECG was additionally passed through a 0.1-Hz analog highpass filter while all others were filtered with an analog 30-Hz lowpass filter. The g.tec MATLAB/Simulink model and a synchronization cable were used to synchronize the two amplifiers.

2.4. Preprocessing and Feature Extraction

Signals from each dyad were divided into five intervals: a 4-minute baseline and the four 4-minute conversation intervals. Data outside these intervals were discarded.

For each interval, fourth-order Butterworth digital 5-Hz lowpass filters were applied to skin conductance, respiration, and skin temperature signals. ECG peaks corresponding to R-waves were identified using peak detection logic in MATLAB, then manually inspected to verify that detected peaks truly represented R-waves. If noise made it impossible to identify an R-wave, a peak was interpolated halfway between two identified R-waves. Peaks indicative of individual breaths were detected in respiration signals with a similar method. In the skin conductance signal, individual skin conductance responses (SCRs) were detected using a peak detection algorithm from our previous work [25]. Following a popular definition [39, p. 157], the algorithm considered peaks in the signal to be SCRs if they were at least 0.05 μS higher than the previous valley and occurred within 5 s of that valley. We acknowledge that this is a relatively simple approach that does not account for, e.g., possible SCR superposition.

Several features were extracted from the baseline interval and each of the 4 conversation intervals. These are divided into individual features (calculated from physiological data of a single participant) and synchrony features (calculated from physiological data of both participants). The individual features were as follows:

  • ECG: Mean HR, three time-domain measures of HR variability (standard deviation (SD) of interbeat intervals, root-mean-square value of consecutive differences between these intervals, percentage of consecutive interbeat intervals with a difference greater than 50 ms), and three frequency-domain measures of HR variability (power in low-frequency band, power in high-frequency band, the ratio of the two). These features are standard in the literature and can be used with intervals that are at least 2 minutes long [40].

  • Skin conductance: Mean skin conductance, the difference between initial and final values of skin conductance, number of skin conductance responses (SCR), mean SCR amplitude, SD of SCR amplitudes.

  • Nose respiration: mean respiration rate and SD of respiratory periods.

  • Chest respiration: same as nose respiration.

  • Peripheral skin temperature: mean temperature, difference between initial and final temperature values.

Synchrony features were calculated from filtered skin conductance and skin temperature signals as well as instantaneous HR, nose respiration rate and chest respiration rate signals (computed as a function of time from raw measurements of ECG and nose/chest respiration [25]). Instantaneous heart/respiration rates were used instead of raw ECG and respiration since they exhibit higher synchrony between participants. Synchrony features were the same for all 5 signals and were as follows:

  • Dynamic time warping distance, computed using the approach of Muszynski et al. [21]. This dynamic-programming-based technique quantifies the similarity between two signals and is robust to temporal de lays between events in individual signals [22].

  • Nonlinear interdependence, computed using the approach of Muszynski et al. [21]. The feature measures the geometric similarity of state-space trajectories of two dynamical systems. Time-delay embedding was used to rebuild trajectories analogous to shape distribution distance [21].

  • Coherence, computed using the same approach as our previous study [25]. This method determines whether two signals oscillate together in one or multiple frequency bands. Coherence was computed in different frequency ranges for different signals since the signals have different power spectra. For example, HR coherence was calculated in 0.05–0.15 Hz and 0.15–0.4 Hz bands while nose/chest respiration coherences were calculated in the 0–2 Hz band [25].

  • Cross-correlation, computed using the same approach as our previous study [25]. This is a very simple, non-robust measure computed as a Pearson correlation between two participants’ signals. [22].

The MATLAB code used to calculate these synchrony features is available on Zenodo [41].

For each dyad and feature, the baseline feature value was subtracted from all 4 conversation interval values to obtain normalized values, which were then used for classification. This technique is broadly used to reduce inter-subject variability in single-user affective computing [15], [16]. We later also tried omitting this normalization step but achieved slightly worse results.

2.5. Classification

2.5.1. Problem Definition

The overall goal of the study is to discriminate between 4 conversation scenarios based on physiological features. We used two approaches to identify these scenarios:

  • Direct 4-class classification: A 4-class classifier was used to directly classify features as one of 4 scenarios.

  • Two-stage classification: Features were first classified as two-sided or one-sided conversation with a binary classifier. In the second stage, data classified as two-sided were further classified as two-sided positive or two-sided negative, and data classified as one-sided were classified as the left or right person talking.

2.5.2. Classifiers and Feature Selection Algorithms

In both 4-class and 2-stage classification, three supervised classifiers were used: k-nearest neighbors (kNN), linear discriminant analysis (LDA), and random forest. They were implemented using standard functions in MATLAB 2021b (Mathworks, USA): fitcknn (kNN), classify (LDA) and treebagger (random forest). The number of neighbors for kNN and number of trees for random forests were chosen for each classification problem using hyperparameter optimization procedures that are part of fitcknn and treebagger functions. For full disclosure: we also tried other classifiers (neural networks, diaglinear LDA, naïve Bayes, gentle adaptive boosting), but their accuracies were low and thus not reported.

As the sample size is relatively low, feature selection was done prior to classification to avoid overfitting. It was done using two standard techniques in MATLAB 2021b: bidirectional stepwise feature selection (using stepwisefit function and p_enter = 0.2) and the chi-square test for univariate feature ranking (using fscchi2 function). These selection approaches preceded each classification step.

In 2-stage classification, the ‘best’ classifier in the first stage (one-sided vs. two-sided) may not be the ‘best’ one in the second stage. Thus, in 2-stage classification, we also tested different combinations of classifiers in the different stages (e.g., kNN followed by LDA).

2.5.3. Classifier Evaluation

Classifiers were evaluated using leave-dyad-out crossvalidation. The feature selection, hyperparameter optimization, and classification algorithms were trained using data from 15 dyads, then tested on the remaining dyad. This was repeated 16 times, with each dyad serving as the test dyad once. Accuracy was calculated as the mean accuracy of the trained classifiers on the 16 test dyads. This approach is common in single-user affective computing [15], [16] and was used in our prior dyadic work [25], [27].

As the primary analysis, we calculated the accuracy of 4-class classification and the overall accuracy of 2-stage classification (i.e., percentage of samples correctly classified at end of second stage) using all physiological features (individual and synchrony) as inputs to the classifiers. For comparison, the accuracy of 4-class classification was also calculated using SAM and IIQ features as inputs.

In addition, four secondary analyses were done:

  1. In the first analysis, classifiers were trained and tested using only individual features (no synchrony features) as inputs. In our previous study, removing synchrony reduced accuracy [27], and this allows us to evaluate whether the accuracy increase justifies the additional work required to calculate synchrony.

  2. In the second analysis, 8 features corresponding to both participants’ self-reported personality traits (social anxiety, cognitive empathy, affective empathy, depression) were added to physiological features as additional classification inputs. In our previous work, adding personality features increased accuracy [27].

  3. In the third analysis, the five most important features were identified for algorithms that yielded the highest accuracy in direct 4-class classification (highest accuracy with random forest, as seen later) and 2-stage classification (highest accuracy with LDA, as seen later). This was done using DeltaPredictor (for LDA) and PermutedPredictorDeltaError (for random forest) functions in MATLAB 2021b.

  4. In the final analysis, 4-class and 2-stage classification were re-run using individual and synchrony features from each single modality (only ECG, only skin conductance, only respiration, only skin temperature).

3. Results

A table containing values of all physiological features in all five intervals is available on Zenodo [41]. Participants’ SAM and IIQ responses are shown in Table 1. The two-sided negative scenario resulted in lower valence and the one-sided scenarios were more unbalanced.

TABLE 1.

Self-Assessment Manikin (SAM) and Interpersonal Interaction Questionnaire (IIQ) results for all conversation scenarios, shown separately for the participant on the left (L) and right (R).

SAM IIQ
Valence Arousal Domin. Balance Valence
L R L R L R L R L R
Two-sided positive 7.1 ± 0.9 6.9 ± 1.1 5.1 ± 1.4 4.2 ± 1.8 5.2 ± 0.5 5.5 ± 0.9 4.3 ± 0.9 4.1 ± 0.9 4.8 ± 0.4 4.6 ± 0.5
Two-sidednegative 3.9 ± 1.9 4.3 ± 1.5 5.8 ± 1.3 5.4 ± 1.3 5.3 ± 0.9 6.0 ± 1.2 4.3 ± 0.9 4.1 ± 0.9 2.6 ± 0.9 2.5 ± 1.0
One-sided, L talks 6.5 ± 1.2 5.6 ± 1.8 5.3 ± 1.8 2.8 ± 1.5 8.5 ± 1.2 2.4 ± 1.2 1.1 ± 0.3 1.1 ± 0.3 4.1 ± 0.9 4.2 ± 1.0
One-sided, R talks 6.3 ± 1.3 6.1 ± 1.7 2.8 ± 1.6 5.6 ± 1.4 2.5 ± 1.5 7.9 ± 1.2 1.1 ± 0.5 1.0 ± 0.0 4.4 ± 0.8 3.9 ± 0.9

All results presented as mean ± standard deviation. SAM scales have a range of 1–9, with 1 indicating low valence/arousal/dominance. IIQ results correspond to items 3 (“How balanced was the conversation?”) and 6 (“How would you rate the overall conversation?”) and have a range of 1–5, with 1 indicating unbalanced or low-valence conversation.

3.1. Primary Analysis

Results of direct 4-class classification are given in Table 2 for all feature selection and classification methods; the highest accuracy was 64.1%. Results of 2-stage classification using stepwise feature selection are given in Table 3 for all classification methods; the highest accuracy was 75.0%. Chi-square selection consistently resulted in lower 2-stage accuracies and is thus not presented in detail.

TABLE 2.

Classification accuracies for direct 4-class classification in the primary analysis

Classifier Feature selection Accuracy
K-nearest neighbors Stepwise 56.3%
Chi-square 42.2%
Linear discriminant analysis Stepwise 59.4%
Chi-square 40.6%
Random forest Stepwise 43.7%
Chi-square 64.1%

TABLE 3.

Classification accuracies for 2-stage classification in the primary analysis using stepwise feature selection

Stage 1 classifier Stage 2 classifier Accuracy
K-nearest neighbors K-nearest neighbors 51.6%
Linear discriminant 59.4%
Random forest 56.3%
Linear discriminant analysis K-nearest neighbors 64.1%
Linear discriminant 75.0%
Random forest 57.8%
Random forest K-nearest neighbors 37.5%
Linear discriminant 50.0%
Random forest 46.9%

For comparison, the 4-class classification accuracy obtained with SAM and IIQ features as inputs to a random forest and chi-square feature selection was 96.9%.

3.2. Secondary Analysis

In the first secondary analysis, synchrony features were removed from the dataset. For this analysis, Table 4 shows results of direct 4-class classification for all methods while Table 5 shows results of 2-stage classification using stepwise feature selection (as in the primary analysis, chi-square selection consistently resulted in lower 2-stage accuracies). The highest overall accuracy was 65.6%.

TABLE 4.

Classification accuracies for direct four-class classification when synchrony features are omitted

Classifier Feature selection Accuracy
K-nearest neighbors Stepwise 46.9%
Chi-square 45.3%
Linear discriminant analysis Stepwise 51.6%
Chi-square 43.8%
Random forest Stepwise 37.5%
Chi-square 62.5%

TABLE 5.

Classification accuracies for 2-stage classification using stepwise feature selection when synchrony features are omitted

Stage 1 classifier Stage 2 classifier Accuracy
K-nearest neighbors K-nearest neighbors 42.2%
Linear discriminant 51.6%
Random forest 45.3%
Linear discriminant analysis K-nearest neighbors 65.6%
Linear discriminant 64.1%
Random forest 53.1%
Random forest K-nearest neighbors 23.4%
Linear discriminant 29.7%
Random forest 21.8%

In the second analysis, personality features were added to the dataset. All obtained accuracies were the same or worse than those in the primary analysis, and results of this analysis are thus not shown in detail.

In the third analysis, the most important features for direct 4-class classification, the first stage of 2-stage classification (one-sided vs. two-sided), and two-sided positive vs. negative classification in the second stage of 2-stage classification were determined. In direct 4-class classification, the most accurate classifier was the random forest; for 2-stage classification, it was LDA. The five most important features for these classifiers are listed in Table 6.

TABLE 6.

The five most important features for 4-class classification using random forest, one-sided vs. two-sided classification using linear discriminant analysis, and two-sided positive vs. negative classification using linear discriminant analysis

Direct four-class One-sided vs. two-sided Two-sided positive vs. negative
1 Mean nose RR of P-L Coherence for SC Nonlinearity for SC
2 HR low-freq. power of P-R DTW for HR Coherence for nose respiration
3 St. dev. of SCR ampl. for P-R Mean chest RR of P-R Number of SCRs for P-L
4 DTW for nose respiration DTW for nose respiration Coherence for skin temperature
5 DTW for HR Nonlinearity for chest respiration Nonlinearity for HR

P-L = participant on left, P-R = participant on right, HR = heart rate, RR = respiration rate, SC = skin conductance, SCR = skin conductance response, DTW = dynamic time warping.

In the final analysis, classification was done using each single data modality. Though not presented in detail, the best overall accuracy was 50.0% using ECG, 39.1% using skin conductance, 43.75% using respiration, and 40.6% using skin temperature (all with 2-stage approach).

4. DISCUSSION

The highest overall classification accuracy, obtained using the 2-stage approach, was 75.0%. This is in line with studies of single-user affective computing, which commonly report 4-class accuracies of 50–80% [15]–[17]. We know of only one other dyadic 4-class classification study, which achieved an accuracy of 35% using only EEG [24]. Our previous competitive gaming study performed 3-class classification using the same signals and achieved accuracies between 44.3% and 60.5% depending on the outcome variable [25]. Thus, the current study appears to achieve higher accuracies than previous dyadic multiclass work. However, this result should be taken with a grain of salt since the classification problem in this study is likely easier than in previous work (which discriminated among, e.g., different workload [24] and enjoyment [25] levels).

Accuracy decreased from 75.0% to 65.6% when synchrony features were omitted. Though synchrony features are less standardized and harder to calculate than individual features, the additional work appears justified by the higher accuracy. Lower accuracies without synchrony were also seen in our regression work [27]. Additionally, using only a single data modality resulted in lower accuracy (best result: 50% with ECG alone) than using all four modalities, suggesting that using multiple sensors is worthwhile. Further analyses could be done to determine, e.g., whether any modality could be omitted entirely.

Adding personality features did not improve accuracy even though it was beneficial in our previous regression work [27]. We believe that this is due to two reasons. First, classification of 4 scenarios in this study is easier than engagement regression in our previous work [27], so personality features are less useful. Second, the scenarios in this study are relatively “rigid” (with chosen topics and behavior instructions) compared to unscripted, freeflowing conversation in our previous work [27], so personality has a smaller impact on participant behavior.

4.1. Potential Applications

Our study represents, to our knowledge, the first attempt to classify among multiple conversation scenarios using multiple physiological responses. As a possible practical application, the classification output could be combined with visualizations to enhance interpersonal communication. For example, an animated smiling or frowning face could be placed in front of a participant to indicate conversation valence, and the size of the face could be varied to indicate one-sided vs. two-sided conversation, similarly to others’ work [12], [42]. These visualizations could be provided to, e.g., novice teachers or mental health counselors to help them better gauge the conversation and modify their behavior to improve the session outcome.

Similarly, since physiological synchrony is correlated with cooperation quality [3]–[7], our approaches could be adapted to classify whether cooperation is effective and whether one participant is contributing more than the other, similar to our previous competition work [25]. This could then be combined with, e.g., algorithms that distribute workload among participants to enhance cooperation. Finally, instead of providing real-time feedback, classification results could be used to rate conversation or cooperation after it has occurred: e.g., to rate the effectiveness of teachers or mental health counselors or to identify cooperating dyads who work well together.

4.2. Limitations and Next Steps

Using self-report data rather than physiology as inputs to classifiers resulted in an accuracy of 96.9%. This does call the practical relevance of physiology-based classification into question: while previous studies suggest that physiological synchrony can provide information not visible from self-report data [6], participants in our study were clearly aware of the conversation state. Thus, is it worth using multiple physiological signals to infer information that participants seem to observe on their own?

The high accuracy of self-report data is not surprising since participants were instructed in advance to act out different scenarios and were thus aware of the expected conversation state. For real-world use, it would be necessary to differentiate among subtler scenarios that evolve spontaneously. We tried to examine such subtler spontaneous scenarios in our previous work [27], but experienced issues with methodology and interpretation. We believe that an appropriate next step would be to compare physiology and self-report data in a more realistic classification scenario – e.g., one with 3–5 valence levels, as done in single-user affective computing [15]–[17]. Such future studies could also involve populations that may have different physiological responses to conversation – e.g., autistic children and adults [43].

Furthermore, comparing classification accuracies of physiological and self-report data is not the only metric of usefulness. We could instead conduct closed-loop studies where users experience classification results in real time: e.g., by providing visualizations of client engagement to novice mental health counselors during the session [13]. In this case, even imperfect classification accuracies may allow users to obtain insight into the conversation and adapt their behavior. We previously conducted similar evaluations of classification algorithms in single-user [44] and competitive [25] affect-aware gaming. Such evaluations could even be done with Wizard of Oz techniques, avoiding the need for full sensor setups [45], [46].

Finally, classification could be improved by including other data: e.g., EEG, speech, and gestures. In addition to increasing classification accuracy, this may also overcome the inherent slowness of autonomic nervous system responses. While the signals in our paper are commonly analyzed over 2–5 minute periods [15]–[17], analysis of speech, EEG and gestures can be done on the scale of seconds and could allow identification of, e.g., rapid mood shifts due to very inappropriate statements.

5. Conclusions

We used dyadic autonomic nervous system responses to differentiate among four conversation scenarios with an accuracy of 75.0% - similar to 4-class accuracies seen in single-user affective computing. The “best” single modality (ECG) achieved an accuracy of 50%, indicating that the use of multiple modalities is worthwhile. Furthermore, removing synchrony features reduced accuracy to 65.6%, indicating that synchrony is informative. However, adding personality features did not improve accuracy.

To the best of our knowledge, this study represents the first use of multiclass classification in dyadic conversation scenarios with multiple physiological signal modalities. It shows the importance of using multiple signal modalities and including synchrony features, and allows the accuracies of different classification approaches to be compared. In the future, such classification algorithms may be used to, e.g., provide real-time feedback about conversation mood to participants, with applications in areas such as mental health counseling and education. They may also be adapted for collaborative scenarios and expanded with additional measurements such as speech and gesture analysis. However, classification should also be evaluated with scenarios that are harder to differentiate so that its practical usefulness can be better estimated.

Acknowledgment

This work was supported in part by a Faculty Grant-in-Aid program from the University of Wyoming and in part by the National Institute of Mental Health of the National Institutes of Health, grant number R03MH128633. V. D. Novak is the corresponding author.

Biographies

I. Chatterjee received his MS degree in computer science from the University of California Davis in 2020. He is currently working toward a PhD at the University of Cincinnati.

M. Goršič received her diploma in electrical engineering from the University of Ljubljana in 2012 and her PhD from the University of Wyoming in 2020. She is currently a postdoctoral fellow at the University of Cincinnati. Her research interests include wearable robotics, serious games, and biomechanics.

M. S. Hossain received his MS degree in electrical engineering from the University of Wyoming in 2021. He is currently working toward a PhD at the University of Cincinnati.

J. D. Clapp received his MS and PhD degrees in psychology from the University at Buffalo in 2008 and 2012. He is currently an associate professor in psychology at the University of Wyoming. His research focuses on assessment and treatment of trauma- and anxiety-related difficulties.

V. D. Novak received her PhD in electrical engineering from the University of Ljubljana in 2011. She was a postdoc at ETH Zurich (2012–2014) and an assistant/associate professor at the University of Wyoming (2014–2021). She is currently an associate professor in electrical engineering at the University of Cincinnati. Her research interests include affective computing, wearable and rehabilitation robotics, and serious games. She is a senior member of the IEEE.

Contributor Information

Iman Chatterjee, University of Cincinnati, Cincinnati, OH 45221..

Maja Goršič, University of Cincinnati, Cincinnati, OH 45221..

Mohammad S. Hossain, University of Cincinnati, Cincinnati, OH 45221.

Joshua D. Clapp, University of Wyoming, Laramie, WY

Vesna D. Novak, University of Cincinnati, Cincinnati, OH 45221..

References

  • [1].Wheatley T, Kang O, Parkinson C, and Looser CE, “From mind perception to mental connection: synchrony as a mechanism for social understanding,” Soc. Personal. Psychol. Compass, vol. 6, no. 8, pp. 589–606, 2012. [Google Scholar]
  • [2].Delaherche E, Chetouani M, Mahdhaoui A, Saint-Georges C, Viaux S, and Cohen D, ‘Interpersonal synchrony: “ survey of evaluation methods across disciplines,” IEEE Transactions on Affective Computing, vol. 3, no. 3. pp. 349–365, 2012. [Google Scholar]
  • [3].Ahonen L, Cowley B, Torniainen J, Ukkonen A, Vihavainen A, and Puolamäki K, ‘Cognitive collaboration found in cardiac physiology: Study in classroom environment,” PLoS One, vol. 11, no. 7, p. e0159178, 2016,. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Vanutelli ME, Gatti L, Angioletti L, and Balconi M, “Affective synchrony and autonomic coupling during cooperation: a hyperscanning study” Biomed Res. Int, vol. 2017, p. 3104564, 2017,. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Park J, Shin J, and Jeong J, “Inter-brain synchrony levels according to task execution modes and difficulty levels: an fNIRS/GSR study” IEEE Trans. Neural Syst. Rehabil. Eng, vol. 30, pp. 194–204, 2022. [DOI] [PubMed] [Google Scholar]
  • [6].Reinero DA, Dikker S, and Van Bavel JJ, “Inter-brain synchrony in teams predicts collective performance,” Soc. Cogn. Affect. Neurosci, vol. 16, pp. 43–57, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Szymanski C et al. , “Teams on the same wavelength perform better: Inter-brain phase synchronization constitutes a neural substrate for social facilitation,” Neuroimage, vol. 152, pp. 425–436, 2017. [DOI] [PubMed] [Google Scholar]
  • [8].Bar-Kalifa E, Prinz JN, Atzil-Slonim D, Rubel JA, Lutz W, and Rafaeli E, “Physiological synchrony and therapeutic alliance in an imagery-based treatment,” J. Couns. Psychol, vol. 66, no. 4, pp. 508–517, 2019. [DOI] [PubMed] [Google Scholar]
  • [9].Tschacher W and Meier D, “Physiological synchrony in psychotherapy sessions,” Psychother. Res, vol. 30, no. 5, pp. 558–573, 2020. [DOI] [PubMed] [Google Scholar]
  • [10].Zheng L et al. , “affiliative bonding between teachers and students through interpersonal synchronisation in brain activity,” Soc. Cogn. Affect. Neurosci, vol. 15, pp. 97–109, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Sun B, Xiao W, Feng X, Shao Y, Zhang W, and Li W, “Behavioral and brain synchronization differences between expert and novice teachers when collaborating with students,” Brain Cogn., vol. 139, p. 105513, 2020. [DOI] [PubMed] [Google Scholar]
  • [12].Chen P et al. , “Hybrid Harmony: a multi-person neurofeedback application for interpersonal synchrony,” Front. Neuroergonomics, vol. 2, p. 687108, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Saul MA, He X, Black S, and Charles F, “A two-person neuroscience approach for social anxiety: a paradigm with interbrain synchrony and neurofeedback,” Front. Psychol, vol. 12, p. 568921, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Järvenoja H et al. , “A collaborative learning design for promoting and analyzing adaptive motivation and emotion regulation in the science classroom,” Front. Educ, vol. 5, p. 111, 2020. [Google Scholar]
  • [15].Novak D, Mihelj M, and Munih M, “A survey of methods for data fusion and system adaptation using autonomic nervous system responses in physiological computing,” Interact. Comput, vol. 24, pp. 154–172, May 2012. [Google Scholar]
  • [16].Aranha RV, Correa CG, and Nunes FLS, “Adapting software with affective computing: a systematic review,” IEEE Trans. Affect. Comput, vol. 12, no. 4, pp. 883–899, 2021. [Google Scholar]
  • [17].D’Mello S, Kappas A, and Gratch J, “The affective computing approach to affect measurement,” Emot. Rev, vol. 10, no. 2, pp. 174–183, 2018. [Google Scholar]
  • [18].Konvalinka I, Bauer M, Stahlhut C, Hansen LK, Roepstorff A, and Frith CD, “Frontal alpha oscillations distinguish leaders from followers: Multivariate decoding of mutually interacting brains,” Neuroimage, vol. 94, pp. 79–88, 2014. [DOI] [PubMed] [Google Scholar]
  • [19].Pan Y, Dikker S, Goldstein P, Zhu Y, Yang C, and Hu Y, “Instructor-learner brain coupling discriminates between instructional approaches and predicts learning,” Neuroimage, vol. 211, p. 116657, 2020. [DOI] [PubMed] [Google Scholar]
  • [20].Brouwer AM, Stuldreher IV, and Thammasan N, “Shared attention reflected in EEG, electrodermal activity and heart rate,” in CEUR Workshop Proceedings, 2019 Socio-Affective Technologies: An Interdisciplinary Approach, 2019, pp. 27–31. [Google Scholar]
  • [21].Muszynski M, Kostoulas T, Lombardo P, Pun T, and Chanel G, ‘“esthetic highlight detection in movies based on synchronization of spectators’ reactions,” ACM Trans. Multimed. Comput. Commun. Appl, vol. 14, no. 3, pp. 1–23, 2018. [Google Scholar]
  • [22].Hernandez J, Riobo I, Rozga A, Abowd GD, and Picard RW, “Using electrodermal activity to recognize ease of engagement in children during social interactions,” in Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing - UbiComp 14, 2014. [Google Scholar]
  • [23].Simar C et al. , “Hyperscanning EEG and classification based on Riemannian geometry for festive and violent mental state discrimination,” Front. Neurosci, vol. 14, p. 588357, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Verdiere KJ, Dehais F, and Roy RN, “Spectral EEG-based classification for operator dyads’ workload and cooperation level estimation,” in Proceedings of the 2019 IEEE International Conference on Systems, Man and Cybernetics, 2019. [Google Scholar]
  • [25].Darzi A and Novak D, “Automated affect classification and task difficulty adaptation in a competitive scenario based on physiological linkage: an exploratory study,” Int. J. Hum. Comput. Stud, vol. 153, p. 102673, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [26].Ding Y, Hu X, Xia Z, Liu YJ, and Zhang D, “Inter-brain EEG feature extraction and analysis for continuous implicit emotion tagging during video watching,” IEEE Trans. Affect. Comput, vol. 12, no. 1, pp. 92–102, 2021. [Google Scholar]
  • [27].Chatterjee I, Goršič M, Clapp JD, and Novak D, “Automatic estimation of interpersonal engagement during naturalistic conversation using dyadic physiological measurements,” Front. Neurosci, vol. 15, p. 757381, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [28].Nam CS, Choo S, Huang J, and Park J, “Brain-to-brain neural synchrony during social interactions: a systematic review on hyperscanning studies,” Appl. Sci, vol. 10, no. 19, p. 6669, 2020. [Google Scholar]
  • [29].Picard RW, Vyzas E, and Healey J, “Toward machine emotional intelligence: Analysis of affective physiological state,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 23, no. 10, pp. 1175–1191, 2001. [Google Scholar]
  • [30].Bradley MM and Lang PJ, “Measuring emotion: the self-assessment manikin and the semantic differential,” J. Behav. Ther. Exp. Psychiatry, vol. 25, no. 1, pp. 49–59, 1994. [DOI] [PubMed] [Google Scholar]
  • [31].Goršič M, Clapp JD, Darzi A, and Novak D, “A brief measure of interpersonal interaction for 2-player serious games: questionnaire validation,” JMIR Serious Games, vol. 7, no. 3, p. e12788, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [32].Bizzego A et al. , “Strangers, friends, and lovers show different physiological synchrony in different emotional states,” Behav. Sci. (Basel), vol. 10, p. 11, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [33].Sachs ME, Habibi A, Damasio A, and Kaplan JT, “Dynamic intersubject neural synchronization reflects affective responses to sad music,” Neuroimage, vol. 218, p. 116512, 2020. [DOI] [PubMed] [Google Scholar]
  • [34].Steiger BK, Kegel LC, Spirig E, and Jokeit H, “Dynamics and diversity of heart rate responses to a disaster motion picture,” Int. J. Psychophysiol, vol. 143, pp. 64–79, 2019. [DOI] [PubMed] [Google Scholar]
  • [35].McKillop HN and Connell AM, “Physiological linkage and affective dynamics in dyadic interactions between adolescents and their mothers,” Dev. Psychobiol, vol. 60, pp. 582–594, 2018. [DOI] [PubMed] [Google Scholar]
  • [36].Reniers RLEP, Corcoran R, Drake R, Shryane NM, and Völlm BA, “The QCAE: As questionnaire of cognitive and affective empathy,” J. Pers. Assess, vol. 93, pp. 84–95, 2011. [DOI] [PubMed] [Google Scholar]
  • [37].Leary MR, “A brief version of the Fear of Negative Evaluation Scale,” Personal. Soc. Psychol. Bull, vol. 9, pp. 371–375, 1983. [Google Scholar]
  • [38].Radloff LS, “The CES-D scale: a self-report depression scale for research in the general population,” Appl. Psychol. Meas, vol. 1, pp. 385–401, 1977. [Google Scholar]
  • [39].Boucsein W, Electrodermal Activity, 2nd ed. Springer, 2012. [Google Scholar]
  • [40].Task Force of the European Society of Cardiology and the North American Society of Pacing and Electrophysiology, “Heart rate variability: Standards of measurement, physiological interpretation, and clinical use,” Eur. Heart J, vol. 17, no. 3, pp. 354–381, 1996. [PubMed] [Google Scholar]
  • [41].Chatterjee I, Goršič M, Hossain MS, Clapp JD, and Novak VD, “Automated classification of dyadic conversation scenarios using autonomic nervous system responses,” Zenodo, 2022. https://zenodo.org/record/7140829 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [42].Salminen M et al. , “Evoking physiological synchrony and empathy using social VR with biofeedback,” IEEE Trans. Affect. Comput, vol. 13, no. 2, pp. 746–755, 2022. [Google Scholar]
  • [43].Dunsmore JC et al. , “Marching to the beat of your own drum?: A proof-of-concept study assessing physiological linkage in Autism Spectrum Disorder,” Biol. Psychol, vol. 144, pp. 37–45, 2019. [DOI] [PubMed] [Google Scholar]
  • [44].Darzi A, McCrea S, and Novak D, “User experience comparison between five dynamic difficulty adjustment methods for an affective computer game,” JMIR Serious Games, vol. 9, no. 2, p. e25771, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [45].McCrea SM, Geršak G, and Novak D, “Absolute and relative user perception of classification accuracy in an affective videogame,” Interact. Comput, vol. 29, no. 2, pp. 271–286, 2017. [Google Scholar]
  • [46].Novak D,Nagle A, and Riener R, “Linking recognition accuracy and user experience in an affective feedback loop,” IEEE Trans. Affect. Comput, vol. 5, no. 2, pp. 168–172, 2014. [Google Scholar]

RESOURCES