Skip to main content
Frontiers in Robotics and AI logoLink to Frontiers in Robotics and AI
. 2020 Dec 21;7:532279. doi: 10.3389/frobt.2020.532279

Emotion Recognition for Human-Robot Interaction: Recent Advances and Future Perspectives

Matteo Spezialetti 1,2, Giuseppe Placidi 3, Silvia Rossi 1,*
PMCID: PMC7806093  PMID: 33501307

Abstract

A fascinating challenge in the field of human–robot interaction is the possibility to endow robots with emotional intelligence in order to make the interaction more intuitive, genuine, and natural. To achieve this, a critical point is the capability of the robot to infer and interpret human emotions. Emotion recognition has been widely explored in the broader fields of human–machine interaction and affective computing. Here, we report recent advances in emotion recognition, with particular regard to the human–robot interaction context. Our aim is to review the state of the art of currently adopted emotional models, interaction modalities, and classification strategies and offer our point of view on future developments and critical issues. We focus on facial expressions, body poses and kinematics, voice, brain activity, and peripheral physiological responses, also providing a list of available datasets containing data from these modalities.

Keywords: emotion recognition (ER), human-robot interaction, affective computing, machine learning, multimodal data

1. Introduction

Emotions are fundamental aspects of the human being and affect decisions and actions. They play an important role in communication, and emotional intelligence, i.e., the ability to understand, use, and manage emotions (Salovey and Mayer, 1990), is crucial for successful interactions. Affective computing aims to endow machines with emotional intelligence (Picard, 1999) for improving natural human-machine interaction (HMI). In the context of human-robot interaction (HRI), it is hoped that robots can be endowed with human-like capabilities of observation, interpretation, and emotion expression. Emotions have been considered from three main points of view as follows:

  • Formalization of the robots own emotional state: the inclusion of emotional traits into agents and robots can improve their effectiveness and adaptiveness and enhance their believability (Hudlicka, 2011). Therefore, the design of robots in the last years has focused on modeling emotions by defining neurocomputational models, formalizing them in existing cognitive architectures, adapting known cognitive models, or defining specialized affective architectures (Cañamero, 2005, 2019; Krasne et al., 2011; Navarro-Guerrero et al., 2012; Reisenzein et al., 2013; Tanevska et al., 2017; Sánchez-López and Cerezo, 2019);

  • Emotional expression of robots: in complex interaction scenarios, such as assistive, educational, and social robotics (Fong et al., 2003; Rossi et al., 2020), the ability of robots to exhibit recognizable emotional expressions strongly impacts the resulting social interaction (Mavridis, 2015). Several studies focused on exploring which modalities (e.g., face expression, body posture, movement, voice) can convey emotional information from robots to humans and how people perceive and recognize emotional states (Tsiourti et al., 2017; Marmpena et al., 2018; Rossi and Ruocco, 2019);

  • Ability of robots to infer the human emotional state: robots able to infer and interpret human emotions would be more effective in interacting with people. Recent works aim to design algorithms for classifying emotional states from different input modalities, such as facial expression, body language, voice, and physiological signals (McColl et al., 2016; Cavallo et al., 2018).

In the following, we focus on the third aspect, reporting recent advances in emotion recognition (ER), in particular in the HRI context. ER is a challenging task, in particular when performed in actual HRI, where the scenario could highly differ from the controlled environment in which most of recognition experiments are usually performed. Moreover, the presence itself of the robot represents a bias, since the robot presence, embodiment, and behavior could affect empathy (Kwak et al., 2013), elicit emotions (Guo et al., 2019; Saunderson and Nejat, 2019; Shao et al., 2020), and impact experience (Cameron et al., 2015). For these reasons, we limit our study to articles that perform ER in actual HRI with physical robots. Our aim is to summarize the state of the art and existing resources for the design of emotion-aware robots, discuss the characteristics that are desirable in HRI, and offer a perspective about future developments.

Literature search has been carried out by querying Google Scholar1, Scopus2, and WebOfScience3 databases with basic keywords from HRI and ER domains, and limiting the search to the last years (≥2015). Submitted queries were as follows:

  • Google Scholar: allintitle: “human–robot interaction”| “human robot interaction” |hri emotion|affective, resulting in 80 documents.

  • Scopus: TITLE ((“human–robot interaction” OR “human robot interaction” OR hri) AND (emotion* OR affective)) OR KEY ((“human–robot interaction” OR “human robot interaction” OR hri) AND (emotion* OR affective)), resulting in 629 documents.

  • WebOfScience: TI=((“human robot interaction” OR “human–robot interaction” or hri) AND (emotion* or affective)) or AK=((“human robot interaction” or “human–robot interaction” OR hri) AND (emotion* or affective)), resulting in 201 documents.

By using a simple script based on the edit distance between articles titles, we looked for repeated items between search engines results. Note that 425 papers were only on Scopus, 22 on Google Scholar, 11 on WebOfScience, and 38 documents were returned by all the search engines. Between the remaining articles, 150 were both on Scopus and WebOfScience, 16 on Scopus and Google Scholar, and 1 on WebOfScience and Google Scholar. We merged the results in a single list of 664 items. Since we used loose selection queries, resulting articles were highly heterogeneous and most of them were out of the scope of our review. Therefore, we subsequently selected published articles that addressed ER in HRI, reporting significant results with respect to the recent literature, and that (a) performed emotion recognition in an actual HRI (i.e., where at least a physical robot and a subject were included in the testing phase), reporting results; (b) were focused on modalities that could be acquired, during HRI, by using both robot's embedded sensors or external devices: facial expression, body pose and kinematics, voice, brain activity, and peripheral physiological responses; and (c) relied on either discrete or dimensional models of emotions (see section 2.1). This phase allowed to select 14 articles. During the process, however, we also looked at the references of the selected paper in order to find other works that fit our inclusion criteria. In this way, 3 articles were added to this review. Finally, we organized the resulting articles by considering modalities and emotional models.

2. State of the Art

2.1. Emotional Models

A key concept in ER is the model used to represent emotions, since it affects the formalization of the problem and the definition and separation of classes. Several models have been proposed for describing emotions, as reported in Table 1. The main distinction is between categorical models, in which emotions consist of discrete entities associated with labels, and dimensional models, in which emotions are defined by continuous values of their describing features, usually represented on axes.

Table 1.

Emotional models.

Model Type Description
Ekman
(Ekman, 1992)
Categorical Six basic emotions (anger, disgust, fear, happiness, sadness, and surprise) characterized by distinctive universal signals, and physiology
Tomkins
(Tomkins, 2008)
Categorical Seven affects organized in low/high intensity couples (interest-excitement, enjoyment-joy, surprise-startle,distress-anguish, anger-rage, fear-terror, shame-humiliation), plus disgust, and dismell
Valence/Arousal (VA)
(Russell, 1980)
Dimensional Emotions represented over a two-dimensional circular space: axes describe the valence and arousal
Pleasure/Arousal/Dominance (PAD) (Mehrabian, 1996) Dimensional Emotions described with three dimensions:pleasure, arousal, and dominance
3D hypercube
(Trnka et al., 2016)
Dimensional Emotions described with three dimensions: valence, intensity, controllability, and utility
Plutchik wheel of emotions (Plutchik and Kellerman, 2013) Hybrid Eight primary emotions differentiated by levels of intensity

For each entry the type of model and a brief description are reported.

It is hard to say which model is better for representing emotions, the debate is still open and it is strictly related to the nature of emotions, with a lack of consensus (Lench et al., 2011; Lindquist et al., 2013). Some scholars claim that emotions are discrete natural kinds associated with categorically distinct patterns of activation at the level of autonomic nervous system (Kragel and LaBar, 2016; Saarimäki et al., 2018). Others point out intra-emotion differences and the overlap of different emotions with respect to observed behavior end autonomous activity (Siegel et al., 2018). Since the purpose of this work is not to support one or the other thesis, we step back from this discussion and focuses on the usability of the models. From the point of view of the recognition process, the possibility to identify distinct emotions is, ideally, the simplest method. Unfortunately, as the number of emotions considered increases, it becomes more difficult to distinguish between classes. On the other hand, despite the usefulness of obtaining information about features of emotions (e.g., valence and arousal), a small number of dimensions could lead to an over-simplification and to an “overlap” of different emotions that share similar values of features (Liberati et al., 2015). For this reason, the choice of informative and possibly uncorrelated dimensions is critical (Trnka et al., 2016). Datasets are often annotated by using both categorical and dimensional models, but it could be observed that those employing discrete models do not ever use the same labels, that is annotated emotions differ both in number and names. Conversely, several dimensional annotated datasets share a common valence-arousal (VA) (Russell, 1980) representation, which allows to compare and merge data from different datasets, in the worst case by ignoring additional axes. When annotating a dataset, a good practice is to provide at least the VA labels.

Table A1 in Supplementary Material reports a non-comprehensive list of datasets that can be used to train and test ER approaches.

2.2. Facial Expressions

A natural way to observe emotions is the analysis of facial expressions (Ko, 2018). Conventional facial emotion recognition (FER) systems aim to detect the face region in images and to compute geometric and appearance features, which are used to train machine learning (ML) algorithms (Kumar and Gupta, 2015). Geometric features are obtained by identifying facial landmarks and by computing their reciprocal positions and action units (AUs) (Ghimire and Lee, 2013; Suk and Prabhakaran, 2014; Álvarez et al., 2018), while appearance-based features are based on texture information (Turan and Lam, 2018).

In recent years, deep learning (DL) approaches have emerged. DL aims to develop end-to-end systems to reduce the dependency from hand-crafted features, pre-processing, and feature extraction techniques (Ghayoumi, 2017). Notably, convolutional neural networks (CNNs) have been proven to be particularly efficient in this task (Mollahosseini et al., 2017; Zhang, 2017; Refat and Azlan, 2019).

When dealing with video-clips, also the temporal components of data can be exploited. In traditional FER, this is usually accomplished by including in the features vector information about landmarks displacement between frames (Ghimire and Lee, 2013). In DL approaches, temporal information is handled by means of specific architectures and layers, such as recurrent neural network (RNN) and long-short term memory (LSTM) (Ebrahimi Kahou et al., 2015).

In the context of HRI, FER has been performed through conventional and DL approaches.

2.2.1. Discrete Models

In Faria et al. (2017), an emotion classification approach, based on simple geometrical features, was proposed. Given the position of 68 facial landmarks, their Euclidean distances and the angles of 91 triangles formed by landmarks were considered. A probabilistic ensemble of classifiers, namely dynamic Bayesian mixture model (DBMM), was employed using linear regression (LR), support vector machine (SVM), and an random forest (RF), combined in a probabilistic weighting strategy. The proposed approach was tested both on Karolinska Directed Emotional Faces (KDEF) (Lundqvist et al., 1998) and during HRI for recognizing 7 discrete emotions. In particular, for HRI test, the humanoid robot NAO was programmed either to react to the recognized emotion. The overall accuracy on the KDEF dataset was 85%, while in actual HRI it was 80.6%. It is to note that the test performed on KDEF was limited to images of frontal or ±45° orientation of the faces.

In Chen et al. (2017), the authors proposed a method for real-time dynamic emotion recognition according to facial expression and emotional intention understanding. In particular, emotion recognition was performed by using a dynamic model of AUs (Candide3-based dynamic feature point matching) and an algorithm implementing fuzzy rules for each of the 7 basic emotions considered. Experiments were conducted on 30 volunteers experiencing the scenario of drinking at bar. One of the 2 employed mobile robot was used for emotion recognition, achieving 80.48% of accuracy.

Candide3-based features were also adopted in Chen et al. (2019). Here, the authors proposed an adaptive feature selection strategy based on the plus-L minus-R selection (LRS) algorithm in order to reduce the model dimensionality. Classification was performed with a set of k-nearest neighbors (kNN) classifiers, integrated into AdaBoost with direct optimization framework. The proposed approach was tested both on the JAFFE dataset (Lyons et al., 1998) and on data acquired by a mobile robot, equipped with a Kinect. In the latter experiment, the proposed method achieved an average accuracy of 81.42% in the classification of 7 discrete emotions.

In Liu et al. (2019), FER was performed by combining local binary pattern (LBP) and 2D Gabor wavelet transform for feature extraction and by training an extreme learning machine (ELM) for the classification of basic emotion. Experiments were conducted both on public datasets [JAFFE (Lyons et al., 1998), CK+ (Lucey et al., 2010)], and during actual HRI as part of a multimodal system setup (Liu et al., 2016). In the latter case, the method was able to recognize between 7 emotions with an overall accuracy of 81.9%.

Histogram of oriented gradients (HOG) and LBP were used as features descriptors in Reyes et al. (2020), where a SVM was trained to classify 7 discrete emotions. The system was initially fed with images from extended Cohn-Kanade (CK+) dataset, but was fine-tuned, by adding batches of different sizes of local images (i.e., facial images of participants acquired during the test). Classification of data acquired during the interaction with the robot NAO achieved 87.7% of accuracy.

Reported results suggest that FER systems designed for HRI can be developed by using several features and decision strategies. However, differences between HMI and HRI arise. In HMI scenarios, FER is in general easier: the position of the face with respect to the camera is more constrained, the user is close to the camera and the environment conditions do not change abruptly. Due to these differences, it is preferable to train FER system on data in-the-wild or during real interaction with robots. Moreover, it could be useful to endow robots with the capability of recognizing emotions not just from facial information, but also from contextual and environmental data (Lee et al., 2019). Future developments of FER will probably depend also on emerging technologies: in the last decades, there was a rapid development of relatively cheap depth cameras (RGB-D sensors), and thermal cameras. The work by Corneanu et al. (2016) offers a comprehensive taxonomy of FER approaches based on RGB, 3D, and thermal sensors. For example, 3D information can improve face detection, landmarks localization, and AUs computation (Mao et al., 2015; Szwoch and Pieniażek, 2015; Zhang et al., 2015; Patil and Bailke, 2016).

2.3. Thermal Facial Images

Changes in the affective state produce the redistribution of the blood in the vessels, due to vasodilatation/vasoconstriction and emotional sweating phenomena. Infra-red thermal cameras can detect these changes, since they cause variations in skin temperature (Ioannou et al., 2014). Therefore, thermal images could be used to perform ER (Liu and Wang, 2011; Wang et al., 2014). Usually, this is done by considering temperature variations of specific regions of interest (ROIs), e.g., tip of the nose, forehead, orbicularis oculi, and cheeks.

2.3.1. Discrete Models

In Boccanfuso et al. (2016), 10 subjects played a trivia game with a MyKeepon robot behaving to induce happiness and anger and watched emotional video clips selected to elicit the same emotions. RGB and thermal facial images were acquired, together with galvanic skin response (GSR) signal. In particular, the thermal trends of 5 ROIs were analyzed by combining principal component analysis (PCA) and logistic regression, achieving a prediction success of 100%.

In Goulart et al. (2019b), a system for the classification of 5 emotions during the interaction between children and a robot was proposed. A mobile robot (N-MARIA) was equipped RGB and thermal camera that were used to locate and acquire thermal information of 11 ROIs, respectively. Statistical features were computed for each ROI and multiple combinations of feature reduction techniques and classification algorithms were tested over a database of 28 developing children (Goulart et al., 2019a). A PCA+linear discriminant analysis (LDA) system, trained and tested on the database (accuracy 85%), was used to infer the emotional responses of children during the interaction with N-MARIA, with results consistent with self reported emotions.

Performing ER from thermal images in actual HRI is still a challenge due to the constraints (other than those that affect simple FER) that are not adaptable to all real-life scenarios, first of all the necessity to maintain a stable environmental temperature. However, recent results show that thermal images have the potential to facilitate HRI (Filippini et al., 2020).

2.4. Body Pose and Kinematics

As facial expressions, body posture, movements, and gestures are natural and intuitive ways to infer the affective state of a person. Emotional body gesture recognition (EBGR) has been widely explored (Noroozi et al., 2018). In order to take advantage of information conveyed by static or dynamic cues, an EBGR system has to model the body position from input signals, usually RGB data, depth maps, or their combination. The first step of the canonical recognition pipeline is the detection of human bodies: literature offers several approaches for addressing the problem (Ioffe and Forsyth, 2001; Viola and Jones, 2001; Viola et al., 2005; Wang and Lien, 2007; Nguyen et al., 2016). Then, the pose of the body has to be estimated, by fitting an a priori defined model, typically a skeleton, over the body region. This task could be performed either by solving an inverse kinematic problem (Barron and Kakadiaris, 2000) or by using DL, if a large amount of skeletonized data are available (Toshev and Szegedy, 2014). Features could include absolute or reciprocal positions and orientations of limbs, as well as movement information such as speed or acceleration (Glowinski et al., 2011; Saha et al., 2014). Classification can be performed either by traditional ML or deep learning (Savva et al., 2012; Saha et al., 2014).

2.4.1. Discrete Models

An interesting approach was proposed in Elfaramawy et al. (2017), where neural network architecture was designed for classifying 6 emotions from body motion patterns. In particular, the classification was performed by grow when required (GWR) networks, self-organizing architectures able to grow nodes whenever the network does not sufficiently match the input. Two GWR network learned samples of pose and motion and, subsequently, a recurrent variant of GWR, namely gamma-GWR, to take in account temporal context. The dataset, that included 19 subjects, was collected by extending a NAO robot with a depth camera (Asus Xtion) and using a second NAO located on the side to allow acquisition from two points of view. Subjects performed body motions related to the emotions, elicited by the description of inspiring scenarios. Pose features (positions of joints) and motion features (the difference in pose features between consecutive frames) were considered. The system achieved an overall accuracy of 88.8%.

2.4.2. Dimensional Models

In Sun et al. (2019), the authors proposed the local joint transformations for describing body poses, and a two-layered LSTM for estimating the emotional intensity of discrete emotions. The authors tested the system over the Emotional Body Motion Database (Volkova et al., 2014) by considering the intensity as the percentage of correctly perceived segments for each scenario. Pearson Correlation Coefficient (PCC) between the ground truth and the estimated intensity was 0.81. In real HRI experiments with a Pepper robot, the system enabled the robot to sense subjects' emotional intensities effectively.

Summarizing these results, we can say that body poses and movements are excellent to convey emotional cues and that EBGR systems can be successfully employed in HRI scenarios. Moreover, the fact that FER and EBGR rely on the same sensors (RGB, depth cameras) would allow to take advantages of both modalities.

2.5. Brain Activity

Inferring the emotional state brain activity represents a challenging and fascinating possibility, since having access to the cerebral activity would allow to avoid any filter, voluntary or not, that could interfere with the ER (Kappas, 2010). Several measurement systems could be used for acquiring brain activity. Among them, electroencephalography (EEG) is characterized by high temporal resolution, is portable, easy to use, and not expensive. Moreover, it has proven to be suitable for brain monitoring also in HMI applications (Lahane et al., 2019). Consumer-grade devices, although not accurate enough for neuroscience research and critical control tasks, have been reported to be a feasible choice for applications such as affective computing (Duvinage et al., 2013; Nijboer et al., 2015; Maskeliunas et al., 2016).

ER by EEG has been widely explored in the literature (Alarcao and Fonseca, 2017; Spezialetti et al., 2018). Most commonly used features can be roughly classified by the domain from which they are extracted (time, frequency, time–frequency). Time domain features include statistical values, Hjorth parameters, fractal dimension (FD), and high order crossing (HOC) (Jenke et al., 2014). Frequency analyses in EEG are very common, also because it is known the association between frequency bands of EEG signal and specific mental tasks. The most intuitive and used frequency feature is band power, but other measures, such as high order spectra (HOS) have been also employed (Hosseini et al., 2010). Time–frequency analysis aims to observe the frequency content of the signal, without losing the information about its the temporal evolution. Among others, wavelet transform (WT) has demonstrated to be particularly suited for analyzing non-stationary signals such as EEG (Akin, 2002). Previously listed features are generally computed in a channel wise manner, but also the topography of EEG signals can be taken into account. Since frontal EEG asymmetry has been proved to be involved in emotional activity (Palmiero and Piccardi, 2017), asymmetry indices have been often employed in emotion recognition.

Some interesting works in traditional literature (Ansari-Asl et al., 2007; Petrantonakis and Hadjileontiadis, 2009; Valenzi et al., 2014) were devoted to test which classification approaches, features, and channel configurations are more suited for EEG-based ER. Results from these studies suggest two significant points. First, the emotional state of subjects can be inferred with quite good accuracy by EEG. Second, using a reduced set of channels and commercial-grade devices still allows to preserve an adequate level of accuracy. The latter point is critical, since in most of the HRI scenarios, the EEG equipment should be worn continuously for long periods and while moving. Light, easy-to-mount devices would be preferable to research-grade hardware. Until 2017 (Al-Nafjan et al., 2017), the most adopted classification approach was SVM, often in conjunction with power spectral density (PSD)-based features. However, deep learning approaches are standing out also in this domain, showing the potential to outperform traditional ML techniques (Zheng and Lu, 2015; Li et al., 2016; Wang et al., 2019). However, to the best of our knowledge, few works have addressed EEG ER in real HRI scenarios.

2.5.1. Dimensional Models

In Shao et al. (2019), the authors employed the Softbank Pepper robot as an autonomous exercise facilitator to encourage the user during the physical activity that can autonomously adapt its emotional behaviors on the basis of user affect. In particular, valence detection was performed by analyzing EEG from a commercial-grade device. Selected features were PSD and frontal asymmetry. Among 6 classifiers, a neural network (NN) achieved the highest accuracy over a dataset from 10 subjects obtained by inducing emotions with pictures and videos. When employed in a real HRI scenario, the robot was able to correctly recognize the valence (5 levels) for 14 of the 15 subjects.

In Shao et al. (2020), the authors proposed a novel paradigm for eliciting emotions by directly employing non-verbal communication of the robot (Pepper), in order to train a detection model with data from actual HRI. The elicitation methodology was based both on music and body movements of the robot and aimed to elicit two types of affects: positive valence and high arousal, and negative valence and low arousal. EEG was acquired by a 4-channel headset in order to extract PSD and frontal asymmetry features and feed an NN and an SVM. The affect detection approach was tested on 14 subjects for valence and 12 for arousal, obtaining an overall accuracy of 71.9 and 70.1% (valence), and 70.6 and 69.5% (arousal) using NN and SVM, respectively.

2.6. Voice

Everyday experiences tell us that the voice, as well as facial expressions, is an informative channel about our interlocutor emotions. We have a natural ability to infer the emotional state underlying the semantic content of what the speaker is saying. Changes in emotional states correspond to variations of organs' features, such as larynx position and vocal fold tension, thus in variations of the voice (Johnstone, 2001). In HRI, automatic acoustic emotion recognition (AER) has to be performed in order to allow robots to perceive human vocal affect. Usually, AER does not examine speech and words in the semantic sense, instead it analyzes variation with respect to the neutral speak of prosody (e.g., pitch, energy, and formants information), voice quality (e.g., voice level and temporal structures), and spectral (e.g., cepstral-based coefficients) features. Features can be extracted either locally, segmenting the signal in frames, or globally, considering the whole utterance. In traditional ML approaches, feature extraction is followed by classification, performed mostly by hidden Markov model (HMM), Gaussian mixture model (GMM), and SVM (El Ayadi et al., 2011; Gangamohan et al., 2016). Also in the AER field, deep learning approaches have rapidly emerged, providing end-to-end mechanisms in contrast with those based on hand-crafted features and demonstrating that they can perform well-compared with traditional techniques (Khalil et al., 2019). Employed models include deep Boltzmann machine (DBM) (Poon-Feng et al., 2014), RNN (Lee and Tashev, 2015), deep belief network (DBN) (Wen et al., 2017), and CNN (Zheng et al., 2015).

2.6.1. Discrete Models

In Chen et al. (2020), two-layer fuzzy multiple random forest (TLFMRF) was proposed for speech emotion recognition. Statistic values of 32 features (16 basic features and their derivative) were extracted from speech samples. Then, clustering by fuzzy C-means (FCM) was adopted to divide the feature data into different subclasses to address differences in identification information such as gender and age. In TLFMRF, a cascade of RF was employed for improving the classification between emotions that are difficult to distinguish. The approach was tested in the classification of six basic emotions from short utterances, spoken by 5 participants in front of a mobile robot. Average accuracy was 80.73%.

2.7. Pheripheral Physiological Responses and Multimodal Approaches

Emotions affect body physiology, producing significant modifications to hearth rate, blood volume pressure (BVP), respiration, skin conductivity, and temperature (Kreibig, 2010), which can contribute to predict the emotional state of a person. A large effort has been made for developing datasets and techniques for ER, often by considering multiple signal sources together (Koelstra et al., 2011; Soleymani et al., 2011; Chen et al., 2015; Correa et al., 2018). Beside the accuracy obtained with just peripheral signals, these works show that the fusion of multiple modalities can outperform single modality approaches, therefore they can be useful sources of information for improving multimodal system performance. Moreover, the rapid development and mass production of consumer grade devices, such as smartband and smartwatch (Poongodi et al., 2020), will facilitate the integration of these signals in most of HRI systems. For example, in Lazzeri et al. (2014) a multimodal acquisition platform, including a humanoid robot capable of expressing emotions, was tested in a social robot-based therapy scenario for children with autism. The system included audio and video sources, together with electrocardiogram (ECG), GSR, respiration, and accelerometer, that are integrated in a sensorized t-shirt, but the platform was designed to be flexible and reconfigurable in order to connect with various hardware devices. Another multimodal platform was described in Liu et al. (2016). It was a complex multimodal emotional communication based human–robot interaction (MEC-HRI) system. The whole platform was composed of 3 NAO robots, two mobile robots, a workstation, and several devices such as a Kinect and an EEG headset, and it was designed to allow robots to recognize humans' emotions and respond in accordance to them, basing on facial expression, speech, posture, and EEG. The article does not report numerical results of multimodal classification in tested HRI scenarios, but modules achieved promising results when tested on benchmark datasets (Liu et al., 2018a,b) or in single-modality HRI experiments (Liu et al., 2019).

At present, several studies have employed single peripheral measures for ER in HRI (McColl et al., 2016), but the majority focused on narrow aspects of ER (e.g., level of stress or fatigue), instead of referring to a broader emotional model.

2.7.1. Discrete Models

One of the highest accuracy results, we found in the literature, was Perez-Gaspar et al. (2016). Here, the authors developed a multimodal emotion recognition system that integrated expressions and voice patterns, based on the evolutionary optimization of NN and HMM. Genetic algorithms (GAs) were used to estimate the most suitable structures for ANNs and HMMs for modeling of speech and visual emotional features. Speech and visual information were managed separately by two distinct modules and a decision level fusion was performed by averaging the output probabilities from different modalities of each class. The system was trained on a dataset of Mexican people, containing pictures from 9 subjects and speech samples from 8 subjects. Four basic emotions were considered (anger, sadness, happiness, and neutral). Live tests were performed with 10 unseen subjects that interacted with a graphical interface and 97% of accuracy was reported. Finally, in all HRI experiments (dialogue with a Bioloid robot), the robot speech was consistent with the emotion shown by the users.

In Filntisis et al. (2019), the authors proposed a system that hierarchically fuses body and facial features from images based on the simultaneous use of residual network (ResNet) and a deep neural network (DNN) to analyze face and body data, respectively. The system was not incorporated into a robot, but tested on a database containing images of children interacting with two different robots (Zeno and Furhat) in a game in which they had to express 6 basic emotions. Classification accuracy was 72%.

An interesting approach was proposed in Yu and Tapus (2019) for multimodal emotion recognition from thermal facial images and gait analysis. Here, interactive robot learning (IRL) was proposed to take advantage of human feedback obtained by the robot during HRI. First, two RF models for thermal images and gait were trained separately on a dataset of 15 subjects labeled with 4 emotions. Computed features included PSD of 4 joints angles and angular velocity for gait and mean and variance of 3 ROIs for thermal images. A decision level fusion was performed based on weights computed from the confusion matrices of the two RF classifiers. In the proposed IRL, during the interaction with the robot, if the predicted emotion does not correspond to the human feedback, the gait and the thermal facial features were used to update the emotion recognition models. The online test included emotion elicitation by movie, followed by gait and thermal images acquisition, and involve a Pepper robot. Results showed the IRL can improve the classification accuracy from 65.6 to 78.1%.

2.7.2. Dimensional Models

Barros et al. (2015) presented a neural architecture, named Cross-Channel CNN for multimodal data extraction. Such network is able to extract emotions' features based on face expression and body motion. Among different experiments, the approach was tested on a real HRI scenario. An iCub robot was used to interact with a subject and presented one emotional state (positive, negative, or neutral). The robot recognized the emotional state and gave feedback by changing its mouth and eyebrow LEDs, with an average accuracy of 74%.

In Val-Calvo et al. (2020), an interesting analysis of the possibilities of ER in HRI was performed by using facial images, EEG, GSR, and blood pressure. In a realistic HRI scenario, a Pepper robot dynamically drives subjects' emotional responses by story-telling and multimedia stimuli. Acquired data (from 16 participants) was labeled with the emotional score that each subject self reported using 3 levels of valence and arousal. Classification experiments were conducted, together with a population-based statistical analysis. Facial expression estimation is achieved by a CNN strategy. The model was trained on FER2013 (Goodfellow et al., 2013) database in order to map facial images in 7 emotions, grouped into 3 levels of valence. Three independent classifications were used for estimating valence from EEG and arousal from BVP and GSR. The classification process was carried using a set of 8 standard classifiers and considering statistical features of the signals. Achieved accuracy results obtained on both emotional dimensions were higher than 80% on average.

3. Discussion

As it is possible to observe by our brief summary about the state of the art, ER is feasible by collecting different kinds of data. Some modalities have been widely explored, both in a broader HMI context and specifically for HRI (FER, EBGR), others should be deeper investigated because, at present, ER has not been tested enough in HRI applications (EEG) or because existing HRI field tests are focused on narrow aspects of emotions (peripheral responses). In our opinion, all the considered modalities represent promising information sources for future developments: innovative and accessible technologies, such as depth cameras, consumer-grade EEG, and smart devices, together with advances in ML will lead to rapid developments of emotions-aware robots. Nevertheless, emotion recognition is currently still a challenge for robots due to the necessity of reliability of results, to provide a trustworthy interaction, and the time constraints required to account for the recognized emotion into the adaptation of the robot behavior. Moreover, many of the used dataset came from general HMI research and so are not suited for emotion recognition in real settings. There is still the need of dataset from real HRI. Indeed, the visual field of view of the robot may not be aligned to the images stored in the dataset (i.e., form face-to-face interaction), the perceived sound may be affected by noise of the ego-motion of the robot, and robot movements may even occlude its field of view. In this perspective, multimodal systems will have a key role by improving the performances of ER with respect to single-modalities approaches, and ML methods and DL architectures have to be developed to deal with heterogeneous data. Particular attention has to be paid on data used to train and test ER: HRI presents some critical and challenging aspects that could make data collected in controlled environments or from different contexts, unsuitable for real HRI applications. However, recently published datasets have the advantage of containing data collected from a large number of sensors. This is a valuable feature, since it will allow to develop features-level fusion approaches for multimodal ER.

Author Contributions

SR conceived the study. MS contributed to finding the relevant literature. All authors contributed to manuscript writing, revision, and read and approved the submitted version.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Funding. This work has been supported by PON I&C 2014-2020 within the BRILLO research project Bartending Robot for Interactive Long Lasting Operations Prog. n. F/190066/01-02/X44.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/frobt.2020.532279/full#supplementary-material

References

  1. Akin M. (2002). Comparison of wavelet transform and FFT methods in the analysis of EEG signals. J. Med. Syst. 26, 241–247. 10.1023/A:1015075101937 [DOI] [PubMed] [Google Scholar]
  2. Alarcao S. M., Fonseca M. J. (2017). Emotions recognition using EEG signals: a survey. IEEE Trans. Affect. Comput. 10, 374–393. 10.1109/TAFFC.2017.2714671 [DOI] [Google Scholar]
  3. Al-Nafjan A., Hosny M., Al-Ohali Y., Al-Wabil A. (2017). Review and classification of emotion recognition based on EEG brain-computer interface system research: a systematic review. Appl. Sci. 7:1239 10.3390/app7121239 [DOI] [Google Scholar]
  4. Álvarez V. M., Sánchez C. N., Gutiérrez S., Domínguez-Soberanes J., Velázquez R. (2018). Facial emotion recognition: a comparison of different landmark-based classifiers, in 2018 International Conference on Research in Intelligent and Computing in Engineering (RICE) (New York, NY: IEEE; ), 1–4. [Google Scholar]
  5. Ansari-Asl K., Chanel G., Pun T. (2007). A channel selection method for EEG classification in emotion assessment based on synchronization likelihood, in Signal Processing Conference, 2007 15th European (New York, NY: ), 1241–1245. [Google Scholar]
  6. Barron C., Kakadiaris I. A. (2000). Estimating anthropometry and pose from a single image, in Proceedings IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2000 (New York, NY: IEEE; ), 669–676. [Google Scholar]
  7. Barros P., Weber C., Wermter S. (2015). Emotional expression recognition with a cross-channel convolutional neural network for human-robot interaction, in 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids) (New York, NY: ), 582–587. [Google Scholar]
  8. Boccanfuso L., Wang Q., Leite I., Li B., Torres C., Chen L., et al. (2016). A thermal emotion classifier for improved human-robot interaction, in 2016 25th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN) (New York, NY: IEEE; ), 718–723. [Google Scholar]
  9. Cameron D., Fernando S., Collins E., Millings A., Moore R., Sharkey A., et al. (2015). Presence of life-like robot expressions influences children's enjoyment of human-robot interactions in the field, in Proceedings of the AISB Convention 2015 (The Society for the Study of Artificial Intelligence and Simulation of Behaviour). [Google Scholar]
  10. Cañamero L. (2005). Emotion understanding from the perspective of autonomous robots research. Neural Netw. (Amsterdam: ) 18, 445–455. 10.1016/j.neunet.2005.03.003 [DOI] [PubMed] [Google Scholar]
  11. Cañamero L. (2019). Embodied robot models for interdisciplinary emotion research. IEEE Trans. Affect. Comput. 1. 10.1109/TAFFC.2019.2908162 [DOI] [Google Scholar]
  12. Cavallo F., Semeraro F., Fiorini L., Magyar G., Sinčák P., Dario P. (2018). Emotion modelling for social robotics applications: a review. J. Bionic Eng. 15, 185–203. 10.1007/s42235-018-0015-y [DOI] [Google Scholar]
  13. Chen J., Hu B., Xu L., Moore P., Su Y. (2015). Feature-level fusion of multimodal physiological signals for emotion recognition, in 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (New York, NY: IEEE; ), 395–399. 10.1109/BIBM.2015.7359713 [DOI] [Google Scholar]
  14. Chen L., Li M., Su W., Wu M., Hirota K., Pedrycz W. (2019). Adaptive feature selection-based AdaBoost-KNN with direct optimization for dynamic emotion recognition in human-robot interaction, in IEEE Transactions on Emerging Topics in Computational Intelligence (New York, NY: ). 10.1109/TETCI.2019.2909930 [DOI] [Google Scholar]
  15. Chen L., Su W., Feng Y., Wu M., She J., Hirota K. (2020). Two-layer fuzzy multiple random forest for speech emotion recognition in human-robot interaction. Inform. Sci. 509, 150–163. 10.1016/j.ins.2019.09.005 [DOI] [Google Scholar]
  16. Chen L., Wu M., Zhou M., Liu Z., She J., Hirota K. (2017). Dynamic emotion understanding in human-robot interaction based on two-layer fuzzy SVR-TS model. IEEE Trans. Syst. Man Cybernet. Syst. 50, 490–501. 10.1109/TSMC.2017.2756447 [DOI] [Google Scholar]
  17. Corneanu C. A., Simón M. O., Cohn J. F., Guerrero S. E. (2016). Survey on RGB, 3D, thermal, and multimodal approaches for facial expression recognition: history, trends, and affect-related applications. IEEE Trans. Pattern Anal. Mach. Intell. 38, 1548–1568. 10.1109/TPAMI.2016.2515606 [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Correa J. A. M., Abadi M. K., Sebe N., Patras I. (2018). Amigos: a dataset for affect, personality and mood research on individuals and groups. IEEE Trans. Affect. Comput. 1. 10.1109/TAFFC.2018.2884461 [DOI] [Google Scholar]
  19. Duvinage M., Castermans T., Petieau M., Hoellinger T., Cheron G., Dutoit T. (2013). Performance of the emotiv epoc headset for p300-based applications. Biomed. Eng. Online 12:56. 10.1186/1475-925X-12-56 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Ebrahimi Kahou S., Michalski V., Konda K., Memisevic R., Pal C. (2015). Recurrent neural networks for emotion recognition in video, in Proceedings of the 2015 ACM on International Conference on Multimodal Interaction (New York, NY: ACM; ), 467–474. 10.1145/2818346.2830596 [DOI] [Google Scholar]
  21. Ekman P. (1992). An argument for basic emotions. Cogn. Emot. 6, 169–200. 10.1080/02699939208411068 [DOI] [Google Scholar]
  22. El Ayadi M., Kamel M. S., Karray F. (2011). Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recogn. 44, 572–587. 10.1016/j.patcog.2010.09.020 [DOI] [Google Scholar]
  23. Elfaramawy N., Barros P., Parisi G. I., Wermter S. (2017). Emotion recognition from body expressions with a neural network architecture, in Proceedings of the 5th International Conference on Human Agent Interaction (New York, NY: ), 143–149. 10.1145/3125739.3125772 [DOI] [Google Scholar]
  24. Faria D. R., Vieira M., Faria F. C. (2017). Towards the development of affective facial expression recognition for human-robot interaction, in Proceedings of the 10th International Conference on PErvasive Technologies Related to Assistive Environments (New York, NY: ), 300–304. 10.1145/3056540.3076199 [DOI] [Google Scholar]
  25. Filippini C., Perpetuini D., Cardone D., Chiarelli A. M., Merla A. (2020). Thermal infrared imaging-based affective computing and its application to facilitate human robot interaction: a review. Appl. Sci. 10:2924 10.3390/app10082924 [DOI] [Google Scholar]
  26. Filntisis P. P., Efthymiou N., Koutras P., Potamianos G., Maragos P. (2019). Fusing body posture with facial expressions for joint recognition of affect in child-robot interaction. IEEE Robot. Automat. Lett. 4, 4011–4018. 10.1109/LRA.2019.2930434 [DOI] [Google Scholar]
  27. Fong T., Nourbakhsh I., Dautenhahn K. (2003). A survey of socially interactive robots. Robot. Auton. Syst. 42, 143–166. 10.1016/S0921-8890(02)00372-X [DOI] [Google Scholar]
  28. Gangamohan P., Kadiri S. R., Yegnanarayana B. (2016). Analysis of emotional speech–a review, in Toward Robotic Socially Believable Behaving Systems-Volume I, eds Esposito A., Jain L. (Cham: Springer; ), 205–238. 10.1007/978-3-319-31056-5_11 [DOI] [Google Scholar]
  29. Ghayoumi M. (2017). A quick review of deep learning in facial expression. J. Commun. Comput. 14, 34–38. 10.17265/1548-7709/2017.01.004 [DOI] [Google Scholar]
  30. Ghimire D., Lee J. (2013). Geometric feature-based facial expression recognition in image sequences using multi-class AdaBoost and support vector machines. Sensors 13, 7714–7734. 10.3390/s130607714 [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Glowinski D., Dael N., Camurri A., Volpe G., Mortillaro M., Scherer K. (2011). Toward a minimal representation of affective gestures. IEEE Trans. Affect. Comput. 2, 106–118. 10.1109/T-AFFC.2011.7 [DOI] [Google Scholar]
  32. Goodfellow I. J., Erhan D., Carrier P. L., Courville A., Mirza M., Hamner B., et al. (2013). Challenges in representation learning: a report on three machine learning contests, in International Conference on Neural Information Processing (Berlin: Springer; ), 117–124. 10.1007/978-3-642-42051-1_16 [DOI] [PubMed] [Google Scholar]
  33. Goulart C., Valadão C., Delisle-Rodriguez D., Caldeira E., Bastos T. (2019a). Emotion analysis in children through facial emissivity of infrared thermal imaging. PLoS ONE 14:e0212928. 10.1371/journal.pone.0212928 [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Goulart C., Valadão C., Delisle-Rodriguez D., Funayama D., Favarato A., Baldo G., et al. (2019b). Visual and thermal image processing for facial specific landmark detection to infer emotions in a child-robot interaction. Sensors 19:2844. 10.3390/s19132844 [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Guo F., Li M., Qu Q., Duffy V. G. (2019). The effect of a humanoid robot's emotional behaviors on users' emotional responses: evidence from pupillometry and electroencephalography measures. Int. J. Hum. Comput. Interact. 35, 1947–1959. 10.1080/10447318.2019.1587938 [DOI] [Google Scholar]
  36. Hosseini S. A., Khalilzadeh M. A., Naghibi-Sistani M. B., Niazmand V. (2010). Higher order spectra analysis of EEG signals in emotional stress states, in 2010 Second International Conference on Information Technology and Computer Science (New York, NY: IEEE; ), 60–63. 10.1109/ITCS.2010.21 [DOI] [Google Scholar]
  37. Hudlicka E. (2011). Guidelines for designing computational models of emotions. Int. J. Synthet. Emot. 2, 26–79. 10.4018/jse.2011010103 [DOI] [Google Scholar]
  38. Ioannou S., Gallese V., Merla A. (2014). Thermal infrared imaging in psychophysiology: potentialities and limits. Psychophysiology 51, 951–963. 10.1111/psyp.12243 [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Ioffe S., Forsyth D. A. (2001). Probabilistic methods for finding people. Int. J. Comput. Vis. 43, 45–68. 10.1023/A:1011179004708 [DOI] [Google Scholar]
  40. Jenke R., Peer A., Buss M. (2014). Feature extraction and selection for emotion recognition from EEG. IEEE Trans. Affect. Comput. 5, 327–339. 10.1109/TAFFC.2014.2339834 [DOI] [Google Scholar]
  41. Johnstone T. (2001). The effect of emotion on voice production and speech acoustics (Ph.D. thesis), Psychology Department, The University of Western Australia, Perth, WA, Australia. [Google Scholar]
  42. Kappas A. (2010). Smile when you read this, whether you like it or not: conceptual challenges to affect detection. IEEE Trans. Affect. Comput. 1, 38–41. 10.1109/T-AFFC.2010.6 [DOI] [Google Scholar]
  43. Khalil R. A., Jones E., Babar M. I., Jan T., Zafar M. H., Alhussain T. (2019). Speech emotion recognition using deep learning techniques: a review. IEEE Access 7, 117327–117345. 10.1109/ACCESS.2019.2936124 [DOI] [Google Scholar]
  44. Ko B. (2018). A brief review of facial emotion recognition based on visual information. Sensors 18:401. 10.3390/s18020401 [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Koelstra S., Muhl C., Soleymani M., Lee J.-S., Yazdani A., Ebrahimi T., et al. (2011). DEAP: a database for emotion analysis; using physiological signals. IEEE Trans. Affect. Comput. 3, 18–31. 10.1109/T-AFFC.2011.15 [DOI] [Google Scholar]
  46. Kragel P. A., LaBar K. S. (2016). Decoding the nature of emotion in the brain. Trends Cogn. Sci. 20, 444–455. 10.1016/j.tics.2016.03.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Krasne F. B., Fanselow M., Zelikowsky M. (2011). Design of a neurally plausible model of fear learning. Front. Behav. Neurosci. 5:41. 10.3389/fnbeh.2011.00041 [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Kreibig S. D. (2010). Autonomic nervous system activity in emotion: a review. Biol. Psychol. 84, 394–421. 10.1016/j.biopsycho.2010.03.010 [DOI] [PubMed] [Google Scholar]
  49. Kumar S., Gupta A. (2015). Facial expression recognition: a review, in Proceedings of the National Conference on Cloud Computing and Big Data (Shanghai: ), 4–6. [Google Scholar]
  50. Kwak S. S., Kim Y., Kim E., Shin C., Cho K. (2013). Whatmakes people empathize with an emotional robot?: The impact of agency and physical embodiment on human empathy for a robot, in 2013 IEEE RO-MAN (New York, NY: IEEE; ), 180–185. 10.1109/ROMAN.2013.6628441 [DOI] [Google Scholar]
  51. Lahane P., Jagtap J., Inamdar A., Karne N., Dev R. (2019). A review of recent trends in EEG based brain-computer interface, in 2019 International Conference on Computational Intelligence in Data Science (ICCIDS) (New York, NY: IEEE; ), 1–6. 10.1109/ICCIDS.2019.8862054 [DOI] [Google Scholar]
  52. Lazzeri N., Mazzei D., De Rossi D. (2014). Development and testing of a multimodal acquisition platform for human-robot interaction affective studies. J. Hum. Robot Interact. 3, 1–24. 10.5898/JHRI.3.2.Lazzeri [DOI] [Google Scholar]
  53. Lee J., Kim S., Kim S., Park J., Sohn K. (2019). Context-aware emotion recognition networks, in Proceedings of the IEEE International Conference on Computer Vision (New York, NY: ), 10143–10152. 10.1109/ICCV.2019.01024 [DOI] [Google Scholar]
  54. Lee J., Tashev I. (2015). High-level feature representation using recurrent neural network for speech emotion recognition, in Sixteenth Annual Conference of the International Speech Communication Association (Baixas: ). [Google Scholar]
  55. Lench H. C., Flores S. A., Bench S. W. (2011). Discrete emotions predict changes in cognition, judgment, experience, behavior, and physiology: a meta-analysis of experimental emotion elicitations. Psychol. Bull. 137:834. 10.1037/a0024244 [DOI] [PubMed] [Google Scholar]
  56. Li X., Song D., Zhang P., Yu G., Hou Y., Hu B. (2016). Emotion recognition from multi-channel EEG data through convolutional recurrent neural network, in 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (Shenzhen: IEEE; ), 352–359. 10.1109/BIBM.2016.7822545 [DOI] [Google Scholar]
  57. Liberati G., Federici S., Pasqualotto E. (2015). Extracting neurophysiological signals reflecting users' emotional and affective responses to BCI use: a systematic literature review. Neurorehabilitation 37, 341–358. 10.3233/NRE-151266 [DOI] [PubMed] [Google Scholar]
  58. Lindquist K. A., Siegel E. H., Quigley K. S., Barrett L. F. (2013). The hundred-year emotion war: are emotions natural kinds or psychological constructions? Comment on Lench, Flores, and Bench (2011). Psychol. Bull. 139, 255–263. 10.1037/a0029038 [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Liu Z., Wang S. (2011). Emotion recognition using hidden Markov models from facial temperature sequence, in International Conference on Affective Computing and Intelligent Interaction (Berlin: Springer; ), 240–247. 10.1007/978-3-642-24571-8_26 [DOI] [Google Scholar]
  60. Liu Z.-T., Li S.-H., Cao W.-H., Li D.-Y., Hao M., Zhang R. (2019). Combining 2D Gabor and local binary pattern for facial expression recognition using extreme learning machine. J. Adv. Comput. Intell. Intell. Inform. 23, 444–455. 10.20965/jaciii.2019.p0444 [DOI] [Google Scholar]
  61. Liu Z.-T., Pan F.-F., Wu M., Cao W.-H., Chen L.-F., Xu J.-P., et al. (2016). A multimodal emotional communication based humans-robots interaction system, in 2016 35th Chinese Control Conference (CCC) (New York, NY: IEEE; ), 6363–6368. 10.1109/ChiCC.2016.7554357 [DOI] [Google Scholar]
  62. Liu Z.-T., Wu M., Cao W.-H., Mao J.-W., Xu J.-P., Tan G.-Z. (2018a). Speech emotion recognition based on feature selection and extreme learning machine decision tree. Neurocomputing 273, 271–280. 10.1016/j.neucom.2017.07.050 [DOI] [Google Scholar]
  63. Liu Z.-T., Xie Q., Wu M., Cao W.-H., Li D.-Y., Li S.-H. (2018b). Electroencephalogram emotion recognition based on empirical mode decomposition and optimal feature selection. IEEE Trans. Cogn. Dev. Syst. 11, 517–526. 10.1109/TCDS.2018.2868121 [DOI] [Google Scholar]
  64. Lucey P., Cohn J. F., Kanade T., Saragih J., Ambadar Z., Matthews I. (2010). The extended Cohn-Kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression, in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops (New York, NY: IEEE; ), 94–101. 10.1109/CVPRW.2010.5543262 [DOI] [Google Scholar]
  65. Lundqvist D., Flykt A., Öhman A. (1998). The Karolinska Directed Emotional Faces (KDEF). CD ROM from Department of Clinical Neuroscience, Psychology section, Karolinska Institutet. 10.1037/t27732-000 [DOI] [Google Scholar]
  66. Lyons M., Akamatsu S., Kamachi M., Gyoba J. (1998). Coding facial expressions with gabor wavelets, in Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition (New York, NY: IEEE; ), 200–205. 10.1109/AFGR.1998.670949 [DOI] [Google Scholar]
  67. Mao Q.-R., Pan X.-Y., Zhan Y.-Z., Shen X.-J. (2015). Using kinect for real-time emotion recognition via facial expressions. Front. Inform. Technol. Electron. Eng. 16, 272–282. 10.1631/FITEE.1400209 [DOI] [Google Scholar]
  68. Marmpena M., Lim A., Dahl T. S. (2018). How does the robot feel? Perception of valence and arousal in emotional body language. Paladyn J. Behav. Robot. 9, 168–182. 10.1515/pjbr-2018-0012 [DOI] [Google Scholar]
  69. Maskeliunas R., Damasevicius R., Martisius I., Vasiljevas M. (2016). Consumer-grade EEG devices: are they usable for control tasks? PeerJ 4:e1746. 10.7717/peerj.1746 [DOI] [PMC free article] [PubMed] [Google Scholar]
  70. Mavridis N. (2015). A review of verbal and non-verbal human-robot interactive communication. Robot. Auton. Syst. 63, 22–35. 10.1016/j.robot.2014.09.031 [DOI] [Google Scholar]
  71. McColl D., Hong A., Hatakeyama N., Nejat G., Benhabib B. (2016). A survey of autonomous human affect detection methods for social robots engaged in natural HRI. J. Intell. Robot. Syst. 82, 101–133. 10.1007/s10846-015-0259-2 [DOI] [Google Scholar]
  72. Mehrabian A. (1996). Pleasure-arousal-dominance: a general framework for describing and measuring individual differences in temperament. Curr. Psychol. 14, 261–292. 10.1007/BF02686918 [DOI] [Google Scholar]
  73. Mollahosseini A., Hasani B., Mahoor M. H. (2017). Affectnet: a database for facial expression, valence, and arousal computing in the wild. IEEE Trans. Affect. Comput. 10, 18–31. 10.1109/TAFFC.2017.274092332380751 [DOI] [Google Scholar]
  74. Navarro-Guerrero N., Lowe R., Wermter S. (2012). A neurocomputational amygdala model of auditory fear conditioning: a hybrid system approach, in The 2012 International Joint Conference on Neural Networks (IJCNN) (New York, NY: IEEE; ), 1–8. 10.1109/IJCNN.2012.6252392 [DOI] [Google Scholar]
  75. Nguyen D. T., Li W., Ogunbona P. O. (2016). Human detection from images and videos: a survey. Pattern Recogn. 51, 148–175. 10.1016/j.patcog.2015.08.027 [DOI] [Google Scholar]
  76. Nijboer F., Van De Laar B., Gerritsen S., Nijholt A., Poel M. (2015). Usability of three electroencephalogram headsets for brain-computer interfaces: a within subject comparison. Interact. Comput. 27, 500–511. 10.1093/iwc/iwv023 [DOI] [Google Scholar]
  77. Noroozi F., Kaminska D., Corneanu C., Sapinski T., Escalera S., Anbarjafari G. (2018). Survey on emotional body gesture recognition. IEEE Trans. Affect. Comput. 10.1109/TAFFC.2018.2874986 [DOI] [Google Scholar]
  78. Palmiero M., Piccardi L. (2017). Frontal EEG asymmetry of mood: a mini-review. Front. Behav. Neurosci. 11:224 10.3389/fnbeh.2017.00224 [DOI] [PMC free article] [PubMed] [Google Scholar]
  79. Patil J. V., Bailke P. (2016). Real time facial expression recognition using realsense camera and ANN, in 2016 International Conference on Inventive Computation Technologies (ICICT), Vol. 2 (New York, NY: IEEE; ), 1–6. 10.1109/INVENTIVE.2016.7824820 [DOI] [Google Scholar]
  80. Perez-Gaspar L.-A., Caballero-Morales S.-O., Trujillo-Romero F. (2016). Multimodal emotion recognition with evolutionary computation for human-robot interaction. Expert Syst. Appl. 66, 42–61. 10.1016/j.eswa.2016.08.047 [DOI] [Google Scholar]
  81. Petrantonakis P. C., Hadjileontiadis L. J. (2009). Emotion recognition from EEG using higher order crossings. IEEE Trans. Inform. Technol. Biomed. 14, 186–197. 10.1109/TITB.2009.2034649 [DOI] [PubMed] [Google Scholar]
  82. Picard R. W. (1999). Affective computing for HCI, in Proceedings of HCI International (the 8th International Conference on Human-Computer Interaction) on Human-Computer Interaction: Ergonomics and User Interfaces-Volume I, eds Bullinger H. J., Ziegler J. (Mahwah, NJ: L. Erlbaum Associates Inc.), 829–833. [Google Scholar]
  83. Plutchik R., Kellerman H. (2013). Theories of Emotion, Vol. 1 Cambridge, MA: Academic Press. [Google Scholar]
  84. Poon-Feng K., Huang D.-Y., Dong M., Li H. (2014). Acoustic emotion recognition based on fusion of multiple feature-dependent deep Boltzmann machines, in The 9th International Symposium on Chinese Spoken Language Processing (New York, NY: IEEE; ), 584–588. 10.1109/ISCSLP.2014.6936696 [DOI] [Google Scholar]
  85. Poongodi T., Krishnamurthi R., Indrakumari R., Suresh P., Balusamy B. (2020). Wearable devices and IoT, in A Handbook of Internet of Things in Biomedical and Cyber Physical System (Berlin: Springer; ), 245–273. 10.1007/978-3-030-23983-1_10 [DOI] [Google Scholar]
  86. Refat C. M. M., Azlan N. Z. (2019). Deep learning methods for facial expression recognition, in 2019 7th International Conference on Mechatronics Engineering (ICOM) (New York, NY: IEEE; ), 1–6. 10.1109/ICOM47790.2019.8952056 [DOI] [Google Scholar]
  87. Reisenzein R., Hudlicka E., Dastani M., Gratch J., Hindriks K., Lorini E., et al. (2013). Computational modeling of emotion: Toward improving the inter-and intradisciplinary exchange. IEEE Trans. Affect. Comput. 4, 246–266. 10.1109/T-AFFC.2013.14 [DOI] [Google Scholar]
  88. Reyes S. R., Depano K. M., Velasco A. M. A., Kwong J. C. T., Oppus C. M. (2020). Face detection and recognition of the seven emotions via facial expression: Integration of machine learning algorithm into the NAO robot, in 2020 5th International Conference on Control and Robotics Engineering (ICCRE) (New York, NY: IEEE; ), 25–29. 10.1109/ICCRE49379.2020.9096267 [DOI] [Google Scholar]
  89. Rossi S., Larafa M., Ruocco M. (2020). Emotional and behavioural distraction by a social robot for children anxiety reduction during vaccination. Int. J. Soc. Robot. 12, 1–13. 10.1007/s12369-019-00616-w [DOI] [Google Scholar]
  90. Rossi S., Ruocco M. (2019). Better alone than in bad company: effects of incoherent non-verbal emotional cues for a humanoid robot. Interact. Stud. 20, 487–508. 10.1075/is.18066.ros [DOI] [Google Scholar]
  91. Russell J. A. (1980). A circumplex model of affect. J. Pers. Soc. Psychol. 39:1161. 10.1037/h007771426804575 [DOI] [Google Scholar]
  92. Saarimäki H., Ejtehadian L. F., Glerean E., Jääskeläinen I. P., Vuilleumier P., Sams M., et al. (2018). Distributed affective space represents multiple emotion categories across the human brain. Soc. Cogn. Affect. Neurosci. 13, 471–482. 10.1093/scan/nsy018 [DOI] [PMC free article] [PubMed] [Google Scholar]
  93. Saha S., Datta S., Konar A., Janarthanan R. (2014). A study on emotion recognition from body gestures using kinect sensor, in 2014 International Conference on Communication and Signal Processing (New York, NY: IEEE; ), 056–060. 10.1109/ICCSP.2014.6949798 [DOI] [Google Scholar]
  94. Salovey P., Mayer J. D. (1990). Emotional intelligence. Imaginat. Cogn. Pers. 9, 185–211. 10.2190/DUGG-P24E-52WK-6CDG [DOI] [Google Scholar]
  95. Sánchez-López Y., Cerezo E. (2019). Designing emotional BDI agents: good practices and open questions. Knowledge Eng. Rev. 34:e26 10.1017/S0269888919000122 [DOI] [Google Scholar]
  96. Saunderson S., Nejat G. (2019). How robots influence humans: a survey of nonverbal communication in social human-robot interaction. Int. J. Soc. Robot. 11, 575–608. 10.1007/s12369-019-00523-0 [DOI] [Google Scholar]
  97. Savva N., Scarinzi A., Bianchi-Berthouze N. (2012). Continuous recognition of player's affective body expression as dynamic quality of aesthetic experience. IEEE Trans. Comput. Intell. AI Games 4, 199–212. 10.1109/TCIAIG.2012.2202663 [DOI] [Google Scholar]
  98. Shao M., Alves S. F. R., Ismail O., Zhang X., Nejat G., Benhabib B. (2019). You are doing great! only one rep left: an affect-aware social robot for exercising, in 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC) (Bari: Institute of Electrical and Electronics Engineers; ), 3811–3817. 10.1109/SMC.2019.8914198 [DOI] [Google Scholar]
  99. Shao M., Snyder M., Nejat G., Benhabib B. (2020). User affect elicitation with a socially emotional robot. Robotics 9:44 10.3390/robotics9020044 [DOI] [Google Scholar]
  100. Siegel E. H., Sands M. K., Van den Noortgate W., Condon P., Chang Y., Dy J., et al. (2018). Emotion fingerprints or emotion populations? A meta-analytic investigation of autonomic features of emotion categories. Psychol. Bull. 144:343. 10.1037/bul0000128 [DOI] [PMC free article] [PubMed] [Google Scholar]
  101. Soleymani M., Lichtenauer J., Pun T., Pantic M. (2011). A multimodal database for affect recognition and implicit tagging. IEEE Trans. Affect. Comput. 3, 42–55. 10.1109/T-AFFC.2011.25 [DOI] [Google Scholar]
  102. Spezialetti M., Cinque L., Tavares J. M. R., Placidi G. (2018). Towards EEG-based bci driven by emotions for addressing BCI-illiteracy: a meta-analytic review. Behav. Inform. Technol. 37, 855–871. 10.1080/0144929X.2018.1485745 [DOI] [Google Scholar]
  103. Suk M., Prabhakaran B. (2014). Real-time mobile facial expression recognition system-a case study, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (New York, NY: ), 132–137. 10.1109/CVPRW.2014.25 [DOI] [Google Scholar]
  104. Sun M., Mou Y., Xie H., Xia M., Wong M., Ma X. (2019). Estimating emotional intensity from body poses for human-robot interaction, in 2018 IEEE International Conference on Systems, Man and Cybernetics (SMC) (New York, NY: IEEE; ), 3811–3817. [Google Scholar]
  105. Szwoch M., Pieniażek P. (2015). Facial emotion recognition using depth data, in 2015 8th International Conference on Human System Interaction (HSI) (New York, NY: IEEE; ), 271–277. 10.1109/HSI.2015.7170679 [DOI] [Google Scholar]
  106. Tanevska A., Rea F., Sandini G., Sciutti A. (2017). Can emotions enhance the robot's cogntive abilities: a study in autonomous HRI with an emotional robot, in Proceedings of AISB Convention (Bath: ). [Google Scholar]
  107. Tomkins S. S. (2008). Affect Imagery Consciousness: The Complete Edition: Two Volumes. Berlin: Springer Publishing Company. [Google Scholar]
  108. Toshev A., Szegedy C. (2014). Deeppose: human pose estimation via deep neural networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (New York, NY: ), 1653–1660. 10.1109/CVPR.2014.214 [DOI] [Google Scholar]
  109. Trnka R., Lačev A., Balcar K., Kuška M., Tavel P. (2016). Modeling semantic emotion space using a 3d hypercube-projection: an innovative analytical approach for the psychology of emotions. Front. Psychol. 7:522. 10.3389/fpsyg.2016.00522 [DOI] [PMC free article] [PubMed] [Google Scholar]
  110. Tsiourti C., Weiss A., Wac K., Vincze M. (2017). Designing emotionally expressive robots: a comparative study on the perception of communication modalities, in Proceedings of the 5th International Conference on Human Agent Interaction (New York, NY: ACM; ), 213–222. 10.1145/3125739.3125744 [DOI] [Google Scholar]
  111. Turan C., Lam K.-M. (2018). Histogram-based local descriptors for facial expression recognition (FER): a comprehensive study. J. Vis. Commun. Image Represent. 55, 331–341. 10.1016/j.jvcir.2018.05.024 [DOI] [Google Scholar]
  112. Val-Calvo M., Álvarez-Sánchez J. R., Ferrández-Vicente J. M., Fernández E. (2020). Affective robot story-telling human-robot interaction: exploratory real-time emotion estimation analysis using facial expressions and physiological signals. IEEE Access 8, 134051–134066. 10.1109/ACCESS.2020.3007109 [DOI] [Google Scholar]
  113. Valenzi S., Islam T., Jurica P., Cichocki A. (2014). Individual classification of emotions using EEG. J. Biomed. Sci. Eng. 7:604 10.4236/jbise.2014.78061 [DOI] [Google Scholar]
  114. Viola P., Jones M. (2001). Rapid object detection using a boosted cascade of simple features, in Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, Vol.1 (New York, NY: IEEE; ). 10.1109/CVPR.2001.990517 [DOI] [Google Scholar]
  115. Viola P., Jones M. J., Snow D. (2005). Detecting pedestrians using patterns of motion and appearance. Int. J. Comput. Vis. 63, 153–161. 10.1007/s11263-005-6644-8 [DOI] [Google Scholar]
  116. Volkova E., De La Rosa S., Bülthoff H. H., Mohler B. (2014). The MPI emotional body expressions database for narrative scenarios. PLoS ONE 9:e113647. 10.1371/journal.pone.0113647 [DOI] [PMC free article] [PubMed] [Google Scholar]
  117. Wang C.-C. R., Lien J.-J. J. (2007). AdaBoost learning for human detection based on histograms of oriented gradients, in Asian Conference on Computer Vision (Berlin: Springer; ), 885–895. 10.1007/978-3-540-76386-4_84 [DOI] [Google Scholar]
  118. Wang K.-Y., Ho Y.-L., Huang Y.-D., Fang W.-C. (2019). Design of intelligent EEG system for human emotion recognition with convolutional neural network, in 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS) (New York, NY: IEEE; ), 142–145. 10.1109/AICAS.2019.8771581 [DOI] [Google Scholar]
  119. Wang S., He M., Gao Z., He S., Ji Q. (2014). Emotion recognition from thermal infrared images using deep Boltzmann machine. Front. Comput. Sci. 8, 609–618. 10.1007/s11704-014-3295-3 [DOI] [Google Scholar]
  120. Wen G., Li H., Huang J., Li D., Xun E. (2017). Random deep belief networks for recognizing emotions from speech signals. Comput. Intell. Neurosci. (London: ) 2017. 10.1155/2017/1945630 [DOI] [PMC free article] [PubMed] [Google Scholar]
  121. Yu C., Tapus A. (2019). Interactive robot learning for multimodal emotion recognition, in International Conference on Social Robotics (Berlin: Springer; ), 633–642. 10.1007/978-3-030-35888-4_59 [DOI] [Google Scholar]
  122. Zhang T. (2017). Facial expression recognition based on deep learning: a survey, in International Conference on Intelligent and Interactive Systems and Applications (Cham: Springer; ), 345–352. 10.1007/978-3-319-69096-4_48 [DOI] [Google Scholar]
  123. Zhang Y., Zhang L., Hossain M. A. (2015). Adaptive 3d facial action intensity estimation and emotion recognition. Expert Syst. Appl. 42, 1446–1464. 10.1016/j.eswa.2014.08.042 [DOI] [Google Scholar]
  124. Zheng W.-L., Lu B.-L. (2015). Investigating critical frequency bands and channels for EEG-based emotion recognition with deep neural networks. IEEE Trans. Auton. Mental Dev. 7, 162–175. 10.1109/TAMD.2015.2431497 [DOI] [Google Scholar]
  125. Zheng W. Q., Yu J. S., Zou Y. X. (2015). An experimental study of speech emotion recognition based on deep convolutional neural networks, in 2015 International Conference on Affective Computing and Intelligent Interaction (ACII) (New York City, NY: Institute of Electrical and Electronics Engineers; ), 827–831. 10.1109/ACII.2015.7344669 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials


Articles from Frontiers in Robotics and AI are provided here courtesy of Frontiers Media SA

RESOURCES