Abstract
The emergence of artificial emotional intelligence technology is revolutionizing the fields of computers and robotics, allowing for a new level of communication and understanding of human behavior that was once thought impossible. While recent advancements in deep learning have transformed the field of computer vision, automated understanding of evoked or expressed emotions in visual media remains in its infancy. This foundering stems from the absence of a universally accepted definition of “emotion,” coupled with the inherently subjective nature of emotions and their intricate nuances. In this article, we provide a comprehensive, multidisciplinary overview of the field of emotion analysis in visual media, drawing on insights from psychology, engineering, and the arts. We begin by exploring the psychological foundations of emotion and the computational principles that underpin the understanding of emotions from images and videos. We then review the latest research and systems within the field, accentuating the most promising approaches. We also discuss the current technological challenges and limitations of emotion analysis, underscoring the necessity for continued investigation and innovation. We contend that this represents a “Holy Grail” research problem in computing and delineate pivotal directions for future inquiry. Finally, we examine the ethical ramifications of emotion-understanding technologies and contemplate their potential societal impacts. Overall, this article endeavors to equip readers with a deeper understanding of the domain of emotion analysis in visual media and to inspire further research and development in this captivating and rapidly evolving field.
Keywords: Artificial emotional intelligence (AEI), bodily expressed emotion understanding (BEEU), deep learning, ethics, evoked emotion, expressed emotion, human behavior, intelligent robots, movement analysis, psychology
I. INTRODUCTION
As artificial intelligence (AI) technology becomes more prevalent and capable of performing a wide range of tasks, the need for effective communication between humans and AI systems is becoming increasingly important. The adoption of smart home products and services is projected to reach 400 million worldwide, with smart devices, such as Alexa and Astro, becoming increasingly common in households [1]. However, these devices are currently limited to executing specific commands and do not possess the capability to understand or respond to human emotions [2]. This lack of emotional intelligence (EQ) limits their potential applications, and this constraint is particularly relevant for future robotic applications, such as personal assistant robots, social robots, service robots, factory/warehouse robots, and police robots, which require close collaboration and a comprehensive understanding of human behavior and emotions.
The ability to impart EQ to AI when dealing with visual information is a topic of growing interest. This article aims to address the fundamental question of how to “teach” AI to understand and respond to human emotions based on images and videos. The potential technical solutions to these questions have far-reaching implications for various application domains, including human–AI interaction, autonomous driving, social media, entertainment, information management and retrieval, design, industrial safety, and education.
To provide a comprehensive and well-balanced view of this complex subject, it is essential to draw on the expertise of various fields, including computer and information science and engineering, psychology, data science, movement analysis, and performing arts. The interdisciplinary nature of this topic highlights the need for collaboration and cooperation among researchers from different fields in order to achieve a deeper understanding of the subject.
In this article, we focus on the topic of affective visual information analysis as it represents a highly nuanced and complex area of study with strong connections to well-established scholarly fields, such as computer vision, multimedia, and image and video processing. However, it is important to note that the techniques presented here can be integrated with other data modalities, such as speech, sensor-generated streaming data, and text, in order to enhance the performance of real-world applications.
The primary objective of this article is to introduce the technical communities to the emerging field of affective visual information analysis. Recognizing the breadth and dynamic nature of this field, we do not aim to provide a comprehensive survey of all subareas. Instead, our discussion focuses on the fundamental psychological and computational principles (see Sections II and III), recent advancements and developments (see Section IV), core challenges and open issues (see Section V), connections to other areas of research and development (see Section VI), and ethical considerations related to this new technology (see Section VII). We apologize in advance for any important publications that may have been omitted in our discussion.
Recently, there have been some other surveys and reviews on artificial EQ (AEI), such as facial expression recognition (FER) [3], [4], [5], [6], microexpression recognition (MER) [7], [8], [9], textual sentiment classification [10], [11], [12], music and speech emotion recognition [13], [14], [15], affective image content analysis [16], emotional body gesture recognition [17], bodily expressed emotion recognition [18], emotion recognition from physiological signals [19], [20], multimodal emotion recognition [21], [22], and affective theory use [23]. These articles mainly focus on emotion and sentiment analysis for a specific modality from the perspective of machine learning and pattern recognition or focus on the psychological emotion theories. Cambria [24] summarized the common tasks of affective computing and sentiment analysis, and classified existing methods into three main categories: knowledge-based, statistical, and hybrid approaches. Poria et al. [25] and Wang et al. [26] reviewed both unimodal and multimodal emotion recognition before the year of 2017 and between the years of 2017 and 2020, respectively. As opposed to those reviews, this article aims to provide a comprehensive overview of emotion analysis from visual media (e.g., both images and videos) with insights drawn from multiple disciplines.
II. EMOTION: THE PSYCHOLOGICAL FOUNDATION
How we define emotion largely descends from the theoretical framework used to study it. In this section, we provide an overview of the most prominent emotion theories, beginning with Darwin, and underscore how contemporary dimensional approaches to understanding emotion align with both the processing of emotion by the human brain and current computer vision approaches for modeling emotion to make predictions about human perception (see Section II-A). In addition, we examine the intrinsic link between emotion and adaptive behavior, a contention that is largely shared across different emotion theories (see Section II-B).
A. Definitions and Models of Emotion
One of the first emotion theories put forth was Charles Darwin’s in his seminal book “On the Expression of the Emotions in Man and Animals” [32]. This book proposed that humans possess a finite set of biologically privileged emotions that evolved to confer upon us survival-related behavior. William James [33] later added to this arguing that the experience of emotion is ultimately our experience of distinct patterns of physiological arousal and physical behaviors associated with each emotion [34]. Building upon these assumptions, Ekman’s Neurocultural Theory of Emotion [35] further perpetuated the notion that there exists a “universal affect program” that underlies the experience and expression of several discrete emotions, such as anger, fear, sadness, and happiness. According to this theory, basic emotional experiences and emotional displays evolved as adaptive responses to specific environmental contingencies and, thus, felt, expressed, and recognized emotions are uniform across all people and cultures, and are marked by specific patterns of physiological and neural responsivity.
Considerable research has since questioned the utility of this approach. This includes findings that people: 1) are often ill-equipped to describe their own emotions in discrete emotion terms, both in research and clinical settings [36]; 2) show low consensus in their ability to categorize both facial and vocal expressions of emotions in discrete emotion terms [37]; and 3) show high intercorrelations across the emotional experiences they do report [38]. Such findings have prompted many researchers to explore alternative approaches to conceptualizing and measuring emotional experience that necessarily involve a cognitive component.
Magda Arnold’s cognitive appraisal theory of emotion was the first to introduce the necessity of cognition in emotion elicitation [39]. Although she did not disagree with Darwin and James that emotions are adaptive states spurring on survival-related behavior, she, nonetheless, took them to task for not considering vast individual variation in emotional experiences. Arnold rightly underscored the capacity for the same emotion-evoking events to lead to different emotional experiences in different people. This becomes readily apparent when considering one’s own emotional experiences. For example, diving off a cliff may generate an aversive fear state in one person but an enjoyable thrill state in another. The difference in how an event is evaluated, therefore, shapes the emotion that results. Central to her theory was the importance of cognitive appraisal in initially eliciting an emotion. Once elicited, she largely agreed with Darwin’s functional assumptions regarding the survival benefits of emotion-related behavior. Later, Schachter and Singer [40] drew on these insights to help resolve ongoing debates regarding James’ theory of embodied emotional experience. Their research demonstrated that the emotion we experience when adrenaline is released in our body depends on our cognitive framing and context. Those injected with adrenaline reported feeling happier when in a fun context and more irritated when in an angering context. The only difference was the cognitive appraisal that framed the experience of that arousal.
Building upon these ideas further, Mehrabian and Russell [41] proposed the Pleasure, Arousal, and Dominance (PAD) Emotional State Model, which suggests a dimensional account of emotion, one in which PAD constitutes the fundamental dimensions that represent all emotions. Later, Russell dropped dominance as a key dimension and focused on what he refers to as “core affect,” suggesting that all emotions can be reduced to the fundamental psychological and biological dimensions of pleasantness and arousal [42]. Dimensional approaches offer a way of conceptualizing and assessing emotion that closely approximates how the human brain processes emotion (see Fig. 1 for a comparison of different dimensional/circumplex models).
Notably, some early attempts to use computational methods to predict human emotions elicited by visual scenes employed the discrete emotion approach described at the outset of this section. For example, Mikels et al. [43] and Machajdik and Hanbury [44] used categorical approaches to assess the visual properties of stimuli taken from the International Affective Picture System (IAPS) [45], a widely used set of emotionally evocative photographs in the emotion literature. However, such approaches resulted in high levels of multicollinearity between emotions, making it difficult to disentangle emotions using traditional regression models. In contrast, adopting a dimensional approach not only aligns well with emerging theoretical accounts of emotion but has been validated by James Wang and his colleagues in the successful assessment of human aesthetics and emotions evoked by visual scenes, as well as bodily expressed emotion [46], [47], [48], [49], [50], [51], [52]. This offers a methodological approach that is consistent with dimensional theories of emotion.
B. Interplay Between Emotion and Behavior
Fridlund’s behavioral ecology perspective of emotion argues that emotional expression evolved primarily as a means of signaling behavioral intent [53]. The view that facial expression evolved specifically as a way to forecast behavioral intentions and consequences to others drew from Darwin’s seminal writings on expression [32] even though Darwin himself argued that expressions did not evolve for social communication per se. Fridlund’s argument is based on the idea that perceiving behavioral intentions is adaptive. From this perspective, anger may primarily convey to an observer a readiness to attack, whereas fear may primarily convey a readiness to submit or retreat (see [54]). From this perspective, behavioral intentions are considered “syndromes of correlated components” [53, p. 151]. Fridlund is not alone in these assumptions. Some researchers have gone so far as to suggest that feeling states associated with emotions are merely conscious perceptions of underlying behavioral intentions or action tendencies, which implies that emotional feeling is simply the experience of behavioral intention, similar to William James’s theory [55].
It is worth noting that empirical research has provided support for the idea that behavioral intention is conveyed through emotional expression. For example, one study demonstrated that action tendencies and emotion labels are attributed to faces at comparable levels of consistency [55]. Similarly, in forced-choice paradigms [54], cross-cultural evidence indicates that participants assign behavioral intention descriptors with about equal consistency as they do with emotion descriptors.
A focus on approach-avoidance tendencies has been highlighted in most of the research conducted to date. The ability to detect another’s intention to approach or avoid us is thought of as a principal factor governing social exchange. However, much of the work on approach-avoidance behavioral motivations has also tended to concentrate on the experience or response of an observer to a stimulus event [56]. One common method of operationalizing approach and avoidance then stems from traditional behavioral learning paradigms that link behavioral motivation and emotion through reward versus punishment contingencies [57]. Approach motivation is defined by appetitive, reward-related behavior, while avoidance motivation is defined by aversive, punishment-related behavior, where appetitive behavior is movement toward a reward, and aversive behavior is movement away from a punishment.
Much research has focused on the relationship between approach and avoidance tendencies and emotional experience [58]. However, there has been less attention paid to whether approach and avoidance tendencies are fundamentally signaled by the external expression of emotion. It stands to reason that, if the experience of emotion is associated with approach and avoidance tendencies, these tendencies should be signaled to others when expressed. This distinction is important as the approach-avoidance tendencies attributed to expressive faces may not always match the approach-avoidance reactions elicited by them. For example, the expression of joy arguably conveys a heightened likelihood of approach by the expressor and a reaction of approach from the observer [56]. In contrast, anger expressions signal approach by the expressor but tend to elicit avoidance by the observer.
Recent insights from the embodiment literature also provide evidence that emotional experiences are grounded in specific action tendencies [34]. This means that emotional experiences can be expressed through stored action tendencies in the body, rather than through semantic cues. For example, studies have examined the coherence between emotional experience (positive or negative) and approach (arm flexion, i.e., pulling toward) versus avoidance behavior (arm extension, i.e., pushing away) [59]. In one study, participants were randomly assigned to an arm flexion (approach behavior) or arm extension (avoidance behavior) condition, either during the reading of a story about a fictional character or during a positive versus negative semantic priming task before reading the story. Participants in the congruent conditions (happy prime and arm flexion, and sad prime and arm extension) were able to remember more items from the story.
Although important for explaining behavioral responding, these studies did not address whether basic tendencies to approach or avoid were also fundamentally signaled by emotional expressions. If they were, expressions coupled with approach and avoidant behaviors should impact the efficiency of emotion recognition. In one set of studies, anger expressions were found to facilitate judgments of approach versus withdrawing faces compared with fear expressions [60]. Similarly, perceived movement of a face toward or away from an observer likewise facilitated angry or fearful expression perception [61]. Thus, approach and avoidance movement are associated in a fundamental way with the recognition of anger and fear displays, respectively, supporting the conclusion that basic action tendencies are inherently associated with the perception of emotion.
In sum, despite widely debatable assumptions about the nature of emotion and emotional expression across various theories, most tend to agree that emotion expression conveys fundamental information regarding basic behavioral tendencies [60].
III. EMOTION : COMPUTATIONAL PRINCIPLES AND FOUNDATIONS
In this section, we aim to establish computational foundations for analyzing and recognizing emotions from visual media. Emotion recognition systems typically involve several fundamental data-related stages, including data collection (see Section III-A), data reliability assessment (see Section III-B), and data representation (see Sections III-D–III-F for general computer vision-based representation, movement coding, and context and function purpose detection, respectively). As we present specific examples at each stage, we will emphasize the underlying principles that they adhere to. We provide a list of representative datasets in Section III-C. We will also introduce the factors of acted portrayals (see Section III-G), cultural and gender dialects (see Section III-H), structure (see Section III-I), personality (see Section III-J), and affective style (see Section III-K) in inferring emotion, based on prior research.
A. Data Collection
Because the categories of emotions are not well-defined, it is not possible to program a computer to recognize all emotion categories based on a set of predefined logic rules, computational instructions, or procedures. Thus, researchers must take a data-driven approach in which computers learn from a large quantity of labeled, partially labeled, and/or unlabeled examples. To enable such research and subsequent real-world applications, it is essential to collect large-scale, high-quality, ecologically valid datasets. To highlight the complexity of the data collection problem and to introduce best practices, we describe a few data collection approaches that incorporate psychological principles in their design.
1). Evoked Emotion—Immediate Response:
In the field of modeling evoked emotion, earlier researchers utilized the IAPS dataset, which consisted of only 1082 images rated for evoked emotional response [46]. In 2017, Lu et al. [48] introduced one of the first large-scale datasets, the EmoSet, utilizing a human subject study. The EmoSet dataset is much larger, and all images are complex scenes that humans regularly encounter in daily life.
To create a diverse image collection, the researchers employed a data-crawling approach to gather nearly 44 000 images from social media and obtained emotion labels (both dimensional and categorical) using crowdsourcing via the Amazon Mechanical Turk (AMT) platform. They used the valence, arousal, and dominance (VAD) dimensional model [27], which is similar to the PAD model. The researchers followed strict psychological subject study procedures and validation approaches. The images were collected from more than 1000 users’ Web albums on Flickr using 558 emotional words as search terms. These words were summarized by Averill [62].
The researchers carefully designed their online crowdsourcing human subject study to ensure the quality of the data. For example, each image was presented to a subject for exactly 6 s. This design differed from conventional object recognition data annotation tasks, where the subject was often given no restrictions on the amount of time to view an image. This design followed psychological convention, as the intention was to collect the subject’s immediate affective response to the visual stimuli. If subjects were given varying amounts of time to view an image before rating it, the data would not be a reliable capture of their immediate affective response. To accommodate for this, subjects were given the option to click a “Reshow Image” button if they needed to refer back to the image. In addition, recognizing that categorical emotions may not cover all feelings, this method allowed the subject to enter other feelings that they may have had.
2). Evoked Emotion—Test–Retest Reliability:
The data collection method proposed by Lu et al. [48] aimed to understand immediate affective responses to visual content, but it did not ensure retest reliability of affective picture stimuli over time and across a population. Many psychological studies, from behavioral to neuroimaging studies, have used visual stimuli that consistently elicited specific emotions in human subject. While the IAPS and other pictorial datasets have validated their data, they have not examined the retest reliability or agreement over time of their picture stimuli.
To address this issue, Kim et al. [50] developed the Image Stimuli for Emotion Elicitation (ISEE) as the first set of stimuli for which there was an unbiased initial selection method and with images specifically selected for high retest correlation coefficients and high within-person agreement across time. The ISEE dataset used a subset of 10 696 images from the Flickr-crawled EmoSet. In the initial screening study, study participants rated stimuli twice for emotion elicitation across a one-week interval, resulting in the selection of 1620 images based on the number of ratings and retest reliability of each picture. Using this set of stimuli, a second phase of the study was conducted, again having participants rate images twice with a one-week interval, in which the researchers found a total of 158 unique images that elicited various levels of emotionality with both good reliability and good agreement over time. Fig. 2 shows 18 example images in the ISEE dataset.
3). Expressed Emotion—Body:
In the field of expressed emotion recognition, the collection of data on bodily expressed emotions has received less attention compared to the more widely studied areas of facial expression and microexpression data collection. In addition, whereas earlier studies often relied on data collected in controlled laboratory environments, recent advancements in technology have made it possible to collect data in more naturalistic, real-world settings. These “in-the-wild” datasets are more challenging to collect, but they offer the opportunity to capture a more diverse range of emotions and expressions. Whereas laboratory environments provide the advantage of advanced sensors, such as Motion Capture (MoCap), body temperature, and brain electroencephalogram (EEG) for collecting data, and it is possible to capture self-identified rather than perceived emotional expression, it is impossible to accurately replicate the vast array of diverse real-world scenarios within a controlled laboratory setting.
Using video clips from movies, TV shows, sporting, and wedding events as a source of data for emotion recognition has several advantages. These videos provide a wide range of scenarios, environments, and situations that can be used to train computer systems to understand human behavior, expression, and movement. For instance, these videos have recorded scenes during natural and man-made disasters, providing valuable information for understanding human emotions under extreme conditions. In addition, a large proportion of video shots in movies is of outdoor human activities, providing a diverse range of contexts for training.
However, it is important to note that using publicly available video clips as a source of data has its limitations. One such limitation is that this approach can only capture perceived emotions, as opposed to self-identified emotions. In many applications, perceived emotions are a sufficient proxy for actual emotions, particularly when the goal is for robots to “perceive” or “express” emotions in a way that is similar to humans for efficient communication with humans. A further constraint is that the videos mainly feature staged or user-selected scenes, rather than depicting natural everyday interactions. This topic will be further explored in Section III-G.
Luo et al. [51] developed the first dataset for bodily expressed emotion understanding (BEEU), named the Body Language Dataset (BoLD), using this approach. The data collection pipeline is illustrated in Fig. 3. The researchers collected hundreds of movies from the Internet and cut them into short clips. An identified character with landmark tracking in a single clip is called an instance. They used the AMT platform for crowdsourcing emotion annotations of a total of over 48 000 instances. The emotion annotation included the VAD dimensional model [27] and 26 emotion categories [63].
B. Data Quality Assurance
Quality control is a crucial aspect for crowdsourcing, particularly for affect annotations. Different individuals may have varying perceptions of affect, and their understanding can be influenced by factors such as cultural background, current mood, gender, and personal experiences. Even an honest participant may provide uninformative affect annotations, leading to poor-quality data. In this case, the variance in acquiring affect usually comes from two kinds of participants, i.e., dishonest ones, who give useless annotations for economic motivation, and exotic ones, who give inconsistent annotations compared with others. The existence of exotic participants is inherent in emotion studies. The annotations provided by an exotic participant could be valuable when aggregating the final ground truth or investigating cultural or gender effects of affect. However, we typically want to reduce the risk of high variance caused by dishonest and exotic participants in order to collect generalizable annotations.
In the case of the BoLD dataset [51], five complementary mechanisms were used, including three online approaches (i.e., analyzing while collecting the data) and two offline (i.e., postcollection analysis), based on a recent technological breakthrough for crowdsourced affective data collection [49]. These mechanisms were participant EQ screening [64], annotation sanity/consistency check [51], gold standard test based on control instances [51], and probabilistic multigraph modeling for reliability analysis [49].
Particularly critical is the probabilistic graphical model Ye et al. developed to jointly model subjective reliability, which is independent of supplied questions, and regularity [49]. For brevity of discussion, we focus on using the mode(s) of the posterior as point estimates. We assumed that each subject had a reliability parameter and regularity parameters , characterizing their agreement behavior with the population, for . We also used the parameter for the rate of agreement between subjects by pure chance. Let be the set of parameters. Let be a random subsample from subjects who labeled the stimulus , where . We also assumed that sets ’s were created independently of each other. For each image , every subject paired from , i.e., (, ) with , had a binary indicator coding whether their opinions agreed on the respective stimulus. We assumed that was generated from a probabilistic process involving two latent variables. The first latent variable indicated whether subject was reliable or not. Given that it was binary, a natural choice of model was the Bernoulli distribution. The second latent variable , lying between 0 and 1, measured the extent to which subject agreed with other reliable responses. We used the beta distribution parameterized by and to model because it was a widely used and flexible parametric distribution for quantities on the interval [0, 1].
In a nutshell, is a latent switch (a.k.a. gate) that controls whether can be used for the posterior inference of the latent variable . Hence, the researchers referred to the model as the gated latent beta allocation (GLBA). A graphical illustration of the model is shown in Fig. 4. If an uninformative annotator was in the subject pool, their reliability parameter was zero though others could still agree with their answers by chance at a rate of . On the other hand, if an individual was very reliable yet often provided controversial answers, their reliability could be one, while they typically disagreed with others, as indicated by their high irregularity
We were interested in finding both types of participants. Most participants were between these two extremes. The quantitative characterization of participants by GLBA will assist in selecting subsets of the data collection for quality control or gaining a comprehensive understanding of subjectivity. For more details, please refer to [49], [65].
A recent study [66] presented a Python-based software program called MuSe-Toolbox, which combines emotion annotations from multiple individuals. The software includes several existing annotation fusion methods, such as estimator weighted evaluator (EWE) [67] and generic-canonical time warping (GCTW) [68]. In addition, the authors have developed a new fusion method based on EWE, named rater-aligned annotation weighting (RAAW), which is also included in the software. Furthermore, MuSe-Toolbox includes the capability to convert continuous emotion annotations into categorical labels.
C. Existing Datasets
Several recent literature surveys have provided an overview of existing datasets for emotions in visual media. In order to avoid duplication of effort, readers are directed to these papers for further information, which includes surveys on evoked emotion [16], BEEU [17], FER [4], MER [7], and multimodal emotion [22], [69], [70]. Table 1 presents a comparison of the properties of some representative datasets. Researchers are advised to thoroughly review the data collection protocol used before utilizing a dataset to ensure that the data have been collected in accordance with appropriate psychological guidelines. In addition, when crowdsourcing is utilized, effective mechanisms are essential for filtering out uninformative annotations.
Table 1.
Dataset Name | Labeled Samples | Data Type | Categorical Emotions | Continuous Emotions | Lab Controlled | Year | Primary Application |
---|---|---|---|---|---|---|---|
IAPS [45] | 1.2k | I | - | VAD | 2005 | Evoked | |
FI [71] | 23.3k | I | 8§ | - | 2016 | Evoked | |
VideoEmotion-8 [72] | 1.2k | V | 8† | - | 2014 | Evoked | |
Ekman-6 [73] | 1.6k | V | 6† | - | 2018 | Evoked | |
E-Walk [74] | 1k | V | 4‡ | - | 2019 | BEEU | |
BoLD [51] | 13k | V | 26† | VAD | 2020 | BEEU | |
iMiGUE [75] | 0.4k | V | 2 | - | 2021 | BEEU★ | |
CK+ [76] | 0.6k | V | 7‡ | - | ✓ | 2010 | FER |
Aff-Wild [77] | 0.3k | V | - | VA | 2017 | FER | |
AffectNet [78] | 450k | I | 7‡ | - | 2017 | FER | |
EMOTIC [79] | 34k | I | 26† | VAD | 2017 | FER* | |
AFEW 8.0 [80] | 1.8k | V | 7‡ | - | 2018 | FER | |
CAER [81] | 13k | V | 7‡ | - | 2019 | FER* | |
DFEW [82] | 16k | V | 7‡ | - | 2020 | FER | |
FERV39k [83] | 39k | V | 7‡ | - | 2022 | FER | |
SAMM [84] | 0.2k | V | 7‡ | - | ✓ | 2016 | MER |
CAS(ME)2 [85] | 0.06k | V | 4‡ | - | ✓ | 2017 | MER |
ICT-MMMO [86] | 0.4k | V,A,T | - | Sentiment | 2013 | Multi-Modal | |
MOSEI [87] | 23.5k | V.A.T | 6† | Sentiment | 2018 | Multi-Modal |
A superset of Ekman’s basic emotions
Ekman’s basic emotions + neutral
Mikels’ emotions
Micro-gesture understanding and emotion analysis dataset Data Type Key: (I)mage, (V)ideo, (A)udio, (T)ext
Context-aware emotion dataset
Data Type Key: (I)mage, (V)ideo, (A)udio, (T)ext
D. Data Representations
After the data collection and quality assurance stages, a significant technological challenge is to represent the emotion-relevant information present in the raw data in a concise form. While current deep neural network (DNN) approaches often utilize raw data, such as matrices of pixels, as input in the modeling process, utilizing a compact data representation can potentially improve the efficiency of the learning process, allowing for larger-scale experiments to be conducted with limited computational resources. In addition, a semantically meaningful data representation can facilitate interpretability, which is crucial for certain applications. There are numerous methods for compactly representing raw visual data, and we discuss several intriguing or widely used data representations for emotion modeling in the following.
1). Roundness, Angularity, Simplicity, and Complexity:
Colors and textures are commonly used in image analysis tasks to represent the content of an image, but research has shown that shape can also be an effective representation when analyzing evoked emotions. In both visual art and psychology, the characteristics of shapes, such as roundness, angularity, simplicity, and complexity, have been linked to specific emotional responses in humans. For example, round and simple shapes tend to evoke positive emotions, while angular and complex shapes evoke negative emotions. Leveraging this understanding, Lu et al. [46] developed a system that predicted evoked emotion based on line segments, curves, and angles extracted from an image. They used ellipse fitting to implicitly estimate roundness and angularity, and used features from the visual elements to estimate complexity. Later, they developed algorithms to explicitly estimate these representations [48]. Fig. 5 shows some example images with different levels of roundness, angularity, and simplicity. The researchers found that these three physically interpretable visual constructs achieved comparable classification accuracy to the hundreds of shape, texture, composition, and facial feature characteristics previously examined. This result was thought-provoking because just a few numerical-value representations could effectively predict evoked emotions.
2). Facial Action Coding System (FACS) and Facial Landmarks:
People use particular facial muscles to express certain facial expressions. For instance, people can express anger by frowning and pursing their lips. Consequently, each facial expression can be viewed as a combination of some facial muscle movements. Ekman and Friesen [88] developed the FACS in 1976, which encodes all movements of facial muscles. FACS defines a total of 32 atomic facial muscle actions, called Action Units (AUs), including Lids Tight (AU7), Cheek Raise (AU6), and so on. By detecting all AUs of a person and linking them to specific expressions, we can identify the individual’s facial expressions.
The problem of AU detection can be approached as a multilabel binary classification problem for each AU. Early work on AU detection used facial landmarks to identify regions of corresponding muscles and then applied neural networks [89] or support vector machines (SVMs) [90] for classification. More recent work has developed end-to-end AU detection networks [91]. Survey papers provide detailed introductions to facial AU detection [92], [93] and face landmark detection [94], [95]. Some researchers also used facial landmarks directly as a representation of facial information in their recognition work.
3). Body Pose and Body Mesh:
People can express emotions through body posture and movement. By manipulating the positioning of body parts (e.g., the shoulders and arms), people produce various postures and movements. The coordinates of human joints can serve as a representation of body language, reflecting the individual’s bodily expression. In the field of computer vision, 2-D pose estimation is a well-studied task for detecting the 2-D position of human joints in an image. Leveraging large-scale 2-D pose datasets (e.g., COCO [96]), researchers have proposed several high-performing pose networks [97], [98]. Even with challenging scenes, such as crowded or occluded scenes, these networks are able to provide comprehensive joint detection and linking.
However, 2-D pose estimation does not fully capture the 3-D nature of human posture and movement. 3-D human pose estimation, on the other hand, aims to predict the 3-D coordinates of human joints in space. Single-person 3-D pose estimation methods determine the 3-D joint coordinates relative to the person’s root joint (i.e., the torso) [99], [100]. In addition, some multiperson 3-D pose estimation approaches comprehensively estimate the absolute distance between the camera and the individuals in the image [101].
3-D human mesh estimation, which provides the 3-D coordinates of each point on the human mesh, is a further extension of the 3-D pose estimation. Researchers often utilize SMPL [103], [104] or other human graph models to represent the mesh. Fig. 6 illustrates an example of 2-D pose, 3-D pose, and 3-D mesh automatically generated from a scene with multiple people [102].
While human pose and human mesh representations provide a higher level of abstraction compared to low-level representations, such as raw video or optical flow, there is still a significant gap between these intermediate-level representations and high-level emotion representations. To bridge this gap, an intermediate-level representation that effectively describes human movements is proposed. Specifically, Laban movement analysis (LMA) is a movement coding system initially developed by the dance community, similar to sheet music for describing music. Fig. 7 illustrates the layers of data representation for BEEU, from low pixel-level representations to the ultimate high-level emotion representation. Elevating each layer higher in this information pyramid requires considerable advancements in abstraction technology.
E. Human Movement Coding and LMA
Expressive human movement is a complex phenomenon. The human body has 360 joints, many of which can move various distances at different velocities and accelerations, and in two (and depending on the joint, sometimes more) different directions, resulting in an astronomical number of possible combinations. These variables create an infinite number of movements and postures that can convey different emotions. Beyond this array of body parts moving in space, the expressive movement also involves multiple qualitative motor components that appear in unique patterns in different individuals and situations. The complexity of human movement, thus, raises the question: how do we determine which of the numerous components present in the expressive movement are significant to code for emotional expression in movement? Thus, when choosing a coding system, the early stages of each research project can benefit from deeply considering which aspects of movements are central to the expression being studied. A multistage methodology, such as first identifying what is potentially relevant and then using preliminary analyses to refine the selection of movements most promising to code, can be helpful before selecting a method to code or quantify the multitude of variables present in unscripted movement (e.g., [105] and [106]).
After deliberating about which movement variables are relevant and meaningful, we must then consider the three main types of coding systems that have been used in various fields, such as psychology, computer vision, animation, robotics, and AI.
Lists of specific motor behaviors that have been found in scientific studies to be typical to the expressions of specific emotions, such as head down and moving slowly as characterizing sadness; moving backward and bringing the arms in front of the body as characterizing fear; jumping, expanding, and upward movements as characterizing happiness; and so on (for review of these studies and lists of these behaviors, see [107] and [108]).
Kinematic description of the human body models, such as skeleton-based models, contour-based models, or volume-based models. Most work in the field of emotion recognition is based on skeleton-based models [109]. This type of model uses 3-D coordinates of markers that were placed on (using various MoCap systems) or were mapped (using pose estimation techniques) to the main joints of the body to create a moving “skeleton,” which enables researchers to quantitatively capture the movement kinematics (e.g., [110] and [111]).
LMA, a comprehensive movement analysis system that depicts qualitative aspects of movement and, theoretically [112], [113], as well as through scientific research [114], [115], relates the different LMA motor elements (movement qualities) to cognitive and emotional aspects of the moving individual.
The first coding system, based on lists of motor behaviors, was primarily used in earlier studies in the field of psychology, where the encoding and decoding of motor behaviors into emotions and vice versa were done manually by human coders in a labor-intensive process. Other limitations of this coding system include the following.
It is based on a limited number of behaviors that have been used in prior scientific studies. However, people can physically express emotions in many different ways, so the list of previously observed and studied whole-body expressions may not be exhaustive or inclusive of cultural variations.
Likewise, because each study used different lists of behaviors, this method makes it difficult to compare results or to review them additively to arrive at larger verification. Thus, this coding system may miss parts of the range of bodily emotional expression, such as those never observed and coded before. This limitation is especially pronounced because many of these previous studies examined emotional expressions performed by actors, whose movement tended to rely upon more stereotypical bodily emotional expressions that were widely recognized by the audience, rather than naturally occurring motor expressions.
When using the second coding system, kinematic description, in particular, the skeleton-based models, researchers usually employ a set of markers similar to, or smaller than that provided by the Kinect Software Development Kit, and transform a large amount of 3-D data into various low- and/or high-level kinematic and dynamic motion features, which are easier to interpret than the raw data. Researchers using this method have studied specific features, such as head and elbow angles [111], maximum acceleration of hand and elbow relative to spine [116], and distance between joints [117], among others. For a review of such studies, please refer to [17]. In recent years, instead of computing handcrafted features on the sequence of body joints, researchers have employed various deep learning methods to generate representations of the dynamics of the human body embedded within the joint sequence, such as spatiotemporal graph convolutional network (ST-GCN) [51], [118], [119]. Although the 3-D data from joint markers can provide a relatively detailed, objective description of whole-body movement, this coding system has two main limitations.
Movement is often captured by a camera from a single view (usually the frontal), which can result in long segments of missing data from markers that are hidden by other body parts, people, or objects in a video frame. Automatic imputations of such missing data are often impractical as they tend to create incorrect and unrealistic movement trajectories.
The SDK system has only three markers along the torso, which are insufficient for capturing subtle movements in the chest area, movements that are usually observed during authentic emotional expressions, as opposed to acted (and often exaggerated) bodily emotional expressions. Another disadvantage is that these systems have not yet been able to successfully and reliably detect many qualitative changes in movement that are significant for perceiving emotional expression.
In contrast to the quantitative data from joint markers, which enable the capture of detailed movement of every body part, the third coding system mentioned above, LMA, describes qualitative aspects of movement and can relate to a general impression from movements of the entire body or to the movement of specific body parts. By identifying the association between LMA motor components and emotions, and characterizing the typical kinematics of these components using high-level features (e.g., [51], [120], [121], [122], [123], and [124]), researchers can overcome the limitations of other coding systems. If people express their emotions with movements that have never been observed in previous studies, we can still decode their emotions based on the quality of their movement. Similarly, if parts of the body, which are usually used to express a certain emotion, are not visible, it is possible that the emotion could still be decoded by identifying the motor elements composing the visible part of the movement. Moreover, by slightly changing the kinematics of a movement of a robot or animation (i.e., adding to a gesture or a functional movement the kinematics of certain LMA motor elements associated with a specific emotion), we can “color” this functional movement with an expression of that emotion, even when the movement is not the typical movement for expressing that emotion (e.g., [125] and [126]). Similarly, identifying the quality of a movement can enable decoding the expressed emotion even from functional actions, such as walking [127] or reaching and grasping. These advantages and the fact that LMA features have been found to be positively correlated with emotional expressions are why LMA coding is becoming popular in studies that encode or decode bodily emotion expressions (e.g., [51] and [123]). In addition, LMA offers the option to link our coding systems to diverse ways in which humans talk about and describe expressive movement—it is a comprehensive movement-theory system that is and can be used across disciplines for application in acting [128], therapy [129], education [130], and animation [131], among others. The last advantage to consider is that LMA is a comprehensive theory of body movement, much like art theory or music theory, including theories of harmony, and, thus, has been used by artists to attune to aesthetics including movement-perception of visual art (such as that discussed in Section VI-A) and visual, auditory, and movement elements of film and theater (as discussed in Section III-G). Like music theory, LMA is capable of attending to rhythm and phrasing as elements shift and unfold over time, aspects that may be crucial to communicating and interpreting emotional expression.
LMA identifies four major categories of movement: Body, Effort, Shape, and Space. Each category encompasses several subsets of motor components (LMA terms are spelled with capital letters to differentiate them from regular usage of these words). Fig. 8 illustrates some basic components of LMA, which are often used in coding.
The Body category describes what is moving, and it is composed of the elements of Body Segments (e.g., arms, legs, and head), their coordination, and basic Body Actions, such as locomotion, jump, rotation, change of support, and so on.
The Effort category describes the qualitative aspect of movement or how we move. It expresses a person’s inner attitude toward movement, and it has four main factors, each describing the continuum between two extremes: indulging in the motor quality of that factor and fighting against that quality. An effort has four factors.
Weight effort, meaning the amount of force or pressure exerted by the body. Activated Weight Effort can be Strong or Light. Alternatively, there may be a lack of weight activation when we give in to the pull of gravity, which we describe as Passive or Heavy Weight.
Space effort describes attention to space, denoting the focus or attitude toward a chosen pathway, i.e., is the movement Direct or flexibly Indirect.
Time effort, describing the mover’s degree of urgency or acceleration/deceleration involved in a movement, i.e., is the movement Sudden or Sustained.
Flow effort, reflecting the element of control or the degree to which a movement is Bound, i.e., restrained or controlled by muscle contraction (usually cocontraction of agonist and antagonist’s muscles), versus Free, i.e., being released and liberated.
The Shape category reflects why we move: Shape describes how the body adapts its shape as we respond to our needs or the environment: do I want to connect with or avoid something, dominate, or cower under? The way the body sculpts itself in space reflects a relationship to self, others, or to the environment. This component includes Shape Flow that describes how the body changes to relate to oneself; it includes Shape Change that describes changes in the form or shape of the body and includes the motor components of Expanding the body or Condensing it in all directions and Rising or Sinking in the vertical dimension, Spreading and Enclosing in the horizontal dimension, and Advancing or Retreating in the sagittal dimension. Another Shape component is Shaping and Carving, which describes how a person shapes their body to shape or affect the environment or other people. For example, when we hug somebody, we might shape and carve the shape of our body, adjusting it to the shape of the other person’s body, or we might shape our ideas by carving or manipulating them through posture and gesture.
The Space category describes where the movement goes in the environment. It describes many spatial factors, such as the Direction, where the movement goes in space, such as Up and Down in the vertical dimension, Side open and Side across in the horizontal dimension, and Forward and Backward in the sagittal dimension; the Level of the movement in space relative to the entire body or parts of the body, such as Low level (movement toward the ground), Middle level, (movement maintaining level, without lowering or elevating), or High level (moving upward in space); Paths or how we travel through space by locomoting; and Pathways through the Kinesphere (the personal bubble of reach-space around each mover that can be moved in and through without locomoting or traveling.) Movement in the Kinesphere might take Central pathways, crossing the space close to the mover’s body center, Peripheral pathways along the periphery of the mover’s reach space, or Transverse pathways cutting across the reach space.
In addition, another important aspect of LMA that is particularly helpful and meaningful to expression is the Phrasing of movements. Phrasing describes changes over time, such as changes in the intensity of the movement over time, similar to musical phrases, which can be Increasing, Decreasing, or Rhythmic, among others. It can also depict how movement components shift during the same action or a series of actions occurring over time, for example, beginning emphatically with strength and then ending by making a light direct point conclusion.
Previous research has highlighted the lack of a notation system that directly encodes the correspondence between bodily expression and body movements in a way similar to FACS for face [51], [105]. LMA, by its nature, has the potential to serve as such a coding system for emotional expressions through body movement. Shafir et al. [115] identified the LMA motor components (qualities), whose existence in a movement could evoke each of the four basic emotions: anger, fear, sadness, and happiness. Melzer et al. [114] identified the LMA components whose existence in a movement caused that movement to be identified as expressing one of those four basic emotions. In an additional experiment by the Shafir group, Gilor, for her Master’s thesis, has been studying the motor elements that are used for expressing sadness and happiness. In this series of studies, the LMA motor components found to evoke these emotions through movement were, for the most part, the same as those used to identify or recognize each emotion from movement, or to express each emotion through movement. For example, anger was associated with Strong Weight Effort, Sudden Time Effort, Direct Space Effort, and Advancing Shape during both emotion elicitation and emotion recognition. Fear was associated with Retreating Shape, moving backward in Space for both emotion elicitation and recognition. In addition, Enclosing and Condensing Shape, and Bound Flow Effort, were also found for emotion elicitation through movement. Sadness was associated with Passive Weight Effort, Sinking Shape, and Head drop for emotion elicitation, emotional expression, and emotion recognition, and arms touching the upper body were also significant indicators for emotion elicitation and expression. Sadness expression was also associated with using Near-Reach Space and Stillness. In contrast, Happiness was associated with jumping, rhythmic movement, Free and Light Efforts, Spreading and Rising Shape, and moving upward in Space for emotion elicitation, emotion expression, and emotion recognition. Happiness expression was also associated with Sudden Time Effort and Rotation. These findings for Happiness were also validated by van Geest et al. [132]. While these studies represent a promising start, further research is needed to create a comprehensive LMA coding system for bodily emotional expressions.
F. Context and Functional Purpose
While human emotions are shown through the face and body, they are closely connected to context and purpose. Thus, the image context surrounding the human can also be used to further identify human emotions. The context information includes the semantic information of the background and what the person is holding, which can assist in identifying activities of the human, thereby allowing for more accurate prediction of the human’s emotion. For instance, people are more likely to feel happy than sad during a birthday party. Context information also includes interactions among humans, which can help infer emotion. For example, people are more likely to be angry when they are engaged in a heated argument with others.
Since the early 2000s, researchers in the field of image retrieval have been developing context recognition systems using machine learning and statistical modeling [133], [134], [135]. With the advent of deep learning and the widespread use of modern graphics processing units (GPUs) or AI processors, accurate object annotation from images has become more feasible. Several deep learning-based approaches have been proposed to leverage context information to enhance basic emotion recognition networks [81], [136], [137]. Sections IV-B3 and IV-E provide more details.
In addition to contextual information, the functional purpose of a person’s movement can also provide valuable insights when inferring emotions. Movement is a combination of both functional and emotional expression, and thus, emotion recognition systems must be able to differentiate between movement that serves a functional purpose and movement that expresses emotions. Action recognition [138], an actively studied topic in computer vision, has the potential to provide information on the function of a person’s movement and assist in disentangling functional and emotional components of movement.
G. Acted Portrayals of Emotion
Research on emotions has often turned to acted portrayals of emotion to overcome some of the challenges inherent in accessing adequate datasets, particularly because evoking intense, authentic emotions in a laboratory can be problematic both ethically [139] and practically. This is because, as noted in Section II, emotional responses vary. Datasets relying upon actors have been useful in overcoming challenges of obtaining ample, adequately labeled emotion-expression data because, compared to unscripted behavior (“in the wild”), where emotion expression often appears in blended or complex ways, actors are typically serving a narrative, in which emotion is part of the story. In addition, sampling emotional expression in the wild encounters cultural distinctions for the expressions themselves, as well as social norms for emotion expression or its repression [140], which may also be culturally scripted for gender. Thus, researchers interested in emotional expressivity and nonverbal communication of emotion often turn to trained actors [139] both to generate new datasets of emotionally expressive movement (e.g., [141]) and for the sampling of emotion expression (e.g., [51]). Such datasets are useful because actors coordinate all three channels of emotions expression, namely, vocal, facial, and body, to produce natural or authentic-seeming expressions of emotions. Some researchers have validated the perception of actor-based and expert movement-based datasets in the lab by showing them to lay observers [114], [141]. This approach also entails problems, in which, while it may capture norms of emotion expression and its clear communication, it may miss distinctions related to demographics such as gender [142], ethnolinguistic heritage, individual, or generational norms [143], [144]. According to cultural dialect theory, we all have nonverbal emotion accents [144], meaning that emotion is expressed differently by different people in different regions, from different cultures. Only some of those cultural dialects appear when sampling films. Such films have often been edited so that viewers beyond that cultural dialect can “read” the emotional expression central to the narrative. Nuance in nonverbal dialects may be excluded in favor of general appeal.
Yet, an advantage of generating datasets from actors or other trained expressive movers is that ground truth can be better established. The intention of the emotion expressed, the emotion felt, and later the emotion perceived from viewing can all be assessed when generating the dataset. Likewise, because actors coordinate image, voice, and movement in the service of storytelling, the context and purpose are clarified, and thus, many of the multiple expressive modes can be organized into individual emotion expression [145], [146]. Moreover, because performing arts productions, such as movies and films made of theater, music, and dance performances, integrate multiple communication modes, the creative team collaborates frequently about the emotional tone or intent of each work, coordinating lighting, scenery/background, objects, camerawork, and sound, with the performers (actors, dancers, and musicians). While the team articulates their intentions during the creative process, the resulting produced art often resonates with wider variation to different audiences, according to their perceptions and tastes.
For researchers relying upon acted emotions, it may be helpful to understand some ways in which actor training considers the role of emotions in theater and film [147]. The role of emotion in narrative arts may reflect some of the theories about the role of emotion itself [148]—to approximately inform the character (organism) about their needs, in order to drive them to take action to meet their needs, or to provide feedback on how recent actions meet or do not meet their needs. As actors prepare, they identify a character’s needs (called their objective) and are moved by emotion to drive the character’s action to overcome obstacles as they pursue the character’s objective. Thus, when collecting data from acted examples, emotion expression can often be found preceding action in the narrative or in response to it. Actors are also trained to listen with their whole being to their scene partner to “hold space” during dialog for emotion expression to fully unfold and complete its purpose of moving either the speaker or the listener to response or action. This expectation is particularly true in opera or other musical theater genres, which often extends the time for emotion expression.
In terms of 3-D modeling, actors trained for theater not only highly develop the specific emotion-related action tendencies of the body but also consider viewer perception from 3-D, for example, when performing on a thrust stage or theater-in-the-round. Thus, their emotion expression may be more easily picked up by systems marking parts of the body during movement. Because bodily emotion expression is so crucial to a narrative, an important application of this field might be to automate audio description of emotion expression in film for the visually impaired or audio description of movement features salient to emotion expression.
While we often do not recognize subtle emotion expression in strangers, and we can in those we know, when actors portray characters within the circumstances of a play, the audience, gradually over the arc of the play, comes to perceive the emotional expression of each character as revealed over time. Understanding how this works in art can help us develop systems that take the time to learn and better understand emotion expression in diverse contexts and individuals, similar to how current voice recognition learns individual accents.
H. Cultural and Gender Dialects
In a meta-analysis examining cultural dialects in the nonverbal expression of emotion, measurable variations were found in response to facial expression across cultures [149]. Moreover, it has long since been acknowledged that learned display rules [150] and decoding rules [151] vary across different cultures. For example, in Japanese culture, overt displays of negative emotional expression are discouraged. Furthermore, smiling behavior in this context is often considered an attempt to conceal negative emotions. In contrast, in American culture, negative expressions are considered more appropriate to display [151].
Previous research has similarly demonstrated a powerful influence of gender-related emotion stereotypes in driving differential expectations for the type of emotions that men and women are likely to experience and express. Men are expected to experience and express more power-related emotions, such as anger and contempt, whereas women are expected to experience and express more powerless emotions, such as sadness and fear [152]. These findings match self-reported emotional experiences and expressions with strong cultural consistency across 37 different countries worldwide [153]. Furthermore, there are even differences in the extent to which neutral facial appearance resembles different emotions, with male faces physically resembling anger expressions more (low, hooded brows, thin lips, and angular features) and female faces resembling fear expressions more [154].
While cultural norms affect how and whether emotions are displayed, such norms also influence how displays of emotion are perceived. For example, when a culture’s norm is to express emotion, less intensity is read into such displays, whereas, in cultures where the norm is not to express emotion intensely, the same displays are read as more intense [155]. In this way, visual percepts derived from objective stimulus characteristics can generate different subjective experiences based on culture and other individual factors. For instance, there is notable cultural variation in the extent to which basic visual information, such as background versus foreground, is integrated into observers’ perceptual experiences [156].
Culture and gender add complexity to human emotional expression, yet little research to date has examined individual variation in responses to visual scenes, either in terms of basic aesthetics or the emotional responses that people have. Future work assessing simple demographic details (e.g., gender and age) will begin to explore this important source of variation.
I. Structure
Some basic visual properties have been found that characterize positive versus negative experiences and preferences. Most notably, round features—whether represented in faces or objects—elicit feelings of positivity and warmth, and tend to be preferred over angular visual features. This preference has been used to explain the roundness of smiling faces and the angularity of anger displays [157]. Such visual curvature has also been found to influence attitudes toward and preference for even meaningless patterns represented in abstract visual designs [158]. The connection between affective responses and these basic visual forms has helped computer vision predict emotions evoked from pictorial scenes, as mentioned earlier [46].
Importantly, the dimensional approach to assessing visual properties underlying emotional experience can be used to examine both visual scenes and faces found within those scenes. Indeed, the dimensional approach adequately captures both “pure” and mixed emotional facial expressions [159], as well as affective responses to visual scenes, as demonstrated by the IAPS. Critically, even neutral displays have been found to elicit strong spontaneous impressions [160], ones that are effortless, nonreflective, and highly consensual. Recent research utilizing computer-based models suggests that these inferences are largely driven by the same physical properties found in emotional expressions. For instance, Said et al. [161] employed a Bayesian network trained to detect expressions in faces and then applied this analysis to images of neutral faces that had been rated on a number of common personality traits. The results showed that the trait ratings of faces were meaningfully associated with the perceptual resemblances that these “neutral” faces had with emotional expressions. Thus, these results speak to a mechanism of perceptual overlap, whereby expression and identity cues can both trigger similar emotional responses.
A reverse engineering model of emotional inferences has suggested that perceptions of stable personality traits can be derived from emotional expressions as well [162]. This work implicates appraisal theory as a primary mechanism by which observing facial expressions can inform stable personality inferences made of others. This account suggests that people use appraisals that are associated with specific emotions to reconstruct inferences of others’ underlying motives, intents, and personal dispositions, which they then use to derive stable impressions. It has likewise been shown that emotion-resembling features in an otherwise neutral face can drive person perception [152]. Finally, research has also suggested that facial expressions actually take on the form that they do to resemble static facial appearance cues associated with certain functional affordance, such as facial maturity [163] and gender-related appearance [152].
J. Personality
Personality describes an individual’s relatively stable patterns of thinking, feeling, and behaving. These are characterized by certain personality traits that represent a person’s disposition toward the world. Emotions, on the other hand, are the consequence of the individual’s response to external or internal stimuli. They may change when the stimuli change and, therefore, are considered as states (as opposed to traits). Just as a full-blown emotion represents an integration of feeling, action, appraisal, and wants at a particular time and location, personality represents the integration of these components over time and space. Researchers have tried to characterize the relationships between personality and emotions. Several studies found correlations between certain personality traits and specific emotions. For instance, the trait neuroticism was found to be correlated with negative emotions, while extroverted people were found to experience higher levels of positive emotions than introverted people [164]. These correlations were explained by relating different personality traits to specific emotion regulation strategies e.g., [165], or by demonstrating that evaluation mediates between certain personality traits and negative or positive affect [166]. Whatever the reason for these correlations is, the fact that they are correlated might help to create state-to-trait inferences. Therefore, if emotional states can be mapped to personality, the ability to automatically recognize emotions could provide tools for the automatic detection of personality.
Advances in computationally, data-driven methods offer promising strides toward personality traits being predicted based on a dataset of individuals’ behavior, self-reported personality traits, or physiological measures. It is also possible to use factor analyses to identify underlying dimensions of personality, such as the Big Five personality traits, or openness, conscientiousness, extraversion, agreeableness, and neuroticism (OCEAN) [167], [168] (see Fig. 9). Personality can likewise be inferred from facial expressions of emotion [162] and even from emotion-resembling facial appearance [169]. Machine learning algorithms have recently been successfully employed to predict human perceptions of personality traits based on such facial emotion cues [170]. Furthermore, network analysis, such as social network analysis, can also be incorporated to identify patterns of connectivity between different personality traits or behavioral measures. Finally, interpreting a person’s emotional expression might also depend on the perception of their personality when additional data are available for inferring personality [171]. For further information, readers can refer to recent research in this area, e.g., [172] and [173].
K. Affective Style
Affective styles driven by the tendencies to approach reward versus avoid punishments found their way into early conflict theories [174] and remain a mainstay of contemporary psychology in theories such as Carver and Scheier’s Control Theory [175] and Higgins’s Self-discrepancy Theory [176].
Evidence for the specific association between emotion and approach-avoidance motivation has largely involved the examination of differential hemispheric asymmetries in cortical activation. Greater right frontal activation has been associated with avoidance motivation, as well as with flattened positive affect and increased negative affect. Greater left frontal activation has been associated with approach motivation and positive affect [177]. Supporting the meaningfulness of these findings, Davidson [177] argued that projections from the mesolimbic reward system, including basal ganglia and ventral striatum, which are associated with dopamine release, give rise to greater left frontal activation. Projections from the amygdala associated with the release of the primary vigilance-related transmitter norepinephrine give rise to greater right frontal activation.
Further evidence supporting the emotion/behavior orientation link stems from evidence accumulated in studies using measures of behavioral motivation based on Gray’s [178] proposed emotion systems. The most widely studied of these are the behavioral activation system (BAS) and the behavioral inhibition system (BIS). The BAS is argued to be highly related to appetitive or approach-oriented behavior in response to reward, whereas the BIS is argued to be related to inhibited or avoidance-oriented behavior in response to punishment. Carver and White [179] developed a BIS/BAS self-report rating measure that is thought to tap into these fundamental behavioral dispositions. They found that extreme scores on BIS/BAS scales were linked to behavioral sensitivity toward reward versus punishment contingencies, respectively [179]. BIS/BAS measures have been shown to be related to emotional predisposition, with positive emotionality being related to the dominance of BAS over BIS, and depressiveness and fearful anxiety being related to the dominance of BIS over BAS [180].
Notably, for many years, there existed a valence (positive/negative) versus motivational (approach/avoidance) confound in all work conducted in the emotion/behavior domain. Negative emotions were associated with avoidance-oriented and positive emotions with approach-oriented behavior, a contention supported by much of the work reviewed above. The valence/motivation confound led researchers Harmon-Jones and Allen [181] to test for hemispheric asymmetries in activation associated with anger, a negative emotion with an approach orientation (aggression). They argued that, if left hemispheric lateralization was associated with anger, this would indicate that the hemispheric lateralization in activation previously found was in fact due to behavioral motivation. However, right hemispheric lateralization would indicate that they were due to valence. In these studies, they found that dispositional anger [181] was associated with left lateralized EEG activation, consistent with the first interpretation and with that previously reported only for positive emotion. They supported this conclusion by showing that the dominance of BAS over BIS was associated with anger [58].
IV. EMOTION RECOGNITION: KEY IDEAS AND SYSTEMS
The field of computer-based emotion recognition from visual media is nascent but has seen encouraging developments in the past two decades. Interest in this area has also grown sharply (see Section IV-A). We will highlight some existing research ideas and systems related to modeling evoked emotion (see Section IV-B), facial expression (see Section IV-C), and bodily expression (see Section IV-D). In addition, we will discuss integrated approaches to emotion recognition (see Sections IV-E and IV-F). Because of the breadth of the field, it is not possible to cover all exciting developments, so we will focus our review on the evolution of the field and some of the most current, cutting-edge results.
A. Exponential Growth of the Field
To gain insight into the growing interest of the IEEE and computing communities in emotion recognition research, we conducted a survey of publications in the IEEE Xplore and ACM Digital Library (DL) databases. Results revealed an exponential increase in the number of publications related to emotion or sentiment in images and videos over the last two decades (see Fig. 10).
As of February 2023, a total of 48 154 publications were found in IEEE Xplore, with the majority (35 247 or 73.2%) being in conference proceedings, followed by journals (8214 or 17.1%), books (2400 or 5.0%), magazines (1413 or 2.9%), and early access journal articles (831 or 1.7%). The field has experienced substantial growth, with a 25-fold increase during the period from the early 2000s to 2022, rising from an average of 275 publications per year to about 7000 per year in 2022.
In the ACM DL, a total of 30 185 publications were found, with conference proceedings making up the majority (24 817 or 82.2%), followed by journals (3851 or 12.8%), magazines (780 or 2.6%), newsletters (505 or 1.7%), and books (264 or 0.9%). In the ACM community, the field has seen a 22-fold growth during the same period, with an average of 170 publications per year in the early 2000s rising to about 3800 per year in 2022.
A baseline search indicated that the field related to images and videos had a roughly linear, sixfold growth during the same period. Emotion-related research accounted for 14.2% of image- and video-related publications. The growth in emotion-related research outpaced the baseline growth by a significant margin, suggesting that it has a higher future potential. The annual growth rate for the field of emotion in images and videos is 15%–16%. If this growth continues, the annual number of publications in this field is expected to double every five years.
B. Modeling Evoked Emotion
Evoked emotions can be inferred when viewers see an image or a video, either based on changes in their physical body or the stimuli being viewed, i.e., explicit affective cues or implicit affective stimuli [22], [69], [114]. In this section, we will focus on implicit affective stimuli, particularly images and videos. Similar to other machine learning and computer vision tasks [16], [182], an evoked emotion recognition system from images or videos typically consists of three components: emotion feature extraction, feature selection and fusion, and feature-to-evoked-emotion mapping. The generalized frameworks for this process are illustrated in Figs. 11 and 12. The first step is to extract emotion features from the original images and videos, typically after preprocessing and converting them to numerical representations that are easier to process. Feature selection and fusion aim to select discriminative features that are more relevant to emotions, reduce the dimensionality of features, and combine different types of features into a unified representation. Finally, each unified representation is classified. Feature-to-evoked-emotion mapping learns a classifier to project the feature representation to specific evoked emotions.
1). Visual Emotion Feature Representation:
Prior to the emergence of deep learning, visual emotion features were primarily developed manually, drawing inspiration from various interrelated disciplines, including computer vision, art theory, psychology, and aesthetics.
One straightforward approach employs existing low-level vision features as representations of emotion. Yanulevskaya et al. [183] extracted holistic Wiccest and Gabor features. Other researchers have employed scale-invariant feature transform (SIFT), histogram of oriented gradients (HOG), self-similarities, GIST, and local binary patterns (LBPs) to extract low-level vision representations of emotion for video keyframes [72], [184]. Yang et al. [185] compared their method with multiple low-level vision features, including SIFT, HOG, and Gabor. Rao et al. [186] first employed bag of visual words of SIFT features for each image block from multiple scales and then represented each block by extracting latent topics based on probabilistic latent semantic analysis. While such low-level vision representations are still used, such as LBP and optical flows [187], they do not effectively bridge the semantic gap and cannot bridge the more challenging affective gap [16].
More robust approaches to exploring visual emotion features employ theories on art and aesthetics to design and develop features. Art is often created with the intention of evoking emotional reactions from the audience. As Pablo Picasso claims, “Art washes away from the soul the dust of everyday life,” and as Paul Cézanne asserts, “A work of art which did not begin in emotion is not art,” [44], [188], [189]. Generally, art theory includes elements of art and principles of art. The elements of art, such as color, line, texture, shape, and form, serve as building blocks in creating artwork. Meanwhile, the principles of art correspond to the rules, tools, or guidelines for arranging and orchestrating the elements of art, such as balance, emphasis, harmony, variety, and gradation. Among these elements, color is the most commonly used artistic feature [44], [187], [190], [191], [192], [193], [194], [195], followed by texture. Lu et al. [46] investigated the relationship between shape and emotions through an in-depth statistical analysis. Zhao et al. [188] systematically formulated and implemented six artistic principles except for rhythm and proportion and combined them into a unified representation.
Aesthetics is widely acknowledged to have a strong correlation with emotion [182], [196]. Artwork that is aesthetically designed tends to attract viewers’ attention and create a sense of immersion. As early as 2006, Datta et al. [52] developed computer algorithms to predict the aesthetic quality of images. Among the various aesthetic features, composition, such as the rule of thirds, has been the most popular [44], [52], [190], [191], [196], [197]. The figure-ground relationship, which refers to cognitive feasibility in distinguishing the foreground and the background, and several other aesthetic features were designed [196]. Sartori et al. [198] analyze the influence of different color combinations on evoking binary sentiment from abstract paintings. Artistic principles and aesthetics were used to organize artistic elements from different but correlated perspectives, which were sometimes not clearly differentiated and extracted together [196], [197]. By considering relationships among different elements, artistic principles and aesthetics have been demonstrated to be more interpretable, robust, and accurate than artistic elements and low-level vision representations for recognizing evoked emotions [188], [196].
To bridge the gap between low-level visual features and high-level semantics, an intermediate level of attributes and characteristics was designed. These intermediate attributes and characteristics were then applied to the prediction of evoked emotions [192], [193], [194], [199]. For example, in addition to generic scene attributes, eigenface-based facial expressions were also considered [199]. These attributes performed better than low-level vision representations, but the interpretability was still limited.
High-level content and concept play an essential role in evoking emotions for images with obvious semantics, such as natural images. The number and size of faces and skins contained in an image have been used as an early and simple content representation [44]. Facial expressions of images containing faces are a direct cue for viewers to produce emotional reactions and are, therefore, often employed as content representation [192], [193]. Jiang et al. [72] developed a method that involved detecting objects and scenes from keyframes. Based on the observation that general nouns such as “baby” were detectable but had a weak link to emotion, whereas adjectives such as “crying” could provide strong emotional information but were difficult to detect, Borth et al. [200] introduced adjective–noun pairs (ANPs) by adding an adjective before a noun, such as “crying baby.” The combination enabled strong ability to map concepts to emotions while remaining detectable. A large visual sentiment ontology named SentiBank was proposed to detect the probability of 1200 ANPs. Thus, as a milestone, SentiBank cannot be ignored as a baseline in performance evaluation, even in the current deep learning era. A multilingual visual sentiment ontology (MVSO) [201] was later extended to deal with different cultures and languages. About 16k sentiment-biased visual concepts across 12 languages are constructed and detected. ANP representations are widely used as high-level semantic representations [72], [192], [193], [194], [202]. These content and concept representations achieve the best performance for the images containing such semantics but fail for abstract paintings.
Fortunately, the rise of deep and very deep convolutional neural networks (CNNs) in image classification has led to deep learning becoming the primary learning strategy in various fields, including computer vision and natural language processing (NLP). This is also the case for evoked emotion modeling [16]. Given sufficient annotated data, deep learning models can be trained to achieve superior performance on various types of images, including natural and social photographs, as well as abstract and artistic paintings. In many cases, features are automatically learned without the need for manual crafting.
Recently, global representation at the image level has been demonstrated to hold promise for evoked emotion analysis. One approach to extract deep features is to directly apply pretrained CNNs, such as AlexNet [203], VGGNet [204], and ResNet [205], to the given images and obtain responses of the last (few) fully connected (FC) layers [71], [190], [194], [197], [202], [206], [207], [208], [209], [210], [211]. Other methods have begun to use the output of a transformer’s encoder as visual reorientation [212], [213]. Xu et al. [206] demonstrated that the deep features from the last but one FC layer outperformed those from the last FC layer. The extracted CNN features were transformed to another kernelized space via discrete Fourier transform and sparse representation to suppress the impact of noise in videos [207]. Chen et al. [214] first obtained event, object, and scene scores with state-of-the-art pretrained detectors based on DNNs. They then integrated these into a context fusion network to generate a unified representation. More recently, Song et al. [195] used the pretrained MS Azure Cognitive Service API to extract objects contained in an image and corresponding confidence scores as visual hints, which were then transformed to TF-IDF representation. This pretrained deep feature can be essentially viewed as another handcrafted feature since the classifier used for the final emotion prediction needs to be separately trained. To enable endto-end training, fine-tuning is widely employed to adjust the deep features to be more correlated with evoked emotions [71], [190], [215], [216]. Other inspiring improvements include multitask learning [185], [217] and emotion correlation mining [218], [219], [220] to better learn emotion-specific deep features. The image-level global representation extracts deep features from the global aspect taking the original image as input. To deal with the temporal correlations among successive frames in videos, 3-D CNN (C3D) [221] is adopted taking a series of frames as input [222]. Although contextual information is considered, the global representation treats local regions equally without considering their importance.
Strategies such as attention [223], [224], [225], [226] and sentiment maps [227], [228] have been widely used to extract region-level representations in order to emphasize the importance of local regions in evoked emotion prediction. Spatial attention is used to determine the correspondence between local image regions and detected textual visual attributes [223]. Besides operating on the global representation to obtain attended local representation, the spatial attention map is enforced to be consistent with prior knowledge contained in the detected saliency map [224]. Both spatial and channelwise attentions are considered to reflect the importance of the spatial local regions and the interdependency between different channels [226]. In contrast, Fan et al. [225] investigated how image emotion could be used to predict human attention and found that emotional content could strongly and briefly attract human visual attention. Yang et al. [227] designed a weakly supervised coupled CNN to explore local information by detecting a sentiment-specific map using a cross-spatial pooling strategy, which only required image-level labels. The holistic and local representations were combined by coupling the sentiment map. To address the issue of overemphasis on local information and neglect of some discriminative sentiment regions, a discriminate enhancement map was recently constructed by spatial weighting and channel weighting [228]. Both the discriminate enhancement map and the sentiment map were coupled with the global representation.
In recent years, a growing body of research has focused on developing effective multilevel representations [219], [229], [230], [231], [232], [233], [234], [235], [236], [237], [238]. One strategy has been to view different CNN layers as different levels [229], [231], [234]. Multiple levels of features were extracted at different branches (four [231] and four plus one main [229] branches). The features from different levels were then integrated by a bidirectional gated recurrent unit (GRU) [229] or a fusion layer with basic operations, such as the mean. These two methods both claim that features from global to local levels can be extracted at different layers, but the correspondence between them is still unclear. This issue is partially addressed in the multilevel dependent attention network (MDAN) [234]. Based on the assumption that different semantic levels and affective levels are correlated, affective semantic mapping disentangles the affective gap by one-to-one mapping between semantic and affective levels. Besides the global-level learning at the highest affective level with emotion hierarchy preserved, local learning is incorporated at each semantic level to differentiate among emotions at the corresponding affective level [234]. Furthermore, multihead cross channel attention and level-dependent class activation maps are designed to model levelwise channel dependencies and spatial attention within each semantic level, respectively. Further research is needed to explore more effective ways to map semantic and affective levels, particularly when the number of levels varies significantly.
Another strategy for extracting multilevel representations involves utilizing object, saliency, and affective region detection [219], [230], [232], [235], [236], [237], [238]. Inspired by the Stimuli-Organism-Response (S-O-R) emotion model in psychology, Yang et al. [219] selected specific emotional stimuli, including image-level color and region-level objects and faces, using off-the-shelf detection models. Corresponding to the selected stimuli, three specific networks were designed to extract the features, including CNN-based color and other global representations, long-short term memory (LSTM)-based semantic correlations between different objects, and CNN-based facial expressions. Besides the correlations between different objects based on GCN, Yang et al. [232] also mined the correlations between scenes and objects using a scene-based attention mechanism, motivated by the assumption that scenes guide objects to evoke distinct emotions. A similar approach was applied by Cheng et al. [238] but with the input to the object detector being the extracted temporal–spatial features by C3D. These methods treat objects, typically detected by Faster R-CNN, as regions, and reflect the importance of different regions through attention [219] or graph reasoning [232], [238]. Objects and faces were also, respectively, detected in [236] and [237], and corresponding CNN features were extracted. Rao et al. [230] employed an emotional region proposal method to select emotional regions and remove nonemotional ones. The region’s emotion score was a combination of the probability of the region containing an object and that of the region evoking emotions. A similar approach for selecting emotionally charged regions was used by Zhang et al. [235], but the emotion score was a weighted combination of two probabilities.
Efforts have been made to combine handcrafted and deep representations in order to take advantage of their complementary information. For example, Liu et al. [197] combined them after feature dimension reduction, whereas Chen et al. [239] used AlexNet-style deep CNNs [203] to train ANP detectors and achieved improved ANP classification accuracy. The DeepSentiBank has also been used to extract keyframe features in videos [210], [240]. More recently, the correlations of content features based on pretrained CNN and color features from color moments have been taken into consideration [241]. Specifically, the proposed cross correlation model consists of two modules: an attention module to enhance color representation using content vectors and a convolution module to enrich content representation using pooled color embedding matrices. Further research is needed to investigate the potential of this approach to interact between handcrafted and deep representations.
2). Audio Emotion Feature Representation:
The audio features used in video-based evoked emotion analysis are often sourced from the fields of acoustics, speech, and music signal processing [69]. Similar to visual features, audio features for emotion analysis can be divided into two categories: handcrafted and deep learning-based.
Among handcrafted feature representations, the Mel-frequency cepstral coefficients (MFCCs) are commonly adopted features [72], [184], [191]. Energy entropy, signal energy, zero crossing rate, spectral rolloff, spectral centroid, and spectral flux are also employed [72], [184]. The openSMILE toolkit is a popular option for extracting handcrafted audio features [242]. El Ayadi et al. [243] classified speech features into four groups: continuous, qualitative, spectral, and Teager energy operator-based. Panda et al. [14] summarized music features into eight dimensions, including melody, harmony, rhythm, dynamics, tone color, expressivity, texture, and form. For more information on handcrafted audio feature representations, please refer to these papers.
Learning-based deep audio feature representations mirror the learning-based strategy used in the visual stream in videos, and some studies have explored the use of CNN-based deep features for audio streams. One approach is to send the raw audio signal directly to a 1-D CNN. However, a more popular method involves transforming the raw audio signal into MFCC, which can be viewed as an image, before feeding it into a 2-D CNN [202], [208], [211], [213], [222], [237], [238]. When simultaneously considering the MFCC from multiple video segments, C3D or LSTM can be utilized. For further information on modeling the temporal information in videos, see Section IV-B4.
3). Contextual Feature Representation:
The context information that is important in evoked emotion prediction can be divided into two categories: within visual media and across modalities. For context within visual media, one common approach is to extract features from different levels that correspond to semantics at different scales, such as the global and local levels. For more details, see Section IV-B1. When considering context across modalities, the social context of users, including their common interest groups, contact lists, and similarity of comments to the same images, is taken into account in personalized image emotion recognition [193]. In addition to the visual and audio representations, motion representation is also considered [213].
4). Feature Selection and Fusion:
As previously mentioned, features can be extracted within a single modality (e.g., image) or across multiple modalities. On the one hand, different types of features can have varying discriminative abilities. High-dimensional features may suppress the power of low-dimensional ones when fusing them together. High-dimensional features may cause a “curse of dimensionality” and corresponding overfitting. On the other hand, fusing different features together can improve the performance of emotion analysis by jointly exploiting their representation ability and complementary information. Therefore, feature selection and fusion techniques are often used, particularly for datasets with a small number of training samples.
Before being processed by the feature extractor, an image or video often undergoes preprocessing. Resizing an image to a fixed spatial resolution is straightforward, but videos pose a greater challenge due to differences in both spatial and temporal structures. It is important to determine how to combine frame- or segment-level features into a unified video-level representation. In the following, we first explore commonly used temporal information modeling in videos, followed by a summary of feature selection and fusion techniques.
In the area of temporal information modeling, there is a need to divide the input video into segments and extract keyframes to be used for feature extraction. A straightforward method is to use all frames of the entire video [207], [236], which ensures the least information loss but also results in high computational complexity. Common methods for segment sampling include fixed time interval [72], [184], [202], [211], [237], fixed number of segments [208], [222], and video shot detection [210], [238]. Sampling keyframes within a segment can be done through fixed frame sampling (uniform sampling) [72], [184], [211], [237], random sampling [208], middle frame selection [202], and mean histogram sampling (i.e., the frame with the closest histogram to the mean histogram) [191]. Recently, researchers have developed dedicated keyframe selection algorithms [210], [240]. One such method uses low-rank and sparse representation with Laplacian matrix regularization in an unsupervised manner, considering both the global and local structures [240]. Another method trains an image emotion recognition model on an additional image dataset to estimate the affective saliency of video frames [210]. Keyframes are selected based on the largest interframe difference of color histograms, after first sorting the segments (shots) based on affective saliency. Despite progress, effective and efficient selection of emotion-aware keyframes to enhance accuracy and speed remains an open challenge.
Another challenge is combining frame- or segment-level features into a unified video-level representation. Some straightforward methods include average pooling [72], [184], [207], [236], average pooling with temporal attention [208], max pooling [210], bag-of-words quantization [72], [184], LSTM [222], [238], and GRUs [237]. Another approach is to perform emotion prediction at the segment level and then combine the results using methods such as majority voting [202], [211].
Feature selection, especially for handcrafted features, is often employed when extracted features have high dimensions that contain redundant, noisy, or irrelevant information, the dimensions of different kinds of features to be fused vary significantly, or the size of the training samples is small. Some commonly used feature selection methods in evoked emotion modeling include cross-validation on the training set [183], wrapper-based selection [44], forward selection [46], and principal component analysis (PCA)-based selection [44], [46], [71], [188]. Feature selection can be combined with feature fusion, particularly in the case of early fusion [188], [197]. If the dimensions of the features being fused are similar, feature selection is carried out after fusion [188]. However, if dimensions are significantly different, feature selection is usually performed first [197] to prevent an overemphasis on high-dimensional features.
The integration of multiple features within a single modality or different kinds of representations across different modalities through feature fusion plays a crucial role in emotion recognition [22]. Model-free fusion, which operates independently of learning algorithms, is widely used and encompasses early fusion [44], [188], [190], [195], [196], [197], [198], [208], [219], [225], [229], [230], [232], [233], [237], late fusion [194], [199], [202], [211], [241], and hybrid fusion [210]. A hierarchical fusion that incorporates various feature sets at different levels of the hierarchy is also employed [187].
Model-based fusion, however, is performed explicitly during the learning process. Kernel-based fusion [72], [236] and graph-based fusion are often used for shallow learning models, whereas attention- [213], [222], [235], neural network- [184], [238], and tensor-based fusion strategies have been recently employed for deep learning models.
A transformer encoder with multihead self-attention layers is used in sentiment region correlation analysis to exploit dependencies between regions [235]. The self-attention transforms original regional features into a higher order representation between implied regions. Ou et al. [222] used a local–global attention mechanism to explore the intrinsic relationship among different modalities and for different segments. Local attention evaluates the importance of different modalities in each segment, whereas global attention captures the weight distribution of different segments. In contrast, Pang et al. [184] employed the deep Boltzmann machine (DBM) to infer nonlinear relationships among different features within and across modalities by learning a jointly shared embedding space. Cheng et al. [238] proposed an adaptive gated multimodal fusion model, which first mapped features from different modalities to the same dimension and then employed a gated multimodal unit (GMU) to find an intermediate representation. For a comprehensive survey on feature fusion strategies, please refer to [22].
5). Feature-to-Evoked-Emotion Mapping:
Based on the emotion representation models discussed in Section II-A, various evoked emotion analysis tasks can be undertaken, including classification, regression, retrieval, detection, and label distribution learning (LDL) [22]. The former three can also be classified into visual media-centric dominant emotion analyses and viewer-centric personalized emotion analyses, while the latter two are typically visual media-centric. After obtaining a unified representation, shallow or deep learning-based methods can be employed to map the features to evoke emotions.
A variety of machine learning algorithms have been utilized to learn the mapping between the unified representation after feature fusion and evoked emotions. These algorithms include naïve Bayes [44], [196], [200], logistic regression [184], [195], [199], [200], SVM and support vector regression with linear and radial basis function (RBF) kernels [46], [72], [183], [187], [188], [191], [195], [197], [199], [200], [207], [210], [211], [236], [240], linear discriminant analysis (LDA) [211], multiple instance learning [186], random forest [195], [210], decision tree [195], ensemble learning [202], a mixture of experts [237], sparse learning [192], [194], and graph/hypergraph learning [193]. There is still room for innovation in designing emotion-sensitive mapping algorithms.
The most common deep learning-based mapping is a multilayer perceptron based on one or more FC layers. The difference mainly lies in objective loss functions, such as cross-entropy loss for classification [71], [185], [210], [212], [215], [216], [217], [222], [227], [228], [229], [230], [231], [232], [233], [234], [235], [238], [240], [241], Euclidean and mean square error loss for regression [190], [195], contrastive loss and triplet loss for retrieval, and Kullback–Leibler (KL) divergence loss for LDL [189], [217], [218]. Unlike shallow learning methods, these deep learning techniques can typically be trained in an end-to-end manner. Some improvements have also been made to better explore the interactions between different tasks, and the characteristics and relationships of different emotion categories, such as through the use of polarity-consistent cross-entropy loss and regression loss [208], [226] and hierarchical cross-entropy loss [219]. More details are given in the following.
The aforementioned evoked emotion analysis tasks are interconnected. For instance, the detection of emotional regions can inform the emotion classification task, and the emotion category with the largest probability in LDL often aligns with the emotion classification results. However, a notable difference between evoked emotion modeling and traditional computer vision or machine learning lies in the existence of specific correlations among emotions. We summarize recent advancements in multitask learning and the exploration of emotion correlations in evoked emotion modeling.
Multitask learning has been shown to significantly improve performance compared to single-task learning by leveraging appropriate shared information and imposing reasonable constraints across multiple related tasks [244]. It has become popular in evoked emotion modeling, especially when training data are limited for each task. Based on the relations explored, multitask learning can be categorized into task relation learning-oriented [185], [217], [227], [228], [230], [233], testing data relation learning-oriented [192], feature relation learning-oriented [194], and viewer relation learning-oriented approaches [193]. Different evoked emotion analysis tasks are often jointly performed, such as classification and distribution learning [217], [230], classification and retrieval [185], and classification and detection [227], [228], [233]. Joint optimization of different objective losses allows models to extract more discriminative feature representations. For example, multitask shared sparse regression was proposed to predict continuous emotion distributions of multiple testing images with sparsity constraints, which takes advantage of feature group structures [192]. Different constraints across features are considered [194] to reflect their importance by a weighting strategy, which can be viewed as a special late fusion category. Rolling multitask hypergraph learning is proposed to simultaneously predict personalized emotion perceptions of different viewers where social connections among viewers are considered [193].
Exploring correlations among different emotion categories or continuous emotion values can improve evoked emotion analysis. Commonly considered emotion correlations include the following.
Emotion hierarchy: As emotion categories researchers attempt to model become more diverse and nuanced, the level of granularity increases [234]. As shown in Fig. 13(a), emotions can be organized into a hierarchy, which has been exploited in evoked emotion analysis. By considering the polarity-emotion hierarchy, i.e., whether two emotion categories belong to the same polarity, polarity-consistent cross-entropy loss [208] and regression loss [226] are designed to increase the penalty of predictions that have the opposite polarity to the ground truth. Hierarchical cross-entropy loss has been proposed to jointly consider both emotion and polarity loss [219]. For each level in the emotion hierarchy, one specific semantic level is mapped with local learning to acquire corresponding discrimination [234].
Emotion similarity: Similarities or distances between emotions can vary, with some being closer than others. For example, sadness is closer to disgust than it is to contentment. To account for these similarities, Mikels’ emotion wheel was introduced [193] [see Fig. 13(b) (left)]. Pairwise emotion similarity is defined as the reciprocal of “1 plus the number of steps required to discriminate one emotion from another.” Using Mikels’ wheel, the emotion distribution is transformed from a single emotion category [217], [230]. Chain center loss is derived from the triplet loss, with anchor-related-negative triplets selected based on emotion similarity [220]. A more accurate method of measuring emotion similarity, based on a well-grounded circular-structured representation called the emotion circle, has also been designed [218] [see Fig. 13(b) (right)]. Each emotion category can be represented as an emotion vector with three attributes (i.e., polarity, type, and intensity) and two properties (i.e., similarity and additivity), allowing for vector addition operations.
Emotion boundedness: Not every combination of VAD values makes sense in an emotion space. The transformation between discrete emotion states and their rough continuous values is often possible [16], [46] [see Fig. 13(c)]. For example, positive valence is linked to happiness, whereas negative valence is linked to sadness or anger even though exact boundaries may not be clear. When performing multitask learning involving both classification and regression, it is helpful to consider the constraints on these values. For example, it would not be valid to predict happiness with a negative valence. The BoLD dataset (see Section III-A3) leverages this concept to check the validity of crowdsourced annotations [51].
6). Preliminary Benchmark Analysis:
We provide a summary of the image- and video-based evoked emotion classification accuracy of various representative methods, reported on the FI [71] and VideoEmotion-8 [72] datasets, respectively. The compared methods for image emotion prediction include SentiBank [200], artistic principles [188], DeepSentiBank [239], fine-tuned AlexNet [203], fine-tuned VGG-16 [204], fine-tuned ResNet-101 [205], progressive CNN [215], multilevel deep representation network (MldrNet) [231], weakly supervised coupled network (WSCNet) [227], polarity-consistent deep attention network (PDANet) [226], multilevel region-based CNN (MlrCNN) [230], stimuliaware visual emotion analysis network (SAVEAN) [219], scene-object interrelated visual emotion reasoning network (SOLVER) [232], MDAN [234], and SimEmotion [212]. Results in Table 2 indicate that: 1) fine-tuning DNNs, particularly those with more layers, outperforms handcrafted features (e.g., fine-tuned ResNet-101 [205] versus SentiBank [200]); 2) exploring local feature representations and combining representations from different levels can increase accuracy to around 75% (e.g., MDAN [234]); and 3) SimEmotion, which uses large-scale language-supervised pretraining, achieves an overall accuracy of around 80%, which remains lower than traditional computer vision tasks. The results highlight the challenges in evoked emotion prediction due to large intraclass variance and the need for further progress toward human-level emotion understanding.
Table 2.
Method | Venue | Accuracy (%) |
---|---|---|
| ||
SentiBank [200] | ACM MM 2013 | 44.5 |
Artistic principles [188] | ACM MM 2014 | 46.1 |
DeepSentiBank [239] | arXiv 2014 | 53.2 |
Fine-tuned AlexNet [203] | NeurIPS 2012 | 58.3 |
Fine-tuned VGG-16 [204] | ICLR 2015 | 65.5 |
Fine-tuned ResNet-101 [205] | CVPR 2016 | 66.2 |
Porgressive CNN [215] | AAAI 2015 | 56.2 |
MldrNet [231] | NPL 2020 | 65.2 |
WSCNet [227] | CVPR 2018 | 70.1 |
PDANet [226] | ACM MM 2019 | 72.1 |
MlrCNN [230] | NEUCOM 2019 | 75.5 |
SAVEAN [219] | TIP 2021 | 72.4 |
SOLVER [232] | TIP 2021 | 72.3 |
MDAN [234] | CVPR 2022 | 76.4 |
SimEmotion [212] | TAFFC 2022 | 80.3 |
The compared methods for video emotion prediction include SentiBank [200], enhanced multimodal deep Bolzmann machine (E-MDBM) [184], image transfer encoding (ITE) [73], visual + audio + attribute (V. + Au. + At.) [72], CFN [214], V. + Au. + At. + E-MDBM [184], Kernelized features and Kernelized + SentiBank [207], visual–audio attention network (VAANet) [208], KeyFrame [210], frame-level adaptation and emotion intensity learning (FAEIL) [207], and temporal-aware multimodal (TAM) methods [213]. Results in Table 3 indicate that: 1) fusing information from multiple modalities is more effective than using a single modality (e.g., VAANet [208] versus SentiBank [200]); 2) the current best accuracy is less than 60%; and 3) effectively fusing information from different modalities and selecting key segments or frames are two challenges. Evoked emotion prediction from videos is even more challenging than image-based prediction and requires further research efforts.
Table 3.
Method | Venue | Accuracy (%) |
---|---|---|
| ||
SentiBauk [200] | ACM MM 2013 | 35.5 |
E-MDBM [184] | TMM 2015 | 40.4 |
ITE [73] | TAFFC 2018 | 44.7 |
V.+Au.+At. [72] | AAAI 2014 | 46.1 |
CFN [214] | ACM MM 2016 | 50.4 |
V.+Au.+At.+E-MDBM [184] | TMM 2015 | 51.1 |
Kernelized [207] | TMM 2018 | 49.7 |
Kernelized+SentiBank [207] | TMM 2018 | 52.5 |
VAANet [208] | AAAI 2020 | 54.5 |
KeyFrame [210] | MTAP 2021 | 52.9 |
FAEIL [207] | TMM 2021 | 57.6 |
TAM [213] | ACM MM 2022 | 57.5 |
C. Facial Expression and Microexpression Recognition
Facial expressions play a crucial role in natural human communication and emotional perception. FER involves the automatic identification of a person’s emotional state through the analysis of images or video clips and has been a long-standing research topic in computer vision and affective computing.
1). Earlier Approaches:
Before 2012, traditional handcrafted features and pipelines were commonly used. The process generally included the following steps: detecting facial regions, extracting handcrafted features (e.g., LBP [245], nonnegative matrix factorization (NMF) [246], HOG [247]) from facial regions, and employ a statistical classifier (e.g., SVM [248]) to recognize emotions.
Readers interested in traditional methodologies are advised to refer to survey articles [249], [250], [251], [252].
Since the creation of the ImageNet dataset and deep CNN AlexNet in 2012, DNNs have demonstrated remarkable image representation capabilities. FER researchers have established large-scale datasets (e.g., EmotoNet [253] and AffectNet [78]) that provide ample training data for DNNs. Consequently, deep learning approaches have become the dominant approach in FER. A survey by Li and Deng [4] provides a comprehensive summary of deep learning-based FER methods from 2012 to 2019. During this period, researchers proposed several DNN techniques to improve FER performance, which the authors categorize as follows.
Adding auxiliary blocks to the typical backbone network (e.g., ResNet [205] and VGG [204]). The scoring ensemble (SSE) [254] proposed three auxiliary blocks to extract features from the shallow, intermediate, and deep layers, respectively.
Ensembling different models to achieve outstanding performance. For example, Bargal et al. [255] concatenated features from three different networks— VGG13 [204], VGG16 [204], and ResNet [205].
Designing specialized loss functions (such as center loss [256], island loss [257], and (N + M)-tuple cluster loss [258]) to learn facial features.
Leveraging multitask learning to learn various features from facial images. For instance, Zhang et al. [259] proposed MSCNN to jointly learn face verification and FER tasks.
Moreover, some studies concentrated on FER from video clips using spatiotemporal networks to capture temporal correlations among video frames. Some researchers [260] also used recurrent neural networks (RNNs) (e.g., LSTM), whereas others [261], [262] used 3-D convolutional networks, such as C3D [221].
2). Recent Approaches:
Although deep learning-based methods have achieved remarkable success, FER remains a difficult task due to these three challenges.
Annotation: Annotations of FER datasets contain much ambiguity. Each annotator subjectively evaluates facial expressions in images, leading to different annotations of the same image. Some images also have inherent ambiguity, making it difficult to assign a clear emotion label. Low image quality (such as occlusion or poor lighting) and ambiguous facial expressions exacerbate this problem. For instance, the images in Fig. 14(a) are labeled as neutral, but this label is uncertain due to annotator subjectivity and/or image quality.
Robustness: FER contains several sources of disturbance. Datasets are made up of individuals of varying ages, genders, and cultures. In addition, visual variations (such as human facial pose, illumination, and occlusions) commonly exist in facial images, which cause distinct appearance differences. These identity and visual variations can make it difficult for FER models to extract useful features. As illustrated in Fig. 14(b), image appearances can vary greatly due to differences in age, gender, race, facial pose, and lighting. Moreover, individuals with different identities and cultures might display their facial expressions differently, adding another challenge for FER models.
Subtlety: Some facial expressions are delicate, and some emotions can also be conveyed through subtle facial actions. Such fine distinctions can make it challenging to distinguish between emotions. For instance, the difference between “fear” and “disgust” images in Fig. 14(c) is nuanced. Thus, FER models must efficiently extract discriminative features to differentiate emotions.
Most deep learning methods developed after 2019 have attempted to address these problems. Regarding the annotation problem, Zeng et al. [263] pointed out the inconsistency of emotion annotation across different FER datasets due to annotators’ subjective evaluations of emotion. To address this issue, they proposed a framework called the Inconsistent Pseudo Annotations to Latent Truth (IPA2LT), which trains multiple independent models on different datasets separately. These models may assign inconsistent pseudolabels to the same image because each model reflects the subjectivity of the corresponding dataset annotators. By comparing image’s inconsistent labels, IPA2LT estimates the latent true label. Another factor that can contribute to annotation ambiguity is uncertain facial images, such as blurry images or those with ambiguous emotions. To mitigate the effect of uncertain images, researchers have proposed various approaches. Wang et al. [264] developed a self-cure network (SCN) to identify and then suppress uncertain images. SCN uses a self-attention mechanism to estimate the uncertainty of facial images and a relabeling mechanism to adjust the labels of those images. Chen et al. [265] argued that annotating uncertain images with multilabel and intensity, rather than the one-hot label commonly used in current FER datasets, is more suitable. LDL allows models to learn label distribution (i.e., multilabel with intensity). Chen et al. [265] introduced an approach called LDL on auxiliary label space graphs (LDLALSG). Given one image, LDL-ALSG first leverages models of related tasks (such as AU detection) to find its neighbor images. Then, LDL-ALSG employs a task guide loss to let images learn a similar label distribution (i.e., multilabel with intensity) with the neighbors. She et al. [266] combined LDL and uncertain image estimation. The proposed DMUE network mines the label distribution and estimates the uncertainty of images together.
To address robustness challenges in FER, researchers have explored mitigating the impact of identity variations on recognition [267], [268]. Chen and Joo [267] investigated the influence of gender and revealed that models tend to recognize women’s faces as happy more often than men’s, even when smile intensities are the same. To overcome this issue, the authors proposed a method that first detected facial AUs and then applied a triplet loss to ensure that people with similar AUs exhibited similar expressions, regardless of gender [267]. Zeng et al. [268] demonstrated that emotion categories can introduce bias into a dataset. In some datasets, emotions (such as disgust) occur less frequently than more prevalent emotions, such as happiness and sadness, leading to poor performance of networks trained on these datasets on minority emotion classes. The authors utilized a millionimage-level facial recognition dataset (much larger than the FER dataset) and a metalearning framework to address the issue [268]. Li et al. [269] recognized the category imbalance challenge and addressed it by proposing AdaReg loss to dynamically adjust the importance of each category.
Other researchers have focused on visual disturbance variations, which lead to the robustness issue. Wang et al. [270] designed a region attention network to capture important facial regions, thus obtaining occlusion-robust and pose-invariant image features. Zhang et al. [271] used a deviation learning network (DLN) to learn identityinvariant features. Wang et al. [272] considered both identity and pose variations together, where an encoder was followed by two discriminators that classified pose and identity independently, while the encoder extracted features that were invariant to both. A classifier for expressions was then used to produce expression predictions.
To address the subtlety problem in FER, Ruan et al. [273] decomposed facial expression features into shared features that represented expression similarities and unique features that represented expression-specific variations using a feature decomposition network (FDN) and a feature reconstruction network (FRN), respectively. The authors further addressed robustness and subtlety problems by proposing a disturbance feature extraction model (DFEM) to identify disturbance features (such as pose and identify) and an adaptive disturbance-disentangled model (ADDM) to remove disturbance features extracted by DFEM and extract the discriminative features across different facial expressions. Farzaneh and Qi [274] enhanced center loss [256]. Although center loss can learn discriminative features, it can also include some irrelevant features. The proposed deep attentive center loss adopts an attention mechanism to adaptively select important discriminative features. Xue et al. [275] leveraged transformers to detect discriminative features and showed that the original vision transformer (ViT) [276] can only capture most discriminative features but neglects other features. Thus, this work proposes the multiattention dropping (MAD) technique to randomly drop some attention maps such that the network can characterize all comprehensive features except the most discriminative ones. Furthermore, Savchenko et al. [277] achieved remarkable performance on FER using the well-performing network EffientNet [278] to extract discriminative features.
3). Preliminary Benchmark Analysis:
Table 4 provides a concise overview of the recent FER method performance on the AffectNet dataset, one of the most comprehensive FER benchmarks. The table shows the emotion classification accuracy of selected representative methods that were previously discussed in the text. As seen in the table, TranFER, ADDL, and EfficientNet-B2 are among the top-performing methods on the AffectNet benchmark. The utilization of advanced network structures from general image recognition tasks in TranFER and EfficientNet-B2 highlights the importance of drawing knowledge and expertise from the image recognition field in FER. However, it is important to note that the highest reported accuracy remains below 70%. Given the over 90% top-1 accuracy achieved by state-of-the-art image recognition methods on the challenging ImageNet benchmark, this deficit suggests significant potential for improvement in FER.
Table 4.
Method | Venue | Accuracy (%) |
---|---|---|
| ||
IPA2LT [263] | ECCV 2018 | 57.31 |
IPFR [272] | ACM MM 2019 | 57.40 |
RAN [270] | TIP 2020 | 52.97 |
SCN [264] | CVPR 2020 | 60.23 |
DACL [274] | WACV 2021 | 65.20 |
DMUE [266] | CVPR 2021 | 63.11 |
KTN [269] | TIP 2021 | 63.97 |
TranFER [275] | CVPR 2021 | 66.23 |
Face2Exp [268] | CVPR 2022 | 64.23 |
ADDL [279] | IJCV 2022 | 66.20 |
EfficientNet-B2 [277] | TAFFC 2022 | 66.34 |
4). Microexpression Recognition:
The methods described above are for recognizing facial expressions. However, individuals may consciously exhibit certain facial expressions to conceal their authentic emotions. In contrast to conventional facial expressions, which can be deliberately controlled, microexpressions are fleeting and spontaneous and can uncover an individual’s genuine emotions. A microexpression is brief in duration and can be imperceptible with the naked eye. MER often requires high frame-rate videos as input and the development of spotting algorithms to isolate temporally the microexpression within videos. A recent survey gives a more in-depth introduction to developments in MER [7].
D. Bodily Expressed Emotion Understanding
In everyday life, people express their emotions through various means, including via their facial expressions and body movements. Recognizing emotions from body movements has some distinct advantages over recognizing emotions from facial images for many computer and robotic applications.
In crowded environments where facial images may be obscured or lack sufficient resolution, body movements and postures can still be reasonably estimated. This context is particularly important in robotic applications where the robot may not be close to all individuals in its environment.
Due to privacy concerns, facial information may be inaccessible. For example, in some medical applications, sharing of facial images or videos is restricted to protect sensitive patient identity information.
Incorporating body expressions as an additional modality can result in more accurate emotion recognition compared to using facial images alone. For example, when the person is not facing the camera, the camera cannot obtain a frontal view.
Psychologists have conducted extensive studies to examine the relationship between body movements and emotions. Research suggests that body movements and postures are crucial for understanding emotion, encoding rich information about an individual’s status, including awareness, intention, and emotional state [114], [280], [281], [282], [283]. Several studies, including one published in Science, found that the human body may be more diagnostic than the face for emotion recognition [283], [284], [285].
However, the field of BEEU in visual media has progressed relatively slowly. Unlike FER, which has seen significant progress with deep learning methods since 2013, most BEEU studies relied on traditional, handcrafted features until 2018. The bottleneck for BEEU is the scarcity of large-scale, high-quality datasets. As mentioned in Section III-A3, collecting and annotating a dataset of bodily expressions with high-quality labels are extremely challenging and costly. Understanding and perception of emotions from concrete observations are heavily influenced by context, interpretation, ethnicity, and culture. There is often no gold-standard label for emotions, especially bodily expressions. Prior to 2018, research on bodily expression was limited to small, acted, and constrained lab-setting video data [286], [287], [288], [289]. These datasets were insufficient for deep learning-based models that required a large amount of data. The recent BoLD dataset by Luo et al. [51] introduced in Section III-A3 is the largest BEEU dataset to date. It contains over 10 000 video clips of body movements with high-quality emotion labels. In addition, Randhavane et al. [74] proposed the E-Walk data, including 1136 3-D pose sequences extracted from raw video clips. As a result, computer vision research is increasingly focused on emotion recognition through body movement recordings.
Because static body images alone can hardly inform a person’s emotions, BEEU often considers human video clips. According to input modality of the models, BEEU methods can be classified as pixel-based and skeleton-based. Pixel-based methods use entire video clips, whereas skeleton-based methods first extract 2-D/3-D pose information and then feed it into the models. Some BEEU works, which focus on movement when walking, are known as gait-based. Herein, we refer to them as skeleton-based as well because gait is also a 2-D/3-D pose. Figs. 15 and 16 illustrate the two kinds of methods.
1). Skeleton-Based Methods:
Our review of past publications shows a greater adoption of skeleton-based approaches compared to pixel-based. This is due to two reasons. First, skeleton data, consisting of sequences of 2-D/3-D joint coordinates, require less engineering effort to process than video clips. A straightforward method is to feed coordinates into a machine learning classifier (e.g., SVM) for direct emotion prediction results. Second, improved MoCap systems enable researchers to easily collect accurate 3-D poses from individuals walking in laboratory settings.
In early stages of skeleton-based BEEU research, a conventional approach was followed, which entailed the extraction of low-level features from 2-D/3-D pose sequences and subsequent utilization of a machine learning classifier to predict emotion. These features were categorized as follows.
Frequent domain features, which transform temporal information into a frequency domain, are obtained through Fourier transformation. For instance, Li et al. [290] used Fourier transformation to convert 3-D pose sequences into frequency domain features, which were then classified using LDA, naïve Bayes, decision tree, and SVM algorithms.
Motion features, characterize the movement of body joints, include joint velocity and acceleration [51].
Geometry features, which describe self-transformation of the body, encompass angles of specific skeletons and distance between certain joints, among others. In particular, Crenn et al. [291] combined motion features, geometry features, and frequent domain features into an SVM classifier.
Luo et al. [51] compared traditional and deep learning-based methods using the BoLD dataset and developed the Automated Recognition of Bodily Expression of Emotion (ARBEE) system. A traditional machine learning approach was designed, where motion and geometry features were extracted from 2-D pose sequences, and a random forest classifier was employed to categorize emotions. In addition, an ST-GCN was utilized to train and evaluate the BoLD dataset. The result indicated that carefully designed traditional machine learning methods outperformed the ST-GCN model from scratch.
Since the development of ARBEE, various deep learning-based methods have emerged. These methods can be broadly categorized into three groups based on the type of neural network used: RNN, graph neural network (GNN), and CNN.
Randhavane et al. [74] used an LSTM network, which is a type of RNN, to extract temporal features from a 3-D pose sequence, which were then concatenated with handcrafted features for classification. Bhattacharya et al. [292] leveraged a semisupervised technique to improve the performance of an RNN model. The work consisted of a GRU (a kind of RNN model) for feature extraction from a 3-D pose sequence, followed by an autoencoder with both encoder and decoder components. During training, when the input data were labeled with emotions, the classifier after the encoder produced the emotion prediction, and the decoder reconstructed the 3-D pose. If the input lacked emotion labels, only the decoder was used for reconstruction.
Bhattacharya et al. [293] adopted ST-GCN [294] to classify emotion categories from 3-D pose sequences. To increase the size of the training set, the authors used a conditional variational autoencoder (CVAE) to generate some synthetic data. Banerjee et al. [295] combined GCN and NLP techniques to achieve zero-shot emotion recognition, which entailed recognizing novel emotion categories not seen during training. The authors used ST-GCN to extract visual features from the 3-D pose sequences and used the word2vec method to obtain word embeddings from emotion labels. An adversarial autoencoder was used to align visual features with word embeddings. During inference, the system searched for the emotion label that best matched the output visual feature.
Inspired by the success of image recognition, some studies have attempted to convert skeleton sequences into images. Narayanan et al. [296] embedded 3-D pose sequences into images and then utilized a CNN for classification. Hu et al. [297] employed a two-stream (TS) CNN, where one stream directly embedded the 3-D pose into an image, and the other stream converted handcrafted features from the 3-D pose into another image. The two CNNs were integrated using transformer layers.
2). Pixel-Based Methods:
Pixel-based networks for BEEU require video clips as input. Due to an increased redundancy in video clips compared to 2-D/3-D pose sequences, these networks necessitate larger training datasets to extract distinctive features. With the availability of the large-scale BEEU dataset, researchers have focused on pixel-based methods. Because the well-established field of action recognition, which also uses videos to analyze human behavior, has many similarities to BEEU, current BEEU research often incorporates networks from action recognition.
The ARBEE study benchmarked the performance of various action recognition networks on the BoLD dataset [51]. They cropped human body regions from raw videos as the input and then fed them into different methods, including a traditional handcrafted feature method and three deep learning methods (TS [298], I3D [138], and TSN [299]). Both TS and TSN consisted of a TS network, where one stream processed the RGB image sequence and the other processed the optical flow sequence. TS and TSN used 2-D convolutional networks, whereas I3D employed a 3-D convolutional network. Results indicated that all deep learning methods significantly outperformed the traditional handcrafted feature method. Among the three deep learning methods, I3D performed worse than TS and TSN, potentially due to its requirement for more training data to reach a comparable level of performance. The BoLD dataset is not as extensive as action recognition datasets, such as Kinetics-400 [300], and thus, I3D trained on BoLD could not fully exhibit its capability as it could on Kinetics-400. At the same time, TS and TSN produced similar results.
The ARBEE study also evaluated the impact of a person’s face on BEEU model performance [51]. They designed an ablation study with three different video inputs: the whole body (crop the whole human body from the raw video clips), just the face (only crop the face part), and the body without the face (crop the whole human body but mask the facial region) [51]. Results showed that using either the face or body alone was comparable to using the whole body. This demonstrated that both the face and body contributed significantly to the final prediction. Although the whole body setting of the TSN model outperformed the separate models, it did so by combining facial and body emotions.
Most BEEU research has followed ARBEE to continue pixel-based approaches. Because ARBEE cropped human body regions as input, a direct idea to improve upon ARBEE is to explicitly utilize facial images and background images as input as well. Recent studies [301], [302] used an extra network to extract context information from whole images and then fused the context features with features extracted from body images. Some research adopts an additional network with facial images as input [303]. Moreover, inspired by the cutting-edge vision-language research of CLIP [304], Zhang et al. [305] developed EmotionCLIP, a contrastive vision-language pretraining paradigm to extract comprehensive visual emotion representations from whole images, encompassing both context information and human body information simultaneously. Because EmotionCLIP uses only uncurated data, it addresses the challenge of data scarcity in emotion understanding. EmotionCLIP outperforms state-of-the-art supervised visual emotion recognition methods and competes with many multimodal approaches across various benchmarks, demonstrating its effectiveness and transferability.
Certain studies delve deeply into the analysis of body gestures and movement in the context of BEEU. ARBEE currently provides only a generalized emotion label for entire video clips, lacking specific descriptions for human body gestures or movements. In contrast, Liu et al. [75] and Chen et al. [306] introduced datasets for microgesture understanding and emotion analysis, featuring detailed body gesture labels, such as crossed fingers, for each human movement clip. These enriched datasets have the potential to improve machines’ understanding of emotions conveyed through gestures. As mentioned in Section III-E, LMA is a comprehensive method for describing human movement. Wu et al. further advanced BEEU by presenting an LMA dataset that provides accurate LMA labels for human movements [307]. Wu et al. [307] incorporated a novel dual-task model structure that simultaneously predicts emotions and LMA labels, achieving remarkable performance on the BoLD dataset.
3). Preliminary Benchmark Analysis and Current Directions:
Table 5 presents the results of various BEEU methods using the BoLD dataset, with performance measured by mean Average Precision (mAP) across 26 emotional categories. Results indicated that pixel-based methods outperformed skeleton-based ones, which was not surprising given RGB images contained more information. Despite significant progress in BEEU, performance remains relatively low, with mAP scores below 25%.
Table 5.
Method | Venue | mAP (%) |
---|---|---|
| ||
Skeleton-based: | ||
| ||
ST-GCN [263] | AAAI 2018 | 12.63 |
Random Forest [51] | IJCV 2020 | 13.59 |
| ||
Pixel-based: | ||
| ||
TS [298] | NeurIPS 2014 | 17.04 |
TSN [299] | ECCV 2016 | 17.02 |
I3D [138] | CVPR 2017 | 15.34 |
Filntisis et al. [302] | ECCVW 2020 | 17.96 |
Pikoulis et al. [303] | FG 2021 | 21.87 |
EmotionCLIP [305] | CVPR 2023 | 22.51 |
Wu et al. [307] | arXiv 2023 | 23.09 |
As BEEU is a relatively new area in computer vision, we explore its potential future directions inspired by the trajectory of related areas, such as action recognition and FER. First, existing bodily expression datasets are not sufficiently large. BoLD contains only thousands of instances, which is much less than contained in action recognition datasets, such as Kinetics-400. The BoLD team is developing larger and more comprehensive datasets to satisfy data requirements of deep learning methods, but those expansions are not yet complete. Second, existing approaches are largely based on action recognition methods without leveraging deep affective features. Previous work only applied low-level features, neglecting characteristics of concurrent body movement. Third, annotation ambiguity, similar to FER, is also challenging. Fourth, any method that segments body regions must account for changing relationships among body sections (i.e., Shape Change in LMA), which is crucial for bodily expressed emotion, and may be particularly relevant to the dimensional emotion model because Shape Change reveals approach and retreat. Section V discusses technological barriers more broadly and in greater depth.
E. Integrating Multiple Visual Input to Model Expression
We discussed earlier how facial and body images have been used as separate inputs to identify human emotions. As was already established, the context of people in a scene also contributes to inferring their emotions. Humans synthesize all visual information to produce emotional determinations. Naturally, computers are also capable of making identification by combining all visual inputs, including facial images, body images, and context information.
As illustrated in Fig. 17, recent work used multistream networks to fuse different visual inputs. Kosti et al. [136] and Lee et al. [81] developed initial approaches by adopting a TS network. Specifically, Kosti et al. used body images and entire images as input to extract context information and body features separately. Lee et al. adopted faceless images (i.e., cropped the human face out of the entire image) and facial images as input, and a feature fusing network that dynamically fused context and facial features based on their significance.
Subsequent research has taken two paths. Some researchers have explored more kinds of visual inputs. For example, Mittal et al. [137] demonstrated that a depth map can indicate how people interact with each other. The estimated depth map from the raw image served as one input. In addition, this work used three other inputs–the facial image, the 2-D pose, and the bodiless (i.e., cropping the human body out of the entire image) images—to extract facial information, body posture information, and context information separately. This was a four-stream network overall. Studies [301], [302], [303] have also used a three-stream network with the entire, facial, and body images as input.
Other researchers focused on more effective fusion of visual features. Existing multistream approaches in typical emotion recognition adopt a simple fusion strategy, i.e., ensemble the predictions from each stream [301], [302], [303]. To improve upon this, Le et al. [308] proposed the global–local attention (GLA) module to enhance interaction between facial and context features.
F. Multimodal Modeling of Expressions
Emotions can be conveyed and perceived not just through visual signals but also through text and audio. To effectively process multimodal signals, however, is nontrivial. With advancements in deep learning, new technical approaches to this problem have emerged, particularly the vision-and-language model [304], [309]. Similar to methods discussed in Section IV-E, multimodal approaches also follow a multistream pipeline, as illustrated in Fig. 17. These techniques use independent networks to extract features from different inputs of multiple modalities and fuse these features with a fusion network.
Multimodal approaches typically utilize backbone networks, such as BERT [310] and ResNet, to extract text and audio features. The visual feature extraction process, however, differs and is typically performed using one of the following three methods.
Region features: A detection network is employed to identify regions of interest (ROIs), from which features are extracted.
Grid features: A backbone network, such as ResNet 101, is used to extract features for the entire image.
Patch projection: The image is split into patches, and a linear layer is used to generate a linear embedding, as described in ViT [276].
For the fusion process, a simple approach is to directly ensemble the final prediction of the different networks. Another method involves concatenating feature maps from different modality networks and using a single network to fuse the features. Yet, another approach is to use multiple networks to process different features, with interactions between the different networks.
Some research in emotion recognition focuses on integrating audio and visual inputs. For example, Tzirakis et al. [311] attempted to fuse audio signals with facial images. They incorporated an audio stream network to extract audio features, followed by an LSTM network to fuse facial and audio features. Antoniadis et al. [312] combined audio signals with multiple visual modalities, including facial image, body image, and context image. In contrast to using CNN, Shirian et al. [313] utilized a GCN structure to process audio, facial, and body inputs, and then used a pooling layer to process the fused features.
With the rapid advancement of NLP, there has been a surge in research on multimodal sentiment analysis, which incorporates text, audio, and visual input. Survey papers by Zhao et al. [22] and Gandhi et al. [70] provided comprehensive overviews of advancements in multimodal sentiment analysis techniques. Noteworthy methods from recent years include the development of a multitask network by Akhtar et al. [314], which used text, acoustic, and visual frames of videos as input with intermodal attention modules to adjust the contribution of each modality. Most existing methods adopt multimodel emotion labels as the supervision, ignoring unimodal labels. Yu et al. [315] proposed a label generation module that generated unimodal labels for each modality in a self-supervised manner, enabling a multitask network to train all unimodal and multimodal tasks simultaneously. Jiang et al. [316] adopted several baseline models for each modality input and used PCA to find the optimal feature for the modality. Furthermore, an early fusion strategy combined all features. Yang et al. [317] considered the consistency and differences among various modalities in a unified manner, using common and private encoders to learn modality-invariant and modality-unique features, respectively, across all modalities. With a similar motivation, Zhang et al. [318] proposed a cascade and specific scoring model to represent the interrelationship and intrarelationship across different modalities. Zhang et al. [319] used reinforcement learning and domain knowledge to process fused features from multiple modalities in conversational videos. Through reinforcement learning, dueling DQN predicted the sentiment of the target sentence based on features of previous sentences. Moreover, information in the first several sentences was used as domain knowledge for subsequent predictions. Mittal et al. [320] incorporated multiple visual inputs, resulting in the inclusion of five distinct types of input: face, body, context, audio, and text.
To show recent advancements in multimodal sentiment analysis, Table 6 presents the performance of selected representative methods on the widely used benchmark, the MOSEI dataset. Performance is evaluated through mean absolute error (MAE) and binary accuracy (Acc-2) metrics. The top-performing methods were Self-MM [315] and FDMER [317]. Self-MM utilizes a multitask network to train all modalities simultaneously, whereas FDMER employs modules to capture shared and individual features across modalities. The essence of multimodal sentiment analysis remains the optimization of relationships between different modalities.
Table 6.
In addition, some research centers on fusing multimodal features. For example, Cambria et al. [321] presented sentic blending to fuse scalable multimodal input. In a multidimensional space, it constructed a continuous stream for each modality, which depicted the semantic and cognitive development of humans. The streams of different modalities were then fused over time.
V. MODELING EMOTION: SIGNIFICANT TECHNOLOGICAL BARRIERS
Despite advancements propelled by deep learning and big data, solving the problem of emotion modeling is nowhere in sight. In this section, we share insights on some of the most significant technological barriers in computer vision (see Section V-A), statistical modeling and machine learning (see Section V-B), AI (see Section V-C), and emotion modeling (see Sections V-D–V-G) that hinder progress in the field. At present, there are no straightforward solutions to these issues. Some fundamental technologies in multiple related fields must be further developed before substantial progress can be made in emotion understanding. EQ plays a significant role in our cognitive functions, such as decision-making, information retrievability, and attention allocation. AEI will likely become an integral part of future-generation AI and robotics. To put it succinctly, AEI based on visual information is a “Holy Grail” research problem in computing.
A. Fundamental Computer Vision Methods
1). Pretraining Techniques:
The task of annotating emotion is a time-consuming process. Acquiring a large-scale dataset for FER or BEEU on the level of ImageNet is difficult. To completely overcome the challenge of emotion recognition, DNNs must acquire sufficient representation capability from extensive datasets. This gap can be bridged by using pretraining techniques, in which a model is initially trained on a massive dataset for an upstream task and then fine-tuned for a downstream task. During pretraining, the model is expected to learn emotion recognition capabilities from the upstream task. The choice of upstream task is critical because it determines the amount of emotion-related capabilities that the model can acquire.
The widely adopted pretraining strategy is to train the model on ImageNet for image classification tasks. However, the ARBEE team has demonstrated that this approach does not significantly enhance the performance of the BEEU task [51].
Our hypothesis is that image classification and emotion recognition require different types of discriminative features to be extracted. This observation reflects the fact that, while most individuals can recognize common objects, a portion of the population is unable to discern subtle emotions. In light of the current state of computer vision, two upstream tasks have been promising for pretraining: self-supervised learning (SSL) [327], [328] and image–text retrieval [304]. Regarding SSL, our concern is that, despite pretraining with state-of-theart SSL techniques, the model still required large downstream datasets in the fine-tuning phase. For instance, even after pretraining with SSL, action recognition tasks required fine-tuning on large-scale Kinetics-400 for several epochs to achieve remarkable performance [329]. As for image–text retrieval, it is crucial to carefully design the text prompts to effectively bridge an emotion recognition model with a pretrained image–text model. Further research is needed to determine how pretraining techniques can substantially enhance emotion recognition.
2). Comprehensive and Robust Body Landmarks:
The effectiveness of current 2-D pose estimation methods has led to their widespread usage in downstream tasks, such as action recognition [97], [294]. However, for emotion recognition, these methods face significant limitations. First, body landmarks used in pose estimation are insufficient for the analysis of subtle emotions, as they do not account for all relevant movements. For instance, many methods only provide one landmark on the chest, and this limitation prevents skeleton-based emotion recognition algorithms from leveraging information regarding chest expansion or contraction, which is a common indicator of a person’s confidence. Similarly, a person can express emotions through finger movements, which are not captured by most pose-estimation algorithms due to being too fine-grained. Developing an emotion-specific pose estimation method would require constructing a new, large-scale annotated dataset with additional landmarks, which can be a time-consuming and costly process. Second, 2-D pose estimation results can be noisy due to the jitter errors [330]. Although these errors may have a minimal effect on metrics of pose estimation benchmarks, they can significantly impact the understanding of subtle bodily expressions, which demands substantially higher precision of landmark locations [51]. Given that pose estimation serves as the starting point for skeleton-based BEEU methods, any errors in human pose can have a ripple effect on a final emotion prediction.
3). Accurate 3-D Pose or Mesh in-the-Wild:
The integration of 3-D pose information has the potential to substantially enhance BEEU algorithms because the extra dimension allows for a better understanding of movements in 3-D space. In other words, BEEU algorithms could depend on precise 3-D pose or mesh as input, which should enhance its overall accuracy. While accurate 3-D poses can be obtained through MoCap systems for laboratorycollected data, obtaining accurate 3-D poses for in-thewild data is challenging. It is impossible to set up MoCap to capture the 3-D pose of a person because the area of movement is often too large for the placement of MoCap cameras. Furthermore, existing 3-D pose or mesh estimation approaches have poor performance when applied to in-the-wild images. The difficulty of collecting 3-D annotations for in-the-wild images has resulted in a lack of large-scale, high-quality 3-D pose datasets. Current 3-D pose models are heavily reliant on lab-collected 3-D datasets for training, which are subject to a distinct domain shift from in-the-wild images. In a laboratory setting, the lighting and environment are fixed, and the appearance and posture of individuals are monotonous. Conversely, in-the-wild images exhibit significant variations in these factors, making it difficult for current models to generalize well to in-the-wild images.
B. Fundamental Statistical Modeling and Learning Methods
The field of emotion modeling has seen a shift from conventional computer vision and machine learning techniques (e.g., [46]) to deep learning-based methods (e.g., [51] and [331]). For many other computer vision problems, deep learning has shown its power in substantially advancing the state of the art compared to other machine learning methods. However, some intrinsic limitations of deep learning continue to limit progress in the field of emotion modeling. In this context, we aim to highlight some fundamental data-driven AI capabilities that, if developed, could drive the field forward.
1). Modeling a Complex Space With Scarce Data:
A person’s emotions and behaviors are influenced by various factors, such as personality, gender, age, race/ethnicity, cultural background, and situational context. Modeling this multidimensional, complex space requires a massive amount of data to sample it sufficiently and properly. Despite the potential to collect data from the Internet, manually annotating such a large dataset for emotion can be prohibitively expensive. This conundrum challenges machine learning researchers to find ways to learn meaningful information with limited data collection.
A relatively more specific challenge facing researchers developing datasets for emotion research is determining the appropriate amount of data needed to obtain meaningful results through machine learning. Emotion data, when collected from public sources, tend to be naturally imbalanced, with some emotion categories (e.g., happiness and sadness) having a much higher number of samples than others (e.g., yearning and pain) [51]. Unlike in typical object recognition where metadata (e.g., keywords and image file names) can aid in crawling a reasonably balanced dataset, we usually cannot determine or estimate the emotion label for a piece of visual data from just the available metadata. Making a balanced, representative dataset depends on crawling or collecting a very large dataset, in the hope of obtaining a sufficient number of samples for the less prevalent categories. Such a laborious process greatly increases the cost of data collection. A potential solution to this challenge is to use AI to guide a more efficient data collection process so that limited annotation resources can be used to achieve maximum benefit for AI training.
2). Explainability, Interpretability, and Trust:
The ability of AI systems to provide clear and interpretable explanations of their reasoning processes is particularly crucial in critical applications, such as healthcare, because endusers, who may not have a data science background, need to have confidence in the AI system’s quantitative findings. There is ongoing research in this area within the broader field of AI, with some progress being made [332], [333], [334]. However, it is widely considered an open challenge.
The task of emotional understanding presents unique challenges in terms of explainability, interpretability, and trust. First, emotions are highly subjective, and AI models typically learn from a general population’s responses based on data collected from numerous human subjects. The AI’s ability to explain or interpret results, thus, is limited to a general perspective, which may naturally be considered unreasonable to a certain extent by virtually any particular individual, especially if their opinions on emotions frequently deviate from the norm.
Second, causal relationships are often more significant in emotion-related applications. For example, in determining why a sunset scene evokes positive emotions in many viewers, it is important to understand whether it is the sunset scenario, the orange hues, the gradient of colors, the horizontal composition, or some other properties that cause viewers to feel positive emotions. Without this knowledge, algorithms developed may behave unpredictably in many situations, acting as black boxes. Currently, methods for causal discovery are often not scalable to high-dimensional or complex data and are sensitive to sparse data. Therefore, there is a need to develop appropriate frameworks and methods for high-dimensional and complex causal discovery specifically tailored to the understanding of emotions.
Third, overlapping semantics among emotion labels add complexity to interpreting results. The BoLD dataset has shown, using video clip annotations, that several pairs of emotion labels are highly correlated [51]. Examples are pleasure and happiness (correlation = 0.57), happiness and excitement (0.40), sadness and suffering (0.39), annoyance and disapproval (0.37), sensitivity and sadness (0.37), and affection and happiness (0.35). Even in the dimensional VAD model, researchers have detected correlations between valence and dominance (0.359) and between arousal and dominance (0.356). Current data-driven approaches can often provide a probability score for each emotion label in the classification system. Although it remains common practice to sort these scores to determine the most likely emotion, it is not always clear what a mixture of scores represents. For example, what does it mean to have 80% happiness, 60% excitement, and 50% sadness? Is it reasonable to just assume that the data point should be classified as happiness? Or, should it be considered a mixture of happiness, excitement, and sadness all at the same time, perhaps as representative of humans often complex emotional states? Or, could it be some partial combination of these categories? More fundamental statistical learning methods will likely be needed to address this issue in a principled way.
One potential strategy in automated emotion recognition is to uncover useful patterns from a large amount of data, with the aim of gaining a deeper understanding of emotions beyond simple classification. Such findings may either support or challenge existing psychological theories or inspire new hypotheses. However, black-box machine learning models can provide little insight into why a decision is made, which restricts our ability to gain knowledge through automated learning. As a result, there has been a recent and growing focus on developing interpretable machine learning.
There are two primary approaches to interpretable machine learning. The first is known as model-agnostic approaches [335], [336] and involves using relatively simple models, such as decision trees and linear models, to approximate locally the output of a black-box model, such as DNN. A fundamental issue with this approach is that the explanation is local, usually a neighborhood around every input point. This outcome raises the question of whether such limited explanations are truly useful. Because the power of explaining a phenomenon implies the capability to reveal underlying mechanisms coherently for a wide range of cases, a severe lack of generality in interpretation undermines this goal.
In the second approach, the emphasis is on developing models that are inherently interpretable. Classic statistical models, due to their simple structure, are often inherently interpretable. However, their accuracy is often significantly lower compared to top-performing black-box models. This drawback of classic models has motivated researchers to develop models with enhanced accuracy without losing interpretability. For example, Seo et al. [337] proposed the concept of cosupervision by DNN to train a mixture of linear models (MLM), aimed at filling the gap between transparent and black-box models. The idea is to treat a DNN model as an approximation to the optimal prediction function based on which augmented data are generated. Clustering methods are used to partition the feature space based on the augmented data, and a linear regression model (or logistic regression for classification) is fit in each region. Although MLMs have existed in various contexts, they have not been widely adopted for high-dimensional data because of the difficulties in generating a good partition. The authors overcame this challenge by exploiting DNN. They also developed methods to help interpret models either by visualization or simple description. Advances in the direction of developing accurate models that are directly interpretable are valuable for emotion recognition.
3). Modeling Under Uncertainty:
Data available for modeling human emotions and behaviors are often suboptimal, leading to various challenges in accurate modeling. For instance, in the case of BEEU, the available data are often limited to partial body movements (e.g., only the upper body is visible in the video) and may include occlusions (i.e., certain body parts are blocked from view). In addition, the automated detection of human body landmarks in video frames is not always precise. An interesting research direction will be to establish accurate models based on incomplete and inaccurate data. Furthermore, it is important to quantify uncertainty throughout the machine learning process, given that such uncertainties are present at each step. This is necessary in order to effectively communicate results to users and help them make informed interpretations.
4). Learning Paradigms for Ambiguous Concepts:
The utilization of traditional learning frameworks, such as DNN, SVM, and classification and regression trees (CARTs), may prove to be limited in tackling complex problems, such as modeling not-so-well-defined concepts (e.g., emotions and movements). Unlike more concrete objects, such as cars and apples, there may not always be a clear ground truth for the expression of emotions in a video clip. The ambiguity of establishing a ground truth can be influenced by subjective interpretations of annotators, leading to different viewpoints and no correct interpretation.
Although these traditional learning frameworks are optimized for well-defined concepts, their straightforward application to emotion recognition may lead to incorrect assumptions, such as assuming a majority annotator holds the ground truth. In the case of BEEU, the creation of a mid-layer of movement concepts between pixel-level information and high-level emotions brings forth another challenge. Many movement classes are qualitative in nature (e.g., dropping weight, rhythmicity, strong, light, or smooth movement, and space harmony), further complicating the development of accurate models. The need for advanced learning paradigms that can effectively tackle the challenges of modeling complex, ambiguous concepts, such as emotions and movements, has become apparent.
5). AI Fairness and Imbalanced Datasets:
In the collection of data regarding human emotions and behaviors, it is not uncommon for certain demographic or emotion/behavior groups to have smaller sample sizes compared to others. For instance, data about white people are often more abundant than data about black people, and data about happiness and sadness are typically more abundant than data about esteem, fatigue, and annoyance. Similarly, data across cultural groups vary significantly. Despite being a current area of research in AI [338], tackling such inadequacies is particularly complex in emotion understanding due to its unique characteristics.
6). Efficiency of AI Models:
For some applications, emotion recognition algorithms must be fit onto robots or mobile devices that are limited by their onboard computing hardware and battery power. Complex problems that require multiple AI algorithms and models to work in concert can be difficult to address in real time without high-performance GPU/CPU computing hardware. Therefore, it is crucial to simplify the mathematical models while maintaining their accuracy.
Machine learning researchers have derived techniques to simplify the model, including channel pruning techniques. In an earlier attempt, Ye et al. [339] proposed a method for accelerating the computations of deep CNNs, which directly simplifies the channel-to-channel computation graph without the need to perform the computationally difficult and not-always-useful task of making high-dimensional tensors of CNN structure sparse. There have been numerous recent studies on pruning neural networks [340], [341].
C. Fundamental AI Methods
These are some fundamental AI components for creating an effective human–AI interaction and collaboration system that can have a significant impact on relevant domains. Simply increasing the amount of data alone may not be sufficient to counterbalance the deficiencies in these fundamental areas of AI. In the following, we will explore some of these challenges in greater detail.
1). Decision-Making Based on Complex Information:
In real-world applications, such as mental healthcare, we often need to incorporate multiple sources of information (e.g., nonverbal cues, verbal cues, health record information, and observations over time), and different people may have different sets of information inputs (e.g., some patients may not have observations over time, whereas others may not have detailed health records). For example, when a patient with depression visits a clinic, a combination of behaviors, speech, and health record information can be used to make informed decisions. To effectively connect all relevant information, research is required to develop a comprehensive framework. This fundamental area of AI research has the potential to impact many AI applications.
2). Integrative Computational Models and Simulations for Understanding Human Communication and Collaboration:
Current research primarily concentrates on analyzing individuals, such as deducing an individual’s emotional state from their behavior. However, emotional expressions play a significant role in interpersonal communication and collaboration. The emotional expression of one individual can have a significant impact on the emotions and behavior of those around them. Thus, there is a need for the development of integrative computational models and simulations to investigate the emotional interactions among individuals. As the problem of understanding individual emotions remains unresolved, a comprehensive approach to study interpersonal emotional interactions must consider the uncertainties present in individual-level emotion recognition.
In addition, when robots are integrated into the interaction, the distinction between human–human and human–robot interactions (HRIs) must be weighed. The external design and behavior of robots can vary greatly and offer a much wider range of possibilities than humans, making it challenging to sample comprehensively the potential space of robot interactions. For example, a robot can take on various forms, such as a human, animal, or a unique entity with its own distinct personality. Similarly, robots can move in ways that go beyond human or animal-like motions. While research can be conducted with limitations on specific types of robots, the results may not be applicable to other forms of robots.
3). Incorporation of Knowledge and Understanding of Cognitive Processes:
When humans interpret emotions through visual media, they rely on their accumulated knowledge and experience of the world around them. However, the same behavior can be interpreted differently depending on the context or situation. Data-driven emotion recognition approaches require vast amounts of training data to be effective, but the countless possible scenarios and contexts can make obtaining such data challenging. To address this issue, AI researchers have been exploring the development of common-sense knowledge representations [342]. Integrating these advances into an emotion recognition framework is an important area of research. In addition, cognitive scientists have gained valuable insights into human cognitive processes through experiments, and incorporating these findings into the design of a next-generation emotion recognition system could be a key to its success.
The expression and interpretation of emotions by humans involve various levels and scales, ranging from basic physiological processes impacting a person’s behavior to sociocultural structures that shape their knowledge and actions. Currently, multilevel and multiscale analyses of emotions are a rare occurrence in AI due to the complexity it entails.
4). Prediction of Actions:
Most of the current research in the field focuses on emotion recognition using visual information. However, for certain human–AI interaction applications, it is necessary to not only recognize emotions based on past data but also proactively gather information in real time and make predictions about future events. For example, in the event of a heated argument between two human workers, a robot may need to move closer to better understand the situation, and based on changes in the behavior of the workers, it may need to predict any potential danger and take action to resolve the issue. This might entail alerting others or attempting to redirect the attention of involved parties. Research is required to map emotion recognition with appropriate actions, even while acknowledging the inherent uncertainty involved in the process of emotion understanding. Such a process is more nuanced than typical scene understanding.
D. Demographics
Unlike many computer vision problems, such as object detection, when it comes to emotion, we simply cannot ignore the effect of demographics. Emotional responses can vary greatly among different demographic groups, including gender, age, race, ethnicity, education, socioeconomic status, religion, marital status, sexual orientation, health and disability status, and psychiatric diagnosis. Existing machine learning-based recognition technologies are not equipped to effectively handle such a vast array of demographic factors. In the absence of sound methods for evaluation, our brains tend to resort to shortcuts that may not be dependable, in order to conserve energy and navigate difficult situations where solid judgment is not present. To fill the gap, we often rely on stereotypes, heuristics, experience, and limited understanding to gauge emotions in other demographic groups, which may be unreliable. However, when we design AI systems, such shortcuts are not acceptable as mistakes made by machines can have disproportionately negative consequences for individuals and society as a whole. Addressing the issue of demographics in automated emotion understanding will likely remain a persistent challenge in the field.
E. Disentangling Personality, Function, Emotion, and Style
A person’s behavior, captured by imaging or movement sensors, is a combination of several elements, including personality, function, emotion, and style. Even if we can find solutions to the problems that we have mentioned earlier, separating these elements so that emotional expression can be properly analyzed will remain challenging. For example, the same punching motion would convey different emotions in a volleyball game versus during an argument between two individuals. This single example highlights the need for AI to first understand the purpose of the movement. For the same function, with the exact sequence of movements, two persons with very different personalities and contexts would likely be expressing different emotions or at least different levels of the same emotion. Without knowing people’s personality traits, it will be impossible to pinpoint their emotional state. There is a need to advance technology to differentiate between these factors in movements.
While fine-tuning the learned model to a specific person is possible, it usually requires collecting a substantial amount of annotated data from that person, which may not be feasible in practical applications requiring personalization. Further research is necessary to develop methods for personalizing emotion-related models with minimal additional data collection.
F. Partitioning the Space of Emotion
Thus far, technology developers have mostly relied on psychological theories of emotion, including the various models that we discussed. However, these models have limitations that make them not ideal for AI applications. For example, if a model used in an AI program has too many components, the program may struggle to differentiate among them. At the same time, if the model is too simple with too few components, the AI may not be able to fully grasp the human emotion for the intended application. The VAD model offers a solution to this issue, but it is not suitable for AI applications for which specific emotions need to be identified. A deeper understanding of the emotional spectrum in AI will lead to more effective applications.
In a recent study, Wortman and Wang [343] articulated that the strongest models needed robust coverage, which meant defining the minimal core set of emotions from which all others could be derived. Using techniques from NLP and statistical clustering, these researchers showed that a set of 15 discrete emotion categories could achieve maximum coverage. This finding applies to six major languages—Arabic, Chinese, English, French, Spanish, and Russian—they have tested. Categories were identified as affable, affection, afraid, anger, apathetic, confused, happiness, honest, playful, rejected, sadness, spiteful, strange, surprised, and unhealthy. A more refined model with 25 categories was also proposed, which included the addition of accepted, despondent, enthusiasm, exuberant, fearless, frustration, loathed, reluctant, sarcastic, terrific, and yearning, and the removal of rejected. Through the analysis of two large-scale emotion recognition datasets, including BoLD, the researchers confirmed the superiority of their models compared to existing models [343].
G. Benchmarking Emotion Recognition
Effective benchmarking has been instrumental in driving advancements in various AI research areas. However, benchmarking for emotion recognition is a challenging task due to its unique nature and the obstacles discussed earlier. In the following, we offer insights on how to establish meaningful benchmarks for the field of emotion recognition, with a specific emphasis on the relatively new area of BEEU, where benchmarking is currently lacking.
1). Benchmark Task Types:
A suite of tasks should be devised, including basic tasks, such as single-data-type recognition of emotion (based on video only, images only, audio only, skeleton only, and human mesh only), as well as multimodal recognition (a combination of video, audio, and text). Emotion localization, which involves determining the range of frames in a video that depicts a targeted label, as well as movement recognition or LMA recognition using video, skeleton, or human mesh, should also be considered. Furthermore, tasks related to predicting emotion from movement coding and video or based on interaction can be developed. With the rich data across various contexts, natural environments, or situations (e.g., celebration, disaster, and learning), data mining tasks focused on social interactions, including the comparison between age groups or the impact of assistive animals on mood, can be explored. In addition, real-world use-case challenges targeting specific applications can be utilized to assess algorithms’ broad applicability and robustness.
2). Testing and Evaluation:
In a benchmarking competition, the performance of participating teams’ algorithms or systems can be evaluated using various criteria. Along with standard prediction accuracy based on shared training and testing datasets and the extent of emotion coverage, a system’s performance with limited training data and equitable accuracy across different demographic subgroups, such as gender or ethnicity, can also be considered. The competition host can supply training data of varying sizes and required metadata, allowing participating teams to focus on a specific evaluation criterion and compete against others using the same standard. This competition format promotes diverse scientific exploration among participating teams and collectively teams focusing on different standards broaden the scope of models being investigated, effectively fostering a form of free-style community collaboration.
3). Verification and Validation:
To validate software packages developed by participating teams in the competition, winning teams should be required to deposit their packages on repositories such as GitHub. The competition host should provide guidelines for verification, including compatibility with common computing environments, comprehensive documentation, and clear feedback on the execution status and reasons for any unexpected termination. To maintain fairness, true labels for test cases should not be disclosed prior to the completion of a competition.
Software packages should undergo thorough verification and validation throughout the entire training and testing pipeline. The competition host should replicate the training and testing process provided by each winning team, and the results should be compared to the claimed results. To streamline the validation process, subsampling of test cases may be employed.
4). Risk Management:
To gauge the robustness of winning teams’ software packages, they should be asked to provide results from a set of robustness tests although, during the competition, the comparison standard should be based on a single, focused criterion within a relatively straightforward test framework. Specifically, the impact of variations in factors such as batch randomization, bootstrapped sampling of training images, and training data size should be numerically evaluated.
5). Evaluation Metric:
As a person’s emotional state does not fall exclusively into a single type, to provide a fair evaluation of algorithms developed by the competition participants, emotions can be characterized by a distribution over a given set of types, allowing each dataset to be described by different (and multiple) types and users to select the set that works best for their methods. Both the ground truth and the output of emotion recognition algorithms are formatted as distributions over these types. Suppose that there are a total of emotion types denoted by . Different from a typical classification problem, there is a more subtle relationship between these types. Each pair of emotion types has a specified distance (or similarity) instead of just being different. For example, the emotions “happy” and “sad” are farther apart than “happy” and “excited.” As a result, when we compare two emotion distributions over these types, we want to account for the underlying distances between emotion types. These pairwise distances can be estimated from the data based on how two emotion types co-occur. We can then use Wasserstein distance [344] to compute the overall distance between the ground truth and the computer prediction. Other commonly used distances between distributions such as norm or KL divergence cannot factor in the underlying distances between emotion types. Let the distance between and be , , . Consider two probability mass functions over and . The Wasserstein distance is defined by an optimal transport problem [344]. Let be a nonnegative matching matrix between and . The Wasserstein distance is
VI. BEYOND EMOTION: INTERACTION WITH OTHER DOMAINS
Emotion, as one of the core components of human-to-human communication, can play an essential role in an array of future technological advancements impacting different parts of society. In this section, we provide an overview of how visual emotional understanding can be connected with other research problems, domains, or application areas, including art and design (see Sections VI-A and VI-B), mental health (see Section VI-C), robotics, AI agents, autonomous vehicles, animation, and gaming (see Section VI-D), information systems (see Section VI-E), industrial safety (see Section VI-F), and education (see Section VI-G). Instead of attempting to provide exhaustive coverage, we aim to highlight key intersections. Because some areas are in their early stages of development, we provide only a brief discussion of their potential.
A. Emotion and Visual Art
Art often depicts human emotional expressions, conveys the artist’s feelings, or evokes emotional responses in viewers. Except for certain genres in visual art, e.g., realism, achieving lifelikeness is usually not the primary goal. Dutch post-impressionist painter Vincent van Gogh wrote, “I want to paint what I feel, and feel what I paint.” Similarly, fine-art photographer Ansel Adams stated, “A great photograph is one that fully expresses what one feels, in the deepest sense, about what is being photographed.” It is evident that artists intentionally link visual elements in their works with emotions. However, the relationship between visual elements in art and the emotion they evoke is still largely an enigma.
Because visual artworks are almost always handcrafted by an artist and artists often develop unique styles, artworks are often abstract and difficult to analyze. In 2016, Lu et al. extended this research on evoked emotion in photographs [46] to paintings [47]. They acknowledged that using models developed for photographs on paintings would not be accurate due to the different visual characteristics of the two types of images. To address this, they created an adaptive learning algorithm that leveraged labeled photographs and unlabeled paintings to infer the visual appeal of paintings.
To convey emotion effectively, artists often create and incorporate certain visual elements that are not commonly seen in real-world objects or scenes. An example is van Gogh’s highly rhythmic brushstroke style, which has been shown by computer vision researchers Li et al. to be one of the key characteristic differences between him and his contemporaries [345] (Fig. 18). In fact, van Gogh took piano lessons in the period 1883–1885 and in the middle of his painting career. He wrote to his younger brother Theo toward the latter part of his career in 1888, “...this bloody mistral is a real nuisance for doing brushstrokes that hold together and intertwine well, with feeling, like a piece of music played with emotion.” The study by Li et al. [345] highlights the importance of designing algorithms specifically to answer the art-historical question at hand rather than using existing computer vision algorithms meant for analyzing real-world scenes.
Because artwork is the crystallization of the creativity and imagination of artists, studying artwork using computers and modern AI has the potential to reveal new perspectives on the connection between visual characteristics and emotion. Artists often incorporate exaggerated visual expressions, such as carefully designed color palettes, tonal contrasts, brushstroke texture, and elegant curves. These features have inspired computer scientists to create new algorithms for analyzing visual content. For instance, Yao et al. [346] developed a color triplet analysis algorithm to predict the aesthetic quality of photographs, drawing inspiration from artists’ use of limited color palettes. Li et al. [347] created an algorithm for tonal adjustments based on the visual art concept of “Notan” that captures the dark and light arrangement of masses. Motivated by the use of explicit and implicit triangles in artworks, He et al. [348] and Zhou et al. [349] developed algorithms to identify triangles in images, which can assist portrait and landscape photographers with composition.
Current techniques for understanding emotion are not yet capable of analyzing certain aspects of emotion expressed in artwork, particularly at the level of composition or abstraction. For example, when emotions are conveyed through subtle interactions between people, the correspondence between low-level features that we can extract and the emotions that they represent cannot be easily established. American Impressionist painter Mary Cassatt’s work, for example, depicts the love bond between a mother and child through composition, pose, and brushstrokes, rather than through clear facial expressions. Similarly, American Modernist artist Georgia O’Keeffe used dramatic colors, elegant curves, and creative, abstract composition in her paintings of enlarged flowers and landscapes to convey feelings. She stated, “I had to create an equivalent for what I felt about what I was looking at—-not copy it.” There is still much to be discovered by technology researchers in terms of the systematic connection between visual elements in abstract artwork and the emotions they convey.
B. Emotion and Design
Emotion plays a key role in product design, whether it is for physical or virtual products. Cognitive scientist Norman [350] was a pioneer in the study of emotional design. A successful design should evoke positive emotions in users/customers, such as excitement and a sense of pride and identity. In physical products, from a bottle of water to a house, designers carefully select visual elements, such as round corners, simple shapes, and elegant curves, to evoke positive emotions in customers. Similarly, designers of websites, mobile apps, and other digital products and services use harmonious color schemes, simple and clean layouts, and emotion-provoking photographs to create a positive emotional impact on viewers or users.
By advancing evoked emotion prediction, future designers can be assisted by computers in multiple ways. First, computers can assess the evoked emotion of a draft design, based on models learned from large, annotated datasets. For example, a website designer can ask the computer to rate a sample screenshot, identify areas for improvement, and provide advice on how to improve it. To develop this capability, however, researchers need to gain a better understanding of how demographics affect emotion. Certain design elements, e.g., color, may evoke different feelings in different cultures. A system trained with a general population may perform poorly for certain demographic groups. Second, computers can provide designers with design options that not only meet customer needs but also evoke a specific emotion. For example, deep learning and generative adversarial networks (GANs) can already do the former task. If an additional emotion understanding component can be used to assess the design options generated and provide feedback to the system, the resulting designs can then evoke a desired emotion.
C. Emotion and Mental Health
Many mental health disorders can be considered disorders of emotion and emotion regulation [351]. This is because mental health disorders often entail extremes of chronic self-reported distress, sadness, anxiety, or lack of emotions, such as flat affect and numbness, as well as extremes in fluctuation of emotions [352]. For example, anxiety disorders are instances of being in an ongoing state of the fight or flight response, often viewing danger where none exists and, therefore, reacting as though the danger is constantly present and/or overreacting to ambiguous information. Emotion-related symptoms of anxiety disorders include panic attacks, fear, feelings of impending danger, agitation, excessive anxiety, feeling on edge, irritability, and restlessness [352]. Major depressive disorder has been conceptualized as a disorder of sustained negative affect, particularly sadness, and low levels of positive affect [353], [354]. Emotion-related symptoms of major depression include feeling sad or down most of the day nearly every day for at least two weeks and can also include an abundance of guilt, agitation, excessive crying, irritability, anxiety, apathy, hopelessness, loss of interest or pleasure in activities, mood swings, and feelings of restlessness. Similarly, bipolar disorder can include mood swings, elevated sadness, anger, anxiety, apathy, apprehension, euphoria, general discontent, guilt, hopelessness, loss of interest or pleasure, irritability, aggression, and agitation [352]. Extreme mood swings are also a prominent feature for some personality disorders, such as borderline personality disorder, which also entails intense depressed mood, irritability or anxiety lasting a few hours to a few days, chronic feelings of emptiness, intense or uncontrollable anger, shame, and guilt [352]. Schizophrenia is associated with symptoms that are associated with mood. Positive symptoms can include delusions or paranoia, feeling anxious, agitated, or tense, and being jumpy or catatonic. Negative symptoms can include lack of interest or enthusiasm, lack of drive, and being emotionally flat [352]. Because extreme emotions are associated with these disorders, researchers have examined ways to identify important distinctive features of them from videos.
Such studies have examined ways to use machine learning to code videos for nonverbal behaviors from facial expressions, body posture, gestures, voice analysis, and motoric functioning to diagnose mental health problems. In terms of facial expressions, studies have found that people with major depression, bipolar disorder, and schizophrenia demonstrated less facial expressivity compared to individuals without these disorders [355], [356], [357]. Depressed compared to nondepressed individuals also evidenced shorter durations and lower frequency of smiling behavior, less looking at an interviewer, and less eyebrow movement [358], [359], [360]. Such differences have been used to discriminate depressed from nondepressed individuals [361], [362], [363], [364], [365], [366]. Studies have similarly diagnosed differential facial movement features of people with disorders such as autism spectrum disorder [367], posttraumatic stress disorder, and generalized anxiety disorder from healthy controls [368]. Similar to facial emotion, studies have used linguistic and voice emotion analysis to detect disorders such as depression, schizophrenia, bipolar disorder, posttraumatic stress disorder, and anxiety disorders [369], [370], [371].
As with facial actions, gestures and body movements have also been examined. In terms of gestures, those with depression showed more self-touching than those without depression [358], [360], [372]. Compared to schizophrenic individuals, those with depression tended to make fewer hand gestures [358]. Bipolar and depressed people also showed less gross motor activity than those without these disorders [360]. However, depressed and bipolar individuals showed more gross motor activity than people with schizophrenia [360]. At the same time, patients with schizophrenia demonstrated fewer hand gestures [373], fewer small and large head movements, and shorter duration of eye contact compared to those with depression [358], [360], [374]. Additional studies have detected attention deficit hyperactivity disorder from gestures and body movements [367]. Such differences have been used to diagnose mental health problems [375], [376]. See Table 7 for more details about differentiating clinical disorders from healthy controls.
Table 7.
Major Depressive Disorder | |
• Reduced facial expressivity [355, 356] | • Reduced variability of head movements [377, 378] |
• Less eyebrow movements [379, 358, 380, 381] | • More nonspecific gaze patterns [382] |
• Looking-down behaviors [382] | • Less eye contact with another person [358] |
• Reduced hand gestures [379, 381] | • Less smiling [379, 358, 380, 383, 381, 360, 359] |
• More self-touching [358, 383, 381, 360] | • Slower voice [384] |
• Reduced rate of speech [384] | • More monotonic voice [385, 386, 369, 387] |
• Reduced speech [384] | • Reduced pitch range [384] |
• Slower movements or abnormal postures [388, 372] | • Reduced gross motor activity [379, 389, 360] |
• Reduced stride length and upward lifting motion of legs [390, 391] | • Slower Gait speed [392, 390, 393] |
• Arm swing and vertical head movements while walking [393] | • Lateral upper body sway while walking [393] |
• Slumped posture [394, 393, 395] | • Forward inclination of head and shoulders [396, 397] |
• Balance difficulties during motor and cognitive tasks [398, 399, 393, 400, 401] | |
• Impaired balance and lower gait velocity [390, 394, 393, 402] | • Difficulty recognizing emotions [403, 404] |
| |
Bipolar Disorder | |
• Reduced levels of facial expressivity [405] | • Greater speech tonality [406, 407, 408] |
• Less gross motor activity [360] | |
• More frequent and longer speech pauses when in depressive states [406, 409] | |
• More postural sway [410] | • Difficulty recognizing emotions [411, 412] |
| |
Schizophrenia | |
• Reduced facial expressivity [413, 355, 356, 414] | • Less upper facial movement expressing positive emotion [356, 415, 416] |
• Less smiling [358, 417, 418] | |
• Reduced smiling eye gaze and head tilting associated with negative symptoms [419, 420] | |
• Fewer hand gestures when speaking [373, 421, 419, 422, 420] | • Fewer gestures and poses [421] |
• Less head nodding [373, 419] | • Less head movement [374, 358] |
• Lower total time talking [423, 424, 371] | • Slower rate of speech [425, 426] |
• Longer speech pauses [423, 424, 371, 427] | • More pauses [371] |
• Flat affect [371] | • Forward head posture and lower spine curvature [428] |
• Balance difficulties and increased postural sway, [429, 430, 431, 432] | • Difficulty walking in a straight fine [433, 434] |
• Slower velocity of walking and shorter strides [435] | • Difficulty recognizing emotions [411, 412, 436] |
| |
Anxiety Disorders | |
• Less eye contact [437, 438, 439] | • Instability of gaze direction [440] |
• Grimacing [437] | • Nonsymmetrical lip deformations [441] |
• Strained face [442] | • Eyelid twitching [442] |
• Smiled less, [438] | • More frequent and faster head movements [443, 444, 445] |
• More and faster blinking [443, 446, 447, 448] | • Nodded less [438] |
• Small rapid head movements [446] | • Made fewer gestures [438] |
• More physical movements indicative of nervousness (e.g., bouncing their knees, fidgety, reposuring [449, 450, 451, 452, 453, 438] | |
• Self touching [449] | • Speech errors [454] |
• Speech dysfluency [437] | • More jittery voice [455, 456] |
• Slow gait velocity associated with fear of falling [457, 458, 459] | • Balance dysfunction [460, 461, 462, 463, 464] |
• Slower speed walking [463] | • Shorter steps [463] |
• Enhanced recognition of anxiety [465, 466] | |
| |
postTraumatic Stress Disorder | |
• Monotonous slower flatter speech [369, 467, 468, 469] | • Reduced facial emotion [470] |
• More anger, aggression, hostility, less joy [471, 470] | |
| |
Autism Spectrum Disorder | |
• Distinctions in gait [472] | • Difficulty recognizing emotions [473] |
Gait, balance, and posture have also been used to identify mental health problems. For example, one meta-analysis summarized 33 studies of gait and balance in depression [474]. Depressed individuals had worse and more slumped posture [393], [395], [397], [475] and more postural instability and control [399]. In terms of gait, compared to healthy controls, those with depression took shorter strides, lifted their legs upward as opposed to a forward motion [390], [391], had more body sway, and walked more slowly, possibly to maintain their balance [390], [392], [393], [401]. These results are consistent with psychomotor retardation, a common symptom of depression. In terms of anxiety disorders, a study showed that these individuals walked more slowly, took shorter steps, and demonstrated problems with balance and mobility [463], [476]. Studies have also used gait to identify bipolar disorder [477], autism spectrum disorders [367], and attention deficit hyperactivity disorder [367].
In addition to mental health problems being disorders of emotional expression and experience, mental health problems can also be considered to be disorders of emotion recognition and understanding leading to social deficits. Understanding one’s own and others’ emotions has been termed the theory of mind. Tests for the theory of mind can include either identifying emotions by looking at images of faces with various emotional expressions (sometimes with parts of the faces obstructed) or watching a video of interpersonal interaction and answering questions about various people’s emotions and intentions. Having difficulties of theory of mind has been associated with depression [403], [404], social anxiety disorder [478], obsessive-compulsive disorder [479], schizophrenia [404], [480], bipolar disorder [404], [481], and autism spectrum disorders [473]. For example, both schizophrenic and bipolar patients showed emotional reactivity that was discordant to emotional videos [357]. Studies have also examined videotapes of facial emotional reactions to emotionally evocative videos as a means of diagnosing mental health problems. For example, using this technique, those with autism spectrum disorders demonstrated impairment in their ability to recognize emotions from body gestures [482], [483]. Thus, emotion regulation, emotional understanding, and emotional reactivity can be impaired in those with mental health problems. Such impairment, however, can be used to create systems to automatically detect the presence of these emotional disorders. See Table 7 for more details.
D. Emotion and Robotics, AI Agents, Autonomous Vehicles, Animation, and Gaming
A natural application domain for emotion understanding is robotics and AI. In science fiction films, robots and AI agents are often depicted as having a high level of EQ, such as R2-D2, T-800, and Wall-E. They are able to understand human emotion, effectively communicate their own emotional feelings, engage in emotional exchanges with other robots or humans, take appropriate actions in challenging situations or conflicts, and so on. The idea of empowering robots and AI with this level of EQ is widely seen as a desirable and ultimate goal or a “Holy Grail.”
Some recent surveys studied the field of robotics and emotion [484], [485], [486], covering topics such as advanced sensors, the latest modeling methods, and techniques for performing emotional actions. Research in the fields of HRI, human–machine interaction (HMI), and human–AI interaction is highly relevant. However, because BEEU is in its infancy and is considered a bottleneck technology, we have yet to see its applications in robotics.
If we can effectively model human emotions through both facial and bodily expressions, robots can work more effectively with human counterparts. Humans would be able to communicate with robots in a way similar to how emotions are used in human-to-human communication. For example, when human workers in a warehouse want to stop a fast-moving robot, they could wave their hands swiftly to signal distress. Similarly, pedestrians could wave their hands to signal to a self-driving vehicle on a highway that there is an accident ahead, and cars should slow down to avoid a collision. In such situations, traditional forms of communication such as speech and facial expressions may not effectively convey a sense of urgency.
Effective emotional communication can help us understand the intention of robots or AI. For instance, emotionless robots can be perceived as unfriendly or unsympathetic. In certain robotic or AI applications, such as companion robots or assistive robots, it is desirable to project a compassionate and supportive image to establish trust and cooperation between the device and humans interacting with it. Researchers have begun to investigate the relationship among robotics, personality, and motion [487].
In animated films, robots can display emotional behaviors, but these are often created by recording the movements of human actors through MoCap. That is, the animated characters mimic the movements of the actors behind the scenes. However, the capacity for computers to comprehend emotions akin to human perception could enable animated characters to use an emotion synthesis engine to autonomously generate authentic emotional behaviors. Advancements in computer graphics, virtual reality, and deep learning techniques, including GANs, transformers, diffusion models, and contrastive learning, have facilitated the creation of increasingly realistic and dynamic visual content. These technologies potentially enable the synthesis of complex and nuanced emotional displays.
Emotion understanding can substantially enhance the gaming experience by making games more emotionally responsive and immersive, as well as by providing personalized feedback to players. Game designers can make design decisions that enhance a player’s experience based on the player’s frustration level. If players are feeling sad, the game could offer them a story-based scenario that is more emotionally uplifting. By providing meaningful feedback, players are more likely to stay engaged with the game, improving their overall experience.
E. Emotion and Information Systems
Emotion understanding can play a pivotal role in advancing information systems. Currently, when searching for visual content in online collections, we primarily rely on keyword-based metadata. Whereas recent developments in deep learning have enabled information systems to search using machine-generated annotations, these annotations are typically limited to identifying objects and relationships (e.g., a boy in a yellow shirt playing soccer).
IBM scientists demonstrated that computers with the ability to understand emotions could aid in sorting through a large amount of visual content and composing emotion-stimulating visual summaries [488]. In 1996, they created the first computer system for generating movie trailers. The trailer it produced for the 20th Century Fox sci-fi thriller “Morgan” was released as the official trailer. The system identified the top ten moments for inclusion in the trailer (see Fig. 19). This work represents a significant milestone in understanding evoked emotions.
IBM’s program is likely to be the starting point for a surge of emotion-based information systems. We can expect to see new applications, such as evoked emotion assessment systems, emotion-based recommender systems, emotion-driven photo/video editing software, and emotion-based media summarization/synopsizing services.
F. Emotion and Industrial Safety
Emotion understanding can be useful in promoting safety in workplaces such as factories and warehouses. It can provide early warnings of potential safety risks, such as worker fatigue or stress, allowing managers to take proactive measures to address the situation. Such capabilities can also provide personalized support and resources to workers who are experiencing emotional distress, improving overall emotional well-being and contributing to a safer work environment. The National Safety Council estimates that fatigue costs employers over $130 billion annually in lost productivity, and over 70 million Americans have sleep disorders [489]. Existing research on fatigue detection typically involves specialized sensors or vision systems that monitor the face or eyes [490], [491], [492], [493], [494], [495], [496]. However, sensor-based approaches have limitations such as the need to wear them, size, cost, and reliability. Thus, there is a need to develop recognition systems that use body movement to enhance such systems [490], [497].
G. Emotion and Education
Emotion recognition technology can help create a more engaging and effective learning experience for online education. Many universities have been offering online courses for years, but the COVID-19 pandemic led to the widespread adoption of online teaching using video conferencing platforms in 2020 and 2021. Even as in-person instruction has resumed, many educational institutions continue to conduct some of their teaching activities online. For example, instructors may be allowed to teach a portion of their classes online for pedagogical or emergency reasons or to hold office hours online. In a traditional classroom setting, instructors can gauge students’ attentiveness and emotional states by observing their facial and bodily expressions. Such feedback can help instructors better understand and respond to students’ needs, e.g., by adjusting the pace of instruction or covering alternative materials. However, in an online teaching environment, instructors often can only see the faces of a small number of students, which does not provide real-time feedback on the instruction. Potentially, if an online teaching platform could dynamically monitor the students and provide aggregated feedback (e.g., the percentage of students with high attentiveness and the overall mood of the class) to the instructor, the quality of online learning could be improved. To protect students’ privacy, the monitoring process should only produce overall statistics.
VII. EMOTION AND SOCIETY: ETHICS
New technologies often bring new ethical concerns. In the field of emotion understanding, we have begun to witness the potential misuse of these technologies. In this section, we will discuss some of the general ethical issues surrounding the development and deployment of these technologies.
1). Generalizability:
Because the emotion space is complex, it is important to recognize that there will always be outliers or unusual situations (e.g., an otherworldly scene or an eccentric behavior) that may not be captured by our models. Without proper consideration of demographic differences and individual variations, these technologies may only provide a broad overview of the general population. To be truly beneficial, the system must be carefully tailored to the specific needs of an individual. Likewise, diversity of representation in sample datasets is critical to ensure that algorithms emerging from them are inclusive.
2). Verification of Accuracy or Performance:
It is important for researchers to keep in mind that there is almost always a lack of ground truth in emotion understanding. We have discussed the impact on data collection, modeling, and evaluation/benchmarking earlier (see Sections III-B, V-B, and V-G). For AI models, it is imperative that the output, design, and training processes are transparent and auditable. Black-box models can become uncontrollable if not properly monitored.
3). Privacy—Data Collection:
The collection of human behavior data, including facial, body, and vocal information, raises privacy concerns. Research involving sensitive populations, such as patients in psychological clinics, must be conducted with utmost care. Furthermore, almost all emotion-related annotations must be collected from humans. To protect human subjects, research protocols must be carefully designed to collect only necessary information, deidentify before distribution, and protect the data with proper access control and encryption. All protocols must be reviewed by an Institutional Review Board.
4). Privacy—Use of Technology:
In today’s automated world, people are losing their privacy to whoever controls data: governments and companies are collecting data about where we are at any given moment; our financial transactions are followed and verified; companies are collecting data about our purchases, preferences, and social networks; most public places are constantly videotaped; and so on. People must sacrifice privacy to live a normal life because everything is computerized. We are being followed, and “Big brother” knows all about us. The only thing we can still keep to ourselves is our thoughts and emotions.
Once AI reaches high-accuracy automatic emotion recognition, our emotions will not be private anymore, and videos of our movements could be used against us by authorities or whoever will have videos of our movements. This situation could become very frightening. Moreover, if we want to hide our emotions, we will have to move in a way that will not reveal them, like using a “poker face” to hide facial expressions. However, because specific movements not only express associated emotions but also enhance those emotions [115], moving in ways that flatten emotional expressions can also flatten the felt emotions, and living in such a way can lead to depression or other mental health problems.
5). Synthesized Affective Behavior:
As much as its potential use in entertainment, success in emotion modeling could inevitably lead to even more lifelike deepfakes and similar abuses. As a society, instead of being fearful of the negative impact of new, beneficial technologies, we need to take on the challenge of detecting fakes, much as we recognize scammers, and mitigating the harm.
6). Lower the Risks:
To mitigate the risks of misuse, proactive measures must be taken. It is essential that laws and regulations are established to keep pace with the rapid development of AEI technologies. As researchers, we have a responsibility to involve affected communities, particularly those that are traditionally marginalized, such as minority groups, elderly individuals, and mental health patients, in the design, development, and deployment processes to ensure that these individuals’ perspectives and needs are recognized and valued.
7). Performance Criteria:
To promote a responsible and ethical expansion of the field, it is crucial to establish reliable mechanisms for comparing the predictions of different algorithms and learning procedures. As discussed in Section V-G, a thorough evaluation of algorithms should consider not only accuracy and speed but also factors such as interpretability, demographic representation, context coverage, emotion space coverage, and personalization capabilities.
VIII. CONCLUSION
We provided an overview of the stimulating and exponentially growing field of visual emotion understanding. Adopting a multidisciplinary approach, we discussed the foundational principles guiding technological progress, reviewed recent innovations and system development, identified open challenges, and highlighted potential intersections with other fields. Our objective was to provide a comprehensive introduction to this vast field sufficient to intrigue researchers across related IEEE subcommunities and to inspire continued research and development toward realizing the field’s immense potential.
Given the multidisciplinary nature of this field, which encompasses multiple technical fields, psychology, and art, the barrier to entry can be considerable. Our aim is to provide researchers and developers with the essential knowledge required to tackle the numerous attractive open problems in the field. Interested readers are encouraged to delve deeper into the cited references for a more profound understanding of the topics discussed.
As active researchers in this domain, we strongly recommend that those interested in pursuing this research topic collaborate with others possessing complementary expertise. Although we anticipate the development and sharing of more large-scale datasets and continuous incremental progress, transformative solutions will not arise solely from the straight application of data-driven approaches. Instead, sustained collaboration among researchers in computational, social and behavioral, and machine learning and statistical modeling fields will likely lead to lasting contributions to this intricate research field.
Acknowledgment
The authors would like to acknowledge the valuable contributions of several colleagues and advisees to this work. In particular, they would like to thank Jia Li for contributing valuable ideas and some writing related to explainable machine learning and benchmarking. They also extend appreciation to Hanjoo Kim, Amy LaViers, Xin Lu, Yu Luo, Yimu Pan, Nora Weibin Wang, Benjamin Wortman, Jianbo Ye, Sitao Zhang, and Lizhen Zhu for their research collaboration or discussions. They also thank Bjöern W. Schuller and Matti Pietikäinen for organizing this special issue on a timely and impactful topic. In addition, they would also like to express the gratitude to the anonymous reviewers for their valuable insights and constructive feedback, which contributed to the improvement of the manuscript. J. Z. Wang and C. Wu utilized the Extreme Science and Engineering Discovery Environment, which was supported by NSF under Grant ACI-1548562, and the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) Program supported by NSF under Grant OAC-2138259, Grant OAC-2138286, Grant OAC-2138307, Grant OAC-2137603, and Grant OAC-2138296. James Z. Wang is grateful for the support and encouragement received from Adam Fineberg, Donald Geman, Robert M. Gray, Dennis A. Hejhal, Yelin Kim, Tatiana D. Korelsky, Edward H. Shortliffe, Juan P. Wachs, Gio Wiederhold, and Jie Yang throughout the years.
The work of James Z. Wang and Chenyan Wu was supported in part by generous gifts from the Amazon Research Awards Program. The work of James Z. Wang, Reginald B. Adams Jr., and Michelle G. Newman was supported in part by the National Science Foundation (NSF) under Grant IIS-1110970 and Grant CNS-1921783. The work of James Z. Wang, Reginald B. Adams Jr., Michelle G. Newman, Tal Shafir, and Rachelle Tsachor was supported in part by the NSF under Grant CNS-2234195 and Grant CNS-2234197. The work of James Z. Wang and Michelle G. Newman was supported in part by the NSF under Grant CIF-2205004.
Biographies
James Z. Wang (Senior Member, IEEE) received the bachelor’s degree (summa cum laude) in mathematics from the University of Minnesota, Minneapolis, MN, USA, in 1994, and the M.S. degree in mathematics, the M.S. degree in computer science, and the Ph.D. degree in medical information sciences from Stanford University, Stanford, CA, USA, in 1997, 1997, and 2000, respectively.
He was a Visiting Professor with the Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, USA, from 2007 to 2008. He is currently a Distinguished Professor of the data science and artificial intelligence area and the human–computer interaction area with the College of Information Sciences and Technology, The Pennsylvania State University, University Park, PA, USA. His research interests include affective computing, image analysis, image modeling, image retrieval, and their applications.
Dr. Wang was a Lead Special Section Guest Editor of the IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE in 2008 and a Program Manager with the Office of the Director of the National Science Foundation from 2011 to 2012.
Sicheng Zhao (Senior Member, IEEE) received the Ph.D. degree from the Harbin Institute of Technology, Harbin, China, in 2016.
He was a Visiting Scholar with the National University of Singapore, Singapore, from 2013 to 2014, a Research Fellow with Tsinghua University, Beijing, China, from 2016 to 2017, a Postdoctoral Research Fellow with the University of California at Berkeley, Berkeley, CA, USA, from 2017 to 2020, and a Postdoctoral Research Scientist with Columbia University, New York, NY, USA, from 2020 to 2022. He is currently a Research Associate Professor with Tsinghua University. His research interests include affective computing, multimedia, and computer vision.
Chenyan Wu received the B.E. degree in electronic information engineering from the School of the Gifted Young, University of Science and Technology of China, Hefei, China, in 2018. He is currently working toward the Ph.D. degree in the Informatics Program of the College of Information Sciences and Technology, The Pennsylvania State University, University Park, PA, USA.
He worked as an Intern at Amazon Lab126, Bellevue, WA, USA, Microsoft Research Asia, Beijing, China, and SenseTime Research, Shenzhen, China. His research interests are affective computing, computer vision, and machine learning.
Reginald B. Adams Jr. received the Ph.D. degree in social psychology from Dartmouth College, Hanover, NH, USA, in 2002.
He is currently a Professor of psychology with The Pennsylvania State University, University Park, PA, USA. He is interested in how we extract social and emotional meaning from nonverbal cues, particularly via the face. His work addresses how multiple social messages (e.g., emotion, gender, race, and age) combine across multiple modalities and interact to form unified representations that guide our impressions of and responses to others. Although his questions are social psychological in origin, his research draws upon visual cognition and affective neuroscience to address social perception at the functional and neuroanatomical levels. His continuing research efforts have been funded through the NSF and the National Institute on Aging and the National Institute of Mental Health of the National Institutes of Health.
Dr. Adams Jr., before joining Penn State, was awarded the National Research Service Award (NRSA) from the NIMH to train as a Postdoctoral Fellow at Harvard University and Tufts University.
Michelle G. Newman received the Ph.D. degree in clinical psychology from Stony Brook University, Stony Brook, NY, USA, in 1992.
She completed a postdoctoral fellowship at Stanford University, Stanford, CA, USA, in 1994. She is currently a Professor of psychology and psychiatry, and the Director of the Center for the Treatment of Anxiety and Depression, The Pennsylvania State University, University Park, PA, USA. She has conducted basic and applied research on anxiety disorders and depression, and has published over 200 papers on these topics.
Dr. Newman is also a Fellow of the American Psychological Association Divisions 29 and 12, the Association for Behavioral and Cognitive Therapies, and the American Psychological Society. She was a recipient of the APA Division 12 Turner Award for distinguished contribution to clinical research, the APA Division 29 Award for Distinguished Publication of Psychotherapy Research, the ABCT Outstanding Service Award, the APA Division 12 Toy Caldwell-Colbert Award for Distinguished Educator in Clinical Psychology, and the Raymond Lombra Award for Distinction in the Social or Life Sciences. She is a Past Editor of Behavior Therapy. She is an Associate Editor of Journal of Anxiety Disorders.
Tal Shafir graduated from the Law School, The Hebrew University of Jerusalem, Jerusalem, Israel. She received the Ph.D. degree in neurophysiology of motor control from the University of Michigan, Ann Arbor, MI, USA, in 2003.
She then studied dance-movement therapy at the University of Haifa, Haifa, Israel, and completed two postdoctoral fellowships in brain–behavior interactions in infants, and in affective neuroscience at the University of Michigan. She developed research on movement–emotion interaction and its underlying brain mechanisms, behavioral expressions, and therapeutic applications.
Dr. Shafir was a recipient of the ADTA 2020 Innovation Award. She, certified also in laban movement analysis, was the Main Editor of The Academic Journal of Creative Arts Therapies and Frontiers in Psychology research topic: the state of the art in creative arts therapies. She has been serving on The American Dance Therapy Association (ADTA) Research Committee since 2016.
Rachelle Tsachor is currently an Associate Professor of movement with the University of Illinois at Chicago, Chicago, IL, USA. She is certified in mind-body medicine (CMBM), Laban Movement Analysis (CMA), and somatic movement therapy (RSMTISMETA). Her research investigates body movement to bring a human, experiential understanding to how movement affects our lives. She analyzes patterns in moving bodies in diverse projects, researching movement’s effects on our brains, emotions, health, and learning. She is a Co-PI on an NSF-funded project STAGE and the UI Presidential Initiative for the Young People’s Science Theater: CPS and UIC Students Creating Performances for Social Change. Both initiatives bring mind/body methods into Chicago Public Schools that educate primarily students of color to support learning in embodied ways.
Contributor Information
JAMES Z. WANG, College of Information Sciences and Technology, The Pennsylvania State University, University Park, PA 16802 USA.
SICHENG ZHAO, Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing 100084, China.
CHENYAN WU, College of Information Sciences and Technology, The Pennsylvania State University, University Park, PA 16802 USA.
REGINALD B. ADAMS, JR., Department of Psychology, The Pennsylvania State University, University Park, PA 16802 USA
MICHELLE G. NEWMAN, Department of Psychology, The Pennsylvania State University, University Park, PA 16802 USA
TAL SHAFIR, Emily Sagol Creative Arts Therapies Research Center, University of Haifa, Haifa 3498838, Israel.
RACHELLE TSACHOR, School of Theatre and Music, University of Illinois at Chicago, Chicago, IL 60607 USA.
REFERENCES
- [1].Laricchia F Smart Speakers—Statistics & Facts. Accessed: Jul. 6, 2022. [Online]. Available: https://www.statista.com/topics/4748/smart-speakers/#dossierKeyfigures
- [2].Krakovsky M, “Artificial (emotional) intelligence,” Commun. ACM, vol. 61, no. 4, pp. 18–19, 2018. [Google Scholar]
- [3].Hassan T. et al. , “Automatic detection of pain from facial expressions: A survey,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 43, no. 6, pp. 1815–1831, Jun. 2021. [DOI] [PubMed] [Google Scholar]
- [4].Li S. and Deng W, “Deep facial expression recognition: A survey,” IEEE Trans. Affect. Comput, vol. 13, no. 3, pp. 1195–1215, Jul. 2022. [Google Scholar]
- [5].Jampour M. and Javidi M, “Multiview facial expression recognition, a survey,” IEEE Trans. Affect. Comput, vol. 13, no. 4, pp. 2086–2105, Oct. 2022. [Google Scholar]
- [6].Liu Y, Zhang X, Li Y, Zhou J, Li X, and Zhao G, “Graph-based facial affect analysis: A review,” IEEE Trans. Affect. Comput, early access, Oct. 19, 2022, doi: 10.1109/TAFFC.2022.3215918. [DOI] [Google Scholar]
- [7].Ben X. et al. , “Video-based facial micro-expression analysis: A survey of datasets, features and algorithms,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 44, no. 9, pp. 5826–5846, Sep. 2022. [DOI] [PubMed] [Google Scholar]
- [8].Li X. et al. , “Towards reading hidden emotions: A comparative study of spontaneous micro-expression spotting and recognition methods,” IEEE Trans. Affect. Comput, vol. 9, no. 4, pp. 563–577, Oct. 2018. [Google Scholar]
- [9].Li Y, Wei J, Liu Y, Kauttonen J, and Zhao G, “Deep learning for micro-expression recognition: A survey,” IEEE Trans. Affect. Comput, vol. 13, no. 4, pp. 2028–2046, Oct. 2022. [Google Scholar]
- [10].Brauwers G. and Frasincar F, “A survey on aspect-based sentiment classification,” ACM Comput. Surveys, vol. 55, no. 4, pp. 1–37, Apr. 2023. [Google Scholar]
- [11].Nazir A, Rao Y, Wu L, and Sun L, “Issues and challenges of aspect-based sentiment analysis: A comprehensive survey,” IEEE Trans. Affect. Comput, vol. 13, no. 2, pp. 845–863, Apr. 2022. [Google Scholar]
- [12].Deng J. and Ren F, “A survey of textual emotion recognition and its challenges,” IEEE Trans. Affect. Comput, vol. 14, no. 1, pp. 49–67, Jan. 2023. [Google Scholar]
- [13].Akcay MB and Oğuz K, “Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers,” Speech Commun, vol. 116, pp. 56–76, Jan. 2020. [Google Scholar]
- [14].Panda R, Malheiro RM, and Paiva RP, “Audio features for music emotion recognition: A survey,” IEEE Trans. Affect. Comput, vol. 14, no. 1, pp. 68–88, 1, Jan. 2023. [Google Scholar]
- [15].Latif S, Rana R, Khalifa S, Jurdak R, Qadir J, and Schuller BW, “Survey of deep representation learning for speech emotion recognition,” IEEE Trans. Affect. Comput, early access, Sep. 21, 2021, doi: 10.1109/TAFFC.2021.3114365. [DOI] [Google Scholar]
- [16].Zhao S. et al. , “Affective image content analysis: Two decades review and new perspectives,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 44, no. 10, pp. 6729–6751, Oct. 2022. [DOI] [PubMed] [Google Scholar]
- [17].Noroozi F, Corneanu CA, Kaminska D, Sapinski T, Escalera S, and Anbarjafari G, “Survey on emotional body gesture recognition,” IEEE Trans. Affect. Comput, vol. 12, no. 2, pp. 505–523, Apr. 2021. [Google Scholar]
- [18].Mahfoudi M-A, Meyer A, Gaudin T, Buendia A, and Bouakaz S, “Emotion expression in human body posture and movement: A survey on intelligible motion factors, quantification and validation,” IEEE Trans. Affect. Comput, early access, Dec. 2, 2022, doi: 10.1109/TAFFC.2022.3226252. [DOI] [Google Scholar]
- [19].Li X. et al. , “EEG based emotion recognition: A tutorial and review,” ACM Comput. Surveys, vol. 55, no. 4, pp. 1–57, Apr. 2023. [Google Scholar]
- [20].Saganowski S, Perz B, Polak A, and Kazienko P, “Emotion recognition for everyday life using physiological signals from wearables: A systematic literature review,” IEEE Trans. Affect. Comput, early access, May 20, 2022, doi: 10.1109/TAFFC.2022.3176135. [DOI] [Google Scholar]
- [21].Zhang J, Yin Z, Chen P, and Nichele S, “Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review,” Inf. Fusion, vol. 59, pp. 103–126, Jul. 2020. [Google Scholar]
- [22].Zhao S, Jia G, Yang J, Ding G, and Keutzer K, “Emotion recognition from multiple modalities: Fundamentals and methodologies,” IEEE Signal Process. Mag, vol. 38, no. 6, pp. 59–73, Nov. 2021. [Google Scholar]
- [23].Smith G. and Carette J, “What lies beneath—A survey of affective theory use in computational models of emotion,” IEEE Trans. Affect. Comput, vol. 13, no. 4, pp. 1793–1812, Jun. 2022. [Google Scholar]
- [24].Cambria E, “Affective computing and sentiment analysis,” IEEE Intell. Syst, vol 31, no. 2, pp. 102–107, Mar. 2016. [Google Scholar]
- [25].Poria S, Cambria E, Bajpai R, and Hussain A, “A review of affective computing: From unimodal analysis to multimodal fusion,” Inf. Fusion, vol. 37, pp. 98–125, Sep. 2017. [Google Scholar]
- [26].Wang Y. et al. , “A systematic review on affective computing: Emotion models, databases, and recent advances,” Inf. Fusion, vols. 83–84, pp. 19–52, Jul. 2022. [Google Scholar]
- [27].Mehrabian A, Basic Dimensions for a General Psychological Theory: Implications for Personality, Social, Environmental, and Developmental Studies. Cambridge, U.K.: MIT Press, 1980. [Google Scholar]
- [28].Watson D. and Tellegen A, “Toward a consensual structure of mood,” Psychol. Bull, vol. 98, no. 2, pp. 219–235, 1985. [DOI] [PubMed] [Google Scholar]
- [29].Thayer R, The Biopsychology of Mood and Arousal. New York, NY, USA: Oxford Univ. Press, 1989. [Google Scholar]
- [30].Larsen RJ and Diener E, “Promises and problems with the circumplex model of emotion,” in Emotion. Newcastle upon Tyne, U.K.: Sage, 1992. [Google Scholar]
- [31].Knutson B, “Facial expressions of emotion influence interpersonal trait inferences,” J. Nonverbal Behav, vol. 20, no. 3, pp. 165–182, Sep. 1996. [Google Scholar]
- [32].Darwin C. and Prodger P, The Expression of the Emotions in Man and Animals. Oxford, U.K.: Oxford Univ. Press, 1998. [Google Scholar]
- [33].James W, What Is an Emotion? New York, NY, USA: Simon and Schuster, 2013. [Google Scholar]
- [34].Niedenthal PM, “Embodying emotion,” Science, vol. 316, no. 5827, pp. 1002–1005, May 2007. [DOI] [PubMed] [Google Scholar]
- [35].Ekman P, “Universals and cultural differences in facial expressions of emotion,” in Proc. Nebraska Symp. Motivat Lincoln, Nebraska: University of Nebraska Press, 1971, pp. 1–76. [Google Scholar]
- [36].Barrett LF, “Solving the emotion paradox: Categorization and the experience of emotion,” Personality Social Psychol. Rev, vol. 10, no. 1, pp. 20–46, Feb. 2006. [DOI] [PubMed] [Google Scholar]
- [37].Posner J, Russell JA, and Peterson BS, “The circumplex model of affect: An integrative approach to affective neuroscience, cognitive development, and psychopathology,” Develop. Psychopathology, vol. 17, no. 3, pp. 715–734, Sep. 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [38].Russell JA and Carroll JM, “On the bipolarity of positive and negative affect,” Psychol. Bull, vol. 125, no. 1, pp. 3–30, 1999. [DOI] [PubMed] [Google Scholar]
- [39].Arnold MB, Emotion and Personality Volume I. Psychological. Aspects New York, NY, USA: Columbia Univer. Press, 1960. [Google Scholar]
- [40].Schachter S. and Singer J, “Cognitive, social, and physiological determinants of emotional state,” Psychol. Rev, vol. 69, no. 5, p. 379, 1962. [DOI] [PubMed] [Google Scholar]
- [41].Mehrabian A. and Russell JA, An Approach to Environmental Psychology. Cambridge, MA, USA: MIT Press, 1974. [Google Scholar]
- [42].Russell JA, “Core affect and the psychological construction of emotion,” Psychol. Rev, vol. 110, no. 1, p. 145, 2003. [DOI] [PubMed] [Google Scholar]
- [43].Mikels JA, Fredrickson BL, Larkin GR, Lindberg CM, Maglio SJ, and Reuter-Lorenz PA, “Emotional category data on images from the international affective picture system,” Behav. Res. Methods, vol. 37, no. 4, pp. 626–630, Nov. 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [44].Machajdik J. and Hanbury A, “Affective image classification using features inspired by psychology and art theory,” in Proc. ACM Int. Conf. Multimedia, 2010, pp. 83–92. [Google Scholar]
- [45].Lang PJ et al. , “International affective picture system (IAPS): Affective ratings of pictures and instruction manual,” Center Study Emotion Attention, NIMH, Gainesville, FL, USA, Tech. Rep., 2005. [Google Scholar]
- [46].Lu X, Suryanarayan P, Adams RB, Li J, Newman MG, and Wang JZ, “On shape and the computability of emotions,” in Proc. ACM Int. Conf. Multimedia, 2012, pp. 229–238. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [47].Lu X, Sawant N, Newman MG, Adams RB Jr., Wang JZ, and Li J, “Identifying emotions aroused from paintings,” in Proc. Eur. Conf. Comput. Vis, 2016, pp. 48–63. [Google Scholar]
- [48].Lu X, Adams RB, Li J, Newman MG, and Wang James. Z., “An investigation into three visual characteristics of complex scenes that evoke human emotion,” in Proc. 7th Int. Conf. Affect. Comput. Intell. Interact. (ACII; ), Oct. 2017, pp. 440–447. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [49].Ye J, Li J, Newman MG, Adams RB, and Wang JZ, “Probabilistic multigraph modeling for improving the quality of crowdsourced affective data,” IEEE Trans. Affect. Comput, vol. 10, no. 1, pp. 115–128, Jan. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [50].Kim H. et al. , “Development and validation of image stimuli for emotion elicitation (ISEE): A novel affective pictorial system with test-retest repeatability,” Psychiatry Res., vol. 261, pp. 414–420, Mar. 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [51].Luo Y, Ye J, Adams RB, Li J, Newman MG, and Wang JZ, “ARBEE: Towards automated recognition of bodily expression of emotion in the wild,” Int. J. Comput. Vis, vol. 128, no. 1, pp. 1–25, Jan. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [52].Datta R, Joshi D, Li J, and Wang JZ, “Studying aesthetics in photographic images using a computational approach,” in Proc. Eur. Conf. Comput. Vis., 2006, pp. 288–301. [Google Scholar]
- [53].Fridlund AJ, Human Facial Expression: An Evolutionary View. New York, NY, USA: Academic, 2014. [Google Scholar]
- [54].Yik MSM, “Interpretation of faces: A cross-cultural study of a prediction from Fridlund’s theory,” Cognition Emotion, vol. 13, no. 1, pp. 93–104, Jan. 1999. [Google Scholar]
- [55].Frijda NH and Tcherkassof A, “Facial expressions as modes of action readiness,” in The Psychology of Facial Expression, Russell JA and FernCández-Dols JM, Eds. Paris, France: Editions de la Maison des Sciences de l’Homme, 1997. [Google Scholar]
- [56].Davidson RJ, Emotion and Affective Style: Hemispheric Substrates. Los Angeles, CA, USA: SAGE,1992. [Google Scholar]
- [57].Miller NE, “Analysis of the form of conflict reactions,” Psychol. Bull, vol. 34, no. 1, pp. 720–731, 1937. [Google Scholar]
- [58].Harmon-Jones E, “Clarifying the emotive functions of asymmetrical frontal cortical activity,” Psychophysiology, vol. 40, no. 6, pp. 838–848, Nov. 2003. [DOI] [PubMed] [Google Scholar]
- [59].Centerbar DB, Schnall S, Clore GL, and Garvin ED, “Affective incoherence: When affective concepts and embodied reactions clash,” J. Personality Social Psychol, vol. 94, no. 4, pp. 560–578, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [60].Adams RB, Ambady N, Macrae CN, and Kleck RE, “Emotional expressions forecast approach-avoidance behavior,” Motivat. Emotion, vol. 30, no. 2, pp. 177–186, Jun. 2006. [Google Scholar]
- [61].Nelson AJ, Adams RB, Stevenson MT, Weisbuch M, and Norton MI, “Approach-avoidance movement influences the decoding of anger and fear expressions,” Social Cognition, vol. 31, no. 6, pp. 745–757, Dec. 2013. [Google Scholar]
- [62].Averill JR, “A semantic atlas of emotional concepts,” JSAS, Catalog Sel. Documents Psychol, vol. 5, no. 330, pp. 1–64, 1975. [Google Scholar]
- [63].Kosti R, Alvarez JM, Recasens A, and Lapedriza A, “Emotion recognition in context,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1667–1675. [Google Scholar]
- [64].Wakabayashi A. et al. , “Development of short forms of the empathy quotient (EQ-Short) and the systemizing quotient (SQ-Short),” Personality Individual Differences, vol. 41, no. 5, pp. 929–940, Oct. 2006. [Google Scholar]
- [65].Kim M. and Leskovec J, “Latent multi-group membership graph model,” 2012, arXiv:1205.4546. [Google Scholar]
- [66].Stappen L. et al. , “MuSe-toolbox: The multimodal sentiment analysis continuous annotation fusion and discrete class transformation toolbox,” in Proc. 2nd Multimodal Sentiment Anal. Challenge, Oct. 2021, pp. 75–82. [Google Scholar]
- [67].Grimm M. and Kroschel K, “Evaluation of natural emotions using self assessment manikins,” in Proc. IEEE Workshop Autom. Speech Recognit. Understand, Dec. 2005, pp. 381–385. [Google Scholar]
- [68].Zhou F. and Torre F, “Canonical time warping for alignment of human behavior,” in Proc. Adv. Neural Inf. Process. Syst, vol. 22, 2009, pp. 1–12. [Google Scholar]
- [69].Wang S. and Ji Q, “Video affective content analysis: A survey of state-of-the-art methods,” IEEE Trans. Affect. Comput, vol. 6, no. 4, pp. 410–430, Oct. 2015. [Google Scholar]
- [70].Gandhi A, Adhvaryu K, Poria S, Cambria E, and Hussain A, “Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions,” Inf. Fusion, vol. 91, pp. 424–444, Mar. 2023. [Google Scholar]
- [71].You Q, Luo J, Jin H, and Yang J, “Building a large scale dataset for image emotion recognition: The fine print and the benchmark,” Proc. AAAI Conf. Artif. Intell, Feb. 2016, vol. 30, no. 1, pp. 308–314. [Google Scholar]
- [72].Jiang Y-G, Xu B, and Xue X, “Predicting emotions in user-generated videos,” in Proc. AAAI Conf. Artificial Intell., 2014, pp. 73–79. [Google Scholar]
- [73].Xu B, Fu Y, Jiang Y-G, Li B, and Sigal L, “Heterogeneous knowledge transfer in video emotion recognition, attribution and summarization,” IEEE Trans. Affect. Comput, vol. 9, no. 2, pp. 255–270, Apr. 2018. [Google Scholar]
- [74].Randhavane T, Bhattacharya U, Kapsaskis K, Gray K, Bera A, and Manocha D, “Identifying emotions from walking using affective and deep features,” 2019, arXiv:1906.11884. [Google Scholar]
- [75].Liu X, Shi H, Chen H, Yu Z, Li X, and Zhao G, “IMiGUE: An identity-free video dataset for micro-gesture understanding and emotion analysis,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 10631–10642. [Google Scholar]
- [76].Lucey P, Cohn JF, Kanade T, Saragih J, Ambadar Z, and Matthews I, “The extended Cohn–Kanade dataset (CK+): A complete dataset for action unit and emotion-specified expression,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Workshops, Jun. 2010, pp. 94–101. [Google Scholar]
- [77].Kollias D. et al. , “Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond,” Int. J. Comput. Vis, vol. 127, nos. 6–7, pp. 907–929, Jun. 2019. [Google Scholar]
- [78].Mollahosseini A, Hasani B, and Mahoor MH, “AffectNet: A database for facial expression, valence, and arousal computing in the wild,” IEEE Trans. Affect. Comput, vol. 10, no. 1, pp. 18–31, Jan. 2019. [Google Scholar]
- [79].Kosti R, Alvarez JM, Recasens A, and Lapedriza A, “EMOTIC: Emotions in context dataset,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW; ), Jul. 2017, pp. 61–69. [Google Scholar]
- [80].Dhall A, Goecke R, Lucey S, and Gedeon T, “Collecting large, richly annotated facial-expression databases from movies,” IEEE Multimedia Mag, vol. 19, no. 3, pp. 34–41, Jul. 2012. [Google Scholar]
- [81].Lee J, Kim S, Kim S, Park J, and Sohn K, “Context-aware emotion recognition networks,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 10143–10152. [Google Scholar]
- [82].Jiang X. et al. , “DFEW: A large-scale database for recognizing dynamic facial expressions in the wild,” in Proc. ACM Int. Conf. Multimedia, 2020, pp. 2881–2889. [Google Scholar]
- [83].Wang Y. et al. , “FERV39k: A large-scale multi-scene dataset for facial expression recognition in videos,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR; ), Jun. 2022, pp. 20922–20931. [Google Scholar]
- [84].Davison AK, Lansley C, Costen N, Tan K, and Yap MH, “SAMM: A spontaneous micro-facial movement dataset,” IEEE Trans. Affect. Comput, vol. 9, no. 1, pp. 116–129, Jan. 2018. [Google Scholar]
- [85].Qu F, Wang S-J, Yan W-J, Li H, Wu S, and Fu X, “CAS(ME): A database for spontaneous macro-expression and micro-expression spotting and recognition,” IEEE Trans. Affect. Comput, vol. 9, no. 4, pp. 424–436, Oct. 2018. [Google Scholar]
- [86].Wollmer M. et al. , “YouTube movie reviews: Sentiment analysis in an audio-visual context,” IEEE Intell. Syst, vol. 28, no. 3, pp. 46–53, May 2013. [Google Scholar]
- [87].Bagher Zadeh A, Liang PP, Poria S, Cambria E, and Morency L-P, “Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph,” in Proc. 56th Annu. Meeting Assoc. Comput. Linguistics, 2018, pp. 2236–2246. [Google Scholar]
- [88].Ekman P. and Friesen WV, Facial Action Coding System: A Technique for the Measurement of Facial Movement. Stanford, CA, USA: Stanford University, 1977. [Google Scholar]
- [89].Tian Y-I, Kanade T, and Cohn JF, “Recognizing action units for facial expression analysis,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 23, no. 2, pp. 97–115, Feb. 2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [90].Valstar M. and Pantic M, “Fully automatic facial action unit detection and temporal analysis,” in Proc. Conf. Comput. Vis. Pattern Recognit. Workshop (CVPRW; ), 2006, p. 149. [Google Scholar]
- [91].Miriam Jacob G. and Stenger B, “Facial action unit detection with transformers,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR; ), Jun. 2021, pp. 7680–7689. [Google Scholar]
- [92].Martinez B, Valstar MF, Jiang B, and Pantic M, “Automatic analysis of facial actions: A survey,” IEEE Trans. Affect. Comput, vol. 10, no. 3, pp. 325–347, Jul. 2019. [Google Scholar]
- [93].Zhi R, Liu M, and Zhang D, “A comprehensive survey on automatic facial action unit analysis,” Vis. Comput, vol. 36, no. 5, pp. 1067–1093, May 2020. [Google Scholar]
- [94].Wang N, Gao X, Tao D, Yang H, and Li X, “Facial feature point detection: A comprehensive survey,” Neurocomputing, vol. 275, pp. 50–65, Jan. 2018. [Google Scholar]
- [95].Wu Y. and Ji Q, “Facial landmark detection: A literature survey,” Int. J. Comput. Vis, vol. 127, no. 2, pp. 115–142, Feb. 2019. [Google Scholar]
- [96].Lin T-Y et al. , “Microsoft COCO: Common objects in context,” in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2014, pp. 740–755. [Google Scholar]
- [97].Sun K, Xiao B, Liu D, and Wang J, “Deep high-resolution representation learning for human pose estimation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR; ), Jun. 2019, pp. 5693–5703. [Google Scholar]
- [98].Xiao B, Wu H, and Wei Y, “Simple baselines for human pose estimation and tracking,” in Proc. Eur. Conf. Comput. Vis, 2018, pp. 466–481. [Google Scholar]
- [99].Sun X, Xiao B, Wei F, Liang S, and Wei Y, “Integral human pose regression,” in Proc. Eur. Conf. Comput. Vis, 2018, pp. 529–545. [Google Scholar]
- [100].Wu C. et al. , “MEBOW: Monocular estimation of body orientation in the wild,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR; ), Jun. 2020, pp. 3451–3461. [Google Scholar]
- [101].Moon G, Chang JY, and Lee KM, “Camera distance-aware top-down approach for 3D multi-person pose estimation from a single RGB image,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV; ), Oct. 2019, pp. 10133–10142. [Google Scholar]
- [102].Wu C, Li Y, Tang X, and Wang J, “MUG: Multi-human graph network for 3D mesh reconstruction from 2D pose,” 2022, arXiv:2205.12583. [Google Scholar]
- [103].Loper M, Mahmood N, Romero J, Pons-Moll G, and Black MJ, “SMPL: A skinned multi-person linear model,” ACM Trans. Graph, vol. 34, no. 6, p. 248, Nov. 2015. [Google Scholar]
- [104].Bogo F, Kanazawa A, Lassner C, Gehler P, Romero J, and Black MJ, “Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 561–578. [Google Scholar]
- [105].Ahmed F, Bari ASMH, and Gavrilova ML, “Emotion recognition from body movement,” IEEE Access, vol. 8, pp. 11761–11781, 2020. [Google Scholar]
- [106].Tsachor RP and Shafir T, “How shall i count the ways? A method for quantifying the qualitative aspects of unscripted movement with Laban movement analysis,” Frontiers Psychol, vol. 10, p. 572, Mar. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [107].Witkower Z. and Tracy JL, “Bodily communication of emotion: Evidence for extrafacial behavioral expressions and available coding systems,” Emotion Rev, vol. 11, no. 2, pp. 184–193, Apr. 2019. [Google Scholar]
- [108].Kleinsmith A. and Bianchi-Berthouze N, “Affective body expression perception and recognition: A survey,” IEEE Trans. Affect. Comput, vol. 4, no. 1, pp. 15–33, Jan. 2013. [Google Scholar]
- [109].Ebdali Takalloo L, Li KF, and Takano K, “An overview of emotion recognition from body movement,” in Proc. Comput. Intell. Secur. Inf. Syst. Conf., 2022, pp. 105–117. [Google Scholar]
- [110].Kleinsmith A, Bianchi-Berthouze N, and Steed A, “Automatic recognition of non-acted affective postures,” IEEE Trans. Syst., Man, Cybern., B, Cybernetics, vol. 41, no. 4, pp. 1027–1038, Aug. 2011. [DOI] [PubMed] [Google Scholar]
- [111].Roether CL, Omlor L, Christensen A, and Giese MA, “Critical features for the perception of emotion from gait,” J. Vis, vol. 9, no. 6, p. 15, Jun. 2009. [DOI] [PubMed] [Google Scholar]
- [112].Bartenieff I. and Lewis D, Body Movement: Coping With the Environment. Philadelphia, PA, USA: Gordon & Breach Science, 1980. [Google Scholar]
- [113].Studd K. and Cox L, Everybody Is a Body. Indianapolis, IN, USA: Dog Ear, 2013. [Google Scholar]
- [114].Melzer A, Shafir T, and Tsachor RP, “How do we recognize emotion from movement? Specific motor components contribute to the recognition of each emotion,” Frontiers Psychol, vol. 10, p. 1389, Jul. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [115].Shafir T, Tsachor RP, and Welch KB, “Emotion regulation through movement: Unique sets of movement characteristics are associated with and enhance basic emotions,” Frontiers Psychol, vol. 6, p. 2030, Jan. 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [116].Saha S, Datta S, Konar A, and Janarthanan R, “A study on emotion recognition from body gestures using Kinect sensor,” in Proc. Int. Conf. Commun. Signal Process., Apr. 2014, pp. 56–60. [Google Scholar]
- [117].Zacharatos H, Gatzoulis C, Charalambous P, and Chrysanthou Y, “Emotion recognition from 3D motion capture data using deep CNNs,” in Proc. IEEE Conf. Games (CoG), Aug. 2021, pp. 1–5. [Google Scholar]
- [118].Shi J, Liu C, Ishi CT, and Ishiguro H, “Skeleton-based emotion recognition based on two-stream self-attention enhanced spatial–temporal graph convolutional network,” Sensors, vol. 21, no. 1, p. 205, Dec. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [119].Ghaleb E, Mertens A, Asteriadis S, and Weiss G, “Skeleton-based explainable bodily expressed emotion recognition through graph convolutional networks,” in Proc. 16th IEEE Int. Conf. Autom. Face Gesture Recognit. (FG), Dec. 2021, pp. 1–8. [Google Scholar]
- [120].Ajili I, Mallem M, and Didier J-Y, “Human motions and emotions recognition inspired by LMA qualities,” Vis. Comput, vol. 35, no. 10, pp. 1411–1426, Oct. 2019. [Google Scholar]
- [121].Aristidou A, Charalambous P, and Chrysanthou Y, “Emotion analysis and classification: Understanding the performers’ emotions using the LMA entities,” Comput. Graph. Forum, vol. 34, no. 6, pp. 262–276, Sep. 2015. [Google Scholar]
- [122].Dewan S, Agarwal S, and Singh N, “Laban movement analysis to classify emotions from motion,” in Proc. 10th Int. Conf. Mach. Vis. (ICMV; ), Apr. 2018, pp. 717–724. [Google Scholar]
- [123].Senecal S, Cuel L, Aristidou A, and Magnenat-Thalmann N, “Continuous body emotion recognition system during theater performances,” Comput. Animation Virtual Worlds, vol. 27, nos. 3–4, pp. 311–320, May 2016. [Google Scholar]
- [124].Wang S, Li J, Cao T, Wang H, Tu P, and Li Y, “Dance emotion recognition based on Laban motion analysis using convolutional neural network and long short-term memory,” IEEE Access, vol. 8, pp. 124928–124938, 2020. [Google Scholar]
- [125].Cui H, Maguire C, and LaViers A, “Laban-inspired task-constrained variable motion generation on expressive aerial robots,” Robotics, vol. 8, no. 2, p. 24, Mar. 2019. [Google Scholar]
- [126].Inthiam J, Hayashi E, Jitviriya W, and Mowshowitz A, “Development of an emotional expression platform based on LMA-shape and interactive evolution computation,” in Proc. 4th Int. Conf. Control, Autom. Robot. (ICCAR; ), Apr. 2018, pp. 11–16. [Google Scholar]
- [127].Gross MM, Crane EA, and Fredrickson BL, “Effort-shape and kinematic assessment of bodily expression of emotion during gait,” Human Movement Sci, vol. 31, no. 1, pp. 202–221, Feb. 2012. [DOI] [PubMed] [Google Scholar]
- [128].Adrian B, Actor Training the Laban Way: An Integrated Approach to Voice, Speech, and Movement. New York, NY, USA: Simon & Schuster, 2010. [Google Scholar]
- [129].Tsachor RP and Shafir T, “A somatic movement approach to fostering emotional resiliency through Laban movement analysis,” Frontiers Human Neurosci, vol. 11, p. 410, Sep. 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [130].Fernandes C, The Moving Researcher: Laban/Bartenieff Movement Analysis in Performing Arts Education and Creative Arts Therapies. London, U.K.: Jessica Kingsley, 2014. [Google Scholar]
- [131].Bishko L, “Animation principles and Laban Movement Analysis: Movement frameworks for creating empathic character performances,” in Nonverbal Communication in Virtual Worlds: Understanding and Designing Expressive Characters, Tanenbaum TJ, El-Nasr MS, and Nixon M, Eds. Pittsburgh, PA, USA: ETC Press, 2014, ch. 11, pp. 177–203. [Google Scholar]
- [132].van Geest J, Samaritter R, and van Hooren S, “Move and be moved: The effect of moving specific movement elements on the experience of happiness,” Frontiers Psychol, vol. 11, p. 3974, Jan. 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [133].Wang JZ and Li J, “Learning-based linguistic indexing of pictures with 2-D MHMMs,” in Proc. 10th ACM Int. Conf. Multimedia, Dec. 2002, pp. 436–445. [Google Scholar]
- [134].Chen Y, Li J, and Wang JZ, Machine Learning and Statistical Modeling Approaches to Image Retrieval, vol. 14. Berlin, Germany: Springer, 2006. [Google Scholar]
- [135].Datta R, Joshi D, Li J, and Wang JZ, “Image retrieval: Ideas, influences, and trends of the new age,” ACM Comput. Surveys, vol. 40, no. 2, pp. 1–60, Apr. 2008. [Google Scholar]
- [136].Kosti R, Alvarez JM, Recasens A, and Lapedriza A, “Context based emotion recognition using EMOTIC dataset,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 42, no. 11, pp. 2755–2766, Nov. 2020. [DOI] [PubMed] [Google Scholar]
- [137].Mittal T, Guhan P, Bhattacharya U, Chandra R, Bera A, and Manocha D, “EmotiCon: Context-aware multimodal emotion recognition using Frege’s principle,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR; ), Jun. 2020, pp. 14234–14243. [Google Scholar]
- [138].Carreira J. and Zisserman A, “Quo vadis, action recognition? A new model and the kinetics dataset,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR; ), Jul. 2017, pp. 6299–6308. [Google Scholar]
- [139].Bänziger T. and Scherer KR, “Using actor portrayals to systematically study multimodal emotion expression: The gemep corpus,” in Proc. Int. Conf. Affect. Comput. Intell. Interact. Cham, Switzerland: Springer, 2007, pp. 476–487. [Google Scholar]
- [140].Mesquita B, Boiger M, and De Leersnyder J, “The cultural construction of emotions,” Current Opinion Psychol, vol. 8, pp. 31–36, Apr. 2016. [DOI] [PubMed] [Google Scholar]
- [141].Atkinson AP, Dittrich WH, Gemmell AJ, and Young AW, “Emotion perception from dynamic and static body expressions in point-light and full-light displays,” Perception, vol. 33, no. 6, pp. 717–746, Jun. 2004. [DOI] [PubMed] [Google Scholar]
- [142].Kelly JR and Hutson-Comeaux SL, “Gender stereotypes of emotional reactions: How we judge an emotion as valid,” Sex Roles, vol. 47, nos. 1–2, pp. 1–10, Jul. 2002. [Google Scholar]
- [143].Chaplin TM, “Gender and emotion expression: A developmental contextual perspective,” Emotion Rev, vol. 7, no. 1, pp. 14–21, Jan. 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [144].Cordaro DT, Sun R, Keltner D, Kamble S, Huddar N, and McNeil G, “Universals and cultural variations in 22 emotional expressions across five cultures,” Emotion, vol. 18, no. 1, pp. 75–93, Feb. 2018. [DOI] [PubMed] [Google Scholar]
- [145].Jürgens R, Grass A, Drolet M, and Fischer J, “Effect of acting experience on emotion expression and recognition in voice: Non-actors provide better stimuli than expected,” J. Nonverbal Behav, vol. 39, no. 3, pp. 195–214, Sep. 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [146].Keltner D, Sauter D, Tracy J, and Cowen A, “Emotional expression: Advances in basic emotion theory,” J. Nonverbal Behav, vol. 43, no. 2, pp. 133–160, Jun. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [147].Hetzler ET, “Actors and emotion in performance,” Stud. Theatre Perform, vol. 28, no. 1, pp. 59–78, Dec. 2007. [Google Scholar]
- [148].Semmer NK, Messerli L, and Tschan F, “Disentangling the components of surface acting in emotion work: Experiencing emotions may be as important as regulating them,” J. Appl. Social Psychol, vol. 46, no. 1, pp. 46–64, Jan. 2016. [Google Scholar]
- [149].Elfenbein HA and Ambady N, “On the universality and cultural specificity of emotion recognition: A meta-analysis,” Psychol. Bull, vol. 128, no. 2, pp. 203–235, 2002. [DOI] [PubMed] [Google Scholar]
- [150].Ekman R, What the Face Reveals: Basi and Applied Studies of Spontaneous Expression Using the Facial Action Coding System (FACS). Oxford, U.K.: Oxford Univ. Press, 1997. [Google Scholar]
- [151].Matsumoto D, “Cultural similarities and differences in display rules,” Motivat. Emotion, vol. 14, no. 3, pp. 195–214, Sep. 1990. [Google Scholar]
- [152].Adams RB, Hess U, and Kleck RE, “The intersection of gender-related facial appearance and facial displays of emotion,” Emotion Rev, vol. 7, no. 1, pp. 5–13, Jan. 2015. [Google Scholar]
- [153].Fischer AH, Rodriguez Mosquera PM, van Vianen AEM, and Manstead ASR, “Gender and culture differences in emotion,” Emotion, vol. 4, no. 1, pp. 87–94, Mar. 2004. [DOI] [PubMed] [Google Scholar]
- [154].Adams RB, Albohn DN, Hedgecoth N, Garrido CO, and Adams KD, “Angry white faces: A contradiction of racial stereotypes and emotion-resembling appearance,” Affect. Sci, vol. 3, no. 1, pp. 46–61, Mar. 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [155].Matsumoto D, “American-Japanese cultural differences in judgements of expression intensity and subjective experience,” Cognition Emotion, vol. 13, no. 2, pp. 201–218, Mar. 1999. [Google Scholar]
- [156].Park H. and Kitayama S, “Perceiving through culture: The socialized attention hypothesis,” Sci. Social Vis, vol. 7, pp. 75–89, Nov. 2010. [Google Scholar]
- [157].Aronoff J, Woike BA, and Hyman LM, “Which are the stimuli in facial displays of anger and happiness? Configurational bases of emotion recognition,” J. Personality Social Psychol, vol. 62, no. 6, pp. 1050–1066, Jun. 1992. [Google Scholar]
- [158].Bar M, Neta M, and Linz H, “Very first impressions,” Emotion, vol. 6, no. 2, pp. 269–278, 2006. [DOI] [PubMed] [Google Scholar]
- [159].Arya A, DiPaola S, and Parush A, “Perceptually valid facial expressions for character-based applications,” Int. J. Comput. Games Technol, vol. 2009, pp. 1–13, Mar. 2009. [Google Scholar]
- [160].Van Overwalle F, Drenth T, and Marsman G, “Spontaneous trait inferences: Are they linked to the actor or to the action?” Personality Social Psychol. Bull, vol. 25, no. 4, pp. 450–462, Apr. 1999. [Google Scholar]
- [161].Said CP, Sebe N, and Todorov A, “Structural resemblance to emotional expressions predicts evaluation of emotionally neutral faces,” Emotion, vol. 9, no. 2, pp. 260–264, 2009. [DOI] [PubMed] [Google Scholar]
- [162].Hareli S. and Hess U, “What emotional reactions can tell us about the nature of others: An appraisal perspective on person perception,” Cognition Emotion, vol. 24, no. 1, pp. 128–140, Jan. 2010. [Google Scholar]
- [163].Marsh AA, Adams RB, and Kleck RE, “Why do fear and anger look the way they do? Form and social function in facial expressions,” Personality Social Psychol. Bull, vol. 31, no. 1, pp. 73–86, Jan. 2005. [DOI] [PubMed] [Google Scholar]
- [164].Donovan R, Johnson A, deRoiste A, and O’Reilly R, “Quantifying the links between personality sub-traits and the basic emotions,” in Computational Science and Its Applications—ICCSA. Cham, Switzerland: Springer, 2020, pp. 521–537. [Google Scholar]
- [165].Hughes DJ, Kratsiotis IK, Niven K, and Holman D, “Personality traits and emotion regulation: A targeted review and recommendations,” Emotion, vol. 20, no. 1, p. 63, 2020. [DOI] [PubMed] [Google Scholar]
- [166].Fossum TA and Barrett LF, “Distinguishing evaluation from description in the personality-emotion relationship,” Personality Social Psychol. Bull, vol. 26, no. 6, pp. 669–678, Aug. 2000. [Google Scholar]
- [167].Digman JM, “Personality structure: Emergence of the five-factor model,” Annu. Rev. Psychol, vol. 41, no. 1, pp. 417–440, Jan. 1990. [Google Scholar]
- [168].Goldberg LR, “The structure of phenotypic personality traits,” Amer. Psychologist, vol. 48, no. 1, pp. 26–34, 1993. [DOI] [PubMed] [Google Scholar]
- [169].Adams RB, Nelson AJ, Soto JA, Hess U, and Kleck RE, “Emotion in the neutral face: A mechanism for impression formation?” Cognition Emotion, vol. 26, no. 3, pp. 431–441, Apr. 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [170].Albohn DN and Adams RB, “The expressive triad: Structure, color, and texture similarity of emotion expressions predict impressions of neutral faces,” Frontiers Psychol., vol. 12, Feb. 2021, Art. no. 612923. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [171].Friedman HS, Riggio RE, and Segall DO, “Personality and the enactment of emotion,” J. Nonverbal Behav, vol. 5, no. 1, pp. 35–48, 1980. [Google Scholar]
- [172].Reisenzein R, Hildebrandt A, and Weber H, “Personality and emotion,” in The Cambridge Handbook of Personality Psychology, Corr PJ and Matthews G, Eds. Cambridge, U.K.: Cambridge Univ. Press, 2020, pp. 81–99. [Google Scholar]
- [173].Azucar D, Marengo D, and Settanni M, “Predicting the big 5 personality traits from digital footprints on social media: A meta-analysis,” Personality Individual Differences, vol. 124, pp. 150–159, Apr. 2018. [Google Scholar]
- [174].Lewin K, The Conceptual Representation and the Measurement of Psychological Forces. Durham, NC, USA: Duke Univ. Press, 1938. [Google Scholar]
- [175].Carver CS and Scheier MF, “Control theory: A useful conceptual framework for personality–social, clinical, and health psychology,” Psychol. Bull, vol. 92, no. 1, pp. 111–135, 1982. [PubMed] [Google Scholar]
- [176].Higgins ET, “Self-discrepancy theory: What patterns of self-beliefs cause people to suffer?” in Advances in Experimental Social Psychology, vol. 22. Amsterdam, The Netherlands: Elsevier, 1989, pp. 93–136. [Google Scholar]
- [177].Davidson RJ, “Affective style and affective disorders: Perspectives from affective neuroscience,” Cognition Emotion, vol. 12, no. 3, pp. 307–330, May 1998. [Google Scholar]
- [178].Gray JA, The Neuropsychology of Emotion and Personality. Oxford, U.K.: Oxford Univ. Press, 1987. [Google Scholar]
- [179].Carver CS and White TL, “Behavioral inhibition, behavioral activation, and affective responses to impending reward and punishment: The BIS/BAS scales,” J. Personality Social Psychol, vol. 67, no. 2, pp. 319–333, Aug. 1994. [Google Scholar]
- [180].Davidson RJ and Hugdahl K, The Asymmetrical Brain. Cambridge, MA, USA: MIT Press, 2003. [Google Scholar]
- [181].Harmon-Jones E. and Allen JJ, “Anger and frontal brain activity: EEG asymmetry consistent with approach motivation despite negative affective valence,” J. Personality Social Psychol, vol. 74, no. 5, p. 1310, 1998. [DOI] [PubMed] [Google Scholar]
- [182].Joshi D. et al. , “Aesthetics and emotions in images,” IEEE Signal Process. Mag, vol. 28, no. 5, pp. 94–115, Sep. 2011. [Google Scholar]
- [183].Yanulevskaya V, van Gemert JC, Roth K, Herbold AK, Sebe N, and Geusebroek JM, “Emotional valence categorization using holistic image features,” in Proc. 15th IEEE Int. Conf. Image Process., Dec. 2008, pp. 101–104. [Google Scholar]
- [184].Pang L, Zhu S, and Ngo C-W, “Deep multimodal learning for affective analysis and retrieval,” IEEE Trans. Multimedia, vol. 17, no. 11, pp. 2008–2020, Nov. 2015. [Google Scholar]
- [185].Yang J, She D, Lai Y-K, and Yang M-H, “Retrieving and classifying affective images via deep metric learning,” in Proc. AAAI Conf. Artif. Intell., 2018, pp. 491–498. [Google Scholar]
- [186].Rao T, Xu M, Liu H, Wang J, and Burnett I, “Multi-scale blocks based image emotion classification using multiple instance learning,” in Proc. IEEE Int. Conf. Image Process., Aug. 2016, pp. 634–638. [Google Scholar]
- [187].Muszynski M. et al. , “Recognizing induced emotions of movie audiences from multimodal information,” IEEE Trans. Affect. Comput, vol. 12, no. 1, pp. 36–52, Jan. 2021. [Google Scholar]
- [188].Zhao S, Gao Y, Jiang X, Yao H, Chua T-S, and Sun X, “Exploring principles-of-art features for image emotion recognition,” in Proc. 22nd ACM Int. Conf. Multimedia, Nov. 2014, pp. 47–56. [Google Scholar]
- [189].Achlioptas P, Ovsjanikov M, Haydarov K, Elhoseiny M, and Guibas LJ, “ArtEmis: Affective language for visual art,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jan. 2021, pp. 11569–11579. [Google Scholar]
- [190].Peng K-C, Chen T, Sadovnik A, and Gallagher A, “A mixed bag of emotions: Model, predict, and transfer emotion distributions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR; ), Jun. 2015, pp. 860–868. [Google Scholar]
- [191].Baveye Y, Dellandréa E, Chamaret C, and Chen L, “LIRIS-ACCEDE: A video database for affective content analysis,” IEEE Trans. Affect. Comput, vol. 6, no. 1, pp. 43–55, Jan. 2015. [Google Scholar]
- [192].Zhao S, Yao H, Gao Y, Ji R, and Ding G, “Continuous probability distribution prediction of image emotions via multitask shared sparse regression,” IEEE Trans. Multimedia, vol. 19, no. 3, pp. 632–645, Mar. 2017. [Google Scholar]
- [193].Zhao S, Yao H, Gao Y, Ding G, and Chua T-S, “Predicting personalized image emotion perceptions in social networks,” IEEE Trans. Affect. Comput, vol. 9, no. 4, pp. 526–540, Oct. 2018. [Google Scholar]
- [194].Zhao S. et al. , “Discrete probability distribution prediction of image emotions with shared sparse learning,” IEEE Trans. Affect. Comput, vol. 11, no. 4, pp. 574–587, Oct. 2020. [Google Scholar]
- [195].Song J, Han K, and Kim S-W, “I have no text in my post’: Using visual hints to model user emotions in social media,” in Proc. ACM Web Conf, Apr. 2022, pp. 2888–2896. [Google Scholar]
- [196].Wang X, Jia J, Yin J, and Cai L, “Interpretable aesthetic features for affective image classification,” in Proc. IEEE Int. Conf. Image Process., Feb. 2013, pp. 3230–3234. [Google Scholar]
- [197].Liu X, Li N, and Xia Y, “Affective image classification by jointly using interpretable art features and semantic annotations,” J. Vis. Commun. Image Represent, vol. 58, pp. 576–588, Jan. 2019. [Google Scholar]
- [198].Sartori A, Culibrk D, Yan Y, and Sebe N, “Who’s afraid of itten: Using the art theory of color combination to analyze emotions in abstract paintings,” in Proc. 23rd ACM Int. Conf. Multimedia, Oct. 2015, pp. 311–320. [Google Scholar]
- [199].Yuan J, Mcdonough S, You Q, and Luo J, “Sentribute: Image sentiment analysis from a mid-level perspective,” in Proc. 2nd Int. Workshop Issues Sentiment Discovery Opinion Mining, Aug. 2013, pp. 1–8. [Google Scholar]
- [200].Borth D, Ji R, Chen T, Breuel T, and Chang S-F, “Large-scale visual sentiment ontology and detectors using adjective noun pairs,” in Proc. 21st ACM Int. Conf. Multimedia, Oct. 2013, pp. 223–232. [Google Scholar]
- [201].Jou B, Chen T, Pappas N, Redi M, Topkara M, and Chang S-F, “Visual affect around the world: A large-scale multilingual visual sentiment ontology,” in Proc. 23rd ACM Int. Conf. Multimedia, Oct. 2015, pp. 159–168. [Google Scholar]
- [202].Acar E, Hopfgartner F, and Albayrak S, “A comprehensive study on mid-level representation and ensemble learning for emotional analysis of video material,” Multimedia Tools Appl, vol. 76, no. 9, pp. 11809–11837, May 2017. [Google Scholar]
- [203].Krizhevsky A, Sutskever I, and Hinton GE, “ImageNet classification with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 1106–1114. [Google Scholar]
- [204].Simonyan K. and Zisserman A, “Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learn. Represent., 2015, pp. 1–9. [Google Scholar]
- [205].He K, Zhang X, Ren S, and Sun J, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR; ), Jun. 2016, pp. 770–778. [Google Scholar]
- [206].Xu C, Cetintas S, Lee K-C, and Li L-J, “Visual sentiment prediction with deep convolutional neural networks,” 2014, arXiv:1411.5731. [Google Scholar]
- [207].Zhang H. and Xu M, “Recognition of emotions in user-generated videos with kernelized features,” IEEE Trans. Multimedia, vol. 20, no. 10, pp. 2824–2835, Oct. 2018. [Google Scholar]
- [208].Zhao S. et al. , “An end-to-end visual-audio attention network for emotion recognition in user-generated videos,” in Proc. AAAI Conf. Artif. Intell, 2020, pp. 303–311. [Google Scholar]
- [209].Jin X, Jing P, Wu J, Xu J, and Su Y, “Visual sentiment classification via low-rank regularization and label relaxation,” IEEE Trans. Cognit. Develop. Syst, vol. 14, no. 4, pp. 1678–1690, Dec. 2022. [Google Scholar]
- [210].Wei J, Yang X, and Dong Y, “User-generated video emotion recognition based on key frames,” Multimedia Tools Appl, vol. 80, no. 9, pp. 14343–14361, Apr. 2021. [Google Scholar]
- [211].Shukla A, Gullapuram SS, Katti H, Kankanhalli M, Winkler S, and Subramanian R, “Recognition of advertisement emotions with application to computational advertising,” IEEE Trans. Affect. Comput, vol. 13, no. 2, pp. 781–792, Apr. 2022. [Google Scholar]
- [212].Deng S. et al. , “Simple but powerful, a language-supervised method for image emotion classification,” IEEE Trans. Affect. Comput, early access, Nov. 2022, doi: 10.1109/TAFFC.2022.3225049. [DOI] [Google Scholar]
- [213].Pan J, Wang S, and Fang L, “Representation learning through multimodal attention and time-sync comments for affective video content analysis,” in Proc. 30th ACM Int. Conf. Multimedia, Oct. 2022, pp. 42–50. [Google Scholar]
- [214].Chen C, Wu Z, and Jiang Y-G, “Emotion in context: Deep semantic feature fusion for video emotion recognition,” in Proc. 24th ACM Int. Conf. Multimedia, Oct. 2016, pp. 127–131. [Google Scholar]
- [215].You Q, Luo J, Jin H, and Yang J, “Robust image sentiment analysis using progressively trained and domain transferred deep networks,” in Proc. AAAI Conf. Artif. Intell., 2015, pp. 381–388. [Google Scholar]
- [216].Campos V, Jou B, and Giró-i-Nieto X, “From pixels to sentiment: Fine-tuning CNNs for visual sentiment prediction,” Image Vis. Comput, vol. 65, pp. 15–22, Sep. 2017. [Google Scholar]
- [217].Yang J, She D, and Sun M, “Joint image emotion classification and distribution learning via deep convolutional neural network,” in Proc. 26th Int. Joint Conf. Artif. Intell., Aug. 2017, pp. 3266–3272. [Google Scholar]
- [218].Yang J, Li J, Li L, Wang X, and Gao X, “A circular-structured representation for visual emotion distribution learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR; ), Jun. 2021, pp. 4237–4246. [Google Scholar]
- [219].Yang J, Li J, Wang X, Ding Y, and Gao X, “Stimuli-aware visual emotion analysis,” IEEE Trans. Image Process, vol. 30, pp. 7432–7445, 2021. [DOI] [PubMed] [Google Scholar]
- [220].Liang Y, Maeda K, Ogawa T, and Haseyama M, “Chain centre loss: A psychology inspired loss function for image sentiment analysis,” Neurocomputing, vol. 495, pp. 118–128, Jul. 2022. [Google Scholar]
- [221].Tran D, Bourdev L, Fergus R, Torresani L, and Paluri M, “Learning spatiotemporal features with3D convolutional networks,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV; ), Dec. 2015, pp. 4489–4497. [Google Scholar]
- [222].Ou Y, Chen Z, and Wu F, “Multimodal local-global attention network for affective video content analysis,” IEEE Trans. Circuits Syst. Video Technol, vol. 31, no. 5, pp. 1901–1914, May 2021. [Google Scholar]
- [223].You Q, Jin H, and Luo J, “Visual sentiment analysis by attending on local image regions,” in Proc. AAAI Conf. Artif. Intell., 2017, pp. 231–237. [Google Scholar]
- [224].Song K, Yao T, Ling Q, and Mei T, “Boosting image sentiment analysis with visual attention,” Neurocomputing, vol. 312, pp. 218–228, Oct. 2018. [Google Scholar]
- [225].Fan S. et al. , “Emotional attention: A study of image sentiment and visual attention,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 7521–7531. [Google Scholar]
- [226].Zhao S, Jia Z, Chen H, Li L, Ding G, and Keutzer K, “PDANet: Polarity-consistent deep attention network for fine-grained visual emotion regression,” in Proc. 27th ACM Int. Conf. Multimedia, Oct. 2019, pp. 192–201. [Google Scholar]
- [227].Yang J, She D, Lai Y, Rosin PL, and Yang M, “Weakly supervised coupled networks for visual sentiment analysis,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 7584–7592. [Google Scholar]
- [228].Li Z, Lu H, Zhao C, Feng L, Gu G, and Chen W, “Weakly supervised discriminate enhancement network for visual sentiment analysis,” Artif. Intell. Rev, vol. 56, no. 2, pp. 1763–1785, Feb. 2023. [Google Scholar]
- [229].Zhu X. et al. , “Dependency exploitation: A unified CNN-RNN approach for visual emotion recognition,” in Proc. 26th Int. Joint Conf. Artif. Intell., Aug. 2017, pp. 3595–3601. [Google Scholar]
- [230].Rao T, Li X, Zhang H, and Xu M, “Multi-level region-based convolutional neural network for image emotion classification,” Neurocomputing, vol. 333, pp. 429–439, Mar. 2019. [Google Scholar]
- [231].Rao T, Li X, and Xu M, “Learning multi-level deep representations for image emotion classification,” Neural Process. Lett, vol. 51, no. 3, pp. 2043–2061, Jun. 2020. [Google Scholar]
- [232].Yang J, Gao X, Li L, Wang X, and Ding J, “SOLVER: Scene-object interrelated visual emotion reasoning network,” IEEE Trans. Image Process, vol. 30, pp. 8686–8701, 2021. [DOI] [PubMed] [Google Scholar]
- [233].Zhang H. and Xu M, “Multiscale emotion representation learning for affective image recognition,” IEEE Trans. Multimedia, early access, Jan. 25, 2022, doi: 10.1109/TMM.2022.3144804. [DOI] [Google Scholar]
- [234].Xu L, Wang Z, Wu B, and Lui S, “MDAN: Multi-level dependent attention network for visual emotion analysis,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR; ), Jun. 2022, pp. 9479–9488. [Google Scholar]
- [235].Zhang J, Liu X, Chen M, Ye Q, and Wang Z, “Image sentiment classification via multi-level sentiment region correlation analysis,” Neurocomputing, vol. 469, pp. 221–233, Jan. 2022. [Google Scholar]
- [236].Xu B, Zheng Y, Ye H, Wu C, Wang H, and Sun G, “Video emotion recognition with concept selection,” in Proc. IEEE Int. Conf. Multimedia Expo. (ICME; ), Jul. 2019, pp. 406–411. [Google Scholar]
- [237].Sun JJ, Liu T, and Prasad G, “GLA in MediaEval 2018 emotional impact of movies task,” 2019, arXiv:1911.12361. [Google Scholar]
- [238].Cheng H, Tie Y, Qi L, and Jin C, “Context-aware based visual-audio feature fusion for emotion recognition,” in Proc. Int. Joint Conf. Neural Netw. (IJCNN; ), Jul. 2021, pp. 1–8. [Google Scholar]
- [239].Chen T, Borth D, Darrell T, and Chang S-F, “DeepSentiBank: Visual sentiment concept classification with deep convolutional neural networks,” 2014, arXiv:1410.8586. [Google Scholar]
- [240].Gu X, Lu L, Qiu S, Zou Q, and Yang Z, “Sentiment key frame extraction in user-generated micro-videos via low-rank and sparse representation,” Neurocomputing, vol. 410, pp. 441–453, Oct. 2020. [Google Scholar]
- [241].Ruan S, Zhang K, Wu L, Xu T, Liu Q, and Chen E, “Color enhanced cross correlation net for image sentiment analysis,” IEEE Trans. Multimedia, early access, Oct. 11, 2021, doi: 10.1109/TMM.2021.3118208. [DOI] [Google Scholar]
- [242].Eyben F, Wöllmer M, and Schuller B, “Opensmile: The Munich versatile and fast open-source audio feature extractor,” in Proc. 18th ACM Int. Conf. Multimedia, Oct. 2010, pp. 1459–1462. [Google Scholar]
- [243].El Ayadi M, Kamel MS, and Karray F, “Survey on speech emotion recognition: Features, classification schemes, and databases,” Pattern Recognit, vol. 44, no. 3, pp. 572–587, Mar. 2011. [Google Scholar]
- [244].Zhang Y. and Yang Q, “A survey on multi-task learning,” IEEE Trans. Knowl. Data Eng, vol. 34, no. 12, pp. 5586–5609, Dec. 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [245].Shan C, Gong S, and McOwan PW, “Facial expression recognition based on local binary patterns: A comprehensive study,” Image Vis. Comput, vol. 27, no. 6, pp. 803–816, May 2009. [Google Scholar]
- [246].Zhi R, Flierl M, Ruan Q, and Kleijn WB, “Graph-preserving sparse nonnegative matrix factorization with application to facial expression recognition,” IEEE Trans. Syst., Man, Cybern., B, Cybernetics, vol. 41, no. 1, pp. 38–52, Feb. 2011. [DOI] [PubMed] [Google Scholar]
- [247].Dalal N. and Triggs B, “Histograms of oriented gradients for human detection,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR; ), Jul. 2005, pp. 886–893. [Google Scholar]
- [248].Michel P. and El Kaliouby R, “Real time facial expression recognition in video using support vector machines,” in Proc. 5th Int. Conf. Multimodal Interfaces, Nov. 2003, pp. 258–264. [Google Scholar]
- [249].Zeng Z, Pantic M, Roisman GI, and Huang TS, “A survey of affect recognition methods: Audio, visual, and spontaneous expressions,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 31, no. 1, pp. 39–58, Jan. 2009. [DOI] [PubMed] [Google Scholar]
- [250].Sariyanidi E, Gunes H, and Cavallaro A, “Automatic analysis of facial affect: A survey of registration, representation, and recognition,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 37, no. 6, pp. 1113–1133, Jun. 2015. [DOI] [PubMed] [Google Scholar]
- [251].Pantie M. and Rothkrantz LJM, “Automatic analysis of facial expressions: The state of the art,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 22, no. 12, pp. 1424–1445, 2000. [Google Scholar]
- [252].Fasel B. and Luettin J, “Automatic facial expression analysis: A survey,” Pattern Recognit, vol. 36, no. 1, pp. 259–275, Jan. 2003. [Google Scholar]
- [253].Benitez-Quiroz CF, Srinivasan R, and Martinez AM, “EmotioNet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR; ), Jan. 2016, pp. 5562–5570. [Google Scholar]
- [254].Hu P, Cai D, Wang S, Yao A, and Chen Y, “Learning supervised scoring ensemble for emotion recognition in the wild,” in Proc. 19th ACM Int. Conf. Multimodal Interact., Nov. 2017, pp. 553–560. [Google Scholar]
- [255].Bargal SA, Barsoum E, Ferrer CC, and Zhang C, “Emotion recognition in the wild from videos using images,” in Proc. 18th ACM Int. Conf. Multimodal Interact., Oct. 2016, pp. 433–436. [Google Scholar]
- [256].Wen Y, Zhang K, Li Z, and Qiao Y, “A discriminative feature learning approach for deep face recognition,” in Proc. Eur. Conf. Comput. Vis, 2016, pp. 499–515. [Google Scholar]
- [257].Cai J, Meng Z, Khan AS, Li Z, O’Reilly J, and Tong Y, “Island loss for learning discriminative features in facial expression recognition,” in Proc. IEEE Int. Conf. Autom. Face Gesture Recognit. (FG), May 2018, pp. 302–309. [Google Scholar]
- [258].Liu X, Kumar BVKV, You J, and Jia P, “Adaptive deep metric learning for identity-aware facial expression recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW; ), Jul. 2017, pp. 20–29. [Google Scholar]
- [259].Zhang K, Huang Y, Du Y, and Wang L, “Facial expression recognition based on deep evolutional spatial–temporal networks,” IEEE Trans. Image Process, vol. 26, no. 9, pp. 4193–4203, Sep. 2017. [DOI] [PubMed] [Google Scholar]
- [260].Yan J, Zheng W, Cui Z, Tang C, Zhang T, and Zong Y, “Multi-cue fusion for emotion recognition in the wild,” Neurocomputing, vol. 309, pp. 27–35, Oct. 2018. [Google Scholar]
- [261].Ouyang X. et al. , “Audio-visual emotion recognition using deep transfer learning and multiple temporal models,” in Proc. 19th ACM Int. Conf. Multimodal Interact., Nov. 2017, pp. 577–582. [Google Scholar]
- [262].Abbasnejad I, Sridharan S, Nguyen D, Denman S, Fookes C, and Lucey S, “Using synthetic data to improve facial expression analysis with 3D convolutional networks,” in Proc. IEEE Int. Conf. Comput. Vis. Workshops (ICCVW; ), Oct. 2017, pp. 1609–1618. [Google Scholar]
- [263].Zeng J, Shan S, and Chen X, “Facial expression recognition with inconsistently annotated datasets,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 222–237. [Google Scholar]
- [264].Wang K, Peng X, Yang J, Lu S, and Qiao Y, “Suppressing uncertainties for large-scale facial expression recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR; ), Jun. 2020, pp. 6897–6906. [Google Scholar]
- [265].Chen S, Wang J, Chen Y, Shi Z, Geng X, and Rui Y, “Label distribution learning on auxiliary label space graphs for facial expression recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR; ), Jun. 2020, pp. 13984–13993. [Google Scholar]
- [266].She J, Hu Y, Shi H, Wang J, Shen Q, and Mei T, “Dive into ambiguity: Latent distribution mining and pairwise uncertainty estimation for facial expression recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR; ), Jun. 2021, pp. 6248–6257. [Google Scholar]
- [267].Chen Y. and Joo J, “Understanding and mitigating annotation bias in facial expression recognition,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV; ), Oct. 2021, pp. 14980–14991. [Google Scholar]
- [268].Zeng D, Lin Z, Yan X, Liu Y, Wang F, and Tang B, “Face2Exp: Combating data biases for facial expression recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR; ), Jun. 2022, pp. 20291–20300. [Google Scholar]
- [269].Li H, Wang N, Ding X, Yang X, and Gao X, “Adaptively learning facial expression representation via C-F labels and distillation,” IEEE Trans. Image Process, vol. 30, pp. 2016–2028, 2021. [DOI] [PubMed] [Google Scholar]
- [270].Wang K, Peng X, Yang J, Meng D, and Qiao Y, “Region attention networks for pose and occlusion robust facial expression recognition,” IEEE Trans. Image Process, vol. 29, pp. 4057–4069, 2020. [DOI] [PubMed] [Google Scholar]
- [271].Zhang W, Ji X, Chen K, Ding Y, and Fan C, “Learning a facial expression embedding disentangled from identity,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR; ), Jun. 2021, pp. 6759–6768. [Google Scholar]
- [272].Wang C, Wang S, and Liang G, “Identity- and pose-robust facial expression recognition through adversarial feature learning,” in Proc. 27th ACM Int. Conf. Multimedia, Oct. 2019, pp. 238–246. [Google Scholar]
- [273].Ruan D, Yan Y, Lai S, Chai Z, Shen C, and Wang H, “Feature decomposition and reconstruction learning for effective facial expression recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR; ), Jun. 2021, pp. 7660–7669. [Google Scholar]
- [274].Farzaneh AH and Qi X, “Facial expression recognition in the wild via deep attentive center loss,” in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV; ), Jan. 2021, pp. 2402–2411. [Google Scholar]
- [275].Xue F, Wang Q, and Guo G, “TransFER: Learning relation-aware facial expression representations with transformers,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 3601–3610. [Google Scholar]
- [276].Dosovitskiy A. et al. , “An image is worth 16 × 16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Represent., 2021, pp. 1–21. [Google Scholar]
- [277].Savchenko AV, Savchenko LV, and Makarov I, “Classifying emotions and engagement in online learning based on a single facial expression recognition neural network,” IEEE Trans. Affect. Comput, vol. 13, no. 4, pp. 2132–2143, Oct. 2022. [Google Scholar]
- [278].Tan M. and Le Q, “EfficientNet: Rethinking model scaling for convolutional neural networks,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 6105–6114. [Google Scholar]
- [279].Ruan D, Mo R, Yan Y, Chen S, Xue J-H, and Wang H, “Adaptive deep disturbance-disentangled learning for facial expression recognition,” Int. J. Comput. Vis, vol. 130, no. 2, pp. 455–477, Feb. 2022. [Google Scholar]
- [280].Wallbott HG, “Bodily expression of emotion,” Eur. J. Social Psychol, vol. 28, no. 6, pp. 879–896, Nov. 1998. [Google Scholar]
- [281].Meeren HKM, van Heijnsbergen CCRJ, and de Gelder B, “Rapid perceptual integration of facial expression and emotional body language,” Proc. Nat. Acad. Sci. USA, vol. 102, no. 45, pp. 16518–16523, Nov. 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [282].de Gelder B, “Towards the neurobiology of emotional body language,” Nature Rev. Neurosci, vol. 7, no. 3, pp. 242–249, Mar. 2006. [DOI] [PubMed] [Google Scholar]
- [283].Aviezer H, Trope Y, and Todorov A, “Body cues, not facial expressions, discriminate between intense positive and negative emotions,” Science, vol. 338, no. 6111, pp. 1225–1229, Nov. 2012. [DOI] [PubMed] [Google Scholar]
- [284].Nelson NL and Mondloch CJ, “Adults’ and children’s perception of facial expressions is influenced by body postures even for dynamic stimuli,” Vis. Cognition, vol. 25, nos. 4–6, pp. 563–574, Jul. 2017. [Google Scholar]
- [285].Karaaslan A, Durmus B, and Amado S, “Does body context affect facial emotion perception and eliminate emotional ambiguity without visual awareness?” Vis. Cognition, vol. 28, no. 10, pp. 605–620, Nov. 2020. [Google Scholar]
- [286].Gunes H. and Piccardi M, “Bi-modal emotion recognition from expressive face and body gestures,” J. Netw. Comput. Appl, vol. 30, no. 4, pp. 1334–1345, Nov. 2007. [Google Scholar]
- [287].Kleinsmith A, De Silva PR, and Bianchi-Berthouze N, “Cross-cultural differences in recognizing affect from body posture,” Interacting Comput, vol. 18, no. 6, pp. 1371–1389, Dec. 2006. [Google Scholar]
- [288].Schindler K, Van Gool L, and de Gelder B, “Recognizing emotions expressed by body pose: A biologically inspired neural model,” Neural Netw, vol. 21, no. 9, pp. 1238–1246, Nov. 2008. [DOI] [PubMed] [Google Scholar]
- [289].Dael N, Mortillaro M, and Scherer KR, “Emotion expression in body action and posture,” Emotion, vol. 12, no. 5, p. 1085, 2012. [DOI] [PubMed] [Google Scholar]
- [290].Li B, Zhu C, Li S, and Zhu T, “Identifying emotions from non-contact gaits information based on Microsoft kinects,” IEEE Trans. Affect. Comput, vol. 9, no. 4, pp. 585–591, Oct. 2018. [Google Scholar]
- [291].Crenn A, Khan RA, Meyer A, and Bouakaz S, “Body expression recognition from animated 3D skeleton,” in Proc. Int. Conf. 3D Imag. (IC3D), Dec. 2016, pp. 1–7. [Google Scholar]
- [292].Bhattacharya U. et al. , “Take an emotion walk: Perceiving emotions from gaits using hierarchical attention pooling and affective mapping,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 145–163. [Google Scholar]
- [293].Bhattacharya U, Mittal T, Chandra R, Randhavane T, Bera A, and Manocha D, “STEP: Spatial temporal graph convolutional networks for emotion perception from gaits,” in Proc. AAAI Conf. Artif. Intell, vol. 34, no. 2, Apr. 2020, pp. 1342–1350. [Google Scholar]
- [294].Yan S, Xiong Y, and Lin D, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in Proc. AAAI Conf. Artif. Intell., 2018, pp. 1–9. [Google Scholar]
- [295].Banerjee A, Bhattacharya U, and Bera A, “Learning unseen emotions from gestures via semantically-conditioned zero-shot perception with adversarial autoencoders,” in Proc. AAAI Conf. Artif. Intell, vol. 36, no. 1, 2022, pp. 3–10. [Google Scholar]
- [296].Narayanan V, Manoghar BM, Sashank Dorbala V, Manocha D, and Bera A, “ProxEmo: Gait-based emotion learning and multi-view proxemic fusion for socially-aware robot navigation,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), Oct. 2020, pp. 8200–8207. [Google Scholar]
- [297].Hu C, Sheng W, Dong B, and Li X, “TNTC: Two-stream network with transformer-based complementarity for gait-based emotion recognition,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP; ), May 2022, pp. 3229–3233. [Google Scholar]
- [298].Simonyan K. and Zisserman A, “Two-stream convolutional networks for action recognition in videos,” in Proc. Adv. Neural Inf. Process. Syst, vol. 27, 2014, pp. 1–9. [Google Scholar]
- [299].Wang L. et al. , “Temporal segment networks: Towards good practices for deep action recognition,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 20–36. [Google Scholar]
- [300].Kay W. et al. , “The kinetics human action video dataset,” 2017, arXiv:1705.06950. [Google Scholar]
- [301].Huang Y, Wen H, Qing L, Jin R, and Xiao L, “Emotion recognition based on body and context fusion in the wild,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshops (ICCVW; ), Oct. 2021, pp. 3609–3617. [Google Scholar]
- [302].Filntisis PP, Efthymiou N, Potamianos G, and Maragos P, “Emotion understanding in videos through body, context, and visual-semantic embedding loss,” in Proc. 1st Int. Workshop Bodily Expressed Emotion Understand, Conjunct Eur. Comput. Vis. Conf., 2020, pp. 747–755. [Google Scholar]
- [303].Pikoulis I, Filntisis PP, and Maragos P, “Leveraging semantic scene characteristics and multi-stream convolutional architectures in a contextual approach for video-based visual emotion recognition in the wild,” in Proc. 16th IEEE Int. Conf. Autom. Face Gesture Recognit. (FG ), Dec. 2021, pp. 01–08. [Google Scholar]
- [304].Radford A. et al. , “Learning transferable visual models from natural language supervision,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 8748–8763. [Google Scholar]
- [305].Zhang S, Pan Y, and Wang JZ, “Learning emotion representations from verbal and nonverbal communication,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2023. [Google Scholar]
- [306].Chen H, Shi H, Liu X, Li X, and Zhao G, “SMG: A micro-gesture dataset towards spontaneous body gestures for emotional stress state analysis,” Int. J. Comput. Vis, vol. 131, pp. 1346–1366, Feb. 2023. [Google Scholar]
- [307].Wu C, Davaasuren D, Shafir T, Tsachor R, and Wang JZ, “Bodily expressed emotion understanding through integrating Laban movement analysis,” 2023, arXiv:2304.02187. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [308].Le N, Nguyen K, Nguyen A, and Le B, “Global-local attention for emotion recognition,” Neural Comput. Appl, vol. 34, pp. 21625–21639, Dec. 2021. [Google Scholar]
- [309].Kim W, Son B, and Kim I, “ViLT: Vision-and-language transformer without convolution or region supervision,” in Proc. Int. Conf. Mach. Learn, 2021, pp. 5583–5594. [Google Scholar]
- [310].Devlin J, Chang M-W, Lee K, and Toutanova K, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. Conf. North Amer. Assoc. Comput. Linguistics, Hum. Lang. Technol. Minneapolis, MN, USA: Association for Computational Linguistics, vol. 1, Jun. 2019, pp. 4171–4186. [Online]. Available: https://aclanthology.org/N19-1423 [Google Scholar]
- [311].Tzirakis P, Trigeorgis G, Nicolaou MA, Schuller BW, and Zafeiriou S, “End-to-end multimodal emotion recognition using deep neural networks,” IEEE J. Sel. Topics Signal Process, vol. 11, no. 8, pp. 1301–1309, Dec. 2017. [Google Scholar]
- [312].Antoniadis P, Pikoulis I, Filntisis PP, and Maragos P, “An audiovisual and contextual approach for categorical and continuous emotion recognition in-the-wild,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshops (ICCVW), Oct. 2021, pp. 3645–3651. [Google Scholar]
- [313].Shirian A, Tripathi S, and Guha T, “Dynamic emotion modeling with learnable graphs and graph inception network,” IEEE Trans. Multimedia, vol. 24, pp. 780–790, 2022. [Google Scholar]
- [314].Shad Akhtar M, Singh Chauhan D, Ghosal D, Poria S, Ekbal A, and Bhattacharyya P, “Multi-task learning for multi-modal emotion recognition and sentiment analysis,” 2019, arXiv:1905.05812. [Google Scholar]
- [315].Yu W, Xu H, Yuan Z, and Wu J, “Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis,” in Proc. AAAI Conf. Artif. Intell, vol. 35, no. 12, 2021, pp. 10790–10797. [Google Scholar]
- [316].Jiang D. et al. , “A multitask learning framework for multimodal sentiment analysis,” in Proc. Int. Conf. Data Mining Workshops (ICDMW), Dec. 2021, pp. 151–157. [Google Scholar]
- [317].Yang D, Huang S, Kuang H, Du Y, and Zhang L, “Disentangled representation learning for multimodal emotion recognition,” in Proc. 30th ACM Int. Conf. Multimedia, Oct. 2022, pp. 1642–1651. [Google Scholar]
- [318].Zhang S, Yin C, and Yin Z, “Multimodal sentiment recognition with multi-task learning,” IEEE Trans. Emerg. Topics Comput. Intell, vol. 7, no. 1, pp. 200–209, Feb. 2023. [Google Scholar]
- [319].Zhang K, Li Y, Wang J, Cambria E, and Li X, “Real-time video emotion recognition based on reinforcement learning and domain knowledge,” IEEE Trans. Circuits Syst. Video Technol, vol. 32, no. 3, pp. 1034–1047, Mar. 2022. [Google Scholar]
- [320].Mittal T, Bera A, and Manocha D, “Multimodal and context-aware emotion perception model with multiplicative fusion,” IEEE Multimedia Mag, vol. 28, no. 2, pp. 67–75, Apr. 2021. [Google Scholar]
- [321].Cambria E, Howard N, Hsu J, and Hussain A, “Sentic blending: Scalable multimodal fusion for the continuous interpretation of semantics and sentics,” in Proc. IEEE Symp. Comput. Intell. Human-like Intell. (CIHLI; ), Apr. 2013, pp. 108–117. [Google Scholar]
- [322].Zadeh A, Chen M, Poria S, Cambria E, and Morency L-P, “Tensor fusion network for multimodal sentiment analysis,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2017, pp. 1103–1114. [Google Scholar]
- [323].Liu Z, Shen Y, Lakshminarasimhan VB, Liang PP, Zadeh AB, and Morency L-P, “Efficient low-rank multimodal fusion with modality-specific factors,” in Proc. 56th Annu. Meeting Assoc. Comput., 2018, pp. 1–10. [Google Scholar]
- [324].Tsai Y-HH, Liang PP, Zadeh A, Morency L-P, and Salakhutdinov R, “Learning factorized multimodal representations,” in Proc. Int. Conf. Learn. Represent., 2019, pp. 1–20. [Online]. Available: https://openreview.net/forum?id=rygqqsA9KX [Google Scholar]
- [325].Sun Z, Sarma P, Sethares W, and Liang Y, “Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis,” in Proc. AAAI Conf. Artif. Intell, vol. 34, no. 5, 2020, pp. 8992–8999. [Google Scholar]
- [326].Hazarika D, Zimmermann R, and Poria S, “MISA: Modality-invariant and-specific representations for multimodal sentiment analysis,” in Proc. 28th ACM Int. Conf. Multimedia, Oct. 2020, pp. 1122–1131. [Google Scholar]
- [327].He K, Fan H, Wu Y, Xie S, and Girshick R, “Momentum contrast for unsupervised visual representation learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 9729–9738. [Google Scholar]
- [328].He K, Chen X, Xie S, Li Y, Dollar P, and Girshick R, “Masked autoencoders are scalable vision learners,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR; ), Jun. 2022, pp. 16000–16009. [Google Scholar]
- [329].Feichtenhofer C, Fan H, Li Y, and He K, “Masked autoencoders as spatiotemporal learners,” 2022, arXiv:2205.09113. [Google Scholar]
- [330].Ronchi MR and Perona P, “Benchmarking and error diagnosis in multi-instance pose estimation,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV; ), Oct. 2017, pp. 369–378. [Google Scholar]
- [331].Lu X, Lin Z, Shen X, Mech R, and Wang JZ, “Deep multi-patch aggregation network for image style, aesthetics, and quality estimation,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV; ), Dec. 2015, pp. 990–998. [Google Scholar]
- [332].Shin D, “The effects of explainability and causability on perception, trust, and acceptance: Implications for explainable AI,” Int. J. Hum.-Comput. Stud, vol. 146, Feb. 2021, Art. no. 102551. [Google Scholar]
- [333].Arya V, Bellamy RKE, Chen PY, and Dhurandhar A, “AI explainability 360: An extensible toolkit for understanding data and machine learning models,” J. Mach. Learn. Res, vol. 21, no. 130, pp. 1–6, 2020.34305477 [Google Scholar]
- [334].Gilpin LH, Bau D, Yuan BZ, Bajwa A, Specter M, and Kagal L, “Explaining explanations: An overview of interpretability of machine learning,” in Proc. IEEE 5th Int. Conf. Data Sci. Adv. Analytics (DSAA), Oct. 2018, pp. 80–89. [Google Scholar]
- [335].Ribeiro MT, Singh S, and Guestrin C, “Model-agnostic interpretability of machine learning,” 2016, arXiv:1606.05386. [Google Scholar]
- [336].Ribeiro MT, Singh S, and Guestrin C, “Why should i trust you?’: Explaining the predictions of any classifier,” in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2016, pp. 1135–1144. [Google Scholar]
- [337].Seo B, Lin L, and Li J, “Mixture of linear models co-supervised by deep neural networks,” J. Comput. Graph. Statist, vol. 31, no. 4, pp. 1–38, 2022. [Google Scholar]
- [338].Mehrabi N, Morstatter F, Saxena N, Lerman K, and Galstyan A, “A survey on bias and fairness in machine learning,” ACM Comput. Surveys, vol. 54, no. 6, pp. 1–35, Jul. 2022. [Google Scholar]
- [339].Ye J, Lu X, Lin Z, and Wang JZ, “Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers,” 2018, arXiv:1802.00124. [Google Scholar]
- [340].Frankle J. and Carbin M, “The lottery ticket hypothesis: Finding sparse, trainable neural networks,” 2018, arXiv:1803.03635. [Google Scholar]
- [341].Hoefler T, Alistarh D, Ben-Nun T, Dryden N, and Peste A, “Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks,” J. Mach. Learn. Res, vol. 22, pp. 1–124, Sep. 2021. [Google Scholar]
- [342].Ji S, Pan S, Cambria E, Marttinen P, and Yu PS, “A survey on knowledge graphs: Representation, acquisition, and applications,” IEEE Trans. Neural Netw. Learn. Syst, vol. 33, no. 2, pp. 494–514, Feb. 2022. [DOI] [PubMed] [Google Scholar]
- [343].Wortman B. and Wang JZ, “HICEM: A high-coverage emotion model for artificial emotional intelligence,” 2022, arXiv:2206.07593. [Google Scholar]
- [344].Zhang Y, Wang JZ, and Li J, “Parallel massive clustering of discrete distributions,” ACM Trans. Multimedia Comput., Commun., Appl, vol. 11, no. 4, pp. 1–24, Jun. 2015. [Google Scholar]
- [345].Li J, Yao L, Hendriks E, and Wang JZ, “Rhythmic brushstrokes distinguish van Gogh from his contemporaries: Findings via automated brushstroke extraction,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 34, no. 6, pp. 1159–1176, Jun. 2012. [DOI] [PubMed] [Google Scholar]
- [346].Yao L, Suryanarayan P, Qiao M, Wang JZ, and Li J, “OSCAR: On-site composition and aesthetics feedback through exemplars for photographers,” Int. J. Comput. Vis, vol. 96, no. 3, pp. 353–383, Feb. 2012. [Google Scholar]
- [347].Li J, Yao L, and Wang JZ, “Photo composition feedback and enhancement,” in Mobile Cloud Visual Media Computing. Cham, Switzerland: Springer, 2015, pp. 113–144. [Google Scholar]
- [348].He S, Zhou Z, Farhat F, and Wang JZ, “Discovering triangles in portraits for supporting photographic creation,” IEEE Trans. Multimedia, vol. 20, no. 2, pp. 496–508, Feb. 2018. [Google Scholar]
- [349].Zhou Z, Farhat F, and Wang JZ, “Detecting dominant vanishing points in natural scenes with application to composition-sensitive image retrieval,” IEEE Trans. Multimedia, vol. 19, no. 12, pp. 2651–2665, Dec. 2017. [Google Scholar]
- [350].Norman DA, Emotional Design: Why We Love (or Hate) Everyday Things. London, U.K.: Civitas Books, 2004. [Google Scholar]
- [351].Sheppes G, Suri G, and Gross JJ, “Emotion regulation and psychopathology: The role of gender,” Annu. Rev. Clin. Psychol, vol. 11, pp. 379–405, Apr. 2015. [DOI] [PubMed] [Google Scholar]
- [352].American Psychiatric Association, Diagnostic and Statistical Manual of Mental Disorders (DSM-5). Richmond, VA, USA: American Psychiatric Association, 2013. [Google Scholar]
- [353].Joormann J. and Stanton CH, “Examining emotion regulation in depression: A review and future directions,” Behaviour Res. Therapy, vol. 86, pp. 35–49, Nov. 2016. [DOI] [PubMed] [Google Scholar]
- [354].Vanderlind WM, Millgram Y, Baskin-Sommers AR, Clark MS, and Joormann J, “Understanding positive emotion deficits in depression: From emotion preferences to emotion regulation,” Clin. Psychol. Rev, vol. 76, Mar. 2020, Art. no. 101826. [DOI] [PubMed] [Google Scholar]
- [355].Gaebel W. and Wölwer W, “Facial expression and emotional face recognition in schizophrenia and depression,” Eur. Arch. Psychiatry Clin. Neurosci, vol. 242, no. 1, pp. 46–52, Sep. 1992. [DOI] [PubMed] [Google Scholar]
- [356].Gaebel W. and Wölwer W, “Facial expressivity in the course of schizophrenia and depression,” Eur. Arch. Psychiatry Clin. Neurosciences, vol. 254, no. 5, pp. 335–342, Oct. 2004. [DOI] [PubMed] [Google Scholar]
- [357].Bersani FS et al. , “Facial expression in patients with bipolar disorder and schizophrenia in response to emotional stimuli: A partially shared cognitive and social deficit of the two disorders,” Neuropsychiatric Disease Treatment, vol. 9, p. 1137, Aug. 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [358].Jones IH and Pansa M, “Some nonverbal aspects of depression and schizophrenia occurring during the interview,” J. Nervous Mental Disease, vol. 167, no. 7, pp. 402–409, Jul. 1979. [DOI] [PubMed] [Google Scholar]
- [359].Troisi A. and Moles A, “Gender differences in depression: An ethological study of nonverbal behavior during interviews,” J. Psychiatric Res, vol. 33, no. 3, pp. 243–250, 1999. [DOI] [PubMed] [Google Scholar]
- [360].Sobin C. and Sackeim HA, “Psychomotor symptoms of depression,” Amer. J. Psychiatry, vol. 154, pp. 4–17, Jan. 1997. [DOI] [PubMed] [Google Scholar]
- [361].Jan A, Meng H, Gaus YFBA, and Zhang F, “Artificial intelligent system for automatic depression level analysis through visual and vocal expressions,” IEEE Trans. Cognit. Develop. Syst, vol. 10, no. 3, pp. 668–680, Sep. 2018. [Google Scholar]
- [362].Zhu Y, Shang Y, Shao Z, and Guo G, “Automated depression diagnosis based on deep networks to encode facial appearance and dynamics,” IEEE Trans. Affect. Comput, vol. 9, no. 4, pp. 578–584, Oct. 2018. [Google Scholar]
- [363].Kulkarni PB and Patil MM, “Clinical depression detection in adolescent by face,” in Proc. Int. Conf. Smart City Emerg. Technol. (ICSCET; ), Jan. 2018, pp. 1–4. [Google Scholar]
- [364].He L, Jiang D, and Sahli H, “Automatic depression analysis using dynamic facial appearance descriptor and Dirichlet process Fisher encoding,” IEEE Trans. Multimedia, vol. 21, no. 6, pp. 1476–1486, Jun. 2019. [Google Scholar]
- [365].Song S, Shen L, and Valstar M, “Human behaviour-based automatic depression analysis using hand-crafted statistics and deep learned spectral features,” in Proc. 13th IEEE Int. Conf. Autom. Face Gesture Recognit. (FG), May 2018, pp. 158–165. [Google Scholar]
- [366].Nasser SA, Hashim IA, and Ali WH, “A review on depression detection and diagnoses based on visual facial cues,” in Proc. 3rd Int. Conf. Eng. Technol. Appl. (IICETA; ), Sep. 2020, pp. 35–40. [Google Scholar]
- [367].Mengi M. and Malhotra D, “Artificial intelligence based techniques for the detection of socio-behavioral disorders: A systematic review,” Arch. Comput. Methods Eng, vol. 29, pp. 2811–2855, Nov. 2021. [Google Scholar]
- [368].Gavrilescu M. and Vizireanu N, “Predicting depression, anxiety, and stress levels from videos using the facial action coding system,” Sensors, vol. 19, no. 17, p. 3693, Aug. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [369].Low DM, Bentley KH, and Ghosh SS, “Automated assessment of psychiatric disorders using speech: A systematic review,” Laryngoscope Investigative Otolaryngology, vol. 5, no. 1, pp. 96–116, Feb. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [370].Morales M, Scherer S, and Levitan R, “A cross-modal review of indicators for depression detection systems,” in Proc. 4th Workshop Comput. Linguistics Clin. Psychol. From Linguistic Signal Clin. Reality, 2017, pp. 1–12. [Google Scholar]
- [371].Parola A, Simonsen A, Bliksted V, and Fusaroli R, “Voice patterns in schizophrenia: A systematic review and Bayesian meta-analysis,” Schizophrenia Res, vol. 216, pp. 24–40, Feb. 2020. [DOI] [PubMed] [Google Scholar]
- [372].Schrijvers D, Hulstijn W, and Sabbe BGC, “Psychomotor symptoms in depression: A diagnostic, pathophysiological and therapeutic tool,” J. Affect. Disorders, vol. 109, nos. 1–2, pp. 1–20, Jul. 2008. [DOI] [PubMed] [Google Scholar]
- [373].Annen S, Roser P, and Brüne M, “Nonverbal behavior during clinical interviews: Similarities and dissimilarities among schizophrenia, mania, and depression,” J. Nervous Mental Disease, vol. 200, no. 1, pp. 26–32, 2012. [DOI] [PubMed] [Google Scholar]
- [374].Davison PS, Frith CD, Harrison-Read PE, and Johnstone EC, “Facial and other non-verbal communicative behaviour in chronic schizophrenia,” Psychol. Med, vol. 26, no. 4, pp. 707–713, Jul. 1996. [DOI] [PubMed] [Google Scholar]
- [375].Pampouchidou A. et al. , “Quantitative comparison of motion history image variants for video-based depression assessment,” EURASIP J. Image Video Process, vol. 2017, no. 1, pp. 1–11, Dec. 2017. [Google Scholar]
- [376].Kacem A, Hammal Z, Daoudi M, and Cohn J, “Detecting depression severity by interpretable representations of motion dynamics,” in Proc. 13th IEEE Int. Conf. Autom. Face Gesture Recognit. (FG), May 2018, pp. 739–745. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [377].Girard JM, Cohn JF, Mahoor MH, Mavadati SM, Hammal Z, and Rosenwald DP, “Nonverbal social withdrawal in depression: Evidence from manual and automatic analyses,” Image Vis. Comput, vol. 32, no. 10, pp. 641–647, Oct. 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [378].Scherer S, Stratou G, and Morency L-P, “Audiovisual behavior descriptors for depression assessment,” in Proc. 15th ACM Int. Conf. Multimodal Interact., Dec. 2013, pp. 135–140. [Google Scholar]
- [379].Balsters MJH, Krahmer EJ, Swerts MGJ, and Vingerhoets AJJM, “Verbal and nonverbal correlates for depression: A review,” Current Psychiatry Rev, vol. 8, no. 3, pp. 227–234, Jun. 2012. [Google Scholar]
- [380].Schelde JTM, “Major depression: Behavioral markers of depression and recovery,” J. Nervous Mental Disease, vol. 186, no. 3, pp. 133–140, Mar. 1998. [DOI] [PubMed] [Google Scholar]
- [381].Segrin C, “Social skills deficits associated with depression,” Clin. Psychol. Rev, vol. 20, no. 3, pp. 379–403, Apr. 2000. [DOI] [PubMed] [Google Scholar]
- [382].Schelde T. and Hertz M, “Ethology and psychotherapy,” Ethology Sociobiology, vol. 15, nos. 5–6, pp. 383–392, Sep. 1994. [Google Scholar]
- [383].Scherer S. et al. , “Automatic behavior descriptors for psychological disorder analysis,” in Proc. 10th IEEE Int. Conf. Workshops Autom. Face Gesture Recognit. (FG), Apr. 2013, pp. 1–8. [Google Scholar]
- [384].Cummins N, Scherer S, Krajewski J, Schnieder S, Epps J, and Quatieri TF, “A review of depression and suicide risk assessment using speech analysis,” Speech Commun, vol. 71, pp. 10–49, Jul. 2015. [Google Scholar]
- [385].Horwitz R, Quatieri TF, Helfer BS, Yu B, Williamson JR, and Mundt J, “On the relative importance of vocal source, system, and prosody in human depression,” in Proc. IEEE Int. Conf. Body Sensor Netw., May 2013, pp. 1–6. [Google Scholar]
- [386].Kiss G. and Vicsi K, “Mono- and multi-lingual depression prediction based on speech processing,” Int. J. Speech Technol, vol. 20, no. 4, pp. 919–935, Dec. 2017. [Google Scholar]
- [387].Quatieri TF and Malyska N, “Vocal-source biomarkers for depression: A link to psychomotor activity,” in Proc. Interspeech, Sep. 2012, pp. 1059–1062. [Google Scholar]
- [388].Buyukdura JS, McClintock SM, and Croarkin PE, “Psychomotor retardation in depression: Biological underpinnings, measurement, and treatment,” Prog. Neuro-Psychopharmacology Biol. Psychiatry, vol. 35, no. 2, pp. 395–409, Mar. 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [389].Parker G. et al. , “Classifying depression by mental state signs,” Brit. J. Psychiatry, vol. 157, no. 1, pp. 55–65, Jul. 1990. [DOI] [PubMed] [Google Scholar]
- [390].Lemke MR, Wendorff T, Mieth B, Buhl K, and Linnemann M, “Spatiotemporal gait patterns during over ground locomotion in major depression compared with healthy controls,” J. Psychiatric Res, vol. 34, nos. 4–5, pp. 277–283, Jul. 2000. [DOI] [PubMed] [Google Scholar]
- [391].Sloman L, Berridge M, Homatidis S, Hunter D, and Duck T, “Gait patterns of depressed patients and normal subjects,” Amer. J. Psychiatry, vol. 139, pp. 94–97, Jan. 1982. [DOI] [PubMed] [Google Scholar]
- [392].Hausdorff JM, Peng C-K, Goldberger AL, and Stoll AL, “Gait unsteadiness and fall risk in two affective disorders: A preliminary study,” BMC Psychiatry, vol. 4, no. 1, pp. 1–7, Dec. 2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [393].Michalak J, Troje NF, Fischer J, Vollmar P, Heidenreich T, and Schulte D, “Embodiment of sadness and depression—Gait patterns associated with dysphoric mood,” Psychosomatic Med, vol. 71, no. 5, pp. 580–587, 2009. [DOI] [PubMed] [Google Scholar]
- [394].Michalak J, Mischnat J, and Teismann T, “Sitting posture makes a difference–embodiment effects on depressive memory bias,” Clin. Psychol. Psychotherapy, vol. 21, no. 6, pp. 519–524, 2014. [DOI] [PubMed] [Google Scholar]
- [395].Wilkes C, Kydd R, Sagar M, and Broadbent E, “Upright posture improves affect and fatigue in people with depressive symptoms,” J. Behav. Therapy Experim. Psychiatry, vol. 54, pp. 143–149, Mar. 2017. [DOI] [PubMed] [Google Scholar]
- [396].Canales JZ, Fiquer JT, Campos RN, Soeiro-de-Souza MG, and Moreno RA, “Investigation of associations between recurrence of major depressive disorder and spinal posture alignment: A quantitative cross-sectional study,” Gait Posture, vol. 52, pp. 258–264, Feb. 2017. [DOI] [PubMed] [Google Scholar]
- [397].Rosario JL, Diógenes MSB, Mattei R, and Leite JR, “Differences and similarities in postural alterations caused by sadness and depression,” J. Bodywork Movement Therapies, vol. 18, no. 4, pp. 540–544, Oct. 2014. [DOI] [PubMed] [Google Scholar]
- [398].Deschamps T, Thomas-Ollivier V, Sauvaget A, Bulteau S, Fortes-Bourbousson M, and Vachon H, “Balance characteristics in patients with major depression after a two-month walking exercise program: A pilot study,” Gait Posture, vol. 42, no. 4, pp. 590–593, Oct. 2015. [DOI] [PubMed] [Google Scholar]
- [399].Doumas M, Smolders C, Brunfaut E, Bouckaert F, and Krampe RT, “Dual task performance of working memory and postural control in major depressive disorder,” Neuropsychology, vol. 26, no. 1, pp. 110–118, 2012. [DOI] [PubMed] [Google Scholar]
- [400].Nakano MM, Otonari TS, Takara KS, Carmo CM, and Tanaka C, “Physical performance, balance, mobility, and muscle strength decline at different rates in elderly people,” J. Phys. Therapy Sci, vol. 26, no. 4, pp. 583–586, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [401].Radovanović S, Jovičić M, Marić NP, and Kostić V, “Gait characteristics in patients with major depression performing cognitive and motor tasks while walking,” Psychiatry Res, vol. 217, nos. 1–2, pp. 39–46, Jun. 2014. [DOI] [PubMed] [Google Scholar]
- [402].Sanders RD and Gillig PM, “Gait and its assessment in psychiatry,” Psychiatry, vol. 7, no. 7, pp. 38–43, 2010. [PMC free article] [PubMed] [Google Scholar]
- [403].Bora E. and Berk M, “Theory of mind in major depressive disorder: A meta-analysis,” J. Affect. Disorders, vol. 191, pp. 49–55, Feb. 2016. [DOI] [PubMed] [Google Scholar]
- [404].van Neerven T, Bos DJ, and van Haren NE, “Deficiencies in theory of mind in patients with schizophrenia, bipolar disorder, and major depressive disorder: A systematic review of secondary literature,” Neurosci. Biobehavioral Rev, vol. 120, pp. 249–261, Jan. 2021. [DOI] [PubMed] [Google Scholar]
- [405].Aghevli MA, Blanchard JJ, and Horan WP, “The expression and experience of emotion in schizophrenia: A study of social interactions,” Psychiatry Res, vol. 119, no. 3, pp. 261–270, Aug. 2003. [DOI] [PubMed] [Google Scholar]
- [406].Guidi A, Schoentgen J, Bertschy G, Gentili C, Scilingo EP, and Vanello N, “Features of vocal frequency contour and speech rhythm in bipolar disorder,” Biomed. Signal Process. Control, vol. 37, pp. 23–31, Aug. 2017. [Google Scholar]
- [407].Guidi A, Scilingo EP, Gentili C, Bertschy G, Landini L, and Vanello N, “Analysis of running speech for the characterization of mood state in bipolar patients,” in Proc. AEIT Int. Annu. Conf. (AEIT), Oct. 2015, pp. 1–6. [Google Scholar]
- [408].Zhang J. et al. , “Analysis on speech signal features of manic patients,” J. Psychiatric Res, vol. 98, pp. 59–63, Mar. 2018. [DOI] [PubMed] [Google Scholar]
- [409].Maxhuni A, Muñoz-Meléndez A, Osmani V, Perez H, Mayora O, and Morales EF, “Classification of bipolar disorder episodes based on analysis of voice and motor activity of patients,” Pervas. Mobile Comput, vol. 31, pp. 50–66, Sep. 2016. [Google Scholar]
- [410].Bolbecker AR, Hong SL, Kent JS, Klaunig MJ, O’Donnell BF, and Hetrick WP, “Postural control in bipolar disorder: Increased sway area and decreased dynamical complexity,” PLoS ONE, vol. 6, no. 5, May 2011, Art. no. e19824. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [411].Baez S. et al. , “Contextual social cognition impairments in schizophrenia and bipolar disorder,” PLoS ONE, vol. 8, no. 3, Mar. 2013, Art. no. e57664. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [412].Donohoe G. et al. , “Social cognition in bipolar disorder versus schizophrenia: Comparability in mental state decoding deficits,” Bipolar Disorders, vol. 14, no. 7, pp. 743–748, Nov. 2012. [DOI] [PubMed] [Google Scholar]
- [413].Berenbaum H. and Oltmanns TF, “Emotional experience and expression in schizophrenia and depression,” J. Abnormal Psychol, vol. 101, no. 1, pp. 37–44, 1992. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [414].Kring AM and Earnst KS, “Stability of emotional responding in schizophrenia,” Behav. Therapy, vol. 30, no. 3, pp. 373–388, 1999. [Google Scholar]
- [415].Juckel G. and Polzer U, “Fine analysis of abnormal facial expressions in chronic schizophrenic patients—A pilot study,” German J. Psychiatry, vol. 1, pp. 6–9, 1998. [Google Scholar]
- [416].Krause R, Steimer E, Sänger-Alt C, and Wagner G, “Facial expression of schizophrenic patients and their interaction partners,” Psychiatry, vol. 52, no. 1, pp. 1–12, Feb. 1989. [DOI] [PubMed] [Google Scholar]
- [417].Steimer-Krause E, Krause R, and Wagner G, “Interaction regulations used by schizophrenic and psychosomatic patients: Studies on facial behavior in dyadic interactions,” Psychiatry, vol. 53, no. 3, pp. 209–228, Aug. 1990. [DOI] [PubMed] [Google Scholar]
- [418].Troisi A, Pompili E, Binello L, and Sterpone A, “Facial expressivity during the clinical interview as a predictor functional disability in schizophrenia. A pilot study,” Prog. Neuro-Psychopharmacology Biol. Psychiatry, vol. 31, no. 2, pp. 475–481, Mar. 2007. [DOI] [PubMed] [Google Scholar]
- [419].Lavelle M, Healey PGT, and McCabe R, “Is nonverbal communication disrupted in interactions involving patients with schizophrenia?” Schizophrenia Bull, vol. 39, no. 5, pp. 1150–1158, Sep. 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [420].Troisi A, “Ethological research in clinical psychiatry: The study of nonverbal behavior during interviews,” Neurosci. Biobehavioral Rev, vol. 23, no. 7, pp. 905–913, Nov. 1999. [DOI] [PubMed] [Google Scholar]
- [421].Brüne M, Sonntag C, Abdel-Hamid M, Lehmkämper C, Juckel G, and Troisi A, “Nonverbal behavior during standardized interviews in patients with schizophrenia spectrum disorders,” J. Nervous Mental Disease, vol. 196, no. 4, pp. 282–288, 2008. [DOI] [PubMed] [Google Scholar]
- [422].Troisi A, Spalletta G, and Pasini A, “Non-verbal behaviour deficits in schizophrenia: An ethological study of drug-free patients,” Acta Psychiatrica Scandinavica, vol. 97, no. 2, pp. 109–115, Feb. 1998. [DOI] [PubMed] [Google Scholar]
- [423].Kliper R, Portuguese S, and Weinshall D, “Prosodic analysis of speech and the underlying mental state,” in Proc. Int. Symp. Pervasive Comput. Paradigms Mental Health, 2015, pp. 52–62. [Google Scholar]
- [424].Kliper R, Vaizman Y, Weinshall D, and Portuguese S, “Evidence for depression and schizophrenia in speech prosody,” in Proc. 3rd Tutorial Res. Workshop Experim. Linguistics, Nov. 2019, pp. 85–88. [Google Scholar]
- [425].Perlini C. et al. , “Linguistic production and syntactic comprehension in schizophrenia and bipolar disorder,” Acta Psychiatrica Scandinavica, vol. 126, no. 5, pp. 363–376, Nov. 2012. [DOI] [PubMed] [Google Scholar]
- [426].Tahir Y. et al. , “Non-verbal speech cues as objective measures for negative symptoms in patients with schizophrenia,” PLoS ONE, vol. 14, no. 4, Apr. 2019, Art. no. e0214314. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [427].Rapcan V, D’Arcy S, Yeap S, Afzal N, Thakore J, and Reilly RB, “Acoustic and temporal analysis of speech: A potential biomarker for schizophrenia,” Med. Eng. Phys, vol. 32, no. 9, pp. 1074–1079, Nov. 2010. [DOI] [PubMed] [Google Scholar]
- [428].Cristiano VB, Vieira Szortyka MF, Lobato MI, Ceresér KM, and Belmonte-de-Abreu P, “Postural changes in different stages of schizophrenia is associated with inflammation and pain: A cross-sectional observational study,” Int. J. Psychiatry Clin. Pract, vol. 21, no. 2, pp. 104–111, Apr. 2017. [DOI] [PubMed] [Google Scholar]
- [429].Kent JS et al. , “Motor deficits in schizophrenia quantified by nonlinear analysis of postural sway,” PLoS ONE, vol. 7, no. 8, pp. 1–10, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [430].Marvel C, “A quantitative measure of postural sway deficits in schizophrenia,” Schizophrenia Res., vol. 68, nos. 2–3, pp. 363–372, Jun. 2004. [DOI] [PubMed] [Google Scholar]
- [431].Matsuura Y. et al. , “Standing postural instability in patients with schizophrenia: Relationships with psychiatric symptoms, anxiety, and the use of neuroleptic medications,” Gait Posture, vol. 41, no. 3, pp. 847–851, Mar. 2015. [DOI] [PubMed] [Google Scholar]
- [432].Teng Y-L et al. , “Postural stability of patients with schizophrenia during challenging sensory conditions: Implication of sensory integration for postural control,” PLoS ONE, vol. 11, no. 6, Jun. 2016, Art. no. e0158219. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [433].Jeon HJ et al. , “Quantitative analysis of ataxic gait in patients with schizophrenia: The influence of age and visual control,” Psychiatry Res, vol. 152, nos. 2–3, pp. 155–164, Aug. 2007. [DOI] [PubMed] [Google Scholar]
- [434].Lallart E. et al. , “Gait control and executive dysfunction in early schizophrenia,” J. Neural Transmiss, vol. 121, no. 4, pp. 443–450, Apr. 2014. [DOI] [PubMed] [Google Scholar]
- [435].Putzhammer A, Perfahl M, Pfeiff L, and Hajak G, “Gait disturbances in patients with schizophrenia and adaptation to treadmill walking,” Psychiatry Clin. Neurosciences, vol. 59, no. 3, pp. 303–310, Jun. 2005. [DOI] [PubMed] [Google Scholar]
- [436].Sparks A, McDonald S, Lino B, O’Donnell M, and Green MJ, “Social cognition, empathy and functional outcome in schizophrenia,” Schizophrenia Res, vol. 122, nos. 1–3, pp. 172–178, Sep. 2010. [DOI] [PubMed] [Google Scholar]
- [437].Gilbert BO, “Physiological and nonverbal correlates of extraversion, neuroticism, and psychoticism during active and passive coping,” Personality Individual Differences, vol. 12, no. 12, pp. 1325–1331, Jan. 1991. [Google Scholar]
- [438].Wenzel A, Graff-Dolezal J, Macho M, and Brendle JR, “Communication and social skills in socially anxious and nonanxious individuals in the context of romantic relationships,” Behaviour Res. Therapy, vol. 43, no. 4, pp. 505–519, Apr. 2005. [DOI] [PubMed] [Google Scholar]
- [439].Wiens AN, Harper RG, and Matarazzo JD, “Personality correlates of nonverbal interview behavior,” J. Clin. Psychol, vol. 36, no. 1, pp. 205–215, Jan. 1980. [Google Scholar]
- [440].Laretzaki G, Plainis S, Vrettos I, Chrisoulakis A, Pallikaris I, and Bitsios P, “Threat and trait anxiety affect stability of gaze fixation,” Biol. Psychol, vol. 86, no. 3, pp. 330–336, Mar. 2011. [DOI] [PubMed] [Google Scholar]
- [441].Metaxas D, Venkataraman S, and Vogler C, “Image-based stress recognition using a model-based dynamic face tracking system,” in Proc. Int. Conf. Comput. Sci., 2004, pp. 813–821. [Google Scholar]
- [442].Hamilton M, “The assessment of anxiety states by rating,” Brit. J. Med. Psychol, vol. 32, no. 1, pp. 50–55, Mar. 1959. [DOI] [PubMed] [Google Scholar]
- [443].Dinges DF et al. , “Optical computer recognition of facial expressions associated with stress induced by performance demands,” Aviation, Space, Environ. Med, vol. 76, no. 6, pp. B172–B182, 2005. [PubMed] [Google Scholar]
- [444].Hadar U, Steiner TJ, Grant EC, and Clifford Rose F, “Head movement correlates of juncture and stress at sentence level,” Lang. Speech, vol. 26, no. 2, pp. 117–129, Apr. 1983. [DOI] [PubMed] [Google Scholar]
- [445].Liao W, Zhang W, Zhu Z, and Ji Q, “A real-time human stress monitoring system using dynamic Bayesian network,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR; ) Workshops, Dec. 2005, p. 70. [Google Scholar]
- [446].Giannakakis G. et al. , “Stress and anxiety detection using facial cues from videos,” Biomed. Signal Process. Control, vol. 31, pp. 89–101, Jan. 2017. [Google Scholar]
- [447].Harrigan JA, Harrigan KM, Sale BA, and Rosenthal R, “Detecting anxiety and defensiveness from visual and auditory cues,” J. Personality, vol. 64, no. 3, pp. 675–709, Sep. 1996. [DOI] [PubMed] [Google Scholar]
- [448].Harris CS, Thackray RI, and Shoenberger RW, “Blink rate as a function of induced muscular tension and manifest anxiety,” Perceptual Motor Skills, vol. 22, no. 1, pp. 155–160, Feb. 1966. [DOI] [PubMed] [Google Scholar]
- [449].Ekman P. and Friesen WV, “Detecting deception from the body or face,” J. Personality Social Psychol, vol. 29, no. 3, pp. 288–298, Mar. 1974. [Google Scholar]
- [450].Heerey EA and Kring AM, “Interpersonal consequences of social anxiety,” J. Abnormal Psychol, vol. 116, no. 1, pp. 125–134, 2007. [DOI] [PubMed] [Google Scholar]
- [451].Jurich AP and Jurich JA, “Correlations among nonverbal expressions of anxiety,” Psychol. Rep, vol. 34, no. 1, pp. 199–204, Feb. 1974. [DOI] [PubMed] [Google Scholar]
- [452].LeCompte WA, “The ecology of anxiety: Situational stress and rate of self-stimulation in Turkey,” J. Personality Social Psychol, vol. 40, no. 4, pp. 712–721, 1981. [DOI] [PubMed] [Google Scholar]
- [453].Shechter T, Asher M, and Aderka IM, “Man vs. machine: A comparison of human and computer assessment of nonverbal behavior in social anxiety disorder,” J. Anxiety Disorders, vol. 89, Jun. 2022, Art. no. 102587. [DOI] [PubMed] [Google Scholar]
- [454].Harrigan JA, Larson MA, and Pflum CJ, “The role of auditory cues in the detection of state anxiety 1,” J. Appl. Social Psychol, vol. 24, no. 22, pp. 1965–1983, Nov. 1994. [Google Scholar]
- [455].Özseven T, Dügenci M, Doruk A, and˘ Kahraman HI, “Voice traces of anxiety: Acoustic parameters affected by anxiety disorder,” Arch. Acoust, vol. 43, no. 4, pp. 625–636, 2018. [Google Scholar]
- [456].Silber-Varod V, Kreiner H, Lovett R, Levi-Belz Y, and Amir N, “Do social anxiety individuals hesitate more? The prosodic profile of hesitation disfluencies in social anxiety disorder individuals,” in Proc. Speech Prosody, May 2016, pp. 1211–1215. [Google Scholar]
- [457].Reelick MF, van Iersel MB, Kessels RPC, and Rikkert MGMO, “The influence of fear of falling on gait and balance in older people,” Age Ageing, vol. 38, no. 4, pp. 435–440, Jul. 2009. [DOI] [PubMed] [Google Scholar]
- [458].Balaban C, Furman J, and Staab J, “Threat assessment and locomotion: Clinical applications of an integrated model of anxiety and postural control,” Seminars Neurol, vol. 33, no. 03, pp. 297–306, Sep. 2013. [DOI] [PubMed] [Google Scholar]
- [459].Wynaden D, Tohotoa J, Heslop K, and Al Omari O, “Recognising falls risk in older adult mental health patients and acknowledging the difference from the general older adult population,” Collegian, vol. 23, no. 1, pp. 97–102, Mar. 2016. [DOI] [PubMed] [Google Scholar]
- [460].Balaban C, “Neural substrates linking balance control and anxiety,” Physiol. Behav, vol. 77, nos. 4–5, pp. 469–475, Dec. 2002. [DOI] [PubMed] [Google Scholar]
- [461].Bart O, Bar-Haim Y, Weizman E, Levin M, Sadeh A, and Mintz M, “Balance treatment ameliorates anxiety and increases self-esteem in children with comorbid anxiety and balance disorder,” Res. Develop. Disabilities, vol. 30, no. 3, pp. 486–495, May 2009. [DOI] [PubMed] [Google Scholar]
- [462].Bolmont B, Gangloff P, Vouriot A, and Perrin PP, “Mood states and anxiety influence abilities to maintain balance control in healthy human subjects,” Neurosci. Lett, vol. 329, no. 1, pp. 96–100, 2002. [DOI] [PubMed] [Google Scholar]
- [463].Feldman R, Schreiber S, Pick CG, and Been E, “Gait, balance, mobility and muscle strength in people with anxiety compared to healthy individuals,” Human Movement Sci, vol. 67, Oct. 2019, Art. no. 102513. [DOI] [PubMed] [Google Scholar]
- [464].Hainaut J-P, Caillet G, Lestienne FG, and Bolmont B, “The role of trait anxiety on static balance performance in control and anxiogenic situations,” Gait Posture, vol. 33, no. 4, pp. 604–608, Apr. 2011. [DOI] [PubMed] [Google Scholar]
- [465].Surcinelli P, Codispoti M, Montebarocci O, Rossi N, and Baldaro B, “Facial emotion recognition in trait anxiety,” J. Anxiety Disorders, vol. 20, no. 1, pp. 110–117, Jan. 2006. [DOI] [PubMed] [Google Scholar]
- [466].Zainal NH and Newman MG, “Worry amplifies theory-of-mind reasoning for negatively valenced social stimuli in generalized anxiety disorder,” J. Affect. Disorders, vol. 227, pp. 824–833, Feb. 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [467].Marmar CR et al. , “Speech-based markers for posttraumatic stress disorder in U.S. veterans,” Depression Anxiety, vol. 36, no. 7, pp. 607–616, Jul. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [468].Scherer S, Lucas GM, Gratch J, Rizzo AS, and Morency L, “Self-reported symptoms of depression and PTSD are associated with reduced vowel space in screening interviews,” IEEE Trans. Affect. Comput, vol. 7, no. 1, pp. 59–73, Jan. 2016. [Google Scholar]
- [469].Xu R. et al. , “A voice-based automated system for PTSD screening and monitoring,” Stud Health Technol Inf, vol. 173, pp. 552–558, Mar. 2012. [PubMed] [Google Scholar]
- [470].Stratou G, Scherer S, Gratch J, and Morency L-P, “Automatic nonverbal behavior indicators of depression and PTSD: Exploring gender differences,” in Proc. Humaine Assoc. Conf. Affect. Comput. Intell. Interact. (ACII; ), 2013, pp. 147–152. [Google Scholar]
- [471].Kirsch A. and Brunnhuber S, “Facial expression and experience of emotions in psychodynamic interviews with patients with PTSD in comparison to healthy subjects,” Psychopathology, vol. 40, no. 5, pp. 296–302, 2007. [DOI] [PubMed] [Google Scholar]
- [472].Biffi E. et al. , “Gait pattern and motor performance during discrete gait perturbation in children with autism spectrum disorders,” Frontiers Psychol, vol. 9, p. 2530, Dec. 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [473].Baron-Cohen S, “Theory of mind and autism: A review,” Int. Rev. Res. Mental Retardation, vol. 23, pp. 169–184, Jan. 2000. [Google Scholar]
- [474].Belvederi Murri M. et al. , “Instrumental assessment of balance and gait in depression: A systematic review,” Psychiatry Res., vol. 284, Feb. 2020, Art. no. 112687. [DOI] [PubMed] [Google Scholar]
- [475].Canales JZ, Cordas TA, Fiquer JT, Cavalcante AF, and Moreno RA, “Posture and body image in individuals with major depressive disorder: A controlled study,” Revista Brasileira de Psiquiatria, vol. 32, no. 4, pp. 375–380, Dec. 2010. [DOI] [PubMed] [Google Scholar]
- [476].Feldman R, Schreiber S, Pick C, and Been E, “Gait, balance and posture in major mental illnesses: Depression, anxiety and schizophrenia,” Austin Med. Sci, vol. 5, no. 1, pp. 1–6, 2020. [Google Scholar]
- [477].Kang GE, Mickey BJ, McInnis MG, Krembs BS, and Gross MM, “Motor behavior characteristics in various phases of bipolar disorder revealed through biomechanical analysis: Quantitative measures of activity and energy variables during gait and sit-to-walk,” Psychiatry Res, vol. 269, pp. 93–101, Nov. 2018. [DOI] [PubMed] [Google Scholar]
- [478].Hezel DM and McNally RJ, “Theory of mind impairments in social anxiety disorder,” Behav. Therapy, vol. 45, no. 4, pp. 530–540, Jul. 2014. [DOI] [PubMed] [Google Scholar]
- [479].Yazici KU and Yazici IP, “Decreased theory of mind skills, increased emotion dysregulation and insight levels in adolescents diagnosed with obsessive compulsive disorder,” Nordic J. Psychiatry, vol. 73, no. 7, pp. 462–469, Oct. 2019. [DOI] [PubMed] [Google Scholar]
- [480].Brune M, “Theory of mind’ in schizophrenia: A review of the literature,” Schizophrenia Bull, vol. 31, no. 1, pp. 21–42, 2005. [DOI] [PubMed] [Google Scholar]
- [481].Montag C. et al. , “Theory of mind impairments in euthymic bipolar patients,” J. Affect. Disorders, vol. 123, nos. 1–3, pp. 264–269, Jun. 2010. [DOI] [PubMed] [Google Scholar]
- [482].Atkinson AP, “Impaired recognition of emotions from body movements is associated with elevated motion coherence thresholds in autism spectrum disorders,” Neuropsychologia, vol. 47, no. 13, pp. 3023–3029, Nov. 2009. [DOI] [PubMed] [Google Scholar]
- [483].Jarraya SK, Masmoudi M, and Hammami M, “A comparative study of autistic children emotion recognition based on spatio-temporal and deep analysis of facial expressions features during a meltdown crisis,” Multimedia Tools Appl, vol. 80, no. 1, pp. 83–125, Jan. 2021. [Google Scholar]
- [484].Savery R. and Weinberg G, “Robots and emotion: A survey of trends, classifications, and forms of interaction,” Adv. Robot, vol. 35, no. 17, pp. 1030–1042, Sep. 2021. [Google Scholar]
- [485].Cavallo F, Semeraro F, Fiorini L, Magyar G, Sinčak P, and Dario P, “Emotion modelling for social robotics applications: A review,” J. Bionic Eng, vol. 15, no. 2, pp. 185–203, Mar. 2018. [Google Scholar]
- [486].Marcos-Pablos S. and Garcia-Penalvo FJ, “Emotional intelligence in robotics: A scoping review,” in Proc. Int. Conf. Disruptive Technologies, Tech Ethics Artificial Intelligence, 2021, pp. 66–75. [Google Scholar]
- [487].Agnihotri A, Chan A, Hedaoo S, and Knight H, “Distinguishing robot personality from motion,” in Proc. Companion ACM/IEEE Int. Conf. Human-Robot Interact., Mar. 2020, pp. 87–89. [Google Scholar]
- [488].Smith JR, Joshi D, Huet B, Hsu W, and Cota J, “Harnessing AI for augmenting creativity: Application to movie trailer creation,” in Proc. ACM Int. Conf. Multimedia, 2017, pp. 1799–1808. [Google Scholar]
- [489].National Safety Council. (2020). Cost of Fatigue at Work. [Online]. Available: https://www.nsc.org/work-safety/safetytopics/fatigue/calculator/cost
- [490].Baghdadi A, Cavuoto LA, Jones-Farmer A, Rigdon SE, Esfahani ET, and Megahed FM, “Monitoring worker fatigue using wearable devices: A case study to detect changes in gait parameters,” J. Quality Technology, vol. 53, pp. 1–25, Aug. 2019. [Google Scholar]
- [491].Sigari M-H, Fathy M, and Soryani M, “A driver face monitoring system for fatigue and distraction detection,” Int. J. Veh. Technol, vol. 2013, pp. 1–11, Jan. 2013. [Google Scholar]
- [492].Karvekar S, Abdollahi M, and Rashedi E, “A data-driven model to identify fatigue level based on the motion data from a smartphone,” in Proc. IEEE Western New York Image Signal Process. Workshop (WNYISPW; ), Oct. 2019, pp. 1–5. [Google Scholar]
- [493].Sikander G. and Anwar S, “Driver fatigue detection systems: A review,” IEEE Trans. Intell. Transp. Syst, vol. 20, no. 6, pp. 2339–2352, Jun. 2019. [Google Scholar]
- [494].Vargas-Cuentas NI and Roman-Gonzalez A, “Facial image processing for sleepiness estimation,” in Proc. 2nd Int. Conf. Bio-Engineering Smart Technol. (BioSMART; ), Aug. 2017, pp. 1–3. [Google Scholar]
- [495].Yamada Y. and Kobayashi M, “Fatigue detection model for older adults using eye-tracking data gathered while watching video: Evaluation against diverse fatiguing tasks,” in Proc. IEEE Int. Conf. Healthcare Informat. (ICHI; ), Aug. 2017, pp. 275–284. [Google Scholar]
- [496].Yamada Y. and Kobayashi M, “Detecting mental fatigue from eye-tracking data gathered while watching video: Evaluation in younger and older adults,” Artif. Intell. Med, vol. 91, pp. 39–48, Sep. 2018. [DOI] [PubMed] [Google Scholar]
- [497].Noakes TD, “Fatigue is a brain-derived emotion that regulates the exercise behavior to ensure the protection of whole body homeostasis,” Frontiers Physiol, vol. 3, p. 82, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]