Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Dec 1.
Published in final edited form as: IEEE Trans Pattern Anal Mach Intell. 2018 Sep 5;41(12):2835–2845. doi: 10.1109/TPAMI.2018.2868952

Discriminant Functional Learning of Color Features for the Recognition of Facial Action Units and their Intensities

C Fabian Benitez-Quiroz 1, Ramprakash Srinivasan 1, Aleix M Martinez 1
PMCID: PMC6880652  NIHMSID: NIHMS1541867  PMID: 30188814

Abstract

Color is a fundamental image feature of facial expressions. For example, when we furrow our eyebrows in anger, blood rushes in, turning some face areas red; or when one goes white in fear as a result of the drainage of blood from the face. Surprisingly, these image properties have not been exploited to recognize the facial action units (AUs) associated with these expressions. Herein, we present the first system to do recognition of AUs and their intensities using these functional color changes. These color features are shown to be robust to changes in identity, gender, race, ethnicity and skin color. Specifically, we identify the chromaticity changes defining the transition of an AU from inactive to active and use an innovative Gabor transform-based algorithm to gain invariance to the timing of these changes. Because these image changes are given by functions rather than vectors, we use a functional classifiers to identify the most discriminant color features of an AU and its intensities. We demonstrate that, using these discriminant color features, one can achieve results superior to those of the state-of-the-art. Finally, we define an algorithm that allows us to use the learned functional color representation in still images. This is done by learning the mapping between images and the identified functional color features in videos. Our algorithm works in realtime, i.e., >30 frames/second/CPU thread.

Index Terms—: Facial expressions of emotion, face recognition, face perception, facial color, compound emotions, Gabor transform, color vision, time invariant, recognition in video, recognition in still images

1. Introduction

The automatic recognition of facial Action Units (AUs) [1], [2] is a major problem in computer vision [3] with applications in engineering (e.g., advertising, robotics, artificial intelligence) [4], [5], [6], education [7], linguistics [8], psychology [9], [10], psychiatry [11], [12], cognitive science and neuroscience [13], [14], to name but a few.

Most past and current computer vision systems use spatio-temporal features (e.g., Gabor filters, high- and low-spatial filtering) [15], [16], shape [17], [18], shading [12], [19] and motion [20], [21] to identify AUs in images and video sequences.

Although color is clearly another important feature of facial expressions [22], it is yet to be used as a feature for the recognition of AU activation.

To clarify the importance of color, let us look at the example in Figure 1. As seen in this figure, when we contract and relax our facial muscles, the shading and color in our faces changes locally. For example, during a smile, the shading and color of the cheeks changes due to the use of AU 12 (lip corner puller). This contraction of facial muscles is known to change the brdf (bidirectional reflectance distribution function) of the face [23], yielding clearly visible image changes [24].

Fig. 1.

Fig. 1.

Top: A few frames of a video sequences showing a facial expression of happily surprised. Note we have demarked a local region on that individual’s right cheek with red lines. You may notice that the average and standard devision of the color of the pixels in this local region change over time. The value changes of the red, green and blue channels of the pixels in this local region are given in the bottom plot, with fjR(t) showing the functional change in the red channel, fjG(t) in the green, and fjB(t) in the blue. Our contribution is to derive a method that can learn to identify when a facial action unit is active by exclusively using these color changes.

In this paper, we will exploit these color changes to detect AUs and their intensities. We will also demonstrate that these changes are consistent across identities, gender, race, ethnicity and skin color.

Note that we define color changes locally using a function fj(t)6, where fj(t)=(fjR(t),fjG(t),fjB(t))T describes the color changes in each of the three channels (R, G, B), fjR(t), fjG(t), fjB(t)2, and j designates the jth local region. Specifically, we use the local regions given by a set of automatically detected fiducial points [25], Figure 2. Aggregating these local functions, we obtain the global color function f(.)=(f1(t),,f107(t))T642, i.e., the function fj(t)2 of the 3 color channels in each of the 107 local regions, j = 1, …, 107. The three channels are the red, green and blue (R, G, B) of the camera.

Fig. 2.

Fig. 2.

The local regions of the face (left image) used by the derived algorithm. These local regions are obtained by Delaunay triangulation of the automatically detected fiducial points shown on the right image. These fiducial points, sij (j = 1, …, 66), correspond to 15 anatomical landmarks (e.g., corners of the eyes, mouth and brows, tip of the nose, and chin) plus 51 pseudo-landmarks defined about the edge of the eyelids, brows, nose, lips and jaw line. The number of pseudo-landmarks defining the contour of each facial component (e.g., the brows) is constant as is their inter-landmark distance. This guarantees equivalency of landmark position across people. This triangulation yields 107 regions (patches).

The color representation described in the preceding paragraph differs from previous shape and shading image descriptors in that its samples are given by functions defining color properties only, fi(t), i = 1, …,n, n the number of samples. (Note we have added a subscript i to our notation to identify the ith sample feature function fi(t).) This calls for the use of a discriminant approach that works with functions.

In order to work with these color functional changes, we derive an approach to represent them in DCT (Discrete Cosine Transform) space and use the Gabor transform to gain invariance to time. The use of the Gabor transform in our derivations is key, yielding a compact mathematical formulation for detecting the color changes of an AU regardless of when this occurs during a facial expression. That is, the resulting algorithm is invariant to the duration, start and finish of the AU activation within a video of a facial expressions.

Since these functions are defined in time, learning must be done over video sequences. But testing can be done in videos and still images. To use the learned functions in still images, we need to first find the color functional changes of an image. To do this, we use regression to learn the mapping between an image of a facial expression and the functional representation of the video of that same expression.

In summary, the present paper demonstrates, for the first time, that the use of these color descriptors yields classification accuracies superior to those reported in the literature. This shows that the contribution of some of these color features need to be uncorrelated to those of shading and shape features used previously.

The paper is organized as follows. Section 2 derives the color space used by our algorithm. Section 3 derives functional classifiers to identify where in the video sequence an AU is active/present. Section 4 defines a mapping from still images to the derived functional representation of color to allow recognition of AUs in images. Section 5 provides extensive experimental validation of the derived algorithm. We conclude in Section 6.

1.1. Related work

Despite many advances in object recognition, the automatic coding of AUs in videos and images remains an open problem in computer vision [3]. A 2017 challenge of AU annotations in images collected “in the wild” demonstrated that the problem is far from solved [26]. Even when large amounts of training data are available, deep nets have a hard time annotating AUs with good precision and recall [27], [28], [29]. In this paper, we tackle this problem by exploiting intrinsic functional changes of the color signal in facial expressions of emotion in video sequences. We then show how this can be readily extended to still images as well.

The Gabor transform is specifically suited for our problem, given its ability to identify a template in a function [30]. Or, alternatively, one could employ the Wavelet transform [31]. Here, we show how the Gabor transform can be used to find a template color function in a functional description of a video sequence without the need of a grid search. Similarly, color images have been used to identify optical flow [32] and other image features [33], but not AUs and the dynamic changes that define facial configurations. Nevertheless, color is known to play a major role in human vision [22]. Herein, we identify the discriminant color templates that specify AUs.

2. Color Space

This section details the computations needed to construct the color feature space used by the proposed algorithm.

2.1. Local regions

We start with the ith sample video sequence Vi={Ii1,,Iiri}, where ri is the number of frames and Iik3qw is the vectorized kth color image of q × w RGB pixels. We now need to describe Vi as the sample function fi(t) defined above, Figure 1.

To do this, we first identify a set of physical facial landmarks on the face and obtained the local regions using the algorithm of [25]. Formally, we define these landmark points in vector form as sik = (sik1, …, sik66), where i is the sample video index, k the frame number, and sikl2 are the 2D image coordinates of the lth landmark, l = 1, …, 66, Figure 2.

Next, let Dij = {di1k, …, diPk} be the set of P = 107 image patches dijk obtained with the Delaunay triangulation shown in Figure 2 (left image), where dijk3qij is the vector describing the jth triangular local region of qij RGB pixels and, as above, i specifies the sample video number (i = 1, …, n) and k the frame (k = 1, …, ri).

Note that the size (i.e., number of pixels, qij) of these local (triangular) regions not only varies across individuals but also within a video sequence of the same person. This is a result of the movement of the facial landmark points, a necessary process to produce a facial expression. This is evident in the images shown in Figures 1. Hence, we need to define a feature space that is invariant to the number of pixels in each of these local regions. We do this by computing statistics on the color of the pixels in each local region as follows.

We compute the first and second (central) moments of the color of each local region,

μijk=qij1p=1Pdijkpσijk=qij1p=1P(dijkpμijk)2, (1)

with dijk = (dijk1, …, dijkP)T and μijk, σijk3. The elements of σijk are the mean and standard deviations of each individual color channel. We could compute additional moments, but this did not result in better classification accuracies in our experiments described below.

We can now construct the color feature vector of each local patch,

xij=(μij1,,μijri,σij1,,σijri)T (2)

where, recall, i is the sample video index (Vi), j the local patch number and ri the number of frames in this video sequence.

This feature representation defines the contribution of color in patch j. One can also include other proven features to increase the richness of this representation. For example, responses to filters or shape features. If these other features do not yield superior results to the representation in (2), then color does provide additional discriminant information beyond what has already been tried. In the experimental results, we show that these color features do indeed yield superior results to those of the state of the art. This demonstrates that color does provide supplementary discriminant features.

2.2. Invariant functional representation of color

We now derive an approach to define the above computed color information of equation (2) as a function invariant to time, i.e., our functional representation needs to be consistent regardless of where in the video sequence an AU becomes active.

This problem is illustrated in Figure 3. As made clear in this figure, we have the color function f (.) that defines color variations of a video sequence V, and a template function fT(.) that models the color changes associated with the activation of an AU (i.e., from AU inactive to active). Our goal is to determine if fT(.) is in f(.).

Fig. 3.

Fig. 3.

The top left plot shows the function f(.) of a video V. The bottom left plot is a template function fT(.) representing the color changes observed when people activate AU 1. Identifying this template fT(.) in f(.) requires us to test all possible locations about time. This matching process in computationally expensive. The Gabor transform solves this complexity issue by identifying the location where the template function matches the color function without resorting to a sliding-window approach (right-most plot).

This problem can be readily solved by placing the template function fT(.) at each possible location in the time domain of f(.). This is typically called a sliding-window approach, because it involves sliding the window left and right until all possible positions of fT(.) have been checked. Unfortunately, this is extremely time consuming.

To solve the problem of computational complexity defined in the preceding paragraph, we derive a matching method using the Gabor transform instead. The Gabor transform is specifically designed to determine the frequency and phase content of a local section of a function. This allows us to derive an algorithm to find the matching of fT(.) in f(.) without having to resort to a sliding-window search. Let us define this process formally.

Without loss of generality let f(t) be a function describing one of our color descriptors, e.g., the mean of the red channel in the jth triangle of sample video i. Then, the Gabor transform of this function is,

G(t,f)=f(τ)g(τt)e2πJvτdτ, (3)

where g(t) is a concave function [34] and J=1. Herein, we use the pulse function,

g(t)={1,0tL0, otherwise, (4)

where L is a fixed time length.

Using (4) in (3) yields

G(t,f)=tLtf(τ)e2πJvτdτ=e2πJv(tL)0Lf(τ+tL)e2πJvτdτ. (5)

Note that (5) is the definition of a functional inner product in the span [0, L] and, thus, G(.,.) can also be written as follows,

G(t,f)=e2πJf(tL)f(τ+tL),e2Jπfτ, (6)

where 〈.,.〉 is the functional inner product. It is important to point out that our definition of the Gabor transform in (6) is both continuous in time and frequency, in the noise-free case.

To compute the color descriptor of the i1th video, fi1(t), we define all functions in a color spaces spanned by a set of b basis functions ϕ(t) = {ϕ0(t), …, ϕb−1(t)}, with fi1(t)=z=0b1ci1zϕz(t) and ci1=(ci10,,ci1b1)T the vector of coefficients. This allows us to compute the functional inner product of two color descriptors as,

fi1(t),fi2(t)=r,q0Lci1rϕi1(t)ci2qϕi2(t)dt=ci1TΦ(t)ci2, (7)

where Φ is a b × b matrix with elements Φij = 〈ϕi(t), ϕj(t)〉.

Our model assumes that statistical color properties change smoothly over time and that their effect in muscle activation has a maximum time span of L seconds. The basis functions that fit this description are the first several components of the real part of the Fourier series, i.e., normalized cosine basis.

Let the cosine bases be ψz(t) = cos(2πzt), z = 0, …, b −1. The corresponding normalized bases are

ψ^z(t)=ψz(t)ψz(t),ψz(t). (8)

We use this normalized basis set, because it allows us to have Φ = Idb, where Idb denotes the b × b identity matrix, rather than an arbitrary positive definite matrix.

Importantly, the above derivations with the cosine bases, makes the frequency space implicitly discrete. This allows us to write our Gabor transform G˜(.,.) of color functions given in (6) as

G˜(t,z)=f˜i1(t),ψ^z(t)=ci1z,z=0,,b1, (9)

where f˜i1(t) is our computed function fi1(t) in the interval [t − L, t] and ci1z is the zth coefficient.

The number of cosine basis functions b is determined by performing a grid-search between a minimum of 5 to a maximum of 20 basis functions. We pick the b that yield the best performance (as measured in section 5). It is crucial to note that since the above-derived approach does not include the time domain, we can always find these coefficients. This thus allows us to solve the matching of functions without resorting to the use of sliding windows.

In the next section we derive a functional classifier that exploits the advantages of this functional representation.

3. Functional Classifier of Action Units

The key to our algorithm is to use the Gabor transform derived above to define a feature space invariant to the timing and duration of an AU. In the resulting space, we can employ any linear or non-linear classifier. Here, we report results on Support Vector Machines (SVM) and a Deep multilayer perceptron Network (DN).

3.1. Functional color space

As stated earlier, our feature representation is the collection of functions describing the mean and standard deviation of color information from distinct local patches, which requires simultaneous modeling of multiple functions. This is readily achieved in our formulation as follows.

We define a multidimensional function Γi(t)=(γi1(t),,γig(t))T, with each function γie(t) the mean or standard deviation of a color channel in a given patch. Using the basis expansion approach described in Section 2.2, each γie(t) is defined by a set of coefficients cie and, thus, Γi(t) is given by:

ciT=[(ci1)T,,(cig)T]. (10)

Using this notation, we can redefine the inner product for multidimensional functions. With our normalized Fourier cosine bases we get,

Γi(t),Γj(t)=e=1gγie(t),γje(t)=e=1g(cie)Tcje=ciTcj. (11)

We use a training set of video sequences to optimize each classifier. It is important to note that our approach is invariant to the length (i.e., number of frames) of a video, Figure 4. Hence, we do not require any alignment or cropping of the videos in our training or testing sets.

Fig. 4.

Fig. 4.

Top: A schematic representation of positive samples (blue squares) and negative samples (orange triangles). Positive feature vectors correspond to videos with the activation of a specific AU. Negative sample videos do not have that AU present. Note that the sample videos need not be of the same length. Bottom: An example of a color functional space obtained with a SVM classifier for video sequences of facial expressions with AU 12 active/inactive.

The approach derived above can readily extended to identify AU intensity. This is done using a multi-class classifier. In our experimental results, we trained our AU classifiers to detect each of the five intensities, a, b, c, d, and e [2] plus AU inactive (not present). This is a total of six classes.

Testing in videos is directly given by the equations derived above. But we can also use these learned functions to identify AUs in still images. The algorithm used to achieve this is presented in Section 4.

3.2. Support Vector Machines

The training set is {(γi(t), y1), …, (γn(t), yn)}, where γi(t)Hv, Hv is a Hilbert space of continuous functions with bounded derivatives up to order v, and yi ∈ {−1, 1} are their class labels, with +1 indicating that the AU is active and −1 inactive.

When the samples of distinct classes are linearly separable, the function w(t) that maximizes class separability is given by

J(w(t),v,ξ)=minw(t),v,ξ{12w(t),w(t)+Ci=1nξi}subject to  yi(w(t),γi(t)v)1ξiξi0, (12)

where v is the bias and, as above, 〈γi(t), γj(t)〉 = ∫ γi(t) γj(t) dt denotes the functional inner product, Figure 4, ξ = (ξ1, …, ξn)T are the slack variables, and C > 0 is a penalty value found using cross-validation [35].

Applied to our derived approach to model Γi using normalized cosine coefficients jointly with (11), transforms (12) to the following criterion

J(w,v,ξ,α)=minw,ξ,b,a{12wTw+Ci=1nξii=1nαi(yi(wTciv)1+ξi)i=1nθiξi}. (13)

where C > 0 is a penalty value found using cross-validation.

The bottom plot in Figure 4 shows the functional feature spaces of an actual AU classification – AU 12. Since one can only plot two-dimensional feature spaces, we projected the original color spaces onto the first two principal components of the data. This was done with Principal Components Analysis (PCA). The two resulting dimensions are labeled ϕPCAk, k = 1, 2.

Once trained, this system can detect AU activation in video in real time, > 30 frames/second/CPU thread.

3.2.1. Deep network approach using multilayer perceptron

In the previous section, we used a SVM to define a linear classifier in Gabor-transform space. This formulation yields a linear classifier on the feature space of the ci. We will now use a deep network to identify non-linear classifiers in this color feature space.

We train a multilayer perceptron network (MPN) using the coefficients ci. This deep neural network is composed of 5 blocks of fully connected layers with batch normalization [36] and rectified linear units (ReLu) [37]. To effectively train the network, we used data augmentation by super-sampling the minority class (active), class weights and weight decay. A summary of the proposed architecture for each AU is in Table 1.

Table 1.

Description of the deep network architecture used in this paper.

Layer type Input size
Fully + batch normalization + ReLu 3,210
Fully + batch normalization + ReLu 1,056
Fully + batch normalization + ReLu 528
Fully + batch normalization + ReLu 132
Fully + batch normalization + ReLu 64
Fully + sigmoid 64

We train this neural network using gradient descent. The resulting algorithm works in real time, > 30 frames/second/CPU thread.

4. AU detection in still images

People generally first observe dynamic facial expressions. Nonetheless, later, we have no problem recognizing facial expressions in still images. We derive an approach that allows our algorithm to recognize AUs in still images [38].

To be able to apply our algorithm to still images, we need a procedure that specifies the color functions fi of image Ii. That is, we need to define the mapping h(Ii) = fi, Figure 5. And, recall, fi is defined by its coefficients ciT. These coefficents can be learned from training data using non-linear regression.

Fig. 5.

Fig. 5.

Each video segment Wik (shown on the bottom left) yields a feature representation fijk (top left), j = 1, …, 107. We regress a function g(.) to learn to map from the last image of Wik to fijk, j = 1, …, 107 (right image).

We start with a training set of m videos, {V1, …,Vm}. As above, Vi={Ii1,,Iiri}. We consider every subset of consecutive frames of length L (with Lri), i.e.,Wi1={Ii1,,IiL},Wi2={Ii2,,Ii(L+1)},,Wi(riL)={Ii(riL),,Iiri}. This allows us to compute the color representations of all Wik as described in Section 2.1. This yields xik = (xi1k, …, xi107k)T for each Wik, k = 1, …,riL. Following (2),

xijk=(μij1,,μijL,σij1,,σijL)T, (14)

where i and k specify the video Wik, and j the patch, j = 1, …, 107 (Figure 2).

Next we compute the functional color representations fijk of each Wik for each of the patches, j = 1, …, 107. This is done using the approach detailed in Section 2.2. That yields fijk = (cijk1, …, cijkQ)T, where cijkq is the qth coefficient of the j patch in video Wij.

Our training set is then given by the pairs {xijk, fijk}. This training set is used to regress the function fijk = h(xijk), Figure 5.

Specifically, let I^ be a test image and x^j its color representation in patch j. We use Kernel Ridge Regression to estimate the qth coefficient of this test image as follows,

c^jq=CT(K+λId)1κ(x^j), (15)

where x^j is the color feature vector of the jth patch, C=(c1j1q,,cmj(rmL)q)T is the vector of coefficients of the jth patch in all training images, K is the Kernel matrix, K(i,j)=k(xijk,xijk^)(i and i^=1,,m,kandk^=1,,riL). And, we use the Radial Basis Function kernel, k(a, b; η) = exp(−ηa − b∥2).

The parameters η and λ are selected to maximize accuracy and minimize model complexity. This is the same as optimizing the bias-variance tradeoff. We use the solution to the bias-variance problem presented in [39].

As shown above, we are ready to use the regressor on previously unseen test images. If I^ is a previously unseen test image. Its functional representation is readily obtained as c^=h(x^), with c^=(c11,,c107Q)T. This functional color representation can be directly used in the functional classifier derived above.

5. Experimental Results

The goal of this paper is to introduce a color feature space that can be efficiently and robustly used for the recognition of AUs. This section details experimental results of the theoretical work introduced above. We show that the proposed algorithm, which uses color features, performs better than state-of-the-art algorithms.

5.1. Comparative results

We provide comparative results against state-of-the-art algorithms on four publicly available datasets: Denver Intensity of Spontaneous Facial Action (DISFA) [40], Shoulder Pain (SP) [41], Binghamton-Pittsburgh 4D Spontaneous Expression Database (BP4D) [42], Affectiva-MIT Facial Expression Dataset (AM-FED) [43], and Compound Facial Expressions of Emotion (CFEE) [19]. AM-FED is a database of videos of facial expressions “in the wild,” while DISFA, SP and BP4D are videos of spontaneous expressions collected in the lab. CFEE is a database of still images rather than video sequences.

In each database, we use subject independent 10-fold cross-validation, where all frames from a few subjects is held out from the training set and only used for testing. This ensures that subject specific patterns cannot be learned by the classifier. The results of the proposed algorithm are compared to the available ground-truth (manually annotated AUs). To more accurately compare our results with state-of-the-art algorithms, we compute the F1 score, defined as, F1 = 2(Precision · Recall)/(Precision + Recall), where Precision (also called positive predictive value) is the fraction of the automatic annotations of AU i that are correctly recognized (i.e., number of correct recognitions of AU i / number of images with detected AU i), and Recall (also called sensitivity) is the number of correct recognition of AU i over the actual number of images with AU i.

Comparative results on the first four datasets are given in Figure 6. Our results are compared against the methods of Emotionet [8], Hierarchical-Restricted Boltzmann Machine (HRBM) [44], Transferring Latent Task Structures (TLTS) [45], lp-norm [46], Discriminant Label Embedding (DLE) [47], Cross-dataset Learning Support Vector Machine (CLM-SVM) [48], and Multi-Conditional Latent Variable Model (MC-LVM) [49]. We also make comparisons with our system annotating frames of the video sequences as still images, using the method described in section 4, and a version of our system that only computes these features in a single intensity channel (i.e., uses only grey-scale information). These results are labelled “This paper (still images)” and “This paper (grey-scale)”.

Fig. 6.

Fig. 6.

F1 scores for the proposed approach and a variety of published algorithms. Note that not all published methods provide results for all AUs. This is why you see empty columns in the plots above. Average (Avg) is computed using the results of the available AUs in each algorithm. First plot: BP4D. Second plot: DISFA. Third plot: SP. Four plot: AM-FED.

As we can see in this figure, the herein derived color features achieve superior results to other feature representations previously used in the literature. Also, these results demonstrate that the proposed Gabor transform approach and functional classifier are efficient algorithms for the recognition of AUs in video. Further, it is apparent that both estimating the functional representation using the Gabor transform and using color based features is crucial to this system, as classifying only using still images or using only grey-scale images yields inferior results.

It is important to note that the results of the non-linear classifier (given by a deep network) are not significantly superior to those of a simple linear classifier. This is important, because it further demonstrates that the color features used herein efficiently seperate the feature vectors of different classes (i.e., AU active vs. inactive). This was previously illustrated in the bottom plot of Figure 4.

It is also important to note that our algorithm works faster than real-time, >30 frames/second/CPU thread.

5.2. ROC curves

To further study the results on the proposed feature representation, we provide ROC (Receiver Operating Characteristic) curves in Figures 710. ROC plots display the true positive rate against the false positive rate. The true positive rate is the sensitivity of the classifier. The false positive rate is the number of negative test samples classified as positive over the total number of false positives plus true negatives.

Fig. 7.

Fig. 7.

ROC curves of the results on the BP4D dataset. The left image shows the ROC of all AUs combined. This shows the small variations between different AUs. The other plots show the ROCs of each AU. The area under the curve for each of these AUs are given in Table 3.

Fig. 10.

Fig. 10.

ROC curves for the AM-FED dataset.

ROC curves are computed as follows. The derivations of our approach have an equal priors assumption. That is, the probability of AU being active is the same as that of not being active. We can however vary the value of these priors. Reducing the prior of AU active will decrease the false detection rate, i.e., it is less likely to misclassify a face that does not have this AU active. Increasing the prior of AU active will increase the true positive detection rate. This is a simple extension of our algorithm that allows us to compute ROC curves.

The plots in Figures 710 allow us to compute the area under the curve for the results of our algorithm. We do this for all four datasets – BP4D, SP.0, DISFA and AM-FED. The results are in Table 3.

Table 3.

Area under the curve of the ROC curves shown in Figures 710.

AU 1 2 4 5 6 7 9 10 11 12 14
BP4D .9656 .9736 .9808 .973 .9656 .9813 .9636 .9983 .9842 .9847 .9447
SP .9935 .9931 .993 .9911 .9961 .9936
DISFA .9862 .9921 .9886 .9891 .9914 .9959 .9866
AM-FED .9913 .9858 .9927 .9873 .9873 .9947
AU 14 15 16 17 20 23 25 26 28 32
BP4D .9447 .9535 .9802 .9405 .9753 .9446 .9914 .9973
SP .9856 .9861 .9936
DISFA .9895 .985 .9884 .9956 .9859
AM-FED .9845 .9887 .9923

5.3. Invariance to skin color

One may wonder if the results reported above vary as a function of skin color/tone. Close analysis of our results shows that this is not the case, i.e., our feature representation is invariant to skin tone.

To demonstrate this, we divided our training samples into four groups as a function of skin color – from lighter to darker skin. We call these skin tonalities: levels 1, 2, 3 and 4. Level 1 represents the lightest tone and level 4 the darkest. The 10-fold cross-validation results using each of these four groups are shown in Figure 11.

Fig. 11.

Fig. 11.

Average and standard error bars of the F1-scores in each of the four skin tone levels across datasets. These results show no statistical difference between these four groups.

A t-test showed no statistical difference in the results of Figure 11 across skin tones. The null hypothesis that these results are different was disproven: p > .1 in DISFA, p > .8 in BP4D, p > .3 in SP.

Figure 12 shows qualitative results on two videos of people of different ethnicity and skin color.

Fig. 12.

Fig. 12.

Qualitative results. These correspond to the automatic annotation of AUs and their intensities. Intensities are in parentheses.

5.4. AU recognition in still images

In Section 4, we derived an approach that allows us to use our algorithm to detect the presence of AUs in still images; even thought the training was done using video sequences.

To achieve this, we trained the proposed regressor h(.) using the three dataset of videos used in the preceding sections. Then, we test the trained system on the still images of the CFEE of [19]. This means that the functions of every test image I^ are estimated as f^=h(x^).

Figure 13 provide comparative results against the EmotioNet algorithm of [25]. To our knowledge, this is the only other algorithm that has been applied to this dataset to date. It also provides comparative results of using only static color features for each image (i.e., without the proposed regressor)

Fig. 13.

Fig. 13.

Comparative F1 scores on CFEE. Our algorithms was trained using the videos of BP4D, DISFA and shoulder pain and tested on the still images of CFEE.

We see that the results of the proposed algorithm are superior to those of previous approaches for AUs 1, 2, 4, 25 and 26. It is also clear that the regressor is a crucial component of the algorithm, as the results are inferior without it.

5.5. AU intensities

The recognition of intensity of activation of each AU is of high importance in most applications. This section demonstrates that our approach achieves intensity estimation errors that are smaller than those given by state-of-the-art algorithms.

Mean Absolute Error (MAE) is used to calculate the accuracy of the estimated intensities of AU activation. To do this each of the six levels of activation is given a numerical number. Specifically, AU not present (inactive) takes the values 0, intensity a the value 1, intensity b the value 2, intensity c the value 3, intensity d the value 4, and intensity e the value 5. The intensity estimate of the ath samples with AU i active is uia. This estimates are compared to the ground-truth u^ia,

MAEi=1nia=1ni|uiau^ia| (16)

where ni is the number of samples with AU i active.

Figure 14 provides comparative results for the recognition of AU intensity. This plot shows the results of the algorithm derived in this paper and those of Multi-Kernel Support Vector Machine (MK-SVM) [50], Context Sensitive Dynamic Ordinal Regression (CS-DOR) [51], Rotation Invariant Feature Regression (RIFR) [52], and EmotioNet [25].

Fig. 14.

Fig. 14.

Mean Absolute Error (MAE) for recognition of AU intensity on a variety of algorithms. Top plot: BP4D. Middle plot: DISFA. Bottom plot: shoulder pain.

6. Conclusions

The automatic recognition of facial action units and their intensities is a fundamental problem in computer visions with a large number of applications in the physical and biological sciences [3], [4], [38]. Recent computational models of the human visual system suggest that the recognition of facial expressions is based on the visual identification of these AUs, and a recent cognitive neuroscience experiment has identified a small brain region where the computations associated with this visual recognition likely take place [13].

Previous approaches to this automatic recognition have exploited shading, shape, motion and spatio-temporal features [3], [12], [15], [16], [18], [19], [20], [21], [53], [54]. Remarkably absent from this list of features is color.

Nonetheless, faces are colorful. To produce a facial expression, one needs to move the facial muscles under our skin. These changes vary the brdf of the face and either increase or decrease the amount of blood or oxygenation in that local area. This yields clearly visible color changes that, to the authors knowledge, have not been exploited before.

The present work has derived the first comprehensive computer vision algorithm for the identification of AUs using color features.

We derived a functional representation of color and a highly innovative Gabor transform that are invariant to the timing and duration of these AU activations. We also define an approach that allows us to apply our trained functional color classifier to still test images. This was done by learning the mapping between the functional representation of color in video and images. Finally, we showed how these color changes can also be used to detect intensity of AU activation.

In summary, the present work reveals how facial color changes can be exploited to identify the presence of AUs in videos and still images. Skin color tone is shown to not have an effect on the efficacy of the derived algorithm.

Fig. 8.

Fig. 8.

ROC curves of our results on the Shoulder Pain dataset.

Fig. 9.

Fig. 9.

ROC curves of our results on the DISFA dataset.

Table 2.

Average F1 score for different classifiers

DN-5 layer DN-10 layer SVM - RBF SVM - polynomial SVM - tanh
BP4D 0.912 0.911 0.859 0.853 0.847
DISFA 0.966 0.958 0.964 0.943 0.951
SP 0.928 0.927 0.925 0.928 0.925
AMFED 0.844 0.851 0.822 0.819 0.821

Acknowledgments

This research was supported in part by the National Institutes of Health, grant R01-DC-014498, and the Human Frontier Science Program, grant RGP0036/2016. RS was partially supported by OSU’s Center for Cognitive and Brain Sciences summer fellowship. We thank the reviewers for constructive feedback.

Biographies

graphic file with name nihms-1541867-b0001.gif

C. Fabian Benitez-Quiroz (S’02-M’03-S’06) received the B.S. degree in electrical engineering from Pontificia Universidad Javeriana, Cali, Colombia, and the M.S. degree in electrical engineering from the University of Puerto Rico, Mayaguez, Puerto Rico, in 2004 and 2008, and a Ph.D. in Electrical and Computer Engineering from The Ohio State University (OSU) in 2015. He is currently a Postdoctoral researcher in the Computational Biology and Cognitive Science Lab at OSU. His current research interests include the analysis of facial expressions in the wild, functional data analysis, deformable shape detection, face perception and deep learning.

graphic file with name nihms-1541867-b0002.gif

Ramprakash Srinivasan received the B.S. degree with honors in Electrical and Electronics Engineering from Anna University, Chennai, India, in 2013. He is currently a Ph.D. candidate in the Department of Electrical and Computer Engineering at The Ohio State University. His research interests include computer vision, machine learning and cognitive science. He is a student member of IEEE.

graphic file with name nihms-1541867-b0003.gif

Aleix M. Martinez is a Professor in the Department of Electrical and Computer Engineering, The Ohio State University (OSU), where he is the founder and director of the Computational Biology and Cognitive Science Lab. He is also affiliated with the Department of Biomedical Engineering and to the Center for Cognitive and Brain Sciences, where he is a member of the executive committee. He has served as an associate editor of IEEE Transactions on Pattern Analysis and Machine Intelligence, IEEE Transactions on Affective Computing, Image and Vision Computing, and Computer Vision and Image Understanding. He has been an area chair for many top conferences and was Program Chair for CVPR, 2014. He is also a member of NIH’s Cognition and Perception study section. He is most known for being the first to define many problems and solutions in face recognition (e.g., recognition under occlusions, expression, imprecise landmark detection), discriminant analysis (e.g., Bayes optimal solutions, subclass-approaches, optimal kernels), structure from motion (e.g., using kernel mappings to better model non-rigid deformations, noise invariance), and, most recently, demonstrating the existence of a much larger set of cross-cultural facial expressions of emotion than previously known (i.e., compound expressions of emotion) and the transmission of emotion through changes in facial color.

References

  • [1].Ekman P and Friesen WV, Facial action coding system. Consulting Psychologists Press, Stanford University, Palo Alto, 1977. [Google Scholar]
  • [2].Ekman P and Rosenberg EL, What the face reveals: Basic and applied studies of spontaneous expression using the Facial Action Coding System (FACS), 2nd Edition Oxford University Press, 2015. [Google Scholar]
  • [3].Corneanu CA, Oliu M, Cohn JF, and Escalera S, “Survey on rgb, 3d, thermal, and multimodal approaches for facial expression recognition: History, trends, and affect-related applications,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Cohn JF and De la Torre F, “Automated face analysis for affective computing,” in The Oxford Handbook of Affective Computing, Calvo R and D’Mello S, Eds. Oxford University Press, USA, 2014, p. 131. [Google Scholar]
  • [5].Chang Y, Vieira M, Turk M, and Velho L, “Automatic 3d facial expression analysis in videos,” in International Workshop on Analysis and Modeling of Faces and Gestures. Springer, 2005, pp. 293–307. [Google Scholar]
  • [6].Huang X, Dhall A, Liu X, Zhao G, Shi J, Goecke R, and Pietikainen M, “Analyzing the affect of a group of people using multi-modal framework,” arXiv preprint arXiv:1610.03640, 2016. [Google Scholar]
  • [7].Bellocchi A, “Methods for sociological inquiry on emotion in educational settings,” Emotion Review, vol. 7, pp. 151–156, 2015. [Google Scholar]
  • [8].Benitez-Quiroz CF, Wilbur RB, and Martinez AM, “The not face: A grammaticalization of facial expressions of emotion,”Cognition, vol. 150, pp. 77–84, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Todorov A, Olivola CY, Dotsch R, and Mende-Siedlecki P, “Social attributions from faces: Determinants, consequences, accuracy, and functional significance,” Psychology, vol. 66, no. 1, p. 519, 2015. [DOI] [PubMed] [Google Scholar]
  • [10].Girard JM, Cohn JF, Jeni LA, Sayette MA, and De la Torre F, “Spontaneous facial expression in unscripted social interactions can be measured automatically,” Behavior research methods, vol. 47, no. 4, pp. 1136–1147, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Cassidy S, Mitchell P, Chapman P, and Ropar D, “Processing of spontaneous emotional responses in adolescents and adults with autism spectrum disorders: Effect of stimulus type,” Autism Research, vol. 8, no. 5, pp. 534–544, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Cohn JF, Kruez TS, Matthews I, Yang Y, Nguyen MH, Padilla MT, Zhou F, and La Torre FD, “Detecting depression from facial actions and vocal prosody,” in Affective Computing and Intelligent Interaction and Workshops, 2009. ACII 2009. 3rd International Conference on. IEEE, 2009, pp. 1–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Srinivasan R, Golomb J, and Martinez AM, “A neural basis of facial action recognition in humans,” The Journal of Neurosceince, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Skerry AE and Saxe R, “Neural representations of emotion are organized around abstract event features,” Current Biology, vol. 25, no. 15, pp. 1945–1954, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Lyons M, Akamatsu S, Kamachi M, and Gyoba J, “Coding facial expressions with gabor wavelets,” in Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition, 1998, pp. 200–205. [Google Scholar]
  • [16].Tian Y.-l., Kanade T, and Cohn JF, “Evaluation of gabor-wavelet-based facial action unit recognition in image sequences of increasing complexity,” in Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition, 2002, pp. 229–234. [Google Scholar]
  • [17].Martinez AM and Du S, “A model of the perception of facial expressions of emotion by humans: Research overview and perspectives,” The Journal of Machine Learning Research, vol. 13, no. 1, pp. 1589–1608, 2012. [PMC free article] [PubMed] [Google Scholar]
  • [18].Jaiswal S and Valstar MF, “Deep learning the dynamic appearance and shape of facial action units,” in Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2016. [Google Scholar]
  • [19].Du S, Tao Y, and Martinez AM, “Compound facial expressions of emotion,” Proceedings of the National Academy of Sciences, vol. 111, no. 15, pp. E1454–E1462, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Tong Y, Liao W, and Ji Q, “Facial action unit recognition by exploiting their dynamic and semantic relationships,” IEEE Transactions on Pattern Analysis and Machine Intelligence, no. 10, pp. 1683–1699, 2007. [DOI] [PubMed] [Google Scholar]
  • [21].Chen S, Tian Y, Liu Q, and Metaxas DN, “Recognizing expressions from face and body gesture by temporal normalized motion and appearance features,” Image and Vision Computing, vol. 31, no. 2, pp. 175–185, 2013. [Google Scholar]
  • [22].Benitez-Quiroz CF, Srinivasan R, and Martinez AM, “Facial color is an efficient mechanism to visually transmit emotion,” Proceedings of the National Academy of Sciences, vol. 115, no. 14, pp. 3581–3586, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Angelopoulo E, Molana R, and Daniilidis K, “Multispectral skin color modeling,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2001, vol. 2, 2001, pp. II–635. [Google Scholar]
  • [24].Changizi MA, Zhang Q, and Shimojo S, “Bare skin, blood and the evolution of primate colour vision,” Biology Letters, vol. 2, no. 2, pp. 217–221, 2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [25].Benitez-Quiroz CF, Srinivasan R, and Martinez AM, “Emotionet: An accurate, real-time algorithm for the automatic annotation of half a million facial expressions in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, 2016. [Google Scholar]
  • [26].Benitez-Quiroz CF, Srinivasan R, Feng Q, Wang Y, and Martinez AM, “Emotionet challenge: Recognition of facial expressions of emotion in the wild,” arXiv preprint arXiv:1703.01210, 2017. [Google Scholar]
  • [27].Benitez-Quiroz CF, Wang Y, and Martinez AM, “Recognition of action units in the wild with deep nets and a new global-local loss,” in 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 2017, pp. 3990–3999. [Google Scholar]
  • [28].Corneanu CA, Madadi M, and Escalera S, “Deep structure inference network for facial action unit recognition,” arXiv preprint arXiv:1803.05873, 2018. [Google Scholar]
  • [29].Batista JC, Albiero V, Bellon OR, and Silva L, “Aumpnet: simultaneous action units detection and intensity estimation on multipose facial images using a single convolutional neural network,” in Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE International Conference on. IEEE, 2017, pp. 866–871. [Google Scholar]
  • [30].Blanco S, D’Attellis C, Isaacson S, Rosso O, and Sirne R, “Time-frequency analysis of electroencephalogram series. ii. gabor and wavelet transforms,” Physical Review E, vol. 54, no. 6, p. 6661, 1996. [DOI] [PubMed] [Google Scholar]
  • [31].Antonini M, Barlaud M, Mathieu P, and Daubechies I, “Image coding using wavelet transform,” IEEE Transactions on image processing, vol. 1, no. 2, pp. 205–220,1992. [DOI] [PubMed] [Google Scholar]
  • [32].Niese R, Al-Hamadi A, Farag A, Neumann H, and Michaelis B, “Facial expression recognition based on geometric and optical flow features in colour image sequences,” IET computer vision, vol. 6, no. 2, pp. 79–89, 2012. [Google Scholar]
  • [33].Lajevardi SM and Wu HR, “Facial expression recognition in perceptual color space,” IEEE transactions on image processing, vol. 21, no. 8, pp. 3721–3733, 2012. [DOI] [PubMed] [Google Scholar]
  • [34].Carmona R, Hwang W-L, and Torresani B, Practical Time-Frequency Analysis: Gabor and Wavelet Transforms, with an Implementation in S. Academic Press, 1998, vol. 9. [Google Scholar]
  • [35].Vapnik V, The nature of statistical learning theory (2nd edition).Springer Science & Business Media, 2000. [Google Scholar]
  • [36].Ioffe S and Szegedy C, “Batch normalization: Accelerating deep network training by reducing internal covariate shift” in ICML, ser. JMLR Workshop and Conference Proceedings, Bach FR and Blei DM, Eds., vol. 37 JMLR.org, 2015, pp. 448–456. [Google Scholar]
  • [37].Nair V and Hinton GE, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814. [Google Scholar]
  • [38].Martinez AM, “Computational models of face perception,” Current Directions in Psychological Science, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [39].You D, Benitez-Quiroz CF, and Martinez AM, “Multiobjective optimization for model selection in kernel methods in regression,” IEEE transactions on neural networks and learning systems, vol. 25, no. 10, pp. 1879–1893, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [40].Mavadati SM, Mahoor MH, Bartlett K, Trinh P, and Cohn J, “Disfa: A spontaneous facial action intensity database,” Affective Computing, IEEE Transactions on, vol. 4, no. 2, April 2013. [Google Scholar]
  • [41].Lucey P, Cohn JF, Prkachin KM, Solomon PE, and Matthews I, “Painful data: The unbc-mcmaster shoulder pain expression archive database,” in Automatic Face & Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference on. IEEE, 2011, pp. 57–64. [Google Scholar]
  • [42].Zhang X, Yin L, Cohn JF, Canavan S, Reale M, Horowitz A, Liu P, and Girard JM, “Bp4d-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database,” Image and Vision Computing, vol. 32, no. 10, pp. 692–706, 2014. [Google Scholar]
  • [43].McDuff D, Kaliouby R, Senechal T, Amr M, Cohn J, and Picard R, “Affectiva-mit facial expression dataset (am-fed): Naturalistic and spontaneous facial expressions collected,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2013, pp. 881–888. [Google Scholar]
  • [44].Wang Z, Li Y, Wang S, and Ji Q, “Capturing global semantic relationships for facial action unit recognition,” in Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE, 2013, pp. 3304–3311. [Google Scholar]
  • [45].Almaev T, Martinez B, and Valstar M, “Learning to transfer: transferring latent task structures and its application to person-specific facial action unit detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3774–3782. [Google Scholar]
  • [46].Zhang X, Mahoor MH, Mavadati SM, and Cohn JF, “A lp-norm mtmkl framework for simultaneous detection of multiple facial action units,” in Applications of Computer Vision (WACV), 2014 IEEE Winter Conference on. IEEE, 2014, pp. 1104–1111. [Google Scholar]
  • [47].Yüce A, Gao H, and Thiran J-P, “Discriminant multi-label manifold embedding for facial action unit detection,” in Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on, vol. 6 IEEE, 2015, pp. 1–6. [Google Scholar]
  • [48].Baltrušaitis T, Mahmoud M, and Robinson P, “Cross-dataset learning and person-specific normalisation for automatic action unit detection,” in Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on, vol. 6 IEEE, 2015, pp. 1–6. [Google Scholar]
  • [49].Eleftheriadis S, Rudovic O, and Pantic M, “Multi-conditional latent variable model for joint facial action unit detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3792–3800. [Google Scholar]
  • [50].Nicolle J, Bailly K, and Chetouani M, “Facial action unit intensity prediction via hard multi-task metric learning for kernel regression,” in Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on, vol. 6 IEEE, 2015, pp. 1–6. [Google Scholar]
  • [51].Rudovic O, Pavlovic V, and Pantic M, “Context-sensitive dynamic ordinal regression for intensity estimation of facial action units,” IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 5, pp. 944–958, 2015. [DOI] [PubMed] [Google Scholar]
  • [52].Bingöl D, Celik T, Omlin CW, and Vadapalli HB, “Facial action unit intensity estimation using rotation invariant features and regression analysis,” in 2014 IEEE International Conference on Image Processing (ICIP). IEEE, 2014, pp. 1381–1385. [Google Scholar]
  • [53].Martinez AM, “Matching expression variant faces,” Vision research, vol. 43, no. 9, pp. 1047–1060, 2003. [DOI] [PubMed] [Google Scholar]
  • [54].Zen G, Porzi L, Sangineto E, Ricci E, and Sebe N, “Learning personalized models for facial expression analysis and gesture recognition,” IEEE Transactions on Multimedia, vol. 18, no. 4, pp. 775–788, 2016. [Google Scholar]

RESOURCES