Abstract
Automatic facial action unit (AU) and expression detection from videos is a long-standing problem. The problem is challenging in part because classifiers must generalize to previously unknown subjects that differ markedly in behavior and facial morphology (e.g., heavy versus delicate brows, smooth versus deeply etched wrinkles) from those on which the classifiers are trained. While some progress has been achieved through improvements in choices of features and classifiers, the challenge occasioned by individual differences among people remains. Person-specific classifiers would be a possible solution but for a paucity of training data. Sufficient training data for person-specific classifiers typically is unavailable. This paper addresses the problem of how to personalize a generic classifier without additional labels from the test subject. We propose a transductive learning method, which we refer as a Selective Transfer Machine (STM), to personalize a generic classifier by attenuating person-specific mismatches. STM achieves this effect by simultaneously learning a classifier and re-weighting the training samples that are most relevant to the test subject. We compared STM to both generic classifiers and cross-domain learning methods on four benchmarks: CK+ [44], GEMEP-FERA [67], RU-FACS [4] and GFT [57]. STM outperformed generic classifiers in all.
Index Terms: Facial expression analysis, personalization, domain adaptation, transfer learning, Support Vector Machine (SVM)
1 Introduction
Automatic facial AU detection confronts a number of challenges. These include changes in pose, scale, illumination, occlusion, and individual differences in face shape, texture, and behavior. Face shape and texture differ between and within sexes; they differ with ethnic and racial backgrounds, age or developmental level, exposure to the elements, and in the base rates with which they occur. For example, some people smile broadly and frequently; others rarely or only with smile controls, which counteract the upward pull of the zygomatic major on the lip corners. These and other sources of variation represent considerable challenges for computer vision. Then there is the challenge of automatically detecting facial actions that require significant training and expertise in humans [67].
To address these challenges, previous work has focused on identifying optimal feature representations and classifiers. Interested readers may refer to [20], [46], [49], [56] for comprehensive reviews. While improvements have been achieved, a persistent shortcoming of existing systems is that they fail to generalize well to previously unseen, or new, subjects. One way to cope with this problem is to train and test separate classifiers on each subject (i.e., person-specific classifier). Fig. 1(a) shows a real example of how a simple linear person-specific classifier can separate the positive samples of AU12 (lip corner puller, seen in smiling) from the negative ones. When ample training data are available, a person-specific classifier approaches an ideal classifier, one that best separates actions for the test subject.
Fig. 1.
An illustration of the proposed Selective Transfer Machine (STM): (a) 2D PCA projection of positive (squares) and negative (circles) samples for a given AU (in this case AU 12 or lip-corner raiser) for 3 subjects. An ideal classifier separates AU 12 nearly perfectly for each subject. (b) A generic classifier trained on all 3 subjects generalizes poorly to a new person (i.e., test subject) due to individual differences between the 3-subject training set and the new person. STM personalizes a generic classifier and reliably separates an AU for a new subject.
A problem with person-specific classifiers is that sufficient quantity of training data is usually unavailable. In part for this reason, most approaches seek to use training data from multiple subjects in the hope to compensate for subject biases. However, as shown in Fig. 1(b), when a classifier is trained on all training subjects and tested on an unknown subject, its generalizability may disappoint. When a classifier is trained and tested in this manner, we refer it as to generic classifier. Because person-independent classifiers typically are not feasible, generic classifiers are most commonly used.
We propose that impaired generalizability occurs in part because of individual differences among subjects. Fig. 2 illustrates this phenomenon on real data in a 3-D eigenspace. One can observe that when the data are interpreted as positive and negative classes in Fig. 2(a), they could be very difficult to separate without overfitting. When the data are interpreted as subjects in Fig. 2(b), the grouping effect becomes clear and echoes with our conjecture on individual differences. In this example, these differences include sex, skin color and texture, illumination, and other ways in which people vary. Our guiding hypothesis is that such person-specific bias causes standard generic classifiers to perform worse on some subjects than others [28].
Fig. 2.
Visualization of samples from the RU-FACS dataset [4] in 3D eigenspace: colors/markers indicate different (a) positive/negative classes, and (b) subjects (best viewed in color).
To mitigate the influence of individual biases, this paper explores the idea of personalizing a generic classifier for facial expression analysis. Given a common observation that test videos usually come from only a single subject, we assume the test distribution can be approximated by certain frames from training subjects. The problem of personalizing a generic classifier then is formulated as training a classifier on selected training samples, while reducing the discrepancy between distributions of selected training samples and test ones. In this way, generic classifiers can adapt to an unseen test subject without test labels. We term this transductive approach Selective Transfer Machine (STM). The major contributions of this work include:
Based on both qualitative observations and empirical findings, individual differences attenuate AU detection. To address this problem, we introduce Selective Transfer Machine (STM). STM is a personalization approach that reduces mismatch between feature distributions of training and test subjects. We propose an effective and robust procedure to optimize STM in its primal form.
Considering that many applications afford labeled test data, we introduce a useful extension of STM, termed L-STM, to make use of labeled target data. This extension shows considerable performance improvement in situations for which some labeled test data exist.
To evaluate STM, we conduct comprehensive experiments using within-subject, cross-subject, and cross-dataset scenarios on four benchmark datasets. We test STM for both AU detection and detection of holistic expressions.
For test subjects, some training samples are more instrumental than others. We can identify those training samples using STM. The effectiveness of STM scales as the number of training subjects increases.
This paper is organized as follows. Sec. 2 reviews related work. Secs. 3–5 describes the STM model, optimization algorithm, and theoretical rationale. Sec. 6 introduces L-STM, an STM extension that utilizes labeled test data. Sec. 7 considers similarities and differences between STM and related methods. Sec. 8 evaluates STM and alternatives for AU and holistic expression detection. Sec. 9 concludes the paper with remarks and future work.
2 Related Work
Our approach lies at the intersection between facial expression analysis and cross-domain adaptation. Below we briefly discuss each in turn.
2.1 Facial expression analysis
Automatic facial expression analysis entails at least three steps: Face tracking and registration, feature extraction, and learning classifiers. This section reviews recent advances in each.
Tracking and registration
Tracking and registration of non-rigid facial features is a long-standing problem in computer vision. The goal of tracking is to detect facial landmarks (e.g., eyes) in each frame. For facial landmark detections, Parametrized Appearance Models (PAM) are among the most popular methods. PAM include the Lucas-Kanade method [43], Active Appearance Models (AAM) [18], [47], Constrained Local Models (CLM) [15], and, more recently, Zface [34] and Supervised Descent Method [74]. Once facial landmarks are located, the registration step aims to align the face image to remove 3D rigid head motion, so features can be geometrically normalized. A similarity transformation [20], [61], [86] registers faces with respect to an averaged face. A Delaunay triangulation can be also applied with a backward piecewise affine warping to extract features in areas not explicitly tracked. This two-step registration proves to preserve better shape variation in appearance than by geometric normalization alone.
Feature extraction
With advances in tracking and registration, there has been renewed emphasis on biologically inspired features and temporal variation. As summarized in Table 1, current approaches to feature extraction may be broadly divided into four types: geometric, appearance, dynamic, and fusion. Geometric features contain information about shape and locations of permanent facial features, such as eyes or nose. Standard approaches rely on detecting fiducial facial points [45], a connected face [61], landmark coordinates [15], or face component shape parameterization [45]. Geometric features have performed well for many but not all AU detection tasks. They have difficulty detecting subtle expressions and are highly vulnerable to registration error [16].
TABLE 1.
Representative feature extraction methods
Type | Feature | Year | Reference |
---|---|---|---|
Geometric | Shape model parametrization | 2012 | [45] |
Geometry of facial components | 2010 | [85] | |
Landmark locations | 2006 | [45] | |
| |||
Appearance | Active facial patches | 2012 | [84] |
SIFT/DAISY | 2011 | [86] | |
Discrete Cosine Transform (DCT) | 2011 | [27] | |
Local Phase Quantization (LPQ) | 2011 | [35] | |
Local Binary Patterns (LBP) | 2009 | [59], [67] | |
Hist. of Oriented Gradient (HOG) | 2009 | [48] | |
Gabor | 2006 | [4], [41] | |
Raw pixels | 2000 | [37] | |
| |||
Dynamic | Longitudinal expression atlases | 2012 | [33] |
Gabor motion energy | 2010 | [73] | |
Bag of Temporal Words (BoTW) | 2010 | [61] | |
Volume LBP (LBP-TOP) | 2007 | [82] | |
Optical flow | 2005 | [32] | |
| |||
Fusion | Multiple feature kernels | 2012 | [58] |
Appearance features, which often are biologically inspired, afford increased robustness to tracking and registration error. Appearance features represent skin texture and its permutations and have been widely applied to facial expression analysis. Representative methods include SIFT [86], DAISY [86], Gabor jets [4], LBP [35], [84], Bag-of-Words model [60], [61], compositional [77] and others [72]. Dynamic features, a newly popular technique, encodes temporal information during the feature extraction stage. Examples include optical flow [32], bag of temporal words [62], volume LBP/LPQ [82], Gabor motion energy [73], and others. Fusion approaches incorporate multiple features, e.g., Multiple Kernel Learning (MKL) [58], and have yet to prove superior to other approaches [67].
Classifiers
Two main trends have been pursued when designing classifiers for facial expression analysis, as summarized in Table 2. One trend, static modeling, typically tackles the problem as discriminative classification and evaluates each frame independently. Representative approaches include Neural Network [38], Adaboost [4], SVMs [45], [61], [83], and Deep Networks [42]. Due to lack of temporal consistency, static models tend to produce non-smooth results. To address this issue, temporal modeling, the other trend, captures temporal transition between contiguous frames. For instance, Dynamic Bayesian Network (DBN) with appearance features [65] was proposed to model AU co-occurrence. Other variants of DBN include Hidden Markov Models [59] and Conditional Random Fields (CRF) [9], [68]. As an alternative, Simon et al. [61] proposed a structural-output SVM that detects AUs as temporal segments. To model relations between segments, Rudovic et al. [52] considered ordinal information in CRF. More recently, Ding et al. [21] proposed a hybrid approach that integrates frame-based, segment-based, and transition-based tasks in a sequential order. Interested readers are referred to [20], [46], [49], [56], [67] for more complete surveys.
TABLE 2.
Representative classifiers
Type | Classifier | Year | Reference |
---|---|---|---|
Static | Deep Networks | 2013 | [42] |
Support Vector Machine (SVM) | 2007 | [45] | |
AdaBoost | 2005 | [4] | |
Neural Network (NN) | 2005 | [38] | |
| |||
Temporal | Conditional Random Field (CRF) | 2009 | [9] |
Gaussian process | 2009 | [13] | |
Dynamic Bayesian Network (DBN) | 2007 | [65], [70] | |
Isomap embedding | 2006 | [10] | |
| |||
Hybrid | Cascade of Tasks (CoT) | 2013 | [21] |
Common to all these approaches is the assumption that training and test data are drawn from the same distribution. However, as Fig. 2 shows, they could suffer from individual differences, causing poor generalizability to an unseen subject. STM makes no such assumption. Instead, it seeks a personalized classifier by re-weighting training samples according to their distribution mismatch with test samples. Several studies merged into this direction could be found in [55], [78]–[80].
2.2 Cross-domain adaptation
Our approach is motivated by an increasing concern about dataset shift in the object detection literature. In real-world data, labels of interest often occur infrequently and features vary markedly between and within datasets. These factors contribute to significant biases in object categorization [66]. Saenko et al. [40], [54] proposed to reduce the discrepancy between features by learning metric transformation. Aytar and Zisserman [2] regularized the training of a new object class by transferring pre-learned models. Chattopadhyay et al. [12] proposed to learn a combination of source classifiers that matches the target labels. Because these techniques use a supervised approach in which one or more labeled instances are required from the target domain, they are ill-suited to new domains or subjects for which no prior knowledge is available. In contrast, our approach is unsupervised and thus better geared to the generalization to new domains or subjects.
Closer to our approach is a special case in unsupervised domain adaptation known as covariate shift [63]. In covariate shift, train and test domains follow different distributions but the label distributions remain the same. Given a source and a target domain, Domain Invariant Projection (DIP) [3] finds a domain-invariant space in which training and test data share a similar distribution. Similarly, Subspace Alignment (SA) [25] represents each domain as a subspace and then learns a mapping function that aligns the sources to the target one. However, learning the projection or mapping is unsupervised, and thus it is unclear how source labels can be incorporated.
On the other hand, Dudík et al. [24] infer the re-sampling weights through maximum entropy density estimation without target labels. Maximum Mean Discrepancy (MMD) [5] measures the discrepancy between two different distributions in terms of expectations of empirical samples. Without estimating densities, Transductive SVM (T-SVM) [36] simultaneously learns a decision boundary and maximizes the margin in the presence of unlabeled patterns. Domain adaptation SVM [6] extends T-SVM by progressively adjusting the discriminant function toward the target domain. SVM-KNN [81] labels a single query using an SVM trained on its k neighborhood of the training data. Each of these methods uses either all or a portion of the training data. STM learns to re-weight training instances, which reduces the influence of irrelevant data.
Considering distribution mismatch, Kernel Mean Matching (KMM) [31] directly infers re-sampling weights by matching training and test distributions. Following this idea, Yamada et al. [75] estimated relative importance weights and learned from re-weighted training samples for 3D human pose estimation. See [50] for further review. These methods take a two-step approach that first estimates the sampling weights and then trains a re-weighted classifier or regressor. In contrast, STM jointly optimizes both the sampling weights and the classifier parameters and hence preserves the discriminant property of the new decision boundary.
3 Selective Transfer Machine (STM)
This section describes the proposed Selective Transfer Machine (STM) for personalizing a generic classifier. Unlike previous cross-domain methods [2], [22], [39], [76], STM requires no labels from a test subject. We will use Support Vector Machine (SVM) as classifier due to its popularity for AU detection [15], [35], [61].
Problem formulation
Recent research and applications in automatic facial expression analysis involve video, which provides a wide sampling of facial appearance change. We assume the distribution of a subject’s appearance can be estimated by certain video frames. Based on this assumption, the main idea of STM is to re-weight training samples (i.e., frames) to form a distribution closer to the test distribution. Classifiers trained on the re-weighted training samples are likely to generalize to the test subject.
Let us denote the training set as , yi ∈ {+1,−1} (see notation1). For notational simplicity, we stack 1 in each data vector xi to compensate for the offset, i.e., xi ∈ℝd+1. We formulate STM as minimizing the objective:
(1) |
where Rf (𝒟tr, s) is the SVM empirical risk defined on the decision function f, and training set 𝒟tr with each instance weighted by s ∈ ℝntr . Each entry si corresponds to a positive weight for a training sample xi. Ωs(Xtr, Xte) measures training and test distribution mismatch as a function of s. The lower the value of Ωs, the more similar the training and the test distributions are. λ > 0 is a tradeoff between the risk and the distribution mismatch. The goal of the STM is to jointly optimize the decision function f as well as the selective coefficient s, such that the resulting classifier can alleviate person-specific biases.
Penalized SVM
The first term in STM, Rf (𝒟tr, s), is the empirical risk of a penalized SVM, where each training instance is weighted by its relevance to the test data. In the following, we denote X≡Xtr for notational simplicity unless further referred. The linear penalized SVM has the target decision function in the form f(x)=w⊤x and minimizes:
(2) |
where Lp(y, ·) = max(0, 1 – y·)p (p = 1 stands for hinge loss and p = 2 for quadratic loss). In general, L could be any loss function. The unconstrained linear SVM in (2) can be extended to a nonlinear version by introducing a kernel matrix Kij := k(xi, xj) corresponding to a kernel function k induced from some nonlinear feature mapping φ(·). Using the representer theorem [11], the nonlinear decision function can be represented , yielding the nonlinear penalized SVM:
(3) |
where β ∈ ℝntr is the expansion coefficient and ki is the ith column of K. Unlike most standard solvers, we train the penalized SVM in the primal due to its simplicity and efficiency. Through the unconstrained primal problems, we applied Newton’s method with quadratic convergence [11]. Details are given in Sec. 4.
Distribution mismatch
The second term in STM, Ωs(Xtr, Xte), imitates domain mismatch and aims to find a re-weighting function that minimizes the discrepancy between the training and the test distributions. In previous cross-domain learning methods, the re-weighting function may be computed by separately estimating the densities and then the weights (e.g., [64]). However, this strategy could be prone to error while taking the ratio of estimated densities [64].
Here we adopt the Kernel Mean Matching (KMM) [31] method to reduce the difference between the means of the training and the test distributions in the Reproducing Kernel Hilbert Space ℋ. KMM computes the instance re-weighting si that minimizes:
(4) |
Introducing , i = 1, . . . , ntr, that captures the closeness between training and each test sample, solving s in (4) can be rewritten as a quadratic programming (QP):
(5) |
where B defines a scope bounding discrepancy between probability distributions Ptr and Pte (B = 1000 in our case). For B→1, one obtains an unweighted solution where all si = 1. The second constraint ensures the weighted samples to be close to a probability distribution [31]. Observe in (5) that larger κi leads to larger si when the objective is minimized. This matches our intuition to put higher selection weights on the training samples that are more likely to resemble the test distribution.
A major benefit from KMM is a direct importance estimation without estimating training and test densities. Compared to existing approaches, with proper tuning of kernel bandwidth, KMM shows the lowest importance estimation error and robustness to input dimension and the number of training samples, as suggested in [64]. Fig. 3 illustrates its effect on a synthetic data. As shown, KMM can estimate the ideal fitting well, while standard Ordinary Least Square (OLS) and Weighted OLS (WOLS) with training/test ratio lead to suboptimal prediction.
Fig. 3.
Fitting a line to a quadratic function using KMM and other re-weighting methods. The larger size (more red) of training data, the more weight KMM adopted. As can be observed, KMM puts higher weights in the training samples closer to the test ones. Compared to standard OLS or WOLS, KMM allows to better approximation for the test data.
4 Optimization for STM
To solve Eq. (1), we adopt the Alternate Convex Search [26] that alternates between solving the decision function f and the selective coefficient s. Note that the objective in (1) is biconvex: Convex in f when s is fixed (f is quadratic and Lp is convex), and convex in s when f is fixed (since K≥0). Under these conditions, the alternate optimization approach is guaranteed to monotonically decrease the objective function. Because the function is bounded below, it will converge to a critical point. Algorithm 1 summarizes the STM algorithm. Once the optimization is done, f is applied to perform the inference for test images. Below we detail the two steps in the alternate algorithm.
Algorithm 1.
Selective Transfer Machine
Minimizing over s
Denote the training losses as , i = 1, . . . , ntr. The optimization over s can be rewritten into the following QP:
(6) |
Since K≥ 0 by definition, (6) has only one global optimum. To make the algorithm numerically stable, we add a ridge σ on the diagonal so that K ≥ σIntr (σ=10−8 in our case).
Note that the procedure here is different from the original KMM in terms of weight refinement: In each iteration s will be refined through the training loss ℓp from the penalized SVM. This effect can be observed from minimizing the second term in (6): Larger ℓp leads to smaller s to keep the objective small. This effectively reduces the selection weights of incorrectly classified training samples. On the contrary, KMM uses no label information and thus is incapable of refining importance weights. Introducing training losses helps preserve the discriminant property of the new decision boundary and hence leads to a more robust personalized classifier. From this perspective, KMM can be treated as a special case as the first iteration in the STM framework.
Fig. 5 illustrates the iterative effects on a synthetic example for learning a personalized classifier. In it#1, the hyperplane estimated by KMM is unreliable because it is completely unsupervised. On the other hand, STM simultaneously considers training loss and the weightings, and thus encourages the associated training samples to be well classified. As can be observed, as the iterations proceed, the STM separation hyperplane approaches toward the ideal one for the target data.
Fig. 5.
Comparisons of a generic SVM, personalized STM, and ideal classifier for synthetic data. The left-most figure shows the convergence curve of the objective value where STM converges in 12 iterations. Figures it#1,4,8,12 with training/test accuracy (Tr% and Te%) show the corresponding hyperplanes in at each iteration. Grey (shaded) dots denote training data, and white (unshaded) dots denote test data. Circles and squares denote positive and negative classes, respectively. Note that it#1 indicates the result of KMM [31]. STM improves separation relative to generic SVM as early as the first iteration and converges toward the ideal hyperplane by the 12th iteration.
Minimizing over f
Let sv indicate the index set of support vectors, and nsv the number of support vectors. In the case of training loss ℓ2 being quadratic, the gradient and Hessian of the linear penalized SVM in (2) can be written as:
(7) |
(8) |
where S = diag(s) ∈ ℝntr×ntr denotes the re-weighting matrix, y∈ℝntr the label vector, and I0∈ℝntr×ntr the proximity identity matrix with the first nsv diagonal elements being 1 and the rest being 0. Similarly, the gradient with respect to the expansion coefficient β in (3) can be derived as:
(9) |
(10) |
Given the gradients and Hessians, the penalized SVM can be optimized by standard Newton’s method or conjugate gradient.
Differentiable Huber loss
The L1 (hinge) loss in standard SVMs are not differentiable, hampering its gradient and Hessian to be explicitly expressed and computed. Instead, we use the Huber loss [11] as a differentiable surrogate, i.e., L1(yi, f(xi))≈ LH (yisign(f(xi))). Note that any differential convex loss, e.g., logistic loss and exponential loss, can be directly incorporated. The Huber loss can be defined as follows:
(11) |
where h is a parameter of choice. Fig. 4 shows the influnce of h in comparison to the L1 and L2 loss. As can be observed, LH approaches the hinge loss when h→0. As indicated in [11], there is no clear reason to prefer the hinge loss because replacing the hinge loss with Huber loss does not influence much the results. With the differentiable Huber loss, the gradient and Hessian with Huber loss for the penalized linear SVM can be obtained:
(12) |
(13) |
and for the penalized nonlinear SVM:
(14) |
(15) |
where I1 ∈ ℝntr×ntr denotes the proximity identity matrix with the first nsv diagonal elements being 0, followed by nℓ (the number of points in the linear part of the Huber loss) elements of ones, and the rest being 0. With the derived gradient and Hessian, we are able to optimize for f with quadratic convergence using standard Newton method.
Fig. 4.
Loss functions: (a) L1 and L2 loss, and (b) Huber loss.
5 Theoretical Rationale
This section analyzes two important properties of STM, biconvexity and boundedness, based on the techniques developed for biconvex optimization [30]. Then we justify the convergence of the Alternate Convex Search algorithm, which we used for solving STM, in terms of both objective value and optimization variables.
5.1 Properties of STM
We start by showing that STM is a biconvex problem.
Property 1. (Bi-convexity)
Selective Transfer Machine (STM) in (1) is a biconvex optimization problem.
Proof
Denote the decision variable of f as w∈W ⊆ℝd and the selection coefficient s ∈ S ⊆ ℝntr , where W and S are two nonempty convex sets. Let Z ⊆ W ×S be the solution set on W ×S; Zw and Zs be the subsets when w and s are given respectively. Because Zs is convex for every w ∈ W (w and Lp are convex; si ∈ [0,B] are non-negative) and Zw is convex for every s ∈ S (Ωs is QP and K ≥ 0), the solution set Z is a biconvex set. Hence STM can be rewritten in the standard form of biconvex optimization problem [1]: minw,s{g(w, s) : (w, s)∈Z}.
Property 2. (Boundedness)
The STM optimization problem in Problem (1) is bounded from below.
Proof
The boundedness can be observed from two aspects: (1) Rf is bounded due to the quadratic term in f and non-negative s and Lp. (2) Ωs is bounded since K is positive semi-definite.
Following the same proof line, the above properties can be also shown for nonlinear STM defined with Eq. (3).
5.2 Algorithm
The following analysis mimics directly Sec. 4 in [30]. We present the key steps for proving the convergence and refer to more details on this style of proof in [30].
Alternate Convex Search
To solve the biconvex STM problem, a standard and popular approach is to exploit its convex substructure. We used the Alternate Convex Search (ACS) algorithm [71], a special case of Block-Relaxation Methods, by alternatively solving the convex subproblems. For explanation convenience, we recall the ACS algorithm in Algorithm 2.
Denote z = (w, s) as the solution variable. As mentioned in Sec. 4, STM can be seen as initializing s0 using KMM, or simply as a vector of ones, and then solve the classifier w1 as an unweighed SVM. As will be discussed below and in Sec. 8.5, the permutation of order does not influence the convergence. For Step 4, there are several ways to determine the stopping criterion. Here, we used the relative decrease of z compared to the last iteration. Below we discuss the convergence properties in terms of objective value (i.e., the difference between g(zt) and g(zt–1) of two consecutive iterations t and t − 1), and the variables (i.e., the difference between zt and zt–1).
Algorithm 2.
Alternate Convex Search Algorithm
Convergence
Recall that W and S are two non-empty sets, and Z ⊆ W ×S is a biconvex set on W ×S. We firstly show the convergence of the sequence of objective value {g(zt)}t∈ℕ, and then convergence of the sequence of the variables {zt}t∈ℕ.
Theorem 1
Let the STM objective function be g : Z → ℝ. Then the sequence of objective value {g(zt)}t∈ℕ generated by ACS converges monotonically.
Proof
The sequence {g(zt)}t∈ℕ generated by Algorithm 2 decreases monotonically, since g(w*, s*) ≤ g(w, s*), ∀w ∈ Zs* and g(w*, s*) ≤ g(w*, s), ∀s ∈ Zw*. In addition, Property 2 shows g is bounded from below. According to Theorem 4.5 in [30], the sequence {g(zt)}t∈ℕ converges to a limit real value.
Theorem 1 only tells the convergence of the sequence {g(zt)}t∈ℕ but not of the sequence {zt}t∈ℕ. See Example 4.3 in [30] where {g(zt)}t∈ℕ converge but {zt}t∈ℕ diverge. The following states the condition for convergence of {zt}t∈ℕ.
Theorem 2
Let W and S be closed sets, and zt = (wt, st)t∈ℕ where wt ∈ W and st ∈ S. The sequence of variables {zt}t∈ℕ generated by ACS converge to z* ∈ W × S.
Proof
This can be proved using Theorem 4.7 in [30].
6 STM with Labeled Target Data (L-STM)
As discussed above, STM requires no labels from the target subject to obtain the personalized classifier. Nevertheless, in many problems one might collect partially labeled data from the target domain, or acquire additional guidance with a few manual labels. Such labels can be considered as the only reference to the target subject and aid the determination of the personalized classifier. This section describes an inductive extension of STM, termed L-STM, to adapt target labels for personalizing a classifier.
Given the target data and their labels as , 0 ≤ nL ≤ nte, we formulate L-STM by introducing an additional regularization term ΩL(𝒟L) to (1):
(16) |
where λL > 0 is a tradeoff parameter. A choice of large λL makes sure the labeled target data are correctly classified. The goal of ΩL(𝒟L) is to regulate the classification quality on the labeled target data. In this paper, we define . Note that an L2 loss here is analogous to the regularization in Least Square SVM [69], which performs comparably with SVM using the hinge loss and has been shown to relate to a ridge regression approach for binary classification, such as our task at hand. Because ΩL(𝒟L) is convex in f, problem (16) is still a biconvex optimization problem, and thus the ACS algorithm can be directly applied.
We show that solving problem (16) is equivalent to solving the original STM using a training set augmented with weighted labeled target data. We demonstrate the use of L2 loss on linear SVM, while different choices of loss functions (e.g., L1) and classifier types (e.g., nonlinear SVM) can be applied. Specifically, updating for s remains the same process. For updating w, one can again use Newton’s method by associated gradient and Hessian:
(17) |
(18) |
where X̂ = [Xtr|XL] is the augmented set with labeled target data, is the augmented re-weighting matrix, and is the augmented labels.
The above equivalence is useful particularly for the scenario of AU detection, where the unlabeled videos are usually abundant with limited laborious FACS coding. L-STM allows users to add just a few frames to alleviate false detections significantly. Fig. 6 illustrates the benefits of L-STM over different methods. Light yellow (dark green) indicates positive (negative) frames for AU 12 on Subject 12 of the RU-FACS dataset. Top two rows show the ground truth and the detection result of the ideal classifier, respectively. The numbers in the parentheses indicate the F1 score. The third and fourth rows illustrate the detection of generic SVM and KMM. Both approaches produced many false detections due to the person-specific biases and the lack of weight refinement. STM, on the fifth row, greatly reduced false positives and produced a better F1 score. The last row shows the detection using L-STM with two misclassified frames from STM with correct labels. Using the label information on the target domain, L-STM boosted ~10% F1 score by using labels from only two frames. As we observed empirically, the more the labeled target data are introduced, the better L-STM approaches the ideal classifier.
Fig. 6.
Comparison of different methods on the RU-FACS dataset. Light yellow (dark green) indicates AU 12 presense (absense) of Subject 12. The numbers in the parentheses are F1 scores. Two misclassified frames of STM were chosen and fed into L-STM with correct labels.
7 Discussion of Related Work
A few related efforts use personalized modeling for facial expression analysis, e.g., AU intensity estimation [53]. STM differs from them in how it accomplishes personalization. Chang and Huang [8] introduced an additional face recognition module and trained a neural network on the combination of face identities and facial features. Romera-Paredes et al. [51] applied multi-task learning to learn a group of linear models and then calibrated the models toward the target subject using target labels. By contrast, STM requires neither a face recognition module nor target labels. Motivated by covariate shift [63], Chen et al. [14] proposed transductive and inductive transfer algorithms for learning person-specific models. In their transductive setting, KL-divergence was used to estimate sample importance. However, STM models the domain mismatch using KMM [31], which with proper tuning, as implied in [64], yields better estimation.
The most related work is transductive transfer learning, which seeks to address domain shift problems without target labels. Table 3 summarizes the comparison. DT-MKL [22] simultaneously minimizes the MMD criterion [5] and a multi-kernel SVM. DAM [23] leverages a set of pre-trained base classifiers and solves for a test classifier that shares similar predictions with the base classifiers on unlabeled data. However, similar to T-SVM [36] and SVM-KNN [81], these methods treat training data uniformly. By contrast, KMM [31] and STM consider importance re-weighting, properly adjusting the importance for each training instance to move the decision function toward test data. KMM performs re-weighting only once while STM does so in an iterative manner. From this perspective, KMM can be viewed as an initialization of STM (see Sec. 4). In addition, STM uses training loss to refine instance weights in successive steps, thus being able to correct sub-optimal weights. DA-SVM [6] refines instance weights as a quadratic function decaying with iterations. However, DA-SVM may fail to converge due to its non-convexity, while STM is formulated as a bi-convex problem and thus assures convergence. Moreover, STM can be extended to tackle labeled target data, which greatly improves the performance.
TABLE 3.
Compare STM with related transductive transfer learning methods
Methods | Importance re-weight | Weight refine | Convexity | Labeled target data |
---|---|---|---|---|
SVM-KNN [81] | × | × | NA | × |
T-SVM [17] | × | × | non-convex | × |
KMM [31] | ✓ | × | convex | × |
DA-SVM [6] | × | ✓ | non-convex | × |
DT-MKL [22] | × | × | jointly convex | optional |
DAM [23] | × | × | convex | optional |
STM (proposed) | ✓ | ✓ | bi-convex | optional |
✓ : included, ×: omitted, NA: not applicable
8 Experiments
STM was evaluated in datasets that afforded inclusion of both posed and unposed facial expression, frontal versus variable pose, complexity (e.g., interview versus 3-person interaction), and differences in numbers of subjects, the amount of video per subject, and men and women of diverse ethnicity. These factors are among the individual differences that adversely affect classifier performance in previous work [28]. To evaluate STM with respect to alternative approaches and scenarios, it was compared with a generic classifier, person-specific classifiers, and cross-domain classifiers using within-subject, cross-subject, and cross-dataset scenarios. Operational parameters for STM included initialization order, parameter choice, and domain size.
8.1 Dataset Description
We tested the algorithms on four diverse datasets that involve posed, acted, or spontaneous expressions, and vary in video quality, length, annotation, the number of subjects, and context, as summarized in Table 4 and illustrated in Fig. 7.
Fig. 7.
Example images from (a) CK+ [44], (b) GEMEP-FERA [67], and (c) RU-FACS [4] datasets.
The extended Cohn-Kanade (CK+) dataset [44] contains brief (approximately 20 frames on average) videos of posed and un-posed facial expressions of men and women of various ethnic backgrounds. Videos begin with a neutral expression and finish at the apex, or peak, which is annotated for AU and for holistic expression. Changes in pose and illumination are relatively small. Posed expressions from 123 subjects and 593 videos were used. Because STM requires some number of frames to estimate a test distribution, it is necessary to modify coding in CK+. In specific, we assume the last one-third frames share the same AU labels. We note that this may introduce some errors, compared to related methods that use only the peak frame for classification.
The GEMEP-FERA dataset [67] consists of 7 portrayed emotion expressions by 10 trained actors. Actors were instructed to utter pseudo-linguistic phoneme sequences or a sustained vowel and display pre-selected facial expressions. Head pose is primarily frontal with some fast movements. Each video is annotated with AUs and holistic expressions. We used the GEMEP-FERA training set, which comprises 7 subjects (three of them men) and 87 videos.
RU-FACS dataset [4] consists of video-recorded interviews of 100 young adults of varying ethnicity. Interviews are approximately 2.5 minutes in duration. Head pose is frontal with small to a moderate out-of-plane rotation. AU are coded if the intensity is greater than ‘A’, i.e., lowest intensity on a 5-point scale. We had access to 34 of the interviews, of which video from 5 subjects could not be processed for technical reasons. Thus, the experiments reported here were conducted with data from 29 participants with more than 180,000 frames in total.
GFT [57] consists of social interaction between 720 previously unacquainted young adults that were assembled into groups of three persons each and observed over the course of a 30-minute group formation task. Two minutes of AU-annotated video from 14 groups (i.e., 42 subjects) was used in the experiments for a total of approximately 302,000 frames. Head pose varies over a range of about plus/minus 15–20 degrees [28]. For comparability with RU-FACS, we included AU 6, 9, 12, 14, 15, 20, 23 and 24.
Out of these datasets, CK+ is the most controlled, followed by GEMEP-FERA. Both include annotation for holistic expression and AU. GEMEP-FERA introduces variations in spontaneous expressions and large head movements but contains only 7 subjects. RU-FACS and GFT are both unposed and vary in complexity. RU-FACS is an interview context; GFT is a social interaction over a longer duration with greater variability. The first sets of experiments focus on CK+, GEMEP, and RU-FACS. GFT figures primarily in experiments on domain transfer between datasets and on the influence of numbers of subjects on performance.
8.2 Settings
Face tracking & registration
For CK+, FERA, and GFT, 49 landmarks were detected and tracked using the Supervised Descent Method (SDM) [74]. For RU-FACS, we used available AAM detection and tracking of 68 landmarks. Tracked landmarks were registered to a 200×200 template shape.
Feature extraction
Given a registered facial image, SIFT descriptors were extracted using 36×36 patches centered at selected landmarks (9 on the upper face and 7 on the lower face), because AUs occur only in local facial regions. The dimensionality of the descriptors was reduced by preserving 98% PCA energy.
AU selection & evaluation
Positive samples were taken as frames with an AU presence and negative samples as frames without an AU. We selected the 8 most commonly observed AUs across all datasets. To provide a comprehensive evaluation, we report both Area Under the ROC Curve (AUC) and F1 score. As AUC was originally designed for balanced binary classification tasks, F1 score, as the harmonic mean of precision and recall, could be more meaningful for imbalanced data, such as AUs.
Dataset split & validation
A leave-one-subject-out protocol was used. For each AU, we iteratively chose one subject for test and the remaining subjects for training and validation. For all iterations, we first identified the range of λ ∈ {2−10, ...210} and C ∈{2−10, ..., 210} for which F1 score on the validation set was greatest. Then, we chose ones for which C was small. That is, we sought the parameters that maximize F1-score while preserving large margin of the decision boundary.
8.3 Action Unit (AU) Detection
We evaluated STM with generic and alternative approaches using three scenarios for AU detection: within-subject, cross-subject, and cross-dataset. We report results separately for each scenario.
8.3.1 Within-subject AU detection
A natural comparison with STM is a classifier trained on a single subject, also known as a Person-Specific (PS) classifier. A PS classifier can be defined in at least two ways. One, the more common definition, is a classifier trained and tested on the same subject. We refer to this usage as PS1. The other definition, referred to as PS2 or quasi-PS, is a classifier that has been tested on a subject included in the training set. The GEMEP-FERA competition [67] defined PS in this way. An SVM trained with PS2 (PS2-SVM) is sometimes considered to be a generic classifier (e.g., [45]). In our usage, we reserve the term “generic classifier” to the case in which training and test subjects are independent.
Here we compared STM with both PS1-SVM and PS2-SVM, and summarize the results in Table 5. In all, PS1-SVM shows the lowest AUC and F1. This outcome likely occurred because of the relatively small number of samples for individual subjects. Lack of sufficient training data for individual subjects is a common problem for person-specific classifiers. It is likely that PS1-SVM would have performed the best if the amount training data from the same subject is large enough. PS2-SVM achieved better AUC and F1 because it saw more training subjects. Overall, STM consistently outperformed both PS classifiers.
TABLE 5.
Within-subject AU detection with STM and PS classifiers
AUC | F1 Score | |||||
---|---|---|---|---|---|---|
AU | PS1-SVM | PS2-SVM | STM | PS1-SVM | PS2-SVM | STM |
1 | 48.0 | 72.4 | 79.2 | 45.0 | 54.8 | 61.9 |
2 | 46.5 | 71.1 | 80.2 | 45.9 | 55.7 | 64.3 |
4 | 62.6 | 61.9 | 66.5 | 46.6 | 40.7 | 60.4 |
6 | 70.3 | 80.0 | 86.4 | 60.2 | 69.7 | 78.5 |
7 | 47.5 | 54.3 | 72.4 | 49.4 | 55.3 | 58.4 |
12 | 65.7 | 74.0 | 72.3 | 69.5 | 70.4 | 72.6 |
15 | 41.4 | 64.0 | 70.5 | 44.5 | 49.0 | 56.0 |
17 | 32.6 | 70.3 | 61.7 | 25.0 | 40.3 | 36.3 |
Av. | 51.8 | 68.5 | 73.6 | 48.3 | 54.5 | 61.0 |
Selection ability of STM
Recall that PS2 includes samples of the test subject in both training and test sets. Could STM improve PS2 performance by selecting proper training samples? To answer this question, we employed PS2 to investigate STM’s ability to select relevant training samples with respect to the test subject. Table 6 shows the selection percentage of STM upon initialization and convergence. Each row sums to 1 and represents a test subject; each entry within one row denotes the percentage of selected samples from each training subject. For example, (a) shows the initialization phase that, when testing on Subject 2, 26% of training samples were selected from Subject 1. Upon convergence, as (b) shows, STM selected most training samples that belong to the target subject (higher diagonal value). Note that the selection percentages along the diagonal do not sum to 100% due to insufficient training samples for the target subject. However, STM was able to select relevant training samples, even from different subjects, to alleviate the mismatch between training and test distributions.
TABLE 6.
Selection percentage of STM for different subjects
(a) Initialization | sub1 | 0.38 | 0.00 | 0.00 | 0.14 | 0.06 | 0.04 | 0.38 |
sub2 | 0.26 | 0.00 | 0.00 | 0.43 | 0.05 | 0.00 | 0.26 | |
sub3 | 0.40 | 0.00 | 0.02 | 0.14 | 0.13 | 0.00 | 0.30 | |
sub4 | 0.51 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.49 | |
sub5 | 0.54 | 0.00 | 0.00 | 0.14 | 0.07 | 0.00 | 0.25 | |
sub6 | 0.43 | 0.00 | 0.00 | 0.07 | 0.00 | 0.15 | 0.35 | |
sub7 | 0.56 | 0.00 | 0.00 | 0.00 | 0.02 | 0.01 | 0.41 | |
sub1 | sub2 | sub3 | sub4 | sub5 | sub6 | sub7 |
(b) Convergence | sub1 | 0.27 | 0.11 | 0.13 | 0.06 | 0.17 | 0.18 | 0.08 |
sub2 | 0.40 | 0.41 | 0.07 | 0.00 | 0.04 | 0.01 | 0.08 | |
sub3 | 0.07 | 0.07 | 0.47 | 0.07 | 0.13 | 0.14 | 0.06 | |
sub4 | 0.06 | 0.11 | 0.09 | 0.47 | 0.10 | 0.10 | 0.08 | |
sub5 | 0.19 | 0.05 | 0.13 | 0.00 | 0.42 | 0.02 | 0.18 | |
sub6 | 0.08 | 0.09 | 0.14 | 0.12 | 0.06 | 0.43 | 0.08 | |
sub7 | 0.11 | 0.08 | 0.11 | 0.14 | 0.14 | 0.17 | 0.24 | |
sub1 | sub2 | sub3 | sub4 | sub5 | sub6 | sub7 |
8.3.2 Cross-subject AU detection
Using a cross-subject scenario, i.e., training and test subjects are independent in all iterations (a.k.a., leave-one-subject-out), we compared STM against various types of methods. Unsupervised domain adaptation methods are closest to STM. For comparisons we included Kernel Mean Matching (KMM) [31], Domain Adaptation SVM (DA-SVM) [6], and Subspace Alignment (SA) [25]. Multiple source domain adaptation methods serve as another natural comparison by treating each training subject as one source domains; we compared to the state-of-the-art DAM [23]. For baseline methods, we compared with linear SVMs and semi-supervised Transductive SVM (T-SVM) [17]. T-SVM, KMM, DAM and SA were implemented per the respective author’s webpage. Because STM requires no target labels, methods that use target labels for adaptation (e.g., [19], [40], [54]) were not included.
All methods were compared in CK+ and RU-FACS with a few exceptions in CK+. In CK+, SA was ruled out because too few frames were available per subject to compute meaningful subspaces. DAM was also omitted in CK+ because it would be problematic to choose negative samples given the structure of the data (i.e., pre-segmented positive examples). In training, a Gaussian kernel was used with bandwidth set as the median distance between pairwise samples. For KMM and STM we set B=1000 so that none of si reached the upper bound, and . As reported in [31], when B was reduced to the point where a small percentage of the si reached B, empirically performance either did not change, or worsened. For T-SVM we used [17] since the original T-SVM [36] solves an integer programming and thus unscalable to our problem that consists hundreds of thousands of frames. For fairness, we used linear SVMs in all cases. In DA-SVM, we used LibSVM [7] as discussed in Sec. 4, τ = 0.5 and β=0.03. For SA, we obtained the dimension of subspaces dmax using their theoretical bound with γ =106 and δ =0.1; SA with both NN and SVM classifiers were reported. Following [23], we tuned DAM using C = 1, λL = λD1 = λD2 = 1; β was set as the median of computed MMD value [5]; the threshold for virtual labels were cross-validated in {0.01, 0.1, 0.5, 1}. Linear SVMs were used as base classifiers. Note that, because these alternative methods are not optimized for our task, their performance might be improved by searching over a wider range of parameters.
Discussion
Tables 7 and 8 show results on AUC and F1 scores. A linear SVM served as a generic classifier. For semi-supervised learning, T-SVM performed similarly to SVM in RU-FACS, but worse than SVM in CK+. An explanation is because in CK+ the negative (neutral) and positive (peak frames) samples are easier to separate than consecutive frames in RU-FACS. For transductive transfer learning, KMM performed worse than the generic classifier, because KMM estimates sample weights without label information. On the other hand, SA combined with both Nearest Neighbor (NN) and LibSVM led to unsatisfactory performance compared to above methods. This is because SA obtained an optimal transformation through linear subspace representation, which could be improper due to the non-linearity of our data. In addition, SA weighted all training samples equally, and thus suffered from biases caused by individual differences (as illustrated in Fig. 2). Although SA+SVM performed better in AUC, its low F1 score tells a likely overfitting (low precision or recall). The proposed STM outperformed alternative approaches in general. For AUC in RU-FACS, STM had the highest averaged score about 6% higher over the 2nd highest, and the highest scores in all but 2 AUs. For F1, STM had the highest averaged score about 12 points higher than the nearest alternative, and the highest F1 score of all but AU4. For CK+, STM achieved 91% AUC on average, slightly better than the best-published result 90.5% [41], although the results may not be directly comparable due to different choices of features and registration. It is also noteworthy that we tested the last one-third of a video that could contain low intensities, while [41] tested only on peak frames with the highest intensity. On the other hand, STM may be benefited from additional frames due to more information.
Table 7.
Cross-subject AU detection on RU-FACS dataset. “SA (NN|SVM)” indicates SA with NN and SVM, respectively.
AUC | F1 Score | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AU | SVM | KMM | T-SVM | DA-SVM | SA (NN|SVM) | DAM | STM | SVM | KMM | T-SVM | DA-SVM | SA (NN|SVM) | DAM | STM |
1 | 72.0 | 74.0 | 72.0 | 77.0 | 41.2|82.0 | 82.6 | 83.9 | 40.8 | 37.7 | 37.4 | 35.5 | 20.9|24.2 | 11.3 | 55.3 |
2 | 66.6 | 58.6 | 71.1 | 76.5 | 38.2|81.4 | 81.2 | 82.4 | 35.7 | 32.2 | 36.2 | 34.1 | 18.6|21.8 | 17.0 | 52.6 |
4 | 74.8 | 62.2 | 50.0 | 76.4 | 24.5|71.1 | 51.3 | 82.4 | 25.2 | 14.5 | 11.2 | 35.3 | 5.7| 5.8 | 2.9 | 30.4 |
6 | 89.1 | 88.8 | 61.6 | 60.3 | 46.2|78.3 | 81.2 | 93.1 | 58.3 | 39.2 | 33.1 | 42.9 | 23.2|19.2 | 20.9 | 72.4 |
12 | 86.7 | 87.0 | 86.7 | 84.4 | 55.9|86.1 | 93.1 | 92.3 | 61.9 | 63.0 | 62.6 | 71.4 | 37.5|38.6 | 36.6 | 72.3 |
14 | 71.8 | 67.8 | 74.4 | 70.4 | 38.0|78.5 | 79.5 | 87.4 | 31.3 | 25.8 | 25.8 | 40.9 | 16.5|15.7 | 5.7 | 51.0 |
15 | 72.5 | 68.8 | 73.5 | 58.1 | 37.7|79.2 | 71.8 | 86.1 | 32.3 | 29.5 | 32.3 | 34.9 | 10.1| 8.8 | 3.2 | 45.4 |
17 | 78.5 | 76.7 | 79.5 | 75.7 | 55.8|89.9 | 93.9 | 89.6 | 39.5 | 35.6 | 44.0 | 46.5 | 21.9|17.2 | 22.9 | 55.3 |
Av. | 76.5 | 72.3 | 71.1 | 72.3 | 42.2|80.8 | 79.3 | 86.3 | 40.6 | 37.3 | 40.6 | 42.7 | 19.3|18.9 | 15.1 | 54.3 |
Table 8.
Cross-subject AU detection on CK+ dataset
AUC | F1 Score | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
AU | SVM | KMM | T-SVM | DA-SVM | STM | SVM | KMM | T- SVM | DA- SVM | STM |
1 | 79.8 | 68.9 | 69.9 | 72.6 | 88.9 | 61.1 | 44.9 | 56.8 | 57.7 | 62.2 |
2 | 90.8 | 73.5 | 69.3 | 71.0 | 87.5 | 73.5 | 50.8 | 59.8 | 64.3 | 76.2 |
4 | 74.8 | 62.2 | 63.4 | 69.9 | 81.1 | 62.7 | 52.3 | 51.9 | 57.7 | 69.1 |
6 | 89.7 | 87.7 | 60.5 | 94.7 | 94.0 | 75.5 | 70.1 | 47.8 | 68.2 | 79.6 |
7 | 82.1 | 68.2 | 55.7 | 61.4 | 91.6 | 59.6 | 47.0 | 43.8 | 53.1 | 79.1 |
12 | 88.1 | 89.5 | 76.0 | 95.5 | 92.8 | 76.7 | 74.5 | 59.6 | 59.0 | 77.2 |
15 | 93.5 | 66.8 | 49.9 | 94.1 | 98.2 | 75.3 | 44.4 | 40.4 | 76.9 | 84.8 |
17 | 90.3 | 66.6 | 73.1 | 94.7 | 96.0 | 76.0 | 53.2 | 61.7 | 81.4 | 84.3 |
Av. | 86.1 | 72.9 | 64.7 | 81.7 | 91.3 | 70.0 | 54.7 | 52.7 | 64.8 | 76.6 |
Unlike STM that uses a penalized SVM, T-SVM and SA considered neither re-weighting for training instances nor weight refinement for irrelevant samples, such as noises or outliers. On the other hand, DA-SVM extends T-SVM by progressively labeling test patterns and removing labeled training patterns. Not surprisingly, DA-SVM showed better performance than KMM and T-SVM, because it selected relevant samples for training and thus obtained a better classifier. However, similar to T-SVM, DA-SVM did not update the re-weightings using label information. Moreover, it is not always guaranteed to converge to a correct solution. In our experiments, we faced the situation where DA-SVM failed to converge due to a large amount of samples lying within the margin bounds. In contrast, STM is a biconvex formulation, and therefore guaranteed to converge to a critical point and outperform existing approaches (details in Sec. 4).
As for multi-source domain adaptation, DAM overall performed comparably in AUC, but significantly worse than STM in F1. There are at least three explanations. First, AUs are by nature imbalanced: Simply predicting all samples as negative could yield high AUC for infrequent AUs (such as AUs 4), yet zero precision and recall for F1 score. Second, similar to person-specific classifiers, training samples for each subject are typically insufficient to estimate the true distribution (as discussed in Sec. 8.3.1). Using such limited training samples for each subject, therefore, limits the power of base classifiers and the final prediction in DAM. Finally, DAM uses MMD to estimate inter-subject distance, which could be inaccurate due to insufficient samples or sampling bias (e.g., some subjects have more expressions than others).
Although in Table 7(a) STM achieved slightly worse in AUC, STM showed a better improvement in F1 metric, which better suits our imbalanced detection task. A major reason that limits STM’s improvement is because GEMEP-FERA comprises limited subjects and training samples, and thus hinders STM from selecting and receiving proper supports from the training samples. This can be also explained by the findings of selection ability in Sec. 8.3.1. When the number of subjects and training samples increase, as illustrated by the CK+ and the RU-FACS datasets in Tables 8 and 7, STM is able to gain contributions from the selected data, and thus the improvement becomes clearer. Overall STM achieves the most competitive performance due to the properties of instance re-weighting, weight refinement, and convergence.
8.3.3 Cross-dataset AU detection
Detecting AUs across datasets is challenging because of differences in acquisition and participant characteristics and behavior. As shown in Fig. 7, participant characteristics, context, background, illumination, camera parameters, compression schemes are among the differences that may bias features. Generic SVMs fail to address such differences. Secs. 8.3.1 and 8.3.2 have shown the effectiveness of STM on within-dataset experiments involving within-subject and across-subject scenarios. This section aims to justify that STM can attain not only subject adaptation but can be naturally extended for cross-dataset adaptation. Specifically, we performed two experiments, RU-FACS→GEMEP-FERA and GFT→RU-FACS, using the same settings described above.
Table 9 shows the results. One can observe that cross-domain approaches outperformed a generic SVM in most cases. It is not surprising because a generic SVM does not model the biases between datasets. That is, in the cross-dataset scenario, the training and test distributions are more likely different than in within-dataset scenario, causing an SVM to fail to transfer the knowledge from one dataset to another. Among the cross-domain methods, STM consistently outperforms the others. Observe STM gained improvement over SVM in Table 7 by 12.8% in AUC (76.5→86.3) and 33.7% in F1 (40.6→54.3), and in Table 9(b) by 37.9% in AUC (55.8→77.0) and 46.1% in F1 (28.6→41.8). The advantages of STM over SVM becomes clearer in the cross-dataset experiments.
Table 9.
Cross-dataset AU detection: (a) RU-FACS→GEMEP-FERA, and (b) GFT→RU-FACS (“A→B” represents for training on dataset A and test on B).
(a) AU |
AUC | F1 Score | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
SVM | KMM | T-SVM | DA-SVM | STM | SVM | KMM | T-SVM | DA-SVM | STM | |
1 | 44.7 | 48.8 | 43.7 | 56.9 | 63.2 | 46.3 | 46.4 | 41.8 | 46.1 | 50.4 |
2 | 52.8 | 70.5 | 52.1 | 52.3 | 74.0 | 47.4 | 54.2 | 38.6 | 45.4 | 54.6 |
4 | 52.7 | 55.4 | 54.2 | 52.7 | 58.6 | 57.1 | 57.1 | 40.2 | 42.9 | 57.4 |
6 | 73.5 | 55.2 | 77.1 | 79.9 | 83.4 | 60.7 | 55.2 | 52.8 | 56.3 | 72.7 |
12 | 56.8 | 60.1 | 70.9 | 76.1 | 78.1 | 67.7 | 67.7 | 63.5 | 62.6 | 71.5 |
15 | 55.1 | 52.1 | 59.3 | 60.2 | 58.6 | 31.5 | 32.8 | 29.7 | 26.4 | 41.1 |
17 | 44.3 | 41.1 | 39.1 | 46.2 | 52.7 | 27.3 | 27.1 | 24.3 | 24.6 | 31.4 |
Av. | 54.3 | 54.8 | 56.6 | 60.6 | 66.9 | 48.3 | 48.6 | 41.6 | 43.5 | 54.2 |
(b) AU |
AUC | F1 Score | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
SVM | KMM | T-SVM | DA-SVM | STM | SVM | KMM | T-SVM | DA-SVM | STM | |
1 | 45.8 | 63.6 | 70.3 | 71.2 | 73.7 | 23.7 | 29.8 | 26.6 | 31.8 | 38.6 |
2 | 46.4 | 62.8 | 68.5 | 68.2 | 71.7 | 21.3 | 25.4 | 19.4 | 32.1 | 30.2 |
4 | 56.9 | 60.1 | 59.1 | 47.2 | 61.7 | 18.3 | 24.5 | 20.7 | 19.4 | 28.5 |
6 | 65.5 | 73.9 | 81.5 | 74.1 | 93.3 | 42.2 | 46.8 | 30.4 | 38.7 | 61.4 |
12 | 65.3 | 72.1 | 76.3 | 80.9 | 90.3 | 43.2 | 47.6 | 45.8 | 56.8 | 62.2 |
14 | 57.2 | 54.8 | 53.7 | 70.2 | 72.2 | 25.8 | 23.8 | 25.9 | 29.7 | 36.2 |
15 | 56.9 | 61.8 | 64.2 | 65.5 | 80.4 | 23.7 | 30.3 | 28.2 | 29.9 | 37.8 |
17 | 52.4 | 54.5 | 64.8 | 72.6 | 72.6 | 30.8 | 31.5 | 32.3 | 38.9 | 39.5 |
Av. | 55.8 | 62.9 | 67.3 | 68.7 | 77.0 | 28.6 | 32.5 | 28.7 | 34.7 | 41.8 |
8.4 Holistic Expression Detection
Taking into account of individual differences, STM showed improvement for AU detection. In this experiment, we ask whether the same could be found for holistic expression detection. We used the major benchmarks CK+ [44] and FERA emotion subchallenge [67] for this experiment, and the same settings in Sec. 8.2, except for that the labels were replaced as holistic expressions. Similar to [67], we utilized every frame of a video to train and test our algorithm. Because each video has only a single expression label instead of a frame-by-frame labeling, F1 score is meaningless in this experiment. For CK+, 327 out of the original 593 videos were given a nominal expression label based on the 7 basic and discrete expressions: Anger, Contempt, Disgust, Fear, Happy, Sadness, and Surprise. For GEMEP-FERA, 289 portrayals were retained one out of the five expression states: Anger, Fear, Joy, Sadness, and Relief. The training set included 7 actors with 3~5 instances of each expression per actor. We evaluated on the training set, which contains a total of 155 videos. STM was also compared to alternative approaches discussed in Sec. 8.3.2.
Table 10(a) shows the results from CK+. Note that DA-SVM is unavailable in this experiment because it failed to converge to a final classifier due to insufficient test data, recalling that we used the last one-third frames of each video for test. One can observe that a generic SVM performed fairly well because positive (peak expressions) and negative samples (neutral faces) are relatively easy to separate in CK+. KMM and T-SVM resulted in suboptimal results due to the lack of a weight-refinement step, and thus were unable to rectify badly estimated weights for learning the final classifier (see discussions in Sec. 7). This effect becomes obvious when there is insufficient test data, such as this experiment. On the other hand, STM considers the labels for weight refinement and performed similarly as well as a generic SVM.
Table 10.
Expression detection with AUC on (a) CK+ and (b) GEMEP-FERA
Expression | SVM | KMM | T-SVM | DA-SVM | STM | |
---|---|---|---|---|---|---|
(a) CK+ | Anger | 95.1 | 85.3 | 76.1 | – | 96.4 |
Contempt | 96.9 | 94.5 | 88.8 | – | 96.9 | |
Disgust | 94.5 | 81.6 | 84.2 | – | 96.0 | |
Fear | 96.6 | 92.7 | 84.9 | – | 95.5 | |
Happy | 99.4 | 93.9 | 86.7 | – | 98.9 | |
Sadness | 94.5 | 76.0 | 78.7 | – | 93.3 | |
Surprise | 97.3 | 64.5 | 81.8 | – | 97.6 | |
Av. | 96.3 | 84.1 | 83.0 | – | 96.4 |
Expression | SVM | KMM | T-SVM | DA-SVM | STM | |
---|---|---|---|---|---|---|
(b) GEMEP-FERA | Anger | 31.1 | 66.5 | 70.4 | 78.8 | 78.6 |
Fear | 31.9 | 81.4 | 64.5 | 83.9 | 85.5 | |
Joy | 90.2 | 33.5 | 78.9 | 71.1 | 95.0 | |
Relief | 20.4 | 74.8 | 76.8 | 87.9 | 88.4 | |
Sadness | 73.4 | 80.2 | 77.1 | 74.7 | 84.8 | |
Av. | 49.4 | 67.3 | 73.5 | 79.3 | 86.5 |
Table 10(b) presents our results on GEMEP-FERA, which served as a larger and more challenging benchmark for evaluating the holistic expression detection performance. In this experiment, each test video consists of tens of frames, and thus enables DA-SVM to converge in most cases. The generic SVM performed poorly due to large variations in this dataset, such as head movements and spontaneous expressions. Without the ability to select meaningful training samples, the generic classifier suffered from the individual differences. Other cross-domain methods alleviated the person-specific biases and produced better results. Overall STM achieved the most satisfactory performance. This serves as evidence that when training data grow larger and more complex, the improvement of STM becomes clearer.
8.5 Analysis
8.5.1 Initialization order
A potential concern of STM is that the initialization order could affect the convergence property and performance. To evaluate this, we examined the initialization order with w0 (STMw) and with s0 (STMs). Standard two-stage approach, i.e., solving the selection coefficients first and then the penalized SVM (e.g., [31]), can be interpreted as STMw, as discussed in Sec. 4. To validate convergence property of STM, we randomized 10 initialization sets for STMw and STMs respectively. Upon the convergence of STM, we computed their objective differences in consecutive iterations (g(zt+1) – g(zt)), and the absolute sum of variable difference (||zt+1 – zt||1). For the cases where STM took fewer iterations to converge, we set the difference of later iterations to 0.
Fig. 8(a) shows the curve of mean and standard deviation of differences across the iterations of STMw and STMs. Note that the differences were scaled for visualization convenience. The random initial value was reflected in the first iteration and made a major difference with the value of the second iteration. One can observe that in STMw and STMs, both the objective value and difference between consecutive variables decreased at each step and toward convergence, as theoretically detailed in Sec. 5. Note that, although the resulting solution was slightly different due to different initialization, the performance remains the same as both converge to a critical point. We observed so by comparing the confusion matrices during the experiments.
Fig. 8.
Analysis experiments: (a)–(b) Objective and variable differences between iterations with initialization w0 (STMw) and s0 (STMs), respectively. (c) Performance versus parameter choices. (d) Per-subject F1 score v.s. # training subjects. (e) Exemplar images of the GFT dataset [57].
8.5.2 Parameter choice
Recall that training STM involves two parameters: C for the tradeoff between maximal margin and training loss, and λ for the tradeoff between the SVM empirical risk and the domain mismatch. This section examines the sensitivity of performance with respect to different parameter choices. Specifically, we ran the experiment of detecting AU12 on the CK+ dataset with the parameters ranges C ∈ {2−10, ..., 210} and λ ∈ {2−10, ..., 210}. Following the experiment settings in Sec. 8.2, we used the leave-one-subject-out protocol and computed an averaged F1 score for evaluating the performance. We used Gaussian kernel with a fixed bandwidth as the median distance between sample points.
Fig. 8(c) illustrates the contour plot of F1 score v.s. different parameter pairs in terms of (log2(C), log2(λ)). As can be observed, the performance scatters evenly in most region of the plot, showing that STM is robust to the parameter choices when their values are reasonable. The performance decayed when both (C, λ) become extremely small (<2−6), as shown in the bottom left of the plot. This is not surprising because smaller values of C and λ imply less emphasis on training loss and personalization. Note that with large enough λ, STM does not need large C to achieve comparable F1, providing an explanation that personalization helps avoid imposing large C and hence avoid overfitting. As a general guideline for choosing parameters, we suggest a small value of C with a reasonable λ (thus encouraging a large-margin decision boundary with reasonable distribution mismatch).
We note that cross validation (CV) for domain adaptation methods is difficult and remains an open research issue. As also mentioned in [64], this issue becomes vital in a conventional scenario where the number of training samples is much smaller than the number of test samples. However, in our case, we always have much more training samples than test samples, and thus, the CV process is less biased under covariate shift. In addition, as can be seen in Fig. 2 of [64], with proper σ (kernel bandwidth) and standard CV, KMM consistently reaches lower error than the KL-divergence-based CV [64]. This serves as a justification for KMM’s ability to estimate importance weights.
8.5.3 Domain size
The intuition for STM to work better in facial expression analysis is a judicious selection of training samples. The availability of richer diversity grants STM a broader knowledge to select better candidates that match the test distribution. This experiment examines performance changes w.r.t. diversities of the source domain, for which we evaluated by the domain size or the number of training subjects. Intuitively, the larger number of training subjects, the more diverse the training domain is, and thus the more likely STM could perform better. We compared STM to a generic SVM (with cross-validation) to contrast the performance.
This experiment was performed on AU 12 using the RU-FACS dataset. A subset from 3 to 27 training subjects was randomly picked as a shrunk domain. The leave-one-subject-out protocol and F1 score were used following Sec. 8.2. Fig. 9(a) illustrates the effects of #training subjects on averaged F1 scores. For each domain size, the mean and standard deviation were computed on F1 scores over all test subjects. Test subjects without true positives were ignored because their precision and F1 scores were not computable. One can observe that, as #training subjects grew, STM achieved higher F1 scores, and also performed more consistently with lower standard deviation. This observation imitates Sec. 8.3.2, where a source domain with poor diversity was shown to limit STM’s performance. On the other hand, generic classifier improved when #training subjects arose to 12. However, with more training subjects being introduced, its performance was slightly lowered due to the biases caused by individual differences. Note that, because the training subjects were downsampled in a randomized manner, it is possible that STM achieved better performance on a domain with less training subjects.
Fig. 9.
Performance versus domain size: The averaged and standard deviation of F1 score on (a) RU-FACS. (b) and (c) show the F1 scores on the GFT dataset before and after removing the outlier subjects.
As another justification, we examined the effects of domain size on the GFT dataset [57], which contains a larger number of subjects and more intensive facial expressions than RU-FACS. The GFT dataset records videos of real-life social interactions among three-person groups in less constrained contexts. Videos were recorded using separate wall-mounted cameras facing each subject; Fig. 8(e) shows exemplar frames. The videos include moderate-to-large head rotations and frequent occlusions; facial movements are spontaneous and unscripted. We selected 50 videos with around 3 minutes each (5400 frames).
Following the same procedure, we randomly picked a subset of subjects varying from 4 to 49 as the shrunk domains. Fig. 9(b) shows the F1 scores with respect to the number of training subjects. One can observe the averaged F1 score increases with #training subjects, although the standard deviation fluctuates. To study the fluctuation, we broke down the averaged F1 into individual subjects corresponding to different training sizes, as shown in Fig. 8(d). Each row represents a test video; each column represents one number of training subjects (ranging from 4 to 49). Note that for subject 4 (the 4th row), there is no F1 score because AU 12 was absent. One can observe that for 6 outlier subjects (e.g., rows 19, 20, 39, 40, 47, 48), their F1 scores remained low even as the number of subjects was increased. This result suggests that these subjects share no or few instances in the feature space. Visual inspection of their data was consistent with this hypothesis. The outliers were ones with darker skin color, asymmetric smiles or relatively large head pose variations. Thus, for these subjects STM could offer no benefit. This finding suggests the need to include greater heterogeneity in training subjects. When these subjects were omitted, as shown in Fig. 9(c), the F1 scores are markedly higher. The influence of the domain size becomes clear and replicates Fig. 9(a). It is interesting to note that, for generic classifiers, the performance increased until 24 training subjects and then drops abruptly. This observation serves as another evidence that individual differences (introduced by increasing number of training subjects) could bias generic classifiers.
Between these two experiments, generally the averaged F1 score in GFT is higher than in RU-FACS. At least two factors may have accounted for this difference. One is that participants in GFT may have been less inhibited and more expressive. In RU-FACS, subjects were motivated to convince an examiner of their veridicality. They knew that they would be penalized if they were not believed. In the three-person social interaction of GFT, there were no such negative contingencies. Subjects may have felt more relaxed and become more expressive. More intense AUs are more easily detected. The other factor is that inter-observer reliability of the ground truth FACS labels was likely much higher for GFT than for RU-FACS. Kappa coefficients for GFT were exceptionally good. While reliability for RU-FACS is not available, we know from past confirmation-coding that inter-observer agreement was not as high. Less error in the GFT ground truth would contribute to more accurate classifier performance.
8.6 Discussion
In above experiments, we have evaluated STM against alternative methods in many scenarios: Within-subject (Sec. 8.3.1), across-subject (Sec. 8.3.2), across-dataset (Sec. 8.3.3), and holistic expression detection (Sec. 8.4). We also analyzed STM on its initialization order, and sensitivity to parameters and domain size (Sec. 8.5). STM consistently outperformed a generic SVM and most transfer learning methods. The advantage of STM is clearest in GFT, where the variety of subjects are more extensive, and slightly so, in RU-FACS. The results indicate a more obvious improvement in F1 than in AUC, in large complex datasets than in posed datasets, in cross-dataset scenario than in within-dataset scenario, and with more training subjects than with fewer ones.
STM has some limitations. For example, it suffers from the lack of training subjects or crucial mismatch between training and test distributions, which are known as common drawbacks in unsupervised domain adaptation methods. For a theoretical analysis in terms of performance v.s. the number of samples, Corollary 1.9 in KMM [29] reaches a transductive bound for an estimated risk of a re-weighted task, given the assumptions of linear loss and data being iid. However, it remains unclear how to theoretically analyze STM’s performance in terms the number of test samples, because STM involves nonlinear loss functions and the data are from real-world videos (non-iid).
9 Conclusion and Future Work
Based on the observation of individuals differences, this paper proposed Selective Transfer Machine (STM) for personalized facial expression analysis. We showed that STM translates to a biconvex problem, and proposed an alternate algorithm with a primal solution. In addition, we introduced L-STM, an extension of STM that exhibited significant improvement when labeled test data are available. Our results on both AU and holistic expression detection suggested that STM is capable of improving test performance by selecting training samples that form a close distribution to a test one. Experiments using within-subject, cross-subject, and cross-dataset scenarios revealed two insights: (1) Training data matter unevenly for test data, and (2) extending the variety of training subjects brings more value in improving performance.
It is worth noting that STM can be extended to other classifiers with convex decision functions and losses, such as logistic regression. This is a direct outcome of Property 1 in Sec. 5.1. However, for non-convex cases, such as random forest, local minimum could cause worse performance. We leave extensions to non-convex classifiers as a focus of future work. Moreover, improving STM’s training speed could be another direction due to the QP for solving s. Finally, while this study focuses evaluations on facial expressions, STM could be applied to other fields where object-specific issues are involved, e.g., object or activity recognition.
TABLE 4.
Detailed content of different datasets
Datasets | #Subjects | #Videos | #Frames/video | Content | AU annotation | Expression annotation |
---|---|---|---|---|---|---|
CK+ [44] | 123 | 593 | ~20 | Neutral→peak | Per video | Per video |
GEMEP-FERA [67] | 7 | 87 | 20~60 | Acting | Frame-by-frame | Per video |
RU-FACS [4] | 34 | 34 | 5000~8000 | Interview | Frame-by-frame | – |
GFT [57] | 720 | 720 | ~60,000 | Multi-person social interaction | Frame-by-frame | – |
Acknowledgments
The authors would like thank many anonymous reviewers for constructive feedback. Research reported in this paper was supported in part by the National Institutes of Health (NIH) under Award Number R01MH096951, the National Science Foundation (NSF) under the grant RI-1116583, and Army Research Laboratory Collaborative Technology Alliance Program under cooperative agreement W911NF-10-2-0016. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH or the NSF.
Biographies
Wen-Sheng Chu received his B.S. and M.S. degrees in computer science from National Cheng Kung University, Tainan, Taiwan. Currently, he is working toward the Ph.D. degree at the Robotics Institute, Carnegie Mellon University. His research interests lie in the development and use of machine learning techniques for computer vision problems. He is a student member of the IEEE and a member of the Phi Tau Phi Scholastic Honor Society.
Fernando De la Torre is an Associate Research Professor in the Robotics Institute at Carnegie Mellon University. He received his B.Sc. degree in Telecommunications, as well as his M.Sc. and Ph. D degrees in Electronic Engineering from La Salle School of Engineering at Ramon Llull University, Barcelona, Spain in 1994, 1996, and 2002, respectively. His research interests are in the fields of Computer Vision and Machine Learning. Currently, he is directing the Component Analysis Laboratory (http://ca.cs.cmu.edu) and the Human Sensing Laboratory (http://humansensing.cs.cmu.edu) at Carnegie Mellon University. He has over 150 publications in referred journals and conferences and is Associate Editor at IEEE TPAMI. He has organized and co-organized several workshops and has given tutorials at international conferences on component analysis.
Jeffrey F. Cohn is Professor of Psychology and Psychiatry at the University of Pittsburgh and Adjunct Professor of Computer Science at the Robotics Institute at Carnegie Mellon University. He leads interdisciplinary and inter-institutional efforts to develop advanced methods of automatic analysis and synthesis of facial expression and prosody; and applies those tools to research in human emotion, social development, non-verbal communication, psychopathology, and biomedicine. He has served as Co-Chair of the 2008 IEEE International Conference on Automatic Face and Gesture Recognition (FG2008), the 2009 International Conference on Affective Computing and Intelligent Interaction (ACII2009), the Steering Committee for IEEE International Conference on Automatic Face and Gesture Recognition, and the 2014 International Conference on Multi-modal Interfaces (ACM 2014). He has co-edited special issues of the Journal of Image and Vision Computing and is a Co-Editor of IEEE Transactions in Affective Computing (TAC).
Footnotes
Bold capital letters denote a matrix X; bold lower-case letters denote a column vector x. xi represents the ith column of the matrix X. All non-bold letters represent scalars. xj denotes the scalar in the jth element of x. In ∈ℝn×n is an identity matrix.
References
- 1.Aumann R, Hart S. Bi-convexity and bi-martingales. Israel Journal of Mathematics. 1986;54(2):159–180. [Google Scholar]
- 2.Aytar Y, Zisserman A. Tabula rasa: Model transfer for object category detection. ICCV. 2011 [Google Scholar]
- 3.Baktashmotlagh M, Harandi MT, Lovell BC, Salzmann M. Unsupervised domain adaptation by domain invariant projection. ICCV. 2013 [Google Scholar]
- 4.Bartlett M, Littlewort G, Frank M, Lainscsek C, Fasel I, Movellan J. Automatic recognition of facial actions in spontaneous expressions. Journal of Multimedia. 2006;1(6):22–35. [Google Scholar]
- 5.Borgwardt KM, Gretton A, Rasch MJ, Kriegel HP, Schölkopf B, Smola AJ. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics. 2006;22(14):49–57. doi: 10.1093/bioinformatics/btl242. [DOI] [PubMed] [Google Scholar]
- 6.Bruzzone L, Marconcini M. Domain adaptation problems: A dasvm classification technique and a circular validation strategy. PAMI. 2010;32(5):770–787. doi: 10.1109/TPAMI.2009.57. [DOI] [PubMed] [Google Scholar]
- 7.Chang C-C, Lin C-J. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology. 2011;2:27:1–27:27. software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. [Google Scholar]
- 8.Chang C-Y, Huang V-C. Personalized facial expression recognition in indoor environments. International Joint Conference on Neural Networks; 2010. [Google Scholar]
- 9.Chang K-Y, Liu T-L, Lai S-H. Learning partially-observed hidden conditional random fields for facial expression recognition. CVPR. 2009 [Google Scholar]
- 10.Chang Y, Hu C, Feris R, Turk M. Manifold based analysis of facial expression. Image and Vision Computing. 2006;24(6):605–614. [Google Scholar]
- 11.Chapelle O. Training a support vector machine in the primal. Neural Computation. 2007;19(5):1155–1178. doi: 10.1162/neco.2007.19.5.1155. [DOI] [PubMed] [Google Scholar]
- 12.Chattopadhyay R, Sun Q, Fan W, Davidson I, Panchanathan S, Ye J. Multi-Source domain adaptation and its application to early detection of fatigue. ACM Transactions on Knowledge Discovery from Data. 2012;6(4):18. [Google Scholar]
- 13.Chen J, Kim M, Wang Y, Ji Q. Switching gaussian process dynamic models for simultaneous composite motion tracking and recognition. CVPR. 2009 [Google Scholar]
- 14.Chen J, Liu X, Tu P, Aragones A. Learning person-specific models for facial expression and action unit recognition. Pattern Recognition Letters. 2013;34(15):1964–1970. [Google Scholar]
- 15.Chew SW, Lucey P, Lucey S, Saragih J, Cohn JF, Sridharan S. Person-independent facial expression detection using constrained local models. Automatic Face & Gesture Recognition. 2011 [Google Scholar]
- 16.Chew SW, Lucey P, Lucey S, Saragih J, Cohn JF, Matthews I, Sridharan S. In the pursuit of effective affective computing: The relationship between features and registration. IEEE Trans on Systems Man and Cybernetics Part B: Cybernetics. 2012;42(4):1006–1016. doi: 10.1109/TSMCB.2012.2194485. [DOI] [PubMed] [Google Scholar]
- 17.Collobert R, Sinz F, Weston J, Bottou L. Large scale transductive svms. Journal of Machine Learning Research. 2006;7:1687–1712. [Google Scholar]
- 18.Cootes TF, Edwards GJ, Taylor CJ. Active appearance models. PAMI. 2001;23(6):681–685. [Google Scholar]
- 19.Daumé H., III Frustratingly easy domain adaptation. Conference of the Association for Computational Linguistics (ACL); 2007. [Google Scholar]
- 20.De la Torre F, Cohn JF. Facial expression analysis. Visual Analysis of Humans: Looking at People. 2011:377. [Google Scholar]
- 21.Ding X, Chu W-S, De la Torre F, Cohn JF. Facial action unit event detection by cascade of tasks. ICCV. 2013 doi: 10.1109/ICCV.2013.298. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Duan L, Tsang IW, Xu D. Domain transfer multiple kernel learning. PAMI. 2012;34(3):465–479. doi: 10.1109/TPAMI.2011.114. [DOI] [PubMed] [Google Scholar]
- 23.Duan L, Xu D, Tsang IW. Domain adaptation from multiple sources: A domain-dependent regularization approach. IEEE Trans on Neural Networks and Learning Systems. 2012;23(3):504–518. doi: 10.1109/TNNLS.2011.2178556. [DOI] [PubMed] [Google Scholar]
- 24.Dudık M, Schapire RE, Phillips SJ. Correcting sample selection bias in maximum entropy density estimation. NIPS. 2005 [Google Scholar]
- 25.Fernando B, Habrard A, Sebban M, Tuytelaars T. Unsupervised visual domain adaptation using subspace alignment. ICCV. 2013 [Google Scholar]
- 26.Floudas C, Visweswaran V. A global optimization algorithm (gop) for certain classes of nonconvex nlpsxi. theory. Computers & chemical engineering. 1990;14(12):1397–1417. [Google Scholar]
- 27.Gehrig T, Ekenel HK. A common framework for real-time emotion recognition and facial action unit detection. CVPR Workshop. 2011 [Google Scholar]
- 28.Girard JM, Cohn JF, Jeni LA, Sayette MA, De la Torre F. Spontaneous facial expression in unscripted social interactions can be measured automatically. Behavior research methods. 2014 doi: 10.3758/s13428-014-0536-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Goldberg DA, Goldberg MB, Goldberg MD, Goldberg BM. Obtaining person-specific images in a public venue. 7,561,723. uS Patent. 2009
- 30.Gorski J, Pfeuffer F, Klamroth K. Biconvex sets and optimization with biconvex functions: a survey and extensions. Mathematical Methods of Operations Research. 2007;66(3):373–407. [Google Scholar]
- 31.Gretton A, Smola A, Huang J, Schmittfull M, Borgwardt K, Schölkopf B. Covariate shift by kernel mean matching. Dataset shift in machine learning. 2009:131–160. [Google Scholar]
- 32.Gunes H, Piccardi M. Affect recognition from face and body: early fusion vs. late fusion. International Conference on Systems, Man and Cybernetics; 2005. pp. 3437–3443. [Google Scholar]
- 33.Guo Y, Zhao G, Pietikäinen M. Dynamic facial expression recognition using longitudinal facial expression atlases. ECCV. 2012 doi: 10.1109/TIP.2016.2537215. [DOI] [PubMed] [Google Scholar]
- 34.Jeni LA, Cohn JF, Kanade T. Dense 3d face alignment from 2d videos in real-time. IEEE International Conference on Automatic Face and Gesture Recognition; 2015; [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Jiang B, Valstar MF, Pantic M. Action unit detection using sparse appearance descriptors in space-time video volumes. Automatic Face & Gesture Recognition. 2011 [Google Scholar]
- 36.Joachims T. Transductive inference for text classification using support vector machines. ICML. 1999 [Google Scholar]
- 37.Kanade T, Cohn JF, Tian Y. Comprehensive database for facial expression analysis. Automatic Face & Gesture Recognition. 2000 [Google Scholar]
- 38.Kapoor A, Picard RW. Multimodal affect recognition in learning environments. ACMMM. 2005 [Google Scholar]
- 39.Khosla A, Zhou T, Malisiewicz T, Efros A, Torralba A. Undoing the damage of dataset bias. ECCV. 2012 [Google Scholar]
- 40.Kulis B, Saenko K, Darrell T. What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. CVPR. 2011 [Google Scholar]
- 41.Littlewort G, Whitehill J, Wu T, Fasel I, Frank M, Movellan J, Bartlett M. The computer expression recognition toolbox (CERT) Automatic Face & Gesture Recognition. 2011 [Google Scholar]
- 42.Liu M, Li S, Shan S, Chen X. AU-aware deep networks for facial expression recognition. IEEE Conference on Automatic Face and Gesture Recognition; 2013. [Google Scholar]
- 43.Lucas BD, Kanade T. An iterative image registration technique with an application to stereo vision. International Joint Conference on Artificial Intelligence; 1981. [Google Scholar]
- 44.Lucey P, Cohn JF, Kanade T, Saragih J, Ambadar Z, Matthews I. The extended cohn-kanade dataset (CK+): A complete dataset for action unit and emotion-specified expression. CVPR Workshops. 2010 [Google Scholar]
- 45.Lucey S, Ashraf AB, Cohn J. Investigating spontaneous facial action recognition through aam representations of the face. Face recognition. 2007:275–286. [Google Scholar]
- 46.Martinez A, Du S. A model of the perception of facial expressions of emotion by humans: Research overview and perspectives. Journal of Machine Learning Research. 2012;13:1589–1608. [PMC free article] [PubMed] [Google Scholar]
- 47.Matthews I, Baker S. Active appearance models revisited. IJCV. 2004;60(2):135–164. [Google Scholar]
- 48.Orrite C, Gañán A, Rogez G. Hog-based decision tree for facial expression classification. Pattern Recognition and Image Analysis. 2009 [Google Scholar]
- 49.Pantic M, Bartlett MS. Machine analysis of facial expressions. Face recognition. 2007:377–416. [Google Scholar]
- 50.Quionero-Candela J, Sugiyama M, Schwaighofer A, Lawrence N. Dataset shift in machine learning. The MIT Press; 2009. [Google Scholar]
- 51.Romera-Paredes B, Aung MS, Pontil M, Bianchi-Berthouze N, de Williams AC, Watson P. Transfer learning to account for idiosyncrasy in face and body expressions. AFGR. 2013 [Google Scholar]
- 52.Rudovic O, Pavlovic V, Pantic M. Kernel conditional ordinal random fields for temporal segmentation of facial action units. ECCV Workshop. 2012 [Google Scholar]
- 53.Rudovic O, Pavlovic V, Pantic M. Context-sensitive dynamic ordinal regression for intensity estimation of facial action units. TPAMI. 2015;37(5):944–958. doi: 10.1109/TPAMI.2014.2356192. [DOI] [PubMed] [Google Scholar]
- 54.Saenko K, Kulis B, Fritz M, Darrell T. Adapting visual category models to new domains. ECCV. 2010 [Google Scholar]
- 55.Sangineto E, Zen G, Ricci E, Sebe N. We are not all equal: Personalizing models for facial expression analysis with transductive parameter transfer. ACM International Conf on Multimedia. 2014 [Google Scholar]
- 56.Sariyanidi E, Gunes H, Cavallaro A. Automatic analysis of facial affect: A survey of registration, representation, and recognition. TPAMI. 2015;37(6):1113–1133. doi: 10.1109/TPAMI.2014.2366127. [DOI] [PubMed] [Google Scholar]
- 57.Sayette MA, Creswell KG, Dimoff JD, Fairbairn CE, Cohn JF, Heckman BW, Kirchner TR, Levine JM, Moreland RL. Alcohol and group formation a multimodal investigation of the effects of alcohol on emotion and social bonding. Psychological Science. 2012;23(8):869–878. doi: 10.1177/0956797611435134. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Senechal T, Rapp V, Salam H, Seguier R, Bailly K, Prevost L. Facial action recognition combining heterogeneous features via multikernel learning. IEEE Trans on Systems Man and Cybernetics Part B: Cybernetics. 2012;42(4):993–1005. doi: 10.1109/TSMCB.2012.2193567. [DOI] [PubMed] [Google Scholar]
- 59.Shang L, Chan K. Nonparametric discriminant HMM and application to facial expression recognition. CVPR. 2009 [Google Scholar]
- 60.Sikka K, Wu T, Susskind J, Bartlett M. Exploring bag of words architectures in the facial expression domain. ECCV Workshops. 2012 [Google Scholar]
- 61.Simon T, Nguyen MH, De La Torre F, Cohn JF. Action unit detection with segment-based svms. CVPR. 2010 [Google Scholar]
- 62.Sivic J, Zisserman A. Video google: A text retrieval approach to object matching in videos. ICCV. 2003 [Google Scholar]
- 63.Sugiyama M, Krauledat M, Müller K. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research. 2007;8:985–1005. [Google Scholar]
- 64.Sugiyama M, Nakajima S, Kashima H, Buenau PV, Kawanabe M. Direct importance estimation with model selection and its application to covariate shift adaptation. NIPS. 2007 [Google Scholar]
- 65.Tong Y, Liao W, Ji Q. Facial action unit recognition by exploiting their dynamic and semantic relationships. PAMI. 2007;29(10):1683–1699. doi: 10.1109/TPAMI.2007.1094. [DOI] [PubMed] [Google Scholar]
- 66.Torralba A, Efros A. Unbiased look at dataset bias. CVPR. 2011 [Google Scholar]
- 67.Valstar MF, Mehu M, Jiang B, Pantic M, Scherer K. Meta-analysis of the first facial expression recognition challenge. IEEE Transactions on Systems Man and Cybernetics Part B. 2012;42(4):966–979. doi: 10.1109/TSMCB.2012.2200675. [DOI] [PubMed] [Google Scholar]
- 68.Valstar MF, Pantic M. Fully automatic recognition of the temporal phases of facial actions. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics. 2012;42(1):28–43. doi: 10.1109/TSMCB.2011.2163710. [DOI] [PubMed] [Google Scholar]
- 69.Van Gestel T, Suykens JA, Baesens B, Viaene S, Vanthienen J, Dedene G, De Moor B, Vandewalle J. Benchmarking least squares support vector machine classifiers. Machine Learning. 2004;54(1):5–32. [Google Scholar]
- 70.Wang Z, Wang S, Ji Q. Capturing complex spatio-temporal relations among facial muscles for facial expression recognition. CVPR. 2013 [Google Scholar]
- 71.Wendell RE, Hurter AP. Minimization of a non-separable objective function subject to disjoint constraints. Operations Research. 1976;24(4):643–657. [Google Scholar]
- 72.Whitehill J, Bartlett MS, Movellan JR. Automatic facial expression recognition. Social Emotions in Nature and Artifact. 2013;88 [Google Scholar]
- 73.Wu T, Bartlett MS, Movellan J. Facial expression recognition using gabor motion energy filters. CVPR Workshop. 2010 [Google Scholar]
- 74.Xiong X, De la Torre F. Supervised descent method and its applications to face alignment. CVPR. 2013 [Google Scholar]
- 75.Yamada M, Sigal L, Raptis M. No bias left behind: Covariate shift adaptation for discriminative 3d pose estimation. ECCV. 2012 doi: 10.1109/TPAMI.2013.123. [DOI] [PubMed] [Google Scholar]
- 76.Yang J, Yan R, Hauptmann A. Cross-domain video concept detection using adaptive svms. International Conference on Multimedia; 2007. [Google Scholar]
- 77.Yang P, Liu Q, Metaxas DN. Exploring facial expressions with compositional features. CVPR. 2010 [Google Scholar]
- 78.Yang S, Rudovic O, Pavlovic V, Pantic M. Personalized modeling of facial action unit intensity. Advances in Visual Computing. 2014:269–281. [Google Scholar]
- 79.Zen G, Sangineto E, Ricci E, Sebe N. Unsupervised domain adaptation for personalized facial emotion recognition. International Conference on Multimodal Interaction. [Google Scholar]
- 80.Zeng J, Chu W-S, De la Torre F, Cohn JF, Xiong Z. Confidence preserving machine for facial action unit detection. ICCV. 2015 doi: 10.1109/TIP.2016.2594486. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Zhang H, Berg AC, Maire M, Malik J. Svm-knn: Discriminative nearest neighbor classification for visual category recognition. CVPR. 2006 [Google Scholar]
- 82.Zhao G, Pietikainen M. Dynamic texture recognition using local binary patterns with an application to facial expressions. PAMI. 2007;29(6):915–928. doi: 10.1109/TPAMI.2007.1110. [DOI] [PubMed] [Google Scholar]
- 83.Zhao K, Chu W-S, De la Torre F, Cohn JF, Zhang H. Joint patch and multi-label learning for facial action unit detection. CVPR. 2015 doi: 10.1109/CVPR.2015.7298833. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Zhong L, Liu Q, Yang P, Liu B, Huang J, Metaxas D. Learning active facial patches for expression analysis. CVPR. 2012 doi: 10.1109/TCYB.2014.2354351. [DOI] [PubMed] [Google Scholar]
- 85.Zhou F, De la Torre F, Cohn JF. Unsupervised discovery of facial events. CVPR. 2010 [Google Scholar]
- 86.Zhu Y, De la Torre F, Cohn JF, Zhang YJ. Dynamic cascades with bidirectional bootstrapping for action unit detection in spontaneous facial behavior. IEEE Transactions on Affective Computing. 2011;2(2):79–91. doi: 10.1109/T-AFFC.2011.10. [DOI] [PMC free article] [PubMed] [Google Scholar]