Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Nov 1.
Published in final edited form as: Conf Comput Vis Pattern Recognit Workshops. 2024 Sep 27;2024:604–614. doi: 10.1109/cvprw63382.2024.00065

Spatio-Temporal Attention and Gaussian Processes for Personalized Video Gaze Estimation

Swati Jindal 1, Mohit Yadav 2, Roberto Manduchi 1
PMCID: PMC11529379  NIHMSID: NIHMS1989056  PMID: 39493731

Abstract

Gaze is an essential prompt for analyzing human behavior and attention. Recently, there has been an increasing interest in determining gaze direction from facial videos. However, video gaze estimation faces significant challenges, such as understanding the dynamic evolution of gaze in video sequences, dealing with static backgrounds, and adapting to variations in illumination. To address these challenges, we propose a simple and novel deep learning model designed to estimate gaze from videos, incorporating a specialized attention module. Our method employs a spatial attention mechanism that tracks spatial dynamics within videos. This technique enables accurate gaze direction prediction through a temporal sequence model, adeptly transforming spatial observations into temporal insights, thereby significantly improving gaze estimation accuracy. Additionally, our approach integrates Gaussian processes to include individual-specific traits, facilitating the personalization of our model with just a few labeled samples. Experimental results confirm the efficacy of the proposed approach, demonstrating its success in both within-dataset and cross-dataset settings. Specifically, our proposed approach achieves state-of-the-art performance on the Gaze360 dataset, improving by 2.5° without personalization. Further, by personalizing the model with just three samples, we achieved an additional improvement of 0.8°. The code and pre-trained models are available at https://github.com/jswati31/stage.

1. Introduction

The human gaze is an essential cue for conveying people’s intent, making it promising for real-world applications such as human-robot interaction [37, 41], AR/VR [40, 49], and saliency detection [48, 54]. In addition, gaze plays a vital role in several computer vision tasks, including but not limited to object detection [62], visual attention [9] and action recognition [35]. Despite the primary research emphasis on gaze estimation from images, the potential benefits of understanding the temporal dynamics of eye movements for video gaze estimation have been relatively overlooked. Constructing an accurate video-based gaze estimation model requires addressing the unique challenges inherent to videos. These include the evolution of eye movements throughout the video, correlations between gaze directions in successive frames, the predominance of a static background in most pixels, and variations due to individual-specific traits [31, 32, 46]. This work responds to these challenges by aiming to develop an accurate gaze estimation technique for videos using deep networks.

Realizing the potential of spatial and motion cues in videos, prior research has utilized residual frames and optical flows for several other vision tasks [13, 56, 66]. Specifically, these methods integrate RGB and residual frames as different input streams, requiring larger models with higher inference time and memory requirements [17, 26, 65]. Similarly, 3D convolutional neural networks (CNNs) can also capture spatiotemporal information from videos, but they require many model parameters [3, 14, 23, 29, 59, 67]. In addition, it is non-trivial to transfer knowledge from pre-trained 3D CNNs to new video tasks, as most pre-trained models rely on large 2D image datasets such as the ImageNet dataset [11]. Despite the critical role of detecting spatial and motion cues in videos, there is a strong need to design efficient attention-based approaches for video-related tasks, including video gaze estimation.

In this work, we draw inspiration from the change captioning task to develop an approach for video gaze estimation. The change captioning task requires describing the changes between a pair of before and after images, expressed through a natural language sentence [43, 50, 60]. Both change captioning and gaze estimation tasks require differentiating irrelevant distractors, such as background movement and facial expression changes, from the relevant ones. Specifically, change captioning focuses on recognizing object movements, whereas gaze estimation concentrates on detecting eye movements. Similar to prior works [43, 50], our approach utilizes a spatial attention mechanism to focus on gaze-relevant information while minimizing the impact of distractors. For example, Figure 1 illustrates various distractors that may obfuscate gaze information in videos.

Figure 1.

Figure 1.

The figure illustrates a range of irrelevant factors for video gaze estimation, also referred to as distractors: (a) and (b) depict alterations in facial expression, (c) highlight background movement, and (d) represent a scenario without any distractors. These examples show the importance of accurately distinguishing between spatial changes due to eye movements and irrelevant distractors for the video gaze estimation task.

We introduce Spatio-Temporal Attention for Gaze Estimation (STAGE), a deep learning model for video gaze estimation. STAGE utilizes spatial changes in consecutive frames to integrate motion cues via a Spatial Attention Module (SAM) and captures global dynamics with a Temporal Sequence Model (TSM). The SAM module focuses on gaze-relevant information by applying local spatial attention between consecutive frames and effectively suppresses irrelevant distractors. Meanwhile, the TSM considers global dynamic movements across the temporal dimension, enabling enhanced prediction of gaze direction sequences. STAGE adeptly encodes motion information through the attention modules with fewer parameters than existing approaches like 3D CNNs [23] or two-branch networks [26], thus offering a more feasible solution for real-world applications.

To enhance the accuracy of gaze estimation models, previous studies have suggested personalization to address significant variability in individual-specific traits, such as eye geometry and appearance [6, 32, 46]. Concretely, this is done by training a person-agnostic gaze model on a large labeled dataset and then fine-tuning it for individual users with a small set of labeled data. Consistent with this approach, we integrate Gaussian processes (GPs) [52], known for their effectiveness in low-data scenarios, to personalize the STAGE model for individual users.

We use GPs to learn an additive bias correction and personalize the gaze estimate of the general STAGE model with just a few labeled samples. GPs enable the estimation of personalized 3D gaze directions and provide uncertainty measurements in interval form. These intervals represent a range of possible gaze directions instead of a single vector, making our approach more suitable for practical applications, such as monitoring attention on screens [2, 72]. To evaluate the efficacy of the proposed STAGE model and personalization using GPs, we use three public video gaze datasets: EYEDIAP [16], Gaze360 [27] and EVE [47].

Our primary contributions are as follows:

  • We introduce STAGE, a novel model for video gaze estimation. STAGE leverages an attention mechanism that is sensitive to spatial changes in sequential frames, effectively extracting gaze-relevant details from videos and facilitating gaze prediction along the temporal axis.

  • We propose a sample-efficient approach to personalize STAGE, aiming to learn a bias correction model for gaze prediction using pre-trained Gaussian processes [68].

  • Our approach either surpasses or matches to the state-of-the-art performance on three publicly available datasets for video gaze estimation. In particular, we obtain state-of-the-art results on the Gaze360 dataset in both cross-data and within-data experimental settings.

2. Related Work

Traditional methods of gaze estimation use an eye geometry model and exploit regression function to create a mapping from the eye or face images to the gaze vector [19, 20, 25, 34, 38, 61]. While these methods perform well in controlled settings with consistent subject features, head positions, and lighting, their precision tends to drop in more varied and less controlled environments [71].

Recently, with the emergence of deep learning methods, researchers employ CNNs to predict gaze direction from eye or face images directly [8, 15, 22, 28, 58, 69]. Image-based gaze estimation methods primarily use eye images to predict gaze directions [30, 44, 45, 69]. Additionally, several approaches consider facial features such as head pose and facial appearance for estimating gaze information [18, 28, 53, 70]. Generally, facial information for gaze estimation yields more accurate results than methods relying solely on eye images [70]. Similarly, our work also relies on full-face images for extracting gaze information.

Following the release of video gaze datasets [16, 27], several temporal gaze estimation models have emerged. These models are designed to predict the gaze direction from a sequence of images. The initial work of Palmero et al. [42] employs a recurrent CNN that concatenates the static features of each frame and feeds into a recurrent module, which is used to predict the 3D gaze direction of the final frame in the sequence. Kellnhofer et al. [27] proposed a bidirectional LSTM that utilizes both past and future frames, indirectly incorporating spatial information. Wang et al. [64] released a dataset that captures eye images and ground-truth gaze positions on a screen while subjects engage in activities like browsing websites or watching videos. They proposed a dynamic gaze transition network to detect the transitions of eye movements over time and refine static gaze predictions using the dynamics learned from these transitions. Similarly, Park et al. [47] collected a large video gaze dataset and proposed a recurrent model to refine Point of Gaze (PoG) estimates for video input. Our work aims to develop a video gaze estimation method by capturing the nuanced spatial and temporal dynamics.

As stated earlier, the performance of gaze estimators can be notably influenced by individual-specific traits, particularly when adapting these models to new subjects [19]. However, in practical scenarios, there are typically only a few labeled samples available per subject and are insufficient for fine-tuning contemporary deep learning models, which tend to be over-parameterized [46]. Previously, Liu et al. [32] utilized a Siamese network to estimate gaze differences, employing a small number of calibration samples for personalization. Similarly, Park et al. [46] employed meta-learning techniques to achieve few-shot personalization, leveraging learned gaze embeddings. Chen and Shi [6] introduced a method to model person-specific biases during the training phase, enabling personalization during testing with just a few samples. Our personalization approach is motivated by the efficacy of Gaussian processes in scenarios with limited data [52]. Unlike Chen and Shi [6], our personalization approach outputs a different bias for each video frame and is designed to be compatible with any existing gaze estimation technique without necessitating alterations to the training objective.

3. Proposed Method

The main goal of video gaze estimation is to learn a deep network f defined as f:VG that maps a sequence of video frames VRn×h0×w0×3 to a sequence of gaze directions GRn×2, where n is the number of frames and h0 and w0 are height and width of each frame, respectively. The output gaze sequence G possesses pitch and yaw angles, which correspond to each frame in V .

The proposed STAGE model employs three modules for setting up the deep network f. Firstly, a ResNet-based CNN model receives the input video and extracts feature maps for all the frames. Then, in the following module of the STAGE model, we process feature maps using a Spatial Attention Module (SAM) to focus on the spatial motion information between consecutive frames followed by a Temporal Sequence Model (TSM) to learn temporal dynamics using past frame embeddings. Next, the gaze prediction layer (GPL) maps the features from the output of the TSM block to a sequence of gaze directions defined in terms of yaw and pitch angles. Figure 2 shows the schematic of STAGE.

Figure 2. A schematic overview of the proposed (person-agnostic) STAGE model.

Figure 2.

The proposed model has three modules: spatial attention module (SAM), temporal sequence model (TSM), and gaze prediction layer (GPL). The SAM is designed to extract information relevant to the gaze by concentrating on the spatial differences between consecutive frames. In the figure, Xi represents features from ResNet, zi denotes the motion-informed output of the SAM, and gi corresponds to the predicted gaze direction.

3.1. Spatial Attention Module (SAM)

Recall that SAM is aimed to distinguish gaze-relevant motion by analyzing differences between consecutive frames, focusing on crucial cues like eye or head movements for gaze estimation while filtering out irrelevant distractions like facial expressions or background movements. It aims to prioritize relevant video changes, particularly eye movements, and disregard non-essential ones.

First, we convert each frame in the video sequence V to features X=[X1,X2,,Xn]Rn×h×w×k, using the ResNet-based CNN model, where w, h, and k are the width, height, and the number of channels of the feature maps extracted by ResNet. The next step is to pass each consecutive feature pair (Xt1,Xt) through a shared SAM. Concretely, the SAM module aggregates information from RGB features of Xt1 and Xt, and the feature differences (XtXt1) through a fusion strategy. Figure 3 provides an overview of all three SAM variants considered in this work. All SAM variants are optimized during model training and outputs zt, a feature representation with spatial motion information for the tth frame of the video.

Figure 3. Block diagram of SAM variants.

Figure 3.

The input in each SAM variant is a pair of the consecutive frame features Xt1 and Xt, and the output is a 1-D feature vector encoding both RGB and motion information. P2d are 2D positional embeddings with height and width that are the same as the input feature map. The cross-attention block in Cross-SAM and Hybrid-SAM is a standard transformer operation. The sum-pooling block applies feature pooling by summing them over height and width dimensions. More details are in Section 3.1.

Dual-SAM predicts separate spatial attention maps for both current Xt and past Xt1 frame. It compares the spatial attention maps of the current and past frames and identifies the region that is most relevant to the observed motion changes. If the spatial attention maps are very similar, SAM infers that there is no substantial change between consecutive frames and encodes these minimal differences in the output vector ztR3k. Conversely, if there is a difference, SAM incorporates this change into the output vector zt. This SAM variant is inspired by Park et al. [43] in the change captioning task and is shown in Figure-3a.

Cross-SAM, unlike Dual-SAM, utilizes cross-attention from transformer models [63] to encapsulate dense correlation between each pair of image patches in the past and current frames. This allows Cross-SAM to identify multiple changes between two frames, as opposed to Dual-SAM, which can only capture a single change. Practically, detecting multiple changes and subsequently filtering out irrelevant distractors is more useful for video gaze estimation tasks. Similar to the Dual-SAM, this variant utilizes both RGB and transformed motion signals at the output. Cross-SAM is motivated by [50] and is shown in Figure-3b.

Hybrid-SAM combines the strengths of both Dual-SAM and Cross-SAM variants. Dual-SAM focuses on one local change, while Cross-SAM focuses on global context and captures multiple changes. Similar to Cross-SAM, Hybrid-SAM encapsulates multiple changes by applying a cross-attention mechanism using global context through position embeddings. However, unlike the Cross-SAM variant, it uses the difference between current and past frames as a key and value, emphasizing regions with the most significant motion differences. The Dual-SAM is utilized as a pooling operator to selectively focus on the most relevant changes, like eye or head movements, which are crucial for the task of gaze estimation.

The Hybrid-SAM is given in the Algorithm 1, and Dual-SAM and Cross-SAM are deferred to supplementary material. The input is features of the past frame Xt1 the and current frame Xt, respectively. Both input features are projected to the higher-dimensional feature maps using the convolution operation, and 2-D position embeddings P2dRh×w are added (Line 1). Line 2 computes difference features Xdiff for the video’s tth frame and cross-attention is applied in Line 3. Lines 5–8 correspond to the same operations as in Dual-SAM. In Algorithm 1, σ denotes the sigmoid function, 1h,w is a one-hot vector spanning the spatial dimensions, and is an element-wise dot product.

3.2. Temporal Sequence Model (TSM)

The temporal sequence model subsumes spatially enhanced representations zt produced by the SAM module and is intended to capture the temporal dynamics of the eye movements in the video. In particular, we consider two variants for TSM: recurrent neural networks (RNN) [57], and transformer network [63]. The RNN consists of unidirectional LSTM layers [21], and the transformer variant is a causal transformer decoder, which is prevalent in generative language modeling, such as the GPT-2 model [51]. We provide more details in supplementary materials.

3.3. Gaze Prediction Layer and Training Objective

The gaze prediction layer is shared across all timestamps and uses an MLP to predict the gaze direction from the frame embeddings generated by the TSM module. For ith sample and tth frame, let {gti} and {g^ti} denote the sequences of true and predicted gaze directions, respectively. Similarly,{pti} and{p^ti} represent the sequences of true and predicted 2D Point-of-Gaze (PoG). We use the following objective function for training STAGE model parameters (similar to Park et al. [47]):

Algorithm 1.

Hybrid-Spatial Attention Module

Input: Xt1,Xt Rh×w×k
Output: zt R3·d
 1: Xt1=flat(conv(Xt1)+1h,wP2d)
  Xt=flat(conv(Xt)+1h,wP2d) Rhw×d
 2: Xdiff=XtXt1
 3: Xt1=crossatten(Xt1,Xdiff,Xdiff)
  Xt=crossatten(Xt,Xdiff,Xdiff) Rhw×d
 4: Xt1=unflat(Xt1,h×w)
  Xt=unflat(Xt,h×w) Rh×w×d
 5: Xt1=[Xt1;XtXt1]
  Xt=[Xt;XtXt1] Rh×w×2d
 6: At1=σ(conv(ReLU(conv(Xt1))))
  At=σ(conv(ReLU(conv(Xt)))) Rh×w×1
 7: vt1=h,wAt1Xt1
  vt=h,wAtXt Rd
 8: zt=[vt1;vtvt1;vt] R3d
 9: return zt
𝓛final=1bni=1bt=0n1180πarccos(gtiTg^ti|gti||g^ti|)+λptip^ti (1)

Here, λ controls the trade-off between 3D gaze angular error and 2D PoG mean absolute error.

3.4. . Personalizing STAGE using Gaussian Processes

As previously mentioned, we propose person-specific Gaussian processes for modeling bias correction terms for each user, which operates on top of the proposed (person-agnostic) STAGE model. Specifically, if f:VG is the STAGE model, then the final prediction for person p is f^p(V)=f(V)+rp(V), where rp is GP-based bias correction model for the person p, i.e., it predicts the residual in addition to the model-agnostic prediction. The GP rp models the components of gaze direction (i.e., yaw and pitch) independently at the frame level, using two one-dimensional independent GPs. Concretely, rp(V)=[(rp,θ(V1),rp,ϕ(V1)),(rp,θ(V2),rp,ϕ(V2)),,(rp,θ(Vn),rp,ϕ(Vn))], where rp,θ and rp,ϕ are the one-dimensional GP predictions for pitch and yaw components, respectively.

For GP hyper-parameter tuning and inference, we collect a set of training frames𝒟={hi,yi}i=1𝓁 that are available for person p, where hiRd are the flattened ResNet output features from the STAGE model, and yi is either pitch or yaw of residual gaze angle, i.e., gig^i, where, gi and g^i are true gaze direction and STAGE’s predicted direction, respectively. To represent the dataset 𝒟 in matrix format, we let yR𝓁be the vector of residual angles, where the ith entry equal to yi, and HR𝓁×d have its ith row equal to the ResNet features hi. For brevity, we omit the person index p from henceforth discussion on GPs.

A Gaussian process associated with kernel (covariance) function k(h,h):Rd×RdR is a distribution over functions that maps features to residual angles such that, for any h1,,h𝓁Rd:

r=[r(h1),,r(h𝓁)]𝒩(μ0,KH),

where KH=[khi,hj]i,j=1𝓁R𝓁×𝓁 is the kernel (covariance) matrix on the data points H, and r has a constant mean function with its value set to μ0. The observed residual angle yi is modeled as the i.i.d. Gaussian noise, i.e., yi𝒩(r(hi),σ2I). In particular, we use the (squared-exponential) automatic-relevance-determination (ARD) kernel, given as k(h,h)=τes=1d(h(s)h(s))2θ(s)2 , where τ and θRd are kernel hyperparameters. The ARD kernel’s per-dimension scaling, being more expressive than the RBF kernel’s use of a single length-scale, often leads to superior practical performance [39]. Intuitively, this flexibility allows the model to adapt to varying feature relevance and noise levels, potentially leading to improved accuracy and generalization [10]. Upon conditioning the GP model on the collected training dataset, the predictive posterior mean and covariance functions are as follows:

mean:μr𝒟(h)=khT(KH+σ2I)1y
variance:σr𝒟(h)=k(h,h)khT(KH+σ2I)1kh

where the vector khR𝓁 has ith entry k(h,hi), i.e., kernel value between any feature vector h and ith data point. The posterior mean function predicts the residual gaze angles and is utilized for correction. The posterior covariance function determines the uncertainty in this prediction, as illustrated in Figure 6.

Figure 6.

Figure 6.

The figure depicts a few certain (a) and uncertain (b) predictions for gaze directions after GP’s personalization on the EyeDiap dataset. The blue and pink arrows show ground truth and predicted gaze directions, respectively. The green-colored region offers uncertainty of the predictions in the pink arrows.

Optimizing GP hyper-parameters using a few labeled samples.

GPs are non-parametric models and thus do not require tuning many parameters [52]. However, they still necessitate optimizing hyperparameters, which in our case are μ0, σ, τ, and θ, totaling d+3 hyperparameters as |θ|=d. The ARD kernel adds flexibility to the GP model but also increases the number of hyperparameters to be tuned. Specifically, since d=16384 when using features from the ResNet model, directly tuning hyperparameters using the log-likelihood of data 𝒟 is prone to overfitting, particularly when as few as three samples are present in 𝒟. To overcome this challenge, we propose the application of pre-trained GPs, similar to the concurrent work [68]. Pre-trained GPs entail the initial optimization of hyperparameters on data used for training the STAGE model, coupled with the implementation of early stopping to maximize the log-likelihood of dataset 𝒟 for each individual. This methodology grants GPs flexibility with expressive ARD kernel and ensures a robust starting point due to pre-training.

4. Experiments and Results

Datasets.

EVE [47] is a large video gaze dataset comprising over 12M frames collected from 54 participants in a controlled indoor setting with four synchronized and calibrated camera views. Following the splits in [47], there are 40 subjects in training and 6 in the validation set. We discard the test subjects due to the unavailability of labels and evaluate our models on the validation set. Gaze360 [27] is a large-scale, physically unconstrained gaze dataset collected from 238 subjects in indoor and outdoor settings. The dataset includes a wide range of head poses, with 129K training, 17K validation, and 26K test images. We evaluate our models on all three subsets of the dataset: the full Gaze360 dataset, the front 180° subset, and the front 20° subset, as done in [27]. EyeDiap [16] consists of 94 videos totaling 237 minutes, collected from 16 subjects in a laboratory environment. It includes videos for both screen and floating targets, and we select VGA videos of screen targets.

Implementation Details.

The input video sequence V consists of 30 frames containing a full-face image of 128 × 128 pixels. We use ResNet-18 [55] initialized with Gaze-CLR [24] weights shared between all timestamps to extract visual features from the image sequence. The third convolutional layer block of ResNet-18 outputs features with a dimension of 256×8×8. We pass these features through the SAM module, followed by TSM and gaze prediction layers. We train STAGE end-to-end for 50K iterations using the SGD optimizer with an initial learning rate of 0.016 and momentum of 0.9. The learning rate is decayed using cosine annealing [33], and the batch size is set to 16. We discuss the implementation in more detail in the supplementary.

4.1. Evaluating the STAGE Model

Baselines.

We benchmarked our framework against EyeNet [47], which consists of ResNet-18 and RNN layers and uses both eye image patches as input. We adopted EyeNet to our setting and trained it on full-face images using 𝓛final with λ=0.001. We also train another variant of EyeNet by replacing the RNN module with a TSM similar to that used in our framework. For a fair comparison, we also implement EyeNet with our version of ResNet18 initialized with GazeCLR [24] weights and call it EyeNet (GazeCLR). Further, we adapt [4] for gaze estimation, which introduces motion-aware-unit (MAU) for the video-prediction task. We also compare with a simple baseline by removing the SAM modules and concatenating Xt and Xdiff=(XtXt1) before passing through TSM, termed Concat-Residual. Finally, we compare the three variants of SAM combined with two variants of TSM for cross-dataset and within-dataset experiments. For the sake of completion, we also evaluate the Hybrid-SAM method without the Dual-SAM module at the output, named as Hybrid-SAM.

4.1.1. Qualitative Evaluation

We conducted a qualitative analysis primarily centered on assessing the Hybrid-SAM ability to distinguish between gaze-irrelevant distractors and gaze-relevant eye movements, which is crucial for video gaze estimation, as stated earlier. Specifically, we examined attention maps At1 and At, strategically overlaid on sequential video frames Vt1 and Vt, as depicted in Figure 4. We analyzed several frames showcasing scenarios from background activities to facial movements, all concurrent with dominant eye movements.

Figure 4.

Figure 4.

Illustration of attention maps At1 and At, generated by the Hybrid-SAM, superimposed on sequential video frames Vt1 and Vt. The SAM module proficiently highlights the ocular area, key for analyzing eye movements, while simultaneously diminishing irrelevant distractions such as background motion (a), tongue movement (b), and changes in emotional expressions (c and d).

In Figure 4(a), the network adeptly focuses on eye movements in frame Vt (red pixels) and prior frame changes (blue pixels) despite significant background pixel shifts from a walking person. This underscores the effectiveness of spatial attention in filtering out irrelevant distractors to accurately identify subtle eye movements and gaze direction. As a result, it eases the process of temporal modeling in video gaze estimation. Additionally, as illustrated in Figure 4(b), although tongue movement presents a potential distraction, it is efficiently disregarded. Moreover, changes in facial expressions, depicted in Figure 4(c, d), are effectively overlooked by the Hybrid-SAM. These qualitative findings affirm that the spatio-temporal attention strategy adeptly minimizes significant distractions, particularly in the eye region, which is essential for accurately tracking eye movements in video gaze estimation tasks.

4.1.2. Within-dataset Evaluation

Table 1 shows results for within-dataset evaluation, where we train and evaluate our model on the same domain dataset. We train our framework on the training subset of Gaze360 with λ=0 and evaluate it over three test subsets as done in [27]. Our model demonstrates superior performance compared to the baselines, including ‘Concat-Residual,’ across all three subsets. Specifically, it achieves absolute improvements of 2.5°, 2.2° and 2.5° on full Gaze360, front 180° and front 20° subsets, respectively. Furthermore, it is noteworthy that Hybrid-SAM performs better in comparison to Hybrid-SAM, illustrating the advantage of incorporating Dual-SAM as a pooling operator.

Table 1. Within-dataset Evaluation.

Comparison of mean angular errors (in degrees) between the proposed STAGE model, SAM and TSM variants, and other baseline approaches. Tx is the transformer TSM model. The first and second best results are bold-ed and underlined, respectively.

Method Full 180° Front 20°

EyeNet [47](GazeCLR) 12.53 12.08 9.45
EyeNet + Tx 13.00 12.55 9.73

Concat-Residual + LSTM 10.35 10.16 7.45
Concat-Residual + Tx 12.22 11.78 9.09

Dual-SAM + LSTM 10.12 9.92 7.08
Dual-SAM + Tx 10.13 9.93 7.23

Cross-SAM + LSTM 12.00 11.59 9.51
Cross-SAM + Tx 10.12 9.91 7.34

Hybrid-SAM + LSTM 12.69 12.26 9.66
Hybrid-SAM + Tx 12.33 11.90 9.53
Hybrid-SAM + LSTM 10.05 9.84 6.92
Hybrid-SAM + Tx 10.10 9.90 7.33

4.1.3. Cross-dataset Evaluation

We performed a cross-dataset evaluation, where the model was trained on the EVE dataset and evaluated on two different datasets, EyeDiap and Gaze360. Table 2 shows the comparison of baselines and our proposed method. We observed a significant improvement in both datasets even with a simple concatenation of Xt and Xdiff, i.e., Concat-Residual approach outperforms EyeNet variants and MAU approach, which demonstrates that residual frames are an effective cue for video-gaze estimation. The Dual-SAM and Cross-SAM show improvements over Concat-Residual approach, indicating that the adapted methods are more accurate than naively using residual frames. Notably, the Hybrid-SAM improves over baselines by 1.2° in absolute and 14.28% in relative on the EyeDiap dataset. It also outperformed Dual-SAM and Cross-SAM on all three evaluation sets. The last two columns of Table 2 show results on the full and front 180° Gaze360 subsets. The Hybrid-SAM improved up to 3.6° on both subsets, further emphasizing the effectiveness of SAM. It is also worth noting that the performance improvements for SAM hold for both LSTM and transformer-based TSM in both Tables 1 and 2. This shows that SAM is helpful irrespective of the choice of TSM model.

Table 2. Cross-dataset Evaluation.

Comparison of mean angular gaze error (in degrees) between the proposed STAGE model, SAM and TSM variants, and other baseline approaches. Tx is the transformer TSM model. For each column, the first best result is bold-ed, and second best result is underlined.

Method EyeDiap Full 180°

MAU 21.30 34.18 33.57
EyeNet [47] 16.07 31.37 30.77
EyeNet (GazeCLR) 7.74 26.57 25.95
EyeNet + Tx 8.40 26.25 25.64

Concat-Residual+ LSTM 7.12 24.12 23.52
Concat-Residual+ Tx 7.27 24.26 23.64

Dual-SAM + LSTM 7.04 24.18 23.58
Dual-SAM + Tx 6.77 23.99 23.38

Cross-SAM + LSTM 8.42 23.19 22.61
Cross-SAM + Tx 8.75 22.57 22.01

Hybrid-SAM + LSTM 8.48 23.31 22.72
Hybrid-SAM + Tx 7.79 22.66 22.09
Hybrid-SAM + LSTM 6.70 23.73 23.13
Hybrid-SAM + Tx 6.54 23.77 23.17

4.1.4. Comparison with State-of-the-art Methods

Table 3 compares the proposed STAGE method with state-of-the-art approaches for a within-dataset setting. Video gaze estimation methods such as the original work of Gaze360 [27] and MSA+Seq [36] employ the LSTM model and learn through the Pinball loss function. We also compare our proposed gaze estimation approach with image-based methods such as L2CS-Net [1], both variants of GazeTR [7], and self-supervised method SwAT [12]. We report the performance of these methods from the original work and show a comparison with our method. Our best results outperform these methods by 1.5°, 0.5° and 2.1° on full Gaze360, front 180° and front 20°, respectively. The superior performance of our method demonstrates the effectiveness of SAM and our choice for other components of the overall STAGE model.

Table 3. STAGE vs. State-of-the-art.

Comparison with state-of-the-art methods on Gaze360 data subsets under the within-dataset setting (Tx = transformer-based TSM). The metric is the mean angular error (in degrees). The first and second best results are bolded and underlined, respectively.

Method Full 180° Front 20°

Gaze360 [27] 13.50 11.40 11.10
MSA+Seq [36] 12.50 10.70 -

SwAT [12] 11.60 - -
L2CS-Net [1] - 10.41 9.02
GazeTR-Pure [7] - 13.58 -
GazeTR-Hybrid [7] - 10.62 -

Hybrid-SAM + LSTM 10.05 9.84 6.92
Hybrid-SAM + Tx 10.10 9.90 7.33

4.2. Evaluating GPs for Personalization

As stated earlier, we first optimize the hyper-parameters of the GP model rp for residual gaze direction prediction using the train subset of EVE dataset. Then, we adapt rp for personalization on the EyeDiap participants. We randomly sample 𝓁 video frames for each participant 10 times and report the performance in Figure 5. We perform GP personalization on two SAM variants: Dual-SAM and Hybrid-SAM, using a transformer TSM model. The baseline method proposed by Chen and Shi [5] involves learning a single personspecific bias during training and utilizing a few labeled samples to predict bias during inference.

Figure 5.

Figure 5.

The figure shows the comparison of 𝓁-shot GP personalization on the STAGE model with Chen and Shi [5] for the EyeDiap dataset. The bars indicate the mean angular error (in degrees) and standard error over 10 iterations.

We obtain an absolute improvement of around 0.8° with the Hybrid-SAM over the baseline with as few as 3 samples. Applying GPs with the baseline objective, i.e., “Chen et al. + GPs”, we see consistent improvements over both GPs and the method proposed by Chen and Shi [5]. These results demonstrate that GPs’ are a valuable tool and provide complementary strengths to Chen and Shi [5]. Unlike Chen and Shi [5], GPs do not require altering the objective for training the deep network. They can be utilized for adaptation with any pre-trained existing model, such as STAGE.

We provide a qualitative evaluation to assess the effectiveness of the GP model’s uncertainty, shown in Figure 6. The figure shows the differences between confident and uncertain gaze predictions after personalization using the EyeDiap dataset. Notably, the uncertainty region typically includes the ground truth, as illustrated by the pink arrows falling within the green area. It is crucial to note that gaze predictions with higher uncertainty often align with situations that are challenging for human interpretation like extreme head poses or closed eyes.

5. Conclusion

In this paper, we presented STAGE, a novel model for video gaze estimation, which utilizes an attention mechanism to encode spatial motion cues and temporal modelling. The method employed a spatial attention module to implicitly focus on the differences between consecutive frames, thereby highlighting relevant changes. We demonstrated that the performance of the STAGE model could be further enhanced using a few labeled samples with Gaussian processes. Future research could explore expanding the receptive field of the attention modules and integrating long-term spatial and temporal dynamics for further enhancements.

Supplementary Material

Supplementary material

References

  • [1].Abdelrahman Ahmed A, Thorsten Hempel, Khalifa Aly, and Ayoub Al-Hamadi. L2cs-net: Fine-grained gaze estimation in unconstrained environments. arXiv preprint arXiv:2203.03339, 2022. 7 [Google Scholar]
  • [2].Albiz Julius, Viberg Olga, and Matviienko Andrii. Guiding visual attention on 2d screens: Effects of gaze cues from avatars and humans. In Proceedings of the 2023 ACM Symposium on Spatial User Interaction, pages 1–9, 2023. 2 [Google Scholar]
  • [3].Carreira Joao and Zisserman Andrew. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017. 1 [Google Scholar]
  • [4].Chang Zheng, Zhang Xinfeng, Wang Shanshe, Ma Siwei, Ye Yan, Xinguang Xiang, and Gao Wen. Mau: A motionaware unit for video prediction and beyond. Advances in Neural Information Processing Systems, 34:26950–26962, 2021. 6 [Google Scholar]
  • [5].Chen Zhaokang and Shi Bertram. Offset calibration for appearance-based gaze estimation via gaze decomposition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 270–279, 2020. 8 [Google Scholar]
  • [6].Chen Zhaokang and Shi Bertram E.. Offset calibration for appearance-based gaze estimation via gaze decomposition. 2020 IEEE Winter Conference on Applications of Computer Vision, pages 259–268, 2019. 2, 3 [Google Scholar]
  • [7].Cheng Yihua and Lu Feng. Gaze estimation using transformer. In 2022 26th International Conference on Pattern Recognition, pages 3341–3347. IEEE, 2022. 7 [Google Scholar]
  • [8].Cheng Yihua, Zhang Xucong, Lu Feng, and Sato Yoichi. Gaze estimation by exploring two-eye asymmetry. IEEE Transactions on Image Processing, 29:5259–5272, 2020. 2 [DOI] [PubMed] [Google Scholar]
  • [9].Chong Eunji, Ruiz Nataniel, Wang Yongxin, Zhang Yun, Rozga Agata, and Rehg James M. Connecting gaze, scene, and attention: Generalized attention estimation via joint modeling of gaze and scene saliency. In Proceedings of the European conference on computer vision, pages 383–398, 2018. 1 [Google Scholar]
  • [10].Delbridge Ian A., Bindel David S., and Wilson Andrew Gordon. Randomly projected additive gaussian processes for regression. In International Conference on Machine Learning, 2019. 5 [Google Scholar]
  • [11].Deng Jia, Dong Wei, Socher Richard, Li Li-Jia, Li Kai, and Fei-Fei Li. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 1 [Google Scholar]
  • [12].Farkhondeh Arya, Palmero Cristina, Scardapane Simone, and Escalera Sergio. Towards self-supervised gaze estimation. In British Machine Vision Conference, 2022. 7 [Google Scholar]
  • [13].Feichtenhofer Christoph, Pinz Axel, and Zisserman Andrew. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1933–1941, 2016. 1 [Google Scholar]
  • [14].Feichtenhofer Christoph, Fan Haoqi, Malik Jitendra, and He Kaiming. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019. 1 [Google Scholar]
  • [15].Fischer Tobias, Hyung Jin Chang, and Yiannis Demiris. Rtgene: Real-time eye gaze estimation in natural environments. In Proceedings of the European Conference on Computer Vision, pages 334–352, 2018. 2 [Google Scholar]
  • [16].Kenneth Alberto Funes Mora, Florent Monay, and JeanMarc Odobez. Eyediap: A database for the development and evaluation of gaze estimation algorithms from rgb and rgb-d cameras. In Proceedings of the symposium on eye tracking research and applications, pages 255–258, 2014. 2, 6 [Google Scholar]
  • [17].Girdhar Rohit, Ramanan Deva, Gupta Abhinav, Sivic Josef, and Russell Bryan. Actionvlad: Learning spatio-temporal aggregation for action classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 971–980, 2017. 1 [Google Scholar]
  • [18].Gu Song, Wang Lihui, He Long, He Xianding, and Wang Jian. Gaze estimation via a differential eyes’ appearances network with a reference grid. Engineering, 7(6):777–786, 2021. 2 [Google Scholar]
  • [19].Elias Daniel Guestrinand Moshe Eizenman. General theory of remote gaze estimation using the pupil center and corneal reflections. IEEE Transactions on biomedical engineering, 53(6):1124–1133, 2006. 2, 3 [DOI] [PubMed] [Google Scholar]
  • [20].Dan Witzner Hansenand Qiang Ji . In the eye of the beholder: A survey of models for eyes and gaze. IEEE transactions on pattern analysis and machine intelligence, 32(3):478–500, 2009. 2 [DOI] [PubMed] [Google Scholar]
  • [21].Hochreiter Sepp and Schmidhuber Jürgen. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. 4 [DOI] [PubMed] [Google Scholar]
  • [22].Huang Qiong, Veeraraghavan Ashok, and Sabharwal Ashutosh. Tabletgaze: dataset and analysis for unconstrained appearance-based gaze estimation in mobile tablets. Machine Vision and Applications, 28:445–461, 2017. 2 [Google Scholar]
  • [23].Ji Shuiwang, Xu Wei, Yang Ming, and Yu Kai. 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1):221–231, 2012. 1, 2 [DOI] [PubMed] [Google Scholar]
  • [24].Jindal Swati and Manduchi Roberto. Contrastive representation learning for gaze estimation. In Annual Conference on Neural Information Processing Systems, pages 37–49. PMLR, 2023. 6 [PMC free article] [PubMed] [Google Scholar]
  • [25].Kar Anuradha and Corcoran Peter. A review and analysis of eye-gaze estimation systems, algorithms and performance evaluation methods in consumer platforms. IEEE Access, 5: 16495–16519, 2017. 2 [Google Scholar]
  • [26].Karpathy Andrej, Toderici George, Shetty Sanketh, Leung Thomas, Sukthankar Rahul, and Fei-Fei Li. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014. 1, 2 [Google Scholar]
  • [27].Kellnhofer Petr, Recasens Adria, Stent Simon, Matusik Wojciech, and Torralba Antonio. Gaze360: Physically unconstrained gaze estimation in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6912–6921, 2019. 2, 6, 7 [Google Scholar]
  • [28].Krafka Kyle, Khosla Aditya, Kellnhofer Petr, Kannan Harini, Bhandarkar Suchendra, Matusik Wojciech, and Torralba Antonio. Eye tracking for everyone. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2176–2184, 2016. 2 [Google Scholar]
  • [29].Li Jun, Liu Xianglong, Zhang Mingyuan, and Wang Deqing. Spatio-temporal deformable 3d convnets with attention for action recognition. Pattern Recognition, 98:107037, 2020. 1 [Google Scholar]
  • [30].Lian Dongze, Hu Lina, Luo Weixin, Xu Yanyu, Duan Lixin, Yu Jingyi, and Gao Shenghua. Multiview multitask gaze estimation with deep convolutional neural networks. IEEE transactions on neural networks and learning systems, 30 (10):3010–3023, 2018. 2 [DOI] [PubMed] [Google Scholar]
  • [31].Linden Erik, Sjostrand Jonas, and Proutiere Alexandre. Learning to personalize in appearance-based gaze tracking. In Proceedings of the IEEE/CVF international conference on computer vision workshops, pages 0–0, 2019. 1 [Google Scholar]
  • [32].Liu Gang, Yu Yuechen, Kenneth Alberto Funes Mora, and Jean-Marc Odobez. A differential approach for gaze estimation with calibration. In BMVC, page 6, 2018. 1, 2, 3 [Google Scholar]
  • [33].Loshchilov Ilya and Hutter Frank. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017. 6 [Google Scholar]
  • [34].Lu Feng, Gao Yue, and Chen Xiaowu. Estimating 3d gaze directions using unlabeled eye images via synthetic iris appearance fitting. IEEE Transactions on Multimedia, 18(9): 1772–1782, 2016. 2 [Google Scholar]
  • [35].Min Kyle and Corso Jason J. Integrating human gaze into attention for egocentric activity recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1069–1078, 2021. 1 [Google Scholar]
  • [36].Mishra Ashesh and Lin Hsuan-Tien. 360-degree gaze estimation in the wild using multiple zoom scales. In British Machine Vision Conference, 2020. 7 [Google Scholar]
  • [37].Moon AJung, Daniel M Troniak Brian Gleeson, Matthew KXJ Pan Minhua Zheng, Benjamin A Blumer Karon MacLean, and Croft Elizabeth A. Meet me where i’m gazing: how shared attention gaze affects human-robot handover timing. In Proceedings of the 2014 ACM/IEEE international conference on Human-robot interaction, pages 334–341, 2014. 1 [Google Scholar]
  • [38].Nakazawa Atsushi and Nitschke Christian. Point of gaze estimation through corneal surface reflection in an active illumination environment. In Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7–13, 2012, Proceedings, Part II 12, pages 159–172. Springer, 2012. 2 [Google Scholar]
  • [39].Neal Radford M.. Assessing relevance determination methods using delve. 1998. 5 [Google Scholar]
  • [40].Padmanaban Nitish, Konrad Robert, Stramer Tal, Cooper Emily A, and Wetzstein Gordon. Optimizing virtual reality for all users through gaze-contingent and adaptive focus displays. Proceedings of the National Academy of Sciences, 114 (9):2183–2188, 2017. 1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [41].Palinko Oskar, Rea Francesco, Sandini Giulio, and Sciutti Alessandra. Robot reading human gaze: Why eye tracking is better than head tracking for human-robot collaboration. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5048–5054. IEEE, 2016. 1 [Google Scholar]
  • [42].Palmero Cristina, Selva Javier, Mohammad Ali Bagheri, and Sergio Escalera. Recurrent cnn for 3d gaze estimation using appearance and shape cues. In British Machine Vision Conference, 2018. 2 [Google Scholar]
  • [43].Dong Huk Park, Trevor Darrell, and Rohrbach Anna. Robust change captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4624–4633, 2019. 1, 3 [Google Scholar]
  • [44].Park Seonwook, Spurr Adrian, and Hilliges Otmar. Deep pictorial gaze estimation. volume 11217 lncs. 2018. 2 [Google Scholar]
  • [45].Park Seonwook, Zhang Xucong, Bulling Andreas, and Hilliges Otmar. Learning to find eye region landmarks for remote gaze estimation in unconstrained settings. In Proceedings of the 2018 ACM symposium on eye tracking research & applications, pages 1–10, 2018. 2 [Google Scholar]
  • [46].Park Seonwook, Mello Shalini De, Molchanov Pavlo, Iqbal Umar, Hilliges Otmar, and Kautz Jan. Few-shot adaptive gaze estimation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9368–9377, 2019. 1, 2, 3 [Google Scholar]
  • [47].Park Seonwook, Aksan Emre, Zhang Xucong, and Hilliges Otmar. Towards end-to-end video-based eye-tracking. In Computer Vision–ECCV 2020: 16th European Conference, Proceedings, Part 16, pages 747–763. Springer, 2020. 2, 5, 6, 7 [Google Scholar]
  • [48].Parks Daniel, Borji Ali, and Itti Laurent. Augmented saliency model using automatic 3d head pose detection and learned gaze following in natural scenes. Vision research, 116:113–126, 2015. 1 [DOI] [PubMed] [Google Scholar]
  • [49].Patney Anjul, Salvi Marco, Kim Joohwan, Kaplanyan Anton, Wyman Chris, Benty Nir, Luebke David, and Lefohn Aaron. Towards foveated rendering for gaze-tracked virtual reality. ACM Transactions on Graphics (TOG), 35(6):1–12, 2016. 1 [Google Scholar]
  • [50].Qiu Yue, Yamamoto Shintaro, Nakashima Kodai, Suzuki Ryota, Iwata Kenji, Kataoka Hirokatsu, and Satoh Yutaka. Describing and localizing multiple changes with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1971–1980, 2021. 1, 4 [Google Scholar]
  • [51].Radford Alec, Wu Jeffrey, Child Rewon, Luan David, Amodei Dario, Sutskever Ilya, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. 4 [Google Scholar]
  • [52].Carl Edward Rasmussen. Gaussian Processes in Machine Learning, pages 63–71. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004. 2, 3, 5 [Google Scholar]
  • [53].Ren Dakai, Chen Jiazhong, Zhong Jian, Lu Zhaoming, Jia Tao, and Li Zongyi. Gaze estimation via bilinear poolingbased attention networks. Journal of Visual Communication and Image Representation, 81:103369, 2021. 2 [Google Scholar]
  • [54].Rudoy Dmitry, Dan B Goldman Eli Shechtman, and Zelnik-Manor Lihi. Learning video saliency from human gaze using candidate selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1147–1154, 2013. 1 [Google Scholar]
  • [55].Shafiq Muhammad and Gu Zhaoquan. Deep residual learning for image recognition: a survey. Applied Sciences, 12 (18):8972, 2022. 6 [Google Scholar]
  • [56].Simonyan Karen and Zisserman Andrew. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27, 2014. 1 [Google Scholar]
  • [57].Sutskever Ilya, Vinyals Oriol, and Quoc V Le. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27, 2014. 4 [Google Scholar]
  • [58].Tan Kar-Han, Kriegman David J, and Ahuja Narendra. Appearance-based eye gaze estimation. In Sixth IEEE Workshop on Applications of Computer Vision, 2002. Proceedings., pages 191–195. IEEE, 2002. 2 [Google Scholar]
  • [59].Tran Du, Bourdev Lubomir, Fergus Rob, Torresani Lorenzo, and Paluri Manohar. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015. 1 [Google Scholar]
  • [60].Tu Yunbin, Yao Tingting, Li Liang, Lou Jiedong, Gao Shengxiang, Yu Zhengtao, and Yan Chenggang. Semantic relation-aware difference representation learning for change captioning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 63–73, 2021. 1 [Google Scholar]
  • [61].Valenti Roberto, Sebe Nicu, and Gevers Theo. Combining head pose and eye location information for gaze estimation. IEEE Transactions on Image Processing, 21(2):802– 815, 2011. 2 [DOI] [PubMed] [Google Scholar]
  • [62].Arun Balajee Vasudevan, Dengxin Dai, and Luc Van Gool. Object referring in videos with language and human gaze. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4129–4138, 2018. 1 [Google Scholar]
  • [63].Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Aidan N Gomez Łukasz Kaiser, and Polosukhin Illia. Attention is all you need. Advances in neural information processing systems, 30, 2017. 3, 4 [Google Scholar]
  • [64].Wang Kang, Su Hui, and Ji Qiang. Neuro-inspired eye tracking with eye movement dynamics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9831–9840, 2019. 2 [Google Scholar]
  • [65].Wang Limin, Qiao Yu, and Tang Xiaoou. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4305–4314, 2015. 1 [Google Scholar]
  • [66].Wang Limin, Xiong Yuanjun, Wang Zhe, Qiao Yu, Lin Dahua, Tang Xiaoou, and Luc Van Gool. Temporal segment networks for action recognition in videos. IEEE transactions on pattern analysis and machine intelligence, 41(11):2740– 2755, 2018. 1 [DOI] [PubMed] [Google Scholar]
  • [67].Wang Xuanhan, Gao Lianli, Wang Peng, Sun Xiaoshuai, and Liu Xianglong. Two-stream 3-d convnet fusion for action recognition in videos with arbitrary size and length. IEEE Transactions on Multimedia, 20(3):634–644, 2017. 1 [Google Scholar]
  • [68].Wang Zi, George E Dahl Kevin Swersky, Lee Chansoo, Mariet Zelda, Nado Zachary, Gilmer Justin, Snoek Jasper, and Ghahramani Zoubin. Pre-trained gaussian processes for bayesian optimization. arXiv preprint arXiv:2109.08215, 2021. 2, 5 [Google Scholar]
  • [69].Zhang Xucong, Sugano Yusuke, Fritz Mario, and Bulling Andreas. Appearance-based gaze estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4511–4520, 2015. 2 [Google Scholar]
  • [70].Zhang Xucong, Sugano Yusuke, Fritz Mario, and Bulling Andreas. It’s written all over your face: Full-face appearance-based gaze estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 51–60, 2017. 2 [Google Scholar]
  • [71].Zhang Xucong, Sugano Yusuke, and Bulling Andreas. Evaluation of appearance-based methods and implications for gaze-based applications. In Proceedings of the 2019 CHI conference on human factors in computing systems, pages 1–13, 2019. 2 [Google Scholar]
  • [72].Zhang Yanxia, Pfeuffer Ken, Chong Ming Ki, Alexander Jason, Bulling Andreas, and Gellersen Hans. Look together: using gaze for assisting co-located collaborative search. Personal and Ubiquitous Computing, 21:173–186, 2017. 2 [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material

RESOURCES