Human Eye Movements Reveal Video Frame Importance

Zheng Ma; Jiaxin Wu; Sheng-hua Zhong; Jianmin Jiang; Stephen J Heinen

doi:10.1109/MC.2019.2903246

. Author manuscript; available in PMC: 2021 Mar 19.

Published in final edited form as: Computer (Long Beach Calif). 2019 May 14;52(5):48–57. doi: 10.1109/MC.2019.2903246

Human Eye Movements Reveal Video Frame Importance

Zheng Ma ¹, Jiaxin Wu ², Sheng-hua Zhong ², Jianmin Jiang ², Stephen J Heinen ³

PMCID: PMC7975628 NIHMSID: NIHMS1529492 PMID: 33746238

Abstract

Human eye movements indicate important spatial information in static images as well as videos. Yet videos contain additional temporal information and convey a storyline. Video summarization is a technique that reduces video size, but maintains the essence of the storyline. Here, the authors explore whether eye movement patterns reflect frame importance during video viewing and facilitate video summarization. Eye movements were recorded while subjects watched videos from the SumMe video summarization dataset. The authors find more gaze consistency for selected than unselected frames. They further introduce a novel multi-stream deep learning model for video summarization that incorporates subjects’ eye movement information. Gaze data improved the model’s performance over that observed when only the frames’ physical attributes were used. The results suggest that eye movement patterns reflect cognitive processing of sequential information that helps select important video frames, and provide an innovative algorithm that uses gaze information in video summarization.

Eye movement information is useful for revealing observers’ interests¹. This is likely because visual acuity is highest in a small foveal region, and eye movements reorient the fovea to different scene elements that require high resolution viewing². Therefore, eye movements often reliably indicate informative regions in static images or scenes.

Several studies explored eye movement dynamics during video watching, and found that subjects make similar eye movements while watching commercial films, TV shows or interviews^3,4. Furthermore, gaze locations tend to cluster at the center of a screen³, as well as on biologically relevant social stimuli⁵. Eye movement data has been integrated into computational models to help them locate salient information and objects in video frames⁶.

Unlike static images, videos contain both spatial and temporal information. Therefore, to understand gaze behavior in videos, it is not only important to characterize where people look, but also which video frames are the most important and informative to the observers. Besides gleaning insights into neural mechanisms for directing gaze, knowing when and where people look has practical implications for research on video summarization. In video summarization, short videos are created from long ones using only those frames that are important for conveying the essence of the long video. Understanding gaze patterns for video summarization is especially important given the recent development of new video technologies, uploading platforms, and the explosive growth of video data. Approximately 400 hours of videos are uploaded to YouTube every minute (https://www.statista.com/topics/2019/youtube/), many of which are poorly edited and have redundant or tediously long segments. Viewers would have a much more efficient viewing experience if video summarization algorithms⁷ could provide an accurate summary of the original long videos. If human eye movement patterns can help inform the importance of individual video frames, then we can use them to better predict which frames are critically important for understanding a video, and thus improve video summarization algorithms.

In the current study, we investigate whether human eye movements reflect the importance of frames within a video, and whether they can improve computational models that perform video summarization. Our work makes the following contributions:

We demonstrate that eye movement patterns of subjects are similar to each other while viewing videos, even when no instructions are given about how to view them.
We show that eye movements are more consistent while viewing frames with higher importance scores than those with lower scores, suggesting that human eye movements reliably predict importance judgements.
In a computational experiment, we demonstrate that a model that used gaze information performs better video summarization than a model with only framebased physical information.

Together, the results demonstrate that human eye movement consistency indicates whether a video frame is important, and suggest that eye movement data facilitates computer vision models of video summarization.

BEHAVIORAL EXPERIMENT

SumMe Video Summarization Dataset

Our experiments were conducted using raw videos from the SumMe video summarization benchmark dataset⁸. The SumMe dataset contains 25 raw videos together with video summaries that are generated by human observers. The videos depict various events such as cooking, plane landing, and base jumping. The length of the videos varies from one to six minutes. In the manuscript, we refer to the people who created the personal video summaries for the SumMe dataset as “users”, and those whose eye movements we recorded as “subjects.”

A total of 41 users (19 males and 22 females) participated in the video summarization for the SumMe dataset⁸. The users were instructed to generate video summaries that contain the most important content of the original raw video, but with only 5% to 15% of the original length. The audio was muted to ensure only visual stimuli were used for summarization. Frame-level importance scores were calculated based on the probability that a frame was included in the user summaries. Finally, a group-level summary was generated based on the frames with the highest 15% of the importance scores⁸.

Human Eye Movement Data Collection

We collected human eye movement data from a separate group of six subjects (two males and four females) while they viewed the raw videos of the SumMe dataset. All subjects had normal or corrected to normal vision. The study was approved by the Institutional Review Board at the Smith-Kettlewell Eye Research Institute, and also adhered to the Declaration of Helsinki. Informed consent was obtained for experimentation with human subjects.

Videos were presented on a Samsung screen (resolution: 2560×1440, refresh rate: 60 Hz) and generated by Psychtoolbox-3 for MATLAB on a MacBook Pro computer. Observers’ heads were stabilized by a chin and forehead rest. Viewing distance was 57 cm, and the display was 58.2°× 33.3°. Eye movements were recorded with an SR Research EyeLink 1000 video-based eye tracker at 1000 Hz.

All of the original videos from the SumMe dataset were resized to have the same width (1920 pixels; 43.6 degree of visual angle). The audio was muted. Subjects were asked to watch the entire video without any additional instructions. The experiment was divided into six blocks, each of which lasted approximately 10 minutes. Before each video, observers fixated a blank screen with a red central dot for 2s.

Since the temporal resolution of the eye movement data (1000 Hz) is higher than the frame rate of the videos (15 to 30 Hz), we first down-sampled the eye movement data by averaging the gaze positions over the samples corresponding to each frame.

Results

Eye movement consistency across subjects

We first tested whether the eye movements were consistent across subjects within the same video. For each subject, we obtained gaze location on each video frame. We then calculated gaze velocity from the gaze location difference between every two contiguous video frames. Figure 1a and b show two example frames with gaze locations and velocities from six subjects superimposed on them.

Example frames from two videos of the SumMe dataset, ‘Jumps’ and ‘Saving dolphins’. The ‘Jumps’ video shows a person traveling down a slide (a) and being launched into the air (b). In (a) and (b), blue dots indicate gaze locations of six subjects, and arrows indicate their eye gaze velocities computed from eye position from the current and next frames. Arrow length denotes eye speed. Note that eye position and velocity are inconsistent in (a), but consistent in (b). (c) A sample frame of the ‘Saving dolphins’ video, which shows people on the beach saving several stranded dolphins. The mixture Gaussian gaze distributions from five subjects (in blue) and the predicted gaze location of the remaining subject (red dot) were superimposed on the original video frame.

Consistency of gaze location.

Guided by previous studies that computed gaze consistency^9,10, for each video frame, we used a “leave one out” technique to test how well the gaze location of five subjects predicted the gaze location of the remaining one. For a video frame t and each subject, we first created a mixture of Gaussian distributions based on the gaze locations from the other five subjects. It is defined as

p_{t}^{i} (x, y) = \frac{1}{N - 1} \sum_{n \ i}^{N} \frac{1}{2 π σ^{2}} e^{- \frac{{(x - x_{t}^{n})}^{2} + {(y - y_{t}^{n})}^{2}}{2 σ^{2}}}

(1)

where n\i denotes the sum over all subjects except for subject i, and σ equals 1 degree of visual angle. Figure 1c shows an example of the Gaussian mixture gaze distributions and the gaze location of the predicted subject. If the probability density at subject i’s gaze location $(x_{t}^{i}, y_{t}^{i})$ was within the top 20% probability density of the whole distribution, it was counted as a correct ‘detection’ and suggests that the gaze location of subject i is successfully predicted by that of the other subjects. We used the average detection rate across all six subjects to indicate inter-subject consistency. We also established a random baseline by using the gaze locations of five subjects at another random frame r to predict the gaze location of the remaining subject at frame t. We applied this procedure to 1000 randomly selected video frames. The first 20 frames of each video were not used in this analysis to avoid confounds from the central dot being presented before each video.

Figure 2a shows the average detection rate from the inter-subject consistency and random baseline analysis. A pair-wise t-test across the 1000 frames shows that the detection rate in the inter-subject consistency analysis is significantly higher than that in the cross-frame control analysis (t(999)=23.5, p<0.001). Note that the detection rate of the random baseline is also above chance. This might reflect a bias of observers to look at center of the screen in some of the video frames, since this can occur regardless of image content^3,10.

Consistency of human eye movements. (a) Average gaze location consistency across 1000 selected frames, and the corresponding random cross-frame control. (b) Average difference of gaze velocity direction across any two continuous frames. (c) Average difference of gaze speed across any two continuous frames. (d) Consistency of gaze locations for selected and unselected frames. (e) Direction and (f) Speed difference of eye velocities across two continuous frames for selected and unselected frames. Error bars show standard error of the mean (a, d, e, f) for within-subject designs, calculated following the suggestion by Cousineau (2005)¹¹, and 95% confidence interval of the mean (b, c).

Consistency of gaze velocities.

In this analysis we determined if gaze velocity at each video frame was consistent across each pair of subjects. For each of the 15 pairs of subjects, we calculated their gaze velocity difference at each video frame. We also generated a random baseline to compare with the pairwise data by shuffling the temporal order of subjects’ eye movement sequences within a video. This was done 1000 times to generate baseline distributions of gaze velocity differences.

For each subject pair, we calculated the differences between the velocity directions using the following formula:

g = \arccos (\frac{{\vec{a}}_{i} \cdot {\vec{a}}_{j}}{|{\vec{a}}_{i}| \cdot |{\vec{a}}_{j}|}),

(2)

where ${\vec{a}}_{i}$ indicates the eye velocity from frame t to the next frame t +1 of subject i. For the speed difference, we computed the absolute differences between the lengths of the two vectors using the formula:

b = | | {\vec{a}}_{i} | - | {\vec{a}}_{j} | |,

(3)

The results are shown in Figure 2b and c. One sample t-tests show that the mean difference in gaze velocity direction and speed from subject pairs are significantly lower than the means of the random baselines (direction: t(14)=31.55, p < 0.001; speed: t(14)=18.97, p<0.001). In summary, similar to the results of previous studies³, eye movement patterns were consistent during video watching for both gaze location and gaze velocity.

Eye movement consistency reflects the importance of a frame

Although subjects’ eye movements were generally consistent while watching videos, there were frames in which eye movements were inconsistent (see Figure 1b). Could gaze consistency serve as an indicator of the importance of individual frames? To test this, we compared eye movement consistency between subjects in the frames that were selected for the user summarized videos with those that were excluded. For consistency of gaze location, an independent t-test showed that the consistency level of the selected frames was significantly higher than that of the unselected frames (t(1998)=6.85, p<0.001, Figure 2d). For consistency of gaze velocity, a pair-wise t-test showed that across the 15 pairs of subjects, the difference in eye velocity direction in the selected frames was significantly lower than that in the unselected frames (t(14)=9.96, p<0.001, Fig 2e). Similarly, the difference in eye speed in the selected frames was also significantly lower than that in the unselected frames (t(14)=10.07, p<0.001, Fig 2f). Therefore, subjects’ eye movement patterns were more similar to each other while viewing the frames that were included in the user summaries than those that were excluded. The results suggest that human gaze patterns can indicate whether a video frame is important for summarizing a video, and that people tend to make similar eye movements while viewing the important video frames.

COMPUTATIONAL EXPERIMENT

Our behavioral results provide evidence that eye movement consistency within a given frame reflects whether a video frame is important for video summarization. These results suggest that eye movements indicate informative aspects of videos that are not directly related to their low level visual features and instead might be cognitively derived, e.g. object-based knowledge or prediction of future frames. If so, adding gaze information to a video-summarization model should improve its performance. To test this, we developed a gaze-aware deep learning model to perform video summarization, and compared its performance to a model that operated with only low-level physical attributes of the frames.

Gaze-Aware Deep Learning Model for Video Summarization

We employed a deep learning model similar to those used in many video-related tasks¹². These models rely on multiple layers of nonlinear processing to extract useful features in a task (e.g. object detection, classification, etc.). Since deep learning networks are effective in processing images and videos, they might be well suited to learn the importance of video frames from low level visual features as well as gaze patterns.

We constructed a novel multi-stream deep learning model for video summarization that we call DLSum (Figure 3a). There are three streams in the model: 1) the spatial stream, 2) the temporal stream, and 3) the gaze stream.

Illustration of the DLSum models and their performance on the SumMe dataset. (a) General model structure. Spatial and temporal information from the video, as well as human gaze data are transmitted to the convolutional neural network. Later, the fused features are input to the SVR algorithm to predict the importance score for each frame. Final summary is generated based on the scores. (b) A schematic illustration of Gaze representation in the model. Two sequential video frames t and *t+1* with one subject’s gaze position (the red dot) were transformed to the horizontal $g_{t}^{x}$ and vertical $g_{t}^{y}$ component of the Gaze representation g_t(x, y). Location of the dots correspond to the subject’s gaze location at frame t. Darker colors correspond to leftward and upward movements. The darker dot in the $g_{t}^{y}$ image represents a large upward displacement of gaze location. In the actual model implementation, gaze information from all subjects was combined to form a single Gaze representation. (c) Performance of different version of the DLSum models in terms of the AMF and (d) AHF scores.

The gaze stream takes the raw eye position and velocity as inputs. The “Gaze” representation of an arbitrary frame t is denoted as g_t. $g_{t}^{x}$ and $g_{t}^{y}$ are the horizontal and vertical components of g_t. Both $g_{t}^{x}$ and $g_{t}^{y}$ are matrices that have the same size as the original video frame, and the values are initialized to zero. We then centered a 1° diameter circle on each subject’s gaze location at frame t. The values within each circle reflect gaze velocity from frame t to frame t+1 of that subject. Positive values indicate rightward and downward movements for the horizontal component $g_{t}^{x}$ and vertical component $g_{t}^{y}$ , respectively. The absolute values reflect horizontal and vertical gaze speeds. Finally, all values were normalized to the (0, 255) range to generate gray-scale g_t. representations for each frame. A schematic illustration of the construction of one subject’s Gaze representation is shown in Figure 3b. We stacked multiple Gaze representations to represent the movement of gaze across frames. A 2L input based on gaze positions and velocities for gaze stream was formed by stacking the Gaze representations from frame t to the next L−1 frames. The final gaze input $Y_{t} \in R^{w \times h \times 2 L}$ for frame t is defined as:

\{\begin{cases} Y_{t} (2 i - 1) = g_{t + i - 1}^{x} \\ Y_{t} (2 i) = g_{t + i - 1}^{y} \end{cases}, 1 \leq i \leq L

(4)

The spatial stream takes the RGB values of each pixel as inputs to represent visual characteristics of the video frames. The temporal stream takes multi-frame motion vectors as inputs to convey the motion information of the video to the model. The motion vectors of an arbitrary frame t are denoted as m_t, which contains the displacement from frame t to frame t +1. $m_{t}^{x}$ and $m_{t}^{y}$ represent the horizontal and vertical components of m_t. A 2L input was stacked to convey the motion information between frame t and the next L −1 frames with L represents the stacking length. The multi-frame motion vectors $F_{t} \in R^{w \times h \times 2 L}$ for an arbitrary frame t were defined as:

\{\begin{cases} F_{t} (2 i - 1) = m_{t + i - 1}^{x} \\ F_{t} (2 i) = m_{t + i - 1}^{y} \end{cases}, 1 \leq i \leq L

(5)

We followed Simonyan and Zisserman (2014)¹² to set the stacking length L of Gaze representation and multi-frame motion vectors equal to 10. In the training stage, each stream is trained separately. In the spatial stream, RGB images with their corresponding labels are input to the convolutional neural network. Temporal stream and gaze stream are trained similarly with different inputs (multi-frame motion vectors and gaze representations). Then, the outputs in the second fully-connected layer with 4096 neurons of each stream are combined in order to define the features for each video frame. The combined features with their corresponding labels are input to the support vector regression (SVR) algorithm¹³ to train a regression model. In the test stage, we first use the trained ConvNets to extract features of the test frames. Then, the learned SVR model is used to predict an importance score for each frame. The final summary is composed of the video frames within the top 15% of the model-generated importance scores^8,14.

Implementation and Evaluation Details

We compared the gaze aware model (DLSum-RealGaze) with a variation that only used the video-based spatial and temporal streams as inputs (DLSum–NoGaze). Since collecting eye movement data during video watching requires additional human effort, the DLSumRealGaze model might be inefficient for practical applications. Therefore, we also tested another variation in which the gaze stream used gaze locations predicted by the class activation mapping (CAM) technique¹⁵ as input (DLSum-PredictedGaze). This model but uses predicted instead of collected real gaze information.

For each variation of our model, we performed a 3-fold cross validation for training and testing. The VGG-16 deep convolutional neural networks were pre-trained on the ImageNet dataset to avoid over-fitting. We implemented our model on the Caffe deep learning framework with Tesla K80 GPUs. In addition, a grid search was run to optimize the parameters for SVR.

We compared the automatic summary (A) with the human-annotated summarizations (H) and report the F-score to evaluate the performance of the different models^8,14. The F-score is defined as:

F = \frac{2 \times p \times r}{p + r},

(6)

r = \frac{n u m b e r o f m a t c h e d p a i r s}{n u m b e r o f f r a m e s i n H} \times 100 %,

(7)

p = \frac{n u m b e r o f m a t c h e d p a i r s}{n u m b e r o f f r a m e s i n A} \times 100 %,

(8)

where r is the recall and p is the precision. Following conventions in the video summarization literature^8,14, we report the average mean F-score across all users (AMF) and the average highest F-score (AHF) by comparing the model generated summaries with the summaries generated by all users. The highest F-score was defined as the score of the model when it best matched one of the users’ summaries. AMF reflects how well the model prediction matches average user preferences, and AHF reflects how well the model prediction matches its most similar user’s summary.

Results

Figure 4 shows an example of how the DLSum-RealGaze model operates on a sample video, ‘Cooking’, from the SumMe dataset. ‘Cooking’ demonstrates how a chef constructs an ‘onion volcano’, which is composed of onion slices with a flammable cooking-oil core. The model identified important frames during the video that were similar to those a human user chose, as is evident from the alignment of the three peaks.

A comparison of human selection and model outputs of the video “Cooking” in the SumMe dataset. Top row: average importance score for each frame provided by users that constructed the summarized video in the SumMe dataset. Second row: model-generated importance score of each frame from our DLSum-RealGaze model. Red regions represent frames selected for the summarized video. Third row: representative samples of the video frames selected by the model.

We then compared the performance of our three-stream gaze-aware model with that of the two-stream model that does not use gaze. We found that the gaze aware model achieved higher AMF and AHF scores than the two-stream model (Figure 3c and d). Across all movies, a pairwise t-test showed that the DLSum-RealGaze model has a marginally significantly higher AMF score than the DLSum-NoGaze model (t(24)=1.98, p=0.059), and performs significantly better than the DLSum-NoGaze model in terms of the AHF measurement (t(24)=2.34, p<0.05). We also tested the DLSum-PredictedGaze model, which uses model-predicted instead of real gaze information. We found that although the DLSum-PredictedGaze model achieved numerically lower AMF and AHF scores, its performance is not statistically different from the performance of the RealGaze model, and also significantly higher than that of the NoGaze model (Figure 3c and d). These results suggest that gaze pattern information produces better video summarization than when only video-based spatial and temporal information is used.

We also asked how well a model with only gaze information performs video summarization. We tested a DLSum-GazeOnly model that only uses the gaze stream. We compared its performance with a baseline that randomly selected 15% of all frames as the summary, as well as the performance of our three-stream gaze-aware model. Across all movies, a pairwise t-test showed that the DLSum-GazeOnly model performed significantly better than the Random model in terms of AMF (t(24)=6.96, p<0.001) and AHF scores (t(24)=9.25, p<0.001). However, its performance was significantly lower than the DLSum-RealGaze model (AMF: t(24)=5.36, p<0.001, AHF: t(24)=6.19, p<0.001), which has both visual and gaze information. These results suggested that although gaze information is valuable for video summarization, combining it with visual information maximizes its contribution.

We also compared our model to several state-of-the-art video summarization models. The comparison results are shown in Table 1. We found that our model outperformed most existing models including The Creating summaries from user videos (CSUV) model⁸, The Summarizing web videos using titles (SWVT) model¹⁴ and The Video Summarization with Long Short-term Memory (dppLSTM) model¹⁹. Although The Unsupervised Video Summarization with Adversarial LSTM Networks (SUM-GAN_sup)¹⁶ model had higher AMF score than us, it has a more complex architecture and needs more data to train. Actually, our models achieve a higher AMF score than their simple model (SUM-GANw/o-GAN). The results suggest that our model is competitive in the video summarization task.

TABLE 1.

The performance comparisons of our proposed methods and the state-of-the-art on SumMe dataset.

	Method	AMF	AHF
Baseline	Random	13.95%	16.79%
Existing method	CSUV	23.40 %	39.40%
	SWVT	26.60 %	--------
	dppLSTM	17.70 %	42.90%
	SUM_GAN_sup	43.60 %	--------
Proposed method	DLSum-NoGaze	35.44 %	56.29%
	DLSum-PredictedGaze	35.99 %	57.27%
	DLSum-Real Gaze	36.00 %	57.33%
	DLSum-GazeOnly	19.60%	30.81%

Open in a new tab

‘--------’ means the result is not reported in published papers.

DISCUSSION and CONCLUSION

We collected eye movement data from subjects watching 25 videos from the SumMe dataset with no explicit instructions to guide their eye movements. We showed that the consistency of gaze location and velocity across subjects was greater in the frames that human users chose as important for summarizing the videos. We then constructed a gaze-aware multi-stream deep learning model that incorporated subjects’ eye movement information to determine if gaze information can facilitate video summarization. We found that when gaze was taken as an input to the model, it outperformed a version that input only low level visual features. The results suggest that eye movements indicate important information in videos.

Previous studies using static images found that observers’ eye movements reflect the information of different spatial regions in a scene^1,17. In the current study, we found that eye movements also reflect the importance of video frames in the temporal domain. We first showed that our subjects’ eye movement patterns were more consistent in frames that were rated as important by users who summarized SumMe dataset. The similarity of eye movements may have arisen because the selected important frames contain objects that subjects found important, and thus enabled the eye movements to help predict video frame importance.

We tested a computational model that incorporated eye movement information and found that it facilitated frame selection for video summarization. Most previous video summarization models find important content using low level visual features, such as color and shape information¹⁴, as well as faces and other landmarks⁸. There are previous methods that use gaze information to facilitate video summarization. Xu et al. used fixation counts as attention scores, and used fixation regions to generate video summaries²⁰. Salehin et al. investigated the connection between video summarization and smooth pursuit to detect important video frames²¹. In contrast to our method, gaze patterns in both previous studies had to be determined first before they could be incorporated into video summarization models.

Similar to in the current study, our previous work¹⁸ also investigated whether incorporating gaze information into the spatial stream facilitates video summarization. In that study, gaze information was used to preprocess the raw video, therefore the gaze information and the spatial information were confounded. Here, we test whether the gaze information by itself facilitates model performance. To do this, we input the gaze information as an independent stream. The higher performance of our gaze aware model suggests that using gaze information by itself facilitates video summarization, and summarizes videos in a similar fashion as human users. Since eye movements are guided by top-down, memory-based knowledge and the semantic meaning of a scene¹, gaze data might reveal information about internal cognitive processes while watching videos that are not captured by low level visual features. For example, a subject may use higher level object-based information to obtain the storyline of a long video. However, the spatial properties or semantic content that cause consistent eye movement patterns and higher importance judgement remains unknown. Future studies will determine the content in a video frame that leads to eye movement patterns to evaluate specific cognitive aspects of a video that are important for video summarization.

We also tested that a model using predicted instead of real gaze information. The specific gaze prediction method we used did not consider the temporal structure of videos, but still achieved comparable performance as the DLSum-RealGaze model. Since the temporal stream of our model already takes the temporal structure into account, it may make the temporal information embedded in the gaze stream redundant. Future work is needed to determine the importance of the temporal structures of human eye movements without other confounding factors. Our results could also be used to develop models that can predict eye movements in videos and then be integrated into video summarization models.

In sum, our study shows that eye movement is a reliable predictor of the importance of temporal information in videos, and other video summarization models may consider using eye movement patterns to obtain better video summaries.

ACKNOWLEDGMENTS

This work was supported by the National Institutes of Health grant 5T32EY025201–03 to Z.M., the Smith-Kettlewell Eye Research Institute grant to S.H., and the National Natural Science Foundation of China grant 61502311 to S.Z.

Biography

Zheng Ma is a postdoctoral research fellow at the Smith-Kettlewell Eye Research Institute. Her research interests include visual cognition and using computational models to understand human mind. Dr. Ma received a PhD in psychology from the Johns Hopkins University. Contact her at zma@ski.org.

Jiaxin Wu is a graduate student in College of Computer Science & Software Engineering at Shenzhen University. Her research interests include video content analysis and deep learning. Jiaxin received a MS in computer science from Shenzhen University. Contact her at jiaxin.wu@email.szu.edu.cn.

Sheng-hua Zhong is an assistant professor in College of Computer Science & Software Engineering at Shenzhen University. Her research interests include multimedia content analysis, neuroscience, and machine learning. Dr. Zhong received a PhD in computer science from the Hong Kong Polytechnic University. Contact her at csshzhong@szu.edu.cn.

Jianmin Jiang is a professor in College of Computer Science & Software Engineering at Shenzhen University. His research interests include computer vision and machine learning in media processing. Dr. Jiang received a PhD in computer science from the University of Nottingham. Contact him at jianmin.jiang@szu.edu.cn.

Stephen J. Heinen is a senior scientist at the Smith-Kettlewell Eye Research Institute. His research interests include eye movements and human visual cognition. Dr. Heinen received a PhD in experimental psychology from the Northeastern University. Contact him at heinen@ski.org.

Contributor Information

Zheng Ma, The Smith-Kettlewell Eye Research Institute.

Stephen J. Heinen, The Smith-Kettlewell Eye Research Institute

REFERENCES

1.Henderson JM. Human gaze control during real-world scene perception. Trends in cognitive sciences 2003;7(11):498–504. [DOI] [PubMed] [Google Scholar]
2.Najemnik J, Geisler WS. Optimal eye movement strategies in visual search. Nature 2005;434(7031):387. [DOI] [PubMed] [Google Scholar]
3.Goldstein RB, Woods RL, Peli E. Where people look when watching movies: Do all viewers look at the same place? Computers in biology and medicine 2007;37(7):957–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Dorr M, Martinetz T, Gegenfurtner KR, Barth E. Variability of eye movements when viewing dynamic natural scenes. Journal of vision, 2010; 10(10):28–28. [DOI] [PubMed] [Google Scholar]
5.Shepherd SV, Steckenfinger SA, Hasson U, Ghazanfar AA. Human-monkey gaze correlations reveal convergent and divergent patterns of movie viewing. Current Biology 2010;20(7):649–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Kienzle W, Schölkopf B, Wichmann FA, Franz MO, editors. How to find interesting locations in video: a spatiotemporal interest point detector learned from human eye movements. Joint Pattern Recognition Symposium; 2007: Springer. [Google Scholar]
7.Truong BT, Venkatesh S. Video abstraction: A systematic review and classification. ACM transactions on multimedia computing, communications, and applications (TOMM) 2007;3(1):3. [Google Scholar]
8.Gygli M, Grabner H, Riemenschneider H, Van Gool L, editors. Creating summaries from user videos. European conference on computer vision; 2014: Springer. [Google Scholar]
9.Mathe S, Sminchisescu C. Dynamic eye movement datasets and learnt saliency models for visual action recognition. Computer Vision–ECCV 2012: Springer; 2012. p. 842–56. [Google Scholar]
10.Torralba A, Oliva A, Castelhano MS, Henderson JM. Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. Psychological review 2006;113(4):766. [DOI] [PubMed] [Google Scholar]
11.Cousineau D Confidence intervals in within-subject designs: A simpler solution to Loftus and Masson’s method. Tutorials in quantitative methods for psychology 2005;1(1):42–5. [Google Scholar]
12.Simonyan K, Zisserman A, editors. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems; 2014.
13.Drucker H, Burges CJ, Kaufman L, Smola AJ, Vapnik V, editors. Support vector regression machines. Advances in neural information processing systems; 1997.
14.Song Y, Vallmitjana J, Stent A, Jaimes A, editors. Tvsum: Summarizing web videos using titles. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2015. [Google Scholar]
15.Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A. Learning deep features for discriminative localization. In: Proceedings of the 2016 IEEE conference on computer vision and pattern recognition. [Google Scholar]
16.Mahasseni B, Lam M, Todorovic S. Unsupervised video summarization with adversarial LSTM networks. In Proceedings of the 2017 IEEE conference on computer vision and pattern recognition. [Google Scholar]
17.Loftus GR, Mackworth NH. Cognitive determinants of fixation location during picture viewing. Journal of Experimental Psychology: Human perception and performance 1978;4(4):565. [DOI] [PubMed] [Google Scholar]
18.Wu J, Zhong S-h, Ma Z, Heinen SJ, Jiang J. Foveated convolutional neural networks for video summarization. Multimedia Tools and Applications 2018:1–23.
19.Zhang K, Chao WL, Sha F, Grauman K. Video summarization with long short-term memory. European conference on computer vision; 2016.
20.Xu J, Mukherjee L, Li Y, Warner J, Rehg JM, Singh V.: Gaze-enabled ego-centric video summarization via constrained submodular maximization. In Proceedings of the 2015 IEEE conference on computer vision and pattern recognition, pp. 2235–2244. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Salehin MM, Paul M: A novel framework for video summarization based on smooth pursuit information from eye tracker data. In Proceedings of the 2017 international conference on multimedia retrieval, pp. 692–697 [Google Scholar]

[R1] 1.Henderson JM. Human gaze control during real-world scene perception. Trends in cognitive sciences 2003;7(11):498–504. [DOI] [PubMed] [Google Scholar]

[R2] 2.Najemnik J, Geisler WS. Optimal eye movement strategies in visual search. Nature 2005;434(7031):387. [DOI] [PubMed] [Google Scholar]

[R3] 3.Goldstein RB, Woods RL, Peli E. Where people look when watching movies: Do all viewers look at the same place? Computers in biology and medicine 2007;37(7):957–64. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Dorr M, Martinetz T, Gegenfurtner KR, Barth E. Variability of eye movements when viewing dynamic natural scenes. Journal of vision, 2010; 10(10):28–28. [DOI] [PubMed] [Google Scholar]

[R5] 5.Shepherd SV, Steckenfinger SA, Hasson U, Ghazanfar AA. Human-monkey gaze correlations reveal convergent and divergent patterns of movie viewing. Current Biology 2010;20(7):649–56. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Kienzle W, Schölkopf B, Wichmann FA, Franz MO, editors. How to find interesting locations in video: a spatiotemporal interest point detector learned from human eye movements. Joint Pattern Recognition Symposium; 2007: Springer. [Google Scholar]

[R7] 7.Truong BT, Venkatesh S. Video abstraction: A systematic review and classification. ACM transactions on multimedia computing, communications, and applications (TOMM) 2007;3(1):3. [Google Scholar]

[R8] 8.Gygli M, Grabner H, Riemenschneider H, Van Gool L, editors. Creating summaries from user videos. European conference on computer vision; 2014: Springer. [Google Scholar]

[R9] 9.Mathe S, Sminchisescu C. Dynamic eye movement datasets and learnt saliency models for visual action recognition. Computer Vision–ECCV 2012: Springer; 2012. p. 842–56. [Google Scholar]

[R10] 10.Torralba A, Oliva A, Castelhano MS, Henderson JM. Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. Psychological review 2006;113(4):766. [DOI] [PubMed] [Google Scholar]

[R11] 11.Cousineau D Confidence intervals in within-subject designs: A simpler solution to Loftus and Masson’s method. Tutorials in quantitative methods for psychology 2005;1(1):42–5. [Google Scholar]

[R12] 12.Simonyan K, Zisserman A, editors. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems; 2014.

[R13] 13.Drucker H, Burges CJ, Kaufman L, Smola AJ, Vapnik V, editors. Support vector regression machines. Advances in neural information processing systems; 1997.

[R14] 14.Song Y, Vallmitjana J, Stent A, Jaimes A, editors. Tvsum: Summarizing web videos using titles. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2015. [Google Scholar]

[R15] 15.Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A. Learning deep features for discriminative localization. In: Proceedings of the 2016 IEEE conference on computer vision and pattern recognition. [Google Scholar]

[R16] 16.Mahasseni B, Lam M, Todorovic S. Unsupervised video summarization with adversarial LSTM networks. In Proceedings of the 2017 IEEE conference on computer vision and pattern recognition. [Google Scholar]

[R17] 17.Loftus GR, Mackworth NH. Cognitive determinants of fixation location during picture viewing. Journal of Experimental Psychology: Human perception and performance 1978;4(4):565. [DOI] [PubMed] [Google Scholar]

[R18] 18.Wu J, Zhong S-h, Ma Z, Heinen SJ, Jiang J. Foveated convolutional neural networks for video summarization. Multimedia Tools and Applications 2018:1–23.

[R19] 19.Zhang K, Chao WL, Sha F, Grauman K. Video summarization with long short-term memory. European conference on computer vision; 2016.

[R20] 20.Xu J, Mukherjee L, Li Y, Warner J, Rehg JM, Singh V.: Gaze-enabled ego-centric video summarization via constrained submodular maximization. In Proceedings of the 2015 IEEE conference on computer vision and pattern recognition, pp. 2235–2244. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Salehin MM, Paul M: A novel framework for video summarization based on smooth pursuit information from eye tracker data. In Proceedings of the 2017 international conference on multimedia retrieval, pp. 692–697 [Google Scholar]

PERMALINK

Human Eye Movements Reveal Video Frame Importance

Zheng Ma

Jiaxin Wu

Sheng-hua Zhong

Jianmin Jiang

Stephen J Heinen

Abstract