Eye movements while viewing narrated, captioned, and silent videos

Nicholas M Ross; Eileen Kowler

doi:10.1167/13.4.1

. 2013 Mar 1;13(4):1. doi: 10.1167/13.4.1

Eye movements while viewing narrated, captioned, and silent videos

Nicholas M Ross ¹, Eileen Kowler ¹

PMCID: PMC4521331 PMID: 23457357

Abstract

Videos are often accompanied by narration delivered either by an audio stream or by captions, yet little is known about saccadic patterns while viewing narrated video displays. Eye movements were recorded while viewing video clips with (a) audio narration, (b) captions, (c) no narration, or (d) concurrent captions and audio. A surprisingly large proportion of time (>40%) was spent reading captions even in the presence of a redundant audio stream. Redundant audio did not affect the saccadic reading patterns but did lead to skipping of some portions of the captions and to delays of saccades made into the caption region. In the absence of captions, fixations were drawn to regions with a high density of information, such as the central region of the display, and to regions with high levels of temporal change (actions and events), regardless of the presence of narration. The strong attraction to captions, with or without redundant audio, raises the question of what determines how time is apportioned between captions and video regions so as to minimize information loss. The strategies of apportioning time may be based on several factors, including the inherent attraction of the line of sight to any available text, the moment by moment impressions of the relative importance of the information in the caption and the video, and the drive to integrate visual text accompanied by audio into a single narrative stream.

Keywords: eye movements, saccadic eye movements, saccades, cognition, videos, movies, narration, captions, salience models, event perception, multi-sensory integration, reading

Introduction

Saccadic eye movements take the line of sight to areas of interest in the visual scene in an effortless but purposeful way. They are indispensable for coping with the wealth of information that is distributed throughout the visual world. Decisions about how to plan saccades in space and time thus play a crucial role in apprehending the content of natural scenes.

A great deal of prior work has focused on identifying the factors that drive saccadic decisions while inspecting static scenes or performing visual or visuomotor tasks (e.g., Epelboim et al., 1995; Epelboim & Suppes, 2001; Hayhoe & Ballard, 2005; Johansson, Westling, Bäckström, & Flanagan, 2001; Kibbe & Kowler, 2011; Kowler, 2011; Land & Hayhoe, 2001; Land, Mennie, & Rusted, 1999; Malcolm & Henderson, 2010; Motter & Belky, 1998; Najemnik & Geisler, 2005; Pelz & Canosa, 2001; Steinman, Menezes, & Herst, 2006; Torralba, Oliva, Castelhano, & Henderson, 2006; Turano, Geruschat, & Baker, 2003; Wilder, Kowler, Schnitzer, Gersch, & Dosher, 2009; Yarbus, 1967). Much of the discussion has surrounded the relative role played by bottom-up versus top-down factors in controlling saccadic decisions. Bottom-up factors refer to the properties of the visual stimulus itself, typically, the contrast of visual features of the display (Koch & Ullman, 1985). Top-down factors encompass everything else, including voluntary attention, the judged importance or relevance of different locations, the constraints imposed by limitations of memory, and (in the case of visuomotor tasks) the coordination of eye and arm. Tatler, Hayhoe, Land, and Ballard (2011) concluded on the basis of a recent review that top-down factors are more important than bottom-up factors in driving saccadic decisions but that an understanding of the nature and operation of the relevant top-down factors is a complex endeavor that is still at a relatively early stage.

The debate about the factors that control saccadic decisions has been recently extended to the characteristics of eye movements made while watching movies or videos (Berg, Boehnke, Marino, Munoz, & Itti, 2009; Carmi & Itti, 2006; Dorr, Martinetz, Gegenfurtner, & Barth, 2010; Itti, 2005; Le Meur, Le Callet, & Barbra, 2007; Tseng, Carmi, Cameron, Munoz, & Itti, 2009; Vig, Dorr, & Barth, 2009). Videos are interesting stimuli, more representative of natural visual arrays than static pictures. Their content changes over time and includes motion as well as a top-down component that originates from the attempts to understand and interpret the depicted events (Itti, 2005; Pantelis et al., 2011; Zacks & Tversky, 2001). In contrast to studies in which the changes to the visual stimulus are produced by observers' actions (e.g., Epelboim et al., 1995; Johansson et al., 2001; Land & Hayhoe, 2001; Pelz & Canosa, 2001; Steinman et al., 2006), videos allow comparisons of performance when content remains the same across all observers. Thus, the study of eye movements while watching videos can provide a useful addition to the array of approaches being used to identify the factors that drive saccadic decisions.

A few previous studies have described eye movements while watching videos. One question dominating this prior work was the role of physical salience in predicting fixated locations. Analyses showed that measures of salience based on either flicker or motion were better predictors of fixated locations than measures based on either intensity or color (Carmi & Itti, 2006; Le Meur et al., 2007). The results also showed preferences to maintain gaze near the center of the display (Berg et al., 2009; Dorr et al., 2010; Le Meur et al., 2007; Tseng et al., 2009), analogous to what has been found for viewing static pictures (Tatler, 2007). Centering preferences may reflect strategies of looking at the most important or vivid objects, which are often placed near the center of the image (Dorr et al., 2010; Tseng et al., 2009),or strategies of positioning gaze at the location that may be best for resolving the greatest number of details across the screen (Tatler, 2007).

Studies of eye movements while watching videos have also examined the same global characteristics of saccadic patterns that have been traditionally studied in both simple and complex visual tasks (Findlay & Gilchrist, 2003; Rayner, 1998), namely, the distributions of sizes of saccades, the durations of fixation pauses, and the scatter of fixated locations. These characteristics are often used to define typical viewing patterns and provide measures that have been used to infer scanning strategies. For example, Dorr et al. (2010) found longer fixation durations, smaller saccades, and less scatter of landing positions while viewing videos than static pictures. They concluded that these differences reflect preferences to maintain gaze near the center of the video images or, in the case of what they called “natural” movies, preferences for occasional large shifts of gaze between clusters of interesting regions. Berg et al. (2009) compared eye movements of monkey and human subjects watching the same videos. They found that monkeys made larger and more frequent saccades and were less likely to confine gaze to the center of the screen than were humans. They attributed this species difference to top down factors, in particular, to the inability of the monkeys to follow the events or to understand the importance of the main actors (who were often located in the center of the images). They assumed that the inability to fully interpret the sequence of depicted events encouraged the monkeys to explore over wider regions of the displays.

One limitation in the prior work on eye movements while viewing videos has been the absence of sound or narration. Narration is often found in videos and provides additional information that guides the interpretation of events (Carmi & Itti, 2006). Narration could change the scanning strategies or scanning characteristics due to the contribution of top-down factors. There have been prior studies of eye movements while viewing static pictures that incorporated narration in the form of spoken sentences or captions. These studies, however, were concerned with using eye movements to infer properties of real-time language processing and were not concerned with characterizing the saccadic strategies used to inspect visual scenes (Andersson, Ferreira, & Henderson, 2011; Spivey, Tanenhaus, Eberhard, & Sedivy, 2002; Trueswell, Sekerina, Hill, & Logrip, 1999). There also have been studies of multimedia learning that included auditory information with videos, but these studies were concerned with how the choice of fixated locations contributed to the understanding and retention of information (Hyönä, 2010; Schmidt-Weigand, Kohnert, & Glowalla, 2010).

There are two main goals of the present study. The first is to study the strategies of reading captioned videos. Captions present challenges to viewers because gaze has to shift continually between video and text. Strategies of saccadic guidance should, ideally, be configured so as to minimize loss of information from either the captions or the video portion of the display. However, prior results using static pictures suggest that observers have strong preferences to read text regardless of its utility. For example, viewers show preferences to read text present in static pictures even when the text is neither vivid nor important (Cerf, Frady, & Koch, 2009). Preferences to fixate text persist when text is scrambled into nonsense words, turned upside down, or presented in an unfamiliar language (Wang & Pomplun, 2012). The preferences to look at text even when text is uninformative or redundant are interesting because such preferences appear to lead to no important gain of information, in contrast to tasks such as search, where the optimal nature of saccadic strategies has been emphasized (Najemnik & Geisler, 2005). There are also some prior reports of preferences to read captions while watching videos with redundant audio (Bisson, van Heusen, Conklin, & Tunney, 2011; d'Ydewalle, Praet, Verfaillie, & Van Rensbergen, 1991). These prior studies were limited in that they used stimuli consisting of conversations or “talking heads,” in which the information conveyed by the narration was critical to interpretation. In the present study, no conversations or talking heads were present. The narration provided background information or explanations of the depicted visual events. We hypothesize that viewers will spend little time reading captions in the presence of redundant audio because this would take attention away from the video.

The second goal of the present study is to determine effects of audio narration on major characteristics of the eye movement patterns that have been studied in the past (see above) to infer scanning strategies. These characteristics are: (a) the distributions of saccade sizes and pause durations, (b) centering tendencies (i.e., scatter of landing positions), and (c) the influence of physical salience. Any differences between these characteristics when the video is viewed with and without narration would not be due to physical (visual) salience but rather provides evidence for a role of top-down factors. On the basis of prior work on centering tendencies while viewing videos, we would hypothesize that the added information provided by narration should increase attention to the flow of events and that the increased attention to the events would be reflected by increased centering tendencies (Berg et al., 2009) and a reduced influence of physical salience. On the other hand, if centering is due to purely visual factors (Dorr et al., 2010; Tseng et al., 2009), no effect of narration on centering would be expected. In addition, finding an influence of narration on the size of saccades or on the intersaccadic pause durations, analogous to previous studies comparing these characteristics in videos and static pictures (see above), would point to effects of narration on global aspects of viewing strategies, including processes used to apprehend the events or the time allocated to processing fixated material.

Two experiments were conducted. Experiment 1 used long (2 min) videos, accompanied by audio narration, captions, both, or neither. Experiment 2 used shorter duration videos (15 s) with or without concurrent audio. The main findings were that audio narration had little effect on saccadic patterns with the exception of a slight increase in the distributions of landing locations in the absence of any narration to guide interpretation of events, while captions had large effects with strong and surprising preferences to spend a lot of time reading captions even with redundant audio.

Experiment 1: Effect of captions and audio narration on eye movements while viewing videos

Methods

Eye-movement recording

Eye movements were recorded using the Eyelink 1000 (SR Research, Osgoode, Canada) tower mounted version, sampling at 1000 Hz. Stimuli were presented on a Viewsonic G90fb CRT monitor, 1024 × 768 resolution, 60 Hz refresh rate, located at a viewing distance of 119 cm. The display area subtended 16.2° horizontally by 12.3° vertically. A chin rest was used to stabilize the head. Eye movements were recorded from the right eye. View of the left eye was occluded by a patch. Viewing was limited to the recorded eye because studies of binocular eye movements have shown that, when binocular view is permitted, the two eyes do not necessarily fixate the same location (Kowler et al., 1992; Steinman et al., 2006). Thus, recording from one eye during a binocular view may not necessarily provide an accurate measure of the intended fixated location.

Stimuli

Sixteen 2-min video clips were tested. Four clips were cut from each of the following four source videos, all documentaries: Meerkat Manor: The Story Begins (Discovery Communications, LLC, 2008), March of the Penguins (Warner Home Video, 2005), Destiny in Space (Warner Home Video, 2005), and Earth the Biography: Volcanoes (BBC Worldwide Ltd. Program, 2008). The long duration of the clips (about as long as typical movie trailers) was used because it seemed to allow sufficient time for viewers to understand and follow the sequence of developing events. Clips were chosen with the constraint that the narrator was not depicted on screen, that is, there were no talking heads or conversations. Clips also contained enough information to allow a brief post-trial test of memory for the contents. Clips were edited to remove instances of long (>∼3 s) uneventful pauses. Captions, when present, contained an average of 7.67 words (SD = 3.15).

Procedure

Each subject was tested in a single experimental session. Before testing began subjects were told that they would view two-minute video clips with each followed by six four-alternative multiple choice questions testing memory for the content. These questions were important for providing a motivation for the video watching. Subjects were also told that videos would contain captions, audio, captions and audio, or neither captions nor audio.

An experimental session consisted of 16 trials, organized in blocks of four. Within each block, each of the four viewing conditions was tested once: captions only, audio only, captions + audio, and no audio/no captions. The content of the captions was identical to the content of the audio.

In any given block, each viewing condition was paired with a clip from a different source video so that each of the four source videos was represented once in each block. As a result, by the end of the 16 trials, clips from each source video were seen an equal number of times in each of the four viewing conditions. No clip was seen more than once. The order of the conditions and clips within a block was haphazard with the constraint that the same viewing condition was never tested in consecutive trials across blocks. The memory test given after each trial consisted of six multiple choice questions. Questions were equally divided among those that tested memory for content presented in the narration only, the video only, or both. Performance was 74% correct over all subjects when some form of narration was provided (captions and/or audio) and 54% when no narration was given. The additional errors in the condition without narration were due to those questions that were primarily drawn from the content of the narration.

The calibration routine built into the Eyelink software was run before the start of the experiment and again before each trial. After the Eyelink calibration subjects fixated a central cross and started the trial by button press when ready. This was followed by a presentation of five crosses, one in the center and one in each corner of the display to serve as an additional check on calibration. Calibration scale factors were adjusted for each trial depending on the outcome of these additional calibrations. Adjustments in scale factors were typically <10%. After the video ended, subjects removed their head from the chin rest and answered six multiple choice questions with pencil and paper.

Subjects

Six subjects (paid volunteers and Rutgers University students) were tested. All had normal or corrected to normal (soft contact lenses) vision and were naïve to the experimental design and hypothesis. Results from the six individual subjects will be identified by an arbitrary two letter code (SA, SC, SJ, ST, SL, SN). All subjects except one (SA) were native English speakers (SA learned English as a child). The project was approved by the Rutgers University IRB for the protection of human subjects.

Analysis

The beginning and ending positions of saccades were detected offline by means of a computer algorithm employing a velocity criterion to find saccade onset and offset. The value of the criterion was determined empirically for individual observers by examining a large sample of analog recordings of eye positions. Portions of data containing blinks or episodes where tracker lock was lost were eliminated (SA 26%, SC 27%, SJ 11%, ST 3%, SL 2%, SN 4%). These proportions were about the same across conditions (audio 11%; captions 11%; neither audio or caption narration 13%; both audio and captions 13%). Data reported are based on the analysis of 1454 s for SA, 1434 s for SC, 1748 s for SJ, 1906 s for ST, 1925 s for SL, and 1886 s for SN. Note that the larger portion of time in which lock was lost for two of the subjects (SA and SC) is not surprising given that the long durations of the trials meant that frequent blinks were likely as well as episodes in which lock was lost due to the eyelid obscuring portions of the pupil. The data available for all subjects (more than 24 minutes of recording for each) was sufficient to obtain a reliable estimate of performance.

Video clips containing captions were examined to determine the frame numbers of the onset and offset of episodes in which captions were present. Caption onset and offset times for each trial were then adjusted for any frames that were dropped during the presentations using the record of dropped frames maintained in the Eyelink software. Any pair of captions that occurred consecutively with a gap of less than 210 ms between them was considered to be part of a single caption episode.

Results

Spatial distributions of eye movements

Figure 1 illustrates a typical eye trace for a representative subject viewing several seconds of the clip Earth the Biography: Volcanoes. Saccades (about 1°–3°) can be seen occurring about once or twice per second. Brief episodes of smooth pursuit (∼90–120°/s) can be seen at second 98 and again at 101.

Sample eye trace from subject SA while viewing *Earth the Biography: Volcanoes* over a period of about 6 s.

Figure 2 shows the spatial distribution of eye positions (resolution 1 ms with samples during saccades omitted) for the same subject for each of the four narration conditions: audio only; captions only; audio + captions; and neither audio or captions. The eye positions were pooled across the four clips tested. For the two conditions without captions, namely, audio only, and neither captions or audio (left panels), the line of sight almost always remained within the central 5° × 5° region of the 16° × 12° display, i.e., about 13% of the total display area (see also Berg et al., 2009; Dorr et al., 2010; Le Meur et al., 2007; Tseng et al., 2009). When captions were present, the patterns changed in that a large proportion of eye samples also fell in the lower portion of the display where the caption was located. Distributions of eye positions were similar for all subjects (see Supplemental Figures S1–S5).

Three-dimensional plot of sampled eye positions during a trial (excluding samples during saccades) for subject SA for each viewing condition (audio only, captions only, neither, and both captions and audio). Data for each condition are pooled across the four clips viewed in that condition. The x and y axes represent horizontal and vertical position, respectively, in minutes of arc. Color represents the proportion of samples. The horizontal white lines indicate the vertical display boundaries. The horizontal boundaries of the display coincided with the horizontal boundaries of the plot.

Saccades made within the caption and video regions

Figure 3 compares the preferences to fixate within the caption region to preferences to fixate within the video region of the display. The functions show distributions of the position of the line of sight along the vertical meridian at the onset time of saccades for each condition.

Histograms showing the proportions of saccades originating from different vertical positions for each viewing condition (audio only; captions only; no audio, no captions; both captions and audio). Data from six subjects. Each histogram is based on approximately 2094 to 3554 saccades. Differences across the conditions were significant, F(3, 22293) = 1885, p < 10⁻⁵.

Without captions (the audio only and no captions/no audio conditions), the distributions peaked near the center of the display and seldom fell outside the central 2° by 2° region, regardless of the presence of audio. When captions were present, and regardless of whether redundant audio narration was also available (the captions-only and captions + audio conditions), the distributions peaked in the lower region of the display containing the caption with a secondary peak near the display center. The presence of audio was influential in that more saccades shifted over from the caption to the video region in the captions + audio condition than in the captions-only condition. The effect of audio will be analyzed in greater detail in the following section.

A large proportion of time was spent reading captions

The analyses above suggest that there were strong preferences to read captions and that concurrent audio narration reduced these preferences. To examine the effects of the redundant audio narration on reading of the captions more closely, eye movements were examined during time intervals when a caption appeared on the screen. A caption episode was defined as the interval between the onset and offset of a caption. In the event that two or more captions were presented with intervening intervals shorter than 210 ms, the captions were considered to constitute a single caption episode. Caption episodes lasted 6 s on average (SD = 3.5 s) and there were an average of 12.6 caption episodes (SD = 3.5) per video. Given a video duration of 2 min, this works out to caption episodes taking up about 63% of the time the video was presented.

Figure 4 shows in detail how subjects apportioned their time during the caption episodes in both the captions-only and captions + audio conditions. Time was divided into the following categories: (a) time spent within the caption area, including the pause durations between successive saccades and the in-flight time of saccades; (b) time spent either within the video area or traveling between caption and video areas including the pause durations between successive saccades and the in-flight time of saccades; (c) the latency of the first saccade made into the caption area in response to the onset of a caption episode; and (d) intervals in which tracker lock was lost.

The proportion of time during all caption episodes for a given subject and condition (captions only, top; captions + audio, bottom) that was allocated to: (a) viewing within the caption area, including pause durations between successive saccades and in-flight time of saccades; (b) viewing within the video area or traveling between caption and video areas, including pause durations between successive saccades and in-flight time of saccades; (c) the latency of the first saccade made into the caption area in response to the onset of a caption episode; (d) intervals in which tracker lock was lost.

Consider first the captions-only condition. When captions were on screen, a large proportion of time—ranging from 48%–78% (average 63%)—was spent in the caption area. In the captions + audio condition, the proportions were still high: 32%–66% (average 44%) of time was spent reading the captions. The subjects who spent the largest proportion of time reading in the captions-only condition also spent the largest proportion of time reading when audio was available (captions + audio condition).

These values are surely underestimates. This is because using long trials (2 min), while advantageous for presenting a coherent and developing narrative, restricted the opportunity for intertrial blinks and thus had the expected consequence of frequent intervals in which tracker lock was lost. The total amount of time in which lock was lost varied among the observers, but within an observer, the totals were the same for the two conditions compared in Figure 4 (captions only; captions + audio). If we recompute the percentage of time spent in the caption area, eliminating from consideration the intervals in which lock was lost, the percentage of time spent reading increases to an average of 76% in the captions-only condition and 55% in the captions + audio condition. Note that the magnitude of the difference between the two conditions is unchanged.

Each subject spent less time in the caption area when the audio was present. The difference between the time spent in the caption area in the captions-only condition and in the captions + audio condition (see Figure 4) was significant, paired t test: t(5) = 3.34, p = 0.02. Further statistical confirmation of the differences between these two conditions was provided by a repeated measures analysis of variance performed on the proportion of time during fixation pauses in which the line of sight was in the caption area while captions were present (arcsine-square-root transform was used on the proportions). This analysis confirmed the significant differences between the time spent processing the captions between the captions-only and captions + audio conditions, F(1,586) = 73.45, p < 0.0001.

In summary, a large proportion of time was devoted to reading captions, even with concurrent audio, with less time reading captions when the audio was present.

It is not surprising that captions were read. The surprising finding is that so much time, indeed, any time, was spent reading captions in the presence of redundant audio when the alternative—watching the video while listening to the narration—would seem to ensure no information loss.

Audio narration reduced the duration of visits to the captions

Figure 4 showed that less time was spent on the captions when the audio was present. Was this because audio led to some captions being ignored entirely, read faster, or read only in part?

The mean frequency of visits to the caption area from the video area was the same with or without audio (1.8 visits per caption in the captions-only condition and 1.7 visits per caption in the captions + audio condition). Thus, audio did not lead to selected captions being ignored entirely. In addition, the pattern of saccades made while reading the captions was the same in both conditions. The mean size of forward (rightward) saccades and the mean duration of intersaccadic pauses were each nearly identical with or without audio (Table 1); thus, the information provided by the audio did not speed up the reading nor did reading slow down in an attempt to keep time with the audio stream. The proportion of leftward saccades during reading (that is, regressions and the resetting saccades made to bring the line of sight to a new line of text) were also the same across the two conditions (mean = 36% with audio and 38% without audio), i.e., there is no evidence that the absence of audio encouraged more rereading (Schnitzer & Kowler, 2006).

Table 1.

Characteristics of saccades made within the caption area: Experiment 1. Notes: The caption area was defined as the portion of the screen >∼140 min arc below screen center.

Subject	Sizes of saccades within the caption area (minarc)		Intersaccadic pause durations within the caption area (ms)
Subject	Captions only mean (SD) N	Captions and audio mean (SD) N	Captions only mean (SD) N	Captions and audio mean (SD) N
SA	104(58.3) 495	107(51.9) 342	224(84.4) 506	221(103.7) 366
SC	114(71.9) 599	110(69.9) 367	238(119.1) 625	265(144.4) 415
SJ	116(77.8) 749	118(81.1) 505	245(131.1) 808	243(126.3) 557
ST	118(71.7) 663	117(81.8) 308	224(137.5) 722	213(110.4) 363
SL	115(78.3) 595	112(75.8) 586	257(183.2) 624	283(164.6) 624
SN	148(85.0) 544	149(81.3) 266	254(103.7) 640	246(137.8) 346
Mean	119(14.8) 6	119(15.1) 6	240(14.3) 6	245(26.1) 6

Open in a new tab

The major consequence of having concurrent audio was to reduce the duration of the visits to the caption area. The duration of each visit within a given caption episode was defined as the time between a saccade into the caption region and a subsequent saccade out of the caption region. This definition allows for multiple visits during the same caption episode. The average duration of visits was 2 s (SD = 1.5) when audio was present and significantly longer, 2.6 s/visit (SD = 2), when audio was absent, t(709) = 4.48, p < 0.00001. The latency of the first saccade into the caption following its initial appearance also was longer when audio was present than when the captions were presented without audio, F(1,391) = 14.87, p = 0.0001 (see Figure 4).

In summary, the concurrent audio did not alter the reading pattern and did not prevent gaze from being attracted to the caption region. Concurrent audio instead led to a portion of the caption being skipped and increased the latency of saccades to the caption.

Differences between saccadic patterns in the caption and video areas

The presence of both captions and video within the same trial provided an opportunity to compare saccadic patterns when reading versus when examining the pictorial material in the video portion of the display. In general, saccades when reading (Table 1) were made at a faster rate than saccades when viewing the video (Table 2). The average pause durations for inspection of the video (∼420 ms; see Table 2) were longer than those reported by Dorr et al. (2010) for movie trailers (mean ∼ 340 ms). Average sizes and pause durations of saccades in the video area were not affected by the presence of narration (see Table 1).

Table 2.

Characteristics of saccades in the video area: Experiment 1. Notes: The video area was defined as the top of the display down to ~140 min arc below screen center

	Audio only mean (SD) N	Captions only mean (SD) N	Neither mean (SD) N	Both mean (SD) N
Vector size (min arc)	96(80) 2132	96(75) 728	102(89) 1972	100.8(83) 1250
Intersaccadic pause duration (ms)	429(310) 2511	424(285) 1091	411(297) 2278	419(280) 1654

Open in a new tab

Effect of audio narration on inspection of the video

Eye fixations clustered near the center of the display (Figures 2 and S1–S5). To compare centering across the four conditions, the two-dimensional scatter of saccadic offset positions for saccades landing within the video area was determined. Two-dimensional scatter was summarized by the bivariate contour ellipse area, BCEA, which represents the size of the area in which saccadic landing positions were found 68% of the time (see Steinman, 1965; Vishwanath & Kowler, 2004; Wu, Kwon, & Kowler, 2010; for other examples of the use of the BCEA for describing the two-dimensional scatter of eye positions in different oculomotor tasks).

Figure 5 (top) shows that BCEA's (averaged across subjects) were about 20 deg²–25 deg², which is about 10%–13% of the total area of the display. BCEA's were largest in the absence of any narration, although the differences were not significant, F(3, 15) = 1.82, p = 0.19. Inspection of the scatter of landing positions for the different video clips suggested that the absence of any narration (no captions; no audio) may have encouraged somewhat larger scatter for one of the clips (Earth the Biography: Volcanoes) which contained frequent scene cuts (Figure 5; middle; bottom graphs). The possibility that narration affects the scatter of landing positions in videos containing many scene cuts will be examined in Experiment 2.

Average scatter of landing locations (bivariate area in degrees squared) across subjects and trials for each viewing condition (audio only, captions only, neither captions nor audio, and captions + audio) based on saccades in the video area of the display (top of display down to about 140 min arc below screen center) for all videos (top panel), for data from *Earth the Biography: Volcanoes* (middle panel), and for data from the entire display while subjects viewed *Earth the Biography: Volcanoes* (bottom panel). Error bars represent ±1 standard error. Each mean in the top panel is based on 24 trials and each mean in the middle and bottom panel is based on six trials.

Discussion

The presence of captions had large effects on the eye movements made while viewing videos. The main finding was that a high proportion of the viewing time was devoted to reading the captions even when the captions were redundant with the audio. Characteristics of the saccades made to read the captions, namely, the number of visits to the caption area, the size and frequency of the saccades, and the frequency of leftward saccades, were unaffected by the presence of redundant audio. The redundant audio, however, was not ignored in that it reduced the average duration of visits to the captions, as well as increased the latency of the initial saccades to the captions.

It would seem that the best strategy to use to avoid information loss when both captions and audio are available would be to process both video and audio streams concurrently, that is, keep the line of sight within the video portion of the display and use the audio stream to listen to the narration, ignoring the captions. Instead, all subjects chose to read the captions and spend somewhat less time reading in the presence of audio. Interestingly, once the decision was made to read the captions, the reading saccades themselves were unaffected by the audio. Thus, the effect of narration on strategies was found at a higher level in the decision tree, involving the choice of which region to look at (captions or video), rather than how to look within these regions. A more detailed examination of the possible reasons for the preferences to read captions in the presence of redundant audio will be taken up in the General discussion.

The suggestion that the most prominent effects of narration are found at a higher level in the decision tree is consistent with the observations made about saccades within the video portion of the display. Just as the pattern of reading saccades was unaffected by narration, the pattern of saccades made while examining the video was largely unaffected by the presence of narration. The average size of saccades and average intersaccadic pause durations were the same across the four conditions (Table 2). There was a small tendency for the scatter of fixated positions to increase in the absence of any narration for some video clips, perhaps because of decisions to search across a larger region of the display in an attempt to better interpret the events. Experiment 2 further investigates the role of audio narration on saccades using a larger number of different video clips, shorter duration clips, and no captions.

Experiment 2: Effect of audio narration on eye movements

The purpose of the second experiment was to further investigate the role of audio narration when viewing videos. Given the extensive discussion of salience in prior work on eye movements while viewing videos (see Introduction), the normalized salience levels of fixated locations was also analyzed in addition to saccades sizes, pause durations, and the scatter of saccadic landing positions.

Experiment 2 used shorter duration video clips (15 s). Three conditions were tested: audio narration during the video, audio narration prior to the video, and no narration. A condition with narration prior to the video was included to examine whether any influence of narration is different when information is drawn from memory.