When scenes speak louder than words: Verbal encoding does not mediate the relationship between scene meaning and visual attention

Gwendolyn Rehrig; Taylor R Hayes; John M Henderson; Fernanda Ferreira

doi:10.3758/s13421-020-01050-4

. Author manuscript; available in PMC: 2022 Feb 14.

Published in final edited form as: Mem Cognit. 2020 Oct;48(7):1181–1195. doi: 10.3758/s13421-020-01050-4

When scenes speak louder than words: Verbal encoding does not mediate the relationship between scene meaning and visual attention

Gwendolyn Rehrig ¹, Taylor R Hayes ², John M Henderson ^1,², Fernanda Ferreira ¹

PMCID: PMC8843103 NIHMSID: NIHMS1773025 PMID: 32430889

Abstract

The complexity of the visual world requires that we constrain visual attention and prioritize some regions of the scene for attention over others. The current study investigated whether verbal encoding processes influence how attention is allocated in scenes. Specifically, we asked whether the advantage of scene meaning over image salience in attentional guidance is modulated by verbal encoding, given that we often use language to process information. In two experiments, 60 subjects studied scenes (N₁ = 30, N₂ = 60) for 12 seconds each in preparation for a scene recognition task. Half of the time, subjects engaged in a secondary articulatory suppression task concurrent with scene viewing. Meaning and saliency maps were quantified for each of the experimental scenes. In both experiments, we found that meaning explained more of the variance in visual attention than image salience did, particularly when we controlled for the overlap between meaning and salience, with and without the suppression task. Based on these results, verbal encoding processes do not appear to modulate the relationship between scene meaning and visual attention. Our findings suggest that semantic information in the scene steers the attentional ship, consistent with cognitive guidance theory.

Keywords: scene processing, visual attention, meaning, salience, language

Introduction

Because the visual world is information-rich, observers prioritize certain scene regions for attention over others to process scenes efficiently. While bottom-up information from the stimulus is clearly relevant, visual attention does not operate in a vacuum, but rather functions in concert with other cognitive processes to solve the problem at hand. What influence, if any, do extra-visual cognitive processes exert on visual attention?

Two opposing theoretical accounts of visual attention are relevant to the current study: saliency-based theories and cognitive guidance theory. According to saliency-based theories (Itti & Koch, 2001; Wolfe & Horowitz, 2017), salient scene regions—those that contrast with their surroundings based on low-level image features (e.g., luminance, color, orientation)—pull visual attention across a scene, from the most salient location to the least salient location in descending order (Itti & Koch, 2000; Parkhurst, Law, & Niebur, 2002). Saliency-based explanations cannot explain that physical salience does not determine which scene regions are fixated (Tatler, Baddeley, & Gilchrist, 2005) and that top-down task demands influence attention more than physical salience does (Einhäuser, Rutishauer, & Koch, 2008). Cognitive guidance theory can account for these findings: the cognitive system pushes visual attention to scene regions, incorporating stored knowledge about scenes to prioritize regions that are most relevant to the viewer’s goals (Henderson, 2007). Under this framework, cognitive systems—for example, long- and short-term memory, executive planning, etc.—operate together to guide visual attention. Coordination of cognitive systems helps to explain behavioral findings where saliency-based attentional theories fall short. For example, viewers look preferentially at meaningful regions of a scene (e.g., those containing task-relevant objects), even when they are not visually salient (e.g., under shadow), despite the presence of a salient distractor (Henderson, Malcolm, & Schandl, 2009).

Recent work has investigated attentional guidance by representing the spatial distribution of image salience and scene meaning comparably (see Henderson, Hayes, Peacock, & Rehrig, 2019 for review). Henderson and Hayes (2017) introduced meaning maps to quantify the distribution of meaning over a scene. Raters on mTurk saw small scene patches presented at two different scales and judged how meaningful or recognizable each patch was. Meaning maps were constructed by averaging the ratings across patch scales and smoothing the values. Image salience was quantified using Graph-Based Visual Salience (GBVS; Harel et al., 2006). The feature maps were correlated with attention maps that were empirically derived from viewer fixations in scene memorization and aesthetic judgement tasks. Meaning explained greater variance in attention maps than salience did, both for linear and semipartial correlations, suggesting that meaning plays a greater role in guiding visual attention than image salience does. This replicated when attention maps constructed from the same dataset were weighted on fixation duration (Henderson & Hayes, 2018), when viewers described scenes aloud (Henderson, Hayes, Rehrig, & Ferreira, 2018; Ferreira & Rehrig, 2019), during free-viewing of scenes (Peacock, Hayes, & Henderson, 2019), when meaning was not task-relevant (Hayes & Henderson, 2019a), and even when image salience was task-relevant (Peacock, Hayes, & Henderson, 2019). In sum, scene meaning explained variation in attention maps better than image salience did across experiments and tasks, supporting the cognitive guidance theory of attentional guidance.

One question that remains unexplored is whether other cognitive processes indirectly influence cognitive guidance of attention. For example, it is possible that verbal encoding may modulate the relationship between scene meaning and visual attention: Perhaps the use of language, whether vocalized or not, pushes attention to more meaningful regions. While only two of the past experiments were explicitly linguistic in nature (scene description; Henderson et al., 2018; Ferreira & Rehrig, 2019), the remaining tasks did not control for verbal encoding processes.

There is evidence that observers incidentally name objects silently during object viewing (Meyer, Belke, Telling, & Humphreys 2007; Meyer & Damian, 2007). Meyer et al. (2007) asked subjects to report whether a target object was present or not in an array of objects, which sometimes included competitors that were semantically related to the target or were semantically unrelated, but had a homophonous name (e.g., bat the tool vs. bat the animal). The presence of competitors interfered with search, which suggests information about the objects (name, semantic information) became active during viewing, even though that information was not task-relevant. In a picture-picture interference study, Meyer and Damian (2007) presented target objects that were paired with distractor objects with phonologically similar names, and instructed subjects to name the target objects. Naming latency was shorter when distractor names were phonologically similar to the name of the target object, suggesting that activation of the distractor object’s name occurred and facilitated retrieval of the target object’s name. Together, the two studies demonstrate a tendency for viewers to incidentally name objects they have seen.

Cross-linguistic studies on the topic of linguistic relativity employ verbal interference paradigms to demonstrate that performance on perceptual tasks can be mediated by language processes. For example, linguistic color categories vary across languages even though the visual spectrum of colors is the same across language communities (Majid et al., 2018). Winawer and colleagues (2007) showed that observers discriminated between colors faster when the colors belonged to different linguistic color categories, but the advantage disappeared with verbal interference. These findings indicate that language processes can mediate performance on perceptual tasks that are ostensibly not linguistic in nature, and a secondary verbal task that prevents task-incidental language use can disrupt the mediating influence of language. Similar influences of language on ostensibly non-linguistic processes, and the disruption thereof by verbal interference tasks, have been found for spatial memory (Hermer-Vazquez, Spelke, & Katsnelson, 1999), event perception (Trueswell & Papafragou, 2010), categorization (Lupyan, 2009), and numerical representations (Frank, Fedorenko, Lai, Saxe, & Gibson, 2012), to name a few (see Lupyan, 2012; Perry & Lupyan, 2013; Ünal & Papafragou, 2016 for discussion).

The above literature suggests we use internal language during visual processing, and in some cases those language processes may mediate perceptual processes. Could the relationship between meaning and visual attention observed previously (Henderson & Hayes, 2017; 2018, Henderson et al., 2018; Peacock et al., 2018) have been modulated by verbal encoding processes? To examine this possibility, we used an articulatory suppression manipulation to determine whether verbal encoding mediates attentional guidance in scenes.

In the current study, observers studied 30 scenes for 12 seconds each for a later recognition memory test. The scenes used in the study phase were mapped for meaning and salience. We conducted two experiments in which subjects performed a secondary articulatory suppression task half of the time in addition to memorizing scenes. In Experiment 1, the suppression manipulation was between-subjects, and the articulatory suppression task was to repeat a three digit sequence aloud during the scene viewing period. We chose this suppression task because we suspected subjects might adapt to and subvert simpler verbal interference such as a syllable repetition (e.g., Martin, Branzi, & Bar, 2017), and because digit sequence repetition imposes less cognitive load than n-back tasks (Allen, Baddeley, & Hitch, 2017). In Experiment 2, we implemented a within-subject design using two experimental blocks: one with the sole task of memorizing scenes, the other with an additional articulatory suppression task. Because numerical stimuli may be processed differently than other verbal stimuli (Maloney et al., 2019; van Dijck & Fias, 2011), we instead asked subjects to repeat the names of a sequence of three shapes aloud during the suppression condition. In the recognition phase of both experiments, subjects viewed 60 scenes—30 that were present in the study phase, 30 foils—and indicated whether or not they recognized the scene from the study phase.

We tested two competing hypotheses about the relationship between verbal encoding and attentional guidance in scenes. If verbal encoding indeed mediated the relationship between meaning and attentional guidance in our previous work, we would expect observers to direct attention to meaningful scene regions only when internal verbalization strategies are available to them. Specifically, meaning should explain greater variance in attention maps than saliency in the control condition, and meaning should explain less or equal variance in attention as salience when subjects suppressed internal language use. Conversely, if verbal encoding did not mediate attentional guidance in scenes, the availability of verbalization strategies should not affect attention, and so we would expect to find an advantage of meaning over salience whether or not subjects engaged in a suppression task.

Experiment 1: Methods

Subjects.

Sixty-eight undergraduates enrolled at the University of California, Davis participated for course credit. All subjects were native speakers of English, at least 18 years old, and had normal or corrected-to-normal vision. They were naive to the purpose of the experiment and provided informed consent as approved by the University of California, Davis Institutional Review Board. Six subjects were excluded from analysis because their eyes could not be accurately tracked, 1 due to an equipment failure, and 1 due to experimenter error; data from the remaining 60 subjects were analyzed (30 subjects/condition).

Stimuli.

Scenes were 30 digitized (1024×768) and luminance-matched photographs of real-world scenes used in a previous experiment (Henderson et al., 2018). Of these, 10 depicted outdoor environments (5 street views), and 20 depicted indoor environments (3 kitchens, 5 living rooms, 2 desk areas, and 10 different room types). People were not present in any scenes.

Another set of 30 digitized images of comparable scenes (similar scene categories and time period, no people depicted) were selected from a Google image search and served as memory foils. Because we did not evaluate attentional guidance for the foils, meaning and salience were not quantified for these scenes, and the images were not luminance-matched.

Digit Sequences.

Digit sequences were selected randomly without replacement from all three digit numbers ranging from 100 to 999 (900 numbers total), then segmented into 30 groups of 30 sequences each such that each digit sequence in the articulatory suppression condition was unique.

Apparatus.

Eye movements were recorded with an SR Research EyeLink 1000+ tower mount eyetracker (spatial resolution 0.01) at a 1000 Hz sampling rate. Subjects sat 83 cm away from a 24.5” monitor such that scenes subtended approximately 26° × 19° visual angle at a resolution of 1024 × 768 pixels, presented in 4:3 aspect ratio. Head movements were minimized using a chin and forehead rest integrated with the eyetracker’s tower mount. Subjects were instructed to lean against the forehead rest to reduce head movement while allowing them to speak during the suppression task. Although viewing was binocular, eye movements were recorded from the right eye. The experiment was controlled using SR Research Experiment Builder software. Data were collected on two systems that were identical except that one subject computer operated using Windows 10, and the other used Windows 7.

Scene Memorization Procedure.

Subjects were told they would see a series of scenes to study for a later memory test. Subjects in the articulatory suppression condition were told each trial would begin with a sequence of 3 digits, and were instructed to repeat the sequence of digits aloud during the scene viewing period. After the instructions, a calibration procedure was conducted to map eye position to screen coordinates. Successful calibration required an average error of less than 0.49° and a maximum error below 0.99°.

Following successful calibration, there were 3 practice trials to familiarize subjects with the task prior to the experimental trials. In the suppression condition, during these practice trials participants studied three-digit sequences prior to viewing the scene. Practice digit sequences were 3 randomly sampled sequences from the range 1 to 99, in 3-digit format (e.g., “0 3 6” for 36). Subjects pressed any button on a button box to advance throughout the task.

Each subject received a unique pseudo-random trial order that prevented two scenes of the same type (e.g., kitchen) from occurring consecutively. A trial proceeded as follows. First, a five-point fixation array was displayed to check calibration (Figure 1a). The subject fixated the center cross and the experimenter pressed a key to begin the trial if the fixation was stable, or reran the calibration procedure if not. Before the scene, subjects in the articulatory suppression condition saw the instruction “Study the sequence of digits shown below. Your task is to repeat these digits over and over out loud for 12 seconds while viewing an image of the scene” along with a sequence of 3 digits separated by spaces (e.g., “8 0 9”), and pressed a button to proceed (Figure 1b). The scene was shown for 12 seconds, during which time eye-movements were recorded (Figure 1c). After 12 seconds elapsed, subjects pressed a button to proceed to the next trial (Figure 1d). The trial procedure repeated until all 30 trials were complete.

Figure 1. — Scene memorization trial procedure. (a) A five point fixation array was used to assess calibration quality. (b) In the articulatory suppression condition only, the digit repetition task instructions were reiterated to subjects along with a three digit sequence. (c) A real-world scene was shown for 12 seconds. (d) Subjects were instructed to press a button to initiate the next trial, at which point the trial procedure repeated (from a).

Memory Test Procedure.

A recognition memory test followed the experimental trials, in which subjects were shown the 30 experimental scenes and 30 foil scenes they had not seen previously. Presentation order was randomized without replacement. Subjects were informed that they would see one scene at a time and instructed to use the button box to indicate as quickly and accurately as possible whether they had seen the scene earlier in the experiment. After the instruction screen, subjects pressed any button to begin the memory test. In a recognition trial, subjects saw a scene that was either a scene from the study phase or a foil image. The scene persisted until a “Yes” or “No” button press occurred, after which the next trial began. Response time and accuracy were recorded. This procedure repeated 60 times, after which the experiment terminated.

Fixations and saccades were parsed with EyeLink’s standard algorithm using velocity and acceleration thresholds (30°/s and 9500°/s²; SR Research, 2017). Eye movement data were imported offline into Matlab using the Visual EDF2ASC tool packaged with SR Research DataViewer software. The first fixation was excluded from analysis, as were saccade amplitude (>20°) and fixation duration outliers (<50ms, >1500ms).

Attention maps.

Attention maps were generated by constructing a matrix of fixation counts with the same x,y dimensions as the scene, and counting the total fixations corresponding to each coordinate in the image. The fixation count matrix was smoothed with a Gaussian low pass filter with circular boundary conditions and a frequency cutoff of −6dB. For the scene-level analysis, all fixations recorded during the viewing period were counted. For the fixation analysis, separate attention maps were constructed for each ordinal fixation.

Meaning maps.

We generated meaning maps using the context-free rating method introduced in Henderson & Hayes (2017). Each 1024 × 768 pixel scene was decomposed into a series of partially overlapping circular patches at fine and coarse spatial scales (Figure 2b&c). The decomposition resulted in 12,000 unique fine-scale patches (87 pixel diameter) and 4,320 unique coarse-scale patches (205 pixel diameter), totaling 16,320 patches.

Figure 2. — (a-d). Meaning map generation schematic. (a) Real-world scene. (b-c) Fine scale (b) and coarse scale (c) spatial grids used to deconstruct the scene into patches. (d) Examples of scene patches that were rated as either low or high in meaning. (e-h) Examples of saliency (e), meaning (f), and attention (g-h) maps for the real-world scene shown in (a). Attention maps were empirically derived from viewer fixations in the control condition (g) and the articulatory suppression condition (h). For the purpose of visualization, all maps were normalized to the same attention map (g).

Raters were 165 subjects recruited from Amazon Mechanical Turk. All subjects were located in the United States, had a HIT approval rating of 99% or more, and participated once. Subjects provided informed consent and were paid $0.50.

All but one subject rated 300 random patches extracted from the 30 scenes. Subjects were instructed to rate how informative or recognizable each patch was using a 6-point Likert scale (‘very low’, ‘low’, ‘somewhat low’, ‘somewhat high’, ‘high’, ‘very high’). Prior to rating patches, subjects were given two examples each of low-meaning and high-meaning patches in the instructions to ensure they understood the task. Patches were presented in random order. Each patch was rated 3 times by 3 independent raters totaling 48,960 ratings per scene. Because there was high overlap across patches, each fine patch contained data from 27 independent raters and each coarse patch from 63 independent raters (see Figure 2d for patch examples).

Meaning maps were generated from the ratings for each scene by averaging, smoothing, and combining the fine and coarse scale maps from the corresponding patch ratings. The ratings for each pixel at each scale in each scene were averaged, producing an average fine and coarse rating map for each scene. The fine and coarse maps were then averaged [(fine map + coarse map)/2]. Because subjects in the eyetracking task showed a consistent center bias¹ in their fixations, we applied center bias to the maps using a multiplicative down-weighting of scores in the map periphery (Hayes & Henderson, 2019b). The final map was blurred using a Gaussian filter via the Matlab function ‘imgaussfilt’ with a sigma of 10 (see Figure 2f for an example meaning map).

Saliency maps.

Image-based saliency maps were constructed using the Graph-Based Visual Saliency (GBVS) toolbox in Matlab with default parameters (Harel et al., 2006). We used GBVS because it is a state-of-the-art model that uses only image-computable salience. While there are newer saliency models that predict attention better (e.g., DeepGaze II: Kümmerer, Wallis, & Bethge, 2016; ICF: Kümmerer, Wallis, Gatys, & Bethge, 2017), these models incorporate high-level image features through training on viewer fixations (DeepGaze II and ICF) and object features (DeepGaze II), which may index semantic information. We used GBVS to avoid incorporating semantic information in image-based saliency maps, which could confound the comparison with meaning (see Henderson et al., 2019 for discussion).

Map normalization.

Prior to analysis, feature maps were normalized to a common scale using image histogram matching via the Matlab function ‘imhistmatch’ in the Image Processing Toolbox. The corresponding attention map for each scene served as the reference image (see Henderson & Hayes, 2017). Map normalization was carried out within task conditions: for the map-based analysis of the control condition, feature maps were normalized to the attention map derived from fixations in the control condition only, and likewise for the suppression condition. Results did not differ between the current analysis and a second analysis using feature maps normalized to the same attention map (from fixations in the control condition).

We computed correlations (R²) across the maps of 30 scenes to determine the degree to which saliency and meaning overlap with one another. We excluded the peripheral 33% of the feature maps when determining overlap between the maps to control for the peripheral downweighting applied to both, which otherwise would inflate the correlation between them. On average, meaning and saliency were correlated (R² = 0.48), and this relationship differed from zero (meaning and saliency: t(29) = 17.24, p < 0.001, 95% CI = [.43 .54]).

Experiment 1: Results

To determine what role verbal encoding might play in extracting meaning from scenes, we asked whether the advantage of meaning over salience in explaining variance in attention would hold in each condition. To answer this question, we conducted two-tailed paired t-tests within task conditions.

Sensitivity analysis.

To determine whether we obtained adequate effect sizes for the primary comparison of interest, we conducted a sensitivity analysis using G*Power 3.1 (Faul, Erdfelder, Lang, & Buchner, 2007; Faul, Erdfelder, Buchner, & Lang, 2009). We computed the effect size index d_z—a standardized difference score (Cohen, 1988)—and the critical t statistic for a two-tailed paired t-test with 95% power and a sample size of 30 scenes. The analysis revealed a critical t value of 2.05 and a minimum d_z of 0.68.

Attention: Scene-level analysis.

We correlated meaning and saliency maps with attention maps to determine the degree to which meaning or salience guided visual attention (Figure 3). Squared linear and semipartial correlations (R²) were computed within each condition for each of the 30 scenes. The relationship between meaning and salience, respectively, and visual attention was analyzed using t-tests. Cohen’s d was computed to estimate effect size, interpreted as small (d = 0.2 – 0.49), medium (d = 0.5 – 0.79), or large (d = 0.8+) following Cohen (1988).

Figure 3. — a) Box plots showing linear correlations (left) and semipartial correlations (right) between feature maps (meaning, saliency) and attention maps. The scatter box plots show the corresponding grand mean (black horizontal line), 95% confidence intervals (colored box), and 1 standard deviation (black vertical line) for meaning (red box) and salience (blue box) across 30 scenes. b) Line graphs showing linear correlations (top) and semipartial correlations (bottom) between feature maps and attention maps for each fixation (1–38) when subjects engaged in a memorization task only (solid lines) or additionally an articulatory suppression task (dashed lines). Error bars indicate 95% confidence intervals.

Linear correlations.

In the control condition, when subjects were only instructed to memorize scenes, meaning accounted for 34% of the average variance in attention (M = 0.34, SD = 0.14) and salience accounted for 21% (M = 0.21, SD = 0.13). The advantage of meaning over salience was significant (t(29) = 6.07, p < .001, 95% CI = [0.09 0.17], d = 0.97, d 95% CI = [0.58 1.36], d_z = 1.10). In the articulatory suppression condition, when subjects additionally had to repeat a sequence of digits aloud, meaning accounted for 37% of the average variance in attention (M = 0.37, SD = 0.17) whereas salience accounted for 23% (M = 0.23, SD = 0.12). The advantage of meaning over salience was also significant when the task prevented verbal encoding (t(29) = 6.04, p < .001, 95% CI = [0.09 0.19], d = 0.88, d 95% CI = [0.53 1.22], d_z = 1.12).

Semipartial correlations.

Because meaning and salience are correlated, we partialed out the shared variance explained by both meaning and salience. In the control condition, when the shared variance explained by salience was accounted for, meaning explained 15% of the average variance in attention (M = 0.15, SD = 0.10), while salience explained only 2% of the average variance once the variance explained by meaning was accounted for (M = 0.02, SD = 0.02). The advantage of meaning over salience was significant (t(29) = 6.07, p < .001, 95% CI = [0.09 0.18], d = 1.98, d 95% CI = [0.86 3.10], d_z = 1.15). In the articulatory suppression condition, meaning explained 16% of the average unique variance after shared variance was partialed out (M = 0.16, SD = 0.11), while salience explained only 2% of the average variance after shared variance with meaning was accounted for (M = 0.02, SD = 0.03), and the advantage was significant (t(29) = 6.05, p < .001, 95% CI = [0.09 0.19], d = 1.95, d 95% CI = [0.85 3.04], d_z = 1.09).

To summarize, we found a large advantage of meaning over salience in explaining variance in attention in both conditions, for both linear and semipartial correlations. For all comparisons, the value of the t statistic and d_z exceeded the thresholds obtained in the sensitivity analysis.

Attention: Fixation analysis.

Following our previous work (Henderson & Hayes, 2017; Henderson et al., 2018), we examined early fixations to determine whether salience influences early scene viewing (Parkhurst et al., 2002; but see Tatler et al., 2005). We correlated each feature map (meaning, salience) with attention maps at each fixation (Figure 3b). Squared linear and semipartial correlations (R²) were computed for each fixation, and the relationship between meaning and salience with attention, respectively, was assessed for the first three fixations using paired t-tests.

Linear correlations.

In the control condition, meaning accounted for 37% of the average variance in attention during the first fixation, and 14% and 13% during the second and third fixations, respectively (1: M = 0.37, SD = 0.19; 2: M = .14, SD = .11; 3: M = .13, SD = .10). Salience accounted for 9% (1: M = .09, SD = .11), 8% (2: M = 0.08, SD = 0.09), and 7% of the average variance (3: M = 0.07, SD = 0.09) during the first, second, and third fixations, respectively. The advantage of meaning was significant for all three fixations (1: t(29) = 8.59, p < .001, 95% CI = [0.21 0.34], d = 1.70, d 95% CI = [1.08 2.31]; 2: t(29) = 3.40, p = .002, 95% CI = [0.03 0.11], d = 0.66, d 95% CI = [0.23 1.08]; 3: t(29) = 4.21, p < .001, 95% CI = [0.03 0.08], d = 0.60, d 95% CI = [0.29 0.90]). For subjects in the suppression condition, meaning accounted for 42% of the average variance during the first fixation (M = 0.42, SD = 0.18), 21% during the second (M = 0.21, SD = 0.15), and 17% during the third fixation (M = 0.17, SD = 0.13). Salience accounted for 10% of the average variance during the first fixation (M = 0.10, SD = 0.10) and 9% during the second and third fixations (2: M = 0.09, SD = 0.09; 3: M = 0.09, SD = 0.09). The advantage of meaning over salience was significant for all three fixations (1: t(29) = 10.27, p < .001, 95% CI = [0.26 0.38], d = 2.12, d 95% CI = [1.39 2.92]; 2: t(29) = 5.49, p < .001, 95% CI = [0.08 0.17], d = 0.90, d 95% CI = [0.51 1.29]; 3: t(29) = 4.49, p < .001, 95% CI = [0.04 0.12], d = 0.71, d 95% CI = [0.35 1.06]).

Semipartial correlations.

To account for the correlation between meaning and salience, we partialed out shared variance explained by both meaning and salience, then repeated the fixation analysis on the semipartial correlations. In the control condition, after the shared variance explained by both meaning and salience was partialed out, meaning accounted for 30% of the average variance at the first fixation (M = 0.30, SD = 0.16), 10% of the variance during the second fixation (M = 0.10, SD = 0.09), and 8% during the third fixation (M = 0.08, SD = 0.06). After shared variance with meaning was partialed out, salience accounted for only 2% of the average unique variance at the first and third fixations (1: M = 0.02, SD = 0.03; 3: M = 0.02, SD = 0.03) and 3% at the second fixation (M = 0.03, SD = 0.04). The advantage of meaning was significant for all three fixations (1: t(29) = 8.58, p < .001, 95% CI = [0.21 0.34], d = 2.66, d 95% CI = [1.34 3.97]; 2: t(29) = 3.40, p < .001, 95% CI = [0.03 0.11], d = 0.99, d 95% CI = [0.28 1.70]; 3: t(29) = 4.21, p < .001, 95% CI = [0.03 0.08], d = 1.10, d 95% CI = [0.44 1.76]). In the articulatory suppression condition, after the shared variance with salience was partialled out, meaning accounted for 34% of the average variance during the first fixation (M = 0.34, SD = 0.15), 14% at the second fixation (M = 0.14, SD = 0.12), and 10% during the third fixation (M = 0.10, SD = 0.09). After the shared variance with meaning was partialled out, on average salience accounted for 2% of the variance at all three fixations (1: M = 0.02, SD = 0.03; 2: M = 0.02, SD = 0.02; 3: M = 0.02, SD = 0.03). The advantage of meaning was significant for all three fixations (1: t(29) = 10.27, p < .001, 95% CI = [0.26 0.38], d = 3.25, d 95% CI = [1.67 4.85]; 2: t(29) = 5.49, p < .001, 95% CI = [0.08 0.17], d = 1.46, d 95% CI = [0.69 2.22]; 3: t(29) = 4.49, p < .001, 95% CI = [0.04 0.12], d = 1.25, d 95% CI = [0.51 1.99]).

In sum, early fixations revealed a consistent advantage of meaning over salience, counter to the claim that salience influences attention during early scene viewing (Parkhurst et al., 2002). The advantage was present for the first three fixations in both conditions, when we analyzed both linear and semipartial correlations, and all effect sizes were medium or large.

Memory: Recognition.

To confirm that subjects took the memorization task seriously, we totaled the number of hits, correct rejections, misses, and false alarms on the recognition task for each subject, each of which ranged from 0 to 30 (Figure 4a). Recognition performance was high in both conditions. On average, subjects in the control condition correctly recognized scenes shown in the memorization task 95% of the time (M_hits = 0.95, SD_hits = 0.06), while subjects who engaged in the suppression task during memorization correctly recognized scenes 90% of the time (M_hits = 0.90, SD_hits = 0.09). Subjects in the control conditions falsely reported that a foil scene had been present in the memorization scene set 3% of the time on average (M_{false alarms} = 0.03, SD_{false alarms} = 0.03), and those in the suppression condition false alarmed an average of 4% of the time (M_{false alarms} = 0.04, SD_{false alarms} = 0.07). Overall, subjects in the control condition had higher recognition accuracy, though the difference in performance was small.

Figure 4. — Violin plot showing the total number of recognition task responses for each subject (individual points), broken into hits, correct rejections, misses, and false alarms (left). Violin plot showing d’ values for each subject (right).

We then computed d’ with log-linear correction to handle extreme values (ceiling or floor performance) using the dprime function from the psycho package in R, resulting in 30 data points per condition (1 data point/subject; Figure 4b). On average, d’ scores were higher in the control condition (M = 3.30, SD = 0.55) than the articulatory suppression condition (M = 2.99, SD = 0.74). The difference in performance was not significant, and the effect size was small (t(58) = 1.83, p = 0.07, 95% CI = [−0.03 0.64], d = 0.47, d 95% CI = [−0.05 1.00]).

In sum, recognition was numerically better for subjects who were only instructed to study the scenes as opposed to those who additionally completed an articulatory suppression task, but the difference was not significant.

Experiment 1: Discussion

The results of Experiment 1 suggest that incidental verbalization does not modulate the relationship between scene meaning and visual attention during scene viewing. However, the experiment had several limitations. First, we implemented the suppression manipulation between-subjects rather than within-subjects out of concern that subjects might infer the hypothesis in a within-subject paradigm and skew the results. Second, because numerical cognition is unique (Maloney et al., 2019; van Dijck & Fias, 2011), it is possible that another type of verbal interference would affect the relationship between meaning and attention. Third, we tested relatively few scenes (N=30).

We conducted a second experiment to address these limitations and replicate the advantage of meaning over salience despite verbal interference. In Experiment 2, the verbal interference consisted of sequences of common shape names (e.g., square, heart, circle) rather than digits, and the interference paradigm was implemented within-subject using a blocked design. We added 30 scenes to the Experiment 1 stimulus set, yielding 60 experimental items total.

We tested the same two competing hypotheses in Experiments 1 and 2: If verbal encoding mediates the relationship between meaning and attentional guidance, and the use of numerical interference in Experiment 1 was insufficient to disrupt that mediation, then the relationship between meaning and attention should be weaker when incidental verbalization is not available, in which case meaning and salience may explain comparable variance in attention. If verbal encoding does not mediate attentional guidance in scenes and our Experiment 1 results cannot be explained by numerical interference specifically, then we expect meaning to explain greater variance in attention both when shape names are used as interference and when there is no verbal interference.