Abstract
Eye tracking provides insights into social processing deficits in autism spectrum disorder (ASD), especially in conjunction with dynamic, naturalistic free-viewing stimuli. However, the question remains whether gaze characteristics, such as preference for specific facial features, can be considered a stable individual trait, particularly in those with ASD. If so, how much data are needed for consistent estimations? To address these questions, we assessed the stability and robustness of gaze preference for facial features as incremental amounts of movie data were introduced for analysis. We trained an artificial neural network to create an object-based segmentation of naturalistic movie clips (14 s each, 7410 frames total). Thirty-three high-functioning individuals with ASD and 36 age- and IQ-equated typically developing individuals (age range: 12–30 years) viewed 22 Hollywood movie clips, each depicting a social interaction. As we evaluated combinations of one, three, five, eight, and 11 movie clips, gaze dwell times on core facial features became increasingly stable at within-subject, within-group, and between-group levels. Using a number of movie clips deemed sufficient by our analysis, we found that individuals with ASD displayed significantly less face-centered gaze (centralized on the nose; p < 0.001) but did not significantly differ from typically developing participants in eye or mouth looking times. Our findings validate gaze preference for specific facial features as a stable individual trait and highlight the possibility of misinterpretation with insufficient data. Additionally, we propose the use of a machine learning approach to stimuli segmentation to quickly and flexibly prepare dynamic stimuli for analysis.
Keywords: autism spectrum disorder, machine learning, social behavior
Lay Summary:
Using a data-driven approach to segmenting movie stimuli, we examined varying amounts of data to assess the stability of social gaze in individuals with autism spectrum disorder (ASD). We found a reduction in social fixations in participants with ASD, driven by decreased attention to the center of the face. Our findings further support the validity of gaze preference for face features as a stable individual trait when sufficient data are used.
INTRODUCTION
Eye tracking plays an integral role in understanding social processing deficits in neurodevelopmental disorders, including autism spectrum disorder (ASD). Data commonly indicate that individuals with ASD do not process social interactions or facial features the same way as their typically developing (TD) peers. However, the specifics of ASD gaze differences are inconsistent. For example, some ASD studies report reduced eye region looking time as compared to TD individuals, while others report no differences (Freeth et al., 2010; Gillespie-Smith et al., 2014; Snow et al., 2011; Yi et al., 2013). Some studies report increased mouth looking time in ASD, and even suggest increased mouth fixations may compensate for reduced attention to eye regions (Jones et al., 2008). On the other hand, other studies report decreased attention to or no differences in ASD and TD mouth looking times, regardless of whether stimuli were dynamic or static, or depicted multiple or single-persons (Johnels et al., 2014; Speer et al., 2007; van der Geest et al., 2002).
Due to a wide heterogeneity among individuals with ASD, it is possible that eye tracking differences are a product of different samplings from this population. Previous literature points to gaze idiosyncrasies within ASD samples; studies find significantly more variable gaze preferences in ASD as compared to typical development (Avni et al., 2020; Ramot et al., 2020). Group- and subject-level inconsistencies broach several questions regarding the meaningfulness of studying gaze in ASD. Do individuals with ASD view stimuli consistently or do viewing preferences vary from scene to scene? Do gaze preferences for specific features of the visual scene converge as we analyze more data, or do they remain variable? Overall, can researchers consider gaze characteristics (e.g., preference for specific features) a stable individual trait in those with ASD?
Prior research has addressed the question of viewing preference stability in TD samples. Findings demonstrate that while gaze varies greatly between TD individuals, it is stable within an individual across tasks and even over long periods of time (Arizpe et al., 2017; Castelhano & Henderson, 2008; Mehoudar et al., 2014; Peterson & Eckstein, 2013; Poynter et al., 2013). While ASD studies show greater group- and subject-level gaze variability (based on the degree to which subjects deviated from other group members) compared to TD individuals, the nature of this variance has yet to be explored (Avni et al., 2020; Ramot et al., 2020). In particular, there are two main alternatives to consider. First, gaze preferences in ASD are simply not consistent within an individual. In this case, averaging across more data will not lead to convergence, as there is no stable preference. Alternately, it is possible that individual gaze preferences in ASD are equally stable, but the decreased consistency is a result of greater noise (i.e., increased variance around a stable mean). In this case, we would expect to see convergence around this mean as more data are considered. Distinguishing between these two alternatives is important for understanding the nature of the inconsistencies in ASD, and the reliability of gaze preference in ASD. The question of gaze preference stability for specific features of a scene as a product of data quantity has not previously been addressed in either TD or ASD populations.
In the present study, we seek to examine the stability of gaze preferences in TD participants and participants with ASD, and whether these preferences generalize to different viewing contexts. We investigate whether viewing preferences for specific facial features converge in both groups, and if so, how much data are necessary to consistently estimate group- and subject-level viewing preferences. To this end, we trained an artificial neural network (ANN) to segment naturalistic dynamic stimuli of 22 movie clips; this algorithm allows for the expeditious segmentation of large amounts of stimuli. Utilizing this approach in a free-viewing eye tracking paradigm, we investigated the stability and robustness of within-subject, within-group, and between-group analyses when incrementally increasing the amount of data used in the analysis. First, we examine each participant’s viewing of single movie clips and assess the consistency of looking time at core facial features (eyes, nose, and mouth). We repeat these analyses while averaging looking time behavior over increasing numbers of movie clips (three, five, eight, and 11) to examine how much data are needed to observe within-group stability and consistent between-group differences. If indeed fixation preferences for the different facial features are a stable individual trait in ASD, then both within-group and between-group results should become increasingly consistent and robust as we introduce more data. Subsequently, we apply the previous analysis’ findings in the context of examining how individuals with ASD view elements of social interactions relative to TD peers, based on a quantity of movie data shown to be sufficient for observing consistent differences.
METHODS
Participants
Fifty high-functioning males with ASD and 36 TD male participants were recruited for this study at the National Institute of Mental Health between May 2017 and January 2019. (ClinicalTrials.gov: NCT01031407). The NIH Combined Neuroscience Review Board granted ethics approval for this study under protocol 2010-M-0027. Prior to study inclusion, a trained research clinician administered the Autism Diagnostic Observation Scale 2 (ADOS)—Module 4 to participants with ASD. Trained research assistants administered the Wechsler Abbreviated Scale of Intelligence - II (WASI-II) to participants. One participant with ASD was administered the Wechsler Adult Intelligence Scale (WAIS). We obtained one participant with ASD’s scores on the Wechsler Intelligence Scale for Children-V (WISC-V) from a collaboration with the Children’s National Health System Center for Autism Spectrum Disorder.
All participants with ASD met the cutoff for the category designated as “broad autism spectrum disorders” according to the criteria established by the National Institute of Child Health and Human Development/National Institute on Deafness and Other Communication Disorders Collaborative Programs for Excellence in Autism (Lainhart et al., 2006). The methods were performed in accordance with relevant guidelines and regulations and approved by the NIH Combined Neuroscience Review Board. All adult participants provided written consent, and we obtained written parental assent for minor participants. Seventeen participants with ASD were omitted from this analysis due to incomplete testing data (n = 5), poor quality eye tracking data (defined as missing data on more than 10% of time points, n = 4), did not meet autism diagnosis (n = 3), scheduling conflicts (n = 2), did not meet IQ cut off (Full Scale IQ > 70; n = 1), conflicting medical conditions (n = 1), and loss to follow up (n = 1). TD participants were selected to create an age- and IQ-equated match for each participant with ASD. TD participants and participants with ASD did not differ on age, IQ, race, or ethnicity (Table 1). Age at evaluation was 20.74 ± 4.0 years and IQ at evaluation was 108.23 ± 13.11 (Table 1). Table 1 displays descriptive statistics on ADOS scores for the ASD sample.
TABLE 1.
Demographics chart
| ASD (n = 33) | TD (n = 36) | Total (n = 69) | |
|---|---|---|---|
| Age, mean (SD) | 20.25 (4.02) | 20.83 (3.82) | 20.74 (4.0) |
| Race, n (%) | |||
| White | 21 (63.6) | 20 (55.5) | 41 (59.4) |
| Black | 2 (6.1) | 6 (16.7) | 8 (11.6) |
| Asian | 1 (3.0) | 3 (8.3) | 4 (5.8) |
| Biracial | 3 (9.1) | 4 (11.1) | 7 (10.15) |
| Other | 1 (3.0) | 1 (2.8) | 2 (2.9) |
| Unknown | 5 (15.2) | 2 (5.6) | 7 (10.15) |
| Ethnicity, n (%) | |||
| Hispanic/latino | 6 (18.2) | 5 (13.9) | 11 (15.9) |
| Not hispanic/latino | 22 (66.7) | 30 (83.3) | 52 (75.4) |
| Unknown | 5 (15.1) | 1 (2.8) | 6 (8.7) |
| IQ, mean (SD) | 107.81 (13.75) | 108.82 (12.98) | 108.23 (13.11) |
| ADOS, mean (SD) | |||
| Communication | 4.3 (1.42) | - | - |
| Social interaction | 7.63 (2.03) | - | - |
| Social affect | 11.93 (2.72) | - | - |
| Imagination/creativity | 0.76 (0.77) | - | - |
| Stereotyped behavior/restricted interests | 2.93 (1.44) | - | - |
Note: There are no statistically significant differences between TD participants and participants with ASD across age, IQ, race, or ethnicity.
Abbreviations: ADOS, adult diagnostic observation schedule; ASD, autism spectrum disorder; TD, typically developing.
Procedure
Participants’ heads were stabilized using a forehead and chin rest, and eye gaze calibrations were performed on the right eye at the beginning of the experiment. Calibrations were all within 0.3° at study onset, and were verified halfway through the study to ensure that the right eye was still correctly aligned and had not shifted. There were no differences between groups in calibration accuracy. Participants engaged in an 8-min free-viewing paradigm. There were no explicit instructions other than to watch the presented movies. They viewed 24 movie clips (14 s in duration) depicting social interactions in which two or more characters engage in conversation. Movie clips consisted of the following Hollywood movies: The Blind Side (six clips), The Goonies (four clips), How To Lose a Guy in Ten Days (four clips), The Italian Job (five clips), and The NeverEnding Story (five clips). Movies were viewed full screen on a digital monitor with a 1920 × 1080 resolution with a screen size of 20.5 × 12 in. Gaze was recorded by the Eyelink 1000 Plus, sampled at 1000 Hz. A gray screen appeared for 6 s between presentations of the clips. A fixation cross appeared in the center of the gray screen to reset fixations to the center before presenting the successive clip.
For our analyses, we excluded two movie clips from The NeverEnding Story for displaying a highly disproportionate ratio of face-to-background pixels or scene darkness that altered segmentations. Final analyses included 22 movies (7410 frames).
Image segmentation
We trained an ANN to predict segmentations of each pixel for each frame for each movie. We used the Pascal-Parts dataset to train a Bayesian SegNet with concrete dropout to make a predicted segmentation for a given movie frame (Everingham et al., 2010; Gal et al., 2017; Kendall et al., 2015). When applying the ANN to new movie frames, 10 concrete dropout Monte-Carlo samples were used to produce predicted segmentation labels and uncertainty. Figure 1 displays a comparison of segmented stimuli to the original frame. The code we used is publicly available at https://github.com/nih-fmrif/MLT_Body_Part_Segmentation, as is further details about the ANN (McClure et al., 2020).
FIGURE 1.
Artificial neural network (ANN) identified and segmented each frame of the dynamic stimuli. Examples comparing two original movie frames to the segmented image in black and white. Different shades of white/gray represent our varying labels generated by the ANN. Black segments represent no labels given by the ANN, which in our analysis was then labeled as background
The ANN segmented images into 11 body part labels: hair, head, ear, eye, eyebrow, leg, arm, mouth, neck, nose, and torso. Additionally, we created a 12th category for each pixel that the ANN did not place into one of these 11 labels. This label was treated as the background label and contained all other frame features such as objects, landscapes, and noise that were not associated with the 11 other labels. The performance of the ANN was tested on the test set portion of the Pascal-Parts dataset by calculating a Dice score, for each label; this statistic is used to evaluate the similarity between two datasets. In this measure, a true positive (TP) is a correctly labeled pixel of that class, a true negative is a correctly labeled pixel not belonging to that class, and false positive (FP) and false negative (FN) are the two possible mislabelings. Average Dice score for eye = 0.62, nose = 0.57, mouth = 0.62, body part labels = 0.55, and background = 0.95 (McClure et al., 2020).
Eye tracking processing
Eye tracking data were extracted for each separate movie clip, removing the first and last 500 ms of each clip. Non-fixation data (e.g., blinks, missing, or offscreen fixations) were ignored. Data were despiked and sampled down to the frame rate at which the clips were presented (29.97 frames per second). Each pixel received one of the aforementioned 12 labels based on ANN output. For each participant, their gaze location for each frame was classified as belonging to one of these labels, by examining the algorithm’s label predictions within the 15-pixel radius surrounding the primary fixation point. After which, the most frequently occurring pixel label was selected with a bias toward smaller features; for example, if the pixels within a 15-pixel radius from a particular fixation included both “eye” and “head” labels, that fixation would be labeled as “eye.” The smallest regions of interest covered by this analysis, including the 15-pixel radius sphere, covered at least 3° of visual angle. For the purposes of our analysis, we examine only the core face features (eyes, nose, and mouth). For comparison of face versus non-face looking times, we consider all face features together (including head), versus all labels outside the face.
Saliency maps
We created saliency maps for the background (i.e., all non-face and non-body pixels) of each frame for each movie to provide additional information on non-socially relevant gaze. We generated saliency maps using an intensity contrast feature (ICF) model (Kummerer, Wallis, Gatys, & Bethge, 2017). This model predicts fixations in images using low-level information such as intensity and intensity contrast. It is publicly available at https://deepgaze.bethgelab.org/.
Each pixel was assigned a saliency value from 0 to 1 based on the ICF algorithm output. These values were then converted to saliency percentage values by normalizing the saliency value of each pixel against the saliency value of all background pixels in all frames for each movie separately. The saliency percentage value for each pixel therefore represents its relative saliency compared to other background pixels in the same movie.
Data analysis
For the purposes of this analysis, we examined social attention to the core facial features including eyes, nose, and mouth. This analysis investigated the effects of varying amounts of movie data on the consistency of individual looking time to these core features as well as between-group differences. As a basis for evaluating consistency of looking time across movie clips with varying content, we normalized the gaze data for each movie clip. For each participant and each movie clip, we calculated the proportion of time spent looking at each individual face label (eyes, nose, and mouth) divided by total face looking time (time spent fixating on eyes, nose, and mouth labels together). For each participant, movie clip, and facial feature, we then calculated the distance from the average looking time of all other participants in their respective groups. This was normalized by total time spent looking at the face, as described above. This normalization serves to account for differences in raw looking time on the different facial features; these differences may arise from movie-specific variability (e.g., number of feature pixels per frames, action, or speech content) and draw attention to or from the different features. The resulting values for each participant (henceforth referred to as looking time proportion) represent the proportion of looking time they allocate to each of the facial features out of the time they spend looking at the face in general for that particular movie clip, compared to all other participants in their group. These values were then used to evaluate the internal consistency of looking time on each of the facial features across movie clips. This was done by correlating the looking time proportion for participants across different movie pairs/movie sets, for all possible combination of single movie pairs, and for 10,000 randomly selected movie sets of three, five, eight, and 11 movies.
Statistical analysis
To evaluate differences in the distributions between proportion of time spent looking at core features, we performed permutation-based statistical tests using movie sets consisting of one, three, five, eight, and 11 different movie clips; these analyses were repeated for each face label (eyes, nose, and mouth). To test whether the ASD and TD correlation coefficient distributions significantly differ from each other, we first calculated the TD group’s median looking time proportion subtracted by the ASD group’s median looking time proportion (henceforth known as real median differences), as well as the TD looking time proportion variance subtracted by ASD looking time proportion variance (henceforth known as real variance differences). We then generated two sets of 10,000 randomly selected looking time proportions from combined ASD and TD eye tracking data, by randomly permuting the TD and ASD labels. From this permuted dataset, we calculated the first dataset’s median looking time proportion subtracted by the second dataset’s median looking time proportion (henceforth known as permuted median differences), as well as the first dataset’s looking time proportion variance subtracted by the second dataset’s looking time proportion variance (henceforth known as permuted variance differences). This process was repeated 10,000 times. For each iteration of permuted differences, we calculated the proportion of permuted median differences greater than real median differences, as well as the proportion of permuted variance differences greater than real variance differences. This resulting number represents a two-tailed p-value.
For analysis of within-group looking time stability, we randomly selected two sets of three non-overlapping movie combinations, totaling 42 s of stimuli. Similar to the aforementioned analysis, we calculated the correlations of the within-group internal consistency of the looking time proportions across these two sets of movies; this was done for TD participants and participants with ASD separately. This process was repeated for 10,000 permutations, with different sets of three movies selected for each permutation; to assess incremental additions of data, the process was repeated with two sets of movies with random combinations of five (70 s), eight (112 s), and 11 (154 s) movies. Then, we sought to evaluate if these correlation coefficient distributions for each varying level of movie data (three, five, eight, and 11 movies) significantly differ from each other. Differences between each of the movie data distributions refers to the following comparisons: three versus five movies, three versus eight movies, three versus 11 movies, five versus eight movies, five versus 11 movies, and eight versus 11 movies. For each of these pairwise combinations, we calculated the real median differences between the first movie level and the second movie level, as well as the real variance differences between the first movie level and the second movie level. We then generated two sets of 10,000 randomly selected looking time proportions from combined movie level eye tracking data. From this permuted dataset, we calculated the permuted median differences between the first dataset and the second dataset, as well as the permuted variance differences between the first dataset and the second dataset. As before, this process was repeated 10,000 times as we calculated the proportion of permuted median differences greater than real median differences, as well as the proportion of permuted variance differences greater than real variance differences. This resulting number represents a two-tailed p-value.
For analysis of between-group looking time stability, we randomly selected two sets of non-overlapping movie combinations; this was done on sets of three, five, eight, and 11 movies as described above. For 10,000 permutations of these randomly selected sets, we calculated the correlations of the within-group internal consistency of the looking time proportions across these two sets of movies; this was done for TD participants and participants with ASD separately. Then, we sought to evaluate if these ASD and TD correlation coefficient distributions significantly differ from each other across akin levels of movie data. First, we calculated the real median differences between TD and ASD data, as well as the real variance differences between TD and ASD data. As before, we generated two sets of 10,000 randomly selected looking time proportions from combined movie level eye tracking data. From this permuted dataset, we calculated the permuted median differences between TD and ASD data, as well as the permuted variance differences between TD and ASD data. As before, this process was repeated 10,000 times as we calculated the proportion of permuted median differences greater than real median differences, as well as the proportion of permuted variance differences greater than real variance differences. This resulting number represents a two-tailed p-value.
RESULTS
Internal consistency of looking time
First, we sought to analyze the consistency of each participant’s looking time at different facial features across 22 movies. As outlined in the methods, the ANN assigned labels with a bias toward smaller features; for example, if a particular fixation included both “eye” and “head” labels, that fixation would be labeled as “eye.” For the purposes of this analysis, we focused on core facial features (eyes, nose, and mouth). We analyzed individual variation in looking times by calculating the looking time proportion per participant, movie clip, and facial feature, compared to all other participants in their group (see Methods). We then compared these individual looking time proportions across all possible movie pairs (e.g., Movie 1 and Movie 2; Movie 1 and Movie 3). The scatterplots in Figure 2(a)-(c) each display an example of how well-correlated the participants are to themselves across two example movies; the correlation coefficient measures within-subject internal consistency across all participants in each group for that particular movie pair. The correlation coefficients for all possible movie pairs are then combined to create the histograms featured in Figure 2(a)-(c). These histograms showcase the individual variability of correlations between all single movie pairs for eye (TDmedian correlation = 0.63, ASDmedian correlation = 0.45), mouth (TDmedian correlation = 0.59, ASDmedian correlation = 0.46), and nose (TDmedian correlation = 0.42, ASDmedian correlation = 0.28) looking times for each group separately.
FIGURE 2.
((a)–(c)) Individual consistency of fixations of TD and ASD participants to the different facial features across 22 movies. The scatterplots display an example of how well-correlated participants are to themselves across two example movies for each of the face labels (Movie 1 vs. Movie 16); each dot represents the looking time proportion for that feature for a single participant in Movie 1 versus Movie 16, in relation to the average looking time proportion of everyone else in the group (positive values mean that participant spent more time than average looking at that feature, whereas negative values represent below-average looking time). The correlation coefficient is a measure of within-subject internal consistency across all participants in each group for that particular movie pair. The histograms display the individual variability of correlations between all single movie pairs for eye ((a) TDmedian = 0.63, ASDmedian = 0.45), mouth ((b) TDmedian = 0.59, ASDmedian = 0.46), and nose ((c) TDmedian = 0.42, ASDmedian = 0.28) fixations for ASD and TD groups separately. ASD, autism spectrum disorder; TD, typically developing
We then carried out a permutation test to assess whether the distributions of these correlation coefficients significantly differ between the TD and ASD groups (see Methods). Compared to TD counterparts, those with ASD showed significantly reduced within-subject internal consistency in facial feature viewing preferences across movie clips (p < 0.001), as well as significantly increased variability (p < 0.001) for eye, nose, and mouth labels. This is in line with previous literature, which has emphasized inter-subject variations among individuals with ASD (Hahamy et al., 2015; Hasson et al., 2009; Ramot et al., 2020). We found a similar result by analyzing the variance of overall looking time for the different facial features.
Stability of within-group looking time
We next investigated the consistency of within-group looking time across movies when incrementally increasing the amount of movie data used. This analysis served two purposes. First, it addressed whether adding more data improves the consistency of individual looking time proportions to the different features. Second, it examined whether both subject groups displayed convergence of individual preferences for the different facial features; if observed, this would justify the treatment of facial feature viewing preference as a stable trait. We examined this effect on each of the three individual face labels. First, from our 22 movie clips, we randomly selected two sets of three non-overlapping movies combinations, totaling 42 s of stimuli. Similar to the previous analysis, we calculated the correlations of the within-group internal consistency of the looking time proportions across these two sets of movies; this was done for TD participants and participants with ASD separately. This process was repeated for 10,000 permutations, with different sets of three movies selected for each permutation. As before, the histograms in Figure 3 display the correlation coefficients for all the different permutations. To assess incremental additions of data, we repeated this process by creating two sets of movies with random combinations of five (70 s), eight (112 s), and 11 (154 s) movies. Figure 3(a)-(c) displays the distribution of correlations as the number of movie clips increases for eye, mouth, and nose looking time proportions, respectively.
FIGURE 3.
((a)–(c)) Change in the consistency of within-group fixations across movies when introducing additional movie data for TD (left panel) and ASD (right panel) participants. Histograms reflect the distribution of correlation coefficients of all permutations as the number of movie clips increases for eye (a), mouth (b), and nose (c) fixations, respectively (see methods for details). ASD, autism spectrum disorder; TD, typically developing
Next, we tested whether there is significantly increased consistency across looking time when using more data. Using permutation tests, we examined the distributions of internal consistency using different numbers of movies (see Methods). Figure 3 shows the increasing consistency of correlations across each face label as data is added, with individual looking time proportions converging to an increasingly stable mean across both the TD and ASD groups. In both groups, the medians and the variance of the distributions were significantly different for different numbers of movies for each of the face labels (pmedian < 1 × 10−4; pvariance < 1 × 10−4 for all pairwise comparisons), with the median consistency increasing and variance decreasing as more movies were considered. For a given amount of movie data (i.e., three movies, five movies, etc.), there were also significant differences in the distributions across groups. Those with ASD were significantly less consistent (pmedian < 1 × 10−4) and more variable (pvariance < 1 × 10−4) than their TD peers across all data amount levels.
Stability of between-group results
Thus far, we have demonstrated that increasing amounts of movie data serves to stabilize individual looking time proportion variation within each group. With this basis, we then examined the effect of this increased stability on the consistency of ASD and TD between-group differences in facial feature preference. First, we examine the variability in between-group differences when using a single movie clip. We used two-sample t-tests to examine between-group differences per face label in each of the movies. Figure 4 shows the distribution of the p-values of the t-tests carried out on the individual movies. Results varied greatly across movies for all three features, but particularly for the eyes and mouth. Next, we analyzed the effects of additional movie data on between-group looking time differences. We randomly selected three movie clips from our 22 movies and performed a two-sample t-test on the average looking times on each of the face labels between the ASD and TD groups in this movie set. This process was repeated for 10,000 permutations. Similar to the previous analysis, we examined the effect of incremental additions of data by repeating this process with random combinations of five, eight, and 11 movies. Figure 5 features the distribution of p-values for eye, mouth, and nose looking times as the number of movie clips increases. As demonstrated in the histograms, the distributions become increasingly narrower as more movies are added. For mouth labels, the percentage of results showing significant differences between ASD and TD looking time decreases as the number of movies increases from one (9%) to 11 movies (0.98%). Though p-values vary widely for between-group differences in eye looking times, all p-values point to non-significant differences between the groups when using 11 movies (p > 0.05 for all iterations). For nose labels, we observe an increase in the percentage of results showing significant differences between ASD and TD groups as the number of movies increases from one (53.9%) to 11 (100%) movies. This is further evidenced by permutation test results comparing differences in the effects of additional movie data on between-group distributions per face label. Findings reveal significant differences between each of the movie number distributions on both median and variance (pmedian < 1 × 10−4; pvariance < 1 × 10−4 for all).
FIGURE 4.
Distribution of the variability (t-test p-values) between TD individuals and individuals with ASD based on separate evaluation of each of the 22 movie clips. Results vary greatly across movies for all three features, but particularly for the eyes and mouth. ASD, autism spectrum disorder; TD, typically developing
FIGURE 5.
Consistency of fixations for TD individuals and those with ASD across movies when introducing additional movie data. Histograms reflect the distribution of t-test p-values of all permutations as the number of movie clips increases for eye, mouth, and nose fixations, respectively (see methods for details). ASD, autism spectrum disorder; TD, typically developing
Social interaction movie viewing
The use of 11 movie clips was shown by the previous analysis to yield consistent and stable results regarding facial feature looking time, with an average correlation >0.8 for both groups and across all features. Our results also clearly showed a convergence as more data were added. Therefore, we used all 22 movie clips (308 s of stimuli) to assess looking time differences between TD participants and those with ASD while they viewed facial features in naturalistic dynamic interactions. Analysis of looking time revealed that those with ASD fixated on the face overall significantly less than TD participants (t = 3.81; p = 3.08 × 10−4). Figure 6 displays a distribution of ASD and TD time spent fixating on each individual face label; for the purposes of this study, we focused on core facial features (eyes, nose, and mouth). With a two-by-three analysis of variance, we examined if looking time was affected by diagnosis (ASD/TD) and individual core facial feature (eyes/nose/mouth). Main effect analyses revealed significant differences among face labels (F [1, 206] = 120.26, p = 4.49 × 10−35) and diagnosis (F [1, 206] = 5.22, p = 0.02). There was a statistically significant interaction between effects of diagnosis and face label on looking time (F [1, 206] = 4.46, p = 0.01). TD participants attended more to the nose (t = 3.52; p = 7.73 × 10−4), however there was no difference between ASD and TD eye (p = 0.50) and mouth (p = 0.14) looking time.
FIGURE 6.
Distribution of proportion of time spent fixating on eye, mouth, and nose label (out of total face fixation time) for TD individuals and those with ASD using all 22 movie clips. ASD, autism spectrum disorder; TD, typically developing
Additionally, individuals with ASD allotted more looking time to the background of social scenes (t = −3.24; p = 0.001). In order to assess whether this increased attention to background was driven by saliency effects, we created saliency maps of each movie background to examine looking time to low level features. To this end, we examined the salience attributes of the attended pixels for each frame in which participants’ eye position indicated that they were looking at the background. We calculated the saliency percentage value for each pixel, such that the value of each pixel represents its relative saliency compared to other background pixels in the same movie. We then carried out a group comparison between TD participants and participants with ASD to explore differences in low level saliency viewing. TD individuals and individuals with ASD did not differ in saliency of background fixations (p = 0.65). We also calculated within-group variance of saliency of background fixations (averaged across all frames for all movies) for TD participants and those with ASD; variance did not significantly differ between these two groups (σTD = 0.06; σASD = 0.1163; p = 0.14).
DISCUSSION
Our research investigates the stability of social gaze during complex and dynamic interactions in both TD individuals and those with ASD using a machine learning approach to eye tracking. Overall, our findings demonstrate that gaze preference for specific facial features can be considered a stable individual trait in both TD and ASD populations, with individual looking time proportions converging to a stable mean as we add more data. Based on a number of movie clips shown by our analysis to yield stable results (22 movie clips), we then sought to examine social looking time differences between TD individuals and those with ASD. Our findings reveal that individuals with ASD attend less to the face in general than their TD counterparts. Particularly, individuals with ASD spend less time attending to the center of the face (the nose region; Figures 5 and 6). However, they do not significantly differ from TD individuals in eye and mouth looking time (Figure 5).
Eye tracking paradigms operate under the assumption that gaze is a stable trait. However, previous literature casts doubt on this assumption, showing that eye fixations vary depending on type of stimuli (static vs. dynamic; Speer et al., 2007). Our results further drive at this question by revealing poor internal consistency when few stimuli are used. Although previous studies show individuals’ gaze consistency in static images, our stimuli feature rich content from several movies, and may reveal complexities that arise when using dynamic stimuli (Arizpe et al., 2017; Mehoudar et al., 2014). As seen by the relationship between single movie pairings, there is considerable individual variation from movie to movie (Figure 2(a)-(c)). Both subject groups display this variability, though individuals with ASD display greater instability between individual movies as compared to TD peers across core face labels (Figure 2(a)-(c)).
Nevertheless, our findings also depict the growing stability of individual gaze preferences when averaging across more data, implying a convergence around a stable mean (Figure 3). Thus, we assert that with sufficient data, our estimates of looking time proportions can be considered a stable trait in both ASD and TD groups. It is important to note that even with the addition of data, the ASD group again displays greater variance across all measures, both at the individual and at the group level (Figure 3(a)-(c)). This is in line with commonly observed idiosyncrasies within the ASD population (Byrge et al., 2015; Hahamy et al., 2015; Hasson et al., 2009). Additionally, this suggests that more data are needed to consistently estimate individual viewing preferences when studying participants with ASD.
Insufficient data may be a source for error. When using only a single short movie clip, findings vary widely in possible between-group analyses results (Figure 4) and potentially yield both false-positive and false-negative significant group differences. We observe increased stability in the t-test results between groups with additional data. Using 11 movie clips, the differences in mouth looking time between TD and ASD participants were overwhelmingly non-significant, such that 99% of p-values showed no significant group differences. Alternatively, using single movie clips yielded significant group differences 9% of the time (Figure 5). Significant differences between ASD and TD nose looking time were found for all possible movie set combinations when examining 11 movies, but failed to reach significance when examining 46% of the single movies. As expected based on these results, using all 22 movies, group differences in nose looking times were significant, but group differences in eye and mouth looking times were not.
While the stability of the between-group differences increases across all facial features with the addition of more data, there are clear differences in the distributions across labels. Distributions for eye and mouth looking time differences remain quite broad throughout, spanning both significant and insignificant results. Distributions for nose looking time differences are much narrower (Figure 5). This likely indicates an interaction between the internal consistency of the individual data and the effect size of between-group differences.
As previously mentioned, there is wide and inconsistent debate on the extent to which individuals with ASD avoid eyes in favor of the mouth (Gillespie-Smith et al., 2014; Johnels et al., 2014; Jones et al., 2008; Snow et al., 2011; Yi et al., 2013). However, we found no evidence to show that TD individuals and high-functioning individuals with ASD differ on either eye or mouth looking time. Given the high degree of variance between different movies in our own dataset, it is possible that the differences in results from different studies reflect sample-dependent findings more than social processing in ASD.
Deviations in gaze are likely a reflection of ASD deficits in neural systems that modulate complex social behaviors. This is evidenced by our previous research in which we link aberrant gaze and atypical neural mechanisms in the “social brain” (Ramot et al., 2020). This is also supported by Avni et al. (2020), whose findings report reduced eye movement typicality in those with ASD, as well as a correlation between individual gaze idiosyncrasies and ASD severity. The present study elaborates on the typicality of ASD gaze by pinpointing the manner in which individual behavior varies. First, our participants with ASD display significantly greater within-group variance in time spent fixating on the face compared to TD individuals. Second, individuals with ASD display significantly reduced overall time spent looking at the face and in particular the central face region. Lastly, this group has significantly reduced internal consistency in the viewing of the different facial features.
As evidenced by previous work, TD gaze allocation toward the nose may demonstrate several visual tendencies that are typical in normative populations. Findings show that TD individuals initially fixate on the geometric center of the face (i.e., the eye–nose region) before exploring other features (Bindemann et al., 2009). Rogers et al. (2018) report the existence of an “eye–mouth gaze continuum” in which TD individuals experiencing real-world interactions distribute their gaze in the area between eye and mouth regions, with variation in specific feature preference. Face perception studies commonly report this scan path in TD participants (Bindemann et al., 2009; Hsaio & Cottrell, 2008), as well as preferential attention to the area around the center of the nose during face recognition tasks (Hsaio & Cottrell, 2008). Nose looking is shown to provide a central point where the viewer’s periphery can take in information from the entire face (Hsaio & Cottrell, 2008). This optimizes face perception, in accordance with the holistic nature of face recognition in TD individuals.
Based on what is known about the centrality of nose fixations, the lack thereof in those with ASD may suggest local processing with a bias toward local facial features. Prior ASD research not only reports evidence for local bias in visual perception, but also suggests that local processing tendencies in autism may contribute to the associated overall difficulty with integrating features to create a global representation (Nayar et al., 2017; Shah et al., 2016). Reduced nose-looking may reveal a developmental behavior that results from atypical social brain neural systems.
Furthermore, individuals with ASD show greater attention to background stimuli compared to TD peers, despite saliency analyses revealing that both groups attend to similar low-level features when viewing the background. The question remains what factors beyond pixel-level salience draw the sustained attention of those with ASD. Xu et al. (2014) discuss a multi-level architecture of salience beyond individual pixels; factors such as object- and semantic-level salience significantly capture human gaze as well. Our rich and complex stimuli feature various characteristics that are known to have heightened visual salience (e.g., faces, emotion, motion, and touched objects; de Haas et al., 2019). Thus, several levels of salience are possibly in effect. ASD background viewing may signify attentional impairments and/or a greater interest in competing non-social stimuli over socially relevant information and low-level salient features. Future investigations should examine multi-layered salience information for a comprehensive view of preferential attention to non-social features in ASD.
It is important to note that our study utilizes different, short movie clips. Using heterogenous movies can support the generalizability of looking time proportion as a stable individual trait. However, it is possible that studies may need fewer clips to reach gaze stability if they are investigating gaze in consistent or homogenous content. Additionally, there are many other elements which could affect looking time proportion which we did not test for in this study. These data were all collected in a single session, and may not capture individual variation across days, though previous studies have shown stability in this regard (Mehoudar et al., 2014). Similarly, all the movies depict social interactions, and the task was a free viewing task. Different task context or very different movie content may also affect gaze patterns (Speer et al., 2007). Future studies may seek to examine how much data would be necessary to robustly estimate individual fixation preferences when using a single, longer movie clip.
It should also be noted that many studies only include eye and mouth regions in facial feature coding. We expand on previous work by distinguishing the eye, mouth, and nose regions in our core facial feature analysis, which in turn revealed data-driven results that diverge from findings of exclusive eye- and mouth-directed analyses (Rice et al., 2012; Speer et al., 2007). The present study’s machine learning algorithm fulfills the need for a quantifiable and data-driven approach to eye tracking segmentation. Our ANN optimizes the use of ample and diverse stimuli, while eliminating some of the typical difficulties associated with manual techniques. We encourage future studies to adopt similar automatic stimuli segmentation techniques to enable the use of the large amounts of stimuli needed to test hypotheses about social gaze processing in populations.
ACKNOWLEDGMENTS
This work was supported by the National Institute of Mental Health (ClinicalTrials.gov: NCT01031407). This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1646737 and under Grant Nos. DGE-1650604 and DGE-2034835. These organizations did not have a role in the conceptualization, design, data collection, analysis, decision to publish, or preparation of the manuscript.
Footnotes
CONFLICT OF INTEREST
Authors declare no competing interests.
REFERENCES
- Arizpe J, Walsh V, Yovel G, & Baker CI (2017). The categories, frequencies, and stability of idiosyncratic eye-movement patterns to faces. Vision Research, 141, 191–203. 10.1016/j.visres.2016.10.013 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Avni I, Meiri G, Bar-Sinai A, Reboh D, Manelis L, Flusser H, Michaelovski A, Menashe I, & Dinstein I (2020). Children with autism observe social interactions in an idiosyncratic manner. Autism Research, 13(6), 935–946. 10.1002/aur.2234 [DOI] [PubMed] [Google Scholar]
- Bindemann M, Scheepers C, & Burton AM (2009). Viewpoint and center of gravity affect eye movements to human faces. Journal of Vision, 9(7), 1–16. 10.1167/9.2.7 [DOI] [PubMed] [Google Scholar]
- Byrge L, Dubois J, Tyszka JM, Adolphs R, & Kennedy DP (2015). Idiosyncratic brain activation patterns are associated with poor social comprehension in autism. Journal of Neuroscience., 35 (14), 5837–5850. 10.1523/JNEUROSCI.5182-14.2015 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Castelhano M, & Henderson J (2008). Stable individual differences across images in human saccadic eye movements. Canadian Journal of Experimental Psychology., 62, 1–14. 10.1037/1196-1961.62.1.1 [DOI] [PubMed] [Google Scholar]
- de Haas B, Iakovidis AL, Schwarzkopf S, & Gegenfurtner KR (2019). Individual differences in visual salience vary along semantic dimensions. Proceedings of the National Academy of Sciences of the United States of America, 116(24), 11687–11692. 10.1073/pnas.1820553116 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Everingham M, Van Gool L, Williams CK, Winn J, & Zisserman A (2010). The Pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338. [Google Scholar]
- Freeth M, Chapman P, Ropar D, & Mitchell P (2010). Do gaze cues in complex scenes capture and direct the attention of high functioning adolescents with ASD? Evidence from eye-tracking. Journal of Autism and Developmental Disorders, 40(5), 534–547. 10.1007/s10803-009-0893-2 [DOI] [PubMed] [Google Scholar]
- Gal Y, Hron J, & Kendall A (2017). Concrete dropout. arXiv pre-print arXiv:1705.07832, 2017. [Google Scholar]
- Gillespie-Smith K, Riby DM, Hancock PJ, & Doherty-Sneddon G (2014). Children with autism spectrum disorder (ASD) attend typically to faces and objects presented within their picture communication systems. Journal of Intellectual Disabilities Research, 58(5), 459–470. 10.1111/jir.12043 [DOI] [PubMed] [Google Scholar]
- Hahamy A, Behrmann M, & Malach R (2015). The idiosyncratic brain: Distortions of spontaneous connectivity patterns in autism spectrum disorder. Nature Neuroscience, 18, 302–309. 10.1038/nn.3919 [DOI] [PubMed] [Google Scholar]
- Hasson U, Avidan G, Gelbard H, Vallines I, Harel M, Minshew N, & Behrmann M (2009). Shared and idiosyncratic cortical activation patterns in autism revealed under continuous real-life viewing conditions. Autism Research, 2(4), 220–231. 10.1002/aur.89 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hsaio JH, & Cottrell G (2008). Two fixations suffice in face recognition. Psychological Science, 19(10), 998–1006. 10.1111/j.1467-9280.2008.02191.x [DOI] [PMC free article] [PubMed] [Google Scholar]
- Johnels J, Gillberg C, Falck-Ytter T, & Miniscalco C (2014). Face-viewing patterns in young children with autism spectrum disorders: Speaking up for the role of language comprehension. Journal of Speech, Language, and Hearing Research, 57(6), 2246–2252. 10.1044/2014_JSLHR-L-13-0268 [DOI] [PubMed] [Google Scholar]
- Jones W, Carr K, & Klin A (2008). Absence of preferential looking to the eyes of approaching adults predicts level of social disability in 2-year-old toddlers with autism spectrum disorder. Archives of General Psychiatry, 65(8), 946–954. 10.1001/archpsyc.65.8.946 [DOI] [PubMed] [Google Scholar]
- Kendall A, Badrinarayanan V, & Cipolla R (2015). Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. arXiv arXiv:1511.02680. [DOI] [PubMed] [Google Scholar]
- Kümmerer M, Wallis TSA, Gatys LA, & Bethge M (2017). Understanding Low- and High-Level Contributions to Fixation Prediction. 2017 IEEE International Conference on Computer Vision (ICCV), 4799–4808. 10.1109/ICCV.2017.513. [DOI] [Google Scholar]
- Lainhart JE, Bigler ED, Bocian M, Coon H, Dinh E, Dawson G, Deutsch CK, Dunn M, Estes A, Tager-Flusberg H, Folstein S, Hepburn S, Hyman S, McMahon W, Minshew N, Munson J, Osann K, Ozonoff S, Rodier P, … Volkmar F (2006). Head circumference and height in autism: A study by the collaborative program of excellence in autism. American Journal of Medical Genetics. Part A, 140(21), 2257–2274. 10.1002/ajmg.a.31465 [DOI] [PMC free article] [PubMed] [Google Scholar]
- McClure P, Reimann GE, Ramot M, & Pereria F (2020). A deep neural network tool for automatic segmentation of human body parts in natural scenes. arXiv. arXiv:2009.09900. [Google Scholar]
- Mehoudar E, Arizpe J, Baker CI, & Yovel G (2014). Faces in the eye of the beholder: Unique and stable eye scanning patterns of individual observers. Journal of Vision, 14(7), 1–11. 10.1167/14.7.6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nayar K, Voyles AC, Kiorpes L, & Di Martino A (2017). Global and local visual processing in autism: An objective assessment approach. Autism Research, 10(8), 1392–1404. 10.1002/aur.1782 [DOI] [PubMed] [Google Scholar]
- Peterson MF, & Eckstein MP (2013). Individual differences in eye movements during face identification reflect observer-specific optimal points of fixation. Psychological Science, 24(7), 1216–1225. 10.1177/0956797612471684 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Poynter W, Barber M, Inman J, & Wiggins C (2013). Individuals exhibit idiosyncratic eye-movement behavior profiles across tasks. Vision Research, 89, 32–38. 10.1016/j.visres.2013.07.002 [DOI] [PubMed] [Google Scholar]
- Ramot M, Walsh C, Reimann GE, & Martin A (2020). Distinct neural mechanisms of social orienting and mentalizing revealed by independent measures of neural and eye movement typicality. Communications Biology, 3, 48. 10.1038/s42003-020-0771-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rice K, Moriuchi JM, Jones W, & Klin A (2012). Parsing heterogeneity in autism spectrum disorders: Visual scanning of dynamic social scenes in school-aged children. Journal of the American Academy of Child & Adolescent Psychiatry, 51(3), 238–248. 10.1016/j.jaac.2011.12.017 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rogers SL, Speelman CP, Guidetti O, & Longmuir M (2018). Using dual eye tracking to uncover personal gaze patterns during social interaction. Scientific Reports, 8, 4271. 10.1038/s41598-018-22726-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shah P, Bird G, & Cook R (2016). Face processing in autism: Reduced integration of cross-feature dynamics. Cortex, 75, 113–119. 10.1016/j.cortex.2015.11.019 [DOI] [PubMed] [Google Scholar]
- Snow J, Ingeholm JE, Levy IF, Caravella RA, Case LK, Wallace GL, & Martin A (2011). Impaired visual scanning and memory for faces in high-functioning autism spectrum disorders: it’s not just the eyes. Journal of the International Neuropsychological Society, 17(6), 1021–1029. 10.1017/S1355617711000981 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Speer LL, Cook AE, McMahon WM, & Clark E (2007). Face processing in children with autism: Effects of stimulus contents and type. Autism, 11(3), 265–277. 10.1177/1362361307076925 [DOI] [PubMed] [Google Scholar]
- van der Geest JN, Kemner C, Verbaten MN, & van Engeland H (2002). Gaze behavior of children with pervasive developmental disorder toward human faces: A fixation time study. Journal of Child Psychology & Psychiatry, 43(5), 669–678. 10.1111/1469-7610.00055 [DOI] [PubMed] [Google Scholar]
- Xu J, Jiang M, Wang S, Kankanhalli MS, & Zhao Q (2014). Predicting human gaze beyond pixels. Journal of Vision, 14(28), 1–20. 10.1167/14.1.28 [DOI] [PubMed] [Google Scholar]
- Yi L, Fan Y, Quinn PC, Feng C, Huang D, Li J, Mao G, & Lee K (2013). Abnormality in face scanning by children with autism spectrum disorder is limited to the eye region: Evidence from multi-method analyses of eye tracking data. Journal of Vision, 13(10), 1–13. 10.1167/13.10.5 [DOI] [PMC free article] [PubMed] [Google Scholar]






