Abstract
While eye movements have been used to analyze behaviors for many years, research studies that employ eye tracking technologies are often limited to basic physical features and fixations, which leads to an abundance of data. Because visual behaviors are complex in nature, it can be difficult to make comparisons and conclusions based on subjects’ scanpaths. In this study, we analyze visual activities from 15 expert radiographers and 26 novices as they view a series of images to attempt to discover relationships between a large number of features including fixation, region and subject information. We expect that the techniques used in this study will be useful in finding common behaviors in eye tracking data for medical applications. These behaviors could be used to train novices and prevent potential medical errors that occur during visual analysis of medical images.
Introduction
Medical experts from many specialties such as radiology, pathology and dermatology who rely on analysis of imagery develop a repertoire of techniques to make decisions quickly and accurately. An experienced practitioner has a noticeable advantage over a novice in terms of speed and skill, however, it is difficult to measure or explain this expertise which is primarily non-verbal in nature. Describing the process to an inexperienced viewer, while possibly improving the accuracy of their decisions, does not automatically make them an expert, because there is a component which can only be gained through experience. Understanding how this visual knowledge is acquired by image analysts over time and which visual information they find useful in their decisions could dramatically reduce the time to train novice analysts and increase the efficiency of practicing experts.
It is difficult to uncover visual strategies from image analysts, because the application of these strategies is often so brief and effortless as to seem instinctive. Techniques such as the Think-Aloud Protocol are able to capture some of the methods used by experts, but these are limited to knowledge of which the expert is consciously aware. Even more detrimental, the slowness of the information capture can severely interrupt the cognitive processes of image analysis1. This research, which is an extension of our previous work2 and part of a larger study, employs eye tracking methods to reveal the visual strategies in action with minimal interference for the subjects. By studying the eye movements of expert radiologic technologists, and comparing them with those of novices, we attempt to explain the differences between our subjects at a cognitive level through observed behaviors. We also include syntactic and semantic level tasks following an image representation hierarchy which addresses degree of expertise.
Eye tracking data accumulates quickly, over many subjects and images. It consequently becomes difficult to identify global trends in the data without carefully chosen tasks. For this reason, we have chosen association rule mining to discover behavioral patterns. By mining associations over a large set of fixations, it is possible to find events that occur with high frequency. This is expected to highlight common visual strategies employed by experts and/or novices, such as an affinity for high-contrast areas or particular semantic regions. Our long-term goal is to use these associations to explain the cognitive processes developed by experts over time. The objective of this study is to determine whether these behaviors are evident in rules mined from gaze data, and whether they might be leveraged for training or modeling.
Background
Eye tracking has been used in research studies for years to study reading, image analysis, driving, etc. However, many studies are limited to the analysis of basic fixation data. Grindinger et al. recently used temporal information from fixation data to successfully distinguish between novice and expert pilots while viewing weather images3. Other studies attempt to quantify scanpath similarity using a set of spatial measurements for fixations4,5.
Many studies in the medical domain use eye tracking to test and design user interfaces6,7, but there are also a number of studies attempting to identify visual behaviors through eye movements. A study conducted by Krupinski et al.8 used expert pathologists, pathology residents, and medical students to investigate behaviors during analysis of pathology images. While the study included a small number of participants, it illustrates some potentially interesting patterns between the different types of users. However, overall, it was difficult to find strong variations between levels of expertise.
While there has been a reasonable amount of success identifying characteristic spatial and temporal features in eye tracking, it is still difficult to describe subject behaviors using these metrics. In particular, in the medical domain where multiple anatomical regions in an image could provide the required information for assessment (such as a low-contrast image or the diagnosis of emphysema), it is important to include image and semantic data for each fixation, in addition to traditional temporal-spatial features such as location and duration. Many medical analysts systematically check images according to trained procedures, careful not to miss subtle indications of trouble. These methodical movements can be problematic for a strictly temporal feature analysis such as frame-by-frame comparison. For medical analyses, it is important to consider high-level, semantic data and the underlying visual characteristics in order to understand the nature of the recorded behaviors.
Methods
The eye tracking performed for this study was part of a larger study which aims to understand the differences between novices and experts at an information-oriented level. Based on the requirements of the larger study, the images that were chosen for analysis each represented a level of knowledge in an information hierarchy9. While the details of this hierarchy are not imperative for this experiment, a basic understanding helps to interpret the level-based rules. Levels at the top of this hierarchy (levels 1–4) are the syntactic features which require only a basic knowledge such as color/grayscale image or painting/photograph. The foundation levels (levels 5–10) are semantic features which require an increasing level of knowledge to identify.
The tasks given for each image attempted to demonstrate behaviors appropriate for the designated level. For example, the level 4 image was displayed with the prompt ”Determine the orientation of this image”, while the level 10 image had the task ”What is wrong with this patient?”. We expect that both novices and experts should be able to answer the level 4 question without difficulty, however the level 10 question requires more experience. In fact, level 10, which pertains to diagnosis, is beyond the scope of a radiographic technologist’s duties. However, experts can frequently provide at least a rudimentary answer which may affect their analysis. For example, this may determine whether a technologist would consider a foreign object obstruction acceptable.
The data for this study was collected from 15 expert radiologic technologists and 26 students in radiologic sciences. Experts were subjects with at least 1 year working in a clinical setting, while novices had 1 or 2 years of classroom training without field experience. A radiologic technologist is responsible for assessing the quality of medical scans before they are sent to a radiologist, so many of their visual patterns are likely to be based on assessing quality (rather than diagnosis). Each subject was asked to sit in front of a 27” iMac with built-in iSight camera, running the open source software OpenGazer, which recorded eye movements throughout the presentation at approximately 30 Hz. Figure 1 shows a sample scanpath that was collected for this study.
Figure 1.
Raw eye tracking data (orange/red) and fixations (yellow) for a subject viewing an x-ray of the abdomen.
During the collection process, subjects were asked to view a series of 10 images and given a task from the corresponding level of the information hierarchy. They were also asked to view the image a second time, and answer whether they noticed anything surprising in the image. This second viewing, for the purposes of this study, gives us an opportunity to see how behavior varies between the original presentation of an image, and one which the subject is already familiar.
Fixations were calculated based on a maximum radius of 40 pixels (approximately 1 degree of the subjects visual field) in which the gaze dwelled for at least 100ms. Each image was viewed for 15 seconds after an 8 second calibration period, so a typical viewing generated between 50 and 60 fixation points (μ=55.09, σ=6.90). The entire dataset, excluding calibration data consisted of 29,952 fixations. For each fixation we compute values such as duration, start and stop time, and distance traveled from the previous fixation.
The features calculated for each fixation are listed in Table 1. In addition to the expertise of the subject and the level of the image, we recorded whether the fixation occurred during the first presentation of the image (case 1) or was an image that was being viewed for the second time (case 2). The onset feature of the fixation represents the starting time of the fixation, while the duration captures the length of time that the subject remained fixated. We provide the distance from the previous fixation so that we can identify how far the subject traveled to view the particular area. These fixation features alone do not provide sufficient detail of the behaviors, so we also combined each fixation with a relevant region of the image (Figure 2).
Table 1.
Fixation and region features used to characterize individual fixations. Values with continuous ranges, such as onset, were binned into 5 equal-frequency bins. For region features, P is a co-occurance matrix. i and j represent two grayscale values sampled from the points in the image.
| fixation features | description | values |
| expertise | level of experience | ”expert” or ”novice” |
| level | information hierarchy level | 1–10 |
| case | first or second viewing of image | 1–2 |
| onset | start time of fixation (seconds) | [0–3), [3–6), [6–9), [9–12), [12–15] |
| from_distance | euclidean distance from last fixation | [0, 56.1), [56.1, 89.5), [89.5, 148.7), [148.7, 259),[259, 1690.2] |
| duration | length of fixation (seconds) | [.1, .156), [.156, .198), [.198, .262), [.262, .368), [.368, 5.769] |
| relevant | fixation falls within an annotated region | TRUE or FALSE |
| region feature | description | values |
| region_name | name of annotated region (if any) | string |
| last_region | region of previous fixation | region id |
| next_region | region of following fixation | region id |
| mean | average grayscale value | [4, 94), [94, 121), [121, 144), [144, 170), [170, 251] |
| contrast | [8.64E-6, 1.46E-4), [1.46E-4, 3.88E-4), [3.88E-4, 8.55E-4), [8.55E-4, 1.94E-3), [1.94E-3, .098] | |
| energy | [2.57E-4, 6.31E-4), [6.31E-4, 9.3E-4), [9.3E-4, 1.29E-3), [1.29E-3, 2.4E-3), [2.4E-3, .576] | |
| entropy | [1.88, 6.49), [6.49, 7.04), [7.04, 7.41),[7.41, 7.76), [7.76, 8.58] | |
| homogeneity | [.145, .256),[.256, .317),[.317, .379),[.379, .455),[.455, .863] | |
Figure 2.
Expert-based annotations for an image are used to associate fixations with anatomical regions.
We developed a web interface to allow experts to mark and edit polygons atop an image, and give semantic features to the regions outlined by the polygons. With the assistance of an experienced radiographer, regions were selected from each image and given an annotation tag. At this point, we did not consider the extent to which the region was related to the task for a particular image, but defined names for areas of interest to a radiographer in general. These annotations were used to associate fixations with an anatomical region (if applicable). From this relationship, we can also compute previous and next regions visited (excluding fixations that fall outside of any annotated region). We also use the polygon to define a region for extracting common image features such as grayscale histogram values and texture information such as contrast, entropy, and homogeneity10.
To identify behaviors, we construct feature sets from the subject data, the image-based and gaze-based features, as well as the image regions marked by our expert. The feature sets constitute a database of transactions for association mining. Each transaction represents a single fixation and is composed of a subset of features. With this database we can discover co-occurrences of these items to describe interesting patterns in the data. To our knowledge, this is the first attempt to mine associations on the fixation and region features extracted from our data.
Results
To mine the associations we built a set of association rules using the apriori algorithm11 implemented in arules12 in the statistical software package R13. Each transaction was constructed from a combination of feature values, or items, for a single fixation. These features are listed in Table 1. For example, given a fixation and its feature values, one potential transaction might contain the items expertise=expert, onset=1, contrast=5 where the expertise is the level of experience, onset (the time of the fixation relative to the start of the session) falls within bin 1 (early), and contrast is bin 5 (high). Features are combined to generate many such transactions for each fixation. All of these transactions are combined into a database.
With these transactions, we look for frequent item sets, in the form of rules. A rule {antecedent} ⇒ {consequent}, such as {expertise=expert,onset=1} ⇒ {contrast=5}, represents the frequency within the database in which the items of the antecedent, {expertise=expert,onset=1}, appear with the consequent, {contrast=5}. In this case, it is the instance where experts, during the first 3 seconds of the viewing, visit areas of high contrast.
From these rules, we can calculate meaningful statistics such as support and confidence. The support for this rule is the percentage of transactions that the contain the specific item combination. Confidence, in contrast, is the conditional probability of the consequent, given the antecedent. If 1% of our transactions concerned experts during the first 3 seconds of viewing (a support of 0.01), and half of these transactions were fixations on areas of high contrast, the confidence for our example rule would be 0.5.
Because of the large feature sets and number of transactions generated, it is important to choose a proper support threshold to filter out infrequent rules. For this study, we chose a threshold value of 0.001, because this represents approximately 30 fixations that demonstrate a behavior. This resulted in a total of 89,498 rules, which were further filtered manually. The process of manual filtering involved an exhaustive review of the rules produced. There were two methods in particular which were applied to our list to make this process simpler.
Many of the rules define trivial facts, for example, rules which contained a consequent of {expertise=novice} often had a confidence of approximately 60%, because this was the novice representation in our data set. Therefore, it is important to determine the expected support and confidence values for a rule. Only when these values were substantially different from expected were they considered interesting.
Another method of selecting interesting rules included comparing a rule’s confidence or support values with those of the same rule for the complimentary expertise. A rule for novices which had a low support compared to that of experts meant that it appeared more frequently in experts. On the other hand, a high confidence for one group indicated that the behavior was more consistent. In this way, we identified those rules which have significantly different behaviors between the experts and novices. A set of interesting rules and their interpretations are discussed in the next section.
Discussion
We have selected a group of interesting rules to discuss as examples of subject behavior (Table 2). We will begin by comparing rules (e1) and (n1). These rules describe the event that a subject fixates on the R Label (indicating a right hand) for image 6 (Figure 3a). For experts, this occurs during the first viewing of the image 92.0% of the time. In contrast, when novices fixate on the label it is only during the first viewing of the image 61.2% of the time. This significant difference could be due to experts’ trained checking of the label, however this behavior is not consistent across all images. Further, focused analyses on image label behavior may uncover the nature of these tendencies.
Table 2.
Sample association rules mined from eye tracking data.
| # | expert association rules | support | confidence |
| (e1) | {expertise=expert, level=6, region_name=R Label}⇒ {case=1} | 0.00101 | 0.92 |
| (e2) | {expertise=expert, level=5, region_name=Gradient} ⇒ {case=2} | 0.00166 | 0.88372093 |
| (e3) | {expertise=expert, region_name=Burnt Out Lung Region} ⇒ {onset=1} | 0.00109 | 0.30487805 |
| (e4) | {expertise=expert, region_name=R Label} ⇒ {from_distance=5} | 0.00105 | 0.19047619 |
| (e5) | {expertise=expert,region_mean=5} ⇒ {contrast=5} | 0.02952 | 0.37231109 |
| # | novice association rules | support | confidence |
| (n1) | {expertise=novice, level=6, region_name=R Label} ⇒ {case=1} | 0.00131 | 0.6122449 |
| (n2) | {expertise=novice, level=5, region_name=Gradient} ⇒ {case=2} | 0.00315 | 0.7826087 |
| (n3) | {expertise=novice, region_name=Erect Label} ⇒ {case=2} | 0.00162 | 0.82222222 |
| (n4) | {expertise=novice, region_name=Collimator Cut Off} ⇒ {onset=1} | 0.0014 | 0.36363636 |
| (n5) | {expertise=novice, region_name=R Label} ⇒ {from_distance=5} | 0.00254 | 0.3258427 |
| (n6) | {expertise=novice, region_mean=5} ⇒ {contrast=5} | 0.04286 | 0.36417689 |
Figure 3.
Areas identified by rules in this study. (a) an R Label spotted frequently by experts on the first viewing of the image. (b) a Gradient caused by x-ray scatter which distracted novices. (c) an Erect Label which was largely ignored by experts. (d) Collimator Cut Off along the edge of the screen, which caught novices attention early in the viewing process. (e) this Burnt-Out Lung Region was highly relevant to the image and visited early on by experts.
A similar pair of rules (e2) and (n2) describe the same situation for the Gradient region of image 5 (Figure 3b). This background region was given a surprisingly unusual texture due to x-ray scatter. Experts fixate on the Gradient during case 2, the second exposure to image 5, 88.4% of the time, while novices do so only at only 78.2%. This type of finding could indicate a more disciplined analysis of the image, waiting to study the scatter pattern until asked to describe any surprising findings during the second exposure to the image.
Our third result is the rule (n3) where 82.2% of novice fixations on the Erect Label occurred during case 2 (Figure 3c). There was no corresponding rule for Experts, because the support was below the threshold of 0.001. This indicates that few experts fixated on the Erect Label during either case 1 or case 2.
Rule (e3) introduces a new consequent, the onset, which indicates the start time of the fixation. This rule shows that 30% of the fixations on Burnt Out Lung Region, a region of particular interest for the task (Figure 3e), occurred at the beginning of the image viewing (within 3 seconds). Novices, on the other hand, did not meet the threshold for this region, despite its relevance to the task.
The Collimator Cut Off region in rule (n4) was a distracting region on the edge of the image, which novices tended to fixate on within the first few seconds of the task (Figure 3d). As with the Gradient region from rules (e2) and (n2), it seems likely that experts were not adversely affected by the distracting region because the number of expert fixations did not meet the 0.001 threshold. Regions such as collimator cut-off are likely to be ignored by experts because they are more skilled at keying into relevant regions.
The consequent from_distance = 5 is of interest, because it represents a large travel distance from the previous fixation location. By comparing rules (e4) and (n5), it can be shown that novices more frequently traveled further to visit the R Label region. This is evident in the fact that 32.6% of the time that novices visited the label, it was at the furthest travel distance (as opposed to the expected 1/5th for experts).
Finally, we look at the regions of high contrast (contrast = 5) in rule (e6) and (n6). Both novices and experts have an unusually high tendency, around 37%, toward high-contrast regions when viewing bright areas (region_mean = 5). A large majority of the associations rules generated from our study, like this example, had very similar confidences. In other words, behaviors of experts and novices were surprisingly identical across a majority of dimensions. This was evident the in large percentage of rules which did not show significant difference between the groups.
Our rules show significant trends, however we are not necessarily interested in the behavior if it is primarily exhibited by a single individual. For example, if it was a single expert who performed the behavior frequently, the rule cannot be generalized to a population. To ensure that our rules represent a significant portion of the population, we have plotted the distributions of fixations among subjects for each of the rules listed in Table 2. An example of this distribution is shown in Figure 4. The relatively evenly-distributed occurrence of the associations indicates that these are population-wide behaviors, and not the tendency of a sole individual.
Figure 4.
Distribution of fixations counts across experts showing a well-spread contribution of fixations for rule e1 from Table 2.
Rules such as (e1) and (n1) show a striking difference between novices and experts, however, they do not apply to feature or image properties as a whole, rather they apply to a single image region. There are very few rules from fixation set which held in the most general cases. This means that our subjects showed little variation in behavior at a global level, across all image and viewings, and that these special cases should be explored in more detail to discover the subtle differences between novices and experts.
Just as Krupinski et al.8 concluded in their pathology study using a smaller number of subjects, there is little variation in global viewing behavior. All subjects maintained similar fixation rates and points of interest. Among almost 90,000 association rules which met our support threshold, only a small subset of them provide insightful behavioral differences between experts and novices, which is a common difficulty with association mining methods. However, association mining shows potential as a viable method for distinguishing tacit visual strategies employed by experts. It seems likely that the differences between novices and experts in medical image analysis will require higher level features than those fixation features listed in Table 1. The rules produced by this analysis are specific to the domain of radiography and the tasks that were given. Further studies could be conducted in which specific tasks for a particular image are designed and interpreted more directly, and with input from experts, using rules extracted by this method. It would also be intriguing to use radiologists performing diagnosis to explore the similarities and differences between these domains. We anticipate the localized details of pathology-bearing regions will play an important role in visual reasoning for diagnosis.
Conclusion
While we are able to identify some interesting associations in the data set, a majority of the behaviors were strikingly similar between experts and novices. This may be due to a smaller-than-anticipated difference in skill level and knowledge between expert and novice subjects, or because of the nature of the tasks performed by our subjects. Radiographers are not responsible for diagnosis, but are concerned with the overall quality of an image. This requires a systematic analysis of specific anatomy to verify that it is in view and easily discernible. A study designed to emphasize the skills of expert analysts would be necessary to test whether behavioral differences are more significant within other domains or between different tasks or types of imagery. These differences in behavior could potentially be used to propagate these behaviors to novices more effectively. This ongoing research may lead us to a better understanding of the visual reasoning process which can be used for modeling, indexing and searching expert visual knowledge in medical domains.
Acknowledgments
This project is supported by the National Science Foundation under grant number IIS-0812515. B.A. is funded under the NLM pre-doctoral training grant #5T15LM007089-20.
References
- [1].Russo JE, Johnson Eric J, Stephens Debra L. The validity of verbal protocols. Memory & Cognition. 1989;17(6):759. doi: 10.3758/bf03202637. [DOI] [PubMed] [Google Scholar]
- [2].Anderson B, Shyu CR. AMIA 2010 Annual Symposium; Washington D.C., USA: Nov, 2010. A preliminary study to understand tacit knowledge and visual routines of medical experts through gaze tracking. AMIA. [PMC free article] [PubMed] [Google Scholar]
- [3].Grindinger T, Duchowski AT, Sawyer M. Group-wise similarity and classification of aggregate scanpaths. Proceedings of the 2010 Symposium on Eye-Tracking Research & Applications; New York, NY, USA. ACM; 2010. pp. 101–104. [Google Scholar]
- [4].Jarodzka Halszka, Holmqvist Kenneth, Nyström Marcus. A vector-based, multidimensional scanpath similarity measure. Proceedings of the 2010 Symposium on Eye-Tracking Research & Applications; New York, NY, USA. ACM; 2010. pp. 211–218. ETRA ’10. [Google Scholar]
- [5].Goldberg Joseph H, Helfman Jonathan I. Scanpath clustering and aggregation. Proceedings of the 2010 Symposium on Eye-Tracking Research & Applications; New York, NY, USA. ACM; 2010. pp. 227–234. ETRA ’10. [Google Scholar]
- [6].Ellis SM, Hu X, Dempere-Marco L, Yang GZ, Wells AU, Hansell DM. Thin-section ct of the lungs: Eye-tracking analysis of the visual approach to reading tiled and stacked display formats. European Journal of Radiology. 2006;59(2):257–264. doi: 10.1016/j.ejrad.2006.05.006. Percutane Ablation Techniques. [DOI] [PubMed] [Google Scholar]
- [7].Niimi Rie, Shimamoto Kazuhiro, Sawaki Akiko, Ishigaki Takeo, Takahashi Yukio, Sugiyama Naoki, Nishihara Eitaro. Eye-tracking device comparisons of three methods of magnetic resonance image series displays. Journal of Digital Imaging. 1997;10:147–151. doi: 10.1007/BF03168836. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Krupinski Elizabeth A, Tillack Allison A, Richter Lynne, Henderson Jeffrey T, Bhattacharyya Achyut K, Scott Katherine M, Graham Anna R, Descour Michael R, Davis John R, Weinstein Ronald S. Eye-movement study and human performance using telepathology virtual slides. implications for medical education and differences with experience. Human pathology. 2006;37(12):1543–1556. doi: 10.1016/j.humpath.2006.08.024. [DOI] [PubMed] [Google Scholar]
- [9].Jaimes Alejandro, Jaimes Ro, Chang Shih fu. A conceptual framework for indexing visual information at multiple levels. in proceedings of SPIE Internet Imaging 2000; 2000. pp. 2–15. [Google Scholar]
- [10].Haralick Robert M, Shanmugam K, Dinstein Its’Hak. Textural features for image classification. Systems, Man and Cybernetics, IEEE Transactions on. 1973 Nov;3(6):610–621. [Google Scholar]
- [11].Agrawal Rakesh, Imieliński Tomasz, Swami Arun. Mining association rules between sets of items in large databases. SIGMOD Rec. 1993 Jun;22:207–216. [Google Scholar]
- [12].Hahsler Michael, Buchta Christian, Gruen Bettina, Hornik Kurt. arules mining association rules and frequent itemsets. 2011.
- [13].R Development Core Team . Vienna Austria R Foundation for Statistical Computing. Vol. 1. Sep 18, 2009. 2008. R: A language and environment for statistical computing. [Google Scholar]




