Abstract
Deep learning algorithms can extract meaningful diagnostic features from biomedical images, promising improved patient care in digital pathology. Vision Transformer (ViT) models capture long-range spatial relationships and offer robust prediction power and better interpretability for image classification tasks than convolutional neural network models. However, limited annotated biomedical imaging datasets can cause ViT models to overfit, leading to false predictions due to random noise. To address this, we introduce Training Attention and Validation Attention Consistency (TAVAC), a metric for evaluating ViT model overfitting and quantifying interpretation reproducibility. By comparing high-attention regions between training and testing, we tested TAVAC on four public image classification datasets and two independent breast cancer histological image datasets. Overfitted models showed significantly lower TAVAC scores. TAVAC also distinguishes off-target from on-target attentions and measures interpretation generalization at a fine-grained cellular level. Beyond diagnostics, TAVAC enhances interpretative reproducibility in basic research, revealing critical spatial patterns and cellular structures of biomedical and other general nonbiomedical images.
TAVAC enables trustworthy and interpretable Vision Transformer models in biomedical imaging.
INTRODUCTION
Application of deep learning in the analysis of biomedical histological images is rapidly advancing the field of digital pathology by providing nuanced pattern recognition that surpasses traditional methods (1–4). Deep learning models, e.g., convolutional neural networks (CNNs), have been successfully applied to histopathology images for predictive purposes such as classifying tumor (5), predicting mutation (6), and identifying cancer subtypes (7). The clear advantage of deep learning algorithms—particularly when trained on vast and diverse datasets—is their potential to offer unprecedented accuracy in identifying and classifying cellular structures and abnormalities, which is critical for early and precise disease diagnosis. However, key challenges include the scarcity of high-quality, labeled histological datasets for training, high computational demands, and the necessity for transparent models that digital pathologists can trust. Nevertheless, the potential benefits of deep learning, such as automating time-consuming diagnostic tasks and unveiling histological features correlated with patient outcomes, are driving research and innovation forward.
Through training on large histological datasets, deep learning models acquire the ability to extract relevant features that give insight into tumor regions and biomarkers for cancer classification. Among deep learning algorithms, the Vision Transformer (ViT) model has exhibited strong prediction power in image classification tasks (8). Unlike CNNs, ViT models use self-attention techniques to robustly capture long-range relationships that CNNs cannot (9). ViT models process images by splitting them into patches, embedding them into a lower-dimensional space, and adding positional encodings and a class token. These sequences pass through a transformer encoder, and the class token’s values are used in a multilayer perceptron to classify the image (10). In the ViT architecture, the attention mechanism dynamically computes weights for input tokens based on their contextual relevance to a designated query. These computed weights serve to determine the influence of each individual token on the resultant representation, facilitating the model's ability to discern intricate relationships and contextual nuances among tokens.
Transparent, explainable, and interpretable are pivotal characteristics of trustworthy artificial intelligence systems (11). Like other deep learning models, a major limitation of ViT models is their tendency to overfit when data size is limited, which is a common problem for biomedical imaging datasets with annotations, e.g., datasets generated using histological methods, computed tomography, x-ray, ultrasound, and optical coherence tomography. Overfitting in machine learning models manifests as an excessive adaptation to the training data, incorporating noise and specific data idiosyncrasies, which impairs the model’s ability to generalize to data it has never been trained with (12–14). This overfitting is often attributable to the noise inherent in a nonrepresentative training dataset and an overly complex hypothesis that causes the model to prioritize variance over bias, leading to inconsistent predictive performance. A model that is optimized for generalization effectively discerns relevant patterns from noise in the training data and maintains a balance between hypothesis complexity and predictive consistency (14). The majority of prior studies on overfitting focused on the predictive performance of deep learning models. At the same time, the generalization of deep learning model interpretation through attention visualization has been understudied. Ramampiandra et al. (15) previously commented that overfitted models impede the interpretation of machine learning models, yet no solution is proposed to solve the problem.
Model interpretation for ViT models is carried out through analysis of attention layers, which are the weight on the input representing the focus of the model (16). Given the increasing interest in leveraging attention maps to explain predictive models, it is critical to assess how overfitting affects the model interpretability. Ideally, the attention maps should highlight semantically meaningful features in the image, so that similar input images will have attention highlighting similar features. Generalization of the model interpretation, when the model encounters various datasets, needs to be controlled.
Unlike common images, biomedical images (e.g., histological staining) are highly vulnerable to overfitting due to its obscure characteristics, as one cannot explicitly decipher whether the prediction is based on a true biologically meaningful pattern or random noise. Overfitting has particularly serious implications in the analysis of histological image data. Incorrect or misleading predictions can have detrimental effects on clinical diagnoses in the health care sector (1–4). Because of this, assessing the degree of self-consistency in ViT model interpretations is essential to ensure transparent models, accurate predictions, and robust conclusions. Outside of deep learning, this issue has been confronted before in the literature on functional brain imaging, where high-dimensional statistical models are fit to brain scans and scientific inferences are made on the basis of the model parameters (17). In that setting, it was found that model self-consistency and predictive performance do not always align exactly; the highest-performing models imply biased scientific inferences. The solution is to explicitly test for self-consistency by comparing fit model parameters on different splits of the data. Building on this idea and applying it to attention maps, we have developed Training Attention and Validation Attention Consistency (TAVAC) that evaluates the performance of model interpretation by quantifying the stability of the attention maps between when an image is used for training or validation during model fitting. While the other evaluation metrics (accuracy, F-1 score, mean squared error, etc.) address whether accurate predictions are made, our algorithm of TAVAC evaluates the generalization of model interpretation.
RESULTS
Overview of TAVAC algorithm for delineating generalization of model interpretation
To address the impact of overfitting on ViT model interpretation of biomedical imaging data, we sought to develop a metric to evaluate model robustness in the context of attention-driven model behaviors, as depicted in Fig. 1. We took inspiration from the established K-fold cross-validation technique, with a particular focus on a twofold implementation (17, 18). The dataset D is bifurcated into two distinct folds, using one for training and the other for validation purposes in the initial stage (stage 1). Specifically, subsequent to the training of a ViT model using the first training subset, its performance is evaluated against the validation subset. In the ensuing stage (stage 2), the roles of the subsets reverse, wherein the initial training dataset is deployed for validation and the initial validation dataset is deployed for training, followed by a reiteration of the training and evaluation process. In stage 2, the ViT has to be reinitialized. TAVAC algorithm examines the model’s internal consistency by juxtaposing the pixel-level attention maps for images from training in stage 1 against validation in stage 2. Within each stage, attention maps for individual images are formulated by using attention rollout methodology (16). Specifically, when an image is designated as “training data,” the resultant attention map is categorically defined as training attention; and conversely, when designated as “validation data,” it is defined as validation attention. The TAVAC score is subsequently articulated as the consistency between training attention and validation attention, measured using Pearson correlation.
Fig. 1. General workflow of the TAVAC algorithm. The TAVAC algorithm quantifies the generality of ViT model interpretation.
(A) ViT model interpretation enabled by attention rollover. (B) The schematic outlines the two-stage process where training and validation subsets are used alternately to develop a ViT model and assess the internal consistency of model interpretation. Attention maps are generated via attention rollout for the same image (horse image from the training subset in this example) from ViT models trained with either training or validation subsets, with TAVAC scores computed based on the Pearson correlation between these pixel-level attention map distributions, reflecting model attention consistency across different ViT models over the same horse image.
Evaluate TAVAC using four image benchmark datasets as proof of concept
To evaluate the TAVAC algorithm, we trained two types of pretrained ViT models by varying epochs settings using transfer learning—a good-fit model and an overfitted model—and examined their attention maps and TAVAC score distribution (Table 1). The parameters for the good-fit model were obtained through a grid search. Subsequently, we increased the number of epochs to intentionally overfit the model to obtain the overfitted model. We evaluated TAVAC algorithm using four well-known image classification benchmarking datasets: CIFAR-10: Canadian Institute for Advanced Research, 10 classes (CIFAR-10), Modified National Institute of Standards and Technology (MNIST), Food-101, and Cats vs. Dogs (19–22). The various image types and groups of classes among these datasets allow us to prove TAVAC effectiveness and generalization: (i) The CIFAR-10 dataset consists of 60,000 of 32 × 32 color images in 10 classes, with 6000 images attributed to each class. (ii) The MNIST dataset consists of 70,000 28 × 28 grayscale images of 10 classes of handwritten digits, i.e., numerals. (iii) The Food-101 dataset consists of 101,000 images containing various foods of 101 classes with varying dimensions. (iv) Last, the Cats vs. Dogs dataset, a subset of the Animals Species Image Recognition for Restricting Access dataset, contains 23,442 cat and dog images of varying dimensions. As expected, excessively high epoch settings led to an overfitted model, while comparatively lower epoch settings generated good-fit models with comparable training and validation accuracies (Table 2), consistent with general machine learning knowledge (23). The settings corresponding to good-fit and overfitted models varied for each dataset, with overfitting on MNIST and Cats vs. Dogs requiring substantial epoch numbers, while overfitting on CIFAR-10 and Food-101 required much lower epoch numbers (Table 1). We assessed TAVAC’s ability to distinguish the impact of overfitting on model interpretation using each of the datasets as described below to show that the overfitted and good-fit ViT models have different TAVAC score distributions. The overfitting models in this study was intentionally created to demonstrate the TAVAC score distribution differences between overfitted and good-fit models by training for a high number of epochs. In real-life applications, one should avoid an overly training process applied here. Hyperparameters should be optimized to achieve high TAVAC scores.
Table 1. Hyperparameters for training good-fit and overfitted ViT models.
| Dataset | Types | Number of epochs | Learning rate | Weight decay | Batch size |
|---|---|---|---|---|---|
| CIFAR-10 | Good-fit | 10 | 2 × 10−5 | 0.01 | 10 |
| Overfitted | 100 | ||||
| MNIST | Good-fit | 1500 | 1 × 10−4 | 0 | 4 |
| Overfitted | 2000 | ||||
| Food-101 | Good-fit | 10 | 1 × 10−4 | 0 | 64 |
| Overfitted | 100 | ||||
| Cats vs. Dogs | Good-fit | 1000 | 1 × 10−4 | 0 | 32 |
| Overfitted | 5000 |
Table 2. Training and validation accuracies of ViT models.
| Dataset | Model type | Stage 1 training accuracy | Stage 1 validation accuracy | Stage 2 training accuracy | Stage 2 validation accuracy |
|---|---|---|---|---|---|
| CIFAR-10 | Good-fit | 0.90 | 0.93 | 0.968 | 0.942 |
| Overfitted | 0.944 | 0.74 | 0.982 | 0.752 | |
| MNIST | Good-fit | 0.958 | 0.958 | 0.96 | 0.966 |
| Overfitted | 0.962 | 0.942 | 0.964 | 0.946 | |
| Food-101 | Good-fit | 0.982 | 0.946 | 0.952 | 0.939 |
| Overfitted | 0.987 | 0.915 | 0.988 | 0.914 | |
| Cats vs. Dogs | Good-fit | 1.0 | 0.982 | 0.972 | 1.0 |
| Overfitted | 1.0 | 0.962 | 0.998 | 0.958 |
Generalization of model interpretation by TAVAC on image benchmark datasets
We applied TAVAC to good-fit and overfitted ViT models trained using 1000 randomly selected images from the CIFAR-10 dataset (500 for each fold). We generated a good-fit model using 10 epochs based on a pretrained ViT model (10), which has comparable training and validation accuracies, in both stage 1 and stage 2 (Table 2). We then compared each image’s training attention from stage 1 and validation attention from stage 2 (and vice versa; Fig. 1) to determine the TAVAC score distribution for each model. As expected with the good-fit model, both training attention and validation attention have high attention regions (HARs) primarily focused on the horse object (Fig. 2A). Next, we examined the TAVAC of an overfitted model of 100 epochs for the same dataset. This variation of the model had a lower validation accuracy than training accuracy, in both stage 1 and stage 2 (Table 2). Moreover, the overfitted model’s HAR of training attention focused on the horse object, while the HAR of validation attention fixated on a nonmeaningful section of the image, away from the object of interest (Fig. 2B).
Fig. 2. TAVAC evaluation on image datasets using good-fit versus overfitted ViT models.
(A and B) Attention maps use red for high attention, blue for low, and green for intermediate. The density scatterplots of TAVAC show the pixel-resolution validation attention and training attention visualized, where the x axis contains the training attention, the y axis contains the validation attention, each data point represents a pixel of the image, and the color key indicates the density of data points. (C) TAVAC score distributions: Histograms for CIFAR-10 illustrate the frequency distribution of TAVAC scores, blue for well-fitted and yellow for overfitted, highlighting diminished attention consistency with increased overfitting. (D) Three example images and their matched validation attention maps that are manually selected from overfit models: original (left) validation attention (right). (E) TAVAC scores of off-target (n = 114) versus on-target (n = 886) attention maps from manually selected examples. Off-target: manually annotated images where high attention is maintained on the object of interest during validation in well-fitted cases; on-target: manually annotated images with poor attention focus during validation in overfitted cases. (F) Box plots compare TAVAC scores for images with “correct” versus “wrong” predictions. Correct and wrong depict images that the ViT model predicted accurately or inaccurately, respectively. (G to I) Cross-dataset independent validation: TAVAC distributions for MNIST, Food-101, and Cats vs. Dogs validate the CIFAR-10 findings, with consistent attention reduction in overfitted models.
We then used a scatterplot to quantify the TAVAC at the pixel level. Using the horse image as an example (Fig. 2A), we plotted pixel-level training attention against validation attention for both the good-fit and overfitted models. The good-fit model’s scatterplots (Fig. 2A, third column) exhibited a strong positive correlation between training attention and validation attention, yielding a Pearson correlation coefficient (r value) of 0.95, which we denote as the TAVAC score. In contrast, the overfitted model’s scatterplot (Fig. 2B) exhibited an extremely poor correlation between training attention and validation attention, and the r value is 0.37. As a holistic analysis, we used TAVAC on every image and compare the distribution between good-fit and overfitted models (Fig. 2C). The TAVAC score of the overfitted model [median: 0.644, interquartile range (IQR): 0.327] is significantly lower and more variable than the good-fit model [median: 0.960, IQR: 0.045, P value < 0.001, Kolmogorov-Smirnov (KS) test]; thus, the TAVAC scores decline substantially when the ViT model is overfitted for the CIFAR-10 dataset.
To evaluate the robustness of TAVAC in identifying meaningful high-attention regions, we visually inspected all the 1000 images. We annotated 114 images, in which the HAR of the validation attention of the overfitted model does not overlap with the object of interest and is defined as the “off-target” group (Fig. 2D). The remaining images are annotated as the “on-target” group. The off-target group exhibits significantly lower TAVAC scores than the on-target group (P value = 0.00001, t test; Fig. 2E). Notably, the TAVAC scores of the off-target group are as low as −0.2, while the scores of the on-target group are all positive. This suggests that TAVAC can accurately identify instances where the ViT model does not use the meaningful image features for prediction by attributing a low score to them. Moreover, we compared the TAVAC scores of incorrectly predicted images, labeled as wrong, and correctly predicted images, labeled as correct. After our analysis, we conclude that incorrectly predicted images have lower TAVAC scores than correctly predicted images (P value = 0.025, t test; Fig. 2F). However, the two groups show sizeable overlap, indicating that the prediction performance is not identical to the TAVAC score, and TAVAC can serve as an independent evaluator for ViT model interpretation.
We further applied the TAVAC algorithm to three additional datasets: MNIST, Food-101, and Cats vs. Dogs. A summary of the model’s performance is provided in Table 1. In the MNIST dataset, the overfitted model (median: 0.979) exhibited modestly lower TAVAC scores than the good fit model (median: 0.989) with a statistical significance (P value = , KS test; Fig. 2G). While the overfitted model (median: 0.919) of the Food-101 dataset exhibited lower TAVAC scores than the good fit model (median: 0.955) with a statistical significance (P value = , KS test; Fig. 2H). Similarly, for the Cats vs. Dogs dataset, the overfitted model (median: 0.895) showed lower TAVAC scores than the good fit model with a statistical significance (median: 0. 942; P value = , KS test; Fig. 2I). Similar to the CIFAR-10 dataset, the manually annotated off-target group in Cats vs. Dogs dataset exhibits significantly lower TAVAC scores than the on-target group (P value = 0.018, t test; fig. S1). Overall, the TAVAC algorithm shows robustness in distinguishing the inaccurate attention maps of overfitted models from accurate attention maps in good-fit models, across various types of images, e.g., handwriting number “8,” hamburger, and a dog (Fig. 3). Overall, as overfitting increased through varying epoch numbers, TAVAC scores decrease in all four imaging benchmarking datasets. The results suggest that TAVAC robustly quantifies the impact of overfitting on ViT model interpretation and the attention maps, which contribute to the transparency of deep learning models. The TAVAC scores for all images can be considered as a numerical distribution. Observing that the frequency of TAVAC scores below 0.8 is less than 0.05 suggests that low TAVAC scores are rare events. This threshold can serve as a guideline for filtering out poorly performing models.
Fig. 3. Pixel-level TAVAC analysis for MNIST, Food-101, and Cats vs. Dogs.
(A) MNIST 8 image: TAVAC analysis reveals high attention consistency for a well-fitted model (TAVAC score barely changes with slight overfitting). (B) Food-101 hamburger image: Well-fitted model attention maps closely align (high TAVAC score of 0.94), whereas the overfitted model shows substantial attention misalignment (low TAVAC score of 0.41). (C) Cats vs. Dogs dog image: Attention is consistent in well-fitted models (TAVAC score of 0.94), with a substantial drop in overfitted models (TAVAC score of 0.32). Density scatterplots accompany each analysis, with purple indicating high density (consistent attention) and yellow indicating low density (inconsistent attention). These comparisons demonstrate that TAVAC scores decrease as model overfitting increases.
Application of TAVAC to two independent breast cancer pathological image datasets
To examine the overfitting impact on ViT model interpretation on biomedical images, we next applied the TAVAC method to the transfer learning models trained using two independent breast cancer pathological image datasets. The first breast cancer hematoxylin and eosin (H&E) dataset (24) consists of 23 patients, 68 tissue sections, and 30,612 spots (diameter: 100 μm). For each patient, we used a collection of three microscopic slide images of high-resolution H&E-stained tissue, all of which have tumor or nontumor annotation by pathologists. We subsequently applied TAVAC algorithm by dividing the dataset into two random partitions, each composed of approximately 15,000 spots. We then trained the pretrained ViT model (10) to predict tumor and nontumor classes using H&E image spots (example in Fig. 4A) via transfer learning (Tables 3 and 4). The confusion matrix of the ViT model prediction is shown in Fig. 4B. We derived training and validation attention maps using 1000 randomly selected H&E image spots, similar to the procedure for the image benchmark datasets. The ViT model of the breast cancer H&E image exhibits high TAVAC scores for the majority of tissue spots with minimal spread (median: 0.933, IQR: 0.070; Fig. 4C). We further visualized the consistency of the training and validation attention of H&E images with various TAVAC scores. The high TAVAC score images (Fig. 4D and fig. S2) exhibit HARs of both training and validation attention enriched for tumor and cellular regions (dark color). While the low TAVAC score images have HARs of validation attention that are inconsistent with the training attention, suggesting that nonspecific image features were used for the prediction of tumors (Fig. 4E and fig. S3). All predictions for training and validation are correct, and all samples in Fig. 4 (D and E) are tumor samples. This is consistent with expectation, as the ViT is trained to predict the presence of tumor tissue; thus the HARs are expected to overlap with the tumor regions in the H&E images. This suggests that the high-TAVAC score image HARs are more biologically meaningful than the low TAVAC score image HARs. Thus, the results showed that the TAVAC score can serve as a quality control metric of ViT model interpretation to rank H&E images for attention maps of biologically meaningful HAR.
Fig. 4. TAVAC quantifies the slide- and pixel-level generality of ViT model interpretation using breast cancer pathology images.
(A) Paired H&E images of breast tissue, with a full tissue cut on the left and a magnified spot image on the right. (B) Confusion matrices for validation data: the TAVAC stage 1 model at the top and the TAVAC stage 2 model below. (C) Histogram of TAVAC scores shows consistent training-validation attention across the majority of spots. (D) A high TAVAC score (0.96) case where training and validation attentions align on tumor spots, as shown in the density scatterplot. (E) A low TAVAC score cases (0.62 and 0.31) displays a marked difference in attention between training (accurately focused on tumor regions) and validation (misaligned). Both (D) and (E) confirm the tumor presence with correct model predictions across training and validation.
Table 3. Hyperparameters for training breast cancer pathological image ViT models.
| Dataset | Number of epochs | Learning rate | Weight decay | Batch size |
|---|---|---|---|---|
| Breast cancer | 500 | 1 × 10−5 | 0 | 32 |
| HER2 | 500 | 1 × 10−5 | 0.001 | 10 |
Table 4. Training and validation accuracies of breast cancer ViT models.
| Dataset | Stage 1 training accuracy | Stage 1 validation accuracy | Stage 2 training accuracy | Stage 2 validation accuracy |
|---|---|---|---|---|
| Breast cancer | 0.994 | 0.867 | 0.993 | 0.880 |
| HER2 | 0.995 | 0.847 | 0.991 | 0.846 |
For independent verification, we performed the TAVAC analysis on another H&E image dataset: Human epidermal growth factor receptor 2 positive (HER2+) breast cancer dataset (25) from eight HER2+ patients. From each patient’s respective tumor specimen, three adjacent or six evenly spaced tissue sections were extracted (25). In addition, a pathologist annotated a representative tissue section from each distinct tumor according to the morphology exhibited by the H&E image. The annotated data have nine replicates of H&E images totaling around 3800 spots. The categorizations include in situ cancer, invasive cancer, adipose tissue, immune infiltrate, or connective tissue (25). We used invasive cancer or noninvasive cancer as class labels for each spot. We divided the spot H&E image dataset into two random partitions to train ViT models (Tables 3 and 4) for invasive cancer prediction (example in Fig. 5A). The confusion of classification is shown in Fig. 5B. After randomly selecting a subset of 500 spot images from each partition, we generated their respective attention maps. The ViT model of the HER2+ breast cancer H&E image exhibits high TAVAC scores for the majority of tissue spots with minimal spread (median: 0.971, IQR: 0.031; Fig. 5C). We then compared the training and validation attention of H&E images with various TAVAC scores. The high-TAVAC-score image (Fig. 5D) exhibits HARs of both training and validation attention enriched for tumor and cellular regions (dark color). Conversely, the low TAVAC score images have HARs of validation attention that are inconsistent with the training attention, where the HARs of validation attention are enriched for nontumor regions (Fig. 5E). All predictions for training and validation are correct. Figure 5 (D and E, bottom) represents invasive cancer samples, while Fig. 5E (top) represents a noninvasive cancer sample. The results collectively confirmed the TAVAC score as a quality control metric of ViT model interpretation to rank H&E images with attention maps of biologically meaningful HAR. Thus, we visualized the high versus low TAVAC score for each spot on the H&E image (Fig. 6A and fig. S4) to identify the spots to filter out for interpretation study. The low TAVAC score spots that should be filtered out (<0.9) are shown as black dots (Fig. 6B) given the high discrepancy between training and validation attention maps. Moreover, the low TAVAC spots are spread across the tissue without association with specific pathological categories (Fig. 6C and fig. S4).
Fig. 5. Independent validation of TAVAC performance using HER2+ breast cancer pathology image ViT model interpretation.
(A) Sample H&E-stained full tissue cut and a corresponding detailed spot image of breast tissue. (B) Confusion matrices for HER2+ breast cancer two-stage models: stage 1 (top) and stage 2 (bottom) model validation. (C) Histogram of TAVAC scores, indicating consistent attention between training and validation for the majority of evaluated spots. (D) High TAVAC score (0.96) example in noncancer H&E images where training and validation attention converge on identical regions, supported by a density scatterplot. (E) Low TAVAC score (0.63) case in invasive cancer H&E images, exhibiting a clear visual discrepancy in attention focus between training and validation, especially on dark tumor spots, as shown in the density scatterplot. For both (D) and (E), the model predictions are accurate in both the training and validation phases.
Fig. 6. Application of TAVAC scores to filter low-quality fine-grained model interpretation.
(A) Spots to be excluded (TAVAC < 0.9) are highlighted as (B) black dots in the second column due to low TAVAC scores, i.e., their lack of consistency between training and validation attention maps. (C) Notably, low TAVAC spots are distributed throughout the tissue, without a distinct correlation to specific pathologist-annotated categories.
DISCUSSION
TAVAC quantifies overfitting impact for ViT model interpretation. It evaluates the level of generalization in visual features that the models focus on when predicting object types or tumor status at pixel resolution. TAVAC generalizes to four various types of image benchmark datasets. The algorithm also contributes to the quality assessment of model interpretation, when the overfitting ViT models pay attention to the regions of the images that are away from the object of interest by assigning a lower TAVAC score. Moreover, TAVAC quantifies the level of generalization of histological features that cause ViT models to predict tumor status in a breast cancer H&E image dataset. Moreover, TAVAC is robust on the independent HER2+ breast cancer data. It identifies fine-grained spatial variations of generalization level of model interpretation within the tumor or the normal tissue, respectively, using the pathologist-defined tumor and normal annotations, e.g., connective tissues or immune infiltrate.
A computational complexity analysis of TAVAC demonstrates its feasibility for deployment across various settings, including those involving large datasets and complex models commonly used in digital pathology. Specifically, the time complexity for calculating one attention layer is (26) where n is the length of input sequence (token number in ViT) and d is the length of embeddings. Consequently, the total time complexity for all attention calculation is , where is the number of attention layers. In addition, the attention rollout algorithm aggregates each attention layer by multiplication, taking (16). In summary, the overall time complexity for TAVAC calculation per image is which is polynomial in relation to sequence length or the number of tokens. This polynomial relationship underscores TAVAC’s scalability and practicality for real-world applications, particularly in the context of digital pathology where large and complex datasets are prevalent.
Recently, deep learning models focusing on more detailed labeling of smaller cellular clusters (multicell) to predict spatial transcriptome data using histological image data (24, 25) started to uncover complex patterns in the relationship between tissue structure and gene expression, providing insights into biological processes and diseases. TAVAC measures the level of generalization of deep learning model interpretation, not only on the slide level but also on the fine-grained level for small groups of cells. The application of TAVAC will allow for robust model interpretation in deep learning models leveraging commonly available H&E-stained images—especially those digitized through modern digital pathology techniques—for high-resolution tumor status or subtype prediction. Theoretically, any association measure can be selected to replace Pearson correlation for consistency measure. Alternatives such as Spearman’s rank correlation, Kendall’s Tau, Hoeffding’s D, distance correlation, and mutual information each has their own limitations. Spearman’s and Kendall’s only measure monotonic relationships and can be computationally intensive with large datasets. Hoeffding’s D, while capable of detecting nonlinear relationships, is complex and less intuitive. Distance correlation, although powerful, is computationally demanding and harder to interpret. Mutual information and the maximal information coefficient require substantial computational resources and can be sensitive to how data are binned, potentially overestimating dependencies (27). Overall, while these alternatives can detect a wider range of relationships and offer robustness to different conditions, they come with increased computational demands and complexity in interpretation.
TAVAC provides unique insights for model performance compared to the traditional metrics like accuracy or validation loss. Traditional metrics focus solely on prediction performance, while TAVAC evaluates the unstructured features used in the images. Although TAVAC scores and prediction performance are correlated, they are not entirely the same. Incorrectly predicted images have lower TAVAC scores than correctly predicted images (P value = 0.025, t test; Fig. 2F). However, the two groups show considerable overlap, indicating that prediction performance is not identical to TAVAC scores. Thus, TAVAC serves as an independent evaluator for ViT model interpretation. Similar observations are made from the TAVAC applications on the two breast cancer datasets. In Figs. 4 (D and E) and 5 (D and E), although all predictions for these images in validation and training sets are correct, the TAVAC score for each image varies. This indicates that TAVAC detects inconsistencies between training and validation data predictions. This is beneficial for interpreting results in biomedical image applications.
Overfitting affects model interpretability (15). TAVAC is the first algorithm focused on the generalization of ViT model interpretation. Using the TAVAC to filter the low-quality data points, where prediction is made on the basis of random noise, increases the confidence in the model interpretation. Future work includes a direct extension of TAVAC for general transformer models (for instance, natural language processing models) where we can extract the attention from the input sequence data. The potential for using alternative consistency metrics in future explorations remains a viable consideration. Also, optimization and training strategies can be developed to achieve high TAVAC score models. Thus, the TAVAC score can also serve as an independent hyperparameter tuning metric to ensure the generalization of transformer models.
MATERIALS AND METHODS
Training Attention and Validation Attention Consistency
The computation of the TAVAC score is delineated as follows: Commencing with a bifurcation of dataset D into two equal subsets, akin to the methodology of K-fold cross-validation, we designate one subset for training and the other for validation purposes. The initial phase involves training a model on the training subset and appraising its efficacy on the validation subset. Subsequently, we invert the roles of the subsets for a second evaluative iteration. The model’s capacity for accurate prediction and its consistency are gauged by comparing the attention maps—generated via Attention Rollout—for congruent images from the first training phase to those from the second validation phase. These attention maps are classified as either training attention or validation attention based on their use case. The TAVAC score is then derived by using Pearson correlation to assess the concordance between the two types of attention maps, a metric that may be expanded to encompass alternative consistency measures in future research endeavors. The pseudocode of the process is attached in Algorithm 1.
Attention rollout
Attention maps for the ViT models were generated as follows. Attention rollout was used to calculate the attention weight assigned to each pixel of every image. The central architecture of the ViT model consists of self-attention layers, which allows it to capture long-range spatial relationships within our images (28). As each pixel or token progresses through the layers of the ViT model, the attention weights of each token are propagated and adjusted accordingly (16). This enabled us to see how much attention each pixel of the images contributed to the prediction power of the ViT model.
Evaluate TAVAC using scatterplot and histogram
We first use the TAVAC method to evaluate individual images selected from each of our datasets. To this end, both stages of training and validation attention were converted to Red, Green, Blue (RGB) images, where each pixel of each image was represented by red, green, or blue colors. Thereafter, the RGB image of each attention map was converted into a NumPy array, which stored respective attention values. Using the attention values stored in the NumPy arrays, we generated a scatterplot to visualize the training and validation attention correlation for each image. Specifically, the validation attention and training attention at single-pixel resolution is visualized as a density scatterplot, where the x axis contains the training attention for each pixel, the y axis contains the validation attention for the corresponding pixel, each data point represents a pixel, and the color key indicates the density of data points. The Pearson correlation coefficient is used to represent the TAVAC score. TAVAC scores close to 1 meant the correlation between train and validation attention values for the particular image was strong, while TAVAC scores close to 0 demonstrated a weak correlation between train and validation attention.
The above allowed us to evaluate the TAVAC for individually selected images. A more holistic approach for overall model interpretation is TAVAC score distribution, represented by a histogram. We calculated the TAVAC score for every image in our respective datasets using a similar procedure to the above. Next, we plotted these values on a histogram to visualize the distribution in attention consistency for the entirety of our datasets. We use mean, median, and variance to compare model performance of good-fit and overfit for all datasets.
Pretrained ViT on ImageNet
For this study, we used a particular ViT model pretrained on ImageNet-21K, an extensive database of 14 million images and 21,843 classes at a resolution of 224 × 224 (10). Images are fed into the transformer encoder as a sequence of equivalently sized linearly embedded patches (16 × 16 resolution) after the addition of classification task tokens and positional embeddings. Pretraining the model allows the algorithm to learn to extract meaningful features from images, and this information is then transferred to smaller downstream tasks. In our case, the transfer learning concept enabled the ViT model to use the knowledge learned on the large ImageNet dataset to our smaller spatial omic breast tissue data (29, 30).
Acknowledgments
We would like to express our sincere gratitude to senior scientific writer C. Robinett, whose invaluable contributions enhanced the quality of this manuscript.
Funding: S.L. is supported by the following grants from National Institute of Health: R35GM133562 (2019-2024), U01HG013175 (2023-2028), U01CA271830 (2021-2026), and R56AG071766 (2022-2024). S.L. is a recipient of Career Development Award (1398-25) of The Leukemia & Lymphoma Society (2024-2029). M.M. is supported by R01GM141309. M.M. and S.L. are supported by U54AG079753 (2022-2026). D.A.’s Summer Student Fellowship was supported by The National Cancer Institute (award R25CA233420) and the Jackson Laboratory’s Summer Student Program fund.
Author contributions: Conceptualization: Y.Z., M.M., D.A., S.L., and Y.L. Methodology: Y.Z., M.M., D.A., S.L., and Y.L. Investigation: Y.Z. and D.A. Resources: Y.Z., D.A., and S.L. Data curation: Y.Z., D.A., and S.L. Validation: Y.Z., D.A., S.L., and Y.L. Formal analysis: Y.Z., D.A., and S.L. Software: Y.Z., D.A., and Y.L. Visualization: Y.Z., D.A., Y.L., and S.L. Funding acquisition: S.L. Project administration: S.L. Supervision: Y.Z., M.M., and S.L. Writing––original draft: Y.Z., M.M., D.A., and S.L.
Competing interests: M.M. has unpaid adjunct faculty positions in the Department of Neurological Sciences at the University of Vermont and the University of Delaware Data Science Institute. All other authors declare that they have no competing interests.
Data and materials availability: The code is available in GitHub: https://github.com/LabShengLi/TAVAC. The repository is archived on 10.5281/zenodo.13274488. CIFAR-10, MNIST, Food-101, Cats vs. Dogs data are available on hugging face data hub: https://huggingface.co/datasets. The data can be accessed using load_dataset function, and all the data extraction codes are included in the GitHub repository. ST-net data are downloaded at https://data.mendeley.com/datasets/29ntw7sh4r/5. All other data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials.
Supplementary Materials
This PDF file includes:
Supplementary Text
Figs. S1 to S4
REFERENCES AND NOTES
- 1.Lo C.-M., Hung P.-H., Computer-aided diagnosis of ischemic stroke using multi-dimensional image features in carotid color Doppler. Comput. Biol. Med. 147, 105779 (2022). [DOI] [PubMed] [Google Scholar]
- 2.Hu W., Li C., Li X., Rahaman M. M., Ma J., Zhang Y., Chen H., Liu W., Sun C., Yao Y., Sun H., Grzegorzek M., GasHisSDB: A new gastric histopathology image dataset for computer aided diagnosis of gastric cancer. Comput. Biol. Med. 142, 105207 (2022). [DOI] [PubMed] [Google Scholar]
- 3.Hu Q., Chen C., Kang S., Sun Z., Wang Y., Xiang M., Guan H., Xia L., Wang S., Application of computer-aided detection (CAD) software to automatically detect nodules under SDCT and LDCT scans with different parameters. Comput. Biol. Med. 146, 105538 (2022). [DOI] [PubMed] [Google Scholar]
- 4.Yang X., Stamp M., Computer-aided diagnosis of low grade endometrial stromal sarcoma (LGESS). Comput. Biol. Med. 138, 104874 (2021). [DOI] [PubMed] [Google Scholar]
- 5.Litjens G., Bandi P., Ehteshami Bejnordi B., Geessink O., Balkenhol M., Bult P., Halilovic A., Hermsen M., Van de Loo R., Vogels R., Manson Q. F., Stathonikos N., Baidoshvili A., van Diest P., Wauters C., van Dijk M., van der Laak J., 1399 H&E-stained sentinel lymph node sections of breast cancer patients: The CAMELYON dataset. Gigascience 7, giy065 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Coudray N., Ocampo P. S., Sakellaropoulos T., Narula N., Snuderl M., Fenyö D., Moreira A. L., Razavian N., Tsirigos A., Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat. Med. 24, 1559–1567 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Khosravi P., Kazemi E., Imielinski M., Elemento O., Hajirasouliha I., Deep convolutional neural networks enable discrimination of heterogeneous digital pathology images. EBioMedicine 27, 317–328 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.M. Raghu, T. Unterthiner, S. Kornblith, C. Zhang, A. Dosovitskiy, “Do vision transformers see like convolutional neural networks?” in Advances in Neural Information Processing Systems (Curran Associates, Inc., 2021), vol. 34, pp. 12116–12128. [Google Scholar]
- 9.X. Mao, G. Qi, Y. Chen, X. Li, R. Duan, S. Ye, Y. He, H. Xue, “Towards robust vision transformer” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2022). [Google Scholar]
- 10.A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. paper presented at the International Conference on Learning Representations, 2020. [Google Scholar]
- 11.Zhang Z., Chen P., Mcgough M., Xing F., Wang C., Bui M., Xie Y., Sapkota M., Cui L., Dhillon J., Ahmad N., Khalil F., Dickinson S., Shi X., Liu F., Su H., Cai J., Yang L., Pathologist-level interpretable whole-slide cancer diagnosis with deep learning. Nat. Mach. Intell. 1, 236–245 (2019). [Google Scholar]
- 12.Dietterich T., Overfitting and undercomputing in machine learning. ACM Comput. Surv. 27, 326–327 (1995). [Google Scholar]
- 13.Lever J., Krzywinski M., Altman N., Model selection and overfitting. Nat. Methods 13, 703–704 (2016). [Google Scholar]
- 14.Ying X., An overview of overfitting and its solutions. Journal of Physics: Conference Series 1168, 022022 (2019). [Google Scholar]
- 15.Ramampiandra E. C., Scheidegger A., Wydler J., Schuwirth N., A comparison of machine learning and statistical species distribution models: Quantifying overfitting supports model interpretation. Ecol. Model. 481, 110353 (2023). [Google Scholar]
- 16.S. Abnar, W. Zuidema, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (Association for Computational Linguistics, 2020). [Google Scholar]
- 17.Strother S. C., Anderson J., Hansen L. K., Kjems U., Kustra R., Sidtis J., Frutiger S., Muley S., Laconte S., Rottenberg D., The quantitative evaluation of functional neuroimaging experiments: The NPAIRS data analysis framework. Neuroimage 15, 747–771 (2002). [DOI] [PubMed] [Google Scholar]
- 18.S. Yadav, S. Shukla, “Analysis of k-fold cross-validation over hold-out validation on colossal datasets for quality classification” in 2016 IEEE 6th International Conference on Advanced Computing (IACC) (IEEE, 2016). [Google Scholar]
- 19.L. Bossard, M. Guillaumin, L. Van Gool, “Food-101–Mining discriminative components with random forests” in Computer Vision–ECCV 2014 (Springer International Publishing, 2014), pp. 446–461. [Google Scholar]
- 20.P. Golle, Machine learning attacks against the Asirra CAPTCHA. paper presented at the Proceedings of the 5th Symposium on Usable Privacy and Security, 2008. [Google Scholar]
- 21.A. Krizhevsky, G. Hinton, “Learning multiple layers of features from tiny images,” thesis, Univ. of Toronto (2009). [Google Scholar]
- 22.LeCun Y., Bottou L., Bengio Y., Haffner P., Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998). [Google Scholar]
- 23.Y. Lecun, L. Bottou, G. Orr, K. R. Müller, “Efficient BackProp” in Lecture Notes in Computer Science, G. M. G. B. Orr, K. R. Müller, Eds. (Lecture Notes in Computer Science, Springer Berlin Heidelberg, 1998), vol. 1524, pp. 9–50. [Google Scholar]
- 24.He B., Bergenstråhle L., Stenbeck L., Abid A., Andersson A., Borg Å., Maaskola J., Zou J., Integrating spatial gene expression and breast tumor morphology via deep learning. Nat. Biomed. Eng. 4, 827–834 (2020). [DOI] [PubMed] [Google Scholar]
- 25.Andersson A., Larsson L., Stenbeck L., Salmén F., Ehinger A., Wu S. Z., Al-Eryani G., Roden D., Swarbrick A., Borg Å., Frisén J., Engblom C., Lundeberg J., Spatial deconvolution of HER2-positive breast cancer delineates tumor-associated cell type interactions. Nat. Commun. 12, 6012 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I Polosukhin, “Attention is all you need” in Advances in Neural Information Processing Systems (Curran Associates, Inc., 2017) 30.
- 27.M. Clark, A comparison of correlation measures. Center for Social Research, Univ. of Notre Dame 4 (2013).
- 28.H. Chefer, S. Gur, L. Wolf, “Transformer interpretability beyond attention visualization” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2021). [Google Scholar]
- 29.J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, “ImageNet: A large-scale hierarchical image database” in 2009 IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2009). [Google Scholar]
- 30.B. Wu, C. Xu, X. Dai, A. Wan, P. Zhang, Z. Yan, Vajda, Visual transformers: Token-based image representation and processing for computer vision, arXiv:2006.03677 (2020).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary Text
Figs. S1 to S4






