Revisiting the Trustworthiness of Saliency Methods in Radiology AI

Jiajin Zhang; Hanqing Chao; Giridhar Dasegowda; Ge Wang; Mannudeep K Kalra; Pingkun Yan

doi:10.1148/ryai.220221

. 2023 Nov 8;6(1):e220221. doi: 10.1148/ryai.220221

Revisiting the Trustworthiness of Saliency Methods in Radiology AI

Jiajin Zhang ¹, Hanqing Chao ¹, Giridhar Dasegowda ¹, Ge Wang ¹, Mannudeep K Kalra ¹, Pingkun Yan ^1,^✉

PMCID: PMC10831523 PMID: 38166328

Abstract

Purpose

To determine whether saliency maps in radiology artificial intelligence (AI) are vulnerable to subtle perturbations of the input, which could lead to misleading interpretations, using prediction-saliency correlation (PSC) for evaluating the sensitivity and robustness of saliency methods.

Materials and Methods

In this retrospective study, locally trained deep learning models and a research prototype provided by a commercial vendor were systematically evaluated on 191 229 chest radiographs from the CheXpert dataset and 7022 MR images from a human brain tumor classification dataset. Two radiologists performed a reader study on 270 chest radiograph pairs. A model-agnostic approach for computing the PSC coefficient was used to evaluate the sensitivity and robustness of seven commonly used saliency methods.

Results

The saliency methods had low sensitivity (maximum PSC, 0.25; 95% CI: 0.12, 0.38) and weak robustness (maximum PSC, 0.12; 95% CI: 0.0, 0.25) on the CheXpert dataset, as demonstrated by leveraging locally trained model parameters. Further evaluation showed that the saliency maps generated from a commercial prototype could be irrelevant to the model output, without knowledge of the model specifics (area under the receiver operating characteristic curve decreased by 8.6% without affecting the saliency map). The human observer studies confirmed that it is difficult for experts to identify the perturbed images; the experts had less than 44.8% correctness.

Conclusion

Popular saliency methods scored low PSC values on the two datasets of perturbed chest radiographs, indicating weak sensitivity and robustness. The proposed PSC metric provides a valuable quantification tool for validating the trustworthiness of medical AI explainability.

Keywords: Saliency Maps, AI Trustworthiness, Dynamic Consistency, Sensitivity, Robustness

Supplemental material is available for this article.

See also the commentary by Yanagawa and Sato in this issue.

Keywords: Saliency Maps, AI Trustworthiness, Dynamic Consistency, Sensitivity, Robustness

graphic file with name ryai.220221.VA.jpg

Summary

Systematic evaluation of saliency methods through subtle perturbations in chest radiographs and brain MRI scans demonstrated low sensitivity and robustness of those methods, warranting caution when using saliency methods that may misrepresent changes in artificial intelligence model prediction.

Key Points

■ A novel evaluation metric, prediction-saliency correlation (PSC), is proposed to systematically quantify the trustworthiness of saliency-based artificial intelligence (AI) explainability.
■ Evaluation of commonly used saliency methods revealed low sensitivity (PSC ≤ 0.25) and weak robustness (PSC ≤ 0.12).
■ The findings suggest that the popular saliency maps may misrepresent true model prediction; thus, AI researchers and users should be aware of the vulnerabilities of saliency maps in radiology AI.

Introduction

Explainability is a pillar in supporting applications of artificial intelligence (AI) and machine learning in medicine (1–4). Understanding how and why AI models make particular decisions is critical for building trust in AI-driven applications (1,2,5–7). The predicted value of a disease by an AI model implies the probability of disease presence. However, those values usually do not correlate with the disease probabilities and lack confidence and prediction intervals. Thus, a series of post hoc explanation approaches have been proposed. Previous research has shown how AI models work by visualizing the relevant contribution of each feature to the overall model prediction result (8–14). Although researchers and clinicians appreciate the development of explainable AI, it is unclear whether the resultant explanations can be trusted. Overlooking the trustworthiness of AI-based saliency methods leaves potential risks to AI-based medical applications.

Saliency maps, also called heat maps, are the most commonly used method for AI explainability (9,13). They are especially important for AI algorithms that target image segmentation, quantification tasks, lesion detection, and characterization. When reviewing the AI outputs, radiologists often review these maps to accept or reject AI output findings. Previous empirical studies (15,16) demonstrated susceptibility of neural networks to small perturbations in normal inputs, resulting in wrong outputs. Along the same direction, Arun et al (17) attempted to assess the saliency maps in medical imaging by quantifying their localization capability, variation between randomized networks, repeatability over separately trained models, and reproducibility across different models. However, their methods demonstrated only the generic properties of the saliency approaches but could not evaluate the correlation between the saliency maps with the model predictions or the saliency visualization quality for a given specific AI model. Theoretical research (18,19) suggested that such weakness is related to neural networks’ lack of local Lipschitz smoothness with respect to the input image space. As a negative result, model outcomes may vary drastically around the original data. Comprehensive evidence is lacking on how such vulnerabilities in each AI model can substantially limit its practical application. The potential risk associated with those vulnerabilities is underestimated in medical AI because the model details, such as architecture and parameters, are usually safeguarded. In addition, previous methods demonstrate only the generic properties of the saliency approaches and cannot evaluate the saliency visualization quality for a specific AI model.

A trustworthy saliency approach should meet two general conditions: The first is sensitivity. A saliency map should change accordingly when the model’s prediction for that image substantially alters due to the input change. The second is robustness. A saliency map should stay consistent when the model’s prediction remains unchanged after an input image is randomly transformed without affecting the image content. In other words, a trustworthy saliency map should be consistent with the model prediction, not just for a specific example at a given state but dynamically consistent when the model prediction changes.

In this study, we sought to determine whether saliency maps in radiology AI are vulnerable to subtle perturbations in their input that can lead to misleading results. We present a novel systematic approach for quantifying the trustworthiness of saliency explanations of given medical AI models. Specifically, we propose a model-agnostic and generalizable measurement to quantitatively analyze saliency method robustness: prediction-saliency correlation (PSC). This measurement depicts the correlation between changes in model predictions and changes in the corresponding saliency maps to quantitatively analyze both robustness and sensitivity. We then illustrate the uses of this approach on commonly used AI models and saliency map methods.

Materials and Methods

Study Design

This is a retrospective study for quantifying the trustworthiness of the most popular explanation methods in radiologic AI. This study was exempt from institutional review board approval and was Health Insurance Portability and Accountability Act compliant because fully de-identified public datasets were used. Seven of the most commonly used saliency methods in medical AI applications were selected for the investigation, including vanilla back propagation (vanilla BP) (20), vanilla BP×image (9), gradient-weighted class activation mapping (GradCAM) (11), guided-GradCAM (11), integrated gradients (12), smoothed gradients (13), and explanation with ranked area integrals (XRAI) (14,20). A representative list of medical AI articles that use the above saliency methods is given in Table S0.

We trained two widely used networks (ResNet-152 and DenseNet-121) as our baseline models tasked to identify atelectasis, cardiomegaly, consolidation, edema, and pleural effusion on chest radiographs. We quantitatively verified the sensitivity and robustness on the two most commonly used convolutional neural networks in medical image classifications, DenseNet-121 and ResNet-152, trained on the CheXpert dataset (21,22). In addition, we processed the chest radiographs with and without perturbation with an AI-based chest radiograph research prototype, where we had no access to the model architecture and parameters. This is to test whether the proposed method can be generalized to such “black box” situations. To demonstrate the efficacy of our proposed evaluation framework on other imaging modalities, analysis tasks, and deep learning model architectures, additional experiments were performed with a ResNet-50 model trained on a brain tumor multiclass classification MRI dataset (23).

A preliminary version of this work was presented at Medical Image Computing and Computer Assisted Intervention (MICCAI) 2022 (24). Compared with our previous MICCAI presentation, the current report includes important extensions, such as additional technical innovations, numeric experiments, human observer studies, and mathematical analysis.

Dataset Preparation

We demonstrate the discovered issues of saliency maps on a multilabel classification task using a chest radiograph dataset, CheXpert (21,22). The original dataset consists of 223 648 publicly available chest radiographs (including frontal and lateral projections) from 64 740 patients (40.6% female; mean age, 59.6 years ± 16.8 [SD]; 59.4% male; mean age, 58.6 years ± 16.3). Only 191 229 frontal chest radiographs (191 027 from the original training set and 202 from the original validation set) were included in our study. Because the test set of the CheXpert dataset is not publicly available, we further randomly split the original training set including 191 027 frontal images into training and validation sets with a ratio of 6:1. The 202 frontal chest radiographs of the original validation set were used as our test set. We also included a human brain tumor MRI classification dataset (23), which consists of 7022 images. More detailed information for the brain tumor MRI dataset is provided in Appendix S1 (section V, part 1, Dataset and Model Preparation).

Quantitative Analysis of Trustworthiness

As shown in Figure 1, we evaluated the dynamic consistency from two aspects: the sensitivity and the robustness. For each image x_i in a set of test images with size of N, we first obtained its prediction p_i and the saliency map m_i with an AI model. We then altered each image x_i to produce a new image, x^'_i. The specific alteration depends on which property is being examined, as detailed below in this section. The model calculates a new prediction p^'_i and generates a corresponding saliency map, m^'_i for the new image.

Overview of the proposed methods. The trustworthiness of saliency maps can be examined from two aspects: sensitivity and robustness. An adversarial image, xpi, generated by prediction attack examines the sensitivity between saliency map and model output. Another adversarial image, xs, generated by saliency attack evaluates whether saliency maps are resistant to the model output change. In both cases, the adversarial images look no different from the original image. AI = artificial intelligence. — Overview of the proposed methods. The trustworthiness of saliency maps can be examined from two aspects: sensitivity and robustness. An adversarial image, x^p_i, generated by prediction attack examines the sensitivity between saliency map and model output. Another adversarial image, *x^s*, generated by saliency attack evaluates whether saliency maps are resistant to the model output change. In both cases, the adversarial images look no different from the original image. AI = artificial intelligence.

To evaluate the sensitivity of each saliency method, we observed whether changes in model predictions due to alteration of input images resulted in corresponding changes to the saliency maps. Specifically, as shown in the left side of Figure 1, by adopting the optimization techniques from adversarial attacks (15,25–29), we identified the slightly perturbed radiographs that caused the AI model to predict a different result but had saliency maps close to those of the original input. The word perturb means making very small changes to the pixel values of an image (ie, an original radiograph) in our study. Such small changes are usually imperceptible to humans but can lead to output change of an AI model. For each of the five observations, every radiograph is perturbed such that the model prediction will be flipped (ie, from “observation exists” to “no observation” and vice versa). In the meantime, the optimization algorithm keeps the saliency map unchanged. The technical details of the image perturbation algorithm are presented in Appendix S1 (section II, Sensitivity Examination).

The robustness of dynamic consistency demands the saliency maps to remain consistent when the predictions do not change. Robustness evaluates whether a saliency map can stay consistent with the model’s prediction output when randomly perturbing an image without affecting the model prediction. As shown in the right part of Figure 1, we used similar optimization techniques as in the above sensitivity experiments to investigate whether it is possible to pull a saliency map toward an arbitrary pattern while keeping the model predictions unaffected. The target pattern was designed as a square at the top right corner of the saliency map. More specifically, for each of the five observations, we perturbed the input radiograph to distract the saliency map from the predefined target square pattern. Meanwhile, the model prediction of the radiograph remains unchanged by optimizing the perturbation. To quantify the findings, we evaluated changes in model performance and the similarities of the saliency maps generated on the perturbed images to the original saliency maps and the target saliency map, respectively. More detailed mathematical derivations are presented in Appendix S1 (section II, Robustness Examination).

Statistical Analysis

The sensitivity and robustness of saliency methods were uniformly quantified by the PSC coefficient proposed in this study. This coefficient is defined by the Pearson correlation between variations in model predictions and changes in their corresponding saliency maps, both of which are gauged using the Jensen–Shannon divergence. The PSC coefficient ranges from -1 to +1: A value of -1 signifies a perfectly negative correlation, 0 suggests no correlation, and +1 denotes a perfectly positive correlation. A PSC value above 0.5 is considered to indicate a high degree of correlation. We used the changes of the area under the receiver operating characteristic curve (AUC) to evaluate the changes of the model predictions. The structural similarity index measure (SSIM) was used to quantify the changes of saliency maps. The mean PSC of the five findings was used to quantify the overall performance of each saliency method. Mathematical derivations of PSC are presented in Appendix S1 (section II, Prediction-Saliency Correlation). The significance tests for AUC comparison were performed using the z-test, as detailed by Zhou et al (30), and the CIs for the AUC values were computed on the basis of the method proposed by Hanley and McNeil (31). The significance of each individual finding is evaluated, and the P values (P < .05 indicated a statistically significant difference) are reported in Appendix S1 (section III). In the main report here, we report the averaged performance over all findings, which have no associated P values.

Human Observer Study

Two radiologists (M.K.K., with 15 years of experience in thoracic imaging, and G.D., with 2 years of postdoctoral experience in thoracic imaging) were presented with 270 pairs of perturbed and original chest radiographs from the CheXpert dataset. These pairs consisted of 120 pairs from the sensitivity experiments in Table 1 and 150 pairs from the robustness experiments in Table 2. Neither abnormality nor pathologic finding was inserted or overlaid on the altered radiographs. The order of altered and original radiographs was randomly assigned for each pair. The radiologists were asked to identify the perturbated image from the pair. The two radiologists were first provided another 150 pairs of images for training, where the images were clearly labeled. Both radiologists separately and independently assessed the radiographs.

Table 1:

Quantification Results of Saliency Sensitivity on ResNet-152 and DenseNet-121

Open in a new tab

Table 2:

Quantification Results of Saliency Robustness on ResNet-152 and DenseNet-121

Open in a new tab

Data Availability

The chest radiograph datasets used in this study are available in the Stanford CheXpert database under accession code (https://stanfordmlgroup.github.io/competitions/chexpert). All data needed to evaluate the findings in this study are presented in this report or the supplemental material. Additional data related to this article, such as the detailed reader test data, may be requested from the authors.

Results

Sensitivity Examination

Figure 2 shows a chest radiograph with atelectasis as an example to demonstrate how the predictions and the saliency maps may diverge from each other. Although the perturbed images look identical to the original image, the probabilities of atelectasis predicted by the model (DenseNet-121) dropped from 67.1% to 2%. The highlighted regions of saliency maps for the perturbed images were similar to those for the original images (SSIM > 0.76), suggesting that the saliency maps failed to reflect the changes of the model predictions (ie, the sensitivity may be low). More results of each individual class are included in Appendix S1 (section III, part 1).

Saliency maps lack sensitivity to predictions of DenseNet-121. The color bar indicates the intensity of the saliency maps. Probabilities of atelectasis are shown at the bottom of saliency maps for original and perturbed images. The images show that highly similar saliency maps of frontal chest radiographs may be associated with very different model predictions.

Table 1 shows the quantification results. All the reported numbers are the means over the five classes. The overall averaged PSC was no greater than 0.26. Because the PSC value was in the range of [-1,1], this is considered a weak association. We further examined the details. On the perturbed images, the AUCs of both models degraded drastically from 0.88 to 0.01. However, the corresponding saliency maps for the perturbed images were similar to the saliency maps for the original images, with a mean SSIM of 0.76 or greater. Such inconsistency between the prediction variation and the saliency preservation is reflected by the small PSC of 0.25 or less. We also performed similar sensitivity evaluation on the MRI classification dataset and reached similar conclusions. These experimental results are reported in Appendix S1 (section V, Sensitivity Examination).

Furthermore, we examined the sensitivity of a research prototype model provided by a commercial vendor, henceforth referred to as the “commercial prototype” for brevity. Because we do not have access to the architecture and parameters of the model, we generated altered images based on three in-house models that we trained; we then fed these images to the model. The saliency maps were generated by the commercial prototype itself. To avoid conflict of interest, we visualized the saliency maps using our color map. The corresponding results are presented in Figure 3. Four cases of the four shared classes (atelectasis, cardiomegaly, consolidation, and pleural effusion) between our pretrained local proxy model trained on CheXpert and the commercial prototype are shown. The results indicate that the generated perturbed images do not substantially change the model saliency maps. The quantitative results also support this observation because the similarities (SSIM) between the saliency maps on the perturbed images and the original ones are greater than 0.88 on all four classes. However, the perturbed images caused the mean AUC to drop by 8.6% (P < .01) on the four classes.

Saliency sensitivity evaluation of a commercially available artificial intelligence (AI) software. The color bar indicates the intensity of the saliency maps. By perturbating the original chest radiographs (top row), perturbed images (bottom row) are generated via attacking a proxy model. The perturbed images were then fed to a commercially available medical AI model. Note the large variations of the predicted probabilities (at the bottom of each image) from the original to the perturbed images on different findings, despite only minor changes to the saliency maps.

Robustness Examination

An example case is presented in Figure 4. Results of each individual class are included in Appendix S1 (section III, part 3). The DenseNet-121 predictions for probability of atelectasis, using perturbed images, were consistent with the original prediction (58.1%). However, the perturbed images successfully misled the saliency maps. For all the saliency methods, the perturbed images shifted the saliency areas toward the targeted top right corner of the image.

Example saliency maps lack robustness to saliency tampering of ResNet-152. The color bar indicates the intensity of the saliency maps. The target region of saliency maps on chest radiographs have been manipulated (third row), but the predicted probability (at the bottom of each image) of atelectasis remains similar to the original prediction.

Table 2 indicates that the deep neural networks performed consistently on the original and perturbed images, with the AUC remaining the same at 0.88. However, the saliency maps on the perturbed images were dramatically different from the original saliency maps (mean SSIM_org ≤ 0.51). Such inconsistency between the prediction variation and the saliency preservation is reflected by the low PSC (≤0.12). At the same time, the saliency maps from the perturbed images share strong similarities with the target saliency map, with mean SSIM_tgt of 0.65 on GradCAM and mean SSIM_tgt of 0.82 or greater on all other saliency methods. We also performed similar robustness evaluation on the MRI classification dataset and reached similar conclusions. These experimental results are reported in Appendix S1 (section V, Robust Examination).

Human Observer Study

At the time of testing, the two physicians correctly pointed out 63 of 270 (23.3% [G.D.]) and 121 of 270 (44.8% [M.K.K.]) altered radiographs, respectively, indicating that the alterations of the image alterations are difficult to spot by human experts, even for a thoracic radiologist with 15 years of subspecialty experience.

Discussion

In this study, we introduce a novel assessment metric, the PSC coefficient, to provide an intuitive and quantitative evaluation of the trustworthiness of widely used saliency maps. The PSC coefficient can serve as an evaluator to quantify the sensitivity and robustness of explanation methods. The quantitative and qualitative results show that commonly adopted saliency methods in medical AI applications can produce misleading interpretations. The saliency methods demonstrated low sensitivity (PSC < 0.25) and robustness (PSC < 0.12) on multiple radiographs. All the findings suggest that the predictions or the saliency maps of the models have undergone tremendous changes. In addition, the saliency maps generated by the commercial AI software may be neither relevant nor robust to perturbation added to the images without knowledge of the model specifics. The human observer studies verified that the perturbed images are difficult to identify even by a human expert with extensive experience in chest radiography. These results indicate that for deep learning models, the sensitivity and robustness of all seven saliency methods were weak (ie, the generated saliency maps may not be relevant to the model predictions). Notably, the radiologist with 15 years of clinical experience demonstrated a substantially stronger ability to correctly identify altered radiographs compared with the other physician with only 2 years of experience. This observation suggests a potential correlation between clinical expertise and the capacity to discern subtle perturbations.

To this end, we proposed a model-agnostic method for saliency map trustworthiness evaluation, which is generalizable to the commonly available saliency methods. Our method of induced perturbation can help establish the trustworthiness of the explainability of AI outputs and assess individual or multiple AI models for susceptibility to similar or different types of perturbation. This is a major clinical implication of our work. Our findings in Appendix S1 (section III, parts 2 and 4) indicate that even if multiple saliency methods were applied and we obtained consistent results, it is still possible that none of the saliency maps are faithful to the model predictions. These results on the commercial prototype suggest that the concerns about the trustworthiness are valid even for commercial AI systems trained with large amounts of data. Our future research will also focus on refining the extent, distribution, and patterns of perturbations to simulate variations in patients, diseases, and imaging parameters. Such work can help reduce the cost and time needed for thorough validation of AI models and uncover the implications of deploying nongeneralizable or nonexplainable AI models.

Many clinical end users of AI might not be aware of the explainability aspects of AI; therefore, these aspects are often not realized and/or applied in clinical practices. Few AI models have safety valves where certain AI outputs are not generated in the presence of issues related to AI explainability. For example, AI models should exercise caution or not describe cardiothoracic ratio or presence of enlarged cardiac silhouette on portable, supine radiographs as opposed to upright, posteroanterior radiographs. At CT, variations in reconstructed section thickness can profoundly influence AI-based estimation of nodule size, growth, and attenuation characteristics over serial CT examinations. A lack of explanation on whether and how the AI model accounts for such variations in acquisition technique and measured findings can mislead the clinical end users or cause them to discard all AI outputs. The addition and awareness of explainability aspects to the models can thus help improve adoption and proper use of AI algorithm outputs. Such explainability would be especially helpful given the profound variations in patient factors (supine vs upright radiographs or radiographs with low lung volumes), acquisition factors (radiation dose and image quality, including artifacts), and image reconstruction techniques (differences in section thickness and kernels).

Our study had limitations. This study mainly focused on the most popular attribution-based AI explanation methods (32,33). However, other explanation techniques, such as counterfactual explanations, exist (33,34). Our future research will extend the gradient-based evaluation to the counterfactual-based explanation methods.

In conclusion, we propose a model-agnostic method to dynamically evaluate the trustworthiness of saliency maps used for explaining the results of AI models. Our findings suggest that the commonly used saliency methods in medical AI can produce interpretations inconsistent with the model predictions. Thus, it is important to establish the trustworthiness of the saliency methods in clinical adoption of AI models. Furthermore, our future work will extend the evaluation from gradient-based methods to the counterfactual explanation methods to determine their trustworthiness.

Supported by National Science Foundation (2046708) (P.Y., principal investigator) and National Institutes of Health (R01EB032716) (G.W., contact principal investigator).

Disclosures of conflicts of interest: J.Z. No relevant relationships. H.C. No relevant relationships. G.D. No relevant relationships. G.W. No relevant relationships. M.K.K. Grants/contracts with Siemens Healthineers, Coreline, and Riverain Technologies to institution; associate editor of Radiology: Cardiothoracic Imaging. P.Y. National Science Foundation (NSF) under the CAREER award OAC (2046708).

Abbreviations:

AI: artificial intelligence
AUC: area under the receiver operating characteristic curve
PSC: prediction-saliency correlation
SSIM: structural similarity index measure

References

1. Shen Y , Shamout FE , Oliver JR , et al . Artificial intelligence system reduces false-positive findings in the interpretation of breast ultrasound exams . Nat Commun 2021. ; 12 ( 1 ): 5645 . [DOI] [PMC free article] [PubMed] [Google Scholar]
2. DeGrave AJ , Janizek JD , Lee SI . AI for radiographic COVID-19 detection selects shortcuts over signal . Nat Mach Intell 2021. ; 3 ( 7 ): 610 – 619 . [Google Scholar]
3. Arnaout R , Curran L , Zhao Y , Levine JC , Chinn E , Moon-Grady AJ . An ensemble of neural networks provides expert-level prenatal detection of complex congenital heart disease . Nat Med 2021. ; 27 ( 5 ): 882 – 891 . [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Abitbol JL , Karsai M . Interpretable socioeconomic status inference from aerial imagery through urban patterns . Nat Mach Intell 2020. ; 2 ( 11 ): 684 – 692 . [Google Scholar]
5. Gonzalez-Gonzalo C , Liefers B , van Ginneken B , Sanchez CI . Iterative augmentation of visual evidence for weakly-supervised lesion localization in deep interpretability frameworks: application to color fundus images . IEEE Trans Med Imaging 2020. ; 39 ( 11 ): 3499 – 3511 . [DOI] [PubMed] [Google Scholar]
6. Mitani A , Huang A , Venugopalan S , et al . Detection of anaemia from retinal fundus images via deep learning . Nat Biomed Eng 2020. ; 4 ( 1 ): 18 – 27 [Published correction appears in Nat Biomed Eng 2020;4(2):242.]. [DOI] [PubMed] [Google Scholar]
7. Sayres R , Taly A , Rahimy E , et al . Using a deep learning algorithm and integrated gradients explanation to assist grading for diabetic retinopathy . Ophthalmology 2019. ; 126 ( 4 ): 552 – 564 . [DOI] [PubMed] [Google Scholar]
8. Ribeiro MT , Singh S , Guestrin C . “Why should I trust you?”: explaining the predictions of any classifier . In: KDD ’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 2016. ; 1135 – 1144 . [Google Scholar]
9. Shrikumar A , Greenside P , Kundaje A . Learning important features through propagating activation differences . In: ICML‘17: Proceedings of the 34th International Conference on Machine Learning , 2019. ; 70 : 2145 – 3153 . [Google Scholar]
10. Shrikumar A , Greenside P , Shcherbina A , Kundaje A . Not just a black box: learning important features through propagating activation differences . arXiv 1605.01713 [preprint] https://arxiv.org/abs/1605.01713. Published May 5, 2016. Accessed August 6, 2017. [Google Scholar]
11. Selvaraju RR , Cogswell M , Das A , Vedantam R , Parikh D , Batra D . Grad-CAM: visual explanations from deep networks via gradient-based localization . In: Proceedings of the IEEE Int Conf on Computer Vision (ICCV) , 2017. ; 618 – 626 . [Google Scholar]
12. Sundararajan M , Taly A , Yan Q . Axiomatic attribution for deep networks . In: ICML‘17: Proceedings of the 34th International Conference on Machine Learning , 2017. ; 70 : 3319 – 3328 . [Google Scholar]
13. Smilkov D , Thorat N , Kim B , Viégas F , Wattenberg M . SmoothGrad: removing noise by adding noise . arXiv 1706.03825 [preprint] https://arxiv.org/abs/1706.03825. Published June 12, 2017. Accessed June 12, 2017. [Google Scholar]
14. Kapishnikov A , Bolukbasi T , Viegas F , Terry M . XRAI: better attributions through regions . In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV) , 2019. ; 4947 – 4956 . [Google Scholar]
15. Goodfellow IJ , Shlens J , Szegedy C . Explaining and harnessing adversarial examples . arXiv 1412.6572 [preprint] https://arxiv.org/abs/1412.6572. Published December 20, 2014. Accessed March 20, 2015. [Google Scholar]
16. Kurakin A , Goodfellow IJ , Bengio S . Adversarial examples in the physical world . In: Yampolskiy RV , ed. Artificial Intelligence Safety and Security . Chapman & Hall/CRC; , 2018. ; 99 – 112 . [Google Scholar]
17. Arun N , Gaw N , Singh P , et al . Assessing the trustworthiness of saliency maps for localizing abnormalities in medical imaging . Radiol Artif Intell 2021. ; 3 ( 6 ): e200267 . [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Cohen J , Rosenfeld E , Kolter JZ . Certified adversarial robustness via randomized smoothing . Int Conf Mach Learn 2019. ; 97 : 1310 – 1320 . [Google Scholar]
19. Qin C , Martens J , Gowal S , et al . Adversarial robustness through local linearization . In: NIPS ’19: Proceedings of the 33rd International Conference on Neural Information Processing Systems , 2019. ; 1240 : 13842 – 13853 . [Google Scholar]
20. Simonyan K , Vedaldi A , Zisserman A . Deep inside convolutional networks: visualising image classification models and saliency maps . arXiv 1312.6034 [preprint] https://arxiv.org/abs/1312.6034. Published December 20, 2013. Accessed December 19, 2021. [Google Scholar]
21. Irvin J , Rajpurkar P , Ko M , et al . CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison . Proc AAAI Conf Artif Intell 2019. ; 33 ( 01 ): 590 – 597 . [Google Scholar]
22. Garbin C , Rajpurkar P , Irvin J , Lungren MP , Marques O . Structured dataset documentation: a datasheet for CheXpert . arXiv 2105.03020 [preprint] https://arxiv.org/abs/2105.03020. Published May 7, 2021. Accessed May 7, 2021. [Google Scholar]
23. Nickparvar M . Brain tumor MRI dataset . https://www.kaggle.com/datasets/masoudnickparvar/brain-tumor-mri-dataset/. Published May 5, 2021. Accessed May 5, 2021.
24. Zhang J , Chao H , Dasegowda G , Wang G , Kalra MK , Yan P . Overlooked trustworthiness of saliency maps . In: Wang L , Dou Q , Fletcher PT , Speidel S , Li S , eds. Medical Image Computing and Computer Assisted Intervention – MICCAI 2022. MICCAI 2022 . Lecture Notes in Computer Science , vol 13433 . Cham, Switzerland: : Springer; , 2022. ; 451 – 461 . [Google Scholar]
25. Bortsova G , González-Gonzalo C , Wetstein SC , et al . Adversarial attack vulnerability of medical image analysis systems: Unexplored factors . Med Image Anal 2021. ; 73 : 102141 . [DOI] [PubMed] [Google Scholar]
26. Finlayson SG , Bowers JD , Ito J , Zittrain JL , Beam AL , Kohane IS . Adversarial attacks on medical machine learning . Science 2019. ; 363 ( 6433 ): 1287 – 1289 . [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Xu M , Zhang T , Li Z , Liu M , Zhang D . Towards evaluating the robustness of deep diagnostic models by adversarial attack . Med Image Anal 2021. ; 69 : 101977 . [DOI] [PubMed] [Google Scholar]
28. Szegedy C , Zaremba W , Sutskever I , et al . Intriguing properties of neural networks . arXiv 13126199 [preprint] https://arxiv.org/abs/1312.6199. Published December 21, 2013. Accessed May 27, 2021. [Google Scholar]
29. Madry A , Makelov A , Schmidt L , Tsipras D , Vladu A . Towards Deep Learning Models Resistant to Adversarial Attacks . arXiv 1706.06083 [preprint] https://arxiv.org/abs/1706.06083. Published June 19, 2017. Accessed May 27, 2021. [Google Scholar]
30. Zhou XH , Obuchowski NA , McClish DK . Statistical Methods in Diagnostic Medicine . 2nd ed. Wiley; , 2011. . [Google Scholar]
31. Hanley JA , McNeil BJ . The meaning and use of the area under a receiver operating characteristic (ROC) curve . Radiology 1982. ; 143 ( 1 ): 29 – 36 . [DOI] [PubMed] [Google Scholar]
32. Singh A , Sengupta S , Lakshminarayanan V . Explainable deep learning models in medical image analysis . J Imaging 2020. ; 6 ( 6 ): 52 . [DOI] [PMC free article] [PubMed] [Google Scholar]
33. Goyal Y , Wu Z , Ernst J , Batra D , Parikh D , Lee S . Counterfactual visual explanations . arXiv 1904.07451 [preprint] https://arxiv.org/abs/1904.07451. Published April 16, 2019. Accessed February 3, 2023. [Google Scholar]
34. Atad M , Dmytrenko V , Li Y , et al . CheXplaining in style: counterfactual explanations for chest x-rays using StyleGAN . arXiv 2207.07553 [preprint] https://arxiv.org/abs/2207.07553. Published July 15, 2022. Accessed February 3, 2023. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[r1] 1. Shen Y , Shamout FE , Oliver JR , et al . Artificial intelligence system reduces false-positive findings in the interpretation of breast ultrasound exams . Nat Commun 2021. ; 12 ( 1 ): 5645 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[r2] 2. DeGrave AJ , Janizek JD , Lee SI . AI for radiographic COVID-19 detection selects shortcuts over signal . Nat Mach Intell 2021. ; 3 ( 7 ): 610 – 619 . [Google Scholar]

[r3] 3. Arnaout R , Curran L , Zhao Y , Levine JC , Chinn E , Moon-Grady AJ . An ensemble of neural networks provides expert-level prenatal detection of complex congenital heart disease . Nat Med 2021. ; 27 ( 5 ): 882 – 891 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[r4] 4. Abitbol JL , Karsai M . Interpretable socioeconomic status inference from aerial imagery through urban patterns . Nat Mach Intell 2020. ; 2 ( 11 ): 684 – 692 . [Google Scholar]

[r5] 5. Gonzalez-Gonzalo C , Liefers B , van Ginneken B , Sanchez CI . Iterative augmentation of visual evidence for weakly-supervised lesion localization in deep interpretability frameworks: application to color fundus images . IEEE Trans Med Imaging 2020. ; 39 ( 11 ): 3499 – 3511 . [DOI] [PubMed] [Google Scholar]

[r6] 6. Mitani A , Huang A , Venugopalan S , et al . Detection of anaemia from retinal fundus images via deep learning . Nat Biomed Eng 2020. ; 4 ( 1 ): 18 – 27 [Published correction appears in Nat Biomed Eng 2020;4(2):242.]. [DOI] [PubMed] [Google Scholar]

[r7] 7. Sayres R , Taly A , Rahimy E , et al . Using a deep learning algorithm and integrated gradients explanation to assist grading for diabetic retinopathy . Ophthalmology 2019. ; 126 ( 4 ): 552 – 564 . [DOI] [PubMed] [Google Scholar]

[r8] 8. Ribeiro MT , Singh S , Guestrin C . “Why should I trust you?”: explaining the predictions of any classifier . In: KDD ’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 2016. ; 1135 – 1144 . [Google Scholar]

[r9] 9. Shrikumar A , Greenside P , Kundaje A . Learning important features through propagating activation differences . In: ICML‘17: Proceedings of the 34th International Conference on Machine Learning , 2019. ; 70 : 2145 – 3153 . [Google Scholar]

[r10] 10. Shrikumar A , Greenside P , Shcherbina A , Kundaje A . Not just a black box: learning important features through propagating activation differences . arXiv 1605.01713 [preprint] https://arxiv.org/abs/1605.01713. Published May 5, 2016. Accessed August 6, 2017. [Google Scholar]

[r11] 11. Selvaraju RR , Cogswell M , Das A , Vedantam R , Parikh D , Batra D . Grad-CAM: visual explanations from deep networks via gradient-based localization . In: Proceedings of the IEEE Int Conf on Computer Vision (ICCV) , 2017. ; 618 – 626 . [Google Scholar]

[r12] 12. Sundararajan M , Taly A , Yan Q . Axiomatic attribution for deep networks . In: ICML‘17: Proceedings of the 34th International Conference on Machine Learning , 2017. ; 70 : 3319 – 3328 . [Google Scholar]

[r13] 13. Smilkov D , Thorat N , Kim B , Viégas F , Wattenberg M . SmoothGrad: removing noise by adding noise . arXiv 1706.03825 [preprint] https://arxiv.org/abs/1706.03825. Published June 12, 2017. Accessed June 12, 2017. [Google Scholar]

[r14] 14. Kapishnikov A , Bolukbasi T , Viegas F , Terry M . XRAI: better attributions through regions . In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV) , 2019. ; 4947 – 4956 . [Google Scholar]

[r15] 15. Goodfellow IJ , Shlens J , Szegedy C . Explaining and harnessing adversarial examples . arXiv 1412.6572 [preprint] https://arxiv.org/abs/1412.6572. Published December 20, 2014. Accessed March 20, 2015. [Google Scholar]

[r16] 16. Kurakin A , Goodfellow IJ , Bengio S . Adversarial examples in the physical world . In: Yampolskiy RV , ed. Artificial Intelligence Safety and Security . Chapman & Hall/CRC; , 2018. ; 99 – 112 . [Google Scholar]

[r17] 17. Arun N , Gaw N , Singh P , et al . Assessing the trustworthiness of saliency maps for localizing abnormalities in medical imaging . Radiol Artif Intell 2021. ; 3 ( 6 ): e200267 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[r18] 18. Cohen J , Rosenfeld E , Kolter JZ . Certified adversarial robustness via randomized smoothing . Int Conf Mach Learn 2019. ; 97 : 1310 – 1320 . [Google Scholar]

[r19] 19. Qin C , Martens J , Gowal S , et al . Adversarial robustness through local linearization . In: NIPS ’19: Proceedings of the 33rd International Conference on Neural Information Processing Systems , 2019. ; 1240 : 13842 – 13853 . [Google Scholar]

[r20] 20. Simonyan K , Vedaldi A , Zisserman A . Deep inside convolutional networks: visualising image classification models and saliency maps . arXiv 1312.6034 [preprint] https://arxiv.org/abs/1312.6034. Published December 20, 2013. Accessed December 19, 2021. [Google Scholar]

[r21] 21. Irvin J , Rajpurkar P , Ko M , et al . CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison . Proc AAAI Conf Artif Intell 2019. ; 33 ( 01 ): 590 – 597 . [Google Scholar]

[r22] 22. Garbin C , Rajpurkar P , Irvin J , Lungren MP , Marques O . Structured dataset documentation: a datasheet for CheXpert . arXiv 2105.03020 [preprint] https://arxiv.org/abs/2105.03020. Published May 7, 2021. Accessed May 7, 2021. [Google Scholar]

[r23] 23. Nickparvar M . Brain tumor MRI dataset . https://www.kaggle.com/datasets/masoudnickparvar/brain-tumor-mri-dataset/. Published May 5, 2021. Accessed May 5, 2021.

[r24] 24. Zhang J , Chao H , Dasegowda G , Wang G , Kalra MK , Yan P . Overlooked trustworthiness of saliency maps . In: Wang L , Dou Q , Fletcher PT , Speidel S , Li S , eds. Medical Image Computing and Computer Assisted Intervention – MICCAI 2022. MICCAI 2022 . Lecture Notes in Computer Science , vol 13433 . Cham, Switzerland: : Springer; , 2022. ; 451 – 461 . [Google Scholar]

[r25] 25. Bortsova G , González-Gonzalo C , Wetstein SC , et al . Adversarial attack vulnerability of medical image analysis systems: Unexplored factors . Med Image Anal 2021. ; 73 : 102141 . [DOI] [PubMed] [Google Scholar]

[r26] 26. Finlayson SG , Bowers JD , Ito J , Zittrain JL , Beam AL , Kohane IS . Adversarial attacks on medical machine learning . Science 2019. ; 363 ( 6433 ): 1287 – 1289 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[r27] 27. Xu M , Zhang T , Li Z , Liu M , Zhang D . Towards evaluating the robustness of deep diagnostic models by adversarial attack . Med Image Anal 2021. ; 69 : 101977 . [DOI] [PubMed] [Google Scholar]

[r28] 28. Szegedy C , Zaremba W , Sutskever I , et al . Intriguing properties of neural networks . arXiv 13126199 [preprint] https://arxiv.org/abs/1312.6199. Published December 21, 2013. Accessed May 27, 2021. [Google Scholar]

[r29] 29. Madry A , Makelov A , Schmidt L , Tsipras D , Vladu A . Towards Deep Learning Models Resistant to Adversarial Attacks . arXiv 1706.06083 [preprint] https://arxiv.org/abs/1706.06083. Published June 19, 2017. Accessed May 27, 2021. [Google Scholar]

[r30] 30. Zhou XH , Obuchowski NA , McClish DK . Statistical Methods in Diagnostic Medicine . 2nd ed. Wiley; , 2011. . [Google Scholar]

[r31] 31. Hanley JA , McNeil BJ . The meaning and use of the area under a receiver operating characteristic (ROC) curve . Radiology 1982. ; 143 ( 1 ): 29 – 36 . [DOI] [PubMed] [Google Scholar]

[r32] 32. Singh A , Sengupta S , Lakshminarayanan V . Explainable deep learning models in medical image analysis . J Imaging 2020. ; 6 ( 6 ): 52 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[r33] 33. Goyal Y , Wu Z , Ernst J , Batra D , Parikh D , Lee S . Counterfactual visual explanations . arXiv 1904.07451 [preprint] https://arxiv.org/abs/1904.07451. Published April 16, 2019. Accessed February 3, 2023. [Google Scholar]

[r34] 34. Atad M , Dmytrenko V , Li Y , et al . CheXplaining in style: counterfactual explanations for chest x-rays using StyleGAN . arXiv 2207.07553 [preprint] https://arxiv.org/abs/2207.07553. Published July 15, 2022. Accessed February 3, 2023. [Google Scholar]

PERMALINK

Revisiting the Trustworthiness of Saliency Methods in Radiology AI

Jiajin Zhang, MS

Hanqing Chao, PhD

Giridhar Dasegowda, MD

Ge Wang, PhD

Mannudeep K Kalra, MD

Pingkun Yan, PhD

Abstract

Purpose

Materials and Methods

Results

Conclusion

Summary

Key Points

Introduction

Materials and Methods

Study Design

Dataset Preparation

Quantitative Analysis of Trustworthiness

Figure 1:

Statistical Analysis

Human Observer Study

Table 1:

Table 2:

Data Availability

Results

Sensitivity Examination

Figure 2:

Figure 3:

Robustness Examination

Figure 4:

Human Observer Study

Discussion

Abbreviations:

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases