Abstract
Deep learning Convolutional Neural Networks have achieved remarkable performance in a variety of classification tasks. The data-driven nature of deep learning indicates that a model behaves in response to the data used to train the model, and the quality of datasets may lead to substantial influence on the model’s performance, especially when dealing with complicated clinical images. In this paper, we propose a simple and novel method to investigate and quantify a deep learning model’s response with respect to a given sample, allowing us to detect out-of-distribution samples based on a newly proposed metric, Response Score. The key idea is that samples belonging to different classes may have different degrees of influence on a model. We quantify the resulting consequence of a single sample to a trained-model and relate the quantitative measure of the consequence (by the Response Score) to detect the out-of-distribution samples. The proposed method can find multiple applications such as 1) recognizing abnormal samples, 2) detecting mixed-domain data, and 3) identifying mislabeled data. We present extensive experiments on the three different applications using four biomedical imaging datasets. Experimental results show that our method exhibits remarkable performance and outperforms the compared methods.
Keywords: Data quality, Out-of-distribution detection, Medical image analysis, Anomaly detection
Graphical Abstract

1. Introduction
Convolutional neural networks (CNNs) have obtained optimal performance in a variety of tasks partially because of the powerful feature learning abilities [27, 38]. In general, the performance of CNN models relies on a large number of “good quality” samples that are balanced in terms of sample number and distribution between classes as well as accurate in annotation labels. However, in reality, these conditions are hard to meet, especially in dealing with medical imaging data. First, medical imaging data samples of diseases/abnormality are often rare due to prevalence or data collection issues, resulting in class imbalance (also called data imbalance) [9, 21, 34]. Second, a set of imaging data may entail a mixed distribution due to variations in imaging vendors/protocols and/or study populations, resulting in domain-mismatching issues [2, 20]. Third, inaccurate labels in training datasets are not unusual due to annotation errors by human or natural language processing (NLP) software, reducing the robustness of model learning [42].
One way of addressing these data quality issues is through a general approach of detecting out-of-distribution (OOD) samples [7, 18], which assesses whether a given sample falls into the distribution of a given dataset. This task is also similar to anomaly detection [3, 36] and outlier detection [16, 17]. Several methods have been proposed for this task such as the maximum softmax probability method [18] and the influence function method [6, 24]. However, these methods mainly work on non-imaging data or natural view images.
In this paper, we propose a novel method to quantify the influence of a given sample to the task, allowing us to detect samples out of distribution. Specifically, we propose a novel measure, which we term response score, to quantitatively measure the influence of a single sample to a trained model a priori. The intuition behind our idea is that different samples may result in different degrees of influence on a model trained with a given dataset. For example, at concept level, the parameters of a model will change slightly or substantially for a sample that is potentially more or less relevant to the model. To evaluate this scheme, we first train a model with a given training dataset and take this as the base model. We then adjust the model each time with a single sample, during which the change of the model due to the single sample is quantitatively calculated by the newly proposed measure, response score, where a higher score means that the sample is more relevant to the model and/or the task.
Our method is simple to implement and can be integrated into a variety of CNN models to address different applications. We present extensive experiments by integrating our method into two CNN models for three use cases: 1) recognizing anomalies for imbalanced data classification, 2) detecting domain mismatched samples, and 3) identifying mislabeled samples. These problems are common when dealing with biomedical data and clinical imaging applications. We test our method on four biomedical datasets with different imaging modalities and tasks, and show results outperforming other tested methods.
Our contributions are summarized as follows:
We propose a novel method to investigate the interactions between models and data samples towards the goal of measuring the influence of each sample to a deep learning model.
We propose a new measure, response score, to quantitatively evaluate the influence of a single sample on a CNN model.
We examine our method in three different use cases on four datasets to address real-world questions, each showing remarkable performance.
The structure of the paper is organized as follows. We first introduce the background of our study. Then, we discuss work related to our method. Next we elucidate the pipeline of the proposed method and demonstrate the effects of our method on three biomedical image OOD detection tasks. In the last section, we conclude by summarizing our work.
2. Related work
In this section, we define three issues our work attempts to address and summarize previous studies about OOD detection.
2.1. Data quality issues
There are many issues potentially related to data quality. Here we focus on reviewing three that are directly related to our method, namely, class imbalance, domain mismatch, and inaccurate annotation labels.
Class imbalance:
For a multi-class classification task, a simple method for imbalanced classes is to collect more balanced data or use data augmentation to generate synthetic data for the classes that have a lower number of samples. Due to the low prevalence of diseases, it is sometimes hard to collect balanced data (for example, cancer cases are much rarer than normal cases in a screening population). Using generative adversarial networks (GANs) [15] to generate synthetic data has been shown helpful in addressing this issue [1, 11]. It is also possible to use under-sample or over-sample [12] strategies to control the number of samples presented in each round of the iterative training process. Another way to address class imbalance is to use one-class classification (OCC) [22], which focuses on learning features by samples belonging only to one class. Related OCC methods include traditional algorithms such as OCSVM [33] and SVDD [31], as we as deep learning algorithms such as DAGMM [43] and Deep SVDD [30].
Domain mismatch:
In a typical machine learning context, it is usually assumed that the training and testing data are drawn from the same distribution. But this is not true in many scenarios in medical data due to variations in study populations, imaging acquisition protocols, imaging reconstruction algorithms, etc. This creates the domain mismatch issue [2]. In the literature, this is usually dealt with out-of-distribution detection methods such as ODIN [26] and Deep Ensembles [25].
Inaccurate labels:
Some training samples may have inaccurate labels due to human error or NLP software errors. These mislabeled samples may confuse models leading to substantial influence on performance significantly [35]. One way to address this is to identify the mislabeled samples and make corrections before using them for training [13, 37].
2.2. OOD schemes
All three issues discussed in the previous subsection can be addressed by the OOD detection approach. OOD detection mainly focuses on observing the influence of a given sample to a model. There are two schemes according to the interaction method between samples and models.
The first scheme attempts to evaluate a sample by analyzing the output of a model when the model is fixed. A straightforward method is to calculate the reconstruction loss of the Convolution Autoencoder (CAE) which is trained with in-distribution samples. Some algorithms also enhance the results of CAE by integrating other methods [41, 43]. [4] proposes a method to find a small subset of training data that is responsible for an incorrect test prediction by observing the results when removing the suspicious set. Shapley value is also employed for quantifying the influence of training samples [13] and features [28], but the calculation of Shapley value is complex. [29] assumes that a sample with high quality would have the same result when dealt with by different models, so this paper attempts to detect abnormal samples by observing the prediction from a series of models and parameters.
The second scheme proposes to evaluate samples by observing the influence of the given sample to the model. A straightforward method is the Leave-One-Out (LOO) method, which determines the influence by leaving one sample out and training the model with the remaining samples again. By this way, LOO can compare the difference of the model trained with/without this sample, and then observe the influence of the given sample to the model. LOO will encounter the combination explosion problem when the sample number is large, making this method unsuitable for deep learning applications. Another method is [24], which simulates the result when the model is trained with/without a given sample by integrating the Influence Function [6] into a CNN model.
Our method belongs to the second scheme. Instead of comparing the performance difference when training with/without the given sample, our method directly focuses on calculating the influence of a given sample to the model by fine-tuning the model with the given sample.
3. The proposed method
To detect out-of-distribution samples, we propose a novel method to evaluate the influence of the given sample to the model. The motivation behind our method is based on the observation that a model trained on in-distribution samples is more robust when re-training with a given in-distribution testing sample, while it is more sensitive when re-training with a given out-of-distribution testing sample. Inspired by this behavior, we propose the response score to quantitatively evaluate the influence of a sample on the model/task. Specifically, we calculate the change of prediction value when adjusting a (trained) model for a given single sample. Here, the adjustments of model weights due to the given sample reflect the model behavior when confronted with this single sample, and thus, we can measure the amplified influence of the sample to the model.
Algorithms
Based on the above-described scheme, we now describe the method for quantitatively evaluating the influence of a given sample to a model, which we term response score in our paper. A higher response score means that the sample is more relevant to the model/task. Algorithm 1 outlines the key steps of our proposed method.
Algorithm 1.
Procedure of the proposed method.
| Input: training data Xr, testing samples Xe. |
| Parameter: Initialize model M0; batch size: 128. |
| Output: Scores of samples in Xe. |
| 1: Train M0 with Xr and obtain M. |
| 2: Obtain prediction value P of Xr with M. |
| 3: for each test sample t. in Xe do |
| 4: |
| 5: ←adjust last f layers of . with tt for E epochs. |
| 6: Obtain prediction value of Xr with |
| 7: Calculate score st according to Formula 1. |
| 8: end for |
| 9: Obtain response scores of samples in Xe: |
| S = {s1, … , sn}. |
Formally, for a model M trained by a set of samples Xr, let L(xi, θ) be the loss of sample xi ∈ Xr, where θ is the parameters of M and defined as . Given a testing sample xe, if we train the model M with xe, the parameters will be changed to θ′ ≜arg minθ′ L(xe, θ′). So the influence of xe can be reflected by Δθ = θ − θ′. We propose an indirect way by calculating the change of the loss value for training samples, as defined by ΔL = L(Xr, θ) − L(Xr, θ′). Specifically, it can be calculated by comparing the difference of prediction values, as shown in the Formula 1:
| (1) |
where si is the response score of a testing sample ti ∈ Xe, M is a model trained by training data Xr, is a model obtained by adjusting M with the testing sample ti. P and is the prediction values of model M and for training data, respectively. This formula means we first calculate the difference of the outputs of M and M′, then sum the square of the difference to a scalar, and take this scalar as the response score.
Extension to applications
Here we discuss how the proposed method can be extended to address several applications. We apply our method to three tasks concerning the improvement of data quality. Specifically, we define that, an out-of-distribution sample represents one 1) not belonging to the given class in anomaly detection; 2) that has different distribution to the testing data in domain mismatch problem; 3) that is a mislabeled sample in the mislabeled sample identification problem.
To validate the effectiveness of our method, two CNN models are employed for three tasks. The first model is an autoencoder-like model for the anomaly detection (one class classification) task. For the autoencoder-like model, the prediction value is a matrix representing the reconstruction values. Here a Convolutional Autoencoder (CAE) is employed as the base model for our experiment and the structure is shown in Table 1. In this experiment, we show that the proposed method can obtain remarkable performance for the anomaly detection problem by integrating our method into a simple CAE model. The second model is a binary class classification model for detecting domain mismatched samples and identifying mislabeled samples. For the binary class classification model, the prediction value is a vector that represents the probability of the sample belonging to these two classes. We use Wide Residual Network (WRN) [39] as the base model, and the structure is shown in Figure 1. In this experiment, we show that our method can effectively identify a majority of domain mismatched/incorrect samples efficiently.
Table 1.
The proposed structure for images of size 32×32 pixels. BN represents the Batch Normalization [19] layer, and s=2 means the stride is set to 2 pixels.
| Autoencoder | |
|---|---|
| Input: | Image ×(32×32×1) |
| [Layer 1] | Conv(3×3,64) s=2, BN, ReLu |
| [Layer 2] | Conv(3×3,128) s=2, BN, ReLu |
| [Layer 3] | Conv(3×3,256) s=2, BN, ReLu |
| [Layer 4] | Flatten() |
| [Layer 5] | Dense(256) |
| [Layer 6] | Dense(4096), BN, ReLu |
| [Layer 7] | Reshape(4, 4, 256), BN, ReLu |
| [Layer 8] | DeConv(3×3,128) s=2, BN, ReLu |
| [Layer 9] | DeConv(3×3,128) s=2, BN, ReLu |
| [Layer 10] | DeConv(3×3,64) s=2, ‘tanh’, ReLu |
| Output: | y(32×32×1) |
Figure 1:

The structure of the WRN [39] for images of 64×64 employed in our experiment. Each dotted block contains three convolutional layers, batch normalization layers, and activation layers. Skip-connection is also used for each block to learn more hierarchical features. The kernel size is set to 3×3 for all layers, and ReLU [14] is used as the activation function. We also use three blocks for images of size 32×32, in this condition, the output of the third block is 4×4×256.
4. Experiments
We evaluate the performance of the proposed method by using two models on four biomedical datasets for three tasks: recognizing abnormal samples, detecting domain mismatched samples, and identifying mislabeled samples.
4.1. Recognizing anomaly by one-class classification
To address the data imbalance problem, OCC, which focuses on learning single-class-relevant features from samples belonging to a single class, is proposed. According to the distinctive feature learning methods, previous OCC algorithms fall into two schemes: feature mapping and feature fitting.
The first scheme focuses on mapping features of given samples into a feature space. If a given sample belongs to the given class, it can be also mapped into the same feature space, while samples not belonging to the given class will be mapped into another space, making them distinguishable. A typical method is the One-class Support Vector Machine (OCSVM) [33], which maps features by kernel functions and optimizes the margin between samples and the origin point.
The second scheme focuses on learning a model that only responds to the given class. Thus, the model will work well for positive samples, yet it will collapse for negative samples. CAE is widely employed as the base model in this field. Some studies also combine CAE with other methods such as Gaussian Mixture Model [43] and energy-based loss [40].
Our method falls into the second scheme, which builds a model to only work for positive samples. To accomplish the OCC task, we apply our method by observing the influence of each testing sample to the previously trained model. Specifically, we first train a model with positive samples, then adjust it with a testing sample. If the model changes significantly, then this sample is probably different from the samples of the given class.
Experimental setting
In this task, we employ a CAE as the base model, and the structure employed for images of size 32×32 is shown in Table 1. For images of size 64×64, we use four convolutional and deconvolutional layers for the experiment. Keras [5] is used on an NVIDIA TITAN GPU to conduct our experiments. Adam [23] is adopted as the optimizer, with the learning rate set to 10−3 for MRI and BCCD datasets, and 10−5 for FFDM dataset. The batch size is set to 128. In the testing stage, we adjust the last two convolutional layers of the model with each testing sample for E=100 iterations, and then obtain the response score for each sample. All methods are repeated three times and the mean evaluation performance is reported for each method.
Datasets
Three different biomedical imaging datasets (see Table 2) of different modalities and types are employed for OCC experiments, as briefly described below.
Table 2.
Statistics of the three datasets.
| Dataset | Class | Dimension | Training (positive) | Testing | Positive in testing | Total |
|---|---|---|---|---|---|---|
| MRI | 2 | 32×32×1 | 1000 | 1872 | 946 | 2872 |
| FFDM | 1 | 64×64×1 | 200 | 104 | 52 | 304 |
| BCCD | 3 | 64×64×1 | 1000 | 2000 | 1271 | 3000 |
Breast Magnetic Resonance Imaging (MRI) dataset: This is a dataset collected at our institution from women diagnosed with breast cancer (see Figure 2). It contains 1000 tumor images (positive) for training, as well as 946 tumor images and 926 normal tissue images for testing. These images are patches automatically segmented/cropped from the full resolution images. The image size ranges from 8 pixels to 125 pixels, with a mean size of 31 pixels, so we resize all the images to 32×32 pixels for our experiments.
Breast Full-Field Digital Mammography (FFDM): This is an in-house X-ray breast imaging dataset for women diagnosed with breast cancer (see Figure 3). We select 200 tumor images (positive samples) for training, and 52 tumor images and 52 normal tissue images for testing. The image size ranges from 100 to 500 pixels, so we resize all the images to 64×64 pixels for consistency.
Blood Cell Classification Dataset (BCCD): The diagnosis of blood-based diseases often involves identifying and characterizing patient blood samples. Automated methods to detect and classify blood cell subtypes have important biomedical applications. BCCD (see Figure 4) is an extremely imbalanced dataset for blood cell detection from Kaggle1. It includes about 4400 red blood cells (RBC, class 1, normal), 359 white blood cells (WBC, class 2) and 370 platelets (P, class3). Here we use 1000 RBC images for training, 1271 RBC, 359 WBC and 370 P images for testing.
Figure 2:

Positive (the first row) and negative (the second row) samples of the breast tumor MRI dataset.
Figure 3:

Positive (the first row) and negative (the second row) samples of the FFDM dataset.
Figure 4:

Positive (class 1) and negative (classes 2 and 3) samples of the BCCD dataset.
Comparing methods
We compare our model with five classic methods: OCSVM [33], CAE, COCSVM, DSEBM [40], and DAGMM [43]. All deep learning methods are trainedfor 200 epochs.
OCSVM: One-Class SVM [32] focuses on learning a mapping to map the original samples into a new feature space by kernel functions. In our experiment, RBF is used as the kernel function. We grid search the parameters c ∈{0.01, 0.02, ⋯, 0.09} and g ∈{2−7, 2−6, ⋯,22}, then report the best performance of the model.
CAE: We also compare our method with CAE, as CAE is used as the base model in our experiment. CAE focuses on training an autoencoder, and then use the reconstruction loss as the score of each sample. The structure is the same as our model shown in Table 1.
COCSVM: This method is a two-stage combination of CAE and OCSVM. We first train an autoencoder to reduce dimensions, and then use the bottleneck features to train the OCSVM. The parameter settings are the same as the previous two methods.
DSEBM: Deep Structured Energy-Based Model (DSEBM) [40] directly models the data distribution with deep architecture and uses an energy function as the output of the model. Here we use the encoder of CAE as the base model, and the other settings are the same as the original paper.
DAGMM: Deep Autoencoding Gaussian Mixture Model (DAGMM) [43] employs an autoencoder to compress data and obtains a reconstruction loss, which is further fed into a Gaussian Mixture Model (GMM). By jointly training with GMM, the autoencoder can escape from local optima and further reduce reconstruction errors, thus compressing normal samples while deconstructing abnormal samples better. Here we also use the same autoencoder as used in our method.
Evaluation metric
For imbalanced data, it is not suitable to evaluate the performance of models by accuracy. Here the Area Under Receiver Operating Characteristic Curve (AUC) and Area Under Precision-Recall curve (AUPR) are employed as the evaluation metrics. Since the AUPR is relevant to the number of positive samples in testing data, we calculate the AUPR in two ways: taking normal samples as positive (AUPR-NP), and taking anomalies as positive (AUPR-AP).
Result
As shown in Table 3, Figures 5 and 6, CAE is a fairly stable method which performs better than DAGMM and DSEBM. This is probably because these datasets are complex with large variations. The proposed method improves by a large margin compared to CAE, which validates the effectiveness of our method when integrating into CAE. The optimal performance of our method is probably because both MRI and FFDM datasets are grayscale images, making it easier for autoencoders to learn features. Compared to other deep learning-based algorithms, our method obtains state-of-the-art performance in these three datasets.
Table 3.
Comparison of our method to five other previous methods. Note that the results of DSEBM on the FFDM dataset were obtained based on switched sample labels aiming to optimally use the DSEBM algorithm on the FFDM images.
| data | Method | c | OCSVM | COCSVM | DAGMM | DSEBM | CAE | OURS |
|---|---|---|---|---|---|---|---|---|
| MRI | AUC | 1 | 0.802 | 0.844 | 0.587 | 0.617 | 0.782 | 0.901 |
| AUPR-NP | 1 | 0.831 | 0.866 | 0.546 | 0.592 | 0.825 | 0.933 | |
| AUPR-AP | 1 | 0.784 | 0.826 | 0.640 | 0.682 | 0.713 | 0.839 | |
| FFDM | AUC | 1 | 0.771 | 0.710 | 0.711 | 0.734 | 0.631 | 0.956 |
| AUPR-NP | 1 | 0.830 | 0.744 | 0.729 | 0.725 | 0.632 | 0.947 | |
| AUPR-AP | 1 | 0.700 | 0.677 | 0.714 | 0.715 | 0.623 | 0.962 | |
| BCCD | AUC | 1 | 0.990 | 0.943 | 0.565 | 0.974 | 0.956 | 0.982 |
| AUPR-NP | 1 | 0.994 | 0.960 | 0.586 | 0.984 | 0.969 | 0.991 | |
| AUPR-AP | 1 | 0.985 | 0.931 | 0.547 | 0.960 | 0.946 | 0.965 |
Figure 5:

The ROC curve of different OCC methods in breast MRI dataset. Our method obtains the highest performance in this dataset.
Figure 6:

The precision-recall curve of different OCC methods in breast MRI dataset.
In the BCCD dataset, OCSVM and COCSVM perform better than the proposed method, which we contribute to two main reasons. First, OCSVM is generally more suitable for small scale datasets compared to deep learning-based methods. Second, testing labels are provided to select the best performance by grid searching parameters from a 10×10 hyper-parameter space.
To qualitatively evaluate the model’s performance, several examples detected by our method on the three datasets are shown in Figure 7. Visual assessment to these samples indicates that they are truly out of the distributions.
Figure 7:

Examples of out-of-distribution samples detected by the proposed method on the three datasets of MRI (a), FFDM (b), and BCCD (c). See Figures 2~4 for some in-distribution samples.
Robustness analysis
To evaluate the influence of hyper-parameters, we show the results when adjusting different layers. We conduct the experiments when adjusting the last 1, 2 and 3 convolutional layers. As shown in Table 4, the proposed method obtains stable performance when adjusting different layers.
Table 4.
Comparison of adjusting different layers on MRI dataset.
| adjustd layers | 1 | 2 | 3 |
|---|---|---|---|
| AUROC | 0.926 | 0.901 | 0.899 |
| AUPR-NP | 0.947 | 0.933 | 0.933 |
| AUPR-AP | 0.883 | 0.839 | 0.843 |
We also reveal the results when grid searching the learning rate lr in {10−2, 10−3, 10−4, 10−5} and epoch number E in {60, 80, 100, 120}. As shown in Table 5, we can see that the learning rate is a critical factor that has more significant influence than the number of fine-tuning epochs E. For the MRI breast, the best learning rates are 10−2 and 10−3, and for the FFDM dataset, the best learning rate is 10−5. The proposed method obtains better performance when using an appropriate learning rate compared to CAE.
Table 5.
Comparison of AUC for different lr and E. The best result of each row is shown in bold.
| data | E | 10−2 | 10−3 | 10−4 | 10−5 |
|---|---|---|---|---|---|
| MRI | CAE | 0.697 | 0.757 | 0.736 | 0.728 |
| 60 | 0.937 | 0.920 | 0.819 | 0.570 | |
| 80 | 0.865 | 0.974 | 0.748 | 0.567 | |
| 100 | 0.907 | 0.901 | 0.798 | 0.399 | |
| 120 | 0.847 | 0.685 | 0.783 | 0.566 | |
| FFDM | CAE | 0.690 | 0.580 | 0.558 | 0.427 |
| 60 | 0.212 | 0.287 | 0.873 | 0.936 | |
| 80 | 0.269 | 0.481 | 0.637 | 0.964 | |
| 100 | 0.432 | 0.616 | 0.658 | 0.956 | |
| 120 | 0.222 | 0.623 | 0.593 | 0.984 |
4.2. Detecting domain mismatched samples
Domain mismatch is a common problem because the distributions of training and testing data may have discrepancies, producing a model that has high accuracy on the training set but poor testing performance [2]. In this task, we show that our model can detect training samples that have a different distribution than the testing data.
Experiment settings
In this task, we first train a classifier with the whole training data (including domain mismatched samples). We then fine-tune the model with each training sample and calculate the change of the prediction value as the score. A higher score signifies the sample is more like to belong to other domains. Here WRN [39] is employed as the base model with two output nodes for our binary class classification situation. The structure is shown in Figure 1, and the learning rate is set to 2×10−4. In the testing procedure, we adjust the whole network for 100 epochs. Other experiment environment settings are the same as the OCC task.
Dataset
HEp-2 Cell Image dataset2 is a publicly available dataset provided by the International Conference on Image Processing (ICIP) 2013 [8] for Human Epithelial Cells type 2 (HEp-2) cell image classification competition. It consists of 6 classes of images, each class includes intermediate and positive sub-classes based on the lightness. To simulate the domain mismatch situation, we conduct our experiment on a selected subset as shown in Table 6. In this experiment, the distribution of samples in domain 2 is different than domain 1, and our goal is to detect samples of domain 2.
Table 6.
Statistics of the HEp-2 dataset.
| Train | Test | |||
|---|---|---|---|---|
| class | 1 | 2 | 1 | 2 |
| domain 1 (intermediate) | 400 | 400 | 500 | 500 |
| domain 2 (positive) | 100 | 100 | 0 | 0 |
Result
The proposed method is compared with the Influence Function [24] with the code provided by the original paper. As shown in Table 7, the proposed method obtains a pretty stable performance on these two classes.
Table 7.
Results of domain mismatch problem.
| AUC | AUPR-AP | |||
|---|---|---|---|---|
| class | 1 | 2 | 1 | 2 |
| Influence Function | 0.518 | 0.967 | 0.459 | 0.969 |
| OURS | 0.716 | 0.817 | 0.536 | 0.623 |
4.3. Identifying mislabeled samples
Samples with correct labels are critical for data-driven models. However, some of the collected datasets potentially contains mislabeled samples, which may influence CNN models severely [10]. To address this issue, we demonstrate that our method can also be used for identifying the mislabeled samples.
Experiment settings
The key point of this task is to detect samples that are most influential to the model, and these samples are potentially mislabeled. Specifically, we first train a model with the original dataset (mislabeled data potentially included), and then pick each training sample to adjust the model for 100 epochs. Then we calculate the response score for each training sample. After this, we select the top 5%,… ,35% samples ranked by the response scores and compare their labels with the ground truth. If the label is flipped, we flip it back. After checking all selected samples, we re-train the model with the checked training dataset and test on the testing dataset. According to section 4.1, Adam [23] is employed as the optimizer with the learning rate setting to 10−3 for the proposed method.
Dataset
We conduct our experiment on breast tumor MRI dataset with different number of samples. The dataset used in this experiment includes 1000 positive and 1000 negative samples for training, and 946 positive and 926 samples for testing. Testing labels are not accessed by the model. 30% labels of training samples are randomly flipped, without any labels being flipped more than once.
Comparing methods
We compare our method with the loss-based method. In the loss-based method, we first train the model with the mislabeled training dataset, and then sort samples base on the prediction loss comparing to the given labels of training samples.
Result
From Figure 8 we can see that mislabeled samples mixed in the entire dataset influence the performance of CNN models severely. Adding 30% mislabeled samples makes the performance drops from about 0.97 to 0.86. We can also observe that using our method to identify mislabeled samples allow us to repair the dataset, even when 30% of the samples are flipped. The proposed method (red), which uses the change of prediction values, outperforms the loss-based method (blue), validating that our method can learn the relationship between samples and models better.
Figure 8:

The accuracy of the testing dataset (left) and the fraction of flipped data detected (right) when checking different proportions of training samples for identifying mislabeled data. The result obtained by the model trained with clean data (black) can be regarded as the upper bound of the performance, and the result obtained by the model trained with noisy data (pink) can be regarded as the lower bound. Our method obtains better performance especially when checking more training data.
5. Conclusion
In this paper, we present a simple and novel method to quantitatively evaluate the response of a deep learning model adjusted by a single given sample. The proposed method can be employed for detecting out-of-distribution samples. We integrate our method to two CNN models for abnormality recognition, domain mismatch detection, and mislabel identification. The proposed method achieves remarkable performance in four biomedical datasets. Our method provides new avenues for future research in the direction of observing interactions between models and samples. Our method can be further extended to multi-class classification tasks. In addition, it would be also interesting to study how to select the best model for different datasets and tasks.
Highlights.
A novel method to investigate the interactions between models and data.
A new measure to evaluate the influence of a single sample on a CNN model.
Experiments in three use cases on four datasets show remarkable performance.
6. Acknowledgment
This work was supported by National Institutes of Health (NIH)/National Cancer Institute (NCI) grants (1R01CA193603, 3R01CA193603-03S1, and 1R01CA218405), a Radiological Society of North America (RSNA) Research Scholar Grant (#RSCH1530), an Amazon Machine Learning Research Award, and a University of Pittsburgh Physicians (UPP) Academic Foundation Award. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU for our research. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562. Specifically, it used the Bridges system, which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC). The authors thank Ms Kadie Clancy for helping proofread the article.
Biographies

Long Gao Long Gao received the B.S. degree in College of Computer from National University of Defense Technology, Changsha, China, in 2016. He is working towards the Ph.D. degree in the College of Computer, National University of Defense Technology, Changsha, China. He is currently a visiting student in the Department of Radiology, University of Pittsburgh, PA, USA. His current research interests include machine learning, image inpainting, anomaly detection, and medical image analysis.

Shandong Wu Shandong Wu received his PhD degree in Computer Vision from City University of Hong Kong. He completed postdoctoral trainings at the University of Central Florida in Computer Vision and at the University of Pennsylvania in clinical radiology research. Dr. Wu is an Assistant Professor with joint appointments in Radiology (primary), Biomedical Informatics, Computer Science, Bioengineering, Intelligent Systems, and Clinical and Translational Science at the University of Pittsburgh. Dr. Wu leads the Intelligent Computing for Clinical Imaging (ICCI) lab and is the founding director of the Pittsburgh Center for Artificial Intelligence Innovation in Medical Imaging. His research interests include computational biomedical imaging analysis, big (health) data coupled with machine/deep learning, radiomics/radiogenomics, and artificial intelligence in clinical informatics/workflows.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- [1].Antoniou A, Storkey A, Edwards H, 2017. Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340. [Google Scholar]
- [2].Ben-David S, Blitzer J, Crammer K, Kulesza A, Pereira F, Vaughan JW, 2010. A theory of learning from different domains. Machine learning 79, 151–175. [Google Scholar]
- [3].Bouadjenek MR, Verspoor K, Zobel J, 2017. Automated detection of records in biological sequence databases that are inconsistent with the literature. Journal of biomedical informatics 71, 229–240. [DOI] [PubMed] [Google Scholar]
- [4].Cadamuro G, G.B.R., Zhu X, 2016. Debugging machine learning models, in: In ICML Workshop on Reliable Machine Learning in the Wild, 2016. [Google Scholar]
- [5].Chollet F, 2015. Keras: https://github.com/fchollet/keras.
- [6].Cook RD, 1977. Detection of influential observation in linear regression. Technometrics 19, 15–18. [Google Scholar]
- [7].DeVries T, Taylor GW, 2018. Learning confidence for out-of-distribution detection in neural networks. arXiv preprint arXiv:1802.04865. [Google Scholar]
- [8].Foggia P, Percannella G, Soda P, Vento M, 2013. Benchmarking hep-2 cells classification methods. IEEE transactions on medical imaging 32, 1878–1889. [DOI] [PubMed] [Google Scholar]
- [9].Fotouhi S, Asadi S, Kattan MW, 2019. A comprehensive data level analysis for cancer diagnosis on imbalanced data. Journal of biomedical informatics. [DOI] [PubMed] [Google Scholar]
- [10].Frénay B, Verleysen M, 2013. Classification in the presence of label noise: a survey. IEEE transactions on neural networks and learning systems 25, 845–869. [DOI] [PubMed] [Google Scholar]
- [11].Frid-Adar M, Klang E, Amitai M, Goldberger J, Greenspan H, 2018. Synthetic data augmentation using gan for improved liver lesion classification, in: 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018), IEEE pp. 289–293. [Google Scholar]
- [12].Ganganwar V, 2012. An overview of classification algorithms for imbalanced datasets. International Journal of Emerging Technology and Advanced Engineering 2, 42–47. [Google Scholar]
- [13].Ghorbani A, Zou J, 2019. Data shapley: Equitable valuation of data for machine learning. arXiv preprint arXiv:1904.02868. [Google Scholar]
- [14].Glorot X, Bordes A, Bengio Y, 2011. Deep sparse rectifier neural networks, in: Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 315–323. [Google Scholar]
- [15].Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y, 2014. Generative adversarial networks. Advances in Neural Information Processing Systems 3, 2672–2680. [Google Scholar]
- [16].Hauskrecht M, Batal I, Hong C, Nguyen Q, Cooper GF, Visweswaran S, Clermont G, 2016. Outlier-based detection of unusual patient-management actions: an icu study. Journal of biomedical informatics 64, 211–221. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Hauskrecht M, Batal I, Valko M, Visweswaran S, Cooper GF, Clermont G, 2013. Outlier detection for patient monitoring and alerting. Journal of biomedical informatics 46, 47–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Hendrycks D, Gimpel K, 2016. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136. [Google Scholar]
- [19].Ioffe S, Szegedy C, 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. [Google Scholar]
- [20].Kansagara D, Englander H, Salanitro A, Kagen D, Theobald C, Freeman M, Kripalani S, 2011. Risk prediction models for hospital readmission: a systematic review. Jama 306, 1688–1698. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Kaur H, Pannu HS, Malhi AK, 2019. A systematic review on imbalanced data challenges in machine learning: Applications and solutions. ACM Computing Surveys (CSUR) 52, 79. [Google Scholar]
- [22].Khan SS, Madden MG, 2014. One-class classification: taxonomy of study and review of techniques. The Knowledge Engineering Review 29, 345–374. [Google Scholar]
- [23].Kingma DP, Ba J, 2014. Adam: A method for stochastic optimization. CoRR abs/1412.6980. [Google Scholar]
- [24].Koh PW, Liang P, 2017. Understanding black-box predictions via influence functions, in: Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org pp. 1885–1894. [Google Scholar]
- [25].Lakshminarayanan B, Pritzel A, Blundell C, 2017. Simple and scalable predictive uncertainty estimation using deep ensembles, in: Advances in Neural Information Processing Systems, pp. 6402–6413. [Google Scholar]
- [26].Liang S, Li Y, Srikant R, 2017. Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690. [Google Scholar]
- [27].Liu W, Wang Z, Liu X, Zeng N, Liu Y, Alsaadi FE, 2017. A survey of deep neural network architectures and their applications. Neurocomputing 234, 11–26. [Google Scholar]
- [28].Lundberg SM, Lee SI, 2017. A unified approach to interpreting model predictions, in: Advances in Neural Information Processing Systems, pp. 4765–4774. [Google Scholar]
- [29].Patel K, Drucker SM, Fogarty J, Kapoor A, Tan DS, 2011. Using multiple models to understand data, in: Twenty-Second International Joint Conference on Artificial Intelligence. [Google Scholar]
- [30].Ruff L, Vandermeulen RA, Görnitz N, Deecke L, Siddiqui SA, Binder A, Müller E, Kloft M, 2018. Deep one-class classification, in: Proceedings of the 35th International Conference on Machine Learning, pp. 4393–4402. [Google Scholar]
- [31].Sanchez-Hernandez C, Boyd DS, Foody GM, 2007. One-class classification for mapping a specific land-cover class: Svdd classification of fenland. IEEE Transactions on Geoscience and Remote Sensing 45, 1061–1073. [Google Scholar]
- [32].Schölkopf B, Platt JC, Shawe-Taylor J, Smola AJ, Williamson RC, 2001. Estimating the support of a high-dimensional distribution. Neural computation 13, 1443–1471. [DOI] [PubMed] [Google Scholar]
- [33].Schölkopf B, Williamson RC, Smola AJ, Shawe-Taylor J, Platt JC, 2000. Support vector method for novelty detection, in: Advances in neural information processing systems, pp. 582–588. [Google Scholar]
- [34].Taft L, Evans RS, Shyu C, Egger M, Chawla N, Mitchell J, Thornton SN, Bray B, Varner M, 2009. Countering imbalanced datasets to improve adverse drug event predictive models in labor and delivery. Journal of biomedical informatics 42, 356–364. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [35].Verikas A, Gelzinis A, Bacauskiene M, 2011. Mining data with random forests: A survey and results of new tests. Pattern recognition 44, 330–349. [Google Scholar]
- [36].Xiaodan Xu HL, Yao M, 2019. Recent progress of anomaly detection. 10.1155/2019/2686378. [DOI]
- [37].Yeh CK, Kim J, Yen IEH, Ravikumar PK, 2018. Representer point selection for explaining deep neural networks, in: Advances in Neural Information Processing Systems, pp. 9291–9301. [Google Scholar]
- [38].Yosinski J, Clune J, Bengio Y, Lipson H, 2014. How transferable are features in deep neural networks?, in: Advances in neural information processing systems, pp. 3320–3328. [Google Scholar]
- [39].Zagoruyko S, Komodakis N, 2016. Wide residual networks. arXiv:1605.07146. [Google Scholar]
- [40].Zhai S, Cheng Y, Lu W, Zhang Z, 2016. Deep structured energy based models for anomaly detection. arXiv preprint arXiv:1605.07717. [Google Scholar]
- [41].Zhou C, Paffenroth RC, 2017. Anomaly detection with robust deep autoencoders, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM pp. 665–674. [Google Scholar]
- [42].Zhou ZH, 2017. A brief introduction to weakly supervised learning. National Science Review 5, 44–53. [Google Scholar]
- [43].Zong B, Song Q, Min MR, Cheng W, Lumezanu C, Cho D, Chen H, 2018. Deep autoencoding gaussian mixture model for un-supervised anomaly detection.
