Abstract
In data-driven deep learning-based modeling, data quality may substantially influence classification performance. Correct data labeling for deep learning modeling is critical. In weakly-supervised learning, a challenge lies in dealing with potentially inaccurate or mislabeled training data. In this paper, we proposed an automated methodological framework to identify mislabeled data using two metric functions, namely, Cross-entropy Loss that indicates divergence between a prediction and ground truth, and Influence function that reflects the dependence of a model on data. After correcting the identified mislabels, we measured their impact on the classification performance. We also compared the mislabeling effects in three experiments on two different real-world clinical questions. A total of 10,500 images were studied in the contexts of clinical breast density category classification and breast cancer malignancy diagnosis. We used intentionally flipped labels as mislabels to evaluate the proposed method at a varying proportion of mislabeled data included in model training. We also compared the effects of our method to two published schemes for breast density category classification. Experiment results show that when the dataset contains 10% of mislabeled data, our method can automatically identify up to 98% of these mislabeled data by examining/checking the top 30% of the full dataset. Furthermore, we show that correcting the identified mislabels leads to an improvement in the classification performance. Our method provides a feasible solution for weakly-supervised deep learning modeling in dealing with inaccurate labels.
Keywords: Deep learning, weakly-supervised learning, Inaccurate label, Metric function, Mammogram
I. INTRODUCTION
ARTIFICIAL intelligence (AI) algorithms, particularly deep learning, have demonstrated remarkable success in many areas including biomedical imaging analytics [1]. However, understanding and improving the data-driven training mechanism for robust deep learning modeling remains as an active research area. From the data perspective, challenges for deep learning include learning from small amount of labeled data, inexactly labeled data, or data with inaccurate labels [2]. Machine learning under these challenges is generally considered learning with weak supervision, in comparison to supervised learning with a large number of accurately labeled samples. In weakly-supervised machine learning, the amount and accuracy of data labels will impact the performance of deep learning models. The potential influence of inaccurate labels to deep learning models’ performance especially merits investigation.
When training a convolutional neural network (CNN) model, its potential performance is subject to the scale and quality of the training data. Big data applications are common thanks to the emergence of many large-scale databases such as ImageNet [3], ChestX-ray8 [4], etc. Correct data labeling for deep learning model training is critical. However, there may exist potential erroneous or mislabeled data in a training dataset. This may be due to mistakes/biases of users in manually labeling data, errors from natural language processing software in automatically extracting information from reports, system errors and/or noise. Deep learning models trained with stochastic gradient methods may be vulnerable to training data bias due to mislabeling issues [5]. Ensuring the training data are correctly labeled for model training is therefore an important requirement in deep learning modeling. However, it is neither realistic nor affordable to manually examine the correctness for each label, especially when the training dataset is large. This is a particularly sensitive and challenging task when dealing with medical data that requires professional expertise to examine the labels.
In this paper, we investigated an approach for weakly-supervised deep learning on radiological images with inaccurate labels. More specifically, we aimed to automatically identify and correct potentially mislabeled clinical images from a training dataset and to measure the resulting effects on the model’s classification performance on the test dataset. We first presented a CNN learning framework for medical image classification, then developed a mechanism with two metric functions, i.e., Cross-entropy Loss and Influence function [6], to rank, identify, and correct potentially mislabeled data. Specifically, when the ground truth labels exist, we manually flipped the labels for a portion of training samples to mimic mislabeled data. We calculated the percentage of the false labels identified automatically by the metric functions with respect to a varying range of flipped labels. By correcting the identified false/flipped labels and put them back to retrain the logistic classifier, the response of the classification model was assessed by measuring the classification performance. In addition, we also assessed our method in identifying potentially mislabeled data with a published rule-based criterion and a published general-purpose algorithm. Experiments were performed on two large and different breast imaging datasets in real clinical scenarios of imaging classification.
The main contribution of this study is three-fold: 1) we designed a weakly supervised scheme coupled with two metric functions to automatically identify and correct potentially mislabeled data in deep learning-based modeling; 2) we examined the impact on classification performance by identifying and correcting the mislabels at different scales of mislabeled data; and 3) we tested our scheme on two relatively large (a total of 10,500 images) datasets of medical images in real-world clinical contexts.
II. RELATED WORKS
In general, weakly supervised learning deals with limited/incomplete labels, inexact labels, and inaccurate labels [2], and readers are referred to a recent survey article [2] for a more comprehensive review of related works. Here we focus on identifying inaccurate labels for weakly supervised classification.
Inaccurate labels or mislabels may present as noise in the machine learning training data. One nature way to clean up the training data is to first identify/detect the noisy labels, such as inconsistencies, missing values, and outliers [7], [8]. Data can be cleaned through examining the full data and then voting for noisy data by domain experts; however, bias and disagreement may exist among individuals even if among experienced experts [9]. Ensemble learning using majority vote by multiple labeling functions has been proposed to detect mislabeled training data [7], [10]; it is difficult, however, for this approach to do majority voting on cases that have near identical labeling functions [11]. In addition, data programming has been proposed to “denoise” the training set by programmatically creating labels through generative models [12]. Differential testing as a software testing technique has also been applied to identify erroneous labels [13]. This method assumes that correctly labeled inputs will become the same outputs through two different neural networks, but this method will fail when both networks produce wrong predictions. In the literature, another approach of identifying mislabels is to measure the impact of noisy data to a computational model, as shown in an example study of identifying mislabeled texts in emails [14]. In order to enable learning from massive noisy data, a probabilistic graphical model was introduced to model the relationship between images and class labels in the presence of label noises [15].
With regards to deep learning on medical image data with noisy labels, when a set of clean labels are available, one method is to reweight the noisy training samples based on their gradient directions on the clean data set [16], as shown in detecting pancreatic tumor regions in whole-slide pathology images annotated with noisy labels [17]. When only noisy labels are available, the loss function may be modified (i.e., by incorporating regularization terms [18]) to recover the truth labels from noisy labels, as tested in classification of ultrasound images. Another way to make a classifier robust to inaccurate labels is to add a noise adaptation layer into a network model [19], [20]. It should be pointed out that although these methods [16]–[20] reported improved classification performance learning directly on data with noisy labels, they are not able to identify those mislabels and thus incapable of examining impact of noisy labels on model performance. Our approach is different because we follow the pipeline of first identifying/detecting inaccurate labels and then correcting them to improve the model’s classification performance, where the impact of mislabels on the model performance is also explicitly assessed.
III. MATERIAL AND METHODS
A. Imaging datasets and clinical scenarios
This retrospective study is compliant with the Health Insurance Portability and Accountability Act (HIPAA) and received approval from the institutional review board (IRB) at our institution. Two breast imaging datasets (Figure 1) from two different clinical scenarios are summarized in the following.
Fig. 1:

Imaging samples for the two different datasets and classification tasks. Top: Breast Imaging Reporting and Data System (BI-RADS) breast density assessment categories. Bottom: diagnosis of Breast Cancer vs Normal cases on Digital Database for Screening Mammography (DDSM) dataset.
a. Mammographic breast density assessment:
The first dataset consists of digital mammogram images collected from breast cancer screening. The goal of our experiments with this dataset is to automatically identify mammogram images that may have received a potentially inaccurate Breast Imaging Reporting and Data System (BI-RADS) label in the clinical assessment of breast density. Breast density is an established risk marker for developing breast cancer. It basically measures the amount of dense/fibroglandular tissue in the breast. Current clinical assessment of breast density is visually made by radiologists using the four qualitative BI-RADS categories: A) almost entirely fatty, B) scattered areas of fibroglandular density (or scattered density for short), C) heterogeneously dense, or D) extremely dense. It is well known that the BI-RADS based assessment is subjective and has substantial inter and intra-reader variability [21], [22]. Particularly, while the assessment of A and D is highly consistent and relatively easy, there is greater variability in distinguishing B from C [21], even among experienced breast imaging radiologists. Correct assessment of the BI-RADS density categories is critical in breast cancer screening. A BI-RADS B assessment is considered low risk while a BI-RADS C assessment is considered high risk in developing breast cancer. This may lead to different clinical actions for an individual patient. For example, it is common to involve supplemental screening using ultrasound or magnetic resonance imaging (MRI) for a BI-RADS C assessment, but no further screening interventions may be introduced for a BI-RADS B assessment. In the U.S., 38 states have enacted breast density notification legislation [23] for breast cancer screening, and most of those laws require women to receive information regarding their breast density assessment and risk indication. Because supplemental screening recommendations and risk management may vary by breast density, it is highly desirable to have a consistent assessment of breast density, minimizing the potential mislabels in assigning a BI-RADS density categories (i.e., A, B, C, or D).
With the goal of identifying the potential mislabels of BI-RADS breast density, we collected a cohort of 1,427 women who underwent standard digital mammography screening and collected a relatively large dataset of total 8,900 digital mammogram images associated with this cohort. The 8,900 images were those reported to have either a BI-RADS A (450 images), B (4,000 images), C (4,000 images) or D (450 images) in their clinical mammography reports. As we can see from these numbers and previous studies [24], the distributions of the four density categories are unbalanced. Considering clinical relevance and the data imbalance issue, we chose to conduct two binary classification experiments on this dataset (i.e., Category A vs D and Category B vs C). This way any unexpected influence from unbalanced data may be reduced in building classification models and in identifying mislabeled data. These BI-RADS-based breast density categories that had been routinely assigned in standard clinical workflow by radiologists were retrieved from mammography reports and used as ground truth (but as said earlier, there must be some cases that were potentially wrongly assessed/labeled. However, fundamentally there is no real ground truth for BI-RADS breast density assessment). Note that the BI-RADS density categories were clinically assessed by a mixture of many different board-certified breast imaging radiologists with a varying range of experience in breast imaging. All mammogram examinations were acquired by the Hologic (Marlborough, MA) full-field digital mammography units. Both the mediolateral oblique (MLO) and craniocaudal (CC) views of the left and right breasts on the processed (i.e., FOR PRESENTATION1) images were used in this study.
b. Computer-aided breast cancer diagnosis:
The second imaging dataset is the Digital Database for Screening Mammography (DDSM) data that is publicly available. This external dataset is used in our study to model computer-aided breast cancer diagnosis using CNN models. The DDSM dataset includes 2,620 mammography screening exams [25], [26] and each exam has CC and MLO views. These exams were labeled in terms of 4 categories: normal (negative), breast cancer (positive), biopsy benign (those biopsied but results are benign), and benign without callbacks (no additional films or biopsy). The goal is to identify the potential false labels of the diagnosis (i.e., breast cancer or normal) using our method. Because film mammograms are no longer clinically used, the DDSM dataset was used to design secondary experiments to test our method. Also due to data balance consideration, we chose to perform a binary classification between the cancer cases (800 images) and normal cases (800 images).
B. Deep learning CNN modeling
The first step of our study is to build deep learning CNN models for two classification experiments on breast density categories and an additional experiment on computer-aided diagnosis of cancer from normal cases. As shown in Figure 2, our CNN model is based on an Inception V3 model pre-trained on ImageNet [3]. The Inception V3 model [27] is a deep neural network built with inception modules — an architecture consisting of convolutional kernels in different sizes. Instead of using the original architecture of the Inception V3 model, we built our model with 2 parts — Part 1) Inception modules functioning as a fine-tuned feature extractor and Part 2) the logistic classifier.
Fig. 2:

The proposed pipeline for automated identification of potentially mislabeled data.
We employed transfer learning to enhance our CNN modeling for the classification tasks. We first applied two preprocessing steps to the images, similar to previous studies [28], [29]. The whole-breast regions are separated from non-breast regions in mammogram images using the LIBRA software [22], [30]. Then the intensity distribution of an image is adjusted using the histogram equalization [31] for all images in a dataset (note that this is done separately on the breast density dataset and the DDSM dataset). For the pre-trained Inception V3 model on ImageNet, we feed our own imaging data (breast density dataset and DDSM dataset) to fine-tune the CNN Model shown in Figure 2. We froze the first 249 layers which consist of eight inception modules in front, then fine-tuned the rest layers including two inception modules by forward propagation and backward propagation. The reason why we chose to keep the front layers with original weights and modify the weights of latter layers is to sustain the transferability of features [32] — the model’s ability of capturing general imaging features such as edges, angles, etc. We saved model on every training epoch and selected the model with the best performance on the validation subset. In the subsequent process of identifying mislabeled data, the classifier needs to be trained for multiple times. In order to avoid potential overfitting caused by training the CNN model for too many epochs and by training a complex model on a relatively small dataset, similar to [33], [34], we used the fine-tuned Inception V3 model as an offline feature extractor. As shown in the CNN model of Figure 2, the extracted features are fed sequentially into the average pooling layer, flatten layer and the logistic layer. The final layer (logistic layer) is trained to classify images of one class against the other class. The classification performance is measured by the receiver operating characteristics (ROC) curves and AUC (area under the ROC curve).
C. Metric functions for examining inaccurate labels
Once the basic deep learning classification models are established, we try to identify the mislabels by two metric functions: Cross-entropy Loss [27] and Influence function [6], [35].
Measuring how different a predicted result is from the ground truth label, the Cross-entropy Loss can quantitatively assess the difference by giving a numeric score. We used this function to track the behavior of each labeled image. The Loss function for binary classification is defined as follows:
| (1) |
where x is an imaging matrix and y denotes the true label (0 or 1) of the image; denotes the predicted probability () of the image given the model with parameters Θ. The larger the predicted probability deviates from the label’s numeric value, the higher the Loss value becomes.
Our second metric function is the Influence function [6], [35] that measures the gradient of empirical risk with respect to the disturbance that an individual data point receives. For a given test data point, when we give it an infinitesimal disturbance ϵ, the Influence (I) on the model is defined as the derivative of Loss with respect to the disturbance:
| (2) |
where X = {xi | i = 1, 2, ⋯ , n} and Y = {yi|i = 1, 2, ⋯ , n} are the training dataset and corresponding labels. Similar to [14], we approximated the Influence function by applying the Chain rule and Taylor expansion. Eq. (2) can be analytically approximated as Eq. (3):
| (3) |
where is a Hessian matrix consisting of the second derivatives of Loss with respect to the model’s parameters; xi and yi represent the ith training data and its label in a dataset of size n; x and y denote the single data point and its label coming from the training dataset to be tested with the model; ϵ is the weight assigned to the test point; Θ denotes the parameters that minimize the empirical risk of the model.
The calculation of Eq. (3) is further simplified by reducing the number of parameters. Since the fine-tuned CNN model is fixed as a feature extractor, only the parameters in the logistic classifier are retrained. Thus we remove those constant parameters and keep the principle submatrix consisting of the second order derivatives of Loss with respect to the parameters from logistic layer.
Based on the scores of these two metric functions, we rank all the training data. A higher Loss score indicates a larger discrepancy between a sample’s predicted result and its ground truth. A higher Influence score indicates a greater extent a sample is responsible for a certain prediction. Thus, for those data that have obviously distinct scores, they are more likely to be potentially mislabeled data among the predominate normal data. We use such a principle to examine all the data and thereby to identify those samples with an abnormal score. Note that the two metric functions are used separately and we compared their effects in identifying the mislabeled data.
D. Identifying mislabeled data
We designed a scheme using the two metric functions to evaluate the likelihood of each data to be mislabeled. Both the Cross-entropy Loss and Influence function are calculated from the weights of the logistic classifiers. In our scheme, the potentially mislabeled data were intentionally created by manually flipping the labels for a portion of the original full dataset. The portion ranges from 10% to 100% with an interval of 10%. For example, in classifying breast density Category A vs D, we first randomly pick 10% images along with their original labels from the full dataset; then we flip the label for each image to “D” if the original label is “A” and vice versa. Next, the CNN model fine-tuned with partially mislabeled dataset extracts features from the full dataset (including the 10% with flipped labels) and passes the features to the logistic classifier for retraining. Then we compute the Loss function and Influence function scores separately for each image. By ranking the scores, we get a list of images (named as Inspection List) prioritized by each sample’s likelihood of being potential mislabeled data. Once the data on the top of the Inspection List are confirmed with wrong labels, we correct the labels (i.e., flipping them back) and update the training dataset with the corrected labels. With the percentage of data checked going up from 10% to 100% with an interval of 10%, we calculate the accumulated percentage of data that are successfully identified as mislabeled data.
E. Method evaluation
We designed three experiments using a total of 10,500 images from the two imaging datasets (Section III.A) to evaluate the effects of our method. The three experiments represent two different classification tasks: 1) classification of breast density Category A vs D, 2) classification of breast density Category B vs C, and 3) classification/diagnosis of Breast Cancer vs Normal images. In the first experiment, we used a relatively simple classification task by distinguishing Category A from D to test our method. Samples in the two categories are balanced and the training, validation, and test dataset consist of 225, 113, 112 images for each category, respectively. The second experiment dealt with classification between Categories B and C, which is a much more difficult task. As mentioned earlier, to avoid potential issues from imbalanced data, we chose to use balanced samples in the two categories and the training, validation, and test dataset consisted of 2000, 1000, 1000 images for each category, respectively. The third experiment served as a secondary test by performing a different classification task from experiments 1) and 2), and the respective size of the training, validation, and test dataset is 400, 200, 200 images for each category.
We evaluated our method by examining its effects in identifying mislabeled data and its impact on the model’s classification performance on an unseen testing data after automatically correcting the identified mislabels. For each experiment, we randomly select a gradually increasing portion of x% (x ∈ {10, 20, 30, …, 100}) samples to flip their labels. This way we have the truth labels and the mislabels for the flipped samples. We compare the overlap rate of the automatically identified mislabeled image set with the intentionally flipped set of images (they are used as ground truth) to measure the effects of identifying mislabels. Then the automatically identified mislabels were corrected (flipping the labels back) and put back to the training set to retrain the logistic classifier. The AUC is calculated to measure the impact on classification performance on the unseen test set.
Furthermore, we performed an objective assessment to our method by comparing it to a published method [28]. It should be noted that since this method is only valid on the classification between Categories B and C, the comparison was only performed on the second experiment (i.e., classification between Categories B and C). For classification between Category B vs C, studies have shown notable inter and intraradiologist variability [21]. Thus, we devised an additional and independent criterion to test our method, since the “ground truth” is not available; in this experiment, we do not manually flip data labels; instead, we use the published quantitative criterion [28] to define the potentially mislabeled data. Specifically, we first compute a quantitative breast density (PD%) using a validated automated computer algorithm (i.e., LIBRA [22], [30]) for each image in the BI-RADS B and C categories. According to the definition of the BI-RADS density, it is generally expected that the quantitative PD% of BI-RADS B images is statistically lower than the quantitative PD% of BI-RADS C images. While this is not a very strict condition, most clinical assessments meet this assumption. Based on such condition, we identify the Category B images where their PD% is unexpectedly greater than the average of the PD% of the C images, and likewise, the Category C images are identified where their PD% is unexpectedly lower than the average of the B images. According to the BI-RADS breast density definition, these identified images are very likely the potential images mislabeled in the visual assessment of the BI-RADS labels. We run the published criterion and our method individually on the experiment for classifying Category B vs C and compute the agreement of identified mislabels by the two methods. The agreement A is defined as follows:
| (4) |
where O denotes the identified mislabels by our method and P denotes the identified mislabels by the published method; i denotes the percentage of data being checked by our method.
In addition, we also compared our method to a general-purpose algorithm for mislabel detection [10]. Because code of this algorithm [10] is not publicly available, we implemented the algorithm using an ensemble learning method composed of multiple majority filters. More specifically, three pre-trained CNN models including Inception V3 [27], ResNeXt [36], ResNet18 [37] were employed in our implementation. We fine-tuned the three models with respective datasets (See Section III.A) containing 20% flipped labels for comparative experiments. We applied this general-purpose algorithm to the same three experiments for binary classifications, i.e., Category A vs D, Category B vs C, and Breast Cancer vs Normal. We calculated the percentages of mislabeled data identified by the majority voting of these three classifiers for each experiment. The parameters for implementing the algorithm [10] are empirically determined to achieve the highest performance.
All experiments in this study were run on the computing resources provided by Pittsburgh Super Computing Center [38]. The typical configuration is one P100 node containing two NVIDIA P100 GPUs plus Intel’s Xeon CPUs. The software framework for our experiment was built on the top of PyTorch [39],TensorFlow [40] and Keras [41].
IV. RESULTS
This section presents the results of the three experiments. We first report the classification performance of the three CNN models. Then we show the results of identifying the flipped labels and the corresponding performance measure for each model at different scale of mislabels. Finally we show the performance comparison of our method to the published scheme.
A. CNN Model’s classification performance
We first show the performance of the fine-tuned Inception V3 model on the test subset without any flipped labels. As seen in Figure 3, the model’s classification AUC = 0.997, 0.932, 0.886 for the classification of Category A vs D, Category B vs C, and Breast Cancer vs Normal cases, respectively. These results represent a fairly good classification performance and thus the models serve as a firm basis for the identification of mislabeled data.
Fig. 3:

Classification performance (ROC curves) of the three CNN models for the three experiments without any flipped labels.
B. Identifying intentionally flipped data
Figure 4 shows the accumulated percentage of the identified data with flipped labels while we increase the fraction of the data for inspection. Each curve represents the identification result using one of the two metric functions with respect to different percentage of flipped labels. Note that results for flipping 50% to 100% labels were similar thus not shown to reduce redundancy.
Fig. 4:

Results of identifying mislabeled data for the three experiments with respect to different fraction percentage of flipped labels. (A): Breast density Category A vs D. (B): Breast density Category B vs C. (C): Breast Cancer vs Normal cases. Results for flipping 50% to 100% labels were similar thus not shown here to reduce redundancy.
As can be seen in the three experiments, as the fraction of examined data increases, it leads to a gradual increase of identified mislabels. For the first experiment (Figure 4A), when 10% labels of the full dataset are flipped, our method using either Influence or Loss as ranking metric can identify 98% of the intentionally flipped data in classifying Categories A and D, by automatically examining as small as only top 30% of the ranked full dataset. Even if we check 10% of the data, approximately 80% of the flipped data can still be identified. For this relatively easy classification task, we see that the Influence and Loss work almost equally well.
When working on the more difficult classification task, i.e., Category B vs C (Figure 4B), when 10% labels of the full dataset are flipped, the Loss function can identify about 90% of the flipped data, while the Influence function can do about 85%, when checking 30% of the full dataset. That is, Loss function performs better than the Influence function in this experiment, but both show relatively lower performance than in the first experiment (Figure 4A).
The results for classifying Breast Cancer vs Normal cases as shown in Figure 4C are similar with what we see in the second experiment (Figure 4B) except that the difference between Influence and Loss is even larger (Loss is much better than Influence in terms of identifying mislabeled cases).
Figure 5 shows the model AUC gains on the test dataset with respect to different percentage of flipped labels that are identified and corrected by our method. First, for the models with zero fraction of data examined, they are the models trained with a percentage (i.e., 10% to 40%) of flipped data and without any attempt to identify and correct flipped/mislabeled data. The testing AUC values of these models serve as a baseline of the performance. For an increasing percentage of flipped labels from 10% to 40%, the baseline AUC values range from 0.967 to 0.661 for breast density Category A vs D, from 0.929 to 0.908 for breast density Category B vs C, and from 0.871 to 0.764 for Breast Cancer vs Normal cases, respectively. It is observed that the higher percentage of samples were flipped (i.e., more mislabels), the lower AUC values the baseline models exhibited. These baseline results demonstrate the quantitative impact of mislabels on the classification models. Also as shown in Figure 5, our method can bring the dropped AUC values back to the performance of models trained without any flipped labels (results indicated by the red arrow in Figure 5), by automatically identifying and correcting those flipped labels. With the increasing fraction of data being examined, more mislabels are identified and corrected and thus the ratio of correct labels in the training dataset increases. As a result, the retrained models on the updated training dataset achieved increasingly higher AUC values on the test dataset. Note, when flipping 50% to 100% labels, it led to dramatic decreases in AUC values so the plotting was skipped intentionally in the figure to avoid redundancy. The results shown in the figure already clearly depict the trends of improving model AUC by identifying and correcting mislabeled data.
Fig. 5:

CNN classification models exhibit increased AUC values along with the identification and correction of increasing number of mislabels. In each subfigure, the red arrow indicates the AUC obtained from unflipped data; an inset is used to show a magnified view of the boxed part for better readability. (A): Breast density Category A vs D. (B): Breast density Category B vs C. (C): Breast Cancer vs Normal cases. Note that flipping 50% to 100% labels led to dramatic decreases in AUC so their curves were skipped in the figure to avoid redundancy.
The whole process of the flipped data in each experiment is automated and completed within ten hours under our computing conditions.
C. Comparison with related methods
Figure 6 shows the curve of agreement defined by Eq. (4) to compare the mislabeled data identified by our method (using the Loss function) and by the published scheme [28]. The identified data by the published method are those in Category C with quantitative PD% lower than the average PD% in Category B, and those in Category B with PD% greater than the average PD% in Category C. As can be seen, by checking more data, the overlap increases fast at the beginning then slows down. The quick increase at the beginning means our method can quickly identify most mislabels by inspecting a small amount of data. When checking up to 50% of the full data and the overlap is approximately 78%. This indicates that the potentially mislabeled data ranked for inspection by our method agrees well with those identified by another validated method. But note that our method is a more general method applicable to different imaging modalities and tasks, while the compared method only works on digital mammogram images specifically for breast density Category B vs C classification.
Fig. 6:

Agreement of the identified mislabeled data between our method (using Loss function) and a previously published scheme [28] in the classification of breast density Category B vs C. The figure shows the percentage of overlap of mislabeled data identified by our method and the published scheme with an increasing percentage of data being checked.
Our method shows higher performance than the implementation of the general-purpose method [10]. By examining the whole (i.e., 100%) dataset, the highest performance of identifying mislabels achieved by the general-purpose method is 48.0%, 28.3%, 53.8% for the experiment of Category A vs D, Category B vs C, and Breast Cancer vs Normal, respectively. Our method achieved equal or comparable performance by only needing to examine 15% of the whole dataset, and when more than 15% data are examined, Figure 4 shows that our method can substantially outperform the general-purpose method.
V. DISCUSSION
In this paper, we proposed an automated method for weakly-supervised classification with potentially mislabeled data. Our method relies on a metric function and we tested both the Loss and Influence functions. We not only flipped data labels on purpose to generate test samples with ground truth, but also compared our method with two previously published methods. We used several different clinical datasets and real-world clinical scenarios to evaluate the proposed method, with a strong connection to clinical needs. Our experiment results are encouraging in both identifying mislabeled data and in improving the model’s classification performance by correcting the identified mislabels. Our work is novel in terms of developing this framework to process medical imaging data, where some similar previous work focused on processing text, email spam, and nature view images only.
In our method, there are two metrics used to guide the prioritization of data points, namely, Influence and Loss. Under the same settings, the two metrics show close effects in identifying mislabeled data and improving model classification AUC in classifying Category A vs D. Since it is quite an easy job to distinguish between Categories A and D, this may be the reason that the two functions work equally well in this simple experiment. When it goes to more difficult tests (Category B vs C and Breast Cancer vs Normal cases), the difference of the capability of the two functions is revealed more clearly. Interestingly, while the Loss function outperformed Influence in identifying mislabels (Figure 4), the Influence function generally performs better than Loss in actually improving the classification AUCs (Figure 5). Thus, Influence function may be more effective in terms of the eventual goal of improving model classification performance by identifying/correcting some of the key data mislabeled. Of note, we observed that the shapes of Influence function’s curves in classifying Breast Cancer vs Normal (Figure 4c) are slightly different from those of the breast density experiments (Figure 4a and Figure 4b). This may have to do with the classification task itself and/or the characteristics (i.e., data heterogeneity due to different cancer subtypes) of the DDSM dataset. All the observations we discussed here are subject to our specific datasets, and further studies are warranted in order to more comprehensively understand the effects of these two metric functions.
Our study results suggest that inaccurate labeling (i.e., here referring to intentionally flipping labels) does have an impact on classification performance. While the models may be able to tolerate a small percentage of mislabels, more mislabeled data resulted in considerable AUC value decrease. As shown in Figure 5, we noticed that for the breast density Category B vs C experiment, the AUC lowered by flipped labels is not as obvious as that in the other two experiments when mislabeled data increase from 10% to 40%. Also, in the same experiment, we noticed that with 30% and 40% flipped labels, the AUC values slightly decrease when examining 5% of the dataset by using the Loss method (this is not the case though when using the Influence function). These observations may be partly due to the difference of the two metric functions and the nature of this specific classification task: the labels B and C are so close and in reality, they may be interchangeable for certain cases. This may partially explain why the model is not that sensitive to the label-flipping and why the AUC values reach minimum at 5% examination but then get back to the increasing trend. Interestingly, we also noticed in Figure 5b and 5c, when examining ~60% to ~80% of the dataset, the AUC values obtained by our method are a little bit higher than the AUC values obtained from the unflipped dataset. This observation may have to do with a possibility that our method identifies some mislabeled samples that are potentially “outlier samples” in the original data, but the exact reasons for the subtle AUC differences remain unknown. Again, these findings need further investigation in future work.
There are different ways to generate mislabels such as assigning unseen/wrong labels or swapping existing labels. In this initial study on evaluating the influence of mislabels on model’s performance, we focused on binary classification, swapping existing labels of the two classes, and generating equal ratio of mislabels in each class. More complicated scenarios, such as multi-class classification and uncontrolled random mislabel generation/swapping will be studied in future work, leveraging the findings in this study. One limitation of our study is that the data identified by our method may include both wrongly labeled data and possibly some correctly-labeled outlier data. Outliers are usually a small portion of data in a full dataset, possibly caused by many reasons such as heavy-tailed distribution. Our method at this stage is not able to distinguish true outlier data from mislabeled data and more investigation is needed in this regard. The good part of our method, however, is that by automatically examining a relatively small set of data, we can still identify a substantial proportion of mislabeled data. This also provides a chance for users to review the algorithm-identified data and possibly separate outliers for further actions. In addition, in future work, our deep learning model structure can be enhanced by converting the feature extractor and logistic classifier into an end-to-end neural network, which will require additional derivation and experimental analyses to compare the effects of the model structures.
In inaccurate supervision of machine learning, the training labels are not always ground-truth [2]. The identification of inaccurate labels provides a novel way to support weakly-supervised learning. In our method, we first identify those inaccurate labels and then correct them, which leads to a gain in the model’s classification performance. Such an approach not only enables models to learn with noisy labels, but also reduces potential bias in crowdsourcing-based annotations. While the identification and correction of flipped data are done automatically in our method, the procedure of inspection of potentially mislabeled data, in practice, may still need to be done by domain experts whose time is expensive. Thus, it would be ideal to examine minimal amount of data samples to identify maximal amount of potentially mislabeled samples, and then expect these samples to be further inspected using least efforts from experts. In certain cases, experts may not be needed for the further inspection if other non-expert users (or validated smart AI agents) can also fix the mislabeling issues. In addition, because our method is automatic, it can be used to examine as much as possible amount of data as it may be needed in specific scenarios aiming to identify all possible mislabels.
The code repository for the core algorithm implementation has been made publicly available on GitHub: http://bit.ly/weaklysupervisedlearning. The two imaging datasets are available for research, where the DDSM dataset [26], [42] can be downloaded from http://bit.ly/ddsmdataset. The breast density dataset is an internal dataset and may be available to interested users upon request, after an approval from our institution along with a signed data use agreement and/or a material transfer agreement.
VI. CONCLUSIONS
In this study, we proposed an automated approach to identify potentially mislabeled data in deep learning-based modeling and showed the impact on classification performance after correcting these mislabels. We compared effects of two metric functions, i.e., Cross-entropy Loss and Influence function, in three different experiments on two real-world clinical datasets and tasks. Our experiments results are promising in both using intentionally flipped data for evaluation and comparing with two previous methods. Both the method using Cross-entropy Loss and the method using Influence function provide a feasible solution for weakly-supervised deep learning modeling in dealing with inaccurate labels.
ACKNOWLEDGMENT
This work was supported by National Institutes of Health (NIH)/National Cancer Institute (NCI) grants (1R01CA193603, 3R01CA193603–03S1, and 1R01CA218405), a Radiological Society of North America (RSNA) Research Scholar Grant (#RSCH1530), an Amazon Machine Learning Research Award, and a University of Pittsburgh Physicians (UPP) Academic Foundation Award. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU for our research. This research was also partly supported by the NIH through resources provided by the National Resource for Biomedical Supercomputing (P41 RR06009), which is part of the Pittsburgh Supercomputing Center. Specifically, it used the Bridges system, which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC).
Biographies

Degan Hao Degan Hao received his BS degree in Applied Physics from Beijing Institute of Technology, China, and his MS degree in Information Science from the University of Pittsburgh, USA. He developed TIES/TCRN, a clinical document search engine at Department of Biomedical Informatics. He is currently a PhD student in the Intelligent Systems Program on biomedical informatics track at the University of Pittsburgh. His research interests include artificial intelligence, computer vision and medical image analysis.

Lei Zhang Lei Zhang received his PhD degree in Computer Science from Harbin Institute of Technology, China. He used to be a Postdoctoral Associate at the University of Pittsburgh. Now he works as a Research Associate at the University of Maryland. His current research interests include computer vision, visualization, and medical image analysis.

Jules Sumkin Jules Sumkin completed his Doctorate of Osteopathic Medicine degree at Ohio University in Athens. Dr. Sumkin currently serves as Chairman of the Department of Radiology and UPMC Endowed Chair of Women’s Imaging at the University of Pittsburgh School of Medicine. His research interests focus on developing new breast imaging techniques and optimizing clinical practice.

Aly Mohamed Aly Mohamed obtained his PhD degree in Computer Engineering from the University of Louisville, USA. He was a Postdoctoral Associate at the University of Pittsburgh, USA. He currently works as a scientist at IBM Corporation. His research interests lie in computer vision, image understanding, and pattern recognition.

Shandong Wu Shandong Wu received his PhD degree in Computer Vision from City University of Hong Kong. He completed postdoctoral trainings at the University of Central Florida in Computer Vision and at the University of Pennsylvania in clinical radiology research. Dr. Wu is an Assistant Professor with joint appointments in Radiology (primary), Biomedical Informatics, Computer Science, Bioengineering, Intelligent Systems, Computational Biology, Clinical and Translational Science at the University of Pittsburgh. Dr. Wu leads the Intelligent Computing for Clinical Imaging (ICCI) lab. His research interests include computational biomedical imaging analysis, big (health) data coupled with machine/deep learning, radiomics/radiogenomics, and artificial intelligence in clinical informatics/workflows.
Footnotes
1 FOR PRESENTATION mammograms are imaging vendor post-processed mammographic images for visual assessment by radiologists.
REFERENCES
- [1].Hosny A, Parmar C, Quackenbush J, Schwartz LH, and Aerts HJWL, “Artificial intelligence in radiology,” Nature Reviews Cancer, p. 1, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Zhou Z-H, “A brief introduction to weakly supervised learning,” National Science Review, vol. 5, no. 1, pp. 44–53, 2017. [Google Scholar]
- [3].Deng J, Dong W, Socher R, Li L-J, Li K, and Fei-Fei L, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. Ieee, 2009, pp. 248–255. [Google Scholar]
- [4].Wang X, Peng Y, Lu L, Lu Z, Bagheri M, and Summers RM, “Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017, pp. 3462–3471. [Google Scholar]
- [5].Zhang C, Bengio S, Hardt M, Recht B, and Vinyals O, “Understanding deep learning requires rethinking generalization,” ICLR, 2017. [Google Scholar]
- [6].Cook RD, “Detection of influential observation in linear regression,” Technometrics, vol. 19, no. 1, pp. 15–18, 1977. [Google Scholar]
- [7].Brodley CE and Friedl MA, “Identifying mislabeled training data,” Journal of artificial intelligence research, vol. 11, pp. 131–167, 1999. [Google Scholar]
- [8].Pezoulas VC, Kourou KD, Kalatzis F, Exarchos TP, Venet-sanopoulou A, Zampeli E, Gandolfo S, Skopouli F, De Vita S, Tzioufas AG et al. , “Medical data quality assessment: On the development of an automated framework for medical data curation,” Computers in biology and medicine, vol. 107, pp. 270–283, 2019. [DOI] [PubMed] [Google Scholar]
- [9].Rajpurkar P, Irvin J, Ball RL, Zhu K, Yang B, Mehta H, Duan T, Ding D, Bagul A, Langlotz CP et al. , “Deep learning for chest radiograph diagnosis: A retrospective comparison of the chexnext algorithm to practicing radiologists,” PLoS medicine, vol. 15, no. 11, p. e1002686, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Yuan W, Guan D, Zhu Q, and Ma T, “Novel mislabeled training data detection algorithm,” Neural Computing and Applications, vol. 29, no. 10, pp. 673–683, 2018. [Google Scholar]
- [11].Roh Y, Heo G, and Whang SE, “A survey on data collection for machine learning: a big data-ai integration perspective,” IEEE Transactions on Knowledge and Data Engineering, 2019. [Google Scholar]
- [12].Ratner AJ, De Sa CM, Wu S, Selsam D, and Ré C, “Data programming: Creating large training sets, quickly,” in Advances in neural information processing systems, 2016, pp. 3567–3575. [PMC free article] [PubMed] [Google Scholar]
- [13].Pei K, Cao Y, Yang J, and Jana S, “Deepxplore: Automated whitebox testing of deep learning systems,” in Proceedings of the 26th Symposium on Operating Systems Principles. ACM, 2017, pp. 1–18. [Google Scholar]
- [14].Koh PW and Liang P, “Understanding black-box predictions via influence functions,” arXiv preprint arXiv:1703.04730, 2017. [Google Scholar]
- [15].Xiao T, Xia T, Yang Y, Huang C, and Wang X, “Learning from massive noisy labeled data for image classification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 2691–2699. [Google Scholar]
- [16].Ren M, Zeng W, Yang B, and Urtasun R, “Learning to reweight examples for robust deep learning,” arXiv preprint arXiv:1803.09050, 2018. [Google Scholar]
- [17].Le H, Samaras D, Kurc T, Gupta R, Shroyer K, and Saltz J, “Pancreatic cancer detection in whole slide images using noisy label annotations,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2019, pp. 541–549. [Google Scholar]
- [18].Tanno R, Saeedi A, Sankaranarayanan S, Alexander DC, and Silberman N, “Learning from noisy labels by regularized estimation of annotator confusion,” arXiv preprint arXiv:1902.03680, 2019. [Google Scholar]
- [19].Goldberger J and Ben-Reuven E, “Training deep neural-networks using a noise adaptation layer,” 2016. [Google Scholar]
- [20].Dgani Y, Greenspan H, and Goldberger J, “Training a neural network based on unreliable human annotation of medical images,” in 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018). IEEE, 2018, pp. 39–42. [Google Scholar]
- [21].Redondo A, Comas M, Macia F, Ferrer F, Murta-Nascimento C, Maristany MT, Molins E, Sala M, and Castells X, “Inter-and intraradiologist variability in the BI-RADS assessment and breast density categories for screening mammograms,” The British journal of radiology, vol. 85, no. 1019, pp. 1465–1470, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Keller BM, Nathan DL, Wang Y, Zheng Y, Gee JC, Conant EF, and Kontos D, “Estimation of breast percent density in raw and processed full field digital mammography images via adaptive fuzzy c-means clustering and support vector machine segmentation,” Medical physics, vol. 39, no. 8, pp. 4903–4917, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Keating NL and Pace LE, “New federal requirements to inform patients about breast density: Will they help patients?” JAMA, vol. 321, no. 23, pp. 2275–2276, 2019. [DOI] [PubMed] [Google Scholar]
- [24].Irshad A, Leddy R, Ackerman S, Cluver A, Pavic D, Abid A, and Lewis MC, “Effects of changes in bi-rads density assessment guidelines (fourth versus fifth edition) on breast density assessment: intra-and interreader agreements and density distribution,” American Journal of Roentgenology, vol. 207, no. 6, pp. 1366–1371, 2016. [DOI] [PubMed] [Google Scholar]
- [25].Heath M, Bowyer K, Kopans D, Moore R, and Kegelmeyer WP, “The digital database for screening mammography,” in Proceedings of the 5th international workshop on digital mammography. Medical Physics Publishing, 2000, pp. 212–218. [Google Scholar]
- [26].Heath M, Bowyer K, Kopans D, Kegelmeyer P, Moore R, Chang K, and Munishkumaran S, “Current status of the digital database for screening mammography,” in Digital mammography. Springer, 1998, pp. 457–460. [Google Scholar]
- [27].Szegedy C, Vanhoucke V, Ioffe S, Shlens J, and Wojna Z, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826. [Google Scholar]
- [28].Mohamed AA, Berg WA, Peng H, Luo Y, Jankowitz RC, and Wu S, “A deep learning method for classifying mammographic breast density categories,” Medical physics, vol. 45, no. 1, pp. 314–321, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [29].Aboutalib SS, Mohamed AA, Berg WA, Zuley ML, Sumkin JH, and Wu S, “Deep learning to distinguish recalled but benign mammography images in breast cancer screening,” Clinical Cancer Research, vol. 24, no. 23, pp. 5902–5909, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [30].Keller BM, Chen J, Daye D, Conant EF, and Kontos D, “Preliminary evaluation of the publicly available Laboratory for Breast Radiodensity Assessment (LIBRA) software tool: comparison of fully automated area and volumetric density measures in a case–control study with digital mammography,” Breast Cancer Research, vol. 17, no. 1, p. 117, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [31].Sun X, Shi L, Luo Y, Yang W, Li H, Liang P, Li K, Mok VC, Chu WC, and Wang D, “Histogram-based normalization technique on human brain magnetic resonance images from different acquisitions,” Biomedical engineering online, vol. 14, no. 1, p. 73, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [32].Yosinski J, Clune J, Bengio Y, and Lipson H, “How transferable are features in deep neural networks?” in Advances in neural information processing systems, 2014, pp. 3320–3328. [Google Scholar]
- [33].Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, and Darrell T, “Decaf: A deep convolutional activation feature for generic visual recognition,” in International conference on machine learning, 2014, pp. 647–655. [Google Scholar]
- [34].Krizhevsky A, Sutskever I, and Hinton GE, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105. [Google Scholar]
- [35].Cook RD and Weisberg S, Residuals and influence in regression. New York: Chapman and Hall, 1982. [Google Scholar]
- [36].Xie S, Girshick R, Dollár P, Tu Z, and He K, “Aggregated residual transformations for deep neural networks,” arXiv preprint arXiv:1611.05431, 2016. [Google Scholar]
- [37].He K, Zhang X, Ren S, and Sun J, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [Google Scholar]
- [38].Nystrom NA, Levine MJ, Roskies RZ, and Scott JR, “Bridges: A uniquely flexible hpc resource for new communities and data analytics,” in Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure, ser XSEDE ‘15. New York, NY, USA: ACM, 2015, pp. 30:1–30:8. [Online]. Available: http://doi.acm.org/10.1145/2792745.2792775 [Google Scholar]
- [39].Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, and Lerer A, “Automatic differentiation in pytorch,” 2017. [Google Scholar]
- [40].Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, and Others, “Tensorflow: a system for large-scale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283. [Google Scholar]
- [41].Chollet F et al. , “Keras,” https://keras.io, 2015. [Google Scholar]
- [42].Bowyer K, Kopans D, Kegelmeyer W, Moore R, Sallam M, Chang K, and Woods K, “The digital database for screening mammography,” in Third international workshop on digital mammography, vol. 58, 1996, p. 27. [Google Scholar]
