Abstract.
Purpose
To validate the effectiveness of an approach called batch-balanced focal loss (BBFL) in enhancing convolutional neural network (CNN) classification performance on imbalanced datasets.
Materials and Methods
BBFL combines two strategies to tackle class imbalance: (1) batch-balancing to equalize model learning of class samples and (2) focal loss to add hard-sample importance to the learning gradient. BBFL was validated on two imbalanced fundus image datasets: a binary retinal nerve fiber layer defect (RNFLD) dataset () and a multiclass glaucoma dataset (). BBFL was compared to several imbalanced learning techniques, including random oversampling (ROS), cost-sensitive learning, and thresholding, based on three state-of-the-art CNNs. Accuracy, F1-score, and the area under the receiver operator characteristic curve (AUC) were used as the performance metrics for binary classification. Mean accuracy and mean F1-score were used for multiclass classification. Confusion matrices, t-distributed neighbor embedding plots, and GradCAM were used for the visual assessment of performance.
Results
In binary classification of RNFLD, BBFL with InceptionV3 (93.0% accuracy, 84.7% F1, 0.971 AUC) outperformed ROS (92.6% accuracy, 83.7% F1, 0.964 AUC), cost-sensitive learning (92.5% accuracy, 83.8% F1, 0.962 AUC), and thresholding (91.9% accuracy, 83.0% F1, 0.962 AUC) and others. In multiclass classification of glaucoma, BBFL with MobileNetV2 (79.7% accuracy, 69.6% average F1 score) outperformed ROS (76.8% accuracy, 64.7% F1), cost-sensitive learning (78.3% accuracy, 67.8.8% F1), and random undersampling (76.5% accuracy, 66.5% F1).
Conclusion
The BBFL-based learning method can improve the performance of a CNN model in both binary and multiclass disease classification when the data are imbalanced.
Keywords: deep learning, convolutional neural networks, retina fundus, class imbalance
1. Introduction
Datasets with imbalanced classes are challenging for deep learning algorithms to “learn” due to the biases toward the majority class.1 This is a pertinent problem in medical imaging analysis because of inherent data heterogeneity. This heterogeneity can be caused by individual anatomical variations, differences in the prevalence of comorbid disease, and variations in radiographic presentations of different disease states. Solutions to mitigate imbalanced learning bias are data-specific, making this issue challenging to address. There is no one-size-fits-all solution, so strategies must be tailored to the datasets being used. For this reason, continued investigation into new effective methods is pertinent.
Available methods to address data imbalance can be roughly categorized as data-level or algorithm-level class imbalance strategies. Data-level class imbalance strategies modify the way the dataset is sampled during training to account for the data discrepancy. Examples of data-level strategies include random oversampling (ROS),1,2 random undersampling (RUS),1,3 and dynamic sampling.1 Pham et al.4 employed a dynamic sampling strategy in multiclass skin disease classification that produced a mean recall score of 86%. In each mini-batch, four to six samples from each of the seven skin disease classes were selected to achieve a balanced learning gradient. While data-level methods have been shown to improve the performance of neural networks, these strategies have limitations. ROS has been associated with overfitting of models to the oversampled class and RUS reduces model generalizability due to data limitations.5 Dynamic sampling adds additional parameters to data sampling and the logic risks overfitting similarly to ROS. These problems also persist in many of the novel data-level imbalance workarounds. Algorithm-level strategies focus on increasing the penalty for misclassified minority class samples to focus the network gradient on the minority class. Examples include thresholding, cost-sensitive learning, and custom loss functions.1
The focal loss function used in this study was initially introduced by Lin et al.6 to improve dense object detection by focusing the training on harder samples instead of well-classified background samples. Since its inception, the focal loss function has proven to be a successful technique for addressing data imbalance in classification tasks.7 In general, there is no one-size-fits-all approach to mitigate imbalanced learning in the presence of class imbalance. The effectiveness of different strategies to tackle imbalance varies depending on the type of data, type of classification, and level of imbalance.1–3
We developed a hybrid learning paradigm termed Batch-Balanced Focal Loss (BBFL) that employs both data-level and algorithm-level strategies to address class imbalance. Batch-balancing is utilized as a form of dynamic sampling to balance the network’s learning gradient across all classes. To enhance a model’s ability to recognize hard samples, the traditional cross-entropy loss function is replaced with a focal loss function. The effectiveness of BBFL was evaluated in the context of retinal nerve fiber layer defect (RNFLD) detection and glaucoma classification depicted on color fundus photographs (CFPs).
2. Materials and Methods
2.1. Study Datasets
We collected two CFP datasets for this study: (1) dataset 1: RNFLD dataset. We leveraged a public diabetic retinopathy detection dataset8 to collect the RNFLD data. An ophthalmologist reviewed 7258 cases in this dataset and graded 5582 images as negative and 1676 images as positive for RNFLDs (class imbalance of 3:1). Dataset 1 has CFPs with dimensions ranging from to and varying resolutions. (2) Dataset 2: glaucoma dataset. The glaucoma dataset was obtained primarily from previous studies.9,10 The images in this dataset were independently rated by an ophthalmologist using a four-scale strategy: 0 for none (), 1 for mild (), 2 for moderate (), or 3 for severe glaucoma (). Dataset 2 has CFPs with dimensions ranging from to with varying resolutions. All the images in the two datasets are available upon request.
2.2. Batch-Balanced Focal Loss Algorithm
The BBFL algorithm is a hybrid, end-to-end strategy with only 1 additional parameter to optimize balanced minority class learning (Fig. 1). It is formed by two key components: (1) a data sampling technique and (2) an algorithm-level method to improve model performance in imbalanced learning. Training data are passed through the model using a batch balancing logic that equalizes the ratio of class samples per batch. This aims to improve model sensitivity to the minority class samples. Random intensity and geometric augmentations are utilized to improve model generalization and prevent overfitting. These augmentations are applied per batch before batches are passed through the input layer of the convolutional neural network (CNN) model. Intensity augmentations include random horizontal and vertical flipping, 90-deg rotations, shearing, and axis transformations. Geometric augmentations include sharpening, blurring, gaussian noise, and salt and peppering. Finally, a focal loss function is used to improve hard sample detection. This strategy can be deployed for datasets with any number of classes.
Fig. 1.
Workflow of the BBFL algorithm.
-
1.
Data level strategy - batch-balance: BBFL uses a simple batch-balancing logic to equalize the proportion of each class in each batch. In binary RNFLD detection, every iterative 16-sample batch that is passed through the network has 8 positive and 8 negative samples. In 4-class glaucoma classification, each iterative 16-sample batch has 4 samples of each glaucoma severity classification. This strategy artificially equalizes the class distribution in each batch in binary RNFLD and multiclass glaucoma classification. Images in each 16-sample batch are shuffled so that the model does not identify a class distribution per batch. ROS is a similar technique; however, it does not guarantee that each batch has an equal number of samples from each class. Due to the mild class imbalance in both datasets, balancing each batch is useful in optimizing model parameters for all class features equally, as previous studies1,4,11 have suggested. While this strategy alone may boost balanced learning, it does not directly improve the model’s performance on difficult samples. For this reason, an algorithm-level technique is also implemented in BBFL.
-
2.Algorithm level strategy - focal loss: the focal loss function [Eq. (2)] adds a modulating factor to a standard cross entropy loss function [Eq. (1)] where is the model’s predicted probability of the ground truth class. This factor adds emphasis to incorrectly classified examples when updating a model’s parameters via backpropagation. Subsequently, samples that the model predicts correctly with a high probability score contribute less to updating the model’s parameters. The decreased contribution of well-classified samples is based on the modulating parameter (Fig. 2). In theory, the focal loss focuses on hard-to-classify samples that may be missed due to the class imbalance. To determine the optimal value, we conducted cross-validation testing using a range of values from 0.5 and 10. Our results indicated that the best performance was achieved when the value was between 0.5 and 2.5 with minimal variance in model performance within this range. Therefore, the value of 2.0 was used
(1) (2)
Fig. 2.

Focal loss has a modulating factor that decreases the loss contribution of well-classified samples. Increasing the parameter focuses less on samples that were predicted correctly with high probability. Focal loss can be applied to both binary and categorical crosss entropy loss functions.
In our implementation, InceptionV3, MobileNetV2, and ResNet50 were used for automatic feature extraction.12 The models take RGB color images as inputs. Two fully connected layers (FCL) are added to the top of each network for feature classification. FCL-1 (the first FCL after feature extraction) has 512 nodes and FCL-2 (the last FCL before prediction) has 256 nodes. A global average pooling (GAP) layer is inserted before FCL-1 to convert each feature map matrix into a single vector for input into FCL-1. Dropout layers are added after each FCL at a rate of 50% to minimize overfitting. Dropout rates higher than 50% negatively affected performance, rather than achieving the desired minimal overfitting. Dropout rates lower than 50% did not result in a significant change in model performance. In binary classification, a final sigmoid activation calculates a probability of RNFLD from the final layer logits. In multiclass glaucoma classification, the SoftMax function was used with classes to predict the class.
3. Training
As a preprocessing step, all CFP images were cropped to center the retina and resized to . No advanced image alterations were used such as histogram equalization, polar transformation, or color channel extraction; the original 3-channel RGB images were used. During training, all image pixels were normalized between 0 and 1 using min-max normalization for model compatibility. To train the models, the RNFLD and glaucoma datasets were randomly split into 80-10-10 between training, test, and validation sets, respectively. The same split was used for all experiments.
Hyperparameters used in training include a batch size of 16 images, 100 epochs, and a learning rate of with a decay rate of on learning plateaus every 3 epochs. All models were fitted with ImageNet weights for faster learning convergence before fine-tuning to the retinal classification tasks. The Adam optimizer (beta = 0.9) was used to update the network gradient during backpropagation. Random geometric and intensity augmentations were applied per batch to improve model generalization and mitigate overfitting. Training was stopped early if no improvements were made in validation accuracy every 10 epochs, also to mitigate overfitting. Each model was run five times for each learning-style experiment and averaged to account for random stochastic variation during training. All training was run on an NVIDIA GeForce RTX 2070 GPU using TensorFlow 2.5.
4. Performance Evaluation
The performance of the BBFL algorithm was evaluated on the two datasets and compared to other imbalanced learning methods based on three state-of-the-art CNNs. The F1-score was specifically utilized as a performance metric because it combines sensitivity and precision via harmonic mean to determine the network’s ability to correctly identify samples of the minor classes. Paired t-tests were used to compare model F1-scores between BBFL and the other techniques. A p-value less than 0.01 was considered statistically significant based on Bonferroni’s correction for multiple comparison testing (five tests).
In binary classification, receiver operating characteristic (ROC) analyses were used to compare the tradeoff of the networks’ sensitivity and specificity. The mean area under the ROC curve (mean AUC) was included as the performance metric, which represents the average AUC achieved in five identical experiments. Like ROC curves, Precision-Recall (PRC) curves were generated to combine precision, sensitivity (recall), and F1-scores graphically. In multiclass classification, confusion matrixes were generated to view interclass accuracy visually. Instead of the binary F1-score, the average F1-score [Eq. (4)] is generated by taking the mean of the individual class F1-scores and averaging over the five model results. Traditional ROC is not possible with multiclass classification, so confusion matrices were generated instead
| (3) |
where is the accuracy of the model predictions of images in class with classes in the total classification task
| (4) |
where is the F1-score of the model predictions of images in class with n classes in the total classification task.
Class activation maps (CAM) and t-distributed neighbor embedding (t-SNE) plots13 were created to visually interpret model performance. The GradCAM14 algorithm was used to visualize the best-performing models. The t-SNE plots were generated to visualize a model’s decision boundary. These plots reduce high-dimensional feature maps into 2D data that groups samples based on feature similarity. Ten batches (160 samples) were chosen at random from the training dataset for the t-SNE plots. Each sample was passed through the model and the output from the final GAP layer was extracted for t-SNE visualization with a perplexity factor of 20.
5. Results
The BBFL algorithm outperformed all other CNN methods in all metrics to classify CFP images as positive or negative for RNFLD (Tables 1–3, Fig. 3). BBFL with Inception V3 generated an average accuracy of 93.0 ± 0.8, F1-Score of 84.5 ± 0.6, and AUC of 0.971 (Table 1, Fig. 3). BBFL with MobileNetV2 generated an average accuracy of 91.3 ± 0.3, F1-Score of 80.9 ± 0.6, and AUC of 0.955 (Table 2, Fig. 3). BBFL with ResNet50 generated an average accuracy of 92.5 ± 0.5, F1-Score of 83.8 ± 0.9, and AUC of 0.962 (Table 3, Fig. 3).
Table 1.
Imbalance strategies RNFLD classification metrics with inceptionV3 ().
| Method | Accuracy | F1-score | Mean AUC | F1 p-value |
|---|---|---|---|---|
| Baseline | 92.5 ± 0.8 | 83.3 ± 0.8 | 0.901 ± 0.024 | 0.230 |
| Thresholding | 91.9 ± 0.6 | 83.0 ± 1.2 | 0.888 ± 0.017 | 0.264 |
| ROS | 92.6 ± 0.7 | 83.7 ± 1.9 | 0.931 ± 0.011 | 0.688 |
| RUS | 87.5 ± 3.4 | 76.6 ± 4.7 | 0.864 ± 0.091 | 0.096 |
| Cost-sensitive | 92.5 ± 0.4 | 83.8 ± 0.9 | 0.919 ± 0.013 | 0.518 |
| Balanced-batch | 92.4 ± 0.5 | 83.7 ± 1.3 | 0.923 ± 0.043 | 0.576 |
| Focal loss | 92.3 ± 0.8 | 82.0 ± 2.0 | 0.911 ± 0.013 | 0.231 |
| BBFL | 93.0 ± 0.8 | 84.5 ± 0.6 | 0.941 ± 0.010 | — |
ROS, random oversampling; RUS, random undersampling; BBFL, batch-balanced focal loss; and AUC, area under ROC
Table 2.
Imbalance strategies RNFLD classification metrics with MobileNetV2 ().
| Method | Accuracy | F1-score | Mean AUC | F1 p-value |
|---|---|---|---|---|
| Baseline | 88.8 ± 1.4 | 74.8 ± 3.6 | 0.901 ± 0.024 | 0.095 |
| Thresholding | 87.9 ± 1.9 | 74.7 ± 3.8 | 0.891 ± 0.026 | 0.107 |
| ROS | 91.2 ± 0.7 | 80.2 ± 1.6 | 0.927 ± 0.011 | 0.682 |
| Cost-sensitive | 87.6 ± 1.7 | 74.7 ± 3.2 | 0.910 ± 0.015 | 0.057 |
| Balanced-batch | 91.0 ± 0.5 | 79.8 ± 1.2 | 0.919 ± 0.013 | 0.412 |
| Focal loss | 90.3 ± 1.3 | 77.7 ± 3.0 | 0.895 ± 0.036 | 0.296 |
| BBFL | 91.3 ± 0.3 | 80.9 ± 0.6 | 0.935 ± 0.011 | — |
ROS, random oversampling; RUS, random undersampling; BBFL, batch-balanced focal loss; and AUC, area under ROC
Table 3.
Imbalance strategies RNFLD classification metrics with ResNet50 ().
| Method | Accuracy | F1-score | Mean AUC | F1 p-value |
|---|---|---|---|---|
| Baseline | 90.7 ± 0.5 | 79.6 ± 0.9 | 0.896 ± 0.015 | |
| Thresholding | 90.4 ± 0.7 | 79.8 ± 1.3 | 0.902 ± 0.007 | 0.011 |
| ROS | 92.1 ± 0.8 | 82.1 ± 1.5 | 0.921 ± 0.011 | 0.331 |
| Cost-sensitive | 89.6 ± 0.2 | 78.8 ± 1.9 | 0.870 ± 0.016 | 0.017 |
| Balanced-batch | 92.0 ± 0.3 | 82.9 ± 0.9 | 0.920 ± 0.012 | 0.480 |
| Focal loss | 90.4 ± 0.2 | 78.7 ± 0.8 | 0.900 ± 0.008 | |
| BBFL | 92.5 ± 0.5 | 83.8 ± 0.9 | 0.938 ± 0.026 |
ROS, random oversampling; RUS, random undersampling; BBFL, batch-balanced focal loss; and AUC, area under ROC
Fig. 3.
ROC plots (left) and PRC plots (right) comparing the class imbalance techniques for InceptionV3 (top), MobileNetV2 (middle), and ResNet50 (bottom) to discriminate positive and negative RNFLD cases. Both plot types were generated using the raw RNFLD predictions from the highest performing model from the five experiments.
Our results show that BBFL, ROS, cost-sensitive learning, and batch-balancing could improve the overall performance for imbalanced binary RNFLD detection. BBFL exhibited superior performance compared to both batch-balancing and focal loss when implemented separately. The best-performing BBFL-trained model performed better than all other methods in terms of mean AUC and F1-score. RUS consistently underperformed the baseline due to the nature of RUS reducing the number of data samples. This dataset is likely too small for RUS to improve performance. Therefore, RUS is excluded from most of the visualizations.
Visual inspection of the GradCAM images demonstrated that BBFL and ROS models produced the most accurate CAMs due to the defined boundaries and location of the prediction heatmaps (Fig. 4). Interestingly, the baseline model did not have class activation for one of the “easy” classification images, misclassifying the image entirely. The decision boundary for RNFLD feature classification was most well defined using the BBFL approach compared to the other models based on the t-SNE plots (Fig. 5). Cost-Sensitive and ROS also improved feature learning over baseline training in the t-SNE plots with ROS only misclassifying a few positive features as negative RNFLD features. Both t-SNE and GradCAM provide qualitative information on feature discrimination abilities for each imbalance strategy.
Fig. 4.
GradCAM visualizations of four positive RNFLD samples. In each grid, the top two CFP images were selected as “easy” and the bottom two as “difficult” in RNFLD interpretability. ROS and BBFL algorithms visually produce the most accurate heatmaps with the least heat variance. ROS, random oversampling.
Fig. 5.
t-SNE plots using output from final GAP layer of InceptionV3 for RNFLD detection: (a) BBFL, (b) cost-sensitive, (c) random oversampling, (d) focal loss, (e) balanced-batches, and (f) baseline t-SNE.
Our experiments showed that BBFL models were more accurate and sensitive in classifying glaucoma when looking at accuracy and average F1-score (Table 4). MobileNetV2 was the most accurate and sensitive CNN on this dataset, with an accuracy and average F1-score of 79.7 ± 0.8 and 69.6 ± 1.5, respectively. In some cases, learning with no imbalance technique (baseline models) outperformed adding a learning strategy during training. This was not seen in RNFLD detection previously. RUS, like in binary detection, did not improve performance over the baseline in any case. BBFL again outperforms its constituent learning techniques (batch-balanced and focal loss learning), indicating that the two learning techniques combine to create an advanced learning style. Confusion matrices generated from the best-performing models (Figs. 6–8) in glaucoma classification show improved interclass performance in BBFL, ROS, and cost-sensitive learning.
Table 4.
Imbalance strategies glaucoma classification metrics ().
| Method | ResNet50 accuracy | ResNet50 Avg. F1 | InceptionV3 accuracy | InceptionV3 Avg. F1 | MobileNetV2 accuracy | MobileNetV2 Avg. F1 |
|---|---|---|---|---|---|---|
| Baseline | 77.4 ± 0.4 | 67.2 ± 1.9a | 77.0 ± 0.9 | 66.1 ± 1.1 | 78.2 ± 0.8 | 67.3 ± 1.3 |
| ROS | 77.4 ± 0.5 | 66.9 ± 1.2a | 77.5 ± 1.2 | 66.9 ± 1.5 | 76.8 ± 0.4 | 66.7 ± 1.0 |
| RUS | 75.5 ± 0.7 | 64.8 ± 2.0a | 75.1 ± 0.7 | 65.5 ± 1.7 | 76.5 ± 0.9 | 64.5 ± 2.1a |
| Balanced-batch | 77.0 ± 0.4 | 67.4 ± 1.3a | 77.4 ± 0.6 | 67.4 ± 1.3 | 77.0 ± 1.3 | 67.5 ± 1.2 |
| Cost-sensitive | 78.0 ± 0.5 | 67.7 ± 0.9 | 78.3 ± 0.4 | 68.7 ± 1.2 | 78.3 ± 1.2 | 67.8 ± 1.1 |
| Focal loss | 78.0 ± 0.9 | 68.1 ± 1.4 | 78.2 ± 1.1 | 67.9 ± 1.8 | 77.8 ± 0.6 | 68.0 ± 1.9 |
| BBFL | 78.3 ± 0.4 | 70.4 ± 0.8 | 78.4 ± 0.8 | 69.1 ± 1.3 | 79.7 ± 0.8 | 69.6 ± 1.5 |
ROS, random oversampling; RUS, random undersampling; BBFL, batch-balanced focal loss; and Avg. F1, average F1-score over four classes.
Results that are statistically significant () when compared to BBFL.
Fig. 6.
Confusion matrix comparing imbalance learning techniques to BBFL using the ResNet50 network (). Results from multiclass glaucoma classification with fundus images graded as normal (0), mild (1), moderate (2), and severe (3).
Fig. 7.
Confusion matrix comparing imbalance learning techniques to BBFL using the MobileNetV2 network (). Results from multiclass glaucoma classification with fundus images graded as normal (0), mild (1), moderate (2), and severe (3).
Fig. 8.
Confusion matrix comparing imbalance learning techniques to BBFL using the InceptionV3 network (). Results from multiclass glaucoma classification with fundus images graded as normal (0), mild (1), moderate (2), and severe (3).
6. Discussion
We developed a novel hybrid learning paradigm termed BBFL to address the common pitfalls experienced in class-imbalanced deep learning. BBFL combines a strategy to balance feature learning and a strategy to add hard-sample importance. The effectiveness of BBFL was evaluated to detect RNFLDs and assess glaucoma severity in two relatively large but imbalanced datasets. We found that BBFL improved the discriminative performance of three CNN models compared to other imbalance techniques for both majority and minority class samples. BBFL is an end-to-end learning style that can easily be applied to any CNN framework by simply incorporating a batch-balancing logic in data sampling and adding a modulation factor to a traditional cross-entropy loss function. Only one extra parameter was introduced, which did not create a great variance in the performance of BBFL when changed in our in-house experiments.
BBFL improved performance over other data imbalance techniques. ROS and Cost-sensitive learning improved performance over the baseline in all three CNN experiments in the detection of RNFLDs. However, BBFL outperformed both techniques in all involved CNN models. Similarly, in multiclass glaucoma detection, ROS and Cost-Sensitive learning improved the performance of the three CNNs compared to using no method (baseline). BBFL was better than ROS and cost-sensitive techniques in both mean accuracy and mean F1-score. The other learning techniques were model and dataset-dependent in comparison to the baseline and other strategies. In glaucoma classification, there is evidence that the misclassifications in the mild and moderate classes were reduced using BBFL (Fig. 6.). These classes were often misclassified using other learning techniques due to their similarity, highlighting the effectiveness of BBFL in addressing the class imbalance. The confusion matrices in Fig. 6 (ResNet50) show decreased ability to distinguish between class 0 and class 3 samples in glaucoma classification, with a trade-off for general improvements in class 1 and 2 delineations. The decreased delineation ability for class 0 healthy samples is due to penalizing minority class misclassification with focal loss. Naturally, the sensitivity will have higher sensitivity in correctly classifying minority class samples, which comes at the expense of decreased majority class (0) sensitivity. Furthermore, the slightly reduced class 3 delineation ability is due to the fact that class 3 samples represent severe cases of glaucoma, which have more noticeable defects than class 1 (mild) and class 2 (moderate). Class 1 and class 2 delineation benefits most from BBFL because mild and moderate samples appear visually similar and are therefore harder to discriminate. Class 3 samples may be less penalized since they are severe and easier to classify.
Two types of model visualizations were introduced for more evidence. The t-SNE graphs show a clear separation between class features with our proposed method, more so than with any other technique. This suggests that a model trained via BBFL on this task has better feature discriminative ability. Visually, ROS had the second-best feature separation based on the t-SNE graph. The CAMs showed model confidence in highlighting RNFL defects. With this visualization, BBFL and ROS generated promising results highlighting the correct areas of disease and limiting false positive detections. Our results suggest that BBFL is a strong framework for dealing with the class imbalance in classification and support the use of ROS and Cost-sensitive learning in similar tasks.
CNN-based deep learning models have been widely used for fundus image classification, due to their ability to automatically learn tens of thousands of unique features on a given dataset.15 These models often perform well enough not to require significant image preprocessing like polar transformations or blood vessel inpainting, which is a major benefit. For example, Muramatsu16 used a modified VGG DL network for the automatic classification and segmentation of RNFLDs, achieving an AUC of 0.92. In another study, Maheshwari et al.17 used LBP-based augmentation and a color channel decision fusion to produce a state-of-the-art 97.0% sensitivity in glaucoma detection. Similarly, Aamir et al. developed a multi-level CNN that detected and classified glaucoma concurrently, achieving a 97.0% sensitivity.18 While these studies have produced promising results, the CNN-based deep learning models are very data-hungry and prone to bias in the presence of imbalanced samples. Medical imaging data are intrinsically imbalanced as healthy patients make up the majority of samples in a dataset. We seek to improve classification in a setting that represents this natural imbalance using our novel paradigm, BBFL. Using our technique, deep learning techniques can be successfully used to process larger datasets that present an inherent imbalance.
This study has some limitations. First, the study used only one type (CFP) of data and was applied to only image classification tasks. The primary objective of this study was to verify the feasibility of the BBFL algorithm to address class-imbalanced learning. Second, CFPs were used because of the availability of large CFP datasets and their relevance in the ophthalmological imaging research field. Third, we only compared BBFL with other commonly used strategies for class imbalance. There may be other approaches developed to address data imbalance. However, we believe that our comparison experiments demonstrated the feasibility and unique characteristics of BBFL.
7. Conclusion
The BBFL is a promising learning paradigm that can improve the performance of CNN models on imbalanced fundus classification datasets. Our results demonstrated that BBFL-trained CNN models consistently outperformed other commonly used strategies, including random over-sampling, cost-sensitive learning, and thresholding, in both binary and multiclass classification tasks. This was evident through both quantitative metrics and visual analysis, suggesting that BBFL can effectively address the challenges posed by imbalanced datasets in medical image analysis.
Acknowledgment
This work is supported in part by research grants from the National Institute of Health (NIH) (Grant Nos. R01CA237277, U01CA271888, and R61AT012282).
Biography
Biographies of the authors are not available.
Disclosure
The authors have no conflicts of interest to declare.
Contributor Information
Jatin Singh, Email: jps162@pitt.edu.
Cameron Beeche, Email: cab347@pitt.edu.
Zhiyi Shi, Email: zhiyis@andrew.cmu.edu.
Oliver Beale, Email: bealeo2@upmc.edu.
Boris Rosin, Email: brosin@pitt.edu.
Joseph Leader, Email: jklst3@pitt.edu.
Jiantao Pu, Email: puj@upmc.edu.
References
- 1.Johnson J. M., Khoshgoftaar T. M., “Survey on deep learning with class imbalance,” J. Big Data 6(1), 27 (2019). 10.1186/s40537-019-0192-5 [DOI] [Google Scholar]
- 2.Masko D., Hensman P., “The impact of imbalanced training data for convolutional neural networks,” Independent thesis basic level (bachelor’s degree) (2015). http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-166451
- 3.Van Hulse J., Khoshgoftaar T., Napolitano A., “Experimental perspectives on learning from imbalanced data,” in ICML ‘07: Proc. 24th Int. Conf. Mach. Learn., pp. 935–942 (2007). 10.1145/1273496.1273614 [DOI] [Google Scholar]
- 4.Pham T. C., et al. , “Improving skin-disease classification based on customized loss function combined with balanced mini-batch logic and real-time image augmentation,” IEEE Access 8, 150725–150737 (2020). 10.1109/ACCESS.2020.3016653 [DOI] [Google Scholar]
- 5.Chawla N., Japkowicz N., Kołcz A., “Editorial: special issue on learning from imbalanced data sets,” SIGKDD Explorations 6, 1–6 (2004). 10.1145/1007730.1007733 [DOI] [Google Scholar]
- 6.Lin T. Y., et al. , “Focal loss for dense object detection,” IEEE Trans. Pattern Anal. Mach. Intell. 42(2), 318–327 (2020). 10.1109/TPAMI.2018.2858826 [DOI] [PubMed] [Google Scholar]
- 7.Nemoto K., et al. , “Classification of rare building change using CNN with multi-class focal loss,” in IEEE Int. Geosci. and Remote Sens. Symp., 22-27 July, pp. 4663–4666 (2018). 10.1109/IGARSS.2018.8517563 [DOI] [Google Scholar]
- 8.Dugas E., Jared J., Cukierski W., “Diabetic retinopathy detection,” https://www.kaggle.com/competitions/diabetic-retinopathy-detection/data (accessed 10 January 2022).
- .9.Zhen Y., et al. , “Performance assessment of the deep learning technologies in grading glaucoma severity,” (2018).
- 10.Wang L., et al. , “Computerized assessment of glaucoma severity based on color fundus images,” Proc. SPIE 10953, 1095322 (2019). 10.1117/12.2510446 [DOI] [Google Scholar]
- 11.Shimizu R., et al. , “Balanced mini-batch training for imbalanced image data classification with neural network,” in First Int. Conf. Artif. Intell. for Ind., 26-28 September 2018, pp. 27–30 (2018). 10.1109/AI4I.2018.8665709 [DOI] [Google Scholar]
- 12.Szegedy C., et al. , “Rethinking the inception architecture for computer vision,” in IEEE Conf. Comput. Vis. and Pattern Recognit., 27-30 June 2016, pp. 2818–2826 (2016). 10.1109/CVPR.2016.308 [DOI] [Google Scholar]
- 13.van der Maaten L., Hinton G., “Viualizing data using t-SNE,” J. Mach. Learn. Res. 9, 2579–2605 (2008). [Google Scholar]
- 14.Selvaraju R. R., et al. , “Grad-CAM: visual explanations from deep networks via gradient-based localization,” in IEEE Int. Conf. Comput. Vis., IEEE; (2017). 10.1109/iccv.2017.74 [DOI] [Google Scholar]
- 15.Schmidhuber J., “Deep learning in neural networks: an overview,” Neural Netw. 61, 85–117 (2015). 10.1016/j.neunet.2014.09.003 [DOI] [PubMed] [Google Scholar]
- 16.Muramatsu C., “Diagnosis of glaucoma on retinal fundus images using deep learning: detection of nerve fiber layer defect and optic disc analysis,” Adv. Exp. Med. Biol. 1213, 121–132 (2020). 10.1007/978-3-030-33128-3_8 [DOI] [PubMed] [Google Scholar]
- 17.Maheshwari S., Kanhangad V., Bilas Pachori R., “CNN-based approach for glaucoma diagnosis using transfer learning and LBP-based data augmentation,” arXiv:2002.08013.
- 18.Aamir M., et al. , “An adoptive threshold-based multi-level deep convolutional neural network for glaucoma eye disease detection and classification,” Diagnostics 10(8), 602 (2020). 10.3390/diagnostics10080602 [DOI] [PMC free article] [PubMed] [Google Scholar]







