Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2024 Oct 27;14:25664. doi: 10.1038/s41598-024-76359-0

An analysis of decipherable red blood cell abnormality detection under federated environment leveraging XAI incorporated deep learning

Shakib Mahmud Dipto 1, Md Tanzim Reza 2, Nadia Tasnim Mim 2, Amel Ksibi 3, Shrooq Alsenan 3, Jia Uddin 4, Md Abdus Samad 5,
PMCID: PMC11514213  PMID: 39463436

Abstract

In recent times, automated detection of diseases from pathological images leveraging Machine Learning (ML) models has become fairly common, where the ML models learn detecting the disease by identifying biomarkers from the images. However, such an approach requires the models to be trained on a vast amount of data, and healthcare organizations often tend to limit access due to privacy concerns. Consequently, collecting data for traditional centralized training becomes challenging. These privacy concerns can be handled by Federation Learning (FL), which builds an unbiased global model from local models trained with client data while maintaining the confidentiality of local data. Using FL, this study solves the problem of centralized data collection by detecting deformations in images of Red Blood Cells (RBC) in a decentralized way. To achieve this, RBC data is used to train multiple Deep Learning (DL) models, and among the various DL models, the most efficient one is considered to be used as the global model inside the FL framework. The FL framework works by copying the global model’s weight to the client’s local models and then training the local models in client-specific devices to average the weights of the local model back to the global model. In the averaging process, direct averaging is performed and alongside, weighted averaging is also done to weigh the individual local model’s contribution according to their performance, keeping the FL framework immune to the effects of bad clients and attacks. In the process, the data of the client remains confidential during training, while the global model learns necessary information. The results of the experiments indicate that there is no significant difference in the performance of the FL method and the best-performing DL model, as the best-performing DL model reaches an accuracy of 96% and the FL environment reaches 94%-95%. This study shows that the FL technique, in comparison to the classic DL methodology, can accomplish confidentiality secured RBC deformation classification from RBC images without substantially diminishing the accuracy of the categorization. Finally, the overall validity of the classification results has been verified by employing GradCam driven Explainable AI techniques.

Subject terms: Computational biology and bioinformatics, Mathematics and computing

Introduction

Morphological examination of peripheral blood cells is required to assess different types of blood-related disorders such as anemia, leukemia, lymphoma, and various infections. Through examining the size, shape, and other morphological abnormalities, the diseases can be observed and treated1. In addition, an unusual count of blood cells also signals various disorders2. Examination of peripheral blood often involves having a drop of blood spread thinly across a surface and inspecting the stained blood smear under a microscope to evaluate the morphology of blood cells, including Red Blood Cells (RBCs), White Blood Cells (WBCs), and platelets3. However, the assessment of blood cells from slide samples requires specialized knowledge of healthcare professionals, which can be often difficult to come across, especially in suburban and rural areas. Hence, automated systems may come in handy using which the general population can assess their hematological state and take faster prevention measures. To build such a system, DL and Artificial Intelligence-based applications come in handy.

Although DL-based applications are generally successful in medical image classification, the rate of success depends on the volume of data. DL-driven methods depend on automated feature extraction, which requires a lot of data to learn. Collecting such a large volume of data might be problematic in medical cases as the data may entail sensitive patient information and hospitals may not want to share data publicly due to various concerns. As a result, it is essential to find a way so that models can be trained while keeping data private, which should solve the issue of privacy and data collection. Here, FL mechanisms come into play, where local models learn from the private data of local devices while a single global model learns by averaging the weights of the local models. In such a way, the local private data never leaves the device, keeping it secured and protected. Due to its privacy-preserving nature, FL has become fairly popular in AI fields that require learning from sensitive data such as in Medical Cyber-Physical Systems (MCPS) Networks4, credit card fraud detection5, and in various types of medical image analysis6. Thus, the existing research works inspired us to propose our FL framework.

In this study, the FL method is used to classify RBC anomalies from cell images. The client data remains hidden in the process and no data is shared while training the FL model. The dataset is divided into 7:2:1 ratios for training, validation, and evaluation purposes of the DL model. The best-performing DL model is then selected for training as the global and local model in the FL framework. Finally, the accuracy rate was compared for both the DL model and the FL framework. The contributions of the proposed method are as follows.

  • We propose a dedicated FL framework for RBC anomaly classification, which is the first of its kind in this field

  • We have provided extensive empirical analysis to prove the viability of the FL framework against the traditional approaches in both IID and non-IID distribution of data

  • Alongside the conventional averaging in the FL framework, weighted averaging is performed for improved performance and prevention of poisoning attacks

  • We have employed GradCam-driven Explainable AI (XAI) based analysis to further explain and solidify our findings

Background study

Literature review

The application of DL in assessing medical data has been explored by quite a few researchers. The use of DL has been fruitful in detecting diseases from different types of medical images such as MRIs, X-rays, CT scans, histopathological images, and so on7. The applications also extend to blood cell disease classification by examining the infected or distorted blood cells8. Blood primarily consists of three different cell types: RBCs, WBCs, and platelets, each of which helps to detect individual blood disorders. Among all the cell types, RBC-related abnormalities are common in many medical emergencies. Often it can be an early sign of clinical deterioration of health in serious medical cases like Sepsis, anemia, or sickle cell type disorders9,10. Therefore, individual research works exist that solely focus on RBC segmentation and classification.

Tomari et al. extracted RBC cells from images using the global threshold method and then extracted geometric properties to classify them into normal or abnormal using Artificial Neural Network (ANN)11. Qiu et al. utilized Region-based CNN to extract individual blood cell patches from microscopic images, and then employed six different CNN architectures for abnormal cell classification12. Sickle cell anemia is a disease that changes the shape of the RBCs, which was detected by Xu et al. through an automated system. To achieve this, the authors first extracted RBC cell patches from the background image and normalized them into a uniform shape. Afterward, the authors employed deep CNNs to classify sickle-shaped RBCs.13 Meanwhile, Reza et al. focused on an optimized pipeline to detect RBC cell shapes using binarized DenseNet models14. In another research, the respective authors performed a vigorous investigation to detect anomalies in the blood cells of Sickle cell anemia patients. They proceeded to use the Alex Net deep learning model to detect 15 types of red blood cell shapes and achieved an impressive result which may help manage a patient’s life better by saving time15. Another study on the classification of sickle cell anemia patients’ red blood cells used a deep learning model as well. The authors proceeded to use several data augmentation techniques and transfer learning and introduced a model that produced noteworthy performance in Microscopy images16. A deep learning model has been proposed to help detect blood cell diseases and achieved a noteworthy accuracy rate in another paper17. There are many such pieces of research available, and each of them proves the potential of DL and AI for automated RBC abnormality diagnosis.

Description of the dataset

The dataset used in this study is available on Mendeley Data18 and is referred to as ‘RBCdataset’. The RBC dataset consists of a total of 7,108 RBC images, all of which have been obtained from nine separate classes. Figure 1 illustrates a class-wise breakdown of the data contained in the RBC dataset. It came to our attention that the Pencil class only has a total of 24 images. Therefore, we skipped that class entirely and utilized the images from the remaining eight classes to train and evaluate our models.

Figure 1.

Figure 1

Class distribution of the RBCdataset.

The eight classes include Elliptocytes, Dacrocytes, Acanthocytes, Stomatocytes, Spherocytes, Hypochromic, Codocytes, and Normal. Elliptocytes are known for their pencil-shaped structure. At the time of circulation, fully-grown RBCs encounter stress and thus they generate Elliptocytes19,20.

Under normal circumstances, RBCs in general contain a really low number of elliptocytes. Dacrocytes are of tear-drop shape. RBCs take this form when they get affected in the bone marrow with cancer such as Metastatic. Acanthocytes are the sign of iron deficiency21. Stomatocytes are generated due to the severe swelling that replaces the central zone of RBC with an incision-like cut22. The sphere-like shaped RBCs are known as Spherocytes. Hereditary spherocytosis and anemia cause RBCs to lose their biconcave, circular structure and take upon the form of a sphere23. Hypochromic cells refer to iron deficiency. It is a result of low hemoglobin in RBCs24. Codocyte is produced in mass when there is a chance of liver disease. It is also known as a target cell due to its appearance as a target practice for shooting with a bull’s eye23. Normal RBCs represent healthy cells with a typical round shape. A sample distribution of the classes is given in Figure 2.

Figure 2.

Figure 2

Sample Data of RBCdataset. (a) Normal, (b) Acanthocytes, (c) Stomatocytes, (d) Spherocytes, (e) Hypochromic, (f) Dacrocytes, (g) Codocytes, (h) Elliptocyte.

Deep learning architecture

In the proposed system, we utilized TensorFlow25 and Keras26 libraries to implement the VGG16, ResNet50, and Inception V3 architectures. We were able to achieve faster training convergence by making use of the weights that had been learned earlier using the ImageNet database27. The VGG16 architecture consists of 5 blocks integrated with convolution and max-pooling layers, resulting in 16 hidden layers corresponding to the name. Inception v3 architecture is specialized in decomposing larger convolution layers into smaller ones, effectively reducing the number of trainable parameters despite having a complex architecture. Meanwhile, the ResNet architecture is a very deep CNN architecture where the effects of having many layers are mitigated by skip connections, which lets the gradient flow properly by skipping some of the intermediate layers. The convolution layers were preserved as the default architecture while the fully connected layers were dropped. After the flattened layer of the pre-existing architecture, a dense layer consisting of 1024 neurons received the output. Subsequently, a dropout layer with a dropout rate of 50% was included in our model to mitigate overfitting. This was achieved by randomly deactivating half of the neuron connections during each training iteration. The output layer made use of the softmax activation function, while the dense layers made use of the ReLU activation function. Furthermore, we used the Adam Optimizer algorithm and configured the learning rate to be 0.00001. The architecture of the DL models that we used is shown in Fig. 3.

Figure 3.

Figure 3

Architecture of the used deep learning models.

Federated learning (FL)

Federated Learning (FL) is a machine learning method that addresses the complications of rigorous rules due to data privacy and sparse resource of datasets availability2830. Federated learning consists of two edges. The first one is known as the global server or the central server. The central server consists of a global model. The second one is the local server which is also known as the client end. The client-server accommodates the local data stored by clients on their terminal appliances.

The FL method facilitates model training without any transfer of data. This method trains the model on several segregated endpoint devices. These devices contain the information, stationed in the local servers. The local model is trained on the client dataset and then sends the updated weights to the global model. Therefore, the expected security and privacy in medical data prognosis are maintained by the FL method3133. However, despite the promising aspects of FL, there are some critical challenges to be addressed. Unlike traditional learning systems, where all data are kept in a single system leveraging a single model, FL framework utilizes multiple data sources spread across multiple devices. Thus, an FL framework needs to withstand heterogeneity across different systems and data counts. In addition, it also needs to ensure that the network overhead during the client-server communication does not bottleneck the entire system34. In order to keep the communication overhead minimal and to ensure smooth participation of clients with low-end configuration, often lightweight models with a limited number of parameters are used35. In addition, making attempts to keep the FL framework secured from attacking and poorly performing clients is a good practice. The architecture of the FL framework that we used is shown in Fig. 4.

Figure 4.

Figure 4

Architecture of federated learning (FL) environment.

Explainable AI (GradCAM)

GradCAM36, which stands for Gradient-weighted Class Activation Mapping, is a computer vision technique that helps us to gain insight into the decision-making process of deep neural networks, specifically convolutional neural networks (CNNs). It achieves this by generating visual explanations and identifying the significant regions in an input image that influence the network’s predictions. To create these visual explanations, GradCAM generates a heat map that indicates the importance of each pixel in the image for the network prediction. It accomplishes this by calculating the gradients of the target class score in relation to the feature maps of the final convolutional layer in the network. These gradients are then averaged globally, resulting in weights that represent the importance of each feature map. By multiplying these weights with their corresponding feature maps, GradCAM combines the feature maps to highlight the most crucial regions in the image. The resulting heat map provides valuable insights into the decision-making process of the network by highlighting the key areas of the image that influenced the final prediction. This information is essential to understand the behavior of the network, diagnose problems or failures, and gain interpretability in deep learning models.

Proposed model

This study’s methodology consisted of two primary parts, each dedicated to the detection of RBC abnormalities. The two primary parts of our proposed methodology are depicted in Figs. 5,  6 and  7, respectively.

Figure 5.

Figure 5

The steps that were taken to find the DL model that was the most effective in identifying RBC abnormalities were included in the suggested system.

Figure 6.

Figure 6

The steps that were taken to pre-process the dataset.

Figure 7.

Figure 7

The proposed method for identifying RBC abnormalities under FL environment by employing DL model.

The steps toward identifying the most effective DL model for RBC abnormality classification are depicted in Fig. 5. The first step was to obtain the data from the original source and preprocess it. As part of the preprocessing, we normalized the pixel values of the RBC images and downsampled them to a resolution of 128x128 pixels. Next, we made a 7:2:1 split of our collected data where 70% of the image was picked randomly for training the models, 20% was utilized for validation, and the remaining 10% was used for the evaluation process. The basic pre-processing steps are illustrated in Fig. 6. Using the split data, we have trained and compared the performance of three DL models, namely VGG16, ResNet50, and Inception v3.

After identifying the DL model with the highest performance, we decided to use it as the global model in the FL framework. Afterward, five local versions of the global model are made. In the next step, we further split the training dataset into five parts and gave each of our five clients their own unique training set as well as their local model. Each client received a random number of images from the training data split. The clients then trained the local models by utilizing the given data, and the model validation was carried out utilizing the validation dataset. Following the training and validation of the local models, the aggregated trained weights of the local models were calculated. The aggregated weight was then sent to the global model during each cycle of communication. This training was carried out for a total of fifty rounds of communication and after finishing the training, we evaluated the performance of the global model by utilizing the test dataset.

The overall process in the FL framework is provided in Fig. 7. Meanwhile, the algorithm outlining the overall procedure, which includes the function named FedRedXAI, is presented in Algorithm 1.

Algorithm 1.

Algorithm 1

Proposed methodology

Model training

The labeled data were used to train DL models, and these data were classified into eight distinct groups. Keras and TensorFlow 2.6.0 were utilized in the construction of the model, and a local machine configured with an Intel Core i5 10400 3.1 GHz Processor, RTX 4070Ti Super 16 GB GPU, and 32GB of RAM was chosen for this experiment. To save resources while training the model, the image sizes were reduced to 128×128. The DL model functions more effectively with a balanced dataset, although the dataset that was used had certain imbalances.

Consequently, throughout the training session, significantly uneven weights were assigned to each image of any particular class. Following the completion of the computation, the ‘Hypochromic’ class was found to have the most significant weight of 2.79, while the ‘Dacrocytes’ class was found to have the lowest weight of 0.3. The formula for assigning weights is described in Equation (1), and the weight that was computed for each class is shown in Table 1. The batch size for the model training was thirty-two, and there were a total of fifty epochs used. Immediately following the completion of the model training, each of the models was kept separately for use in further experiments.

Weightclass=NTNTC×NC 1

where, NT=total number of training image, NTC=number of training image from class, and NC=number of class.

Table 1.

Weights for the classes of RBCdataset.

Class of RBC dataset Calculated weights
Elliptocyte 0.5118
Codocytes 0.7283
Dacrocytes 0.2985
Hypochromic 2.7917
Spherocytes 1.1008
Stomatocyte 1.6224
Acanthocytes 1.7507
Normal 0.4346

Model validation

The validation set, which contains 20% of the data, was used to evaluate how well the models performed. During the process of training the models, it was necessary to evaluate how well the model was progressing with each iteration. Therefore, based on the data that were not observed, our experimental models were able to determine the accuracy of the RBC abnormality detection utilizing prior knowledge and experience. The effectiveness of the models was evaluated with respect to their level of validation accuracy. As long as the validation results are consistent with the test result, we can ensure that the training results are not skewed.

Model testing

If hyperparameter optimization is performed based on the observations made on the validation set, there is a chance that the model will be skewed in favor of the validation data. Therefore, after the training phase of the model has been completed, it is essential to evaluate the model using a different set of data. As a result, we have set aside 10% of the data included in the RBC dataset to use as a test set. This 10% of the data was utilized in the evaluation of the predictions that were produced by the trained models. Our trained models were able to estimate the RBC abnormality by applying their existing knowledge and experience based on the data that were not experienced.

Weight aggregation

In the federated environment, where a simulated client-server structure is built for empirical analysis, client models are trained while keeping them local in the device, and their weights are aggregated to update the server model. In this aggregation procedure, the most simple approach is to simply average the weights of all the models, which exposes the server model to erroneous and poisoning clients. This vanilla averaging can be written as,

Wavg=n=1k1kWn 2

where, k is the number of clients. For each of the clients, the model weights are divided by the number of clients and then aggregated for each of the clients. The final result is Wavg which is the average of all the client weights.

Meanwhile, the weighted averaging case can be written as,

Wavg=n=1kϕn(x)n=1kϕn(x)Wn 3

where, the ϕn(x) represents the accuracy score of an individual model given the test set x, and n=1kϕn(x) in the denominator adds up all the model accuracy. Each model’s accuracy score ϕn(x) is then divided by the aggregated accuracy score to scale the weights of the individual models. The scaled weights are then aggregated to get the weighted average.

This weighted averaging, although exposes the client’s accuracy scores to the servers, can be beneficial if there are lots of faulty clients on the server. Due to the weighting mechanism, the clients with lower performance get lower priority during the averaging process and thus, performance improvement can be expected.

Experimental result analysis

The subsequent part contains the final findings of our implementation, as well as an evaluation of how well our models function to identify RBC abnormalities. For each model, the values for precision, recall, F1 score, confusion matrix, AUC score, ROC curve, accuracy, and loss function are shown. The Eqs. ((4)), ((5)), ((6)), ((7)) and ((8)) demonstrate the many formulae that are used to compute accuracy, precision, recall, f1-score, and specificity. Our key objective was to get a greater test accuracy while simultaneously reducing the amount of model loss. Every model was executed for a total of fifty epochs using Adam as the optimizer and setting the learning rate to 0.00001. Following the training, the best-performing DL model was selected to become the global model based on the values obtained from the results. Then we train it in an FL environment. Finally, we summarize our efforts by making comparisons to other state-of-the-art methods that have been previously published.

Accuracy=TP+TNTP+TN+FP+FN 4
Precision=TPTP+FP 5
Recall=TPTP+FN 6
f1-score=2×TP2×TP+FP+FN 7
Specificity=TNTN+FP 8

Performance evaluation of DL models

We trained VGG16, Inception v3, and ResNet50 models for 50 epochs each. The accuracy and the loss curves for the models are given in Figs. 8,  9, and  10:

Figure 8.

Figure 8

VGG16 accuracy (a) and loss (b) curves.

Figure 9.

Figure 9

Inception v3 accuracy (a) and loss (b) curves.

Figure 10.

Figure 10

ResNet50 accuracy (a) and loss (b) curves.

If we follow the training pattern of the three models, we can see that all the models reach near their peak training scores at around 20 epochs. The training accuracy and loss patterns for Inception v3 and VGG16 showcase smooth patterns, representing proper learning without any issues. Meanwhile, the validation loss curve for ResNet50 architecture showcases a large spike at the beginning, which is likely caused by overstepping local minima due to a larger than necessary learning rate. However, the model gradually rectifies it as the training goes on. Except for ResNet, all other models’ training and validation accuracy-loss curves run smoothly side by side; no abnormality is seen.

In Fig. 11, the classification results for the three models are given. A common pattern of not being able to classify the hypochromic RBC images is present in the classification reports, as the precision, recall, and f1-score are low for that class. The failure to classify the hypochromic RBC images is also evident in the confusion matrices given in Fig. 12.

Figure 11.

Figure 11

Classification reports of VGG16 (a), Inception v3 (b) and Resnet50 (c).

Figure 12.

Figure 12

Confusion matrices of VGG16 (a), Inception v3 (b) and Resnet50 (c).

As visible in Fig. 12, the shortage of samples in the hypochromic RBC image class makes it difficult for the models to classify them properly. However, the ResNet model still manages to classify them fairly accurately. The prediction accuracy for all the other classes is satisfactory, as visible in Fig. 11. Afterward, we analyzed the ROC curve and the AUC score of the models. The analyzed ROC curves are shown in Fig. 13. The models achieved the perfect AUC score for the acanthocyte class and the lowest AUC score for the hypochromic class. In addition to the hypochromic class, a satisfactory AUC score was achieved for all the other classes across all the models.

Figure 13.

Figure 13

ROC curves and AUC scores of VGG16 (a), Inception v3 (b) and Resnet50 (c).

Among the three models, the VGG16 model comes to the top with 96% overall accuracy across all the classes. In addition, VGG16 also has the least number of parameters compared to the other two models, as visualized in Fig. 14. The less amount of parameters should make the model smaller in size, making the model easier to transfer from client to server and vice versa during the federated learning communications. Consequently, VGG16 turned out to be our preferred model for the FL environment.

Figure 14.

Figure 14

Number of trainable parameters in the models.

Performance evaluation under federated learning

In the process of creating the FL environment, we divided the data into five segments, each segment representing a single client. A separate test set was also kept to analyze the test performance of the federated global model. The FL simulation was run for 50 communication rounds, where each communication round represents one epoch on each client’s dataset.

Vanilla averaging

In the vanilla averaging, the global accuracy and loss curve across the communication rounds visualizes a healthy rate of changes (Fig. 15). The accuracy quickly increased until the 15th communication round and then gradually increased with occasional spikes until the 50th communication round. Both the accuracy and the loss curves reached a plateau after the 50th round, therefore, we decided to keep the record until that point.

Figure 15.

Figure 15

Federated learning global model accuracy with vanilla averaging (a) and loss (b) curve.

Surprisingly, as evident in Fig. 16, while the FL global model performs well on the non-hypochromic RBC image classes like the centrally trained models, it also performs well in classifying the hypochromic RBC images, something that the centrally trained models failed to achieve. In the centralized environment, the models failed to properly classify the hypochromic RBC images due to the low number of samples. Meanwhile, in the FL environment, the VGG model had to work on a non-IID dataset due to the nature of the environment, which likely led the model to work well on classes with low sample distribution. The ROC curve for the global model under the FL environment is given in Fig. 17. As mentioned previously, FL seems to solve the issue of classifying hypochromic classes accurately, with a 0.91 AUC score for the corresponding class. Meanwhile, a lower AUC score was achieved for the codocyte class compared to the centrally trained models. Apart from the mentioned two classes, a balanced AUC score was maintained across the central and decentralized models.

Figure 16.

Figure 16

Federated learning classification report (a) and confusion matrix (b).

Figure 17.

Figure 17

ROC curve for FL global model.

Despite having a decent result based on trusted clients, the basic averaging method is mostly not sustainable in a more unpredictable environment where there might be problematic clients with poor data or with ill intentions. Therefore, some form of basic safeguard against such threat is required and a weighted averaging process can provide such safeguard.

Weighted averaging

The weighted averaging is used where the model weights are scaled by the scores achieved by the client model on a separate test set in the server end. This weighting mechanism should perform better than the vanilla averaging theoretically due to it accommodating the model qualities in the averaging mechanism. This averaging procedure provides greater emphasis to the better models while providing lesser priority to the worse model, so that the worse performing models do not drag down the performance of the averaged server model.

On the normal dataset and client distribution, we can see from Fig. 19 is that the FL framework incorporating weighted averaging achieves 95% accuracy, a slight improvement compared to the framework with weighted averaging. Comparing between Figs. 15 and 18, we can notice a smoother curve in terms of loss per epoch, showcasing a more stable learning behavior of the framework with weighted averaging mechanism. In addition, we can observe an more uniformly distributed ROC score per each class, as visualized in Fig. 20. Although the performance improvement is rather minimal in the general client and data distribution, the main benefit of the averaging procedure should be visible if there are dedicated clients with really bad data, or intentional data poisoning attacks with few clients. Effects of such behaviour has been empirically shown and discussed in the ablation study section.

Figure 19.

Figure 19

Federated learning classification report (a) and confusion matrix (b) for weighted averaging.

Figure 18.

Figure 18

Federated learning global model accuracy with weighted averaging (a) and loss (b) curve.

Figure 20.

Figure 20

ROC curve for FL global model with weighted averaging.

The practicality of the averaging mechanism ultimately comes down to the properties of the clients in the FL environment. The privacy concern is the obvious major tradeoff, as individual client performance is observed with a test set, and the clients with bad data can easily get exposed. Such exposures may discourage clients from participating in the federated loop. Thus, if a federated environment mostly consists of trusted clients with good data, the sacrifice of privacy is likely not worth it as performance improvement should be mostly minimal. However, an open source FL environment exposed to clients with bad data might significantly benefit from such averaging, even if some privacy trade off is there.

Comparison against literature

Overall, under the federated setting, the VGG16 model managed to score an accuracy of 94%, which is quite close to the 96% accuracy score achieved by the centrally learned VGG16 model. This demonstrates the fact that even under the FL setting, it is possible to classify RBC deformation nearly as accurately as in the normal setting. In FL, there was a sacrifice of only a 2% accuracy score with a better distribution of precision and recall scores across the classes. As the FL environment ensures data privacy and provides opportunities for open-source training integration, the added benefits are huge. Therefore, a decrease of 2% accuracy score is a worthy trade-off from our point of view. This research thus proves the effectiveness of FL for the classification of RBC image data. Finally, we compared our FL approach to the State Of The Art (SOTA) result from another paper37 on the same dataset. Results are given in Table 2.

Table 2.

Comparison against literature.

Sensitivity (%) Specificity (%)
Classes Literature Proposed Literature Proposed
Acanthocyte 97.84 100.00 99.71 100.00
Codocyte 88.38 99.52 98.73 68.24
Elliptocyte 98.10 99.66 99.57 97.52
Hypochromic 94.97 99.85 99.52 82.61
Normal 91.75 96.47 98.16 99.30
Spherocytes 94.83 99.69 98.12 98.21
Stomatocyte 88.94 97.76 99.12 84.21
Dacrocyte 100.00 99.80 99.78 99.04

Observing Table 2, we can say that the proposed FL architecture conclusively achieved better sensitivity across most of the classes. Meanwhile, the specificity scores across the classes are generally lower compared to the literature. With a better sensitivity score, the proposed FL architecture should be able to detect a particular RBC deformation type fairly well. Additionally, the overall accuracy score for both the literature and the proposed FL architecture is 94%, showcasing the prevalence of the FL system. Even with e decentralized learning structure, FL performs quite competitively against models from the literature.

Despite having competitive performance against traditional learning mechanisms, one important drawback of federated learning is the communication overhead. VGG, Inception, and ResNet are mostly large CNN architecture, and sending them back and forth between client and server will cause some delay. It is difficult to empirically discuss the communication overhead due to its variability depending on some factors, such as the configuration of the client devices, their internet connectivity, and so on. However, we can note that the communication overhead is a less significant issue in this case as it is only present during the time of federated training. During inference time, the class is directly predicted by the global model, and the communication overhead does not matter at that time.

Ablation study

The ablation study provides insights into the performance of different variants of VGG16 architecture on both IID and Non-IID datasets in the Federated Learning environment. The details of the Non-IID distribution are illustrated in Fig. 21.

Figure 21.

Figure 21

Per client class distribution in the non-IID setting.

In the non-IID setup, we made sure that each class is unevenly distributed among the clients, with some clients having no samples of particular classes. Performance against these uneven distribution classes should provide us with a better insight into the robustness of the model in the federated framework.

From Table 3, we can notice that the basic architecture of the VGG16 model (5 blocks) archives the highest accuracy across both datasets with minimum numbers of trainable parameters. On the other hand, an increase in the number of trainable parameters is noticed for the reduced number of blocks in the architecture. This increment is occurring due to the model outputting more filters to the classifier layers when later convolution blocks are removed. The convlution layers at the 4th and 3rd block outputs a lot more filters compared to the fifth block. As the number of parameters in Conv2D layers is proportional to the square of the number of filters, the increase in width (number of filters) leads to a quadratic increase in parameters. Besides, reducing the number of blocks also reduces the number of pooling layers. Due to the lack of enough pooling layers, the spatial dimensions remain larger, leading to more parameters in the fully connected layers. In addition, the input to the remaining fully connected layers becomes larger, resulting in a higher parameter count. This study highlights the importance of balancing model depth and width to achieve optimal performance without unnecessary complexity. The basic architecture of the VGG16 model strikes this balance effectively, distributing parameters across multiple layers and leveraging pooling operations to control the parameter count. In addition to that, it is also important to use lightweight models in any distributed networks. Therefore, the basic architecture of the VGG16 model is the best fit for red blood cell abnormality detection under a distributed learning framework.

Table 3.

Ablation study with different variations of VGG16 architecture on IID and non-IID dataset in a centralized learning environment with vanilla and weighted averaging.

Variants Trainable parameters Data distribution Accuracy
Vanilla avg. Weighted avg.
VGG16 (5 blocks) 23,112,520 IID 0.94 0.95
Non-IID 0.93 0.94
VGG16 (4 blocks) 41,198,920 IID 0.93 0.94
Non-IID 0.92 0.93
VGG16 (3 blocks) 68,853,576 IID 0.92 0.94
Non-IID 0.91 0.93

One of the key challenges in decentralized learning is data poisoning. In federated learning, this type of adversarial attack occurs when malicious participants deliberately modify their local data with the aim of corrupting the global model being trained across distributed devices. To evaluate the impact of data poisoning on our baseline models in a federated learning setup, we manually created a poisoned dataset for training purposes. This involved flipping labels and adding different types of images (such as white blood cell images, lung X-ray images, etc.) to two clients’ data, thereby forming the poisoned training set. The performance comparison of baseline models trained on IID (Independent and Identically Distributed), Non-IID, and Poisoned datasets in a decentralized learning environment is illustrated in Table 4. The table showcases the results for both Vanilla Averaging and Weighted Averaging techniques. It is evident that models trained with weighted averaging in the federated learning setup consistently outperform those trained with vanilla averaging. Among the three baseline models, the VGG16 (5 blocks) model demonstrates superior performance across all scenarios, including the poisoned dataset. Therefore, weighted averaging improves model resilience against data poisoning in federated learning, with VGG16 (5 blocks) demonstrating the best overall performance among the tested models.

Table 4.

Ablation study with different variations of model on IID and Non-IID dataset.

Models Data distribution Accuracy
Vanilla avg. Weighted avg.
VGG16 (5 blocks) IID 0.94 0.95
Non-IID 0.93 0.94
Poisoned 0.71 0.92
ResNet50 IID 0.91 0.93
Non-IID 0.89 0.92
Poisoned 0.68 0.89
Inception V3 IID 0.92 0.93
Non-IID 0.92 0.93
Poisoned 0.74 0.90

Interpretation of global model

As VGG16 is our best-performing global model in the federated environment, it was the chosen model for the generation of Explainable AI (XAI) based outputs. For generating XAI outputs, we used GradCam and picked the final convolution layer of the global VGG16 model.

In Fig. 22, samples of the RBC image of each class and the corresponding GradCam output have been given. The GradCam mapping highlights the region of interest for the global model on individual images by highlighting the image with a gradient ranging from red to blue color. The bright red color in the feature map indicates the most important region while the blue color indicates the least. From the GradCam output, for RBCs that are not round in structure (e.g. Acanthocyte, Elliptocyte, Stomatocyte, Dacrocyte), we notice that a large portion of color mapping is at the edge of the images. The image samples have little texture in the middle of the cell images. Since the non-round RBC cells have distinguishable shapes, the model could better rely on the shape itself for proper classification. In order to retrieve the shape information, the model had to focus on the edge of the RBC images, explaining the color mappings often being centered around the edges. In the case of the RBCs that are relatively rounder in shape (e.g. Codocyte, Hypochromic, Normal, Spherocyte), the shape becomes a hard-to-distinguish feature for the model. For these images, the model prioritizes the texture in the middle of the RBC cells, resulting in a higher distribution of the color mapping in the middle.

Figure 22.

Figure 22

Interpretation of prediction by the global model using GradCAM. (a) Interpretation on Acanthocyte, (b) interpretation on Codocyte, (c) interpretation on Elliptocyte, (d) interpretation on Hypochromic, (e) interpretation on normal, (f) interpretation on Spherocyte, (g) interpretation on Stomatocyte, (h) interpretation on Dacrocyte.

Discussion

The accomplishment of FL primarily lies in the protection of data confidentiality and privacy protected training. However, the challenges related to various aspects, such as the non-IID nature of data distribution, non-centralized learning, and along with others, need to be addressed first. The proposed research investigates the efficacy of FL in detecting red blood cell abnormalities. In the process, centralized training and evaluation have been used to determine the most potent model for the FL framework. We found VGG16 to be the most successful model in the centralized training framework, which is also the most lightweight model parameter wise. As individual red blood cell images have less complex patterns and features, the VGG16 model is likely less prone to overfitting due to having fewer parameters, which is likely to be the cause of the better performance of the model. The best model being the most lightweight is also advantageous for the federated framework as a lower sized model is likely to create less network-related overhead during the FL process. In the end, the VGG16 model managed to achieve competitive performance in the FL framework with a very small accuracy trade-off in the case of both averaging techniques, effectively proving the viability of the model in the FL framework. The further ablation test proves that the model also performs well on non-IID distribution in the FL framework, preventive against client attacks, and changes in the model architecture do not really improve the performance. Hence, the default model remains good as it is, and an explainable AI analysis is performed on it.

Conclusion

Restricted access to datasets of patients due to their privacy and security concerns makes it hard for researchers to obtain medical-related datasets and work with ML models due to insufficient data. The use of the Federated Learning model aids in conserving the confidentiality of clients’ data and preparing an unbiased global model. This research work aims to generate a solution for the centralized data collection problem in red blood cell images using FL. This work used multiple Deep Learning models like VGG16, ResNet50, and Inception v3 in order to train the RBC data. The dataset was initially split into a 7:2:1 ratio. The DL models were trained using 70% data of the dataset and then 20% of the dataset was used for validation. Finally, the remaining 10% of the data was used for evaluation purposes. The best-performing DL model VGG16 which achieved an accuracy rate of 96% was elected for the FL model. FL technique was implemented to train the DL model using model weights from the local model. The clients’ data was kept private throughout the process of training the FL model. The trained FL model obtained a 94% accuracy rate in the case of vanilla averaging, and 95% accuracy in the weighted averaging, while maintaining the confidentiality of clients. Moreover, the outcome of this experiment shows that both the FL method and DL method provide similar performance as the achieved accuracy for detecting RBC anomalies was the same for both models. Furthermore, it can be concluded that this study denotes that the FL technique can protect the client’s data and detect RBC abnormalities from RBC images in contrast to the traditional deep learning approach while consistently maintaining the accuracy of the classification.

Author contributions

S.M.D. and M.T.R. conceived the idea, developed visualization, performed the analysis, and wrote the initial manuscript. N.T.M. developed the methodology and participated in data curation. A.K. and S.A. contributed to validating the results and managed the funds. J.U. contributed to the investigation, supervised the study, and reviewing of the original manuscript and editing. M.A.S. contributed to the investigation, and reviewing of the original manuscript and editing. All authors reviewed the manuscript.

Funding

This study was supported by Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2024R506), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Data availibility

Data for this research is available at https://data.mendeley.com/datasets/rfdz6wfzn4/1 (Accessed on 20th January 2024).

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Rodgers, G. P. & Young, N. S. The Bethesda Handbook of Clinical Hematology (Lippincott Williams & Wilkins, 2013).
  • 2.Bain, B. J. Diagnosis from the blood smear. N. Engl. J. Med. 353, 498–507 (2005). [DOI] [PubMed] [Google Scholar]
  • 3.Dacie, J. V. Dacie and Lewis Practical Haematology (Elsevier Health Sciences, 2006).
  • 4.Siniosoglou, I. et al. Federated intrusion detection in NG-IOT healthcare systems: An adversarial approach. In ICC 2021-IEEE International Conference on Communications. 1–6 (IEEE, 2021).
  • 5.Yang, W., Zhang, Y., Ye, K., Li, L. & Xu, C.-Z. FFD: A federated learning based method for credit card fraud detection. In Big Data—BigData 2019: 8th International Congress, Held as Part of the Services Conference Federation, SCF 2019, San Diego, CA, USA, June 25–30, 2019, Proceedings 8. 18–32 (Springer, 2019).
  • 6.Guan, H., Yap, P.-T., Bozoki, A. & Liu, M. Federated learning for medical image analysis: A survey. Pattern Recognit. 110424 (2024). [DOI] [PMC free article] [PubMed]
  • 7.Litjens, G. et al. A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017). [DOI] [PubMed] [Google Scholar]
  • 8.Varghese, N. Machine learning techniques for the classification of blood cells and prediction of diseases. Int. J. Comput. Sci. Eng. 9, 66–75 (2020). [Google Scholar]
  • 9.Ko, E. et al. Early red blood cell abnormalities as a clinical variable in sepsis diagnosis. Clin. Hemorheol. Microcirc. 70, 355–363 (2018). [DOI] [PubMed] [Google Scholar]
  • 10.Lippi, G. & Plebani, M. Recent developments and innovations in red blood cells diagnostics. J. Lab. Precis. Med. 3 (2018).
  • 11.Tomari, R., Zakaria, W. N. W., Jamil, M. M. A., Nor, F. M. & Fuad, N. F. N. Computer aided system for red blood cell classification in blood smear image. Proc. Comput. Sci. 42, 206–213 (2014). [Google Scholar]
  • 12.Qiu, W. et al. Multi-label detection and classification of red blood cells in microscopic images. In 2020 IEEE International Conference on Big Data (Big Data). 4257–4263 (IEEE, 2020).
  • 13.Xu, M. et al. A deep convolutional neural network for classification of red blood cells in sickle cell anemia. PLoS Comput. Biol. 13, e1005746 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Reza, M. T., Dipto, S. M., Parvez, M. Z., Barua, P. D. & Chakraborty, S. A power efficient solution to determine red blood cell deformation type using binarized densenet. In International Conference on Advances in Computing Research. 246–256 (Springer, 2023).
  • 15.Aliyu, H. A., Razak, M. A. A., Sudirman, R. & Ramli, N. A deep learning Alexnet model for classification of red blood cells in sickle cell anemia. Int. J. Artif. Intell. 9, 221–228 (2020). [Google Scholar]
  • 16.Alzubaidi, L., Fadhel, M. A., Al-Shamma, O., Zhang, J. & Duan, Y. Deep learning models for classification of red blood cells in microscopy images to aid in sickle cell anemia diagnosis. Electronics 9, 427 (2020). [Google Scholar]
  • 17.Khalil, A. J. & Abu-Naser, S. S. Diagnosis of blood cells using deep learning. Int. J. Acad. Eng. Res. (IJAER) 6, 69–84 (2022). [Google Scholar]
  • 18.Tyas, D. A., Ratnaningsih, T., Harjoko, A. & Hartati, S. Erythrocyte (red blood cell) dataset in thalassemia case. Data Brief 41, 107886 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Landis-Piwowar, K., Landis, J. & Keila, P. Clinical Laboratory Hematology . 3rd ed. 154–177 (New Jersey Pearson, 2015).
  • 20.Manchanda, N. Anemias: Red Blood Morphology and Approach to Diagnosis. 284–296 (Saunders, 2015).
  • 21.Bosman, G. J. Disturbed red blood cell structure and function: An exploration of the role of red blood cells in neurodegeneration. Front. Med. 5, 198 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Andolfo, I., Russo, R., Gambale, A. & Iolascon, A. Hereditary stomatocytosis: An underdiagnosed condition. Am. J. Hematol. 93, 107–121 (2018). [DOI] [PubMed] [Google Scholar]
  • 23.Parab, M. A. & Mehendale, N. D. Red blood cell classification using image processing and CNN. SN Comput. Sci. 2, 70 (2021). [Google Scholar]
  • 24.Dinh, N. H., Cheanh Beaupha, S. M. & Tran, L. T. A. The validity of reticulocyte hemoglobin content and percentage of hypochromic red blood cells for screening iron-deficiency anemia among patients with end-stage renal disease: A retrospective analysis. BMC Nephrol. 21, 1–7 (2020). [DOI] [PMC free article] [PubMed]
  • 25.Abadi, M. et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).
  • 26.GitHub—keras-team/keras: Deep learning for humans—github.com. https://github.com/keras-team/keras. Accessed 01 Feb 2024.
  • 27.Deng, J. et al. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248–255 (IEEE, 2009).
  • 28.Aledhari, M., Razzak, R., Parizi, R. M. & Saeed, F. Federated learning: A survey on enabling technologies, protocols, and applications. IEEE Access 8, 140699–140725 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Zhang, W. et al. Blockchain-based federated learning for device failure detection in industrial IOT. IEEE Internet Things J. 8, 5926–5937 (2020). [Google Scholar]
  • 30.Sarma, K. V. et al. Federated learning improves site performance in multicenter deep learning without data sharing. J. Am. Med. Inform. Assoc. 28, 1259–1264 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Zhang, W. et al. Dynamic-fusion-based federated learning for covid-19 detection. IEEE Internet Things J. 8, 15884–15891 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Aich, S. et al. Protecting personal healthcare record using blockchain & federated learning technologies. In 2022 24th International Conference on Advanced Communication Technology (ICACT). 109–112 (IEEE, 2022).
  • 33.Stripelis, D., Ambite, J. L., Lam, P. & Thompson, P. Scaling neuroscience research using federated learning. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI). 1191–1195 (IEEE, 2021).
  • 34.Duan, Q. et al. Combining federated learning and edge computing toward ubiquitous intelligence in 6g network: Challenges, recent advances, and future directions. In IEEE Communications Surveys & Tutorials (2023).
  • 35.Zhou, F., Hu, S., Du, X., Wan, X. & Wu, J. A lightweight neural network model for disease risk prediction in edge intelligent computing architecture. Future Internet 16, 75 (2024). [Google Scholar]
  • 36.Selvaraju, R. R. et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision. 618–626 (2017).
  • 37.Tyas, D. A., Hartati, S., Harjoko, A. & Ratnaningsih, T. Morphological, texture, and color feature analysis for erythrocyte classification in thalassemia cases. IEEE Access 8, 69849–69860 (2020). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data for this research is available at https://data.mendeley.com/datasets/rfdz6wfzn4/1 (Accessed on 20th January 2024).


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES