Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Jul 2;15:23006. doi: 10.1038/s41598-025-06807-y

Bank data protection and fraud identification based on improved adaptive federated learning and WGAN

Yunjiao Zheng 1,
PMCID: PMC12219085  PMID: 40594570

Abstract

To enhance the protection of bank data privacy, build a more secure and efficient data privacy protection system, and effectively identify transaction fraud, this study first proposes an improved adaptive data protection architecture by combining federated learning and differential privacy technology. Then, a fraud recognition model is designed based on generative adversarial networks. The experiment outcomes show that the adaptive data protection architecture designed in this study notably raises the convergence performance of the loss function after introducing adaptive gradient pruning and noise differentiation strategies, with a minimum loss value of 0.09 and excellent learning ability. The accuracy of this method can reach 0.996, and its performance advantage is significant under low privacy budget. Meanwhile, after introducing attacks, the accuracy of this method decreases slightly and the probability of privacy leakage is low. The values of mutual information and re-identification risk analysis are the smallest, converging to 0.02 and 0.194, respectively. The class imbalance foundation of the research design validly raises the capability benefits of the dataset, and the fraud recognition model constructed from it has the best capability and the recognition precision. The missed and false alarm rates fluctuate within the range of 0.15–0.35. In summary, the research improves the data privacy protection capabilities of banks, which helps to reduce financial risks and enhance customer trust.

Keywords: Federated learning, Bank, Machine learning, Generative adversarial networks, Differential privacy mechanism

Subject terms: Computer science, Scientific data, Software

Introduction

In recent years, through innovative financial products, services, and technology, mobile Internet finance has met the diversified financial needs of users and provided users with more convenient and efficient financial services. The implementation of technologies like AI, big data, and blockchain has enhanced the quality and efficiency of financial services, making them no longer limited by time and space. Compared to traditional finance, it has more convenience advantages1,2. However, mobile Internet technology has also led to increasingly prominent security problems in the financial industry. Bank data is the core asset of banks, including important sensitive information such as customer personal information, transaction records, account balances, etc. The existence of a significant quantity of mobile clients has increased the number of vulnerabilities in banks. External threats such as hacker attacks, viruses, and ransomware can steal user data, implant malicious programs, attack services, etc., leading to an increased risk of personal information leakage3,4. On the other hand, the mobile Internet financial field is faced with various fraud means, including forged credit cards, cheque fraud and identity fraud, which are generally hidden, organized and complex. Fraudulent behavior hidden in transaction data has caused serious economic losses and reputational damage to banks and customers5. In short, protecting user privacy while effectively identifying fraud has become an important challenge that banks must face in their risk management and fraud prevention systems. In order to cope with challenges, banks need to continuously strengthen their efforts in technology investment, process optimization, and risk prevention to guarantee the safety of customer data and the stable development of their business6. However, there is still a risk of cracking or bypassing existing privacy protection technologies such as data encryption and access control. How to protect data privacy while achieving efficient utilization and sharing of data has become a challenge for bank data privacy protection. In addition, financial fraud methods are constantly becoming more specialized and high-tech, and the scarcity of fraud samples has led to the inability of recognition models to fully learn the characteristics of fraudulent behavior, resulting in limited recognition ability of models for fraudulent behavior.

Therefore, in order to effectively protect bank data privacy and efficiently identify fraudulent behavior, innovative adaptive improvement strategies were proposed for Federated Learning (FL) and Differential Privacy (DP), and the design of a bank data protection framework was completed. At the same time, based on Generative Adversarial Networks (GAN), an innovative technique for processing imbalanced transaction data was proposed, and fraud identification was completed using ensemble thinking. This study contributes to building a safer and more efficient financial environment, promoting the healthy development of the financial industry.

The writing is categorized into four sections. The initial section provides an overview and generalizes the existing research landscape of data privacy protection and fraud identification in different fields. The second section designs an improved adaptive FL architecture for bank data protection and a fraud recognition model based on WGAN. The third section conducted performance testing and application analysis on the data protection architecture and fraud identification model. The fourth section outlines the main conclusions of the experiment and future work.

Related works

Data privacy protection is an important area to ensure personal privacy and data security. As big data and AI technology continue to advance rapidly, research on data privacy protection has garnered the interest of many scholars, and extensive research has been conducted around privacy protection technology. Qashlan A et al. raised a solution using DP to tackle the problem of data privacy protection in IoT communication in smart home systems. By adopting a machine learning scheme with Rényi DP and combining it with a variant of stochastic gradient descent function, a data aggregation mechanism is implemented. The outcomes indicated that although the differential private model sacrificed some model utility, it could effectively protect user privacy7. To address security and privacy issues in the industrial IoT, Selvarajan S et al. raised a lightweight blockchain security model grounded on AI. By combining lightweight blockchain with COSNN mechanism, feature data is transformed using a real intrinsic analysis model. The experiment outcomes showed that the framework significantly improved execution efficiency, classification accuracy, and detection performance, reaching 0.6 s, 99.8%, and 99.7%, respectively, and performed excellently in anomaly detection8. Li M et al. raised a robust feature detector grounded on power normalized inverse spectrum to address the limited application of deep neural networks (DNNs) in data privacy protection, combined with cloud computing deep learning for data processing. By using the information hiding method of dual tree complex wavelet packet transform, embedding and independent extraction of data in orthogonal space can be achieved. The experiment outcomes confirmed that this method had superiority over existing technologies9. Gupta et al. proposed a data protection model grounded on DP and DNNs to address the issue of learning degradation caused by privacy protection in data outsourcing. This model utilizes Laplace transform to inject noise, maintaining data privacy while improving accuracy. The experiment outcomes showed that the accuracy, precision, recall, and F1 score of the model significantly improved on multiple datasets, with an increase of 19% to 32% compared to existing methods10. Zhang X et al. proposed a unified FL framework aimed at coordinating horizontal and vertical FL, and quantifying the trade-off between privacy leakage, utility loss, and efficiency reduction. The experiment outcomes indicated that there was no algorithm that could optimize all three aspects simultaneously, that is, the FL system had the “no free lunch” theorem. Simultaneously, the lower limits of various protection mechanisms are analyzed, providing guidance for selecting protection parameters to meet specific needs11. Alsuqaih et al. raised a secure electronic health platform grounded on blockchain to address the privacy protection challenges of the Internet of Things in remote health data analysis. This platform allows data owners to independently manage medical data access permissions through an effective access control system. experiment outcomes showed that the platform performed excellently in terms of computational efficiency, time consumption, and resistance to security attacks, making it suitable for intelligent healthcare systems. It effectively enhanced diagnostic accuracy and protected patient privacy12.

Fraudulent behavior is diverse, including insurance fraud, credit card fraud, online credit fraud, etc. It has strong characteristics of concealment, organization, and complexity, and a large number of researchers have carried out studies on it. Chatterjee et al. focused on credit card fraud detection and outlined the types and challenges of fraud. The research aimed to improve the accuracy of anomaly detection and behavior analysis by introducing digital twin technology, and to achieve secure collaborative learning through blockchain supported FL. The results showed that the strategy could effectively identify known and new fraud patterns. The study emphasized the importance of technological innovation in strengthening financial security measures, aiming to build a more powerful fraud detection system13. Karthikeyan T et al. raised a hybrid approach of optimizing LSTMs using chimpanzees to address the issue of payment card fraud caused by the surge in online shopping. This method first uses chimpanzee optimization algorithm to screen key features, and then uses long short-term memory network classifier to identify credit card fraud. The experiment outcomes showed that this method outperformed existing technologies in classification accuracy, error rate, and evaluation indicators, with an accuracy rate of up to 99.18%, confirming the effectiveness of this method in improving the accuracy and dependability of financial fraud detection systems14. Putrada A G et al. investigated the utilization of autoencoders in credit card fraud detection. Using the Kaggle dataset, a five-layer autoencoder was applied after oversampling through preprocessing techniques such as normalization and synthesis of minority oversampling. They proposed an average reduction measure for absolute error and squared error for feature selection, and evaluated the impact of autoencoders on four classifiers. The outcomes indicated that the autoencoder significantly enhanced the performance of the classification model, with its AUC increasing from 0.558 to 0.999, verifying the effectiveness of the autoencoder in enhancing credit card fraud detection15. Roy N C et al. identified and classified fraud types within Indian banking institutions through literature review and focus group discussions, and used a random forest model to predict and rank the severity of fraud. A model aimed at reducing fraud was created utilizing the criminal victim-centered approach, and pre- and post-mitigation measures were proposed. The research results showed that this model helped banks and investigators quickly detect and control fraud16. Zhang J et al. raised a two-stage graph neural network fraud detection method based on camouflage recognition and enhanced semantic aggregation. Firstly, camouflage is detected through the reconstruction of subgraphs, including hidden edges between fraudsters and removing redundant connections between benign nodes and fraudsters. Secondly, the enhanced semantic aggregation module is utilized to embed node information, integrating high-order information and aggregated semantics. Experiment outcomes showed that this method outperformed existing baselines on real datasets and effectively improved fraud detection performance17.

In summary, significant progress has been made in current research on data protection and fraud identification. Data privacy protection technologies such as DP and blockchain security models have been widely explored to enhance personal privacy and data security. FL has garnered considerable interest because of its privacy protection advantages. In terms of fraud identification, scholars focus on fraudulent behaviors such as credit cards and online credit, and propose various detection methods. However, there are still challenges in terms of data protection and fraud identification in the banking industry, such as privacy protection adaptability and class imbalance. Given the application advantages of FL technology and the issue of class imbalance in transaction data, this study aims to introduce FL and WGAN technology into bank data protection and fraud identification to further enhance bank operational security.

Bank data protection and fraud identification based on improved FL and WGAN

In order to effectively protect bank data and complete fraud identification, the study first designed an adaptive bank data protection architecture based on FL and DP. Then, based on improved GAN, integrated models, and other technologies, transaction fraud identification was completed.

Bank data protection based on improved adaptive FL

High quality data is the foundation for building effective fraud identification models. In order to reduce the risk of data leakage, ensure the effectiveness of data, and achieve data sharing and utilization, research has first been conducted on the protection of bank data. The fundamental concept of FL involves multiple data holders collaborating to train a global model, coordinated by a central server, while ensuring that the training data remains decentralized and distributed. FL can achieve collaborative training of machine learning models by multiple data holders while protecting data privacy18,19. Given the superiority of FL in privacy protection, research has been conducted on data protection architecture design based on FL. FL is mainly composed of a central server and multiple devices, and the training process diagram is shown in Fig. 1.

Fig. 1.

Fig. 1

Schematic diagram of FL training process.

In Fig. 1, FL’s distributed learning framework provides users with stronger privacy protection through the combination of data localization, distributed training, and communication efficiency20,21. Firstly, the central server initializes the global model and sets relevant training parameters such as learning rate and iteration times. The central server distributes the initial global model and its parameter to Inline graphic client devices. Subsequently, entering the local training phase, each client receives the global model and its parameter Inline graphic from the central server, performs data cleaning and feature extraction grounded on the local dataset Inline graphic, and completes the training of the global model based on local devices to minimize local risks and loss function Inline graphic. Then the process enters the communication phase, where the client updates the local model encrypted and uploads it to the central server. Finally, the central server receives all local model updates and uses the FedAvg algorithm to aggregate the updates to generate new global model parameters. Equation (1) demonstrates the calculation procedure.

graphic file with name 41598_2025_6807_Article_Equ1.gif 1

The entire training process of FL is shown in Eq. (2).

graphic file with name 41598_2025_6807_Article_Equ2.gif 2

In Eq. (2), Inline graphic represents the optimal model parameters. The study uses DP technology to protect the privacy of FL model parameters. DP introduces noise during the data query process, rendering it challenging for attackers to accurately infer other data given partial data. The study utilizes DP perturbation to update the FL model, ensuring that it does not leak sensitive information of individual participants during the aggregation process. The expression that satisfies DP is shown in Eq. (3).

graphic file with name 41598_2025_6807_Article_Equ3.gif 3

In Eq. (3), Inline graphic and Inline graphic represent the deep learning mechanism and its output set. Inline graphic and Inline graphic represent adjacent datasets. Inline graphic represents the budget for privacy protection, which is related to the level of privacy protection. Inline graphic represents a relaxation term, which is related to the degree of looseness of privacy leakage. In order to facilitate the allocation of noisy calculations, DP technology includes two principles: serial combination and parallel combination. The working principle is shown in Fig. 2.

Fig. 2.

Fig. 2

Schematic diagram of parallel and serial principles of DP technology.

As shown in Fig. 2, serial combination is the application of multiple DP algorithms on the same dataset. The overall privacy protection level of all algorithm combinations can be quantified by the accumulation of privacy protection levels of each algorithm, satisfying Eq. (4).

graphic file with name 41598_2025_6807_Article_Equ4.gif 4

Parallel combination is the application of multiple DP algorithms on different datasets, and the overall privacy protection level obtained by combining all algorithms is equal to the highest privacy protection level among all algorithms, satisfying Eq. (5).

graphic file with name 41598_2025_6807_Article_Equ5.gif 5

The implementation of DP technology is related to the global sensitivity Inline graphic of adjacent datasets, as expressed in Eq. (6).

graphic file with name 41598_2025_6807_Article_Equ6.gif 6

In Eq. (6), Inline graphic represents the range of variation of the query function, and sensitivity controls the effect of noise on privacy protection. The denoising mechanism used in the study is Gaussian noise, and the distribution of noise follows a Gaussian distribution Inline graphic, where Inline graphic represents variance. The calculation process is shown in Eq. (7).

graphic file with name 41598_2025_6807_Article_Equ7.gif 7

From Eq. (7), random noise that follows a Gaussian distribution can be obtained. Subsequently, the study set the DP perturbation before local update aggregation, and the FL-DP data protection framework is shown in Fig. 3.

Fig. 3.

Fig. 3

Schematic diagram of FL-DP data protection framework.

In Fig. 3, the study sets Gaussian noise in the local training stage of the client, and the training strategy is stochastic gradient descent. The calculation process is shown in Eq. (8).

graphic file with name 41598_2025_6807_Article_Equ8.gif 8

In Eq. (8), Inline graphic represents the gradient corresponding to the data sample Inline graphic. Inline graphic indicates the gradient after clipping. Then, the mean Inline graphic of all gradients is calculated to update the model parameters, as shown in Eq. (9).

graphic file with name 41598_2025_6807_Article_Equ9.gif 9

In Eq. (9), Inline graphic represents the learning rate. Finally, on the basis of updating the parameters, Gaussian perturbation is added to produce local parameters. However, traditional FL-DP data protection architectures create perturbations by adding random noise, and excessive noise can directly affect the accuracy of the model, significantly reducing its usability. Meanwhile, prolonged exposure to noise accumulates, making it hard for the model to sustain high accuracy over time22,23. Therefore, the study introduces the idea of “adaptive” to adjust the noise scheme and improve the FL-DP architecture.

Gradient clipping is a technique used to limit the size of gradients, which can avoid gradient explosion or vanishing during the training process of deep learning models, and adjust the training stability and speed of the model. In the FL-DP architecture, gradient clipping also limits the sensitivity of gradients. Sensitivity can measure the degree to which data changes affect the output results. The higher the sensitivity, the more noise needs to be added. Therefore, setting a reasonable clipping threshold can effectively control the amount of noise added and improve the usability of the model. The research adaptively adjusts the clipping threshold based on the gradient norm. Firstly, the L2 norm Inline graphic of different network layers is calculated, i.e. the square root of the sum of squares of gradients, and supplement it to the gradient L2 norm sequence of the network. The calculation process is in Eq. (10).

graphic file with name 41598_2025_6807_Article_Equ10.gif 10

The clipping threshold can be determined by calculating the sequence mean according to Eq. (10). Then, the sensitivity calculation of adjacent datasets is updated to Eq. (11).

graphic file with name 41598_2025_6807_Article_Equ11.gif 11

In Eq. (11), Inline graphic is the number of local training iterations. Inline graphic indicates the number of batch samples. In addition, FL-DP architecture often faces a trade-off between availability and privacy, and this research has also differentiated noise perturbations. The study identifies the role of different parameters in the global optimization process by analyzing gradient updates, weight parameters, and the update trends of global and local model gradients, in order to confirm the importance parameters for achieving noise differentiation. The global and local updates of model parameters at different iteration times are defined as Inline graphic and Inline graphic, where Inline graphic represents the number of iterations. The calculation of the importance coefficient Inline graphic of local model parameters is shown in Eq. (12).

graphic file with name 41598_2025_6807_Article_Equ12.gif 12

In Eq. (12), Inline graphic is the number of parameters, Inline graphic. Inline graphic indicates the adjustment coefficient. The process of noise differentiation operation is shown in Fig. 4.

Fig. 4.

Fig. 4

Schematic diagram of noise differentiation operation process.

As shown in Fig. 4, the differentiation operation calculates the Inline graphic of all layers of the model network based on Eq. (12), and selects the model parameters corresponding to the top Inline graphic importance coefficients using the Top-K method. It selects the first Inline graphic larger random noises. Finally, it adds smaller noise disturbances to numbers with higher importance coefficients, and adds larger noise disturbances to numbers with lower importance coefficients.

Fraud identification based on WGAN

After completing the research on bank data protection, high-quality data with strong security and privacy was obtained, providing a data foundation for fraud identification research. Therefore, research immediately began on fraud identification, in order to jointly build a bank risk management and fraud prevention system. Firstly, the collected bank transaction data are preprocessed. For missing values in the data, different processing methods should be selected according to the degree of missing. When there are few missing values and their impact on the overall data is small, the missing data can be deleted. When missing values have a significant impact on the overall data, methods such as mean imputation and median imputation can be used to fill in the missing values. For outliers in the data, K-means clustering algorithm can be used to identify and preserve them. Finally, the study analyzed statistical measures such as mean, median, and standard deviation in the data, using statistical analysis tools to display the distribution and trend of the data, facilitating the identification of outliers.

Due to significant differences in consumption habits and investment preferences among different customer groups, there are varying demands for financial products. At the same time, different financial products have different transaction patterns and frequencies, resulting in significant imbalances and imbalances in bank transaction data, which in turn leads to deficiencies in fraud identification by banks24,25. Therefore, to guarantee the precision of fraud identification and classification, the study first carried out balancing operations on imbalanced data. The unbalanced data processing technique chosen for the study is GAN, which consists of a generator Inline graphic and a discriminator Inline graphic. Its working mechanism is shown in Fig. 5.

Fig. 5.

Fig. 5

Schematic diagram of GAN working mechanism.

As shown in Fig. 5, the generator is responsible for capturing the distribution of training data and generating new data items that are similar to real samples. The discriminator’s role is to differentiate between synthetic and authentic data items. Throughout the training phase of GAN, the generator and discriminator engage in a competitive dynamic. The generator continuously minimizes the probability of the generated data being judged as fake by the discriminator to deceive the discriminator, while striving to improve its ability to distinguish truth from falsehood26,27. The objective function expression for GAN training is shown in Eq. (13).

graphic file with name 41598_2025_6807_Article_Equ13.gif 13

In Eq. (13), Inline graphic is the probability distribution function of the real dataset. Inline graphic represents random noise. Inline graphic represents real data. The adversarial training process of GAN is shown in Fig. 6.

Fig. 6.

Fig. 6

Schematic diagram of GAN adversarial training process.

As shown in Fig. 6, during the game evolution process of GAN, the generator and discriminator continuously optimize their own performance through alternating training. The game process continues to iterate until the generator can generate high-quality samples and the discriminator loses its ability to precisely differentiate between genuine and synthetic samples. However, the training stability of traditional GANs is poor. When the generator degrades, the “Nash equilibrium” state between the discriminator and the generator oscillates, increasing the instability of training. Therefore, the study employed Wasserstein GAN (WGAN), a variant of GAN, for data balancing. WGAN introduces Wasserstein distance instead of Kullback Leibler divergence as the loss function in traditional GANs. Wasserstein distance Inline graphic can still provide meaningful gradient information when the probability distribution difference is small, improving the stability of training28,29. The training objective calculation of WGAN is shown in Eq. (14).

graphic file with name 41598_2025_6807_Article_Equ14.gif 14

In Eq. (14), Inline graphic represents the K-Lipschitz constraint. The discriminator of WGAN satisfies the K-Lipschitz condition, which limits the output difference of the discriminator to a certain range. Inline graphic and Inline graphic are the true distribution and the generated distribution respectively. The loss function expressions for WGAN generator Inline graphic and discriminator Inline graphic are shown in Eq. (15).

graphic file with name 41598_2025_6807_Article_Equ15.gif 15

In Eq. (15), Inline graphic represents batch size. Inline graphic represents parameters. The traditional WGAN solves the problem of possible pattern collapse and gradient vanishing in GANs, but the WGAN structure lacks autoencoders and has shortcomings in learning the distribution of data. Therefore, the study introduced Variational Autoencoders (VAE) to improve WGAN’s learning of data distribution features. VAE is an autoencoder used to learn effective encoding of input data, describing observations of latent space in a probabilistic manner. VAE consists of an encoder Inline graphic and a decoder Inline graphic. The encoder converts the input data into the mean and logarithmic variance in the latent representation space, and randomly selects a point from the latent normal distribution Inline graphic output by the encoder as the latent representation30. The decoder maps the potential representations obtained from random sampling back to the space of the original input data, thereby generating new data samples. The training process is in Eq. (16).

graphic file with name 41598_2025_6807_Article_Equ16.gif 16

In Eq. (16), Inline graphic represents the input data. The training process of VAE is to minimize the sum of reconstruction loss and regularization loss to achieve parameter optimization. The VAE-WGAN model utilizes VAE to learn input data, reducing the uncertainty of WGAN reconstructed data. The expression for reconstructed data Inline graphic is shown in Eq. (17).

graphic file with name 41598_2025_6807_Article_Equ17.gif 17

In Eq. (17), Inline graphic represents the approximate posterior distribution of the input data extracted by Inline graphic. The calculation of loss function Inline graphic of the final discriminator is shown in Eq. (18).

graphic file with name 41598_2025_6807_Article_Equ18.gif 18

In Eq. (18), Inline graphic represents the feature vector of the original unlabeled data after dimensionality reduction and feature extraction by the discriminator. Inline graphic represents the output of the discriminator after reconstructing the data. After completing the class imbalance processing, the research selects the idea of Extreme Gradient Boosting (XGBoost) based on ensemble algorithm to construct a fraud recognition model. XGBoost is a distributed gradient boosting machine learning algorithm that performs optimization under the gradient boosting framework, which is improved based on the gradient boosting decision tree. The design of the fraud detection model is illustrated in Fig. 7.

Fig. 7.

Fig. 7

Schematic diagram of fraud recognition model architecture based on improved WGAN.

As shown in Fig. 7, during the model training phase, XGBoost gradually constructs a series of weak learners to combine into a strong learner, improving its recognition and classification ability. The weak learners of XGBoost are different decision tree Inline graphic, and the construction process is shown in Eq. (19).

graphic file with name 41598_2025_6807_Article_Equ19.gif 19

In Eq. (18), Inline graphic and Inline graphic respectively represent the input data and parameters of the decision tree Inline graphic. Inline graphic represents the leaf node region. Inline graphic is a constant. The decision tree is generated in the direction of XGBoost residual reduction, and the calculation process of residual Inline graphic is in Eq. (20).

graphic file with name 41598_2025_6807_Article_Equ20.gif 20

In Eq. (20), Inline graphic and Inline graphic respectively represent the estimation function and loss function. Inline graphic pertains to the discrepancy between the model’s predicted value and the actual value, and the final classification result is the weighted residual of all decision trees. Inline graphic is the number of iterations for XGBoost. Subsequently, based on the residual Inline graphic, the parameters of XGBoost are continuously updated to obtain the final regression tree. The calculation of the training objective function Inline graphic is shown in Eq. (21).

graphic file with name 41598_2025_6807_Article_Equ21.gif 21

In Eq. (21), Inline graphic and Inline graphic respectively represent the first and second derivatives of the loss function. Inline graphic represents the regular penalty term for leaf weight Inline graphic. Inline graphic is the regular penalty term for the number of leaves Inline graphic.

Performance testing and application effect analysis of bank data protection and fraud identification models

In order to verify the effectiveness of the data protection and fraud identification model designed in the research on bank security protection, performance testing and application analysis of the model were conducted, and the results were analyzed and discussed.

Improving the performance testing of adaptive FL bank data protection model

The study first conducted performance testing on the data protection model, based on the Windows 10 operating system, equipped with an Intel (R) Core (TM) i9-11900 K processor with 16 GB of memory, an NVIDIA GTX 2080Ti image processor, Python 3.8.8 programming language, and PyTorch 1.12.1 deep learning framework. The study selected UNSW-NB15, NSL-KDD, and ToN-IoT as datasets to conduct data privacy protection testing. UNSW-NB15 contains 175,341 network connection records, involving 13 different network traffic attributes such as source IP address, destination IP address, port, etc. The NSL-KDD dataset contains various network traffic data. The ToN-IoT dataset includes heterogeneous data sources collected from telemetry datasets, Windows and Linux base datasets, and network traffic datasets. Divide the dataset into training and testing sets in an 8:2 ratio. The comparative models include the traditional FL-DP data protection architecture, the Renyi’s DP protection scheme in reference7, and the lightweight blockchain security model based on AI in reference8. Set the privacy budget to 4, the clipping threshold to 5, the iteration count to 120, and the noise scale to 1.

The comparison results of loss function values for various models are in Fig. 8. In Fig. 8 (a), under the UNSW-NB15 dataset, the loss value of the adaptive FL-DP data protection architecture designed in the study decreased to a lower level at the 40th iteration. After 60 iterations, the loss value tended to stabilize. In contrast, the loss function curves of FL-DP architecture, Renyi DP protection scheme, and lightweight blockchain security model gradually converged after 100 iterations. In Fig. 8 (b), under the NSL-KDD dataset, the loss function curve of the adaptive FL-DP data protection architecture still converged the fastest, and the loss value converged to the minimum value of 0.09. In Fig. 8 (c), the loss function curves of different models in the ToN-IoT dataset had a relatively consistent variation pattern, but the adaptive FL-DP data protection architecture had the best convergence performance. This method had excellent learning ability in protecting data privacy, and the loss function value maintained a stable downward trend. The model was gradually learning and optimizing its performance.

Fig. 8.

Fig. 8

Comparison of loss function values on different datasets.

The comparison results of the accuracy of different data privacy protection schemes under different privacy budgets are in Fig. 9. In Fig. 9, with the increase of privacy budget, the accuracy of various methods on different datasets showed a fluctuating growth trend. In Fig. 9 (a), the adaptive FL-DP architecture designed for research had the highest accuracy, reaching 0.996. Compared to other methods, especially when the privacy budget was low, the advantages of adaptive FL-DP architecture were more pronounced. The traditional FL-DP architecture improved accuracy slowly when the privacy budget was low. In Fig. 9 (b), the adaptive FL-DP architecture performed the best under all privacy budgets, while the accuracy improvement speed of the lightweight blockchain security model was relatively slow. In Fig. 9 (c), as the privacy budget increased, the difference in accuracy improvement rate among different methods was relatively small. Overall, the adaptive FL-DP architecture had the best accuracy and significant advantages under low privacy budgets.

Fig. 9.

Fig. 9

Accuracy of different data privacy protection schemes.

Taking the deposit, withdrawal, transfer, loan, wealth management, and other financial transaction data of all accounts of a certain bank in China in the past year as an example, this study collects transaction information such as transaction date, transaction time, transaction type, and transaction channel, and collects basic customer information such as name, age, gender, contact information, etc. for data protection application analysis. The privacy leakage probability and model robustness analysis results of different methods are in Fig. 10. In Fig. 10 (a), during the attack process, the accuracy of different methods’ models showed a decreasing trend. Under the same experimental environment, the accuracy of the adaptive FL-DP architecture decreased less, with the lowest accuracy still above 0.80, demonstrating good robustness. The accuracy of the traditional FL-DP architecture rapidly decreased in the early stages of the attack and then tended to stabilize, but the overall accuracy was the lowest. The accuracy of the other two methods fluctuated greatly during the attack process, showing poor robustness. As shown in Fig. 10 (b), in most data samples, the privacy leakage probability of the adaptive FL-DP architecture was relatively low, fluctuating within the range of 0.05–0.25, demonstrating good privacy protection performance. The privacy leakage probability of other methods fluctuated under different data samples, with the traditional FL-DP architecture having the highest leakage probability, exceeding 0.50. Overall, the adaptive FL-DP architecture achieved a good balance between privacy protection and model robustness.

Fig. 10.

Fig. 10

Analysis of privacy leakage probability and model robustness for different methods.

The Mutual Information (MI) and re identification risk analysis results of different methods are in Fig. 11. In Fig. 11 (a), with the increase of iteration times, the MI of the adaptive FL-DP architecture decreased the fastest, and the minimum value was close to 0, indicating that the amount of information leaked to the participants during the model training process was the least. In contrast, the MI values of other methods were all above 0.20, with a maximum value of 0.456. The dependence between different variables was strong, and the risk of information leakage was relatively high. As shown in Fig. 11 (b), the adaptive FL-DP architecture had the fastest reduction in re identification risk, with the minimum value converging to 0.194, indicating the strongest privacy protection capability. The lightweight blockchain security model and Renyi’s DP protection for re-identification risk decreased slowly, ultimately converging to 0.491 and 0.446, with slightly inferior performance compared to the research design.

Fig. 11.

Fig. 11

Comparison of MI and re-identification risks.

Performance testing and analysis of fraud recognition model based on improved WGAN

Firstly, the performance of the data class imbalance processing algorithm is analyzed, with a comparison of the Matthews Correlation Coefficient (MCC), balance accuracy, and clustering performance being conducted before and after data balancing, and the experiment outcomes are shown in Fig. 12. In Fig. 12 (a), there was a significant difference in the clustering effect of the dataset before and after data balancing. The Xie Beni (XB) value of the balanced dataset decreased from 0.604 to 0.439, and the Davies Bouldin Index (DB) decreased from 0.423 to 0.269. As the number of iterations increased, the clustering effect improved. In Fig. 12 (b), the MCC coefficient of the dataset before balance gradually increased from -0.927 to 0.575, and the MCC coefficient of the dataset after balance increased from 0.040 to 0.723. MCC considers the true cases, true negative cases, false positive cases, and false negative cases of the model. The closer the value is to 1, the greater the probability that the model correctly classifies all samples. Meanwhile, the maximum balancing accuracy of the dataset before balancing was only 0.799, and after balancing it could reach 0.926. Overall, handling imbalanced datasets is beneficial for improving the performance of the model on imbalanced datasets.

Fig. 12.

Fig. 12

Comparison of data class imbalance processing effects.

The research compared the fraud recognition model VAE-WGAN XGBoost designed for research with WGAN XGBoost, XGBoost, and logistic regression models. Firstly, the research compared the accuracy, recall, F1 score, Area Under the Curve (AUC), and Kolmogorov Smirnov (KS) value of different models. The experiment outcomes are shown in Fig. 13. As shown in Fig. 13 (a), the accuracy of the VAE-WGAN XGBoost model was 0.969, the recall was 0.963, the F1 score was 0.983, and the AUC value was as high as 0.988, showing the best performance in all indicators. In fraud detection, it had better performance and could accurately identify fraudulent transactions. In the case of imbalanced data, compared with XGBoost and Logistic models, VAE-WGAN processing could significantly improve the performance of fraud recognition models and achieve the best classification performance. In Fig. 13 (b), the VAE-WGAN XGBoost model had the highest KS value at all sample sizes, reaching 0.920, demonstrating the strongest discriminative ability. The KS values of WGAN XGBoost, XGBoost, and Logistic models were 0.776, 0.628, and 0.691, indicating that the designed models achieved the best ability to distinguish between positive and negative samples. This is because the VAE-WGAN XGBoost model proposed by the research institute combines the robust feature learning ability of VAE, the stable generation ability of WGAN, and the powerful classification ability of XGBoost. It not only solves the problem of class imbalance, but also enhances the model’s ability to recognize complex fraud patterns, thus outperforming the comparison model in various performance indicators.

Fig. 13.

Fig. 13

Performance comparison of fraud recognition models.

Comparing the recognition accuracy of different fraud recognition models, Mean Absolute Percentage Error (MAPE), Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and R-squared were selected as evaluation metrics. The experiment outcomes are shown in Table 1. As shown in Table 1, the VAE-WGAN XGBoost model designed for research had the smallest values in three different error indicators, with MAE, MAPE, and RMSE values of 0.222, 0.233, and 0.236, respectively. The recognition errors of the other three models were relatively large, with a maximum error value of 0.526. Meanwhile, the R-squared value of the VAE-WGAN-XGBoost model was higher than 0.90, with a maximum value of 0.930. In contrast, this model had strong fitting ability for data and can accurately identify fraudulent behavior. This is because the combination of VAE-WGAN helps generate high-quality synthetic data, expand the training dataset, and XGBoost uses this data for classification, which can further improve classification accuracy and reduce errors.

Table 1.

Performance comparison of different fraud recognition models.

Model Data set MAPE RMSE MAE R-squared
VAE-WGAN-XGBoost Tr 0.222 0.233 0.236 0.912
Te 0.231 0.235 0.312 0.930
WGAN-XGBoost Tr 0.426 0.311 0.404 0.863
Te 0.327 0.376 0.334 0.736
XGBoost Tr 0.355 0.526 0.412 0.808
Te 0.370 0.481 0.489 0.760
Logistic Tr 0.357 0.333 0.390 0.687
Te 0.330 0.349 0.458 0.667

The False Negative Rate (FNR) and False Alarm Rate (FAR) results of different models in fraud recognition are shown in Fig. 14. As shown in Fig. 14 (a), the FAR value of the VAE-WGAN XGBoost model was relatively low, fluctuating within the range of 0.15–0.25, demonstrating good ability to control false alarms. The FAR of the other three methods was relatively high, and the accuracy of the model in identifying normal transactions was low. As shown in Fig. 14 (b), the FNR of the VAE-WGAN XGBoost model fluctuated greatly under different sample sizes, but in most cases it was lower than other models, demonstrating good fraud detection ability. This is because it combines the advantages of VAE and WGAN, which can better capture complex patterns in data. Overall, the research design had strong capabilities in identifying fraudulent transactions. This is because the VAE in the VAE-WGAN XGBoost model can learn the potential distribution of data, capture the feature differences between normal and fraudulent transactions, enhance the sensitivity of the model to fraudulent behavior, and thus reduce false positives and false negatives. In addition, the adversarial training of the generator and discriminator in WGAN also enables the model to better distinguish between different categories, reducing false positives and false negatives.

Fig. 14.

Fig. 14

Comparison of FNR and FAR of different models.

To further validate the financial fraud recognition performance of the VAE-WGAN XGBoost model, a credit card fraud detection dataset was used, which contains approximately 284,000 credit card transaction records with 30 features per record. 0 represents normal transactions, 1 represents fraudulent transactions. Compare the accuracy, recall, MAE, FNR, and FAR of the VAE-WGAN-XGBoost model with those of the even number 3 model. The test results of the four models in the credit card fraud detection dataset are shown in Table 2. From Table 2, it can be seen that the VAE-WGAN XGBoost model still has good financial fraud recognition performance in the credit card fraud detection dataset, with the highest accuracy and recall rates of 0.966 and 0.963, respectively. The MAE, FNR, and FAR values are the lowest, at 0.236, 0.037, and 0.045, respectively. The results indicate that the VAE-WGAN XGBoost model has good performance in identifying financial fraud and has good applicability in the field of bank fraud identification.

Table 2.

Test results of four models in credit card fraud detection dataset.

Index WGAN-XGBoost XGBoost Logistic VAE-WGAN-XGBoost
Accuracy/% 0.921 0.901 0.876 0.966
Recall/% 0.914 0.896 0.861 0.963
MAE 0.404 0.412 0.393 0.236
FNR 0.085 0.103 0.138 0.037
FAR 0.072 0.098 0.125 0.045

Conclusion

Bank data protection and fraud identification are important links in ensuring the stable operation of banks and the security of customer funds. To guarantee the privacy of bank data while improving the accuracy of fraud detection, an adaptive bank data protection architecture based on FL and DP was designed. Based on VAE-WGAN processing of imbalanced datasets, transaction fraud identification was completed. The experiment outcomes showed that the loss function curve of the adaptive FL-DP data protection architecture still converged the fastest and had the smallest convergence value. Under a low privacy budget, its accuracy value had a significant advantage. During the attack process, this method demonstrated good robustness. The probability of privacy leakage was low, demonstrating good privacy protection performance. In the process of data protection, the adaptive FL-DP architecture reduced the risk of re identification the fastest and reduced the risk of information leakage. The accuracy of the VAE-WGAN-XGBoost fraud recognition architecture was 0.969, the recall rate was 0.963, the F1 score was 0.983, and the AUC value was as high as 0.988, indicating the highest recognition accuracy. FNR and FAR had the lowest values, which can accurately identify fraudulent behavior. The research provides new ideas and methods for financial technology innovation, further improving the efficiency and accuracy of fraud detection. However, the integrated model for fraud identification can be further optimized to enhance the data and transaction security of banks.

Author contributions

Y.J.Z. processed the numerical attribute linear programming of communication big data, and the mutual information feature quantity of communication big data numerical attribute was extracted by the cloud extended distributed feature fitting method. Y.J.Z. combined with fuzzy C-means clustering and linear regression analysis, the statistical analysis of big data numerical attribute feature information was carried out, and the associated attribute sample set of communication big data numerical attribute cloud grid distribution was constructed. Y.J.Z. did the experiments, recorded data, and created manuscripts. All authors read and approved the final manuscript.

Data availability

The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Rani, G. S., Jayan, S. & Alatas, B. Analysis of chaotic maps for global optimization and a hybrid chaotic pattern search algorithm for optimizing the reliability of a bank. IEEE Access11(3), 24497–24510. 10.1109/ACCESS.2023.3253512 (2023). [Google Scholar]
  • 2.Zheng, T., Chen, J., Zhang, Z., Gong, Z. & Chen, Y. Bank credit score card selection and threshold determination based on quantum annealing algorithm and genetic algorithm. 2023 IEEE Int. Conf. Power Intell. Comput. Syst. (ICPICS)10.1109/ICPICS58376.2023.10235447 (2023). [Google Scholar]
  • 3.Hajiabbasi, M., Akhtarkavan, E. & Majidi, B. Cyber-physical customer management for internet of robotic things-enabled banking. IEEE Access11, 34062–34079. 10.1109/ACCESS.2023.3263859 (2023). [Google Scholar]
  • 4.Singhal, M. & Shinghal, K. Secure deep multimodal biometric authentication using online signature and face features fusion. Multimed. Tools Appl.83(10), 30981–31000. 10.1007/s11042-023-16683-1 (2024). [Google Scholar]
  • 5.Maithili, K. et al. Development of an efficient machine learning algorithm for reliable credit card fraud identification and protection systems. MATEC Web Conf.392(3), 1116–1123. 10.1051/matecconf/202439201116 (2024). [Google Scholar]
  • 6.Chiu, Y. C. et al. A CMOS-integrated spintronic compute-in-memory macro for secure AI edge devices. Nat. Electron.6(7), 534–543. 10.1038/s41928-023-00994-0 (2023). [Google Scholar]
  • 7.Qashlan, A., Nanda, P. & Mohanty, M. Differential privacy model for blockchain based smart home architecture. Future Gener. Comp. Syst.150(1), 49–63. 10.1016/j.future.2023.08.010 (2023). [Google Scholar]
  • 8.Selvarajan, S. et al. An artificial intelligence lightweight blockchain security model for security and privacy in IIoT systems. J. Cloud Comput. Adv. Syst. Appl.12(1), 38–54. 10.1186/s13677-023-00412-y (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Li, M. et al. Power normalized cepstral robust features of deep neural networks in a cloud computing data privacy protection scheme. Neurocomputing518(1), 165–173. 10.1016/j.neucom.2022.11.001 (2023). [Google Scholar]
  • 10.Gupta, R., Gupta, I., Saxena, D. & Singh, A. K. A differential approach and deep neural network based data privacy-preserving model in cloud environment. J. Amb. Intell. Hum. Comp.14(5), 4659–4674. 10.1007/s12652-022-04367-x (2023). [Google Scholar]
  • 11.Zhang, X., Kang, Y., Chen, K., Fan, L. & Yang, Q. Trading off privacy, utility, and efficiency in federated learning. ACM Trans. Intell. Syst. Technol.14(6), 1–32. 10.1145/3595185 (2023). [Google Scholar]
  • 12.Alsuqaih, H. N., Hamdan, W., Elmessiry, H. & Abulkasim, H. An efficient privacy-preserving control mechanism based on blockchain for E-health applications. Alex. Eng. J.73(6), 159–172. 10.1016/j.aej.2023.04.037 (2023). [Google Scholar]
  • 13.Chatterjee, P., Das, D. & Rawat, D. B. Digital twin for credit card fraud detection: Opportunities, challenges, and fraud detection advancements. Future Gener. Comp. Syst.73(8), 410–426. 10.1016/j.future.2024.04.057 (2024). [Google Scholar]
  • 14.Karthikeyan, T., Govindarajan, M. & Vijayakumar, V. Enhancing financial fraud detection through chimp-optimized long short-term memory networks. Trait. Signal.41(2), 835–845. 10.18280/ts.410224 (2024). [Google Scholar]
  • 15.Putrada, A. G. & Ramadhan, N. G. MDIASE-autoencoder: A novel anomaly detection method for increasing the performance of credit card fraud detection models. IEEE29, 1–6. 10.1109/ICT60153.2023.10374051 (2023). [Google Scholar]
  • 16.Roy, N. C. & Prabhakaran, S. Insider employee-led cyber fraud (IECF) in Indian banks: From identification to sustainable mitigation planning. Behav. Inform. Technol.43(5), 876–906. 10.1080/0144929x.2023.2191748 (2024). [Google Scholar]
  • 17.Zhang, J., Lu, J. & Tang, X. Two-stage GNN-based fraud detection with camouflage identification and enhanced semantics aggregation. Neurocomputing570(2), 127108–127118. 10.1016/j.neucom.2023.127108 (2024). [Google Scholar]
  • 18.Rodríguez, E., Otero, B. & Canal, R. A survey of machine and deep learning methods for privacy protection in the internet of things. Sensor Basel23(3), 1252–1275. 10.3390/s23031252 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Ratnayake, H., Chen, L. & Ding, X. A review of federated learning: Taxonomy, privacy and future directions. J. Intell. Inf. Syst.61(3), 923–949. 10.1007/s10844-023-00797-x (2023). [Google Scholar]
  • 20.Zubaydi, H. D., Varga, P. & Molná, S. Leveraging blockchain technology for ensuring security and privacy aspects in internet of things: A systematic literature review. Sensor Basel23(3), 788–830. 10.3390/s23020788 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Wu, Z. et al. A confusion method for the protection of user topic privacy in Chinese keyword-based book retrieval. ACM Trans. Asian Low Resour. Lang. Inf. Process.22(5), 1–19. 10.1145/3571731 (2023). [Google Scholar]
  • 22.Zhao, Y. & Chen, J. Vector-indistinguishability: Location dependency based privacy protection for successive location data. IEEE Trans. Comput.73(4), 970–979. 10.1109/TC.2023.3236900 (2023). [Google Scholar]
  • 23.Wang, M., Liang, H., Pan, Y. & Xie, X. A new privacy preservation mechanism and a gain iterative disturbance observer for multiagent systems. IEEE Trans. Netw. Sci. Eng.11(1), 392–403. 10.1109/TNSE.2023.3299614 (2024). [Google Scholar]
  • 24.Haque, M. E. & Tozal, M. E. Identification of fraudulent healthcare claims using fuzzy bipartite knowledge graphs. IEEE Trans. Comput.16(6), 3931–3945. 10.1109/TSC.2023.3296782 (2023). [Google Scholar]
  • 25.Cheng, F., Yan, C., Liu, W. & Lin, X. Research on medical insurance anti-gang fraud model based on the knowledge graph. Eng. Appl. Artif. Intell.134(8), 108627–108644. 10.1016/j.engappai.2024.108627 (2024). [Google Scholar]
  • 26.Luo, N. et al. Fuzzy logic and neural network-based risk assessment model for import and export enterprises: A review. J. Data Sci. Intell. Syst.1(1), 2–11. 10.47852/bonviewJDSIS32021078 (2023). [Google Scholar]
  • 27.Dash, A., Ye, J. & Wang, G. A review of generative adversarial networks (GANs) and its applications in a wide variety of disciplines: From medical to remote sensing. IEEE Access12, 18330–18357. 10.1109/ACCESS.2023.3346273 (2024). [Google Scholar]
  • 28.Lin, H., Liu, Y., Li, S. & Qu, X. How generative adversarial networks promote the development of intelligent transportation systems: A survey. CAA J. Automat. Sin.10(9), 1781–1796. 10.1109/JAS.2023.123744 (2023). [Google Scholar]
  • 29.Ding, H., Sun, Y., Huang, N., Shen, Z. & Cui, X. TMG-GAN: Generative adversarial networks-based imbalanced learning for network intrusion detection. IEEE Trans. Inf. Foren. Sec.19, 1156–1167. 10.1109/TIFS.2023.3331240 (2024). [Google Scholar]
  • 30.Guha, D., Chatterjee, R. & Sikdar, B. Anomaly detection using LSTM-based variational autoencoder in unsupervised data in power grid. IEEE Syst. J.17(3), 4313–4323. 10.1109/JSYST.2023.3266554 (2023). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES