Abstract
Healthcare insurance fraud imposes a significant financial burden on healthcare systems worldwide, with annual losses reaching billions of dollars. This study aims to improve fraud detection accuracy using machine learning techniques. Our approach consists of three key stages: data preprocessing, model training and integration, and result analysis with feature interpretation. Initially, we examined the dataset’s characteristics and employed embedded and permutation methods to test the performance and runtime of single models under different feature sets, selecting the minimal number of features that could still achieve high performance. We then applied ensemble techniques, including Voting, Weighted, and Stacking methods, to combine different models and compare their performances. Feature interpretation was achieved through partial dependence plots (PDP), SHAP, and LIME, allowing us to understand each feature’s impact on the predictions. Finally, we benchmarked our approach against existing studies to evaluate its advantages and limitations. The findings demonstrate improved fraud detection accuracy and offer insights into the interpretability of machine learning models in this context.
Keywords: Healthcare insurance fraud, Machine learning, Model ensemble, Model interpretability
Subject terms: Health care economics, Health services
Introduction
Healthcare insurance fraud represents a significant financial drain on healthcare systems worldwide, estimated to cost billions annually1. Healthcare insurance fraud encompasses a range of illicit activities, including billing for services not rendered, inflating the costs of actual services, prescribing unnecessary treatments, acquiring large quantities of drugs for resale, and hospitals falsifying claims to illicitly obtain insurance money. Detecting and preventing such fraud is crucial not only for the sustainability of healthcare financing but also for maintaining the integrity of patient care.
Machine learning models involve various techniques, such as supervised learning and reinforcement learning. A variety of machine learning models can be used, including CatBoost, XGBoost, LightGBM, as well as models like random forests, support vector machines, and neural networks. By using feature engineering and feature selection, as well as the integration of multiple models, the efficiency and accuracy of recognition can be improved.
However, implementing these technologies comes with its own set of challenges. For instance, the vast volume of healthcare data makes it difficult to identify the features most relevant to fraudulent activities. An excessive number of features can negatively impact model performance by increasing training time and potentially introducing noise into the output2 Additionally, many machine learning models function as “black boxes,” making it challenging to interpret how individual features influence the results. The lack of transparency poses a risk when making decisions in financially sensitive areas.3.
Despite advancements in healthcare insurance fraud detection, several gaps persist in current research. Many existing approaches rely on data that involves patient-doctor relationships4 or personal identifiers, which are difficult to obtain due to privacy concerns. As a result, these approaches often struggle to access crucial features. Furthermore, some studies fail to explain how model features contribute to the final outputs, leading to poor interpretability and a lack of transparency. Additionally, model performance could be enhanced through more effective feature selection and ensemble techniques. These challenges highlight the need for solutions that balance performance, interpretability, and respect for data privacy, while also ensuring computational efficiency.
This paper aims to address these challenges by investigating the application of machine learning techniques. We focus on training effective models using CatBoost, XGBoost, LightGBM, and Random Forest, and propose the use of Sequential Forward Selection (SFS) to optimize feature selection. This approach aims to reduce computational resource consumption, lower feature dimensionality, and enhance model performance. Furthermore, we leverage interpretability tools such as Partial Dependence Plots (PDP), SHAP, and LIME to explain feature contributions to the model’s predictions, offering trustworthy insights into the data and providing interpretable explanations for the results.
The main contributions of this paper are listed as follow:
Achieve high accuracy and performance in healthcare insurance fraud detection.
Effectively select relevant features and reduce feature dimensionality.
Enhance model performance through model ensembling.
Increase model trustworthiness and interpretability by utilizing various feature interpretation tools.
The remainder of this paper is structured as follows: Section Related works reviews related work, while Section Material and method provides an in-depth analysis of the dataset and describes the model training process, including hyperparameter optimization, feature selection, and model ensemble. In Section Result and explanation, we compare the performance of several models and interpret the contribution of different features to the model’s predictive outcomes. In Section Evaluation of merits and limitations, we discuss the strengths and limitations of our work and compare it with several recent studies. Finally, Section Conclusion and future work summarizes the findings and outlines directions for future research.
Related works
Indeed, healthcare fraud detection is a broad topic. Due to the varying legal frameworks across different regions and countries, the focus of healthcare fraud identification projects also varies. For instance, I. Matloob et al.5 summarized that exploring solutions for healthcare fraud detection can start from various perspectives such as clinical medical processes or the correlation between diseases and medications, healthcare financial records, and services received by patients. They also mentioned that machine learning and statistical approaches are increasingly being applied to healthcare fraud detection, representing a major trend in current medical research.
In the domain of healthcare record information detection, machine learning and data mining methods like XGBoost, LightGBM, Random Forest (RF), Convolutional Neural Networks (CNNs), and models related to one-class learning are commonly used. Chen et al.6 utilized Convolutional Neural Networks (CNNs) and Graph Convolutional Networks (GCNs) to model and integrate the patient-doctor relationship network for effective inference. Li et al.7 focused on healthcare insurance text records, utilizing various methods from natural language processing and information retrieval, such as Term Frequency-Inverse Document Frequency (TF-IDF). They employed the TextRank-based keyword extraction algorithm to achieve keyword extraction and automatic review of medical record texts. Zhou et al.8 addressed collective medical fraud cases by leveraging graph theory knowledge, constructing a visitation network to depict spatiotemporal relationships among patients, and employing Louvain, a community detection algorithm based on modularity optimization, to unearth suspicious groups. Similarly, Y. Yoo et al.9 also opted to construct graph structures and utilize GNN algorithms for model training. In addition to this, they also considered using traditional machine learning methods, to extract relationships from the dataset and creating bipartite graphs as input features. Hancock et al.10 creatively utilized integrated supervised feature selection techniques to analyze medical records and generate feature sets, leveraging these feature sets to interpret model results. They examined the performance and improvements of typical models in past Medicare fraud detection within the two major classes of algorithms, Gradient Boosted Decision Trees (GBDT) and Bagging. They proposed, within the context of Medicare fraud detection, the adoption of an integrated ranking of feature selection importance functions across multiple machine learning algorithms and the utilization of median ranking, thereby achieving feature interpretability.
Most of the research perspectives on healthcare fraud detection discussed earlier are essentially in line with those explored by I. Matloob et al.5, and these studies utilize various methods from the field of machine learning, such as neural networks, natural language processing (NLP), and GBDT algorithms. Many studies focus on analyzing nodes in the healthcare process, such as diagnoses, medications, and doctor-patient relationships, using graph theory to construct graph relationships and training neural networks. The medical encounter behavior itself reflects the relationship between patients and doctors, which can naturally be represented using graph structures. However, medical data have dynamic characteristics, which means that graph structures also need to be dynamic. Currently, dynamic networks have not been applied to the fraud detection industry. Furthermore, in the research field of healthcare fraud detection, most supervised learning methods require a large amount of training data to achieve good predictive performance, thus being constrained by data costs. At the same time, the interpretability of machine learning models for fraud detection is insufficient, and the application of multi-class method ensemble supervision for feature selection is less common.
Machine learning models have been applied in various fields. Harikumar Pallathadka et al. found that in the field of education, machine learning has been used to analyze and predict students’ performance, showing that academic outcomes can be anticipated based on previous academic records11. Additionally, S.K. Towfek et al. conducted a student survey and analyzed the results using various machine learning algorithms. Their findings highlighted the factors influencing students’ choices of AI tools in academic learning12, and they also assessed the strengths and weaknesses of different machine learning models.
Some researchers have proposed highly efficient model optimization algorithms, designed not only for healthcare insurance fraud detection but also for other knowledge discovery projects13. For example, El-Sayed M. El-Kenawy et al. proposed an optimization algorithm called Greylag Goose Optimization (GGO), inspired by biological evolution14. Additionally, Benyamin Abdollahzadeh developed a metaheuristic algorithm known as the Puma Optimizer (PO), which draws inspiration from the hunting intelligence of pumas15. In the paper, it was noted that the Puma Optimizer (PO) outperformed other algorithms on most clustering datasets, demonstrating its superior effectiveness in solving such tasks. This indicates the robustness and adaptability of PO in handling complex clustering problems across various domains. Moreover, recent studies on algebraic and graphical structures16,17, such as bipartite graphs, Latin squares, and isotopes of inverse property quasigroups, have shown promise for enhancing problem-solving capabilities and data interaction modeling in machine learning18,19,20.
Material and method
The overall workflow of the project is illustrated in Fig. 1, which consists of five key steps. We begin with Data Exploration, where we throughtly investigate the dataset to understand its characteristics. Following this, Feature Engineering is conducted to select and construct the most relevant features that will improve the model’s predictive performance and efficiency. The third step involves Model Building, where various models are trained and optimized. In the Model Evaluation and Comparison phase, we assess the performance of these models, comparing them based on accuracy, precision, and other relevant metrics. Finally, in the Feature Explanation step, we interpret the models’ predictions to understand the contribution of each features, enhancing the transparency and trustworthiness of the model’s decisions.
Figure 1.
Workflow of this project.
Dataset
The development of big data technology, healthcare insurance reimbursement data from the Chinese Healthcare Security Administration and well-established electronic medical record database provide data for analysis of healthcare security fraud. We used a publicly available dataset containing 16,000 records. In the dataset, patients’ privacy-sensitive data, such as personal identification data, has been anonymized. Each reimbursement record has been manually labeled to indicate whether it is fraudulent.
In this dataset, each record contains 81 features and a corresponding outcome. Records confirmed as fraudulent are marked as 1, while normal records are marked as 0. The fraud rate in this dataset is 5%, which indicates the dataset is highly imbalanced, with a significant disparity between positive and negative samples.
Data overview
Exploratory data analysis
Using graphical representations such as box plots, scatter plots, and bar charts can intuitively display the distribution of data, enhancing our understanding of the dataset. Below, we introduce some of the data visualization charts used in this study. Figure 2 shows the heatmap of feature correlation matrix, where darker colors indicate higher correlation values between features.
Figure 2.
Correlation matrix of all features.
Figure 3 shows the box plots of “MAX Monthly Visit Days” and “MAX Monthly Visit Times”. From the figure, we can observe that these two features have a significant number of outliers. These outliers need to be removed or binned in later stages to reduce data sensitivity and prevent overfitting.
Figure 3.
Box plots of features with a high number of outliers.
A Kernel Density Plot displays distribution of data using one continuous curve. Compared to a histogram, the Kernel Density Plot is not affected by the number of bins. Figure 4 shows the Kernel Density Plot for four features. This plot illustrates distribution of values, and highlights abnormal values. For instance, the feature “Total Drug Expense Amount” is generally distributed around 20,000 yuan per month; however, there are abnormal values reaching up to 100,000 yuan or more. These unusual values should be noted, as they have a higher potential for indicating fraudulent activity.
Figure 4.
Comparison between kernel density plots and histograms.
Figure 5 clearly shows the relationship between the number of visits and fraudulent activity. Within a specific time range, some patients visit hospitals one hundred times or more, indicating a high risk of involving in fraudulent activities.
Figure 5.
Scatter plot of feature “Total Number of Visit”. The scatter plot illustrates the relationship between the “Total Number of Visits” feature and fraudulent activities. Each point represents a sample in the dataset, with colors indicating whether it is a normal case or a fraud case.
Ethical considerations
The dataset used in this study is publicly accessible, ensuring transparency and replicability of the research. However, to protect the privacy and confidentiality of individuals, all sensitive information has been anonymized. Specifically, personally identifiable details such as patient visit dates, names, genders, birth dates, identification numbers, disease names, diagnoses, and hospital names have been de-identified. This ensures that no personal data can be traced back to any individual, in compliance with applicable privacy regulations and ethical guidelines. The de-identification process aligns with ethical standards to prevent misuse of personal information while still allowing the dataset to be used for healthcare fraud detection research.
Data preprocessing
Missing values
The dataset was analyzed for missing values. It was found that the “Discharge Diagnosis” field contains 355 missing values. Due to privacy considerations, this field represents the length of the discharge diagnosis text rather than the raw text itself, as the original diagnostic descriptions are unavailable. For this field, we applied median imputation, as the median is a robust measure that minimizes the effect of outliers. Additionally, subsequent analyses found that the length of the “Discharge Diagnosis” text had a minimal correlation with fraudulent activity, suggesting limited predictive power. This finding informed our decision to minimize the emphasis on this feature in our model development. Notably, no other missing values were detected in the dataset.
Outlier analysis
During the outlier detection process, we identified several data points that deviated significantly from the typical range. However, after further examination, we concluded that these outliers were not random errors or noise but represented legitimate cases strongly associated with healthcare fraud. Removing these outliers would have led to a loss of valuable information relevant to fraud detection. Therefore, these outliers were retained in the dataset to maintain the integrity of the analysis and ensure that important patterns were captured by the model.
We performed consistency checks on the ALL_SUM variable in relation to Total Drug Expense Amount, Total Treatment Expense Amount, and five additional Expense Amount variables. As shown in Fig. 6, across all 16,000 entries, ALL_SUM precisely matched the sum of these expense components, indicating that the data collection process was highly accurate. This consistent relationship supports the reliability of the dataset and reinforces the validity of subsequent analyses.
Figure 6.

Relationship between ALL_SUM and other features.
Data binning
The features “total personal account amount” and “total traditional Chinese medicine expense amount” display high positive skewness (4.14 and 6.32, respectively), indicating that most values are clustered at the lower end of the distribution, with a few extreme high values. Such skewness suggests that while most patients have relatively low values in these categories, a small number of patients exhibit unusually high expenses. Additionally, considering the meaning of each feature, “total traditional Chinese medicine expense amount” captures specific healthcare expenses that are utilized by only a subset of patients, while “total personal account amount” varies significantly based on individual financial circumstances.
This skewed distribution and individual difference can introduce variability into the model, leading it to overemphasize rare, high-expense cases. We employ the supervised chi-square binning method to bin these two features. To analyze the influence of raw and binned data, the project incorporates binned data into the feature matrix.
Feature derivation
Feature derivation primarily refers to the creation of new features from existing data. Generally, there are two main methods for feature derivation. The first method involves manual feature synthesis through in-depth analysis of the data background and business context. Features created using this method often possess strong business relevance and interpretability, and can significantly enhance the model’s performance. However, this method is time-consuming and requires manual analysis and selection. The second method bypasses the business context and uses some straightforward engineering techniques to generate features in bulk. Useful features are then selected from this large pool of generated features for modeling. While this method is simple and efficient, it can result in the creation of too many derived features, some of which may not be effective. Common feature derivation methods include univariate, bivariate, group statistics, and polynomial features.
In this project, arithmetic operations among multiple variables are primarily used for feature derivation. For example, by calculating the sum of self-payment drug expense, self-payment treatment expense, and self-payment surgery expense, we derived a new feature called total self-payment expense.
Model training
Baseline models
We utilize CatBoost, XGboost, LightGBM and Random Forest to train our base models. In these base models, all features are used as input. CatBoost employs ordered boosting and native support for categorical features, mitigating the need for extensive preprocessing and reducing the risk of target leakage. XGBoost, known for its efficiency and regularization techniques, builds trees sequentially to prevent overfitting while maintaining high performance. LightGBM improves computational efficiency by utilizing a leaf-wise growth strategy and Exclusive Feature Bundling, allowing it to handle large datasets with many sparse features effectively. Random Forest, an ensemble method, constructs multiple independent decision trees and averages their predictions, thus providing robustness to overfitting, especially when dealing with high-dimensional data. Together, these models enable flexible and effective learning from a wide range of features in complex datasets. Table 1 summarizes several ensemble models, highlighting their ensemble methods, prediction processing, and feature selection techniques. These models employ different strategies, such as stacking, weighted, and voting, along with feature selection methods like embedded and permutation-based approaches. Superscripts in the model names indicate specific strategies, such as selecting top N features and using unions or intersections of features. This table helps compare and analyze the characteristics and performance of each model across various scenarios.
Table 1.
Symbols and explanations for single models and ensemble models used in this paper.
| Model notation | Ensemble method | Prediction strategy | Feature selection method |
|---|---|---|---|
![]() |
Stacking | Union Top5 | Embedded |
![]() |
Weighted | Union Top5 | Embedded |
![]() |
Voting | Union Top5 | Embedded |
![]() |
None | All features | All features |
![]() |
Stacking | Union Top10 | Embedded |
![]() |
None | All features | All Features |
![]() |
Voting | Union Top10 | Embedded |
![]() |
None | All features | All Features |
![]() |
Stacking | Union Top10 | Permutation |
![]() |
Stacking | Union Top10 | Permutation |
![]() |
Voting | Intersect Top15 | Embedded |
![]() |
Weighted | Union Top10 | Permutation |
![]() |
Voting | Union Top3 | Embedded |
![]() |
None | Union Top5 | Embedded |
![]() |
None | Union Top10 | Permutation |
Feature selection
This scheme primarily uses the variance threshold method and mutual information method for initial feature selection. To achieve more efficient feature selection, this scheme employs machine learning techniques to reduce the number of features, enhance model accuracy, and improve interpretability.
The variance threshold method is a fundamental technique for feature selection that aims to eliminate features with low variance, which are less likely to contribute meaningful information to the model. The process begins by calculating the variance of each feature in the dataset21. For a dataset
, where n is the number of samples and p is the number of features, the variance of the j-th feature is calculated as:
![]() |
1 |
Once the variances are computed, a predetermined threshold is set, and a feature is selected if its variance exceeds the predefined threshold
:
![]() |
2 |
In this project, when the threshold is set to 0, two features are removed, while setting the threshold to 0.05 filters out 11 features.
The method for calculating mutual information estimates the mutual information between different features, denoted as
, and the outcome,
. Mutual information quantifies the degree of dependence between two discrete random variables. A mutual information value is zero if and only if the two variables are independent22,23. In our analysis, we identified 12 features that have a mutual information value of zero with respect to the outcome,
. These features have virtually no impact on the result and should be removed to improve the model’s accuracy. For discrete distributions, mutual information is calculated as a double sum:
![]() |
3 |
In the case of continuous distributions, mutual information is calculated using a double integral:
![]() |
4 |
To further refine feature selection, we employed an embedded method and the permutation method to assess feature importance. Subsequently, Sequential Forward Selection (SFS) was applied to enhance performance while minimizing the number of features. The dataset was split into training and validation sets in a 7:3 ratio, with the model trained on the training set and its performance evaluated on the validation set.
Embedded methods for feature selection are based on the principle of enhancing model performance within a predefined framework. Specifically, in embedded feature selection, the algorithm automatically identifies features most relevant to the target variable while penalizing those that contribute minimally to the model24,25.This project effectively prevents overfitting and, to a certain extent, improves the interpretability of models. After filtering out 13 features, the CatBoost, XGboost and LightGBM baseline models were trained using 71 features. After training the baseline models, we ranked the features by their importance. We applied Sequential Forward Selection (SFS) to select features, adding them to the feature set in order of their ranked importance, and monitoring the model’s performance metrics until no further improvement was observed. The model’s performance was evaluated using accuracy, average precision, AUC score, F1-score, and recall. Figure 7a shows the model performances vary from
,
to
. It is evident that the model’s overall performance increases as the number of selected features increases from one to five, after which the performance remains stable with slight fluctuations. The other three models exhibit a similar trend, with performance initially increasing and then remaining relatively unchanged. Overall, the training time increases as the number of features grows, although some irregular variations are observed.
Figure 7.
Variation of training time cost and model performance with the increase of feature number (embedded feature selection).
Built-in feature importance methods are effective for selecting features, but they have certain limitations and issues. These models often rely on metrics such as the frequency of feature splits in the decision tree structure or the feature’s contribution to the overall model loss. This can result in overestimating the importance of features that are frequently used for splits but may not significantly contribute to the model’s predictive power26,27,28. Moreover, these methods can overlook feature interactions, leading to an underestimation of the importance of features that are significant only in the context of other features. In contrast, permutation importance provides a more robust alternative. This method assesses the importance of a feature by randomly shuffling its values and measuring the change in model performance29,30. The advantage of permutation importance is that it captures the actual contribution of a feature to the model’s predictive performance, taking into account interactions between features31,32. By disrupting the relationship between the feature and the target variable, we can observe how much the model’s performance deteriorates, providing a more accurate measure of the feature’s importance.
The implementation of permutation importance offers a different perspective on feature significance. Permutation importances were calculated for four individual models, and then performance was estimated as the number of selected features varied from one to 71. Figure 8 shows that in the CatBoost model, average precision significantly improved compared to models using embedded feature selection, while other performance scores did not change noticeably. Additionally, the LightGBM model showed improvements in recall and precision. Although these two models exhibited enhanced performance when features were ranked based on permutation importance, the XGBoost model’s performance remained unchanged.
Figure 8.
Variation of training time cost and model performance with the increase of feature number (permutation feature selection).
Our project aims to accurately predict healthcare insurance fraud while using as few features as possible. When too few features are selected, the model cannot learn all the characteristics from the dataset, resulting in poor performance. On the other hand, selecting too many features consumes more computational resources and may potentially decrease the model’s accuracy. After evaluating two feature selection methods-the embedded method and the permutation importance method-we found that permutation importance generally performs better. Performance plots indicate that the increasing trend in performance scores mostly plateaus at the top 3, top 5, and top 10 features, so we selected four feature sets based on these thresholds using set operations.
Figure 9 demonstrates the process of feature selection using embedded methods. The feature selectors employed in this approach are CatBoost, XGBoost, LightGBM, and Random Forest. After training, each algorithm selects the top 3, 5, 10, and 15 features based on feature importance rankings. These feature sets are then combined through intersection or union operations to obtain the final feature set. We generated four feature sets: S1 represents the union of the top 3 features from each algorithm, S2 represents the union of the top 5 features, S3 represents the intersection of the top 15 features, and S4 represents the union of the top 10 features.
Figure 9.
Flowchart illustrating the steps of the feature selection process.
Model optimization
Bayesian Optimization
Bayesian optimization is an efficient global optimization method primarily used for model tuning and parameter selection. It is based on Bayes’ theorem and leverages prior knowledge, along with observed data, to predict model performance. This prediction guides the search process, gradually narrowing the parameter space to identify the optimal parameters33. Bayesian optimization is particularly well-suited for optimizing expensive functions, such as model training and validation processes34.
Grid search
Grid search is a hyperparameter optimization method that exhaustively searches through a specified parameter grid to find the best parameter combinations. This method is straightforward and easy to implement, making it suitable for situations where the parameter space is not too large or when sufficient computational resources are available. This process begins by defining a range of values for each hyperparameter, forming a multi-dimensional grid space where each point represents a potential combination of parameters. Grid search then iterates through all combinations in the parameter grid, training a new model instance and evaluating its performance for each set of parameters35,36. Typically, cross-validation is used to improve the robustness and reliability of the evaluation by splitting the dataset into multiple subsets of training and validation sets37. Once all parameter combinations have been evaluated, the algorithm selects the combination that performs the best based on criteria such as accuracy, recall, or other relevant performance indicators on the validation set. Finally, the model is trained on the full dataset using the best parameter combination to build the final model.
Model ensemble
We employ several methods to ensemble our models, including the use of set operations to construct feature sets aimed at enhancing model performance. In embedded feature selection, the ’union top10’ feature set comprises 21 features, while the ’union top5’ feature set includes 11. For permutation feature selection, the ’union top10’ contains 19 features, and the ’union top5’ includes 10. Due to model differences, the ’inter top15’ feature set comprises only 4 features, limiting the ensemble model’s ability to capture key information from the training data. These feature sets are utilized for training ensemble models and assessing performance.
To optimize model performance while maintaining interpretability, we selected voting, stacking, and weighted fusion as our primary ensemble methods. These approaches were chosen based on their ability to balance model accuracy with computational efficiency and interpretability, which is crucial for practical applications in healthcare fraud detection. While other methods, such as bagging, boosting, blending, and cascading, were considered, each presented certain limitations for our context. Bagging and boosting, for example, are known for high computational demands and complex hyperparameter tuning, which may hinder scalability. Blending and cascading, though potentially beneficial for predictive performance, introduce additional complexity that can obscure interpretability-an essential attribute for stakeholder trust. Therefore, by using voting, stacking, and weighted fusion, we were able to construct an ensemble that aligns with our study’s objectives of delivering both high performance and explainability, thus meeting the requirements of the healthcare domain.
Weighted method
The weighted method involves assigning different weights to the predictions of various models and then directly summing these weighted predictions to compare with the true values. This could be described using the formula below:
![]() |
5 |
We selected CatBoost, XGBoost, and LightGBM as classifiers in the weighted ensemble model due to their unique strengths and complementary features. CatBoost excels in handling categorical variables efficiently, requiring minimal preprocessing, which complements the capabilities of XGBoost and LightGBM, as they typically need more preprocessing for such data. XGBoost provides stability and consistently high accuracy across diverse data distributions, making it a reliable foundational model for the ensemble. Meanwhile, LightGBM is optimized for computational efficiency, making it particularly effective for datasets with a large number of features or records38.The combination of these models ensures a balanced and synergistic ensemble, leveraging their individual advantages to achieve robust and accurate predictions, which validates the rationale behind this selection.
In weighted ensemble model, we firstly combined three individual models-CatBoost, XGBoost, and LightGBM-and assigned them weights of 40%, 30%, and 30%, respectively. This approach is based on our previous experience. The feature set used for training was the union of the top five features. The
model demonstrated improved performance compared to the individual models, achieving an AUC score of 0.9300.
To further refine the model weights, we determine them based on the performance metrics of the three models. The F1 score is a metric used to evaluate model performance by combining precision (the proportion of true positive predictions among all positive predictions) and recall (the proportion of true positive predictions among all actual positive cases). It is defined as the harmonic mean of precision and recall, balancing the trade-off between these two metrics39. Given the imbalance in our dataset, the F1 score is particularly suitable because it accounts for both false positives and false negatives, providing a comprehensive measure of performance. We use the F1 score to calculate model weights in the ensemble, ensuring the model’s contribution reflects its balanced effectiveness across precision and recall.
The weights were determined based on the F1-scores of each individual model. Specifically, the weight for each model,
, was calculated as follows:
![]() |
6 |
Here,
represents the F1-score of the i-th model, and
is the sum of all three models. This approach ensures that models with higher F1-score contribute more significantly to the final ensemble, reflecting their relative performance. This model outperforms the previously used model with a 3:3:4 weight ratio.
Stacked generalization
Stacked generalization involves combining the outputs of multiple machine learning models to improve predictive performance. This method determines which model to trust for specific instances within a dataset, thereby leveraging the strengths of each model. The architecture of a stacking model consists of multiple base models and a meta-model40. The base models generate predictions, which are then used as input features for the meta-model. The meta-model, typically a higher-capacity learner, integrates these inputs to produce the final prediction41. This hierarchical structure allows the stacking model to capture complex relationships and interactions between the predictions of the base models, thereby enhancing the overall predictive capability and reducing the risk of overfitting.
Voting
Voting is a straightforward ensemble method in which multiple models are trained independently, and their predictions are aggregated to form a final decision. There are two main types of voting: hard voting and soft voting. In hard voting, each model votes for a class, and the class with the majority of votes is chosen as the final prediction42. In soft voting, the predicted probabilities for each class are averaged, and the class with the highest average probability is selected43. This approach leverages the diversity of the individual models.
Result and explanation
Model performances
The dataset is highly imbalanced, meaning that a model could achieve 95% accuracy by always predicting the majority class (0). To objectively evaluate our model, we assess performance using average precision, recall, AUC score, and F1-score44. Table 2 presents the performances metrics of 10 models, including both base models and integrated models.
Table 2.
Performance metrics of the models.

As shown in the Table 2,
achieves the best AUC score, while
has the highest recall. Additionally,
exhibits the best average precision. Overall, the integrated models outperform base models. In real-world applications, the selection of the optimal model depends on the specific requirements and priorities of the task. For instance, if the primary goal is to minimize false negatives-such as in scenarios where missing a fraudulent case could result in significant financial losses-recall should be prioritized. In such cases, models like
, which excel in recall, would be preferred.
The AUC curve, which represents the area under the Receiver Operating Characteristic curve, illustrates the trade-off between a model’s sensitivity and its specificity. A Higher AUC value indicates better overall model performance. Figure 10 shows that
and
achieve the best AUC scores. Additionally, the PR curve is another useful method for visualizing model performance. The PR curve focuses on the relationship between a model’s precision and recall, which is particularly valuable for imbalanced datasets. The area under the PR curve is commonly referred to as average precision, calculated by integrating the precision-recall curve across different thresholds. Figure 11 shows
delivers the best performance, achieving an average precision score of 0.6115. As evidenced by the performance metrics in Table 2, our integrated models consistently achieve high scores across multiple evaluation metrics, further proving their effectiveness in detecting healthcare fraud.
Figure 10.

ROC curves for the nine models evaluated in this study.
Figure 11.

PR curve for the nine models evaluated in this study.
Feature interpretation
The interpretability of machine learning models is of paramount importance, particularly in medical insurance fraud detection, where decisions must be transparent and justifiable. Interpretability allows stakeholders to understand the reasoning behind model predictions, thereby facilitating trust and acceptance. Additionally, interpretability aids in diagnosing model errors and improving feature selection. Models lacking clear interpretability are often perceived as “black boxes,” limiting their applicability in sensitive and high-stakes domains.
In this particular approach, the base model incorporate 71 features, which significantly complicates the interpretability of the results. The high dimensionality and complexity of the feature space make it challenging to discern which features are driving the model’s decisions. After dimensionality reduction, we employ multiple toolkits to interpret the features of the optimized model. These toolkits help elucidate the contributions and interactions of the reduced set of features, enhancing the model’s interpretability and providing insights into the underlying data patterns.
PDP
Partial Dependence Plots (PDPs) is a graphical representation that illustrates the marginal effect of a feature on the predicted outcome of a machine learning model. It is used to assess whether the relationship between the feature and the target variable is linear, monotonic, or more complex45. The Partial dependence function for regression is defined as follows:
![]() |
7 |
The variable
represents the features for which the partial dependence function is plotted, while
denotes the other features in the machine learning model
. The set S typically contains only one or two features. The partial dependence function
is estimated using the Monte Carlo method, which involves calculating averages over the training data:
![]() |
8 |
The working principle of PDPs involves fixing the value of one or two features within a certain range while allowing the values of other features to vary naturally within the dataset. The model’s predictions are then averaged for each fixed value of the selected features. This process does not rely solely on the actual observed values but instead explores the effect of the feature by creating a grid of feature values over a specified range. Through this method, PDPs reveal how the model’s predictions change as the feature values change, graphically presenting these changes to intuitively display the specific impact of the features on the target variable.
Figures 12 and 13 respectively show the relationship between the features “max reimbursement amount per month” and “max number of visits per month” with the model output. The horizontal axis represents a feature, while vertical axis indicates the average predicted outcome of the model. The vertical axis represents the marginal effect of the feature on the prediction. The shaded region around the blue line represents the confidence interval, providing an estimate of the uncertainty around the average effect. In Fig. 12, the positive slope of the blue line indicates a positive correlation between “max reimbursement amount per month” and the model output. Additionally, the confidence interval widens as the feature value increases, indicating greater uncertainty in the model’s predictions as drug expenses grow higher. This explanation aligns with common sense and experience, as there have been reports of scammers reimbursing large amount of drugs and then selling them for profit. Additionally, some hospitals have been implicated in fraudulent activities by reimbursing false medical expenses and sharing the proceeds with healthcare insurance cardholders. Therefore, the total reimbursed amount is the most crucial feature for detecting fraud.
Figure 12.
PDP plot illustrating the effect of the feature “Max Reimbursement Amount per Month” on the predicted probability of fraud.
Figure 13.
PDP plot illustrating the effect of the feature “Max Number of Visits per Month” on the predicted probability of fraud.
Figure 13 shows a somewhat weak relationship between “max number of visits per month” and the model output, with a high degree of uncertainty. One possible explanation is that some patients may have valid reasons for frequent hospital visits, such as managing multiple diseases that require consultations in different hospital departments. This is particularly true for elderly individuals who are in poor health.
SHAP
SHAP is a model interpretation framework that explains the output of any machine learning model. Drawing inspiration from cooperative game theory, SHAP constructs an additive explanation model in which all features are treated as “contributors”46. For each prediction, the model generates a prediction value, and the SHAP value represents the contribution of each feature within that sample.
SHAP explains model predictions using Eq. 9, and adheres to three key properties, as outlined in Eqs. 10, 11, and 12.
![]() |
9 |
![]() |
10 |
![]() |
11 |
![]() |
12 |
Assume that the i-th sample is
, the j-th feature of the i-th sample is
, and the model’s prediction for this sample is
. The baseline for the entire model, typically the mean of the target variable across all samples, is denoted as
. The SHAP values satisfy the following equation:
![]() |
13 |
Here,
represents the SHAP value of
. Intuitively,
reflects the contribution of the first feature in the i-th sample to the final prediction
. A feature has a positive effect on the prediction when
is greater than zero, conversely, if
, it indicates a negative effect.
The SHAP force plot in Fig. 14 illustrates how specific features influence the model’s prediction for this instance, with an overall output of 2.14. Positive contributions from features such as “Total Reimbursement Amount” push the prediction towards a higher likelihood of fraud, as shown by the red bars. In contrast, a smaller set of features push the prediction downward, reducing the likelihood of fraud, as shown by the blue bars. This sample is classified as a fraudulent case, with the model predicting a probability of fraud at 0.8951 and a probability of no fraud at 0.1049.
Figure 14.
SHAP force plot illustrating the contributions of key features to the prediction of a fraudulent case.
The dot plot, also known as beeswarm plot, illustrates the impact of each sample. In this plot, each dot represents a sample, and the color bar indicated the level of impact. The horizontal axis displays whether the impact is positive or negative, along with the magnitude of impacts. As shown in Fig. 15, the feature “max reimbursement amount per month” mostly contributes positively to the outcome, while some blue dots indicate that this feature has a slight negative influence in certain samples. For another feature, “ALL_SUM,” only a few samples exhibit high impact, and this feature predominantly has a negative effect on the model’s output.
Figure 15.
Beeswarm plot of SHAP values for six important features.
LIME
LIME, which stands for Local Interpretable Model-agnostic Explanations, is a technique used to explain the predictions of machine learning models. It provides interpretable explanations for complex models where the decision-making process is not easily understood. Local explanations mean that LIME focuses on explaining individual predictions rather than the overall model47. LIME generates a simple, interpretable model, such as a linear model, around a specific instance of interest. This is valuable because it helps in understanding how the model behaves for a particular prediction.
LIME creates a new dataset that comprises perturbed samples along with the corresponding predictions generated by the black-box model. The resulting model is expected to provide a reliable local approximation of the predictions made by the machine learning model, although it is not required to serve as an accurate global approximation. This level of accuracy is often referred to as local fidelity48. Mathematically, local surrogate models with interpretability constraints can be expressed as follows:
![]() |
14 |
Figure 16 represents a instance where the case is not fraud, the model predicts a probability of 0.87 for the outcome being non-fraudulent (0) and 0.13 for the outcome being fraudulent (1). The feature “max reimbursement amount per month” contributes most significantly to this result, leading the model to classify the sample as non-fraudulent. In contrast, another instance illustrates how the model output predicts a sample as fraudulent. In this case, the model predicts an 83% probability of fraud and a 17% probability of the case being normal. As shown in Fig. 17, features like “max reimbursement amount per month” and “max number of visits per month” indicate fraudulence. However, some features, such as “ALL_SUM,” have a negative impact on model’s predictive outcome.
Figure 16.
LIME values of sample 1, a normal case.
Figure 17.
LIME values of sample 2, a fraudulent case.
LIME clearly demonstrates how each feature contributes to the result, and it can be used to investigate and analyze specific samples in cases of fraud.
Model application
For real-world applications in the healthcare industry, we propose deploying our model through a web-based user interface (UI). The model selected for implementation is
, which consistently demonstrated the best performance compared to both individual and ensemble models. Manually labeled data can be uploaded via the web UI to serve as training data, while the interface will display the total number of detected fraudulent cases. Users can click on specific cases to view detailed analysis, including suspicious features contributing to the fraud detection. Figure 18 illustrates our web UI. Additionally, integrating an application programming interface (API) is essential to ensure seamless integration with existing healthcare insurance systems for real-time fraud detection and decision support.
Figure 18.
Web user interface to apply our model.
The interpretability of our proposed model significantly enhances its practical application for healthcare insurers and auditors by providing clear and interpretable insights into the factors driving fraud detection. Specifically, the explanations generated by the model-such as the contribution of individual features to the final decision-are actionable, allowing stakeholders to focus on specific areas of concern (e.g., abnormal billing patterns or unusual service frequencies). These explanations are made comprehensible to non-technical users through intuitive visualizations within the user interface, which break down the influence of each feature in a manner that does not require technical expertise. By highlighting key risk factors in a transparent way, the model empowers insurers and auditors to make informed decisions, prioritize investigations, and implement preventative measures based on identified trends. This combination of interpretability and usability ensures that the model’s outputs are not only reliable but also easily integrated into everyday decision-making processes within the healthcare insurance industry.
Evaluation of merits and limitations
Merits
Our method achieves high accuracy and AUC scores in fraud detection while significantly reducing feature dimensionality. Additionally, we enhance model performance through ensemble techniques. Moreover, we thoroughly explain the contribution of each feature to the model’s predictive outcomes. This approach offers valuable insights into the patterns of healthcare insurance fraud, aiding in the investigation and analysis of fraudulent cases. Identifying the most relevant features to fraudulent activities also highlights which factors healthcare insurance authorities should prioritize in their monitoring efforts.
Potential limitation
Our feature selection method, based on Sequential Forward Selection (SFS), effectively reduces the dimensionality of the feature set but is computationally intensive. Each time a new feature is added, the model must be retrained and its performance reevaluated. For large-scale datasets, identifying the optimal feature set consumes the majority of the training time.
In real-world applications, the characteristics of fraudulent activities may evolve over time. Therefore, it is crucial to regularly evaluate the model’s accuracy and update it with new data. In our current approach, updating the model requires a complete retraining process, which can be time-consuming. In the future, we plan to explore the use of incremental learning techniques to enable dynamic updates without the need for full retraining.
Our model relies on manually labeled training data, which can be both time-consuming and costly to obtain. The quality of these labeled datasets has a significant impact on the model’s overall performance. Furthermore, the feature engineering process demands substantial domain expertise and industry-specific knowledge. Identifying an optimal subset of features that enhances model performance while minimizing redundancy remains a challenging task.
Comparison to related works
We replicated the healthcare insurance fraud detection approach proposed by Hancock et al.8 and compared it with our method. Hancock et al. utilized six machine learning models as rankers to determine the rank of features, using the median rank from these models to assign the final rank to each feature. While they employed an ensemble feature selection approach, it did not incorporate model ensemble techniques such as Voting or Stacking.
We applied Hancock et al.’s method to our dataset and compared their models with ours. To evaluate performance, we used accuracy, average precision, recall, F1-score, and AUC, applying five-fold cross-validation to minimize the randomness inherent in machine learning. Table 3 presents the performance comparison, where the first row corresponds to our model,
, and the subsequent rows represent the models evaluated by Hancock et al.
Table 3.
Model comparison with other study.
To further evaluate the robustness of our proposed model, we conducted a comparative analysis with more complex models, including artificial neural networks (ANN). Nabrawi et al. introduced a neural network approach for healthcare insurance fraud detection, employing ANN to identify fraudulent activities49. Following a similar approach, we applied ANN on our dataset to benchmark performance. The results demonstrated that our Stacking-based machine learning model consistently outperformed ANN across key evaluation metrics. This suggests that, despite ANN’s capability for complex pattern recognition, our ensemble approach is better suited to the characteristics of healthcare fraud detection in this context.
Our
model utilizes a total of 19 features. To ensure a fair comparison, we evaluated its performance against models in Hancock et al.’s study that also use 19 features. The results demonstrate that our model outperforms the approach proposed by Hancock et al. Specifically, our ensemble method, which integrates model stacking with an optimized feature selection process, consistently achieved higher scores across all metrics, including accuracy, precision, recall, F1-score, and AUC.
Conclusion and future work
We found that permutation importance is a more effective method for calculating feature importance scores compared to the embedded methods. Our project combined embedded methods, permutation importance, and ensemble techniques to select features efficiently. This comprehensive strategy not only improved the performance of our integrated models but also reduced feature dimensionality. Furthermore, reducing the number of features led to a decrease in the training time for these integrated models compared to the base models.
However, one limitation is that the precision of our models, while generally strong, could be further improved, particularly in detecting less apparent cases of fraud. We also observed variability in the training time as the number of features changed, which exhibited a degree of randomness that remains unexplained. This unpredictability in training time presents a challenge in achieving consistent model performance optimization.
In future research, extending our model beyond healthcare fraud detection to other domains of fraud prevention presents a valuable opportunity. Fraud detection models developed in healthcare settings can potentially be adapted to other sectors, such as financial services, insurance, and e-commerce, where the identification of anomalous behavior is equally critical. By generalizing the techniques applied in healthcare, such as feature selection and model tuning, these models could address fraud detection challenges in a wider variety of contexts.
Furthermore, the real-world deployment of such models introduces additional challenges that merit further exploration. Data privacy concerns, especially when dealing with sensitive healthcare information, must be addressed by integrating privacy-preserving techniques like differential privacy. Scalability issues also arise when implementing fraud detection in large-scale, real-time environments. Future research could explore the integration of machine learning models and real-time data streams to enhance real-time monitoring and fraud detection capabilities. Investigating methods to reduce computational demands, particularly in resource-constrained environments, could help make these models more accessible and efficient across different industries.
The implementation of incremental learning will allow for the continuous update of fraud detection systems as new data emerges, facilitating real-time fraud detection and reducing the computational burden associated with retraining models from scratch50. This would be particularly beneficial in industries with rapidly changing fraud patterns, ensuring models remain effective over time.
Acknowledgements
This research was supported by Natural Science Foundation of Xinjiang Uyghur Autonomous Region(2023D01C55), Scientific Research Program of the Higher Education Institution of Xinjiang (XJEDU2023P127), 2023 Teaching Research and Reform Program for Undergraduate Education in Autonomous Colleges and Universities (XJGXZHJG-202341) and XMU Training Program of Innovation and Enterpreneurship for Undergraduates (2024X932). We would like to express our gratitude to the 15th China University Student Service Outsourcing Innovation and Entrepreneurship Competition for providing us the research topic and datasets. We also extend our appreciation to the anonymous referees for their valuable comments and suggestions.
Author contributions
Shiming Lin and Gang Qiu contributed to the research technical route and methods for this paper, while Xiaofang Chen interpreted the medical insurance-related policies. Zeyu Wang was primarily responsible for conducting the experiments and writing the paper, and Yiwei Wu and Linke Jiang assisted with the relevant chart drawings. All authors reviewed the manuscript.
Data availability
The datasets analyzed in this study are provided by the 15th China University Student Service Outsourcing Innovation and Entrepreneurship Competition. It can be downloaded from the link https://github.com/ZeyuWang-cyber/ML-Healthcare-Fraud-Detection.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
The original online version of this Article was revised: The original version of the Article contained an error in the Author Contributions section. It now reads: “Shiming Lin and Gang Qiu contributed to the research technical route and methods for this paper, while Xiaofang Chen interpreted the medical insurance-related policies. Zeyu Wang was primarily responsible for conducting the experiments and writing the paper, and Yiwei Wu and Linke Jiang assisted with the relevant chart drawings. All authors reviewed the manuscript.”
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Change history
5/13/2025
A Correction to this paper has been published: 10.1038/s41598-025-00687-y
Contributor Information
Shiming Lin, Email: xmulsm@xmu.edu.cn.
Gang Qiu, Email: 49041128@qq.com.
References
- 1.Al-Hashedi, K. G. & Magalingam, P. Financial fraud detection applying data mining techniques: A comprehensive review from 2009 to 2019. Comput. Sci. Rev.40, 100402. 10.1016/j.cosrev.2021.100402 (2021). [Google Scholar]
- 2.Htun, H. H., Biehl, M. & Petkov, N. Survey of feature selection and extraction techniques for stock market prediction. Financ. Innov.9(1), 26. 10.1186/s40854-022-00441-7 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Hu, T. et al. Crop yield prediction via explainable ai and interpretable machine learning: Dangers of black box models for evaluating climate change impacts on crop yield. Agric. For. Meteorol.336, 109458. 10.1016/j.agrformet.2023.109458 (2023). [Google Scholar]
- 4.Cui, H., Li, Q., Li, H., & Yan, Z. Healthcare fraud detection based on trustworthiness of doctors. In 2016 IEEE Trustcom/BigDataSE/ISPA, pp. 74–81 (2016). 10.1109/TrustCom.2016.0048 . IEEE
- 5.Matloob, I., Khan, S. A., Rukaiya, R., Khattak, M. A. K. & Munir, A. A sequence mining-based novel architecture for detecting fraudulent transactions in healthcare systems. IEEE ACCESS10, 48447–48463. 10.1109/ACCESS.2022.3170888 (2022). [Google Scholar]
- 6.Chen, J., Hu, X., Yi, D., Alazab, M. & Li, J. A variational autoencoder-based relational model for cost-effective automatic medical fraud detection. IEEE Trans. Dependable Secure Comput.20(4), 3408–3420. 10.1109/TDSC.2022.3187973 (2023). [Google Scholar]
- 7.Li, W., Ye, P., Yu, K., Min, X. & Xie, W. An abnormal surgical record recognition model with keywords combination patterns based on TextRank for medical insurance fraud detection. Multimedia Tools Appl.82(20), 30949–30963. 10.1007/s11042-023-14529-4 (2023). [Google Scholar]
- 8.Hancock, J. T., Bauder, R. A., Wang, H. & Khoshgoftaar, T. M. Explainable machine learning models for medicare fraud detection. J. Big Data10(1), 154. 10.1186/s40537-023-00821-5 (2023). [Google Scholar]
- 9.Zhou, J. et al. FraudAuditor: A visual analytics approach for collusive fraud in health insurance. IEEE Trans. Visual. Comput. Gr.29(6), 2849–2861. 10.1109/TVCG.2023.3261910 (2023). [DOI] [PubMed] [Google Scholar]
- 10.Yoo, Y., Shin, J. & Kyeong, S. Medicare fraud detection using graph analysis: A comparative study of machine learning and graph neural networks. IEEE Access11, 88278–88294. 10.1109/ACCESS.2023.3305962 (2023). [Google Scholar]
- 11.Pallathadka, H., Wenda, A., Ramirez-Asís, E., Asís-López, M., Flores-Albornoz, J. & Phasinam, K. Classification and prediction of student performance data using various machine learning algorithms. Mater. Today Proc. 80, 3782–3785 (2023) 10.1016/j.matpr.2021.07.382
- 12.Towfek, S., Khodadadi, N., Abualigah, L. & Rizk, F. H. Ai in higher education: Insights from student surveys and predictive analytics using pso-guided woa and linear regression. J. Artif. Intell. Eng. Practice1(1), 1–17. 10.21608/jaiep.2024.354003 (2024). [Google Scholar]
- 13.El-Kenawy, E.-S.M., Rizk, F.H., Zaki, A.M., Mohamed, M.E., Ibrahim, A., Abdelhamid, A.A., Khodadadi, N., Almetwally, E.M. & Eid, M.M., et al. Football optimization algorithm (fboa): A novel metaheuristic inspired by team strategy dynamics. J. Artif. Intell. Metaheurist.1, 21–1 10.54216/JAIM.080103
- 14.El-Kenawy, E.-S.M. et al. Greylag goose optimization: nature-inspired optimization algorithm. Expert Syst. Appl.238, 122147. 10.1016/j.eswa.2023.122147 (2024). [Google Scholar]
- 15.Abdollahzadeh, B., Khodadadi, N., Barshandeh, S., Trojovskỳ, P., Gharehchopogh, F.S., El-kenawy, E.-S.M., Abualigah, L., & Mirjalili, S. Puma optimizer (po): A novel metaheuristic optimization algorithm and its application in machine learning. Clust. Comput., 1–49 (2024) 10.1007/s10586-023-04221-5
- 16.Nadeem, M., Siddique, I., Alam, M. A. & Ali, W. A new graphical representation of the old algebraic structure. J. Math.2023(1), 4333301. 10.1155/2023/4333301 (2023). [Google Scholar]
- 17.Nadeem, M. et al. A class of koszul algebra and some homological invariants through circulant matrices and cycles. J. Math.2022(1), 4450488. 10.1155/2022/4450488 (2022). [Google Scholar]
- 18.Zhang, X., Nadeem, M., Ahmad, S. & Siddiqui, M. K. On applications of bipartite graph associated with algebraic structures. Open Math.18(1), 57–66. 10.1515/math-2020-0003 (2020). [Google Scholar]
- 19.Hazzazi, M. M., Nadeem, M., Kamran, M., Naci Cangul, I. & Akhter, J. Holomorphism and edge labeling: An inner study of latin squares associated with antiautomorphic inverse property moufang quasigroups with applications. Complexity2024(1), 8575569. 10.1155/2024/8575569 (2024). [Google Scholar]
- 20.Nadeem, M., Ali, S. & Alam, M. A. Graphs connected to isotopes of inverse property quasigroups: A few applications. J. Appl. Math.2024(1), 6616243. 10.1155/2024/6616243 (2024). [Google Scholar]
- 21.Theng, D. & Bhoyar, K. K. Feature selection techniques for machine learning: a survey of more than two decades of research. Knowl. Inf. Syst.66(3), 1575–1637. 10.1007/s10115-023-02010-5 (2024). [Google Scholar]
- 22.Zhou, H., Wang, X. & Zhu, R. Feature selection based on mutual information with correlation coefficient. Appl. Intell.52(5), 5457–5474. 10.1007/s10489-021-02524-x (2022). [Google Scholar]
- 23.Gao, L. & Wu, W. Relevance assignation feature selection method based on mutual information for machine learning. Knowl.-Based Syst.209, 106439. 10.1016/j.knosys.2020.106439 (2020). [Google Scholar]
- 24.Li, J., Zhang, H., Zhao, J., Guo, X., Rihan, W., & Deng, G. Embedded feature selection and machine learning methods for flash flood susceptibility-mapping in the mainstream songhua river basin, china. Remote Sens.14(21) (2022) 10.3390/rs14215523
- 25.Hamla, H., & Ghanem, K. Comparative study of embedded feature selection methods on microarray data. In: Maglogiannis, I., Macintyre, J., Iliadis, L. (eds.) 17th IFIP International Conference on Artificial Intelligence Applications and Innovations (AIAI). Artificial Intelligence Applications and Innovations, vol. AICT-627, pp. 69–77. Springer International Publishing, Hersonissos, Crete, Greece (2021). 10.1007/978-3-030-79150-6_6 . Part 2: AI in Biomedical Applications. https://inria.hal.science/hal-03287701
- 26.Saarela, M. & Jauhiainen, S. Comparison of feature importance measures as explanations for classification models. SN Appl. Sci.3(2), 272. 10.1007/s42452-021-04148-9 (2021). [Google Scholar]
- 27.Rengasamy, D. et al. Feature importance in machine learning models: A fuzzy information fusion approach. Neurocomputing511, 163–174. 10.1016/j.neucom.2022.09.053 (2022). [Google Scholar]
- 28.Muschalik, M., Fumagalli, F., Hammer, B., & Hüllermeier, E. Agnostic explanation of model change based on feature importance. KI - Künstliche Intelligenz 36 (2022) 10.1007/s13218-022-00766-6
- 29.Thakur, D. & Biswas, S. Permutation importance based modified guided regularized random forest in human activity recognition with smartphone. Eng. Appl. Artif. Intell.129, 107681. 10.1016/j.engappai.2023.107681 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Effrosynidis, D. & Arampatzis, A. An evaluation of feature selection methods for environmental data. Eco. Inform.61, 101224. 10.1016/j.ecoinf.2021.101224 (2021). [Google Scholar]
- 31.Rajbahadur, G. K., Wang, S., Oliva, G. A., Kamei, Y. & Hassan, A. E. The impact of feature importance methods on the interpretation of defect classifiers. IEEE Trans. Software Eng.48(7), 2245–2261. 10.1109/TSE.2021.3056941 (2022). [Google Scholar]
- 32.Qian, H., Wang, B., Yuan, M., Gao, S. & Song, Y. Financial distress prediction using a corrected feature selection measure and gradient boosted decision tree. Expert Syst. Appl.190, 116202. 10.1016/j.eswa.2021.116202 (2022). [Google Scholar]
- 33.Victoria, A. H. & Maragatham, G. Automatic tuning of hyperparameters using Bayesian optimization. Evol. Syst.12(1), 217–223. 10.1007/s12530-020-09345-2 (2021). [Google Scholar]
- 34.Wang, X., Jin, Y., Schmitt, S., & Olhofer, M. Recent advances in Bayesian optimization. ACM Comput. Surv.55(13s) (2023) 10.1145/3582078
- 35.Belete, D. M. & Huchaiah, M. D. Grid search in hyperparameter optimization of machine learning models for prediction of hiv/aids test results. Int. J. Comput. Appl.44(9), 875–886. 10.1080/1206212X.2021.1974663 (2022). [Google Scholar]
- 36.Alibrahim, H., & Ludwig, S.A. Hyperparameter optimization: Comparing genetic algorithm against grid search and bayesian optimization. In 2021 IEEE Congress on Evolutionary Computation (CEC), pp. 1551–1559 (2021). 10.1109/CEC45853.2021.9504761
- 37.Prabu, S., Thiyaneswaran, B., Sujatha, M., Nalini, C., & Rajkumar, S. Grid search for predicting coronary heart disease by tuning hyper-parameters. Comput. Syst. Sci. Eng.43(2) (2022) 10.32604/csse.2022.022739
- 38.Imani, M., & Arabnia, H.R. Hyperparameter optimization and combined data sampling techniques in machine learning for customer churn prediction: A comparative analysis. Technologies11(6) (2023) 10.3390/technologies11060167
- 39.Louk, M.H.L., & Tama, B.A. Revisiting gradient boosting-based approaches for learning imbalanced data: A case of anomaly detection on power grids. Big Data and Cognit. Comput.6(2) (2022) 10.3390/bdcc6020041
- 40.Kshatri, S. S. et al. An empirical analysis of machine learning algorithms for crime prediction using stacked generalization: An ensemble approach. IEEE Access9, 67488–67500. 10.1109/ACCESS.2021.3075140 (2021). [Google Scholar]
- 41.Niyogisubizo, J., Liao, L., Nziyumva, E., Murwanashyaka, E. & Nshimyumukiza, P. C. Predicting student’s dropout in university classes using two-layer ensemble machine learning approach: A novel stacked generalization. Comput. Educ. Artif. Intell.3, 100066. 10.1016/j.caeai.2022.100066 (2022). [Google Scholar]
- 42.Bin Habib, A.-Z.S., & Tasnim, T. An ensemble hard voting model for cardiovascular disease prediction. In 2020 2nd International Conference on Sustainable Technologies for Industry 4.0 (STI), pp. 1–6 (2020). 10.1109/STI50764.2020.9350514
- 43.Kumari, S., Kumar, D. & Mittal, M. An ensemble approach for classification and prediction of diabetes mellitus using soft voting classifier. Int. J. Cognit. Comput. Eng.2, 40–46. 10.1016/j.ijcce.2021.01.001 (2021). [Google Scholar]
- 44.Kandel, M.A., Rizk, F.H., Hongou, L., Zaki, A.M., Khan, H. & El-Kenawy, E.-S.M., et al. Evaluating the efficacy of deep learning architectures in predicting traffic patterns for smart city development. Full Length Article6(2), 26–6 (2023) 10.54216/JAIM.060203
- 45.Molnar, C., Freiesleben, T., König, G., Herbinger, J., Reisinger, T., Casalicchio, G., Wright, M.N., & Bischl, B. Relating the partial dependence plot and permutation feature importance to the data generating process. In: Longo, L. (ed.) Explainable Artificial Intelligence, pp. 456–479. Springer, Cham (2023). 10.1007/978-3-031-44064-9_24
- 46.Lundberg, S.M., & Lee, S.-I. A unified approach to interpreting model predictions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17. pp. 4768–4777. Curran Associates Inc., Red Hook, NY, USA (2017).10.48550/arXiv.1705.07874
- 47.Agarwal, N. & Das, S. Interpretable machine learning tools: A survey. In: 2020 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1528–1534 (2020). 10.1109/SSCI47803.2020.9308260
- 48.Ribeiro, M.T., Singh, S. & Guestrin, C. “why should i trust you?”: Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’16, pp. 1135–1144. Association for Computing Machinery, New York, NY, USA (2016). 10.1145/2939672.2939778
- 49.Nabrawi, E. & Alanazi, A. Fraud detection in healthcare insurance claims using machine learning. Risks11(9), 160. 10.3390/risks11090160 (2023). [Google Scholar]
- 50.Ven, G. M., Tuytelaars, T. & Tolias, A. S. Three types of incremental learning. Nat. Mach. Intell.4(12), 1185–1197. 10.1038/s42256-022-00568-3 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets analyzed in this study are provided by the 15th China University Student Service Outsourcing Innovation and Entrepreneurship Competition. It can be downloaded from the link https://github.com/ZeyuWang-cyber/ML-Healthcare-Fraud-Detection.


















































