Abstract
Credit scorecards are essential tools for banks to assess the creditworthiness of loan applicants. While advanced machine learning models like XGBoost and random forest often outperform traditional logistic regression in predictive accuracy, their lack of interpretability hinders their adoption in practice. This study bridges the gap between research and practice by developing a novel framework for constructing interpretable credit scorecards using Shapley values. We apply this framework to two credit datasets, discretizing numerical variables and utilizing one-hot encoding to facilitate model development. Shapley values are then employed to derive credit scores for each predictor variable group in XGBoost, random forest, LightGBM, and CatBoost models. Our results demonstrate that this approach yields credit scorecards with interpretability comparable to logistic regression while maintaining superior predictive accuracy. This framework offers a practical and effective solution for credit practitioners seeking to leverage the power of advanced models without sacrificing transparency and regulatory compliance.
Introduction
Banks play a crucial role in the economy, influencing the financial landscape while making critical lending decisions that balance risk and profitability for both individuals and businesses [1–3]. To mitigate losses and identify low-risk applicants, banks rely on credit scoring models, or credit scorecards, that use predictor variables to generate credit scores [4]. Accurate identification of high-risk applicants is essential for effective lending, and regulatory frameworks often mandate that credit decisions, especially loan rejections, be transparent and explainable [5–7]. Traditional credit scorecards achieve this transparency through interpretable models like logistic regression [8].
Despite the dominance of logistic regression in credit scoring due to its simplicity and interpretability, recent research has highlighted the superior accuracy of tree-based models such as eXtreme gradient boosting (XGBoost) and random forest [5, 9, 10]. However, their limited interpretability poses significant challenges in practical application, particularly in the banking sector [11]. The "black box" nature of these models makes it difficult for practitioners to understand the underlying reasons behind credit decisions, hindering regulatory compliance, model validation, and effective communication with customers [6, 7, 11].
Specific challenges with tree-based models include:
Regulatory Compliance: Banks are required to provide clear reasons for loan rejections. The opaque nature of tree-based models complicates this requirement [6, 7].
Model Validation: The lack of transparency makes it difficult for banks to validate and trust the models, which is crucial for deployment in a highly regulated industry [11].
Customer Communication: Banks need to explain credit decisions to customers in an understandable manner. The complexity of tree-based models hampers this communication [11].
While the SHapley Additive exPlanations (SHAP) framework, leveraging Shapley values, has been proposed to enhance the interpretability of these models, initially for XGBoost [12], the focus has primarily been on probabilities rather than the credit scores used by credit practitioners [4]. This discrepancy between research advancements and practical needs underscores the importance of developing methods that can harness the predictive power of advanced models while ensuring the transparency and interpretability required in the banking sector.
This research aims to bridge this gap by demonstrating how Shapley values derived from tree-based models like XGBoost and random forest can be used to generate credit scores that are comparable to those from logistic regression-based credit scorecards, using two credit datasets. Additionally, we explore how these Shapley value-derived scores align with current practices for explaining credit decisions. By combining accuracy with interpretability, this study aims to promote the adoption of transparent and high-performing models in practical credit scoring, empowering banks to make informed lending decisions.
The paper is organized as follows: We begin with an overview of key themes in credit risk modelling, examining logistic regression, advanced scoring models, and the imperative aspect of interpretability. Following this, we describe the methodological approach used in this research, present our findings and results, and conclude with insights into future research directions.
Literature review
This section offers a comprehensive overview of key themes in credit risk modelling. It examines logistic regression, advanced scoring models, and the imperative aspect of interpretability. This review not only traces the historical significance of logistic regression and advanced models but also underscores the evolving challenges and solutions tied to model interpretability. Through this exploration, the section lays the groundwork for a deeper understanding of credit risk assessment methodologies.
Credit scoring models
Logistic regression, a technique with roots in the 19th century [13], is the most common credit risk model in practice due to its simplicity and ability to produce interpretable predictions [4, 5, 8, 14]. Its prominence and scope of application expanded following research by [15, 16], with early examples of its use in credit risk seen in the work of [17]. A study by [18] comparing five base learners on a credit loan dataset found that logistic regression outperformed decision trees, naïve bayes, and AdaBoost in terms of AUC and accuracy metrics, but was surpassed by random forest and XGBoost. This highlights the trade-offs between interpretability and performance in credit risk modelling.
The logistic regression model’s structure facilitates this interpretability by relating predictor variables to the probability of an event (such as default) through a logit transformation. The model consists of an additive component, the sum of the intercept and a product of model parameters and their respective predictor variables [19]. The intercept represents the average value of the natural log of the odds when the predictor variables equal zero [19]. The logistic regression model is expressed as follows [19]:
| (1) |
where β0 is the intercept, βi, i = 1,2,…m are parameters of the predictor variables xi, i = 1,2,…m.
Research in credit risk modelling remains active, with a strong focus on improving the accuracy of models, particularly through tree-based methods [5]. Studies have shown that tree-based models, such as XGBoost, random forest, LightGBM, and CatBoost, often outperform traditional models like logistic regression in terms of accuracy [5]. These models construct numerous non-linear decision trees by iteratively selecting subsets of data, with XGBoost further employing a boosting technique to combine multiple weak learners and enhance predictive accuracy [20].
Tree-based models make predictions through majority voting, where the final prediction is based on the most frequent outcome among the individual trees [20, 21]. This approach, as demonstrated in various studies [14, 20, 21], often leads to superior predictive performance. Notably, [14] indicated that XGBoost handles imbalanced datasets—a common characteristic of credit data due to the rarity of defaults compared to non-defaults [22]—better than other advanced scoring methods.
Despite their popularity in research [4, 5] and superior prediction accuracy compared to logistic regression [5], tree-based models remain less common in practical credit scoring [11]. This is largely due to their inherent complexity, which makes it difficult to interpret their predictions and explain the reasons behind credit decisions, a crucial requirement in many regulatory contexts.
A 2015 survey of machine learning models used in data science competitions found that XGBoost was the most popular choice, offering higher prediction accuracy in various domains, including credit risk [9]. A benchmarking study on credit data further demonstrated XGBoost’s superior accuracy compared to logistic regression, neural networks, support vector machines, and random forest, even outperforming FICO scores [10].
Similar to XGBoost, LightGBM is a gradient boosting model, but it differs in its depth-first tree growth strategy, often leading to faster performance [23]. Studies [24, 25] have shown LightGBM’s superior predictive performance on credit data compared to XGBoost and CatBoost.
CatBoost, another member of the gradient boosting family, stands out for its handling of categorical variables, making it valuable for datasets where categorical data plays a crucial role in predictive modelling [26]. Research has shown that CatBoost can outperform both XGBoost and LightGBM models in terms of predictive performance on credit data [23].
In random forest models, multiple decision trees are built, and the final prediction is determined through majority voting, where the most common prediction among the trees is selected [27]. Research has shown that tuning hyperparameters, such as the number of trees and predictor variables, is crucial for optimizing random forest performance in credit scoring [28, 29].
Studies comparing the performance of different models in credit scoring have reported varying AUC values. For instance, [30] investigated logistic regression and a neural network, achieving AUC values of 0.711 and 0.731, respectively. The study in [31] obtained an AUC of 0.680 from a random forest model before implementing a data sampling methodology for balancing and achieved higher AUC after applying their proposed technique.
While tree-based models have showcased high accuracy in their predictions, their limited ability to offer human-understandable explanations have constrained their adoption. This challenge has been acknowledged and addressed by [32], yet it continues to hinder the widespread use of these advanced models in real-world credit scoring applications.
Interpretability
The lack of human-understandable explanations for predictions made by advanced machine learning models is the primary obstacle to their wider adoption in practice [14, 32]. This concern is echoed by credit regulators in the USA and RSA, who require models used for credit decisions to provide human-understandable interpretations [6, 7, 33]. In addition to explaining loan rejections, interpretability is also crucial for communicating low and high credit scores to various stakeholders, including credit practitioners, auditors, regulators, senior management, and model validators [4].
To address this challenge, the SHAP framework, rooted in game theory, was introduced by [12] to enhance the interpretability of machine learning models. Originally developed by [34] to determine the fair distribution of payouts in cooperative games, the SHAP framework calculates Shapley values for each predictor variable in a model [35]. These values represent the marginal contribution of each variable to a prediction and can be used to provide human-understandable explanations for credit decisions, aligning with the requirements outlined by [4].
Researchers have adopted the SHAP framework to provide detailed explanations of complex machine models [36–38], with the motivation of increasing understanding and trust in these models [38]. In the context of credit risk scoring, the SHAP framework has been used to explain predictions made by tree-based gradient boosting models [37, 39, 40].
Studies such as [37, 40] utilized SHAP to compute and compare marginal probabilities of predictor variables in tree-based models, finding significant differences in predictions and highlighting the higher default risk predicted by tree-based models. Similarly, [39] used SHAP with counterfactuals to provide explanations for predictions made by a tree-based gradient boosting model, ultimately concluding that the methodology helps in understanding the model’s behaviour.
While previous studies have explored the use of SHAP for explaining credit scores, they have not addressed how these explanations align with the credit scores used by practitioners, nor how they can be used to identify specific predictor variable categories that lead to lower scores and potential rejections. Our research aims to fill this gap, particularly when using tree-based models, by demonstrating the practical application of Shapley values. Our goal is to empower credit professionals to identify predictor variable categories that substantially impact lower credit scores, potentially resulting in credit application denials. This will ultimately enhance the transparency and effectiveness of credit assessment processes.
Literature review summary
This section offers a synthesis of the preceding sections, encompassing credit scoring models in practical application and literature. Table 1 provides a condensed overview of how prior research leveraged the SHAP framework to enhance the interpretability of advanced credit scoring models.
Table 1. Overview of the literature.
| Research Focus | Key Findings |
|---|---|
| Credit scoring models | Logistic regression, acknowledged as the most common credit risk model in practice [4, 5, 8] and with roots tracing back to the 19th century [13], gained prominence through [15, 16]. Its valued attributes encompass simplicity and interpretability, particularly in banking contexts [14]. Notably, it comprises additive components as elucidated in reference [19], thus solidifying its enduring role in credit risk assessment. Advanced Scoring Models, including XGBoost and random forest, exhibit notably higher accuracy compared to logistic regression [5, 20]. Leveraging non-linear trees and boosting [20], they’re prominent in research, yet constrained in practicality due to interpretability [4, 5]. They employ multiple trees with final predictions by majority voting [27], optimized through hyperparameter tuning [28, 29], they’re favoured for precision in credit scoring [28, 29]. |
| Interpretability using the SHAP framework | SHAP framework applied to machine models enhances understanding and trust [38]. In emerging credit risk scoring studies, SHAP reveals variable influences [37, 39, 40]. Notably, [37] and [39] extract log-odds and probabilities from SHAP for insights into predictor significance. [40] demonstrates heightened default predictions by gradient boosting models and examines predictor marginal probabilities. SHAP’s efficacy in explaining complex models bolsters predictive superiority [39]. |
Methodology
This section outlines the systematic approach employed in this study for credit scoring model development and evaluation. It covers data preprocessing, feature engineering, variable selection, Shapley values integration, credit score computation, encoding methods, data partitioning, hyperparameter tuning, and model performance metrics. This section provides a concise overview of the methodology used to construct and assess the credit scoring models.
Data
This research employs two datasets: the Taiwan Credit Card data from [41], comprising 30,000 loan accounts (6,636 in default, a 22.12% default rate) from April to September 2005, and the Home Credit data from [42], containing 356,255 customers (24,845 classified as "bad" due to default, a 6.97% default rate), released on Kaggle in June 2018. The Taiwan Credit Card dataset includes 23 predictor variables, encompassing demographics, credit history, payment behaviour, and financial characteristics, while the Home Credit dataset contains 217 variables, including credit bureau, alternative, and demographic data.
To develop the models, both datasets are split into 80% training data and 20% test data using probability-based sampling to ensure consistent results and maintain the independence of the test set [4, 43]. The reported results of the model’s performance are based on the test data. However, this approach has limitations, such as the fixed 80–20 split ratio recommended in [4], which may not be optimal for all datasets and could potentially impact the generalizability of the models.
Feature engineering
Feature engineering, the process of creating new predictor variables from existing data, can be used to enhance model performance and extract additional insights [44]. This can involve transforming or aggregating variables, as detailed in [44]. In this study, feature engineering was applied to three time-series predictor variables in the Taiwan Credit Card dataset, transforming the original 23 variables into 59. Specifically, we calculated the 3-month rolling average, standard deviation, and the ratio of the current month’s value to the 3-month average for each time-series variable, starting from June and progressing through September.
Data aggregation techniques, including averages, counts, and sums on transactions grouped by client ID, were applied to the Home Credit dataset, expanding the predictor variables from 217 to 767. These aggregations, mirroring the approach in [45], focused on numeric application data, transaction patterns, and timely instalment payment behaviour.
Unlike [30, 31], which used the predictor variables in their raw form, our study leverages these feature-engineered variables, potentially providing a unique perspective on the dataset and its predictive power. This approach may reveal hidden patterns and relationships that could improve the accuracy and interpretability of our credit risk models.
Variable selection
Permutation importance [46] and the Wald test [47] were employed to reduce the predictor variable set, eliminating variables with minimal contribution to the AUC or lacking statistical significance. This resulted in 7 variables for the Taiwan Credit Card data and 11 variables for the Home Credit data, aligning with recommendations for typical scorecard complexity [48] and mitigating overfitting concerns [49].
Additionally, a correlation analysis following established guidelines [50, 51] assessed multicollinearity. No pairs of predictor variables exceeded the pre-defined 0.8 correlation coefficient threshold [52]. The highest observed correlations were 0.69545 (Home Credit) and 0.75263 (Taiwan Credit Card). These combined steps removed 52 predictor variables from the Taiwan Credit Card data and 756 from the Home Credit data.
Ultimately, for the Home Credit data, this selection process resulted in the predictor variables that are statistically significant, as shown in Table 2.
Table 2. Predictor variables for the Home Credit data.
|
Predictor variable |
Description |
p-value |
|---|---|---|
| AVG_EXT_SOURCE | Average of external scores (1, 2 & 3) | 0.04200 |
| EXT_SOURCE_3 | Normalized score from external data source | 0.00000 |
| EXT_SOURCE_2 | Normalized score from external data source | 0.00000 |
| EXT_SOURCE_1 | Normalized score from external data source | 0.00000 |
| BUREAU_DAYS_CREDIT_MAX | Maximum—How many days before current application did client apply for Credit Bureau credit | 0.00000 |
| INSTAL_DPD_MEAN | Average—Days past due of Instalments | 0.00000 |
| DAYS_EMPLOYED | How many days before the application the person started current employment | 0.00000 |
| APPS_ANNUITY_CREDIT_RATIO | Ratio of AMT_ANNUITY / AMT_CREDIT | 0.00000 |
| APPROVED_AMT_ANNUITY_MEAN | Average–Approved annuity amount | 0.00000 |
| INSTAL_DAYS_ENTRY_PAYMENT_MAX | Maximum number of days (relative to the application date) on which a payment was made for previous instalments | 0.00000 |
| APPROVED_AMT_CREDIT_MAX | Maximum–Approved credit amount | 0.00000 |
Similarly, for the Taiwan data, the final list of predictor variables is presented in Table 3.
Table 3. Predictor variables for the Taiwan credit data.
|
Predictor variable |
Description |
p-value |
|---|---|---|
| AVG_PAY__SEP | Average repayment payment status for July, August and September | 0.00000 |
| CURRENT_OVER_3MAVG_PAY__SEP | The Ratio of September repayment payment status over Average repayment payment status for July, August and September | 0.00000 |
| STD_PAY__SEP | The standard deviation of July, August and September repayment payment status | 0.00000 |
| AVG_PAY__JUN | Average repayment payment status for April, May and June | 0.00100 |
| AVG_BILL_AMT_SEP | Average bill for July, August and September | 0.00000 |
| AVG_PAY_AMT_SEP | Average payment amount for July, August and September | 0.00000 |
| CURRENT_OVER_3MAVG_BILL_AMT_SEP | The Ratio of the September bill over Average bill for July, August and September | 0.00000 |
In conclusion, the number of predictor variables in Tables 2 and 3 has been intentionally limited to align with standard credit scorecard development practices, which typically utilize up to 12 variables [48], and to minimize the risk of model overfitting and complexity [49].
Calculating the credit score in a practice setting
A previous study [4] introduced the concept of a neutral score, the point at which the odds of good and bad outcomes are equal, as a key element in explaining loan application rejections. This score is calculated using parameters such as the intercept of a logistic regression model and the number of predictor variables in the scorecard. The formulas for calculating credit scores, including the neutral score and scores for categorical variables, are well-established and can be found in [4].
Score scaling parameters, offset and factor, are used to adjust the scorecard to achieve desired odds of good to bad outcomes at specific credit score levels. For example, in a logistic regression-based scorecard, a customer’s score falling below the neutral score on a predictor variable is considered a likely reason for credit application decline [4].
While the methodology in [4] provides a foundation for interpretability, our research proposes an alternative approach using Shapley values to further enhance the interpretability of credit scorecards, particularly for tree-based models.
Shapley values
As indicated earlier, the SHAP framework was proposed by [12] to provide detailed explanations of complex machine learning models through the use of Shapley values. These Shapley values offer three important properties crucial for determining the marginal contribution of each predictor variable in a model [12]:
Local Accuracy: Ensures that predictions for a specific instance can be attributed to the input values of each variable for that instance.
Missingness: A variable absent from the model does not influence the prediction, similar to how entities that make no contribution in a given context receive no payoffs [53].
Consistency: (also known as symmetry) Variables with equal contributions in the model contribute equally to the overall prediction, ensuring fairness and unbiased model performance.
The predictions are given by the following:
| (2) |
where ϕ0 is the naive prediction i.e., prediction without any predictor variables, ϕi, i = 1,2,…m are the parameters of predictor variables xi, i = 1,2,…m and xi, i = 1,2,…m are the inputs of predictor variables.
Data processing
Binning, the process of converting continuous variables into categorical ones, is a common practice in credit scoring [4]. It involves grouping values into distinct categories or "bins." This approach simplifies the understanding of relationships between predictor and target variables, streamlines the allocation of credit points, and systematically addresses outliers [4, 54]. It also enhances the ability of banking professionals to derive actionable insights from the data, such as identifying high-risk customer segments or optimal credit score thresholds. In a credit scorecard, each bin is associated with a specific credit score linked to the input values of a predictor variable, allowing for easy comparison with the neutral score and identification of bins where predictor variables fall below the standard [4].
Our binning approach aligns with the standard practice of maximizing the Weight of Evidence (WOE) [4], a measure of the strength of an input value in differentiating between good and bad customers. By discretizing continuous variables into categorical ones, we optimize the WOE metric, ensuring that the resulting bins enhance interpretability and facilitate precise allocation of credit points.
Given that machine learning algorithms like XGBoost require numerical inputs [55], we binned numerical variables and then employed one-hot encoding, a popular and simple method for representing categorical variables [55, 56]. To address missing values in the numerical variables, imputation with the mean of non-missing values was employed for each variable [57]. Additionally, outliers were handled by setting the lower and upper bounds for all observations in each variable to the 2.5th and 97.5th percentiles, respectively [58].
Hyperparameter tuning
Hyperparameter tuning is essential for optimizing model performance, as it allows for fine-tuning the parameters of ensemble models to achieve superior outcomes [59]. In this study, we employed grid search, a well-established and effective method for finding optimal hyperparameters [60]. Other hyperparameter tuning methods include Bayesian optimization, which uses probabilistic models, random search, which randomly samples hyperparameter combinations, and manual search, guided by human expertise [61, 62]. The choice of method depends on computational resources and problem complexity, as each balances comprehensiveness and efficiency in finding optimal configurations [62].
Model validation
To validate the models and assess their generalizability, this study employs 5-fold cross-validation, a common technique for estimating machine learning model performance on unseen data [63]. This method involves partitioning the dataset into five subsets (folds), iteratively using each fold as the validation set while the remaining folds are used for training [63]. The process is repeated five times, and the resulting performance metrics are averaged to provide a robust estimate [63]. While effective, k-fold cross-validation can be computationally expensive, particularly for larger values of k [63]. This 5-fold approach aligns with previous studies [23, 25], offering a balance between computational efficiency and model validation rigor.
Model performance metrics
Most researchers assess the performance of credit scorecards using the AUC [5, 14, 64, 65], due to its ability to indicate a model’s capacity to differentiate between good and bad customers [5]. A higher AUC signifies better discrimination between these two groups [45]. However, AUC has limitations. It can be misleading for poorly fitted models [66] and lacks intuitive interpretation for practitioners [67]. Despite these shortcomings, AUC remains a popular metric in both research and practice [5].
The AUC is calculated as the area under the receiver operating characteristics (ROC) curve, which plots the true positive rate against the false positive rate at various classification thresholds [5]. To assess the statistical significance of differences in AUC between models, we employed the DeLong test [68, 69].
In addition to AUC, misclassification statistics, often presented in a confusion matrix (Table 4), offer a practical way to evaluate credit scorecard performance [4]. This matrix categorizes customers based on their probability of default and compares their actual classification to the scorecard’s prediction, resulting in four cells: true negative, false positive, false negative, and true positive. This comparison helps determine the accuracy of the scorecard’s predictions for good and bad customers.
Table 4. Confusion matrix.
| Predicted | |||
|---|---|---|---|
| Good | Bad | ||
|
Actual |
Good | True negative | False positive |
| Bad | False negative | True positive | |
To evaluate a credit scorecard’s accuracy, the true negative rate (specificity) measures the model’s ability to predict non-defaulting (good) customers, while the true positive rate (sensitivity) measures its ability to predict defaulting (bad) customers. The aim is to use the scorecard’s probability of default to reduce false positives and false negatives by adjusting the probability cut-off [4].
Proposed framework for calculating credit scores
This framework outlines a systematic approach for enhancing credit scoring models by integrating Shapley values [12] into the established methodology of [4]. It encompasses the entire process of deriving credit scores, from the initial predictor variable binning to the final credit score calculation. By incorporating Shapley values, this framework provides a comprehensive pathway to derive more transparent and insightful credit scores, ultimately aiding in informed credit decision-making and model refinement.
Our proposed methodology begins with the binning phase, a crucial step in scorecard development given its significant impact on the final scorecard’s structure [4]. As illustrated in Fig 1, our approach introduces additional stages where one-hot encoding is applied to the binned predictor variables before model fitting, and Shapley values are used in place of logistic regression parameters.
Fig 1. Credit scores calculation process flow—current vs proposed.
Results and analysis
This section presents the outcomes of the credit scoring models and delves into their performance. This includes an in-depth examination of credit scorecards associated with each model, illustrating how individual predictor variables are practically represented. Through a detailed exploration of these outcomes, this section offers valuable insights into the effectiveness and real-world applicability of the developed models.
Performance of the models
Table 5 presents a comparison of the logistic regression, random forest, XGBoost, LightGBM, and CatBoost models in terms of AUC. The random forest model achieved the highest AUC, followed closely by XGBoost and LightGBM. However, the DeLong test [68] indicates that the differences in AUC among these three models are not statistically significant.
Table 5. AUC and p-values of the models–Taiwan data.
| p-value | |||||
|---|---|---|---|---|---|
| Model | AUC | XGBoost | LightGBM | Logistic Regression | CatBoost |
| Random Forest | 0.75929 | 0.41580 | 0.15990 | 0.00021 | 0.00143 |
| XGBoost | 0.75766 | 0.47520 | 0.00315 | 0.00056 | |
| LightGBM | 0.75690 | 0.00316 | 0.00310 | ||
| Logistic Regression | 0.74891 | 0.81190 | |||
| CatBoost | 0.74793 | ||||
Similarly, the AUC values for logistic regression and CatBoost were not significantly different from each other. However, the p-values from the DeLong test show significant differences between the top-performing group (random forest, XGBoost, LightGBM) and the lower-performing group (logistic regression, CatBoost).
Notably, our models outperformed the benchmark AUC of 0.697 reported in previous research [30, 31] that used the same dataset but without applying feature engineering approach. This suggests that feature engineering, which distinguished our study from previous work in terms of predictor variable utilization, contributed to the improved predictive performance.
Table 6 presents the confusion matrices for the Taiwan Credit Card data models, highlighting the superior predictive power of the random forest and XGBoost models. Both achieved the highest overall accuracy (75.717%) and lowest misclassification rate (24.283%), outperforming LightGBM, logistic regression, and CatBoost.
Table 6. Confusion matrices of the models–Taiwan data.
| Random Forest | Predicted | ||
| Good | Bad | ||
| Actual | Good | 3,765 (80.363%) | 920 |
| Bad | 537 | 778 (59.163%) | |
| XGBoost | Predicted | ||
| Good | Bad | ||
| Actual | Good | 3,767 (80.406%) | 918 |
| Bad | 539 | 776 (59.011%) | |
| LightGBM | Predicted | ||
| Good | Bad | ||
| Actual | Good | 3,727 (79.552%) | 958 |
| Bad | 522 | 793 (60.304%) | |
| Logistic Regression | Predicted | ||
| Good | Bad | ||
| Actual | Good | 3,688 (78.719%) | 997 |
| Bad | 533 | 782 (59.468%) | |
| CatBoost | Predicted | ||
| Good | Bad | ||
| Actual | Good | 3,657 (78.058%) | 1,028 |
| Bad | 515 | 800 (60.837%) | |
Table 7 presents the AUC values of the different models on the Home Credit data. The XGBoost model achieved the highest AUC of 0.69766. The DeLong test [68] confirmed that the differences in AUC between XGBoost and all other models, were statistically significant (p-values < 0.05). The only comparison that did not reach statistical significance was between LightGBM and logistic regression, suggesting their AUC values are not significantly different according to the DeLong test [68].
Table 7. AUC and p-values of the models–Home Credit data.
| p-value | |||||
|---|---|---|---|---|---|
| Model | AUC | XGBoost | LightGBM | Logistic Regression | CatBoost |
| Random Forest | 0.69280 | 0.00000 | 0.00044 | 0.00012 | 0.00002 |
| XGBoost | 0.69766 | 0.02081 | 0.00466 | 0.00000 | |
| LightGBM | 0.69654 | 0.87847 | 0.00000 | ||
| Logistic Regression | 0.69644 | 0.00000 | |||
| CatBoost | 0.68450 | ||||
Table 8 presents the confusion matrices of the Home Credit data models. The XGBoost model achieved the highest overall accuracy (70.335%) and the lowest misclassification rate (29.665%) compared to the other models.
Table 8. Confusion matrices of the models–Home Credit data.
| Random Forest | Predicted | ||
| Good | Bad | ||
| Actual | Good | 39422 (69.775%) | 17077 |
| Bad | 2057 | 2947 (58.893%) | |
| XGBoost | Predicted | ||
| Good | Bad | ||
| Actual | Good | 40299 (71.327%) | 16200 |
| Bad | 2045 | 2959 (59.133%) | |
| LightGBM | Predicted | ||
| Good | Bad | ||
| Actual | Good | 40060 (70.904%) | 16439 |
| Bad | 2019 | 2985 (59.652%) | |
| Logistic Regression | Predicted | ||
| Good | Bad | ||
| Actual | Good | 39545 (69.992%) | 16954 |
| Bad | 2006 | 2998 (59.912%) | |
| CatBoost | Predicted | ||
| Good | Bad | ||
| Actual | Good | 39178 (69.343%) | 17321 |
| Bad | 2024 | 2980 (59.552%) | |
Overall, these results corroborate previous findings [5, 70] demonstrating the superior performance of tree-based models compared to classic techniques like logistic regression in credit risk assessment.
Interpretable credit models–Taiwan data
Previous research, such as [37, 39, 40], focused on providing marginal probability or log-odds contributions of each variable in a model, shedding light on their statistical significance.
Fig 2 illustrates the type of interpretability offered by previous studies, showcasing the log-odds contributions of each predictor variable for a specific customer in the dataset. While statistically informative, this type of output, which focuses on log-odds or probabilities, may not be readily interpretable or actionable for credit practitioners who primarily rely on credit scores for decision-making [4]. This section aims to bridge this gap by drawing parallels between the parameters used in logistic regression-based models and those derived from the SHAP framework, proposing to replace logistic regression parameters with Shapley values for identifying top reasons for model predictions. We compare the established method for determining top reasons for credit scorecard predictions [4] with our proposed approach using the SHAP framework [12].
Fig 2. Logodds of the predictor variables.
The following representations visually distinguish credit scores below the neutral score by shading them in grey. We provide side-by-side comparisons of credit scores based on both logistic regression parameters and Shapley values. All five models were developed using seven predictor variables with consistent binning.
Tables 9–15 illustrate the credit scores of the predictor variables on the Taiwan data. In most cases, the five models agree regarding the predictor variable bins that lie below the neutral credit score, thereby presenting potential explanations for customers receiving lower credit scores. Except for the predictor variable "Average Bill Amount (July, August, September)" in Table 10, where the random forest model suggests that only the bin (-inf, 13.50) could potentially be cited as a reason for an applicant receiving a lower credit score.
Table 9. Average payment indicator—July, August & September.
| Logistic Regression | XGBoost | Random Forest | LightGBM | CatBoost | ||
|---|---|---|---|---|---|---|
| Predictor Variable | Bin | Credit Score | ||||
| Average Payment Indicator (July, August, September) | [0.17, inf) | 0.0000 | 77.7914 | 0.0000 | 0.0000 | 0.0000 |
| (-inf, 0.17) | 227.9544 | 189.8177 | 203.2322 | 1145.1521 | 208.4717 | |
| Neutral Credit Score | 181.5809 | 167.0278 | 161.8880 | 912.1900 | 166.0616 | |
Table 15. Ratio September bill amount over a 3-months average bill amount—July, August, September.
| Logistic Regression | XGBoost | Random Forest | LightGBM | CatBoost | ||
|---|---|---|---|---|---|---|
| Predictor Variable | Bin | Credit Score | ||||
| Ratio September Bill Amount over a 3 months Average Bill Amount (July, August, September) | [0.83, 1.06) | 30.9308 | 91.7615 | 77.6660 | 94.3556 | 0.0000 |
| (-inf, 0.83) | 149.9491 | 145.6139 | 127.7371 | 142.1031 | 146.3869 | |
| [1.06, inf) | 203.8937 | 191.2323 | 139.0207 | 180.9784 | 193.4898 | |
| Neutral Credit Score | 109.3790 | 133.7848 | 107.1574 | 131.1313 | 90.5642 | |
Table 10. Average bill amount—July, August & September.
| Logistic Regression | XGBoost | Random Forest | LightGBM | CatBoost | ||
|---|---|---|---|---|---|---|
| Predictor Variable | Bin | Credit Score | ||||
| Average Bill Amount (July, August, September) | (-inf, 13.50) | 75.4879 | 84.5542 | 69.1873 | 0.0000 | 66.7293 |
| [13.50,49794.83) | 119.9583 | 120.3305 | 119.6997 | 106.1236 | 119.5988 | |
| [49794.83, inf) | 154.1544 | 126.4671 | 126.5869 | 129.8170 | 165.6828 | |
| Neutral Credit Score | 128.18 | 120.3961 | 119.2480 | 107.9997 | 131.0323 | |
Table 11. Average payment indicator–April, May, and June.
| Logistic Regression | XGBoost | Random Forest | LightGBM | CatBoost | ||
|---|---|---|---|---|---|---|
| Predictor Variable | Bin | Credit Score | ||||
| Average Payment Indicator (Apr, May, Jun) | [1.50, inf) | 0.0000 | 33.9752 | 0.0000 | 40.5293 | 0.0000 |
| [0.50, 1.50) | 55.7196 | 79.5345 | 0.0000 | 82.6911 | 17.5708 | |
| (-inf, 0.50) | 172.5623 | 232.1287 | 263.2195 | 1,590.1881 | 237.1356 | |
| Neutral Credit Score | 151.0169 | 205.6815 | 222.4819 | 1,354.2532 | 202.0626 | |
Table 12. Ratio September payment indicator divided by a 3-months average payment indicator (July, August, and September).
| Logistic Regression | XGBoost | Random Forest | LightGBM | CatBoost | ||
|---|---|---|---|---|---|---|
| Predictor Variable | Bin | Credit Score | ||||
| Ratio September Payment Indicator over a 3 months Average Payment Indicator (July,August, September) | [1.23, inf) | 72.9460 | 18.1803 | 82.6155 | 0.0000 | 0.0000 |
| [0.20, 1.23) | 106.2807 | 88.8360 | 109.3608 | 74.6328 | 45.9130 | |
| (-inf, 0.20) | 171.8272 | 299.5807 | 149.1770 | 264.4279 | 622.8268 | |
| Neutral Credit Score | 140.2733 | 201.8103 | 129.3548 | 175.2677 | 370.0489 | |
Table 13. Standard deviation of payment indicator—July, August, September.
| Logistic Regression | XGBoost | Random Forest | LightGBM | CatBoost | ||
|---|---|---|---|---|---|---|
| Predictor Variable | Bin | Credit Score | ||||
| Standard Deviation of Payment Indicator (July, August, September) | [0.79, inf) | 69.4371 | 0.0000 | 93.2310 | 38.0470 | 27.2959 |
| (-inf, 0.79) | 155.7639 | 204.9736 | 132.7333 | 219.7735 | 204.3044 | |
| Neutral Credit Score | 137.4367 | 161.4577 | 124.3469 | 181.1929 | 166.7255 | |
Table 14. Average payment amount—July, August, September.
| Logistic Regression | XGBoost | Random Forest | LightGBM | CatBoost | ||
|---|---|---|---|---|---|---|
| Predictor Variable | Bin | Credit Score | ||||
| Average Payment Amount (July, August, September) | (-inf, 31.17) | 62.0784 | 0.0000 | 85.1439 | 0.0000 | 0.0000 |
| [31.17,2001.83) | 91.5675 | 7.1893 | 103.2557 | 43.4340 | 0.0000 | |
| [2001.83,4312.17) | 131.3691 | 144.8357 | 127.4429 | 141.8376 | 130.2874 | |
| [4312.17, inf) | 181.3795 | 265.6866 | 156.7996 | 246.9170 | 174.6079 | |
| Neutral Credit Score | 127.3518 | 121.2622 | 124.6605 | 128.1473 | 86.2906 | |
The consistency and similarity in predictor variable input values across models have yielded compelling results. The models largely agree on which input values fall below or above the neutral credit score, demonstrating consistency in identifying potential reasons for credit decline. A significant finding of this research is the successful substitution of logistic regression parameters with Shapley values to derive credit scores using the methodology outlined in [4], showcasing the practical applicability of Shapley values in credit scoring.
Interpretable credit models–Home Credit data
Across the Home Credit data, Tables 16–26 illustrate the credit scores of the eleven predictor variables. Notably, in all instances, the five models consistently agree on which predictor variable bins fall below the neutral credit score, thus providing potential explanations for why customers might receive lower scores.
Table 16. Average–Approved annuity amount.
| Logistic Regression | XGBoost | Random Forest | LightGBM | CatBoost | ||
|---|---|---|---|---|---|---|
| Predictor Variable | Bin | Credit Score | ||||
| Average–Approved annuity amount | (-inf, 4160.83) | 23.6388 | 104.1387 | 121.8600 | 107.3289 | 0.0000 |
| [4160.83, 8934.75) | 91.5894 | 116.3997 | 121.8615 | 117.3830 | 26.1096 | |
| [8934.75, inf) | 162.1021 | 128.6020 | 121.8626 | 129.5077 | 142.7540 | |
| Neutral Credit Score | 135.5539 | 123.9826 | 121.8622 | 125.0247 | 103.1746 | |
Table 26. Ratio of annuity amount / Credit amount.
| Logistic Regression | XGBoost | Random Forest | LightGBM | CatBoost | ||
|---|---|---|---|---|---|---|
| Predictor Variable | Bin | Credit Score | ||||
| Ratio of Annuity amount / Credit Amount | [0.05, inf) | 121.8622 | 65.6892 | 109.3365 | 71.7789 | 108.1582 |
| (-inf, 0.05) | 141.7548 | 160.1158 | 145.3041 | 177.0741 | 143.2001 | |
| Neutral Credit Score | 129.9096 | 103.8887 | 123.8869 | 114.3752 | 122.3341 | |
Table 17. Maximum number of days (relative to the application date) on which a payment was made for previous instalments.
| Logistic Regression | XGBoost | Random Forest | LightGBM | CatBoost | ||
|---|---|---|---|---|---|---|
| Predictor Variable | Bin | Credit Score | ||||
| Maximum number of days (relative to the application date) on which a payment was made for previous instalments | [-19.50, inf) | 121.8622 | 106.8595 | 121.8618 | 115.0803 | 114.3648 |
| (-inf, -19.50) | 124.4529 | 125.0546 | 121.8622 | 128.6087 | 123.2026 | |
| Neutral Credit Score | 124.0384 | 122.1440 | 121.8623 | 126.4446 | 121.7888 | |
Table 18. Maximum–Approved credit amount.
| Logistic Regression | XGBoost | Random Forest | LightGBM | CatBoost | ||
|---|---|---|---|---|---|---|
| Predictor Variable | Bin | Credit Score | ||||
| Maximum–Approved credit amount | (-inf, 50954.04) | 0.0000 | 0.0000 | 121.8580 | 100.4851 | 88.3454 |
| [50954.04, 898398.00) | 123.4501 | 126.0522 | 121.8622 | 122.8748 | 123.8610 | |
| [898398.00, inf) | 138.4285 | 165.5761 | 121.8623 | 132.4269 | 142.7153 | |
| Neutral Credit Score | 112.5787 | 117.2354 | 121.8618 | 121.5474 | 122.1043 | |
Table 19. Average of external scores (1, 2 & 3).
| Logistic Regression | XGBoost | Random Forest | LightGBM | CatBoost | ||
|---|---|---|---|---|---|---|
| Predictor Variable | Bin | Credit Score | ||||
| Average of external scores (1, 2 & 3) | (-inf, 0.42) | 121.8622 | 28.2460 | 83.5474 | 35.5794 | 0.0000 |
| [0.42, inf) | 190.7930 | 175.5838 | 138.3170 | 180.3757 | 288.2603 | |
| Neutral Credit Score | 177.4769 | 147.1210 | 127.7366 | 152.4039 | 232.5741 | |
Table 20. Normalized score from external data source 3.
| Logistic Regression | XGBoost | Random Forest | LightGBM | CatBoost | ||
|---|---|---|---|---|---|---|
| Predictor Variable | Bin | Credit Score | ||||
| Normalized score from external data source 3 | (-inf, 0.31) | 121.8622 | 0.0000 | 92.8836 | 94.9131 | 0.0000 |
| [0.31, inf) | 162.9762 | 215.6522 | 129.7101 | 132.4732 | 160.9843 | |
| Neutral Credit Score | 156.9643 | 184.1183 | 124.3251 | 126.9810 | 137.4443 | |
Table 21. Normalized score from external data source 2.
| Logistic Regression | XGBoost | Random Forest | LightGBM | CatBoost | ||
|---|---|---|---|---|---|---|
| Predictor Variable | Bin | Credit Score | ||||
| Normalized score from external data source 2 | (-inf, 0.35) | 121.8622 | 0.0000 | 46.7201 | 0.0000 | 0.0000 |
| [0.35, inf) | 172.4943 | 355.6024 | 153.1185 | 339.8045 | 215.9150 | |
| Neutral Credit Score | 161.6762 | 279.6242 | 130.3853 | 267.2016 | 169.7825 | |
Table 22. Normalized score from external data source 1.
| Logistic Regression | XGBoost | Random Forest | LightGBM | CatBoost | ||
|---|---|---|---|---|---|---|
| Predictor Variable | Bin | Credit Score | ||||
| Normalized score from external data source 1 | (-inf, 0.27) | 121.8622 | 49.3349 | 121.8610 | 47.4972 | 117.9383 |
| [0.27, inf) | 129.7068 | 216.1394 | 121.8623 | 197.6447 | 138.7986 | |
| Neutral Credit Score | 129.1451 | 204.1939 | 121.8622 | 186.8921 | 137.3047 | |
Table 23. Maximum—How many days before current application did client apply for credit bureau credit.
| Logistic Regression | XGBoost | Random Forest | LightGBM | CatBoost | ||
|---|---|---|---|---|---|---|
| Predictor Variable | Bin | Credit Score | ||||
| Maximum—How many days before current application did client apply for Credit Bureau credit | [-76.50, inf) | 121.8622 | 80.3987 | 107.6043 | 87.7062 | 11.3244 |
| (-inf, -76.50) | 125.7284 | 131.9663 | 123.8594 | 137.1557 | 136.2507 | |
| Neutral Credit Score | 125.3388 | 126.7700 | 122.2214 | 132.1728 | 123.6623 | |
Table 24. Average—Days past due of instalments.
| Logistic Regression | XGBoost | Random Forest | LightGBM | CatBoost | ||
|---|---|---|---|---|---|---|
| Predictor Variable | Bin | Credit Score | ||||
| Average—Days past due of Instalments | [0.08, inf) | 121.8622 | 102.7606 | 115.5192 | 95.9194 | 92.8871 |
| (-inf, 0.08) | 138.9704 | 252.5404 | 127.8514 | 238.8954 | 150.6132 | |
| Neutral Credit Score | 130.6599 | 179.7826 | 121.8609 | 169.4427 | 122.5719 | |
Table 25. How many days before the application the person started current employment.
| Logistic Regression | XGBoost | Random Forest | LightGBM | CatBoost | ||
|---|---|---|---|---|---|---|
| Predictor Variable | Bin | Credit Score | ||||
| How many days before the application the person started current employment | [6123.75, inf) | 121.8622 | 87.3832 | 121.8599 | 87.3148 | 112.2891 |
| (-inf, 6123.75) | 151.7564 | 168.3528 | 121.8692 | 157.0443 | 172.7415 | |
| Neutral Credit Score | 130.4769 | 110.7163 | 121.8626 | 107.4089 | 129.7098 | |
The consistent agreement across all models regarding which predictor variable input values fall below or above the neutral credit score demonstrates the robustness of our approach and reinforces the potential of Shapley values as a viable alternative to logistic regression parameters for deriving interpretable credit scores, as demonstrated in the Taiwan dataset. This finding further supports the applicability of the methodology outlined in [4] for a broader range of credit scoring models.
Conclusion and future work
As noted in the literature, the limited transparency of advanced machine learning models has been a barrier to their widespread adoption in credit scoring due to regulatory requirements [14, 71]. However, our findings demonstrate that transparency need not be a barrier, as credit scores derived from Shapley values align closely with those derived from logistic regression models.
Our research establishes that Shapley values can effectively identify reasons for unfavourable credit reports, aligning with industry practices and providing a valuable tool for interpreting complex machine learning models. Furthermore, our research confirms previous findings [5, 70] that tree-based models like XGBoost and random forest outperform logistic regression in terms of accuracy, solidifying their efficacy in credit scoring.
Building upon these findings, future research should focus on the practical implementation of the proposed interpretability methods within real-world credit scoring scenarios. Additionally, investigating the potential of these methods to enhance the interpretability of other ensemble models in various applications would be a valuable avenue for further exploration.
Data Availability
The dataset on default payments in Taiwan is publicly available through the UCI Machine Learning Repository (https://archive.ics.uci.edu/dataset/350/default+of+credit+card+clients) and Kaggle (https://www.kaggle.com/competitions/home-credit-default-risk/data).
Funding Statement
The author(s) received no specific funding for this work.
References
- 1.Cierniak-Emerych A, Mazur-Wierzbicka E, Rojek-Nowosielska M. Corporate Social Responsibility in Poland. 2021. doi: 10.1007/978-3-030-68386-3_13 [DOI] [Google Scholar]
- 2.Crook JN, Edelman DB, Thomas LC. Recent developments in consumer credit risk assessment. Eur J Oper Res. 2007;183. doi: 10.1016/j.ejor.2006.09.100 [DOI] [Google Scholar]
- 3.Hand DavidJ Henley WE. Statistical classification methods in consumer credit scoring: A review. J R Stat Soc Ser A Stat Soc. 1997;160. doi: 10.1111/j.1467-985X.1997.00078.x [DOI] [Google Scholar]
- 4.Siddiqi N. Scorecard Development. Intelligent Credit Scoring. John Wiley & Sons, Ltd; 2016. doi: 10.1002/9781119282396.ch2 [DOI] [Google Scholar]
- 5.Lessmann S, Baesens B, Seow HV, Thomas LC. Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. Eur J Oper Res. 2015;247. doi: 10.1016/j.ejor.2015.05.030 [DOI] [Google Scholar]
- 6.Kelly-Louw M. Introduction to the National Credit Act. Juta’s business law. 2007;15: 147–159. [Google Scholar]
- 7.McCorkell PL, Smith AM. Fair credit reporting act update-2008. Business Lawyer. 2009;64. [Google Scholar]
- 8.Trueck S, Svetlozar RR. Rating Based Modeling of Credit Risk. Rating Based Modeling of Credit Risk. 2009. doi: 10.1016/B978-0-12-373683-3.X0001-2 [DOI] [Google Scholar]
- 9.Chen T, Guestrin C. XGBoost: A scalable tree boosting system. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. doi: 10.1145/2939672.2939785 [DOI] [Google Scholar]
- 10.Munkhdalai L, Munkhdalai T, Namsrai OE, Lee JY, Ryu KH. An empirical comparison of machine-learning methods on bank client credit assessments. Sustainability (Switzerland). 2019;11. doi: 10.3390/su11030699 [DOI] [Google Scholar]
- 11.Alonso-Robisco A, Carbó JM. Can machine learning models save capital for banks? Evidence from a Spanish credit portfolio. International Review of Financial Analysis. 2022;84. doi: 10.1016/j.irfa.2022.102372 [DOI] [Google Scholar]
- 12.Lundberg SM, Lee SI. A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems. 2017. [Google Scholar]
- 13.Cramer JS. The Origins of Logistic Regression. SSRN Electronic Journal. 2005. doi: 10.2139/ssrn.360300 [DOI] [Google Scholar]
- 14.Wei S, Yang D, Zhang W, Zhang S. A novel noise-adapted two-layer ensemble model for credit scoring based on backflow learning. IEEE Access. 2019;7. doi: 10.1109/ACCESS.2019.2930332 [DOI] [Google Scholar]
- 15.Chambers EA, Cox DR. Discrimination between alternative binary response models. Biometrika. 1967;54. doi: [DOI] [PubMed] [Google Scholar]
- 16.McFadden D. Conditional logit analysis of qualitative choice behaviour. Drying Technology. 1973. doi: 10.1080/07373937.2014.997882 [DOI] [Google Scholar]
- 17.Fischer ML, Moore K. An Improved Credit Scoring Function for the St. Paul Bank for Cooperatives. 1986. [Google Scholar]
- 18.Li Y, Chen W. A comparative performance assessment of ensemble learning for credit scoring. Mathematics. 2020;8. doi: 10.3390/math8101756 [DOI] [Google Scholar]
- 19.Osborne JW. Best Practices in Logistic Regression. Best Practices in Logistic Regression. 2017. doi: 10.4135/9781483399041 [DOI] [Google Scholar]
- 20.Tsai CF, Hsu YF, Yen DC. A comparative study of classifier ensembles for bankruptcy prediction. Applied Soft Computing Journal. 2014;24. doi: 10.1016/j.asoc.2014.08.047 [DOI] [Google Scholar]
- 21.Xia Y, Liu C, Da B, Xie F. A novel heterogeneous ensemble credit scoring model based on bstacking approach. Expert Syst Appl. 2018;93. doi: 10.1016/j.eswa.2017.10.022 [DOI] [Google Scholar]
- 22.Brown I, Mues C. An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Syst Appl. 2012;39. doi: 10.1016/j.eswa.2011.09.033 [DOI] [Google Scholar]
- 23.Tounsi Y, Anoun H, Hassouni L. CSMAS: Improving Multi-Agent Credit Scoring System by Integrating Big Data and the new generation of Gradient Boosting Algorithms. ACM International Conference Proceeding Series. 2020. doi: 10.1145/3386723.3387851 [DOI]
- 24.Coşkun SB, Turanli M. Credit risk analysis using boosting methods. Journal of Applied Mathematics, Statistics and Informatics. 2023;19: 5–18. doi: 10.2478/jamsi-2023-0001 [DOI] [Google Scholar]
- 25.Al Daoud E. Comparison between XGBoost, LightGBM and CatBoost Using a Home Credit Dataset. International Journal of Computer and Information Engineering. 2019;13. [Google Scholar]
- 26.Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. Catboost: Unbiased boosting with categorical features. Advances in Neural Information Processing Systems. 2018. [Google Scholar]
- 27.Alok Kumar. Ensemble Learning for AI Developers Learn Bagging, Stacking, and Boosting Methods with Use Cases. 1st ed. 2020. Ensemble Learning for AI Developers Learn Bagging, Stacking, and Boosting Methods with Use Cases. Berkeley, CA: Apress; 2020. [Google Scholar]
- 28.Ala’raj M, Abbod MF. A new hybrid ensemble credit scoring model based on classifiers consensus system approach. Expert Syst Appl. 2016;64. doi: 10.1016/j.eswa.2016.07.017 [DOI] [Google Scholar]
- 29.Xia Y, Zhao J, He L, Li Y, Niu M. A novel tree-based dynamic heterogeneous ensemble method for credit scoring. Expert Syst Appl. 2020;159. doi: 10.1016/j.eswa.2020.113615 [DOI] [Google Scholar]
- 30.Chen D, Ye J, Ye W. Interpretable selective learning in credit risk. Res Int Bus Finance. 2023;65. doi: 10.1016/j.ribaf.2023.101940 [DOI] [Google Scholar]
- 31.Alam TM, Shaukat K, Hameed IA, Luo S, Sarwar MU, Shabbir S, et al. An investigation of credit card default prediction in the imbalanced datasets. IEEE Access. 2020;8. doi: 10.1109/ACCESS.2020.3033784 [DOI] [Google Scholar]
- 32.Barredo Arrieta A, Díaz-Rodríguez N, Del Ser J, Bennetot A, Tabik S, Barbado A, et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion. 2020;58. doi: 10.1016/j.inffus.2019.12.012 [DOI] [Google Scholar]
- 33.Hertza VA. Fighting unfair classifications in credit reporting: Should the united states adopt GDPR-inspired rights in regulating consumer credit? New York University Law Review. 2018;93. [Google Scholar]
- 34.Shapley LS. A Value for n-person Games. Contributions to the Theory of Games. Annals of Mathematics Studies. 1953;28. [Google Scholar]
- 35.Samek W, Montavon G, Vedaldi A, Hansen LK, Muller K-R. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. Lecture Notes in Computer Science (LNCS). 2019;11700. [Google Scholar]
- 36.Bussmann N, Giudici P, Marinelli D, Papenbrock J. Explainable Machine Learning in Credit Risk Management. Comput Econ. 2021;57. doi: 10.1007/s10614-020-10042-0 [DOI] [Google Scholar]
- 37.Bracke P, Datta A, Jung C, Sen S. Machine Learning Explainability in Finance: An Application to Default Risk Analysis. SSRN Electronic Journal. 2019. doi: 10.2139/ssrn.3435104 [DOI] [Google Scholar]
- 38.Elshawi R, Al-Mallah MH, Sakr S. On the interpretability of machine learning-based model for predicting hypertension. BMC Med Inform Decis Mak. 2019;19. doi: 10.1186/s12911-019-0874-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Bueff AC, Cytryński M, Calabrese R, Jones M, Roberts J, Moore J, et al. Machine learning interpretability for a stress scenario generation in credit scoring based on counterfactuals. Expert Syst Appl. 2022;202: 117271. doi: 10.1016/j.eswa.2022.117271 [DOI] [Google Scholar]
- 40.Bussmann N, Giudici P, Marinelli D, Papenbrock J. Explainable AI in Fintech Risk Management. Front Artif Intell. 2020;3. doi: 10.3389/frai.2020.00026 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Yeh I-C. default of credit card clients. 2016. doi: 10.24432/C55S3H [DOI] [Google Scholar]
- 42.Home Credit Group. Home Credit Default Risk DataSet. In: Kaggle [Internet]. 2018 [cited 3 Jan 2021]. Available: https://www.kaggle.com/c/home-credit-default-risk/data
- 43.Abowitz DA, Toole TM. Mixed Method Research: Fundamental Issues of Design, Validity, and Reliability in Construction Research. J Constr Eng Manag. 2010;136. doi: 10.1061/(asce)co.1943-7862.0000026 [DOI] [Google Scholar]
- 44.Han J, Kamber M, Pei J. Data Mining: Concepts and Techniques. Data Mining: Concepts and Techniques. 2012. doi: 10.1016/C2009-0-61819-5 [DOI] [Google Scholar]
- 45.Hlongwane R, Ramaboa KKKM, Mongwe W. Enhancing credit scoring accuracy with a comprehensive evaluation of alternative data. PLoS One. 2024;19. doi: 10.1371/journal.pone.0303566 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Hooker G, Mentch L, Zhou S. Unrestricted permutation forces extrapolation: variable importance requires at least one more model, or there is no free variable importance. Stat Comput. 2021;31. doi: 10.1007/s11222-021-10057-z [DOI] [Google Scholar]
- 47.Costa e Silva E, Lopes IC, Correia A, Faria S. A logistic regression model for consumer default risk. J Appl Stat. 2020;47. doi: 10.1080/02664763.2020.1759030 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Mester LJ. What’s the Point of Credit Scoring? Business Review. 1997;3. [Google Scholar]
- 49.Yu L, Yu L, Yu K. A high-dimensionality-trait-driven learning paradigm for high dimensional credit classification. Financial Innovation. 2021;7. doi: 10.1186/s40854-021-00249-x [DOI] [Google Scholar]
- 50.Remeseiro B, Bolon-Canedo V. A review of feature selection methods in medical applications. Computers in Biology and Medicine. 2019. doi: 10.1016/j.compbiomed.2019.103375 [DOI] [PubMed] [Google Scholar]
- 51.Hira ZM, Gillies DF. A review of feature selection and feature extraction methods applied on microarray data. Adv Bioinformatics. 2015;2015. doi: 10.1155/2015/198363 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Kalnins A. Multicollinearity: How common factors cause Type 1 errors in multivariate regression. Strategic Management Journal. 2018;39. doi: 10.1002/smj.2783 [DOI] [Google Scholar]
- 53.Winter E. Chapter 53 The shapley value. Handbook of Game Theory with Economic Applications. 2002. doi: 10.1016/S1574-0005(02)03016-3 [DOI] [Google Scholar]
- 54.Kritzinger N, van Vuuren GW. An optimised credit scorecard to enhance cut-off score determination. South African Journal of Economic and Management Sciences. 2018;21. doi: 10.4102/sajems.v21i1.1571 [DOI] [Google Scholar]
- 55.Cerda P, Varoquaux G, Kégl B. Similarity encoding for learning with dirty categorical variables. Mach Learn. 2018;107. doi: 10.1007/s10994-018-5724-2 [DOI] [Google Scholar]
- 56.Yu L, Zhou R, Chen R, Lai KK. Missing Data Preprocessing in Credit Classification: One-Hot Encoding or Imputation? Emerging Markets Finance and Trade. 2022;58. doi: 10.1080/1540496X.2020.1825935 [DOI] [Google Scholar]
- 57.Jenghara MM, Ebrahimpour-Komleh H, Rezaie V, Nejatian S, Parvin H, Yusof SKS. Imputing missing value through ensemble concept based on statistical measures. Knowl Inf Syst. 2018;56. doi: 10.1007/s10115-017-1118-1 [DOI] [Google Scholar]
- 58.Aguinis H, Gottfredson RK, Joo H. Best-Practice Recommendations for Defining, Identifying, and Handling Outliers. Organizational Research Methods. 2013. doi: 10.1177/1094428112470848 [DOI] [Google Scholar]
- 59.Xia Y, Liu C, Li YY, Liu N. A boosted decision tree approach using Bayesian hyper-parameter optimization for credit scoring. Expert Syst Appl. 2017;78. doi: 10.1016/j.eswa.2017.02.017 [DOI] [Google Scholar]
- 60.Pan S, Zheng Z, Guo Z, Luo H. An optimized XGBoost method for predicting reservoir porosity using petrophysical logs. J Pet Sci Eng. 2022;208. doi: 10.1016/j.petrol.2021.109520 [DOI] [Google Scholar]
- 61.Yang F, Qiao Y, Qi Y, Bo J, Wang X. BACS: blockchain and AutoML-based technology for efficient credit scoring classification. Ann Oper Res. 2022. doi: 10.1007/s10479-022-04531-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Yang L, Shami A. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing. 2020;415. doi: 10.1016/j.neucom.2020.07.061 [DOI] [Google Scholar]
- 63.Zhang X, Liu CA. Model averaging prediction by K-fold cross-validation. J Econom. 2023;235. doi: 10.1016/j.jeconom.2022.04.007 [DOI] [Google Scholar]
- 64.Barboza F, Kimura H, Altman E. Machine learning models and bankruptcy prediction. Expert Syst Appl. 2017;83. doi: 10.1016/j.eswa.2017.04.006 [DOI] [Google Scholar]
- 65.Gurný P, Gurný M. Comparison of credit scoring models on probability of default estimation for us banks. Prague Economic Papers. 2013. doi: 10.18267/j.pep.446 [DOI] [Google Scholar]
- 66.Lobo JM, Jiménez-valverde A, Real R. AUC: A misleading measure of the performance of predictive distribution models. Global Ecology and Biogeography. 2008. doi: 10.1111/j.1466-8238.2007.00358.x [DOI] [Google Scholar]
- 67.Halligan S, Altman DG, Mallett S. Disadvantages of using the area under the receiver operating characteristic curve to assess imaging tests: A discussion and proposal for an alternative approach. Eur Radiol. 2015;25. doi: 10.1007/s00330-014-3487-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach. Biometrics. 1988;44. doi: 10.2307/2531595 [DOI] [PubMed] [Google Scholar]
- 69.McKinney SM, Sieniek M, Godbole V, Godwin J, Antropova N, Ashrafian H, et al. International evaluation of an AI system for breast cancer screening. Nature. 2020;577. doi: 10.1038/s41586-019-1799-6 [DOI] [PubMed] [Google Scholar]
- 70.Ben Jabeur S, Gharib C, Mefteh-Wali S, Ben Arfi W. CatBoost model and artificial intelligence techniques for corporate failure prediction. Technol Forecast Soc Change. 2021;166. doi: 10.1016/j.techfore.2021.120658 [DOI] [Google Scholar]
- 71.Hosaka T. Bankruptcy prediction using imaged financial ratios and convolutional neural networks. Expert Syst Appl. 2019;117. doi: 10.1016/j.eswa.2018.09.039 [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The dataset on default payments in Taiwan is publicly available through the UCI Machine Learning Repository (https://archive.ics.uci.edu/dataset/350/default+of+credit+card+clients) and Kaggle (https://www.kaggle.com/competitions/home-credit-default-risk/data).


