A novel framework for enhancing transparency in credit scoring: Leveraging Shapley values for interpretable credit scorecards

Rivalani Hlongwane; Kutlwano Ramabao; Wilson Mongwe

doi:10.1371/journal.pone.0308718

. 2024 Aug 12;19(8):e0308718. doi: 10.1371/journal.pone.0308718

A novel framework for enhancing transparency in credit scoring: Leveraging Shapley values for interpretable credit scorecards

Rivalani Hlongwane ^1,^*, Kutlwano Ramabao ¹, Wilson Mongwe ²

Editor: Madhuri Rao³

PMCID: PMC11318906 PMID: 39133710

Abstract

Credit scorecards are essential tools for banks to assess the creditworthiness of loan applicants. While advanced machine learning models like XGBoost and random forest often outperform traditional logistic regression in predictive accuracy, their lack of interpretability hinders their adoption in practice. This study bridges the gap between research and practice by developing a novel framework for constructing interpretable credit scorecards using Shapley values. We apply this framework to two credit datasets, discretizing numerical variables and utilizing one-hot encoding to facilitate model development. Shapley values are then employed to derive credit scores for each predictor variable group in XGBoost, random forest, LightGBM, and CatBoost models. Our results demonstrate that this approach yields credit scorecards with interpretability comparable to logistic regression while maintaining superior predictive accuracy. This framework offers a practical and effective solution for credit practitioners seeking to leverage the power of advanced models without sacrificing transparency and regulatory compliance.

Introduction

Banks play a crucial role in the economy, influencing the financial landscape while making critical lending decisions that balance risk and profitability for both individuals and businesses [1–3]. To mitigate losses and identify low-risk applicants, banks rely on credit scoring models, or credit scorecards, that use predictor variables to generate credit scores [4]. Accurate identification of high-risk applicants is essential for effective lending, and regulatory frameworks often mandate that credit decisions, especially loan rejections, be transparent and explainable [5–7]. Traditional credit scorecards achieve this transparency through interpretable models like logistic regression [8].

Despite the dominance of logistic regression in credit scoring due to its simplicity and interpretability, recent research has highlighted the superior accuracy of tree-based models such as eXtreme gradient boosting (XGBoost) and random forest [5, 9, 10]. However, their limited interpretability poses significant challenges in practical application, particularly in the banking sector [11]. The "black box" nature of these models makes it difficult for practitioners to understand the underlying reasons behind credit decisions, hindering regulatory compliance, model validation, and effective communication with customers [6, 7, 11].

Specific challenges with tree-based models include:

Regulatory Compliance: Banks are required to provide clear reasons for loan rejections. The opaque nature of tree-based models complicates this requirement [6, 7].
Model Validation: The lack of transparency makes it difficult for banks to validate and trust the models, which is crucial for deployment in a highly regulated industry [11].
Customer Communication: Banks need to explain credit decisions to customers in an understandable manner. The complexity of tree-based models hampers this communication [11].

While the SHapley Additive exPlanations (SHAP) framework, leveraging Shapley values, has been proposed to enhance the interpretability of these models, initially for XGBoost [12], the focus has primarily been on probabilities rather than the credit scores used by credit practitioners [4]. This discrepancy between research advancements and practical needs underscores the importance of developing methods that can harness the predictive power of advanced models while ensuring the transparency and interpretability required in the banking sector.

This research aims to bridge this gap by demonstrating how Shapley values derived from tree-based models like XGBoost and random forest can be used to generate credit scores that are comparable to those from logistic regression-based credit scorecards, using two credit datasets. Additionally, we explore how these Shapley value-derived scores align with current practices for explaining credit decisions. By combining accuracy with interpretability, this study aims to promote the adoption of transparent and high-performing models in practical credit scoring, empowering banks to make informed lending decisions.

The paper is organized as follows: We begin with an overview of key themes in credit risk modelling, examining logistic regression, advanced scoring models, and the imperative aspect of interpretability. Following this, we describe the methodological approach used in this research, present our findings and results, and conclude with insights into future research directions.

Literature review

This section offers a comprehensive overview of key themes in credit risk modelling. It examines logistic regression, advanced scoring models, and the imperative aspect of interpretability. This review not only traces the historical significance of logistic regression and advanced models but also underscores the evolving challenges and solutions tied to model interpretability. Through this exploration, the section lays the groundwork for a deeper understanding of credit risk assessment methodologies.

Credit scoring models

Logistic regression, a technique with roots in the 19th century [13], is the most common credit risk model in practice due to its simplicity and ability to produce interpretable predictions [4, 5, 8, 14]. Its prominence and scope of application expanded following research by [15, 16], with early examples of its use in credit risk seen in the work of [17]. A study by [18] comparing five base learners on a credit loan dataset found that logistic regression outperformed decision trees, naïve bayes, and AdaBoost in terms of AUC and accuracy metrics, but was surpassed by random forest and XGBoost. This highlights the trade-offs between interpretability and performance in credit risk modelling.

The logistic regression model’s structure facilitates this interpretability by relating predictor variables to the probability of an event (such as default) through a logit transformation. The model consists of an additive component, the sum of the intercept and a product of model parameters and their respective predictor variables [19]. The intercept represents the average value of the natural log of the odds when the predictor variables equal zero [19]. The logistic regression model is expressed as follows [19]:

\ln (o d d s) = l n (\frac{p}{1 - p}) = β_{0} + \sum_{i = 1}^{m} β_{i} x_{i}

(1)

where β₀ is the intercept, β_i, i = 1,2,…m are parameters of the predictor variables x_i, i = 1,2,…m.

Research in credit risk modelling remains active, with a strong focus on improving the accuracy of models, particularly through tree-based methods [5]. Studies have shown that tree-based models, such as XGBoost, random forest, LightGBM, and CatBoost, often outperform traditional models like logistic regression in terms of accuracy [5]. These models construct numerous non-linear decision trees by iteratively selecting subsets of data, with XGBoost further employing a boosting technique to combine multiple weak learners and enhance predictive accuracy [20].

Tree-based models make predictions through majority voting, where the final prediction is based on the most frequent outcome among the individual trees [20, 21]. This approach, as demonstrated in various studies [14, 20, 21], often leads to superior predictive performance. Notably, [14] indicated that XGBoost handles imbalanced datasets—a common characteristic of credit data due to the rarity of defaults compared to non-defaults [22]—better than other advanced scoring methods.

Despite their popularity in research [4, 5] and superior prediction accuracy compared to logistic regression [5], tree-based models remain less common in practical credit scoring [11]. This is largely due to their inherent complexity, which makes it difficult to interpret their predictions and explain the reasons behind credit decisions, a crucial requirement in many regulatory contexts.

A 2015 survey of machine learning models used in data science competitions found that XGBoost was the most popular choice, offering higher prediction accuracy in various domains, including credit risk [9]. A benchmarking study on credit data further demonstrated XGBoost’s superior accuracy compared to logistic regression, neural networks, support vector machines, and random forest, even outperforming FICO scores [10].

Similar to XGBoost, LightGBM is a gradient boosting model, but it differs in its depth-first tree growth strategy, often leading to faster performance [23]. Studies [24, 25] have shown LightGBM’s superior predictive performance on credit data compared to XGBoost and CatBoost.

CatBoost, another member of the gradient boosting family, stands out for its handling of categorical variables, making it valuable for datasets where categorical data plays a crucial role in predictive modelling [26]. Research has shown that CatBoost can outperform both XGBoost and LightGBM models in terms of predictive performance on credit data [23].

In random forest models, multiple decision trees are built, and the final prediction is determined through majority voting, where the most common prediction among the trees is selected [27]. Research has shown that tuning hyperparameters, such as the number of trees and predictor variables, is crucial for optimizing random forest performance in credit scoring [28, 29].

Studies comparing the performance of different models in credit scoring have reported varying AUC values. For instance, [30] investigated logistic regression and a neural network, achieving AUC values of 0.711 and 0.731, respectively. The study in [31] obtained an AUC of 0.680 from a random forest model before implementing a data sampling methodology for balancing and achieved higher AUC after applying their proposed technique.

While tree-based models have showcased high accuracy in their predictions, their limited ability to offer human-understandable explanations have constrained their adoption. This challenge has been acknowledged and addressed by [32], yet it continues to hinder the widespread use of these advanced models in real-world credit scoring applications.

Interpretability

The lack of human-understandable explanations for predictions made by advanced machine learning models is the primary obstacle to their wider adoption in practice [14, 32]. This concern is echoed by credit regulators in the USA and RSA, who require models used for credit decisions to provide human-understandable interpretations [6, 7, 33]. In addition to explaining loan rejections, interpretability is also crucial for communicating low and high credit scores to various stakeholders, including credit practitioners, auditors, regulators, senior management, and model validators [4].

To address this challenge, the SHAP framework, rooted in game theory, was introduced by [12] to enhance the interpretability of machine learning models. Originally developed by [34] to determine the fair distribution of payouts in cooperative games, the SHAP framework calculates Shapley values for each predictor variable in a model [35]. These values represent the marginal contribution of each variable to a prediction and can be used to provide human-understandable explanations for credit decisions, aligning with the requirements outlined by [4].

Researchers have adopted the SHAP framework to provide detailed explanations of complex machine models [36–38], with the motivation of increasing understanding and trust in these models [38]. In the context of credit risk scoring, the SHAP framework has been used to explain predictions made by tree-based gradient boosting models [37, 39, 40].

Studies such as [37, 40] utilized SHAP to compute and compare marginal probabilities of predictor variables in tree-based models, finding significant differences in predictions and highlighting the higher default risk predicted by tree-based models. Similarly, [39] used SHAP with counterfactuals to provide explanations for predictions made by a tree-based gradient boosting model, ultimately concluding that the methodology helps in understanding the model’s behaviour.

While previous studies have explored the use of SHAP for explaining credit scores, they have not addressed how these explanations align with the credit scores used by practitioners, nor how they can be used to identify specific predictor variable categories that lead to lower scores and potential rejections. Our research aims to fill this gap, particularly when using tree-based models, by demonstrating the practical application of Shapley values. Our goal is to empower credit professionals to identify predictor variable categories that substantially impact lower credit scores, potentially resulting in credit application denials. This will ultimately enhance the transparency and effectiveness of credit assessment processes.

Literature review summary

This section offers a synthesis of the preceding sections, encompassing credit scoring models in practical application and literature. Table 1 provides a condensed overview of how prior research leveraged the SHAP framework to enhance the interpretability of advanced credit scoring models.

Table 1. Overview of the literature.

Research Focus

Key Findings

Credit scoring models

Logistic regression, acknowledged as the most common credit risk model in practice [4, 5, 8] and with roots tracing back to the 19th century [13], gained prominence through [15, 16]. Its valued attributes encompass simplicity and interpretability, particularly in banking contexts [14]. Notably, it comprises additive components as elucidated in reference [19], thus solidifying its enduring role in credit risk assessment.
Advanced Scoring Models, including XGBoost and random forest, exhibit notably higher accuracy compared to logistic regression [5, 20]. Leveraging non-linear trees and boosting [20], they’re prominent in research, yet constrained in practicality due to interpretability [4, 5]. They employ multiple trees with final predictions by majority voting [27], optimized through hyperparameter tuning [28, 29], they’re favoured for precision in credit scoring [28, 29].

Interpretability using the SHAP framework

SHAP framework applied to machine models enhances understanding and trust [38]. In emerging credit risk scoring studies, SHAP reveals variable influences [37, 39, 40]. Notably, [37] and [39] extract log-odds and probabilities from SHAP for insights into predictor significance. [40] demonstrates heightened default predictions by gradient boosting models and examines predictor marginal probabilities. SHAP’s efficacy in explaining complex models bolsters predictive superiority [39].

Predictor variable	Description	p-value
AVG_EXT_SOURCE	Average of external scores (1, 2 & 3)	0.04200
EXT_SOURCE_3	Normalized score from external data source	0.00000
EXT_SOURCE_2	Normalized score from external data source	0.00000
EXT_SOURCE_1	Normalized score from external data source	0.00000
BUREAU_DAYS_CREDIT_MAX	Maximum—How many days before current application did client apply for Credit Bureau credit	0.00000
INSTAL_DPD_MEAN	Average—Days past due of Instalments	0.00000
DAYS_EMPLOYED	How many days before the application the person started current employment	0.00000
APPS_ANNUITY_CREDIT_RATIO	Ratio of AMT_ANNUITY / AMT_CREDIT	0.00000
APPROVED_AMT_ANNUITY_MEAN	Average–Approved annuity amount	0.00000
INSTAL_DAYS_ENTRY_PAYMENT_MAX	Maximum number of days (relative to the application date) on which a payment was made for previous instalments	0.00000
APPROVED_AMT_CREDIT_MAX	Maximum–Approved credit amount	0.00000

Predictor variable	Description	p-value
AVG_PAY__SEP	Average repayment payment status for July, August and September	0.00000
CURRENT_OVER_3MAVG_PAY__SEP	The Ratio of September repayment payment status over Average repayment payment status for July, August and September	0.00000
STD_PAY__SEP	The standard deviation of July, August and September repayment payment status	0.00000
AVG_PAY__JUN	Average repayment payment status for April, May and June	0.00100
AVG_BILL_AMT_SEP	Average bill for July, August and September	0.00000
AVG_PAY_AMT_SEP	Average payment amount for July, August and September	0.00000
CURRENT_OVER_3MAVG_BILL_AMT_SEP	The Ratio of the September bill over Average bill for July, August and September	0.00000

		Predicted
		Good	Bad
Actual	Good	True negative	False positive
Actual	Bad	False negative	True positive

	p-value
Model	AUC	XGBoost	LightGBM	Logistic Regression	CatBoost
Random Forest	0.75929	0.41580	0.15990	0.00021	0.00143
XGBoost	0.75766		0.47520	0.00315	0.00056
LightGBM	0.75690			0.00316	0.00310
Logistic Regression	0.74891				0.81190
CatBoost	0.74793

Random Forest		Predicted
Random Forest		Good	Bad
Actual	Good	3,765 (80.363%)	920
Actual	Bad	537	778 (59.163%)
XGBoost		Predicted
XGBoost		Good	Bad
Actual	Good	3,767 (80.406%)	918
Actual	Bad	539	776 (59.011%)
LightGBM		Predicted
LightGBM		Good	Bad
Actual	Good	3,727 (79.552%)	958
Actual	Bad	522	793 (60.304%)
Logistic Regression		Predicted
Logistic Regression		Good	Bad
Actual	Good	3,688 (78.719%)	997
Actual	Bad	533	782 (59.468%)
CatBoost		Predicted
CatBoost		Good	Bad
Actual	Good	3,657 (78.058%)	1,028
Actual	Bad	515	800 (60.837%)

		Logistic Regression	XGBoost	Random Forest	LightGBM	CatBoost
Predictor Variable	Bin	Credit Score
Average Payment Indicator (July, August, September)	[0.17, inf)	0.0000	77.7914	0.0000	0.0000	0.0000
Average Payment Indicator (July, August, September)	(-inf, 0.17)	227.9544	189.8177	203.2322	1145.1521	208.4717
Neutral Credit Score		181.5809	167.0278	161.8880	912.1900	166.0616

		Logistic Regression	XGBoost	Random Forest	LightGBM	CatBoost
Predictor Variable	Bin	Credit Score
Ratio September Bill Amount over a 3 months Average Bill Amount (July, August, September)	[0.83, 1.06)	30.9308	91.7615	77.6660	94.3556	0.0000
	(-inf, 0.83)	149.9491	145.6139	127.7371	142.1031	146.3869
	[1.06, inf)	203.8937	191.2323	139.0207	180.9784	193.4898
Neutral Credit Score		109.3790	133.7848	107.1574	131.1313	90.5642

		Logistic Regression	XGBoost	Random Forest	LightGBM	CatBoost
Predictor Variable	Bin	Credit Score
Average Bill Amount (July, August, September)	(-inf, 13.50)	75.4879	84.5542	69.1873	0.0000	66.7293
	[13.50,49794.83)	119.9583	120.3305	119.6997	106.1236	119.5988
	[49794.83, inf)	154.1544	126.4671	126.5869	129.8170	165.6828
Neutral Credit Score		128.18	120.3961	119.2480	107.9997	131.0323

		Logistic Regression	XGBoost	Random Forest	LightGBM	CatBoost
Predictor Variable	Bin	Credit Score
Average–Approved annuity amount	(-inf, 4160.83)	23.6388	104.1387	121.8600	107.3289	0.0000
	[4160.83, 8934.75)	91.5894	116.3997	121.8615	117.3830	26.1096
	[8934.75, inf)	162.1021	128.6020	121.8626	129.5077	142.7540
Neutral Credit Score		135.5539	123.9826	121.8622	125.0247	103.1746

		Logistic Regression	XGBoost	Random Forest	LightGBM	CatBoost
Predictor Variable	Bin	Credit Score
Maximum–Approved credit amount	(-inf, 50954.04)	0.0000	0.0000	121.8580	100.4851	88.3454
	[50954.04, 898398.00)	123.4501	126.0522	121.8622	122.8748	123.8610
	[898398.00, inf)	138.4285	165.5761	121.8623	132.4269	142.7153
Neutral Credit Score		112.5787	117.2354	121.8618	121.5474	122.1043

PERMALINK

A novel framework for enhancing transparency in credit scoring: Leveraging Shapley values for interpretable credit scorecards

Rivalani Hlongwane

Kutlwano Ramabao

Wilson Mongwe

Roles

Abstract

Introduction

Literature review

Credit scoring models

Interpretability

Literature review summary

Table 1. Overview of the literature.

Methodology

Data

Feature engineering

Variable selection

Table 2. Predictor variables for the Home Credit data.

Table 3. Predictor variables for the Taiwan credit data.

Calculating the credit score in a practice setting

Shapley values

Data processing

Hyperparameter tuning

Model validation

Model performance metrics

Table 4. Confusion matrix.

Proposed framework for calculating credit scores

Fig 1. Credit scores calculation process flow—current vs proposed.

Results and analysis

Performance of the models

Table 5. AUC and p-values of the models–Taiwan data.

Table 6. Confusion matrices of the models–Taiwan data.

Table 7. AUC and p-values of the models–Home Credit data.

Table 8. Confusion matrices of the models–Home Credit data.

Interpretable credit models–Taiwan data

Fig 2. Logodds of the predictor variables.

Table 9. Average payment indicator—July, August & September.

Table 15. Ratio September bill amount over a 3-months average bill amount—July, August, September.

Table 10. Average bill amount—July, August & September.

Table 11. Average payment indicator–April, May, and June.

Table 12. Ratio September payment indicator divided by a 3-months average payment indicator (July, August, and September).

Table 13. Standard deviation of payment indicator—July, August, September.

Table 14. Average payment amount—July, August, September.

Interpretable credit models–Home Credit data

Table 16. Average–Approved annuity amount.

Table 26. Ratio of annuity amount / Credit amount.

Table 17. Maximum number of days (relative to the application date) on which a payment was made for previous instalments.

Table 18. Maximum–Approved credit amount.

Table 19. Average of external scores (1, 2 & 3).

Table 20. Normalized score from external data source 3.

Table 21. Normalized score from external data source 2.

Table 22. Normalized score from external data source 1.

Table 23. Maximum—How many days before current application did client apply for credit bureau credit.

Table 24. Average—Days past due of instalments.

Table 25. How many days before the application the person started current employment.

Conclusion and future work

Data Availability

Funding Statement

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases