Skip to main content
Journal of Applied Statistics logoLink to Journal of Applied Statistics
. 2024 Jan 9;51(12):2481–2488. doi: 10.1080/02664763.2024.2302058

Propensity score matching: a tool for consumer risk modeling and portfolio underwriting

Jennifer Lewis Priestley 1,CONTACT, Eric VonDohlen 1
PMCID: PMC11389638  PMID: 39267709

Abstract

Researchers and practitioners in financial services utilize a wide range of empirical techniques to assess risk and value. In cases where known performance is used to predict future performance of a new asset, the risk of bias is present when samples are uncontrolled by the analyst. Propensity score matching is a statistical methodology commonly used in medical and social science research to address issues related to experimental design when random assignment of cases is not possible. This common method has been almost absent from financial risk modeling and portfolio underwriting, primarily due to the different objectives for this sector relative to medicine and social sciences. In this application note, we demonstrate how propensity score matching can be considered as a practical tool to inform portfolio underwriting outside of experimental design. Using a portfolio of distressed consumer credit accounts, we demonstrate that propensity score matching can be used to predict both account-level and portfolio-level risk and argue that propensity score matching should be included in the methodological toolbox of researchers and practitioners engaged in risk modeling and valuation activities of portfolios of consumer assets, particularly in contexts with limited observations, a large number of potential modeling features, or highly imbalanced covariates.

KEYWORDS: Propensity score matching, risk modeling, portfolio underwriting, consumer credit, financial services

1. Introduction

Financial service organizations, including lenders, debt buyers, portfolio servicers, and investors, consider a wide range of available information for the purposes of assessing portfolio-level credit risk and the probability of debt repayment. Where possible, the modeling applications will refer to historical attributes from similar types of accounts for the purposes of predicting repayment using machine learning algorithms [4,13,14] or network-based scoring models [10].

Propensity score matching was developed to mitigate violations of statistical assumptions such as independence and strengthen the validity of claims of causation in observational and in quasi-experimental studies [1,21]. The technique is commonly utilized as an experimental design methodology where large-scale, controlled randomized studies are not possible and researchers are restricted to working with the data to which they have access, commonly presenting statistical violations related to selection bias or as a tool for data pre-processing in advance of machine learning applications. In addition, propensity score matching addresses at least two common data challenges – covariate imbalance and high dimensionality. The method achieves covariate balance between the treatment (treated) and control (evaluation) groups through the process of matching based on Euclidean distance, resulting in improved control of cofounding variables versus traditional supervised modeling, where the emphasis is on the estimation of coefficients. Second, because PSM focuses on likelihood (propensity) scores rather than on coefficients, there is an inherent reduction in dimensionality; traditional supervised modeling approaches with a large number of covariates can lead to overfitting and unstable estimates [6].

Examples of propensity score matching in application are common for the purposes of randomizing observations into treatment and control assignment in medical research [17], educational research [18], and evaluation research [20]. Practitioners of consumer risk modeling and portfolio underwriting are typically less concerned about traditional experimental design assumptions of randomized control and treatment assignment of subjects which has been the primary application of propensity score matching. However, there is significant value for practitioners engaged in risk modeling and portfolio underwriting in matching the known performance or behavior of existing assets with a portfolio of accounts under consideration where future performance needs to be predicted and the risk associated with accounts needs to be quantified. While there is a small extant literature related to using propensity score matching to evaluate societal impacts of credit modeling [e.g. 2–4], published research related to the application of the method to evaluate and price a portfolio of consumer credit accounts is limited; one example is a study which formally considered propensity score matching as a pre-processing technique prior to an application of machine learning to extract the most relevant factors related to consumer credit risk [12]. In the current application note, the authors were interested in determining if propensity score matching could be used directly to predict expected payments and underwrite newly presented portfolios – rather than as a pre-processing technique – bypassing some of the computational and complexities related to machine learning applications.

2. Methodology

The medical and sociological research literature is rich with presenting matching methodologies used to randomize observational studies. The most commonly used methods include Traditional Covariate Adjustment, Propensity Score Stratification, Propensity Score Inverse Probability Treatment Weighting (IPTW), Propensity Score Covariate Adjustment (using PS as a covariate), and Propensity Score Matching (PSM). In the current application note, the authors have elected to use PSM because the method is accepted as reliable, provides excellent covariate balance in most circumstances, and is simple to analyze and to interpret [9]. Several studies have provided thorough investigations of these and other alternative matching methods [e.g. 16,21,23].

PSM in the current study is conducted by creating a binary variable y[0,1], which is coded ‘0’ for an observation in the file to be reviewed, and ‘1’ for observations in the pool with known (i.e. historical) performance. The propensity score is defined as:

S=P(y=1|Z=z) (1)

where z is an n-vector of attributes observable across both samples. Logistic regression is the modeling technique most commonly associated with propensity score matching [1]. This binary modeling approach predicts the likelihood of membership in the treatment or in the control group:

logP(y=1)1P(y=1)=β0+β1z1++βnzn (2)

The likelihood can then be used in one of several protocols to select assets for comparison. In the current study, the ‘nearest’ method was selected, which matches a treated case to a control case with the smallest Euclidean distance measure [5]. A caliper setting establishes the maximum tolerated distance between matched subjects, larger caliper settings result in more matches with reduced similarity, while smaller caliper settings result in potentially fewer matches but with increased similarity. In the current study, the authors tested several potential caliper settings, and determined that a value of.1 – the maximum allowed difference between scores – was the smallest difference that enabled a match rate of 100% for the observations of interest. Optimal caliper settings has been found to range between .05 and .30 depending upon the characteristics of the data [1]. All analysis was completed using R (version 4.2.2). Propensity score matching was executed using the Matchit package, available for download on CRAN [5]. The matching protocol employed in the foregoing analysis is 1-1; specifically, each unit of interest (i.e. credit account to be evaluated) is matched to a single unit from a sample of non-overlapping units from the existing database of accounts with known performance.

3. Application

3.1. Data

The data used here to illustrate the process of using propensity score matching to support portfolio underwriting comes from a small finance company, specializing in pricing portfolios of distressed consumer credit accounts, where the total of expected payments over time is the variable of interest [19]. For this study, two files from [19] are used – an ‘Evaluation’ file of 50 charged-off credit card accounts to be evaluated for the expected cumulative value of payments over 60 months and a ‘Known’ file of 5,000 charged-off credit card accounts which have at least 60 months of known payment history. The dependent variable in this study is the ‘liquidation ratio’ – the percent of the charged off account balance that is expected to be collected by the end of the 60-month period. This value will be multiplied by the balance on each account to be evaluated and then summed over the total portfolio to predict the total expected payments or the ‘value’ of the portfolio; setting the liquidation ratio rather than the expected payment as the dependent variable allows for the results to be applied to any balance (large or small) at time of charge-off.

The process of attribute selection for propensity score matching in financial services is similar to more common application contexts (e.g. medicine, sociology) in the sense that variables should reflect aspects of the population to be controlled. For the current study, the most relevant control variables are the amount owed by the consumer, the age of the debt, and geographic considerations, as those variables tend to stratify the ultimate value of a given set of accounts. Recent innovations in explainable AI and Shapley Values [11] can help validate variable selection when the propensity scoring algorithm uses nonlinear applications such as random forest or gradient boosted trees [15]. In the case of traditional linear propensity models, standardized regression coefficients are useful in ranking the importance of each selected variable to the matching exercise.

The selected attributes in the current study, include account balance at time of closure, days since charge-off (i.e. number of days between the account being written off by the original lender and the date the file is considered for purchase by a collection agency), and consumer income. Although most financial modeling applications would likely have access to more (and more complicated) features, these simple attributes are selected for the current study because they are almost always observable at the time the portfolio of accounts is considered for purchase, are generally accepted to be important drivers of underlying asset value, and are easily understood and interpretable regardless of domain experience. Descriptive statistics for the two files can be found in Table 1.

Table 1.

Descriptive statistics for evaluation file and known file (mean(std)).

Variable Evaluation file (n = 50) Known file (n = 5,000)
Liquidation Ratio .55 (.32)* .59 (.34)
Account Balance at Time of Charge Off $773 ($431) $889 ($1,475)
Days Since Charge Off 55 (118) 140 (273)
Income $68,525 ($20,272) $66,149 ($23,904)

 *label removed for matching application.

Both files have a binary ‘source’ variable appended; the ‘source’ variable has been set to ‘0’ for the Evaluation File and ‘1’ for the Known File. For the purposes of this study, the Evaluation File will serve as the ‘control’ group and the Known File will serve as the ‘treatment’ group. The data that support the findings of this application note are openly available at [19].

3.2. Results

The matching results are provided in Tables 2 and 3.

Table 2.

Propensity score matching results – pre matched.

Attribute Means treated (n = 5,000) Means control (n = 50) Std. mean difference Variance ratio
Distance .9901 .9881 .5334 2.0768
Account Balance at Time of Charge Off $889 $773 .0790 11.6693
Days Since Charge Off 140 55 .3091 5.3837
Income $66,149 $68,525 −.0994 1.3904

Table 3.

Propensity score matching results – post matched.

Attribute Means treated (n = 50) Means control (n = 50) Std. mean difference Variance ratio
Distance .9885 .9881 .0990 .9987
Account Balance at Time of Charge Off $713 $773 −.0407 1.2599
Days Since Charge Off 97 55 .1536 8.4123
Income $69,624 $68,525 .0460 1.7452

All 50 accounts from the Evaluation File matched an account with 60 months of payment performance in the Known File. From Tables 2 and 3, the results demonstrate the effectiveness of the scoring method. The Pre-Match values of the Evaluation File (Control) and the Known File (Treated), are consistent with the original values provided above in Table 1. The standardized mean differences between the Post-Match values for the matched observations from the Known File and the Evaluation File, are lower than the Pre-Match values, indicating a sample of similar records has been extracted.

3.3. Application of results

At this stage, the analyst can directly apply the results to estimate the value of the portfolio. The liquidation ratio values from the accounts in the Known File can be applied to the account balance at closure values for the matched accounts in the Evaluation File. An example from the output is provided in Table 4.

Table 4.

Application of matches to estimation of portfolio value.

Evaluation file (observation #) Known file (observation #) Matched liquidation ratio (from the known file) Account balance at closure (from the evaluation file) Expected payment (calculated as liquidation ratio * account balance at closure)
1 2213 .6455 $423.21 $273.18
2 4559 .8656 $512.01 $443.20
3 829 .5904 $220.55 $130.21
 …   …   …   …   … 
TOTAL       $21,402.57

The expected payments for each account can be aggregated to a ‘Total Expected Payment’ for the portfolio. The authors tested this approach using a traditional Linear Regression method:

y=β0+β1x1++βnxn (3)

where the continuous, (approximately) normally distributed liquidation ratio was used as the dependent variable and using the same three attributes – Account Balance at Time of Charge Off, Days Since Charge Off, and Income – as independent predictors. A wide range of metrics can be used to compare predictive accuracy of competing models, including consideration of models where there is no difference [8]. In the current study, a comparison of model performance for the PSM and the Regression approaches to predicted total payments is provided in Table 5, where the mean average error (MAE) and root mean squared error (RMSE) were selected to compare accuracy. The PSM approach outperformed the regression approach on overall percent difference and on MAE but underperformed the regression approach on RMSE.

Table 5.

Modeling results for propensity score matching versus regression.

  Actual total payments Predicted total payments Percent difference Total payments MAE Total payments RMSE
Propensity Score Matching $20,767.17 $21,402.57 .0306 193.18 392.17
Regression $20,767.17 $23,174.40 .1159 211.27 280.21

3.4. Matching model considerations

The presented PSM uses logistic regression to generate estimated probabilities for the matching logic. The caliper chosen defines units that are ‘nearby’ units of interest, across the (0,1) interval. Consequently, the selection of covariates for the regression model and the quality of the probability estimates chiefly determine the quality of the matches. In cases for which propensity score matching is meant to retrospectively create similar samples, the researcher may use machine learning algorithms to achieve ‘better’ fits for the matching model, particularly when selected covariates are ill-conditioned (e.g. non-monotonic). Common machine learning algorithms used as alternatives to logistic regression for matching within PSM include tree-based methodologies such as Random Forest (RF) [5,22] and Gradient Boosted Trees (GBT) [5,15]. In the context of PSM, both RF and GBT are non-parametric ensemble methods where the classification and regression trees (CART) are developed through a recursive process of partitioning the data based on the covariates, increasing the accuracy of the outcome – the ‘matches’ – through each iteration [15,22]. Where the RF iterates trees in parallel, GBT iterates sequentially. Both non-parametric methods have the potential for increased robustness over logistic regression. However, tree-based algorithms that generate matched probability estimates often have poorly calibrated probability estimates and can suffer from overfitting [7]. Calibration of probabilities from machine learning models often requires additional strategies, such as isotonic regression and the use of the log-loss loss function for tuning the algorithm. These points are illustrated in Table 6.

Table 6.

Comparison of alternative propensity score (post) matching results.

  Logistic regression (Match = 100%) Random forest (Match = 68%) Gradient boosted tree (Match = 36%)
Attribute Means treated Means control Std. mean difference Means treated Means control Std. mean difference Means treated Means control Std. mean difference
Distance .9885 .9881 .0990 .9489 .9478 .0422 .9376 .9337 .2913
Account Balance at Time of Charge Off $713 $773 −.0407 $2,465 $775 1.1458 $2,911 $841 1.403
Days Since Charge Off 97 55 .1536 55 65 -.0337 25 93 -.2476
Income $69,624 $68,525 .0460 $62,633 $66,610 -.1664 $71,893 $68,323 .1493

The differences in moments for the (untuned) machine learning algorithms are farther from the differences of the logistic regression model, and the number of matches for the machine learning algorithms is smaller. The low match rates for the nonlinear models may depend on the caliper chosen, as the empirical distribution of estimated match probabilities will differ from the logistic regression results (the caliper setting was held constant across all approaches). In addition, the low match rates of the tree-based machine learning approaches means that not every observation (account) is matched – leaving an unacceptably large portion of the target sample unaccounted for. These results indicate that machine learning algorithms applied to propensity scoring problems may require substantively more effort than logistic regression or even simple multi-variate stratification.

4. Robustness

Propensity score matching in the current context can be altered in the same manner used in social science applications. For example, the foregoing discussion considers 1:1 matching only, in which each unit of interest is matched with the best single unit with known performance. In many cases, the analyst has access to a much larger pool of units with known performance. In such cases, a 1:n (n > 1) protocol is possible. Options for selection of the number of matching units include selection of a certain number of units that satisfy a particular caliper specification, or selection of units that satisfy a list of calipers that are successively weakened as the analyst develops a sufficient sample. The relevance of the ultimate sample can be assessed by its size relative to the target sample and evaluation of sample moments.

Alternatively, or perhaps concurrently, calipers may be applied to well-chosen covariates that are observable across samples. Under the covariate caliper matching regime, units with covariate values that satisfy all caliper constraints are candidates for matching. This approach is attractive in applications in which there are few, well-controlled covariates, and to the extent that there may be significant differences across some covariates, covariate caliper matching is a straightforward method to assess the validity of a matching exercise. In cases where there are many covariates, there may be ambiguity in selection of a useful subset.

Similarly, the Mahalanobis distance may be used to identify matches across populations. This method differs from the previous methods in that it is model-agnostic, and it considers the covariances of matching covariates in the process. In practice, it is advisable to test more than one method in a matching exercise to ensure robustness of the covariates used and the consistency of matched units.

5. Conclusion

Propensity score matching is a technique that has been widely used in medical and social science research for decades to mitigate statistical violations inherent in observational studies. Because the objectives of practitioners engaged in consumer credit risk assessment and in portfolio underwriting are different than are the objectives of social science researchers, the applicability of propensity score matching to financial services may not be immediately obvious, resulting in the effective absence of the methodology among portfolio underwriting practitioners and risk researchers. However, we have sought to demonstrate that propensity score matching may be a highly appropriate tool for underwriting and portfolio evaluation, particularly when the available pool of historical observations available for modeling may not be sufficient for advanced machine learning algorithms. Propensity score matching allows underwriters the ability to identify which historical samples at their disposal are most appropriate for estimation of account repayment specifically and for overall portfolio performance more generally. Using the standardized distance diagnostics, propensity score matching provides underwriters and risk practitioners with a method to assess similarities of the historical data to which they have access with the accounts they are challenged to assess for future performance.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • 1.Austin P.C., An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate. Behav. Res. 46 (2011), pp. 399–424. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Baycan O., The effects of exchange rate regimes on economic growth: evidence from propensity score matching estimates. J. Appl. Stat. 43 (2016), pp. 914–924. [Google Scholar]
  • 3.Chen Z., Friedline T., and Lemieux C.M., A national examination on payday loan use and financial well-being: a propensity score matching approach. J. Fam. Econ. Issues. 43 (2022), pp. 678–689. [Google Scholar]
  • 4.Costa e Silva E., Lopes I., Correia A., and Faria S., A logistic regression model for consumer default risk. J. Appl. Stat. 47 (2020), pp. 2879–2894. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.cran.r-project.org/web/packages/MatchIt/vignettes/MatchIt.html.
  • 6.D'Agostino R. B. Jr., Propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control group. Stat. Med. 17 (19) (1998), pp. 2265–2281. [DOI] [PubMed] [Google Scholar]
  • 7.Dankowski T., and Ziegler A., Calibrating random forests for probability estimation. Stat. Med. 35(22) (2016), pp. 3949–3960. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Diebold F.X., and Mariano R.S., Comparing predictive accuracy. J. Bus. Econ. Stat. 20(1) (2002), pp. 134–144. [Google Scholar]
  • 9.Elze M., Gregson J., and Baber U., Comparison of propensity score methods and covariate adjustment. J. Am. Coll. Cardiol. 69 (3) (2017), pp. 345–357. [DOI] [PubMed] [Google Scholar]
  • 10.Giudici P., Hadji-Misheva B., and Spelta A., Network based scoring models to improve credit risk management in peer to peer lending platforms, Frontiers in Artificial Intelligence 2 (3) (2019), pp. 1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Giudici P., and Raffinetti E., Shapley-Lorenz explainable artificial intelligence. Expert. Syst. Appl. 167 (2020), pp. 1–8. [Google Scholar]
  • 12.Karwanski M., and Grzybowska U., Propensity score matching and its application to risk drivers in financial setting. Acta Phsyica Polonica A 129 (2016), pp. 945–949. [Google Scholar]
  • 13.Kou G., Chao X., Peng Y., Alsaadi F.E., and Herrera-Viedma E., Machine learning methods for systemic risk analysis in financial sectors. Tech Econ Develop Econ 25(5) (2019), pp. 716–742. [Google Scholar]
  • 14.Leo M., Sharma S., and Maddulety K., Machine learning in banking risk management a literature review. Risks 7(1) (2019), pp. 29–51. [Google Scholar]
  • 15.Nguyen V., Example 1: propensity score weights using gradient boosted trees, index of web packages R Examples, R-project.org, October (2023).
  • 16.Olmos A., and Govindasamy P., A practical guide for using propensity score weighting in R. Practical Assessment Research & Evaluation 20(13) (2015), pp. 1–7. [Google Scholar]
  • 17.Peterson J., Paranjape N., Grundlingh N., and Priestley J., Outcomes and adverse effects of Baricitinib versus Tocilizumab in the management of severe COVID-19. Crit. Care Med. 51(3) (2023), pp. 337–346. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Powell M.G., Darrell M., Hull A., and Beaujean A., Propensity score matching for education data: worked examples. J Exp Edu 88 (2020), pp. 145–164. [Google Scholar]
  • 19.Priestley J., Risk scoring using PSM, Mendeley Data, V3. URL: data.mendeley.com/datasets/7djrw3cknc/3
  • 20.Randolph J., Falbe K., Manuel A., and Balloun J., A step-by-step to propensity score matching in R. Practical Assessment Research & Evaluation 19(18) (2014), pp. 1–6. [Google Scholar]
  • 21.Rosenbaum P.R., and Rubin D.B., The central role of the propensity score in observational studies for causal effects. Biometrika 70(1) (1983), pp. 41–55. [Google Scholar]
  • 22.Zhao P., Su X., Ge T., and Fan J., Propensity score and proximity matching using random forest. Contemp. Clin. Trials 47 (2016), pp. 85–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Zhao Q.Y., Luo J.C., Su Y., Zhang Y.J., Tu G.W., and Luo Z., Propensity score matching with R: conventional methods and new features. Ann Trans Med 9(9) (2021), pp. 1–39. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Journal of Applied Statistics are provided here courtesy of Taylor & Francis

RESOURCES