The effect of resampling techniques on the performances of machine learning clinical risk prediction models in the setting of severe class imbalance: development and internal validation in a retrospective cohort

Janny Xue Chen Ke; Arunachalam DhakshinaMurthy; Ronald B George; Paula Branco

doi:10.1007/s44163-024-00199-0

. 2024 Nov 26;4(1):91. doi: 10.1007/s44163-024-00199-0

The effect of resampling techniques on the performances of machine learning clinical risk prediction models in the setting of severe class imbalance: development and internal validation in a retrospective cohort

Janny Xue Chen Ke ^1,^2,^3,^✉, Arunachalam DhakshinaMurthy ⁴, Ronald B George ⁵, Paula Branco ⁶

PMCID: PMC11610218 PMID: 39624046

Abstract

Purpose

The availability of population datasets and machine learning techniques heralded a new era of sophisticated prediction models involving a large number of routinely collected variables. However, severe class imbalance in clinical datasets is a major challenge. The aim of this study is to investigate the impact of commonly-used resampling techniques in combination with commonly-used machine learning algorithms in a clinical dataset, to determine whether combination(s) of these approaches improve upon the original multivariable logistic regression with no resampling.

Methods

We previously developed and internally validated a multivariable logistic regression 30-day mortality prediction model in 30,619 patients using preoperative and intraoperative features.

Using the same dataset, we systematically evaluated and compared model performances after application of resampling techniques [random under-sampling, near miss under-sampling, random oversampling, and synthetic minority oversampling (SMOTE)] in combination with machine learning algorithms (logistic regression, elastic net, decision trees, random forest, and extreme gradient boosting).

Results

We found that in the setting of severe class imbalance, the impact of resampling techniques on model performance varied by the machine learning algorithm and the evaluation metric. Existing resampling techniques did not meaningfully improve area under receiving operating curve (AUROC). The area under the precision recall curve (AUPRC) was only increased by random under-sampling and SMOTE for decision trees, and oversampling and SMOTE for extreme gradient boosting. Importantly, some combinations of algorithm and resampling technique decreased AUROC and AUPRC compared to no resampling.

Conclusion

Existing resampling techniques had a variable impact on models, depending on the algorithms and the evaluation metrics. Future research is needed to improve predictive performances in the setting of severe class imbalance.

Keywords: Machine learning, Class imbalance, Resampling, Risk prediction, Predictive modeling, Anesthesiology

Introduction

Every year, more than 300 million surgeries are performed globally, with more than 4 million deaths [1]. Worldwide, postoperative mortality is the third leading cause of death, after ischaemic heart disease and stroke, contributing to 8% of all deaths [1]. There has been significant research interest in the accurate prediction of postoperative mortality, to assist with perioperative risk stratification, shared decision making, and disposition planning [2–8]. The availability of population perioperative datasets from electronic health record (EHR) [9] and machine learning techniques heralded a new era of sophisticated prediction models. These models involve a large number of routinely collected preoperative and intraoperative variables, such as demographics, surgery types, laboratory values, and vital signs [5, 7, 8]. However, a major challenge in postoperative mortality prediction is the problem of severe class imbalance in the clinical datasets [10].

Class imbalance occurs when the event rate of a binary outcome (i.e. has two categories or classes) is <50% and can be particularly problematic when the outcome is rare [10, 11]. In the setting of postoperative mortality prediction, the incidence of mortality is typically 2% or lower [3, 5–8]. Such extreme class imbalance poses two significant challenges during model development and validation. First, during model development, the algorithms will mostly learn from samples without events. Second, standard performance metrics [e.g. accuracy and area under receiving operating curve (AUROC)] can be misleading and overly optimistic in the setting of class imbalance [10–12]. For example, with a mortality event rate of only 2%, a model that predicts that no one dies will only be wrong only 2% of the time and will be correct 98% of the time. However, such a model would be useless and harmful, despite high accuracy. Moreover, clinicians are more interested in predicting who will have the event in order to implement appropriate interventions (such as increased monitoring). Thus, missing a high-risk patient may lead to dire consequences.

Several methods have been used in the literature to mitigate the issue of imbalanced datasets [10, 11], which can be grouped into the broad categories performance metrics and resampling techniques. In terms of performance metrics, several alternatives have been studied [13]. The area under precision-recall curve (AUPRC) presents precision [positive predictive value (PPV)] against recall (sensitivity) and indicates how well the model can predict true positives [12]. The baseline of AUPRC in a given model is the event rate in the dataset [12]. However, this metric has only been reported in two papers in the postoperative mortality prediction literature [7, 8]. In terms of model development, a possible approach involves using resampling techniques in the derivation set [7, 14, 15], to balance the class distribution by sampling more from patients with events (over-sampling) or sampling less from patients without events (under-sampling). Note that resampling is never applied to the validation set, such that model performances can be correctly calculated in a dataset with the unaltered, real-life event rate.

However, it remains under active research how different resampling techniques to improve class imbalance affect model performances when used in combination with machine learning algorithms (learners), which would be important to inform best practices in clinical machine learning modeling. While studies found that resampling techniques improved model performances, the effectiveness of the resampling strategies varies with the dataset and the learner [16, 17]. As within a given dataset it is usually unknown which one will have the best performance, one must also test multiple learners for the domain task under consideration. Special-purpose algorithms have been proposed to learn in conditions where the data is imbalanced, such as different versions of extreme gradient boosting (XGBoost) [18, 19]. Moreover, some characteristics of medical data, such as high dimensionality, can lead to problems that can make resampling strategies less effective in improving model performance [18, 19].

We have previously developed and internally validated a multivariable logistic regression 30-day mortality prediction model in 30,619 patients using preoperative and intraoperative features, with validation set AUROC of 0.893 (95% CI 0.861–0.920) and AUPRC of 0.158 (baseline 0.017 based on mortality rate) [8]. The aim of this study is to systematically evaluate and compare commonly-used resampling techniques [random under-sampling, near miss under-sampling, random oversampling, and synthetic minority oversampling (SMOTE)] in combination with commonly-used machine learning algorithms (logistic regression, elastic net, decision trees, random forest, and extreme gradient boosting) in a clinical dataset where the events are rare, to determine whether combination(s) of these approaches improve upon the original multivariable logistic regression with no resampling.

Materials and methods

We received research ethics approval (Nova Scotia Health Authority Research Ethics Board, Halifax, NS, Canada; file # 1024251), with waiver informed consent for secondary analysis of de-identified population dataset developed in a published retrospective cohort study [8]. Figure 1 illustrates the methodology of this study.

Dataset

Please see the previous publication [8] for details about the dataset. Briefly, the cohort consists of 30,619 patients age ≥45 and undergoing inpatient noncardiac surgery (except deceased organ donation) at two tertiary academic hospitals in Halifax, Nova Scotia, Canada, between January 1, 2013 to December 1, 2017. Multiple sources of data [hospitals’ Anesthesia Information Management System (AIMS) containing intraoperative vital signs and anesthetic interventions, hospitals’ perioperative EHR, the Nova Scotia Vital Statistics database which documented all deaths within Nova Scotia, and Canadian Institute for Health Information (CIHI) Discharge Abstract Database (DAD)] were linked to create the final de-identified dataset containing preoperative, intraoperative, and postoperative variables. The final de-identified dataset was extracted by and accessed through Health Data Nova Scotia (HDNS).

To mirror real-life application, where prediction models are built using data from the past and validated with prospective data, the cohort was divided into derivation and validation sets temporally by ranking surgery dates from the earliest to the latest [8]. The derivation set consisted of patients with the earliest 75% of the surgery dates, and the validation set consisted of patients with the latest 25% of the surgery dates (i.e. 75:25 division). There were no major differences in cohort characteristics between the derivation and validation datasets (standardized mean difference were all <0.2) [8].

Also, the models in this study used the same final set of features (predictors) and event (outcome) as the previously published primary multivariable logistic regression model [8]. The event was all-cause mortality within 30 days after surgery, in- or out-of-hospital. The event rate was 2.2% (493/22964) in the derivation set, and 1.7% (131/7655) in the validation set (i.e., the derivation and validation sets have an imbalance ratio of 2.2 and 1.7%, respectively). While the class imbalance ratios within the derivation and validation sets are extremely high, they are within the expected values observed in the literature for postoperative mortality, where the typical imbalance ratio is approximately 2% or lower. The features (predictors) are listed in Table 1. The mean (SD) age was 66 (11) years, with 50.2% female (8).

Table 1 .

List of features according to preoperative, intraoperative vital signs, and other intraoperative groups

Features
Preoperative features
Age (years)
Female sex
Emergency surgery
Procedural Index for Mortality Risk
Surgery type (compared to general surgery)
Neurosurgery
Obstetrics and gynecology
Orthopedic surgery
Other
Otolaryngology
Plastic surgery
Thoracic surgery
Urology
Vascular surgery
Hypertension
Chronic obstructive pulmonary disease
Revised Cardiac Risk Index
Elixhauser Comorbidity Index
Hospital Frailty Risk Score
Obesity
Vital signs features
Max. decrease (%) of SBP relative to first AIMS SBP
Cumulative duration (minutes) < MAP 70 mmHg
Max. change of HR in 10 BPM above first AIMS HR
Max. change of HR in 10 BPM below first AIMS HR
Cumulative duration (minutes) of HR < 60
Cumulative duration (minutes) of HR > 100
Cumulative duration (minutes) SpO₂ < 88%
Cumulative duration (hour) of temperature <36 °C
Cumulative duration (hour) of temperature >38 °C
Cumulative duration (minutes) ETCO₂ < 30 mmHg, GA
Cumulative duration (minutes) of ETCO₂ > 45 mmHg, GA
Other intraoperative features
Duration of Surgery (hour)
General anesthesia
Neuraxial anesthesia
Peripheral nerve block
Laparoscopic surgery with no conversion to open
Age-adjusted MAC during GA, time averaged
Crystalloid (L) above 1L
Use of vasopressors and inotropes
Use of vasodilators

Open in a new tab

For the surgery type, the reference group was general surgery

AIMS Anesthesia Information Management System, BPM beats per minute, ETCO₂ end-tidal CO₂, GA general anesthesia, HR heart rate, IQR interquartile range, MAC Minimal Alveolar Concentration, MAP mean arterial pressure, Max. maximum, SE standard error, SBP systolic blood pressure

Statistical analysis

Software

Data analysis was performed on the HDNS Citadel server using R 4.1.0 and Python 2.7.10 (in particular scikit-learn, hyperopt, and imblearn) [20–23].

Pre-processing

To facilitate machine learning, one hot encoding was used for categorical encoding, and standard scaling was applied to continuous variables with zero mean and unit standard deviation.

Resampling techniques

The resampling approaches tested were: no resampling, random under-sampling [24, 25], near miss under-sampling [26], random oversampling [24, 25], and SMOTE [10, 11]. The more frequent event class (i.e. patients without mortality in our dataset) is referred to as the majority class, and the less frequent event class (i.e. patients with mortality in our dataset) the minority class. The class weight and resampling ratio are amongst hyperparameter that needed to be tuned (please see “Hyperparameter tuning” below). Under-sampling addresses class imbalance by removing samples from the majority class randomly (random under-sampling) or by proximity in the feature distribution space amongst majority and minority classes (near-miss under-sampling). Over-sampling balances the dataset by adding subjects to the minority class with randomly selected replicas (random over-sampling) or by generating sample data by interpolating amongst randomly selected subjects and their K nearest neighbours (SMOTE) [27, 28].

Machine learning algorithms

The algorithms developed in the derivation set were logistic regression, elastic net [29], decision tree [30], random forest [31], and XGBoost [32]. Elastic net is logistic regression with L1 and L2 regularization, which helps reduce overfitting. Decision trees are simple models that are interpretable. XGBoost and random forest are ensemble methods based on decision trees, which work well for nonlinear relationships, can reveal feature importance, and help reduce overfitting.

Hyperparameter optimization

Each algorithm has hyperparameters that must be optimized (Appendix 1). Hyperparameter tuning for algorithms, class weight, and the resampling ratio are necessary to ensure that when comparing the different models, the differences are due to the algorithm or the resampling technique rather than suboptimal hyperparameter settings. In the derivation set, Bayesian optimization [22, 33] was used to efficiently obtain the best hyperparameters for each model, with five-fold stratified cross validation in the derivation set using AUROC as the metric. The models were subsequently validated in the validation set using the best hyperparameters.

Model validation

Model performances were evaluated in the validation set. A variety of metrics were calculated, since no single metric is sufficient for discerning the best model [34, 35]. The main metrics assessed were AUROC (95% CI calculated by 2000 stratified bootstrap replicates using the original event rate in the test set) and AUPRC.

In addition, sensitivity (also known as recall), specificity, positive predictive value (also known as precision), negative predictive value, F1-score (harmonic mean of precision and recall, i.e. measure of accuracy), and G-mean (geometric mean of sensitivity and specificity, helpful in imbalanced datasets) were assessed at a probability threshold of 0.5.

Results

Compared to logistic regression, at baseline (without resampling), the other models (elastic net, decision tree, random forest, and XGBoost) did not meaningfully improve upon the AUROC nor AUPRCs. Different resampling techniques impacted each algorithm differently. A summary of the performances from combinations of each resampling technique and each algorithm are displayed in Fig. 2 for AUROC, and Fig. 3 for AUPRC. Importantly, despite the high AUROC, all models exhibited poor calibration (Appendix 2).

Fig. 2 — Area under the receiver operating curve (AUROC) results from combinations of algorithms and resampling techniques. Legend: y-axis, area under the receiver operating curve (AUROC); x-axis, algorithms in combination with resampling techniques

Fig. 3 — Area under the precision recall curve (AUPRC) results from combinations of algorithms and resampling techniques. Legend: y-axis, area under the precision recall curve (AUPRC); x-axis, algorithms in combination with resampling techniques

Overall, existing resampling techniques improved some performance metrics (i.e. AUPRC) for decision tree and XGBoost, but did not meaningfully improve model performances for logistic regression, elastic net, and random forest. Random over-sampling minimally improved AUROC for decision trees (0.837 with over-sampling vs. 0.817 without resampling), though this needs to be interpreted in the context of lower baseline AUROC compared to other algorithms. The AUPRC was only increased by random under-sampling and SMOTE for decision trees (increase between 0.05 and 0.1, approximately), and oversampling and SMOTE for XGBoost (increase between 0.3 and 0.35, approximately).

Importantly, some combinations of algorithm and resampling technique decreased AUROC and AUPRC compared to no resampling. For example, all resampling techniques returned worse AUROC for XGBoost (with a decrease ranging between 0.1 and 0.3), and near miss undersampling decreased the AUROC of logistic regression (approximately 0.3 decrease) and elastic net (approximately 0.15 decrease). For AUPRC, SMOTE decreased AUPRC for logistic regression, and random forest (approximately 0.1 decrease), while near miss undersampling decreased the AUPRC for elastic net, randomforest, and XGBoost (decrease ranging between 0.05 and 0.1). Overall, the AUROC bootstrapped results show a performance consistent with the non-bootstrapped results for all learning algorithms and resampling techniques. The exception to this is a general decrease in performance for the alternative of not using any resampling technique. This decrease ranged between approximately 0.1 for elastic net, decision trees, and random forest, to a 0.4 decrease for XGBoost. When using logistic regression with the option involving no resampling, a performance of approximately 0.9 is observed for both bootstrapped and non-bootstrapped alternatives.

The performances of the five algorithms in combination with each resampling approach are displayed in Table 2 (no resampling), Table 3 (random under-sampling), Table 4 (near-miss under-sampling), Table 5 (random over-sampling), and Table 6 (SMOTE). Table 7 provides the p-values of a paired t-test carried out for each learning algorithm comparing pairs of resampling techniques, on both AUROC and AUPRC scores, with the Benjamini–Hochberg correction [39] for multiple hypothesis testing.