Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Jun 1.
Published in final edited form as: J Thorac Cardiovasc Surg. 2020 Jun 29;161(6):1940–1941. doi: 10.1016/j.jtcvs.2020.06.052

Commentary: The Problem of Class Imbalance in Biomedical Data

Hemant Ishwaran a, Robert O’Brien b
PMCID: PMC7769929  NIHMSID: NIHMS1623069  PMID: 32711988

The main focus of the work by Bolourani et al. is the development of a machine learning (ML) algorithm for predicting early readmission after esophagectomy. The authors provide a detailed, multi-step analysis, that includes univariate and multivariate logistic regression, regularized lasso, random forests and NearMiss. This is obviously a very complex analysis, and so at first glance readers might question why studying a simple binary outcome such as hospital readmission would entail so much effort. As correctly identified by the authors, the difficulty here occurs due to the presence of “class-imbalanced data,” which turns out to be a thorny scenario for ML procedures to overcome. It is this problem we would like to comment on and provide promising new developments for.

Class-imbalanced data, or simply imbalanced data, is data where the outcome is binary (here early readmission) such that the frequency of the observed classes is skewed to one realization, called the majority class, versus the other possible realization, called the minority class. In the analysis here, of the 2037 patients studied, only 383 required early readmission: thus the frequency of early readmission (minority class) to those not readmitted early (majority class), is 383 to 1654, an imbalance ratio (IR) of 4.3.

The problem is that many ML methods are “biased” towards the majority class in the presence of imbalanced data, especially when the IR is high. The reason is that ML classification is generally based on the Bayes decision rule which classifies patients on the basis of their probabilities, with patients assigned to the minority class if their probability is 0.5 or higher. Of course the very nature of the imbalanced data makes this unlikely to occur as the probability of being a minority class will almost certainly be less than 0.5 (except, perhaps, for a small subset), especially when IR is high. Hence ML classifiers tend to classify most of the data into the majority class in imbalanced data settings. Note that the same principle applies to standard procedures such as logistic regression if these use Bayes rule for classification.

What is the answer? In ML, one approach has been to utilize what are called (1) under-sampling and (2) oversampling techniques. As an example of (2), SMOTE1 is a popular technique which creates artificial minority class examples in an effort to balance the data. Thus for the data here, SMOTE would “manufacture” cases of early readmission and the manufactured data would then be used in the analysis. NearMiss2 is an example of (1) and is the technique utilized in this work. NearMiss undersamples the majority class by removing patients not readmitted early in an effort to balance the data.

Unfortunately, while these types of methods have had reported success in the literature, as well as in this analysis, there is no theoretical justification for them that we are aware of. Most importantly, in subsampling the data by making use of clinical information, the resulting estimated values for probability will not be valid in general. Thus the reported success of these methods is primarily based on their empirical performance in terms of classification (identifying which patient might be readmitted early) but not their ability to estimate probabilities (the probability a patient will be readmitted early). In our own experience with these methods we have found that they can sometimes help in improving classification, however very delicate tuning and experience is required to do so.

There is however another solution that provides a clearer path forward. This method is also based on subsampling the data, however it differs in a very important way in that the sampling uses only the value of the outcome and makes no use of the associated clinical data. This type of sampling is called response sampling. In the ML literature the most popular implementation is undersampling, where the majority class is undersampled to match the frequency of the minority class. For example, this is the technique used by balanced random forests (BRF).3 This method has been used quite widely and generally observed to produce good results. The theoretical explanation for why BRF and response based undersampling works was provided in a recent paper by O’Brien and Ishwaran.4 They showed that response based undersampling is theoretically equivalent to replacing the Bayes rule with a different decision rule called a quantile classification rule. Rather than classifying patients on whether their probability is larger than 0.5, the rule adjusts the value 0.5 to match the underlying prevalence. Doing so yields a procedure with the optimal property of simultaneously maximizing sensitivity and specificity.

In fact there is no need to subsample at all! The work by O’Brien and Ishwaran4 showed one only has to replace the Bayes rule with the new quantile rule to yield a procedure with theoretically justified properties. Furthermore, by forgoing sampling all together, the resulting estimated probabilities remain valid. Thus one obtains not only a good classifier but also one with valid probability estimates.

O’Brien and Ishwaran4 have developed the quantile classifier for use with random forests, a method referred to as RFQ. We would like to mention that RFQ is available for general public use through the “imbalanced” function in the randomForestSRC R-package.5 It can be used for classification, for producing estimated probabilities, and for calculating variable importance (VIMP) values. The latter allow researchers to quickly determine which clinical variables are important and provide estimates of their effect size in terms of prediction error.

In conclusion, the authors have tackled an important medical issue for esophageal cancer patients. They provided a detailed analysis of class-imbalanced data, a setting common in medical studies but often misunderstood or overlooked. As the authors have found, such settings can be nuanced and difficult to analyze and require careful use of ML methods. Finally, we would like to thank the editors for providing us with the opportunity to comment on this work.

References

  • 1.Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority oversampling technique. Journal of Artificial Intelligence Research. 2002; 16:321–357 [Google Scholar]
  • 2.Bao L, Juan C, Li J, Zhang Y. Boosted near-miss under-sampling on svm ensembles for concept detection in large-scale imbalanced datasets. Neurocomputing. 2016; 172:198–206. [Google Scholar]
  • 3.Breiman L, Chen C, Liaw A. Using random forest to learn imbalanced data. Technical Report. University of California, Berkeley. 2004 [Google Scholar]
  • 4.O’Brien R, Ishwaran H. A random forests quantile classifier for class imbalanced data. Pattern recognition. 2019; 90:232–249 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Ishwaran H, Kogalur UB. Random forests for survival, regression and classification (rf-src). R package version 2.9.3 2020. Available at https://cran.r-project.org/package=randomForestSRC.

RESOURCES