Skip to main content
. 2020 Jun 10;120(16):8066–8129. doi: 10.1021/acs.chemrev.0c00004

Figure 33.

Figure 33

Influence of class imbalance on different classification metrics. For this experiment, we used different thresholds (median, mean 2.5 mmol g–1) for CO2 uptake to divide structures in “high performing” and “low performing” (see histogram inset). That is, we convert our problem with continuous labels for CO2 uptake to a binary classification problem for which we now need to select an appropriate performance measure. We then test different baselines that randomly predict the class (uniform), i.e., sample from a uniform distribution, that randomly draw from the training set distribution (stratified), and that only predict the majority class (majority). For each baseline and threshold, we then evaluate the predictive performance on a test set using common classification metrics such as the accuracy (red), precision (blue), recall (yellow), F1 score (green), and the area under the curve (AUC) (pink). We see that by only reporting one number, without any information about the class distribution, one might be overly optimistic about the performance of a model; that is, some metrics give rise to a high score even for only random guessing in the case of imbalanced distributions. For example, using a threshold of 2.5 mmol g–1 we find high values for precision for all of our sampling strategies. Note that some scores are set to zero due to not being defined due to zero division.