Illustrations of different cases of binary classification under group underrepresentation
Circles and crosses denote the two possible outcomes (values of y), blue (majority) and red (minority), two patient groups of interest. The variables and denote model inputs.
(A) Group underrepresentation is not problematic if the same decision boundary is optimal for all groups.
(B) If the optimal decision boundaries differ between groups, and either the model or the input data are not sufficiently expressive to capture the optimal decision boundaries for all groups simultaneously, standard (empirical risk minimizing) learning approaches will optimize for performance in the majority group (here, the blue group).
(C) An expressive model could learn a decision boundary (red) that is optimal for both groups. In practice, however, it is unclear whether a training procedure will indeed identify this optimal boundary. This is due to inductive biases,19 local optimization schemes, and limited dataset size for the minority groups, all combined with standard empirical risk minimization, which prioritizes optimizing performance for the majority group.