Skip to main content
. 2022 Nov 9;65(1):31–57. doi: 10.1007/s10115-022-01772-8

Table 7.

Lessons learned by answering the research questions

RQ# Lessons learned
FQ1 There are 5 central domain areas in imbalanced data applications: health (34.3%), finance (22.9%), engineering (14.3%), biology (8.6%), and software (8.6%). These areas have good references for new applications. New domains have the potential to be explored
FQ2 The studies applied 55 different sampling techniques—oversampling (55.5%), undersampling (27.4%), and hybrid sampling (17.1%). Oversampling techniques achieved the best performance among the existing types, whereas hybrid sampling techniques performed better relatively (ratio of selected within tested studies)
FQ3 None of the studies used simulation as a means for optimizing synthetic data generation and accelerating training time in oversampling. This technology could optimize results and reduce computational costs in domains such as engineering and health
FQ4 The studies applied 45 different ML models—classical (54%), ensemble (24.8%), and NN (21.2%). NN models achieved the best performance overall and relative to tested studies, with ensemble models as a close second
FQ5 There are 3 recurrent development tools within the studies: Python, MATLAB, and Weka. These tools have both sampling techniques and ML models already implemented as resources
FQ6 Domain areas selected distinctive sampling techniques and ML models—especially in health. However, there is a clear preference for oversampling in engineering, biology, and software, while finance splits between oversampling and hybrid sampling. For ML, engineering selected only NN models, and finance selected mostly ensemble models. Other domains did not have a clear categorical preference
SQ1 There is a growing research interest in the subject, especially since 2019
SQ2 The 35 reviewed studies show a prevalence of journal publications, with 25 works (71.4%), while the remaining 10 are from conferences. The digital libraries ACM, Science Direct, and Springer Link account for at least 20% of the results individually