Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Aug 25;15:31245. doi: 10.1038/s41598-025-13929-w

Improving learning from the complex multi-class imbalanced and overlapped data by mapping into higher dimension using SVM++

Zafar Mahmood 1, Leila Jamel 2, Dina Ahmed Salem 3, Imran Ashraf 4,
PMCID: PMC12378458  PMID: 40854927

Abstract

Several issues are there to prevent the traditional classifiers from getting an acceptable performance level while learning from multi-class problems. One of the main problems is the unequal distribution of samples, which significantly reduces the efficiency of the underlying classifier when combined with incompatible optimization benchmarks and data overlapping phenomena. The classifier performance is compromised beyond the expected level by the combined effects of imbalanced distribution and sample overlapping around the class boundaries. This problem worsens with the increase in the number of classes in the multi-class scenario. Despite having a more significant combined effect on classifier performance, the combined effects of imbalanced data and overlapping questions have been given the least attention in the research. To improve models’ learning from imbalanced multi-class and overlapping of shared attributes issues, this work introduces SVM++, a modified version of support vector machines (SVM). Comprising of three steps, Algorithm-1 finds and splits the training set into overlapping and non-overlapping samples. Algorithm-2 then separates the overlapped data into the Critical-1 and Critical-2 regions. The Critical-1 region consists of overlapped samples, sharing similar characteristics, which is the main cause of degraded classification performance. In the third step, an algorithm based on the mean of the maximum and minimum distance of the Critical-1 region samples is proposed by improving the traditional SVM kernel mapping function to a higher dimension. Thirty real datasets with various imbalances and degrees of overlap are utilized to compare our suggested algorithms’ supremacy with the state-of-the-art classifiers.

Keywords: Imbalance data, Class overlapping samples, Kernel mapping function, Support vector machine, Overlapped and non-overlapped region

Subject terms: Computer science, Scientific data

Introduction

Classification problems having imbalances and overlapping issues pose great challenges for the research communities while classifying real-world applications like fraud detection, disease diagnostic systems, character recognition, text classification, etc1,2. In a complex multi-class problem, the distribution of samples is not equal and some of the class’s samples share similar characteristics near the class boundary, resulting in an overlapping region3. Overall classification performance, particularly the precision and accuracy, was greatly influenced by these overlapping samples, as there is no clear-cut separating boundary between the samples of overlapping classes. As seen in Fig. 1a, when a traditional classifier is trained over overlapped samples’ where the minority class samples are less visible. It becomes confusing when predicting the unseen samples. Different researchers46 proved that the misclassification rate is higher near the class boundaries, usually the area where the overlapped sample resides. While some solutions to the overlapping problems are offered in7,8, there is still much work to be done in this area of research to effectively handle the overlapping samples in real-world datasets.

Fig. 1.

Fig. 1

Illustration of boundary areas with (a) overlapping samples and (b) imbalance distribution of data.

In addition to the overlapping problems, one of the main challenges when learning from multi-class problems is the unequal distribution of the sample. As seen in Fig. 1b, in a multi-class imbalanced dataset, certain classes are underrepresented and one class has a significant number of samples9. The traditional classifiers are biased towards the majority class because the unfair distribution of samples causes a higher misclassification rate while predicting the minority class samples10. The occurrence of overlapping issues in conjunction with the imbalanced distribution of samples catalyzes the performance drop ratio beyond the expectation, as compared to both the factors affecting the performance independently11.

Several studies1214 concluded that the sub-problems put on by the unbalanced data lower the classifier’s performance rather than the imbalanced data itself being a major factor in the performance tradeoff. Among these well-known problems are Overfitting15, information loss problem16, small disjuncts problem17, and class overlapping issues18. Among these factors, the imbalanced nature of data and overlapping issues greatly impact the overall classification performance, however, for the large and linearly separable dataset, the degree of imbalance does not affect performance as presented by N. Japkowicz19. On the other hand, studies in6,20 demonstrate that overlapping had a great influence on overall classification performance, which becomes more problematic with the increasing degree of overlapping21.

Various solutions, primarily based on algorithm-level, data-level, hybrid-base, and ensemble-based approaches, are documented in the literature to handle the overlapping and combined impact of overlapping and imbalanced data10,21. The algorithm-level approaches suggested a new way or improved the current algorithms to handle this issue, while the data-level approaches use resampling techniques to modify the distribution of class (s) samples. Ensemble base methods wrap different classifiers to improve the classifier performance over the imbalanced and overlapped data22,23. Regardless of the advantages of the approaches mentioned above, the algorithms of today are typically modified to address overlapping problems based on particular conditions rather than being a generalized tool24. In the same way, data-level approaches can be used in an area of problems where the data have certain characteristics5.

Motivation

As reported25, resampling techniques may result in overfitting or underfitting issues by eliminating some useful information in terms of samples. While using ensemble base methods, the key issues like base classifier selection, diversity and accuracy of the model, number of classifiers, and decision-making strategy need more attention to be paid23. Based on the existing literature, the visibility of the minority class samples in the overlapped region is always compromised as compared to the majority class samples, thus creating a problem for the underlying classifier to correctly classify the target class. Most of the researchers proposed and developed data-level approaches to handle the imbalance and overlapping sample issues, the main motivation behind this idea/algorithm is to keep the same data distribution, there must be some techniques needed to develop/propose an algorithm-level approach to maximize the visibility of the minority class samples, thus minimizing the misclassification rate.

Contributions

In this study, we address the issue of learning from complex multi-class scenarios combining imbalance and overlapping data by proposing a new approach, SVM++, by changing the existing SVM kernel mapping function. Our previous research26 focused on a hybrid model for the same purpose. This research contributes to the following aspects.

  • To maximize the visibility of samples in the dense overlapped region, so that the underlying classifier can easily predict the target class. Three cutting-edge algorithms are proposed.

  • Algorithm-1 splits the training set data into overlapped and non-overlapped regions. Algorithm Algorithm-2 filters the overlapped region into the Critical-1 and Critical-2 regions while Algorithm-3 modifies the kernel mapping of SVM to map the most critical samples in the Critical-1 region.

  • Experimental evaluation of these algorithms using 30 multiclass real-world datasets to analyze their performance. In addition, comparative analysis with basic neighborhood search (NB-Basic), modified Tomek link search (NB-Tomek), synthetic minority oversampling technique(SMOTE)-SVM, KNN-based undersampling method (K-US), overlap-based undersampling (OBU), SVM++, overlap-sensitive-margin (OSM), Fuzzy SVM for class imbalance learning (FSVM-CIL), radial basis function network (RBFN) and KNN is also carried out.

Algorithm-1 splits training set data into overlapped and non-overlapped regions. Algorithm-2 is designed to identify the Critical-1 region samples and Critical-2 region samples. Particularly we are interested in Critical-2 region samples, where the majority and minority class samples share almost the same characteristics, thus minimizing the visibility of minority class samples and hence increasing the misclassification rate. The third and main algorithm is the modified SVM to map the most critical sample in the Critical-1 region to the higher dimension, based on calculating the mean distance of maximum and minimum distance of majority and minority class samples.

The research papers’ primary sections are as follows. “Literature review” reviews the literature. “Materials and methodology” discusses complex overlapping samples and how to eliminate them. The imbalance and overlapping effects on classification, and possible mitigating techniques are also highlighted. In addition, the algorithms and suggested methodology are also covered in the same section. The experimental setup, datasets, and results are discussed in “Experimental results”, whereas “Conclusion” highlights the conclusions and future directions.

Literature review

As discussed in8,27, overlapping issues are more challenging than an imbalance problem both in binary-class and multi-class problems. Although with the increasing ratio of imbalances, and overlapping issues getting more flourish, several techniques are there to resolve the unequal distribution of data. Data-level approaches aim to reduce the majority class samples or add the sample to the minority class, to equal the majority and minority class28. The algorithm-level approaches29 incorporate the user input and preference to design a new algorithm or to modify the existing algorithm to resolve the unequal distribution of data. The data-level approaches are generally applied to various problems having imbalance and overlapping issues, while the algorithm-level approaches are specific, and cannot be modified once implemented. However, the random undersampling may result in the loss of valuable information, while dropping the majority class samples and random oversampling may result in overfitting the model in the training dataset, as more samples are added into the minority class30.

Class overlap undersampling is handled in31 using the global similarity of the dataset. To calculate the global similarity of a dataset, the authors used the Schur decomposition approach. Comparative analysis using 46 publicly available datasets, the proposed approach showed the best results concerning AUC. Similarly, a heterogeneous clustering approach can be adopted to tackle the class imbalance problem, as investigated in32. The authors presented an ensemble approach using heterogeneous clustering for multiclass overlapping problems along with a genetic algorithm (GA). Overlapping samples are identified in the dataset and overlapping problems can be resolved with higher accuracy compared to existing approaches.

The SMOTE33, can be used to replace the problems of over-fitting and information loss resulting from random sample selection. SMOTE creates a new minority class sample by interpolating the closest sample of the target class. With the increasing complexity in the structure of data, SMOTE results in different variations of resampling methods3436. Some of the well-known extensions e.g. SMOTE-IPF37 and Borderline SMOTE38 specifically designed to deal with the samples in shared space or overlapped regions. Some variants like DBSMOTE35 and Safe-level-SMOTE34 avoided the resampling strategy, showing less improvement as compared to the other counterpart sampling methods designed to deal with overlapping issues.

Methods designed to deal with overlapped instances are mostly targeting the samples near the borderline or in the overlapped region as discussed in39. Very few methods in the literature address the overlapping problem by addressing the entire overlapped region40, because of the fear of losing valuable samples when dropping the instances to reduce the negative impact of minority instances, as the dropping instances sometimes contain critical information41. OBU40, a recent and novel overlapped-based method, designed based on removing the negative instances from the all-inclusive overlapped region, produces better results when compared with the well-known data distribution-based method. However, in OBU in most of the cases, information loss is higher as compared to other counterpart methods, by excessively removing the negative samples from the overlapped region. DBMUTE42, and the overlapping resolving method takes advantage of a clustering-based algorithm to identify the borderline samples, and then using the undersampling scheme to eliminate the negative sample from the overlapped region, outperforms other counterparts DBSMOTE42, targeting the instances outside the overlapped region. To solve the imbalance and overlapping data difficulties in multi-class imbalanced data, the authors in43 developed an overlapping filter technique based on KNNs. The authors combined the under-sampling scheme with KNN (K-US) to embed the ensemble classifier learning, to reduce the effect of information loss. D. Deve et al.44 present a combined approach of Condensed Nearest Neighbor (CNN) and Tomek-link (TL), a data-level approach by introducing novelty to eliminate effectively the majority of class samples without losing useful information. The authors propose an improved version of the undersampling scheme at the data pre-processing level by integrating the detection of redundancy and outlier aspects. The proposed scheme manifests its dominance over counter algorithms like SVM, KNN, and NN over different real-world datasets.

The imbalanced nature of data and overlapping instances can affect the performance of both the multi-class and binary-class traditional classifier, as the classifier is biased towards the group having the bulk of instances, as pointed out in45. The authors designed a novel scheme, a neighbor-based undersampling scheme? to identify and eliminate the overlapped sample in data space to reduce the impact of a negative sample. The proposed scheme employed the nearest neighbor-based searched techniques to identify accurately the negative instances in the shared region; secondly, it maximizes the visibility of minority class samples, by eliminating only the overlapped samples, thus reducing the chance of dropping valuable instances and information loss. Mengyu Fu, Yang Tian, and Fang Wu proposed a technique that improves the performance of the machine learning algorithm SVM for multiple class overlapped datasets46. The author combines SVM with some auxiliary algorithms, which include Recursive Feature Elimination (RFE)47 and a two-step classification strategy to calculate sorting coefficients based on Inline graphic (weight vector) obtained from the training process. Jose et al. in21 proposed an algorithm to introduce controlled overlapping by producing a synthetic sample in real-world datasets. Controlled overlapping was introduced by finding the nearest neighbors of each sample of the class and then generating a new artificial instance between the nearest neighbors and the sample itself. The same concept was introduced by the authors in48, by augmenting the opinions that overlapping greatly affects the classification rate. The authors in49, suggest the generation of samples from the minority class using the proposed optimal oversampling framework (OOF). The new proposed model uses different optimization techniques to generate the samples. The generated samples are further refined using evolutionary strategies. During the refining stage, the samples are matched with samples from the minority and majority classes using metrics like distance and cosine similarity measures.

Table 1 provides a summary of the discussed approaches concerning data-level, algorithm-level, and hybrid levels. These approaches target multiclass imbalance problems.

Table 1.

Summary of data-level, algorithm-level, and hybrid methods for handling multiclass imbalance and overlapping issues.

Approach Type Overlap handling Imbalance handling Key details
Random over/undersampling + SVM Algorithm + data-level Indirect (imbalance-targeted) Yes Simple sampling; risk of overfitting or loss of information
SMOTE Data-level Indirect (via interpolation) Yes Generates synthetic minority samples
Borderline-SMOTE Data-level Yes (borderline focus) Yes Focus on borderline/overlapped regions
Safe-Level-SMOTE Data-level Yes (safe vs. risky regions) Yes Avoids generating risky (highly overlapped) samples
DBSMOTE Data-level Limited Yes Focus on dense regions; less effective on overlap
SMOTE-IPF Algorithm + data-level Yes Yes Iteratively removes misclassified samples
OBU + Evolutionary Methods Algorithm + data-level Yes (fully overlapped region) Yes Removes fully overlapped negatives; may lose useful data
DBMUTE Hybrid Yes (via clustering) Yes Clusters borderline data, then undersamples
K-US + KNN ensemble Data-level + ensemble Yes Yes Filters overlaps with KNN before ensemble classification
CNN + Tomek link Data-level Partial Yes Cleans overlapping instances; preserves useful majority
Neighbor-based undersampling Data-level Yes (focused removal) Yes Selectively removes harmful overlapped samples
SVM + RFE + two-step classification Algorithm-level Yes (via coefficients & steps) Yes Ranks features, reduces overlap, handles imbalance
Controlled overlapping generation Data-level Yes (controlled generation) Yes Generates samples in controlled overlap areas
Optimal oversampling framework (OOF) Hybrid Indirect (via similarity) Yes Uses optimization & similarity to generate refined minority samples

Materials and methodology

Quantification of overlapping

As previously discussed, for the most part, the real-world problem exhibits overlapping issues, particularly when accompanied by the imbalanced data distribution of data. It substantially affects models’ performance, as reported in6. Degrading the model’s performance is attributed to the models’ difficulty in learning from the imbalanced data and varieties with respect to overlap degree50. Although various studies have been offered in the literature18,50 to assess the degree of overlap, all of these methods previously assumed a normal distribution of samples, which is not the case for most real-world datasets. Equation (1) is the updated version of the formula45 which was initially created for problems with 2-dimensional feature spaces and is used to accurately approximate complicated overlapping region samples.

graphic file with name d33e715.gif 1

The overlapping degree for minority class samples in the Critical- 1 (overlapping region) is calculated via Eq. (2).

graphic file with name d33e725.gif 2

Equations (3), and (4) calculate the majority class area by using Euclidian distance methods and k-nearest neighbor rule for two-class Inline graphic and Inline graphic.

graphic file with name d33e751.gif 3
graphic file with name d33e757.gif 4

Using Eq. (3) if the probability of class Inline graphic is Inline graphic, then it must be true for class Inline graphic using Eq. (4). In this case, samples of both classes have similar characteristics, thus creating a complex structure of overlapping regions.

Borderline vs. overlapped samples and their elimination

According to51, overlapping is found beside the boundaries of classes, which results in a higher misclassification rate. Borderline samples contribute to the overlapping region, where the classifier performance sharply declines with the increasing level of overlapping. As shown in Fig. 2, most traditional classifiers limit the influence of overlapping samples by removing the overlapping samples close to the borderline. The training dataset is shown in Fig. 2a, where samples are unevenly distributed and overlap in the dotted circle. After applying the k-nearest neighbor standards to Fig. 2a, the borderline samples that are generating overlapping problems are removed, yielding Fig. 2b.

Fig. 2.

Fig. 2

Dataset with, (a) Imbalance and overlapping samples, (b) Removing the borderline samples, and (c) Using the under-sampling, eliminate the majority class overlapped data.

After the elimination process, the majority of class samples are now more observable in the dense overlapped region, highlighted in Fig. 2b. To improve the visibility of the minority class samples, majority class instances are removed, in order to minimize the misclassification rate, as shown in Fig. 2c. To maintain important information, this research study will not employ removal strategies; instead, SVM++ transforms samples from the overlapped regions to a higher dimension.

Algorithm to improve learning from overlapped and imbalanced multi-class data

Three algorithms are used in this study to minimize the deletion of negative samples in the overlapping region and maximize the visibility of positive examples. The training data space is partitioned into overlapping and non-overlapping samples using the first algorithm. The second algorithm is designed to divide highly dense overlapping samples into two regions, Critical-1 and Critical-2 In the final approach, the mean of the maximum and lowest distances of the Critical-1 region samples is used to map the samples into a higher dimension by improving the traditional SVM kernel mapping function. The minority class samples are underrepresented and have limited visibility in the Critical-1 region, which makes it difficult for the underlying classifier to predict the right category label. In order to help the underlying classifier correctly predict the target class, the third algorithm that has been suggested increases the visibility of minority class samples.

Algorithm to detect overlapping and non-overlapping samples in data space

Overlapping close to class boundaries significantly reduces the underlying classifier’s learning ability, as was covered in the literature review section. Algorithm 1 locates various regions in the data space in order to improve learning performance and focus on the overlapped region. The samples of the majority and minority classes that do not overlap with one another make up the non-overlapping region, whereas the samples that significantly contribute to the overlapped region make up the overlapped region. Using Eq. (5), Algorithm 1 computes the majority and minority class samples close to the boundary line to determine how far apart they are from one another.

graphic file with name d33e846.gif 5

where d shows the distance between a, b are two vectors (sample attributes). Using Eq. (6), we can find the distance for the nth rows point, as

graphic file with name d33e869.gif 6

Following the computation of the distance between each sample from the majority class and the sample from the minority class, we generated two sets, an uncertain region (the overlapped region where the classifier gets confused by predicting the class labels correctly) and a non-overlapping region, namely Inline graphic and Inline graphic, as shown in Fig. 3a. The three closest minority class neighbors for each sample in the majority class were determined. Two of a minority class’s three closest neighbors with the shortest distance are in Inline graphic. The process is repeated for each sample of the majority class to find the minority class samples in the uncertain region. At the end of this step, Inline graphic, set contains all samples of the minority class, and Inline graphic contains majority class samples. Both the majority and minority class samples are added into the Inline graphic, which is initially empty using Eq. (7), as shown in Fig. 3b3c.

graphic file with name d33e926.gif 7
Fig. 3.

Fig. 3

(a) Training dataset of imbalance and overlapped samples, (b) Majority and minority classes having a clear decision boundary, and (c) Overlapped region.

The hard region samples of both classes shared similar characteristics. To further highlight minority class samples, the hard region samples are further divided into the Critical-1 regions and Critical-2 regions. After division, the Critical-1 region consists of those samples that are exactly overlapped or near to overlapped.

Algorithm 1 finds overlapped (hard region) and non-overlapped regions as shown in Fig. 3b3c. The non-overlapped region has no overlap, wherein a hard region sample of different classes overlaps with each other.

Algorithm 1.

Algorithm 1

Find overlapped and non-overlapped regions.

Algorithm to filter overlapped data into the Critical-1 and Critical-2 region

The problem dataset is split into overlapped and non-overlapped regions.. The few samples of the majority and minority classes with identical characteristics still make up the overlapped region. While few samples from both the majority and minority classes overlap by sharing characteristics, making it difficult for the underlying classifier to predict the target class label, the majority of samples in the overlapped region have clear visibility to predict the exact target class. To emphasize the instances where the minority class samples are extremely lowly visible and are predicted to be majority class samples by the underlying class classifier, Algorithm-2 is intended to segregate the overlapped region. In this section, we introduce algorithm-2, which finds samples with similar features, resulting in dense overlapping regions, and filters the overlapped (hard region) data into Critical-1 and Critical-2 regions. Samples at k=2 of the majority and minority classes are placed in the Critical-1 region by applying k-nearest neighbor rules over the samples in the Critical-2 region. Samples of both the majority and minority classes in the overlapped region shared similar characteristics, creating hurdles for the classifier to label the exact target class.

Algorithm 2.

Algorithm 2

Filter the overlapped data (Inline graphic) obtained in algorithm 1 into Critical-1 and Critical-2 region to get the overlapped, Critical-2, and non-overlapped data.

Algorithm to map Critical-1 region samples into higher dimension, using SVM++

This section presents SVM++ that modifies the kernel mapping function of the previous SVM to improve it. The objective is to increase minority class samples by transforming the Critical-1 region sample into a higher dimension. The samples are first filtered into two regions: the hard and non-overlapped regions in Algorithm 1. Samples from the hard region are then further filtered into the Critical-1 and Critical-2 regions using Algorithm 2, using the nearest neighbor distance rule and computing the overlapped sample that is closest to the exact. To predict each target class, we designed an algorithm in this part that maps the overlapping (Critical-1) region samples into a higher dimension.

The SVM++ algorithm computes the distance between each majority and minority class sample in the Critical-1 region using the minimum Euclidean distance formula. A list called Inline graphic contains all of the minority class samples at distance k=1, whereas a different list called Inline graphic contains samples at distance k=2. As seen in Fig. 4, Inline graphic contains samples between Inline graphic and Inline graphic, whereas Inline graphic contains samples between Inline graphic 1 and Inline graphic, to prevent the duplication of samples in both lists. Following this separation, the Inline graphic minimizes the samples’ visibility for the underlying classifiers by containing all samples from both classes that are extremely close to overlapping. The map function Inline graphic maps in a higher dimension for the best decision boundary.

Fig. 4.

Fig. 4

Mapping data into higher dimension.

Applying the Inline graphic and choosing the manual mapping value for the training set samples is the most crucial part because adjusting it too much runs the risk of producing an outlier, which is not within the purview of this study. To minimize the risk of outliers, we developed a systematic range of values to draw into the higher dimension by calculating the mean distance of each minority class sample to the majority class samples. All the samples of the majority class at k=1 are at the distance 0 (minimum distance) and 1 (maximum distance) with each minority class sample. In this case, the mean distance of sample i will be 0.5. Another important concern is that there is a chance the sample might split or intersect with other nearest samples at distances k= 1 and 2 if the value is changed from k=0 to 0.5. We decrease the mean value by -0.5 for samples in the range Inline graphic 1 for k =2 and increase it by 1 for all samples that fall within the range of Inline graphic to reduce the likelihood of overlapping again. By defining this range, we significantly mitigate the effect of a negative sample in the overlapping zone. Equation (8) displays, on average, a mapping function for sample x.

graphic file with name d33e1105.gif 8

where X represents the sample, HD denote higher dimension space, and Inline graphic denotes the mapping function. Or to be more precise

graphic file with name d33e1125.gif 9

Limits to the sum of all possible features, as Fig. 4 shows

According to Eq. eq10, we created different map functions for Inline graphic and Inline graphic for each training sample.

graphic file with name d33e1159.gif 10

Inline graphic list contains the mapped samples in higher dimension, Inline graphic is the change made to sample features value based on mean distance, Inline graphic is a function used for mapping in the higher dimension, while x represents samples in input space. The minimum and maximum distances for all neighbors are 0 and 1, respectively. For the maximum value of k = 1, the mean will be 0.5 and Inline graphic will be 1.5. Here, Inline graphic will be added with the mapping function to reduce the factor of overlapping again. A minor modification might not have a meaningful effect on the dimension if the minority class sample perfectly overlaps the majority class sample at k=0 with the mean value of 0.5.

Given that both class samples clearly had the same characteristics at k = 0, a slight variation in a sample’s actual features with a mean value of 0.5 might not have a substantial impact on the dimension. To further increase the dimension of the sample, we boost the mean value by one. With similar objectives, a map function for Inline graphic was also developed.

graphic file with name d33e1216.gif 11

Located in the Inline graphic, Inline graphic provides value to the training samples. To prevent overlapping samples, the Inline graphic range differs from the Inline graphic. For example, at k = 2, the minimum and maximum distance are 0 and 2 with a mean of 1 respectively. Here in this case (Inline graphic), a small value of 0.5 will be added to the samples’ features.

The overlapping effect is significantly lessened when the training samples are mapped in a higher dimension using two separate values. A small percentage of samples re-overlap after being transformed into higher dimensions, adding value to the initial samples. This is why some samples appear to still have overlapped issues when the precision of the minority class samples is analyzed in this research paper. Here, we avoid using sampling approaches and can address the overlapping issues completely by applying under-sampling on the Critical-1 region to remove the negative instance.

Algorithm 3.

Algorithm 3

To map the Critical-1 region data, (output of algorithm 2, as shown in Fig. 5a) into a higher dimension (as shown in Fig. 5b) to reduce the negative impact of negative samples.

Fig. 5.

Fig. 5

Mapping into a higher dimension, (a) the overlapped sample, and (b) overlapped samples in the higher dimension.

Normal SVM is applied to classify the samples in Inline graphic, (non-overlapped samples), whereas SVM++ with the customized kernel is applied to the Critical-1 region to classify. The final classification performance, as shown in the results section, is the average of the classification results of the non-overlapped samples set, Inline graphic, and the Critical-1 region samples.

Experimental results

In this research, we classify thirty real-world multi-class datasets and compare the proposed SVM++ classifiers with several state-of-the-art classifiers based on data level and algorithm level approach.

Real-world multi-class dataset classification using SVM ++

For the experiment, we use thirty publicly available datasets that were downloaded from KEEL and UCI52 and are shown in Table 2 along with an appropriate description. To attain optimal outcomes, we utilize a 10-fold cross-validation methodology, ensuring that every sample has an equal chance of evaluation.

Table 2.

Real-world multi-class datasets.

Dataset Features Samples Minority class samples Majority class samples IR Overlap sensitivity cost in majority class Overlap sensitivity cost in minority class
Wisconsin 9 683 239 444 1.86 0.97 0.92
Pima 8 768 268 500 1.87 0.96 0.75
Glass0 9 214 70 144 2.06 0.77 0.68
Vehicle1 18 846 217 629 2.09 0.79 0.45
hebreman 3 306 81 255 2.78 0.78 0.34
Vehicle3 18 846 212 634 2.99 0.82 0.39
Vehicle0 18 846 199 647 3.25 0.93 0.80
Ecoli1 7 336 77 259 3.36 0.99 0.54
N-thyroid1 5 215 35 180 5.14 0.98 0.72
Ecoli2 7 336 52 284 5.46 0.98 0.46
Segment0 19 2308 329 1979 6.02 0.97 0.41
Glass6 9 214 29 185 6.38 0.97 0.78
Yeast3 8 1484 163 1321 8.1 0.96 0.37
Ecoli3 7 336 35 301 8.6 0.98 0.26
Yeast2-4 8 1484 51 1433 9.08 0.97 0.67
Vowel0 13 988 90 898 9.98 0.99 0.75
Glass2 9 214 17 197 11.59 0.93 0.23
Yeast1-7 7 1484 30 1454 14.03 0.96 0.9
Glass4 9 214 13 201 15.46 0.97 0.45
Ecoli4 7 336 20 306 15.08 0.98 0.74
P-bloc13-2 10 472 28 444 15.86 0.95 0.60
Abalone9-18 8 731 42 689 16.04 0.97 0.20
Shutle2-4 9 230 6 123 20.5 1 0.33
Shutle6-23 9 293 10 220 22.01 1 0.85
Glass5 9 214 9 205 22.78 0.97 0.30
Yeast4 8 1484 51 1433 28.01 0.97 0.23
Ecoli0137-26 7 336 7 327 39.14 0.99 0.43
Yeast6 8 1484 35 1449 41.4 0.98 0.45
Winwhite3-7 11 1482 25 1457 58.28 0.98 0.09
Winered3-5 11 691 10 681 68.1 0.98 0.02

Every dataset has a different number of samples, features, imbalance ratios, and classes; for example, features can range from 3 to 19, imbalance ratios can range from 1 to 70, and overlap sensitivity can range from 214 to 230818. The imbalance ratio for the multi-class dataset is calculated by applying Eqs. (12) and (13).

graphic file with name d33e1876.gif 12
graphic file with name d33e1882.gif 13

Evaluation metrics

The confusion matrix, as recommended in53, must be used in the scenario of a multi-class imbalanced and overlapped dataset, as the accuracy measure (see Eq. (14)) is unable to shed light on the classifier’s performance when handling individual classes.

graphic file with name d33e1899.gif 14

The proposed SVM++ is evaluated on multi-class imbalanced and overlapping datasets using several metrics like precision, G-mean, sensitivity, and f1-score, which are derived from the confusion matrix45. Equation (15) can be used to compute precision, which is a trade-off between positive and negative class accuracy. Equation (16) uses the G-mean to balance the accuracy of the majority and minority classes. Equation (17) uses the sensitivity measure to compute the accuracy of minority class instances.

graphic file with name d33e1920.gif 15
graphic file with name d33e1926.gif 16

where,

graphic file with name d33e1933.gif 17
graphic file with name d33e1939.gif 18
graphic file with name d33e1945.gif 19

Results and discussion

The proposed SVM++ classifier is compared with ten different approaches in the current study. Five of the algorithms are based on the data-level approach, and five are based on the algorithm-level approach. This comparison is done on 30 real datasets. Data level approaches include, Basic Neighborhood Search (NB-Basic)45, Modified Tomek Link Search (NB-Tomek)45,54, SMOTE-SVM18, Overlapped Based Undersampling (OBU)40 and KNN-based under-sampling-method (K_US)43. Algorithm-level approaches include SVM, Overlap-Sensitive-Margin (OSM)18, Fuzzy Support Vector Machine for Class Imbalance Learning (FSVM-CIL)55, Radial Basis Function Network (RBFN)39, and KNN. The comparison and justification of the proposed model is. To achieve a balanced distribution of samples in the sampling-based technique, we apply under-sampling and oversampling to the training dataset while maintaining the original state of the testing samples. Using 10-fold cross-validation, we used the tuned parameters provided in18 for SVM, OSM, and FSVM-CIL. To determine the optimal value for k, which is K=5, we used grid-search to analyze K (1, 3, 5, 7, 9) for KNN. By eliminating the less represented samples (minority class samples), the strategies maximize the prominence of the minority class samples. In contrast to the aforementioned and more recent approaches, the strategy suggested in this work maps the minority class samples into higher dimensions in order to maximize their visibility without eliminating the overlapped samples. We use some critical parameters for our proposed classifier, SVM++, such as the constant term c and slop alpha, which are optimized using a grid-search approach and a polynomial of degree 2.

Table 3 shows the accuracy of 11 classifiers’ classification performance on 30 real datasets arranged in ascending order by imbalance ratio. The proposed classifier’s best performance is indicated by the bold value. Similar to this, Tables 4 and 5 display the 11 classifiers’ classification performance in terms of F1-score and G-mean. With the exception of the Vehicle3, Vehicle1, and yeast3 datasets, the suggested SVM++ classifier performs better for the datasets with imbalance ratios of less than 12.

Table 3.

Classification performance in terms of accuracy for 11 algorithms on 30 real datasets.

Datasets NB-basic NB-Tomek SMOTE+SVM OBU K_US SVM++ SVM OSM FSVM-CIL RBFN KNN
Wisconsin 90.11 87.79 89.99 67.7 67.44 97.34 87.88 96.19 83.61 87.55 90.9
Pima 89.21 80.03 75.76 74.94 66.2 81.02 73.65 47.15 34.57 38.51 84.79
Glass0 74.24 61.75 75.28 89.09 81.53 92.11 87.17 82.27 79.69 91.63 72.01
Vehicle1 70.27 65.14 55.95 65.09 44.79 68.47 63.38 73.47 60.89 64.83 83.3
heberman 77.33 67.84 75.22 78.98 72.11 80.43 73.61 38.59 46.01 51.95 58.88
Vehicle3 49.76 73.17 76.53 87.03 71.33 74.22 64.42 76.8 64.22 70.16 76.21
Vehicle0 61.91 61.91 91.43 91.41 80.37 92.11 89.32 95.58 83 88.94 39.64
Ecoli1 70.37 66.91 64.9 75.76 77.7 87.93 92.17 88.15 75.57 81.51 65.18
New-thyroid1 73.19 71.94 97.99 88.2 82.52 91.83 100 93.09 85.33 89.27 61.73
Ecoli2 92.09 88.95 97.99 85.7 69.92 98.19 95.88 75.15 62.57 66.51 69.28
Segment0 50.1 58.7 96.42 84.69 76.41 90.23 84.31 81.15 68.57 77.51 78.5
Glass6 75.42 68.74 72.35 78 79.22 80.54 70.24 75.32 72.74 70.68 62.55
Yeast3 55.15 57.42 69.78 60.53 70.76 63.2 67.67 71.15 72.57 62.51 71.65
Ecoli3 48.1 48.1 47.91 62.55 66.97 71.29 65.8 56.15 43.57 67.51 63.1
Yeast2-4 66.73 71.21 78.2 73.04 74.06 81.95 76.09 77.15 64.57 70.51 78.88
Vowel0 100 100 100 94.99 90.49 100 100 98.13 85.55 89.49 100
Glass2 71.95 68.04 77.99 79.15 47.24 85.95 85.24 80.8 68.22 72.16 67.99
Yeast1-7 66.34 62.38 61.82 67.52 52.03 77.92 75.21 73.22 70.49 67.51 66.94
Glass4 97.88 96.63 97.99 87.65 67.48 100 95.88 75.58 63 68.94 77.78
Ecoli4 80.5 80.5 95.99 82.71 87.22 96.72 95.88 93.15 80.57 84.51 80.94
Page-bloc13-2 69.97 68.41 95.99 81.23 96.77 96.93 95.88 98.68 86.1 90.04 84.4
Abalone9-18 52.73 41.4 73.28 57.93 60.22 77.55 70.17 64.15 46.57 51.51 51.15
Shutle2-4 64.2 59.34 67.42 68.64 61.79 95.43 85.31 93.28 80.7 64.64 86.6
Shutle6-23 69.07 73.72 63.22 80.16 86.11 72.34 67.91 95.42 83.18 83.12 64.4
Glass5 87.44 81.95 88.08 89.24 58.16 91.23 85.07 76.15 63.57 67.51 67.28
Yeast4 47.51 50.2 55.91 62.77 69.41 72.63 68.8 74.37 41.79 45.73 48.94
Ecoli0137-26 55.81 53.32 67.99 66.71 90.49 91.35 65.18 72.46 69.88 63.82 63.29
Yeast6 60.29 59.32 77.95 65.82 70.81 90 53.81 74.22 60.84 64.78 66.95
Wine-white3-7 65.25 63.04 69.47 69.92 62.09 82.91 78.36 73.89 61.31 75.25 69.11
Wine-red3-5 64 69.28 60.26 72.03 70.61 75.32 68.15 70.51 57.53 61.87 51.51
Average rank 6.63 7.8 5 7.73 6.23 2.76 5.36 3.93 7.56 6.16 6.2

Table 4.

Classification performance in terms of precision for 11 algorithms on 30 real datasets.

Datasets NB-basic NB-Tomek SMOTE+SVM OBU K_US SVM++ SVM OSM FSVM-CIL RBFN KNN
Wisconsin 88.22 85.9 88.1 65.81 65.55 95.45 85.99 94.3 81.72 85.66 89.01
Pima 87.32 78.14 73.87 73.05 64.31 79.13 71.76 45.26 32.68 36.62 82.9
Glass0 72.35 59.86 73.39 87.2 79.64 90.22 85.28 80.38 77.8 89.74 70.12
Vehicle1 68.38 63.25 54.06 63.2 42.9 66.58 61.49 71.58 59 62.94 81.41
heberman 75.44 65.95 73.33 77.09 70.22 78.54 71.72 36.7 44.12 50.06 56.99
Vehicle3 47.87 71.28 74.64 85.14 69.44 72.33 62.53 74.91 62.33 68.27 74.32
Vehicle0 60.02 60.02 89.54 89.52 78.48 90.22 87.43 93.69 81.11 87.05 37.75
Ecoli1 68.48 65.02 63.01 73.87 75.81 86.04 90.28 86.26 73.68 79.62 63.29
New-thyroid1 71.3 70.05 96.1 86.31 80.63 89.94 98.11 91.2 83.44 87.38 59.84
Ecoli2 90.2 87.06 96.1 83.81 68.03 96.3 93.99 73.26 60.68 64.62 67.39
Segment0 48.21 56.81 94.53 82.8 74.52 88.34 82.42 79.26 66.68 75.62 76.61
Glass6 73.53 66.85 70.46 76.11 77.33 78.65 68.35 73.43 70.85 68.79 60.66
Yeast3 53.26 55.53 67.89 58.64 68.87 61.31 65.78 69.26 70.68 60.62 69.76
Ecoli3 46.21 46.21 46.02 60.66 65.08 69.4 63.91 54.26 41.68 65.62 61.21
Yeast2-4 64.84 69.32 76.31 71.15 72.17 80.06 74.2 75.26 62.68 68.62 76.99
Vowel0 98.11 98.11 98.11 93.1 88.6 98.11 98.11 96.24 83.66 87.6 98.11
Glass2 70.06 66.15 76.1 77.26 45.35 84.06 83.35 78.91 66.33 70.27 66.1
Yeast1-7 64.45 60.49 59.93 65.63 50.14 76.03 73.32 71.33 68.6 65.62 65.05
Glass4 95.99 94.74 96.1 85.76 65.59 98.11 93.99 73.69 61.11 67.05 75.89
Ecoli4 78.61 78.61 94.1 80.82 85.33 94.83 93.99 91.26 78.68 82.62 79.05
Page-bloc13-2 68.08 66.52 94.1 79.34 94.88 95.04 93.99 96.79 84.21 88.15 82.51
Abalone9-18 50.84 39.51 71.39 56.04 58.33 75.66 68.28 62.26 44.68 49.62 49.26
Shutle2-4 62.31 57.45 65.53 66.75 59.9 93.54 83.42 91.39 78.81 62.75 84.71
Shutle6-23 67.18 71.83 61.33 78.27 84.22 70.45 66.02 93.53 81.29 81.23 62.51
Glass5 85.55 80.06 86.19 87.35 56.27 89.34 83.18 74.26 61.68 65.62 65.39
Yeast4 45.62 48.31 54.02 60.88 67.52 70.74 66.91 72.48 39.9 43.84 47.05
Ecoli0137-26 53.92 51.43 66.1 64.82 88.6 89.46 63.29 70.57 67.99 61.93 61.4
Yeast6 58.4 57.43 76.06 63.93 68.92 88.11 51.92 72.33 58.95 62.89 65.06
Wine-white3-7 63.36 61.15 67.58 68.03 60.2 81.02 76.47 72 59.42 73.36 67.22
Wine-red3-5 62.11 67.39 58.37 70.14 68.72 73.43 66.26 68.62 55.64 59.98 49.62
Average Rank 5.46 7.33 6.8 8.266 5.93 2.63 5.2 5 5.76 6.6 6.96

Table 5.

Classification performance in terms of G-mean for 11 algorithms on 30 real datasets.

Datasets NB-Basic NB-Tomek SMOTE+SVM OBU K_US SVM++ SVM OSM FSVM-CIL RBFN KNN
Wisconsin 86 83.68 85.88 63.59 63.33 93.23 83.77 92.08 79.5 83.44 86.79
Pima 85.1 75.92 71.65 70.83 62.09 76.91 69.54 43.04 30.46 34.4 80.68
Glass0 70.13 57.64 71.17 84.98 77.42 88 83.06 78.16 75.58 87.52 67.9
Vehicle1 66.16 61.03 51.84 60.98 40.68 64.36 59.27 69.36 56.78 60.72 79.19
heberman 73.22 63.73 71.11 74.87 68 76.32 69.5 34.48 41.9 47.84 54.77
Vehicle3 45.65 69.06 72.42 82.92 67.22 70.11 60.31 72.69 60.11 66.05 72.1
Vehicle0 57.8 57.8 87.32 87.3 76.26 88 85.21 91.47 78.89 84.83 35.53
Ecoli1 66.26 62.8 60.79 71.65 73.59 83.82 88.06 84.04 71.46 77.4 61.07
New-thyroid1 69.08 67.83 93.88 84.09 78.41 87.72 95.89 88.98 81.22 85.16 57.62
Ecoli2 87.98 84.84 93.88 81.59 65.81 94.08 91.77 71.04 58.46 62.4 65.17
Segment0 45.99 54.59 92.31 80.58 72.3 86.12 80.2 77.04 64.46 73.4 74.39
Glass6 71.31 64.63 68.24 73.89 75.11 76.43 66.13 71.21 68.63 66.57 58.44
Yeast3 51.04 53.31 65.67 56.42 66.65 59.09 63.56 67.04 68.46 58.4 67.54
Ecoli3 43.99 43.99 43.8 58.44 62.86 67.18 61.69 52.04 39.46 63.4 58.99
Yeast2-4 62.62 67.1 74.09 68.93 69.95 77.84 71.98 73.04 60.46 66.4 74.77
Vowel0 95.89 95.89 95.89 90.88 86.38 95.89 95.89 94.02 81.44 85.38 95.89
Glass2 67.84 63.93 73.88 75.04 43.13 81.84 81.13 76.69 64.11 68.05 63.88
Yeast1-7 62.23 58.27 57.71 63.41 47.92 73.81 71.1 69.11 66.38 63.4 62.83
Glass4 93.77 92.52 93.88 83.54 63.37 95.89 91.77 71.47 58.89 64.83 73.67
Ecoli4 76.39 76.39 91.88 78.6 83.11 92.61 91.77 89.04 76.46 80.4 76.83
Page-bloc13-2 65.86 64.3 91.88 77.12 92.66 92.82 91.77 94.57 81.99 85.93 80.29
Abalone9-18 48.62 37.29 69.17 53.82 56.11 73.44 66.06 60.04 42.46 47.4 47.04
Shutle2-4 60.09 55.23 63.31 64.53 57.68 91.32 81.2 89.17 76.59 60.53 82.49
Shutle6-23 64.96 69.61 59.11 76.05 82 68.23 63.8 91.31 79.07 79.01 60.29
Glass5 83.33 77.84 83.97 85.13 54.05 87.12 80.96 72.04 59.46 63.4 63.17
Yeast4 43.4 46.09 51.8 58.66 65.3 68.52 64.69 70.26 37.68 41.62 44.83
Ecoli0137-26 51.7 49.21 63.88 62.6 86.38 87.24 61.07 68.35 65.77 59.71 59.18
Yeast6 56.18 55.21 73.84 61.71 66.7 85.89 49.7 70.11 56.73 60.67 62.84
Wine-white3-7 61.14 58.93 65.36 65.81 57.98 78.8 74.25 69.78 57.2 71.14 65
Wine-red3-5 59.89 65.17 56.15 67.92 66.5 71.21 64.04 66.4 53.42 57.76 47.4
Average Rank 4.5 5.3 6.2 5.63 7.3 3.8 7.23 4.7 6.16 7.6 7.4

In terms of precision, F1-score, and G-mean for the largest dataset, the data-level approach performs significantly better than the algorithm-level strategy when compared side by side. Their significance stems from the samples in the overlapping region being removed and added to balance the distribution of samples. The over-fitting problems and information loss problems are also significant drawbacks of these data-level approaches, making it difficult to generalize the model based on them. By focusing on the entire overlapping region, the OBU-based model outperforms its counterpart data-level techniques, but at the expense of removing too many negative examples from the training set. NB-Basic and SMOTE-SVM perform better among the data-level approaches, especially when the imbalance ratio and overlapping degree are greater. This is because SMOTE significantly lessens the problems of model over-fitting and information loss. No matter how effective the NB-Basic-based approach is, the approaches’ removal of the majority class instance has a negative effect on the classification outcomes.

One advantage of the suggested model is that SVM++ locates the exact or nearly overlapped sample in the training set rather than taking into account the complete overlapping region to remove the negative examples. For most datasets, the regular SVM and OSM classifiers perform better than FSVM-CIL, RBFN, and KNN when examining the algorithmic-based approach. By making positive samples more visible, both OSM and SVM with kernel tricks significantly lessen the overlapping impact of negative samples. Simultaneously, when examining the highly imbalanced dataset with IR < 20, all algorithm-based techniques, such as SVM, perform poorly, especially for F1-score and G-mean. This is because the majority class samples dominate the overlapped region, disregarding the minority class samples, which results in a higher misclassification rate. As stated in algorithms 1 and 2, the suggested SVM++ models exhibit an overall excellent performance due to its three-level filtration of the samples. Based on the distance measured between the samples from the majority class and the samples from the minority class, the data are first filtered to separate them into non-overlapped and uncertain or overlapped regions. The critical-1 and critical-2 regions are further subdivided into the doubtful data in the second filter.

Table 6 provides results regarding the area under the curve (AUC) for all 11 models used in this study on 10 real-world datasets. Concerning the AUC, it is divided into various groups, a range between 90 to 100 indicates the perfect ACU showing the model’s capacity to discriminate between positive and negative class samples. Results given in Table 6 show that AUC varies for each dataset. However, the proposed SVM++ shows the highest AUC for most of the datasets. For example, it has the highest AUC of 96.39% for the Wisconsin dataset, 91.16% for the Glass0 dataset, 79.47% for the Heberman dataset, 94.6% for the Vehicle0 dataset, and 97.24% for the Ecoli2 dataset. For the Pima dataset and Ecoli1 dataset, its AUC is between 80.0% to 90.0% which shows better performance but can be further improved. For the Vehicle1 dataset, the performance of SVM++ is low with 67.51% AUC, however, other models also have a similar AUC for this dataset. Except for the KNN which shows an AUC of 82.34%.

Table 6.

Classification performance in terms of AUC for 11 algorithms on 10 real datasets.

Datasets NB-basic NB-Tomek SMOTE+SVM OBU K_US SVM++ SVM OSM FSVM-CIL RBFN KNN
Wisconsin 89.15 86.83 89.03 66.74 66.48 96.39 86.92 95.24 82.65 86.59 89.95
Pima 88.25 79.07 74.80 73.98 65.24 80.06 72.69 46.19 33.60 37.54 83.83
Glass0 73.28 60.79 74.32 88.13 80.57 91.16 86.21 81.31 78.73 90.68 71.05
Vehicle1 69.31 64.18 54.99 64.13 43.82 67.51 62.42 72.51 59.93 63.87 82.34
Heberman 76.37 66.88 74.26 78.02 71.15 79.47 72.65 37.62 45.05 50.99 57.92
Vehicle3 48.80 72.21 75.57 86.07 70.37 73.26 63.46 75.84 63.26 69.20 75.25
Vehicle0 60.95 60.95 90.48 90.46 79.41 94.63 88.36 91.16 82.04 87.98 38.67
Ecoli1 69.41 65.95 63.94 74.80 76.74 86.97 91.22 87.19 74.61 80.55 64.22
New-thyroid1 72.23 70.98 97.04 87.24 81.56 90.88 99.05 92.14 84.37 88.31 60.77
Ecoli2 91.14 87.99 97.04 84.74 68.96 97.24 94.93 74.19 61.61 65.55 68.32

Figure 5 shows that the samples of both classes that exactly or nearly overlap with one another are found in the critical-1 region. A mapping function is created to map each sample in the Critical-1 and Critical-2 region in the higher dimension, as shown in Fig. 5b, allowing for a straight-line decision boundary between the majority and minority samples. The mean distance between each minority sample and the majority sample, as well as vice versa, is the basis for the sample’s higher dimension values.

This study’s special contribution, among several others, is to significantly lessen the influence of negative samples by systematically imposing a larger dimension restriction on a specific set of samples. Here, we precisely define the overlapped region to address the influence of the overlapped sample, in contrast to data-level approaches that involve removing or adding samples from the training dataset to guarantee the accurate representation of the minority class. The suggested SVM++ performs moderately (not very well) on several datasets; these datasets are shuttle6-23, vehical1, vehical3, yeast3, yest1-7, and shuttle6-23, which have small overlap sensitive values for the minority class.

Figures 6, 7, and 8 show the performance comparison of various state-of-the-art models with SVM++ concerning accuracy, precision, and recall, respectively indicating that it yields better values for these metrics.

Fig. 6.

Fig. 6

Average accuracy of SVM++ vs state-of-the-art classifiers, (a) first 15 datasets, and (b) last 15 datasets.

Fig. 7.

Fig. 7

Average precision of SVM++ vs state-of-the-art classifiers, (a) first 15 datasets, and (b) last 15 datasets.

Fig. 8.

Fig. 8

Average G-mean of SVM++ vs state-of-the-art classifiers, (a) first 15 datasets, and (b) last 15 datasets.

The SVM++ yields very gratifying results when we examine the highly unbalanced dataset (IR> 20) and overlapping data. This is because the SVM++ filters the overlapped region once more in the Critical-1 and Critical-2 regions and rejects samples based on their mean distance. Table 1 illustrates how the minority class in a highly unbalanced dataset such as Glass5, Yeast4, Ecoli0137-26, Yeast6, Winewhite3-7, and Winered3-5 is made up of extremely rare occurrences compared to instances of the majority class, which leaves the other classifier completely in the shadows. The majority class instance predominates in the overlapping region, according to the overlap sensitivity cost for the majority and minority classes, which biases the underlying classifier’s target class prediction.

The suggested SVM++ outperforms all other data level approaches for accuracy, precision, and G-mean for most datasets when compared to the most recent and comparable methods such as NB-Basic NB-Tomek, OBU, K-US, and SMOTE+SVM. Similar to this, all of the data-level techniques described above are based on sampling techniques that involve adding or deleting samples from the training dataset; however, SVM++ preserves the original training dataset by mapping the nearly identical overlapping samples into a higher dimension. In contrast to SVM-based techniques (SVM, OSM, FSVM-CIL, RBFN, and KNN) and algorithms, the suggested SVM++ exhibits exceptional performance when learning from datasets that are imbalanced, complex, or overlapping. In order to reduce classification error, the equivalent algorithm-level techniques added various sampling strategies or feature trimming, however, the suggested SVM++ retains the original training dataset and features in order to translate the overlapped samples into a higher dimension. SVM++ performed better than all the classifiers when it came to overall classification for the largest dataset.

Figure 9 shows the two-dimensional plots for the average rank of accuracy, precision, and G-mean to facilitate the comparison between the SVM++ and the rest of the classifiers used in the experiment. From the plot, it is clear that SVM++ outperforms all the classifiers for the average rank of precision and accuracy showing the stability of the proposed model. For G-mean the SVM++ is slightly better than other classifiers.

Fig. 9.

Fig. 9

Average rank of accuracy, precision, and G-mean.

“Algorithm to detect overlapping and non-overlapping samples in data space” highlights the strategy of SVM++ to maximize the visibility of minority class samples. SVM++ uses better optimization, imbalance handling, and kernel tuning which leads to higher classification performance. A comparison of SVM++ is carried out with existing approaches and results are given in Table 7.

Table 7.

Comparison of SVM++ with best known competitors across key performance metrics.

Metric SVM++ mean (%) Best competitor Best competitor mean (%) Is SVM++ SOTA?
Accuracy 85.90 (likely SVM, OSM) (around 82–83) Inline graphic Yes
Precision 84.00 (likely SVM, OSM) (around 81–82) Inline graphic Yes
G-mean 82.50 (likely SVM, OSM) (around 79–80) Inline graphic Yes

In addition to performance comparative with SOTA approaches, we performed the Friedman Test and results are given in Table 8. Results show that the p-values < 0.0001, which indicates that the differences are statistically significant, i.e., SVM++’s improvements are not by chance which further highlights the improved performance of SVM++.

Table 8.

Friedman test statistics and corresponding p-values for different metrics.

Metric Friedman statistics P-value
Accuracy 176.49 < 0.0001
Precision 174.69 < 0.0001
G-mean 177.59 < 0.001

Discussion

One of the factors that affect the performance of ML models is imbalanced class distribution. The class imbalance problem is further complicated as the number of classes increases in a dataset. Despite existing studies dealing with class imbalanced issues, studies investigating multiclass imbalanced problems are rather scarce. This study proposes a solution in this regard by introducing three state-of-the-art algorithms called Algorithm1-, Algorithm-2, and Algorithm-3.

Algorithm-1 is used to split the training set so that the overlapped and non-overlapped can be identified. It is followed by Algorithm-2 which then identifies critical regions called Critical-1 and Critical-2. It is used to segregate overlapped regions by finding hard regions where the overlapping is high. Finally, the proposed SVM++ is used to modify the kernel mapping to improve classification. It aims to maximize the number of minority class samples by mapping samples in the Critical-1 region into a higher dimension.

Experiments involve a total of 11 models including those from already reported literature. For this purpose, both data-level and algorithm-level approaches are used. For example, NB-Basic45, NB-Tomek45,54, SMOTE-SVM18, OBU40 and K_US43 are data-level approaches used in this study. Similarly, SVM, OSM18, FSVM-CIL55, RBFN39, and KNN are algorithm-level approaches used in this study. Experiments are performed using 30 real-world datasets.

Experimental findings reveal that using the largest dataset, the data-level methods prove to be a better option than the algorithm-level models concerning precision, F1-score, and G-mean. The superior outcomes from data-level algorithms come from their ability to adjust the number of samples by removing or adding them in overlapping regions. However, despite their good performance, data-level models can lead to model overfitting, as well as, loss of information. This problem can lead the model to low generalizability to unseen data.

It is also observed that the OBU-based model has comparatively better performance than data-level techniques, however, it removes a large number of negative samples during training. Other data-level methods such as NB-Basic and SMOTE-SVM stand out, particularly in scenarios when the data imbalance is high and overlap is big. Nonetheless, NB-Basic removes too many samples from the majority class which affects classification accuracy. SMOTE helps in mitigating the overfitting problem and information loss.

Predominantly, SVM and OSM classifiers have better outcomes than FSVM-CIL, RBFN, and KNN, when we talk about algorithm-level methods. OSM and SVM help increase positive class samples, which reduces the impact of overlapping negative samples. However, when it comes to dealing with highly imbalanced datasets, particularly when the imbalance ratio < 20, most algorithm-based models including SVM tend to show poor performance. It is so because primarily the majority class dominates the overlapping regions and the minority class is ignored due to the low number of samples, thereby leading to a higher number of misclassifications.

The proposed SVM++ is advantageous over other models as it targets only the exact or nearly overlapping samples. It does not remove all negative samples from the overlapping region. As described in Algorithms 1 and 2, the proposed SVM++ is used to filter samples. It performs filtering in three stages, starting by measuring the distance between samples from the majority and minority classes to split them into clearly separated, uncertain, or overlapping groups. Results confirm its superior capability in handling overlapping regions thereby leading to better performance concerning accuracy, G-mean, and AUC.

Conclusion

In this research, we addressed the overlapping influence in a multi-class imbalanced dataset by proposing an improved version of SVM, SVM++, with the customized kernel mapping function. This study makes several contributions in this regard. First, the majority of the techniques described in the literature only addressed the issue of class imbalance, while SVM++ focused on both the imbalanced (with various imbalance ratios from low to high dimension) and overlapping issues in learning from multi-class problems. Second, we design algorithm-1, to divide the dataset into the uncertain and non-overlapped regions (the uncertain region consists of the overlapped samples), using the Euclidean distance formula, providing a basis for deep insight and study of the hard region. Third, using the k-nearest neighbor method, the second approach further separated the uncertain zone into two sections: the Critical-1 and Critical-2 regions. The Critical-1 region was then computed for samples that overlapped exactly or very nearly. The focus is narrowed down to the exact or near overlap samples by dividing the overlapped region into the Critical-1 and Critical-2 regions. Fourth, the third algorithm changes the SVM kernel mapping function based on the mean of the maximum and minimum distance to transform the Critical-1 region sample, which shares similar characteristics into higher dimensional space. Finally, we create a range for mapping the overlapping sample in a higher dimension based on the mean distance between the majority and minority class samples in the Critical-1 zone. We develop a mapping function for every sample in the Critical-1 zone, depending on the mean distance, to map that sample in a constrained higher dimension, therefore changing the SVM kernel. The key contribution of the proposed SVM++ classifier is its ability to maximize the visibility of the minority class samples without eliminating samples from the overlapped region (the Critical-1 region samples). After the mapping into a higher dimension, the learning capability of the SVM++ improves over its counterparts. After experimenting, the SVM++ depicts favorable results for evaluating the accuracy, precision, and G-mean as compared to different data and algorithm-level methods, still, the proposed model can improve. For an extremely large dataset with reasonable minority class samples, the samples in higher dimension space may overlap again, as we only repel or map the samples up to k=2 distance, resulting in the specialized model, which will be evaluated in our future work. The results can further improve if the proposed model combines with sampling strategies, followed by proper feature engineering.

Acknowledgements

This research is supported by Princess Nourah bint Abdulrahman University Researchers Supporting Project number(PNURSP2025R897), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Author contributions

ZM conceptualization, formal analysis and writing - the original draft. LJ data curation, visualization, methodology. DAS software, visualization and investigation. IA validation, supervision and writing - review & edit the manuscript. All authors read and approved the final manuscript.

Funding

This research is supported by Princess Nourah bint Abdulrahman University Researchers supporting Project number(PNURSP2025R897), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Data availability

The dataset used in this study can be requested from the first author Zafar Mahmood by contacting at zafar.mehmood@uog.edu.pk.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Ganganwar, V. An overview of classification algorithms for imbalanced datasets. Int. J. Emerg. Technol. Adv. Eng.2(4), 42–47 (2012). [Google Scholar]
  • 2.Borsos, Z., Lemnaru, C. & Potolea, R. Dealing with overlap and imbalance: A new metric and approach. Pattern Anal. Appl.21(2), 381–395 (2018). [Google Scholar]
  • 3.Prati, R.C., Batista, G.E. & Monard, M.C. Class imbalances versus class overlapping: An analysis of a learning system behavior. In Mexican International Conference on Artificial Intelligence, 2004. 312–321. (Springer, 2004)
  • 4.Zhang, X., Zhou, C., Zhu, X., Tao, Z. & Zhao, H. J. A. A. Class-imbalanced voice pathology classification: Combining hybrid sampling with optimal two-factor random forests190, 108618 (2022). [Google Scholar]
  • 5.Batista, G.E., Prati, R.C. & Monard, M.C. Balancing strategies and class overlapping. In International Symposium on Intelligent Data Analysis, 2005. 24–35. (Springer, 2005)
  • 6.Garcí­a, V., Mollineda, R.A. & Sánchez, J.S. On the k-nn performance in a challenging scenario of imbalance and overlapping. Pattern Anal. Appl.11(3-4), 269–280 (2008)
  • 7.Raguenaud, C. & Kennedy, J. Multiple overlapping classifications: Issues and solutions. In Proceedings 14th International Conference on Scientific and Statistical Database Management, 2002. 77–86 (IEEE, 2002).
  • 8.Chen, L., Fang, B., Shang, Z. & Tang, Y. Tackling class overlap and imbalance problems in software defect prediction. Softw. Qual. J.26(1), 97–125 (2018). [Google Scholar]
  • 9.He, H. & Garcia, E. A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng.21(9), 1263–1284 (2009). [Google Scholar]
  • 10.Fernández, A., Garcí­a, S. & Herrera, F. Addressing the classification with imbalanced data: Open problems and new challenges on class distribution. In International Conference on Hybrid Artificial Intelligence Systems, 2011. 1–10. (Springer, 2011)
  • 11.Han, S., Choi, H.-J., Choi, S.-K. & Oh, J.-S. Fault diagnosis of planetary gear carrier packs: A class imbalance and multiclass classification problem. Int. J. Precis. Eng. Manuf.20(2), 167–179 (2019). [Google Scholar]
  • 12.Kotsiantis, S., Kanellopoulos, D. & Pintelas, P. Handling imbalanced datasets: A review. GESTS Int. Trans. Comput. Sci. Eng.30(1), 25–36 (2006). [Google Scholar]
  • 13.Garcí­a, V., Alejo, R., Sánchez, J.S., Sotoca, J.M. & Mollineda, R.A. Combined effects of class imbalance and class overlap on instance-based classification. In International Conference on Intelligent Data Engineering and Automated Learning, 2006. 371–378. (Springer, 2006)
  • 14.Barella, V.H., Garcia, L.P., Souto, M.P., Lorena, A.C. & Carvalho, A. Data complexity measures for imbalanced classification tasks. In 2018 International Joint Conference on Neural Networks (IJCNN) . 1–8 (IEEE, 2018).
  • 15.Tsai, C.-F., Lin, W.-C., Hu, Y.-H. & Yao, G.-T. Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inf. Sci.477, 47–54 (2019). [Google Scholar]
  • 16.Kaur, P. & Gosain, A. Comparing the behavior of oversampling and undersampling approach of class imbalance learning by combining class imbalance problem with noise. In ICT Based Innovations. 23–30. (Springer, 2018)
  • 17.Garcí­a, V., Sánchez, J.S., Domí­nguez, H.O. & Cleofas-Sánchez, L. Dissimilarity-based learning from imbalanced data with small disjuncts and noise. In Iberian Conference on Pattern Recognition and Image Analysis, 2015. . 370–378. (Springer, 2015)
  • 18.Lee, H. K. & Kim, S. B. An overlap-sensitive margin classifier for imbalanced and overlapping data. Expert Syst. Appl.98, 72–83 (2018). [Google Scholar]
  • 19.Japkowicz, N. & Stephen, S. The class imbalance problem: A systematic study. Intell. Data Anal.6(5), 429–449 (2002). [Google Scholar]
  • 20.Das, S., Datta, S. & Chaudhuri, B. B. Handling data irregularities in classification: Foundations, trends, and future challenges. Pattern Recognit.81, 674–693 (2018). [Google Scholar]
  • 21.Sáez, J. A., Galar, M. & Krawczyk, B. Addressing the overlapping data problem in classification using the one-vs-one decomposition strategy. IEEE Access7, 83396–83411 (2019). [Google Scholar]
  • 22.López, V., Fernández, A., Moreno-Torres, J.G. & Herrera, F. Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. open problems on intrinsic data characteristics. Expert Syst. Appl.39(7), 6585–6608 (2012)
  • 23.Sun, Z. et al. A novel ensemble method for classifying imbalanced data. Pattern Recognit.48(5), 1623–1637 (2015). [Google Scholar]
  • 24.Qu, Y., Su, H., Guo, L. & Chu, J. A novel svm modeling approach for highly imbalanced and overlapping classification. Intell. Data Anal.15(3), 319–341 (2011). [Google Scholar]
  • 25.Perveen, S., Shahbaz, M., Keshavjee, K. & Guergachi, A. Metabolic syndrome and development of diabetes mellitus: Predictive modeling based on machine learning techniques. IEEE Access7, 1365–1375 (2019). [Google Scholar]
  • 26.Mehmood, Z. & Asghar, S. Customizing svm as a base learner with adaboost ensemble to learn from multi-class problems: A hybrid approach adaboost-msvm. Knowl.-Based Syst.217, 106845 (2021). [Google Scholar]
  • 27.Denil, M. & Trappenberg, T. Overlap versus imbalance. In Canadian Conference on Artificial Intelligence, 2010. 220–231. (Springer, 2010)
  • 28.Haixiang, G. et al. Learning from class-imbalanced data: Review of methods and applications. Expert Syst. Appl.73, 220–239 (2017). [Google Scholar]
  • 29.Fernández, A., Garcá­a, S., Galar, M., Prati, R.C., Krawczyk, B. & Herrera, F. Algorithm-level approaches. In Learning from Imbalanced Data Sets. 123–146. (Springer, 2018)
  • 30.López, V., Fernández, A., Garcá­a, S., Palade, V. & Herrera, F. An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inf. Sci.250, 113–141 (2013)
  • 31.Dai, Q., Liu, J.-W. & Shi, Y.-H. Class-overlap undersampling based on Schur decomposition for class-imbalance problems. Expert Syst. Appl.221, 119735 (2023)
  • 32.Dai, Q., Wang, L.-H., Xu, K.-L., Du, T. & Chen, L.-F. Class-overlap detection based on heterogeneous clustering ensemble for multi-class imbalance problem. Expert Syst. Appl.255, 124558 (2024)
  • 33.Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res.16, 321–357 (2002). [Google Scholar]
  • 34.Bunkhumpornpat, C., Sinapiromsaran, K. & Lursinsap, C. Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2009. 475–482. (Springer, 2009)
  • 35.Bunkhumpornpat, C., Sinapiromsaran, K. & Lursinsap, C. Dbsmote: Density-based synthetic minority over-sampling technique. Appl. Intell.36(3), 664–684 (2012). [Google Scholar]
  • 36.Douzas, G., Bacao, F. & Last, F. Improving imbalanced learning through a heuristic oversampling method based on k-means and smote. Inf. Sci.465, 1–20 (2018). [Google Scholar]
  • 37.Sáez, J. A., Luengo, J., Stefanowski, J. & Herrera, F. Smote-ipf: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf. Sci.291, 184–203 (2015). [Google Scholar]
  • 38.Han, H., Wang, W.-Y. & Mao, B.-H. Borderline-smote: A new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing, 2005. 878–887. (Springer, 2005).
  • 39.Vorraboot, P., Rasmequan, S., Chinnasarn, K. & Lursinsap, C. Improving classification rate constrained to imbalanced data between overlapped and non-overlapped regions by hybrid algorithms. Neurocomputing152, 429–443 (2015). [Google Scholar]
  • 40.Vuttipittayamongkol, P., Elyan, E., Petrovski, A. & Jayne, C. Overlap-based undersampling for improving imbalanced data classification. In International Conference on Intelligent Data Engineering and Automated Learning, 2018. 689–697. (Springer, 2018)
  • 41.Zhao, Y. & Cen, Y. Data Mining Applications with R. (Academic Press, 2013)
  • 42.Bunkhumpornpat, C. & Sinapiromsaran, K. Dbmute: Density-based majority under-sampling technique. Knowl. Inf. Syst.50(3), 827–850 (2017). [Google Scholar]
  • 43.Nwe, M.M. & Lynn, K.T. Knn-based overlapping samples filter approach for classification of imbalanced data. In International Conference on Software Engineering Research, Management and Applications, 2019. 55–73. (Springer, 2019).
  • 44.Devi, D. & Purkayastha, B. Redundancy-driven modified Tomek-link based undersampling: A solution to class imbalance. Pattern Recognit. Lett.93, 3–12 (2017). [Google Scholar]
  • 45.Vuttipittayamongkol, P. & Elyan, E. Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Inf. Sci.509, 47–70 (2020). [Google Scholar]
  • 46.Fu, M., Tian, Y. & Wu, F. Step-wise support vector machines for classification of overlapping samples. Neurocomputing155, 159–166 (2015). [Google Scholar]
  • 47.Simic, V. et al. Locating a disinfection facility for hazardous healthcare waste in the COVID-19 era: A El approach based on Fermatean fuzzy ITARA-MARCOS and random forest recursive feature elimination algorithm . 1–46 (2022). [DOI] [PMC free article] [PubMed]
  • 48.Sun, Y. et al. A Robust Oversampling Approach for Class Imbalance Problem with Small Disjuncts. (2022).
  • 49.Song, W.C..L.J. J. Enhancing minority data generation through optimization in imbalanced datasets. Knowl. Inf. Syst. 1–25 (2025)
  • 50.Sun, H. & Wang, S. Measuring the component overlapping in the Gaussian mixture model. Data Min. Knowl. Discov.23(3), 479–502 (2011). [Google Scholar]
  • 51.Xiong, H., Wu, J. & Liu, L. Classification with classoverlapping: A systematic study. In Proceedings of the 1st International Conference on E-Business Intelligence (ICEBI2010), 2010. (Atlantis Press, 2010)
  • 52.Lichman, M. UCI Machine Learning Repository. (University of California, School of Information and Computer Sciences, 2018).
  • 53.Ohsaki, M. et al. Confusion-matrix-based kernel logistic regression for imbalanced data classification. IEEE Trans. Knowl. Data Eng.29(9), 1806–1819 (2017). [Google Scholar]
  • 54.Tomek, I. Two modifications of cnn. IEEE Trans. Syst. Man Cybern.6, 769–772 (1976)
  • 55.Batuwita, R. & Palade, V. Fsvm-cil: Fuzzy support vector machines for class imbalance learning. IEEE Trans. Fuzzy Syst.18(3), 558–571 (2010). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The dataset used in this study can be requested from the first author Zafar Mahmood by contacting at zafar.mehmood@uog.edu.pk.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES