Abstract
Support Vector Regression (SVR) is gaining in popularity in the detection of outliers and classification problems in high-dimensional data (HDD) as this technique does not require the data to be of full rank. In real application, most of the data are of high dimensional. Classification of high-dimensional data is needed in applied sciences, in particular, as it is important to discriminate cancerous cells from non-cancerous cells. It is also imperative that outliers are identified before constructing a model on the relationship between the dependent and independent variables to avoid misleading interpretations about the fitting of a model. The standard SVR and the μ-ε-SVR are able to detect outliers; however, they are computationally expensive. The fixed parameters support vector regression (FP-ε-SVR) was put forward to remedy this issue. However, the FP-ε-SVR using ε-SVR is not very successful in identifying outliers. In this article, we propose an alternative method to detect outliers i.e. by employing nu-SVR. The merit of our proposed method is confirmed by three real examples and the Monte Carlo simulation. The results show that our proposed nu-SVR method is very successful in identifying outliers under a variety of situations, and with less computational running time.
Keywords: High-dimensional data, outliers, robustness, statistical learning theory, support vector regression
1. Introduction
The analysis of high-dimensional data has become increasingly important in many fields such as applied sciences, engineering, and medicines. For instance, there are tens of thousands of gene expression values available in tumor classification utilizing genomic data; however, the number of arrays is only at the order of 10. Such high-dimensional data that refer to a situation when the number of predictor variables (p) is much larger than the sample size (n) forms a major statistical challenge in terms of data classification and other statistical analyses. In high-dimensional data, a matrix related to some algorithms may become singular. The problem may become more complicated with the presence of non-linear relationships among variables and outliers in the data. In real applications, an encounter with non-normal data with nonlinear relationship between input and output variables is quite common [30]. Hampel et al. [9] pointed out that a routine data set typically contains about 1–10% outliers, and even the highest quality data set cannot be guaranteed to be free of outliers. One immediate consequence of the presence of outliers is that they may cause apparent non-normality, and the entire classical inferential procedure might breakdown in their presence. In this regard, calls for new methods and theories such as the nonparametric methods to handle these issues have become an urgent necessity. A Support Vector Machine (SVM) is a nonparametric approach which comprises a new class of learning algorithms that follows statistical learning theory [32]. It has been successfully applied to classification as well as regression problems. One attractive feature of Support Vector Regression (SVR) is that it can handle nonlinear, rank deficient, and high-dimensional problems by utilizing the kernel trick to transform the nonlinear relationship in the input space into a linear form in a high-dimensional feature space [16,31]. The main idea behind the SVM modeling is its ability to classify and separate positive and negative training data with the greatest margin possible [2].
Majority of real data in several applications can be affected by outliers. If outliers are present in a sample data, the learning method will try to fit such an undesirable sample, resulting in an approximation function to go awry. This phenomenon is known as overfitting issue [29] whereby the testing error will be greatly affected that it will lead to loss of generalization ability. This is one of the reasons why detection of outliers is crucial since outliers are responsible for the misleading conclusion about the fitting of a regression model. There are many good papers in the literature on the identification of outliers in a linear model and low dimensional data, to name a few [8,17,19,22]. Nonetheless, not many papers deal with the identification of outliers in a linear as well as nonlinear model using SVR for high-dimensional data.
Cherkassky and Mulier [34] have successfully applied the SVM for outlier detection. As pointed out by Jordaan and Smith [12], SVM for classification (SVC) was set forth to detect outliers. They employed SVR for outlier detection based on the robustness of SVR with regard to outliers, and the fact that outliers are a fraction of the support vector group make the SVM model a potentially appropriate method to detect outliers. Unfortunately, this method suffers from some shortcomings when applied to real applications. Moreover, the process requires high calculation costs because it requires numerous iterations of the optimization procedure in order to detect even a single outlier. Nishiguchi [18] employed μ-ε-SVR to detect outliers, but this method, for example, the standard SVR method, also has its own drawbacks. Although this method employs free parameters which can easily be controlled and may result in the reduction of the masking and swamping problems, it does not have a clear rule for selecting the value of the parameter, ε. ‘Swamping’ refers to normal observations incorrectly declared as outliers, while ‘masking’ refers to a situation where outliers are incorrectly determined as inliers [24]. To overcome these weaknesses, Dhnan et al. [6] and Rana et al. [20] have developed the fixed parameters support vector regression (FP-ε-SVR) based on fixed parameters to detect all outliers in a data set in a single iteration. Unfortunately, the FP-ε-SVR which employs ε-SVR is not very successful in identifying outliers under different scenarios. To remedy this issue, we propose nu-SVR for the identification of outliers in a high-dimensional data set.
This article is arranged as follows: Section 2 briefly describes the SVR; the proposed nu-SVR for the identification of outliers in case of p>>n is presented in Section 3; Section 4 presents the Monte Carlo Simulation Study and real examples; and the concluding remarks are presented in Section 5.
2. Support vector regression for detection outliers
To understand the methodology of SVR, let us first consider the training data ⊂X*R, where X is the input space variable. The aim of ε-tube SVR is to find such that, it has at most, ε deviation from the outputs . Thus, the regression function can be defined as follows:
where is a transform function from non-linear to linear dimensional space, and w and b are the slope and the offset of the regression function, respectively. According to [32], these parameters can be estimated by employing ε-tube loss function as follows:
where all the training points which lie outside the ε-tube are considered as support vectors, while their deviations are penalized in a linear method. Vapnik [32] exemplified that the primal problem can be written as follows:
where the C (cost) parameter is defined as a trade-off between the flatness of the model, and quantity of deviations larger than ε are tolerated; the upper and lower errors are represented by the slack variables and ; and ε is parameter of loss function. The next convex quadratic [28] represents the dual optimization problem which can be written as follows,
where , are Lagrange multipliers. Hence, the dual variables have to be positive , , and is the kernel function that transforms the inputs into a high-dimensional feature space [33]. The final regression estimation for SV regression can be written as
Applying SVR in outlier detection based on the robustness was first suggested by Jordaan et al. [12]. Subsequently, [18] employed the μ–ε–SVR for outlier detection of high-dimensional nonlinear problems. However, these techniques pose several limitations when applied to real applications, as stated below:
In the approach of Jordaan et al. [12], the detection of a single outlier requires many iterations. Thus, this method requires high-computational costs.
The approach of μ–ε–SVR is only suitable for a small number of outliers. Since this approach can detect only one outlier in each iteration, the computational cost would be very high when the data contain a large number of outliers.
Moreover, for both approaches, no specific cutoff points were given to identify which observations are the outliers. Observations are declared as outliers if the observed points are far from the majority of the data. Hence, in practice, non-expert users will find these approaches difficult since they depend on subjective judgement.
Recently, Dhhan et al. [6] developed fixed parameters ε-tube SVR (FP-ε-SVR) to improve the performance of the standard SVR for detecting outliers. The FP-ε-SVR applies a particular case of the ε-tube loss function when the value of ε is equal to zero. In fact, this technique is much less time-consuming than the conventional approaches in detecting outliers. However, it has several limitations that can reduce its role in practical applications. The FP-ε-SVR employs ε-SVR which is not very successful in identifying outliers under a variety of situations. As a result, the FP-ε-SVR tends to show masking and swamping problems that lead to misleading conclusions about data classification as well as the fitting of the regression model.
3. The proposed method nu-SVR
The nu-support vector regression (nu-SVR) is a new type of support vector regression (SVR) that can handle regression issues, especially problems related to the first type ε-SVR [25,26], where it is difficult to determine the suitable ε value to be selected. Selecting many data points as support vectors will lead to overfitting problems. However, the new parameter nu which is introduced by [26] is able to control the number of support vectors. In fact, [26] have shown that nu is an upper bound on the fraction of margin errors and a lower bound of the fraction of support vectors [5].
To enhance the efficiency of the standard SV regression for outlier detection, we propose a practical approach that covers the following three angles of the characteristics of kernel functions, sparseness, and robustness. The proposed technique, called the nu-tube-SVR, addresses the limitations of the traditional methods already mentioned, in which it can detect outliers in high-dimensional data under a variety of scenarios without displaying any masking and swamping effects. In the nu-tube-SVR, the non-sparseness property of the ε-tube-loss function is used to detect abnormal points in the data. Besides that, due to the difficulty of determining the suitable values of the ε tube-width, the ε-SVR algorithm has been modified to minimize the value of ε depending on the properties of the data [26]. Consequently, we use the modified ε-SVR algorithm that can reduce the complexity of a model by controlling the number of support vectors. The adjusted ε-SVR algorithm is called the nu-SVR algorithm. Hence, the identification of outliers can be done by using the nu-tube loss function. It is defined as follows:
In the case of nu-SVR, the nu parameter automatically adjusts the flexible tube (minimizes ε) to control the number of support vectors and training errors in the tube. According to Scholkopf [25], nu is an upper bound on the fraction of margin errors and a lower bound of the fraction of support vectors. Moreover, they noted that, asymptotically, nu equals to both fractions with probability 1. Thus, the convex optimization problem as mentioned by [25,26] can be rephrased as:
| (1) |
Consequently, Equation (1) is shown as the next convex quadratic problem
Here and below, it is understood that , where l denotes l-dimensional vectors of the corresponding variables. The weight vector and final regression function are represented as follows:
| (2) |
Choosing suitable values of the parameters C, nu and kernel function will lead to high sensitivity to detect the outliers [21]. In fact, there are no specific rules in choosing the cost parameter of C. Therefore, in our calculations, we chose a slightly high value of parameter C since the employed data usually have outliers. Besides that, a high value of parameter C with high variance between the Lagrange multipliers and can generate significant weights as can be seen in Equation (2). The weight vector increases when the parameter C has a high value. Another important point to highlight is that the SVM algorithm has high sensitivity to the characteristics of kernel functions [36], which implies that the choice of a suitable kernel function is very important. Therefore, this paper discusses the effectiveness of employing Bessel kernel for detecting the outliers.
3.1. The bessel kernel function
The Bessel function of the first kind kernel is generally used when no prior knowledge of the data set is available [14,15]. Mathematically, it is given by
| (3) |
where x and are two arbitrary vectors in the feature space, and J is the Bessel function of the first kind. It can be written as follows,
where v is the constant which determines the order of the Bessel functions, and is the gamma function. The Bessel kernel function was employed in the algorithm of nu-SVR because it has a better performance in separating outliers from the rest of the data, where the difference between outliers and normal observations can be clearly seen from graphs (see all figures in Section 4).
In order to get the estimated values, the Bessel function of the first kind kernel in Equation (3) has a higher estimation accuracy and stronger generalization ability SVR [37]. By looking at the estimated values in Equation (2), we may conclude that outliers have affected their own rows more than the rest of the data, where the outliers have higher values than the estimated values. Moreover, choosing a parameter C can affect the difference between the values of the outliers and other observations. Therefore, the higher the value of the C multiplier is, the greater the possibility of separating the abnormal points is from the majority of the data. It should be noted that the value of the C parameter should not be overestimated as this will increase the complexity of the model.
According to Dhhan et al. [6], the estimated value in Equation (2) that corresponds to outlier point is expected to be bigger than the other estimated values. However, the differences would not be clear when the value of C parameter is small. Hence, utilizing a high value of C parameter will lead to a high estimated value corresponding to the row of outliers. As a result, the estimated values are utilized to detect the outlier points. Outliers can be identified based on the graphical plot, whereby points that are located far from the rest of the data are declared as outliers. However, non-expert users do face some difficulties in using this approach. In this situation, it is crucial to propose a cutoff point to separate the outliers from the majority of the data. Since the distribution of the estimated values is intractable, we follow the approach of Hadi [8] that proposed nonparametric cut off points as follows,
where
and
The attractive feature about this technique is that it is able to identify outliers through only one iteration process, unlike the standard SVR method where it requires many iterations just to detect one outlier. Furthermore, employing a fixed set of parameters and being given a specific cutoff point will facilitate the usage of this method by non-expert users.
4. Results
In this section, the Monte Carlo simulation study, one artificial and three real data sets are considered in order to evaluate the performance of nu-SVR. All calculations were done by using the R software [4].
4.1. Monte carlo simulation study
In this section, we report simulation studies that have been carried out to investigate the performance of our newly proposed nu-SVR method in detecting outliers in high-dimensional data, and to compare its performance with the existing FP-ε-SVR method. Here, we consider three different simulation scenarios.
4.1.1. Simulation 1
As per Dhhan et al. [6], the first simulation considers a non-liner model,
| (4) |
where each of the variables was firstly generated from a uniform distribution [0,1], and was drawn from N(0,1). These generated data are referred to as ‘clean data’. Four samples of different sizes and different percentages of outliers were considered. The first 100 θ % observations of the clean data of each regressor were then replaced with a certain percentage of outliers. The remaining % of the observations are ‘clean data’. To generate outliers in the X–direction (high leverage points (HLPs)), the values corresponding to the first HLPs were kept fixed at 5, and those of the successive values were created by adding the observation index, i by 5. To create vertical outliers (outliers in the Y-direction), the generated clean y values in Model (4) were replaced with outliers whereby the first outlier was kept fixed at 15, and the successive values were created by adding the observation index, i by 15. The first objective of this simulation study is to compare the performance of the proposed nu-SVR with some of the existing methods, namely the Hat matrix, Robust Mahalanobis distance of Rousseew [23], Diagnostic robust generalized potential of Habshah et al. [10] and FP-ε-SVR for low dimensional data . The results were based on the percentage of correct detection of outliers, and the rates of masking and swamping are exhibited in Table 1.
Table 1.
Percentage of correct identification of outliers, masking and swamping for simulation data with three predictors .
| % Correct detection | % Masking | % Swamping | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| θ | n | RMD | Hat Matrix | DRGP | FP-ε-SVR | nu-SVR | RMD | Hat Matrix | DRGP | FP-ε-SVR | nu-SVR | RMD | Hat Matrix | DRGP | FP-ε-SVR | nu-SVR |
| %5 | 20 | 100 | 100 | 100 | 100 | 100 | 0 | 0 | 0 | 0 | 0 | 25 | 10 | 20 | 0.085 | 0 |
| 40 | 100 | 100 | 100 | 100 | 100 | 0 | 0 | 0 | 0 | 0 | 17.5 | 10 | 17 | 0.375 | 0.022 | |
| 100 | 100 | 100 | 100 | 100 | 100 | 0 | 0 | 0 | 0 | 0 | 9.4 | 15 | 7.49 | 0.454 | 0 | |
| 160 | 100 | 100 | 100 | 100 | 100 | 0 | 0 | 0 | 0 | 0 | 20.31 | 6.25 | 7.5 | 0.462 | 0 | |
| %10 | 20 | 100 | 100 | 100 | 100 | 100 | 0 | 0 | 0 | 0 | 0 | 20 | 15 | 25 | 0 | 0 |
| 40 | 100 | 100 | 100 | 100 | 100 | 0 | 0 | 0 | 0 | 0 | 7.25 | 7.5 | 7.5 | 0.075 | 0 | |
| 100 | 100 | 70 | 100 | 100 | 100 | 0 | 30 | 0 | 0 | 0 | 13.3 | 11 | 2.93 | 0.02 | 0 | |
| 160 | 100 | 62.5 | 100 | 100 | 100 | 0 | 37.5 | 0 | 0 | 0 | 6.44 | 10.62 | 1.25 | 0.025 | 0 | |
| %15 | 20 | 100 | 66.66 | 100 | 100 | 100 | 0 | 33.33 | 0 | 0 | 0 | 10 | 5 | 5 | 0 | 0.005 |
| 40 | 100 | 50 | 100 | 100 | 100 | 0 | 50 | 0 | 0 | 0 | 17.72 | 7.5 | 2.5 | 0.002 | 0 | |
| 100 | 100 | 46.66 | 100 | 100 | 100 | 0 | 53.33 | 0 | 0 | 0 | 11.75 | 10 | 0 | 0.006 | 0 | |
| 160 | 100 | 50 | 100 | 100 | 100 | 0 | 50 | 0 | 0 | 0 | 6.88 | 8.75 | 1.25 | 0.007 | 0 | |
| %20 | 20 | 100 | 25 | 100 | 100 | 100 | 0 | 75 | 0 | 0 | 0 | 15 | 10 | 5 | 0 | 0 |
| 40 | 100 | 37.5 | 100 | 100 | 100 | 0 | 62.5 | 0 | 0 | 0 | 16.33 | 10 | 0 | 0 | 0 | |
| 100 | 100 | 35 | 100 | 100 | 100 | 0 | 65 | 0 | 0 | 0 | 9.79 | 10 | 0 | 0 | 0 | |
| 160 | 100 | 37.5 | 100 | 100 | 100 | 0 | 62.5 | 0 | 0 | 0 | 3.47 | 8.75 | 0.625 | 0 | 0 | |
As shown in Table 1, most of the methods successfully detected all outliers except for the Hat matrix method which also experienced masking and swamping problems. The other four methods did not suffer from masking problems but did experience swamping problems. The proposed nu-SVR had the least swamping effect, followed by the FP-ε-SVR, DRGP, RMD, and Hat matrix methods. Figure 1 exhibits the attractive performance of the proposed nu-SVR. The computation running time of nu-SVR is the shortest compared to the other methods. The running time for the RMD and the Hat matrix were not considered since they had shown a poor performance. The second objective of this simulation study is to compare the performance of the proposed nu-SVR only with FP-ε-SVR, for high-dimensional data . The DRGP, RMD, and Hat matrix methods were not considered since they only dealt with full-rank data. The standard SVR methods were also not considered in this study since these methods do not have specific cut-off points for the detection of outliers. Moreover, it is not practical to make comparisons with those methods because they require very long computation running time and many iteration processes even to detect just one outlier. The results of the study are exhibited in Table 2. It can be observed from Table 2 that the proposed nu-SVR method outperforms the FP-ε-SVR. The nu-SVR has displayed a satisfactory result as it possesses a higher outliers detection rate even with very small percentage of swamping and shows no masking effect for all combinations values of p, n, and contamination levels. On the other hand, the FP-ε-SVR performs poorly especially for smaller sample sizes where the percentage of correct detection of outliers is very low (equals to 0.5) and with higher percentage of masking (99.5), for n equals to 20. It is also very interesting to observe from Table 2 that the computation running time for nu-SVR is much shorter than that of the FP-ε-SVR.
Figure 1.
Computational running time for DRGP, FP-ε-SVR and nu-SVR, p = 3 and various percentages of contaminations.
Table 2.
Percentage of correct identification of outliers, masking and swamping for simulation data with 200 predictors .
| % Correct detection | %Masking | %Swamping | Time by seconds | ||||||
|---|---|---|---|---|---|---|---|---|---|
| θ | n | FP-ε-SVR | nu-SVR | FP-ε-SVR | nu-SVR | FP-ε-SVR | nu-SVR | FP-ε-SVR | nu-SVR |
| 5% | 20 | 0.50 | 100 | 99.5 | 0 | 0 | 1.94 | 167.4 | 119.4 |
| 40 | 12.10 | 100 | 87.9 | 0 | 0 | 1.33 | 671.4 | 173.4 | |
| 100 | 70.12 | 100 | 29.88 | 0 | 0 | 0.45 | 2073.6 | 175.2 | |
| 160 | 80.26 | 100 | 19.73 | 0 | 0 | 0.23 | 3148.8 | 271.8 | |
| 10% | 20 | 0.6 | 100 | 99.4 | 0 | 0 | 1.06 | 138 | 126 |
| 40 | 41.07 | 100 | 58.92 | 0 | 0 | 0.69 | 682.2 | 168.6 | |
| 100 | 79.66 | 100 | 20.34 | 0 | 0 | 0.22 | 1533 | 209.4 | |
| 160 | 87.30 | 100 | 12.69 | 0 | 0 | 0.09 | 5904 | 243.6 | |
| 15% | 20 | 3.26 | 100 | 96.73 | 0 | 0 | 0.43 | 120 | 117 |
| 40 | 52.75 | 100 | 41.25 | 0 | 0 | 0.20 | 538.2 | 130.2 | |
| 100 | 80.95 | 100 | 19.04 | 0 | 0 | 0.04 | 1459.8 | 177.6 | |
| 160 | 88.01 | 100 | 11.99 | 0 | 0 | 0.01 | 5508 | 249 | |
| 20% | 20 | 12.45 | 100 | 87.55 | 0 | 0 | 0.11 | 165.6 | 135.6 |
| 40 | 58.75 | 100 | 41.25 | 0 | 0 | 0.05 | 798 | 169.2 | |
| 100 | 83.38 | 100 | 16.62 | 0 | 0 | 0.004 | 1671 | 211.2 | |
| 160 | 89.12 | 100 | 10.87 | 0 | 0 | 0.002 | 3006 | 254.4 | |
We further investigated the performance of our proposed nu-SVR by computing the average number of outliers detected together with their standard deviations (in parenthesis) for both methods, FP-ε-SVR and nu-SVR, as shown in Table 3. It can be seen that the average number of outliers detected using the FP-ε-SVR is much lower than the actual number of outliers in the data. This result indicates that FP-ε-SVR has a masking effect as shown in Table 2. On the other hand, the average number of outliers detected by nu-SVR is fairly close to the actual number of outliers in the data set, and it also has smaller values of standard deviations compared to FP-ε-SVR for 15% and 20% of outliers in the data.
Table 3.
The average number of outliers detected and standard deviation (in parentheses) for the FP-ε-SVR and nu-SVR methods with different sample sizes and contaminations .
| θ | n | No.of Outliers | FP-ε-SVR | nu-SVR |
|---|---|---|---|---|
| 5% | 20 | 1 | 0.002 (0.04) | 1.46 (0.72) |
| 40 | 2 | 0.21 (0.41) | 2.63 (0.87) | |
| 100 | 5 | 3.42 (0.49) | 5.49 (0.76) | |
| 160 | 8 | 6.51 (0.49) | 8.34 (0.64) | |
| 10% | 20 | 2 | 0.02 (0.14) | 2.25 (0.53) |
| 40 | 4 | 1.69 (0.51) | 4.30 (0.60) | |
| 100 | 10 | 7.94 (0.29) | 10.20 (0.46) | |
| 160 | 16 | 13.97 (0.15) | 16.12 (0.37) | |
| 15% | 20 | 3 | 0.11 (0.32) | 3.12 (0.37) |
| 40 | 6 | 3.14 (0.49) | 6.09 (0.33) | |
| 100 | 15 | 12.22 (0.42) | 15.05 (0.23) | |
| 160 | 24 | 21.11 (0.32) | 24.03 (0.19) | |
| 20% | 20 | 4 | 0.38 (0.52) | 4.03 (0.19) |
| 40 | 8 | 4.64 (0.52) | 8.01 (0.12) | |
| 100 | 20 | 16.70 (0.45) | 20.00 (0.08) | |
| 160 | 32 | 28.48 (0.49) | 32.00 (0.07) |
The third objective of this simulation study is to show that Bessel kernel function has better performance than some commonly used kernels in terms of separating the outliers. The performance of our proposed nu-SVR based on Bessel kernel function is compared with nu-SVR based on three commonly used kernels, namely the linear, polynomial and Radial Basis Function (RBF) for p =200. As displayed in Table 4, the Bessel kernel outperforms other functions in term of having 100% correct detection of outliers, very small percentage of swamping with no masking effects. On the other hand, the other kernel functions perform poorly where the percentage of correct detection of outliers are comparatively low with high masking effects. The results seem to suggest that other kernels are not very successful in separating the outliers from the good observations. Another attractive feature of the Bessel kernel is that its computation running time is much shorter than that of other kernels (see Table 4 and Figure 2). The results for p = 50 and p = 3000 are not included due to space limitation. However, the results are consistent with those obtained using p = 200.
Table 4.
Percentage of correct identification of outliers, masking and swamping, and computational running time (p=200) with different types of kernel functions.
| % Correct detection | % Masking | % Swamping | Time by seconds | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Types of Kernel | Types of Kernel | Types of Kernel | Types of Kernel | ||||||||||||||
| θ | n | RBF | Polynomial | Linear | Bessel | RBF | Polynomial | Linear | Bessel | RBF | Polynomial | Linear | Bessel | RBF | Polynomial | Linear | Bessel |
| 5% | 20 | 38.8 | 32.7 | 38.9 | 100 | 61.2 | 67.3 | 61.1 | 0 | 0.43 | 0.42 | 0.385 | 1.94 | 230.4 | 246.6 | 213.6 | 119.4 |
| 40 | 49.8 | 47.25 | 50 | 100 | 50.2 | 52.75 | 50 | 0 | 0.217 | 0.23 | 0.25 | 1.33 | 239.4 | 2082 | 2406 | 173.4 | |
| 100 | 79.1 | 92 | 80 | 100 | 20.9 | 8 | 20 | 0 | 0.088 | 0 | 0.3 | 0.45 | 349.2 | 4536 | 5400 | 175.2 | |
| 160 | 87.2 | 96.25 | 85 | 100 | 12.8 | 3.75 | 15 | 0 | 0.092 | 0 | 0.1875 | 0.23 | 467.4 | 4860 | 5580 | 271.8 | |
| 10% | 20 | 39.25 | 43.4 | 35 | 100 | 60.75 | 56.6 | 65 | 0 | 0.26 | 0.205 | 0.5 | 1.06 | 212.4 | 615 | 252 | 126 |
| 40 | 66.4 | 62.5 | 52.5 | 100 | 33.6 | 37.5 | 47.5 | 0 | 0.072 | 0 | 0 | 0.69 | 228 | 1854 | 2028 | 168.6 | |
| 100 | 86.91 | 91 | 88 | 100 | 13.09 | 9 | 12 | 0 | 0.049 | 0.1 | 0 | 0.22 | 323.4 | 3996 | 4320 | 209.4 | |
| 160 | 92.006 | 95.625 | 90 | 100 | 7.993 | 4.375 | 10 | 0 | 0.028 | 0 | 0.062 | 0.09 | 459 | 5040 | 5616 | 243.6 | |
| 15% | 20 | 42.333 | 33.333 | 43.333 | 100 | 57.666 | 66.666 | 56.666 | 0 | 0.09 | 0.5 | 0 | 0.43 | 247.2 | 618 | 918 | 117 |
| 40 | 71.35 | 70 | 73.33 | 100 | 28.65 | 30 | 26.66 | 0 | 0.035 | 0 | 0 | 0.20 | 270 | 1933.2 | 2400 | 130.2 | |
| 100 | 88.093 | 80.667 | 68.667 | 100 | 11.906 | 19.332 | 31.333 | 0 | 0.012 | 0 | 0.1 | 0.04 | 337.2 | 2412 | 3960 | 177.6 | |
| 160 | 92.8 | 87.916 | 75.416 | 100 | 7.2 | 12.083 | 24.583 | 0 | 0.0037 | 0.312 | 1.687 | 0.01 | 501 | 5220 | 7560 | 249 | |
| 20% | 20 | 45.85 | 47.5 | 37.5 | 100 | 54.15 | 52.5 | 62.5 | 0 | 0.035 | 0 | 0 | 0.11 | 246 | 660 | 1038 | 135.6 |
| 40 | 72.987 | 70 | 66.25 | 100 | 27.012 | 30 | 33.75 | 0 | 0.0075 | 0 | 0 | 0.05 | 298.8 | 2010 | 2544 | 169.2 | |
| 100 | 88.5 | 83 | 71.5 | 100 | 11.5 | 17 | 28.5 | 0 | 0.002 | 0 | 0 | 0.004 | 298.2 | 4860 | 5616 | 211.2 | |
| 160 | 92.768 | 87.187 | 80.312 | 100 | 7.231 | 12.812 | 19.687 | 0 | 0.0006 | 0.25 | 0.75 | 0.002 | 399.6 | 5220 | 7562 | 254.4 | |
Figure 2.
Computational running time for RBF, Polynomial, Linear and Bessel kernel functions, p=200 and various percentages of contaminations
4.1.2. Simulation 2
The second simulation study considers a non-linear model as in model 4, where and are generated from standard normal distribution, for p = 50 and n = 20. Four good observations were replaced with large values equal to 20 to create outliers . The percentage of correct detection, masking and swamping, and the computational time are exhibited in Table 5 and Figure 3(a). It can be observed from Table 5 that both methods have 100% correct detection of outliers with no masking effects. However, the FP-ε-SVR method shows a higher swamping rate compared to the nu-SVR method. Moreover, the computational time for FP-ε-SVR is higher than that of the nu-SVR as shown in Table 5 and Figure 3(a).
Table 5.
Percentage of correct identification of outliers, masking and swamping and computational time when p = 50 and n = 20.
| % Correct detection | % Masking | % Swamping | Time by seconds | |||||
|---|---|---|---|---|---|---|---|---|
| Outliers | FP-ε-SVR | nu-SVR | FP-ε-SVR | nu-SVR | FP-ε-SVR | nu-SVR | FP-ε-SVR | nu-SVR |
| 1 | 100 | 100 | 0 | 0 | 12.58 | 2.26 | 22.34 | 24.56 |
| 2 | 100 | 100 | 0 | 0 | 4.48 | 0.87 | 27.13 | 21.06 |
| 3 | 100 | 100 | 0 | 0 | 2.22 | 0.705 | 30 | 21.2 |
| 4 | 100 | 100 | 0 | 0 | 4.82 | 0.005 | 22.77 | 21.1 |
Figure 3.
Computational time for in (a) and for in (b).
4.1.3. Simulation 3
The third simulation study considers very large number of explanatory variables which is equal to 3000 and a sample size equals to 100 . In this experiment, we considered the linear model as follows:
| (5) |
where the errors, , are generated from the standard normal distribution, and are generated from uniform distribution (0,1). Following Alin and Agostinelli [1], β is drawn from a normal distribution with the mean of 0 and standard deviation of 0.01. In order to see the effect of the outliers, four observations are replaced with high values equal to 20. Table 6 presents the percentage of the correct detection, masking and swamping, while Figure 3(b) displays the computational time for both methods. It is interesting to observe the results in Table 6 and Figure 3(b). Similar to Table 6, both methods possess 100% correct detection of outliers with no masking effects. Nonetheless, the swamping rate of the FP-ε-SVR is higher than that of the nu-SVR. It is also interesting to note that the nu-SVRs' computational time is much lower than that of the FP-ε-SVR (see Table 6 and Figure 3(b)).
Table 6.
Percentage of correct identification of outliers, masking and swamping and computational time when p = 3000 and n = 100.
| % Correct detection | % Masking | % Swamping | Time by seconds | |||||
|---|---|---|---|---|---|---|---|---|
| Outliers | FP-ε-SVR | nu-SVR | FP-ε-SVR | nu-SVR | FP-ε-SVR | nu-SVR | FP-ε-SVR | nu-SVR |
| 1 | 100 | 100 | 0 | 0 | 6.32 | 2.29 | 1241.4 | 624 |
| 2 | 100 | 100 | 0 | 0 | 5.83 | 2.23 | 1239.6 | 624.6 |
| 3 | 100 | 100 | 0 | 0 | 3.32 | 2.12 | 1202.4 | 628.6 |
| 4 | 100 | 100 | 0 | 0 | 2.21 | 1.43 | 1188.6 | 625.2 |
4.2. Examples
One artificial and three real data sets are employed to evaluate the performance of our proposed nu-SVR compared to FP-ε-SVR method. We also show the merit of our proposed nu-SVR method with Bessel kernel function when compared with three other commonly used kernel functions. Due to space constraint, we only report the results for octane data set to show that nu-SVR method with Bessel kernel has better performance than the other three kernels in terms of separating the outliers. However, the results of the other data sets are consistent with those obtained using octane data set.
4.2.1. Artifical data set
Our first example is an artificial data set introduced by [11] that examines the performance of the proposed method. In this data set, the independent variables are larger than the sample size . Following [11], we consider the multiple linear regression model as in Equation 5, where and are generated from the standard normal distribution. Three outliers are created by changing three good observations with arbitrarily large values, which is equals to 20. The results are presented in Table 7 and Figure 4. Values in parentheses are the cut-off points. As can be seen from Figure 4 and Table 7, the proposed method successfully detects the three planted outliers. However, the FP-ε-SVR suffers from swamping effects in which two additional outliers are detected (Cases 5 and 14).
Table 7.
The values for nu-SVR and FP-ε-SVR for artificial data, n = 20 and p = 50.
| Case | nu-SVR (6.70) | FP-ε-SVR (12.73) | Case | nu-SVR (6.70) | FP-ε-SVR (12.73) |
|---|---|---|---|---|---|
| 1 | 0.14 | 2.01 | 11 | 0.78 | 3.73 |
| 2 | 20.00 | 20.00 | 12 | 1.41 | 2.79 |
| 3 | 1.03 | 2.48 | 13 | 1.90 | 1.40 |
| 4 | 20.00 | 20.00 | 14 | 1.06 | 13.15 |
| 5 | 0.34 | 14.72 | 15 | 0.90 | 1.54 |
| 6 | 20.00 | 20.00 | 16 | 3.34 | 5.65 |
| 7 | 3.49 | 3.2 | 17 | 0.47 | 11.01 |
| 8 | 4.83 | 3.34 | 18 | 1.71 | 10.90 |
| 9 | 4.43 | 1.52 | 19 | 0.29 | 5.21 |
| 10 | 0.84 | 0.67 | 20 | 1.71 | 6.87 |
Figure 4.
Identification plots of the artificial data based on (a,b) nu-SVR method and (c,d) FP-ε-SVR method.
4.2.2. Cardiomyopathy microarray data
Segal [27] was the first to use the cardiomyopathy microarray data to assess regression – based microarray analysis. This data set was used to illustrate his proposed ranking screening procedure [27,35]. The analysis of this data set involves studying all kinds of heart diseases in humans. The dependent variable Y is a Ro1 expression level, and the independent variable is a gene expression level for with sample size n = 30 and huge dimension p = 6319. Wang [35] used this data set for the identification of outliers by using his proposed leave-one-out method. He noted that his proposed method identified three outliers; however, he did not exactly state which observations were the outliers. The proposed nu-SVR and the FP-ε-SVR were then applied to the data to detect and locate the three outliers. The results are exhibited in Table 8 and Figure 5. It can be observed from Table 8 that the nu-SVR perfectly identifies the three outliers . We can clearly see from Figure 5(a,b) that these three observations are well separated from the rest of the data. On the other hand, by looking at Figure 5(c,d) and Table 8, we observe that the FP-ε-SVR identified four additional observations as outliers . Moreover, the computational running time for cardiomyopathy microarray data is 6. s for FP-ε-SVR method and 3.28 s for nu-SVR method. This result is supported by the simulation study where for a very huge p, the FP-ε-SVR suffers from the swamping effect.
Table 8.
The values for nu-SVR and FP-ε-SVR for Cardiomyopathy Microarray data set.
| Case | nu-SVR (670.85) | FP-ε-SVR (836.98) | Case | nu-SVR (670.85) | FP-ε-SVR (836.98) |
|---|---|---|---|---|---|
| 1 | 562.11 | 143.01 | 16 | 586.87 | 966.01 |
| 2 | 503.11 | 83.99 | 17 | 586.88 | 707.99 |
| 3 | 517.11 | 97.99 | 18 | 586.88 | 487.00 |
| 4 | 502.11 | 83.01 | 19 | 786.90 | 1447.00 |
| 5 | 572.11 | 152.99 | 20 | 786.88 | 1693.00 |
| 6 | 560.11 | 141.00 | 21 | 786.89 | 1731.00 |
| 7 | 586.89 | 190.99 | 22 | 608.88 | 1025.00 |
| 8 | 549.11 | 130.01 | 23 | 586.89 | 375.99 |
| 9 | 586.88 | 743.99 | 24 | 545.11 | 126.00 |
| 10 | 586.89 | 381.00 | 25 | 521.11 | 102.00 |
| 11 | 627.88 | 1046.99 | 26 | 568.11 | 149.00 |
| 12 | 586.87 | 806.02 | 27 | 510.11 | 90.99 |
| 13 | 586.89 | 621.01 | 28 | 572.11 | 152.99 |
| 14 | 586.88 | 849.00 | 29 | 586.89 | 234.99 |
| 15 | 586.87 | 475.00 | 30 | 487.11 | 67.99 |
Figure 5.
Identification plots of the Cardiomyopathy Microarray Dataset Based on (a,b) nu-SVR method and (c,d) FP-ε-SVR method.
4.2.3. The octane data
The Octane data taken from Esbensen [7] is the second real example in this paper. This dataset includes near-infrared absorbance spectra over p = 226 wavelengths of n = 39 gasoline samples with specific octane numbers. According to [11], observations are outliers which contain added ethanol [11]. The proposed nu-SVR and the FP-ε-SVR were then applied to the data. The results are presented in Table 9 and Figure 6. We observe from Table 9 and Figure 6(a,b) that the FP-ε-SVR fails to identify even a single outlier. It is interesting to note the results shown in Table 9 and Figure 6(c,d) for the nu-SVR. For this dataset, the nu-SVR successfully identified the six observations as noted by [11]. The plot in Figure 6(c,d) clearly locates the six outliers. The computational running time for octane data for FP-ε-SVR is 0.92 and for nu-SVR is 0.72. Here, we want to show that the Bessel kernel function has better performance than some commonly used kernels in terms of separating the outliers. The identification plot as given in Figure 7(a) shows that Bessel kernel can successfully identify all the six outliers and clearly separates them from the rest of the data. However, the identification plots of RBF, Polynomial and Linear kernels as shown in Figures 7( b–d), respectively, mask all the six outliers; unable to separate the outliers from the good data.
Table 9.
The values for nu-SVR and FP-ε-SVR, Octane data set.
| Case | nu-SVR (89.43) | FP-ε-SVR (179.58) | Case | nu-SVR (89.43) | FP-ε-SVR (179.58) |
|---|---|---|---|---|---|
| 1 | 89.39 | 88.67 | 21 | 89.39 | 87.20 |
| 2 | 89.39 | 88.79 | 22 | 89.40 | 92.07 |
| 3 | 89.40 | 89.79 | 23 | 89.40 | 91.80 |
| 4 | 89.39 | 86.69 | 24 | 89.39 | 87.00 |
| 5 | 89.40 | 91.20 | 25 | 89.49 | 88.99 |
| 6 | 89.40 | 91.35 | 26 | 89.60 | 92.40 |
| 7 | 89.39 | 87.40 | 27 | 89.39 | 88.60 |
| 8 | 89.39 | 87.10 | 28 | 89.40 | 88.76 |
| 9 | 89.39 | 86.97 | 29 | 89.40 | 91.20 |
| 10 | 89.39 | 91.80 | 30 | 89.40 | 91.83 |
| 11 | 89.39 | 89.10 | 31 | 89.39 | 89.00 |
| 12 | 89.40 | 91.75 | 32 | 89.40 | 91.26 |
| 13 | 89.39 | 86.90 | 33 | 89.39 | 88.44 |
| 14 | 89.40 | 91.63 | 34 | 89.40 | 91.40 |
| 15 | 89.40 | 91.70 | 35 | 89.39 | 87.09 |
| 16 | 89.39 | 87.00 | 36 | 89.52 | 91.39 |
| 17 | 89.39 | 87.06 | 37 | 89.52 | 90.29 |
| 18 | 89.40 | 90.79 | 38 | 89.55 | 91.20 |
| 19 | 89.39 | 86.85 | 39 | 89.53 | 90.99 |
| 20 | 89.40 | 91.40 | – | – | – |
Figure 6.
Identification plots of the Octane Dataset Based on (a,b) nu-SVR method, (c,d) FP-ε-SVR method.
Figure 7.
Identification plots of the Octane Dataset Using Bessel (a), RBF (b), polynomial (c) and Linear (d) kernels.
4.2.4. The gas data
Our next example is the Gasoline data presented by [13]. This data set has been analyzed by a number of research [1,3]. Gasoline data consist of NIR spectra of 60 samples of gasoline taken at intervals of 2 nm from 900 to 1700 nm, p = 401. The octane number Y was calculated for each sample. This data set has no outliers. Following [1], the data were contaminated by replacing observations for both X and Y with large values . The index plot of FP-ε-SVR as shown in Figure 8(b) indicates that this method can identify only one outlier, while the other three outliers are masked. On the other hand, the index plot in Figure 8(a) correctly identifies all the four outliers, and they are clearly separated from the rest of the data. Furthermore, the computation time for gas data is 0.98 s for FP-ε-SVR and 0.83 s for the proposed method nu-SVR.
Figure 8.
Identification plots of the Gas dataset based on (a) nu-SVR method and (b) FP-ε-SVR method.
5. Conclusions
The main focus of this study was to develop a reliable alternative method for detecting outliers in high-dimensional data. The existing fixed parameters support vector regression (FP-ε-SVR) is not effective enough as it suffers from masking and swamping problems. To remedy these problems, we propose an alternative method, which is based on the nu-SVR. The performance of our newly proposed method has been extensively investigated by using some real data sets and the Monte Carlo simulation study under a variety of situations. The results acquired from some real data sets indicate that our proposed method is very successful in detecting multiple outliers. The index plots of the proposed nu-SVR clearly separate the outliers from the rest of the data. The merit of our proposed method is supported by the Monte Carlo simulation study. Our newly proposed method nu-SVR was very successful in identifying outliers without any masking effects and reasonably negligible swamping effect, under a variety of situations. Moreover, the nu-SVR has an attractive feature, i.e. its computational time is much faster than the existing FP-ε-SVR.
Acknowledgments
The authors would like to thank editors and anonymous reviewers for their careful reading and helpful remarks.
Funding Statement
The present research was partially supported by the Universiti Putra Malaysia Grant under Putra Grant (GPB) with project number [grant number GPB/2018/9629700].
Disclosure statement
No potential conflict of interest was reported by the author(s).
References
- 1.Alin A. and Agostinelli C., Robust iteratively reweighted SIMPLS, J. Chemom. 31 (2017), pp. e2881. [Google Scholar]
- 2.Balfer J. and Bajorath J., Systematic artifacts in support vector regression-based compound potency prediction revealed by statistical and activity landscape analysis, PLoS One. 10 (2015), pp. e0119301. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Branden K.V. and Hubert M., Robustness properties of a robust partial least squares regression method, Anal. Chim. Acta. 515 (2004), pp. 229–241. [Google Scholar]
- 4.R. Core Team . R: a Language and Environment for Statistical Computing [Computer software manual], r-project organisation. https://www.r-project.org/, Vienna, Austria 2016.v3. 4.4. [Google Scholar]
- 5.Chang C.C. and Lin C.J., Training v-support vector regression: theory and algorithms, Neural. Comput. 14 (2002), pp. 1959–1977. [DOI] [PubMed] [Google Scholar]
- 6.Dhhan W., Rana S. and Midi H., Non-sparse ϵ-insensitive support vector regression for outlier detection, J. Appl. Stat. 42 (2015), pp. 1723–1739. [Google Scholar]
- 7.Esbensen K., Schokopf S. and Midtgaard T., Multivariate Analysis in Practice: Training Package, Ed, Computer-Aided Modelling, 9 (1995), pp. 521–523. [Google Scholar]
- 8.Hadi A.S., A new measure of overall potential influence in linear regression, Comput. Stat. Data Anal. 14 (1992), pp. 1–27. [Google Scholar]
- 9.Hampel F.R., Ronchetti E.M., Rousseeuw P.J. and Stahel W.A., Robust Statistics: the Approach Based on Influence Functions, 196. John Wiley & Sons, New York, 2011. [Google Scholar]
- 10.Habshah M., Norazan M. and Rahmatullah Imon A., The performance of diagnostic-robust generalized potentials for the identification of multiple high leverage points in linear regression, J. Appl. Stat. 36 (2009), pp. 507–520. [Google Scholar]
- 11.Hubert M., Rousseeuw P.J. and Vanden Branden K., ROBPCA: a new approach to robust principal component analysis, Technometrics 47 (2005), pp. 64–79. [Google Scholar]
- 12.Jordaan M.E. and Smits F.G., Robust outlier detection using SVM regression. Neural Networks, IEEE International Joint Conference Proceedings, Budapest, Hungary, 2004, pp. 2017–2022
- 13.Kalivas J.H., Two data sets of near infrared spectra, Chemom. Intell. Lab. Syst. 37 (1997), pp. 255–259. [Google Scholar]
- 14.Karatzoglou A., Meyer D. and Hornik K., Support vector machines in R, J. Stat. Softw. 15 (2006), pp. 1–28. [Google Scholar]
- 15.Karatzoglou A., Smola A., Hornik K. and Zeileis A., kernlab-an s4 package for kernel methods in R, J. Stat. Softw. 11 (2004), pp. 1–20. [Google Scholar]
- 16.Lahiri S.K. and Ghanta K.C., Support vector regression with parameter tuning assisted by differential evolution technique: study on pressure drop of slurry flow in pipeline, Korean J. Chem. Eng. 26 (2009), pp. 1175–1185. [Google Scholar]
- 17.Lim H.A. and Midi H., Diagnostic robust generalized potential based on index set equality (DRGP (ISE)) for the identification of high leverage points in linear model, Comput. Stat. 31 (2016), pp. 859–877. [Google Scholar]
- 18.Nishiguchi J., Kaseda C., Nakayama H., Arakawa M. and Yun Y., Modified support vector regression in outlier detection, The 2010 International Joint Conference on IEEE Neural Networks (IJCNN), Barcelona, Spain, 2010, pp. 2750–2754.
- 19.Rahmatullah Imon A., Identifying multiple influential observations in linear regression, J. Appl. Stat. 32 (2005), pp. 929–946. [Google Scholar]
- 20.Rana S., Dhhan W. and Midi H.. Fixed parameters support vector regression for outlier detection, Econ. Comput. Econ. Cybern. Studies Res. 52 (2018). [Google Scholar]
- 21.Rojo-Álvarez J.L., Martínez-Ramón M., Figueiras-Vidal A.R., García-Armada A. and Artés-Rodríguez A., A robust support vector algorithm for nonparametric spectral analysis, IEEE Signal Process. Lett. 10 (2003), pp. 320–323. [Google Scholar]
- 22.Rousseeuw P.J. and Van Zomeren B.C., Unmasking multivariate outliers and leverage points, J. Am. Stat. Assoc. 85 (1990), pp. 633–639. [Google Scholar]
- 23.Rousseeuw P.J. and Leroy A.M., Robust Regression and Outlier Detection, ed, 589, John wiley &sons, New York, 1987. [Google Scholar]
- 24.Pell R.J., Multiple outlier detection for multivariate calibration using robust tatistical techniques, Chemom. Intell. Lab. Syst. 52 (2000), pp. 87–104. [Google Scholar]
- 25.Schölkopf B., Bartlett P.L., Smola A.J. and Williamson R.C., Shrinking the tube: a new support vector regression algorithm, in Advances in neural information processing systems, 1999, pp. 330–336.
- 26.Schölkopf B., Smola A.J., Williamson R.C. and Van Zomeren B.C., New support vector algorithms, Neural Comput. 12 (2000), pp. 1207–1245. [DOI] [PubMed] [Google Scholar]
- 27.Segal M.R., Dahlquist K.D. and Conklin B.R., Regression approaches for microarray data analysis, J. Comput. Biol. 10 (2003), pp. 961–980. [DOI] [PubMed] [Google Scholar]
- 28.Smola A.J. and Schölkopf B., A tutorial on support vector regression, Stat. Comput. 14 (2004), pp. 199–222. [Google Scholar]
- 29.Suykens J.A., De Brabanter J., Lukas L. and Vandewalle J., Weighted least squares support vector machines: robustness and sparse approximation, Neurocomputing 48 (2002), pp. 85–105. [Google Scholar]
- 30.Ukil A., Intelligent Systems and Signal Processing in Power Engineering, Springer Science & Business Media, Berlin, 2007. [Google Scholar]
- 31.Üstün B., Melssen W.J. and Buydens L.M., Facilitating the application of support vector regression by using a universal Pearson VII function based kernel, Chemom. Intell. Lab. Syst. 81 (2006), pp. 29–40. [Google Scholar]
- 32.Vapnik V., The Nature of Statistical Learning Theory, ed.Springer Science & Business Media, Springer, New York, 2013. [Google Scholar]
- 33.Vapnik V.N., Direct Methods in Statistical Learning Theory in The Nature of Statistical Learning Theory, Springer, New York, 2000, pp. 225–265. [Google Scholar]
- 34.Vladimir S C. and Mulier F., Learning from Data: Concepts, Theory, and Methods, ed,IEEE Press: Wiley-Interscience, Hoboken, NJ, 2007. [Google Scholar]
- 35.Wang T. and Li Z., Outlier detection in high-dimensional regression model, Commun. Stat. Theory Methods 46 (2017), pp. 6947–6958. [Google Scholar]
- 36.Williams G., Data Mining with Rattle and R: The Art of Excavating Data for Knowledge Discovery, Springer, New York, 2011. [Google Scholar]
- 37.Xiang L., Quanyin Z. and Liuyang W., Research of bessel kernel function of the first kind for support vector regression, Inform. Technol. J. 12 (2013), pp. 2673–2682. [Google Scholar]








