Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Dec 13.
Published in final edited form as: IEEE Int Conf Trust Priv Secur Intell Syst Appl. 2022 Dec;2022:150–159. doi: 10.1109/tps-isa56441.2022.00028

Impact of Dimensionality Reduction on Outlier Detection: an Empirical Study

Vivek Vaidya 1, Jaideep Vaidya 2
PMCID: PMC10716874  NIHMSID: NIHMS1945681  PMID: 38094985

Abstract

Outlier detection is a fundamental data analytics technique often used for many security applications. Numerous outlier detection techniques exist, and in most cases are used to directly identify outliers without any interaction. Typically the underlying data used is often high dimensional and complex. Even though outliers may be identified, since humans can easily grasp low dimensional spaces, it is difficult for a security expert to understand/visualize why a particular event or record has been identified as an outlier. In this paper we study the extent to which outlier detection techniques work in smaller dimensions and how well dimensional reduction techniques still enable accurate detection of outliers. This can help us to understand the extent to which data can be visualized while still retaining the intrinsic outlyingness of the outliers.

Keywords: outlier detection, anomaly detection, dimensionality reduction, visualization, explainability

I. Introduction and Related Work

Outliers (also referred to as anomalies) are aberrant observations (records) that are quite different/distant from regular observations (records). Thus, outliers do not conform to the typical pattern/distribution of data. Typically, in any setting where data is collected (be it natural, human-created, or even machine-created), there are always a few records/observations that are outlying in nature. Outlier/anomaly detection is the problem of identifying outliers from a given set of data and is one of the most fundamental problems in the field of data analytics. Outlier analysis has numerous scientific, commercial, and governmental applications in diverse domains such as astronomy, finance, and medicine, among others. In particular, outlier detection is extensively used in the fields of security, dependability, and trust, for applications such as fraud detection for credit cards [9] and insurance [8], [16], health care [31], intrusion detection for cyber-security [12], [33], [34], fault detection in safety critical systems [15], [21], and military surveillance for enemy activities [4].

However, one key problem, especially for security and privacy applications is the explainability and interpretability of outliers. This is particular important when an unsupervised outlier detection technique has been used and the identified outliers need to be further analyzed by the security administrator. The problem here is that most data is high dimensional whereas humans primarily operate in two or at most three dimensions. Therefore, even if a particular outlier detection technique is able to identify outliers accurately, visualizing the outliers is near impossible. One of the most common techniques for dealing with high-dimensional data is that of dimensionality reduction. However, the effect of dimensionality reduction on outlier analysis has not been sufficiently studied upto this point. Assuming that the security analyst is able to accurately visualize and understand the reasons for outlyingness, they may be able to correspondingly monitor for abnormal behavior and improve the security posture.

The key objective of this study is to empirically examine the effect of dimension reduction on outlier identification. While there has been some work on using local geometric structure to detect outliers [32], the effect of dimensionality reduction on standard outlier detection technqiues has not been studied. Since dimensionality reduction typically reduces the amount of information for each record, we would like to see the extent to which outliers can still be accurately identified as the data dimensionality is reduced over a host of settings. Note that as the number of dimensions are reduced, especially to 2 or 3 dimensions, it is much easier to visualize the datasets and potentially see the extent to which the outliers are different.

Towards this, for a vast variety of datasets, we utilize several dimensionality reduction techniques such as Principal Component Analysis (PCA), Uniform Manifold Approximation and Projection (UMAP), and Random Projections, and then examine the performance of several commonly used outlier detection methods such as Local Outlier Factor (LOF), Isolation Forest (IF), and Angle-Based Outlier Detection (ABOD).

The rest of this paper is structured as follows. Section II provides a brief overview of standard dimensionality reduction techniques. Section III presents an overview of outlier detection and then discusses several commonly used outlier detection techniques that we evaluate in this work. Section IV presents the results of our empirical study over a number of real datasets of varying size and characteristics. Finally, Section V concludes the paper and discusses future work.

II. OVERVIEW OF DIFFERENT DIMENSION REDUCTION METHODS

Dimensionality reduction is a standard data pre-processing approach used whenever data is high-dimensional. Many techniques exist for dimensionality reduction including Principal Component Analysis (PCA), Singular Value Decomposition (SVD), and Uniform Manifold Approximation and Projection (UMAP) among others. These mostly apply over numeric data. For categorical data, there are dimensionality reduction techniques such as Corrspondance Analysis (CA) and Multiple correspondance analysis (MCA), which generalizes CA to more than two categorical variables. Multiple Factor Analysis can be used for mixed data. In this study, we focus on numeric data and discuss below, in more detail, a few standard dimensionality reduction techniques.

A. PCA

Principal Component Analysis (PCA) is a well known technique for dimensionality reduction. The goal of PCA is to reduce the number of dimensions of the data while retaining most of the information. This is done by compressing the dimensions into “principal components”, which are made up of a number of dimensions that retain most of the information of the dimensions used to construct them. PCA has been used in a variety of settings [13]. Typically, data is standardized to ensure that all features have the same scale (i.e., zero mean and unit variance). Next, the covariance matrix is computed, which is a x*x matrix where x is the number of dimensions that contains covariances of all dimensions. This gives us the variances of all individual dimensions because the covariance of a dimension with itself returns the variance, and the overall matrix is symmetrical with it’s main diagonal. The sign of each covariance tells us whether they are proportional or inversely proportional. Essentially, the principal components represent the orthogonal projections that capture the maximum amount of variance in the data objects. To identify the principal components, we compute the eigenvectors and eigenvalues of the covariance matrix (they come in pairs, and there are as many of these pairs as there are dimensions). Once these are calculated, the greater the eigenvalue, the more significant the corresponding principal component.

Inverse PCA:

Due to PCA being a linear method of Dimension Reduction (it takes a mathematical approach when it comes to reducing, we will explain this in detail shortly), it is possible to invert the projection to recreate the original dataset. This is an imperfect representation due to information being lost while reducing, but it is a useful tool regardless. Since the goal is to identify outliers, along with standard PCA we experiment with a modified version where along with the reduced PCA dimensions, we use inverse PCA to reconstruct the original dataset based on the reduced principal components and calculate the Euclidean distances between each record of the original dataset and the reconstructed one. This is then added on as an extra dimension to the reduced dataset. The results pertaining to these experiments can be found below.

B. UMAP

Uniform Manifold Approximation and Projection [26] is a comparatively new dimension reduction technique that uses a graphical method to reduce dimensions. It starts by graphing data and connecting it all together with simplices. A simplex can be explained as an easy way to build a k dimensional object. A 0 simplex is just a point. A 1 simplex is a line with two defined end points. A 2 simplex is a triangle, and a 3 simplex is a triangular pyramid (higher ones exist, but are harder to visualize due to us living in a 3 dimensional world). These simplices can be used to connect the points on the generated graph. From here, the dimensions are reduced by finding an equivalent lower dimensional representation of the data using this graph.

C. T-SNE

T-distributed Stochastic Neighbor Embedding [30] is an alternative type of dimension reduction that specializes in reducing large datasets to two or three dimensions. PCA was created to conserve global variance, but isn’t good at conserving local variance. T-SNE was created to make up for this shortcoming. As such, T-SNE differs from PCA by preserving only small pairwise distances or local similarities whereas PCA is concerned with preserving large pairwise distances to maximize variance. The process of computing T-SNE starts by creating a probability distribution that represents similarities between neighbors in which the “similarity of datapoint xj to datapoint xi is the conditional probability, pj|i, that xi would pick xj as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at xi” (Maaten 1). The value of pj|i will be really high if the two points are really close together, but very small if the two points are far apart. Then, t-SNE defines a similar distribution for the points in the low-dimensional embedding. Finally, t-SNE minimizes the Kullback–Leibler divergence between the two distributions with respect to the locations of the points in the embedding. T-SNE is primarily used for visualization. Furthermore, since T-SNE is a local embedding, it is difficult to find a good inverse mapping. Also, as the perplexity parameter is changed, drastically different results are achieved. Finally, T-SNE is a significantly more computationally costly algorithm to use as compared to PCA. Due to these reasons, we do not utilize T-SNE in our study.

D. Linear vs Manifold Learning

Most dimension reduction techniques fall under two categories. Linear methods such as PCA or SVD that use a mathematical approach to reduce, and non linear or Manifold Learning methods such as T-SNE and UMAP that use a more graphical approach for reduction. Linear methods always have 1:1 correspondence between the original points and the reduced ones, while non-linear methods don’t. This is why T-SNE cannot be used effectively in this study, because it’s too stochastic. Note that UMAP is also stochastic, but we are using it here. T-SNE reduces by taking points from the high dimensional graph and moving it cluster by cluster. This really helps preserve local variance among the points, but global variance is very inaccurate as a result of it. UMAP takes a graph that it has connected with simplices, and compresses it, preserving the global structure a lot better than T-SNE (but still not as well as PCA) while still retaining it’s ability to preserve local variance. While not perfect, UMAP’s correspondence is a lot better than T-SNE’s despite it’s non-linear nature, and it even has an inverse method. We have not used that inverse method here because it can’t restore from 1 dimension, meaning 3D would be the only visualizable dimension we could work with, and we used a random seed for the UMAP calculations.

E. Random Projection

The key idea behind random projection is to simply project the data into a random space. This is done by generating a bunch of random k*2 matrices where k is the number of rows of the dataset you’re reducing. From there you multiply them each separately with the dataset to reduce it to two dimensions. Calculate the outliers for each of these reduced datasets and add up the outlier scores for each point returned by whatever outlier detection method you’re using.

III. Overview of outlier detection methods

As with many things, there is no silver bullet in Outlier Detection. A multitude of methods have been developed for detecting Outliers over the years, including LOF, Z-score, Isolation Forest, Extreme value analysis, box plots, autoencoders, ABOD and many more [2], [10]. However, we will only be using three of these methods in this study. Many of these methods cannot work, as they do not fulfill certain requirement, or have other issues. For example, detection methods need to return an outlier score in this study so our results are reproducible, and they have to work in higher dimensional spaces on fully numeric data. The detection methods chosen for this study are LOF, Isolation Forest (please note that Isolation Forest has a degree of randomness to it, so we set a universal seed of 3 for reproducible results) and ABOD, because they meet the critera we set previously.

A. LOF

Local Outlier Factor (LOF) [6], [11] is an unsupervised density based outlier detection technique that can be used to identify a wide variety of outliers. It works by calculating each datapoint’s k_distance (The euclidean distance between the point and it’s kth nearest neighbor). It then finds the k_neighbors (The closest neighbors whose distances to it are equal to or less than k_distance). From there it can calculate the Local Reachability Density, or LRD of the point, which is the inverse of the average reachability distance of the datapoint from it’s neighbors. The lower the value, the farther away the closest cluster is. From there, the LOF score is calculated. It is the ratio of the average LRD of the k_neighbors to the LRD of the datapoint. Now we can essentially rank the points based on their degree of outlyingness and pick the top outliers (based on how many we want) since we have the LOF score for each point in the entire dataset. Our datasets specifically state how many “true outliers” lie within them, and we can use this metric to select the number of outliers we want to calculate for.

B. Isolation Forest

Isolation forest [20] is an unsupervised outlier detection method that processes the data into a decision tree (called an isolation tree or an iTree for short) and finds outliers by isolating data points and finding out how many branches have to be traversed (i.e., the depth of the tree) to reach them. If the branches are comparatively few, that data point is marked as an outlier. Thanks to this, Isolation Forest can score each data point based on how anomalous it is. The method is trained by taking a random subset of the data and putting it into an iTree, and the branching is done on a random threshold. The data branches left or right depending on whether it’s smaller or larger than that threshold. This is repeated until every datapoint is in an iTree and isolated. From there the algorithm can count the branches and score the datapoints.

C. ABOD

Angle Based Outlier Detection [17] is a technique that works well in high dimensional spaces, unlike some of its peers. It works by iterating over every datapoint and storing it’s angle relative to every other datapoint. It then calculates that variance of the generated list. The lower that variance of a certain datapoint is, the more of an outlier it is.

IV. Empirical Study

We use a number of real datasets taken from the ODDS repository, which specializes in providing Datasets specifically for Outlier detection [24]. Table I provides the details of the datasets used. For each Dataset, the actual outliers that are supposed to be detected are marked.

TABLE I.

Dataset Glossary

Dataset Number of Outliers Number of features Number of Rows Citation
Arrhythmia 66 274 452 [14], [19], [29]
Glass 9 9 214 [3], [14], [27]
Ionosphere 126 33 351 [14], [19], [29]
Letter 100 32 1600 [23], [25]
Lympho 6 18 148 [3], [18], [35]
Mnist 188 100 7603 [5]
Musk 97 166 3062 [3]
Optdigits 150 64 5216 [3]
Pendigits 156 16 6870 [14], [27]
Pima 12 8 768 [14], [19], [29]
Satellite 244 36 6435 [19], [29]
Satimage-2 71 36 5803 [3], [35]
Shuttle 183 9 49097 [1], [19], [22], [28], [29]
Speech 61 400 3686 [23]
Vertebral 30 6 240 [27]
Vowels 50 12 1456 [3], [27]
WBC 21 30 378 [3], [14], [35]
Wine 10 13 129 [27]

All experiments were carried out on a AMD Ryzen) 7 5800 (8-Core, 36MB Total Cache, Max Boost Clock of 4.6GHz) with 32GB of RAM. The implementation was done in Python 3.10 primarily using the scikit-learn [7], scipy, pandas, and numpy libraries.

Since the study is about the effect of dimensionality reduction on outlier detection, the results and our observations are structured first by the dimensionality reduction technique.

A. PCA

As mentioned in Section IIA, after reducing the dimensions, it is possible to invert PCA giving a reconstructed dataset. We would also like to check if the distance between the original record and the reconstructed record is informative for outlier detection.

Therefore, we first evaluate the performance of outlier detection without utilizing this distance and then we also evaluate the performance when the distance is added as an extra dimension to the reconstructed dataset. Additionally, since no outlier detection technique is perfect, we would like to evaluate not only the stability of its performance (in terms of the outliers it detects) as well as the effect on its performance in terms of the true outliers. We believe this to be one of the crucial contributions of this study.

1). No reconstruction:

We now separately discuss the results for the three outlier detection methods used in this study.

Table II presents the percentage of LOF outliers retained for each dataset in 4 cases: i) when the number of dimensions is reduced by 1; ii) when the number of dimensions is reduced to half; iii) when the number of dimensions is reduced to 3; and iv) when the number of dimensions is reduced to 2. Note that in each case we are merely interested in knowing what fraction of the outliers identified originally by LOF are still identified as the top outliers after dimensionality reduction. Therefore, when no dimensionality reduction is done, the baseline score is 100%. We observe that taking away one dimension, or even upto half, does not usually have much effect on the stability of LOF (i.e., how many outliers identified originally are still retained). Specifically, even when the data is reduced to half of the original dimensions, in 11 of the 18 datasets, at least 75% of the originally identified outliers are still retained. However, when the number of dimensions is reduced to a level where the data can be visualized (i.e., to 3 or 2 dimensions) there is a significant degradation in terms of stability. For example, when the data is reduced to 3 dimensions, more than 10% of the original identified outliers are retained in only 11 out of the 18 datasets. In the case of 2 dimensions, it is even worse – specifically, in only 6 out of the 18 datasets are more than 10% of the original identified outliers retained. If we care about retaining 20% of the originally identified outliers than this happens in only 5-6 cases out of 18 for either 2 or 3 dimensions. Unfortunately no clear pattern or reason can be observed with respect to why the performance for a particular dataset is better or worse.

TABLE II.

LOF and PCA

Dataset 1 Dimension Removed Reduced to half of original dimensions Reduced to 3 Dimensions Reduced to 2 Dimensions
Arrhythmia 100 100 34.85 25.76
Glass 100 77.78 55.56 33.33
Ionosphere 97.62 88.89 52.38 43.65
Letter 96.0 75.0 18.0 9.0
Lympho 83.33 83.33 0.0 0.0
Mnist 100.0 87.77 1.60 3.19
Musk 100.0 91.75 10.31 8.25
Optdigits 98.67 76.67 6.0 2.0
Pendigits 97.44 64.1 17.31 2.56
Pima 100.0 58.33 16.67 8.33
Satellite 95.9 75.0 6.56 4.1
Satimage-2 91.55 69.01 1.41 2.82
Shuttle 86.34 42.62 1.64 0.55
Speech 96.72 54.1 1.64 1.64
Vertebral 100.0 50.0 46.67 30.0
Vowels 92.0 38.0 12.0 8.0
WBC 100.0 95.24 38.1 38.1
Wine 100.0 100.0 100.0 90.0

Next, we examine the results in terms of identification of the true outliers. Specifically, Table III presents the percentage of true outliers retained after dimension reduction. One point to note is that even though LOF is widely accepted as a standard outlier detection technique its basic performance (without dimensionality reduction) leaves a lot to be desired.

TABLE III.

LOF and PCA (True Outliers)

Dataset 1 Dimension Removed Reduced to half of original dimensions Reduced to 3 Dimensions Reduced to 2 dimensions
Arrhythmia 18.18 18.18 18.18 22.73
Glass 22.22 22.22 11.11 0.0
Ionosphere 74.6 71.43 49.21 44.44
Letter 41.0 38.0 17.0 7.0
Lympho 33.33 33.33 0.0 33.33
Mnist 20.74 18.62 15.43 10.64
Musk 11.34 13.4 3.09 12.37
Optdigits 5.33 4.67 6.0 1.33
Pendigits 5.77 5.13 5.77 5.13
Pima 25.0 41.67 33.33 41.67
Satellite 48.36 47.13 40.16 29.1
Satimage-2 9.86 9.86 1.41 2.82
Shuttle 14.21 12.57 21.86 33.88
Speech 13.11 9.84 3.28 0.0
Vertebral 3.33 16.67 20.0 6.67
Vowels 32.0 28.0 2.0 6.0
WBC 19.05 19.05 19.05 23.81
Wine 0.0 0.0 0.0 0.0

The interesting observation to be made here is that LOF detects around the same or more “true” outliers after dimension reduction 8/18 times, and five datasets specifically have the most “true” outliers detected once reduced to the third and second dimension respectively. This effectively means that LOF actually performs better in terms of detecting the true outliers once dimensionality reduction is performed.

Next we examine the performance of Isolation Forest. Table IV provides the stability results for Isolation Forest (i.e., how well are the originally identified outliers retained) when dimensionality reduction is done using PCA, while Table V presents the results in terms of identifying the true outliers. There are three main observations to be made. First, the stability results are quite erratic – the results show that in many cases though there is an initial dip in the fraction of originally identified outliers retained as the number of dimensions is reduced to 2 or 3, there are also many cases, where the fraction of originally identified outliers retained increases. second, and very interestingly, we observe that in 17 of the 18 datasets, an equal or higher percentage of true outliers is discovered when the number of dimensions is reduced to 3 or 2. This clearly implies that the performance of Isolation Forest (in terms of identifying true outliers) significantly improves when the dimensionality of data is reduced. Third, as a whole, after dimensionality reduction, Isolation Forest actually outperforms LOF in terms of finding true outliers in close to 2/3rds of the datasets.

TABLE IV.

Isolation Forest and PCA

Dataset 1 Dimension Removed Reduced to half of original dimensions Reduced to 3 Dimensions Reduced to 2 dimensions
Arrhythmia 63.64 62.12 33.33 18.18
Glass 33.33 33.33 33.33 22.22
Ionosphere 69.05 73.81 65.87 55.56
Letter 36.0 28.0 9.0 1.0
Lympho 0.0 33.33 33.33 16.67
Mnist 23.94 27.66 1.06 1.06
Musk 14.43 21.65 2.06 0.0
Optdigits 14.67 16.67 6.67 5.33
Pendigits 0.0 0.0 45.51 5.77
Pima 16.67 8.33 16.67 16.67
Satellite 11.48 11.89 32.38 38.93
Satimage-2 0.0 0.0 90.14 88.73
Shuttle 7.65 12.02 10.93 5.46
Speech 29.51 39.34 1.64 0.0
Vertebral 66.67 70.0 50.0 30.0
Vowels 20.0 36.0 0.0 0.0
WBC 23.81 42.86 47.62 42.86
Wine 30.0 30.0 10.0 0.0
TABLE V.

Isolation Forest and PCA (True Outliers)

Dataset 1 Dimension Removed Reduced to half of original dimensions Reduced to 3 Dimensions Reduced to 2 dimensions
Arrhythmia 3.03 3.03 16.67 21.21
Glass 11.11 11.11 0.0 11.11
Ionosphere 5.56 8.73 18.25 25.4
Letter 1.0 2.0 4.0 3.0
Lympho 0.0 33.33 33.33 16.67
Mnist 0.0 0.0 19.68 33.51
Musk 0.0 0.0 100.0 63.92
Optdigits 2.0 0.0 0.0 0.0
Pendigits 0.0 0.0 60.9 33.97
Pima 41.67 50.0 58.33 58.33
Satellite 27.05 36.89 50.0 90.57
Satimage-2 0.0 0.0 88.73 87.32
Shuttle 20.22 24.04 24.59 60.11
Speech 1.64 3.28 3.28 0.0
Vertebral 16.67 16.67 20.0 13.33
Vowels 0.0 0.0 2.0 2.0
WBC 0.0 0.0 33.33 38.1
Wine 0.0 0.0 10.0 30.0

Finally, we study the performance of ABOD with PCA. Table VI provides the results corresponding to detection of the originally identified outliers (i.e., stability), while Table VII provides the results for the detection of the true outliers. The key observation here is that for 9 out of 18 of the datasets an equal or higher percentage of true outliers is discovered when the data is reduced to 2 or 3 dimensions. However, we observe that the general performance of ABoD is worse than the performance of both LOF and Isolation Forest in terms of both stability and identification of true outliers. One reason for this is that ABoD was specifically designed for high dimensional data and therefore dimensionality reduction is in any case antithetical to its purpose.

TABLE VI.

ABOD and PCA

Dataset 1 Dimension Removed Reduced to half of original dimensions Reduced to 3 Dimensions Reduced to 2 dimensions
Arrhythmia 100.0 100.0 34.85 27.27
Glass 100.0 44.44 11.11 0.0
Ionosphere 99.21 93.65 61.11 50.0
Letter 96.0 75.0 21.0 21.0
Lympho 83.33 33.33 16.67 0.0
Mnist 100.0 92.55 3.19 2.66
Musk 100.0 96.91 76.29 44.33
Optdigits 99.33 78.67 8.0 1.33
Pendigits 97.44 66.03 16.67 7.69
Pima 100.0 16.67 8.33 0.0
Satellite 89.34 36.48 9.02 5.74
Satimage-2 81.69 28.17 1.41 2.82
Shuttle 8.74 0.0 0.0 1.09
Speech 100.0 57.38 3.28 3.28
Vertebral 100.0 70.0 66.67 40.0
Vowels 94.0 66.0 4.0 0.0
WBC 100.0 100.0 14.29 19.05
Wine 100.0 90.0 40.0 0.0
TABLE VII.

ABOD and PCA (True Outliers)

Dataset 1 Dimension Removed Reduced to half of original dimensions Reduced to 3 Dimensions Reduced to 2 dimensions
Arrhythmia 4.55 4.55 9.09 13.64
Glass 0.0 0.0 0.0 0.0
Ionosphere 6.35 7.14 18.25 24.6
Letter 0.0 0.0 1.0 4.0
Lympho 0.0 0.0 0.0 0.0
Mnist 0.53 0.0 4.26 9.04
Musk 90.72 92.78 74.23 43.3
Optdigits 2.67 5.33 4.0 5.33
Pendigits 0.0 0.0 0.0 0.64
Pima 16.67 16.67 16.67 50.0
Satellite 28.69 22.13 11.48 13.52
Satimage-2 0.0 0.0 0.0 0.0
Shuttle 0.0 0.0 0.0 0.0
Vertebral 23.33 20.0 30.0 26.67
Vowels 0.0 0.0 2.0 2.0
WBC 0.0 0.0 0.0 0.0
Wine 0.0 0.0 0.0 0.0

2). Reconstructed from PCA:

We now study the performance of all of the outlier detection methods when we add the extra feature containing the distance between the reconstructed datapoint and the original. Note that in this case, reducing to 3 dimensions actually means reducing to 2 dimensions and adding the extra dimension containing the distance. Similarly, reducing to 2 dimensions implies actually reducing to a single dimension and then adding the distance between the reconstructed data point and the original.

Table VIII provides the stability results pertaining to LOF and PCA with reconstruction. Table IX provides the results pertaining to LOF and PCA in terms of identifying the true outliers with reconstruction. We see that stability generally suffers as the dimensions are significantly reduced, whereas there is still improvement in the detection of true outliers for some datasets. Table X and Table XI provide the corresponding results for Isolation Forest and PCA when reconstruction is done, while Table XII and Table XIII give the results pertaining to ABOD and PCA with reconstruction. We discuss the comparative performance of these later in more detail.

TABLE VIII.

LOF and PCA (Reconstructed)

Dataset 1 Dimension Removed Reduced to half of original dimensions Reduced to 3 Dimensions Reduced to 2 dimensions
Arrhythmia 100.0 99.0 42.4 27.0
Glass 100.0 87.89 66.7 22.0
Ionosphere 98.41 77.57 38.1 41.0
Letter 97.0 76.0 24.0 11.0
Lympho 83.33 99.0 33.3 17.0
Mnist 100.0 91.02 4.8 4.0
Musk 100.0 89.72 9.3 6.0
Optdigits 98.67 77.67 11.3 4.0
Pendigits 98.08 67.59 16.7 12.0
Pima 100.0 74.0 25.0 17.0
Satellite 96.31 73.59 9.4 4.0
Satimage-2 95.77 69.42 1.4 3.0
Shuttle 89.07 71.68 4.4 3.0
Speech 98.36 59.66 4.9 2.0
Vertebral 100.0 65.67 63.3 50.0
Vowels 94.0 55.0 8.0 12.0
WBC 100.0 94.24 52.4 33.0
Wine 100.0 99.0 100.0 90.0
TABLE IX.

LOF and PCA (Reconstructed, True Outliers)

Dataset 1 Dimension Removed Reduced to half of original dimensions Reduced to 3 Dimensions Reduced to 2 dimensions
Arrhythmia 18.18 17.18 19.7 24.0
Glass 22.22 21.22 22.2 11.0
Ionosphere 73.02 63.29 33.3 40.0
Letter 42.0 38.0 21.0 10.0
Lympho 33.33 32.33 50.0 50.0
Mnist 20.74 18.15 15.4 16.0
Musk 11.34 13.43 4.1 5.0
Optdigits 5.33 3.67 5.3 2.0
Pendigits 5.13 3.49 5.8 7.0
Pima 25.0 24.0 50.0 33.0
Satellite 47.95 44.49 29.1 42.0
Satimage-2 9.86 8.86 0.0 6.0
Shuttle 14.75 14.85 2.7 36.0
Speech 13.11 13.75 8.2 3.0
Vertebral 3.33 15.67 10.0 7.0
Vowels 32.0 29.0 12.0 10.0
WBC 19.05 18.05 19.0 19.0
Wine 0.0 0.0 0.0 0.0
TABLE X.

Isolation Forest and PCA (Reconstructed)

Dataset 1 Dimension Removed Reduced to half of original dimensions Reduced to 3 Dimensions Reduced to 2 dimensions
Arrhythmia 65.15 62.12 16.7 17.0
Glass 44.44 11.11 22.2 33.0
Ionosphere 71.43 74.6 71.4 67.0
Letter 23.0 24.0 23.0 18.0
Lympho 0.0 0.0 66.7 50.0
Mnist 32.45 19.68 14.4 15.0
Musk 16.49 23.71 2.1 2.0
Optdigits 19.33 12.0 11.3 13.0
Pendigits 2.56 3.21 34.0 59.0
Pima 8.33 16.67 16.7 17.0
Satellite 12.7 9.84 17.2 32.0
Satimage-2 1.41 11.27 90.1 92.0
Shuttle 3.83 10.38 38.8 37.0
Speech 22.95 36.07 0.0 5.0
Vertebral 63.33 53.33 46.7 33.0
Vowels 26.0 36.0 2.0 0.0
WBC 19.05 42.86 38.1 33.0
Wine 30.0 20.0 20.0 30.0
TABLE XI.

Isolation Forest and PCA (Reconstructed, True Outliers)

Dataset 1 Dimension Removed Reduced to half of original dimensions Reduced to 3 Dimensions Reduced to 2 dimensions
Arrhythmia 3.03 4.55 15.2 23.0
Glass 0.0 11.11 0.0 11.0
Ionosphere 6.35 8.73 13.5 22.0
Letter 0.0 1.0 7.0 5.0
Lympho 0.0 0.0 66.7 50.0
Mnist 0.0 0.0 17.0 32.0
Musk 0.0 0.0 100.0 100.0
Optdigits 0.0 0.0 0.0 1.0
Pendigits 0.64 1.92 9.6 35.0
Pima 16.67 33.33 50.0 75.0
Satellite 25.82 39.75 61.5 89.0
Satimage-2 1.41 11.27 91.5 90.0
Shuttle 10.38 32.24 71.0 82.0
Speech 3.28 6.56 0.0 2.0
Vertebral 6.67 16.67 30.0 17.0
Vowels 2.0 0.0 18.0 18.0
WBC 0.0 4.76 28.6 33.0
Wine 0.0 0.0 0.0 10.0
TABLE XII.

ABOD and PCA (Reconstructed)

Dataset 1 Dimension Removed Reduced to half of original dimensions Reduced to 3 Dimensions Reduced to 2 dimensions
Arrhythmia 100.0 100.0 42.4 30.0
Glass 100.0 44.44 44.4 11.0
Ionosphere 100.0 92.86 68.3 66.0
Letter 95.0 78.0 35.0 29.0
Lympho 83.33 33.33 0.0 0.0
Mnist 100.0 95.74 5.3 3.0
Musk 100.0 96.91 85.6 76.0
Optdigits 99.33 79.33 8.0 4.0
Pendigits 98.08 71.15 12.2 4.0
Pima 100.0 41.67 16.7 17.0
Satellite 89.34 36.07 13.9 11.0
Satimage-2 81.69 26.76 1.4 3.0
Shuttle 50.82 1.64 0.5 0.0
Speech 100.0 65.57 6.6 7.0
Vertebral 100.0 76.67 60.0 50.0
Vowels 100.0 78.0 6.0 10.0
WBC 100.0 100.0 9.5 14.0
Wine 100.0 90.0 60.0 30.0
TABLE XIII.

ABOD and PCA (Reconstructed, True Outliers)

Dataset 1 Dimension Removed Reduced to half of original dimensions Reduced to 3 Dimensions Reduced to 2 dimensions
Arrhythmia 4.55 4.55 4.5 3.0
Glass 0.0 0.0 0.0 0.0
Ionosphere 6.35 7.14 9.5 10.0
Letter 0.0 0.0 2.0 3.0
Lympho 0.0 0.0 0.0 0.0
Mnist 0.53 0.0 1.1 5.0
Musk 90.72 92.78 84.5 76.0
Optdigits 2.67 5.33 3.3 4.0
Pendigits 0.0 0.0 0.0 0.0
Pima 16.67 25.0 16.7 58.0
Satellite 28.28 24.18 8.6 8.0
Satimage-2 0.0 0.0 0.0 0.0
Shuttle 0.55 0.0 2.2 10.0
Speech 1.64 0.0 0.0 0.0
Vertebral 23.33 20.0 33.3 30.0
Vowels 0.0 0.0 0.0 0.0
WBC 0.0 0.0 0.0 0.0
Wine 0.0 0.0 0.0 0.0

B. UMAP

Tables XIV XV provides the results pertaining to LOF and UMAP. As before, for 13 of the 18 datasets an equal or higher percentage of true outliers is discovered when data is reduced to 3 or 2 dimensions.

TABLE XIV.

LOF and UMAP

Dataset 1 Dimension Removed Reduced to half of original dimensions Reduced to 3 Dimensions Reduced to 2 dimensions
Arrhythmia 12.12 22.73 15.15 15.15
Glass 22.22 11.11 11.11 11.11
Ionosphere 42.06 42.06 46.83 40.48
Letter 27.0 21.0 25.0 22.0
Lympho 33.33 33.33 16.67 16.67
Mnist 4.79 5.85 2.13 3.72
Musk 9.28 7.22 12.37 8.25
Optdigits 7.33 6.67 4.67 5.33
Pendigits 11.54 8.97 10.9 6.41
Pima 0.0 8.33 0.0 0.0
Satellite 13.52 13.52 11.48 8.2
Satimage-2 4.23 4.23 4.23 1.41
Shuttle 4.92 4.92 3.28 2.73
Speech 0.0 3.28 0.0 0.0
Vertebral 6.67 13.33 16.67 16.67
Vowels 8.0 8.0 14.0 8.0
WBC 4.76 14.29 0.0 9.52
Wine 10.0 20.0 20.0 20.0

TABLE XV.

LOF and UMAP (True Outliers)

Dataset 1 Dimension Removed Reduced to half of original dimensions Reduced to 3 Dimensions Reduced to 2 dimensions
Arrhythmia 10.61 6.06 15.15 19.7
Glass 0.0 0.0 0.0 11.11
Ionosphere 39.68 39.68 38.1 41.27
Letter 16.0 14.0 19.0 17.0
Lympho 0.0 0.0 0.0 0.0
Mnist 17.02 17.55 25.0 22.87
Musk 9.28 5.15 7.22 12.37
Optdigits 1.33 4.0 1.33 0.67
Pendigits 2.56 3.85 3.21 1.92
Pima 41.67 33.33 50.0 33.33
Satellite 45.9 43.03 49.18 45.49
Satimage-2 5.63 8.45 7.04 5.63
Shuttle 14.21 14.75 16.39 14.75
Speech 8.2 6.56 9.84 4.92
Vertebral 13.33 10.0 10.0 13.33
Vowels 2.0 4.0 6.0 2.0
WBC 14.29 0.0 14.29 9.52
Wine 20.0 0.0 10.0 0.0

Tables XVI and Table XVII give the results pertaining to Isolation Forest and UMAP. As earlier, we again observe that for 7 of the 18 datasets have an equal or higher percentage of true outliers discovered when the data is reduced to 3 or 2 dimensions. Tables XVIII and XIX give the results pertaining to ABOD and UMAP. Here 11 of the 18 datasets have an equal or higher percentage of true outliers discovered when the data is reduced to 3 or 2 dimensions.

TABLE XVI.

Isolation Forest and UMAP

Dataset 1 Dimension Removed Reduced to half of original dimensions Reduced to 3 Dimensions Reduced to 2 dimensions
Arrhythmia 43.94 48.48 33.33 19.7
Glass 22.22 0.0 11.11 0.0
Ionosphere 51.59 58.73 55.56 51.59
Letter 10.0 7.0 14.0 10.0
Lympho 0.0 16.67 0.0 0.0
Mnist 8.51 7.45 2.66 10.11
Musk 14.43 4.12 1.03 2.06
Optdigits 6.0 4.0 6.0 10.67
Pima 8.33 25.0 0.0 0.0
Pendigits 10.9 18.59 22.44 1.92
Satellite 23.36 15.16 4.92 0.0
Satimage-2 8.45 5.63 1.41 0.0
Shuttle 3.28 0.55 20.77 0.0
Speech 4.92 3.28 0.0 1.64
Vertebral 30.0 33.33 33.33 40.0
Vowels 20.0 8.0 18.0 8.0
WBC 47.62 33.33 28.57 14.29
Wine 0.0 30.0 0.0 0.0

TABLE XVII.

Isolation Forest and UMAP (True Outliers)

Dataset 1 Dimension Removed Reduced to half of original dimensions Reduced to 3 Dimensions Reduced to 2 dimensions
Arrhythmia 4.55 1.52 9.09 13.64
Glass 0.0 0.0 0.0 11.11
Ionosphere 48.41 36.51 37.3 49.21
Letter 17.0 20.0 11.0 14.0
Lympho 16.67 0.0 0.0 0.0
Mnist 4.79 1.6 0.0 0.53
Musk 0.0 0.0 0.0 0.0
Optdigits 0.0 0.0 0.0 0.0
Pendigits 0.0 0.0 0.0 0.0
Pima 25.0 41.67 58.33 25.0
Satellite 44.26 70.08 34.84 2.46
Satimage-2 0.0 0.0 0.0 0.0
Shuttle 0.0 16.39 0.0 0.0
Speech 0.0 0.0 0.0 1.64
Vertebral 6.67 3.33 3.33 0.0
Vowels 0.0 0.0 4.0 2.0
WBC 0.0 0.0 0.0 0.0
Wine 0.0 0.0 20.0 0.0

TABLE XIX.

ABOD and UMAP (True Outliers)

Dataset 1 Dimension Removed Reduced to half of original dimensions Reduced to 3 Dimensions Reduced to 2 dimensions
Arrhythmia 33.33 19.7 28.79 31.82
Glass 0.0 0.0 0.0 22.22
Ionosphere 13.49 11.9 23.02 34.13
Letter 5.0 2.0 3.0 5.0
Lympho 16.67 0.0 0.0 0.0
Mnist 71.28 64.89 55.32 25.0
Musk 23.71 21.65 28.87 22.68
Pendigits 1.92 3.21 19.23 5.13
Optdigits 4.0 6.67 4.0 12.0
Pima 58.33 66.67 50.0 58.33
Satellite 55.74 56.97 52.87 41.39
Satimage-2 28.17 26.76 43.66 25.35
Shuttle 55.74 51.91 52.46 30.6
Speech 0.0 0.0 0.0 0.0
Vertebral 13.33 13.33 23.33 16.67
Vowels 10.0 6.0 10.0 6.0
WBC 23.81 23.81 28.57 33.33
Wine 20.0 10.0 0.0 40.0

C. Random Projection

Finally, for Random Projection, we only project the data down to two dimensions. We compare the performance of all three outlier detection methods in terms of both stability and detection of true outliers in this case. The results are shown in Tables XX and XXI. We observe that LOF outperforms the other two methods when Random Projections are used.

TABLE XX.

Random Projection

Dataset LOF Isolation Forest ABOD
Arrhythmia 39.39 36.36 22.73
Glass 55.56 44.44 22.22
Ionosphere 64.29 39.68 59.52
Letter 11.0 1.0 4.0
Lympho 0.0 16.67 0.0
Mnist 6.91 2.13 3.72
Musk 7.22 0.0 51.55
Optdigits 4.67 0.67 2.67
Pendigits 7.69 0.0 1.92
Pima 25.0 16.67 8.33
Satellite 10.66 1.64 6.97
Satimage-2 8.45 2.82 0.0
Shuttle 9.84 16.94 0.55
Speech 0.0 0.0 0.0
Vertebral 56.67 36.67 20.0
Vowels 16.0 2.0 8.0
WBC 42.86 42.86 19.05
Wine 40.0 0.0 40.0

TABLE XXI.

Random Projection (True Outliers)

Dataset LOF Isolation Forest ABOD
Arrhythmia 40.91 18.18 10.61
Glass 22.22 11.11 0.0
Ionosphere 61.11 30.95 16.67
Letter 11.0 6.0 7.0
Lympho 16.67 16.67 16.67
Mnist 12.77 20.21 5.85
Musk 13.4 8.25 50.52
Optdigits 0.67 4.0 4.67
Pendigits 1.28 0.0 0.64
Pima 66.67 75.0 25.0
Satellite 50.41 53.28 15.98
Satimage-2 7.04 0.0 2.82
Shuttle 26.78 46.45 0.0
Speech 0.0 0.0 3.28
Vertebral 13.33 10.0 16.67
Vowels 16.0 36.0 0.0
WBC 9.52 42.86 0.0
Wine 0.0 70.0 0.0

D. Summary of Observations

Disregarding minor fluctuations, the peculiar behavior observed in true outliers when reduced to the two or three dimensions we briefly touched upon earlier seems to be fairly consistent percentage wise between all the dimension reduction. While there were some standout datasets that retained this quality many times among all the methods (Arrhythmia, Lympho, Shuttle and WBC to name a few), they mostly varied between the many reduction techniques. Percentage wise, Isolation Forest and PCA (see Table V) were the best at bringing out this quality, with a whopping 17 out of 18 (or around 94 percent of) datasets displaying it. One reason why this tends to happen is that in general outlier detection techniques are vulnerable to the “curse of dimensionality” [2], and therefore dimensionality reductions helps in focusing attention on the right data subspace.

In order to more clearly see the difference between UMAP and PCA (in both 2 and 3 dimensions), we present scatterplots for the Lymphography Dataset as an exemplar.

Observing the four scatterplots (Figures 14) for the Lymphography Dataset, there is a clear difference between PCA and UMAP. Even though it keeps the general location of the outliers fairly similar to those in the PCA scatterplot, UMAP is a lot more clustered, and this led to Isolation Forest making a couple of mistakes while trying trying to find the Outliers. PCA’s more logical approach is made very apparent when comparing its 2D and 3D graph. The two graphs are almost identical in their global structure. There are differences of course, but PCA has managed to keep the global structure almost unchanged from 3D to 2D. UMAP on the other is also impressive in its ability to keep a similar shape. However, while this shape is quite similar, the points are scattered a little differently, especially looking at the middle of the structure and the bottom right end.

Fig. 1.

Fig. 1.

Isolation Forest PCA 3D for Lymphography (Labels stand for “Found True Outlier”, “Not Found True Outlier”, “Found Previously Calculated Outlier”, “Not Found Previously Calculated Outlier” and “Regular Point”)

Fig. 4.

Fig. 4.

Isolation Forest UMAP 3D

As discussed before, another interesting effect we observed was the inverse method and it’s relationship with true outliers. Attaching on an extra dimension of euclidean distances comparing the original dataset and the reconstructed datasets seems to magnify the number of true outliers. Investigating whether the reconstructed tables (IX, XI and XIII) yielded more true outliers in their second and third dimensions compared to their counterparts (III, V and VII) ended up revealing interesting results. Angle Based Outlier Detection (ABOD) ended up faring the worst, with the outliers detecting in 3 and 2 dimensions increasing in only 4 datasets out of the 18 that were calculated for (with 6 being the same). Isolation Forest was in the middle, increasing in 9 datasets out of the 18 (with 1 being the same). Local Outlier Factor (LOF) fared the best, increasing in 12 datasets out of the 18 (with 1 being the same). But these results did not explain whether this method increases the number of true outliers detected in 2 and 3 dimensions compared to higher ones. Seeking to answer that question, we created a bar chart comparing reconstructed LOF true outliers with normal LOF true outliers, and labelled whether the quality (an increase in true outliers when reduced) was brought out in each dataset (see Figure 5).

Fig. 5.

Fig. 5.

Comparison of LOF PCA with simply reduced data vs reconstructed data. (The prefix “Re” represents reconstructed datasets The datasets are labeled 2D or 3D, depending on which dimension had more true outliers detected. They can also be labeled B (both) or N (neither) The prefix Q is used if that specific dataset had an increase in true outliers in lower dimensions compared to higher ones. )

Overall, the quality was gained in 1 instance, and lost in 3. In all three of those, the total percentage of true outliers was also reduced. While this method doesn’t seem great at inducing the quality, it increases the number of properly detected true outliers more often that not. The chances of losing outliers are much lower than the chances of gaining them, and the pros seem to generally outweigh the cons, as the reconstructed datasets were the only ones in the entire graph to ever reach 50 percent detection, and they did so twice.

V. Conclusion and Future Work

In this work we have studied the relationship between dimensionality reduction and outlier detection. In particular, we empirically evaluated the performance of several standard outlier detection techniques when the data is reducing using several common dimensionality reduction techniques over a diverse set of datasets. Our results show that while the performance of the outlier detection techniques may degrade in terms of their stability, their ability to find the true outliers often improves in lower dimensional spaces. This is a very surprising observation. our study, however, is restricted to numeric data, and is completely empirical. In the future, we also plan to study categorical and mixed data, as well as further explore this problem from a theoretical perspective. Note that it is also possible that state of the art outlier detection techniques are used to identify outliers and then dimensionality reduction is used to visualize these outliers and explain why they are identified as outliers. We plan to examine different visualization techniques to do this in the future as well.

Fig. 2.

Fig. 2.

LOF PCA for Lymphography

Fig. 3.

Fig. 3.

Isolation Forest UMAP for Lymphography

TABLE XVIII.

ABOD and UMAP

Dataset 1 Dimension Removed Reduced to half of original dimensions Reduced to 3 Dimensions Reduced to 2 dimensions
Arrhythmia 9.09 9.09 4.55 10.61
Glass 0.0 0.0 0.0 0.0
Ionosphere 53.97 60.32 52.38 48.41
Letter 14.0 22.0 14.0 18.0
Lympho 0.0 16.67 0.0 0.0
Mnist 1.06 2.13 2.13 3.19
Musk 22.68 20.62 26.8 21.65
Optdigits 2.67 3.33 2.0 4.67
Pendigits 0.64 0.64 1.28 0.64
Pima 0.0 0.0 0.0 0.0
Satellite 1.64 0.82 2.05 2.87
Satimage-2 0.0 0.0 0.0 1.41
Shuttle 0.0 0.0 0.0 0.0
Speech 9.84 9.84 9.84 6.56
Vertebral 6.67 10.0 20.0 23.33
Vowels 10.0 10.0 10.0 2.0
WBC 0.0 4.76 9.52 9.52
Wine 20.0 20.0 10.0 0.0

Acknowledgments

Research reported in this publication was supported by the National Institutes of Health under award R35GM134927. The content is solely the responsibility of the authors and does not necessarily represent the official views of the agencies funding the research.

Contributor Information

Vivek Vaidya, East Brunswick High School, East Brunswick, NJ, USA.

Jaideep Vaidya, MSIS Department, Rutgers University, Newark, NJ, USA.

References

  • [1].Abe N, Zadrozny B, and Langford J, “Outlier detection by active learning,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, 2006, pp. 504–509. [Google Scholar]
  • [2].Aggarwal CC, “An introduction to outlier analysis,” in Outlier analysis. Springer, 2017, pp. 1–34. [Google Scholar]
  • [3].Aggarwal CC and Sathe S, “Theoretical foundations and algorithms for outlier ensembles,” Acm sigkdd explorations newsletter, vol. 17, no. 1, pp. 24–47, 2015. [Google Scholar]
  • [4].Avola D, Cannistraci I, Cascio M, Cinque L, Diko A, Fagioli A, Foresti GL, Lanzino R, Mancini M, Mecca A et al. , “A novel gan-based anomaly detection and localization method for aerial video surveillance at low altitude,” Remote Sensing, vol. 14, no. 16, p. 4110, 2022. [Google Scholar]
  • [5].Bandaragoda TR, Ting KM, Albrecht D, Liu FT, and Wells JR, “Efficient anomaly detection by isolation using nearest neighbour ensemble,” in 2014 IEEE International conference on data mining workshop. IEEE, 2014, pp. 698–705. [Google Scholar]
  • [6].Breunig MM, Kriegel H-P, Ng RT, and Sander J, “Lof: identifying density-based local outliers,” in Proceedings of the 2000 ACM SIGMOD international conference on Management of data, 2000, pp. 93–104. [Google Scholar]
  • [7].Buitinck L, Louppe G, Blondel M, Pedregosa F, Mueller A, Grisel O, Niculae V, Prettenhofer P, Gramfort A, Grobler J, Layton R, VanderPlas J, Joly A, Holt B, and Varoquaux G, “API design for machine learning software: experiences from the scikit-learn project,” in ECML PKDD Workshop: Languages for Data Mining and Machine Learning, 2013, pp. 108–122. [Google Scholar]
  • [8].Capelleveen GC, “Outlier based predictors for health insurance fraud detection within us medicaid,” Master’s thesis, University of Twente, 2013. [Google Scholar]
  • [9].Caroline Cynthia P and Thomas George S, “An outlier detection approach on credit card fraud detection using machine learning: a comparative analysis on supervised and unsupervised learning,” in Intelligence in Big Data Technologies—Beyond the Hype. Springer, 2021, pp. 125–135. [Google Scholar]
  • [10].Chandola V, Banerjee A, and Kumar V, “Anomaly detection: A survey,” ACM computing surveys (CSUR), vol. 41, no. 3, pp. 1–58, 2009. [Google Scholar]
  • [11].Chepenko D, “A density-based algorithm for outlier detection.” [Online]. Available: https://github.com/zkid18/Machine-Learning-Algorithms/blob/master/Outlier_Detection/LOF-algorithm.ipynb
  • [12].Dokas P, Ertoz L, Kumar V, Lazarevic A, Srivastava J, and Tan P-N, “Data mining for network intrusion detection,” in Proc. NSF Workshop on Next Generation Data Mining. Citeseer, 2002, pp. 21–30. [Google Scholar]
  • [13].F.R.S. KP, “Liii. on lines and planes of closest fit to systems of points in space,” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, vol. 2, no. 11, pp. 559–572, 1901. [Google Scholar]
  • [14].Keller F, Muller E, and Bohm K, “Hics: High contrast subspaces for density-based outlier ranking,” in 2012 IEEE 28th international conference on data engineering. IEEE, 2012, pp. 1037–1048. [Google Scholar]
  • [15].Kim T, Adhikaree A, Pandey R, Kang D, Kim M, Oh C-Y, and Back J, “Outlier mining-based fault diagnosis for multiceli lithiumion batteries using a low-priced microcontroller,” in 2018 IEEE applied power electronics conference and exposition (APEC). IEEE, 2018, pp. 3365–3369. [Google Scholar]
  • [16].Konijn RM and Kowalczyk W, “Finding fraud in health insurance data with two-layer outlier detection approach,” in International conference on data warehousing and knowledge discovery. Springer, 2011, pp.394–405. [Google Scholar]
  • [17].Kriegel H-P, Schubert M, and Zimek A, “Angle-based outlier detection in high-dimensional data,” in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 2008, pp. 444–452. [Google Scholar]
  • [18].Lazarevic A and Kumar V, “Feature bagging for outlier detection,” in Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, 2005, pp. 157–166. [Google Scholar]
  • [19].Liu FT, Ting KM, and Zhou Z-H, “Isolation forest,” in 2008 eighth ieee international conference on data mining. IEEE, 2008, pp. 413–422. [Google Scholar]
  • [20].—, “Isolation forest,” in 2008 Eighth IEEE International Conference on Data Mining, 2008, pp. 413–422. [Google Scholar]
  • [21].Liu J and Zio E, “Knn-fsvm for fault detection in high-speed trains,” in 2018 IEEE International Conference on Prognostics and Health Management (ICPHM). IEEE, 2018, pp. 1–7. [Google Scholar]
  • [22].Liu J, Yuan L, and Ye J, “An efficient algorithm for a class of fused lasso problems,” in Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, 2010, pp. 323–332. [Google Scholar]
  • [23].Micenková B, McWilliams B, and Assent I, “Learning outlier ensembles: The best of both worlds–supervised and unsupervised,” in Proceedings of the ACM SIGKDD 2014 Workshop on Outlier Detection and Description under Data Diversity (ODD2). New York, NY, USA. Citeseer, 2014, pp. 51–54. [Google Scholar]
  • [24].Rayana S, “Odds library,” 2016. [Online]. Available: http://odds.cs.stonybrook.edu
  • [25].Rayana S and Akoglu L, “Less is more: Building selective anomaly ensembles,” Acm transactions on knowledge discovery from data (tkdd), vol. 10, no. 4, pp. 1–33, 2016. [Google Scholar]
  • [26].Sainburg T, McInnes L, and Gentner TQ, “Parametric umap embeddings for representation and semisupervised learning,” Neural Computation, vol. 33, no. 11, pp. 2881–2907, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Sathe S and Aggarwal C, “Lodes: Local density meets spectral outlier detection,” in Proceedings of the 2016 SIAM international conference on data mining. SIAM, 2016, pp. 171–179. [Google Scholar]
  • [28].Tan SC, Ting KM, and Liu TF, “Fast anomaly detection for streaming data,” in Twenty-second international joint conference on artificial intelligence, 2011. [Google Scholar]
  • [29].Ting K, Tan S, and Liu F, “Mass: A new ranking measure for anomaly detection,” Gippsland School of Information Technology, Monash University, 2009. [Google Scholar]
  • [30].Van der Maaten L and Hinton G, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008. [Google Scholar]
  • [31].Vidmar G and Blagus R, “Outlier detection for healthcare quality monitoring–a comparison of four approaches to over-dispersed proportions,” Quality and Reliability Engineering International, vol. 30, no. 3, pp. 347–362, 2014. [Google Scholar]
  • [32].Ye Q and Zhi W, “Outlier detection in the framework of dimensionality reduction,” International Journal of Pattern Recognition and Artificial Intelligence, vol. 29, no. 04, p. 1550017, 2015. [Google Scholar]
  • [33].Zhang J and Zulkernine M, “Anomaly based network intrusion detection with unsupervised outlier detection,” in 2006 IEEE International Conference on Communications, vol. 5. IEEE, 2006, pp. 2388–2393. [Google Scholar]
  • [34].Zhang J, Zulkernine M, and Haque A, “Random-forests-based network intrusion detection systems,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 38, no. 5, pp. 649–659, 2008. [Google Scholar]
  • [35].Zimek A, Gaudet M, Campello RJ, and Sander J, “Subsampling for efficient and effective unsupervised outlier detection ensembles,” in Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, 2013, pp. 428–436. [Google Scholar]

RESOURCES