Impact of Dimensionality Reduction on Outlier Detection: an Empirical Study

Vivek Vaidya; Jaideep Vaidya

doi:10.1109/tps-isa56441.2022.00028

. Author manuscript; available in PMC: 2023 Dec 13.

Published in final edited form as: IEEE Int Conf Trust Priv Secur Intell Syst Appl. 2022 Dec;2022:150–159. doi: 10.1109/tps-isa56441.2022.00028

Impact of Dimensionality Reduction on Outlier Detection: an Empirical Study

Vivek Vaidya ¹, Jaideep Vaidya ²

PMCID: PMC10716874 NIHMSID: NIHMS1945681 PMID: 38094985

Abstract

Outlier detection is a fundamental data analytics technique often used for many security applications. Numerous outlier detection techniques exist, and in most cases are used to directly identify outliers without any interaction. Typically the underlying data used is often high dimensional and complex. Even though outliers may be identified, since humans can easily grasp low dimensional spaces, it is difficult for a security expert to understand/visualize why a particular event or record has been identified as an outlier. In this paper we study the extent to which outlier detection techniques work in smaller dimensions and how well dimensional reduction techniques still enable accurate detection of outliers. This can help us to understand the extent to which data can be visualized while still retaining the intrinsic outlyingness of the outliers.

Keywords: outlier detection, anomaly detection, dimensionality reduction, visualization, explainability

I. Introduction and Related Work

Outliers (also referred to as anomalies) are aberrant observations (records) that are quite different/distant from regular observations (records). Thus, outliers do not conform to the typical pattern/distribution of data. Typically, in any setting where data is collected (be it natural, human-created, or even machine-created), there are always a few records/observations that are outlying in nature. Outlier/anomaly detection is the problem of identifying outliers from a given set of data and is one of the most fundamental problems in the field of data analytics. Outlier analysis has numerous scientific, commercial, and governmental applications in diverse domains such as astronomy, finance, and medicine, among others. In particular, outlier detection is extensively used in the fields of security, dependability, and trust, for applications such as fraud detection for credit cards [9] and insurance [8], [16], health care [31], intrusion detection for cyber-security [12], [33], [34], fault detection in safety critical systems [15], [21], and military surveillance for enemy activities [4].

However, one key problem, especially for security and privacy applications is the explainability and interpretability of outliers. This is particular important when an unsupervised outlier detection technique has been used and the identified outliers need to be further analyzed by the security administrator. The problem here is that most data is high dimensional whereas humans primarily operate in two or at most three dimensions. Therefore, even if a particular outlier detection technique is able to identify outliers accurately, visualizing the outliers is near impossible. One of the most common techniques for dealing with high-dimensional data is that of dimensionality reduction. However, the effect of dimensionality reduction on outlier analysis has not been sufficiently studied upto this point. Assuming that the security analyst is able to accurately visualize and understand the reasons for outlyingness, they may be able to correspondingly monitor for abnormal behavior and improve the security posture.

The key objective of this study is to empirically examine the effect of dimension reduction on outlier identification. While there has been some work on using local geometric structure to detect outliers [32], the effect of dimensionality reduction on standard outlier detection technqiues has not been studied. Since dimensionality reduction typically reduces the amount of information for each record, we would like to see the extent to which outliers can still be accurately identified as the data dimensionality is reduced over a host of settings. Note that as the number of dimensions are reduced, especially to 2 or 3 dimensions, it is much easier to visualize the datasets and potentially see the extent to which the outliers are different.

Towards this, for a vast variety of datasets, we utilize several dimensionality reduction techniques such as Principal Component Analysis (PCA), Uniform Manifold Approximation and Projection (UMAP), and Random Projections, and then examine the performance of several commonly used outlier detection methods such as Local Outlier Factor (LOF), Isolation Forest (IF), and Angle-Based Outlier Detection (ABOD).

The rest of this paper is structured as follows. Section II provides a brief overview of standard dimensionality reduction techniques. Section III presents an overview of outlier detection and then discusses several commonly used outlier detection techniques that we evaluate in this work. Section IV presents the results of our empirical study over a number of real datasets of varying size and characteristics. Finally, Section V concludes the paper and discusses future work.

II. OVERVIEW OF DIFFERENT DIMENSION REDUCTION METHODS

Dimensionality reduction is a standard data pre-processing approach used whenever data is high-dimensional. Many techniques exist for dimensionality reduction including Principal Component Analysis (PCA), Singular Value Decomposition (SVD), and Uniform Manifold Approximation and Projection (UMAP) among others. These mostly apply over numeric data. For categorical data, there are dimensionality reduction techniques such as Corrspondance Analysis (CA) and Multiple correspondance analysis (MCA), which generalizes CA to more than two categorical variables. Multiple Factor Analysis can be used for mixed data. In this study, we focus on numeric data and discuss below, in more detail, a few standard dimensionality reduction techniques.

A. PCA

Principal Component Analysis (PCA) is a well known technique for dimensionality reduction. The goal of PCA is to reduce the number of dimensions of the data while retaining most of the information. This is done by compressing the dimensions into “principal components”, which are made up of a number of dimensions that retain most of the information of the dimensions used to construct them. PCA has been used in a variety of settings [13]. Typically, data is standardized to ensure that all features have the same scale (i.e., zero mean and unit variance). Next, the covariance matrix is computed, which is a x*x matrix where x is the number of dimensions that contains covariances of all dimensions. This gives us the variances of all individual dimensions because the covariance of a dimension with itself returns the variance, and the overall matrix is symmetrical with it’s main diagonal. The sign of each covariance tells us whether they are proportional or inversely proportional. Essentially, the principal components represent the orthogonal projections that capture the maximum amount of variance in the data objects. To identify the principal components, we compute the eigenvectors and eigenvalues of the covariance matrix (they come in pairs, and there are as many of these pairs as there are dimensions). Once these are calculated, the greater the eigenvalue, the more significant the corresponding principal component.

Inverse PCA:

Due to PCA being a linear method of Dimension Reduction (it takes a mathematical approach when it comes to reducing, we will explain this in detail shortly), it is possible to invert the projection to recreate the original dataset. This is an imperfect representation due to information being lost while reducing, but it is a useful tool regardless. Since the goal is to identify outliers, along with standard PCA we experiment with a modified version where along with the reduced PCA dimensions, we use inverse PCA to reconstruct the original dataset based on the reduced principal components and calculate the Euclidean distances between each record of the original dataset and the reconstructed one. This is then added on as an extra dimension to the reduced dataset. The results pertaining to these experiments can be found below.

B. UMAP

Uniform Manifold Approximation and Projection [26] is a comparatively new dimension reduction technique that uses a graphical method to reduce dimensions. It starts by graphing data and connecting it all together with simplices. A simplex can be explained as an easy way to build a k dimensional object. A 0 simplex is just a point. A 1 simplex is a line with two defined end points. A 2 simplex is a triangle, and a 3 simplex is a triangular pyramid (higher ones exist, but are harder to visualize due to us living in a 3 dimensional world). These simplices can be used to connect the points on the generated graph. From here, the dimensions are reduced by finding an equivalent lower dimensional representation of the data using this graph.

C. T-SNE

T-distributed Stochastic Neighbor Embedding [30] is an alternative type of dimension reduction that specializes in reducing large datasets to two or three dimensions. PCA was created to conserve global variance, but isn’t good at conserving local variance. T-SNE was created to make up for this shortcoming. As such, T-SNE differs from PCA by preserving only small pairwise distances or local similarities whereas PCA is concerned with preserving large pairwise distances to maximize variance. The process of computing T-SNE starts by creating a probability distribution that represents similarities between neighbors in which the “similarity of datapoint x_j to datapoint x_i is the conditional probability, p_j|i, that x_i would pick x_j as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at x_i” (Maaten 1). The value of p_j|i will be really high if the two points are really close together, but very small if the two points are far apart. Then, t-SNE defines a similar distribution for the points in the low-dimensional embedding. Finally, t-SNE minimizes the Kullback–Leibler divergence between the two distributions with respect to the locations of the points in the embedding. T-SNE is primarily used for visualization. Furthermore, since T-SNE is a local embedding, it is difficult to find a good inverse mapping. Also, as the perplexity parameter is changed, drastically different results are achieved. Finally, T-SNE is a significantly more computationally costly algorithm to use as compared to PCA. Due to these reasons, we do not utilize T-SNE in our study.

D. Linear vs Manifold Learning

Most dimension reduction techniques fall under two categories. Linear methods such as PCA or SVD that use a mathematical approach to reduce, and non linear or Manifold Learning methods such as T-SNE and UMAP that use a more graphical approach for reduction. Linear methods always have 1:1 correspondence between the original points and the reduced ones, while non-linear methods don’t. This is why T-SNE cannot be used effectively in this study, because it’s too stochastic. Note that UMAP is also stochastic, but we are using it here. T-SNE reduces by taking points from the high dimensional graph and moving it cluster by cluster. This really helps preserve local variance among the points, but global variance is very inaccurate as a result of it. UMAP takes a graph that it has connected with simplices, and compresses it, preserving the global structure a lot better than T-SNE (but still not as well as PCA) while still retaining it’s ability to preserve local variance. While not perfect, UMAP’s correspondence is a lot better than T-SNE’s despite it’s non-linear nature, and it even has an inverse method. We have not used that inverse method here because it can’t restore from 1 dimension, meaning 3D would be the only visualizable dimension we could work with, and we used a random seed for the UMAP calculations.

E. Random Projection

The key idea behind random projection is to simply project the data into a random space. This is done by generating a bunch of random k*2 matrices where k is the number of rows of the dataset you’re reducing. From there you multiply them each separately with the dataset to reduce it to two dimensions. Calculate the outliers for each of these reduced datasets and add up the outlier scores for each point returned by whatever outlier detection method you’re using.

III. Overview of outlier detection methods

As with many things, there is no silver bullet in Outlier Detection. A multitude of methods have been developed for detecting Outliers over the years, including LOF, Z-score, Isolation Forest, Extreme value analysis, box plots, autoencoders, ABOD and many more [2], [10]. However, we will only be using three of these methods in this study. Many of these methods cannot work, as they do not fulfill certain requirement, or have other issues. For example, detection methods need to return an outlier score in this study so our results are reproducible, and they have to work in higher dimensional spaces on fully numeric data. The detection methods chosen for this study are LOF, Isolation Forest (please note that Isolation Forest has a degree of randomness to it, so we set a universal seed of 3 for reproducible results) and ABOD, because they meet the critera we set previously.

A. LOF

Local Outlier Factor (LOF) [6], [11] is an unsupervised density based outlier detection technique that can be used to identify a wide variety of outliers. It works by calculating each datapoint’s k_distance (The euclidean distance between the point and it’s kth nearest neighbor). It then finds the k_neighbors (The closest neighbors whose distances to it are equal to or less than k_distance). From there it can calculate the Local Reachability Density, or LRD of the point, which is the inverse of the average reachability distance of the datapoint from it’s neighbors. The lower the value, the farther away the closest cluster is. From there, the LOF score is calculated. It is the ratio of the average LRD of the k_neighbors to the LRD of the datapoint. Now we can essentially rank the points based on their degree of outlyingness and pick the top outliers (based on how many we want) since we have the LOF score for each point in the entire dataset. Our datasets specifically state how many “true outliers” lie within them, and we can use this metric to select the number of outliers we want to calculate for.

B. Isolation Forest

Isolation forest [20] is an unsupervised outlier detection method that processes the data into a decision tree (called an isolation tree or an iTree for short) and finds outliers by isolating data points and finding out how many branches have to be traversed (i.e., the depth of the tree) to reach them. If the branches are comparatively few, that data point is marked as an outlier. Thanks to this, Isolation Forest can score each data point based on how anomalous it is. The method is trained by taking a random subset of the data and putting it into an iTree, and the branching is done on a random threshold. The data branches left or right depending on whether it’s smaller or larger than that threshold. This is repeated until every datapoint is in an iTree and isolated. From there the algorithm can count the branches and score the datapoints.

C. ABOD

Angle Based Outlier Detection [17] is a technique that works well in high dimensional spaces, unlike some of its peers. It works by iterating over every datapoint and storing it’s angle relative to every other datapoint. It then calculates that variance of the generated list. The lower that variance of a certain datapoint is, the more of an outlier it is.

IV. Empirical Study

We use a number of real datasets taken from the ODDS repository, which specializes in providing Datasets specifically for Outlier detection [24]. Table I provides the details of the datasets used. For each Dataset, the actual outliers that are supposed to be detected are marked.

TABLE I.

Dataset Glossary

Dataset	Number of Outliers	Number of features	Number of Rows	Citation
Arrhythmia	66	274	452	[14], [19], [29]
Glass	9	9	214	[3], [14], [27]
Ionosphere	126	33	351	[14], [19], [29]
Letter	100	32	1600	[23], [25]
Lympho	6	18	148	[3], [18], [35]
Mnist	188	100	7603	[5]
Musk	97	166	3062	[3]
Optdigits	150	64	5216	[3]
Pendigits	156	16	6870	[14], [27]
Pima	12	8	768	[14], [19], [29]
Satellite	244	36	6435	[19], [29]
Satimage-2	71	36	5803	[3], [35]
Shuttle	183	9	49097	[1], [19], [22], [28], [29]
Speech	61	400	3686	[23]
Vertebral	30	6	240	[27]
Vowels	50	12	1456	[3], [27]
WBC	21	30	378	[3], [14], [35]
Wine	10	13	129	[27]

Dataset	1 Dimension Removed	Reduced to half of original dimensions	Reduced to 3 Dimensions	Reduced to 2 Dimensions
Arrhythmia	100	100	34.85	25.76
Glass	100	77.78	55.56	33.33
Ionosphere	97.62	88.89	52.38	43.65
Letter	96.0	75.0	18.0	9.0
Lympho	83.33	83.33	0.0	0.0
Mnist	100.0	87.77	1.60	3.19
Musk	100.0	91.75	10.31	8.25
Optdigits	98.67	76.67	6.0	2.0
Pendigits	97.44	64.1	17.31	2.56
Pima	100.0	58.33	16.67	8.33
Satellite	95.9	75.0	6.56	4.1
Satimage-2	91.55	69.01	1.41	2.82
Shuttle	86.34	42.62	1.64	0.55
Speech	96.72	54.1	1.64	1.64
Vertebral	100.0	50.0	46.67	30.0
Vowels	92.0	38.0	12.0	8.0
WBC	100.0	95.24	38.1	38.1
Wine	100.0	100.0	100.0	90.0

Dataset	1 Dimension Removed	Reduced to half of original dimensions	Reduced to 3 Dimensions	Reduced to 2 dimensions
Arrhythmia	18.18	18.18	18.18	22.73
Glass	22.22	22.22	11.11	0.0
Ionosphere	74.6	71.43	49.21	44.44
Letter	41.0	38.0	17.0	7.0
Lympho	33.33	33.33	0.0	33.33
Mnist	20.74	18.62	15.43	10.64
Musk	11.34	13.4	3.09	12.37
Optdigits	5.33	4.67	6.0	1.33
Pendigits	5.77	5.13	5.77	5.13
Pima	25.0	41.67	33.33	41.67
Satellite	48.36	47.13	40.16	29.1
Satimage-2	9.86	9.86	1.41	2.82
Shuttle	14.21	12.57	21.86	33.88
Speech	13.11	9.84	3.28	0.0
Vertebral	3.33	16.67	20.0	6.67
Vowels	32.0	28.0	2.0	6.0
WBC	19.05	19.05	19.05	23.81
Wine	0.0	0.0	0.0	0.0

Dataset	1 Dimension Removed	Reduced to half of original dimensions	Reduced to 3 Dimensions	Reduced to 2 dimensions
Arrhythmia	63.64	62.12	33.33	18.18
Glass	33.33	33.33	33.33	22.22
Ionosphere	69.05	73.81	65.87	55.56
Letter	36.0	28.0	9.0	1.0
Lympho	0.0	33.33	33.33	16.67
Mnist	23.94	27.66	1.06	1.06
Musk	14.43	21.65	2.06	0.0
Optdigits	14.67	16.67	6.67	5.33
Pendigits	0.0	0.0	45.51	5.77
Pima	16.67	8.33	16.67	16.67
Satellite	11.48	11.89	32.38	38.93
Satimage-2	0.0	0.0	90.14	88.73
Shuttle	7.65	12.02	10.93	5.46
Speech	29.51	39.34	1.64	0.0
Vertebral	66.67	70.0	50.0	30.0
Vowels	20.0	36.0	0.0	0.0
WBC	23.81	42.86	47.62	42.86
Wine	30.0	30.0	10.0	0.0

Dataset	1 Dimension Removed	Reduced to half of original dimensions	Reduced to 3 Dimensions	Reduced to 2 dimensions
Arrhythmia	3.03	3.03	16.67	21.21
Glass	11.11	11.11	0.0	11.11
Ionosphere	5.56	8.73	18.25	25.4
Letter	1.0	2.0	4.0	3.0
Lympho	0.0	33.33	33.33	16.67
Mnist	0.0	0.0	19.68	33.51
Musk	0.0	0.0	100.0	63.92
Optdigits	2.0	0.0	0.0	0.0
Pendigits	0.0	0.0	60.9	33.97
Pima	41.67	50.0	58.33	58.33
Satellite	27.05	36.89	50.0	90.57
Satimage-2	0.0	0.0	88.73	87.32
Shuttle	20.22	24.04	24.59	60.11
Speech	1.64	3.28	3.28	0.0
Vertebral	16.67	16.67	20.0	13.33
Vowels	0.0	0.0	2.0	2.0
WBC	0.0	0.0	33.33	38.1
Wine	0.0	0.0	10.0	30.0

Dataset	1 Dimension Removed	Reduced to half of original dimensions	Reduced to 3 Dimensions	Reduced to 2 dimensions
Arrhythmia	4.55	4.55	9.09	13.64
Glass	0.0	0.0	0.0	0.0
Ionosphere	6.35	7.14	18.25	24.6
Letter	0.0	0.0	1.0	4.0
Lympho	0.0	0.0	0.0	0.0
Mnist	0.53	0.0	4.26	9.04
Musk	90.72	92.78	74.23	43.3
Optdigits	2.67	5.33	4.0	5.33
Pendigits	0.0	0.0	0.0	0.64
Pima	16.67	16.67	16.67	50.0
Satellite	28.69	22.13	11.48	13.52
Satimage-2	0.0	0.0	0.0	0.0
Shuttle	0.0	0.0	0.0	0.0
Vertebral	23.33	20.0	30.0	26.67
Vowels	0.0	0.0	2.0	2.0
WBC	0.0	0.0	0.0	0.0
Wine	0.0	0.0	0.0	0.0

Dataset	1 Dimension Removed	Reduced to half of original dimensions	Reduced to 3 Dimensions	Reduced to 2 dimensions
Arrhythmia	100.0	99.0	42.4	27.0
Glass	100.0	87.89	66.7	22.0
Ionosphere	98.41	77.57	38.1	41.0
Letter	97.0	76.0	24.0	11.0
Lympho	83.33	99.0	33.3	17.0
Mnist	100.0	91.02	4.8	4.0
Musk	100.0	89.72	9.3	6.0
Optdigits	98.67	77.67	11.3	4.0
Pendigits	98.08	67.59	16.7	12.0
Pima	100.0	74.0	25.0	17.0
Satellite	96.31	73.59	9.4	4.0
Satimage-2	95.77	69.42	1.4	3.0
Shuttle	89.07	71.68	4.4	3.0
Speech	98.36	59.66	4.9	2.0
Vertebral	100.0	65.67	63.3	50.0
Vowels	94.0	55.0	8.0	12.0
WBC	100.0	94.24	52.4	33.0
Wine	100.0	99.0	100.0	90.0

Dataset	1 Dimension Removed	Reduced to half of original dimensions	Reduced to 3 Dimensions	Reduced to 2 dimensions
Arrhythmia	18.18	17.18	19.7	24.0
Glass	22.22	21.22	22.2	11.0
Ionosphere	73.02	63.29	33.3	40.0
Letter	42.0	38.0	21.0	10.0
Lympho	33.33	32.33	50.0	50.0
Mnist	20.74	18.15	15.4	16.0
Musk	11.34	13.43	4.1	5.0
Optdigits	5.33	3.67	5.3	2.0
Pendigits	5.13	3.49	5.8	7.0
Pima	25.0	24.0	50.0	33.0
Satellite	47.95	44.49	29.1	42.0
Satimage-2	9.86	8.86	0.0	6.0
Shuttle	14.75	14.85	2.7	36.0
Speech	13.11	13.75	8.2	3.0
Vertebral	3.33	15.67	10.0	7.0
Vowels	32.0	29.0	12.0	10.0
WBC	19.05	18.05	19.0	19.0
Wine	0.0	0.0	0.0	0.0

Dataset	1 Dimension Removed	Reduced to half of original dimensions	Reduced to 3 Dimensions	Reduced to 2 dimensions
Arrhythmia	65.15	62.12	16.7	17.0
Glass	44.44	11.11	22.2	33.0
Ionosphere	71.43	74.6	71.4	67.0
Letter	23.0	24.0	23.0	18.0
Lympho	0.0	0.0	66.7	50.0
Mnist	32.45	19.68	14.4	15.0
Musk	16.49	23.71	2.1	2.0
Optdigits	19.33	12.0	11.3	13.0
Pendigits	2.56	3.21	34.0	59.0
Pima	8.33	16.67	16.7	17.0
Satellite	12.7	9.84	17.2	32.0
Satimage-2	1.41	11.27	90.1	92.0
Shuttle	3.83	10.38	38.8	37.0
Speech	22.95	36.07	0.0	5.0
Vertebral	63.33	53.33	46.7	33.0
Vowels	26.0	36.0	2.0	0.0
WBC	19.05	42.86	38.1	33.0
Wine	30.0	20.0	20.0	30.0

Dataset	1 Dimension Removed	Reduced to half of original dimensions	Reduced to 3 Dimensions	Reduced to 2 dimensions
Arrhythmia	12.12	22.73	15.15	15.15
Glass	22.22	11.11	11.11	11.11
Ionosphere	42.06	42.06	46.83	40.48
Letter	27.0	21.0	25.0	22.0
Lympho	33.33	33.33	16.67	16.67
Mnist	4.79	5.85	2.13	3.72
Musk	9.28	7.22	12.37	8.25
Optdigits	7.33	6.67	4.67	5.33
Pendigits	11.54	8.97	10.9	6.41
Pima	0.0	8.33	0.0	0.0
Satellite	13.52	13.52	11.48	8.2
Satimage-2	4.23	4.23	4.23	1.41
Shuttle	4.92	4.92	3.28	2.73
Speech	0.0	3.28	0.0	0.0
Vertebral	6.67	13.33	16.67	16.67
Vowels	8.0	8.0	14.0	8.0
WBC	4.76	14.29	0.0	9.52
Wine	10.0	20.0	20.0	20.0

Dataset	1 Dimension Removed	Reduced to half of original dimensions	Reduced to 3 Dimensions	Reduced to 2 dimensions
Arrhythmia	10.61	6.06	15.15	19.7
Glass	0.0	0.0	0.0	11.11
Ionosphere	39.68	39.68	38.1	41.27
Letter	16.0	14.0	19.0	17.0
Lympho	0.0	0.0	0.0	0.0
Mnist	17.02	17.55	25.0	22.87
Musk	9.28	5.15	7.22	12.37
Optdigits	1.33	4.0	1.33	0.67
Pendigits	2.56	3.85	3.21	1.92
Pima	41.67	33.33	50.0	33.33
Satellite	45.9	43.03	49.18	45.49
Satimage-2	5.63	8.45	7.04	5.63
Shuttle	14.21	14.75	16.39	14.75
Speech	8.2	6.56	9.84	4.92
Vertebral	13.33	10.0	10.0	13.33
Vowels	2.0	4.0	6.0	2.0
WBC	14.29	0.0	14.29	9.52
Wine	20.0	0.0	10.0	0.0

Dataset	1 Dimension Removed	Reduced to half of original dimensions	Reduced to 3 Dimensions	Reduced to 2 dimensions
Arrhythmia	43.94	48.48	33.33	19.7
Glass	22.22	0.0	11.11	0.0
Ionosphere	51.59	58.73	55.56	51.59
Letter	10.0	7.0	14.0	10.0
Lympho	0.0	16.67	0.0	0.0
Mnist	8.51	7.45	2.66	10.11
Musk	14.43	4.12	1.03	2.06
Optdigits	6.0	4.0	6.0	10.67
Pima	8.33	25.0	0.0	0.0
Pendigits	10.9	18.59	22.44	1.92
Satellite	23.36	15.16	4.92	0.0
Satimage-2	8.45	5.63	1.41	0.0
Shuttle	3.28	0.55	20.77	0.0
Speech	4.92	3.28	0.0	1.64
Vertebral	30.0	33.33	33.33	40.0
Vowels	20.0	8.0	18.0	8.0
WBC	47.62	33.33	28.57	14.29
Wine	0.0	30.0	0.0	0.0

Dataset	1 Dimension Removed	Reduced to half of original dimensions	Reduced to 3 Dimensions	Reduced to 2 dimensions
Arrhythmia	4.55	1.52	9.09	13.64
Glass	0.0	0.0	0.0	11.11
Ionosphere	48.41	36.51	37.3	49.21
Letter	17.0	20.0	11.0	14.0
Lympho	16.67	0.0	0.0	0.0
Mnist	4.79	1.6	0.0	0.53
Musk	0.0	0.0	0.0	0.0
Optdigits	0.0	0.0	0.0	0.0
Pendigits	0.0	0.0	0.0	0.0
Pima	25.0	41.67	58.33	25.0
Satellite	44.26	70.08	34.84	2.46
Satimage-2	0.0	0.0	0.0	0.0
Shuttle	0.0	16.39	0.0	0.0
Speech	0.0	0.0	0.0	1.64
Vertebral	6.67	3.33	3.33	0.0
Vowels	0.0	0.0	4.0	2.0
WBC	0.0	0.0	0.0	0.0
Wine	0.0	0.0	20.0	0.0

Dataset	LOF	Isolation Forest	ABOD
Arrhythmia	39.39	36.36	22.73
Glass	55.56	44.44	22.22
Ionosphere	64.29	39.68	59.52
Letter	11.0	1.0	4.0
Lympho	0.0	16.67	0.0
Mnist	6.91	2.13	3.72
Musk	7.22	0.0	51.55
Optdigits	4.67	0.67	2.67
Pendigits	7.69	0.0	1.92
Pima	25.0	16.67	8.33
Satellite	10.66	1.64	6.97
Satimage-2	8.45	2.82	0.0
Shuttle	9.84	16.94	0.55
Speech	0.0	0.0	0.0
Vertebral	56.67	36.67	20.0
Vowels	16.0	2.0	8.0
WBC	42.86	42.86	19.05
Wine	40.0	0.0	40.0

Dataset	LOF	Isolation Forest	ABOD
Arrhythmia	40.91	18.18	10.61
Glass	22.22	11.11	0.0
Ionosphere	61.11	30.95	16.67
Letter	11.0	6.0	7.0
Lympho	16.67	16.67	16.67
Mnist	12.77	20.21	5.85
Musk	13.4	8.25	50.52
Optdigits	0.67	4.0	4.67
Pendigits	1.28	0.0	0.64
Pima	66.67	75.0	25.0
Satellite	50.41	53.28	15.98
Satimage-2	7.04	0.0	2.82
Shuttle	26.78	46.45	0.0
Speech	0.0	0.0	3.28
Vertebral	13.33	10.0	16.67
Vowels	16.0	36.0	0.0
WBC	9.52	42.86	0.0
Wine	0.0	70.0	0.0

PERMALINK

Impact of Dimensionality Reduction on Outlier Detection: an Empirical Study

Vivek Vaidya

Jaideep Vaidya

Abstract

I. Introduction and Related Work

II. OVERVIEW OF DIFFERENT DIMENSION REDUCTION METHODS

A. PCA

Inverse PCA:

B. UMAP

C. T-SNE

D. Linear vs Manifold Learning

E. Random Projection

III. Overview of outlier detection methods

A. LOF

B. Isolation Forest

C. ABOD

IV. Empirical Study

TABLE I.

A. PCA

1). No reconstruction:

TABLE II.

TABLE III.

TABLE IV.

TABLE V.

TABLE VI.

TABLE VII.

2). Reconstructed from PCA:

TABLE VIII.

TABLE IX.

TABLE X.

TABLE XI.

TABLE XII.

TABLE XIII.

B. UMAP

TABLE XIV.

TABLE XV.

TABLE XVI.

TABLE XVII.

TABLE XIX.

C. Random Projection

TABLE XX.

TABLE XXI.

D. Summary of Observations

Fig. 1.

Fig. 4.

Fig. 5.

V. Conclusion and Future Work

Fig. 2.

Fig. 3.

TABLE XVIII.

Acknowledgments

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases