Compressed kNN: K-Nearest Neighbors with Data Compression

Jaime Salvador–Meneses; Zoila Ruiz–Chavez; Jose Garcia–Rodriguez

doi:10.3390/e21030234

. 2019 Feb 28;21(3):234. doi: 10.3390/e21030234

Compressed kNN: K-Nearest Neighbors with Data Compression

Jaime Salvador–Meneses ^1,^*, Zoila Ruiz–Chavez ¹, Jose Garcia–Rodriguez ²

PMCID: PMC7514715 PMID: 33266949

Abstract

The kNN (k-nearest neighbors) classification algorithm is one of the most widely used non-parametric classification methods, however it is limited due to memory consumption related to the size of the dataset, which makes them impractical to apply to large volumes of data. Variations of this method have been proposed, such as condensed KNN which divides the training dataset into clusters to be classified, other variations reduce the input dataset in order to apply the algorithm. This paper presents a variation of the kNN algorithm, of the type structure less NN, to work with categorical data. Categorical data, due to their nature, can be compressed in order to decrease the memory requirements at the time of executing the classification. The method proposes a previous phase of compression of the data to then apply the algorithm on the compressed data. This allows us to maintain the whole dataset in memory which leads to a considerable reduction of the amount of memory required. Experiments and tests carried out on known datasets show the reduction in the volume of information stored in memory and maintain the accuracy of the classification. They also show a slight decrease in processing time because the information is decompressed in real time (on-the-fly) while the algorithm is running.

Keywords: classification, KNN, compression, categorical data, feature pre-processing

1. Introduction

Discrete data compression is an interesting problem especially when compressed data is required to maintain the characteristics of the original data [1]. Most of the state-of-the-art classification methods require a large amount of memory and time making them unfeasible options for some practical applications in the real world [2].

In many datasets, the number of attributes (also called the dimension) is large, and many algorithms do not work well with datasets that have a high dimension because they require all information to be stored in memory prior to processing. Nowadays, it is a challenge to process datasets with a high dimensionality such as censuses carried out in different countries [3]. A census is a particularly relevant process and a vital source of information for a country [4]. The predominant characteristic of this type of information is that most of the data is of categorical type.

The problem of assigning a class to a dataset is a basic action in data analysis and pattern recognition, the task consists labeling an observation from a set of known variables [5].

Supervised learning is a part of machine learning (ML) which tries to model the behavior of some system. The supervised models are created from observations which consist of a set of input and output data. A supervised model describes the function which associates inputs with output [6].

In many cases, k-nearest neighbors (kNN) is a simple and effective classification method [7]. However, it presents two major problems when it comes to implementation: (1) it is a lazy learning method and (2) it depends on the selection of the value of k [8]. Other limitations present in this method corresponds to the high memory consumption which limits its application [9].

In this work a new method to classify information, using the kNN algorithm, on a compressed dataset is proposed. The method proposes to compress observations into packets of a certain number of bits, in each packet a certain number of attributes are stored (compressed) through operations at the bit level. This avoids having to reduce the size of the dataset [9,10] to avoid the memory problem.

An interesting feature of the proposed method is that the information can be decompressed, observation by observation in real time (on-the-fly), without the need to decompress all the dataset and carry out it into the memory.

As an application of the compression mechanism, this work proposed the implementation of the kNN algorithm that works with compressed data, we call this method “Compressed kNN algorithm”.

The rest of this document is organized as follows: Section 2 shows a brief introduction to data classification techniques focusing on the kNN algorithm, in Section 3 the datasets used in this work are described, in Section 4 a variation of the algorithm for working with compressed data is presented, in Section 5 some results obtained with the execution of the proposed algorithm are presented, and finally, Section 6 shows some conclusions and future work.

2. Background

This section describes, in general, the process of data classification focusing on the kNN method (the algorithm is also presented). In addition, some compression/encoding techniques are described, as well as the metrics used for categorical, numerical or mixed information.

2.1. KNN

The kNN algorithm belongs to the family of methods known as instance based methods. These methods are based on the principle that observations (instances) within a dataset are usually placed close to other observations that have similar attributes [11].

Given an observation from which you want to predict the class to which it belongs, this method selects the closest observations from the dataset in such a way to minimize the distance [12]. There are two types of kNN algorithms [10]:

Structure less NN
Structure based NN

Algorithm 1 defines the basic scheme of the kNN classification method (structure less NN) on a dataset with m observations.

There are variations of this algorithm in order to reduce the input dataset size [13]. For example we can cite stochastic neighbor compression (SNN) [14] that tries to compress the input dataset in order to obtain a sample of the data, or ProtoNN—compressed and accurate kNN for resource-scarce devices [15] that generates a subset of small number of prototypes to represent the input dataset.

Algorithm 1: The k-nearest neighbors (KNN) algorithm.

Observation	$f_{1}$	$f_{2}$	$f_{3}$	...	$f_{n}$
1	$v_{1}^{1}$	$v_{2}^{1}$	$v_{3}^{1}$	...	$v_{n}^{1}$
2	$v_{1}^{2}$	$v_{2}^{2}$	$v_{3}^{2}$	...	$v_{n}^{2}$
3	$v_{1}^{3}$	$v_{2}^{3}$	$v_{3}^{3}$	...	$v_{n}^{3}$
...	...	...	...	...	...
$m - 1$	$v_{1}^{m - 1}$	$v_{2}^{m - 1}$	$v_{3}^{m - 1}$	...	$v_{n}^{m - 1}$
m	$v_{1}^{m}$	$v_{2}^{m}$	$v_{3}^{m}$	...	$v_{n}^{m}$

Total Categories	Total Bits	Total Elements (32-Bits)	Lost Bits	Total Elements (64 Bits)	Lost Bits
2	1	32	0	64	0
3–4	2	16	0	32	0
5–8	3	10	2	21	1
9–16	4	8	0	16	0
17–32	5	6	2	12	4
33–64	6	5	2	10	4
65–128	7	4	4	9	1
129–256	8	4	0	8	0
257–512	9	3	5	7	1

Dataset	No. Attributes	No. Observations	No. Classes
Census Income Data Set	15	32,561	2
Wisconsin Breast Cancer (original)	11	699	2

No.	Attribute	Original Type	Range	Type Used
1	age	continuous	17–90	categorical
2	workclassge	categorical	1–8	categorical
3	final weight (fnlwgt)	continuous	12,285–1,484,705	numeric
4	education	categorical	1–16	categorical
5	education-num	continuous	1–16	categorical
6	marital-status	categorical	1–7	categorical
7	occupation	categorical	1–14	categorical
8	relationship	categorical	1–6	categorical
9	race	categorical	1–5	categorical
10	sex	categorical	1–2	categorical
11	capital-gain	continuous	0–99,999	numeric
12	capital-loss	continuous	0–4356	numeric
13	hours-per-week	continuous	1–99	categorical
14	native-country	continuous	1–41	categorical
15	class	categorical	1–2	categorical

No.	Attribute	Type	Range
1	Sample code number	id
2	Clump Thickness	categorical	1–10
3	Uniformity of Cell Size	categorical	1–10
4	Uniformity of Cell Shape	categorical	1–10
5	Marginal Adhesion	categorical	1–10
6	Single Epithelial Cell Size	categorical	1–10
7	Bare Nuclei	categorical	1–10
8	Bland Chromatin	categorical	1–10
9	Normal Nucleoli	categorical	1–10
10	Mitoses	categorical	1–10
11	Class	categorical	2 = benign, 4 = malignant

	Compressed Dataset		Uncompressed Dataset
k	Time (ms)	Accuracy (%)	Time (ms)	Accuracy (%)
5	30	95.404	18	95.386
10	40	94.669	19	94.301
15	42	94.118	19	94.136
20	50	93.934	18	93.548

Vector Type	WBC		CID
Vector Type	Memory (bytes)	% Memory	Memory (bytes)	% Memory
int	24,588	100.0	1,689,072	100.0
short	12,294	50.0	-	-
byte	6147	25.0	-	-

CID	Census income dataset
CIDS	Census income data set
GZIP	GNU ZIP
KNN, kNN	K-nearest neighbors
HEOM	Heterogeneous euclidean-overlap metric
ML	Machine learning
NA	Not applicable
NN	Nearest neighbors
RAM	Random access memory
SMILE	Statistical machine intelligence and learning engine
WBC-original	Wisconsin breast cancer (original)

$v_{2}$	$v_{3}$	$v_{4}$	$v_{5}$	$v_{6}$	$v_{7}$	$v_{8}$	$v_{9}$	$v_{10}$
.	.	.	.	.	.	.	.	.
.	.	.	.	.	.	.	.	.
.	.	.	.	.	.	.	.	.
.	.	.	.	.	.	.	.	.

w1	w2	w3	$v_{3}$	$v_{11}$	$v_{12}$
.	.	.	.	.	.
.	.	.	.	.	.
.	.	.	.	.	.
.	.	.	.	.	.

	Compressed Dataset		Uncompressed Dataset
k	Time (ms)	Accuracy (%)	Time (ms)	Accuracy (%)
5	36.42	81.712	26.93	81.712
10	40.92	82.528	29.37	82.528
15	39.12	82.902	27.49	82.902
20	40.26	82.955	27.16	82.955

Block Size (bits)	Total Blocks	Memory (bytes)	% Memory
64	1	5464	22.2
32	2	5464	22.2
16	3	4098	16.7
8	5	3415	13.9

PERMALINK

Compressed kNN: K-Nearest Neighbors with Data Compression

Jaime Salvador–Meneses

Zoila Ruiz–Chavez

Jose Garcia–Rodriguez

Abstract

1. Introduction

2. Background

2.1. KNN

2.1.1. Categorical Data

2.1.2. Metrics

2.2. Data Compression

Table 1.

2.3. Bit Level Compression

Figure 1.

Figure 2.

Figure 3.

Table 2.

3. Datasets

Table 3.

3.1. Census Income Data Set

Table 4.

3.2. Wisconsin Breast Cancer (Original)

Table 5.

4. Compressed kNN

Figure 4.

4.1. Preprocess

Table 6.

4.2. Feature Compression

Figure 5.

Table 7.

Table 8.

Figure 6.

4.3. kNN Classification

Figure 7.

Limitations

5. Experimental Results

Table 9.

Table 10.

5.1. Test Platform

5.2. Memory Consumption

Table 11.

Table 12.

Table 13.

Table 14.

Figure 8.

Figure 9.

Table 15.

Table 16.

Figure 10.

5.3. Accuracy

Figure 11.

Figure 12.

5.4. Processing Speed

Figure 13.

Figure 14.

5.5. kNN Variations

Table 17.

Table 18.

6. Conclusions

Acknowledgments

Abbreviations

Supplementary Materials

Author Contributions

Funding

Conflicts of Interest

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases