An Ensemble Outlier Detection Method Based on Information Entropy-Weighted Subspaces for High-Dimensional Data

. 2023 Aug 9;25(8):1185. doi: 10.3390/e25081185

Algorithm 1 EOEH

Input: dataset Y with dimension n and sample number m, number of subsample sets T, number of samples in each subsample set

μ

, abnormal entropy weight

α

, normal entropy weight

β

, neighborhood parameter K
Output: Integrate exception score set O

1:
Begin
2:
for $S_{t} \leftarrow 1$ to T do
3:
$S_{t}$ = Random( $Y, μ$ ) // The subsample set $S_{t}$ is formed from the dataset Y without putting back $μ$ of random sampling
4:
End for
5:
$q \in Y$ , $p \in S_{t}$ , a = Feature set a = $a_{1}, a_{2}, \dots, a_{n}$ of dataset Y
6:
for $S_{t} \leftarrow 1$ to T do
7:
for $a_{i} \leftarrow 1$ to n do
8:
$S F E_{(S_{t}, a_{i})} (q)$ // Calculate the subsample eigenentropy of point q on the subsample set $S_{t}$ with respect to feature $a_{i}$
9:
if the $S F E_{(S_{t}, a_{i})} (q) >$ the average of $S F E_{(S_{t}, a_{i})} (p)$ for other points in the sub-sample set then
10:
$A F S_{(S_{t})} (q)$ = ${a_{i} \in a ∣ a_{i}$ is an abnormal feature of $q}$ // Point q belongs to the abnormal feature subspace in $S_{t}$ , which is formed by all the abnormal features of q in $S_{t}$ .
11:
End if
12:
End for
13:
End for
14:
The feature weight vector of point q in the sub-sample set $S_{t}$ is $F W V (q) = {ω_{1}, ω_{2}, . . ., ω_{n}}$ .
15:
for $ω_{i} \leftarrow 1$ to n do
16:
if $a_{i} \in A F S_{(S_{t})} (q)$ then //The feature weight vector of point q is assigned differently for each feature based on the abnormal feature subspace.
17:
$ω_{i} = α$
18:
else
19:
$ω_{i} = β$
20:
End if
21:
End for
22:
for $S_{t} \leftarrow 1$ to T do
23:
for each $p \in S_{t}$ do
24:
$S W - d i s t a n c e_{(S_{t})} (q, p)$ // Calculate the weighted distance based on the subspace between point q and each point in the sub-sample set.
25:
End for
26:
Perform the KNN (k-nearest neighbors) algorithm on sub-sample set $S_{t}$ using the $S W - d i s t a n c e_{(S_{t})} (q, p)$ (weighted distance) metric between point q and point p. // Obtain the k-neighborhood of point q on the sub-sample set based on the weighted distance.
27:
for $j \leftarrow 1$ to K do
28:
two reach- $d i s t_{(S_{t}, k)} (q, p)$ // Calculate the two-reach distance of point q within its k-neighborhood
29:
End for
30:
Based on the definition of detailed local reachability density, it can calculate the $d l r d_{(S_{t}, k)} (q)$ value of point q on the subset $S_{t}$ .
31:
By averaging the $d l r d_{(S_{t}, k)} (q)$ value of point q and the $d l r d_{(S_{t}, k)} (q)$ values of other points in its k-neighborhood, it can obtain the detailed local outlier factor $d L O F_{(S_{t}, k)} (q)$ that reflects the abnormality of point q on the subset $S_{t}$ .
32:
End for
33:
By utilizing the ensemble anomaly score $O_{q} = \sum_{T} d L O F_{(S_{t}, k)} (q) / T$ based on the subset, calculating the ensemble anomaly scores for each data object in the dataset Y. The ensemble anomaly score set $O = O_{1}, O_{2}, . . ., O_{m}$ is obtained, where $O_{i}$ represents the ensemble anomaly score for the $i - t h$ data object in the dataset Y.
34:
End