Cheap robust learning of data anomalies with analytically solvable entropic outlier sparsification

Illia Horenko

doi:10.1073/pnas.2119659119

. 2022 Feb 23;119(9):e2119659119. doi: 10.1073/pnas.2119659119

Cheap robust learning of data anomalies with analytically solvable entropic outlier sparsification

Illia Horenko ^a,¹

PMCID: PMC8917346 PMID: 35197293

Abstract

Entropic outlier sparsification (EOS) is proposed as a cheap and robust computational strategy for learning in the presence of data anomalies and outliers. EOS dwells on the derived analytic solution of the (weighted) expected loss minimization problem subject to Shannon entropy regularization. An identified closed-form solution is proven to impose additional costs that depend linearly on statistics size and are independent of data dimension. Obtained analytic results also explain why the mixtures of spherically symmetric Gaussians—used heuristically in many popular data analysis algorithms—represent an optimal and least-biased choice for the nonparametric probability distributions when working with squared Euclidean distances. The performance of EOS is compared to a range of commonly used tools on synthetic problems and on partially mislabeled supervised classification problems from biomedicine. Applying EOS for coinference of data anomalies during learning is shown to allow reaching an accuracy of $97 % \pm 2 %$ when predicting patient mortality after heart failure, statistically significantly outperforming predictive performance of common learning tools for the same data.

Keywords: sparsification, outlier detection, mislabeling, regularization, entropy

Detection of data anomalies, outliers, and mislabeling is a long-standing problem in statistics, machine learning (ML), and artificial intelligence (1–4). Let ${x_{1}, x_{2}, \dots, x_{T}}$ be a fixed dataset (where data instances x_t are possibly augmented with labels), let θ be a set of ML model parameters, and let $g (x_{t}, θ)$ be a scalar-valued loss function measuring a misfit of the data instance x_t. Then, a wide class of learning methods and anomaly detection algorithms can be formulated as numerical procedures for a minimization of the following functional:

{\hat{w}, \hat{θ}} = \arg \min_{w_{\cdot}, θ} \sum_{t = 1}^{T} w_{t} g (x_{t}, θ),

[1]

where $0 \leq w_{t} \leq 1$ is the outlyingness, taking the values close to zero if the data point x_t is an anomaly (1, 5–7). If w and θ are both unknown, then the above problem [1] for simultaneous estimation of model parameters and loss weights becomes ill posed. Common approaches deal with this ill-posedness problem imposing additional parametric assumptions on w, for example, based on parametric thresholding of one-dimensional linear projections in Stahel–Donoho estimators or deploying other parametric tools [like $χ (D)$ -distribution quantiles to determine outliers of a D-dimensional normal distribution] (1, 5, 7, 8). An appealing idea would be to make this ill-posed problem well posed in a nonparametric way, by regularizing it with one of the common regularization approaches. For example, applying l1 regularization could result in a sparsification of w and zeroing out the outlying data points from the estimation (9). However, applying l1 and other sparsification methods results in a polynomial cost scaling required for a numerical solution of the resulting optimization problems—and would limit the solution of [1] to relatively small problems (10).

The key message of this brief report is in showing that the simultaneous well-posed detection of anomalies and learning of parameters θ in [1] can be achieved computationally very efficiently by means of the minimization of expected loss from the right-hand side of [1]—performed simultaneously to the regularized Shannon entropy maximization of the loss weight distribution w,

\begin{array}{l} {w^{(α)}, θ^{(α)}} = \arg \min_{w \in P^{(T)}} L (w, θ, α), \\ where L (w, θ, α) = \sum_{t = 1}^{T} w_{t} g (x_{t}, θ) + α \sum_{t = 1}^{T} w_{t} \log w_{t}, \\ such that w \in P^{(T)}, \\ P^{(T)} ≔ {w \in R^{T} | w \geq 0 \land \sum_{t = 1}^{T} w_{t} = 1} . \end{array}

[2]

The following Theorem summarizes the properties of this problem’s solutions.

Theorem.

For any fixed ${x_{1}, x_{2}, \dots, x_{T}}$ and θ such that $\sup_{t} | g (x_{t}, θ) | < \infty$ and $α > 0$ , constrained minimization problem [2] admits a unique closed-form solution $w^{(α)}$ ,

$w_{t}^{(α)} = \frac{\exp (- α^{- 1} g (x_{t}, θ))}{\sum_{t = 1}^{T} \exp (- α^{- 1} g (x_{t}, θ))} .$ [3]

The proof of the Theorem is provided in SI Appendix. It is straightforward to validate that the numerical cost of computing [3] scales linearly in statistics size T and is independent of the data dimension D—in contrast to common regularization techniques that require polynomial cost scaling in the data dimension D (10).

If the loss function $g (x_{t}, θ)$ is a squared Euclidean distance (as in the least-squares methods), then, according to the above Theorem, the unique probability distributions w minimizing [2] are from the α-parametric family of spherically symmetric Gaussians, with the dimension-wise variance $σ^{2}$ being $σ^{2} = 0.5 α$ . This result provides an interesting insight into the density-based methods, for example, in t-stochastic neighbor embedding (t-SNE) (11)—one of the most popular nonlinear dimensions reduction approaches in the area of biomedicine (with over 20,000 citations according to Google Scholar). This method searches for the optimal low-dimensional approximations of the high-dimensional densities defined in a heuristic way as mixtures of spherically symmetric Gaussians,

w_{t} = \frac{\exp (- | | | x_{i} - x_{j} | | |^{2} / 2 σ^{2})}{\sum_{k} \exp (- | | | x_{i} - x_{k} | | |^{2} / 2 σ^{2})},

[4]

with a multiindex $t = (i, j)$ . According to the above Theorem, this heuristics—building a computational foundation of tSNE—is actually equivalent to the optimal nonparametric density estimate [3], in the sense that it is simultaneously minimizing the expectation of the pairwise squared Euclidean distances between the data points (when considering loss function $g (x_{t}, θ) = | | θ - x_{t} | |^{2}$ with $t \equiv j$ and $θ \equiv x_{i}$ in [2]) and simultaneously maximizing the entropy of w (i.e., providing the least-biased estimation) and is obtained with an explicitly computable closed-form expression. Furthermore, solution [3] also provides a recipe for computing such t-SNE density estimates in the cases with non-Euclidean loss functions g.

Algorithm 1

Entropic outlier sparsification algorithm for the solution of optimization problem [2]

For a given ${x_{1}, x_{2}, \dots, x_{T}}$ , and $α > 0$ , randomly choose initial $w^{(1)}$

$I = 1; L^{(I)} = \infty; Δ L^{(I)} = \infty$ $I = 1; L^{(I)} = \infty; Δ L^{(I)} = \infty$

while $Δ L^{(I)} > tol$ do

$θ^{(I)} \leftarrow$ solution of [2] for fixed $w^{(I)}$

$w^{(I + 1)} \leftarrow$ evaluating [3] for fixed $θ^{(I)}$

$L^{(I + 1)} \leftarrow L (w^{(I + 1)}, θ^{(I)}, α)$

$I \leftarrow I + 1$

$Δ L^{(I)} \leftarrow L^{(I - 1)} - L^{(I)} .$

It is straightforward to verify that the simultaneous learning of the parameters θ and probability densities w can be performed with the monotonically convergent entropic outlier sparsification (EOS) algorithm (see Algorithm 1).

Eq. 4 establishes a relation between the Gaussian variance parameter $σ^{2}$ and the entropic sparsification parameter α in [3], indicating a possibility of inferring the optimal sparsification parameter $α^{*}$ for the given data. For example, optimal $σ^{2}$ in [4]—and hence the optimal sparsification parameter value $α^{*}$ —can be obtained by maximizing the log-likelihood of the distribution $w_{t}^{α}$ with respect to the parameter α; that is, $α^{*} = \arg \max_{α > 0} \sum_{t} \log (w_{t}^{α})$ . In the practical examples of EOS below, we will follow a simpler grid search approach for selecting the optimal sparsification parameter $α^{*}$ —deploying the same multiple cross-validation procedure that is commonly used for determining metaparameter values in AI and ML. On a predefined grid of α values, we will select those values that show the best overall model performance on the validation data that were not used in the model training.

Fig. 1 summarizes numerical experiments comparing EOS to common data anomaly detection and learning tools on randomly generated synthetic datasets (representing multivariate normal distributions with asymmetrically positioned uniformly distributed outliers; Fig. 1 A–F) and three biomedical datasets with various proportions of randomly mislabeled data instances in the training sets (Fig. 1 G–I). All of the compared algorithms are provided with the same information and run with the same hardware and software; 50 cross-validations were performed in every experiment to visualize the obtained 95% CIs. In numerical experiments with synthetic data (Fig. 1 A–F), the EOS algorithm is deployed, with g being the negative point-wise multivariate Gaussian loglikelihood, that is, with $g (x_{t}, μ, Σ) = 1 / D (0.5 \log \det (Σ) + 0.5 {(x_{t} - μ)}^{†} Σ^{- 1} (x_{t} - μ))$ , where μ and Σ are Gaussian mean and covariance, respectively. Iterative estimation of weighted mean and covariance in the θ-step of the EOS algorithm is performed using analytical estimates of the weighted Gaussian covariance and mean, and convergence tolerance tol is set to $10^{- 12}$ . Total computational costs and statistical precisions—the latter are measured as the numbers of correctly identified points not belonging to the Gaussian distribution divided by the total number of identified outliers—are performed for various problem dimensions, statistic sizes, and outlier proportions. EOS was compared to all of the outlier detection methods available in the “Statistics” and “Machine Learning” toolboxes of MathWorks. Precision is chosen as the measure of performance here since it is more robust than the other common measures when the datasets are not balanced, for example, when the number of instances in one class (outliers) is much less than in the other class (nonoutliers). These results show that EOS allows a marked and robust improvement of outlier detection precision for all of the considered comparison cases. Data and MATLAB code are provided at https://github.com/horenkoi/EOS.

Next, real labeled datasets from biomedicine are considered, including two popular datasets—the University of Wisconsin Database for Breast Cancer diagnostics data (12) (Fig. 1 G) and the clinical dataset for predicting mortality after heart failure (13) (Fig. 1 I)—as well as a single-cell messenger RNA gene expression dataset from longevity research (14) (Fig. 1 H). The main focus here is on comparing the robustness of learning methods to outliers and mislabeled data instances in the training set, for common binary classifiers and for EOS that is equipped with loss function g from the scalable probabilistic approximation (SPA) classifier algorithm (15, 16). SPA is selected since it shows the highest robustness to mislabeling for all of the considered datasets (Fig. 1 G and I). As can be seen from Fig. 1 G and H, EOS with $g (x_{t}, θ)$ from SPA (EOS+SPA, red dashed lines), allows a statistically significant improvement of prediction performance (measured with the common performance measure area under curve [AUC]) for all of the tested mislabeling proportions p for all of the considered biomedical examples. As was shown recently, coinference of data mislabelings can significantly improve predictive performance of supervised classifiers (17). Application of the EOS algorithm with model loss function $g (x_{t}, θ)$ from SPA (EOS+SPA) allows achieving AUC of 0.96 and accuracy of $97 % \pm 2 %$ (SI Appendix, Fig. S1) when predicting patients’ mortality after heart failure from clinical patients’ data, statistically significantly outperforming common learning tools that do not deploy outlier coinference (Fig. 1I).

EOS and entropic sparsification Eq. 3 can be also applied to other types of leaning problems, for example, to feature selection and novelty detection problems.

Supplementary Material

Supplementary File

pnas.2119659119.sapp.pdf^{(279.1KB, pdf)}

Acknowledgments

I.H. acknowledges funding from the Carl-Zeiss Foundation initiative “Emergent Algorithmic Intelligence.”

Footnotes

The author declares no competing interest.

This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.2119659119/-/DCSupplemental.

Data Availability

Data and MATLAB code have been deposited in GitHub (https://github.com/horenkoi/EOS). Previously published data were used for this work (12–14).

References

1.Donoho D. L., Gasko M., Breakdown properties of location estimates based on halfspace depth and projected outlyingness. Ann. Stat. 20, 1803–1827 (1992). [Google Scholar]
2.Rocke D. M., Woodruff D. L., Identification of outliers in multivariate data. J. Am. Stat. Assoc. 91, 1047–1061 (1996). [Google Scholar]
3.Filzmoser P., Maronna R., Werner M., Outlier identification in high dimensions. Comput. Stat. Data Anal. 52, 1694–1711 (2008). [Google Scholar]
4.Wang H., Bah M. J., Hammad M., Progress in outlier detection techniques: A survey. IEEE Access 7, 107964–108000 (2019). [Google Scholar]
5.Stahel W. A., “Robust estimation: Infinitesimal optimality and covariance matrix estimators,” PhD thesis, Eidgenossische Technische Hochschule (ETH) Zurich, Zurich, Switzerland (1981).
6.Rousseeuw P. J., Van Zomeren B. C., Unmasking multivariate outliers and leverage points. J. Am. Stat. Assoc. 85, 633–639 (1990). [Google Scholar]
7.Maronna R. A., Yohai V. J., The behavior of the Stahel-Donoho robust multivariate estimator. J. Am. Stat. Assoc. 90, 330–341 (1995). [Google Scholar]
8.Zuo Y., Cui H., He X., On the Stahel-Donoho estimator and depth-weighted means of multivariate data. Ann. Stat. 32, 167–188 (2004). [Google Scholar]
9.Donoho D. L., For most large underdetermined systems of equations, the minimal l1-norm near-solution approximates the sparsest near-solution. Commun. Pure Appl. Math. 59, 907–934 (2006). [Google Scholar]
10.Huang S., Tran T. D., Sparse signal recovery via generalized entropy functions minimization. IEEE Trans. Signal Process. 67, 1322–1337 (2019). [Google Scholar]
11.van der Maaten L., Hinton G., Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008). [Google Scholar]
12.UCI Machine Learning, Data from “Breast cancer Wisconsin (diagnostic) data set.” Kaggle. https://www.kaggle.com/uciml/breast-cancer-wisconsin-data. Accessed 1 October 2021.
13.Chicco D., Jurman G., Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Med. Inform. Decis. Mak. 20, 16 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Lan J., et al., Translational regulation of non-autonomous mitochondrial stress response promotes longevity. Cell Rep. 28, 1050–1062.e6 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Gerber S., Pospisil L., Navandar M., Horenko I., Low-cost scalable discretization, prediction, and feature selection for complex systems. Sci. Adv. 6, eaaw0961 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Horenko I., On a scalable entropic breaching of the overfitting barrier for small data problems in machine learning. Neural Comput. 32, 1563–1579 (2020). [DOI] [PubMed] [Google Scholar]
17.Gerber S., et al., Co-inference of data mislabelings reveals improved models in genomics and breast cancer diagnostics. Front. Artif. Intell. 4, 739432 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File

pnas.2119659119.sapp.pdf^{(279.1KB, pdf)}

Data Availability Statement

Data and MATLAB code have been deposited in GitHub (https://github.com/horenkoi/EOS). Previously published data were used for this work (12–14).

[r1] 1.Donoho D. L., Gasko M., Breakdown properties of location estimates based on halfspace depth and projected outlyingness. Ann. Stat. 20, 1803–1827 (1992). [Google Scholar]

[r2] 2.Rocke D. M., Woodruff D. L., Identification of outliers in multivariate data. J. Am. Stat. Assoc. 91, 1047–1061 (1996). [Google Scholar]

[r3] 3.Filzmoser P., Maronna R., Werner M., Outlier identification in high dimensions. Comput. Stat. Data Anal. 52, 1694–1711 (2008). [Google Scholar]

[r4] 4.Wang H., Bah M. J., Hammad M., Progress in outlier detection techniques: A survey. IEEE Access 7, 107964–108000 (2019). [Google Scholar]

[r5] 5.Stahel W. A., “Robust estimation: Infinitesimal optimality and covariance matrix estimators,” PhD thesis, Eidgenossische Technische Hochschule (ETH) Zurich, Zurich, Switzerland (1981).

[r6] 6.Rousseeuw P. J., Van Zomeren B. C., Unmasking multivariate outliers and leverage points. J. Am. Stat. Assoc. 85, 633–639 (1990). [Google Scholar]

[r7] 7.Maronna R. A., Yohai V. J., The behavior of the Stahel-Donoho robust multivariate estimator. J. Am. Stat. Assoc. 90, 330–341 (1995). [Google Scholar]

[r8] 8.Zuo Y., Cui H., He X., On the Stahel-Donoho estimator and depth-weighted means of multivariate data. Ann. Stat. 32, 167–188 (2004). [Google Scholar]

[r9] 9.Donoho D. L., For most large underdetermined systems of equations, the minimal l1-norm near-solution approximates the sparsest near-solution. Commun. Pure Appl. Math. 59, 907–934 (2006). [Google Scholar]

[r10] 10.Huang S., Tran T. D., Sparse signal recovery via generalized entropy functions minimization. IEEE Trans. Signal Process. 67, 1322–1337 (2019). [Google Scholar]

[r11] 11.van der Maaten L., Hinton G., Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008). [Google Scholar]

[r12] 12.UCI Machine Learning, Data from “Breast cancer Wisconsin (diagnostic) data set.” Kaggle. https://www.kaggle.com/uciml/breast-cancer-wisconsin-data. Accessed 1 October 2021.

[r13] 13.Chicco D., Jurman G., Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Med. Inform. Decis. Mak. 20, 16 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r14] 14.Lan J., et al., Translational regulation of non-autonomous mitochondrial stress response promotes longevity. Cell Rep. 28, 1050–1062.e6 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r15] 15.Gerber S., Pospisil L., Navandar M., Horenko I., Low-cost scalable discretization, prediction, and feature selection for complex systems. Sci. Adv. 6, eaaw0961 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r16] 16.Horenko I., On a scalable entropic breaching of the overfitting barrier for small data problems in machine learning. Neural Comput. 32, 1563–1579 (2020). [DOI] [PubMed] [Google Scholar]

[r17] 17.Gerber S., et al., Co-inference of data mislabelings reveals improved models in genomics and breast cancer diagnostics. Front. Artif. Intell. 4, 739432 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Cheap robust learning of data anomalies with analytically solvable entropic outlier sparsification

Illia Horenko

Series information

Abstract

Theorem.

Algorithm 1

Fig. 1.

Supplementary Material

Acknowledgments

Footnotes

Data Availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Cheap robust learning of data anomalies with analytically solvable entropic outlier sparsification

Illia Horenko

Series information

Abstract

Theorem.

Algorithm 1

Fig. 1.

Supplementary Material

Acknowledgments

Footnotes

Data Availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases