Soft Quantization Using Entropic Regularization

Rajmadan Lakshmanan; Alois Pichler

doi:10.3390/e25101435

. 2023 Oct 10;25(10):1435. doi: 10.3390/e25101435

Soft Quantization Using Entropic Regularization

Rajmadan Lakshmanan ^1,^*, Alois Pichler ¹

Editor: Jean-Pierre Gazeau¹

PMCID: PMC10606929 PMID: 37895556

Abstract

The quantization problem aims to find the best possible approximation of probability measures on $R^{d}$ using finite and discrete measures. The Wasserstein distance is a typical choice to measure the quality of the approximation. This contribution investigates the properties and robustness of the entropy-regularized quantization problem, which relaxes the standard quantization problem. The proposed approximation technique naturally adopts the softmin function, which is well known for its robustness from both theoretical and practicability standpoints. Moreover, we use the entropy-regularized Wasserstein distance to evaluate the quality of the soft quantization problem’s approximation, and we implement a stochastic gradient approach to achieve the optimal solutions. The control parameter in our proposed method allows for the adjustment of the optimization problem’s difficulty level, providing significant advantages when dealing with exceptionally challenging problems of interest. As well, this contribution empirically illustrates the performance of the method in various expositions.

Keywords: quantization, approximation of measures, entropic regularization

JEL Classification: 94A17, 81S20, 40A25

1. Introduction

Over the past few decades, extensive research has been conducted on optimal quantization techniques in order to tackle numerical problems that are related to various fields such as data science, applied disciplines, and economic models. These problems are typically centered around uncertainties or probabilities which demand robust and efficient solutions (cf. Graf and Mauldin [1], Luschgy and Pagès [2], El Nmeir et al. [3]). In general, these problems are difficult to handle, as the random components in the problem allow uncountably many outcomes. As a consequence, in order to address this difficulty the probability measures are replaced by simpler or finite measures, which can facilitate numerical computations. However, the probability measures should be ‘close’ in order to ensure that the result of the computations with approximate (discrete) measures resembles the original problem. In a nutshell, the goal is to find the best approximation of a diffuse measure using a discrete measure, which is called an optimal quantization problem. For a comprehensive discussion of the optimal quantization problem from a mathematical standpoint, refer to Graf and Luschgy [4].

On the other hand, entropy (sometimes known as information entropy) is an essential concept when dealing with uncertainties and probabilities. In mathematics, entropy is often used as a measure of information and uncertainty. It provides a quantitative measure of the randomness or disorder in a system or a random variable. Its applications span information theory, statistical analysis, probability theory, and the study of complex dynamical systems (cf. Breuer and Csiszár [5,6], Pichler and Schlotter [7]).

In order to assess the closeness of probability measures, distances are often considered; one notable instance is the Wasserstein distance. Ostensibly, the Wasserstein distance measures the minimum average amount of transporting cost required to transfer one probability measure into another. Unlike other formulations of distances and/or divergence, which simply compare the probabilities of the distribution functions (e.g., the total variation distance and the Kullback–Leibler divergence), the Wasserstein distance incorporates the geometry of the underlying space. This increases the understanding of the relationships between different probability measures in a geometrically trustworthy manner.

In our research work, we focus on entropy-regularized quantization methods. More precisely, we consider an entropy-regularized version of the Wasserstein problem to quantify the quality of the approximation, and adapt the stochastic gradient approach to obtain the optimal quantizers.

The key features of our methodology include the following:

(i)
Our regularization approach stabilizes and simplifies the standard quantization problem by introducing penalty terms or constraints that discourage overly complex or overfitted models, promoting better generalizations and robustness in the solutions.
(ii)
The influence of entropy is controlled using a parameter, $λ$ , which enables us to reach the genuine optimal quantizers.
(iii)
Generally, parameter tuning comes with certain limitations. However, our method builds upon the framework of the well-established softmin function, which allows us to exercise parameter control without encountering any restrictions.
(iv)
For larger values of the regularization parameter $λ$ , the optimal measure accumulates all its mass at the center of the measure.

Applications in the Context of Quantization.

Quantization techniques have undergone significant developments in recent years, particularly within the domain of deep learning and model optimization. State-of-the-art research has introduced advanced methodologies such as non-uniform quantization and quantization-aware training, enabling the efficient deployment of neural networks while preserving performance (cf. Jacob et al. [8], Zhuang et al. [9], Hubara et al. [10]). Furthermore, quantization principles have found applications beyond machine learning, such as in digital image processing, computer vision (cf. Polino et al. [11]), and electric charge quantization (cf. Bhattacharya [12]).

Related Works and Contributions.

As mentioned above, optimal quantization is a well-researched topic in the field of information theory and signal processing. Several methods have been developed for the optimal quantization problem, notably including the following:

–
Lloyd-Max Algorithm: this algorithm, also known as Lloyd’s algorithm or the k-means algorithm, is a popular iterative algorithm for computing optimal vector quantizers. It iteratively adjusts the centroids of the quantization levels to minimize the quantization error (cf. Scheunders [13]).
–
Tree-Structured Vector Quantization (TSVQ): TSVQ is a hierarchical quantization method that uses a tree structure to partition the input space into regions. It recursively applies vector quantization at each level of the tree until the desired number of quantization levels is achieved (cf. Wei and Levoy [14]).
–
Expectation-maximization (EM) algorithm: the EM algorithm is a general-purpose optimization algorithm that can be used for optimal quantization. It is an iterative algorithm that estimates the parameters of a statistical model to maximize the likelihood of the observed data (cf. Heskes [15]).
–
Stochastic Optimization Methods: stochastic optimization methods such as simulated annealing, genetic algorithms, and particle swarm optimization can be used to find optimal quantization strategies by exploring the search space and iteratively improving the quantization performance (cf. Pagès et al. [16]).
–
Greedy vector quantization (GVQ): the greedy algorithm tries to solve this problem iteratively by adding one code word at every step until the desired number of code words is reached, each time selecting the code word that minimizes the error. GVQ is known to provide suboptimal quantization compared to other non-greedy methods such as the Lloyd-Max and Linde–Buzo–Gray algorithms. However, it has been shown to perform well when the data have a strong correlation structure. Notably, it utilizes the Wasserstein distance to measure the error of approximation (cf. Luschgy and Pagès [2]).

These methods provide efficient and practical solutions for finding optimal quantization schemes, and have different trade-offs between complexity and performance. The choice of method depends on the problem of interest and the requirements of the application. However, most of these methods depend on strict constraints, which makes the solutions overly complex or results in model overfitting. Our method mitigates this issue by promoting better generalizations and robustness in the solutions.

In the optimal transport community, the entropy-regularized version of the optimal transport problem (known as the entropy-regularized Wasserstein problem) was initial proposed in Cuturi [17]. This entropy version of the Wasserstein problem promotes fast computations using Sinkhorn’s algorithm. As an avenue for constructive research, the above-cited study presented a multitude of results aimed at gaining a comprehensive understanding of the subtleties involved in enhancing the computational performance of entropy-optimal transport (cf. Ramdas et al. [18], Neumayer and Steidl [19], Altschuler et al. [20], Lakshmanan et al. [21], Ba and Quellmalz [22], Lakshmanan and Pichler [23]). These findings have served as a valuable foundation for further exploration in the field of optimal transport, providing insights into both the intricacies of the topic and potential avenues for improvement.

In contrast, we present a new and innovative approach that concentrates on the optimal quantization problem based on entropy and on its robust properties, which represents a distinct contribution with regard to standard entropy-regularized optimal transport problems.

One of the principal consequences of our research substantiates the convergence behavior of quantizers at the center of the measure. The relationship between the center of the measure and the entropy-regularized quantization problem has not been exposed yet. The following plain solution is obtained by intensifying the entropy term in the regularization of the quantization problem.

Theorem 1.

There exists a real valued $λ_{0} > 0$ such that the approximation of the entropy-regularized optimal quantization problem is provided by the Dirac measure

$P = δ_{a}$

for every $λ > λ_{0}$ , where a is the center of the measure P with respect to the distance d.

This enthralling interpretation (Theorem 1) of our master problem facilitates an understanding of the transition from a complex and difficult optimization solution to a simple solution. Moreover, along with a theoretical discussion, we provide an algorithm and numerical exemplification which empirically demonstrate the robustness of our method. The forthcoming sections elucidate the robustness and asymptotic properties of the proposed method in detail.

Outline of the Paper.

Section 2 establishes the essential notation, definitions, and properties. Moreover, we comprehensively expound upon the significance of the smooth minimum, a pivotal component in our research. In Section 3, we introduce the entropy-regularized optimal quantization problem and delve into its inherent properties. Section 4 presents a discussion of the soft tessellation, optimal weights, and theoretical properties of parameter tuning. Furthermore, we systematically illustrate the computational process along with a pseudo-algorithm. Section 5 provides numerical examples and empirically substantiates the theoretical proofs. Finally, Section 6 summarizes the study.

2. Preliminaries

In what follows, $(X, d)$ is a Polish space. The $σ$ -algebra generated by the Borel sets induced by the distance d is $F$ , while the set of all probability measures on $X$ is $P (X)$ .

2.1. Distances and Divergences of Measures

The standard quantization problem employs the Wasserstein distance to measure the quality of the approximation, which was initially studied by Monge and Kantorovich (cf. Monge [24], Kantorovich [25]). One of the remarkable properties of this distance is that it metrizes the weak* topology of measures.

Definition 1

(Wasserstein distance). Let P and $\tilde{P}$ be probability measures on $(X, d)$ . The Wasserstein distance of order $r \geq 1$ between P and $\tilde{P} \in P (X)$ is

$d_{r} (P, \tilde{P}) : = inf {(\underset{X \times X}{\int \int} d {(ξ, \tilde{ξ})}^{r} π (d ξ, d \tilde{ξ}))}^{1 / r},$ (1)

where the infimum is among all measures $π \in P (X^{2})$ with marginals P and $\tilde{P}$ , that is,

$\begin{matrix} π (A \times X) & = P (A) a n d \end{matrix}$ (2)

$\begin{matrix} π (X \times B) & = \tilde{P} (B) \end{matrix}$ (3)

for all sets A and $B \in F$ . The measures

$π_{1} (\cdot) : = π (\cdot \times X) a n d π_{2} (\cdot) : = π (X \times \cdot)$

on $X$ are called the marginal measures of the bivariate measure π.

Readers may refer to the excellent monographs in [26,27] for a comprehensive discussion of the Wasserstein distance.

Remark 1

(Flexibility). In the subsequent discussion, our problem of interest is to approximate the measure P, which is a continuous, discrete, or mixed measure on $X = R^{d}$ . The measure $\tilde{P}$ is used to approximate the measure P, which is a discrete measure. The definition of the Wasserstein distance flexibly comprises all the cases, namely, continuous, semi-discrete, and discrete measures.

In contrast to the standard methodology, we investigate the quantization problem by utilizing an entropy version of the Wasserstein distance. The standard Wasserstein problem is regularized by adding the Kullback–Leibler divergence, which is known as the relative entropy.

Definition 2

(Kullback–Leibler divergence). Let P and $Q \in P (X)$ be probability measures. Denote the Radon–Nikodým derivative $d Q = Z d P$ by $Z \in L^{1} (P)$ if Q is absolutely continuous with respect to P ( $Q ≪ P$ ). The Kullback–Leibler divergence is

$D (Q ∥ P) : = \{\begin{matrix} E_{P} Z log Z = E_{Q} log Z & i f Q ≪ P a n d d Q = Z d P, \\ + \infty & e l s e, \end{matrix}$ (4)

where $E_{P}$ ( $E_{Q}$ , resp.) is the expectation with respect to the measure P (Q, resp.).

Per Gibbs’ inequality, the Kullback–Leibler divergence satisfies $D (Q ∥ P) \geq 0$ (non-negativity). However, D is not a distance metric, as it does not satisfy the symmetry and triangle inequality properties.

We would like to emphasize the following distinctness with respect to the Wasserstein distance (cf. Remark 1): in order for the Kullback–Leibler divergence to be finite ( $D (Q ∥ P) < \infty$ ), we must have

supp Q \subset supp P,

where the support of the measure (cf. Rüschendorf [28]) is

supp P : = ⋂ \{A \in F : A is closed and P (A) = 1\} .

If P is a continuous measure on $X = R^{d}$ , then Q is as well. If P is a finite measure, then the support points of P contain the support points of Q.

2.2. The Smooth Minimum

In what follows, we present the smooth minimum in its general form, which includes discrete and continuous measures. The numerical computations in the following section rely on results for its discrete version. Therefore, we address the special properties of its discrete version in detail.

Definition 3

(Smooth minimum). Let $λ > 0$ and let Y be a random variable. The smooth minimum, or smooth minimum with respect to $\tilde{P}$ , is

$\begin{matrix} {min}_{\tilde{P}; λ} (Y) & : = - λ log E_{\tilde{P}} e^{- Y / λ} \end{matrix}$ (5)

$\begin{matrix} = - λ log \int_{X} e^{- Y (η) / λ} \tilde{P} (d η), \end{matrix}$ (6)

provided that the expectation (integral) of $e^{- Y / λ}$ is finite, or if it is not finite, that ${min}_{\tilde{P}; λ} (Y) : = - \infty$ . For $λ = 0$ , we set

${min}_{\tilde{P}; λ = 0} (Y) : = ess inf Y .$ (7)

For a σ-algebra $G \subset F$ and $λ > 0$ measurable with respect to $G$ , the conditional smooth minimum is

${min}_{\tilde{P}; λ} (Y | G) : = - λ log E_{\tilde{P}} (e^{- Y / λ}| G) .$

The following lemma relates the smooth minimum with the essential infimum (cf. (7)), that is, colloquially, the ‘minimum’ of a random variable. As well, the result justifies the term smooth minimum.

Lemma 1.

For $λ > 0$ , it holds that

${min}_{\tilde{P}; λ} (Y) \leq E_{\tilde{P}} Y$ (8)

and

$ess inf Y \leq {min}_{\tilde{P}; λ} (Y) \underset{λ \to 0}{\to} ess inf Y .$ (9)

Proof.

Inequality (8) follows from Jensen’s inequality as applied to the convex function $x \mapsto exp (- x / λ)$ .

Next, the first inequality in the second display (9) follows from $ess inf Y \leq Y$ and the fact that all operations in (6) are monotonic. Finally, let $a > ess inf Y$ . Per Markov’s inequality, we have

$E_{\tilde{P}} e^{- Y / λ} \geq e^{- a / λ} \tilde{P} (e^{- Y / λ} \geq e^{- a / λ}) = e^{- a / λ} \tilde{P} (Y \leq a),$ (10)

which is a variant of the Chernoff bound. From Inequality (10), it follows that

${min}_{\tilde{P}; λ} (Y) = - λ log E_{\tilde{P}} e^{- Y / λ} \leq - λ log (e^{- a / λ} \tilde{P} (Y \leq a)) = a + λ log \frac{1}{\tilde{P} (Y \leq a)} .$ (11)

When $λ > 0$ and $λ \to 0$ , we have

${min}_{\tilde{P}; λ} (Y) \leq a,$

where a is an arbitrary number with $a > ess inf Y$ . This completes the proof. □

Remark 2

(Nesting property). The main properties of the smooth minimum include translation equivariance

${min}_{\tilde{P}; λ} (Y + c) = {min}_{\tilde{P}; λ} (Y) + c, c \in R,$

and positive homogeneity

${min}_{\tilde{P}; γ \cdot λ} (γ \cdot Y) = γ \cdot {min}_{\tilde{P}; λ} (Y), γ > 0 .$

As a consequence of the tower property of the expectation, we have the nesting property

${min}_{\tilde{P}; λ} ({min}_{\tilde{P}; λ} (Y | G)) = {min}_{\tilde{P}; λ} (Y),$

provided that $G$ is a sub-σ-algebra of $F$ .

2.3. Softmin Function

The smooth minimum is related to the softmin function via its derivatives. In what follows, we express variants of its derivatives, which are involved later.

Definition 4

(Softmin function). For $λ > 0$ and a random variable Y with a finite smooth minimum, the softmin function is the random variable

$σ_{λ} (Y) : = exp (- \frac{Y - {min}_{\tilde{P}; λ} (Y)}{λ}) = \frac{e^{- Y / λ}}{E_{\tilde{P}} e^{- Y / λ}},$ (12)

where the latter equality is obvious based on the definition of the smooth minimum in (6). The function $σ_{λ} (Y)$ is called the Gibbs density.

The Derivative with respect to the Probability Measure

The definition of the smooth minimum in (6) does not require the measure $\tilde{P}$ to be a probability measure. Based on $\frac{\partial}{\partial t} log (a + t \cdot h) = \frac{h}{a}$ (at $t = 0$ ) for the natural logarithm, the directional derivative of the smooth minimum in the direction of the measure Q is

\begin{matrix} \frac{1}{t} ({min}_{\tilde{P} + t \cdot Q; λ} (Y) - {min}_{\tilde{P}; λ} (Y)) \end{matrix}

(13)

\begin{matrix} = - \frac{λ}{t} (log \int_{X} e^{- Y / λ} d (\tilde{P} + t \cdot Q) - log \int_{X} e^{- Y / λ} d \tilde{P}) \end{matrix}

(14)

\begin{matrix} \underset{t \to 0}{\to} - λ \cdot \frac{\int_{X} e^{- Y / λ} d Q}{\int_{X} e^{- Y / λ} d \tilde{P}} \end{matrix}

(15)

\begin{matrix} = - λ \cdot \int_{X} σ_{λ} (Y) d Q . \end{matrix}

(16)

Note that $- λ σ_{λ}$ is (up to the constant $- λ$ ) a Radon–Nikodým density in (16). Thus, the Gibbs density $σ_{λ} (Y)$ is proportional to the directional derivative of the smooth minimum with respect to the underlying measure $\tilde{P}$ .

The Derivative with respect to the Random Variable

In what follows, we additionally require the derivative of the smooth minimum with respect to its argument. Following similar reasoning as above, this is accomplished by

\begin{matrix} \frac{1}{t} ({min}_{\tilde{P}; λ} (Y + t \cdot Z) - {min}_{\tilde{P}; λ} (Y)) \end{matrix}

(17)

\begin{matrix} = - \frac{λ}{t} (log \int_{X} e^{- (Y + t \cdot Z) / λ} d \tilde{P} - log \int_{X} e^{- Y / λ} d \tilde{P}) \end{matrix}

(18)

\begin{matrix} = - \frac{λ}{t} (log \int_{X} e^{- Y / λ} (1 - \frac{t}{λ} Z + 𝒪 (t^{2})) d \tilde{P} - log \int_{X} e^{- Y / λ} d \tilde{P}) \end{matrix}

(19)

\begin{matrix} \underset{t \to 0}{\to} \frac{\int_{X} Z \cdot e^{- Y / λ} d \tilde{P}}{\int_{X} e^{- Y / λ} d \tilde{P}} \end{matrix}

(20)

\begin{matrix} = \int_{X} Z \cdot σ_{λ} (Y) d \tilde{P}, \end{matrix}

(21)

which involves the softmin function $σ_{λ} (\cdot)$ as well.

3. Regularized Quantization

This section introduces the entropy-regularized optimal quantization problem along with its properties; we first recall the standard optimal quantization problem.

The standard quantization measures the quality of the approximation using the Wasserstein distance and considers the following problem (cf. Graf and Luschgy [4]):

inf_{π : \begin{matrix} π_{1} = P, \\ π_{2} \in P_{m} (X) \end{matrix}} \underset{X \times X}{\int \int} d (ξ, \tilde{ξ}) π (d ξ, d \tilde{ξ}),

(22)

where

P_{m} (X) : = \{{\tilde{P}}_{m} \in P (X) : {\tilde{P}}_{m} = \sum_{j = 1}^{m} {\tilde{p}}_{j} δ_{y_{j}}\}

(23)

is the set of measures on $X$ supported by not more than m ( $m \in N$ ) points.

Soft quantization (or quantization regularized with the Kullback–Leibler divergence) involves the regularized Wasserstein distance instead of (22). The soft quantization problem is regularized with the Kullback–Leibler divergence:

inf \{E_{π} d^{r} + λ \cdot D (π ∥ P \times {\tilde{P}}_{m}) : π_{1} = P and π_{2} = {\tilde{P}}_{m} \in P_{m} (X)\},

(24)

where $λ > 0$ and $E_{π} d^{r} = {\int \int}_{X^{2}} d {(ξ, \tilde{ξ})}^{r} π (d ξ, d \tilde{ξ})$ . The optimal measure ${\tilde{P}}_{m} \in P_{m} (X)$ for solving (24) depends on the regularization parameter $λ$ .

In the following discussion, we initially investigate the regularized approximation, which again demonstrates the existence of an optimal approximation.

3.1. Approximation with Inflexible Marginal Measures

The following proposition addresses the optimal approximation problem after being regularized with the Kullback–Leibler divergence and fixed marginals. To this end, we dissect the infimum in the soft quantization problem (24) as follows:

inf_{{\tilde{P}}_{m} \in P_{m} (X)} inf_{π : \begin{matrix} π_{1} = P, \\ π_{2} = {\tilde{P}}_{m} \end{matrix}} E_{π} d^{r} + λ \cdot D (π ∥ P \times {\tilde{P}}_{m}),

(25)

where the marginals P and ${\tilde{P}}_{m}$ are fixed in the inner infimum.

The following Proposition 1 addresses this problem with a fixed bivariate distribution, which is the inner infimum in (25). Then, Proposition 2 reveals that the optimal marginals coincide in this case.

Proposition 1.

Let P be a probability measure and let $λ > 0$ . The inner optimization problem in (25) relative to the fixed bivariate distribution $P \times \tilde{P}$ is provided by the explicit formula

$\begin{matrix} inf_{π : π_{1} = P} E_{π} d^{r} + λ \cdot D (π ∥ P \times \tilde{P}) & = - λ \int_{X} log \int_{X} e^{- d {(ξ, \tilde{ξ})}^{r} / λ} \tilde{P} (d \tilde{ξ}) P (d ξ) \end{matrix}$ (26)

$\begin{matrix} = E_{ξ \sim P} ({min}_{\tilde{ξ} \sim \tilde{P}; λ} d {(ξ, \tilde{ξ})}^{r}), \end{matrix}$ (27)

where $D (π ∥ P \times \tilde{P})$ is the Kullback–Leibler divergence. Further, the infimum in (26) is attained.

Remark 3.

The notation in (27) ((29) below, resp.) is chosen to reflect the explicit expression in (26), while the soft minimum ${min}_{\tilde{P}; λ}$ is with respect to the measure $\tilde{P}$ , which is associated with the variable $\tilde{ξ}$ , and the expectation $E_{P}$ is with respect to P and has an associated variable ξ (that is, the variable ξ in (27) is associated with P and the variable $\tilde{ξ}$ with $\tilde{P}$ ).

Remark 4

(Standard quantization). The result from (27) extends

$\begin{matrix} inf_{π : \begin{matrix} π_{1} = P a n d \\ supp π_{2} = supp \tilde{P} \end{matrix}} & = \int_{X} min_{ξ \in supp \tilde{P}} d {(ξ, \tilde{ξ})}^{r} P (d ξ) \end{matrix}$ (28)

$\begin{matrix} = E_{P} (min_{\tilde{ξ} \in supp \tilde{P}} d {(ξ, \tilde{ξ})}^{r}), \end{matrix}$ (29)

which is the formula without regularization and with restriction to the marginals P and $\tilde{P}$ (i.e., $λ = 0$ , cf. Pflug and Pichler [29]). Note that the preceding display thereby explicitly involves the support $supp \tilde{P}$ , while (26) only involves the expectation (via the smooth minimum) with respect to the measure $\tilde{P}$ . In other words, (27) quantifies the quality of entropy-regularized quantization, while (29) quantifies standard quantization.

Proof of Proposition 1.

It follows from the definition of the Kullback–Leibler divergence in (4) that it is enough to consider measures $π$ which are absolutely continuous with respect to the product measure $π ≪ P \times \tilde{P}$ ; otherwise, the objective is not finite. Hence, there is a Radon–Nikodým density $\tilde{Z}$ such that, with Fubini’s theorem,

$π (A \times B) = \int_{A} \int_{B} \tilde{Z} (ξ, η) \tilde{P} (d η) P (d ξ) .$

In order for the marginal constraint $π (A \times X) = P (A)$ to be satisfied (cf. (2)), we have

$\int_{A} \int_{X} \tilde{Z} (ξ, η) \tilde{P} (d η) P (d ξ) = π (A \times X) = P (A) = \int_{A} 1 P (d ξ)$

for every measurable set A. It follows that

$\int_{X} \tilde{Z} (ξ, η) \tilde{P} (d η) = 1 P (d ξ) almost everywhere .$

We can conclude that every density of the form

$\tilde{Z} (ξ, η) = \frac{Z (ξ, η)}{\int_{X} Z (ξ, η^{'}) \tilde{P} (d η^{'})}$ (30)

satisfies constraints in (2), irrespective of Z, and conversely that, via $\tilde{Z}$ in (30), every Z defines a bivariate measure $π$ satisfying the constraints in (2). We set $Φ (ξ, η) : = log Z (ξ, η)$ (with the convention that $log 0 = - \infty$ and $exp (- \infty) = 0$ , resp.) and consider

$\tilde{Z} (ξ, η) = \frac{e^{Φ (ξ, η)}}{\int_{X} e^{Φ (ξ, η^{'})} \tilde{P} (d η^{'})} .$

With these, the divergence is

$\begin{matrix} D (π ∥ P & \times \tilde{P}) = \\ = \int_{X} \int_{X} \frac{e^{Φ (ξ, η)}}{\int_{X} e^{Φ (ξ, η^{'})} \tilde{P} (d η^{'})} log \frac{e^{Φ (ξ, η)} \tilde{P} (d η) P (d ξ)}{\int_{X} e^{Φ (ξ, η^{'})} \tilde{P} (d η^{'}) P (d ξ) \tilde{P} (d η)} \tilde{P} (d η) P (d ξ) \\ = \int_{X} \int_{X} \frac{e^{Φ (ξ, η)}}{\int_{X} e^{Φ (ξ, η^{'})} \tilde{P} (d η^{'})} log \frac{e^{Φ (ξ, η)}}{\int_{X} e^{Φ (ξ, η^{'})} \tilde{P} (d η^{'})} \tilde{P} (d η) P (d ξ) \\ = \int_{X} \int_{X} \frac{e^{Φ (ξ, η)}}{\int_{X} e^{Φ (ξ, η^{'})} \tilde{P} (d η^{'})} Φ (ξ, η) \tilde{P} (d η) P (d ξ) \\ - \int_{X} \frac{e^{Φ (ξ, η)}}{\int_{X} e^{Φ (ξ, η^{'})} \tilde{P} (d η^{'})} log \int_{X} e^{Φ (ξ, η^{'})} \tilde{P} (d η^{'}) \tilde{P} (d η) P (d ξ) \\ = \int_{X} \int_{X} \frac{e^{Φ (ξ, η)}}{\int_{X} e^{Φ (ξ, η^{'})} \tilde{P} (d η^{'})} Φ (ξ, η) - log \int_{X} e^{Φ (ξ, η^{'})} \tilde{P} (d η^{'}) P (d ξ) . \end{matrix}$

For the other term in Objective (24), we have

$E_{π} d^{r} = \int_{X} \int_{X} \frac{e^{Φ (ξ, η)}}{\int_{X} e^{Φ (ξ, η^{'})} \tilde{P} (d η^{'})} d {(ξ, η)}^{r} \tilde{P} (d η) P (d ξ) .$

Combining the last expressions obtained, the objective in (26) is

$\begin{matrix} E_{π} d^{r} + λ D (π ∥ P \times \tilde{P}) & = \int_{X} \int_{X} \frac{e^{Φ (ξ, η)}}{\int_{X} e^{Φ (ξ, η^{'})} \tilde{P} (d η^{'})} (d {(ξ, η)}^{r} + λ Φ (ξ, η)) \tilde{P} (d η) P (d ξ) \end{matrix}$ (31)

$\begin{matrix} - λ \int_{X} log \int_{X} e^{Φ (ξ, η^{'})} \tilde{P} (d η^{'}) P (d ξ) . \end{matrix}$ (32)

For fixed $ξ$ ( $ξ$ is suppressed in the following two displays to abbreviate the notation), consider the function

$f (Φ) : = \int_{X} \frac{e^{Φ (η)}}{\int_{X} e^{Φ (η^{'})} \tilde{P} (d η^{'})} (d {(η)}^{r} + λ Φ (η)) \tilde{P} (d η) - λ log \int_{X} e^{Φ (η^{'})} \tilde{P} (d η^{'}) .$

The directional derivative in direction h of this function is

$\begin{matrix} lim_{t \to 0} \frac{1}{t} (f (Φ + t h) - f (Φ)) \end{matrix}$ (33)

$\begin{matrix} = \int_{X} \frac{e^{Φ (η)}}{\int_{X} e^{Φ (η^{'})} \tilde{P} (d η^{'})} (d {(η)}^{r} + λ Φ (η) - λ) h (η) \tilde{P} (d η) \end{matrix}$ (34)

$\begin{matrix} - \int_{X} \frac{e^{Φ (η)} \int_{X} e^{Φ (η^{'})} h (η^{'}) \tilde{P} (d η^{'})}{{(\int_{X} e^{Φ (η^{'})} \tilde{P} (d η^{'}))}^{2}} (d {(η)}^{r} + λ Φ (η)) \tilde{P} (d η) \end{matrix}$ (35)

$\begin{matrix} + λ \int_{X} \frac{e^{Φ (η)} h (η)}{\int_{X} e^{Φ (η^{'})} \tilde{P} (d η^{'})} \tilde{P} (d η) \end{matrix}$ (36)

$\begin{matrix} = \int_{X} \frac{e^{Φ (η)}}{\int_{X} e^{Φ (η^{'})} \tilde{P} (d η^{'})} (d {(η)}^{r} + λ Φ (η)) h (η) \tilde{P} (d η) \end{matrix}$ (37)

$\begin{matrix} - \int_{X} \frac{e^{Φ (η)} \int_{X} e^{Φ (η^{'})} h (η^{'}) \tilde{P} (d η^{'})}{{(\int_{X} e^{Φ (η^{'})} \tilde{P} (d η^{'}))}^{2}} (d {(η)}^{r} + λ Φ (η)) \tilde{P} (d η) . \end{matrix}$ (38)

Per (37) and (38), the derivative vanishes for every function h if $d {(η)}^{r} + λ Φ (η) = 0$ . As $ξ$ is arbitrary, the general minimum is attained for $Φ (ξ, η) = - d {(ξ, η)}^{r} / λ$ . With this, the first expression in (31) vanishes, and we can conclude that

$\begin{matrix} inf_{π} E_{π} d^{r} + λ D (π ∥ P \times \tilde{P}) & = - λ \int_{X} log \int_{X} e^{- d {(ξ, η)}^{r} / λ} \tilde{P} (d η) P (d ξ) \\ = E_{P} ({min}_{\tilde{P}; λ} d {(ξ, \tilde{ξ})}^{r}) . \end{matrix}$

Finally, notice that the variable $Z (ξ, η) = e^{Φ (ξ, η)}$ is completely arbitrary for the problem in (26) involving the Wasserstein distance and the Kullback–Leibler divergence. As outlined above, for every measure $π$ with finite divergence $D (π ∥ P \times \tilde{P})$ , there is a density Z, as considered above. From this, the assertion in Proposition 1 follows. □

Remark 5.

The preceding proposition considers probability measures π with marginal $π_{1} = P$ . Its first marginal distribution (trivially) is absolutely continuous with respect to P, $π_{1} ≪ P$ , as $π_{1} = P$ .

The second marginal $π_{2}$ , however, is not specified. In order for π to be feasible in (26), its Kullback–Leibler divergence with respect to $P \times \tilde{P}$ must be finite. Hence, there is a (non-negative) Radon–Nikodým density Z such that

$π_{2} (B) = π (X \times B) = \underset{X \times B}{\int \int} Z (ξ, η) P (d ξ) \tilde{P} (d η) .$

It follows from Fubini’s theorem that

$π_{2} (B) = \int_{B} \int_{X} Z (ξ, η) P (d ξ) \tilde{P} (d η) = \int_{B} Z (η) \tilde{P} (d η),$

where $Z (η) : = \int_{X} Z (ξ, η) P (d ξ)$ . Thus, the second marginal is absolutely continuous with respect to $\tilde{P}$ , $π_{2} ≪ \tilde{P}$ .

Proposition 1 characterizes the objective of the quantization problem. In addition, its proof implicitly reveals the marginal of the best approximation. The following lemma explicitly spells out the density of the marginal of the optimal measure with respect to $\tilde{P}$ .

Lemma 2

(Characterization of the best approximating measure). The best approximating marginal probability measure minimizing (26) has a density

$Z (\tilde{ξ}) = E_{P} σ_{λ} (d {(ξ, \tilde{ξ})}^{r}) = \int_{X} σ_{λ} (d {(ξ, \tilde{ξ})}^{r}) P (d ξ),$

where $σ_{λ} (\cdot)$ is the softmin function (cf. Definition 4).

Proof.

Recall from the proof of Proposition 1 that we have the density

$\tilde{Z} (ξ, \tilde{ξ}) = \frac{e^{- d {(ξ, \tilde{ξ})}^{r} / λ}}{E_{\tilde{P}} e^{- d {(ξ, \tilde{ξ})}^{r} / λ}}$

of the optimal measure $π$ relative to $P \times \tilde{P}$ . From this, we can derive

$π_{2} (B) = π (X \times B) = \int_{B} \int_{X} \frac{e^{- d {(ξ, \tilde{ξ})}^{r} / λ}}{E_{\tilde{P}} e^{- d {(ξ, \tilde{ξ})}^{r} / λ}} P (d ξ) \tilde{P} (d \tilde{ξ})$

such that

$Z (\tilde{ξ}) = \int_{X} \frac{e^{- d {(ξ, \tilde{ξ})}^{r} / λ}}{E_{\tilde{P}} e^{- d {(ξ, \tilde{ξ})}^{r} / λ}} P (d ξ) = E_{P} σ_{λ} (d {(ξ, \tilde{ξ})}^{r})$

is the density with respect to $\tilde{P}$ , that is, $d π_{2} = Z d \tilde{P}$ (i.e., $π_{2} (d \tilde{ξ}) = Z (\tilde{ξ}) \tilde{P} (d \tilde{ξ})$ ). □

3.2. Approximation with Flexible Marginal Measure

The following proposition reveals that the best approximation of a bivariate measure in terms of a product of independent measures is provided by the product of its marginals. With this, it follows that the objectives in (25) and (26) coincide for $\tilde{P} = π_{2}$ .

Proposition 2.

Let P be a measure and let π be a bivariate measure with marginal $π_{1} = P$ and $π_{2}$ . Then, it holds that

$D (π ∥ P \times π_{2}) \leq D (π ∥ P \times \tilde{P}),$ (39)

where $\tilde{P}$ is an arbitrary measure.

Proof.

Define the Radon–Nikodým density $Z (η) : = \frac{π_{2} (d η)}{\tilde{P} (d η)}$ and observe that the extension $Z (ξ, η) : = Z (η)$ to $X \times X$ is the density $Z = \frac{d P \times π_{2}}{d P \times \tilde{P}}$ . It follows with (4) that

$\begin{matrix} 0 \leq D (π_{2} ∥ \tilde{P}) & = E_{π_{2}} log \frac{d π_{2}}{d \tilde{P}} \end{matrix}$ (40)

$\begin{matrix} = E_{π} log \frac{d P \times π_{2}}{d P \times \tilde{P}} \end{matrix}$ (41)

$\begin{matrix} = E_{π} (log \frac{d π}{d P \times \tilde{P}} - log \frac{d π}{d P \times π_{2}}) \end{matrix}$ (42)

$\begin{matrix} = D (π ∥ P \times \tilde{P}) - D (π ∥ P \times π_{2}), \end{matrix}$ (43)

which is the assertion. In case the measures are not absolutely continuous, the assertion in (40) is trivial. □

Suppose now that $π$ is a solution of the master problem (26) with some $\tilde{P}$ . It follows from the preceding proposition that the objective (26) improves when replacing the initial $\tilde{P}$ with the marginal of the optimal solution, that is, $\tilde{P} = π_{2}$ .

3.3. The Relation of Soft Quantization and Entropy

The soft quantization problem (26) involves the Kullback–Leibler divergence and not the entropy. The major advantage of the formulation presented above is that it works for discrete, continuous, or mixed measures, while entropy usually needs to be defined separately for discrete and continuous measures.

For a discrete measure with $P (x) : = P ({x})$ and $\tilde{P} (y) : = \tilde{P} ({y})$ , the Kullback–Leibler divergence (4) is

\begin{matrix} D (\tilde{P} ∥ P) & = H (\tilde{P}, P) - H (\tilde{P}) \end{matrix}

(44)

\begin{matrix} = \sum_{x \in X} \tilde{P} (x) log \frac{\tilde{P} (x)}{P (x)}, \end{matrix}

(45)

where

H (\tilde{P}, P) : = - \sum_{x \in X} \tilde{P} (x) \cdot log P (x)

is the cross-entropy of the measures $\tilde{P}$ and P, while

H (\tilde{P}) : = H (\tilde{P}, \tilde{P}) = - \sum_{x \in X} \tilde{P} (x) log \tilde{P} (x)

(46)

is the entropy of $\tilde{P}$ .

For a measure $π$ with marginals P and $\tilde{P}$ , the cross-entropy is

\begin{matrix} H (π, P \times \tilde{P}) & = - \sum_{x, y} π (x, y) log (P (x) \cdot \tilde{P} (y)) \end{matrix}

(47)

\begin{matrix} = - \sum_{x, y} π (x, y) log P (x) - \sum_{x, y} π (x, y) log \tilde{P} (y) \end{matrix}

(48)

\begin{matrix} = - \sum_{x} P (x) log P (x) - \sum_{y} \tilde{P} (y) log \tilde{P} (y), \end{matrix}

(49)

where we have used the marginals from (2). Note that (49) does not depend on $π$ ; hence, $H (π, P \times \tilde{P})$ does not depend on $π$ .

With (44), the quantization problem (26) can be rewritten equivalently as

min_{π : π_{2} \in P} \underset{X \times X}{\int \int} d^{r} d π - λ \cdot H (π)

(50)

by involving the entropy only. For this reason, we call the master problem in (26) the entropy-regularized problem.

4. Soft Tessellation

The quantization problem (25) consists of finding a good (in the best case, the optimal) approximation of a general probability measure P on $X$ using a simple and discrete measure ${\tilde{P}}_{m} = \sum_{j = 1}^{m} {\tilde{p}}_{j} δ_{y_{j}}$ . Thus, the problem consists of finding good weights ${\tilde{p}}_{1}, \dots, {\tilde{p}}_{m}$ as well as good locations $y_{1}, \dots, y_{m}$ . Quantization employs the Wasserstein distance to measure the quality of the approximation; instead, soft quantization involves the regularized Wasserstein distance, as in (26):

inf_{{\tilde{P}}_{m} \in P_{m} (X)} inf_{π : \begin{matrix} π_{1} = P, \\ π_{2} = {\tilde{P}}_{m} \end{matrix}} E_{π} d^{r} + λ \cdot D (π ∥ P \times {\tilde{P}}_{m}),

where the measures on $X$ supported by not more than m points (cf. (23)) are as follows:

P_{m} (X) = \{{\tilde{P}}_{m} \in P (X) : {\tilde{P}}_{m} = \sum_{j = 1}^{m} {\tilde{p}}_{j} δ_{y_{j}}\} .

We separate the problems of finding the best weights and locations. The following Section 4.1 addresses the problem of finding the optimal weights $\tilde{p}$ ; the subsequent Section 4.2 then addresses the problem of finding the optimal locations $y_{1}, \dots, y_{m}$ . As well, we elaborate the numerical advantages of soft quantization below.

4.1. Optimal Weights

Proposition 1 above is formulated for the general probability measures P and $\tilde{P}$ . The desired measure in quantization is a simple and discrete measure. To this end, recall that, per Remark 5, measures which are feasible for (26) have marginals $π_{2}$ with $π_{2} ≪ \tilde{P}$ . It follows that the support of the marginal is smaller than the support of $\tilde{P}$ , that is,

supp π_{2} \subset supp \tilde{P} .

For a simple measure $\tilde{P} = \sum_{j = 1}^{m} {\tilde{p}}_{j} δ_{y_{j}}$ with ${\tilde{p}}_{j} > 0$ , it follows in particular that $supp π_{2} \subset {y_{1}, \dots, y_{m}}$ . In this subsection, we consider the measure $\tilde{P}$ and the support ${y_{1}, \dots, y_{m}}$ to be fixed.

To unfold the result of Proposition 1 for discrete measures, recall the smooth minimum and the softmin function for the discrete (empirical or uniform) measure $\tilde{P} = \sum_{j = 1}^{m} {\tilde{p}}_{j} δ_{y_{j}}$ . For this measure, the smooth minimum (6) explicitly is

{min}_{λ; \tilde{P}} (y_{1}, \dots, y_{m}) = - λ log ({\tilde{p}}_{1} e^{- y_{1} / λ} + \dots + {\tilde{p}}_{m} e^{- y_{m} / λ}) .

This function is occasionally referred to as the LogSumExp function. The softmin function (or Gibbs density (12)) is

σ_{λ} (y_{1}, \dots, y_{m}) = {(\frac{e^{- y_{j} / λ}}{{\tilde{p}}_{1} e^{- y_{1} / λ} + \dots + {\tilde{p}}_{m} e^{- y_{m} / λ)}})}_{j = 1}^{m} .

It follows from Lemma 2 that the best approximating measure is $Q = \sum_{j = 1}^{m} q_{j} {\tilde{p}}_{j} δ_{y_{j}}$ , where the vector q of the optimal weights relative to $\tilde{P}$ is provided explicitly by

q = \int_{X} σ_{λ} (d {(ξ, y_{1})}^{r}, \dots, d {(ξ, y_{m})}^{r}) P (d ξ) = E_{P} σ_{λ} (d {(ξ, y_{1})}^{r}, \dots, d {(ξ, y_{m})}^{r}),

(51)

which involves computing expectations.

Soft Tessellation

For $λ = 0$ , the softmin function $σ_{λ}$ is

{\tilde{p}}_{j} \cdot σ_{λ = 0} {(d {(ξ, y_{1})}^{r}, \dots, d {(ξ, y_{m})}^{r})}_{j} = \{\begin{matrix} 1 & if d {(ξ, y_{j})}^{r} = min (d {(ξ, y_{1})}^{r}, \dots, d {(ξ, y_{m})}^{r}), \\ 0 & else . \end{matrix}

That is, the mapping $j \mapsto {\tilde{p}}_{j} \cdot σ_{λ} {(\dots)}_{j}$ can serve for classification, i.e., tessellation; the point $ξ$ is associated with $y_{j}$ if $σ_{λ} {(\dots)}_{j} \neq 0$ , and the corresponding region is known as a Voronoi diagram.

For $λ > 0$ , the softmin ${\tilde{p}}_{j} \cdot σ_{λ} {(\dots)}_{j}$ is not a strict indicator, and can instead be interpreted as probability; that is,

{\tilde{p}}_{j} \cdot σ_{λ} {(d {(ξ, y_{1})}^{r}, \dots, d {(ξ, y_{m})}^{r})}_{j}

is the probability of allocating $ξ \in X$ to the quantizer $y_{j}$ .

Remark 6

(K-means and quantization). K-means clustering is a widely used unsupervised machine learning algorithm that groups datapoints into clusters based on their similarity. Voronoi tessellation, on the other hand, is a geometrical concept used to partition a space into regions, each of which are associated with a specific point or seed. Notably, this is an unavoidable concept in the investigation of optimal quantizers. In the K-means context, Voronoi tessellation helps to define cluster boundaries. Each cluster’s boundary is constructed as the region within which the datapoints are closer to its cluster center than to any other (cf. Graf and Luschgy [4] Chapter I).

4.2. Optimal Locations

As a result of Proposition 1, the objective in (27) is an expectation. To identify the optimal support points $y_{1}, \dots, y_{m}$ , it is essential to first minimize

min_{\tilde{P} = \sum_{j = 1}^{m} {\tilde{p}}_{j} δ_{y_{j}}} E_{ξ \sim P} ({min}_{λ; y \sim \tilde{P}} d {(ξ, y)}^{r}) .

(52)

This is a stochastic, nonlinear, and non-convex optimization problem:

f (y_{1}, \dots, y_{m}) : = E f (y_{1}, \dots, y_{m}; ξ) = E (\underset{j = 1, \dots, m}{{min}_{λ; \tilde{P}}} d {(ξ, y_{i})}^{r}),

(53)

where the function $f (y_{1}, \dots, y_{m}; ξ) : = {min}_{λ; \tilde{P}} {d {(ξ, y_{i})}^{r} : j = 1, \dots, m}$ is nonlinear and non-convex. The optimal quantization problem constitutes an unconstrained, stochastic, non-convex, and nonlinear optimization problem. According to the chain rule and gradients in Section 2.3, the gradient of the objective is constructed from the components

\frac{\partial}{\partial y_{j}} f (y_{1}, \dots, y_{m}) = \frac{{\tilde{p}}_{j} \cdot exp (- d {(ξ, y_{j})}^{r} / λ)}{\sum_{j^{'} = 1}^{m} {\tilde{p}}_{j^{'}} \cdot exp (- d {(ξ, y_{j^{'}})}^{r} / λ)} \cdot {\nabla_{y} d {(ξ, y)}^{r}|}_{y = y_{j}},

(54)

that is,

\nabla f = \tilde{p} \cdot σ_{λ} (d {(ξ, y_{1})}^{r}, \dots, d {(ξ, y_{m})}^{r}) \cdot r d {(ξ, y)}^{r - 1} \cdot \nabla_{y} d (ξ, y),

(55)

where ‘·’ denotes the Hadamard (element-wise) product and $\tilde{p}$ , $d {(ξ, y)}^{r - 1}$ are the vectors with entries ${\tilde{p}}_{j}$ , $d {(ξ, y_{j})}^{r - 1}$ , $j = 1, \dots, m$ . In other words, the gradient of the LogSumExp function (53) is the softmin function, which Section 2.3 explicitly illustrates.

Algorithm 1 is a stochastic gradient algorithm used to minimize (51), which collects the elements of the optimal weights and the optimal locations provided here and in the preceding section.

Algorithm 1: Stochastic gradient algorithm to find the optimal quantizers and optimal masses

graphic file with name entropy-25-01435-i001.jpg

Open in a new tab

Example 1.

To provide an example of the gradient of the distance function in (54) ((55), resp.), the derivative of the weighted norm

$d (ξ, y) = {∥ y - ξ ∥}_{p} : = {(\sum_{ℓ = 1}^{d} w_{ℓ} \cdot {| y_{ℓ} - ξ_{ℓ} |}^{p})}^{1 / p}$

is

$\frac{\partial}{\partial y_{j}} {∥ y - ξ ∥}_{p}^{r} = r w_{j} {∥ ξ - y ∥}_{p}^{\frac{r - p}{p}} \cdot {| y_{j} - ξ_{j} |}^{p - 1} \cdot sign (y_{j} - ξ_{j}) .$

4.3. Quantization with Large Regularization Parameters

The entropy in (46) is minimal for the Dirac measure $P = δ_{x}$ (where x is any point in $X$ ); in this case, $H (δ_{x}) = 1 \cdot log 1 = 0$ , while $H (\tilde{P}) > 0$ for any other measure. For larger values of $λ$ , the objective in (50), and as such the objective of the master problem (23), will supposedly prefer a measure with fewer points. This is indeed the case, as stated by Theorem 1 above. We provide its proof below after formally defining the center of the measure.

Definition 5

(Center of the measure). Let P be a probability measure on $X$ and let d be a distance on $X$ . The point $a \in X$ is a center of the measure P with respect to the distance d if

$a \in \underset{x \in X}{arg min} E d {(x, ξ)}^{r},$

provided that $E d {(x_{0}, ξ)}^{r} < \infty$ for some (i.e., any) $x_{0} \in X$ and $r \geq 1$ .

In what follows, we demonstrate that the regularized quantization problem (50) links the optimal quantization problem and the center of the measure.

Proof of Theorem 1.

According to Proposition 1, Problems (50) and (26) are equivalent. Now, assume that $y_{i} = y_{j}$ for all i, $j \leq m$ ; then, $d (y_{i}, ξ) = d (y_{j}, ξ)$ for $ξ \in Ξ$ , and it follows that

${min}_{λ} (d {(y_{1}, ξ)}^{r}, \dots, d {(y_{m}, ξ)}^{r}) = d {(y_{i}, ξ)}^{r}, i = 1, \dots, m .$

Thus, the minimum of the optimization problem is attained at $y_{i} = a$ for each $i = 1, \dots, m$ , where a is the center of the measure P with respect to the distance d. It follows that $y_{1} = \dots = y_{m} = a$ is a local minimum and a stationary point satisfying the first order conditions

$\nabla f (y_{1}, \dots, y_{m}) = 0$

for the function f provided in (53). Note as well that

$σ_{λ} {(d {(ξ, y_{1})}^{r}, \dots, d {(ξ, y_{n})}^{r})}_{i} = \frac{exp (- d {(ξ, y_{i})}^{r} / λ)}{\sum_{j = 1}^{n} {\tilde{p}}_{j} exp (- d {(ξ, y_{j})}^{r} / λ)} = 1,$

and as such the softmin function does not depend on $λ$ at the stationary point $y_{1} = \dots = y_{m} = a$ .

Recall from (54) that

$\nabla E (\underset{j = 1, \dots, n}{{min}_{λ; \tilde{P}}} d {(ξ, y_{j})}^{r}) = E σ_{λ} (d {(y_{1}, ξ)}^{r}, \dots, d {(y_{n}, ξ)}^{r}) \cdot \nabla d {(ξ, y_{i})}^{r} .$

According to the product rule, the Hessian matrix is

$\nabla^{2} E (\underset{j = 1, \dots, n}{{min}_{λ; \tilde{P}}} d {(ξ, y_{j})}^{r}) = E (\begin{matrix} \begin{matrix} \nabla σ_{λ} (d {(y_{1}, ξ)}^{r}, \dots, d {(y_{n}, ξ)}^{r}) \cdot {(\nabla d {(ξ, y_{i})}^{r})}^{2} \\ + σ_{λ} (d {(y_{1}, ξ)}^{r}, \dots, d {(y_{n}, ξ)}^{r}) \cdot \nabla^{2} d {(ξ, y_{i})}^{r} \end{matrix} \end{matrix}) .$ (56)

Note that the second expression is positive definite, as the Hessian $\nabla^{2} d {(ξ, y_{i})}^{r}$ of the convex function is positive definite and $\nabla {min}_{λ; \tilde{P}}_{j = 1, \dots, n} (x_{1}, \dots, x_{n}) = σ_{λ} (x_{1}, \dots, x_{n}) \geq 0$ . Further, the Hessian of the smooth minimum (see Appendix A) is

$\nabla σ_{λ} = \nabla^{2} \underset{j = 1, \dots, n}{{min}_{λ}} = - \frac{1}{λ} Σ,$

where the matrix $Σ$ is

$Σ : = diag (σ_{1}, \dots, σ_{n}) - σ σ^{⊤} .$

This matrix $Σ$ is positive definite (as $\sum_{i = 1}^{n} σ_{i} = 1$ ) and $0 \leq Σ \leq 1$ in Loewner order; indeed, $Σ$ is the covariance matrix of the multinomial distribution. It follows that the first term in (56) is $O (1)$ , while the second is $O (\frac{1}{λ})$ , such that (56) is positive definite for sufficiently small $λ$ . Thus, the extremal point $y_{i} = a$ is a minimum for all $λ$ . In particular, there exists $λ_{0} > 0$ such that (56) is positive definite for every $λ > λ_{0}$ , hence, the result. □

5. Numerical Illustration

This section presents numerical findings for the approaches and methods discussed earlier. The Julia implementations for these methods are available online (cf. https://github.com/rajmadan96/SoftQuantization.git, accessed on 8 September 2023).

In the following experiments, we approximate the measure P with a finite discrete measure $\tilde{P}$ using the stochastic gradient algorithm presented in Algorithm 1.

5.1. One Dimension

First, we perform the analysis in one dimension. In this experiment, our problem of interest is to find entropy-regularized optimal quantizers for

P \sim N (0, 1) and P \sim Exp (1)

(i.e., the normal and exponential distributions with standard parameters). To enhance the peculiarity, we consider only $m = 8$ quantizers.

Figure 1 illustrates the results of soft quantization of the standard normal distribution and exponential distribution. It is apparent that when $λ$ is increased beyond a certain threshold (cf. Theorem 1), the quantizers converge towards the center of the measure (i.e., the mean), while for smaller values of $λ$ the quantizers are able to identify the actual optimal locations with greater accuracy. Furthermore, we emphasize that our proposed method is capable of identifying the mean location regardless of the shape of the distribution, which this experiment empirically substantiates.

Soft quantization of measures on $R$ with a varying regularization parameter $λ$ with eight quantization points. (a) Normal distribution: for $λ = 10$ , the best approximation resides at the center of the measure; for $λ = 1$ , the approximation is reduced to only six points, as two of the remaining points have probability 0; for $λ = 0$ , we obtain the standard quantization. (b) Exponential distribution: the measures concentrate on one ( $λ$ large), three ( $λ = 1$ ), five ( $λ = 0.5$ ), and eight quantization points.

For better understanding of the dissemination of the weights (probabilities) and their respective positions, the following examination involves the calculation of the cumulative distribution function. Additionally, we consider

P \sim Γ (2, 2) (Gamma distribution)

as a problem of interest, which is a notably distinct scenario in terms of shape compared to the measures examined previously.

Figure 2 provides the results. It is evident that the number of quantizers m decreases as $λ$ increases. When $λ$ reaches a specific threshold, such as $λ = 20$ in our case, all quantizers converge towards the center of the measures, represented by the mean (i.e., 4).

Soft quantization of the Gamma distribution on $R$ with varying regularization parameter $λ$ ; the approximating measure is simplifies with increasing $λ$ . (a) $λ = 0$ : approximate solution to standard quantization problem with eight quantizers. (b) $λ = 1$ : the eight quantization points collapse to seven quantization points. (c) $λ = 10$ : the eight quantization points collapse to three quantization points. (d) $λ = 20$ : the quantization points converge to a single point representing the center of the measure.

5.2. Two Dimensions

Next, we demonstrate the behavior of entropy-regularized optimal quantization for a range of $λ$ in two dimensions. In the following experiment, we consider

P \sim U ((0, 1) \times (0, 1)) (uniform distribution on the square)

as a problem of interest. Initially, we perform the experiment with $m = 4$ quantizers.

Figure 3 illustrates the findings. Figure 3a reveals a quantization pattern similar to that observed in the one-dimensional experiment. However, in Figure 3b we gain more detailed insight into the behavior of the quantizers at $λ = 1$ , where they align diagonally before eventually colliding. Furthermore, the size of the point indicates the respective probability of the quantization point, which is notably uniformly distributed for a varying regularization parameter $λ$ .

Two-dimensional soft quantization of the uniform distribution on $R^{2}$ with a varying regularization parameter $λ$ with 4 quantizers. (a) Uniform distribution in $R^{2}$ . (b) Enlargement of (a): for larger values of $λ$ (here, $λ = 1$ ), the quantizers align while converging to the center of the measure.

Again, we consider a uniform distribution as a problem of interest in the subsequent experiment, this time employing $m = 16$ quantizers for enhanced comprehension. Figure 4 encapsulates the essence of the experiment, offering an extensive visual representation. In contrast to the previous experiment, it can be observed that for regularization values of $λ = 0.037$ and $λ = 0.1$ they assemble at the nearest strong points (in terms of high probability) rather than converging towards the center of the measure (see Figure 4b,c). Subsequently, for larger $λ$ , they move from these strong points towards the center, where they are in a diagonal alignment before colliding (see Figure 4d). More concisely, when $λ = 0$ we achieve the genuine quantization solution (see Figure 4a). As $λ$ increases, the quantizers with lower probabilities converge towards those with nearest higher probabilities. Subsequently, all quantizers converge towards the center of the measure, represented by the mean of respective measure.

Soft quantization of uniform distribution on $R^{2}$ with varying regularization parameter $λ$ ; the approximating measure simplifies with $λ$ increasing. (a) $λ = 0.0$ : approximate solution to the standard quantization problem with sixteen quantizers. (b) $λ = 0.037$ : the sixteen quantization points collapse to eight quantization points. (c) $λ = 0.1$ : the sixteen quantization points collapse to four quantization points. (d) $λ = 1.0$ : the quantization points converge to a single point, representing the center of the measure, in an aligned way.

Thus far, we have conducted two-dimensional experiments employing various quantizers ( $m = 4$ and $m = 16$ ) with the uniform distribution. These experiments can be categorized under the k-means approach (see Remark 6). Next, we delve into the complexity of a multivariate normal distribution, with the aim of enhancing comprehension. More precisely, our problem of interest is to find a soft quantization for

P \sim N (μ, Σ),

where

μ = (\begin{matrix} 0 \\ 0 \end{matrix}), Σ = (\begin{matrix} 3 & 1 \\ 1 & 3 \end{matrix}) .

In this endeavor, we employ more quantizers, specifically, $m = 100$ . Figure 5 captures the core essence of the experiment, delivering a comprehensive and visually illustrative representation. From the experiment, it is evident that the initial diagonal alignment precedes convergence toward the center of the measure as $λ$ increases. Additionally, a noticeable shift can be observed on the part of the points with lower probabilities towards those with higher probabilities. This experiment highlights that the threshold of $λ$ for achieving convergence or diagonal alignment in the center of the measure depends on the number of quantizers employed.

Two-dimensional soft quantization of the normal distribution on $R^{2}$ with varying regularization parameter $λ$ and parameters $r = 2$ , $p = 2$ , and $m = 100$ . (a) $λ = 0.0$ , the solution to the standard quantization problem, (b) $λ = 5.0$ , (c) $λ = 10.0$ .

6. Summary

In this study, we have enhanced the stability and simplicity of the standard quantization problem by introducing a novel method of quantization using entropy. Propositions 1 and 2 thoroughly elucidate the intricacies of the master problem (25). Our substantiation of the convergence of quantizers to the center of the measure explains the transition from a complex hard optimization problem to a simplified configuration (see Theorem 1). More concisely, this transition underscores the fundamental shift towards a more tractable and straightforward computational framework, marking a significant advancement in terms of the overall approach. Moreover, in Section 5, we provide numerical illustrations of our method that confirm its robustness, stability, and properties, as discussed in our theoretical results. These numerical demonstrations serve as empirical evidence reinforcing the efficacy of our proposed approach.

Appendix A. Hessian of the Softmin

The empirical measure $\frac{1}{λ} \sum_{i = 1}^{n} δ_{x_{i}}$ is a probability measure. From Jensen’s inequality, it follows that ${min}_{λ} (x_{1}, \dots, x_{n}) \leq \frac{1}{n} \sum_{i = 1}^{n} x_{i} {\bar{x}}_{n}$ . Thus, the smooth minimum involves a cumulant generating function, for which we derive

\begin{matrix} {min}_{λ} (x_{1}, \dots, x_{n}) & = \sum_{j = 1}^{\infty} \frac{{(- 1)}^{j - 1}}{λ^{j - 1} \cdot j!} κ_{j} \end{matrix}

(A1)

\begin{matrix} = {\bar{x}}_{n} - \frac{1}{2 λ} s_{n}^{2} + \frac{1}{6 λ^{2}} κ_{3} + O (λ^{- 3}), \end{matrix}

(A2)

where $κ_{j}$ is the j-th cumulant with respect to the empirical measure. Specifically,

κ_{1} = {\bar{x}}_{n} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}, κ ₂ = s_{n}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {(x_{i} - {\bar{x}}_{n})}^{2}, κ_{3} = \frac{1}{n} \sum_{i = 1}^{n} {(x_{i} - {\bar{x}}_{n})}^{3},

where ${\bar{x}}_{n}$ is the ‘sample mean’ and $s_{n}^{2}$ the ‘sample variance’. The following cumulants ( $κ ₄$ , etc.) are more involved. The Taylor series expansion $log (1 + x) = x - \frac{1}{2} x^{2} + O (x^{3})$ and

\begin{matrix} - λ log \frac{1}{n} \sum_{i = 1}^{n} e^{- x_{i} / λ} & = {\bar{x}}_{n} - λ log \frac{1}{n} \sum_{i = 1}^{n} e^{- (x_{i} - {\bar{x}}_{n}) / λ} \\ = {\bar{x}}_{n} - λ log \sum_{i = 1}^{n} \frac{1}{n} (1 - \frac{x_{i} - {\bar{x}}_{n}}{λ} + \frac{1}{2} {(\frac{x_{i} - x_{n}}{λ})}^{2} - \frac{1}{6} (\frac{x_{i} - {\bar{x}}_{n}}{λ^{3}}) + O (\frac{1}{λ ⁴})) \\ = {\bar{x}}_{n} - λ log (1 + \frac{1}{2 λ^{2}} s_{n}^{2} - \frac{1}{6 λ^{3}} κ_{3} + O (λ^{- 4})) \\ = {\bar{x}}_{n} - λ (\frac{1}{2 λ^{2}} s_{n}^{2} - \frac{1}{6 λ^{3}} κ_{3} + O (λ^{- 4})) \\ = {\bar{x}}_{n} - \frac{1}{2 λ} s_{n}^{2} + \frac{1}{6 λ^{2}} κ_{3} + O (λ^{3}) . \end{matrix}

Note as well that the softmin function is the gradient of the smooth minimum:

σ_{λ} {(x_{1}, \dots, x_{m})}_{i} = \frac{\partial}{\partial x_{i}} {min}_{λ} (x_{1}, \dots, x_{n}) .

The softmin function is frequently used in classification in a maximum likelihood framework. It holds that

\begin{matrix} \frac{\partial^{2}}{\partial x_{i} \partial x_{j}} {min}_{λ} (x_{1}, \dots, x_{n}) & = \frac{\partial}{\partial x_{j}} \frac{exp (- λ x_{i})}{\sum_{k = 1}^{m} exp (- λ x_{k})} \\ = + λ \frac{exp (- λ x_{i} - λ x_{j})}{{(\sum_{k = 1}^{m} exp (- λ x_{k}))}^{2}} \\ = λ σ_{i} σ_{j} \end{matrix}

for $i \neq j$ and

\begin{matrix} \frac{\partial^{2}}{\partial x_{i}^{2}} {min}_{λ} (x_{1}, ‥, x_{n}) & = \frac{\partial}{\partial x_{i}} \frac{exp (- λ x_{i})}{\sum_{j = 1}^{m} exp (- λ x_{j})} \\ = \frac{- λ exp (- λ x_{i}) (\sum_{j = 1}^{m} exp (- λ x_{j})) + λ exp (- λ x_{i} - λ x_{i})}{(\sum_{j = 1}^{m} exp (- λ x_{j}))} \\ = - λ σ_{i} + λ σ_{i} σ_{i}, \end{matrix}

that is,

\nabla^{2} {min}_{λ} (x_{1}, \dots, x_{n}) = λ (σ σ^{⊤} - diag σ) = - λ \cdot ((\begin{matrix} σ_{1} & 0 & ⋱ \\ 0 & ⋱ & 0 \\ ⋱ & 0 & σ_{n} \end{matrix}) - σ \cdot σ^{⊤}) .

Author Contributions

The authors have contributed equally to this article. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

Data is available at https://github.com/rajmadan96/SoftQuantization.git, accessed on 8 September 2023.

Conflicts of Interest

The authors declare no conflict of interest.

Funding Statement

DFG, German Research Foundation—Project-ID 416228727—SFB 1410.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

References

1.Graf S., Mauldin R.D. A Classification of Disintegrations of Measures. Contemp. Math. 1989;94:147–158. [Google Scholar]
2.Luschgy H., Pagès G. Greedy vector quantization. J. Approx. Theory. 2015;198:111–131. doi: 10.1016/j.jat.2015.05.005. [DOI] [Google Scholar]
3.El Nmeir R., Luschgy H., Pagès G. New approach to greedy vector quantization. Bernoulli. 2022;28:424–452. doi: 10.3150/21-BEJ1350. [DOI] [Google Scholar]
4.Graf S., Luschgy H. Foundations of Quantization for Probability Distributions. Volume 1730. Springer; Berlin, Germany: 2000. Lecture Notes in Mathematics. [DOI] [Google Scholar]
5.Breuer T., Csiszár I. Measuring distribution model risk. Math. Financ. 2013;26:395–411. doi: 10.1111/mafi.12050. [DOI] [Google Scholar]
6.Breuer T., Csiszár I. Systematic stress tests with entropic plausibility constraints. J. Bank. Financ. 2013;37:1552–1559. doi: 10.1016/j.jbankfin.2012.04.013. [DOI] [Google Scholar]
7.Pichler A., Schlotter R. Entropy based risk measures. Eur. J. Oper. Res. 2020;285:223–236. doi: 10.1016/j.ejor.2019.01.016. [DOI] [Google Scholar]
8.Jacob B., Kligys S., Chen B., Zhu M., Tang M., Howard A., Adam H., Kalenichenko D. Quantization and training of neural networks for efficient integer-arithmetic-only inference; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Salt Lake City, UT, USA. 18–23 June 2018. [Google Scholar]
9.Zhuang B., Liu L., Tan M., Shen C., Reid I. Training quantized neural networks with a full-precision auxiliary module; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020. [(accessed on 6 October 2023)]. pp. 1488–1497. Available online: https://openaccess.thecvf.com/content_CVPR_2020/html/Zhuang_Training_Quantized_Neural_Networks_With_a_Full-Precision_Auxiliary_Module_CVPR_2020_paper.html. [Google Scholar]
10.Hubara I., Courbariaux M., Soudry D., El-Yaniv R., Bengio Y. Binarized neural networks. [(accessed on 6 October 2023)];Adv. Neural Inf. Process. Syst. 2016 29 Available online: https://proceedings.neurips.cc/paper_files/paper/2016/hash/d8330f857a17c53d217014ee776bfd50-Abstract.html. [Google Scholar]
11.Polino A., Pascanu R., Alistarh D.-A. Model compression via distillation and quantization; Proceedings of the 6th International Conference on Learning Representations; Vancouver, BC, Canada. 30 April–3 May 2018; [(accessed on 6 October 2023)]. Available online: https://research-explorer.ista.ac.at/record/7812. [Google Scholar]
12.Bhattacharya K. Semi-classical description of electrostatics and quantization of electric charge. Phys. Scr. 2023;98:8. doi: 10.1088/1402-4896/ace1b0. [DOI] [Google Scholar]
13.Scheunders P. A genetic Lloyd-Max image quantization algorithm. Pattern Recognit. Lett. 1996;17:547–556. doi: 10.1016/0167-8655(96)00011-6. [DOI] [Google Scholar]
14.Wei L.Y., Levoy M. Fast texture synthesis using tree-structured vector quantization; Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques; 2000. [(accessed on 6 October 2023)]. pp. 479–488. Available online: https://dl.acm.org/doi/abs/10.1145/344779.345009. [Google Scholar]
15.Heskes T. Self-organizing maps, vector quantization, and mixture modeling. IEEE Trans. Neural Netw. 2001;12:1299–1305. doi: 10.1109/72.963766. [DOI] [PubMed] [Google Scholar]
16.Pagès G., Pham H., Printems J. Handbook of Computational and Numerical Methods in Finance. Springer Science & Business Media; Berlin/Heidelberg, Germany: 2004. Optimal Quantization Methods and Applications to Numerical Problems in Finance; pp. 253–297. [DOI] [Google Scholar]
17.Cuturi M. Sinkhorn distances: Lightspeed computation of optimal transport; Proceedings of the 26th International Conference on Neural Information Processing Systems; Lake Tahoe, NV, USA. 5–10 December 2013; [Google Scholar]
18.Ramdas A., García Trillos N., Cuturi M. On Wasserstein two-sample testing and related families of nonparametric tests. Entropy. 2017;19:47. doi: 10.3390/e19020047. [DOI] [Google Scholar]
19.Neumayer S., Steidl G. Handbook of Mathematical Models and Algorithms in Computer Vision and Imaging: Mathematical Imaging and Vision. Springer; Berlin/Heidelberg, Germany: 2021. From optimal transport to discrepancy; pp. 1–36. [DOI] [Google Scholar]
20.Altschuler J., Bach F., Rudi A., Niles-Weed J. Massively scalable Sinkhorn distances via the Nyström method. In: Wallach H., Larochelle H., Beygelzimer A., d’ Alché-Buc F., Fox E., Garnett R., editors. Advances in Neural Information Processing Systems. Volume 32 Curran Associates, Inc.; Red Hook, NY, USA: 2019. [Google Scholar]
21.Lakshmanan R., Pichler A., Potts D. Nonequispaced Fast Fourier Transform Boost for the Sinkhorn Algorithm. Etna—Electron. Trans. Numer. Anal. 2023;58:289–315. doi: 10.1553/etna_vol58s289. [DOI] [Google Scholar]
22.Ba F.A., Quellmalz M. Accelerating the Sinkhorn algorithm for sparse multi-marginal optimal transport via fast Fourier transforms. Algorithms. 2022;15:311. doi: 10.3390/a15090311. [DOI] [Google Scholar]
23.Lakshmanan R., Pichler A. Fast approximation of unbalanced optimal transport and maximum mean discrepancies. arXiv. 2023 doi: 10.48550/arXiv.2306.13618.2306.13618 [DOI] [Google Scholar]
24.Monge G. Mémoire sue la théorie des déblais et de remblais. Histoire de l’Académie Royale des Sciences de Paris, Avec les Mémoires de Mathématique et de Physique Pour la Même Année. 1781. [(accessed on 6 October 2023)]. pp. 666–704. Available online: https://cir.nii.ac.jp/crid/1572261550791499008.
25.Kantorovich L. On the translocation of masses. J. Math. Sci. 2006;133:1381–1382. doi: 10.1007/s10958-006-0049-2. [DOI] [Google Scholar]
26.Villani C. Topics in Optimal Transportation. Volume 58. American Mathematical Society; Providence, RI, USA: 2003. Graduate Studies in Mathematics. [DOI] [Google Scholar]
27.Rachev S.T., Rüschendorf L. Mass Transportation Problems Volume I: Theory, Volume II: Applications. Volume XXV. Springer; New York, NY, USA: 1998. Probability and Its Applications. [DOI] [Google Scholar]
28.Rüschendorf L. Mathematische Statistik. Springer; Berlin/Heidelberg, Germany: 2014. [DOI] [Google Scholar]
29.Ch Pflug G., Pichler A. Multistage Stochastic Optimization. Springer; Berlin/Heidelberg, Germany: 2014. (Springer Series in Operations Research and Financial Engineering). [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data is available at https://github.com/rajmadan96/SoftQuantization.git, accessed on 8 September 2023.

[B1-entropy-25-01435] 1.Graf S., Mauldin R.D. A Classification of Disintegrations of Measures. Contemp. Math. 1989;94:147–158. [Google Scholar]

[B2-entropy-25-01435] 2.Luschgy H., Pagès G. Greedy vector quantization. J. Approx. Theory. 2015;198:111–131. doi: 10.1016/j.jat.2015.05.005. [DOI] [Google Scholar]

[B3-entropy-25-01435] 3.El Nmeir R., Luschgy H., Pagès G. New approach to greedy vector quantization. Bernoulli. 2022;28:424–452. doi: 10.3150/21-BEJ1350. [DOI] [Google Scholar]

[B4-entropy-25-01435] 4.Graf S., Luschgy H. Foundations of Quantization for Probability Distributions. Volume 1730. Springer; Berlin, Germany: 2000. Lecture Notes in Mathematics. [DOI] [Google Scholar]

[B5-entropy-25-01435] 5.Breuer T., Csiszár I. Measuring distribution model risk. Math. Financ. 2013;26:395–411. doi: 10.1111/mafi.12050. [DOI] [Google Scholar]

[B6-entropy-25-01435] 6.Breuer T., Csiszár I. Systematic stress tests with entropic plausibility constraints. J. Bank. Financ. 2013;37:1552–1559. doi: 10.1016/j.jbankfin.2012.04.013. [DOI] [Google Scholar]

[B7-entropy-25-01435] 7.Pichler A., Schlotter R. Entropy based risk measures. Eur. J. Oper. Res. 2020;285:223–236. doi: 10.1016/j.ejor.2019.01.016. [DOI] [Google Scholar]

[B8-entropy-25-01435] 8.Jacob B., Kligys S., Chen B., Zhu M., Tang M., Howard A., Adam H., Kalenichenko D. Quantization and training of neural networks for efficient integer-arithmetic-only inference; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Salt Lake City, UT, USA. 18–23 June 2018. [Google Scholar]

[B9-entropy-25-01435] 9.Zhuang B., Liu L., Tan M., Shen C., Reid I. Training quantized neural networks with a full-precision auxiliary module; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020. [(accessed on 6 October 2023)]. pp. 1488–1497. Available online: https://openaccess.thecvf.com/content_CVPR_2020/html/Zhuang_Training_Quantized_Neural_Networks_With_a_Full-Precision_Auxiliary_Module_CVPR_2020_paper.html. [Google Scholar]

[B10-entropy-25-01435] 10.Hubara I., Courbariaux M., Soudry D., El-Yaniv R., Bengio Y. Binarized neural networks. [(accessed on 6 October 2023)];Adv. Neural Inf. Process. Syst. 2016 29 Available online: https://proceedings.neurips.cc/paper_files/paper/2016/hash/d8330f857a17c53d217014ee776bfd50-Abstract.html. [Google Scholar]

[B11-entropy-25-01435] 11.Polino A., Pascanu R., Alistarh D.-A. Model compression via distillation and quantization; Proceedings of the 6th International Conference on Learning Representations; Vancouver, BC, Canada. 30 April–3 May 2018; [(accessed on 6 October 2023)]. Available online: https://research-explorer.ista.ac.at/record/7812. [Google Scholar]

[B12-entropy-25-01435] 12.Bhattacharya K. Semi-classical description of electrostatics and quantization of electric charge. Phys. Scr. 2023;98:8. doi: 10.1088/1402-4896/ace1b0. [DOI] [Google Scholar]

[B13-entropy-25-01435] 13.Scheunders P. A genetic Lloyd-Max image quantization algorithm. Pattern Recognit. Lett. 1996;17:547–556. doi: 10.1016/0167-8655(96)00011-6. [DOI] [Google Scholar]

[B14-entropy-25-01435] 14.Wei L.Y., Levoy M. Fast texture synthesis using tree-structured vector quantization; Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques; 2000. [(accessed on 6 October 2023)]. pp. 479–488. Available online: https://dl.acm.org/doi/abs/10.1145/344779.345009. [Google Scholar]

[B15-entropy-25-01435] 15.Heskes T. Self-organizing maps, vector quantization, and mixture modeling. IEEE Trans. Neural Netw. 2001;12:1299–1305. doi: 10.1109/72.963766. [DOI] [PubMed] [Google Scholar]

[B16-entropy-25-01435] 16.Pagès G., Pham H., Printems J. Handbook of Computational and Numerical Methods in Finance. Springer Science & Business Media; Berlin/Heidelberg, Germany: 2004. Optimal Quantization Methods and Applications to Numerical Problems in Finance; pp. 253–297. [DOI] [Google Scholar]

[B17-entropy-25-01435] 17.Cuturi M. Sinkhorn distances: Lightspeed computation of optimal transport; Proceedings of the 26th International Conference on Neural Information Processing Systems; Lake Tahoe, NV, USA. 5–10 December 2013; [Google Scholar]

[B18-entropy-25-01435] 18.Ramdas A., García Trillos N., Cuturi M. On Wasserstein two-sample testing and related families of nonparametric tests. Entropy. 2017;19:47. doi: 10.3390/e19020047. [DOI] [Google Scholar]

[B19-entropy-25-01435] 19.Neumayer S., Steidl G. Handbook of Mathematical Models and Algorithms in Computer Vision and Imaging: Mathematical Imaging and Vision. Springer; Berlin/Heidelberg, Germany: 2021. From optimal transport to discrepancy; pp. 1–36. [DOI] [Google Scholar]

[B20-entropy-25-01435] 20.Altschuler J., Bach F., Rudi A., Niles-Weed J. Massively scalable Sinkhorn distances via the Nyström method. In: Wallach H., Larochelle H., Beygelzimer A., d’ Alché-Buc F., Fox E., Garnett R., editors. Advances in Neural Information Processing Systems. Volume 32 Curran Associates, Inc.; Red Hook, NY, USA: 2019. [Google Scholar]

[B21-entropy-25-01435] 21.Lakshmanan R., Pichler A., Potts D. Nonequispaced Fast Fourier Transform Boost for the Sinkhorn Algorithm. Etna—Electron. Trans. Numer. Anal. 2023;58:289–315. doi: 10.1553/etna_vol58s289. [DOI] [Google Scholar]

[B22-entropy-25-01435] 22.Ba F.A., Quellmalz M. Accelerating the Sinkhorn algorithm for sparse multi-marginal optimal transport via fast Fourier transforms. Algorithms. 2022;15:311. doi: 10.3390/a15090311. [DOI] [Google Scholar]

[B23-entropy-25-01435] 23.Lakshmanan R., Pichler A. Fast approximation of unbalanced optimal transport and maximum mean discrepancies. arXiv. 2023 doi: 10.48550/arXiv.2306.13618.2306.13618 [DOI] [Google Scholar]

[B24-entropy-25-01435] 24.Monge G. Mémoire sue la théorie des déblais et de remblais. Histoire de l’Académie Royale des Sciences de Paris, Avec les Mémoires de Mathématique et de Physique Pour la Même Année. 1781. [(accessed on 6 October 2023)]. pp. 666–704. Available online: https://cir.nii.ac.jp/crid/1572261550791499008.

[B25-entropy-25-01435] 25.Kantorovich L. On the translocation of masses. J. Math. Sci. 2006;133:1381–1382. doi: 10.1007/s10958-006-0049-2. [DOI] [Google Scholar]

[B26-entropy-25-01435] 26.Villani C. Topics in Optimal Transportation. Volume 58. American Mathematical Society; Providence, RI, USA: 2003. Graduate Studies in Mathematics. [DOI] [Google Scholar]

[B27-entropy-25-01435] 27.Rachev S.T., Rüschendorf L. Mass Transportation Problems Volume I: Theory, Volume II: Applications. Volume XXV. Springer; New York, NY, USA: 1998. Probability and Its Applications. [DOI] [Google Scholar]

[B28-entropy-25-01435] 28.Rüschendorf L. Mathematische Statistik. Springer; Berlin/Heidelberg, Germany: 2014. [DOI] [Google Scholar]

[B29-entropy-25-01435] 29.Ch Pflug G., Pichler A. Multistage Stochastic Optimization. Springer; Berlin/Heidelberg, Germany: 2014. (Springer Series in Operations Research and Financial Engineering). [DOI] [Google Scholar]

PERMALINK

Soft Quantization Using Entropic Regularization

Rajmadan Lakshmanan

Alois Pichler

Roles

Abstract

1. Introduction

Theorem 1.

2. Preliminaries

2.1. Distances and Divergences of Measures

Definition 1

Remark 1

Definition 2

2.2. The Smooth Minimum

Definition 3

Lemma 1.

Proof.

Remark 2

2.3. Softmin Function

Definition 4

3. Regularized Quantization

3.1. Approximation with Inflexible Marginal Measures

Proposition 1.

Remark 3.

Remark 4

Proof of Proposition 1.

Remark 5.

Lemma 2

Proof.

3.2. Approximation with Flexible Marginal Measure

Proposition 2.

Proof.

3.3. The Relation of Soft Quantization and Entropy

4. Soft Tessellation

4.1. Optimal Weights

Remark 6

4.2. Optimal Locations

Example 1.

4.3. Quantization with Large Regularization Parameters

Definition 5

Proof of Theorem 1.

5. Numerical Illustration

5.1. One Dimension

Figure 1.

Figure 2.

5.2. Two Dimensions

Figure 3.

Figure 4.

Figure 5.

6. Summary

Appendix A. Hessian of the Softmin

Author Contributions

Data Availability Statement

Conflicts of Interest

Funding Statement

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases