A Free Energy Based Approach for Distance Metric Learning

Sho Inaba; Carl T Fakhry; Rahul V Kulkarni; Kourosh Zarringhalam

doi:10.1145/3292500.3330975

. Author manuscript; available in PMC: 2021 Mar 10.

Published in final edited form as: KDD. 2019 Jul;2019:5–13. doi: 10.1145/3292500.3330975

A Free Energy Based Approach for Distance Metric Learning

Sho Inaba ¹, Carl T Fakhry ², Rahul V Kulkarni ³, Kourosh Zarringhalam ^4,^*

PMCID: PMC7945720 NIHMSID: NIHMS1676359 PMID: 33708457

Abstract

We present a reformulation of the distance metric learning problem as a penalized optimization problem, with a penalty term corresponding to the von Neumann entropy of the distance metric. This formulation leads to a mapping to statistical mechanics such that the metric learning optimization problem becomes equivalent to free energy minimization. Correspondingly, our approach leads to an analytical solution of the optimization problem based on the Boltzmann distribution. The mapping established in this work suggests new approaches for dimensionality reduction and provides insights into determination of optimal parameters for the penalty term. Furthermore, we demonstrate that the metric projects the data onto direction of maximum dissimilarity with optimal and tunable separation between classes and thus the transformation can be used for high dimensional data visualization, classification, and clustering tasks. We benchmark our method against previous distance learning methods and provide an efficient implementation in an R package available to download at: https://github.com/kouroshz/fenn

Keywords: distance metric learning, dimensionality reduction, high-dimensional data visualization

1. INTRODUCTION

The main objective of distance metric learning is to ‘optimally’ transform the data in order to bring similar points closer to each other while keeping the distance between the dissimilar points bounded away from zero. Distance metric learning can be traced back to the works in [2, 8, 11, 12]. However the formulation in [26] is generally regarded as the modern origin of the approach. In [26], the transformation is identified by minimizing the Mahalanobis distance between similar pairs with the constraint that the distance between the dissimilar pairs is bounded below by a constant. Since its original introduction, several researchers have developed a wealth of algorithms for distance metric learning with different flavors. Distance learning methods are designed to improve the performance of learning algorithms such as clustering with k-means [1, 5, 16, 26] or classification with k-NN [9, 17, 24, 25]. While the intended applications of these methods can be diverse, the overall schemes show considerable overlap. We refer the reader to [3], [14] and [22] for comprehensive surveys on distance metric learning.

As noted by the authors in [26], if the same metric is used to measure the distance between points in each class, the optimal solution will project the data onto a line, which is not suitable for most of the intended applications. Modifications to the original problem are typically applied to avoid such ‘degenerate’ solutions. For instance in [26], the authors propose using a transformation (square root) of the metric in the constraint part of the problem. Similarly, in [27], the authors propose a method called DML-eig, which also applies modifications to the metric. Another popular approach is the Large Margin Nearest Neighbor (LMNN) developed in [24, 25, 27], wherein local information is used to learn a global transformation matrix. In [5] and [23], authors propose information theoretic methods for metric learning.

In this paper, we reformulate the metric learning optimization problem in a way that does not involve modifications to the metric and keeps the metric consistent for measuring distances between points, regardless of class. To avoid trivial solutions, we propose to add a penalty term that is based on the Von Neumann entropy. Our choice of the entropy-based penalty term is motivated by a natural connection to the mathematical formalism used for Quantum Statistical Mechanics (QSM). With this formulation, we develop a mapping of the optimization problem to minimization of the Helmholtz free energy in QSM. Establishing such a mapping allows for approaches and insights from statistical mechanics to be used for solving the optimization problem and for other applications. Importantly, we provide an analytical solution for the optimization problem and therefore, unlike many widely used distance learning algorithms, our method does not rely on gradient-descent based solvers, which are well known for their high sensitivity to scaling and learning rates. The analytical approach also avoids costly computations such as eigenvalue decomposition in each iteration. Further, we demonstrate that our analytical solution identifies the most discriminant features that result in maximum separation between classes. We also establish a connection between our analytical solution and multi-class Linear Discriminant Analysis (LDA) and show that our approach can be used to make improvements to multi-class LDA by providing tunable scaling that leads to controlled separation of classes, while the connection to QSM provides physical intuition for setting the scaling for optimal separation.

In the following sections, we establish the theoretical basis for our method. The primary focus of this paper is on the theoretical insights and generalizations based on the mappings established to statistical mechanics and the geometrical aspects of the metric. To illustrate the effectiveness of the proposed approach, we apply it for supervised classification of several traditional UCI datasets and include comparisons with results from previous distance metric learning methods. We close the paper with a brief conclusion and discussion of future directions.

2. METRIC LEARNING MODEL AND RELATION TO OTHER METHODS

Let ${(x_{i}, l (x_{i}))}_{i = 1}^{n}$ be a set of labeled training data with sample points $x_{i} = {(x_{i 1}, \dots, x_{i p})}^{T} \in ℝ^{p}$ and their labels ℓ(x_i). Let $S = {(x, y) ∣ l (x) = l (y)}$ and $D = {(x, y) ∣ l (x) \neq l (y)}$ be the sets of similar and dissimilar training examples respectively. Let $S_{+}^{d}$ denote the cone of positive semi-definite matrices. Our goal is to find an optimal (pseudo) metric that directly minimizes the distance between similar pairs while keeping the dissimilar pairs apart. As a starting point, consider the optimization problem originally proposed in [26]:

min_{M \in S_{+}^{d}} \sum_{(x, y) \in S} {(x - y)}^{T} M (x - y) S.t. \sum_{(x, y) \in D} {(x - y)}^{T} M (x - y) \geq 1 .

(1)

Here ${(x - y)}^{T} M (x - y) = d_{M}^{2} (x, y)$ is a pseudo metric. As noted by the authors in [26], the solution of this problem projects the data onto a line. To avoid this, Xing et al. [26] propose solving the following alternative problem:

max_{M \in S_{+}^{d}} \sum_{(x, y) \in D} d_{M} (x, y) s.t. \sum_{(x, y) \in S} d_{M}^{2} (x, y) \leq 1.

(2)

Several other reformulations of this problem have been proposed. Most closely related to our method is the formulation proposed by Ying and Li [27]:

max_{M \in S_{+}^{d}} min_{(i, j) \in D} d_{M}^{2} (x, y) s.t. \sum_{(x, y) \in S} d_{M}^{2} (x, y) \leq 1.

(3)

Ying and Li show that the problem in (3) can be reformulated as a generalized eigenvalue optimization, and subsequently show that LMNN [24, 25] can also be formulated in a similar fashion as an eigenvalue optimization problem. Here we present an alternative approach to recast the original problem in (1) as an eigenvalue optimization problem with the important distinction that our formulation has an analytic solution. We will also show that our method has a natural mapping to statistical mechanics, thereby providing a bridge between a class of distance metric learning problems to a class of well established physical problems. Moreover, we will provide a geometric interpretation of our method and demonstrate that the learned metric identifies directions of maximum separation between classes and generalizes LDA.

3. VON NEUMANN ENTROPY PENALIZED DISTANCE METRIC LEARNING

To begin, we first normalize the original problem with the number of classes and numbers of samples in each class. This normalization helps to avoid problems based on combining unbalanced datasets and it provides a consistent framework for our results regardless of the number of classes in the training data. Let N be the number of classes and let n_x be the number of samples in the class containing the sample point x. Then, we consider the normalized problem:

min_{M \in S_{+}^{d}} \frac{1}{N} \sum_{(x, y) \in S} \frac{1}{n_{x}^{2}} {(x - y)}^{T} M (x - y) s.t. \frac{1}{N (N - 1)} \sum_{(x, y) \in D} \frac{1}{n_{x} n_{y}} {(x - y)}^{T} M (x - y) \geq 1.

(4)

Next, we introduce some notation and reformulate our optimization problem. Let τ = (x, y) represent a pair of samples and let X_τ = (x − y)(x − y)^T be the un-normalized projection onto x − y. Then, we have ${(x - y)}^{T} M (x - y) = t r (X_{τ}^{T} M) = 〈 X_{τ}, M 〉$ . Define

X_{D} = \frac{1}{N (N - 1)} \sum_{τ \in D} \frac{1}{n_{x} n_{y}} X_{τ},

(5)

and

X_{S} = \frac{1}{N} \sum_{τ \in S} \frac{1}{n_{x}^{2}} X_{τ} .

(6)

Without loss of generality we may assume that $X_{D}$ is a positive definite matrix ( $X_{D}$ is symmetric positive semi-definite and if needed, we may replace $X_{D}$ by $X_{D} + δ I$ for a small value of δ to make it a positive definite matrix). We now reformulate the optimization problem in (4), following the same approach as in [27].

Lemma 3.1.

The inequality constraint in 4 can be replaced with an equality constraint to obtain the following optimization problem:

min_{M \in S_{+}^{d}} 〈 X_{S}, M 〉 s.t. 〈 X_{D}, M 〉 = 1.

(7)

Proof.

Let M* be the optimal solution of problem (4) and define $M_{0}^{*} = \frac{M^{*}}{〈 X_{D}, M^{*} 〉}$ . Then $〈 X_{D}, M_{0}^{*} 〉 = 1$ . Moreover $〈 X_{S}, M_{0}^{*} 〉 = \frac{1}{〈 X_{D}, M^{*} 〉} 〈 X_{S}, M^{*} 〉 \leq 〈 X_{S}, M^{*} 〉$ . Hence $M_{0}^{*}$ is also an optimal solution. □

For a given positive semi-definite matrix M, let $S = X_{D}^{1 / 2} M X_{D}^{1 / 2}$ and let $\tilde{x} = X_{D}^{- 1 / 2} x$ . Define

{\tilde{X}}_{τ} = (\tilde{x} - \tilde{y}) {(\tilde{x} - \tilde{y})}^{T} = (X_{D}^{- 1 / 2} (x - y)) {(X_{D}^{- 1 / 2} (x - y))}^{T},

(8)

{\tilde{X}}_{S} = \frac{1}{N} \sum_{τ \in S} \frac{1}{n_{x}^{2}} {\tilde{X}}_{τ} .

(9)

We have the following theorem.

Theorem 3.2.

The constrained optimization problem in (7) can be reformulated as follows:

min_{S \in P} 〈 {\tilde{X}}_{S}, S 〉,

(10)

where $P$ is the spectrahedron $P = {S \in S_{+}^{d} ∣ t r (S) = 1}$ .

Proof.

Note that

{\tilde{X}}_{τ} = (X_{D}^{- 1 / 2} (x - y)) {(X_{D}^{- 1 / 2} (x - y))}^{T} = X_{D}^{- 1 / 2} X_{τ} X_{D}^{- 1 / 2} .

(11)

It is straightforward to show that $〈 {\tilde{X}}_{S}, S 〉 = 〈 X_{S}, M 〉$ and that $〈 X_{D}, M 〉 = t r (X_{D}^{1 / 2} M X_{D}^{1 / 2}) = t r (S)$ . Hence $〈 X_{D}, M 〉 = 1$ if and only if tr(S) = 1 and we obtain a simple optimization problem. □

As mentioned before, the optimal solution of the above problem will project the data points onto a line. As such, we propose the following smoothed optimization problem that will avoid such solutions:

min_{S \in P} 〈 {\tilde{X}}_{S}, S 〉 - μ H (S) = min_{S \in P} t r ({\tilde{X}}_{S} S) - μ H (S) = min_{S \in P} \frac{1}{N} \sum_{τ \in S} \frac{1}{n_{x}^{2}} {(\tilde{x} - \tilde{y})}^{T} S (\tilde{x} - \tilde{y}) - μ H (S),

(12)

where H(S) is the Von Neumann entropy of S and is defined by $H (S) = - \sum_{i = 1}^{p} λ_{i} log (λ_{i})$ and μ is a smoothing parameter. Here the λ_i’s are the eigenvalues of the matrix S. Note that since $S \in S_{+}^{d}$ and tr(S) = 1, this will necessarily impose the constraint that λ_i > 0 and $\sum_{i = 1}^{p} λ_{i} = 1$ .

3.1. Analytical solution of the optimization problem

Optimization problems for distance metric learning are typically addressed using numerical approaches. However, given our use of the Von Neumann Entropy H(S) as a smoothing term, an analytical expression for the solution of the optimization problem can be derived. In the following, we present two approaches for obtaining the analytical solution. One approach involves a direct solution of the optimization problem using Lagrange multipliers, whereas the other method involves a mapping to statistical mechanics. Both approaches are instructive and lead to insights for further analysis as presented below.

3.1.1. Analytic solution using Lagrange multipliers.

The constraint $S \in P$ is equivalent to S symmetric, λ_k ≥ 0, and $\sum_{k = 1}^{p} λ_{k} = 1$ . Define the Lagrangian

L (S) = \frac{1}{N} \sum_{τ \in S} \frac{1}{n_{x}^{2}} {(\tilde{x} - \tilde{y})}^{T} S (\tilde{x} - \tilde{y}) + μ \sum_{k = 1}^{p} λ_{k} log (λ_{k}) + ξ (\sum_{k = 1}^{p} λ_{k} - 1),

where ξ is the Lagrange multiplier. We will make use of the following lemma.

Lemma 3.3.

The derivative of H(S) with respect to S is given by the following expression:

\nabla_{S} [H (S)] = \sum_{k = 1}^{p} \nabla_{S} [λ_{k} log (λ_{k})] = \sum_{k = 1}^{p} (1 + log (λ_{k})) u_{k} u_{k}^{T}

where the u_k’s are the eigenvectors of ${\tilde{X}}_{S}$ .

Proof.

Let s = s_ij denote the (i, j)-th entry of S and let S = UΛU^T be the spectral decomposition of S. Then,

\frac{\partial Λ}{\partial s} = U^{T} \frac{\partial S}{\partial s} U - U^{T} \frac{\partial U}{\partial s} Λ - Λ \frac{\partial U^{T}}{\partial s} U

(13)

Note that ${[U^{T} \frac{\partial S}{\partial s} U]}_{x y} = [U_{i x} U_{i y}]$ , and hence $\frac{\partial λ_{k}}{\partial s} = U_{i k} U_{j k} = {[u_{k} u_{k}^{T}]}_{i j}$ and $\nabla_{S} λ_{k} = u_{k} u_{k}^{T}$ . Examining the diagonal entries of each sides of the equation (13) and observing that the last two terms have diagonal elements equal to 0, we get that:

\nabla_{S} [\sum_{k = 1}^{p} λ_{k} log (λ_{k})] = \sum_{k = 1}^{p} \nabla_{S} [λ_{k} log (λ_{k})] = \sum_{k = 1}^{p} (1 + log (λ_{k})) u_{k} u_{k}^{T} .

□

Theorem 3.4.

The solution S to the optimization problem in (10) is given by $S = \sum_{i = 1}^{p} λ_{i} u_{i} u_{i}^{T}$ where λ_i is the i-th eigenvalue given by:

λ_{i} = \frac{e^{- (u_{i}^{T} {\tilde{X}}_{S} u_{i}) / μ}}{\sum_{i = 1}^{p} e^{- (u_{i}^{T} {\tilde{X}}_{S} u_{i}) / μ}},

(14)

and its corresponding eigenvector u_i is the i-th eigenvector of ${\tilde{X}}_{S}$ .

Proof.

First, we differentiate the first part of $L (S)$ with respect to S, we get that

\nabla_{S} [\frac{1}{N} \sum_{τ \in S} \frac{1}{n_{x}^{2}} {(\tilde{x} - \tilde{y})}^{T} S (\tilde{x} - \tilde{y})] = \frac{1}{N} \sum_{τ \in S} \frac{1}{n_{x}^{2}} \nabla_{S} [{(\tilde{x} - \tilde{y})}^{T} S (\tilde{x} - \tilde{y})] = \frac{1}{N} \sum_{τ \in S} \frac{1}{n_{x}^{2}} (\tilde{x} - \tilde{y}) {(\tilde{x} - \tilde{y})}^{T} = \frac{1}{N} \sum_{τ \in S} \frac{1}{n_{x}^{2}} {\tilde{X}}_{τ} = {\tilde{X}}_{S} .

Next, using the computed derivative for H(S) in Lemma 2 we get:

\nabla_{S} L (S) = {\tilde{X}}_{S} + μ \sum_{k = 1}^{p} (1 + log (λ_{k})) u_{k} u_{k}^{T} + ξ \sum_{k = 1}^{p} u_{k} u_{k}^{T} .

Setting the Lagrangian equal to 0 and multiplying each side by u_i, we get that ${\tilde{X}}_{S} u_{i} = - μ (1 + log (λ_{i})) u_{i} - ξ u_{i}$ . In particular, this means that the u_i’s are eigenvectors of the symmetric matrix ${\tilde{X}}_{S}$ . Moreover, we have that $u_{i}^{T} {\tilde{X}}_{S} u_{i} = (- ξ - μ) - μ log (λ_{i})$ . Solving this equation for λ_i and applying the constraint $\sum_{k = 1}^{p} λ_{k} = 1$ , we arrive at the following solution for λ_i

λ_{i} = \frac{e^{- (u_{i}^{T} {\tilde{X}}_{S} u_{i}) / μ}}{\sum_{i = 1}^{p} e^{- (u_{i}^{T} {\tilde{X}}_{S} u_{i}) / μ}} .

(15)

□

In other words, the previous theorem shows that the optimal solution of the eigenvalues is given by the Boltzmann distribution from which S can be constructed using the spectral decomposition as $S = \sum_{i = 1}^{p} λ_{i} u_{i} u_{i}^{T}$ . The optimal distance between the points in the transformed space is then calculated by

d_{S}^{2} (\tilde{x}, \tilde{y}) = {(\tilde{x} - \tilde{y})}^{T} S (\tilde{x} - \tilde{y}) = {[S^{1 / 2} (\tilde{x} - \tilde{y})]}^{T} [S^{1 / 2} (\tilde{x} - \tilde{y})] = {[S^{1 / 2} X_{D}^{- 1 / 2} (x - y)]}^{T} [S^{1 / 2} X_{D}^{- 1 / 2} (x - y)]

(16)

where $S^{1 / 2} = \sum_{i} λ_{i}^{1 / 2} u_{i} u_{i}^{T}$ . Hence the optimal distance is obtained by first transforming the data points by $S^{1 / 2} X_{D}^{- 1 / 2}$ and then computing the Euclidean distance in the transformed space. Note that S, S^1/2 and ${\tilde{X}}_{S}$ share common eigenvectors. In the next section, we will give a physical interpretation for these quantities.

3.1.2. Analytical solution using the mapping to statistical mechanics.

The appearance of the Boltzmann distribution strongly indicates that the optimization problem can be mapped on to a problem in statistical mechanics. Establishing such a mapping may suggest analytical solutions for a broader class of optimization problems that are currently addressed numerically. In the following, we establish such a mapping for the distance metric optimization problem. We outline the mapping from the original optimization problem to the formalism for systems studied in quantum statistical mechanics (QSM). Consider an ensemble of systems in the same macroscopic or thermodynamic state. In QSM, the central quantity of interest is the density matrix ρ, which contains information about the ensemble probabilities assigned to different microstates of the system in equilibrium. For a system in equilibrium at fixed temperature T, the density matrix can be obtained by minimizing the Helmholtz Free energy [13] which is given by

F [ρ] = t r (ρ \hat{H}) - T H [ρ],

(17)

where the matrix $\hat{H}$ represents the Hamiltonian of the system and H[ρ] is the Von Neumann entropy. Note that we have set the Boltzmann constant k_B = 1. The eigenvectors of $\hat{H}$ form a complete basis set and the eigenvalues correspond to allowed energy values for microstates of the system.

To connect with the original optimization problem, we make the identification $\hat{H} = \frac{1}{N} \sum_{τ \in S} \frac{1}{n_{x}^{2}} (\tilde{x} - \tilde{y}) {(\tilde{x} - \tilde{y})}^{T} = {\tilde{X}}_{S}$ . Correspondingly, the free energy $F [ρ] = t r (ρ \hat{H}) - T H [ρ]$ can be expressed as:

F [ρ] = t r (\frac{ρ}{N} \sum_{τ \in S} \frac{1}{n_{x}^{2}} (\tilde{x} - \tilde{y}) {(\tilde{x} - \tilde{y})}^{T}) - T H [ρ] .

(18)

With the above mapping, minimization of the free energy is seen to be identical to the original optimization problem (Eq. (12)) with the identification ρ = S. Note that one of the properties of the density matrix is that tr(ρ) = 1, consistent with the constraint on S in the original problem. The density matrix ρ that minimizes the free energy can be derived using a unitary transformation to the basis set that diagonalizes the Hamiltonian $\hat{H}$ . In this basis, the density matrix ρ is also a diagonal matrix and its eigenvalues (λ_i) are given by the Boltzmann distribution, as obtained in the previous section. Correspondingly, the mapping developed can be summarized as follows: There is a one-to-one correspondence between the distance metric learning problem in Eq. (12) and the problem in QSM given in Eq. (18) i.e:

min_{ρ} t r (ρ \hat{H}) - T H [ρ] \Leftrightarrow min_{S \in P} t r (S {\tilde{X}}_{S}) - μ H (S) .

3.2. Geometric interpretation

As we saw in the previous section, the analytical solution involves the eigenvalues and eigenvectors of the Hamiltonian matrix derived from the data, namely ${\tilde{X}}_{S}$ . In this section, we develop geometric interpretations for the terms $X_{S}$ , $X_{D}$ and ${\tilde{X}}_{S}$ to gain further insight into the geometry of the transformation. Let m_i be the mean of the i-th cluster, let C_i denote the covariance matrix of i-th cluster, and let C_m denote the covariance matrix of cluster centroids. Then,

X_{S} = \frac{1}{N} \sum_{τ \in S} \frac{1}{n_{x}^{2}} X_{τ} = \frac{1}{N} \sum_{i = 1}^{N} \frac{1}{n_{x}^{2}} \sum_{j = 1}^{n_{x}} \sum_{k = 1}^{n_{x}} (x_{j} x_{j}^{T} - x_{j} x_{k}^{T}) = \frac{1}{N} \sum_{i = 1}^{N} (E [x_{i} x_{i}^{T}] - m_{i} m_{i}^{T}) = \frac{1}{N} \sum_{i = 1}^{N} C_{i}

This implies that $X_{S}$ is the ‘mean’ covariance representing an estimated covariance for all clusters. Similarly, it can be shown that $X_{D} = X_{S} + \frac{n}{n - 1} C_{m}$ . The combination of the mean covariance matrix $X_{S}$ and the covariance matrix of the centroids C_m in $X_{D}$ leads to the following theorem.

Theorem 3.5.

Let u₁, u₂, ⋯ , u_p be the eigenvectors of ${\tilde{X}}_{S}$ with corresponding eigenvalue σ₁, σ₂, ⋯ , σ_p sorted in increasing order. Let d = min{p, N − 1}. If $X_{S}$ is nonsingular, then σ₁ ≤ σ₂ ≤ ⋯ ≤ σ_d < 1 and σ_d+1 = σ_d+2 = ⋯ = σ_p = 1.

Proof.

{\tilde{X}}_{S} = X_{D}^{- \frac{1}{2}} X_{S} X_{D}^{- \frac{1}{2}} = X_{D}^{- \frac{1}{2}} (X_{D} - \frac{n}{n - 1} C_{m}) X_{D}^{- \frac{1}{2}} = I_{p} - \frac{n}{n - 1} X_{D}^{- \frac{1}{2}} C_{m} X_{D}^{- \frac{1}{2}} .

Since $X_{S}$ is nonsingular, $X_{D} = X_{S} + \frac{n}{n - 1} C_{m}$ is also nonsingular and hence, $rank (X_{D}^{- \frac{1}{2}} C_{m} X_{D}^{- \frac{1}{2}}) = rank (C_{m}) = d$ . This yields the following spectral decomposition:

\frac{n}{n - 1} X_{D}^{- \frac{1}{2}} C_{m} X_{D}^{- \frac{1}{2}} = (\begin{matrix} \begin{array}{l} U_{d} & U_{p - d} \end{array} \end{matrix}) (\begin{matrix} Λ_{d} & 0 \\ 0 & 0 \end{matrix}) (\begin{matrix} U_{d}^{T} \\ U_{p - d}^{T} \end{matrix}),

where U_d and U_p−d are eigenvector matrices of $X_{D}^{- \frac{1}{2}} C_{m} X_{D}^{- \frac{1}{2}}$ corresponding to positive and 0 eigenvalues respectively and Λ_d represents a positive diagonal matrix with positives eigenvalues on the diagonal. Then,

{\tilde{X}}_{S} = (\begin{matrix} \begin{array}{l} U_{d} & U_{p - d} \end{array} \end{matrix}) (\begin{matrix} I_{d} - Λ_{d} & 0 \\ 0 & I_{p - d} \end{matrix}) (\begin{matrix} U_{d}^{T} \\ U_{p - d}^{T} \end{matrix}) .

Therefore, σ_i = (I_d − Λ_d)_ii < 1 for i ≤ d and σ_i = 1 for i > d. □

As can be seen from the proof of the Theorem 3.5, the eigenvectors and eigenvalues of the Hamiltonian matrix ${\tilde{X}}_{S}$ can be split into two categories: (1) a set corresponding to the span of $X_{D}^{- \frac{1}{2}} C_{m} X_{D}^{- \frac{1}{2}}$ , which can be viewed as a scaling of the covariance matrix C_m with scaled eigenvalues 0 < σ_i < 1 and (2) a set corresponding to the Null space of $X_{D}^{- \frac{1}{2}} C_{m} X_{D}^{- \frac{1}{2}}$ with eigenvalues exactly 1.

In our approach, the transformation $\tilde{x} = X_{D}^{- \frac{1}{2}} x$ is a scaling that is applied to the data points a priori. We refer to this space as the tilde transformed space. The span of the scaled covariance matrix $X_{D}^{- \frac{1}{2}} C_{m} X_{D}^{- \frac{1}{2}}$ coincides with the N − 1 dimensional subspace containing the N cluster centroids in the tilde transformed space. This space corresponds to the most informative projection for cluster separation, while the Null space of the scaled covariance matrix is uninformative in this regard (Fig. 1). Hence, the spectrum of ${\tilde{X}}_{S}$ can be used to identify informative and non-informative directions. This process can be used to reduce the dimension and visualize high-dimensional datasets.

Figure 1: — Figure shows the subspace spanned by the scaled covariance matrix $X_{D}^{- \frac{1}{2}} C_{m} X_{D}^{- \frac{1}{2}}$ of cluster centroids in the tilde transformed space $\tilde{x} = X_{D}^{- \frac{1}{2}} x$ along with its orthogonal complement. The span of the scaled covariance matrix coincides with the N − 1 dimensional subspace containing the N cluster centroids (gray arrows in the indicated hyperplane). This space corresponds to the most informative projection for cluster separation. The Null space of the scaled covariance matrix (green arrow) is non-informative for cluster separation.

3.3. Optimal value of the tuning parameter μ

To illustrate the physical interpretation of our approach for tuning μ, we first consider the case of a single label dataset. In this case, $X_{D}$ is not defined and, as shown in the preceding sections, we have ${\tilde{X}}_{S} = X_{S} = C$ . Thus the Hamiltonian of the system is simply the covariance matrix obtained from the data and the principal features correspond to the low energy eigenstates. Now recall that the tuning parameter of our optimization problem μ corresponds to temperature T in the mapping to statistical mechanics and the free energy is given by F[ρ] = 〈E〉 − μH[ρ], where $〈 E 〉 = t r (ρ \hat{H})$ is the average energy. In the low temperature limit (μ → 0), free energy minimization is equivalent to minimizing the average energy of the system. Clearly, the average energy is minimized by setting the eigenvalues of the the density matrix ρ such that λ_i = 1 for the lowest energy eigenstate and λ_i = 0 for all the other eigenstates. This corresponds to projecting the data onto a line in the limit μ → 0. On the other hand, in the limit μ → ∞, the free energy is minimized by maximizing the entropy of the system, which corresponds to having equal probabilities for all the eigenstates, i.e. $λ_{i} = \frac{1}{p}$ . The mapping to statistical mechanics suggests a criterion for choosing the optimal μ between these two extremes as discussed below.

A natural approach is to choose the parameter μ by maximizing the corresponding Fisher information, given that the maximum of the Fisher information corresponds to minimum variance in the estimate of μ based on the Cramer-Rao inequality [19]. The Fisher information for the parameter μ (obtained using the Boltzmann distribution derived above) is given by $I (μ) = \frac{1}{μ^{2}} (〈 E^{2} 〉 - 〈 E 〉^{2})$ [4]. The connections to statistical mechanics provide further insight into the corresponding physical significance as outlined below. In the equilibrium state at temperature μ, let us denote the eigenstates of the Hamiltonian $({\tilde{X}}_{S})$ by E_i. The energy of the system is a random variable with corresponding probability distribution given by the Boltzmann distribution

λ_{i} = \frac{e^{- E_{i} / μ}}{\sum_{i = 1}^{p} e^{- E_{i} / μ}},

(19)

Using the above, it is straightforward to show that the equilibrium free energy is given by $F = - μ log (\sum_{i = 1}^{p} e^{- E_{i} / μ})$ and the average energy can be obtained from the free energy using $〈 E 〉 = - μ^{2} \frac{\partial (F / μ)}{\partial μ}$ . Furthermore, the heat capacity of the system (defined as $C = \frac{\partial 〈 E 〉}{\partial μ}$ ) is given by $C = \frac{1}{μ^{2}} (〈 E^{2} 〉 - {〈 E 〉}^{2})$ . Comparing this with the expression obtained for the Fisher information, we see that $I (μ)$ corresponds to the heat capacity of the system and thus can also be expressed as $I (μ) = \frac{\partial 〈 E 〉}{\partial μ}$ . This identification also highlights the fact that we have $I (μ) \to 0$ in the low temperature limit as well as the high temperature limit; thus a natural choice for μ is the value that corresponds to the maximum of $I (μ)$ . 〈E〉 increases monotonically with the value of μ and there are multiple of inflection points depending on data. If the eigenvalue spectrum has multiple clusters of closely-spaced eigenvalues separated by large gaps, then we can see multiple peak points of $I (μ)$ . In this case, we consider the largest μ value corresponding to a local maximum of $I (μ)$ as the optimal choice. This choice ensures that all informative features are considered, with the corresponding weights given by the Boltzmann distribution.

To illustrate the use of the Fisher information for determining the optimal μ, we simulated a single cluster of data points in $ℝ^{9}$ using a multivariate gaussian distribution with arbitrarily oriented covariance matrix. The variance along the covariance axes were selected in 3 clusters of increasing order on logarithmic scale. This resulted in a Hamiltonian with three sets of eigenvalue clusters. Fig. 2 shows the eigenvalues of S (red, green, and blue curves) along with the Fisher information (black curve). The eigenvalues of the Hamiltonian are represented by vertical dashed lines. As can be seen from the figure, the local maxima of the Fisher information mark the boundaries of the gaps between the eigenvalue clusters of the Hamiltonian.

Figure 2: — Figure shows the plots of 9 eigenvalues of S, λ_i (red, green, and blue curves), and Fisher information, $I (μ)$ (black curve) as a function of temperature values (μ) (x-axis). The dashed vertical lines indicate the eigenvalues of the Hamiltonian matrix ${\tilde{X}}_{S}$ . There are three clusters of eigenvalues as indicated by the colors. The peaks of $I (μ)$ mark the boundaries of the gaps between the eigenvalue clusters. When μ → ∞, each eigenvalue approaches 1/9.

Cross validation is typically employed to find optimal tuning parameter values in supervised learning task. We can also estimate the tuning parameter μ using a similar approach. However, we find that restricting the search for an optimal μ in a neighborhood around the μ that optimizes the Fisher Information works well while significantly reducing the search space.

3.4. Scaled linear discriminant analysis and high-dimensional data visualization

In section 3.2 we showed that the informative variance of clusters lies on the subspace spanned by U_d, i.e., the eignevectors of ${\tilde{X}}_{S}$ corresponding to eigenvalues strictly less than 1. The appearance of covariance matrices of clusters and cluster centroids in ${\tilde{X}}_{S}$ suggest a connection to multiclass LDA [18], where the data is projected onto the subspace spanned by eigenvectors of $L = X_{S}^{- 1} C_{m}$ that correspond to the largest N − 1 eigenvalues. The following lemma establishes the connection between our transformation S and multiclass LDA.

Lemma 3.6.

There is a bijection between the eigenvectors of $L = X_{S}^{- 1} C_{m}$ and ${\tilde{X}}_{S}$ .

Proof.

Let v_i and γ_i be i-th eigenvector and eigenvalue of $X_{S}^{- 1} C_{m}$ and let $α_{N} = \frac{N - 1}{N}$ . Then,

γ_{i} v_{i} = X_{S}^{- 1} C_{m} v_{i} = α_{N} X_{S}^{- 1} (X_{D} - X_{S}) v_{i} .

Hence $(α_{N} + γ_{i}) v_{i} = α_{N} X_{S}^{- 1} X_{D} v_{i}$ , and therefore

{\tilde{X}}_{S} (X_{D}^{1 / 2} v_{i}) = X_{D}^{- 1 / 2} X_{S} X_{D}^{- 1 / 2} (X_{D}^{1 / 2} v_{i}) = \frac{α_{N}}{α_{N} + γ_{i}} (X_{D}^{1 / 2} v_{i}) .

□

The above lemma shows that the eigenvectors of ${\tilde{X}}_{S}$ are the eigenvectors of L transformed by $X_{D}^{- 1 / 2}$ . Note that the 0 eigenvalues of L i.e., γ_i = 0 correspond to 1 eigenvalues of ${\tilde{X}}_{S}$ , both of which correspond to uninformative directions. From the above lemma, we see that $u_{i} = X_{D}^{1 / 2} v_{i}$ is an eigenvector of ${\tilde{X}}_{S}$ and hence an eigenvector of S^1/2. Recall that in our approach, the data point x is first transformed to the tilde space via $\tilde{x} = X_{D}^{- 1 / 2} x$ and subsequently the transformation S^1/2 is applied (Eq. 16). Projecting the point x on the coordinate system of S^1/2 in the tilde transformed space, we obtain

u_{i}^{T} S^{1 / 2} X_{D}^{- 1 / 2} x = λ_{i}^{1 / 2} u_{i}^{T} X_{D}^{- 1 / 2} x = λ_{i}^{1 / 2} {(X_{D}^{1 / 2} v_{i})}^{T} X_{D}^{- 1 / 2} x = λ_{i}^{1 / 2} v_{i}^{T} x

In other words, the projections in LDA are scaled by $λ_{i}^{1 / 2}$ . For this reason, we refer to the eigenvectors (u_i) and eigenvalues $(λ_{i}^{1 / 2})$ of S^1/2 as SLDA components and weights respectively. Note that from this point of view, our distance metric learning formulation generalizes LDA and naturally overcomes some of its shortcomings as follows. First, there is no upper limit on the eigenvalues of the LDA transformation [10], however, the scalings λ_is in our method are bounded between 0 and 1 and add up to 1, which naturally solves the boundary problem of LDA. Second, the eigenvalues λ_is of S are obtained by optimizing the metric to enhance separation between classes (Eq. (12)) and hence provide more optimal scaling in the projected space, which can increase the performance of k-NN type classifiers. Finally, the scalings λ_is are tunable by varying the temperature parameter μ, which can vary from the boundary solution λ₁ = 1, λ₂ = ⋯ = λ_p = 0 when μ → 0 to equal scaling λ₁ = ⋯ = λ_p = 1/p when μ → ∞ with the optimal value obtained by maximizing the Fisher Information. This can be advantageous for scalable high-dimensional data visualization. Fig. 3 illustrates this property of solution in visualizing high-dimensional data. In Fig. 3, data points from the Wine dataset are transformed and projected onto the first two SLDA components for various values of μ. The data consists of 3 categories in $ℝ^{13}$ . There are a total of 2 informative components (section 3.2). By varying μ, the weights of the SLDA components can be controlled in a principled way. As can be seen, in the extreme case μ_min → 0, the data is projected onto the first component, corresponding to λ₁ = 1, λ₂ = 0. Using μ_opt, obtained by maximizing Fisher Information, the weights are approximately λ₁ ≈ 0.7 and λ₂ ≈ 0.3, while μ_max → ∞ results in equal weights λ₁ = λ₂ = 0.5. For comparison, the projection onto principal components using standard PCA is depicted as well in Fig. 3.

Figure 3: — High dimensional data visualization. Wine data set, consisting of 3 classes in $ℝ^{13}$ are projected onto SLDA components using the indicated μ values. As μ → 0 (μ_min) the first SLDA component has maximum weight of λ₁ = 1 while the second component has the weight λ₂ = 0. The weights for optimal μ value (μ_opt) are approximately λ₁ ≈ 0.7 and λ₂ ≈ 0.3. As μ → ∞ (μ_max), both components have equal weight λ₁ = λ₂ = 0.5. The bottom right corner plot is the projection of the data onto PCA components.

3.5. Application to supervised learning

As in other metric learning models, training data can be used to obtain the optimal metric and to classify new examples in the optimally transformed space using k-NN or similar types of algorithms. The learning may be performed in a ‘global’ manner where all of the training data is utilized in learning the optimal space. Alternatively, to take advantage of local structures, one may use portions of the data to learn an ensemble of ‘locally’ optimal transform spaces. New examples can then be classified in each space and the final class label can be decided in an ensemble fashion. While our method does have an analytic solution and can be used efficiently to estimate a local metric, we restrict our attention to the global metric.

Our algorithm is straightforward. We first compute the within class (C₁, ⋯C_N) and between class C_m covariance matrices. Next, the transformations $X_{S} = \frac{1}{N} \sum_{i = 1}^{N} C_{i}$ , $X_{D} = X_{S} + \frac{n}{n - 1} C_{m}$ , and ${\tilde{X}}_{S} = X_{D}^{- \frac{1}{2}} X_{S} X_{D}^{- \frac{1}{2}}$ are calculated. We then calculate the Fisher Information $I (μ)$ for a grid of μ values spanning the range of eigenvalues of ${\tilde{X}}_{S}$ and calculate the optimal value by identifying the local maxima of $I (μ)$ and an optional cross-validation. The final transformation $S^{1 / 2} = \sum_{i} λ_{i}^{1 / 2} u_{i} u_{i}^{T}$ is then be calculated in order to map the original data to the optimally transformed space. Since the distance metric is determined by free energy minimization, we name our method Free Energy Nearest Neighbor (FENN). The algorithm for FENN is shown in Algorithm 1. The time complexity of each step is indicated in the algorithms.

4. BENCHMARKS

We benchmarked our method on 10 UCI datasets [15] based on previous work (the information for each dataset can be seen in table 1) and we compare our performance to 8 other distance learning methods. We implemented our algorithm in an R package fenn (https://github.com/kouroshz/fenn). We use the R shogun package to run experiments for LMNN.

Table 1:

Dataset Information

	bscale	glass	ionosphere	tictactoe	image segmentation	iris	wine	wdbc	car	waveform
dimensions	4	9	34	9	19	4	13	30	6	21
number of examples	625	214	351	958	2310	150	178	569	1728	5000
number of classes	3	6	2	2	7	3	3	2	4	3

Open in a new tab

graphic file with name nihms-1676359-f0001.jpg

In addition, we implemented the 7 other widely used distance learning methods in a separate R package DistanceLearning [7] (https://github.com/carltonyfakhry/DistanceLearning), which we use to benchmark the remaining methods. The parameters for training all the methods were set as suggested in each method’s original paper. In each trial, the training sets and the test sets were created by data preprocessing in our package. Then, training and testing methods from various packages were called for the same preprocessed datasets. LFDA [20, 21], NCA [9], Xing’s method [26], LMNN [24, 25] and FENN are global metric methods therefore they learn a single transformation of the data. After the transformation is learned for each global method, we apply k-NN classification with k = 5. DANN, i-DANN [11], ADAMENN and i-ADAMENN [6] are local methods and generally take longer to train. We consider a local neighborhood (μ_opt −4×p, μ_opt +4×p) around μ_opt (a total of 6×p μ values where generated) for performing the cross validation in step 5 of Algorithm 1. Table 2 summarized the accuracy of the methods in a 10 fold cross validation experiment. In this setting, FENN performs as well or better than previous methods in 6 out of 10 datasets while it is very close in performance to the best method for the remaining 4 datasets.

Table 2:

Results of 10-fold Cross Validation

	DANN	i-DANN	ADAMENN	iADAMENN	LFDA	NCA	Xing	LMNN	FENN
bscale	0.955	0.904	0.765	0.773	0.851	0.856	0.938	0.867	0.947
glass	0.682	0.575	0.664	0.668	0.664	0.678	0.673	0.668	0.71
ionosphere	0.84	0.798	0.84	0.843	0.838	0.843	0.84	0.889	0.846
tictactoe	0.857	0.816	0.811	0.811	0.841	0.832	0.837	0.846	0.9
image segmentation	0.963	0.878	0.877	0.877	0.958	0.92	0.96	0.959	0.975
iris	0.953	0.933	0.947	0.94	0.953	0.953	0.96	0.947	0.96
wine	0.978	0.944	0.955	0.938	0.916	0.781	0.848	0.978	0.994
wdbc	0.94	0.916	0.963	0.967	0.944	0.91	0.933	0.97	0.967
car	0.97	0.942	0.84	0.811	0.978	0.958	0.981	0.987	0.986
waveform	0.83	0.822	0.744	0.686	0.828	0.846	0.766	0.815	0.849

Open in a new tab

5. CONCLUSION

In this paper we introduced a new formalization of distance metric learning inspired by a mapping to systems studied in quantum statistical mechanics. Our formalization applies the same metric when measuring the distance in both classes of similar and dissimilar points, while controlling the collapse of dimensions by enforcing a bound based on the Von Neumann entropy of the optimal metric. This is in contrast to several other distance metric learning methods, where collapse of dimensionality is avoided by applying difference metrics to measure the distance between similar and dissimilar points. Importantly, our formulation results in an analytical solution of the problem that avoids costly numerical approximations that are typically utilized in distance metric learning. Further, the quantum statistical mechanics interpretation of the problem allows for physical interpretation of the optimal solution and provides insights into selecting values for the entropy tuning parameter. We provide theoretical justification for optimality of the projected space in terms of class separation and show how the geometry and the tuning of the entropy parameter can be utilized for high-dimensional data visualization. Moreover, we establish a connection between our learned metric and LDA and demonstrate that our distance metric formulation generalizes LDA, while addressing some of the shortcomings. In future work, we plan to apply FENN with sparse covariance analysis and kernelization and further investigate analogous physical models for multi-class systems.

While the emphasis of the current work is on theoretical insights, the experimental results obtained also show the effectiveness of the proposed approach. Our approach establishes the connection between a class of optimization problems in distance metric learning to concepts explored in statistical mechanics and preexisting classification methods. In particular, the scaling provided by our method results in more optimal, controlled, separation of classes while still obtaining maximum discriminant directions. Our mapping to quantum statistical mechanics provides further insight into the geometrical aspects of the problem and may suggest analytical solutions for a broader class of optimization problems that are currently addressed numerically. In our opinion, further insights that will be gained building on the mappings established in this work are likely to lead to important developments in the field. Finally, to facilitate usage, we provide an efficient implementation of our algorithm along with implementation of 7 other widely used distance learning methods in R packages available to download at https://github.com/kouroshz/fenn and https://github.com/carltonyfakhry/DistanceLearning.

CCS CONCEPTS.

Computing methodologies → Supervised learning; Classification and regression trees; Model verification and validation;
Mathematics of computing → Convex optimization.

ACKNOWLEDGMENTS

The authors would like to thank Robert for the bagels and explaining CMYK and color spaces. The authors also wish to thank Kei, Sanae, and Kaede, for drawing a 3d picture displaying informative and uninformative features. The authors acknowledge funding support from the National Institutes of Health through grant 3U54CA156734-05S3 (as part of the University of Massachusetts Boston/Dana Farber-Harvard Cancer Center U54 partnership).

Contributor Information

Sho Inaba, Computational Sciences PhD program, University of Massachusetts Boston, Boston, USA.

Carl T. Fakhry, Department of Computer Science, University of Massachusetts Boston, Boston, USA

Rahul V. Kulkarni, Department of Physics, University of Massachusetts Boston, Boston, USA

Kourosh Zarringhalam, Department of Mathematics, University of Massachusetts Boston, Boston, USA.

REFERENCES

[1].Bar-Hillel Aharon, Hertz Tomer, Shental Noam, and Weinshall Daphna. 2003. Learning Distance Functions Using Equivalence Relations. In In Proceedings of the Twentieth International Conference on Machine Learning. 11–18. [Google Scholar]
[2].Baxter Jonathan and Bartlett Peter L.. 1998. The Canonical Distortion Measure in Feature Space and 1-NN Classification. In Advances in Neural Information Processing Systems 10, Jordan MI, Kearns MJ, and Solla SA (Eds.). MIT Press, 245–251. http://papers.nips.cc/paper/1357-the-canonical-distortion-measure-in-feature-space-and-1-nn-classification.pdf [Google Scholar]
[3].Bellet AurÃľlien, Habrard Amaury, and Sebban Marc. 2013. A Survey on Metric Learning for Feature Vectors and Structured Data. arXiv:1306.6709 [cs, stat] (June 2013). http://arxiv.org/abs/1306.6709 arXiv: 1306.6709. [Google Scholar]
[4].Crooks Gavin E. 2011. Fisher information and statistical mechanics. Technical Report Citeseer. [Google Scholar]
[5].Davis Jason V., Kulis Brian, Jain Prateek, Sra Suvrit, and Dhillon Inderjit S.. 2007. Information-theoretic Metric Learning. In Proceedings of the 24th International Conference on Machine Learning (ICML ‘07). ACM, New York, NY, USA, 209–216. 10.1145/1273496.1273523 [DOI] [Google Scholar]
[6].Domeniconi C, Peng Jing, and Gunopulos D. 2002. Locally adaptive metric nearest-neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 9 (September. 2002), 1281–1285. 10.1109/TPAMI.2002.1033219 [DOI] [Google Scholar]
[7].Fakhry Carl Tony. 2018. Causal Reasoning and Machine Learning Models for Cellular Regulatory Mechanisms. ScholarWorks at UMass Boston (2018). https://scholarworks.umb.edu/doctoral_dissertations/382 [Google Scholar]
[8].Friedman Jerome H.. 1994. Flexible Metric Nearest Neighbor Classification. ResearchGate (December 1994). https://www.researchgate.net/publication/2696649_Flexible_Metric_Nearest_Neighbor_Classification [Google Scholar]
[9].Goldberger Jacob, Hinton Geoffrey E, Roweis Sam T., and Salakhutdinov Ruslan R. 2005. Neighbourhood Components Analysis. In Advances in Neural Information Processing Systems 17, Saul LK, Weiss Y, and Bottou L (Eds.). MIT Press, 513–520. http://papers.nips.cc/paper/2566-neighbourhood-components-analysis.pdf [Google Scholar]
[10].Hansen John. 2005. Using SPSS for windows and macintosh: analyzing and understanding data.
[11].Hastie Trevor and Tibshirani Robert. 1996. Discriminant Adaptive Nearest Neighbor Classification and Regression. In Advances in Neural Information Processing Systems 8, Touretzky DS and Hasselmo ME (Eds.). MIT Press, 409–415. http://papers.nips.cc/paper/1131-discriminant-adaptive-nearest-neighbor-classification-and-regression.pdf [Google Scholar]
[12].Short Robert D. II and Fukunaga Keinosuke. 1981. The optimal distance measure for nearest neighbor classification. Information Theory, IEEE Transactions on 27, 5 (October 1981), 622–627. 10.1109/TIT.1981.1056403 [DOI] [Google Scholar]
[13].Kardar Mehran. 2007. Statistical Physics of Particles. Cambridge University Press. 10.1017/CBO9780511815898 [DOI] [Google Scholar]
[14].Kulis Brian. 2013. Metric Learning: A Survey. MAL 5, 4 (July 2013), 287–364. 10.1561/2200000019 [DOI] [Google Scholar]
[15].Lichman Moshe. 2013. UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/index.php
[16].Mu Yang, Ding Wei, and Tao Dacheng. 2013. Local discriminative distance metrics ensemble learning. Pattern Recognition 46, 8 (2013), 2337–2349. [Google Scholar]
[17].Qi Guo-Jun, Tang Jinhui, Zha Zheng-Jun, Chua Tat-Seng, and Zhang Hong-Jiang. 2009. An Efficient Sparse Metric Learning in High-dimensional Space via L1-penalized Log-determinant Regularization. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML ‘09). ACM, New York, NY, USA, 841–848. 10.1145/1553374.1553482 [DOI] [Google Scholar]
[18].Rao C. Radhakrishna. 1948. The Utilization of Multiple Measurements in Problems of Biological Classification. Journal of the Royal Statistical Society. Series B (Methodological) 10, 2 (1948), 159–203. http://www.jstor.org/stable/2983775 [Google Scholar]
[19].Rao C Radhakrishna. 1992. Information and the accuracy attainable in the estimation of statistical parameters. In Breakthroughs in statistics. Springer, 235–247. [Google Scholar]
[20].Sugiyama Masashi. 2006. Local Fisher Discriminant Analysis for Supervised Dimensionality Reduction. In Proceedings of the 23rd International Conference on Machine Learning (ICML ‘06). ACM, New York, NY, USA, 905–912. 10.1145/1143844.1143958 [DOI] [Google Scholar]
[21].Sugiyama Masashi. 2007. Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis. J. Mach. Learn. Res 8 (May 2007), 1027–1061. http://dl.acm.org/citation.cfm?id=1248659.1248694 [Google Scholar]
[22].Wang Fei and Sun Jimeng. 2014. Survey on distance metric learning and dimensionality reduction in data mining. Data Mining and Knowledge Discovery 29, 2 (June 2014), 534–564. 10.1007/s10618-014-0356-z [DOI] [Google Scholar]
[23].Wang Shijun and Jin Rong. 2009. An Information Geometry Approach for Distance Metric Learning. In Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research), van Dyk David and Welling Max (Eds.), Vol. 5. PMLR, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 591–598. http://proceedings.mlr.press/v5/wang09c.html [Google Scholar]
[24].Weinberger Kilian Q, Blitzer John, and Saul Lawrence K.. 2006. Distance Metric Learning for Large Margin Nearest Neighbor Classification. In Advances in Neural Information Processing Systems 18, Weiss Y, SchÃűlkopf B, and Platt JC (Eds.). MIT Press, 1473–1480. http://papers.nips.cc/paper/2795-distance-metric-learning-for-large-margin-nearest-neighbor-classification.pdf [Google Scholar]
[25].Weinberger Kilian Q. and Saul Lawrence K.. 2009. Distance Metric Learning for Large Margin Nearest Neighbor Classification. Journal of Machine Learning Research 10, February (2009), 207–244. http://www.jmlr.org/papers/v10/weinberger09a.html [Google Scholar]
[26].Xing Eric P., Jordan Michael I., Russell Stuart J, and Ng Andrew Y.. 2003. Distance Metric Learning with Application to Clustering with Side-Information. In Advances in Neural Information Processing Systems 15, Becker S, Thrun S, and Obermayer K (Eds.). MIT Press, 521–528. http://papers.nips.cc/paper/2164-distance-metric-learning-with-application-to-clustering-with-side-information.pdf [Google Scholar]
[27].Ying Yiming and Li Peng. 2012. Distance Metric Learning with Eigenvalue Optimization. Journal of Machine Learning Research 13, January (2012), 1–26. http://www.jmlr.org/papers/v13/ying12a.html [Google Scholar]

[R1] [1].Bar-Hillel Aharon, Hertz Tomer, Shental Noam, and Weinshall Daphna. 2003. Learning Distance Functions Using Equivalence Relations. In In Proceedings of the Twentieth International Conference on Machine Learning. 11–18. [Google Scholar]

[R2] [2].Baxter Jonathan and Bartlett Peter L.. 1998. The Canonical Distortion Measure in Feature Space and 1-NN Classification. In Advances in Neural Information Processing Systems 10, Jordan MI, Kearns MJ, and Solla SA (Eds.). MIT Press, 245–251. http://papers.nips.cc/paper/1357-the-canonical-distortion-measure-in-feature-space-and-1-nn-classification.pdf [Google Scholar]

[R3] [3].Bellet AurÃľlien, Habrard Amaury, and Sebban Marc. 2013. A Survey on Metric Learning for Feature Vectors and Structured Data. arXiv:1306.6709 [cs, stat] (June 2013). http://arxiv.org/abs/1306.6709 arXiv: 1306.6709. [Google Scholar]

[R4] [4].Crooks Gavin E. 2011. Fisher information and statistical mechanics. Technical Report Citeseer. [Google Scholar]

[R5] [5].Davis Jason V., Kulis Brian, Jain Prateek, Sra Suvrit, and Dhillon Inderjit S.. 2007. Information-theoretic Metric Learning. In Proceedings of the 24th International Conference on Machine Learning (ICML ‘07). ACM, New York, NY, USA, 209–216. 10.1145/1273496.1273523 [DOI] [Google Scholar]

[R6] [6].Domeniconi C, Peng Jing, and Gunopulos D. 2002. Locally adaptive metric nearest-neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 9 (September. 2002), 1281–1285. 10.1109/TPAMI.2002.1033219 [DOI] [Google Scholar]

[R7] [7].Fakhry Carl Tony. 2018. Causal Reasoning and Machine Learning Models for Cellular Regulatory Mechanisms. ScholarWorks at UMass Boston (2018). https://scholarworks.umb.edu/doctoral_dissertations/382 [Google Scholar]

[R8] [8].Friedman Jerome H.. 1994. Flexible Metric Nearest Neighbor Classification. ResearchGate (December 1994). https://www.researchgate.net/publication/2696649_Flexible_Metric_Nearest_Neighbor_Classification [Google Scholar]

[R9] [9].Goldberger Jacob, Hinton Geoffrey E, Roweis Sam T., and Salakhutdinov Ruslan R. 2005. Neighbourhood Components Analysis. In Advances in Neural Information Processing Systems 17, Saul LK, Weiss Y, and Bottou L (Eds.). MIT Press, 513–520. http://papers.nips.cc/paper/2566-neighbourhood-components-analysis.pdf [Google Scholar]

[R10] [10].Hansen John. 2005. Using SPSS for windows and macintosh: analyzing and understanding data.

[R11] [11].Hastie Trevor and Tibshirani Robert. 1996. Discriminant Adaptive Nearest Neighbor Classification and Regression. In Advances in Neural Information Processing Systems 8, Touretzky DS and Hasselmo ME (Eds.). MIT Press, 409–415. http://papers.nips.cc/paper/1131-discriminant-adaptive-nearest-neighbor-classification-and-regression.pdf [Google Scholar]

[R12] [12].Short Robert D. II and Fukunaga Keinosuke. 1981. The optimal distance measure for nearest neighbor classification. Information Theory, IEEE Transactions on 27, 5 (October 1981), 622–627. 10.1109/TIT.1981.1056403 [DOI] [Google Scholar]

[R13] [13].Kardar Mehran. 2007. Statistical Physics of Particles. Cambridge University Press. 10.1017/CBO9780511815898 [DOI] [Google Scholar]

[R14] [14].Kulis Brian. 2013. Metric Learning: A Survey. MAL 5, 4 (July 2013), 287–364. 10.1561/2200000019 [DOI] [Google Scholar]

[R15] [15].Lichman Moshe. 2013. UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/index.php

[R16] [16].Mu Yang, Ding Wei, and Tao Dacheng. 2013. Local discriminative distance metrics ensemble learning. Pattern Recognition 46, 8 (2013), 2337–2349. [Google Scholar]

[R17] [17].Qi Guo-Jun, Tang Jinhui, Zha Zheng-Jun, Chua Tat-Seng, and Zhang Hong-Jiang. 2009. An Efficient Sparse Metric Learning in High-dimensional Space via L1-penalized Log-determinant Regularization. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML ‘09). ACM, New York, NY, USA, 841–848. 10.1145/1553374.1553482 [DOI] [Google Scholar]

[R18] [18].Rao C. Radhakrishna. 1948. The Utilization of Multiple Measurements in Problems of Biological Classification. Journal of the Royal Statistical Society. Series B (Methodological) 10, 2 (1948), 159–203. http://www.jstor.org/stable/2983775 [Google Scholar]

[R19] [19].Rao C Radhakrishna. 1992. Information and the accuracy attainable in the estimation of statistical parameters. In Breakthroughs in statistics. Springer, 235–247. [Google Scholar]

[R20] [20].Sugiyama Masashi. 2006. Local Fisher Discriminant Analysis for Supervised Dimensionality Reduction. In Proceedings of the 23rd International Conference on Machine Learning (ICML ‘06). ACM, New York, NY, USA, 905–912. 10.1145/1143844.1143958 [DOI] [Google Scholar]

[R21] [21].Sugiyama Masashi. 2007. Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis. J. Mach. Learn. Res 8 (May 2007), 1027–1061. http://dl.acm.org/citation.cfm?id=1248659.1248694 [Google Scholar]

[R22] [22].Wang Fei and Sun Jimeng. 2014. Survey on distance metric learning and dimensionality reduction in data mining. Data Mining and Knowledge Discovery 29, 2 (June 2014), 534–564. 10.1007/s10618-014-0356-z [DOI] [Google Scholar]

[R23] [23].Wang Shijun and Jin Rong. 2009. An Information Geometry Approach for Distance Metric Learning. In Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research), van Dyk David and Welling Max (Eds.), Vol. 5. PMLR, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 591–598. http://proceedings.mlr.press/v5/wang09c.html [Google Scholar]

[R24] [24].Weinberger Kilian Q, Blitzer John, and Saul Lawrence K.. 2006. Distance Metric Learning for Large Margin Nearest Neighbor Classification. In Advances in Neural Information Processing Systems 18, Weiss Y, SchÃűlkopf B, and Platt JC (Eds.). MIT Press, 1473–1480. http://papers.nips.cc/paper/2795-distance-metric-learning-for-large-margin-nearest-neighbor-classification.pdf [Google Scholar]

[R25] [25].Weinberger Kilian Q. and Saul Lawrence K.. 2009. Distance Metric Learning for Large Margin Nearest Neighbor Classification. Journal of Machine Learning Research 10, February (2009), 207–244. http://www.jmlr.org/papers/v10/weinberger09a.html [Google Scholar]

[R26] [26].Xing Eric P., Jordan Michael I., Russell Stuart J, and Ng Andrew Y.. 2003. Distance Metric Learning with Application to Clustering with Side-Information. In Advances in Neural Information Processing Systems 15, Becker S, Thrun S, and Obermayer K (Eds.). MIT Press, 521–528. http://papers.nips.cc/paper/2164-distance-metric-learning-with-application-to-clustering-with-side-information.pdf [Google Scholar]

[R27] [27].Ying Yiming and Li Peng. 2012. Distance Metric Learning with Eigenvalue Optimization. Journal of Machine Learning Research 13, January (2012), 1–26. http://www.jmlr.org/papers/v13/ying12a.html [Google Scholar]

PERMALINK

A Free Energy Based Approach for Distance Metric Learning

Sho Inaba

Carl T Fakhry

Rahul V Kulkarni

Kourosh Zarringhalam

Abstract

1. INTRODUCTION

2. METRIC LEARNING MODEL AND RELATION TO OTHER METHODS

3. VON NEUMANN ENTROPY PENALIZED DISTANCE METRIC LEARNING

Lemma 3.1.

Proof.

Theorem 3.2.

Proof.

3.1. Analytical solution of the optimization problem

3.1.1. Analytic solution using Lagrange multipliers.

Lemma 3.3.

Proof.

Theorem 3.4.

Proof.

3.1.2. Analytical solution using the mapping to statistical mechanics.

3.2. Geometric interpretation

Theorem 3.5.

Proof.

Figure 1:

3.3. Optimal value of the tuning parameter μ

Figure 2:

3.4. Scaled linear discriminant analysis and high-dimensional data visualization

Lemma 3.6.

Proof.

Figure 3:

3.5. Application to supervised learning

4. BENCHMARKS

Table 1:

Table 2:

5. CONCLUSION

CCS CONCEPTS.

ACKNOWLEDGMENTS

Contributor Information

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases