Randomized Linear Algebra Approaches to Estimate the Von Neumann Entropy of Density Matrices

Eugenia-Maria Kontopoulou; Gregory-Paul Dexter; Wojciech Szpankowski; Ananth Grama; Petros Drineas

doi:10.1109/tit.2020.2971991

. Author manuscript; available in PMC: 2021 Mar 18.

Published in final edited form as: IEEE Trans Inf Theory. 2018 Aug 16;66(8):5003–5021. doi: 10.1109/tit.2020.2971991

Randomized Linear Algebra Approaches to Estimate the Von Neumann Entropy of Density Matrices

Eugenia-Maria Kontopoulou ¹, Gregory-Paul Dexter ¹, Wojciech Szpankowski ¹, Ananth Grama ¹, Petros Drineas ¹

PMCID: PMC7971349 NIHMSID: NIHMS1612224 PMID: 33746243

Abstract

The von Neumann entropy, named after John von Neumann, is an extension of the classical concept of entropy to the field of quantum mechanics. From a numerical perspective, von Neumann entropy can be computed simply by computing all eigenvalues of a density matrix, an operation that could be prohibitively expensive for large-scale density matrices. We present and analyze three randomized algorithms to approximate von Neumann entropy of real density matrices: our algorithms leverage recent developments in the Randomized Numerical Linear Algebra (RandNLA) literature, such as randomized trace estimators, provable bounds for the power method, and the use of random projections to approximate the eigenvalues of a matrix. All three algorithms come with provable accuracy guarantees and our experimental evaluations support our theoretical findings showing considerable speedup with small loss in accuracy.

Keywords: von Neumann entropy, randomized algorithms, randNLA, Taylor polynomials, Chebyshev polynomials, random projections

I. Introduction

Entropy is a fundamental quantity in many areas of science and engineering. von Neumann entropy, named after John von Neumann, is an extension of classical entropy concepts to the field of quantum mechanics. Its foundations can be traced to von Neumann’s work on Mathematische Grundlagen der Quantenmechanik¹. In his work, Von Neumann introduced the notion of a density matrix, which facilitated extension of the tools of classical statistical mechanics to the quantum domain in order to develop a theory of quantum mechanics.

From a mathematical perspective (see Section I-A for details) the real density matrix R is a symmetric positive semidefinite matrix in $R^{n \times n}$ with unit trace. Let p_i, i = 1 … n be the eigenvalues of R in decreasing order; then, the entropy of R is defined as²

H (R) = - \sum_{i = 1}^{n} p_{i} ln p_{i} .

(1)

The above definition is a proper extension of both the Gibbs entropy and the Shannon entropy to the quantum case. It implies an obvious algorithm to compute $H (R)$ by computing the eigendecomposition of R; known algorithms for this task can be prohibitively expensive for large values of n, particularly when the matrix becomes dense [1]. For example, [2] describes an entangled two-photon state generated by spontaneous parametric down-conversion, which can result in a sparse and banded density matrix with n ≈ 10⁸.

Motivated by the high computational cost, we seek numerical algorithms that approximate the von Neumann entropy of large density matrices, e.g., symmetric positive definite matrices with unit trace, faster than the trivial $O (n^{3})$ approach. Our algorithms build upon recent developments in the field of Randomized Numerical Linear Algebra (RandNLA), an interdisciplinary research area that exploits randomization as a computational resource to develop improved algorithms for large-scale linear algebra problems. Indeed, our work here focuses at the intersection of RandNLA and information theory, delivering novel randomized linear algebra algorithms and related quality-of-approximation results for a fundamental information-theoretic metric.

A. Background

We focus on finite-dimensional function (state) spaces. In this setting, the density matrix R represents the statistical mixture of k ≤ n pure states, and has the form

R = \sum_{i = 1}^{k} p_{i} ψ_{i} ψ_{i}^{T} \in R^{n \times n} .

(2)

The vectors $ψ_{i} \in R^{n}$ for i = 1 … k represent the k ≤ n pure states and can be assumed to be pairwise orthogonal and normal, while p_i’s correspond to the probability of each state and satisfy p_i > 0 and $\sum_{i = 1}^{k} p_{i} = 1$ . From a linear algebraic perspective, eqn. (2) can be rewritten as

R = Ψ Σ_{p} Ψ^{T} \in R^{n \times n},

(3)

where $Ψ \in R^{n \times k}$ is the matrix whose columns are the vectors ψ_i and $Σ_{p} \in R^{k \times k}$ is a diagonal matrix whose entries are the (positive) p_i’s. Given our assumptions for ψ_i, Ψ^TΨ = I; also R is symmetric positive semidefinite with its eigenvalues equal to p_i and corresponding left/right singular vectors equal to ψ_i’s; and $tr (R) = \sum_{i = 1}^{k} p_{i} = 1$ . Notice that eqn. (3) essentially reveals the (thin) Singular Value Decomposition (SVD) [1] of R. The Von Neumann entropy of R, denoted by $H (R)$ is equal to (see also eqn. (1))

H (R) = \sum_{i : p_{i} > 0} p_{i} ln p_{i} = - tr (R ln R) .

(4)

The second equality follows from the definition of matrix functions [3]. More precisely, we overload notation and consider the full SVD of R, namely R = ΨΣ_pΨ^T, where $Ψ \in R^{n \times n}$ is an orthogonal matrix whose top k columns correspond to the k pure states and the bottom n − k columns are chosen so that ΨΨ^T = Ψ^TΨ = I_n. Here Σ_p is a diagonal matrix whose bottom n−k diagonal entries are set to zero. Let h(x) = x ln x for any x > 0 and let h(0) = 0. Then, using the cyclical property of the trace and the definition of h(x),

- \sum_{i, p_{i} > 0} p_{i} ln p_{i} = - tr (Ψ h (Σ_{p}) Ψ^{T}) = - tr (h (R)) = - tr (R ln R) .

(5)

B. Trace estimators

The following lemma appeared in [4] and is immediate from Theorem 5.2 in [5]. It implies an algorithm to approximate the trace of any symmetric positive semidefinite matrix A by computing inner products of the matrix with Gaussian random vectors.

Lemma 1. Let $A \in R^{n \times n}$ be a positive semi-definite matrix, let 0 < ϵ < 1 be an accuracy parameter, and let 0 < δ < 1 be a failure probability. If $g_{1}, g_{2}, \dots, g_{s} \in R^{n}$ are independent random standard Gaussian vectors, then, for s = ⌈20 ln(2/δ)/ϵ²⌉, with probability at least 1 − δ,

| tr (A) - \frac{1}{s} \sum_{i = 1}^{s} g_{i}^{Τ} {Ag}_{i} | \leq \in \cdot tr (A) .

C. Our contributions

We present and analyze three randomized algorithms to approximate the von Neumann entropy of density matrices. The first two algorithms (Sections II and III) leverage two different polynomial approximations of the matrix function $H (R) = - tr (R \ln R)$ : the first approximation uses a Taylor series expansion, while the second approximation uses Chebyschev polynomials. Both algorithms return, with high probability, relative-error approximations to the true entropy of the input density matrix, under certain assumptions. More specifically, in both cases, we need to assume that the input density matrix has n non-zero eigenvalues, or, equivalently, that the probabilities p_i, i = 1 … n, corresponding to the underlying n pure states are non-zero. The running time of both algorithms is proportional to the sparsity of the input density matrix and depends (see Theorems 2 and 4 for precise statements) on, roughly, the ratio of the largest to the smallest probability p₁/p_n (recall that the smallest probability is assumed to be non-zero), as well as the desired accuracy.

The third algorithm (Section V) is fundamentally different, if not orthogonal, to the previous two approaches. It leverages the power of random projections [6], [7] to approximate numerical linear algebra quantities, such as the eigenvalues of a matrix. Assuming that the density matrix R has exactly k ⪡ n non-zero eigenvalues, e.g., there are k pure states with non-zero probabilities p_i, i = 1 … k, the proposed algorithm returns, with high probability, relative error approximations to all k probabilities p_i. This, in turn, implies an additive-relative error approximation to the entropy of the density matrix, which, under a mild assumption on the true entropy of the density matrix, becomes a relative error approximation (see Theorem 10 for a precise statement). The running time of the algorithm is again proportional to the sparsity of the density matrix and depends on the target accuracy, but, unlike the previous two algorithms, does not depend on any function of the p_i.

From a technical perspective, the theoretical analysis of the first two algorithms proceeds by combining the power of polynomial approximations, either using Taylor series or Chebyschev polynomials, to matrix functions, combined with randomized trace estimators. A provably accurate variant of the power method is used to estimate the largest probability p_i. If this estimate is significantly smaller than one, it can improve the running times of the proposed algorithms (see discussion after Theorem 2). The third algorithm leverages a powerful, multiplicative matrix perturbation result that first appeared in [8]. Our work in Section V is a novel application of this inequality to derive bounds for RandNLA algorithms.

Finally, in Section VI, we present a detailed evaluation of our algorithms on synthetic density matrices of various sizes, most of which were generated using Matlab’s QETLAB toolbox [9]. For some of the larger matrices that were used in our evaluations, the exact computation of the entropy takes hours, whereas our algorithms return approximations with relative errors well below 0.5% in only a few minutes.

D. Prior work

The first non-trivial algorithm to approximate the von Neumann entropy of a density matrix appeared in [2]. Their approach is essentially the same as our approach in Section III. Indeed, our algorithm in Section III was inspired by their approach. However, our analysis is somewhat different, lever-aging a provably accurate variant of the power method, as well as provably accurate trace estimators to derive a relative error approximation to the entropy of a density matrix, under appropriate assumptions. A detailed, technical comparison between our results in Section III and the work of [2] is delegated to Section III-C.

Independently and in parallel with our work, [10] presented a multipoint interpolation algorithm (building upon [11]) to compute a relative error approximation for the entropy of a real matrix with bounded condition number. The proposed running time of Theorem 35 of [10] does not depend on the condition number of the input matrix (i.e., the ratio of the largest to the smallest probability), which is a clear advantage in the case of ill-conditioned matrices. However, the dependence of the algorithm of Theorem 35 of [10] on terms like (log n/ϵ)⁶ or $n^{1 / 3} nnz (A) + n \sqrt{nnz (A)}$ (where nnz(A) represents the number of non-zero elements of the matrix A) could blow up the running time of the proposed algorithm for reasonably conditioned matrices.

We also note the recent work in [4], which used Taylor approximations to matrix functions to estimate the log determinant of symmetric positive definite matrices (see also Section 1.2 of [4] for an overview of prior work on approximating matrix functions via Taylor series). The work of [12] used a Chebyschev polynomial approximation to estimate the log determinant of a matrix and is reminiscent of our approach in Section III and, of course, the work of [2].

We conclude this section by noting that our algorithms use two tools (described, for the sake of completeness, in the Appendix) that appeared in prior work. The first tool is the power method, with a provable analysis that first appeared in [13]. The second tool is a provably accurate trace estimation algorithm for symmetric positive semidefinite matrices that appeared in [5].

II. An approach via Taylor series

Our first approach to approximate the von Neumann entropy of a density matrix uses a Taylor series expansion to approximate the logarithm of a matrix, combined with a relative-error trace estimator for symmetric positive semi-definite matrices and the power method to upper bound the largest singular value of a matrix.

A. Algorithm and Main Theorem

Our main result is an analysis of Algorithm 1 (see below) that guarantees relative error approximation to the entropy of the density matrix R, under the assumption that $R = \sum_{i = 1}^{n} p_{i} ψ_{i} ψ_{i}^{T} \in R^{n \times n}$ has n pure states with 0 < ℓ ≤ p_i for all i = 1 … n. The following theorem is our main quality-of-approximation result for Algorithm 1.

Algorithm 1.

A Taylor series approach to estimate the entropy.

1:	INPUT: $R \in R^{n \times n}$ , accuracy parameter ε > 0, failure probability δ, and integer m > 0.
2:	Compute ${\tilde{p}}_{1}$ , the estimate of the largest eigenvalue of R, p₁, using Algorithm 8 (see Appendix) with $t = O (\ln n)$ and $q = O (\ln (1 / δ))$ .
3:	Set $u = \min {1, 6 {\tilde{p}}_{1}}$ .
4:	Set s = ⌈20 ln(2/δ)/ε²⌉.
5:	Let $g_{1}, g_{2}, \dots, g_{s} \in R^{n}$ be i.i.d. random Gaussian vectors.
6:	OUTPUT: return $\hat{H} (R) = ln u^{- 1} + \frac{1}{s} \sum_{i = 1}^{s} \sum_{k = 1}^{m} \frac{g_{i}^{T} R {(I_{n} - u^{- 1} R)}^{k} g_{i}}{k}$ .

Open in a new tab

Theorem 2. Let R be a density matrix such that all probabilities p_i, i = 1 … n satisfy 0 < ℓ ≤ p_i. Let u be computed as in Algorithm 1 and let $\hat{H} (R)$ be the output of Algorithm 1 on inputs R, m, and ϵ < 1; Then, with probability at least 1 − 2δ,

| \hat{H} (R) - H (R) | \leq 2 ϵ H (R),

by setting $m = ⌈ \frac{u}{ℓ} ln \frac{1}{ϵ} ⌉$ . The algorithm runs in time

O ((\frac{u}{ℓ} \cdot \frac{ln (1 / ϵ)}{ϵ^{2}} + ln (n)) ln (1 / δ) \cdot nnz (R)) .

A few remarks are necessary to better understand the above theorem. First, ℓ could be set to p_n, the smallest of the probabilities corresponding to the n pure states of the density matrix R. Second, it should be obvious that u in Algorithm 1 could be simply set to one and thus we could avoid calling Algorithm 8 to estimate p₁ by ${\tilde{p}}_{1}$ and thus compute u. However, if p₁ is small, then u could be significantly smaller than one, thus reducing the running time of Algorithm 1, which depends on the ratio u/l. Third, ideally, if both p₁ and p_n were used instead of u and l, respectively, the running time of the algorithm would scale with the ratio p₁/p_n.

B. Proof of Theorem 2

We now prove Theorem 2, which analyzes the performance of Algorithm 1. Our first lemma presents a simple expression for $H (R)$ using a Taylor series expansion.

Lemma 3. Let $R \in R^{n \times n}$ be a symmetric positive definite matrix with unit trace and whose eigenvalues lie in the interval [ℓ, u], for some 0 < ℓ ≤ u ≤ 1. Then,

H (R) = ln u^{- 1} + \sum_{k = 1}^{\infty} \frac{tr (R {(I_{n} - u^{- 1} R)}^{k})}{k} .

Proof: From the definition of the von Neumann entropy and a Taylor expansion,

H (R) = - tr (R ln (u u^{- 1} R)) = - tr ((ln u) R) - tr (R ln (I_{n} - (I_{n} - u^{- 1} R))) = ln u^{- 1} - tr (- R \sum_{k = 1}^{\infty} \frac{{(I_{n} - u^{- 1} R)}^{k}}{k}) = ln u^{- 1} + \sum_{k = 1}^{\infty} \frac{tr (R {(I_{n} - u^{- 1} R)}^{k})}{k} .

(6)

Eqn. (6) follows since R has unit trace and from a Taylor expansion: indeed, $ln (I_{n} - A) = - \sum_{k = 1}^{\infty} A^{k} / k$ for a symmetric matrix A whose eigenvalues are all in the interval (−1, 1). We note that the eigenvalues of I_n − u⁻¹R are in the interval [0, 1 − (ℓ/u)], whose upper bound is strictly less than one since, by our assumptions, ℓ/u > 0.

We now proceed to prove Theorem 2. We will condition our analysis on Algorithm 8 being successful, which happens with probability at least 1 − δ. In this case, $u = min {1, 6 {\tilde{p}}_{1}}$ is an upper bound for all probabilities p_i. For notational convenience, set C = I_n − u⁻¹R. We start by manipulating $Δ = | \hat{H} (R) - H (R) |$ as follows:

Δ = | \sum_{k = 1}^{m} \frac{1}{k} \cdot \frac{1}{s} \sum_{i = 1}^{s} g_{i}^{T} {RC}^{k} g_{i} - \sum_{k = 1}^{\infty} \frac{1}{k} tr ({RC}^{k}) | \leq | \sum_{k = 1}^{m} \frac{1}{k} \cdot \frac{1}{s} \sum_{i = 1}^{s} g_{i}^{T} {RC}^{k} g_{i} - \sum_{k = 1}^{\infty} \frac{1}{k} tr ({RC}^{k}) | + | \sum_{k = m + 1}^{\infty} \frac{1}{k} tr ({RC}^{k}) | = \underset{Δ_{1}}{\underset{︸}{| \frac{1}{s} \sum_{i = 1}^{s} g_{i}^{T} (\sum_{k = 1}^{m} {RC}^{k} / k) g_{i} - tr (\sum_{k = 1}^{m} {RC}^{k}) |}} + \underset{Δ_{2}}{\underset{︸}{| \sum_{k = m + 1}^{m} tr ({RC}^{k}) / k |}} .

We now bound the two terms Δ₁ and Δ₂ separately. We start with Δ₁: the idea is to apply Lemma 1 on the matrix $\sum_{k = 1}^{m} {RC}_{k} / k$ with s = ⌈20 ln(2/δ)/ϵ²⌉. Hence, with probability at least 1 − δ:

Δ_{1} \leq ϵ \cdot tr (\sum_{k = 1}^{m} {RC}^{k} / k) \leq ϵ \cdot tr (\sum_{k = 1}^{\infty} {RC}^{k} / k) .

(7)

A subtle point in applying Lemma 1 is that the matrix $\sum_{k = 1}^{m} {RC}^{k} / k$ must be symmetric positive semidefinite. To prove this, let the SVD of R be R = ΨΣ_pΨ^T, where all three matrices are in $R^{n \times n}$ and the diagonal entries of Σ_p are in the interval [ℓ, u]. Then, it is easy to see that C = I_n − u⁻¹R = Ψ(I_n − u⁻¹Σ_p)Ψ^T and RC^k = ΨΣ_p(I_n − u⁻¹Σ_p)^kΨ^T, where the diagonal entries of I_n − u⁻¹Σ_p are non-negative, since the largest entry in Σ_p is upper bounded by u. This proves that RC^k is symmetric positive semidefinite for any k, a fact which will be useful throughout the proof. Now,

\sum_{k = 1}^{m} {RC}^{k} / k = Ψ (Σ_{p} \sum_{k = 1}^{m} {(I_{n} - u^{- 1} Σ_{p})}^{k} / k) Ψ^{T},

which shows that the matrix of interest is symmetric positive semidefinite. Additionally, since RC^k is symmetric positive semidefinite, its trace is non-negative, which proves the second inequality in eqn. (7) as well.

We proceed to bound Δ₂ as follows:

Δ_{2} = | \sum_{k = m + 1}^{\infty} tr ({RC}^{k}) / k | = | \sum_{k = m + 1}^{\infty} tr ({RC}^{m} C^{k - m}) / k | = | \sum_{k = m + 1}^{\infty} tr (C^{m} C^{k - m} R) / k | \leq | \sum_{k = m + 1}^{\infty} {‖ C^{m} ‖}_{2} \cdot tr (C^{k - m} R) / k |

(8)

= {‖ C^{m} ‖}_{2} \cdot | \sum_{k = m + 1}^{\infty} tr ({RC}^{k - m}) / k | \leq {‖ C^{m} ‖}_{2} \cdot | \sum_{k = 1}^{\infty} tr ({RC}^{k}) / k |

(9)

\leq {(1 - \frac{ℓ}{u})}^{m} \sum_{k = 1}^{\infty} tr ({RC}^{k}) / k .

(10)

To prove eqn. (8), we used von Neumann’s trace inequality³. Eqn. (8) now follows since C^k−mR is symmetric positive sem)definite⁴. To prove eqn. (9), we used the fact that tr (RC^k)/k ≥ 0 for any k ≥ 1. Finally, to prove eqn. (10), we used the fact that ‖C‖₂ = ‖I_n − u⁻¹Σ_p‖₂ ≤ 1 − ℓ/u since the smallest entry in Σ_p is at least ℓ by our assumptions. We also removed unnecessary absolute values since tr (RC^k)/k is non-negative for any positive integer k.

Combining the bounds for Δ₁ and Δ₂ gives

| \hat{H} (R) - H (R) | \leq (ϵ + {(1 - \frac{ℓ}{u})}^{m}) \sum_{k = 1}^{\infty} \frac{tr ({RC}^{k})}{k} .

We have already proven in Lemma 3 that

\sum_{k = 1}^{\infty} \frac{tr ({RC}^{k})}{k} \leq H (R) - ln u^{- 1} \leq H (R),

where the last inequality follows since u ≤ 1. Collecting our results, we get

| \hat{H} (R) - H (R) | \leq (\in + {(1 - \frac{ℓ}{u})}^{m}) H (R) .

Setting

m = ⌈ \frac{u}{ℓ} ln \frac{1}{ϵ} ⌉

and using (1 − x⁻¹)^x ≤ e⁻¹ (x > 0), guarantees that (1 − ℓ/u)^m ≤ ϵ and concludes the proof of the theorem. We note that the failure probability of the algorithm is at most 2δ (the sum of the failure probabilities of the power method and the trace estimation algorithm).

Finally, we discuss the running time of Algorithm 1, which is equal to $O (s \cdot m \cdot nnz (R))$ . Since $s = O (\frac{\ln (1 / δ)}{ϵ^{2}})$ and $m = O (\frac{u \ln (1 / ϵ)}{ℓ})$ , the running time becomes (after accounting for the running time of Algorithm 8)

O ((\frac{u}{ℓ} \cdot \frac{ln (1 / ϵ)}{ϵ^{2}} + ln (n)) ln (1 / δ) \cdot nnz (R)) .

III. An approach via Chebyschev polynomials

Our second approach is to use a Chebyschev polynomial-based approximation scheme to estimate the entropy of a density matrix. Our approach follows the work of [2], but our analysis uses the trace estimators of [5] and Algorithm 8 and its analysis. Importantly, we present conditions under which the proposed approach is competitive with the approach of Section II.

A. Algorithm and Main Theorem

The proposed algorithm leverages the fact that the von Neumann entropy of a density matrix R is equal to the (negative) trace of the matrix function R ln R and approximates the function R ln R by a sum of Chebyschev polynomials; then, the trace of the resulting matrix is estimated using the trace estimator of [5].

Let $f_{m} (x) = \sum_{w = 0}^{m} α_{w} T_{w} (x)$ with $α_{0} = \frac{u}{2} (\ln \frac{u}{4} + 1)$ , $α_{1} = \frac{u}{4} (2 \ln \frac{u}{4} + 3)$ , and $α_{w} = \frac{{(- 1)}^{w}}{w^{3} - w}$ for w ≥ 2. Let $T_{w} (x) = cos (w \cdot \arccos ((2 / u) x - 1))$ and x ∈ [0, u] be the Chebyschev polynomials of the first kind for any integer w > 0. Algorithm 2 computes u (an upper bound estimate for the largest probability p₁ of the density matrix R) and then computes f_m(R) and estimates its trace. We note that the computation $g_{i}^{T} f_{m} (R) g_{i}$ can be done efficiently using Clenshaw’s algorithm; see Appendix C for the well-known approach.

Algorithm 2.

A Chebyschev polynomial-based approach to estimate the entropy.

1:	INPUT: $R \in R^{n \times n}$ , accuracy parameter ε > 0, failure probability δ, and integer m > 0.
2:	Compute ${\tilde{p}}_{1}$ , the estimate of the largest eigenvalue of R, p₁, using Algorithm 8 (see Appendix) with $t = O (\ln n)$ and $q = O (\ln (1 / δ))$ .
3:	Set $u = \min {1, 6 {\tilde{p}}_{1}}$ .
4:	Set s = ⌈20 ln(2/δ)/ε²⌉.
5:	Let $g_{1}, g_{2}, \dots, g_{s} \in R^{n}$ be i.i.d. random Gaussian vectors.
6:	OUTPUT: $\hat{H} (R) = - \frac{1}{s} \sum_{i = 1}^{s} g_{i}^{T} f_{m} (R) g_{i}$ .

Open in a new tab

Our main result is an analysis of Algorithm 2 that guarantees a relative error approximation to the entropy of the density matrix R, under the assumption that $R = \sum_{i = 1}^{n} p_{i} ψ_{i} ψ_{i}^{T} \in R^{n \times n}$ has n pure states with 0 < ℓ ≤ p_i for all i = 1 … n. The following theorem is our main quality-of-approximation result for Algorithm 2.

Theorem 4. Let R be a density matrix such that all probabilities p_i, i = 1 … n satisfy 0 < ℓ ≤ p_i. Let u be computed as in Algorithm 1 and let $\hat{H} (R)$ be the output of Algorithm 2 on inputs R m, and ϵ < 1; Then, with probability at least 1 − 2δ,

| \hat{H} (R) - H (R) | \leq 3 ϵ H (R),

by setting $m = \sqrt{\frac{u}{2 ϵ ℓ ln (1 / (1 - ℓ))}}$ . The algorithm runs in time

O ((\sqrt{\frac{u}{ℓ ln (1 / (1 - ℓ))}} \cdot \frac{1}{ϵ^{2.5}} + ln (n)) ln (1 / δ) \cdot nnz (R)) .

The similarities between Theorems 2 and 4 are obvious: same assumptions and directly comparable accuracy guarantees. The only difference is in the running times: the Taylor series approach has a milder dependency on ϵ, while the Chebyschev-based approximation has a milder dependency on the ratio u/ℓ, which controls the behavior of the probabilities p_i. However, for small values of ℓ (ℓ → 0),

ln \frac{1}{1 - ℓ} = ln (1 + \frac{ℓ}{1 - ℓ}) \approx \frac{ℓ}{1 - ℓ} \approx ℓ .

Thus, the Chebyschev-based approximation has a milder dependency on u but not necessarily ℓ when compared to the Taylor-series approach. We also note that the discussion following Theorem 2 is again applicable here.

B. Proof of Theorem 4

We will condition our analysis on Algorithm 8 being successful, which happens with probability at least 1−δ. In this case, $u = \min {1, 6 {\tilde{p}}_{1}}$ is an upper bound for all probabilities p_i. We now recall (from Section I-A) the definition of the function h(x) = x ln x for any real x ∈ (0, 1], with h(0) = 0. Let $R = Ψ Σ_{p} Ψ^{T} \in R^{n \times n}$ be the density matrix, where both Σ_p and Ψ are matrices in $R^{n \times n}$ . Notice that the diagonal entries of Σ_p are the p_is and they satisfy 0 < ℓ ≤ p_i ≤ u ≤ 1 for all i = 1…n.

Using the definitions of matrix functions from [3], we can now define h(R) = Ψh(Σ_p)Ψ^T, where h(Σ_p) is a diagonal matrix in $R^{n \times n}$ with entries equal to h(p_i) for all i = 1 … n. We now restate Proposition 3.1 from [2] in the context of our work, using our notation.

Lemma 5. The function h(x) in the interval [0, u] can be approximated by

f_{m} (x) = \sum_{w = 0}^{m} α_{w} T_{w} (x),

where $α_{0} = \frac{u}{2} (\ln \frac{u}{4} + 1)$ , $α_{1} = \frac{u}{4} (2 \ln \frac{u}{4} + 3)$ , and $α_{w} = \frac{{(- 1)}^{w} u}{w^{3} - w}$ for w ≥ 2. For any m ≥ 1,

| h (x) - f_{m} (x) | \leq \frac{u}{2 m (m + 1)} \leq \frac{u}{2 m^{2}},

for x ∈ [0, u].

In the above, $T_{w} (x) = cos (w \cdot \arccos ((2 / u) x - 1))$ for any integer w ≥ 0 and x ∈ [0, u]. Notice that the function (2/u)x − 1 essentially maps the interval [0, u], which is the interval of interest for the function h(x), to [−1, 1], which is the interval over which Chebyschev polynomials are commonly defined. The above theorem exploits the fact that the Chebyschev polynomials form an orthonormal basis for the space of functions over the interval [−1, 1].

We now move on to approximate the entropy $H (R)$ using the function f_m(x). First,

- tr (f_{m} (R)) = - tr (\sum_{w = 0}^{m} α_{w} T_{w} (R)) = - tr (\sum_{w = 0}^{m} α_{w} Ψ T_{w} (Σ_{p}) Ψ^{T}) = - \sum_{w = 0}^{m} α_{w} tr (T_{w} (Σ_{p})) = - \sum_{w = 0}^{m} α_{w} \sum_{i = 1}^{n} T_{w} (p_{i}) = - \sum_{i = 1}^{n} \sum_{w = 0}^{m} α_{w} T_{w} (p_{i}) .

(11)

Recall from Section I-A that $H (R) = - \sum_{i = 1}^{n} h (p_{i})$ . We can now bound the difference between tr(−f_m(R)) and $H (R)$ . Indeed,

| H (R) - tr (- f_{m} (R)) | = | - \sum_{i = 1}^{n} h (p_{i}) + \sum_{i = 1}^{n} \sum_{w = 0}^{m} α_{w} T_{w} (p_{i}) | \leq \sum_{i = 1}^{n} | h (p_{i}) - \sum_{w = 0}^{m} α_{w} T_{w} (p_{i}) | \leq \frac{n u}{2 m^{2}} .

(12)

The last inequality follows by the final bound in Lemma 5, since all p_i’s are in the interval [0, u].

Recall that we also assumed that all p_is are lower-bounded by ℓ > 0 and thus

H (R) = \sum_{i = 1}^{n} p_{i} ln \frac{1}{p_{i}} \geq n ℓ ln \frac{1}{1 - ℓ} .

(13)

We note that the upper bound on the p_is follows since the smallest p_i is at least ℓ > 0 and thus the largest p_i cannot exceed 1 − ℓ < 1. We note that we cannot use the upper bound u in the above formula, since u could be equal to one; 1 − ℓ is always strictly less than one but it cannot be a priori computed (and thus cannot be used in Algorithm 2), since ℓ is not a priori known.

We can now restate the bound of eqn. (12) as follows:

| H (R) - tr (- f_{m} (R)) | \leq \frac{u}{2 m^{2} ℓ ln (1 / (1 - ℓ))} H (R) \leq ϵ H (R),

(14)

where the last inequality follows by setting

m = \sqrt{\frac{u}{2 \in ℓ ln (1 / (1 - ℓ))}} .

(15)

Next, we argue that the matrix −f_m(R) is symmetric positive semidefinite (under our assumptions) and thus one can apply Lemma 1 to estimate its trace. We note that

- f_{m} (R) = Ψ (- f_{m} (Σ_{p})) Ψ^{T},

which trivially proves the symmetry of −f_m(R) and also shows that its eigenvalues are equal to −f_m(p_i) for all i = 1 … n. We now bound

| (- f_{m} (p_{i})) - p_{i} ln \frac{1}{p_{i}} | = | - f_{m} (p_{i}) + p_{i} ln p_{i} | = | p_{i} ln p_{i} - f_{m} (p_{i}) | \leq \frac{u}{2 m^{2}} \leq ϵ ℓ ln \frac{1}{1 - ℓ},

where the inequalities follow from Lemma 5 and our choice for m from eqn. (15). This inequality holds for all i = 1 … n and implies that

- f_{m} (p_{i}) \geq p_{i} ln \frac{1}{p_{i}} - ϵ ℓ ln \frac{1}{1 - ℓ} \geq (1 - ϵ) ℓ ln \frac{1}{1 - ℓ},

using our upper (1 − ℓ < 1) and lower (ℓ > 0) bounds on the p_is. Now ϵ ≤ 1 proves that −f_m(p_i) are non-negative for all i = 1 … n and thus −f_m(R) is a symmetric positive semidefinite matrix; it follows that its trace is also non-negative.

We can now apply the trace estimator of Lemma 1 to get

| tr (- f_{m} (R)) - (- \frac{1}{s} \sum_{i = 1}^{s} g_{i}^{T} f_{m} (R) g_{i}) | \leq ϵ \cdot tr (- f_{m} (R)) .

(16)

For the above bound to hold, we need to set

s = ⌈ 20 ln (2 / δ) / ϵ^{2} ⌉ .

(17)

We now conclude as follows:

| H (R) - \hat{H} (R) | \leq | H (R) - tr (- f_{m} (R)) | + | tr (- f_{m} (R)) - (- \frac{1}{s} \sum_{i = 1}^{s} g_{i}^{T} f_{m} (R) g_{i}) | \leq ϵ H (R) + ϵ tr (- f_{m} (R)) \leq ϵ H (R) + ϵ (1 + ϵ) H (R) \leq 3 ϵ H (R) .

The first inequality follows by adding and subtracting −tr(f_m(R)) and using sub-additivity of the absolute value; the second inequality follows by eqns. (14) and (16); the third inequality follows again by eqn. (14); and the last inequality follows by using ϵ ≤ 1.

We note that the failure probability of the algorithm is at most 2δ (the sum of the failure probabilities of the power method and the trace estimation algorithm). Finally, we discuss the running time of Algorithm 2, which is equal to $O (s \cdot m \cdot nnz (R))$ . Using the values for m and s from eqns. (15) and (17), the running time becomes (after accounting for the running time of Algorithm 8)

O ((\sqrt{\frac{u}{ℓ ln (1 / (1 - ℓ))}} \cdot \frac{1}{ϵ^{2.5}} + ln (n)) ln (1 / δ) \cdot nnz (R)) .

C. A comparison with the results of [2]

The work of [2] culminates in the error bounds described in Theorem 4.3 (and the ensuing discussion). In our parlance, [2] first derives the error bound of eqn. (12). It is worth emphasizing that the bound of eqn. (12) holds even if the p_is are not necessarily strictly positive, as assumed by Theorem 4: the bound holds even if some of the p_is are equal to zero.

Unfortunately, without imposing a lower bound assumption on the p_is it is difficult to get a meaningful error bound and an efficient algorithm. Indeed, the error implied by eqn. (12) (without any assumption on the p_is) necessitates setting m to at least $Ω (\sqrt{n})$ (perhaps up to a logarithmic factor, as we will discuss shortly). To understand this, note that the entropy of the density matrix R ranges between zero and ln k, where k is the rank of the matrix R, i.e., the number of non-zero p_i’s. Clearly, k ≤ n and thus ln n is an upper bound for $H (R)$ . Notice that if $H (R)$ is smaller than n/(2m²), the error bound of eqn. (12) does not even guarantee that the resulting approximation will be positive, which is, of course, meaningless as an approximation to the entropy.

In order to guarantee a relative error bound of the form $ϵ H (R)$ via eqn. (12), we need to set m to be at least

m \geq \sqrt{\frac{n}{2 ϵ H (R)}},

(18)

which even for “large” values of $H (R)$ (i.e., values close to the upper bound ln n) still implies that m is $O (ϵ^{- 1 / 2} \sqrt{n / \ln n})$ . Even with such a large value for m, we are still not done: we need an efficient trace estimation procedure for the matrix −f_m(R). While this matrix is always symmetric, it is not necessarily positive or negative semi-definite (unless additional assumptions are imposed on the p_is, like we did in Theorem 4).

IV. Approaches for Hermitian Density Matrices

Hermitian, instead of symmetric, positive definite matrices, frequently arise in quantum mechanics. The analyses of Sections II and III focus on real density matrices; we now briefly discuss how they can be extended to Hermitian density matrices. Recall that both approaches follow the same algorithmic scheme. First, the dominant eigenvalue of the density matrix is estimated via the power method; a trace estimation follows using Gaussian trace estimators on either the truncated Taylor expansion of a suitable matrix function or on a chebyshev polynomial approximation of the same matrix function. Interestingly, the Taylor expansions, as well as the chebyshev polynomial approximations, both work when the input matrix is complex. However, the estimation of the dominant eigenvalue of R poses a theoretical difficulty: to the best of our knowledge, there is no known bound for the accuracy of the power method in the case where R is complex. Lemma 14 guarantees relative error approximations to the dominant eigenvalue of real matrices, but we are not aware of any provable relative error bound for the complex case. To avoid this issue we will be using one as a (loose) upper bound for the dominant eigenvalue.

The crucial step in order to guarantee relative error approximations to the entropy of a Hermitian positive definite matrix is to guarantee relative error approximations for the trace of a Hermitian positive definite matrix. Lemma 1 assumes symmetric positive semi-definite matrices; we now prove that the same lemma can be applied on Hermitian positive definite matrices to achieve the same guarantees.

Theorem 6. Every Hermitian matrix $A \in C^{n \times n}$ can be expressed as

A = B + i C,

(19)

where $B \in R^{n \times n}$ is symmetric and $C \in R^{n \times n}$ is anti-symmetric (or skew-symmetric). If $A \in C^{n \times n}$ is positive semi-definite, then B is also positive semi-definite.

Proof: The proof is trivial and uses the fact that for any Hermitian (symmetric) positive semi-definite matrix all eigenvalues are real and greater than zero.

Theorem 7. The trace of a Hermitian matrix $A \in C^{n \times n}$ expressed as in eqn. (19) is equal to the trace of its real part:

tr (A) = tr (B) .

Proof: Using tr(A) = tr(A^T), it is easy to see that

tr (A) = tr (B + i C) = tr (B) + i tr (C) = tr (B^{T}) + i tr (C^{T}) = tr (B) .

The last equality follows by noticing that the only way for the equality to hold for a skew-symmetric matrix C is if tr(C^T) = −tr(C^T). This is true only if C is the all-zeros matrix.

In words, Theorem 7 states that the trace of a Hermitian matrix equals the trace of its real part. Similarly, Theorem 6 states that the real part of a Hermitian positive semi-definite matrix is symmetric positive semi-definite. combining both theorems we conclude that we can estimate the trace of a Hermitian positive definite matrix up to relative error, using the Gaussian trace estimator of Lemma 1 on its real part. Therefore, both approaches generalize to Hermitian positive definite matrices using one as an upper bound instead of u for the dominant eigenvalue. Algorithms 3 and 4 are modified versions of Algorithms 1 and 2 respectively that work on Hermitian inputs (the function Re(·) returns the real part of its argument in an entry-wise manner).

Algorithm 3.

A Taylor series approach to estimate the entropy.

1:	INPUT: $R \in C^{n \times n}$ , accuracy parameter ε > 0, failure probability δ, and integer m > 0.
2:	Set s = [20 ln(2/δ)/ε²].
3:	Let $g_{1}, g_{2}, \dots, g_{s} \in R^{n}$ be i.i.d. random Gaussian vectors.
4:	OUTPUT: return $\hat{H} (R) = \frac{1}{s} \sum_{i = 1}^{s} \sum_{k = 1}^{m} \frac{g_{i}^{T} (Re [R {(I_{n} - R)}^{k}]) g_{i}}{k}$ .

Open in a new tab

Algorithm 4.

A Chebyschev polynomial-based approach to estimate the entropy.

1:	INPUT: $R \in C^{n \times n}$ , accuracy parameter ϵ > 0, failure probability δ, and integer m > 0.
2:	Set s = [20 ln(2/δ)/ε²].
3:	Let $g_{1}, g_{2}, \dots, g_{s} \in R^{n}$ be i.i.d. random Gaussian vectors.
4:	OUTPUT: $\hat{H} (R) = - \frac{1}{s} \sum_{i = 1}^{s} g_{i}^{T} (Re [f_{m} (R)]) g_{i}$ .

Open in a new tab

Theorems 8 and 9 are our main quality-of-approximation results for Algorithm 3 and 4.

Theorem 8. Let R be a complex density matrix such that all probabilities p_i, i = 1 … n satisfy 0 < ℓ ≤ p_i. Let $\hat{H} (R)$ be the output of Algorithm 3 on inputs R, m, and ϵ < 1. Then, with probability at least 1 − δ,

| H (R) - H (R) | \leq 2 ϵ H (R),

by setting $i = ⌈ \frac{1}{ℓ} \ln \frac{1}{ϵ} ⌉$ . The algorithm runs in time

O (\frac{ln (1 / ϵ)}{ℓ \cdot ϵ^{2}} \cdot ln (1 / δ) \cdot n n z (R)) .

Theorem 9. Let R be a density matrix such that all probabilities p_i, i = 1 … n satisfy 0 < ℓ ≤ p_i. Let $\hat{H} (R)$ be the output of Algorithm 4 on inputs R, m, and ϵ < 1. Then, with probability at least 1 − δ,

| \hat{H} (R) - H (R) | \leq 3 ϵ H (R),

by setting $m = \sqrt{\frac{1}{2 ϵ ℓ ln (1 / (1 - ℓ))}}$ . The algorithm runs in time

O (\sqrt{\frac{1}{ℓ ln (1 / (1 - ℓ))}} \cdot \frac{1}{ϵ^{2.5}} ln (1 / δ) \cdot nnz (R)) .

V. An approach via random projection matrices

Finally, we focus on perhaps the most interesting special case: the setting where at most k (out of n, with k ⪡ n) of the probabilities p_i of the density matrix R of eqn. (2) are non-zero. in this setting, we prove that elegant random-projection-based techniques achieve relative error approximations to all probabilities p_i, i = 1 … k. The running time of the proposed approach depends on the particular random projection that is used and can be made to depend on the sparsity of the input matrix.

A. Algorithm and Main Theorem

The proposed algorithm uses a random projection matrix Π to create a “sketch” of R in order to approximate the p_is. In words, Algorithm 5 creates a sketch of the input matrix R by post-multiplying R by a random projection matrix; this is a well-known approach from the RandNLA literature (see [6] for details). Assuming that R has rank at most k, which is equivalent to assuming that at most k of the probabilities p_i in eqn. (2) are non-zero (e.g., the system underlying the density matrix R has at most k pure states), then the rank of RΠ is also at most k. In this setting, Algorithm 5 returns the non-zero singular values of RΠ as approximations to the p_i, i = 1 … k.

Algorithm 5.

Approximating the entropy via random projection matrices

1:	INPUT: Integer n (dimensions of matrix R) and integer k (with rank of R at most k ⪡ n, see eqn. (2)).
2:	Construct the random projection matrix $Π \in R^{n \times s}$ (see Section V-B for details on Π and s).
3:	Compute $\tilde{R} = R Π \in R^{n \times s}$ .
4:	Compute and return the (at most) k non-zero singular values of $\tilde{R}$ , denoted by ${\tilde{p}}_{i}$ , i = 1 … k.
5:	OUTPUT: ${\tilde{p}}_{i}$ , i = 1 … k and $\hat{H} (R) = \sum_{i = 1}^{k} {\tilde{p}}_{i} \ln \frac{1}{{\tilde{p}}_{i}}$ .

Open in a new tab

The following theorem is our main quality-of-approximation result for Algorithm 5.

Theorem 10. Let R be a density matrix with at most k ⪡ n non-zero probabilities and let ϵ < 1/2 be an accuracy parameter. Then, with probability at least 0.9, the output of Algorithm 5 satisfies

| p_{i}^{2} - {\tilde{p}}_{i}^{2} | \leq ϵ p_{i}^{2}

for all i = 1 … k. Additionally,

| H (R) - \hat{H} (R) | \leq \sqrt{ϵ} H (R) + \sqrt{\frac{3}{2}} ϵ .

Algorithm 5 (combined with Algorithm 7 below) runs in time

O (nnz (R) + n k^{4} / ϵ^{4}) .

Comparing the above result with Theorems 2 and 4, we note that the above theorem does not necessitate imposing any constraints on the probabilities p_i, i = 1 … k. instead, it suffices to have k non-zero probabilities. The final result is an additive-relative error approximation to the entropy of R (as opposed to the relative error approximations of Theorems 2 and 4); under the mild assumption $H (R) \geq \sqrt{ϵ}$ , the above bound becomes a true relative error approximation⁵.

B. Two constructions for the random projection matrix

We now discuss two constructions for the matrix Π and we cite two bounds regarding these constructions from prior work that will be useful in our analysis. The first construction is the subsampled Hadamard Transform, a simplification of the Fast Johnson-Lindenstrauss Transform of [14]; see [15], [16] for details. We do note that even though it appears that Algorithm 7 is always better than Algorithm 6 (at least in terms of their respective theoretical running times), both algorithms are worth evaluating experimentally: in particular, prior work [17] has reported that Algorithm 6 often outperforms Algorithm 7 in terms of empirical accuracy and running time when the input matrix is dense, as is often the case in our setting. Therefore, we choose to present results (theoretical and empirical) for both well-known constructions of Π (Algorithms 6 and 7).

Algorithm 6.

The subsampled Randomized Hadamard Transform

1:	INPUT: integers n, s > 0 with s ⪡ n.
2:	Let S be an empty matrix.
3:	For t = 1, …, s (i.i.d. trials with replacement) select uniformly at random an integer from {1, 2, …, n}.
4:	If i is selected, then append the column vector e_i to S, where $e_{n} \in R^{n}$ is the i-th canonical vector.
5:	Let $H \in R^{n \times n}$ be the normalized Hadamard transform matrix.
6:	Let $D \in R^{n \times n}$ be a diagonal matrix with $D_{ii} = {\begin{matrix} + 1 & , with probability 1 / 2 \\ - 1 & , with probability 1 / 2 \end{matrix}$
7:	OUTPUT: $Π = DHS \in R^{n \times s}$ .

Open in a new tab

The following result has appeared in [7], [15], [16].

Lemma 11. Let $U \in R^{n \times k}$ such that U^TU = I_k and let $Π \in R^{n \times s}$ be constructed by Algorithm 6. Then, with probability at least 0.9,

‖ \frac{n}{k} U^{T} Π Π^{T} U - I_{k} ‖_{2} \leq ϵ,

by setting $s = O ((k + \log n) \cdot \frac{\log k}{ϵ^{2}})$ .

Our second construction is the input sparsity transform of [18]. This major breakthrough was further analyzed in [19], [20] and we present the following result from [19, Appendix A1].

Lemma 12. Let $U \in R^{n \times k}$ such that U^TU = I_k and let $Π \in R^{n \times k}$ be constructed by Algorithm 7. Then, with probability at least 0.9,

‖ U^{T} Π Π^{T} U - I_{k} ‖_{2} \leq ϵ,

by setting $s = O (k^{2} / ϵ^{2})$ .

We refer the interested reader to [20] for improved analyses of Algorithm 7 and its variants.

Algorithm 7.

An input-sparsity transform

1:	INPUT: integers n, s > 0 with s ⪡ n.
2:	Let S be an empty matrix.
3:	For t = 1, …, n (i.i.d. trials with replacement) select uniformly at random an integer from {1, 2, …, s}.
4:	If i is selected, then append the row vector $e_{i}^{T}$ to S, where $e_{i} \in R^{s}$ is the i-th canonical vector.
5:	Let $D \in R^{n \times n}$ be a diagonal matrix with $D_{ii} = {\begin{matrix} + 1 & , with probability 1 / 2 \\ - 1 & , with probability 1 / 2 \end{matrix}$
6:	OUTPUT: $Π = DS \in R^{n \times s}$ .

Open in a new tab

C. Proof of Theorem 10

At the heart of the proof of Theorem 10 lies the following perturbation bound from [8] (Theorem 2.3).

Theorem 13. Let DAD be a symmetric positive definite matrix such that D is a diagonal matrix and A_ii = 1 for all i. Let DED be a perturbation matrix such that ‖E‖₂ < λ_min(A). Let λ_j be the i-the eigenvalue of DAD and let $λ_{i}^{'}$ be the i-th eigenvalue of D(A + E)D. Then, for all i,

| λ_{i} - λ_{i}^{'} | \leq \frac{{‖ E ‖}_{2}}{λ_{\min} (A)} .

We note that λ_min(A) in the above theorem is a real, strictly positive number⁶. Now consider the matrix RΠΠ^TR^T; we will use the above theorem to argue that its singular values are good approximations to the singular values of the matrix RR^T. Recall that R = ΨΣ_pΨ^T where Ψ has orthonormal columns. Note that the eigenvalues of ${RR}^{T} = Ψ Σ_{p}^{2} Ψ^{T}$ are equal to the eigenvalues of the matrix $Σ_{p}^{2}$ ; similarly, the eigenvalues of ΨΣ_pΨ^TΠΠ^TΨΣ_pΨ^T are equal to the eigenvalues of Σ_pΨ^TΠΠ^TΨΣ_p. Thus, we can compare the matrices

Σ_{p} I_{k} Σ_{p} and Σ_{p} Ψ^{T} Π Π^{T} Ψ Σ_{p} .

In the parlance of Theorem 13, E = Ψ^TΠΠ^TΨ − I_k. Applying either Lemma 11 (after rescaling the matrix Π) or Lemma 12, we immediately get that ‖E_A‖₂ ≤ ϵ < 1 with probability at least 0.9. Since λ_min (I_k) = 1, the assumption of Theorem 13 is satisfied. We note that the eigenvalues of Σ_pI_kΣ_p are equal to $p_{i}^{2}$ for i = 1 … k (all positive, which guarantees that the matrix Σ_pI_kΣ_p is symmetric positive definite, as mandated by Theorem 13) and the eigenvalues of Σ_pΨ^TΠΠ^TΨΣ_p are equal to ${\tilde{p}}_{i}^{2}$ , where ${\tilde{p}}_{i}$ are the singular values of Σ_pΨ^TΠ. (Note that these are exactly equal to the outputs returned by Algorithm 5, since the singular values of Σ_pΨ^TΠ are equal to the singular values of ΨΣ_pΨ^TΠ = RΠ). Thus, we can conclude:

| p_{i}^{2} - {\tilde{p}}_{i}^{2} | \leq ϵ p_{i}^{2} .

(20)

The above result guarantees that all p_is can be approximated up to relative error using Algorithm 5. We now investigate the implication of the above bound to approximating the von Neumann entropy of R. Indeed,

\sum_{i = 1}^{k} {\tilde{p}}_{i} ln \frac{1}{{\tilde{p}}_{i}} \leq \sum_{i = 1}^{k} {(1 + ϵ)}^{1 / 2} p_{i} ln \frac{1}{{(1 - ϵ)}^{1 / 2} p_{i}} \leq {(1 + ϵ)}^{1 / 2} (\sum_{i = 1}^{k} p_{i} ln \frac{1}{p_{i}} + \sum_{i = 1}^{k} p_{i} ln \frac{1}{{(1 - ϵ)}^{1 / 2}}) = {(1 + ϵ)}^{1 / 2} H (R) + \frac{\sqrt{1 + ϵ}}{2} ln \frac{1}{1 - ϵ} \leq {(1 + ϵ)}^{1 / 2} H (R) + \frac{\sqrt{1 + ϵ}}{2} ln (1 + 2 ϵ) \leq (1 + \sqrt{ϵ}) H (R) + \sqrt{\frac{3}{2}} ϵ .

In the second to last inequality we used 1/(1 − ϵ) ≤ 1 + 2ϵ for any ϵ ≤ 1/2 and in the last inequality we used ln(1 + 2ϵ) ≤ 2ϵ for ϵ ∈ (0, 1/2). Similarly, we can prove that:

\sum_{i = 1}^{k} {\tilde{p}}_{i} ln \frac{1}{{\tilde{p}}_{i}} \geq (1 - \sqrt{ϵ}) H (R) - \frac{1}{2} ϵ .

Combining, we get

| \sum_{i = 1}^{k} {\tilde{p}}_{i} ln \frac{1}{{\tilde{p}}_{i}} - H (R) | \leq \sqrt{ϵ} H (R) + \sqrt{\frac{3}{2}} ϵ .

We conclude by discussing the running time of Algorithm 5. Theoretically, the best choice is to combine the matrix Π from Algorithm 7 with Algorithm 5, which results in a running time

O (nnz (R) + n k^{4} / ϵ^{4}) .

D. The Hermitian case

The above approach via random projections critically depends on Lemmas 11 and 12, which, to the best of our knowledge, have only been proven for the real case. These results are typically proven using matrix concentration inequalities, which are well-explored for sums of random real matrices but less explored for sums of real complex matrices. We leave it as an open problem to extend the theoretical analysis of our approach to the Hermitian case.

VI. Experiments

In this section we report experimental results in order to demonstrate the practical efficiency of our algorithms. We show that our algorithms are both numerically accurate and computationally efficient. Our algorithms were implemented in Matlab R2016a on a compute node with two 10-Core Intel Xeon-E5 processors (2.60GHz) and 512 GBs of RAM.

We generated random density matrices for most of which we used the QETLAB Matlab toolbox [9] to derive (realvalued) density matrices of size 5, 000 × 5, 000, on which most of our extensive evaluations were run. We also tested our methods on a much larger 30, 000 × 30, 000 density matrix, which was close to the largest matrix that Matlab would allow us to load. We used the function RandomDensityMatrix of QETLAB and the Haar measure; we also experimented with the Bures measure to generate random matrices, but we did not observe any qualitative differences worth reporting. Recall that exactly computing the Von-Neumann entropy using eqn. (1) presumes knowledge of the entire spectrum of the matrix; to compute all singular values of a matrix we used the svd function of Matlab. The accuracy of our proposed approximation algorithms was evaluated by measuring the relative error; wall-clock times were reported in order to quantify the speedup that our approximation algorithms were able to achieve.

A. Empirical results for the Taylor and Chebyshev approximation algorithms

We start by reporting results on the Taylor and Chebyshev approximation algorithms, which have two sources of error: the number of terms that are retained in either the Taylor series expansion or the Chebyshev polynomial approximation and the trace estimation that is used in both approximation algorithms. We will separately evaluate the accuracy loss that is contributed by each source of error in order to understand the behavior of the proposed approximation algorithms.

Consider a 5, 000 × 5, 000 random density matrix and let m (the number of terms retained in the Taylor series approximation or the degree of the polynomial used in the Chebyshev polynomial approximation) range between five and 30 in increments of five. Let s, the number of random Gaussian vectors used to estimate the trace, be set to {50,100,200, 300}. Recall that our error bounds for Algorithms 1 and 2 depend on u, an estimate for the largest eigenvalue of the density matrix. We used the power method to estimate the largest eigenvalue (let ${\tilde{λ}}_{\max}$ be the estimate) and we set u to ${\tilde{λ}}_{\max}$ and $6 {\tilde{λ}}_{\max}$ . Figures 1 and 2 show the relative error (out of 100%) for all combinations of m, s, and u for the Taylor and Chebyshev approximation algorithms. It is worth noting that we also report the error when no trace estimation (NTE) is used in order to highlight that most of the accuracy loss is due to the Taylor/Chebyshev approximation and not the trace estimation.

Fig. 1. — Relative error for 5, 000 × 5, 000 density matrix using the Taylor and the Chebyshev approximation algorithms with $u = {\tilde{λ}}_{\max}$ .

Fig. 2. — Relative error for 5, 000 × 5, 000 density matrix using the Taylor and the Chebyshev approximation algorithms with $u = {\tilde{λ}}_{\max}$ .

We observe that the relative error is always small, typically close to 1-2%, for any choice of the parameters s, m, and u. The Chebyshev algorithm returns better approximations when u is an overestimate for λ_max while the two algorithms are comparable (in terms of accuracy) where u is very close to λ_max, which agrees with our theoretical results. We also note that estimating the largest eigenvalue incurs minimal computational cost (less than one second). The NTE line (no trace estimation) in the plots serves as a lower bound for the relative error. Finally, we note that computing the exact Von-Neumann entropy took approximately 1.5 minutes for matrices of this size.

The second dataset that we experimented with was a much larger density matrix of size 30, 000 × 30, 000. This matrix was the largest matrix for which the memory was sufficient to perform operations like the full SVD. Notice that since the increase in the matrix size is six-fold compared to the previous one and SVD’s running time grows cubically with the input size, we expect the running time to compute the exact SVD to be roughly 6³ · 90 seconds, which is approximately 5.4 hours; indeed, the exact computation of the Von-Neumann entropy took approximately 5.6 hours. We evaluated both the Taylor and the Chebyshev approximation schemes by setting the parameters m and s to take values in the sets {5, 10, 15, 20} and {50, 100, 200}, respectively. The parameter u was set to ${\tilde{λ}}_{\max}$ , where the latter value was computed using the power method, which took approximately 3.6 minutes. We report the wall-clock running times and relative error (out of 100%) in Figures 5 and 4.

Fig. 5. — Wall-clock times: Taylor approximation (blue) and Chebyshev approximation (red) for $u = {\tilde{λ}}_{\max}$ . Exact computation needed approximately 5.6 hours.

Fig. 4. — Relative error for 30, 000 × 30, 000 density matrix using the Taylor and the Chebyshev approximation algorithms with $u = {\tilde{λ}}_{\max}$ .

We observe that the relative error is always less than 1% for both methods, with the Chebyshev approximation yielding almost always slightly better results. Note that our Chebyshev-polynomial-based approximation algorithm significantly outperformed the exact computation: e.g., for m = 5 and s = 50, our estimate was computed in less than ten minutes and achieved less than .2% relative error.

The third dataset we experimented with was the tridiagonal matrix from [12, Section 5.1]:

A = [\begin{matrix} 2 & - 1 & 0 & \dots & 0 \\ - 1 & 2 & - 1 & ⋱ & ⋮ \\ 0 & ⋱ & ⋱ & ⋱ & 0 \\ ⋮ & ⋱ & - 1 & 2 & - 1 \\ 0 & \dots & 0 & - 1 & 2 \end{matrix}]

(21)

This matrix is the coefficient matrix of the discretized onedimensional Poisson equation:

f (x) = - \frac{d^{2} v_{x}}{d x}

defined in the interval [0, 1] with Dirichlet boundary conditions v(0) = v(1) = 0. We normalize A by dividing it with its trace in order to make it a density matrix. Consider the 5, 000 × 5, 000 normalized matrix A and let m (the number of terms retained in the Taylor series approximation or the degree of the polynomial used in the Chebyshev polynomial approximation) range between five and 30 in increments of five. Let s, the number of random Gaussian vectors used for estimating the trace be set to 50, 100, 200, or 300. We used the formula

λ_{i} = \frac{4}{2 n} \sin^{2} (\frac{i π}{2 n + 2}), i = 1, \dots, n

(22)

to compute the eigenvalues of A (after normalization) and we set u to λ_max and 6λ_max. Figures 6 and 7 show the relative error (out of 100%) for all combinations of m, s, and u for the Taylor and Chebyshev approximation algorithms. We also report the error when no trace estimation (NTE) is used.

Fig. 6. — Relative error for 5, 000 × 5, 000 tridiagonal density matrix using the Taylor and the Chebyshev approximation algorithms with u = λ_max.

Fig. 7. — Relative error for 5, 000 × 5, 000 tridiagonal density matrix using the Taylor and the Chebyshev approximation algorithms with u = 6λ_max.

We observe that the relative error is higher than the one observed for the 5, 000 × 5, 000 random density matrix. We report wall-clock running times in Figure 8. The Chebyshev-polynomial-based algorithm returns better approximations for all choices of the parameters and, in most cases, is faster than the Taylor-polynomial-based algorithm, e.g. for m = 5, s = 50 and u = λ_max, our estimate was computed in about two seconds and achieved less than .5% relative error.

Fig. 8. — Wall-clock times: Taylor approximation (blue) and Chebyshev approximation (red) for m = 5. Exact computation needed approximately 30 seconds.

We further considered a 10⁸ × 10⁸ tridiagonal matrix of the form of eqn. (21). Although an exact computation of the singular values of A is not feasible (at least with our computational resources), such a computation is not necessary since eqn. (22) provides a closed formula for its eigenvalues and, thus, its entropy. Let m (the number of terms retained in the Taylor series approximation or the degree of the polynomial used in the Chebyshev polynomial approximation) be equal to five or ten and let s, the number of random Gaussian vectors used to estimate the trace be equal to 50 or 100. Figures 9 and 10 show the relative error (out of 100%) and the runtime, respectively, for all combinations of m and s for both the Taylor and Chebyshev approximation algorithms. We observe that in both cases we estimated the entropy in less than ten minutes with a relative error below 0.15%.

Fig. 9. — Relative error for the 10⁸ × 10⁸ tridiagonal density matrix using the Taylor and the Chebyshev approximation algorithms with u = λ_max.

Fig. 10. — Wall-clock times: Taylor approximation (blue) and Chebyshev approximation (red) for the 10⁸ × 10⁸ triadiagonal density matrix. Exact computation using the Singular Value Decomposition was infeasible using our computational resources.

The fourth dataset we experimented with includes 5, 000 × 5, 000 density matrices whose first top-k eigenvalues follow a linear decay and the remaining 5, 000 – k a uniform distribution. Let k, the number of eigenvalues that follow the linear decay, take values in the set {50, 1000, 3500, 5000}. Let m, the number of terms retained in the Taylor series approximation or the degree of the polynomial used in the Chebyshev polynomial approximation, range between five and 30 in increments of five. Let s, the number of random Gaussian vectors used to estimate the trace, be set to {50, 100, 200, 300}. The estimate of the largest eigenvalue u is set to ${\tilde{λ}}_{\max}$ . Figures 11 to 14 show the relative error (out of 100%) for all combinations of k, m, s, and u for the Taylor and Chebyshev approximation algorithms.

Fig. 11. — Relative error for 5, 000 × 5, 000 density matrix with the top-50 eigenvalues decaying linearly using the Taylor and the Chebyshev approximation algorithms with u = λ_max.

Fig. 14. — Relative error for 5, 000 × 5, 000 density matrix with the top-5000 eigenvalues decaying linearly using the Taylor and the Chebyshev approximation algorithms with u = λ_max.

We observe that the relative error is decreasing as k increases. It is worth noting that when k = 3,500 and k = 5, 000 the Taylor-polynomial-based algorithm returns better relative error approximation than the Chebyshev-polynomial-based algorithm. In the latter case we observe that the relative error of the Taylor-based algorithm is almost zero. This observation has a simple explanation. Figure 15 shows the distribution of the eigenvalues in the four cases we examine. We observe that for k = 50 the eigenvalues are spread in the interval (10^—2, 10⁻⁴); for k = 1, 000 the eigenvalues are spread in the interval (10⁻³, 10⁻⁴); while for k = 3, 500 or k = 5, 000 the eigenvalues are of order 10⁻⁴. It is well known that the Taylor polynomial returns highly accurate approximations when it is computed on values lying inside the open disc centered at a specific value u, which, in our case, is the approximation to the dominant eigenvalue. The radius of the disk is roughly r = λ_m+1/λ_m, where m is the degree of the Taylor polynomial. If r ≤ 1 then the Taylor polynomial converges; otherwise it diverges. Figure 16 shows the convergence rate for various values of k. We observe that for k = 50 the polynomial diverges, which leads to increased errors for the Taylor-based approximation algorithm (reported error close to 23%). In all other cases, the convergence rate is close to one, resulting in negligible impact to the overall error.

Fig. 15. — Eigenvalue distribution of 5, 000 × 5, 000 density matrices with the top-k = {50, 1000, 3500, 5000} eigenvalues decaying linearly and the remaining ones (5, 000 – k) following a uniform distribution.

Fig. 16. — Convergence radius of the Taylor polynomial for the 5, 000 × 5, 000 density matrices with the top-k = {50, 1000, 3500, 5000} eigenvalues decaying linearly and the remaining ones (5, 000 – k) following a uniform distribution.

In all four cases, the Chebyshev-polynomial based algorithm behaves better or similar to the Taylor-polynomial based algorithm. It is worth noting that when the majority of the eigenvalues are clustered around the smallest eigenvalue, then to achieve relative error similar to the one observed for the QETLAB random density matrices, more than 30 polynomial terms need to be retained, which increases the computational time of our algorithms. The increase of the computational time as well as the increased relative error can be justified by the large condition number that these matrices have (remember that for both approximation algorithms the running time depends on the approximate condition number u/l). As an example, for k = 50, the condition number is in the order of hundreds which is significant larger than the roughly constant condition number when k = 5, 000.

B. Empirical Results for the Hermitian Case

Our last dataset is a random 5, 000 × 5, 000 complex density matrix generated using the QETLAB Matlab toolbox. We used the function RandomDensityMatrix of QETLAB and the Haar measure. Let m (the number of terms retained in the Taylor series approximation or the degree of the polynomial used in the Chebyshev polynomial approximation) range between five and 30 in increments of five. Let s, the number of random Gaussian vectors used to estimate the trace, be set to {50, 100, 200, 300}. Figures 17 and 18 show the relative error (out of 100%) for all combinations of m, s, and u for the Taylor-based and Chebyshev-based approximation algorithms respectively.

Fig. 17. — Relative error for 5, 000 × 5, 000 density matrix using the approximation algorithm.

Fig. 18. — Relative error for 5, 000 × 5, 000 density matrix using the Chebyshev approximation algorithm.

We observe that the relative error is always small, typically below 1%, for any choice of the parameters s and m. The NTE line (no trace estimation) in the plots serves as a lower bound for the relative error. We note that computing the exact Von-Neumann entropy took approximately 52 seconds for matrices of this size. Finally, our algorithm seems to outperform exact computation of the von-Neumann entropy by approximating it in about ten seconds (for the Taylor-based approach) with a relative error of 0.5% using 100 random Gaussian vectors and retaining ten Taylor terms (see Fig. 19) or in about 18 seconds (for the Chebyshev-based approach) with a relative error of 0.2% using 50 random Gaussian vectors and five Chebyshev polynomials (see Fig. 20) .

Fig. 19. — Time (in seconds) to run the Taylor-based algorithm for the 5, 000 × 5, 000 density matrix for all combinations of m and *s. Exactly* computing the Von-Neumann entropy took approximately 52 seconds, designated by the straight horizontal line in the figure.

Fig. 20. — Time (in seconds) to run the Chebyshev-based algorithm for the 5, 000 × 5, 000 density matrix for all combinations of m and *s. Exactly* computing the Von-Neumann entropy took approximately 52 seconds, designated by the straight horizontal line in the figure.

C. Empirical results for the random projection approximation algorithms

In order to evaluate our third algorithm, we generated low-rank random density matrices (recall that the algorithm of Section V works only for random density matrices of rank k with k ⪡ n). Additionally, in order to evaluate the subsampled randomized Hadamard transform and avoid padding with allzero rows, we focused on values of n (the number of rows and columns of the density matrix) that are powers of two. Finally, we also evaluated a simpler random projection matrix, namely the Gaussian random matrix, whose entries are all Gaussian random variables with zero mean and unit variance.

We generated low rank random density matrices with exponentially (using the QETLAB Matlab toolbox) and linearly decaying eigenvalues. The sizes of the density matrices we tested were 4, 096 × 4, 096 and 16, 384 × 16, 384. We also generated much larger 30, 000 × 30, 000 random matrices on which we only experimented with the Gaussian random projection matrix.

We computed all the non-zero singular values of a matrix using the svds function of Matlab in order to take advantage of the fact that the target density matrix has low rank. The accuracy of our proposed approximation algorithms was evaluated by measuring the relative error; wall-clock times were reported in order to quantify the speedup that our approximation algorithms were able to achieve.

We start by reporting results for Algorithm 5 using the Gaussian, the subsampled randomized Hadamard transform (Algorithm 6), and the input-sparsity transform (Algorithm 7) random projection matrices. Consider the 4, 096 × 4, 096 low rank density matrices and let k, the rank of the matrix, be 10, 50, 100, and 300. Let s, the number of columns of the random projection matrix, range from 50 to 1,000 in increments of 50. Figures 21 and 22 depict the relative error (out of 100%) for all combinations of k and s. We also report the wall-clock running times for values of s between 300 and 450 at Figure 23.

Fig. 21. — Relative error for the 4, 096 × 4, 096 rank-k density matrix with exponentially decaying eigenvalues using Algorithm 5 with the Gaussian (red), the subsampled randomized Hadamard transform (blue), and the input sparsity transform (black) random projection matrices.

Fig. 22. — Relative error for the 4, 096 × 4, 096 rank-k density matrix with linearly decaying eigenvalues using Algorithm 5 with the Gaussian (red), the subsampled randomized Hadamard transform (blue), and the input sparsity transform (black) random projection matrices.

Fig. 23. — Wall-clock times: Algorithm 5 on 4, 096 × 4, 096 random matrices, with the Gaussian (blue), the subsampled randomized Hadamard transform (red) and the input sparsity transform (orange) projection matrices. The exact entropy was computed in 1.5 seconds for the rank-10 approximation, in eight seconds for the rank-50 approximation, in 15 seconds for the rank-100 approximation, and in one minute for the rank-300 approximation.

We observe that in the case of the random matrix with exponentially decaying eigenvalues and for all algorithms the relative error is under 0.3% for any choice of the parameters k and s and, as expected, decreases as the dimension of the projection space s grows larger. Interestingly, all three random projection matrices returned essentially identical accuracies and very comparable wall-clock running time results. This observation is due to the fact that for all choices of k, after scaling the matrix to unit trace, the only eigenvalues that were numerically non-zero were the 10 dominant ones.

In the case of the random matrix with linearly decaying eigenvalues (and for all algorithms) the relative error increases as the rank of the matrix increases and decreases as the size of the random projection matrix increases. This is expected: as the rank of the matrix increases, a larger random projection space is needed to capture the “energy” of the matrix. Indeed, we observe that for all values of k, setting s = 1, 000 guarantees a relative error under 1%. Similarly, for k = 10, the relative error is under 0.3% for any choice of s.

The running time depends not only on the size of the matrix, but also on its rank, e.g. for k = 100 and s = 450, our approximation was computed in about 2.5 seconds, whereas for k = 300 and s = 450, it was computed in less than one second. Considering, for example, the case of k = 300 exponentially decaying eigenvalues, we observe that for s = 400 we achieve relative error below 0.15% and a speedup of over 60 times compared to the exact computation. Finally, it is observed that all three algorithms returned very comparable wall-clock running time results. This observation could be due to the fact that matrix multiplication is heavily optimized in Matlab and therefore the theoretical advantages of the Hadamard transform did not manifest themselves in practice.

The second dataset we experimented with was a 16,384 × 16,384 low rank density matrix. We set k = 50 and k = 500 and we let s take values in the set {500, 1000, 1500, …, 3000, 3500}. We report the relative error (out of 100%) for all combinations of k and s in Figure 24 for the matrix with exponentially decaying eigenvalues and in Figure 25 for the matrix with linearly decaying eigenvalues. We also report the wall-clock running times for s between 500 and 2,000 in Figure 26. We observe that the relative error is typically around 1% for both types of matrices, with running times ranging between ten seconds and four minutes, significantly outperforming the exact entropy computation which took approximately 3.6 minutes for the rank 50 approximation and 20 minutes for the rank 500 approximation.

Fig. 24. — Relative error for the 16, 384 × 16, 384 rank-k density matrix with exponentially decaying eigenvalues using Algorithm 5 with the Gaussian (red), the subsampled randomized Hadamard transform (blue), and the input sparsity transform (black) random projection matrices.

Fig. 25. — Relative error for the 16, 384 × 16, 384 rank-k density matrix with linearly decaying eigenvalues using Algorithm 5 with the Gaussian (red), the subsampled randomized Hadamard transform (blue), and the input sparsity transform (black) random projection matrices.

Fig. 26. — Wall-clock times: Algorithm 5 with the Gaussian (blue), the subsampled randomized Hadamard transform (red) and the input sparsity transform (orange) projection matrices. The exact entropy was computed in 1.6 minutes for the rank 50 approximation and in 20 minutes for the rank 500 approximation.

The last dataset we experimented with was a 30, 000 × 30, 000 low rank density matrix on which we ran Algorithm 5 using a Gaussian random projection matrix. We set k = 50 and k = 500 and we let s take values in the set {500, 1000, 1500, …, 3000, 3500}. We report the relative error (out of 100%) for all combinations of k and s in Figure 27 for the matrix with exponentially decaying eigenvalues and in Figure 28 for the matrix with the linearly decaying eigenvalues. We also report the wall-clock running times for s ranging between 500 and 2, 000 in Figure 29. We observe that the relative error is typically around 1% for both types of matrices, with the running times ranging between 30 seconds and two minutes, outperforming the exact entropy which was computed in six minutes for the rank 50 approximation and in one hour for the rank 500 approximation.

Fig. 27. — Relative error for the 30, 000 × 30, 000 rank-k density matrix with exponentially decaying eigenvalues using Algorithm 5 with the Gaussian random projection matrix for k = 50 (red) and for k = 500 (blue).

Fig. 28. — Relative error for the 30, 000 × 30, 000 rank-k density matrix with linearly decaying eigenvalues using Algorithm 5 with the Gaussian random projection matrix for k = 50 (red) and for k = 500 (blue).

Fig. 29. — Wall-clock times: rank-50 approximation (blue) and rank-500 approximation (red). Exact computation needed about six minutes and one hour respectively.

VII. Conclusions and open problems

We presented and analyzed three randomized algorithms to approximate the von Neumann entropy of density matrices. Our algorithms leverage recent developments in the RandNLA literature: randomized trace estimators, provable bounds for the power method, the use of random projections to approximate the singular values of a matrix, etc. All three algorithms come with provable accuracy guarantees under assumptions on the spectrum of the density matrix. Empirical evaluations on 30, 000 × 30, 000 synthetic density matrices support our theoretical findings and demonstrate that we can efficiently approximate the von Neumann entropy in a few minutes with minimal loss in accuracy, whereas an the exact computation takes over 5.5 hours.

An interesting open problem would be to consider the estimation of the cross entropy. The cross entropy is a measure between two probability distributions and is particularly important in information theory. Algebraically, it can be defined as $H (S, R) = - tr (S \log R)$ , where $S \in C^{n \times n}$ and $R \in C^{n \times n}$ are density matrices with a full set of pure states. One can further extend our polynomial-based approaches using the Taylor expansion or the Chebyshev polynomials to approximate the matrix Γ = S log R. The case where both or one of the density matrices have an incomplete set of pure states is an open problem: if R is low-rank, then our first two approaches would not work for the reasons discussed in Section V. However, if the only low rank matrix is S, then our first two approaches would still work: S is only appearing in the trace estimation part, and having eigenvalues equal to zero does not affect the positive semi-definiteness of Γ. When R is of low rank then one might be able to use our random projection approaches to reduce its dimensionality and/or the dimensionality of S.

The most important open problem is to relax (or eliminate) the assumptions associated with our three key technical results without sacrificing our running time guarantees. It would be critical to understand whether our assumptions are, for example, necessary to achieve relative error approximations and either provide algorithmic results that relax or eliminate our assumptions or provide matching lower bounds and counterexamples.

Fig. 3. — Time (in seconds) to run the approximate algorithms for the 5, 000 × 5, 000 density matrix for m = 5. *Exactly* computing the Von-Neumann entropy took approximately 90 seconds.

Fig. 12. — Relative error for 5, 000 × 5, 000 density matrix with the top-1000 eigenvalues decaying linearly using the Taylor and the Chebyshev approximation algorithms with u = λ_max.

Fig. 13. — Relative error for 5, 000 × 5, 000 density matrix with the top-3500 eigenvalues decaying linearly using the Taylor and the Chebyshev approximation algorithms with u = λ_max.

Acknowledgment

The authors would like to thank the editor for numerous useful suggestions that significantly improved the presentation of our work, especially in the Hermitian case.

PD and EK were supported by NSF IIS-1319280 and IIS-1661760. WS and AG were supported by the NSF Center for Science of Information (CSoI) Grant CCF-0939370 and by NSF CCF-1524312 and NIH 1U01CA198941-01.

Biographies

Eugenia-Maria Kontopoulou is a Ph.D. candidate with the Computer Science Department at Purdue University. She earned her B. Eng. and M. Eng. from the Computer Science and Informatics Department of University of Patras Greece in 2012. Her current interests lie in the areas of (Randomized) Numerical Linear Algebra with a focus on designing and implementing randomized algorithms for the solution of linear algebraic problems in large-scale data applications.

Gregory Dexter is a current senior at Purdue University majoring in honors statistics and mathematics. He is broadly interested in artificial intelligence and hopes to pursue a PhD focused in this area.

Wojciech Szpankowski is Saul Rosen Distinguished Professor of Computer Science at Purdue University where he teaches and conducts research in analysis of algorithms, information theory, analytic combinatorics, data science, random structures, and stability problems of distributed systems. He held several Visiting Professor/Scholar positions, including McGill University, INRIA, France, Stanford, Hewlett-Packard Labs, Universite de Versailles, University of Canterbury, New Zealand, Ecole Polytechnique, France, the Newton Institute, Cambridge, UK, ETH, Zurich, and Gdansk University of Technology, Poland. He is a Fellow of IEEE, and the Erskine Fellow. In 2010 he received the Humboldt Research Award, in 2015 the Inaugural Arden L. Bement Jr. Award, and in 2020 he was the recipient of the Flajolet Lecture Prize. He published two books: “Average Case Analysis of Algorithms on Sequences”, John Wiley & Sons, 2001, and “Analytic Pattern Matching: From DNA to Twitter”, Cambridge, 2015. In 2008 he launched the interdisciplinary Institute for Science of Information, and in 2010 he became the Director of the newly established NSF Science and Technology Center for Science of Information.

Ananth Grama is the Samuel Conte Professor of Computer Science at Purdue University. His research interests include parallel and distributed computing, large-scale data analytics, and applications in life sciences. Grama received a Ph.D. in computer science from the University of Minnesota. He is a recipient of the National Science Foundation CAREER award and the Purdue University Faculty Scholar Award. Grama is a Fellow of the American Association for the Advancement of Sciences and a Distinguished Alumnus of the University of Minnesota. He chaired the Bio-data Management and Analysis (BDMA) Study Section of the National Institutes of Health from 2012 to 2014. Contact him at ayg@purdue.edu.

Petros Drineas is a Professor at the Computer Science Department of Purdue University. He earned a PhD in Computer Science from Yale University in 2003 and a BS in Computer Engineering and Informatics from the University of Patras, Greece, in 1997. From 2003 until 2016, Prof. Drineas was an Assistant (until 2009) and then an Associate Professor at Rensselaer Polytechnic Institute. His research interests lie in the design and analysis of randomized algorithms for linear algebraic problems, as well as their applications to the analysis of modern, massive datasets, with a particular emphasis on the analysis of population genetics data.

Appendix A. The power method

We consider the well-known power method to estimate the largest eigenvalue of a matrix. In our context, we will use the power method to estimate the largest probability p_i for a density matrix R.

Algorithm 8 requires $O (qt (n + nnz (A)))$ arithmetic operations to compute ${\tilde{p}}_{1}$ . The following lemma appeared in [4], building upon [13].

Lemma 14. Let ${\tilde{p}}_{1}$ be the output of Algorithm 8 with q = ⌈4.82 log(1/δ)⌉ and $t = ⌈ \log \sqrt{4 n} ⌉$ . Then, with probability at least 1 − δ,

\frac{1}{6} p_{1} \leq {\tilde{p}}_{1} \leq p_{1} .

Algorithm 8.

Power method repeated q times.

• INPUT: SPD matrix

A \in R^{n \times n}

, integers q, t > 0.

• For j = 1, …, q

1) Pick uniformly at random a vector

x_{0}^{j} {+ 1, - 1}^{n}

2) For i = 1, …, t

–

x_{i}^{j} = A \cdot x_{i - 1}^{j}

3) Compute:

{\tilde{p}}_{1}^{j} = \frac{{x_{t}^{j}}^{T} A x_{t}^{j}}{{x_{t}^{j}}^{T} x_{t}^{j}}

• OUTPUT:

{\tilde{p}}_{1} = \max_{j = 1 \dots q} {\tilde{p}}_{1}^{j}

Open in a new tab

The running time of Algorithm 8 is $O ((n + nnz (A)) \log (n) \log (\frac{1}{δ}))$ .

Appendix B. The Clenshaw Algorithm

We briefly sketch Clenshaw’s algorithm to evaluate Chebyshev polynomials with matrix inputs. Clenshaw’s algorithm is a recursive approach with base cases b_m+2(x) = b_m+1(x) = 0 and the recursive step (for k = m, m − 1, …, 0):

b_{k} (x) = α_{k} + 2 x b_{k + 1} (x) - b_{k + 2} (x) .

(23)

(See Section III for the definition of α_k.) Then,

f_{m} (x) = \frac{1}{2} (α_{0} + b_{0} (x) - b_{2} (x))

(24)

Using the mapping x → 2(x/u) − 1, eqn. (23) becomes

b_{k} (x) = α_{k} + 2 (\frac{2}{u} x - 1) b_{k + 1} (x) - b_{k + 2} (x) .

(25)

In the matrix case, we substitute x by a matrix. Therefore, the base cases are B_m+2(R) = B_m+1(R) = 0 and the recursive step is

B_{k} (R) = α_{k} I_{n} + 2 (\frac{2}{u} R - I_{n}) B_{k + 1} (R) - B_{k + 2} (R)

(26)

for k = m, m − 1, …, 0. The final sum is

f_{m} (R) = \frac{1}{2} (α_{0} I_{n} + B_{0} (R) - B_{2} (R)) .

(27)

Using the matrix version of Clenshaw’s algorithm, we can now rewrite the trace estimation g^T f_m(R)g as follows. First, we right multiply eqn. (26) by g,

B_{k} (R) g = α_{k} I_{n} g + 2 (\frac{2}{u} R - I_{n}) B_{k + 1} (R) g - B_{k + 2} (R) g, y_{k} = α_{k} g + 2 (\frac{2}{u} R - I_{n}) y_{k + 1} - y_{k + 2} .

(28)

Eqn. (28) follows by substituting y_i = B_i(R)g. Multiplying the base cases by g, we get y_m+2 = y_m+1 = 0 and the final sum becomes

g^{T} f_{m} (R) g = \frac{1}{2} (α_{0} (g^{T} g) + g^{T} (y_{0} - y_{2})) .

(29)

Algorithm 9 summarizes all the above.

Algorithm 9.

Clenshaw’s algorithm to compute g^T f_m(R)g.

1:	INPUT: α_i, i = 0, …, m, $R \in R^{n \times n}$ , $g \in R^{n}$
2:	Set y_m+2 = y_m+1 = 0
3:	for k = m, m − 1, …, 0 do
4:	$y_{k} = α_{k} g + \frac{4}{u} {Ry}_{k + 1} - 2 y_{k + 1} - y_{k + 2}$
5:	end for
6:	OUTPUT: $g^{T} f_{m} (R) g = \frac{1}{2} (α_{0} (g^{T} g) + g^{T} (y_{0} - y_{2}))$

Open in a new tab

Footnotes

This paper was presented in part at the 2018 IEEE International Symposium on Information Theory.

Originally published in German in 1932; published in English under the title Mathematical Foundations of Quantum Mechanics in 1955

R is symmetric positive semidefinite and thus all its eigenvalues are non-negative. If p_i is equal to zero we set p_i ln p_i to zero as well.

Indeed, for any two matrices A and B, tr (AB) ≤ ∑_i σ_i(A)σ_i(B), where σ_i(A) (respectively σ_i(B)) denotes the i-th singular value of A (respectively B). Let ‖·‖₂ to denote the induced-2 matrix or spectral norm, then ‖A‖₂ = σ₁(A) (its largest singular value). Given that each singular value of A is upper bounded by σ₁(A) then we can rewrite tr(AB) ≤ ‖A‖₂ ∑_i π_i(B); if B is symmetric positive semidefinite, tr(B) = ∑_i σ_i(B).

⁴

This can be proven using an argument similar to the one used to prove eqn. (7).

⁵

Recall that $H (R)$ ranges between zero and ln k.

⁶

This follows from the fact that A is a symmetric positive definite matrix and the inequality 0 ≤ ‖E‖₂ < λ_min(A).

References

[1].Golub GH and Van Loan CF, Matrix Computations (3rd Ed.). Baltimore, MD, USA: Johns Hopkins University Press, 1996. [Google Scholar]
[2].Wihler TP, Bessire B, and Stefanov A, “Computing the Entropy of a Large Matrix,” Journal of Physics A: Mathematical and Theoretical, vol. 47, no. 24, p. 245201, 2014. [Google Scholar]
[3].Higham NJ, Functions of Matrices: Theory and Computation. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 2008. [Google Scholar]
[4].Boutsidis C, Drineas P, Kambadur P, Kontopoulou E-M, and Zouzias A, “A Randomized Algorithm for Approximating the Log Determinant of a Symmetric Positive Definite Matrix,” Linear Algebra and its Applications, vol. 533, pp. 95–117, 2017. [Google Scholar]
[5].Avron H and Toledo S, “Randomized Algorithms for Estimating the Trace of an Implicit Symmetric Positive Semi-definite Matrix,” Journal of the ACM, vol. 58, no. 2, p. 8, 2011. [Google Scholar]
[6].Drineas P and Mahoney MW, “RandNLA: Randomized Numerical Linear Algebra,” Communications of the ACM, vol. 59, no. 6, pp. 80–90, 2016. [Google Scholar]
[7].Woodruff DP, “Sketching as a Tool for Numerical Linear Algebra,” Foundations and Trends in Theoretical Computer Science, vol. 10, no. 1-2, pp. 1–157, 2014. [Google Scholar]
[8].Demmel J and Veselic K, “Jacobi’s Method is more Accurate than QR,” SIAM Journal on Matrix Analysis and Applications, vol. 13, no. 4, pp. 1204–1245, 1992. [Google Scholar]
[9].Johnston N, “QETLAB: A Matlab toolbox for quantum entanglement, version 0.9,” http://qetlab.com, 2016.
[10].Musco C, Netrapalli P, Sidford A, Ubaru S, and Woodruff DP, “Spectrum Approximation Beyond Fast Matrix Multiplication: Algorithms and Hardness,” 2018. [Online]. Available: http://arxiv.org/abs/1704.04163
[11].Harvey NJ, Nelson J, and Onak K, “Sketching and streaming entropy via approximation theory,” in IEEE Annual Symposium on Foundations of Computer Science, 2008, pp. 489–498. [Google Scholar]
[12].Han I, Malioutov D, and Shin J, “Large-scale Log-determinant Computation through Stochastic Chebyshev Expansions,” Proceedings of the 32nd International Conference on Machine Learning, vol. 37, pp. 908–917, 2015. [Google Scholar]
[13].Trevisan L, “Graph Partitioning and Expanders,” 2011, handout 7.
[14].Ailon N and Chazelle B, “The Fast Johnson–Lindenstrauss Transform and Approximate Nearest Neighbors,” SIAM Journal on Computing, vol. 39, no. 1, pp. 302–322, 2009. [Google Scholar]
[15].Drineas P, Mahoney MW, Muthukrishnan S, and Sarlós T, “Faster Least Squares Approximation,” Numerische Mathematik, vol. 117, pp. 219–249, 2011. [Google Scholar]
[16].Tropp JA, “Improved Analysis of the Subsampled Randomized Hadamard Transform,” Advances in Adaptive Data Analysis, vol. 03, no. 01, p. 8, 2010. [Google Scholar]
[17].Paul S, Boutsidis C, Magdon-Ismail M, and Drineas P, “Random projections and support vector machines,” in Proceeding of the 16th International Conference on Artificial Intelligence and Statistics, 2013. [Google Scholar]
[18].Clarkson KL and Woodruff DP, “Low Rank Approximation and Regression in Input Sparsity Time,” in Proceedings of the 45th annual ACM Symposium on Theory of Computing. ACM Press, 2013, pp. 81–90. [Google Scholar]
[19].Meng X and Mahoney MW, “Low-distortion Subspace Embeddings in Input-sparsity Time and Applications to Robust Linear Regression,” in Proceedings of the 45th annual ACM Symposium on Theory of Computing, 2013, pp. 91–100. [Google Scholar]
[20].Nelson J and Nguyn HL, “OSNAP: Faster Numerical Linear Algebra Algorithms via Sparser Subspace Embeddings,” in Proceedings of the 46th annual IEEE Symposium on Foundations of Computer Science, 2013. [Google Scholar]

[R1] [1].Golub GH and Van Loan CF, Matrix Computations (3rd Ed.). Baltimore, MD, USA: Johns Hopkins University Press, 1996. [Google Scholar]

[R2] [2].Wihler TP, Bessire B, and Stefanov A, “Computing the Entropy of a Large Matrix,” Journal of Physics A: Mathematical and Theoretical, vol. 47, no. 24, p. 245201, 2014. [Google Scholar]

[R3] [3].Higham NJ, Functions of Matrices: Theory and Computation. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 2008. [Google Scholar]

[R4] [4].Boutsidis C, Drineas P, Kambadur P, Kontopoulou E-M, and Zouzias A, “A Randomized Algorithm for Approximating the Log Determinant of a Symmetric Positive Definite Matrix,” Linear Algebra and its Applications, vol. 533, pp. 95–117, 2017. [Google Scholar]

[R5] [5].Avron H and Toledo S, “Randomized Algorithms for Estimating the Trace of an Implicit Symmetric Positive Semi-definite Matrix,” Journal of the ACM, vol. 58, no. 2, p. 8, 2011. [Google Scholar]

[R6] [6].Drineas P and Mahoney MW, “RandNLA: Randomized Numerical Linear Algebra,” Communications of the ACM, vol. 59, no. 6, pp. 80–90, 2016. [Google Scholar]

[R7] [7].Woodruff DP, “Sketching as a Tool for Numerical Linear Algebra,” Foundations and Trends in Theoretical Computer Science, vol. 10, no. 1-2, pp. 1–157, 2014. [Google Scholar]

[R8] [8].Demmel J and Veselic K, “Jacobi’s Method is more Accurate than QR,” SIAM Journal on Matrix Analysis and Applications, vol. 13, no. 4, pp. 1204–1245, 1992. [Google Scholar]

[R9] [9].Johnston N, “QETLAB: A Matlab toolbox for quantum entanglement, version 0.9,” http://qetlab.com, 2016.

[R10] [10].Musco C, Netrapalli P, Sidford A, Ubaru S, and Woodruff DP, “Spectrum Approximation Beyond Fast Matrix Multiplication: Algorithms and Hardness,” 2018. [Online]. Available: http://arxiv.org/abs/1704.04163

[R11] [11].Harvey NJ, Nelson J, and Onak K, “Sketching and streaming entropy via approximation theory,” in IEEE Annual Symposium on Foundations of Computer Science, 2008, pp. 489–498. [Google Scholar]

[R12] [12].Han I, Malioutov D, and Shin J, “Large-scale Log-determinant Computation through Stochastic Chebyshev Expansions,” Proceedings of the 32nd International Conference on Machine Learning, vol. 37, pp. 908–917, 2015. [Google Scholar]

[R13] [13].Trevisan L, “Graph Partitioning and Expanders,” 2011, handout 7.

[R14] [14].Ailon N and Chazelle B, “The Fast Johnson–Lindenstrauss Transform and Approximate Nearest Neighbors,” SIAM Journal on Computing, vol. 39, no. 1, pp. 302–322, 2009. [Google Scholar]

[R15] [15].Drineas P, Mahoney MW, Muthukrishnan S, and Sarlós T, “Faster Least Squares Approximation,” Numerische Mathematik, vol. 117, pp. 219–249, 2011. [Google Scholar]

[R16] [16].Tropp JA, “Improved Analysis of the Subsampled Randomized Hadamard Transform,” Advances in Adaptive Data Analysis, vol. 03, no. 01, p. 8, 2010. [Google Scholar]

[R17] [17].Paul S, Boutsidis C, Magdon-Ismail M, and Drineas P, “Random projections and support vector machines,” in Proceeding of the 16th International Conference on Artificial Intelligence and Statistics, 2013. [Google Scholar]

[R18] [18].Clarkson KL and Woodruff DP, “Low Rank Approximation and Regression in Input Sparsity Time,” in Proceedings of the 45th annual ACM Symposium on Theory of Computing. ACM Press, 2013, pp. 81–90. [Google Scholar]

[R19] [19].Meng X and Mahoney MW, “Low-distortion Subspace Embeddings in Input-sparsity Time and Applications to Robust Linear Regression,” in Proceedings of the 45th annual ACM Symposium on Theory of Computing, 2013, pp. 91–100. [Google Scholar]

[R20] [20].Nelson J and Nguyn HL, “OSNAP: Faster Numerical Linear Algebra Algorithms via Sparser Subspace Embeddings,” in Proceedings of the 46th annual IEEE Symposium on Foundations of Computer Science, 2013. [Google Scholar]

PERMALINK

Randomized Linear Algebra Approaches to Estimate the Von Neumann Entropy of Density Matrices

Eugenia-Maria Kontopoulou

Gregory-Paul Dexter

Wojciech Szpankowski

Ananth Grama

Petros Drineas

Roles

Abstract

I. Introduction

A. Background

B. Trace estimators

C. Our contributions

D. Prior work

II. An approach via Taylor series

A. Algorithm and Main Theorem

Algorithm 1.

B. Proof of Theorem 2

III. An approach via Chebyschev polynomials

A. Algorithm and Main Theorem

Algorithm 2.

B. Proof of Theorem 4

C. A comparison with the results of [2]

IV. Approaches for Hermitian Density Matrices

Algorithm 3.

Algorithm 4.

V. An approach via random projection matrices

A. Algorithm and Main Theorem

Algorithm 5.

B. Two constructions for the random projection matrix

Algorithm 6.

Algorithm 7.

C. Proof of Theorem 10

D. The Hermitian case

VI. Experiments

A. Empirical results for the Taylor and Chebyshev approximation algorithms

Fig. 1.

Fig. 2.

Fig. 5.

Fig. 4.

Fig. 6.

Fig. 7.

Fig. 8.

Fig. 9.

Fig. 10.

Fig. 11.

Fig. 14.

Fig. 15.

Fig. 16.

B. Empirical Results for the Hermitian Case

Fig. 17.

Fig. 18.

Fig. 19.

Fig. 20.

C. Empirical results for the random projection approximation algorithms

Fig. 21.

Fig. 22.

Fig. 23.

Fig. 24.

Fig. 25.

Fig. 26.

Fig. 27.

Fig. 28.

Fig. 29.

VII. Conclusions and open problems

Fig. 3.

Fig. 12.

Fig. 13.

Acknowledgment

Biographies

Appendix A. The power method

Algorithm 8.

Appendix B. The Clenshaw Algorithm

Algorithm 9.

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles