Hutch++: Optimal Stochastic Trace Estimation

Raphael A Meyer; Cameron Musco; Christopher Musco; David P Woodruff

doi:10.1137/1.9781611976496.16

. Author manuscript; available in PMC: 2021 Oct 28.

Published in final edited form as: Proc SIAM Symp Simplicity Algorithms. 2021 Jan;2021:142–155. doi: 10.1137/1.9781611976496.16

Hutch++: Optimal Stochastic Trace Estimation

Raphael A Meyer ¹, Cameron Musco ², Christopher Musco ³, David P Woodruff ⁴

PMCID: PMC8553228 NIHMSID: NIHMS1696966 PMID: 34723248

Abstract

We study the problem of estimating the trace of a matrix A that can only be accessed through matrix-vector multiplication. We introduce a new randomized algorithm, Hutch++, which computes a (1 ± ε) approximation to tr(A) for any positive semidefinite (PSD) A using just O(1/ε) matrix-vector products. This improves on the ubiquitous Hutchinson’s estimator, which requires O(1/ε²) matrix-vector products. Our approach is based on a simple technique for reducing the variance of Hutchinson’s estimator using a low-rank approximation step, and is easy to implement and analyze. Moreover, we prove that, up to a logarithmic factor, the complexity of Hutch++ is optimal amongst all matrix-vector query algorithms, even when queries can be chosen adaptively. We show that it significantly outperforms Hutchinson’s method in experiments. While our theory requires A to be positive semidefinite, empirical gains extend to applications involving non-PSD matrices, such as triangle estimation in networks.

1. Introduction

A ubiquitous problem in numerical linear algebra is that of approximating the trace of a d×d matrix A that can only be accessed via matrix-vector multiplication queries. In other words, we are given access to an oracle that can evaluate Ax for any $x \in ℝ^{d}$ , and the goal is to return an approximation to tr(A) using as few queries to this oracle as possible. An exact solution can be obtained with d queries because $tr (A) = \sum_{i = 1}^{d} A_{i i} = \sum_{i = 1}^{d} e_{i}^{T} A e_{i}$ , where e_i denotes the i^th standard basis vector. The goal is thus to develop algorithms that use far fewer than d matrix-vector multiplications.

Known as implicit or matrix free trace estimation, this problem arises in applications that require the trace of a matrix A, where A is itself a transformation of some other matrix B. For example, A = B^q, A = B⁻¹, or A = exp(B). In all of these cases, explicitly computing A would require roughly O(d³) time, whereas multiplication with a vector x can be implemented more quickly using iterative methods. For example, B^qx can be computed in just O(d²) time for constant q, and for well-conditioned matrices, B⁻¹x and exp(B)x can also be computed in O(d²) time using the conjugate gradient or Lanczos methods [Hig08]. Implicit trace estimation is used to approximate matrix norms [HMAS17, MNS⁺18], spectral densities [LSY16, CKSV18, BKKS20], log-determinants [BDKZ15, HMS15], the Estrada index [US18, WSMB20], eigenvalue counts in intervals [DNPS16], triangle counts in graphs [Avr10], and much more [Che16, LSTZ20]. In these applications, we typically have that A is symmetric, and often positive semidefinite (PSD).

1.1. Hutchinson’s Estimator

The most common method for implicit trace estimation is Hutchinson’s stochastic estimator [Hut90]. This elegant randomized algorithm works as follows: let $G = [g_{1}, \dots, g_{m}] \in ℝ^{d \times m}$ be a matrix containing i.i.d. random variables with mean 0 and variance 1. A simple calculation shows that $E [g_{i}^{T} A g_{i}] = tr (A)$ for each $g_{i} \in ℝ^{d}$ , and $g_{i}^{T} A g_{i}$ can be computed with just one matrix-vector multiplication. So to approximate tr(A), Hutchinson’s estimator returns the following average:

Hutchinson ’ s Estimator : H_{m} (A) = \frac{1}{m} \sum_{i = 1}^{m} g_{i}^{T} A g_{i} = \frac{1}{m} tr (G^{T} AG) .

(1)

Hutchinson’s original work suggests using random ±1 sign vectors for g₁, …, g_m, and an earlier paper by Girard suggests standard normal random variables [Gir87]. Both choices perform similarly, as both random variables are sub-Gaussian. For vectors with sub-Gaussian random entries, it can be proven that, when A is positive semidefinite, (1−ε)tr(A) ≤ H_m(A) ≤ (1+ε)tr(A) with probability ≥ 1 − δ if we use $m = O (\log (\frac{1}{δ}) / ε^{2})$ matrix-vector multiplication queries [AT11, RA15].¹ For constant δ (e.g., $δ = \frac{1}{10}$ ) the bound is O(1/ε²).

1.2. Our results

Since Hutchinson’s work, and the non-asymptotic analysis in [AT11], there has been no improvement on this O(1/ε²) matrix-vector multiplication bound for trace approximation. Our main contribution is a quadratic improvement: we provide a new algorithm, Hutch++, that obtains the same (1 ± ε) guarantee with O(1/ε) matrix-vector multiplication queries. This algorithm is nearly as simple as the original Hutchinson’s method, and can be implemented in just a few lines of code.

Algorithm 1.

Hutch++

input: Matrix-vector multiplication oracle for PSD matrix

A \in ℝ^{n \times n}

. Number m of queries.

output: Approximation to tr(A).

1: Sample

S \in ℝ^{n \times \frac{m}{3}}

and

G \in ℝ^{n \times \frac{m}{3}}

with i.i.d. {+1, −1} entries.

2: Compute an orthonormal basis

Q \in ℝ^{n \times \frac{m}{3}}

for the span of AS (e.g., via QR decomposition).

3: return Hutch++

(A) = tr (Q^{T} A Q) + \frac{3}{m} tr (G^{T} (I - Q Q^{T}) A (I - Q Q^{T}) G)

Open in a new tab

Hutch++ requires m matrix-vector multiplications with A: m/3 to compute A·S, m/3 to compute A · Q, and m/3 to compute A · (I − QQ^T )G. It requires O(dm²) additional runtime to compute the basis Q and the product (I − QQ^T )G = G − QQ^T G. For concreteness, we state the method with random sign matrices, but the entries of S and G can be any sub-Gaussian random variables with mean 0 and variance 1, including e.g., standard Gaussians. Our main theorem on Hutch++ is:

Theorem 1.

If Hutch++ is implemented with $m = O (\sqrt{\log (1 / δ)} / ε + \log (1 / δ))$ matrix-vector multiplication queries, then for any PSD A, with probability ≥ 1 − δ, the output Hutch⁺⁺(A) satisfies:

(1 - ε) tr (A) \leq H u t c h + + (A) \leq (1 + ε) tr (A) .

Hutch++ can be viewed as a natural variance reduced version of Hutchinson’s estimator. The method starts by computing an orthonormal span $Q \in ℝ^{d \times \frac{m}{3}}$ by running a single iteration of power method with a random start matrix S. Q coarsely approximates the span of A’s top eigenvectors. Then we separate A into its projection onto the subspace spanned by Q, and onto that subspace’s orthogonal compliment, writing tr(A) = tr(QQ^T AQQ^T ) + tr ((I − QQ^T )A(I − QQ^T)). By the cyclic property of the trace, the first term is equal to tr(Q^T AQ), which is computed exactly by Hutch++ with $\frac{m}{3}$ matrix-vector multiplications. The second term is approximated using Hutchinson’s estimator with the random vectors in G.

Thus, the error in estimating tr(A) is entirely due to approximating this second term. The key observation is that the variance when estimating this term is much lower than when estimating tr(A) directly. Specifically, it is proportional to ${‖ (I - Q Q^{T}) A (I - Q Q^{T}) ‖}_{F}^{2}$ , which, using standard tools from randomized linear algebra [CEM⁺15, Woo14], we can show is bounded by ε tr(A)² with good probability when m = O(1/ε). This yields our improvement over Hutchinson’s method applied directly to A, which has variance bounded by tr(A)². The full proof of Theorem 1 is in Section 3.

Algorithm 1 is adaptive: it multiplies A by a sequence of query vectors r₁, …, r_m, where later queries depend on earlier ones. In contrast, Hutchinson’s method is non-adaptive: r₁, …, r_m are chosen in advance, before computing any of the products Ar₁, …, Ar_m. In addition to Algorithm 1, we give a non-adaptive variant of Hutch++ that obtains the same O(1/ε) bound. We complement these results with a nearly matching lower bound, proven in Section 4. Specifically, via a reduction from the Gap-Hamming problem from communication complexity, we show that any matrix-vector query algorithm whose queries have bounded bit complexity requires $m = Ω (\frac{1}{ε \log (1 / ε)})$ queries to estimate the trace of a PSD matrix up to a (1 ± ε) multiplicative approximation. We also prove a tight $m = Ω (\frac{1}{ε})$ lower bound for non-adaptive algorithms in the real RAM model of computation.

Empirical Results.

In Section 5 we complement our theoretical results with experiments on synthetic and real-world matrices, including applications of trace estimation to approximating log determinants, the graph Estrada index, and the number of triangles in a graph. We demonstrate that Hutch++ improves substantially on Hutchinson’s estimator, and on related estimators based on approximating the top eigenvalues of A. While our theory applies to positive semidefinite matrices, Hutch++ can be applied unmodified to non-positive semidefinite trace estimation, and continues to perform very well empirically. We note that Hutch++ is simple to implement and essentially parameter free – the only choice needed is the number of matrix-vector multiplication queries m.

1.3. Prior Work

Upper bounds.

A nearly tight non-asymptotic analysis of Hutchinson’s estimator for positive semidefinite matrices was given by Avron and Toledo using an approach based on reducing to Johnson-Lindenstrauss random projection [AT11, DG03, Ach03]. A slightly tighter approach from [RA15] obtains a (1±ε) multiplicative error bound with m = O(1/ε²) matrix-vector multiplication queries. This bound is what we improve on with Hutch++.

A number of papers suggest variance reduction schemes for Hutchinson’s estimator. Some take advantage of sparsity structure in A [TS11, SLO13] and others use a “decomposition” approach similar to Hutch++ [APJ⁺18]. Most related to our work are two papers which, like Hutch++, perform the decomposition by projecting onto some Q that approximately spans A’s top eigenspace [GSO17, Lin17]. The justification is that this method should perform much better than Hutchinson’s when A is close to low-rank, because tr(Q^T AQ) will capture most of A’s trace. Our contribution is an analysis of this approach which 1) improves on Hutchinson’s even when A is far from low-rank and 2) shows that a very coarse approximation to the top eigenvectors suffices (computed using one iteration of the power method). Finally, we note two papers which directly use the approximation tr(A) ≈ tr(Q^T AQ), where Q is computed with a randomized SVD method [SAI17, HL20]. Of course, this approach works best for nearly-low rank matrices.

Lower bounds.

Our lower bounds extend a recent line of work on lower bounds for linear algebra problems in the “matrix-vector query model” [SEAR18, SWYZ19, BHSW20]. [WWZ14] proves a lower bound of Ω(1/ε²) queries for PSD trace approximation in an alternative model that allows for adaptive “quadratic form” queries: $r_{1}^{T} A r_{1}, \dots, r_{m}^{T} A r_{m}$ . This model captures Hutchinson’s estimator, but not Hutch++, which is why we are able to obtain an upper bound of O(1/ε) queries.

2. Preliminaries

Notation.

For $a \in ℝ^{d}$ , $‖ a ‖_{2} = {(\sum_{i = 1}^{d} a_{i}^{2})}^{1 / 2}$ denotes the ℓ₂ norm and $‖ a ‖_{1} = \sum_{i = 1}^{d} | a_{i} |$ denotes the ℓ₁ norm. For $A \in ℝ^{n \times d}$ , ${‖ A ‖}_{F} = {(\sum_{i = 1}^{d} \sum_{j = 1}^{d} A_{i j}^{2})}^{1 / 2}$ denotes the Frobenius norm. For square $A \in ℝ^{d \times d}$ , $tr (A) = \sum_{i = 1}^{d} A_{i i}$ denotes the trace. Our main results on trace approximation are proven for symmetric positive semidefinite (PSD) matrices, which are the focus of most applications. Any symmetric $A \in ℝ^{d \times d}$ has an eigendecomposition A = V ΛV ^T, where $V \in ℝ^{d \times d}$ is orthogonal and Λ is a real-valued diagonal matrix. We let λ = diag(Λ) be a length d vector containing A’s eigenvalues in descending order: λ₁ ≥ λ₂ ≥ … ≥ λ_d. When A is PSD, λ_i ≥ 0 for all i. We use the identities tr(A) = ‖λ‖₁ and ‖A‖_F = ‖λ‖₂. Finally, we let A_k = argmin_B,rank(B)=k ‖A − B‖_F denote the optimal k-rank approximation to A. For a PSD matrix $A, A_{k} = V_{k} Λ_{k} V_{k}^{T}$ , where $V_{k} \in ℝ^{d \times k}$ contains the first k columns of V and Λ_k is the k × k top left submatrix of Λ.

Hutchinson’s Analysis.

We require a standard bound on the accuracy of Hutchinson’s estimator:

Lemma 2.

Let $A \in ℝ^{d \times d}$ , δ ∈ (0, ¹/2], $ℓ \in ℕ$ . Let H_ℓ(A) be the ℓ-query Hutchinson estimator defined in (1), implemented with mean 0, i.i.d. sub-Gaussian random variables with constant sub-Gaussian parameter. For fixed constants c, C, if ℓ > clog(¹/δ), then with probability ≥ 1 − δ,

| H_{ℓ} (A) - tr (A) | \leq C \sqrt{\frac{\log (1 / δ)}{ℓ}} ‖ A ‖_{F} .

So, if $ℓ = O (\frac{\log (1 / δ)}{ε^{2}})$ then, with probability ≥ 1 − δ, |H_ℓ(A) − tr(A)| ≤ ε‖A‖_F.

We refer the reader to [RV⁺13] for a formal definition of sub-Gaussian random variables: both normal $N (0, 1)$ random variables and ±1 random variables are sub-Gaussian with constant parameter. Lemma 2 is proven in Appendix A for completeness. It is slightly more general than prior work [RA15] in that it applies to non-PSD, and even asymmetric matrices, which will be important in the analysis of our non-adaptive algorithm. A similar result was recently shown in [CK20].

3. Complexity Analysis

We start by providing the technical intuition behind Hutch++. First note that, for a PSD matrix with eigenvalues λ, ‖A‖_F ≤ tr(A), so Lemma 2 immediately implies that Hutchinson’s estimator obtains a relative error guarantee with O(1/ε²) queries. However, this bound is only tight when ‖λ‖₂ ≈ ‖λ‖₁, i.e., when A has significant mass concentrated on just a small number of eigenvalues.

Hutch++ simply eliminates this possibility by approximately projecting off A’s large eigenvalues using a projection QQ^T. By doing so, it only needs to compute a stochastic estimate for the trace of (I−QQ^T )A(I−QQ^T ). The error of this estimate is proportional to ‖(I−QQ^T )A(I−QQ^T )‖_F, which we show is always much smaller than tr(A). In particular, suppose that Q = V_k exactly spanned the top k eigenvectors A and thus (I − QQ^T )A(I − QQ^T ) = A − A_k. Then we have:

Lemma 3.

Let A_k be the best rank-k approximation to PSD matrix A. Then,

{‖ A - A_{k} ‖}_{F} \leq \frac{1}{\sqrt{k}} tr (A) .

Proof.

We have $λ_{k + 1} \leq \frac{1}{k} \sum_{i = 1}^{k} λ_{i} \leq \frac{1}{k} tr (A)$ , so:

{‖ A - A_{k} ‖}_{F}^{2} = \sum_{i = k + 1}^{d} λ_{i}^{2} \leq λ_{k + 1} \sum_{i = k + 1}^{d} λ_{i} \leq \frac{1}{k} tr (A) \sum_{i = k + 1}^{d} λ_{i} \leq \frac{1}{k} tr {(A)}^{2} .

This result immediately suggests the possibility of an algorithm with O(1/ε) query complexity: Set k = O(1/ε) and split tr(A) = tr(A_k) + tr(A − A_k). The first term can be computed exactly with O(1/ε) matrix-vector multiplication queries if V_k is known, since $tr (A_{k}) = tr (V_{k}^{T} A V_{k})$ . By Lemma 3 combined with Lemma 2, the second can be estimated to error ±εtr(A) using just O(1/ε) queries instead of O(1/ε²). Of course, we can’t compute V_k exactly with a small number of matrix-vector multiplication queries, but this is easily resolved by using an approximate projection. Using standard tools from randomized linear algebra, O(k) queries suffices to find a Q with ‖(I − QQ^T )A(I − QQ^T )‖_F ≤ O(‖A − A_k‖_F ), which is all that is needed for a O(1/ε) query result.

Concretely, we use Lemma 3 to prove the following general theorem, from which Theorem 1 and our non-adaptive algorithmic result will follow as direct corollaries.

Theorem 4.

Let $A \in ℝ^{d \times d}$ be PSD, δ ∈ (0, ¹/₂), $ℓ \in ℕ$ , $k \in ℕ$ . Let $\tilde{A}$ and Δ be any matrices with:

tr (A) = tr (\tilde{A}) + tr (Δ) and ‖ Δ ‖_{F} \leq 2 {‖ A - A_{k} ‖}_{F} .

For fixed constants c, C, if ℓ > clog(¹/δ), then with probability 1−δ, $Z = [tr (\tilde{A}) + H_{ℓ} (Δ)]$ satisfies:

| Z - tr (A) | \leq 2 C \sqrt{\frac{\log (1 / δ)}{k ℓ}} \cdot tr (A) .

In particular, if $k = ℓ = O (\frac{\sqrt{\log (1 / δ)}}{ε} + \log (1 / δ))$ , Z is a (1 ± ε) error approximation to tr(A).

Proof.

We have with probability ≥ 1 − δ:

| Z - tr (A) | = | H_{ℓ} (Δ) - tr (Δ) | (since Z = tr (\tilde{A}) + H_{ℓ} (Δ) and tr (A) = tr (\tilde{A}) + tr (Δ)) \leq C \sqrt{\frac{\log (1 / δ)}{ℓ} ‖ Δ ‖_{F}} (by the standard Hutchinson’s analysis, Lemma 2) \leq 2 C \sqrt{\frac{\log (1 / δ)}{ℓ}} {‖ A - A_{k} ‖}_{F} (by the assumption that ‖ Δ ‖_{F} \leq 2 {‖ A - A_{k} ‖}_{F}) \leq 2 C \sqrt{\frac{\log (1 / δ)}{k ℓ}} tr (A) . (by Lemma 3)

As discussed, Theorem 4 would immediately yield an O(1/ε) query algorithm if we knew an optimal k-rank approximation for A. Since computing one is infeasible, our first version of Hutch++ (Algorithm 1) instead uses a projection onto a subspace Q which is computed with one iteration of the power method. We have:

Theorem 1 Restated.

If Algorithm 1 is implemented with $m = O (\sqrt{\log (1 / δ)} / ε + \log (1 / δ))$ matrix-vector multiplication queries, then for any PSD A, with probability ≥ 1−δ, the output Hutch⁺⁺(A) satisfies: (1 − ε)tr(A) ≤ Hutch⁺⁺(A) ≤ (1 + ε)tr(A).

Proof.

Let S, G, and Q be as in Algorithm 1. We instantiate Theorem 4 with $\tilde{A} = Q^{T} A Q$ and Δ = (I − QQ^T )A(I − QQ^T ). Note that, since Q is orthogonal, (I − QQ^T ) is a projection matrix, so (I − QQ^T ) = (I − QQ^T )². This fact, along with the cyclic property of the trace, gives:

tr (\tilde{A}) = tr (A Q Q^{T}) and tr (Δ) = tr (A (I - Q Q^{T})),

and thus $tr (\tilde{A}) + tr (Δ) = tr (A)$ as required by Theorem 4. Furthermore, since multiplying by a projection matrix can only decrease Frobenius norm, $‖ Δ ‖_{F}^{2} \leq {‖ A (I - Q Q^{T}) ‖}_{F}^{2} = ‖ A - A Q Q ‖_{F}^{2}$ .

Recall that Q is an orthogonal basis for the column span of AS, where S is a random sign matrix with $\frac{m}{3}$ columns. Q is thus an orthogonal basis for a linear sketch of A’s column space, and it is well known that Q will align with large eigenvectors of A, and $‖ A - A Q Q ‖_{F}^{2}$ will be small [Sar06, Woo14]. Concretely, applying Corollary 7 and Claim 1 from [MM20], we have that, as long as $\frac{m}{3} \geq O (k + \log (1 / δ))$ , with probability ≥ 1 − δ:

{‖ A - A Q Q^{T} ‖}_{F}^{2} \leq 2 {‖ A - A_{k} ‖}_{F}^{2} .

Accordingly, $‖ Δ ‖_{F}^{2} \leq 2 {‖ A - A_{k} ‖}_{F}^{2}$ as required by Theorem 4. The result then immediately follows by setting $k = O (\sqrt{\log (1 / δ)} / ε + \log (1 / δ))$ and noting that $Hutch + + (A) = [tr (\tilde{A}) + H_{ℓ} (Δ)]$ where $ℓ = O (\sqrt{\log (1 / δ)} / ε + \log (1 / δ))$ . □

3.1. A Non-Adaptive Variant of Hutch++

As discussed in Section 1, Algorithm 1 is adaptive: it uses the result of computing AS to compute Q, which is then multiplied by A to compute the tr(Q^T AQ) term. Meanwhile, Hutchinson’s estimator is non-adaptive: it samples a single random matrix upfront, batch-multiplies by A once, and computes an approximation to tr(A) from the result, without any further queries.

Not only is non-adaptivity an interesting theoretical property, but it can be practically useful, since parallelism or block iterative methods often make it faster to multiply an implicit matrix by many vectors at once. With these considerations in mind, we describe a non-adaptive variant of Hutch++, which we call NA-Hutch++. NA-Hutch++ obtains nearly the same theoretical guarantees as Algorithm 1, although it tends to perform slightly worse in our experiments.

We leverage a streaming low-rank approximation result of Clarkson and Woodruff [CW09] which shows that if $S \in ℝ^{d \times m}$ and $R \in ℝ^{d \times c m}$ are sub-Gaussian random matrices with m = O(k log(1/_δ)) and c > 1 a fixed constant, then with probability 1 − δ, the matrix $\tilde{A} = A R {(S^{T} A R)}^{+} {(A S)}^{T}$ satisfies $‖ A - \tilde{A} ‖_{F} \leq 2 {‖ A - A_{k} ‖}_{F}$ . Here ⁺ denotes the Moore-Penrose pseudoinverse. We can compute $tr (\tilde{A})$ efficiently without explicitly constructing $\tilde{A} \in ℝ^{d \times d}$ by noting that it is equal to tr((S^T AR)⁺(AS)^T (AR)) via the cyclic property of the trace. This yields:

Algorithm 2.

NA-Hutch++ (Non-Adaptive variant of Hutch++)

input: Matrix-vector multiplication oracle for PSD matrix

A \in ℝ^{d \times d}

. Number m of queries.

output: Approximation to tr(A).

1: Fix constants c₁, c₂, c₃ such that c₁ < c₂ and c₁ + c₂ + c₃ = 1.

2: Sample

S \in ℝ^{d \times c_{1} m}

R \in ℝ^{d \times c_{2} m}

, and

G \in ℝ^{d \times c_{3} m}

with i.i.d. {+1, −1} entries.

3: Compute Z = AR and W = AS.

4: return

NA - Hutch + + (A) = tr ({(S^{T} Z)}^{+} (W^{T} Z)) + \frac{1}{c_{3} m} [tr (G^{T} A G) - tr (G^{T} Z {(S^{T} Z)}^{+} W^{T} G)]

Open in a new tab

NA-Hutch++ requires m matrix-vector multiplications with A. In our experiments, it works well with c₁ = c₃ = 1/4 and c₂ = 1/2. Assuming m < d, it requires O(dm²) further runtime, to perform the matrix multiplications on line 4 and to compute (S^T Z)⁺, which takes O(dm² + m³) time.

Theorem 5.

If NA-Hutch++ is implemented with m = O(log(¹/δ)/ε) matrix-vector multiplication queries and $\frac{c_{2}}{c_{1}}$ a sufficiently large constant, then for any PSD A, with probability ≥ 1 − δ, the output NA-Hutch++(A) satisfies: (1 − ε)tr(A) ≤ NA-Hutch++(A) ≤ (1 + ε)tr(A).

Proof.

We apply Theorem 4 with $\tilde{A} = Z {(S^{T} Z)}^{+} W^{T}$ , $Δ = A - \tilde{A}$ , k = O(1/ε) and $ℓ = c_{3} m = O (\frac{\log (1 / δ)}{ε})$ . $tr (A) = tr (\tilde{A}) + tr (Δ)$ and $NA-Hutch++ (A) = [tr (\tilde{A}) + H_{ℓ} (Δ)]$ . By Theorem 4.7 of [CW09], since c₁m = O(k log(¹/δ)), ∥Δ∥_F ≤ 2∥A − A_k∥_F with probability ≥ 1 − δ as required. □

4. Lower Bounds

A natural question is if the O(1/ε) matrix-vector query bound of Theorem 1 and Theorem 5 is tight. In this section, we prove that it is up to a logarithmic factor, even for algorithms that perform adaptive queries like Hutch++. Our lower bound is via a reduction to communication complexity: we show that a better algorithm for PSD trace estimation would imply a better 2-party communication protocol for the Gap-Hamming problem, which would violate known adaptive lower bounds for that problem [CR12]. To prove this result we need to assume a fixed precision model of computation. Specifically we require that the entries in each query vector r are integers bounded in absolute value by 2^b, for some fixed constant b. By scaling, this captures the setting where the query vectors are non-integer, but have bounded precision. Formally, we prove in Section 4.1:

Theorem 6.

Any algorithm that accesses a positive semidefinite matrix A via matrix-vector multiplication queries Ar₁, …, Ar_m, where r₁, …, r_m are possibly adaptively chosen vectors with integer entries in {−2^b, …, 2^b}, requires $m = Ω (\frac{1}{ε (b + \log (1 / ε))})$ such queries to output an estimate t so that, with $p r o b a b i l i t y > 2 / 3, (1 - ε) tr (A) \leq t \leq (1 + ε) tr (A)$ .

For constant b our lower bound is $Ω (\frac{1}{ε \log (1 / ε)})$ , which matches Theorem 1 and Theorem 5 up to a log(1/ε) factor. We also provide an alternative lower bound which holds in the real RAM model of computation (all inputs and arithmetic operations involve real numbers). This second lower bound is tight up to constants, but only applies to non-adaptive algorithms. It is proven using different information theoretic techniques – we reduce to a hypothesis testing problem involving negatively spiked covariance matrices [CMW15, PWBM18]. Formally, we prove in Appendix B:

Theorem 7.

Any algorithm that accesses a postive semidefinite matrix A through matrix-vector multiplication queries Ar₁, …, Ar_m, where r₁, …, r_m are real valued non-adaptively chosen vectors requires $m = Ω (\frac{1}{ε})$ such queries to output an estimate t so that, with probability > ¾, (1 − ε)tr(A) ≤ t ≤ (1 + ε) tr(A).

4.1. Adaptive lower bound

The proof of Theorem 6 is based on reducing the Gap-Hamming problem to trace estimation. This problem has been well studied in communication complexity since its introduction in [IW03].

Problem 1 (Gap-Hamming).

Let Alice and Bob be communicating parties who hold vectors s ∈ {−1, 1}ⁿ and t ∈ {−1, 1}ⁿ, respectively. The Gap-Hamming problem asks Alice and Bob to return:

\begin{array}{l} 1 i f 〈 s, t 〉 \geq \sqrt{n} & a n d & - 1 i f 〈 s, t 〉 \leq - \sqrt{n} . \end{array}

A tight lower bound on the unbounded round, randomized communication complexity of this problem was first proven in [CR12], with alternative proofs appearing in [Vid12, She12]. Formally:

Lemma 8 (Theorem 2.6 in [CR12]).

The randomized communication complexity for solving Problem 1 with probability ≥ 2/3 is Ω(n) bits.

With Lemma 8 in place, we have all we need to prove Theorem 6.

Proof of Theorem 6.

Fix a perfect square $n \in ℕ$ . Consider an instance of Problem 1 with inputs $s \in ℝ^{n}$ and $t \in ℝ^{n}$ . Let $S \in ℝ^{\sqrt{n} \times \sqrt{n}}$ and $T \in ℝ^{\sqrt{n} \times \sqrt{n}}$ contain the entries of s and t rearranged into matrices (e.g., placed left-to-right, top-to-bottom). Let Z = S + T and let A = Z^T Z. A is positive semidefinite and we have:

tr (A) = ‖ Z ‖_{F}^{2} = ‖ s + t ‖_{2}^{2} = ‖ s ‖_{2}^{2} + ‖ t ‖_{2}^{2} + 2 〈 s, t 〉 = 2 n + 2 〈 s, t 〉 .

(2)

If $〈 s, t 〉 \geq \sqrt{n}$ then we will have $tr (A) \geq 2 (n + \sqrt{n})$ and if $〈 s, t 〉 \geq - \sqrt{n}$ then we will have $tr (A) \leq 2 (n - \sqrt{n})$ . So, if Alice and Bob can approximate tr(A) up to relative error $(1 \pm 1 / \sqrt{n})$ , then they can solve Problem 1. We claim that they can do so with just $O (m \cdot \sqrt{n} (\log n + b))$ bits of communication if there exists an m-query adaptive matrix-vector multiplication algorithm for positive semidefinite trace estimation achieving error $(1 \pm 1 / \sqrt{n})$ .

Specifically, Alice takes charge of running the query algorithm. To compute Ar for a vector r, Alice and Bob first need to compute Zr. To do so, Alice sends r to Bob, which takes $O (\sqrt{n} \cdot b)$ bits since r has entries bounded by 2^b. Bob then computes Tr, which has entries bounded by $\sqrt{n} 2^{b}$ . He sends the result to Alice, using $O (\sqrt{n} (b + \log n))$ bits. Upon receiving Tr, Alice computes Zr = Sr + Tr. Next, they need to multiply Zr by Z^T to obtain Ar = Z^T Zr. To do so, Alice sends Zr to Bob (again using $O (\sqrt{n} (b + \log n))$ bits) who computes T^T Zr. The entries in this vector are bounded by 2n2^b, so Bob sends the result back to Alice using $O (\sqrt{n} (b + \log n))$ bits. Finally, Alice computes S^T Zr and adds the result to T^T Zr to obtain Z^T Zr = Ar. Given this result, Alice chooses the next query vector according to the algorithm and repeats.

Overall, running the full matrix-vector query algorithm requires $O (m \cdot \sqrt{n} (\log n + b))$ bits of communication. So, from Lemma 8 we have that $m = Ω (\sqrt{n} / (\log n + b))$ queries are needed to approximate the trace to accuracy 1 ± ε for $ε = 1 / \sqrt{n}$ , with $probability > 2 / 3$ . □

5. Experimental Validation

We complement our theory with experiments on synthetic matrices and real-world trace estimation problems. We compare four algorithms, including both our adaptive and non-adaptive methods:

Hutchinson’s. The standard estimator run with {+1, −1} random vectors.
Subspace Projection. The method from [SAI17], which computes an orthogonal matrix $Q \in ℝ^{d \times k}$ that approximately spans the top eigenvector subspace of $A \in ℝ^{d \times d}$ and returns tr(Q^T AQ) as an approximation to tr(A). A similar approach is employed in [HL20]. [SAI17] computes Q using subspace iteration, which requires k(q + 1) matrix-vector multiplications when run for q iterations. A larger q results in a more accurate Q, but requires more multiplications. As in [SAI17], we found that setting q = 1 gave the best performance, so we did so in our experiments. With q = 1, this method is similar to Hutch++, except that is does not approximate the remainder of the trace outside the top eigenspace.
Hutch++. The adaptive method of Algorithm 1 with {+1, −1} random vectors.
NA-Hutch++. The non-adaptive method of Algorithm 2 with c₁ = c₃ = ¼ and c₂ = 1/2 and {+1, −1} random vectors.

5.1. Synthetic Matrices

We first test the methods above on random matrices with power law spectra. For varying constant c, we let Λ be diagonal with Λ_ii = i^−c. We generate a random orthogonal matrix $Q \in ℝ^{5000 \times 5000}$ by orthogonalizing a random Gaussian matrix and set A = Q^T ΛQ. A’s eigenvalues are the values in Λ. A larger c results in a more quickly decaying spectrum, so we expect Subspace Projection to perform well. A smaller c results in a slowly decaying spectrum, which will mean that ∥A∥_F ≪tr(A). In this case, we expect Hutchinson’s to outperform its worst case multiplicative error bound: instead of error ±εtr(A) after O(1/ε²) matrix-multiplication queries, Lemma 2 predicts error on the order of ±ε∥A∥_F. Concretely, for dimension d = 5000 and c = 2, we have ∥A∥_F = .63 · tr(A) and for c = .5 we have ∥A∥_F = .02 · tr(A). In general, unlike the Subspace Projection method and Hutchinson’s estimator, we expect Hutch++ and NA-Hutch++ to be less sensitive to A’s spectrum.

In Figure 1 we plot results for various c. Relative error should scale roughly as ε = O(m^−γ), where γ = ½ for Hutchinson’s and γ = 1 for Hutch++ and NA-Hutch++. We thus use log-log plots, where we expect a linear relationship between the error ε and number of iterations m.

The superior performance of Hutch++ and NA-Hutch++ shown in Figure 1 is not surprising. These methods are designed to achieve the “best of both worlds”: when A’s spectrum decays quickly, our methods approximate tr(A) well by projecting off the top eigenvalues. When it decays slowly, they perform essentially no worse than Hutchinson’s. We note that the adaptivity of Hutch++ leads to consistently better performance over NA-Hutch++, and the method is simpler to implement as we do not need to set the constants c₁, c₂, c₃. Accordingly, this is the method we move forward with in our real data experiments.

5.2. Real Matrices

To evaluate the real-world performance of Hutch++ we test it in the common setting where A = f(B). In most applications, B is symmetric with eigendecomposition V^T ΛV, and $f : ℝ \to ℝ$ is a function on real valued inputs. Then we have f(B) = V^T f(Λ)V where f(Λ) is simply f applied to the real-valued eigenvalues on the diagonal of Λ. When f returns negative values, A may not be postive semidefinite. Generally, computing f(B) explicitly requires a full eigendecomposition and thus Ω(n³) time. However, many iterative methods can more quickly approximate matrix-vector queries of the form Ar = f(B)r. The most popular and general is the Lanczos method, which we employ in our experiments [UCS17, MMS18].²

We consider trace estimation in three example applications, involving both PSD and non-PSD matrices. We test on relatively small inputs, for which we can explicitly compute tr(f(B)) to use as a baseline for the approximation error. However, our methods can scale to much larger matrices.

Graph Estrada Index.

Given the binary adjacency matrix B ∈ {0, 1}^d×d of a graph G, the Estrada index is defined as tr(exp(B)) [Est00, dlPGR07], where exp(x) = e^x. This index measures the strength of connectivity within G. A simple transformation of the Estrada index yields the natural connectivity metric, defined as $\log (\frac{1}{d} tr (\exp (B)))$ [JBYJHZ10, EHB12].

In our experiments, we approximated the Estrada index of the Roget’s Thesaurus semantic graph, available from [BM06]. The Estrada index of this 1022 node graph was originally studied in [EH08]. We use the Lanczos method to approximate matrix multiplication with exp(B), running it for 40 iterations, after which the error of application was negligible compared to the approximation error of trace estimation. Results are shown in Figure 2.

Figure 2: — Relative error versus number of matrix-vector multiplication queries for trace approximations of transformed matrices, which were multiplied by vectors using the Lanczos method. We report median relative error of the approximation t after 100 trials. The upper and lower bounds of the shaded region around each curve are the 25^th and 75^th percentile errors. As expected, Subspace Project and Hutch++ outperform Hutchinson’s when A = exp(B), as exponentiating leads to a quickly decaying spectrum. On the other hand, Hutchinson’s performs well for A = log(B + λI), which has a very flat spectrum. Hutch++ is still essentially as fast, even though this matrix is not PSD. Subspace Project fails in this case because the top eigenvalues of A do not dominate its trace.

Gaussian Process Log Likelihood.

Let $B \in ℝ^{d \times d}$ be a PSD kernel covariance matrix and let λ ≥ 0 be a regularization parameter. In Gaussian process regression, the model log likelihood computation requires computing logdet(B + λI) = tr(f(B)) where f(x) = log(x + λ) [WR96, Ras04]. This quantity must be computed repeatedly for different choices of B and λ during hyperparameter optimization, and it is often approximated using Hutchinson’s method [BDKZ15, UCS17, HMAS17, DEN⁺17]. We note that, while B is positive semidefinite, log(B +λI) typically will not be. So our theoretical bounds do not apply in this case, but Hutch++ can be applied unmodified, and as we see in Figure 2, still gives good performance.

In our experiments we consider a benchmark 2D Gaussian process regression problem from the GIS literature, involving precipitation data from Slovakia [NM13]. B is the kernel covariance matrix on 6400 randomly selected training points out of 196,104 total points. Following the setup of [EMM20], we let B be a Gaussian kernel matrix with width parameter γ = 64 and regularization parameter λ = .008, both determined via cross-validation on ℓ₂ regression loss.

Graph Triangle Counting.

Given the binary adjacency matrix B ∈ {0, 1}^d×d of an undirected graph G, the number of triangles in G is equal to $\frac{1}{6} tr (B^{3})$ . The triangle count is an important measure of local connectivity and extensive research studies its efficient approximation [SW05, BBCG08, PT12]. Popular approaches include applying Hutchinson’s method to A = B³ [Avr10], or using the EigenTriangle estimator, which is similar to the Subspace Projection method [Tso08].

In our experiments, we study approximate triangle counting on two common benchmark graphs: an arXiv.org collaboration network³ with 5,243 nodes and 48,260 triangles, and a Wikipedia administrator voting network⁴ with 7,115 nodes and 608,389 triangles. We again note that the adjacency matrix B is not positive semidefinite, and neither is A = B³. Nevertheless, we can apply Hutch++ and see very strong performance. In this setting we do not need to apply Lanczos for matrix-vector query computation: Ar can be computed exactly using three matrix-vector multiplications with B. Results are shown Figure 3 with graph spectral visualized in Figure 4

Figure 3: — Relative error versus number of matrix-vector multiplication queries for trace approximations of transformed matrices. We report the median relative error of the approximation t after 100 trials. The upper and lower bounds of the shaded region around each curve are the 25^th and 75^th percentile errors. Hutch++ still outperforms the baseline methods even though A is not PSD. We note that Subspace Project has somewhat uneven performance: increasing m will take into account a larger number of top eigenvalues when approximating the trace. However, since these may be positive or negative, approximation error does not monotonically decrease. Hutch++ is not sensitive to this issue since it does not use *just* the top eigenvalues: see Figure 4 for more discussion.

Figure 4: — Even when estimating the trace of a non-PSD matrix like A = B³, which for the triangle counting examples above will have both positive and negative eigenvalues, Hutch++ can far outperform Hutchinson’s method. It will approximately project off the largest *magnitude* eigenvalues from A (whether postive or negative), which will reduce the variance in estimating the trace $tr (A) = \sum_{i = 1}^{d} λ_{i} (B)$ ³.

Acknowledgments:

D. Woodruff would like to thank support from the National Institute of Health (NIH) grant 5R01 HG 10798–2 and a Simons Investigator Award.

A. Proof of Lemma 2

We start by stating the Hanson-Wright inequality for i.i.d sub-Gaussian random variables:

Imported Theorem 9 (Theorem 1.1 from [RV⁺13]).

Let $x \in ℝ^{n}$ be a vector of mean 0, i.i.d. sub-Gaussian random variables with constant sub-Gaussian parameter C. Let $A \in ℝ^{n \times n}$ be a matrix. Then, there exists a constant c only depending on C such that for every t ≥ 0,

\Pr {| x^{T} A x - E [x^{T} A x] | > t} \leq 2 \exp (- c \cdot \min {\frac{t^{2}}{‖ A ‖_{F}^{2}}, \frac{t}{‖ A ‖_{2}}}) .

Above, ∥A∥₂ = max_x ∥Ax∥₂/∥x∥₂ denotes the spectral norm. We refer the reader to [RV⁺13] for a formal definition of sub-Gaussian random variables: both normal $N (0, 1)$ random variables and ±1 random variables are sub-Gaussian with constant C.

Lemma 2 Restated.

Let $A \in ℝ^{d \times d}$ , δ ∈ (0, ½], $ℓ \in ℕ$ . Let H_ℓ(A) be the ℓ-query Hutchinson estimator defined in (1), implemented with mean 0, i.i.d. sub-Gaussian random variables with constant sub-Gaussian parameter. For fixed constants c, C, if $ℓ > c \log (1 / δ)$ , then with prob. 1 − δ,

| H_{ℓ} (A) - tr (A) | \leq C \sqrt{\frac{\log (1 / δ)}{ℓ}} ‖ A ‖_{F} .

Proof.

Let $\bar{A} \in ℝ^{ℓ d \times ℓ d}$ be a block-diagonal matrix formed from ℓ repetitions of A:

\bar{A} : = [\begin{matrix} A & 0 & \dots & 0 \\ 0 & A & \dots & 0 \\ 0 & 0 & ⋱ & ⋮ \\ 0 & 0 & \dots & A \end{matrix}] .

Let $G \in ℝ^{d \times ℓ}$ be as in (1). Let g_i be G’s i^th column and let $g = [g_{1}, \dots, g_{ℓ}] \in ℝ^{d ℓ}$ be a vectorization of G. We have that $ℓ \cdot H_{ℓ} (A) = tr (G^{T} A G) = g^{T} \bar{A} g$ . So, by Imported Theorem 9,

\Pr {| g^{T} \bar{A} g - E | g^{T} \bar{A} g] ∣ > t} \leq 2 \exp (- c \cdot \min {\frac{t^{2}}{‖ \bar{A} ‖_{F}^{2}}, \frac{t}{‖ \bar{A} ‖_{2}}}) .

(3)

We let t′ = t/ℓ, and substitute $E [g^{T} \bar{A} g] = tr (\bar{A}) = ℓ tr (A)$ , $‖ \bar{A} ‖_{F}^{2} = ℓ ‖ A ‖_{F}^{2}$ and $‖ \bar{A} ‖_{2} = ‖ A ‖_{2}$ into (3) to get:

\Pr {| H_{ℓ} (A) - tr (A) | > t^{'}} \leq 2 \exp (- c \min {\frac{ℓ t^{' 2}}{‖ A ‖_{F}^{2}}, \frac{ℓ t^{'}}{‖ A ‖_{2}}}) .

Now, taking $t^{'} = \sqrt{\frac{\ln (2 / δ)}{c ℓ}} ‖ A ‖_{F}$ , we have:

\Pr {| H_{ℓ} (A) - tr (A) | > \sqrt{\frac{\ln (2 / δ)}{c ℓ}} ‖ A ‖_{F}} \leq 2 \exp (- \min {\log (2 / δ), \sqrt{c ℓ \log (2 / δ)} \frac{‖ A ‖_{F}}{‖ A ‖_{2}}})

Since $\frac{‖ A ‖_{F}}{‖ A ‖_{2}} \geq 1$ , if we take $ℓ \geq \ln (2 / δ) / c$ , we have that the minimum takes value $\log (2 / δ)$ , so

\Pr {| H_{ℓ} (A) - tr (A) | > \sqrt{\frac{\ln (2 / δ)}{c ℓ}} ‖ A ‖_{F}} \leq δ .

The final result follows from noting that $\ln (2 / δ) \leq 2 \ln (1 / δ)$ for δ ≤ ½. □

B. Proof of Theorem 7

To prove our non-adaptive lower bound for the real RAM moodel we first introduce a simple testing problem which we reduce to estimating the trace of a PSD matrix A to (1 ± ε) relative error:

Problem 2.

Fix d, $n \in ℕ$ such that d ≥ n and $n : = \frac{1}{ε}$ for ε ∈ (0, 1]. Let D₁ = I_n and $D_{2} = (\begin{array}{l} I_{n - 1} \\ 0 \end{array})$ .⁵ Consider A = G^T DG generated by selecting $G \in ℝ^{n \times d}$ with i.i.d. random Guassian $N (0, 1)$ entries and D = D₁ or D = D₂ with equal probability. Then consider any algorithm which fixes a query matrix $U \in ℝ^{d \times m}$ , observes $A U \in ℝ^{d \times m}$ , and guesses if D = D₁ or D = D₂.

The reduction from Problem 2 to relative error trace estimation is as follows:

Lemma 10.

For any ε ∈ (0, 1] and sufficient large d, if a randomized algorithm $A$ can estimate the trace of any d × d PSD matrix to relative error $1 \pm \frac{ε}{4}$ with success $p r o b a b i l i t y \geq \frac{3}{4}$ using m queries, then $A$ can be used to solve Problem 2 with success $p r o b a b i l i t y \geq \frac{2}{3}$ using m queries.

Proof.

To solve Problem 2 we simply apply $A$ to the matrix A = G^T DG and guess D₁ if the trace is closer to $\frac{d}{ε}$ and D₂ if it’s closer to $\frac{d}{ε} - d$ . To see that this succeeds with probability 2/3, we first need to understand the trace of A. To do so, note that tr(A) = tr(G^T DG) is simply a scaled Hutchinson estimate for tr(D), i.e. tr(G^T DG) = d · H_d(D). So, via Lemma 2, for large enough d we have that with probability $probability \geq \frac{11}{12}$ both of the following hold:

\frac{1}{d} tr (G^{T} D_{1} G) \geq (1 - \frac{ε}{4}) tr (D_{1}) and \frac{1}{d} tr (G^{T} D_{2} G) \leq (1 + \frac{ε}{4}) tr (D_{2}) .

Additionally, with probability $\frac{3}{4}$ , $A$ computes an approximation Z with $(1 - \frac{ε}{4}) tr (A) \leq Z \leq (1 + \frac{ε}{4}) tr (A)$ . By a union bound, all of the above events happen with $probability \geq \frac{2}{3}$ . If D = D₁:

Z \geq (1 - \frac{ε}{4}) tr (A) \geq {(1 - \frac{ε}{4})}^{2} \cdot d \cdot tr (D_{1}) > (1 - \frac{ε}{2}) \cdot \frac{d}{ε} .

On the other hand, if D = D₂,

Z \leq (1 + \frac{ε}{4}) tr (A) \leq {(1 + \frac{ε}{4})}^{2} \cdot d \cdot tr (D_{2}) = {(1 + \frac{ε}{4})}^{2} \cdot (1 - ε) \cdot \frac{d}{ε} < (1 - \frac{ε}{2}) \cdot \frac{d}{ε} .

Thus, with probability 2/3, Z is closer to $\frac{d}{ε}$ when D = D₁ and closer to $\frac{d}{ε} - d$ when D = D₂, so the proposed scheme guesses correctly.

In the remainder of the section we show that Problem 2 requires Ω(1/ε) queries, which combined with Lemma 10 proves our main lower bound, Theorem 7. Throughout, we let $X \overset{dist}{=} Y$ denote that X and Y are identically distributed. We first argue that for Problem 2, the non-adaptive query matrix U might as well be chosen to be the first m standard basis vectors.

Lemma 11.

For Problem 2, without loss of generality, we may assume that the query matrix U equals U = E_m = [e₁ … e_m], the first m standard basis vectors.

Proof.

First, we may assume without loss of generality that U is orthonormal, since if it were not, we could simply reconstruct the queries AU by querying A with an orthonormal basis for the columns of U. Next, by rotational invariance of the Gaussian distribution, if $G \in ℝ^{n \times d}$ is an i.i.d. $N (0, 1)$ matrix, and $Q \in ℝ^{d \times d}$ is any orthogonal matrix, then GQ is distributed identically to G. Let $\bar{U} \in ℝ^{d \times d - m}$ be any orthonormal span for the nullspace of U, so that $Q : = [\begin{array}{l} U & \bar{U} \end{array}]$ is orthogonal. We have that $Q G^{T} D G E_{m} \overset{dist}{=} Q Q^{T} G^{T} D G Q E_{m} = G^{T} D G U$ . So, using the result G^T DGE_m of querying with matrix E_m, we can just multiply by Q on the left to obtain a set of vectors that has the exact same distribution as if U had been used as a 25aquery matrix.

With Lemma 11 in place, we are able to reduce Problem 2 to a simpler testing problem on distinguishing m random vectors drawn from normal distributions with different covariance matrices:

Problem 3.

Let $n = \frac{1}{ε}$ and let $z \in ℝ^{n}$ be a uniformly random unit vector. Let $N \in ℝ^{n \times m}$ contain m i.i.d. random Gaussian vectors drawn from an n-dimensional Gaussian distribution, $N (0, C)$ , where the covariance matrix C either equals I or I − zz^T , with equality probability. The goal is to use N to distinguish, with $p r o b a b i l i t y > \frac{1}{2}$ what the true identity of C is.

Lemma 12.

Let $A$ be an algorithm that solves Problem 2 with m queries and success probability p. Then $A$ can be used to solve Problem 3 with m Gaussian samples and the same success probability.

Proof.

By Lemma 11, it suffices to show how to use the observed matrix N in Problem 3 to create a sample from the distribution G^T DGE_m where $G \in ℝ^{n \times d}$ has i.i.d. $N (0, 1)$ entries. Specifically, we claim that, if we sample $L \in ℝ^{n \times (d - m)}$ with i.i.d. $N (0, 1)$ entries, and compute

M = [\begin{matrix} N^{T} N \\ L^{T} N \end{matrix}],

then M is identically distributed to G^T DGE_m. I.e, if we let $G_{m} \in ℝ^{n \times m}$ contain the first m columns of G and let G_d−m contain the remaining d − m columns, our goal is the show that $M \overset{dist}{=} {[G_{m} G_{d - m}]}^{T} D G_{m}$ .

To see this is the case, let $Z \in ℝ^{n \times n}$ be a uniformly random orthogonal matrix and let D₁, D₂ be as in Problem 2. The first observation is that $N - N (0, C)$ is identically distributed to ZDS where $S \in ℝ^{d \times m}$ has standard normal entries ~ $N (0, 1)$ and D = D₁ or D = D₂ with equal probability. This follows simply from that fact that ZD₁D₁Z^T = I and $Z D_{2} D_{2} Z^{T} = I - z_{n} z_{n}^{T}$ , where z_n is the last row of Z, which is a uniformly random unit vector. It follows that $N^{T} N \overset{dist}{=} S^{T} D^{T} Z^{T} Z D S = S^{T} D S \overset{dist}{=} G_{m}^{T} D G_{m}$ . Next, observe that L^T Z is independent of G and has i.i.d. $N (0, 1)$ entries since Z is orthogonal (and Gaussians are rotationally invariant). So, $L^{T} N \overset{dist}{=} L^{T} Z D S \overset{dist}{=} G_{d - m}^{T} D G$ and overall:

M = [\begin{matrix} N^{T} N \\ L^{T} N \end{matrix}] \overset{dist}{=} [\begin{matrix} G_{m}^{T} D G_{m} \\ G_{d - m}^{T} D G_{m} \end{matrix}] = {[\begin{array}{l} G_{m} & G_{d - m} \end{array}]}^{T} D G_{m} .

Finally, we directly prove a lower bound on the number of samples m required to solve Problem 3, and thus, via Lemma 12, Problem 2. Combined with Lemma 10, this immediately yields our main lower bound on non-adaptive trace estimation, Theorem 7.

Lemma 13.

If $m < \frac{c}{ε}$ for a fixed constant c, then Problem 3 cannot be solved with $p r o b a b i l i t y \geq \frac{2}{3}$ .

Proof.

The proof follows from existing work on lower bounds for learning “negatively spiked” covariance matrices [CMW15, PWBM18]. Let $P$ be the distribution of N in Problem 3, conditioned on C = I, and let $Q$ be the distribution conditioned on C = I − zz^T. These distributions fall into the spiked covariance model of [PWBM18], specifically the negatively spiked Wishart model (see Defn. 5.1 in [PWBM18]) with spike size β = −1, and spike distribution $X$ the uniform distribution over unit vectors in $ℝ^{n}$ . Let $D_{χ^{2}} (P ‖ Q)$ denote the $χ^{2}$ divergence between $P$ and $Q$ . Specifically,

D_{χ^{2}} (Q ‖ P) = \int_{X \in ℝ_{d \times m}} {(\frac{Q (X)}{P (X)})}^{2} P (X) d X - 1.

We have $D_{K L} (Q ‖ P) \leq D_{χ^{2}} (Q ‖ P)$ , so to prove that $P$ , $Q$ cannot be distinguished with good probability, it suffices to prove an upper bound on $D_{χ^{2}} (Q ‖ P)$ . In [CMW15] (Lemma 7) it is proven that, letting v and v′ be independent random unit vectors in $ℝ^{n}$ ,

D_{χ^{2}} (Q ‖ P) = \underset{v, v^{'}}{E} [{(1 - {〈 v, v^{'} 〉}^{2})}^{- m / 2}] - 1.

(4)

Equation (4) uses the notation of Prop. 5.11 in [PWBM18], which restates and proves a slightly less general form of the equality from [CMW15]. Our goal is to prove that the expectation term in (4) is ≤ 1 + C for some small constant C when $m = \frac{c}{ε} = c n$ for a sufficiently small constant c.

We first note that 〈v, v′〉 is identically distributed to x ∈ [−1, 1] where x is the first entry in a random unit vector in $ℝ^{n}$ . It is well known that $\frac{x + 1}{2}$ is distributed according to a beta distribution with parameters $α = β = \frac{n - 1}{2}$ [FKN90]. Specifically, this gives that x has density:

p (x) = \frac{Γ (2 α)}{2 Γ {(α)}^{2}} \cdot {(\frac{1 - x^{2}}{4})}^{α - 1} .

Plugging this density back in to the expectation term in (4) we obtain:

\underset{v, v^{'}}{E} [{(1 - {〈 v, v^{'} 〉}^{2})}^{- m / 2}] = \int_{- 1}^{1} \frac{Γ (2 α)}{2 Γ {(α)}^{2}} \cdot {(\frac{1 - x^{2}}{4})}^{α - 1} {(1 - x^{2})}^{- m / 2} d x = \int_{- 1}^{1} \frac{Γ (2 α)}{2 Γ {(α)}^{2}} \cdot {(\frac{1}{4})}^{α - 1} {(1 - x^{2})}^{α - 1 - m / 2} d x

Assume without loss of generality that n is an odd integer, and thus $α = \frac{n - 1}{2}$ is an integer. Let m/2 = cα for some constant c ≪ 1 such that cα is an integer and thus (1−c)α is an integer. Then:

\begin{array}{l} \underset{v, v^{'}}{E} [{(1 - {〈 v, v^{'} 〉}^{2})}^{- m / 2}] = \frac{\frac{Γ (2 α)}{Γ {(α)}^{2}}}{\frac{Γ (2 (1 - c) α)}{Γ {((1 - c) α)}^{2}}} \cdot {(\frac{1}{4})}^{c α} \cdot \int_{- 1}^{1} \frac{Γ (2 (1 - c) α)}{2 Γ {((1 - c) α)}^{2}} {(\frac{1}{4})}^{(1 - c) α - 1} {(1 - x^{2})}^{(1 - c) α - 1} d x \\ = \frac{\frac{Γ (2 α)}{Γ {(α)}^{2}}}{\frac{Γ (2 (1 - c) α)}{Γ {((1 - c) α)}^{2}}} \cdot {(\frac{1}{4})}^{c α}, \end{array}

(5)

where the equality follows because the term being integrated is the density of x where $\frac{x + 1}{2}$ is distributed according to a beta distribution with parameters (1 − c)α, (1 − c)α. Since we have chosen parameters such that α is a positive integer, we have:

\frac{Γ (2 α)}{Γ {(α)}^{2}} = \frac{(2 α - 1)!}{(α - 1)! (α - 1)!} = \frac{α}{2} \cdot (\begin{matrix} 2 α \\ α \end{matrix}) .

Similarly $\frac{Γ (2 (1 - c) α)}{Γ {((1 - c) α)}^{2}} = \frac{(1 - c) α}{2} \cdot (\begin{matrix} 2 (1 - c) α \\ (1 - c) α \end{matrix})$ . Each of the binomial coefficients in these expressions is a central binomial coefficient (i.e., proportional to a Catalan number), and we can use well known methods like Stirling’s approximation to bound them. In particular, we employ a bound given in Lemma 7 of [MS77], which gives $\frac{1}{2} \frac{4^{z}}{\sqrt{z}} \leq (\begin{matrix} 2 z \\ z \end{matrix}) \leq \frac{1}{\sqrt{π}} \frac{4^{z}}{\sqrt{z}}$ . for any integer z. Accordingly, we have

\frac{\frac{Γ (2 α)}{Γ {(α)}^{2}}}{\frac{Γ (2 (1 - c) α)}{Γ {((1 - c) α)}^{2}}} = \frac{1}{1 - c} \cdot \frac{(\begin{matrix} 2 α \\ α \end{matrix})}{(\begin{matrix} 2 (1 - c) α \\ (1 - c) α \end{matrix})} \leq \frac{1}{1 - c} \cdot \frac{1 / \sqrt{π}}{1 / 2} \frac{4^{α}}{4^{(1 - c) α}} \cdot \sqrt{\frac{(1 - c) α}{α}} = \frac{2}{\sqrt{π (1 - c)}} \cdot 4^{c α} .

Plugging into (5) and requiring c ≤ .1 we have:

\underset{v, v^{'}}{E} [{(1 - {〈 v, v^{'} 〉}^{2})}^{- m / 2}] \leq \frac{2}{\sqrt{π \cdot .9}} < \frac{6}{5} .

It follows that $D_{K L} (Q ‖ P) \leq D_{χ^{2}} (Q ‖ P) \leq \frac{6}{5} - 1 = \frac{1}{5}$ , and thus by Pinsker’s inequality that

D_{T V} (Q, P) \leq \frac{1}{\sqrt{10}} < \frac{1}{3} .

Thus, no algorithm can solve Problem 3 with $probability \geq \frac{1}{2} + \frac{1 / 3}{2} = \frac{2}{3}$ , completing the lemma. □

Footnotes

For non-PSD matrices, this generalizes to tr(A) − ε‖A‖_F ≤ H_m(A) ≤ tr(A) + ε‖A‖_F, which implies the relative error bound since when A is PSD, ‖A‖_F ≤ tr(A).

We use our implementation of Lanczos available at https://github.com/cpmusco/fast-pcr, but modified to block matrix-vector multiplies when run on multiple query vectors r.

Accessed from: https://snap.stanford.edu/data/ca-GrQc.html.

⁴

Accessed from: https://snap.stanford.edu/data/wiki-Vote.html.

⁵

Here I_r denotes an r × r identity matrix.

Contributor Information

Raphael A. Meyer, New York University

Cameron Musco, UMass Amherst.

Christopher Musco, New York University.

David P. Woodruff, Carnegie Mellon University

References

[Ach03].Achlioptas Dimitris. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671–687, 2003. Preliminary version in the 20th Symposium on Principles of Database Systems (PODS). [Google Scholar]
[APJ+18].Adams Ryan P., Pennington Jeffrey, Johnson Matthew J., Smith Jamie, Ovadia Yaniv, Brian Patton, and James Saunderson. Estimating the spectral density of large implicit matrices. arXiv:1802.03451, 2018. [Google Scholar]
[AT11].Avron Haim and Toledo Sivan. Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. Journal of the ACM, 58(2), 2011. [Google Scholar]
[Avr10].Avron Haim. Counting triangles in large graphs using randomized matrix trace estimation. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2010. [Google Scholar]
[BBCG08].Becchetti Luca, Boldi Paolo, Castillo Carlos, and Gionis Aristides. Efficient semi-streaming algorithms for local triangle counting in massive graphs. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 16–24, 2008. [Google Scholar]
[BDKZ15].Boutsidis Christos, Drineas Petros, Kambadur Prabhanjan, and Zouzias Anastasios. A randomized algorithm for approximating the log determinant of a symmetric positive definite matrix. Linear Algebra and its Applications, 533, 03 2015. [Google Scholar]
[BHSW20].Braverman Mark, Hazan Elad, Simchowitz Max, and Woodworth Blake. The gradient complexity of linear regression. In Proceedings of the 33rd Annual Conference on Computational Learning Theory (COLT), volume 125, pages 627–647, 2020. [Google Scholar]
[BKKS20].Braverman Vladimir, Krauthgamer Robert, Krishnan Aditya, and Sinoff Roi. Schatten norms in matrix streams: Hello sparsity, goodbye dimension. In Proceedings of the 37th International Conference on Machine Learning (ICML), 2020. [Google Scholar]
[BM06].Batagelj Vladimir and Mrvar Andrej. Pajek datasets. http://vlado.fmf.unilj.si/pub/networks/data/, 2006.
[CEM+15].Cohen Michael, Elder Sam, Musco Cameron, Musco Christopher, and Persu Madalina. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the 47th Annual ACM Symposium on Theory of Computing (STOC), pages 163–172, 2015. [Google Scholar]
[Che16].Chen Jie. How accurately should I compute implicit matrix-vector products when applying the Hutchinson trace estimator? SIAM Journal on Scientific Computing, 38(6):A3515–A3539, 2016. [Google Scholar]
[CK20].Cortinovis Alice and Kressner Daniel. On randomized trace estimates for indefinite matrices with an application to determinants. arXiv:2005.10009, 2020. [Google Scholar]
[CKSV18].David Cohen-Steiner Weihao Kong, Sohler Christian, and Valiant Gregory. Approximating the spectrum of a graph. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 1263–1271, 2018. [Google Scholar]
[CMW15].Cai Tony, Ma Zongming, and Wu Yihong. Optimal estimation and rank detection for sparse spiked covariance matrices. Probability theory and related fields, 161:781–815, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
[CR12].Chakrabarti Amit and Regev Oded. An optimal lower bound on the communication complexity of gap-hamming-distance. SIAM Journal on Computing, 41(5):1299–1317, 2012. [Google Scholar]
[CW09].Clarkson Kenneth L. and Woodruff David P.. Numerical linear algebra in the streaming model. In Proceedings of the 41st Annual ACM Symposium on Theory of Computing (STOC), pages 205–214, 2009. [Google Scholar]
[DEN+17].Dong Kun, Eriksson David, Nickisch Hannes, Bindel David, and Wilson Andrew Gordon. Scalable log determinants for Gaussian process kernel learning. In Advances in Neural Information Processing Systems 30 (NeurIPS), pages 6327–6337, 2017. [Google Scholar]
[DG03].Dasgupta Sanjoy and Gupta Anupam. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures & Algorithms, 22(1):60–65, 2003. [Google Scholar]
[dlPGR07].de la Peña osé Antonio, Gutman Ivan, and Rada Juan. Estimating the Estrada index. Linear Algebra and its Applications, 427(1):70–76, 2007. [Google Scholar]
[DNPS16].Edoardo Di Napoli Eric Polizzi, and Saad Yousef. Efficient estimation of eigenvalue counts in an interval. Numerical Linear Algebra with Applications, 2016. [Google Scholar]
[EH08].Estrada Ernesto and Hatano Naomichi. Communicability in complex networks. Phys. Rev. E, 77:036111, March 2008. [DOI] [PubMed] [Google Scholar]
[EHB12].Estrada Ernesto, Hatano Naomichi, and Benzi Michele. The physics of communicability in complex networks. Physics Reports, 514(3):89–119, 2012. [DOI] [PubMed] [Google Scholar]
[EMM20].Erdélyi Tamás, Musco Cameron, and Musco Christopher. Fourier sparse leverage scores and approximate kernel learning. Advances in Neural Information Processing Systems 33 (NeurIPS), 2020. [Google Scholar]
[Est00].Estrada Ernesto. Characterization of 3d molecular structure. Chemical Physics Letters, 319(5–6):713–718, 2000. [Google Scholar]
[FKN90].Fang Kaitai, Kotz Samuel, and Wang Ng Kai. Symmetric Multivariate and Related Distributions. London: Chapman and Hall, 1990. [Google Scholar]
[Gir87].Girard Didier. Un algorithme simple et rapide pour la validation croisee géenéralisée sur des problémes de grande taille. Technical report, 1987. [Google Scholar]
[GSO17].Gambhir Arjun Singh, Stathopoulos Andreas, and Orginos Kostas. Deflation as a method of variance reduction for estimating the trace of a matrix inverse. SIAM Journal on Scientific Computing, 39(2):A532–A558, 2017. [Google Scholar]
[Hig08].Higham Nicholas J.. Functions of Matrices: Theory and Computation. Society for Industrial and Applied Mathematics, 2008. [Google Scholar]
[HL20].Zhu Yuanyang Li Hanyu. Randomized block Krylov space methods for trace and log-determinant estimators. arXiv:2003.00212, 2020. [Google Scholar]
[HMAS17].Han Insu, Malioutov Dmitry, Avron Haim, and Shin Jinwoo. Approximating the spectral sums of large-scale matrices using stochastic Chebyshev approximations. SIAM Journal on Scientific Computing, 2017. [Google Scholar]
[HMS15].Han Insu, Malioutov Dmitry, and Shin Jinwoo. Large-scale log-determinant computation through stochastic Chebyshev expansions. In Proceedings of the 32nd International Conference on Machine Learning (ICML), pages 908–917, 2015. [Google Scholar]
[Hut90].Hutchinson Michael F.. A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 19(2):433–450, 1990. [Google Scholar]
[IW03].Indyk Piotr and Woodruff David. Tight lower bounds for the distinct elements problem. In Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2003. [Google Scholar]
[JBYJHZ10].Jun WU, Barahona Mauricio, Yue-Jin Tan, and Hong-Zhong Deng. Natural connectivity of complex networks. Chinese Physics Letters, 27(7):078902, 2010. [Google Scholar]
[Lin17].Lin Lin. Randomized estimation of spectral densities of large matrices made accurate. Numerische Mathematik, 136(1):183–213, 2017. [Google Scholar]
[LSTZ20].Li Jerry, Sidford Aaron, Tian Kevin, and Zhang Huishuai. Well-conditioned methods for ill-conditioned systems: Linear regression with semi-random noise. arXiv:2008.01722, 2020. [Google Scholar]
[LSY16].Lin Lin, Saad Yousef, and Yang Chao. Approximating spectral densities of large matrices. SIAM Review, 58(1):34–65, 2016. [Google Scholar]
[MM20].Musco Cameron and Musco Christopher. Projection-cost-preserving sketches: Proof strategies and constructions. arXiv:2004.08434, 2020. [Google Scholar]
[MMS18].Musco Cameron, Musco Christopher, and Sidford Aaron. Stability of the Lanczos method for matrix function approximation. In Proceedings of the 29th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1605–1624, 2018. [Google Scholar]
[MNS+18].Musco Cameron, Netrapalli Praneeth, Sidford Aaron, Ubaru Shashanka, and Woodruff David P.. Spectrum approximation beyond fast matrix multiplication: Algorithms and hardness. Proceedings of the 9th Conference on Innovations in Theoretical Computer Science (ITCS), 2018. [Google Scholar]
[MS77].MacWilliams Florence Jessie and Alexander Sloane Neil James. The theory of error correcting codes, volume 16. Elsevier, 1977. [Google Scholar]
[NM13].Neteler Markus and Mitasova Helena. Open source GIS: a GRASS GIS approach, volume 689. Springer Science & Business Media, 2013. [Google Scholar]
[PT12].Pagh Rasmus and Charalampos E Tsourakakis. Colorful triangle counting and a mapreduce implementation. Information Processing Letters, 112(7):277–281, 2012. [Google Scholar]
[PWBM18].Perry Amelia, Wein Alexander, Bandeira Afonso, and Moitra Ankur. Optimality and sub-optimality of PCA I: Spiked random matrix models. Annals of Statistics, 46:2416–2451, 10 2018. [Google Scholar]
[RA15].Roosta-Khorasani Farbod and Ascher Uri M.. Improved bounds on sample size for implicit matrix trace estimators. Foundations of Computational Mathematics, 15(5):1187–1212, 2015. [Google Scholar]
[Ras04].Rasmussen Carl Edward. Gaussian Processes in Machine Learning. In Advanced Lectures on Machine Learning, pages 63–71. Springer, 2004. [Google Scholar]
[RV+13].Rudelson Mark, Vershynin Roman, et al. Hanson-Wright inequality and sub-Gaussian concentration. Electronic Communications in Probability, 18, 2013. [Google Scholar]
[SAI17].Saibaba Arvind K., Alexanderian Alen, and Ipsen Ilse C. F.. Randomized matrix-free trace and log-determinant estimators. Numerische Mathematik, 137(2):353–395, 2017. [Google Scholar]
[Sar06].Sarlos Tamas. Improved approximation algorithms for large matrices via random projections. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 143–152, 2006. [Google Scholar]
[SEAR18].Simchowitz Max, El Alaoui Ahmed, and Recht Benjamin. Tight query complexity lower bounds for PCA via finite sample deformed Wigner law. In Proceedings of the 50th Annual ACM Symposium on Theory of Computing (STOC), pages 1249–1259, 2018. [Google Scholar]
[She12].Sherstov Alexander A.. The communication complexity of gap hamming distance. Theory of Computing, 8(8):197–208, 2012. [Google Scholar]
[SLO13].Stathopoulos Andreas, Laeuchli Jesse, and Orginos Kostas. Hierarchical probing for estimating the trace of the matrix inverse on toroidal lattices. SIAM Journal on Scientific Computing, 35(5):S299–S322, 2013. [Google Scholar]
[SW05].Schank Thomas and Wagner Dorothea. Finding, counting and listing all triangles in large graphs, an experimental study. In International Workshop on Experimental and Efficient Algorithms, pages 606–609. Springer, 2005. [Google Scholar]
[SWYZ19].Sun Xiaoming, Woodruff David P., Yang Guang, and Zhang Jialin. Querying a matrix through matrix-vector products. In Proceedings of the 46th International Colloquium on Automata, Languages and Programming (ICALP), volume 132, pages 94:1–94:16, 2019. [Google Scholar]
[TS11].Tang Jok M. and Saad Yousef. Domain-decomposition-type methods for computing the diagonal of a matrix inverse. SIAM Journal on Scientific Computing, 33(5):2823–2847, 2011. [Google Scholar]
[Tso08].Tsourakakis Charalampos E. Fast counting of triangles in large real networks without counting: Algorithms and laws. In 2008 Eighth IEEE International Conference on Data Mining, pages 608–617, 2008. [Google Scholar]
[UCS17].Ubaru Shashanka, Chen Jie, and Saad Yousef. Fast estimation of $tr(f(a))$ via stochastic Lanczos quadrature. SIAM Journal on Matrix Analysis and Applications, 38(4):1075–1099, 2017. [Google Scholar]
[US18].Ubaru Shashanka and Saad Yousef. Applications of trace estimation techniques. In High Performance Computing in Science and Engineering, pages 19–33, 2018. [Google Scholar]
[Vid12].Vidick Thomas. A concentration inequality for the overlap of a vector on a large set, with application to the communication complexity of the gap-hamming-distance problem. Chicago Journal of Theoretical Computer Science, 2012. [Google Scholar]
[Woo14].Woodruff David P.. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1–2):1–157, 2014. [Google Scholar]
[WR96].Williams Christopher K. I. and Rasmussen Carl Edward. Gaussian Processes for Regression. In Advances in Neural Information Processing Systems 9 (NeurIPS), pages 514–520, 1996. [Google Scholar]
[WSMB20].Wang Sheng, Sun Yuan, Musco Christopher, and Bao Zhifeng. Route planning for robust transit networks: When connectivity matters. Preprint, 2020. [Google Scholar]
[WWZ14].Wimmer Karl, Wu Yi, and Zhang Peng. Optimal query complexity for estimating the trace of a matrix. In Proceedings of the 41st International Colloquium on Automata, Languages and Programming (ICALP), pages 1051–1062, 2014. [Google Scholar]

[R1] [Ach03].Achlioptas Dimitris. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671–687, 2003. Preliminary version in the 20th Symposium on Principles of Database Systems (PODS). [Google Scholar]

[R2] [APJ+18].Adams Ryan P., Pennington Jeffrey, Johnson Matthew J., Smith Jamie, Ovadia Yaniv, Brian Patton, and James Saunderson. Estimating the spectral density of large implicit matrices. arXiv:1802.03451, 2018. [Google Scholar]

[R3] [AT11].Avron Haim and Toledo Sivan. Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. Journal of the ACM, 58(2), 2011. [Google Scholar]

[R4] [Avr10].Avron Haim. Counting triangles in large graphs using randomized matrix trace estimation. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2010. [Google Scholar]

[R5] [BBCG08].Becchetti Luca, Boldi Paolo, Castillo Carlos, and Gionis Aristides. Efficient semi-streaming algorithms for local triangle counting in massive graphs. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 16–24, 2008. [Google Scholar]

[R6] [BDKZ15].Boutsidis Christos, Drineas Petros, Kambadur Prabhanjan, and Zouzias Anastasios. A randomized algorithm for approximating the log determinant of a symmetric positive definite matrix. Linear Algebra and its Applications, 533, 03 2015. [Google Scholar]

[R7] [BHSW20].Braverman Mark, Hazan Elad, Simchowitz Max, and Woodworth Blake. The gradient complexity of linear regression. In Proceedings of the 33rd Annual Conference on Computational Learning Theory (COLT), volume 125, pages 627–647, 2020. [Google Scholar]

[R8] [BKKS20].Braverman Vladimir, Krauthgamer Robert, Krishnan Aditya, and Sinoff Roi. Schatten norms in matrix streams: Hello sparsity, goodbye dimension. In Proceedings of the 37th International Conference on Machine Learning (ICML), 2020. [Google Scholar]

[R9] [BM06].Batagelj Vladimir and Mrvar Andrej. Pajek datasets. http://vlado.fmf.unilj.si/pub/networks/data/, 2006.

[R10] [CEM+15].Cohen Michael, Elder Sam, Musco Cameron, Musco Christopher, and Persu Madalina. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the 47th Annual ACM Symposium on Theory of Computing (STOC), pages 163–172, 2015. [Google Scholar]

[R11] [Che16].Chen Jie. How accurately should I compute implicit matrix-vector products when applying the Hutchinson trace estimator? SIAM Journal on Scientific Computing, 38(6):A3515–A3539, 2016. [Google Scholar]

[R12] [CK20].Cortinovis Alice and Kressner Daniel. On randomized trace estimates for indefinite matrices with an application to determinants. arXiv:2005.10009, 2020. [Google Scholar]

[R13] [CKSV18].David Cohen-Steiner Weihao Kong, Sohler Christian, and Valiant Gregory. Approximating the spectrum of a graph. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 1263–1271, 2018. [Google Scholar]

[R14] [CMW15].Cai Tony, Ma Zongming, and Wu Yihong. Optimal estimation and rank detection for sparse spiked covariance matrices. Probability theory and related fields, 161:781–815, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] [CR12].Chakrabarti Amit and Regev Oded. An optimal lower bound on the communication complexity of gap-hamming-distance. SIAM Journal on Computing, 41(5):1299–1317, 2012. [Google Scholar]

[R16] [CW09].Clarkson Kenneth L. and Woodruff David P.. Numerical linear algebra in the streaming model. In Proceedings of the 41st Annual ACM Symposium on Theory of Computing (STOC), pages 205–214, 2009. [Google Scholar]

[R17] [DEN+17].Dong Kun, Eriksson David, Nickisch Hannes, Bindel David, and Wilson Andrew Gordon. Scalable log determinants for Gaussian process kernel learning. In Advances in Neural Information Processing Systems 30 (NeurIPS), pages 6327–6337, 2017. [Google Scholar]

[R18] [DG03].Dasgupta Sanjoy and Gupta Anupam. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures & Algorithms, 22(1):60–65, 2003. [Google Scholar]

[R19] [dlPGR07].de la Peña osé Antonio, Gutman Ivan, and Rada Juan. Estimating the Estrada index. Linear Algebra and its Applications, 427(1):70–76, 2007. [Google Scholar]

[R20] [DNPS16].Edoardo Di Napoli Eric Polizzi, and Saad Yousef. Efficient estimation of eigenvalue counts in an interval. Numerical Linear Algebra with Applications, 2016. [Google Scholar]

[R21] [EH08].Estrada Ernesto and Hatano Naomichi. Communicability in complex networks. Phys. Rev. E, 77:036111, March 2008. [DOI] [PubMed] [Google Scholar]

[R22] [EHB12].Estrada Ernesto, Hatano Naomichi, and Benzi Michele. The physics of communicability in complex networks. Physics Reports, 514(3):89–119, 2012. [DOI] [PubMed] [Google Scholar]

[R23] [EMM20].Erdélyi Tamás, Musco Cameron, and Musco Christopher. Fourier sparse leverage scores and approximate kernel learning. Advances in Neural Information Processing Systems 33 (NeurIPS), 2020. [Google Scholar]

[R24] [Est00].Estrada Ernesto. Characterization of 3d molecular structure. Chemical Physics Letters, 319(5–6):713–718, 2000. [Google Scholar]

[R25] [FKN90].Fang Kaitai, Kotz Samuel, and Wang Ng Kai. Symmetric Multivariate and Related Distributions. London: Chapman and Hall, 1990. [Google Scholar]

[R26] [Gir87].Girard Didier. Un algorithme simple et rapide pour la validation croisee géenéralisée sur des problémes de grande taille. Technical report, 1987. [Google Scholar]

[R27] [GSO17].Gambhir Arjun Singh, Stathopoulos Andreas, and Orginos Kostas. Deflation as a method of variance reduction for estimating the trace of a matrix inverse. SIAM Journal on Scientific Computing, 39(2):A532–A558, 2017. [Google Scholar]

[R28] [Hig08].Higham Nicholas J.. Functions of Matrices: Theory and Computation. Society for Industrial and Applied Mathematics, 2008. [Google Scholar]

[R29] [HL20].Zhu Yuanyang Li Hanyu. Randomized block Krylov space methods for trace and log-determinant estimators. arXiv:2003.00212, 2020. [Google Scholar]

[R30] [HMAS17].Han Insu, Malioutov Dmitry, Avron Haim, and Shin Jinwoo. Approximating the spectral sums of large-scale matrices using stochastic Chebyshev approximations. SIAM Journal on Scientific Computing, 2017. [Google Scholar]

[R31] [HMS15].Han Insu, Malioutov Dmitry, and Shin Jinwoo. Large-scale log-determinant computation through stochastic Chebyshev expansions. In Proceedings of the 32nd International Conference on Machine Learning (ICML), pages 908–917, 2015. [Google Scholar]

[R32] [Hut90].Hutchinson Michael F.. A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 19(2):433–450, 1990. [Google Scholar]

[R33] [IW03].Indyk Piotr and Woodruff David. Tight lower bounds for the distinct elements problem. In Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2003. [Google Scholar]

[R34] [JBYJHZ10].Jun WU, Barahona Mauricio, Yue-Jin Tan, and Hong-Zhong Deng. Natural connectivity of complex networks. Chinese Physics Letters, 27(7):078902, 2010. [Google Scholar]

[R35] [Lin17].Lin Lin. Randomized estimation of spectral densities of large matrices made accurate. Numerische Mathematik, 136(1):183–213, 2017. [Google Scholar]

[R36] [LSTZ20].Li Jerry, Sidford Aaron, Tian Kevin, and Zhang Huishuai. Well-conditioned methods for ill-conditioned systems: Linear regression with semi-random noise. arXiv:2008.01722, 2020. [Google Scholar]

[R37] [LSY16].Lin Lin, Saad Yousef, and Yang Chao. Approximating spectral densities of large matrices. SIAM Review, 58(1):34–65, 2016. [Google Scholar]

[R38] [MM20].Musco Cameron and Musco Christopher. Projection-cost-preserving sketches: Proof strategies and constructions. arXiv:2004.08434, 2020. [Google Scholar]

[R39] [MMS18].Musco Cameron, Musco Christopher, and Sidford Aaron. Stability of the Lanczos method for matrix function approximation. In Proceedings of the 29th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1605–1624, 2018. [Google Scholar]

[R40] [MNS+18].Musco Cameron, Netrapalli Praneeth, Sidford Aaron, Ubaru Shashanka, and Woodruff David P.. Spectrum approximation beyond fast matrix multiplication: Algorithms and hardness. Proceedings of the 9th Conference on Innovations in Theoretical Computer Science (ITCS), 2018. [Google Scholar]

[R41] [MS77].MacWilliams Florence Jessie and Alexander Sloane Neil James. The theory of error correcting codes, volume 16. Elsevier, 1977. [Google Scholar]

[R42] [NM13].Neteler Markus and Mitasova Helena. Open source GIS: a GRASS GIS approach, volume 689. Springer Science & Business Media, 2013. [Google Scholar]

[R43] [PT12].Pagh Rasmus and Charalampos E Tsourakakis. Colorful triangle counting and a mapreduce implementation. Information Processing Letters, 112(7):277–281, 2012. [Google Scholar]

[R44] [PWBM18].Perry Amelia, Wein Alexander, Bandeira Afonso, and Moitra Ankur. Optimality and sub-optimality of PCA I: Spiked random matrix models. Annals of Statistics, 46:2416–2451, 10 2018. [Google Scholar]

[R45] [RA15].Roosta-Khorasani Farbod and Ascher Uri M.. Improved bounds on sample size for implicit matrix trace estimators. Foundations of Computational Mathematics, 15(5):1187–1212, 2015. [Google Scholar]

[R46] [Ras04].Rasmussen Carl Edward. Gaussian Processes in Machine Learning. In Advanced Lectures on Machine Learning, pages 63–71. Springer, 2004. [Google Scholar]

[R47] [RV+13].Rudelson Mark, Vershynin Roman, et al. Hanson-Wright inequality and sub-Gaussian concentration. Electronic Communications in Probability, 18, 2013. [Google Scholar]

[R48] [SAI17].Saibaba Arvind K., Alexanderian Alen, and Ipsen Ilse C. F.. Randomized matrix-free trace and log-determinant estimators. Numerische Mathematik, 137(2):353–395, 2017. [Google Scholar]

[R49] [Sar06].Sarlos Tamas. Improved approximation algorithms for large matrices via random projections. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 143–152, 2006. [Google Scholar]

[R50] [SEAR18].Simchowitz Max, El Alaoui Ahmed, and Recht Benjamin. Tight query complexity lower bounds for PCA via finite sample deformed Wigner law. In Proceedings of the 50th Annual ACM Symposium on Theory of Computing (STOC), pages 1249–1259, 2018. [Google Scholar]

[R51] [She12].Sherstov Alexander A.. The communication complexity of gap hamming distance. Theory of Computing, 8(8):197–208, 2012. [Google Scholar]

[R52] [SLO13].Stathopoulos Andreas, Laeuchli Jesse, and Orginos Kostas. Hierarchical probing for estimating the trace of the matrix inverse on toroidal lattices. SIAM Journal on Scientific Computing, 35(5):S299–S322, 2013. [Google Scholar]

[R53] [SW05].Schank Thomas and Wagner Dorothea. Finding, counting and listing all triangles in large graphs, an experimental study. In International Workshop on Experimental and Efficient Algorithms, pages 606–609. Springer, 2005. [Google Scholar]

[R54] [SWYZ19].Sun Xiaoming, Woodruff David P., Yang Guang, and Zhang Jialin. Querying a matrix through matrix-vector products. In Proceedings of the 46th International Colloquium on Automata, Languages and Programming (ICALP), volume 132, pages 94:1–94:16, 2019. [Google Scholar]

[R55] [TS11].Tang Jok M. and Saad Yousef. Domain-decomposition-type methods for computing the diagonal of a matrix inverse. SIAM Journal on Scientific Computing, 33(5):2823–2847, 2011. [Google Scholar]

[R56] [Tso08].Tsourakakis Charalampos E. Fast counting of triangles in large real networks without counting: Algorithms and laws. In 2008 Eighth IEEE International Conference on Data Mining, pages 608–617, 2008. [Google Scholar]

[R57] [UCS17].Ubaru Shashanka, Chen Jie, and Saad Yousef. Fast estimation of $tr(f(a))$ via stochastic Lanczos quadrature. SIAM Journal on Matrix Analysis and Applications, 38(4):1075–1099, 2017. [Google Scholar]

[R58] [US18].Ubaru Shashanka and Saad Yousef. Applications of trace estimation techniques. In High Performance Computing in Science and Engineering, pages 19–33, 2018. [Google Scholar]

[R59] [Vid12].Vidick Thomas. A concentration inequality for the overlap of a vector on a large set, with application to the communication complexity of the gap-hamming-distance problem. Chicago Journal of Theoretical Computer Science, 2012. [Google Scholar]

[R60] [Woo14].Woodruff David P.. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1–2):1–157, 2014. [Google Scholar]

[R61] [WR96].Williams Christopher K. I. and Rasmussen Carl Edward. Gaussian Processes for Regression. In Advances in Neural Information Processing Systems 9 (NeurIPS), pages 514–520, 1996. [Google Scholar]

[R62] [WSMB20].Wang Sheng, Sun Yuan, Musco Christopher, and Bao Zhifeng. Route planning for robust transit networks: When connectivity matters. Preprint, 2020. [Google Scholar]

[R63] [WWZ14].Wimmer Karl, Wu Yi, and Zhang Peng. Optimal query complexity for estimating the trace of a matrix. In Proceedings of the 41st International Colloquium on Automata, Languages and Programming (ICALP), pages 1051–1062, 2014. [Google Scholar]

PERMALINK

Hutch++: Optimal Stochastic Trace Estimation

Raphael A Meyer

Cameron Musco

Christopher Musco

David P Woodruff

Abstract

1. Introduction

1.1. Hutchinson’s Estimator

1.2. Our results

Algorithm 1.

Theorem 1.

Empirical Results.

1.3. Prior Work

Upper bounds.

Lower bounds.

2. Preliminaries

Notation.

Hutchinson’s Analysis.

Lemma 2.

3. Complexity Analysis

Lemma 3.

Proof.

Theorem 4.

Proof.

Theorem 1 Restated.

Proof.

3.1. A Non-Adaptive Variant of Hutch++

Algorithm 2.

Theorem 5.

Proof.

4. Lower Bounds

Theorem 6.

Theorem 7.

4.1. Adaptive lower bound

Problem 1 (Gap-Hamming).

Lemma 8 (Theorem 2.6 in [CR12]).

Proof of Theorem 6.

5. Experimental Validation

5.1. Synthetic Matrices

Figure 1:

5.2. Real Matrices

Graph Estrada Index.

Figure 2:

Gaussian Process Log Likelihood.

Graph Triangle Counting.

Figure 3:

Figure 4:

Acknowledgments:

A. Proof of Lemma 2

Imported Theorem 9 (Theorem 1.1 from [RV+13]).

Lemma 2 Restated.

Proof.

B. Proof of Theorem 7

Problem 2.

Lemma 10.

Proof.

Lemma 11.

Proof.

Problem 3.

Lemma 12.

Proof.

Lemma 13.

Proof.

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Imported Theorem 9 (Theorem 1.1 from [RV⁺13]).