On the Downstream Performance of Compressed Word Embeddings

Avner May; Jian Zhang; Tri Dao; Christopher Ré

. Author manuscript; available in PMC: 2019 Dec 28.

Published in final edited form as: Adv Neural Inf Process Syst. 2019 Dec;32:11782–11793.

On the Downstream Performance of Compressed Word Embeddings

Avner May ¹, Jian Zhang ¹, Tri Dao ¹, Christopher Ré ¹

PMCID: PMC6935262 NIHMSID: NIHMS1062391 PMID: 31885428

Abstract

Compressing word embeddings is important for deploying NLP models in memory-constrained settings. However, understanding what makes compressed embeddings perform well on downstream tasks is challenging—existing measures of compression quality often fail to distinguish between embeddings that perform well and those that do not. We thus propose the eigenspace overlap score as a new measure. We relate the eigenspace overlap score to downstream performance by developing generalization bounds for the compressed embeddings in terms of this score, in the context of linear and logistic regression. We then show that we can lower bound the eigenspace overlap score for a simple uniform quantization compression method, helping to explain the strong empirical performance of this method. Finally, we show that by using the eigenspace overlap score as a selection criterion between embeddings drawn from a representative set we compressed, we can efficiently identify the better performing embedding with up to 2× lower selection error rates than the next best measure of compression quality, and avoid the cost of training a model for each task of interest.

1. Introduction

In recent years, word embeddings [23, 29, 24, 30, 10] have brought large improvements to a wide range of applications in natural language processing (NLP) [1, 5, 38]. However, these word embeddings can occupy a large amount of memory, making it expensive to deploy them in data centers, and impractical to use them in memory-constrained environments like smartphones. To reduce and amortize these costs, embeddings can be compressed [e.g., 34] and shared across many downstream tasks [7]. Recently, there have been numerous successful methods proposed for compressing embeddings; these methods take a variety of approaches, ranging from dictionary learning using neural networks [34, 6] to simpler compression using k-means clustering [2].

The goal of this work is to gain a deeper understanding of what makes compressed embeddings perform well on downstream tasks. Practically, this understanding could allow for evaluating the quality of a compressed embedding without having to train a model for each task of interest. Our work is motivated by two surprising empirical observations: First, we find that existing ways [41, 3, 42] of measuring the quality of compressed embeddings do not effectively explain the relative downstream performance of different compressed embeddings—for example, failing to discriminate between embeddings that perform well and those that do not. Second, we observe that a simple uniform quantization method can match or outperform the state-of-the-art deep compositional code learning method [34] and the k-means compression method [2] in terms of downstream performance. These observations suggest that there is currently an incomplete understanding of what makes a compressed embedding perform well on downstream tasks. One way to narrow this gap in our understanding is to find a measure of compression quality that (i) is directly related to generalization performance, and (ii) can be used to analyze the performance of uniformly quantized embeddings.

Here we introduce the eigenspace overlap score as a new measure of compression quality, and show that it satisfies the above two desired properties. This score measures the degree of overlap between the subspaces spanned by the eigenvectors of the Gram matrices of the compressed and uncompressed embedding matrices. Our theoretical contributions are two-fold, addressing the surprising observations and desired properties discussed above: First, we prove a generalization bound for the compressed embeddings in terms of the eigenspace overlap score in the context of linear and logistic regression, revealing a direct connection between this score and downstream performance. Second, we prove that in expectation uniformly quantized embeddings attain a high eigenspace overlap score with the uncompressed embeddings at relatively high compression rates, helping to explain their strong performance. Inspired by these theoretical connections between the eigenspace overlap score and generalization performance, we propose using this score as a selection criterion for efficiently picking among a set of compressed embeddings, without having to train a model for each task of interest using each embedding.

We empirically validate our theoretical contributions and the efficacy of our proposed selection criterion by showing three main experimental results: First, we show the eigenspace overlap score is more predictive of downstream performance than existing measures of compression quality [41, 3, 42]. Second, we show uniform quantization consistently matches or outperforms all the compression methods to which we compare [2, 34, 16], in terms of both the eigenspace overlap score and downstream performance. Third, we show the eigenspace overlap score is a more accurate criterion for choosing between compressed embeddings than existing measures; specifically, we show that when choosing between embeddings drawn from a representative set we compressed [2, 34, 12, 16], the eigenspace overlap score is able to identify the one that attains better downstream performance with up to 2× lower selection error rates than the next best measure of compression quality. We consider several baseline measures of compression quality: the Pairwise Inner Product (PIP) loss [41], and two spectral measures of approximation error between the embedding Gram matrices [3, 42]. Our results are consistent across a range of NLP tasks [33, 19, 38], embedding types [29, 24, 10], and compression methods [2, 34, 12].

The rest of this paper is organized as follows. In Section 2 we review background on word embedding compression methods and existing measures of compression quality, and present the two surprising empirical observations that motivate our work. In Section 3 we present the eigenspace overlap score along with our corresponding theoretical contributions, and propose to use the eigenspace overlap score as a selection criterion. In Section 4, we show the results from our extensive experiments validating the practical significance of our theoretical contributions, and the efficacy of our proposed selection criterion. We present related work in Section 5, and conclude in Section 6.

2. Background and Motivation

We first review different compression methods in Section 2.1 and existing ways to measure the quality of a compressed embedding relative to the uncompressed embedding in Section 2.2. We then show in Section 2.3 that existing measures of compression quality do not satisfactorily explain the relative downstream performance of existing compression methods; this motivates our work to better understand the downstream performance of compressed embeddings.

2.1. Embedding Compression Methods

We now discuss a number of compression methods for word embeddings. For the purposes of this paper, the goal of an embedding compression method C(·) is to take as input an uncompressed embedding $X \in ℝ^{n \times d},$ and produce as output a compressed embedding $\tilde{X} : = C (X) \in ℝ^{n \times k}$ which uses less memory than X, but attains similar performance to X when used in downstream models. Here, n denotes the vocabulary size, d and k the uncompressed and compressed dimensions.

Deep Compositional Code Learning (DCCL)

The DCCL method [34] uses a dictionary learning approach to represent a large number of word vectors using a much smaller number of basis vectors. These basis vectors are organized into multiple dictionaries, and each word is represented as a sum which includes one basis vector from each dictionary. The dictionaries are trained using an autoencoder-style architecture to minimize the embedding matrix reconstruction error. A similar approach was independently proposed by Chen et al. [6].

K-means Compression

The k-means algorithm can be used to compress word embeddings by first clustering all the scalar entries in the word embedding matrix, and then replacing each scalar with the closest centroid [2]. Using 2^b centroids allows for storing each matrix entry using only b bits.

Uniform Quantization

To compress real numbers, uniform quantization divides an interval into sub-intervals of equal size, and then deterministically or stochastically rounds the numbers in each sub-interval to one of the boundaries [12, 14]. To apply uniform quantization to embedding compression, we propose to first determine the optimal threshold at which to clip the extreme values in the word embedding matrix, and then uniformly quantize the clipped embeddings within the clipped interval. For more details about uniform quantization and how we use it to compress embeddings, see Appendices A.1 and D.3 respectively.

Dimensionality Reduction

Another simple baseline for compressing embeddings is dimensionality reduction. Specifically, one can train an embedding with a lower dimension, or use a method like principal component analysis (PCA) to reduce the dimension of an existing embedding.

2.2. Measures of Compression Quality

We review ways of measuring the compression quality of a compressed embedding relative to the uncompressed embedding. For our purposes, an ideal measure of compression quality would consider the compressed and uncompressed embeddings to be similar when they are likely to perform similarly on downstream tasks, and different when this is unlikely. Such a measure would shed light on what determines the downstream performance of a compressed embedding, and give us a way of measuring the quality of a compressed embedding without having to train a downstream model for each task.

Several of the measures discussed below are based on comparing the pairwise inner product (Gram) matrices of the compressed and uncompressed embeddings. The Gram matrices of embeddings are natural to consider for two reasons: First, the loss function for training word embeddings typically only considers dot-products between embedding vectors [23, 29]. Second, one can view word embedding training as implicit matrix factorization [21], and thus comparing the Gram matrices of two embedding matrices is similar to comparing the matrices these embeddings are implicitly factoring. We now review the existing ways of measuring compression quality.

Word Embedding Reconstruction Error

The first and simplest way of comparing two embeddings X and $\tilde{X}$ is to measure the reconstruction error $‖ X - \tilde{X} ‖_{F} .$ Note that in order to be able to use this measure of quality, $X and \tilde{X}$ must have the same dimension.

Pairwise Inner Product (PIP) Loss

Given $X X^{T} and \tilde{X} {\tilde{X}}^{T},$ the Gram matrices of the uncompressed and compressed embeddings, the Pairwise Inner Product (PIP Loss) [41] is defined as ${‖ X X^{T} - \tilde{X} {\tilde{X}}^{T} ‖}_{F} .$ This measure of quality was recently proposed to explain the existence of an optimal dimension for word embeddings, in terms of a bias-variance trade-off for the PIP loss.

Spectral Approximation Error

A symmetric matrix A is defined [42] to be a (Δ₁, Δ₂)-spectral approximation of another symmetric matrix B if it satisfies $(1 - Δ_{1}) B \underline{≺} A \underline{≺} (1 + Δ_{2}) B$ (in the semidefinite order). Zhang et al. [42] show that if $\tilde{X} {\tilde{X}}^{T} + λ I is a (Δ_{1}, Δ_{2})$ -spectral approximation of $X X^{T} + λ I$ for small values of Δ₁ and Δ₂, then the linear model trained using $\tilde{X}$ and regularization parameter λ will attain similar generalization performance to the model trained using X. Avron et al. [3] use a single scalar Δ in place of Δ₁ and Δ₂, and use this scalar as a measure of approximation error, while Zhang et al. [42] consider Δ₁ and Δ₂ independently, and use the quantity $Δ_{\max} : = \max (\frac{1}{1 - Δ_{1}}, Δ_{2}) .$

2.3. Motivation: Two Surprising Empirical Observations

We now present two surprising empirical observations which illustrate the need to better understand the downstream performance of models trained using compressed embeddings. In these experiments we compare the downstream performance of the methods introduced in Section 2.1, and attempt to use the measures of compression quality from Section 2.2 to explain the relative performance of these compression methods. Our observations reveal that explaining the downstream performance of compressed embeddings is challenging. We now provide an overview of these two observations; for a more thorough presentation of these results, please see Section 4.

First, we observe that the downstream performance of embeddings compressed using the various methods from Section 2.1 cannot be satisfactorily explained in terms of any of the existing measures of compression quality described in Section 2.2. For example, in Figure 1 (left) we see that on GloVe embeddings [29], the uniform quantization method with compression rate 32× can have over 1.3× higher PIP loss than dimensionality reduction with compression rate 6×, while attaining better downstream performance by over 2.5 F1 points on the Stanford Question Answering Dataset (SQuAD) [33]. Furthermore, the PIP loss and the two spectral measures of approximation error Δ and Δ_max only achieve Spearman correlations (absolute value) of 0.49, 0.46, and 0.62 with the question answering test F1 score, respectively (Table 1). These results show that existing measures of compression quality correlate poorly with downstream performance.
Our second observation is that the simple uniform quantization method matches or outperforms the more complex DCCL and k-means compression methods across a number of tasks, embedding types, and compression ratios. For example, we see in Figure 1 (right) that with a compression ratio of 32×, uniform quantization attains an average F1 score 0.47% absolute below the uncompressed GloVe embeddings on the Stanford Question Answering Dataset [33], while the DCCL method [34] is 0.43% below.

Figure 1: — Left: Existing measures of compression quality do not satisfactorily explain the relative downstream performance of different compression methods. For example, compressed embeddings with higher PIP loss can perform better on question answering than those with lower PIP loss. Right: Across different compression rates, a simple uniform quantization method can compete with more complex methods such as DCCL and k-means.

Table 1: Spearman correlation between measures of compression quality and downstream performance.

For each measure of compression quality, we show the absolute value of its Spearman correlation with downstream performance, on the SQuAD (question answering), SST-1 (sentiment analysis), MNLI (natural language inference), and QQP (question pair matching) tasks. We see that the eigenspace overlap score $E$ attains stronger correlation than the other measures.

Dataset	SQuAD		SST-1		MNLI	QQP

Embedding	GloVe	fastText	GloVe	fastText	BERT WordPiece	BERT WordPiece

PIP loss	0.49	0.34	0.46	0.25	0.45	0.45
Δ	0.46	0.31	0.33	0.29	0.44	0.36
Δ_max	0.62	0.72	0.51	0.60	0.86	0.86
1 − $E$	0.81	0.91	0.75	0.73	0.92	0.93

Open in a new tab

These two observations suggest the need to better understand the downstream performance of compressed embeddings. Toward this end, we focus on finding a measure of compression quality with the properties that (i) we can directly relate it to generalization performance, and (ii) we can use it to analyze the performance of uniformly quantized embeddings.

3. A New Measure of Compression Quality

To better understand what properties of compressed embeddings determine their downstream performance, and to help explain the surprising empirical observations above, we introduce the eigenspace overlap score, and show that it satisfies the two desired properties described above. In Section 3.1, we present generalization bounds for compressed embeddings in the context of linear and logistic regression, in terms of the eigenspace overlap score between the compressed and uncompressed embeddings. In Section 3.2 we show that uniformly quantized embeddings in expectation attain high eigenspace overlap scores, helping to explain their strong downstream performance. Based on the connection between the eigenspace overlap score and downstream performance, in Section 3.3 we propose using this score as a way of efficiently selecting among different compressed embeddings.

3.1. The Eigenspace Overlap Score and Generalization Performance

We begin by defining the eigenspace overlap score, which measures how well a compressed embedding approximates an uncompressed embedding. We then present our theoretical result relating the generalization performance of compressed embeddings to their eigenspace overlap scores.

3.1.1. The Eigenspace Overlap Score

We now define the eigenspace overlap score, and discuss the intuition behind this definition.

Definition 1.

Given two embedding matrices $X \in ℝ^{n \times d}, \tilde{X} \in ℝ^{n \times k},$ whose Gram matrices have eigendecom-positions $X X^{T} = U Λ U^{T}, \tilde{X} {\tilde{X}}^{T} = \tilde{U} \tilde{Λ} {\tilde{U}}^{T} f o r U \in ℝ^{n \times d}, \tilde{U} \in ℝ^{n \times k},$ we define the eigenspace overlap score $E (X, \tilde{X}) : = \frac{1}{\max (d, k)} {‖ U^{T} \tilde{U} ‖}_{F}^{2} .$

This score measures the degree to which the span of the eigenvectors with nonzero eigenvalue of $\tilde{X} {\tilde{X}}^{T}$ agrees with that of $X X^{T} .$ In particular, assuming $k \leq d,$ it measures the ratio between the squared Frobenius norm of U before and after being projected onto $\tilde{U} .$ Computing this score takes time O(n max(d, k)²), as it requires computing the singular value decompositions (SVDs) of $X and \tilde{X} .$ As is clear from the definition, the eigenspace overlap score only depends on the left singular vectors of the two embedding matrices. To better understand why this is a desirable property, consider two embedding matrices $X and \tilde{X}$ with the same left singular vectors. It follows that the output of any linear model over X can be exactly matched by the output of a linear model over $\tilde{X};$ if we consider the SVDs $X = U S V^{T}, \tilde{X} : = U \tilde{S} {\tilde{V}}^{T},$ then for any parameter vector $w \in ℝ^{d} over X, \tilde{w} : = \tilde{V} {\tilde{S}}^{- 1} S V^{T} w gives X w = \tilde{X} \tilde{w} .$ This observation shows how central the left singular vectors of an embedding matrix are to the set of models which use this matrix, and thus why it is reasonable for the eigenspace overlap score to only consider the left singular vectors. In Appendix B.3 we discuss this score’s robustness to perturbations, while in Appendix B.4 we discuss the connection between this score and a variant of embedding reconstruction error.

3.1.2. Generalization Bound

We now present our theoretical result bounding the difference in generalization performance between models trained on compressed vs. uncompressed embeddings, in terms of the eigenspace overlap score. For this bound, we consider an average-case analysis in the context of fixed design linear regression; note that we consider the fixed design setting for ease of analysis, as it has a closed-form expression for generalization performance. Before presenting our bound in Theorem 1, we will briefly review fixed design linear regression, and discuss the average-case setting we consider.

In fixed design linear regression, we observe a set of labeled points ${(x_{i}, y_{i})}_{i = 1}^{n}$ where the observed labels $y_{i} = {\bar{y}}_{i} + ϵ_{i} \in ℝ$ are perturbed from the true labels ${\bar{y}}_{i}$ with independent zero-mean noise ϵ_i (variance σ²). If we let $x_{i} \in ℝ^{d}$ denote the i^th row of the matrix $X = U S V^{T}, and let y and \bar{y} in ℝ^{n}$ denote the perturbed and true label vectors, it is easy to show that the expected error of the optimal linear regressor¹ $f_{X, ϵ}$ trained on this perturbed dataset is equal to $R_{\bar{y}} (X) : = E_{ϵ} [\frac{1}{n} \sum_{i = 1}^{n} {(f_{X, ϵ} (x_{i}) - {\bar{y}}_{i})}^{2}] = \frac{1}{n} (‖ \bar{y} ‖^{2} - {‖ U^{T} \bar{y} ‖}^{2} + d σ^{2}) .$ For the derivation, see Appendix A.2.

We consider average-case bounds for two reasons: First, in the setting where one would like to use the same compressed embedding across many tasks (i.e., different label vectors $\bar{y}$ ), an average-case bound describes the average performance across these tasks. Second, for both empirical and theoretical reasons we argue worst-case bounds are too loose to adequately explain our empirical observations. Empirically, we observe that compressed embeddings with large values of Δ₁ and Δ₂ can attain strong generalization performance (Appendix E.6), even though these values imply large worst-case bounds on the generalization error [42]. From a theoretical perspective, worst-case bounds must account for all possible label vectors, including those chosen adversarially. For example, if there exists a single direction in span(U) orthogonal to span $(\tilde{U})$ (which always occurs when $\dim (\tilde{U}) < \dim (U))$ the label vector $\bar{y}$ can simply be equal to this direction, resulting in large generalization error for $\tilde{X}$ and small generalization error for X. Thus, we consider an average-case analysis in which we assume y is a random label vector in span(U). We consider this setting because we are most interested in the situation where we know the uncompressed embedding matrix X performs well (in this case, $R_{\bar{y}} (X) = d σ^{2} / n),$ and we would like to understand how well $\tilde{X}$ can do.² We now present our bound.

Theorem 1.

E_{\bar{y}} [R_{\bar{y}} (\tilde{X}) - R_{\bar{y}} (X)] = \frac{d}{n} \cdot (1 - E (X, \tilde{X})) - \frac{d - k}{n} σ^{2} .

(1)

See Appendix B for the proof, where we consider the more general setting of z having zero mean and arbitrary covariance. This theorem reveals that a larger eigenspace overlap score results in better expected generalization performance for the compressed embedding. Note that if we focus on the low-dimensional and low-noise setting, where $d ≪ n, σ^{2} < \frac{d}{n} = \frac{1}{n} E [\sum_{i = 1}^{n} {\bar{y}}_{i}^{2}],$ we can effectively ignore the term $\frac{d - k}{n} σ^{2} = O (d^{2} / n^{2}),$ and the generalization performance is determined by the eigenspace overlap score.

Although here we analyze regression, in Appendix B we show the eigenspace overlap score also plays an important role in generalization bounds for Lipschitz-continuous loss functions (e.g., logistic regression).

3.2. The Eigenspace Overlap Score and Uniform Quantization

To help explain the strong downstream performance of uniformly quantized embeddings, in this section we present a lower bound on the expected eigenspace overlap score for uniformly quantized embeddings. Combining this result with Theorem 1 directly provides a guarantee on the performance of the uniformly quantized embeddings. This result further demonstrate how the eigenspace overlap score can be used to better understand the performance of compressed embeddings.

To prove this bound on the eigenspace overlap score, we use the Davis-Kahan $\sin (Θ)$ theorem from matrix perturbation analysis [8]. Note that we assume unbiased stochastic rounding is used for the uniform quantization (see [14] or Appendix A.1). We now present the result (proof in Appendix C):

Theorem 2.

Let $X \in ℝ^{n \times d}$ be a bounded embedding matrix with $X_{i j} \in [- \frac{1}{\sqrt{d}}, \frac{1}{\sqrt{d}}]$ ³ and smallest singular value $σ_{\min} = a \sqrt{n / d}, f o r a \in (0, 1] .$ ⁴ Let $\tilde{X}$ be an unbiased stochastic uniform quantization of X, where b bits are used per entry. Then for n ≥ max(33, d), we can lower bound the expected eigenspace overlap score of $\tilde{X},$ over the randomness of the stochastic quantization, as follows:

E [1 - E (X, \tilde{X})] \leq \frac{20}{{(2^{b} - 1)}^{2} a^{4}} .

A consequence of this theorem is that with only a logarithmic number of bits $b \geq \log_{2} (\frac{\sqrt{20}}{a^{2} \sqrt{ϵ}} + 1),$ uniform quantization can attain an expected eigenspace overlap score of at least 1 − ϵ. This helps explain the strong downstream performance of uniform quantization at high compression rates.

In Appendix C.2 we empirically validate that the scaling of the eigenspace overlap score with respect to the quantities in Theorem 2 matches the theory; we show $1 - E (X, \tilde{X})$ drops as the precision b and the scalar a are increased, and is relatively unaffected by changes to the vocabulary size n and dimension d.

3.3. The Eigenspace Overlap Score as a Selection Criterion

Due to the theoretical connections between generalization performance and the eigenspace overlap score, we propose using the eigenspace overlap score as a selection criterion between different compressed embeddings. Specifically, the algorithm we propose takes as input an uncompressed embedding along with two or more compressed versions of this embedding, and returns the compressed embedding with the highest eigenspace overlap score to the uncompressed embedding. Ideally, a selection criterion should be both accurate and robust. For each downstream task, we consider accuracy as the fraction of cases where a criterion selects the best-performing embedding on the task. We quantify the robustness as the maximum observed performance difference between the selected embedding and the one which performs the best on a downstream task. In Section 4.3, we show across extensive experiments that the eigenspace overlap score is a more accurate and robust criterion than existing measures of compression quality.

4. Experiments

We empirically validate our theory relating the eigenspace overlap score with generalization performance, our analysis on the strong performance of uniform quantization, and the efficacy of the eigenspace overlap score as an embedding selection criterion. We first demonstrate that this score correlates better with downstream performance than existing measures of compression quality in Section 4.1. We then demonstrate in Section 4.2 that uniform quantization consistently matches or outperforms the compression methods to which we compare, both in terms of the eigenspace overlap score and downstream performance. In Section 4.3, we show that the eigenspace overlap score is a more accurate and robust selection criterion than other measures of compression quality.

Experiment setup

We evaluate compressed versions of publicly available 300-dimensional fastText and GloVe embeddings on question answering and sentiment analysis tasks, and compressed 768-dimensional WordPiece embeddings from the pre-trained case-sensitive BERT_BASE model [10] on tasks from the General Language Understanding Evaluation (GLUE) benchmark [38]. We use the four embedding compression methods discussed in Section 2: DCCL, k-means, uniform quantization, and dimensionality reduction.⁵ For the tasks, we consider question answering using the DrQA model [5] on the Stanford Question Answering Dataset (SQuAD) [33], sentiment analysis using a CNN model [19] on all the datasets used by Kim [19], and language understanding using the BERT_BASE model on the tasks in the GLUE benchmark [38]. We present results on the SQuAD dataset, the largest sentiment analysis dataset (SST-1 [35]) and the two largest GLUE tasks (MNLI and QQP) in this section, and include the results on the other sentiment analysis and GLUE tasks in Appendix E. Across embedding types and tasks, we first compress the pre-trained embeddings, and then train the non-embedding model parameters in the standard manner for each task, keeping the embeddings fixed throughout training. For the GLUE tasks, we add a linear layer on top of the final layer of the pre-trained BERT model (as is standard), and then fine-tune the non-embedding model parameters.6 For more details on the various embeddings, tasks, and hyperparameters we use, see Appendix D.

4.1. The Eigenspace Overlap Score and Downstream Performance

To empirically validate the theoretical connection between the eigenspace overlap score and downstream performance, we show the eigenspace overlap score correlates better with downstream performance than the existing measures of compression quality discussed in Section 2. Thus, even though our analysis is for linear and logistic regression, we see the eigenspace overlap score also has strong empirical correlation with downstream performance on tasks using neural network models.

In Figure 2 we present results for question answering (SQuAD) performance for compressed fastText embeddings, and natural language inference (MNLI) performance for compressed BERT WordPiece embeddings, as a function of the various measures of compression quality. In each plot, for each combination of compression rate and compression method, we plot the average compression quality measure (x-axis) and the average downstream performance (y-axis) across the five random seeds used. If the ranking based on the measure of compression quality was identical to the ranking based on downstream performance, we would see a monotonically decreasing sequence of points. As we can see from the rightmost plots in Figure 2, the downstream performance decreases smoothly as the eigenspace overlap score decreases; the downstream performance does not align as well with the other measures of compression quality.

To quantify how well the ranking based on the quality measures matches the ranking based on downstream performance, we compute the Spearman correlation ρ between these quantities. In Table 1 we can see that the eigenspace overlap score gets consistently higher correlation values with downstream performance than the other measures of compression quality. Note that Δ_max also attains relatively high correlation values, though the eigenspace overlap score still outperforms Δ_max by 0.06 to 0.24 on the tasks in Table 1. See Appendix E.5 for similar results on other tasks.

4.2. Downstream Performance of Uniform Quantization

We show across tasks and compression rates that uniform quantization consistently matches or outperforms the other compression methods, in terms of both the eigenspace overlap score and downstream performance. These empirical results validate our analysis from Section 3.2 showing that uniformly quantized embeddings in expectation attain high eigenspace overlap scores, and are thus likely to attain strong downstream performance. In Figure 3, we plot the eigenspace overlap score and downstream performance averaged over five random seeds for different compression methods and compression rates; we visualize the standard deviation with error bars. In particular, we show the results for the fastText embeddings on question answering (SQuAD, left) and BERT WordPiece embeddings on natural language inference (MNLI, right). Our primary conclusion is that the simple uniform quantization method consistently performs similarly to or better than the other compression methods, both in terms of the eigenspace overlap score and downstream performance.⁷ Given the connections between downstream performance and the eigenspace overlap score, the high eigenspace overlap scores attained by uniform quantization help explain its strong downstream performance. For results with the same trend on the GLUE and sentiment tasks, see Appendices E.1, E.4.⁸

Figure 3: — Uniform quantization can match or outperform the more complex k-means and DCCL compression methods, both in terms of the eigenspace overlap score $E$ and downstream performance. We present results for fastText embeddings on question answering (SQuAD, left) and for BERT WordPiece embeddings on natural language inference (MNLI, right).

4.3. Compressed Embedding Selection with the Eigenspace Overlap Score

We now show that the eigenspace overlap score is a more accurate and robust selection criterion for compressed embeddings than the existing measures of compression quality. In our experiment, we first enumerate all the embeddings we compressed using different compression methods, compression rates, and five random seeds, and we evaluate each of these embeddings on the various downstream tasks; we use the same random seed for compression and for downstream training. We then consider for each task all pairs of compressed embeddings, and for each measure of compression quality report the selection error rate—the fraction of cases where the embedding with a higher compression quality score attains worse downstream performance. We show in Table 2 that across different tasks the eigenspace overlap score achieves lower selection error rates than the PIP loss and the spectral distance measures Δ and Δ_max, with 1.3× to 2× lower selection error rates than the second best measure. To demonstrate the robustness of the eigenspace overlap score as a criterion, we measure the maximum difference in downstream performance, across all pairs of compressed embeddings discussed above, between the better performing embedding and the one selected by the eigenspace overlap score. We observe that this maximum performance difference is 1.1× to 5.5× smaller for the eigenspace overlap score than for the measure of compression quality with the second smallest maximum performance difference. See Appendix E.8 for more detailed results on the robustness of the eigenspace overlap score as a selection criterion.

Table 2: The selection error rate of each measure of compression quality as a selection criterion.

Across all pairs of compressed embeddings from our experiments, we measure for each task the fraction of cases when a quality measure selects the worse performing embedding. We observe that the eigenspace overlap score $E$ achieves lower error rates than other compression quality measures.

Dataset	SQuAD		SST-1		MNLI	QQP

Embedding	GloVe	fastText	GloVe	fastText	BERT WordPiece	BERT WordPiece

PIP loss	0.32	0.37	0.32	0.40	0.31	0.32
Δ	0.34	0.58	0.39	0.57	0.32	0.33
Δ_max	0.28	0.22	0.30	0.27	0.15	0.16
1 − $E$	0.17	0.11	0.19	0.20	0.10	0.10

Open in a new tab

5. Related Work

Compressing machine learning models is critical for training and inference in resource-constrained settings. To enable low-memory training, recent work investigates using low numerical precision [22, 9] and sparsity [36, 11, 25]. To compress a model for low-memory inference, Han et al. [15] investigate pruning and quantization for deep neural networks.

Our work on understanding the generalization performance of compressed embeddings is also closely related to work on understanding the generalization performance of kernel approximation methods [39, 32]. In particular, training a linear model over compressed word embeddings can be viewed as training a model with a linear kernel using an approximation to the kernel matrix. Recently, there has been work on how different measures of kernel approximation error relate to the generalization performance of the model trained using the approximate kernels, with Avron et al. [3] and Zhang et al. [42] proposing the spectral measures of approximation error which we consider in this work.

6. Conclusion and Future Work

We proposed the eigenspace overlap score, a new way to measure the quality of a compressed embedding without requiring training for each downstream task of interest. We related this score to the generalization performance of linear and logistic regression models, used this score to better understand the strong empirical performance of uniformly quantized embeddings, and showed this score is an accurate and robust selection criterion for compressed embeddings. Although this work focuses on word embeddings, for future work we hope to show that the ideas presented here extend to other domains—for example, to other types of embeddings (e.g., graph node embeddings [13]), and to compressing the activations of neural networks. We also believe that our work can help understand the performance of any model trained using compressed or perturbed features, and to understand why certain proposed methods for compressing neural networks succeed while others fail. We hope this work inspires improvements to compression methods in various domains.

Figure 9: — We plot the average BLEU4 test performance for compressed task-specific embeddings on the IWSLT’14 German-to-English translation task across five random seeds (standard deviations indicated with error bars). note that because for some random seeds the TT method attains very low BLEU scores, in this plot we report the *best* performance for the TT method across the five random seeds. We observe that the uniform quantization and k-means compression methods generally achieve better BLEU4 test performance than the TT method for compression rates up to 128×.

Acknowledgments

We thank Tony Ginart, Max Lam, Stephanie Wang, and Christopher Aberger for all their work on the early stages of this project. We further thank all the members of our research group for their helpful discussions and feedback throughout the course of this work.

We gratefully acknowledge the support of DARPA under Nos. FA87501720095 (D3M) and FA86501827865 (SDH), NIH under No. U54EB020405 (Mobilize), NSF under Nos. CCF1763315 (Beyond Sparsity) and CCF1563078 (Volume to Velocity), ONR under No. N000141712266 (Unifying Weak Supervision), the Moore Foundation, NXP, Xilinx, LETI-CEA, Intel, Google, NEC, Toshiba, TSMC, ARM, Hitachi, BASF, Accenture, Ericsson, Qualcomm, Analog Devices, the Okawa Foundation, and American Family Insurance, Google Cloud, Swiss Re, and members of the Stanford DAWN project: Intel, Microsoft, Teradata, Facebook, Google, Ant Financial, NEC, SAP, VMWare, and Infosys. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views, policies, or endorsements, either expressed or implied, of DARPA, NIH, ONR, or the U.S. Government.

A. Background

A.1. Uniform Quantization

A b-bit uniform quantization of a real number $x \in [- r, r]$ is computed as follows: First, the interval $[- r, r]$ is divided into 2^b − 1 sub-intervals of equal size. Then, x is rounded to either the top or bottom of the sub-interval $[\underline{x}, \bar{x}]$ containing x, where $\underline{x} = r + j \frac{2 r}{2^{b} - 1} and \bar{x} = r + (j + 1) \frac{2 r}{2^{b} - 1}, for j \in {0, 1, \dots, 2^{b} - 2} .$ Given this rounded value, one can simply store the b-bit integer j or j + 1 in place of the real-valued x, depending on whether x was rounded to $\underline{x} or \bar{x}$ respectively. In this work, we will consider a deterministic rounding scheme which rounds x to the nearest value (denoted by Q_b,r(x)), as well as an unbiased stochastic rounding scheme (denoted by ${\tilde{Q}}_{b, r} (x) .$ More details below). Note that our analysis will focus on the stochastic rounding scheme, while our experiments will focus on deterministic quantization; however, for completeness, in Appendix E.9, we show that stochastic quantization also performs quite well empirically.

We now define unbiased stochastic uniform quantization more formally. We will denote by ${\tilde{Q}}_{b, r} (x)$ the b-bit unbiased stochastic uniform quantization of a real number $x \in [- r, r] .$ More formally, if $x \in [\underline{x}, \bar{x}]$ for $\underline{x} = r + j \frac{2 r}{2^{b} - 1} and \bar{x} = r + (j + 1) \frac{2 r}{2^{b} - 1}, for j \in {0, 1, \dots, 2^{b} - 2}, ℙ [{\tilde{Q}}_{b, r} (x) = \underline{x}] = \frac{\bar{x} - x}{\bar{x} - \underline{x}} and ℙ [{\bar{Q}}_{b, r} (x) = \bar{x}] = \frac{x - \underline{x}}{\bar{x} - \underline{x}} .$ Note that $E [{\tilde{Q}}_{b, r} (x)] = x and V A ℝ [{\tilde{Q}}_{b, r} (x)] \leq \frac{r^{2}}{{(2^{b} - 1)}^{2}} = δ_{b}^{2} r^{2} for δ_{b}^{2} : = \frac{1}{{(2^{b} - 1)}^{2}} .$ We bound the variance using the fact that a bounded random variable in an interval of length c has variance at most $c^{2} / 4$ by Popoviciu’s inequality on variances [31] (in our case, $c = \frac{2 r}{2^{b} - 1}$ ).

Using the above definition of ${\tilde{Q}}_{b, r},$ we define the b-bit stochastic uniform quantization of a matrix X:

Definition 2.

For a bounded embedding matrix $X w i t h X_{i j} \in [- r, r],$ we define a b-bit stochastic uniform quantization of X to be a matrix $\tilde{X}$ such that ${\tilde{X}}_{i j} = {\tilde{Q}}_{b, r} (X_{i j}) .$

For details on how we use uniform quantization to compress word embeddings, please see Algorithm 1 and the associated discussion in Appendix D.3.

A.2. Fixed Design Linear Regression

We derive here the close form expression for the risk of fixed design linear regression. In this setting we observe a set of labeled points ${(x_{i}, y_{i})}_{i = 1}^{n}$ where the observe labels $y_{i} = {\bar{y}}_{i} + ϵ_{i} \in ℝ$ are perturbed versions of the true label $\bar{y}$ with independent zero-mean noise ϵ_i (with variance σ²). In other words, $y = \bar{y} + ϵ$ with $ϵ$ being a n-dimensional zero-mean random variable with covariance $σ^{2} I_{n} . Let X \in ℝ^{n \times d}$ be the feature matrix. The weight vector w^* of the optimal linear regressor $f_{X, ϵ} (x) = 〈 x, w^{*} 〉$ is computed by minimizing the least square loss:

w^{*} = \underset{w \in d}{\arg \min} \frac{1}{n} ‖ X w - y ‖^{2} .

From the normal equation, we know that $w^{*} = {(X^{T} X)}^{- 1} X^{T} y .$ The risk, or expected error, of the optimal linear regressor f_X,ϵ trained on data matrix X and label vector $y = \bar{y} + ϵ$ is defined as

R_{\bar{y}} (X) : = E_{ϵ} [\frac{1}{n} \sum_{i = 1}^{n} {(f_{X, ϵ} (x_{i}) - {\bar{y}}_{i})}^{2}] = E_{ϵ} [\frac{1}{n} {‖ X w^{*} - \bar{y} ‖}^{2}] .

Proposition 3.

If the feature matrix $X \in ℝ^{n \times d}$ is full-rank and has the SVD decomposition $X = U S V^{T},$ then the risk of the optimal linear regressor, in the fixed design linear regression problem with noise variance σ², is

R_{\bar{y}} (X) = \frac{1}{n} (‖ \bar{y} ‖^{2} - {‖ U^{T} \bar{y} ‖}^{2} + d σ^{2}) .

Proof. From the normal equation,

w^{*} = {(X^{T} X)}^{- 1} X^{T} y = {(V S^{2} V^{T})}^{- 1} V S U^{T} y = V S^{- 2} V^{T} V S U^{T} y = V S^{- 2} S U^{T} y = V S^{- 1} U^{T} y .

Substituting this expression into the definition of the risk, we obtain

R_{\bar{y}} (X) = \frac{1}{n} E_{ϵ} [‖ X w - \bar{y} ‖^{2}] = \frac{1}{n} E_{ϵ} [{‖ U S V^{T} V S^{- 1} U^{T} y - \bar{y} ‖}^{2}] = \frac{1}{n} E_{ϵ} [{‖ U U^{T} y - \bar{y} ‖}^{2}] = \frac{1}{n} E_{ϵ} [{‖ U U^{T} (\bar{y} + ϵ) - \bar{y} ‖}^{2}] = \frac{1}{n} E_{ϵ} [{‖ (U U^{T} - I_{n}) (\bar{y} + ϵ) + ϵ ‖}^{2}] = \frac{1}{n} E_{ϵ} [{‖ (I_{n} - U U^{T}) (\bar{y} + ϵ) - ϵ ‖}^{2}] = \frac{1}{n} E_{ϵ} [‖ A (\bar{y} + ϵ) - ϵ ‖^{2}] (letting A : = I_{n} - U U^{T}) = \frac{1}{n} E_{ϵ} [{(\bar{y} + ϵ)}^{T} A^{2} (\bar{y} + ϵ) - {(\bar{y} + ϵ)}^{T} A ϵ - ϵ^{T} A (\bar{y} + ϵ) + ϵ^{T} ϵ] = \frac{1}{n} E_{ϵ} [{\bar{y}}^{T} A \bar{y} - ϵ^{T} A ϵ + ϵ^{T} ϵ] (using A^{2} = A, and E_{ϵ} [ϵ^{T} A \bar{y}] = E_{ϵ} [{\bar{y}}^{T} A ϵ] = 0) = \frac{1}{n} ({\bar{y}}^{T} (I_{n} - U U^{T}) \bar{y} + E_{ϵ} [- ϵ^{T} (I_{n} - U U^{T}) ϵ + ϵ^{T} ϵ]) = \frac{1}{n} (‖ \bar{y} ‖^{2} - {‖ U^{T} \bar{y} ‖}^{2} + E_{ϵ} [ϵ^{T} U U^{T} ϵ]) = \frac{1}{n} (‖ \bar{y} ‖^{2} - {‖ U^{T} \bar{y} ‖}^{2} + d σ^{2}),

where the last step follows from

E_{ϵ} [ϵ^{T} U U^{T} ϵ] = E_{ϵ} [tr (U U^{T} ϵ ϵ^{T})] = tr (U U^{T} E_{ϵ} [ϵ ϵ^{T}]) = tr (U U^{T} σ^{2} I_{n}) = σ^{2} ‖ U ‖_{F}^{2} = d σ^{2} .

□

B. The Eigenspace Overlap Score: Theory and Extensions

B.1. Proof of Theorem 1: Average Case Analysis for Fixed Design Linear Regression

We present the proof of Theorem 1, relating the generalization performance and eigenspace overlap score in the context of fixed design linear regression. The true label $\bar{y}$ is assumed to be randomly distributed in the span of U, of the form $\bar{y} = U z$ for some zero-mean d-dimensional random variable z. We prove a more general version of Theorem 1, for a general covariance matrix Σ of z.

Theorem 1 (Generalized).

Let $X = U S V^{T} \in ℝ^{n \times d}$ be the singular value decomposition of a full-rank embedding matrix $X . L e t \tilde{X} \in ℝ^{n \times k}$ be another full-rank embedding matrix with k ≤ d, and let $\bar{y} = U z \in ℝ^{n}$ denote a random label vector in span(U), where z has mean zero and covariance matrix Σ. Let λ_min(Σ) be the smallest eigenvalue of Σ. Then the gap in expected risk of fixed design linear regression trained on X and $\tilde{X},$ with label noise variance σ², is:

E_{\bar{y}} [R_{\bar{y}} (\tilde{X}) - R_{\bar{y}} (X)] \leq \frac{tr Σ - d λ_{\min} (Σ) E (X, \tilde{X})}{n} - \frac{d - k}{n} σ^{2} .

If the random variable z has identity covariance matrix, then

E_{\bar{y}} [R_{\bar{y}} (\tilde{X}) - R_{\bar{y}} (X)] = \frac{d}{n} \cdot (1 - E (X, \tilde{X})) - \frac{d - k}{n} σ^{2} .

Proof. Since $\bar{y} = U z,$ we have

E_{\bar{y}} [{‖ U^{T} \bar{y} ‖}^{2}] = E_{z} [{‖ U^{T} U z ‖}^{2}] = E_{z} [‖ z ‖^{2}] = E_{z} [tr (z^{T} z)] = E_{z} [tr (z z^{T})] = tr (E_{z} [z z^{T}]) = tr (Σ) .

Similarly,

E_{\bar{y}} [{‖ {\tilde{U}}^{T} \bar{y} ‖}^{2}] = E_{z} [{‖ {\tilde{U}}^{T} U z ‖}^{2}] = E_{z} [tr (z^{T} U^{T} \tilde{U} {\tilde{U}}^{T} U z)] = E_{z} [tr (U^{T} \tilde{U} {\tilde{U}}^{T} U z z^{T})] = tr (U^{T} \tilde{U} {\tilde{U}}^{T} U E_{z} [z z^{T}]) = tr (U^{T} \tilde{U} {\tilde{U}}^{T} U Σ) = tr (Σ^{1 / 2} U^{T} \tilde{U} {\tilde{U}}^{T} U Σ^{1 / 2}) = {‖ {\tilde{U}}^{T} U Σ^{1 / 2} ‖}_{F}^{2},

where Σ^1/2 is the positive semidefinite (PSD) matrix such that (Σ^1/2)² = Σ. From Proposition 3, the risks and $R_{\bar{y}} (X) = \frac{1}{n} (‖ \bar{y} ‖^{2} - {‖ U^{T} \bar{y} ‖}^{2} + d σ^{2})$ and $R_{\bar{y}} (\tilde{X}) = \frac{1}{n} (‖ \bar{y} ‖^{2} - {‖ {\tilde{U}}^{T} \bar{y} ‖}^{2} + k σ^{2}) .$ We thus obtain:

E_{\bar{y}} [R_{\bar{y}} (\tilde{X}) - R_{\bar{y}} (X)] = \frac{1}{n} (E_{y} [{‖ U^{T} y ‖}^{2}] - E_{y} [{‖ {\tilde{U}}^{T} y ‖}^{2}] - (d - k) σ^{2}) = \frac{1}{n} (tr (Σ) - {‖ {\tilde{U}}^{T} U Σ^{1 / 2} ‖}_{F}^{2}) - \frac{d - k}{n} σ^{2} .

(2)

We can lower bound ${‖ {\tilde{U}}^{T} U Σ^{1 / 2} ‖}_{F}^{2}$ in terms of the smallest eigenvalue of Σ and the eigenspace overlap score of $X and \tilde{X} .$ Specifically, we now show that ${‖ {\tilde{U}}^{T} U Σ^{1 / 2} ‖}_{F}^{2} \geq d λ_{\min} (Σ) E (X, \tilde{X}), where λ_{\min} (Σ)$ is the smallest eigenvalue of Σ. We will use the fact that $Σ - λ_{\min} (Σ) I_{d}$ is PSD, and so ${(Σ - λ_{\min} (Σ) I_{d})}^{1 / 2}$ exists. We now prove the above inequality:

{‖ {\tilde{U}}^{T} U Σ^{1 / 2} ‖}_{F}^{2} - d λ_{\min} (Σ) E (X, \tilde{X}) = {‖ {\tilde{U}}^{T} U Σ^{1 / 2} ‖}_{F}^{2} - λ_{\min} (Σ) {‖ {\tilde{U}}^{T} U ‖}_{F}^{2} = tr (Σ^{1 / 2} U^{T} \tilde{U} {\tilde{U}}^{T} U Σ^{1 / 2}) - λ_{\min} (Σ) tr (U^{T} \tilde{U} {\tilde{U}}^{T} U) = tr (U^{T} \tilde{U} {\tilde{U}}^{T} U Σ) - tr (U^{T} \tilde{U} {\tilde{U}}^{T} U λ_{\min} (Σ) I_{d}) = tr (U^{T} \tilde{U} {\tilde{U}}^{T} U (Σ - λ_{\min} (Σ) I_{d})) = tr ({(Σ - λ_{\min} (Σ) I_{d})}^{1 / 2} U^{T} \tilde{U} {\tilde{U}}^{T} U {(Σ - λ_{\min} (Σ) I_{d})}^{1 / 2}) = {‖ {\tilde{U}}^{T} U {(Σ - λ_{\min} (Σ) I_{d})}^{1 / 2} ‖}_{F}^{2} \geq 0 .

Thus, we have shown that $E_{\bar{y}} [{‖ {\tilde{U}}^{T} \bar{y} ‖}^{2}] = {‖ {\tilde{U}}^{T} U Σ^{1 / 2} ‖}_{F}^{2} \geq d λ_{\min} (Σ) E (X, \tilde{X}) .$ Substituting this lower bound into Equation (2) yields

E_{\bar{y}} [R_{\bar{y}} (\tilde{X}) - R_{\bar{y}} (X)] \leq \frac{1}{n} (tr (Σ) - d λ_{\min} (Σ) E (X, \tilde{X})) - \frac{d - k}{n} σ^{2} .

In the case where $Σ = I_{d},$ we obtain $tr (Σ) = d and {‖ {\tilde{U}}^{T} U Σ^{1 / 2} ‖}_{F}^{2} = {‖ {\tilde{U}}^{T} U ‖}_{F}^{2} = d E (X, \tilde{X}) .$ Thus from Equation (2), we obtain

E_{\bar{y}} [R_{\bar{y}} (\tilde{X}) - R_{\bar{y}} (X)] = \frac{d}{n} (1 - E (X, \tilde{X})) - \frac{d - k}{n} σ^{2} .

□

B.2. Average Case Analysis for Lipschitz-Continuous Loss Function

We now consider the fixed design setting with a Lipschitz-continuous loss function, and discuss how the average risk of training on $\tilde{X}$ can be bounded in terms of the average risk of training on X and the eigenspace overlap score $E (X, \tilde{X}) .$

Let $l : ℝ \times ℝ \to ℝ$ be a non-negative loss function which is L-Lipschitz in both its first and second arguments, $X \in ℝ^{n \times d}$ be a fixed data matrix with $SVD X = U S V^{T}, x_{i} \in ℝ^{d}$ be the i^th row of X, and $y \in ℝ^{n}$ be a label vector. We assume that arg $\min_{v^{'}} l (v^{'}, v) = v for all v \in ℝ .$ We will consider a linear model f(x) = x^Tw parameterized by some weight vector w, such that the loss function for each data point x_i under this model is $l (x_{i}^{T} w, y_{i}) .$

Similar to the fixed design linear regression setting, we assume that y is generated from the true label $\bar{y} \in ℝ^{n}$ by adding zero-mean independent noise: $y = \bar{y} + ϵ$ where $ϵ = {[ϵ_{1}, \dots, ϵ_{n}]}^{T} \in ℝ^{n}$ and the ϵ_i are independent with zero mean and variance σ². We also assume that the true label $\bar{y}$ is randomly distributed in the span of U, of the form $\bar{y} = U z$ for some zero-mean d-dimensional random variable z, with covariance matrix Σ. Note that because $\frac{1}{n} E [‖ \bar{y} ‖^{2}] = \frac{1}{n} E_{z} [‖ U z ‖^{2}] = \frac{1}{n} E_{z} [‖ z ‖^{2}] = \frac{1}{n} tr (Σ),$ it makes sense for the variance σ² of the noise we add to each entry of y to scale as $σ^{2} = O (\frac{1}{n} tr (Σ)) . Lastly, let \tilde{X} = \tilde{U} \tilde{S} {\tilde{V}}^{T} \in ℝ^{n \times k}$ be any matrix, which for our purposes will represent a compressed version of X.

We now define the optimal weight vectors $w^{*} and {\tilde{w}}^{*}$ trained on $X and \tilde{X}$ respectively, along with their corresponding vectors of predictions $u and \tilde{u} :$

w^{*} = \underset{w}{\arg \min} \sum_{i = 1}^{n} l (x_{i}^{T} w, y_{i}), u : = X w^{*}

{\tilde{w}}^{*} = \underset{\tilde{w}}{\arg \min} \sum_{i = 1}^{n} l ({\tilde{x}}_{i}^{T} \tilde{w}, y_{i}), \tilde{u} : = \tilde{X} {\tilde{w}}^{*} .

(3)

Note that $u and \tilde{u}$ depend on ϵ and $\bar{y}$ , which are random.

The risks, or expected errors (expectation taken over ϵ, for a fixed $\bar{y}$ , for the models trained with X and $\tilde{X}$ respectively, are defined as

R_{\bar{y}} (X) : = E_{ϵ} [\frac{1}{n} \sum_{i = 1}^{n} l (x_{i}^{T} w^{*}, {\bar{y}}_{i})] .

(4)

R_{\bar{y}} (\tilde{X}) : = E_{ϵ} [\frac{1}{n} \sum_{i = 1}^{n} l ({\tilde{x}}_{i}^{T} {\tilde{w}}^{*}, {\bar{y}}_{i})] .

(5)

We are now ready to present our Theorem for the case of average-case generalization performance with L-Lipschitz continuous loss functions. Note that we will use the assumption we stated above that $σ^{2} = O (\frac{1}{n} tr (Σ)) .$

Theorem 4.

$L e t X = U S V^{T} \in ℝ^{n \times d}$ be the singular value decomposition of a full-rank embedding matrix X. $L e t \tilde{X} \in ℝ^{n \times k}$ be another full-rank embedding matrix with k ≤ d, and $l e t \bar{y} = U z \in ℝ^{n}$ denote a random label vector in span(U), where z has mean zero and covariance matrix Σ. $L e t λ_{\min} (Σ)$ be the smallest eigenvalue of Σ. Then the gap in expected risk of the linear models trained on $X and \tilde{X},$ with a loss function $l : ℝ \times ℝ \to ℝ$ which is non-negative and L-Lipschitz continuous and satisfies arg $\min_{v^{'}} l (v^{'}, v) = v \forall v \in ℝ,$ with label noise variance $σ^{2} = \frac{a^{2}}{n} tr (Σ)$ for a scalar $a \in ℝ,$ satisfies:

E_{\bar{y}} [R_{\bar{y}} (\tilde{X}) - R_{\bar{y}} (X)] \leq \frac{L}{\sqrt{n}} \sqrt{tr (Σ) - d λ_{\min} (Σ) E (X, \tilde{X})} + \frac{2 La \sqrt{tr (Σ)}}{\sqrt{n}} .

If the random variable z has identity covariance matrix, then

E_{\bar{y}} [R_{\bar{y}} (\tilde{X}) - R_{\bar{y}} (X)] \leq \frac{L \sqrt{d}}{\sqrt{n}} (\sqrt{1 - E (X, \tilde{X})} + 2 a) .

Proof. Let $f (v, y) = \frac{1}{n} \sum_{i = 1}^{n} l (v_{i}, y_{i})$ be the average loss on the training set, given predictions v, and let $f (v, \bar{y}) = \frac{1}{n} \sum_{i = 1}^{n} l (v_{i}, {\bar{y}}_{i})$ be the average test loss. Note that $f is L / \sqrt{n}$ -Lipschitz in its first argument, because ${‖ \nabla_{u} f (v, y) ‖}^{2} \leq \sum_{i = 1}^{n} {(L / n)}^{2} = L^{2} / n .$ Similarly, $f is L / \sqrt{n}$ -Lipschitz in its second argument. The training loss on X is then $f (u, y),$ and the risk is $E_{ϵ} [f (u, \bar{y})] .$ Similarly, the training loss on $\tilde{X} is f (\tilde{u}, y)$ and the risk is $E_{ϵ} [f (\tilde{u}, \bar{y})] .$

We can bound the difference in the average risk (average over $\bar{y}$ ) when training on $X and \tilde{X}$ in terms of the eigenspace overlap score. We do this in three steps: First, we lower bound $R_{\bar{y}} (X) .$ Second, we upper bound $R_{\bar{y}} (\tilde{X}) .$ Third, we used the bounds from the first two steps to upper bound the expectation over $\bar{y}$ of the difference between $R_{\bar{y}} (X) and R_{\bar{y}} (\tilde{X}) .$ We now go through these steps one at a time:

Step 1: We show that $R_{\bar{y}} (X) \geq f (\bar{y}, \bar{y}) .$
$R_{\bar{y}} (X) = E_{ϵ} [f (u, \bar{y})] \geq E_{ϵ} [f (\bar{y}, \bar{y})] = f (\bar{y}, \bar{y}) .$
Here, we used the fact that $\bar{y} = \arg \min_{v} f (v, \bar{y})$ (which follows from our assumption on the loss function $l$ ).

Step 2: We show that for all

\tilde{w} \in ℝ^{k}, R_{\bar{y}} (\tilde{X}) \leq f (\tilde{X} \tilde{w}, \bar{y}) + 2 L σ .

R_{\bar{y}} (\tilde{X}) = E_{ϵ} [f (\tilde{u}, \bar{y})] \leq E_{ϵ} [f (\tilde{u}, y) + \frac{L}{\sqrt{n}} ‖ y - \bar{y} ‖] (f is L / \sqrt{n} - Lipschitsz) = E_{ϵ} [f (\tilde{X} {\tilde{w}}^{*}, y)] + E_{ϵ} [\frac{L}{\sqrt{n}} ‖ ϵ ‖] \leq E_{ϵ} [f (\tilde{X} \tilde{w}, y)] + E_{ϵ} [\frac{L}{\sqrt{n}} ‖ ϵ ‖] (by Equation (3)) \leq E_{ϵ} [f (\tilde{X} \tilde{w}, \bar{y}) + \frac{L}{\sqrt{n}} ‖ \bar{y} - y ‖] + E_{ϵ} [\frac{L}{\sqrt{n}} ‖ ϵ ‖] (f is L / \sqrt{n} - Lipschitsz) = f (\tilde{X} \tilde{w}, \bar{y}) + \frac{2 L}{\sqrt{n}} E_{ϵ} [‖ ϵ ‖] \leq f (\tilde{X} \tilde{w}, \bar{y}) + 2 L σ (by E_{ϵ} {[‖ ϵ ‖]}^{2} \leq E_{ϵ} [‖ ϵ ‖^{2}] = n σ^{2}) .

Step 3: We bound the expected difference, over the randomness in the label vector $\bar{y}$ , between $R_{\bar{y}} (\tilde{X}) and R_{\bar{y}} (X),$ leveraging the results from steps 1 and 2 above.
$E_{\bar{y}} [R_{\bar{y}} (\tilde{X}) - R_{\bar{y}} (X)] \leq E_{\bar{y}} [f (\tilde{X} \tilde{w}, \bar{y}) + 2 L σ - f (\bar{y}, \bar{y})] (by steps 1 and 2) \leq E_{\bar{y}} [\frac{L}{\sqrt{n}} ‖ \tilde{X} \tilde{w} - \bar{y} ‖ + 2 L σ] (f is L / \sqrt{n} -Lipschitsz) .$

To get the tightest bound, we can minimize $‖ \tilde{X} \tilde{w} - \bar{y} ‖$ over $\tilde{w} \in ℝ^{k} .$ But this is exactly the least squares problem, with solution $\tilde{w} = {({\tilde{X}}^{T} \tilde{X})}^{- 1} {\tilde{X}}^{T} \bar{y} = \tilde{V} {\tilde{S}}^{- 1} {\tilde{U}}^{T} \bar{y},$ and minimum value $\sqrt{‖ \bar{y} ‖^{2} - {‖ {\tilde{U}}^{T} \bar{y} ‖}^{2}}$ (by proof of Proposition 3). We can substitute this bound into the above inequalities and continue:

E_{\bar{y}} [R_{\bar{y}} (\tilde{X}) - R_{\bar{y}} (X)] \leq \frac{L}{\sqrt{n}} E_{\bar{y}} [\sqrt{‖ \bar{y} ‖^{2} - {‖ {\tilde{U}}^{T} \bar{y} ‖}^{2}}] + 2 L σ \leq \frac{L}{\sqrt{n}} \sqrt{E_{\bar{y}} [‖ \bar{y} ‖^{2} - {‖ {\tilde{U}}^{T} \bar{y} ‖}^{2}]} + 2 L σ (by Jensens inequality) \leq \frac{L}{\sqrt{n}} \sqrt{tr (Σ) - d λ_{\min} (Σ) E (X, \tilde{X})} + 2 L σ,

where this last step follows from $E_{\bar{y}} [‖ \bar{y} ‖^{2}] = E_{z} [z^{T} U^{T} U z] = E_{z} [z^{T} z] = tr (Σ), and E_{\bar{y}} [{‖ {\tilde{U}}^{T} \bar{y} ‖}^{2}] \geq d λ_{\min} (Σ) E (X, \tilde{X}),$ which we show in the proof of Theorem 1 (Generalized). Using the assumption $σ^{2} = \frac{a^{2}}{n} tr (Σ), and thus σ = \frac{a \sqrt{tr (Σ)}}{\sqrt{n}},$ completes the proof. □

Note that in the case of logistic regression where we observe the noisy logits,⁹ the loss is 1-Lipschitz in the first argument. If we assume that the weight vector w has bounded norm (say, because of L2 regularization), and that the data matrix X is bounded, then the loss function is also Lipschitz in the second argument. We can think of z_i as being the optimal logits such that $P (y_{i} = 1) = σ (z_{i})$ (one can think of these $z_{i} = x_{i}^{T} w$ as the parameters of the generative model which generated the data). Just like in the linear regression case, we see that the overlap $E (X, \tilde{X}) = \frac{1}{d} {‖ {\tilde{U}}^{T} U ‖}_{F}^{2}$ gives an upper bound on the maximum possible expected difference in the loss functions when training on $X vs . \tilde{X} .$

B.3. Robustness of the Eigenspace Overlap Score to Perturbations

For a measure of compression quality to correlate strongly with downstream performance, a necessary condition is for it to be robust to embedding perturbations which are unlikely to significantly affect generalization performance. Here, we give an example of an embedding perturbation which has minimal effect on the eigenspace overlap score and on average-case generalization performance, while having a much larger impact on the other measures of compression quality. We consider the following simple perturbation: if $X = \sum_{i = 1}^{d} σ_{i} U_{i} V_{i}^{T}$ is the singular `value decomposition of X, we consider setting it’s largest singular value to 0, resulting in the perturbed matrix $\tilde{X} : = \sum_{i = 2}^{d} σ_{i} U_{i} V_{i}^{T} .$ Assuming a label vector $y = U z, \tilde{X}$ would have generalization error of $‖ y ‖^{2} - {‖ {\tilde{U}}^{T} y ‖}^{2} = z_{1}^{2} .$ If we assume that $z_{1}^{2} ≪ \sum_{i = 2}^{d} z_{i}^{2}$ (as would be expected in our average-case analysis), then $\tilde{X}$ would perform similarly to X.

In Table 3, we show the impact of the above perturbation on the various measures of compression quality we have discussed. At a high-level, we observe that this perturbation can have a dramatic effect of the previously proposed measures, while having minimal effect on the eigenspace overlap score. For example, the eigenspace overlap score after this perturbation is equal to $E (X, \tilde{X}) = \frac{d - 1}{d},$ relative to the maximum possible overlap of 1. In contrast, this perturbation results in a Δ₁ value very close to 1 if $λ ≪ σ_{1}$ (note that 1 is the maximum possible value for Δ₁, and that Zhang et al. [42] show generalization bounds scale with $\frac{1}{1 - Δ_{1}}$ ) This makes sense, because Δ₁ can be used to attain a worst-case generalization bound for the perturbed embeddings, and there exist cases where setting the largest singular value to 0 can significantly harm the generalization performance of the embeddings $(e.g., if z_{1}^{2} \approx ‖ z ‖^{2}) .$ Thus, while the Δ₁ measure is important for understanding the worst-case performance of the compressed embeddings, it is generally an overly pessimistic measure. The eigenspace overlap score, on the other hand, is generally unable to provide worst-case guarantees, but aligns nicely with the expected performance of the compressed embeddings in the average-case setting.

Table 3: Effect of perturbation on measures of compression quality.

In this table, we consider the effect of perturbing an embedding matrix X by setting its largest singular value to 0 on the various measures of compression quality discussed above. As we can see, setting the largest singular value of X to 0 can have a disproportionately large effect on the relative reconstruction error $(\frac{‖ X - \tilde{X} ‖_{F}}{‖ X ‖_{F}}),$ relative PIP loss $(\frac{{‖ X X^{T} - \tilde{X} {\tilde{X}}^{T} ‖}_{F}}{{‖ X X^{T} ‖}_{F}}), Δ_{1}, Δ, and Δ_{\max}$ measures (values can approach 1), while having a modest effect on the eigenspace overlap score (value of 1/d always).

Compression quality measure	Measure after perturbation

Rel. reconstruction error	$σ_{1} / \sqrt{\sum_{i = 1}^{d} σ_{i}^{2}}$
Rel. PIP loss	$σ_{1}^{2} / \sqrt{\sum_{i = 1}^{d} σ_{i}^{4}}$
Δ₁	$σ_{1}^{2} / (σ_{1}^{2} + λ)$
Δ₂	0
Δ	$σ_{1}^{2} / (σ_{1}^{2} + λ)$
Δ_max	$(σ_{1}^{2} + λ) / λ$
$1 - E (X, \tilde{X})$	$\frac{1}{d}$

Open in a new tab

B.4. Relating the Eigenspace Overlap Score to Embedding Reconstruction Error

We now define a variant of embedding reconstruction error which we show is closely related to the eigenspace overlap score. As we mention in Section 2.2, the definition of embedding reconstruction error $‖ X - \tilde{X} ‖_{F}$ is only applicable when $X \in ℝ^{n \times d} and \tilde{X} \in ℝ^{n \times k}$ have the same dimensions (d = k). To get around this limitation, we define the projected embedding reconstruction error as $\min_{P \in ℝ^{k \times d}} ‖ \tilde{X} P - X ‖_{F}^{2} .$ It is easy to show that the matrix P minimizing the above expression is $P^{⋆} : = {({\tilde{X}}^{T} \tilde{X})}^{- 1} {\tilde{X}}^{T} X .$ Letting $X = U S V^{T} and \tilde{X} = \tilde{U} \tilde{S} {\tilde{V}}^{T}$ be the singular value decompositions of $X and \tilde{X},$ we can simplify the expression for the projected embedding reconstruction error as follows:

\min_{P \in ℝ^{k \times d}} ‖ \tilde{X} P - X ‖_{F}^{2} = {‖ \tilde{X} {({\tilde{X}}^{T} \tilde{X})}^{- 1} {\tilde{X}}^{T} X - X ‖}_{F}^{2} = {‖ \tilde{U} \tilde{S} {\tilde{V}}^{T} (\tilde{V} {\tilde{S}}^{- 2} {\tilde{V}}^{T}) \tilde{V} S {\tilde{U}}^{T} X - X ‖}_{F}^{2} = {‖ \tilde{U} {\tilde{U}}^{T} X - X ‖}_{F}^{2} = tr ({(\tilde{U} {\tilde{U}}^{T} X - X)}^{T} (\tilde{U} {\tilde{U}}^{T} X - X)) = ‖ X ‖_{F}^{2} - {‖ {\tilde{U}}^{T} X ‖}_{F}^{2} = ‖ X ‖_{F}^{2} - {‖ {\tilde{U}}^{T} U S V^{T} ‖}_{F}^{2} = ‖ X ‖_{F}^{2} - {‖ {\tilde{U}}^{T} U S ‖}_{F}^{2}

Thus, the projected embedding reconstruction error is equal to a term $(‖ X ‖_{F}^{2})$ which is constant in $\tilde{X},$ minus a term ${‖ {\tilde{U}}^{T} U S ‖}_{F}^{2} = \sum_{i = 1}^{d} σ_{i}^{2} {‖ {\tilde{U}}^{T} U_{i} ‖}_{2}^{2} .$ Note that this second term is simply a version of the eigenspace overlap score $\frac{1}{\max (d, k)} {‖ {\tilde{U}}^{T} U ‖}_{F}^{2} = \frac{1}{\max (d, k)} \sum_{i = 1}^{d} {‖ {\tilde{U}}^{T} U_{i} ‖}_{2}^{2}$ which weights the projections of the different singular vectors $U_{i} of X onto \tilde{U}$ according to the singular values of X. In Section B.1 we show that in the case where the random label vector $\bar{y} = U z$ where z is a zero mean random variable in $ℝ^{d}$ with covariance Σ, the expected error depends on a term ${‖ {\tilde{U}}^{T} U Σ^{1 / 2} ‖}_{F}^{2} .$ Thus, the projected embedding reconstruction error is directly related to the expected error when z is sampled with covariance matrix Σ = S².

In Table 4 we show that the projected embedding reconstruction error, like the eigenspace overlap score, attains high Spearman correlation with downstream performance.

Table 4: Spearman correlation between projected embedding reconstruction error and downstream performance.

In this table, we show that the Spearman correlation ρ between the projected embedding reconstruction error and downstream performance is relatively similar to the Spearman correlation between the eigenspace overlap score and downstream performance. We show results on the SQuAD question answering task, and the SST-1 sentiment analysis task, for both GloVe and fastText embeddings. In each table entry, we present the correlation absolute values as “GloVe |ρ| | fastText |ρ|.”

	SQuAD		SST-1

Projected embed. reconst. error	0.82	0.86	0.75	0.64
1 − $E$	0.81	0.91	0.75	0.73

Open in a new tab

C. The Eigenspace Overlap Score of Uniformly Quantized Embeddings

This Appendix focuses on the eigenspace overlap score of uniformly quantized embeddings. In Appendix C.1 we prove our result on the expected eigenspace overlap score of uniformly quantized embeddings (Theorem 2). In Appendix C.2 we validate that the empirical scaling of the eigenspace overlap score with respect to the vocabulary size, embedding dimension, compression rate, and smallest singular value of the embedding matrix, matches the scaling predicted by the theory. Lastly, in Appendix C.3, we demonstrate that choosing the clipping value for uniform quantization is crucial for attaining a high eigenspace overlap score, and that choosing the clipping threshold with lowest reconstruction error is very similar to choosing the clipping threshold with highest eigenspace overlap score. Additionally, we demonstrate that the optimal clipping thresholds for deterministic and stochastic quantization are very similar, and that deterministic quantization attains slightly higher eigenspace overlap scores than stochastic quantization.

C.1. Theorem 2 Proof

We now prove Theorem 2, which bounds the expected eigenspace overlap scores for uniformaly quantized embeddings. The core of our proof is an application of the Davis-Kahan $\sin (Θ)$ theorem [8]. We now review this classic theorem, and then prove our result.

Theorem 5.

(Davis-Kahan $\sin (Θ)$ Theorem (adapted)) $L e t K = U_{0} S_{0} U_{0}^{T} + U_{1} S_{1} U_{1}^{T}$ be the eigendecomposition of K such that $U_{0} \in ℝ^{n \times d}$ are the first d eigenvectors of $K = U S U^{T}, S_{0}$ the first d eigenvalues, U₁, S₁ the rest. Similarly, let $\tilde{K} = V_{0} R_{0} V_{0}^{T} + V_{1} R_{1} V_{1}^{T}$ be the equivalent eigendecomposition for $\tilde{K} = K + H .$ If the eigenvalues of S0 are contained in the interval (a₀, a₁), and the eigenvalues of R₁ are excluded from the interval $(a_{0} - δ, a_{1} + δ)$ for some δ > 0, then

‖ V_{1}^{T} U_{0} ‖ \leq \frac{‖ V_{1}^{T} H U_{0} ‖}{δ}

(6)

for any unitarily invariant norm $‖ \cdot ‖ .$

To prove Theorem 2, we will apply the Davis-Kahan sin(Θ) theorem to the setting where K is the Gram matrix of an uncompressed matrix X, and $\tilde{K}$ is the gram matrix of a b-bit stochastic uniform quantization $\tilde{X}$ of X (See Definition 2). We now present and prove Theorem 2.

Theorem 2.

Let $X \in ℝ^{n \times d}$ be a bounded embedding matrix with $X_{i j} \in [- \frac{1}{\sqrt{d}}, \frac{1}{\sqrt{d}}]$ and smallest singular value $σ_{\min} = a \sqrt{n / d}, for a \in (0, 1] .$ ¹⁰ Let $\tilde{X}$ be a b-bit stochastic uniform quantization of X. Then for n ≥ max(33, d), we can lower bound the expected eigenspace overlap score of $\tilde{X}$ , over the randomness of the stochastic quantization, as follows:

E [1 - E (X, \tilde{X})] \leq \frac{20}{{(2^{b} - 1)}^{2} a^{4}} .

Proof. We will denote the Gram matrices of X and $\tilde{X} by K = X X^{T} = U S U^{T} and \tilde{K} = \tilde{X} {\tilde{X}}^{T} = (X + C) {(X + C)}^{T} = V R V^{T} .$ Here, C is a stochastic matrix satisfying $E [C_{i j}] = 0 and V A ℝ [C_{i j}] \leq δ_{b}^{2} / d \forall i, j, for δ_{b}^{2} : = \frac{1}{{(2^{b} - 1)}^{2}}$ (see Appendix A.1). In our application of the Davis-Kahan $\sin (Θ)$ theorem, we will use $a_{0} = σ_{\min} (K), a_{1} = \infty, δ = σ_{\min} (K) .$ Note also the $H = \tilde{K} - K = (X + C) {(X + C)}^{T} - X X^{T} = X C^{T} + C X^{T} + C C^{T} .$ We will let a ∈ [0, 1] be the scalar such that $σ_{\min} (X) = a \sqrt{\frac{n}{d}}$ (equivalently, $σ_{\min} (K) = a^{2} \frac{n}{d}$ ).

Using the Davis-Kahan $\sin (Θ)$ theorem, along with Lemma 6 (below), we can show the following:

{‖ V_{1}^{T} U_{0} ‖}_{F} \leq \frac{{‖ V_{1}^{T} H U_{0} ‖}_{F}}{σ_{\min} (K)} = \frac{{‖ V_{1}^{T} (X C^{T} + C X^{T} + C C^{T}) U_{0} ‖}_{F}}{σ_{\min} (K)} \leq \frac{{‖ V_{1}^{T} ‖}_{2} {‖ X C^{T} + C X^{T} + C C^{T} ‖}_{F} {‖ U_{0} ‖}_{2}}{σ_{\min} (K)} (using ‖ A B ‖_{F} \leq ‖ A ‖_{2} ‖ B ‖_{F} twice.) \leq \frac{{‖ X C^{T} + C X^{T} + C C^{T} ‖}_{F}}{σ_{\min} (K)} (using {‖ V_{1}^{T} ‖}_{2} = {‖ U_{0} ‖}_{2} = 1 .) \Rightarrow \frac{1}{d} {‖ V_{1}^{T} U_{0} ‖}_{F}^{2} \leq \frac{{‖ X C^{T} + C X^{T} + C C^{T} ‖}_{F}^{2}}{d \cdot σ_{\min} {(K)}^{2}} \Leftrightarrow 1 - \frac{1}{d} {‖ V_{0}^{T} U_{0} ‖}_{F}^{2} \leq \frac{{‖ X C^{T} + C X^{T} + C C^{T} ‖}_{F}^{2}}{d \cdot σ_{\min} {(K)}^{2}} (using {‖ V_{0}^{T} U_{0} ‖}_{F}^{2} + {‖ V_{1}^{T} U_{0} ‖}_{F}^{2} = {‖ U_{0} ‖}_{F}^{2} = d) \Leftrightarrow 1 - E (X, \tilde{X}) \leq \frac{{‖ X C^{T} + C X^{T} + C C^{T} ‖}_{F}^{2}}{d \cdot σ_{\min} {(K)}^{2}}

\Rightarrow E [1 - E (X, \tilde{X})] \leq E [\frac{{‖ X C^{T} + C X^{T} + C C^{T} ‖}_{F}^{2}}{d \cdot σ_{\min} {(K)}^{2}}] = \frac{E [{‖ X C^{T} + C X^{T} + C C^{T} ‖}_{F}^{2}]}{d \cdot σ_{\min} {(K)}^{2}} \leq \frac{\frac{20 n^{2} δ_{b}^{2}}{d}}{d \cdot σ_{\min} {(K)}^{2}} (by Lemma 6) . = \frac{20 n^{2} δ_{b}^{2}}{d^{2} a^{4} (n^{2} / d^{2})} = \frac{20 δ_{b}^{2}}{a^{4}}

□

We now present and prove Lemma 6.

Lemma 6.

Let $X \in ℝ^{n \times d}$ be a bounded embedding matrix with $X_{i j} \in [- \frac{1}{\sqrt{d}}, \frac{1}{\sqrt{d}}] . Let \tilde{X} = X + C$ be a b-bit stochastic uniform quantization of X. Then for n ≥ max(33, d), it follows that

E [{‖ X C^{T} + C X^{T} + C C^{T} ‖}_{F}^{2}] \leq \frac{20 n^{2} δ_{b}^{2}}{d} .

(7)

Proof. We will let $H : = X C^{T} + C X^{T} + C C^{T} .$ To bound $E [‖ H ‖_{F}^{2}] = \sum_{i, j = 1}^{n} E [H_{i j}^{2}],$ we will consider two cases: $H_{i j} for i \neq j and H_{i j} for i = j .$ We will let $x_{i}, c_{i} \in ℝ^{d}$ denote the i^th rows of X and C respectively.

Case 1:

i \neq j

E [H_{i j}^{2}] = E [{(x_{i}^{T} c_{j} + c_{i}^{T} x_{j} + c_{i}^{T} c_{j})}^{2}] = E [{(x_{i}^{T} c_{j})}^{2} + {(c_{i}^{T} x_{j})}^{2} + {(c_{i}^{T} c_{j})}^{2}] = E [{(\sum_{k = 1}^{d} x_{i k} c_{j k})}^{2}] + E [{(\sum_{k = 1}^{d} c_{i k} x_{j k})}^{2}] + E [{(\sum_{k = 1}^{d} c_{i k} c_{j k})}^{2}] = E [(\sum_{k = 1}^{d} x_{i k}^{2} c_{j k}^{2})] + E [(\sum_{k = 1}^{d} c_{i k}^{2} x_{j k}^{2})] + E [(\sum_{k = 1}^{d} c_{i k}^{2} c_{j k}^{2})] = \sum_{k = 1}^{d} x_{i k}^{2} E [c_{j k}^{2}] + \sum_{k = 1}^{d} E [c_{i k}^{2}] x_{j k}^{2} + \sum_{k = 1}^{d} E [c_{i k}^{2}] E [c_{j k}^{2}] \leq \frac{δ_{b}^{2}}{d} \cdot \sum_{k = 1}^{d} x_{i k}^{2} + \frac{δ_{b}^{2}}{d} \cdot \sum_{k = 1}^{d} x_{j k}^{2} + \sum_{k = 1}^{d} {(\frac{δ_{b}^{2}}{d})}^{2} \leq \frac{δ_{b}^{2}}{d} \cdot {‖ x_{i} ‖}^{2} + \frac{δ_{b}^{2}}{d} \cdot {‖ x_{j} ‖}^{2} + \sum_{k = 1}^{d} {(\frac{δ_{b}^{2}}{d})}^{2} \leq \frac{2 δ_{b}^{2} + δ_{b}^{4}}{d} (using {‖ x_{i} ‖}^{2} \leq 1) \leq \frac{3 δ_{b}^{2}}{d} (using δ_{b} \leq 1) .

Case 2:

i = j

E [H_{i i}^{2}] = E [{(x_{i}^{T} c_{i} + c_{i}^{T} x_{i} + c_{i}^{T} c_{i})}^{2}] = E [{(2 \sum_{k = 1}^{d} x_{i k} c_{i k} + \sum_{l = 1}^{d} c_{i l}^{2})}^{2}] = E [4 {(\sum_{k = 1}^{d} x_{i k} c_{i k})}^{2} + 4 (\sum_{k = 1}^{d} x_{i k} c_{i k}) \cdot (\sum_{l = 1}^{d} c_{i l}^{2}) + {(\sum_{l = 1}^{d} c_{i l}^{2})}^{2}] = E [4 \sum_{k = 1}^{d} x_{i k}^{2} c_{i k}^{2} + 4 \sum_{k, l = 1}^{d} x_{i k} c_{i k} c_{i l}^{2} + \sum_{k, l = 1}^{d} c_{i l}^{2} c_{i k}^{2}] = 4 \sum_{k = 1}^{d} x_{i k}^{2} E [c_{i k}^{2}] + 4 \sum_{k = 1}^{d} x_{i k} E [c_{i k}^{3}] + \sum_{k, l = 1}^{d} E [c_{i l}^{2} c_{i k}^{2}] \leq 4 \cdot \frac{δ_{b}^{2}}{d} \cdot \sum_{k = 1}^{d} x_{i k}^{2} + 4 \sum_{k = 1}^{d} \frac{1}{\sqrt{d}} {(\frac{2}{\sqrt{d} (2^{b} - 1)})}^{3} + \sum_{k, l = 1}^{d} {(\frac{2}{\sqrt{d} (2^{b} - 1)})}^{4} = 4 \cdot \frac{δ_{b}^{2}}{d} \cdot {‖ x_{i} ‖}^{2} + 4 d \cdot \frac{8}{d^{2} {(2^{b} - 1)}^{3}} + d^{2} \cdot \frac{16}{d^{2} {(2^{b} - 1)}^{4}} \leq \frac{4 δ_{b}^{2}}{d} + \frac{32}{d {(2^{b} - 1)}^{3}} + \frac{16}{{(2^{b} - 1)}^{4}} = \frac{4 δ_{b}^{2} + 32 δ_{b}^{3}}{d} + 16 δ_{b}^{4} \leq \frac{36 δ_{b}^{2}}{d} + 16 δ_{b}^{4} (using δ_{b} \leq 1) .

Now we can combine the above results:

\sum_{i, j = 1}^{n} E [H_{i j}^{2}] \leq \sum_{i \neq j} (\frac{3 δ_{b}^{2}}{d}) + \sum_{i = 1}^{n} (\frac{36 δ_{b}^{2}}{d} + 16 δ_{b}^{4}) = n (n - 1) (\frac{3 δ_{b}^{2}}{d}) + n (\frac{36 δ_{b}^{2}}{d} + 16 δ_{b}^{4}) = \frac{3 n^{2} δ_{b}^{2} - 3 n δ_{b}^{2} + 36 n δ_{b}^{2}}{d} + 16 n δ_{b}^{4} = \frac{3 n^{2} δ_{b}^{2} + 33 n δ_{b}^{2}}{d} + 16 n δ_{b}^{4} \leq \frac{4 n^{2} δ_{b}^{2}}{d} + 16 n δ_{b}^{4} (assuming n \geq 33.) \leq \frac{4 n^{2} δ_{b}^{2}}{d} + \frac{16 n^{2} δ_{b}^{2}}{d} (assuming n \geq d .) = \frac{20 n^{2} δ_{b}^{2}}{d}

□

C.2. Empirical Validation of Theorem 2 Scaling

We now validate Theorem 2 empirically by showing the impact of the precision (b), the scalar (a), the vocabulary size (n), and the embedding dimension (d) on the eigenspace overlap score $E (X, \tilde{X})$ of uniformly quantized embeddings matrices. As predicted by the theory, we will show in Figure 4 that $1 - E (X, \tilde{X})$ drops as b and a are increased, and is relatively unaffected by changes in n and d.

Figure 4: — We measure the eigenspace overlap score $E$ of uniformly quantized embeddings with the uncompressed embedding for various precisions, values of a, vocabulary sizes n, and dimensions d. We observe that 1 − $E$ decays as the precision b and scalar a grow, and that 1 − $E$ is largely unaffected by the vocabulary size n and embedding dimension d.

We now describe our experimental protocol for studying the impact of each of these parameters on the eigenspace overlap score:

Precision (b), Figure 4(a): We randomly generate a 10⁴ × 10 matrix, with entries drawn uniformly from $[- \frac{1}{\sqrt{10}}, \frac{1}{\sqrt{10}}] .$ We uniformly quantize this matrix with precisions $b \in {1, 2, 4, 8, 16}$ , and compute the eigenspace overlap score between the quantized matrix and the original matrix. As one can see, $1 - E (X, \tilde{X})$ drops rapidly as the precision is increased.
Scalar (a), Figure 4(b): We randomly generate a 10⁴ × 10 matrix, with entries drawn uniformly from $[- \frac{1}{\sqrt{10}}, \frac{1}{\sqrt{10}}] .$ We then multiply this matrix on the right by diagonal matrices with diagonal entries spaced logarithmically between 1 and ${1, 0.1, .01, .001, .0001},$ thus generating matrices with increasingly small values of the scalar a. We uniformly quantize each of these matrices with precisions $b \in {1, 2, 4},$ and compute the eigenspace overlap score between the quantized matrices and the original matrices. As one can see, $1 - E (X, \tilde{X})$ drops as the scalar a increases.
Vocabulary size (n), Figure 4(c): We randomly generate n × 10 matrices for $n \in {10^{2}, 3 \times 10^{2}, 10^{3}, 3 \times 10^{3}, 10^{4}, 3 \times 10^{4}, 10^{5}},$ with entries drawn uniformly from $[- \frac{1}{\sqrt{10}}, \frac{1}{\sqrt{10}}] .$ We uniformly quantize these matrices with precisions $b \in {1, 2, 4},$ and compute the corresponding eigenspace overlap scores. As one can see, the vocabulary size n has minimal impact on the eigenspace overlap score.
Embedding dimension (d), Figure 4(d): We randomly generate 10⁴ × d matrices for $d \in {10, 30, 100, 300, 1000}$ with entries drawn uniformly from $[- \frac{1}{\sqrt{d}}, \frac{1}{\sqrt{d}}] .$ We uniformly quantize these matrices with precisions $b \in {1, 2, 4},$ and compute the corresponding eigenspace overlap scores. As one can see, the embedding dimension d has minimal impact on the eigenspace overlap score.

An important thing to mention about Theorem 2 is that this bound can be vacuous when the embedding matrix has a quickly decaying spectrum, and thus a small value of a. This is a consequence of the proof of the Davis-Kahan $\sin (Θ)$ theorem, which uses the smallest eigenvalue of $X X^{T}$ to lower bound a matrix multiplication; this inequality is relatively tight when the spectrum of $X X^{T}$ decays slowly, but is quite loose if it doesn’t.

C.3. Impact of Clipping and Deterministic vs. Stochastic Quantization on the Eigenspace Overlap Score

As shown in Algorithm 1 (described in Section D.3), clipping is the first step in the uniform quantization method we use for compressing word embeddings. Here, we show that clipping is important because it can significantly improve the eigenspace overlap scores of the compressed embeddings, compared to uniform quantization without clipping. Specifically, we compute the eigenspace overlap score of $Q_{b, r} ({clip}_{r} (X)) (and {\tilde{Q}}_{b, r} ({clip}_{r} (X)))$ with X, for a range of clipping values $r \in [0, \max (| X |)],$ using the publicly available 300-dimensional pre-trained GloVe embeddings as X (see Appendix D.2 for embedding details). Recall that $Q_{b, r} and {\tilde{Q}}_{b, r}$ are the deterministic and stochastic b-bit uniform quantization functions for the interval $[- r, r],$ respectively (defined in Section A.1). In Figure 5(a), we plot the eigenspace overlap scores attained by both quantization methods as a function of the clipping value r, for precisions $b \in {1, 2, 4} .$ We observe that choosing the value of r appropriately is crucial for attaining high eigenspace overlap scores. We also observe that deterministic quantization typically attains slightly higher eigenspace overlap scores than stochastic quantization. This result helps explain our empirical observation in Appendix E.9 that deterministic quantization often attains slightly better downstream performance than stochastic quantization.

Figure 5: — (a) We plot the eigenspace overlap score as a function of the clipping threshold, for precisions $b \in {1, 2, 4}$ and for both stochastic and deterministic quantization. We observe that choosing the value of r appropriately is crucial for attaining high eigenspace overlap scores, and that deterministic quantization generally gives slightly higher eigenspace overlap scores than stochastic quantization. (b) For each precision b, we plot the clipping thresholds which give the highest eigenspace overlap scores and embedding reconstruction errors, for both stochastic and deterministic quantization. We observe that for both types of quantization, the optimal clipping threshold chosen according to embedding reconstruction error is very similar to the optimal clipping threshold chosen according to the eigenspace overlap score.

In Algorithm 1, we choose the clipping threshold $r^{*}$ which minimizes the embedding reconstruction error of the clipped and quantized embeddings. In Figure 5(b), we show that choosing the clipping threshold based on the embedding reconstruction error gives very similar results to choosing the clipping threshold based on the eigenspace overlap score, for both deterministic and stochastic quantization. This helps explain the strong downstream performance of the embeddings compressed using Algorithm 1.

D. Experiment Details

We now discuss in detail the protocols we used for all our experiments. In Appendix D.1, we describe the model architectures and datasets we use for each downstream task, including the train/development/test splits for each dataset. We then discuss in Appendix D.3 the details of the different compression methods we use. In Appendix D.4 we discuss the training details for each of the downstream tasks, including the hyperparameter grids we use to tune our models.

D.1. Task Details

Question Answering

For the question answering task, we use the DrQA model [5] trained and evaluated on the Stanford Question and Answering Dataset (SQuAD) [33]. For this task, given a paragraph and a corresponding question in natural language, the model must predict the start and end position, within the paragraph, of the answer to the question. We use the default train and development set splits for the SQuAD dataset, and report all results on the development set, as the test set is not publicly available. The DrQA model consists of a three-layer bidirectional LSTM model with 128-dimensional hidden units on top of a pretrained word embedding. We train the DrQA model on the SQuAD-v1.1 training set, and report the F1 score on the SQuAD-v1.1 development set. We use the implementation of the DrQA model from the Facebook Research DrQA repository.¹¹

Sentiment Analysis

For the sentiment analysis tasks, we use the convolutional neural network (CNN) architecture proposed by Kim [19], and evaluate performance on the datasets used in that work (see Section 3 of that paper for dataset details). We use the data released as part of the Harvard NLP group’s sentiment analysis repository.¹² For the datasets which are pre-split into train/development/test (SST-1, SST-2), we use these dataset splits. For the datasets which are pre-split into train/test (TREC), we take a random 10% of the training set as a development set. For the datasets which have no pre-specified splits (MR, Subj, CR, MPQA), we take a random 10% of the data as a test set, and a random 10% of the remaining data as a development set; the rest of the data is used as the training set. We tune hyperparameters (learning rate) on the development sets, and report results on the test sets. The CNN architecture we use for this task has one convolutional layer with multiple filters, followed by a ReLU non-linearity and a max-pooling layer. The convolutional layer uses filter windows of size 3, 4, and 5, each with 100 feature maps. As we use PyTorch [27] for all our experiments, we reimplemented this model architecture in PyTorch, using the original Theano implementation as a template.¹³.

GLUE Tasks

The General Language Understanding Evaluation (GLUE) benchmark [38] is a collection of nine natural language understanding tasks. We summarize these tasks in Table 5, along with the evaluation metric used for each task. We use the default train and development set splits for each of these tasks. We tune hyperparameters (learning rate) on the development sets, and also report results on the development sets, as the test sets are not publicly available. For each task, we use the standard approach of adding a linear layer on top of the pre-trained BERT model, and then fine-tuning the model using the data for that task. To evaluate the performance of compressed embeddings on these tasks, we compress the WordPiece [40] embeddings in the pre-trained case-sensitive BERT_BASE model, and then fine-tune all the non-embedding model parameters, keeping the embeddings frozen during training. We use a third-party implementation of the BERT model, and of the fine-tuning procedure.¹⁴ We run experiments on all the GLUE tasks except WNLI. We skip the WNLI dataset because this is a dataset on which it is very difficult to outperform the trivial model which always outputs the majority class. This trivial model attains 65.1% accuracy, and only two of the contributors to the GLUE leaderboard¹⁵ have outperformed this model, as of this writing.

Table 5: The GLUE datasets, along with the evaluation metric used for each dataset.

For the MRPC and QQP datasets, the average of the F1 score and accuracy on the development set is used. For the STS-B dataset, the average of the Pearson and Spearman correlations on the development set is used. For the MNLI dataset, the average of the accuracies on the matched and mismatched development sets is used.

Datasets	Evaluation Metrics
The Corpus of Linguistic Acceptability (CoLA)	Matthew’s Correlation
The Stanford Sentiment Treebank (SST-2)	Accuracy
Microsoft Research Paraphrase Corpus (MRPC)	F1 / Accuracy
Semantic Textual Similarity Benchmark (STS-B)	Pearson-Spearman Correlation
Quora Question Pairs (QQP)	F1 / Accuracy
Multi-Genre Natural Language Inference (MNLI)	Accuracy (matched/mismatched)
Question Natural Language Inference (QNLI)	Accuracy
Recognizing Textual Entailment (RTE)	Accuracy
Winograd Natural Language Inference (WNLI)	Accuracy

Open in a new tab

D.2. Word Embedding Details

For the GloVe embeddings, we use publicly available embeddings pre-trained on the Wikipedia 2014 and Gigaword 5 corpora.¹⁶ These are available for dimensions $d \in {50, 100, 200, 300};$ we use the 300-dimensional embeddings for all our experiments, except for our GloVe dimensionality reduction experiments, where we use the lower-dimensional embeddings. For the fastText embeddings, we use the publicly available 300-dimensional embeddings trained on the Wikipedia 2017 corpus, the UMBC webbase corpus, and the statmt.org news dataset.¹⁷ For the WordPiece embeddings [40], we use the embeddings which are part of the pre-trained case-sensitive BERT_BASE model, available through the Hugging Face BERT repository.¹⁸

D.3. Compression Method Details

Uniform Quantization

In Algorithm 1 we show how we use uniform quantization to compress word embeddings. The input to the algorithm is an embedding matrix $X \in ℝ^{n \times d},$ where n is the size of the vocabulary, and d is the dimension of the embeddings. We define the function ${clip}_{r} (x) = \max (\min (x, r), - r)$ for any non-negative r; when matrices are passed in as inputs to this function, it clips the entries in an element-wise fashion. Given an input embedding and a desired numbers of bits to use per entry of the compressed embedding matrix, the uniform quantization method operates in two steps:

Step 1: We find the value of $r \in [0, \max (| X |)]$ which minimizes the reconstruction error of the quantized embeddings after X is clipped to $[- r, r] .$ More formally, we let $r^{*} : = \arg \min_{r \in [0, \max (| X |)} ‖ Q_{b, r} ({clip}_{r} (X)) - X ‖_{F},$ and use this value r^* to clip X. In our experiments, we find r^* to within a specified tolerance ϵ = 0.01 using the golden-section search algorithm [18]. To avoid stochasticity impacting the search process for the clipping threshold, we always use deterministic rounding in the search for r^*, regardless of whether we use stochastic rounding or deterministic nearest rounding in the final quantization after clipping the extremal values.
Step 2: We quantize the clipped embeddings to b bits per entry with $Q_{b, r} .$

In all of our main experiments on the downstream performance (question answering, sentiment analysis, GLUE tasks) of compressed word embeddings, we use the deterministic quantization function $Q_{b, r}$ introduced in Appendix A.1 for both steps of this algorithm. However, in Appendix E.9 we use the stochastic quantization function ${\tilde{Q}}_{b, r}$ for the second step of this compression algorithm, and show that it performs similarly to deterministic quantization on downstream tasks.

K-means

The k-means clustering method can be used to compress embeddings as follows: First, the one-dimensional k-means clustering algorithm is run on all the scalar entries in the full-precision embedding matrix X. Then, each entry in X is replaced by the centroid to which it is closest. If 2^b centroids are used during the clustering step, then for each entry of the compressed embedding matrix, only the integer $j \in {0, 1, \dots, 2^{b} - 1}$ of the corresponding centroid needs to be stored; this requires b bits per entry. In our experiments, we use the Scikit Learn [28] implementation of k-means. We use the default configuration from

Algorithm 1.

Uniform quantization for word embeddings

1:	Input: Embedding $X \in ℝ^{n \times d};$ quantization func. $Q_{b, r};$ clipping func. clip_r: $ℝ \to [- r, r] .$
2:	Output: Quantized embedding $\tilde{X} \in ℝ^{n \times d} .$
3:	$r^{*} : = \arg \min_{r \in [0, \max (\| X \|)]} {‖ Q_{b, r} ({clip}_{r} (X)) - X ‖}_{F} .$
4:	Return: $Q_{b, r^{}} ({clip}_{r^{}} (X)) .$

Open in a new tab

Scikit Learn, which runs for a maximum of 300 iterations and can early stop if the relative decrease of the loss function is smaller than 10⁻⁴.

Deep Compositional Code Learning (DCCL)

We give an overview of the DCCL method [34] in Section 2.1. The important hyperparameters for this method include the learning rate η of the Adam optimizer [20], the number of dictionaries m, the size k of each dictionary, the temperature parameter τ for Gumbel sampling, and the mini-batch size. To select the learning rate η and the dictionary size k for each compression rate, we perform a grid search using the Cartesian product of $η \in {0.00001, 0.00003, 0.0001, 0.0003, 0.001} and k \in {2, 4, 8, 16}$ for each uncompressed embedding type (GloVe, fastText, BERT WordPiece embeddings) and compression rate. Note that given a compression rate and a dictionary size k, this uniquely determines the number of dictionaries m to use. We select the combination of learning rate and dictionary size which minimizes the reconstruction error of the compressed embeddings. When compressing BERT WordPiece embedding, we extended the dictionary size grid to $k \in {2, 4, 8, 16, 32, 64, 128, 256}$ to avoid the optimal dictionary size touching the boundary of the grid. We provide the optimal learning rates and dictionary sizes in Table 6 for reproducibility. For the temperature parameter τ, we follow Shu and Nakayama [34] and consistently use τ = 1.0. For all our experiments we use a mini-batch size of 64, which is the default value in the DCCL repository.¹⁹

Table 6:

The optimal learning rates η and dictionary sizes k for DCCL.

Embedding	GloVe			fastText			BERT WordPiece
Compression rate	8×	16×	32×	8×	16×	32×	8×	16×	32×
k	8	4	4	8	4	8	128	64	32
η	0.0003	0.0003	0.0003	0.0001	0.0001	0.0001	0.0003	0.0003	0.0003

Open in a new tab

Dimensionality Reduction

The two dimensionality reduction methods we consider are (1) using pre-trained lower-dimensional embeddings, and (2) principal component analysis (PCA). For the GloVe embeddings, we use the publicly available lower-dimensional embeddings described in Appendix D.2. These embeddings are available for dimensions $d \in {50, 100, 200, 300},$ where we consider the 300-dimensional embeddings to be the “uncompressed” embeddings. For our experiments with fastText and BERT WordPiece embeddings, we use PCA to reduce the dimension of the embeddings, as these embeddings are not publicly available in lower dimensions. When we compress the 300-dimensional fastText and GloVe embeddings with dimensionality reduction, we use compression rates in ${1 \times, 1.5 \times, 3 \times, 6 \times} .$ For the 768-dimensional BERT WordPiece embeddings, we use compression rates in ${1 \times, 2 \times, 4 \times, 8 \times} .$

We now give details on how we implement the PCA dimensionality reduction method. For an embedding $X \in ℝ^{n \times d}$ with vocabulary size n and dimension d, let $X = U S V^{T} be the SVD of X with U = [U_{1}, \dots U_{d}], S = diag ([s_{1}, \dots s_{d}]), and V = [V_{1}, \dots V_{d}] .$ If we let $U_{(k)} : = [U_{1}, \dots, U_{k}], S_{(k)} : = diag ([s_{1}, \dots s_{k}]), and V_{(k)} : = [V_{1}, \dots, V_{k}]$ then we use $\tilde{X} : = U_{(k)} S_{(k)}$ as the k-dimensional compressed embedding. Note that for the GLUE tasks, we instead use $\tilde{X} : = U_{(k)} S_{(k)} V_{(k)}^{T}$ to ensure that these compressed embeddings are compatible with the parameters of the pre-trained BERT model; because the dimension k of these compressed embeddings is small compared to the vocabulary size n, storing $V_{(k)}^{T}$ requires a relatively small amount of additional memory.

D.4. Training Details

We now discuss the training details for the different tasks we consider, focusing on how we tune the hyperparameters.

Question Answering

We use the default hyperparameters from the Facebook Research DrQA implementation for all our question answering experiments, as these are tuned for the SQuAD dataset.²⁰ We summarize these hyperparameters in Table 7.

Table 7:

Training hyperparameter for DrQA on the SQuAD dataset.

Hyperparameter	Value
Optimizer	Adamax
Decay rates for 1st moment β₁	0.9
Decay rates for 2nd moment β₂	0.999
Adamax ϵ	10⁻⁸
Learning rate	2 × 10⁻³
Batchsize	32
Training epochs	40
Dropout	0.4

Open in a new tab

Sentiment Analysis

We tune the learning rate for each of the sentiment analysis datasets using the grid {10⁻⁶, 10⁻⁵, 10⁻⁴, 10⁻³, 10⁻², 10⁻¹, 1.0}. For this tuning process, we use the uncompressed embedding for each dataset and embedding type (GloVe, fastText), and pick the learning rate which attains highest average accuracy on the development set across five random seeds. This learning rate is then used to train the models that use the uncompressed embeddings, as well as the embeddings compressed using uniform quantization, k-means, and DCCL. Note that we tune the learning rate individually for each embedding compressed using dimensionality reduction (for both GloVe and fastText). We do this to ensure that the lower dimensionality of these compressed embeddings does not result in the learning rate being improperly tuned. We list the hyperparameters shared across datasets in Table 8 and list the optimal learning rate for each dataset and embedding type in Table 9.

Table 8:

Training hyperparameter shared across sentiment analysis datasets.

Hyperparameter	Value
Optimizer	Adam
Decay rates for 1st moment β₁	0.9
Decay rates for 2nd moment β₂	0.999
Adam ϵ	10⁻⁸
Batchsize	32
Training epochs	100
Dropout	0.5

Open in a new tab

Table 9:

The optimal learning rate η for different sentiment analysis datasets.

Datasets	MR	SST-1	SST-2	Subj	TREC	CR	MPQA
GloVe uncompressed	0.001	0.001	0.001	0.001	0.001	0.001	0.001
GloVe dim. red. 1×	0.001	0.001	0.001	0.001	0.001	0.001	0.001
GloVe dim. red. 1.5 ×	0.001	0.001	0.001	0.001	0.001	0.001	0.001
GloVe dim. red. 3×	0.0001	0.001	0.001	0.001	0.001	0.001	0.001
GloVe dim. red. 6×	0.001	0.001	0.001	0.001	0.001	0.001	0.001
fastText uncompressed	0.001	0.001	0.001	0.001	0.001	0.001	0.001
fastText dim. red. 1×	0.0001	0.001	0.001	0.001	0.001	0.001	0.001
fastText dim. red. 1.5 ×	0.0001	0.001	0.001	0.001	0.001	0.001	0.001
fastText dim. red. 3×	0.0001	0.001	0.001	0.001	0.001	0.001	0.001
fastText dim. red. 6×	0.001	0.0001	0.001	0.001	0.001	0.001	0.001

Open in a new tab

GLUE Tasks

We tune the learning rate for each of the GLUE tasks using the grid {10⁻⁵, 2 × 10⁻⁵, 3 × 10⁻⁵, 5 × 10⁻⁵, 10⁻⁴}. When tuning the learning rate, we use the uncompressed WordPiece embeddings, and we fine-tune the entire model, without freezing the embedding parameters. For each task we pick the learning rate which gives the best average performance (according to the metrics in Table 5) on the development set, across five random seeds. The optimal learning rates are listed in Table 10 for all the GLUE tasks we run. We use the default values (from both the Google Research TensorFlow BERT repository²¹ and the Hugging Face PyTorch BERT repository²²) for the other hyperparameters. Specifically, we fine-tune the model for 3 epochs using the Adam optimizer with a mini-batch size of 32, and a weight decay strength of 0.01 (weight decay is not applied to the layer norm layers or to the bias parameters). We use a linear learning rate warm-up for the first 10% of training (learning rate grows linearly from 0× to 1× the specified learning rate), and then a linear learning rate decay for the remaining 90% of training (learning rate decays linearly from 1× to 0× the specified learning rate).

Table 10:

The optimal learning rate η for different GLUE tasks

Tasks	MNLI	QQP	QNLI	SST-2	CoLA	STS-B	MRPC	RTE
η	3 × 10⁻⁵	3 × 10⁻⁵	3 × 10⁻⁵	3 × 10⁻⁵	10⁻⁵	2 × 10⁻⁵	2 × 10⁻⁵	2 × 10⁻⁵

Open in a new tab

D.5. Infrastructure Details

We run our experiments using AWS p2.xlarge instances, which have NVIDIA Tesla K80 GPUs. We use Python 3.6 for our experiments. For compatibility with the DrQA repository (which had not been ported to PyTorch 1.0 when we began our experiments), we use PyTorch 0.3.1 for the question answering and sentiment analysis tasks. For the GLUE tasks we use PyTorch 1.0.

E. Extended Empirical Results

We now provide a more complete version of the empirical results included in the main body of the paper, as well as a number of additional experiments validating claims related to our work. More specifically:

In Appendix E.1 we present extended results comparing the downstream performance of the different compression methods across a range of compression rates for the GloVe, fastText, and BERT WordPiece embeddings, on question answering, sentiment analysis, and GLUE tasks. We show that uniform quantization can consistently match or outperform the other compression methods across these settings.
In Appendix E.2 we present experiments comparing the performance of the different compression methods when applied to compressing task-specific embeddings which have been trained end-to-end for a translation task. Though our main focus in this paper is compressing task-agnostic embeddings (e.g., GloVe, fastText), we show that uniform quantization can effectively compete with a recently proposed tensorized factorization [17] of the embedding matrix designed for the task-specific setting.
In Appendix E.3 we study whether, under a fixed memory budget, it is better to use low-dimensional high-precision embeddings, or high-dimensional low-precision embeddings. We show that under a wide range of memory budgets, one can attain large improvements in downstream performance on the SQuAD question answering task by using high-dimensional low-precision embeddings in place of lower dimensional high-precision embeddings.
In Appendix E.4 we present extended results comparing the eigenspace overlap scores of the different compression methods, for different compression rates and embeddings types. We show that uniform quantization can attain comparable or higher eigenspace overlap scores relative to the other compression methods, helping to explain the strong empirical performance of this compression method.
In Appendix E.5 we present extended results on the correlations between downstream performance and the different measures of compression quality. We show that across the question answering, sentiment analysis, and GLUE tasks we consider, the eigenspace overlap score consistently attains higher Spearman correlation with downstream performance than the other measures of compression quality (PIP loss, Δ, Δ_max).
In Appendix E.6 we show that the eigenspace overlap score also correlates better with downstream performance than the Δ₁ and Δ₂ compression quality metrics, across a range of tasks.
In Appendix E.7 we show that our claim that the eigenspace overlap score correlates better with downstream performance than Δ and Δ_max is robust to the choice of the parameter λ used when computing the values of Δ and Δ_max.
In Appendix E.8 we show that the eigenspace overlap score is a more robust selection criterion for choosing between pairs of compressed embeddings than the other measures of compression quality.
In Appendix E.9 we compare the downstream performance of embeddings compressed using deterministic vs. stochastic uniform quantization. We show these methods perform similarly, though the deterministic quantization performs slightly better at precision b = 1.

We present all these results in more detail below.

E.1. Downstream Performance vs. Compression Rate: Pre-Trained Embeddings

In Figures 6 (GloVE), 7 (fastText), and 8 (BERT), we show the downstream performance of the embeddings compressed using different compression methods, across question answering, sentiment analysis, and GLUE tasks. We show that the simple uniform quantization method can match or outperform the other compression methods across these tasks. We also observe that for the GLUE tasks (Figure 8), freezing the WordPiece embeddings during the BERT model fine-tuning does not observably hurt downstream performance.

Figure 6: — We evaluate the downstream performance of the different compression methods on question answering and sentiment analysis tasks, across different compression rates. For question answering, we use the SQuAD dataset, and for sentiment analysis we use the MR, SST-1, SST-2, Subj, TREC, CR and MPQA datasets. We show average performance across five random seeds, with error bars indicating standard deviations.

Figure 7: — We evaluate the downstream performance of the different compression methods on question answering and sentiment analysis tasks, across different compression rates. For question answering, we use the SQuAD dataset, and for sentiment analysis we use the MR, SST-1, SST-2, Subj, TREC, CR and MPQA datasets. We show average performance across five random seeds, with error bars indicating standard deviations.

Figure 8: — We evaluate the downstream performance of the different compression methods on all GLUE tasks except WNLI (as discussed in Appendix D.1), across different compression rates. In these plots, the horizontal dashed pink line marks the performance of the BERT model fine-tuned with uncompressed and unfrozen WordPiece embeddings. We show average performance across five random seeds, with error bars indicating standard deviations.

E.2. Downstream Performance vs. Compression Rate: Task-Specific Embeddings

The main focus of our work is on understanding the downstream performance of NLP models trained using compressed pre-trained word embeddings. Recently, Khrulkov et al. [17] proposed compressing word embedding matrices by parameterizing them as a product of tensors, and then learning the entries of these tensors jointly with the downstream NLP model in a task-specific, end-to-end fashion; they call this method a Tensor Train (TT) decomposition of the embedding matrix. In this section, we show that we can apply uniform quantization to compressing task-specific word embeddings, and attain competitive downstream performance with the TT method.

Task details

We consider the IWSLT’14 German-to-English translation task [4]. We use a six-layer Transformer [37] based translation model for this task, and use the Fairseq [26] implementation of this model. In our experiments, across all compression rates and compression methods, we train for 50000 steps, and use the same model size with a 512-dimensional transformer hidden layer; thus, the uncompressed embeddings are 512 dimensional. We use the default training and inference hyperparameters for this German-to-English translation task in the Fairseq repository; we list the values of these hyperparameters in Table 12. To be compatible with the Fairseq implementation, we run these experiments using PyTorch 1.0.

Table 12:

The hyperparameters we use for our experiments on the IWSLT’14 German-to-English translation task.

Hyperparameter	Value
Optimizer	Adam
Adam decay rates for 1st moment β₁	0.9
Adam decay rates for 2nd moment β₂	0.999
Adam ϵ	10⁻⁸
Training steps	50000
Learning rate schedule	$10^{- 7} + (5 * 10^{- 4} - 10^{- 7}) n / 4000 for step n < = 4000 5 * 10^{- 4} * \sqrt{4000 / n} for step n > 4000$
Warmup initial learning rate	10⁻⁷
Dropout	0.3
Weight decay	0.0001
Beam search width	5
Transformer hidden dimension	512

Open in a new tab

Compression method details

We now provide details on how we apply the different compression methods in this task-specific setting. Note that because TT can achieve compression rates greater than 32×, we run experiments both above and below this compression rate.

Dimensionality reduction: We randomly initialize lower-dimensional embeddings, and train the parameters of these embeddings jointly with the rest of the model.
Uniform quantization: We jointly train the full-precision embedding matrix and the transformer model for the first half of the training steps; we then compress this embedding matrix with uniform quantization (Algorithm 1), and keep the embedding parameters fixed for the remainder of training. To attain a compression rate c > 32, we perform the first half of training using lower-dimensional embeddings (compression rate c/32), and then apply uniform quantization to these lower-dimensional embeddings with compression rate 32.
K-means: We use the same protocol as we do for uniform quantization, but apply the k-means compression method in place of uniform quantization.
DCCL: As we do for uniform quantization and k-means, we jointly train the full-precision embedding matrix and the transformer model for the first half of the training steps; we then compress the embeddings with DCCL, and perform the rest of training with the embedding parameters fixed. We grid search the dictionary size k ∈ {2, 4, 8, 16, 32, 64} and the learning rate η ∈ {0.00003, 0.0001, 0.0003, 0.001, 0.003, 0.01} for DCCL, and pick the combination of values which minimizes the embedding reconstruction error with respect to the embeddings generated in the first half of training. We show the optimal hyperparameters for each compression rate in Table 11.
Tensor Train: We use the TT method in the manner described in the original paper [17]. For each compression rate, there are two hyperparameters that must be tuned—the number of tensor factors and the “TT-rank” of these factors. We consider 3 and 4 as the number of factors, following the values used in the paper [17], and pick the one which gives the lowest validation perplexity. Given the number of factors, the TT-rank of these factors is automatically determined for a given compression rate.

Table 11:

The optimal learning rates η and dictionary sizes k for DCCL for compressing the task-specific embeddings for the IWSLT’14 translation task.

Compression rate	8×	16×	32×	64×	128×	256×
k	16	16	4	4	4	2
η	0.0003	0.0003	0.001	0.001	0.001	0.001

Open in a new tab

Results

In Figure 10 we plot the average test BLEU4 score across five random seeds for the compression methods described above, at a wide range of compression rates; because for some random seeds the TT method attains very low BLEU scores, for the TT method we plot the BLEU4 score of the seed which performs best. We observe that the uniform quantization and k-means methods generally achieve better BLEU4 score than the TT method up to compression rate 128×, and that the dimensionality reduction method performs significantly worse than the other methods beyond compression rate 8×. These observations suggest that uniform quantization and k-means can be effectively applied to compress task-specific embeddings.

E.3. Dimension vs. Precision Trade-Off

We show that in the memory constrained setting, using low-precision high-dimensional embeddings typically outperforms using high-precision low-dimensional embeddings which occupy the same memory. To demonstrate this, we train GloVe embeddings (details below) of dimensions d ∈ {25, 50, 100, 200, 400}, and then compress each of these embeddings using uniform quantization with precisions b ∈ {1, 2, 4, 8, 16, 32} (32 bits represents no compression). We then train DrQA models [5] using all of these embeddings on the SQuAD dataset [33], and CNN models [19] on the SST-1 sentiment analysis dataset. In Figure 10 we present the downstream performance of all of these models (y-axis) in terms of the memory occupied by the embeddings (x-axis). As we can see, across a range of memory budgets, it is optimal to use low-precision (1 bit) high-dimensional embeddings, as this allows for using the largest dimension possible under that memory budget.

GloVe embedding training details

We train GloVe embeddings on a full English Wikimedia dump on December 4, 2017 which was pre-processed by a fastText script ²³ while keeping the letter cases and digits. We use the GloVe Github repository²⁴ for embedding training. We use a vocabulary size of 400000, a window size of 15, a learning rate of 0.05, and train for 50 epochs.

E.4. Eigenspace Overlap Score vs. Compression Rate

In Figure 11, we plot the eigenspace overlap scores attained by the different compression methods at different compression rates for GloVe, fastText, and BERT WordPiece embeddings. We observe that uniform quantization consistently attains higher or matching eigenspace overlap scores than the other compression methods. Based on the theoretical connection between the eigenspace overlap score and downstream performance, this empirical observation helps explain the strong downstream performance of embeddings compressed with uniform quantization.

E.5. Downstream Performance vs. Measures of Compression Quality

We show across tasks and embedding types that the eigenspace overlap score correlates better with downstream performance than the other measures of compression quality. In Figures 12, 13, and 14, we plot the downstream performance (y-axis) of the compressed Glove, fastText, and BERT WordPiece embeddings (respectively) on a variety of tasks, as a function of the different measures of compression quality (x-axis). For GloVe and fastText, we show performance on question answering (SQuAD) and on the largest sentiment analysis dataset (SST-1). For BERT, we show performance on MNLI and QQP, the two largest GLUE datasets. We see in these plots that the eigenspace overlap score generally aligns quite well with downstream performance, while the other measures of compression quality often do not. To quantify this observation, we measure the Spearman correlations between the downstream performances of the embeddings we compressed, and the various measures of compression quality. We include these correlations for all the sentiment analysis tasks for the Glove and fastText embeddings in Table 13, and for all the GLUE tasks for the BERT WordPiece embeddings in Table 14. From these results, we can see that across different tasks and embedding types, the eigenspace overlap score generally correlates better with downstream performance than the other measure of compression quality.

Figure 12: — We plot the performance of compressed GloVe embeddings on question answering (SQuAD, left column) and sentiment analysis (SST-1, right column), in terms of the different measures of compression quality for these embeddings. We can see that the eigenspace overlap score $E$ generally aligns better with downstream performance than the other measures of compression quality. To quantify this, in the title of each plot we include the Spearman correlation ρ between downstream performance and the measure of compression quality for that plot. We can see that the eigenspace overlap score attains the strongest correlations with downstream performance, as it has the largest values for |ρ|.

Figure 13: — We plot the performance of compressed fastText embeddings on question answering (SQuAD, left column) and sentiment analysis (SST-1, right column), in terms of the different measures of compression quality for these embeddings. We can see that the eigenspace overlap score $E$ generally aligns better with downstream performance than the other measures of compression quality. To quantify this, in the title of each plot we include the Spearman correlation ρ between downstream performance and the measure of compression quality for that plot. We can see that the eigenspace overlap score attains the strongest correlations with downstream performance, as it has the largest values for |ρ|.

Figure 14: — We plot the performance of compressed BERT WordPiece embeddings on the two largest GLUE datasets (MNLI, left column; QQP, right column) in terms of the different measures of compression quality for these embeddings. We can see that the eigenspace overlap score $E$ generally aligns better with downstream performance than the other measures of compression quality. To quantify this, in the title of each plot we include the Spearman correlation ρ between downstream performance and the measure of compression quality for that plot. We can see that the eigenspace overlap score attains the strongest correlations with downstream performance, as it has the largest values for |ρ|.

Table 13: Spearman correlations ρ between compression quality measures and sentiment analysis performance.

Within each entry of the table, the correlations are presented in terms of ‘GloVe |ρ| / fastText |ρ|.’ Note that we present the absolute values of the correlation coefficients, with higher absolute values indicating stronger correlation.

	MR	SST-1	SST-2	Subj	TREC	CR	MPQA

PIP loss	0.36/0.03	0.46/0.25	0.29/0.21	0.30/0.13	0.14/0.05	0.10/0.03	0.49/0.18
Δ	0.29/0.39	0.33/0.29	0.39/0.40	0.26/0.16	0.33/0.29	0.11/0.34	0.41/0.40
Δ_max	0.39/0.41	0.51/0.60	0.41/0.62	0.32/0.49	0.23/0.30	0.12/0.40	0.60/0.39
1 − $E$	0.29/0.59	0.75/0.73	0.72/0.83	0.27/0.58	0.49/0.32	0.40/0.55	0.60/0.55

Open in a new tab

Table 14: Spearman correlations ρ between compression quality measures and GLUE task performance.

Note that we present the absolute values of the correlation coefficients, with higher absolute values indicating stronger correlation.

	MNLI	QQP	QNLI	SST-2	CoLA	STS-B	MRPC	RTE

PIP loss	0.45	0.45	0.43	0.18	0.32	0.41	0.28	0.22
Δ	0.44	0.36	0.36	0.25	0.25	0.43	0.23	0.12
Δ_max	0.86	0.86	0.85	0.67	0.75	0.84	0.59	0.58
1 − $E$	0.92	0.93	0.92	0.83	0.86	0.87	0.62	0.66

Open in a new tab

E.6. Downstream Performance vs. 1/(1 − Δ₁) and Δ₂

We show two main results: First, we show examples of compressed embeddings that have large values of $\frac{1}{1 - Δ_{1}} or Δ_{2},$ but which still attain strong downstream performance; because large values of $\frac{1}{1 - Δ_{1}} or Δ_{2}$ imply large worst-case bounds on the generalization error of the embeddings [42], these observations demonstrate that the worst-case bounds are too loose to explain the empirical results. Second, we show that the eigenspace overlap score generally attains stronger correlation with downstream performance than both 1/(1 − Δ₁) and Δ₂.

For the first result, we can see in Figure 15 that there are points with large $\frac{1}{1 - Δ_{1}},$ for example, but where the downstream performance is still quite close to the full-precision embedding performance. For the second result, we show in Table 15 that the eigenspace overlap score attains higher Spearman correlation with downstream performance than 1/(1 − Δ₁) and Δ₂ across a range of tasks.

Table 15: Spearman correlations between compression quality measures and downstream performance.

On the SQuAD (question answering), SST-1 (sentiment analysis), MNLI (natural language inference), and QQP (question pair matching) tasks, the eigenspace overlap score $E$ attains higher Spearman correlation (absolute value) with downstream performance than 1/(1 − Δ₁) and Δ₂.

Dataset	SQuAD		SST-1		MNLI	QQP

Embedding	GloVe	fastText	GloVe	fastText	BERT WordPiece	BERT WordPiece

1/(1 − Δ₁)	0.62	0.80	0.52	0.65	0.87	0.87
Δ₂	0.46	0.48	0.33	0.44	0.30	0.20
1 − $E$	0.81	0.91	0.75	0.73	0.92	0.93

Open in a new tab

E.7. Downstream Performance vs. Δ_max and Δ with different λ values

In Section 4, we showed across numerous tasks and embedding types that the eigenspace overlap score typically attains stronger correlation with downstream performance than the other measures of compression quality, including Δ_max and Δ. For these results, we computed Δ_max and Δ with the parameter λ being the smallest non-zero eigenvalue of the Gram matrix of the uncompressed embeddings (see Section 2.2 for a review of how λ is used when calculating these measures). We now show these results are robust to the choice of λ. Specifically, in Table 16 we show the Spearman correlations attained by Δ_max and Δ with different λ values. Letting λ_min and λ_max be the smallest and largest eigenvalues of the uncompressed embedding Gram matrix, we consider $λ \in {λ_{\min} / 100, λ_{\min} / 10, λ_{\min}, λ_{\min} \times 10, λ_{\min} \times 100, λ_{\max}}$ for this table. We observe that the eigenspace overlap score attains stronger correlation with downstream performance across the tasks and embedding types in this table than Δ and Δ_max, across all the λ values listed above.

Table 16: Spearman correlations between Δ and Δ_max and downstream performance, for different λ values.

Dataset		SQuAD		SST-1		MNLI	QQP

Embedding		GloVe	fastText	GloVe	fastText	BERT WordPiece	BERT WordPiece

Δ_max,	λ = λ_min/100	0.66	0.71	0.54	0.63	0.47	0.56
Δ_max,	λ = λ_min/10	0.65	0.73	0.54	0.61	0.47	0.57
Δ_max,	λ = λ_min	0.62	0.72	0.51	0.60	0.38	0.56
Δ_max,	λ = λ_min × 10	0.61	0.65	0.53	0.51	0.35	0.43
Δ_max,	λ = λ_min × 100	0.25	0.49	0.18	0.43	0.13	0.36
Δ_max,	λ = λ_max	0.15	0.08	0.22	0.03	0.49	0.08
Δ,	λ = λ_min/100	0.41	0.31	0.27	0.30	0.51	0.05
Δ,	λ = λ_min/10	0.41	0.32	0.27	0.30	0.51	0.05
Δ,	λ = λ_min	0.46	0.31	0.33	0.29	0.57	0.04
Δ,	λ = λ_min × 10	0.42	0.00	0.28	0.05	0.55	0.28
Δ,	λ = λ_min × 100	0.70	0.32	0.60	0.27	0.87	0.26
Δ,	λ = λ_max	0.35	0.01	0.31	0.03	0.10	0.02

	1 − $E$	0.81	0.91	0.75	0.73	0.92	0.93

Open in a new tab

E.8. The Robustness of the Measures of Compression Quality as Selection Criteria

In Section 4.3 we argued that the eigenspace overlap score is a more accurate and robust selection criterion for choosing between compressed embeddings than the other measures of compression quality. We showed in Table 2 the selection error rates attained by the various measures of compression quality across different tasks and embeddings types. Here we provide detailed results on the robustness of the various measures of compression quality when used as selection criteria. To quantify the robustness of each measure of compression quality as a selection criterion, we measure for each task the maximum difference in performance, across all pairs of compressed embeddings from our experiments, between the embedding which performs best and the one which is selected by the measure of compression quality. We report these results in Table 17 for GloVe and fastText embeddings on the question answering (SQuAD) and sentiment analysis (SST-1) tasks, and for BERT WordPiece embeddings on the language infernece (MNLI) and question pair classification (QQP) tasks. We observe that the eigenspace overlap score can attain 1.1× to 5.5× lower maximum performance differences than the next best measures of compression quality.

Table 17: The robustness of each measure of compression quality as a selection criterion.

Across all pairs of compressed embeddings from our experiments, we measure for each task the maximum difference in performance between the embedding selected by each measure of compression quality and the one which performs best on the task. We report these results in the table below, and observe that the eigenspace overlap score $E$ attains lower maximum performance differences than the other measures of compression quality.

Dataset	SQuAD		SST-1		MNLI	QQP

Embedding	GloVe	fastText	GloVe	fastText	BERT WordPiece	BERT WordPiece

PIP loss	0.03	0.08	0.11	0.08	0.04	0.02
Δ_max	0.03	0.03	0.11	0.05	0.02	0.02
Δ	0.03	0.08	0.11	0.09	0.03	0.02
1 − $E$	0.01	0.01	0.04	0.03	0.01	0.01

Open in a new tab

E.9. Stochastic vs. Deterministic Uniform Quantization

Thus far, all the uniform quantization experiments we have presented on question answering, sentiment analysis, and GLUE tasks have used deterministic rounding. However, our theoretical analysis on the expected eigenspace overlap score of uniformly quantized embeddings assumed unbiased stochastic quantization is used. In this section, we show that (1) stochastic and deterministic uniform quantization perform similarly on downstream tasks, and that (2) the eigenspace overlap score still correlates well with downstream performance when using stochastic quantization instead of deterministic quantization. In Figure 16, we compare the downstream performance of deterministic and stochastic quantization on the SQuAD question answering task and on the SST-1 sentiment analysis task. We can observe that uniform and deterministic quantization perform similarly, although at 1-bit precision deterministic quantization performs slightly better than stochastic quantization. We then show in Figure 17 that regardless of whether we use deterministic or stochastic quantization, the eigenspace overlap score correlates better with downstream performance across compression methods than the other measures of compression quality.

Figure 16: — We can observe, using compressed GloVe embeddings on both the SQuAD question answering task and the SST-1 sentiment analysis task, that stochastic uniform quantization (left plots) performs similarly to deterministic uniform quantization (right plots).

Figure 17: — We can see that regardless of whether stochastic (left plots) or deterministic (right plots) quantization is used, the eigenspace overlap score correlates better with downstream performance than the other measures of compression quality (as quantified by the Spearman correlations ρ in the plot titles).

Footnotes

The regressor minimizing the squared loss on the training set is $f_{X, ϵ} (x) = w^{T} x, for w = {(X^{T} X)}^{- 1} X^{T} y .$

The difference between average-case and worst-case analysis is also central to understanding the difference between (Δ₁, Δ₂)-spectral approximation (which yields worst-case generalization bounds) [42], and the eigenspace overlap score (which yields average-case generalization bounds).

This bound on the entries of X results in the entries of its Gram matrix being bounded by a constant independent of d.

⁴

The maximum possible value of σ_min is $\sqrt{n / d},$ which occurs when $‖ X ‖_{F}^{2} = n and σ_{\min} = σ_{\max} .$

⁵

For dimensionality reduction, we use PCA for fastText and BERT embeddings (compression rates: 1, 2, 4, 8), and publicly available lower-dimensional embeddings for GloVe (compression rates: 1, 1.5, 3, 6).

⁶

Freezing the WordPiece embeddings does not observably affect performance (see Appendix E.1).

⁷

We apply uniform quantization to compress embeddings trained end-to-end for a translation task in Appendix E.2; we show it outperforms a tensorized factorization [17] proposed for the task-specific setting.

⁸

We provide a memory-efficient implementation of the uniform quantization method in https://github.com/HazyResearch/smallfry.

⁹

For logistic regression, we can write $l (z^{'}, z) : = - (σ (z) \log (σ (z^{'})) + (1 - σ (z)) \log (1 - σ (z^{'}))) . Here z and z^{'}$ both represent logits. We can recover the standard logistic loss by letting $z \in {- \infty, \infty} .$ Or, you can think of $σ (z) as p (y = 1 | x),$ the parameter of the Bernoulli generating the label for a datapoint x.

¹⁰

The maximum possible value of $σ_{\min} is \sqrt{n / d},$ which occurs when $‖ X ‖_{F}^{2} = n and σ_{\min} = σ_{\max} .$

¹¹

https://github.com/facebookresearch/DrQA.

¹²

https://github.com/harvardnlp/sent-conv-torch/tree/master/data.

¹³

https://github.com/yoonkim/CNN_sentence.

¹⁴

PyTorch implementation of the pre-trained BERT model: https://github.com/huggingface/pytorch-pretrained-BERT. We use the examples/run_classifier.py file provided in this repo for fine-tuning.

¹⁵

https://gluebenchmark.com/leaderboard/

¹⁶

http://nlp.stanford.edu/data/glove.6B.zip.

¹⁷

https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki-news-300d-1M.vec.zip.

¹⁸

https://github.com/huggingface/pytorch-pretrained-BERT.

¹⁹

https://github.com/zomux/neuralcompressor.

²⁰

https://github.com/facebookresearch/DrQA.

²¹

https://github.com/google-research/bert/blob/master/run_classifier.py.

²²

https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_classifier.py.

²³

https://github.com/facebookresearch/fastText/blob/master/get-wikimedia.sh

²⁴

https://github.com/stanfordnlp/GloVe

References

[1].Andor Daniel, Alberti Chris, Weiss David, Severyn Aliaksei, Presta Alessandro, Ganchev Kuzman, Petrov Slav, and Collins Michael. Globally normalized transition-based neural networks. In ACL, 2016. [Google Scholar]
[2].Andrews Martin. Compressing word embeddings. In ICONIP, 2016. [Google Scholar]
[3].Avron Haim, Kapralov Michael, Musco Cameron, Musco Christopher, Velingker Ameya, and Zandieh Amir. Random fourier features for kernel ridge regression: Approximation bounds and statistical guarantees. In ICML, 2017. [Google Scholar]
[4].Bertoldi Nicola, Mathur Prashant, Ruiz Nicholas, and Federico Marcello. FBK’s machine translation and speech translation systems for the IWSLT 2014 evaluation campaign. In IWSLT, 2014. [Google Scholar]
[5].Chen Danqi, Fisch Adam, Weston Jason, and Bordes Antoine. Reading Wikipedia to answer open-domain questions. In ACL, 2017. [Google Scholar]
[6].Chen Ting, Martin Renqiang Min, and Yizhou Sun. Learning k-way d-dimensional discrete codes for compact embedding representations. In ICML, 2018. [Google Scholar]
[7].Shiebler Dan, Green Chris, Belli Luca, Tayal Abhishek. Embeddings@Twitter, 2018. URL https://blog.twitter.com/engineering/en_us/topics/insights/2018/embeddingsattwitter.html. [Online; published 13-Sept-2018; accessed 20-May-2019].
[8].Davis C and Kahan W The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):1–46, 1970. [Google Scholar]
[9].Christopher De Sa, Leszczynski Megan, Zhang Jian, Marzoev Alana, Christopher R Aberger, Olukotun Kunle, and Ré Christopher. High-accuracy low-precision training. arXiv preprint arXiv:180303383, 2018. [Google Scholar]
[10].Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805, 2018. [Google Scholar]
[11].Gale Trevor, Elsen Erich, and Hooker Sara. The state of sparsity in deep neural networks. arXiv preprint arXiv:190209574, 2019. [Google Scholar]
[12].Gersho A Quantization. IEEE Communications Society Magazine, 15(5):16–16, Sep. 1977. [Google Scholar]
[13].Grover Aditya and Leskovec Jure. node2vec: Scalable feature learning for networks. In KDD, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[14].Gupta Suyog, Agrawal Ankur, Gopalakrishnan Kailash, and Narayanan Pritish. Deep learning with limited numerical precision. In ICML, 2015. [Google Scholar]
[15].Han Song, Mao Huizi, and Dally William J. Deep compression: Compressing deep neural network with pruning, trained quantization and Huffman coding. In ICLR, 2016. [Google Scholar]
[16].Hotelling Harold. Analysis of a complex of statistical variables into principal components. Journal of educational psychology, 24(6):417, 1933. [Google Scholar]
[17].Khrulkov Valentin, Hrinchuk Oleksii, Mirvakhabova Leyla, and Oseledets Ivan V. Tensorized embedding layers for efficient model compression. arXiv preprint arXiv:190110787, 2019. [Google Scholar]
[18].Kiefer J Sequential minimax search for a maximum. Proceedings of the American Mathematical Society, 4:502–506, 1953. [Google Scholar]
[19].Kim Yoon. Convolutional neural networks for sentence classification. In EMNLP, 2014. [Google Scholar]
[20].Kingma Diederik P and Ba Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980, 2014. [Google Scholar]
[21].Levy Omer and Goldberg Yoav. Neural word embedding as implicit matrix factorization. In NeurIPS, 2014. [Google Scholar]
[22].Micikevicius Paulius, Narang Sharan, Alben Jonah, Diamos Gregory Frederick, Elsen Erich, García David, Ginsburg Boris, Houston Michael, Kuchaiev Oleksii, Venkatesh Ganesh, and Wu Hao. Mixed precision training. In ICLR, 2018. [Google Scholar]
[23].Mikolov Tomas, Chen Kai, Corrado Greg, and Dean Jeffrey. Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781, 2013. [Google Scholar]
[24].Mikolov Tomas, Grave Edouard, Bojanowski Piotr, Puhrsch Christian, and Joulin Armand. Advances in pre-training distributed word representations. In LREC, 2018. [Google Scholar]
[25].Mostafa Hesham and Wang Xin. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In ICML, 2019. [Google Scholar]
[26].Ott Myle, Edunov Sergey, Baevski Alexei, Fan Angela, Gross Sam, Ng Nathan, Grangier David, and Auli Michael. fairseq: A fast, extensible toolkit for sequence modeling. In NAACL-HLT: Demonstrations, 2019. [Google Scholar]
[27].Paszke Adam, Gross Sam, Chintala Soumith, Chanan Gregory, Yang Edward, Zachary DeVito Zeming Lin, Desmaison Alban, Antiga Luca, and Lerer Adam. Automatic differentiation in PyTorch. In NeurIPS Autodiff Workshop, 2017. [Google Scholar]
[28].Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, and Duchesnay E Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. [Google Scholar]
[29].Pennington Jeffrey, Socher Richard, and Manning Christopher D. GloVe: Global vectors for word representation. In EMNLP, 2014. [Google Scholar]
[30].Peters Matthew E., Neumann Mark, Iyyer Mohit, Gardner Matt, Clark Christopher, Lee Kenton, and Zettlemoyer Luke. Deep contextualized word representations. In NAACL-HLT, 2018. [Google Scholar]
[31].Popoviciu Tiberiu. Sur les équations algébriques ayant toutes leurs racines réelles. Mathematica, 9: 129–145, 1935. [Google Scholar]
[32].Rahimi Ali and Recht Benjamin. Random features for large-scale kernel machines. In NeurIPS, 2007. [Google Scholar]
[33].Rajpurkar Pranav, Zhang Jian, Lopyrev Konstantin, and Liang Percy. SQuAD: 100,000+ questions for machine comprehension of text. In EMNLP, 2016. [Google Scholar]
[34].Shu Raphael and Nakayama Hideki. Compressing word embeddings via deep compositional code learning. In ICLR, 2018. [Google Scholar]
[35].Socher Richard, Perelygin Alex, Wu Jean, Chuang Jason, Manning Christopher D., Ng Andrew Y., and Potts Christopher. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, 2013. [Google Scholar]
[36].Sohoni Nimit Sharad, Aberger Christopher Richard, Leszczynski Megan, Zhang Jian, and Ré Christopher. Low-memory neural network training: A technical report. arXiv preprint arXiv:190410631, 2019. [Google Scholar]
[37].Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N, Kaiser Łukasz, and Polosukhin Illia. Attention is all you need. In NeurIPS, 2017. [Google Scholar]
[38].Wang Alex, Singh Amanpreet, Michael Julian, Hill Felix, Levy Omer, and Bowman Samuel R. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In ICLR, 2019. [Google Scholar]
[39].Williams Christopher K. I. and Seeger Matthias W. Using the Nyström method to speed up kernel machines. In NeurIPS, 2000. [Google Scholar]
[40].Wu Yonghui, Schuster Mike, Chen Zhifeng, Le Quoc V, Norouzi Mohammad, Macherey Wolfgang, Krikun Maxim, Cao Yuan, Gao Qin, Macherey Klaus, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:160908144, 2016. [Google Scholar]
[41].Yin Zi and Shen Yuanyuan. On the dimensionality of word embedding. In NeurIPS, 2018. [Google Scholar]
[42].Zhang Jian, May Avner, Dao Tri, and Ré Christopher. Low-precision random Fourier features for memory-constrained kernel approximation. In AISTATS, 2019. [PMC free article] [PubMed] [Google Scholar]

[R1] [1].Andor Daniel, Alberti Chris, Weiss David, Severyn Aliaksei, Presta Alessandro, Ganchev Kuzman, Petrov Slav, and Collins Michael. Globally normalized transition-based neural networks. In ACL, 2016. [Google Scholar]

[R2] [2].Andrews Martin. Compressing word embeddings. In ICONIP, 2016. [Google Scholar]

[R3] [3].Avron Haim, Kapralov Michael, Musco Cameron, Musco Christopher, Velingker Ameya, and Zandieh Amir. Random fourier features for kernel ridge regression: Approximation bounds and statistical guarantees. In ICML, 2017. [Google Scholar]

[R4] [4].Bertoldi Nicola, Mathur Prashant, Ruiz Nicholas, and Federico Marcello. FBK’s machine translation and speech translation systems for the IWSLT 2014 evaluation campaign. In IWSLT, 2014. [Google Scholar]

[R5] [5].Chen Danqi, Fisch Adam, Weston Jason, and Bordes Antoine. Reading Wikipedia to answer open-domain questions. In ACL, 2017. [Google Scholar]

[R6] [6].Chen Ting, Martin Renqiang Min, and Yizhou Sun. Learning k-way d-dimensional discrete codes for compact embedding representations. In ICML, 2018. [Google Scholar]

[R7] [7].Shiebler Dan, Green Chris, Belli Luca, Tayal Abhishek. Embeddings@Twitter, 2018. URL https://blog.twitter.com/engineering/en_us/topics/insights/2018/embeddingsattwitter.html. [Online; published 13-Sept-2018; accessed 20-May-2019].

[R8] [8].Davis C and Kahan W The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):1–46, 1970. [Google Scholar]

[R9] [9].Christopher De Sa, Leszczynski Megan, Zhang Jian, Marzoev Alana, Christopher R Aberger, Olukotun Kunle, and Ré Christopher. High-accuracy low-precision training. arXiv preprint arXiv:180303383, 2018. [Google Scholar]

[R10] [10].Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805, 2018. [Google Scholar]

[R11] [11].Gale Trevor, Elsen Erich, and Hooker Sara. The state of sparsity in deep neural networks. arXiv preprint arXiv:190209574, 2019. [Google Scholar]

[R12] [12].Gersho A Quantization. IEEE Communications Society Magazine, 15(5):16–16, Sep. 1977. [Google Scholar]

[R13] [13].Grover Aditya and Leskovec Jure. node2vec: Scalable feature learning for networks. In KDD, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] [14].Gupta Suyog, Agrawal Ankur, Gopalakrishnan Kailash, and Narayanan Pritish. Deep learning with limited numerical precision. In ICML, 2015. [Google Scholar]

[R15] [15].Han Song, Mao Huizi, and Dally William J. Deep compression: Compressing deep neural network with pruning, trained quantization and Huffman coding. In ICLR, 2016. [Google Scholar]

[R16] [16].Hotelling Harold. Analysis of a complex of statistical variables into principal components. Journal of educational psychology, 24(6):417, 1933. [Google Scholar]

[R17] [17].Khrulkov Valentin, Hrinchuk Oleksii, Mirvakhabova Leyla, and Oseledets Ivan V. Tensorized embedding layers for efficient model compression. arXiv preprint arXiv:190110787, 2019. [Google Scholar]

[R18] [18].Kiefer J Sequential minimax search for a maximum. Proceedings of the American Mathematical Society, 4:502–506, 1953. [Google Scholar]

[R19] [19].Kim Yoon. Convolutional neural networks for sentence classification. In EMNLP, 2014. [Google Scholar]

[R20] [20].Kingma Diederik P and Ba Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980, 2014. [Google Scholar]

[R21] [21].Levy Omer and Goldberg Yoav. Neural word embedding as implicit matrix factorization. In NeurIPS, 2014. [Google Scholar]

[R22] [22].Micikevicius Paulius, Narang Sharan, Alben Jonah, Diamos Gregory Frederick, Elsen Erich, García David, Ginsburg Boris, Houston Michael, Kuchaiev Oleksii, Venkatesh Ganesh, and Wu Hao. Mixed precision training. In ICLR, 2018. [Google Scholar]

[R23] [23].Mikolov Tomas, Chen Kai, Corrado Greg, and Dean Jeffrey. Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781, 2013. [Google Scholar]

[R24] [24].Mikolov Tomas, Grave Edouard, Bojanowski Piotr, Puhrsch Christian, and Joulin Armand. Advances in pre-training distributed word representations. In LREC, 2018. [Google Scholar]

[R25] [25].Mostafa Hesham and Wang Xin. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In ICML, 2019. [Google Scholar]

[R26] [26].Ott Myle, Edunov Sergey, Baevski Alexei, Fan Angela, Gross Sam, Ng Nathan, Grangier David, and Auli Michael. fairseq: A fast, extensible toolkit for sequence modeling. In NAACL-HLT: Demonstrations, 2019. [Google Scholar]

[R27] [27].Paszke Adam, Gross Sam, Chintala Soumith, Chanan Gregory, Yang Edward, Zachary DeVito Zeming Lin, Desmaison Alban, Antiga Luca, and Lerer Adam. Automatic differentiation in PyTorch. In NeurIPS Autodiff Workshop, 2017. [Google Scholar]

[R28] [28].Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, and Duchesnay E Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. [Google Scholar]

[R29] [29].Pennington Jeffrey, Socher Richard, and Manning Christopher D. GloVe: Global vectors for word representation. In EMNLP, 2014. [Google Scholar]

[R30] [30].Peters Matthew E., Neumann Mark, Iyyer Mohit, Gardner Matt, Clark Christopher, Lee Kenton, and Zettlemoyer Luke. Deep contextualized word representations. In NAACL-HLT, 2018. [Google Scholar]

[R31] [31].Popoviciu Tiberiu. Sur les équations algébriques ayant toutes leurs racines réelles. Mathematica, 9: 129–145, 1935. [Google Scholar]

[R32] [32].Rahimi Ali and Recht Benjamin. Random features for large-scale kernel machines. In NeurIPS, 2007. [Google Scholar]

[R33] [33].Rajpurkar Pranav, Zhang Jian, Lopyrev Konstantin, and Liang Percy. SQuAD: 100,000+ questions for machine comprehension of text. In EMNLP, 2016. [Google Scholar]

[R34] [34].Shu Raphael and Nakayama Hideki. Compressing word embeddings via deep compositional code learning. In ICLR, 2018. [Google Scholar]

[R35] [35].Socher Richard, Perelygin Alex, Wu Jean, Chuang Jason, Manning Christopher D., Ng Andrew Y., and Potts Christopher. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, 2013. [Google Scholar]

[R36] [36].Sohoni Nimit Sharad, Aberger Christopher Richard, Leszczynski Megan, Zhang Jian, and Ré Christopher. Low-memory neural network training: A technical report. arXiv preprint arXiv:190410631, 2019. [Google Scholar]

[R37] [37].Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N, Kaiser Łukasz, and Polosukhin Illia. Attention is all you need. In NeurIPS, 2017. [Google Scholar]

[R38] [38].Wang Alex, Singh Amanpreet, Michael Julian, Hill Felix, Levy Omer, and Bowman Samuel R. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In ICLR, 2019. [Google Scholar]

[R39] [39].Williams Christopher K. I. and Seeger Matthias W. Using the Nyström method to speed up kernel machines. In NeurIPS, 2000. [Google Scholar]

[R40] [40].Wu Yonghui, Schuster Mike, Chen Zhifeng, Le Quoc V, Norouzi Mohammad, Macherey Wolfgang, Krikun Maxim, Cao Yuan, Gao Qin, Macherey Klaus, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:160908144, 2016. [Google Scholar]

[R41] [41].Yin Zi and Shen Yuanyuan. On the dimensionality of word embedding. In NeurIPS, 2018. [Google Scholar]

[R42] [42].Zhang Jian, May Avner, Dao Tri, and Ré Christopher. Low-precision random Fourier features for memory-constrained kernel approximation. In AISTATS, 2019. [PMC free article] [PubMed] [Google Scholar]

PERMALINK

On the Downstream Performance of Compressed Word Embeddings

Avner May

Jian Zhang

Tri Dao

Christopher Ré

Abstract

1. Introduction

2. Background and Motivation

2.1. Embedding Compression Methods

Deep Compositional Code Learning (DCCL)

K-means Compression

Uniform Quantization

Dimensionality Reduction

2.2. Measures of Compression Quality

Word Embedding Reconstruction Error

Pairwise Inner Product (PIP) Loss

Spectral Approximation Error

2.3. Motivation: Two Surprising Empirical Observations

Figure 1: Two motivating empirical observations.

Table 1: Spearman correlation between measures of compression quality and downstream performance.

3. A New Measure of Compression Quality

3.1. The Eigenspace Overlap Score and Generalization Performance

3.1.1. The Eigenspace Overlap Score

Definition 1.

3.1.2. Generalization Bound

Theorem 1.

3.2. The Eigenspace Overlap Score and Uniform Quantization

Theorem 2.

3.3. The Eigenspace Overlap Score as a Selection Criterion

4. Experiments

Experiment setup

4.1. The Eigenspace Overlap Score and Downstream Performance

Figure 2: Downstream performance vs. measures of compression quality.

4.2. Downstream Performance of Uniform Quantization

Figure 3: The eigenspace overlap score and downstream performance of compressed embeddings.

4.3. Compressed Embedding Selection with the Eigenspace Overlap Score

Table 2: The selection error rate of each measure of compression quality as a selection criterion.

5. Related Work

6. Conclusion and Future Work

Figure 9: Downstream performance vs. compression rate: task-specific embeddings.

Acknowledgments

A. Background

A.1. Uniform Quantization

Definition 2.

A.2. Fixed Design Linear Regression

Proposition 3.

B. The Eigenspace Overlap Score: Theory and Extensions

B.1. Proof of Theorem 1: Average Case Analysis for Fixed Design Linear Regression

Theorem 1 (Generalized).

B.2. Average Case Analysis for Lipschitz-Continuous Loss Function

Theorem 4.

B.3. Robustness of the Eigenspace Overlap Score to Perturbations

Table 3: Effect of perturbation on measures of compression quality.

B.4. Relating the Eigenspace Overlap Score to Embedding Reconstruction Error

Table 4: Spearman correlation between projected embedding reconstruction error and downstream performance.

C. The Eigenspace Overlap Score of Uniformly Quantized Embeddings

C.1. Theorem 2 Proof

Theorem 5.

Theorem 2.

Lemma 6.

C.2. Empirical Validation of Theorem 2 Scaling

Figure 4: Empirical Validation of Theorem 2.

C.3. Impact of Clipping and Deterministic vs. Stochastic Quantization on the Eigenspace Overlap Score

Figure 5: The impact of clipping and deterministic vs. stochastic quantization on the eigenspace overlap score.

D. Experiment Details

D.1. Task Details

Question Answering

Sentiment Analysis

GLUE Tasks

Table 5: The GLUE datasets, along with the evaluation metric used for each dataset.

D.2. Word Embedding Details

D.3. Compression Method Details

Uniform Quantization

K-means

Algorithm 1.

Deep Compositional Code Learning (DCCL)

Table 6:

Dimensionality Reduction

D.4. Training Details

E.6. Downstream Performance vs. 1/(1 − Δ₁) and Δ₂

Figure 15: Downstream performance vs. 1/(1 − Δ₁) and Δ₂ (GloVe embeddings).

E.7. Downstream Performance vs. Δ_max and Δ with different λ values

Table 16: Spearman correlations between Δ and Δ_max and downstream performance, for different λ values.