Subspace exploration: Bounds on Projected Frequency Estimation

Graham Cormode; Charlie Dickens; David P Woodruff

doi:10.1145/3452021.3458312

. Author manuscript; available in PMC: 2021 Aug 6.

Published in final edited form as: Proc ACM SIGACT SIGMOD SIGART Symp Princ Database Syst. 2021 Jun 20;2021:273–284. doi: 10.1145/3452021.3458312

Subspace exploration: Bounds on Projected Frequency Estimation

Graham Cormode ¹, Charlie Dickens ², David P Woodruff ³

PMCID: PMC8345326 NIHMSID: NIHMS1696965 PMID: 34366641

Abstract

Given an n × d dimensional dataset A, a projection query specifies a subset C ⊆ [d] of columns which yields a new n × |C| array. We study the space complexity of computing data analysis functions over such subspaces, including heavy hitters and norms, when the subspaces are revealed only after observing the data. We show that this important class of problems is typically hard: for many problems, we show 2^Ω(d) lower bounds. However, we present upper bounds which demonstrate space dependency better than 2^d. That is, for c, c′ ∈ (0, 1) and a parameter N = 2^d an N^c-approximation can be obtained in space $\min (N^{c^{'}}, n)$ , showing that it is possible to improve on the naïve approach of keeping information for all 2^d subsets of d columns. Our results are based on careful constructions of instances using coding theory and novel combinatorial reductions that exhibit such space-approximation tradeoffs.

Keywords: projection queries, distinct elements, frequency moments

1. INTRODUCTION

In many data analysis scenarios, datasets of interest are of moderate to high dimension, but many of these dimensions are spurious or irrelevant. Thus, we are interested in subspaces, corresponding to the data projected on a particular subset of dimensions. Within each subspace, we are concerned with computing statistics, such as norms, measures of variation, or finding common patterns. Such calculations are the basis of subsequent analysis, such as regression and clustering. In this paper, we introduce and formalize novel problems related to functions of the frequency in such projected subspaces. Already, special cases such as subspace projected distinct elements have begun to generate interest, e.g., in Vu’s work [18], and as an open problem in sublinear algorithms [16].

In more detail, we consider the original data to be represented by a (usually binary) array with n rows of d dimensions. A subspace is defined by a set C ⊆ [d] of columns, which defines a new array with n rows and |C| dimensions. Our goal is to understand the complexity of answering queries, such as which rows occur most frequently in the projected data, computing frequency moments over the rows, and so on. If C is provided prior to seeing the data, then the projection can be performed online, and so many of these tasks reduce to previously studied questions. Hence, we focus on the case when C is decided after the data is seen. In particular, we may wish to try out many different choices of C to explore the structure of the subspaces of the data. Our model is given in detail in Section 2.

For further motivation, we outline some specific areas where such problems arise.

Bias and Diversity. A growing concern in data analysis and machine learning is whether outcomes are ‘fair’ to different subgroups within the population, or whether they reinforce existing disparities. A starting point for this is to quantify the level of bias within the data when different features are considered. That is, we want to know whether certain combinations of attribute values are over-represented in the data (heavy hitters), and how many different combinations of values are represented in the data (captured by measures like F₀). We would like to be able to answer such queries accurately for many different (typically overlapping) subsets of dimensions.
Privacy and Linkability. When sharing datasets, we seek assurance that they are not vulnerable to attacks that exploit structure in the data to re-identify individuals. An attempt to quantify this risk is given in recent work [6], which asks how many distinct values occur in the data for each partial identifier, specified as a subset of dimensions. This prior work considered the case where the target dimensions are known in advance, but more generally we would like to compute such measures for arbitrary subsets, based on frequency moments and sampling techniques.
Clustering and Frequency Analysis. In the area of clustering, the notion of subspaces has been studied under a number of interpretations. The common theme is that the data may look unclustered in the original space due to spurious dimensions inflating the distance between points that are otherwise close. Many papers addressed this as a search problem: to search through exponentially many subspaces to find those in which the data is well-clustered. See the survey by Parsons, Haque and Liu [15]. In our setting, the problem would be to estimate various measures of density or clusteredness for a given subspace. A related problem is to find subspaces (or “subcubes” in database terminology) that have high frequency. Prior work proceeded under strong statistical independence assumptions about the values in different dimensions, for example, that the distribution can be modeled accurately with a (Naïve) Bayesian model [13].

2. PRELIMINARIES AND DEFINITIONS

For a positive integer Q, let [Q] = {0, 1, …, Q − 1}, and A ∈ [Q]^n×d be the input data. The objective is to keep a summary of A which is used to estimate the solution to a problem P upon receiving a column subset query C ⊆ [d]. Problems P of interest are described in Section 2.1. Define the restriction of A to the columns indexed by C as A^C whose rows $A_{i}^{C}$ , 1 ≤ i ≤ n, are vectors over [Q] ^{|C |}. We use the Minkowski norm $‖ X ‖_{p} = {(\sum_{i, j} {| X_{i j} |}^{p})}^{1 / p}$ to denote the entrywise-ℓ_p norm for vectors (j = 1) and matrices (j > 1).

Computational Model.

First, the data A is received under the assumption that it is too large to hold entirely in memory so can be modeled as a stream of data. Our lower bounds are not strongly dependent on the order in which the data is presented. After observing A, a column query C is presented. The frequency vector over A induced by C is f = f (A,C) whose entries f_i (A,C) denote the frequency of Q-ary word w_i ∈ [Q] ^{|C |}. We study functions of the frequency vector f = f (A,C) after the observation of A and receiving column query C. The task is, during the observation phase, to design a summary of A which approximates statistics of A^C, the restriction of A to its projected subspace C. Approximations of A^C are accessed through the frequency vector f (A,C). Note that functions (e.g., norms) are taken over f (A,C) as opposed to the raw vector inputs from the column projection.

Remark 1 (Indexing Q-ary words into f).

Recall that the frequency vector f (A,C) has length Q^{|C |} with each entry f_i counting the occurrences of word w_i ∈ [Q] ^{|C |}. To clearly distinguish between the (scalar) index i of f and the input vectors w_i whose frequency is measured by f_i we introduce the index function e(w_i) = i. We may think of e(·) as simply the canonical mapping from [Q] ^{|C |} into {0, 1, 2, …,Q^C − 1}, but other suitable bijections may be used.

For example, suppose Q = 2 and A ∈ {0, 1}^5×3 with column indices {1, 2, 3} given below. If C = {1, 2}, then using the canonical mapping from {0, 1}^{|C |} into {0, 1, 2, 3} (e.g e(00) = 0, e(01) = 1, …e(11) = 3) we obtain A^C and hence f (A,C) = (1, 1, 0, 3).

A = [\begin{array}{l} 1 & 1 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 0 \end{array}] \to A^{C} = [\begin{matrix} 1 & 1 \\ 0 & 1 \\ 0 & 0 \\ 1 & 1 \\ 1 & 1 \end{matrix}]

The vector f = f (A,C) is then the frequency vector over which we seek to compute statistical queries such as ∥f ∥₀. In this example, ∥f ∥₀ = 3 (there are three distinct rows in A^C), while ∥f ∥₁ = 5 is independent of the choice of C.

2.1. Problem Definitions.

The problems that we consider are column-projected forms of common streaming problems ([11], [3], [4]). Here, we refer to these problems as “projected frequency estimation problems” over the input A. We define

f_{i} (A, C) = | {j : A_{j}^{C} = w_{i}, j \in [n]} |

(1)

F_{p} (A, C) = \sum_{i \in {0, 1}^{| C |}} f_{i} {(A, C)}^{p} .

(2)

F_p estimation: Given a column query C, the F_p estimation problem is to approximate the quantity $F_{p} (A, C) = ‖ f (A, C) ‖_{p}^{p}$ under some measure of approximation to be specified later (e.g., up to a constant factor). Of particular interest to us is (projected) F₀ (A,C) estimation, which counts the number of distinct row patterns in A^C.
ℓ_p-heavy hitters: The query is specified by a column query C ⊆ [d], a choice of metric/norm ℓ_p, p > 0 and accuracy parameter ϕ ∈ (0, 1). The task is then to identify all patterns w_i observed on A^C for which f_i (A,C) ≥ ϕ∥f (A,C)∥_p. Such values w_i (or equivalently i) are called ϕ-ℓ_p-heavy hitters, or simply ℓ_p-heavy hitters when ϕ is fixed. We will consider a multiplicative approximation based on a parameter c > 1, where we require that all ϕ-ℓ_p heavy hitters are reported, and no items with weight less than (ϕ/c) · ∥f (A,C)∥_p are included.
ℓ_p-frequency estimation: A related problem is to allow the frequency f_i (A,C) to be estimated accurately, with error as a fraction of F_p (A,C)^1/p = ∥f (A,C)∥_p, which we refer to as ℓ_p frequency estimation. Specifically, for a given w_i, return an estimate ${\hat{f}}_{i}$ which satisfies $| {\hat{f}}_{i} (A, C) - f_{i} (A, C) | \leq ϕ ‖ f (A, C) ‖_{p}$ .
ℓ_p sampling: The goal of this sampling problem is to sample patterns w_i according to the distribution $p_{i} \in (1 \pm ε) \frac{f_{i}^{p} (A, C)}{‖ f (A, C) ‖_{p}^{p}} + Δ$ where Δ = 1/poly(nd), and return a (1 ± ε′)-approximation to the probability p_i of the item w_i returned.

When clear, we may drop the dependence upon C in the notation and write f_i and F_p instead. We will use $\tilde{O}$ and $\tilde{Ω}$ notation to supress factors that are polylogarithmic in the leading term. For example, lower bounds stated as $\tilde{Ω} (2^{d})$ suppress terms polynomial in d.

2.2. Related Work

The model we study is reminscent of, but distinct from, some related formulations. In the problem of cascaded aggregates [10], we imagine the starting data as a matrix, and apply a first operator (denoted Q) on each row to obtain a vector, on which we apply a second operator P. Our problems can be understood as special cases of cascaded aggregates where Q is a project-then-concatenate operator, to obtain a vector whose indices correspond to the concatenation of the projection of a row. Another example of a cascaded aggregate is a so-called correlated aggregate [17], but this was only studied in the context of two dimensions. To the best of our knowledge, our projection-based definitions have not been previously studied under the banner of cascaded aggregates.

Other work includes results on provisioning queries for analytics [2], but the way these statistics are defined is different from our formulation. In that setting there are different scenarios (“hypotheticals”) that may or may not be turned on: this corresponds to “what-if” analysis whereby a query is roughly “how many items are observed if a given set of columns is present (turned on)?” The number of distinct elements for the query is the union of the number of distinct elements across scenarios. In our setting, we concatenate the distinct items into a row vector and count the number of distinct vectors. Note that in the hypotheticals setting in the binary case, each column only has 2 distinct values, 0 and 1, and thus the union also only has 2 distinct values. However, we can obtain up to 2^d distinct vectors. Consequently, Assadi et al. are able to achieve poly(d/ε) space for counting distinct elements, whereas we show a 2^Ω(d) lower bound. Moreover, they achieve a 2^Ω(d) lower bound for counting (i.e., F₁), whereas we achieve a constant upper bound. These disparities highlight the differences in our models.

More recently, the notion of “subset norms” was introduced by Braverman, Krauthgamer and Yang [5]. This problem considers an input that defines a vector υ, where the objective is to take a subset s of entries of υ and compute the norm. Results are parameterized by the “heavy hitter dimension”, which is a measure of complexity over the set system from which s can be drawn. While sharing some properties with our scenario, the results for this model are quite different. In particular, in [5] a trivial upper bound follows by maintaining the vector υ explicitly, of dimension n. Meanwhile, many of our results show lower bounds that are exponential in the dimensionality, as 2^Ω(d), though we also obtain non-trivial upper bounds.

3. CONTRIBUTIONS

The main challenge here is that the column query C is revealed after observing the data; consequently, applying a known algorithm to just the columns C as the data arrives is not possible. For example, consider the exemplar problem of counting the number of distinct rows under the projection C, i.e., the projected F₀ problem. Recall that $A_{i}^{C}$ denotes the i-th row of array A^C. Then the task is to count the number of distinct rows observed in A^C, i.e.,

F_{0} (A, C) = | {A_{j}^{C} : j \in [n]} | = ‖ f (A, C) ‖_{0} .

Observe that F₀ (A,C) can vary widely over different choices of C. For example, even for a binary input A ∈ {0, 1}^n×d, F₀ (A,C) can be as large as 2^d when C consists of all columns from a highly diverse dataset, and as small as 1 or 2 when C is a single column or when C selects homogeneous columns (e.g., the columns in C are all zeros).

3.1. Summary of Results

Our main focus, in common with prior work on streaming algorithms, is on space complexity. For the above problems we obtain the following results:

In Section 4 we show that projected F₀ estimation requires 2^Ω(d) space for a constant factor approximation, demonstrating the essential hardness of these problems. Nevertheless, we obtain a tradeoff in terms of upper bounds described below.
Section 5 presents results for ℓ_p frequency estimation, ℓ_p heavy hitters, F_p estimation, and ℓ_p sampling. We show a space upper bound of O(ε⁻² log(1/δ)) for ℓ_p frequency estimation when 0 < p < 1 and complement this result with lower bounds for heavy hitters when p > 1, F_p estimation and ℓ_p sampling for all p ≠ 1, showing that these problems require 2^Ω(d) bits of space.
In Section 6 we show upper bounds for F₀ and F_p estimation which improve on the exhaustive approach of keeping summaries of all 2^d subsets of columns, by showing that we can obtain coarse approximate answers with a smaller subset of materialized answers. Specifically, for parameters N = 2^d and α ∈ (0, 1) we can obtain an N ^α approximation in min (N^{H (1/2−α)}, n) space. Since the binary entropy function H(x) < 1, this bound is better than the trivial 2^d bound.

These bounds show that there is no possibility of “super efficient” solutions that use space less than exponential in d. Nevertheless, we demonstrate some solutions whose dependence is still exponential but weaker than a naïve 2^d. Thinking of N = 2^d, the above upper and lower bounds imply the actual complexity is a nontrivial polynomial function of N.

The bounds also show novel dichotomies that are not present in comparable problems without projection. In particular, we show that (projected) ℓ_p sampling is difficult for p ≠ 1 while (projected) ℓ_p-heavy hitters has a small space algorithm for 0 < p < 1. This differs from the standard streaming model in which the (classical) ℓ_p heavy hitters problem has a small space solutions for p ≤ 2 without projection [14], and (classical) ℓ_p sampling can be performed efficiently for p ≤ 2 [9]. Our lower bounds are built on amplifying the frequency of target codewords for a carefully chosen test word.

Note that there are trivial naïve solutions which simply retain the entire input and so answer the query exactly on the queryC: to do so takes Θ(nd) space, noting that n may be exponential in d. Alternatively, if we know t = |C| then we may enumerate all $(\begin{array}{l} d \\ t \end{array})$ subsets of [d] with size t and maintain (approximate) summaries for each choice of C. However, this will entail a cost of at least Ω(d^t) and as such does not give a major reduction in cost.

3.2. Coding Theory Definitions

Our lower bounds will typically make use of a binary code $C$ , constituted of a collection of codewords, which are vectors (or strings) of fixed length. We write $B (l, k)$ to denote all binary strings of length l and (Hamming) weight k. We first consider the dense, low-distance family of codes $C = B (d, k)$ but will later use more sophisticated randomly sampled codes. When k < d/2, we have $(\begin{array}{l} d \\ k \end{array}) \geq {(d / k)}^{k}$ and when k = d/2, we have $(\begin{matrix} d \\ d / 2 \end{matrix}) \geq 2^{d} / \sqrt{2 d}$ . A trivial but crucial property of $B (d, k)$ is that any two codewords from this set can have intersecting 1s in at most k − 1 positions.

We define the support of a string y as supp(y) = {i : y_i ≠ 0}, the set of locations where y is non-zero. We define child words to be the set of new codewords obtained from $C$ by generating all Q-ary words z with supp(z) ⊆ supp(y) for some $y \in C$ , and construct them with the star operator defined next.

Definition 3.1 (star^Q operation, child words).

Let d be the length of a binary word, k be a weight parameter, and suppose $y \in B (d, k)$ . Let M = supp(y). We define the function star^Q (y) to be the operation which lifts a binary word y to a larger alphabet by generating all the words over alphabet [Q] on M. Formally,

{star}^{Q} (y \in {0, 1}^{d}) = {z : z \in {[Q]}^{d}, supp (z) \subseteq supp (y)}

Since the alphabet size Q is often fixed when using this operation, when clear we will drop the superscript and abuse notation by writing star(y). Elements of the set star^Q (y) are referred to as child words of y.

For any $y \in B (d, k)$ , there are Q^k words generated by star^Q (y). When star(·) is applied to all vectors of a set U then we write $star (U) = \cup_{u \in U} star (u)$ . For example, if y ∈ {0, 1}^d and Q = 2, then star^Q (y) is simply all possible binary words of length d whose support is contained in supp(y). For the projected F₀ problem, the code $C = B (d, k)$ is sufficient. However, for our subsequent results, we need a randomly chosen code whose existence is demonstrated in Lemma 3.2. The proof follows from a Chernoff bound.

Lemma 3.2.

Fix ϵ, γ ∈ (0, 1) and let $C \subseteq B (d, ϵ d)$ be such that for any two distinct x, $y \in C$ we have |x ∩ y| ≤ (ϵ² +γ)d. With probability at least 1 − exp(−2dγ²) there exists such a code C with size $2^{O (γ^{2} d)}$ instantiated by sampling sufficiently many words i.i.d. at random from $B (d, ϵ d)$ .

Proof.

Let X be the random variable for the number of 1s in common between x and y sampled uniformly at random. Then the expectation of X is $E [X] = \frac{{(ϵ d)}^{2}}{d} = ϵ^{2} d$ and although the coordinates of x, y are not independent, they are negatively correlated so we may use a Chernoff bound (see Section 1.10.2 of [7] for self-contained details). Our aim is to show that the number of 1s in common between x and y can be at most γ^d more than its expectation. Then, via an additive Chernoff-Hoeffding bound:

ℙ (X - E (X) \geq γ d) \leq \exp (- 2 d γ^{2}) .

This is the probability that any two codewords x and y are not too similar, so by taking a union bound over the $Θ (| C |^{2})$ pairs of codewords, the size of the code is $| C | = \exp (d γ^{2}) = 2^{γ^{2} d / \ln 2}$ . □

3.3. Overview of Lower Bound Constructions

Our lower bounds rely upon non-standard reductions to the Index problem using codes $C$ defined in Section 3.2. These reductions are more involved than is typically found as we need to combine the combinatorial properties of $C$ along with the star(·) operation on Alice’s input. In particular, the interplay between $C$ and star(·) must be understood over the column query S given by Bob, which again relies on properties of $C$ used to define the input.

Recall that the typical reduction from Index is as follows: Alice holds a vector a ∈ {0, 1}^N, Bob holds an index i ∈ [N] and he is tasked with finding a_i following one-way communication from Alice. The randomized communication complexity of Index is Ω(N) [12]. We adapt this setup for our family of problems, following an approach that has been used to prove many space lower bounds for streaming algorithms.

The general construction of our lower bounds is as follows: first we choose a binary code $C$ (usually independently at random) with certain properties such as a specific weight and a bounded number of 1s in common locations with other words in the code. In the communication setting, Alice holds a subset $T \subseteq C$ while Bob holds a codeword $y \in C$ and is tasked with determining whether or not y ∈ T. Bob can also access the index function (Remark 1) e(y) which simply returns the index or location that y is enumerated in $C$ . The corresponding bitstring for the Index problem that Alice holds is $a \in {0, 1}^{| C |}$ which has a_j = 1 for every element w_j ∈ T (under a suitable enumeration of {0, 1}^d). We use the star(T) operator (defined in Section 3.2) to map these strings into an input A for each of the problems (i.e., a collection of rows of datapoints). Upon defining the instance, we show that Bob can query a proposed algorithm for the problem and use the output to determine whether or not Alice holds y. This enables Bob to return a_e (_y), which is 1 if Alice holds y ∈ T and 0 otherwise. Hence, determining if y ∈ T or $y \in C \ T$ solves Index and incurs the lower bound $Ω (| C |)$ . Our constructions of $C$ establish that $| C |$ is exponentially large in d.

4. LOWER BOUNDS FOR F₀

In this section, we focus on the F₀ (distinct counting) projected frequency problem. The main result in this section is a strong lower bound for the problem, which is exponential in the domain size d.

We use codes $C = B (d, k)$ as defined in Section 3.2.

Theorem 4.1.

Let Q ≥ 2 be the target alphabet size and k _< d/2 be a fixed query size with Q > k. Any algorithm achieving an approximation factor of |Q|/k for the projected F₀ problem requires space 2^Ω(d).

Proof.

Fix the code $C = B (d, k)$ , recalling that any $x \in C$ has Hamming weight k, and for distinct x, $y \in C$ at most k − 1 bits are shared in common. We will use these facts to obtain the approximation factor.

Obtain the collection of all child words $C_{Q}$ from $C$ by using star^Q (·) as defined in Section 3.2. We will reduce from the Index problem in communication complexity as follows. Alice has a set of (binary) codewords $T \subseteq C$ and initializes the input array A for the algorithm with all strings from the set star(T). Bob has a vector $y \in C$ and wants to know if y ∈ T or not. Let S = supp(y) so that |S| = k and Bob queries the F₀ algorithm on columns of A restricted to S. First suppose that y ∈ T. Then Alice holds y so star(y) is included in A and there must be at least Q^k patterns observed. Conversely, if y ∉ T, then Alice does not include y in A. However, by the construction of $C$ , y shares at most (k − 1) 1s with any distinct $y^{'} \in C$ . Thus, the number of patterns observed on the columns corresponding to S is at most $(\begin{matrix} k \\ k - 1 \end{matrix}) Q^{k - 1} = k Q^{k - 1}$ .

We observe that if we can distinguish the case of kQ^k−1 from Q^k, then we could correctly answer the Index instance, i.e., if we can achieve an approximation factor of Δ such that:

Δ = \frac{Q^{k}}{k Q^{k - 1}} = \frac{Q}{k} .

(3)

Any protocol for Index requires communication proportional to the length of Alice’s input vector a, which translates into a space lower bound for our problem. Alice’s set $T \subset C$ defines an input vector for the Index problem built using a characteristic vector over all words in $C$ , denoted by $a \in {0, 1}^{| C |}$ , as follows. Under a suitable enumeration of $C = {w_{1}, w_{2}, \dots, w_{| C |}}$ , Alice’s vector is encoded via a_i = 1 if and only if Alice holds the binary word w_i ∈ T. From the separation shown earlier, Bob can determine if Alice holds a word in T, thus solving Index and incurring the lower bound. Hence, space proportional to $| C | = (\begin{array}{l} d \\ k \end{array})$ is necessary. We use the standard relation $(\begin{array}{l} d \\ k \end{array}) \geq {(d / k)}^{k}$ and choose k = ad/2 for a constant a ∈ [0, 1) from which we obtain $| C | \geq 2^{a d / 2}$ to achieve the stated approximation guarantee. □

Setting k = ad/2 allows us to vary the query size and directly understand how this affects the size of the code necessary for the lower bound. For a query of size k, the size of the input to the projected F₀ problem is a (d/k)^k × d array A of total size d^k+1/k^k. Theorem 4.1 is for k < d/2. When k = d/2 we can use the tighter bound for the central binomial term on the sum of the binomial coefficients and obtain the following stronger bounds. The subsequent results use the same encoding as in Theorem 4.1. However, at certain points of the calculations the parameter setttings are slightly altered to obtain different guarantees.

Corollary 4.2.

Let Q ≥ d/2 be an alphabet size and d/2 be the query size. There exists a choice of input data A ∈ [Q]^n×d such that any algorithm achieving approximation factor 2Q/d for the projected F₀ problem on the query requires space 2^Ω(d).

Proof.

Repeat the argument of Theorem 4.1 with k = d/2. The approximation factor from Equation (3) becomes: $Δ = \frac{2 Q}{d}$ . The code size for Index is $| C | \geq 2^{d} / \sqrt{2 d}$ . Note that $| C |$ is 2^Ω(d) as $\frac{1}{2} \log_{2} (d)$ can always be bounded above by a linear function of d. The instance is an array whose rows are the Q^d/2 child words in ${star}^{Q} (C)$ . Hence, the size of the instance to the F₀ algorithm is: Θ(2^dQ^d/2d^1/2). □

Corollary 4.3 follows from Corollary 4.2 by setting Q = d.

Corollary 4.3.

A 2-factor approximation to the projected F₀ problem on a query of size d/2 needs space 2^Ω(d) with an instance A whose size is $Θ (2^{d} d^{\frac{d + 1}{2}})$ .

Theorem 4.1 and its corollaries suffice to obtain space bounds over all choices of Q. However, Q could potentially grow to be very large, which may be unsatisfying. As a result, we will argue how the error varies for fixed Q. To do so, we map Q down to a smaller alphabet of size q and use this code to define the communication problem from which the lower bound will follow. The cost of this is that the instance is a logarithmic factor larger in the dimensionality.

Corollary 4.4.

Let q be a target alphabet size such that 2 ≤ q ≤ |Q|. Let α = Q log_q (Q) ≥ 1 and d′ = d log_q (Q). There exists a choice of input data $A \in {[q]}^{n \times d^{'}}$ for which any algorithm for the projected F₀ problem over queries of size d/2 that guarantees error $\tilde{O} (α / d^{'})$ requires space 2^Ω(d) .

Proof.

Fix the binary code $C = B (d, d / 2)$ and generate all child words over alphabet [Q] to obtain the approximation factor Δ= 2Q/d as in Corollary 4.3. For every $w \in C$ there Q^d/2 child words so the child code $C_{Q}$ now has size $n = Θ (2^{d} Q^{d / 2} / \sqrt{d})$ words. Since Q can be arbitrarily large, we encode it via a mapping to a smaller alphabet but over a slightly larger dimension; specifically, use a function $[Q] \mapsto {[q]}^{\log_{q} (Q)}$ which generates q-ary strings for each symbol in [Q]. Hence, all of the stored strings in $C_{Q} \subset {[Q]}^{d}$ are equivalent to a collection, $C_{q}$ over ${[q]}^{d \log_{q} (Q)}$ . Although $| C_{Q} | = | C_{q} |$ , words in $C_{Q}$ are length d, while the equivalent word in $C_{q}$ has length d log_q (Q). This collection of words from C_q now defines the instance $A \in {[q]}^{n \times d \log_{q} (Q)}$ , each word being a row of A. Taking α = Q log_q (Q) and d′ = d log_q (Q) results in an approximation factor of:

Δ = \frac{2 Q}{d} = \frac{2 α}{d^{'}} .

(4)

Alice’s input vector a is defined by the same code $C$ and held set $T \subset C$ as in Theorem 4.1 so we incur the same space bound. Likewise, Bob’s test vector y and column query S also remain the same as in that theorem. □

Corollary 4.4 says that the same accuracy guarantee as Corollary 4.2 can be given by reducing the arbitrarily large alphabet [Q] to a smaller one over [q]. However, the price to pay for this is that the size of the instance A increases by a factor of log_q (Q) in the dimensionality. These various results are summarized in Table 1.

Table 1:

Comparison of the lower bounds for F₀. Theorem 4.1 uses $C = B (d, k)$ , corollaries use $C = B (d, d / 2)$ .

	Instance A for F₀	Approx. Factor

Theorem 4.1	${(\frac{d}{k})}^{k} \times d$	$\frac{Q}{k}$
Corollary 4.2	2^dQ^d/2 × d over [Q]	$\frac{2 Q}{d}$
Corollary 4.3	2^dd^d/2 × d over [d]	2
Corollary 4.4	2^dQ^d/2 × d log_q Q over [q]	$\frac{2 Q}{d}$

Open in a new tab

5. ℓ_p-FREQUENCY BASED PROBLEMS

In this section, we extend the techniques from the previous section to understand the complexity of projected frequency estimation problems related to the ℓ_p norms and F_p frequency moments (defined in Section 2.1). A number of our results are lower bounds, but we begin with a simple sampling-based upper bound to set the stage.

5.1. ℓ_p Frequency Estimation

We first focus on the projected frequency estimation problem showing that a simple algorithm keeping a uniform sample of the rows works for p < 1. The algorithm uSample(A,C, t, b) first builds a uniform sample of t rows (sampled with replacement at rate α = t/n) from A and evaluates the absolute frequency of string b on the sample after projection onto C. Let g be the absolute frequency of b on the subsample. To estimate the true frequency of b on the entire dataset from the subsample, we return an appropriately scaled estimator ${\hat{f}}_{e (b)} = g / α$ which meets the required bounds given in Theorem 5.1, recalling that e(b) is the index location associated with the string b. The proof follows by a standard Chernoff bound argument and is given in Appendix A.1.

Theorem 5.1.

Let A ∈ {0, 1}^n×d be the input data and let C ⊆ [d] be a given column query. For a given string b ∈ {0, 1}^C, the absolute frequency of b, f_{e (b)} , can be estimated up to ε∥f ∥₁ additive error using a uniform sample of size O(ε⁻² log(1/δ)) with probability at least 1 − δ.

The same algorithm can be used to obtain bounds for all 0 < p < 1. By noting that ∥f∥₁ ≤ ∥f∥_p for 0 < p < 1 we can obtain the following corollary.

Corollary 5.2.

Let A, b, C be as in Theorem 5.1. Let 0 < p < 1. Then uniformly samplingO(ε⁻² log(1/δ)) rows achieves $| {\hat{f}}_{e (b)} - f_{e (b)} | \leq ε ‖ f ‖_{p}$ with probability at least 1 − δ.

Both Theorem 5.1 and Corollary 5.2 are stated as if C is given. However, since the sampling did not rely on C in any way, we can sample complete rows of the input uniformly prior to receiving the query C, which is revealed after observing the data. The uniform sampling approach also allows us to identify the ℓ_p heavy hitters in small space: for each item included in the sample (when projected onto column set C), we use the sample to estimate its frequency, and declare those with high enough estimated frequency to be the heavy hitters. By contrast, for p > 1 we are able to obtain a 2^Ω(d) space lower bound, given in the next section.

5.2. ℓ_p Heavy Hitters Lower Bound

Recall that the objective of (projected) ℓ_p heavy hitters is to find all those rows in A^C whose frequency is at least some fraction of the ℓ_p norm of the frequency distribution of this projection. For the lower bound we need a randomly sampled code as defined in Lemma 3.2. The lower bound argument follows a similar outline to the bound for F₀, although now Bob’s query is on the complement of the support of his test vector y (i.e., S = [d] \ supp(y)) rather than supp(y). Akin to Theorem 4.1, we will create a reduction from the Index problem in communication complexity, and use its communication lower bound to argue a space lower bound for projected ℓ_p heavy hitters. The proof will generate an instance of ℓ_p heavy hitters based on encoding a collection of codewords, and consider in particular the status of the string corresponding to all zeroes. We will consider two cases: when Bob’s query string is represented in Alice’s set of codewords, then the all zeros string will be a heavy hitter (for a subset of columns determined by the query); and when Bob’s string is not in the set, then the all zeros string will not be a heavy hitter. We begin by setting up the encoding of the input to the Index instance.

Theorem 5.3.

Let ϕ ∈ (0, 1) be a parameter and fix p > 1. Any algorithm which can obtain a constant factor approximation to the projected ℓ_p-heavy hitters problem requires space 2^Ω(d) .

Proof.

Fix ϵ > 0. Let $C \subset B (d, ϵ d)$ be a code whose words have weight ϵd and any two distinct words x, y have at most (ϵ² + γ)d ones in common. By Lemma 3.2 such a $C$ exists and $| C | = 2^{Ω_{γ} (d)}$ .

Suppose Alice holds a subset $T \subset C$ . Let $a \in {0, 1}^{| C |}$ be the characteristic vector over all length-d binary strings for which a_{e (u)} = 1 if and only if Alice holds u ∈ T. Bob holds $y \in C$ and wants to determine if Alice holds y ∈ T. Ascertaining whether or not Alice holds y would be sufficient for Bob to solve Index and incur the $Ω (| C |)$ lower bound.

The input array, A, for the ℓ_p-heavy hitters problem is constructed as follows.

Alice populates A with 2^ϵd copies of the length-d all ones vector, 1_d
Next, Alice takes Q = 2 and inserts into A the collection star^Q (T), which is the expansion of her input strings to all child-words in binary. That is, for every s ∈ T, Alice computes all binary strings x of length d with supp(x) ⊆ supp(s) and includes these in A.

Let S = [d]\supp(y), so that |S| = d −ϵd = (1−ϵ)d. Without loss of generality we may assume S = {1, 2, …, (1−ϵ)d} and we denote the (1−ϵ)d length vector which is identically 0 on S by 0_S. Suppose there is an algorithm $A$ which approximates the ℓ_p-heavy hitters problem on a given column query up to a constant approximation factor. Bob queries $A$ for the heavy hitters in the table A under the column query given by the set S, and then uses this information to answer whether or not y ∈ T.

Case 1:

y ∈ T. If y ∈ T, then we claim that 0_S is a ϕ-ℓ_p heavy hitter for some constant ϕ, i.e., $f_{e}_{(0_{S})} \geq ϕ ‖ f ‖_{p}$ . We will manipulate the equivalent condition $f_{e (0_{S})}^{p} \geq ϕ^{p} F_{p}$ . Since y ∈ T, the set star(y) is included in the table A as Alice inserted star(s) for every s that she holds. Consider any child word of y, that is, a w ∈ star(y). Since y is supported only on [d] \ S and supp(w) ⊆ supp(y), every w_i = 0 for i ∈ S. So 0_S is observed once for every w ∈ star(y) and there are |star(y)| = 2^ϵd such w. Hence, 0_S occurs at least 2^ϵd times.

Now that we have a lower bound on the frequency of 0_S, it remains to upper bound the F_p value when y ∈ T so that we are assured 0_S will be a heavy hitter in this instance. The quantity we seek is the F_p value of all vectors in A^S, written F_p (A, S); which we decompose into the contribution from 0_S present due to y being in T, and two special cases from the block of 2^εd all-ones rows and ‘extra’ copies of 0_S which are contributed by vectors y′ ≠ y. We claim that this F_p (A, S) value is at most $| C |^{1 + p} 2^{ϵ d + (ϵ^{2} + γ) d p} + 3 \cdot 2^{ϵ p d}$

First, let $y^{'} \in C$ with y′ ≠ y and consider prefixes z supported on S which can be generated by possible child words from star(y′). Since our code requires that |y′ ∩ y| ≤ (ϵ²+γ)d, y′ can have at most (ϵ²+γ)d 1s located in $\bar{S} = [d] \ S$ , and hence must have at least (ϵ − ϵ² − γ)d 1s located in S. Since |star(y′)| = 2^ϵd, the number of copies of z inserted is at most $2^{ϵ d - (ϵ d - ϵ^{2} d - γ d)} = 2^{ϵ^{2} d + γ d}$ . This occurs for ever $y^{'} \in C$ so the total number of occurences of z is at most $| C | 2^{(ϵ^{2} + γ) d}$ . The contribution to F_p for this scenario is then $| C |^{p} 2^{(ϵ^{2} + γ) d p}$ . Observe that each codeword y′ generates at most 2^ϵd vectors under the star(y′) operator, so we have an upper bound of $| C | 2^{ϵ d}$ such vectors generated, with a total contribution of $| C |^{1 + p} 2^{(ϵ^{2} p + ϵ + γ p) d}$ .

Next, we focus on the two special vectors to count which have a high contribution to the F_p value. Recall that Alice specifically included 1_d into A 2^ϵd times so the p-th powered frequency is exactly 2^ϵpd for this term. From the above argument, 0_S also has frequency 2^ϵd from star(y). But 0_S is also created at most $2^{(ϵ^{2} + γ) d}$ times from each y′ ≠ y in T, giving an additional count of at most $| C | 2^{(ϵ^{2} + γ) d}$ . Based on our choice of ϵ and γ, we can ensure that this is asymptotically smaller than 2^ϵd, and so the total contribution from these two special vectors is at most 3 · 2^ϵd. So in total we achieve that F_p is at most $| C |^{1 + p} 2^{ϵ d + (ϵ^{2} + γ) d p} + 3 \cdot 2^{ϵ p d}$ , as claimed.

Then 0_S meets the definition to be a ϕ-ℓ_p heavy hitter provided

2^{ϵ p d} > ϕ^{p} (| C |^{1 + p} 2^{ϵ d + (ϵ^{2} + γ) p d} + 3 \cdot 2^{ϵ p d}) .

Assuming p > 1, and choosing ϵ sufficiently smaller than (p − 1)/p and γ sufficiently small, we have that

| C |^{1 + p} 2^{ϵ d + (ϵ^{2} + γ) p d} \leq 2^{O (γ^{2} d (1 + p)) + ϵ d + ϵ (p - 1) d + γ p d} \leq 2^{ϵ p d} .

Hence, we require 2^ϵpd > ϕ^pO(2^ϵpd), i.e., 2^ϵd > ϕO(2^ϵd), which is satisfied for a suitably small but constant ϕ.

Case 2:

y ∉ T. On the other hand, suppose that y ∉ T. Then the claim is that 0_S is not a ϕ-ℓ_p-heavy hitter. Now the vector 0_S does not occur with a high frequency because star(y) is not included in A. However, certain child words in star(T) could also generate 0_S when projected onto S and this is the contribution we need to upper bound. Again, any codeword s ∈ T has at least (ϵ − ϵ² − γ)d 1s present on S. So for a particular s ∈ T, 0_S can occur $2^{ϵ^{2} d + γ d}$ times. Taken over all $y^{'} \in C$ for which Alice includes in A, the frequency of 0_S in this case is at most $| C | 2^{ϵ^{2} d + γ d}$ . Taking ε < 1/3, γ < ε/3 and using $| C | = 2^{γ^{2} d / \ln 2}$ (Lemma 3.2) we have f_e (0_S) ≤ 20^.72^εd. Meanwhile, there are 2^ϵd copies of the string 1_d inserted into A meaning that F_p (A, S) ≥ 2^ϵpd and hence $F_{p}^{1 / p}$ is strictly greater than f_e (0_S). Hence, 0_S is not a ϕ-ℓ_p heavy hitter provided that $f_{e (0_{S})} / F_{p}^{1 / p} = 2^{- 0.28 ε d}$ is strictly less than ϕ = 1/4, this is satisfied for suitable ε and d.

Concluding the proof.

Bob can use his test vector y and a query S with a constant factor approximation algorithm $A$ for the ℓ_p-heavy hitters problem and distinguish between the two cases of Alice holding y or not based on whether 0_S is reported. As a result, Bob can determine if y ∈ T and consequently solve Index, thus incurring the $Ω (| C |) = 2^{Ω (d)}$ lower bound. □

The instance A is initialized with 2^εd rows of the vector 1_d and the child words star^Q (T). For anyt ∈ star^Q (T), |star^Q (t)| = 2^εd so the size of the instance A is (|T | + 1)2^εd × d.

5.3. F_p Estimation

The space complexity of approximating the frequency moments F_p has been widely studied since the pioneering work of Alon, Matias and Szegedy [1]. Here, we investigate their complexity under projection. For p = 1, the frequency is always the number n of rows in the original instance irrespective of the column set C, so only one word of space is required. We therefore devote attention to p ≠ 1.

The reduction to Index for Theorem 5.4 follows a similar outline as Theorem 5.3 for p > 1. For p < 1, we encode the problem slightly differently, closer to that in Theorem 4.1. Again, the reduction to Index relies on Bob determining whether or not Alice holds y, which for F_p estimation amounts to Bob evaluating F_p (A, S) and comparing to a threshold value.

Theorem 5.4.

Fix a real number p > 0 with p ≠ 1. A constant factor approximation to the projected F_p estimation problem requires space 2^Ω(d) .

Proof.

For p > 1 we begin by noticing that in the proof for Theorem 5.3 one can also monitor the F_p value of the input to the problem rather than simply checking the heavy hitters. In particular, depending on whether or not Alice holds Bob’s test word, y, the projected F_p changes by more than a constant. Consequently, we invoke the same proof for F_p, p > 1 and obtain the same 2^Ω(d) lower bound.

On the other hand, suppose that p < 1. We assume a code $C \subset B (d, ϵ d)$ with the property that any distinct x, $x^{'} \in C$ have |x ∩ x ′| ≤ cd for some small constant c > ϵ² (see Lemma 3.2). Again, Alice holds a subset $T \subseteq C$ and inserts star(T) into the table for the problem A. Throughout this proof we use a binary alphabet so suppress the Q notation from star^Q (·). Bob holds a test vector $y \in C$ and is tasked with determining whether or not Alice holds y ∈ T. We distinguish between the cases when Alice holds y ∈ T or not as follows. Bob uses y to determine the query column set S = supp(y) and will compare against the returned frequency value from the algorithm.

Case 1:

y ∉ T. Consider some $y^{'} \in C \ {y}$ . Since y and y′ are both codewords, they can have a 1 coincident in at most cd locations. So if Alice does not hold y then the codewords we need to consider are all binary words in the code which have at most cd 1s in common with y on S. We denote this collection of words by M, i.e., the set of binary strings of length d that have at most cd locations set to 1. There are r such vectors, where r is defined by:

r ≜ \sum_{i = 0}^{c d} (\begin{array}{l} d \\ i \end{array}) \leq c d \cdot (\begin{array}{l} d \\ c d \end{array}) = O (d) 2^{Θ (c d)} .

The total count of all strings generated by Alice’s encoding is at most $2^{ϵ d} | C |$ : each string in $C$ generates 2^ϵd subwords from the star(·) operation. We now evaluate the ℓ_p-frequency of elements in the set M, denoted F_p (M). For p < 1, the value F_p (M) is maximized when every element of M has the same number of occurrences, $| C | 2^{ϵ d} / r$ . As there are at most r members of M, we obtain $F_{p} (M) \leq | C |^{p} 2^{ϵ d p} r^{1 - p}$ . Recalling the bounds on $| C |$ and r, this is:

2^{c d p + ϵ d p + Θ ((1 - p) c d)} \cdot O (d^{1 - p}) .

(5)

We can now choose c to be a small enough constant so that (5) is at most 2^(1−α)ϵd for a constant α > 0 by Lemma A.2 in Appendix A.2.

Case 2:

y ∈ T. Now consider the scenario when y ∈ T so that Alice has inserted star(y) into the table A. Here, we can be sure that each of the 2^ϵd strings in star(y) appears at least once over the column set S, and so the F_p value is at least 2^ϵd1^p = 2^ϵd.

We observe that these two cases obtain the constant factor separation, as required. Then, Bob can use his test vector y and a query S with a constant factor approximation algorithm to the projected F_p-estimation problem and distinguish between the two cases of Alice holding y or not. Thus, Bob can determine if y ∈ T and consequently solve the Index problem, incurring the $Ω (| C |) = 2^{Ω_{c} (d)}$ lower bound for a c arbitrarily small. □

Remark 2.

For p > 1 we adopt the same instance as in Theorem 5.3 so the instance is of size (|T | + 1)2^εd × d. On the other hand, for 0 < p < 1, only the words in star^Q (T) are required so A has size |T |2^εd × d.

5.4. ℓ_p-Sampling

In the projected ℓ_p-sampling problem, the goal is to sample a row in A^C proportional to the p-th power of its number of occurrences. One approach to the standard (non-projected) ℓ_p-sampling problem on a vector x is to subsample and find the ℓ_p-heavy hitters [14]. Consequently, if one can find ℓ_p-heavy hitters for a certain value of p, then one can perform ℓ_p-sampling in the same amount of space, up to polylogarithmic factors. Interestingly, for projected ℓ_p-sampling, this is not the case, and we show for every p ≠ 1, there is a 2^Ω(d) lower bound. This is despite the fact that we can estimate ℓ_p-frequencies efficiently for 0 < p < 1, and hence find the heavy hitters (Section 5.1).

Theorem 5.5.

Fix a real number p > 0 with p ≠ 1, and let ε ∈ (0, 1/2). Let S ⊆ [d] be a column query and i be a pattern observed on the projected data A^S. Any algorithm which returns a pattern i sampled from a distribution (p₁, …, p_n), where $p_{i} \in (1 \pm ε) \frac{f_{e (i)}^{p}}{‖ f (A, S) ‖_{p}^{p}} + Δ$ together with a (1±ε′)-approximation to p_i, Δ = 1/poly(nd) and ε′ > 0 is a sufficiently small constant, requires 2^Ω(d) bits of space.

Proof.

Case 1:

p > 1. The proof of Theorem 5.3 argues that the vector 0_S is a constant factor ℓ_p-heavy hitter for any p > 1 if and only if Bob’s test vector y is in Alice’s input set T, via a reduction from Index. That is, we argue that there are constants C₁ > C₂ for which if y ∈ T, then $f_{e (0_{S})}^{p} \geq C_{1} F_{p}$ , while if y ∉ T, then $f_{e (0_{S})}^{p} < C_{2} F_{p}$ . Consequently, given an ℓ_p-sampler with the guarantees as described in the theorem statement, then the (empirical) probability of sampling the item 0_S should allow us to distinguish the two cases. This holds even tolerating the (1 +ε′)-approximation in sampling rate, for a sufficiently small constant ε′. In particular, if y ∈ T, then we will indeed sample 0_S with Ω(1) probability, which can be amplified by independent repetition; whereas, if y ∉ T, we do not expect to sample 0_S more than a handful of times. Consequently, for p > 1, an ℓ_p-sampler can be used to solve the ℓ_p-heavy hitters problem with arbitrarily large constant probability, and thus requires 2^Ω(d) space.

Case 2:

0 < p < 1. We now turn to 0 < p < 1. In the proof of Theorem 5.4, a reduction from Index is described where Alice holds the set T and Bob the string y. Bob can generate the set star(y) of size 2^εd which is all possible binary strings supported on the column query S. From this, Bob constructs the set $M^{'} = {z \in star (y) : | supp (z) | \geq \frac{ε d}{2}}$ . We observe that if y ∈ T then at least half of the strings in star(y) are supported on at least εd/2 coordinates which implies |M′| ≥ 2^εd−1. The total F_p in this case can be bounded by a contribution of |M′|1^p + 2^εd. The first term arises from the |M′| strings in M′ with a frequency of 1, while the second term is shown in Case 1 of Theorem 5.4. Since |M′| ≤ 2^εd, we have that F_p ≤ 2^εd+1 in this case. Consequently, the correct probability of ℓ_p-sampling returning a string in M′ is at least $\frac{1}{4}$ for the “ideal” case of ε = 0, Δ= 0. Even allowing $ε < \frac{1}{2}$ and Δ= 1/poly(nd), this probability is at least 1/10.

Otherwise, if y ∉ T, we exploit that y′ ≠ y can coincide in at most cd = O(ε²d) coordinates and | supp(z)| ≥ εd/2 > cd for any z ∈ M′. Hence, no z ∈ M′ can occur in star(y′) for another $y^{'} \in C \ {y}$ on the column projection S. In this case, there should be zero probability of sampling a string in M′ (neglecting the trivial additive probability Δ).

To summarize, in the case that y ∈ T, by querying the projection S then a constant fraction of the F_p-mass is on the set M′, whereas when y ∉ T, then there is zero F_p-mass on the set M′. Since Bob knows M′, he can run an ℓ_p-sampler and check if the output is in the set M′, and succeed with constant probability. It follows that Bob can solve the Index problem (amplifying success probability by independent repetitions if needed), and thus again the space required is 2^Ω(d). □

Remark 3.

For p > 1 we again adopt the same instance as in Theorem 5.3 which has size (|T | + 1)2^εd × d. However, for 0 < p < 1, we require the instance from Theorem 5.4 so A has size |T |2^εd × d.

6. PROJECTED FREQUENCY ESTIMATION VIA SET ROUNDING

Although our lower bounds rule out the possibility of computing constant factor approximations to projected frequency problems in sub-exponential space, it is still possible to compute non-trivial approximations using exponential space but still better than $i n l i n e m a t h$ enumerating all column subsets of [d]. We design a class of algorithms that proceed by keeping appropriate sketch data structures for a “net” of subsets. The net has the property that for any query C ⊂ [d] there is a C′ ⊂ [d] stored in the net which is not too different from C. We can then answer the query on C using the summary data structure computed for columnset C′. To formalize this approach we need some further definitions, the first of which conceptualizes the notion of a net over subsets.

Definition 6.1 (α-net of subsets).

Let $P ([d])$ denote the power set of [d]. Fix a parameter α ∈ (0, 1/2). An α-net of $P ([d])$ is the set $N = {U : | U | \leq 2^{d / 2 - α d} or | U | \geq 2^{d / 2 + α d}}$ which contains all subsets $U \in P ([d])$ whose size is at most 2^d/2−αd or at least 2^d/2+αd.

Let H (x) = −x log₂(x) − (1 − x) log₂(1 − x) denote the binary entropy function.

Lemma 6.2.

Let $N$ be an α-net for $P ([d])$ . Then $| N | \leq 2^{H (1 / 2 - α) d + 1}$ .

Proof.

The total number of subsets whose size is at most 2^d/²−^αd is $\sum_{i \leq α d} (\begin{array}{l} d \\ i \end{array})$ and $\sum_{i \leq α d} (\begin{array}{l} d \\ i \end{array}) \leq 2^{H (1 / 2 - α) d}$ [8, Theorem 3.1]. By symmetry we obtain the same bound for the number of subsets of size at least 2^d/2+αd, yielding the claimed total. □

6.1. From α-nets to Projections

Suppose that we are tasked with answering problem P = P (A,C) on a projection query C. We know that if C is known ahead of time then we can encode the input data A ∈ [Q]^n×d on projection C as a standard stream over the alphabet [Q] ^{|C |}. The use of α-nets allows us sketch some of the input and use this to approximately answer a query. For a standard streaming problem, we will say that an algorithm yields a β-approximation to the true solution z^∗ if the returned estimatez ∈ [z∗/β, βz∗]. A sketch obtaining such approximation guarantees will be referred to as a β approximate sketch. We additionally need the following notion of error due to the distortion incurred when answering queries on elements of the α-net rather than the given query.

Algorithm 1:

Projected frequency by query rounding

graphic file with name nihms-1696965-t0002.jpg

Open in a new tab

Definition 6.3 (Rounding distortion).

Let P = P (A,C) be a projection query for the problem P on input A ∈ [Q]^n×d with projection C. Let $N \subset P ([d])$ be an α-net. The rounding distortion r (α, P) is the worst-case determinstic error incurred by solving P (A,C′) rather than P (A,C) for an α-neighbour $C^{'} \in N$ of C so that P (A,C)/r (α, P) ≤ P (A,C′) ≤ r (α, P) (A,C).

Definition 6.3 is easiest to conceptualize for the F₀ problem when A ∈ {0, 1}^n×d. Specifically, P = F₀ and the task to solve is P = F₀(A,C). For a given query C, with an α-neighbour C′ in the net, the gap between the number of distinct items observed on C′ at most doubles for each column in the set difference between C and C′. Since C′ is an α-neighbour, we have |C′ ΔC| ≤ αd so the worst-case approximation factor in the number of distinct items observed over C′ rather than C is 2^αd.

More generally, we can categorize the rounding distortion for other typical queries, as demonstrated in the following lemma. Note that if the query is contained in the α-net $N$ then we will retain a sketch for that problem; hence the distortion is only incurred for queries not contained in the net.

Lemma 6.4.

Fix α ∈ (0, 1/2), suppose A ∈ {0, 1}^n×d and $N$ be an α-net. If C is a projection query for the following cases, the rounding distortion can be bounded as:

P = F₀(A,C) then r (α, F₀) = 2^αd
P = F_p (A,C), p _> 1 then r (α, F_p) = 2^{αd (p−1)}
P = F_p (A,C), p _< 1 then r (α, F_p) = 2^{αd (1−p)}

Proof.

Item (1) is an immediate consequence of the discussion above following Definition 6.3 so we focus on (2) and (3). Suppose p ≥ 1. Let f_C = f (A,C) denote the frequency vector associated to the projection query C over domain [2^{|C |}]. First, consider a single index j ∈ [2^{|C |}] with (f_C)_j = x. Let C′ be an α-neighbour for C in $N$ , and without loss of generality, assume that |C| < |C′|. The task is to estimate ${‖ f_{C} ‖}_{p}^{p} = x^{p}$ from ${‖ f_{C^{'}} ‖}_{p}^{p}$ , where $f_{C^{'}} = f (A, C^{'})$ is a frequency vector over the domain $[2^{| C^{'} |}]$ which is a |C′ \C| factor larger than the domain for f_C. However, observe that in $f_{C^{'}}$ , the value of x is spread across the at most 2^αd entries that agree with j on columns C. The contribution to F_p from these entries is at most x^p (if the mass of x is mapped to a single entry). On the other hand, by Jensen’s inequality, the contribution is at least 2^αd (x/2^αd)^p = x^p/2^{αd (p−1)}. Hence, considering all entries j, we obtain. ${‖ f_{C} ‖}_{p}^{p} / 2^{α d (p - 1)} \leq {‖ f_{C^{'}} ‖}_{p}^{p} \leq {‖ f_{C} ‖}_{p}^{p}$ . In the case |C| > |C′|, essentially the same argument shows that ${‖ f_{C} ‖}_{p}^{p} \leq {‖ f_{C^{'}} ‖}_{p}^{p} \leq {‖ f_{C} ‖}_{p}^{p} 2^{α d (p - 1)}$ . Thus we obtain the rounding distortion of 2^{αd (p−1)}. For p < 1, we proceed as above, except by concavity, the ordering is reversed. □

Observe that the distortion reduces to 1 (no distortion) as we approach p = 1 from either side. This is intuitive, since the F₁ problem is simply to report the number of rows in the input, regardless of C, and so the problem becomes “easier” as we approach p = 1.

With these properties in hand, we can give a “meta-algorithm” as described in Algorithm 1. In Theorem 6.5 we can fully characterize the accuracy-space tradeoff for Algorithm 1 as a function of α and d.

Theorem 6.5.

Let A ∈ {0, 1}^n×d be the input data and C ⊆ [d] be a projection query. Suppose P = P (A,C) is the projected frequency problem, α ∈ (0, 1/2) and r (α,d) is the rounding distortion. With probability at least 1 − δ a βr (α,d) approximation can be obtained by keeping $\tilde{O} (2^{H (1 / 2 - α) d})$ β-approximate sketches.

Proof.

Let $N$ be a α-net for $P ([d])$ and for every $U \in N$ generate a sketch with accuracy parameter ϵ for the problem P on the projection defined by U ⊆ [d]. Either the projection $C \in N$ , in which case we can report a β factor approximation, or $C \notin N$ in which case we take an α-neighbour, $C^{'} \in N$ and return the estimate z for P (A,C′). The sketch ensures that the answer to P (A,C′) is obtained with accuracy β, which by the rounding distortion is a βr (α,d) approximation. To obtain this guarantee we build one sketch for every $U \in N$ , for a total of O(2^H (1/2−^α)^d) sketches (via Lemma 6.2). By setting the failure probabilty for each sketch as δ = 1/2^αd and then taking a union bound over the α-net we achieve probability at least 1 − δ. □

We remark that similar results are possible for the other functions considered, ℓ_p frequency estimation, ℓ_p heavy hitters and ℓ_p sampling. The key insight is that all these functions depend at their heart on the quantity f_j /∥f ∥_p, the frequency of the item at location j divided by the ℓ_p norm. If we evaluate this quantity on a superset of columns, then both the numerator and denominator may shrink or grow, in the same ways as analyzed in Lemma 6.4, and hence their ratio is bounded by the same factor, up to a constant. Hence, we can also obtain (multiplicative) approximation algorithms for these problems with similar behavior.

Illustration of Bounds.

First, observe that, irrespective of the problem P, the number of sketches needed is sublinear in 2^d. This is due to the fact that the entropy H (1/2 −α) < 1 for α > 0, so the size of the net $| N | < 2^{d}$ . For 0 ≤ p ≤ 2, we have β-approximate sketches with β = (1 + ϵ) whose size is $\tilde{O} (ε^{- 2})$ , which is constant for constant ϵ. For example, we obtain a 2^αd approximation (ignoring small constant factors) for F₀ in space O(2^{H (1/2−α)}d), using for instance the (1 + ϵ)-approximate sketch from [11] which requires O(ε⁻² +log n′) bits for an input over domain {1, …, n′}. Since n′ ≤ 2^d, and setting ϵ = 1, we obtain the approximation in space O(d2^{H (1/2−α)d}). This is to be compared to the bounds in Section 4, where it is shown that (binary) instances of the projected F₀ problem require space 2^Ω(d). These results show that the constant hidden by the Ω() notation is less than 1.

In Figure 1 we illustrate the general behavior of the bounds for d = 20. We plot the relative space by 2^{H (1/2−α)} /2^d while varying α over (0, 1/2) (plotted in the leftmost pane). This shows the space reduction in using the α-net approach compared to $i n l i n e m a t h$ storing all 2^d queries. The central pane shows how the approximation factor 2^αd (on a log scale) varies with α. We plot the space-approximation tradeoff in the rightmost pane and the approximation factor is again plotted on a log₂-scale. This plot suggests that if we reduce the space by a factor of 4 (i.e., permit relative space 2⁻²) then the approximation factor is on the order of 10s. Meanwhile, if we use relative space 2⁻⁸, then the approximation remains on the order of hundreds: this is a substantial saving as the number of summaries kept for the approximation is 2¹² = 4096 ≪ 2²⁰ ≈ 10⁶.

7. CONCLUDING REMARKS

We have introduced the topic of projected frequency estimation, with the aim of abstracting a range of problems involving computing functions over projected subspaces of data. Our main results show that these problems are generally hard, in terms of the space requirements: in most cases, we require space which is exponential in the dimensionality d of the input. However, interestingly, the exact dependence is not as simple as 2^d: we show that coarse approximations can be obtained whose cost is substantially sublinear in 2^d. Letting N = 2^d, our upper and lower bounds establish that the space complexity for a number of problems here is polynomial in N, though substantially sublinear. And, in a few special cases (ℓ_p frequency estimation for p ≤ 1), a sufficiently constant-sized sample suffices for accurate approximation of projected frequencies. It remains an intriguing open question to close the gaps between the upper and lower bounds, and to find the exact form of the polynomial dependence on N for these problems.

CCS CONCEPTS.

• Theory of computation → Streaming models; Lower bounds and information complexity; Communication complexity; Sketching and sampling.

Acknowledgements.

We thank S. Muthukrishnan and Jacques Dark for helpful discussions about this problem. The work of GC and CD was supported by European Research Council grant ERC-2014-CoG 647557. The work of DW was supported by NSF grant No. CCF-1815840, National Institute of Health grant 5R01HG 10798-2, and a Simons Investigator Award.

A. OMITTED PROOFS

A.1. Omitted Proof for Section 5.1

Theorem A.1 (Restated Theorem 5.1).

Let A ∈ {0, 1}^n×d be the input data and let C ⊆ [d] be a given column query. For a given string b ∈ {0, 1}^C, the absolute frequency of b, f_{e (}b₎ , can be estimated up to ε∥f ∥₁ additive error using a uniform sample of size O(ε⁻² log(1/δ)) with probability at least 1 − δ.

Proof.

$T = {i \in [n] : A_{i}^{C} = b}$ be the set of indices on which the projection onto query set C is equal to the given pattern b. Sample t rows of A uniformly with replacement at a rate q = t/n. Let the (multi)-subset of rows obtained be denoted by B and the matrix formed from the rows of B be denoted $\hat{A}$ . For every i ∈ B, define the indicator random variable X_i which is 1 if and only if the randomly sampled index i satisfies $A_{i}^{C} = b$ , which occurs with probability |T |/n. Next, we define $\hat{T} = T \cap B$ so that $| \hat{T} | = \sum_{i = 1}^{t} X_{i}$ and the estimator $Z = \frac{n}{t} | \hat{T} |$ has $E (Z) = | T |$ . Finally, apply an additive form of the Chernoff bound:

ℙ (| Z - E (Z) | \geq ε n) = ℙ (| \frac{n}{t} | \hat{T} | - | T | | \geq ε n) = ℙ (| | \hat{T} | - \frac{t}{n} | T | | \geq ε t) \leq 2 \exp (- ε^{2} t) .

Settingδ = 2exp (−ε²t) allows us to chooset = O(ε⁻² log(1/δ)), which is independent of n and d. The final bound comes from observing that ∥f ∥₁ = n, f_{e (b)} = |T | and ${\hat{f}}_{e} (b) = Z$ . □

A.2. Omitted Proof for Section 5.3

A key step in the proof of Theorem 5.4 is that in Equation (5), the expression

2^{c d p + ϵ d p + Θ ((1 - p) c d)} \cdot O (d^{1 - p})

can be bounded by a manageable power of two. We formalize this in Lemma A.2.

Lemma A.2.

Under the same assumptions as in Theorem 5.4, there exists a small constant c > 0 which bounds Equation (5) by at most 2^(1−α)ϵd for some α > 0.

Proof.

Here we use base-2 logarithms and let 0 < c < 1 be a small constant which we need to bound. Also, let 0 < p < 1 be a given constant. Observe that the O(d^1−p) term only contributes positively in the exponent term of (5) so we can ignore it from the calculation. Write 2^{Θ(cd (1−p))} = 2^{cdα (1−p)} for α > 0. This follows from:

(\begin{matrix} d \\ c d \end{matrix}) \leq {(\frac{e d}{c d})}^{c d} \leq 2^{(2 + \log \frac{1}{c}) c d}

(6)

so let $α = 2 + \log \frac{1}{c}$ . For clarity, we proceed by using the trivial identity 1 − (1 − υ) = υ and show that 1 − υ > 0 for υ a function of c, p, d. We need to ensure:

c p d + ϵ d p + α c d (1 - p) \leq (1 - α) ϵ d .

(7)

This amounts to showing that:

v ≜ c p / ϵ + p + α c (1 - p) / ϵ \leq (1 - α)

Now, υ = p(c/ϵ + 1 − αc/ϵ) + αc/ϵ and we require υ < 1. We may enforce the weaker property of p(c/ϵ + 1 − α/ϵ) < 1 because c > 0 and for c < 4 we also have α > 0 (inspection on Equation (6)) so αc/ϵ > 0, and so can be omitted. Solving for c we obtain c(1−α) < ϵ(1/p −1). Recalling the definition of α this becomes:

c (\log c - 1) < ϵ (1 / p - 1)

(8)

from which positivity on c yields c log c < ϵ(1/p −1). Hence, it is enough to use c < ϵ(1/p − 1). □

Contributor Information

Graham Cormode, University of Warwick.

Charlie Dickens, University of Warwick.

David P. Woodruff, Carnegie Mellon University

REFERENCES

[1].Alon N, Matias Y, and Szegedy M. 1999. The Space Complexity of Approximating the Frequency Moments. JCSS: Journal of Computer and System Sciences 58 (1999), 137–147. [Google Scholar]
[2].Assadi Sepehr, Khanna Sanjeev, Li Yang, and Tannen Val. 2016. Algorithms for Provisioning Queries and Analytics. In International Conference on Database Theory. 18:1–18:18. [Google Scholar]
[3].Braverman Vladimir, Chestnut Stephen R, Ivkin Nikita, Nelson Jelani, Wang Zhengyu, and Woodruff David P. 2017. BPTree: An ℓ2 heavy hitters algorithm sing constant memory. In Proceedings of Principles of Database Systems. ACM, 361–376. [Google Scholar]
[4].Braverman Vladimir, Grigorescu Elena, Lang Harry, Woodruff David P., and Zhou Samson. 2018. Nearly Optimal Distinct Elements and Heavy Hitters on Sliding Windows. In Approximation, Randomization, and Combinatorial Optimization Algorithms and Techniques (AP-PROX/RANDOM 2018), Vol. 116. 7:1–7:22. 10.4230/LIPIcs.APPROX-RANDOM.2018.7 [DOI] [Google Scholar]
[5].Braverman Vladimir, Krauthgamer Robert, and Yang Lin F.. 2018. Universal Streaming of Subset Norms. CoRR abs/1812.00241 (2018). arXiv:1812.00241 http://arxiv.org/abs/1812.00241 [Google Scholar]
[6].Chia Pern Hui, Desfontaines Damien, Perera Irippuge Milinda, Simmons-Marengo Daniel, Li Chao, Day Wei-Yen, Wang Qiushi, and Guevara Miguel. 2019. KHyperLogLog: Estimating Reidentifiability and Join-ability of Large Data at Scale. In IEEE Symposium on Security and Privacy (SP). 867–881. [Google Scholar]
[7].Doerr Benjamin. 2020. Probabilistic tools for the analysis of randomized optimization heuristics. In Theory of Evolutionary Computation. Springer, 1–87. [Google Scholar]
[8].Galvin David. 2014. Three tutorial lectures on entropy and counting. arXiv preprint arXiv:1406.7872 (2014). [Google Scholar]
[9].Jayaram Rajesh and Woodruff David P.. 2018. Perfect ℓp Sampling in a Data Stream. In 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS. 544–555. [Google Scholar]
[10].Jayram TS and Woodruff DP. 2009. The Data Stream Space Complexity of Cascaded Norms. In IEEE Symposium on Foundations of Computer Science (FOCS). 765–774. 10.1109/FOCS.2009.82 [DOI] [Google Scholar]
[11].Kane Daniel M, Jelani Nelson, and Woodruff David P. 2010. An Optimal Algorithm for the Distinct Elements Problem. In Proceedings of Principles of database systems. ACM, 41–52. [Google Scholar]
[12].Kremer Ilan, Nisan Noam, and Ron Dana. 1999. On Randomized One-Round Communication Complexity. Computational Complexity 8, 1 (1999), 21–49. [Google Scholar]
[13].Kveton Branislav, Muthukrishnan S, Vu Hoa T., and Xian Yikun. 2018. Finding Subcube Heavy Hitters in Analytics Data Streams. In Proceedings of the 2018 World Wide Web Conference. 1705–1714. 10.1145/3178876.3186082 [DOI] [Google Scholar]
[14].Larsen Kasper Green, Nelson Jelani, Nguyen Huy L., and Thorup Mikkel. 2016. Heavy Hitters via Cluster-Preserving Clustering. In IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS. 61–70. [Google Scholar]
[15].Parsons Lance, Haque Ehtesham, and Liu Huan. 2004. Subspace Clustering for High Dimensional Data: a review. SIGKDD Explorations 6, 1 (2004), 90–105. 10.1145/1007730.1007731 [DOI] [Google Scholar]
[16].Sublinear.info. [n.d.]. Open Problem 94. https://sublinear.info/index.php?title=Open_Problems:94.
[17].Tirthapura Srikanta and Woodruff David P.. 2012. A General Method for Estimating Correlated Aggregates over a Data Stream. In IEEE 28th International Conference on Data Engineering (ICDE 2012), Washington, DC, USA (Arlington, Virginia), 1–5 April, 2012. 162–173. [Google Scholar]
[18].Vu Hoa. 2018. Data Stream Algorithms for Large Graphs and High Dimensional Data. Ph.D. Dissertation. U. Massachusetts at Amherst. [Google Scholar]

[R1] [1].Alon N, Matias Y, and Szegedy M. 1999. The Space Complexity of Approximating the Frequency Moments. JCSS: Journal of Computer and System Sciences 58 (1999), 137–147. [Google Scholar]

[R2] [2].Assadi Sepehr, Khanna Sanjeev, Li Yang, and Tannen Val. 2016. Algorithms for Provisioning Queries and Analytics. In International Conference on Database Theory. 18:1–18:18. [Google Scholar]

[R3] [3].Braverman Vladimir, Chestnut Stephen R, Ivkin Nikita, Nelson Jelani, Wang Zhengyu, and Woodruff David P. 2017. BPTree: An ℓ2 heavy hitters algorithm sing constant memory. In Proceedings of Principles of Database Systems. ACM, 361–376. [Google Scholar]

[R4] [4].Braverman Vladimir, Grigorescu Elena, Lang Harry, Woodruff David P., and Zhou Samson. 2018. Nearly Optimal Distinct Elements and Heavy Hitters on Sliding Windows. In Approximation, Randomization, and Combinatorial Optimization Algorithms and Techniques (AP-PROX/RANDOM 2018), Vol. 116. 7:1–7:22. 10.4230/LIPIcs.APPROX-RANDOM.2018.7 [DOI] [Google Scholar]

[R5] [5].Braverman Vladimir, Krauthgamer Robert, and Yang Lin F.. 2018. Universal Streaming of Subset Norms. CoRR abs/1812.00241 (2018). arXiv:1812.00241 http://arxiv.org/abs/1812.00241 [Google Scholar]

[R6] [6].Chia Pern Hui, Desfontaines Damien, Perera Irippuge Milinda, Simmons-Marengo Daniel, Li Chao, Day Wei-Yen, Wang Qiushi, and Guevara Miguel. 2019. KHyperLogLog: Estimating Reidentifiability and Join-ability of Large Data at Scale. In IEEE Symposium on Security and Privacy (SP). 867–881. [Google Scholar]

[R7] [7].Doerr Benjamin. 2020. Probabilistic tools for the analysis of randomized optimization heuristics. In Theory of Evolutionary Computation. Springer, 1–87. [Google Scholar]

[R8] [8].Galvin David. 2014. Three tutorial lectures on entropy and counting. arXiv preprint arXiv:1406.7872 (2014). [Google Scholar]

[R9] [9].Jayaram Rajesh and Woodruff David P.. 2018. Perfect ℓp Sampling in a Data Stream. In 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS. 544–555. [Google Scholar]

[R10] [10].Jayram TS and Woodruff DP. 2009. The Data Stream Space Complexity of Cascaded Norms. In IEEE Symposium on Foundations of Computer Science (FOCS). 765–774. 10.1109/FOCS.2009.82 [DOI] [Google Scholar]

[R11] [11].Kane Daniel M, Jelani Nelson, and Woodruff David P. 2010. An Optimal Algorithm for the Distinct Elements Problem. In Proceedings of Principles of database systems. ACM, 41–52. [Google Scholar]

[R12] [12].Kremer Ilan, Nisan Noam, and Ron Dana. 1999. On Randomized One-Round Communication Complexity. Computational Complexity 8, 1 (1999), 21–49. [Google Scholar]

[R13] [13].Kveton Branislav, Muthukrishnan S, Vu Hoa T., and Xian Yikun. 2018. Finding Subcube Heavy Hitters in Analytics Data Streams. In Proceedings of the 2018 World Wide Web Conference. 1705–1714. 10.1145/3178876.3186082 [DOI] [Google Scholar]

[R14] [14].Larsen Kasper Green, Nelson Jelani, Nguyen Huy L., and Thorup Mikkel. 2016. Heavy Hitters via Cluster-Preserving Clustering. In IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS. 61–70. [Google Scholar]

[R15] [15].Parsons Lance, Haque Ehtesham, and Liu Huan. 2004. Subspace Clustering for High Dimensional Data: a review. SIGKDD Explorations 6, 1 (2004), 90–105. 10.1145/1007730.1007731 [DOI] [Google Scholar]

[R16] [16].Sublinear.info. [n.d.]. Open Problem 94. https://sublinear.info/index.php?title=Open_Problems:94.

[R17] [17].Tirthapura Srikanta and Woodruff David P.. 2012. A General Method for Estimating Correlated Aggregates over a Data Stream. In IEEE 28th International Conference on Data Engineering (ICDE 2012), Washington, DC, USA (Arlington, Virginia), 1–5 April, 2012. 162–173. [Google Scholar]

[R18] [18].Vu Hoa. 2018. Data Stream Algorithms for Large Graphs and High Dimensional Data. Ph.D. Dissertation. U. Massachusetts at Amherst. [Google Scholar]

PERMALINK

Subspace exploration: Bounds on Projected Frequency Estimation

Graham Cormode

Charlie Dickens

David P Woodruff

Abstract

1. INTRODUCTION

2. PRELIMINARIES AND DEFINITIONS

Computational Model.

Remark 1 (Indexing Q-ary words into f).

2.1. Problem Definitions.

2.2. Related Work

3. CONTRIBUTIONS

3.1. Summary of Results

3.2. Coding Theory Definitions

Definition 3.1 (starQ operation, child words).

Lemma 3.2.

Proof.

3.3. Overview of Lower Bound Constructions

4. LOWER BOUNDS FOR F0

Theorem 4.1.

Proof.

Corollary 4.2.

Proof.

Corollary 4.3.

Corollary 4.4.

Proof.

Table 1:

5. ℓp-FREQUENCY BASED PROBLEMS

5.1. ℓp Frequency Estimation

Theorem 5.1.

Corollary 5.2.

5.2. ℓp Heavy Hitters Lower Bound

Theorem 5.3.

Proof.

Case 1:

Case 2:

Concluding the proof.

5.3. Fp Estimation

Theorem 5.4.

Proof.

Case 1:

Case 2:

Remark 2.

5.4. ℓp-Sampling

Theorem 5.5.

Proof.

Case 1:

Case 2:

Remark 3.

6. PROJECTED FREQUENCY ESTIMATION VIA SET ROUNDING

Definition 6.1 (α-net of subsets).

Lemma 6.2.

Proof.

6.1. From α-nets to Projections

Algorithm 1:

Definition 6.3 (Rounding distortion).

Lemma 6.4.

Proof.

Theorem 6.5.

Proof.

Illustration of Bounds.

Figure 1:

7. CONCLUDING REMARKS

CCS CONCEPTS.

Acknowledgements.

A. OMITTED PROOFS

A.1. Omitted Proof for Section 5.1

Theorem A.1 (Restated Theorem 5.1).

Proof.

A.2. Omitted Proof for Section 5.3

Lemma A.2.

Proof.

Contributor Information

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Definition 3.1 (star^Q operation, child words).

4. LOWER BOUNDS FOR F₀

5. ℓ_p-FREQUENCY BASED PROBLEMS

5.1. ℓ_p Frequency Estimation

5.2. ℓ_p Heavy Hitters Lower Bound

5.3. F_p Estimation

5.4. ℓ_p-Sampling