Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Aug 6.
Published in final edited form as: Proc ACM SIGACT SIGMOD SIGART Symp Princ Database Syst. 2021 Jun 20;2021:273–284. doi: 10.1145/3452021.3458312

Subspace exploration: Bounds on Projected Frequency Estimation

Graham Cormode 1, Charlie Dickens 2, David P Woodruff 3
PMCID: PMC8345326  NIHMSID: NIHMS1696965  PMID: 34366641

Abstract

Given an n × d dimensional dataset A, a projection query specifies a subset C ⊆ [d] of columns which yields a new n × |C| array. We study the space complexity of computing data analysis functions over such subspaces, including heavy hitters and norms, when the subspaces are revealed only after observing the data. We show that this important class of problems is typically hard: for many problems, we show 2Ω(d) lower bounds. However, we present upper bounds which demonstrate space dependency better than 2d. That is, for c, c′ ∈ (0, 1) and a parameter N = 2d an Nc-approximation can be obtained in space min(Nc,n), showing that it is possible to improve on the naïve approach of keeping information for all 2d subsets of d columns. Our results are based on careful constructions of instances using coding theory and novel combinatorial reductions that exhibit such space-approximation tradeoffs.

Keywords: projection queries, distinct elements, frequency moments

1. INTRODUCTION

In many data analysis scenarios, datasets of interest are of moderate to high dimension, but many of these dimensions are spurious or irrelevant. Thus, we are interested in subspaces, corresponding to the data projected on a particular subset of dimensions. Within each subspace, we are concerned with computing statistics, such as norms, measures of variation, or finding common patterns. Such calculations are the basis of subsequent analysis, such as regression and clustering. In this paper, we introduce and formalize novel problems related to functions of the frequency in such projected subspaces. Already, special cases such as subspace projected distinct elements have begun to generate interest, e.g., in Vu’s work [18], and as an open problem in sublinear algorithms [16].

In more detail, we consider the original data to be represented by a (usually binary) array with n rows of d dimensions. A subspace is defined by a set C ⊆ [d] of columns, which defines a new array with n rows and |C| dimensions. Our goal is to understand the complexity of answering queries, such as which rows occur most frequently in the projected data, computing frequency moments over the rows, and so on. If C is provided prior to seeing the data, then the projection can be performed online, and so many of these tasks reduce to previously studied questions. Hence, we focus on the case when C is decided after the data is seen. In particular, we may wish to try out many different choices of C to explore the structure of the subspaces of the data. Our model is given in detail in Section 2.

For further motivation, we outline some specific areas where such problems arise.

  • Bias and Diversity. A growing concern in data analysis and machine learning is whether outcomes are ‘fair’ to different subgroups within the population, or whether they reinforce existing disparities. A starting point for this is to quantify the level of bias within the data when different features are considered. That is, we want to know whether certain combinations of attribute values are over-represented in the data (heavy hitters), and how many different combinations of values are represented in the data (captured by measures like F0). We would like to be able to answer such queries accurately for many different (typically overlapping) subsets of dimensions.

  • Privacy and Linkability. When sharing datasets, we seek assurance that they are not vulnerable to attacks that exploit structure in the data to re-identify individuals. An attempt to quantify this risk is given in recent work [6], which asks how many distinct values occur in the data for each partial identifier, specified as a subset of dimensions. This prior work considered the case where the target dimensions are known in advance, but more generally we would like to compute such measures for arbitrary subsets, based on frequency moments and sampling techniques.

  • Clustering and Frequency Analysis. In the area of clustering, the notion of subspaces has been studied under a number of interpretations. The common theme is that the data may look unclustered in the original space due to spurious dimensions inflating the distance between points that are otherwise close. Many papers addressed this as a search problem: to search through exponentially many subspaces to find those in which the data is well-clustered. See the survey by Parsons, Haque and Liu [15]. In our setting, the problem would be to estimate various measures of density or clusteredness for a given subspace. A related problem is to find subspaces (or “subcubes” in database terminology) that have high frequency. Prior work proceeded under strong statistical independence assumptions about the values in different dimensions, for example, that the distribution can be modeled accurately with a (Naïve) Bayesian model [13].

2. PRELIMINARIES AND DEFINITIONS

For a positive integer Q, let [Q] = {0, 1, …, Q − 1}, and A ∈ [Q]n×d be the input data. The objective is to keep a summary of A which is used to estimate the solution to a problem P upon receiving a column subset query C ⊆ [d]. Problems P of interest are described in Section 2.1. Define the restriction of A to the columns indexed by C as AC whose rows AiC, 1 ≤ in, are vectors over [Q] |C |. We use the Minkowski norm Xp=(i,j|Xij|p)1/p to denote the entrywise-p norm for vectors (j = 1) and matrices (j > 1).

Computational Model.

First, the data A is received under the assumption that it is too large to hold entirely in memory so can be modeled as a stream of data. Our lower bounds are not strongly dependent on the order in which the data is presented. After observing A, a column query C is presented. The frequency vector over A induced by C is f = f (A,C) whose entries fi (A,C) denote the frequency of Q-ary word wi ∈ [Q] |C |. We study functions of the frequency vector f = f (A,C) after the observation of A and receiving column query C. The task is, during the observation phase, to design a summary of A which approximates statistics of AC, the restriction of A to its projected subspace C. Approximations of AC are accessed through the frequency vector f (A,C). Note that functions (e.g., norms) are taken over f (A,C) as opposed to the raw vector inputs from the column projection.

Remark 1 (Indexing Q-ary words into f).

Recall that the frequency vector f (A,C) has length Q|C | with each entry fi counting the occurrences of word wi ∈ [Q] |C |. To clearly distinguish between the (scalar) index i of f and the input vectors wi whose frequency is measured by fi we introduce the index function e(wi) = i. We may think of e(·) as simply the canonical mapping from [Q] |C | into {0, 1, 2, …,QC − 1}, but other suitable bijections may be used.

For example, suppose Q = 2 and A ∈ {0, 1}5×3 with column indices {1, 2, 3} given below. If C = {1, 2}, then using the canonical mapping from {0, 1}|C | into {0, 1, 2, 3} (e.g e(00) = 0, e(01) = 1, …e(11) = 3) we obtain AC and hence f (A,C) = (1, 1, 0, 3).

A=[110010001111110]AC=[1101001111]

The vector f = f (A,C) is then the frequency vector over which we seek to compute statistical queries such as ∥f0. In this example, ∥f0 = 3 (there are three distinct rows in AC), while ∥f1 = 5 is independent of the choice of C.

2.1. Problem Definitions.

The problems that we consider are column-projected forms of common streaming problems ([11], [3], [4]). Here, we refer to these problems as “projected frequency estimation problems” over the input A. We define

fi(A,C)=|{j:AjC=wi,j[n]}| (1)
Fp(A,C)=i{0,1}|C|fi(A,C)p. (2)
  • Fp estimation: Given a column query C, the Fp estimation problem is to approximate the quantity Fp(A,C)=f(A,C)pp under some measure of approximation to be specified later (e.g., up to a constant factor). Of particular interest to us is (projected) F0 (A,C) estimation, which counts the number of distinct row patterns in AC.

  • p-heavy hitters: The query is specified by a column query C ⊆ [d], a choice of metric/norm p, p > 0 and accuracy parameter ϕ ∈ (0, 1). The task is then to identify all patterns wi observed on AC for which fi (A,C) ≥ ϕf (A,C)∥p. Such values wi (or equivalently i) are called ϕ-ℓp-heavy hitters, or simply p-heavy hitters when ϕ is fixed. We will consider a multiplicative approximation based on a parameter c > 1, where we require that all ϕ-ℓp heavy hitters are reported, and no items with weight less than (ϕ/c) · ∥f (A,C)∥p are included.

  • p-frequency estimation: A related problem is to allow the frequency fi (A,C) to be estimated accurately, with error as a fraction of Fp (A,C)1/p = ∥f (A,C)∥p, which we refer to as p frequency estimation. Specifically, for a given wi, return an estimate f^i which satisfies |f^i(A,C)fi(A,C)|ϕf(A,C)p.

  • p sampling: The goal of this sampling problem is to sample patterns wi according to the distribution pi(1±ε)fip(A,C)f(A,C)pp+Δ where Δ = 1/poly(nd), and return a (1 ± ε′)-approximation to the probability pi of the item wi returned.

When clear, we may drop the dependence upon C in the notation and write fi and Fp instead. We will use O˜ and Ω˜ notation to supress factors that are polylogarithmic in the leading term. For example, lower bounds stated as Ω˜(2d) suppress terms polynomial in d.

2.2. Related Work

The model we study is reminscent of, but distinct from, some related formulations. In the problem of cascaded aggregates [10], we imagine the starting data as a matrix, and apply a first operator (denoted Q) on each row to obtain a vector, on which we apply a second operator P. Our problems can be understood as special cases of cascaded aggregates where Q is a project-then-concatenate operator, to obtain a vector whose indices correspond to the concatenation of the projection of a row. Another example of a cascaded aggregate is a so-called correlated aggregate [17], but this was only studied in the context of two dimensions. To the best of our knowledge, our projection-based definitions have not been previously studied under the banner of cascaded aggregates.

Other work includes results on provisioning queries for analytics [2], but the way these statistics are defined is different from our formulation. In that setting there are different scenarios (“hypotheticals”) that may or may not be turned on: this corresponds to “what-if” analysis whereby a query is roughly “how many items are observed if a given set of columns is present (turned on)?” The number of distinct elements for the query is the union of the number of distinct elements across scenarios. In our setting, we concatenate the distinct items into a row vector and count the number of distinct vectors. Note that in the hypotheticals setting in the binary case, each column only has 2 distinct values, 0 and 1, and thus the union also only has 2 distinct values. However, we can obtain up to 2d distinct vectors. Consequently, Assadi et al. are able to achieve poly(d/ε) space for counting distinct elements, whereas we show a 2Ω(d) lower bound. Moreover, they achieve a 2Ω(d) lower bound for counting (i.e., F1), whereas we achieve a constant upper bound. These disparities highlight the differences in our models.

More recently, the notion of “subset norms” was introduced by Braverman, Krauthgamer and Yang [5]. This problem considers an input that defines a vector υ, where the objective is to take a subset s of entries of υ and compute the norm. Results are parameterized by the “heavy hitter dimension”, which is a measure of complexity over the set system from which s can be drawn. While sharing some properties with our scenario, the results for this model are quite different. In particular, in [5] a trivial upper bound follows by maintaining the vector υ explicitly, of dimension n. Meanwhile, many of our results show lower bounds that are exponential in the dimensionality, as 2Ω(d), though we also obtain non-trivial upper bounds.

3. CONTRIBUTIONS

The main challenge here is that the column query C is revealed after observing the data; consequently, applying a known algorithm to just the columns C as the data arrives is not possible. For example, consider the exemplar problem of counting the number of distinct rows under the projection C, i.e., the projected F0 problem. Recall that AiC denotes the i-th row of array AC. Then the task is to count the number of distinct rows observed in AC, i.e.,

F0(A,C)=|{AjC:j[n]}|=f(A,C)0.

Observe that F0 (A,C) can vary widely over different choices of C. For example, even for a binary input A ∈ {0, 1}n×d, F0 (A,C) can be as large as 2d when C consists of all columns from a highly diverse dataset, and as small as 1 or 2 when C is a single column or when C selects homogeneous columns (e.g., the columns in C are all zeros).

3.1. Summary of Results

Our main focus, in common with prior work on streaming algorithms, is on space complexity. For the above problems we obtain the following results:

  • In Section 4 we show that projected F0 estimation requires 2Ω(d) space for a constant factor approximation, demonstrating the essential hardness of these problems. Nevertheless, we obtain a tradeoff in terms of upper bounds described below.

  • Section 5 presents results for p frequency estimation, p heavy hitters, Fp estimation, and p sampling. We show a space upper bound of O(ε−2 log(1/δ)) for p frequency estimation when 0 < p < 1 and complement this result with lower bounds for heavy hitters when p > 1, Fp estimation and p sampling for all p ≠ 1, showing that these problems require 2Ω(d) bits of space.

  • In Section 6 we show upper bounds for F0 and Fp estimation which improve on the exhaustive approach of keeping summaries of all 2d subsets of columns, by showing that we can obtain coarse approximate answers with a smaller subset of materialized answers. Specifically, for parameters N = 2d and α ∈ (0, 1) we can obtain an N α approximation in min (NH (1/2−α), n) space. Since the binary entropy function H(x) < 1, this bound is better than the trivial 2d bound.

These bounds show that there is no possibility of “super efficient” solutions that use space less than exponential in d. Nevertheless, we demonstrate some solutions whose dependence is still exponential but weaker than a naïve 2d. Thinking of N = 2d, the above upper and lower bounds imply the actual complexity is a nontrivial polynomial function of N.

The bounds also show novel dichotomies that are not present in comparable problems without projection. In particular, we show that (projected) p sampling is difficult for p ≠ 1 while (projected) p-heavy hitters has a small space algorithm for 0 < p < 1. This differs from the standard streaming model in which the (classical) p heavy hitters problem has a small space solutions for p ≤ 2 without projection [14], and (classical) p sampling can be performed efficiently for p ≤ 2 [9]. Our lower bounds are built on amplifying the frequency of target codewords for a carefully chosen test word.

Note that there are trivial naïve solutions which simply retain the entire input and so answer the query exactly on the queryC: to do so takes Θ(nd) space, noting that n may be exponential in d. Alternatively, if we know t = |C| then we may enumerate all (dt) subsets of [d] with size t and maintain (approximate) summaries for each choice of C. However, this will entail a cost of at least Ω(dt) and as such does not give a major reduction in cost.

3.2. Coding Theory Definitions

Our lower bounds will typically make use of a binary code C, constituted of a collection of codewords, which are vectors (or strings) of fixed length. We write B(l,k) to denote all binary strings of length l and (Hamming) weight k. We first consider the dense, low-distance family of codes C=B(d,k) but will later use more sophisticated randomly sampled codes. When k < d/2, we have (dk)(d/k)k and when k = d/2, we have (dd/2)2d/2d. A trivial but crucial property of B(d,k) is that any two codewords from this set can have intersecting 1s in at most k − 1 positions.

We define the support of a string y as supp(y) = {i : yi ≠ 0}, the set of locations where y is non-zero. We define child words to be the set of new codewords obtained from C by generating all Q-ary words z with supp(z) ⊆ supp(y) for some yC, and construct them with the star operator defined next.

Definition 3.1 (starQ operation, child words).

Let d be the length of a binary word, k be a weight parameter, and suppose yB(d,k). Let M = supp(y). We define the function starQ (y) to be the operation which lifts a binary word y to a larger alphabet by generating all the words over alphabet [Q] on M. Formally,

starQ(y{0,1}d)={z:z[Q]d,supp(z)supp(y)}

Since the alphabet size Q is often fixed when using this operation, when clear we will drop the superscript and abuse notation by writing star(y). Elements of the set starQ (y) are referred to as child words of y.

For any yB(d,k), there are Qk words generated by starQ (y). When star(·) is applied to all vectors of a set U then we write star(U)=uUstar(u). For example, if y ∈ {0, 1}d and Q = 2, then starQ (y) is simply all possible binary words of length d whose support is contained in supp(y). For the projected F0 problem, the code C=B(d,k) is sufficient. However, for our subsequent results, we need a randomly chosen code whose existence is demonstrated in Lemma 3.2. The proof follows from a Chernoff bound.

Lemma 3.2.

Fix ϵ, γ ∈ (0, 1) and let CB(d,ϵd) be such that for any two distinct x, yC we have |xy| ≤ (ϵ2 +γ)d. With probability at least 1 − exp(−22) there exists such a code C with size 2O(γ2d) instantiated by sampling sufficiently many words i.i.d. at random from B(d,ϵd).

Proof.

Let X be the random variable for the number of 1s in common between x and y sampled uniformly at random. Then the expectation of X is E[X]=(ϵd)2d=ϵ2d and although the coordinates of x, y are not independent, they are negatively correlated so we may use a Chernoff bound (see Section 1.10.2 of [7] for self-contained details). Our aim is to show that the number of 1s in common between x and y can be at most γd more than its expectation. Then, via an additive Chernoff-Hoeffding bound:

(XE(X)γd)exp(2dγ2).

This is the probability that any two codewords x and y are not too similar, so by taking a union bound over the Θ(|C|2) pairs of codewords, the size of the code is |C|=exp(dγ2)=2γ2d/ln2. □

3.3. Overview of Lower Bound Constructions

Our lower bounds rely upon non-standard reductions to the Index problem using codes C defined in Section 3.2. These reductions are more involved than is typically found as we need to combine the combinatorial properties of C along with the star(·) operation on Alice’s input. In particular, the interplay between C and star(·) must be understood over the column query S given by Bob, which again relies on properties of C used to define the input.

Recall that the typical reduction from Index is as follows: Alice holds a vector a ∈ {0, 1}N, Bob holds an index i ∈ [N] and he is tasked with finding ai following one-way communication from Alice. The randomized communication complexity of Index is Ω(N) [12]. We adapt this setup for our family of problems, following an approach that has been used to prove many space lower bounds for streaming algorithms.

The general construction of our lower bounds is as follows: first we choose a binary code C (usually independently at random) with certain properties such as a specific weight and a bounded number of 1s in common locations with other words in the code. In the communication setting, Alice holds a subset TC while Bob holds a codeword yC and is tasked with determining whether or not yT. Bob can also access the index function (Remark 1) e(y) which simply returns the index or location that y is enumerated in C. The corresponding bitstring for the Index problem that Alice holds is a{0,1}|C| which has aj = 1 for every element wjT (under a suitable enumeration of {0, 1}d). We use the star(T) operator (defined in Section 3.2) to map these strings into an input A for each of the problems (i.e., a collection of rows of datapoints). Upon defining the instance, we show that Bob can query a proposed algorithm for the problem and use the output to determine whether or not Alice holds y. This enables Bob to return ae (y), which is 1 if Alice holds yT and 0 otherwise. Hence, determining if yT or yC\T solves Index and incurs the lower bound Ω(|C|). Our constructions of C establish that |C| is exponentially large in d.

4. LOWER BOUNDS FOR F0

In this section, we focus on the F0 (distinct counting) projected frequency problem. The main result in this section is a strong lower bound for the problem, which is exponential in the domain size d.

We use codes C=B(d,k) as defined in Section 3.2.

Theorem 4.1.

Let Q ≥ 2 be the target alphabet size and k < d/2 be a fixed query size with Q > k. Any algorithm achieving an approximation factor of |Q|/k for the projected F0 problem requires space 2Ω(d).

Proof.

Fix the code C=B(d,k), recalling that any xC has Hamming weight k, and for distinct x, yCat most k − 1 bits are shared in common. We will use these facts to obtain the approximation factor.

Obtain the collection of all child words CQ from C by using starQ (·) as defined in Section 3.2. We will reduce from the Index problem in communication complexity as follows. Alice has a set of (binary) codewords TC and initializes the input array A for the algorithm with all strings from the set star(T). Bob has a vector yC and wants to know if yT or not. Let S = supp(y) so that |S| = k and Bob queries the F0 algorithm on columns of A restricted to S. First suppose that yT. Then Alice holds y so star(y) is included in A and there must be at least Qk patterns observed. Conversely, if yT, then Alice does not include y in A. However, by the construction of C, y shares at most (k − 1) 1s with any distinct yC. Thus, the number of patterns observed on the columns corresponding to S is at most (kk1)Qk1=kQk1.

We observe that if we can distinguish the case of kQk−1 from Qk, then we could correctly answer the Index instance, i.e., if we can achieve an approximation factor of Δ such that:

Δ=QkkQk1=Qk. (3)

Any protocol for Index requires communication proportional to the length of Alice’s input vector a, which translates into a space lower bound for our problem. Alice’s set TC defines an input vector for the Index problem built using a characteristic vector over all words in C, denoted by a{0,1}|C|, as follows. Under a suitable enumeration of C={w1,w2,,w|C|}, Alice’s vector is encoded via ai = 1 if and only if Alice holds the binary word wiT. From the separation shown earlier, Bob can determine if Alice holds a word in T, thus solving Index and incurring the lower bound. Hence, space proportional to |C|=(dk) is necessary. We use the standard relation (dk)(d/k)k and choose k = ad/2 for a constant a ∈ [0, 1) from which we obtain |C|2ad/2 to achieve the stated approximation guarantee. □

Setting k = ad/2 allows us to vary the query size and directly understand how this affects the size of the code necessary for the lower bound. For a query of size k, the size of the input to the projected F0 problem is a (d/k)k × d array A of total size dk+1/kk. Theorem 4.1 is for k < d/2. When k = d/2 we can use the tighter bound for the central binomial term on the sum of the binomial coefficients and obtain the following stronger bounds. The subsequent results use the same encoding as in Theorem 4.1. However, at certain points of the calculations the parameter setttings are slightly altered to obtain different guarantees.

Corollary 4.2.

Let Qd/2 be an alphabet size and d/2 be the query size. There exists a choice of input data A ∈ [Q]n×d such that any algorithm achieving approximation factor 2Q/d for the projected F0 problem on the query requires space 2Ω(d).

Proof.

Repeat the argument of Theorem 4.1 with k = d/2. The approximation factor from Equation (3) becomes:Δ=2Qd. The code size for Index is |C|2d/2d. Note that |C| is 2Ω(d) as 12log2(d) can always be bounded above by a linear function of d. The instance is an array whose rows are the Qd/2 child words in starQ(C). Hence, the size of the instance to the F0 algorithm is: Θ(2dQd/2d1/2). □

Corollary 4.3 follows from Corollary 4.2 by setting Q = d.

Corollary 4.3.

A 2-factor approximation to the projected F0 problem on a query of size d/2 needs space 2Ω(d) with an instance A whose size is Θ(2ddd+12).

Theorem 4.1 and its corollaries suffice to obtain space bounds over all choices of Q. However, Q could potentially grow to be very large, which may be unsatisfying. As a result, we will argue how the error varies for fixed Q. To do so, we map Q down to a smaller alphabet of size q and use this code to define the communication problem from which the lower bound will follow. The cost of this is that the instance is a logarithmic factor larger in the dimensionality.

Corollary 4.4.

Let q be a target alphabet size such that 2 ≤ q ≤ |Q|. Let α = Q logq (Q) ≥ 1 and d′ = d logq (Q). There exists a choice of input data A[q]n×d for which any algorithm for the projected F0 problem over queries of size d/2 that guarantees error O˜(α/d) requires space 2Ω(d) .

Proof.

Fix the binary code C=B(d,d/2) and generate all child words over alphabet [Q] to obtain the approximation factor Δ= 2Q/d as in Corollary 4.3. For every wC there Qd/2 child words so the child code CQ now has size n=Θ(2dQd/2/d) words. Since Q can be arbitrarily large, we encode it via a mapping to a smaller alphabet but over a slightly larger dimension; specifically, use a function [Q][q]logq(Q) which generates q-ary strings for each symbol in [Q]. Hence, all of the stored strings in CQ[Q]d are equivalent to a collection, Cq over [q]dlogq(Q). Although |CQ|=|Cq|, words in CQ are length d, while the equivalent word in Cq has length d logq (Q). This collection of words from Cq now defines the instance A[q]n×dlogq(Q), each word being a row of A. Taking α = Q logq (Q) and d′ = d logq (Q) results in an approximation factor of:

Δ=2Qd=2αd. (4)

Alice’s input vector a is defined by the same code C and held set TC as in Theorem 4.1 so we incur the same space bound. Likewise, Bob’s test vector y and column query S also remain the same as in that theorem. □

Corollary 4.4 says that the same accuracy guarantee as Corollary 4.2 can be given by reducing the arbitrarily large alphabet [Q] to a smaller one over [q]. However, the price to pay for this is that the size of the instance A increases by a factor of logq (Q) in the dimensionality. These various results are summarized in Table 1.

Table 1:

Comparison of the lower bounds for F0. Theorem 4.1 uses C=B(d,k), corollaries use C=B(d,d/2).

Instance A for F0 Approx. Factor

Theorem 4.1 (dk)k×d Qk
Corollary 4.2 2dQd/2 × d over [Q] 2Qd
Corollary 4.3 2ddd/2 × d over [d] 2
Corollary 4.4 2dQd/2 × d logq Q over [q] 2Qd

5. p-FREQUENCY BASED PROBLEMS

In this section, we extend the techniques from the previous section to understand the complexity of projected frequency estimation problems related to the p norms and Fp frequency moments (defined in Section 2.1). A number of our results are lower bounds, but we begin with a simple sampling-based upper bound to set the stage.

5.1. p Frequency Estimation

We first focus on the projected frequency estimation problem showing that a simple algorithm keeping a uniform sample of the rows works for p < 1. The algorithm uSample(A,C, t, b) first builds a uniform sample of t rows (sampled with replacement at rate α = t/n) from A and evaluates the absolute frequency of string b on the sample after projection onto C. Let g be the absolute frequency of b on the subsample. To estimate the true frequency of b on the entire dataset from the subsample, we return an appropriately scaled estimator f^e(b)=g/α which meets the required bounds given in Theorem 5.1, recalling that e(b) is the index location associated with the string b. The proof follows by a standard Chernoff bound argument and is given in Appendix A.1.

Theorem 5.1.

Let A ∈ {0, 1}n×d be the input data and let C ⊆ [d] be a given column query. For a given string b ∈ {0, 1}C, the absolute frequency of b, fe (b) , can be estimated up to εf1 additive error using a uniform sample of size O(ε−2 log(1/δ)) with probability at least 1 − δ.

The same algorithm can be used to obtain bounds for all 0 < p < 1. By noting that ∥f1 ≤ ∥fp for 0 < p < 1 we can obtain the following corollary.

Corollary 5.2.

Let A, b, C be as in Theorem 5.1. Let 0 < p < 1. Then uniformly samplingO(ε−2 log(1/δ)) rows achieves |f^e(b)fe(b)|εfp with probability at least 1 − δ.

Both Theorem 5.1 and Corollary 5.2 are stated as if C is given. However, since the sampling did not rely on C in any way, we can sample complete rows of the input uniformly prior to receiving the query C, which is revealed after observing the data. The uniform sampling approach also allows us to identify the p heavy hitters in small space: for each item included in the sample (when projected onto column set C), we use the sample to estimate its frequency, and declare those with high enough estimated frequency to be the heavy hitters. By contrast, for p > 1 we are able to obtain a 2Ω(d) space lower bound, given in the next section.

5.2. p Heavy Hitters Lower Bound

Recall that the objective of (projected) p heavy hitters is to find all those rows in AC whose frequency is at least some fraction of the p norm of the frequency distribution of this projection. For the lower bound we need a randomly sampled code as defined in Lemma 3.2. The lower bound argument follows a similar outline to the bound for F0, although now Bob’s query is on the complement of the support of his test vector y (i.e., S = [d] \ supp(y)) rather than supp(y). Akin to Theorem 4.1, we will create a reduction from the Index problem in communication complexity, and use its communication lower bound to argue a space lower bound for projected p heavy hitters. The proof will generate an instance of p heavy hitters based on encoding a collection of codewords, and consider in particular the status of the string corresponding to all zeroes. We will consider two cases: when Bob’s query string is represented in Alice’s set of codewords, then the all zeros string will be a heavy hitter (for a subset of columns determined by the query); and when Bob’s string is not in the set, then the all zeros string will not be a heavy hitter. We begin by setting up the encoding of the input to the Index instance.

Theorem 5.3.

Let ϕ ∈ (0, 1) be a parameter and fix p > 1. Any algorithm which can obtain a constant factor approximation to the projected ℓp-heavy hitters problem requires space 2Ω(d) .

Proof.

Fix ϵ > 0. Let CB(d,ϵd) be a code whose words have weight ϵd and any two distinct words x, y have at most (ϵ2 + γ)d ones in common. By Lemma 3.2 such a C exists and |C|=2Ωγ(d).

Suppose Alice holds a subset TC. Let a{0,1}|C| be the characteristic vector over all length-d binary strings for which ae (u) = 1 if and only if Alice holds uT. Bob holds yC and wants to determine if Alice holds yT. Ascertaining whether or not Alice holds y would be sufficient for Bob to solve Index and incur the Ω(|C|) lower bound.

The input array, A, for the p-heavy hitters problem is constructed as follows.

  1. Alice populates A with 2ϵd copies of the length-d all ones vector, 1d

  2. Next, Alice takes Q = 2 and inserts into A the collection starQ (T), which is the expansion of her input strings to all child-words in binary. That is, for every sT, Alice computes all binary strings x of length d with supp(x) ⊆ supp(s) and includes these in A.

Let S = [d]\supp(y), so that |S| = dϵd = (1−ϵ)d. Without loss of generality we may assume S = {1, 2, …, (1−ϵ)d} and we denote the (1−ϵ)d length vector which is identically 0 on S by 0S. Suppose there is an algorithm A which approximates the p-heavy hitters problem on a given column query up to a constant approximation factor. Bob queries A for the heavy hitters in the table A under the column query given by the set S, and then uses this information to answer whether or not yT.

Case 1:

yT. If yT, then we claim that 0S is a ϕ-ℓp heavy hitter for some constant ϕ, i.e.,fe(0S)ϕfp. We will manipulate the equivalent condition fe(0S)pϕpFp. Since yT, the set star(y) is included in the table A as Alice inserted star(s) for every s that she holds. Consider any child word of y, that is, a w ∈ star(y). Since y is supported only on [d] \ S and supp(w) ⊆ supp(y), every wi = 0 for iS. So 0S is observed once for every w ∈ star(y) and there are |star(y)| = 2ϵd such w. Hence, 0S occurs at least 2ϵd times.

Now that we have a lower bound on the frequency of 0S, it remains to upper bound the Fp value when yT so that we are assured 0S will be a heavy hitter in this instance. The quantity we seek is the Fp value of all vectors in AS, written Fp (A, S); which we decompose into the contribution from 0S present due to y being in T, and two special cases from the block of 2εd all-ones rows and ‘extra’ copies of 0S which are contributed by vectors y′y. We claim that this Fp (A, S) value is at most |C|1+p2ϵd+(ϵ2+γ)dp+32ϵpd

First, let yC with y′ ≠ y and consider prefixes z supported on S which can be generated by possible child words from star(y′). Since our code requires that |y′ ∩ y| ≤ (ϵ2+γ)d, y′ can have at most (ϵ2+γ)d 1s located in S¯=[d]\S, and hence must have at least (ϵϵ2γ)d 1s located in S. Since |star(y′)| = 2ϵd, the number of copies of z inserted is at most 2ϵd(ϵdϵ2dγd)=2ϵ2d+γd. This occurs for ever yC so the total number of occurences of z is at most |C|2(ϵ2+γ)d. The contribution to Fp for this scenario is then |C|p2(ϵ2+γ)dp. Observe that each codeword y′ generates at most 2ϵd vectors under the star(y′) operator, so we have an upper bound of |C|2ϵd such vectors generated, with a total contribution of |C|1+p2(ϵ2p+ϵ+γp)d.

Next, we focus on the two special vectors to count which have a high contribution to the Fp value. Recall that Alice specifically included 1d into A 2ϵd times so the p-th powered frequency is exactly 2ϵpd for this term. From the above argument, 0S also has frequency 2ϵd from star(y). But 0S is also created at most 2(ϵ2+γ)d times from each y′ ≠ y in T, giving an additional count of at most |C|2(ϵ2+γ)d. Based on our choice of ϵ and γ, we can ensure that this is asymptotically smaller than 2ϵd, and so the total contribution from these two special vectors is at most 3 · 2ϵd. So in total we achieve that Fp is at most |C|1+p2ϵd+(ϵ2+γ)dp+32ϵpd, as claimed.

Then 0S meets the definition to be a ϕ-ℓp heavy hitter provided

2ϵpd>ϕp(|C|1+p2ϵd+(ϵ2+γ)pd+32ϵpd).

Assuming p > 1, and choosing ϵ sufficiently smaller than (p − 1)/p and γ sufficiently small, we have that

|C|1+p2ϵd+(ϵ2+γ)pd2O(γ2d(1+p))+ϵd+ϵ(p1)d+γpd2ϵpd.

Hence, we require 2ϵpd > ϕpO(2ϵpd), i.e., 2ϵd > ϕO(2ϵd), which is satisfied for a suitably small but constant ϕ.

Case 2:

yT. On the other hand, suppose that yT. Then the claim is that 0S is not a ϕ-ℓp-heavy hitter. Now the vector 0S does not occur with a high frequency because star(y) is not included in A. However, certain child words in star(T) could also generate 0S when projected onto S and this is the contribution we need to upper bound. Again, any codeword sT has at least (ϵϵ2γ)d 1s present on S. So for a particular sT, 0S can occur 2ϵ2d+γd times. Taken over all yC for which Alice includes in A, the frequency of 0S in this case is at most |C|2ϵ2d+γd. Taking ε < 1/3, γ < ε/3 and using |C|=2γ2d/ln2 (Lemma 3.2) we have fe (0S) ≤ 20.72εd. Meanwhile, there are 2ϵd copies of the string 1d inserted into A meaning that Fp (A, S) ≥ 2ϵpd and hence Fp1/p is strictly greater than fe (0S). Hence, 0S is not a ϕ-ℓp heavy hitter provided that fe(0S)/Fp1/p=20.28εd is strictly less than ϕ = 1/4, this is satisfied for suitable ε and d.

Concluding the proof.

Bob can use his test vector y and a query S with a constant factor approximation algorithm A for the p-heavy hitters problem and distinguish between the two cases of Alice holding y or not based on whether 0S is reported. As a result, Bob can determine if yT and consequently solve Index, thus incurring the Ω(|C|)=2Ω(d) lower bound. □

The instance A is initialized with 2εd rows of the vector 1d and the child words starQ (T). For anyt ∈ starQ (T), |starQ (t)| = 2εd so the size of the instance A is (|T | + 1)2εd × d.

5.3. Fp Estimation

The space complexity of approximating the frequency moments Fp has been widely studied since the pioneering work of Alon, Matias and Szegedy [1]. Here, we investigate their complexity under projection. For p = 1, the frequency is always the number n of rows in the original instance irrespective of the column set C, so only one word of space is required. We therefore devote attention to p ≠ 1.

The reduction to Index for Theorem 5.4 follows a similar outline as Theorem 5.3 for p > 1. For p < 1, we encode the problem slightly differently, closer to that in Theorem 4.1. Again, the reduction to Index relies on Bob determining whether or not Alice holds y, which for Fp estimation amounts to Bob evaluating Fp (A, S) and comparing to a threshold value.

Theorem 5.4.

Fix a real number p > 0 with p ≠ 1. A constant factor approximation to the projected Fp estimation problem requires space 2Ω(d) .

Proof.

For p > 1 we begin by noticing that in the proof for Theorem 5.3 one can also monitor the Fp value of the input to the problem rather than simply checking the heavy hitters. In particular, depending on whether or not Alice holds Bob’s test word, y, the projected Fp changes by more than a constant. Consequently, we invoke the same proof for Fp, p > 1 and obtain the same 2Ω(d) lower bound.

On the other hand, suppose that p < 1. We assume a code CB(d,ϵd) with the property that any distinct x, xChave |xx ′| ≤ cd for some small constant c > ϵ2 (see Lemma 3.2). Again, Alice holds a subset TC and inserts star(T) into the table for the problem A. Throughout this proof we use a binary alphabet so suppress the Q notation from starQ (·). Bob holds a test vector yC and is tasked with determining whether or not Alice holds yT. We distinguish between the cases when Alice holds yT or not as follows. Bob uses y to determine the query column set S = supp(y) and will compare against the returned frequency value from the algorithm.

Case 1:

yT. Consider some yC\{y}. Since y and y′ are both codewords, they can have a 1 coincident in at most cd locations. So if Alice does not hold y then the codewords we need to consider are all binary words in the code which have at most cd 1s in common with y on S. We denote this collection of words by M, i.e., the set of binary strings of length d that have at most cd locations set to 1. There are r such vectors, where r is defined by:

ri=0cd(di)cd(dcd)=O(d)2Θ(cd).

The total count of all strings generated by Alice’s encoding is at most 2ϵd|C|: each string in C generates 2ϵd subwords from the star(·) operation. We now evaluate the p-frequency of elements in the set M, denoted Fp (M). For p < 1, the value Fp (M) is maximized when every element of M has the same number of occurrences,|C|2ϵd/r. As there are at most r members of M, we obtain Fp(M)|C|p2ϵdpr1p. Recalling the bounds on |C| and r, this is:

2cdp+ϵdp+Θ((1p)cd)O(d1p). (5)

We can now choose c to be a small enough constant so that (5) is at most 2(1−α)ϵd for a constant α > 0 by Lemma A.2 in Appendix A.2.

Case 2:

yT. Now consider the scenario when yT so that Alice has inserted star(y) into the table A. Here, we can be sure that each of the 2ϵd strings in star(y) appears at least once over the column set S, and so the Fp value is at least 2ϵd1p = 2ϵd.

We observe that these two cases obtain the constant factor separation, as required. Then, Bob can use his test vector y and a query S with a constant factor approximation algorithm to the projected Fp-estimation problem and distinguish between the two cases of Alice holding y or not. Thus, Bob can determine if yT and consequently solve the Index problem, incurring the Ω(|C|)=2Ωc(d) lower bound for a c arbitrarily small. □

Remark 2.

For p > 1 we adopt the same instance as in Theorem 5.3 so the instance is of size (|T | + 1)2εd × d. On the other hand, for 0 < p < 1, only the words in starQ (T) are required so A has size |T |2εd × d.

5.4. p-Sampling

In the projected p-sampling problem, the goal is to sample a row in AC proportional to the p-th power of its number of occurrences. One approach to the standard (non-projected) p-sampling problem on a vector x is to subsample and find the p-heavy hitters [14]. Consequently, if one can find p-heavy hitters for a certain value of p, then one can perform p-sampling in the same amount of space, up to polylogarithmic factors. Interestingly, for projected p-sampling, this is not the case, and we show for every p ≠ 1, there is a 2Ω(d) lower bound. This is despite the fact that we can estimate p-frequencies efficiently for 0 < p < 1, and hence find the heavy hitters (Section 5.1).

Theorem 5.5.

Fix a real number p > 0 with p ≠ 1, and let ε ∈ (0, 1/2). Let S ⊆ [d] be a column query and i be a pattern observed on the projected data AS. Any algorithm which returns a pattern i sampled from a distribution (p1, …, pn), where pi(1±ε)fe(i)pf(A,S)pp+Δ together with a (1±ε′)-approximation to pi, Δ = 1/poly(nd) and ε> 0 is a sufficiently small constant, requires 2Ω(d) bits of space.

Proof.

Case 1:

p > 1. The proof of Theorem 5.3 argues that the vector 0S is a constant factor p-heavy hitter for any p > 1 if and only if Bob’s test vector y is in Alice’s input set T, via a reduction from Index. That is, we argue that there are constants C1 > C2 for which if yT, then fe(0S)pC1Fp, while if yT, then fe(0S)p<C2Fp. Consequently, given an p-sampler with the guarantees as described in the theorem statement, then the (empirical) probability of sampling the item 0S should allow us to distinguish the two cases. This holds even tolerating the (1 +ε′)-approximation in sampling rate, for a sufficiently small constant ε′. In particular, if yT, then we will indeed sample 0S with Ω(1) probability, which can be amplified by independent repetition; whereas, if yT, we do not expect to sample 0S more than a handful of times. Consequently, for p > 1, an p-sampler can be used to solve the p-heavy hitters problem with arbitrarily large constant probability, and thus requires 2Ω(d) space.

Case 2:

0 < p < 1. We now turn to 0 < p < 1. In the proof of Theorem 5.4, a reduction from Index is described where Alice holds the set T and Bob the string y. Bob can generate the set star(y) of size 2εd which is all possible binary strings supported on the column query S. From this, Bob constructs the set M={zstar(y):|supp(z)|εd2}. We observe that if yT then at least half of the strings in star(y) are supported on at least εd/2 coordinates which implies |M′| ≥ 2εd−1. The total Fp in this case can be bounded by a contribution of |M′|1p + 2εd. The first term arises from the |M′| strings in M′ with a frequency of 1, while the second term is shown in Case 1 of Theorem 5.4. Since |M′| ≤ 2εd, we have that Fp ≤ 2εd+1 in this case. Consequently, the correct probability of p-sampling returning a string in M′ is at least 14 for the “ideal” case of ε = 0, Δ= 0. Even allowing ε<12 and Δ= 1/poly(nd), this probability is at least 1/10.

Otherwise, if yT, we exploit that y′ ≠ y can coincide in at most cd = O(ε2d) coordinates and | supp(z)| ≥ εd/2 > cd for any zM′. Hence, no zM′ can occur in star(y′) for another yC\{y} on the column projection S. In this case, there should be zero probability of sampling a string in M′ (neglecting the trivial additive probability Δ).

To summarize, in the case that yT, by querying the projection S then a constant fraction of the Fp-mass is on the set M′, whereas when yT, then there is zero Fp-mass on the set M′. Since Bob knows M′, he can run an p-sampler and check if the output is in the set M′, and succeed with constant probability. It follows that Bob can solve the Index problem (amplifying success probability by independent repetitions if needed), and thus again the space required is 2Ω(d). □

Remark 3.

For p > 1 we again adopt the same instance as in Theorem 5.3 which has size (|T | + 1)2εd × d. However, for 0 < p < 1, we require the instance from Theorem 5.4 so A has size |T |2εd × d.

6. PROJECTED FREQUENCY ESTIMATION VIA SET ROUNDING

Although our lower bounds rule out the possibility of computing constant factor approximations to projected frequency problems in sub-exponential space, it is still possible to compute non-trivial approximations using exponential space but still better than inlinemath enumerating all column subsets of [d]. We design a class of algorithms that proceed by keeping appropriate sketch data structures for a “net” of subsets. The net has the property that for any query C ⊂ [d] there is a C′ ⊂ [d] stored in the net which is not too different from C. We can then answer the query on C using the summary data structure computed for columnset C′. To formalize this approach we need some further definitions, the first of which conceptualizes the notion of a net over subsets.

Definition 6.1 (α-net of subsets).

Let P([d]) denote the power set of [d]. Fix a parameter α ∈ (0, 1/2). An α-net of P([d]) is the set N={U:|U|2d/2αdor|U|2d/2+αd} which contains all subsets UP([d]) whose size is at most 2d/2−αd or at least 2d/2+αd.

Let H (x) = −x log2(x) − (1 − x) log2(1 − x) denote the binary entropy function.

Lemma 6.2.

Let N be an α-net for P([d]). Then |N|2H(1/2α)d+1.

Proof.

The total number of subsets whose size is at most 2d/2αd is iαd(di) and iαd(di)2H(1/2α)d [8, Theorem 3.1]. By symmetry we obtain the same bound for the number of subsets of size at least 2d/2+αd, yielding the claimed total. □

6.1. From α-nets to Projections

Suppose that we are tasked with answering problem P = P (A,C) on a projection query C. We know that if C is known ahead of time then we can encode the input data A ∈ [Q]n×d on projection C as a standard stream over the alphabet [Q] |C |. The use of α-nets allows us sketch some of the input and use this to approximately answer a query. For a standard streaming problem, we will say that an algorithm yields a β-approximation to the true solution z if the returned estimatez ∈ [z∗/β, βz∗]. A sketch obtaining such approximation guarantees will be referred to as a β approximate sketch. We additionally need the following notion of error due to the distortion incurred when answering queries on elements of the α-net rather than the given query.

Algorithm 1:

Projected frequency by query rounding

graphic file with name nihms-1696965-t0002.jpg

Definition 6.3 (Rounding distortion).

Let P = P (A,C) be a projection query for the problem P on input A ∈ [Q]n×d with projection C. Let NP([d]) be an α-net. The rounding distortion r (α, P) is the worst-case determinstic error incurred by solving P (A,C′) rather than P (A,C) for an α-neighbour CN of C so that P (A,C)/r (α, P) ≤ P (A,C′) ≤ r (α, P) (A,C).

Definition 6.3 is easiest to conceptualize for the F0 problem when A ∈ {0, 1}n×d. Specifically, P = F0 and the task to solve is P = F0(A,C). For a given query C, with an α-neighbour C′ in the net, the gap between the number of distinct items observed on C′ at most doubles for each column in the set difference between C and C′. Since C′ is an α-neighbour, we have |C′ ΔC| ≤ αd so the worst-case approximation factor in the number of distinct items observed over C′ rather than C is 2αd.

More generally, we can categorize the rounding distortion for other typical queries, as demonstrated in the following lemma. Note that if the query is contained in the α-net N then we will retain a sketch for that problem; hence the distortion is only incurred for queries not contained in the net.

Lemma 6.4.

Fix α ∈ (0, 1/2), suppose A ∈ {0, 1}n×d and N be an α-net. If C is a projection query for the following cases, the rounding distortion can be bounded as:

  1. P = F0(A,C) then r (α, F0) = 2αd

  2. P = Fp (A,C), p > 1 then r (α, Fp) = 2αd (p−1)

  3. P = Fp (A,C), p < 1 then r (α, Fp) = 2αd (1−p)

Proof.

Item (1) is an immediate consequence of the discussion above following Definition 6.3 so we focus on (2) and (3). Suppose p ≥ 1. Let fC = f (A,C) denote the frequency vector associated to the projection query C over domain [2|C |]. First, consider a single index j ∈ [2|C |] with (fC)j = x. Let C′ be an α-neighbour for C in N, and without loss of generality, assume that |C| < |C′|. The task is to estimate fCpp=xp from fCpp, where fC=f(A,C) is a frequency vector over the domain [2|C|] which is a |C′ \C| factor larger than the domain for fC. However, observe that in fC, the value of x is spread across the at most 2αd entries that agree with j on columns C. The contribution to Fp from these entries is at most xp (if the mass of x is mapped to a single entry). On the other hand, by Jensen’s inequality, the contribution is at least 2αd (x/2αd)p = xp/2αd (p−1). Hence, considering all entries j, we obtain.fCpp/2αd(p1)fCppfCpp. In the case |C| > |C′|, essentially the same argument shows that fCppfCppfCpp2αd(p1). Thus we obtain the rounding distortion of 2αd (p−1). For p < 1, we proceed as above, except by concavity, the ordering is reversed. □

Observe that the distortion reduces to 1 (no distortion) as we approach p = 1 from either side. This is intuitive, since the F1 problem is simply to report the number of rows in the input, regardless of C, and so the problem becomes “easier” as we approach p = 1.

With these properties in hand, we can give a “meta-algorithm” as described in Algorithm 1. In Theorem 6.5 we can fully characterize the accuracy-space tradeoff for Algorithm 1 as a function of α and d.

Theorem 6.5.

Let A ∈ {0, 1}n×d be the input data and C ⊆ [d] be a projection query. Suppose P = P (A,C) is the projected frequency problem, α ∈ (0, 1/2) and r (α,d) is the rounding distortion. With probability at least 1 − δ a βr (α,d) approximation can be obtained by keeping O˜(2H(1/2α)d) β-approximate sketches.

Proof.

Let N be a α-net for P([d]) and for every UN generate a sketch with accuracy parameter ϵ for the problem P on the projection defined by U ⊆ [d]. Either the projection CN, in which case we can report a β factor approximation, or CN in which case we take an α-neighbour, CN and return the estimate z for P (A,C′). The sketch ensures that the answer to P (A,C′) is obtained with accuracy β, which by the rounding distortion is a βr (α,d) approximation. To obtain this guarantee we build one sketch for every UN, for a total of O(2H (1/2−α)d) sketches (via Lemma 6.2). By setting the failure probabilty for each sketch as δ = 1/2αd and then taking a union bound over the α-net we achieve probability at least 1 − δ. □

We remark that similar results are possible for the other functions considered, p frequency estimation, p heavy hitters and p sampling. The key insight is that all these functions depend at their heart on the quantity fj /∥fp, the frequency of the item at location j divided by the p norm. If we evaluate this quantity on a superset of columns, then both the numerator and denominator may shrink or grow, in the same ways as analyzed in Lemma 6.4, and hence their ratio is bounded by the same factor, up to a constant. Hence, we can also obtain (multiplicative) approximation algorithms for these problems with similar behavior.

Illustration of Bounds.

First, observe that, irrespective of the problem P, the number of sketches needed is sublinear in 2d. This is due to the fact that the entropy H (1/2 −α) < 1 for α > 0, so the size of the net |N|<2d. For 0 ≤ p ≤ 2, we have β-approximate sketches with β = (1 + ϵ) whose size is O˜(ε2), which is constant for constant ϵ. For example, we obtain a 2αd approximation (ignoring small constant factors) for F0 in space O(2H (1/2−α)d), using for instance the (1 + ϵ)-approximate sketch from [11] which requires O(ε−2 +log n′) bits for an input over domain {1, …, n′}. Since n′ ≤ 2d, and setting ϵ = 1, we obtain the approximation in space O(d2H (1/2−α)d). This is to be compared to the bounds in Section 4, where it is shown that (binary) instances of the projected F0 problem require space 2Ω(d). These results show that the constant hidden by the Ω() notation is less than 1.

In Figure 1 we illustrate the general behavior of the bounds for d = 20. We plot the relative space by 2H (1/2−α) /2d while varying α over (0, 1/2) (plotted in the leftmost pane). This shows the space reduction in using the α-net approach compared to inlinemath storing all 2d queries. The central pane shows how the approximation factor 2αd (on a log scale) varies with α. We plot the space-approximation tradeoff in the rightmost pane and the approximation factor is again plotted on a log2-scale. This plot suggests that if we reduce the space by a factor of 4 (i.e., permit relative space 2−2) then the approximation factor is on the order of 10s. Meanwhile, if we use relative space 2−8, then the approximation remains on the order of hundreds: this is a substantial saving as the number of summaries kept for the approximation is 212 = 4096 ≪ 220 ≈ 106.

Figure 1:

Figure 1:

Space-approximation tradeoff for d = 20 as α is varied from 0 to 1/2. Relative space is 2H(1/2−α)d/2d.

7. CONCLUDING REMARKS

We have introduced the topic of projected frequency estimation, with the aim of abstracting a range of problems involving computing functions over projected subspaces of data. Our main results show that these problems are generally hard, in terms of the space requirements: in most cases, we require space which is exponential in the dimensionality d of the input. However, interestingly, the exact dependence is not as simple as 2d: we show that coarse approximations can be obtained whose cost is substantially sublinear in 2d. Letting N = 2d, our upper and lower bounds establish that the space complexity for a number of problems here is polynomial in N, though substantially sublinear. And, in a few special cases (p frequency estimation for p ≤ 1), a sufficiently constant-sized sample suffices for accurate approximation of projected frequencies. It remains an intriguing open question to close the gaps between the upper and lower bounds, and to find the exact form of the polynomial dependence on N for these problems.

CCS CONCEPTS.

• Theory of computation → Streaming models; Lower bounds and information complexity; Communication complexity; Sketching and sampling.

Acknowledgements.

We thank S. Muthukrishnan and Jacques Dark for helpful discussions about this problem. The work of GC and CD was supported by European Research Council grant ERC-2014-CoG 647557. The work of DW was supported by NSF grant No. CCF-1815840, National Institute of Health grant 5R01HG 10798-2, and a Simons Investigator Award.

A. OMITTED PROOFS

A.1. Omitted Proof for Section 5.1

Theorem A.1 (Restated Theorem 5.1).

Let A ∈ {0, 1}n×d be the input data and let C ⊆ [d] be a given column query. For a given string b ∈ {0, 1}C, the absolute frequency of b, fe (b) , can be estimated up to εf1 additive error using a uniform sample of size O(ε−2 log(1/δ)) with probability at least 1 − δ.

Proof.

T={i[n]:AiC=b} be the set of indices on which the projection onto query set C is equal to the given pattern b. Sample t rows of A uniformly with replacement at a rate q = t/n. Let the (multi)-subset of rows obtained be denoted by B and the matrix formed from the rows of B be denoted A^. For every iB, define the indicator random variable Xi which is 1 if and only if the randomly sampled index i satisfies AiC=b, which occurs with probability |T |/n. Next, we define T^=TB so that |T^|=i=1tXi and the estimator Z=nt|T^| has E(Z)=|T|. Finally, apply an additive form of the Chernoff bound:

(|ZE(Z)|εn)=(|nt|T^||T||εn)=(||T^|tn|T||εt)2exp(ε2t).

Settingδ = 2exp (−ε2t) allows us to chooset = O(ε−2 log(1/δ)), which is independent of n and d. The final bound comes from observing that ∥f1 = n, fe (b) = |T | and f^e(b)=Z. □

A.2. Omitted Proof for Section 5.3

A key step in the proof of Theorem 5.4 is that in Equation (5), the expression

2cdp+ϵdp+Θ((1p)cd)O(d1p)

can be bounded by a manageable power of two. We formalize this in Lemma A.2.

Lemma A.2.

Under the same assumptions as in Theorem 5.4, there exists a small constant c > 0 which bounds Equation (5) by at most 2(1−α)ϵd for some α > 0.

Proof.

Here we use base-2 logarithms and let 0 < c < 1 be a small constant which we need to bound. Also, let 0 < p < 1 be a given constant. Observe that the O(d1−p) term only contributes positively in the exponent term of (5) so we can ignore it from the calculation. Write 2Θ(cd (1−p)) = 2cdα (1−p) for α > 0. This follows from:

(dcd)(edcd)cd2(2+log1c)cd (6)

so let α=2+log1c. For clarity, we proceed by using the trivial identity 1 − (1 − υ) = υ and show that 1 − υ > 0 for υ a function of c, p, d. We need to ensure:

cpd+ϵdp+αcd(1p)(1α)ϵd. (7)

This amounts to showing that:

vcp/ϵ+p+αc(1p)/ϵ(1α)

Now, υ = p(c/ϵ + 1 − αc/ϵ) + αc/ϵ and we require υ < 1. We may enforce the weaker property of p(c/ϵ + 1 − α/ϵ) < 1 because c > 0 and for c < 4 we also have α > 0 (inspection on Equation (6)) so αc/ϵ > 0, and so can be omitted. Solving for c we obtain c(1−α) < ϵ(1/p −1). Recalling the definition of α this becomes:

c(logc1)<ϵ(1/p1) (8)

from which positivity on c yields c log c < ϵ(1/p −1). Hence, it is enough to use c < ϵ(1/p − 1). □

Contributor Information

Graham Cormode, University of Warwick.

Charlie Dickens, University of Warwick.

David P. Woodruff, Carnegie Mellon University

REFERENCES

  • [1].Alon N, Matias Y, and Szegedy M. 1999. The Space Complexity of Approximating the Frequency Moments. JCSS: Journal of Computer and System Sciences 58 (1999), 137–147. [Google Scholar]
  • [2].Assadi Sepehr, Khanna Sanjeev, Li Yang, and Tannen Val. 2016. Algorithms for Provisioning Queries and Analytics. In International Conference on Database Theory. 18:1–18:18. [Google Scholar]
  • [3].Braverman Vladimir, Chestnut Stephen R, Ivkin Nikita, Nelson Jelani, Wang Zhengyu, and Woodruff David P. 2017. BPTree: An ℓ2 heavy hitters algorithm sing constant memory. In Proceedings of Principles of Database Systems. ACM, 361–376. [Google Scholar]
  • [4].Braverman Vladimir, Grigorescu Elena, Lang Harry, Woodruff David P., and Zhou Samson. 2018. Nearly Optimal Distinct Elements and Heavy Hitters on Sliding Windows. In Approximation, Randomization, and Combinatorial Optimization Algorithms and Techniques (AP-PROX/RANDOM 2018), Vol. 116. 7:1–7:22. 10.4230/LIPIcs.APPROX-RANDOM.2018.7 [DOI] [Google Scholar]
  • [5].Braverman Vladimir, Krauthgamer Robert, and Yang Lin F.. 2018. Universal Streaming of Subset Norms. CoRR abs/1812.00241 (2018). arXiv:1812.00241 http://arxiv.org/abs/1812.00241 [Google Scholar]
  • [6].Chia Pern Hui, Desfontaines Damien, Perera Irippuge Milinda, Simmons-Marengo Daniel, Li Chao, Day Wei-Yen, Wang Qiushi, and Guevara Miguel. 2019. KHyperLogLog: Estimating Reidentifiability and Join-ability of Large Data at Scale. In IEEE Symposium on Security and Privacy (SP). 867–881. [Google Scholar]
  • [7].Doerr Benjamin. 2020. Probabilistic tools for the analysis of randomized optimization heuristics. In Theory of Evolutionary Computation. Springer, 1–87. [Google Scholar]
  • [8].Galvin David. 2014. Three tutorial lectures on entropy and counting. arXiv preprint arXiv:1406.7872 (2014). [Google Scholar]
  • [9].Jayaram Rajesh and Woodruff David P.. 2018. Perfect ℓp Sampling in a Data Stream. In 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS. 544–555. [Google Scholar]
  • [10].Jayram TS and Woodruff DP. 2009. The Data Stream Space Complexity of Cascaded Norms. In IEEE Symposium on Foundations of Computer Science (FOCS). 765–774. 10.1109/FOCS.2009.82 [DOI] [Google Scholar]
  • [11].Kane Daniel M, Jelani Nelson, and Woodruff David P. 2010. An Optimal Algorithm for the Distinct Elements Problem. In Proceedings of Principles of database systems. ACM, 41–52. [Google Scholar]
  • [12].Kremer Ilan, Nisan Noam, and Ron Dana. 1999. On Randomized One-Round Communication Complexity. Computational Complexity 8, 1 (1999), 21–49. [Google Scholar]
  • [13].Kveton Branislav, Muthukrishnan S, Vu Hoa T., and Xian Yikun. 2018. Finding Subcube Heavy Hitters in Analytics Data Streams. In Proceedings of the 2018 World Wide Web Conference. 1705–1714. 10.1145/3178876.3186082 [DOI] [Google Scholar]
  • [14].Larsen Kasper Green, Nelson Jelani, Nguyen Huy L., and Thorup Mikkel. 2016. Heavy Hitters via Cluster-Preserving Clustering. In IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS. 61–70. [Google Scholar]
  • [15].Parsons Lance, Haque Ehtesham, and Liu Huan. 2004. Subspace Clustering for High Dimensional Data: a review. SIGKDD Explorations 6, 1 (2004), 90–105. 10.1145/1007730.1007731 [DOI] [Google Scholar]
  • [16].Sublinear.info. [n.d.]. Open Problem 94. https://sublinear.info/index.php?title=Open_Problems:94.
  • [17].Tirthapura Srikanta and Woodruff David P.. 2012. A General Method for Estimating Correlated Aggregates over a Data Stream. In IEEE 28th International Conference on Data Engineering (ICDE 2012), Washington, DC, USA (Arlington, Virginia), 1–5 April, 2012. 162–173. [Google Scholar]
  • [18].Vu Hoa. 2018. Data Stream Algorithms for Large Graphs and High Dimensional Data. Ph.D. Dissertation. U. Massachusetts at Amherst. [Google Scholar]

RESOURCES