Skip to main content
Entropy logoLink to Entropy
. 2022 Nov 5;24(11):1611. doi: 10.3390/e24111611

Communication Efficient Algorithms for Bounding and Approximating the Empirical Entropy in Distributed Systems

Amit Shahar 1, Yuval Alfassi 1, Daniel Keren 1,*
Editor: Hector Zenil1
PMCID: PMC9689552  PMID: 36359705

Abstract

The empirical entropy is a key statistical measure of data frequency vectors, enabling one to estimate how diverse the data are. From the computational point of view, it is important to quickly compute, approximate, or bound the entropy. In a distributed system, the representative (“global”) frequency vector is the average of the “local” frequency vectors, each residing in a distinct node. Typically, the trivial solution of aggregating the local vectors and computing their average incurs a huge communication overhead. Hence, the challenge is to approximate, or bound, the entropy of the global vector, while reducing communication overhead. In this paper, we develop algorithms which achieve this goal.

Keywords: entropy, entropy approximation, entropy bounds, distributed systems, sketches

1. Introduction

Consider the distributed computing model [1,2,3], where the goal is to compute a function over input divided amongst multiple nodes. A local computation, while simple, does not always suffice to reach a conclusion on the aggregated input data, especially when the function is nonlinear. On the other hand, broadcasting the local data to a coordinator node is impractical and undesirable due to communication overhead, energy consumption, and privacy issues. Generally, we seek to approximate or bound the function’s value on the aggregated data without broadcasting it in its entirety.

The function we handle in this paper is the empirical Shannon entropy [4], which is defined as ixiln(xi) for a frequency vector X=x1,,xn (i.e., all values are non-negative and sum to 1). For some of the ensuing analysis, it is easier to use the natural logarithm than the base 2 one, which only changes the value by a multiplicative constant. Thus, hereafter, “entropy” will refer exclusively to the empirical Shannon entropy. Specifically, we assume there exists a distributed system, with each node (or “party”) holding a “local” frequency vector. The target function is defined as the global system entropy, which is equal to the empirical Shannon entropy of the average of the local vectors. Alas, to compute the exact value, we must first aggregate the local vectors and average them, which often incurs a huge communication overhead. Fortunately, it often suffices to approximate, or bound, this global entropy; for example:

  • Often, a sudden change in the entropy indicates a phase change in the underlying system, for example a DDoS (Distributed Denial of Service) attack [5]. To this end, it typically suffices to bound the entropy, since its precise value is not required.

  • A good measure of similarity between two datasets is the difference between their aggregated and individual entropies. For example, if two collections of text are of a similar nature, their aggregate entropy will be similar to the individual ones, and if they are different, the aggregated entropy will be substantially larger. Here, too, it suffices to approximate the global entropy, or bound it from above or below in order to reach a decision.

Guided by such challenges, we develop communication-efficient algorithms for bounding and approximating the global entropy, which are organized as follows:

In Section 3, we present algorithms for bounding the global entropy, with low communication. Some results on real and synthetic data are also provided.

In Section 4, a novel algorithm is provided for approximating the global entropy. It is tailored to treat the cases in which the algorithm in Section 3 underperforms.

2. Previous Work

The problem of reducing communication overhead in distributed systems is very important both from the practical and theoretical points of view. Applications abound, for example distributed graphs [2,6] and distributed machine learning [3]. Research close to ours in spirit [7] deals with the following scenario: a system is given consisting of

  • Distributed computing nodes denoted by N1Nt, with Ni holding a “local” data vector Xi. The nodes can communicate, either directly or via a “coordinator” node.

  • A scalar-valued function f(X,Y), defined on pairs of real-valued vectors.

Given the above, the challenge is to approximate the values f(Xi,Xj),i,j=1t with a low communication overhead; that is, the trivial solution of sending all the local vectors to some computing node is forbidden.

A sketch for this type of problem is defined as a structure s(), of size smaller than the dimension of Xi, which has the following property: knowledge of s(X),s(Y) allows one to approximate f(X,Y) with very high accuracy. An important example [7] is f(X,Y)=X,Y (X,Y stands for the inner product of X,Y).

There are many types of sketches, for example:

  • PCA (Principle Component Analysis) sketch: given a large subset SRn, one wishes to quickly estimate the distance of vectors from an underlying structure which S is sampled from (a famous example is images of a certain type [8]). To this end, S is represented by a smaller set, consisting of the dominant eigenvectors of S’s scatter matrix, and the distance is estimated by the distance of the vector from the subspace spanned by these vectors.

  • In the analysis of streaming data, some important sketches were developed, in order to handle large and dynamic streams, by only preserving salient properties (such as the number of distinct items, frequency, and the norm). It is beyond the scope of this paper to describe these sketches, so we refer to the literature [9].

Sketches are specifically tailored for the task at hand. In our case, X,Y are frequency (probability) vectors, and f(X,Y) is the empirical Shannon entropy of X+Y/2. Similarly, one may look at functions defined on larger subsets of {X1,,Xt} (Section 3.4). Our task is therefore to define a sketch s(), such that

  • s(X) is much smaller than X.

  • Knowledge of s(X),s(Y) allows one to approximate the empirical Shannon entropy of X+Y/2.

We note here that some work addressed entropy approximation in the Streaming Model  [1,10,11]. Here, as in [7], we are mainly interested in the static scenario, in which the overall communication overhead is substantially smaller than the overall data volume. The “geometric monitoring” method [10,11], applied to solve the Distributed Monitoring Problem [1], relies on checking local constraints at the nodes; as long as they hold, the value of some global function, defined on the average of the local streams, is guaranteed to lie in some range. Alas, when the local conditions are violated, the nodes undertake a “synchronization stage” [12], which consists of communicating their local vectors in their entirety (which here we avoid). In the future, we plan to extend the techniques developed here to the distributed streaming scenario.

3. Dynamic Bounds and Communication Reduction

In this section, we present algorithms for bounding the entropy of a centralized vector—that is, the mean of several local vectors—by broadcasting a controlled amount of inter-communication between machines. The proposed algorithms for both upper and lower bounds accept the same input and therefore can be run concurrently.

3.1. Problem, Motivation, and an Example

This work addresses the following problem:

  • Given nodes Ni, each holding a probability vector vi (i.e., all values are positive and sum to 1), approximate the entropy of the average of vi, while maintaining low communication overhead.

Let us start with the simplest possible scenario, which we shall analyze in detail, in order to prepare the ground for the general treatment.

Example 1. 

There are two nodes, N1,N2, and the vectors they hold are of length 3. Assume without loss of generality that N1 sends some of its data to N2, where “data” consists of a set of pairs (coordinate, value), where “coordinate” is the location of a value of v1, and “value” is its numerical value; then, N2 attempts to derive an upper bound on the entropy of v1+v22. Note that vectors of length 2 are hardly interesting, since sending a single datum allows one to compute the other (as they sum to 1); hence, N2 will be able to exactly compute the entropy.

Intuition suggests that N1 should relay its largest value to N2. While (as we will show later), this is true on the average, that is not always the case. Assume that the vectors held by the nodes are

v1=23,13,0,v2=0,12,12

Assume that N1 sends its largest value (and its coordinate) to N2. Now, N2 knows that (a) the first value of the average vector is 13, and (b) the second and third values of N2 sum to 13. That leaves open the possibility that these values are 16 each, which would render the average vector equal to

13,13,13,

hence, the upper bound is equal to the maximal entropy possible, 3ln(3). However, if N1 sends its second largest value 13 to N2, N2 can conclude that the second value of the average vector equals 512; hence, the upper bound on the entropy is strictly smaller than 3ln(3).

We observe here that the key consideration in determining the upper bound is the distribution of the “slack” corresponding to the unknown values at the other node (N1 in this example). The overall size of this “slack” is one-half of the unknown values, and it should be distributed amongst the same set of coordinates in N2 after they have been divided by 2.

In contrast to the above “adversarial” example, on average, it is optimal to send the largest value (i.e., it allows one to achieve a lower upper bound). To prove this, we have (numerically) computed the integral of the upper bound over all triplets, both after the largest value and a random value were sent; sending the highest value, on the average, yielded an upper bound lower by 0.041 than sending a random value. More general experiments, for both real and synthetic data, are reported in Section 3.5.

We now address the general scenario. Let us start with a few definitions:

Notation 1. 

Let {X1,,Xt} be a set of local vectors held in t nodes {N1,,Nt}. Then, X˜=1ti=1tXi is the aggregate vector, which in our case is the mean over all local vectors.

Notation 2. 

Let x[0,1]. We define the Entropy activation function h(x) by:

h(x)=0ifx=0xlnxotherwise

Definition 1. 

Let XRn s.t i:0xi1. Then, H(X) denotes the Shannon’s Entropy of X [4], given by

H(X)=i=1nhxi

We will henceforth assume all vectors are of length n and behave like X in Definition 1, even if it is not explicitly noted. We also assume each value of X can be represented by at most b bits.

Notation 3. 

Let XLocal,XOther denote a probability vector held by a local machine and a probability vector held by a remote machine, respectively.

In this section, we present algorithms for deciding whether the entropy of an average probability vector that sums to 1 is greater or lesser than a user-defined threshold. Formally, we will address two problems:

  1. Determining whether the inequality HX˜L holds for some user-defined constant L.

  2. Determining whether the inequality HX˜U holds for some user-defined constant U.

We begin with a lemma which provides the foundation for both the Local Upper Bound (Section 3.2) and Local Lower Bound (Section 3.3) in the following subsections.

While noting that the lemma and its corollary hold for any vector XRn, our vectors are always frequency vectors and hence sum to 1; the Δ below corresponds to the “slack” added after dividing the respective value by 2, as explained in the discussion of Example 1 above; hence, the values still sum to 1.

Lemma 1 

(Extrema of Entropy). Let X=x1,,xnRn s.t. i,xi0. Let Δ be a positive number, and let i,j be two distinct coordinates of X.

  • Let Xi=x1,,xi+Δ,,xn;

  • Let Xj=x1,,xj+Δ,,xn.

If xi<xj, then H(Xi)>H(Xj).

Proof. 

To establish H(Xi)H(Xj)>0, then since H(X) is coordinate-wise additive, it suffices to show that:

h(xi+Δ)h(xi)>h(xj+Δ)h(xj).

Using the observation that h(x)=lnx1, which is strictly decreasing, we divide the proof into two cases depending on the relation between xi+Δ and xj:

  • 1.

    xi+Δxj:

    Since xi<xi+Δxj<xj+Δ, the intervals (xi,xi+Δ) and (xj,xj+Δ) are disjoint. By applying the Lagrange Mean Value Theorem, for some c1(xi,xi+Δ) and c2(xj,xj+Δ):
    h(c1)=h(xi+Δ)h(xi)Δh(c2)=h(xj+Δ)h(xj)Δ
    Since h() is decreasing and c1<c2, we immediately obtain that h(c1)>h(c2). It follows that:
    h(xi+Δ)h(xi)Δ=h(c1)>h(c2)=h(xj+Δ)h(xj)Δh(xi+Δ)h(xi)>h(xj+Δ)h(xj)
  • 2.

    xi+Δ>xj:

    Observing the disjoint intervals (xi,xj), (xi+Δ,xj+Δ). The sought inequality, following the Lagrange Mean Value Theorem for c1(xi,xj),c2(xi+Δ,xj+Δ), as in the case above, is:
    h(xj)h(xi)Δ=h(c1)>h(c2)=h(xj+Δ)h(xi+Δ)Δh(xi+Δ)h(xi)>h(xj+Δ)h(xj)

   □

Corollary 1. 

Given a probability vector X, and Δ>0, the following properties hold:

  • 1. 

    If Δ is added to any value of X, the maximal increase of its entropy will occur when Δ is added to the minimal value of X.

  • 2. 

    If Δ is added to any value of X, the minimal increase of its entropy will occur when Δ is added to the maximal value of X.

3.2. Upper Bound

While ln(n) is a trivial upper bound to the entropy, and does not require any communication to agree upon, we can develop a more efficient alternative while incurring a small communication overhead. Let X,Sk(X) denote a probability vector and a k-sized ordered subset of X’s k largest values, respectively. Hence, let local nodes broadcast the following two ordered sets:

  1. Sk(X)= ordered set of largest k values of X;

  2. Ck(X)= the coordinates of the values in Sk(X), or formally {ixiSk(X)}.

Each of these messages costs at most k(b+log2n) bits: b for each value and log2n for each corresponding coordinate. By sending these subsets of values and coordinates, local machines can immediately obtain the following information regarding the local vector X from which Sk(X),Ck(X) were sent:

  • The sum of all values not in Sk(X), i.e., 1xSk(X)x, will be referred to as the mass of the local vector that remains available to be distributed among coordinates. It will be denoted by m in the following algorithms.

  • max{X\Sk(X)}min{Sk(X)}, since Sk(X) contains the largest values of X (where X\Sk(X) denotes set difference).

We next suggest an algorithm for a local machine with local probability vector XLocal to compute the strict upper bound for X˜, which is the aggregated data of both XLocal and XOther, which is a probability vector that is not accessible to the machine. The remote machine broadcasts Sk(XOther) and Ck(XOther) for some predetermined k.

The algorithm constructs the unknown subset of the remote machine that ensures the centralized entropy is maximized, or formally argmaxXHXLocal+X, while maintaining feasible constraints. We view this problem as an instance of constrained optimization, where our target function is the global entropy, and the constraints are given by the broadcast set of SK(X) and its sum. The main tool is Corollary 1 for every coordinate of XLocal.

Before we present the algorithm, we note two extreme cases, which instantly produce an upper bound without the need to algorithmically compute it:

  • 1.

    xSk(X)x1. In this case, most (or all) the information of X is broadcast by the message, and the entropy can be computed accurately without need for a bound.

  • 2.

    1xSk(X)xxiXLocalxmaxxi, where xmax=max{XLocal}. In this case, there is no need to run the proposed algorithm; the constraint maximization will always result is an “optimal” target—the uniform vector with the value xmax+1nmxiXLocalxmaxxi, whose entropy we know is maximal w.r.t its sum.

Theorem 1. 

Algorithm 1 runs in O(n2) time and returns an upper bound on the entropy of X˜=12XLocal+XOther.

Algorithm 1: Upper Entropy Bound for Two Nodes
graphic file with name entropy-24-01611-i001.jpg

Proof. 

Let n be the length of X*, which equals nk. In each loop iteration, the algorithm increments no more than n values of X*, and since there are n coordinates of X*, it will perform at most n steps. Hence, the bound of O(n2) runtime follows.

Let X*=(x1,,xn) be the initial vector as noted in line 3, and let Y denote the same by the end of the while loop, i.e., after the condition m=0 is met. Let the coordinates of Y be arranged in ascending order, which has no effect on its entropy: Y=y1,y2,,yt,yt+1,,yn. Since at every loop iteration, all minimal coordinates are incremented simultaneously, there exists some coordinate t such that for all i<t, yi equals c, and for all i>t, yi is strictly greater than c. Hence, we can view Y as a concatenation of the two vectors (YL,YR) as defined below:

  • YL=y1,,yt=c,,c;

  • YR=yt+1,,yn.

Let sX denote the sum of X. It now suffices to show that any vector Z that sums to sX*+m and can be achieved by performing only additions to X* has a lesser or equal entropy value than Y. Let Z denote such a vector for every value of which zi satisfies zixi. Let Z=(ZL,ZR), where ZL=(z1,,zt),ZR=(zt+1,,zn) for the same t as defined above.

Since it holds that sZ=sZL+sZR=sYL+sYR=sY, we examine the following cases:

  • sZL=sYL,sZR=sYR: Note that ZR=YR, since their sum is equal, and YR has had no further additions. In addition, since sZL=sYL=c·YL and YL is the uniform vector, H(ZL)H(YL). It follows that H(ZL)+H(ZR)H(YL)+H(YR).

  • sZL<sYL,sZR>sYR: there exists a subset zi1,,ziZR for which every zij is greater than the corresponding value yij of YR. Let δij=zijyij. For every δij, there exists a value z in ZL s.t z<yij<zij, since sZL<sYL=c·YL<yij·YL. Let Z be Z after δij is subtracted from each zij and added to some zZL as described above. By Lemma 1, H(Z)>H(Z). It also holds that ZR=YR, and that H(ZL)H(YL), since they both sum to c·YL, and YL is a uniform vector (whose entropy is maximal). Therefore, H(Z)<H(Z)H(Y).

  • sZL>sYL,sZR<sYR: this case can be immediately omitted; it is an impossibility to feasibly subtract a value from ZR or increase the value of sZL above sYL.

Therefore, we have proven for any vector Z, it holds that H(Z)H(Y).    □

Next, we suggest a more time-efficient algorithm than Algorithm 1 that achieves an equivalent bound, with a runtime of Onlogn. Suppose c is the maximal threshold all values of the local vectors can be incremented to without exceeding the sum of the values from the remote vector, m. Then, if we define the sorted coordinates of X* to be x1,,xn, there is some coordinate t such that xtcxt+1.

By performing a binary search on the the coordinate t of the local vector as described above, we can efficiently find that xt as described in the algorithm below.

Theorem 2. 

Algorithm 2 runs in O(nlogn) time and returns an upper bound for the entropy of X˜=12XLocal+XOther.

Algorithm 2: Binary Search Upper Entropy Bound for Two Nodes
graphic file with name entropy-24-01611-i002.jpg

Proof. 

The algorithm begins by sorting X*, which costs Onlogn and follows by performing a binary search on a range of size n, wherein a single step requires O(n) operations. Therefore, its runtime is Onlogn.

Since the vector X* constructed by this algorithm is equivalent to the vector which Algorithm 1 computes, the proof of correctness is the same as the proof of Theorem 1.    □

It will be noted that the upper bound given by Algorithms 1 and 2 can be further improved by using the feasibility constraint upon X*. It is possible to increase a coordinate of X* by a value larger than min{SK(XOther)}, particularly if for some coordinate i, the inequality xi+1*xi*>min{SK(XOther)} holds. In order to keep the core algorithms simple, we will address this formally in Appendix A by proposing an improvement to the algorithms above, such that the bound will indeed by tight.

3.3. Lower Bound

We now turn to discuss a communication-efficient solution for computing a tight lower bound for the entropy of a global vector. As with the upper bound, this problem is an instance of constrained optimization, only that here, our target is to find the minimum. As with the Upper Bound (Section 3.2), we use the same message containing Ck(X) and Sk(X), for a remote vector X.

Property 1. 

Let n denote X*, sup denote min{Sk(XOther)} and m denote 1xSk(XOther)x, as used in Algorithm 3. Then, n·supm.

Algorithm 3: Lower Entropy Bound for Two Nodes
graphic file with name entropy-24-01611-i003.jpg

Proof. 

Using the definitions, we obtain the following inequality:

X*·min{Sk(XOther)}1xSk(XOther)x

which holds, since:

xSk(XOther)x+X*·min{Sk(XOther)}xXOtherx=1.

   □

Theorem 3. 

Algorithm 3 runs in O(nlogn) time and returns a tight, lower bound for the entropy of X˜=12XLocal+XOther

Proof. 

After sorting the vector, we iteratively increment no more than n=X*n coordinates since n·supm by Property 1; hence, the total runtime is Onlogn.

To prove the correctness of the bound, it suffices to examine our loop step; it is clear we must add a total sum of m to any of X*’s coordinates, and we cannot increment a single coordinate by more than sup—since we know all remaining values of the unknown vector X are lesser or equal to sup.

The algorithm increments the maximal values of X* by sup, which by Corollary 1 incurs the minimal entropy gain to X*. Due to the fact that the entropy is coordinate-wise additive, the ”greedy” approach which minimizes over coordinates separately reaches the global minimum.    □

In Figure 1, the proposed algorithm, and the bounds computed using the algorithms described herein, are compared to the bounds derived after sending a random subset of coordinate–value pairs, as well as sending many random subsets and choosing the minimal resulting bound. In Section 3.5, more extensive experiments are reported.

Figure 1.

Figure 1

Comparison of three upper bounds as a function of the number of values sent. The red plot corresponds to an average of random selections of values from the remote vector; the green plot represents the best results (i.e., lower upper bounds) from 10,000 random selections; the blue plot represents the algorithm described here, where Sk(X) consists of the largest values. The vectors have a length of 100, and their values are sampled from a half-normal distribution with standard deviation 0.02 and then normalized to sum 1.

3.4. Multiparty Bounds

When considering the scale and variability of modern distributed systems, an algorithm that supports multiple machines and incurs a low communication overhead is desirable.

We next suggest a few modifications in order to generalize Algorithms 1–3, for the upper and lower bounds of entropy centralized across t+1 nodes. We denote Xi as the vector of machine i, and in a manner similar to Section 3.2, Sk(Xi),Ck(Xi) are the ordered sets of the k maximum values and their coordinates, respectively. Typically, the coordinates of Ck(Xi) and Ck(Xj) will be disjoint, in which case each machine will have to broadcast its missing coordinates. The additional communication may cost us up to tk(b+log2n) bits. We hereby assume a second round of communication occurs, and that SkX1,,SKXt, as well as CkX1,,CKXt include the same coordinates.

Below, we list the modifications to be made to the previous algorithms for the multiparty case. These changes are similar for all three algorithms.

  1. Input: In addition to the local vector XLocal, the k-sized largest value sets Sk(X1),Sk(Xt) and corresponding ordered coordinate sets Ck(X1),,Ck(Xt)—instead of single sets.

  2. The sum to be added to all coordinates of X*, m will be ti=1txSk(Xi)x, since there are t local vectors to process and construct, while local vector Xi contributes 1xXix.

  3. The return value is now H(1t+1X˜), since we have summed t additional vectors into X* and XKnown.

3.5. Experimental Results

To evaluate our algorithms, we tested them on both real and synthetic probability vectors. We now describe the methods and data used to perform our experiments and analyze the results.

In Figure 2 and Figure 3, we simulated the upper bound algorithm (Algorithm 2) and the lower bound algorithm (Algorithm 3). Figure 2b depicts a simulation of the algorithms on two randomly generated vectors: Node1 with uniform distribution and Node2 with beta distribution. The probability vectors of Node1 and Node2 are shown in Figure 2a. Note that as depicted in Figure 2b, the algorithms’ results in each node are determined by the distribution of the local probability vectors. That is because the more probability mass is transmitted, the tighter the bounds become, and the quicker it converges with respect to k. As illustrated, in the bounds of Node1, which receives Node2’s maximal beta distribution’s probability values, the bounds converge quickly with respect to k. In contrast, the bounds that are computed at Node2, which receives the maximal values of the uniform distribution of Node1, converge slowly to the real entropy. This is due to the fact Node2 does not gain much information from Node1.

Figure 2.

Figure 2

Algorithmic bounds for the empirical entropy on synthetic probability vector of dimension 50 k. (a) depicts the distributions of the generated vectors: a normalized uniform distribution (i.e., each value is randomly selected from U[0,1], and then their sum is normalized to 1) which for brevity we refer to as “uniform” at Node1, and a beta distribution at Node2 with parameters α=0.2,β=100. The dashed line is the average vector of the two. (b) demonstrates the locally calculated upper bound and lower bound for different numbers of top values transmitted (k) in Algorithms 2 and 3.

Figure 3.

Figure 3

Algorithmic bounds between token frequency vectors of atheism-themed newsgroups and hockey-themed newsgroups. (a) depicts the histogram of the tokens of the accumulation of the first 200 articles in each theme. Note that the histogram’s coordinates are organized in descending order of each vector’s values separately; thus, the average vector may be larger, for some values, from both Node1 and Node2. (b) demonstrates the locally computed upper bound and lower bound for different numbers of top values transmitted (k) in Algorithms 2 and 3.

Fortunately, the difference between the bounds of Node1 and Node2 is an advantage to our proposed algorithms; we can compare them and use the better one simply by comparing the bounds (which requires transmitting only one scalar).

Another interesting observation can be drawn from Figure 2b; Node1’s lower bound is already quite close to the global entropy for very small k values. The algorithm works well here since the maximum value of Node2 is not large, which in turn enables the algorithm to reach a tighter bound.

Figure 3b illustrates our experiments on the 20 Newsgroups Dataset [13] which includes about 20,000 newsgroup documents for 20 different topics. We measured the entropy of token frequency vectors (A vector where each value corresponds to the frequency of a word or token in the document).  from the atheism-themed newsgroups and the hockey-themed newsgroups. To do so, we took the top 10,000 occurring tokens and created token frequency vectors on the first 200 articles from the atheism theme and the hockey theme. The visual illustration of the (sorted) tokens frequency is in Figure 3a. As can be observed, the atheism newsgroups is more verbally rich than the hockey newsgroup, having more words which are unique to it.

As demonstrated in Figure 3b, the upper bound computed by both nodes is almost the same. However, for the lower bound, Node1 (atheism) converges faster to the real entropy as we increase the parameter k of the algorithm. We attribute that to the denser token histogram of the hockey theme; thus, more “probability mass” is transmitted for the same k.

Figure 4 presents results for the multiparty case, as discussed in Section 3.4.

Figure 4.

Figure 4

An example of the multiparty upper bound on real and synthetic data. (a) For the real data, we used vectors from the newsgroups dataset; each vector is of length 10,000. (b) The synthetic data are sampled from half-normal distribution; each vector is of length 1000.

To conclude, the distribution of the probability vectors directly affects the tightness of the bounds. The less concentrated the probability vectors are, the less information we can send for every k; hence, the bounds become less tight, as demonstrated in Figure 5. A solution for this case is presented in Section 4.

Figure 5.

Figure 5

Rate of convergence of the dynamic bound algorithms in Section 3 to the real entropy values as a function of communication overhead. (a) Node2 obeys a uniform distribution, and Node1 obeys a beta distribution with α=0.1, β=100. (b) Both nodes obey a beta distribution, one with α=0.1, β=100 and the other with α=0.02, β=100.

4. Entropy Approximation

The algorithms described in Section 3 perform better in terms of communication overhead when there are a few relatively large values in the local frequency vectors, i.e., a substantial percentage of the overall “probability mass” resides in a relatively small percentage of the vectors’ values. However, in the case in which the vectors are “flat”—that is, their distribution approaches a uniform one—the nodes will have to exchange many values in order to reach tight bounds on the overall entropy; see Figure 5. In this section, we offer a probabilistic solution to this problem.

Assume that two nodes N1,N2 hold vectors X,Y, and the goal is to approximate the entropy of the average vector X+Y2, with a small communication overhead, relative to n, the length of the vectors.

One solution, which was applied in previous work on monitoring entropy [10], is to use sketches. This popular technique found many applications in computer science, for example, for computations over distributed data [7]. A well-known sketch for entropy, which we describe in Section 4.1, is presented in [14]; see also [15].

Here we use a different sketch, which, for our purposes, performed better than the sketch presented in [14]. In resemblance to Section 3, the two nodes first exchange all values which are greater or equal to a threshold ε, whose value is determined by a communication/accuracy trade-off. Hence, we assume hereafter that all values are smaller than ε. Next, choose a polynomial approximation, of degree at least 2, over the interval [0,ε], to the function h(t)tln(t). Assuming in the meanwhile a degree 2 approximation, denote it by At2+Bt+C. The proposed method is oblivious to the choice of this approximation; we have used the approach of minimizing

0εf(t)(At2+Bt+C)2dt,

which allows a closed-form solution

A=54ε,B=ln(ε)+1312,C=ε8.

Using this quadratic function, we can approximate the entropy of the average vector X+Y2 by

i=1nAXi+Yi22+BXi+Yi2+C=A4i=1n(Xi2+Yi2)+A2i=1nXiYi+B2i=1n(Xi+Yi)+nC.

Note that with the exception of the term i=1nXiYi, all terms can be computed locally and require O(1) communication overhead to transmit. Thus, it only remains to approximate the term iXiYi, which equals the inner product X,Y. To this end, we can apply an approximation based on the famed Johnson–Lindenstrauss Lemma [16], which is defined as follows:

X,Y1di=1dX,RiY,Ri;

where Ri are independent random vectors with all values i.i.d standard normal variables, which are generated by a pre-agreed upon random seed, and thus require no communication. A direct calculation yields that this estimate has expectation X,Y (i.e., is unbiased) and its variance equals

X2Y2+X,Yd.

Similarly, we can apply higher-order approximations. For a cubic approximation, we obtain a more complicated but identical in spirit sketch, which requires an approximation of the expressions i=1nXi2Yi,i=1nXiYi2; this, too, can be achieved by applying the estimate above, since these quantities can also be represented as inner products of “local vectors”: for example, i=1nXi2Yi=X2,Y, where (X2)iXi2.

Some results for two nodes are presented in Figure 6, in which the proposed sketch is compared to the one in [14] (see Section 4.1). Extending the sketch to the multiparty scenario is straightforward; results are presented in Figure 7b.

Figure 6.

Figure 6

Comparison of the Poly2 and CC sketches for approximating the empirical Shannon entropy. (a) illustrates the synthetic probability vectors that were generated to perform the comparison. (b) compares the Poly2 sketch to the CC sketch for varying sketch sizes. The comparison was made using three different random seeds for the sketches. We used the value ε=0.0002.

Figure 7.

Figure 7

(a): standard deviation of the error of the CC and Poly2 sketches, for two parties and varying sketch size. The experiments were performed on a vector of dimension 10000 with uniform distribution, which was followed by normalization to sum 1. The standard deviation was calculated on 50 sketches for each sketch size. (b): comparison of the standard deviation of CC and Poly2 sketches in the multiparty scenario for fixed sketch size and varying number of parties. The experiments were performed for an i.i.d random vector distribution of dimension 5000 with sketch size 200 and ε=0.0002.

4.1. The Clifford-Cosma Sketch

We compare our sketch to an entropy sketch proposed in  [14]. The sketch is a linear projection of the probability vector. The linear projection is performed by a multiplication matrix with i.i.d elements drawn from F(x;1,1,π/2,0). The entropy approximation of the d-dimensional linear projected vector (y1,,yd) is:

H˜(y1,,yd)=ln(d)lni=1deyi.

4.2. Sketch Evaluation

We now compare the proposed sketch to the one in [14], which is denoted “CC”. The proposed quadratic sketch is denoted “Poly2”.

5. Conclusions

We have presented novel communication-efficient algorithms for bounding and approximating the entropy in a distributed setting. The algorithms were tested on real and synthetic data, yielding a substantial reduction in communication overhead. Future work will address both sketch-based techniques and further development of the dynamic bound algorithms presented here. In addition, we intend to address the efficient distributed computation of other functions.

Appendix A. Improving the Upper Bound

Let us recall Algorithms 1 and 2 from Section 3.2. As mentioned there, the computed bound can be improved; we now present this improvement as a generalized algorithm that achieves a tight upper bound. The algorithm below can be run following the upper bound algorithms (either of them), after X* is computed, or as an external procedure that accepts X*, m and sup as input. We show the former option:

Algorithm A1: Tight Entropy Upper Bound for Two Nodes
graphic file with name entropy-24-01611-i004.jpg

Author Contributions

Conceptualization, A.S., Y.A. and D.K; methodology, A.S., Y.A. and D.K.; software, A.S. and Y.A.; validation, A.S., Y.A. and D.K.; formal analysis, A.S., Y.A. and D.K.; investigation, A.S., Y.A. and D.K.; resources, A.S., Y.A. and D.K.; data curation, A.S., Y.A. and D.K.; writing—original draft preparation, A.S., Y.A. and D.K.; writing—review and editing, A.S., Y.A. and D.K.; visualization, A.S., Y.A. and D.K.; supervision, D.K.; project administration, D.K. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

The 20 Newsgroups Dataset, which can be found in this link.

Conflicts of Interest

The authors declare no conflict of interest.

Funding Statement

This research received no external funding.

Footnotes

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Cormode G. The continuous distributed monitoring model. SIGMOD Rec. 2013;42:5–14. doi: 10.1145/2481528.2481530. [DOI] [Google Scholar]
  • 2.Censor-Hillel K., Dory M. Distributed Spanner Approximation. SIAM J. Comput. 2021;50:1103–1147. doi: 10.1137/20M1312630. [DOI] [Google Scholar]
  • 3.Li M., Andersen D.G., Smola A.J., Yu K. Communication Efficient Distributed Machine Learning with the Parameter Server; Proceedings of the Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014; Montreal, QC, Canada. 8–13 December 2014; pp. 19–27. [Google Scholar]
  • 4.Vajapeyam S. Understanding Shannon’s Entropy metric for Information. arXiv. 20141405.2061 [Google Scholar]
  • 5.Li L., Zhou J., Xiao N. DDoS Attack Detection Algorithms Based on Entropy Computing. In: Qing S., Imai H., Wang G., editors. Proceedings of the Information and Communications Security, 9th International Conference, ICICS 2007; Zhengzhou, China. 12–15 December 2007; Berlin, Germany: Springer; 2007. pp. 452–466. Lecture Notes in Computer Science. [DOI] [Google Scholar]
  • 6.Yehuda G., Keren D., Akaria I. Monitoring Properties of Large, Distributed, Dynamic Graphs; Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2017; Orlando, FL, USA. 29 May–2 June 2017; pp. 2–11. [DOI] [Google Scholar]
  • 7.Alon N., Klartag B. Optimal Compression of Approximate Inner Products and Dimension Reduction; Proceedings of the 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017; Berkeley, CA, USA. 15–17 October 2017; pp. 639–650. [DOI] [Google Scholar]
  • 8.Turk M.A., Pentland A. Face recognition using eigenfaces; Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 1991; Lahaina, Maui, HI, USA. 3–6 June 1991; pp. 586–591. [DOI] [Google Scholar]
  • 9.Garofalakis M.N., Gehrke J., Rastogi R., editors. Data Stream Management—Processing High-Speed Data Streams. Springer; Berlin, Germany: 2016. Data-Centric Systems and Applications. [DOI] [Google Scholar]
  • 10.Gabel M., Keren D., Schuster A. Anarchists, Unite: Practical Entropy Approximation for Distributed Streams; Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Halifax, NS, Canada. 13–17 August 2017; New York, NY, USA: ACM; 2017. pp. 837–846. [DOI] [Google Scholar]
  • 11.Alfassi Y., Gabel M., Yehuda G., Keren D. A Distance-Based Scheme for Reducing Bandwidth in Distributed Geometric Monitoring; Proceedings of the 37th IEEE International Conference on Data Engineering, ICDE 2021; Chania, Greece. 19–22 April 2021; pp. 1164–1175. [DOI] [Google Scholar]
  • 12.Sharfman I., Schuster A., Keren D. A geometric approach to monitoring threshold functions over distributed data streams. ACM Trans. Database Syst. 2007;32:23. doi: 10.1145/1292609.1292613. [DOI] [Google Scholar]
  • 13.Lang K. Machine Learning Proceedings 1995. Elsevier; Amsterdam, The Netherlands: 1995. Newsweeder: Learning to filter netnews; pp. 331–339. [Google Scholar]
  • 14.Clifford P., Cosma I. A simple sketching algorithm for entropy estimation over streaming data; Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2013, JMLR.org, JMLR Workshop and Conference Proceedings; Scottsdale, AZ, USA. 29 April–1 May 2013; pp. 196–206. [Google Scholar]
  • 15.Harvey N.J.A., Nelson J., Onak K. Sketching and Streaming Entropy via Approximation Theory; Proceedings of the 49th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2008; Philadelphia, PA, USA. 25–28 October 2008; pp. 489–498. [DOI] [Google Scholar]
  • 16.Johnson W., Lindenstrauss J. Extensions of Lipschitz mappings into a Hilbert space. Conf. Mod. Anal. Probab. 1982;26:189–206. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The 20 Newsgroups Dataset, which can be found in this link.


Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES