Abstract
Let G = (V,E,w) be a finite, connected graph with weighted edges. We are interested in the problem of finding a subset W ⊂ V of vertices and weights aw such that
for functions that are ‘smooth’ with respect to the geometry of the graph; here ~ indicates that we want the right-hand side to be as close to the left-hand side as possible. The main application are problems where f is known to vary smoothly over the underlying graph but is expensive to evaluate on even a single vertex. We prove an inequality showing that the integration problem can be rewritten as a geometric problem (‘the optimal packing of heat balls’). We discuss how one would construct approximate solutions of the heat ball packing problem; numerical examples demonstrate the efficiency of the method.
Keywords: Graph, Sampling, Graph Laplacian, Heat Kernel, Packing, 05C50, 05C70, 35P05, 65D32
1. Introduction
1.1. Introduction.
The purpose of this paper is to report on a very general idea in the context of sampling on graphs. A graph for us will always be comprised of a finite set of n vertices V, a set of edges E ⊂ V × V which are equipped with weights: we use wxy > 0 to denote the weight of the edge between the vertices x, y ∈ V. We will only deal with connected graphs. For simplicity of notation and exposition, we will assume that the vertices themselves are unweighted (or, equivalently, all have weight 1).
In a very real sense, the problem of ‘sampling’ a function on a finite graph is a generalization of the problem of sampling a function where since any bounded domain Ω can always be approximated by a graph by taking
where λ > 0 is a small parameter and
Figure 1 shows an example: the unit disk with λ = 3/10. We can summarize the main problem as follows:
Figure 1.
The Integration Problem on continuous domains is a special case of the Integration Problem on finite graphs. However, there are many more graphs and they can be quite a bit more complicated than the grid.
Classical Numerical Integration tries to find the average value of a function by sampling in specific points using with specific weights.
Numerical Integration on Graphs tries to find the average value of a function by sampling in specific vertices using specific weights.
Since we can approximate every domain by a finite graph, any reasonable numerical integration technique needs to correspond to a reasonable numerical integration technique in Euclidean space. However, since graphs are more general, even the description of such techniques will require a very different setup. Moreover, problems in numerical integration only make sense within a certain smoothness class: without any assumptions on the regularity of functions, numerical integration is impossible since the behavior of a function in a finite set of points may have little to no implications for the behavior of that function in other points. The same problem dominates finite graphs: we need to somehow specify smoothness properties of the function , otherwise the question of numerical integration on a graph is not meaningful.
Problem (Quadrature). If we are allowed to sample in a set W ⊂ V of |W| = k vertices, which set of vertices W and which weights aw should we pick so that
for functions that are ‘smooth with respect to the geometry’ of G? Here, ~ indicates that we want the right-hand side to be as close to the left-hand side as possible. More precisely, how should we pick the set of vertices W ⊂ V and the weights aw such that
where X ⊂ L2(V) denotes a subspace of functions (the space of ‘smooth’ functions that is yet to be defined).
We emphasize that, just as in the continuous case, there may be more than one way of defining a smoothness class; we will define one further below but emphasize that many others are conceivable.
1.2. Example applications.
At first glance, the question seems to not even make a lot of sense: since the graph is finite, one can simply compute the true average of f by summing over all n vertices. This question is only interesting whenever sampling f in a vertex is difficult or expensive. We believe that in the light of new challenges in data science, the question is far from artificial and of profound relevance as opposed to merely an artificial extension of the classical numerical integration problem. A toy example is the following (that was relevant in work of the first author [17] and partially inspired this paper): suppose we have a medical database of n ≫ 1 people containing all sorts of information and are interested in the average blood pressure (information not contained in the database) of those n people. Actually going out and measuring the blood pressure of all n people would take a very long time. However, it is known that blood pressure is strongly correlated with some of the factors we have on file (say, age, weight, and smoking habits) and weakly connected or unconnected to others (say, eye color). We can then build a weighted graph on n vertices where the weight on the edge connecting two people depends on how similar they are with regards to relevant factors – the hope is that blood pressure, as a function on the graph, is then smoothly varying. Which of the, say, n/1000 people should we do a blood pressure measurement on so that the sample average is representative of the global average? Is there a way to pick them in a way that decreases the expected error over a pure random selection? It is likely possible to make stronger statements if one restricts to certain classes of graphs and functions and this could be of substantial interest; our paper will only address the most general case.
1.3. Related results.
Our paper is a companion paper to [31] (itself inspired by [30]) dealing with a problem in Spectral Graph Theory that discusses, in the language of numerical integration, the analogue of t—designs on finite graphs. However, the focus of [31] is more of an algebraic type and does not discuss the problem of numerical integration on generic graphs (moreover, the approach in [31] is unlikely to work on graphs that do not exhibit a great degree of structure). We extend some of the ideas from [30, 31] to sampling, prove an inequality bounding the integration error in terms of the geometry of the sampling points and give several examples. As discussed above, any reasonable approach to numerical integration on graphs has to necessarily yield a corresponding reasonable scheme on Euclidean domains – the approach we discuss in this paper has been shown to have a natural analogue in Euclidean space in work of J. Lu, M. Sachs and the second author [18] (where, at least on the Torus, it improves on classical constructions with respect to a Fourier-analytic way of measuring the integration error). We also emphasize a paper of Pesenson, Pesenson & Führ [26] which deals with cubature formulas on combinatorial graphs by enforcing exactness on bandlimited functions and a paper of Pesenson & Geller [25] for the continuous analogue.
2. Setup and Main Results
2.1. Formal Setup.
We will now make the notion of smoothness precise. If f has no particular structure, then there is little hope of being able to achieve anything at all. However, a graph does induce a natural notion of smoothness: we want the function to vary little between nodes that are “well connected,” (e.g. connected by an edges with large weight or have many neighbors that connect to each other). These are nodes that are “very similar” and hence we expect the function to vary less between these nodes than between nodes that are less similar. We now introduce a notion of a Laplacian L on a Graph: more precisely, we define linear operator that maps functions to functions and is given as a local averaging operation. Our notion of a Laplacian deviates slightly from more classical notions and could be of independent interest (see Section §2.4 for a more detailed discussion). Let A denote the (weighted) adjacency matrix of G
where w(eij) ≥ 0 is the weight of the edge eij = eji connecting vertices i and j. A is a symmetric matrix. For simplicity, we assume that there are no loops, so the diagonal entries of A are zero. We introduce the maximum sum of any of the rows of this symmetric matrix (coinciding, by symmetry, with the maximum sum of any of the columns) and use it to define a normalized adjacency matrix: more precisely, we have
Finally, we introduce the (diagonal) degree matrix D′ associated to the renormalized adjacency matrix and use it to define a Laplacian: we set
We will never work directly with the Laplacian: our main object of interest is the associated diffusion process whose generator is given by
where Idn×n is the identity matrix of size n. P is a symmetric stochastic matrix and represents a lazy random walk where the probability of “staying put” depends on the vertex (as opposed to being, say, 0.5 as in the classical lazy random walk). We denote the eigenvalues of P, which are merely the eigenvalues of the Laplacian L shifted by 1, by λ1, … , λn. Since P is a stochastic matrix, we have |λi| ≤ 1. The eigenvectors whose eigenvalues are close to 0 ‘diffuse quickly’ and are thus the natural high-frequency objects on the graph. This motivates an ordering of eigenvalues from low frequency to high frequency
We denote the corresponding orthogonal eigenvectors of P, which clearly coincide with the eigenvectors of the Laplacian L, by (normalized in L2(V), is the constant vector). We define a function space Χλ, the canonical analogue of trigonometric polynomials on the torus or spherical harmonics on the sphere , via
where 0 ≤ λ ≤ 1 is a parameter controlling the degree of smoothness. If λ > μ, then Xλ ⊂ Xμ. Moreover, contains all functions while, at least on generic graphs, X1 contains only the constant functions (depends on whether the induced random walk is ergodic; this is not important for our purposes). The norm is just the classical L2–norm on the subspace – in more classical terms, we are simply considering the L2–space obtained via a Littlewood-Paley projection. This can be considered as the discrete analogue of the Paley-Wiener space of band-limited functions, to the best of our knowledge this space was first introduced in the discrete setting by Pesenson [21]. We also refer to the emerging theory around this notion in [19, 20, 22, 23, 24, 12, 34, 33, 35, 11].
This function space Xλ is natural: if the graph G approximates a torus (being close to a grid graph), then this function space will indeed approximate trigonometric functions: Graph Laplacians are known to converge spectrally when graphs approximate an underlying manifold [28]. If G is close to a discretization of a sphere , then the space Xλ approximates the space of low-degree spherical harmonics. However, as is not surprising, the precise definition of ‘smoothness’ of the function is crucial here and different notions of smoothness will lead to different versions of the numerical integration problem. We state this as an explicit
Open Problem. Investigate other notions of smoothness for functions and the arising Numerical Integration problem.
While our particular definition of smoothness has the advantage of naturally reducing to a spectral perspective on numerical integration of continuous functions, it is quite conceivable that many other notions may exist and have various types of advantages and disadvantages; we believe this to be a natural and fascinating question that could lead to rather interesting questions.
2.2. Main Result.
Our main result bounds the integration error in terms of ‖f‖Xλ and a purely geometric quantity (explained in detail below) formulated in terms of the quadrature points and independent of f. This has the nice effect of multiplicatively separating the size of the function ‖f‖Xλ and a quantity that can be interpreted as the ‘quality’ of the quadrature scheme. We use, for v ∈ V, the notation δv to denote the function for which δv(w) = δvw, where δ is the Kronecker delta function (i.e. δv(w) = 1 if w = v and δv(w) = 0 otherwise).
Theorem 1 (Main Result). Let W ⊂ V be equipped with weights aw summing to 1. Then, for all and all 0 < λ < 1,
The proof shows that the inequality is essentially close to sharp in various settings (see below). If one has access to the eigenfunctions of the Laplacian (eigenvectors of the Laplacian matrix), then it is possible to use a stronger inequality discussed below in §2.5. The advantage of the main theorem is that the upper bound can be computed without knowing anything about the eigenfunctions of the Laplacian. The result has a certain philosophical similarity to the Koksma-Hlawka inequality: its simplest setting it states that for functions
Where V(f) is the variation of the function in the sense of Hardy & Krause and denotes the star-discrepancy of the set of points . We refer to the standard textbooks [7, 8, 15] for a definition of these terms: the main point is that the error is the product of an error coming from a notion of smoothness of the function V(f) and a notion of regularity of the sampling points . If one knows nothing about the function except the degree with which it oscillates, it is nonetheless wise to minimize the error that is within one’s control and to try to make as small as possible. Our main result is of a very similar flavor, the error being multiplicative. As in the case of the Koksma-Hlawka inequality, the proof shows that the result is close to sharp which has the advantage of defining a notion of Discrepancy (regularity of sampling points and weights) on finite graphs. The study of low-discrepancy sets on the Torus has been of classical interest (see [7, 8, 15]) and has intimate connections to Analytic Number Theory, Combinatorics and Harmonic Analysis - we believe that the notion of low-discrepancy sets on graphs might be an interesting avenue of further research.
2.3. Geometric Interpretation: Heat Ball Packing.
Altogether, this suggests that we should use the quantity on the right-hand side in the inequality in Theorem 1, depending only on the set W, the weights aw and the free parameter ℓ (but not on the function f) as a guideline for how to construct the quadrature rule. This motivates studying the minimization problem
Note that for practical applications we do not necessarily need to find the precise minimizer of the solution, indeed, we just want the quantity to be as small as possible. Of course, from a mathematical point, it is very desirable to actually find global minima and study their properties. The quantity (Idn×n + L)ℓδw is the probability distribution of a random walker starting in w after ℓ jumps which lends itself to a geometric interpretation.
Guideline. If we manage to find a placement of vertices W with the property that the random walks, weighted by aw, overlap very little, then we have found a good quadrature rule on the graph G = (V,E).
We observe that this can be reinterpreted as a ‘packing problem for heat balls’. This principle has already appeared naturally in the continuous setting of Riemannian manifolds in work of the second author [30]. It is even meaningful if we can only sample in a single vertex: we should pick the most central vertex and that is equivalent to heat diffusing quickly to many other points. In the ideal setting with perfectly distributed heat balls all the weights would be identical (as can be seen in examples with lots of symmetry, see [31]). We summarize that
it is desirable to construct W ⊂ V equipped with weights aw such that random walks, starting in w ∈ W and weighted by aw, intersect each other as little as possible.
if we are not allowed to chose W, we can still use the procedure above to find weights that yield a better result than weighting each vertex equally.
We emphasize that we do not attempt to solve the heat ball packing problem here – nor do we expect it to be easily solvable at this level of generality. The main contribution of this paper is to introduce the heat ball packing problem as a fundamental issue with implications for sampling
Problem. How does one find effective heat ball packing configurations quickly? Are there algorithms leading to effective almost-minimizing configurations? What theoretical guarantees can be proven?
If the minimization problem over heat balls is similar to classical problems in potential theory (see also §3.4) where one places points in such a way that
where d(·,·) is the distance on a manifold (the case of ball packing corresponding to the limit α → ∞), then this would suggest that there will be many local minimizer that are not global: this has been empirically observed in the continuous setting (see e.g. [9]). Moreover, if the continuous analogies continue to hold, then the actual numerical value of minimizing configuration can be expected to be quite close to that of almost-minimizers which is good news: there are many almost-minimizers (thus easy to find) and they are almost as good as the global minimizer – at this point, however, this is pure speculation.
2.4. Another interpretation.
There is a nice bit of intuition coming from a more abstract perspective: the reason why the above method works is that the heat propagator etΔ is self-adjoint. More precisely, let us assume (M, g) is a compact manifold normalized to vol(M) = 1 and let μ be a probability measure (our quadrature rule) on it. Smooth functions f have the property that they do not change substantially if we apply the heat propagator for a short time t > 0 (this is one way of quantifying smoothness) and therefore
Here, the ~ is hiding an error term that depends on the smoothness of the function f and the size of time t – various ways of making it precise lead to various different elementary results that we do not further pursue here. In order for this quadrature rule to be effective, we want etΔμ to be close to the Lebesgue measure dx. However, the heat propagator preserves the L1–mass and thus
The Cauchy-Schwarz inequality is only sharp if etΔμ coincides with the constant function and thus minimizing ‖etΔμ‖L2 is a natural way to obtain good quadrature rules. If μ is a weighted sum of Dirac measures, then etΔ turns this, roughly, into a sum of Gaussians (‘heat balls’) and finding locations to minimize the L2–norm becomes, essentially, a packing problem. We observe that the argument does not single out L2 and minimizing ‖etΔμ‖Lp for p > 1 will lead to a very similar phenomenon – it remains to be seen whether there is any advantage to that perspective since L2 is usually easiest to deal with in practice.
2.5. The Laplacian: a second method.
We believe that our notion of Laplacian could useful in practice (see also [16]). It combines the desirable properties of
having a symmetric matrix (and thus orthogonal eigenvectors)
inducing a diffusion operator (that is, a time-discrete heat semigroup generated by Id + L) that preserves the mean value of the function
and the propagator Id + L having only nonnegative entries.
The Kirchhoff matrix L1 = D – A has the first two properties but not the third; the normalized Laplacian L2 = Idn×n – D−1/2AD−1/2 is not symmetric. We believe that, for this reason alone, our notion of a Laplacian could be useful in other settings as well. If we can compute eigenvectors, then there is a way of approaching the problem directly.
Proposition. Let W ⊂ V be equipped with weights aw summing to 1. Then, for all 0 < λ < 1,
This statement follows easily from L2–duality and will be obtained in the proof of the Theorem as a by-product. The result is true in general but, of course, one cannot compute the quantity on the right-hand side (or, more generally, the size of ‖f‖Xλ) unless one has access to the eigenvectors of the Laplacian. If one has indeed access to either the Laplacian eigenfunctions or at least some of the leading eigenvectors ϕ1, … , ϕk where k ≪ n is chosen so that λk ~ λ (simpler to obtain in practice), then optimizing the functional in W ⊂ V and weights aw is essentially equivalent to finding good quadrature points. This simple observation is very effective in practice, we refer to numerical examples below.
3. How to use the Theorem
The result discussed above has the nice effect of cleanly separating the problem of numerical integration on a graph from the actual graph structure: the geometry of the graph is encoded implicitly in the geometry of the random walk. This has the nice effect of providing a uniform treatment but, as a downside, does not provide an immediate method on how to proceed in particular instances. The purpose of this section is to comment on various aspects of the problem and discuss approaches. Recall that our main result can be written as
3.1. The parameters λ and ℓ.
A priori we have no knowledge about the degree of smoothness of the function and have no control over it. The parameter ℓ, on the other hand, is quite important and has a nontrivial impact on the minimizing energy configurations for the quantity on the right-hand side. In practice, we have to choose ℓ without knowing λ (fixing an allowed number of sample points implicitly fixes a scale for λ). We propose the following basic heuristic.
Heuristic. If ℓ is too small, then there is not enough diffusion and heat balls interact strongly with themselves. If ℓ is too large, the exponentially increasing weight λ−ℓ is too large. There is an intermediate regime when heat balls starting to interact with nearby heat balls.
The heuristic is accurate if the graph is locally (at the scale of typical distance between elements of W) close to Euclidean (in the sense that it is given as the discretization of a Euclidean domain as discussed in the Introduction). On general graphs, the optimal scale of ℓ might be more nontrivial to estimate – we have observed that, in practice, there is a wide range of ℓ yielding good results.
3.2. The placement of W.
Naturally, to avoid intersections of random walkers, we want to place the elements of W as far from each other as possible. In particular, if the graph is close to having a Euclidean structure, we would expect fairly equi-spaced points to do well. A method that was used in [31] is to start with a random set of k vertices {v1,v2, … , vk} and compute the
where d is a metric on G = (V,E). The algorithm then goes through all the vertices and checks whether moving one of them to a neighboring vertices increases the total mutual distance and, if so, moves the vertex. This is repeated as long as possible. The simple numerical examples in [31] all have edges with equal weight and the standard combinatorial graph distance can be used; which notion of ‘total mutual distance’ leads to the best result could strongly depend on the type of graph under consideration. There is a fairly natural reason why an algorithm of this type has the ability to successfully produce sets of vertices that are very well spread out. We quickly return to the sphere where a particularly spectacular justification exists. Let σ be the normalized measure on . Then, for any set , we have Stolarsky’s invariance principle [1, 32]
where the quantity on the right is the L2–based spherical cap discrepancy and cd is a constant only depending on the dimension. The L2–based spherical cap discrepancy is a measure that has been studied in its own right: if the points are evenly distributed, then it is small. This may be a somewhat peculiar case. However, it is not difficult to see that on fairly generic manifolds functionals along the lines of
converge to the uniform distributions if the number of points is large. Moreover, and this is particularly useful, these types of functionals tend to produce minimal energy configurations that only weakly depend on the functional being used. On two-dimensional manifolds, the hexagonal lattice seems to be particularly universal (see [2]). We do not know what kind of interaction functional is optimal on graphs. In practice, one would like to have fast and reliable algorithms that scale well and this seems like a problem of substantial interest.
3.3. The weights aw.
Once we are given a set W ⊂ V and a parameter ℓ, the optimization of the weights is completely straightforward. Observe that
This is merely a quadratic form indexed by a |W| × |W| matrix – we thus need to solve the semidefinite program
These weights aw play an important role in further fine-tuning an existing set of vertices W ⊂ V. This is the reason why minimizing the functional can be used to find appropriate weights for any given set of vertices W: if two vertices in W happen to be somewhat close to each other, then the quadratic form will take this into account when distributing the weights. Conversely, if one the vertices in W is surprisingly isolated from the other vertices in W, then the quadratic form will increase the weight assigned to that vertex. This is exactly how things should be: points that are oversampling a region in the graph should be given a smaller weight whereas isolated vertices cover a wider range and are, correspondingly, more important. We refer to §4 for numerical examples illustrating this point.
3.4. Related results.
Sampling on graphs is a fundamental problem and a variety of approaches have been discussed in the literature [13]. Sampling is usually done for the purpose of compression or visualization and not numerical integration (in particular, vertices are usually not equipped with weights).
Our approach seems to be very different from anything that has been proposed. The closest construction seems to be a line of research using biased random walks [10, 14, 27] based on the idea of sending random walkers and accepting the vertices they traverse with certain biased probabilities as sampling points. In contrast, we select points so that random walkers starting there avoid each other. Other results seem related in spirit [3]. Our approach is motivated by a recent approach [30] to the study of spherical t–designs of Delsarte, Goethals & Seidel [6] (and Sobolev [29]). These are sets of points defined on with the property that they integrate a large number of low-degree polynomials exactly (we refer to a survey of Brauchart & Grabner) [4]. The second author recently extended some of these results to weighted points and general manifolds [30]. These ideas are shown to be an effective source of good quadrature points in the Euclidean setting in a paper of Lu, Sachs and the second author [18].
The second author recently proved [31] a generalized Delsarte-Goethals-Seidel bound for graph designs (the analogue of spherical t–designs on combinatorial graphs). The main condition in that paper is algebraic (exact integration of a certain number of Laplacian eigenvectors) as opposed to quantitative (small integration error). [31] shows a number of truly remarkable quadrature rules on highly structured graphs that were found by numerical search: one of these rules is depicted in Figure 3 and manages with only 8 evaluations to integrate 21 out of a total of 24 eigenvectors exactly. However, these examples are very non-generic, the consequence of strong underlying symmetries and not likely to be representative of what can be achieved in a typical setting.
Figure 3.
(Left:) The Icosahedron integrates all polynomials up to degree 5 on exactly (this space is 36-dimensional). (Right:) a subset of 8 vertices integrates 21 of 24 eigenvectors of the McGee graph exactly. Such examples require extraordinary amounts of symmetry and are not generic (from [31]).
4. Numerical Examples
4.1. Importance of weights.
We start with a toy example shown in Figure 4 to illustrate the importance of the weights to counterbalance bad geometric distributions. We assume we are given two clusters and an uneven distribution of six sampling points: five end up in one cluster while the other cluster contains a total of five points. We construct a graph based on nearest neighbor distances weighted with a Gaussian kernel. The spreading of heat is rather uneven if all points are given equal weight. Conversely, by adjusting the weight to minimize the L2–norm (constraint to summing to 1 and being nonnegative), a much more balanced distribution is achieved.
Figure 4.
Six points selected unevenly in two clusters (left), the heat flow emanating from weighting them all evenly (middle) and the optimal coefficients (0.11, 0.24, 0.05, 0.00, 0.18 and 0.43) for the heat ball packing problem (right).
The weights show that, in particular, one point is given weight 0 and another point is given a rather small weight (0.05). This is to counterbalance the clustering of points. The isolated point is given almost half the weight (0.43). We see that the heat distribution in the second cluster is still highly uneven: this shows that it would be preferable to pick another point since the single sampling point is actually quite far from the center of the cluster: if it was closer to the center, it would have received an even larger weight.
Figure 5 shows an example on a small graph with 10 vertices and 15 edges. More precisely, we optimize ‖(Id10×10+L)3∑wawδw‖L2 over all sets W with three vertices. In the first example, we see that there is one very central node that distributes well throughout the network, another weight is actually set to 0. This is a consequence of constraining the optimization to non-negative weights aw ≥ 0. This constraint is not required by the Theorem, but makes the optimization easier, and is well-motivated by classical methods in numerical integration. If we move the point that was assigned weight 0, then the weight splits evenly (the value of the functional barely changes).
Figure 5.
Two optimal configurations for ℓ = 3 on 3 vertices.
4.2. MNIST.
Our explicit example is as follows: we consider the data set MNIST, a collection of handwritten digits represented as 28 × 28 pixels. For simplicity, we only consider the subset comprised of the digits 0 and 1, resulting in a total of 12,665 images. The problem is to determine the proportion of elements in the set that are handwritten digits that are 1’s (which is 0.53). This ties in to our example in the beginning: suppose we did not know the precise proportion of 1’s and the data is unlabeled. Labeling the data is expensive: the function evaluation would be one human being looking at a picture and labeling it, which is costly. However, these pictures are merely {0,1}–vectors in . As is commonly done, we reduce the dimensionality of the data by projecting onto its first ten principal components. We build a graph by connecting every element to its 10-nearest neighbors (in Euclidean distance) weighted with a Gaussian kernel and then symmetrize by averaging the resulting adjacency matrix with its transpose. It is reasonable to assume that the indicator function of 1’s is smooth over a graph defined by that notion of distance: handwritten digits looking like a 1 should be close to other handwritten digits that look like 1. We then proceed as outlined above: we sample random points, move them iteratively so that they are far away from each other and then adjust weights by solving the semidefinite program. In Figure 6, the result is plotted against the parameter ℓ and compared to uniform weights on random points (red); the picture shows 20 different sets of points, the evolution of their integration error depending on ℓ as well as their average (black). We observe that for the right parameter range of ℓ, the obtained numerical integration scheme performs much better but the precise performance depends on the points chosen. This highlights the need for fast, stable and guaranteed ways of approximately solving the heat ball packing problem.
Figure 6.
Estimating digits in MNIST: the relative integration error for 20 different sets of points of size 50 and how it evolves as a function depending on ℓ (leading to different selection of weights). The average of these 20 curves is shown as the bold black line, sampling error for randomly chosen points is the red line.
4.3. Using Eigenvectors of Laplacian.
This section studies the same example as above, estimating the proportion of handwritten digits ‘1’ in MNIST, but assumes additionally that we are able to access the eigenvectors of the Laplacian associated to the largest few eigenvalues exactly. We set λ = 0.994 close to 1 leading to a space Xλ spanned by very few of the smoothest eigenvectors, sample random points, move them far apart and make use of
to explicitly optimize the weights. Our method (blue) is shown to perform exceedingly well (Figure 7); the weights are crucial, sampling over the same points (red) is even worse than sampling over random points (yellow) which decays as (#size of subset)−1/2 (averaged over 200 random samples). As mentioned above, having direct access to the eigenvectors implies that the bound is often close to sharp. This is illustrated in the subsequent Figure 8 where we replace the indicator function of the ‘1’-digits in the subset of MNIST comprised of digits 0 and 1 by its mollification obtained from projecting onto the first 6 eigenvectors. We optimize in the weights and obtain
that is then compared to the error on the function f (blue). We observe that as soon as the sample size exceeds a certain limit, integration becomes exact. It is quite desirable to obtain a better understanding of the interplay of parameters involved: suppose we are given a set of k well-distributed points and optimize the weights so as to minimize ‖−n−1 + ∑w∈Wawδw‖Xλ, what is the interplay between k, the Xλ space and the performance of the arising quadrature rule? Or, put differently, how does the integration error in a Xμ space depend on the space Xλ that was used to determine the weights? This question is clearly of great relevance in applications.
Figure 7.
Numerical Integration with access to eigenvectors: our method (blue) compared to sampling in random points (yellow) and sampling in the location in which we move the points (red).
Figure 8.
Error for the smoothed indicator function on the 1’s (blue) and the theoretical upper bound on the integration error (red). It depends strongly on W and, after some initial fluctuation settles down to essentially exact integration; the theoretical upper bound matches the performance on the particular instance.
The phenomenon, which seems to be generic and easily observed in most examples, is illustrated in Figure 9. We downsampled the MNIST dataset of digits ‘0’ and ‘1’ to consist of a total of 1000 points, and constructed the graph as before. We then choose a subset of points W of size 100, increase their mutual distance and fix them as quadrature points. Finally, we optimize their weights in three different Xλ spaces where λ is chosen such that the dimensions of the spaces are 10, 20 and 25 (i.e. they contain the first 10, 20 and 25 eigenfunctions, respectively). We then plot the integration error of these three quadrature rules on the first 50 eigenfunctions. If we optimize the weights according to the Xλ space containing the first 10 eigenfunctions, then the first 10 eigenfunctions are essentially integrated exactly, the subsequent integration error is small. The same is true for optimization in the space containing the first 20 eigenfunctions. Then the behavior changes abruptly: if we optimize over the first 25 eigenfunctions, then the error on those 25 eigenfunctions is small (~ 10−4) and, as in the other examples, increases afterwards. This seems to be typical: for any given set W ⊂ V, there seems to be a range of Xλ–spaces such that optimizing parameters leads to exact integration in Xλ. Once their dimension exceeds a certain (sharp) threshold, the error is still small in Xλ but many orders of magnitude larger than before. This sharp phase transition could serve as another measure of quality of W that may be useful in judging the quality of algorithms finding W (the largest number of eigenfunctions that can be integrated exactly using W for some choice of weights, a measure already studied in [31]).
Figure 9.
Integration error of three quadrature rules with weights fine-tuned in three different Xλ–spaces on the first 50 eigenfunctions.
5. Proof of the Theorem
Proof. We write the integration error as the inner product of two vectors and decompose it as
Since f ∈ Xλ, we have that 〈f, ϕk〉 = 0 unless |λk| ≥ λ. A simple application of the Cauchy-Schwarz inequality then shows that
More precisely, L2–duality implies that this step is not lossy since
For any function , we have
We observe that this inequality is also valid if g ∉ Xλ since ever step is a valid bound from above and ‖·‖Xλ is defined on all functions as a semi-norm (we note, however, that if g ∉ Xλ, then the inequality will usually be far from sharp). We use this inequality for
to conclude that
We observe that Idn×n + L is the generator of the diffusion. It is a linear operator for which constant functions are invariants. This implies
We now show that the operator (Idn×n + L) preserves the average value of the function. It suffices to show that L maps the average value of a function to 0. We use the definition
It thus suffices to show that A – D maps every function to a function with mean value 0. This follows from changing the order of summation
This implies that if we normalize the weights so that constants are being integrated exactly, i.e.
then the mean value of
Squaring out implies
Altogether, we have shown
Figure 2.
Suppose this is a social network: people who are similar to each other, are connected by an edge. I would like to discover whether people prefer spaghetti or pizza and assume that my existing knowledge of similarity extends to tastes in food: I am only allowed to ask 3 people (because I have to pay them and have a limited budget), who do I ask?
Acknowledgement
The authors are grateful to an anonymous referee for many helpful remarks.
GCL was supported by NIH grant #1R01HG008383-01A1 (PI: Yuval Kluger) and U.S. NIH MSTP Training Grant T32GM007205. S. S. was supported by the NSF (DMS-1763179) and the Alfred P. Sloan Foundation.
Contributor Information
GEORGE C. LINDERMAN, Program in Applied Mathematics, Yale University, New Haven, CT 06511, USA
STEFAN STEINERBERGER, Department of Mathematics, Yale University, New Haven, CT 06511, USA.
References
- [1].Bilyk D, Dai F and Matzke R, Stolarsky Principle and Energy Optimization on the Sphere, Constructive Approximation, to appear
- [2].Blanc X and Lewin M, The crystallization conjecture: a review. EMS Surv. Math. Sci. 2 (2015), no. 2, 225–306. [Google Scholar]
- [3].Bermanis A, Averbuch A, and Coifman R. Multiscale data sampling and function extension. Applied and Computational Harmonic Analysis 34.1 (2013): 15–29. [Google Scholar]
- [4].Brauchart J and Grabner P, Distributing many points on spheres: minimal energy and designs. J. Complexity 31, no. 3, 293–326, (2015). [Google Scholar]
- [5].Chung F, Spectral Graph Theory, CBMS Regional Conference Series in Mathematics 92, American Mathematical Society, 1996. [Google Scholar]
- [6].Delsarte P, Goethals JM and Seidel JJ, Spherical codes and designs. Geometriae Dedicata 6, no. 3, 363–388, (1977). [Google Scholar]
- [7].Dick J and Pillichshammer F, Digital nets and sequences. Discrepancy theory and quasi-Monte Carlo integration. Cambridge University Press, Cambridge, 2010. [Google Scholar]
- [8].Drmota M, Tichy R, Sequences, discrepancies and applications. Lecture Notes in Mathematics, 1651. Springer- Verlag, Berlin, 1997. [Google Scholar]
- [9].Erber T and Hockney G, Complex systems: Equilibrium configurations of n equal charges on a sphere (2 ≤ n ≤ 112), Advances in Chemical Physics 98 (1997), 495–594. [Google Scholar]
- [10].Gjoka M, Kurant M, Butts CT, and Markopoulou A. Walking in facebook: A case study of unbiased sampling of osns. In INFOCOM, 2010 Proceedings IEEE, pages 1–9. IEEE, 2010. [Google Scholar]
- [11].Borodin Valeria, Snoussi Hichem, Hnaien Faicel, and Labadie Nacima. Signal processing on graphs: Case of sampling in paley-wiener spaces. Signal Processing, 152:130–140, 2018. [Google Scholar]
- [12].Fujiwara Koji. Eigenvalues of laplacians on a closed riemannian manifold and its nets. Proceedings of the American Mathematical Society, 123(8):2585–2594, 1995. [Google Scholar]
- [13].Hu P and Lau WC, A Survey and Taxonomy of Graph Sampling, arXiv:1308.5865
- [14].Jin L, Chen Y, Hui P, Ding C, Wang T, Vasilakos AV, Deng B, and Li X. Albatross sampling: robust and effective hybrid vertex sampling for social graphs. In Proceedings of the 3rd ACM international workshop on MobiArch, pages 11–16. ACM, 2011. [Google Scholar]
- [15].Kuipers L and Niederreiter H, Uniform distribution of sequences. Pure and Applied Mathematics. Wiley-Interscience, New York-London-Sydney, 1974. [Google Scholar]
- [16].Linderman G and Steinerberger S, Clustering with t-SNE, Provably, SIAM Journal on Mathematics of Data Science 1-2 (2019), pp. 313–332. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Lu J, Lu Y, Wang X, Li X, Linderman G, Wu C, Cheng X, Mu L, Zhang H, Liu J, Su M, Zhao H, Spatz E, Spertus J, Masoudi F, Krumholz H and Jiang L, Prevalence, awareness, treatment, and control of hypertension in China: data from 1.7 million adults in a population-based screening study (China PEACE Million Persons Project), The Lancet 390, 2549–2558, 2017. [DOI] [PubMed] [Google Scholar]
- [18].Lu J, Sachs M and Steinerberger S, Quadrature Points via Heat Kernel Repulsion, Constructive Approximation, to appear.
- [19].Pesenson Isaac. Sampling of paley-wiener functions on stratified groups. Journal of Fourier Analysis and Applications, 4(3):271–281, 1998. [Google Scholar]
- [20].Pesenson Isaac. A sampling theorem on homogeneous manifolds. Transactions of the American Mathematical Society, 352(9):4257–4269, 2000. [Google Scholar]
- [21].Pesenson Isaac. Sampling in Paley-Wiener Spaces on Combinatorial Graphs. Transactions of the American Mathematical Society, 360(10):5603–5627, 2008. [Google Scholar]
- [22].Pesenson Isaac Z. Shannon Sampling and Weak Weyl’s Law on Compact Riemannian manifolds. In Analysis and Partial Differential Equations: Perspectives from Developing Countries, pages 207–218. Springer, 2019. [Google Scholar]
- [23].Pesenson I, Average sampling and average splines on combinatorial graphs, arXiv:1901.08726
- [24].Pesenson I, Weighted sampling and weighted interpolation on combinatorial graphs, arXiv:1905.02603
- [25].Pesenson Isaac Z and Geller Daryl. Cubature formulas and discrete fourier transform on compact manifolds. In From Fourier Analysis and Number Theory to Radon Transforms and Geometry, pages 431–453. Springer, 2013. [Google Scholar]
- [26].Pesenson Isaac Z, Pesenson Meyer Z, Führ Hartmut, Cubature formulas on combinatorial graphs. arXiv preprint arXiv:110f.096S, 2011.
- [27].Rasti AH, Torkjazi M, Rejaie R, Duffield N, Willinger W, and Stutzbach D. Respondent-driven sampling for characterizing unstructured overlays. In INFOCOM 2009, IEEE, pages 2701–2705. IEEE, 2009. [Google Scholar]
- [28].Singer A, From graph to manifold Laplacian: The convergence rate, Applied. Comp. Harm. Anal 21 (2006), p. 128–134. [Google Scholar]
- [29].Sobolev S, Cubature formulas on the sphere which are invariant under transformations of finite rotation groups. Dokl. Akad. Nauk SSSR 146, 310–313, (1962). [Google Scholar]
- [30].Steinerberger S, Spectral Limitations of Quadrature Rules and Generalized Spherical Designs, IMRN, to appear
- [31].Steinerberger S, Designs on Graphs: Sampling, Spectra, Symmetries, Journal of Graph Theory, to appear
- [32].Stolarsky KB. Sums of distances between points on a sphere II. Proc. Amer. Math. Soc, 41, 575–582, 1973. [Google Scholar]
- [33].Strichartz Robert S. Half sampling on bipartite graphs. Journal of Fourier Analysis and Applications, 22(5):1157–1173, 2016. [Google Scholar]
- [34].Wang Xiaohan, Liu Pengfei, and Gu Yuantao. Local-set-based graph signal reconstruction. IEEE transactions on signal processing, 63(9):2432–2444, 2015. [Google Scholar]
- [35].Ward John Paul, Narcowich Francis J, and Ward Joseph D. Interpolating splines on graphs for data science applications. arXiv preprint arXiv:1806.10695, 2018.