Skip to main content
GigaScience logoLink to GigaScience
. 2023 Nov 15;12:giad094. doi: 10.1093/gigascience/giad094

Euler characteristic curves and profiles: a stable shape invariant for big data problems

Paweł Dłotko 1, Davide Gurnari 2,
PMCID: PMC10646871  PMID: 37966428

Abstract

Tools of topological data analysis provide stable summaries encapsulating the shape of the considered data. Persistent homology, the most standard and well-studied data summary, suffers a number of limitations; its computations are hard to distribute, and it is hard to generalize to multifiltrations and is computationally prohibitive for big datasets. In this article, we study the concept of Euler characteristics curves for 1-parameter filtrations and Euler characteristic profiles for multiparameter filtrations. While being a weaker invariant in one dimension, we show that Euler characteristic–based approaches do not possess some handicaps of persistent homology; we show efficient algorithms to compute them in a distributed way, their generalization to multifiltrations, and practical applicability for big data problems. In addition, we show that the Euler curves and profiles enjoy a certain type of stability, which makes them robust tools for data analysis. Lastly, to show their practical applicability, multiple use cases are considered.

Keywords: topological data analysis, persistent homology, Euler characteristic, distributed computations

Introduction

Topological data analysis since its beginning [1, 2] has brought attention in the data science community. Topological tools, like persistent homology [3] and mapper [2], were used in multiple tasks in material science [4–6], medicine [7], and many more. In time, persistent homology has been successfully integrated with machine learning pipelines, and mapper became an exploratory data analysis tool. In this work, we will extend on the path of persistent homology. With its successes, attempts were made to apply it in task of big data analysis. However, the progress is minimal. While there exists a single distributed implementation [8], it does not scale up and was not extensively used in big data analysis. In practice, mostly various sequential implementations are used [9]. To bypass the problem of too large input, a number of sparsification techniques [10, 11] as well as bootstrap [12] and zigzag [13] approaches were proposed. While they scale up to problems of a certain size, they tend to bypass the big data challenge rather than proposing a solution for it.

In this article, we extend the tool of classical Euler characteristic and Euler characteristic curves. The new contributions include the following:

  • A proof of stability of the Euler characteristic curve (ECC) with respect to the 1-Wasserstein distance between persistence diagrams

  • A generalization of the ECC to the multiparamenter filtration case, with an arbitrary number of parameters, that we denote as the Euler characteristic profile (ECP)

  • An analysis of the stability of such ECPs

  • Distributed algorithms to compute the exact ECC for Vietoris–Rips and cubical complexes that can be naturally extended to the multiparameter case. An Python implementation of such algorithms is provided as the scikit-learn [14] compatible package.

  • Discussion of methods to compare and vectorize ECCs and ECPs

  • Examples of applications of the ECC/ECP to real-world data

Our proposed algorithms are perfectly parallelizable and our software implementation can take advantage of multicore CPUs that are nowadays the standard even in consumer-grade hardware. Moreover, they can also be easily executed on a computer cluster to tackle scenarios when the size of the input data is too large to fit into the system’s memory. While we are not aware of any distributed algorithm to compute ECCs of a Vietoris–Rips complex, Heiss and Wagner [15] describe a streaming algorithm to compute the ECC from cubical complexes, which has also been adapted for GPU computations [16]. While their implementation is very fast, we see no straightforward way to generalize it to the multiparameter filtration case. To the best of our knowledge, the concept of ECPs of arbitrary dimension is novel in the literature. There are, however, some works that focus on the bifiltration case, known as Euler characteristic surfaces. It was used in an applied setting by Roy et al. [17] to analyze drying droplets, but no topological background is provided. Beltramo et al. [18] gave a description of Euler characteristic surfaces in the persistence homology framework and applied it to obtain a descriptor of both pointcloud and image-based data. Moreover, they provided a Python implementation of their algorithms, which, however, requires the input bifiltration to be binned. Chen et al. [19] introduced a time-aware multipersistence Euler–Poincaré surface to describe dynamical networks and proved its weak L1 stability. A recent preprint by Perez [20] analyzes the stability of Euler and Betti curves of stochastic processes on compact Riemannian manifolds.

Euler Characteristic Curves (and Profiles)

In this section, we introduce the essential mathematical concepts needed to define Euler characteristic curves and profiles. For an exhaustive presentation, we refer to classic textbooks like [21] and [3].

Definition 1.

A CW or cell complex X is a topological space that can be built up starting from a discrete set X0 of 0-dimensional cells and then inductively creating the n-skeleton Xn by attaching n-cells to Xn − 1 along their boundary. The process can be stopped at some finite dimension or can continue indefinitely. A subset AX is a subcomplex of X if, with each cell of A, all its lower-dimensional cells enter A.

Remark 1.

Since we are interested in applying this machinery to analyze real-world data, we will always assume that our complexes are finite.

While the theory can be built in the general CW complex setting, the algorithms we present in the Algorithms section are specific to 2 different specializations that are used to represent different types of data: simplicial and cubical complexes.

Definition 2.

An abstract simplicial complex is a finite collection of sets K such that σ ∈ K and τ⊆σ implies τ ∈ K. The sets in K are called simplicies, and the dimension of a simplex is dim(σ) = card(σ) − 1. We will often refer to 0-simplices as vertices and to 1-simplices as edges. Given a simplex s = {v0, …, vk}, its boundary is Inline graphic, where Inline graphic denotes that the vertex vi is removed from the simplex. Simplices Inline graphic are in the boundary of s.

There are different ways of obtaining an abstract simplicial complex from pointcloud data such as the Čech, the Vietoris–Rips, and the Alpha constructions [3], and in the Vietoris–Rips complexes section, we describe the Vietoris–Rips construction.

Definition 3.

An elementary interval is a subset of Inline graphic of the type I = [l, l + 1] or I = [l, l], for some integer l. The first type is called a nondegenerate interval while the second is a degenerate interval. An elementary cube C is a product of elementary intervals C = I1 × ⋅⋅⋅ × In, and its dimension is the number of nondegenerate intervals in the product. The boundary of an elementary interval is ∂[l, l + 1] = [l + 1, l + 1] + [l, l] and ∂[l, l] = 0. The boundary of an elementary cube is then defined as Inline graphic. Similarly to the simplicial complex case, a cubical complex K is a collection of elementary cubes closed under operation of taking boundary.

One of most common use cases of cubical complexes involves image data. In section 16 "Cubical complexes"  we describe how to build a filtered cubical complex from an n-dimensional image by identifying the image’s pixels with top dimensional cells.

In what follows, we will refer to simplices and cubes as elements of a simplicial or a cubical complex jointly as cells in a cell complex. A cell τ is said to be a face of σ if τ is in the boundary of σ.

Definition 4.

Let K be a cell complex and d a dimension. A d-chain is a formal sum of d-cells in K, namely, c = ∑aiσi, where the σi are the d-cells and the ai are the coefficients.

There are many possible choices for the group of coefficients. A standard approach in computational topology is to use modulo 2 coefficients, that is, the ai can be either 0 or 1 and satisfy 1 + 1 = 0. (Using modulo 2 coefficients allows us to get rid of the ( − 1)i in the definition of the boundary in 2.) Other options include integer, rational, or real coefficients.

Two d-chains can be added component-wise. Namely, given c = ∑aiσi and c′ = ∑biσi, c + c′ = ∑(ai + bii. Therefore, we can define the group of d-chains  Inline graphic. The boundary of a d-chain is the sum of the boundaries of its cells ∂c = ∑ai∂σi, which is a (d − 1)–chain. Since the boundary commutes with the addition operation, we can define, for each dimension d, the boundary homomorphism  Inline graphic.

A d-cycle is a d-chain with empty boundary ∂c = 0. A d-boundary is a d-chain that is the boundary of a (d + 1)–chain. Since ∂ commutes with addition, we have the group of d-cycles  Inline graphic and the group of d-boundaries  Inline graphic. It is a fundamental result that ∂dd + 1c = 0 for every dimension d and every (d + 1)–chain c. This means that the boundary of a boundary is always zero; in other words, Inline graphic is a subgroup of Inline graphic. This leads to the following definition.

Definition 5.

The dth homology group is the dth cycle group modulo the dth boundary group, Inline graphic. The dth Betti number is the rank of this group, Inline graphic.

Definition 6.

Let K be a cell complex. A filtration of K is a sequence of nested subcomplexes ∅ = K0K1⊆⋅⋅⋅⊆Kn = K. Such a sequence is finite for finite complexes. It can be obtained by means of a filtration function over K, a monotonic nondecreasing function Inline graphic such that f(τ) ≤ f(σ) if τ is a face of σ. Note that every sublevel set Kt = f−1( − ∞, t] is a subcomplex of K for every Inline graphic.

For each dimension d, such a filtration corresponds to a sequence of homology groups Inline graphic. For every i < j, the homomorphism Inline graphic is induced from the inclusion map of Ki into Kj.

Definition 7.

The dth persistent homology groups are the images of the homomorphisms Inline graphic. The ranks of these groups are the dth persistent Betti numbers  Inline graphic.

Intuitively, the dth persistent Betti number Inline graphic counts how may homology classes of Ki are still present in Kj. There are 2 scenarios in which a homology class from Ki may not be present in Kj—it may become trivial or identical (homologous) to a class that was created earlier.

Definition 8.

The kth-dimensional persistence diagram of a filtered complex K, Dgmk(K) is a multiset of points in the extended real plane Inline graphic. The multiplicity of each point (b, d) indicates the number of independent k-dimensional classes that are born at filtration value b and die at filtration value d.

All the points on the diagonal are always included, with countable multiplicity, in a persistence diagram, in order to make sense of the following.

Definition 9.

A matching of 2 persistence diagrams C and D is a bijection η: CD possibly to or from points on the diagonal.

Definition 10.

The 1-Wasserstein distance between two k-dimensional persistence diagrams C, D is

Definition 10.

where η is a matching of C and D.

Definition 11.

The Euler characteristic of a cell complex K is the alternating sum of the number of its cells in each dimension

Definition 11.

where Kd denotes the d-dimensional cells in K. Thanks to the Euler–Poincaré formula, the Euler characteristic can also be expressed as the alternating sum of the Betti numbers, the ranks of the cell complex’s homology groups: χ(K) = ∑d( − 1)dβd(K) [3].

Definition 12.

Let us consider a filtered complex K with filtration function Inline graphic. We can define its Euler characteristic curve as a function that assign an Euler number χ for each filtration level Inline graphic  

Definition 12.

Recall that Kt = f−1( − ∞, t] is a subcomplex of K for every Inline graphic.

We are now interested in extending the concept of the Euler characteristic curve to the more general multidimensional persistence setting [22]. In order to do so, we need to generalize Definition 6 to families of nested complexes indexed by posets. While multidimensional persistence is a vibrant and active research topic, in this article, we will only make use of the basic concepts. We refer the interested reader to [23] for a modern introduction to the topic.

Definition 13.

Let K be a cell complex and P a poset. A P-indexed filtration on K is a family of nested complexes such that Kx is a subcomplex of K for each xP, and KxKy whenever xy. If P = T1 × ⋅⋅⋅ × Tn where each Ti is a totally ordered set, we call a multiparameter or n-parameter filtration.

It is a natural question to ask whether the idea of sublevel sets of a filtration function could be extended too. In general, this is not the case. It can be achieved only when each cell of K first appears in the filtration at some unique minimal index in P.

Definition 14.

Let K be a cell complex, P a poset, and f a function f: KP. The sublevel filtration of f is a family of complexes of the type

Definition 14.

A filtration isomorphic to a sublevel filtration is said to be 1-critical. A filtration that is not 1-critical is said to be multicritical.

Definition 15.

The ECP of a P-filtered complex K is a function that is assigned to any value pP the Euler characteristic of the corresponding subcomplex Kp.

Definition 15.

For the rest of the article, we will focus on the case Inline graphic.

Remark 2.

The 2-dimensional ECP already appeared in the literature, and it is known as Euler characteristic surface [17–19]. It was, however, defined only for the Cartesian product of two 1-parameter filtrations, and it is treated as a matrix in the following way. Given a bifiltering function Inline graphic over K and a set of threshold values Inline graphic, the Euler characteristic surface is the m × n integer valued matrix S whose entries are Sij = χ(Kij) = χ(F−1(( − ∞, ai] × ( − ∞, bj]). This matrix representation corresponds to sampling the 2-dimensional profile on the grid given by I. In general, the choice of such grid is not unique, and the spacing of such grid may not be constant. This makes it difficult to define a general notion of distance between Euler characteristic surface matrices. For this reason ,we think it is more natural to define the Euler characteristic profile as a function like in 15 and look for stability results in this setting.

Stability of Euler Characteristic Curves and Profiles

The goal of this section is to find a bound for the distance between Euler characteristic curves by some known topological quantity of the pointcloud that is robust with respect to small perturbations of the pointcloud. This way, the stability of Euler characteristic curves is obtained.

Euler characteristic curves

Since ECCs are are piecewise constant functions, we consider the L1 distances between them (Fig. 1).

Figure 1:

Figure 1:

Two Euler characteristic curves in red and green. The absolute value of their difference is highlighted in shaded gray.

Definition 16.

Let K1 and K2 be 2 filtered cell complexes. The L1 distance between their Euler characteristic curves is

Definition 16.

The proof presented in this section is inspired by the stability result for persistence functions by Chung and Lawson [24], who analyzed the stability of a wide class of persistence curves and obtained a general bound (see Theorem 1 in [24]). However, trying to specialize this result to the simple Betti curve case leads to a term that depends on the number of points in the persistence diagram. Hence, the authors claim that Betti curves are unstable.

We will instead carry out the proof focusing exclusively on Betti curves, and by doing so, a stability result can be obtained.

Definition 17.

Let K be a cell complex with filtration function f. Its kth Betti curve is a function that assigns to each filtration level the kth Betti number of the corresponding subcomplex.

Definition 17.

Let now D be the k-dimensional persistence diagram obtained from a filtered complex K. The fundamental lemma of persistent homology [3] states that the kth Betti number of the subcomplex Kt can be obtained by counting the points in the diagram that lie in the box Inline graphic,

graphic file with name TM0036.gif

We can reformulate this statement by assigning to each point (b, d) in the diagram its indicator function in the interval [b, d), I[b, d)(t) = 1 if t ∈ [b, d) and 0 otherwise. These indicator functions are exactly the bars in the barcode representation. By doing so, we can define the k-dimensional Betti curve as the step function obtained by summing up all these indicator functions.

Definition 18.

The kth Betti curve for a persistence diagram D with finitely many off-diagonal points is

Definition 18.

Proposition 1.

Let C and D be 2 k-dimensional persistence diagrams. Their Betti curves are stable with respect to the 1-Wasserstein distance,

Proposition 1. (1)

Proof.

Let us consider 2 k-dimensional persistence diagrams C, D and assume the optimal matching under the 1-Wasserstein distance is known. Moreover, let us index the points in each diagram as Inline graphic and Inline graphic so that points with matching indices are paired under the optimal matching. The case when points from 1 diagram are matched to the diagonal is described in case 2. We can then write the difference between the 2 Betti curves as the following:

Proof.

Let us focus on a single term of the sum, Inline graphic. Then, one of the following cases has to hold:

Case 1:  Inline graphic (Fig. 2).

Proof.

Case 2:  Inline graphic (Fig. 3).

Proof.

The matching of one point Inline graphic with a point in the diagonal of D is a degenerate case 2 with Inline graphic. Note that, because of this, C and D are not required to have the same number of off-diagonal points.

Case 3:  Inline graphic (Fig. 4)

This case will never happen as a better matching can always be obtained by matching both points to the diagonal, which is a degenerate case 2.

We have that Inline graphic holds for every i. We can then write the difference between 2 Betti curves as

Proof.

Figure 2:

Figure 2:

Case 1.

Figure 3:

Figure 3:

Case 2.

Figure 4:

Figure 4:

Case 3.

Thanks to the Euler–Poincaré formula, the Euler characteristic curve of a filtered complex K can be obtained as the alternating sum of its Betti curves.

graphic file with name TM0052.gif

A stability result for the ECCs can be immediately derived from 1 assuming that the complex K has nonzero persistence diagrams in a finite number of dimensions, each of them containing a finite amount of off-diagonal points.

Proposition 2.

Let X and Y be 2 filtered cell complexes. The L1 difference between the Euler characteristic curves of X and Y is bounded by the sum of the 1-Wasserstein distances between the corresponding k-dimensional persistence diagrams Dgmk(X), Dgmk(Y).

Proposition 2. (2)

where the sum is over all dimensions in which the persistence diagrams are nonempty.

Proof.

It is an immediate consequence of 1 and the triangular inequality.

Proof.

The above Proposition 2 is in explicit contrast with the claim that the Euler characteristic curve is unstable. In addition to the already mentioned work by Chung and Lawson [24], a similar statement can be found in [18] and [25].

Remark 3.

With reference to Fig. 1, the left-hand side in 2 is finite when the 2 ECCs agree from some filtration value onward. This is exactly what happens, for example, when considering curves obtained from full complexes (i.e., filtered complexes having a single simplex as a last element of a filtration): at some value, all possible faces will have entered the filtration, and so the Euler characteristic will stabilize at 1. If this does not happen, the difference between the 2 ECCs will be unbounded. At the same time, it is straightforward to show that if 2 filtered complexes have different Euler characteristics at +∞, their homologies will have a different number of essential classes. This translates to a different number of points at infinity in the persistence diagrams, whose Wasserstein distance would then be unbounded. In this case, the above result will trivially be +∞ ≤ +∞.

Euler characteristic profiles

We can immediately extend the notion of L1 distances between ECCs to work in the general case of n-dimensional ECPs.

Definition 19.

Let K1, K2 be 2 multifiltered cell complexes. The L1 distance between the corresponding n-dimensional Euler characteristic profiles is

Definition 19.

It is natural to ask whether the stability result in 2 can be naturally extended to the multiparameter case. In the existing literature, Chen et al. [19] proposed the following weak L1-metric in the case of bifiltered complexes (see Definition 3.2 in [19]). Let us remind the proposed construction; consider 2 cell complexes K1 and K2 with a bifiltration function Inline graphic. Let us denote with f and g the 2 real valued functions in the bifiltrations such that F(σ) = ((f(σ), g(σ))) for every cell σ. Moreover, let us index the threshold values of F as Inline graphic. The idea behind the Chen et al. [19] construction is to fix 1 of the 2 filtrations at a specific value and consider the distances between the single-parameter persistence diagrams induced by the other filtration function. By considering the set of threshold values I as a matrix with i rows and j columns, they define the ith column distance for the k-dimensional PDs [Persistence Diagram (s)] as Inline graphic. Similarly, the jth  row distance is Inline graphic.

Definition 20 (Definition 3.2 in [19]).

The weak L1 metric between K1 and K2 is

Definition 20 (Definition 3.2 in [19]).

Being able to recover the single-parameter case, they prove the following stability result.

Proposition 3 (Theorem 3.1 in [19]).

Let K1, K2 be 2 bifiltered cell complexes. The distance between the corresponding Euler characteristic surfaces is bounded by the weak L1 metric metric between K1 and K2,

Proposition 3 (Theorem 3.1 in [19]).

for some c > 0.

This construction appears to be the natural generalization to multifiltration case of the stability result in 2. However, there are some fundamental problems that undermine the usefulness of such a weak L1 metric.

Remark 4.

In our opinion, the sums over rows or columns in 20 should be replaced with integrals over the filtration ranges. As already discussed in Remark 2, this would allow for more flexibility when dealing with filtration thresholds whose spacing is not constant.

Remark 5.

Proposition 3 will evaluate to a trivial ∞ ≤ ∞ in most cases, even the simplest one. Consider, for example, the situation depicted in Fig. 5 of the ECP of a bifiltered complex K1 made by just one 0-dimensional cell that appears at filtration value (g1, g2). The ECP will then be 1 in the cone Inline graphic and 0 otherwise. We can obtain a different complex K2 by perturbing the first filtration value by an ϵ amount (g1 + ϵ, g2). The difference between the 2 ECPs will then be unbounded. At the same time, also the weak L1 distance between K1 and K2 will be unbounded because in the interval [g1, g1 + ϵ) × [g2, +∞], the 2 complexes have a different number of essential classes and so the W1 distance between the corresponding PDs will be infinite.

Figure 5:

Figure 5:

Minimal counterexample for the instability of ECP. Consider a cell complex made by only 1 vertex whose +1 contribution appears at some point Inline graphic and move it to g′ = (g1 + ϵ, g2). Their difference, the region shaded in red, is unbounded.

Because of the discussed issues, the stability result in [19], while being formally correct, does not cover a lot of practically relevant cases.

However, in most applications, we can truncate the ECP by limiting its filtration domain to the interval [0, f] in every filtration dimension, where f is a finite value. Note that this value at infinity should not be the same as the maximum filtration value of the complex’s cells, but it should be strictly larger than the maximum filtration value. For example, in the case of images whose pixels have integer filtration values in the [0, 255] range (see section 14 "RGB images" ), we could choose f = 256 as a truncation value. By doing so, the distance between every pair of ECP will be finite, but it will of course depend on the truncation value. Using truncation, we can state the following result.

Proposition 4.

Let K be a finite cell complex with an n-dimensional multifiltration Inline graphic. We define Kϵ as the complex obtained by perturbing the filtration values of each cell in K by at most ϵ in l norm. Let us assume, for simplicity, that we truncate the domain of every filtration function to the same interval [0, f]. We then have the following bound :

Proposition 4.

where |K| is the number of cells in the complex and n is the number of filtration parameters.

Proof.

Let us consider a single-cell σ ∈ K with filtration value g = (g1, ⋅⋅⋅, gd). Its contribution to the ECP will be ( − 1)dim(σ) in the cone above g (i.e., for all points Inline graphic such that gx coordinate-wise). Let σ′ be the corresponding cell in Kϵ whose filtration values have been maximally perturbed to g′ = (g1 + ϵ, ⋅⋅⋅, gd + ϵ). The volume of the region, which is in the cone of g but not on the cone of g′, can be bounded by a sum of n n-dimensional cuboids of base ϵn − 1 and height f, each of them corresponding to a shift of ϵ in the direction of one axis, Vσd · ϵn − 1 · f, where the inequality is due to the fact that cuboids can have a nonempty intersection. One of such cuboids is shaded in red in Fig. 5. Multiplying by the total number of cells gives us the bound.

Algorithms

Recall that the Euler characteristic of a cell complex is the alternating sum of the number of its cells in each dimension. The contribution of each cell will thus be plus or minus 1 depending on the dimension of the cell. Moreover, this contribution will appear at the cell’s filtration level. Therefore, if we are able to obtain a list of all cells with their filtration values, we can compute the Euler characteristic at each filtration level. This is the main idea behind the following algorithms, which will always return what we will denote as list_of_contributions, a list of pairs (f(σ), ( − 1)dim(σ)) that stores each cell’s contribution to the EC (Euler Characteristic) at the cell filtration level. Once these pairs have been sorted in ascending order with respect to the filtration, the Euler characteristic curve can be reconstructed by progressively summing up the contributions of following elements in the list.

Remark 6.

Roune and de Cabezón [26] proved that computing the Euler characteristic of a simplicial complex given by its vertices and facets is #-P-complete. Even if their result does not mention filtered complexes, it follows from it that the problem of computing the ECC is at least P-complete. Otherwise, by contradiction, we could construct an arbitrary filtration of the considered complex and look at the end value of the curve to obtain the Euler characteristic of the complex in polynomial time.

Vietoris–Rips complexes

In this section, we will present a distributed algorithm to compute the Euler characteristic curve of a Vietoris–Rips simplicial complex obtained from a collection of points in Inline graphic.

Definition 21.

Let X be a finite collection of points in Inline graphic, also denoted as a pointcloud. Given a parameter ϵ ≤ 0, the Vietoris–Rips complex constructed from X is the collection of all subsets of the diameter at most 2ϵ, where the diameter is the greatest distance between any pair of vertices

Definition 21.

The filtration of each simplex is given by its diameter.

The Vietoris–Rips complex is a flag complex; this means that a subset S of vertices is in the complex if every pair of vertices in S is in the complex. This is analogous to saying that the Vietoris–Rips complex is completely determined by its 1-skeleton graph as there is a 1-to-1 correspondence between simplices in the complex and cliques in its 1-skeleton graph.

Therefore, it is straightforward to see that listing all the simplices in a Vietoris–Rips complex is equivalent to performing a cliques count of its 1-skeleton graph [27]. In order to compute the contributions to the ECC, we need to find an efficient and distributed way to list all cells in the simplex (i.e., all cliques in the 1-skeleton graph) and their filtration values (i.e., the length of the longest edge in each clique); this can be achieved in the following way. Given an ordered list of points X = {xi: i ∈ [1, n]} (the points can be ordered in an arbitrary way) and a maximum distance ϵ, for each point xi, we build its local graph Gi of subsequent neighbors—namely, all points xjB(xi, ϵ)∩X with j > i. For each Gi, we list all of its cliques that contain xi. They will correspond to simplices with xi being the smallest vertex in the chosen ordering of points. This way, each simplex σ in the V-R complex will be generated exactly once, when considering the local graph of its lowest vertex in the considered ordering.

Algorithm 1, which uses this idea, describes a way to list all the simplices of increasing dimension. At each iteration, we obtain a list of d-dimensional simplices (given as collections of vertices) and, for each of them, a list of common subsequent neighbors of its vertices. We can then extend each simplex to a (d + 1)–dimensional one by adding 1 common neighbor to the collections of vertices. When doing so, we need to update the simplex’s filtration value if one of the newly added edges is longer than the current filtration. Moreover, we need to update the list of common subsequent neighbors by intersecting it with the subsequent neighbors of the newly added vertex. Once we have obtained all possible (d + 1)–simplices, we carry out this extension procedure 1 dimension higher. All of these operations are performed at the local graph of each vertex. The procedure ends when no simplex can be extended (i.e., when all maximal simplices have been listed). This construction might be understood as a breadth-first traversal of the simplex tree [27].

Algorithm 1:

COMPUTE LOCAL CONTRIBUTIONS V-R

graphic file with name giad094alg1.jpg

The main advantages of the proposed algorithm are 2: it does not require to construct the whole complex, leading to a significant decrease in memory utilization, and it considers each point separately, allowing the computations to be carried out independently.

The inputs of our algorithm are X, a ordered list of points in Inline graphic and a maximum filtration value ϵ. The output is list_of_contributions, an ordered list of pairs. For each simplex σ, we store its contribution as a tuple (f(σ), ( − 1)dim(σ)). The output list will sorted according to the filtration values.

Note that the Algorithm 1 is correct. First, every simplex in the Vietoris–Rips complex will be generated. It will happen when its smallest vertex in the considered order will be considered in the for loop. Second, each simplex will be generated only once in the INCREASE_DIMENSION procedure (Algorithm 2). A simplex σ = [v0, …, nn − 1, vn], where v0 < … < nn − 1 < vn, will be generated from a simplex [v0, …, nn − 1] by adding vn as a common neighbor of its vertices.

Algorithm 2:

INCREASE_DIMENSION

graphic file with name giad094alg2.jpg

Time performance

The worst-case scenario occurs when the the 1-skeleton graph is fully connected. Assuming the pointcloud consists of n points, the resulting V-R complex will contain 2n − 1 simplices. In this case, the time complexity of Algorithm 1 is Inline graphic. More details are provided in Appendix A.

Memory performance

Assuming the worst-case scenario, the size of the output list of contributions is O(n2) while the maximal memory required at 1 intermediate step is Inline graphic. More details are provided in Appendix A.

Choice of the vertex ordering

Note that the total running time of the fully parallelized Algorithm 1 can be dominated by few vertices whose simplex tree is considerably larger than the others. This explains the plateau in Fig. 6. This effect can be mitigated by choosing a different ordering of the vertices. One efficient choice is to order the vertices by increasing number of ϵ-neighbors. Since the local graph for each vertex is constructed by considering only its subsequent neighbors, this ordering will produce more evenly sized simplex trees. A simple example is shown in Fig. 7, while the effect of this reshuffling on a larger dataset is shown in Fig. 8.

Figure 6:

Figure 6:

Average runtime over 10 runs of Algorithm 1 as a function of the number of cores used. Contributions computed for the V-R complex obtained from 10,000 points sampled from the unit 4-sphere up to a maximum radius of 0.4. Experiment run on a AMD Ryzen Threadripper PRO 5955WX CPUu. Error bars are scaled up by a factor of 20 for visibility.

Figure 7:

Figure 7:

Different ordering of the vertices can produce different simplex trees. In the first row, vertices are ordered by decreasing number of neighbors and in the second row by increasing number. The second choice produces more evenly sized trees.

Figure 8:

Figure 8:

Effect of different orderings of vertices for the example in Fig. 6. Selecting the ascending order allows achieving a more even distribution in the number of simplices in each tree (A), thus reducing the number of very large trees that dominate the running time (B).

Cubical complexes

Cubical complexes are the most used combinatorial structure to represent digital grayscale images and extract topological information from them. There are 2 ways to construct a cubical complex from an image, the V-construction and the T-construction. The former identifies pixels—also know as voxels, in case of images of arbitrary dimension—with the vertices (the 0-dimensional cells) of the cubical complex. Voxels’ values are used to define the filtration on the vertices and the filtration of each other elementary cube is the maximal value of its vertices. The T-construction can be seen as the dual procedure, voxels’ values are assigned to the top-dimensional cubes, and the filtration values are propagated to lower-dimensional cells by taking the minimum over the cofaces. The relation between these 2 constructions is explored in a recent work by Bleile et al. [28]. In this article, we chose the T-construction, although the presented techniques translate easily to the V-construction.

Similar to the V-R case, we are interested, given a grayscale n-dimensional image, in obtaining a list of contributions to the Euler characteristic of its corresponding cubical complex. As before, we then need to iterate over all cells σ in the complex and store each contribution as a tuple (f(σ), ( − 1)dim(σ)). This can be achieved in a streaming fashion by loading into memory a 2-voxel-high slice of the image, iterating through the cells in the bottom row computing their contributions, and then moving the sliding window up by 1 voxel. To make sure we consider each cell’s contribution exactly once, at each iteration, we consider 1 voxel and compute the contributions to the Euler characteristic of the cells in its upper closure. Assuming that we can identify each top-dimensional cell ci with the indices (x1, ⋅⋅⋅, xn) of the corresponding voxel in the input n-dimensional image, we define the upper closure of ci as the set containing ci and all its faces that are shared with other top-dimensional cells cj whose indices are yi = xi or yi = xi + 1 for all i. An example of this procedure can be found in Fig. 9.

Figure 9:

Figure 9:

A slice of a cubical complex obtained from a 2-dimensional image. The image’s pixels are associated to the top-dimensional cells, depicted in yellow. Algorithm 3 takes as input a 2-voxel tick slice of the image and iterates through the voxels in the bottom row. At each iteration, a voxel is selected and the contributions of the cells in its upper closure are computed. In this example, the voxel at coordinates (1, 1) is selected, and the considered contributions are depicted in red: the one coming from the corresponding 2-cell, the two from the 1-cell shared with (2,1) and (1,2), and the contribution from the 0-cell shared with (2,1), (1,2), and (2,2).

As already mentioned, a similar streaming algorithm to compute the ECC of grayscale images has been presented by Heiss and Wagner [15]. They also provide a fast open-source C++ implementation. Recently, Wang et al. [16] provided a GPU implementation of the same algorithm. However, there is a significant difference between their approach and the one we describe in Algorithm 3: they keep track of the faces introduced by each voxel by looking at the gray values of the voxel’s 3d − 1 neighbor and store the cumulative change in the EC at the voxel’s filtration value. This requires sorting the voxel’s value from the lowest to highest. While such a sorting can be done for 1-value filtrations, it cannot be performed for multifiltrations, as no good ordering exists in the general case. There are some small differences in the implementation too: CHUNKYEuler only works with integer filtration values and only accepts “raw” binary files as input. Our implementation, while being not as fast as CHUNKYEuler, offers the user more flexibility in the input and choice of filtration (or multifiltration) values.

Algorithm 3:

COMPUTE LOCAL CONTRIBUTIONS CUBICAL

graphic file with name giad094alg3.jpg

Time and memory complexity

Considering a d-dimensional image with n voxels as input, the resulting cubical complex will have 3dn cells. The running time of Algorithm 3 is then linear in the number of cells in the complex with a multiplicative constant, which is exponential in the dimension. This is not a problem in practice as images with dimension larger than 3 are not common in applications. The memory requirement is just the space needed to store a 2-row slice of the input image, and the memory overhead for computing the local contributions for each voxel is negligible.

From Euler characteristic curves to profiles

Both Algorithm 1 and Algorithm 3 can be immediately extended to compute the Euler characteristic profile of multifiltered Vietoris–Rips or cubical complexes. In the Vietoris–Rips case, we require that all filtration functions should be defined on the vertices or the edges and then be extended to higher-dimensional simplices by some user-defined rule. This is to enssure that the resulting multifiltered V-R complex is still a flag complex. In the case of cubical complexes, we assume that the input images contain a n-tuple of numbers in each voxel—RGB images are a typical n = 3 example—and values are propagated to lower-dimensional cells by some user-defined rules. In both cases, the output of both algorithms will be a list of (n + 1)−tuples (f1(σ), ⋅⋅⋅, fn(σ), ( − 1)dim(σ)) that stores the list of contributions to the ECP at different points Inline graphic.

Remark 7.

In above, the simplest case of so-called 1-critical multifiltration is discussed. In this case, each cell σ appears in a unique value of the multifiltration. In a general case, a cell σ may appear in multiple noncomparable values p1, …, pk of multifiltration. A simple generalization described below allows to adopt this presented algorithm to the general case; let us assume that each pi is n-dimensional tuple, Inline graphic. We assume that pi and pj are not comparable provided ij. It means that there exist a pair of coordinates lm so that Inline graphic and Inline graphic. Then, the cell σ contributes the value Inline graphic for all the points Inline graphic for which there exist i such that x > pi. Note that the regions consisting of points greater that pi overlap for different i ∈ {1, …, k}; hence, we need to avoid double and multiple counting of the contributions. Below we describe a procedure to achieve it and enforce the contribution of exactly Inline graphic for all x > pi for arbitrary i ∈ {1, …, k}. For that purpose, given ij, we define Inline graphic. Algorithm 4 defines a set of points with appropriate contributions to enforce the required condition for all xpi for all i ∈ {1, …, k}.

It is straightforward to see that for any given cell σ, its contributions to the ECP will change at most in pipj for i, j ∈ {1, …, k}, where {p1, …, pk} are incompatible points in which σ appears in the multifiltration. Algorithm 4 scans all those points and assigns the appropriate value (see lines 1 and 9) to contributions to the ECP. Note that all points p1, …, pk have their contributions initially set in line 1. Consequently, the presented algorithm will terminate, as in each iteration, at least 1 p will be added to the Contribution list. In addition, it explicitly enforces the correct contribution of the cell σ to all points xpi for any i ∈ {1, …, k}.

Algorithm 4:

CONTIBUTION OF σ TO ECP

graphic file with name giad094alg4.jpg

Data Structures for ECPs

All the algorithms we described in the previous section output a list of contributions to the Euler characteristic profile. For an n-dimensional profile, each contribution in the list is a pair where the first entry is an n-tuple storing the coordinates in Inline graphic at which the Euler characteristic varies by the integer values stored in the second item. When dealing with 1-dimensional ECCs, it makes sense to sort the contributions according to their filtration value, in order to perform faster operations on them.

Retrieving the EC at some filtration values

Given an ECP as a list of contributions, the first basic operation is to retrieve the value of the Euler characteristic at an arbitrary filtration value f*. It can be obtained by summing up all the contributions in the ECP that appear at filtration values less than or equal to f*. For a d-dimensional ECP, this can be achieved in linear time with respect to the size of the contribution list. In the 1-dimensional case, we can take advantage of the total ordering on the list of contributions, since the filtration values Inline graphic. By doing so we can build an auxiliary data structure storing the value of the Euler characteristic at each fi, the points in which the ECC is changing value. This can be done in O(n) time and space, where n is the length of the list of contributions. Given such a structure, computing the value of the ECC at a given filtration f* boils down to the the search for the largest jump point fi < f* and retrieving the value of the ECC therein. This can be achieved by interpolation search in O(log(log(n))) time.

Computing distances

Distances between Euler characteristic curves

In the Euler characteristic curves section, we introduced the notion of difference between 2 ECCs, expressed in terms of the L1 norm of the difference between the 2 curves. One should note that, in the case of finite Vietoris–Rips or cubical complexes, such a difference is always finite (but not bounded) as all ECCs will eventually stabilize to 1 for a sufficiently large filtration value. When the construction of a Vietoris–Rips complex is stopped at a certain diameter 2ϵ, and the final complexes have more than 1 infinite homology, it make sense to restrict the integral used in distance computations to an interval [0, 2ϵ] in order to make the distances between the ECCs finite.

Both Algorithms 1 and 3 return the computed ECC as a list of pairs (fi, ci) where ci is an integer representing the change in the Euler characteristic at filtration fi. Such a list is sorted in increasing order with respect to the filtration values. Using such a data structure, the difference between 2 ECCs can be computed in linear time with the size of the lists. Given 2 list of contributions ECC1 and ECC2, we can merge them in linear time, preserving the order. While merging, we flip the sign of all the contributions coming from ECC2. Let us denote the obtained list with ECC1−2. Now the difference can be computed by iterating over the full list

graphic file with name TM0087.gif

where Inline graphic with respect to the ordering of ECC1−2.

Distances between Euler characteristic profiles

Unfortunately, the strategy proposed in the previous section is difficult to generalize in the multifiltration setting as there is no natural way to sort the list of contributions. We present here a basic algorithm to compute the distances between 2 ECPs and leave the search for a potentially faster algorithm to future work.

Let ECP1 and ECP2 be 2 lists of contributions representing 2 n-dimensional profiles. We can merge them in linear time, as in the 1-dimensional case, flipping the sign of the contributions in the second list. Let N be the total number of contributions. With reference to Fig. 10, the coordinates of such contributions will create an n-dimensional irregular grid of size (N + 1). The value of the EC inside each cuboid will be equal to the EC at the cuboid’s bottom left corner and can be computed in O(N). The L1 distance between the 2 ECPs can then be obtained by summing up the values of the EC in each cuboid weighted by the cuboid’s volume. Given that the number of cuboids is (N + 1)d, this operation can be computed in O(Nd + 1). Note that the ECPs need to be truncated in order to avoid cuboids with infinite volume.

Figure 10:

Figure 10:

Example of a 2-dimensional ECP with 3 contributions. The green points indicate a +1 while the red point is a −1. The plane can then be subdivided in a 4 × 4 irregular grid. The coloring of each block indicates the value of the EC in that region; white is 0, light gray is 1, and dark gray is 2.

Vectorization

Vectorizing the ECC/ECP is a critical step if we are interested in using these invariants in a machine learning framework.

Curves

Assume we are given an ECC whose filtration values ranges from 0 to fmax. We can convert it to a vector by evenly sampling it N times between 0 and fmax. If we choose to include the endpoints, the resulting vector will be vec(ECC, N) = [EC(0), EC(Δ), EC(2Δ), ⋅⋅⋅, EC((N − 2)Δ), EC(fmax)], where Δ is the vectorization’s resolution, which is defined as Δ = fmax/(N − 1).

The vectorized ECC can be obtained by such vector as the union of N − 1 left-closed, right-open intervals of length Δ that correspond to sampling the value of the EC at filtration value fi and extending it until fi + 1. It makes sense then to ask whether it is possible to bound the difference between an ECC and its vectorized representation. Fig. 11 is an example of such difference when a curve is sampled in 5 points.

Figure 11:

Figure 11:

An Euler characteristic curve (black) and its vectorized version (green) with resolution Δ. In this case, the vectorized version is stored as a vector of length 5 (the green filled-in points) but can be reconverted to a step-size function.

Proposition 5.

Let K be a filtered cell complex whose filtration values ranges from 0 to fmax. The L1 norm between the Euler characteristic curve of K and its vectorized version at resolution Δ is bounded by

Proposition 5. (3)

where |K| is the number of simplices in the complex and Inline graphic is the sum of the absolute value of the differences between consecutive values in the vectorized Euler characteristic vec(ECC(K), N).

Proof.

We will prove the 2 terms in the bound separately as they come from 2 different types of errors.

Type I errors occur when the EC at 2 consecutive sampling points fi and fi + 1 is different. The simplest case is depicted in Fig. 12 A; the EC changes values in between the sampling interval. We can upper bound this error with the area of the rectangle having as base the vectorization’s resolution Δ = fi + 1fi and as height the difference between the EC at the 2 sampling points |EC(fi) − EC(fi + 1). Note that this bound also holds in the more general case where the EC varies monotonically at multiple values inside the sampling interval. By summing up all the contributions, we obtain the value Inline graphic.

Type II errors (see Fig. 12B) occur when the EC has the same value at consecutive filtration steps but varies in between. The maximum possible variation can be upper bounded by the area of the rectangle with Δ as base and half the number of cells in the complex as height. Each cell contributes to the EC by ±1, and the factor one-half is due to the constraint that the EC has the same value in fi and fi + 1. This amounts to the values Δ · |K|/2.

By summing up the 2 contributions, we obtain the bound in 3. Note that a generic situation can always be described as a combination of type I and type II errors.

Figure 12:

Figure 12:

The 2 possible source of errors during vectorization of an ECC.

We have shown a way to bound the distance between an ECC and its vectorized version. Another possible stability question is whether this vectorization preserves distances between ECCs. In other words, we are interested in knowing whether something can be said for ||vec(ECC1, N) − vec(ECC2, N)|| given ||ECC1ECC2||. Unfortunately, it is possible to construct examples in which 2 curves can be made arbitrarily far apart, but they have the same vectorization or 2 curves can be made arbitrarily close but have drastically different vectorizations. Fig. 13 shows 2 such examples. Moreover, in the existing literature, Johnson and Jung [29] prove that the distance between 2 vectorized Betti curves cannot be bounded by the Wasserstein distance between the respective persistence diagrams. They propose a stable vectorization inspired by Gaussian smoothing techniques.

Figure 13:

Figure 13:

Two ECCs superimposed in the same plot. (A) The 2 curves can be made arbitrarily far apart in L1 but have the same vectorization. (B) The 2 curves can be made arbitrarily close but have drastically different vectorizations.

Profiles

An n-dimensional Euler characteristic profile whose filtration values ranges from 0 to Inline graphic for i ∈ 1⋅⋅⋅n can be vectorized in a similar fashion by sampling it on a grid of size N1 × N2 × ⋅⋅⋅ × Nn. In general, the Ni can be different and thus lead to different resolutions Δi on the various filtration parameters. The output of this sampling procedure is an n-dimensional tensor vec(ECP, Ni) that can be eventually flattened to a 1-dimensional vector. Although this is an intuitive generalization of the 1-dimensional ECC case, the procedure has an increased computational cost due to the difficulties in sampling EC values from a profile, as already discussed in section 14 "Retrieving the EC at some filtration values" . Moreover, the stability result in 5 cannot be generalized to the multiparameter setting. As depicted in Fig. 14, the grid vectorization could not detect the contributions coming from pairs of cells. In the multiparameter case, however, it is not possible to bound this contribution using only the vectorization resolutions Δ as such a contribution can persist on subsequent grid elements up to infinity.

Figure 14:

Figure 14:

A 2-dimensional analogue of a type II error of Fig. 12. The ECP is vectorized by sampling the EC values on the green grid. We can add pair of cells with contributions ±1 inside the rectangle ABCD in such a way that the value of the EC on the vertices does not change. However, such contributions have a nonzero sum on an area that can be made arbitrarily large.

Examples and Experiments

RGB images

A toy experiment using 3-dimensional Euler characteristic profiles can be constructed using RGB images. In a RGB image, each pixel contains a tuple of 3 integers, each ranging from 0 to 255. They stand for the red, green, and blue color channel, and all colors in the visible spectrum can be represented by a 3-tuple. In particular, black is coded by (0,0,0) and white is (255, 255, 255).

In this example, we consider 2 different textures, stripes and checks; each of them can be red, green, or blue. We generate 10 samples of each combination of style and color by adding random Gaussian noise to each pixel. We then compute the 3-dimensional Euler characteristic profile of the cubical complex obtained from each image and compute the matrix of pairwise L1 distances between them. Such a matrix is show in Fig. 15. It confirms that distance between Euler characteristic profiles of different images increase following the intuitive sequence same style, same color < same style, different color < different style, same color < different style, different color.

Figure 15:

Figure 15:

A 60 × 60 distance matrix between Euler characteristic profiles of different RGB images.

Immune cell spatial patterns in tumors

Vipond et al. [30] applied multiparameter persistent homology (MPH) landscapes to study immune cell location in digital histology images from head and neck cancer. They extracted the locations of 3 immune cell types from histology slides, thus obtaining a list of pointclouds labeled CD8+, FoxP3+, or CD68. The goal is to correctly classify a pointcloud. All pointcloud data are available at [31]. The authors created a bifiltered Vietoris–Rips complex from each pointcloud, using radius and a codensity function defined over each vertex p as Inline graphic, where pi is the ith nearest neighbor of p. They then computed MPH landscapes and used them as input for 1 of 3 classifiers: linear discriminant analysis (LDA), regularized linear discriminant analysis (rLDA), and regularized quadratic discriminant analysis (rQDA) [32]. They made a randomized 80/20 training/test split and evaluated the classification accuracy of 3 classifiers on the test data for each pair of cell types and for the 3-class problem. The classification results are reported in the supplementary material of [30].

We used the authors’ code to regenerate the same standard Vietoris–Rips and bifiltered Vietoris–Rips complexes from the provided pointclouds. We then computed ECC (radius only) and ECP (radius and codensity) for each complex and used them as input for the same LDA, rLDA, and rQDA classifiers using the same train–test split procedure. The average accuracy for the various classification tasks is reported in Tables 1, 2, and 3. Both ECC and ECP significantly outperform MPH landscapes while there is apparently no gain in moving from ECC to ECP. This can be an indication that the second dimension in the filtration (the codensity parameter) does not contain significant information.

Table 1:

Average classification accuracy for the LDA classifier using as input MLP (Multiparameter Persistence Landscape), ECC, or ECP. Data for each tumor are split into 80/20 train–test splits and classification accuracy is reported as the mean over 100 repetitions of splitting, training, and testing.

CD68+ vs. FoxP3+ CD8+ vs. FoxP3+ CD8+ vs. CD68+ CD8+ vs. CD68+ vs. FoxP3+
MPL - ECC - ECP MPL - ECC - ECP MPL - ECC - ECP MPL - ECC - ECP
T_A 0.584 - 0.938 - 0.941 0.672 - 0.994 - 0.988 0.669 - 0.894 - 0.856 0.486 - 0.896 - 0.886
T_B 0.794 - 0.917 - 0.922 0.88 - 0.992 - 0.992 0.54 - 0.943 - 0.962 0.568 - 0.921 - 0.940
T_C 0.723 - 0.947 - 0.904 0.7 - 0.884 - 0.859 0.605 - 0.811 - 0.699 0.505 - 0.842 - 0.755
T_D 0.811 - 0.960 - 0.933 0.899 - 0.986 - 0.985 0.644 - 0.802 - 0.807 0.613 - 0.862 - 0.874
T_E 0.732 - 0.941 - 0.940 0.644 - 0.867 - 0.869 0.593 - 0.806 - 0.688 0.511 - 0.842 - 0.719
T_F 0.738 - 0.655 - 0.933 0.644 - 0.619 - 0.830 0.73 - 0.709 - 0.850 0.511 - 0.578 - 0.824
T_G 0.771 - 0.788 - 0.858 0.782 - 0.791 - 0.904 0.675 - 0.614 - 0.609 0.599 - 0.673 - 0.659
T_H 0.710 - 0.651 - 0.885 0.682 - 0.747 - 0.955 0.628 - 0.695 - 0.891 0.555 - 0.659 - 0.845
T_I 0.733 - 0.788 - 0.737 0.758 - 0.716 - 0.679 0.540 - 0.693 - 0.713 0.548 - 0.716 - 0.493
T_J 0.727 - 0.642 - 0.767 0.535 - 0.678 - 0.857 0.602 - 0.808 - 0.868 0.449 - 0.507 - 0.699
T_K 0.510 - 0.872 - 0.770 0.570 - 0.784 - 0.816 0.502 - 0.823 - 0.877 0.404 - 0.594 - 0.635
T_N 0.493 - 0.457 - 0.570 0.512 - 0.658 - 0.632 0.577 - 0.507 - 0.760 0.342 - 0.462 - 0.370
T_O 0.948 - 0.830 - 0.840 0.788 - 0.602 - 0.754 0.532 - 0.484 - 0.598 0.550 - 0.431 - 0.615

Table 2:

Average classification accuracy for the rLDA classifier using as input MLP, ECC, or ECP. Data for each tumor are split into 80/20 train–test splits and classification accuracy is reported as the mean over 100 repetitions of splitting, training, and testing.

CD68+ vs. FoxP3+ CD8+ vs. FoxP3+ CD8+ vs. CD68+ CD8+ vs. CD68+ vs. FoxP3+
MPL - ECC - ECP MPL - ECC - ECP MPL - ECC - ECP MPL - ECC - ECP
T_A 0.491 - 0.967 - 0.964 0.642 - 0.973 - 0.967 0.630 - 0.840 - 0.830 0.427 - 0.858 - 0.859
T_B 0.760 - 0.892 - 0.869 0.787 - 0.986 - 0.985 0.671 - 0.942 - 0.945 0.604 - 0.868 - 0.865
T_C 0.863 - 0.906 - 0.896 0.747 - 0.847 - 0.842 0.653 - 0.584 - 0.614 0.640 - 0.628 - 0.627
T_D 0.683 - 0.926 - 0.918 0.829 - 0.990 - 0.988 0.476 - 0.779 - 0.779 0.492 - 0.779 - 0.775
T_E 0.820 - 0.886 - 0.883 0.736 - 0.929 - 0.920 0.534 - 0.735 - 0.743 0.502 - 0.702 - 0.683
T_F 0.623 - 0.899 - 0.925 0.476 - 0.842 - 0.847 0.765 - 0.909 - 0.921 0.408 - 0.845 - 0.847
T_G 0.886 - 0.932 - 0.927 0.897 - 0.970 - 0.975 0.446 - 0.696 - 0.692 0.581 - 0.738 - 0.746
T_H 0.524 - 0.890 - 0.898 0.735 - 0.930 - 0.929 0.714 - 0.882 - 0.877 0.502 - 0.844 - 0.859
T_I 0.859 - 0.912 - 0.931 0.883 - 0.908 - 0.909 0.484 - 0.470 - 0.474 0.597 - 0.619 - 0.614
T_J 0.608 - 0.763 - 0.750 0.750 - 0.835 - 0.872 0.850 - 0.882 - 0.892 0.536 - 0.653 - 0.670
T_K 0.376 - 0.868 - 0.804 0.523 - 0.918 - 0.914 0.455 - 0.857 - 0.845 0.261 - 0.718 - 0.679
T_N 0.410 - 0.527 - 0.563 0.432 - 0.662 - 0.745 0.643 - 0.690 - 0.713 0.294 - 0.388 - 0.460
T_O 0.702 - 0.954 - 0.952 0.644 - 0.806 - 0.772 0.546 - 0.672 - 0.684 0.429 - 0.639 - 0.632

Table 3:

Average classification accuracy for the rQDA classifier using as input MLP, ECC, or ECP. Data for each tumor are split into 80/20 train–test splits and classification accuracy is reported as the mean over 100 repetitions of splitting, training, and testing.

CD68+ vs. FoxP3+ CD8+ vs. FoxP3+ CD8+ vs. CD68+ CD8+ vs. CD68+ vs. FoxP3+
MPL - ECC - ECP MPL - ECC - ECP MPL - ECC - ECP MPL - ECC - ECP
T_A 0.503 - 0.945 - 0.931 0.598 - 0.840 - 0.838 0.598 - 0.840 - 0.838 0.380 - 0.865 - 0.861
T_B 0.738 - 0.896 - 0.867 0.588 - 0.913 - 0.911 0.588 - 0.913 - 0.911 0.531 - 0.871 - 0.869
T_C 0.855 - 0.915 - 0.906 0.673 - 0.552 - 0.568 0.673 - 0.552 - 0.568 0.614 - 0.627 - 0.640
T_D 0.554 - 0.934 - 0.929 0.494 - 0.767 - 0.755 0.494 - 0.767 - 0.755 0.482 - 0.786 - 0.787
T_E 0.826 - 0.876 - 0.871 0.548 - 0.751 - 0.754 0.548 - 0.751 - 0.754 0.499 - 0.754 - 0.724
T_F 0.646 - 0.964 - 0.963 0.666 - 0.853 - 0.855 0.666 - 0.853 - 0.855 0.412 - 0.881 - 0.878
T_G 0.882 - 0.937 - 0.928 0.485 - 0.723 - 0.699 0.485 - 0.723 - 0.699 0.583 - 0.771 - 0.767
T_H 0.621 - 0.968 - 0.967 0.699 - 0.886 - 0.898 0.699 - 0.886 - 0.898 0.550 - 0.889 - 0.901
T_I 0.919 - 0.928 - 0.940 0.493 - 0.531 - 0.527 0.493 - 0.531 - 0.527 0.626 - 0.621 - 0.624
T_J 0.588 - 0.908 - 0.903 0.860 - 0.898 - 0.902 0.860 - 0.898 - 0.902 0.541 - 0.720 - 0.719
T_K 0.468 - 0.874 - 0.838 0.567 - 0.923 - 0.903 0.567 - 0.923 - 0.903 0.352 - 0.751 - 0.736
T_N 0.353 - 0.453 - 0.477 0.510 - 0.617 - 0.600 0.510 - 0.617 - 0.600 0.334 - 0.392 - 0.384
T_O 0.724 - 0.972 - 0.984 0.524 - 0.668 - 0.662 0.524 - 0.668 - 0.662 0.440 - 0.730 - 0.738

Prostate cancer histology slides

Lawson et al. [33] demonstrated that persistent homology can successfully be used to evaluate features in prostate cancer hmatoxylin and eosin (H&E)–stained slides. Their dataset, available in the Open Science Framework [34], contains 5,182 RGB images of a resolution 512 × 512 corresponding to different regions of interest (ROIs) in prostate cancer H&E slices obtained from 39 patients. Each image is labeled with a Gleason score of 3, 4, or 5 indicating the architectural patterns of the cancer. A higher Gleason score indicates an increasing level of cancer aggressiveness. The datasets contains 2,567 grade 3 ROIs, 2,351 grade 4 ROIs, but only 264 grade 5 ROIs. Given the imbalance in the data, we decided to consider a classification problem between grades 3 and 4.

Following the procedure described by the authors, we normalized and extracted the H&E color channel from each ROI. By doing so, we converted each RGB image into a bidimensional (H&E) one (Fig. 16). We first computed the ECC for each of the grayscale images corresponding to the hematoxylin channel as it is the color that highlights cell nuclei. We then also used the eosin color channel to obtain a 2-dimensional ECP (Fig. 17). We input either the ECCs or the ECPs into an support vector machine [35] classifier and computed the mean test accuracy over 100 rounds with a 80/20 training split. The results are displayed in Table 4. The classifier using as input the 2-dimensional ECPs is consistently performing better than the one using the 1-dimensional ECCs.

Figure 16:

Figure 16:

(A) A raw RGB ROI. (B) Hematoxylin channel. (C) Eosin channel.

Figure 17:

Figure 17:

(A) Hematoxylin ECC for the ROI in Fig. 16. (B) The eosin ECC. (C) The combined ECP.

Table 4:

Mean test accuracy for the Gleason 3 vs. Gleason 4 classification using ECCs or ECPs as input to a support vector machine classifier

Hematoxylin ECC H&E ECP
0.765 ± 0.001 0.826 ± 0.001

Conclusions

Euler characteristic curves and profiles provide a stable summary of the shape of data. Unlike other summaries used in topological data analysis, this one can be computed in a distributed fashion and hence is applicable to deal with big data problems. In addition, we show, contrary to a common misconception, that the Euler characteristic curves and profiles enjoy certain type of stability. We confirm it when using them to discriminate various toy datasets with varying levels of noise. We also show how to compare and vectorize the Euler characteristic curves and profiles and apply them to a number of real data analysis problems. The presented results are accompanied with efficient Python implementation. For example, on modern commodity hardware, our implementation for V-R complexes can handle a number of simplices on the order of 1010. This is a 2-order magnitude more than what can be achieved using available software like GUDHI [9]. With this work, we hope that the machinery of Euler characteristic curves and profiles will be useful for practitioners in topological data analysis.

Supplementary Material

giad094_GIGA-D-23-00240_Original_Submission
giad094_GIGA-D-23-00240_Revision_1
giad094_GIGA-D-23-00240_Revision_2
giad094_Response_to_Reviewer_Comments_Original_Submission
giad094_Response_to_Reviewer_Comments_Revision_1
giad094_Reviewer_1_Report_Original_Submission

Fan Wang, Ph.D -- 9/9/2023

giad094_Reviewer_2_Report_Original_Submission

Liam Naughton -- 9/12/2023

Acknowledgement

D.G. thanks Niklas Hellmer for the multiple valuable discussions and insights and Jan Felix Senge for helpful discussion on the algorithms’ time performances.

Appendix A.

Time performance analysis

We asses the time performance of Algorithm 1 by analyzing the worst-case scenario, a complete graph built from a pointcloud {xi} i ∈ [1, n]. This is the worst-case scenario as it contains the maximal number of cliques (hence simplices) for a given number of vertices, namely, 2n − 1. As discussed in the previous section, the running time will be dominated by the first vertex x1 as it has the highest number of successive neighbors. The most time-consuming operations are the ones that happen inside Algorithm 2, namely, the update filtration and update common neighbors subroutines.

Update filtration

The extension of a d-clique requires checking whether 1 or more of the new introduced edges have a filtration value higher than the current d-clique. Comparison between floats can be done in constant time and has to be repeated d times. With reference to Fig. A1, we can assign to each edge in the simplex tree a cost that depends only on the edge depth. The total sum of such cost is

graphic file with name TM0094.gif

In case of perfect parallelization, the cost for the first vertex only is

graphic file with name TM0095.gif
Figure A1:

Figure A1:

Simplex tree for a 4-clique with the update filtration cost.

Update common neighbors

Updating the list of common neighbors after a clique extension requires computing the intersection between the current list of common neighbors (with length m1) and the list of neighbors of the newly added vertex (with length m2). Given that such lists are ordered, their intersection can be computed in Inline graphic. The total cost for this operation can be obtain recursively by observing that the number of neighbors in a clique is uniquely determined—in this particular case—only by the last element of the clique. For example, in Fig. A2, the subtree spanning from ab is the same as the one spanning from b, and the one from ac is equivalent to c. The total cost for a clique of size n can be then expressed as twice the cost for the (n − 1)—clique plus the cost of the depth 1 edges spanning from the first vertex:

graphic file with name TM0097.gif

Such recurrence has solution

graphic file with name TM0098.gif

In case of perfect parallelization, the cost for the first vertex only is

graphic file with name TM0099.gif
Figure A2:

Figure A2:

Simplex tree for a 4-clique with the update common neighbors cost.

Memory performance analysis

At each step, the algorithm needs to store in memory the local graph G with each edge’s filtration value. The local graph can be stored as an adjacency matrix whose entries represent the filtration values. Moreover, we need to store the current list of simplices, the list of their filtration values, and the list of common neighbor for each simplex. Let us denote with V the bits needed to store an edge label (usually a uint) and with F the bits needed to store a filtration value (usually a float). Assuming the worst-case scenario of a fully connected graph with n nodes, the maximum number of simplices will be generated at dimension Inline graphic and will be Inline graphic. The memory cost at that step will then be Inline graphic, where the first term is the graph cost, the second one is the cost of the list of simplices, and the list of common neighbors that we assume to have the same size due to the symmetry of the binomial coefficients, and the third one is the cost of the simplices filtration values.

Contributor Information

Paweł Dłotko, Dioscuri Centre in Topological Data Analysis, Mathematical Institute, Polish Academy of Sciences, Warsaw, 00-656, Poland.

Davide Gurnari, Dioscuri Centre in Topological Data Analysis, Mathematical Institute, Polish Academy of Sciences, Warsaw, 00-656, Poland.

Availability of Source Code and Requirements

Jupyter notebooks to reproduce all the experiments described in this article are available in the GigaDB repository [36], and the workflows are described in [37].

Abbreviations

ECC: Euler characteristic curve; ECP: Euler characteristic profile; H&E: hematoxylin and eosin; LDA: linear discriminant analysis; MPH: multiparameter persistent homology; rLDA: regularized linear discriminant analysis; rQDA: regularized quadratic discriminant analysis; TDA: topological data analysis.

Authors’ Contributions

P.D. conceived and led the project. Both authors contributed to the mathematical aspect of the work and to the designing of the algorithms. D.G. led to the development of the software implementation and carried out the experiments. Both authors contributed to the writing of the manuscript.

Funding

P.D. and D.G. acknowledge support by the Dioscuri program initiated by the Max Planck Society, jointly managed with the National Science Centre (Poland) and mutually funded by the Polish Ministry of Science and Higher Education and the German Federal Ministry of Education and Research. D.G. is also with the University of Warsaw within the Doctoral School of Exact and Natural Sciences.

Data Availability

Snapshots of our code and other data further supporting this work are openly available in the GigaScience repository, GigaDB [36].

Competing Interests

The authors declare that they have no competing interests.

References

  • 1. Edelsbrunner  H, Letscher  D, Zomorodian  A. Topological persistence and simplification. Discrete Comput Geometry. 2002;28(4):511–33.. 10.1007/s00454-002-2885-2. [DOI] [Google Scholar]
  • 2. Singh  G, Memoli  F, Carlsson  G. Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition. Eindhoven, The Netherlands: The Eurographics Association; 2007. 10.2312/SPBG/SPBG07/091-100. [DOI] [Google Scholar]
  • 3. Edelsbrunner  H, Harer  JL. Computational Topology: An Introduction. Providence, RI: American Mathematical Society; 2022. [Google Scholar]
  • 4. Lee  Y, Barthel  SD, Dłotko  P, et al.  Quantifying similarity of pore-geometry in nanoporous materials. Nat Commun. 2017;8(1):15396. 10.1038/ncomms15396. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Dłotko  P, Wanner  T. Topological microstructure analysis using persistence landscapes. Phys D Nonl Phen. 2016;334:60–81.. 10.1016/j.physd.2016.04.015. [DOI] [Google Scholar]
  • 6. Hiraoka  Y, Nakamura  T, Hirata  A, et al.  Hierarchical structures of amorphous solids characterized by persistent homology. Proc Natl Acad Sci. 2016;113(26):7035–40.. 10.1073/pnas.1520877113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Nicolau  M, Levine  AJ, Carlsson  G. Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival. Proc Natl Acad Sci. 2011;108(17):7265–70.. 10.1073/pnas.1102826108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Bauer  U, Kerber  M, Reininghaus  J. Distributed computation of persistent homology. In: 2014 Proceedings of the Meeting on Algorithm Engineering and Experiments (ALENEX). Philadelphia, PA: Society for Industrial and Applied Mathematics; 2013:31–8.. 10.1137/1.9781611973198.4. [DOI] [Google Scholar]
  • 9. The GUDHI Project . GUDHI User and Reference Manual. 3.6.0 ed. GUDHI Editorial Board; 2022. online onlyhttps://gudhi.inria.fr/python/latest/citation.html. [Google Scholar]
  • 10. Silva  Vd, Carlsson  G. Topological estimation using witness complexes. In: Gross  M, Pfister  H, Alexa  M, Rusinkiewicz  S, eds. SPBG’04 Symposium on Point—Based Graphics 2004. Eindhoven, The Netherlands: The Eurographics Association; 2004. 10.2312/SPBG/SPBG04/157-166. [DOI] [Google Scholar]
  • 11. Sheehy  DR. Linear-size approximations to the vietoris–rips filtration. Discrete Comput Geometry. 2013;49(4):778–96.. 10.1007/s00454-013-9513-1. [DOI] [Google Scholar]
  • 12. Chazal  F, Fasy  BT, Lecci  F, et al.  On the bootstrap for persistence diagrams and landscapes. Model Anal Inf Syst. 2013;20(6):111–20.. 10.18255/1818-1015-2013-6-111-120. [DOI] [Google Scholar]
  • 13. Carlsson  G, de Silva  V. Zigzag Persistence. Found Comput Math. 2010;10(4):367–405.. 10.1007/s10208-010-9066-0. [DOI] [Google Scholar]
  • 14. Pedregosa  F, Varoquaux  G, Gramfort  A, et al.  Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30.. https://dl.acm.org/doi/10.5555/1953048.2078195. [Google Scholar]
  • 15. Heiss  T, Wagner  H. Streaming algorithm for Euler characteristic curves of multidimensional images. In: Felsberg  M, Heyden  A, Krüger  N, eds. Computer Analysis of Images and Patterns. Lecture Notes in Computer Science. Cham, Switzerland: Springer International Publishing; 2017:397–409.. 10.1007/978-3-319-64689-3_32. [DOI] [Google Scholar]
  • 16. Wang  F, Wagner  H, Chen  C. GPU computation of the Euler characteristic curve for imaging data. In: Goaoc  X, Kerber  M eds. 38th International Symposium on Computational Geometry (SoCG 2022). Dagstuhl, Germany: Schloss Dagstuhl – Leibniz-Zentrum für Informatik; 2022, 224. 10.4230/LIPIcs.SoCG.2022.64. [DOI] [Google Scholar]
  • 17. Roy  A, Haque  RaI, Mitra  AJ, et al.  Understanding flow features in drying droplets via Euler characteristic surfaces—a topological tool. Phys Fluids. 2020;32(12):123310. 10.1063/5.0026807 [DOI] [Google Scholar]
  • 18. Beltramo  G, Skraba  P, Andreeva  R, et al.  Euler characteristic surfaces. Found Data Sci. 2022; 4(4):505–536.. 10.3934/fods.2021027. [DOI] [Google Scholar]
  • 19. Chen  Y, Segovia-Dominguez  I, Coskunuzer  B, et al.  TAMP-S2GCNets: coupling time-aware multipersistence knowledge representation with spatio-supra graph convolutional networks for time-series forecasting. In: International Conference on Learning Representations. 2022. https://openreview.net/forum?id=wv6g8fWLX2q.
  • 20. Perez  D. Euler and Betti curves are stable under Wasserstein deformations of distributions of stochastic processes. arXiv. 2022. ArXiv:2211.12384 [math]. 10.48550/arXiv.2211.12384. [DOI]
  • 21. Hatcher  A. Algebraic Topology. Cambridge, UK: Cambridge University Press; 2002. [Google Scholar]
  • 22. Carlsson  G, Zomorodian  A. The theory of multidimensional persistence. SCG ’07. New York, NY, USA: Association for Computing Machinery. 2007. 184–193.. 10.1145/1247069.1247105. [DOI] [Google Scholar]
  • 23. Botnan  MB, Lesnick  M. An introduction to multiparameter persistence. arXiv. 2022. https://doi.org/10.ArXiv:2203.14289 [cs, math]. 10.48550/arXiv.2203.14289. [DOI] [Google Scholar]
  • 24. Chung  YM, Lawson  A. Persistence curves: a canonical framework for summarizing persistence diagrams. Adv Comput Math. 2022;48(1):6. 10.1007/s10444-021-09893-4. [DOI] [Google Scholar]
  • 25. Chevyrev  I, Nanda  V, Oberhauser  H. Persistence paths and signature features in topological data analysis. IEEE Trans Pattern Anal Mach Int. 2020;42(1):192–202.. 10.1109/TPAMI.2018.2885516. [DOI] [PubMed] [Google Scholar]
  • 26. Roune  BH, de Cabezón  ES. Complexity and algorithms for Euler characteristic of simplicial complexes. arXiv. 2011. ArXiv:1112.4523 [cs, math]. 10.48550/arXiv.1112.4523. [DOI] [Google Scholar]
  • 27. Boissonnat  JD, Maria  C. The simplex tree: an efficient data structure for general simplicial complexes. Algorithmica. 2014;70(3):406–27.. 10.1007/s00453-014-9887-3. [DOI] [Google Scholar]
  • 28. Bleile  B, Garin  A, Heiss  T, et al.  The persistent homology of dual digital image constructions. In: Gasparovic  E, Robins  V, Turner  K, eds. Research in Computational Topology 2. Association for Women in Mathematics Series. Cham, Switzerland: Springer International Publishing; 2022:1–26.. 10.1007/978-3-030-95519-9_1. [DOI] [Google Scholar]
  • 29. Johnson  M, Jung  JH. Instability of the Betti sequence for persistent homology and a stabilized version of the betti sequence. arXiv. 2021. ArXiv:2109.09218 [cs, math]. 10.48550/arXiv.2109.09218. [DOI] [Google Scholar]
  • 30. Vipond  O, Bull  JA, Macklin  PS, et al.  Multiparameter persistent homology landscapes identify immune cell spatial patterns in tumors. Proc Natl Acad Sci. 2021;118(41):e2102166118. 10.1073/pnas.2102166118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Vipond  O, Bull  JA, Macklin  PS, et al.  Spatial patterning of immune cells. GitHub. 2021. https://github.com/MultiparameterTDAHistology/SpatialPatterningOfImmuneCells. [Google Scholar]
  • 32. Hastie  T, Tibshirani  R, Friedman  J. The Elements of Statistical Learning. New York, NY: Springer; 2009. 10.1007/978-0-387-84858-7. [DOI] [Google Scholar]
  • 33. Lawson  P, Sholl  AB, Brown  JQ, et al.  Persistent homology for the quantitative evaluation of architectural features in prostate cancer histology. Sci Rep. 2019;9(1):1139. 10.1038/s41598-018-36798-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Lawson  P, Wenk  C, Fasy  BT, et al.  Corresponding data for “Persistent Homology for the Quantitative Evaluation of Architectural Features in Prostate Cancer Histology”. Open Science Framework. 2020. 10.17605/OSF.IO/K96QW. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Bishop  CM, Nasrabadi  NM. Pattern Recognition and Machine Learning. Springer; New York, NY. 2006. [Google Scholar]
  • 36. Dłotko  P, Gurnari  D. Supporting data for “Euler Characteristic Curves and Profiles: A Stable Shape Invariant for Big Data Problems.”. GigaScience Database. 2023. 10.5524/102459. [DOI] [PMC free article] [PubMed]
  • 37. Dłotko  P, Gurnari  D. ECP experiments. WorkflowHub. 2023. 10.48546/workflowhub.workflow.576.1. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

  1. Dłotko  P, Gurnari  D. Supporting data for “Euler Characteristic Curves and Profiles: A Stable Shape Invariant for Big Data Problems.”. GigaScience Database. 2023. 10.5524/102459. [DOI] [PMC free article] [PubMed]

Supplementary Materials

giad094_GIGA-D-23-00240_Original_Submission
giad094_GIGA-D-23-00240_Revision_1
giad094_GIGA-D-23-00240_Revision_2
giad094_Response_to_Reviewer_Comments_Original_Submission
giad094_Response_to_Reviewer_Comments_Revision_1
giad094_Reviewer_1_Report_Original_Submission

Fan Wang, Ph.D -- 9/9/2023

giad094_Reviewer_2_Report_Original_Submission

Liam Naughton -- 9/12/2023

Data Availability Statement

Snapshots of our code and other data further supporting this work are openly available in the GigaScience repository, GigaDB [36].


Articles from GigaScience are provided here courtesy of Oxford University Press

RESOURCES