Skip to main content
Springer Nature - PMC COVID-19 Collection logoLink to Springer Nature - PMC COVID-19 Collection
. 2023 Jan 25;86(3):34. doi: 10.1007/s00285-023-01868-x

The arithmetic topology of genetic alignments

Christopher Barrett 1,2,#, Andrei Bura 1,#, Qijun He 1,#, Fenix Huang 1,#, Christian Reidys 1,3,✉,#
PMCID: PMC9875784  PMID: 36695949

Abstract

We propose a novel mathematical paradigm for the study of genetic variation in sequence alignments. This framework originates from extending the notion of pairwise relations, upon which current analysis is based on, to k-ary dissimilarity. This dissimilarity naturally leads to a generalization of simplicial complexes by endowing simplices with weights, compatible with the boundary operator. We introduce the notion of k-stances and dissimilarity complex, the former encapsulating arithmetic as well as topological structure expressing these k-ary relations. We study basic mathematical properties of dissimilarity complexes and show how this approach captures watershed moments of viral dynamics in the context of SARS-CoV-2 and H1N1 flu genomic data.

Keywords: Hamming distance, k-Stances, Sequence dissimilarity, Phylogeny, Weighted simplicial complexes, Weighted algebraic homology

Introduction

Genetic variation is the observed difference at the genetic sequence level between individuals in a population and is the key contributor to phenotypic diversity. It affects population dynamics and ultimately the evolution of the entire system.

One of the key tasks of Biology is understanding the development of genetic variation within a given population. Namely, the extraction of evolutionary relationships and histories among the sequences present (i.e. their phylogenetics). These relationships are construed in the guise of the phylogenetic tree; a graph topological structure that underpins our understanding of the major evolutionary transitions appearing in the system. This structure is central to inferring everything from the emergence of new body plans, novel metabolism and the origin of new genes to detecting molecular adaptation, understanding morphological character evolution and reconstructing demographic changes in recently diverged species (Kapli et al. 2020).

Such a tree is generally constructed from metric (pairwise) information present at the sequence level. For two sequences of equal length, one naive approach is to employ the Hamming distance, which counts the number of positions in the sequence pair with different entries at those positions. For sequences of different length, alignment methods are applied to obtain equal length via the insertion of “gap” symbols (Needleman and Wunsch 1970; Smith and Waterman 1981). Other metrics can be used, such as the edit distance, which can still be viewed as weighted versions of the Hamming metric Berger et al. (2021).

As a result, metric-based phylogenetic tree constructions integrate only pairwise dissimilarity information. This information can be viewed as a complete graph where each node represents a sequence and the length of an edge represents the dissimilarity between the two corresponding sequences the edge links. The integration process in metric-based phylogeny can be summarized as finding a spanning tree whose inherited metric information “best fits” the metric in this complete graph (Felsenstein and Felenstein 2004; Saitou and Nei 1987).

Graphs are natural choices but can only encode pairwise relations. However in a population of genetic sequences, there exist higher order interactions that cannot be expressed from pairwise relations alone. In a population of sequences, different positions (sites) exhibit different nucleotide diversities. Some sites are conserved allowing almost no polymorphism while others, the key contributors to genetic variation, are less conserved. Polymorphic sites that allow three or more nucleotide realizations (tri-allelic or higher) are of interest as they are particularly indicative of the quasi-species exploring the fitness landscape by stochastic mutation and post-selection (Westen et al. 2009; Huebner et al. 2007). For instance, in the SARS-CoV-2 genome, site 23012, located in the spike protein region, exhibits such polymorphisms, see Fig. 1.

Fig. 1.

Fig. 1

Site 23012 on the SARS-CoV-2 genome

This site has the nucleotide G in the wild type sequence, while in the Beta variant of the virus, it mutates to A and induces an amino acid change from E to K at position 484. E484K is one of the characteristic mutations of the Beta variant and improves the virus’s ability to evade the host’s immune system (Wise 2021). One the other hand, in the Kappa variant, site 23012 mutates to C and results in the amino acid change E484Q, which is thought to enhance ACE2 receptor binding (Cherian et al. 2021), and may reduce the vaccine-stimulated antibodies’ ability to attach to this altered spike protein (Wilhelm et al. 2021).

Conservation at a site can be measured by various indices which are usually based on alphabet frequency distributions, information entropy or stereo-chemical matrix considerations. In Valdar (2002), Valdar provides a retrospective on such indices and presents the criteria a successful score should obey. Although the scores become sophisticated in terms of components they employ, such as means of accounting for sequence redundancy in the MSA and stereo-chemical variability or gap presence, the scores that yield the best results, are still based on sum of pairs constructions. When considering only symbol frequency distributions, key features of highly polymorphic sites in genetic alignments, like the site 23012 on the Spike protein of the SARS-CoV-2 genome, cannot be expressed employing measures derived from pairwise comparisons alone, see Fig. 2.

Fig. 2.

Fig. 2

Segments of SARS-CoV-2 genomes and their GSAID IDs. X: EPI-ISL-428678, Y: EPI-ISL-509797, Z: EPI-ISL-656231, X’: EPI-ISL-467431, Y’: EPI-ISL-428678, Z’: EPI-ISL-415706. We cannot distinguish the triplet {X,Y,Z} from that of {X,Y,Z} if we restrict ourselves to pairwise Hamming distance comparisons. That is because for hamming distance dH we have dH(X,Y)=dH(X,Y), dH(X,Z)=dH(X,Z), dH(Y,Z)=dH(Y,Z). However, the triple {X,Y,Z} contains two tri-allelic sites, while {X,Y,Z} does not

To capture multi-fold sequence interactions like the ones in Fig. 2, we will introduce a novel notion of dissimilarity (the complement of conservation), that augments the sums of pairs approaches in Valdar (2002) to k-tuples where k can be greater than two. We propose passing from weighted graphs to weighted simplicial complexes (that account for k-fold relations) embodied in a novel arithmetico-topological structure we call the dissimilarity complex which encodes these sequence hyper-relations.

The notion of dissimilarity entails more data than just a collection of certain scores. This is already apparent in the analysis of sequences based on Hamming distances. Genetic data are in fact more deeply understood via tree or tree-like concepts, all of which reside in the weighted graph of Hamming distances. A tree is a maximally acyclic spanning graph of the complete graph over the set of its vertices. Tree edges form a maximally independent set, namely including any edge that connects two vertices we enclose a unique cycle. The distance between two such vertices in the complete graph can be approximated by the sum of involved tree edge lengths. In conclusion, the graph contextualizes the notions of trees instrumental in classical phylogeny.

The goal of this paper is to introduce the dissimilarity complex, which is the higher dimensional counterpart of the weighted graph as it appears in classical phylogeny. In fact this graph can be derived as the 1-skeleton of the dissimilarity complex. The dissimilarity complex is a weighted simplicial complex.

An (unweighted) simplicial complex is classically regarded as just a set composed of points, edges, triangles, tetrahedra and their higher dimensional counterparts. In addition to its combinatorics, a simplicial complex can also be studied from a global (topological) perspective as it naturally gives rise to a topological space. In particular, homological (algebraic) invariants of simplicial complexes often already capture important information about global structural features of the sequence phylogeny. For instance, k-dimensional holes in such spaces can be shown to correspond to multi-fold recombination (Chan et al. 2013). Recently, weighted homology, an augmentation of simplicial homology that further encodes arithmetical information for each simplex has been developed (Bura et al. 2021; Dawson 1990; Ren et al. 2018). These weights and the information they carry are central to the new theory as they enrich the algebraic invariants of classical simplicial homology with arithmetic torsion that encodes additional combinatorial information about the sequence phylogeny.

The dissimilarity complex, similar to the graph of Hamming distances, allows to derive the notion of a “phylogenetic complex”, the higher dimensional analogue of a tree. Distance based phylogeny shows the way: here one constructs a tree of sequences whose tree metric “best approximates” the complete graph on sequences weighted by their pairwise distances. To mimic this process, one would construct the phylogenetic complex, whose induced dissimilarity “best approximates” that of the dissimilarity complex of the given alignment of sequences. Such a derived object should retain “tree like” qualities and it is here where the choice of a discrete valuation rings is key, see (Li and Reidys 2022).

As a higher dimensional analogue of the phylogenetic tree, the phylogenetic complex should behave similarly: It should contain no (homological) cycles at any simplicial dimension and should be able to approximate the weight of any simplex that appears in the dissimilarity complex but not in the phylogenetic one. These notions can only be derived within the framework of the dissimilarity complex introduced here.

Our paper is organized as follows:

Section 1.1 provides context by reviewing the current phylogenetic study of a sequence alignment via Hamming distance optimizations and phylogenetic trees.

In Sect. 2 we introduce k-stances, a higher dimensional relation between k aligned sequences representing a dissimilarity measurement that naturally generalizes the Hamming distance. Section 2.1 deals with the mathematical properties of this this k-ary relation, both related to metric properties it inherits from the Hamming distance (pairwise case) and to properties that are intrinsic to its higher dimensionality. Section 2.2 discusses k-stances in the context of genetic recombination.

Section 3 introduces the dissimilarity complex: an arithmetico-geometrical space whose topological structure encodes the k-stance relations among the sequences of an alignment by integration across all k-dimensions. We shall briefly review the notion of simplicial complexes and their build principles, usage and means of study. In Sect. 3.1 we provide the construction of the dissimilarity complex of a sequence alignment and establish some basic properties. In Sect. 3.2 we then introduce weighted homology and show how it is connected to the dissimilarity complex of an alignment.

Section 4 analyzes the “induction basis” of alignments of length one. We shall show here that all k-stances are in fact connected.

Section 5 considers the multiple column case. In Sect. 5.1 we present two case studies on the statistics of k-stances for alignments of SARS-CoV-2 and H1N1 flu genomes, while in Sect. 5.2 we hint at the relationships between the arithmetic torsion that arises in the weighted homology of a multi-column alignment and its connections to genetic recombination present in the population of the alignment.

Section 6 we integrate our results and discuss future directions of work. Finally, Sect. 7 contains all proofs.

Hamming distance and phylogeny

Let us begin by revisiting the underlying ideas of Hamming distance, its basic metric properties and current metric-based phylogeny.

Given an alphabet A, let Al denote the set of all sequences of length l over A. In the case of DNA sequences, A={A,C,G,T}. The Hamming distance between two sequences w0,w1Al, denoted by h(w0,w1), is the number of positions in which the two sequences differ.

It is easy to check that h satisfies the following axioms making it a metric. Namely, for any w0,w1,w2Al

  1. (identity of indiscernibles) h(w0,w1)=0w0=w1.

  2. (symmetry) h(w0,w1)=h(w1,w0).

  3. (triangle inequality) h(w0,w2)h(w0,w1)+h(w1,w2).

Given a set of sequences in Al, all pairwise distance information can be encoded in a symmetric matrix, or equivalently, a complete weighted graph, where each node represents a sequence and the weight of the edge represents the Hamming distance between the two corresponding sequences.

This metric structure over Al allows to infer the phylogenetic relations of a given set of sequences via recursive clustering (i.e. Neighbor-joining, UPGMA, WPGMA etc. Saitou and Nei 1987; Sokal 1958) or via optimization (minimum evolution, least squares inference etc. Fitch and Margoliash 1967; Hendy and Penny 1982). These relations can then be represented in the phylogenetic tree where the input sequences become the tree’s leaves and the internal nodes can then be interpreted as common ancestors of their descendants. The key idea in all such algorithms is that no matter the metric information on the sequences, our target (the phylogenetic tree) is acyclic and spanning. As a result there exists a unique path between any two leaves, i.e. an unique distance which approximates the original distance, see Fig. 3.

Fig. 3.

Fig. 3

LHS: a complete graph that encodes all pairwise dissimilarities among sequences labelled A, B, C and D. The edge label represents the distance between the corresponding sequences. RHS: the corresponding neighbor-joining tree: X and Y were added as interior nodes, and the distance between two original nodes is approximated by the sum of the length of the edges on the unique path the connects these nodes in the tree, e.g. d(A,B)=7+4.5+5.5=17 approximates the original d(A,B)=13

In what follows we shall mimic this construction but move beyond pairwise metrics to k-wise comparisons (k-stances). As a canonical analogue of the weighted graph, encoding the metric structure, the dissimilarity complex will encapsulate the k-stances.

k-Stances

As previously noted, one encounters k-ary interactions that have no pairwise (Hamming) analogue. In this section we will derive a k-spectrum of measurements that capture such higher order dissimilarity relations among multiple sequences. The following Subsections deal with the properties of this measurement and its connection to a particular type of genetic recombination.

We begin by reformulating the Hamming distance between two sequences. Consider the following projections: for any j{1,,l}, let fj:AlA where for a sequence wAl, fj(w) is the letter at position j in w. Then for two sequences w0,w1Al, each position j contributes one unit to the Hamming distance between the two sequences if and only if fj(w0)fj(w1).

Accordingly, a position j for the two sequences contributes one unit to the distance, if the number of distinct letters that appear at said position in w0 and w1 respectively, is equal to the number of sequences, i.e. two in this case. Stated this way, the definition of Hamming distance immediately hints at a generalization for any number k1 of sequences as follows

Definition 1

Let k1. The kth order dissimilarity or k-stance, of k given sequences each of length l, is given by

dk:(Al)kN,dk(w0,,wk-1):=|{j[[1,l]]:|i=0k-1{fj(wi)}|=k}|.

In other words, the k-stance of k given sequences is the number of positions in which the given sequences are all mutually distinct, see Fig. 4.

Fig. 4.

Fig. 4

w0=AAGAGGCTT, w1=AGGAGACCT and w2=GTGGGTCCC. Note that at position 2 and position 6, all three sequences are mutually distinct. As such d3(w0,w1,w2)=|{j[[1,9]]:|i=03-1{fj(wi)}|=3}|=|{2,6}|=2

Basic properties

By definition, d1=l, i.e. the 1-stance reproduces the sequence length, while the 2-stance is exactly the Hamming distance, d2=h. In case of k>2, the k-stances has the following properties.

Proposition 1

For any k>2 and for any w0,,wkAl, the k-stance satisfies

  1. (implication of indiscernibles) if wi=wi for some 0i<ik-1 then dk(w0,,wk-1)=0.

  2. (symmetry) dk(w0,,wk-1)=dk(wϵ(0),,wϵ(k-1))for any index permutation ϵSk.

  3. (polyhedron inequality) dk(w0,,wk-1)i=0k-1dk(w0,,wi^,,wk), where wi^ denotes the omission of the ith sequence, see Fig. 5.

Fig. 5.

Fig. 5

An alignment W={w0=AAGAGGCTT,w1=AGGAGACCT,w2=GTGGGTCCC,w3=CCGGGCCAC}, 3-stances, d3(w0,w1,w2)=2,d3(w0,w1,w3)=3,d3(w0,w2,w3)=4,d3(w1,w2,w3)=3. Note that for instance, d3(w0,w2,w3)=48=2+3+3=d3(w0,w1,w2)+d3(w0,w1,w3)+d3(w1,w2,w3), and this holds for any other permutation as well

Accordingly, for k>2, dk can be considered as a higher order pseudometric (Collatz 2014). Note that several generalization of metrics and their properties have been studied in the literature (Klein and Zhu 1998; Deza et al. 1997; Sommerville 2020). In particular, in Klein and Zhu (1998) such polyhedron type inequalities appear for certain graph invariants in the context of sums of powers of volumes of k-vertices in a graph. This higher dimensional “volume” is still derived from pairwise vertex quantities via Cayley-Menger type constructions (Sommerville 2020). In Deza et al. (1997) generalizations of the triangle inequality to hypermetrics are also presented in the context of the cut cone and integer programming, all of which are based on pairwise relations. Dissimilarity, in contrast, entails genuine k-interactions.

Proposition 2

For any k>2 and for any w0,,wk-1Al, the k-stance satisfies

  1. (dimensional bounding) dk(w0,,wk-1)=0, for any k>|A|.

  2. (dimensional monotonicity) dk(w0,,wk-1)dk(wι(0),,wι(k-1)), for any 1kk and any injection ι:{0,,k-1}{0,,k-1}.

k-Stances and genetic recombination

Genetic recombination can be defined as the exchange of genetic material among multiple sequences and is a key contributor to genetic variation (Rubio et al. 2001). In this section, we focus on a particular class of recombination and discuss its connections to k-stances.

Definition 2

Let W={w0,,wk-1}Al be k sequences and fix another sequence wAl. Then w is called a linear recombinant of W, if for each j{1,,l} there exists an i{0,,k-1} such that fj(w)=fj(wi).

Then we obtain

Proposition 3

Let W={w0,,wk-1}Al be k fixed sequences. Then, dk(w0,,wk-1)=0 if there exists wW such that w is a linear recombinant of W\{w}.

Example 1

Consider W={w0=GCTT,w1=TTCA,w2=GCCA}. Firstly, d1(w0)=d1(w1)=d1(w2)=4, d2(w0,w1)=4, d2(w0,w2)=d2(w1,w2)=2. Since w2=f1(w0)f2(w0)f3(w1)f4(w1), w2 is a linear recombinant of W\{w2}, and as such d3(w0,w1,w2)=0.

Using higher order dissimilarity (as low in dimensionality as 3-stance) we obtain a more refined description of a sequence set. The following example illustrates this point in providing two sets of sequences, exhibiting the same Hamming distance signature, while differing on the level of 3-stances.

Example 2

Let W0={w00=TTCA,w10=CTCG,w20=TTTG} and W1={w01=CTCG,w11=TTAG,w21=ATTG} be two sets of length four sequences. We have d1(w0j)=d1(w1j)=d1(w2j)=4, d2(w0j,w1j)=d2(w0j,w2j)=d2(w1j,w2j)=2 for j=0,1. However, d3(w00,w10,w20)=02=d3(w01,w11,w21). Accordingly for d1 and d2, W0 and W1 exhibit identical dissimilarity while their d3-dissimilarities are distinct.

It is worth pointing out that, in the above example, none of the W0-sequences is a linear recombinant of the remaining two, while their 3-stance is still zero. This indeed shows that the implication of indiscernibles for k-stances with respect to linear recombinants is a sufficient but not a necessary condition.

The dissimilarity complex

To understand the extra arithmetic information encapsulated within the dissimilarity complex, we recall the notion of simplicial complexes. We adopt here a data-centric point of view and eschew abstract topological and category theoretical considerations.

Suppose we are given a discrete, finite set of data points W as measurements of a system, and suppose that among the points in this data set there exists a multi-fold relation denoted by , i.e. a collection of k-ary relations for all 1k|W|. This relation might model an intrinsic dependency in the system that manifests in the measurements in W. Suppose that the relations in are restriction compatible, i.e. they have the property that for any subsets WWW we have that WW. Namely, if a subset of elements in W are in a relation in , then any subset of those elements are in a lower arity relation as well. For instance, suppose W:={w0=AC,w1=AT,w2=AG} and the relation would be “contains a nucleotide in common”. Then, clearly {w0,w1,w2} but also {w0,w1}, {w0,w2} and {w1,w2}. This is since among the triplet and among any pairs, the nucleotide A always appears in common. We shall further assume that for each individual measurement wW we have {w}. Then, the combinatorial structure consisting of the totality of such -satisfying subsets is called the simplicial complex of the data set W under the relation and is denoted by (W,). A single such element W(W,) is called a |W|-1-dimensional simplex. This complex can be organised into a topological object by embedding each simplex W(W,) into a (|W|-1)-dimensional R-linear (Euclidean) polytope, and gluing these polytopes along their common faces, see Fig. 6.

Fig. 6.

Fig. 6

A simplicial complex (W,), where W={1,2,3,4,5} with {1,2,3} and {2,3,4,5}. These two polytopes share the one dimensional line segment sub-polytope {2,3}. Gluing produces the geometric realization of (W,)

We shall use this perspective to build the dissimilarity complex by letting W, the data set, be the sequences in a (genetic) alignment, with the relation :=“mutual dissimilarity at at least one position”. Furthermore, we keep track of the degree of dissimilarity for each such k-1-simplex via the k-stance among its constituent sequences.

Construction of the dissimilarity complex

We are now in position to formally introduce the natural mathematical structure encapsulating higher order dissimilarity information among a collection of sequences.

Definition 3

An alignment W=[w1,,wn], wiAl, is a finite ordered tuple of sequences of equal length l over an alphabet A. We can view W as a matrix whose entries are letters in A with wi being the sequence (ordered tuple of letters) in the ith row of W. Furthermore, the jth column of W is the ordered tuple [fj(w1),,fj(wn)]. The integer l is called the length of the alignment (the number of columns) while the integer n is called the size of the alignment (the number of sequences).

Given an alignment, the totality of dissimilarity it contains can be expressed via a weighted simplicial complex. Such a complex subsumes said dissimilarity information but also contains extra non-local information that is geometric in nature. This is similar to how a weighted tree and a weighted graph can be metrically equated via UPGMA in the context of phylogeny, but differ in their cycle structure as geometric objects.

Definition 4

Let W=[w1,,wn], wiAl, be a given alignment and let k0 be fixed. A simplex of dimension k over W is a k+1-subset of W, σ={w0,,wk} of sequences from W, such that dk+1(w0,,wk)>0. We denote by Kk(W) the set of all k-simplices over W, and set X(W)=k0Kk(W). Let RZ be a discrete valuation ring with uniformizer π. Let

vW:X(W)R,vW(σ=[w0,,wk])=πdk+1(w0,,wk).

Then, vW is called the weight function associated to X(W) and we call the pair (X(W),vW) the dissimilarity complex of W.

For all intents and purposes, the discrete valuation ring can be taken to be the ring of formal power series with rational coefficients and a transcendental variable we denote by π, namely R=Q[[π]]. We use the powers of the transcendental π to encode the integer weights of simplices as exponents. This is done since in what follows, the new kind of boundary operator that will be introduced which yields a weighted version of homology as per (Dawson 1990; Ren et al. 2018; Bura et al. 2021), requires certain divisibility conditions as we shall see in Sect. 3.2.

Given a dissimilarity complex, we can construct its geometrical realization by constructing the geometrical realizations at each k dimension and then integrating them via gluing along common faces, see Ex 3.

Example 3

Let W=[w0=AAGAGGCTT,w1=AGGAGACCT,w2=GTGGGTCCC]. We can construct the geometrical realization of (X(W),vW) by first constructing its 0-simplices (see Fig. 7a), 1-simplices (see Fig. 7b) and 2-simplices (see Fig. 7c). The geometrical realization of (X(W),vW) is then obtained by integrating all k-simplices via gluing (see Fig. 7d).

Fig. 7.

Fig. 7

The geometric realizations at different dimensions and their integration of X([w0=AAGAGGCTT,w1=AGGAGACCT,w2=GTGGGTCCC]) with K0 in a, K1 in b, K2 in c and the integration into (X(W),vW) in d

An immediate motivation for this construction is that more information can be encoded when we lift from weighted graphs to the weighted complex structure, see Ex 4. We are now in position to identify information not present in the graph– which is only the so called 1-skeleton of the complex.

Example 4

Consider the dissimilarity complexes associated to the alignments in Ex 2. For W0=[w00=TTCA,w10=CTCG,w20=TTTG], we have d1(w00)=d1(w10)=d1(w20)=40, hence K0(W0)={[w00],[w10],[w20]} and since d2(w00,w10)=d2(w00,w20)=d2(w10,w20)=20, we also have K1(W0)={[w00,w10],[w00,w20],[w10,w20]}. Finally d3(w00,w10,w20)=0, thus K2(W0)=, see Fig. 8 (LHS). For W1=[w01=CTCG,w11=TTAG,w21=ATTG] we have K0(W1)K0(W0) and K1(W1)K1(W0) however d3(w01,w11,w21)=20 yields [w01,w11,w21]K2(W1), see Fig. 8 (RHS).

Fig. 8.

Fig. 8

Dissimilarity complexes corresponding to W0=[w00=TTCA,w10=CTCG,w20=TTTG] and W1=[w01=CTCG,w11=TTAG,w21=ATTG], with their respective weights. Note that X(W0) is an “empty” triangle while X(W1) is “filled”

(X(W),vW) represents an augmentation of classical simplical complexes and in particular a generalization of weighted graphs. In addition, (X(W),vW) has some nice combinatorial properties that facilitate the study of a weighted version of homology as detailed in Sect. 3.2.

Proposition 4

X(W) is a simplicial complex that is bounded in dimension, namely, for any σX(W), dim(σ)|A|-1.

Proposition 5

Let σX(W) be a k-simplex and let τσX(W) be a k face of σ. Then we have vW(σ)|vW(τ).

Definition 5

Let now ϵSn be a permutation on {1,,n}. The ordered tuple

Wϵ:=[wϵ(1),,wϵ(n)],

is called a row shuffle of W.

Let ωSl be a permutation on {1,,l}. The ordered tuple

Wω:=[w1ω,,wlω],

with

wiω=[fω-1(1)(wi),,fω-1(l)(wi)],

for all 1il, is called a column shuffle of W.

Proposition 6

For any alignment W and any pair of row and column shuffles (ϵ,ω)Sn×Sl, we have (Wϵ)ω=(Wω)ϵ. Furthermore denoting Wϵω:=(Wϵ)ω=(Wω)ϵ we have that, for any pair of row and column shuffles (ϵ,ω)Sn×Sl,

(X(W),vW)(X(Wϵω),vWϵω).

I.e. any row or column shuffle of W produces a weighted dissimilarity complex that, up to vertex relabelling, is the same as the original dissimilarity complex of W.

Proposition 5 shows that (X(W),vW) is amenable to construct weighted homology by means of a novel boundary operator, compatible with the weight function, see Sect. 3.2. Proposition 6 shows that the dissimilarity complex has nice symmetry properties.

Weighted homology

Passing from graphs to simplicial complexes not only provides us with the degree of freedom to encode additional information, but it also enables us to study a multiple sequence alignment from a novel mathematical perspective. Any simplicial complex gives rise to a topological space. Studying topological properties of the dissimilarity complex enhances our conceptual understanding of the multiple sequence alignment itself.

In algebraic topology, simplicial homology is a useful tool for the study of features of a simplicial complex. It comes about as a sequence of abelian groups H1,Hn,, one for each dimension, whose structures yield surprising information (invariants with respect to continuous deformations) about the space in question, such as the structure of its k-dimensional holes and its orientability (geometric torsion). This information is of key relevance and to date, dynamically tracking the birth and death of generators of these homology groups is an integral part of topological data analysis in the guise of Persistent Homology (Zomorodian and Carlsson 2005).

The dissimilarity complex constitutes a simplicial complex with additional weight information. By coherently integrating this weight information into a new homology theory that mimics the classical case, we can study the augmented arithmetic version of its topology. In this case, its torsion encodes k-stance level information among the sequences, and thus we can gain more insight about the structure of the alignment the dissimilarity complex is modeling.

Given an alignment W, let (X,v)=(X(W),vW) be its corresponding weighted dissimilarity complex. Let Cn,R(X) denote the free R-module generated by all n-simplices in X, with R being the co-domain ring of the simplex weight function. Setting a simplicial ordering (Hatcher 2005), namely a linear order on the 0-simplices, we can now consider a simplex σ as an ordered tuple of sequences instead of a set. This allows one to define

nv:Cn,R(X)Cn-1,R(X),nv(σ)=i=0nv(σ^i)v(σ)·(-1)iσ^i,

where the face σ^iσ is obtained by dropping the ith position in σ. We have v(σ) divides v(σ^i) and as a result, nv is a well defined R-module homomorphism. Note here that it is crucial that v takes values in the DVR as defined. This is since then, by Proposition 5, the divisibility condition required in the definition of nv is contingent only on dn-1(σ^i)dn(σ) and not the latter dividing the former.

Indeed, in Dawson (1990); Ren et al. (2018) a similar weighted perspective is considered. However in those instances, the weighted complexes considered must have weights constrained by the divisibility condition, such as for instance integers that divide each other, as otherwise the fraction coefficients obtained exit the integer ring. In our formulation, although dn takes integer values, passing to the DVR eschews the issue by encoding the integers in the powers of the transcendental π. As such, only the order relation on the integers is required, and not divisibility. The DVR setup guarantees that v(σ^i,j)v(σ^i)·v(σ^i)v(σ)=v(σ^j,i)v(σ^j)·v(σ^j)v(σ), hence we obtain n-1v(nv(σ))=0. In view of this, nv is indeed a boundary map and we can define a homology theory accordingly, Hnv(X)=Ker(nv)/Im(n+1v) denoting the weighted homology modules of (Xv). Furthermore, it is a well known result that the weighted homology modules are in fact independent of our initial choice of simplicial order (Hatcher 2005).

Proposition 7

Let W be an alignment over an alphabet A and let (X(W),vW) be its corresponding weighted dissimilarity complex. Then Hkv(X(W))=0, for all k|A|.

Proposition 8

Let W be an alignment consisting of n sequences of length l. For any pair (ϵ,ω)Sn×Sl, and any kN, we have HkvW(X(W))HkvWϵω(X(Wϵω)).

The single column case

In this section we present a relation between the k-stances for different k for alignments of length one. Furthermore we compute their weighted homology. To this end, let W=[w0,,wn-1] be an alignment of length one and size n.

We can organize W via bins W=˙i=1sBi=˙i=1s{wW|f1(w)=ai} where s is the number of distinct letters that appear in W’s column and we let bi=|Bi| for all 1is be the size of Bi, i.e. the multiplicity in W’s column of the letter ai, see Fig. 9.

Fig. 9.

Fig. 9

The partition of a single column alignment W into bins

Given the bin partitioning of W, it is easy to see that any simplex σX(W) has weight v(σ)=π1=π. Furthermore X(W) is a pure simplicial complex as all of its maximal simplices are of dimension s-1. By construction, each s-1-simplex is obtained by picking one sequence (0-simplex) from each of the s bins. Therefore, X(W) is a complete k-partite simplicial complex, which is a natural generalization of the complete bipartite graph. Note that in the case of s=2, X(W) is precisely the classical complete bipartite graph Kb1,b2, see Fig. 10.

Fig. 10.

Fig. 10

A single column alignment with only two bins and its dissimilarity complex, the complete bipartite graph K4,3

Definition 6

The total k-stance contribution in W is defined to be

ck=YW,|Y|=kdk(y0,,yk-1),

where the sum is taken over all size k subsets Y={y0,,yk-1}W. We integrate this information over all k into a polynomial in indeterminate x called the dissimilarity polynomial of W

DW(x)=xs+k=1s(-1)kckxs-k.

Theorem 9

Let W be a single column alignment. The size of each bin of W is a root of W’s dissimilarity polynomial, and this polynomial has no other roots.

Example 5

Let W be the single column alignment shown in Fig. 9. We have b1=3, b2=b3=2 and b4=1. Furthermore, we have c1=8, c2=23, c3=28 and c4=12. Then DW(x)=x4-8x3+23x2-28x+12=(x-3)(x-2)2(x-1).

We note here that for each k>1, the coefficient ck, in the dissimilarity polynomial for the column W, is a straight-forward augmentation of sum of pairs type scores for nucleotide conservation, as defined in Valdar (2002). This augmentation is achieved by passing to sums of k-tuples. A natural score that integrates all such information is DW(1) which is easily computable as per Theorem 9.

Note that, in certain cases when conservation within the column is high, we can observe a higher sensitivity for higher order k-stances than for that of lower order ones. Consider the following example:

Example 6

Let W be an alignment of size 100 with two bins |b1|=98 and |b2|=2. Then d2(W)=2×98 and d3(W)=0. Suppose now that a single nucleotide in bin b2 mutates, creating the alignment W with bins b1=b1, b2b2 and |b2|=1 and lastly |b3|=1 with the last bin containing the mutated nucleotide. Then, d2(W)=2×98+1 while now d3(W)=98. As such the relative increase in 3-stance is much higher when compared to the one in 2-stance.

Theorem 10

Let W be a single column alignment, let (X,v):=(X(W),vW) be its corresponding weighted simplicial complex and denote by b=i=1s|bi-1|. Then, all homology modules Hkv(X) are free and furthermore

  • (a) H0v(X)=R,

  • (b) Hs-1v(X)=Rb

  • (c) Hkv(X)=0 for any k>0,ks-1.

Example 7

Let W be the single column alignment shown in Fig. 10. We have b1=4 and b2=3. Then H0v(X)=R, H1v(X)=R(4-1)(3-1)=R6 and Hn2v(X)=0.

Dissimilarity and k-stances of multi-column sequence alignments

For general alignments we have at present no analytical (closed form) expression connecting its k-stances and weighted homology modules in terms of the bins of its various columns. A way of piecing together column information inductively is currently under investigation and the idea here would be to employ some version of Mayer-Vietoris sequences for weighted complexes. However, k-stance statistics as well as the modules of weighted homology can be computed, effectively. We have developed a framework for computing weighted homology and can provide a link to a free underlying software module created for this purpose (software module). In the following, we shall illustrate that both k-stances and weighted homology provide new insights into aligned genetic data and reveal biologically relevant features of said alignments.

In Sect. 5.1 we present case studies for SARS-CoV-2 and for H1N1, respectively, where k-stance signatures are seen to reflect distinct phases in the evolution of these pathogens in the human population. In Sect. 5.2 we illustrate connections between the structure of the weighted homology modules and the k-stances present in the alignments.

k-Stance statistics

In this Subsection we present two case studies that illuminate the usage of higher order k-stance statistics to infer biologically relevant information on viral population dynamics. In these case studies, k-stance statistics are computed over the alphabet {A,C,G,T,-}, where the gap symbol “−” was not accounted as a distinguished symbol for any of the contributions computed. Therefore by construction, k=4 is the maximum possible k with nontrivial k-stance. Here we focus on k=2,3 as the vast majority of the sites exhibit at most three different nucleotide types. In general, k-stances for larger k values can still be of interest when the alphabet is sufficiently large, such as when considering amino acid sequences.

Case study 1: SARS-CoV-2

The multiple sequence alignment considered here is comprised of all SARS-CoV-2 genomes submitted to GISAID (Shu and McCauley 2017) prior to 2021-01-11. This amounts to 254,148 sequences, each exhibiting 29,903 aligned sites. For each site, we computed its total 2-stance and 3-stance contribution respectively (i.e. the total number of pairs and the total number of triplets that are mutually distinct). We now partition the set of logarithms of these numbers (shifted by 1 for technical reasons) into 100 bins of the same width and plot their corresponding histograms (bin vs. frequency), see Fig. 11.

Fig. 11.

Fig. 11

Histograms of site k-stance distribution (k=2,3). x-axis: Log(k-stance+1), y-axis: the frequency in each bin on a logarithmic scale. The red line marks the bin containing site 23012 corresponding to the well studied mutations E484Q and E484K

The 2-stance and the 3-stance exhibit distinctive distributions. Firstly, the 2-stance contains approx. 6000 sites in the zero-th bin while the 3-stance contains approx. 20,000 in the zero-th order bin. This is due to the fact that approx. 14,000 sites exhibit bi-allelic but not tri-allelic SNPs. This should be expected as it is an instance of the observation preceding Theorem 10, regarding sensitivity of higher order k-stances in highly conserved columns. In any other bin the two distributions also differ distinctively. The 2-stance exhibits a sharper absolute decay in frequency when compared to the 3-stance frequencies, as the latter are approximately an order of magnitude lower. The Pearson’s coefficients of dispersion (Pearson 1920) for the two distributions are 0.49 for 2-stance and 1.24 for 3-stance. Having a closer look at the polymorphic site 23012 mentioned in the Introduction, corresponding to the E484Q and E484K mutations, we find rank 1028 for 2-stance and rank 327 for 3-stance. Note that in the aggregated population, more than 99.88% of the sequences have nucleotide G at site 23012. Such a high degree of conservation results in a relative low 2-stance rank. On the other hand, the remaining 0.12% sequences exhibit two types of nucleotides: A and C. The existence of tri-allelic SNPs results in a more prominent 3-stance rank in comparison to its 2-stance rank.

We proceed by investigating sites with high k-stance rank for k=2,3. For each k, we collect the top 100 sites in k-stance. The 2-stance collected sites have 40 sites in common with their 3-stance counterparts, see Fig. 12. When we restrict focus to the Spike protein region of the SARS-CoV-2 genome, we observe that 13 of the top 2-stance collected sites are located on the spike protein region with 6 of which being among the top 3-stance sites. In addition, there are 10 Spike protein sites that are among the top 3-stance collected sites but are not found in the 2-stance collected cohort. Cross referencing these 10 sites with characteristic mutations of VoCs, we found that 4 of them actually correspond to characteristic mutations of VoCs, see Table 1.

Fig. 12.

Fig. 12

Venn diagram of top 100 2-stance and 3-stance sites. Numbers in the brackets represent the number of Spike protein sites in each category. The four highlighted mutations are found to be characteristic mutation sites of the SARS-CoV-2 VoCs in Table 1

Table 1.

Highlighted spike protein sites that are captured by 3-stance but not 2-stance and their correspondence to characteristic mutations of certain VoCs

Site Amino acid mutation VoC
21974 D138Y Gamma
22205 D215G Beta
23593 Q677H Eta
23604 P681H Alpha

Case study 2: H1N1

We study 2-stances and 3-stances within a sliding window of 100 sequences across a temporally ordered alignment of GISAID H1N1 flu data from 2009 to 2018. The y-axis represents the sum over all possible k-stances (k=2,3) which we refer to as the ensemble of 2-stance and 3-stance, respectively, for each window as time evolves in the x-axis, see Fig. 13.

Fig. 13.

Fig. 13

Time evolution of the ensembles of 2-stance and 3-stance in a sliding window of 100 sequences across a temporally ordered alignment of H1N1 flu data from 2009 to 2018

2- and 3-stances capture the two outbreaks (Jan 2009 and Nov 2013) of the virus and we speculate that the peaks in this dissimilarity signal appear due to the virus’ genetic variation being elevated as it explores its fitness landscape. However, note that the Apr 2016 flu season that exhibited a change of the dominant strain is not captured by 2- but 3-stances. We stipulate that this is the case because 3-stance exhibits higher conservation decreases in the column when compared to 2-stance as discussed in Ex 6.

Multi column weighted homology

Li and Reidys (2022) provides structure theorems for the weighted homologies of arbitrary weighted complexes (not necessarily arising from dissimilarity) relating simplicial homology with coefficients in certain valuation rings to weighted simplicial homology. The idea being here is to create a homomorphic image of the “known” homology into the “unknown” homology and then to study the quotient via homological algebra. The concepts developed in the process suggest employing a version of Nakayama’s Lemma (Nakayama 1951) to reduce the coefficients controlling this quotient down to rational numbers. This enables very fast computation of all weighted homology modules (software module).

The weighted homology modules of the dissimilarity complex exhibit non-trivial torsion, which genuinely stems from k-stances and reflects interesting features about the structure of the alignment itself. We present two pertinent examples that allude to this fact:

Example 8

Consider the alignment W=[w0=AGCTTT,w1=ATTCAA,w2=AGCCAA]. Firstly, we have d1(w0)=d1(w1)=d1(w2)=6. Then we have d2(w0,w1)=5, d2(w0,w2)=3 and d2(w1,w2)=2. Finally, we have d3(w0,w1,w2)=0. Since the maximum dimension of X(W) is one, we have two nontrivial weighted homology modules, namely, H1v(X(W))=R and H0v(X(W))=RR/πR/π3. Note that H1v(X(W)) has free rank one, and this is due to the fact that w2 is a linear recombinant of w0 and w1, namely w2=f1(w0)f2(w0)f3(w0)f4(w1)f5(w1)f6(w1). Furthermore, H0v(X(W)) has free rank one and two torsion components. The first torsion component R/π=R/π(6-5) corresponds to the largest 2-stance among the three sequences and the second torsion component R/π=R/π(6-3) corresponds to the second largest 2-stance among the three sequences.

Example 9

Let W=[w0=AAGAGGCTT,w1=AGGAGACCT,w2=GTGGGTCCC]. Firstly, we have d1(w0)=d1(w1)=d1(w2)=9 Then we have d2(w0,w1)=3, d2(w0,w2)=6 and d2(w1,w2)=5. Finally, we have d3(w0,w1,w2)=2. Since the maximum dimension of X(W) is two, we have Hkv(X(W))=0, for all k3. In fact, we have H2v(X(W))=0, H1v(X(W))=R/π and H0v(X(W))=RR/π3R/π4. Since H1v(X(W))=R/π are full torsion, none of w0, w1 or w2 is a linear recombinant of the remaining two. Furthermore, the torsion R/π=R/π(3-2) corresponds to the difference between the 3-stance and the minimum pairwise 2-stance among the three sequences.

In the following, we present a theoretical application that motivates the homological framework approach: the structure of certain degenerate primers.

Definition 7

Let A={a1,,as} be a finite alphabet, for instance A={a,c,g,t}. A primer P of length k is a string of k nonempty subsets of A, namely, P=p1pk, where piA. Each pi represents the potential nucleotides that could appear at position i in the primer P. The degeneracy of P is defined to be

d(P)=i=1k|pi|.

One key procedure in PCR is to design efficient and specific primer sequences. A primer sequence can form a double-stranded initiation site with the target sequence at which the polymerase can bind. In real applications, there might be more than one target sequence. In a typical situation, one has a collection of related target sequences, for instance, DNA sequences of homologous genes, and the goal is to design primers that will match as many of them as possible. In this case, a single unique primer is typically not sufficient. Instead, one might consider designing a degenerate primer. On the one hand, a primer with high degeneracy is likely to match more target sequences, while on the other hand, high degeneracy may lead to reduced specificity, namely the primer is more likely to bind to wrong binding sites. Therefore, a key task of the degenerate primer design problem is to find a balance between degeneracy and the number of matched target sequences. Despite the already established hardness for the general problem, a topological classification can often shed light on the computational complexity of the class and reveal sub-classes with low topological complexity corresponding to tractable instances. Here we provide a structure theorem that can hint at said complexity.

A primer P can be interpreted as an alignment consisting of the d(P) distinct sequences P can produce. As we are only interested in X(P), by Proposition 6, w.l.o.g. we can assume the columns of the alignment P are sorted by their respective multiplicities, namely, |pi||pj|, for any i<j. In other words, P can be organised as a concatenation of blocks of columns, where each block consists of columns with the same multiplicity and the blocks are arranged from left to right in decreasing order of their respective multiplicities.

Definition 8

For a given primer P, let Xm(P)X(P) denote the subcomplex generated by all the columns with multiplicity at least m

Xm(P):=|pj|mX(Pj),

where Pj is the jth column of the primer alignment P.

Since, columns with |pj|=1 will not generate non-zero simplices in X(P), X2(P)=X(P), for any primer P with |p1|2. As such we can w.l.o.g, consider only primers P, such that |pj|2, for all 1jl.

Theorem 11

Let P be a length l>1 primer that contains a single column with maximum multiplicity s. Namely, |p1|=s and |p1|>|pj|, for all 2jl. Then for the unweighted dissimilarity complex of P (i.e. v(σ):=1R) Hk(X(P))=0, for all ks, and Hs-1(X(P))=Rb, where

b=-1+2jl|pj|s.

Discussion

In this paper we introduce the notion of higher order dissimilarities, naturally generalizing the concept of Hamming distance. We have shown that such dissimilarities emerge within alignments of viral sequences and that these are not independent of each other. In fact we give explicit formulae for these dependencies in specific instances. We can thus conclude that, in case of genetic sequences and the underlying four letter alphabets, there is more information than is reflected by Hamming distance alone by considering 3- and 4-stances. It is therefore noteworthy that all the information we currently obtain is based on or derived from Hamming distance.

We then provide a mathematical context for these hyper-distances by means of the dissimilarity complex. Here k-stances manifest as weights of certain simplices. To be concrete, simplices are comprised of k-sequences that exhibit in at least one site a kth order polymorphism and the weight of this simplex is the actual number of the sites exhibiting such kth order polymorphisms. The weighted complex homology can be readily computed via weighted homology (Li and Reidys 2022) and in case of sequence alignments, an inductive computation by means of patching the complex column by column that is based on the single column case—which we compute here—is currently under investigation.

As for future work, along the lines of constructing the phylogenetic tree within a complete weighted graph of an alignment, we work on constructing a “tree-analogue” within the Dissimilarity Complex of the given alignment. This “phylogenetic complex” should generalize the well known phylogenetic tree. The central idea of distance based phylogeny is to construct a tree of sequences whose tree metric “best approximates” the complete graph on sequences weighted by their pairwise distances. To mimic this process, construct the phylogenetic complex, whose induced dissimilarity “best approximates” that of the dissimilarity complex of the given alignment of sequences. It is natural then to ask what sort of properties such a derived object should possess:

A tree is a maximally acyclic spanning graph of the complete graph over the set of its vertices. Tree edges form a maximally independent set, namely including any edge that connects two vertices we enclose a unique cycle. The distance between two such vertices can be approximated by the sum of involved tree edge lengths. As a higher dimensional analogue of the phylogenetic tree, the phylogenetic complex should also exhibit such “tree-like” properties. Firstly, the phylogenetic complex should contain no (homological) cycles at any simplicial dimension. Secondly, the phylogenetic complex should be able to approximate the weight of any simplex that appears in the dissimilarity complex but not in the phylogenetic one, just as the phylogenetic tree distance between two vertices, approximates the weight of the complete graph edge that closes the cycle between them in the tree.

The phylogenetic complex will inevitably include higher order, pseudometric information that arises as a result of an optimization process that is fundamentally different from clustering. This is clear since the very notion of clustering is based on pairwise relations. There are two mechanisms that we are currently investigating as construction principles for this object:

Higher order “clustering”: one approach to construct a phylogenetic tree based on a pairwise distance matrix is by recursive clustering. At each step, small clusters are combined into a larger one with the distance matrix being updated to reflect the new inter-cluster distances. The selection of which clusters to merge is typically based on a local criterion. To adapt this clustering approach to the construction of the phylogenetic complex, one key challenge is the design of a higher dimensional analogue of clustering. One potential way to mitigate this challenge is to generalize the split decomposition presented in Bandelt and Dress (1992). That is since, a split can be viewed as a minimal clustering result and any phylogenetic tree can thus be decomposed into a set of compatible splits.

Combinatorial optimization: an alternate approach to constructing a phylogenetic tree based on pairwise distance information is by combinatorial optimization. A tree is selected based on certain global criteria such as least square error or minimum evolution. For example, one can select the tree that minimizes the total tree length, as in the case of the “Steiner tree”. In this case the estimation of the length of each edge can be generally dealt with within the least-squares framework. To adapt this approach to the construction of the phylogenetic complex, one key step is the definition of an objective function that incorporates higher dimensional dissimilarity information, beyond pairwise distance errors.

Proofs

Proposition 1.

Proof

Implication of indiscernibles: if wi=wi for some 0i<ik-1, then fj(wi)=fj(wi) for any j{1,,l} and the claim follows by definition of dk.

Symmetry: since i=0k-1{fj(wi)}=i=0k-1{fj(wϵ(i))} for any ϵSk, the claim follows by definition of dk.

Polyhedron inequality: Let I:Ak{0,1} be an indicator function for which I(fj(w0),,fj(wk-1))=1 if |i=0k-1{fj(wi)}|=k and I(fj(w0),,fj(wk-1))=0 otherwise. We note then that

dk(w0,,wk-1)=j=1lI(fj(w0),,fj(wk-1)).

It suffices then to show that I satisfies the polyhedron inequality. Furthermore, since I is always non-negative, it suffices to only consider the case I(fj(w0),,fj(wk-1))=1. If fj(wk)fj(wi) for all 0ik-1, then I(fj(w0),,fj(wi)^,,fj(wk))=1 for any 0ik-1. In this case the polyhedron inequality holds for I. The other possibility is that fj(wk)=fj(wi) for some distinguished 0ik-1. But then, I(fj(w0),,fj(wi)^,,fj(wk))=1 which still implies the claim for I, completing the proof.

Proposition 2.

Proof

Dimensional bounding: if k>|A| then for any j{1,,l} we must have |i=0k-1{fj(wi)}||A|<k and the claim follows by definition of dk.

Dimensional monotonicity: Fixing 1kk, by Proposition 1 (symmetry), it suffices to prove the claim for ι=id|{0,,k-1}. Namely, we want to show

dk(w0,w1,,wk-1)dk(w0,w1,,wk-1).

This however follows immediately from the definition of dk by observing that for any j{1,,l} for which |i=0k-1{fj(wi)}|=k we must in turn have |i=0k-1{fj(wi)}|=k.

Proposition 3.

Proof

It suffices to note that if wW is a linear recombinant of W\{w} then by definition, for each j{1,,l} there exists an i{0,,k-1} such that fj(w)=fj(wi). This means that for for each j{1,,l} we must have |i=0k-1{fj(wi)}|k-1<k and as such dk(w0,,wk-1)=0 as claimed.

Proposition 4.

Proof

This is an immediate consequence of Proposition 2 (dimensional bounding), for any k>|A|.

Proposition 5.

Proof

Again, by Proposition 2 (dimensional monotonicity), we have dk+1(σ)dk+1(τ) and so immediately vW(σ)=πdk+1(σ)|πdk+1(τ)=vW(τ).

Proposition 6.

Proof

The first claim, (Wϵ)ω=(Wω)ϵ, follows immediately by observing the commutative identity for each entry in the alignment matrix of Wgraphic file with name 285_2023_1868_Figa_HTML.jpg

For the second claim it suffices to investigate the action a row and column shuffle pair (ϵ,ω) has on a fixed k-1-simplex σ={w0,,wk-1}X(W).

Note first that dk(ϵ.σ)=dk(σ) by construction, where ϵ.σWϵ is the simplex in Wϵ corresponding to σ. On the other hand, for I the indicator function in the proof of Proposition 1, we have that

dk(σ)=j=1lI(fj(w0),,fj(wk-1))=j=1lI(fω(j)(w0),,fω(j)(wk-1))=dk(ω.σ).

By the previous claim, the order in which we apply the two actions does not matter, and as such Wσωσϵω=ω.ϵ.σ with dk(σϵω)=dk(σ) and the proposition follows.

Proposition 7.

Proof

This is an immediate consequence of Proposition 4 which bounds the dimensionality of the complex.

Proposition 8.

Proof

This follows immediately from Proposition 6 which homeomorphically equates the dissimilarity complex of an alignment with its row column shuffle and shows that the arithmetic weight information is preserved under such a transformation.

Theorem 9.

Proof

It suffices to examine the polynomial PW(x)=i=1s(x-bi) where bi=|Bi| and s is the number of distinct letters that appear in W’s column. Vieta’s formulae for PW(x) yield, for each 0ks,

1i0<i1<<iksj=0kbij=(-1)kps-kps

where pq is the coefficient of the term xq in PW(x) for 0qs. In this particular case ps=1. If we showed that ps-k=ck for any 0ks then PW(x)=DW(x) and the theorem would follow. To this end, for I the indicator function in the proof of Proposition 1, we can write

ck=σKk(W)I(σ)=|Kk(W)|.

To construct a simplex in Kk(W) it suffices to select k+1 bins and select one sequence from each bin. As such, the theorem follows from

|Kk(W)|=1i0<i1<<iksj=0kbij=ps-k.

Theorem 10.

Note that this Theorem is equivalent to Theorem 4 in Bolker (1976). The original proof was based on simplicial joins and a Mayer-Vietoris type sequence. Here we present an alternate, more combinatorial proof of the claims.

We note that our alternate proof consists in developing a novel, explicit combinatorial construction of the relevant generators for homology at the s-1 dimension which could prove useful in tackling the multi-column case. Furthermore the argument for item c), contains an imbricated induction technique on bins and dimensions, that we believe provides hints as to similar results on the multi-column closed form case.

Before we present our proof, let us fix a simplicial ordering. Without the loss of generality, we can pick a simplicial order that is compatible with the bin ordering, see Fig. 14a. Then for any maximal simplex σ=[x1,,xs], we have xiBi, and a simplex is now considered an ordered tuple.

Fig. 14.

Fig. 14

a A single column alignment W=[A,A,A,C,C,G,G]. b A fixed σ0=[w0,w3,w5] and σ=[w1,w4,w6], that do not share any vertex in common. c All 8 triangles σ in L(σ), the grading is given by |σ0σ|. d Geometric illustration of l(σ) as an element in Ker(s-1=2v), the boundary of an octahedron. We have H2v(X)=R(3-1)(2-1)(2-1)=R2 and the other generator corresponds to σ=[w2,w4,w6].

Proof

Item (a) Since there exist two different sequences in W, X is connected, and H0(X)=Z.

Item (b) Since all simplices in X(W) have weight π, we have kv(σ)=i=0k(-1)iππσi^=i=0k(-1)iσi^.

It suffices to find a set of R-linearly independent set of generators for Ker(s-1v)=Hs-1v(X) of size b. By construction, Cs-1,R(X)=MR0. Fix σ0=[x1,,xs]M and consider

Case 1 there exists no other simplex σ=[y1,,ys]M such that yixi for all 1is. In this case, at least one bin has size 1 and thus we have b=0.

Claim Hs-1v(X)=0.

To prove this, consider cKer(s-1v) and grade it by |σ0σ|, denoting the number of vertices the two maximal simplices share. In this case, the grading starts at |σ0σ|=1

c=k=1s|σ0σ|=kaσσ.

Let σc satisfy |σ0σ|=1. Then σ=[y1,,xi,,ys], for some 1is, while yjxj for all ji. Let σ^i=[y1,,x^i,,ys]. Consider all possible σM with |σσ0|1, such that σ^i=σ^iσ. Since |σσ0|1 and |σ^iσ0|=0, we must have σ=σ. Namely, σ is the only simplex in M that contains σ^i as a face. Then s-1v(c)=0aσ=0. This holds independently for all σ with |σ0σ|=1. Therefore we have

c=k=2s|σ0σ|=kaσσ.

We proceed similarly for each k2 in order, which eventually leads to c=0.

Case 2 there exist at least one simplex σ=[y1,,ys]M such that yixi for all 1is. In this case, each bin must contain at least 2 vertices, see Fig. 14b. Let

L(σ):={[z1,,zs]M|zj=xjorzj=yj,for all1js},

with xi or yi appearing at the same coordinate since they are chosen from the same bin and the 0-simplices follow an order that is compatible with the bin order, see Fig. 14c. We make the Ansatz

β={l(σ):=σL(σ)(-1)|σ0σ|σ|σM,|σ0σ|=0},

noting that |β|=i=1s||Bi|-1|=b.

Claim β is a R-basis for Ker(s-1v).

A fixed σ only ever appears in l(σ), therefore β is a R-linearly independent set.

We next show βKer(s-1v). Fix l(σ)β and consider

s-1v(l(σ))=σL(σ)(-1)|σ0σ|s-1v(σ)=σL(σ)(-1)|σ0σ|i(-1)iσ^i.

Note that σ^i=[z1,,z^i,,zs] appears in exactly two images,

(-1)|σ0σ|s-1v(σ)and(-1)|σ0σ|s-1v(σ),

where σ=[z1,,xi,,zs] and σ=[z1,,yi,,zs], and it does so with opposite signs, hence l(σ)Ker(s-1v), see Fig. 14d.

Finally, we show βR=Ker(s-1v). Consider

Ker(s-1v)c=σKs-1(W)aσσ=k=0s|σ0σ|=kaσσ=|σ0σ|=0aσσ+k=1s|σ0σ|=kaσσ.

Claim: c=|σ0σ|=0aσl(σ).

Let

Ker(s-1v)c=c-|σ0σ|=0aσl(σ).

By construction, the coefficient of σc, with |σσ0|=0, is 0 hence

c=k=1s|σ0σ|=kaσσ.

Iterating Case 1, yields c=0.

Item (c)

Clearly Hk>s-1(X)=0. It remains to show H1ks-2(X)=0. We will prove this by induction (called the outer induction) on s3 the number of bins. To this end, first fix s>3 and assume the outer induction hypothesis that, for any complex X~ over s-1 bins we have already shown that H1ks-3(X~)=0. We wish to prove this is still the case for any complex X over s bins. To achieve this, we proceed by another layer of induction (called the inner induction) on the number of 0-simplices (sequences) of X-spaces over exactly s bins. Note that, X:=X(W)X:=X(W{x}), i.e. adding one more sequence x to W is tantamount to adding a vertex to the complex over W with its respective accompanying simplices and, in terms of bins, x will be added to an already existing bin which we will denote by Bx. We have the following relative sequence

Hk+1(X,X)Hk(X)Hk(X)Hk(X,X).

We assume the inner induction hypothesis that Hk(X)=0 for any 1ks-2, and we want to show that H2ks-2(X,X)=0, which will yield H2ks-2(X)H2ks-2(X)=0, and where H1(X,X)=0 will have to be proved separately. To this end, fix a 2ks-2, and let

c=i=1|Kk-1(W)|ai[xi,x]Ker(¯k)

where xiKk-1(X), and [xi,x] is the tuple that contains the vertices in xi with the vertex x appended at the end as per a simplicial ordering we fix on W where x>w for any wW. Then

0=¯k(c)=i=1|Kk-1(W)|ai¯k([xi,x])=i=1|Kk-1(W)|ai(-1)k[xi]+i=1|Kk-1(W)|j=0k-1aij[xi^j,x]

implies that the coefficient, in the second term above, of any [xi^j,x] must be zero. As such, since [xi^j,x] is tantamount to the face [xi^j] of [xi]Kk-1(W) with an extra x-label appended at the end, we have

k-1i=1|Kk-1(W)|ai[xi]=0.

Claim there exists c=i=1|Kk(W)|αi[yi]Ck(X) such that for any of the simplices [yi] we have the property that yix for any xBx, and furthermore c is such that k(c)=i=1|Kk-1(W)|ai[xi]. To show this, it suffices to first note that no [xi] simplex contains elements of Bx and as such, i=1|Kk-1(W)|ai[xi]Ck-1(X~) for X~:=X(W\Bx) a space on s-1 bins. By the outer induction hypothesis for the space X~, we have Hk-1(X~)=0 and so there exists cCk(X~) such that k(c)=i=1|Kk-1(W)|ai[xi] and the Claim follows.

In view of the Claim, we can construct c=i=1|Kk(W)|αi[yi,x]Ck+1(X), for which it is clear that k+1(c)=c and so H2ks-2(X,X)=0 follows. It remains now to prove H1(X,X)=0.

Since x is the largest 0-simplex, any c=yαy[y,x]Ker(¯1) must have yαy=0, where yBs. Thus, each edge with positive sign can be paired with an edge with negative sign and so c=([y,x]-[y,x]), with yy where w.l.o.g. we can assume yAy. For each ([y,x]-[y,x]), if y and y belong to different bins, then [y,x]-[y,x]=¯2([y,y,x]). If however y and y belong to the same bin, then since s3, there exists a bin Q such that x,y,yQ. We can select zQ such that [y,x]-[y,x]=[y,x]-[z,x]+[z,x]-[y,x]=¯2([y,z,x])-¯2([y,z,x]). Thus, cIm(¯2), and hence H1(X,X)=0.

In view of the inner induction hypothesis and the long exact sequence, the above concludes the proof for the inner induction step. The inner induction base case is W=A, which makes X(W) an s-1 simplex, for which it is trivial that H1ks-2(X)=0. This concludes the proof for the outer induction step.

It remains to show the outer induction base case, s=3—i.e. for any space X over three bins, H1(X)=0. To this end, fix a simplicial ordering such that, for any xB3, if xσ, x is the last coordinate of σ. We proceed by induction on m:=|B3|. First, the base case m=1. Let X~:=X(B1B2)X:=X(B1B2{x}).

Claim: i(H1(X~))=0. There are two distinct cases:

  1. H1(X~)=0. Then immediately i(H1(X~))=0.

  2. H1(X~)0. By previous arguments H1(X~)={l(σ)}Z, where l(σ)=σL(σ)(-1)|σ0σ|σ. For each l(σ), we construct
    [l(σ),x]:=σL(σ)(-1)|σ0σ|[σ,x]C2(X),

obtained by appending x to each term in l(σ). Then

2([l(σ),x])=-l(σ)+σL(σ)(-1)|σ0σ|i=12(-1)i[σi^,x].

Note that,

0=1(l(σ))=σL(σ)(-1)|σ0σ|i=12(-1)iσi^.

which yields 2([l(σ),x])=-l(σ). In other words, l(σ)Im(2(X)) and as such i(l(σ))=0, and so i(H1(X~))=0. Since s=3, by an identical argument to the distinguished case of the inner inductive step, we immediately conclude H1(X,X~)=0. By exactness of the (X,X~) relative homology sequence and the Claim, H1(X)=0, and the base case m=1 follows.

Finally, for the inductive step, suppose that H1(X(B1B2B3))=0 for all 1|B3|μ and consider the case m=μ+1. Let xB3 and X~:=X(W\{x}). Then X~ is a complex over three bins where |B3|=μ. By inductive hypothesis, H1(X~)=0, and hence i(H1(X~))=0. Now, H1(X,X~)=0 by the exact same argument as in the m=1 case, and by exactness of the (X,X~) relative homology sequence, H1(X)=0. The proof of the outer induction base case thus follows, completing the proof of the theorem.

Theorem 11.

Proof

The theorem follows immediately from Theorem 10 and the following

Claim: let P be a primer and let X(P) be its corresponding dissimilarity complex. Then for all km-1,

Hk(X(P))=Hk(Xm(P)).

To prove the claim, it suffices to note that column j in P with |pj|m-1, can not generate a k-simplex in X(P) if km-1.

Acknowledgements

We want to thank Thomas Li for comments and discussions. We also thank the anonymous reviewers for their constructive feedback.

Data availability

Data and code will be made available upon request.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Christopher Barrett, Andrei Bura, Qijun He, Fenix Huang and Christian Reidys have contributed equally to this work.

Contributor Information

Christopher Barrett, Email: clb5xe@virginia.edu.

Andrei Bura, Email: cb8wn@virginia.edu.

Qijun He, Email: qh4nj@virginia.edu.

Fenix Huang, Email: fwh3zc@virginia.edu.

Christian Reidys, Email: cmr3hk@virginia.edu.

References

  1. Bandelt H-J, Dress AW. Split decomposition: a new and useful approach to phylogenetic analysis of distance data. Mol Phylogenet Evol. 1992;1(3):242–252. doi: 10.1016/1055-7903(92)90021-8. [DOI] [PubMed] [Google Scholar]
  2. Berger B, Waterman MS, Yu YW. Levenshtein distance, sequence comparison and biological database search. IEEE Trans Inf Theory. 2021;67(6):3287–3294. doi: 10.1109/TIT.2020.2996543. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bolker ED. Simplicial geometry and transportation polytopes. Trans Am Math Soc. 1976;217:121–142. [Google Scholar]
  4. Bura A, He Q, Reidys C. Weighted homology of bi-structures over certain discrete valuation rings. Mathematics. 2021;9(7):744. doi: 10.3390/math9070744. [DOI] [Google Scholar]
  5. Chan JM, Carlsson G, Rabadan R. Topology of viral evolution. Proc Natl Acad Sci. 2013;110(46):18566–18571. doi: 10.1073/pnas.1313480110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Cherian S, Potdar V, Jadhav S, Yadav P, Gupta N, Das M, Rakshit P, Singh S, Abraham P, Panda S, et al. SARS-CoV-2 spike mutations, L452R, T478K, E484Q and P681R, in the second wave of COVID-19 in Maharashtra, India. Microorganisms. 2021;9(7):1542. doi: 10.3390/microorganisms9071542. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Collatz L. Functional analysis and numerical mathematics. New York: Academic Press; 2014. [Google Scholar]
  8. Dawson RJM. Homology of weighted simplicial complexes. Cah Topol Geom Differ Categ. 1990;31(3):229–243. [Google Scholar]
  9. Deza MM, Laurent M, Weismantel R. Geometry of cuts and metrics. Berlin: Springer; 1997. [Google Scholar]
  10. Felsenstein J, Felenstein J. Inferring phylogenies. Sunderland: Sinauer Associates; 2004. [Google Scholar]
  11. Fitch WM, Margoliash E. Construction of phylogenetic trees: a method based on mutation distances as estimated from cytochrome c sequences is of general applicability. Science. 1967;155(3760):279–284. doi: 10.1126/science.155.3760.279. [DOI] [PubMed] [Google Scholar]
  12. Hatcher A. Algebraic topology. Beijing: Tsinghua University Press; 2005. [Google Scholar]
  13. Hendy MD, Penny D. Branch and bound algorithms to determine minimal evolutionary trees. Math Biosci. 1982;59(2):277–290. doi: 10.1016/0025-5564(82)90027-X. [DOI] [Google Scholar]
  14. Huebner C, Petermann I, Browning BL, Shelling AN, Ferguson LR. Triallelic single nucleotide polymorphisms and genotyping error in genetic epidemiology studies: MDR1 (ABCB1) G2677/T/A as an example. Cancer Epidemiol Biomark Prevent. 2007;16(6):1185–1192. doi: 10.1158/1055-9965.EPI-06-0759. [DOI] [PubMed] [Google Scholar]
  15. Kapli P, Yang Z, Telford MJ. Phylogenetic tree building in the genomic age. Nat Rev Genet. 2020;21(7):428–444. doi: 10.1038/s41576-020-0233-0. [DOI] [PubMed] [Google Scholar]
  16. Klein D, Zhu H-Y. Distances and volumina for graphs. J Math Chem. 1998;23(1):179–195. doi: 10.1023/A:1019108905697. [DOI] [Google Scholar]
  17. Li JT, Reidys MC (2022) On weighted simplicial homology
  18. Nakayama T. A remark on finitely generated modules. Nagoya Math J. 1951;3:139–140. doi: 10.1017/S0027763000012265. [DOI] [Google Scholar]
  19. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970;48(3):443–453. doi: 10.1016/0022-2836(70)90057-4. [DOI] [PubMed] [Google Scholar]
  20. Pearson K. Notes on the history of correlation. Biometrika. 1920;13(1):25–45. doi: 10.1093/biomet/13.1.25. [DOI] [Google Scholar]
  21. Ren S, Wu C, Wu J, et al. Weighted persistent homology. Rocky Mt J Math. 2018;48(8):2661–2687. doi: 10.1216/RMJ-2018-48-8-2661. [DOI] [Google Scholar]
  22. Rubio L, Ayllón MA, Kong P, Fernández A, Polek M, Guerri J, Moreno P, Falk BW. Genetic variation of Citrus tristeza virus isolates from California and Spain: evidence for mixed infections and recombination. J Virol. 2001;75(17):8054–8062. doi: 10.1128/JVI.75.17.8054-8062.2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987;4(4):406–425. doi: 10.1093/oxfordjournals.molbev.a040454. [DOI] [PubMed] [Google Scholar]
  24. Shu Y, McCauley J. GISAID: global initiative on sharing all influenza data-from vision to reality. Eurosurveillance. 2017;22(13):30494. doi: 10.2807/1560-7917.ES.2017.22.13.30494. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Smith TF, Waterman MS, et al. Identification of common molecular subsequences. J Mol Biol. 1981;147(1):195–197. doi: 10.1016/0022-2836(81)90087-5. [DOI] [PubMed] [Google Scholar]
  26. Sokal RR. A statistical method for evaluating systematic relationships. Univ Kansas Sci Bull. 1958;38:1409–1438. [Google Scholar]
  27. Sommerville DM. Introduction to the geometry of N dimensions. New York: Courier Dover Publications; 2020. [Google Scholar]
  28. Valdar WS. Scoring residue conservation. Proteins. 2002;48(2):227–241. doi: 10.1002/prot.10146. [DOI] [PubMed] [Google Scholar]
  29. Westen AA, Matai AS, Laros JF, Meiland HC, Jasper M, de Leeuw WJ, de Knijff P, Sijen T. Tri-allelic SNP markers enable analysis of mixed and degraded DNA samples. Forensic Sci Int Genet. 2009;3(4):233–241. doi: 10.1016/j.fsigen.2009.02.003. [DOI] [PubMed] [Google Scholar]
  30. Wilhelm A, Toptan T, Pallas C, Wolf T, Goetsch U, Gottschalk R, Vehreschild MJ, Ciesek S, Widera M. Antibody-mediated neutralization of authentic SARS-CoV-2 B.1.617 variants harboring L452R and T478K/E484Q. Viruses. 2021;13(9):1693. doi: 10.3390/v13091693. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Wise J. Covid-19: the E484K mutation and the risks it poses. Br Med J. 2021;372:n359. doi: 10.1136/bmj.n359. [DOI] [PubMed] [Google Scholar]
  32. Zomorodian A, Carlsson G. Computing persistent homology. Discrete Comput Geom. 2005;33(2):249–274. doi: 10.1007/s00454-004-1146-y. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data and code will be made available upon request.


Articles from Journal of Mathematical Biology are provided here courtesy of Nature Publishing Group

RESOURCES