Skip to main content
Springer logoLink to Springer
. 2022 May 5;84(6):49. doi: 10.1007/s00285-022-01744-0

A new algebraic approach to genome rearrangement models

Venta Terauds 1,, Jeremy Sumner 1
PMCID: PMC9068684  PMID: 35508785

Abstract

We present a unified framework for modelling genomes and their rearrangements in a genome algebra, as elements that simultaneously incorporate all physical symmetries. Building on previous work utilising the group algebra of the symmetric group, we explicitly construct the genome algebra for the case of unsigned circular genomes with dihedral symmetry and show that the maximum likelihood estimate (MLE) of genome rearrangement distance can be validly and more efficiently performed in this setting. We then construct the genome algebra for a more general case, that is, for genomes that may be represented by elements of an arbitrary group and symmetry group, and show that the MLE computations can be performed entirely within this framework. There is no prescribed model in this framework; that is, it allows any choice of rearrangements that preserve the set of regions, along with arbitrary weights. Further, since the likelihood function is built from path probabilities—a generalisation of path counts—the framework may be utilised for any distance measure that is based on path probabilities.

Mathematics Subject Classification: 92D15, 20C05

Introduction

In the eight decades since Dobzhansky and Sturtevant observed that differences in fruit fly genomes could be explained by a sequence of reversals of genome segments (Dobzhansky and Sturtevant 1938), the study of evolution via genome rearrangement has developed into a rich and active field, with diverse applications (Chen et al. 2018; Darmon and Leach 2014; Oesper et al. 2017). Much work focuses on the calculation of evolutionary distances under rearrangement models, with the distances subsequently used to reconstruct phylogenetic trees. For example, minimal rearrangement distances between genomes—and other distance estimates based on these—have been studied extensively and, under various model restrictions, can be calculated efficiently (Bader et al. 2001; Wang et al. 2006; Bader and Ohlebusch 2006; Oliveira et al. 2019). There are, however, good arguments for applying stochastic methods that estimate genomic distance, via rearrangement, as evolutionary time elapsed (Serdoz et al. 2017), particularly when such an approach allows various rearrangement models to be considered (Terauds and Sumner 2019).

The maximum likelihood approach detailed in Serdoz et al. (2017) utilised the theory of the symmetric and dihedral groups to model circular genomes and region set-conserving rearrangements, motivated by earlier group-theoretical approaches to rearrangement models (Francis 2014; Egri-Nagy et al. 2014). The combinatorial problem of calculating the maximum likelihood estimate (MLE) of evolutionary distance was then converted into a numerical one in Sumner et al. (2017) via the representation theory of the symmetric group algebra. In Terauds and Sumner (2019), the consideration of symmetry was extended to include symmetry of rearrangement models, the role of this in simplifying calculations was explored, and the concrete implementation of the technique for a general model was described. Whilst the representation theory approach reduces the complexity of the MLE computations, the complexity is still of factorial order, meaning that computations for large number of regions remain, for the moment, out of reach.

In this work, we suggest that the appropriate theoretical setting for such MLE computations is in fact not the symmetric group algebra but a lower-dimensional algebra. In the symmetric group algebra, the basis elements for computations are individual permutations, each representing a rearrangement or a genome in a fixed orientation, and symmetry is incorporated as an extra step in the calculations. To simplify this, we construct an algebra that incorporates the inherent symmetry into each element. Here, the basis elements are permutation clouds. These correspond to genomes, by simultaneously including all physical orientations; due to the corresponding symmetries of the rearrangement model, they also represent rearrangements in a natural way.

This approach explains and removes the redundancy in the MLE computations that was observed in Terauds and Sumner (2019). In developing the approach, we firstly focus on the simple concrete case of uni-chromosomal circular genomes modelled with unoriented regions and no distinguished positions, building on previous work (Serdoz et al. 2017; Sumner et al. 2017; Terauds and Sumner 2019). Subsequently, we demonstrate that our results may be applied more generally, for example to genomic models that include region orientation and/or origin and terminus of replication. Further, although our focus is on calculation of MLEs, our approach can be applied to calculate other measures of genomic distance under rearrangement; in particular, any that utilise path counts or weighted path counts, such as minimum distance. Whilst the framework does not specify a rearrangement model—indeed, one may choose the allowed rearrangements and their weights—we note that the group-based approach limits us to rearrangements that conserve the set of genomic regions, and thus cannot accommodate insertions, deletions or duplications. We are currently working on expanding the framework to a semigroup-based approach that could incorporate at least some of these rearrangement types. Some of the algebra easily extends to the semigroup case (see Remark 4.7, for example), however there is much yet to be done and this is outside the scope of the present paper.

In the next section, we outline the details of the symmetric group algebra approach to calculating MLEs (Sumner et al. 2017; Terauds and Sumner 2019) for pairs of unsigned, circular genomes, which forms the foundation for the current work. Following this, in Sect. 3, we construct the genome algebra, based on permutation clouds, for this case and show that it provides a coherent framework for modelling such genomes and region set-conserving rearrangements and for calculating MLEs. Section 4 outlines the extension of our results and techniques from permutations with dihedral symmetry to an arbitrary group and symmetry group. This verifies that, as well as incorporating flexibility in the rearrangement model, the framework is not specific to one particular genomic model. The paper concludes with a brief discussion section.

Background: the permutation approach

In this section we set out the theoretical framework for rearrangement models based on permutations, and recall the key elements of the technique for calculating the maximum likelihood estimate of evolutionary distance. Full derivations and details may be found in Terauds and Sumner (2019) and the earlier papers (Sumner et al. 2017; Serdoz et al. 2017). For the specific case study in this and the next section, we model the evolution of single-strand, circular genomes; we do not consider the regions to be oriented and do not distinguish any positions.1 Genomes that are to be compared share N identified regions2 of interest and we consider only rearrangements that conserve the set of regions. Accordingly, we use unsigned permutations, that is, elements of the symmetric group, SN, to represent both genomes and rearrangements. Explicitly: the regions and positions are each labelled by the integers {1,2,,N}, and a given genome is represented by a permutation σSN, where

σ(i)=jregioniis in positionj.

Note that while the region labels are chosen once and are immutable, the position labelling reflects a choice of reference frame (starting position and direction of numbering) that changes when we move the genome in space. Since we do not distinguish any positions, there are 2N possible choices of reference frame and thus 2N distinct permutations that represent any given genome; we denote these by

[σ]:={dσ:dDN}. 1

Here, DN is the dihedral group, and the genome has dihedral symmetry.

Since DN is a subgroup of SN, the sets [σ] are cosets, that is, each [σ]=DNσ is an equivalence class of SN. Since a given genome exists independently of its orientation in space, we may identify it with the entire coset (Serdoz et al. 2017; Egri-Nagy et al. 2014). However, in this initial formulation, we choose any one of the permutations from the coset (1), say σ, to specify the genome, work with this single permutation at first, and incorporate all permutations in the set [σ] (all symmetries of the genome) into the likelihood calculations in due course.

We model evolution as a sequence of discrete rearrangement events occurring in continuous time. In this section, as in previous work, we consider a rearrangement to be a single permutation acting on a single permutation; in the next section we shall develop this into the notion of permutation clouds acting on permutation clouds. For now, however, for a genome represented by σSN, a rearrangement event is represented by a permutation, aSN, acting on σ (on the left): σaσ. We refer to permutations a acting in this way as “rearrangements”.

The full biological model for evolution is given by (M,w,dist), where MSN is the set of allowed rearrangements, w:M(0,1] is the probability distribution on this set, and dist is the probability distribution of the independent rearrangement events in time. One may have biological evidence for including particular types and sizes of rearrangements in the model with differing relative probabilities, or may wish to compare distances computed under differing models (see Terauds et al. 2021 for some specific examples of models and distance comparisons). The distribution dist may similarly be chosen according to evidence or preference; in this treatment, we use the Poisson distribution.

We shall emphasise at this stage that we make minimal further restrictions on the set of allowed rearrangements M. Without loss of generality, we assume that M generates the group SN; this means that any permutation in SN may be obtained from any other by applying a sequence of elements from M. (Note that the case of M not generating SN is simpler: in this case M generates a subgroup, HSN, and the problem reduces to considering this smaller group, since any pair of elements would simply be unrelated under the model or both be elements of a coset Hσ.) Elements of DN are, formally, allowed in the set of rearrangements, although their action does not actually alter the genome. This allows for full generality; for example, if one wishes to include ‘all inversions’ in the model, then the inversion of a region of size N-1 is the same as flipping the genome over in space.

The first model condition simply states that the model should naturally possess the same symmetry as the genome (in the current case, dihedral symmetry). Suppose, for example, that (1,2)M, meaning that the regions in positions 1 and 2 may swap places. Then, since the position labelling is arbitrary, we should have (,+1)M for all =1,2,,N-1, and (N,1)M, meaning that any two regions in adjacent positions may swap places; further, these rearrangements should all be equally probable. We refer to this property as dihedral symmetry of the model and, mathematically, express the condition as

for eachaManddDN,dad-1Mandw(dad-1)=w(a).(M1)

The second model condition ensures that the modelling is agnostic to the temporal direction of evolution. More precisely, the condition states that for any rearrangement that is allowed, its inverse is also allowed, with the same probability. That is,

for eachaM,a-1Mandw(a-1)=w(a).(M2)

We refer to this property as rearrangement reversibility of the model, or simply model reversibility. This condition is natural in the current group-based setting, where the typical rearrangements are reversals (which are self-inverse) and translocations (whose inverses are translocations). It is not essential for most of the construction, however does have some nice implications. For example, when we interpret our model of evolution as a Markov process, in Sect. 3.1, we will show that (M2) is equivalent to the time reversibility of the Markov process.

The evolutionary distance measure we consider in this paper is the maximum likelihood estimate of time elapsed (MLE). This is the maximum value of the likelihood function, which gives the probability, for any given time T, that the reference genome has evolved into the target genome in this amount of time. To be precise, for reference genome represented by the identity permutation eSN and target genome represented by σSN, the MLE is the maximum of the function L(T|σ), where

L(T|σ):=P(σ|T)=k=0P(e[σ]viakevents)P(kevents in timeT). 2

Of course, the likelihood function need not have a maximum; this simply means that no evidence of an evolutionary relationship between the reference and the target under the given model can be discerned. This scenario, familiar from DNA sequence alignment paradigms such as the Jukes–Cantor correction (Felsenstein 2004), was discussed in the context of genome rearrangement models in Serdoz et al. (2017) and Terauds and Sumner (2019), where it was observed to occur in a substantial proportion of cases (independently of the chosen biological model).

For each k, the factor P(kevents in timeT) in the likelihood expression is determined by the distribution dist. The first factor, P(e[σ]viakevents) is the genome path probability, which we shall denote by αk(σ). Since the target genome may be represented by any permutation from the set [σ], ‘e[σ]’ is shorthand for “the permutation e is transformed into any permutation from the set [σ]” and we calculate the genome path probability as a sum of permutation path probabilities, denoted by βk(σ). That is,

αk(σ):=P(e[σ]viakevents)=dDNP(edσviakevents)=dDNβk(dσ).

Given a permutation σSN and a model (M,w,dist), we specify each permutation path probability βk(σ) by considering the set Pk(σ) of all k-length sequences of permutations, chosen from M, that transform e into σ. Since we assume rearrangement events to be independent, the permutation path probability is then the sum of the probabilities of all such sequences, that is,

βk(σ)=(a1,a2,,ak)Pk(σ)w(a1)w(a2)w(ak).

We note that permutation path probabilities vary for different elements of [σ] (that is, in general, βk(σ)βk(dσ) for σSN,dDN). However, genome path probabilities are of course constant on the cosets [σ]. In fact, the model symmetry conditions (M1) and (M2) ensure that there are bigger classes of permutations that all have the same path probabilities (and thus likelihoods). The following results were established in Serdoz et al. (2017) ((i)) and Terauds and Sumner (2019) ((ii) and (iii)).

Theorem 2.1

Let (M,w,dist) be a full biological model for evolution. For all kN0 and σSN, the following hold.

  • (i)

    αk(σ1)=αk(σ2) for all σ1,σ2[σ]={dσ:dDN}.

  • (ii)
    If the model has the dihedral symmetry property (M1), then αk(σ1)=αk(σ2) for all σ1,σ2 in the set
    [σ]D:={d1σd2:d1,d2DN}.
  • (iii)
    If the model has the dihedral symmetry property (M1) and the reversibility property (M2), then αk(σ1)=αk(σ2) for all σ1,σ2 in the set
    [σ]DR:={d1σd2,d1σ-1d2:d1,d2DN}.

It was shown in Sumner et al. (2017) that the combinatorial problem of calculating path probabilities may be converted into a linear algebra problem via the representation theory of the symmetric group algebra, C[SN]. For full details in the general model setting, we refer the reader to Terauds and Sumner (2019). We recall the essential steps in the derivation here, since we shall undertake a similar procedure in a lower dimensional algebra in the next section.

Here, we use the term algebra to mean a vector space equipped with a bilinear product. In particular, we require the group algebra C[SN] consisting of all formal linear combinations of elements of the group SN; this algebra has natural basis SN, and thus dimension N!. For detailed background on the symmetric group algebra, and algebras more generally, we refer the reader to Sagan (2001) and Etingof et al. (2011) respectively. The following group algebra elements are key to our calculations.

Definition 2.2

Let (M,w,dist) be a biological model for evolution of genomes with N regions. We define the model element, s, and the symmetry element, z, of the group algebra C[SN] by

s:=aMw(a)aandz:=12NdDNd.

To reformulate the path probabilities, we firstly observe that

sk=τSNβk(τ)τ. 3

Then, for σSN, we multiply (3) on the left by σ-1 to see that βk(σ) is the coefficient of e in the expansion of σ-1sk. The representation theory of the symmetric group algebra tells us that this is exactly (1N! times) the trace of the regular representation of σ-1sk. That is,

βk(σ)=1N!χreg(σ-1sk). 4

Thus, for σSN, the kth genome path probability is

αk(σ)=dDNβk(dσ)=1N!dDNχreg(σ-1dsk)=2NN!χreg(σ-1zsk)=2NN!pNDpχp(σ-1zsk), 5

where we have used the linearity of the characters to incorporate the symmetry element z. The final equality is gained by decomposing the regular representation of C[SN] into irreducible representations. Recall that the irreducible representations of C[SN] correspond to the integer partitions of N (Sagan 2001, Prop. 1.10.1); here, we denote a partition of N by pN and index the representations and related objects accordingly. Specifically, for each partition pN, ρp is the irreducible representation corresponding to p, Dp is its multiplicity (and dimension), and χp is the character of this representation.

The above derivation of the permutation path probabilities follows that of Sumner et al. (2017); in that paper it was also noted that an alternative derivation is possible via the theory of the Fourier transform on SN. That is, one may extend the probability distribution w on M to w on the whole of SN, notice that the Fourier transform of w with respect to an irreducible representation ρp is equal to ρp(s) and that w convolved with itself k times is exactly the function βk on SN, and then apply the Fourier inversion formula to obtain (4).

Now, for a model with rearrangement reversibility (M2), the irreducible representations of the model element s are diagonalisable (Terauds and Sumner 2019) and we obtain

αk(σ)=2NN!pNDpi=1rp(λp,i)ktr(ρp(σ-1z)Ep,i),

where for each p, the eigenvalues of ρp(s) are {λp,i:i=1,,rp} and, for each p and i, Ep,i is the projection onto the eigenspace of λp,i. Substituting this into the likelihood expression (2) and setting the distribution of events in time to be dist=Poisson(1), we obtain

L(T|σ)=e-T2NN!pNDpi=1rptr(ρp(σ-1z)Ep,i)eλp,iT, 6

where we have observed that the expression is in fact a power series and, accordingly, have been able to eliminate the infinite sum from the expression.

We note that, for a given model, one need only calculate the eigenvalues of each ρp(s) once. Thus the bulk of the calculation burden is now in calculating the partial traces, that is, for any given genome, the set tr(ρp(σ-1z)Ep,i):pN,i=1rp of coefficients that correspond to the distinct eigenvalues in the likelihood equation.

In implementing the likelihood calculations using the expression (6), we observed that for all genomes, most of these partial trace coefficients were zero (Terauds and Sumner 2019). That is, most of our calculations ended up not contributing to the final likelihood function. In the next section, we explain the occurrence of these zeroes and show that the redundancy can be removed from the computations.

The circular genome algebra

The calculations outlined above are performed in the group algebra C[SN], where each permutation in SN is a distinct basis element. However, we (and, indeed, the computations) do not distinguish between different permutations that represent the same genome—that is, between elements of each equivalence class

[σ]={dσ:dDN},

for σSN. We now construct a lower-dimensional algebra by combining these equivalent permutations together to form basis elements—permutation clouds—that correspond to circular genomes. Until otherwise stated, assume that we have fixed a number of regions N and a biological model for evolution (M,w,dist) that has dihedral symmetry and is reversible, that is, satisfies (M1) and (M2).

Definition 3.1

For symmetry element zC[SN], the circular genome algebra for N regions is

A:=zC[SN]={zτ:τC[SN]}.

Any element of A of the form zσ, where σSN, is called a permutation cloud.

One easily verifies that A is a subalgebra of C[SN]. To see that A has a natural basis that is in correspondence with the set of genomes, firstly observe that any element of A can be written as a linear combination of permutation clouds zσ, for σSN. Thus there exists a basis for A of the form {zσ1,,zσK}, for σiSN. Now,

zσi=12NdDNdσi,

so that each basis element is a weighted sum of elements from a set [σi], representing a particular genome. Since the sets are equivalence classes, for any σ1,σ2SN we have

zσ1=zσ2σ1,σ2[σ]for someσSN.

This means that the set of distinct permutation clouds corresponds to the set of distinct genomes, and these form a basis for A. Finally, noting that for all σSN, |[σ]|=2N, we see that

dim(A)=|{[σ]:σSN}|=N!2N=:K. 7

For the remainder of this section, we fix a basis for A,

B:={zσ:σSN}={zσ1,,zσK}, 8

where we have chosen a representative σiSN of each equivalence class [σi] for notational convenience. We set the first basis element to correspond to [e]=DN, so that zσ1=z. Having used the symmetry of the genomes to construct the algebra, we now incorporate the symmetry of the model to extract some useful properties.

Proposition 3.2

The model and symmetry elements, s,zC[SN], have the following properties.

  • (i)

    z is idempotent;

  • (ii)

    s and z commute.

Proof

(i) Since DN is a group, we have

z2=1(2N)2dDNfDNdf=1(2N)2dDNfDNf=12NfDNf=z.

(ii) Now we use the dihedral symmetry (M1) of the model to rewrite the model as m base rearrangements, a1,,amSN, along with their symmetries. That is,

M={da1d-1,da2d-1,,damd-1:dDN}. 9

Then, using the same idea as in (i),

zs=12NfDNi=1mdDNw(ai)fdaid-1=12NfDNi=1mdDNw(ai)daid-1f=sz.

The above properties translate immediately into properties of the representations of z and s.

Corollary 3.3

Let pN and ρp:C[SN]MDp(C) denote the corresponding irreducible representation of the symmetric group algebra. Then

  • (i)

    the only eigenvalues of ρp(z) are 0 and 1;

  • (ii)

    ρp(z) and ρp(s) are simultaneously diagonalisable, with real eigenvectors.

Proof

Claim (i) is immediate since z, and thus ρp(z), is idempotent. To show (ii), we firstly choose the representation ρp to be orthogonal on SN (Sagan 2001). Then the rearrangement reversibility of the model ensures that ρp(s) is symmetric (Terauds and Sumner 2019) and, similarly, one may verify directly that ρp(z)T=ρp(z). Thus, since z and s commute, the representation matrices commute and are simultaneously diagonalisable. In particular, since these matrices are real symmetric, the orthonormal set of simultaneous eigenvectors may be chosen to be real.

Now fix pN and choose a set of orthonormal vectors {v1,v2,,vDp}RDp that are eigenvectors for both ρp(z) and ρp(s), ordered so that the first kp of them are eigenvectors for the eigenvalue 1 of ρp(z). Take an eigenvalue, λp,i of ρp(s) and let Ji{1,2,,Dp} such that {vj:jJi} are the eigenvectors for λp,i. Then, for σSN, the partial trace for λp,i may be written as

tr(ρp(σ-1z)Ep,i)=jJivjTρp(σ-1z)vj, 10

where for each j,

vjTρp(σ-1z)vj=vjTρp(σ-1)ρp(z)vj=vjTρp(σ-1)vj,ifjkp;0,ifj>kp. 11

We see then that ρp(z) “knocks out” parts of the partial traces; in particular, it does this independently of the genome. We shall establish shortly that, in total, 2N-12N of the partial traces are knocked out in this way, thus explaining the observation in Terauds and Sumner (2019) that most of the calculated partial traces were zero.

The key to performing MLE computations in the symmetric group algebra is the relationship between the character of the regular representation and the identity element, eC[SN]: the character χreg(τ) counts occurrences of the identity in a generic element τC[SN]. Since z is idempotent, it is a left identity (but not a right identity) in the algebra A. We now construct the regular representation, ρregA, of A and show that its character, the regular character χregA, functions in exactly this way for the left identity zA.

We construct the regular representation of A via the left action of elements of A on the basis B={zσ1,,zσK} fixed above (8). We need only consider the representation of a generic basis element, zσ for σSN, since one may extend linearly to all of A. For arbitrary σSN, the ijth entry of the matrix ρregA(zσ) is the coefficient of zσi in the expansion of (zσ)(zσj), that is,

(ρregA(zσ))ij=12N|{dDN:σdσj[σi]}|. 12

One readily verifies that ρregA(z) is the K×K identity matrix and that ρregA(zσ)T=ρregA(zσ-1). The regular character χregA is the trace of the regular representation matrix. For a generic basis element zσA,

χregA(zσ)=i=1K(ρregA(zσ)ii=12Ni=1K|{d1DN:σd1σi[σi]}|=12Ni=1Kd2DN|{d1DN:σd1σi=d2σi}|=12Ni=1Kd2DN|{d1DN:σd1=d2}|=12NKd2DN|{d1DN:d1=σ-1d2}|=KifσDN;0ifσDN. 13

Since zσ=z if and only if σDN, this shows that we can use the character of the regular representation of the algebra A to track coefficients of the left identity z, just as we do for the identity e in C[SN]. Further, we can express the regular character of A as a sum over the irreducible characters of C[SN], and thus see that the regular characters of A and C[SN] coincide on A.

Proposition 3.4

For arbitrary τA,

  • (i)

    1KχregA(τ) is the coefficient of z in τ;

  • (ii)

    χregA(τ)=pNχp(e)χp(τ)=pNDpχp(τ)=χreg(τ).

Proof

(i) Given τA and the basis B from (8), there exist c1,,cKC such that τ=c1z+c2zσ2++cKzσK. Then

χregA(τ)=c1χregA(z)+i=2KciχregA(zσi)=c1K,

from (13).

(ii) It suffices to consider a generic basis element zσA, since the characters are linear. We shall apply the dual orthogonality relations on the irreducible characters of SN (see for example (James and Liebeck 2001, Thm. 16.4)), given for σ,τSN by

pNχp(σ)χp(τ)¯=δ(σ,τ)centSN(σ), 14

where centSN(σ):={γSN:γσ=σγ} is the centraliser of σ and the map δ:SN{0,1} is defined by

δ(σ,τ)=1,ifσ,τare in the same conjugacy class inSN;0,otherwise. 15

Recall that for pN and τSN, χp(τ)¯=χp(τ-1).3 Then for σSN, we have

graphic file with name 285_2022_1744_Equ88_HTML.gif

by (13), recalling that K=N!2N.

An immediate consequence of the above is an expression for the dimension of A in terms of the characters χp of C[SN]:

dim(A)=K=N!2N=pNχp(e)χp(z)=pNDpkp, 16

where, for each pN, kp is the multiplicity of the eigenvalue 1 of ρp(z). The first part of this can also be seen directly from the dual orthogonality relations.

To perform the MLE calculations in the algebra A efficiently, we will need a decomposition of the regular character in terms of irreducible characters of A. Firstly, we’ll observe that irreducible submodules of A=zC[SN] can be produced by acting with z on the irreducible submodules of C[SN]. This is straightforward and, in fact, true in a more general context (see, for example, Steinberg 2016, Lemma 4.15), however we include the details here since we’ll use them in our subsequent constructions. We denote the irreducible submodules of C[SN] by Vp=CDp for pN, that is, we write

C[SN]pNDpVp. 17

Theorem 3.5

The non-trivial modules gained by acting with the symmetry element z on the irreducible submodules of C[SN] are irreducible modules of the genome algebra A.

Proof

Let pN. As above, we may take a set {v1,v2,,vDp} of (real, orthonormal, linearly independent) eigenvectors for both ρp(s) and ρp(z), ordered such that the first kp of them correspond to the eigenvalue 1 of ρp(z). Then Vp=spanC{vi:i=1,,Dp} and

Wp:=z·Vp=spanC{ρ(z)vi:i=1,,Dp}=spanC{vi:i=1,,kp}. 18

It is clear that Wp is an A-module; we need show that it is either {0} or irreducible. Suppose that there exists UpWp such that Up is an A-module, and 0uUp. Then

UpspanC{ρp(zσ)u:σSN}=z·spanC{ρp(σ)u:σSN}=z·Vp=Wp,

so that Up=Wp.

It was shown in Terauds and Sumner (2019) that there exist pN for all N>3 such that χp(z)=kp=0; that is, there are always some C[SN]-modules that are projected down to zero in A. We note that this is not true in the more general case considered in Sect. 4 (for example if z is constructed from a different symmetry group).

Now, the dimension expression (16) suggests that we will not be able to decompose the algebra A into a direct sum of irreducible submodules as we can for C[SN] (17). By (Etingof et al. 2011, Thm. 3.5.8), if this were possible with the irreducible submodules Wp from above, then the dimension of A would be pNkp2. We have not yet verified here that these Wp comprise all irreducible submodules of A, nor that they are all distinct (not isomorphic to one another), but this is indeed the case (Steinberg 2016, Thm. 4.23). The difference between the dimension of A and that gained from the irreducible modules here is signalling that not all of the information about A can be represented by the action of A—in this case, the left action of A on its irreducible modules, and on itself, is not injective.

To see this, let W be an irreducible module of A. Then, since z is a left identity in A, z must act as the identity on W. But then, for any zσ,zσA such that zσz=zσz,

(zσ)·w=(zσ)·(z·w)=(zσz)·w=(zσz)·w=(zσ)·w, 19

for all wW.

From Theorem 2.1(ii), such zσzσA correspond to physically distinct genomes that share the same path probabilities and likelihood functions: we have zσz=zσz if and only if σ[σ]D={dσd:d,dDN}.

In the language of algebras, A has a non-trivial radical (Etingof et al. 2011, Def. 3.5.1), since (for N>3) there are non-zero elements zσ-zσA that annihilate all irreducible modules of A. As a concrete example, consider the following.

Example 3.6

Let σ=(1,2),σ=(2,3)SN. Setting r=(1,2,,N)DN, we observe that σ=rσr-1, so that zσz=zσz. But there exists no dDN such that σ=dσ, and thus zσzσ.

For our practical purposes, this is perfect: the algebra sees genomes as distinct entities, but their representations do not distinguish between genomes corresponding to an equivalence class [σ]D, whose likelihood functions are the same. Further, whilst zσ and zσ correspond to distinct genomes, if we consider them as rearrangements, they are not distinct, since they have the same action. We shall return to this presently, when we define models in the genome algebra.

Proposition 3.7

Let zσ,zσA. Then ρregA(zσ)=ρregA(zσ) if and only if σ[σ]D.

Proof

For the reverse implication, we argue as above, replacing w in (19) by each basis element zσi to verify that the matrices are the same. Conversely, if the regular representations coincide, then we immediately have that (zσ)z=(zσ)z.

We now explicitly consider the irreducible representations of A on the irreducible submodules and use these to rewrite the regular character of A in terms of irreducible characters of A. Let pN such that kp>0 and consider the module Wp of A which, as in the proof of Theorem 3.5, has a basis {v1,,vkp}Rd of orthonormal eigenvectors. Now, the action of A on Wp=z·Vp is inherited from the action of C[SN] on Vp, so for arbitrary τA, we define the (kp×kp) representation matrix ρpA(τ) on Wp via the action of ρp(τ) on the basis vectors vj:

ρpA(τ)ij:=viTρp(τ)vj.

More concisely, setting Qp to be the (Dp×kp) matrix with {v1,,vkp} as columns, we have

ρpA(τ)=QpTρp(τ)Qp. 20

Clearly (also c.f. (19)), ρpA(z) is the (kp×kp) identity matrix for each such pN. For each pN such that kp=0, we formally define ρpA to be the zero representation. Now we may calculate the irreducible characters χpA of A and see that they coincide with the irreducible characters χp of C[SN] restricted to A.

Proposition 3.8

For each pN and τA, χpA(τ)=χp(τ).

Proof

Let pN. By linearity, we need only verify the claim on a generic basis element, zσA. Again utilising the orthonormal eigenvectors {v1,,vDp} of ρp(z), where those for ikp correspond to the eigenvalue 1 and the remainder to the eigenvalue 0, we have

χpA(zσ)=i=1kpρpA(zσ)ii=i=1kpviTρp(zσ)vi=i=1DpviTρp(zσz)vi=χp(zσz)=χp(zσ),

where in the final step we have used the cyclicity of the trace and the idempotency of z (Proposition 3.2).

Combining Propositions 3.4 and 3.8 gives the desired character decomposition.

Corollary 3.9

For arbitrary τA,

χregA(τ)=pNDpχpA(τ).

Having defined and decomposed the regular character of A, we are ready to return to the likelihood calculations. Using the equivalence of the characters of A and C[SN] on the algebra A, along with the the interplay between the genome and model symmetry, we now verify that we may work entirely in A to calculate the genome path probabilities and thus the likelihood functions, as defined in the previous section (2).

Theorem 3.10

Let σSN and kN0. Then

αk(σ)=2NN!χregA(zσ-1zsk)=2NN!pNDpχpA(zσ-1zsk). 21

Proof

From (5), αk(σ)=2NN!χreg(σ-1zsk)=2NN!χreg(zσ-1zsk), since z and s commute, z is idempotent and the trace is cyclic. The first equality is then clear from Proposition 3.4 and the second from Corollary 3.9.

We have mentioned the importance of the ‘identity counting’ property of the regular character, that is, Proposition 3.4 (i), but this combinatorial component is somewhat hidden in the proof of Theorem 3.10. To highlight it, one may begin with the identity (3) stated in the previous section and, for any given genome zσ (σSN), multiply by zσ-1z to obtain

zσ-1zsk=12NτSNαk(τ)zσ-1τ.

By observing that there are exactly 2N values of τSN for which zσ-1τ=z, one thus sees directly that the coefficient of z in the expansion of zσ-1zsk is αk(σ).

Note that we could have simplified the above character expression (21) a little, that is,

χpA(zσ-1zsk)=χpA(zσ-1skz)=χpA(zσ-1sk).

However, as we did in the algebra C[SN], we want to diagonalise the matrices representing the model element, namely the matrices ρpA(zs). So we keep the middle z and write

χpA(zσ-1zsk)=trρpA(zσ-1(zs)k)=trρpA(zσ-1)ρpA(zs)k.

For each pN, as in the proof of Corollary 3.3, we can choose ρp(s) to be symmetric; thus by the definition (20) each matrix ρpA(zs) is symmetric and thus diagonalisable. Then we obtain

αk(σ)=2NN!pNDpi=1Rpλp,iktr(ρpA(zσ-1)Ep,iA), 22

where Ep,iA is the projection onto the eigenspace of the ith eigenvalue, λp,i, of ρpA(zs).

Now, finally substituting the path probabilities (22) into the theoretical likelihood expression (2), we obtain

L(T|σ)=e-T2NN!pNDpi=1Rptr(ρpA(zσ-1)Ep,iA)eλp,iT. 23

It is clear from Theorem 3.10 that the likelihood expression (23), involving only elements of the genome algebra A, is equal to that (6) gained via the group algebra C[SN]. We now show that the above is, really, a simplified version of (6): that is, by working in the smaller algebra we have eliminated the many eigenvalue terms that occur with zero coefficients.

Proposition 3.11

For each pN such that Wp{0}, the eigenvalues of the matrix ρpA(zs) are exactly the eigenvalues of ρp(s) that occur with non-zero coefficient in the likelihood expression (6).

Proof

Let pN such that Wp{0}. As above, take the set {v1,v2,,vDp}RDp of orthonormal eigenvectors for both ρp(z) and ρp(s), with the first kp corresponding to the eigenvalue 1 of ρp(z), and form the matrix Qp with the first kp vectors as columns. Then, as in (20),

ρpA(zs)=QpTρp(zs)Qp=QpTρp(sz)Qp=QpTρp(s)Qp=λ1000λ2000λkp, 24

where each λi is clearly an eigenvalue of both ρpA(zs) and ρp(s) (and the λi are not necessarily distinct). Suppose λ is an eigenvalue of ρp(s) that does not appear in the matrix of (24). Then λ has corresponding eigenvector(s) {vj:jJ}, where J{kp+1,kp+2,,Dp}. But then, for any σSN, the coefficient of the λ term in the likelihood expression is the partial trace

jJvjTρp(σ-1z)vj=0,

by (10) and (11).

Note that, although in the proof of Proposition 3.11 we construct each ρpA(zs) as a diagonal matrix (in which case the projections onto the eigenspaces would be diagonal matrices of 1s and 0s), we do this only to verify that the representation has the required properties, and we utilise the eigenvectors of the representation ρp(s). In practice, the whole point is to not calculate the much bigger representations ρp(s). That is, when implementing calculations, we would expect to construct a basis for each irreducible module Wp directly, hence the general form of the projections in (23).

We note that the equivalence of the path probabilities and thus likelihoods on the classes [σ]D and [σ]DR stated in Theorem 2.1 can alternatively be obtained by working directly in the genome algebra A. We omit the proof here since we shall prove a more general version of the result in Sect. 4.

Since dim(C[SN])=N!=pNDp2 and dim(A)=N!2N=pNDpkp, we have

pNDpkp=pNDpDp2N. 25

Note that this does not imply that kp=12NDp for each (or any) pN, rather that on average, and asymptotically, the dimension of each irreducible submodule Wp of A is 12Nth of the dimension of the irreducible submodule Vp of C[SN]. To put this another way, on average, 2N-12Nths of the computations in the group algebra, as documented in Terauds and Sumner (2019), resulted in zeroes.

Given that the dimension of the algebra A is still N!2N, this does not significantly reduce the computational complexity. However, since the multiplicity of the irreducible submodules in A is the same as in the group algebra (25), the reduction in the dimension of the irreducible submodules is (relatively) much larger than the reduction in total dimension.

Example 3.12

Consider N=6. There are N!=720 permutations in S6, so the dimension of the regular representation of C[S6] is 720. The dimensions of the irreducible modules Vp of C[S6] (given as a list rather than a set as they are not all distinct) are

[Dp:p6]=[1,5,9,10,5,16,10,5,9,5,1].

Moving to the genome algebra, there are N!2N=60 distinct genomes, so the dimension of the regular representation of A is 60. The dimensions of the corresponding irreducible modules Wp=z·Vp of A are

[kp:p6]=[1,0,2,0,0,1,1,2,0,1,0].

Thus, for any rearrangement model, each likelihood expression will be a sum of at most eight terms, corresponding to at most eight distinct eigenvalues.

We note that such dimension reductions are less striking for larger N. In any case, we see a significant theoretical gain here: the genome algebra A incorporates the symmetry of the genomes and models into a unified framework, within which the problem can be formulated and the computations performed. To highlight this, we next consider the regular representations of s in the algebra C[SN] and of zs in the algebra A as Markov matrices; then we conclude this section by re-formulating the model in the genome algebra framework.

The Markov interpretation

In the group algebra, C[SN], the rows and columns of the regular representation are determined by the N! permutations σiSN. In particular,

ρreg(s)=aMw(a)ρreg(a),

where for each rearrangement permutation aM, the ijth entry of ρreg(a) is 1 if aσj=σi and 0 otherwise, so that the ρreg(a) matrices have exactly one ‘1’ in each row and column. Then ρreg(s), as a convex sum of Markov matrices, is itself a Markov matrix. The jth column of ρreg(s) contains |M| non-zero entries, each equal to a unique w(a), since for the distinct permutations aM, the permutations aσj are all distinct.

Thus ρreg(s) is the transition matrix of a discrete Markov chain where the states are the N! permutations in SN and the ijth entry is the probability of permutation σj transitioning into permutation σi via one rearrangement chosen from the model M. That is,

ρreg(s)ij=w(a)ifa=σiσj-1M,0otherwise.

It is clear from this formulation that the matrix ρreg(s) is symmetric if and only if the model has the rearrangement reversibility property (M2). Thus, since the stationary distribution on the Markov chain is the uniform distribution on the states, the reversibility property (M2) of the model is equivalent to reversibility of the Markov model.

Now, in the algebra A, the corresponding matrix representing the model element is

ρregA(zs)=aMw(a)ρregA(za). 26

As above, the matrices on the right hand side represent basis elements of the algebra, here za for aM. Although the basis elements here do not form a group (so their regular representations are not, in general, zero-one matrices), each of the ρregA(za) is again a Markov matrix: for a given aM, the ijth entry of ρregA(za) is the coefficient of zσi in the expansion of (za)(zσj) and, since z=12NdDNd, the expansion is a convex sum. Thus the entries in each column of each ρregA(za) sum to one and ρregA(zs), as a convex sum of Markov matrices, is indeed a Markov matrix.

Each basis element zσi corresponds to a genome, so ρregA(zs) is the transition matrix of a Markov chain where the states are genomes. The ijth entry, which we calculated as the proportion of the expansion of (zs)(zσj) that is equal to zσi, is of course the probability of the genome (zσj) transitioning into the genome zσi in one step, via the model.

Permutation clouds: a unifying concept

In the permutation approach detailed in Sect. 2, we considered rearrangement events to be individual permutations acting on individual permutations. In the setting of the genome algebra, we represent both genomes and rearrangements by permutation clouds, each of which is a sum of permutations weighted by their probabilities (zσ=12NdDNdσ). A single rearrangement event is here modelled by a permutation cloud za acting on a permutation cloud zσ. Mathematically, this event results in a convex combination of permutation clouds cizσi; biologically, it results in one of the genomes zσi, according to the probability distribution given by the coefficients ci.

The permutation cloud view of circular genomes seems to us quite natural. To observe a genome, we fix an orientation and a reference frame, and assign to it a single permutation (any one, from the appropriate equivalence class [σ], with probability 12N). We refer to this as an instance of the genome. Theoretically, however, the genome exists simultaneously as all of its possible physical orientations in space; it is the cloud, zσ.

What about rearrangements? For a rearrangement permutation aSN and dDN, the result of the action da on σSN is d(aσ)[aσ], that is, it results in the same genome as a acting on σ. So we can think of za acting on zσ as encompassing (all orientations of (a acting on (all orientations of σ))).

Of course, the action of za on zσ also incorporates the dihedral symmetries of a as an action, that is, dad-1 for dDN. For a biological model (M,w,dist) for evolution of genomes as permutations, under the assumption of dihedral symmetry, we wrote (9)

M={da1d-1,da2d-1,,damd-1:dDN}SN, 27

where for each ak and all dDN, w(dakd-1)=w(ak). Since dakd-1[ak]D, the action of z(dakd-1) on zσ is the same as the action of zak on zσ (see Proposition 3.7) and thus each ρregA(z(dakd-1))=ρregA(zak).

Having shown that the MLE computations can be performed in the genome algebra A, and discussed the representation of both genomes and rearrangements as permutation clouds in this algebra, it remains to reformulate the model within this framework. Given the model (27) in the permutation framework, the equivalent model for evolution in the genome algebra setting is (MA,wA,dist), where

MA:={za1,,zam}A,

and wA(zai)=2Nw(ai) for each i.

Since the dihedral symmetry of the genomes is built into the algebra A, specifying the model to consist of elements of A in this way makes the dihedral symmetry requirement (M1) redundant. Model reversibility in this setting is formulated as graphic file with name 285_2022_1744_Figa_HTML.jpg This condition is sufficient to ensure that the irreducible representations of zs are diagonalisable, which is convenient for computations. Although the algebra A has a left identity, it does not contain inverses, so za-1 is not (in general) an inverse of za. However, as we shall see in the next section, model reversibility is, further, equivalent to the reversibility of the Markov model. We conclude this section with an example to illustrate some of these key concepts.

Example 3.13

Suppose we wish to consider a model consisting only of “small inversions”, which we will take to be inversions of two or three regions. In the permutation framework, we would define this model to be

M:={d(1,2)d-1,d(1,3)d-1:dDN}={(1,2),(2,3),,(N-1,N),(N,1),(1,3),(2,4),,(N-1,1),(N,2)}.

Here there are N, rather than 2N, distinct instances of each rearrangement type, since for inversions,4 each flip coincides with a rotation.

For the rearrangement probabilities, one could choose the uniform distribution, w(a)=12N for all aM, or one may consider the larger inversions to be less likely and set, for all aM,

w(a)=23N,ifa=d(1,2)d-1,somedDN;13N,ifa=d(1,3)d-1,somedDN.

In the genome algebra, the model is simpler to express; we take the rearrangement instances (1, 2) and (1, 3) and the model is

MA:={z(1,2),z(1,3)}.

The weight functions corresponding to the above would then be wA(z(1,2))=wA(z(1,3))=12 or wA(z(1,2))=23,wA(z(1,3))=13.

One may recall from Example 3.6 that z(1,2)z(2,3), however, one need not (and indeed should not) include both of these in the rearrangement model since they have the same action: z(1,2)·zσ=z(2,3)·zσ for any genome zσA, since z(1,2)z=z(2,3)z. Further, one must be aware of complementary rearrangements.

For example, in the case N=5, the actions of z(1,2) and z(1,3) coincide: inverting a two-region segment is, under dihedral symmetry, the same rearrangement as inverting the complementary three region segment (correspondingly, when N=5, z(1,2)z=z(1,3)z).

One can easily eliminate the possibility of such ‘rearrangement redundancies’ by reformulating the model in the genome algebra as a set of elements of the form zaz; we do this in the next section (30). More on such considerations, along with explicit examples of rearrangement models in the oriented region case, may be found in Terauds et al. (2021); a deeper algebraic consideration of rearrangements is given in Stevenson et al. (2022).

More general models of genomes

The construction of the genome algebra A=zC[SN] in Sect. 3 was determined by assumptions we made about how to model the genomes. In particular, following on from previous work (Serdoz et al. 2017; Sumner et al. 2017; Terauds and Sumner 2019), we chose to model circular genomes, without considering orientation of regions, which meant an instance of the genome could be represented by a permutation σSN. We modelled the genomes without a distinguished position, which meant the genome symmetries corresponded to the dihedral group DN. In this section, we outline how the constructions and techniques presented for this specific case can be generalised to cover different genomic models: any for which genome (and rearrangement) instances can be represented as elements of a group G.

Suppose, for example, that one wanted to vary the above model to include an origin of replication in the circular genomes. We would model this as a distinguished position and the genomes would then have no rotational symmetry, only reflectional. The symmetry group would thus be ZN={e,f}, the symmetry element z=12(e+f), and each genome an element zσ=12(σ+fσ)zC[SN]. The model would naturally reflect this symmetry, with rearrangements taking the form za, aSN. In particular, this case allows for rearrangements at different positions on the genome, relative to the origin of replication, to be assigned different probabilities. With appropriate choices of rearrangements, this framework could also be used to represent linear genomes.

To include orientation of genes, one would use a different underlying group, for example the hyperoctahedral group HN of signed permutations (as outlined in Egri-Nagy et al. 2014), and a symmetry group of choice (for example, a copy of the dihedral group in the case of a circular genome with no distinguished positions). An explicit consideration of the genome algebra for the signed region case, including some detailed examples, may be found in Terauds et al. (2021).

To construct the general genome algebra, we begin with a group G, whose elements represent instances of the genomes of interest, and a subgroup ZG that represents the physical symmetries of these genomes.5 We consider rearrangements such that a single rearrangement event for an instance gG of a genome can be modelled via the left action of a particular element aG on g. The terms in the following definition reflect our applications of the objects, but obviously the subsequent results concerning the algebras hold whether or not one applies them to genomes.

Definition 4.1

Let G be a finite group with subgroup ZG. Define

z:=1|Z|zZz,A:=zC[G]andA0:=zC[G]z.

We call A the genome algebra of G with Z, A0 the class algebra of G with Z and z the symmetry element of A and A0.

Rather than proceeding as in Sects. 2 and 3 , where we first defined the rearrangement model, path probabilities and likelihoods for genome instances (group elements) and then showed that the calculations could be performed in the genome algebra, we will here formulate these concepts (and then perform the computations) entirely in the genome algebra A. We include the class algebra A0 for completeness. Following the observations in the previous section, it seems a natural next step to consider the algebra formed by combining together the elements of A that act indistinguishably. However, we shall see that this lower dimensional algebra is not the appropriate setting for our calculations.

Lemma 4.2

Let G be a finite group with subgroup ZG.

  • (i)

    For each gG, define [g]:={zg:zZ}. Then the sets {[g]:gG} are equivalence classes of G. For each gG, [g]=|Z|.

  • (ii)

    For each gG, define [g]D:={zgz:z,zZ}. Then the sets {[g]D:gG} are equivalence classes of G.

Proof

Since the sets [g] and [g]D for gG are respectively right cosets and double cosets of G with respect to the subgroup Z, it is clear that they are equivalence classes.

We use the label ‘D’ for the classes defined in (ii) above to refer to the double coset structure of the sets [g]D (noting that this conveniently coincides with the original usage (Terauds and Sumner 2019) of the label, which referred to the dihedral symmetry in the circular genome case). The following statements can be derived directly from the subgroup properties of Z and Lemma 4.2 (c.f. the corresponding results in Sect. 3).

Proposition 4.3

Let G be a finite group with subgroup ZG. Let A and A0 respectively be the genome algebra and the class algebra of G with Z and z the symmetry element of A and A0. Then

  • (i)

    z is idempotent, z is a left identity in A, and z is the identity in A0;

  • (ii)
    A has a basis of the form {zg:gG}={zg1,,zgK} and
    K:=dim(A)=|{[g]:gG}|=|G||Z|;
  • (iii)
    A0 has a basis of the form {zgz:gG}={zg1z,,zgLz} and
    L:=dim(A0)=|{[g]D:gG}|.

For the remainder of this section, we fix a basis for each of A and A0, as defined in (ii) and (iii) above.

Remark 4.4

Each equivalence class [g]D can be viewed as an orbit of G under an action of the group Z×Z and thus, from (iii) above, the dimension L of A0 may be calculated via Burnside’s lemma (James and Liebeck 2001, Prop. 29.4). By combining this with the dual orthogonality relations on the group G, one can directly obtain the dimension result stated below in Theorem 4.5 (i).

We note that working ‘entirely’ in the genome algebra does not mean that we forget about the group G. In practice, one would observe a genome with a particular orientation and reference frame, thus as an instance gG, and then identify the genome as the cloud zgA for the purposes of computation. There are K=|G||Z| distinct genomes, corresponding to the distinct basis elements zg of the genome algebra A. Similarly, one would conceive a rearrangement initially as an instance aG and then lift to za in the genome algebra. Considering all orientations of a rearrangement instance on all orientations of a genome corresponds to a left action of the algebra A on itself. Distinct elements of A that correspond to the same element of A0 act indistinguishably, since

(za)·(zg)=(zaz)·(zg). 28

Thus there are L=dim(A0) distinct rearrangement actions.6

The (left) regular representations ρregA of the genome algebra A and ρreg0 of the class algebra A0 can be constructed in the usual way (c.f. (12)) via the bases fixed above. As in Sect. 3, one readily verifies that ρregA(z) is the K×K identity matrix and that ρregA(zg)T=ρregA(zg-1). Since z is an identity in A0, ρreg0(z) is the L×L identity matrix. In this case, the equivalence classes [g]D need not be the same size and thus, in general, ρreg0(zgz)Tρreg0(zg-1z).7 We denote the regular characters of A and A0 by χregA and χreg0 respectively and note that these take real values on any algebra element that is a real linear combination of basis elements.

Recall that, by Maschke’s theorem (Etingof et al. 2011, Thm. 4.1.1), the group algebra C[G] of any finite group G can be written as a direct sum over its irreducible modules.

Theorem 4.5

Let G be a finite group with subgroup ZG. Denote the distinct irreducible submodules of C[G] by Vi, with dim(Vi)=Di for each i so that

C[G]i=1MDiVi,

and denote the corresponding irreducible representations and characters of C[G] by ρi and χi respectively. Then the following hold.

  • (i)
    For each i, Wi:=z·Vi is either {0} or an irreducible A0-module and
    A01iMWi{0}kiWi, 29
    with ki=dim(Wi)=χi(z) for each i. Thus, dim(A0)=L=i=1Mχi(z)2.
  • (ii)
    The modules {Wi:1iM,Wi{0}} comprise all irreducible modules of A. Denoting the corresponding irreducible representations of A and A0 by ρiA and ρi0 respectively,
    ρiA(zg)=ρiA(zgz)=ρi0(zgz),
    for all gG and all (relevant) 1iM. Thus for all gA0, ρiA(g)=ρi0(g).
  • (iii)
    Denoting the corresponding characters of A and A0 by χiA and χi0 respectively, and defining χiA=χi00 for each i such that Wi={0},
    χi(zg)=χiA(zg)=χi0(zgz),
    for all gG and all 1iM. Thus the characters χi,χiA and χi0 coincide on A0.
  • (iv)
    For all gA,
    χregA(g)=i=1MDiχiA(g)=χreg(g),
    where χreg and χregA denote the regular characters of C[G] and A respectively.
  • (v)

    For any gA, 1K·χregA(g) is the coefficient of z in g.

Proof

For (i), we use (Steinberg 2016, Prop. 4.18, Thm. 4.23). For the remaining results, we use the observation (28) and proceed just as for the corresponding results in Sect. 3. Note that in this general setting we cannot assume that the irreducible representations ρi are orthogonal on G, but we can choose them to be unitary (Etingof et al. 2011, Thm. 4.6.2). This means that the corresponding irreducible representations of z in C[G] are self adjoint, and thus each ρi(z) is unitarily diagonalisable, so that its eigenvectors form an orthonormal basis for CDiVi. Since the eigenvectors need not be real, the only difference in the proofs is that we need the conjugate transposes, not just transposes, of these vectors.

The above results imply that A0A/Rad(A) (Etingof et al. 2011, Thm. 3.5.4), which formalises the relationship between the genome algebra and the class algebra: A0 is obtained from A by factoring out the elements of A that act trivially. We have previously expressed this as A0 combining together the elements of A that act indistinguishably. Another aspect of this is the following.

Corollary 4.6

Let G be a finite group with subgroup ZG. For any gG and each irreducible representation ρiA of A, ρiA(zg)=ρiA(zg) for all g[g]D. For any g,gG, ρregA(zg)=ρregA(zg) if and only if g[g]D.

Remark 4.7

We note that (Steinberg 2016, Prop. 7.14) implies part of Theorem 4.5 (iii) (namely, that χi0(g)=χi(g) for gA0) in the more general case of G being a semigroup. Extending the framework to algebras based on semigroups would allow us to model further types of rearrangements, such as insertions and deletions (Francis 2014), and we intend to investigate this possibility in future work.

We are ready to proceed with the formulation of path probabilities and the likelihood function within the genome algebra A. Firstly, we formally define a biological model for evolution in the genome algebra to be (M,w,dist), where

M:={za1z,za2z,,zaqz}A, 30

for some a1,,aqG, w:M(0,1) is the probability distribution on M, and dist is the probability distribution of rearrangement events in time. Note that we have used the form zaz rather than za to avoid duplicating rearrangements in the model (that is, elements za,zaA that are distinct but have the same left action, c.f. Examples 3.6 and 3.13 ). Presently, we shall also add the condition that the model be reversible, that is, that za-1zM for every zazM, and w(za-1z)=w(zaz).

We fix the reference genome to be zA (whose instances are the elements of Z, in particular eZ). Then, for any target genome zg (where we have observed the instance gG) and each kN0, we define the path probability αk(zg) to be

αk(zg):=P(zzgviakrearrangements),

where zzg means “genome z is transformed into genome zg”. As usual, to find the path probability for an arbitrary genome zh to be transformed into target genome zg, we can simply translate to the reference; that is, this is exactly the path probability for z to be rearranged into zgh-1, which is αk(zgh-1).

Given a model M, we define the corresponding model element of A to be

s~:=i=1qw(zaiz)zai.

We write s~ to distinguish the model element here from the previous definition of s in the group algebra (c.f. Definition 2.2), and choose to sum over rearrangements of the form zai rather than zaiz for simplicity (recall from (28), these have the same action, so either form may be used).

It remains to connect the path probabilities to the regular character of powers of the model element (c.f (3) in Sect. 2). Recall from Sect. 3.1 that za·zg gives a convex combination of genomes, that is,

za·zg=i=1Kpizgi,

where {zg1,,zgK} is our fixed basis for A and each pi is the proportion of the expansion of zazg that is equal to zgi or, equivalently, the probability that the rearrangement za acting on the genome zg will result in the genome zgi. Thus

s~·z=j=1qw(zajz)zajz=j=1qw(zajz)i=1Kpj,izgi=i=1Kj=1qw(zajz)pj,izgi,

where we have rearranged and collected terms in the final step so that, for each i, j=1qw(zajz)pj,i is the total probability that the genome z will be transformed into the genome zgi via some (single) rearrangement chosen from the model. Thus

s~·z=i=1Kα1(zgi)zgi

and, by repeatedly applying s~, one sees that

s~kz=i=1Kαk(zgi)zgi. 31

Now, for gG an instance of the genome of interest, multiply (31) on the right by g-1 to obtain

s~kzg-1=i=1Kαk(zgi)zgig-1.

Since zgig-1=z if and only if zgi=zg, we see that αk(zg) is the coefficient of z in the expansion of s~kzg-1, and thus

αk(zg)=1K·χregAs~kzg-1 32

by Theorem 4.5 (v).

Theorem 4.5 allows us to decompose the regular character in (32) into irreducible characters, however we also will need to diagonalise the irreducible representation matrices.

Lemma 4.8

Let G be a finite group with subgroup ZG. Let (M,w,dist) be a biological model for evolution of genomes represented by elements zgA of the genome algebra (where gG) and let s~A be the corresponding model element. If the model is reversible, then the following hold.

  • (i)

    The irreducible representation matrices of the model element s~ in A are diagonalisable.

  • (ii)

    The regular representation of s~ in A is symmetric.

Proof

(i) Suppose that the model is reversible and let 1iq. We have (c.f. (20))

ρiA(s~)=ρiA(s~z):=Q¯Tρi(s~z)Q,

where Q is the Di×ki matrix of orthonormal eigenvectors for ρi(z). Since G is a finite group, we may choose the irreducible representation ρi on G to be unitary. Then, writing ρi(s~z) as a sum of matrices of the form w(zaz)ρi(zaz)+ρi(za-1z) (omitting the second term if a=a-1), each of which is self adjoint, we see that ρi(s~z) is self-adjoint and thus so is ρiA(s~). For claim (ii), we proceed similarly, using the observation that ρregA(za)T=ρregA(za-1).

Theorem 4.9

Let G be a finite group with subgroup ZG. Let (M,w,dist) be a reversible biological model for evolution of genomes represented by basis elements of the genome algebra A of G with Z and let s~A be the corresponding model element. Let gG be an observed instance of a genome zgA. Then the following hold.

  • (i)
    For any kN0, the probability that the reference genome z is transformed into the genome zg via k rearrangements chosen from the model is
    αk(zg)=|Z||G|χregA(s~kzg-1)=|Z||G|i=1MDij=1Riλi,jktr(ρiA(zg-1)Ei,jA),
    where for each i, Ei,jA is the projection onto the eigenspace of the jth eigenvalue λi,j of ρiA(s~).
  • (ii)
    If the distribution of rearrangement events in time is dist=Poisson(1), then the probability that the reference genome z is transformed into the genome zg via the given model in time T is given by the likelihood function
    L(T|g)=e-T|Z||G|i=1MDij=1Ritr(ρpA(zg-1)Ei,jA)eλi,jT.
  • (iii)

    For any genome zhA with an instance h[g]D[g-1]D, the path probabilities and likelihood functions of zg and zh coincide.

Proof

The first expression for the path probability αk(zg) was gained above (32). To gain the second, we use the decomposition of the regular character from Theorem 4.5 (iv), and then for each i, use the cyclicity of the trace to write

χiA(s~kzg-1)=trρiA(zg-1)ρiA(s~)k.

Then, from Lemma 4.8, we may diagonalise ρiA(s~) to gain the second expression. Analogously to the definition in Sect. 2, but with genomes instead of elements of G, we define the likelihood function as

L(T|g):=P(zg|T)=k=0P(zzgviakevents)P(kevents in timeT)=k=0αk(zg)e-TTkk!.

Then substituting in the expression from (i) and simplifying the power series gives (ii).

(iii) Let hG such that zhz=zgz or zhz=zg-1z. We show that αk(zg)=αk(zh) for all kN0, which implies (iii). Let kN0. Since the trace is cyclic, we have

χregA(s~kzg-1)=trρregA(zg-1)ρregA(s~)k=trρregA(s~)kρregA(zg)=χregA(s~kzg),

where the second equality was obtained by taking the transpose of the argument and applying Lemma 4.8. Then from (i) it is clear that αk(zg)=αk(zg-1) and these coincide with αk(zh) by Corollary 4.6.

Since the regular representation of the model element is symmetric, by Lemma 4.8, and the equilibrium distribution is the uniform distribution on the set of genomes, reversibility of the model M is equivalent to time reversibility of the underlying Markov process. As in Sect. 3.1, the regular representation of the model element in A is the transition matrix for a Markov chain with states being genomes, with the probability that genome zgj transitions into genome zgi via k rearrangement steps from the model given by

ρregA(s~k)ij=αk(zgigj-1).

Reversibility then means that for any genomes zg,zhA, the probability of zg transforming into zh in k steps via the given model is the same as that of zh transforming into zg in k steps. In terms of path probabilities,

αk(zgh-1)=αk(zhg-1),

which is just a special case of Theorem 4.9 (iii). Model reversibility thus implies that the MLE distance is ‘directionless’, or symmetric, as is any evolutionary distance measure based on path probabilities calculated in this framework.

To conclude this section, we return briefly to the class algebra A0. By constructing simple examples, one can verify that, in general, the regular representation matrices of non-identity basis elements in A0 have non-zero entries on the diagonal, and thus see that we do not have an analogue of Theorem 4.5 (v) for A0. That is, the regular character of A0 is not counting occurrences of the identity in elements of this algebra, and thus cannot be used to calculate path probabilities as in Theorem 4.9.

Consider the underlying Markov model here, with transition matrix given by the regular representation of the model element analogue ρreg0(s~z). Now the states are the basis elements zgiz, each corresponding to an equivalence class [gi]D. Since each equivalence class [g]D is the disjoint union of |[g]D||Z| equivalence classes of the form [gz] for some zZ, each basis element zgz is the average of |[g]D||Z| distinct genomes of the form zgz for some zZ. Thus an arbitrary element of the matrix gives us the average probability of a genome from a certain class transitioning into a genome from another class (and a diagonal element gives the probability of a transition within a class). This is not refined enough for our purposes, since, given one genome zg and two more genomes zg and zg that are in the same class (g[g]D), the probability of transitioning between zg and zg need not be the same as the probability of transitioning between zg and zg.

The information is not entirely lost, however; one can use the first column of this Markov matrix to calculate path probabilities. Given an observed instance gG of a genome, we find the appropriate basis element zgz of A0 such that g[g]D, and then

αk(zg)=|Z||[g]D|ρreg0((s~z)k)1.

Of course, the fact that this path probability information exists in the regular representation does not mean that it is easy to obtain, in particular since the size of the regular representation in A0 is likely to be rapidly increasing with the number of genomic regions (for example for G=SN and Z=DN, dim(A0) is proportional to (N-2)!) and, being unable to retain the ‘first column’ information through diagonalisation (as one can for the trace), one would need to calculate the kth power of the matrix for each desired path probability. We further note that calculating the equivalence classes themselves, and checking for membership of an equivalence class, is a non-trivial exercise and simply not feasible for large numbers of regions.

In any case, the class algebra A0 is nicer in some ways than the genome algebra A, in particular in that it is decomposable (that is, isomorphic to a direct sum of its irreducible modules). This property is formally known as semisimplicity. Then, since the irreducible modules of A0 are identical to those of the algebra A and the irreducible representations of the two algebras not only have the same dimension but coincide on the objects of interest,

ρp0(zgz)=ρpA(zgz)=ρpA(zg),

one may in fact choose to implement the calculations of the irreducible representations in A or A0 and then, either way, combine the results together according to the decomposition given in Theorem 4.9.

Conclusion

We have presented a coherent algebraic framework for modelling some classes of genomes and rearrangements in an algebra that incorporates the inherent physical symmetries into each element. Algebraic frameworks for modelling genome rearrangement have been studied previously (Meidanis and Dias 2000; Moulton and Steel 2012; Francis 2014), and the importance of including genome symmetry in rearrangement distance calculations has been recognised (Egri-Nagy et al. 2014; Serdoz et al. 2017), however our unified approach, incorporating symmetry into the position paradigm framework (Bhatia et al. 2018), is new.

Beginning with the specific case of circular genomes modelled with unoriented regions and dihedral symmetry, we explicitly constructed the genome algebra from the symmetric group algebra, and showed that the MLE computations can be performed entirely within this algebra. By identifying genomes and rearrangements with single elements—permutation clouds—in the genome algebra, we have advanced previous work that identified genomes with cosets of permutations (where each element of a given coset represents an instance of a genome in a fixed physical orientation) but used the permutations as the basis elements for computation (Egri-Nagy et al. 2014; Serdoz et al. 2017; Sumner et al. 2017; Terauds and Sumner 2019). We have both explained and removed the redundancy that we identified (Terauds and Sumner 2019) in the implementation of the calculations in the symmetric group algebra.

In Terauds and Sumner (2019), we also signalled a desire to extend our technique for calculating the MLE to other settings, for example to include oriented regions or genomes with non-dihedral symmetry. We have not recorded the results of any explicit computations here, however we have algebraically verified that the technique can indeed be extended to a much more general case. For genomes where a single physical orientation can be represented by elements of a group G, and their physical symmetries by the subgroup ZG, we defined the genome algebra of G with Z; here, as in the special case described above, genomes and rearrangements correspond to basis elements (clouds). We showed that the path probabilities and thus the MLE can be formulated and the computations performed entirely in this genome algebra. An application of the framework to modelling signed circular genomes, using the hyperoctahedral group and two possible symmetry groups, is presented in Terauds et al. (2021), along with the results of some sample computations that illustrate how the framework may be applied to compare different models and distance measures.

Although the genome algebra has lower dimension than the group algebra (by a factor of 12N in the DN case and 1|Z| in the general case), this does not significantly reduce the computational complexity of calculating the MLE. We have performed distance calculations, in reasonable time, for genomes with up to twelve unoriented regions (unpublished) and up to six oriented regions (Terauds et al. 2021). Work to extend our initial experimental calculations to implementation of the framework for larger numbers of regions is ongoing. In particular, we are exploring the use of simulations and intend to apply numerical approximations to make distance calculations tractable for genomes with larger numbers of regions.

Whilst the framework does not specify a particular rearrangement model (and indeed, allows choice both in the type of rearrangements allowed and their relative probabilities of occurring), we cannot currently model insertions, deletions, or duplications, since the underlying group structure means we are restricted to rearrangements that do not alter the set of regions. This is a clear limitation of the current approach. To address this, we are currently working on extending the framework to a semigroup-based approach, with the aim of accommodating insertions and deletions. Furthermore, whilst one can apply different probabilities to different rearrangements (and, depending on the genome’s symmetry, rearrangements at different genomic positions), the current approach does not incorporate intergenic regions or explicitly consider breakpoints. Whether a group- or semigroup-based genome algebra approach can be devised that incorporates these biological realities, and others such as multiple chromosomes, is another question for future research.

Finally, we note that the applications of this algebraic framework are not limited to calculating MLEs. The likelihood function is built from path probabilities; since our fundamental results hold for these ‘building blocks’, other rearrangement distance measures that are based on path probabilities may be calculated via the genome algebra. We have shown that, via the regular representation of the genome algebra, the general genome rearrangement model can be viewed as a discrete (or, with the addition of the stochastic component, continuous time) Markov chain, and thus represented as a connected graph, generalising the Cayley graph approach (Moulton and Steel 2012; Clark et al. 2019). This facilitates the calculation of further distance measures, for example mean first passage time, as demonstrated in Terauds et al. (2021).

Funding

Open Access funding enabled and organized by CAUL and its Member Institutions.

Footnotes

1

These latter two may be considered simplifying assumptions, since they reduce the size of the state space. However, as will be shown later, the framework just as easily accommodates the alternate cases.

2

Here, a region is a contiguous section of the genome such as a sequence of genes (for an example of how such genomic simplification may be enacted in practice, see Belda et al. 2005).

3

Recall that the irreducible characters of SN are all real valued, however we use the general form of the result here, since we apply the same argument to an arbitrary group in Sect. 4.

4

In this non-oriented region case.

5

To model genomes as possessing no symmetries, one takes the symmetry group to be trivial; this case thus fits within the framework, but each genome simply corresponds to a single permutation.

6

We note that most of these mathematically possible rearrangements would not correspond to biologically plausible ones, so would not appear in rearrangement models in practice. For a deeper consideration of biologically plausible rearrangements from an algebraic perspective, see Stevenson et al. (2022).

7

One can verify via a simple counting argument that (ρreg0(zgz))ij=[gi]D[gj]D(ρreg0(zg-1z))ji.

This work was supported by Australian Research Council Discovery Grant DP180102215. The authors would like to thank Andrew Francis and Joshua Stevenson, for many interesting discussions relating to this work, and the anonymous reviewers, for their constructive comments.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Venta Terauds, Email: venta.terauds@utas.edu.au.

Jeremy Sumner, Email: jeremy.sumner@utas.edu.au.

References

  1. Bader David A, Moret Bernard ME, Yan Mi. A linear-time algorithm for computing inversion distance between signed permutations with an experimental study. J Comput Biol. 2001;8(5):483–491. doi: 10.1089/106652701753216503. [DOI] [PubMed] [Google Scholar]
  2. Bader M, Ohlebusch E (2006) Sorting by weighted reversals, transpositions, and inverted transpositions. In: Research in computational molecular biology, 10th Annual International Conference, RECOMB 2006, Venice, Italy, April 2–5, 2006, Proceedings, pp 563–577
  3. Belda E, Moya A, Silva Francisco J. Genome rearrangement distances and gene order phylogeny in y-proteobacteria. Mol Biol Evolut. 2005;22(6):1456–1467. doi: 10.1093/molbev/msi134. [DOI] [PubMed] [Google Scholar]
  4. Bhatia S, Feijão P, Francis AR. Position and content paradigms in genome rearrangements: the wild and crazy world of permutations in genomics. Bull Math Biol. 2018;80(12):3227–3246. doi: 10.1007/s11538-018-0514-3. [DOI] [PubMed] [Google Scholar]
  5. Chen L, Chen P-Y, Xue X-F, Hua H-Q, Li Y-X, Zhang F, Wei S-J (2018) Extensive gene rearrangements in the mitochondrial genomes of two egg parasitoids, Trichogramma japonicum and Trichogramma ostriniae (hymenoptera: Chalcidoidea: Trichogrammatidae). Sci Rep 8 [DOI] [PMC free article] [PubMed]
  6. Clark C, Egri-Nagy A, Francis A, Gebhardt V. Bacterial phylogeny in the Cayley graph. Discrete Math Algorithms Appl. 2019;11(05):1950059. doi: 10.1142/S1793830919500599. [DOI] [Google Scholar]
  7. Darmon E, Leach DRF. Bacterial genome instability. Microbiol Mol Biol Rev. 2014;78(1):1–39. doi: 10.1128/MMBR.00035-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Dobzhansky T, Sturtevant AH. Inversions in the chromosomes of Drosophila pseudoobscura. Genetics. 1938;23(1):28–64. doi: 10.1093/genetics/23.1.28. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Egri-Nagy A, Gebhardt V, Tanaka MM, Francis AR. Group-theoretic models of the inversion process in bacterial genomes. J Math Biol. 2014;69(1):243–265. doi: 10.1007/s00285-013-0702-6. [DOI] [PubMed] [Google Scholar]
  10. Etingof P, Golberg O, Hensel S, Liu T, Schwendner A, Vaintrob D, Yudovina E (2011) Introduction to representation theory, volume 59 of Student Mathematical Library. American Mathematical Society, Providence, RI. With historical interludes by Slava Gerovitch
  11. Felsenstein J. Inferring phylogenies. Sunderland: Sinauer Associates; 2004. [Google Scholar]
  12. Francis AR. An algebraic view of bacterial genome evolution. J Math Biol. 2014;69(6–7):1693–1718. doi: 10.1007/s00285-013-0747-6. [DOI] [PubMed] [Google Scholar]
  13. James G, Liebeck M. Representations and characters of groups. 2. New York: Cambridge University Press; 2001. [Google Scholar]
  14. Meidanis J, Dias Z (2000) An alternative algebraic formalism for genome rearrangements. In: Sankoff D, Nadeau JH (eds) Comparative genomics. Comput Biol 1: 213–223
  15. Moulton V, Steel M. The ‘butterfly effect’ in Cayley graphs with applications to genomics. J Math Biol. 2012;65(6–7):1267–1284. doi: 10.1007/s00285-011-0498-1. [DOI] [PubMed] [Google Scholar]
  16. Oesper L, Dantas S, Raphael BJ. Identifying simultaneous rearrangements in cancer genomes. Bioinformatics. 2018;34(2):346–352. doi: 10.1093/bioinformatics/btx745. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Oliveira AR, Jean G, Fertin G, Dias U, Dias Z. Super short operations on both gene order and intergenic sizes. Algorithms Mol Biol. 2019;14(1):1–17. doi: 10.1186/s13015-019-0156-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Sagan BE (2001) The symmetric group, volume 203 of Graduate texts in mathematics, 2nd edn. Springer, New York
  19. Serdoz S, Egri-Nagy A, Sumner J, Holland BR, Jarvis PD, Tanaka MM, Francis AR. Maximum likelihood estimates of pairwise rearrangement distances. J Theor Biol. 2017;423:31–40. doi: 10.1016/j.jtbi.2017.04.015. [DOI] [PubMed] [Google Scholar]
  20. Steinberg B. Representation theory of finite monoids. Cham: Springer; 2016. [Google Scholar]
  21. Stevenson J, Terauds V, Sumner J (2022) Rearrangement events on circular genomes. arXiv preprint arXiv:2202.01968 [DOI] [PMC free article] [PubMed]
  22. Sumner JG, Jarvis PD, Francis AR. A representation-theoretic approach to the calculation of evolutionary distance in bacteria. J Phys A Math Theor. 2017;50(33):335601. doi: 10.1088/1751-8121/aa7d60. [DOI] [Google Scholar]
  23. Terauds V, Sumner J. Maximum likelihood estimates of rearrangement distance: implementing a representation-theoretic approach. Bull Math Biol. 2019;81(2):535–567. doi: 10.1007/s11538-018-0511-6. [DOI] [PubMed] [Google Scholar]
  24. Terauds V, Stevenson J, Sumner J. A symmetry-inclusive algebraic approach to genome rearrangement. J Bioinform Comput Biol. 2021;19(06):2140015. doi: 10.1142/S0219720021400151. [DOI] [PubMed] [Google Scholar]
  25. Wang L-S, Warnow T, Moret BME, Jansen RK, Raubeson LA. Distance-based genome rearrangement phylogeny. J Mol Evol. 2006;63(4):473–483. doi: 10.1007/s00239-005-0216-y. [DOI] [PubMed] [Google Scholar]

Articles from Journal of Mathematical Biology are provided here courtesy of Springer

RESOURCES