Skip to main content
Springer logoLink to Springer
. 2020 Sep 17;81(4):1029–1057. doi: 10.1007/s00285-020-01535-5

Structure of the space of taboo-free sequences

Cassius Manuel 1,, Arndt von Haeseler 1,2
PMCID: PMC7560954  PMID: 32940748

Abstract

Models of sequence evolution typically assume that all sequences are possible. However, restriction enzymes that cut DNA at specific recognition sites provide an example where carrying a recognition site can be lethal. Motivated by this observation, we studied the set of strings over a finite alphabet with taboos, that is, with prohibited substrings. The taboo-set is referred to as T and any allowed string as a taboo-free string. We consider the so-called Hamming graph Γn(T), whose vertices are taboo-free strings of length n and whose edges connect two taboo-free strings if their Hamming distance equals one. Any (random) walk on this graph describes the evolution of a DNA sequence that avoids taboos. We describe the construction of the vertex set of Γn(T). Then we state conditions under which Γn(T) and its suffix subgraphs are connected. Moreover, we provide an algorithm that determines if all these graphs are connected for an arbitrary T. As an application of the algorithm, we show that about 87% of bacteria listed in REBASE have a taboo-set that induces connected taboo-free Hamming graphs, because they have less than four type II restriction enzymes. On the other hand, four properly chosen taboos are enough to disconnect one suffix subgraph, and consequently connectivity of taboo-free Hamming graphs could change depending on the composition of restriction sites.

Keywords: Bacteriophage DNA evolution, Endonuclease-dependent evolution, Restriction-enzyme dependent evolution, Restriction–modification system, Hamming graph with taboos, Connectivity of Hamming graphs

Introduction

In bacteria, restriction enzymes cleave foreign DNA to stop its propagation. To do so, a double-stranded cut is induced by a so-called recognition site, a DNA sequence of length 4–8 base pairs (Alberts et al. 2004). As part of their restriction–modification (R–M) system, bacteria can escape the lethal effect of their own restriction enzymes by modifying recognition sites in their own DNA (Kommireddy and Nagaraja 2013). Nevertheless, Gelfand and Koonin (1997) and Rocha et al. (2001) found a significant avoidance of recognition sites in bacterial DNA, and Rusinov et al. (2015) showed that this avoidance was characteristic of type II R–M systems. Also in bacteriophages, the avoidance of the recognition sites is evolutionary advantageous (Rocha et al. 2001), mainly for non-temperate bacteriophages affected by orthodox type II R–M systems (Rusinov et al. 2018a). Therefore in those instances the recognition site is, as we call it, a taboo for host and foreign DNA.

Although avoidance of recognition sites is well studied, e.g. by Rusinov et al. (2018b), taboo free DNA evolution has not yet been modelled. To initiate models of sequence evolution with taboos, we studied the Hamming graph Γn(T), whose vertices are strings of length n over a finite alphabet Σ not containing any taboos of the set T as subsequence. Two vertices of the Hamming graph are adjacent if the corresponding taboo-free strings have Hamming distance equal to one. In biological terms, the sequences differ by a single substitution.

We note that, for a binary alphabet Σ={0,1} and taboo-set T={11}, the corresponding Hamming graphs Γn(T) are known as Fibonacci cubes. Some properties of the Fibonacci cubes like the Wiener Index or the degree distribution were surveyed by Klavžar (2013). Further results have been obtained for taboo-sets forbidding arbitrary numbers of consecutive “1”s, T={11}, by Hsu and Chung (1993), or when T={s} for an arbitrary binary string s by Ilić et al. (2012). Recently, the equivalent problem of lattice paths that avoid some patterns has been described using automata and generating functions by Asinowski et al. (2018, (2020).

We are not so much interested in enumerative properties of Hamming graphs. We want to define conditions under which the Hamming graphs stay connected for arbitrary finite alphabets and arbitrary finite taboo-sets. From an evolutionary point of view, connectivity guarantees that any taboo-free sequence can be generated by point mutations from any initial taboo-free sequence without containing a taboo-string during evolution. To include further biological realism, we will also study the connectivity of subgraphs Γns(T) of the Hamming graph, where s is a taboo-free suffix. Suffix s can be viewed as a conserved DNA fragment, that is, a sequence that remained invariable during evolution (Shoemaker and Fitch 1989; Fitch and Margoliash 1967).

The inclusion of Hamming graphs with a constant suffix provides more general results, because Γne(T)=Γn(T), where e is the empty string. Given a taboo-set T, if for every taboo-free string s and integer n the Hamming graph Γns(T) is connected, then evolution can explore the space of taboo-free sequences by simple point mutation, no matter which DNA suffix fragments remain invariable, as long as the taboo-set T does not change in the course of evolution.

Motivating examples and non-technical presentation of key results

Here, we give a non-technical description of the essential results to determine connectivity. The subsequent sections provide a more technical and precise description of the central results.

Consider an alphabet Σ, for example Σ={0,1}. In a Hamming graph of length  n, all possible words of length n are vertices, and two of these vertices are joined by an edge if they differ in exactly one position. A taboo-set is a set of forbidden subwords, such as T={11,000}. Then, to construct a taboo-free Hamming graph Γn(T), we simply have to erase all words of the Hamming graph of length n containing those taboos. Figure 1 provides an example where Γn(T) is disconnected for n3.

Fig. 1.

Fig. 1

Graph Γn(T) for n[1,5] for binary alphabet Σ={0,1} and T={11,000}. Set Vn+1(T) is constructed by adding every allowed letter at the beginning of each string in Vn(T)

Given some alphabet and some taboo-set, deciding whether graph Γn(T) is connected is not a trivial task. To see this, consider the four-nucleotide alphabet Σ={A,C,G,T}, which is our main object of interest. Figure 2 shows the connected graph Γ3(T) for taboo-set T={AA,AC,AG,CA,CC,CG,GA,GC,GG}. The word TTT is a cut vertex, meaning that taboo-set T=T{TTT} yields the disconnected graph Γ3(T).

Fig. 2.

Fig. 2

Graph Γ3(T), where Σ={A,C,G,T} and T={AA,AC,AG,CA,CC,CG,GA,GC,GG}. Vertex TTT is a cut vertex, because if we remove TTT and its incident edges (dashed lines, coloured red), then the resulting graph is disconnected. Consequently, graph Γ3(T) induced by taboo-set T=T{TTT} is disconnected. Red, blue and yellow edges connect vertices with a different distribution of letter T (colour figure online)

Since the addition or deletion of one single taboo can have such an impact on connectivity, we need a tool to determine the structure of the taboo-free Hamming graphs. This tool is described in full generality at the end of Sect. 8. In the particular case when Σ={A,C,G,T}, our results can be simplified as follows.

  1. If the number of taboos is smaller than the size of the alphabet, that is if |T|<4, then all graphs Γns(T) are connected (Corollary 25.b). For example, given T={AATT,CCGG}, all taboo-free Hamming graphs are connected.

    Similarly, if the size of the set of all starting letters of taboos is smaller than the size of the alphabet, then all taboo-free Hamming graphs are connected (Corollary 25.a). This applies for taboo-set T={AA,AC,AG,CA,CC,CG,GA,GC,GG}, because the set of initial letters is {A,C,G} and |{A,C,G}|=3<4.

  2. Proposition 24 describes a slightly more complex sufficient condition to determine connectivity. Given T, delete the first letter of each taboo to construct the set Ψ(T). For example, if T={AAA,CCA,GGA,TTT}, then Ψ(T)={AA,CA,GA,TT}.

    In set Ψ(T), consider every pair of strings with Hamming distances 1 or 0. For example, the pair (AAAA) has distance 0; the pair (AACA) has distance 1; and the pair (AATT) has distance 2. If every pair with Hamming distance 1 or 0 can be taboo-free extended to the left by the same letter, then all graphs Γns(T) are connected.

    For example, the pair (AAAA) can be extended by C, because CAA is taboo-free, and the pair (AACA) can be extended by T, because TAA and TCA are taboo-free. After checking all possible pairs with Hamming distance 0 or 1, we see that all such pairs in Ψ(T) are extendable to the left, and thus taboo-set T generates connected taboo-free Hamming graphs.

  3. If Proposition 24 cannot be applied, then we apply the characterization of Theorem 22. Assume for example that T={AAA,CCA,TAA,GAA}. Since the pair {AA,CA}Ψ(T) with Hamming distance one is not taboo-free extendable to the left by any letter, we proceed as follows. First we construct suf(T), the set of all proper suffixes of T. In our example, suf(T)={AA,CA,A,e}, where e is the string with no letters. Now we consider, for every suffix rsuf(T) the graph Γ|r|+Mr(T), where |r| is the length of r and M is the length of the longest taboo(s) in T. If all graphs Γ|r|+Mr(T) are connected, then every graph Γns(T) is connected. In our example, graphs Γ5AA(T), Γ5CA(T), Γ4A(T) and Γ3(T) are connected, implying that all taboo-free Hamming graphs are connected.

    When graph Γ|r|+Mr(T) is disconnected for some rsuf(T), then suffix r induces disconnected taboo-free Hamming graphs of the form Γnr(T) for n|r|+M. Therefore evolution cannot explore the whole space of taboo-free sequences. This is the case for taboo-set T of Fig. 2, where r=e yields the disconnected graph Γ3(T).

Outline

We will characterize taboo-sets T such that every Hamming graph of the form Γns(T) is connected. To this end, we describe in Sect. 5 basic properties of taboo-sets. In Sect. 6, we introduce a very general type of taboo-sets, called left proper (Definition 4), which are our main object of study. In Proposition 11.b we show that, to construct graph Γns(T), we only need the longest prefix of s which is a suffix of a taboo, which we call s[1,ks]. In Sect. 7 we state the graph isomorphism Γns(T)Γns[1,ks](T) (Theorem 16). In Sect. 8 we explain how the edges of a quotient graph are related to the structure of graph Γnn(T) (Proposition 17).

Combining all these results, in Sect. 8 we characterize the connectivity of Hamming graphs Γns(T). We prove by induction that the connectivity of a small number of quotient graphs implies the connectivity of all Hamming graphs with long suffixes (Proposition 20). This result can be used to prove connectivity of Hamming graphs with short suffixes (Proposition 21). These two results yield the characterization of the connectivity of every suffix Hamming graph in Theorem 22. Section 9 provides examples of bacterial taboo-sets and their connectivity.

Basic notations

We will introduce some standard notations concerning strings as well as some relevant terms from graph theory.

Strings

We will use the term string to refer to a sequence of symbols over an arbitrary finite alphabet Σ={a1,,am}, where m2, while (DNA) sequence is reserved for biological contexts, where the alphabet consists of the four nucleotides Σ={A,C,G,T}.

We denote the set of strings of length n over the alphabet Σ by Σn. The length of a string s is denoted by |s|. The empty string will be denoted by e, and satisfies |e|=0 and {e}=Σ0.

Given a string s=b1bnΣn, the expression

s[i,j]:=bibjif1ijneotherwise

denotes the substring of s starting at the ith position and ending at the jth position, and e when this substring is not well-defined (for example if j=0). In particular s[1, j] is a prefix of s that ends at position j and s[in] is a suffix of s that starts at position i. A substring, prefix or suffix is called proper if it is not the entire string s. For a set of strings S, we define the substrings from the i th to the j th position of S as

S[i,j]:={s[i,j]|sS}.

We also need the set of proper suffixes of S, defined as

suf(S):=sSi[2,|s|]s[i,|s|]{e}.

where i[2,|s|] refers to all integers i within the interval [2, |s|]. It should not be confused with substring s[2, |s|] of s.

Example 1

If S={ACG,GGG,TTC,CC} then

suf(S)={CG,G,GG,TC,C,e}.

If string s1 is substring of string s2, we write s1s2, while s1s2 denotes that s1 is not a substring of s2. By convention, es for any string s. For strings s1 and s2, we define s1s2 as the concatenation of s1 and s2. Note that es=se=s for any s. For a string s and a set of strings S={s1,sk}, the concatenation of s with all elements in S is denoted by sS:={ss1,ssk}. If S1 and S2 are disjoint sets, then the disjoint union of S1 and S2 will be denoted by S1S2.

Finally, given two strings s1,s2 of equal length, d(s1,s2) denotes their Hamming distance, that is, the number of positions at which the corresponding symbols differ.

Graph theory

We will use common graph theory terminology following Wilson (1986). Let G=(V,E) denote a simple, undirected graph with vertex set V and edge set E. We say that graph G1=(V1,E1) is subgraph of G2=(V2,E2) if V1V2 and E1E2, and we denote this as G1G2.

Given a graph G=(V,E) and a subset V1V, then the subgraph induced by V1 in  G, G(V1)=(V1,EV1), has vertex set V1 and, for any u,vV1, {u,v}EV1 iff {u,v}E.

Two graphs G1=(V1,E1) and G2=(V2,E2) are isomorphic, denoted by G1G2, if there exists a bijection f:V1V2 such that, for every u,vV1, {u,v}E1 iff {f(u),f(v)}E2. That is, G1 and G2 are isomorphic if there exists an edge-preserving bijection between their vertex sets.

We will also need the quotient graph, as defined by Sanders and Schulz (2013), to study the connectivity of Hamming graphs. To define it, consider a graph G=(V,E) and a partition of its vertex set V, namely V=bJVb for some index set J. The quotient graph of G, denoted as Q[G]=(J,EJ), is the graph whose vertices are J and such that {b1,b2}EJ iff an edge connects a vertex in Vb1 with a vertex in Vb2. Figure 3 gives an example of a quotient graph.

Fig. 3.

Fig. 3

Example of a quotient graph. For G=(V,E) on the left hand side, with V={1,2,3,4,5,6,7,8} and partition V=VaVbVcVd, we obtain the quotient graph Q[G] on the right hand side

Our strategy to prove connectivity of taboo-free Hamming graphs will use the following propositions, whose proof is simple enough to be omitted.

Proposition 1

Consider graph G=(V,E) and partition V=bJVb.

If every induced subgraph G(Vb) for bJ is connected and the quotient graph Q[G] is connected, then G is connected.

Proposition 2

For graph G=(V,E), the following statements are equivalent:

  • G is connected.

  • For every partition of V, the quotient graph Q[G] is connected.

Properties of taboo-sets

We will repetadly use of the following terminology.

Definition 1

  • A finite set of strings T such that every tT satisfies |t|2 is called a taboo-set.

  • Strings in T are called taboos.

  • The length of the longest taboo(s) in T will be denoted by M:=max{|t|}tT.

  • A string is taboo-free if it does not contain any taboo of T as substring.

  • Vn(T) denotes the set of taboo-free strings of length n .

  • Vns(T) denotes the set of strings in  Vn(T) with suffix  s.

  • Similarly, sVn(T) denotes all strings in Vn(T) with prefix s.

With Definition 1 in mind, we can prove some simple properties of taboo-sets.

Proposition 3

Given taboo-sets T1 and T2, it holds that:

  1. Set T1T2 is a taboo-set

  2. For nN, Vn(T1)Vn(T2)=Vn(T1T2).

  3. If for every t1T1 there exists t2T2 such that t2t1, then for any nN, Vn(T2)Vn(T1).

Proof

  1. Every tT1T2 has length at least 2, and thus T1T2 is a taboo-set.

  2. All strings sVn(T1)Vn(T2) satisfy t1s for all t1T1 and t2s for all t2T2 this is equivalent to s satisfying ts for all tT1T2.

  3. Consider sVn(T2). Assume that sVn(T1); then there exists t1T1 such that t1s. But there also exists a t2T2 such that t2t1, and thus t2s, a contradiction. Hence sVn(T1).

For a given n and T, we can find a taboo-set TT such that Vn(T)=Vn(T). In this sense, taboo-sets are not unique, as we illustrate in the following proposition.

Proposition 4

For a string t and n|t|+1, it holds that

Vn({t})=Vn((tΣ)(Σt)).

Proof

  • : Any taboo in T1:=(tΣ)(Σt) has tT2:={t} as substring, and thus Proposition 3.c implies Vn({t})Vn((tΣ)(Σt)).

  • : Assume that there exists an sVn((tΣ)(Σt)) with ts. Since |s|=n and n|t|+1, the substring t is either preceded or followed by some symbol aΣ. This contradicts {at,ta}(tΣ)(Σt).

Proposition 4 implies that, for any T, we can construct many taboo-sets T such that Vn(T)=Vn(T) as long as nmax(M,M), where M and M denote the length of the longest taboo in T and T, respectively.

Example 2

If T=T1T2 with T2=(tΣ)(Σt), Proposition 3.a and 4 imply that T:=T1{t} satisfies Vn(T)=Vn(T) for any nM. Repeating this process, we can construct a taboo-set T such that (tΣ)(Σt)T for any string t and satisfying Vn(T)=Vn(T) for any nM.

Example 2 and Proposition 4 motivate the following definition.

Definition 2

A taboo-set T is minimal if the following conditions hold:

  1. For every different t1,t2T, it holds that t1t2.

  2. For every j[0,M-1] and sVj(T), set (sΣ)(Σs) is not a subset of T.

Condition (a) is easy to justify: If string AA is a taboo, it is redundant that AAA be a taboo. Condition (b) avoids unnecessarily complicated taboo-sets. For example, using the four-nucleotide alphabet, taboo-set T={AAA,AAC,AAG,AAT,CAA,GAA,TAA} can be minimized as T={AA}. In general, one can minimize a taboo-set according to Example 2.

Since we want to study taboo-free strings of arbitrary lengths, we need conditions to concatenate taboo-free strings such that the concatenated sequence is taboo-free. The following result gives such a condition.

Proposition 5

Given taboo-set T, consider three strings s1,s2,s3 such that s1s2 and s2s3 are taboo-free and |s2]M-1. Then s:=s1s2s3 is taboo-free.

Proof

If |s1|=0 and |s3|=0, then s=s2 is taboo-free, as desired. Now assume either |s1|>0 or |s3|>0, yielding n:=|s1|+|s2|+|s3|M. For each i[1,n-(M+1)], the fact that |s2|M-1 implies that either s[i,i+M-1]s1s2 or s[i,i+M-1]s2s3, hence each s[i,i+M-1] is taboo-free and the result follows.

Prefixes and suffixes of a taboo-free string

Given a taboo-free string s, the construction of set Vns(T) for n>|s| depends on which string w can be concatenated to the left side of s, such that wsVn(T). This motivates the following definition.

Definition 3

Given a taboo-set T, consider a taboo-free string s and kN0. The k-prefixes of s are the elements of the set Lk(s), defined as

Lk(s):=wΣksuch thatwsis taboo-free=V|s|+ks(T)[1,k].

If Lk(s), then we will say that s is k-prefixable.

Similarly, the k-suffixes of s, denoted Rk(s), are the strings wΣk such that swV|s|+k(T), that is, Rk(s):=sV|s|+k(T)[|s|+1,|s|+k]. When Rk(s), we say that s is k-suffixable.

Example 3

If Σ={A,C,G,T} and T={CAA,GAA,TAA}, then L1(AA)={A} and L2(AA)={AA}. Hence string AA is 1-prefixable and 2-prefixable. Moreover, R1(AA)={A,C,G,T}, hence string AA is 1-suffixable.

By construction, given sV|s|(T), for any kN0 it holds that

Vk+|s|s(T)=Lk(s)s. 1

That is, Vk+|s|s(T) is Lk(s) with s concatenated. Moreover, the following proposition shows that the k-prefixes of a string s induce a disjoint partition of the set Vns(T).

Proposition 6

Given a taboo-set T and a taboo-free string s, consider integers kN0 and nk+|s|. It holds that

Vns(T)=wLk(s)Vnws(T).

That is, the set Vns(T) can be partitioned into the disjoint sets of taboo-free strings of length n with suffix ws, where wLk(s).

Proof

If s is not k-prefixable, then Lk(s)= and Vns(T)=, hence the equation holds. Otherwise, the inclusion is clear, while the follows from the fact that, for any string wΣk preceding the suffix s, this w must necessarily belong to Lk(s).

Clearly, if a taboo-free string s is k-prefixable, then it is also k-prefixable for any integer k<k, while nothing can be said a priori about the case k>k. Consequently, we need to find conditions under which one can concatenate at least one symbol to the left of a taboo-free string. We will first introduce such taboo-sets in Definition 4 and then characterize prefixability in Proposition 7.

Definition 4

A taboo-set T is called left proper if every sVM(T) is 1-prefixable. Analogously, T is right proper if every sVM(T) is 1-suffixable.

Example 4

If Σ={A,C,G,T} and T=ΣA, then ACV2(T) and AC is not 1-suffixable. Thus, T is not left proper.

Proposition 7

Consider a left proper taboo-set T and a taboo-free string s such that one of the following conditions holds:

  1. |s|M

  2. |s|M-1 and s is (M-|s|)-prefixable

Then s is k-prefixable for every kN.

Proof

If condition (a) applies, then the prefix s[1,M]VM(T) is 1-prefixable, because T is left proper. That is, there exists aΣ with asV1+|s|(T). Proceeding analogously with (as)[1, M] , we infer that s is 2-prefixable. Continuing with this process, we deduce that s is k-prefixable for any kN.

If condition (b) holds, then we can take any string in VMs(T) and proceed as we did assuming (a).

We mainly study left proper taboo-sets due to Proposition 7, because the existence of arbitrary k-prefixes is necessary in many of our proofs. Analogous results for right proper taboo-sets are obtained by reversing the order of the symbols composing the string.

According to Proposition 7, if the length of a taboo-free string is at least M, then the taboo-free string can prefixed for arbitrary lengths. Otherwise, one needs to check the (M-|s|)-prefixability of this string. To that end, the following result comes in handy.

Proposition 8

Consider a left proper taboo-set T and a taboo-free string s.

  1. If |s|M-1 and ssuf(VM(T)), then Vns(T)= for nM.

  2. If either |s|M or ssuf(VM(T)), then Vns(T) for nmax(|s|,M).

Proof

  1. If 0|s|M-1 and ssuf(VM(T)), since suf(VMs(T))suf(VM(T)), it holds that VMs(T)=. This implies that Vns(T)= for every nM, because otherwise
    Vns(T)[n-M+1,n]VMs(T),
    which contradicts VMs(T)=.
  2. If |s|M, since T is left proper, Proposition 7.a implies that s is k-prefixable for every kN. Thus, Vns(T). Similarly, if ssuf(VM(T)), then s is (M-|s|)-prefixable, and thus Proposition 7.b implies that s is k-prefixable for every kN.

Note that, since the assumptions of Proposition 8.a are the negation of the assumptions of Proposition 8.b, in Proposition 8 we have proved that Vns(T)= for nM iff string s satisfies |s|M-1 and ssuf(VM(T)).

To study the connectivity of Hamming graphs Γns(T), we need to know whether two different strings have a k-prefix in common. Thus, we introduce the following.

Definition 5

Given a taboo-set T, we say that two taboo-free strings s1 and s2 (maybe of different length) are left  k-synchronized if Lk(s1)Lk(s2). If Rk(s)Rk(r), then we say that s1 and s2 are right  k-synchronized.

In words, two taboo-free strings are left k-synchronized if they are k-prefixable by at least one string w. Clearly, two taboo-free strings s1,s2 that are left k-synchronized are also left k-synchronized for any kk (one simply has to “cut” the k symbols on the left of Lk(s1)Lk(s2)). The following proposition states when we can also guarantee k-synchronization for k>k:

Proposition 9

Consider a left proper taboo-set T and two taboo-free strings s1,s2, with length greater than zero, such that s1 and s2 are left (M-1)-synchronized. Then s1 and s2 are left k-synchronized for any kN.

Proof

If kM-1, then the assertion is true since s1 and s2 are (M-1)-synchronized.

For k>M-1, consider a string wLM-1(s1)LM-1(s2). We know that ws1 and ws2 are taboo-free strings with length at least M. Since T is left proper, Proposition 7.a applied to ws1 and ws2 implies that ws1 and ws2 are k-prefixable for any kN. Therefore w is k-prefixable for any kN. For any k, take xLk(w) and consider strings xws and xwr. The fact that |w|=M-1, together with the fact that xw and the pair ws1,ws2 are taboo-free, allows applying Proposition 5, hence xws1 and xws2 are also taboo-free.

It follows that xwLM-1+k(s1)LM-1+k(s2). With k:=M-1+k, the result follows for any k>M-1.

The following proposition provides a Hamming-distance based criterion to quickly decide whether two taboo-free strings of length M are left k-synchronized.

Proposition 10

Consider a left proper taboo-set T. If all pairs s1,s2VM(T) with d(s1,s2)=1 are left 1-synchronized, then all pairs s1,s2VM(T) with d(s1,s2)=1 are left k-synchronized for all kN0.

Proof

Given any left 1-synchronized pair s1,s2 with d(s1,s2)=1, there exists an aΣ such that as1 and as2 are taboo-free. Since (asi)[1,M]VM(T) for i{1,2} and the Hamming distance between these two strings is at most 1, as1,as2 are 1-synchronized, hence there exists a symbol bΣ such that bas1 and bas2 are taboo-free, i.e. s1 and s2 are left 2-synchronized. Continuing with this process, it follows that s1 and s2 are k-synchronized.

We will now discuss conditions that allow increasing the string length of an entire set of taboo-free strings. To this end, consider two taboo-free strings s1,s and the set Vn+|s1|+|s|s1s(T). It is generally not true that Vn+|s1|+|s|s1s(T)=Vn+|s1|s1(T)s, because the concatenation of s to a taboo-free string from Vn+|s1|s1(T) can create a taboo string around the junction of both strings. For the remainder of this section we will discuss when the equality holds.

Definition 6

For a taboo-set T and a taboo-free string s, we define the length of the longest taboo suffix-prefix match as

ks:=max{i[0,|s|]|s[1,i]suf(T)},

i.e. ks denotes the length of the longest prefix of s being a proper suffix of a taboo.

Note that the length ks is well defined, because s[1,0]=esuf(T), hence ks[0,min(M-1,|s|)]. Using this length ks, in Proposition 11 we give conditions implying that equality Vn+|s1|+|s|s1s(T)=Vn+|s1|s1(T)s holds.

Proposition 11

For a taboo-set T and a taboo-free string s, the following holds:

  1. Take wΣM-1 such that wsVM-1+|s|(T). Then for any nM-1,
    Vn+|s|ws(T)=Vnw(T)s.
  2. For any nN0 it holds that
    Vn+|s|s(T)=Vn+kss[1,ks](T)s[ks+1,|s|].

Proof

  1. The inclusion is clear. The inclusion follows from the fact that, if we are given rwVnw(T) such that wsVM-1+j(T), since |w|=M-1, Proposition 5 yields that the concatenated string rws is taboo-free.

  2. The result is obvious if |s|=0 or n=0, hence assume |s|>0 and n>0.

    Clearly Vn+|s|s(T)Vn+kss[1,ks](T)s[ks+1,|s|]. For rVn(T), consider rs[1,ks]Vn+kss[1,ks](T). We need to prove that the string
    rs[1,ks]s[ks+1,|s|]=rs
    is taboo-free. But otherwise, since rs[1,ks] and s are taboo-free, there would exist integers cd such that 1c|r||r|+ks<d|r|+|s| and (rs)[c,d]T. Take k:=d-|r|>ks, which yields s[1,k]suf(T), contradicting the maximality of ks. Hence rs is taboo-free, as desired. Note that the same argument applies if ks=0.

From Proposition 11.b we obtain two corollaries.

Corollary 12

Given a taboo-set T and a taboo-free string s, for any kN0 it holds that

Lk(s)=Lk(s[1,ks]).

Proof

By construction, Lk(s)=V|s|+ks(T)[1,k]. Proposition 11.b yields

Vk+|s|s(T)[1,k]=(Vk+kss[1,ks](T)s[ks+1,|s|])[1,k]=Vk+kss[1,ks](T)[1,k]=Lk(s[1,ks]).

Corollary 13

For a taboo-set T and for any pair of taboo-free strings s1 and s2, the following statements are equivalent for all kN0:

  • s1 and s2 are left k-synchronized

  • s1[1,ks1] and s2[1,ks2] are left k-synchronized.

Proof

Strings s1 and s2 are left k-synchronized iff Lk(s1)Lk(s2). We just have to apply Corollary 12.

Thus, the string s[1,ks], which is the longest prefix of s that matches a proper suffix of the taboos, provides all the information we need to construct Vns(T) or Lk(s).

Isomorphisms between taboo-free Hamming graphs

Here we will discuss isomorphism between Hamming graphs. Let us first introduce the formal definition of a taboo-free Hamming graph.

Definition 7

The taboo-free Hamming graph of length n, Γn(T):=(Vn(T),En(T)), is the graph with vertex set Vn(T) such that two vertices u,vVn(T) are adjacent if their Hamming distance equals 1, that is, e={u,v}En(T) iff d(u,v)=1. Analogously, Γns(T) is the Hamming graph with vertex set Vns(T).

Examples of disconnected Hamming graphs are given in Figs. 1 and 2. When dealing with taboo-free Hamming graphs, the following proposition is a simple way to establish graph isomorphisms.

Proposition 14

Consider a taboo-set T, a taboo-free string s and a taboo-free string w satisfying wsV|w|+|s|(T). If Vn+|s|ws(T)=Vnw(T)s for some n|w|, then Γn+|s|ws(T) and Γnw(T) are isomorphic.

Proof

By assumption, the vertex set of Γn+|s|ws(T) is Vn+|s|ws(T)=Vnw(T)s. Thus, the map

f:Vnw(T)sVnw(T)rsr

is well defined and bijective. Moreover, f is an edge-preserving bijection: Given any pair of strings r1,r2Σn and any string sΣ|s|, then d(r1,r2)=1 iff d(r1s,r2s)=1.

Propositions 14 and 11.a imply that, for a taboo-free string s with |s|M, the graphs Γn+|s|s(T) and Γn+M-1s[1,M-1](T) are isomorphic. Furthermore Proposition 11.b implies that Γn+|s|s(T)Γn+kss[1,ks](T), which can be stated as follows.

Proposition 15

Consider a taboo-set T and a taboo-free string s. There exists a unique wsuf(T) such that w=s[1,ks]. Moreover, for any n0,

Γn+|s|s(T)Γn+|w|w(T).

Proposition 15 does not describe in which cases Vn+|s|s(T)=. However, if T is left proper, Proposition 8 implies that this happens iff |s|M-1 and ssuf(VM(T)). This suggests that we can state a version of Proposition 15 for left proper T. But first, due to our interest in taboo-free strings of length M, we introduce the following.

Definition 8

Given a left proper taboo-set T, the long suffix classification lsc(T) is defined as

lsc(T):={wsuf(T)such thatsVM(T)satisfyings[1,ks]=w},

that is, lsc(T) is the set of all suffixes of taboos that are the longest prefix of at least one taboo-free string of length M.

Example 5

If Σ1={A,C,G,T} and T1={AA,CC,GG,TT}, then

lsc(T1)suf(T1)={A,C,G,T,e}=Σ1{e}.

For any sV2(T1), we see ks>0, hence elsc(T1). Moreover,

{AC,CG,GT,TA}V2(T1),

yielding lsc(T1)=Σ1. If we consider Σ2:={A,C,G,T,C}, where C could represent a 5-methylcytosine, and T2:=T1, then string s=CA satisfies ks=0, hence lsc(T2)=suf(T2).

The following theorem classifies graphs Γns(T) for left proper T.

Theorem 16

Consider a left proper taboo-set T and a taboo-free string s such that either |s|M or ssuf(VM(T)). Then a unique wsuf(VM(T))suf(T) exists such that w=s[1,ks], which satisfies Γn+|s|s(T)Γn+|w|w(T) for n0. Moreover, if |s|M, then wlsc(T).

Proof

Proposition 8.b yields Vn+|s|s(T) for n0, while Γn+|s|s(T)Γn+kss[1,ks](T) for n0 follows from Proposition 15. Hence we can set w:=s[1,ks], which by definition belongs to suf(T). Since by assumption either |s|M or ssuf(VM(T)), it follows from Proposition 7 that s is k-prefixable for any k, and thus also w:=s[1,ks] is k-prefixable. We consider xLM-ks(w), which satisfies xwVM(T). Therefore w=(xw)[M-ks+1,M]suf(VM(T)). All in all, wsuf(VM(T))suf(T). This w is trivially unique since ks is uniquely determined given s.

As for the case |s|M, the fact that s[1,M]VM(T) and the definition of lsc(T) implies that wlsc(T).

In formal terms, Theorem 16 states that the equivalence relation “being isomorphic” divides all graphs Γn+|s|s(T) into equivalence classes. The representative of each class is a graph Γn+|w|w(T), where wsuf(VM(T))suf(T). When |s|M, string w belongs to lsc(T). This is why lsc(T) is called the long suffix classification.

To efficiently compute lsc(T), we recommend that T be minimal. Theorem 16 implies that

lsc(T)suf(VM(T))suf(T), 2

and thus we define the short suffix classification as

ssc(T):=(suf(VM(T))suf(T))-lsc(T). 3

The set ssc(T) is called short suffix classification because only when |s|<M it can happen that a graph Γn+|s|s(T) is represented by a graph Γn+|w|w(T) with wssc(T). Note that, if a string w satisfies the condition |w|<M-1 and wRi(w)suf(T) for some i[1,M-1-|w|], then any swRi(w) satisfies s[1,ks+i]suf(T), hence wlsc(T). This property is used in the following example.

Example 6

If Σ1={A,C,G,T} and T1={AA,CC,GG,TT}, then it is clear that esuf(VM(T))suf(T), because the empty string e belongs to both sets. Moreover, elsc(T1) due to eΣsuf(T1). Therefore essc(T).

Connectivity of taboo-free Hamming graphs

We will make extensive use of the quotient graph to study the connectivity of taboo-free Hamming graphs. Before we start with the technicalities, we briefly describe our initial strategy.

For a Hamming graph Γn+j(T), let us consider two different subsets of its vertex set, namely Vn+jsb(T) and Vn+jsc(T), where sb,scVj(T). These two subsets are disjoint, so we can use the quotient graph Q[Γn+j(T)] to make each of them collapse in a single vertex, represented respectively by sb and sc. We will prove in Proposition 17 that sb and sc are adjacent in Q[Γn+j(T)] iff strings sb and sc have Hamming distance 1 and are left n-synchronized. This is specially interesting, because we know from Proposition 9 that two left (M-1)-synchronized strings are left n-synchronized for any nN. Thus, it is enough to know that sb,sc are adjacent in Q[Γ(M-1)+j(T)] to claim that sb,sc are adjacent in all partition graphs Q[Γn+j(T)] for nN (that is the essential content of Lemma 18). More formally, we have the following results.

Proposition 17

Given taboo-set T, jN0 and nN0, consider graph Γn+j(T) and a subset SVn+j(T) partitioned as S=bJVn+jsb(T), where sb are taboo-free strings of length j. Consider moreover the quotient graph Q[Γn+j(T)(S)]={J,EJ}, where Γn+j(T)(S) denotes the graph induced by S in Γn+j(T).

In these conditions, a pair of vertices b,cJ is connected by an edge {b,c}EJ iff the pair sb, sc is left n-synchronized and d(sb,sc)=1.

Proof

By definition, b and c are adjacent in Q[Γn+j(T)(S)] iff in graph Γn+j(T) an edge connects a vertex in Vn+jsb(T) with a vertex in Vn+jsc(T). Since d(sb,sc)1, this edge exists iff d(sb,sc)=1 and there exists sVn(T) such that ssb,sscVn+j(T). The last condition is the definition of sb and sc being left n-synchronized.

The combination of Propositions 17 and 9 gives the following lemma.

Lemma 18

Given a left proper taboo-set T, a taboo-free string s and kN, consider, for any n|s|+k, partition Vns(T)=wLk(s)Vnws(T) and quotient graph Q[Γns(T)]=(Lk(s),ELk(s)). Then it holds that

QΓ|s|+ks(T)QΓ|s|+k+1s(T)QΓ|s|+k+M-1s(T)=QΓ|s|+k+Ms(T)=QΓ|s|+k+M+1s(T)=.

If Q[Γ|s|+k+M-1s(T)] is connected, then Q[Γns(T)] is connected for n|s|+k.

Proof

For some n0|s|+k, consider an edge {wb,wc} of graph Q[Γn0s(T)], where wb,wcLk(s). We set sb:=wbs and sc:=wcs. Proposition 17 implies that wb and wc are adjacent in Q[Γn0s(T)] iff sb and sc are are left (n0-|s|-k)-synchronized and d(wb,wc)=1. Since sb and sc are left (n0-|s|-k)-synchronized, they are also left (n-|s|-k)-synchronized for any nn0, and thus wb and wc are adjacent in Q[Γns(T)] for |s|+knn0. Hence the decreasing chain of quotient graphs is proven.

Now we will prove that this chain stabilizes after n=|s|+k+M-1. If n0-|s|-k=M-1, then, according to Proposition 9, wb and wc are left k-synchronized for arbitrary k, and thus Proposition 17 implies that wb and wc are adjacent in Q[Γns(T)] for arbitrary n|s|+k. All in all, Q[Γn0s(T)] and Q[Γns(T)] have the same edges, as desired.

Regarding connectivity, given graphs G1 and G2 with the same vertex set V1=V2 such that G1G2, if subgraph G1 is connected, then G2 is connected.

Figure 4 visualizes Lemma 18 for alphabet Σ={a,b,c}, taboo-set T={ba,aa,ac,cc} (which is left proper), suffix s=b and k=1.

Fig. 4.

Fig. 4

Visualization of Lemma 18 for Σ={a,b,c}, T={ba,aa,ac,cc}, s=b and k=1. It holds that L1(b)={a,b,c}

We are finally ready to study the connectivity of graphs Γns(T) for |s|M. Let us begin with the following lemma.

Lemma 19

Given a left proper T, for any wVM(T) consider the set V2Mw(T) and partition

V2Mw(T)=aL1(w)V2Maw(T),

inducing the quotient graph Q[Γ2Mw(T)]=(L1(w),EL1(w)). Then the following statements are equivalent:

  1. For every wVM(T), Q[Γ2Mw(T)] is connected.

  2. For every wVM(T) and integer nM, Γnw(T) is connected.

Proof

Proposition 2 states that, in a connected graph, every quotient graph is connected, and thus (b) implies (a) by considering n=2M.

Now we prove by induction that (a) implies (b). For n=M and wVM(T), we have that

VMw(T)={w},

hence ΓMw(T) is connected. For the inductive step, assume that Γnw(T) is connected for every wVM(T) and up to an integer nM. We will prove that also every Γn+1w(T) is connected. Consider

Vn+1w(T)=aL1(w)Vn+1aw(T).

Let us write w separating the first M-1 symbols from the last one, that is w=rc for rΣM-1 and cΣ. Then for any aL1(w), Vn+1aw(T)=Vn+1arc(T). Since |r|=M-1, Proposition 11.a implies Vn+1arc(T)=Vnar(T)c, while the isomorphism established in Proposition 14 yields

Γn+1aw(T)=Γn+1arc(T)Γnar(T).

Thus, every Γn+1aw(T) is connected, because the induction hypothesis implies that Γnar(T) is connected since arVM(T). To prove that graph Γn+1w(T) is connected, it remains to apply Proposition 1, so we need to prove that the quotient graph induced by partition Vn+1w(T)=aL1(w)Vn+1aw(T), namely Q[Γn+1w(T)], is connected.

We know that, given partition V2Mw(T)=aL1(w)V2Maw(T), the quotient graph Q[Γ2Mw(T)] is connected. Applying Lemma 18 with s=w and k=1, we get the following chain of inclusions:

QΓM+1w(T)QΓM+2w(T)QΓ2Mw(T)=QΓ2M+1w(T)=QΓ2M+2w(T)=.

Since Q[Γ2Mw(T)] is connected, every quotient graph of the chain of inclusions is connected, as shown in Lemma 18. In particular, graph Q[Γn+1w(T)] is an element of the chain of inclusions because n+1M+1, so it is connected, as desired.

Lemma 19 is very interesting: We wanted to characterize the connectivity of graphs Γns(T) for sVM(T) and nM. We have proved that it is enough to study a finite number of graphs, namely Q[Γ2Mw(T)] for sVM(T), that is, |VM(T)| graphs. Let us summarize the connectivity results that follow from Lemma 19 and Theorem 16.

Proposition 20

Given a left proper T, the following statements are equivalent:

  1. For any taboo-free string s with |s|M and any integer n|s|, Γns(T) is connected.

  2. For any wVM(T) and any integer nM, Γnw(T) is connected.

  3. For any rlsc(T), ΓM+|r|r(T) is connected.

  4. For any rlsc(T), the partition VM+|r|r(T)=aL1(r)VM+|r|ar(T) induces a connected partition graph Q[ΓM+|r|r(T)].

Proof

Implication (a) (b) is obvious, while (b) (a) is proven as follows: Given Vns(T), where s is a taboo-free string with |s|M, Proposition 11.a implies that Vns(T)=Vns[1,M-1](T)s[M,j]. Since s[M,j]=s[M,M]s[M+1,j], applying Proposition 11.a again we have Vns(T)=Vns[1,M](T)s[M+1,j]. Proposition 14 yields the isomorphism Γn+js(T)Γn+Ms[1,M](T), and Γn+Ms[1,M](T) is connected due to s[1,M]VM(T) and the assumption of (b). Thus, statements (a) and (b) are equivalent.

Implication (b) (c) is consequence of Theorem 16. Moreover, (c) (d) follows from Proposition 2. It remains to prove (d) (b), which we do as follows. Corollary 12 implies L1(w)=L1(w[1,kw]). Moreover, for any wVM(T) and a,bL1(w), we claim that the following statements are equivalent:

  • (i)

    Strings aw and bw are left k-synchronized.

  • (ii)

    Strings aw[1,kw] and bw[1,kw] are left k-synchronized.

Indeed, the implication (i) (ii) is obvious, so let us prove (ii) (i). Given a taboo-free string sVj(T) such that saw[1,kw] and sbw[1,kw] are taboo-free, we want to prove that also saw and sbw are taboo-free. But if that were not the case, it would be the consequence of either (saw)[c,d]T or (sbw)[c,d]T for some integers 1cj<j+1+kwdj+1+M. However, that contradicts the maximality of kw, yielding ii)i.

Our previous claim and Proposition 17 imply that, if r=w[1,kw] for some wVM(T), given partition VM+|r|r(T)=aL1(r)VM+|r|ar(T), it holds that

QΓnr(T)QΓnw(T).

Theorem 16 implies that, for every wVM(T), there exists r=w[1,kw]lsc(T). Applying Lemma 19, finally (d) (b) follows.

It is worth noticing how simpler the connectivity problem has become. Initially, we were studying whether every Γns(T) with |s|M is connected, obtaining in Lemma 19 that this is equivalent to the connectivity of graphs Γ2Mw(T) for wVM(T), which are |VM(T)| graphs. Now we see, using Proposition 20 and the fact that lsc(T)suf(T), that we only need to prove the connectivity of |lsc(T)||suf(T)|(M-1)|T|+1 graphs, namely either Q[ΓM+|r|r(T)] or ΓM+|r|r(T) for rlsc(T). We give an example.

Example 7

Take Σ={A,C,G,T} and T={AA,CCC}, which is left proper. Using Proposition 20, since M=3 and lsc(T)=suf(T)={e,A,C,CC}, the connectivity of graphs

Γ3e(T),Γ4A(T),Γ4C(T),Γ5CC(T)

implies that any Γnw(T) with wsuf(T) is connected. Proposition 15 implies that, for any taboo-free string s and n|s|, Γns(T) is connected.

Proposition 20 characterizes the connectivity of every Γn+|s|s(T) for |s|M. We know from Theorem 16 that there exists rlsc(T)suf(VM(T))suf(T) such that Γn+|s|s(T)Γn+|r|r(T). Since ssc(T):=suf(VM(T))suf(T)-lsc(T), to complete our characterization of the connectivity of every taboo-free Hamming graph, some cases (such as Example 6) require considering the connectivity of graphs Γnp(T) for pssc(T). We have the following.

Proposition 21

Given a left proper T and pssc(T), assume that, for every rlsc(T), graph ΓM+|r|r(T) is connected. Given kN, if partition

V|p|+k+M-1p(T)=wLk(p)V|p|+k+M-1wp(T)

satisfies that (wp)[1,kwp]lsc(T) for each wLk(p), and moreover Q[Γ|p|+k+M-1p(T)] is connected, then Γnp(T) is connected for n|p|+k.

Proof

For n|p|+k, given partition

Vnp(T)=wLk(p)Vnwp(T),

subgraphs Γnwp(T) are connected due to (wp)[1,kwp]lsc(T). Moreover, since Q[Γ2M-1p(T)] is connected, Lemma 18 with s=p implies that Q[Γnp(T)] is connected for n|p|+k. Thus, the quotient graph Q[Γnp(T)] and all induced subgraphs Γnwp(T) are connected. The connectivity of Γnp(T) follows applying Proposition 1.

In Proposition 21, one can always take k=M-|p| and just check if Q[Γ2M-1p(T)] or Γ2M-1p(T) is connected for pssc(T). Otherwise one can try k=1 and increase it progressively.

Example 8

If Σ={A,C,G,T} and T={AA,CC,GG,TT}, then it holds that lsc(T)={A,C,G,T} and ssc(T)={e}. For rlsc(T), it can be proven that Γ3r(T) is connected. Thus, Proposition 20 implies that every Γnr(T) is connected for rlsc(T) and n1.

We can combine Propositions 20 and 21 to obtain our aimed characterization of the connectivity of every suffix Hamming graph. We do so in the following theorem.

Theorem 22

Given a left proper taboo-set T, the following are equivalent.

  1. Consider, for every rlsc(T), partition VM+|r|r(T)=aL1(r)VM+|r|ar(T), and for every pssc(T), partition V2M-1p(T)=wLM-|p|(p)V2M-1wp(T).

    For rlsc(T), every partition graph Q[ΓM+|r|r(T)] is connected; for pssc(T), every partition graph Q[Γ2M-1p(T)] is connected; for pssc(T), every graph Γnp(T) with |p|+2nM-1 is connected.

  2. For rlsc(T), graph ΓM+|r|r(T) is connected; for pssc(T), graph Γ2M-1p(T) is connected; for pssc(T) and |p|+2nM-1, every graph Γnp(T) is connected.

  3. For every taboo-free string s and n0, graph Γ|s|+ns(T) is connected.

Proof

Proposition 2 states that the connectivity of a graph is equivalent to the connectivity of each of its quotient graphs. Hence (b) (a) follows, because if graphs ΓM+|r|r(T) and Γ2M-1p(T) are connected, then also partition graphs Q[ΓM+|r|r(T)] and Q[Γ2M-1p(T)] are connected. Since the implication (c) (b) is obvious, it only remains to prove (a) (c).

Theorem 16 states that, when T is left proper, every nonempty graph of the form Γn+|s|s(T) is isomorphic to graph Γn+|w|w(T), where w=s[1,ks]suf(T)suf(VM(T)). By construction, strings in suf(T)suf(VM(T)) either belong to lsc(T) or ssc(T). Therefore, statement (c) is equivalent to the connectivity, for every n0, of every Γn+|r|r(T), where rlsc(T), and of every Γn+|p|p(T), where pssc(T).

Assuming statement (a), since every partition graph Q[ΓM+|r|r(T)] is connected for rlsc(T), Proposition 20 implies that every graph ΓM+nw(T) is connected, where wVM(T) and n0. For any rlsc(T), there exists by construction a wVM(T) such that r=w[1,kw]. Since ΓM+nw(T)Γ|r|+nr(T) due to Proposition 15, it follows that (a) implies that every Γ|r|+nr(T) is connected, where rlsc(T) and n0.

It remains to prove that (a) implies that every Γ|p|+np(T) is connected, where pssc(T) and n0. Since every partition graph Q[Γ2M-1p(T)] is connected, Proposition 21 with k=M-|p| implies that ΓM+np(T) is connected for n0. The connectivity of graphs Γ|p|+2p(T),,ΓM-1p(T) is part of the assumptions of (a), and graphs Γ|p|+1p(T) and Γ|p|p(T) are trivially connected, finishing the proof.

In general, if T has just a few taboos, proving connectivity becomes easier since most of strings are left k-synchronized. In Proposition 23 only previous results are used, while in Proposition 24 we study this case more exhaustively in a self-contained manner. Note that, when taboo-set T is minimal, the assumptions of Proposition 24 are much easier to check.

Proposition 23

Given a left proper T such that every pair of strings w1,w2VM(T) with d(w1,w2)=1 is left 1-synchronized, it holds that:

  1. For any rlsc(T) and nN0, Γn+|r|r(T) is connected.

  2. For any pssc(T) with connected ΓMp(T), Γnp(T) is connected for nM.

Proof

Proposition 10 implies that every pair w1,w2VM(T) is left k-synchronized for any kN. We know from Lemma 17 that left k-synchronization of two strings with Hamming distance 1 indexing a partition as suffixes is equivalent to those two strings being adjacent in the partition graph. Therefore any quotient graph Q[Γnw(T)]=(L1(w),EL1(w)) induced by partition Vnw(T)=aL1(w)Vnaw(T) is fully connected (that is, every two vertices are adjacent). In particular, every Q[Γnw(T)] is connected, and thus Proposition 20 implies (a). Similarly with partition Vnp(T)=wLM-|p|(p)Vnwp(T), since Q[Γnp(T)]ΓMp(T) for nM, Proposition 21 implies (b).

Example 9

For Σ={A,C,G,T} and T={AA,CCC}, the strings Tw1 and Tw2 are taboo-free for w1,w2V3(T), hence they are left 1-synchronized. Since lsc(T)=suf(T), for any taboo-free string s and n|s|, Γns(T) is connected.

Proposition 24

Given taboo-set T and set Ψ(T):=tTt[2,|t|], if every pair of taboo-free strings w1,w2Ψ(T) with |w1||w2| and d(w1[1,|w2|],w2)1 is left 1-synchronized, then it holds that:

  1. Every taboo-free string is 1-prefixable. In particular, T is left proper.

  2. Every two taboo-free strings s1,s2 with d(s1,s2)=1 are left 1-synchronized.

  3. Graph Γns(T) is connected for every taboo-free string s and n|s|.

Proof

  1. Consider any taboo-free string s. Assume that, for each aΣ, as is not taboo-free, that is, that for some integer ca2, (as)[1,ca]T. WLOG assume ca1cam and consider s[1,cam-1], which satisfies s[1,cam-1]Ψ(T) since (ams)[1,cam]T. By construction, for any aΣ, string as[1,cam-1] is not taboo-free. On the other hand, the Hamming distance between s[1,cam-1]Ψ(T) and itself is 0, and thus the assumption of the statement implies that s[1,cam-1] is left 1-synchronized with s[1,cam-1]. In other words, a symbol aΣ exists such that as[1,cam-1] is taboo-free, which is a contradiction. All in all, s must be 1-prefixable. Taking sVM(T) we see that T is left proper.

  2. Given taboo-free strings s1,s2 such that d(s1,s2)=1, assume that they are not 1-synchronized. Then for every aΣ, either (as1)[1,ca]T or (as2)[1,ca]T for some ca2. Denote by C1aΣ{ca} those ca such that (as1)[1,ca]T, and analogously with C2. If C1 were empty, then s2 would not be 1-prefixable, contradicting (a). Thus, both C1 and C2 must be nonempty. Consider d1:=max{c:cC1} and d2:=max{c:cC2}. It holds that s1[2,d1]Ψ(T) and s2[2,d2]Ψ(T). Moreover, we have that the pair s1[2,d1], s2[2,d2] is not left 1-synchronized. Since d(s1,s2)=1, that contradicts the assumptions of the statement, hence s1 and s2 must be left 1-synchronized, as desired.

  3. Clearly Γ|s|s(T) is connected, so let us proceed by induction. Assume Γns(T) is connected for a fixed n|s| and consider Γn+1s(T). Since Vn+1s(T)ΣVns(T), if |Vns(T)|=1, then Γn+1s(T) is connected. Otherwise we take different s1,s2Vn+1s(T); we will prove that they are connected. We know that s1,s2ΣVns(T), hence let us write s1=c1w1 and s2=c2w2 for ciΣ and wiVns(T). If w1=w2, the result is obvious, so assume w1w2.

    By hypothesis, Γns(T) is connected, and thus there exists a path of vertices of Vns(T), namely y1,,yD, such that d(yi,yi+1)=1, y1=w1 and yD=w2. For every j[1,D-1], the pair yj, yj+1 is left 1-synchronized, and thus there exists bjΣ such that bjyj and bjyj+1 are taboo-free. Since d(bjyj,bjyj+1)=1, bjyj and bjyj+1 are adjacent in Γn+1s(T). Moreover every pair of taboo-free strings contained in Σyi is adjacent for i[1,D-1]. Since the relation “being connected” is transitive, vertices s1Σy1 and s2ΣyD are connected, as desired.

Example 10

If Σ={A,C,G,T} and T={AA,CC,GG,TT}, then Ψ(T)={A,C,G,T}. Every pair of strings in Ψ(T) is left 1-synchronized, hence for every taboo-free s and n|s|, Γns(T) is connected.

Now we aim to find an upper bound for the number of taboos needed to guarantee connectivity of the graphs Γns(T). The following Corollary of Proposition 24 holds.

Corollary 25

Consider an alphabet Σ and a taboo-set T. The following holds:

  1. If |T[1,1]|<|Σ|, then for any taboo-free string s and n|s|, Γns(T) is connected.

  2. If |T|<|Σ|, then for any taboo-free string s and n|s|, Γns(T) is connected.

Proof

  1. Assume that taboo-free strings s1, s2 satisfy L1(s1)L1(s2)=. That is, for each aΣ, either as1 or as2 has a taboo as prefix, contradicting |T[1,1]|<|Σ|. Therefore every two taboo-free strings are left 1-synchronized, so we can apply Proposition 24.c, implying (a).

  2. If |T|<|Σ|, then |T[1,1]|<|Σ|. Thus, statement (a) yields the result.

Corollary 25.b implies that, if |T|<|Σ|, then every Γns(T) is connected. In Examples 11 and 12, we give examples of taboo-sets over an alphabet with |Σ|=2 and |Σ|>2 symbols respectively, such that |T|=|Σ| and at least one suffix graph is disconnected. In this sense, the upper bound |T|<|Σ| that guarantees connectivity for every suffix graph cannot be improved.

Example 11

If Σ={0,1} and T={10,01}, then T is left proper and |T[1,1]|=|T[2,2]|=2=|Σ|. For n2, Vn(T)={00,11}, which makes Γn(T) disconnected. The trivial graphs Γn0(T) and Γn1(T) are both connected.

Example 12

For m3, Σ={a1,,am} and the left proper taboo-set

T={a3a1,a4a1,a5a1,,ama1}{a1a2,a2a2},

we claim that Γna1(T) is disconnected for n3. Indeed,

Vna1(T)=Vna1a1(T)Vna2a1(T)==(Vna2a1a1(T)Vna1a1a1(T))i[3,m]Vnaia2a1(T),

so take sVna2a1a1(T)Vna1a1a1(T) and ri[3,m]Vnaia2a1(T). It holds that d(s,r)2, hence we found two disconnected components in graph Γna1(T). This is coherent with |T[1,1]|=|Σ|=m.

To generalize this example, for iN0, denote by si:=a1i)a1 the concatenation of i a1’s. The taboo-set

Ti={a3si,a4si,,amsi}{a1a2si-1,a2a2si-1}

satisfies that graph Γnsi(Ti) is disconnected for ni+2.

In this section, we have stated various results regarding the connectivity of every suffix Hamming graph given a left proper taboo-set T. Up to Theorem 16, our aim was to characterize the connectivity of every suffix Hamming graph. Then we found sufficient conditions in Proposition 24 and Corollary 25 that are easier to apply. When studying this connectivity problem, the practitioner should firstly try to apply the results requiring easy-to-check assumptions, and increasingly use the more complicated ones. Given a taboo-set T, a possible workflow would be the following:

  1. We check if |T[1,1]|<|Σ|. If it holds, we can apply Corollary 25.a. Otherwise go to step 2)

  2. In order to apply Proposition 24, we check if every pair of taboo-free strings w1,w2Ψ(T) with |w1||w2| and d(w1[1,|w2|],w2)1 is left 1-synchronized. If it does not hold, go to step 3)

  3. We check whether T is left proper (this holds in all the biological examples that we considered so far). Otherwise redefine an equivalent left proper taboo-set and apply the characterization of Theorem 22. Two possibilities can arise: Either every suffix Hamming graph is connected, and thus evolution can explore all the space of taboo-free strings; or some taboo-free strings belonging to lsc(T) or ssc(T) induce disconnected suffix graphs Γn0s(T) for some n0|s|+M, implying that Γns(T) stays disconnected for nn0.

Examples of plausible bacterial taboo-sets

Taboo-sets as generated by the avoidance of restriction sites can assume various levels of complexities. In this section, we discuss some examples from REBASE (Roberts et al. 2014) using the theory developed in this work. Note that many restriction enzymes of REBASE database have an unknown recognition site, hence our taboo-sets may underestimate the actual amount of taboos. Before describing the examples, we will briefly review essential nomenclature for DNA sequences.

DNA is double-stranded, where A pairs with T and G pairs with C, hence it suffices to discuss only one of the strands. We adopt the convention that, given any of the strands, the DNA sequence is always represented from the 5’ end to the 3’ end (which is chemically determined). As a consequence, given a DNA sequence, its complementary DNA sequence, the one lying on the opposite strand, is obtained by inverting the order of the symbols and carrying through substitutions AT and CG. If a DNA sequence s is identical to its complementary DNA sequence, we say that s is an inverted repeat (Ussery et al. 2008). For example, sequence CCGG is an inverted repeat.

The fact that DNA is double-stranded implies that each recognition site induces taboos in pairs, namely itself and its complementary DNA sequence. For example, if AGGGC is a recognition site, then also the complementary strand GCCCT is a taboo. If, however, the recognition site is an inverted repeat such as TGCA, then this pair is actually one single recognition site. Recognition sites of type II R–M systems are nearly always an inverted repeat (Rusinov et al. 2015; Gelfand and Koonin 1997), and therefore one recognition site induces one single taboo. This is specially interesting because, according to Rusinov et al. (2015, (2018a), only type II R–M systems induce taboos.

A permutation of the symbols of alphabet Σ does not alter any of the results that we proved along this work. Moreover, by reversing the order of the symbols, any statement regarding e.g. left-properness and suffixes has an analogous one in which right-properness and suffixes are involved. On the other hand, taboo-sets induced by restriction enzymes remain invariant when we interchange every recognition site by its complementary sequence. Therefore, note that, for a bacterial taboo-set T, if we prove that every graph Γns(T) is connected, then also every graph sΓn(T) is connected.

A frequent case: Turneriella parva

The Turneriella parva (REBASE organism number 8970) strain produces a restriction enzyme with recognition site GATC, an inverted repeat. Similarly, another of its enzymes has recognition sites GGACC and GGTCC. Thus, these restriction enzymes generate the taboo-set

TT.pa={GATC}{GGACC,GGTCC}. 4

Since |TT.pa[1,1]|<4, Corollary 25.a implies that every graph Γns(TT.pa) is connected. Therefore the evolution of the DNA sequences can potentially reach any other taboo-free DNA sequence, no matter which suffix was conserved along this process.

Among the 3623 bacteria in REBASE (2020a), only 465 have more than three type II restriction enzymes. Assuming that only type II restriction enzymes induce taboos, as stated by Rusinov et al. (2015, (2018a), Corollary 25.b implies that at least 87%(3158/3623) of bacterial taboo-sets in REBASE (2020a) yield connected taboo-free Hamming graphs. Similarly, at least 90%(139/153) of archea in REBASE (2020b) induce connected taboo-free Hamming graphs, because they have less than four type II restriction enzymes. The following example describes a more complex collection of restriction enzymes.

Helicobacter pylori

In H. pylori 21-A-EK1, studied by Ailloud et al. (2019), many restriction enzymes have been identified. For the sake of clarity, let us write TH.py=ATGTCTTT, where aT denotes those taboos in TH.py whose first symbol is aΣ. Then we have

AT={ACΣGT},GT=(GTΣ2AC){GTCAC,GTGAC}{GTAC,GAGG}CT={CCGG,CCTC,CATG},TT={TGCA}, 5

where GTΣ2AC represents taboos of the type GTabAC with a,bΣ, and so on for analogous notations.

We want to apply Proposition 24. Take any r1,r2Ψ(TH.py) and assume that they are not left 1-synchronized. In particular WLOG we can assume that TL1(r1), implying r1=GCA. If CL1(r1), then r1{CGG,CTC,ATG}, which contradicts r1=GCA. Therefore it must be CL1(r2), yielding r2{CGG,CTC,ATG}. In any case, d(r1,r2)2. Thus, for any w1,w2Ψ(T) with d(w1[1,|w2|],w2)1, it holds that w1 and w2 are left 1-synchronized, so Proposition 24 can be applied: Every graph Γns(TH.py) is connected and, in particular, Γn(TH.py) is connected.

An imaginary bacterium

The taboo-set can significantly influence evolution in the cases where some Γns(T) is disconnected. To explain this, we will create a plausible, nonexistent example. Suppose that a strain of Bacterium imaginara has taboo-set

TB.im={ACCC,TCCC,CGCC,GGCC}{GGGT,GGGA,GGCG},

where the second set contains the complementary DNA sequences of the first set, except that of GGCC, which is an inverted repeat. Thus, taboo-set TB.im is induced by 4 restriction enzymes. At first glance, taboo-set TB.im seems less restrictive than TH.py, which has 6 taboos of length four and 22 taboos of length five or more.

Proposition 24 cannot be applied because CCC and GCC are not left 1-synchronized, and actually we can find a disconnected suffix graph. Let us take VnCCC(TB.im), which satisfies

VnCCC(TB.im)=VnGCCC(TB.im)VnCCCC(TB.im)=(VnAGCCC(TB.im)VnTGCCC(TB.im))(VnGCCCC(TB.im)VnCCCCC(TB.im)),

implying that, for any strings s1VnGCCC(TB.im) and s2VnCCCC(TB.im), it holds that d(s1,s2)2. Thus, we found two disconnected components in ΓnCCC(TB.im), namely ΓnGCCC(TB.im) and ΓnCCCC(TB.im). All in all, the graph ΓnCCC(TB.im) is disconnected for n5.

This produces the following evolutionary implications: Assume that we have two correctly aligned DNA fragments fα and fβ of the genome of Bacterium imaginara. Assume moreover that we can write fα=rαGCCC and fβ=rβCCCC for some strings rα and rβ, as also that the suffix CCC is invariable due to functional constrains. Then fα cannot have evolved from fβ by simple point mutations, because at some point in evolution a taboo string is produced that is lethal for the carrier. Thus, the standard models of sequence evolution (Strimmer and von Haeseler 2009) do not apply.

Concluding remarks

Using the results proven in this work, it is possible to decide whether every Hamming graph Γns(T) is connected. The connectivity of the taboo-free Hamming graphs induced by the restriction enzymes of the bacteria listed in REBASE could be quickly analysed with our tools. Unfortunately, for many organisms listed in REBASE, the recognition sites of restriction enzymes are not available.

Based on the current version of REBASE (2020a), we conclude using Corollary 25 that taboo-sets of at least 87%(3158/3623) of bacteria in REBASE induce connected taboo-free Hamming graphs, because they have less than four type II restriction enzymes. For larger taboo-sets, Proposition 24 can be used, as we did in Sect. 9.2, or one can directly use the characterization of Theorem 22. Thus, restriction enzymes in bacteria generally do not lead to any disconnected taboo-free Hamming graph, and our models of sequence evolution are by and large applicable. However, the influence of some missing sequences in the Hamming graph on the estimation of evolutionary parameters deserves further investigations. We also would like to emphasize that still many recognition sites have to be identified, and thus it may be well possible that we find disconnected taboo-free Hamming graphs in the next future.

We consider the formal framework developed in this paper as a first and necessary step to understand the effect of restriction enzymes (and possibly other taboo sequences) on the DNA composition of bacteria and viruses, or more generally on the sequence space modelled as a Hamming graph. Consider, for example, the phylogenetic studies by Ailloud et al. (2019), where the H. pylori taboo-set TH.py of Sect.  9.2 was taken from. The following natural questions arise: How are inferred evolutionary times between the two H. pylori populations affected by TH.py? Has their GC content varied due to the taboos of restriction enzymes?

To answer such questions, we need to develop models of sequence evolution that take taboos into account. Taboo avoidance induces complex dependencies along a DNA sequence, which can be measured using Markov Chain Monte Carlo (MCMC) simulations. If all taboo-free Hamming graphs Γns(T) are connected, then MCMC methods are easy to apply (Manuel et al. unpublished). A disconnected taboo-free Hamming graph, however, leads to a reducible Markov chain, which complicates simulation of taboo-free evolution.

Another application of our framework is the construction of combinations of restriction enzymes that lead to a disconnected Hamming graph, and thus limit evolutionary freedom. This may help to efficiently treat viral infections. Some progress has been made in the usage of restriction enzymes for the treatment of viral infections (Weber et al. 2014). Since one or just a few SNPs can significantly alter the symptoms or even the mortality associated to a pathogen (Collery et al. 2017; Yuan et al. 2017), our characterization of the connectivity of taboo-free Hamming graphs could help to delete SNPs from the viral genome that are detrimental to humans. Although the treatment of an infection using restriction enzymes is mostly unexplored, this work could be a first theoretical guide to a successful treatment.

Acknowledgements

We would like to thank Michael Charleston (University of Tasmania) for his constructive criticism. In fact the idea of studying taboo-free sequences originated from discussions with Mike while Arndt received a visiting fellowship at the University of Tasmania. This work was supported by the Austrian Science Fund (FWF, Grant Number I-1824-B22) to Arndt von Haeseler.

Author Contributions

C.M. wrote this manuscript with the guidance of A.vonH. All authors read and approved the manuscript.

Funding

Open access funding provided by Austrian Science Fund (FWF).

Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interests.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Cassius Manuel, Email: cassius.perez@univie.ac.at.

Arndt von Haeseler, Email: arndt.von.haeseler@univie.ac.at.

References

  1. Ailloud F, Didelot X, Woltemate S, Pfaffinger G, Overmann J, Bader RC, Schulz C, Malfertheiner P, Suerbaum S. Within-host evolution of Helicobacter pylori shaped by niche-specific adaptation, intragastric migrations and selective sweeps. Nat Commun. 2019;10(1):2273. doi: 10.1038/s41467-019-10050-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Alberts B, Bray D, Lewis J, Raff M, Roberts K, Watson J. Molecular biology of the cell (chapter 8) 5. London: Garland; 2004. pp. 532–534. [Google Scholar]
  3. Asinowski A, Bacher A, Banderier C, Gittenberger B (2018) Analytic combinatorics of lattice paths with forbidden patterns: enumerative aspects. In: Language and automata theory and applications. Springer, pp 195–206
  4. Asinowski A, Bacher A, Banderier C, Gittenberger B (2020) Analytic combinatorics of lattice paths with forbidden patterns, the vectorial kernel method, and generating functions for pushdown automata. Algorithmica 82:386–428. 10.1007/s00453-019-00623-3
  5. Collery MM, Kuehne SA, McBride SM, Kelly ML, Monot M, Cockayne A, Dupuy B, Minton NP. What’s a SNP between friends: the influence of single nucleotide polymorphisms on virulence and phenotypes of Clostridium difficile strain 630 and derivatives. Virulence. 2017;8(6):767–781. doi: 10.1080/21505594.2016.1237333. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Fitch WM, Margoliash E. A method for estimating the number of invariant amino acid coding positions in a gene using cytochrome c as a model case. Biochem Genet. 1967;1(1):65–71. doi: 10.1007/BF00487738. [DOI] [PubMed] [Google Scholar]
  7. Gelfand M, Koonin E. Avoidance of palindromic words in bacterial and archaeal genomes: a close connection with restriction enzymes. Nucleic Acids Res. 1997;25:2430–9. doi: 10.1093/nar/25.12.2430. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Hsu WJ, Chung MJ (1993) Generalized Fibonacci cubes. In: 1993 International conference on parallel processing—ICPP’93, vol 1, pp 299–302
  9. Ilić A, Klavžar S, Rho Y. Generalized Fibonacci cubes. Discrete Math. 2012;312:2–11. doi: 10.1016/j.disc.2011.02.015. [DOI] [Google Scholar]
  10. Klavžar S. Structure of Fibonacci cubes: a survey. J Comb Optim. 2013;25:505–522. doi: 10.1007/s10878-011-9433-z. [DOI] [Google Scholar]
  11. Kommireddy V, Nagaraja V. Diverse functions of restriction–modification systems in addition to cellular defense. Microbiol Mol Biol Rev MMBR. 2013;77:53–72. doi: 10.1128/MMBR.00044-12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Manuel C, Pfannerer S, von Haeseler A (unpublished) Etaboo: modelling and measuring taboo-free evolution. Unpublished
  13. REBASE (2020a) The restriction enzyme database. http://rebase.neb.com/rebase/arcbaclistB.html. Accessed 17 June 2020
  14. REBASE (2020b) The restriction enzyme database. http://rebase.neb.com/rebase/arcbaclistA.html. Accessed 17 June 2020
  15. Roberts RJ, Vincze T, Posfai J, Macelis D. REBASEa database for DNA restriction and modification: enzymes, genes and genomes. Nucleic Acids Res. 2014;43(D1):D298–D299. doi: 10.1093/nar/gku1046. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Rocha E, Danchin A, Viari A. Evolutionary role of restriction/modification systems as revealed by comparative genome analysis. Genome Res. 2001;11:946–958. doi: 10.1101/gr.GR-1531RR. [DOI] [PubMed] [Google Scholar]
  17. Rusinov I, Ershova A, Karyagina A, Spirin S, Alexeevski A. Lifespan of restriction–modification systems critically affects avoidance of their recognition sites in host genomes. BMC Genomics. 2015;16(1):1084. doi: 10.1186/s12864-015-2288-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Rusinov IS, Ershova AS, Karyagina AS, Spirin SA, Alexeevski AV. Avoidance of recognition sites of restriction–modification systems is a widespread but not universal anti-restriction strategy of prokaryotic viruses. BMC Genomics. 2018;19(1):885. doi: 10.1186/s12864-018-5324-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Rusinov IS, Ershova AS, Karyagina AS, Spirin SA, Alexeevski AV. Comparison of methods of detection of exceptional sequences in prokaryotic genomes. Biochemistry (Moscow) 2018;83(2):129–139. doi: 10.1134/S0006297918020050. [DOI] [PubMed] [Google Scholar]
  20. Sanders P, Schulz C (2013) High quality graph partitioning. In: Proceedings of the 10th DIMACS implementation challenge workshop
  21. Shoemaker JS, Fitch WM. Evidence from nuclear sequences that invariable sites should be considered when sequence divergence is calculated. Mol Biol Evol. 1989;6(3):270–289. doi: 10.1093/oxfordjournals.molbev.a040550. [DOI] [PubMed] [Google Scholar]
  22. Strimmer K, von Haeseler A. Genetic distances and nucleotide substitution models. In: Lemey P, Salemi M, Anne-Mieke V, editors. The phylogenetic handbook: a practical approach to phylogenetic analysis and hypothesis testing. 2. Cambridge: Cambridge University Press; 2009. pp. 111–141. [Google Scholar]
  23. Ussery DW, Wassenaar TM, Borini S. Computing for Comparative Microbial Genome: Bioinformatics for Microbiologists. 1. Berlin: Springer; 2008. [Google Scholar]
  24. Weber ND, Aubert M, Dang CH, Stone D, Jerome KR. DNA cleavage enzymes for treatment of persistent viral infections: recent advances and the pathway forward. Virology. 2014;454–455:353–361. doi: 10.1016/j.virol.2013.12.037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Wilson RJ. Introduction to graph theory. New York: Wiley; 1986. [Google Scholar]
  26. Yuan L, Huang X-Y, Liu Z-Y, Zhang F, Zhu XL, Yu J-Y, Ji X, Xu Y, Li G, Li C, Wang H-J, Deng Y-Q, Wu M, Cheng M-L, Ye Q, Xie D-Y, Li X-F, Wang X, Shi W, Qin C-F. A single mutation in the prM protein of Zika virus contributes to fetal microcephaly. Science. 2017;358:933–936. doi: 10.1126/science.aam7120. [DOI] [PubMed] [Google Scholar]

Articles from Journal of Mathematical Biology are provided here courtesy of Springer

RESOURCES