Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Jun 1.
Published in final edited form as: Inf Comput. 2015 Jun;242:340–353. doi: 10.1016/j.ic.2015.04.001

Existence of constants in regular splicing languages

Paola Bonizzoni a,, Nataša Jonoska b
PMCID: PMC4866503  NIHMSID: NIHMS691394  PMID: 27185985

Abstract

In spite of wide investigations of finite splicing systems in formal language theory, basic questions, such as their characterization, remain unsolved. It has been conjectured that a necessary condition for a regular language L to be a splicing language is that L must have a constant in the Schutzenberger sense. We prove this longstanding conjecture to be true. The result is based on properties of strongly connected components of the minimal deterministic finite state automaton for a regular splicing language. Using constants of the corresponding languages, we also provide properties of transitive automata and pathautomata.

1. Introduction

A splicing system, originally introduced in [12], is a formal model that uses contextual cross-over operation over words to generate languages called splicing languages. This cross-over splicing formalizes the behavior of basic biomolecular processes involving cut and paste of DNA performed by restriction enzymes and a ligase. Restriction enzymes act on double stranded DNA molecules by cleaving certain recognized segments leaving short single stranded overhangs. Molecules with same overhangs can join (in a cross-over fashion) in presence of a ligase enzyme. In the introductory paper, T. Head proved that if the splicing is performed by a finite set of certain simple rules, then splicing of finite set of words can generate the class of strictly locally testable languages [9]. The splicing notion was reformulated by G. Paun at a less restrictive level of generality, giving rise to the splicing operation that is commonly adopted and appears nowadays as a standard [17].

Theoretical results in splicing systems have contributed to new research in formal language theory focused on modeling of biochemical processes [18]. On the other side, the field suggested new ideas in the framework of biomolecular science, for example, the design of automated enzymatic processes.

In this paper, we focus on finite splicing systems, called here simply as splicing systems. A splicing system is meant to have a finite set of rules (modeling enzymes) applied on a finite set of initial strings (modeling DNA sequences). A splicing system (or H-system) is a triple H = (A, I, R), where A is a finite alphabet, IA* is the initial language and R is the set of rules, (see Section 4 for the definitions). The formal language generated by the splicing system is the smallest language containing I and closed under the splicing operation.

There have been successes in characterizing certain subclasses of splicing languages, for example those generated by reflexive rules and those generated by symmetric rules [2]. Reflexivity and symmetry are natural properties for splicing systems because they assure splicing of molecules cut with the same enzyme, as well as recombining molecules resulting of the same type of cut [12]. The formal language of a general splicing system may have a set of rules R that is not necessarily symmetric, nor reflexive. Under the formal model, a splicing system is a generative mechanism for a language which belongs to a class that is a proper subclass of the regular languages. This basic result has been firstly proved in [8], and later proved in several other papers by using different approaches (see for example [19,21]).

In spite of the vast literature on the topic, a structural characterization of the finite splicing systems is still an open problem, although decidability of regular splicing languages has been recently proved in [15].

On the other hand, progress has been made towards the characterization of certain sub-classes of splicing systems. Authors in [11] prove that it is decidable whether a regular language is a reflexive splicing language and provide an example of a regular splicing language that is neither reflexive nor symmetric, A quite different characterization of reflexive symmetric splicing languages is given in [3] and it has been extended to the general class of reflexive regular languages in [4,5]. This characterization has been given by using the concept of a constant of a language introduced by Schutzenberger [20].

In order to solve the open problem of characterizing he whole class of splicing languages, it seems necessary to understand the role of constants. Indeed, since the introduction of splicing languages it has been conjectured, and more formally in [10], and in [11], that existence of a constant is a necessary condition for a regular language to be splicing. In this paper we solve this longstanding open question by proving this conjecture true. This result is proved by investigating structural properties of connected components of the transition graph given by the minimal finite state automaton for a regular splicing language. More precisely, properties of the factor language of transitive components are related to the notion of synchronizing words [7]. Synchronizing words have been studied in automata theory for a long time and are of interest in both coding theory [1] and symbolic dynamics [16,14]. Our proof uses an old observation that a synchronizing word for an automaton is a constant for the language recognized by the automaton [20].

The paper is organized as follows.

In Section 2 we introduce preliminary concepts, including the notion of a synchronizing word and a constant. In Section 3 we introduce the notion of a transitive automaton and a path-automaton, as well as show several results connecting terminal components automata and synchronizing words. Moreover, we show a relationship between transitive languages, transitive automata, transitive components, and constants of the language. Then in Section 4 we recall the basic notion of a splicing system and revisit the notion of splicing rules of a splicing system by providing properties that are necessary in proving the main result of the paper. Finally in Section 5 we give examples of non reflexive splicing languages, show a relationship between transitive languages and splicing languages and we prove the main result of the paper. A preliminary extended abstract of this paper appeared in [6]

2. Preliminaries

We refer the reader to [13] for the background of automata theory, and assume some familiarity of the subject. Let A* be the free monoid over a finite alphabet A and let A+ = A* \ 1, where 1 is the empty word. A deterministic finite state automaton (DFA) is a 5-tuple Inline graphic = (Q, A, I, T, Inline graphic), where Q is a finite set of states, IQ is the set of initial states, TQ is the set of terminal (final) states and Inline graphicQ × A × Q, is the set of transitions such that for every qQ and every aA the set {q′ | (q, a, q′) ∈ Inline graphic, qQ, aA} consists of at most one element. Given a deterministic finite state automaton Inline graphic, the set of transitions defines a partial action of A* on Q. It is generated with a : QQ for aA defined with q(a) = q′ iff q′ ∈ Q is the unique state with (q, a, q′) ∈ Inline graphic. We use the standard notation qa to denote q′. If such q′ does not exist, we write qa = ∅. Inductively, we extend the notation on words with qwa = (qw)a. Similarly, we write Q w for the image of the set Q under the map w : QQ defined with w(q) = qw. If qa is defined for all qQ and aA we say that Inline graphic is complete. A deterministic finite state automaton is usually depicted as a directed graph with vertices Q and a set of directed edges Inline graphic. For an edge e = (q, a, q′) we say that q is its “start” state, q′ is its “end” state (also refer to as an end-point) and a is its label. A word w is accepted by an automaton Inline graphic if there is a path with label w that starts at an initial state and ends at a terminal state. We denote with L( Inline graphic) the language recognized by Inline graphic, that is, the set of all words accepted by Inline graphic [13]. Given a regular language LA* it is well-known that there is a unique minimal complete deterministic finite state automaton (mDFA) Inline graphic = (Q, A, {q0}, T, Inline graphic) that recognizes L such that all other complete DFA with one initial state that recognize L map homomorphically onto Inline graphic [13]. This automaton is unique up to possible renaming of the states, i.e., up to an isomorphism. We reserve the notation Inline graphic(L) to denote this automaton.

Given a language L, the language F(L) is the set of all factors of words in L, where x is factor of a word w if w = zxy for z, yA*. We say L is factor-closed if F(L) = L.

The right context of a word wA* with respect to a language L is defined with Inline graphic(w) = {xA* | wxL}. Symmetrically, the left context of w with respect of L is the set Inline graphic(w) = {xA* | xwL}.

The right context of a state in Inline graphic is Inline graphic(q) = {xA* | qxT}. An automaton Inline graphic is said to be reduced if there are no two states in Inline graphic with the same right context. Observe that the right context depends only on the terminal states in the automaton. In other words, if the initial state(s) are changed in Inline graphic but the transitions and the set of terminal states remain, the right contexts of the states don’t change. It is well-known (see for ex. [13]) that given a regular language L, there is a one-to-one correspondence between the right contexts of words with respect to L and the right contexts of the states in the minimal deterministic finite state automaton Inline graphic for L, i.e.,

q0w=qiffRL(w)=RA^(q).

In fact, in the mDFA Inline graphic, it also holds Inline graphic(w) = Inline graphic(q) iff Inline graphic(wa) = Inline graphic(qa) for all aA, and therefore Inline graphic(q) = Inline graphic(q′) implies q = q′.

When the language and the DFA are fixed, we drop the subscripts and write Inline graphic(w) and Inline graphic(q).

Note that every state in an mDFA is accessible, i.e., for each state qQ there is an xA* such that q0x = q. A state q is co-accessible, if Inline graphic(q) ≠ ∅. In an mDFA, there is at most one state that is not co-accessible, since for each qQ, there is uA* such that quT iff Inline graphic(q) ≠ ∅. If such a state in Inline graphic exists, we call it zero and denote it with z. A trimmed mDFA for language L is the DFA obtained from the mDFA for L by erasing the state z and all transitions that terminate in z. The trimmed mDFA is denoted trim Inline graphic.

More generally, a trimmed DFA Inline graphic is an automaton in which all states are both accessible and co-accessible.

Finally, for a finite set S, by #S, we denote the cardinality of the set S.

Definition 1

Given a DFA Inline graphic and a state q of the automaton, the set of follower words for q relative to Inline graphic is the set Inline graphic(q) = {x | qx ≠ ∅}.

For states q and q′ of Inline graphic, we say that they are follower-equivalent if Inline graphic(q) = Inline graphic(q′). For a state q the set of states in Inline graphic that are follower equivalent to q is denoted μq( Inline graphic).

For a state q of Inline graphic we say that it is minimal-follower with respect to Inline graphic if whenever Inline graphic(q′) ⊆ Inline graphic(q) for a state q′ of Inline graphic, it implies that q and q′ are follower-equivalent.

Recall the definition of a constant of a language L introduced by Schutzenberger in [20].

Definition 2

A word wA+ is a constant of a language L if w is a factor of some word in L and for all words u1, u2, v1, v2 in A* we have:

u1wu2Lv1wu2Lu1wu2Lv1wu2L

A characterization of constants, which is more or less folklore, is stated below.

Proposition 1

Let LA* be a regular language and let Inline graphic be the mDFA recognizing L. A word wA+ is a constant of L if and only if Q w \ {z} is a singleton, i.e., there is a unique non-zero state qw such that qwz implies qw = qw for all qQ.

Suppose w is a label of a path in a finite state automaton. If for a word w there is a state qw such that every path in the automaton with label w terminates in qw, we say that w is a synchronizing word and we say that qw is a synchronizing state, synchronized by w. By Proposition 1, in a trimmed mDFA, trim Inline graphic, of a regular language L, the set of synchronizing words for trim Inline graphic coincides with the set of constants of L. In general, if w is a synchronizing word for an automaton Inline graphic then it is a constant for the language recognized by Inline graphic.

The context of w with respect to L is the set CL(w) = {(u, v) | u, vA*, uwvL}. We define the left projection of the context of w (resp. right projection) as the set CL(w)={u(u,v)CL(w)} (respectively CLr(w)={v(u,v)CL(w)}). A constant w of L defines a constant language Const(w) with respect to the language L with the set Const(w)=CL(w)wCLr(w). Given two constants w1 and w2 of L, a split language for w1 and w2 with respect to L is a language Split(w1,w2)=CL(w1)w1w2CLr(w2) where w1 is a prefix (possibly empty) of w1 and w2 is a suffix (possibly empty) of w2.

3. Transitive components and synchronizing words

In this section we provide structural characterizations of transitive components in a minimal DFA using the notion of synchronizing words. We define the notions of a transitive automaton and of a path-automaton, and give properties that are used to prove the main result of the paper.

We first introduce definitions and properties that are used in the rest of the paper.

Recall the notion of a transitive component in a deterministic automaton. A strongly connected component of the directed graph for a deterministic automaton Inline graphic is called a transitive component for Inline graphic. If in a transitive component, every edge that starts at a state in this component also ends at the same component, then the transitive component is called terminal. For every state in the mDFA of a language L, there is a path that leads from that state to a terminal component. For a transitive component Inline graphic, we say that Inline graphic is induced by q if q is a state in Inline graphic. We write L( Inline graphic) for the set of labels of all paths in Inline graphic and say that Inline graphic recognizes L( Inline graphic). A transitive component Inline graphic is called trivial if L( Inline graphic) = {1}.

A language L is said to be transitive if for every pair of words u, vL there is a word wA* such that uwvL. Note that for a transitive component Inline graphic the language L = L( Inline graphic) is transitive.

Remark 1

Notice that if Inline graphic is a transitive component, then L( Inline graphic) is factor-closed, i.e., F(L( Inline graphic)) = L( Inline graphic).

Two transitive components Inline graphic and Inline graphic are called factor-equivalent if L( Inline graphic) = L( Inline graphic). In the following we often use the term component to denote a transitive component.

A component Inline graphic is said to be maximal for a collection of components C if for every transitive component Inline graphic in C, we have that L( Inline graphic) ⊆ L( Inline graphic) implies L( Inline graphic) = L( Inline graphic). Analogously, a transitive component Inline graphic is called minimal for a collection C if whenever L( Inline graphic) ⊆ L( Inline graphic) we have L( Inline graphic) = L( Inline graphic).

3.1. Transitive automata

In this section we relate the notion of a synchronizing word to properties of a transitive automaton. An automaton is called transitive if it consists of only one transitive component.

Remark 2

Note that if Inline graphic is transitive, then L( Inline graphic) is also transitive. Consider two words u, vL( Inline graphic). There are initial states q0 and q0 such that q0u, q0v. Since Inline graphic is transitive, there is a word w that is a label of a path from q0u to q0 in Inline graphic, so uwvL( Inline graphic).

Example 3.1

Consider the example shown in Fig. 1. This language is transitive, the automaton is reduced and deterministic, hence it is the mDFA for the language. However, there is no deterministic transitive automaton that recognizes this language. Notice that this language has no constants.

Fig. 1.

Fig. 1

A transitive language that doesn’t have a transitive deterministic automaton. The initial state is indicated with an arrow and the terminal states are shaded.

Remark 3

If L is transitive such that L = L( Inline graphic) for a transitive component Inline graphic, then for each state q in Inline graphic, Inline graphic(q) = Inline graphic(q) since all states in Inline graphic are terminal.

We consider several observations about transitive automata, transitive components and languages. The following observations are proved in [14] (see also [16]):

Lemma 2

For every regular factor-closed transitive language L there is a unique minimal deterministic transitive automaton Inline graphic recognizing L.

Lemma 3

For a regular factor-closed transitive language L and its unique minimal deterministic transitive automaton Inline graphic the following properties hold:

  1. Every state in Inline graphic is synchronizing.

  2. A word wL is a constant for L if and only if w is synchronizing for Inline graphic.

  3. Every two states q̂ and p̂ in Inline graphic(q̂p̂) are not follower-equivalent.

  4. For every transitive DFA Inline graphic with L( Inline graphic) = L there is an onto homomorphism ϕ : Inline graphicInline graphic such that for every state q̂ in Inline graphic, Inline graphic(q) = Inline graphic() for each qϕ−1().

Observe that if a state q of a transitive component Inline graphic is synchronizing, then all states in Inline graphic are synchronizing.

Consider the action of A* on the set of states Inline graphic of Inline graphic. In order to simplify the notation, the action of w on the set Inline graphic is denoted as Inline graphicw instead of Inline graphicw and moreover we say that q is a state of Inline graphic if qInline graphic.

Remark 4

If c is a constant of L( Inline graphic) for a transitive automaton Inline graphic, and Inline graphic is the minimal transitive deterministic automaton for L( Inline graphic) such that c synchronizes onto , then by Remark 3 and Lemma 3(ii–iv) every state q in Inline graphicc maps with ϕ onto , and has the same follower set as . We say that q is follower-equivalent to . In particular, if c is a constant such that qc = q in Inline graphic, then q̂c = in Inline graphic, and for every qInline graphicc, the state qc is in Inline graphicc and is follower-equivalent to , and to q.

Remark 5

If q is a state in a transitive automaton Inline graphic and Inline graphic is the minimal transitive deterministic automaton as in Lemma 3 with Inline graphic follower-equivalent to q, then for all q′ ∈ μq( Inline graphic) there are constants c, c′ ∈ L( Inline graphic) such that q̂c = q̂c′ = , qc = q and qc′ = q′. Take any constant c1L( Inline graphic) such that q̂c1 = . Then by transitivity there are x, y such that qc1x = q′ and qc1y = q. By Remark 4, c = c1x, and c′ = c1y are the constants sought.

If Inline graphic is a transitive deterministic automaton without a synchronizing word, then for every word w, # Inline graphicwk for some k ≥ 2. We call the minimal such k the degree of Inline graphic. A word w such that # Inline graphicw = k is called k-synchronizing. Therefore, the minimal transitive DFA Inline graphic for L( Inline graphic) has degree 1, and all constants coincide with (1-)synchronizing words. It follows from Lemmas 2 and 3 that if Inline graphic has degree k and w is a constant that is k-synchronizing, then all states in Inline graphicw are follower-equivalent. Moreover, in that case for all xA* with wx ≠ ∅, wx is also k-synchronizing.

The following lemma relates the right contexts of states in a transitive automaton reached by reading a word w that is k-synchronizing.

Lemma 4

Let Inline graphic = (Q, A, Q, Q, Inline graphic) be a transitive DFA with degree k ≥ 2 and let Inline graphic = (Q, A, I, T, Inline graphic) be a reduced DFA obtained from Inline graphic by choosing a subset of states I as initial states, and some proper subset T of Q as terminal states. If w is k-synchronizing and q, q′ ∈ Inline graphicw, then Inline graphic(q) \ Inline graphic(q′) ≠ ∅.

Proof

Let k ≥2 be the degree of Inline graphic, w a k-synchronizing word, and q1, q1Cw. Since w is k-synchronizing, for all words zA*, either both q1z, q1z are undefined or q1zq1z. In the rest of the proof we drop the subscript Inline graphic from Inline graphic. Suppose R(q1)R(q1). Since Inline graphic is reduced, there is a word x1 such that q1x1T and q1x1T. Set q2 = q1x1 and q2=q1x1. Then R(q2)R(q2) because otherwise, if yR(q2)\R(q2) then x1yR(q1)\R(q1) which is a contradiction with our assumption that R(q1)R(q1). We have that q2q2. Since Inline graphic is transitive, there is x2 such that q2x2=q2. Denote q2 with q3, and set q3=q2x2. Again, similarly as with q2 and q2, we have that R(q3)R(q3) which implies that both q2=q3 and q3 are in T. In fact, R(q2)R(q2)=R(q3)R(q3). We continue in this way and consider the pairs of states (q2,q2),(q3,q3),,(qi,qi), where qi-1=qi. Since Inline graphic is finite, there are i and j such that (qi,qi)=(qj,qj) for some i < j. But R(qi)R(qi)=R(qi+1)R(qi+1)R(qj)=R(qi). Because qiqi, this is a contradiction with the assumption that Inline graphic is reduced. Therefore, R(q1)\R(q1).

Example 3.2

Consider the reduced automata in Fig. 2(a). It contains two terminal components that are mutually factor-equivalent recognizing a*. Moreover, the states q1, q2 and q3 in Fig. 2(a) are follower-equivalent, and the component that contains these three states has degree 3. Every word ak for k ≥ 0 is 3-synchronizing. Consider the automaton Inline graphic that consists of {q1, q2, q3} with q1 being an initial and also the terminal state. Then Inline graphic(q1) = (a3)*, Inline graphic(q2) = aa(a3)* and Inline graphic(q3) = a(a3)*.

Fig. 2.

Fig. 2

Two automata, initial states are indicated with an arrow pointing to them and the terminal states are shaded.

Example 3.3

The transitive component Inline graphic of the automaton in Fig. 2(b) has degree 2. All words that end with symbol a are labels of paths that end in states q2 and q3, and all words that end with symbol b are labels of paths that end in states q1 and q2. The action of c on the states of Inline graphic is the identity. Hence every word that contains symbols a or b is 2-synchronizing. Note that all states are follower-equivalent but Inline graphic is reduced. Moreover, aInline graphic(q1) \ Inline graphic(q2) and aaInline graphic(q2) \ Inline graphic(q1), also aaInline graphic(q2) \ Inline graphic(q3) and aInline graphic(q3) \ Inline graphic(q2). However, cInline graphic(q1) \ Inline graphic(q3), but Inline graphic(q3) ⊂ Inline graphic(q1). This last condition does not violate Lemma 4 because there are no 2-synchronizing words that label paths ending at states q1 and q3.

3.2. Path-automata

In this section we provide structural characterizations of path-automata that do not have synchronizing words. More precisely, we show that a path-automaton having no synchronizing words has a unique maximal component, which is the terminal one, whose language contains all factors of the language accepted by the path-automaton.

Definition 3 (Path-automaton)

An automaton Inline graphic with an initial state q0 is called a path-automaton if the following is satisfied:

  1. There is at most one transition in Inline graphic which starts at the component induced by q0 and terminates in another component.

  2. There is only one terminal transitive component in Inline graphic.

  3. For every transitive component Inline graphic which does not contain q0 there is precisely one transition that starts in a state outside Inline graphic but terminates in Inline graphic, and if Inline graphic is not terminal, there is precisely one transition that starts at a state in Inline graphic but terminates in a state outside Inline graphic.

Let Inline graphic be a path automaton and Inline graphic one of its transitive components. The state of Inline graphic that is the end point of the transition starting outside Inline graphic but ending at Inline graphic is called the entrance state for Inline graphic and the state that is the start point of a transition that starts in Inline graphic but terminates outside Inline graphic is called the exit of Inline graphic. The initial component of Inline graphic has no entrance, and the terminal component has no exit.

A path π from an initial state in an automaton Inline graphic to a terminal component in Inline graphic induces a path-automaton Inline graphic which consists of all transitive components in Inline graphic induced by states visited by π.

Let Inline graphic be a terminal component of the path-automaton Inline graphic, and let q be the entrance of Inline graphic. We define the language accepted by the component Inline graphic induced by the path π, denoted by Lπ( Inline graphic), as the language accepted by the automaton Inline graphic with initial state q.

Lemma 5

Every trimmed deterministic path-automaton with two transitive components, and whose terminal component is trivial, has a synchronizing word.

Proof

Let q0 be an initial state for a path-automaton Inline graphic with two transitive components having a trivial terminal component. Let x be the label of the edge starting from the initial component (say state s) and ending at the terminal component (say state t). Let Inline graphic be the initial component for Inline graphic, and let k be the degree of Inline graphic. Let w be k-synchronizing for Inline graphic. Because Inline graphic is transitive, we can extend w such that there is a path with label w that ends at s, i.e., # Inline graphicw = k and sInline graphicw. Since k is the degree of Inline graphic, for all zA*, either wzL( Inline graphic) or Inline graphicwzInline graphic with # Inline graphicwz = k. As Inline graphic is deterministic, there is only one edge starting at s with label x and this edge leads outside Inline graphic. Therefore wxL( Inline graphic), but wxF(L( Inline graphic)). Hence wx is synchronizing that synchronizes onto t.

We first give a technical lemma that is used later.

Lemma 6

Given a deterministic path automaton Inline graphic, let Inline graphic be the terminal component of Inline graphic. If Inline graphic has no synchronizing word, then Inline graphic is a unique maximal transitive component in Inline graphic.

Proof

Assume Inline graphic is an automaton with transition function δ that has no synchronizing words. Let Inline graphic, …, Inline graphic = Inline graphic be the transitive components of Inline graphic. Let qiin and qiout be the states in the component Inline graphic (i = 2, …, k − 1) such that qiin is the entrance state of Inline graphic and qiout is the exit state of Inline graphic. For i = 1 we only have q1out and for i = k we only have qkin. We set q1in=q0 for the initial state q0, and qkout=q for a fixed terminal state q′. For i = 1, …, k − 1, let xi be the label of the transition from state qiout to state qi+1in, i.e., qioutxi=qi+1in. Consider L( Inline graphic), …, L( Inline graphic). Because these languages are all transitive, there is a maximal transitive among them. Assume L1, …, Ls are all distinct maximal transitive languages such that for each j = 1, …, s, there is a transitive component Inline graphic with Lj = L( Inline graphic). Then for each i = 1, …, s, there are words wiLi such that wiLj if ij (as Li is maximal, for each ji there is wijLi \ Lj, and due to the transitivity of Li, there are zij such that wi = wi1zi1wi2 ··· zis−1wis−1). Note that for each language Li there might be several transitive components that recognize it.

We consider words yi (i = 1, …, k) such that yi is a label of a path in Inline graphic from qiin to qiout in the following way. (i) If L( Inline graphic) = Lj is a maximal transitive language, then wj is a factor of yi, and yi is a constant for L( Inline graphic) which uniquely determines the follower-equivalence class of qiout, meaning, RCi(qiout)=RL(Ci)(yi). This is always possible by Lemma 3 and the transitivity of L( Inline graphic). (ii) If L( Inline graphic) is not a maximal transitive language, then yi is a label of the shortest path between qiin and qiout.

Consider the word y1x1 ··· yk−1xk−1yk. Let p be the smallest index of 1, …, k such that L( Inline graphic) is maximal transitive and r be the largest index such that L( Inline graphic) is maximal transitive. Then u = ypxpyp+1 ··· xr−1yr is a word that starts at a maximal transitive component, visits all maximal transitive components, and terminates at the last maximal transitive component. Since Inline graphic has no synchronizing words, there must be at least one more path in Inline graphic with label u. But, by the choice of yi and Lemma 5, every path with label ypxp must start in a transitive component recognizing L( Inline graphic) and must have a transition with label xp leading outside the component, because ypxpL( Inline graphic). Let i1, i2, …, iν be all indexes between p and r such that i1 = p, iν = r and L( Inline graphic) is a maximal transitive language. By the choice of yi’s, yi1 = yp, yi2, …, yiν = yr uniquely determine the languages of the transitive components Ci1, …, Ciν, that is yijL(Cij) but yijL(Cit) if jt. Therefore, there is a one-to-one correspondence between the order of appearance of yp, yp+1, …, yr in u and the order of the maximal transitive components. Hence, the only possibility for existence of another path with label u is if such a path also starts at Inline graphic. Although there might be many paths with label yp in Inline graphic, by Lemma 3 they all end at follower-equivalent states, and due to determinism, there is at most one of those states that is the start of a transition with label xp, and that is qpout (by Lemma 5, ypxp is synchronizing for Cp{qp+1in}). Hence, u (or uxr) is a synchronizing word unless p = r = k, i.e., xp does not exist. As we assumed that there are no synchronizing words for Inline graphic, there is at most one maximal transitive language and it must be recognized by the terminal transitive component.

The following lemma characterizes a path-automaton with no synchronizing words.

Proposition 7

Given a deterministic path-automaton Inline graphic, let Inline graphic be the terminal component of Inline graphic. Then one of the following holds:

  1. Inline graphic has a synchronizing word, or,

  2. F(L( Inline graphic)) = L( Inline graphic).

Proof

We prove the proposition by induction on the number of transitive components in the path automaton. If Inline graphic consists of a single component, then the lemma holds trivially as Inline graphic = Inline graphic. Now assume that lemma holds for all path automata with less then k transitive components, and suppose that Inline graphic has k transitive components Inline graphic, …, Inline graphic with Inline graphic being initial and Inline graphic = Inline graphic terminal. Denote with qiin the entrance of Inline graphic and qiout the exit of Inline graphic. Consider the path automaton Inline graphic with initial state q2in and transitive components Inline graphic, …, Inline graphic. As this path automaton has k − 1 components, by the inductive hypothesis, either Inline graphic has a synchronizing word, or F(L( Inline graphic)) = L( Inline graphic). Note that L( Inline graphic) ⊆ F(L( Inline graphic)) holds trivially, so we only consider the converse inequality.

Case 1

The path automaton Inline graphic has a synchronizing word. Let y be the synchronizing word for the automaton Inline graphic which consists of C1{q2in} with trivial terminal component consisting of q2in. By Lemma 5, y exists, and we can assume that y synchronizes onto q2in. Let w be synchronizing for Inline graphic, and since for every state q in Inline graphic there is a path from q2in to q, we can assume that q2inw. We observe that yw is synchronizing for Inline graphic. There is no path in Inline graphic with label y, since y is synchronizing for q2in, hence every path in Inline graphic that has a label y terminates in a state in Inline graphic. Since w is synchronizing for Inline graphic, every path in Inline graphic with label yw terminates in a single state. Thus yw is synchronizing for Inline graphic and part (a) is satisfied.

Case 2

The path automaton Inline graphic has no synchronizing word. Then by the inductive hypothesis, F(L( Inline graphic)) ⊆ L( Inline graphic). Assume that Inline graphic has no synchronizing word. We show that all words in F(L( Inline graphic)) appear as labels of paths in Inline graphic = Inline graphic. As in Case 1, consider Inline graphic which consists of C1{q2in} with trivial terminal component consisting of q2in. Let w be a label of a path in Inline graphic. If there is a path in Inline graphic with label w then, wL( Inline graphic).

Assume now that all paths with label w start in Inline graphic. If all paths with label w also end at Inline graphic then, by Lemma 5, w is a factor of a word y that synchronizes onto q2in of Inline graphic, and hence y is synchronizing for Inline graphic, and lemma holds.

Suppose there is a path in Inline graphic with label w that starts at Inline graphic and terminates in Inline graphic. We observe that in this case also, Inline graphic has a synchronizing word. Let u be the shortest word such that w = uxv where x is a symbol, q1outx=q2in,q2inv and q2inC1ux. Let cL( Inline graphic) be a constant for L( Inline graphic) that fixes the follower-equivalence class of q1out, meaning, RL(C1)(c)=RC1(q1out). Such c exists by Lemma 2 and Lemma 3. By transitivity of Inline graphic, there is a word c′ such that ccu also fixes the follower equivalence class for q1out and is a label of a path that terminates at q1out. Consider ccw = ccuxv. Then ccux is synchronizing for q2in in Inline graphic, by Lemma 5. But ccw is not synchronizing for Inline graphic, hence there must be another path in Inline graphic with label ccw, and by our assumption, it starts in Inline graphic and must terminate in Inline graphic. Such a path must use the transition q1outx=q2in, either with a portion of the path labeled cc′ or with a portion labeled w. In the first case w is a label of a path in Inline graphic, hence wL( Inline graphic). In the second case, there must be u′ and v′ such that ccw = ccuxv′ = ccuxv. Since u was the shortest word such that q2inC1ux, it must be that u = u′, in which case ccux is synchronizing for Inline graphic. It is impossible that u is a proper prefix of u′ because this would imply Inline graphicccuxInline graphic which would contradict the fact that ccux synchronizes onto q2in in Inline graphic.

Example 3.4

The automaton in Fig. 3 is a path-automaton with no synchronizing words. It has only one terminal component which is maximal and the factors of all words in the language are labels of paths in the terminal component. This illustrates the situation (b) in Proposition 7.

Fig. 3.

Fig. 3

A path-automaton with no synchronizing words.

The following result is used to prove the main result (Theorem 15, Section 5.3) of the paper.

Proposition 8

Let L be a regular language, xF(L) and trim Inline graphic be the trimmed mDFA for L. At least one of the two cases holds:

  1. x is a factor of a constant for L,

  2. there is a path-automaton induced by a path of trim Inline graphic containing a path labeled x and having a non-trivial terminal transitive component with at least two states.

Proof

Let trim Inline graphic = (Q, A, {q0}, T, Inline graphic) be the trimmed mDFA for the language L. Suppose xF(L) is not a factor of a constant, i.e., for every v, v′ ∈ A*, vxv′ is not a constant for L, and therefore not synchronizing for trim Inline graphic. Consider a word w such that #Q xw = min#{Q xu|uA*} and let Pw = Q xw. Since xw is not synchronizing, by Proposition 1, #Pw > 1. Then for every word uA* we have that either Q xwu = ∅ or #Q xwu = #Pw. Therefore, we can assume that all states in Pw are in terminal components of trim Inline graphic, (if not, we can concatenate w with words that are labels of paths that lead to terminal components). If all terminal components in trim Inline graphic are trivial, then because trim Inline graphic is reduced, there is only one trivial terminal transitive component implying #Pw = 1, which is a contradiction with our assumption that x does not extend to a constant. Thus there must be at least one terminal transitive component which is not trivial. If there is a state in Pw that belongs to a component that is not a single state component then (ii) holds. Assume to the contrary that each state in Pw is in a distinct transitive component consisting of only one state having loops at itself. Let y be a label of one of these loops. Since Pw y ≠ ∅ implies Pw y = Pw, i.e., for every qPw we have qy = q. This means that all states in Pw are terminal, their loops must have the same labels, and therefore their right contexts are equal. Hence the states in Pw cannot be distinct in a reduced automaton. Thus again implies that Pw has cardinality 1, a contradiction. Hence, there must be at least one state in Pw that belongs to a terminal transitive component with at least two states.

4. Splicing languages and properties of splicing rules

As mentioned, in this paper we consider the general notion of the splicing operation and the splicing system given by Paun [17], as defined below.

Definition 4

A finite splicing system is a triple S = (A, I, R) where, IA* is a finite set of strings, called an initial language, R is a finite set of splicing rules of the form r = (u1, u2)(u3, u4), with uiA* for i = 1, 2, 3, 4.

Given two words x = x1u1u2x2, y = y1u3u4y2, with x1, x2, y1, y2A* and a rule r = (u1, u2)(u3, u4), the splicing rule produces w = x1u1u4y2 denoted (x, y) ⊢r w. We also say that u1u2, u3u4 are splice sites of r and u1u4 is the paste site of r.

To simplify the notation, in the following, by a splicing system we mean a finite splicing system.

Let LA*. We denote σ(L) = {wA*|(x, y) ⊢r w, x, yL, rR}. The (iterated) splicing operation is defined as follows: σ0(L) = L, σi+1(L) = σi(L) ∪ σ(σi(L)), i ≥ 0. Finally, σ*(L) = ⋃i≥0 σi(L).

Definition 5 (Splicing language)

Given a finite splicing system S = (A, I, R), the language L(S) = σ*(I) is the language generated by S. A language L is a splicing language if there is a splicing system S such that L = L(S).

For a word w and a set of states Q, we use notation Inline graphic(Q w) for ⋃qQ Inline graphic(qw).

Definition 6 (Paste site at p)

Let Inline graphic be the a DFA for a regular splicing language L. The word u1u4 is said to be a paste site at a state pQ for a splicing rule r = (u1, u2)(u3, u4) if Inline graphic(Q u3u4) ⊆ Inline graphic(pu1u4) and pu1u2 ≠ ∅.

More precisely, the notion of a paste site at a state q is used to identify states of the automaton where a rule can be applied. Fig. 4 depicts the situation for a paste site at state p. The doted path with label u3 may not exist in the automaton, but the right context of qu3u4 (wherever a path with such a label exists) must be included in the right context of pu1u4.

Fig. 4.

Fig. 4

Paste site at state p, the dotted path with label u3 may or may not exist. But the right context of qu3u4 is included in the right context of pu1u4 for every q.

In what follows we assume that every splicing system is such that all rules are applied at least once during the generation of the splicing language. The following lemma shows an equivalence between splicing systems with respect to the extension of sites and paste sites of rules.

Lemma 9

Let S = (A, I, R) be a finite splicing system and r = (u1, u2)(u3, u4) be a splicing rule in R. Let cA*. Then L(S) is the language generated with the splicing system S′ = (A, I, R′) where R′ = R ∪ {r′} for r′ = (u1, u2)(u3, u4c).

Proof

It is clear that L(S) ⊆ L(S′) since R′ contains R. The converse also holds since whenever we have (x, y) ⊢r w we also have (x, y) ⊢r w.

Lemma 10

Let S = (A, I, R) be a finite splicing system and Inline graphic a DFA for L = L(S). If u1u4 is a paste site at state p for a rule r = (u1, u2)(u3, u4) ∈ R then for every cA* with pu1u4c ≠ ∅, u1u4c is a paste site at p for a rule r′ = (u1, u2)(u3, u4c).

Proof

Suppose that u1u4 is a paste site at state p for rule r = (u1, u2)(u3, u4), and let pu1u4c ≠ ∅. Then by Lemma 9, L is also generated by the splicing system S = (A, I, R′) for the set of rules R=R{r=(u1,u2)(u3,u4)} where u4=u4c. The first splice site of r equals r′ thus pu1u2 ≠ ∅. It only remains to show that RA(Qu3u4)RA(pu1u4). But if yRA(Qu3u4) then cyInline graphic(Q u3u4) ⊆ Inline graphic(pu1u4) and so yInline graphic(u1u4c). It follows that u1u4 is a paste site at state p for rule r′.

5. Splicing languages must have a constant

5.1. Reflexive and non-reflexive splicing languages

It is known that every splicing language generated by a finite splicing system is always regular [8,19]. More precisely, regular splicing languages form a proper subclass of the class of regular languages.

Recall that a splicing system S is said to be reflexive if for every rule r = (u1, u2)(u3, u4) in R, both (u1, u2)(u1, u2) and (u3, u4)(u3, u4) are rules in R. A language L is said to be a reflexive splicing language if there is a reflexive splicing system S such that L = L(S). It is said that S is symmetric if (u1, u2), (u3, u4) being in R implies that (u3, u4), (u1, u2) is in R. The notion of a constant of a language turned out to be essential in providing a characterization of the class of reflexive regular splicing languages [11,3]. Indeed, a fundamental property of a reflexive regular splicing language L is that there exists a splicing system generating L that has rules whose splicing sites consist of constants for the language L. A more precise characterization shows that the class of reflexive and symmetric splicing languages is equivalent to a class of regular languages, the so-called PA-con-split languages [3]. This result has been extended to the non-symmetric case [4]. In [4], itis shown that each language L in this class is constructed from a finite set of constants for L, as L is expressed as a union consisting of a finite set X, and a finite union of constant and split languages (see end of Section 2). The characterization is given with the following proposition.

Proposition 11. (See [4].)

A regular language L is a reflexive splicing language if and only if there is a finite set XA*, a finite set of constants K1 of L and a finite set K2 of pairs of constants of L such that

L=X(wK1ConstL(w))((w1,w2)K2SplityL(w1,w2))

The characterization of reflexive languages in Proposition 11 helps to describe factor-closed transitive regular languages as reflexive splicing languages.

Proposition 12

If L is a factor-closed transitive regular language then L is a reflexive splicing language.

Proof

Since L is factor-closed, by Lemma 2, consider its minimal deterministic transitive automaton Inline graphic. By Lemma 3, all states in Inline graphic are synchronizing. Every word wL is a label of a path in Inline graphic that passes through some state q, hence w = ww″ where w′ is in the left context of a constant that labels a path starting at q and w″ is in the right context of a constant that synchronizes onto q. Because all states in Inline graphic are initial and terminal, we have that CL(w)=LL(w) and CLr(w)=RL(w) for every constant w of L. Let M be a set that consists of constants by choosing one constant mq for each state q in Inline graphic that is a label of a path starting and ending at the state q. Then L = ⋃mqM Split(mq, mq) where Split(mq,mq)=CL(mq)CLr(mq)=LL(mq)RL(mq) and splitting is performed by taking the empty prefix and the empty suffix of mq. The conclusion follows directly from the characterization in Proposition 11.

An example of non-reflexive regular splicing language is given in [11]; this is the language L = a+b+a+b+a+a+b+a+.

Example 5.1

The path-automaton Inline graphic of Fig. 5 generalizes the example of regular non-reflexive language given in [11]. More precisely, it is possible to show, similarly as in [11], that any splicing system for the language Lk = ⋃1≤ik(a+b+)ia+ with k ≥ 3 must have a rule whose both splice sites are not constants for the language. A splicing system S for Lk can be defined with an initial language Ik = ∪1≤ik{a(ba)i, a2(ba)i, (ab)ia2} and rules:

r1,k=(1,(ab)k)(1,a(ab)k),r2,k=((ba)ka,1)((ba)k,1)r3,i=(a,(ba)i)((ab)k-i,ab),foreach1ik,andr4,i=(b,(ab)i)((ba)k-i,ba),foreach1i<k.
Fig. 5.

Fig. 5

A path automaton that recognizes a non reflexive splicing language.

The proof that such splicing system generates the language Lk is along the same lines of the ones given in [11]. Observe that both splice sites of rules r3,i = (a, (ba)i)((ab)ki, ab), for i > 1, are not constants for the language Lk. More precisely, rules r1,k and r2,k are used to increase the initial and final number of a’s in language a+(ab)ka+, respectively. Rules r3,i are used to increase the number of a’s in the (ki)th appearance of a’s in (a+b+)ka+, for ik. Similarly, rules r4,i are used to increase the number of b’s in language Lk. The rules r3,i are also used to obtain (a+b+)ja+, for j < k.

The following lemma shows another example of non-reflexive splicing language whose trimmed mDFA is not a pathautomaton (Fig. 6).

Fig. 6.

Fig. 6

The figure reports the automaton of Fig. 2(a) detailing the minimal terminal component which consists of states q2, q4 and q3.

Lemma 13

The regular language L = b(a3)* + cba* + da(a3)* is a non-reflexive splicing language.

Proof

First we note that LA*, for A = {a, b, c, d} is splicing. A splicing system S = (A, I, R) for language L consists of rules R = {r1 = (cba, 1)(cb, a), r2 = (daa3, 1)(da, 1), r3 = (b, a3)(da, 1)}, while the initial language I consists of language I = {ba3, b, cba, cb, daa3, da}. By induction on the number k of iteration steps of splicing rules, we first show that L(S) ⊆ L. If k = 0, since IL(S), the inclusion holds. Assume that wL(S) is generated with k > 0 iterations by applying a rule r to a pair of words w1, w2L(S). By induction w1, w2L are obtained with k − 1 iterations. Checking splice sites in w1 and w2 for all of the rules, it is immediate to see that wL. In order to show that LL(S), we observe that language L1 = da(a3)* is generated by rule r2 applied to words in the same language daa3. Similarly, we see that language L2 = cba* is generated by rule r1 starting from words from the same language. Language L3 = b(a3)* is generated by rule r3 applied to words of language da(a3)* and of language b(a3)*. By induction on i ≥ 0, indeed we can observe that b(a3)iL(S), i ≥ 0. If i = 0 or i = 1, being b, ba3I, the result is immediate. Otherwise, given words b(a3)i−1L(S), for i > 1 and word da(a3)iL(S), by rule r3 is immediate to generate word b(a3)iL(S).

Finally, notice that language L is not reflexive, that is, it cannot be generated by a splicing system by reflexive splicing rules. Suppose L is reflexive splicing language generated by a reflexive system S. We obtain a contradiction by considering generation of words in language b(a3)*. Since in language L the only words that start with a b are those in language b(a3)* and there must be splicing rules in S to generate words of the form b(a3)k for arbitrarily large k, there must be a rule r with splice site u1u2 that is a factor of b(a3)* But, because S is reflexive, S must also contain a rule (u1, u2)(u1, u2). Then this rule can be applied to x =b(a3)k and y = cb(a3)ka, for some large k > 0, to generate a word w = b(a3)kaL. Therefore the language L cannot be generated by reflexive rules.

5.2. Canonical and special words

The proof of the main result (Theorem 15) is based on special words in a regular splicing language L that must be generated by a splicing rule whose splice site u3u4 is a constant of L. For a lack of a better name, we call these words q-canonical and k-special words.

Informally, the q-canonical word of a component Inline graphic is a word c such that qc = q and every such path with label c crosses all states in the component, and moreover, the word c is able to identify the language L( Inline graphic) of the component Inline graphic.

Definition 7 (q-Canonical)

Let Inline graphic be an automaton and let Inline graphic be a component of Inline graphic. Let q be a state of Inline graphic. Then a word cA+ such that cL( Inline graphic) and qc = q is called q-canonical for Inline graphic with respect to Inline graphic if whenever cL( Inline graphic), for another component Inline graphic of Inline graphic, implies that L( Inline graphic) ⊆ L( Inline graphic).

In the following we show the existence of a q-canonical word for every state in every transitive component in Inline graphic. We give a constructive proof based on the notion of a k-special word for L( Inline graphic) as defined below.

Definition 8 (k-Special)

Let Inline graphic be an automaton. A word c in L( Inline graphic) is k-special for the language L if every word of F(L) of length ≤k is a factor of c.

Example 5.2

Consider the automaton of Fig. 6. Then the word a3 is q2-canonical for the terminal component Inline graphic consisting of states q2, q3, q4. Given the language L = L( Inline graphic), then the word a3 is k-special for the language L for k ≤ 3.

Lemma 14

Given a non-trivial transitive component Inline graphic in a DFA Inline graphic, let k = (#Q)2. Then for every state q in Inline graphic there is a q-canonical of Inline graphic that is a k-special constant of L( Inline graphic).

Proof

Let {x1,, xn} = L( Inline graphic) ∩ Ak. Being Inline graphic a transitive component, there are y1,, yn−1 such that x1 y1x2 ···yn−1xnL( Inline graphic). Set c = x1 y1x2 ···yn−1xn. Due to transitivity, for every qInline graphic there are yq and yq such that wq=yqcyq is a label of a path that starts and ends at q. By Remark 5, yq, yq can be chosen so that wq is a constant. We show that wq is q-canonical. Assume that wqL( Inline graphic) for some transitive component Inline graphic. Take the shortest word zL( Inline graphic) \ L( Inline graphic). Since L( Inline graphic) \ L( Inline graphic) = L( Inline graphic) ∩ (L( Inline graphic))c, it can be recognized by an automaton with at most #Q ( Inline graphic) · #Q ( Inline graphic) ≤ k states [13], the shortest word in this language has length at most k. Thus |z| ≤ k and therefore z must be a factor of c, i.e., z must be in L( Inline graphic), contradicting the existence of z.

5.3. Proof of the main result

Considering the importance of constants in characterization of sub-classes of regular splicing languages, it has been conjectured that every splicing language must have a constant [10,11]. Our main result proves this conjecture to be true.

Theorem 15 (Main result)

If L is a regular splicing language, then L has a constant.

Example 5.3

The path-automaton Inline graphic of Fig. 3 has no synchronizing word (see Example 3.4) and thus the language L( Inline graphic) = a*c(c*ac*a)* has no constant. By Proposition 15, L( Inline graphic) is not a regular splicing language.

Example 5.4

The transitive regular language L recognized by the automaton in Fig. 1 has no constants. By Theorem 15, the language L is not a splicing language.

Example 5.5

The regular language L = b(a3)* + cba* + da(a3)* is another example of non-reflexive splicing language, as proved in Lemma 6. Fig. 2(a) shows the trimmed mDFA graph for language L. Observe that not every path-automaton induced by a path in the mDFA from the initial state q0 to a terminal component has necessarily a constant of L. Indeed, the path-automaton in Fig. 2(a) recognizing language b(a3)*L does not have any constant of the language L because every word in b(a3)* is also a substring of a word in cba* and therefore is not a synchronizing word for the automaton of L.

Given a splicing regular language L, the proof of Theorem 15 shows existence of a splicing rule r = (u1, u2)(u3, u4) such that the word u1u4 ends in a non-trivial terminal component of the trimmed mDFA trim Inline graphic. More precisely, u1u4 ends in a state which we show to be synchronizing for the automaton trim Inline graphic.

Let L be a regular splicing language and let trim Inline graphic = (Q, A, {q0}, T, Inline graphic) be the trimmed mDFA for language L. We introduce some basic notations that are used in the proof. We are interested in states of the automaton trim Inline graphic that are found as follows.

Consider a non-trivial terminal component Inline graphic that is minimal among the non-trivial terminal components in the automaton trim Inline graphic. If a non-trivial component does not exist, then by Proposition 15, trim Inline graphic must have a constant and Theorem 15 holds. Let qInline graphic be a minimal-follower state with respect to Inline graphic and recall that with μq( Inline graphic) we denote the set of states in Inline graphic that are follower-equivalent to q. Let C = { Inline graphic = Inline graphic, Inline graphic, ···, Inline graphic} be the set of all terminal components of the automaton that are factor-equivalent to Inline graphic. Consider the set F={q1,,qn}=i=1kμq(Ci) (note that by Lemma 3, for each i = 1,, k, the collection of follower sets in Inline graphic coincides with the collection of follower sets in Inline graphic).

Then, a candidate state of trim Inline graphic is a state F with Inline graphic for some component Inline graphicC such that Inline graphic() is minimal in the following sense: for all qF, whenever Inline graphic(q) ⊆ Inline graphic(), it holds that Inline graphic(q)= Inline graphic(), i.e., being trim Inline graphic reduced it holds that q = .

The main idea of the proof is to show that either the automaton has no non-trivial components, and in this case a constant exists (see Proposition 8), otherwise there exists a candidate state that is synchronizing for the automaton trim Inline graphic.

Example 5.6

Consider the automaton in Fig. 6. Observe that the minimal terminal component Inline graphic induced by state q2 has language L( Inline graphic) = a*, with L( Inline graphic) = {a3}*, and is factor-equivalent to the component induced by the state q1. Then the set F, corresponding to the candidate component Inline graphic, is the set of states F = {q1, q2, q3, q4} because all these states belong to only one follower-equivalence, thereby the minimal follower-equivalence class. Then the candidate states are q2, q3, q4.

We will use the following lemma.

Lemma 16

Let Inline graphic be a candidate state and let 1FInline graphic. Then 1 is also a candidate state.

Proof

Let Inline graphic be the minimal deterministic transitive automaton for Inline graphicC from Lemma 3. Suppose Inline graphic is a candidate state and let 1FInline graphic. Let further Inline graphic be the follower-equivalent state to . Then by Remark 5 there is a constant c of L( Inline graphic) such that q̂c = and 1c = .

Let q′F. First, suppose that q′Inline graphic for some Inline graphicInline graphic. Consider q1C such that qc=q1. Because c is a constant, by Remark 4, q1 is follower-equivalent to , so we have q1F. Because is a candidate state, there are two possibilities (a) right contexts of and q1 are incomparable, i.e., RtrimA^(q¯)\RtrimA^(q1) and RtrimA^(q1)\RtrimA^(q¯), or (b) RtrimA^(q¯)RtrimA^(q1) (equality cannot hold because trim Inline graphic is reduced). In both cases here must be a zRtrimA^(q1)\RtrimA^(q¯). Then qcz is a terminal state while 1cz is not. Therefore, in both cases Inline graphic(q′) ⊈ Inline graphic(1).

Also, if q′ ∈ Inline graphicF then by Lemma 4, Inline graphic (q′) \ Inline graphic(1) ≠ ∅.

Therefore 1 is a candidate state.

Before we present details of the proof of Theorem 15 we outline the steps involved in the proof by illustrating the situation in Example 5.6 shown in Fig. 6.

  1. In trim Inline graphic we identify a candidate state within a non-trivial component as outlined above. (For Example 5.6 we choose state q2.)

  2. We consider a -canonical word c and observe that there must be a rule (u1, u2)(u3, u4) with a paste site u1u4 at a state p that lies on a path labeled wcsx, for some s, where q0 w = and q̄x is terminal (see Fig. 7). (For Example 5.6, p = q0 and wcsx = b(a3)s with w = b and c = a3, the rule in question is r3 = (b, a3)(da, 1).)

  3. We observe that there is a state qQ u3u4 such that, for arbitrarily large i, ci is a factor of the right context of q. (For Example 5.6, such states are only q2, q3, q4, because q1Q daa3x for any x.) We choose a sufficiently large i such that for some z, all states in Q u3u4zci belong to non-trivial components and we set u3u4=u3u4zci. We observe that all states in Qu3u4 end in non-trivial components that are factor-equivalent to the non-trivial component Inline graphic. By Lemma 10 and obtain that u1u4 is a paste site for p, given rule r=(u1,u2),(u3,u4). (For Example 5.6, we can choose z = 1 and have a new rule r3 = (b, a3)(da, a3), and Q daa3 = q2.)

  4. We show that for every qQu3u4, it must be q=pu1u4, therefore u3u4 is synchronizing.

Fig. 7.

Fig. 7

A possible paste site at state p.

We now present the proof of the main result.

Proof of Theorem 15

Let L be a regular splicing language, and let trim Inline graphic = (Q, A, {q0}, T, Inline graphic) be its trimmed mDFA. By Proposition 8, if the automaton trim Inline graphic has only a trivial terminal component (note that since trim Inline graphic is reduced, there could be only one such component), it must have a constant and thus the theorem holds. Therefore we consider the case that trim Inline graphic has at least one non-trivial terminal component.

  1. Consider a non-trivial terminal component Inline graphic that is minimal among the non-trivial terminal components in the automaton trim Inline graphic and let qInline graphic be minimal-follower state in Inline graphic. Let C = { Inline graphic = Inline graphic Inline graphic, ···, Inline graphic} be the set of all terminal components of the automaton that are factor-equivalent to Inline graphic and set F={q1,,qn}=i=1kμq(Ci). We choose a candidate state F in a component Inline graphicC.

  2. Let wA* be the shortest word such that q0w = . Consider a word c which is a constant of L( Inline graphic) and is -canonical for Inline graphic. Such a word exists by Lemma 14. Then wc*xL for some xA*. Since there is a finite number of rules in the splicing system, there are an infinite number of indexes s such that wcsx are obtained by using the same splicing rule r = (u1, u2)(u3, u4) where u1u4 is a subword of wcsx for every such s. More precisely, there must exist an infinite number of pairs of words v = vu1u2 v″L and wu3u4 w″ ∈ L such that vu1u4 w″ ∈ wc*x. Thus vu1 is a prefix of wcix for some i ≥ 0. Let p be such that pu1u2 ≠ ∅ where p = q0 v′. Moreover, if y″ ∈ Inline graphic(Qu3u4), since there is y′ such that y = yu3u4 y″ ∈ L, by splicing words v = vu1u2 v″ and y = yu3u4 y″ with rule r, we obtain vu1u4 y″ ∈ L and thus y″ ∈ Inline graphic(pu1u4). Therefore, Inline graphic (Qu3u4) ⊆ Inline graphic (pu1u4).

    We obtain that u1u4 is a paste site at state p for rule r = (u1u2)(u3u4). Refer to Fig. 7.

  3. In the following we show that there are states in Q u3u4 such that ci is a factor of a word in their right context for arbitrarily large i’s.

    Let p′ = q0 vu1u4 where v′ is a prefix of wc*x. Being trim Inline graphic deterministic, and since Inline graphic is terminal, by the choice of p′ it must be that p′ is either inside the component Inline graphic or otherwise lies along a path with label w from state q0 to state . In the latter case when p′ is not a state inside Inline graphic, vu1u4 is a prefix of w. In this case cix must be a suffix of w″ in the splicing of vu1u2 v″ ∈ L and wu3u4 w″ ∈ L that produces vu1u4 w″ = wci x. Hence, for arbitrarily large i’s, it must be that ci is a factor of a word in the right context of a state qQ u3u4. Since there are infinite number of i’s with this property, there is a state qQ u3u4 such that ciF ( Inline graphic(qu3u4)) for arbitrarily large i.

    Now suppose p′ is a state in Inline graphic (see Fig. 7). By Proposition 8, u3u4 is either a factor of a constant, which proves the statement of the theorem, or there is a path-automaton Inline graphic (a sub-automaton of trim Inline graphic) with a non-trivial terminal component Inline graphic such that u3u4 is a label of path π in Inline graphic. Then, by Lemma 14, there is a q-canonical word c′ for some qInline graphic such that u3u4zc′ is a label of a path in Inline graphic for some z, that is zc′ ∈ Inline graphic (qu3u4). Since u1u4 is a paste site for the rule r at state p, we have that zc′ ∈ Inline graphic (qu3u4) ⊆ Inline graphic (pu1u4) = Inline graphic (p′) ⊆ L( Inline graphic). But because c′ is q-canonical for Inline graphic it follows that L( Inline graphic) ⊆ L( Inline graphic) and by the minimality of component Inline graphic and since Inline graphic is factor equivalent to Inline graphic we have that L( Inline graphic) = L( Inline graphic), i.e., c*L( Inline graphic). Therefore ci is a factor of the right context of a state qQ u3u4 for arbitrarily large i.

    We now consider states in Q u3u4 whose right context has words with factors ci for arbitrarily large i’s. We fix i sufficiently large, such that for some z, every state in Q u3u4zci belongs to a non-trivial component, and for every state Q u3u4zci, the language of the component Inline graphic containing contains the word c. Given, u4=u4zci, by Lemma 10, u1u4 is a paste site at the same state p for the rule r=(u1u2)(u3u4). Observe that z can be chosen such that pu1u4=q¯0C. If p′ = pu1u4 is not in Inline graphic, by the argument above, cix is a suffix of w″, hence we can chose z such that w″ = zcix, i.e., pz = .

    Because c is a constant for L( Inline graphic) such that q̄c = , by Lemma 3 and Remark 4, every state qInline graphicc is follower-equivalent to . Therefore, the state q¯0=pu1u4=pu4zciC¯ is follower-equivalent to , and hence in F. Having 0FInline graphic, by Lemma 16, 0 is also a candidate state.

  4. Let q be a state in Qu3u4. We conclude with the observation that q = 0 and therefore u3u4 is a synchronizing word for 0, proving the theorem. The proof of this last step consists first in showing that L( Inline graphic) = L( Inline graphic) where Inline graphic is the component in trim Inline graphic containing q. Then we show that Inline graphic is terminal and thus Inline graphicC. By the fact that 0 is a candidate state we are able to show that 0 = q.

We first observe that L( Inline graphic) = L( Inline graphic). As qQu3u4, by Definition 6 of paste site, it holds that

RtrimA^(q)RtrimA^(Qu3u4)RtrimA^(pu1u4=q¯0). (*)

Since c is -canonical, by Definition 7 we have that L( Inline graphic) ⊆ L( Inline graphic). If L( Inline graphic) \ L( Inline graphic) ≠ ∅ then Inline graphic (q) ⊈ Inline graphic (0) which contradicts (*). Therefore it must be that L( Inline graphic) = L( Inline graphic), that is Inline graphic and Inline graphic are factor-equivalent.

Next we see that Inline graphic is terminal. Assume to the contrary that Inline graphic is not terminal and thus there is an edge labeled a that starts in Inline graphic and terminates in a state q′ outside Inline graphic. By Lemma 5 the automaton that consists of Inline graphic together with the edge labeled a ending at q′ has a synchronizing word that ends at q′. Let ua be that word. Then ua/ L( Inline graphic) = L( Inline graphic), because otherwise ua would not be synchronizing. By the transitivity of Inline graphic we can assume that ua is a label of a path that starts at q. This implies that ua must be a prefix of a word in Inline graphic (q)\ Inline graphic(0), again contradicting (*). Consequently Inline graphic is terminal and in C. Moreover qF, since by the choice of the constant c, by Remark 4, q is factor-equivalent to (hence, to 0), and F consists of all states that are factor-equivalent to and belong to components in C. Thus by (*) (i.e., Inline graphic (q) ⊆ Inline graphic (0)) and the fact that 0 as a candidate state Inline graphic (0)= Inline graphic (q). Because trim Inline graphic is reduced, 0 = q, which concludes the proof.

The proof of Proposition 15 is based on the effective computation of a synchronizing state in the automaton for a regular splicing language in the case of an automaton having non-trivial terminal components. As a main corollary of the above Proposition we can state the following fact.

Corollary 17

Let trim Inline graphic be the trimmed minimal deterministic automaton recognizing a splicing regular language. Then every state in a terminal component that contains a candidate state for trim Inline graphic is synchronizing.

6. Concluding remarks

In this paper we solve a conjecture posed by T. Head in his seminal works on regular splicing languages about the existence of a constant as a necessary condition for a regular language to be splicing. We solve this open problem in an affirmative way, by providing a constructive proof that leads to a procedure for finding a synchronizing state in a mDFA for a regular splicing language.

The use of constants allows to determine a necessary and sufficient condition for a regular language to be reflexive splicing [3,4]; identifying such a condition for non reflexive splicing languages is still an open problem.

Recently, decidability of regular splicing languages has been proved in [15] by providing an upper bound on the lengths of the words included in the splicing rules. This bound is quadratic with respect to the size of the syntactic monoid of the language. The decidability follows from the fact that the bound allows brute-force search and comparison of the given language with splicing languages obtained through all possible finite sets of rules of certain size. Although the existence of the algorithm was long waited, the procedure it provides is useless for all practical purposes. Having a practical procedure to decide whether a regular language is splicing remains a challenging open problem. We believe that finding a characterization of minimal splicing systems recognizing splicing languages, where minimality of the system is given in terms of both the number of splice sites of rules and the length of the splicing sites, would be a promising direction for obtaining a practical decision procedure. Moreover, since splicing rules are built from constants in reflexive languages, the notions of constants and synchronizing words again seem to be vital for answering most of the above questions.

Acknowledgments

We thank the reviewers for numerous valuable comments. P. Bonizzoni is partially supported by MIUR PRIN 2010–2011 grant “Automi e Linguaggi Formali: Aspetti Matematici e Applicativi”, code H41J12000190001, N. Jonoska is supported in part by the NSF grant CCF-1117254 and the NIH grant R01GM109459-01.

Contributor Information

Paola Bonizzoni, Email: bonizzoni@disco.unimib.it.

Nataša Jonoska, Email: jonoska@mail.usf.edu.

References

  • 1.Berstel J, Perrin D. Theory of Codes. Academic Press, Inc; Orlando, Florida: 1985. [Google Scholar]
  • 2.Bonizzoni P, De Felice C, Mauri G, Zizza R. Regular languages generated by reflexive finite linear splicing systems. Lect Notes Comput Sci; Proc. Development in Language Theory; Berlin: Springer; 2003. pp. 134–145. [Google Scholar]
  • 3.Bonizzoni P, De Felice C, Zizza R. The structure of reflexive regular splicing languages via Schützenberger constants. Theor Comput Sci. 2005;334(1–3):71–98. [Google Scholar]
  • 4.Bonizzoni P, Mauri G. Regular splicing languages and subclasses. Theor Comput Sci. 2005;340:349–363. [Google Scholar]
  • 5.Bonizzoni P. Constants and label-equivalence: a decision procedure for reflexive regular splicing languages. Theor Comput Sci. 2010;411(6):865–877. [Google Scholar]
  • 6.Bonizzoni P, Jonoska N. Regular splicing languages must have a constant. Lect Notes Comput Sci; Proc. Developments in Language Theory; Berlin: Springer; 2011. pp. 82–92. [Google Scholar]
  • 7.Černý J. Poznámka k homogénnym eksperimentom s konecnými automatami. Mat-Fyz čas Slov Akad Vied. 1964;14:208–216. [Google Scholar]
  • 8.Culik K, Harju T. Splicing semigroups of dominoes and DNA. Discrete Appl Math. 1991;31:261–277. [Google Scholar]
  • 9.De Luca A, Restivo A. A characterization of strictly locally testable languages and its application to semigroups of free semigroup. Inf Control. 1980;44:300–319. [Google Scholar]
  • 10.Goode E. PhD Thesis. Binghamton University; 1999. Constants and splicing systems. [Google Scholar]
  • 11.Goode E, Pixton D. Recognizing splicing languages: syntactic monoids and simultaneous pumping. Discrete Appl Math. 2007;155:989–1006. [Google Scholar]
  • 12.Head T. Formal language theory and DNA: an analysis of the generative capacity of specific recombinant behaviours. Bull Math Biol. 1987;49:737–759. doi: 10.1007/BF02481771. [DOI] [PubMed] [Google Scholar]
  • 13.Hopcroft JE, Motwani R, Ullman JD. Introduction to Automata Theory, Languages, and Computation. Addison–Wesley; Reading, Mass: 2001. [Google Scholar]
  • 14.Jonoska N. Sofic systems with synchronizing representations. Theor Comput Sci. 1996;158(1–2):81–115. [Google Scholar]
  • 15.Kari L, Kopecki S. Deciding if a regular language is generated by a splicing system. Lect Notes Comput Sci; Proc. DNA Computing and Molecular Programming – 18th International Conference; Berlin: Springer; 2012. pp. 98–109. [Google Scholar]
  • 16.Lind D, Marcus B. An Introduction to Symbolic Dynamics. Cambridge University Press; New York: 1995. [Google Scholar]
  • 17.Paun G. On the splicing operation. Discrete Appl Math. 1996;70:57–79. [Google Scholar]
  • 18.Paun G, Rozenberg G, Salomaa A. New Computing Paradigms. Springer-Verlag; Berlin: 1998. DNA Computing. [Google Scholar]
  • 19.Pixton D. Regularity of splicing languages. Discrete Appl Math. 1996;69:101–124. [Google Scholar]
  • 20.Schützenberger MP. Sur certaines opérations de fermeture dans le langages rationnels. Symp Math. 1975;15:245–253. [Google Scholar]
  • 21.Verlan S. PhD Thesis. University of Metz; 2004. Head systems and applications to bio-informatics. [Google Scholar]

RESOURCES