Existence of constants in regular splicing languages

Paola Bonizzoni; Nataša Jonoska

doi:10.1016/j.ic.2015.04.001

. Author manuscript; available in PMC: 2016 Jun 1.

Published in final edited form as: Inf Comput. 2015 Jun;242:340–353. doi: 10.1016/j.ic.2015.04.001

Existence of constants in regular splicing languages

Paola Bonizzoni ^a,^✉, Nataša Jonoska ^b

PMCID: PMC4866503 NIHMSID: NIHMS691394 PMID: 27185985

Abstract

In spite of wide investigations of finite splicing systems in formal language theory, basic questions, such as their characterization, remain unsolved. It has been conjectured that a necessary condition for a regular language L to be a splicing language is that L must have a constant in the Schutzenberger sense. We prove this longstanding conjecture to be true. The result is based on properties of strongly connected components of the minimal deterministic finite state automaton for a regular splicing language. Using constants of the corresponding languages, we also provide properties of transitive automata and pathautomata.

1. Introduction

A splicing system, originally introduced in [12], is a formal model that uses contextual cross-over operation over words to generate languages called splicing languages. This cross-over splicing formalizes the behavior of basic biomolecular processes involving cut and paste of DNA performed by restriction enzymes and a ligase. Restriction enzymes act on double stranded DNA molecules by cleaving certain recognized segments leaving short single stranded overhangs. Molecules with same overhangs can join (in a cross-over fashion) in presence of a ligase enzyme. In the introductory paper, T. Head proved that if the splicing is performed by a finite set of certain simple rules, then splicing of finite set of words can generate the class of strictly locally testable languages [9]. The splicing notion was reformulated by G. Paun at a less restrictive level of generality, giving rise to the splicing operation that is commonly adopted and appears nowadays as a standard [17].

Theoretical results in splicing systems have contributed to new research in formal language theory focused on modeling of biochemical processes [18]. On the other side, the field suggested new ideas in the framework of biomolecular science, for example, the design of automated enzymatic processes.

In this paper, we focus on finite splicing systems, called here simply as splicing systems. A splicing system is meant to have a finite set of rules (modeling enzymes) applied on a finite set of initial strings (modeling DNA sequences). A splicing system (or H-system) is a triple H = (A, I, R), where A is a finite alphabet, I ⊆ A^* is the initial language and R is the set of rules, (see Section 4 for the definitions). The formal language generated by the splicing system is the smallest language containing I and closed under the splicing operation.

There have been successes in characterizing certain subclasses of splicing languages, for example those generated by reflexive rules and those generated by symmetric rules [2]. Reflexivity and symmetry are natural properties for splicing systems because they assure splicing of molecules cut with the same enzyme, as well as recombining molecules resulting of the same type of cut [12]. The formal language of a general splicing system may have a set of rules R that is not necessarily symmetric, nor reflexive. Under the formal model, a splicing system is a generative mechanism for a language which belongs to a class that is a proper subclass of the regular languages. This basic result has been firstly proved in [8], and later proved in several other papers by using different approaches (see for example [19,21]).

In spite of the vast literature on the topic, a structural characterization of the finite splicing systems is still an open problem, although decidability of regular splicing languages has been recently proved in [15].

On the other hand, progress has been made towards the characterization of certain sub-classes of splicing systems. Authors in [11] prove that it is decidable whether a regular language is a reflexive splicing language and provide an example of a regular splicing language that is neither reflexive nor symmetric, A quite different characterization of reflexive symmetric splicing languages is given in [3] and it has been extended to the general class of reflexive regular languages in [4,5]. This characterization has been given by using the concept of a constant of a language introduced by Schutzenberger [20].

In order to solve the open problem of characterizing he whole class of splicing languages, it seems necessary to understand the role of constants. Indeed, since the introduction of splicing languages it has been conjectured, and more formally in [10], and in [11], that existence of a constant is a necessary condition for a regular language to be splicing. In this paper we solve this longstanding open question by proving this conjecture true. This result is proved by investigating structural properties of connected components of the transition graph given by the minimal finite state automaton for a regular splicing language. More precisely, properties of the factor language of transitive components are related to the notion of synchronizing words [7]. Synchronizing words have been studied in automata theory for a long time and are of interest in both coding theory [1] and symbolic dynamics [16,14]. Our proof uses an old observation that a synchronizing word for an automaton is a constant for the language recognized by the automaton [20].

The paper is organized as follows.

In Section 2 we introduce preliminary concepts, including the notion of a synchronizing word and a constant. In Section 3 we introduce the notion of a transitive automaton and a path-automaton, as well as show several results connecting terminal components automata and synchronizing words. Moreover, we show a relationship between transitive languages, transitive automata, transitive components, and constants of the language. Then in Section 4 we recall the basic notion of a splicing system and revisit the notion of splicing rules of a splicing system by providing properties that are necessary in proving the main result of the paper. Finally in Section 5 we give examples of non reflexive splicing languages, show a relationship between transitive languages and splicing languages and we prove the main result of the paper. A preliminary extended abstract of this paper appeared in [6]

2. Preliminaries

We refer the reader to [13] for the background of automata theory, and assume some familiarity of the subject. Let A^* be the free monoid over a finite alphabet A and let A⁺ = A^* \ 1, where 1 is the empty word. A deterministic finite state automaton (DFA) is a 5-tuple Inline graphic = (Q, A, I, T, ), where Q is a finite set of states, I ⊆ Q is the set of initial states, T ⊆ Q is the set of terminal (final) states and ⊆ Q × A × Q, is the set of transitions such that for every q ∈ Q and every a ∈ A the set {q′ | (q, a, q′) ∈ , q ∈ Q, a ∈ A} consists of at most one element. Given a deterministic finite state automaton Inline graphic , the set of transitions defines a partial action of A^* on Q. It is generated with a : Q → Q for a ∈ A defined with q(a) = q′ iff q′ ∈ Q is the unique state with (q, a, q′) ∈ . We use the standard notation qa to denote q′. If such q′ does not exist, we write qa = ∅. Inductively, we extend the notation on words with qwa = (qw)a. Similarly, we write Q w for the image of the set Q under the map w : Q → Q defined with w(q) = qw. If qa is defined for all q ∈ Q and a ∈ A we say that Inline graphic is complete. A deterministic finite state automaton is usually depicted as a directed graph with vertices Q and a set of directed edges . For an edge e = (q, a, q′) we say that q is its “start” state, q′ is its “end” state (also refer to as an end-point) and a is its label. A word w is accepted by an automaton Inline graphic if there is a path with label w that starts at an initial state and ends at a terminal state. We denote with L( ) the language recognized by , that is, the set of all words accepted by [13]. Given a regular language L ⊆ A^* it is well-known that there is a unique minimal complete deterministic finite state automaton (mDFA) Inline graphic = (Q, A, {q₀}, T, ) that recognizes L such that all other complete DFA with one initial state that recognize L map homomorphically onto [13]. This automaton is unique up to possible renaming of the states, i.e., up to an isomorphism. We reserve the notation (L) to denote this automaton.

Given a language L, the language F(L) is the set of all factors of words in L, where x is factor of a word w if w = zxy for z, y ∈ A^*. We say L is factor-closed if F(L) = L.

The right context of a word w ∈ A^* with respect to a language L is defined with Inline graphic (w) = {x ∈ A^* | wx ∈ L}. Symmetrically, the left context of w with respect of L is the set (w) = {x ∈ A^* | xw ∈ L}.

The right context of a state in Inline graphic is (q) = {x ∈ A^* | qx ∈ T}. An automaton is said to be reduced if there are no two states in with the same right context. Observe that the right context depends only on the terminal states in the automaton. In other words, if the initial state(s) are changed in but the transitions and the set of terminal states remain, the right contexts of the states don’t change. It is well-known (see for ex. [13]) that given a regular language L, there is a one-to-one correspondence between the right contexts of words with respect to L and the right contexts of the states in the minimal deterministic finite state automaton Inline graphic for L, i.e.,

q_{0} w = q iff R_{L} (w) = R_{\hat{A}} (q) .

In fact, in the mDFA Inline graphic , it also holds (w) = (q) iff (wa) = (qa) for all a ∈ A, and therefore (q) = (q′) implies q = q′.

When the language and the DFA are fixed, we drop the subscripts and write Inline graphic (w) and (q).

Note that every state in an mDFA is accessible, i.e., for each state q ∈ Q there is an x ∈ A^* such that q₀x = q. A state q is co-accessible, if Inline graphic (q) ≠ ∅. In an mDFA, there is at most one state that is not co-accessible, since for each q ∈ Q, there is u ∈ A^* such that qu ∈ T iff (q) ≠ ∅. If such a state in exists, we call it zero and denote it with z. A trimmed mDFA for language L is the DFA obtained from the mDFA for L by erasing the state z and all transitions that terminate in z. The trimmed mDFA is denoted trim Inline graphic .

More generally, a trimmed DFA Inline graphic is an automaton in which all states are both accessible and co-accessible.

Finally, for a finite set S, by #S, we denote the cardinality of the set S.

Definition 1

Given a DFA Inline graphic and a state q of the automaton, the set of follower words for q relative to is the set (q) = {x | qx ≠ ∅}.

For states q and q′ of Inline graphic , we say that they are follower-equivalent if (q) = (q′). For a state q the set of states in that are follower equivalent to q is denoted μ_q( ).

For a state q of Inline graphic we say that it is minimal-follower with respect to if whenever (q′) ⊆ (q) for a state q′ of , it implies that q and q′ are follower-equivalent.

Recall the definition of a constant of a language L introduced by Schutzenberger in [20].

Definition 2

A word w ∈ A⁺ is a constant of a language L if w is a factor of some word in L and for all words u₁, u₂, v₁, v₂ in A^* we have:

\begin{array}{l} u_{1} {w u}_{2} \in L \\ v_{1} {w u}_{2} \in L \end{array} \Rightarrow \begin{array}{l} u_{1} {w u}_{2} \in L \\ v_{1} {w u}_{2} \in L \end{array}

A characterization of constants, which is more or less folklore, is stated below.

Proposition 1

Let L ⊆ A^* be a regular language and let Inline graphic be the mDFA recognizing L. A word w ∈ A⁺ is a constant of L if and only if Q w \ {z} is a singleton, i.e., there is a unique non-zero state q_w such that qw ≠ z implies qw = q_w for all q ∈ Q.

Suppose w is a label of a path in a finite state automaton. If for a word w there is a state q_w such that every path in the automaton with label w terminates in q_w, we say that w is a synchronizing word and we say that q_w is a synchronizing state, synchronized by w. By Proposition 1, in a trimmed mDFA, trim Inline graphic , of a regular language L, the set of synchronizing words for trim coincides with the set of constants of L. In general, if w is a synchronizing word for an automaton then it is a constant for the language recognized by .

The context of w with respect to L is the set C_L(w) = {(u, v) | u, v ∈ A^*, uwv ∈ L}. We define the left projection of the context of w (resp. right projection) as the set $C_{L}^{ℓ} (w) = {u ∣ (u, v) \in C_{L} (w)}$ (respectively $C_{L}^{r} (w) = {v ∣ (u, v) \in C_{L} (w)}$ ). A constant w of L defines a constant language Const(w) with respect to the language L with the set $Const (w) = C_{L}^{ℓ} (w) w C_{L}^{r} (w)$ . Given two constants w₁ and w₂ of L, a split language for w₁ and w₂ with respect to L is a language $Split (w_{1}, w_{2}) = C_{L}^{ℓ} (w_{1}) w_{1}^{'} w_{2}^{″} C_{L}^{r} (w_{2})$ where $w_{1}^{'}$ is a prefix (possibly empty) of w₁ and $w_{2}^{″}$ is a suffix (possibly empty) of w₂.

3. Transitive components and synchronizing words

In this section we provide structural characterizations of transitive components in a minimal DFA using the notion of synchronizing words. We define the notions of a transitive automaton and of a path-automaton, and give properties that are used to prove the main result of the paper.

We first introduce definitions and properties that are used in the rest of the paper.

Recall the notion of a transitive component in a deterministic automaton. A strongly connected component of the directed graph for a deterministic automaton Inline graphic is called a transitive component for . If in a transitive component, every edge that starts at a state in this component also ends at the same component, then the transitive component is called terminal. For every state in the mDFA of a language L, there is a path that leads from that state to a terminal component. For a transitive component Inline graphic , we say that is induced by q if q is a state in . We write L( ) for the set of labels of all paths in and say that recognizes L( ). A transitive component is called trivial if L( ) = {1}.

A language L is said to be transitive if for every pair of words u, v ∈ L there is a word w ∈ A^* such that uwv ∈ L. Note that for a transitive component Inline graphic the language L = L( ) is transitive.

Remark 1

Notice that if Inline graphic is a transitive component, then L( ) is factor-closed, i.e., F(L( )) = L( ).

Two transitive components Inline graphic and are called factor-equivalent if L( ) = L( ). In the following we often use the term component to denote a transitive component.

A component Inline graphic is said to be maximal for a collection of components C if for every transitive component in C, we have that L( ) ⊆ L( ) implies L( ) = L( ). Analogously, a transitive component is called minimal for a collection C if whenever L( ) ⊆ L( ) we have L( ) = L( ).

3.1. Transitive automata

In this section we relate the notion of a synchronizing word to properties of a transitive automaton. An automaton is called transitive if it consists of only one transitive component.

Remark 2

Note that if Inline graphic is transitive, then L( ) is also transitive. Consider two words u, v ∈ L( ). There are initial states q₀ and $q_{0}^{'}$ such that q₀u, $q_{0}^{'} v \neq \emptyset$ . Since is transitive, there is a word w that is a label of a path from q₀u to $q_{0}^{'}$ in , so uwv ∈ L( Inline graphic ).

Example 3.1

Consider the example shown in Fig. 1. This language is transitive, the automaton is reduced and deterministic, hence it is the mDFA for the language. However, there is no deterministic transitive automaton that recognizes this language. Notice that this language has no constants.

Remark 3

If L is transitive such that L = L( Inline graphic ) for a transitive component , then for each state q in , (q) = (q) since all states in are terminal.

We consider several observations about transitive automata, transitive components and languages. The following observations are proved in [14] (see also [16]):

Lemma 2

For every regular factor-closed transitive language L there is a unique minimal deterministic transitive automaton Inline graphic recognizing L.

Lemma 3

For a regular factor-closed transitive language L and its unique minimal deterministic transitive automaton Inline graphic the following properties hold:

Every state in is synchronizing.
A word w ∈ L is a constant for L if and only if w is synchronizing for .
Every two states q̂ and p̂ in (q̂ ≠ p̂) are not follower-equivalent.
For every transitive DFA with L( ) = L there is an onto homomorphism ϕ : → such that for every state q̂ in , (q) = (q̂) for each q ∈ ϕ⁻¹(q̂).

Observe that if a state q of a transitive component Inline graphic is synchronizing, then all states in are synchronizing.

Consider the action of A^* on the set of states Inline graphic of . In order to simplify the notation, the action of w on the set is denoted as w instead of w and moreover we say that q is a state of if q ∈ .

Remark 4

If c is a constant of L( Inline graphic ) for a transitive automaton , and is the minimal transitive deterministic automaton for L( ) such that c synchronizes onto q̂, then by Remark 3 and Lemma 3(ii–iv) every state q in c maps with ϕ onto q̂, and has the same follower set as q̂. We say that q is follower-equivalent to q̂. In particular, if c is a constant such that qc = q in Inline graphic , then q̂c = q̂ in , and for every q ∈ c, the state qc is in c and is follower-equivalent to q̂, and to q.

Remark 5

If q is a state in a transitive automaton Inline graphic and is the minimal transitive deterministic automaton as in Lemma 3 with q̂ ∈ follower-equivalent to q, then for all q′ ∈ μ_q( ) there are constants c, c′ ∈ L( ) such that q̂c = q̂c′ = q̂, q′c = q and qc′ = q′. Take any constant c₁ ∈ L( ) such that q̂c₁ = q̂. Then by transitivity there are x, y such that qc₁x = q′ and q′c₁y = q. By Remark 4, c = c₁x, and c′ = c₁y are the constants sought.

If Inline graphic is a transitive deterministic automaton without a synchronizing word, then for every word w, # w ≥ k for some k ≥ 2. We call the minimal such k the degree of . A word w such that # w = k is called k-synchronizing. Therefore, the minimal transitive DFA for L( ) has degree 1, and all constants coincide with (1-)synchronizing words. It follows from Lemmas 2 and 3 that if Inline graphic has degree k and w is a constant that is k-synchronizing, then all states in w are follower-equivalent. Moreover, in that case for all x ∈ A^* with wx ≠ ∅, wx is also k-synchronizing.

The following lemma relates the right contexts of states in a transitive automaton reached by reading a word w that is k-synchronizing.

Lemma 4

Let Inline graphic = (Q, A, Q, Q, ) be a transitive DFA with degree k ≥ 2 and let = (Q, A, I, T, ) be a reduced DFA obtained from by choosing a subset of states I as initial states, and some proper subset T of Q as terminal states. If w is k-synchronizing and q, q′ ∈ w, then (q) \ (q′) ≠ ∅.

Proof

Let k ≥2 be the degree of Inline graphic , w a k-synchronizing word, and q₁, $q_{1}^{'} \in C w$ . Since w is k-synchronizing, for all words z ∈ A^*, either both q₁z, $q_{1}^{'} z$ are undefined or $q_{1} z \neq q_{1}^{'} z$ . In the rest of the proof we drop the subscript from . Suppose $R (q_{1}) \subseteq R (q_{1}^{'})$ . Since Inline graphic is reduced, there is a word x₁ such that $q_{1}^{'} x_{1} \in T$ and q₁x₁ ∉ T. Set q₂ = q₁x₁ and $q_{2}^{'} = q_{1}^{'} x_{1}$ . Then $R (q_{2}) \subseteq R (q_{2}^{'})$ because otherwise, if $y \in R (q_{2}) \ R (q_{2}^{'})$ then $x_{1} y \in R (q_{1}) \ R (q_{1}^{'})$ which is a contradiction with our assumption that $R (q_{1}) \subseteq R (q_{1}^{'})$ . We have that $q_{2} \neq q_{2}^{'}$ . Since Inline graphic is transitive, there is x₂ such that $q_{2} x_{2} = q_{2}^{'}$ . Denote $q_{2}^{'}$ with q₃, and set $q_{3}^{'} = q_{2}^{'} x_{2}$ . Again, similarly as with q₂ and $q_{2}^{'}$ , we have that $R (q_{3}) \subseteq R (q_{3}^{'})$ which implies that both $q_{2}^{'} = q_{3}$ and $q_{3}^{'}$ are in T. In fact, $R (q_{2}) \subseteq R (q_{2}^{'}) = R (q_{3}) \subseteq R (q_{3}^{'})$ . We continue in this way and consider the pairs of states $(q_{2}, q_{2}^{'}), (q_{3}, q_{3}^{'}), \dots, (q_{i}, q_{i}^{'}), \dots$ where $q_{i - 1}^{'} = q_{i}$ . Since Inline graphic is finite, there are i and j such that $(q_{i}, q_{i}^{'}) = (q_{j}, q_{j}^{'})$ for some i < j. But $R (q_{i}) \subseteq R (q_{i}^{'}) = R (q_{i + 1}) \subseteq R (q_{i + 1}^{'}) \subseteq \dots \subseteq R (q_{j}) = R (q_{i})$ . Because $q_{i} \neq q_{i}^{'}$ , this is a contradiction with the assumption that Inline graphic is reduced. Therefore, $R (q_{1}) \ R (q_{1}^{'}) \neq \emptyset$ .

Example 3.2

Consider the reduced automata in Fig. 2(a). It contains two terminal components that are mutually factor-equivalent recognizing a^*. Moreover, the states q₁, q₂ and q₃ in Fig. 2(a) are follower-equivalent, and the component that contains these three states has degree 3. Every word a^k for k ≥ 0 is 3-synchronizing. Consider the automaton Inline graphic that consists of {q₁, q₂, q₃} with q₁ being an initial and also the terminal state. Then (q₁) = (a³)^*, (q₂) = aa(a³)^* and (q₃) = a(a³)^*.

Example 3.3

The transitive component Inline graphic of the automaton in Fig. 2(b) has degree 2. All words that end with symbol a are labels of paths that end in states q₂ and q₃, and all words that end with symbol b are labels of paths that end in states q₁ and q₂. The action of c on the states of is the identity. Hence every word that contains symbols a or b is 2-synchronizing. Note that all states are follower-equivalent but Inline graphic is reduced. Moreover, a ∈ (q₁) \ (q₂) and aa ∈ (q₂) \ (q₁), also aa ∈ (q₂) \ (q₃) and a ∈ (q₃) \ (q₂). However, c ∈ (q₁) \ (q₃), but (q₃) ⊂ (q₁). This last condition does not violate Lemma 4 because there are no 2-synchronizing words that label paths ending at states q₁ and q₃.

3.2. Path-automata

In this section we provide structural characterizations of path-automata that do not have synchronizing words. More precisely, we show that a path-automaton having no synchronizing words has a unique maximal component, which is the terminal one, whose language contains all factors of the language accepted by the path-automaton.

Definition 3 (Path-automaton)

An automaton Inline graphic with an initial state q₀ is called a path-automaton if the following is satisfied:

There is at most one transition in which starts at the component induced by q₀ and terminates in another component.
There is only one terminal transitive component in .
For every transitive component which does not contain q₀ there is precisely one transition that starts in a state outside but terminates in , and if is not terminal, there is precisely one transition that starts at a state in but terminates in a state outside .

Let Inline graphic be a path automaton and one of its transitive components. The state of that is the end point of the transition starting outside but ending at is called the entrance state for and the state that is the start point of a transition that starts in but terminates outside is called the exit of Inline graphic . The initial component of has no entrance, and the terminal component has no exit.

A path π from an initial state in an automaton Inline graphic to a terminal component in induces a path-automaton which consists of all transitive components in induced by states visited by π.

Let Inline graphic be a terminal component of the path-automaton , and let q be the entrance of . We define the language accepted by the component induced by the path π, denoted by L_π( ), as the language accepted by the automaton with initial state q.

Lemma 5

Every trimmed deterministic path-automaton with two transitive components, and whose terminal component is trivial, has a synchronizing word.

Proof

Let q₀ be an initial state for a path-automaton Inline graphic with two transitive components having a trivial terminal component. Let x be the label of the edge starting from the initial component (say state s) and ending at the terminal component (say state t). Let be the initial component for , and let k be the degree of . Let w be k-synchronizing for Inline graphic . Because is transitive, we can extend w such that there is a path with label w that ends at s, i.e., # w = k and s ∈ w. Since k is the degree of , for all z ∈ A^*, either wz ∉ L( ) or wz ⊆ with # wz = k. As is deterministic, there is only one edge starting at s with label x and this edge leads outside Inline graphic . Therefore wx ∉ L( ), but wx ∈ F(L( )). Hence wx is synchronizing that synchronizes onto t.

We first give a technical lemma that is used later.

Lemma 6

Given a deterministic path automaton Inline graphic , let be the terminal component of . If has no synchronizing word, then is a unique maximal transitive component in .

Proof

Assume Inline graphic is an automaton with transition function δ that has no synchronizing words. Let , …, = be the transitive components of . Let $q_{i}^{in}$ and $q_{i}^{out}$ be the states in the component (i = 2, …, k − 1) such that $q_{i}^{in}$ is the entrance state of and $q_{i}^{out}$ is the exit state of Inline graphic . For i = 1 we only have $q_{1}^{out}$ and for i = k we only have $q_{k}^{in}$ . We set $q_{1}^{in} = q_{0}$ for the initial state q₀, and $q_{k}^{out} = q^{'}$ for a fixed terminal state q′. For i = 1, …, k − 1, let x_i be the label of the transition from state $q_{i}^{out}$ to state $q_{i + 1}^{in}$ , i.e., $q_{i}^{out} x_{i} = q_{i + 1}^{in}$ . Consider L( Inline graphic ), …, L( ). Because these languages are all transitive, there is a maximal transitive among them. Assume L₁, …, L_s are all distinct maximal transitive languages such that for each j = 1, …, s, there is a transitive component with L_j = L( ). Then for each i = 1, …, s, there are words w_i ∈ L_i such that w_i ∉ L_j if i ≠ j (as L_i is maximal, for each j ≠ i there is w_{i_j} ∈ L_i \ L_j, and due to the transitivity of L_i, there are z_{i_j} such that w_i = w_i₁z_i₁w_i₂ ··· z_{i_s−1}w_{i_s−1}). Note that for each language L_i there might be several transitive components that recognize it.

We consider words y_i (i = 1, …, k) such that y_i is a label of a path in Inline graphic from $q_{i}^{in}$ to $q_{i}^{out}$ in the following way. (i) If L( ) = L_j is a maximal transitive language, then w_j is a factor of y_i, and y_i is a constant for L( ) which uniquely determines the follower-equivalence class of $q_{i}^{out}$ , meaning, $R_{C_{i}} (q_{i}^{out}) = R_{L (C_{i})} (y_{i})$ . This is always possible by Lemma 3 and the transitivity of L( Inline graphic ). (ii) If L( ) is not a maximal transitive language, then y_i is a label of the shortest path between $q_{i}^{in}$ and $q_{i}^{out}$ .

Consider the word y₁x₁ ··· y_k₋₁x_k₋₁y_k. Let p be the smallest index of 1, …, k such that L( Inline graphic ) is maximal transitive and r be the largest index such that L( ) is maximal transitive. Then u = y_px_py_p₊₁ ··· x_r₋₁y_r is a word that starts at a maximal transitive component, visits all maximal transitive components, and terminates at the last maximal transitive component. Since Inline graphic has no synchronizing words, there must be at least one more path in with label u. But, by the choice of y_i and Lemma 5, every path with label y_px_p must start in a transitive component recognizing L( ) and must have a transition with label x_p leading outside the component, because y_px_p ∉ L( Inline graphic ). Let i₁, i₂, …, i_ν be all indexes between p and r such that i₁ = p, i_ν = r and L( ) is a maximal transitive language. By the choice of y_i’s, y_i₁ = y_p, y_i₂, …, y_{i_ν} = y_r uniquely determine the languages of the transitive components C_i₁, …, C_{i_ν}, that is y_{i_j} ∈ L(C_{i_j}) but y_{i_j} ∉ L(C_{i_t}) if j ≠ t. Therefore, there is a one-to-one correspondence between the order of appearance of y_p, y_p₊₁, …, y_r in u and the order of the maximal transitive components. Hence, the only possibility for existence of another path with label u is if such a path also starts at Inline graphic . Although there might be many paths with label y_p in , by Lemma 3 they all end at follower-equivalent states, and due to determinism, there is at most one of those states that is the start of a transition with label x_p, and that is $q_{p}^{out}$ (by Lemma 5, y_px_p is synchronizing for $C_{p} \cup {q_{p + 1}^{in}}$ ). Hence, u (or ux_r) is a synchronizing word unless p = r = k, i.e., x_p does not exist. As we assumed that there are no synchronizing words for Inline graphic , there is at most one maximal transitive language and it must be recognized by the terminal transitive component.

The following lemma characterizes a path-automaton with no synchronizing words.

Proposition 7

Given a deterministic path-automaton Inline graphic , let be the terminal component of . Then one of the following holds:

has a synchronizing word, or,
F(L( )) = L( ).

Proof

We prove the proposition by induction on the number of transitive components in the path automaton. If Inline graphic consists of a single component, then the lemma holds trivially as = . Now assume that lemma holds for all path automata with less then k transitive components, and suppose that has k transitive components , …, with being initial and = terminal. Denote with $q_{i}^{in}$ the entrance of Inline graphic and $q_{i}^{out}$ the exit of . Consider the path automaton with initial state $q_{2}^{in}$ and transitive components , …, . As this path automaton has k − 1 components, by the inductive hypothesis, either has a synchronizing word, or F(L( )) = L( ). Note that L( ) ⊆ F(L( Inline graphic )) holds trivially, so we only consider the converse inequality.

Case 1

The path automaton Inline graphic has a synchronizing word. Let y be the synchronizing word for the automaton which consists of $C_{1} \cup {q_{2}^{in}}$ with trivial terminal component consisting of $q_{2}^{in}$ . By Lemma 5, y exists, and we can assume that y synchronizes onto $q_{2}^{in}$ . Let w be synchronizing for Inline graphic , and since for every state q in there is a path from $q_{2}^{in}$ to q, we can assume that $q_{2}^{in} w \neq \emptyset$ . We observe that yw is synchronizing for . There is no path in with label y, since y is synchronizing for $q_{2}^{in}$ , hence every path in that has a label y terminates in a state in Inline graphic . Since w is synchronizing for , every path in with label yw terminates in a single state. Thus yw is synchronizing for and part (a) is satisfied.

Case 2

The path automaton Inline graphic has no synchronizing word. Then by the inductive hypothesis, F(L( )) ⊆ L( ). Assume that has no synchronizing word. We show that all words in F(L( )) appear as labels of paths in = . As in Case 1, consider which consists of $C_{1} \cup {q_{2}^{in}}$ with trivial terminal component consisting of $q_{2}^{in}$ . Let w be a label of a path in Inline graphic . If there is a path in with label w then, w ∈ L( ).

Assume now that all paths with label w start in Inline graphic . If all paths with label w also end at then, by Lemma 5, w is a factor of a word y that synchronizes onto $q_{2}^{in}$ of , and hence y is synchronizing for , and lemma holds.

Suppose there is a path in Inline graphic with label w that starts at and terminates in . We observe that in this case also, has a synchronizing word. Let u be the shortest word such that w = uxv where x is a symbol, $q_{1}^{out} x = q_{2}^{in}, q_{2}^{in} v \neq \emptyset$ and $q_{2}^{in} \in C_{1} u x$ . Let c ∈ L( Inline graphic ) be a constant for L( ) that fixes the follower-equivalence class of $q_{1}^{out}$ , meaning, $R_{L (C_{1})} (c) = R_{C_{1}} (q_{1}^{out})$ . Such c exists by Lemma 2 and Lemma 3. By transitivity of , there is a word c′ such that cc′u also fixes the follower equivalence class for $q_{1}^{out}$ and is a label of a path that terminates at $q_{1}^{out}$ . Consider cc′w = cc′uxv. Then cc′ux is synchronizing for $q_{2}^{in}$ in Inline graphic , by Lemma 5. But cc′w is not synchronizing for , hence there must be another path in with label cc′w, and by our assumption, it starts in and must terminate in . Such a path must use the transition $q_{1}^{out} x = q_{2}^{in}$ , either with a portion of the path labeled cc′ or with a portion labeled w. In the first case w is a label of a path in Inline graphic , hence w ∈ L( ). In the second case, there must be u′ and v′ such that cc′w = cc′u′xv′ = cc′uxv. Since u was the shortest word such that $q_{2}^{in} \in C_{1} u x$ , it must be that u = u′, in which case cc′ux is synchronizing for . It is impossible that u is a proper prefix of u′ because this would imply Inline graphic cc′ux ⊆ which would contradict the fact that cc′ux synchronizes onto $q_{2}^{in}$ in .

Example 3.4

The automaton in Fig. 3 is a path-automaton with no synchronizing words. It has only one terminal component which is maximal and the factors of all words in the language are labels of paths in the terminal component. This illustrates the situation (b) in Proposition 7.

The following result is used to prove the main result (Theorem 15, Section 5.3) of the paper.

Proposition 8

Let L be a regular language, x ∈ F(L) and trim Inline graphic be the trimmed mDFA for L. At least one of the two cases holds:

x is a factor of a constant for L,
there is a path-automaton induced by a path of trim containing a path labeled x and having a non-trivial terminal transitive component with at least two states.

Proof

Let trim Inline graphic = (Q, A, {q₀}, T, ) be the trimmed mDFA for the language L. Suppose x ∈ F(L) is not a factor of a constant, i.e., for every v, v′ ∈ A^*, vxv′ is not a constant for L, and therefore not synchronizing for trim . Consider a word w such that #Q xw = min#{Q xu|u ∈ A^*} and let P_w = Q xw. Since xw is not synchronizing, by Proposition 1, #P_w > 1. Then for every word u ∈ A^* we have that either Q xwu = ∅ or #Q xwu = #P_w. Therefore, we can assume that all states in P_w are in terminal components of trim Inline graphic , (if not, we can concatenate w with words that are labels of paths that lead to terminal components). If all terminal components in trim are trivial, then because trim is reduced, there is only one trivial terminal transitive component implying #P_w = 1, which is a contradiction with our assumption that x does not extend to a constant. Thus there must be at least one terminal transitive component which is not trivial. If there is a state in P_w that belongs to a component that is not a single state component then (ii) holds. Assume to the contrary that each state in P_w is in a distinct transitive component consisting of only one state having loops at itself. Let y be a label of one of these loops. Since P_w y ≠ ∅ implies P_w y = P_w, i.e., for every q ∈ P_w we have qy = q. This means that all states in P_w are terminal, their loops must have the same labels, and therefore their right contexts are equal. Hence the states in P_w cannot be distinct in a reduced automaton. Thus again implies that P_w has cardinality 1, a contradiction. Hence, there must be at least one state in P_w that belongs to a terminal transitive component with at least two states.

4. Splicing languages and properties of splicing rules

As mentioned, in this paper we consider the general notion of the splicing operation and the splicing system given by Paun [17], as defined below.

Definition 4

A finite splicing system is a triple S = (A, I, R) where, I ⊂ A^* is a finite set of strings, called an initial language, R is a finite set of splicing rules of the form r = (u₁, u₂)(u₃, u₄), with u_i ∈ A^* for i = 1, 2, 3, 4.

Given two words x = x₁u₁u₂x₂, y = y₁u₃u₄y₂, with x₁, x₂, y₁, y₂ ∈ A^* and a rule r = (u₁, u₂)(u₃, u₄), the splicing rule produces w = x₁u₁u₄y₂ denoted (x, y) ⊢_r w. We also say that u₁u₂, u₃u₄ are splice sites of r and u₁u₄ is the paste site of r.

To simplify the notation, in the following, by a splicing system we mean a finite splicing system.

Let L ⊆ A^*. We denote σ(L) = {w ∈ A^*|(x, y) ⊢_r w, x, y ∈ L, r ∈ R}. The (iterated) splicing operation is defined as follows: σ⁰(L) = L, σⁱ⁺¹(L) = σⁱ(L) ∪ σ(σⁱ(L)), i ≥ 0. Finally, σ^*(L) = ⋃_i_≥0 σⁱ(L).

Definition 5 (Splicing language)

Given a finite splicing system S = (A, I, R), the language L(S) = σ^*(I) is the language generated by S. A language L is a splicing language if there is a splicing system S such that L = L(S).

For a word w and a set of states Q, we use notation Inline graphic (Q w) for ⋃_q_∈_Q (qw).

Definition 6 (Paste site at p)

Let Inline graphic be the a DFA for a regular splicing language L. The word u₁u₄ is said to be a paste site at a state p ∈ Q for a splicing rule r = (u₁, u₂)(u₃, u₄) if (Q u₃u₄) ⊆ (pu₁u₄) and pu₁u₂ ≠ ∅.

More precisely, the notion of a paste site at a state q is used to identify states of the automaton where a rule can be applied. Fig. 4 depicts the situation for a paste site at state p. The doted path with label u₃ may not exist in the automaton, but the right context of qu₃u₄ (wherever a path with such a label exists) must be included in the right context of pu₁u₄.

In what follows we assume that every splicing system is such that all rules are applied at least once during the generation of the splicing language. The following lemma shows an equivalence between splicing systems with respect to the extension of sites and paste sites of rules.

Lemma 9

Let S = (A, I, R) be a finite splicing system and r = (u₁, u₂)(u₃, u₄) be a splicing rule in R. Let c ∈ A^*. Then L(S) is the language generated with the splicing system S′ = (A, I, R′) where R′ = R ∪ {r′} for r′ = (u₁, u₂)(u₃, u₄c).

Proof

It is clear that L(S) ⊆ L(S′) since R′ contains R. The converse also holds since whenever we have (x, y) ⊢_r_′ w we also have (x, y) ⊢_r w.

Lemma 10

Let S = (A, I, R) be a finite splicing system and Inline graphic a DFA for L = L(S). If u₁u₄ is a paste site at state p for a rule r = (u₁, u₂)(u₃, u₄) ∈ R then for every c ∈ A^* with pu₁u₄c ≠ ∅, u₁u₄c is a paste site at p for a rule r′ = (u₁, u₂)(u₃, u₄c).

Proof

Suppose that u₁u₄ is a paste site at state p for rule r = (u₁, u₂)(u₃, u₄), and let pu₁u₄c ≠ ∅. Then by Lemma 9, L is also generated by the splicing system S = (A, I, R′) for the set of rules $R^{'} = R \cup {r^{'} = (u_{1}, u_{2}) (u_{3}, u_{4}^{'})}$ where $u_{4}^{'} = u_{4} c$ . The first splice site of r equals r′ thus pu₁u₂ ≠ ∅. It only remains to show that $R_{A} (Q u_{3} u_{4}^{'}) \subseteq R_{A} (p u_{1} u_{4}^{'})$ . But if $y \in R_{A} (Q u_{3} u_{4}^{'})$ then cy ∈ Inline graphic (Q u₃u₄) ⊆ (pu₁u₄) and so y ∈ (u₁u₄c). It follows that $u_{1} u_{4}^{'}$ is a paste site at state p for rule r′.

5. Splicing languages must have a constant

5.1. Reflexive and non-reflexive splicing languages

It is known that every splicing language generated by a finite splicing system is always regular [8,19]. More precisely, regular splicing languages form a proper subclass of the class of regular languages.

Recall that a splicing system S is said to be reflexive if for every rule r = (u₁, u₂)(u₃, u₄) in R, both (u₁, u₂)(u₁, u₂) and (u₃, u₄)(u₃, u₄) are rules in R. A language L is said to be a reflexive splicing language if there is a reflexive splicing system S such that L = L(S). It is said that S is symmetric if (u₁, u₂), (u₃, u₄) being in R implies that (u₃, u₄), (u₁, u₂) is in R. The notion of a constant of a language turned out to be essential in providing a characterization of the class of reflexive regular splicing languages [11,3]. Indeed, a fundamental property of a reflexive regular splicing language L is that there exists a splicing system generating L that has rules whose splicing sites consist of constants for the language L. A more precise characterization shows that the class of reflexive and symmetric splicing languages is equivalent to a class of regular languages, the so-called PA-con-split languages [3]. This result has been extended to the non-symmetric case [4]. In [4], itis shown that each language L in this class is constructed from a finite set of constants for L, as L is expressed as a union consisting of a finite set X, and a finite union of constant and split languages (see end of Section 2). The characterization is given with the following proposition.

Proposition 11. (See [4].)

A regular language L is a reflexive splicing language if and only if there is a finite set X ⊂ A^*, a finite set of constants K₁ of L and a finite set K₂ of pairs of constants of L such that

L = X \cup (\underset{w \in K_{1}}{\cup} {Const}_{L} (w)) \cup (\underset{(w_{1}, w_{2}) \in K_{2}}{\cup} {Splity}_{L} (w_{1}, w_{2}))

The characterization of reflexive languages in Proposition 11 helps to describe factor-closed transitive regular languages as reflexive splicing languages.

Proposition 12

If L is a factor-closed transitive regular language then L is a reflexive splicing language.

Proof

Since L is factor-closed, by Lemma 2, consider its minimal deterministic transitive automaton Inline graphic . By Lemma 3, all states in are synchronizing. Every word w ∈ L is a label of a path in that passes through some state q, hence w = w′w″ where w′ is in the left context of a constant that labels a path starting at q and w″ is in the right context of a constant that synchronizes onto q. Because all states in Inline graphic are initial and terminal, we have that $C_{L}^{ℓ} (w) = L_{L} (w)$ and $C_{L}^{r} (w) = R_{L} (w)$ for every constant w of L. Let M be a set that consists of constants by choosing one constant m_q for each state q in that is a label of a path starting and ending at the state q. Then L = ⋃_{m_q∈M} Split(m_q, m_q) where $Split (m_{q}, m_{q}) = C_{L}^{ℓ} (m_{q}) C_{L}^{r} (m_{q}) = L_{L} (m_{q}) R_{L} (m_{q})$ and splitting is performed by taking the empty prefix and the empty suffix of m_q. The conclusion follows directly from the characterization in Proposition 11.

An example of non-reflexive regular splicing language is given in [11]; this is the language L = a⁺b⁺a⁺b⁺a⁺ ∪ a⁺b⁺a⁺.

Example 5.1

The path-automaton Inline graphic of Fig. 5 generalizes the example of regular non-reflexive language given in [11]. More precisely, it is possible to show, similarly as in [11], that any splicing system for the language L_k = ⋃_1≤_i_≤_k(a⁺b⁺)ⁱa⁺ with k ≥ 3 must have a rule whose both splice sites are not constants for the language. A splicing system S for L_k can be defined with an initial language I_k = ∪_1≤_i_≤_k{a(ba)ⁱ, a²(ba)ⁱ, (ab)ⁱa²} and rules:

\begin{array}{l} r_{1, k} = (1, {(a b)}^{k}) (1, a {(a b)}^{k}), r_{2, k} = ({(b a)}^{k} a, 1) ({(b a)}^{k}, 1) \\ r_{3, i} = (a, {(b a)}^{i}) ({(a b)}^{k - i}, a b), for each 1 \leq i \leq k, and \\ r_{4, i} = (b, {(a b)}^{i}) ({(b a)}^{k - i}, b a), for each 1 \leq i < k . \end{array}

The proof that such splicing system generates the language L_k is along the same lines of the ones given in [11]. Observe that both splice sites of rules r_3,_i = (a, (ba)ⁱ)((ab)^k⁻ⁱ, ab), for i > 1, are not constants for the language L_k. More precisely, rules r_1,_k and r_2,_k are used to increase the initial and final number of a’s in language a⁺(ab)^ka⁺, respectively. Rules r_3,_i are used to increase the number of a’s in the (k − i)th appearance of a’s in (a⁺b⁺)^ka⁺, for i ≤ k. Similarly, rules r_4,_i are used to increase the number of b’s in language L_k. The rules r_3,_i are also used to obtain (a⁺b⁺)^ja⁺, for j < k.

The following lemma shows another example of non-reflexive splicing language whose trimmed mDFA is not a pathautomaton (Fig. 6).

Lemma 13

The regular language L = b(a³)^* + cba^* + da(a³)^* is a non-reflexive splicing language.

Proof

First we note that L ⊆ A^*, for A = {a, b, c, d} is splicing. A splicing system S = (A, I, R) for language L consists of rules R = {r₁ = (cba, 1)(cb, a), r₂ = (daa³, 1)(da, 1), r₃ = (b, a³)(da, 1)}, while the initial language I consists of language I = {ba³, b, cba, cb, daa³, da}. By induction on the number k of iteration steps of splicing rules, we first show that L(S) ⊆ L. If k = 0, since I ⊆ L(S), the inclusion holds. Assume that w ∈ L(S) is generated with k > 0 iterations by applying a rule r to a pair of words w₁, w₂ ∈ L(S). By induction w₁, w₂ ∈ L are obtained with k − 1 iterations. Checking splice sites in w₁ and w₂ for all of the rules, it is immediate to see that w ∈ L. In order to show that L ⊆ L(S), we observe that language L₁ = da(a³)^* is generated by rule r₂ applied to words in the same language daa³. Similarly, we see that language L₂ = cba^* is generated by rule r₁ starting from words from the same language. Language L₃ = b(a³)^* is generated by rule r₃ applied to words of language da(a³)^* and of language b(a³)^*. By induction on i ≥ 0, indeed we can observe that b(a³)ⁱ ∈ L(S), i ≥ 0. If i = 0 or i = 1, being b, ba³ ∈ I, the result is immediate. Otherwise, given words b(a³)ⁱ⁻¹ ∈ L(S), for i > 1 and word da(a³)ⁱ ∈ L(S), by rule r₃ is immediate to generate word b(a³)ⁱ ∈ L(S).

Finally, notice that language L is not reflexive, that is, it cannot be generated by a splicing system by reflexive splicing rules. Suppose L is reflexive splicing language generated by a reflexive system S. We obtain a contradiction by considering generation of words in language b(a³)^*. Since in language L the only words that start with a b are those in language b(a³)^* and there must be splicing rules in S to generate words of the form b(a³)^k for arbitrarily large k, there must be a rule r with splice site u₁u₂ that is a factor of b(a³)^* But, because S is reflexive, S must also contain a rule (u₁, u₂)(u₁, u₂). Then this rule can be applied to x =b(a³)^k and y = cb(a³)^ka, for some large k > 0, to generate a word w = b(a³)^ka ∉ L. Therefore the language L cannot be generated by reflexive rules.

5.2. Canonical and special words

The proof of the main result (Theorem 15) is based on special words in a regular splicing language L that must be generated by a splicing rule whose splice site u₃u₄ is a constant of L. For a lack of a better name, we call these words q-canonical and k-special words.

Informally, the q-canonical word of a component Inline graphic is a word c such that qc = q and every such path with label c crosses all states in the component, and moreover, the word c is able to identify the language L( ) of the component .

Definition 7 (q-Canonical)

Let Inline graphic be an automaton and let be a component of . Let q be a state of . Then a word c ∈ A⁺ such that c ∈ L( ) and qc = q is called q-canonical for with respect to if whenever c ∈ L( ), for another component of , implies that L( ) ⊆ L( ).

In the following we show the existence of a q-canonical word for every state in every transitive component in Inline graphic . We give a constructive proof based on the notion of a k-special word for L( ) as defined below.

Definition 8 (k-Special)

Let Inline graphic be an automaton. A word c in L( ) is k-special for the language L if every word of F(L) of length ≤k is a factor of c.

Example 5.2

Consider the automaton of Fig. 6. Then the word a³ is q₂-canonical for the terminal component Inline graphic consisting of states q₂, q₃, q₄. Given the language L = L( ), then the word a³ is k-special for the language L for k ≤ 3.

Lemma 14

Given a non-trivial transitive component Inline graphic in a DFA , let k = (#Q)². Then for every state q in there is a q-canonical of that is a k-special constant of L( ).

Proof

Let {x₁,…, x_n} = L( Inline graphic ) ∩ A^≤^k. Being a transitive component, there are y₁,…, y_n₋₁ such that x₁ y₁x₂ ···y_n₋₁x_n ∈ L( ). Set c = x₁ y₁x₂ ···y_n₋₁x_n. Due to transitivity, for every q ∈ there are y_q and $y_{q}^{'}$ such that $w_{q} = y_{q} c y_{q}^{'}$ is a label of a path that starts and ends at q. By Remark 5, y_q, $y_{q}^{'}$ can be chosen so that w_q is a constant. We show that w_q is q-canonical. Assume that w_q ∈ L( Inline graphic ) for some transitive component . Take the shortest word z ∈ L( ) \ L( ). Since L( ) \ L( ) = L( ) ∩ (L( ))^c, it can be recognized by an automaton with at most #Q ( ) · #Q ( ) ≤ k states [13], the shortest word in this language has length at most k. Thus |z| ≤ k and therefore z must be a factor of c, i.e., z must be in L( Inline graphic ), contradicting the existence of z.

5.3. Proof of the main result

Considering the importance of constants in characterization of sub-classes of regular splicing languages, it has been conjectured that every splicing language must have a constant [10,11]. Our main result proves this conjecture to be true.

Theorem 15 (Main result)

If L is a regular splicing language, then L has a constant.

Example 5.3

The path-automaton Inline graphic of Fig. 3 has no synchronizing word (see Example 3.4) and thus the language L( ) = a^*c(c^*ac^*a)^* has no constant. By Proposition 15, L( ) is not a regular splicing language.

Example 5.4

The transitive regular language L recognized by the automaton in Fig. 1 has no constants. By Theorem 15, the language L is not a splicing language.

Example 5.5

The regular language L = b(a³)^* + cba^* + da(a³)^* is another example of non-reflexive splicing language, as proved in Lemma 6. Fig. 2(a) shows the trimmed mDFA graph for language L. Observe that not every path-automaton induced by a path in the mDFA from the initial state q₀ to a terminal component has necessarily a constant of L. Indeed, the path-automaton in Fig. 2(a) recognizing language b(a³)^* ⊂ L does not have any constant of the language L because every word in b(a³)^* is also a substring of a word in cba^* and therefore is not a synchronizing word for the automaton of L.

Given a splicing regular language L, the proof of Theorem 15 shows existence of a splicing rule r = (u₁, u₂)(u₃, u₄) such that the word u₁u₄ ends in a non-trivial terminal component of the trimmed mDFA trim Inline graphic . More precisely, u₁u₄ ends in a state which we show to be synchronizing for the automaton trim .

Let L be a regular splicing language and let trim Inline graphic = (Q, A, {q₀}, T, ) be the trimmed mDFA for language L. We introduce some basic notations that are used in the proof. We are interested in states of the automaton trim that are found as follows.

Consider a non-trivial terminal component Inline graphic that is minimal among the non-trivial terminal components in the automaton trim . If a non-trivial component does not exist, then by Proposition 15, trim must have a constant and Theorem 15 holds. Let q ∈ be a minimal-follower state with respect to and recall that with μ_q( Inline graphic ) we denote the set of states in that are follower-equivalent to q. Let C = { = , , ···, } be the set of all terminal components of the automaton that are factor-equivalent to . Consider the set $F = {q_{1}, \dots, q_{n}} = \cup_{i = 1}^{k} μ_{q} (C_{i})$ (note that by Lemma 3, for each i = 1,…, k, the collection of follower sets in Inline graphic coincides with the collection of follower sets in ).

Then, a candidate state of trim Inline graphic is a state q̄ ∈ F with q̄ ∈ for some component ∈ C such that (q̄) is minimal in the following sense: for all q ∈ F, whenever (q) ⊆ (q̄), it holds that (q)= (q̄), i.e., being trim reduced it holds that q = q̄.

The main idea of the proof is to show that either the automaton has no non-trivial components, and in this case a constant exists (see Proposition 8), otherwise there exists a candidate state that is synchronizing for the automaton trim Inline graphic .

Example 5.6

Consider the automaton in Fig. 6. Observe that the minimal terminal component Inline graphic induced by state q₂ has language L( ) = a^*, with L( ) = {a³}^*, and is factor-equivalent to the component induced by the state q₁. Then the set F, corresponding to the candidate component , is the set of states F = {q₁, q₂, q₃, q₄} because all these states belong to only one follower-equivalence, thereby the minimal follower-equivalence class. Then the candidate states are q₂, q₃, q₄.

We will use the following lemma.

Lemma 16

Let q̄ ∈ Inline graphic be a candidate state and let q̄₁ ∈ F ∩ . Then q̄₁ is also a candidate state.

Proof

Let Inline graphic be the minimal deterministic transitive automaton for ∈ C from Lemma 3. Suppose q̄ ∈ is a candidate state and let q̄₁ ∈ F ∩ . Let further q̂ ∈ be the follower-equivalent state to q̄. Then by Remark 5 there is a constant c of L( ) such that q̂c = q̂ and q̄₁c = q̄.

Let q′ ∈ F. First, suppose that q′ ∈ Inline graphic for some ≠ . Consider $q_{1}^{'} \in C^{'}$ such that $q^{'} c = q_{1}^{'}$ . Because c is a constant, by Remark 4, $q_{1}^{'}$ is follower-equivalent to q̂, so we have $q_{1}^{'} \in F$ . Because q̄ is a candidate state, there are two possibilities (a) right contexts of q̄ and $q_{1}^{'}$ are incomparable, i.e., $R_{trim \hat{A}} (\bar{q}) \ R_{trim \hat{A}} (q_{1}^{'}) \neq \emptyset$ and $R_{trim \hat{A}} (q_{1}^{'}) \ R_{trim \hat{A}} (\bar{q}) \neq \emptyset$ , or (b) $R_{trim \hat{A}} (\bar{q}) ⊊ R_{trim \hat{A}} (q_{1}^{'})$ (equality cannot hold because trim Inline graphic is reduced). In both cases here must be a $z \in R_{trim \hat{A}} (q_{1}^{'}) \ R_{trim \hat{A}} (\bar{q})$ . Then q′cz is a terminal state while q̄₁cz is not. Therefore, in both cases (q′) ⊈ (q̄₁).

Also, if q′ ∈ Inline graphic ∩ F then by Lemma 4, (q′) \ (q̄₁) ≠ ∅.

Therefore q̄₁ is a candidate state.

Before we present details of the proof of Theorem 15 we outline the steps involved in the proof by illustrating the situation in Example 5.6 shown in Fig. 6.

In trim we identify a candidate state q̄ within a non-trivial component C̄ as outlined above. (For Example 5.6 we choose state q₂.)
We consider a q̄-canonical word c and observe that there must be a rule (u₁, u₂)(u₃, u₄) with a paste site u₁u₄ at a state p that lies on a path labeled wc^sx, for some s, where q₀ w = q̄ and q̄x is terminal (see Fig. 7). (For Example 5.6, p = q₀ and wc^sx = b(a³)^s with w = b and c = a³, the rule in question is r₃ = (b, a³)(da, 1).)
We observe that there is a state q ∈ Q u₃u₄ such that, for arbitrarily large i, cⁱ is a factor of the right context of q. (For Example 5.6, such states are only q₂, q₃, q₄, because q₁ ∉ Q daa³x for any x.) We choose a sufficiently large i such that for some z, all states in Q u₃u₄zcⁱ belong to non-trivial components and we set $u_{3} u_{4}^{'} = u_{3} u_{4} z c^{i}$ . We observe that all states in $Q u_{3} u_{4}^{'}$ end in non-trivial components that are factor-equivalent to the non-trivial component . By Lemma 10 and obtain that $u_{1} u_{4}^{'}$ is a paste site for p, given rule $r = (u_{1}, u_{2}), (u_{3}, u_{4}^{'})$ . (For Example 5.6, we can choose z = 1 and have a new rule r₃ = (b, a³)(da, a³), and Q daa³ = q₂.)
We show that for every $q \in Q u_{3} u_{4}^{'}$ , it must be $q = p u_{1} u_{4}^{'}$ , therefore $u_{3} u_{4}^{'}$ is synchronizing.

Fig. 7 — A possible paste site at state p.

We now present the proof of the main result.

Proof of Theorem 15

Let L be a regular splicing language, and let trim Inline graphic = (Q, A, {q₀}, T, ) be its trimmed mDFA. By Proposition 8, if the automaton trim has only a trivial terminal component (note that since trim is reduced, there could be only one such component), it must have a constant and thus the theorem holds. Therefore we consider the case that trim Inline graphic has at least one non-trivial terminal component.

Consider a non-trivial terminal component that is minimal among the non-trivial terminal components in the automaton trim and let q ∈ be minimal-follower state in . Let C = { = , ···, } be the set of all terminal components of the automaton that are factor-equivalent to and set $F = {q_{1}, \dots, q_{n}} = \cup_{i = 1}^{k} μ_{q} (C_{i})$ . We choose a candidate state q̄ ∈ F in a component ∈ C.
Let w ∈ A^* be the shortest word such that q₀w = q̄. Consider a word c which is a constant of L( ) and is q̄-canonical for . Such a word exists by Lemma 14. Then wc^*x ⊆ L for some x ∈ A^*. Since there is a finite number of rules in the splicing system, there are an infinite number of indexes s such that wc^sx are obtained by using the same splicing rule r = (u₁, u₂)(u₃, u₄) where u₁u₄ is a subword of wc^sx for every such s. More precisely, there must exist an infinite number of pairs of words v = v′u₁u₂ v″ ∈ L and w′u₃u₄ w″ ∈ L such that v′u₁u₄ w″ ∈ wc*x. Thus v′u₁ is a prefix of wcⁱx for some i ≥ 0. Let p be such that pu₁u₂ ≠ ∅ where p = q₀ v′. Moreover, if y″ ∈ (Qu₃u₄), since there is y′ such that y = y′u₃u₄ y″ ∈ L, by splicing words v = v′u₁u₂ v″ and y = y′u₃u₄ y″ with rule r, we obtain v′u₁u₄ y″ ∈ L and thus y″ ∈ (pu₁u₄). Therefore, (Qu₃u₄) ⊆ (pu₁u₄).

We obtain that u₁u₄ is a paste site at state p for rule r = (u₁u₂)(u₃u₄). Refer to Fig. 7.
In the following we show that there are states in Q u₃u₄ such that cⁱ is a factor of a word in their right context for arbitrarily large i’s.

Let p′ = q₀ v′u₁u₄ where v′ is a prefix of wc^*x. Being trim deterministic, and since is terminal, by the choice of p′ it must be that p′ is either inside the component or otherwise lies along a path with label w from state q₀ to state q̄. In the latter case when p′ is not a state inside , v′u₁u₄ is a prefix of w. In this case cⁱx must be a suffix of w″ in the splicing of v′u₁u₂ v″ ∈ L and w′u₃u₄ w″ ∈ L that produces v′u₁u₄ w″ = wcⁱ x. Hence, for arbitrarily large i’s, it must be that cⁱ is a factor of a word in the right context of a state q ∈ Q u₃u₄. Since there are infinite number of i’s with this property, there is a state q ∈ Q u₃u₄ such that cⁱ ∈ F ( (qu₃u₄)) for arbitrarily large i.

Now suppose p′ is a state in (see Fig. 7). By Proposition 8, u₃u₄ is either a factor of a constant, which proves the statement of the theorem, or there is a path-automaton (a sub-automaton of trim ) with a non-trivial terminal component such that u₃u₄ is a label of path π in . Then, by Lemma 14, there is a q-canonical word c′ for some q ∈ such that u₃u₄zc′ is a label of a path in for some z, that is zc′ ∈ (qu₃u₄). Since u₁u₄ is a paste site for the rule r at state p, we have that zc′ ∈ (qu₃u₄) ⊆ (pu₁u₄) = (p′) ⊆ L( ). But because c′ is q-canonical for it follows that L( ) ⊆ L( ) and by the minimality of component and since is factor equivalent to we have that L( ) = L( ), i.e., c^* ⊆ L( ). Therefore cⁱ is a factor of the right context of a state q ∈ Q u₃u₄ for arbitrarily large i.

We now consider states in Q u₃u₄ whose right context has words with factors cⁱ for arbitrarily large i’s. We fix i sufficiently large, such that for some z, every state in Q u₃u₄zcⁱ belongs to a non-trivial component, and for every state q̂ ∈ Q u₃u₄zcⁱ, the language of the component containing q̂ contains the word c. Given, $u_{4}^{'} = u_{4} z c^{i}$ , by Lemma 10, $u_{1} u_{4}^{'}$ is a paste site at the same state p for the rule $r^{'} = (u_{1} u_{2}) (u_{3} u_{4}^{'})$ . Observe that z can be chosen such that $p u_{1} u_{4}^{'} = {\bar{q}}_{0} \in C$ . If p′ = pu₁u₄ is not in , by the argument above, cⁱx is a suffix of w″, hence we can chose z such that w″ = zcⁱx, i.e., p′z = q̄.

Because c is a constant for L( ) such that q̄c = q̄, by Lemma 3 and Remark 4, every state q ∈ c is follower-equivalent to q̄. Therefore, the state ${\bar{q}}_{0} = p u_{1} u_{4}^{'} = p u_{4} z c^{i} \in \bar{C}$ is follower-equivalent to q̄, and hence in F. Having q̄₀ ∈ F ∩ , by Lemma 16, q̄₀ is also a candidate state.
Let q be a state in $Q u_{3} u_{4}^{'}$ . We conclude with the observation that q = q̄₀ and therefore $u_{3} u_{4}^{'}$ is a synchronizing word for q̄₀, proving the theorem. The proof of this last step consists first in showing that L( ) = L( ) where is the component in trim containing q. Then we show that is terminal and thus ∈ C. By the fact that q̄₀ is a candidate state we are able to show that q̄₀ = q.

We first observe that L( Inline graphic ) = L( ). As $q \in Q u_{3} u_{4}^{'}$ , by Definition 6 of paste site, it holds that

R_{trim \hat{A}} (q) \subseteq R_{trim \hat{A}} (Q u_{3} u_{4}^{'}) \subseteq R_{trim \hat{A}} (p u_{1} u_{4}^{'} = {\bar{q}}_{0}) .

(*)

Since c is q̄-canonical, by Definition 7 we have that L( Inline graphic ) ⊆ L( ). If L( ) \ L( ) ≠ ∅ then (q) ⊈ (q̄₀) which contradicts (*). Therefore it must be that L( ) = L( ), that is and are factor-equivalent.

Next we see that Inline graphic is terminal. Assume to the contrary that is not terminal and thus there is an edge labeled a that starts in and terminates in a state q′ outside . By Lemma 5 the automaton that consists of together with the edge labeled a ending at q′ has a synchronizing word that ends at q′. Let ua be that word. Then ua ∉ / L( Inline graphic ) = L( ), because otherwise ua would not be synchronizing. By the transitivity of we can assume that ua is a label of a path that starts at q. This implies that ua must be a prefix of a word in (q)\ (q̄₀), again contradicting (*). Consequently is terminal and in C. Moreover q ∈ F, since by the choice of the constant c, by Remark 4, q is factor-equivalent to q̄ (hence, to q̄₀), and F consists of all states that are factor-equivalent to q̄ and belong to components in C. Thus by (*) (i.e., Inline graphic (q) ⊆ (q̄₀)) and the fact that q̄₀ as a candidate state (q̄₀)= (q). Because trim is reduced, q̄₀ = q, which concludes the proof.

The proof of Proposition 15 is based on the effective computation of a synchronizing state in the automaton for a regular splicing language in the case of an automaton having non-trivial terminal components. As a main corollary of the above Proposition we can state the following fact.

Corollary 17

Let trim Inline graphic be the trimmed minimal deterministic automaton recognizing a splicing regular language. Then every state in a terminal component that contains a candidate state for trim is synchronizing.

6. Concluding remarks

In this paper we solve a conjecture posed by T. Head in his seminal works on regular splicing languages about the existence of a constant as a necessary condition for a regular language to be splicing. We solve this open problem in an affirmative way, by providing a constructive proof that leads to a procedure for finding a synchronizing state in a mDFA for a regular splicing language.

The use of constants allows to determine a necessary and sufficient condition for a regular language to be reflexive splicing [3,4]; identifying such a condition for non reflexive splicing languages is still an open problem.

Recently, decidability of regular splicing languages has been proved in [15] by providing an upper bound on the lengths of the words included in the splicing rules. This bound is quadratic with respect to the size of the syntactic monoid of the language. The decidability follows from the fact that the bound allows brute-force search and comparison of the given language with splicing languages obtained through all possible finite sets of rules of certain size. Although the existence of the algorithm was long waited, the procedure it provides is useless for all practical purposes. Having a practical procedure to decide whether a regular language is splicing remains a challenging open problem. We believe that finding a characterization of minimal splicing systems recognizing splicing languages, where minimality of the system is given in terms of both the number of splice sites of rules and the length of the splicing sites, would be a promising direction for obtaining a practical decision procedure. Moreover, since splicing rules are built from constants in reflexive languages, the notions of constants and synchronizing words again seem to be vital for answering most of the above questions.

Acknowledgments

We thank the reviewers for numerous valuable comments. P. Bonizzoni is partially supported by MIUR PRIN 2010–2011 grant “Automi e Linguaggi Formali: Aspetti Matematici e Applicativi”, code H41J12000190001, N. Jonoska is supported in part by the NSF grant CCF-1117254 and the NIH grant R01GM109459-01.

Contributor Information

Paola Bonizzoni, Email: bonizzoni@disco.unimib.it.

Nataša Jonoska, Email: jonoska@mail.usf.edu.

References

1.Berstel J, Perrin D. Theory of Codes. Academic Press, Inc; Orlando, Florida: 1985. [Google Scholar]
2.Bonizzoni P, De Felice C, Mauri G, Zizza R. Regular languages generated by reflexive finite linear splicing systems. Lect Notes Comput Sci; Proc. Development in Language Theory; Berlin: Springer; 2003. pp. 134–145. [Google Scholar]
3.Bonizzoni P, De Felice C, Zizza R. The structure of reflexive regular splicing languages via Schützenberger constants. Theor Comput Sci. 2005;334(1–3):71–98. [Google Scholar]
4.Bonizzoni P, Mauri G. Regular splicing languages and subclasses. Theor Comput Sci. 2005;340:349–363. [Google Scholar]
5.Bonizzoni P. Constants and label-equivalence: a decision procedure for reflexive regular splicing languages. Theor Comput Sci. 2010;411(6):865–877. [Google Scholar]
6.Bonizzoni P, Jonoska N. Regular splicing languages must have a constant. Lect Notes Comput Sci; Proc. Developments in Language Theory; Berlin: Springer; 2011. pp. 82–92. [Google Scholar]
7.Černý J. Poznámka k homogénnym eksperimentom s konecnými automatami. Mat-Fyz čas Slov Akad Vied. 1964;14:208–216. [Google Scholar]
8.Culik K, Harju T. Splicing semigroups of dominoes and DNA. Discrete Appl Math. 1991;31:261–277. [Google Scholar]
9.De Luca A, Restivo A. A characterization of strictly locally testable languages and its application to semigroups of free semigroup. Inf Control. 1980;44:300–319. [Google Scholar]
10.Goode E. PhD Thesis. Binghamton University; 1999. Constants and splicing systems. [Google Scholar]
11.Goode E, Pixton D. Recognizing splicing languages: syntactic monoids and simultaneous pumping. Discrete Appl Math. 2007;155:989–1006. [Google Scholar]
12.Head T. Formal language theory and DNA: an analysis of the generative capacity of specific recombinant behaviours. Bull Math Biol. 1987;49:737–759. doi: 10.1007/BF02481771. [DOI] [PubMed] [Google Scholar]
13.Hopcroft JE, Motwani R, Ullman JD. Introduction to Automata Theory, Languages, and Computation. Addison–Wesley; Reading, Mass: 2001. [Google Scholar]
14.Jonoska N. Sofic systems with synchronizing representations. Theor Comput Sci. 1996;158(1–2):81–115. [Google Scholar]
15.Kari L, Kopecki S. Deciding if a regular language is generated by a splicing system. Lect Notes Comput Sci; Proc. DNA Computing and Molecular Programming – 18th International Conference; Berlin: Springer; 2012. pp. 98–109. [Google Scholar]
16.Lind D, Marcus B. An Introduction to Symbolic Dynamics. Cambridge University Press; New York: 1995. [Google Scholar]
17.Paun G. On the splicing operation. Discrete Appl Math. 1996;70:57–79. [Google Scholar]
18.Paun G, Rozenberg G, Salomaa A. New Computing Paradigms. Springer-Verlag; Berlin: 1998. DNA Computing. [Google Scholar]
19.Pixton D. Regularity of splicing languages. Discrete Appl Math. 1996;69:101–124. [Google Scholar]
20.Schützenberger MP. Sur certaines opérations de fermeture dans le langages rationnels. Symp Math. 1975;15:245–253. [Google Scholar]
21.Verlan S. PhD Thesis. University of Metz; 2004. Head systems and applications to bio-informatics. [Google Scholar]

[R1] 1.Berstel J, Perrin D. Theory of Codes. Academic Press, Inc; Orlando, Florida: 1985. [Google Scholar]

[R2] 2.Bonizzoni P, De Felice C, Mauri G, Zizza R. Regular languages generated by reflexive finite linear splicing systems. Lect Notes Comput Sci; Proc. Development in Language Theory; Berlin: Springer; 2003. pp. 134–145. [Google Scholar]

[R3] 3.Bonizzoni P, De Felice C, Zizza R. The structure of reflexive regular splicing languages via Schützenberger constants. Theor Comput Sci. 2005;334(1–3):71–98. [Google Scholar]

[R4] 4.Bonizzoni P, Mauri G. Regular splicing languages and subclasses. Theor Comput Sci. 2005;340:349–363. [Google Scholar]

[R5] 5.Bonizzoni P. Constants and label-equivalence: a decision procedure for reflexive regular splicing languages. Theor Comput Sci. 2010;411(6):865–877. [Google Scholar]

[R6] 6.Bonizzoni P, Jonoska N. Regular splicing languages must have a constant. Lect Notes Comput Sci; Proc. Developments in Language Theory; Berlin: Springer; 2011. pp. 82–92. [Google Scholar]

[R7] 7.Černý J. Poznámka k homogénnym eksperimentom s konecnými automatami. Mat-Fyz čas Slov Akad Vied. 1964;14:208–216. [Google Scholar]

[R8] 8.Culik K, Harju T. Splicing semigroups of dominoes and DNA. Discrete Appl Math. 1991;31:261–277. [Google Scholar]

[R9] 9.De Luca A, Restivo A. A characterization of strictly locally testable languages and its application to semigroups of free semigroup. Inf Control. 1980;44:300–319. [Google Scholar]

[R10] 10.Goode E. PhD Thesis. Binghamton University; 1999. Constants and splicing systems. [Google Scholar]

[R11] 11.Goode E, Pixton D. Recognizing splicing languages: syntactic monoids and simultaneous pumping. Discrete Appl Math. 2007;155:989–1006. [Google Scholar]

[R12] 12.Head T. Formal language theory and DNA: an analysis of the generative capacity of specific recombinant behaviours. Bull Math Biol. 1987;49:737–759. doi: 10.1007/BF02481771. [DOI] [PubMed] [Google Scholar]

[R13] 13.Hopcroft JE, Motwani R, Ullman JD. Introduction to Automata Theory, Languages, and Computation. Addison–Wesley; Reading, Mass: 2001. [Google Scholar]

[R14] 14.Jonoska N. Sofic systems with synchronizing representations. Theor Comput Sci. 1996;158(1–2):81–115. [Google Scholar]

[R15] 15.Kari L, Kopecki S. Deciding if a regular language is generated by a splicing system. Lect Notes Comput Sci; Proc. DNA Computing and Molecular Programming – 18th International Conference; Berlin: Springer; 2012. pp. 98–109. [Google Scholar]

[R16] 16.Lind D, Marcus B. An Introduction to Symbolic Dynamics. Cambridge University Press; New York: 1995. [Google Scholar]

[R17] 17.Paun G. On the splicing operation. Discrete Appl Math. 1996;70:57–79. [Google Scholar]

[R18] 18.Paun G, Rozenberg G, Salomaa A. New Computing Paradigms. Springer-Verlag; Berlin: 1998. DNA Computing. [Google Scholar]

[R19] 19.Pixton D. Regularity of splicing languages. Discrete Appl Math. 1996;69:101–124. [Google Scholar]

[R20] 20.Schützenberger MP. Sur certaines opérations de fermeture dans le langages rationnels. Symp Math. 1975;15:245–253. [Google Scholar]

[R21] 21.Verlan S. PhD Thesis. University of Metz; 2004. Head systems and applications to bio-informatics. [Google Scholar]

PERMALINK

Existence of constants in regular splicing languages

Paola Bonizzoni

Nataša Jonoska

Abstract

1. Introduction

2. Preliminaries

Definition 1

Definition 2

Proposition 1

3. Transitive components and synchronizing words

Remark 1

3.1. Transitive automata

Remark 2

Example 3.1

Fig. 1.

Remark 3

Lemma 2

Lemma 3

Remark 4

Remark 5

Lemma 4

Proof

Example 3.2

Fig. 2.

Example 3.3

3.2. Path-automata

Definition 3 (Path-automaton)

Lemma 5

Proof

Lemma 6

Proof

Proposition 7

Proof

Case 1

Case 2

Example 3.4

Fig. 3.

Proposition 8

Proof

4. Splicing languages and properties of splicing rules

Definition 4

Definition 5 (Splicing language)

Definition 6 (Paste site at p)

Fig. 4.

Lemma 9

Proof

Lemma 10

Proof

5. Splicing languages must have a constant

5.1. Reflexive and non-reflexive splicing languages

Proposition 11. (See [4].)

Proposition 12

Proof

Example 5.1

Fig. 5.

Fig. 6.

Lemma 13

Proof

5.2. Canonical and special words

Definition 7 (q-Canonical)

Definition 8 (k-Special)

Example 5.2

Lemma 14

Proof

5.3. Proof of the main result

Theorem 15 (Main result)

Example 5.3

Example 5.4

Example 5.5

Example 5.6

Lemma 16

Proof

Fig. 7.

Proof of Theorem 15

Corollary 17

6. Concluding remarks

Acknowledgments

Contributor Information

References