Abstract
In spite of wide investigations of finite splicing systems in formal language theory, basic questions, such as their characterization, remain unsolved. It has been conjectured that a necessary condition for a regular language L to be a splicing language is that L must have a constant in the Schutzenberger sense. We prove this longstanding conjecture to be true. The result is based on properties of strongly connected components of the minimal deterministic finite state automaton for a regular splicing language. Using constants of the corresponding languages, we also provide properties of transitive automata and pathautomata.
1. Introduction
A splicing system, originally introduced in [12], is a formal model that uses contextual cross-over operation over words to generate languages called splicing languages. This cross-over splicing formalizes the behavior of basic biomolecular processes involving cut and paste of DNA performed by restriction enzymes and a ligase. Restriction enzymes act on double stranded DNA molecules by cleaving certain recognized segments leaving short single stranded overhangs. Molecules with same overhangs can join (in a cross-over fashion) in presence of a ligase enzyme. In the introductory paper, T. Head proved that if the splicing is performed by a finite set of certain simple rules, then splicing of finite set of words can generate the class of strictly locally testable languages [9]. The splicing notion was reformulated by G. Paun at a less restrictive level of generality, giving rise to the splicing operation that is commonly adopted and appears nowadays as a standard [17].
Theoretical results in splicing systems have contributed to new research in formal language theory focused on modeling of biochemical processes [18]. On the other side, the field suggested new ideas in the framework of biomolecular science, for example, the design of automated enzymatic processes.
In this paper, we focus on finite splicing systems, called here simply as splicing systems. A splicing system is meant to have a finite set of rules (modeling enzymes) applied on a finite set of initial strings (modeling DNA sequences). A splicing system (or H-system) is a triple H = (A, I, R), where A is a finite alphabet, I ⊆ A* is the initial language and R is the set of rules, (see Section 4 for the definitions). The formal language generated by the splicing system is the smallest language containing I and closed under the splicing operation.
There have been successes in characterizing certain subclasses of splicing languages, for example those generated by reflexive rules and those generated by symmetric rules [2]. Reflexivity and symmetry are natural properties for splicing systems because they assure splicing of molecules cut with the same enzyme, as well as recombining molecules resulting of the same type of cut [12]. The formal language of a general splicing system may have a set of rules R that is not necessarily symmetric, nor reflexive. Under the formal model, a splicing system is a generative mechanism for a language which belongs to a class that is a proper subclass of the regular languages. This basic result has been firstly proved in [8], and later proved in several other papers by using different approaches (see for example [19,21]).
In spite of the vast literature on the topic, a structural characterization of the finite splicing systems is still an open problem, although decidability of regular splicing languages has been recently proved in [15].
On the other hand, progress has been made towards the characterization of certain sub-classes of splicing systems. Authors in [11] prove that it is decidable whether a regular language is a reflexive splicing language and provide an example of a regular splicing language that is neither reflexive nor symmetric, A quite different characterization of reflexive symmetric splicing languages is given in [3] and it has been extended to the general class of reflexive regular languages in [4,5]. This characterization has been given by using the concept of a constant of a language introduced by Schutzenberger [20].
In order to solve the open problem of characterizing he whole class of splicing languages, it seems necessary to understand the role of constants. Indeed, since the introduction of splicing languages it has been conjectured, and more formally in [10], and in [11], that existence of a constant is a necessary condition for a regular language to be splicing. In this paper we solve this longstanding open question by proving this conjecture true. This result is proved by investigating structural properties of connected components of the transition graph given by the minimal finite state automaton for a regular splicing language. More precisely, properties of the factor language of transitive components are related to the notion of synchronizing words [7]. Synchronizing words have been studied in automata theory for a long time and are of interest in both coding theory [1] and symbolic dynamics [16,14]. Our proof uses an old observation that a synchronizing word for an automaton is a constant for the language recognized by the automaton [20].
The paper is organized as follows.
In Section 2 we introduce preliminary concepts, including the notion of a synchronizing word and a constant. In Section 3 we introduce the notion of a transitive automaton and a path-automaton, as well as show several results connecting terminal components automata and synchronizing words. Moreover, we show a relationship between transitive languages, transitive automata, transitive components, and constants of the language. Then in Section 4 we recall the basic notion of a splicing system and revisit the notion of splicing rules of a splicing system by providing properties that are necessary in proving the main result of the paper. Finally in Section 5 we give examples of non reflexive splicing languages, show a relationship between transitive languages and splicing languages and we prove the main result of the paper. A preliminary extended abstract of this paper appeared in [6]
2. Preliminaries
We refer the reader to [13] for the background of automata theory, and assume some familiarity of the subject. Let A* be the free monoid over a finite alphabet A and let A+ = A* \ 1, where 1 is the empty word. A deterministic finite state automaton (DFA) is a 5-tuple
= (Q, A, I, T,
), where Q is a finite set of states, I ⊆ Q is the set of initial states, T ⊆ Q is the set of terminal (final) states and
⊆ Q × A × Q, is the set of transitions such that for every q ∈ Q and every a ∈ A the set {q′ | (q, a, q′) ∈
, q ∈ Q, a ∈ A} consists of at most one element. Given a deterministic finite state automaton
, the set of transitions defines a partial action of A* on Q. It is generated with a : Q → Q for a ∈ A defined with q(a) = q′ iff q′ ∈ Q is the unique state with (q, a, q′) ∈
. We use the standard notation qa to denote q′. If such q′ does not exist, we write qa = ∅. Inductively, we extend the notation on words with qwa = (qw)a. Similarly, we write Q w for the image of the set Q under the map w : Q → Q defined with w(q) = qw. If qa is defined for all q ∈ Q and a ∈ A we say that
is complete. A deterministic finite state automaton is usually depicted as a directed graph with vertices Q and a set of directed edges
. For an edge e = (q, a, q′) we say that q is its “start” state, q′ is its “end” state (also refer to as an end-point) and a is its label. A word w is accepted by an automaton
if there is a path with label w that starts at an initial state and ends at a terminal state. We denote with L(
) the language recognized by
, that is, the set of all words accepted by
[13]. Given a regular language L ⊆ A* it is well-known that there is a unique minimal complete deterministic finite state automaton (mDFA)
= (Q, A, {q0}, T,
) that recognizes L such that all other complete DFA with one initial state that recognize L map homomorphically onto
[13]. This automaton is unique up to possible renaming of the states, i.e., up to an isomorphism. We reserve the notation
(L) to denote this automaton.
Given a language L, the language F(L) is the set of all factors of words in L, where x is factor of a word w if w = zxy for z, y ∈ A*. We say L is factor-closed if F(L) = L.
The right context of a word w ∈ A*
with respect to a language L is defined with
(w) = {x ∈ A* | wx ∈ L}. Symmetrically, the left context of w with respect of L is the set
(w) = {x ∈ A* | xw ∈ L}.
The right context of a state in
is
(q) = {x ∈ A* | qx ∈ T}. An automaton
is said to be reduced if there are no two states in
with the same right context. Observe that the right context depends only on the terminal states in the automaton. In other words, if the initial state(s) are changed in
but the transitions and the set of terminal states remain, the right contexts of the states don’t change. It is well-known (see for ex. [13]) that given a regular language L, there is a one-to-one correspondence between the right contexts of words with respect to L and the right contexts of the states in the minimal deterministic finite state automaton
for L, i.e.,
In fact, in the mDFA
, it also holds
(w) =
(q) iff
(wa) =
(qa) for all a ∈ A, and therefore
(q) =
(q′) implies q = q′.
When the language and the DFA are fixed, we drop the subscripts and write
(w) and
(q).
Note that every state in an mDFA is accessible, i.e., for each state q ∈ Q there is an x ∈ A* such that q0x = q. A state q is co-accessible, if
(q) ≠ ∅. In an mDFA, there is at most one state that is not co-accessible, since for each q ∈ Q, there is u ∈ A* such that qu ∈ T iff
(q) ≠ ∅. If such a state in
exists, we call it zero and denote it with z. A trimmed mDFA for language L is the DFA obtained from the mDFA for L by erasing the state z and all transitions that terminate in z. The trimmed mDFA is denoted trim
.
More generally, a trimmed DFA
is an automaton in which all states are both accessible and co-accessible.
Finally, for a finite set S, by #S, we denote the cardinality of the set S.
Definition 1
Given a DFA
and a state q of the automaton, the set of follower words for q relative to
is the set
(q) = {x | qx ≠ ∅}.
For states q and q′ of
, we say that they are follower-equivalent if
(q) =
(q′). For a state q the set of states in
that are follower equivalent to q is denoted μq(
).
For a state q of
we say that it is minimal-follower with respect to
if whenever
(q′) ⊆
(q) for a state q′ of
, it implies that q and q′ are follower-equivalent.
Recall the definition of a constant of a language L introduced by Schutzenberger in [20].
Definition 2
A word w ∈ A+ is a constant of a language L if w is a factor of some word in L and for all words u1, u2, v1, v2 in A* we have:
A characterization of constants, which is more or less folklore, is stated below.
Proposition 1
Let L ⊆ A* be a regular language and let
be the mDFA recognizing L. A word w ∈ A+ is a constant of L if and only if Q w \ {z} is a singleton, i.e., there is a unique non-zero state qw such that qw ≠ z implies qw = qw for all q ∈ Q.
Suppose w is a label of a path in a finite state automaton. If for a word w there is a state qw such that every path in the automaton with label w terminates in qw, we say that w is a synchronizing word and we say that qw is a synchronizing state, synchronized by w. By Proposition 1, in a trimmed mDFA, trim
, of a regular language L, the set of synchronizing words for trim
coincides with the set of constants of L. In general, if w is a synchronizing word for an automaton
then it is a constant for the language recognized by
.
The context of w with respect to L is the set CL(w) = {(u, v) | u, v ∈ A*, uwv ∈ L}. We define the left projection of the context of w (resp. right projection) as the set (respectively ). A constant w of L defines a constant language Const(w) with respect to the language L with the set . Given two constants w1 and w2 of L, a split language for w1 and w2 with respect to L is a language where is a prefix (possibly empty) of w1 and is a suffix (possibly empty) of w2.
3. Transitive components and synchronizing words
In this section we provide structural characterizations of transitive components in a minimal DFA using the notion of synchronizing words. We define the notions of a transitive automaton and of a path-automaton, and give properties that are used to prove the main result of the paper.
We first introduce definitions and properties that are used in the rest of the paper.
Recall the notion of a transitive component in a deterministic automaton. A strongly connected component of the directed graph for a deterministic automaton
is called a transitive component for
. If in a transitive component, every edge that starts at a state in this component also ends at the same component, then the transitive component is called terminal. For every state in the mDFA of a language L, there is a path that leads from that state to a terminal component. For a transitive component
, we say that
is induced by q if q is a state in
. We write L(
) for the set of labels of all paths in
and say that
recognizes L(
). A transitive component
is called trivial if L(
) = {1}.
A language L is said to be transitive if for every pair of words u, v ∈ L there is a word w ∈ A* such that uwv ∈ L. Note that for a transitive component
the language L = L(
) is transitive.
Remark 1
Notice that if
is a transitive component, then L(
) is factor-closed, i.e., F(L(
)) = L(
).
Two transitive components
and
are called factor-equivalent if L(
) = L(
). In the following we often use the term component to denote a transitive component.
A component
is said to be maximal for a collection of components C if for every transitive component
in C, we have that L(
) ⊆ L(
) implies L(
) = L(
). Analogously, a transitive component
is called minimal for a collection C if whenever L(
) ⊆ L(
) we have L(
) = L(
).
3.1. Transitive automata
In this section we relate the notion of a synchronizing word to properties of a transitive automaton. An automaton is called transitive if it consists of only one transitive component.
Remark 2
Note that if
is transitive, then L(
) is also transitive. Consider two words u, v ∈ L(
). There are initial states q0 and
such that q0u,
. Since
is transitive, there is a word w that is a label of a path from q0u to
in
, so uwv ∈ L(
).
Example 3.1
Consider the example shown in Fig. 1. This language is transitive, the automaton is reduced and deterministic, hence it is the mDFA for the language. However, there is no deterministic transitive automaton that recognizes this language. Notice that this language has no constants.
Fig. 1.
A transitive language that doesn’t have a transitive deterministic automaton. The initial state is indicated with an arrow and the terminal states are shaded.
Remark 3
If L is transitive such that L = L(
) for a transitive component
, then for each state q in
,
(q) =
(q) since all states in
are terminal.
We consider several observations about transitive automata, transitive components and languages. The following observations are proved in [14] (see also [16]):
Lemma 2
For every regular factor-closed transitive language L there is a unique minimal deterministic transitive automaton
recognizing L.
Lemma 3
For a regular factor-closed transitive language L and its unique minimal deterministic transitive automaton
the following properties hold:
Every state in
is synchronizing.A word w ∈ L is a constant for L if and only if w is synchronizing for
.Every two states q̂ and p̂ in
(q̂ ≠ p̂) are not follower-equivalent.For every transitive DFA
with L(
) = L there is an onto homomorphism ϕ :
→
such that for every state q̂ in
,
(q) =
(q̂) for each q ∈ ϕ−1(q̂).
Observe that if a state q of a transitive component
is synchronizing, then all states in
are synchronizing.
Consider the action of A* on the set of states
of
. In order to simplify the notation, the action of w on the set
is denoted as
w instead of
w and moreover we say that q is a state of
if q ∈
.
Remark 4
If c is a constant of L(
) for a transitive automaton
, and
is the minimal transitive deterministic automaton for L(
) such that c synchronizes onto q̂, then by Remark 3 and Lemma 3(ii–iv) every state q in
c maps with ϕ onto q̂, and has the same follower set as q̂. We say that q is follower-equivalent to q̂. In particular, if c is a constant such that qc = q in
, then q̂c = q̂ in
, and for every q ∈
c, the state qc is in
c and is follower-equivalent to q̂, and to q.
Remark 5
If q is a state in a transitive automaton
and
is the minimal transitive deterministic automaton as in Lemma 3 with q̂ ∈
follower-equivalent to q, then for all q′ ∈ μq(
) there are constants c, c′ ∈ L(
) such that q̂c = q̂c′ = q̂, q′c = q and qc′ = q′. Take any constant c1 ∈ L(
) such that q̂c1 = q̂. Then by transitivity there are x, y such that qc1x = q′ and q′c1y = q. By Remark 4, c = c1x, and c′ = c1y are the constants sought.
If
is a transitive deterministic automaton without a synchronizing word, then for every word w, #
w ≥ k for some k ≥ 2. We call the minimal such k the degree of
. A word w such that #
w = k is called k-synchronizing. Therefore, the minimal transitive DFA
for L(
) has degree 1, and all constants coincide with (1-)synchronizing words. It follows from Lemmas 2 and 3 that if
has degree k and w is a constant that is k-synchronizing, then all states in
w are follower-equivalent. Moreover, in that case for all x ∈ A* with wx ≠ ∅, wx is also k-synchronizing.
The following lemma relates the right contexts of states in a transitive automaton reached by reading a word w that is k-synchronizing.
Lemma 4
Let
= (Q, A, Q, Q,
) be a transitive DFA with degree k ≥ 2 and let
= (Q, A, I, T,
) be a reduced DFA obtained from
by choosing a subset of states I as initial states, and some proper subset T of Q as terminal states. If w is k-synchronizing and q, q′ ∈
w, then
(q) \
(q′) ≠ ∅.
Proof
Let k ≥2 be the degree of
, w a k-synchronizing word, and q1,
. Since w is k-synchronizing, for all words z ∈ A*, either both q1z,
are undefined or
. In the rest of the proof we drop the subscript
from
. Suppose
. Since
is reduced, there is a word x1 such that
and q1x1 ∉ T. Set q2 = q1x1 and
. Then
because otherwise, if
then
which is a contradiction with our assumption that
. We have that
. Since
is transitive, there is x2 such that
. Denote
with q3, and set
. Again, similarly as with q2 and
, we have that
which implies that both
and
are in T. In fact,
. We continue in this way and consider the pairs of states
where
. Since
is finite, there are i and j such that
for some i < j. But
. Because
, this is a contradiction with the assumption that
is reduced. Therefore,
.
Example 3.2
Consider the reduced automata in Fig. 2(a). It contains two terminal components that are mutually factor-equivalent recognizing a*. Moreover, the states q1, q2 and q3 in Fig. 2(a) are follower-equivalent, and the component that contains these three states has degree 3. Every word ak for k ≥ 0 is 3-synchronizing. Consider the automaton
that consists of {q1, q2, q3} with q1 being an initial and also the terminal state. Then
(q1) = (a3)*,
(q2) = aa(a3)* and
(q3) = a(a3)*.
Fig. 2.
Two automata, initial states are indicated with an arrow pointing to them and the terminal states are shaded.
Example 3.3
The transitive component
of the automaton in Fig. 2(b) has degree 2. All words that end with symbol a are labels of paths that end in states q2 and q3, and all words that end with symbol b are labels of paths that end in states q1 and q2. The action of c on the states of
is the identity. Hence every word that contains symbols a or b is 2-synchronizing. Note that all states are follower-equivalent but
is reduced. Moreover, a ∈
(q1) \
(q2) and aa ∈
(q2) \
(q1), also aa ∈
(q2) \
(q3) and a ∈
(q3) \
(q2). However, c ∈
(q1) \
(q3), but
(q3) ⊂
(q1). This last condition does not violate Lemma 4 because there are no 2-synchronizing words that label paths ending at states q1 and q3.
3.2. Path-automata
In this section we provide structural characterizations of path-automata that do not have synchronizing words. More precisely, we show that a path-automaton having no synchronizing words has a unique maximal component, which is the terminal one, whose language contains all factors of the language accepted by the path-automaton.
Definition 3 (Path-automaton)
An automaton
with an initial state q0 is called a path-automaton if the following is satisfied:
There is at most one transition in
which starts at the component induced by q0 and terminates in another component.There is only one terminal transitive component in
.For every transitive component
which does not contain q0 there is precisely one transition that starts in a state outside
but terminates in
, and if
is not terminal, there is precisely one transition that starts at a state in
but terminates in a state outside
.
Let
be a path automaton and
one of its transitive components. The state of
that is the end point of the transition starting outside
but ending at
is called the entrance state for
and the state that is the start point of a transition that starts in
but terminates outside
is called the exit of
. The initial component of
has no entrance, and the terminal component has no exit.
A path π from an initial state in an automaton
to a terminal component in
induces a path-automaton
which consists of all transitive components in
induced by states visited by π.
Let
be a terminal component of the path-automaton
, and let q be the entrance of
. We define the language accepted by the component
induced by the path π, denoted by Lπ(
), as the language accepted by the automaton
with initial state q.
Lemma 5
Every trimmed deterministic path-automaton with two transitive components, and whose terminal component is trivial, has a synchronizing word.
Proof
Let q0 be an initial state for a path-automaton
with two transitive components having a trivial terminal component. Let x be the label of the edge starting from the initial component (say state s) and ending at the terminal component (say state t). Let
be the initial component for
, and let k be the degree of
. Let w be k-synchronizing for
. Because
is transitive, we can extend w such that there is a path with label w that ends at s, i.e., #
w = k and s ∈
w. Since k is the degree of
, for all z ∈ A*, either wz ∉ L(
) or
wz ⊆
with #
wz = k. As
is deterministic, there is only one edge starting at s with label x and this edge leads outside
. Therefore wx ∉ L(
), but wx ∈ F(L(
)). Hence wx is synchronizing that synchronizes onto t.
We first give a technical lemma that is used later.
Lemma 6
Given a deterministic path automaton
, let
be the terminal component of
. If
has no synchronizing word, then
is a unique maximal transitive component in
.
Proof
Assume
is an automaton with transition function δ that has no synchronizing words. Let
, …,
=
be the transitive components of
. Let
and
be the states in the component
(i = 2, …, k − 1) such that
is the entrance state of
and
is the exit state of
. For i = 1 we only have
and for i = k we only have
. We set
for the initial state q0, and
for a fixed terminal state q′. For i = 1, …, k − 1, let xi be the label of the transition from state
to state
, i.e.,
. Consider L(
), …, L(
). Because these languages are all transitive, there is a maximal transitive among them. Assume L1, …, Ls are all distinct maximal transitive languages such that for each j = 1, …, s, there is a transitive component
with Lj = L(
). Then for each i = 1, …, s, there are words wi ∈ Li such that wi ∉ Lj if i ≠ j (as Li is maximal, for each j ≠ i there is wij ∈ Li \ Lj, and due to the transitivity of Li, there are zij such that wi = wi1zi1wi2 ··· zis−1wis−1). Note that for each language Li there might be several transitive components that recognize it.
We consider words yi (i = 1, …, k) such that yi is a label of a path in
from
to
in the following way. (i) If L(
) = Lj is a maximal transitive language, then wj is a factor of yi, and yi is a constant for L(
) which uniquely determines the follower-equivalence class of
, meaning,
. This is always possible by Lemma 3 and the transitivity of L(
). (ii) If L(
) is not a maximal transitive language, then yi is a label of the shortest path between
and
.
Consider the word y1x1 ··· yk−1xk−1yk. Let p be the smallest index of 1, …, k such that L(
) is maximal transitive and r be the largest index such that L(
) is maximal transitive. Then u = ypxpyp+1 ··· xr−1yr is a word that starts at a maximal transitive component, visits all maximal transitive components, and terminates at the last maximal transitive component. Since
has no synchronizing words, there must be at least one more path in
with label u. But, by the choice of yi and Lemma 5, every path with label ypxp must start in a transitive component recognizing L(
) and must have a transition with label xp leading outside the component, because ypxp ∉ L(
). Let i1, i2, …, iν be all indexes between p and r such that i1 = p, iν = r and L(
) is a maximal transitive language. By the choice of yi’s, yi1 = yp, yi2, …, yiν = yr uniquely determine the languages of the transitive components Ci1, …, Ciν, that is yij ∈ L(Cij) but yij ∉ L(Cit) if j ≠ t. Therefore, there is a one-to-one correspondence between the order of appearance of yp, yp+1, …, yr in u and the order of the maximal transitive components. Hence, the only possibility for existence of another path with label u is if such a path also starts at
. Although there might be many paths with label yp in
, by Lemma 3 they all end at follower-equivalent states, and due to determinism, there is at most one of those states that is the start of a transition with label xp, and that is
(by Lemma 5, ypxp is synchronizing for
). Hence, u (or uxr) is a synchronizing word unless p = r = k, i.e., xp does not exist. As we assumed that there are no synchronizing words for
, there is at most one maximal transitive language and it must be recognized by the terminal transitive component.
The following lemma characterizes a path-automaton with no synchronizing words.
Proposition 7
Given a deterministic path-automaton
, let
be the terminal component of
. Then one of the following holds:
has a synchronizing word, or,F(L(
)) = L(
).
Proof
We prove the proposition by induction on the number of transitive components in the path automaton. If
consists of a single component, then the lemma holds trivially as
=
. Now assume that lemma holds for all path automata with less then k transitive components, and suppose that
has k transitive components
, …,
with
being initial and
=
terminal. Denote with
the entrance of
and
the exit of
. Consider the path automaton
with initial state
and transitive components
, …,
. As this path automaton has k − 1 components, by the inductive hypothesis, either
has a synchronizing word, or F(L(
)) = L(
). Note that L(
) ⊆ F(L(
)) holds trivially, so we only consider the converse inequality.
Case 1
The path automaton
has a synchronizing word. Let y be the synchronizing word for the automaton
which consists of
with trivial terminal component consisting of
. By Lemma 5, y exists, and we can assume that y synchronizes onto
. Let w be synchronizing for
, and since for every state q in
there is a path from
to q, we can assume that
. We observe that yw is synchronizing for
. There is no path in
with label y, since y is synchronizing for
, hence every path in
that has a label y terminates in a state in
. Since w is synchronizing for
, every path in
with label yw terminates in a single state. Thus yw is synchronizing for
and part (a) is satisfied.
Case 2
The path automaton
has no synchronizing word. Then by the inductive hypothesis, F(L(
)) ⊆ L(
). Assume that
has no synchronizing word. We show that all words in F(L(
)) appear as labels of paths in
=
. As in Case 1, consider
which consists of
with trivial terminal component consisting of
. Let w be a label of a path in
. If there is a path in
with label w then, w ∈ L(
).
Assume now that all paths with label w start in
. If all paths with label w also end at
then, by Lemma 5, w is a factor of a word y that synchronizes onto
of
, and hence y is synchronizing for
, and lemma holds.
Suppose there is a path in
with label w that starts at
and terminates in
. We observe that in this case also,
has a synchronizing word. Let u be the shortest word such that w = uxv where x is a symbol,
and
. Let c ∈ L(
) be a constant for L(
) that fixes the follower-equivalence class of
, meaning,
. Such c exists by Lemma 2 and Lemma 3. By transitivity of
, there is a word c′ such that cc′u also fixes the follower equivalence class for
and is a label of a path that terminates at
. Consider cc′w = cc′uxv. Then cc′ux is synchronizing for
in
, by Lemma 5. But cc′w is not synchronizing for
, hence there must be another path in
with label cc′w, and by our assumption, it starts in
and must terminate in
. Such a path must use the transition
, either with a portion of the path labeled cc′ or with a portion labeled w. In the first case w is a label of a path in
, hence w ∈ L(
). In the second case, there must be u′ and v′ such that cc′w = cc′u′xv′ = cc′uxv. Since u was the shortest word such that
, it must be that u = u′, in which case cc′ux is synchronizing for
. It is impossible that u is a proper prefix of u′ because this would imply
cc′ux ⊆
which would contradict the fact that cc′ux synchronizes onto
in
.
Example 3.4
The automaton in Fig. 3 is a path-automaton with no synchronizing words. It has only one terminal component which is maximal and the factors of all words in the language are labels of paths in the terminal component. This illustrates the situation (b) in Proposition 7.
Fig. 3.
A path-automaton with no synchronizing words.
The following result is used to prove the main result (Theorem 15, Section 5.3) of the paper.
Proposition 8
Let L be a regular language, x ∈ F(L) and trim
be the trimmed mDFA for L. At least one of the two cases holds:
x is a factor of a constant for L,
there is a path-automaton induced by a path of trim
containing a path labeled x and having a non-trivial terminal transitive component with at least two states.
Proof
Let trim
= (Q, A, {q0}, T,
) be the trimmed mDFA for the language L. Suppose x ∈ F(L) is not a factor of a constant, i.e., for every v, v′ ∈ A*, vxv′ is not a constant for L, and therefore not synchronizing for trim
. Consider a word w such that #Q xw = min#{Q xu|u ∈ A*} and let Pw = Q xw. Since xw is not synchronizing, by Proposition 1, #Pw > 1. Then for every word u ∈ A* we have that either Q xwu = ∅ or #Q xwu = #Pw. Therefore, we can assume that all states in Pw are in terminal components of trim
, (if not, we can concatenate w with words that are labels of paths that lead to terminal components). If all terminal components in trim
are trivial, then because trim
is reduced, there is only one trivial terminal transitive component implying #Pw = 1, which is a contradiction with our assumption that x does not extend to a constant. Thus there must be at least one terminal transitive component which is not trivial. If there is a state in Pw that belongs to a component that is not a single state component then (ii) holds. Assume to the contrary that each state in Pw is in a distinct transitive component consisting of only one state having loops at itself. Let y be a label of one of these loops. Since Pw y ≠ ∅ implies Pw y = Pw, i.e., for every q ∈ Pw we have qy = q. This means that all states in Pw are terminal, their loops must have the same labels, and therefore their right contexts are equal. Hence the states in Pw cannot be distinct in a reduced automaton. Thus again implies that Pw has cardinality 1, a contradiction. Hence, there must be at least one state in Pw that belongs to a terminal transitive component with at least two states.
4. Splicing languages and properties of splicing rules
As mentioned, in this paper we consider the general notion of the splicing operation and the splicing system given by Paun [17], as defined below.
Definition 4
A finite splicing system is a triple S = (A, I, R) where, I ⊂ A* is a finite set of strings, called an initial language, R is a finite set of splicing rules of the form r = (u1, u2)(u3, u4), with ui ∈ A* for i = 1, 2, 3, 4.
Given two words x = x1u1u2x2, y = y1u3u4y2, with x1, x2, y1, y2 ∈ A* and a rule r = (u1, u2)(u3, u4), the splicing rule produces w = x1u1u4y2 denoted (x, y) ⊢r w. We also say that u1u2, u3u4 are splice sites of r and u1u4 is the paste site of r.
To simplify the notation, in the following, by a splicing system we mean a finite splicing system.
Let L ⊆ A*. We denote σ(L) = {w ∈ A*|(x, y) ⊢r w, x, y ∈ L, r ∈ R}. The (iterated) splicing operation is defined as follows: σ0(L) = L, σi+1(L) = σi(L) ∪ σ(σi(L)), i ≥ 0. Finally, σ*(L) = ⋃i≥0 σi(L).
Definition 5 (Splicing language)
Given a finite splicing system S = (A, I, R), the language L(S) = σ*(I) is the language generated by S. A language L is a splicing language if there is a splicing system S such that L = L(S).
For a word w and a set of states Q, we use notation
(Q w) for ⋃q∈Q
(qw).
Definition 6 (Paste site at p)
Let
be the a DFA for a regular splicing language L. The word u1u4 is said to be a paste site at a state p ∈ Q for a splicing rule r = (u1, u2)(u3, u4) if
(Q u3u4) ⊆
(pu1u4) and pu1u2 ≠ ∅.
More precisely, the notion of a paste site at a state q is used to identify states of the automaton where a rule can be applied. Fig. 4 depicts the situation for a paste site at state p. The doted path with label u3 may not exist in the automaton, but the right context of qu3u4 (wherever a path with such a label exists) must be included in the right context of pu1u4.
Fig. 4.
Paste site at state p, the dotted path with label u3 may or may not exist. But the right context of qu3u4 is included in the right context of pu1u4 for every q.
In what follows we assume that every splicing system is such that all rules are applied at least once during the generation of the splicing language. The following lemma shows an equivalence between splicing systems with respect to the extension of sites and paste sites of rules.
Lemma 9
Let S = (A, I, R) be a finite splicing system and r = (u1, u2)(u3, u4) be a splicing rule in R. Let c ∈ A*. Then L(S) is the language generated with the splicing system S′ = (A, I, R′) where R′ = R ∪ {r′} for r′ = (u1, u2)(u3, u4c).
Proof
It is clear that L(S) ⊆ L(S′) since R′ contains R. The converse also holds since whenever we have (x, y) ⊢r′ w we also have (x, y) ⊢r w.
Lemma 10
Let S = (A, I, R) be a finite splicing system and
a DFA for L = L(S). If u1u4 is a paste site at state p for a rule r = (u1, u2)(u3, u4) ∈ R then for every c ∈ A* with pu1u4c ≠ ∅, u1u4c is a paste site at p for a rule r′ = (u1, u2)(u3, u4c).
Proof
Suppose that u1u4 is a paste site at state p for rule r = (u1, u2)(u3, u4), and let pu1u4c ≠ ∅. Then by Lemma 9, L is also generated by the splicing system S = (A, I, R′) for the set of rules
where
. The first splice site of r equals r′ thus pu1u2 ≠ ∅. It only remains to show that
. But if
then cy ∈
(Q u3u4) ⊆
(pu1u4) and so y ∈
(u1u4c). It follows that
is a paste site at state p for rule r′.
5. Splicing languages must have a constant
5.1. Reflexive and non-reflexive splicing languages
It is known that every splicing language generated by a finite splicing system is always regular [8,19]. More precisely, regular splicing languages form a proper subclass of the class of regular languages.
Recall that a splicing system S is said to be reflexive if for every rule r = (u1, u2)(u3, u4) in R, both (u1, u2)(u1, u2) and (u3, u4)(u3, u4) are rules in R. A language L is said to be a reflexive splicing language if there is a reflexive splicing system S such that L = L(S). It is said that S is symmetric if (u1, u2), (u3, u4) being in R implies that (u3, u4), (u1, u2) is in R. The notion of a constant of a language turned out to be essential in providing a characterization of the class of reflexive regular splicing languages [11,3]. Indeed, a fundamental property of a reflexive regular splicing language L is that there exists a splicing system generating L that has rules whose splicing sites consist of constants for the language L. A more precise characterization shows that the class of reflexive and symmetric splicing languages is equivalent to a class of regular languages, the so-called PA-con-split languages [3]. This result has been extended to the non-symmetric case [4]. In [4], itis shown that each language L in this class is constructed from a finite set of constants for L, as L is expressed as a union consisting of a finite set X, and a finite union of constant and split languages (see end of Section 2). The characterization is given with the following proposition.
Proposition 11. (See [4].)
A regular language L is a reflexive splicing language if and only if there is a finite set X ⊂ A*, a finite set of constants K1 of L and a finite set K2 of pairs of constants of L such that
The characterization of reflexive languages in Proposition 11 helps to describe factor-closed transitive regular languages as reflexive splicing languages.
Proposition 12
If L is a factor-closed transitive regular language then L is a reflexive splicing language.
Proof
Since L is factor-closed, by Lemma 2, consider its minimal deterministic transitive automaton
. By Lemma 3, all states in
are synchronizing. Every word w ∈ L is a label of a path in
that passes through some state q, hence w = w′w″ where w′ is in the left context of a constant that labels a path starting at q and w″ is in the right context of a constant that synchronizes onto q. Because all states in
are initial and terminal, we have that
and
for every constant w of L. Let M be a set that consists of constants by choosing one constant mq for each state q in
that is a label of a path starting and ending at the state q. Then L = ⋃mq∈M Split(mq, mq) where
and splitting is performed by taking the empty prefix and the empty suffix of mq. The conclusion follows directly from the characterization in Proposition 11.
An example of non-reflexive regular splicing language is given in [11]; this is the language L = a+b+a+b+a+ ∪ a+b+a+.
Example 5.1
The path-automaton
of Fig. 5 generalizes the example of regular non-reflexive language given in [11]. More precisely, it is possible to show, similarly as in [11], that any splicing system for the language Lk = ⋃1≤i≤k(a+b+)ia+ with k ≥ 3 must have a rule whose both splice sites are not constants for the language. A splicing system S for Lk can be defined with an initial language Ik = ∪1≤i≤k{a(ba)i, a2(ba)i, (ab)ia2} and rules:
Fig. 5.
A path automaton that recognizes a non reflexive splicing language.
The proof that such splicing system generates the language Lk is along the same lines of the ones given in [11]. Observe that both splice sites of rules r3,i = (a, (ba)i)((ab)k−i, ab), for i > 1, are not constants for the language Lk. More precisely, rules r1,k and r2,k are used to increase the initial and final number of a’s in language a+(ab)ka+, respectively. Rules r3,i are used to increase the number of a’s in the (k − i)th appearance of a’s in (a+b+)ka+, for i ≤ k. Similarly, rules r4,i are used to increase the number of b’s in language Lk. The rules r3,i are also used to obtain (a+b+)ja+, for j < k.
The following lemma shows another example of non-reflexive splicing language whose trimmed mDFA is not a pathautomaton (Fig. 6).
Fig. 6.
The figure reports the automaton of Fig. 2(a) detailing the minimal terminal component which consists of states q2, q4 and q3.
Lemma 13
The regular language L = b(a3)* + cba* + da(a3)* is a non-reflexive splicing language.
Proof
First we note that L ⊆ A*, for A = {a, b, c, d} is splicing. A splicing system S = (A, I, R) for language L consists of rules R = {r1 = (cba, 1)(cb, a), r2 = (daa3, 1)(da, 1), r3 = (b, a3)(da, 1)}, while the initial language I consists of language I = {ba3, b, cba, cb, daa3, da}. By induction on the number k of iteration steps of splicing rules, we first show that L(S) ⊆ L. If k = 0, since I ⊆ L(S), the inclusion holds. Assume that w ∈ L(S) is generated with k > 0 iterations by applying a rule r to a pair of words w1, w2 ∈ L(S). By induction w1, w2 ∈ L are obtained with k − 1 iterations. Checking splice sites in w1 and w2 for all of the rules, it is immediate to see that w ∈ L. In order to show that L ⊆ L(S), we observe that language L1 = da(a3)* is generated by rule r2 applied to words in the same language daa3. Similarly, we see that language L2 = cba* is generated by rule r1 starting from words from the same language. Language L3 = b(a3)* is generated by rule r3 applied to words of language da(a3)* and of language b(a3)*. By induction on i ≥ 0, indeed we can observe that b(a3)i ∈ L(S), i ≥ 0. If i = 0 or i = 1, being b, ba3 ∈ I, the result is immediate. Otherwise, given words b(a3)i−1 ∈ L(S), for i > 1 and word da(a3)i ∈ L(S), by rule r3 is immediate to generate word b(a3)i ∈ L(S).
Finally, notice that language L is not reflexive, that is, it cannot be generated by a splicing system by reflexive splicing rules. Suppose L is reflexive splicing language generated by a reflexive system S. We obtain a contradiction by considering generation of words in language b(a3)*. Since in language L the only words that start with a b are those in language b(a3)* and there must be splicing rules in S to generate words of the form b(a3)k for arbitrarily large k, there must be a rule r with splice site u1u2 that is a factor of b(a3)* But, because S is reflexive, S must also contain a rule (u1, u2)(u1, u2). Then this rule can be applied to x =b(a3)k and y = cb(a3)ka, for some large k > 0, to generate a word w = b(a3)ka ∉ L. Therefore the language L cannot be generated by reflexive rules.
5.2. Canonical and special words
The proof of the main result (Theorem 15) is based on special words in a regular splicing language L that must be generated by a splicing rule whose splice site u3u4 is a constant of L. For a lack of a better name, we call these words q-canonical and k-special words.
Informally, the q-canonical word of a component
is a word c such that qc = q and every such path with label c crosses all states in the component, and moreover, the word c is able to identify the language L(
) of the component
.
Definition 7 (q-Canonical)
Let
be an automaton and let
be a component of
. Let q be a state of
. Then a word c ∈ A+ such that c ∈ L(
) and qc = q is called q-canonical for
with respect to
if whenever c ∈ L(
), for another component
of
, implies that L(
) ⊆ L(
).
In the following we show the existence of a q-canonical word for every state in every transitive component in
. We give a constructive proof based on the notion of a k-special word for L(
) as defined below.
Definition 8 (k-Special)
Let
be an automaton. A word c in L(
) is k-special for the language L if every word of F(L) of length ≤k is a factor of c.
Example 5.2
Consider the automaton of Fig. 6. Then the word a3 is q2-canonical for the terminal component
consisting of states q2, q3, q4. Given the language L = L(
), then the word a3 is k-special for the language L for k ≤ 3.
Lemma 14
Given a non-trivial transitive component
in a DFA
, let k = (#Q)2. Then for every state q in
there is a q-canonical of
that is a k-special constant of L(
).
Proof
Let {x1,…, xn} = L(
) ∩ A≤k. Being
a transitive component, there are y1,…, yn−1 such that x1
y1x2 ···yn−1xn ∈ L(
). Set c = x1
y1x2 ···yn−1xn. Due to transitivity, for every q ∈
there are yq and
such that
is a label of a path that starts and ends at q. By Remark 5, yq,
can be chosen so that wq is a constant. We show that wq is q-canonical. Assume that wq ∈ L(
) for some transitive component
. Take the shortest word z ∈ L(
) \ L(
). Since L(
) \ L(
) = L(
) ∩ (L(
))c, it can be recognized by an automaton with at most #Q (
) · #Q (
) ≤ k states [13], the shortest word in this language has length at most k. Thus |z| ≤ k and therefore z must be a factor of c, i.e., z must be in L(
), contradicting the existence of z.
5.3. Proof of the main result
Considering the importance of constants in characterization of sub-classes of regular splicing languages, it has been conjectured that every splicing language must have a constant [10,11]. Our main result proves this conjecture to be true.
Theorem 15 (Main result)
If L is a regular splicing language, then L has a constant.
Example 5.3
The path-automaton
of Fig. 3 has no synchronizing word (see Example 3.4) and thus the language L(
) = a*c(c*ac*a)* has no constant. By Proposition 15, L(
) is not a regular splicing language.
Example 5.4
The transitive regular language L recognized by the automaton in Fig. 1 has no constants. By Theorem 15, the language L is not a splicing language.
Example 5.5
The regular language L = b(a3)* + cba* + da(a3)* is another example of non-reflexive splicing language, as proved in Lemma 6. Fig. 2(a) shows the trimmed mDFA graph for language L. Observe that not every path-automaton induced by a path in the mDFA from the initial state q0 to a terminal component has necessarily a constant of L. Indeed, the path-automaton in Fig. 2(a) recognizing language b(a3)* ⊂ L does not have any constant of the language L because every word in b(a3)* is also a substring of a word in cba* and therefore is not a synchronizing word for the automaton of L.
Given a splicing regular language L, the proof of Theorem 15 shows existence of a splicing rule r = (u1, u2)(u3, u4) such that the word u1u4 ends in a non-trivial terminal component of the trimmed mDFA trim
. More precisely, u1u4 ends in a state which we show to be synchronizing for the automaton trim
.
Let L be a regular splicing language and let trim
= (Q, A, {q0}, T,
) be the trimmed mDFA for language L. We introduce some basic notations that are used in the proof. We are interested in states of the automaton trim
that are found as follows.
Consider a non-trivial terminal component
that is minimal among the non-trivial terminal components in the automaton trim
. If a non-trivial component does not exist, then by Proposition 15, trim
must have a constant and Theorem 15 holds. Let q ∈
be a minimal-follower state with respect to
and recall that with μq(
) we denote the set of states in
that are follower-equivalent to q. Let C = {
=
,
, ···,
} be the set of all terminal components of the automaton that are factor-equivalent to
. Consider the set
(note that by Lemma 3, for each i = 1,…, k, the collection of follower sets in
coincides with the collection of follower sets in
).
Then, a candidate state of trim
is a state q̄ ∈ F with q̄ ∈
for some component
∈ C such that
(q̄) is minimal in the following sense: for all q ∈ F, whenever
(q) ⊆
(q̄), it holds that
(q)=
(q̄), i.e., being trim
reduced it holds that q = q̄.
The main idea of the proof is to show that either the automaton has no non-trivial components, and in this case a constant exists (see Proposition 8), otherwise there exists a candidate state that is synchronizing for the automaton trim
.
Example 5.6
Consider the automaton in Fig. 6. Observe that the minimal terminal component
induced by state q2 has language L(
) = a*, with L(
) = {a3}*, and is factor-equivalent to the component induced by the state q1. Then the set F, corresponding to the candidate component
, is the set of states F = {q1, q2, q3, q4} because all these states belong to only one follower-equivalence, thereby the minimal follower-equivalence class. Then the candidate states are q2, q3, q4.
We will use the following lemma.
Lemma 16
Let q̄ ∈
be a candidate state and let q̄1 ∈ F ∩
. Then q̄1 is also a candidate state.
Proof
Let
be the minimal deterministic transitive automaton for
∈ C from Lemma 3. Suppose q̄ ∈
is a candidate state and let q̄1 ∈ F ∩
. Let further q̂ ∈
be the follower-equivalent state to q̄. Then by Remark 5 there is a constant c of L(
) such that q̂c = q̂ and q̄1c = q̄.
Let q′ ∈ F. First, suppose that q′ ∈
for some
≠
. Consider
such that
. Because c is a constant, by Remark 4,
is follower-equivalent to q̂, so we have
. Because q̄ is a candidate state, there are two possibilities (a) right contexts of q̄ and
are incomparable, i.e.,
and
, or (b)
(equality cannot hold because trim
is reduced). In both cases here must be a
. Then q′cz is a terminal state while q̄1cz is not. Therefore, in both cases
(q′) ⊈
(q̄1).
Also, if q′ ∈
∩ F then by Lemma 4,
(q′) \
(q̄1) ≠ ∅.
Therefore q̄1 is a candidate state.
Before we present details of the proof of Theorem 15 we outline the steps involved in the proof by illustrating the situation in Example 5.6 shown in Fig. 6.
In trim
we identify a candidate state q̄ within a non-trivial component C̄ as outlined above. (For Example 5.6 we choose state q2.)We consider a q̄-canonical word c and observe that there must be a rule (u1, u2)(u3, u4) with a paste site u1u4 at a state p that lies on a path labeled wcsx, for some s, where q0 w = q̄ and q̄x is terminal (see Fig. 7). (For Example 5.6, p = q0 and wcsx = b(a3)s with w = b and c = a3, the rule in question is r3 = (b, a3)(da, 1).)
We observe that there is a state q ∈ Q u3u4 such that, for arbitrarily large i, ci is a factor of the right context of q. (For Example 5.6, such states are only q2, q3, q4, because q1 ∉ Q daa3x for any x.) We choose a sufficiently large i such that for some z, all states in Q u3u4zci belong to non-trivial components and we set . We observe that all states in end in non-trivial components that are factor-equivalent to the non-trivial component
. By Lemma 10 and obtain that
is a paste site for p, given rule
. (For Example 5.6, we can choose z = 1 and have a new rule r3 = (b, a3)(da, a3), and Q daa3 = q2.)We show that for every , it must be , therefore is synchronizing.
Fig. 7.
A possible paste site at state p.
We now present the proof of the main result.
Proof of Theorem 15
Let L be a regular splicing language, and let trim
= (Q, A, {q0}, T,
) be its trimmed mDFA. By Proposition 8, if the automaton trim
has only a trivial terminal component (note that since trim
is reduced, there could be only one such component), it must have a constant and thus the theorem holds. Therefore we consider the case that trim
has at least one non-trivial terminal component.
Consider a non-trivial terminal component
that is minimal among the non-trivial terminal components in the automaton trim
and let q ∈
be minimal-follower state in
. Let C = {
=
, ···,
} be the set of all terminal components of the automaton that are factor-equivalent to
and set
. We choose a candidate state q̄ ∈ F in a component
∈ C.-
Let w ∈ A* be the shortest word such that q0w = q̄. Consider a word c which is a constant of L(
) and is q̄-canonical for
. Such a word exists by Lemma 14. Then wc*x ⊆ L for some x ∈ A*. Since there is a finite number of rules in the splicing system, there are an infinite number of indexes s such that wcsx are obtained by using the same splicing rule r = (u1, u2)(u3, u4) where u1u4 is a subword of wcsx for every such s. More precisely, there must exist an infinite number of pairs of words v = v′u1u2
v″ ∈ L and w′u3u4
w″ ∈ L such that v′u1u4
w″ ∈ wc*x. Thus v′u1 is a prefix of wcix for some i ≥ 0. Let p be such that pu1u2 ≠ ∅ where p = q0
v′. Moreover, if y″ ∈
(Qu3u4), since there is y′ such that y = y′u3u4
y″ ∈ L, by splicing words v = v′u1u2
v″ and y = y′u3u4
y″ with rule r, we obtain v′u1u4
y″ ∈ L and thus y″ ∈
(pu1u4). Therefore,
(Qu3u4) ⊆
(pu1u4).We obtain that u1u4 is a paste site at state p for rule r = (u1u2)(u3u4). Refer to Fig. 7.
-
In the following we show that there are states in Q u3u4 such that ci is a factor of a word in their right context for arbitrarily large i’s.
Let p′ = q0 v′u1u4 where v′ is a prefix of wc*x. Being trim
deterministic, and since
is terminal, by the choice of p′ it must be that p′ is either inside the component
or otherwise lies along a path with label w from state q0 to state q̄. In the latter case when p′ is not a state inside
, v′u1u4 is a prefix of w. In this case cix must be a suffix of w″ in the splicing of v′u1u2
v″ ∈ L and w′u3u4
w″ ∈ L that produces v′u1u4
w″ = wci x. Hence, for arbitrarily large i’s, it must be that ci is a factor of a word in the right context of a state q ∈ Q u3u4. Since there are infinite number of i’s with this property, there is a state q ∈ Q u3u4 such that ci ∈ F (
(qu3u4)) for arbitrarily large i.Now suppose p′ is a state in
(see Fig. 7). By Proposition 8, u3u4 is either a factor of a constant, which proves the statement of the theorem, or there is a path-automaton
(a sub-automaton of trim
) with a non-trivial terminal component
such that u3u4 is a label of path π in
. Then, by Lemma 14, there is a q-canonical word c′ for some q ∈
such that u3u4zc′ is a label of a path in
for some z, that is zc′ ∈
(qu3u4). Since u1u4 is a paste site for the rule r at state p, we have that zc′ ∈
(qu3u4) ⊆
(pu1u4) =
(p′) ⊆ L(
). But because c′ is q-canonical for
it follows that L(
) ⊆ L(
) and by the minimality of component
and since
is factor equivalent to
we have that L(
) = L(
), i.e., c* ⊆ L(
). Therefore ci is a factor of the right context of a state q ∈ Q u3u4 for arbitrarily large i.We now consider states in Q u3u4 whose right context has words with factors ci for arbitrarily large i’s. We fix i sufficiently large, such that for some z, every state in Q u3u4zci belongs to a non-trivial component, and for every state q̂ ∈ Q u3u4zci, the language of the component
containing q̂ contains the word c. Given,
, by Lemma 10,
is a paste site at the same state p for the rule
. Observe that z can be chosen such that
. If p′ = pu1u4 is not in
, by the argument above, cix is a suffix of w″, hence we can chose z such that w″ = zcix, i.e., p′z = q̄.Because c is a constant for L(
) such that q̄c = q̄, by Lemma 3 and Remark 4, every state q ∈
c is follower-equivalent to q̄. Therefore, the state
is follower-equivalent to q̄, and hence in F. Having q̄0 ∈ F ∩
, by Lemma 16, q̄0 is also a candidate state. Let q be a state in . We conclude with the observation that q = q̄0 and therefore is a synchronizing word for q̄0, proving the theorem. The proof of this last step consists first in showing that L(
) = L(
) where
is the component in trim
containing q. Then we show that
is terminal and thus
∈ C. By the fact that q̄0 is a candidate state we are able to show that q̄0 = q.
We first observe that L(
) = L(
). As
, by Definition 6 of paste site, it holds that
| (*) |
Since c is q̄-canonical, by Definition 7 we have that L(
) ⊆ L(
). If L(
) \ L(
) ≠ ∅ then
(q) ⊈
(q̄0) which contradicts (*). Therefore it must be that L(
) = L(
), that is
and
are factor-equivalent.
Next we see that
is terminal. Assume to the contrary that
is not terminal and thus there is an edge labeled a that starts in
and terminates in a state q′ outside
. By Lemma 5 the automaton that consists of
together with the edge labeled a ending at q′ has a synchronizing word that ends at q′. Let ua be that word. Then ua ∉ / L(
) = L(
), because otherwise ua would not be synchronizing. By the transitivity of
we can assume that ua is a label of a path that starts at q. This implies that ua must be a prefix of a word in
(q)\
(q̄0), again contradicting (*). Consequently
is terminal and in C. Moreover q ∈ F, since by the choice of the constant c, by Remark 4, q is factor-equivalent to q̄ (hence, to q̄0), and F consists of all states that are factor-equivalent to q̄ and belong to components in C. Thus by (*) (i.e.,
(q) ⊆
(q̄0)) and the fact that q̄0 as a candidate state
(q̄0)=
(q). Because trim
is reduced, q̄0 = q, which concludes the proof.
The proof of Proposition 15 is based on the effective computation of a synchronizing state in the automaton for a regular splicing language in the case of an automaton having non-trivial terminal components. As a main corollary of the above Proposition we can state the following fact.
Corollary 17
Let trim
be the trimmed minimal deterministic automaton recognizing a splicing regular language. Then every state in a terminal component that contains a candidate state for trim
is synchronizing.
6. Concluding remarks
In this paper we solve a conjecture posed by T. Head in his seminal works on regular splicing languages about the existence of a constant as a necessary condition for a regular language to be splicing. We solve this open problem in an affirmative way, by providing a constructive proof that leads to a procedure for finding a synchronizing state in a mDFA for a regular splicing language.
The use of constants allows to determine a necessary and sufficient condition for a regular language to be reflexive splicing [3,4]; identifying such a condition for non reflexive splicing languages is still an open problem.
Recently, decidability of regular splicing languages has been proved in [15] by providing an upper bound on the lengths of the words included in the splicing rules. This bound is quadratic with respect to the size of the syntactic monoid of the language. The decidability follows from the fact that the bound allows brute-force search and comparison of the given language with splicing languages obtained through all possible finite sets of rules of certain size. Although the existence of the algorithm was long waited, the procedure it provides is useless for all practical purposes. Having a practical procedure to decide whether a regular language is splicing remains a challenging open problem. We believe that finding a characterization of minimal splicing systems recognizing splicing languages, where minimality of the system is given in terms of both the number of splice sites of rules and the length of the splicing sites, would be a promising direction for obtaining a practical decision procedure. Moreover, since splicing rules are built from constants in reflexive languages, the notions of constants and synchronizing words again seem to be vital for answering most of the above questions.
Acknowledgments
We thank the reviewers for numerous valuable comments. P. Bonizzoni is partially supported by MIUR PRIN 2010–2011 grant “Automi e Linguaggi Formali: Aspetti Matematici e Applicativi”, code H41J12000190001, N. Jonoska is supported in part by the NSF grant CCF-1117254 and the NIH grant R01GM109459-01.
Contributor Information
Paola Bonizzoni, Email: bonizzoni@disco.unimib.it.
Nataša Jonoska, Email: jonoska@mail.usf.edu.
References
- 1.Berstel J, Perrin D. Theory of Codes. Academic Press, Inc; Orlando, Florida: 1985. [Google Scholar]
- 2.Bonizzoni P, De Felice C, Mauri G, Zizza R. Regular languages generated by reflexive finite linear splicing systems. Lect Notes Comput Sci; Proc. Development in Language Theory; Berlin: Springer; 2003. pp. 134–145. [Google Scholar]
- 3.Bonizzoni P, De Felice C, Zizza R. The structure of reflexive regular splicing languages via Schützenberger constants. Theor Comput Sci. 2005;334(1–3):71–98. [Google Scholar]
- 4.Bonizzoni P, Mauri G. Regular splicing languages and subclasses. Theor Comput Sci. 2005;340:349–363. [Google Scholar]
- 5.Bonizzoni P. Constants and label-equivalence: a decision procedure for reflexive regular splicing languages. Theor Comput Sci. 2010;411(6):865–877. [Google Scholar]
- 6.Bonizzoni P, Jonoska N. Regular splicing languages must have a constant. Lect Notes Comput Sci; Proc. Developments in Language Theory; Berlin: Springer; 2011. pp. 82–92. [Google Scholar]
- 7.Černý J. Poznámka k homogénnym eksperimentom s konecnými automatami. Mat-Fyz čas Slov Akad Vied. 1964;14:208–216. [Google Scholar]
- 8.Culik K, Harju T. Splicing semigroups of dominoes and DNA. Discrete Appl Math. 1991;31:261–277. [Google Scholar]
- 9.De Luca A, Restivo A. A characterization of strictly locally testable languages and its application to semigroups of free semigroup. Inf Control. 1980;44:300–319. [Google Scholar]
- 10.Goode E. PhD Thesis. Binghamton University; 1999. Constants and splicing systems. [Google Scholar]
- 11.Goode E, Pixton D. Recognizing splicing languages: syntactic monoids and simultaneous pumping. Discrete Appl Math. 2007;155:989–1006. [Google Scholar]
- 12.Head T. Formal language theory and DNA: an analysis of the generative capacity of specific recombinant behaviours. Bull Math Biol. 1987;49:737–759. doi: 10.1007/BF02481771. [DOI] [PubMed] [Google Scholar]
- 13.Hopcroft JE, Motwani R, Ullman JD. Introduction to Automata Theory, Languages, and Computation. Addison–Wesley; Reading, Mass: 2001. [Google Scholar]
- 14.Jonoska N. Sofic systems with synchronizing representations. Theor Comput Sci. 1996;158(1–2):81–115. [Google Scholar]
- 15.Kari L, Kopecki S. Deciding if a regular language is generated by a splicing system. Lect Notes Comput Sci; Proc. DNA Computing and Molecular Programming – 18th International Conference; Berlin: Springer; 2012. pp. 98–109. [Google Scholar]
- 16.Lind D, Marcus B. An Introduction to Symbolic Dynamics. Cambridge University Press; New York: 1995. [Google Scholar]
- 17.Paun G. On the splicing operation. Discrete Appl Math. 1996;70:57–79. [Google Scholar]
- 18.Paun G, Rozenberg G, Salomaa A. New Computing Paradigms. Springer-Verlag; Berlin: 1998. DNA Computing. [Google Scholar]
- 19.Pixton D. Regularity of splicing languages. Discrete Appl Math. 1996;69:101–124. [Google Scholar]
- 20.Schützenberger MP. Sur certaines opérations de fermeture dans le langages rationnels. Symp Math. 1975;15:245–253. [Google Scholar]
- 21.Verlan S. PhD Thesis. University of Metz; 2004. Head systems and applications to bio-informatics. [Google Scholar]







