Abstract
Multi-component grammars, known in the literature as “multiple context-free grammars” and “linear context-free rewriting systems”, describe the structure of a string by defining the properties of k-tuples of its substrings, in the same way as ordinary formal grammars (Chomsky’s “context-free”) define properties of substrings. It is shown that, for every fixed k, the family of languages described by k-component grammars is closed under the cyclic shift operation. On the other hand, the subfamily defined by well-nested k-component grammars is not closed under the cyclic shift, yet their cyclic shifts are always defined by well-nested
-component grammars.
Introduction
The cyclic shift operation on formal languages, defined as
for a language L, is notable for several interesting properties. The closure of the class of regular languages under this operation is likely folklore, and proving it is a standard exercise in automata theory [2, Exercise 3.4(c)]. An interesting detail is that the cyclic shift incurs a huge blow-up in the number of states in a DFA, which is of the order
. [3, 9] An analogous (quite an unobvious one) result for context-free grammars was first discovered by Maslov [10] and by Oshiba [12], and a direct construction of a grammar was later presented in the textbook by Hopcroft and Ullman [2, Exercise 6.4(c)]. In their proof, a grammar describing a language L is transformed to a grammar for the cyclic shift of L, and the transformation turns the grammar inside out, so that each parse tree in the new grammar simulates a parse tree in the original grammar, while reversing the order of nodes on one of its paths.
In contrast to this remarkable closure result, all noteworthy subfamilies of the ordinary grammars—that is, unambiguous, LR, LL, linear, input-driven, etc.—are not closed under the cyclic shift. A non-closure result for the linear conjunctive languages [11] was established by Terrier [17]. For conjunctive grammars [11], whether they are closed under the cyclic shift, remains an open problem. A summary of these results can be found in a fairly recent survey [11, Sect. 8.2].
This paper investigates the cyclic shift operation on one of the most well-known families of formal grammars, the multi-component grammars. These grammars describe the syntax of a string by defining the properties of k-tuples of its substrings, in the same way as ordinary formal grammars and their basic variants, such as conjunctive grammars, define properties of individual substrings. In their modern form, multi-component grammars were independently introduced by Seki, Matsumura, Fujii and Kasami [14] (as “multiple context-free grammars”, MCFG), and by Vijay-Shankar, Weir and Joshi [18] (as “linear context-free rewriting systems”, LCFRS). These grammars are subject to much ongoing research [1, 7, 8, 19]. Also much attention is given to their special case: the well-nested multi-component grammars, in which all components of any intermediate k-tuple are listed in the order, in which they occur in the final string, and the grammar rules combine these k-tuples. This family is believed to correspond to the natural language syntax better than other grammar formalisms.
The first result of this paper is the closure of the language family defined by k-component grammars under the cyclic shift operation. The proof, presented in Sect. 3, proceeds by transforming an arbitrary k-component grammar to another k-component grammar describing the cyclic shift of the original language.
However, this construction does not preserve well-nestedness. A new construction adapted for well-nested grammars is presented in Sect. 4, and it incurs the increase of the number of components by one. In the final Sect. 5, it is shown that, whereas the language
is defined by a well-nested k-component grammar, its cyclic shift is defined by no grammar from this class, and accordingly requires
components. This points out a peculiar difference between the general and the well-nested cases of multi-component grammars.
Multi-component Grammars
Definition 1
(Vijay-Shankar et al. [18]; Seki et al. [14]). A multi-component grammar is a quintuple
, where
is the alphabet of the language being described;N is the set of syntactic categories defined in the grammar, usually called “nonterminal symbols”;
is a function that defines the number of components in each nonterminal symbol, so that if
, then A describes k-tuples of substrings;- R is a set of grammar rules, each of the form
where
*
, the variables
are pairwise distinct,
are strings over symbols from
and variables
, and each variable
occurs in
exactly once; a nonterminal symbol
of dimension 1 is the “initial symbol”, that is, the category of all well-formed sentences defined by the grammar.
A grammar is a logical system for proving elementary propositions of the form
, with
and
, meaning that the given k-tuple of strings has the property A. A proof proceeds using the rules in R, with each rule (*) treated as a schema for derivation rules, for any strings substituted for all variables
.
![]() |
The language generated by the grammar, denoted by L(G), is the set of all such strings w that the proposition S(w) can be derived in one or more such steps.
Whenever a string w is generated by G, the derivation of a proposition S(w) forms a parse tree. Each node in the tree is labelled with a proposition
, where
and
are substrings of w. Every node has a corresponding rule (*), by which the proposition is derived, and the direct successors of this node are labelled with
, ...,
, as in the definition of a derivation step.
The dimension of a grammar,
, is the largest dimension of a nonterminal symbol. A multi-component grammar of dimension k shall be called a k-component grammar.
A special case of these grammars are well-nested multi-component grammars, in which, whenever multiple constituents are joined in a single rule, their components cannot be intertwined, unless one’s components are completely embedded within another’s components. Thus, patterns such as
are prohibited.
Definition 2
A multi-component grammar is called well-nested, if every rule (*), satisfies the following conditions.
(non-permuting condition) For every i, the variables
occur inside
in this particular order.- For all
the concatenation
satisfies one of the following patterns:
.
.
.
Example 1
A language
is defined by a 2-component grammar with the rules
![]() |
A well-nested 2-component grammar for the same language is
![]() |
A well-nested multi-component grammar can be transformed to the following form resembling the Chomsky normal form.
Proposition 1
([15], Thm. 1). Each well-nested k-component grammar is equivalent to a well-nested k-component grammar, in which all rules are of the following form.
![]() |
Rules of the first kind generalize the concatenation. The operation implemented in the rules of the second kind, defined for
, is known as displacement or discontinuous product.
A multi-component grammar of dimension 1 is an ordinary grammar, or “context-free” in Chomsky’s terminology. A well-nested multi-component grammar of dimension 2 is known in the literature as a “head grammar” [13]; these grammars are equivalent in power to tree-adjoining grammars [4].
Cyclic Shift on k-component Grammars
Let G be a non-permuting k-component grammar, the goal is to construct a new k-component grammar
that describes the language
.
Whenever G generates a string w,
should generate vu for every partition
. Consider a parse tree of uv according to G, that is, a proof tree of the proposition S(uv). Each node in the tree is labelled with a proposition
, where
and
are substrings of w. We call a node split, if one of its components
spans over the boundary between u and v, that is, contains both the last symbol of u and the first symbol of v.
In the proposed construction of a grammar for the cyclic shift, each split node
is represented by another node of dimension k, which, however, specifies an entirely different k-tuple of strings. Consider that, whenever the original split node
is used in a parse tree of a string uv, this string contains
as substrings, in any order. The corresponding node in the parse tree of vu according to the grammar for the cyclic shift shall contain all symbols of uv
except the symbols in
. For the moment, assume that
occur in uv in the order listed, and that some
spans over the boundary between u and v. Then,
, and the symbols not in
are arranged into
substrings
. However, note that in the string vu generated by the new grammar,
and
come concatenated as a single substring
, and there is no need to represent them as separate components. Therefore, the new grammar can represent this split node
by another node
of the same dimension k, where
is a new nonterminal symbol representing the whole string with a gap for a k-tuple generated by A.
To see how this transformation can be done, the structure of split nodes in the original parse tree ought to be examined, As long as
and
, the root S(uv) is split. Each split node has at most one split node among its immediate successors, because the last symbol of u and the first symbol of v cannot be in two successors at once. If a node is not split, then none of its successors are split. Thus, split nodes form a path in a parse tree, beginning in the root and ending somewhere inside the tree. This path shall be called the main path, and the new grammar
retraces this path using the nonterminal symbols of the form
.
In the original grammar, whenever a rule
is used in one of the nodes on the main path, where B is the next node along the path, shorter substrings described by B are concatenated to something taken from C to form longer substrings described by A. In the new grammar, a nonterminal symbol
generates all symbols of the string except those generated by A, whereas
generates all symbols except the symbols generated by B. Therefore,
can be defined by a rule that partially fills the gap for A in
, replacing it with a smaller gap for B in
. This is achieved by a rule
. The node
is accordingly higher up than
in the parse tree of vu, and the main path of the original parse tree is retraced in the reverse direction. Each rule along the path is inverted, and the parse tree is effectively turned inside out.
Theorem 1
For every k-component grammar G with n nonterminal symbols, there exists another k-component grammar with at most
nonterminal symbols that describes the language
.
Proof
Let
, The new grammar is defined as
, where every new nonterminal symbol in
is of the form
, where
is a symbol of dimension k, and
is a permutation of
; the dimension of this new symbol is also k.
Each symbol from N is defined in
by the same rules as in G, and hence
for all
. For each new symbol
in
, with
, the intention is that it generates all such k-tuples
that, for some partition
, a proposition
can be derived using an assumption
. In other words, a k-tuple generated by
is a string from L(G) with k gaps, which should be filled by a k-tuple generated by A, Note that the components of
occur in the final string generated by the grammar G exactly in the given order, though
is split into a suffix and a prefix. On the other hand, the components of
may occur in the final string in L(G) in any order, and this order is specified in the permutation
.
The grammar
has three kinds of rules for the new symbols. The first rule creates an empty string with one gap for a string generated by S.
![]() |
1 |
Indeed, using an assumption S(x), one can derive S(x) in zero steps.
For the second type of rules in
, consider any rule in G, which defines a symbol A of dimension k, and fix any nonterminal symbol B on its right-hand side. Let
be the variables of B. Denote the remaining nonterminal symbols referenced in this rule by
.
![]() |
For every i-th argument of A, consider all occurrences of variables
in
, and accordingly let
, where
is the number of these occurrences,
are strings over the alphabet
and over the variables of
, and
, for each i. Since each variable is referenced exactly once,
and
is a permutation of
.
To see how to transform this rule, consider any proposition
, where
is a permutation of
. This symbol represents a full string generated by G, with a gap for A. If A is derived from B and
using the above rule for A, then the substrings obtained from
partially fill the gaps for A, leaving smaller gaps for B. The resulting symbol
has
gaps for B, and the permutation
of
is defined by listing the numbers of the variables of B in the order they occur as gaps: the sequence
is the same as
.
The corresponding transformed rule in the new grammar has to fill the gaps in the right order. Let
be the variables of
. Then the circular sequence
containing variables of
, B and
represents the entire string, and every occurrence of a variable of B becomes a gap in the new rule. Accordingly, the sequence between any two subsequent variables of B forms an argument of
. The first argument is the one containing
. The variables of B become gaps between the variables of
, and the resulting rule is defined as follows.
![]() |
2 |
Rules of the third and the last type are defined for the initial symbol of the new grammar. They correspond to the bottom split node on the main path of the parse tree in G, where the last symbol of u and the first symbol of v are finally assigned to different substrings. Denote the bottom split node by
, and let
be the entire string generated by the original grammar. In the new grammar, the node
is represented by a proposition
. Let
, with
, be the split component of
. The plan is to fill the gaps in
with the symbols in the subtree of
. However, it is not possible to do this directly in a rule of the form
, because the component
is split.
Consider the rule used to derive
in the new grammar, and let
be all nonterminal symbols on its right-hand side.
![]() |
The split component
generates a substring
, where the first part
is a suffix of u and the second part
is a prefix of v. Let
be a partition of
into the symbols generating
and the symbols generating
. Then the new grammar has the following rule, where the components of A are inserted into the gaps in
, and the resulting string is cyclically shifted to begin in the middle of the component
.
![]() |
3 |
Overall, for every two strings u and v, the string uv is in L(G) if and only if vu belongs to
.
It can be easily observed that our construction does not preserve well-nestedness. Consider the well-nested rule
, by our construction it produces the rule
, which is not well-nested.
Cyclic Shift on Well-Nested k-component Grammars
The construction for the cyclic shift in the case of well-nested grammars is generally easier, since it does not involve turning parse trees inside out. All paths in the transformed trees continue in the same direction, at the expense of using one extra component. On the other hand, special care has to be taken to preserve the order of components and their well-nestedness.
Theorem 2
If a language is defined by a well-nested k-component grammar, then its cyclic shift can be defined by a well-nested
-component grammar.
Proof
Assume that all rules in the original grammar G are as in Proposition 1. If G defines a string
, the new grammar
should generate vu. In the parse tree of uv according to G, a node
is split, if one of its components
spans over the boundary between u and v. Let
, where u ends with
, and v begins with
. Then, the new grammar shall have a new nonterminal symbol
, which defines a
-tuple
.
For a non-split node, let
be in u and let
be in v. Then the new grammar has a new nonterminal symbol
with defines a shifted k-tuple
. In particular, the nonterminal
, where S is the initial symbol of G, generates the language
. Adding a new initial nonterminal
and the rules
and
then yields the grammar for the language
. What remains is to equip the newly introduced nonterminals with the rules that match their definitions.
For each concatenation rule
in the original grammar, first, there are
non-split shifts, which simply rotate the order of the components. They are using the rules below corresponding to different shifts; note that in each case one of B, C remains unshifted, and the other is shifted and wrapped around it.
![]() |
Secondly, the cyclic shift may split one of the components of this
-tuple. This is implemented in
: then, one of B, C is unshifted, and the other is split. There are the following cases.
![]() |
Consider a displacement rule
in G, with
. Again, there are non-split and split shifts. Non-split shifts fall into the following three cases.
![]() |
If one of the components is split, the corresponding rule for
is one of the following.
![]() |
A correctness proof for the construction proceeds by induction on the size of derivations in the respective grammars, formalizing the above explanations. 
Number of Components in Well-Nested Grammars1
Theorem 2 shows how to represent the cyclic shift of a well-nested k-component grammar by a well-nested
-component grammar. On the other hand, without the well-nestedness restriction, a k-component grammar can be constructed by Theorem 1. The growth in the number of components is caused by keeping a split substring as two components. The question is, whether this weakness is an artefact of the construction, or is determined by the fundamental properties of well-nested grammars. In this section we prove, that for any
, there exists a well-nested k-component grammar, whose cyclic shift lies outside this class; thus the result of the previous section cannot be strengthened.
As such a counterexample, we take a very simple language
, containing all the strings of the form
, with
, which is defined by a well-nested k-component grammar (see Example 2). It is claimed that the cyclic shift of this language cannot be represented by a well-nested k-component grammar. Since this language family is closed under rational transductions, it suffices to demonstrate that the language
cannot be generated by a well-nested k-component grammar, because this language is obtained from
by intersection with a regular language
, and with a circular letter renaming
.
Example 2
The language
, containing all the strings of the form
, with
, is defined by the following well-nested k-component grammar.
![]() |
The definitions below are taken from Kanazawa [5].
Definition 3
An r-pump D is a nonempty derivation of the form
.
Note that in case of a well-nested grammar in Chomsky normal form,
is a proper subsequence of
. For each pump D, we define the sequence of its pumping strings:
. For example, the derivation
produces the pumping sequence
. Informally, the pumping strings are maximal contiguous strings that the pump subtree injects into the derived string. It is easy to prove that the pumping sequence of an r-pump consists of exactly 2r strings.
Definition 4
An even r-pump is a nonempty derivation of the form
.
Obviously, for an even pump D the pumping strings are
.
We use the term “pump” not only for derivations, but also for derivation trees. Given a derivation tree, we call a letter occurrence covered if it occurs in the yield of some pump, and evenly covered if this pump is even.
In what follows we consider only grammars in the Chomsky normal form, as in Proposition 1. The following lemma is a mathematical folklore for context-free grammars, the proof for well-nested multicomponent grammars is the same.
Lemma 1
For every language L defined by a well-nested grammar, there exists a number p, such that for every
at most
letters are not covered.
In the case of ordinary grammars (well-nested 1-component grammars), this lemma implies a weak version of the Ogden property [6, 16] However, as shown by Kanazawa and Salvati [8], that is not the case for well-nested grammars of higher dimensions. Namely, the existence of an uneven pump does not imply the k-pumping lemma. However, in our case we may get rid of uneven pumps.
Definition 5
A language is called bounded if it is a subset of the language
, for some symbols
. A language is strictly bounded if all the symbols
are distinct.
For a bounded language
, its decoration is the language
. We call decorations of bounded languages decorated bounded and decorations of strictly bounded languages decorated strictly bounded. Obviously,
is rationally equivalent to L. Therefore, in what follows we consider the decorated strictly bounded language
.
Lemma 2
Let G be a grammar in Chomsky normal form without useless nonterminals for a decorated strictly bounded language. Let
if
, and
if
(both functions are undefined for the empty string). Let
and
. Then, for every j, it holds that
if
and
, then
and
;if
, then
for some i and k.
Lemma 3
If there exists a well-nested k-component grammar for
in Chomsky normal form without useless nonterminals, then its derivations contain only even pumps.
The next result follows from the definition of well-nestedness by simple geometrical considerations.
Lemma 4
Let
and
be two derivations corresponding to the same derivation tree of the string
. Then one of the following is the case:
is a substring of
.
is a substring of
.
and
are two disjunct substrings of w.
Informally speaking, the “continuous spans” of two constituents either are embedded or do not intersect. Now we are ready to prove our main theorem.
Theorem 3
The language
is not defined by any well-nested k-component grammar.
Proof
Assuming the contrary, let such a grammar exist. Then, by Lemma 1, there exists a number p such that at most
letters in every string
are uncovered. For the string
, at least one
in this string is covered by some pump
. By Lemma 3, this pump must be of the form
![]() |
for some nonterminal A, and natural numbers
and
. By analogous arguments applied to the occurrences of
, we obtain another derivation
![]() |
However, the continuous spans of these two derivations contradict Lemma 4.
Theorem 4
The family defined by well-nested k-component grammars is not closed under the cyclic shift.
Conclusion
This paper has settled the closure under the cyclic shift for both general and well-nested multi-component grammars, as well as pointed out an interesting difference between these two grammar families. This contributes to the general knowledge on multi-component grammars.
This result has an interesting consequence: since the identity language of any group is closed under cyclic shift, and rational transformations preserve this closure property, no group identity language can be a rational generator of well-nested k-component grammars, for any
. This is not the case for
, where the Chomsky-Schützenberger theorem states that any such language can be obtained from the language
, that includes the words equal to 1 in a free group with two generators, by a composition of intersection with regular language and a homomorphism.
Footnotes
Most of the proofs are omitted due to space restrictions.
Research supported by Russian Science Foundation, project 18-11-00100.
Contributor Information
Alberto Leporati, Email: alberto.leporati@unimib.it.
Carlos Martín-Vide, Email: carlos.martin@urv.cat.
Dana Shapira, Email: shapird@g.ariel.ac.il.
Claudio Zandron, Email: zandron@disco.unimib.it.
Alexander Okhotin, Email: alexander.okhotin@spbu.ru.
Alexey Sorokin, Email: alexey.sorokin@list.ru.
References
- 1.Clark A, Yoshinaka R. An algebraic approach to multiple context-free grammars. In: Asher N, Soloviev S, editors. Logical Aspects of Computational Linguistics; Heidelberg: Springer; 2014. pp. 57–69. [Google Scholar]
- 2.Hopcroft JE, Ullman JD. Introduction to Automata Theory, Languages and Computation. Reading: Adison-Wesley; 1979. [Google Scholar]
- 3.Jirásková G, Okhotin A. State complexity of cyclic shift. RAIRO-Theoret. Inform. Appl. 2008;42(2):335–360. doi: 10.1051/ita:2007038. [DOI] [Google Scholar]
- 4.Joshi AK, Levy LS, Takahashi M. Tree adjunct grammars. J. Comput. Syst. Sci. 1975;10(1):136–163. doi: 10.1016/S0022-0000(75)80019-5. [DOI] [Google Scholar]
- 5.Kanazawa M. The pumping lemma for well-nested multiple context-free languages. In: Diekert V, Nowotka D, editors. Developments in Language Theory; Heidelberg: Springer; 2009. pp. 312–325. [Google Scholar]
- 6.Kanazawa, M.: Ogden’s lemma, multiple context-free grammars, and the control language hierarchy. Inf. Comput. (2019)
- 7.Kanazawa M, Kobele GM, Michaelis J, Salvati S, Yoshinaka R. The failure of the strong pumping lemma for multiple context-free languages. Theory Comput. Syst. 2014;55(1):250–278. doi: 10.1007/s00224-014-9534-z. [DOI] [Google Scholar]
- 8.Kanazawa, M., Salvati, S.: Mix is not a tree-adjoining language. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 666–674 (2012)
- 9.Maslov AN. Estimates of the number of states of finite automata. Dokl. Akad. Nauk. 1970;194(6):1266–1268. [Google Scholar]
- 10.Maslov AN. Cyclic shift operation for languages. Problemy Peredachi Informatsii. 1973;9(4):81–87. [Google Scholar]
- 11.Okhotin A. Conjunctive and Boolean grammars: the true general case of the context-free grammars. Comput. Sci. Rev. 2013;9:27–59. doi: 10.1016/j.cosrev.2013.06.001. [DOI] [Google Scholar]
- 12.Oshiba T. Closure property of family of context-free languages under cyclic shift operation. Electron. Commun. Jpn. 1972;55(4):119–122. [Google Scholar]
- 13.Pollard, C.J.: Generalized phrase structure grammars, head grammars, and natural language. Ph.D. dissertation, Stanford University (1984)
- 14.Seki H, Matsumura T, Fujii M, Kasami T. On multiple context-free grammars. Theoret. Comput. Sci. 1991;88(2):191–229. doi: 10.1016/0304-3975(91)90374-B. [DOI] [Google Scholar]
- 15.Sorokin A. Normal forms for multiple context-free languages and displacement Lambek grammars. In: Artemov S, Nerode A, editors. Logical Foundations of Computer Science; Heidelberg: Springer; 2013. pp. 319–334. [Google Scholar]
- 16.Sorokin A. Ogden property for linear displacement context-free grammars. In: Artemov S, Nerode A, editors. Logical Foundations of Computer Science; Cham: Springer; 2016. pp. 376–391. [Google Scholar]
- 17.Terrier V. Closure properties of cellular automata. Theoret. Comput. Sci. 2006;352(1–3):97–107. doi: 10.1016/j.tcs.2005.10.039. [DOI] [Google Scholar]
- 18.Vijay-Shanker, K., Weir, D.J., Joshi, A.K.: Characterizing structural descriptions produced by various grammatical formalisms. In: Proceedings of the 25th Annual Meeting on Association for Computational Linguistics, pp. 104–111. Association for Computational Linguistics (1987)
- 19.Yoshinaka R, Kaji Y, Seki H. Chomsky-Schützenberger-type characterization of multiple context-free languages. In: Dediu A-H, Fernau H, Martín-Vide C, editors. Language and Automata Theory and Applications; Heidelberg: Springer; 2010. pp. 596–607. [Google Scholar]
















