Single-Frame, Multiple-Frame and Framing Motifs in Genes

Christian J Michel

doi:10.3390/life9010018

. 2019 Feb 10;9(1):18. doi: 10.3390/life9010018

Single-Frame, Multiple-Frame and Framing Motifs in Genes

Christian J Michel ¹

PMCID: PMC6463195 PMID: 30744207

Abstract

We study the distribution of new classes of motifs in genes, a research field that has not been investigated to date. A single-frame motif SF has no trinucleotide in reading frame (frame 0) that occurs in a shifted frame (frame 1 or 2), e.g., the dicodon AAACAA is $S F$ as the trinucleotides AAA and CAA do not occur in a shifted frame. A motif which is not single-frame $S F$ is multiple-frame $M F$ . Several classes of $M F$ motifs are defined and analysed. The distributions of single-frame $S F$ motifs (associated with an unambiguous trinucleotide decoding in the two $5^{'} - 3^{'}$ and $3^{'} - 5^{'}$ directions) and 5′ unambiguous motifs $5^{'} U$ (associated with an unambiguous trinucleotide decoding in the $5^{'} - 3^{'}$ direction only) are analysed without and with constraints. The constraints studied are: initiation and stop codons, periodic codons ${A A A, C C C, G G G, T T T}$ , antiparallel complementarity and parallel complementarity. Taken together, these results suggest that the complementarity property involved in the antiparallel (DNA double helix, RNA stem) and parallel sequences could also be fundamental for coding genes with an unambiguous trinucleotide decoding in the two $5^{'} - 3^{'}$ and $3^{'} - 5^{'}$ directions or the $5^{'} - 3^{'}$ direction only. Furthermore, the single-frame motifs $S F$ with a property of trinucleotide decoding and the framing motifs $F$ (also called circular code motifs; first introduced by Michel (2012)) with a property of reading frame decoding may have been involved in the early life genes to build the modern genetic code and the extant genes. They could have been involved in the stage without anticodon-amino acid interactions or in the Implicated Site Nucleotides (ISN) of RNA interacting with the amino acids. Finally, the $S F$ and $M F$ dipeptides associated with the $S F$ and $M F$ dicodons, respectively, are studied and their importance for biology and the origin of life discussed.

Keywords: single-frame motifs, multiple-frame motifs, framing motifs, gene coding, antiparallel and parallel sequences, early life genes

1. Introduction

The reading frame coding with trinucleotide sets is a fascinating problem, both theoretical and experimental. Before the discovery of the genetic code, a first code was proposed by Gamow [1] by considering the “key-and-lock” relation between various amino acids, and the rhomb shaped “holes” formed by various nucleotides in the DNA. The proposed model will later prove to be false. A few years later, a class of trinucleotide codes, called comma-free codes, was proposed by Crick et al. [2] for explaining how the reading of a sequence of trinucleotides could code amino acids. In particular, how the correct reading frame can be retrieved and maintained. The four nucleotides {A,C,G,T} as well as the 16 dinucleotides {AA,…,TT} are simple codes which are not appropriate for coding 20 amino acids. However, trinucleotides induce a redundancy in their coding. Thus, Crick et al. [2] conjectured that only 20 trinucleotides among the 64 possible trinucleotides {AAA,…,TTT} code for the 20 amino acids. Such a bijective code implies that the coding trinucleotides are found only in one frame—the comma-freeness property. The determination of a set of 20 trinucleotides forming a comma-free code has several necessary conditions:

(i) A periodic trinucleotide from the set {AAA,CCC,GGG,TTT} must be excluded from such a code. Indeed, the concatenation of AAA with itself, for instance, does not allow the (original) reading frame to be retrieved as there are three possible decompositions: …,AAA,AAA,AAA,… (original frame), …A,AAA,AAA,AA… and …AA,AAA,AAA,A…, the commas showing the adopted decomposition.

(ii) Two non-periodic permuted trinucleotides, i.e., two trinucleotides related by a circular permutation, e.g., ACG and CGA, must also be excluded from such a code. Indeed, the concatenation of ACG with itself, for instance, does not allow the reading frame to be retrieved as there are two possible decompositions: …,ACG,ACG,ACG,… (original frame) and …A,CGA,CGA,CG…

Therefore, by excluding the four periodic trinucleotides and by gathering the 60 remaining trinucleotides in 20 classes of three trinucleotides such that, in each class, the three trinucleotides are deduced from each other by a circular permutation, e.g., ACG, CGA and GAC, we see that a comma-free code can contain only one trinucleotide from each class and thus has at most 20 trinucleotides. This trinucleotide number is identical to the amino acid number, thus leading to a code assigning one trinucleotide per amino acid without ambiguity.

In the beginning 1960’s, the discovery that the trinucleotide TTT, an excluded trinucleotide in a comma-free code, codes phenylalanine [3], led to the abandonment of the concepts both of a comma-free code [2] and a bijective code as the genetic code is degenerate [4,5,6] with a gene translation in one direction [7].

In 1996, a statistical analysis of occurrence frequencies of the 64 trinucleotides in the three frames of genes of both prokaryotes and eukaryotes showed that the trinucleotides are not uniformly distributed in these three frames [8]. By excluding the four periodic trinucleotides and by assigning each trinucleotide to a preferential frame (frame of its highest occurrence frequency), three subsets $X = X_{0}$ , $X_{1}$ and $X_{2}$ of 20 trinucleotides each are found in the frames 0 (reading frame), 1 (frame 0 shifted by one nucleotide in the $5^{'} - 3^{'}$ direction, i.e., to the right) and 2 (frame 0 shifted by two nucleotides in the $5^{'} - 3^{'}$ direction) in genes of both prokaryotes and eukaryotes. The same set $X$ of trinucleotides was identified in average in genes (reading frame) of bacteria, archaea, eukaryotes, plasmids and viruses [9,10]. It contains the 20 following trinucleotides:

X = {A A C, A A T, A C C, A T C, A T T, C A G, C T C, C T G, G A A, G A C, G A G, G A T, G C C, G G C, G G T, G T A, G T C, G T T, T A C, T T C}

(1)

and codes the 12 following amino acids (three and one letter notation):

X = {A l a, A s n, A s p, G l n, G l u, G l y, I l e, L e u, P h e, T h r, T y r, V a l} = {A, N, D, Q, E, G, I, L, F, T, Y, V} .

(2)

This set $X$ has a strong mathematical property. Indeed, $X$ is a maximal $C^{3}$ self-complementary trinucleotide circular code [8].

The reading frame coding with trinucleotide codes (sets of words) in general terms, i.e., not particularly the genetic code, is a concept which has been studied in Michel [11,12]. We extend it to the motifs (words of codes), a theoretical domain which has been ignored according to our knowledge. Genes (protein coding regions) can be partitioned into two disjoint classes of motifs: the single-frame motifs $S F$ with an unambiguous trinucleotide decoding in the two $5^{'} - 3^{'}$ and $3^{'} - 5^{'}$ directions, and the multiple-frame motifs $M F$ with an ambiguous trinucleotide decoding in at least one direction. A single-frame motif $S F$ has no trinucleotide in reading frame (frame 0) that occurs in a shifted frame (frame 1 or 2). In contrast, a multiple-frame motif $M F$ has at least one trinucleotide in reading frame that occurs in a shifted frame. Some well-known $M F$ motifs are involved in ribosomal frameshifting. The expression of some viral and cellular genes utilizes a -1 programmed ribosomal frameshifting (-1 PRF) [13,14]. This -1 PRF sequence is based on three elements: (i) a slippery motif composed of seven nucleotides at which the change in reading frame occurs; (ii) a spacer motif, usually less than 12 nucleotides; and (iii) a down-stream (3′) stimulatory motif, usually a pseudoknot or a stem-loop. In eukaryotes, the slippery motif fits a consensus heptanucleotide X,XXY,YYZ, where XXX is any three identical nucleotides, YYY represents AAA or TTT, Z represents A, C or T, the commas separating the codons in reading frame [15,16]. The slippery motifs $M F_{1} = A, A A A, A A Z$ and $M F_{2} = T, T T T, T T Z$ are multiple-frame $M F$ . Indeed, the codon AAA in reading frame also occurs in the shifted frames 1 and 2 in $M F_{1}$ , and similarly with the codon TTT in $M F_{2}$ . Alternative gene decoding is also possible with +1 programmed ribosomal frameshifting (+1 PRF) which has been particularly observed in Euplotes [17]. The identified slippery motif $T T T, T A R$ where $R = {A, G}$ is multiple-frame $M F$ . The slippery motifs AAA, CCC, GGG and TTT may cause frameshifting during transcription, producing RNAs missing specific nucleotides when compared to template DNA [18,19]. The slippery motifs are not always multiple-frame while stressing that the spacer and the down-stream stimulatory motifs have been very poorly characterized [20] and could also be involved in such a multiple-frame definition. From a theoretical point of view, it is important to extend this concept by increasing the length of such multiple-frame slippery motifs and also by considering their different classes. If the multiple-frame motifs may be involved in ribosomal frameshifting, the single-frame motifs $S F$ and the framing motifs $F$ (also called circular code motifs; first introduced in Michel [21,22]) from the circular codes [8,9,10] (reviews in Michel [23]; Fimmel and Strüngmann [24]) may have been important in early life genes for constructing the modern genetic code and the extant genes (see Discussion).

Several classes of $M F$ motifs are defined: (i) a unidirectional multiple-frame motif $3^{'} U M F$ has no trinucleotide in reading frame that occurs in a shifted frame after its reading (i.e., its position in the reading frame) but has at least one trinucleotide in reading frame that occurs in a shifted frame before its reading, e.g., the dicodon AACACA is $3^{'} U M F$ as the trinucleotides AAC and (trivially) ACA do not occur in a shifted frame after their reading and as the trinucleotide ACA occurs in a shifted frame (precisely frame 1) before its reading; (ii) a unidirectional multiple-frame motif $5^{'} U M F$ , the opposite, has no trinucleotide in reading frame that occurs in a shifted frame before its reading but has at least one trinucleotide in reading frame that occurs in a shifted frame after its reading, e.g., the dicodon ACACAA mirror of AACACA is $5^{'} U M F$ as the trinucleotides (trivially) ACA and CAA do not occur in a shifted frame before their reading and as the trinucleotide ACA occurs in a shifted frame (precisely frame 2) after its reading; and (iii) a bidirectional multiple-frame motif $B M F$ has at least one trinucleotide in reading frame that occurs in a shifted frame before its reading and has at least one trinucleotide in reading frame that occurs in a shifted frame after its reading (both $3^{'} U M F$ and $5^{'} U M F$ ), e.g., the dicodons AAAAAA and ACACAC are $B M F$ . A 5′ unambiguous motif $5^{'} U$ , is either a $S F$ motif or a $3^{'} U M F$ motif, e.g., the dicodons AAACAA ( $S F$ motif) and AACACA ( $3^{'} U M F$ motif) belong to the class $5^{'} U$ .

We will only investigate here the distribution of the single-frame motifs $S F$ associated with an unambiguous trinucleotide decoding in the two $5^{'} - 3^{'}$ and $3^{'} - 5^{'}$ directions, and the 5′ unambiguous motifs $5^{'} U$ associated with an unambiguous trinucleotide decoding in the $5^{'} - 3^{'}$ direction only, i.e., a less restrictive class of motifs. The distributions of $S F$ and $5^{'} U$ motifs will be analysed without and with constraints. The constraints studied are: (i) with initiation and stop codons; (ii) without periodic codons ${A A A, C C C, G G G, T T T}$ ; (iii) with antiparallel complementarity; and (iv) with parallel complementarity.

We will also investigate the particular case of motifs made up of two codons, i.e., the dicodons. The definitions of $S F$ and $M F$ dicodons will thus identify two new classes of dipeptides, the $S F$ and $M F$ dipeptides. The $S F$ dipeptides are coded by dicodons with an unambiguous trinucleotide decoding, in contrast to the $M F$ dipeptides which are coded by dicodons with an ambiguous trinucleotide decoding. The concept of $S F$ and $M F$ dipeptides might be of predictive value to studies of prebiotic metabolites [25]. Peptide evolution on the primitive earth is an active and exciting field of research with cyclic dipeptides [26] and selective formation of SerHis dipeptide via phosphorus activation [27,28].

2. Method

2.1. Recall of Biological Definitions

Notation 1.

Let us denotes the nucleotide 4-letter alphabet $B = {A, C, G, T}$ where $A$ stands for adenine, $C$ stands for cytosine, $G$ stands for guanine and $T$ stands for thymine. The trinucleotide set over $B$ is denoted by $B^{3} = {A A A, \dots, T T T}$ . The set of non-empty words (words, respectively) over $B$ is denoted by $B^{+}$ ( $B^{*}$ , respectively).

Definition 1.

According to the complementary property of the DNA double helix, the nucleotide complementarity map $C : B \to B$ is defined by $C (A) = T$ , $C (C) = G$ , $C (G) = C$ , $C (T) = A$ . According to the complementary and antiparallel properties of the DNA double helix, the trinucleotide antiparallel complementarity map $C : B^{3} \to B^{3}$ is defined by $C (l_{0} l_{1} l_{2}) = C (l_{2}) C (l_{1}) C (l_{0})$ for all $l_{0}, l_{1}, l_{2} \in B$ . The trinucleotide parallel complementarity map $D : B^{3} \to B^{3}$ is defined by $D (l_{0} l_{1} l_{2}) = C (l_{0}) C (l_{1}) C (l_{2})$ for all $l_{0}, l_{1}, l_{2} \in B$ .

Example 1.

$C (A C G) = C G T$ and $D (A C G) = T G C$ .

2.2. Recall of Circular Code Definitions

Definition 2.

A set $S \subseteq B^{+}$ is a code if, for each $x_{1}, \dots, x_{n}, y_{1}, \dots, y_{m} \in S$ , $n, m \geq 1$ , the condition $x_{1} \dots x_{n} = y_{1} \dots y_{m}$ implies $n = m$ and $x_{i} = y_{i}$ for $i = 1, \dots, n$ .

Definition 3.

Any non-empty subset of the code $B^{3}$ is a code and called trinucleotide code.

Definition 4.

A trinucleotide code $X \subseteq B^{3}$ is circular if, for each $x_{1}, \dots, x_{n}, y_{1}, \dots, y_{m} \in X$ , $n, m \geq 1$ , $r \in B^{*}$ , $s \in B^{+}$ , the conditions $s x_{2} \dots x_{n} r = y_{1} \dots y_{m}$ and $x_{1} = r s$ imply $n = m$ , $r = ε$ (empty word) and $x_{i} = y_{i}$ for $i = 1, \dots, n$ .

We briefly recall the proof used here to determine whether a code is circular or not, with the most recent and powerful approach which relates an oriented (directed) graph to a trinucleotide code.

Definition 5.

[29]. Let $X \subseteq B^{3}$ be a trinucleotide code. The directed graph $G (X) = (V (X), E (X))$ associated with $X$ has a finite set of vertices $V (X)$ and a finite set of oriented edges $E (X)$ (ordered pairs $[v, w]$ where $v, w \in X$ ) defined as follows:

${\begin{matrix} V (X) = {N_{1}, N_{3}, N_{1} N_{2}, N_{2} N_{3} : N_{1} N_{2} N_{3} \in X} \\ E (X) = {[N_{1}, N_{2} N_{3}], [N_{1} N_{2}, N_{3}] : N_{1} N_{2} N_{3} \in X} \end{matrix} .$

The theorem below gives a relation between a trinucleotide code which is circular and its associated graph.

Theorem 1.

[29]. Let $X \subseteq B^{3}$ be a trinucleotide code. The following statements are equivalent:

(i)
The code $X$ is circular.

(ii)
The graph $G (X)$ is acyclic.

Definition 6.

Circular code motifs (first introduced by Michel [21,22]), also called here framing motifs $F$ , are motifs from the circular codes. They have the capacity to retrieve, maintain and synchronize the reading frame in genes.

Example 2.

Let a framing motif $F_{1} =$ ...AGGTAATTACCAG... be constructed with the circular code $X$ (1) identified in genes of bacteria, archaea, eukaryotes, plasmids and viruses [8,9,10].

(i) Such a framing motif $F_{1}$ can be obtained as follows. A sequence $s$ of trinucleotides of $X$ is generated and a substring is extracted at any position in this sequence $s$ , i.e., the series of nucleotides on the right and the left of the substring are not considered. Let this substring be $F_{1}$ . (ii) This framing motif $F_{1}$ allows the reading frame to be retrieved (Figure 1). We try the three possible decompositions $w_{0}$ , $w_{1}$ (shifted by one letter to the right) and $w_{2}$ (shifted by two letters to the right) of $F_{1}$ . With $w_{0}$ , AG is not a prefix of any trinucleotide of $X$ , thus the frame associated with $w_{0}$ is impossible. With $w_{2}$ , AG is a suffix of CAG and GAG belonging to $X$ , then GTA, ATT and ACC belong to $X$ , followed by A which is a prefix of five trinucleotides of $X$ . Thus at this position, the frame associated with $w_{2}$ is still possible and $2 + 3 \times 3 + 1 = 12$ nucleotides are read. The next letter G leads to AG which is not a prefix of any trinucleotide of $X$ . Thus, a window of $12 + 1 = 13$ nucleotides demonstrates that the frame associated with $w_{2}$ is impossible. With $w_{1}$ , A is a suffix of GAA and GTA belonging to $X$ , then GGT, AAT, TAC, CAG, etc., belong to $X$ . Thus, the reading frame of $F_{1}$ is associated with $w_{1}$ , i.e., the first letter A of $w$ is the 3rd letter of a trinucleotide of $X$ : the reading frame of the sequence $s$ is retrieved: ...A,GGT,AAT,TAC,CAG,… (the comma showing the reading frame). (iii) We can prove mathematically that a windows of 13 nucleotides always retrieves the reading frame with the circular code $X$ . Four framing motifs $F$ need a window of 13 nucleotides with the circular code $X$ as they are the four longest ambiguous words of length $l = 12$ nucleotides: $F_{1} =$ AGGTAATTACCA, $F_{2} =$ AGGTAATTACCT (with $w_{2}$ , the first two letters AG are suffix of CAG and GAG belonging to $X$ , and the last letter T is prefix of TAC and TTC belonging to $X$ ), $F_{3} =$ TGGTAATTACCA (with $w_{2}$ , the first two letters TG are suffix of CTG belonging to $X$ , and the last letter A is prefix of five trinucleotides of $X$ ) and $F_{4} =$ TGGTAATTACCT (with $w_{2}$ , the first two letters TG are suffix of CTG belonging to $X$ , and the last letter T is prefix of TAC and TTC belonging to $X$ ). These four framing motifs $F$ contain the two longest ambiguous words of length $l = 11$ nucleotides starting with a trinucleotide of $X$ , i.e., when the suffixes of $X$ are not considered: GGTAATTACCA and GGTAATTACCT (see last row in Table 1 in [21]). (iv) It is very important to stress that for all the other framing motifs $F$ of the circular code $X$ , i.e., different from $F_{1}$ , $F_{2}$ , $F_{3}$ and $F_{4}$ , the window for retrieving the reading frame is less than 13 nucleotides (see the growth function of the window as a function of the number of nucleotides in Figure 4 in [21]). It is also very important to recall that any motif of the circular code $X$ is framing, i.e., it has the property of reading frame retrieval.

Retrieval of the reading frame of the word $w =$ ...AGGTAATTACCAG... constructed with the circular code $X$ (1). Among the three possible factorizations $w_{0}$ , $w_{1}$ and $w_{2}$ , only one factorization $w_{1}$ into trinucleotides of $X$ is possible leading to ...A,GGT,AAT,TAC,CAG,… (the comma showing the reading frame). Thus, the first letter A of $w$ is the third letter of a trinucleotide of $X$ and the reading frame of the word is retrieved.

2.3. Definitions of Single-Frame and Multiple-Frame Motifs

Definition 7.

A $n$ -motif, also called $n$ -codon, is a series of trinucleotides $t_{i}$ in $B^{3}$ of trinucleotide length $n$ , $i \in {1, \dots, n}$ , which defines the reading frame $f = 0$ , i.e., $t_{1} t_{2} \dots t_{n}$ .

Definition 8.

The shifted frame $f = 1$ and $f = 2$ of a $n$ -motif is a series of trinucleotides $t_{i}^{f}$ in $B^{3}$ of trinucleotide length $n - 1$ , $i \in {1, \dots, n - 1}$ , starting at the 2nd and 3rd nucleotide of $t_{1} = l_{0} l_{1} l_{2}$ of the $n$ -motif, i.e., at $l_{1}$ ( $f = 1$ ) and $l_{2}$ ( $f = 2$ ).

Notation 2.

Let $T$ be the set of trinucleotides in reading frame $f = 0$ of a $n$ -motif. Let $T^{f}$ be the set of trinucleotides in a shifted frame $f \in {1, 2}$ of a $n$ -motif.

A single-frame motif $S F$ has no trinucleotide $t$ in reading frame that occurs in a shifted frame, i.e., the trinucleotide decoding is unambiguous in the two $5^{'} - 3^{'}$ and $3^{'} - 5^{'}$ directions. Formally:

Definition 9.

A single-frame $n$ -motif $S F$ (unambiguous trinucleotide decoding in the two $5^{'} - 3^{'}$ and $3^{'} - 5^{'}$ directions) is a $n$ -motif such that $T \cap T^{f} = ⊘$ for $f \in {1, 2}$ , i.e., $t_{i} \neq t_{j}^{f}$ for $i \in {1, \dots, n}$ , for $j \in {1, \dots, n - 1}$ and for $f \in {1, 2}$ .

Example 3.

Let the dicodon be AAACAA ( $2$ -motif). The trinucleotides in reading frame are $t_{1} = A A A$ and $t_{2} = C A A$ , leading to the trinucleotide set $T = {A A A, C A A}$ . The single trinucleotide in the shifted frame 1 is $t_{1}^{1} = A A C$ , leading to the trinucleotide set $T^{1} = {A A C}$ . The single trinucleotide in the shifted frame 2 is $t_{1}^{2} = A C A$ , leading to the trinucleotide set $T^{2} = {A C A}$ . As $T \cap T^{1} = ⊘$ and $T \cap T^{2} = ⊘$ , AAACAA is a single-frame dicodon $S F$ (Figure 2).

(associated with Example 3). The dicodon *AAACAA* is single-frame $S F$ .

A multiple-frame motif $M F$ , in contrast to a $S F$ motif, has at least one trinucleotide $t$ in reading frame that occur in a shifted frame $f$ . Formally:

Definition 10.

A multiple-frame $n$ -motif $M F$ (ambiguous trinucleotide decoding in at least one direction) is a $n$ -motif such that $T \cap T^{f} \neq ⊘$ for $f \in {1, 2}$ , i.e., $\exists i \in {1, \dots, n} \land \exists j \in {1, \dots, n - 1} \land \exists f \in {1, 2} : t_{i} = t_{j}^{f}$ .

The unidirectional multiple-frame motifs $U M F$ belong to a class of $M F$ motifs where all the trinucleotides $t^{f}$ in a shifted frame $f$ occur only before $(3^{'} U M F$ : $3^{'} - 5^{'}$ direction) or only after ( $5^{'} U M F$ : $5^{'} - 3^{'}$ direction) the trinucleotides $t$ in reading frame. Formally:

Definition 11.

A unidirectional multiple-frame $n$ -motif $3^{'} U M F$ (ambiguous trinucleotide decoding in the $3^{'} - 5^{'}$ direction only) is a $M F$ $n$ -motif ( $F \cap F^{f} \neq ⊘$ for $f \in {1, 2}$ ) such that the condition $t_{i} = t_{j}^{f}$ implies $i > j$ for $i \in {1, \dots, n}$ , for $j \in {1, \dots, n - 1}$ and for $f \in {1, 2}$ .

Example 4.

Let the dicodon be AACACA. The trinucleotides in reading frame are $t_{1} = A A C$ and $t_{2} = A C A$ , leading to $T = {A A C, A C A}$ . The single trinucleotide in the shifted frame 1 is $t_{1}^{1} = A C A$ , leading to $T^{1} = {A C A}$ . The single trinucleotide in the shifted frame 2 is $t_{1}^{2} = C A C$ , leading to $T^{2} = {C A C}$ . As $T \cap T^{1} \neq ⊘$ , AACACA is a multiple-frame dicodon $M F$ . Furthermore, as $t_{2} = t_{1}^{1} = A C A$ yields to the inequality $2 > 1$ , as $t_{1} = A A C \neq t_{1}^{1} = A C A$ and as $t_{1} = A A C \neq t_{1}^{2} = C A C$ , AACACA is a unidirectional multiple-frame dicodon $3^{'} U M F$ (Figure 3).

(associated with Example 4). The dicodon *AACACA* is unidirectional multiple-frame $3' U M F$ .

Definition 12.

A unidirectional multiple-frame $n$ -motif $5^{'} U M F$ (ambiguous trinucleotide decoding in the $5^{'} - 3^{'}$ direction only) is a $M F$ $n$ -motif ( $F \cap F^{f} \neq ⊘$ for $f \in {1, 2}$ ) such that the condition $t_{i} = t_{j}^{f}$ implies $i \leq j$ for $i \in {1, \dots, n}$ , for $j \in {1, \dots, n - 1}$ and for $f \in {1, 2}$ .

Example 5.

Let the dicodon be AAAAAC. The trinucleotides in reading frame are $t_{1} = A A A$ and $t_{2} = A A C$ , leading to $T = {A A A, A A C}$ . The trinucleotides in the shifted frames 1 and 2 are $t_{1}^{1} = t_{1}^{2} = A A A$ , leading to the trinucleotide sets $T^{1} = T^{2} = {A A A}$ . As $T \cap T^{1} \neq ⊘$ and $T \cap T^{2} \neq ⊘$ , AAAAAC is a multiple-frame dicodon $M F$ . Furthermore, as $t_{1} = t_{1}^{1} = t_{1}^{2} = A A A$ yields to the two inequalities $1 \leq 1$ and as $t_{2} = A A C \neq t_{1}^{1} = t_{1}^{2} = A A A$ ,AAAAAC is a unidirectional multiple-frame dicodon $5^{'} U M F$ (Figure 4).

(associated with Example 5). The dicodon *AAAAAC* is unidirectional multiple-frame $5^{'} U M F$ .

Example 6.

Let the dicodon be ACACAA . The trinucleotides in reading frame are $t_{1} = A C A$ and $t_{2} = C A A$ , leading to $T = {A C A, C A A}$ . The single trinucleotide in the shifted frame 1 is $t_{1}^{1} = C A C$ , leading to $T^{1} = {C A C}$ . The single trinucleotide in the shifted frame 2 is $t_{1}^{2} = A C A$ , leading to $T^{2} = {A C A}$ . As $T \cap T^{2} \neq ⊘$ , ACACAA is a multiple-frame dicodon $M F$ . Furthermore, as $t_{1} = t_{1}^{2} = A C A$ yields to the inequality $1 \leq 1$ , as $t_{2} = C A A \neq t_{1}^{1} = C A C$ and as $t_{2} = C A A \neq t_{1}^{2} = A C A$ , ACACAA is a unidirectional multiple-frame dicodon $5^{'} U M F$ (Figure 5). The reasoning could be immediate by noting that the dicodon ACACAA is mirror of AACACA (compare with Example 4).

(associated with Example 6). The dicodon *ACACAA* is unidirectional multiple-frame $5^{'} U M F$ .

Definition 13.

A bidirectional multiple-frame $n$ -motif $B M F$ (ambiguous trinucleotide decoding in the two $5^{'} - 3^{'}$ and $3^{'} - 5^{'}$ directions) is both a $5^{'} U M F$ and $3^{'} U M F$ $n$ -motif.

Example 7.

Let the trivial dicodon be AAAAAA. The trinucleotides in reading frame are $t_{1} = t_{2} = A A A$ , leading to the trinucleotide set $T = {A A A}$ . The trinucleotides in the shifted frames 1 and 2 are $t_{1}^{1} = t_{1}^{2} = A A A$ , leading to the trinucleotide sets $T^{1} = T^{2} = {A A A}$ . As $T \cap T^{1} \neq ⊘$ and $T \cap T^{2} \neq ⊘$ , AAAAAA is a multiple-frame dicodon $M F$ . Furthermore, as $t_{1} = t_{1}^{1} = t_{1}^{2} = A A A$ yields to the two inequalities $1 \leq 1$ and as $t_{2} = t_{1}^{1} = t_{1}^{2} = A A A$ yields to the two inequalities $2 > 1$ , AAAAAA is a bidirectional multiple-frame dicodon $B M F$ (Figure 6).

(associated with Example 7). The dicodon *AAAAAA* is bidirectional multiple-frame $B M F$ .

Example 8.

Let the dicodon be ACACAC. The trinucleotides in reading frame are $t_{1} = A C A$ and $t_{2} = C A C$ , leading to $T = {A C A, C A C}$ . The single trinucleotide in the shifted frame 1 is $t_{1}^{1} = C A C$ , leading to $T^{1} = {C A C}$ . The single trinucleotide in the shifted frame 2 is $t_{1}^{2} = A C A$ , leading to $T^{2} = {A C A}$ . As $T \cap T^{1} \neq ⊘$ and $T \cap T^{2} \neq ⊘$ , ACACAC is a multiple-frame dicodon $M F$ . Furthermore, as $t_{1} = t_{1}^{2} = A C A$ yields to the inequality $1 \leq 1$ and as $t_{2} = t_{1}^{1} = C A C$ yields to the inequality $2 > 1$ , ACACAC is a bidirectional multiple-frame dicodon $B M F$ (Figure 7).

(associated with Example 8). The dicodon *ACACAC* is bidirectional multiple-frame $B M F$ .

In this paper, by varying $n \in ℕ^{*}$ , we will investigate two distributions: the single-frame $n$ -motifs $S F$ with an unambiguous trinucleotide decoding in the two $5^{'} - 3^{'}$ and $3^{'} - 5^{'}$ directions (see Definition 9), and the 5′ unambiguous $n$ -motifs $5^{'} U$ with an unambiguous trinucleotide decoding in the $5^{'} - 3^{'}$ direction only which are defined formally as follows:

Definition 14.

A 5′ unambiguous $n$ -motif $5^{'} U$ (unambiguous trinucleotide decoding in the $5^{'} - 3^{'}$ direction only) is either a $S F$ $n$ -motif or a $3^{'} U M F$ $n$ -motif , i.e., neither a $5^{'} U M F$ $n$ -motif nor a $B M F$ $n$ -motif.

Example 9.

The dicodons AAACAA ( $S F$ motif) andAACACA ( $3^{'} U M F$ motif) belong to the class $5^{'} U$ .

2.4. Occurrence Probabilities of Single-Frame $n$ -Motifs $S F$ and 5′ Unambiguous $n$ -Motifs $5^{'} U$

Definition 15.

Let $N b S F M (n)$ and $N b M F M (n)$ be the numbers of $n$ -motifs ( $n \in ℕ^{*}$ ) single-frame $S F$ and multiple-frame $M F$ , respectively. Let $N b 5^{'} U M F M (n)$ , $N b 3^{'} U M F M (n)$ and $N b B M F M (n)$ be the numbers of multiple-frame $n$ -motifs ( $n \in ℕ^{*}$ ) which are unidirectional $5^{'} U M F$ , unidirectional $3^{'} U M F$ and bidirectional $B M F$ , respectively.

For $n \in ℕ^{*}$ , we have the obvious relations:

N b S F M (n) + N b M F M (n) = 64^{n}, N b M F M (n) = N b 5^{'} U M F M (n) + N b 3^{'} U M F M (n) + N b B M F M (n) .

For $n \in ℕ^{*}$ , the occurrence probability $P b S F M (n)$ of single-frame $n$ -motifs $S F$ will be computed according to

P b S F M (n) = 1 - \frac{N b M F M (n)}{64^{n}} .

(3)

Similarly, for $n \in ℕ^{*}$ , the occurrence probability $P b 5^{'} U M (n)$ of 5′ unambiguous $n$ -motifs $5^{'} U$ will be computed as follows

P b 5^{'} U M (n) = P b S F M (n) + \frac{N b 3^{'} U M F M (n)}{64^{n}} .

(4)

Remark 1.

Obviously, $P b 5^{'} U M (n) > P b S F M (n)$ whatever $n$ . However, it will be interesting to compare these two probability distributions by varying $n$ .

2.5. Single-Frame $1$ -Motifs

It is a trivial case. Each of the 64 codons (1-motifs, $n = 1$ ) are obviously single-frame motifs $S F$ , by definition (non-existence of a shifted frame). Thus, the probabilities of $S F$ and $5^{'} U$ $1$ -motifs are equal to $P b S F M (1) = P b 5^{'} U M (1) = 1$ .

2.6. Single-Frame $2$ -Motifs

There are $64^{2} = 4096$ dicodons (2-motifs, $n = 2$ ). The complete study of dicodons which are single-frame $S F$ and multiple-frame $M F$ can be done by hand without difficulty. For the convenience of the reader, we give the complete list of $M F$ dicodons: $B M F$ (Definition 13, Table 1), $3^{'} U M F$ (Definition 11, Table 2) and $5^{'} U M F$ (Definition 12, Table 3).

Table 1.

The 16 bidirectional multiple-frame dicodons $B M F$ (Definition 13).

Dicodon	Frame 1	Frame 2	Dicodon	Frame 1	Frame 2	Dicodon	Frame 1	Frame 2	Dicodon	Frame 1	Frame 2
AAAAAA	AAA	AAA	CACACA	ACA	CAC	GAGAGA	AGA	GAG	TATATA	ATA	TAT
ACACAC	CAC	ACA	CCCCCC	CCC	CCC	GCGCGC	CGC	GCG	TCTCTC	CTC	TCT
AGAGAG	GAG	AGA	CGCGCG	GCG	CGC	GGGGGG	GGG	GGG	TGTGTG	GTG	TGT
ATATAT	TAT	ATA	CTCTCT	TCT	CTC	GTGTGT	TGT	GTG	TTTTTT	TTT	TTT

Open in a new tab

Table 2.

The 96 unidirectional multiple-frame dicodons $3' U M F$ (Definition 11), $N$ being any nucleotide.

Dicodon	Frame 1	Frame 2	Dicodon	Frame 1	Frame 2	Dicodon	Frame 1	Frame 2	Dicodon	Frame 1	Frame 2
CAAAAA	AAA	AAA	CCACAC	CAC		CGAGAG	GAG		CTATAT	TAT
GAAAAA	AAA	AAA	GCACAC	CAC		GGAGAG	GAG		GTATAT	TAT
TAAAAA	AAA	AAA	TCACAC	CAC		TGAGAG	GAG		TTATAT	TAT
NCAAAA		AAA	ACCCCC	CCC	CCC	AGCGCG	GCG		ATCTCT	TCT
NGAAAA		AAA	GCCCCC	CCC	CCC	GGCGCG	GCG		GTCTCT	TCT
NTAAAA		AAA	TCCCCC	CCC	CCC	TGCGCG	GCG		TTCTCT	TCT
AACACA	ACA		NACCCC		CCC	AGGGGG	GGG	GGG	ATGTGT	TGT
GACACA	ACA		NGCCCC		CCC	CGGGGG	GGG	GGG	CTGTGT	TGT
TACACA	ACA		NTCCCC		CCC	TGGGGG	GGG	GGG	TTGTGT	TGT
AAGAGA	AGA		ACGCGC	CGC		NAGGGG		GGG	ATTTTT	TTT	TTT
CAGAGA	AGA		CCGCGC	CGC		NCGGGG		GGG	CTTTTT	TTT	TTT
TAGAGA	AGA		TCGCGC	CGC		NTGGGG		GGG	GTTTTT	TTT	TTT
AATATA	ATA		ACTCTC	CTC		AGTGTG	GTG		NATTTT		TTT
CATATA	ATA		CCTCTC	CTC		CGTGTG	GTG		NCTTTT		TTT
GATATA	ATA		GCTCTC	CTC		GGTGTG	GTG		NGTTTT		TTT

Open in a new tab

Table 3.

The 96 unidirectional multiple-frame dicodons $5^{'} U M F$ (Definition 12), $N$ being any nucleotide.

Dicodon	Frame 1	Frame 2	Dicodon	Frame 1	Frame 2	Dicodon	Frame 1	Frame 2	Dicodon	Frame 1	Frame 2
AAAAAC	AAA	AAA	CACACC		CAC	GAGAGC		GAG	TATATC		TAT
AAAAAG	AAA	AAA	CACACG		CAC	GAGAGG		GAG	TATATG		TAT
AAAAAT	AAA	AAA	CACACT		CAC	GAGAGT		GAG	TATATT		TAT
AAAACN	AAA		CCCCCA	CCC	CCC	GCGCGA		GCG	TCTCTA		TCT
AAAAGN	AAA		CCCCCG	CCC	CCC	GCGCGG		GCG	TCTCTG		TCT
AAAATN	AAA		CCCCCT	CCC	CCC	GCGCGT		GCG	TCTCTT		TCT
ACACAA		ACA	CCCCAN	CCC		GGGGGA	GGG	GGG	TGTGTA		TGT
ACACAG		ACA	CCCCGN	CCC		GGGGGC	GGG	GGG	TGTGTC		TGT
ACACAT		ACA	CCCCTN	CCC		GGGGGT	GGG	GGG	TGTGTT		TGT
AGAGAA		AGA	CGCGCA		CGC	GGGGAN	GGG		TTTTTA	TTT	TTT
AGAGAC		AGA	CGCGCC		CGC	GGGGCN	GGG		TTTTTC	TTT	TTT
AGAGAT		AGA	CGCGCT		CGC	GGGGTN	GGG		TTTTTG	TTT	TTT
ATATAA		ATA	CTCTCA		CTC	GTGTGA		GTG	TTTTAN	TTT
ATATAC		ATA	CTCTCC		CTC	GTGTGC		GTG	TTTTCN	TTT
ATATAG		ATA	CTCTCG		CTC	GTGTGG		GTG	TTTTGN	TTT

Open in a new tab

The probability of $S F$ $2$ -motifs is equal to $P b S F M (2) = 1 - (16 + 2 \times 96) / 64^{2} = 0.9492$ . The probability of $5^{'} U$ $2$ -motifs is equal to $P b 5^{'} U M (2) = P b S F M (2) + 96 / 64^{2} = 0.9727$ .

Remark 2.

For $n \geq 3$ , the $3^{'} U M F$ and $5^{'} U M F$ $n$ -motifs can have two different shifted trinucleotides in the two frames 1 and 2, in contrast to the $2$ -motifs (see Table 2 and Table 3). For example, with the tricodon AACAAAACC, the trinucleotides in reading frame are $t_{1} = A A C$ , $t_{2} = A A A$ and $t_{3} = A C C$ leading to $T = {A A A, A A C, A C C}$ . The trinucleotides in the shifted frame 1 are $t_{1}^{1} = A C A$ and $t_{2}^{1} = A A A$ , leading to $T^{1} = {A A A, A C A}$ . The trinucleotides in the shifted frame 2 are $t_{1}^{2} = C A A$ and $t_{2}^{2} = A A C$ , leading to $T^{2} = {A A C, C A A}$ . As $T \cap T^{1} \neq ⊘$ and $T \cap T^{2} \neq ⊘$ , AACAAAACC is a multiple-frame tricodon $M F$ . Furthermore, as $t_{1} = t_{2}^{2} = A A C$ yields to the inequality $1 \leq 2$ , as $t_{2} = t_{2}^{1} = A A A$ yields to the inequality $2 \leq 2$ and as $t_{3} = A C C \notin T^{1} \cup T^{2}$ , AACAAAACC is a unidirectional multiple-frame tricodon $5^{'} U M F$ with two different trinucleotides in the two frames 1 and 2, i.e., AAA in frame 1 and AAC in frame 2.

2.7. Single-Frame $n$ -Motifs

The determination of probability $P b S F M (n)$ of single-frame $n$ -motifs $S F$ for $n \geq 3$ (tricodons, tetracodons, etc.) cannot be done by hand. For $n \in {3, \dots, 6}$ (tricodons up to hexacodons), exact values of probability $P b S F M (n)$ can be obtained by computer calculus (see Table 4). For $n = 6$ , the computation of $S F$ motifs among the $64^{6} = 68, 719, 476, 736$ hexacodons with a parallel program with 8 threads takes about 7 days on a standard PC. For $n \geq 7$ (heptacodons, octocodons, etc.), the probability $P b S F M (n)$ is obtained by computer simulation. Simulated values of $P b S F M (n)$ are obtained by generating 1,000,000 random $n$ -motifs for each $n$ . In order to evaluate this approach by computer simulation, simulated values of $P b S F M (n)$ for $n \in {2, \dots, 6}$ are also given in Table 4. Exact and simulated values of $P b S F M (n)$ are identical at $10^{- 3}$ , demonstrating the reliability of the simulation approach.

Table 4.

Probability $P b S F M (n)$ (%) of single-frame $n$ -motifs $S F$ for $n \in {1, \dots, 6}$ . Exact and simulated values of $P b S F M (n)$ are identical at $10^{- 3}$ .

		$Probability P b S F M (n) (%)$
$n$ -Motifs	$Number 64^{n}$	Exact Values	Simulated Values
1	64	100
2	4096	94.92	94.93
3	262,144	85.22	85.20
4	16,777,216	72.35	72.37
5	1,073,741,824	58.07	58.08
6	68,719,476,736	44.07	44.08

Open in a new tab

The probability $P b 5^{'} U M (n)$ of 5′ unambiguous $n$ -motifs $5^{'} U$ for $n \geq 3$ is computed similarly.

3. Results

3.1. Single-Frame Motifs

I first investigated the probability $P b S F M (n)$ (Equation (3)) of single-frame $n$ -motifs $S F$ (Definition 9). The probability $P b S F M (1)$ is equal to 1 (1-motifs, Section 2.5). The probability $P b S F M (2)$ is equal to 94.9% (2-motifs, Section 2.6). The probability $P b S F M (n)$ for $n \in {3, \dots, 6}$ is given in Table 4. The probability $P b S F M (n)$ for $n \geq 7$ is obtained by computer simulation (Section 2.7).

While the proportion of multiple-frame $2$ -motifs $M F$ (Definition 10) is minimal ( $5.1 % = 100 % - 94.9 %$ for dicodons, Section 2.6), Figure 8 shows that their propagation will drastically reduce the proportion of $S F$ $n$ -motifs when the trinucleotide length $n$ increases. There are almost no more $S F$ motifs with a length of 14 trinucleotides ( $P b S F M (14) < 1 %$ ) and the number of $M F$ motifs becomes already higher than the number of $S F$ motifs with a length of six trinucleotides (Figure 8).

Decreasing probability $P b S F M (n)$ (Equation (3)) of single-frame $n$ -motifs $S F$ (blue curve) and increasing probability $1 - P b S F M (n)$ of multiple-frame $n$ -motifs $M F$ (red curve) by varying the length $n$ between 1 and 20 trinucleotides.

Thus, only short genes, i.e., with up to five trinucleotides, have a higher proportion of single-frame motifs compared to the multiple-frame motifs. Thus, primitive translation, without the extant complex ribosome, could only generate short peptides without frameshift errors.

3.2. 5′ Unambiguous Motifs

I then compared the probability $P b S F M (n)$ (Equation (3)) of single-frame $n$ -motifs $S F$ (Definition 9) and the probability $P b 5^{'} U M (n)$ (Equation (4)) of 5′ unambiguous $n$ -motifs $5^{'} U$ (Definition 14). Figure 9 shows the decreasing probability $P b 5^{'} U M (n)$ of $5^{'} U$ $n$ -motifs when the trinucleotide length $n$ increases. As expected (see Remark 1), its decrease is slower than that of $S F$ $n$ -motifs. There are almost no more $5^{'} U$ motifs with a length of 20 trinucleotides ( $P b 5^{'} U M (20) < 1 %$ ). Thus with the $5^{'} U$ motifs, there is a length increase of $20 - 14 = 6$ trinucleotides in the trinucleotide decoding. The maximum probability difference $P b 5^{'} U M (n) - P b S F M (n)$ is 22.0% at length $n = 8$ trinucleotides.

Decreasing probability $P b S F M (n)$ (Equation (3)) of single-frame $n$ -motifs $S F$ (blue curve from Figure 8) and decreasing probability $P b 5^{'} U M (n)$ (Equation (4)) of 5′ unambiguous $n$ -motifs $5^{'} U$ (cyan curve) by varying the length $n$ between 1 and 20 trinucleotides.

The 5′ unambiguous $n$ -motifs, a less restrictive class of motifs with an unambiguous trinucleotide decoding in the $5^{'} - 3^{'}$ direction only, can generate a slightly longer peptides without frameshift error compared to the single-frame motifs.

I now evaluate the single-frame motifs $S F$ and the 5′ unambiguous motifs $5^{'} U$ with constraints.

3.3. Single-Frame and 5′ Unambiguous Motifs with Initiation and Stop Codons

The single-frame $n$ -motifs $S F$ and the 5′ unambiguous motifs $5^{'} U$ are investigated with an initiation codon ATG and a stop codon ${T A A, T A G, T G A}$ . The case $n = 1$ does not exist. For $n = 2$ , there are only three dicodons: ATGTAA, ATGTAG and ATGTGA which are all obviously $S F$ . Thus, the probabilities of $S F$ and $U S F$ $2$ -motifs are obviously $P b S F M (2) = P b 5^{'} U M (2) = 1$ . Figure 10 shows that the proportions of $S F$ and $5^{'} U$ motifs with initiation and stop codons are lower than their respective non-constrained motifs.

Genes with initiation and stop codons do not increase translation fidelity compared to non-constrained genes (according to this approach).

3.4. Single-Frame and 5′ Unambiguous Motifs without Periodic Codons

The single-frame motifs $S F$ and the 5′ unambiguous motifs $5^{'} U$ are now studied without periodic codons ${A A A, C C C, G G G, T T T}$ . As expected, Figure 11 shows that the proportions of $S F$ and $5^{'} U$ motifs without periodic codons are higher than their respective non-constrained motifs.

Genes without periodic codons slightly increase frame translation fidelity compared to non-constrained genes (according to this approach).

3.5. Single-Frame and 5′ Unambiguous Motifs with Antiparallel Complementarity

The single-frame $2 n$ -motifs $S F$ and the 5′ unambiguous $2 n$ -motifs $5^{'} U$ are now investigated with the following antiparallel complementary sequence: $t_{1} t_{2} \dots t_{n} C (t_{n}) \dots C (t_{2}) C (t_{1})$ where the trinucleotide antiparallel complementarity map $C$ applied to a trinucleotide $t$ is recalled in Definition 1. As an example, if $t_{1} t_{2} t_{3} = A C G T G C A A T$ then the antiparallel complementary sequence studied is $A C G T G C A A T A T T G C A C G T$ . Note that the trinucleotide length of such motifs is even. Classical antiparallel complementary structures are the DNA double helix and the RNA stem. Interesting results are observed. As expected, the two probability curves $P b S F M (n)$ of $S F$ motifs and $P b 5^{'} U M (n)$ of $5^{'} U$ motifs with antiparallel complementarity are identical (Figure 12). The proof is based on the following property: if $t_{i} = t_{j}^{f}$ with $i > j$ ( $3^{'} U M F$ motif) then $C (t_{i}) = t_{i^{'}} = C (t_{j}^{f}) = t_{j^{'}}^{f^{'}}$ with $i^{'} \leq j^{'}$ ( $5^{'} U M F$ motif) and $f \neq f^{'}$ . Furthermore, antiparallel complementarity increases the proportion of $S F$ motifs but decreases the proportion of $5^{'} U$ motifs, compared to their respective non-constrained motifs.

The “antiparallel complementary” genes have a higher proportion of single-frame motifs compared to the non-complementary genes. Thus, primitive translation associated with a DNA property could generate a greater number of peptides without frameshift errors.

3.6. Single-Frame Motifs and 5′ Unambiguous with Parallel Complementarity

The single-frame $2 n$ -motifs $S F$ and the 5′ unambiguous $2 n$ -motifs $5^{'} U$ are now analysed with the following parallel complementary sequence: $t_{1} t_{2} \dots t_{n} D (t_{1}) D (t_{2}) \dots D (t_{n})$ where the trinucleotide parallel complementarity map $D$ applied to a trinucleotide $t$ is recalled in Definition 1. As an example, if $t_{1} t_{2} t_{3} = A C G T G C A A T$ then the parallel complementary sequence studied is $A C G T G C A A T T G C A C G T T A$ . Note that the trinucleotide length of such motifs is also even. Interesting results are also observed. The two probability curves $P b S F M (n)$ of $S F$ motifs with parallel complementarity and $P b 5^{'} U M (n)$ of $5^{'} U$ motifs without constraints are superposable (Figure 13). Parallel complementarity increases the proportions of both $S F$ motifs and $5^{'} U$ motifs compared to their respective non-constrained motifs.

“Parallel complementary” genes have a slightly higher proportion of single-frame motifs compared to the “antiparallel complementary” genes (compare the magenta curves in Figure 12 and Figure 13). The biological meaning is not yet explained.

3.7. Framing Motifs

There are framing motifs $F$ which are single-frame $S F$ or multiple-frame $M F$ .

Proposition 1.

A framing motif $F$ can be single-frame $S F$ .

Proof. Take the following motif $m = G A A C T C C C G A T A T G G C T C$ . The motif $m$ can be generated by the code $X = {A T A, C C G, C T C, G A A, T G G}$ . By Theorem 1, it is easy to verify that the graph $G (X)$ is acyclic, and thus $X$ is circular. Furthermore, the set of trinucleotides in reading frame is $T = X$ , the set of trinucleotides in the shifted frame 1 is $T^{1} = {A A C, C G A, G G C, T A T, T C C}$ and the set of trinucleotides in the shifted frame 2 is $T^{2} = {A C T, A T G, C C C, G A T, G C T}$ . We have $T \cap T^{1} = ⊘$ and $T \cap T^{2} = ⊘$ . Thus, the motif $m$ is both framing $F$ and single-frame $S F$ .

Proposition 2.

A framing motif $F$ can be multiple-frame $M F$ .

Proof. Take the following motif $m = A T T G A G C G A G C C T G T C A G$ . The motif $m$ can be generated by the code $X = {A T T, C A G, C G A, G A G, G C C, T G T}$ . By Theorem 1, it is easy to verify that the graph $G (X)$ is acyclic, and thus $X$ is circular. Furthermore, we have the trinucleotide sets $T = X$ , $T^{1} = {A G C, C C T, G A G, G T C, T T G}$ and $T^{2} = {A G C, C T G, G C G, T C A, T G A}$ leading to $T \cap T^{1} = {G A G}$ and $T \cap T^{2} = ⊘$ . Thus, the motif $m$ is both framing $F$ and multiple-frame $M F$ , precisely unidirectional multiple-frame $5^{'} U M F$ .

There are single-frame motifs $S F$ or multiple-frame motifs $M F$ which are not framing $F$ .

Proposition 3.

A single-frame motif $S F$ can be non-framing $F$ .

Proof. Take the following motif $m = G A C A A A T A A G T G G T A T G A$ . The motif $m$ can be generated by the code $X = {A A A, G A C, G T A, G T G, T A A, T G A}$ . We have the trinucleotide sets $T = X$ , $T^{1} = {A A G, A A T, A C A, T A T, T G G}$ and $T^{2} = {A G T, A T A, A T G, C A A, G G T}$ leading to $T \cap T^{1} = ⊘$ and $T \cap T^{2} = ⊘$ . However, as $X$ contains the periodic trinucleotide AAA, $X$ is not circular. Thus, the motif $m$ is single-frame $S F$ but not framing $F$ .

Proposition 4.

A multiple-frame motif $M F$ can be non-framing $F$ .

Proof. Take the following motif $m = G G A C C A T A C A T C C G G A C T$ . The motif $m$ can be generated by the code $X = {A C T, A T C, C C A, C G G, G G A, T A C}$ . We have the trinucleotide sets $T = X$ , $T^{1} = {A C A, C A T, G A C, G G A, T C C}$ and $T^{2} = {A C C, A T A, C A T, C C G, G A C}$ leading to $T \cap T^{1} = {G G A}$ and $T \cap T^{2} = ⊘$ . However, as $X$ contains the two permuted trinucleotides ACT and TAC, $X$ is not circular. Thus, the motif $m$ is multiple-frame $M F$ , precisely unidirectional multiple-frame $5^{'} U M F$ , but not framing $F$ .

Genes which are both framing $F$ and single-frame $S F$ retrieve the reading frame and code for a unique peptide as the shifted frames would lead to a different peptide product.

3.8. A New Class of Theoretical Parameters Relating the Circular Codes and Their Circular Code Motifs

The idea is to define a new class of parameters in order to measure the intensity $I (m)$ of a motif $m$ of a circular code to retrieve the reading frame. Thus, we have to associate information from the circular code theory with information from words (motifs).

In the circular code theory, the most important and the simplest parameter is the length $l_{m a x} (X)$ of a longest path (maximal arrow-length of a path) in the associated graph $G (X)$ of a circular code $X$ (see Definition 5). Note that the longest path $l_{m a x} (X)$ has a finite length as the graph $G (X)$ is acyclic (Theorem 1). The longest path $l_{m a x} (X)$ can classify the circular codes, from the strong comma-free codes with $l_{m a x} (X) = 1$ and the comma-free codes with $l_{m a x} (X) = 2$ up to the general circular codes with a maximal longest path $l_{m a x} (X) = 8$ when $X \subseteq B^{3}$ (i.e., for the trinucleotide circular codes) [29]. It is also related to the reading frame number $n_{X}$ of $X$ , i.e., the number of nucleotides to retrieve the reading frame. This reading frame number $n_{X}$ can also be used to classify the circular codes, from the strong comma-free codes with $n_{X} = 2$ nucleotides and the comma-free codes with $n_{X} = 3$ nucleotides up to the general circular codes with a maximal number $n_{X} = 13$ nucleotides when $X \subseteq B^{3}$ [30]. However, this parameter $n_{X}$ needs to know the structure of the longest path $l_{m a x} (X)$ which is one of the four cases: $b_{1} \to d_{1} \to \dots \to b_{k}$ , $b_{1} \to d_{1} \to \dots \to d_{k}$ , $d_{1} \to b_{1} \to \dots \to b_{k}$ and $d_{1} \to b_{1} \to \dots \to d_{k}$ where the nucleotide $b_{i} \in B$ and the dinucleotide $d_{i} \in B^{2}$ for any $i$ (see Definition 5). In summary, for the circular codes $X \subseteq B^{3}$ , the longest path $l_{m a x} (X)$ belongs to the interval $1 \leq l_{m a x} (X) \leq 8$ and the reading frame number $n_{X}$ belongs to the interval $2 \leq n_{X} \leq 13$ nucleotides. The definition of the reading frame number $n_{X}$ can still be generalized to arbitrary sequences, i.e., not entirely consisting of trinucleotides from $X$ [30]. For these two reasons, i.e., the knowledge of the structure of $l_{m a x} (X)$ and the generalized definition of $n_{X}$ , the parameter $n_{X}$ , mentioned here to take date, will not be considered here.

A motif $m$ of a code, circular or not, can be characterized by its length $l (m)$ , given here in trinucleotides for convenience, for measuring its expansion; and its cardinality $card (T (m))$ of the set $T (m)$ (see Notation 2) of trinucleotides (in reading frame $f = 0$ ) of $m$ for measuring its variety (complexity). In the case of a motif $m$ of a trinucleotide circular code $X \subseteq B^{3}$ , $1 \leq card (T (m)) \leq 20$ .

It is important to stress the following condition: $T (m) \subseteq X$ with a trinucleotide circular code $X \subseteq B^{3}$ . The case $T (m) = X$ is associated with a trinucleotide circular code $X$ constructed from the motif $m$ .

A simple parameter measuring the expansion intensity $I_{e} (m)$ of reading frame retrieval of a circular code motif $m$ can be defined as follows:

I_{e} (m) = \frac{l (m)}{l_{m a x} (X)}

(5)

where $l (m)$ , $l (m) \geq 1$ , is the trinucleotide length of the motif $m$ and $l_{m a x} (X)$ , $1 \leq l_{m a x} (X) \leq 8$ , is the length of a longest path in the associated graph $G (X)$ of a trinucleotide circular code $X \subseteq B^{3}$ . Note that $\frac{1}{8} \leq I_{e} (m) \leq l (m)$ and if $l (m) \geq l_{m a x} (X)$ then $1 \leq I_{e} (m) \leq l (m)$ .

A second parameter measuring both the expansion and variety intensity $I_{e v} (m)$ of a circular code motif $m$ can also be defined as follows:

I_{e v} (m) = card (T (m)) \times I_{e} (m)

(6)

where $I_{e} (m)$ is defined in Equation (5) and $card (T (m))$ , $1 \leq card (T (m)) \leq 20$ , is the cardinality of the set $T (m)$ (Notation 2) of trinucleotides (in reading frame $f = 0$ ) of $m$ . Note that $\frac{1}{8} \leq I_{e v} (m) \leq 20 l (m)$ and if $l (m) \geq l_{m a x} (X)$ then $1 \leq I_{e v} (m) \leq 20 l (m)$ . Thus, for the circular code motifs $m$ of a given trinucleotide length $l (m)$ , the intensity $I_{e v} (m)$ of reading frame retrieval increases according to their cardinality $card (T (m))$ .

For a sequence $s$ containing several circular code motifs $m$ , the formulas (5) and (6) can be expressed as follows:

I_{e} (s) = \sum_{m \in s} I_{e} (m) = \frac{\sum_{m \in s} l (m)}{l_{m a x} (X)}

(7)

with the hypothesis that $l_{m a x} (X)$ is identical for the motifs $m$ , a realistic case when the motifs $m$ are obtained from a same studied trinucleotide circular code $X$ , and thus:

I_{e v} (s) = \sum_{m \in s} I_{e v} (m) = \frac{\sum_{m \in s} card (T (m)) \times l (m)}{l_{m a x} (X)} .

(8)

Note also that the formulas $I_{e} (s)$ and $I_{e v} (s)$ can also be normalized in order to weight the different lengths of sequences $s$ .

3.9. $M F$ Dipeptides

The series of multi-frame motifs $M F$ starts with the dicodons. We will now focus on the $M F$ dipeptides which are two consecutive amino acids coded by the $M F$ dicodons. The 16 bidirectional multiple-frame dicodons $B M F$ (Table 1) code 16 $B M F$ dipeptides according to the universal genetic code (Table 5). They include the four obvious $B M F$ dipeptides GlyGly (GGGGGG), LysLys (AAAAAA), PhePhe (TTTTTT) and ProPro (CCCCCC). 15 amino acids out of 20 are involved in these 16 $B M F$ dipeptides (Table 6): Ala, Arg, Cys, Glu, Gly, His, Ile, Leu, Lys, Phe, Pro, Ser, Thr, Tyr and Val (except Asn, Asp, Gln, Met and Trp), each amino acid occurring once in a position of a $B M F$ dipeptide, except Arg occurring twice in a position of a $B M F$ dipeptide: ArgAla, ArgGlu, AlaArg and GluArg.

Table 5.

The 16 $B M F$ dipeptides coded by the 16 bidirectional multiple-frame dicodons $B M F$ (Definition 13, Table 1).

AR	AlaArg	GCGCGC	GG	GlyGly	GGGGGG	LS	LeuSer	CTCTCT	SL	SerLeu	TCTCTC
CV	CysVal	TGTGTG	HT	HisThr	CACACA	PP	ProPro	CCCCCC	TH	ThrHis	ACACAC
ER	GluArg	GAGAGA	IY	IleTyr	ATATAT	RA	ArgAla	CGCGCG	VC	ValCys	GTGTGT
FF	PhePhe	TTTTTT	KK	LysLys	AAAAAA	RE	ArgGlu	AGAGAG	YI	TyrIle	TATATA

Open in a new tab

Table 6.

Occurrence number of the 15 amino acids in the 1st and 2nd positions of the 16 $B M F$ dipeptides (Table 5).

	A	C	E	F	G	H	I	K	L	P	R	S	T	V	Y
	Ala	Cys	Glu	Phe	Gly	His	Ile	Lys	Leu	Pro	Arg	Ser	Thr	Val	Tyr	Sum
1st site	1	1	1	1	1	1	1	1	1	1	2	1	1	1	1	16
2nd site	1	1	1	1	1	1	1	1	1	1	2	1	1	1	1	16
Sum	2	2	2	2	2	2	2	2	2	2	4	2	2	2	2	32

Open in a new tab

The 96 unidirectional multiple-frame dicodons $3^{'} U M F$ (Table 2) code 83 $3^{'} U M F$ dipeptides and four pairs (stop codon, amino acid): TAGArg, TAGGly, TGAGlu and TerLys where Ter can be the two stop codons TAA and TGA (Table 7). All the 20 amino acids are involved in the 83 $3^{'} U M F$ dipeptides (Table 8). All the 20 amino acids occur in the first position of $3^{'} U M F$ dipeptides. Five amino acids Asn, Asp, Gln, Met and Trp do not occur in their second position which are the five amino acids not involved in the $B M F$ dipeptides. In the 83 $3^{'} U M F$ dipeptides, Pro and Gly are involved 20 and 19 times, respectively, while Met and Trp only twice and once, respectively.

Table 7.

The 83 $3' U M F$ dipeptides and the four pairs (stop codon, amino acid) coded by the 96 unidirectional multiple-frame dicodons $3' U M F$ (Definition 11, Table 2).

AF	AlaPhe	GCTTTT	IS	IleSer	ATCTCT	RV	ArgVal	CGTGTG
AG	AlaGly	GCGGGG	KG	LysGly	AAGGGG	SA	SerAla	AGCGCG
AH	AlaHis	GCACAC	KR	LysArg	AAGAGA	SF	SerPhe	AGTTTT, TCTTTT
AK	AlaLys	GCAAAA	LC	LeuCys	CTGTGT, TTGTGT	SG	SerGly	TCGGGG
AL	AlaLeu	GCTCTC	LF	LeuPhe	CTTTTT	SH	SerHis	TCACAC
AP	AlaPro	GCCCCC	LG	LeuGly	CTGGGG, TTGGGG	SK	SerLys	TCAAAA
CA	CysAla	TGCGCG	LK	LeuLys	CTAAAA, TTAAAA	SP	SerPro	AGCCCC, TCCCCC
CF	CysPhe	TGTTTT	LP	LeuPro	CTCCCC	SR	SerArg	TCGCGC
CP	CysPro	TGCCCC	LY	LeuTyr	CTATAT, TTATAT	SV	SerVal	AGTGTG
DF	AspPhe	GATTTT	MC	MetCys	ATGTGT	TerE	TerGlu	TGAGAG
DI	AspIle	GATATA	MG	MetGly	ATGGGG	TerG	TerGly	TAGGGG
DP	AspPro	GACCCC	NF	AsnPhe	AATTTT	TerK	TerLys	TAAAAA, TGAAAA
DT	AspThr	GACACA	NI	AsnIle	AATATA	TerR	TerArg	TAGAGA
EG	GluGly	GAGGGG	NP	AsnPro	AACCCC	TF	ThrPhe	ACTTTT
EK	GluLys	GAAAAA	NT	AsnThr	AACACA	TG	ThrGly	ACGGGG
FP	PhePro	TTCCCC	PF	ProPhe	CCTTTT	TK	ThrLys	ACAAAA
FS	PheSer	TTCTCT	PG	ProGly	CCGGGG	TL	ThrLeu	ACTCTC
GA	GlyAla	GGCGCG	PH	ProHis	CCACAC	TP	ThrPro	ACCCCC
GE	GlyGlu	GGAGAG	PK	ProLys	CCAAAA	TR	ThrArg	ACGCGC
GF	GlyPhe	GGTTTT	PL	ProLeu	CCTCTC	VF	ValPhe	GTTTTT
GK	GlyLys	GGAAAA	PR	ProArg	CCGCGC	VG	ValGly	GTGGGG
GP	GlyPro	GGCCCC	QG	GlnGly	CAGGGG	VK	ValLys	GTAAAA
GV	GlyVal	GGTGTG	QK	GlnLys	CAAAAA	VP	ValPro	GTCCCC
HF	HisPhe	CATTTT	QR	GlnArg	CAGAGA	VS	ValSer	GTCTCT
HI	HisIle	CATATA	RE	ArgGlu	CGAGAG	VY	ValTyr	GTATAT
HP	HisPro	CACCCC	RF	ArgPhe	CGTTTT	WG	TrpGly	TGGGGG
IF	IlePhe	ATTTTT	RG	ArgGly	AGGGGG, CGGGGG	YF	TyrPhe	TATTTT
IK	IleLys	ATAAAA	RK	ArgLys	AGAAAA, CGAAAA	YP	TyrPro	TACCCC
IP	IlePro	ATCCCC	RP	ArgPro	CGCCCC	YT	TyrThr	TACACA

Open in a new tab

Table 8.

Occurrence number of the 20 amino acids in the first and second positions of the 83 $3' U M F$ dipeptides and the four pairs (stop codon, amino acid) (Table 7).

	A	C	D	E	F	G	H	I	K	L	M	N	P	Q	R	S	T	V	W	Y
	Ala	Cys	Asp	Glu	Phe	Gly	His	Ile	Lys	Leu	Met	Asn	Pro	Gln	Arg	Ser	Thr	Val	Trp	Tyr	Ter	Sum
1st site	6	3	4	2	2	6	3	4	2	6	2	4	6	3	6	8	6	6	1	3	4	87
2nd site	3	2	0	3	14	13	3	3	12	3	0	0	14	0	6	3	3	3	0	2	0	87
Sum	9	5	4	5	16	19	6	7	14	9	2	4	20	3	12	11	9	9	1	5	4	174

Open in a new tab

The 96 unidirectional multiple-frame dicodons $5^{'} U M F$ (Table 3) code 40 $5^{'} U M F$ dipeptides and three pairs (amino acid, stop codon): IleTer where Ter can be the two stop codons TAA and TAG, PheTer where Ter can be the three stop codons TAA, TAG and TGA, and ValTGA (Table 9). All the 20 amino acids are involved in the 40 $5^{'} U M F$ dipeptides (Table 10). Five amino acids are Asn, Asp, Gln, Met and Trp do not occur in the first position of $5^{'} U M F$ dipeptides which are the five amino acids not involved in the $B M F$ dipeptides. All the 20 amino acids occur in their second position. In the 40 $5^{'} U M F$ dipeptides, two amino acids Lys and Phe are involved eight times while Asn only once.

Table 9.

The 40 $5^{'} U M F$ dipeptides and the three pairs (amino acid, stop codon) coded by the 96 unidirectional multiple-frame dicodons $5^{'} U M F$ (Definition 12, Table 3).

AR	AlaArg	GCGCGA, GCGCGG, GCGCGT	KN	LysAsn	AAAAAC, AAAAAT
CV	CysVal	TGTGTA, TGTGTC, TGTGTT	KR	LysArg	AAAAGA, AAAAGG
ER	GluArg	GAGAGG	KS	LysSer	AAAAGC, AAAAGT
ES	GluSer	GAGAGC, GAGAGT	KT	LysThr	AAAACA, AAAACC, AAAACG, AAAACT
FC	PheCys	TTTTGC, TTTTGT	LS	LeuSer	CTCTCA, CTCTCC, CTCTCG
FF	PhePhe	TTTTTC	PH	ProHis	CCCCAC, CCCCAT
FL	PheLeu	TTTTTA, TTTTTG	PL	ProLeu	CCCCTA, CCCCTC, CCCCTG, CCCCTT
FS	PheSer	TTTTCA, TTTTCC, TTTTCG, TTTTCT	PP	ProPro	CCCCCA, CCCCCG, CCCCCT
FTer	PheTer	TTTTAA, TTTTAG, TTTTGA	PQ	ProGln	CCCCAA, CCCCAG
FW	PheTrp	TTTTGG	PR	ProArg	CCCCGA, CCCCGC, CCCCGG, CCCCGT
FY	PheTyr	TTTTAC, TTTTAT	RA	ArgAla	CGCGCA, CGCGCC, CGCGCT
GA	GlyAla	GGGGCA, GGGGCC, GGGGCG, GGGGCT	RD	ArgAsp	AGAGAC, AGAGAT
GD	GlyAsp	GGGGAC, GGGGAT	RE	ArgGlu	AGAGAA
GE	GlyGlu	GGGGAA, GGGGAG	SL	SerLeu	TCTCTA, TCTCTG, TCTCTT
GG	GlyGly	GGGGGA, GGGGGC, GGGGGT	TH	ThrHis	ACACAT
GV	GlyVal	GGGGTA, GGGGTC, GGGGTG, GGGGTT	TQ	ThrGln	ACACAA, ACACAG
HT	HisThr	CACACC, CACACG, CACACT	VC	ValCys	GTGTGC
ITer	IleTer	ATATAA, ATATAG	VTer	ValTer	GTGTGA
IY	IleTyr	ATATAC	VW	ValTrp	GTGTGG
KI	LysIle	AAAATA, AAAATC, AAAATT	YI	TyrIle	TATATC, TATATT
KK	LysLys	AAAAAG	YM	TyrMet	TATATG
KM	LysMet	AAAATG

Open in a new tab

Table 10.

Occurrence number of the 20 amino acids in the first and second positions of the 40 $5^{'} U M F$ dipeptides and the three pairs (amino acid, stop codon) (Table 9).

	A	C	D	E	F	G	H	I	K	L	M	N	P	Q	R	S	T	V	W	Y
	Ala	Cys	Asp	Glu	Phe	Gly	His	Ile	Lys	Leu	Met	Asn	Pro	Gln	Arg	Ser	Thr	Val	Trp	Tyr	Ter	Sum
1st site	1	1	0	2	7	5	1	2	7	1	0	0	5	0	3	1	2	3	0	2	0	43
2nd site	2	2	2	2	1	1	2	2	1	3	2	1	1	2	4	4	2	2	2	2	3	43
Sum	3	3	2	4	8	6	3	4	8	4	2	1	6	2	7	5	4	5	2	4	3	86

Open in a new tab

The $114 = 121 - 4 - 3$ $M F$ dipeptides among 400, i.e., 28.5%, are coded by $208 = 16 + 2 \times 96$ $M F$ dicodons ( $B M F$ , $3^{'} U M F$ , $5^{'} U M F$ ) among 4096, i.e., 5.1% (Table 11). As a consequence, 286 $S F$ dipeptides, i.e., 71.5%, are coded by 3888 single-frame dicodons $S F$ , i.e., 94.9%. There is also a strong asymmetry between the number of $M F$ dipeptides coded by one direction or other direction: 83 $3^{'} U M F$ dipeptides (Table 7) versus 40 $5^{'} U M F$ dipeptides (Table 9). This asymmetry may be related to the gene translation in the $5^{'} - 3^{'}$ direction, the $3^{'} U M F$ dicodons having an unambiguous trinucleotide decoding in the $5^{'} - 3^{'}$ direction.

Table 11.

Multi-frame dipeptide boolean matrix. The $114 = 121 - 4 - 3$ $M F$ dipeptides, the four pairs (stop codon, amino acid) and the three pairs (amino acid, stop codon) coded by the $208 = 16 + 2 \times 96$ multiple-frame dicodons $B M F$ (Definition 13, Table 1), $3' U M F$ (Definition 11, Table 2) and $5^{'} U M F$ (Definition 12, Table 3). The rows and columns are associated with the first and second amino acid, respectively, in the dipeptide. The value of 1 means a $M F$ dipeptide coded by at least a multiple-frame dicodon $M F$ ( $M F$ true). The value of 0 stands for a $S F$ dipeptide coded by a single-frame dicodon $S F$ ( $M F$ false). For example, the value of AlaCys is 0 (absent in Table 5, Table 7 and Table 9) and the value of CysAla is 1 (7th row in Table 7).

Site	2nd	A	C	D	E	F	G	H	I	K	L	M	N	P	Q	R	S	T	V	W	Y
1st		Ala	Cys	Asp	Glu	Phe	Gly	His	Ile	Lys	Leu	Met	Asn	Pro	Gln	Arg	Ser	Thr	Val	Trp	Tyr	Ter	Sum
A	Ala	0	0	0	0	1	1	1	0	1	1	0	0	1	0	1	0	0	0	0	0	0	7
C	Cys	1	0	0	0	1	0	0	0	0	0	0	0	1	0	0	0	0	1	0	0	0	4
D	Asp	0	0	0	0	1	0	0	1	0	0	0	0	1	0	0	0	1	0	0	0	0	4
E	Glu	0	0	0	0	0	1	0	0	1	0	0	0	0	0	1	1	0	0	0	0	0	4
F	Phe	0	1	0	0	1	0	0	0	0	1	0	0	1	0	0	1	0	0	1	1	1	8
G	Gly	1	0	1	1	1	1	0	0	1	0	0	0	1	0	0	0	0	1	0	0	0	8
H	His	0	0	0	0	1	0	0	1	0	0	0	0	1	0	0	0	1	0	0	0	0	4
I	Ile	0	0	0	0	1	0	0	0	1	0	0	0	1	0	0	1	0	0	0	1	1	6
K	Lys	0	0	0	0	0	1	0	1	1	0	1	1	0	0	1	1	1	0	0	0	0	8
L	Leu	0	1	0	0	1	1	0	0	1	0	0	0	1	0	0	1	0	0	0	1	0	7
M	Met	0	1	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	2
N	Asn	0	0	0	0	1	0	0	1	0	0	0	0	1	0	0	0	1	0	0	0	0	4
P	Pro	0	0	0	0	1	1	1	0	1	1	0	0	1	1	1	0	0	0	0	0	0	8
Q	Gln	0	0	0	0	0	1	0	0	1	0	0	0	0	0	1	0	0	0	0	0	0	3
R	Arg	1	0	1	1	1	1	0	0	1	0	0	0	1	0	0	0	0	1	0	0	0	8
S	Ser	1	0	0	0	1	1	1	0	1	1	0	0	1	0	1	0	0	1	0	0	0	9
T	Thr	0	0	0	0	1	1	1	0	1	1	0	0	1	1	1	0	0	0	0	0	0	8
V	Val	0	1	0	0	1	1	0	0	1	0	0	0	1	0	0	1	0	0	1	1	1	9
W	Trp	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
Y	Tyr	0	0	0	0	1	0	0	1	0	0	1	0	1	0	0	0	1	0	0	0	0	5
	Ter	0	0	0	1	0	1	0	0	1	0	0	0	0	0	1	0	0	0	0	0	0	4
	Sum	4	4	2	3	15	14	4	5	13	5	2	1	15	2	8	6	5	4	2	4	3	121

Open in a new tab

Five dipeptides GlyAla, GlyVal, PheSer, ProLeu and ProArg are the most strongly coded, each by five $M F$ dicodons (Table 12), e.g., GlyAla is coded by one $3^{'} U M F$ dicodon GGCGCG (Table 7), and four $5^{'} U M F$ dicodons GGGGCA, GGGGCC, GGGGCG and GGGGCT (Table 9). The $S F$ and $M F$ dipeptides could have particular spatial structures and biological functions in extant and primitive proteins which remain to be identified.

Table 12.

Multi-frame dipeptide occurrence matrix. The $114 = 121 - 4 - 3$ $M F$ dipeptides, the four pairs (stop codon, amino acid) and the three pairs (amino acid, stop codon) coded by the $208 = 16 + 2 \times 96$ multiple-frame dicodons $B M F$ (Definition 13, Table 1), $3' U M F$ (Definition 11, Table 2) and $5^{'} U M F$ (Definition 12, Table 3). The rows and columns are associated with the first and second amino acid, respectively, in the dipeptide. The values between 1 and 5 give the number of times a $M F$ dipeptide is coded by multiple-frame dicodons $M F$ . The value of 0 stands for a $S F$ dipeptide coded by a single-frame dicodon $S F$ . For example, the value of AlaCys is 0 (absent in Table 5, Table 7 and Table 9), the value of CysAla is 1 (7th row in Table 7) and the value of AlaArg if 4 (one occurrence: 1st row in Table 5 and three occurrences: 1st row in Table 9).

Site	2nd	A	C	D	E	F	G	H	I	K	L	M	N	P	Q	R	S	T	V	W	Y
1st		Ala	Cys	Asp	Glu	Phe	Gly	His	Ile	Lys	Leu	Met	Asn	Pro	Gln	Arg	Ser	Thr	Val	Trp	Tyr	Ter	Sum
A	Ala	0	0	0	0	1	1	1	0	1	1	0	0	1	0	4	0	0	0	0	0	0	10
C	Cys	1	0	0	0	1	0	0	0	0	0	0	0	1	0	0	0	0	4	0	0	0	7
D	Asp	0	0	0	0	1	0	0	1	0	0	0	0	1	0	0	0	1	0	0	0	0	4
E	Glu	0	0	0	0	0	1	0	0	1	0	0	0	0	0	2	2	0	0	0	0	0	6
F	Phe	0	2	0	0	2	0	0	0	0	2	0	0	1	0	0	5	0	0	1	2	3	18
G	Gly	5	0	2	3	1	4	0	0	1	0	0	0	1	0	0	0	0	5	0	0	0	22
H	His	0	0	0	0	1	0	0	1	0	0	0	0	1	0	0	0	4	0	0	0	0	7
I	Ile	0	0	0	0	1	0	0	0	1	0	0	0	1	0	0	1	0	0	0	2	2	8
K	Lys	0	0	0	0	0	1	0	3	2	0	1	2	0	0	3	2	4	0	0	0	0	18
L	Leu	0	2	0	0	1	2	0	0	2	0	0	0	1	0	0	4	0	0	0	2	0	14
M	Met	0	1	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	2
N	Asn	0	0	0	0	1	0	0	1	0	0	0	0	1	0	0	0	1	0	0	0	0	4
P	Pro	0	0	0	0	1	1	3	0	1	5	0	0	4	2	5	0	0	0	0	0	0	22
Q	Gln	0	0	0	0	0	1	0	0	1	0	0	0	0	0	1	0	0	0	0	0	0	3
R	Arg	4	0	2	3	1	2	0	0	2	0	0	0	1	0	0	0	0	1	0	0	0	16
S	Ser	1	0	0	0	2	1	1	0	1	4	0	0	2	0	1	0	0	1	0	0	0	14
T	Thr	0	0	0	0	1	1	2	0	1	1	0	0	1	2	1	0	0	0	0	0	0	10
V	Val	0	2	0	0	1	1	0	0	1	0	0	0	1	0	0	1	0	0	1	1	1	10
W	Trp	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
Y	Tyr	0	0	0	0	1	0	0	3	0	0	1	0	1	0	0	0	1	0	0	0	0	7
	Ter	0	0	0	1	0	1	0	0	2	0	0	0	0	0	1	0	0	0	0	0	0	5
	Sum	11	7	4	7	17	19	7	9	17	13	2	2	19	4	18	15	11	11	2	7	6	208

Open in a new tab

4. Discussion

For the first time to our knowledge, new definitions of motifs in genes are presented. The single-frame motifs $S F$ (unambiguous trinucleotide decoding in the two $5^{'} - 3^{'}$ and $3^{'} - 5^{'}$ directions) and the multiple-frame motifs $M F$ (ambiguous trinucleotide decoding in at least one direction) form a partition of genes. Several classes of $M F$ motifs are defined and analysed: (i) unidirectional multiple-frame motifs $3' U M F$ (ambiguous trinucleotide decoding in the $3^{'} - 5^{'}$ direction only); (ii) unidirectional multiple-frame motifs $5^{'} U M F$ (ambiguous trinucleotide decoding in the $5^{'} - 3^{'}$ direction only); and (iii) bidirectional multiple-frame motifs $B M F$ (ambiguous trinucleotide decoding in the two $5^{'} - 3^{'}$ and $3^{'} - 5^{'}$ directions). The distribution of the single-frame motifs $S F$ and the 5′ unambiguous motifs $5^{'} U$ (unambiguous trinucleotide decoding in the $5^{'} - 3^{'}$ direction only) are studied without and with constraints.

The proportion of $S F$ motifs drastically decreases with their trinucleotide length. The $S F$ motifs become absent ( $< 1 %$ ) when their length $\geq 14$ trinucleotides and the number of $M F$ motifs becomes already higher than the number of $S F$ motifs when their length $\geq 6$ trinucleotides. As expected, the proportion of $5^{'} U$ motifs decreases more slowly than that of $S F$ motifs. The $5^{'} U$ motifs become absent ( $< 1 %$ ) when their length $\geq 20$ trinucleotides. Thus with the $5^{'} U$ motifs, there is a length increase of $20 - 14 = 6$ trinucleotides in the trinucleotide decoding.

The proportions of $S F$ and $5^{'} U$ motifs with initiation and stop codons are lower than their respective non-constrained motifs. In contrasts, their proportions in motifs without periodic codons ${A A A, C C C, G G G, T T T}$ are higher than their respective non-constrained motifs. The proportions of $S F$ and $5^{'} U$ motifs with antiparallel complementarity are identical. Antiparallel complementarity increases the proportion of $S F$ motifs but decreases the proportion of $5^{'} U$ motifs, compared to their respective non-constrained motifs. The proportions of $S F$ motifs with parallel complementarity and $5^{'} U$ motifs without constraints follow a similar distribution. Finally, parallel complementarity increases the proportions of both $S F$ motifs and $5^{'} U$ motifs compared to their respective non-constrained motifs. Taken together, these results suggest that the complementarity property involved in the antiparallel (DNA double helix, RNA stem) and parallel sequences could also be fundamental for coding genes with unambiguous trinucleotide decoding, strictly in the two $5^{'} - 3^{'}$ and $3^{'} - 5^{'}$ directions ( $S F$ motifs) or conserved in the $5^{'} - 3^{'}$ direction but relaxed-lost in the $3^{'} - 5$ direction ( $5^{'} U$ motifs).

The single-frame motifs $S F$ with a property of trinucleotide decoding and the framing motifs $F$ with a property of reading frame decoding could have operated in the primitive soup for constructing the modern genetic code and the extant genes [31]. They could have been involved in the stage without anticodon-amino acid interactions to form peptides from prebiotically amino acids [32]. They could also have been related in the Implicated Site Nucleotides (ISN) of RNA interacting with the amino acids at the primitive step of life (review in [33]). According to a great number of biological experiments, the ISN structure contains nucleotides in fixed and variable positions, as well as an important trinucleotide for interacting with the amino acid (see e.g., the recent review in [34]). However, the general structure of the aptamers binding amino acids, in particular its nucleotide length, its amino acid binding loop and its nucleotide position, is still an open problem. Similar arguments could hold for the ribonucleopeptides which could be implicated in a primitive T box riboswitch functioning as an aminoacyl-tRNA synthetase and a peptidyl-transferase ribozyme [35]. The single-frame motifs $S F$ and the framing motifs $F$ with their properties to decode the trinucleotides and the reading frame could have been necessary for the evolutionary construction of the modern genetic code.

Acknowledgment

I thank Denise Marie Besch for her support.

Abbreviations

$S F$	single-frame motif (unambiguous trinucleotide decoding in the two $5^{'} - 3^{'}$ and $3^{'} - 5'$ directions)
$M F$	multiple-frame motif
$U M F$	unidirectional multiple-frame motif
$3^{'} U M F$	unidirectional multiple-frame motif (ambiguous trinucleotide decoding in the $3^{'} - 5^{'}$ direction only)
$5^{'} U M F$	unidirectional multiple-frame motif (ambiguous trinucleotide decoding in the $5^{'} - 3^{'}$ direction only)
$B M F$	bidirectional multiple-frame motif (ambiguous trinucleotide decoding in the two $5^{'} - 3^{'}$ and $3^{'} - 5^{'}$ directions)
$5^{'} U$	5′ unambiguous motif (unambiguous trinucleotide decoding in the $5^{'} - 3^{'}$ direction only)
$F$	framing motif (also called circular code motif)

Open in a new tab

Funding

The author received no funding for this study.

Conflicts of Interest

The author declares no competing interests.

References

1.Gamow G. Possible relation between deoxyribonucleic acid and protein structures. Nature. 1954;173:318. doi: 10.1038/173318a0. [DOI] [Google Scholar]
2.Crick F.H.C., Griffith J.S., Orgel L.E. Codes without commas. Proc. Natl. Acad. Sci. USA. 1957;43:416–421. doi: 10.1073/pnas.43.5.416. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Nirenberg M.W., Matthaei J.H. The dependence of cell-free protein synthesis in E. coli upon naturally occurring or synthetic polyribonucleotides. Proc. Natl. Acad. Sci. USA. 1961;47:1588–1602. doi: 10.1073/pnas.47.10.1588. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Crick F.H.C., Leslie Barnett Brenner S., Watts-Tobin R.J. General nature of the genetic code for proteins. Nature. 1961;192:1227–1232. doi: 10.1038/1921227a0. [DOI] [PubMed] [Google Scholar]
5.Khorana H.G., Büchi H., Ghosh H., Gupta N., Jacob T.M., Kössel H., Morgan R., Narang S.A., Ohtsuka E., Wells R.D. Polynucleotide synthesis and the genetic code. Cold Spring Harb. Symp. Quant. Biol. 1966;31:39–49. doi: 10.1101/SQB.1966.031.01.010. [DOI] [PubMed] [Google Scholar]
6.Nirenberg M., Caskey T., Marshall R., Brimacombe R., Kellogg D., Doctor B., Hatfield D., Levin J., Rottman F., Pestka S., et al. The RNA code and protein synthesis. Cold Spring Harb. Symp. Quant. Biol. 1966;31:11–24. doi: 10.1101/SQB.1966.031.01.008. [DOI] [PubMed] [Google Scholar]
7.Salas M., Smith M.A., Stanley W.M., Wahba A.J., Ochoa S. Direction of reading of the genetic message. J. Biol. Chem. 1965;240:3988–3995. [PubMed] [Google Scholar]
8.Arquès D.G., Michel C.J. A complementary circular code in the protein coding genes. J. Theor. Biol. 1996;182:45–58. doi: 10.1006/jtbi.1996.0142. [DOI] [PubMed] [Google Scholar]
9.Michel C.J. The maximal C3 self-complementary trinucleotide circular code X in genes of bacteria, eukaryotes, plasmids and viruses. J. Theor. Biol. 2015;380:156–177. doi: 10.1016/j.jtbi.2015.04.009. [DOI] [PubMed] [Google Scholar]
10.Michel C.J. The maximal C3 self-complementary trinucleotide circular code X in genes of bacteria, archaea, eukaryotes, plasmids and viruses. Life. 2017;7:20. doi: 10.3390/life7020020. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Michel C.J. A genetic scale of reading frame coding. J. Theor. Biol. 2014;355:83–94. doi: 10.1016/j.jtbi.2014.03.029. [DOI] [PubMed] [Google Scholar]
12.Michel C.J. An extended genetic scale of reading frame coding. J. Theor. Biol. 2015;365:164–174. doi: 10.1016/j.jtbi.2014.09.040. [DOI] [PubMed] [Google Scholar]
13.Dinman J.D. Programmed ribosomal frameshifting goes beyond viruses. Microbe. 2006;1:521–527. doi: 10.1128/microbe.1.521.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Farabaugh P.J. Programmed translational frameshifting. Annu. Rev. Genet. 1996;30:507–528. doi: 10.1146/annurev.genet.30.1.507. [DOI] [PubMed] [Google Scholar]
15.Caliskan N., Peske F., Rodnina M.V. Changed in translation: MRNA recoding by -1 programmed ribosomal frameshifting. Trends Biochem. Sci. 2015;40:265–274. doi: 10.1016/j.tibs.2015.03.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Napthine S., Ling R., Finch L.K., Jones J.D., Bell S., Brierley I., Firth A.E. Protein-directed ribosomal frameshifting temporally regulates gene expression. Nat. Commun. 2017;8:15582. doi: 10.1038/ncomms15582. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Wang R., Xiong J., Wang W., Miao W., Liang A. High frequency of +1 programmed ribosomal frameshifting in Euplotes octocarinatus. Sci. Rep. 2016;6:21139. doi: 10.1038/srep21139. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.El Houmami N., Seligmann H. Evolution of nucleotide punctuation marks: From structural to linear signals. Front. Genet. 2017;8:36. doi: 10.3389/fgene.2017.00036. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Seligmann H. Codon expansion and systematic transcriptional deletions produce tetra-, pentacoded mitochondrial peptides. J. Theor. Biol. 2015;387:154–165. doi: 10.1016/j.jtbi.2015.09.030. [DOI] [PubMed] [Google Scholar]
20.Baranov P.V., Atkins J.F., Yordanova M.M. Augmented genetic decoding: Global, local and temporal alterations of decoding processes and codon meaning. Nat. Rev. Genet. 2015;16:517–529. doi: 10.1038/nrg3963. [DOI] [PubMed] [Google Scholar]
21.Michel C.J. Circular code motifs in transfer and 16S ribosomal RNAs: A possible translation code in genes. Comput. Biol. Chem. 2012;37:24–37. doi: 10.1016/j.compbiolchem.2011.10.002. [DOI] [PubMed] [Google Scholar]
22.Michel C.J. Circular code motifs in transfer RNAs. Comput. Biol. Chem. 2013;45:17–29. doi: 10.1016/j.compbiolchem.2013.02.004. [DOI] [PubMed] [Google Scholar]
23.Michel C.J. A 2006 review of circular codes in genes. Comput. Math. Appl. 2008;55:984–988. doi: 10.1016/j.camwa.2006.12.090. [DOI] [Google Scholar]
24.Fimmel E., Strüngmann L. Mathematical fundamentals for the noise immunity of the genetic code. Biosystems. 2018;164:186–198. doi: 10.1016/j.biosystems.2017.09.007. [DOI] [PubMed] [Google Scholar]
25.Luisi P.L. Prebiotic metabolic networks? Mol. Syst. Biol. 2014;10:729. doi: 10.1002/msb.20145351. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Ying J., Lin R., Xu P., Wu Y., Liu Y., Zhao Y. Prebiotic formation of cyclic dipeptides under potentially early Earth conditions. Sci. Rep. 2018;8:936. doi: 10.1038/s41598-018-19335-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Shu W., Yu Y., Chen S., Yan X., Liu Y., Zhao Y. Selective formation of Ser-His dipeptide via phosphorus activation. Orig. Life Evol. Biospheres. 2018;48:213–222. doi: 10.1007/s11084-018-9556-7. [DOI] [PubMed] [Google Scholar]
28.Wieczorek R., Adamala K., Gasperi T., Polticelli F., Stano P. Small and random peptides: An unexplored reservoir of potentially functional primitive organocatalysts. The case of Seryl-Histidine. Life. 2017;7:19. doi: 10.3390/life7020019. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Fimmel E., Michel C.J., Strüngmann L. n-Nucleotide circular codes in graph theory. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2016;374:20150058. doi: 10.1098/rsta.2015.0058. [DOI] [PubMed] [Google Scholar]
30.Fimmel E., Michel C.J., Starman M., Strüngmann L. Self-complementary circular codes in coding theory. Theory Biosci. 2018;137:51–65. doi: 10.1007/s12064-018-0259-4. [DOI] [PubMed] [Google Scholar]
31.Kun Á., Radványi Á. The evolution of the genetic code: Impasses and challenges. Biosystems. 2018;164:217–225. doi: 10.1016/j.biosystems.2017.10.006. [DOI] [PubMed] [Google Scholar]
32.Johnson D.B.F., Wang L. Imprints of the genetic code in the ribosome. Proc. Natl. Acad. Sci. USA. 2010;107:8298–8303. doi: 10.1073/pnas.1000704107. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Yarus M. The genetic code and RNA-amino acid affinities. Life. 2017;7:13. doi: 10.3390/life7020013. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Zagrovic B., Bartonek L., Polyansky A.A. RNA-protein interactions in an unstructured context. FEBS Lett. 2018;592:2901–2916. doi: 10.1002/1873-3468.13116. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Saad N.Y. A ribonucleopeptide world at the origin of life. J. Syst. Evol. 2018;56:1–13. doi: 10.1111/jse.12287. [DOI] [Google Scholar]

[B1-life-09-00018] 1.Gamow G. Possible relation between deoxyribonucleic acid and protein structures. Nature. 1954;173:318. doi: 10.1038/173318a0. [DOI] [Google Scholar]

[B2-life-09-00018] 2.Crick F.H.C., Griffith J.S., Orgel L.E. Codes without commas. Proc. Natl. Acad. Sci. USA. 1957;43:416–421. doi: 10.1073/pnas.43.5.416. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3-life-09-00018] 3.Nirenberg M.W., Matthaei J.H. The dependence of cell-free protein synthesis in E. coli upon naturally occurring or synthetic polyribonucleotides. Proc. Natl. Acad. Sci. USA. 1961;47:1588–1602. doi: 10.1073/pnas.47.10.1588. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4-life-09-00018] 4.Crick F.H.C., Leslie Barnett Brenner S., Watts-Tobin R.J. General nature of the genetic code for proteins. Nature. 1961;192:1227–1232. doi: 10.1038/1921227a0. [DOI] [PubMed] [Google Scholar]

[B5-life-09-00018] 5.Khorana H.G., Büchi H., Ghosh H., Gupta N., Jacob T.M., Kössel H., Morgan R., Narang S.A., Ohtsuka E., Wells R.D. Polynucleotide synthesis and the genetic code. Cold Spring Harb. Symp. Quant. Biol. 1966;31:39–49. doi: 10.1101/SQB.1966.031.01.010. [DOI] [PubMed] [Google Scholar]

[B6-life-09-00018] 6.Nirenberg M., Caskey T., Marshall R., Brimacombe R., Kellogg D., Doctor B., Hatfield D., Levin J., Rottman F., Pestka S., et al. The RNA code and protein synthesis. Cold Spring Harb. Symp. Quant. Biol. 1966;31:11–24. doi: 10.1101/SQB.1966.031.01.008. [DOI] [PubMed] [Google Scholar]

[B7-life-09-00018] 7.Salas M., Smith M.A., Stanley W.M., Wahba A.J., Ochoa S. Direction of reading of the genetic message. J. Biol. Chem. 1965;240:3988–3995. [PubMed] [Google Scholar]

[B8-life-09-00018] 8.Arquès D.G., Michel C.J. A complementary circular code in the protein coding genes. J. Theor. Biol. 1996;182:45–58. doi: 10.1006/jtbi.1996.0142. [DOI] [PubMed] [Google Scholar]

[B9-life-09-00018] 9.Michel C.J. The maximal C3 self-complementary trinucleotide circular code X in genes of bacteria, eukaryotes, plasmids and viruses. J. Theor. Biol. 2015;380:156–177. doi: 10.1016/j.jtbi.2015.04.009. [DOI] [PubMed] [Google Scholar]

[B10-life-09-00018] 10.Michel C.J. The maximal C3 self-complementary trinucleotide circular code X in genes of bacteria, archaea, eukaryotes, plasmids and viruses. Life. 2017;7:20. doi: 10.3390/life7020020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11-life-09-00018] 11.Michel C.J. A genetic scale of reading frame coding. J. Theor. Biol. 2014;355:83–94. doi: 10.1016/j.jtbi.2014.03.029. [DOI] [PubMed] [Google Scholar]

[B12-life-09-00018] 12.Michel C.J. An extended genetic scale of reading frame coding. J. Theor. Biol. 2015;365:164–174. doi: 10.1016/j.jtbi.2014.09.040. [DOI] [PubMed] [Google Scholar]

[B13-life-09-00018] 13.Dinman J.D. Programmed ribosomal frameshifting goes beyond viruses. Microbe. 2006;1:521–527. doi: 10.1128/microbe.1.521.1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14-life-09-00018] 14.Farabaugh P.J. Programmed translational frameshifting. Annu. Rev. Genet. 1996;30:507–528. doi: 10.1146/annurev.genet.30.1.507. [DOI] [PubMed] [Google Scholar]

[B15-life-09-00018] 15.Caliskan N., Peske F., Rodnina M.V. Changed in translation: MRNA recoding by -1 programmed ribosomal frameshifting. Trends Biochem. Sci. 2015;40:265–274. doi: 10.1016/j.tibs.2015.03.006. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16-life-09-00018] 16.Napthine S., Ling R., Finch L.K., Jones J.D., Bell S., Brierley I., Firth A.E. Protein-directed ribosomal frameshifting temporally regulates gene expression. Nat. Commun. 2017;8:15582. doi: 10.1038/ncomms15582. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17-life-09-00018] 17.Wang R., Xiong J., Wang W., Miao W., Liang A. High frequency of +1 programmed ribosomal frameshifting in Euplotes octocarinatus. Sci. Rep. 2016;6:21139. doi: 10.1038/srep21139. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B18-life-09-00018] 18.El Houmami N., Seligmann H. Evolution of nucleotide punctuation marks: From structural to linear signals. Front. Genet. 2017;8:36. doi: 10.3389/fgene.2017.00036. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B19-life-09-00018] 19.Seligmann H. Codon expansion and systematic transcriptional deletions produce tetra-, pentacoded mitochondrial peptides. J. Theor. Biol. 2015;387:154–165. doi: 10.1016/j.jtbi.2015.09.030. [DOI] [PubMed] [Google Scholar]

[B20-life-09-00018] 20.Baranov P.V., Atkins J.F., Yordanova M.M. Augmented genetic decoding: Global, local and temporal alterations of decoding processes and codon meaning. Nat. Rev. Genet. 2015;16:517–529. doi: 10.1038/nrg3963. [DOI] [PubMed] [Google Scholar]

[B21-life-09-00018] 21.Michel C.J. Circular code motifs in transfer and 16S ribosomal RNAs: A possible translation code in genes. Comput. Biol. Chem. 2012;37:24–37. doi: 10.1016/j.compbiolchem.2011.10.002. [DOI] [PubMed] [Google Scholar]

[B22-life-09-00018] 22.Michel C.J. Circular code motifs in transfer RNAs. Comput. Biol. Chem. 2013;45:17–29. doi: 10.1016/j.compbiolchem.2013.02.004. [DOI] [PubMed] [Google Scholar]

[B23-life-09-00018] 23.Michel C.J. A 2006 review of circular codes in genes. Comput. Math. Appl. 2008;55:984–988. doi: 10.1016/j.camwa.2006.12.090. [DOI] [Google Scholar]

[B24-life-09-00018] 24.Fimmel E., Strüngmann L. Mathematical fundamentals for the noise immunity of the genetic code. Biosystems. 2018;164:186–198. doi: 10.1016/j.biosystems.2017.09.007. [DOI] [PubMed] [Google Scholar]

[B25-life-09-00018] 25.Luisi P.L. Prebiotic metabolic networks? Mol. Syst. Biol. 2014;10:729. doi: 10.1002/msb.20145351. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26-life-09-00018] 26.Ying J., Lin R., Xu P., Wu Y., Liu Y., Zhao Y. Prebiotic formation of cyclic dipeptides under potentially early Earth conditions. Sci. Rep. 2018;8:936. doi: 10.1038/s41598-018-19335-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B27-life-09-00018] 27.Shu W., Yu Y., Chen S., Yan X., Liu Y., Zhao Y. Selective formation of Ser-His dipeptide via phosphorus activation. Orig. Life Evol. Biospheres. 2018;48:213–222. doi: 10.1007/s11084-018-9556-7. [DOI] [PubMed] [Google Scholar]

[B28-life-09-00018] 28.Wieczorek R., Adamala K., Gasperi T., Polticelli F., Stano P. Small and random peptides: An unexplored reservoir of potentially functional primitive organocatalysts. The case of Seryl-Histidine. Life. 2017;7:19. doi: 10.3390/life7020019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B29-life-09-00018] 29.Fimmel E., Michel C.J., Strüngmann L. n-Nucleotide circular codes in graph theory. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2016;374:20150058. doi: 10.1098/rsta.2015.0058. [DOI] [PubMed] [Google Scholar]

[B30-life-09-00018] 30.Fimmel E., Michel C.J., Starman M., Strüngmann L. Self-complementary circular codes in coding theory. Theory Biosci. 2018;137:51–65. doi: 10.1007/s12064-018-0259-4. [DOI] [PubMed] [Google Scholar]

[B31-life-09-00018] 31.Kun Á., Radványi Á. The evolution of the genetic code: Impasses and challenges. Biosystems. 2018;164:217–225. doi: 10.1016/j.biosystems.2017.10.006. [DOI] [PubMed] [Google Scholar]

[B32-life-09-00018] 32.Johnson D.B.F., Wang L. Imprints of the genetic code in the ribosome. Proc. Natl. Acad. Sci. USA. 2010;107:8298–8303. doi: 10.1073/pnas.1000704107. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B33-life-09-00018] 33.Yarus M. The genetic code and RNA-amino acid affinities. Life. 2017;7:13. doi: 10.3390/life7020013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B34-life-09-00018] 34.Zagrovic B., Bartonek L., Polyansky A.A. RNA-protein interactions in an unstructured context. FEBS Lett. 2018;592:2901–2916. doi: 10.1002/1873-3468.13116. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B35-life-09-00018] 35.Saad N.Y. A ribonucleopeptide world at the origin of life. J. Syst. Evol. 2018;56:1–13. doi: 10.1111/jse.12287. [DOI] [Google Scholar]

PERMALINK

Single-Frame, Multiple-Frame and Framing Motifs in Genes

Christian J Michel

Abstract

1. Introduction

2. Method

2.1. Recall of Biological Definitions

Notation 1.

Definition 1.

Example 1.

2.2. Recall of Circular Code Definitions

Definition 2.

Definition 3.

Definition 4.

Definition 5.

Theorem 1.

Definition 6.

Example 2.

Figure 1.

2.3. Definitions of Single-Frame and Multiple-Frame Motifs

Definition 7.

Definition 8.

Notation 2.

Definition 9.

Example 3.

Figure 2.

Definition 10.

Definition 11.

Example 4.

Figure 3.

Definition 12.

Example 5.

Figure 4.

Example 6.

Figure 5.

Definition 13.

Example 7.

Figure 6.

Example 8.

Figure 7.

Definition 14.

Example 9.

2.4. Occurrence Probabilities of Single-Frame n-Motifs SF and 5′ Unambiguous n-Motifs 5′U

Definition 15.

Remark 1.

2.5. Single-Frame 1-Motifs

2.6. Single-Frame 2-Motifs

Table 1.

Table 2.

Table 3.

Remark 2.

2.7. Single-Frame n-Motifs

Table 4.

3. Results

3.1. Single-Frame Motifs

Figure 8.

3.2. 5′ Unambiguous Motifs

Figure 9.

3.3. Single-Frame and 5′ Unambiguous Motifs with Initiation and Stop Codons

Figure 10.

3.4. Single-Frame and 5′ Unambiguous Motifs without Periodic Codons

Figure 11.

3.5. Single-Frame and 5′ Unambiguous Motifs with Antiparallel Complementarity

Figure 12.

3.6. Single-Frame Motifs and 5′ Unambiguous with Parallel Complementarity

Figure 13.

3.7. Framing Motifs

Proposition 1.

Proposition 2.

Proposition 3.

Proposition 4.

3.8. A New Class of Theoretical Parameters Relating the Circular Codes and Their Circular Code Motifs

3.9. MF Dipeptides

Table 5.

Table 6.

Table 7.

Table 8.

Table 9.

Table 10.

Table 11.

2.4. Occurrence Probabilities of Single-Frame $n$ -Motifs $S F$ and 5′ Unambiguous $n$ -Motifs $5^{'} U$

2.5. Single-Frame $1$ -Motifs

2.6. Single-Frame $2$ -Motifs

2.7. Single-Frame $n$ -Motifs

3.9. $M F$ Dipeptides