Abstract
We study the distribution of new classes of motifs in genes, a research field that has not been investigated to date. A single-frame motif SF has no trinucleotide in reading frame (frame 0) that occurs in a shifted frame (frame 1 or 2), e.g., the dicodon AAACAA is as the trinucleotides AAA and CAA do not occur in a shifted frame. A motif which is not single-frame is multiple-frame . Several classes of motifs are defined and analysed. The distributions of single-frame motifs (associated with an unambiguous trinucleotide decoding in the two and directions) and 5′ unambiguous motifs (associated with an unambiguous trinucleotide decoding in the direction only) are analysed without and with constraints. The constraints studied are: initiation and stop codons, periodic codons , antiparallel complementarity and parallel complementarity. Taken together, these results suggest that the complementarity property involved in the antiparallel (DNA double helix, RNA stem) and parallel sequences could also be fundamental for coding genes with an unambiguous trinucleotide decoding in the two and directions or the direction only. Furthermore, the single-frame motifs with a property of trinucleotide decoding and the framing motifs (also called circular code motifs; first introduced by Michel (2012)) with a property of reading frame decoding may have been involved in the early life genes to build the modern genetic code and the extant genes. They could have been involved in the stage without anticodon-amino acid interactions or in the Implicated Site Nucleotides (ISN) of RNA interacting with the amino acids. Finally, the and dipeptides associated with the and dicodons, respectively, are studied and their importance for biology and the origin of life discussed.
Keywords: single-frame motifs, multiple-frame motifs, framing motifs, gene coding, antiparallel and parallel sequences, early life genes
1. Introduction
The reading frame coding with trinucleotide sets is a fascinating problem, both theoretical and experimental. Before the discovery of the genetic code, a first code was proposed by Gamow [1] by considering the “key-and-lock” relation between various amino acids, and the rhomb shaped “holes” formed by various nucleotides in the DNA. The proposed model will later prove to be false. A few years later, a class of trinucleotide codes, called comma-free codes, was proposed by Crick et al. [2] for explaining how the reading of a sequence of trinucleotides could code amino acids. In particular, how the correct reading frame can be retrieved and maintained. The four nucleotides {A,C,G,T} as well as the 16 dinucleotides {AA,…,TT} are simple codes which are not appropriate for coding 20 amino acids. However, trinucleotides induce a redundancy in their coding. Thus, Crick et al. [2] conjectured that only 20 trinucleotides among the 64 possible trinucleotides {AAA,…,TTT} code for the 20 amino acids. Such a bijective code implies that the coding trinucleotides are found only in one frame—the comma-freeness property. The determination of a set of 20 trinucleotides forming a comma-free code has several necessary conditions:
(i) A periodic trinucleotide from the set {AAA,CCC,GGG,TTT} must be excluded from such a code. Indeed, the concatenation of AAA with itself, for instance, does not allow the (original) reading frame to be retrieved as there are three possible decompositions: …,AAA,AAA,AAA,… (original frame), …A,AAA,AAA,AA… and …AA,AAA,AAA,A…, the commas showing the adopted decomposition.
(ii) Two non-periodic permuted trinucleotides, i.e., two trinucleotides related by a circular permutation, e.g., ACG and CGA, must also be excluded from such a code. Indeed, the concatenation of ACG with itself, for instance, does not allow the reading frame to be retrieved as there are two possible decompositions: …,ACG,ACG,ACG,… (original frame) and …A,CGA,CGA,CG…
Therefore, by excluding the four periodic trinucleotides and by gathering the 60 remaining trinucleotides in 20 classes of three trinucleotides such that, in each class, the three trinucleotides are deduced from each other by a circular permutation, e.g., ACG, CGA and GAC, we see that a comma-free code can contain only one trinucleotide from each class and thus has at most 20 trinucleotides. This trinucleotide number is identical to the amino acid number, thus leading to a code assigning one trinucleotide per amino acid without ambiguity.
In the beginning 1960’s, the discovery that the trinucleotide TTT, an excluded trinucleotide in a comma-free code, codes phenylalanine [3], led to the abandonment of the concepts both of a comma-free code [2] and a bijective code as the genetic code is degenerate [4,5,6] with a gene translation in one direction [7].
In 1996, a statistical analysis of occurrence frequencies of the 64 trinucleotides in the three frames of genes of both prokaryotes and eukaryotes showed that the trinucleotides are not uniformly distributed in these three frames [8]. By excluding the four periodic trinucleotides and by assigning each trinucleotide to a preferential frame (frame of its highest occurrence frequency), three subsets , and of 20 trinucleotides each are found in the frames 0 (reading frame), 1 (frame 0 shifted by one nucleotide in the direction, i.e., to the right) and 2 (frame 0 shifted by two nucleotides in the direction) in genes of both prokaryotes and eukaryotes. The same set of trinucleotides was identified in average in genes (reading frame) of bacteria, archaea, eukaryotes, plasmids and viruses [9,10]. It contains the 20 following trinucleotides:
| (1) |
and codes the 12 following amino acids (three and one letter notation):
| (2) |
This set has a strong mathematical property. Indeed, is a maximal self-complementary trinucleotide circular code [8].
The reading frame coding with trinucleotide codes (sets of words) in general terms, i.e., not particularly the genetic code, is a concept which has been studied in Michel [11,12]. We extend it to the motifs (words of codes), a theoretical domain which has been ignored according to our knowledge. Genes (protein coding regions) can be partitioned into two disjoint classes of motifs: the single-frame motifs with an unambiguous trinucleotide decoding in the two and directions, and the multiple-frame motifs with an ambiguous trinucleotide decoding in at least one direction. A single-frame motif has no trinucleotide in reading frame (frame 0) that occurs in a shifted frame (frame 1 or 2). In contrast, a multiple-frame motif has at least one trinucleotide in reading frame that occurs in a shifted frame. Some well-known motifs are involved in ribosomal frameshifting. The expression of some viral and cellular genes utilizes a -1 programmed ribosomal frameshifting (-1 PRF) [13,14]. This -1 PRF sequence is based on three elements: (i) a slippery motif composed of seven nucleotides at which the change in reading frame occurs; (ii) a spacer motif, usually less than 12 nucleotides; and (iii) a down-stream (3′) stimulatory motif, usually a pseudoknot or a stem-loop. In eukaryotes, the slippery motif fits a consensus heptanucleotide X,XXY,YYZ, where XXX is any three identical nucleotides, YYY represents AAA or TTT, Z represents A, C or T, the commas separating the codons in reading frame [15,16]. The slippery motifs and are multiple-frame . Indeed, the codon AAA in reading frame also occurs in the shifted frames 1 and 2 in , and similarly with the codon TTT in . Alternative gene decoding is also possible with +1 programmed ribosomal frameshifting (+1 PRF) which has been particularly observed in Euplotes [17]. The identified slippery motif where is multiple-frame . The slippery motifs AAA, CCC, GGG and TTT may cause frameshifting during transcription, producing RNAs missing specific nucleotides when compared to template DNA [18,19]. The slippery motifs are not always multiple-frame while stressing that the spacer and the down-stream stimulatory motifs have been very poorly characterized [20] and could also be involved in such a multiple-frame definition. From a theoretical point of view, it is important to extend this concept by increasing the length of such multiple-frame slippery motifs and also by considering their different classes. If the multiple-frame motifs may be involved in ribosomal frameshifting, the single-frame motifs and the framing motifs (also called circular code motifs; first introduced in Michel [21,22]) from the circular codes [8,9,10] (reviews in Michel [23]; Fimmel and Strüngmann [24]) may have been important in early life genes for constructing the modern genetic code and the extant genes (see Discussion).
Several classes of motifs are defined: (i) a unidirectional multiple-frame motif has no trinucleotide in reading frame that occurs in a shifted frame after its reading (i.e., its position in the reading frame) but has at least one trinucleotide in reading frame that occurs in a shifted frame before its reading, e.g., the dicodon AACACA is as the trinucleotides AAC and (trivially) ACA do not occur in a shifted frame after their reading and as the trinucleotide ACA occurs in a shifted frame (precisely frame 1) before its reading; (ii) a unidirectional multiple-frame motif , the opposite, has no trinucleotide in reading frame that occurs in a shifted frame before its reading but has at least one trinucleotide in reading frame that occurs in a shifted frame after its reading, e.g., the dicodon ACACAA mirror of AACACA is as the trinucleotides (trivially) ACA and CAA do not occur in a shifted frame before their reading and as the trinucleotide ACA occurs in a shifted frame (precisely frame 2) after its reading; and (iii) a bidirectional multiple-frame motif has at least one trinucleotide in reading frame that occurs in a shifted frame before its reading and has at least one trinucleotide in reading frame that occurs in a shifted frame after its reading (both and ), e.g., the dicodons AAAAAA and ACACAC are . A 5′ unambiguous motif , is either a motif or a motif, e.g., the dicodons AAACAA ( motif) and AACACA ( motif) belong to the class .
We will only investigate here the distribution of the single-frame motifs associated with an unambiguous trinucleotide decoding in the two and directions, and the 5′ unambiguous motifs associated with an unambiguous trinucleotide decoding in the direction only, i.e., a less restrictive class of motifs. The distributions of and motifs will be analysed without and with constraints. The constraints studied are: (i) with initiation and stop codons; (ii) without periodic codons ; (iii) with antiparallel complementarity; and (iv) with parallel complementarity.
We will also investigate the particular case of motifs made up of two codons, i.e., the dicodons. The definitions of and dicodons will thus identify two new classes of dipeptides, the and dipeptides. The dipeptides are coded by dicodons with an unambiguous trinucleotide decoding, in contrast to the dipeptides which are coded by dicodons with an ambiguous trinucleotide decoding. The concept of and dipeptides might be of predictive value to studies of prebiotic metabolites [25]. Peptide evolution on the primitive earth is an active and exciting field of research with cyclic dipeptides [26] and selective formation of SerHis dipeptide via phosphorus activation [27,28].
2. Method
2.1. Recall of Biological Definitions
Notation 1.
Let us denotes the nucleotide 4-letter alphabet where stands for adenine, stands for cytosine, stands for guanine and stands for thymine. The trinucleotide set over is denoted by . The set of non-empty words (words, respectively) over is denoted by ( , respectively).
Definition 1.
According to the complementary property of the DNA double helix, the nucleotide complementarity mapis defined by,,,. According to the complementary and antiparallel properties of the DNA double helix, the trinucleotide antiparallel complementarity mapis defined byfor all. The trinucleotide parallel complementarity mapis defined byfor all.
Example 1.
and.
2.2. Recall of Circular Code Definitions
Definition 2.
A setis a code if, for each,, the conditionimpliesandfor.
Definition 3.
Any non-empty subset of the code is a code and called trinucleotide code.
Definition 4.
A trinucleotide codeis circular if, for each,,,, the conditionsandimply,(empty word) andfor.
We briefly recall the proof used here to determine whether a code is circular or not, with the most recent and powerful approach which relates an oriented (directed) graph to a trinucleotide code.
Definition 5.
[29]. Let be a trinucleotide code. The directed graph associated with has a finite set of vertices and a finite set of oriented edges (ordered pairs where ) defined as follows:
The theorem below gives a relation between a trinucleotide code which is circular and its associated graph.
Theorem 1.
[29]. Let be a trinucleotide code. The following statements are equivalent:
- (i)
The codeis circular.
- (ii)
The graphis acyclic.
Definition 6.
Circular code motifs (first introduced by Michel [21,22]), also called here framing motifs , are motifs from the circular codes. They have the capacity to retrieve, maintain and synchronize the reading frame in genes.
Example 2.
Let a framing motif ...AGGTAATTACCAG... be constructed with the circular code (1) identified in genes of bacteria, archaea, eukaryotes, plasmids and viruses [8,9,10].
(i) Such a framing motif can be obtained as follows. A sequence of trinucleotides of is generated and a substring is extracted at any position in this sequence , i.e., the series of nucleotides on the right and the left of the substring are not considered. Let this substring be . (ii) This framing motif allows the reading frame to be retrieved (Figure 1). We try the three possible decompositions , (shifted by one letter to the right) and (shifted by two letters to the right) of . With , AG is not a prefix of any trinucleotide of , thus the frame associated with is impossible. With , AG is a suffix of CAG and GAG belonging to , then GTA, ATT and ACC belong to , followed by A which is a prefix of five trinucleotides of . Thus at this position, the frame associated with is still possible and nucleotides are read. The next letter G leads to AG which is not a prefix of any trinucleotide of . Thus, a window of nucleotides demonstrates that the frame associated with is impossible. With , A is a suffix of GAA and GTA belonging to , then GGT, AAT, TAC, CAG, etc., belong to . Thus, the reading frame of is associated with , i.e., the first letter A of is the 3rd letter of a trinucleotide of : the reading frame of the sequence is retrieved: ...A,GGT,AAT,TAC,CAG,… (the comma showing the reading frame). (iii) We can prove mathematically that a windows of 13 nucleotides always retrieves the reading frame with the circular code . Four framing motifs need a window of 13 nucleotides with the circular code as they are the four longest ambiguous words of length nucleotides: AGGTAATTACCA, AGGTAATTACCT (with , the first two letters AG are suffix of CAG and GAG belonging to , and the last letter T is prefix of TAC and TTC belonging to ), TGGTAATTACCA (with , the first two letters TG are suffix of CTG belonging to , and the last letter A is prefix of five trinucleotides of ) and TGGTAATTACCT (with , the first two letters TG are suffix of CTG belonging to , and the last letter T is prefix of TAC and TTC belonging to ). These four framing motifs contain the two longest ambiguous words of length nucleotides starting with a trinucleotide of , i.e., when the suffixes of are not considered: GGTAATTACCA and GGTAATTACCT (see last row in Table 1 in [21]). (iv) It is very important to stress that for all the other framing motifs of the circular code , i.e., different from , , and , the window for retrieving the reading frame is less than 13 nucleotides (see the growth function of the window as a function of the number of nucleotides in Figure 4 in [21]). It is also very important to recall that any motif of the circular code is framing, i.e., it has the property of reading frame retrieval.
Figure 1.
Retrieval of the reading frame of the word ...AGGTAATTACCAG... constructed with the circular code (1). Among the three possible factorizations , and , only one factorization into trinucleotides of is possible leading to ...A,GGT,AAT,TAC,CAG,… (the comma showing the reading frame). Thus, the first letter A of is the third letter of a trinucleotide of and the reading frame of the word is retrieved.
2.3. Definitions of Single-Frame and Multiple-Frame Motifs
Definition 7.
A-motif, also called-codon, is a series of trinucleotidesinof trinucleotide length,, which defines the reading frame, i.e.,.
Definition 8.
The shifted frameandof a-motif is a series of trinucleotidesinof trinucleotide length,, starting at the 2nd and 3rd nucleotide ofof the-motif, i.e., at() and().
Notation 2.
Let be the set of trinucleotides in reading frame of a -motif. Let be the set of trinucleotides in a shifted frame of a -motif.
A single-frame motif has no trinucleotide in reading frame that occurs in a shifted frame, i.e., the trinucleotide decoding is unambiguous in the two and directions. Formally:
Definition 9.
A single-frame-motif(unambiguous trinucleotide decoding in the twoanddirections) is a-motif such thatfor, i.e.,for, forand for.
Example 3.
Let the dicodon be AAACAA (-motif). The trinucleotides in reading frame areand, leading to the trinucleotide set. The single trinucleotide in the shifted frame 1 is, leading to the trinucleotide set. The single trinucleotide in the shifted frame 2 is, leading to the trinucleotide set. Asand, AAACAA is a single-frame dicodon(Figure 2).
Figure 2.
(associated with Example 3). The dicodon AAACAA is single-frame .
A multiple-frame motif , in contrast to a motif, has at least one trinucleotide in reading frame that occur in a shifted frame . Formally:
Definition 10.
A multiple-frame -motif (ambiguous trinucleotide decoding in at least one direction) is a -motif such that for , i.e., .
The unidirectional multiple-frame motifs belong to a class of motifs where all the trinucleotides in a shifted frame occur only before : direction) or only after (: direction) the trinucleotides in reading frame. Formally:
Definition 11.
A unidirectional multiple-frame-motif(ambiguous trinucleotide decoding in thedirection only) is a-motif (for) such that the conditionimpliesfor, forand for.
Example 4.
Let the dicodon be AACACA. The trinucleotides in reading frame are and , leading to . The single trinucleotide in the shifted frame 1 is , leading to . The single trinucleotide in the shifted frame 2 is , leading to . As , AACACA is a multiple-frame dicodon . Furthermore, as yields to the inequality , as and as , AACACA is a unidirectional multiple-frame dicodon (Figure 3).
Figure 3.
(associated with Example 4). The dicodon AACACA is unidirectional multiple-frame .
Definition 12.
A unidirectional multiple-frame-motif(ambiguous trinucleotide decoding in thedirection only) is a-motif (for) such that the conditionimpliesfor, forand for.
Example 5.
Let the dicodon be AAAAAC. The trinucleotides in reading frame areand, leading to. The trinucleotides in the shifted frames 1 and 2 are, leading to the trinucleotide sets. Asand, AAAAAC is a multiple-frame dicodon. Furthermore, asyields to the two inequalitiesand as,AAAAAC is a unidirectional multiple-frame dicodon(Figure 4).
Figure 4.
(associated with Example 5). The dicodon AAAAAC is unidirectional multiple-frame .
Example 6.
Let the dicodon be ACACAA . The trinucleotides in reading frame are and , leading to . The single trinucleotide in the shifted frame 1 is , leading to . The single trinucleotide in the shifted frame 2 is , leading to . As , ACACAA is a multiple-frame dicodon . Furthermore, as yields to the inequality , as and as , ACACAA is a unidirectional multiple-frame dicodon (Figure 5). The reasoning could be immediate by noting that the dicodon ACACAA is mirror of AACACA (compare with Example 4).
Figure 5.
(associated with Example 6). The dicodon ACACAA is unidirectional multiple-frame .
Definition 13.
A bidirectional multiple-frame-motif(ambiguous trinucleotide decoding in the twoanddirections) is both aand-motif.
Example 7.
Let the trivial dicodon be AAAAAA. The trinucleotides in reading frame are, leading to the trinucleotide set. The trinucleotides in the shifted frames 1 and 2 are, leading to the trinucleotide sets. Asand, AAAAAA is a multiple-frame dicodon. Furthermore, asyields to the two inequalitiesand asyields to the two inequalities, AAAAAA is a bidirectional multiple-frame dicodon(Figure 6).
Figure 6.
(associated with Example 7). The dicodon AAAAAA is bidirectional multiple-frame .
Example 8.
Let the dicodon be ACACAC. The trinucleotides in reading frame areand, leading to. The single trinucleotide in the shifted frame 1 is, leading to. The single trinucleotide in the shifted frame 2 is, leading to. Asand, ACACAC is a multiple-frame dicodon. Furthermore, asyields to the inequalityand asyields to the inequality, ACACAC is a bidirectional multiple-frame dicodon(Figure 7).
Figure 7.
(associated with Example 8). The dicodon ACACAC is bidirectional multiple-frame .
In this paper, by varying , we will investigate two distributions: the single-frame -motifs with an unambiguous trinucleotide decoding in the two and directions (see Definition 9), and the 5′ unambiguous -motifs with an unambiguous trinucleotide decoding in the direction only which are defined formally as follows:
Definition 14.
A 5′ unambiguous -motif (unambiguous trinucleotide decoding in the direction only) is either a -motif or a -motif , i.e., neither a -motif nor a -motif.
Example 9.
The dicodons AAACAA (motif) andAACACA (motif) belong to the class.
2.4. Occurrence Probabilities of Single-Frame -Motifs and 5′ Unambiguous -Motifs
Definition 15.
Let and be the numbers of -motifs ( ) single-frame and multiple-frame , respectively. Let , and be the numbers of multiple-frame -motifs ( ) which are unidirectional , unidirectional and bidirectional , respectively.
For , we have the obvious relations:
For , the occurrence probability of single-frame -motifs will be computed according to
| (3) |
Similarly, for , the occurrence probability of 5′ unambiguous -motifs will be computed as follows
| (4) |
Remark 1.
Obviously,whatever. However, it will be interesting to compare these two probability distributions by varying.
2.5. Single-Frame -Motifs
It is a trivial case. Each of the 64 codons (1-motifs, ) are obviously single-frame motifs , by definition (non-existence of a shifted frame). Thus, the probabilities of and -motifs are equal to .
2.6. Single-Frame -Motifs
There are dicodons (2-motifs, ). The complete study of dicodons which are single-frame and multiple-frame can be done by hand without difficulty. For the convenience of the reader, we give the complete list of dicodons: (Definition 13, Table 1), (Definition 11, Table 2) and (Definition 12, Table 3).
Table 1.
The 16 bidirectional multiple-frame dicodons (Definition 13).
| Dicodon | Frame 1 | Frame 2 | Dicodon | Frame 1 | Frame 2 | Dicodon | Frame 1 | Frame 2 | Dicodon | Frame 1 | Frame 2 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| AAAAAA | AAA | AAA | CACACA | ACA | CAC | GAGAGA | AGA | GAG | TATATA | ATA | TAT |
| ACACAC | CAC | ACA | CCCCCC | CCC | CCC | GCGCGC | CGC | GCG | TCTCTC | CTC | TCT |
| AGAGAG | GAG | AGA | CGCGCG | GCG | CGC | GGGGGG | GGG | GGG | TGTGTG | GTG | TGT |
| ATATAT | TAT | ATA | CTCTCT | TCT | CTC | GTGTGT | TGT | GTG | TTTTTT | TTT | TTT |
Table 2.
The 96 unidirectional multiple-frame dicodons (Definition 11), being any nucleotide.
| Dicodon | Frame 1 | Frame 2 | Dicodon | Frame 1 | Frame 2 | Dicodon | Frame 1 | Frame 2 | Dicodon | Frame 1 | Frame 2 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| CAAAAA | AAA | AAA | CCACAC | CAC | CGAGAG | GAG | CTATAT | TAT | |||
| GAAAAA | AAA | AAA | GCACAC | CAC | GGAGAG | GAG | GTATAT | TAT | |||
| TAAAAA | AAA | AAA | TCACAC | CAC | TGAGAG | GAG | TTATAT | TAT | |||
| NCAAAA | AAA | ACCCCC | CCC | CCC | AGCGCG | GCG | ATCTCT | TCT | |||
| NGAAAA | AAA | GCCCCC | CCC | CCC | GGCGCG | GCG | GTCTCT | TCT | |||
| NTAAAA | AAA | TCCCCC | CCC | CCC | TGCGCG | GCG | TTCTCT | TCT | |||
| AACACA | ACA | NACCCC | CCC | AGGGGG | GGG | GGG | ATGTGT | TGT | |||
| GACACA | ACA | NGCCCC | CCC | CGGGGG | GGG | GGG | CTGTGT | TGT | |||
| TACACA | ACA | NTCCCC | CCC | TGGGGG | GGG | GGG | TTGTGT | TGT | |||
| AAGAGA | AGA | ACGCGC | CGC | NAGGGG | GGG | ATTTTT | TTT | TTT | |||
| CAGAGA | AGA | CCGCGC | CGC | NCGGGG | GGG | CTTTTT | TTT | TTT | |||
| TAGAGA | AGA | TCGCGC | CGC | NTGGGG | GGG | GTTTTT | TTT | TTT | |||
| AATATA | ATA | ACTCTC | CTC | AGTGTG | GTG | NATTTT | TTT | ||||
| CATATA | ATA | CCTCTC | CTC | CGTGTG | GTG | NCTTTT | TTT | ||||
| GATATA | ATA | GCTCTC | CTC | GGTGTG | GTG | NGTTTT | TTT |
Table 3.
The 96 unidirectional multiple-frame dicodons (Definition 12), being any nucleotide.
| Dicodon | Frame 1 | Frame 2 | Dicodon | Frame 1 | Frame 2 | Dicodon | Frame 1 | Frame 2 | Dicodon | Frame 1 | Frame 2 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| AAAAAC | AAA | AAA | CACACC | CAC | GAGAGC | GAG | TATATC | TAT | |||
| AAAAAG | AAA | AAA | CACACG | CAC | GAGAGG | GAG | TATATG | TAT | |||
| AAAAAT | AAA | AAA | CACACT | CAC | GAGAGT | GAG | TATATT | TAT | |||
| AAAACN | AAA | CCCCCA | CCC | CCC | GCGCGA | GCG | TCTCTA | TCT | |||
| AAAAGN | AAA | CCCCCG | CCC | CCC | GCGCGG | GCG | TCTCTG | TCT | |||
| AAAATN | AAA | CCCCCT | CCC | CCC | GCGCGT | GCG | TCTCTT | TCT | |||
| ACACAA | ACA | CCCCAN | CCC | GGGGGA | GGG | GGG | TGTGTA | TGT | |||
| ACACAG | ACA | CCCCGN | CCC | GGGGGC | GGG | GGG | TGTGTC | TGT | |||
| ACACAT | ACA | CCCCTN | CCC | GGGGGT | GGG | GGG | TGTGTT | TGT | |||
| AGAGAA | AGA | CGCGCA | CGC | GGGGAN | GGG | TTTTTA | TTT | TTT | |||
| AGAGAC | AGA | CGCGCC | CGC | GGGGCN | GGG | TTTTTC | TTT | TTT | |||
| AGAGAT | AGA | CGCGCT | CGC | GGGGTN | GGG | TTTTTG | TTT | TTT | |||
| ATATAA | ATA | CTCTCA | CTC | GTGTGA | GTG | TTTTAN | TTT | ||||
| ATATAC | ATA | CTCTCC | CTC | GTGTGC | GTG | TTTTCN | TTT | ||||
| ATATAG | ATA | CTCTCG | CTC | GTGTGG | GTG | TTTTGN | TTT |
The probability of -motifs is equal to . The probability of -motifs is equal to .
Remark 2.
For, theand-motifs can have two different shifted trinucleotides in the two frames 1 and 2, in contrast to the-motifs (see Table 2 and Table 3). For example, with the tricodon AACAAAACC, the trinucleotides in reading frame are,andleading to. The trinucleotides in the shifted frame 1 areand, leading to. The trinucleotides in the shifted frame 2 areand, leading to. Asand, AACAAAACC is a multiple-frame tricodon. Furthermore, asyields to the inequality, asyields to the inequalityand as, AACAAAACC is a unidirectional multiple-frame tricodonwith two different trinucleotides in the two frames 1 and 2, i.e., AAA in frame 1 and AAC in frame 2.
2.7. Single-Frame -Motifs
The determination of probability of single-frame -motifs for (tricodons, tetracodons, etc.) cannot be done by hand. For (tricodons up to hexacodons), exact values of probability can be obtained by computer calculus (see Table 4). For , the computation of motifs among the hexacodons with a parallel program with 8 threads takes about 7 days on a standard PC. For (heptacodons, octocodons, etc.), the probability is obtained by computer simulation. Simulated values of are obtained by generating 1,000,000 random -motifs for each . In order to evaluate this approach by computer simulation, simulated values of for are also given in Table 4. Exact and simulated values of are identical at , demonstrating the reliability of the simulation approach.
Table 4.
Probability (%) of single-frame -motifs for . Exact and simulated values of are identical at .
| -Motifs | Exact Values | Simulated Values | |
|---|---|---|---|
| 1 | 64 | 100 | |
| 2 | 4096 | 94.92 | 94.93 |
| 3 | 262,144 | 85.22 | 85.20 |
| 4 | 16,777,216 | 72.35 | 72.37 |
| 5 | 1,073,741,824 | 58.07 | 58.08 |
| 6 | 68,719,476,736 | 44.07 | 44.08 |
The probability of 5′ unambiguous -motifs for is computed similarly.
3. Results
3.1. Single-Frame Motifs
I first investigated the probability (Equation (3)) of single-frame -motifs (Definition 9). The probability is equal to 1 (1-motifs, Section 2.5). The probability is equal to 94.9% (2-motifs, Section 2.6). The probability for is given in Table 4. The probability for is obtained by computer simulation (Section 2.7).
While the proportion of multiple-frame -motifs (Definition 10) is minimal ( for dicodons, Section 2.6), Figure 8 shows that their propagation will drastically reduce the proportion of -motifs when the trinucleotide length increases. There are almost no more motifs with a length of 14 trinucleotides () and the number of motifs becomes already higher than the number of motifs with a length of six trinucleotides (Figure 8).
Figure 8.
Decreasing probability (Equation (3)) of single-frame -motifs (blue curve) and increasing probability of multiple-frame -motifs (red curve) by varying the length between 1 and 20 trinucleotides.
Thus, only short genes, i.e., with up to five trinucleotides, have a higher proportion of single-frame motifs compared to the multiple-frame motifs. Thus, primitive translation, without the extant complex ribosome, could only generate short peptides without frameshift errors.
3.2. 5′ Unambiguous Motifs
I then compared the probability (Equation (3)) of single-frame -motifs (Definition 9) and the probability (Equation (4)) of 5′ unambiguous -motifs (Definition 14). Figure 9 shows the decreasing probability of -motifs when the trinucleotide length increases. As expected (see Remark 1), its decrease is slower than that of -motifs. There are almost no more motifs with a length of 20 trinucleotides (). Thus with the motifs, there is a length increase of trinucleotides in the trinucleotide decoding. The maximum probability difference is 22.0% at length trinucleotides.
Figure 9.
Decreasing probability (Equation (3)) of single-frame -motifs (blue curve from Figure 8) and decreasing probability (Equation (4)) of 5′ unambiguous -motifs (cyan curve) by varying the length between 1 and 20 trinucleotides.
The 5′ unambiguous -motifs, a less restrictive class of motifs with an unambiguous trinucleotide decoding in the direction only, can generate a slightly longer peptides without frameshift error compared to the single-frame motifs.
I now evaluate the single-frame motifs and the 5′ unambiguous motifs with constraints.
3.3. Single-Frame and 5′ Unambiguous Motifs with Initiation and Stop Codons
The single-frame -motifs and the 5′ unambiguous motifs are investigated with an initiation codon ATG and a stop codon . The case does not exist. For , there are only three dicodons: ATGTAA, ATGTAG and ATGTGA which are all obviously . Thus, the probabilities of and -motifs are obviously . Figure 10 shows that the proportions of and motifs with initiation and stop codons are lower than their respective non-constrained motifs.
Figure 10.
Decreasing probability (Equation (3)) of single-frame -motifs (blue curve from Figure 8) and decreasing probability (Equation (4)) of 5′ unambiguous -motifs (cyan curve from Figure 9) by varying the length between 1 and 10 trinucleotides. With initiation and stop codons, decreasing probability of -motifs (magenta curve) and decreasing probability of -motifs (orange curve) by varying the length between 2 and 10 trinucleotides.
Genes with initiation and stop codons do not increase translation fidelity compared to non-constrained genes (according to this approach).
3.4. Single-Frame and 5′ Unambiguous Motifs without Periodic Codons
The single-frame motifs and the 5′ unambiguous motifs are now studied without periodic codons . As expected, Figure 11 shows that the proportions of and motifs without periodic codons are higher than their respective non-constrained motifs.
Figure 11.
Decreasing probability (Equation (3)) of single-frame -motifs (blue curve from Figure 8) and decreasing probability (Equation (4)) of 5′ unambiguous -motifs (cyan curve from Figure 9) by varying the length between 1 and 10 trinucleotides. Without periodic codons , decreasing probability of -motifs (magenta curve) and decreasing probability of -motifs (orange curve) by varying the length between 1 and 10 trinucleotides.
Genes without periodic codons slightly increase frame translation fidelity compared to non-constrained genes (according to this approach).
3.5. Single-Frame and 5′ Unambiguous Motifs with Antiparallel Complementarity
The single-frame -motifs and the 5′ unambiguous -motifs are now investigated with the following antiparallel complementary sequence: where the trinucleotide antiparallel complementarity map applied to a trinucleotide is recalled in Definition 1. As an example, if then the antiparallel complementary sequence studied is . Note that the trinucleotide length of such motifs is even. Classical antiparallel complementary structures are the DNA double helix and the RNA stem. Interesting results are observed. As expected, the two probability curves of motifs and of motifs with antiparallel complementarity are identical (Figure 12). The proof is based on the following property: if with ( motif) then with ( motif) and . Furthermore, antiparallel complementarity increases the proportion of motifs but decreases the proportion of motifs, compared to their respective non-constrained motifs.
Figure 12.
Decreasing probability (Equation (3)) of single-frame -motifs (blue curve from Figure 8) and decreasing probability (Equation (4)) of 5′ unambiguous -motifs (cyan curve from Figure 9) by varying the length between 1 and 14 trinucleotides. With antiparallel complementarity, decreasing probabilities and of -motifs and (two identical curves in magenta) by varying the length between 1 and 7 trinucleotides.
The “antiparallel complementary” genes have a higher proportion of single-frame motifs compared to the non-complementary genes. Thus, primitive translation associated with a DNA property could generate a greater number of peptides without frameshift errors.
3.6. Single-Frame Motifs and 5′ Unambiguous with Parallel Complementarity
The single-frame -motifs and the 5′ unambiguous -motifs are now analysed with the following parallel complementary sequence: where the trinucleotide parallel complementarity map applied to a trinucleotide is recalled in Definition 1. As an example, if then the parallel complementary sequence studied is . Note that the trinucleotide length of such motifs is also even. Interesting results are also observed. The two probability curves of motifs with parallel complementarity and of motifs without constraints are superposable (Figure 13). Parallel complementarity increases the proportions of both motifs and motifs compared to their respective non-constrained motifs.
Figure 13.
Decreasing probability (Equation (3)) of single-frame -motifs (blue curve from Figure 8) and decreasing probability (Equation (4)) of 5′ unambiguous -motifs (cyan curve from Figure 9) by varying the length between 1 and 14 trinucleotides. With parallel complementarity, decreasing probability of -motifs (magenta curve) and decreasing probability of -motifs (orange curve) by varying the length between 1 and 7 trinucleotides.
“Parallel complementary” genes have a slightly higher proportion of single-frame motifs compared to the “antiparallel complementary” genes (compare the magenta curves in Figure 12 and Figure 13). The biological meaning is not yet explained.
3.7. Framing Motifs
There are framing motifs which are single-frame or multiple-frame .
Proposition 1.
A framing motifcan be single-frame.
Proof. Take the following motif . The motif can be generated by the code . By Theorem 1, it is easy to verify that the graph is acyclic, and thus is circular. Furthermore, the set of trinucleotides in reading frame is , the set of trinucleotides in the shifted frame 1 is and the set of trinucleotides in the shifted frame 2 is . We have and . Thus, the motif is both framing and single-frame .
Proposition 2.
A framing motifcan be multiple-frame.
Proof. Take the following motif . The motif can be generated by the code . By Theorem 1, it is easy to verify that the graph is acyclic, and thus is circular. Furthermore, we have the trinucleotide sets , and leading to and . Thus, the motif is both framing and multiple-frame , precisely unidirectional multiple-frame .
There are single-frame motifs or multiple-frame motifs which are not framing .
Proposition 3.
A single-frame motifcan be non-framing.
Proof. Take the following motif . The motif can be generated by the code . We have the trinucleotide sets , and leading to and . However, as contains the periodic trinucleotide AAA, is not circular. Thus, the motif is single-frame but not framing .
Proposition 4.
A multiple-frame motifcan be non-framing.
Proof. Take the following motif . The motif can be generated by the code . We have the trinucleotide sets , and leading to and . However, as contains the two permuted trinucleotides ACT and TAC, is not circular. Thus, the motif is multiple-frame , precisely unidirectional multiple-frame , but not framing .
Genes which are both framing and single-frame retrieve the reading frame and code for a unique peptide as the shifted frames would lead to a different peptide product.
3.8. A New Class of Theoretical Parameters Relating the Circular Codes and Their Circular Code Motifs
The idea is to define a new class of parameters in order to measure the intensity of a motif of a circular code to retrieve the reading frame. Thus, we have to associate information from the circular code theory with information from words (motifs).
In the circular code theory, the most important and the simplest parameter is the length of a longest path (maximal arrow-length of a path) in the associated graph of a circular code (see Definition 5). Note that the longest path has a finite length as the graph is acyclic (Theorem 1). The longest path can classify the circular codes, from the strong comma-free codes with and the comma-free codes with up to the general circular codes with a maximal longest path when (i.e., for the trinucleotide circular codes) [29]. It is also related to the reading frame number of , i.e., the number of nucleotides to retrieve the reading frame. This reading frame number can also be used to classify the circular codes, from the strong comma-free codes with nucleotides and the comma-free codes with nucleotides up to the general circular codes with a maximal number nucleotides when [30]. However, this parameter needs to know the structure of the longest path which is one of the four cases: , , and where the nucleotide and the dinucleotide for any (see Definition 5). In summary, for the circular codes , the longest path belongs to the interval and the reading frame number belongs to the interval nucleotides. The definition of the reading frame number can still be generalized to arbitrary sequences, i.e., not entirely consisting of trinucleotides from [30]. For these two reasons, i.e., the knowledge of the structure of and the generalized definition of , the parameter , mentioned here to take date, will not be considered here.
A motif of a code, circular or not, can be characterized by its length , given here in trinucleotides for convenience, for measuring its expansion; and its cardinality of the set (see Notation 2) of trinucleotides (in reading frame ) of for measuring its variety (complexity). In the case of a motif of a trinucleotide circular code , .
It is important to stress the following condition: with a trinucleotide circular code . The case is associated with a trinucleotide circular code constructed from the motif .
A simple parameter measuring the expansion intensity of reading frame retrieval of a circular code motif can be defined as follows:
| (5) |
where , , is the trinucleotide length of the motif and , , is the length of a longest path in the associated graph of a trinucleotide circular code . Note that and if then .
A second parameter measuring both the expansion and variety intensity of a circular code motif can also be defined as follows:
| (6) |
where is defined in Equation (5) and , , is the cardinality of the set (Notation 2) of trinucleotides (in reading frame ) of . Note that and if then . Thus, for the circular code motifs of a given trinucleotide length , the intensity of reading frame retrieval increases according to their cardinality .
For a sequence containing several circular code motifs , the formulas (5) and (6) can be expressed as follows:
| (7) |
with the hypothesis that is identical for the motifs , a realistic case when the motifs are obtained from a same studied trinucleotide circular code , and thus:
| (8) |
Note also that the formulas and can also be normalized in order to weight the different lengths of sequences .
3.9. Dipeptides
The series of multi-frame motifs starts with the dicodons. We will now focus on the dipeptides which are two consecutive amino acids coded by the dicodons. The 16 bidirectional multiple-frame dicodons (Table 1) code 16 dipeptides according to the universal genetic code (Table 5). They include the four obvious dipeptides GlyGly (GGGGGG), LysLys (AAAAAA), PhePhe (TTTTTT) and ProPro (CCCCCC). 15 amino acids out of 20 are involved in these 16 dipeptides (Table 6): Ala, Arg, Cys, Glu, Gly, His, Ile, Leu, Lys, Phe, Pro, Ser, Thr, Tyr and Val (except Asn, Asp, Gln, Met and Trp), each amino acid occurring once in a position of a dipeptide, except Arg occurring twice in a position of a dipeptide: ArgAla, ArgGlu, AlaArg and GluArg.
Table 5.
The 16 dipeptides coded by the 16 bidirectional multiple-frame dicodons (Definition 13, Table 1).
| AR | AlaArg | GCGCGC | GG | GlyGly | GGGGGG | LS | LeuSer | CTCTCT | SL | SerLeu | TCTCTC |
| CV | CysVal | TGTGTG | HT | HisThr | CACACA | PP | ProPro | CCCCCC | TH | ThrHis | ACACAC |
| ER | GluArg | GAGAGA | IY | IleTyr | ATATAT | RA | ArgAla | CGCGCG | VC | ValCys | GTGTGT |
| FF | PhePhe | TTTTTT | KK | LysLys | AAAAAA | RE | ArgGlu | AGAGAG | YI | TyrIle | TATATA |
Table 6.
Occurrence number of the 15 amino acids in the 1st and 2nd positions of the 16 dipeptides (Table 5).
| A | C | E | F | G | H | I | K | L | P | R | S | T | V | Y | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Ala | Cys | Glu | Phe | Gly | His | Ile | Lys | Leu | Pro | Arg | Ser | Thr | Val | Tyr | Sum | |
| 1st site | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 16 |
| 2nd site | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 16 |
| Sum | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 4 | 2 | 2 | 2 | 2 | 32 |
The 96 unidirectional multiple-frame dicodons (Table 2) code 83 dipeptides and four pairs (stop codon, amino acid): TAGArg, TAGGly, TGAGlu and TerLys where Ter can be the two stop codons TAA and TGA (Table 7). All the 20 amino acids are involved in the 83 dipeptides (Table 8). All the 20 amino acids occur in the first position of dipeptides. Five amino acids Asn, Asp, Gln, Met and Trp do not occur in their second position which are the five amino acids not involved in the dipeptides. In the 83 dipeptides, Pro and Gly are involved 20 and 19 times, respectively, while Met and Trp only twice and once, respectively.
Table 7.
The 83 dipeptides and the four pairs (stop codon, amino acid) coded by the 96 unidirectional multiple-frame dicodons (Definition 11, Table 2).
| AF | AlaPhe | GCTTTT | IS | IleSer | ATCTCT | RV | ArgVal | CGTGTG |
| AG | AlaGly | GCGGGG | KG | LysGly | AAGGGG | SA | SerAla | AGCGCG |
| AH | AlaHis | GCACAC | KR | LysArg | AAGAGA | SF | SerPhe | AGTTTT, TCTTTT |
| AK | AlaLys | GCAAAA | LC | LeuCys | CTGTGT, TTGTGT | SG | SerGly | TCGGGG |
| AL | AlaLeu | GCTCTC | LF | LeuPhe | CTTTTT | SH | SerHis | TCACAC |
| AP | AlaPro | GCCCCC | LG | LeuGly | CTGGGG, TTGGGG | SK | SerLys | TCAAAA |
| CA | CysAla | TGCGCG | LK | LeuLys | CTAAAA, TTAAAA | SP | SerPro | AGCCCC, TCCCCC |
| CF | CysPhe | TGTTTT | LP | LeuPro | CTCCCC | SR | SerArg | TCGCGC |
| CP | CysPro | TGCCCC | LY | LeuTyr | CTATAT, TTATAT | SV | SerVal | AGTGTG |
| DF | AspPhe | GATTTT | MC | MetCys | ATGTGT | TerE | TerGlu | TGAGAG |
| DI | AspIle | GATATA | MG | MetGly | ATGGGG | TerG | TerGly | TAGGGG |
| DP | AspPro | GACCCC | NF | AsnPhe | AATTTT | TerK | TerLys | TAAAAA, TGAAAA |
| DT | AspThr | GACACA | NI | AsnIle | AATATA | TerR | TerArg | TAGAGA |
| EG | GluGly | GAGGGG | NP | AsnPro | AACCCC | TF | ThrPhe | ACTTTT |
| EK | GluLys | GAAAAA | NT | AsnThr | AACACA | TG | ThrGly | ACGGGG |
| FP | PhePro | TTCCCC | PF | ProPhe | CCTTTT | TK | ThrLys | ACAAAA |
| FS | PheSer | TTCTCT | PG | ProGly | CCGGGG | TL | ThrLeu | ACTCTC |
| GA | GlyAla | GGCGCG | PH | ProHis | CCACAC | TP | ThrPro | ACCCCC |
| GE | GlyGlu | GGAGAG | PK | ProLys | CCAAAA | TR | ThrArg | ACGCGC |
| GF | GlyPhe | GGTTTT | PL | ProLeu | CCTCTC | VF | ValPhe | GTTTTT |
| GK | GlyLys | GGAAAA | PR | ProArg | CCGCGC | VG | ValGly | GTGGGG |
| GP | GlyPro | GGCCCC | QG | GlnGly | CAGGGG | VK | ValLys | GTAAAA |
| GV | GlyVal | GGTGTG | QK | GlnLys | CAAAAA | VP | ValPro | GTCCCC |
| HF | HisPhe | CATTTT | QR | GlnArg | CAGAGA | VS | ValSer | GTCTCT |
| HI | HisIle | CATATA | RE | ArgGlu | CGAGAG | VY | ValTyr | GTATAT |
| HP | HisPro | CACCCC | RF | ArgPhe | CGTTTT | WG | TrpGly | TGGGGG |
| IF | IlePhe | ATTTTT | RG | ArgGly | AGGGGG, CGGGGG | YF | TyrPhe | TATTTT |
| IK | IleLys | ATAAAA | RK | ArgLys | AGAAAA, CGAAAA | YP | TyrPro | TACCCC |
| IP | IlePro | ATCCCC | RP | ArgPro | CGCCCC | YT | TyrThr | TACACA |
Table 8.
Occurrence number of the 20 amino acids in the first and second positions of the 83 dipeptides and the four pairs (stop codon, amino acid) (Table 7).
| A | C | D | E | F | G | H | I | K | L | M | N | P | Q | R | S | T | V | W | Y | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Ala | Cys | Asp | Glu | Phe | Gly | His | Ile | Lys | Leu | Met | Asn | Pro | Gln | Arg | Ser | Thr | Val | Trp | Tyr | Ter | Sum | |
| 1st site | 6 | 3 | 4 | 2 | 2 | 6 | 3 | 4 | 2 | 6 | 2 | 4 | 6 | 3 | 6 | 8 | 6 | 6 | 1 | 3 | 4 | 87 |
| 2nd site | 3 | 2 | 0 | 3 | 14 | 13 | 3 | 3 | 12 | 3 | 0 | 0 | 14 | 0 | 6 | 3 | 3 | 3 | 0 | 2 | 0 | 87 |
| Sum | 9 | 5 | 4 | 5 | 16 | 19 | 6 | 7 | 14 | 9 | 2 | 4 | 20 | 3 | 12 | 11 | 9 | 9 | 1 | 5 | 4 | 174 |
The 96 unidirectional multiple-frame dicodons (Table 3) code 40 dipeptides and three pairs (amino acid, stop codon): IleTer where Ter can be the two stop codons TAA and TAG, PheTer where Ter can be the three stop codons TAA, TAG and TGA, and ValTGA (Table 9). All the 20 amino acids are involved in the 40 dipeptides (Table 10). Five amino acids are Asn, Asp, Gln, Met and Trp do not occur in the first position of dipeptides which are the five amino acids not involved in the dipeptides. All the 20 amino acids occur in their second position. In the 40 dipeptides, two amino acids Lys and Phe are involved eight times while Asn only once.
Table 9.
The 40 dipeptides and the three pairs (amino acid, stop codon) coded by the 96 unidirectional multiple-frame dicodons (Definition 12, Table 3).
| AR | AlaArg | GCGCGA, GCGCGG, GCGCGT | KN | LysAsn | AAAAAC, AAAAAT |
| CV | CysVal | TGTGTA, TGTGTC, TGTGTT | KR | LysArg | AAAAGA, AAAAGG |
| ER | GluArg | GAGAGG | KS | LysSer | AAAAGC, AAAAGT |
| ES | GluSer | GAGAGC, GAGAGT | KT | LysThr | AAAACA, AAAACC, AAAACG, AAAACT |
| FC | PheCys | TTTTGC, TTTTGT | LS | LeuSer | CTCTCA, CTCTCC, CTCTCG |
| FF | PhePhe | TTTTTC | PH | ProHis | CCCCAC, CCCCAT |
| FL | PheLeu | TTTTTA, TTTTTG | PL | ProLeu | CCCCTA, CCCCTC, CCCCTG, CCCCTT |
| FS | PheSer | TTTTCA, TTTTCC, TTTTCG, TTTTCT | PP | ProPro | CCCCCA, CCCCCG, CCCCCT |
| FTer | PheTer | TTTTAA, TTTTAG, TTTTGA | PQ | ProGln | CCCCAA, CCCCAG |
| FW | PheTrp | TTTTGG | PR | ProArg | CCCCGA, CCCCGC, CCCCGG, CCCCGT |
| FY | PheTyr | TTTTAC, TTTTAT | RA | ArgAla | CGCGCA, CGCGCC, CGCGCT |
| GA | GlyAla | GGGGCA, GGGGCC, GGGGCG, GGGGCT | RD | ArgAsp | AGAGAC, AGAGAT |
| GD | GlyAsp | GGGGAC, GGGGAT | RE | ArgGlu | AGAGAA |
| GE | GlyGlu | GGGGAA, GGGGAG | SL | SerLeu | TCTCTA, TCTCTG, TCTCTT |
| GG | GlyGly | GGGGGA, GGGGGC, GGGGGT | TH | ThrHis | ACACAT |
| GV | GlyVal | GGGGTA, GGGGTC, GGGGTG, GGGGTT | TQ | ThrGln | ACACAA, ACACAG |
| HT | HisThr | CACACC, CACACG, CACACT | VC | ValCys | GTGTGC |
| ITer | IleTer | ATATAA, ATATAG | VTer | ValTer | GTGTGA |
| IY | IleTyr | ATATAC | VW | ValTrp | GTGTGG |
| KI | LysIle | AAAATA, AAAATC, AAAATT | YI | TyrIle | TATATC, TATATT |
| KK | LysLys | AAAAAG | YM | TyrMet | TATATG |
| KM | LysMet | AAAATG |
Table 10.
Occurrence number of the 20 amino acids in the first and second positions of the 40 dipeptides and the three pairs (amino acid, stop codon) (Table 9).
| A | C | D | E | F | G | H | I | K | L | M | N | P | Q | R | S | T | V | W | Y | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Ala | Cys | Asp | Glu | Phe | Gly | His | Ile | Lys | Leu | Met | Asn | Pro | Gln | Arg | Ser | Thr | Val | Trp | Tyr | Ter | Sum | |
| 1st site | 1 | 1 | 0 | 2 | 7 | 5 | 1 | 2 | 7 | 1 | 0 | 0 | 5 | 0 | 3 | 1 | 2 | 3 | 0 | 2 | 0 | 43 |
| 2nd site | 2 | 2 | 2 | 2 | 1 | 1 | 2 | 2 | 1 | 3 | 2 | 1 | 1 | 2 | 4 | 4 | 2 | 2 | 2 | 2 | 3 | 43 |
| Sum | 3 | 3 | 2 | 4 | 8 | 6 | 3 | 4 | 8 | 4 | 2 | 1 | 6 | 2 | 7 | 5 | 4 | 5 | 2 | 4 | 3 | 86 |
The dipeptides among 400, i.e., 28.5%, are coded by dicodons (, , ) among 4096, i.e., 5.1% (Table 11). As a consequence, 286 dipeptides, i.e., 71.5%, are coded by 3888 single-frame dicodons , i.e., 94.9%. There is also a strong asymmetry between the number of dipeptides coded by one direction or other direction: 83 dipeptides (Table 7) versus 40 dipeptides (Table 9). This asymmetry may be related to the gene translation in the direction, the dicodons having an unambiguous trinucleotide decoding in the direction.
Table 11.
Multi-frame dipeptide boolean matrix. The dipeptides, the four pairs (stop codon, amino acid) and the three pairs (amino acid, stop codon) coded by the multiple-frame dicodons (Definition 13, Table 1), (Definition 11, Table 2) and (Definition 12, Table 3). The rows and columns are associated with the first and second amino acid, respectively, in the dipeptide. The value of 1 means a dipeptide coded by at least a multiple-frame dicodon ( true). The value of 0 stands for a dipeptide coded by a single-frame dicodon ( false). For example, the value of AlaCys is 0 (absent in Table 5, Table 7 and Table 9) and the value of CysAla is 1 (7th row in Table 7).
| Site | 2nd | A | C | D | E | F | G | H | I | K | L | M | N | P | Q | R | S | T | V | W | Y | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1st | Ala | Cys | Asp | Glu | Phe | Gly | His | Ile | Lys | Leu | Met | Asn | Pro | Gln | Arg | Ser | Thr | Val | Trp | Tyr | Ter | Sum | |
| A | Ala | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 7 |
| C | Cys | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 4 |
| D | Asp | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 4 |
| E | Glu | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 4 |
| F | Phe | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 8 |
| G | Gly | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 8 |
| H | His | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 4 |
| I | Ile | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 6 |
| K | Lys | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 8 |
| L | Leu | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 7 |
| M | Met | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| N | Asn | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 4 |
| P | Pro | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 8 |
| Q | Gln | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
| R | Arg | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 8 |
| S | Ser | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 9 |
| T | Thr | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 8 |
| V | Val | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 9 |
| W | Trp | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| Y | Tyr | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 5 |
| Ter | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | |
| Sum | 4 | 4 | 2 | 3 | 15 | 14 | 4 | 5 | 13 | 5 | 2 | 1 | 15 | 2 | 8 | 6 | 5 | 4 | 2 | 4 | 3 | 121 |
Five dipeptides GlyAla, GlyVal, PheSer, ProLeu and ProArg are the most strongly coded, each by five dicodons (Table 12), e.g., GlyAla is coded by one dicodon GGCGCG (Table 7), and four dicodons GGGGCA, GGGGCC, GGGGCG and GGGGCT (Table 9). The and dipeptides could have particular spatial structures and biological functions in extant and primitive proteins which remain to be identified.
Table 12.
Multi-frame dipeptide occurrence matrix. The dipeptides, the four pairs (stop codon, amino acid) and the three pairs (amino acid, stop codon) coded by the multiple-frame dicodons (Definition 13, Table 1), (Definition 11, Table 2) and (Definition 12, Table 3). The rows and columns are associated with the first and second amino acid, respectively, in the dipeptide. The values between 1 and 5 give the number of times a dipeptide is coded by multiple-frame dicodons . The value of 0 stands for a dipeptide coded by a single-frame dicodon . For example, the value of AlaCys is 0 (absent in Table 5, Table 7 and Table 9), the value of CysAla is 1 (7th row in Table 7) and the value of AlaArg if 4 (one occurrence: 1st row in Table 5 and three occurrences: 1st row in Table 9).
| Site | 2nd | A | C | D | E | F | G | H | I | K | L | M | N | P | Q | R | S | T | V | W | Y | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1st | Ala | Cys | Asp | Glu | Phe | Gly | His | Ile | Lys | Leu | Met | Asn | Pro | Gln | Arg | Ser | Thr | Val | Trp | Tyr | Ter | Sum | |
| A | Ala | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 10 |
| C | Cys | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 7 |
| D | Asp | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 4 |
| E | Glu | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 6 |
| F | Phe | 0 | 2 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 1 | 0 | 0 | 5 | 0 | 0 | 1 | 2 | 3 | 18 |
| G | Gly | 5 | 0 | 2 | 3 | 1 | 4 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 22 |
| H | His | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 7 |
| I | Ile | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 2 | 2 | 8 |
| K | Lys | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3 | 2 | 0 | 1 | 2 | 0 | 0 | 3 | 2 | 4 | 0 | 0 | 0 | 0 | 18 |
| L | Leu | 0 | 2 | 0 | 0 | 1 | 2 | 0 | 0 | 2 | 0 | 0 | 0 | 1 | 0 | 0 | 4 | 0 | 0 | 0 | 2 | 0 | 14 |
| M | Met | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| N | Asn | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 4 |
| P | Pro | 0 | 0 | 0 | 0 | 1 | 1 | 3 | 0 | 1 | 5 | 0 | 0 | 4 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 22 |
| Q | Gln | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
| R | Arg | 4 | 0 | 2 | 3 | 1 | 2 | 0 | 0 | 2 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 16 |
| S | Ser | 1 | 0 | 0 | 0 | 2 | 1 | 1 | 0 | 1 | 4 | 0 | 0 | 2 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 14 |
| T | Thr | 0 | 0 | 0 | 0 | 1 | 1 | 2 | 0 | 1 | 1 | 0 | 0 | 1 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 10 |
| V | Val | 0 | 2 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 10 |
| W | Trp | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| Y | Tyr | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 7 |
| Ter | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | |
| Sum | 11 | 7 | 4 | 7 | 17 | 19 | 7 | 9 | 17 | 13 | 2 | 2 | 19 | 4 | 18 | 15 | 11 | 11 | 2 | 7 | 6 | 208 |
4. Discussion
For the first time to our knowledge, new definitions of motifs in genes are presented. The single-frame motifs (unambiguous trinucleotide decoding in the two and directions) and the multiple-frame motifs (ambiguous trinucleotide decoding in at least one direction) form a partition of genes. Several classes of motifs are defined and analysed: (i) unidirectional multiple-frame motifs (ambiguous trinucleotide decoding in the direction only); (ii) unidirectional multiple-frame motifs (ambiguous trinucleotide decoding in the direction only); and (iii) bidirectional multiple-frame motifs (ambiguous trinucleotide decoding in the two and directions). The distribution of the single-frame motifs and the 5′ unambiguous motifs (unambiguous trinucleotide decoding in the direction only) are studied without and with constraints.
The proportion of motifs drastically decreases with their trinucleotide length. The motifs become absent () when their length trinucleotides and the number of motifs becomes already higher than the number of motifs when their length trinucleotides. As expected, the proportion of motifs decreases more slowly than that of motifs. The motifs become absent () when their length trinucleotides. Thus with the motifs, there is a length increase of trinucleotides in the trinucleotide decoding.
The proportions of and motifs with initiation and stop codons are lower than their respective non-constrained motifs. In contrasts, their proportions in motifs without periodic codons are higher than their respective non-constrained motifs. The proportions of and motifs with antiparallel complementarity are identical. Antiparallel complementarity increases the proportion of motifs but decreases the proportion of motifs, compared to their respective non-constrained motifs. The proportions of motifs with parallel complementarity and motifs without constraints follow a similar distribution. Finally, parallel complementarity increases the proportions of both motifs and motifs compared to their respective non-constrained motifs. Taken together, these results suggest that the complementarity property involved in the antiparallel (DNA double helix, RNA stem) and parallel sequences could also be fundamental for coding genes with unambiguous trinucleotide decoding, strictly in the two and directions ( motifs) or conserved in the direction but relaxed-lost in the direction ( motifs).
The single-frame motifs with a property of trinucleotide decoding and the framing motifs with a property of reading frame decoding could have operated in the primitive soup for constructing the modern genetic code and the extant genes [31]. They could have been involved in the stage without anticodon-amino acid interactions to form peptides from prebiotically amino acids [32]. They could also have been related in the Implicated Site Nucleotides (ISN) of RNA interacting with the amino acids at the primitive step of life (review in [33]). According to a great number of biological experiments, the ISN structure contains nucleotides in fixed and variable positions, as well as an important trinucleotide for interacting with the amino acid (see e.g., the recent review in [34]). However, the general structure of the aptamers binding amino acids, in particular its nucleotide length, its amino acid binding loop and its nucleotide position, is still an open problem. Similar arguments could hold for the ribonucleopeptides which could be implicated in a primitive T box riboswitch functioning as an aminoacyl-tRNA synthetase and a peptidyl-transferase ribozyme [35]. The single-frame motifs and the framing motifs with their properties to decode the trinucleotides and the reading frame could have been necessary for the evolutionary construction of the modern genetic code.
Acknowledgment
I thank Denise Marie Besch for her support.
Abbreviations
| single-frame motif (unambiguous trinucleotide decoding in the two and directions) | |
| multiple-frame motif | |
| unidirectional multiple-frame motif | |
| unidirectional multiple-frame motif (ambiguous trinucleotide decoding in the direction only) | |
| unidirectional multiple-frame motif (ambiguous trinucleotide decoding in the direction only) | |
| bidirectional multiple-frame motif (ambiguous trinucleotide decoding in the two and directions) | |
| 5′ unambiguous motif (unambiguous trinucleotide decoding in the direction only) | |
| framing motif (also called circular code motif) |
Funding
The author received no funding for this study.
Conflicts of Interest
The author declares no competing interests.
References
- 1.Gamow G. Possible relation between deoxyribonucleic acid and protein structures. Nature. 1954;173:318. doi: 10.1038/173318a0. [DOI] [Google Scholar]
- 2.Crick F.H.C., Griffith J.S., Orgel L.E. Codes without commas. Proc. Natl. Acad. Sci. USA. 1957;43:416–421. doi: 10.1073/pnas.43.5.416. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Nirenberg M.W., Matthaei J.H. The dependence of cell-free protein synthesis in E. coli upon naturally occurring or synthetic polyribonucleotides. Proc. Natl. Acad. Sci. USA. 1961;47:1588–1602. doi: 10.1073/pnas.47.10.1588. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Crick F.H.C., Leslie Barnett Brenner S., Watts-Tobin R.J. General nature of the genetic code for proteins. Nature. 1961;192:1227–1232. doi: 10.1038/1921227a0. [DOI] [PubMed] [Google Scholar]
- 5.Khorana H.G., Büchi H., Ghosh H., Gupta N., Jacob T.M., Kössel H., Morgan R., Narang S.A., Ohtsuka E., Wells R.D. Polynucleotide synthesis and the genetic code. Cold Spring Harb. Symp. Quant. Biol. 1966;31:39–49. doi: 10.1101/SQB.1966.031.01.010. [DOI] [PubMed] [Google Scholar]
- 6.Nirenberg M., Caskey T., Marshall R., Brimacombe R., Kellogg D., Doctor B., Hatfield D., Levin J., Rottman F., Pestka S., et al. The RNA code and protein synthesis. Cold Spring Harb. Symp. Quant. Biol. 1966;31:11–24. doi: 10.1101/SQB.1966.031.01.008. [DOI] [PubMed] [Google Scholar]
- 7.Salas M., Smith M.A., Stanley W.M., Wahba A.J., Ochoa S. Direction of reading of the genetic message. J. Biol. Chem. 1965;240:3988–3995. [PubMed] [Google Scholar]
- 8.Arquès D.G., Michel C.J. A complementary circular code in the protein coding genes. J. Theor. Biol. 1996;182:45–58. doi: 10.1006/jtbi.1996.0142. [DOI] [PubMed] [Google Scholar]
- 9.Michel C.J. The maximal C3 self-complementary trinucleotide circular code X in genes of bacteria, eukaryotes, plasmids and viruses. J. Theor. Biol. 2015;380:156–177. doi: 10.1016/j.jtbi.2015.04.009. [DOI] [PubMed] [Google Scholar]
- 10.Michel C.J. The maximal C3 self-complementary trinucleotide circular code X in genes of bacteria, archaea, eukaryotes, plasmids and viruses. Life. 2017;7:20. doi: 10.3390/life7020020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Michel C.J. A genetic scale of reading frame coding. J. Theor. Biol. 2014;355:83–94. doi: 10.1016/j.jtbi.2014.03.029. [DOI] [PubMed] [Google Scholar]
- 12.Michel C.J. An extended genetic scale of reading frame coding. J. Theor. Biol. 2015;365:164–174. doi: 10.1016/j.jtbi.2014.09.040. [DOI] [PubMed] [Google Scholar]
- 13.Dinman J.D. Programmed ribosomal frameshifting goes beyond viruses. Microbe. 2006;1:521–527. doi: 10.1128/microbe.1.521.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Farabaugh P.J. Programmed translational frameshifting. Annu. Rev. Genet. 1996;30:507–528. doi: 10.1146/annurev.genet.30.1.507. [DOI] [PubMed] [Google Scholar]
- 15.Caliskan N., Peske F., Rodnina M.V. Changed in translation: MRNA recoding by -1 programmed ribosomal frameshifting. Trends Biochem. Sci. 2015;40:265–274. doi: 10.1016/j.tibs.2015.03.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Napthine S., Ling R., Finch L.K., Jones J.D., Bell S., Brierley I., Firth A.E. Protein-directed ribosomal frameshifting temporally regulates gene expression. Nat. Commun. 2017;8:15582. doi: 10.1038/ncomms15582. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Wang R., Xiong J., Wang W., Miao W., Liang A. High frequency of +1 programmed ribosomal frameshifting in Euplotes octocarinatus. Sci. Rep. 2016;6:21139. doi: 10.1038/srep21139. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.El Houmami N., Seligmann H. Evolution of nucleotide punctuation marks: From structural to linear signals. Front. Genet. 2017;8:36. doi: 10.3389/fgene.2017.00036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Seligmann H. Codon expansion and systematic transcriptional deletions produce tetra-, pentacoded mitochondrial peptides. J. Theor. Biol. 2015;387:154–165. doi: 10.1016/j.jtbi.2015.09.030. [DOI] [PubMed] [Google Scholar]
- 20.Baranov P.V., Atkins J.F., Yordanova M.M. Augmented genetic decoding: Global, local and temporal alterations of decoding processes and codon meaning. Nat. Rev. Genet. 2015;16:517–529. doi: 10.1038/nrg3963. [DOI] [PubMed] [Google Scholar]
- 21.Michel C.J. Circular code motifs in transfer and 16S ribosomal RNAs: A possible translation code in genes. Comput. Biol. Chem. 2012;37:24–37. doi: 10.1016/j.compbiolchem.2011.10.002. [DOI] [PubMed] [Google Scholar]
- 22.Michel C.J. Circular code motifs in transfer RNAs. Comput. Biol. Chem. 2013;45:17–29. doi: 10.1016/j.compbiolchem.2013.02.004. [DOI] [PubMed] [Google Scholar]
- 23.Michel C.J. A 2006 review of circular codes in genes. Comput. Math. Appl. 2008;55:984–988. doi: 10.1016/j.camwa.2006.12.090. [DOI] [Google Scholar]
- 24.Fimmel E., Strüngmann L. Mathematical fundamentals for the noise immunity of the genetic code. Biosystems. 2018;164:186–198. doi: 10.1016/j.biosystems.2017.09.007. [DOI] [PubMed] [Google Scholar]
- 25.Luisi P.L. Prebiotic metabolic networks? Mol. Syst. Biol. 2014;10:729. doi: 10.1002/msb.20145351. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Ying J., Lin R., Xu P., Wu Y., Liu Y., Zhao Y. Prebiotic formation of cyclic dipeptides under potentially early Earth conditions. Sci. Rep. 2018;8:936. doi: 10.1038/s41598-018-19335-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Shu W., Yu Y., Chen S., Yan X., Liu Y., Zhao Y. Selective formation of Ser-His dipeptide via phosphorus activation. Orig. Life Evol. Biospheres. 2018;48:213–222. doi: 10.1007/s11084-018-9556-7. [DOI] [PubMed] [Google Scholar]
- 28.Wieczorek R., Adamala K., Gasperi T., Polticelli F., Stano P. Small and random peptides: An unexplored reservoir of potentially functional primitive organocatalysts. The case of Seryl-Histidine. Life. 2017;7:19. doi: 10.3390/life7020019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Fimmel E., Michel C.J., Strüngmann L. n-Nucleotide circular codes in graph theory. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2016;374:20150058. doi: 10.1098/rsta.2015.0058. [DOI] [PubMed] [Google Scholar]
- 30.Fimmel E., Michel C.J., Starman M., Strüngmann L. Self-complementary circular codes in coding theory. Theory Biosci. 2018;137:51–65. doi: 10.1007/s12064-018-0259-4. [DOI] [PubMed] [Google Scholar]
- 31.Kun Á., Radványi Á. The evolution of the genetic code: Impasses and challenges. Biosystems. 2018;164:217–225. doi: 10.1016/j.biosystems.2017.10.006. [DOI] [PubMed] [Google Scholar]
- 32.Johnson D.B.F., Wang L. Imprints of the genetic code in the ribosome. Proc. Natl. Acad. Sci. USA. 2010;107:8298–8303. doi: 10.1073/pnas.1000704107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Yarus M. The genetic code and RNA-amino acid affinities. Life. 2017;7:13. doi: 10.3390/life7020013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Zagrovic B., Bartonek L., Polyansky A.A. RNA-protein interactions in an unstructured context. FEBS Lett. 2018;592:2901–2916. doi: 10.1002/1873-3468.13116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Saad N.Y. A ribonucleopeptide world at the origin of life. J. Syst. Evol. 2018;56:1–13. doi: 10.1111/jse.12287. [DOI] [Google Scholar]













