Skip to main content
BioMed Research International logoLink to BioMed Research International
. 2017 Jan 15;2017:7949287. doi: 10.1155/2017/7949287

Database of Periodic DNA Regions in Major Genomes

Felix E Frenkel 1, Maria A Korotkova 2, Eugene V Korotkov 1,2,*
PMCID: PMC5274682  PMID: 28182099

Abstract

Summary. We analyzed several prokaryotic and eukaryotic genomes looking for the periodicity sequences availability and employing a new mathematical method. The method envisaged using the random position weight matrices and dynamic programming. Insertions and deletions were allowed inside periodicities, thus adding a novelty to the results we obtained. A periodicity length, one of the key periodicity features, varied from 2 to 50 nt. Totally over 60,000 periodicity sequences were found in 15 genomes including some chromosomes of the H. sapiens (partial), C. elegans, D. melanogaster, and A. thaliana genomes.

1. Introduction

Periodicity is one of the sequences' structural regularities and is widely represented in the amino and DNA sequences [1]. A periodicity is considered as latent, if the similarity between any two periods is not statistically significant or if it belongs to the twilight zone [2]. Perfect periodicity can transform into latent periodicity, if it accumulates over 1.0 mutation per nucleotide in the studied DNA sequence [3] and a certain number of insertions and deletions of bases (this number > 1 per each period). Availability of a large number of base substitutions, insertions, and deletions in each period is regarded to represent the latent periodicity property. The distinctive property of latent periodicity is that it cannot be detected by pairwise comparison of the nucleotide sequences. However, latent periodicity can be found, if a mathematical method is applied to directly detect the alignment of nucleotide sequences without constructing pairwise alignments. The periods of a sequence with latent periodicity are sequences for multiple alignment that is statistically significant. The aim of this study was to apply the previously developed mathematical method [4, 5] and to detect the DNA sequences periodicity, as well as the latent periodicity, and to create the new data bank.

At present, there is a significant gap in the mathematical approaches developed in the periodicities search for symbolic and numeric sequences (sequence-based methods). Spectral approaches enable detection of the adequate “fuzzy” periodicity in nucleotide sequences without the nucleotides insertion(s) or deletion(s). In this case, the number of base substitutions can be over 1.0 per nucleotide [3]. Fourier transform, wavelet transform, information decomposition, and some other methods could be included within the spectral methods list [1, 3, 615]. However, these approaches have a significant limitation; they do not ensure detection of a periodicity with insertions and deletions. However, the DNA sequences obtain not only base substitutions, but also insertions and deletions in the course of evolution. Therefore, spectral methods are rarely used to study the DNA sequences of complete genomes. For this purpose, we need a mathematical method that would allow us to detect the sequences similarity that accumulates more than 1.0 base substitutions per nucleotide in the presence of the bases insertions and deletions, that is, to find the latent periodicity.

On the other hand, methods based on pairwise alignment or algorithmic methods are able to accurately detect insertions and deletions [1620]. However, these methods cannot detect a latent periodicity in a situation, where the statistical significance of similarity between any two periodic sequences is insignificant [1, 21]. This happens due to the fact that the DNA sequences periodicity (with the number of periods greater than or equal to 4) is detected by pairwise similarity between the periods. In the absence of statistically significant pairwise similarity, these approaches are unable to find the latent periodicity. First of all, it involves algorithms and programs, such as TRF [16], Mreps [22], TRStalker [23], ATRHunter [24], T-REKS [25], IMEX [26, 27], CRISPRs [28], SWAN [29], and some others [30, 31], because similarity between the different periods is very low in case of latent periodicity. This leads to the lack of seeds and identical short strings.

The frequency matrix [32] or corresponding position weight matrix M could describe a sequence periodicity S with the N length [33]. Rows of this matrix are DNA bases, and the columns are the period positions. The DNA base i in position j of the period has weight m(i, j) and positions of the period vary from 1 to n. We create artificial periodic sequence S1 of the N length, which is 1,2,…, n, 1,2,…, n,…. The numbers are considered as symbols; and they correspond to the columns of matrix M. A certain frequency matrix and weight matrix M(20, n) correspond to the period equal to n in the sequence S. We can formulate the following problem. We have a sequence S with the length N. We should find the optimal weight matrix M0, where the local alignment of sequences S1 and S obtain the highest statistical significance. Under the statistical significance, the probability P determines that Fr > Fmax, where Fmax is the maximum weight of a local alignment of sequences S and S1, using the optimal matrix M0. Here, Fr is the maximum weight of a local alignment of the randomly mixed sequence S and sequence S1 using the optimal matrix Mr. It is always possible to set the threshold level of the probability P0 and, if the probability P(Fr > Fmax) is less than P0, then the revealed local alignment of sequences S and S1 using the optimum matrix M0 can be considered as statistically significant.

A local alignment algorithm could be used for the alignment of the DNA sequence S and the artificial periodic sequence S1 using the known weight matrix [34]. But we should find the optimal weight matrix M0. Previously we developed a mathematical approach for detecting the matrix M0, as well as a method for assessing the probability P [4, 5]. The periods multiple alignments were calculated by optimizing the PWM without using the pairwise alignments or the similarity search between periods. This approach allows us to detect latent periodicity that is the periodicity, which accumulated more than 1.0 base substitution and a number of the bases deletions or insertions within each study period. Periodicities with a smaller number of base substitutions are detected by this method without any problems. The developed algorithm was applied to search for periodicity with insertions and deletions in the different genomes. This study showed the presence of latent periodicity (over 1.0 mutation per nucleotide with insertions and deletions) in some chromosomes of the H. sapiens and C. elegans, D. melanogaster, and A. thaliana genomes, where the presence of periodicity was not previously discovered. The results are included in database, which is available online at the following website: http://victoria.biengi.ac.ru/indelper/.

2. Methods and Algorithms

2.1. Using Genetic Algorithm

Optimal weight matrix M0 for period n and for sequence S was calculated using the genetic algorithm. Genetic algorithm is a heuristic search algorithm for solving the optimization problems and is a form of the direct random search [35]. It is often used to optimize the functions of several variables. Usually, the problem is formalized, so that a solution could be found as a vector, where each element can be a bit, a number, or some other object. This vector is considered as an “organism.” Usually, a set of initial organisms are created randomly [36]. Each of these organisms was measured using the objective function, which is regarded as a “fitness function.” As a result, every organism was associated with a certain fitness value, which determines how well the organism solves the problem. Separate organisms were selected from this set of organisms (it can be called “generation”) for application of the “genetic operators” (“crossing” and “mutation,” taking into account the “fitness” value). The new organisms were received as a result of these operators application. The fitness value was also calculated for new organisms and then selection of the best organisms for the next generation was performed. This set of actions was repeated iteratively and thus simulated the “evolutionary process.” This process was allowed to continue for several life cycles (generations), before executing the termination criterion of the algorithm. Such a criterion can find either the global or suboptimal solutions or exhaustion of the number of generations released for evolution.

The organisms in our case are the periodicity weight matrices. This set is called Qn or population. Each matrix has 4 rows and n columns. Matrix elements m(i, j) are certain numbers that show weight base i in column number j. A larger weight of the element m(i, j) corresponds to the high probability of the presence of the base i at position j of the period. As the assessment of fitness (objective function) for the organism (weight matrix M), the maximum value of the similarity function Fmax was considered for the local alignment [38]. A local alignment was created between sequences S1 and S, using the weight matrix M to calculate the objective function. The calculation of Fmax was performed for each organism (weight matrix M). The process was repeated after applying the genetic operators to the organisms. The process was terminated after a stable population was achieved; that is, the increase in the values of Fmax was terminated. As a result, the matrix M0 was defined for the period length n, which obtains the greatest value Fmax (mFmax). The alignment of sequences S1 and S was well built using the matrix M0. The algorithm was repeated for n from 2 to 100. The method we used to reveal DNA periodicity sequences was previously developed and applied to amino acid sequences and the details could be found in the following studies [4, 5, 39]. It can effectively detect periodicity in the presence of insertions, deletions, and large number of substitutions of nucleotides (on the average, more than 1 per nucleotide). Until now methods, which could find periodicity under these conditions, are missing. Both terms and functions referred to here follow notations defined in the paper cited above.

2.2. Genome Analysis and Statistical Importance Detection Level

Periodicity sequences were detected in a 600 nt long sliding window (sequence S) that was shifted by 200 bp iteratively. In order to avoid interference with the triplet periodicity in the genome protein-coding regions, we skipped those windows that had statistically significant triplet periodicity Z value (see [40] for details) over 3.0. In the given window we found the position weight matrix M [1], which obtained the best local alignment that maximizes similarity score Fmax [5, 39] for each tested period length n (from 2 to 50). In order to account for interwindow duplicates we filtered out intersecting sequences by choosing the one having larger Fmax value (mFmax).

We evaluated the search statistical significance by finding the cutoff level for mFmax. Numerical simulation via genome sequences shuffling showed that mFmax = 390.0 corresponds to the false positives ratio ≤ 5%. Altogether, we analyzed 15 major prokaryotic and eukaryotic genomes. Human genome was analyzed partially (from 19th to 22nd chromosomes) due to intensive algorithm calculation and time limitations. We are continuing the human genome analysis and the results in regard to the new chromosomes will be posted, as calculation is completed. We revealed 66,596 periodic sequences in the analyzed genomes. Figure 1 shows distribution of the detected sequences throughout the organisms. The largest number of sequences was found in the human genome, but we could not completely analyze the genome. The complete human genome will be analyzed early in 2017.

Figure 1.

Figure 1

Distribution of found sequences with periodicity by organisms.

3. Results and Discussion

Our results are published as a web database at http://victoria.biengi.ac.ru/indelper. It provides basic navigation within organisms and filtering using the periodicity length and periodicity significance score (mFmax). Each periodicity sequence could be analyzed in detail by selecting it in query results. The page will provide data on genomic position, statistical significance of triplet periodicity (Z) [40], alignment against periodicity, periodicity consensus sequence, and periodicity weight matrix M [33].

Search composition by periodicity length is shown further in Figure 2. The heterogeneous nature of distribution is clearly distinguished and we can see a few peaks in the distribution. Firstly, a large number of sequences are observed within the periodicity length of 2 to 4 DNA bases. As we note below, it could be highly divergent microsatellite sequences. Secondly, a distinguished peak could be observed from 9 to 11 bases. Thirdly, some peaks are present in bases 31 and 35.

Figure 2.

Figure 2

Distribution of found sequences by period length in all species.

In this study, two sequences obtaining periodicity were considered as the examples. The first sequence had a periodicity length of 10 nucleotides and this periodicity could be detected only in the presence of deletions or insertions. The spectrum of mFmax(n) is shown in Figure 3. This region was found in the third chromosome of the Arabidopsis thaliana genome, in sequence NC_003279.8. mFmax(4) = 726.31. This periodicity was not detected using the TRF [16], ATR hunter [24], Repfind [41], and Mreps [22] programs. T-REKs [25] found 11 periodicities equal to 10 nucleotides only. However, the 10 bp periodicity in the region with length equal to 466 bases was found by our method (more 46 periods). It is more than four times greater than the length found by the T-REKs program, and very high levels of statistical significance were found. TRStalker program [23] found 4 repeats with the length of 22 bases but did not find 11 base repeats. BWT program [42] found 3 repeats in the sequence with length of 10 bp. According to this study's estimates, mFmax(4) = 726.31, and it corresponds to P(mFmax > 726.31) < 10−50, because the average value of mFmax for random sequences Sr is about 142.6 and σ ~ 52.3. The resulting alignment and the resulting matrix M0 could be received at http://victoria.biengi.ac.ru/cgi-bin/indelper/index.cgi. A consensus periodicity with length equal to 11 nucleotides is (A/C)AAG(A/G)(C/G)(T/G)TTTC. This periodicity was repeated more than 40 times in the region found and the periodicity equal to 11 bases obtained the highest statistical significance.

Figure 3.

Figure 3

mF max(n) spectrum for fragment of the sequence NC_003074.1 from chromosome 3 of the C. elegans genome. The coordinates of fragment are 11055357–11055823.

The second example includes the sequence from chromosome 2l of the Drosophila melanogaster genome and obtains periodicity equal to 12 nucleotides. The sequence code is AE014134.6 and the periodicity sequence ranged from 10643993 to 10644422 bases. Figure 4 shows the spectrum mFmax(n). From this graph it is clear that the highest statistical significance obtained periodicity equal to 12 nucleotides and mFmax(11) = 592.0. The discovered sequence obtained the length equal to 430 nucleotides and contained more than 35 highly divergent 12 base repeats. TRF [16], Mreps [22], TRStalker [23], Repfind [41], ATR hunter [24], and BWTRs [42] programs did not find repeats equal to 11 nucleotides in the sequence. T-REKs program [25] revealed 7 repeats having length equal to 5 bases and did not find repeats equal to 11 nucleotides. According to this study estimates, mFmax(11) = 592.0 and it corresponds to P(mFmax > 592.0) < 10−50, since the average value mFmax for random sequences Sr is ~127.8 and σ ~ 41.1. It means that the period equal to 12 nucleotides is the most statistically significant in this sequence. The resulting alignment and the resulting matrix M0 could be received from http://victoria.biengi.ac.ru/cgi-bin/indelper/index.cgi. A consensus periodicity equal to 12 nucleotides is CACAGTCTCAA(T/G).

Figure 4.

Figure 4

mF max(n) spectrum for fragment of the sequence AE014134.6 from chromosome 2l of the Drosophila melanogaster genome. The coordinates of fragment are 10643993–10644422.

We encountered multicity of sequences, where the tandem periodicity could only be detected using our developed method. Typically, these sequences could obtain the varying length periodicity, but mFmax(n) < 600.0. Such sequences could be found at the http://victoria.biengi.ac.ru/indelper data bank. An average of 1.0 mutation per nucleotide is accumulated in these sequences and the periodicities contain a large number of insertions and deletions. In this case, statistically significant pairwise similarity between the periodicities could not be found. As a result, all developed methods are becoming “blind,” because they are based upon construction of the periodicities multiple alignments using the pairwise alignment thereof or the periodicity “germs” search, as is performed by the T-REKs program [21]. Our approach envisages the random position and weight matrices which appear to be the images of certain random multiple alignments, and we are changing those images to best fit the analyzed sequence (periods). That actually means that we are working immediately with the multiple alignment of tandem sequences (periods) in the form of position and weight matrices and are not using the pairwise alignments. Particularly this feature of our mathematical method allows finding the latent periodicity, that is, the tandem periods having a large number of base substitutions, insertions, and deletions present within each periodicity.

We also examined the periodicity identity, which was discovered by us in the A. thaliana, C. elegans, and D. melanogaster genomes. For these purposes, we obtained the frequency matrices for each sequence with the detected periodicity. To receive the frequency matrix we were using the created alignment. The frequency matrix demonstrates the number of each nucleotide i in each periodicity position j. The sum of the frequency matrix elements is equal to the length of the sequence minus the length of regions with insertions or deletions. In order to verify the matrices identity, we were using the distance between matrices, which was introduced as follows [43]:

InMk,Ml=i=14nmki,jlnmki,j+i=14nmli,jlnmli,ji=14nmki,j+mli,jlnmki,j+mli,j+skj+sljlnskj+sljskjlnskjsljlnslj. (1)

Here sk(j) = ∑i=14nmk(i, j), sl(j) = ∑i=14nml(i, j), and n is the periodicity length. 2In has an asymptotic Chi-square distribution with the degree of freedom df = 3(n − 1) [44]. Mk is the matrix from the first compared genome with the number k and Ml is the matrix from the second compared genome with the number l. Then using approximation of the normal distribution we have

Xkl=4IM1,M22df1. (2)

We are obtaining the value of Xkl ~ N(0,1). Then we calculated the minimum value of Xkmin for each k from the first genome and for all l for from the second genome. Xkmin = minXkll. The genomes are A. thaliana, C. elegans, and D. melanogaster, kl. As a result, we calculated the distribution Fk(x) which is the number of matrices Nk from the first genome, which has Xkmin > x. We also calculated Frk(x) for pure random matrices. For this purpose, sequences with the length periodicity n from the first and second genome were randomly mixed and we determined therefore random matrices of 4 × n size. The number of such matrices was fully consistent with the number of the found sequences and periodicities in the first and second compared genomes. Then we determined such a minimum value of x = x0, where Fk(x0)/Frk(x0) > 20.0. This ratio shows that the number of false positives is less than 5%. For each pair of various genomes from the multiplicity {A. thaliana, C. elegans and D. melanogaster} we determined Fk(x0) and Frk(x0). Fk(x0) demonstrates the number of unique repeats characterizing only the first genome which could not be found in the second analyzed genome even as a single specimen. The Frk(x0) values are shown in Table 1. The table row is the first genome for comparison and the table column presents the second genome to compare. This table demonstrates that C. elegans does not have any repeats, which are not discovered in the A. thaliana and D. melanogaster genomes. The D. melanogaster genome is characterized by greater difference relative to the A. thaliana genome. In this case we observed 391 tandem repeats, which were not found in the A. thaliana genome. On the other hand, A. thaliana obtains such types of tandem repeats that could not be found in the elegans and D. melanogaster genomes, although the number thereof is relatively small. These data generally show that tandem repeats that are specific for a particular genome could exist.

Table 1.

Row of the table shows the number of unique tandem repeats, peculiar only to the specified genome. These repeats are not found in the second analyzed genome (column) even in the only instance.

A. thaliana D. melanogaster C. elegans
A. thaliana 32 98
D. melanogaster 391 0
C. elegans 0 0

We studied the relationship between the detected tandem repeats and the known families of dispersed repeats. It was discovered in the A. thaliana genome that 3,740 regions out of the 10,191 regions found by us were associated with different dispersed repeats. In the C. elegans and D. melanogaster genomes these numbers are 2,578 out of 16,234 and 1,543 from 14,049, respectively. More detailed results covering the D. melanogaster and C. elegans genomes are presented in Table 2. These results testify that mostly sequences with periodicity could be found in the LTR (D. melanogaster) and retroposons DNA repetitive elements (C. elegans). The same results for human chromosomes (17–22) are presented in Table 3. It could be seen that such correlation is irregular in regard to the chromosomes, and the major part thereof is associated with LINE, SINE, or LTR.

Table 2.

The relationship at the location of known repeats and found periodicity regions in the genomes of D. melanogaster and C. elegans.

D. melanogaster C. elegans
Retroposon DNA repeat elements 60 1026
LINE 256 28
Low complexity 6 5
LTR 1079 69
rolling-circle (RC) transposons 81 562
rRNA 3
Satellite 253 484
Simple repeat 78 94
Other 12
Found in this work 14049 16234

Table 3.

The relationship at the location of known repeats and found periodicity regions in the chromosomes 17–22 of the human genome.

Chr.17 Chr.18 Chr.19 Chr.20 Chr.21 Chr.22
Retroposon DNA repeat elements 100 76 241 109 37 17
LINE 746 916 1817 1539 379 154
LTR 191 261 966 496 189 44
Rolling-circle (RC) transposons 2 2
Simple repeat 7 5 67 13 7 4
SINE 119 72 933 155 28 34
Low complexity 27 4 2
Found in this work 3127 3320 10680 5974 4073 2420

It is also interesting to discuss the limits of applicability of the method developed in this study. As was noted above we used the local alignment of sequences S1 and S. The average value of the length for this alignment l-=200 was chosen using the random sequences. This means that micro- and minisatellite sequences less than this length are detected being not sufficient for this method. The fact is that these lengths cannot overcome the threshold of F0 = 390.0; thus, these sequences could be missed when using this study method. This means that even perfect micro- and minisatellites may be skipped, if they obtain a length equal to or less than 200 nucleotides. On the basis of this limitation, a comparison could be made between the previous works on the micro- and minisatellite search and the results of this study. Previously, micro- and minisatellite sequences from C. elegans genome were investigated [45, 46] and mathematical methods for finding the micro- and minisatellites sequences were shown in [31]. The programs used included TRF [16], T-REKs [25], Mreps [22], BWTRs (Pokrzywa and Polanski, 2010), ATR hunter [24], and Repfind [41]. Therefore, it can be assumed that the developed approach misses micro- and minisatellite sequences, which obtain the length of less than 200 bases. However, the method used in this study was able to find a highly diverged periodicity sequence, which is characterized by considerable length (200 or more bases) and which passed the previously developed approaches. In this sense, this method was developed as an addition to the already developed techniques. Also, this study method is suitable when it comes to searching for highly divergent tandem repeats, characterized by a total length of more than 200 nucleotides.

We compared our results and methods with one of the recently published databases [47]. We selected out of the HeteroGenome database the sequences that obtain the length from 200 to 600 bases and the periodicity length from 2 to 50 and that are not characterized by the triplet periodicity, since under these conditions we conducted the tandem repeats search. The last condition is connected to the fact that many coding sequences obtain periodicity of the 3 bases length, which is not associated with the tandem repeats [37]. Therefore, these sequences were excluded from consideration. We studied the intersection of regions in our database (Indelper) and in HeteroGenome (Table 4). We constructed intersection with HeteroGenome under one setting: 30% overlap without periodicity length equality. The periodicity length (specified in HeteroGenome) was not considered due to the fact that the model sequences demonstrated inaccurate determination of the periodicity length in the HeteroGenome database [3]. Results of our comparison for the A. thaliana, D. melanogaster, and C. elegans genomes are shown in Table 1. The results demonstrate that more than 75% of sequences with latent periodicity submitted in the HeteroGenome database are also presented in the Indelper database. The rest of sequences are below the F0 = 390.0 threshold. This may be connected to the fact that HeteroGenome in some cases has sequences that contain regions with periodicities of varying lengths. An example of such a sequence could be found in the sequence of the 1st chromosome of A. thaliana 14264679–14264886, where artificially the length periodicities of 12, 13, 25, and 49 bases are combined [47]. This comparison also demonstrates that Indelper database contains significantly more (approximately 10 times) regions with periodicity than the HeteroGenome database. This is connected to the fact that the method used [47] does not allow detecting the tandem periodicity in the presence of symbols' deletions and insertions. In fact, the tandem periods become invisible for the method used in the HeteroGenome development, if even a single deletion or insertion is present therein. The performed comparison shows that the Indelper database contains a large number of tandem periodicities, which were not detected previously.

Table 4.

Comparison of results of the Indelper and HeteroGenome databases. We compared periodicities with lengths from 2 to 50 bases and with region lengths from 200 to 600 bases. We excluded from consideration regions with triplet periodicity if x > 3.0 (see [37] for details).

Indelper HeteroGenome Indelper∩HeteroGenome
A. thaliana 10190 633 465
D. melanogaster 16236 2190 2095
C. elegans 14049 620 547

The results of this study were compared with those of the T-REKs program [25]. T-REKs program is one of the best tools for the tandem repeats search in DNA sequences. A comparison showed that the T-REKs program could detect not more than 30% sequences with periodicities found in this work at the false positives ratio ≤ 5%. There is a natural question about the biological significance of the periodicity found. Regions were found with periodicities that could be micro- and minisatellite sequences [45]. It is very likely that sequences with periodicity ranging from 9 to 11 bases are associated with the formation of chromatin loops and are found in this study [48]. There are earlier suggestions that the periodicity length of 10 and 11 nucleotides has a relationship with the α-helices in proteins, as well as with the processes of DNA compaction and protein binding [49, 50]. Early periodicity of 10 and 11 nucleotides was intensively studied in the study conducted by Trifonov et al. [49, 5154].

We studied the distribution of periodicity regions along the chromosome length of the A. thaliana, D. melanogaster, and C. elegans genomes. For this purpose we selected chromosomes segments of 105 bases length and summarized in each segment the mFmax value. As a result, we got the ∑Fmax(x) value depending on x. The ∑Fmax is an integral characteristic, which depends upon the number of periodicity regions in the segment of 105 bases long as well as upon the level of evolutionary divergence of this periodicity. The higher the ∑Fmax value is, the larger the number of periodicity regions is present and the higher the mFmax value is in the DNA segment 105 bases long. The results of the analysis demonstrated availability of the two major ∑Fmax(x) dependence behavior models. The first model is typical for the chromosomes of the A. thaliana genome. ∑Fmax(x) typical dependence is shown in Figure 5. It could be seen that there is a clear maximum in the ~15 × 106 nucleotide position. This maximum position correlates with the position of the centromere of the chromosome 1 from the A. thaliana genome [55]. The length of the region with the increased ∑Fmax value constitutes about 5 × 106 nucleotides, which is significantly more than the length of the found periodicity region with period equal to 180 nucleotides [55]. We studied periodicity within the interval from 2 to 50 nucleotides and the periodicity of 180 nucleotides long is detected in the form of periodicity that is the integer quotient from dividing 180 by whole numbers (<180) and that is less than 50 bases. A similar maximum correlation of the ∑Fmax(x) could be observed for the remaining chromosomes of the A. thaliana genome.

Figure 5.

Figure 5

Fmax for the chromosome 1 of the genome A. thaliana. x is the position in the chromosome. ∑Fmax is the sum of mFmax for all sequences with periodicity which are found in the region with coordinates from (x − 1)/105 to x/105.

The second model is shown in Figure 6 and is based on the first chromosome of the C. elegans genome example. In this case, within the centromere region, there is a slight increase of the ∑Fmax values indicating the presence of a variety of the repetitive sequences. At the same time, a significant increase in the number of repetitive sequences could be observed close to the chromosome ends. The first model of the repetitive sequences distribution could be observed in the D. melanogaster genome and is shown in Figure 5. Only the increase in the repetitive sequences presence in the centromere region is expressed more weakly than in the A. thaliana genome. Probably these two models of the tandem repeats organization models are associated in some way with the two types of the centromere organization. Monocentric eukaryotes contain “localized” centromeres, where the centromere formation is restricted to a specific chromosomal locus. “Localized” centromeres are highly variable in size and sequence, including the simple “point” centromeres for the S. pombe budding yeast, C. albicans pathogenic fungus, Drosophila, plants, and human cells. On the other hand, in holocentric organisms, such as the Caenorhabditis elegans nematode, a “diffuse” centromere is forming along the entire chromosome [56]. It is very likely that the first model of the tandem repeats distribution along the chromosome associated with monocentric eukaryotes contains “localized” centromeres. The second model of the tandem repeats distribution is present in a diffuse centromere.

Figure 6.

Figure 6

Fmax for the chromosome 1 of the genome C. elegans. x is the position in the chromosome. ∑Fmax is the sum of mFmax for all sequences with periodicity which are found in the region with coordinates from (x − 1)/105 to x/105.

Our results are published at the http://victoria.biengi.ac.ru/indelper web database. It provides basic navigation in regard to the organisms and filtering by the length period and the periodicity significance score (mFmax). Each periodicity sequence could be analyzed in detail by selecting it in the query results. The page will provide data on genomic position, statistical significance of the triplet periodicity (Z) [40], alignment against periodicities, periodicities consensus sequence, and periodicity weight matrix M [33]. The database we present here provides novel information on periodicity in DNA sequences by accounting for both indels and substitutions arisen during evolution. The current set of organisms in the database is a subject for further expansion by analyzing other genomes of interest.

Acknowledgments

This work was supported by Russian Science Foundation.

Competing Interests

The authors declare that they have no competing interests.

References

  • 1.Korotkova E. V., Korotkovaa M. A., Kudryashova N. A. Information decomposition method to analyze symbolical sequences. Physics Letters. Section A: General, Atomic and Solid State Physics. 2003;312(3-4):198–210. doi: 10.1016/S0375-9601(03)00641-8. [DOI] [Google Scholar]
  • 2.Durbin R., Eddy S. R., Krogh A., Mitchison G. Biological Sequence Analysis:Probabilistic Models of Proteins and Nucleic Acids. Cambridge, UK: Cambridge University Press; 1998. [DOI] [Google Scholar]
  • 3.Suvorova Y. M., Korotkova M. A., Korotkov E. V. Comparative analysis of periodicity search methods in DNA sequences. Computational Biology and Chemistry. 2014;53:43–48. doi: 10.1016/j.compbiolchem.2014.08.008. [DOI] [PubMed] [Google Scholar]
  • 4.Pugacheva V. M., Korotkov A. E., Korotkov E. V. Search of latent periodicity in amino acid sequences by means of genetic algorithm and dynamic programming. Statistical Applications in Genetics and Molecular Biology. 2016;15(5):381–400. doi: 10.1515/sagmb-2015-0079. [DOI] [PubMed] [Google Scholar]
  • 5.Pugacheva E., Korotkov V., Korotkov A. Search for latent periodicity in amino acid sequences with insertions and deletions. Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC '16); February 2016; SCITEPRESS—Science and Technology Publications; pp. 117–127. [Google Scholar]
  • 6.Tiwari S., Ramachandran S., Bhattacharya A., Bhattacharya S., Ramaswamy R. Prediction of probable genes by Fourier analysis of genomic sequences. Computer Applications in the Biosciences. 1997;13(3):263–270. doi: 10.1093/bioinformatics/13.3.263. [DOI] [PubMed] [Google Scholar]
  • 7.Lobzin V. V., Chechetkin V. R. Order and correlations in genomic DNA sequences. The spectral approach. Uspekhi Fizicheskikh Nauk. 2000;170(1):80–81. [Google Scholar]
  • 8.Kravatskaya G. I., Kravatsky Y. V., Chechetkin V. R., Tumanyan V. G. Coexistence of different base periodicities in prokaryotic genomes as related to DNA curvature, supercoiling, and transcription. Genomics. 2011;98(3):223–231. doi: 10.1016/j.ygeno.2011.06.006. [DOI] [PubMed] [Google Scholar]
  • 9.De Sousa Vieira M. Statistics of DNA sequences: a low-frequency analysis. Physical Review E—Statistical Physics, Plasmas, Fluids, and Related Interdisciplinary Topics. 1999;60(5):5932–5937. doi: 10.1103/physreve.60.5932. [DOI] [PubMed] [Google Scholar]
  • 10.Meng T., Soliman A. T., Shyu M.-L., et al. Wavelet analysis in current cancer genome research: a survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2013;10(6):1442–1459. doi: 10.1109/tcbb.2013.134. [DOI] [PubMed] [Google Scholar]
  • 11.Afreixo V., Ferreira P. J. S. G., Santos D. Fourier analysis of symbolic data: a brief review. Digital Signal Processing. 2004;14(6):523–530. doi: 10.1016/j.dsp.2004.08.001. [DOI] [Google Scholar]
  • 12.Sosa D., Miramontes P., Li W., Mireles V., Bobadilla J. R., José M. V. Periodic distribution of a putative nucleosome positioning motif in human, nonhuman primates, and archaea: mutual information analysis. International Journal of Genomics. 2013;2013:13. doi: 10.1155/2013/963956.963956 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Kumar L., Futschik M., Herzel H. DNA motifs and sequence periodicities. In Silico Biology. 2006;6(1-2):71–78. [PubMed] [Google Scholar]
  • 14.Holste D., Grosse I., Beirer S., Schieg P., Herzel H. Repeats and correlations in human DNA sequences. Physical Review E: Statistical, Nonlinear, and Soft Matter Physics. 2003;67 doi: 10.1103/PhysRevE.67.061913.061913 [DOI] [PubMed] [Google Scholar]
  • 15.Zhang X., Shen Z., Zhang G., et al. Short exon detection via wavelet transform modulus maxima. PLOS ONE. 2016;11(9) doi: 10.1371/journal.pone.0163088.e0163088 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research. 1999;27(2):573–580. doi: 10.1093/nar/27.2.573. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Parisi V., De Fonzo V., Aluffi-Pentini F. STRING: finding tandem repeats in DNA sequences. Bioinformatics. 2003;19(14):1733–1738. doi: 10.1093/bioinformatics/btg268. [DOI] [PubMed] [Google Scholar]
  • 18.O'Dushlaine C. T., Shields D. C. Tools for the identification of variable and potentially variable tandem repeats. BMC Genomics. 2006;7, article 290 doi: 10.1186/1471-2164-7-290. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Domaniç N. O., Preparata F. P. A novel approach to the detection of genomic approximate tandem repeats in the Levenshtein metric. Journal of Computational Biology. 2007;14(7):873–891. doi: 10.1089/cmb.2007.0018. [DOI] [PubMed] [Google Scholar]
  • 20.Sokol D., Tojeira J. Speeding up the detection of tandem repeats over the edit distance. Theoretical Computer Science. 2014;525:103–110. doi: 10.1016/j.tcs.2013.04.021. [DOI] [Google Scholar]
  • 21.Turutina V. P., Laskin A. A., Kudryashov N. A., Skryabin K. G., Korotkov E. V. Identification of amino acid latent periodicity within 94 protein families. Journal of Computational Biology. 2006;13(4):946–964. doi: 10.1089/cmb.2006.13.946. [DOI] [PubMed] [Google Scholar]
  • 22.Kolpakov R., Bana G., Kucherov G. mreps: efficient and flexible detection of tandem repeats in DNA. Nucleic Acids Research. 2003;31(13):3672–3678. doi: 10.1093/nar/gkg617. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Pellegrini M., Renda M. E., Vecchio A. TRStalker: an efficient heuristic for finding fuzzy tandem repeats. Bioinformatics. 2010;26(12):i358–i366. doi: 10.1093/bioinformatics/btq209. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Wexler Y., Yakhini Z., Kashi Y., Geiger D. Finding approximate tandem repeats in genomic sequences. Journal of Computational Biology. 2005;12(7):928–942. doi: 10.1089/cmb.2005.12.928. [DOI] [PubMed] [Google Scholar]
  • 25.Jorda J., Kajava A. V. T-REKS: identification of Tandem REpeats in sequences with a K-meanS based algorithm. Bioinformatics. 2009;25(20):2632–2638. doi: 10.1093/bioinformatics/btp482. [DOI] [PubMed] [Google Scholar]
  • 26.Mudunuri S. B., Kumar P., Rao A. A., Pallamsetty S., Nagarajaram H. A. G-IMEx: a comprehensive software tool for detection of microsatellites from genome sequences. Bioinformation. 2010;5(5):221–223. doi: 10.6026/97320630005221. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Mudunuri S. B., Nagarajaram H. A. IMEx: imperfect microsatellite extractor. Bioinformatics. 2007;23(10):1181–1187. doi: 10.1093/bioinformatics/btm097. [DOI] [PubMed] [Google Scholar]
  • 28.Grissa I., Vergnaud G., Pourcel C. CRISPRFinder: a web tool to identify clustered regularly interspaced short palindromic repeats. Nucleic Acids Research. 2007;35(2):W52–W57. doi: 10.1093/nar/gkm360. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Boeva V., Regnier M., Papatsenko D., Makeev V. Short fuzzy tandem repeats in genomic sequences, identification, and possible role in regulation of gene expression. Bioinformatics. 2006;22(6):676–684. doi: 10.1093/bioinformatics/btk032. [DOI] [PubMed] [Google Scholar]
  • 30.Lim K. G., Kwoh C. K., Hsu L. Y., Wirawan A. Review of tandem repeat search tools: a systematic approach to evaluating algorithmic performance. Briefings in Bioinformatics. 2013;14(1):67–81. doi: 10.1093/bib/bbs023. [DOI] [PubMed] [Google Scholar]
  • 31.Moniruzzaman M., Khatun R., Yaakob Z., Khan M. S., Mintoo A. A. Development of microsatellites: a powerful genetic marker. The Agriculturists. 2016;13(1):152–172. doi: 10.3329/agric.v13i1.26559. [DOI] [Google Scholar]
  • 32.Korotkov E. V., Korotkova M. A., Kudryashov N. A. The informational concept of searching for periodicity in symbol sequences. Molecular Biology. 2003;37(3):436–451. doi: 10.1023/a:1024247612994. [DOI] [PubMed] [Google Scholar]
  • 33.Shelenkov A., Skryabin K., Korotkov E. Search and classification of potential minisatellite sequences from bacterial genomes. DNA Research. 2006;13(3):89–102. doi: 10.1093/dnares/dsl004. [DOI] [PubMed] [Google Scholar]
  • 34.Smith T. F., Waterman M. S. Identification of common molecular subsequences. Journal of Molecular Biology. 1981;147(1):195–197. doi: 10.1016/0022-2836(81)90087-5. [DOI] [PubMed] [Google Scholar]
  • 35.Mitchell M. An Introduction to Genetic Algorithms. 1998. http://dl.acm.org/citation.cfm?id=522098. [Google Scholar]
  • 36.Gondro C., Kinghorn B. P. A simple genetic algorithm for multiple sequence alignment. Genetics and Molecular Research. 2007;6(4):964–982. [PubMed] [Google Scholar]
  • 37.Frenkel F. E., Korotkov E. V. Classification analysis of triplet periodicity in protein-coding regions of genes. Gene. 2008;421(1-2):52–60. doi: 10.1016/j.gene.2008.06.012. [DOI] [PubMed] [Google Scholar]
  • 38.Altschul S. F., Gish W., Miller W., Meyers E. W., Lipman D. J. Basic Local Alignment Search Tool. Journal of Molecular Biology. 1990;215(3):403–410. doi: 10.1006/jmbi.1990.9999. [DOI] [PubMed] [Google Scholar]
  • 39.Korotkov E. V., Korotkova M. A. Developing a mathematical method to search for latent periodicity in protein amino-acid sequences with deletions and insertions. Biophysics. 2016;60(6):876–885. doi: 10.1134/S0006350915060159. [DOI] [PubMed] [Google Scholar]
  • 40.Frenkel F. E., Korotkov E. V. Using triplet periodicity of nucleotide sequences for finding potential reading frame shifts in genes. DNA Research. 2009;16(2):105–114. doi: 10.1093/dnares/dsp002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Betley J. N., Frith M. C., Graber J. H., Choo S., Deshler J. O. A ubiquitous and conserved signal for RNA localization in chordates. Current Biology. 2002;12(20):1756–1761. doi: 10.1016/S0960-9822(02)01220-4. [DOI] [PubMed] [Google Scholar]
  • 42.Pokrzywa R., Polanski A. BWtrs: a tool for searching for tandem repeats in DNA sequences based on the Burrows-Wheeler transform. Genomics. 2010;96(5):316–321. doi: 10.1016/j.ygeno.2010.08.001. [DOI] [PubMed] [Google Scholar]
  • 43.Kullback S. Statistics and Information Theory. New York, NY, USA: John Wiley & Sons; 1959. [Google Scholar]
  • 44.Vinckenbosch N., Dupanloup I., Kaessmann H. Evolutionary fate of retroposed gene copies in the human genome. Proceedings of the National Academy of Sciences of the United States of America. 2006;103(9):3220–3225. doi: 10.1073/pnas.0511307103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Richard G.-F., Kerrest A., Dujon B. Comparative genomics and molecular dynamics of DNA repeats in eukaryotes. Microbiology and Molecular Biology Reviews. 2008;72(4):686–727. doi: 10.1128/mmbr.00011-08. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Tóth G., Gáspári Z., Jurka J. Microsatellites in different eukaryotic genomes: surveys and analysis. Genome Research. 2000;10(7):967–981. doi: 10.1101/gr.10.7.967. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Chaley M., Kutyrkin V., Tulbasheva G., Teplukhina E., Nazipova N. HeteroGenome: database of genome periodicity. Database. 2014;2014 doi: 10.1093/database/bau040.bau040 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Kadauke S., Blobel G. A. Chromatin loops in gene regulation. Biochimica et Biophysica Acta (BBA)—Gene Regulatory Mechanisms. 2009;1789(1):17–25. doi: 10.1016/j.bbagrm.2008.07.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Herzel H., Weiss O., Trifonov E. N. 10-11 bp periodicities in complete genomes reflect protein structure and DNA folding. Bioinformatics. 1999;15(3):187–193. doi: 10.1093/bioinformatics/15.3.187. [DOI] [PubMed] [Google Scholar]
  • 50.Larsabal E., Danchin A. Genomes are covered with ubiquitous 11 bp periodic patterns, the ‘class A flexible patterns’. BMC Bioinformatics. 2005;6, article 206 doi: 10.1186/1471-2105-6-206. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Trifonov E. N., Sussman J. L. The pitch of chromatin DNA is reflected in its nucleotide sequence. Proceedings of the National Academy of Sciences of the United States of America. 1980;77(7):3816–3820. doi: 10.1073/pnas.77.7.3816. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Herzel H., Trifonov E., Weiss O., Große I. Interpreting correlations in biosequences. Physica A: Statistical Mechanics and Its Applications. 1998;249(1–4):449–459. doi: 10.1016/s0378-4371(97)00505-0. [DOI] [Google Scholar]
  • 53.Trifonov E. N. Elucidating sequence codes: three codes for evolution. Annals of the New York Academy of Sciences. 1999;870:330–338. doi: 10.1111/j.1749-6632.1999.tb08894.x. [DOI] [PubMed] [Google Scholar]
  • 54.Trifonov E. N. 3-, 10.5-, 200- and 400-base periodicities in genome sequences. Physica A: Statistical Mechanics and Its Applications. 1998;249(1–4):511–516. doi: 10.1016/s0378-4371(97)00510-4. [DOI] [Google Scholar]
  • 55.Round E. K., Flowers S. K., Richards E. J. Arabidopsis thaliana centromere regions: genetic map positions and repetitive DNA structure. Genome Research. 1997;7(11):1045–1053. doi: 10.1101/gr.7.11.1045. [DOI] [PubMed] [Google Scholar]
  • 56.Torras-Llort M., Moreno-Moreno O., Azorín F. Focus on the centre: the role of chromatin on the regulation of centromere identity and function. EMBO Journal. 2009;28(16):2337–2348. doi: 10.1038/emboj.2009.174. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from BioMed Research International are provided here courtesy of Wiley

RESOURCES