One novel representation of DNA sequence based on the global and local position information

Zhiyi Mo; Wen Zhu; Yi Sun; Qilin Xiang; Ming Zheng; Min Chen; Zejun Li

doi:10.1038/s41598-018-26005-3

. 2018 May 15;8:7592. doi: 10.1038/s41598-018-26005-3

One novel representation of DNA sequence based on the global and local position information

Zhiyi Mo ¹, Wen Zhu ^2,^✉, Yi Sun ², Qilin Xiang ², Ming Zheng ¹, Min Chen ³, Zejun Li ³

PMCID: PMC5953932 PMID: 29765099

Abstract

One novel representation of DNA sequence combining the global and local position information of the original sequence has been proposed to distinguish the different species. First, for the sufficient exploitation of global information, one graphical representation of DNA sequence has been formulated according to the curve of Fermat spiral. Then, for the consideration of local characteristics of DNA sequence, attaching each point in the curve of Fermat spiral with the related mass has been applied based on the relationships of neighboring four nucleotides. In this paper, the normalized moments of inertia of the curve of Fermat spiral which composed by the points with mass has been calculated as the numerical description of the corresponding DNA sequence on the first exons of beta-global genes. Choosing the Euclidean distance as the measurement of the numerical descriptions, the similarity between species has shown the performance of proposed method.

Introduction

The graphical and numerical representation of DNA, RNA or protein sequences has become the popular strategies to analyze the evolutionary relationship between species. As the availability of varies gene data for different species, the comparison of different organisms that own unique genetic information involves in mathematics, biology, physics, informatics and so on. Many researchers have focused on the issue of representation of gene sequence, as seen in^1–31, so the study of representation of gene sequence is significant and beneficial.

Hamori and Ruskin³² first proposed the H-curve, the graphical representation of nucleotide sequence, which is convenient for the visual analysis and comprehension of the DNA sequences. Following them, further researches of representation of DNA sequence were carried^33–44. For example, Zhang⁴⁵ proposed a five-color map visualization of DNA sequences named ColorSquare. Jafarzadeh¹ constructed the C-curve with no loss of information. And Aram⁵ introduced a new graphical representation of the DNA sequences which called spider representation. Moreover, Bielinska-Waz¹⁰ represented the sequence with a set of discrete lines which referred to as the B−spectrum. Unfortunately, owing to the high degeneracy and loss of information and the need of a lot of space in the transformation of DNA sequence to graphical representation, the performances of many methods are not satisfactory as expected.

To solve those problems, we present one novel representation of DNA sequence based on global and local position information. Distinct from previous reports, the more effective representation is obtained and the possible effect caused by different length of DNA sequence is restrained by new method. In detail, the novel concept of representation of DNA sequence involves (1) formulating the graphical representation of DNA sequence according to the curve of Fermat spiral which remaining the global position information of the original sequence, (2) taking the local position information of DNA sequence into consideration according to attach each point in the curve of Fermat spiral with the related mass, (3) the normalized moments of inertia of the curve of Fermat spiral which composed by the points with mass has been calculated as the description of the corresponding DNA sequence on the first exons of beta-global genes.

Graphical representation of DNA sequence

In order to make full use of global information of DNA sequence, the original DNA sequence is divided into four subsequences constituted by A, C, G or T that four point sets correspondingly can be obtained by the position of nucleotide in the original DNA sequence. Thus, each nucleotide in the subsequence corresponds to one point in the set. With the operation by distributing each point set to the curve of Fermat spiral, four corresponding curves which means the graphical representation of DNA sequence can be plotted. The reason that we choose the Fermat spiral instead of the circle as the distribution curve of subsequence is that the curve of Fermat spiral is the monotonically increasing functions in the polar coordinate system which can remaining the information of position of the original sequence.

We regard the DNA sequence as BS(base sequence) which is constituted by four subsequences of AS, CS, GS and TS. Concretely, the i-th nucleotide in BS is denoted as V_i^BS, i = 1, 2, ···, N_BS. It is obvious that the length of nucleotide in base sequence is equal to the total length of nucleotide in four subsequence, described as:

N_{B S} = N_{A S} + N_{C S} + N_{G S} + N_{T S}

where N_BS, N_AS, N_CS, N_GS and N_TS respectively denote the length of nucleotide in base, A, C, G and T subsequence. For the purpose of plotting the base curve of Fermat spiral corresponding to the base sequence, the coordinate of points in the polar coordinate system are calculated according to the information of position in the base sequence. For each point, calculated as:

θ_{V_{i}^{B S}} = \frac{2 π}{(L - 1)} \times (L_{V_{i}^{B S}} - 1)

where $θ_{V_{i}^{BS}}$ denotes the polar angle of nucleotide $V_{i}^{BS}$ in the polar coordinate system; L is one constant which means the shortest length of DNA sequence for different species in the experience; $L_{V_{i}^{BS}}$ denotes the position of nucleotide $V_{i}^{BS}$ in the base sequence which ranging from 1 to $N_{BS}$ . The mathematical formula of the curve of Fermat spiral is described as:

ρ_{V_{i}^{B S}} = \sqrt{θ_{V_{i}^{B S}}}

As for the nucleotides in the base sequence, the corresponding set of coordinate for each point in the polar coordinates are calculated as

{p_{V_{1}^{BS}} (θ_{V_{1}^{BS}}, ρ_{V_{1}^{BS}}), p_{V_{2}^{BS}} (θ_{V_{2}^{BS}}, ρ_{V_{2}^{BS}}), \dots, p_{V_{i}^{BS}} (θ_{V_{i}^{BS}}, ρ_{V_{i}^{BS}}), \dots, p_{V_{N_{BS}}^{BS}} (θ_{V_{N_{BS}}^{BS}}, ρ_{V_{N_{BS}}^{BS}})} .

Correspondingly, four subsets can be obtained and plotted. As shown in Fig. 1, the graphical representation of the first exons of β-globin gene of human DNA gene is plotted.

The graphical representation of human gene. From left to right and from top to bottom, the graphical representations are respectively for A, C, G and T subsequence.

Attaching each point with a mass

In order to make full use of carried information of DNA sequence, the local characteristics are taken into consideration to attach each point corresponding to the nucleotide in the base sequence with a mass. Since one of immediate 5′ neighbor nucleotide and two of immediate 3′ neighbor nucleotides were considered as the context to calculate the mass of point corresponding to the second nucleotide in the group, the times and the compactness that the second nucleotide occurs and arranges are considered as the criterion to confirm the mass of the second nucleotide in the group.

According to the times that the nucleotide same as the second position repeats in the group, four categories may be divided. As shown in the following the nucleotide being same as the second nucleotide is denoted as 1 and the nucleotide being different from the second nucleotide is denoted as 0.

0100
0101, 1100, 0110
1101, 1110, 0111
1111

For example, the nucleotide of second position occurs one time in the first category. And according to the analysis of the four categories, six situations are obtained by the compactness that the second nucleotide arranges. For example, as for the second category that the nucleotide of second position occurs two times, its first situation of 0101 in which the two 1 are separated by one 0 and its second situation in which the two 1 are compactly arranged.

0100
0101
1100, 0110
1101
1110, 0111
1111

Therefore, the different mass in ${\frac{1}{6}, \frac{2}{6}, \frac{3}{6}, \frac{4}{6}, \frac{5}{6}, 1}$ is attached to the point corresponding to the nucleotide of second position. However, for the purpose of reducing the impact of DNA sequence which is too long, the mass of latter sequence after L are restrained as

graphic file with name 41598_2018_26005_Equ1_HTML.gif

where Inline graphic denotes the mass of the point corresponding to the nucleotide of $V_{i}^{BS}$ after restraint; ε denotes the scale of constraint which is one constant of 0.0375 in experiment. So the points corresponding to the positions later than L own bigger polar radius but smaller mass; on the one hand, this characteristic can restrain the difference on length of dissimilar species; on other, it also can reserve the smaller difference on length of similar species.

Numerical Representation

For the widespread application of the moment of inertia in many gene numerical representation method^10,11,15,16, the normalized moments of inertia for each massive sub-curve of Fermat spiral are calculated as the numerical representation of formal DNA sequence in this paper. To the convenience of calculation, the transformation of polar coordinates to plane coordinates is performed:

{\begin{matrix} x_{V_{i}^{B S}} = ρ_{V_{i}^{B S}} \times c o s θ_{V_{i}^{B S}} \\ y_{V_{i}^{B S}} = ρ_{V_{i}^{B S}} \times s i n θ_{V_{i}^{B S}} \end{matrix}

Since the point $p_{V_{i}^{BS}} (θ_{V_{i}^{BS}}, ρ_{V_{i}^{BS}})$ in the polar coordinates is transformed to point $p_{V_{i}^{BS}} (x_{V_{i}^{BS}}, y_{V_{i}^{BS}})$ in the plane coordinates, the center of mass for the massive curve of Fermat spiral in the plane coordinates system is calculated as:

{\begin{matrix} \tilde{x_{V^{B S}}} = \frac{1}{N_{B S}} \sum_{i = 1}^{N_{B S}} m_{V_{i}^{B S}} \times x_{V_{i}^{B S}} \\ \tilde{y_{V^{B S}}} = \frac{1}{N_{B S}} \sum_{i = 1}^{N_{B S}} m_{V_{i}^{B S}} \times y_{V_{i}^{B S}} \end{matrix}

So the ordinate of the center of mass is point $\tilde{p_{V^{BS}}} (\tilde{x_{V^{BS}}}, \tilde{y_{V^{BS}}})$ , the moment of inertia of the massive curve is described as:

M_{B S} = \sum_{i = 1}^{N_{B S}} m_{V_{i}^{B S}} \times d i s t a n c e (p_{V_{i}^{B S}}, \tilde{p_{V^{B S}}})

where distance $(p_{V_{i}^{BS}}, \tilde{p_{V^{BS}}})$ denotes the squared distance, calculated as:

d i s t a n c e (p_{V_{i}^{B S}}, \tilde{p_{V^{B S}}}) = {(x_{V_{i}^{B S}} - \tilde{x_{V_{i}^{B S}}})}^{2} + {(y_{V_{i}^{B S}} - \tilde{y_{V_{i}^{B S}}})}^{2}

The normalized moment of inertia is described as:

r_{B S} = \sqrt{\frac{M_{B S}}{\sum_{i = 1}^{N_{B S}} m_{V_{i}^{B S}}}}

There $r_{B S}$ , one 4-dimensional vector $r_{B S} = [r_{A S}, r_{C S}, r_{G S}, r_{T S}]$ , denotes the numerical representation of DNA sequence consisted of A, T, C and G subsequences. Following, the similarity distance between species is calculated according to the Euclidean measurement:

S (α, β) = {[\sum^{} {| r_{BS}^{α} - r_{BS}^{β} |}^{2}]}^{\frac{1}{2}}

where $r_{BS}^{α}$ and $r_{BS}^{β}$ respectively denote the numerical representations of species α and β. So S(α, β) denotes the similarity distance between vectors $r_{BS}^{α}$ and $r_{BS}^{β}$ in the 4-dimensional space.

Results and Discussion

We test the performance of proposed method in the standard dataset that popular in the field of the DNA representation research, as seen in Table 1, the first exons of β-globin gene of different species. According to Eq. (9), Table 2 shows the numerical representations of DNA sequence for each target species. After obtaining the numerical representation consisted of 4-dimensional vectors, Table 3 shows the similarity/dissimilarity between pairs of species according the description of Eq. (10).

Table 1.

The first exons of β-globin gene of different species.

k	Species	Gene ID	N
1	Human	U01317	92
2	Gorilla	X61109	93
3	Chimpanzee	X02345	105
4	Rat	X06701	92
5	Mouse	V00722	93
6	Lemur	M15734	92
7	Rabbit	V00882	92
8	Goat	M15387	86
9	Bovine	X00376	86
10	Opossum	J03643	92
11	Gallus	V00409	92

Open in a new tab

Table 2.

The numerical representation of DNA sequence.

Species	r _AS	r _CS	r _GS	r _TS
Human	1.6674	1.7921	1.7233	1.7689
Gorilla	1.6674	1.7921	1.7233	1.7727
Chimpanzee	1.6885	1.8085	1.7239	1.7845
Rat	1.6074	1.7858	1.8242	1.7462
Mouse	1.5943	1.7480	1.8407	1.7921
Lemur	1.6781	1.6659	1.8149	1.8046
Rabbit	1.6454	1.7145	1.8690	1.7809
Goat	1.5416	1.6808	1.8246	1.8476
Bovine	1.5416	1.5929	1.8056	1.8471
Opossum	1.5693	1.6713	1.9416	1.7398
Gallus	1.7986	1.6879	1.9639	1.7140

Open in a new tab

Table 3.

Similarity/dissimilarity matrix under the Euclidean distance.

Species	Human	Gorilla	Chimp	Rat	Mouse	Lemur	Rabbit	Goat	Bovine	Opossum	Gallus
Human	0	0.0038	0.0309	0.1198	0.1470	0.1604	0.1670	0.2114	0.2616	0.2696	0.2983
Gorilla	0.0038	0	0.0292	0.1206	0.1465	0.1596	0.1668	0.2100	0.2604	0.2701	0.2991
Chimp	0.0309	0.0292	0	0.1365	0.1620	0.1707	0.1782	0.2281	0.2804	0.2870	0.2988
Rat	0.1198	0.1206	0.1365	0	0.0631	0.1513	0.0987	0.1602	0.2282	0.1684	0.2583
Mouse	0.1470	0.1465	0.1620	0.0631	0	0.1208	0.0683	0.1031	0.1763	0.1393	0.2582
Lemur	0.1604	0.1596	0.1707	0.1513	0.1208	0	0.0832	0.1443	0.1608	0.1792	0.2131
Rabbit	0.1670	0.1668	0.1782	0.0987	0.0683	0.0832	0	0.1355	0.1844	0.1209	0.1940
Goat	0.2114	0.2100	0.2281	0.1602	0.1031	0.1443	0.1355	0	0.0900	0.1618	0.3215
Bovine	0.2616	0.2604	0.2804	0.2282	0.1763	0.1608	0.1844	0.0900	0	0.1921	0.3433
Opossum	0.2696	0.2701	0.2870	0.1684	0.1393	0.1792	0.1209	0.1618	0.1921	0	0.2324
Gallus	0.2983	0.2991	0.2988	0.2583	0.2582	0.2131	0.1940	0.3215	0.3433	0.2324	0

Open in a new tab

For the comparison, Table 4 shows the similarity/dissimilarity between Human and other species in some other methods similarly taking the Euclidean distance as the measurement. From Table 4, finding that most listed methods^1,2,10,46,47 also make the same conclusion that Gorilla are the most similar species to Human and Chimp is the next similar species to Human except method³³ which make the similar conclusion that Chimp is the most similar species to Human and Gorilla is the next similar species to Human. Besides, some listed methods^2,10,47 also make the same conclusion that Gallus is the most dissimilar species to Human.

Table 4.

Similarity/dissimilarity between Human and other species with different methods.

Methods	Gorilla	Chimp	Rat	Mouse	Lemur	Rabbit	Goat	Bovine	Opossum	Gallus
Our work	0.0038	0.0309	0.1198	0.1470	0.1604	0.1670	0.2114	0.2616	0.2696	0.2983
Randic et al. 2003³³	0.0210	0.0170	0.0430	0.0830	0.0870	0.0420	0.0610	0.0840	0.1480	0.1090
Dai et al. 2006⁴⁶	0.0120	0.0155	0.0704	0.0543	0.0603	0.0287	0.0169	0.0276	0.1389	0.1146
Liu and Wang 2006⁴⁷	0.3070	0.3101	0.4256	0.3089	0.3688	0.2968	0.4341	0.4172	0.3805	0.4479
Liao et al. 2013²	0.1651	0.4688	0.9202	0.6024	1.0110	0.7453	0.6010	0.6320	1.3710	1.5932
Jafarzadeh et al. 2013¹	0.0330	0.0920	0.2160	0.1630	0.1940	0.1240	0.1650	0.2210	0.1940	0.1940
Bielinska-Waz et al. 2017¹⁰	0.0056	0.0314	0.1838	0.2395	0.2497	0.1844	0.1276	0.0872	0.3904	0.4687

Open in a new tab

Two most distinct dendrogram corresponding to the Euclidean measures is plotted in Fig. 2. As seen, the similar cluster pairs are respectively as Human-Gorilla(same cluster result in^{1,2,10,40,47,48}), Rat-Mouse(same result in^10,49), Lemur-Rabbit(same cluster result in¹⁰), Goat-Bovine(same cluster result in^{10,33,40,47–49}), Human-Gorilla-Chimpanzee (same cluster result in^{1,10,33,40,46,48,49}).

Normalizing S^{human−gallus} = 1 to the convenience of the visualization for results in other paper^{1,2,10,33,46,47} which similarly using the Euclidean measurement. As shown in Fig. 3, different methods perform different results that may be useful with different consideration.

Similarity values of human-other species with different methods.

In conclusion, the paper presents a novel method to extract the characteristic of the DNA sequence with the graphical and numerical operations which can effectively achieve the similarity/dissimilarity comparison of different species. In this method, the distribution of sequence to the curve of Fermat spiral remains the global position information successfully and the attachment of the mass to the point remains the local position information successfully. Specifically in our result, the group of Rat-Mouse- Lemur- Rabbit is more similar to the group of Human- Gorilla-Chimpanzee compared with the group of Goat- Bovine-Opossum which may be helpful to the exploration of the evolutionary relationship between species. Moreover, the similar pairs that obtained by our method illustrate the performance of proposed representation of DNA sequence.

Acknowledgements

This study is supported by the National Nature Science Foundation of China (Grant Number: 11171369, 61272395, 61370171, 61300128, 61472127, 61572178, 61502343, 61672214, 61672223 and 61772192), the Guangxi Natural Science Foundation (Grant Number: 2017GXNSFAA198148).

Author Contributions

Zhiyi Mo and Yi Sun wrote the main manuscript text, Qilin Xiang prepared Figures 1–2 and Tables 2–3. Besides, Ming Zheng collected the gene data and prepared Table 1, Min Chen and Zejun Li give the analyses of different methods and prepared the Figure 3 and Table 4. Wen Zhu advised on adding some comparison of similar gene group of different methods.

Competing Interests

The authors declare no competing interests.

Footnotes

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Jafarzadeh N, Iranmanesh A. C-curve: a novel 3D graphical representation of DNA sequence based on codons. Math Biosci. 2013;241:217–224. doi: 10.1016/j.mbs.2012.11.009. [DOI] [PubMed] [Google Scholar]
2.Liao B, Xiang Q, Cai L, Cao Z. A new graphical coding of DNA sequence and its similarity calculation. Physica A. 2013;392:4663–4667. doi: 10.1016/j.physa.2013.05.015. [DOI] [Google Scholar]
3.Yang X, Wang T. Linear regression model of short k-word: A similarity distance suitable for biological sequences with various lengths. J Theor Biol. 2013;337:61–70. doi: 10.1016/j.jtbi.2013.07.028. [DOI] [PubMed] [Google Scholar]
4.Wąż P, Bielińskawąż D. Non-standard similarity/dissimilarity analysis of DNA sequences. Genomics. 2014;104:464–471. doi: 10.1016/j.ygeno.2014.08.010. [DOI] [PubMed] [Google Scholar]
5.Aram V, Iranmanesh A, Majid Z. Spider representation of DNA sequences. J Comput Theor Nanos. 2014;11:418–420. doi: 10.1166/jctn.2014.3371. [DOI] [Google Scholar]
6.Liu YW, Peng Y. A novel technique for analyzing the similarity and dissimilarity of DNA sequences. Genet Mol Res. 2014;13:570–577. doi: 10.4238/2014.January.28.2. [DOI] [PubMed] [Google Scholar]
7.Yin C, Yin XE, Wang J. A novel method for comparative analysis of DNA sequences by Ramanujan-Fourier transform. J Comput Biol. 2014;21:867–879. doi: 10.1089/cmb.2014.0120. [DOI] [PubMed] [Google Scholar]
8.Li C, Fei WC, Zhao Y, Yu XQ. Novel Graphical Representation and Numerical Characterization of DNA Sequences. Applied Sciences. 2016;6:63. doi: 10.3390/app6030063. [DOI] [Google Scholar]
9.Xu X, Zhu F. A New Method to Digitize DNA Sequence. J Biosci Med. 2017;05:7–12. [Google Scholar]
10.Bielińskawąż D, Wąż P. Spectral-dynamic representation of DNA sequences. J Biomed Inform. 2017;72:1–7. doi: 10.1016/j.jbi.2017.06.001. [DOI] [PubMed] [Google Scholar]
11.Panas D, Wąż P, Bielińskawąż D, Nandy A, Basak SC. 2D-Dynamic Representation of DNA/RNA Sequences as a Characterization Tool of the Zika Virus Genome. MATCH Commun. Math Comput Chem. 2017;77:321–332. [Google Scholar]
12.Ma T, Liu Y, Dai Q, Yao Y, He PA. A graphical representation of protein based on a novel iterated function system. Physica A. 2014;403:21–28. doi: 10.1016/j.physa.2014.01.067. [DOI] [Google Scholar]
13.Li Y, Liu Q, Zheng X, He PA. UC-Curve: A highly compact 2D graphical representation of protein sequences. Int. J Quantum Chem. 2014;114:409–415. doi: 10.1002/qua.24581. [DOI] [Google Scholar]
14.Yao Y, Yan S, Han J, Dai Q, He PA. A novel descriptor of protein sequences and its application. J Theor Biol. 2014;347:109–117. doi: 10.1016/j.jtbi.2014.01.001. [DOI] [PubMed] [Google Scholar]
15.Yao Y, et al. Similarity/Dissimilarity Analysis of Protein Sequences Based on a New Spectrum-Like Graphical Representation. Evol Bioinform Online. 2014;10:87–96. doi: 10.4137/EBO.S14713. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Xu SC, Li Z, Zhang SP, Hu JL. Primary structure similarity analysis of proteins sequences by a new graphical representation. SAR QSAR Environ Res. 2014;25:791–803. doi: 10.1080/1062936X.2014.955055. [DOI] [PubMed] [Google Scholar]
17.El-Lakkani A, Mahran H. An efficient numerical method for protein sequences similarity analysis based on a new two-dimensional graphical representation. SAR QSAR Environ. Res. 2015;26:125–137. doi: 10.1080/1062936X.2014.995700. [DOI] [PubMed] [Google Scholar]
18.Hou W, Pan Q, He M. A new graphical representation of protein sequences and its applications. Physica A. 2016;444:996–1002. doi: 10.1016/j.physa.2015.10.067. [DOI] [Google Scholar]
19.Czerniecka A, Bielińskawąż D, Wąż P, Clark T. 20D-dynamic Representation of Protein Sequences. Genomics. 2016;107:16–23. doi: 10.1016/j.ygeno.2015.12.003. [DOI] [PubMed] [Google Scholar]
20.Ping P, Zhu X, Wang L. Similarities/dissimilarities analysis of protein sequences based on pca-fft. J Biol Syst. 2017;25:1–17. doi: 10.1142/S0218339017500024. [DOI] [Google Scholar]
21.Hu H, Li Z, Dong H, Zhou T. Graphical Representation and Similarity Analysis of Protein Sequences Based on Fractal Interpolation. IEEE ACM T Comput Bi. 2017;14:182–192. doi: 10.1109/TCBB.2015.2511731. [DOI] [PubMed] [Google Scholar]
22.Liao B, Liao L, Wu R, Li R. Construction of the phylogenetic tree by self-organizing map based on encoding sequence. J Comput Theor Nanos. 2012;9:826–830. doi: 10.1166/jctn.2012.2103. [DOI] [Google Scholar]
23.Liao B, Liao BY, Lu X, Cao Z. A Novel Graphical Representation of Protein Sequences and Its Application. J Comput Chem. 2011;32:2539–2544. doi: 10.1002/jcc.21833. [DOI] [PubMed] [Google Scholar]
24.Liao B, Liao B, Sun X, Zeng Q. A Novel method for similarity analysis and protein subcellular localization prediction. Bioinformatics. 2010;26:2678–2683. doi: 10.1093/bioinformatics/btq521. [DOI] [PubMed] [Google Scholar]
25.Li X, Liao B, Zeng Q, Luo J. Protein functional class prediction using global encoding of amino acid sequence. J Theor Biol. 2009;261:290–293. doi: 10.1016/j.jtbi.2009.07.017. [DOI] [PubMed] [Google Scholar]
26.Huang G, Liao B, Li R. Similarity studies of DNA sequences based on a new 2D graphical representation. Biophys Chem. 2009;143:55–59. doi: 10.1016/j.bpc.2009.03.013. [DOI] [PubMed] [Google Scholar]
27.Liao B, Zeng C, Li F, Tang Y. Analysis of Similarity/Dissimilarity of DNA Sequences Based on Dual Nucleotides. MATCH Commun Math Co. 2008;59:647–652. [Google Scholar]
28.Yao Y, Kong F, Dai Q, He P. A Sequence-Segmented Method Applied to the Similarity Analysis of Long Protein Sequence. MATCH Commun Math Co. 2013;70:431–450. [Google Scholar]
29.He P, Xu S, Dai Q, Yao Y. A generalization of CGR representation for analyzing and comparing protein sequences. Int J Quantum Chem. 2016;116:476–482. doi: 10.1002/qua.25068. [DOI] [Google Scholar]
30.Dai Q, et al. Comparison study on statistical features of predicted secondary structures for protein structural class prediction: From content to position. BMC Bioinformatics. 2013;14:152. doi: 10.1186/1471-2105-14-152. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Dai Q, et al. Study of LZ-word distribution and its application for sequence comparison. Journal of Theor Biol. 2103;336:52–60. doi: 10.1016/j.jtbi.2013.07.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Hamori E, Ruskin J. H curves, a novel method of representation of nucleotide series especially suited for long DNA sequences. J Biol Chem. 1983;258:1318–1327. [PubMed] [Google Scholar]
33.Randić M, Vračko M, Lerš N, Plavšić D. Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation. Chem Phys Lett. 2003;371:202–207. doi: 10.1016/S0009-2614(03)00244-6. [DOI] [Google Scholar]
34.Wąż P, Bielińskawąż D. 3D-dynamic representation of DNA sequences. J Mol Model. 2014;20:2141. doi: 10.1007/s00894-014-2141-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Jeong BS, Bari ATG, Rokeya RM, Jeon S, Lim CG. Codon-based encoding for DNA sequence analysis. Methods. 2014;67:373–379. doi: 10.1016/j.ymeth.2014.01.016. [DOI] [PubMed] [Google Scholar]
36.Bari AT, Reaz MR, Islam AK, Choi HJ, Jeong BS. Effective Encoding for DNA Sequence Visualization Based on Nucleotide’s Ring Structure. Evol Bioinfrom. 2013;9:251–261. doi: 10.4137/EBO.S12160. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Xie X, Guan J, Zhou S. Similarity evaluation of DNA sequences based on frequent patterns and entropy. Bmc Genomics. 2015;16:1–10. doi: 10.1186/1471-2164-16-S3-S1. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Yu HJ, Huang DS. Graphical Representation for DNA Sequences via Joint Diagonalization of Matrix Pencil. IEEE J Biomed Health. 2013;17:503–511. doi: 10.1109/TITB.2012.2227146. [DOI] [PubMed] [Google Scholar]
39.Hou W, Pan Q, He M. A novel representation of DNA sequence based on CMI coding. Physica A. 2014;409:87–96. doi: 10.1016/j.physa.2014.04.030. [DOI] [Google Scholar]
40.Li Y, Liu Q, Zheng X. DUC-Curve, a highly compact 2D graphical representation of DNA sequences and its application in sequence alignment. Physica A. 2016;456:256–270. doi: 10.1016/j.physa.2016.03.061. [DOI] [Google Scholar]
41.Yin C. Representation of DNA sequences in genetic codon context with applications in exon and intron prediction. J Bioinf Comput Biol. 2015;13:1550004. doi: 10.1142/S0219720015500043. [DOI] [PubMed] [Google Scholar]
42.Peng Y, Liu Y. A Novel Numerical Characterization for Graphical Representations of DNA Sequences. Mini-Rev Org Chem. 2015;12:534–539. doi: 10.2174/1570193X13666151218191218. [DOI] [Google Scholar]
43.Cheng J, Shan, Ping S. 4D Graphical representation research of DNA sequences. Int J Biomath. 2015;08:47–58. [Google Scholar]
44.Manoj KG, Rajdeep N, Manoj M. A new adjacent pair 2D graphical representation of DNA sequences. J Biol Syst. 2013;21:196–244. [Google Scholar]
45.Zhang Z, et al. ColorSquare: A colorful square visualization of DNA sequences. MATCH Commun Math Comput Chem. 2012;68:621–637. [Google Scholar]
46.Dai Q, Liu X, Wang T. A novel graphical representation of DNA sequences and its application. J Mol Graph Model. 2006;25:340–344. doi: 10.1016/j.jmgm.2005.12.004. [DOI] [PubMed] [Google Scholar]
47.Liu Y, Wang T. Related matrices of DNA primary sequences based on triplets of nucleic acid bases. Chem Phys Lett. 2006;417:173–178. doi: 10.1016/j.cplett.2005.10.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Jin X, et al. A novel DNA sequence similarity calculation based on simplified pulse-coupled neural network and Huffman coding. Physica A. 2016;461:325–338. doi: 10.1016/j.physa.2016.05.004. [DOI] [Google Scholar]
49.Li Y, Xiao W. Circular Helix-Like Curve: An Effective Tool of Biological Sequence Analysis and Comparison. Comput Math Method M. 2016;2:1–12. doi: 10.1155/2016/3262813. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR1] 1.Jafarzadeh N, Iranmanesh A. C-curve: a novel 3D graphical representation of DNA sequence based on codons. Math Biosci. 2013;241:217–224. doi: 10.1016/j.mbs.2012.11.009. [DOI] [PubMed] [Google Scholar]

[CR2] 2.Liao B, Xiang Q, Cai L, Cao Z. A new graphical coding of DNA sequence and its similarity calculation. Physica A. 2013;392:4663–4667. doi: 10.1016/j.physa.2013.05.015. [DOI] [Google Scholar]

[CR3] 3.Yang X, Wang T. Linear regression model of short k-word: A similarity distance suitable for biological sequences with various lengths. J Theor Biol. 2013;337:61–70. doi: 10.1016/j.jtbi.2013.07.028. [DOI] [PubMed] [Google Scholar]

[CR4] 4.Wąż P, Bielińskawąż D. Non-standard similarity/dissimilarity analysis of DNA sequences. Genomics. 2014;104:464–471. doi: 10.1016/j.ygeno.2014.08.010. [DOI] [PubMed] [Google Scholar]

[CR5] 5.Aram V, Iranmanesh A, Majid Z. Spider representation of DNA sequences. J Comput Theor Nanos. 2014;11:418–420. doi: 10.1166/jctn.2014.3371. [DOI] [Google Scholar]

[CR6] 6.Liu YW, Peng Y. A novel technique for analyzing the similarity and dissimilarity of DNA sequences. Genet Mol Res. 2014;13:570–577. doi: 10.4238/2014.January.28.2. [DOI] [PubMed] [Google Scholar]

[CR7] 7.Yin C, Yin XE, Wang J. A novel method for comparative analysis of DNA sequences by Ramanujan-Fourier transform. J Comput Biol. 2014;21:867–879. doi: 10.1089/cmb.2014.0120. [DOI] [PubMed] [Google Scholar]

[CR8] 8.Li C, Fei WC, Zhao Y, Yu XQ. Novel Graphical Representation and Numerical Characterization of DNA Sequences. Applied Sciences. 2016;6:63. doi: 10.3390/app6030063. [DOI] [Google Scholar]

[CR9] 9.Xu X, Zhu F. A New Method to Digitize DNA Sequence. J Biosci Med. 2017;05:7–12. [Google Scholar]

[CR10] 10.Bielińskawąż D, Wąż P. Spectral-dynamic representation of DNA sequences. J Biomed Inform. 2017;72:1–7. doi: 10.1016/j.jbi.2017.06.001. [DOI] [PubMed] [Google Scholar]

[CR11] 11.Panas D, Wąż P, Bielińskawąż D, Nandy A, Basak SC. 2D-Dynamic Representation of DNA/RNA Sequences as a Characterization Tool of the Zika Virus Genome. MATCH Commun. Math Comput Chem. 2017;77:321–332. [Google Scholar]

[CR12] 12.Ma T, Liu Y, Dai Q, Yao Y, He PA. A graphical representation of protein based on a novel iterated function system. Physica A. 2014;403:21–28. doi: 10.1016/j.physa.2014.01.067. [DOI] [Google Scholar]

[CR13] 13.Li Y, Liu Q, Zheng X, He PA. UC-Curve: A highly compact 2D graphical representation of protein sequences. Int. J Quantum Chem. 2014;114:409–415. doi: 10.1002/qua.24581. [DOI] [Google Scholar]

[CR14] 14.Yao Y, Yan S, Han J, Dai Q, He PA. A novel descriptor of protein sequences and its application. J Theor Biol. 2014;347:109–117. doi: 10.1016/j.jtbi.2014.01.001. [DOI] [PubMed] [Google Scholar]

[CR15] 15.Yao Y, et al. Similarity/Dissimilarity Analysis of Protein Sequences Based on a New Spectrum-Like Graphical Representation. Evol Bioinform Online. 2014;10:87–96. doi: 10.4137/EBO.S14713. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Xu SC, Li Z, Zhang SP, Hu JL. Primary structure similarity analysis of proteins sequences by a new graphical representation. SAR QSAR Environ Res. 2014;25:791–803. doi: 10.1080/1062936X.2014.955055. [DOI] [PubMed] [Google Scholar]

[CR17] 17.El-Lakkani A, Mahran H. An efficient numerical method for protein sequences similarity analysis based on a new two-dimensional graphical representation. SAR QSAR Environ. Res. 2015;26:125–137. doi: 10.1080/1062936X.2014.995700. [DOI] [PubMed] [Google Scholar]

[CR18] 18.Hou W, Pan Q, He M. A new graphical representation of protein sequences and its applications. Physica A. 2016;444:996–1002. doi: 10.1016/j.physa.2015.10.067. [DOI] [Google Scholar]

[CR19] 19.Czerniecka A, Bielińskawąż D, Wąż P, Clark T. 20D-dynamic Representation of Protein Sequences. Genomics. 2016;107:16–23. doi: 10.1016/j.ygeno.2015.12.003. [DOI] [PubMed] [Google Scholar]

[CR20] 20.Ping P, Zhu X, Wang L. Similarities/dissimilarities analysis of protein sequences based on pca-fft. J Biol Syst. 2017;25:1–17. doi: 10.1142/S0218339017500024. [DOI] [Google Scholar]

[CR21] 21.Hu H, Li Z, Dong H, Zhou T. Graphical Representation and Similarity Analysis of Protein Sequences Based on Fractal Interpolation. IEEE ACM T Comput Bi. 2017;14:182–192. doi: 10.1109/TCBB.2015.2511731. [DOI] [PubMed] [Google Scholar]

[CR22] 22.Liao B, Liao L, Wu R, Li R. Construction of the phylogenetic tree by self-organizing map based on encoding sequence. J Comput Theor Nanos. 2012;9:826–830. doi: 10.1166/jctn.2012.2103. [DOI] [Google Scholar]

[CR23] 23.Liao B, Liao BY, Lu X, Cao Z. A Novel Graphical Representation of Protein Sequences and Its Application. J Comput Chem. 2011;32:2539–2544. doi: 10.1002/jcc.21833. [DOI] [PubMed] [Google Scholar]

[CR24] 24.Liao B, Liao B, Sun X, Zeng Q. A Novel method for similarity analysis and protein subcellular localization prediction. Bioinformatics. 2010;26:2678–2683. doi: 10.1093/bioinformatics/btq521. [DOI] [PubMed] [Google Scholar]

[CR25] 25.Li X, Liao B, Zeng Q, Luo J. Protein functional class prediction using global encoding of amino acid sequence. J Theor Biol. 2009;261:290–293. doi: 10.1016/j.jtbi.2009.07.017. [DOI] [PubMed] [Google Scholar]

[CR26] 26.Huang G, Liao B, Li R. Similarity studies of DNA sequences based on a new 2D graphical representation. Biophys Chem. 2009;143:55–59. doi: 10.1016/j.bpc.2009.03.013. [DOI] [PubMed] [Google Scholar]

[CR27] 27.Liao B, Zeng C, Li F, Tang Y. Analysis of Similarity/Dissimilarity of DNA Sequences Based on Dual Nucleotides. MATCH Commun Math Co. 2008;59:647–652. [Google Scholar]

[CR28] 28.Yao Y, Kong F, Dai Q, He P. A Sequence-Segmented Method Applied to the Similarity Analysis of Long Protein Sequence. MATCH Commun Math Co. 2013;70:431–450. [Google Scholar]

[CR29] 29.He P, Xu S, Dai Q, Yao Y. A generalization of CGR representation for analyzing and comparing protein sequences. Int J Quantum Chem. 2016;116:476–482. doi: 10.1002/qua.25068. [DOI] [Google Scholar]

[CR30] 30.Dai Q, et al. Comparison study on statistical features of predicted secondary structures for protein structural class prediction: From content to position. BMC Bioinformatics. 2013;14:152. doi: 10.1186/1471-2105-14-152. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Dai Q, et al. Study of LZ-word distribution and its application for sequence comparison. Journal of Theor Biol. 2103;336:52–60. doi: 10.1016/j.jtbi.2013.07.008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Hamori E, Ruskin J. H curves, a novel method of representation of nucleotide series especially suited for long DNA sequences. J Biol Chem. 1983;258:1318–1327. [PubMed] [Google Scholar]

[CR33] 33.Randić M, Vračko M, Lerš N, Plavšić D. Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation. Chem Phys Lett. 2003;371:202–207. doi: 10.1016/S0009-2614(03)00244-6. [DOI] [Google Scholar]

[CR34] 34.Wąż P, Bielińskawąż D. 3D-dynamic representation of DNA sequences. J Mol Model. 2014;20:2141. doi: 10.1007/s00894-014-2141-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] 35.Jeong BS, Bari ATG, Rokeya RM, Jeon S, Lim CG. Codon-based encoding for DNA sequence analysis. Methods. 2014;67:373–379. doi: 10.1016/j.ymeth.2014.01.016. [DOI] [PubMed] [Google Scholar]

[CR36] 36.Bari AT, Reaz MR, Islam AK, Choi HJ, Jeong BS. Effective Encoding for DNA Sequence Visualization Based on Nucleotide’s Ring Structure. Evol Bioinfrom. 2013;9:251–261. doi: 10.4137/EBO.S12160. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR37] 37.Xie X, Guan J, Zhou S. Similarity evaluation of DNA sequences based on frequent patterns and entropy. Bmc Genomics. 2015;16:1–10. doi: 10.1186/1471-2164-16-S3-S1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR38] 38.Yu HJ, Huang DS. Graphical Representation for DNA Sequences via Joint Diagonalization of Matrix Pencil. IEEE J Biomed Health. 2013;17:503–511. doi: 10.1109/TITB.2012.2227146. [DOI] [PubMed] [Google Scholar]

[CR39] 39.Hou W, Pan Q, He M. A novel representation of DNA sequence based on CMI coding. Physica A. 2014;409:87–96. doi: 10.1016/j.physa.2014.04.030. [DOI] [Google Scholar]

[CR40] 40.Li Y, Liu Q, Zheng X. DUC-Curve, a highly compact 2D graphical representation of DNA sequences and its application in sequence alignment. Physica A. 2016;456:256–270. doi: 10.1016/j.physa.2016.03.061. [DOI] [Google Scholar]

[CR41] 41.Yin C. Representation of DNA sequences in genetic codon context with applications in exon and intron prediction. J Bioinf Comput Biol. 2015;13:1550004. doi: 10.1142/S0219720015500043. [DOI] [PubMed] [Google Scholar]

[CR42] 42.Peng Y, Liu Y. A Novel Numerical Characterization for Graphical Representations of DNA Sequences. Mini-Rev Org Chem. 2015;12:534–539. doi: 10.2174/1570193X13666151218191218. [DOI] [Google Scholar]

[CR43] 43.Cheng J, Shan, Ping S. 4D Graphical representation research of DNA sequences. Int J Biomath. 2015;08:47–58. [Google Scholar]

[CR44] 44.Manoj KG, Rajdeep N, Manoj M. A new adjacent pair 2D graphical representation of DNA sequences. J Biol Syst. 2013;21:196–244. [Google Scholar]

[CR45] 45.Zhang Z, et al. ColorSquare: A colorful square visualization of DNA sequences. MATCH Commun Math Comput Chem. 2012;68:621–637. [Google Scholar]

[CR46] 46.Dai Q, Liu X, Wang T. A novel graphical representation of DNA sequences and its application. J Mol Graph Model. 2006;25:340–344. doi: 10.1016/j.jmgm.2005.12.004. [DOI] [PubMed] [Google Scholar]

[CR47] 47.Liu Y, Wang T. Related matrices of DNA primary sequences based on triplets of nucleic acid bases. Chem Phys Lett. 2006;417:173–178. doi: 10.1016/j.cplett.2005.10.007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR48] 48.Jin X, et al. A novel DNA sequence similarity calculation based on simplified pulse-coupled neural network and Huffman coding. Physica A. 2016;461:325–338. doi: 10.1016/j.physa.2016.05.004. [DOI] [Google Scholar]

[CR49] 49.Li Y, Xiao W. Circular Helix-Like Curve: An Effective Tool of Biological Sequence Analysis and Comparison. Comput Math Method M. 2016;2:1–12. doi: 10.1155/2016/3262813. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

One novel representation of DNA sequence based on the global and local position information

Zhiyi Mo

Wen Zhu

Yi Sun

Qilin Xiang

Ming Zheng

Min Chen

Zejun Li

Abstract

Introduction

Graphical representation of DNA sequence

Figure 1.

Attaching each point with a mass

Numerical Representation

Results and Discussion

Table 1.

Table 2.

Table 3.

Table 4.

Figure 2.

Figure 3.

Acknowledgements

Author Contributions

Competing Interests

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

One novel representation of DNA sequence based on the global and local position information

Zhiyi Mo

Wen Zhu

Yi Sun

Qilin Xiang

Ming Zheng

Min Chen

Zejun Li

Abstract

Introduction

Graphical representation of DNA sequence

Figure 1.

Attaching each point with a mass

Numerical Representation

Results and Discussion

Table 1.

Table 2.

Table 3.

Table 4.

Figure 2.

Figure 3.

Acknowledgements

Author Contributions

Competing Interests

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases