A new method to analyze protein sequence similarity using Dynamic Time Warping

Wenbing Hou; Qiuhui Pan; Qianying Peng; Mingfeng He

doi:10.1016/j.ygeno.2016.12.002

. 2016 Dec 11;109(2):123–130. doi: 10.1016/j.ygeno.2016.12.002

A new method to analyze protein sequence similarity using Dynamic Time Warping

Wenbing Hou ^a, Qiuhui Pan ^b,^a, Qianying Peng ^c, Mingfeng He ^a,^⁎

PMCID: PMC7125777 PMID: 27974244

Abstract

Sequences similarity analysis is one of the major topics in bioinformatics. It helps researchers to reveal evolution relationships of different species. In this paper, we outline a new method to analyze the similarity of proteins by Discrete Fourier Transform (DFT) and Dynamic Time Warping (DTW). The original symbol sequences are converted to numerical sequences according to their physico-chemical properties. We obtain the power spectra of sequences from DFT and extend the spectra to the same length to calculate the distance between different sequences by DTW. Our method is tested in different datasets and the results are compared with that of other software algorithms. In the comparison we find our scheme could amend some wrong classifications appear in other software. The comparison shows our approach is reasonable and effective.

Keywords: Protein sequences similarity analysis, Discrete Fourier Transform, Dynamic Time Warping, Phylogenetic tree

Highlights

•
We propose a novel method to extract the features of the sequences based on physicochemical property of proteins.
•
We apply the Discrete Fourier Transform (DFT) and Dynamic Time Warping (DTW) to analyze the similarity of proteins.
•
Different datasets are used to prove our model's effectiveness.

1. Introduction

With the advance of sequencing techniques, the database of DNA, RNA and protein has been enlarged rapidly, promoting the development of bioinformatics effectively. It has been increasingly important to develop efficient ways to obtain the information hidden in the gene data. In the last few decades, several methods to classify the genes have been proposed. In 1983, Hamori and Ruskin proposed a visible 3-D curve with the name of H-curve to tell the relations between different DNAs [1]. As the first graphical representation, it motivates other researchers in the following years to develop more graphical representations of DNA sequences including 2D, 3D and even multidimensional representations [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14]. Besides the graphical representations, researchers try to combine some techniques from other disciplines into the study of genes and have proposed novel methods. For example, the Discrete Fourier Transform, which is broadly applied in signal process, has been introduced into the process of genes [15], [16]. It is proved effective in the analysis of DNA sequences.

Methods for similarity analysis of proteins also have been proposed recently. Considering a protein sequence consists of 20 kinds of different amino acids while a DNA sequence only consists of four bases, it is much more complex to express a protein than a DNA sequence. However, there are some methods which are generalized from the ways of analyzing the DNA sequences [17], [18], [19], [20], [21]. Yau et al. propose a method with the name of protein map [22] following their previous work. They use the moment vectors to represent proteins and generate a universal protein map [23]. Motivated by the protein map, they also develop a novel method, with the name of protein space, to realize the nature of protein universe [24]. Their method is applied successfully in their following papers and proved effective [25], [26]. He et al. present a new way of generalized Chaos Game Representation (CGR) method to outline a dynamic 3D graphical representation [27] which is analogous to the original CGR method proposed by Jeffrey for graphical representation of DNA [3]. El-Lakkani and Mahran introduce a two dimensional graphical representation of protein sequences. They propose a new mathematical descriptor in their paper to measure the similarity of two protein sequences [28]. Li et al. present a graphical representation with the name of UC-Curve [29]. The amino acids are assigned to the circumference of a unit circle with a cyclic order. Geometric center vectors of UC-Curves and Euclidean distances are extracted to analyze pairwise similarities. Moreover, techniques from other disciplines have been applied in the analysis of proteins successfully. Wąż and Bielińska-Wąż introduce the moments of inertia as new descriptors in the calculation of similarities [30], [31]. Based on their works, Czerniecka et al. propose a 20-D dynamic representation of protein sequences [32] and the scheme is proved reasonable.

In this paper, we outline a new method based on Discrete Fourier Transform (DFT) and Dynamic Time Warping (DTW) to calculate the similarities of proteins. The original symbol sequences are converted to numerical sequences according to their physico-chemical properties and the similarities are calculated based on DFT and DTW. We test our scheme with different datasets and compare our results with some existing softwares. It is demonstrated that the consequences from our test are in agreement with evolutionary relation satisfactorily.

2. Models and methods

2.1. Numerical representation of protein sequence

Amino acids are considered as the basic component of proteins. The study of proteins always starts with the study of amino acids. The physico-chemical properties of amino acids are considered to have immense effects on the properties of proteins [33]. It is an effective way to study the similarity of proteins by the properties of amino acids. In our work, we choose two main amino acids properties, namely hydropathy and isoelectric point, to construct a new way to represent the protein sequences. The detail values are declared in Table 1 . All values are cited from reference [23].

Table 1.

Hydropathy and isoelectric point values of 20 amino acids.

Amino acid	Abbreviation	Hydropathy	Isoelectric point
Isoleucine	I	4.5	6.02
Valine	V	4.2	5.96
Leucine	L	3.8	5.98
Phenylalanine	F	2.8	5.48
Cysteine	C	2.5	5.07
Methionine	M	1.9	5.74
Alanine	A	1.8	6.00
Glycine	G	− 0.4	5.97
Threonine	T	− 0.7	6.16
Serine	S	− 0.8	5.68
Tryptophan	W	− 0.9	5.89
Tyrosine	Y	− 1.3	5.66
Proline	P	− 1.6	6.30
Histidine	H	− 3.2	7.59
Aspartic acid	D	− 3.5	2.77
Asparagine	N	− 3.5	5.41
Glutamic acid	E	− 3.5	3.22
Glutamine	Q	− 3.5	5.65
Lysine	K	− 3.9	9.74
Arginine	R	− 4.5	10.76

Open in a new tab

According to the hydropathy value of amino acids, we could list their ranks: I > V > L > F > C > M > A > G > T > S > W > Y > P > H > D > N > E > Q > K > R. A radian θ _i will be assigned to each of the amino acid according to the hydropathy value rank. Based on the ranks, the value of θ _i will change from 0 to 2π at the interval of $\frac{1}{20} π$ . For example, the radian 0 will be assigned to the amino acid I and $\frac{6}{20} π$ will be assigned to the amino acid A. Similarly, we also list another rank based on the isoelectric point values: R > K > H > P > T > I > A > L > G > V > W > M > S > Y > Q > F > N > C > E > D. Another radian φ _i, ranging from 0 to 2π at the interval of $\frac{1}{20} π$ , will be assigned to amino acids according to the isoelectric point values. Then we build a three-dimensional representation of the amino acids. The coordinates of amino acids are calculated as follows:

x_{i} = sin (θ_{i}) cos (φ_{i}), y_{i} = sin (θ_{i}) sin (φ_{i}), z_{i} = cos (θ_{i}), i = 1, 2, \dots 20

(1)

Now, an amino acids sequence S = s ₁ s ₂ s ₃ … s _N with the length of N could be represented by a new sequence F = {c ₁, c ₂, c ₃, …c _N}, where c _i = (x _i, y _i, z _i). The coordinates of different axis are extracted respectively, forming new sequences

\begin{array}{c} u_{1} (n) = \{u_{1} (0), u_{1} (1), \dots u_{1} (N - 1)\} = \{x_{1}, x_{2}, \dots x_{N}\}, \\ u_{2} (n) = \{u_{2} (0), u_{2} (1), \dots u_{2} (N - 1)\} = \{y_{1}, y_{2}, \dots y_{N}\}, \\ u_{3} (n) = \{u_{3} (0), u_{3} (1), \dots u_{3} (N - 1)\} = \{z_{1}, z_{2}, \dots z_{N}\}, \end{array}

It is obvious that every symbol sequence will be represented by three numerical sequences according to our method. The representation will be unique because every amino acid has a unique coordinate, which means our approach could avoid the confusion from similar proteins.

2.2. Discrete Fourier Transform

The DFT is a common way in signal processing which is used to transform the signals in time domain into frequency domain. The latent information hidden in the signal in time domain could be discovered in this transformation without any loss. In recent years, the DFT has also been used in DNA sequences analysis. The classical application of DFT including prediction the location of exons in DNA sequences, genomic signature and periodicity analysis [34], [35], [36], [37].

Considering the signal sequences u ₁(n), u ₂(n) and u ₃(n) defined in Section 2.1, the DFT of signal at frequency kis calculated by

U_{i} (k) = DFT [u_{i} (n)] = \sum_{n = 0}^{N - 1} u_{i} (n) e^{- j \frac{2 π}{N} nk}, k = 0, 1, \dots, N - 1; i = 1, 2, 3

(2)

where $j = \sqrt{- 1}$ .

In our method, every protein sequences will be represented by three numerical sequences. The DFT power spectrum of the signal at frequency k will be defined as

PS (k) = \sum_{i - 1}^{3} {|U_{i} (k)|}^{2}, k = 0, 1 \dots, N - 1

(3)

2.3. Dynamic Time Warping

The Dynamic Time Warping has been widely used in the analysis of speech signals. It is first proposed by Sakoe and Chiba in 1978 [38], aiming to eliminate the nonlinear fluctuation in speech pattern time axis. This property could be used in the analysis of genes if we consider the protein sequences as genomic signal inputs. Recently, researchers have applied the DTW algorithm in the analysis of genetic signals. Skutkova et al. used DTW to classify DNA signals and they have obtained some excellent results [39], [40].

In Fig. 1 , we give an example to illustrate the function of DTW. Assume data1 and data2, which have similar wave shapes, are same spoken word from different speakers. The subfigure a shows the two original signals has similar shapes, but obviously they are in different time scales. It is hard to tell whether they are from the same word. However, in subfigure b, the two signals have the same wave shapes after the DTW, which means the two signals are from the same word.

In this paper, DTW is applied to calculate the distance of different power spectra. We assume there are two power spectra

{PS}_{1} (k_{1}) (k_{1} = 0 1 \dots M - 1), {PS}_{2} (k_{2}) (k_{2} = 0 1 \dots N - 1)

To simplify the symbol, we use two sequences to represent the two power spectra

a_{1}, a_{2}, \dots a_{p} \dots, a_{M} b_{1}, b_{2}, \dots b_{q} \dots, b_{N}

where a _p = PS ₁(p − 1) , b _q = PS ₂(q − 1) (p = 1, 2, …M; q = 1, 2, …N).

Define a distance

d (p, q) = {‖a_{p} - b_{q}‖}_{2} (p = 1, 2, \dots M, q = 1, 2, \dots N)

as a metric of the difference between feature vectors a _p and b _q. The accumulated distance is calculated by formula (4).

D (p, q) = \{\begin{array}{c} 2 d (1, 1) & p = 1; q = 1 \\ d (1, q) + D (1, q - 1) & p = 1; 2 \leq q \leq N \\ d (p, 1) + D (p - 1, 1) & 2 \leq p \leq M; q = 1 \\ min [D (p - 1, q) + d (p, q), D (p, q - 1) + d (p, q), D (p - 1, q - 1) + 2 d (p, q)] & 2 \leq p \leq M; 2 \leq q \leq N \end{array})

(4)

Apparently, the accumulated distance depends on the pairwise distance d(p,q) and the minimum from the previous values. The values of D(p,q), which will be used as the metric of similarity of two sequences, will form a table. The sequence warping path is derived on the basis of minimization of the backward way from the right upper corner to the left lower corner [39]. For two sequences, the minor D(p,q) is, the more similar they are.

3. Results and discussion

To verify the approach we proposed, we choose different datasets of various species and take several experiments. We construct the phylogenetic trees to get the cluster results and illustrate the distance between species in the evolution.

3.1. ND5 protein sequences of 22 species

Our scheme is applied to test 22 kinds of animal first. We choose the NADH dehydrogenase subunit 5 (ND5) sequences from NCBI database as our inputs. All the information of sequences we used is listed in Table 2 . Table 3 reveals our results. Corresponding to every species, a number is assigned in the table: 1-blue whale, 2-bornean orangutan, 3-cat, 4-common chimpanzee, 5-fin whale, 6-gibbon, 7-gorilla, 8-gray seal, 9-habor seal, 10-human, 11-horse, 12-mouse, 13-opossum, 14-pigmy chimpanzee, 15-platypus, 16-rat, 17-rhino, 18-sumantran orangutan, 19-wallaroo, 20-tiger, 21-korean bovine, 22-spain bovine. It is noticed in Table 3 the pairs (blue whale, fin whale) (common chimpanzee, pigmy chimpanzee) and (Korean bovine, Spanish bovine) have a shorter distance in our analysis. The homologies revealed in the table are in agreement with evolutionary relation satisfactorily. Moreover, we also construct the phylogenetic tree of the 22 species in Fig. 2 .

Table 2.

Information of sequences used in our test.

Sequence name	NCBI accession number
Blue whale	NP_007066
Bornean orangutan	NP_008235
Cat	NP_008261
Common chimpanzee	NP_008196
Fin whale	NP_006899
gibbon	NP_007832
gorilla	NP_008222
Gray seal	NP_007079
Harbor seal	NP_006938
Human	AP_000649
Horse	ADQ55101
Mouse	NP_904338
Opossum	NP_007105
Pigmy chimpanzee	NP_008209
Platypus	NP_008053
Rat	AP_004902
Rhino	YP_002520019
Sumatran orangutan	NP_007845
Wallaroo	NP_007404
Tiger	ADK73290
Korean bovine	YP_209215
Spain bovine	AKK32014

Open in a new tab

Table 3.

similarity/dissimilarity of 22 kinds of animals.

	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22
1	0
2	0.773	0
3	0.6073	0.7402	0
4	0.7356	0.5935	0.7373	0
5	0.2693	0.7698	0.573	0.7089	0
6	0.8056	0.6492	0.7325	0.6371	0.796	0
7	0.8362	0.7991	0.7825	0.6393	0.8809	0.6817	0
8	0.6922	0.787	0.6168	0.8027	0.7342	0.7899	0.9324	0
9	0.7234	0.7823	0.6484	0.7766	0.7437	0.7727	0.8938	0.2107	0
10	0.7832	0.6711	0.722	0.4971	0.7885	0.5414	0.6569	0.7941	0.8339	0
11	0.7148	0.7234	0.6448	0.7555	0.7189	0.7683	0.8991	0.683	0.6984	0.7577	0
12	0.7184	0.8341	0.7096	0.8027	0.7216	0.8231	0.9665	0.7239	0.7332	0.8207	0.7348	0
13	0.7284	0.8274	0.7539	0.8621	0.7458	0.8686	1	0.7839	0.7394	0.8784	0.7287	0.7241	0
14	0.7525	0.6402	0.7168	0.4056	0.7466	0.6185	0.6591	0.8118	0.7849	0.5381	0.7685	0.8589	0.9143	0
15	0.7398	0.7428	0.7173	0.7593	0.7239	0.7997	0.8807	0.7663	0.7657	0.7718	0.7532	0.7776	0.8091	0.7927	0
16	0.7544	0.9225	0.7435	0.8505	0.7864	0.7991	0.8177	0.8257	0.8221	0.7799	0.7279	0.8625	0.8292	0.8865	0.8853	0
17	0.7875	0.7944	0.8335	0.8411	0.8157	0.8174	0.8964	0.8312	0.8217	0.852	0.7951	0.8116	0.792	0.8712	0.8102	0.8054	0
18	0.6592	0.5813	0.6342	0.658	0.6543	0.7318	0.8514	0.7249	0.7178	0.7364	0.6682	0.7195	0.7576	0.7077	0.6915	0.7887	0.7581	0
19	0.6858	0.7034	0.7492	0.7788	0.6866	0.7099	0.8833	0.7057	0.7166	0.7887	0.6786	0.7561	0.7009	0.7752	0.6889	0.8266	0.8192	0.6458	0
20	0.6882	0.8015	0.5009	0.803	0.6389	0.809	0.8869	0.6557	0.6988	0.7584	0.7406	0.7652	0.7902	0.8204	0.8072	0.7709	0.8307	0.699	0.7963	0
21	0.6547	0.708	0.648	0.7473	0.6951	0.6789	0.793	0.6331	0.6644	0.7088	0.6631	0.7599	0.7063	0.7526	0.7416	0.7576	0.78	0.6697	0.6959	0.6222	0
22	0.6455	0.7223	0.5916	0.747	0.6751	0.6874	0.7846	0.6316	0.6914	0.6847	0.652	0.7448	0.7061	0.7319	0.7017	0.726	0.768	0.6829	0.7218	0.5976	0.05929	0

Open in a new tab

In Fig. 2, some reasonable cluster results are revealed. We find that the primates, such as common chimpanzee, pigmy chimpanzee, human, gorilla, orangutan and gibbon are much closer than other species in the evolutionary distance. Besides, different kinds of whale, bovine and seal are also located in the same branch respectively. All the classifications we've obtained are in agreement with the classical evolution theory. As a comparison, we apply the method in Ref. [21] to analyze the dataset in Table 2. The results are shown in Fig. 3 . The results obtained in Fig. 2, Fig. 3 have similar clusters. However, there also exists some difference. The Bornean orangutan and Sumatran orangutan should have a closer relation than other species, but obviously they are located in different branches in Fig. 3. Besides, the results in Fig. 3 also indicate the two kinds of whales and two kinds of bovines are much closer than others in the phylogeny. In Fig. 2, all the improper classifications are corrected. This experiment indicates our scheme is effective in the similarity analysis of animal ND5 proteins.

Fig. 3 — The phylogenetic tree of 22 species based on Yau's protein map.

3.2. Neuraminidase proteins of influenza A virus

The influenza A virus has been a major threat to human and animals [41]. The viruses could be identified to different subtypes according to the different viral surface proteins hemagglutinin and neuraminidase. Until now 18 H serotypes (H1 to H18) and 11 N serotypes (N1 to N11) of influenza A viruses have been identified. The influenza A viruses have caused epidemic among human and animals. Some of the most lethal viruses are H1N1, H2N2, H5N1 and H7N9. We take 28 kinds of influenza A virus as samples in our test. All the sequences are picked from NCBI database. The sequences information is listed in Table 4 .

Table 4.

Information of protein sequences used in this paper.

Sequence name	NCBI accession number
A/Adachi/2/1957(H2N2)	BAD16637.1
A/bar-headed_goose/Qinghai/1/2005(H5N1)	BAM85828.1
A/Beijing/4/2009(H1N1)	ACR67256.1
A/Berkeley/1/1968(H2N2)	BAD16641.1
A/blue-winged_teal/Ohio/566/2006(H7N9)	ABS89412.1
A/California/1/1966(H2N2)	AAO46235.1
A/California/04/2009(H1N1)	AEE69012.1
A/cat/Germany/R606/06(H5N1)	ABF61763.1
A/chicken/Dongguan/1096/2014(H7N9)	AJJ96855.1
A/Cygnus_olor/Italy/742/2006(H5N1)	ABF50822.1
A/chicken/Quzhou/2/2015(H7N9)	AKI82227.1
A/Duck/Ohio/118C/93(H1N1)	AAF77041.1
A/blue_winged_teal/Louisiana/A00557206/2009(H7N7)	ALT67567.1
A/canine/Guangxi/1/2011(H9N2)	AEK07935.1
A/chicken/China/AH-10-01/2010(H9N2)	AEE73586.1
A/chicken/Hubei/01-MA01/1999(H9N2)	AEO92432.1
A/chicken/Iran/B263/2004(H9N2)	ACD47112.1
A/England/1/1961(H2N2)	AAO46220.1
A/equine/Prague/1/1956(H7N7)	AAC57418.1
A/equine/Santiago/77(H7N7)	AAQ90293.1
A/fowl/Weybridge(H7N7)	AAA43425.1
A/Georgia/1/1967(H2N2)	AAO46244.1
A/goose/Czech_Republic/1848-K9/2009(H7N9)	ACX53685.1
A/GuangzhouSB/01/2009(H1N1)	ACR49238.1
A/Nagasaki/07N020/2008(H1N1)	ADC45738.1
A/lesser_white-fronted_goose/HuNan/412-3Y/2010(H7N7)	AIW60686.1
A/muscovy_duck/Vietnam/LBM66/2011(H5N1)	BAM36161.1
A/tree_sparrow/Shanghai/01/2013(H7N9)	AGW82590.1

Open in a new tab

We use the Mega software (version 6.06) to calculate the distance between 28 kinds of influenza A virus, drawing the phylogenetic tree in Fig. 4 . In Fig. 4, we notice that most of the virus are classified correctly except the virus (A/Duck/Ohio/118C/93(H1N1)), which belongs to H1N1 in virology. Clearly, this is an improper classification. Using the same virus data, we apply our method to calculate the similarity of 28 influenza A virus, getting the results shown in Fig. 5 . The cluster results from our method matches the classification in virology correctly. The viruses from same type are clustered in the same branch respectively. We notice the wrong classification in Mega software has been corrected in our method. Furthermore, it is also noticed that the viruses appeared in adjacent years are much closer in the phylogeny. For example, the virus (A/blue-winged_teal/Ohio/566/2006(H7N9)) is much closer to the virus (A/goose/Czech_Republic/1848-K9/2009(H7N9)) than (A/chicken/Quzhou/2/2015(H7N9)).

Fig. 5 — The phylogenetic tree of 28 influenza A virus calculated by our method.

As a comparison, another software is also applied in our test. The cluster results from Clustal X software is illustrated in Fig. 6 . The results in Fig. 6 are similar with ours. However, as illustrated in the phylogenetic tree, the viruses which belong to H2N2 are clustered in different branches. We conclude from the figures that the results obtained from different methods have an overall agreement even though there exists some variation between different methods. The phylogenetic trees in different figures reveal similar classification of influenza A virus. Among the three methods, our approach is more accurate in the test.

3.3. Coronavirus spike proteins

As a further comparison, we construct a phylogenetic tree for 50 coronavirus spike proteins. The coronavirus could cause some severe epidemics, for example, SARS. We use some coronavirus spike proteins as inputs to test our method. All the data comes from the Table 3 in reference [21]. The relations revealed in Fig. 7 are similar to the phylogeny reported in reference [21]. All the SARS coronavirus gather in the same branch. The coronavirus from same species has a much closer relationships. Due to the discussions above, our method is proved reasonable and effective.

Fig. 7 — The phylogenetic tree of 50 coronavirus spike proteins.

4. Conclusion

In this work, techniques from signal process have been applied in the analysis of protein sequences. The approach in this paper provides an intuitive solution to analyze the protein sequences. We establish a novel measure based on Discrete Fourier Transform and Dynamic Time Warping to analyze the similarity of protein sequences. Based on the values of hydropathy and isoelectric point, we assign different radians to the amino acids according to their ranks. A three dimensional representation is constructed to represent all the amino acids. With the help of DFT and DTW, we get the power spectra and scale the spectra to the same length. The distances between species are evaluated by constructing phylogenetic trees. We use different datasets including animals and viruses to test our method. Compared to the existing methods and softwares, the computational time of our algorithm is large. However, there still exists ways to improve our method. For example, in the DTW process, a proper filter or sampling method could be considered to pick some important information from results of the DFT instead of keeping all the values of spectra. Also, the DTW algorithm could be improved to reduce the running time of our method. In the test, we find the method in our paper provides accurate classification of different species. An improved DFT-DTW method will be our goal in the future works.

References

1.Hamori E., Ruskin J. H-curves, a novel method of representation of nucleotide series especially suited for long DNA sequences. J. Biol. Chem. 1983;258:1318–1327. [PubMed] [Google Scholar]
2.Nandy A. A new graphical representation and analysis of DNA sequence structure. 1. Methodology and application to globin genes. Curr. Sci. 1994;66:309–314. [Google Scholar]
3.Jeffrey H.J. Chaos game representation of gene structure. Nucleic Acids Res. 1990;18:2163–2170. doi: 10.1093/nar/18.8.2163. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Randić M., Vračko M., Lerš N., Plavšić D. Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation. Chem. Phys. Lett. 2003;371:202–207. [Google Scholar]
5.Yau S.S.T., Wang J.S., Niknejad A., Lu C., Jin N., Ho Y.K. DNA sequence representation without degeneracy. Nucleic Acids Res. 2003;31:3078–3080. doi: 10.1093/nar/gkg432. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Liu X.Q., Dai Q., Xiu Z.L., Wang T.M. PNN-curve: a new 2D graphical representation of DNA sequences and its application. J. Theor. Biol. 2006;243:555–561. doi: 10.1016/j.jtbi.2006.07.018. [DOI] [PubMed] [Google Scholar]
7.Liao B., Zhang Y., Ding K.Q., Wang T.M. Analysis of similarity/dissimilarity of DNA sequences based on a condensed curve representation. Theochem. J. Mol. Struct. 2005;717:199–203. [Google Scholar]
8.Cao Z., Liao B., Li R. A group of 3D graphical representation of DNA sequences based on dual nucleotides. Int. J. Quantum Chem. 2008;108:1485–1490. [Google Scholar]
9.Jafarzadeh N., Iranmanesh A. A novel graphical and numerical representation for analyzing DNA sequences based on codons, MATCH-Commun. Math. Comput. Chem. 2012;68:611–620. [Google Scholar]
10.El-Lakkani A., El-Sherif S. Similarity analysis of protein sequences based on 2D and 3D amino acid adjacency matrices. Chem. Phys. Lett. 2013;590:192–195. [Google Scholar]
11.Jafarzadeh N., Iranmanesh A. C-curve: a novel 3D graphical representation of DNA sequence based on codons. Math. Biosci. 2013;241:217–224. doi: 10.1016/j.mbs.2012.11.009. [DOI] [PubMed] [Google Scholar]
12.Yao Y.H., Nan X.Y., Wang T.M. A new 2D graphical representation - classification curve and the analysis of similarity/dissimilarity of DNA sequences. Theochem. J. Mol. Struct. 2006;764:101–108. [Google Scholar]
13.Hou W., Pan Q., He M., Novel A. 2D representation of genome sequence and its application. J. Comput. Theor. Nanosci. 2014;11:1745–1749. [Google Scholar]
14.Bo L., Tian-Ming W. New 2D graphical representation of DNA sequences. J. Comput. Chem. 2004;25:1364–1368. doi: 10.1002/jcc.20060. [DOI] [PubMed] [Google Scholar]
15.Yin C., Yau S.S. An improved model for whole genome phylogenetic analysis by Fourier transform. J. Theor. Biol. 2015;382:99–110. doi: 10.1016/j.jtbi.2015.06.033. [DOI] [PubMed] [Google Scholar]
16.Hoang T., Yin C., Zheng H., Yu C., Lucy He R., Yau S.S. A new method to cluster DNA sequences using Fourier power spectrum. J. Theor. Biol. 2015;372:135–145. doi: 10.1016/j.jtbi.2015.02.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Yu C., He R.L., Yau S.S. Protein sequence comparison based on K-string dictionary. Gene. 2013;529:250–256. doi: 10.1016/j.gene.2013.07.092. [DOI] [PubMed] [Google Scholar]
18.Ma T., Liu Y., Dai Q., Yao Y., He P.-A. A graphical representation of protein based on a novel iterated function system. Phys. A. 2014;403:21–28. [Google Scholar]
19.He P.A., Li D., Zhang Y., Wang X., Yao Y. A 3D graphical representation of protein sequences based on the Gray code. J. Theor. Biol. 2012;304:81–87. doi: 10.1016/j.jtbi.2012.03.023. [DOI] [PubMed] [Google Scholar]
20.Ling L., Fen K., Jilin H., Xuying N., Yuhua Y. 2012 Spring Congress on Engineering and Technology (S-CET 2012) 2012. A 3-D graphical method applied to the similarities of protein sequences. (4 pp.-4 pp.) [Google Scholar]
21.Gupta M.K., Niyogi R., Misra M. An alignment-free method to find similarity among protein sequences via the general form of Chou's pseudo amino acid composition. SAR QSAR Environ. Res. 2013;24:597–609. doi: 10.1080/1062936X.2013.773378. [DOI] [PubMed] [Google Scholar]
22.Yau S.S.T., Yu C.L., He R. A protein map and its application. DNA Cell Biol. 2008;27:241–250. doi: 10.1089/dna.2007.0676. [DOI] [PubMed] [Google Scholar]
23.Yu C., Cheng S.Y., He R.L., Yau S.S. Protein map: an alignment-free sequence comparison method based on various properties of amino acids. Gene. 2011;486:110–118. doi: 10.1016/j.gene.2011.07.002. [DOI] [PubMed] [Google Scholar]
24.Yu C., Deng M., Cheng S.Y., Yau S.C., He R.L., Yau S.S. Protein space: a natural method for realizing the nature of protein universe. J. Theor. Biol. 2013;318:197–204. doi: 10.1016/j.jtbi.2012.11.005. [DOI] [PubMed] [Google Scholar]
25.Yau S.S., Mao W.G., Benson M., He R.L. Distinguishing proteins from arbitrary amino acid sequences. Sci. Rep. 2015;5:7972. doi: 10.1038/srep07972. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Li Y., Tian K., Yin C., He R.L., Yau S.S. Virus classification in 60-dimensional protein space. Mol. Phylogenet. Evol. 2016;99:53–62. doi: 10.1016/j.ympev.2016.03.009. [DOI] [PubMed] [Google Scholar]
27.He P.A., Xu S.N., Dai Q., Yao Y.H. A generalization of CGR representation for analyzing and comparing protein sequences. Int. J. Quantum Chem. 2016;116:476–482. [Google Scholar]
28.El-Lakkani A., Mahran H. An efficient numerical method for protein sequences similarity analysis based on a new two-dimensional graphical representation. SAR QSAR Environ. Res. 2015;26:125–137. doi: 10.1080/1062936X.2014.995700. [DOI] [PubMed] [Google Scholar]
29.Li Y., Liu Q., Zheng X., He P.-A. UC-Curve: a highly compact 2D graphical representation of protein sequences. Int. J. Quantum Chem. 2014;114:409–415. [Google Scholar]
30.Wąż P., Bielińska-Wąż D. 3D-dynamic representation of DNA sequences. J. Mol. Model. 2014;20:2141. doi: 10.1007/s00894-014-2141-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Wąż P., Bielińska-Wąż D., Nandy A. Descriptors of 2D-dynamic graphs as a classification tool of DNA sequences. J. Math. Chem. 2014;52:132–140. doi: 10.1007/s10910-013-0249-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Czerniecka A., Bielińska-Wąż D., Wąż P., Clark T. 20D-dynamic representation of protein sequences. Genomics. 2016;107:16–23. doi: 10.1016/j.ygeno.2015.12.003. [DOI] [PubMed] [Google Scholar]
33.Xia X.H., Li W.H. What amino acid properties affect protein evolution? J. Mol. Evol. 1998;47:557–564. doi: 10.1007/pl00006412. [DOI] [PubMed] [Google Scholar]
34.Yin C., Yau S.S. Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence. J. Theor. Biol. 2007;247:687–694. doi: 10.1016/j.jtbi.2007.03.038. [DOI] [PubMed] [Google Scholar]
35.Anastassiou D. Frequency-domain analysis of biomolecular sequences. Bioinformatics. 2000;16:1073–1081. doi: 10.1093/bioinformatics/16.12.1073. [DOI] [PubMed] [Google Scholar]
36.Marhon S.A., Kremer S.C. Gene prediction based on DNA spectral analysis: a literature review. J. Comput. Biol. 2011;18:639–676. doi: 10.1089/cmb.2010.0184. [DOI] [PubMed] [Google Scholar]
37.Akhtar M., Epps J., Ambikairajah E. Signal processing in sequence analysis: advances in eukaryotic gene prediction. IEEE J. Sel. Top. Sign. Proces. 2008;2:310–321. [Google Scholar]
38.Sakoe H., Chiba S. Dynamic-programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Signal Process. 1978;26:43–49. [Google Scholar]
39.Skutkova H., Vitek M., Babula P., Kizek R., Provaznik I. Classification of genomic signals using dynamic time warping. BMC Bioinf. 2013;14:7. doi: 10.1186/1471-2105-14-S10-S1. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Skutkova H., Vitek M., Sedlar K., Provaznik I. Progressive alignment of genomic signals by multiple dynamic time warping. J. Theor. Biol. 2015;385:20–30. doi: 10.1016/j.jtbi.2015.08.007. [DOI] [PubMed] [Google Scholar]
41.Alexander D.J. A review of avian influenza in different bird species. Vet. Microbiol. 2000;74:3–13. doi: 10.1016/s0378-1135(00)00160-7. [DOI] [PubMed] [Google Scholar]

[bb0005] 1.Hamori E., Ruskin J. H-curves, a novel method of representation of nucleotide series especially suited for long DNA sequences. J. Biol. Chem. 1983;258:1318–1327. [PubMed] [Google Scholar]

[bb0010] 2.Nandy A. A new graphical representation and analysis of DNA sequence structure. 1. Methodology and application to globin genes. Curr. Sci. 1994;66:309–314. [Google Scholar]

[bb0015] 3.Jeffrey H.J. Chaos game representation of gene structure. Nucleic Acids Res. 1990;18:2163–2170. doi: 10.1093/nar/18.8.2163. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0020] 4.Randić M., Vračko M., Lerš N., Plavšić D. Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation. Chem. Phys. Lett. 2003;371:202–207. [Google Scholar]

[bb0025] 5.Yau S.S.T., Wang J.S., Niknejad A., Lu C., Jin N., Ho Y.K. DNA sequence representation without degeneracy. Nucleic Acids Res. 2003;31:3078–3080. doi: 10.1093/nar/gkg432. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0030] 6.Liu X.Q., Dai Q., Xiu Z.L., Wang T.M. PNN-curve: a new 2D graphical representation of DNA sequences and its application. J. Theor. Biol. 2006;243:555–561. doi: 10.1016/j.jtbi.2006.07.018. [DOI] [PubMed] [Google Scholar]

[bb0035] 7.Liao B., Zhang Y., Ding K.Q., Wang T.M. Analysis of similarity/dissimilarity of DNA sequences based on a condensed curve representation. Theochem. J. Mol. Struct. 2005;717:199–203. [Google Scholar]

[bb0040] 8.Cao Z., Liao B., Li R. A group of 3D graphical representation of DNA sequences based on dual nucleotides. Int. J. Quantum Chem. 2008;108:1485–1490. [Google Scholar]

[bb0045] 9.Jafarzadeh N., Iranmanesh A. A novel graphical and numerical representation for analyzing DNA sequences based on codons, MATCH-Commun. Math. Comput. Chem. 2012;68:611–620. [Google Scholar]

[bb0050] 10.El-Lakkani A., El-Sherif S. Similarity analysis of protein sequences based on 2D and 3D amino acid adjacency matrices. Chem. Phys. Lett. 2013;590:192–195. [Google Scholar]

[bb0055] 11.Jafarzadeh N., Iranmanesh A. C-curve: a novel 3D graphical representation of DNA sequence based on codons. Math. Biosci. 2013;241:217–224. doi: 10.1016/j.mbs.2012.11.009. [DOI] [PubMed] [Google Scholar]

[bb0060] 12.Yao Y.H., Nan X.Y., Wang T.M. A new 2D graphical representation - classification curve and the analysis of similarity/dissimilarity of DNA sequences. Theochem. J. Mol. Struct. 2006;764:101–108. [Google Scholar]

[bb0065] 13.Hou W., Pan Q., He M., Novel A. 2D representation of genome sequence and its application. J. Comput. Theor. Nanosci. 2014;11:1745–1749. [Google Scholar]

[bb0070] 14.Bo L., Tian-Ming W. New 2D graphical representation of DNA sequences. J. Comput. Chem. 2004;25:1364–1368. doi: 10.1002/jcc.20060. [DOI] [PubMed] [Google Scholar]

[bb0075] 15.Yin C., Yau S.S. An improved model for whole genome phylogenetic analysis by Fourier transform. J. Theor. Biol. 2015;382:99–110. doi: 10.1016/j.jtbi.2015.06.033. [DOI] [PubMed] [Google Scholar]

[bb0080] 16.Hoang T., Yin C., Zheng H., Yu C., Lucy He R., Yau S.S. A new method to cluster DNA sequences using Fourier power spectrum. J. Theor. Biol. 2015;372:135–145. doi: 10.1016/j.jtbi.2015.02.026. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0085] 17.Yu C., He R.L., Yau S.S. Protein sequence comparison based on K-string dictionary. Gene. 2013;529:250–256. doi: 10.1016/j.gene.2013.07.092. [DOI] [PubMed] [Google Scholar]

[bb0090] 18.Ma T., Liu Y., Dai Q., Yao Y., He P.-A. A graphical representation of protein based on a novel iterated function system. Phys. A. 2014;403:21–28. [Google Scholar]

[bb0095] 19.He P.A., Li D., Zhang Y., Wang X., Yao Y. A 3D graphical representation of protein sequences based on the Gray code. J. Theor. Biol. 2012;304:81–87. doi: 10.1016/j.jtbi.2012.03.023. [DOI] [PubMed] [Google Scholar]

[bb0100] 20.Ling L., Fen K., Jilin H., Xuying N., Yuhua Y. 2012 Spring Congress on Engineering and Technology (S-CET 2012) 2012. A 3-D graphical method applied to the similarities of protein sequences. (4 pp.-4 pp.) [Google Scholar]

[bb0105] 21.Gupta M.K., Niyogi R., Misra M. An alignment-free method to find similarity among protein sequences via the general form of Chou's pseudo amino acid composition. SAR QSAR Environ. Res. 2013;24:597–609. doi: 10.1080/1062936X.2013.773378. [DOI] [PubMed] [Google Scholar]

[bb0110] 22.Yau S.S.T., Yu C.L., He R. A protein map and its application. DNA Cell Biol. 2008;27:241–250. doi: 10.1089/dna.2007.0676. [DOI] [PubMed] [Google Scholar]

[bb0115] 23.Yu C., Cheng S.Y., He R.L., Yau S.S. Protein map: an alignment-free sequence comparison method based on various properties of amino acids. Gene. 2011;486:110–118. doi: 10.1016/j.gene.2011.07.002. [DOI] [PubMed] [Google Scholar]

[bb0120] 24.Yu C., Deng M., Cheng S.Y., Yau S.C., He R.L., Yau S.S. Protein space: a natural method for realizing the nature of protein universe. J. Theor. Biol. 2013;318:197–204. doi: 10.1016/j.jtbi.2012.11.005. [DOI] [PubMed] [Google Scholar]

[bb0125] 25.Yau S.S., Mao W.G., Benson M., He R.L. Distinguishing proteins from arbitrary amino acid sequences. Sci. Rep. 2015;5:7972. doi: 10.1038/srep07972. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0130] 26.Li Y., Tian K., Yin C., He R.L., Yau S.S. Virus classification in 60-dimensional protein space. Mol. Phylogenet. Evol. 2016;99:53–62. doi: 10.1016/j.ympev.2016.03.009. [DOI] [PubMed] [Google Scholar]

[bb0135] 27.He P.A., Xu S.N., Dai Q., Yao Y.H. A generalization of CGR representation for analyzing and comparing protein sequences. Int. J. Quantum Chem. 2016;116:476–482. [Google Scholar]

[bb0140] 28.El-Lakkani A., Mahran H. An efficient numerical method for protein sequences similarity analysis based on a new two-dimensional graphical representation. SAR QSAR Environ. Res. 2015;26:125–137. doi: 10.1080/1062936X.2014.995700. [DOI] [PubMed] [Google Scholar]

[bb0145] 29.Li Y., Liu Q., Zheng X., He P.-A. UC-Curve: a highly compact 2D graphical representation of protein sequences. Int. J. Quantum Chem. 2014;114:409–415. [Google Scholar]

[bb0150] 30.Wąż P., Bielińska-Wąż D. 3D-dynamic representation of DNA sequences. J. Mol. Model. 2014;20:2141. doi: 10.1007/s00894-014-2141-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0155] 31.Wąż P., Bielińska-Wąż D., Nandy A. Descriptors of 2D-dynamic graphs as a classification tool of DNA sequences. J. Math. Chem. 2014;52:132–140. doi: 10.1007/s10910-013-0249-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0160] 32.Czerniecka A., Bielińska-Wąż D., Wąż P., Clark T. 20D-dynamic representation of protein sequences. Genomics. 2016;107:16–23. doi: 10.1016/j.ygeno.2015.12.003. [DOI] [PubMed] [Google Scholar]

[bb0165] 33.Xia X.H., Li W.H. What amino acid properties affect protein evolution? J. Mol. Evol. 1998;47:557–564. doi: 10.1007/pl00006412. [DOI] [PubMed] [Google Scholar]

[bb0170] 34.Yin C., Yau S.S. Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence. J. Theor. Biol. 2007;247:687–694. doi: 10.1016/j.jtbi.2007.03.038. [DOI] [PubMed] [Google Scholar]

[bb0175] 35.Anastassiou D. Frequency-domain analysis of biomolecular sequences. Bioinformatics. 2000;16:1073–1081. doi: 10.1093/bioinformatics/16.12.1073. [DOI] [PubMed] [Google Scholar]

[bb0180] 36.Marhon S.A., Kremer S.C. Gene prediction based on DNA spectral analysis: a literature review. J. Comput. Biol. 2011;18:639–676. doi: 10.1089/cmb.2010.0184. [DOI] [PubMed] [Google Scholar]

[bb0185] 37.Akhtar M., Epps J., Ambikairajah E. Signal processing in sequence analysis: advances in eukaryotic gene prediction. IEEE J. Sel. Top. Sign. Proces. 2008;2:310–321. [Google Scholar]

[bb0190] 38.Sakoe H., Chiba S. Dynamic-programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Signal Process. 1978;26:43–49. [Google Scholar]

[bb0195] 39.Skutkova H., Vitek M., Babula P., Kizek R., Provaznik I. Classification of genomic signals using dynamic time warping. BMC Bioinf. 2013;14:7. doi: 10.1186/1471-2105-14-S10-S1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0200] 40.Skutkova H., Vitek M., Sedlar K., Provaznik I. Progressive alignment of genomic signals by multiple dynamic time warping. J. Theor. Biol. 2015;385:20–30. doi: 10.1016/j.jtbi.2015.08.007. [DOI] [PubMed] [Google Scholar]

[bb0205] 41.Alexander D.J. A review of avian influenza in different bird species. Vet. Microbiol. 2000;74:3–13. doi: 10.1016/s0378-1135(00)00160-7. [DOI] [PubMed] [Google Scholar]

PERMALINK

A new method to analyze protein sequence similarity using Dynamic Time Warping

Wenbing Hou

Qiuhui Pan

Qianying Peng

Mingfeng He

Abstract

Highlights

1. Introduction

2. Models and methods