Classifying Genomic Sequences by Sequence Feature Analysis

Zhi-Hua Liu; Dian Jiao; Xiao Sun

doi:10.1016/S1672-0229(05)03027-5

. 2016 Nov 28;3(4):201–205. doi: 10.1016/S1672-0229(05)03027-5

Classifying Genomic Sequences by Sequence Feature Analysis

Zhi-Hua Liu ¹, Dian Jiao ¹, Xiao Sun ^1,^*

PMCID: PMC5172532 PMID: 16689686

Abstract

Traditional sequence analysis depends on sequence alignment. In this study, we analyzed various functional regions of the human genome based on sequence features, including word frequency, dinucleotide relative abundance, and base-base correlation. We analyzed the human chromosome 22 and classified the upstream, exon, intron, downstream, and intergenic regions by principal component analysis and discriminant analysis of these features. The results show that we could classify the functional regions of genome based on sequence feature and discriminant analysis.

Key words: genome, sequence feature analysis, BBC, PCA, discriminant analysis

Introduction

Since the beginning of the Human Genome Project, a huge amount of genomic sequences have been generated. It becomes more and more important to annotate these raw sequences. Eukaryotes have genes that contain upstream, exon, intron, and downstream regions. It is even more important to classify these various functional regions. Seeking appropriate features is the key to solve this problem. In recent years, several sequence features have been proposed, including word frequency (WF; ref. 1), synonymous codon choice, amino acid usage, G+C content (2), and nucleotide composition constraint (3). In this study, we present a novel sequence feature extraction algorithm and multidimensional statistical analysis to classify genomic sequences.

Results and Discussion

We extracted the sequence feature information from the collected sequence data of the human chromosome 22, reduced the dimensionality of sequence feature vector by principal component analysis (PCA), and classified the datasets by discriminant analysis.

Word frequency

Reinert et al. (4) provided the concept of word frequency. Since a DNA sequence is formed by using an alphabet of four letters (A, T, C, G) denoting four DNA bases, we can define DNA k-words, which are k-tuples formed by using these four letters. For an integer k ≥ 1, clearly there are 4^k possible k-words. We assume that f_w is the frequency of w in the DNA sequences with the length of L:

f_{w} = \frac{n_{w}}{L}

In this study, we analyze mainly 2-word and 3-word frequencies, which form 4²=16 and 4³=64 dimensional frequency vectors, respectively.

Dinucleotide relative abundance

Karlin and Burge (5) defined the formula of dinucleotide relative abundance (DRA) as the following:

T_{i j} = \frac{p_{i j}}{p_{i} p_{j}}

in which p_i or p_j means the frequency of appearance of a single base i or j, and p_ij means that of joint probabilities of bases i and j. The DRA feature formsa 16-dimensional vector. If one sequence is completely stochastic and the bases are mutually independent, then theoretically p_ij = p_ip_j and the value of T_ij is 1. Therefore, the deviation of T_ij of one sequence opposite to 1 could evaluate the bias of dinucleotide.

Base-base correlation

We have proposed a novel feature called base-base correlation (BBC) with the following formula:

T_{i j} (k) = \sum_{i = l}^{k} p_{i j} (l) \cdot \log_{2} (\frac{p_{i j}}{p_{i} p_{j}}) i, j \in {1, 2, 3, 4}

Here, p_i and p_j are defined as above, while p_ij(l) means the joint probabilities of bases i and j at a distance of l. T_ij(k) represents the average relevance of the two-base combination with different gaps from 1 to k. It reflects a local feature of two bases with an interval of k. The BBC feature forms a 16-dimensional vector.

For a given DNA sequence, the features of 2-word, 3-word, DRA, and BBC form a 112-dimensional vector in all.

Principal component analysis

Let X₁, X₂, …, X_p denote the p index considered, then we have

S = [\begin{matrix} S_{11} & S_{12} & \dots & S_{1 p} \\ S_{21} & S_{22} & \dots & S_{2 p} \\ ⋮ & ⋮ & ⋮ \\ S_{p 1} & S_{p 2} & \dots & S_{p p} \end{matrix}]

The above matrix is the covariance matrix of X₁, X₂, …, X_p, in which the principal diagonal elements S₁₁, S₂₂, …, S_pp represent the variance of X₁, X₂, …, X_p, respectively, reflecting the p index variation degree. Therefore, S₁₁ + S₂₂ + ··· + S_pp means the total variation degree of the p index.

Now we seek a new index y₁ = a₁₁x₁ + a₁₂x₂ + ··· + a_1px_p instead of the original p index. Moreover, we expect this new index could contain the original information as far as possible. We suppose λ₁ ≥ λ₂ ≥ ··· ≥ λ_γ (γ ≤ p) is the non-vanishing characteristic root. Then S₁₁ + S₂₂ + ··· + S_pp = λ₁ + λ₂ + ··· + λ_γ. Thus we extract the γ overall index of y₁, y₂, …, y_γ, whose variance is equal to the original p index variance, that is to say, the information that the γ index contains is equal to the information that the original p index contains. If γ is much smaller than p, the method greatly reduces the index but does not affect the analysis result. Because the overall index y₁ = a₁₁x₁ + a₁₂x₂ + ··· + a_1px_p is the biggest when the variance is λ₁, so the ability of synthesizing the p index of y₁ is the strongest. We define y₁, y₂, …, y_γ as the first, second, …, and the γ^th principal component, respectively. Then

\frac{λ_{γ}}{λ_{1} + λ_{2} + \dots + λ_{γ}} = \frac{λ_{γ}}{S_{11} + S_{22} + \dots + S_{p p}}

which expresses the proportion of y_γ variance in the total variance, and it is called the variance contribution rate of the γ^th principal component (6).

Here we reduced the original 112-dimensional vector to a 21-dimensional vector according to whether the eigenvalue is bigger than 1 (Table 1).

Table 1.

The Result of Principal Component Analysis

Component	Initial eigenvalue			Extraction sum of squared loadings
Component	Total	Variance (%)	Cumulation (%)	Total	Variance (%)	Cumulation (%)
1	31.128	27.793	27.793	31.128	27.793	27.793
2	12.589	11.240	39.033	12.589	11.240	39.033
3	8.365	7.469	46.503	8.365	7.469	46.503
4	8.075	7.210	53.713	8.075	7.210	53.713
5	4.726	4.220	57.933	4.726	4.220	57.933
6	4.192	3.743	61.675	4.192	3.743	61.675
7	3.836	3.425	65.100	3.836	3.425	65.100
8	3.425	3.058	68.158	3.425	3.058	68.158
9	2.938	2.624	70.782	2.938	2.624	70.782
10	2.775	2.478	73.259	2.775	2.478	73.259
11	2.606	2.327	75.586	2.606	2.327	75.586
12	1.928	1.721	77.308	1.928	1.721	77.308
13	1.880	1.678	78.986	1.880	1.678	78.986
14	1.663	1.485	80.471	1.663	1.485	80.471
15	1.565	1.397	81.868	1.565	1.397	81.868
16	1.515	1.353	83.221	1.515	1.353	83.221
17	1.293	1.154	84.375	1.293	1.154	84.375
18	1.276	1.139	85.515	1.276	1.139	85.515
19	1.170	1.045	86.559	1.170	1.045	86.559
20	1.067	0.953	87.512	1.067	0.953	87.512
21	1.052	0.939	88.451	1.052	0.939	88.451

22	0.925	0.826	89.277
23	0.831	0.742	90.019
24	0.786	0.702	90.721
25	0.677	0.605	91.326

Open in a new tab

Discriminant analysis

The basic principle of discriminant analysis is that the studied object that could be portrayed by the p index could also be described with the stochastic vector X = (X₁, X₂, …, X_p)^T. Let π₁, π₂, …, π_s denote the s kinds of the object that we study. If an object belongs to the j^th kind, then it is recorded as X ∈ π_j. The main goal of discriminant analysis is to seek the decision function g(X) of X according to different discriminative criteria, and to determine the category of X based on the attribute of g(X). The main criteria to construct discriminative function include the shortest distance criterion, the smallest expectation loss criterion, the Fisher criterion, and so on. Sandberg et al. (7) used a naïve Bayesian classifier to capture whole-genome characteristics in short sequences. In our method, we use the Fisher criterion whose basic principle is to find the most appropriate projection axis to make the two kinds of samples that project on this axis to be the least, thus make the classified effect to be the best.

We firstly analyzed the upstream, coding, and downstream regions of the sequence (Figure 1). The scatter plots in Figure 1 show the values of the cases on two discriminant functions, and we can see obvious differences among the coding, upstream, and downstream regions. It is observed that the coding regions (green) prefer to appear on the positive side of Function 1, whereas the upstream (red) and downstream (blue) regions prefer to appear on the negative side. The two discriminant functions cannot distinguish between upstream and downstream regions. We think the reason is that regulatory elements are located in upstream regions and the gene regulatory information is not considered when we use these three sequence features. Therefore, we may seek a more effective sequence feature related to known gene regulatory knowledge to distinguish the two regions.

Fig. 1 — Classification of the upstream (red), coding (green), and downstream (blue) regions. The horizontal axis represents the function value of the first linear distinction, and the vertical axis represents the function value of the second linear distinction, which is based on calculations from the variable value.

In order to further investigate non-coding regions, we expanded the datasets from three kinds to five kinds, and selected three features, namely WF, DRA, and BBC, which constructed a 112-dimensional vector as mentioned above. The SPSS software (8) was applied to carry on discriminant analysis and the result is shown in Table 2, which was used to assess how well the discriminant function works. From the result, we can see that the classification accuracy of the exon, intron, upstream, downstream, and intergenic regions is 94%, 86%, 71%, 69%, and 69%, respectively. The classification accuracy of exon and intron is relatively high, while that of upstream, downstream, and intergenic regions is relatively low. This can help us identify genes and study the gene structure (exonintron arrangement). The 3-word frequency can help us reveal hidden sequence features in coding regions. Recent discoveries have suggested that non-coding regions may not be merely “junk DNA” as previously thought. High densities of long interspersed nuclear elements (LINEs) and short interspersed nuclear elements (SINEs) occur in non-coding regions as the signal to start methylating a region of DNA 9., 10.. The sequence features that we have used may not match inherent sequence features in non-coding regions. Therefore, the classification accuracy of non-coding regions is lower than that of coding regions. Our future project is to further improve the classification accuracy of non-coding regions by seeking new features and more efficient algorithms.

Table 2.

The Statistical Result of Discriminant Analysis^*

Result	Predicted group membership						Total
Result	Group	1	2	3	4	5	Total
Original	1	71	0	7	8	14	100
	2	1	94	0	2	3	100
	3	7	0	86	5	2	100
	4	4	1	13	69	13	100
	5	5	2	12	12	69	100

Cross-validated	1	68	4	8	7	13	100
	2	1	94	0	2	3	100
	3	7	0	86	5	2	100
	4	6	2	16	57	19	100
	5	9	4	18	13	56	100

Open in a new tab

“Original” is the classification result of each observated sample, and “Cross-validated” is the alternately confirmed result. Groups 1 to 5 represent the upstream, exon, intron, downstream, and intergenic regions, respectively. In “Predicted group membership”, the established discriminative function reclassifies the source data and is compared with the primary variable value to compute the probability of mistaken discriminant. For example, for the 1^st group of samples with the total number of 100, the constructed discriminative function based on the original data predicts that the number belongs to the 1^st, 2^nd, 3^rd, 4^th, and 5^th group is 71, 0, 7, 8, and 14, respectively.

Conclusion

Nowadays algorithms and software for gene prediction have been developed widely. However, to our knowledge, researches on how to effectually distinguish the exon, intron, and intergenic regions have not made breakthrough. We have proposed a novel analysis method of genomic sequences based on sequence feature and statistic analysis. The results show that our analysis algorithm could improve the identification accuracy of the upstream, exon, intron, downstream, and intergenic regions from DNA sequences, especially the exon (94%) and intron (86%) regions.

Materials

We used the human chromosome 22 and collected the upstream (1,000 bp), exon, intron, downstream (1,000 bp), and intergenic regions (1,000 bp) according to the gene annotation database of the University of Santa Cruz Golden Path human genome sequence (http://genome.cse.ucsc.edu).

Acknowledgements

This work was supported by the National High-Tech Research and Development Program (863 Program) of China (No. 2002AA231071) and the Natural Science Foundation of Jiangsu Province (No. BK2002057).

References

1.Basu S. Words in DNA sequences: some case studies based on their frequency statistics. J. Math. Biol. 2003;46:479–503. doi: 10.1007/s00285-002-0185-3. [DOI] [PubMed] [Google Scholar]
2.Sandberg R. Quantifying the species-specificity in genomic signatures, synonymous codon choice, amino acid usage and G+C content. Gene. 2003;311:35–42. doi: 10.1016/s0378-1119(03)00581-x. [DOI] [PubMed] [Google Scholar]
3.Zhang C.T., Zhang R. A nucleotide composition constraint of genome sequences. Comput. Biol. Chem. 2004;28:149–153. doi: 10.1016/j.compbiolchem.2004.02.002. [DOI] [PubMed] [Google Scholar]
4.Reinert G. Probabilistic and statistical properties of words: an overview. J. Comput. Biol. 2000;7:1–46. doi: 10.1089/10665270050081360. [DOI] [PubMed] [Google Scholar]
5.Karlin S., Burge C. Dinucleotide relative abundance extremes: a genomic signature. Trends Genet. 1995;11:283–290. doi: 10.1016/s0168-9525(00)89076-9. [DOI] [PubMed] [Google Scholar]
6.Hogg R.V., Craig A.T. (fifth edition) Prentic-Hall, Englewood Cliffs; USA: 1995. Introduction to Mathematical Statistics. [Google Scholar]
7.Sandberg R. Capturing whole-genome characteristics in short sequences using a naive Bayesian classifier. Genome Res. 2001;11:1404–1409. doi: 10.1101/gr.186401. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Kirkpatrick L.A., Feeney B.C. (revised edition) Wadsworth Publishing; Florence, USA: 2003. A Simple Guide to SPSS for Windows for Versions 8.0, 9.0, 10.0, and 11.0. [Google Scholar]
9.Arnaud P. SINE retroposons can be used in vivo as nucleation centers for de novo methylation. Mol. Cell. Biol. 2000;20:3434–3441. doi: 10.1128/mcb.20.10.3434-3441.2000. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Lyon M.F. LINE-1 elements and X chromosome inactivation: a function for “junk” DNA? Proc. Natl. Acad. Sci. USA. 2000;97:6248–6249. doi: 10.1073/pnas.97.12.6248. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib1] 1.Basu S. Words in DNA sequences: some case studies based on their frequency statistics. J. Math. Biol. 2003;46:479–503. doi: 10.1007/s00285-002-0185-3. [DOI] [PubMed] [Google Scholar]

[bib2] 2.Sandberg R. Quantifying the species-specificity in genomic signatures, synonymous codon choice, amino acid usage and G+C content. Gene. 2003;311:35–42. doi: 10.1016/s0378-1119(03)00581-x. [DOI] [PubMed] [Google Scholar]

[bib3] 3.Zhang C.T., Zhang R. A nucleotide composition constraint of genome sequences. Comput. Biol. Chem. 2004;28:149–153. doi: 10.1016/j.compbiolchem.2004.02.002. [DOI] [PubMed] [Google Scholar]

[bib4] 4.Reinert G. Probabilistic and statistical properties of words: an overview. J. Comput. Biol. 2000;7:1–46. doi: 10.1089/10665270050081360. [DOI] [PubMed] [Google Scholar]

[bib5] 5.Karlin S., Burge C. Dinucleotide relative abundance extremes: a genomic signature. Trends Genet. 1995;11:283–290. doi: 10.1016/s0168-9525(00)89076-9. [DOI] [PubMed] [Google Scholar]

[bib6] 6.Hogg R.V., Craig A.T. (fifth edition) Prentic-Hall, Englewood Cliffs; USA: 1995. Introduction to Mathematical Statistics. [Google Scholar]

[bib7] 7.Sandberg R. Capturing whole-genome characteristics in short sequences using a naive Bayesian classifier. Genome Res. 2001;11:1404–1409. doi: 10.1101/gr.186401. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] 8.Kirkpatrick L.A., Feeney B.C. (revised edition) Wadsworth Publishing; Florence, USA: 2003. A Simple Guide to SPSS for Windows for Versions 8.0, 9.0, 10.0, and 11.0. [Google Scholar]

[bib9] 9.Arnaud P. SINE retroposons can be used in vivo as nucleation centers for de novo methylation. Mol. Cell. Biol. 2000;20:3434–3441. doi: 10.1128/mcb.20.10.3434-3441.2000. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] 10.Lyon M.F. LINE-1 elements and X chromosome inactivation: a function for “junk” DNA? Proc. Natl. Acad. Sci. USA. 2000;97:6248–6249. doi: 10.1073/pnas.97.12.6248. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Classifying Genomic Sequences by Sequence Feature Analysis

Zhi-Hua Liu

Dian Jiao

Xiao Sun

Abstract

Introduction

Results and Discussion

Word frequency

Dinucleotide relative abundance

Base-base correlation

Principal component analysis

Table 1.

Discriminant analysis

Fig. 1.

Table 2.

Conclusion

Materials

Acknowledgements

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Classifying Genomic Sequences by Sequence Feature Analysis

Zhi-Hua Liu

Dian Jiao

Xiao Sun

Abstract

Introduction

Results and Discussion

Word frequency

Dinucleotide relative abundance

Base-base correlation

Principal component analysis

Table 1.

Discriminant analysis

Fig. 1.

Table 2.

Conclusion

Materials

Acknowledgements

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases