Skip to main content
Genomics, Proteomics & Bioinformatics logoLink to Genomics, Proteomics & Bioinformatics
. 2016 Nov 28;3(4):201–205. doi: 10.1016/S1672-0229(05)03027-5

Classifying Genomic Sequences by Sequence Feature Analysis

Zhi-Hua Liu 1, Dian Jiao 1, Xiao Sun 1,*
PMCID: PMC5172532  PMID: 16689686

Abstract

Traditional sequence analysis depends on sequence alignment. In this study, we analyzed various functional regions of the human genome based on sequence features, including word frequency, dinucleotide relative abundance, and base-base correlation. We analyzed the human chromosome 22 and classified the upstream, exon, intron, downstream, and intergenic regions by principal component analysis and discriminant analysis of these features. The results show that we could classify the functional regions of genome based on sequence feature and discriminant analysis.

Key words: genome, sequence feature analysis, BBC, PCA, discriminant analysis

Introduction

Since the beginning of the Human Genome Project, a huge amount of genomic sequences have been generated. It becomes more and more important to annotate these raw sequences. Eukaryotes have genes that contain upstream, exon, intron, and downstream regions. It is even more important to classify these various functional regions. Seeking appropriate features is the key to solve this problem. In recent years, several sequence features have been proposed, including word frequency (WF; ref. 1), synonymous codon choice, amino acid usage, G+C content (2), and nucleotide composition constraint (3). In this study, we present a novel sequence feature extraction algorithm and multidimensional statistical analysis to classify genomic sequences.

Results and Discussion

We extracted the sequence feature information from the collected sequence data of the human chromosome 22, reduced the dimensionality of sequence feature vector by principal component analysis (PCA), and classified the datasets by discriminant analysis.

Word frequency

Reinert et al. (4) provided the concept of word frequency. Since a DNA sequence is formed by using an alphabet of four letters (A, T, C, G) denoting four DNA bases, we can define DNA k-words, which are k-tuples formed by using these four letters. For an integer k ≥ 1, clearly there are 4k possible k-words. We assume that fw is the frequency of w in the DNA sequences with the length of L:

fw=nwL

In this study, we analyze mainly 2-word and 3-word frequencies, which form 42=16 and 43=64 dimensional frequency vectors, respectively.

Dinucleotide relative abundance

Karlin and Burge (5) defined the formula of dinucleotide relative abundance (DRA) as the following:

Tij=pijpipj

in which pi or pj means the frequency of appearance of a single base i or j, and pij means that of joint probabilities of bases i and j. The DRA feature formsa 16-dimensional vector. If one sequence is completely stochastic and the bases are mutually independent, then theoretically pij = pipj and the value of Tij is 1. Therefore, the deviation of Tij of one sequence opposite to 1 could evaluate the bias of dinucleotide.

Base-base correlation

We have proposed a novel feature called base-base correlation (BBC) with the following formula:

Tij(k)=i=lkpij(l)log2(pijpipj)i,j{1,2,3,4}

Here, pi and pj are defined as above, while pij(l) means the joint probabilities of bases i and j at a distance of l. Tij(k) represents the average relevance of the two-base combination with different gaps from 1 to k. It reflects a local feature of two bases with an interval of k. The BBC feature forms a 16-dimensional vector.

For a given DNA sequence, the features of 2-word, 3-word, DRA, and BBC form a 112-dimensional vector in all.

Principal component analysis

Let X1, X2, …, Xp denote the p index considered, then we have

S=[S11S12S1pS21S22S2pSp1Sp2Spp]

The above matrix is the covariance matrix of X1, X2, …, Xp, in which the principal diagonal elements S11, S22, …, Spp represent the variance of X1, X2, …, Xp, respectively, reflecting the p index variation degree. Therefore, S11 + S22 + ··· + Spp means the total variation degree of the p index.

Now we seek a new index y1 = a11x1 + a12x2 + ··· + a1pxp instead of the original p index. Moreover, we expect this new index could contain the original information as far as possible. We suppose λ1λ2 ≥ ··· ≥ λγ (γp) is the non-vanishing characteristic root. Then S11 + S22 + ··· + Spp = λ1 + λ2 + ··· + λγ. Thus we extract the γ overall index of y1, y2, …, yγ, whose variance is equal to the original p index variance, that is to say, the information that the γ index contains is equal to the information that the original p index contains. If γ is much smaller than p, the method greatly reduces the index but does not affect the analysis result. Because the overall index y1 = a11x1 + a12x2 + ··· + a1pxp is the biggest when the variance is λ1, so the ability of synthesizing the p index of y1 is the strongest. We define y1, y2, …, yγ as the first, second, …, and the γth principal component, respectively. Then

λγλ1+λ2++λγ=λγS11+S22++Spp

which expresses the proportion of yγ variance in the total variance, and it is called the variance contribution rate of the γth principal component (6).

Here we reduced the original 112-dimensional vector to a 21-dimensional vector according to whether the eigenvalue is bigger than 1 (Table 1).

Table 1.

The Result of Principal Component Analysis

Component Initial eigenvalue
Extraction sum of squared loadings
Total Variance (%) Cumulation (%) Total Variance (%) Cumulation (%)
1 31.128 27.793 27.793 31.128 27.793 27.793
2 12.589 11.240 39.033 12.589 11.240 39.033
3 8.365 7.469 46.503 8.365 7.469 46.503
4 8.075 7.210 53.713 8.075 7.210 53.713
5 4.726 4.220 57.933 4.726 4.220 57.933
6 4.192 3.743 61.675 4.192 3.743 61.675
7 3.836 3.425 65.100 3.836 3.425 65.100
8 3.425 3.058 68.158 3.425 3.058 68.158
9 2.938 2.624 70.782 2.938 2.624 70.782
10 2.775 2.478 73.259 2.775 2.478 73.259
11 2.606 2.327 75.586 2.606 2.327 75.586
12 1.928 1.721 77.308 1.928 1.721 77.308
13 1.880 1.678 78.986 1.880 1.678 78.986
14 1.663 1.485 80.471 1.663 1.485 80.471
15 1.565 1.397 81.868 1.565 1.397 81.868
16 1.515 1.353 83.221 1.515 1.353 83.221
17 1.293 1.154 84.375 1.293 1.154 84.375
18 1.276 1.139 85.515 1.276 1.139 85.515
19 1.170 1.045 86.559 1.170 1.045 86.559
20 1.067 0.953 87.512 1.067 0.953 87.512
21 1.052 0.939 88.451 1.052 0.939 88.451

22 0.925 0.826 89.277
23 0.831 0.742 90.019
24 0.786 0.702 90.721
25 0.677 0.605 91.326

Discriminant analysis

The basic principle of discriminant analysis is that the studied object that could be portrayed by the p index could also be described with the stochastic vector X = (X1, X2, …, Xp)T. Let π1, π2, …, πs denote the s kinds of the object that we study. If an object belongs to the jth kind, then it is recorded as Xπj. The main goal of discriminant analysis is to seek the decision function g(X) of X according to different discriminative criteria, and to determine the category of X based on the attribute of g(X). The main criteria to construct discriminative function include the shortest distance criterion, the smallest expectation loss criterion, the Fisher criterion, and so on. Sandberg et al. (7) used a naïve Bayesian classifier to capture whole-genome characteristics in short sequences. In our method, we use the Fisher criterion whose basic principle is to find the most appropriate projection axis to make the two kinds of samples that project on this axis to be the least, thus make the classified effect to be the best.

We firstly analyzed the upstream, coding, and downstream regions of the sequence (Figure 1). The scatter plots in Figure 1 show the values of the cases on two discriminant functions, and we can see obvious differences among the coding, upstream, and downstream regions. It is observed that the coding regions (green) prefer to appear on the positive side of Function 1, whereas the upstream (red) and downstream (blue) regions prefer to appear on the negative side. The two discriminant functions cannot distinguish between upstream and downstream regions. We think the reason is that regulatory elements are located in upstream regions and the gene regulatory information is not considered when we use these three sequence features. Therefore, we may seek a more effective sequence feature related to known gene regulatory knowledge to distinguish the two regions.

Fig. 1.

Fig. 1

Classification of the upstream (red), coding (green), and downstream (blue) regions. The horizontal axis represents the function value of the first linear distinction, and the vertical axis represents the function value of the second linear distinction, which is based on calculations from the variable value.

In order to further investigate non-coding regions, we expanded the datasets from three kinds to five kinds, and selected three features, namely WF, DRA, and BBC, which constructed a 112-dimensional vector as mentioned above. The SPSS software (8) was applied to carry on discriminant analysis and the result is shown in Table 2, which was used to assess how well the discriminant function works. From the result, we can see that the classification accuracy of the exon, intron, upstream, downstream, and intergenic regions is 94%, 86%, 71%, 69%, and 69%, respectively. The classification accuracy of exon and intron is relatively high, while that of upstream, downstream, and intergenic regions is relatively low. This can help us identify genes and study the gene structure (exonintron arrangement). The 3-word frequency can help us reveal hidden sequence features in coding regions. Recent discoveries have suggested that non-coding regions may not be merely “junk DNA” as previously thought. High densities of long interspersed nuclear elements (LINEs) and short interspersed nuclear elements (SINEs) occur in non-coding regions as the signal to start methylating a region of DNA 9., 10.. The sequence features that we have used may not match inherent sequence features in non-coding regions. Therefore, the classification accuracy of non-coding regions is lower than that of coding regions. Our future project is to further improve the classification accuracy of non-coding regions by seeking new features and more efficient algorithms.

Table 2.

The Statistical Result of Discriminant Analysis*

Result Predicted group membership
Total
Group 1 2 3 4 5
Original
1 71 0 7 8 14 100
2 1 94 0 2 3 100
3 7 0 86 5 2 100
4 4 1 13 69 13 100
5 5 2 12 12 69 100

Cross-validated 1 68 4 8 7 13 100
2 1 94 0 2 3 100
3 7 0 86 5 2 100
4 6 2 16 57 19 100
5 9 4 18 13 56 100
*

“Original” is the classification result of each observated sample, and “Cross-validated” is the alternately confirmed result. Groups 1 to 5 represent the upstream, exon, intron, downstream, and intergenic regions, respectively. In “Predicted group membership”, the established discriminative function reclassifies the source data and is compared with the primary variable value to compute the probability of mistaken discriminant. For example, for the 1st group of samples with the total number of 100, the constructed discriminative function based on the original data predicts that the number belongs to the 1st, 2nd, 3rd, 4th, and 5th group is 71, 0, 7, 8, and 14, respectively.

Conclusion

Nowadays algorithms and software for gene prediction have been developed widely. However, to our knowledge, researches on how to effectually distinguish the exon, intron, and intergenic regions have not made breakthrough. We have proposed a novel analysis method of genomic sequences based on sequence feature and statistic analysis. The results show that our analysis algorithm could improve the identification accuracy of the upstream, exon, intron, downstream, and intergenic regions from DNA sequences, especially the exon (94%) and intron (86%) regions.

Materials

We used the human chromosome 22 and collected the upstream (1,000 bp), exon, intron, downstream (1,000 bp), and intergenic regions (1,000 bp) according to the gene annotation database of the University of Santa Cruz Golden Path human genome sequence (http://genome.cse.ucsc.edu).

Acknowledgements

This work was supported by the National High-Tech Research and Development Program (863 Program) of China (No. 2002AA231071) and the Natural Science Foundation of Jiangsu Province (No. BK2002057).

References

  • 1.Basu S. Words in DNA sequences: some case studies based on their frequency statistics. J. Math. Biol. 2003;46:479–503. doi: 10.1007/s00285-002-0185-3. [DOI] [PubMed] [Google Scholar]
  • 2.Sandberg R. Quantifying the species-specificity in genomic signatures, synonymous codon choice, amino acid usage and G+C content. Gene. 2003;311:35–42. doi: 10.1016/s0378-1119(03)00581-x. [DOI] [PubMed] [Google Scholar]
  • 3.Zhang C.T., Zhang R. A nucleotide composition constraint of genome sequences. Comput. Biol. Chem. 2004;28:149–153. doi: 10.1016/j.compbiolchem.2004.02.002. [DOI] [PubMed] [Google Scholar]
  • 4.Reinert G. Probabilistic and statistical properties of words: an overview. J. Comput. Biol. 2000;7:1–46. doi: 10.1089/10665270050081360. [DOI] [PubMed] [Google Scholar]
  • 5.Karlin S., Burge C. Dinucleotide relative abundance extremes: a genomic signature. Trends Genet. 1995;11:283–290. doi: 10.1016/s0168-9525(00)89076-9. [DOI] [PubMed] [Google Scholar]
  • 6.Hogg R.V., Craig A.T. (fifth edition) Prentic-Hall, Englewood Cliffs; USA: 1995. Introduction to Mathematical Statistics. [Google Scholar]
  • 7.Sandberg R. Capturing whole-genome characteristics in short sequences using a naive Bayesian classifier. Genome Res. 2001;11:1404–1409. doi: 10.1101/gr.186401. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Kirkpatrick L.A., Feeney B.C. (revised edition) Wadsworth Publishing; Florence, USA: 2003. A Simple Guide to SPSS for Windows for Versions 8.0, 9.0, 10.0, and 11.0. [Google Scholar]
  • 9.Arnaud P. SINE retroposons can be used in vivo as nucleation centers for de novo methylation. Mol. Cell. Biol. 2000;20:3434–3441. doi: 10.1128/mcb.20.10.3434-3441.2000. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Lyon M.F. LINE-1 elements and X chromosome inactivation: a function for “junk” DNA? Proc. Natl. Acad. Sci. USA. 2000;97:6248–6249. doi: 10.1073/pnas.97.12.6248. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Genomics, Proteomics & Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES