Skip to main content
The Scientific World Journal logoLink to The Scientific World Journal
. 2014 Aug 19;2014:740506. doi: 10.1155/2014/740506

Prediction of DNase I Hypersensitive Sites by Using Pseudo Nucleotide Compositions

Pengmian Feng 1,*, Ning Jiang 1, Nan Liu 1
PMCID: PMC4152949  PMID: 25215331

Abstract

DNase I hypersensitive sites (DHS) associated with a wide variety of regulatory DNA elements. Knowledge about the locations of DHS is helpful for deciphering the function of noncoding genomic regions. With the acceleration of genome sequences in the postgenomic age, it is highly desired to develop cost-effective computational methods to identify DHS. In the present work, a support vector machine based model was proposed to identify DHS by using the pseudo dinucleotide composition. In the jackknife test, the proposed model obtained an accuracy of 83%, which is competitive with that of the existing method. This result suggests that the proposed model may become a useful tool for DHS identifications.

1. Introduction

DNase I hypersensitive sites (DHS) are regions of chromatin which are sensitive to cleavage by the DNase I enzyme. Since the discovery of DHSs in 1980s [1], they have been used as markers of regulatory DNA regions. In general, these specific regions are generally nucleosome-free and associate with a wide variety of genomic regulatory elements, such as promoters, enhancers, insulators, silencers, and suppressors [24]. Therefore, mapping of DHS has become an effective approach for discovering functional DNA elements from the noncoding sequences.

Although the traditional Southern blotting technique is a gold-standard approach for identifying DHS, obtaining information from Southern blot approach is a tricky, time-consuming, and inaccurate task [5]. Recently, the DNase-seq technique (combination of DNase I digestion and high-throughput sequencing) has been proposed [6] and this technique allows for an unprecedented increase in resolution. However, methodologies for the analysis of DNase-seq data are relatively immature [7]. Therefore, computational models will be an important complement to experimental techniques for identifying DHS.

Based on nucleotide compositions, a support vector machine model for identifying DHS in K562 cell line was proposed [8]. This method yielded quite encouraging results and did play a role in stimulating the development of this area. However, further work is needed due to the following reasons. First, the sequences in their dataset share high sequence similarities. Second, the DNA structural properties were ignored. To solve these problems, we proposed a new model for identifying DHS, which is trained on a high quality benchmark dataset. In the new model, each DNA sample is encoded by using the pseudo dinucleotide composition, into which the DNA structural properties are incorporated.

2. Materials and Methods

2.1. Benchmark Dataset

The experimentally confirmed 280 DHS and 731 non-DHS sequences were obtained from http://noble.gs.washington.edu/proj/hs/, which have been used to train DHS prediction models [8]. As elucidated in [9], a predictor, if trained and tested by a dataset containing redundant samples with high similarity, might yield misleading results with an overestimated accuracy. To get rid of the redundancy and avoid bias, the CD-HIT software [10] was utilized to remove those DNA fragments that have ≥60% pairwise sequence identity to each other.

Finally, we obtained 247 positive and 710 negative samples for the benchmark dataset S, as can be formulated by

S=S+S, (1)

where the subset S + contains 247 DHS sequences and S contains 710 non-DHS sequences, while ⋃ represents the “union” in the set theory. The detailed sequences in the benchmark dataset S are given in Supplementary Information S1 available online at http://dx.doi.org/10.1155/2014/740506.

2.2. DNA Sequence Representation

In order to integrate the sequence-order effects and DNA physicochemical properties together, the pseudo nucleotide composition was proposed in 2011 [11]. Since then, the concept of pseudo nucleotide composition has penetrated into many branches of computational genomics, such as predicting the recombination spots [12], predicting promoters [13], predicting nucleosome positioning sequences [14], and identifying splice sites [15]. Because of its wide and increasing usage, recently, a flexible web-server, called “pseudo K-tuple nucleotide composition (PseKNC),” was developed [16], which can be used to generate various kinds of pseudo K-tuple nucleotide compositions.

Encouraged by the success of introducing pseudo nucleotide composition to computational genomics, in the current study, the pseudo dinucleotide composition was used to represent DNA sequences in the benchmark dataset, which can be expressed as [12, 16]

D=[d1d2d16d16+1d16+λ]T, (2)

where

du={fui=116fi+wj=1λθj,(1u16),wθu16i=116fi+wj=1λθj,(16<u16+λ). (3)

In (3), f u  (u = 1,2,…, 16) is the normalized occurrence frequency of the dinucleotides in the DNA sequence. λ is the number of the total counted ranks (or tiers) of the correlations along a DNA sequence, and w is the weight factor. The concrete values for λ and w as well as k will be further discussed in Section 3.1, while the correlation factor θ j represents the j-tier structural correlation factor between all the jth most contiguous dinucleotide R i R i+1 at position i.

2.3. Support Vector Machine (SVM)

SVM is a supervised learning algorithm and has been widely used in computational genomics and proteomics [1723]. The basic principle of SVM is to transform the input vector into a high dimension space and then seek a separating hyperplane with the maximal margin in this space by using the decision function

f(X)=sgn(i=1Nyiαi·K(X,Xi)+b), (4)

where α i is the Lagrange multipliers, b is the offset, Xi is the ith training vector, and y i represents the type of the ith training vector. K(X,Xi) is a kernel function which defines an inner product in a high dimensional feature space, and sgn is the sign function. Due to its effectiveness and speed in nonlinear classification process, the radial basis kernel function (RBF) K(Xi,Xj)=exp(-γ||Xi,Xj||2) was used in the current study.

The Libsvm 2.84 package [24] was used to perform the SVM, which can be downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvm/. The regularization parameter C and the kernel width parameter γ were optimized via an optimization procedure using a grid search. The search spaces for C and γ are [215, 2−5] and [2−5, 2−15] with steps of 2−1 and 2, respectively.

2.4. Performance Evaluation

Three cross-validation methods, that is, independent dataset test, subsampling (or K-fold cross-validation) test, and jackknife test, are often used to evaluate the anticipated success rate of a predictor. Among the three methods, the jackknife test is deemed the least arbitrary and most objective one [9, 25] and, hence, has been widely recognized and increasingly adopted by investigators to examine the quality of various predictors [2630]. Accordingly, the jackknife test was used to examine the performance of the model proposed in the current study. In the jackknife test, each sequence in the training dataset is in turn singled out as an independent test sample and all the rule-parameters are calculated without including the one being identified.

A set of parameters, namely, sensitivity (Sn), specificity (Sp), Matthew's correlation coefficient (MCC), and accuracy (Acc), are used to evaluate the performance of the proposed model and they are defined as follows:

Sn=TPTP+FN, (5)
Sp=TNTN+FP, (6)
MCC=TP×TNFP×FN(TP+FN)×(TN+FN)×(TP+FP)×(TN+FP), (7)
Acc=TP+TNTP+FN+TN+FP, (8)

where TP, TN, FP, and FN represent the number of the correctly recognized DHS, the number of the correctly recognized non-DHS, the number of non-DHS recognized as DHS, and the number of DHS recognized as non-DHS, respectively.

3. Results and Discussions

3.1. Parameter Optimization

By analyzing the dinucleotide composition of DHS and non-DHS sequences, we found that the frequency of CC, CG, GC, and GG is higher in DHS sequences, while the frequency of the remaining dinucleotides is higher in non-DHS (Figure 1). This is self-evident as to why the pseudo dinucleotide composition was used for the current case.

Figure 1.

Figure 1

Comparative frequencies of 16 dinucleotides in DHS and non-DHS sequences.

A series of evidences [12, 14, 31, 32] have demonstrated that DNA local structural properties, that is, angular parameters (twist, tilt, and roll) and translational parameters (shift, slide, and rise), are effective in identifying DNA attributes. Therefore, in the present work, the six structural parameters of dinucleotides were used to calculate the pseudo dinucleotide composition by using the PseKNC web-server, which is available at http://lin.uestc.edu.cn/pseknc/default.aspx.

As we can see from (1) and (2), the present model depends on the two parameters w and λ. w is the weight factor usually within the range from 0 to 1 and λ is the global order effect. Generally speaking, the greater the λ is, the more global sequence-order information the model contains. However, if λ is too large, it would reduce the cluster-tolerant capacity so as to lower down the cross-validation accuracy due to overfitting or “high dimension disaster” problem [33]. Therefore, our searching for the optimal values of the two parameters is in the range of w ∈ [0,1] and λ ∈ [1,10] with the steps of 0.1 and 1, respectively.

In order to reduce the computational time, the 5-fold cross-validation approach was used to optimize the two parameters together with the parameters C and γ of the SVM. We found that when w = 0.2 and λ = 6 with C = 512 and γ = 0.0078125, a peak was observed for the Acc. Accordingly, the two numerical values were used for the two uncertain parameters in the following analysis.

3.2. Prediction Quality

The prediction quality measured by the four metrics defined in (5)–(8) for the present model in identifying DHS in the benchmark dataset S via the rigorous jackknife test was listed in Table 1, where, for facilitating comparison, the corresponding results obtained by the previous predictor [8] on the same benchmark data set are also given. As we can see from Table 1, the current method outperformed the existing model in all the four metrics, indicating that our proposed method may become a useful tool in identifying DHS sequences.

Table 1.

Comparison of different methods for identifying DHS by the jackknife test on the same benchmark dataset.

Predictor Sn (%) Sp (%) Acc (%) MCC
Our method 72.12 86.78 83.00 0.57
Noble et al.a 70.43 84.23 80.12 0.52

aFrom Noble et al. [8].

4. Conclusions

Since DHS associates with a wide variety of functional elements, knowledge about the locations of DHS is helpful for deciphering the genomes. However, strong DNA sequence conservation is not observed among DHS sequences, suggesting that it is difficult to computationally identify DHS from primary DNA sequence.

A series of recent studies have demonstrated that the information coded by DNA structural properties is contributable to the identification of regulatory elements in genomes [12, 14, 31, 32]. Hence, in the present study, we proposed a SVM based model for identifying DHS by using the pseudo dinucleotide composition. In this model, we integrate dinucleotide composition with DNA structural properties. The predictive results of our model are better than existing methods. Therefore, it is anticipated that the proposed method may become a useful tool for identifying DHS sequences or, at the very least, it can play a complementary role to the existing methods in this area.

Supplementary Material

Listed in Supplementary Information S1 are the 247 DHS and 710 non-DHS sequences of the benchmark dataset.

740506.f1.pdf (166.4KB, pdf)

Acknowledgment

This work was supported by Foundation of Science and Technology Department of Hebei Province (no. 132777133).

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

References

  • 1.Wu C, M. Bingham P, Livak KJ, Holmgren R, Elgin SCR. The chromatin structure of specific genes: I. Evidence for higher order domains of defined DNA sequence. Cell. 1979;16(4):797–806. doi: 10.1016/0092-8674(79)90095-3. [DOI] [PubMed] [Google Scholar]
  • 2.Gross DS, Garrard WT. Nuclease hypersensitive sites in chromatin. Annual Review of Biochemistry. 1988;57:159–197. doi: 10.1146/annurev.bi.57.070188.001111. [DOI] [PubMed] [Google Scholar]
  • 3.Felsenfeld G, Groudine M. Controlling the double helix. Nature. 2003;421(6921):448–453. doi: 10.1038/nature01411. [DOI] [PubMed] [Google Scholar]
  • 4.Felsenfeld G. Chromatin as an essential part of the transcriptional mechanism. Nature. 1992;355(6357):219–224. doi: 10.1038/355219a0. [DOI] [PubMed] [Google Scholar]
  • 5.Crawford GE, Holt IE, Whittle J, et al. Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS) Genome Research. 2006;16(1):123–131. doi: 10.1101/gr.4074106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Song L, Crawford GE. DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells. Cold Spring Harbor Protocols. 2010;5(2) doi: 10.1101/pdb.prot5384.pdb.prot5384 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Madrigal P, Krajewski P. Current bioinformatic approaches to identify DNase I hypersensitive sites and genomic footprints from DNase-seq data. Frontiers in Genetics. 2012;3(article 230) doi: 10.3389/fgene.2012.00230. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Noble WS, Kuehn S, Thurman R, Yu M, Stamatoyannopoulos J. Predicting the in vivo signature of human gene regulatory sequences. Bioinformatics. 2005;21(1):i338–i343. doi: 10.1093/bioinformatics/bti1047. [DOI] [PubMed] [Google Scholar]
  • 9.Chou K. Some remarks on protein attribute prediction and pseudo amino acid composition. Journal of Theoretical Biology. 2011;273:236–247. doi: 10.1016/j.jtbi.2010.12.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28(23):3150–3152. doi: 10.1093/bioinformatics/bts565. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Zhou X, Li Z, Dai Z, Zou X. Predicting methylation status of human DNA sequences by pseudo-trinucleotide composition. Talanta. 2011;85(2):1143–1147. doi: 10.1016/j.talanta.2011.05.043. [DOI] [PubMed] [Google Scholar]
  • 12.Chen W, Feng P, Lin H, Chou K. IRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Research. 2013;41(6, article e68) doi: 10.1093/nar/gks1450. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Zhou X, Li Z, Dai Z, Zou X. Predicting promoters by pseudo-trinucleotide compositions based on discrete wavelets transform. Journal of Theoretical Biology. 2013;319:1–7. doi: 10.1016/j.jtbi.2012.11.024. [DOI] [PubMed] [Google Scholar]
  • 14.Guo SH, Deng EZ, Xu LQ, et al. iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. Bioinformatics. 2014;30(11):1522–1529. doi: 10.1093/bioinformatics/btu083. [DOI] [PubMed] [Google Scholar]
  • 15.Chen W, Feng PM, Lin H, Chou KC. iSS-PseDNC: identifying splicing sites using pseudo dinucleotide composition. BioMed Research International. 2014;2014:12 pages. doi: 10.1155/2014/623149.623149 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Chen W, Lei TY, Jin DC, Lin H, Chou KC. PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. Analytical Biochemistry. 2014;456:53–60. doi: 10.1016/j.ab.2014.04.001. [DOI] [PubMed] [Google Scholar]
  • 17.Chen W, Lin H. Prediction of midbody, centrosome and kinetochore proteins based on gene ontology information. Biochemical and Biophysical Research Communications. 2010;401(3):382–384. doi: 10.1016/j.bbrc.2010.09.061. [DOI] [PubMed] [Google Scholar]
  • 18.Lin H, Ding H. Predicting ion channels and their types by the dipeptide mode of pseudo amino acid composition. Journal of Theoretical Biology. 2011;269(1):64–69. doi: 10.1016/j.jtbi.2010.10.019. [DOI] [PubMed] [Google Scholar]
  • 19.Liu B, Wang X, Chen Q, Dong Q, Lan X. Using Amino Acid Physicochemical Distance Transformation for Fast Protein Remote Homology Detection. PLoS ONE. 2012;7(9) doi: 10.1371/journal.pone.0046633.e46633 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Liu B, Wang X, Lin L, Tang B, Dong Q. Prediction of protein binding sites in protein structures using hidden Markov support vector machine. BMC Bioinformatics. 2009;10, article 381 doi: 10.1186/1471-2105-10-381. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Liu B, Wang X, Lin L, Dong Q, Wang X. Exploiting three kinds of interface propensities to identify protein binding sites. Computational Biology and Chemistry. 2009;33(4):303–311. doi: 10.1016/j.compbiolchem.2009.07.001. [DOI] [PubMed] [Google Scholar]
  • 22.Chou KC, Cai YD. Using functional domain composition and support vector machines for prediction of protein subcellular location. The Journal of Biological Chemistry. 2002;277(48):45765–45769. doi: 10.1074/jbc.M204161200. [DOI] [PubMed] [Google Scholar]
  • 23.Hayat M, Khan A. MemHyb: predicting membrane protein types by hybridizing SAAC and PSSM. Journal of Theoretical Biology. 2012;292:93–102. doi: 10.1016/j.jtbi.2011.09.026. [DOI] [PubMed] [Google Scholar]
  • 24.Chang CC, Lin CJ. LIBSVM: a library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm/
  • 25.Chou K-C, Zhang C-T. Prediction of protein structural classes. Critical Reviews in Biochemistry and Molecular Biology. 1995;30(4):275–349. doi: 10.3109/10409239509083488. [DOI] [PubMed] [Google Scholar]
  • 26.Esmaeili M, Mohabatkar H, Mohsenzadeh S. Using the concept of Chou's pseudo amino acid composition for risk type prediction of human papillomaviruses. Journal of Theoretical Biology. 2010;263(2):203–209. doi: 10.1016/j.jtbi.2009.11.016. [DOI] [PubMed] [Google Scholar]
  • 27.Ding C, Yuan LF, Guo SH, Lin H, Chen W. Identification of mycobacterial membrane proteins and their types using over-represented tripeptide compositions. Journal of Proteomics. 2012;77:321–328. doi: 10.1016/j.jprot.2012.09.006. [DOI] [PubMed] [Google Scholar]
  • 28.Chen W, Lin H. Identification of voltage-gated potassium channel subfamilies from sequence information using support vector machine. Computers in Biology and Medicine. 2012;42(4):504–507. doi: 10.1016/j.compbiomed.2012.01.003. [DOI] [PubMed] [Google Scholar]
  • 29.Chou K, Wu Z, Xiao X. iLoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins. PLoS ONE. 2011;6(3) doi: 10.1371/journal.pone.0018258.e18258 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Mohabatkar H, Mohammad Beigi M, Esmaeili A. Prediction of GABAA receptor proteins using the concept of Chou's pseudo-amino acid composition and support vector machine. Journal of Theoretical Biology. 2011;281(1):18–23. doi: 10.1016/j.jtbi.2011.04.017. [DOI] [PubMed] [Google Scholar]
  • 31.Zuo Y, Li Q. Identification of TATA and TATA-less promoters in plant genomes by integrating diversity measure, GC-Skew and DNA geometric flexibility. Genomics. 2011;97(2):112–120. doi: 10.1016/j.ygeno.2010.11.002. [DOI] [PubMed] [Google Scholar]
  • 32.Goñi JR, Pérez A, Torrents D, Orozco M. Determining promoter location based on DNA structure first-principles calculations. Genome Biology. 2007;8(12, article R263) doi: 10.1186/gb-2007-8-12-r263. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Wang T, Yang J, Shen H, Chou K. Predicting membrane protein types by the LLDA algorithm. Protein and Peptide Letters. 2008;15(9):915–921. doi: 10.2174/092986608785849308. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Listed in Supplementary Information S1 are the 247 DHS and 710 non-DHS sequences of the benchmark dataset.

740506.f1.pdf (166.4KB, pdf)

Articles from The Scientific World Journal are provided here courtesy of Wiley

RESOURCES