Benchmark data for identifying DNA methylation sites via pseudo trinucleotide composition

Zi Liu; Xuan Xiao; Wang-Ren Qiu; Kuo-Chen Chou

doi:10.1016/j.dib.2015.04.021

. 2015 May 7;4:87–89. doi: 10.1016/j.dib.2015.04.021

Benchmark data for identifying DNA methylation sites via pseudo trinucleotide composition

Zi Liu ^a, Xuan Xiao ^a,^b,^c,^⁎, Wang-Ren Qiu ^a, Kuo-Chen Chou ^c,^d

PMCID: PMC4510404 PMID: 26217768

Abstract

This data article contains three benchmark datasets for training and testing iDNA-Methyl, a web-server predictor for identifying DNA methylation sites [Liu et al. Anal. Biochem. 474 (2015) 69–79].

Specifications table

Subject area	Biology
More specific subject area	Bioinformatics and Biomedicine
Type of data	Text file
How data was acquired	Using flexible sliding window approach[2–5]
Data format	Analyzed
Experimental factors	n/a
Experimental features	DNA sample was formulated by combining its trinucleotide composition (TNC) [6–8] and the pseudo amino acid components (PseAAC) [9–11] of the sequence translated from the DNA sample according to its genetic codons. Meanwhile, some novel techniques in statistical analysis were introduced to train and test the predictor, such as “Neighborhood Cleaning Rule”, “Synthetic Minority Over-Sampling Technique”, and “Target-Jackknife Test” [12].
Data source location	Jingdezhen 333403, China
Data accessibility	With this paper and at: http://www.jci-bioinfo.cn/DNAmethy/IDM_data.html

Open in a new tab

1. Value of the data

•
DNA methylation plays an important role in regulating a variety of biological processes and is very important for basic research and drug development as well.
•
The datasets presented here are good for testing DNA methylation site identifying algorithms because of their realistic, highly unbalanced nature.
•
For the first dataset (Supplementary material, File 1), users can use the original sequences to construct their own benchmark dataset, for the the 2nd dataset (Supplementary material, File 2) and the 3rd dataset (Supplementary material, File 1) users can use them to design their own predictor for identifying methylation sites.

2. Data, experimental design, materials and methods

The data presented here are three benchmark datasets for training and testing iDNA-Methyl [1] http://www.jci-bioinfo.cn/iDNA-Methyl, a web-server predictor for identifying DNA methylation sites. The DNA sample was formulated by combining its trinucleotide composition (TNC) and the pseudo amino acid components (PseAAC) of the sequence translated from the DNA sample according to its genetic codons. Sliding a window of nucleotides along each of the DNA sequences taken from MethDB (http://www.methdb.de/), and DNA sample was formulated by combining its trinucleotide composition (TNC) and the pseudo amino acid components (PseAAC) of the sequence translated from the DNA sample according to its genetic codons. In real world, the data very unbalanced. Target-jackknife was used to optimize the unbalanced benchmark dataset and minimize the consequence of this kind of mis-prediction.

I.
The first dataset (Supplementary material, File 1) contains 2426 nucleotide segment samples, of which 787 are true methylation ones and 1639 are false methylation ones.
II.
The 2nd dataset (Supplementary material, File 2) is the optimized benchmark dataset obtained after the NCR (Neighborhood Cleaning Rule) [13] treatments on the original benchmark dataset of the DNA methylation system. It contains 522 non-methylation samples that were removed from the negative subset, each of which corresponds to a vector with 72 components. For distinction, the real Non-methylation starts with a line of “>Non-Methylation code”.
III.
The 3rd dataset (Supplementary material, File 1) is the optimized benchmark dataset obtained after both the NCR (Neighborhood Cleaning Rule) [13] and SMOTE (Synthetic Minority Over-Sampling Technique) [14] treatments on the 1st benchmark dataset. It contains 1117 DNA methylation (including 330 hypothetical methylation created by SMOTE) and 1117 non-methylation, each of which corresponds to a vector with 72 components. For distinction, the real DNA methylation starts with a line of “>Methylation code” while the hypothetical DNA methylation starts with a line of “Hypothetical” [6–8].

Footnotes

^{Appendix A}

Supplementary data associated with this article can be found in the online version at doi:10.1016/j.dib.2015.04.021.

Contributor Information

Xuan Xiao, Email: xxiao@gordonlifescience.org.

Kuo-Chen Chou, Email: kcchou@gordonlifescience.org.

Supplementary materials

Supplementary data

mmc1.zip^{(300.4KB, zip)}

References

1.Liu Z., Xiao X., Qiu W.R., Chou K.C. iDNA-Methyl: identifying DNA methylation sites via pseudo trinucleotide composition. Anal. Biochem. 2015;474:69–79. doi: 10.1016/j.ab.2014.12.009. [DOI] [PubMed] [Google Scholar]
2.Chou K.C. Using subsite coupling to predict signal peptides. Protein Eng. 2001;14:75–79. doi: 10.1093/protein/14.2.75. [DOI] [PubMed] [Google Scholar]
3.Chou K.C. Review: prediction of protein signal sequences. Curr. Protein Peptide Sci. 2002;3:615–622. doi: 10.2174/1389203023380468. [DOI] [PubMed] [Google Scholar]
4.Chou K.C., Shen H.B. Signal-CF: a subsite-coupled and window-fusing approach for predicting signal peptides. Biochem. Biophys. Res. Commun. (BBRC) 2007;357:633–640. doi: 10.1016/j.bbrc.2007.03.162. [DOI] [PubMed] [Google Scholar]
5.Shen H.B., Chou K.C. Signal-3L: a 3-layer approach for predicting signal peptide. Biochem. Biophys. Res. Commun. (BBRC) 2007;363:297–303. doi: 10.1016/j.bbrc.2007.08.140. [DOI] [PubMed] [Google Scholar]
6.W. Chen, T.Y. Lei, D.C. Jin, H. Lin, and K.C. Chou, PseKNC: a flexible web-server for generating pseudo K-tuple nucleotide composition. Analytical Biochemistry 456 (2014) 53-60. [DOI] [PubMed]
7.W. Chen, X. Zhang, J. Brooker, H. Lin, L. Zhang, and K.C. Chou, PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions. Bioinformatics 31 (2015) 119-120. [DOI] [PubMed]
8.B. Liu, F. Liu, X. Wang, J. Chen, L. Fang, and K.C. Chou, Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences Nucleic Acids Research 10.1093/nar/gkv458 (2015). [DOI] [PMC free article] [PubMed]
9.K.C. Chou, Prediction of protein cellular attributes using pseudo amino acid composition. PROTEINS: Structure, Function, and Genetics (Erratum: ibid., 2001, Vol.44, 60) 43 (2001) 246-255. [DOI] [PubMed]
10.K.C. Chou, Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21 (2005) 10-19. [DOI] [PubMed]
11.K.C. Chou, Impacts of bioinformatics to medicinal chemistry. Medicinal Chemistry 11 (2015) 218-234. [DOI] [PubMed]
12.X. Xiao, J.L. Min, W.Z. Lin, Z. Liu, X. Cheng, and K.C. Chou, iDrug-Target: predicting the interactions between drug compounds and target proteins in cellular networking via the benchmark dataset optimization approach. Journal of Biomolecular Structure & Dynamics (JBSD) (2014) 10.1080/07391102.2014.998710. [DOI] [PubMed]
13.Laurikkala J. Springer; Berlin Heidelberg: 2001. Improving Identification of Difficult Small Classes by Balancing Class Distribution; pp. 63–66. [Google Scholar]
14.Chawla N.V., Bowyer K.W., Hall L.O., Kegelmeyer W.P. SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 2011;16:321–357. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary data

mmc1.zip^{(300.4KB, zip)}

[bib1] 1.Liu Z., Xiao X., Qiu W.R., Chou K.C. iDNA-Methyl: identifying DNA methylation sites via pseudo trinucleotide composition. Anal. Biochem. 2015;474:69–79. doi: 10.1016/j.ab.2014.12.009. [DOI] [PubMed] [Google Scholar]

[bib2] 2.Chou K.C. Using subsite coupling to predict signal peptides. Protein Eng. 2001;14:75–79. doi: 10.1093/protein/14.2.75. [DOI] [PubMed] [Google Scholar]

[bib3] 3.Chou K.C. Review: prediction of protein signal sequences. Curr. Protein Peptide Sci. 2002;3:615–622. doi: 10.2174/1389203023380468. [DOI] [PubMed] [Google Scholar]

[bib4] 4.Chou K.C., Shen H.B. Signal-CF: a subsite-coupled and window-fusing approach for predicting signal peptides. Biochem. Biophys. Res. Commun. (BBRC) 2007;357:633–640. doi: 10.1016/j.bbrc.2007.03.162. [DOI] [PubMed] [Google Scholar]

[bib5] 5.Shen H.B., Chou K.C. Signal-3L: a 3-layer approach for predicting signal peptide. Biochem. Biophys. Res. Commun. (BBRC) 2007;363:297–303. doi: 10.1016/j.bbrc.2007.08.140. [DOI] [PubMed] [Google Scholar]

[bib6] 6.W. Chen, T.Y. Lei, D.C. Jin, H. Lin, and K.C. Chou, PseKNC: a flexible web-server for generating pseudo K-tuple nucleotide composition. Analytical Biochemistry 456 (2014) 53-60. [DOI] [PubMed]

[bib7] 7.W. Chen, X. Zhang, J. Brooker, H. Lin, L. Zhang, and K.C. Chou, PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions. Bioinformatics 31 (2015) 119-120. [DOI] [PubMed]

[bib8] 8.B. Liu, F. Liu, X. Wang, J. Chen, L. Fang, and K.C. Chou, Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences Nucleic Acids Research 10.1093/nar/gkv458 (2015). [DOI] [PMC free article] [PubMed]

[bib9] 9.K.C. Chou, Prediction of protein cellular attributes using pseudo amino acid composition. PROTEINS: Structure, Function, and Genetics (Erratum: ibid., 2001, Vol.44, 60) 43 (2001) 246-255. [DOI] [PubMed]

[bib10] 10.K.C. Chou, Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21 (2005) 10-19. [DOI] [PubMed]

[bib11] 11.K.C. Chou, Impacts of bioinformatics to medicinal chemistry. Medicinal Chemistry 11 (2015) 218-234. [DOI] [PubMed]

[bib12] 12.X. Xiao, J.L. Min, W.Z. Lin, Z. Liu, X. Cheng, and K.C. Chou, iDrug-Target: predicting the interactions between drug compounds and target proteins in cellular networking via the benchmark dataset optimization approach. Journal of Biomolecular Structure & Dynamics (JBSD) (2014) 10.1080/07391102.2014.998710. [DOI] [PubMed]

[bib13] 13.Laurikkala J. Springer; Berlin Heidelberg: 2001. Improving Identification of Difficult Small Classes by Balancing Class Distribution; pp. 63–66. [Google Scholar]

[bib14] 14.Chawla N.V., Bowyer K.W., Hall L.O., Kegelmeyer W.P. SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 2011;16:321–357. [Google Scholar]

PERMALINK

Benchmark data for identifying DNA methylation sites via pseudo trinucleotide composition

Zi Liu

Xuan Xiao

Wang-Ren Qiu

Kuo-Chen Chou

Abstract

1. Value of the data

2. Data, experimental design, materials and methods

Footnotes

Contributor Information

Supplementary materials

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Benchmark data for identifying DNA methylation sites via pseudo trinucleotide composition

Zi Liu

Xuan Xiao

Wang-Ren Qiu

Kuo-Chen Chou

Abstract

1. Value of the data

2. Data, experimental design, materials and methods

Footnotes

Contributor Information

Supplementary materials

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases