Skip to main content
Bioinformation logoLink to Bioinformation
. 2013 Jan 9;9(1):61–64. doi: 10.6026/97320630009061

MethFinder - A software package for prediction of human tissue-specific methylation status of CpG islands

Priyanka James $, Reshmi Girijadevi $, Sona Charles *, M Radhakrishna Pillai *,*
PMCID: PMC3563418  PMID: 23390346

Abstract

DNA methylation, the highly studied epigenetic mechanism which is involved in the regulatory events of various cellular processes like chromatin structure modifications, chromosomal inactivation, gene expressional patterns, embriyonic developments and transcriptional modification etc. Various high throughput techniques evolved for direct detection of methylation actions as well as information across the entire region. However, despite high throughput technological advances in experimental field, the development of software tools that has been dedicated to the prediction of epigenetic information from specific genome sequences is warranted. To this end we developed a tissue specific classifier MethFinder based on the frequency of novel sequence patterns across nine human tissues that was capable of discriminating methylation prone and methylation resistant CpG islands with an overall accuracy of 93%.

Availability

MethFinder is freely available at www.rgcb.res.in/methfinder

Keywords: DNA methylation, CpG islands, Support Vector Machines (SVM)

Background

High levels of epigenetic systems such as DNA methylation, histone modification and chromatin remodelling tightly regulate gene specificity in mammals [1]. DNA methylation, is the widely studied epigenetic modification and has a critical role in tissue-specific gene expression in mammals. Computational approaches for detection of methylation events would be a complimentary aid for expensive and laborious experimental analysis. Genome-wide DNA methylation studies show that methylation status is tissue specific and possess sequence correlations [2, 3]. Recently some studies revealed evolutionary conservation of tissue-specific methylation in human tissues by using BAC microarrays [3]. Both experimental and computational comprehensive genome-wide profiles of methylated regions would significantly improve our ability to address these questions. Currently there are no tissue specific methylation tools available, thus a need for a classifier that can detect patterns across tissues and to calculate DNA methylation levels by available statistical models. To this end, we developed MethFinder an efficient machine learning model to unravel the pattern of DNA methylation in CpG dinucleotides using support vector machines (SVM).

Tissue-specific Sequence data sets:

The tissue-specific non-redundant cytosine methylation data were extracted from MethDB [4] a curated database of experimentally determined methylated DNA fragments. The database contains a total of 5382 methylation patterns from various sources ranging from plants to humans [5]. In-house Python script was used to download tissue specific methylation patterns of Homo sapiens from MethDB. We incorporated CpG islands predicted by the CpG cluster algorithm. For studying the effect of flanking sequence features, we split the sequences into overlapping fragments of fixed window size. Fragments with a methylated cytosine in the center were considered as Methylation prone, where as fragments with non-methylated cytosine in the center were considered as Methylation resistant.

Pattern Detection and classification:

To detect overrepresented sequence motifs in the flanking regions, we used the Multiple Em for Motif Elicitation (MEME Suite version 4.3.0) [6]. Twenty best-fit motifs were obtained for each sequence set (Methylation prone and Methylation resistant) for individual tissues, for all window size using the ZOOPs model (zero or one occurrence per sequence) with default parameters. When submitted to MEME, datasets with increasing window size from 59 to 79 show the presence of motifs for nine tissues (Blood, Brain, Kidney, Liver, Lung, Muscle, Pancreas, Prostate and Skin). For each sequence, MAST a motif alignment program [7] determines the best match in the sequence to each motif. The frequency and position of all motif hits with a goodness-of-fit (P < 0.000001) were extracted using custom Perl scripts. The percentages of occurrence of each motif between the methylation prone and methylation resistant data sets were calculated for the datasets with window size from 59 to 79 for all tissues. A Student's t-test was used to compare the frequency of occurrence of each motif between two datasets and P-value below P>0.001 as considered as not significant between the methylation prone and methylation resistant data sets.

Support vector machine (SVM) parameter optimization and calculation:

For optimization, we developed training data sets of n samples, (x1,x2,........xi,.......xn), where xi are vectors of d features and known labels for each vector {y1,y2,y3, .. .,yi, .. .,yn}, indicating whether the fragment is methylation prone or methylation resistant (yi{+1, -1}) (see supplementary material for equation and explanation). We used the software LIBSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) for the implementation of the SVM algorithm, and adopted the linear kernel function by a grid search script available in the LibSVM package, with 10 fold cross validation.

Sensitivity (SE), specificity (SP), accuracy (ACC) and Matthew correlation coefficient (MCC) of the SVMs to assess classification performance were estimated using the following equations. We calculated the expressions for SP, SE, ACC and MCC using Eqs. (2– 5) (see supplementary material).

Software Performance:

We trained the SVM classifier with training sets from nine different tissues (supplementary table 1) and tested its performance on a corresponding test set from individual tissues. The training set was randomly selected from individual tissues with specific window length (59, 69, and 79 bp). For each window length, this experiment was repeated with random selections of training and test sets. The best classification accuracy was observed for a window size of 69, where the best balance between specificity (0.97) and sensitivity (0.89) were also observed with the highest value for MCC (0.86) Table 1 (see supplementary material). Performance of the classifiers was also evaluated by forming receiver operating characteristic (ROC) curves (Figure 1). Here we used motif-based sequence analysis tools coupled with classification techniques to identify DNA sequence patterns that define CpG island methylation status. This study serves as proof-of-principle that the epigenetic state of a genomic region can be predicted based on DNA sequences.

Figure 1.

Figure 1

The receiver operating characteristic (ROC) plots for performance measure using datasets from nine different tissues

Supplementary material

Data 1
97320630009061S1.pdf (318.5KB, pdf)

Acknowledgments

This work was sponsored by Advaita Informatics, Dubai.

Footnotes

Citation:James et al, Bioinformation 9(1): 061-064 (2013)

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data 1
97320630009061S1.pdf (318.5KB, pdf)

Articles from Bioinformation are provided here courtesy of Biomedical Informatics Publishing Group

RESOURCES