Skip to main content
Epigenetics logoLink to Epigenetics
. 2012 Sep 1;7(9):982–986. doi: 10.4161/epi.21493

The landscape for epigenetic/epigenomic biomedical resources

Kabita Shakya 1,2,3,*, Mary J O'Connell 2,3, Heather J Ruskin 1,3
PMCID: PMC3515018  PMID: 22874136

Abstract

Recent advances in molecular biology and computational power have seen the biomedical sector enter a new era, with corresponding development of Bioinformatics as a major discipline. Generation of enormous amounts of data has driven the need for more advanced storage solutions and shared access through a range of public repositories. The number of such biomedical resources is increasing constantly and mining these large and diverse data sets continues to present real challenges. This paper attempts a general overview of currently available resources, together with remarks on their data mining and analysis capabilities. Of interest here is the recent shift in focus from genetic to epigenetic/epigenomic research and the emergence and extension of resource provision to support this both at local and global scale. Biomedical text and numerical data mining are both considered, the first dealing with automated methods for analyzing research content and information extraction, and the second (broadly) with pattern recognition and prediction. Any summary and selection of resources is inherently limited, given the spectrum available, but the aim is to provide a guideline for the assessment and comparison of currently available provision, particularly as this relates to epigenetics/epigenomics.

Keywords: biomedical resource, data mining, epigenetics, epigenomics, methylation, primary database, secondary database

Introduction

The Human Genome Project (HGP) in 2003 led to identification of more than 20,000 genes and determined the three billion chemical base pairs of human DNA. With the tremendous advances in medical technologies, corresponding development in computational power, storage capacity, inter-connectivity and cost effectiveness, this explosive growth has resulted in the generation and collection of all aspects of biomedical data and, in the past decade, the importance of bioinformatics has been recognized.1 Data warehousing,2 as a way of dealing with large data set size, combines databases across an entire enterprise, whereas independent or federated systems seek to integrate multiple autonomous databases into a single federation, with constituent databases interconnected via a network and often geographically decentralised.3,4 One example is the many bioinformatics data sources linked by the Entrez Life Sciences search engine.5

Biomedical data cover a wide range, from patient records to information from pharmaceutical studies, specific disease research and different ‘omics’ studies, including genomics, proteomics and transcriptomics. Resource types can be classified by two key features: first, the means or method by which access is provided to entities; second, the nature of the entities themselves. The repository or web service that provides access to these data are a vital component of biomedical data resourcing.6 An example is PubMed, the NLM’s web-based interface to MEDLINE, the premier bibliographic index to journal articles in the Life Sciences. In general, resource providers, such as PubMeth and MutationDB, review research papers from the domain and mine these for information relevant to the scientific audience. Typically, non-profit research institutes, such as the Sanger Institute, University of California Santa Cruz (UCSC), National Center for Biotechnology Information (NCBI), National Institute of Health (NIH), European Molecular Biology Laboratory (EMBL) and European Bioinformatics Institute (EBI), among others, make such data publicly available over the internet so that these can be further analyzed/mined for knowledge discovery.

Biological/biomedical resources may be one of several types, primary, secondary or composite. Examples of primary database containing information on biological quantities themselves indicate those for sequence or structure, e.g., SwissProt, PIR (protein sequences), GenBank and DDBJ (genome sequences). Secondary resources contain derived information from primary sources and examples include eMOTIF (Stanford) and SCOP (Cambridge). Composite resources typically draw information from a variety of different databases, such as those of the NCBI genome browser and Genecards.7 The most popular genome browsers today are Ensembl, NCBI Map Viewer and UCSC, which act as gateways for access to genetic and epigenetic information.

Following completion of the Human Genome Project, increased attention has been paid to processes that lead to heritable changes in gene expression, during development or across generations, without altering the nucleotide sequence within the DNA. Both epigenetics and epigenomics, the genome-wide distribution of epigenetic changes, have become major areas of research focus. Principal epigenetic phenomena encompass DNA methylation, histone modification (methylation/demethylation, acetylation/deacetylation, phosphorylation, ubiquitylation and sumoylation), gene silencing, genomic imprinting and X-chromosome inactivation. Recently-launched large-scale initiatives include, among others, IHEC (International Human Epigenome Consortium),8 which plans to map up to 1,000 reference epigenomes within a decade, and the Human Epigenome Project (HEP),9 which aims to identify, catalog and interpret genome-wide DNA methylation patterns of all human genes in all major tissues.10

Epigenetics, cancer and other diseases

Epigenetic abnormalities have been found to be causative factors of cancer, genetic disorders and pediatric syndromes, as well as contributory factors of autoimmune diseases and aging.11 The recent intensive research on cancer-epigenetics has also led to the discovery of many epigenetic markers that play an important role in disease initiation. As a consequence, cancer-related epigenetic resources preponderate over others. Two of the large-scale project initiatives for cancer research include ICGC (see “ICGC” section below) and TCGA (The Cancer Genome Atlas). TCGA has achieved comprehensive sequencing, characterization and analysis of the genomic changes in various cancers and intends to chart the genomic changes involved in more than 20 types of cancers.12 All of the epigenetic resources are outlined in the following sections, with additional assessment of their data mining capabilities, intrinsic or externally accessed, and their adequacy provided where possible.

DNA methylation can induce “epigenetic silencing” (or loss of expression) of tumor suppressor genes, causing normal cells to be transformed into cancer cells and is the first and most common epigenetic alteration to be observed.13,14 A direct link also exists between DNA methylation and histone modification, since a number of proteins involved in DNA methylation (e.g., DNMTs and MBDs) directly interact with histone modifying enzymes, such as histone methyltransferases (HMTs) and histone deacetylases (HDACs).15 Epigenetic resources incorporating methylation signatures are described in the “Methylation” section below.

Resources for Epigenetic/Epigenomic Signatures

Epigenetic/epigenomic resources are inevitably less comprehensive to date but can be broadly categorized in terms of type of data content, tools and access, and are described below.

Methylation

Pubmeth,16 a cancer methylation database, provides a sorted, annotated and summarized overview of genes, reported to be methylated in various cancers, with user query based on gene or cancer type. PubMeth draws on text-mining of Medline/ PubMed abstracts, combined with manual annotation of pre-selected abstracts. The text mining approach in Pubmeth is fast and intelligent, enabling search of multiple aliases and textual variants of these aliases, and querying of multiple keyword-lists simultaneously. Pubmeth also provides the facility to browse a pre-computed gene list, without having to query the database directly.

MethDB17 is also a major source for experimentally confirmed DNA methylation data but is general, more sample-oriented and not optimized to cancer-related queries. The database is designed to store and annotate information on the occurrence of methylated cytosines in DNA. It currently contains 19,905 methylation content data items and 5,382 methylation patterns or profiles for 48 species, 1,511 individuals, 198 tissues and cell lines and 79 phenotypes. MethDB also has a public online submission system available.18 The resource forms part of an integrated network of biological databases through DAS (Distributed Annotation System), enabling the epigenetic data to be viewed as a layer in the human genome, and is also connected to Ensembl (for DNA sequences with available MethDB data aligned to NCBI Refseq).

A subset resource, MethPrimerDB,19 is a database of primer sequences used in PCR based methylation methods. The database depends on submissions by users and administrators that guarantee the required quality of the database but not necessarily its completeness. To date, there are 29 primer sets. In 2006, the MethBLAST feature was added to MethPrimerDB oligonucleotide sequences. Further updates since 2006, however, are not found for this resource.

MethyCancer20 is a disease-oriented database, specifically of human DNA methylation and cancer that aims to integrate methylation databases and has developed a meta-data format for data standardization, with manual curation still used for noisy data. Four main types of data are included in MethyCancer, namely, (1) CGI clones and global CGI predictions, (2) DNA methylation data, (3) cancer information, genes and mutations, and (4) correlations of DNA methylation, gene expression and cancer. MethyView, a visualization tool from MethyCancer, is used to facilitate the browsing of methylation data in the context of existing human genome annotations. A search engine to query different data types and interactions from the MethyCancer database provides simple keyword search and also offers advanced options namely, “methylation,” “gene,” “cancer,” “clone” and “repeat” searches. For example, Methylation search enables the user to specify and combine query options, such as methylation type (pattern, profile, content, domain), data source (BIG/UHN, MethDB,17 HEP,9 Columbia University), experimental methods, sample information (tissue, sex, age, phenotype) and chromosomal positions.

On similar lines, Methylogix21 provides a high density DNA methylation database of human chromosomes 21 and 22, a CpG island DNA methylation database for male germ cells, enabling comprehensive analysis of DNA methylation variation between and within the germ lines of normal males, and a targeted DNA methylation database of late-onset Alzheimer disease. Similarly, Methtools is a collection of software tools for handling and analysis of DNA methylation data, generated by the Bisulfite Genomic Sequencing method.22

Genomic imprinting related resources

Genomic imprinting is an important epigenetic phenomenon whereby inherited genes are ‘imprinted’ due to one copy of the gene being epigenetically marked or imprinted in either the egg or the sperm. Thus, the allelic expression of an imprinted gene depends on whether it is inherited maternally or paternally. Imprinted expression can also vary between tissues, developmental stages and species.23 The Geneimprint database24 includes genes and related information on genomic imprinting for different animals including humans and gathered from NCBI. Genes are listed by species and sorted by chromosomal location, name and imprinting status and are provided through the web-interface. Similarly, an imprinted gene and parent-of-origin effect database25 presents imprinted genes and related effects. This consists of two sections: (i) catalog of current literature on imprinted genes in humans and animals and (ii) catalog of reports of parental origin of de novo mutations in humans alone. The addition of (ii), showing a parent-of-origin effect, expands the scope of the database and provides a useful tool for examining parental origin trends for different types of spontaneous mutations. This second section currently includes more than 1,700 mutations, found in 59 different disorders. The 85 imprinted genes are described in 152 entries from several mammalian species. In addition, more than 300 other entries describe a range of reported parent-of-origin effects in animals.26 Further resource, containing information on mouse gene imprinting,27 also includes an imprinting catalog, as well as chromosome anomalies on mutant mouse lines. This represents integration of curated information from the MRC Harwell stock resource and other Harwell databases, with additional information from external data resources such as IMSR (International Mouse Stain Resource).

Histone and chromatin-related resources

The Histone database,28 of the National Human Genome Research Institute, provides a complete set of histone protein sequences. Nucleosomes, through various core histone post-translational modifications and incorporation of diverse histone variants, can serve as epigenetic markers to control processes such as gene expression and recombination. The Histone Sequence Database is a curated collection, assembled from major public databases, of sequences and structures of histones and non-histone proteins containing histone folds. A substantial increase in the number of sequences and taxonomic coverage for histone and histone fold-containing proteins is available. The database also provides comprehensive multiple sequence alignments for each of the four core histones (H2A, H2B, H3 and H4), the linker histones (H1/H5) and the archaeal histones. Also included is current information on solved histone fold-containing structures. The database is thus an inclusive resource for the analysis of chromatin structure and function.

Chromatin.us is another web portal that includes information on chromatin proteins, histones and nucleosome structures and non-histone chromatin protein structures, and provides links to the protein data bank (PDB) site, which provides further details.29 ReplicationDomain30 is an online database for storing, sharing and visualizing DNA replication timing and transcription data, along with other numerical epigenetic data types. Data are typically obtained from DNA microarrays or DNA sequencing.

Gene silencing

An important epigenetic phenomenon, gene silencing, has also attracted attention and has been well reported in the literature. Collected papers are available on Bio-Tech Info-Net.31 Similarly, RNA induced epigenetics related papers on imprinting by non-coding RNAs are collated.32

Other epigenetic biomedical resources

The evolution of epigenetic resources is still in its early stages, with provision associated with several specific research efforts and groups. Nevertheless, in line with genetic/genomic data examples, efforts are being made to connect information, even as new targets are emerging. The Epigenetics Database33 includes all known epigenetics genes/proteins discovered to date. The database is arranged in hierarchical format, based upon gene ontology. While still in its developmental (β) phase, it is expected that future developments will include user-submitted meta-data, which will be freely available for use in database and flat file format. Some sites, e.g., Epigenie,34 also provide bioinformatics tools (e.g., CpG Viewer, CpG and GC Plotter and tools for CpG Island detection). NCBI supported efforts include the Epigenetics Antibody Database,35 providing antibody information for researchers working in the field of epigenetics/epigenomics, and Unigene,36 containing same locus-of-origin transcription sequences, protein similarities, gene expression, cDNA clone reagents, genomic location and associated epigenetic information. NARNA,37 supported by Newcastle University, incorporates relationships between epigenetic events, DNA methylation, gene imprinting and X-chromosome inactivation with natural antisense RNAs. Other, locally developed or supported, current resources include StatEpigen,38 with an initial focus on colon cancer, although incorporating some information on other pathologies for comparison. Data are provided on simple and conditional molecular events, since many genetic and epigenetic alterations are expected to be mutually correlated and synergistic, and drive model input at the micro-layer.39 Specialized resources also exist for plant data.40

Large-Scale Epigenetic Project Initiatives

European project initiatives including HEP

A number of European initiatives exist for centralized projects on DNA methylation. The Human Epigenome Project (HEP9) will provide an epigenetic resource of chromosomal DNA methylation reference profiles in human tissues and cell lines. Other initiatives include chromatin profiling (HEROIC, High-Throughput Epigenetic Regulatory Organization In Chromatin), treatment of neoplastic disease (EPITRON, Epigenetic Treatment Of Neoplastic Disease41) and the SMARTER42 initiative, which aims to develop small inhibitors of chromatin-modifying enzymes. Another effort to provide structure to the epigenetic research landscape in Europe is that of the Epigenetic Network of Excellence, now known as Epigenesys, which aims to advance epigenetics toward Systems Biology.43

Roadmap epigenomics program

The Roadmap Epigenomics Program (also known as Epigenomics Roadmap initiative), launched by NIH (2008), seeks to create a series of epigenome maps to study epigenetic mechanisms, develop new epigenetic analytics, generate a repository and long-term data archive, standardize procedures and practices in epigenomics and support new technologies for these. As part of the $190 million, 5-y initiative, the Roadmap Epigenomics Mapping Consortium44 was formed to provide a public database for human epigenomics data, the Human Epigenome Atlas.45 The current release, Epigenome Atlas Release 7, includes human reference epigenomes and the results of their integrative and comparative analyses.

The NIH Roadmap Epigenomics Program has also established IHEC (International Human Epigenome Consortium),8 which aims to coordinate epigenome mapping and characterization worldwide, in order to ensure high data quality standards, coordination of data storage, management and analysis and free access to the epigenomes produced. To attain substantial coverage of the human epigenome, IHEC aims to decipher at least 1,000 epigenomes within the next 7–10 years. Officially launched in Paris (Jan 2010), with an initial (first phase) budget target of $130 million, IHEC intends to coordinate the mapping of epigenomes from not only the NIH’s Epigenomics Mapping Consortium but also from international efforts such as the European Epigenome Network of Excellence, the Danish National Research Foundation Centre for Epigenetics, and the Australian Epigenetic Alliance. The IHEC web portal provides links to databases, such as GEO, ARRAYEXPRESS and DDBJ, where epigenetic sequencing data will be made available.

Another significant large-scale program in epigenetics is the Encyclopedia of DNA Elements (ENCODE).46 This is supported by the ENCODE Consortium, an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). This initiative aims to identify all functional elements, both at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active, in the human genome sequence.

ICGC

Genomic changes that occur in various types of cancer are being investigated by the International Cancer Genome Consortium (ICGC).47 The goal is to obtain a comprehensive description of genomic, transcriptomic and epigenomic changes in 50 different tumor types and/or subtypes. Many samples from one tumor type or subtype will be analyzed in detail so that this initiative promises to provide crucial insights on genetic-epigenetic links.

Discussion and Conclusions

The biomedical resources relating, primarily, to epigenetic data that were surveyed here are numerous and range from small- to large-scale, with considerable ongoing integration and new links still being forged. In common with many newly identified research targets, early-stage resources are often very specific and are supported locally, and this is still the case for much useful epigenetic data. Many such databases and their software tools are publicly accessible from academic/research institutions, while others are commercially available (Table S1). Major issues remain quality assurance, effective annotation and overall management, but appropriate analysis must also keep pace and is typically uneven (Table S2). Clearly, the generation of a centralized repository for epigenetics-related data are desirable and currently lacking, but new technologies offer increased potential for processing solutions down the line. Notably, biomedical needs are an important focus for federated database development, health-grid technology and, of course, Cloud computing.

Major initiatives to ensure quality and standards for genetic and epigenetic research do exist and some, such as IHEC and HEP, are described in this review. With improved technology, these should lead to improved data mining tools where those currently available for epigenetic/epigenomic analyses are limited and predominantly sequence-oriented, ranging from identification, through PCR and initial pattern matching (Table S2 presents the current summary).

Supplementary Material

Additional material
epi-7-982-s01.pdf (75.7KB, pdf)

Acknowledgments

The authors would like to acknowledge funding from the Daniel O’ Hare Scholarship program DCU, which made it possible to carry out this study.

Glossary

Abbreviations:

BIG

Beijing Institute of Genomics

BRO

biomedical resource ontology

DDBJ

DNA data bank of Japan

EBI

European Bioinformatics Institute

ENCODE

Encyclopaedia of DNA Elements

EPITRO

epigenetic treatment of neoplastic disease

HEP

Human Epigenome Project

HEROIC

high-throughput epigenetic regulatory organization in chromatin

HGP

Human Genome Project

ICGC

International Cancer Genome Consortium

IHEC

International Human Epigenome Consortium

NCBI

National Center for Biotechnology Information

NHGRI

National Human Genome Research Institute

NIH

National Institute of Health

NLM

National Library of Medicine

PDB

protein data bank

SCOP

structural classification of proteins

UHN

University Health Network

Footnotes

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Additional material
epi-7-982-s01.pdf (75.7KB, pdf)

Articles from Epigenetics are provided here courtesy of Taylor & Francis

RESOURCES