Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2012 Jul 13;7(7):e35230. doi: 10.1371/journal.pone.0035230

SyStemCell: A Database Populated with Multiple Levels of Experimental Data from Stem Cell Differentiation Research

Jian Yu 1,#, Xiaobin Xing 1,2,¤a,#, Lingyao Zeng 1,4, Jiehuan Sun 1,3,¤b, Wei Li 1,3, Han Sun 1,2, Ying He 1,2, Jing Li 1,3, Guoqing Zhang 1, Chuan Wang 1, Yixue Li 1,2,*, Lu Xie 1,*
Editor: Jason E Stajich5
PMCID: PMC3396617  PMID: 22807998

Abstract

Elucidation of the mechanisms of stem cell differentiation is of great scientific interest. Increasing evidence suggests that stem cell differentiation involves changes at multiple levels of biological regulation, which together orchestrate the complex differentiation process; many related studies have been performed to investigate the various levels of regulation. The resulting valuable data, however, remain scattered. Most of the current stem cell-relevant databases focus on a single level of regulation (mRNA expression) from limited stem cell types; thus, a unifying resource would be of great value to compile the multiple levels of research data available. Here we present a database for this purpose, SyStemCell, deposited with multi-level experimental data from stem cell research. The database currently covers seven levels of stem cell differentiation-associated regulatory mechanisms, including DNA CpG 5-hydroxymethylcytosine/methylation, histone modification, transcript products, microRNA-based regulation, protein products, phosphorylation proteins and transcription factor regulation, all of which have been curated from 285 peer-reviewed publications selected from PubMed. The database contains 43,434 genes, recorded as 942,221 gene entries, for four organisms (Homo sapiens, Mus musculus, Rattus norvegicus, and Macaca mulatta) and various stem cell sources (e.g., embryonic stem cells, neural stem cells and induced pluripotent stem cells). Data in SyStemCell can be queried by Entrez gene ID, symbol, alias, or browsed by specific stem cell type at each level of genetic regulation. An online analysis tool is integrated to assist researchers to mine potential relationships among different regulations, and the potential usage of the database is demonstrated by three case studies. SyStemCell is the first database to bridge multi-level experimental information of stem cell studies, which can become an important reference resource for stem cell researchers. The database is available at http://lifecenter.sgst.cn/SyStemCell/.

Introduction

Stem cells are of great interest to the biomedical research community due to their differentiation pluripotency and capability of unlimited self-renewal. Elucidation of the underlying molecular mechanisms of stem cell differentiation could contribute to the advancement of cell-based regenerative medicine [1]. In the last decade, many large-scale experiments have been performed to investigate the process of stem cell differentiation from different perspectives, and abundant data have been generated. DNA CpG 5-hydroxymethylcytosine/methylation (5 hmC/5 mC) and histone modification have been proved to play crucial roles in regulating stem cells during differentiation [2], [3], [4]. Transcriptome profilings and mass spectrometry analyses have revealed characteristic gene/miRNA expression patterns and protein abundance/kinase-substrate dynamics which are specific to some stem cell types and their differentiated counterparts [5], [6], [7], [8]. Transcription factors (TF) such as Pou5f1 (Oct4), Sox2 and Nanog have always been considered essential for establishing the regulatory networks that define and maintain the undifferentiated state of stem cells [9], [10].

However, most experimental data generated by recent modern technology for different levels of regulation and different stem cell types are still scattered in individual published papers, as included results or even as supplementary materials. Given that recent evidence indicates that different levels of regulatory mechanisms could interact to orchestrate the complex differentiation process [11], [12], [13], a unifying resource with a comprehensive collection of currently available multi-level, multi-organism stem cell data could be of great value to allow for cross-referencing of such orchestration, and thus promoting stem cell related research.

Several pioneer databases have been developed to collect stem cell-related information; many of them focus on single-level experimental data from limited studies. BloodExpress (http://hscl.cimr.cam.ac.uk/bloodexpress/index.html) stores 271 gene expression profiles derived from 15 distinct studies on mouse immature stem cells, intermediate multipotent progenitors and mature blood cells [14]. FunGenES (http://biit.cs.ut.ee/fungenes/) covers eleven datasets of mRNA expression profiles focusing on mouse ES cells [15]. Besides the most widely studied expression profiles, some databases provide other kinds of information. CELLPEDIA (http://cellpedia.cbrc.jp/), a repository for human cell studies and differentiation analyses, provide cell location and taxonomy information other than compiling gene expression data derived from journal papers [16]. StemDB (http://www.stemdb.org/stemdb/) which was mainly designed for stem cell project management, contains stem cell-relevant information on antibodies, markers, primers other than large-scale mRNA expression data. Recently databases curating data from more than one regulatory level start to emerge, but only with limited stem cell types. For instance, UESC is a database for urologic epithelial stem cells with gene expression and immunohistochemistry images [17] (http://scgap.systemsbiology.net/). The last on the list is ESCDb (http://biit.cs.ut.ee/escd/help.html), which gathers ChIP and microarray experiments with a focus on pluripotency associated TFs involved in human and mouse ES and carcinoma cells [18]. Compared to UESC, ESCDb offers a summarized view of its multiple-level data collection, but the web page does not support data browsing and its latest datasets are now out of date (lastly updated two years ago).

Therefore, we have developed SyStemCell, a database populated with seven levels of experimental data manually curated from 285 carefully selected publications from PubMed. Its data collection ranges from DNA CpG 5-hydroxymethylcytosine/methylation (5 hmC/5 mC), histone modification, transcript products, microRNA-based regulation, protein products, phosphorylation proteins and TF regulation, covering diverse stem cell types from four organisms (Homo sapiens, Mus musculus, Rattus norvegicus, and Macaca mulatta). An online analysis tool is also integrated to mine potential relationships among different regulation levels and possibly formulate new hypothesis. Besides, by comparing data of human and mouse available in the download section, a co-regulatory network is investigated which is conserved in these two species. All these characteristics render SyStemCell a most comprehensive and up-to-date resource for stem cell research currently. It would provide a basic platform for users to extract relationships suggested by the multi-source data and should contribute to more in-depth understanding of stem cell biology.

Methods

Data Collection and Curation

A semi-automatic method was employed to collect and curate multiple levels of original qualitative and quantitative stem cell experimental data from peer-reviewed publications in PubMed (Figure 1), as follows:

Figure 1. Pipeline of data collection, curation and recording in SyStemCell.

Figure 1

  1. PubMed was automatically surveyed for large-scale experiments using the keyword “stem cell” along with level-specific keywords for the time period June 2000 to June 2011. The level-specific keywords included “DNA methylation”, “DNA 5-hydroxymethylcytosine”, “histone modification” and “ChIP-Seq” for epigenetic modification; “transcription profile”, “expression profile”, “transcriptome”, “transcriptomics”, “RNA-Seq” and “microarray” for mRNA expression; “microRNA” for microRNA regulation; “proteome”, “proteomics”, and “mass spectrometry” for protein abundance; “phosphorylation” and “phosphoproteome” for protein phosphorylation information; “ChIP-Chip”, “ChIP-Seq” and “transcription factor” for transcriptional regulation. In addition, PubMed was searched for specific studies on stem-cell master genes (e.g., Pou5f1) with low-throughput experimental results (e.g., Western blot, real-time PCR, bisulfite sequencing).

  2. To ensure data availability and quality, the original data in retrieved papers were manually checked, for the following points of concern: (1) whether the experimental cell type was defined as stem cell (e.g., excluding precursors); (2) whether the experimental data was included in original paper of available in supplementary information; (3) whether experimental design relevant to the data generation was provided. Based on these criteria, 285 publications were selected, of which 22 papers were related to DNA CpG 5 hmC/5 mC, 30 to histone modification, 109 to mRNA expression, 58 to microRNA regulation, 68 to protein abundance, 5 to protein phosphorylation and 14 to TF regulation (Table S1, one paper may cover two or more regulatory levels). The data for both large-scale and low-throughput experiments were strictly curated as raw gene entries before being deposited into SyStemCell. The items recorded for each raw gene entry at each regulatory level include: original gene/protein accession number, stem cell type, control sample type, treatment used to induce stem cell differentiation (if data available), regulatory state in stem cell sample compared to control sample, and PubMed accession number. Statistical cutoffs for mRNA/miRNA/protein detected and/or differentially expressed, specific experimental operation platforms, and other related original information in each publication were also extracted and recorded along with gene entries (Table S2).

  3. The original gene/protein accession numbers in raw gene entries were derived from various data sources, including Entrez Gene [19], UniGene (http://www.ncbi.nlm.nih.gov/unigene), GeneBank [20], NCBI Refseq [21], UniProt [22], and Ensembl [23]. To cross-link the multi-level data in SyStemCell, all original accession numbers are referenced to Entrez Gene.

  4. Gene annotation information was extracted from the Gene Ontology database [24], Biocarta Pathway (http://www.biocarta.com/), Biosystems Pathway [25] and dbDEPC [26]. Biocarta Pathway contains signaling pathway information in human and mouse while Biosystems Pathway defines biosystems consisting of interacting genes, proteins, and small molecules (http://www.ncbi.nlm.nih.gov/biosystems). dbDEPC is an in-house database of differentially expressed proteins in human cancers, which might allow a quick check of tumor relevance for genes identified in stem cell research.

Database Construction

SyStemCell consists of a relational database and a dynamic web interface, implemented using Mysql Server Edition 5.0 and configured on a running RedHat Linux Server. The web interface is implemented with JSP technology with AJAX using an Apache Tomcat 6.0 Server. The online analysis tools, including co-localization analysis and venn-diagram plotting, are developed with R (http://www.r-project.org/).

Database availability

SyStemCell can be accessed via http://lifecenter.sgst.cn/SyStemCell/. All data in SyStemCell are freely available through the download page http://lifecenter.sgst.cn/SyStemCell/Download.jsp.

Results

Database Content

Currently, SyStemCell covers four organisms (Homo sapiens, Mus musculus, Rattus norvegicus, and Macaca mulatta) and diverse stem cell types, including ES cells, hematopoietic stem/progenitor cells (HSC/HPC), mesenchymal stem cells (MSC), induced pluripotent stem cells (iPSC), neural stem cells (NSC), cancer stem cells, and others. Regarding cell type and data type in publications, ES cell related studies (48.9%) and transcript-level data (35.8%) constitute the most abundant knowledge in stem cell research (Figure 2A–B). However, as for entry count, DNA 5 hmC/5 mC, histone modification and TF regulation now form the predominant proportion of SyStemCell (76.7%), due to the explosion of ChIP-Seq technology.

Figure 2. Database content of SyStemCell.

Figure 2

(A) Summary of original papers on seven levels of regulation, where transcription products possess the largest proportion of all recorded papers in SyStemCell. (B) Summary of Top 5 stem cell types from original papers, where the proportion of ESC (Embryonic Stem Cells) ranks the first. MSC, Mesenchymal Stem Cells; HSC/HPC, Hematopoietic Stem/Progenitor Cells; NSC, Neural Stem Cells and iPSC, induced Pluripotent Stem Cells. (C) Summary of entry across seven regulatory levels. The entry counts are log2 transformed for each level. (D) Pie plot of regulatory levels occupied by all 43,434 genes in SyStemCell.

The database now contains information covering seven levels of stem cell gene regulation, including DNA CpG 5 hmC/5 mC (168,291 entries, 27,645 for 5 hmC and 140,646 for 5 mC), histone modification (319,496 entries), mRNA expression (164,089 entries), microRNA-based regulation (1,412 entries), protein abundance (30,299 entries), protein phosphorylation (24,360 entries) and TF regulation (234,274 entries) (Figure 2C). In total, 43,434 Entrez genes are recorded in SyStemCell; of these, 36,385 genes (84%) show more than one level of regulation, and 24,196 genes (56%) demonstrate four to seven levels of regulation (Figure 2D). Please note that regulatory state is denoted as “increase” (hypermethylation/histone modification/phosphorylation/and up-regulated in transcript products, miRNA expression and protein abundance), “decrease” (hypomethylation/without histone modification/without phosphorylation/and down-regulated in transcript products, miRNA expression and protein abundance), when comparing stem cells with control. If the state is recorded as “detected”, it means either there were no control cells in experimental design or no statistic test (such as p-value and false discovery rate) was conducted in the original paper (Figure S1: A–D). The only exception which cannot be denoted as “increase”, “decrease” or “detected” is transcription factor regulation, in which genes are only categorized into two statuses: transcription factor (TF) and TF targets (Figure S1: E).

Database Utility

SyStemCell provides two data-retrieving methods on its homepage. One is gene-based query, supporting Entrez gene ID, symbol, or alias. The retrieved page includes information in three sections: Gene Description, Multi-level Data visualization, and Gene Annotation. If any information about the query gene is present in the database, SyStemCell will first come up with a gene summary section, including the official gene symbol, gene ID, official full name, and organism. Next, in the multi-level visualization section, its related entries are summarized as a heatmap-like table, where the red indicates “up-regulated”, the grey “detected only” and the blue “down-regulated” (Figure 3A, with the mouse stem cell master gene “Pou5f1” as a query gene). Numbers in the table indicate the entry count for each regulation level in each state. More detailed information about each regulatory level can be viewed and downloaded in another page for further investigation through a “magnifier” bottom (Figure 3B–D). Below this part is the gene annotation section, providing annotation information from Gene Ontology, Biocarta Pathway, Biosystems Pathway and dbDEPC. Additionally, in the page of mRNA expression and protein abundance, a brief summary of experimental record information is supplied, covering related platform, preprocessing method and filtering condition (Figure S1F). All the available annotations are hyperlinked to the original page in their corresponding databases (GO, dbDEPC, NCBI and Biocarta).

Figure 3. Queries retrieved from SyStemCell, using mouse gene “Pou5f1” (Oct4) as an example.

Figure 3

(A) Multi-level summary page and external annotation (only partial displayed). (B) DNA CpG Methylation information. (C) Histone modification information (only partial displayed) and (D) microRNA regulation information.

SyStemCell also allows for stem cell-specific data browsing via the ‘browse’ page (Figure 4A). Users can browse by organism, level of regulation, stem cell type, and/or control sample. Powered by Ajax technology, dynamic dependent box is implemented in this page to avoid null hits during browsing. When a selection is made in a “Parent” box (e.g., mouse ES cells as “Stem Cell Sample”), it allows a “Child” list box to return matched information (e.g., embryonic fibroblasts as “Control sample” of ES cells) available in the database (Figure 4B). After all boxes are selected, the retrieved page will display related information and provide another standalone page similar to Figure 3B–D for users to download these results.

Figure 4. Browse page and dynamic selecting box.

Figure 4

(A) Browse page for seven levels of regulatory information in SyStemCell. (B) Dynamic selecting box (using histone modification H3K27me3 in mouse ES and fibroblasts cells as an example). “Child” boxes are only displayed when their “Parent” boxes are selected.

Co-Localization Analysis Tool

It is now believed that the ‘stemness’ state of stem cell is regulated by the orchestration of transcription regulation network as well as a set of ‘chromatin signatures’ that support an environment maintaining self-renewal and that are permissive for differentiation [27]. SyStemCell therefore implements an online analysis tool to help researchers investigate the correlation among three important regulation levels: DNA 5 hmC/5 mC, histone modification and transcription factor regulation (Figure 5A). A lower triangular matrix consisted of ellipses with different colors can be plotted in the Co-localization Analysis page, after selecting interested epigenetic modifications such as H3K4me3, H3K27me3 (histone modifications), and Nr5a2, Pou5f1 (also known as Oct3/4), Sox2 and Nanog (transcription factors) in mouse genome (Figure 5B). Each ellipse represented a spearman correlation coefficient between two modifiers/regulators, which was conducted by following steps: First, the presence of each modifier/regulator in mouse/human genome was summarized, where 1 represents detected and 0 represents none. Next the “0” or “1” was composed into a vector in the order of gene names and spearman correlation coefficients were calculated between each modification pair. Finally a graphical display of correlation matrix was plotted, where color of red and ellipse shaping close to slash indicate more positive correlation, color of blue and ellipse shaping close to backslash indicate negative correlation, and color of grey and shaping circle indicate no correlation. To further demonstrate the intersection of regulated genes by interested co-localized pairs, and to test whether the intersection is random, SyStemCell also provides an online Venn-Diagram plotting tool (Figure 5C) that can be followed by enrichment analysis via DAVID [28].

Figure 5. Co-Localization analysis page and example.

Figure 5

(A) Analysis can be carried in two organisms (human and mouse) and three regulation levels (CpG hydroxy/methylation, histone modification and transcription factor binding) (B) Correlation matrix created by selecting interested modifiers/regulators (Pou5f1, Nr5a2, Sox2, Nanog, H3K4me3 and H3K27me3) in mouse. The color of red and shape close to slash indicate more positive correlation, while the color of blue and shape close to backslash indicate negative correlation, and the color of grey and shape like circle indicate no correlation. (C) Venn-diagram of Pou5f1 targeted genes and Nr5a2 targeted genes. Gene list in each part of the plot can be downloaded separately to run enrichment analysis in DAVID.

Case Studies of Utilizing the Database and the Co-localization Tool

To illustrate applications of SyStemCell, here we propose three examples in three levels: single-gene search and result display, co-localization of selected group of modifications and TFs, co-regulatory network that conserves across species by comparing whole datasets from different species.

A prominent mouse stem cell master gene, Pou5f1, critical for early embryogenesis and for ES cell pluripotency [29], [30], is recorded with six levels of regulation in SyStemCell (Figure 3A). The gene query results show that mRNA expression and protein abundance are significantly increased in stem cells than their differentiated counterparts, which can be confirmed in many related experiments across different regulation levels. The increase could be associated with the following changes, detailed in Figure 3B–D: 1) decrease in DNA CpG methylation intensity in the promoter region, which could facilitate gene expression [31], [32], 2) increase in the histones H3ac, H3K4me3, and decrease in H3K27me in the upstream/promoter region, which also could influence mRNA expression level [33], [34], [35], and/or 3) microRNA-induced degradation of Pou5f1, as suggested by several experiments [36], [37].

Second, the potential usage of the co-localization analysis tool in SyStemCell is illustrated in Figure 5B, from two perspectives. Firstly, significant co-localization patterns among Oct4 (Pou5f1), Sox2 and Nanog (OSN) are observed, in good agreement with the findings that these three factors form the core of a transcription factor network that act synergistically for ES cell pluripotency and self-renewal both in human and mouse [38], [39], [40]. Secondly, the co-localization pair of H3K4me3 and H3K27me3 (Figure 5B) supports previous discoveries that they are the most studied bivalent modification contributing to development control of ES cells [4], [41]. Besides conforming to existed knowledge, this analysis tool may also provide new insights to formulate hypotheses. For example, Figure 5B shows a correlation between different regulation levels: H3K4me3 and OSN genes. Their interconnectivity remained unclear until very recently when H3K4me3 was found to interact with core transcriptional network to maintain ES cell self-renewal [42]. Another example, all OSN genes share a proportion of target genes with Nr5a2 (Figure 5B–C), suggesting Nr5a2 may bypass the need of OSN genes in iPSC derivation from somatic cell reprogramming, and this was realized experimentally by Heng et al [43] in 2010.

Finally, integrating data across different species to reveal evolutionarily conserved regulatory patterns in stem cells is always of great interest. Here, by combining epigenetic modification (including transcription regulation) data in both Mus Musculus and Homo Sapiens, a co-regulatory network was extracted to represent a brief overview of transcription regulation and epigenetic modification that existed or ‘conserved’ in both species (Figure 6). The co-regulatory network was plotted by selecting candidate pairs satisfying the following three criteria in co-localization analysis: i) the candidate pair existed in both human and mouse; ii) the Bonforroni adjusted p value of spearman correlation was below 0.001 and iii) the intersection genes of the pair was enriched 2-fold than random expectation. In this co-regulatory network, notably H3K4me3 is the hub with the largest degree, showing its multi-faceted roles in mediating DNA 5 hmC (Hydroxymethylcytosine) [44], histone modification (H3K27me3) [45] and TF targeting (OSN: Sox2, Pou5f1 and Nanog) [46] in a conserved approach in both Homo sapiens and Mus musculus species. The bivalent modification between H3K4me3 and H3K27me3 and the interaction of H3K4me3 with OSN were also identified in the second case-study(the above paragraph).

Figure 6. Conserved co-regulatory network in both Homo sapiens and Mus musculus species.

Figure 6

Each interconnected edge (representing a pair of modifier/regulator) must satisfy three criteria, i.e., existed in both human and mouse, the Bonforroni adjusted p<0.001 and the intersection genes of the pair was enriched at least 2-fold. The gene symbols are shown as in Mus musculus species. The node size is in proportion to its degree and color represents different types of modifier/regulator, red, DNA hydroxy/methylation; blue, hisotone modification and yellow, transcription factor.

Another intriguing finding shown in the co-regulatory network is that 5 hmC, a previously unappreciated modification of DNA but now considered the sixth base of genome [47], connected to both transcription-active modification marker H3K4me3 and repressive marker H3K27me3. Although the detailed mechanisms and function of 5 hmC remain enigmatic, it has been implicated that 5 hmC plays a dual role in transcription regulation [48]. When modified by H3K4me3, it may contribute to maintaining a more accessible chromatin structure to facilitate TF binding; on the other hand, when connected to the trimethylation of H3K27(H3K27me3) it may help the generation of heterochromatin, thus preventing TF binding [49]. Together, the conserved relations of 5 hmC with H3K4me3 and H3K27me3 suggest that 5 hmC may be essential in stem cell transcription regulation, by associating with a ‘poised’ chromatin configuration. Lastly the co-localized pair of H3K9me3 and methylation is also conserved in both Homo sapiens and Mus musculus species, which has been indicated as an ES-specific silencing mechanism to protect the stability of genome from the threat of endogenous retroviruses and retrovirus-like elements [50].

Study of Combinatorial Network Including TFs and miRNAs in ESC

The roles of miRNAs are emerging in the establishment and maintenance of ESC identity [51]. Investigation into the topology and properties of the combinatorial network including TFs and miRNAs is helpful for us to understand the interplay between these two types of transcriptional regulators [52]. Here we propose a simple combinatorial network analysis in the context of mouse embryonic stem cells (ESC) in order to show the rationale and usefulness of our database in a specific topic research related to ESC.

Construction and validation of the mouse ESC network: Our database included TF-TF and TF-miRNA regulatory relationships in mouse embryonic stem cells, while miRNA-TF relationships were not included. In order to supplement the miRNA-TF relationships, we resorted to miRNA target prediction algorithms, miRanda [53] and TargetScan [54]. Then a combinatorial regulatory network in mouse embryonic stem cells was constructed and validated by the classic transcriptional regulators in ESC (Figure S2). Based on published studies [10],[55], a list of 21 transcriptional regulators implicated in the ES cells were collected. Of the 21 core regulators in ESC, 14 could be mapped to the regulatory relationships in our database (3-rd column in Table 1).

Table 1. Nodes with high coreness in combinatorial TF-miRNA network of mouse ESC.

Name Coreness core TFs in ESC*
Klf4 16** YES
Tcfcp2l1 16 YES
Sall4 16 YES
Pou5f1 16 YES
Nipbl 16 NO
Nanog 16 YES
Mycn 16 YES
Sox2 16 YES
E2f1 16 YES
Tbp 16 NO
Smc1a 16 NO
Med12 16 NO
Med1 16 NO
Esrrb 16 YES
Ctcf 16 YES
Smc3 16 NO
Mycn 16 YES
Stat3 16 YES
Zfx 15 YES
Zic3 14 NO
Tcfap2c 14 NO
Smad1 14 YES
Ldb1 14 NO
Smarca4 13 NO
Sall4b 10 NO
Meis1 10 NO
mmu-miR-762 10
mmu-miR-705 10
mmu-miR-455-5p 10
mmu-miR-34a-5p 10
mmu-miR-1958 10
mmu-miR-190-5p 10
*

ESC, Embryonic Stem Cells.

**

Only the nodes with coreness larger than 10 are displayed in the table.

Identification of mouse ESC-related miRNAs through network analysis: Coreness of nodes was calculated as a description of clustering structure of a network graph [56]. It turned out that most nodes with high coreness (clustering together with high degrees) were the ESC core TFs, and 6 miRNAs ranked as high-coreness nodes as well (Table 1). Motif patterns such as feed-forward loop and feed-back loop [52], [57] were also investigated (Figure 7). Among the one feed-back loop and 8 feed-forward loops, mmu-miR-199a-5p played as an important miRNA regulator in concert with TFs in mouse ESC.

Figure 7. Motif patterns in the mouse ESC combinatorial network.

Figure 7

Green nodes represent TFs, and red nodes represent miRNAs. Nodes in rectangle shape are ESC core TFs according to literatures. All the edges are retrieved from SystemCell except those in purple, which are supplemented by predicted miRNA-target relationships.

Discussion

Until now, a large proportion of gene information across diverse regulatory levels and species are still scattered among literatures in the field of stem cell research, and a database collecting and integrating such information is in great need. To address this issue, SyStemCell, a database populated with multiple levels of experimental data from stem cell differentiation research, was established and now available for data query, browse, analysis and accession to other related resources. In the section of case study, the first example (shown by Pou5f1) illustrated how SyStemCell can provide a comprehensive picture in diverse regulatory levels of any stem cell related gene. In total, 36,385 genes (84%) can be found with more than one level of regulation information recorded in SyStemCell; these records could be cross-referenced to help promote understanding of gene regulation mechanisms in stem cell.

With the explosion of ChIP-Sequencing technology, the entry counts of epigenetic modification and TF regulation go far beyond those in transcripts and protein products, forming the predominant proportion of SyStemCell. Therefore, a unique co-localization analysis tool aimed to investigate potential relationship among DNA CpG 5 hmC/5 mC, histone modification and TF regulation has been developed and deployed in SyStemCell, which may help mark out substantial biological effectors and suggest underlying molecular circuit in the complex progress of stem cell self-renewal and differentiation [58], [59], [60]. Such examples include the prevalent bivalent modification of H3K4me3/H3K27me3 and the core OSN transcription network in stem cell, as well as the potential effect of Nr5a2 in cell reprogramming. Furthermore, after combining data from Homo sapiens and Mus musculus, the pivotal role of H3K4me3 and dual function of 5 hmC were emphasized from an evolutionarily conserved viewpoint, highlighting the potential value of further stem cell research with the aid of data integration available in SyStemCell.

Mouse embryonic stem cells (ESC) are populated with the most information at transcription expression levels: mRNA and miRNA, and TF-TF and TF-miRNA regulatory relationships were also annotated in the database. Incorporating such abundant information, and making use of other bioinformatics strategies such as miRNA targets prediction, network topology analysis, we were able to show even more complicated research study based on SyStemCell, that is the constructing of a combinatorial network including TFs and miRNAs as regulators. Of the 21 core regulators in mouse ESC, 14 could be mapped to the regulatory relationships in our database. Motif patterns such as feed-forward loop and feed-back loop were also investigated, and mmu-miR-199a-5p was found to act as an important miRNA regulator in concert with TFs in mouse ESC.

Overall, SyStemCell has been constructed in the hope of providing a comprehensive stem cell library with more information of diverse regulatory levels and species than existed databases before. Other than using SyStemCell as a data-depositing library only, through cross-referencing and elaborating Co-localization Analysis Tool provided in the webpage, or through integrating large datasets in specific stem cell types, which were all examplified in this paper, users may very well likely to be able to research on certain interested topics in stem cell biology field with the help of SyStemCell.

Supplementary Data

Supplementary data are available Online.

Supporting Information

Figure S1

Summary of entry state according to regulatory state across six levels in four organisms. (A) Homo sapiens, (B) Mus musculus, (C) Rattus norvegicus and (D) Macaca mulatta. The only exception is transcription factor (TF), where gene is categorized into two states, TF and TF target (E). (F) Experimental information related to mRNA expression and protein abundance was embedded in supplied in a standalone web page.

(EPS)

Figure S2

Overview of the mouse ESC combinatorial network. Size of each node is in proportion to its coreness. Green nodes represent TFs, and red nodes represent miRNAs. Nodes in rectangle shape are ESC core TFs according to literatures. All the edges are retrieved from SystemCell except those in purple, which are supplemented by predicted miRNA-target relationships.

(EPS)

Table S1

List of 285 peer-reviewed publications in PubMed, from which the data in SyStemCell were curated.

(XLS)

Table S2

List of experimental information extracted from 285 peer-reviewed publications according to seven levels of regulation. It is organized in six sheets (“protein product” and “phosphoprotein” were combined together in one sheet).

(XLS)

Footnotes

Competing Interests: The authors have declared that no competing interests exist.

Funding: Funding for open access charge: National Key Basic Research Program (2010CB912702); State Key Basic Research Program (2011CB910204); Key Infectious Disease Project (2012ZX10002012-014); National Natural Science Foundation of China (31070752); National Key Technology R&D Program (2008BAI64B01); National Scientific-Basic Special Fund (2009FY120100). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Cohen DE, Melton D. Turning straw into gold: directing cell fate for regenerative medicine. Nat Rev Genet. 2011;12:243–252. doi: 10.1038/nrg2938. [DOI] [PubMed] [Google Scholar]
  • 2.Szulwach KE, Li X, Li Y, Song CX, Han JW, et al. Integrating 5-hydroxymethylcytosine into the epigenomic landscape of human embryonic stem cells. PLoS Genet. 2011;7:e1002154. doi: 10.1371/journal.pgen.1002154. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Pastor WA, Pape UJ, Huang Y, Henderson HR, Lister R, et al. Genome-wide mapping of 5-hydroxymethylcytosine in embryonic stem cells. Nature. 2011;473:394–397. doi: 10.1038/nature10102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Gan Q, Yoshida T, McDonald OG, Owens GK. Concise review: epigenetic mechanisms contribute to pluripotency and cell lineage determination of embryonic stem cells. Stem Cells. 2007;25:2–9. doi: 10.1634/stemcells.2006-0383. [DOI] [PubMed] [Google Scholar]
  • 5.Ramalho-Santos M, Yoon S, Matsuzaki Y, Mulligan RC, Melton DA. “Stemness”: transcriptional profiling of embryonic and adult stem cells. Science. 2002;298:597–600. doi: 10.1126/science.1072530. [DOI] [PubMed] [Google Scholar]
  • 6.Judson RL, Babiarz JE, Venere M, Blelloch R. Embryonic stem cell-specific microRNAs promote induced pluripotency. Nat Biotechnol. 2009;27:459–461. doi: 10.1038/nbt.1535. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Fathi A, Pakzad M, Taei A, Brink TC, Pirhaji L, et al. Comparative proteome and transcriptome analyses of embryonic stem cells during embryoid body-based differentiation. Proteomics. 2009;9:4859–4870. doi: 10.1002/pmic.200900003. [DOI] [PubMed] [Google Scholar]
  • 8.Brill LM, Xiong W, Lee KB, Ficarro SB, Crain A, et al. Phosphoproteomic analysis of human embryonic stem cells. Cell Stem Cell. 2009;5:204–213. doi: 10.1016/j.stem.2009.06.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Silva J, Smith A. Capturing pluripotency. Cell. 2008;132:532–536. doi: 10.1016/j.cell.2008.02.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Chen X, Xu H, Yuan P, Fang F, Huss M, et al. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell. 2008;133:1106–1117. doi: 10.1016/j.cell.2008.04.043. [DOI] [PubMed] [Google Scholar]
  • 11.Lu R, Markowetz F, Unwin RD, Leek JT, Airoldi EM, et al. Systems-level dynamic analyses of fate change in murine embryonic stem cells. Nature. 2009;462:358–362. doi: 10.1038/nature08575. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Oh IH, Humphries RK. Multi-Dimensional Regulation of the Hematopoietic Stem Cell State. Stem Cells. 2011. [DOI] [PubMed]
  • 13.Guenther MG, Frampton GM, Soldner F, Hockemeyer D, Mitalipova M, et al. Chromatin structure and gene expression programs of human embryonic and induced pluripotent stem cells. Cell Stem Cell. 2010;7:249–257. doi: 10.1016/j.stem.2010.06.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Miranda-Saavedra D, De S, Trotter MW, Teichmann SA, Gottgens B. BloodExpress: a database of gene expression in mouse haematopoiesis. Nucleic Acids Res. 2009;37:D873–879. doi: 10.1093/nar/gkn854. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Schulz H, Kolde R, Adler P, Aksoy I, Anastassiadis K, et al. The FunGenES database: a genomics resource for mouse embryonic stem cell differentiation. PLoS One. 2009;4:e6804. doi: 10.1371/journal.pone.0006804. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Hatano A, Chiba H, Moesa HA, Taniguchi T, Nagaie S, et al. CELLPEDIA: a repository for human cell information for cell studies and differentiation analyses. Database (Oxford) 2011;2011:bar046. doi: 10.1093/database/bar046. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Pascal LE, Deutsch EW, Campbell DS, Korb M, True LD, et al. The urologic epithelial stem cell database (UESC) - a web tool for cell type-specific gene expression and immunohistochemistry images of the prostate and bladder. BMC Urol. 2007;7:19. doi: 10.1186/1471-2490-7-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Jung M, Peterson H, Chavez L, Kahlem P, Lehrach H, et al. A data integration approach to mapping OCT4 gene regulatory networks operative in embryonic stem cells and embryonal carcinoma cells. PLoS One. 2010;5:e10709. doi: 10.1371/journal.pone.0010709. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Maglott D, Ostell J, Pruitt KD, Tatusova T. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 2007;35:D26–31. doi: 10.1093/nar/gkl993. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic Acids Res. 2010;38:D46–51. doi: 10.1093/nar/gkp1024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Pruitt KD, Tatusova T, Maglott DR. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2005;33:D501–504. doi: 10.1093/nar/gki025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, et al. UniProt: the Universal Protein knowledgebase. Nucleic Acids Res. 2004;32:D115–119. doi: 10.1093/nar/gkh131. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Scacchi R, Corbo RM, Mulas G, Mureddu L, Pascone R. Genetic polymorphisms of the A and B subunits of human coagulation factor XIII in mainland Italy and Sardinia: description of a new FXIIIA variant allele. Electrophoresis. 1991;12:667–670. doi: 10.1002/elps.1150120912. [DOI] [PubMed] [Google Scholar]
  • 24.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Geer LY, Marchler-Bauer A, Geer RC, Han L, He J, et al. The NCBI BioSystems database. Nucleic Acids Res. 2010;38:D492–496. doi: 10.1093/nar/gkp858. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Li H, He Y, Ding G, Wang C, Xie L, et al. dbDEPC: a database of differentially expressed proteins in human cancers. Nucleic Acids Res. 2010;38:D658–664. doi: 10.1093/nar/gkp933. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Fisher CL, Fisher AG. Chromatin states in pluripotent, differentiated, and reprogrammed cells. Curr Opin Genet Dev. 2011;21:140–146. doi: 10.1016/j.gde.2011.01.015. [DOI] [PubMed] [Google Scholar]
  • 28.Huang da W, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009;4:44–57. doi: 10.1038/nprot.2008.211. [DOI] [PubMed] [Google Scholar]
  • 29.Zhang Z, Liao B, Xu M, Jin Y. Post-translational modification of POU domain transcription factor Oct-4 by SUMO-1. FASEB J. 2007;21:3042–3051. doi: 10.1096/fj.06-6914com. [DOI] [PubMed] [Google Scholar]
  • 30.Wei F, Scholer HR, Atchison ML. Sumoylation of Oct4 enhances its stability, DNA binding, and transactivation. J Biol Chem. 2007;282:21551–21560. doi: 10.1074/jbc.M611041200. [DOI] [PubMed] [Google Scholar]
  • 31.Aoto T, Saitoh N, Ichimura T, Niwa H, Nakao M. Nuclear and chromatin reorganization in the MHC-Oct3/4 locus at developmental phases of embryonic stem cell differentiation. Dev Biol. 2006;298:354–367. doi: 10.1016/j.ydbio.2006.04.450. [DOI] [PubMed] [Google Scholar]
  • 32.Hattori N, Nishino K, Ko YG, Ohgane J, Tanaka S, et al. Epigenetic control of mouse Oct-4 gene expression in embryonic stem cells and trophoblast stem cells. J Biol Chem. 2004;279:17063–17069. doi: 10.1074/jbc.M309002200. [DOI] [PubMed] [Google Scholar]
  • 33.Kimura H, Tada M, Nakatsuji N, Tada T. Histone code modifications on pluripotential nuclei of reprogrammed somatic cells. Mol Cell Biol. 2004;24:5710–5720. doi: 10.1128/MCB.24.13.5710-5720.2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Barry ER, Krueger W, Jakuba CM, Veilleux E, Ambrosi DJ, et al. ES cell cycle progression and differentiation require the action of the histone methyltransferase Dot1L. Stem Cells. 2009;27:1538–1547. doi: 10.1002/stem.86. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Golob JL, Paige SL, Muskheli V, Pabon L, Murry CE. Chromatin remodeling during mouse and human embryonic stem cell differentiation. Dev Dyn. 2008;237:1389–1398. doi: 10.1002/dvdy.21545. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Chen H, Qian K, Tang ZP, Xing B, Liu N, et al. Bioinformatics and microarray analysis of microRNA expression profiles of murine embryonic stem cells, neural stem cells induced from ESCs and isolated from E8.5 mouse neural tube. Neurol Res. 2009. [DOI] [PubMed]
  • 37.Tay Y, Zhang J, Thomson AM, Lim B, Rigoutsos I. MicroRNAs to Nanog, Oct4 and Sox2 coding regions modulate embryonic stem cell differentiation. Nature. 2008;455:1124–1128. doi: 10.1038/nature07299. [DOI] [PubMed] [Google Scholar]
  • 38.Boyer LA, Lee TI, Cole MF, Johnstone SE, Levine SS, et al. Core transcriptional regulatory circuitry in human embryonic stem cells. Cell. 2005;122:947–956. doi: 10.1016/j.cell.2005.08.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Loh YH, Wu Q, Chew JL, Vega VB, Zhang W, et al. The Oct4 and Nanog transcription network regulates pluripotency in mouse embryonic stem cells. Nat Genet. 2006;38:431–440. doi: 10.1038/ng1760. [DOI] [PubMed] [Google Scholar]
  • 40.Kim J, Chu J, Shen X, Wang J, Orkin SH. An extended transcriptional network for pluripotency of embryonic stem cells. Cell. 2008;132:1049–1061. doi: 10.1016/j.cell.2008.02.039. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Zhao XD, Han X, Chew JL, Liu J, Chiu KP, et al. Whole-genome mapping of histone H3 Lys4 and 27 trimethylations reveals distinct genomic compartments in human embryonic stem cells. Cell Stem Cell. 2007;1:286–298. doi: 10.1016/j.stem.2007.08.004. [DOI] [PubMed] [Google Scholar]
  • 42.Ang YS, Tsai SY, Lee DF, Monk J, Su J, et al. Wdr5 mediates self-renewal and reprogramming via the embryonic stem cell core transcriptional network. Cell. 2011;145:183–197. doi: 10.1016/j.cell.2011.03.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Heng JC, Feng B, Han J, Jiang J, Kraus P, et al. The nuclear receptor Nr5a2 can replace Oct4 in the reprogramming of murine somatic cells to pluripotent cells. Cell Stem Cell. 2010;6:167–174. doi: 10.1016/j.stem.2009.12.009. [DOI] [PubMed] [Google Scholar]
  • 44.Ooi SK, Qiu C, Bernstein E, Li K, Jia D, et al. DNMT3L connects unmethylated lysine 4 of histone H3 to de novo methylation of DNA. Nature. 2007;448:714–717. doi: 10.1038/nature05987. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Bernstein BE, Mikkelsen TS, Xie X, Kamal M, Huebert DJ, et al. A bivalent chromatin structure marks key developmental genes in embryonic stem cells. Cell. 2006;125:315–326. doi: 10.1016/j.cell.2006.02.041. [DOI] [PubMed] [Google Scholar]
  • 46.Karnani N, Taylor CM, Malhotra A, Dutta A. Genomic study of replication initiation in human chromosomes reveals the influence of transcription regulation and chromatin structure on origin selection. Mol Biol Cell. 2010;21:393–404. doi: 10.1091/mbc.E09-08-0707. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Munzel M, Globisch D, Carell T. 5-Hydroxymethylcytosine, the sixth base of the genome. Angew Chem Int Ed Engl. 2011;50:6460–6468. doi: 10.1002/anie.201101547. [DOI] [PubMed] [Google Scholar]
  • 48.Wu H, D’Alessio AC, Ito S, Wang Z, Cui K, et al. Genome-wide analysis of 5-hydroxymethylcytosine distribution reveals its dual function in transcriptional regulation in mouse embryonic stem cells. Genes Dev. 2011;25:679–684. doi: 10.1101/gad.2036011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Cedar H, Bergman Y. Linking DNA methylation and histone modification: patterns and paradigms. Nat Rev Genet. 2009;10:295–304. doi: 10.1038/nrg2540. [DOI] [PubMed] [Google Scholar]
  • 50.Zhang X, Huang J. Integrative genome-wide approaches in embryonic stem cell research. Integr Biol (Camb) 2010;2:510–516. doi: 10.1039/c0ib00068j. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Martinez NJ, Gregory RI. MicroRNA gene regulatory pathways in the establishment and maintenance of ESC identity. Cell Stem Cell. 2010;7:31–35. doi: 10.1016/j.stem.2010.06.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Martinez NJ, Walhout AJ. The interplay between transcription factors and microRNAs in genome-scale regulatory networks. Bioessays. 2009;31:435–445. doi: 10.1002/bies.200800212. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Sethupathy P, Megraw M, Hatzigeorgiou AG. A guide through present computational approaches for the identification of mammalian microRNA targets. Nat Methods. 2006;3:881–886. doi: 10.1038/nmeth954. [DOI] [PubMed] [Google Scholar]
  • 54.Lewis BP, Burge CB, Bartel DP. Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell. 2005;120:15–20. doi: 10.1016/j.cell.2004.12.035. [DOI] [PubMed] [Google Scholar]
  • 55.Young RA. Control of the embryonic stem cell state. Cell. 2011;144:940–954. doi: 10.1016/j.cell.2011.01.032. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Wuchty S, Almaas E. Evolutionary cores of domain co-occurrence networks. BMC Evol Biol. 2005;5:24. doi: 10.1186/1471-2148-5-24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Arda HE, Walhout AJ. Gene-centered regulatory networks. Brief Funct Genomics. 2010;9:4–12. doi: 10.1093/bfgp/elp049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Smale ST. Pioneer factors in embryonic stem cells and differentiation. Curr Opin Genet Dev. 2010;20:519–526. doi: 10.1016/j.gde.2010.06.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Cao Y, Yao Z, Sarkar D, Lawrence M, Sanchez GJ, et al. Genome-wide MyoD binding in skeletal muscle cells: a potential for broad cellular reprogramming. Dev Cell. 2010;18:662–674. doi: 10.1016/j.devcel.2010.02.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Robertson AG, Bilenky M, Tam A, Zhao YJ, Zeng T, et al. Genome-wide relationship between histone H3 lysine 4 mono- and tri-methylation and transcription factor binding. Genome Research. 2008;18:1906–1917. doi: 10.1101/gr.078519.108. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Figure S1

Summary of entry state according to regulatory state across six levels in four organisms. (A) Homo sapiens, (B) Mus musculus, (C) Rattus norvegicus and (D) Macaca mulatta. The only exception is transcription factor (TF), where gene is categorized into two states, TF and TF target (E). (F) Experimental information related to mRNA expression and protein abundance was embedded in supplied in a standalone web page.

(EPS)

Figure S2

Overview of the mouse ESC combinatorial network. Size of each node is in proportion to its coreness. Green nodes represent TFs, and red nodes represent miRNAs. Nodes in rectangle shape are ESC core TFs according to literatures. All the edges are retrieved from SystemCell except those in purple, which are supplemented by predicted miRNA-target relationships.

(EPS)

Table S1

List of 285 peer-reviewed publications in PubMed, from which the data in SyStemCell were curated.

(XLS)

Table S2

List of experimental information extracted from 285 peer-reviewed publications according to seven levels of regulation. It is organized in six sheets (“protein product” and “phosphoprotein” were combined together in one sheet).

(XLS)


Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES