Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Dec 23.
Published in final edited form as: Nat Biotechnol. 2014 Oct;32(10):1007–1008. doi: 10.1038/nbt.3035

A blueprint of cell identity

Avi Ma’ayan 1, Qiaonan Duan 1
PMCID: PMC4274604  NIHMSID: NIHMS649162  PMID: 25299921

Abstract

Research on converting one cell type to another will be aided by systematic mapping of the gene-regulatory networks in mammalian cells.


Recipes for generating pure populations of specific cell types are undergoing rapid development for a diversity of applications, from cell therapies to disease modeling and drug screening. The main approaches used are differentiation of pluripotent stem cells, reprogramming of somatic cells to induced pluripotent stem cells and transdifferentiation of one type of somatic cell to another. But how can one be sure that the resulting cells are true exemplars of the desired cell type and that the conversion strategy is optimal? Two recent publications1,2 in Cell describe an important step toward answering these questions. They present CellNet, a map of the gene-regulatory networks that define many different mammalian tissues and cell types. As the authors show, CellNet not only improves the assessment of similarity between cell types but can also be used to predict the transcription-factor perturbations needed to optimize cell-type conversion methods.

Over the last decade the stem cell field has made substantial progress on developing protocols for producing various types of mammalian cells in vitro, but so far computational systems biologists have been only marginally involved in these efforts. This is surprising because mapping of the regulatory networks that control cellular phenotypes and the transitions between cell types is at the heart of computational systems biology, and systems-level data on stem cells and differentiated cells—generated with RNA interference screens, ChIP-seq for many transcription factors and histone modifications, gene-expression profiling and protein interaction studies—are abundant. Even before many components of gene-regulatory networks were discovered, theoretical biologists had long been contemplating the nature of transitions between cellular phenotypes. The best-known example is Waddington’s theory of the epigenetic landscape, which makes an analogy between cellular differentiation and a ball rolling down a hill that ends at different valleys representing different cell types3.

In recent years, cross-fertilization between stem cell biologists and computational systems biologists has begun to show how computational models that integrate various types of genome-wide data46 can complement and enhance experimental work on creating cell-conversion protocols and mapping the regulatory networks that govern cell-type transitions. In general, however, these models have covered only one network in one cell type. A more global attempt to map the mammalian signalome, which has some conceptual similarities with CellNet, was demonstrated with a method for defining cell identity called SPADE7. However, this work was limited to analysis of the hematopoietic system.

The CellNet studies1,2 were carried out in a collaboration between an experimental stem cell biology laboratory and a computational systems biology laboratory, and they underscore the power of such cross-disciplinary research. The first step in creating CellNet was to build a global gene-regulatory network of cellular identity across many tissues by compiling data from thousands of published studies that profiled mammalian cells with cDNA microarrays. This network was then broken up into specific gene-regulatory networks for ~20 mammalian cell types. To achieve this, the global network was divided into clusters by identifying highly connected regions, and each cluster was assigned a cell type using enrichment analysis applied to the genes in each cluster. This enabled the development of a metric that characterizes the similarity between cell type–specific gene-regulatory networks and of a method to predict the transcription-factor perturbations that are likely to shift cells toward a desired new phenotype.

Experimental validation demonstrated that CellNet can accurately assess the purity of cell types that have been generated in many recent cellular engineering studies and can guide efforts to correct incomplete cell-type conversions. For example, the method correctly predicted that the conversion of B cells to macrophages would be improved by knockdown of the transcription factors Pou2af1 and Ebf1 to reduce the remaining B-cell identity of the transdifferentiated cells. CellNet was also applied to better characterize induced hepatocytes—fibroblasts converted to hepatocyte progenitors by expression of the transcription factors Hnf4a and Foxa18. Induced hepatocytes had been shown to engraft in the liver. Analysis with CellNet revealed that these cells actually have broader potential than previously believed and should be considered induced endoderm progenitors. The CellNet method predicted—and experiment subsequently validated—that induced hepatocytes can produce cells that resemble intestinal progenitors and engraft in the colon (Fig. 1).

Figure 1.

Figure 1

Gene-expression data from 11 samples corresponding to five cell types—colon, fetal liver, adult liver, induced hepatocyte (iHep) and colon-engrafted induced hepatocyte—profiled in one of the CellNet studies2 were downloaded from GEO (GSE59037). The downloaded data were subjected to quantile normalization; the top 10% of genes with the most variance were retained for further analysis (n = 4,510), and principal component analysis (PCA) was applied on this subset of the data. The first two principal components (PC-1 and PC-2) were used to generate the plot. Numbers in parentheses indicate the percentage of variance captured by each component. Distance between points on the plot is an estimate of expression vector similarity between the samples. The plot supports the finding that induced hepatocytes are similar to liver progenitors but can also give rise to cells that engraft the colon. This finding was discovered though CellNet analysis and validated experimentally2.

More generally, analysis with CellNet revealed that direct lineage conversion between somatic cell types often partly retains the identity of the original cells, resulting in hybrid cell types. This may explain why in some cases GFP reporter genes suggest phenotypic conversion, whereas functional assays show otherwise.

The authors showed that CellNet outperforms standard correlation-based methods for classifying cell identity with genome-wide gene-expression data. The most likely explanation is that CellNet focuses only on transcription factors and their interactions, and computes correlations between these transcripts with a method called context likelihood of relatedness (CLR)9. The CLR algorithm identifies mutual information (a correlation method that can capture nonlinearities) between pairs of genes; CLR then boosts pairs of genes that have high mutual information scores within a specific window of samples to reduce false positives. This is relevant to the problem of constructing gene-regulatory networks from expression data to define cell identity because two genes may be unrelated in most cases but have a high mutual information score in certain specific cell types.

In contrast with standard unsupervised clustering methods, such as hierarchical clustering, which use the entire gene-expression space or a subspace defined by a crude global filtering method (for example, filtering by variance), CellNet strengthens the influence of transcription factors and their interactions—known to be critical for cell identity—and ignores less-critical variables.

The other important element of CellNet is that it can be benchmarked and integrated with gene-regulatory networks created from ChIP-seq and DNase-seq data. With the emergence of ChIP-seq and DNase-seq, gene-regulatory co-expression networks extracted from mRNA expression data alone can be cross-validated with other independent data sets10. This allows methods for constructing gene-regulatory networks to be compared in order to assess their quality.

The analysis for validating CellNet showed that the CLR method recovered many interactions between transcription factors and their targets independently identified by ChIP-seq data from the ENCODE project1. Integration of the gene-regulatory networks extracted from mRNA expression data with DNase-seq data collected from embryonic stem cells differentiated to neurons showed that open chromatin genes are more easily induced than chromatin-protected neuronal genes. Such integrative analysis is powerful because it points to specific epigenetic barriers that may prevent complete differentiation into neurons.

One of the remaining challenges with integrative analyses such as the one introduced by CellNet is the differences between the various platforms and technologies for profiling genome-wide gene expression. To build CellNet, the authors used only a few types of microarray platforms. It is difficult to consolidate results from multiple platforms because different platforms have different biases and levels of noise. It has yet to be determined whether using RNA-seq rather than microarrays would improve or hinder meta-analyses such as CellNet.

In addition, the current CellNet map is of low resolution, containing only ~20 generic tissue or cell types. More cell types are needed to properly classify engineered cells. Integrative analyses that synthesize many prior studies to obtain a global picture of the regulatory networks that define mammalian cell types in development, normal physiology and disease are expected to gradually increase in resolution and accuracy in the coming years.

An important avenue of research for reprogramming, directed differentiation and direct lineage conversion strategies is the use of small molecules rather than transcription factor induction or silencing. Small molecules would simplify cell engineering and be more clinically relevant because they are easily applied and removed in vitro. Few specific small-molecule inhibitors for transcription factors have been identified. But many inhibitors of kinases are available, and it would be interesting to apply CellNet to develop a regulatory map of the mammalian kinome. Other possibilities for CellNet-type maps are paracrine and autocrine networks that include receptors and their ligands. Perhaps the most effective cell type–specific map would include all of these regulatory layers. However, experimental data on the human kinome and on paracrine and autocrine networks are relatively sparse.

Global computational models such as CellNet can be used to evaluate the many studies being published in the field of cellular engineering. This approach allows more rational predictions about network perturbations that have a greater chance of success than ad hoc expert guesses. Beyond providing a more logical and comprehensive synthesis of existing knowledge, mapping the regulatory networks in mammalian cells should eventually allow us to fully control cells and organisms with small molecules in normal physiology and in disease.

Footnotes

COMPETING FINANCIAL INTERESTS

The authors declare no competing financial interests.

References

RESOURCES